flatten a level one list

M

Michael Spencer

Peter said:
I you require len(xdata) == len(ydata) there's an easy way to move the loop
into C:

def flatten7():
n = len(xdata)
assert len(ydata) == n
result = [None] * (2*n)
result[::2] = xdata
result[1::2] = ydata
return result

$ python -m timeit 'from flatten import flatten6 as f' 'f()'
1000 loops, best of 3: 847 usec per loop
$ python -m timeit 'from flatten import flatten7 as f' 'f()'
10000 loops, best of 3: 43.9 usec per loop

Peter
Very nice observation, Peter. Easily the "winner". It reminds me of
str.translate beating python loops in filtering applications:
http://groups.google.com/group/comp.lang.python/msg/e23cdc374144a4bd


What's more, you can generalize your approach to any number of sequences and
un-equal lengths, with only modest loss of speed:

def interleave(*args, **kw):
"""Peter Otten flatten7 (generalized by Michael Spencer)
Interleave any number of sequences, padding shorter sequences if kw pad
is supplied"""
dopad = "pad" in kw
pad = dopad and kw["pad"]
count = len(args)
lengths = map(len, args)
maxlen = max(lengths)
result = maxlen*count*[None]
for ix, input in enumerate(args):
try:
result[ix::count] = input
except ValueError:
if dopad:
result[ix::count] = input + [pad]*(maxlen-lengths[ix])
else:
raise
return result


def test_interleave():
assert interleave([1,3,5], [2,4,6]) == [1,2,3,4,5,6]
assert interleave([1,2,3]) == [1,2,3]
assert interleave(*[[1,2,3]]*10) == [1]*10+[2]*10+[3]*10
assert interleave(range(0,1000,2),range(1,1000,2)) == range(1000)
def raises(func,*args, **kw):
exc = kw.get("exc", Exception)
try:
func(*args)
except exc:
return True
assert raises(interleave,[1,2],[3,4,5])
assert interleave([1,3],[2,4,6], pad = None) == [1,2,3,4,None,6]


def timethem():
for n in [1000, 10000, 100000, 1000000]:
x = range(n); y = range(n)
print "List length: ",n
for func in [flatten7, interleave]:
print shell.timefunc(func, x, y)
List length: 1000
flatten7(...) 6442 iterations, 77.62usec per call
interleave(...) 5475 iterations, 91.33usec per call
List length: 10000
flatten7(...) 318 iterations, 1.57msec per call
interleave(...) 315 iterations, 1.59msec per call
List length: 100000
flatten7(...) 25 iterations, 20.61msec per call
interleave(...) 24 iterations, 21.06msec per call
List length: 1000000
flatten7(...) 3 iterations, 198.51msec per call
interleave(...) 3 iterations, 198.91msec per call

Michael
 
B

bonono

David said:
Robin said:
# New attempts:
from itertools import imap
def flatten4(x, y):
'''D Murman'''
l = []
list(imap(l.extend, izip(x, y)))
return l


from Tkinter import _flatten
def flatten5(x, y):
'''D Murman'''
return list(_flatten(zip(x, y)))

well, i would really like to take credit for these, but they're
not mine ;) (credit goes to Michael Spencer). i especially like
flatten4, even if its not as fast as the phenomenally faster
flatten7.
Me too. And expand a bit on flatten4, I got this interesting result.

bonono@moresing:~/bonobo/psp$ python ~/lib/python2.4/timeit.py -s
"import itertools; a=zip(xrange(1000),xrange(1000))" "l=len(a);
li=[None]*l*2;li[::2]=range(1000); li[1::2]=range(1000)"
1000 loops, best of 3: 318 usec per loop

bonono@moresing:~/bonobo/psp$ python ~/lib/python2.4/timeit.py -s
"import itertools,psyco; a=zip(xrange(1000),xrange(1000));li=[]"
"filter(li.extend,a)"
1000 loops, best of 3: 474 usec per loop

Still 50% slower but it has the advantage that it works on all kinds of
sequence as they can be efficiently izip() together.
 
A

Alex Martelli

Nick Craig-Wood said:
Sion Arrowsmith said:
sum(...)
sum(sequence, start=0) -> value

If you're using sum() as a 1-level flatten you need to give it
start=[].

Except if you are trying to sum arrays of strings...
sum(["a","b","c"], "")
Traceback (most recent call last):

I've no idea why this limitation is here... perhaps it is because pre
python2.4 calling += on strings was very slow?

No: when I implemented sum, I originally specialcased sum on strings to
map down to a ''.join -- Guido decided it was confusing and had no
advantage wrt calling ''.join directly so he made me put in that check.


Alex
 
P

Peter Otten

David Murmann wrote:
# New attempts:
from itertools import imap
def flatten4(x, y):
'''D Murman'''
l = []
list(imap(l.extend, izip(x, y)))
return l
well, i would really like to take credit for these, but they're
not mine ;) (credit goes to Michael Spencer). i especially like
flatten4, even if its not as fast as the phenomenally faster
flatten7.
Me too. And expand a bit on flatten4, I got this interesting result.

bonono@moresing:~/bonobo/psp$ python ~/lib/python2.4/timeit.py -s
"import itertools; a=zip(xrange(1000),xrange(1000))" "l=len(a);
li=[None]*l*2;li[::2]=range(1000); li[1::2]=range(1000)"
1000 loops, best of 3: 318 usec per loop

bonono@moresing:~/bonobo/psp$ python ~/lib/python2.4/timeit.py -s
"import itertools,psyco; a=zip(xrange(1000),xrange(1000));li=[]"
"filter(li.extend,a)"
1000 loops, best of 3: 474 usec per loop

For a fair comparison you'd have to time the zip operation.
Still 50% slower but it has the advantage that it works on all kinds of
sequence as they can be efficiently izip() together.

Creating a list via list/map/filter just for the side effect is not only bad
taste,

~ $ python -m timeit -s'a = zip([range(1000)]*2)' 'lst=[];ext=lst.extend'
'for i in a: ext(i)'
1000000 loops, best of 3: 1.23 usec per loop
~ $ python -m timeit -s'a = zip([range(1000)]*2)' 'lst=[];filter(lst.extend,
a)'
1000000 loops, best of 3: 1.63 usec per loop

it is also slower than an explicit loop. Don't do it.

Peter
 
B

bonono

Peter said:
David Murmann wrote:
# New attempts:
from itertools import imap
def flatten4(x, y):
'''D Murman'''
l = []
list(imap(l.extend, izip(x, y)))
return l
well, i would really like to take credit for these, but they're
not mine ;) (credit goes to Michael Spencer). i especially like
flatten4, even if its not as fast as the phenomenally faster
flatten7.
Me too. And expand a bit on flatten4, I got this interesting result.

bonono@moresing:~/bonobo/psp$ python ~/lib/python2.4/timeit.py -s
"import itertools; a=zip(xrange(1000),xrange(1000))" "l=len(a);
li=[None]*l*2;li[::2]=range(1000); li[1::2]=range(1000)"
1000 loops, best of 3: 318 usec per loop

bonono@moresing:~/bonobo/psp$ python ~/lib/python2.4/timeit.py -s
"import itertools,psyco; a=zip(xrange(1000),xrange(1000));li=[]"
"filter(li.extend,a)"
1000 loops, best of 3: 474 usec per loop

For a fair comparison you'd have to time the zip operation.
Still 50% slower but it has the advantage that it works on all kinds of
sequence as they can be efficiently izip() together.

Creating a list via list/map/filter just for the side effect is not only bad
taste,

~ $ python -m timeit -s'a = zip([range(1000)]*2)' 'lst=[];ext=lst.extend'
'for i in a: ext(i)'
1000000 loops, best of 3: 1.23 usec per loop
~ $ python -m timeit -s'a = zip([range(1000)]*2)' 'lst=[];filter(lst.extend,
a)'
1000000 loops, best of 3: 1.63 usec per loop

it is also slower than an explicit loop. Don't do it.
ah, stand corrected.
 
P

Peter Otten

Michael said:
Peter said:
If you require len(xdata) == len(ydata) there's an easy way to move the
loop into C:

def flatten7():
n = len(xdata)
assert len(ydata) == n
result = [None] * (2*n)
result[::2] = xdata
result[1::2] = ydata
return result

$ python -m timeit 'from flatten import flatten6 as f' 'f()'
1000 loops, best of 3: 847 usec per loop
$ python -m timeit 'from flatten import flatten7 as f' 'f()'
10000 loops, best of 3: 43.9 usec per loop
Very nice observation, Peter.

Thank you :)
Easily the "winner". It reminds me of
str.translate beating python loops in filtering applications:
http://groups.google.com/group/comp.lang.python/msg/e23cdc374144a4bd
What's more, you can generalize your approach to any number of sequences
and un-equal lengths, with only modest loss of speed:

def interleave(*args, **kw):
"""Peter Otten flatten7 (generalized by Michael Spencer)
Interleave any number of sequences, padding shorter sequences if kw
pad is supplied"""
dopad = "pad" in kw
pad = dopad and kw["pad"]
count = len(args)
lengths = map(len, args)
maxlen = max(lengths)
result = maxlen*count*[None]
for ix, input in enumerate(args):
try:
result[ix::count] = input
except ValueError:
if dopad:
result[ix::count] = input + [pad]*(maxlen-lengths[ix])
else:
raise
return result

You don't loose speed because the expensive operation is only executed if
list lengths differ. Two observations:

- list arithmetic like 'input + [pad]*(maxlen-lengths[ix])' should and can
be avoided

- You are padding twice -- once with None, and then with the real thing.

def interleave2(*args, **kw):
dopad = "pad" in kw
pad = kw.get("pad")
count = len(args)
lengths = map(len, args)
maxlen = max(lengths)
if not dopad and min(lengths) != maxlen:
raise ValueError
result = maxlen*count*[pad]
for ix, input in enumerate(args):
result[ix:len(input)*count:count] = input
return result

$ python -m timeit -s 'from interleave_spencer import xdata, ydata,
interleave as i;xdata=xdata[:-1]' 'i(xdata, ydata, pad=None)'
10000 loops, best of 3: 69.7 usec per loop
$ python -m timeit -s 'from interleave_spencer import xdata, ydata,
interleave2 as i;xdata=xdata[:-1]' 'i(xdata, ydata, pad=None)'
10000 loops, best of 3: 46.4 usec per loop

Not overwhelming, but I expect the difference to grow when the arguments
occupy a significant portion of the available memory.

Peter
 
B

bonono

Peter said:
David Murmann wrote:
# New attempts:
from itertools import imap
def flatten4(x, y):
'''D Murman'''
l = []
list(imap(l.extend, izip(x, y)))
return l
well, i would really like to take credit for these, but they're
not mine ;) (credit goes to Michael Spencer). i especially like
flatten4, even if its not as fast as the phenomenally faster
flatten7.
Me too. And expand a bit on flatten4, I got this interesting result.

bonono@moresing:~/bonobo/psp$ python ~/lib/python2.4/timeit.py -s
"import itertools; a=zip(xrange(1000),xrange(1000))" "l=len(a);
li=[None]*l*2;li[::2]=range(1000); li[1::2]=range(1000)"
1000 loops, best of 3: 318 usec per loop

bonono@moresing:~/bonobo/psp$ python ~/lib/python2.4/timeit.py -s
"import itertools,psyco; a=zip(xrange(1000),xrange(1000));li=[]"
"filter(li.extend,a)"
1000 loops, best of 3: 474 usec per loop

For a fair comparison you'd have to time the zip operation.
Still 50% slower but it has the advantage that it works on all kinds of
sequence as they can be efficiently izip() together.

Creating a list via list/map/filter just for the side effect is not only bad
taste,

~ $ python -m timeit -s'a = zip([range(1000)]*2)' 'lst=[];ext=lst.extend'
'for i in a: ext(i)'
1000000 loops, best of 3: 1.23 usec per loop
~ $ python -m timeit -s'a = zip([range(1000)]*2)' 'lst=[];filter(lst.extend,
a)'
1000000 loops, best of 3: 1.63 usec per loop

it is also slower than an explicit loop. Don't do it.
Hi, but I found this result instead :

bonono@moresing:~$ python ~/lib/python2.4/timeit.py -s "a=range(10000);
b=zip(*[a]*2); l=[None]*len(a)*2; e=l.extend" "l[::2]=b;l[1::2]=b"
100 loops, best of 3: 6.22 msec per loop
bonono@moresing:~$ python ~/lib/python2.4/timeit.py -s "a=range(10000);
b=zip(*[a]*2); l=[]; e=l.extend" "filter(e,b)"
100 loops, best of 3: 7.25 msec per loop
bonono@moresing:~$ python ~/lib/python2.4/timeit.py -s "a=range(10000);
b=zip(*[a]*2); l=[]; e=l.extend" "for x in b: e(x)"
100 loops, best of 3: 10.7 msec per loop
bonono@moresing:~$

So it seems to be faster than explicit loop. By localizing the l.extend
name binding, its speed is only 20% slower than the fastest method.
May be I have done something wrong in the test ?
 
P

Peter Otten

Creating a list via list/map/filter just for the side effect is not only
bad taste,

~ $ python -m timeit -s'a = zip([range(1000)]*2)' 'lst=[];ext=lst.extend'
'for i in a: ext(i)'
1000000 loops, best of 3: 1.23 usec per loop
~ $ python -m timeit -s'a = zip([range(1000)]*2)'
'lst=[];filter(lst.extend, a)'
1000000 loops, best of 3: 1.63 usec per loop

it is also slower than an explicit loop. Don't do it.
Hi, but I found this result instead :

bonono@moresing:~$ python ~/lib/python2.4/timeit.py -s "a=range(10000);
b=zip(*[a]*2); l=[None]*len(a)*2; e=l.extend" "l[::2]=b;l[1::2]=b"
100 loops, best of 3: 6.22 msec per loop
bonono@moresing:~$ python ~/lib/python2.4/timeit.py -s "a=range(10000);
b=zip(*[a]*2); l=[]; e=l.extend" "filter(e,b)"
100 loops, best of 3: 7.25 msec per loop
bonono@moresing:~$ python ~/lib/python2.4/timeit.py -s "a=range(10000);
b=zip(*[a]*2); l=[]; e=l.extend" "for x in b: e(x)"
100 loops, best of 3: 10.7 msec per loop
bonono@moresing:~$

So it seems to be faster than explicit loop. By localizing the l.extend
name binding, its speed is only 20% slower than the fastest method.
May be I have done something wrong in the test ?

I hate to admit it but /my/ post had a bug:

zip([range(1000)]*2) should have been zip(*[range(1000)]*2).

filter() may be ugly, but faster it is.

Peter
 
R

Robin Becker

Peter Otten wrote:
.......
- You are padding twice -- once with None, and then with the real thing.

def interleave2(*args, **kw):
dopad = "pad" in kw
pad = kw.get("pad")
count = len(args)
lengths = map(len, args)
maxlen = max(lengths)
if not dopad and min(lengths) != maxlen:
raise ValueError
result = maxlen*count*[pad]
for ix, input in enumerate(args):
result[ix:len(input)*count:count] = input
return result

$ python -m timeit -s 'from interleave_spencer import xdata, ydata,
interleave as i;xdata=xdata[:-1]' 'i(xdata, ydata, pad=None)'
10000 loops, best of 3: 69.7 usec per loop
$ python -m timeit -s 'from interleave_spencer import xdata, ydata,
interleave2 as i;xdata=xdata[:-1]' 'i(xdata, ydata, pad=None)'
10000 loops, best of 3: 46.4 usec per loop

Not overwhelming, but I expect the difference to grow when the arguments
occupy a significant portion of the available memory.

Peter
very nice indeed; another generalization is to allow truncation as well

def interleave(*args,**kw):
"""Peter Otten flatten7 (generalized by Michael Spencer and Robin Becker)
Interleave any number of sequences, padding shorter sequences if kw pad
is supplied or truncating if truncate=True is specified
>>> interleave([1,3,5], [2,4,6]) == [1,2,3,4,5,6] True
>>> interleave([1,2,3]) == [1,2,3] True
>>> interleave(*[[1,2,3]]*10) == [1]*10+[2]*10+[3]*10 True
>>> interleave(range(0,1000,2),range(1,1000,2)) == range(1000) True
>>> interleave([1,2],[3,4,5])
Traceback (most recent call last):
...
ValueError: Minimum length=2 != Max length=3
>>> interleave([1,3],[2,4,6], pad = None) == [1,2,3,4,None,6] True
>>> interleave([1,3],[2,4,6],truncate=True) == [1,2,3,4] True
>>> interleave([1,2],[3,4,5],pad='aaa',truncate=1)
Traceback (most recent call last):
...
AssertionError: Cannot specify both truncate=True and pad='aaa'
"""
dopad = "pad" in kw
pad = kw.get("pad")
dotrunc = bool(kw.get('truncate',False))
assert not (dotrunc and pad), \
'Cannot specify both truncate=True and pad=%r' % pad
count = len(args)
lengths = map(len,args)
maxlen = max(lengths)
minlen = min(lengths)
if dotrunc:
result = minlen*count*[None]
for ix, input in enumerate(args):
result[ix::count] = input[:minlen]
else:
if not dopad and minlen!=maxlen:
raise ValueError('Minimum length=%d != Max length=%d'
% (minlen,maxlen))
result = maxlen*count*[pad]
for ix, input in enumerate(args):
result[ix:len(input)*count:count] = input
return result

def _test():
import doctest, interleave
return doctest.testmod(interleave)

if __name__ == "__main__":
_test()
 
N

Nick Craig-Wood

Alex Martelli said:
Nick Craig-Wood said:
Except if you are trying to sum arrays of strings...
sum(["a","b","c"], "")
Traceback (most recent call last):

I've no idea why this limitation is here... perhaps it is because pre
python2.4 calling += on strings was very slow?

No: when I implemented sum, I originally specialcased sum on strings to
map down to a ''.join -- Guido decided it was confusing and had no
advantage wrt calling ''.join directly so he made me put in that
check.

Hmm, interesting...

I agree with Guido about the special case, but I disagree about the
error message. Not being able to use sum(L,"") reduces the
orthogonality of sum for no good reason I could see.

However I only noticed that when trying to do the recent python
challenge which probaby isn't the best use case ;-)
 
A

Alex Martelli

Nick Craig-Wood said:
I agree with Guido about the special case, but I disagree about the
error message. Not being able to use sum(L,"") reduces the
orthogonality of sum for no good reason I could see.

Having sum(L,'') work but be O(N squared) would be an "attractive
nuisance" within the meaning of the law; it's bad enough that sum(L,[])
etc are that way, but at least beginners want to concatenate lists (or
tuples, arrays, etc) much less often than they want to concatenate
strings. sum is meant to work on _numbers_ -- the reason it takes that
second optional parameter is essentially to be able to specify the
``type'' of numbers. In retrospect it might have been better to have a
single argument (or interpret multiple ones like min or max do, maybe),
forcing the starting value to be integer 0.


Alex
 
B

Bengt Richter

I agree.
Having sum(L,'') work but be O(N squared) would be an "attractive
Are you saying ''.join is O(N squared)? Or that sum would not be
allowed to used the same/equivalent implementation algorithms if it were allowed
to work on strings?
nuisance" within the meaning of the law; it's bad enough that sum(L,[])
etc are that way, but at least beginners want to concatenate lists (or
tuples, arrays, etc) much less often than they want to concatenate
strings. sum is meant to work on _numbers_ -- the reason it takes that
If sum is meant to work on _numbers_, why not call it numsum, to suggest
the restriction? ISTM the usual assumption in python is non-restrictive duck typing,
and the expectation would be no more restriction than in reduce(operator.add, L, init_val).
second optional parameter is essentially to be able to specify the
``type'' of numbers. In retrospect it might have been better to have a
single argument (or interpret multiple ones like min or max do, maybe),
forcing the starting value to be integer 0.
So it was an accident that user types are permitted?
... def __getattr__(self, attr): print attr;raise AttributeError
...
__coerce__
__radd__
Traceback (most recent call last):
File said:
__coerce__
__add__
__coerce__
__radd__
Traceback (most recent call last):
File "<stdin>", line 1, in ?
TypeError: unsupported operand type(s) for +: 'instance' and 'instance'

Yet user subclassing of str does not yield an acceptable user type?
Traceback (most recent call last):
Traceback (most recent call last):
File "<stdin>", line 1, in ?
TypeError: sum() can't sum strings [use ''.join(seq) instead]

Although, notice
... def __add__(self, other): return SI(str.__add__(self, str(other)))
... __radd__ = __add__
... '001234'

But: Traceback (most recent call last):
File "<stdin>", line 1, in ?
TypeError: sum() can't sum strings [use ''.join(seq) instead]
vs 'start:01234'


Which seems to boil down to incomplete restrictions on duck typing.
Maybe Guido will want to complete it, but ISTM your original implementation
delegating string addition implementation to ''.join was reasonable.

Regards,
Bengt Richter
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,276
Messages
2,571,384
Members
48,073
Latest member
ImogenePal

Latest Threads

Top