List Flatten

B

bearophile

This is my first Python program (it's an improvement of a recursive
version by Luther Blissett). Given a list like this:
[["a", 2, [], [3, 5, (2,1), [4]]]]

It produces the "flatted" version:
["a", 2, 3, 5, (2, 1), 4]

I think this operation is quite important, and it's similar to the
built-in Mathematica Flatten function:
l = {{"a", 2, {}, {3, 5, {4}}}}
Flatten[l]
=> {"a", 2, 3, 5, 4}

This program works for very deep lists too, without giving Stack
overflow like the recursive version (and it's faster than the
recursive version).

There's an added loop to speed up the already quite flat lists (I
think they are the most common) but this slows down a bit the
flattening of very deep lists. I don't know if this is the best
compromise for average situations.
I've tried a version with the stack as a dictionary, but the timings
are almost the same and it uses a LOT more memory (I've tried it with
lists with a million nested levels).

def listflatten(l):
"""Flattens a list, examples:
listflatten("a") => "a"
listflatten([]) => []
listflatten([[1,[2,[],"a"]]) => [1,2,"a"]
"""
if type(l) != list:
return l
else:
result = []
stack = []
stack.append((l,0))
while len(stack) != 0:
sequence, j = stack.pop(-1)
while j < len(sequence):
if type(sequence[j]) != list:
k, j, lens = j, j+1, len(sequence)
while j < len(sequence) and \
(type(sequence[j]) != list):
j += 1
result.extend(sequence[k:j])
else:
stack.append((sequence, j+1))
sequence, j = sequence[j], 0
return result

I think this program is essentially (almost) O(n) because the sequence
list isn't really copied when it's inserted into the stack, and in
some tests I've seen that the pushes-pops in the lists tail are almost
O(1) (but the Append of an element in the head of a list is ~O(n)).
Bugs: for really big lists this program can crash for memory overflow.
I still don't know how to fix this problem. The exception MemoryError
doesn't come even when the HD trashes a lot... This is probably a
memory management problem of the Operative System... -_-

Comments are appreciated,
bear hugs,
bearophile
 
P

Peter Otten

bearophile said:
def listflatten(l):
"""Flattens a list, examples:
listflatten("a") => "a"
listflatten([]) => []
listflatten([[1,[2,[],"a"]]) => [1,2,"a"]
"""
if type(l) != list:
return l
else:
result = []
stack = []
stack.append((l,0))
while len(stack) != 0:
sequence, j = stack.pop(-1)
while j < len(sequence):
if type(sequence[j]) != list:
k, j, lens = j, j+1, len(sequence)
while j < len(sequence) and \
(type(sequence[j]) != list):
j += 1
result.extend(sequence[k:j])
else:
stack.append((sequence, j+1))
sequence, j = sequence[j], 0
return result

I see my newsreader has a different notion about what flatten is meant to
be :)

Have you considered generators?

def flatten(items):
if not isinstance(items, list):
yield items
stack = [iter(items)]
while stack:
for item in stack[-1]:
if isinstance(item, list):
stack.append(iter(item))
break
yield item
else:
stack.pop()

assert list(flatten([[[[]]], 0, 1, [[[2,[3, [4]]]], [],[5, 6], 7, [8, 9]],
10])) == range(11)

I have not timed it, I just prefer the generator for its clarity (provided
it is correct, I haven't tested it either). This is achieved mostly by
hiding the indices together with the underlying list in iterator objects.
Generators may also save you a lot of memory for very large data structures
as you can iterate over a flat _view_ of your nested lists without ever
having to _create_ the large flat list.

Peter
 
J

Jeffrey Froman

Peter said:
Generators may also save you a lot of memory for very large data
structures as you can iterate over a flat view of your nested lists
without ever having to create the large flat list.

Particularly if you avoid building the interim "stack" list, by making use
of recursion:

def deep_iter(nested):
for x in nested:
if isinstance(x, (list, tuple)):
for y in deep_iter(x):
yield y
else:
yield x


The example above should work to flatten nested lists or tuples.

Jeffrey
 
J

Jeremy Bowers

Particularly if you avoid building the interim "stack" list, by making use
of recursion:

The original poster explicitly mentioned not using recursion due to
wanting to flatten things deeper than the call stack can go.
 
J

Jeffrey Froman

Jeremy said:
The original poster explicitly mentioned not using recursion due to
wanting to flatten things deeper than the call stack can go.

Thanks, I missed the OP.

Jeffrey
 
B

bearophile

Peter Otten:
Have you considered generators?<

I haven't because I am still learning Python, and I still don't know
all the unusual features of it.
I'll try to improve, there are many things to learn.

I have not timed it, I just prefer the generator for its clarity
(provided it is correct, I haven't tested it either).<

It seems correct, and the timings are a bit better than mine for some
cases (very nested lists) and a bit slowler for other cases (quite
flat lists). I think usual list are already quite flat (like 2D
arrays, etc).

Generators may also save you a lot of memory for very large data
structures as you can iterate over a flat _view_ of your nested lists
without ever having to _create_ the large flat list.<

I understand; this is very interesting.


Jeffrey Froman's version is recursive, and I have refused such
versions because the stack is limited by default.

Thank you all for the comments and nice ideas,
a bear hug,
bearophile
 
R

Raymond Hettinger

bearophile said:
This program works for very deep lists too, without giving Stack
overflow like the recursive version (and it's faster than the
recursive version).

Here is a non-recursive version using generators. It runs in O(n) time and is
memory friendly (consuming space proportional to the depth of the input). It
has one additional feature, strings are kept whole instead of iterating them
into characters.

iterstack = [iter(n)]
while iterstack:
for elem in iterstack[-1]:
try:
it = iter(elem)
except TypeError:
pass
else:
if not isinstance(elem, basestring):
iterstack.append(it)
break
yield elem
else:
iterstack.pop() # remove iterator only when it is exhausted
n = [['ab', 2, [], [3, 5, (2,1), [4]]]]
print list(f(n))
['ab', 2, 3, 5, 2, 1, 4]



Raymond Hettinger
 
A

Alex Martelli

Raymond Hettinger said:
iterstack = [iter(n)]
while iterstack:
for elem in iterstack[-1]:
try:
it = iter(elem)
except TypeError:
pass
else:
if not isinstance(elem, basestring):
iterstack.append(it)
break
yield elem
else:
iterstack.pop() # remove iterator only when it is exhausted

Nice! I was wondering about a sligthly different ordering...:

def flatten(sequence, isatomic=None):
if isatomic is None:
def isatomic(item): return isinstance(item, basestring)
stack = [iter(sequence)]
while stack:
for item in stack[-1]:
if not isatomic(item):
try: stack.append(iter(item))
except TypeError: pass
else: break
yield item
else:
stack.pop()

Very similar, of course, but for some reason it looks nicer to me to
enter the try/except/else only after "singling out" items that are meant
to be taken as "atomic" no matter what (by default, all strings). I'm
not sure how to weigh the pro's and con's of the two very similar
variations. Maybe the second one may be preferred on a rather minor
aspect: if I'm flattening a sequence which, I suspect, will contain lots
of numbers at the leaves, I could code and pass in something like:

def isatomic(item):
return isinstance(item, (basestring, int, long, float, complex))

and bypass the try/except for all those numbers, perhaps speeding things
up...? (unmeasured...). BTW, this does point out how nice a
baseinteger and/or basenumber analog to basestring would be, to shorten
that longish tuple of types. Ah well, maybe in 2.5!-)


Alex
 
B

bearophile

Comparing versions for speed, Raymond Hettinger's version is quite
slow (even 10 or more times slower than mine for certain kinds of
lists).

Into the most nested cicle of Alex Martelli's version there is a
function call to isatomic (and the try-except) that slow down the
program a lot. Removing them (we expand lists only) we obtan the short
Peter Otten's version, that is a bit better than mine for very nested
lists and a bit worse for quite flat lists (that are the most common,
I think).

Thank you all,
bear hugs,
bearophile
 
R

Raymond Hettinger

bearophile said:
Comparing versions for speed, Raymond Hettinger's version is quite
slow (even 10 or more times slower than mine for certain kinds of
lists).

Into the most nested cicle of Alex Martelli's version there is a
function call to isatomic (and the try-except) that slow down the
program a lot. Removing them (we expand lists only) we obtan the short
Peter Otten's version, that is a bit better than mine for very nested
lists and a bit worse for quite flat lists (that are the most common,
I think).

That is an apples-to-oranges comparision. The loss of speed is due to the
additional features of not expanding strings and not expanding iterables into
memory all at one. Previous discussions about flatten() in this newsgroup or
the tutor list have indicated that is what is usually wanted. Take out the test
for the atomicity of strings and the speed is much improved. Also, from
original posting, it seemed that memory friendliness was a key measure for merit
given the huge data sizes and nesting depths.

This discussion may have over-abstracted the problem so that your
apples-to-oranges timings are irrelevant. Do you have real world use cases for
wanting to flatten nested containers of non-homogenous object types (a mix of
numbers, strings, dictionaries, etc)? Do you really want to split your strings
into streams of characters and then throw numbers in the mix. I think not.


Raymond Hettinger
 
P

Peter Otten

Raymond said:
That is an apples-to-oranges comparision. The loss of speed is due to the
additional features of not expanding strings and not expanding iterables
into
memory all at one. Previous discussions about flatten() in this newsgroup
or
the tutor list have indicated that is what is usually wanted. Take out
the test
for the atomicity of strings and the speed is much improved. Also, from
original posting, it seemed that memory friendliness was a key measure for
merit given the huge data sizes and nesting depths.

This discussion may have over-abstracted the problem so that your
apples-to-oranges timings are irrelevant. Do you have real world use
cases for wanting to flatten nested containers of non-homogenous object
types (a mix of
numbers, strings, dictionaries, etc)? Do you really want to split your
strings
into streams of characters and then throw numbers in the mix. I think
not.

In may book all three versions (Alex Martelli's yours and mine) use the
_same_ flattening algorithm with _different_ methods to test for atomicity.
While I take the lightweight (and fast) approach of inlining
isinstance(item, list) - only item.__class__ is list could be faster - as
specified in the original post, you provide a function that tries to cover
most common cases and Alex parameterizes his variant making it a candidate
for inclusion in a library. I think you often have to decide between these
three approaches and would not call it an apples-to-oranges comparison
(btw, oranges are also called "Chinese apples" in German and Dutch, so
somebody thought that comparing them was a good idea :), but again, it
should be made clear to the OP that he is _not_ comparing flattening
algorithms.

Another lesson - that exceptions can slow down a program if they are raised
in the common case - is also worth noting.
Various tests for atomicity have been discussed and benchmarked in

http://groups.google.com/[email protected]

which I didn't mention before because the OP explicitly ruled out a
recursive solution.

Peter
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,209
Messages
2,571,088
Members
47,686
Latest member
scamivo

Latest Threads

Top