itertools.flatten()? and copying generators/iterators.

F

Francis Avila

Below is an implementation a 'flattening' recursive generator (take a nested
iterator and remove all its nesting). Is this possibly general and useful
enough to be included in itertools? (I know *I* wanted something like it...)

Very basic examples:
rl = [1, [2, 3, [4, 5], '678', 9]]
list(flatten(rl)) [1, 2, 3, 4, 5, '6', '7', '8', 9]
notstring = lambda obj: not isinstance(obj, type(''))
list(flatten(rl, notstring)) [1, 2, 3, 4, 5, '678', 9]
isstring = lambda obj: not notstring(obj)
list(flatten(rl, isstring)) [1, [2, 3, [4, 5], '678', 9]]
#The string is within a list, so we never descend that far.
car_is_2 = lambda obj: isinstance(obj, type([])) and obj[0] == 2
list(flatten(rl, car_is_2)) [1, 2, 3, [4, 5], '678', 9]
rls = ['Here', 'are', ['some', ['nested'], 'strings']]
list(flatten(rls))
['H', 'e', 'r', 'e', 'a', 'r', 'e', 's', 'o', 'm', 'e', 'n', 'e', 's', 't',
'e', 'd', 's', 't', 'r', 'i', 'n', 'g', 's']
list(flatten(rls, notstring)) ['Here', 'are', 'some', 'nested', 'strings']
rli = iter([1, 2, iter(['abc', iter('ABC')]), 4])
list(flatten(rli)) [1, 2, 'a', 'b', 'c', 'A', 'B', 'C', 4]
list(flatten(rli, notstring)) []
#rli is an iterator, remember!
rli = iter([1, 2, iter(['abc', iter('ABC')]), 4])
list(flatten(rli, notstring)) [1, 2, 'abc', 'A', 'B', 'C', 4]
# The following I'm not sure what to do about...
empty = [1, [], 3]
emptyiter = [1, iter([]), 3]
list(flatten(empty)) [1, [], 3]
list(flatten(emptyiter)) [1, 3]

I tried hard to get it to work with iterator and generator objects, too, and
it mostly does. However, I'm having some problems determining whether a
given object will iterate infinitely, if that object is already a
generator/iterator. Basically, I'm unable to copy an iterator (why?). See
isemptyorself() below for details. Aside from that, I'm just generally
unsure what the proper behavior should be when an iterator/generator is
encountered.

Also, why is the iterator type not included in the types module or described
in the language reference (Python 2.2)?

--- Code ---


def isiterable(obj):
try: iter(obj)
except TypeError: return False
else: return True

def isemptyorself(iterable):
"""True if iterable yields nothing or itself."""
it = iter(iterable)

# KLUDGE! This test must modify the object in order to test
# it. This means that a value of iterable will be discarded!
# Most likely, iterable is itself an iterator or generator,
# because id(iter(GENR or ITER)) == id(GENR or ITER).
# Unfortunately, we can't copy generators and iterators using
# the copy module, so we must just assume that this iterator
# doesn't yield itself or nothing....

if it is iterable:
return False

try: res = it.next()
except StopIteration: #Yields nothing
return True
else:
if res == iterable: #Yields itself
return True
return False

def flatten(iterable, isnested=None):
"""Iterate items in iterable, descending into nested items.

isnested is a function that returns true if the element of
iterable should be descended into. The default is to
consider iterable anything that iter() thinks is iterable (unless
doing so would cause an infinite recursion).

"""
if isnested is None:
isnested = lambda obj: True #Always descend

for item in iterable:
if isiterable(item) and not isemptyorself(item) \
and isnested(item):
for subitem in flatten(item, isnested):
yield subitem
else:
yield item
 
R

Raymond Hettinger

[Francis Avila]
Below is an implementation a 'flattening' recursive generator (take a nested
iterator and remove all its nesting). Is this possibly general and useful
enough to be included in itertools? (I know *I* wanted something like it...)

Interesting post!

I'll add your idea to the list of itertool suggestions received so far. My
first take
is that it more likely to be added to the "recipes" in the examples section.

Core itertools should be primitive building blocks that combine with one
another to make other tools. Also, I'm trying to keep the toolset as small
as possible by excluding new tools that can be easily and efficiently coded
in pure python.

I would code it like this:

def flatten(s):
try:
iter(s)
except TypeError:
yield s
else:
for elem in s:
for subelem in flatten(elem):
yield subelem

As your examples show, it does have some suprising behavior in that
strings get split apart instead of staying intact. That is easily taken care of
by an AtomicString subclass:

class AtomicString(str):
def __iter__(self):
raise TypeError

a = [1, 2, AtomicString('abc'), 4]


# The following I'm not sure what to do about...
empty = [1, [], 3]
emptyiter = [1, iter([]), 3]

The above code definition handles this without a problem.


I'm having some problems determining whether a
given object will iterate infinitely, if that object is already a
generator/iterator.

In general, that is not knowable.


Basically, I'm unable to copy an iterator (why?).

Alex Martelli just wrote a PEP about making many iterators copyable.
See www.python.org/peps/pep-0323.html

Until that is adopted, your best bet is to use tee():

def tee(iterable):
"Return two independent iterators from a single iterable"
def gen(next, data={}, cnt=[0]):
dpop = data.pop
for i in count():
if i == cnt[0]:
item = data = next()
cnt[0] += 1
else:
item = dpop(i)
yield item
next = iter(iterable).next
return (gen(next), gen(next))

Also, why is the iterator type not included in the types module or described
in the language reference (Python 2.2)?

There is no iterator type. Iterators can be any object that supports the
iterator
protocol: __iter__() and next(). Generators, container iterators, and each of
the itertools define their own type.


This was a long but highly instructive pair of posts. I hope it becomes
widely read. Your work ventured into previously uncharted
territory and touched on a number of frontier issues like copyability,
determining whether something is iterable, the various iterator types,
and the deep mathematical subject of determining whether a given
process is finite.


Raymond Hettinger
 
P

Peter Otten

Raymond said:
Core itertools should be primitive building blocks that combine with one
another to make other tools. Also, I'm trying to keep the toolset as
small as possible by excluding new tools that can be easily and
efficiently coded in pure python.

I would code it like this:

def flatten(s):
try:
iter(s)
except TypeError:
yield s
else:
for elem in s:
for subelem in flatten(elem):
yield subelem

As your examples show, it does have some suprising behavior in that
strings get split apart instead of staying intact. That is easily taken
care of by an AtomicString subclass:

class AtomicString(str):
def __iter__(self):
raise TypeError

a = [1, 2, AtomicString('abc'), 4]


# The following I'm not sure what to do about...
empty = [1, [], 3]
emptyiter = [1, iter([]), 3]

I suggest a minor modification:

def flatten(s, toiter=iter):
try:
it = toiter(s)
except TypeError:
yield s
else:
for elem in it:
for subelem in flatten(elem, toiter):
yield subelem

def keepstrings(seq):
if isinstance(seq, basestring):
raise TypeError
return iter(seq)

sample = [1, 2, [3, "abc def".split()], 4]
print sample
print list(flatten(sample, keepstrings))
print list(flatten([1, [], 3]))
print list(flatten([1, iter([]), 3]))

The above handles strings in a way that is nonintrusive on client code.

Peter
 
A

Alex Martelli

Peter Otten wrote:
...
def keepstrings(seq):
if isinstance(seq, basestring):
raise TypeError
return iter(seq) ...
The above handles strings in a way that is nonintrusive on client code.

Yes, very nice. I'd make keepstrings the default (one RARELY wants
to treat strings as nonatomic) AND replace the typetest with an
attempt to do a seq+'' (if it doesn't raise, seq ain't a string),
but that's just me:).


Alex
 
F

Francis Avila

Alex Martelli said:
Peter Otten wrote:
...

Yes, very nice. I'd make keepstrings the default (one RARELY wants
to treat strings as nonatomic) AND replace the typetest with an
attempt to do a seq+'' (if it doesn't raise, seq ain't a string),
but that's just me:).

Alex

try: seq+'' seems more expensive. Is it just to handle the case where
someone makes a string-alike that doesn't derive from str? (Actually, I
guess that's about all pre-2.2 code.)

Is it expensive enough to even matter?
 
F

Francis Avila

Raymond Hettinger said:
[Francis Avila]
Also, why is the iterator type not included in the types module or described
in the language reference (Python 2.2)?

There is no iterator type. Iterators can be any object that supports the
iterator
protocol: __iter__() and next(). Generators, container iterators, and each of
the itertools define their own type.

Actually (Python 2.2):
['__class__', '__delattr__', '__doc__', '__getattribute__', '__hash__',
'__init__', '__iter__', '__new__', '__reduce__', '__repr__', '__setattr__',
'__str__', 'gi_frame', 'gi_running', 'next']['__class__', '__delattr__', '__doc__', '__getattribute__', '__hash__',
'__init__', '__iter__', '__new__', '__reduce__', '__repr__', '__setattr__',
'__str__', 'next'] yield None


The generator seems to be an actual execution frame, but an iterator only a
type that supports the iterator protocol. I think from this that iterator
needs to store all its elements prior to yielding them, whereas a generator
does not. For this reason, an iterator will always terminate eventually.
For these two reasons, I think it's useful to distinguish iterators and
generators.

I think I've clarified it a bit more to myself. For iterators, it iterates
infinitely if it yields itself, otherwise it doesn't. For generators, it is
unknowable.

Are classes supporting the iterator protocol always generators? Would they
*ever* be an iterator? (According to type(), I mean) Is there any other way
to get a strict iterator other than applying iter() to a sequence? (I
noticed you can't force Python to subclass the iterator type.)

The language reference (or maybe it's the module reference?) states that
generators yield iterator objects. That's not exactly true: they yield
generator objects which support the iterator protocol, but not iterator
objects. And again, iterator types are not mentioned as a separate type in
the language reference, which strikes me as odd.

Perhaps a guru can clarify the relationship between generators and
iterators?
 
P

Peter Otten

Francis said:
try: seq+'' seems more expensive. Is it just to handle the case where
someone makes a string-alike that doesn't derive from str? (Actually, I
guess that's about all pre-2.2 code.)

Is it expensive enough to even matter?

I you think that the concatenation is costly, as I did when I read your
remark - it's not:
True

Peter
 
F

Francis Avila

Peter Otten said:
I you think that the concatenation is costly, as I did when I read your
remark - it's not:

True

Peter

Actually, I meant the exception testing. Isn't Alex suggesting something
like?:

try:
seq+''
except:
seq is not a string
else:
seq is a string
 
A

Alex Martelli

Francis Avila wrote:
...
Actually, I meant the exception testing. Isn't Alex suggesting something
like?:

try:
seq+''
except:
seq is not a string
else:
seq is a string

Right. Yes, if you have a lot of nonstrings this IS going to be
substantially slower than isinstance tests. As usual, setting up
a representative benchmark and measuring the effect is best.

Of course, it's not easy to decide what IS representative, but
let's give it a try:

# Peter Otten's suggestion (simplified/flattened out)
def flatten_po(s):
try:
if isinstance(s, basestring):
raise TypeError
else:
it = iter(s)
except TypeError:
yield s
else:
for elem in it:
for subelem in flatten_po(elem):
yield subelem

# my slight preference
def flatten_am(s):
try:
s+''
except TypeError:
try: it = iter(s)
except TypeError:
yield s
else:
for elem in s:
for subelem in flatten_am(elem):
yield subelem
else:
yield s

# an example tree -- a mix of strings, items, some nesting
subtree = ['foo']*18 + [1,2,3]*6
tree = [ subtree*10, [ subtree * 8 ] ]

# functions to benchmark
def process(flatten, tree=tree):
return list(flatten(tree))

def pro_po(): return process(flatten_po)
def pro_am(): return process(flatten_am)


Now, with this setup, I measure:

[alex@lancelot bo]$ timeit.py -c -s'import fl' 'fl.pro_po()'
100 loops, best of 3: 9.4e+03 usec per loop
[alex@lancelot bo]$ timeit.py -c -s'import fl' 'fl.pro_am()'
100 loops, best of 3: 8.3e+03 usec per loop

....which is a reminder of WHY it's best to always measure --
I fully expected pro_am to be slower, as it's more general
(handles UserString instances etc), and I was just trying
to check _how_ much slower it was...!

Upon reflection, the problem is with flatten_po elegant
'raise TypeError', which merges strings with noniterables
BUT only at the cost of an exception. Changing the
preample of flatten_po to:

try:
if isinstance(s, basestring):
yield s
return
...

improves its timeit.py-measured performance to 5.9e+03 usec.
So, on this sample of data, the _po version is either 13%
slower or 29% faster than the _am .

Now, fortunately, it's an EXTREMELY unlikely event that
performance differences of this order of magnitude should
deeply influence our choice of algorithms! Twice as slow,
well, one wants to keep an eye on that; 30% or less, it's
unlikely to matter except at the very heart of some key
bottleneck -- and, honestly, how often will flattening
overhead be THE bottleneck of your application...? Note
that in this case we're basically doing no processing with
the items in the flattened sequence, just stuffing them into
flat lists. Do any processing whatsoever, and that will
probably dwarf the infrastructural overhead of the flattening
procedure itself.

The key thing is watching for difference in big-O behavior --
and, here, we have none... just somewhat different (NOT
by whole orders of magnitude!) multiplicative constants.
THAT kind of performance issue is one we can live with
more often than not! Which, in turn, is what makes the
ordinary use of exceptions generally tolerable from the
point of view of performance, except in the very hottest
bottlenecks...


Alex
 
A

Alex Martelli

Francis Avila wrote:
...
Perhaps a guru can clarify the relationship between generators and
iterators?

Sure: the use of "iterators" or "iterator objects" in the Python docs
does NOT mean "objects that are instance of <type 'iterator'>", it
means "objects that satisfy the iterator protocol".

The situation may be a bit less confusing in today's Python, just
because the typenames involved have been changed a bit in a few
important special cases:
<iterator object at 0x402dee8c>

These types are unrelated -- each only inherits from object, as
you can check by viewing their __bases__ -- and besides the small
performance improvement that results from each of these iterator
objects 'knowing' the exact type of sequence it's iterating on,
the name diversification may help avoid the understandable kind
of confusion you may have been led into by older Python versions.


Alex
 
A

Alex Martelli

Francis Avila wrote:
...
try: seq+'' seems more expensive. Is it just to handle the case where
someone makes a string-alike that doesn't derive from str? (Actually, I

As per my other post, it sure SEEMS more expensive - whether it IS or
not depends on details:). Yes, it's basically to try and treat as a
string any class that looks like one.
guess that's about all pre-2.2 code.)

Nope, UserString is still alive and well in 2.2 and 2.3 -- and doesn't
subclass basestring (I don't know why). Yes, if you write stringlike
classes in 2.2 and later, subclassing basestring, just as a "flag" since
quite a few spots may specialcase instances of basestring, IS becoming
advisable, I suspect.
Is it expensive enough to even matter?

Nope (in the sample case I measured), again see my other post for
details. But perhaps my old-style insistence against isinstance
abuse is becoming misplaced in this one case -- basestring exists
ONLY for the purpose of allowing easy isinstance checks, after all,
it offers no useful functionality per se. So, I should maybe
resign myself to the inevitable...

....and hope a similar type 'basenumber' to become the "just flagging"
baseclass of int, long, float and complext doesn't emerge...:)


Alex
 
B

Bengt Richter

Francis Avila wrote:
...

Right. Yes, if you have a lot of nonstrings this IS going to be
substantially slower than isinstance tests. As usual, setting up
a representative benchmark and measuring the effect is best.

Of course, it's not easy to decide what IS representative, but
let's give it a try:
I am a worrywart, I guess, but my reaction is 'Nicht finger-poken
in der blinken-machinen' etc ;-) IOW, without knowing exactly what kind of
blinken-machine seq is, I would fear triggering an unknown side effect
with seq+'', so I would not like such a test as part of library code without
at least having it documented with shout-caps in the doc strings.

Regards,
Bengt Richter
 
P

Peter Otten

Alex said:
The key thing is watching for difference in big-O behavior --
and, here, we have none... just somewhat different (NOT
by whole orders of magnitude!) multiplicative constants.
THAT kind of performance issue is one we can live with
more often than not! Which, in turn, is what makes the
ordinary use of exceptions generally tolerable from the
point of view of performance, except in the very hottest
bottlenecks...

You are right, but let me play a little with the code anyway:

def flatten_am(s):
try:
s+''
except TypeError:
try: it = iter(s)
except TypeError:
yield s
else:
for elem in s:
for subelem in flatten_am(elem):
yield subelem
else:
yield s


def flatten_dict(s, isScalar):
try:
scalar = isScalar[s.__class__]
except KeyError:
t = s.__class__
try:
s = iter(s)
except TypeError:
scalar = isScalar[t] = True
else:
scalar = isScalar[t] = False

if scalar:
yield s
else:
for elem in s:
for subelem in flatten_dict(elem, isScalar):
yield subelem

# an example tree -- a mix of strings, items, some nesting
subtree = ['foo']*18 + [1,2,3]*6
tree = [ subtree*10, [ subtree * 8 ] ]

def pro_am():
for f in flatten_am(tree): pass

def pro_dict():
for f in flatten_dict(tree, {"".__class__: True}): pass

assert list(flatten_am(tree)) == list(flatten_dict(tree, {"".__class__:
True}))


This gives for flatten_am()
....> timeit.py -c -s"import fl" "fl.pro_am()"
100 loops, best of 3: 5.7e+03 usec per loop

versus

....> timeit.py -c -s"import fl" "fl.pro_dict()"
1000 loops, best of 3: 1.37e+03 usec per loop

for flatten_dict(). So you get some speed up at the cost of uglier though
still quite readable code and under the assumption that either all or no
instances of a class are iterable.
Of course the idea works with both isinstance() and a + "" but seems
somewhat more in the spirit of isinstance().

Peter
 
A

Alex Martelli

Peter Otten wrote:
...
def flatten_dict(s, isScalar):
try:
scalar = isScalar[s.__class__]
except KeyError:
t = s.__class__
try:
s = iter(s)
except TypeError:
scalar = isScalar[t] = True
else:
scalar = isScalar[t] = False

if scalar:
yield s
else:
for elem in s:
for subelem in flatten_dict(elem, isScalar):
yield subelem

Neat: "caching" (aka memoizing) IS a good general technique
for optimization, and this is a cool and un-obvious application.


Alex
 
F

Francis Avila

Peter Otten said:
def flatten_dict(s, isScalar):
try:
scalar = isScalar[s.__class__]
except KeyError: ....
t = s.__class__

Why doesn't the above fail (AttributeError) for classic classes?

....
scalar = isScalar[t] = True
else:
scalar = isScalar[t] = False

This part I never would have thought to do. Clever!

....
for flatten_dict(). So you get some speed up at the cost of uglier though
still quite readable code and under the assumption that either all or no
instances of a class are iterable.

That's not always a good assumption to make, if we want to decide whether to
iterate based upon some property of the object (which is actually how I
originally discovered I needed a "flatten").

Perhaps we can combine both conditions and caching optimizations?


def flatten_fastcond(s, isScalar, itercond=lambda o: None):
try:
scalar = isScalar[s.__class__]
except KeyError:

# If itercond has something to say about s, use *it*, and don't
# add the lookup to isScalar.

iterate = itercond(s)
if not iterate is None:
scalar = iterate
if not scalar:
s = iter(s)

else:
t = s.__class__
try:
s = iter(s)
except TypeError:
scalar = isScalar[t] = True
else:
scalar = isScalar[t] = False

if scalar:
yield s
else:
for elem in s:
for subelem in flatten_dict(elem, isScalar):
yield subelem


It's usually true that all instances of a given class are iterable or not,
and it's less inconvenient to add an entry to a dictionary than to write a
conditional test (also faster). If we move the itercond block above 'try',
we'll slow down this more common case, but speed things up if s is a
homogenous structure where itercond does all the work of deciding. There
are probably other things I could say against it if I thought about it, and
we're fast enough anyway.

As expected, given the same test tree (which doesn't exercise itercond), the
speed is about the same.
python timeit.py -c -s"import flatten" "flatten.pro_fastcond()"
100 loops, best of 3: 6.29e+003 usec per loop
python timeit.py -c -s"import flatten" "flatten.pro_dict()"
100 loops, best of 3: 6.29e+003 usec per loop

I'll make up some fancy tree later to test out itercond.
 
F

Francis Avila

Francis Avila said:
As expected, given the same test tree (which doesn't exercise itercond), the
speed is about the same.
*Ahem*:

def flatten_fastcond(s, isScalar, itercond=lambda o: None): ....
for subelem in flatten_dict(elem, isScalar):
yield subelem

I thought it was a little strange that it was *exactly* the same, given the
function call...


Corrected (still not much difference, though):

....>python timeit.py -c -s"import flatten" "flatten.pro_dict()"
100 loops, best of 3: 7.22e+003 usec per loop

....>python timeit.py -c -s"import flatten" "flatten.pro_fastcond()"
100 loops, best of 3: 7.33e+003 usec per loop

I'll make up some fancy tree later to test out itercond.

Well, should have made it sooner. I would have noticed an empty list...
 
B

Bryan

i've embedded python into our product and i'm not sure how to handle this situation. the following code is a
simplification of a function that calls a python function. i've removed all error checking to furthur simplify the
problem. the following code works correctly:

void foo(context_t ctx) {
PyObject py_interp = NULL;
PyObject py_mod = NULL;
PyObject py_call_foo = NULL;
const char *mod;

PyEval_AcquireLock();
py_interp = get_interpreter(ctx);
PyThreadState_Swap(py_interp);

mod = get_module(ctx);
py_mod = PyImport_ImportModule(mod);
py_call_foo = PyObject_GetAttrString(py_mod, "py_foo");
PyObject_CallFunction(py_call_foo, NULL);

Py_XDECREF(py_interp);
Py_XDECREF(py_mod);
Py_XDECREF(py_call_foo);

PyThreadState_Swap(NULL);
PyEval_ReleaseLock();
}

now i have a situation where foo may be called at the same time in different c threads. furthermore, the call py_foo
might take forever to execute and foo is then required to return immediately. so i modified the code to something like
this where foo starts a threaded function foo_threaded in a new thread so foo can return immediately.

void foo_threaded(void *args) {
PyObject py_interp = NULL;
PyObject py_mod = NULL;
PyObject py_call_foo = NULL;
const char *mod;

PyEval_AcquireLock();
py_interp = get_interpreter(ctx);
PyThreadState_Swap(py_interp);

mod = get_module(ctx);
py_mod = PyImport_ImportModule(mod);
py_call_foo = PyObject_GetAttrString(py_mod, "py_foo");
PyObject_CallFunction(py_call_foo, NULL);

Py_XDECREF(py_interp);
Py_XDECREF(py_mod);
Py_XDECREF(py_call_foo);

PyThreadState_Swap(NULL);
PyEval_ReleaseLock();
}


void foo(context_t ctx) {
Thread thread;
thread = new_thread(call_py_foo);
thread.start()
}

this seems like the wrong approach to me. if foo gets called a second time but the call to py_foo hasn't returned yet,
the interpreter is still locked. so how can another py_foo be called? the 2nd time foo is called the call to
get_interpreter may or may not return the same interpreter. what am i missing?

thanks for any help.

bryan
 
P

Peter Otten

Francis said:
That's not always a good assumption to make, if we want to decide whether
to iterate based upon some property of the object (which is actually how I
originally discovered I needed a "flatten").

Perhaps we can combine both conditions and caching optimizations?

I didn't work this out, but I suppose the way to go is to allow three states
(ALWAYS, NEVER, NOCACHE) in the dictionary and only invoke itercond() on
cache misses and NOCACHE values.

Peter
 
F

Francis Avila

Peter Otten said:
I didn't work this out, but I suppose the way to go is to allow three states
(ALWAYS, NEVER, NOCACHE) in the dictionary and only invoke itercond() on
cache misses and NOCACHE values.

Peter

Hah, just finished writing the post the moment you
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,997
Messages
2,570,241
Members
46,831
Latest member
RusselWill

Latest Threads

Top