Interesting list() un-optimization

R

Roy Smith

I stumbled upon an interesting bit of trivia concerning lists and list
comprehensions today.

We use mongoengine as a database model layer. A mongoengine query
returns an iterable object called a QuerySet. The "obvious" way to
create a list of the query results would be:

my_objects = list(my_query_set)

and, indeed, that works. But, then I found this code:

my_objects = [obj for obj in my_query_set]

which seemed a bit silly. I called over the guy who wrote it and asked
him why he didn't just write it using list(). I was astounded when it
turned out there's a good reason!

Apparently, list() has an "optimization" where it calls len() on its
argument to try and discover the number of items it's going to put into
the list. Presumably, list() uses this information to pre-allocate the
right amount of memory the first time, without any resizing. If len()
fails, it falls back to just iterating and resizing as needed.
Normally, this would be a win.

The problem is, QuerySets have a __len__() method. Calling it is a lot
faster than iterating over the whole query set and counting the items,
but it does result in an additional database query, which is a lot
slower than the list resizing! Writing the code as a list comprehension
prevents list() from trying to optimize when it shouldn't!
 
D

Dave Angel

I stumbled upon an interesting bit of trivia concerning lists and list
comprehensions today.

We use mongoengine as a database model layer. A mongoengine query
returns an iterable object called a QuerySet. The "obvious" way to
create a list of the query results would be:

my_objects = list(my_query_set)

and, indeed, that works. But, then I found this code:

my_objects = [obj for obj in my_query_set]

which seemed a bit silly. I called over the guy who wrote it and asked
him why he didn't just write it using list(). I was astounded when it
turned out there's a good reason!

Apparently, list() has an "optimization" where it calls len() on its
argument to try and discover the number of items it's going to put into
the list. Presumably, list() uses this information to pre-allocate the
right amount of memory the first time, without any resizing. If len()
fails, it falls back to just iterating and resizing as needed.
Normally, this would be a win.

The problem is, QuerySets have a __len__() method. Calling it is a lot
faster than iterating over the whole query set and counting the items,
but it does result in an additional database query, which is a lot
slower than the list resizing! Writing the code as a list comprehension
prevents list() from trying to optimize when it shouldn't!

That is very interesting. list() assumes the __len__() method would be
very quick.

Perhaps list() should take an optional second argument that specifies
the initial length to allocate. That way code that either doesn't want
__len__() to be used, or that already knows a reasonable number to use,
can supply the value to preallocate.
 
T

Tim Chase

I stumbled upon an interesting bit of trivia concerning lists and
list comprehensions today.

I agree with Dave Angel that this is interesting. A little testing
shows that this can be rewritten as

my_objects = list(iter(my_query_set))

which seems to then skip the costly __len__ call. Performance geeks
are welcome to time it against the list-comprehension version :)

-tkc


class Foo(object):
def __init__(self):
self.items = range(10)
def __iter__(self):
return iter(self.items)
def __len__(self):
print "Calling costly __len__"
return len(self.items)

print "Ensuring we can iterate over it:"
for x in Foo():
print x

print "\nJust call list():"
lst = list(Foo())
print lst

print "\nCall list(iter())"
lst = list(iter(Foo()))
print lst
 
K

Kev Dwyer

Roy said:
I stumbled upon an interesting bit of trivia concerning lists and list
comprehensions today.

We use mongoengine as a database model layer. A mongoengine query
returns an iterable object called a QuerySet. The "obvious" way to
create a list of the query results would be:

my_objects = list(my_query_set)

and, indeed, that works. But, then I found this code:

my_objects = [obj for obj in my_query_set]

which seemed a bit silly. I called over the guy who wrote it and asked
him why he didn't just write it using list(). I was astounded when it
turned out there's a good reason!

Apparently, list() has an "optimization" where it calls len() on its
argument to try and discover the number of items it's going to put into
the list. Presumably, list() uses this information to pre-allocate the
right amount of memory the first time, without any resizing. If len()
fails, it falls back to just iterating and resizing as needed.
Normally, this would be a win.

The problem is, QuerySets have a __len__() method. Calling it is a lot
faster than iterating over the whole query set and counting the items,
but it does result in an additional database query, which is a lot
slower than the list resizing! Writing the code as a list comprehension
prevents list() from trying to optimize when it shouldn't!


Interesting discovery. Yet isn't this as much an issue with the mongoengine
library as with list()? Queryset.count() can be called if the "length" of a
resultset needs to be retrieved, so the __len__() methd seems redundant.
And given that it's not unheard of to call list() on iterables, perhaps the
library designers should either optimise the __len__() method, or document
the performance implications of calling list on the queryset?

Anyway, thanks for this thought-provoking post.

Cheers,

Kev
 
W

Wolfgang Maier

Tim Chase said:
A little testing
shows that this can be rewritten as

my_objects = list(iter(my_query_set))

which seems to then skip the costly __len__ call. Performance geeks
are welcome to time it against the list-comprehension version

class Foo(object):
def __init__(self):
self.items = range(10)
def __iter__(self):
return iter(self.items)
def __len__(self):
print "Calling costly __len__"
return len(self.items)

Well, it skips the costly len() call because your iter(Foo()) returns
iter(range()) under the hood and list() uses that object's __len__() method. In
most cases, such a workaround will not be feasible. Why should iter(QuerySet())
have a faster __len__() method defined than QuerySet() itself. Most likely,
iter(QuerySet()) just returns self anyway?
Best,
Wolfgang
 
I

Ian Kelly

Well, it skips the costly len() call because your iter(Foo()) returns
iter(range()) under the hood and list() uses that object's __len__() method.

Iterators do not generally have __len__ methods.
Traceback (most recent call last):
In
most cases, such a workaround will not be feasible. Why should iter(QuerySet())
have a faster __len__() method defined than QuerySet() itself.

iter(QuerySet()) should not have any __len__ method defined at all,
which is why the optimization would be skipped.
Most likely,
iter(QuerySet()) just returns self anyway?

But on this point, you are correct. The mongoengine QuerySet.__iter__
method is defined as:

def __iter__(self):
self.rewind()
return self

This is unfortunate design. Not only does it mean that the iterator's
__len__ method cannot be trusted (what should the __len__ of a
partially exhausted iterator return?), but it also means that requesting
an iterator over the QuerySet will also silently invalidate any
existing iterators.
 
I

Ian Kelly

Am 07.03.2013 17:00, schrieb Ian Kelly:

But iterators have a length hint method that are used for some
optimizations and preallocations, too.

10

See http://www.python.org/dev/peps/pep-0424/

Didn't know about that, thanks. Presumably a proper iter(QuerySet())
object could implement __length_hint__ in an efficient manner rather
than by just calling the __len__ of the underlying QuerySet,
 
I

Ian Kelly

And how exactly would it do that, without either doing what __len__ does or
reading the whole result set into memory?

If the underlying cursor provides its own efficient length hint, it
could return that. Or if the query is result-limited, use the limit
as a length hint, provided it's not absurdly large. And if you really
can't efficiently determine anything about the length of the result
set at all, you can always fall back on returning NotImplemented.
 
T

Terry Reedy

But on this point, you are correct. The mongoengine QuerySet.__iter__
method is defined as:

def __iter__(self):
self.rewind()
return self

This is unfortunate design. Not only does it mean that the iterator's
__len__ method cannot be trusted (what should the __len__ of a
partially exhausted iterator return?), but it also means that requesting
an iterator over the QuerySet will also silently invalidate any
existing iterators.

I view that design as a violation of the iterator protocol and hence a
program bug. __iter__ should either *just* return self (if the self is
an iterator) or return a new object (if self is a non-iterator
iterable). File objects are iterators and .__iter__ does not rewind.
" def __bytes__(self): return b'hello'\n"
 
W

Wolfgang Maier

very interesting (hadn't heard of it)! Just checked the PEP,
then tested list()'s behavior, and it is just as described:

class stupid(list):
def __len__(self):
print ('len() called')
return NotImplemented

def __length_hint__(self):
print ('hint requested')
l=iter(self).__length_hint__()
print (l)
return l

a=stupid((1,2,3))
len(d)
======>
len() called

Traceback (most recent call last):
File "<pyshell#79>", line 1, in <module>
len(d)
TypeError: an integer is required

list(d)
======>
len() called
hint requested
3
[1, 2, 3]

so list() first tries to call the iterable's __len__ method. If that raises a
TypeError it falls back to __length_hint__ .
What I still don't know is how the listiterator object's __length_hint__ works.
Why, in this case, does it know that it has a length of 3 ? The PEP does not
provide any hint how a reasonable hint could be calculated.
And how exactly would it do that, without either doing what __len__ does or
reading the whole result set into memory?

Stefan

a very good question.

Best,
Wolfgang
 
T

Terry Reedy

very interesting (hadn't heard of it)! Just checked the PEP,
then tested list()'s behavior, and it is just as described:

class stupid(list):
def __len__(self):
print ('len() called')
return NotImplemented

def __length_hint__(self):
print ('hint requested')
l=iter(self).__length_hint__()
print (l)
return l

a=stupid((1,2,3))
len(d)
======>
len() called

Traceback (most recent call last):
File "<pyshell#79>", line 1, in <module>
len(d)
TypeError: an integer is required

list(d)
======>
len() called
hint requested
3
[1, 2, 3]

so list() first tries to call the iterable's __len__ method. If that raises a
TypeError it falls back to __length_hint__ .
What I still don't know is how the listiterator object's __length_hint__ works.
Why, in this case, does it know that it has a length of 3 ? The PEP does not
provide any hint how a reasonable hint could be calculated.

'How' depends on the iterator, but when it has an underlying concrete
iterable of known length, it should be rather trivial as .__next__ will
explicitly or implicitly use the count remaining in its operation. Part
of the justification of adding __length_hint__ is that in many cases it
is so easy. Iterators based on an iterator with a length_hint can just
pass it along.

The list iterator might work with a C pointer to the next item and a
countdown count initialized to the list length. The __next__ method
might be something like this mixed C and Python pseudocode:

if (countdown--) return *(current--);
else raise StopIteration

(For non-C coders, postfix -- decrements value *after* retrieving it.)
Then __length_hint__ would just return countdown. Tuples would work the
same way. Sets and dicts would need current-- elaborated to skip over
empty hash table slots. Range iterators would add the step to current,
instead of 1.
 
S

Steven D'Aprano

I stumbled upon an interesting bit of trivia concerning lists and list
comprehensions today.

We use mongoengine as a database model layer. A mongoengine query
returns an iterable object called a QuerySet. The "obvious" way to
create a list of the query results would be:

my_objects = list(my_query_set)

and, indeed, that works. But, then I found this code:

my_objects = [obj for obj in my_query_set]

which seemed a bit silly. I called over the guy who wrote it and asked
him why he didn't just write it using list(). I was astounded when it
turned out there's a good reason!

And why was that not documented in the code?

I hope you took this fellow out and gave him a damned good thrashing!
 
R

Roy Smith

Steven D'Aprano said:
I stumbled upon an interesting bit of trivia concerning lists and list
comprehensions today.

We use mongoengine as a database model layer. A mongoengine query
returns an iterable object called a QuerySet. The "obvious" way to
create a list of the query results would be:

my_objects = list(my_query_set)

and, indeed, that works. But, then I found this code:

my_objects = [obj for obj in my_query_set]

which seemed a bit silly. I called over the guy who wrote it and asked
him why he didn't just write it using list(). I was astounded when it
turned out there's a good reason!

And why was that not documented in the code?

I hope you took this fellow out and gave him a damned good thrashing!

Nah. I'm a pacifist. Besides, he's my boss :)
 
R

Roy Smith

Roy Smith said:
The problem is, QuerySets have a __len__() method. Calling it is a lot
faster than iterating over the whole query set and counting the items,
but it does result in an additional database query, which is a lot
slower than the list resizing! Writing the code as a list comprehension
prevents list() from trying to optimize when it shouldn't!

Hmmm, I think I've found a good solution.

It turns out, we don't actually use QuerySet in our models. We've
defined our own QuerySet subclass which adds a few convenience methods.
Adding

def __len__(self):
raise NotImplemented

to our subclass should do the job. It looks like list() respects that,
calls __iter__(), and does the right thing. I can't find any place
where that behavior for list() is documented, but it's logical and
experimentally, it seems to work.

Can anybody see any downside to this?
 
T

Terry Reedy

Hmmm, I think I've found a good solution.

It turns out, we don't actually use QuerySet in our models. We've
defined our own QuerySet subclass which adds a few convenience methods.
Adding

def __len__(self):
raise NotImplemented

to our subclass should do the job. It looks like list() respects that,
calls __iter__(), and does the right thing. I can't find any place
where that behavior for list() is documented,

It is a cpython implementation detail that probably has changed with the
versions.
but it's logical and experimentally, it seems to work.
Can anybody see any downside to this?

No. Give it a try.
 
R

Roy Smith

It turns out, we don't actually use QuerySet in our models. We've
defined our own QuerySet subclass which adds a few convenience methods.
Adding

def __len__(self):
raise NotImplemented

to our subclass should do the job. It looks like list() respects that,
calls __iter__(), and does the right thing. I can't find any place
where that behavior for list() is documented,

It is a cpython implementation detail that probably has changed with the
versions.[/QUOTE]

Yeah, that's what I was afraid of. The "obvious" solution of:

class QuerySet(mongoengine.queryset.QuerySet):
def __init__(self, document, collection):
super(QuerySet, self).__init__(document, collection)
[...]
del self.__len__

results in:

[rest of stack dump elided]
del self.__len__
AttributeError: __len__

which I don't understand.
 
S

Steven D'Aprano

Yeah, that's what I was afraid of. The "obvious" solution of:

class QuerySet(mongoengine.queryset.QuerySet):
def __init__(self, document, collection):
super(QuerySet, self).__init__(document, collection) [...]
del self.__len__

results in:

[rest of stack dump elided]
del self.__len__
AttributeError: __len__

which I don't understand.

You don't define a per-instance custom QuerySet attribute called __len__,
so you can't delete it from the instance.

The existing __len__ attribute is attached to the mongoengine
queryset.QuerySet class, not the instance. You could monkeypatch the
parent class, but that will probably break something else:

del mongoengine.queryset.QuerySet.__len__


or you could try over-riding __len__ with a fake that pretends it doesn't
exist:


def __len__(self):
raise AttributeError


Try that and see it it works. (It may not.)
 
R

Roy Smith

Steven D'Aprano said:
Yeah, that's what I was afraid of. The "obvious" solution of:

class QuerySet(mongoengine.queryset.QuerySet):
def __init__(self, document, collection):
super(QuerySet, self).__init__(document, collection) [...]
del self.__len__

results in:

[rest of stack dump elided]
del self.__len__
AttributeError: __len__

which I don't understand.

You don't define a per-instance custom QuerySet attribute called __len__,
so you can't delete it from the instance.

The existing __len__ attribute is attached to the mongoengine
queryset.QuerySet class, not the instance. You could monkeypatch the
parent class, but that will probably break something else:

del mongoengine.queryset.QuerySet.__len__


or you could try over-riding __len__ with a fake that pretends it doesn't
exist:


def __len__(self):
raise AttributeError


Try that and see it it works. (It may not.)

Yeah, that was one of the things I experimented with. It seems to work,
but like most of these other things, I suspect it's version specific.
If list() does:

try:
l = obj.__len__()
except AttributeErrror:
l = 0

then we're good. On the other hand, if it does:

if obj.getattr('__len__', None):
# etc

then it fails.

At this point, I'm thinking the monkeypatch is the right solution.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,981
Messages
2,570,188
Members
46,731
Latest member
MarcyGipso

Latest Threads

Top