heapq "key" arguments

J

Joshua Bronson

According to http://docs.python.org/library/heapq.html, Python 2.5
added an optional "key" argument to heapq.nsmallest and
heapq.nlargest. I could never understand why they didn't also add a
"key" argument to the other relevant functions (heapify, heappush,
etc). Say I want to maintain a heap of (x, y) pairs sorted only by
first coordinate. Without being able to pass key=itemgetter(0), won't
heapifying a list of such pairs unnecessarily compare both
coordinates? And worse, if the second coordinate is actually an object
with no ordering defined for it, heapifying will cause an error even
though all I care about is sorting by the first coordinate, which does
have an ordering. Am I missing something?
 
J

Jonathan Gardner

Say I want to maintain a heap of (x, y) pairs sorted only by
first coordinate. Without being able to pass key=itemgetter(0), won't
heapifying a list of such pairs unnecessarily compare both
coordinates?

It will compare the second value only if the first values are equal.
 
J

Joshua Bronson

It will compare the second value only if the first values are equal.

I don't see how that helps. That's not at all the same thing as being
able to pass key=itemgetter(0).
 
G

Gabriel Genellina

I don't see how that helps. That's not at all the same thing as being
able to pass key=itemgetter(0).

Ok, it's not strictly the same, but usually it doesn't hurt. The heaqp
module doesn't promise anything about equal elements: it may keep the
original order, rearrange them at will, reverse them, whatever. So the
element-wise comparison of tuples is as good as comparing only the first
element - *except* when comparing the second element isn't cheap or has
side effects or something like that. In that case, use a custom class to
redefine comparison operators:

from heapq import heapify, heappop

class tuplebyfirst(tuple):
"tuple that sorts only on first element"
def __lt__(self, other):
return self[0]<other[0]

heap = [tuplebyfirst(x) for x in [(2, 3, 'A'), (1, 4, 'B'),
(2, 5, 'C'), (2, 5, 'D'), (3, 0, 'E')]]
print heap
heapify(heap)
while heap:
print heappop(heap)

Output:

[(2, 3, 'A'), (1, 4, 'B'), (2, 5, 'C'), (2, 5, 'D'), (3, 0, 'E')]
(1, 4, 'B')
(2, 5, 'C')
(2, 5, 'D')
(2, 3, 'A')
(3, 0, 'E')

The letters are just to recognize the original ordering.
(I've used an undocumented property: all heapq functions compare elements
ONLY by using "<", in 2.6.2 at least. Defining all the other rich
comparison methods doesn't change anything)
 
R

Raymond Hettinger

[Duncan Booth]
The documentation doesn't say anything directly about stability, but the
implementation is actually stable. You can probably assume it must be at
least for nlargest and nsmallest otherwise the stated equivalence wouldn't
hold:

e.g. nsmallest documentation says:

        Equivalent to: sorted(iterable, key=key)[:n]

Yes. The code for nsmallest and nlargest preserves stability
so that the equivalence is maintained.


Raymond
 
R

Raymond Hettinger

[Joshua Bronson]:
According tohttp://docs.python.org/library/heapq.html, Python 2.5
added an optional "key" argument to heapq.nsmallest and
heapq.nlargest. I could never understand why they didn't also add a
"key" argument to the other relevant functions (heapify, heappush,
etc).

The problem is that heapq acts on regular lists, so it does not
have exclusive access to the structure. So, there is no reliable
way for it to maintain a separate list of keys. Since the keys
can't be saved in the structure (without possibly breaking other
code), the fine grained heapq functions (like heappop and heappush)
would need to call key functions every time they are invoked.
This is at odds with the implicit guarantee of the key function
that it will be called no more than once per key.

The overall problem is one of granularity. A key function should
be applied once in an initial pass, not on every call to a push/pop
function. The everyday solution that most people use is to operate
on a list of (key, record) tuples and let tuple comparison do the
work for you. Another solution is to build a Heap class that does
have exclusive access to the structure, but the API sugar often
isn't worth the slightly weaker performance.

One other thought. Heaps are a lazy evaluation structure, so their
fined-grained mutation functions only work well with just a single
ordering function, so there is not need to have (and every reason
to avoid) changing key functions in mid-stream. IOW, the key
function needs to be constant across all accesses. Contrast this
with other uses of key functions where it makes perfect sense
to run minage=min(data, key=attrgetter('age')) and then running
minsal=min(data, key=attrgetter('salary')). The flexibility to
change key functions just doesn't make sense in the context of
the fine-grained heap functions.

Accordingly, this is why I put key functions in nlargest() and
nsmallest() but not in heappush() and friends. The former can
guarantee no more than one key function call per entry and they
evaluate immediately instead of lazily.


Raymond
 
J

Joshua Bronson

[Joshua Bronson]:
According tohttp://docs.python.org/library/heapq.html, Python 2.5
added an optional "key" argument to heapq.nsmallest and
heapq.nlargest. I could never understand why they didn't also add a
"key" argument to the other relevant functions (heapify, heappush,
etc).

The problem is that heapq acts on regular lists, so it does not
have exclusive access to the structure.  So, there is no reliable
way for it to maintain a separate list of keys.  Since the keys
can't be saved in the structure (without possibly breaking other
code), the fine grained heapq functions (like heappop and heappush)
would need to call key functions every time they are invoked.
This is at odds with the implicit guarantee of the key function
that it will be called no more than once per key.

The overall problem is one of granularity.  A key function should
be applied once in an initial pass, not on every call to a push/pop
function.  The everyday solution that most people use is to operate
on a list of (key, record) tuples and let tuple comparison do the
work for you.  Another solution is to build a Heap class that does
have exclusive access to the structure, but the API sugar often
isn't worth the slightly weaker performance.

One other thought.  Heaps are a lazy evaluation structure, so their
fined-grained mutation functions only work well with just a single
ordering function, so there is not need to have (and every reason
to avoid) changing key functions in mid-stream.  IOW, the key
function needs to be constant across all accesses.  Contrast this
with other uses of key functions where it makes perfect sense
to run minage=min(data, key=attrgetter('age')) and then running
minsal=min(data, key=attrgetter('salary')).  The flexibility to
change key functions just doesn't make sense in the context of
the fine-grained heap functions.

Accordingly, this is why I put key functions in nlargest() and
nsmallest() but not in heappush() and friends.  The former can
guarantee no more than one key function call per entry and they
evaluate immediately instead of lazily.

Raymond


I see, that makes sense. Thanks for the great explanation.

Josh
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,995
Messages
2,570,230
Members
46,817
Latest member
DicWeils

Latest Threads

Top