Order of tuples in dict.items()

W

Will McGugan

Hi,

If I have two dictionaries containing identical values, can I be sure
that the items() method will return tuples in the same order?

I tried an experiment with CPython and it does appear to be the case.
>>> a=dict(a=1, b=1, c=2)
>>> b=dict(c=2, a=1, b=1)
>>> a {'a': 1, 'c': 2, 'b': 1}
>>> b {'a': 1, 'c': 2, 'b': 1}
>>> a.items() [('a', 1), ('c', 2), ('b', 1)]
>>> b.items()
[('a', 1), ('c', 2), ('b', 1)]

Can I rely on this behavior?

Regards,

Will McGugan

blog: http://www.willmcgugan.com
 
P

Paul Hankin

If I have two dictionaries containing identical values, can I be sure
that the items() method will return tuples in the same order?
...
Can I rely on this behavior?

No. To quote the python documentation:
Keys and values are listed in an arbitrary order which is
non-random, varies across Python implementations, and
depends on the dictionary's history of insertions and deletions.

In general, looking at documentation is better than experimenting when
you want to know if something is either always true or mostly true.
 
E

Erik Max Francis

Will said:
If I have two dictionaries containing identical values, can I be sure
that the items() method will return tuples in the same order?

I tried an experiment with CPython and it does appear to be the case.
a=dict(a=1, b=1, c=2)
b=dict(c=2, a=1, b=1)
a {'a': 1, 'c': 2, 'b': 1}
b {'a': 1, 'c': 2, 'b': 1}
a.items() [('a', 1), ('c', 2), ('b', 1)]
b.items()
[('a', 1), ('c', 2), ('b', 1)]

Can I rely on this behavior?

Probably not. Dictionaries do not have an ordering that you should
count on. In practice, the order in which you get the items if you
iterate over a dictionary is dependent on the hashing function, which
can potentially change over time.

This is a case where most implementations _probably_ will return
identical dictionaries in the same order when iterated over (of course,
different implementations will have different orderings, but you don't
care about that), but I wouldn't take the chance and rely on such an
implementation detail.

If you want to keep track of the order in which objects were added to a
dictionary, you'll need to keep a separate list of (sorted) keys, which
is easy enough. If you're lazy there are plenty of recipes around for
things like `SortedDictionary` or `sorteddict`.
 
S

Steven D'Aprano

Will said:
If I have two dictionaries containing identical values, can I be sure
that the items() method will return tuples in the same order? [...]
Can I rely on this behavior?

Probably not.

Definitely not. See Paul Hankin's earlier post in this thread.

Dictionaries do not have an ordering that you should
count on. In practice, the order in which you get the items if you
iterate over a dictionary is dependent on the hashing function, which
can potentially change over time.

Well, I suppose it *is* possible for Python to change its hash function
in some future version, but Python hashing is highly optimized, very fast
and not likely to change.

It's especially not likely to change in the middle of a run of your
program, say between calling a.items() and calling b.items().

However, the very nature of hash tables is that they are dependent on the
order that items are added and removed; Python dicts are hash tables, and
sure enough, their order is dependent on the order that items are added
and removed. Here's the simplest example I could come up with:
{-1: 'a', -2: 'b'}


This is a case where most implementations _probably_ will return
identical dictionaries in the same order when iterated over (of course,
different implementations will have different orderings, but you don't
care about that), but I wouldn't take the chance and rely on such an
implementation detail.

I think this is a case of being right for the wrong reasons.

If you want to keep track of the order in which objects were added to a
dictionary, you'll need to keep a separate list of (sorted) keys, which
is easy enough.

Keeping a list of keys *in the order they are added* is not the same as
keeping a list of keys *sorted*. Here's an obvious example:

D = {}
D['b'] = 2
D['a'] = 1
D['c'] = 3

Insertion order: bac
Sorted order: abc

If you're lazy there are plenty of recipes around for
things like `SortedDictionary` or `sorteddict`.

I've never seen the point of a sorted dictionary, it's easy to just say:

for key, value in sorted(D.items())

An OrderedDict, that remembers the order that items are added, might be a
good addition to the standard library. But that would depend on folks
agreeing on its semantics.
 
J

John Machin

I've never seen the point of a sorted dictionary, it's easy to just say:

for key, value in sorted(D.items())

There are several applications that involve finding i such that key
<= query < key[i+1] where the keys are sorted and unique ... perhaps
with a huge sentinel key at the end. Examples include sliding-scale
taxes, and looking up prices for shares or managed funds. The code to
do this in Python using the bisect module is about as clear as the
equivalent SQL :)
 
P

Paul Rubin

Steven D'Aprano said:
I've never seen the point of a sorted dictionary, it's easy to just say:
for key, value in sorted(D.items())

You might want just a subrange of the dictionary (say the 100th
through 150th items in sorted order) without having to sort the entire
dictionary.

Also, with the right data structures you can support fast non-mutating
updates, i.e.

e = d.updated(x=3)

is equivalent to:

e = dict((x,y) for x,y in d.iteritems()) # copy all entries from d
e[x] = 3 # do the update

but the sorted version can work in O(log n) time instead of O(n). The
new dictionary shares most of its contents with the old dictionary,
except now you can no longer mutate one without mutating the other, so
better not do any in-place updates.

Usage case (inspired by Happs State which was inspired by Prevayler):
You are writing a web app that has to remember a lot of request data
and you don't want to use an SQL database (too much overhead). Your
app gets requests containing name/value pairs. It responds with the
old value and then updates its stored value. I.e. your server process
looks like:

d = {}
while True:
name, new_value = get_request()
old_value = d.get(name, None)
d[name] = new_value
send_response(old_value)

This is very efficient (just in-memory dictionary access, no SQL
garbage) but has any obvious problem: what if the computer crashes?
You have to store the stuff on disk, too. So you put:

log_to_disk(name, new_value)

right after the get_request. If the computer crashes, reboot it and
play back the log from the beginning.

But that's no good either, the server could run for years nonstop and
get billions of updates and you don't want to play them back from the
beginning. You want to take a checkpoint now and then, so you can
restart from there:

d = {}
while True:
name, new_value = get_request()
log_to_disk(name, new_value, timestamp())
if checkpoint_now(): # do this once an hour or so
log_to_disk(pickle(d), timestamp())
old_value = d.get(name, None)
d[name] = new_value
send_response(old_value)

Now to restore, you find the newest pickle, read it in, and then apply
any logged updates with a newer timestamp, i.e. just the last hour's worth.

But this is no good either, because d can have millions of items, and
your server freezes while you're pickling and writing it. You want a
separate thread doing that--except d is busily mutating while that
operation is in progress! You can go crazy with locks, rollback
buffers, and all the insanity that SQL servers use, or with the sorted
immutable dictionary, you can simply write:

d = frozendict()
while True:
name, new_value = get_request()
log_to_disk(name, new_value, timestamp())
if checkpoint_now(): # do this once an hour or so
in_new_thread(log_to_disk, (pickle(d), timestamp()))
old_value = d.get(name, None) # look mom, no freeze-up
d = d.updated(name, new_value)
send_response(old_value)

By making d.updated() create a new immutable dict that's separate from
d, you now don't have to worry about locks or synchronization or any
of that crap. When the pickler is done with the old d, its references
are gone and the GC reclaims its storage. Of course you can do more
elaborate checkpointing on subsets of the data etc. with the same
methods.

Really it's the presence of the O(log n) .updated() operation, rather
than sorting, that makes this type of structure interesting, but the
usual implementation involves balanced tree structures so the sorting
comes naturally.
 
E

Erik Jones

Will said:
If I have two dictionaries containing identical values, can I be
sure
that the items() method will return tuples in the same order? [...]
Can I rely on this behavior?

Probably not.

Definitely not. See Paul Hankin's earlier post in this thread.

Dictionaries do not have an ordering that you should
count on. In practice, the order in which you get the items if you
iterate over a dictionary is dependent on the hashing function, which
can potentially change over time.

Well, I suppose it *is* possible for Python to change its hash
function
in some future version, but Python hashing is highly optimized,
very fast
and not likely to change.

It's especially not likely to change in the middle of a run of your
program, say between calling a.items() and calling b.items().

Not between two consecutive reads, no. However, after any resizing
of a dict the result of Python's hash function for any given newly
inserted key is extremely likely to be different than it would have
been before the resizing, i.e. the method may be the same, but the
result is different.

Erik Jones

Software Developer | Emma®
(e-mail address removed)
800.595.4401 or 615.292.5888
615.292.0777 (fax)

Emma helps organizations everywhere communicate & market in style.
Visit us online at http://www.myemma.com
 
J

John Machin

Not between two consecutive reads, no. However, after any resizing
of a dict the result of Python's hash function for any given newly
inserted key is extremely likely to be different than it would have
been before the resizing, i.e. the method may be the same, but the
result is different.

Could you please supply the basis for the above assertion? My reading
of the docs for the built-in hash function, the docs for an object's
__hash__ method, and the source (dictobject.c, intobject.c,
stringobject.c) indicate (as I would have expected) that the hash of
an object is determined solely by the object itself, not by the
history of insertion into a dict (or multiple dicts!?).

Note that position_in dict = some_function(hash(obj),
size_of_dict) ... perhaps you are conflating two different concepts.
 
S

Steven D'Aprano

Could you please supply the basis for the above assertion? My reading of
the docs for the built-in hash function, the docs for an object's
__hash__ method, and the source (dictobject.c, intobject.c,
stringobject.c) indicate (as I would have expected) that the hash of an
object is determined solely by the object itself, not by the history of
insertion into a dict (or multiple dicts!?).

Note that position_in dict = some_function(hash(obj), size_of_dict) ...
perhaps you are conflating two different concepts.


The hash() function doesn't even take a dictionary as an argument -- it
simply can't be dependent on the history of insertions into the
dictionary, because it can't know what dictionary to look at!

But as you say, the position in the dictionary itself depends on the
result of the hash function, and the size of the dictionary, and what's
already in the dict (that is to say, the history of insertions and
deletions). That's how hash tables work.
 
E

Erik Jones

The hash() function doesn't even take a dictionary as an argument
-- it
simply can't be dependent on the history of insertions into the
dictionary, because it can't know what dictionary to look at!

But as you say, the position in the dictionary itself depends on the
result of the hash function, and the size of the dictionary, and
what's
already in the dict (that is to say, the history of insertions and
deletions). That's how hash tables work.

John, sorry, I never saw your reply to my initial posting. What I
was referring to was indeed the some_function in your example, not
the actual hash() function available in the standard library. I was
in no way conflating the two, the confusing is just one over the
terminology as some_function IS a hash function.

Erik Jones

Software Developer | Emma®
(e-mail address removed)
800.595.4401 or 615.292.5888
615.292.0777 (fax)

Emma helps organizations everywhere communicate & market in style.
Visit us online at http://www.myemma.com
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,968
Messages
2,570,149
Members
46,695
Latest member
StanleyDri

Latest Threads

Top