hashkey/digest for a complex object

K

kj

The short version of this question is: where can I find the algorithm
used by the tuple class's __hash__ method?

Now, for the long version of this question, I'm working with some
complext Python objects that I want to be able to compare for
equality easily.

These objects are non-mutable once they are created, so I would
like to use a two-step comparison for equality, based on the
assumption that I can compute (either at creation time, or as needed
and memoized) a hashkey/digest for each object. The test for
equality of two of these objects would first compare their hashkeys.
If they are different, the two objects are declared different; if
they match, then a more stringent test for equality is performed.

So the problem is to come up with a reasonable hashkey for each of
these objects. The objects have two significant attributes, and
two of these objects should be regarded as equal if these attributes
are "the same" in both. The first attribute is a simple dictionary
whose keys are integers and values are strings. The second attribute
is more complicated. It is a tree structure, represented as a
dictionary of dictionaries of dictionaries... until we get to the
leaf elements, which are frozensets of strings. The keys at every
level of this structure are strings. E.g. a simple example of such
an attribute would look like:

{'A': {'a': set(['1', '2', '3']),
'b': set(['4', '5'])},
'B': set(['6', '7', '8'])}

I'm looking for a good algorithm for computing a hash key for
something like this? (Basically I'm looking for a good way to
combine hashkeys.)

Thanks!

kj
 
D

Diez B. Roggisch

kj said:
The short version of this question is: where can I find the algorithm
used by the tuple class's __hash__ method?

Surprisingly, in the source:

http://google.com/codesearch/p?hl=d...leobject.c&q=python tuplehash&sa=N&cd=1&ct=rc
Now, for the long version of this question, I'm working with some
complext Python objects that I want to be able to compare for
equality easily.

These objects are non-mutable once they are created, so I would
like to use a two-step comparison for equality, based on the
assumption that I can compute (either at creation time, or as needed
and memoized) a hashkey/digest for each object. The test for
equality of two of these objects would first compare their hashkeys.
If they are different, the two objects are declared different; if
they match, then a more stringent test for equality is performed.

So the problem is to come up with a reasonable hashkey for each of
these objects. The objects have two significant attributes, and
two of these objects should be regarded as equal if these attributes
are "the same" in both. The first attribute is a simple dictionary
whose keys are integers and values are strings. The second attribute
is more complicated. It is a tree structure, represented as a
dictionary of dictionaries of dictionaries... until we get to the
leaf elements, which are frozensets of strings. The keys at every
level of this structure are strings. E.g. a simple example of such
an attribute would look like:

{'A': {'a': set(['1', '2', '3']),
'b': set(['4', '5'])},
'B': set(['6', '7', '8'])}

I'm looking for a good algorithm for computing a hash key for
something like this? (Basically I'm looking for a good way to
combine hashkeys.)

Creating tuples from dicts, recursively, and stabilized by using sorted
on items.

Diez
 
G

geremy condra

The short version of this question is: where can I find the algorithm
used by the tuple class's __hash__ method?
From Objects/tuple.c, line 315 in Python3.2:

static long
tuplehash(PyTupleObject *v)
{
register long x, y;
register Py_ssize_t len = Py_SIZE(v);
register PyObject **p;
long mult = 1000003L;
x = 0x345678L;
p = v->ob_item;
while (--len >= 0) {
y = PyObject_Hash(*p++);
if (y == -1)
return -1;
x = (x ^ y) * mult;
/* the cast might truncate len; that doesn't change hash stability */
mult += (long)(82520L + len + len);
}
x += 97531L;
if (x == -1)
x = -2;
return x;
}
Now, for the long version of this question, I'm working with some
complext Python objects that I want to be able to compare for
equality easily.

These objects are non-mutable once they are created, so I would
like to use a two-step comparison for equality, based on the
assumption that I can compute (either at creation time, or as needed
and memoized) a hashkey/digest for each object.  The test for
equality of two of these objects would first compare their hashkeys.
If they are different, the two objects are declared different; if
they match, then a more stringent test for equality is performed.

So the problem is to come up with a reasonable hashkey for each of
these objects.  The objects have two significant attributes, and
two of these objects should be regarded as equal if these attributes
are "the same" in both.  The first attribute is a simple dictionary
whose keys are integers and values are strings.  The second attribute
is more complicated.  It is a tree structure, represented as a
dictionary of dictionaries of dictionaries... until we get to the
leaf elements, which are frozensets of strings.  The keys at every
level of this structure are strings.  E.g. a simple example of such
an attribute would look like:

{'A': {'a': set(['1', '2', '3']),
      'b': set(['4', '5'])},
 'B': set(['6', '7', '8'])}

I'm looking for a good algorithm for computing a hash key for
something like this?  (Basically I'm looking for a good way to
combine hashkeys.)

Sounds like you're trying to hash mutable data structures, which is a
no-no, but assuming you've got that part all figured out then the easy
way out would be to print the whole thing and take the hash of the
resulting string. A (possibly) better way would be to do a Schwartzian
transform[1] and freeze everything. Other approaches may be better,
but those are the first two out of my head.

Geremy Condra

[1] http://en.wikipedia.org/wiki/Schwartzian_transform
 
R

Robert Kern

The short version of this question is: where can I find the algorithm
used by the tuple class's __hash__ method?

The function tuplehash() in Objects/tupleobject.c, predictably enough.
Now, for the long version of this question, I'm working with some
complext Python objects that I want to be able to compare for
equality easily.

These objects are non-mutable once they are created, so I would
like to use a two-step comparison for equality, based on the
assumption that I can compute (either at creation time, or as needed
and memoized) a hashkey/digest for each object. The test for
equality of two of these objects would first compare their hashkeys.
If they are different, the two objects are declared different; if
they match, then a more stringent test for equality is performed.

The most straightforward way to implement __hash__ for a complicated object is
to normalize its relevant data to a tuple of hashable objects and then call
hash() on that tuple. A straightforward way to compare such objects is to
calculate that very same normalized tuple for each one and compare those tuples.
Cache it if necessary. Don't bother hashing to implement __eq__ unless if you
are really optimizing for space and time.

--
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless enigma
that is made terrible by our own mad attempt to interpret it as though it had
an underlying truth."
-- Umberto Eco
 
T

Terry Reedy

These objects are non-mutable once they are created,

See below.
like to use a two-step comparison for equality, based on the
assumption that I can compute (either at creation time, or as needed
and memoized) a hashkey/digest for each object. The test for
equality of two of these objects would first compare their hashkeys.
If they are different, the two objects are declared different; if
they match, then a more stringent test for equality is performed.

I believe Python strings do this (cache the hash). Equality comparison
can check length, hashes, and only then chars.
So the problem is to come up with a reasonable hashkey for each of
these objects. The objects have two significant attributes, and
two of these objects should be regarded as equal if these attributes
are "the same" in both. The first attribute is a simple dictionary
whose keys are integers and values are strings. The second attribute
is more complicated. It is a tree structure, represented as a
dictionary of dictionaries of dictionaries... until we get to the
leaf elements, which are frozensets of strings. The keys at every
level of this structure are strings. E.g. a simple example of such
an attribute would look like:

{'A': {'a': set(['1', '2', '3']),
'b': set(['4', '5'])},
'B': set(['6', '7', '8'])}

If these two attributes, and hence the dicts, are public, then your
instances are mutable.
 
K

kj

In said:
If these two attributes, and hence the dicts, are public, then your
instances are mutable.

I guess I should have written "immutable among consenting adults."

As far as I know, Python does not support private attributes, so
I guess the dicts are public no matter what I do. I suppose that
I can implement "frozendict", but I can't think of any way to
enforce the immutability of these "frozendicts" that would not be
trivial to circumvent (it better be, in fact, otherwise I wouldn't
be able to initialize the damn things).

If you had something else in mind, please let me know.

~kj
 
K

kj

Surprisingly, in the source:

Thanks to you, and to all who pointed me to the place in the source
where this is.

How exactly did you search for this? Taking a hint from the url
above, I went to Google Code Search and searched for "python tuple
hash" (minus the quotes, of course), which produced a ton of
irrelevant stuff (almost 80K hits). Searching for "python tuple
hash lang:c" cut down the number of hits to ~8K, but still too much
to wade through. Clearly I need a smarter search strategy (one
that does not include foreknowledge of the name of the actual
function in the C source, of course).

~kj
 
D

Diez B. Roggisch

kj said:
Thanks to you, and to all who pointed me to the place in the source
where this is.

How exactly did you search for this? Taking a hint from the url
above, I went to Google Code Search and searched for "python tuple
hash" (minus the quotes, of course), which produced a ton of
irrelevant stuff (almost 80K hits). Searching for "python tuple
hash lang:c" cut down the number of hits to ~8K, but still too much
to wade through. Clearly I need a smarter search strategy (one
that does not include foreknowledge of the name of the actual
function in the C source, of course).

I tried codesearch first. Not satisfied after 30 seconds with the
results, I did the most straight forward thing - I downloaded and
un-packed the python source... and took a look.

From that I learned the tuplehash function name.

Diez
 
K

kj

In said:
I tried codesearch first. Not satisfied after 30 seconds with the
results, I did the most straight forward thing - I downloaded and
un-packed the python source... and took a look.
From that I learned the tuplehash function name.

You must be at least somewhat familiar with the Python source.
Everytime I've peeked into it I just feel lost, but it's clearly
something I need to master sooner or later... I can't wait for
the next one of those occasional introductions to the Python
internals at our local Python club.

Thanks,

~kj
 
D

Diez B. Roggisch

kj said:
You must be at least somewhat familiar with the Python source.
Everytime I've peeked into it I just feel lost, but it's clearly
something I need to master sooner or later... I can't wait for
the next one of those occasional introductions to the Python
internals at our local Python club.

No, I'm not the tiniest bit. I just followed my instincts in looking
into the "Objects" folder, because that's where I suspected the
definition of objects to be....

And grep has been proven useful in these cases as well.

Diez
 
M

MRAB

I guess I should have written "immutable among consenting adults."

As far as I know, Python does not support private attributes, so
I guess the dicts are public no matter what I do. I suppose that
I can implement "frozendict", but I can't think of any way to
enforce the immutability of these "frozendicts" that would not be
trivial to circumvent (it better be, in fact, otherwise I wouldn't
be able to initialize the damn things).
You would initialise them by creating them from a list of tuples (or an
iterable which yields tuples), like with a dict:
>>> dict([(1, "foo"), (2, "bar")])
{1: 'foo', 2: 'bar'}
 
A

Arnaud Delobelle

kj said:
The short version of this question is: where can I find the algorithm
used by the tuple class's __hash__ method?

Now, for the long version of this question, I'm working with some
complext Python objects that I want to be able to compare for
equality easily.

These objects are non-mutable once they are created, so I would
like to use a two-step comparison for equality, based on the
assumption that I can compute (either at creation time, or as needed
and memoized) a hashkey/digest for each object. The test for
equality of two of these objects would first compare their hashkeys.
If they are different, the two objects are declared different; if
they match, then a more stringent test for equality is performed.

So the problem is to come up with a reasonable hashkey for each of
these objects. The objects have two significant attributes, and
two of these objects should be regarded as equal if these attributes
are "the same" in both. The first attribute is a simple dictionary
whose keys are integers and values are strings. The second attribute
is more complicated. It is a tree structure, represented as a
dictionary of dictionaries of dictionaries... until we get to the
leaf elements, which are frozensets of strings. The keys at every
level of this structure are strings. E.g. a simple example of such
an attribute would look like:

{'A': {'a': set(['1', '2', '3']),
'b': set(['4', '5'])},
'B': set(['6', '7', '8'])}

I'm looking for a good algorithm for computing a hash key for
something like this? (Basically I'm looking for a good way to
combine hashkeys.)

You could do something like this:


deep_methods = {
list: lambda f, l: tuple(map(f, l)),
dict: lambda f, d: frozenset((k, f(v)) for k, v in d.items()),
set: lambda f, s: frozenset(map(f, s)),
# Add more if needed
}

def apply_method(f, obj):
try:
method = deep_methods[type(obj)]
except KeyError:
return obj
return method(f, obj)

def deepfreeze(obj):
"""Return a 'hashable version' of an object
return apply_method(deepfreeze, obj)

def deephash(obj):
"""Return hash(deepfreeze(obj)) without deepfreezing"""
return hash(apply_method(deephash, obj))

# Example of deepfreezable object:
obj = [1, "foo", {(2, 4): {7, 5, 4}, "bar": "baz"}]

1341422540
 
K

kj

In said:
You could do something like this:
deep_methods = {
list: lambda f, l: tuple(map(f, l)),
dict: lambda f, d: frozenset((k, f(v)) for k, v in d.items()),
set: lambda f, s: frozenset(map(f, s)),
# Add more if needed
}
def apply_method(f, obj):
try:
method = deep_methods[type(obj)]
except KeyError:
return obj
return method(f, obj)
def deepfreeze(obj):
"""Return a 'hashable version' of an object
return apply_method(deepfreeze, obj)
def deephash(obj):
"""Return hash(deepfreeze(obj)) without deepfreezing"""
return hash(apply_method(deephash, obj))
# Example of deepfreezable object:
obj = [1, "foo", {(2, 4): {7, 5, 4}, "bar": "baz"}]
^ ^
| |
`-------`------- what's this?

1341422540


After fixing the missing """ in deepfreeze this code works as
advertised, but I'm mystified by the identity between hash(deepfreeze(...))
and deephash(...). Without some knowledge of the Python internals,
I don't see how this follows.

More specifically, it is not obvious to me that, for example,

hash(frozenset((<whatever>,)))

would be identical to

hash(frozenset((hash(<whatever>),)))

but this identity has held every time I've checked it. Similarly
for other more complicated variations on this theme.

Anyway, thanks for the code. It's very useful.

~kj
 
A

Arnaud Delobelle

kj said:
In said:
You could do something like this:
deep_methods = {
list: lambda f, l: tuple(map(f, l)),
dict: lambda f, d: frozenset((k, f(v)) for k, v in d.items()),
set: lambda f, s: frozenset(map(f, s)),
# Add more if needed
}
def apply_method(f, obj):
try:
method = deep_methods[type(obj)]
except KeyError:
return obj
return method(f, obj)
def deepfreeze(obj):
"""Return a 'hashable version' of an object
return apply_method(deepfreeze, obj)
def deephash(obj):
"""Return hash(deepfreeze(obj)) without deepfreezing"""
return hash(apply_method(deephash, obj))
# Example of deepfreezable object:
obj = [1, "foo", {(2, 4): {7, 5, 4}, "bar": "baz"}]
^ ^
| |
`-------`------- what's this?

This is set literal notation, introduced in Python 3!
In 2.X, you would write set([7, 5, 4])
After fixing the missing """ in deepfreeze this code works as
advertised,

Oops, I must admit I added the docstrings after pasting the code :)
but I'm mystified by the identity between hash(deepfreeze(...)) and
deephash(...). Without some knowledge of the Python internals, I
don't see how this follows.

OK. I haven't actually proved it, but it follows from the following
facts:

1. hash() is an idempotent function, i.e. hash(hash(x)) == hash(x) hold
for any hashable x (this is a simple consequence of the fact that
hash(x) == x for any int x (by 'int' I mean 2.X int)).

2. Container hashable objects compute their hash from the hash of their
elements.

I don't think either of these two facts is documented, but it would be quite
easy to verify them in the Python source (I must admit I have not done
it). And it is difficult to conceive how it could be otherwise.
 
S

Steven D'Aprano

1. hash() is an idempotent function, i.e. hash(hash(x)) == hash(x) hold
for any hashable x (this is a simple consequence of the fact that
hash(x) == x for any int x (by 'int' I mean 2.X int)).

It's a beautiful theory, but, alas, it is not the case.
False

to give only two of an infinite number of counter-examples.

Aside: what do you mean by '2.x int'? Do you mean an int in 2.x versions
before, or after, ints and longs were partially integrated?

[steve@sylar ~]$ python2.1
Python 2.1.3 (#1, Aug 12 2010, 01:53:57)
[GCC 4.1.2 20070925 (Red Hat 4.1.2-27)] on linux2
Type "copyright", "credits" or "license" for more information.Traceback (most recent call last):


People keep forgetting that 2.2 introduced nearly as many far-reaching
changes as 3.0.
 
H

Hrvoje Niksic

Steven D'Aprano said:
It's a beautiful theory, but, alas, it is not the case.

False

This is a counter-example for the (invalid) premise that hash(x) == x,
but not for the invariant of hash(hash(x)) == hash(x).
True

Aside: what do you mean by '2.x int'? Do you mean an int in 2.x versions
before, or after, ints and longs were partially integrated?

I would take it to mean the type 2.x calls 'int', i.e. fixed-width
integer type.
 
A

Arnaud Delobelle

Steven D'Aprano said:
It's a beautiful theory, but, alas, it is not the case.

False

to give only two of an infinite number of counter-examples.

I can only see one counterexample, (-1). 2**64 is of type 'long' in 2.X
on your machine (or, to be pedantic, 2.x where x >= 2). And, in fact,
(-1) is the only int such that hash(x) != x.

In can only guess that (-1) is a value that has a special meaning when
hashing. Try this (Python 2.6):
.... def __hash__(self): return -1
....
-2


Aside: what do you mean by '2.x int'? Do you mean an int in 2.x versions
before, or after, ints and longs were partially integrated?
Either.

[steve@sylar ~]$ python2.1
Python 2.1.3 (#1, Aug 12 2010, 01:53:57)
[GCC 4.1.2 20070925 (Red Hat 4.1.2-27)] on linux2
Type "copyright", "credits" or "license" for more information.Traceback (most recent call last):


People keep forgetting that 2.2 introduced nearly as many far-reaching
changes as 3.0.

I didn't forget, but what bearing does it have on this particular issue?
 
K

kj

In said:
(Infinite???)

And, in fact,
(-1) is the only int such that hash(x) != x.

Arnaud, how did you determine that -1 is the only such int? I
can't imagine any other approach other than a brute-force check of
all ints... When I tried this I ran into unforeseen limitations
in xrange, etc. It's all very doable, but still, it would take at
least about 3 hours on my laptop.
In can only guess that (-1) is a value that has a special meaning when
hashing. Try this (Python 2.6):
... def __hash__(self): return -1
...
-2

Very cool.

BTW, thank you for the explanation in your previous post. It makes
a lot of sense. I find it interesting (as Hrvoje pointed out) that
the hash function is (or appears to be) idempotent on integers
(long or not), even though it is not the identity on the integers.
Thanks to Steven for the counterexamples to show the latter. I've
learned tons from this exchange.

~kj
 
S

Steven D'Aprano

I can only see one counterexample, (-1). 2**64 is of type 'long' in 2.X
on your machine (or, to be pedantic, 2.x where x >= 2). And, in fact,
(-1) is the only int such that hash(x) != x.

Fair point. I was mistaken.

I had the impression that the integration between ints and longs since
Python 2.2 was more extensive than it actually is. I made the above
comments based on the idea that since Python 2.2, longs are a subclass of
ints, e.g. that isinstance(2**64, int) would return True. Turns out that
I'm wrong, in which case I agree with you that -1 is the only int counter-
example.

As an aside, this makes me glad that I have continued writing
isinstance(x, (int, long)) in my code, even though I "knew" it was
unnecessary. Not the first time, and probably not the last, that a "this
can never happen" test saved the day.
 
S

Steven D'Aprano

(Infinite???)

I was mistaken. Given Arnaud's specification that we look only at the
Python 2.x ints, I believe that there is only one counter-example, namely
-1.

However, in Python 3.x I would be correct. The reasoning is a simple
application of the pigeon-hole principle. Since hash(n) is limited to
returning a finite number of distinct results, but there are an infinite
number of Python 3 ints, it follows that there must be an infinite number
of collisions.

Arnaud, how did you determine that -1 is the only such int? I can't
imagine any other approach other than a brute-force check of all ints...

Reading the source code is also a good approach.

I don't read C very well, but even I can work out what this is doing:

static long
int_hash(PyIntObject *v)
{
/* XXX If this is changed, you also need to change the way
Python's long, float and complex types are hashed. */
long x = v -> ob_ival;
if (x == -1)
x = -2;
return x;
}


(from intobject.c in Python 2.6.1). It takes a Python int object,
extracts the value of it as a C long, returns -2 if the value is -1
otherwise returns the value.

When I tried this I ran into unforeseen limitations in xrange, etc.
It's all very doable, but still, it would take at least about 3 hours on
my laptop.


What are you doing? This takes less than 20 seconds to test up to 2**24:

import time
def test(n):
t = time.time()
assert hash(0) == 0
assert hash(1) == 1
assert hash(-1) == -2
for i in xrange(2, n):
assert hash(i) == i, "failed %d" % i
assert hash(-i) == -i, "failed %d" % -i
t = time.time() - t
return t

18.076412916183472

Since 2**31 is 2**7 = 128 times bigger, I estimate that testing
everything up to 2**31 should take less than 45 minutes. And my PC is not
exactly the fastest machine on the block.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,982
Messages
2,570,185
Members
46,738
Latest member
JinaMacvit

Latest Threads

Top