Sets in Python

T

thebjorn

On Sep 19, 7:17 pm, Steven D'Aprano <st...@REMOVE-THIS-
[...]
(2) Allow the hash of mutable objects to change, which means you can use
mutable objects as keys in dicts but if you change them, you can no
longer find them in the dict. They'll still be there, using up memory,
but you can't get to them.

In the new model, at the time of addition, you need to remember the
key at that time. If it's a list, you make a copy of the items.

Eek! Barf! Gag me with a spoon! etc. etc. :)

And you mean a deep-copy, not just a copy, right?

Or perhaps you were thinking of something like this (mdict ::= mutable
dict):

class mdict(dict):
def __setitem__(self, k, val):
super(mdict,self).__setitem__(`k`, val)
def __getitem__(self, k):
return super(mdict,self).__getitem__(`k`)
def __contains__(self, k):
return super(mdict,self).__contains__(`k`)
def keys(self):
return list(eval(k) for k in super(mdict,self).keys())
def __iter__(self):
for k in super(mdict,self).__iter__():
yield eval(k)
def items(self):
return list((eval(k),v) for k,v in super(mdict,self).items())
def __repr__(self):
items = ', '.join('%s: %s' % (k,repr(v)) for k,v in
self.items())
return '{' + items + '}'

I think it does what you want..?:
m = mdict()
a, b = [], [1,2]
print m {}
m[a] = a
m = b
m {[1, 2]: [1, 2], []: []}
m.keys() [[1, 2], []]
for k in m:

.... m[k].append('foo')
....
m {[1, 2]: [1, 2, 'foo'], []: ['foo']}
m.items() [([1, 2], [1, 2, 'foo']), ([], ['foo'])]
m.values() [[1, 2, 'foo'], ['foo']]
a in m False
a ['foo']
b [1, 2, 'foo']
[] in m True
[1,2] in m True
m[{'key':['val']}] = 'this works too'

It'll work for all keys, k, where eval(`k`) == k, and repr(a) ==
repr(b) when a == b (I'm pretty sure the latter isn't always true for
dicts, although I haven't looked at the implementation.)

-- bjorn
 
T

thebjorn

On Sep 20, 9:50 am, thebjorn <[email protected]>
wrote:

it's bad form to reply to myself, I know, but
def __iter__(self):
for k in super(mdict,self).__iter__():
yield eval(k)

should probably be

def __iter__(self):
return (eval(k) for k in super(mdict,self).__iter__())

I'm still getting used to the power of generator expressions :)

-- bjorn
 
S

Steve Holden

Karthik said:
You answered it yourself later. For a mapping service, hash is just
one way to do things. What you need is for each item in the
collection, a unique key.
How you go from the key to the value is not something a programmer
needs to know.
Your mind is set on thinking on hash alone and hence you don't see
beyond it.
There's a reason for that: the developers (and particularly Tim Peters)
have, to use a phrase Tim is fond of "optimized the snot" out of dict,
and the mechanism is fundamental to much of Python's internals. So don't
expect to be able to tamper wiht it without adversely affectinf performance.
Oh yes. If the keys are all integers (or any set of items that can be
ordered), why not an avl. It has guaranteed O(log N) while a hash in
worst case is O(N). Why you want to tie yourself to the drawbacks of
one datastructure? Understand your goal is not to provide a hash; but
to provide a mapping service.
Possibly so, but the goals also include "excellent speed" and "as wide a
set of keys as is (practically) possible".

How would you suggest Python avoids "[tying itself] to the drawbacks of
one data structure"? Implement different strategies according to the key
values used? That'll surely speed things up, not.

Python provides you with plenty of tools to implement optimized data
structures for your own applications. Arguing for making them language
primitives merely reveals your design inexperience.
the dict must be able to find keys it has

Yes, if you keep thinking hash is the only tool you got.


No. I don't expect. I expect the hash to be different. Why do you keep
thinking it's the mappings responsibility to take care of a changing
key.
Because mappings must do precisely that. Do you actually know what a
mapping *is*?
hash(M1) == hash(M2) if and only if M1 and M2 are equal
hash(M1) != hash(M2) if M1 and M2 are unequal

but also:

if M1 mutates to become equal to M2, hash(M1) must remain the same while
still being different from hash(M2).

That means that hash() now is a non-deterministic function. hash([1,2,3])
will vary according to how the list [1,2,3] was constructed!

Obviously something has to give, because not all of these things are
mutually compatible.
If later the given sequence is different, that's
not the dict's problem.
Data structures don't have problems. Programmers do. And language
designers with sense build languages that minimize the programmers
problems, not maximize them.

Yes, here you talk about a different goal altogether. Here comes the
'arbitrary' part I mentioned.
The choices for the language designer are:

(1) Invent some sort of magical non-deterministic hash function which
always does the Right Thing.

Nope, just say if the new sequence is different, you don't find the
item in the dict.
(2) Allow the hash of mutable objects to change, which means you can use
mutable objects as keys in dicts but if you change them, you can no
longer find them in the dict. They'll still be there, using up memory,
but you can't get to them.

In the new model, at the time of addition, you need to remember the
key at that time. If it's a list, you make a copy of the items.
Right. Simple. And completely impractical.
Of course they can be reached with.. for k in dict...

But that hardly provides the required mapping features. What on earth
are you thinking?.
M = [1, 2, 3]
D = {M: 'parrot'} # pretend this works
D
{[1, 2, 3]: 'parrot'}>>> M.append(4)
{[1, 2, 3, 4]: 'parrot'}>>> D[[1, 2, 3, 4]]

No, in the new way, the key still remains [1, 2, 3]
What was changed is M. Not the key given to dict at the time of
addition.
Again I'm not describing today's behavior; it's in the new way.
I doubt this new way, whatever it is, is going to have many disciples.
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
KeyError: [1, 2, 3, 4]

Try explaining that one to programmers: they can SEE the key in the dict
when they print it, but they can't get it or delete it because the hash
has changed.

No they don't. They see the key at the time of addition ([1,2,3])
Alternative 3 is easy to deal with: simply don't use mutable objects as
keys. That's what Python does. Sure, the programmer sometimes needs to
work around the lack (convert the list into a tuple, or a string, or
pickle it, whatever...) which on rare occasions is hard to do, but at
least there are no mysterious, hard to track down bugs.

When I first saw key's must'be be mutable, it did appear to me to be
mysterious. There was unnecessary tighter coupling between
implementation details and the service exposed to the programmer. (As
I see it, the root cause of all this is, the dict does not make a copy
of the key at the time of item addition, it just makes a new reference
to the same object)
Light dawns. Dicts are optimized for PERFORMANCE! And for very good reasons.

regards
Steve
--
Steve Holden +1 571 484 6266 +1 800 494 3119
Holden Web LLC/Ltd http://www.holdenweb.com
Skype: holdenweb http://del.icio.us/steve.holden

Sorry, the dog ate my .sigline
 
D

Dustan

Bad Python 3000 has no immutable type for byte-strings.
The new bytes type cannot serve for dict keys or set members.
Many things one would want to hash are unhashable -- for
example, the results of the hash functions in hashlib.

Are you serious????
 
S

Steven D'Aprano

You answered it yourself later. For a mapping service, hash is just one
way to do things. What you need is for each item in the collection, a
unique key.

And does the mapping get access to that unique key (the key's key)? It
can't keep a mapping of object:key, because if it could do that, it
wouldn't need the key, it could just keep object:payload.

Since the mapping can't know the key's key, it has to ask the key, and
that's what the __hash__ method does.

How you go from the key to the value is not something a programmer needs
to know.

You are correct. All you need to know is that in Python, you can't use
lists and sets as keys directly.

You only need to know about the details of the way dicts look up keys if
you are writing your own class, and you aren't happy with Python's
default hash for instance objects. It isn't a common task.


Your mind is set on thinking on hash alone and hence you don't see
beyond it.

No, these issues exist regardless of the implementation of the mapping.
Whether you use a hash table or a binary tree of some sort, or a linear
linked list, or a database, or folders in a filing cabinet.

The behaviour of mutable objects in a mapping is always problematic,
regardless of the mapping implementation.

Oh yes. If the keys are all integers (or any set of items that can be
ordered), why not an avl. It has guaranteed O(log N) while a hash in
worst case is O(N).

But both the best and average cases are O(1), which beats AVL trees by a
lot. Since dicts are used for Python's internals, they are HIGHLY
optimized and VERY fast. Their O(1) will beat the O(log N) of AVL trees
easily.

Hash tables can also use keys even if they can't be ordered: {1+2j: None}

Why you want to tie yourself to the drawbacks of one
datastructure? Understand your goal is not to provide a hash; but to
provide a mapping service.

No, the goal of a good language designer is to provide the fastest, most
lightweight, simplest, most flexible mapping as a built-in. Any old slow,
heavyweight, complicated, inflexible mapping will not do the job. That
goal is best provided by a hash table.

If people want additional mappings as well, they can be added as
libraries. If those libraries are very useful, they can be added to the
standard library. If they are HUGELY useful, they will become built-ins.

AVL trees are not as flexible or fast as hash tables, and even if they
were, you would *still* need to resolve the difficulties that occur if
the keys mutate.

Yes, if you keep thinking hash is the only tool you got.

Fine, let me re-word it in terms of an AVL.

Suppose you have two lists in an AVL:

L1 = [1, 2, 3]
L2 = [4, 5, 6]
M = avl((L1, True), (L2, False))

The tree M (for mapping) has L1 at the top of the tree, and L2 as the
right node.

But now, EVERY time ANY mutable object changes, Python has to check
whether it is a key in EVERY avl, and if so, re-built the tree. Otherwise
the tree can become corrupt because the AVL invariants are no longer true.

(Consider what happens if we say L1[0] = 999. Now L1 > L2. If you don't
reorder the avl, M[L2] cannot be reached except by an exhaustive search
of every node. That means it is no longer an AVL tree, just an inordered
tree. Might as well save memory and use a linear linked list.)

Needless to say, this will slow down Python just a tad.


Look, you can go back to the archives of 1993 when this was first
discussed. Nothing has changed since then. Mutable keys are still
problematic, regardless of the implementation, and the simplest solution
is to simply prohibit them.

http://www.python.org/search/hypermail/python-1993/0044.html


If you want to use a mutable object as a key, by object identity rather
than value, then the easy way is to wrap it in an instance:

class Wrapper: # don't even need new-style classes
pass # or even an __init__

key = Wrapper()
key.payload = [1, 2, 3]
D = {key: "value"}

But of course you can't look up the dict by value, only by identity. But
that's what you wanted.


Another way is to use this class:

class HashableList(list):
def __hash__(self):
return hash(tuple(self))


Have fun, and remember us when you're debugging your code, because you'll
be doing a lot of it.
 
P

Piet van Oostrum

Steven D'Aprano said:
SD> In other words, if you have two mutable objects M1 and M2, then you
SD> expect:
SD> hash(M1) == hash(M2) if and only if M1 and M2 are equal
SD> hash(M1) != hash(M2) if M1 and M2 are unequal

Huh? Unequal things may hash to the same value. That one of the essential
properties of a hash function.
 
N

Neil Cerutti

In the new model, it should be the value at the time of
addition. That is [1,2] (not [1,2,3]). This does mean a copy
of key in maintained internally in the dict.

A copy!? That has to be a deep copy. Which would make
`dict`\s alot slower and use more memory. Plus you can't store
objects that can't be copied anymore. That doesn't sound like
a good trade off to me.

Python's dict implementation is so deeply unorthoganal (is that a
word?) to Python's basic assignment semantics and clever type
hierarchy that it's hard to even sensibly promote anything other
than the current implementation without completely redesigning
Python.
 
O

OKB (not okblacke)

Steven said:
But of course you can't look up the dict by value, only by
identity. But that's what you wanted.

Actually, if I understand the OP's examples right, he wants to look
up only by value, not by identity.

--
--OKB (not okblacke)
Brendan Barnwell
"Do not follow where the path may lead. Go, instead, where there is
no path, and leave a trail."
--author unknown
 
C

Chris Mellon

Are you serious????

It's a little Chicken Little. The current version of py3k has no
immutable bytes type, but there's work being done even as we speak to
implement it.
 
C

Chris Mellon

Actually, if I understand the OP's examples right, he wants to look
up only by value, not by identity.

He's switched at least twice, as I understand his writing. Currently,
he appears to want to look up by value, copying lists on insertion,
and is ignoring what happens if you mutate the key.
 
T

Terry Reedy

| > > Bad Python 3000 has no immutable type for byte-strings.
| > > The new bytes type cannot serve for dict keys or set members.
| > > Many things one would want to hash are unhashable -- for
| > > example, the results of the hash functions in hashlib.
| >
| > Are you serious????
| >
|
| It's a little Chicken Little. The current version of py3k has no
| immutable bytes type, but there's work being done even as we speak to
| implement it.

To reinforce this comment: CPython3.0 will not be released for at least 9
months. The current 3.0.a1 is openly *experimental*. 3.0.a2 should be
released within a few weeks with an invitation for anyone concerned with
the final product to experiment with it.

GvR's development strategy has been to start small and add what is clearly
useful (rather than start large and trim). 3.0 is being trimmed bit.
Where too much is trimmed, something can get added back.
 
G

Gabriel Genellina

En Thu, 20 Sep 2007 08:46:29 -0300, Steven D'Aprano
Another way is to use this class:

class HashableList(list):
def __hash__(self):
return hash(tuple(self))

....and that will stop working as soon as the list is mutated (which is
exactly what you said before)
 
B

Bryan Olson

Gabriel said:
En Thu, 20 Sep 2007 08:46:29 -0300, Steven D'Aprano


...and that will stop working as soon as the list is mutated (which is
exactly what you said before)

Yup. I had suggested that technique in this thread, but it
doesn't really work. It hashes by state and compares by
state, but the actual key that a dict would store is the
object's identity. If the state has changed by the time
of a dict lookup, the dict will look in the hash-bucket
of the old state, but the object's equality test will
compare against the current state.
Bummer. Sorry.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,999
Messages
2,570,246
Members
46,839
Latest member
MartinaBur

Latest Threads

Top