Unicode and dictionaries

G

gizli

Hi all,

I am using Python 2.6.2 (release26-maint, Apr 19 2009, 01:56:41). I
ran into this issue yesterday and wanted to check to see if this is a
python bug. It seems that there is an inconsistency between lists and
dictionaries in the way that unicode objects are handled. Take a look
at the following example:
test_dict = {u'öğe':1}
u'öğe' in test_dict.keys() True
'öğe' in test_dict.keys() True
test_dict[u'öğe'] 1
test_dict['öğe']
Traceback (most recent call last):

Is this a bug? has_key functionality of the dictionary works as
expected:
False
 
S

Steven D'Aprano

Hi all,

I am using Python 2.6.2 (release26-maint, Apr 19 2009, 01:56:41). I ran
into this issue yesterday and wanted to check to see if this is a
python bug. It seems that there is an inconsistency between lists and
dictionaries in the way that unicode objects are handled. Take a look at
the following example:

True


I can't reproduce your result, at least not in 2.6.1:
__main__:1: UnicodeWarning: Unicode equal comparison failed to convert
both arguments to Unicode - interpreting them as being unequal
False
 
C

Carl Banks

I can't reproduce your result, at least not in 2.6.1:


__main__:1: UnicodeWarning: Unicode equal comparison failed to convert
both arguments to Unicode - interpreting them as being unequal
False


The OP changed his default encoding. I was able to confirm the
behavior after setting the default encoding to latin-1.

This is most definitely a bug in Python.


Carl Banks
 
C

Carl Banks

I would call this a bug. The two objects are different, so the latter
expression should return ‘False’.

Except the two objects are not different if default encoding is utf-8.

(Whether it's a good idea to change the default encoding is another
question, but Python is clearly documented as behaving this way. When
comparing a byte string and a Unicode string, the byte string will be
decoded according to the default encoding.)

FYI, ‘foo in bar.keys()’ is easier to spell as ‘foo in bar’.

I believe the OP's point was to show that dicts behave differently
than lists here ("in" works for lists, doesn't work for dicts).


Carl Banks
 
C

Carl Banks

The OP changed his default encoding. I was able to confirm the
behavior after setting the default encoding to latin-1.

This is most definitely a bug in Python.

I've thought it over and I'm not so sure it's a bug now, but it is
highly questionable. Here is more detailed explanation. The
following script shows why; my terminal is UTF-8.


Python 2.5.4 (r254:67916, Nov 19 2009, 19:46:21)
[GCC 4.3.4] on linux2
Type "help", "copyright", "credits" or "license" for more information.
import sys
reload(sys) # get sys.setdefaultencoding back
sys.setdefaultencoding('utf-8')
u'öğe' == 'öğe' True
test_dict = {u'öğe':1}
test_dict['öğe']
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
KeyError: '\xc3\xb6\xc4\x9fe'


So the source encoding is UTF-8, and you see I've set the default
encoding to UTF-8. You'll notice that u'öğe' and 'öğe' compare equal,
this is entirely correct. Given that UTF-8 is the source encoding,
the string 'öğe' will be read as a byte-string with the UTF-8 encoding
of those Unicode characters. And, given that UTF-8 is also the
default encoding, the string will be re-encoded using UTF-8, and so
will be equal to the Unicode stirng.

Given that the two are equal, the correct behavior for dicts would be
to use the two as the same key. However, it doesn't. In fact the two
objects don't even have the same hash code:
-813744964

This ought to be a bug; objects that compare equal and are hashable
must have the same hash code. However, given that it is crucially
important to be as fast as possible when calculating that hash code of
ASCII strings, I could imagine that this is deliberate. (And if it is
it should be documented so; I looked briefly but did not see it.)

I can imagine another buggy possibility as well. test_dict['öğe'] = 2
will add a new key to the above example, but it could overwrite the
key if there's a hash collision, because the objects compare equal.

All in all, it's a mighty mess. The best advice is to avoid it
altogether and leave the default encoding alone.

Thankfully Python 3 does away with all this nonsense.


Carl Banks
 
C

Carl Banks

They are different, because a Unicode object is *not* encoded in any
character encoding, whereas the byte string object is.

Of course they're different, it's not relevant to this situation.
What matters is if they compare equal, which is the only criteria for
whether an object is found in a list. x in s is true if there is some
object m in s for which m == x.

If the default encoding and the terminal encoding are both UTF-8 (or
both latin-9), then u'öğe' == 'öğe'. This behavior is documented (PEP
100) and therefore not a bug. Relevant lines:

"Unicode objects should compare equal to other objects after these
other objects have been coerced to Unicode. For strings this means
that they are interpreted as Unicode string using the <default
encoding>."



Carl Banks
 
G

gizli

Thanks to all of you. This once again proves how deep you can get
yourself into a mess if you mix unicode and string objects in your
code!
 
M

Martin v. Loewis

This ought to be a bug; objects that compare equal and are hashable
must have the same hash code.

It's not a bug. Changing the default encoding is not really supported,
let alone changing it to anything but latin-1, precisely for the reasons
you discuss.

If you do change the default encoding, Python *will* break. This has
been discussed many times, but some people still think they know better.

Regards,
Martin
 
M

Martin v. Loewis

Thanks to all of you. This once again proves how deep you can get
yourself into a mess if you mix unicode and string objects in your
code!

The specific issue is that you apparently changed the default encoding.
Don't do that, Python will break if you do.

Regards,
Martin
 
S

Steven D'Aprano

It's not a bug. Changing the default encoding is not really supported,
let alone changing it to anything but latin-1, precisely for the reasons
you discuss.

If you do change the default encoding, Python *will* break. This has
been discussed many times, but some people still think they know better.


That's specific to CPython though, isn't it? Other implementations may,
or may not, cope with it better?
 
M

Martin v. Loewis

This ought to be a bug; objects that compare equal and are hashable
That's specific to CPython though, isn't it? Other implementations may,
or may not, cope with it better?

No, that's fairly inherent to the problem. Only if that other
implementation doesn't use hashing for dictionaries, the problems
might go away. However, this is fairly unlikely - in particular,
since the language spec nearly mandates that dictionaries are hash-based
(rather than relying on comparability).

Regards,
Martin
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,995
Messages
2,570,230
Members
46,819
Latest member
masterdaster

Latest Threads

Top