S
Steven D'Aprano
I think this is a bug in Python's UTF-8 handling, but I'm not sure.
If I've read the Unicode FAQs correctly, you cannot encode *lone*
surrogate code points into UTF-8:
http://www.unicode.org/faq/utf_bom.html#utf8-5
Sure enough, using Python 3.3:
py> surr = '\udc80'
py> surr.encode('utf-8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode character '\udc80' in
position 0: surrogates not allowed
But reading the previous entry in the FAQs:
http://www.unicode.org/faq/utf_bom.html#utf8-4
I interpret this as meaning that I should be able to encode valid pairs
of surrogates. So if I find a code point that encodes to a surrogate pair
in UTF-16:
py> c = '\N{LINEAR B SYLLABLE B038 E}'
py> surr_pair = c.encode('utf-16be')
py> print(surr_pair)
b'\xd8\x00\xdc\x01'
and then use those same values as the code points, I ought to be able to
encode to UTF-8, as if it were the same \N{LINEAR B SYLLABLE B038 E} code
point. But I can't:
py> s = '\ud800\udc01'
py> s.encode('utf-8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode character '\ud800' in
position 0: surrogates not allowed
Have I misunderstood? I think that Python is being too strict about
rejecting surrogate code points. It should only reject lone surrogates,
or invalid pairs, not valid pairs. Have I misunderstood the Unicode FAQs,
or is this a bug in Python's handling of UTF-8?
If I've read the Unicode FAQs correctly, you cannot encode *lone*
surrogate code points into UTF-8:
http://www.unicode.org/faq/utf_bom.html#utf8-5
Sure enough, using Python 3.3:
py> surr = '\udc80'
py> surr.encode('utf-8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode character '\udc80' in
position 0: surrogates not allowed
But reading the previous entry in the FAQs:
http://www.unicode.org/faq/utf_bom.html#utf8-4
I interpret this as meaning that I should be able to encode valid pairs
of surrogates. So if I find a code point that encodes to a surrogate pair
in UTF-16:
py> c = '\N{LINEAR B SYLLABLE B038 E}'
py> surr_pair = c.encode('utf-16be')
py> print(surr_pair)
b'\xd8\x00\xdc\x01'
and then use those same values as the code points, I ought to be able to
encode to UTF-8, as if it were the same \N{LINEAR B SYLLABLE B038 E} code
point. But I can't:
py> s = '\ud800\udc01'
py> s.encode('utf-8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode character '\ud800' in
position 0: surrogates not allowed
Have I misunderstood? I think that Python is being too strict about
rejecting surrogate code points. It should only reject lone surrogates,
or invalid pairs, not valid pairs. Have I misunderstood the Unicode FAQs,
or is this a bug in Python's handling of UTF-8?