UTF16 codec doesn't round-trip?

John Perks and Sarah Mount · May 28, 2005

(My Python uses UTF16 natively; can someone with UTF32 Python let me
know if that behaves differently?)
u'\ud800'
codecs.utf_16_be_encode(_)[0]
'\xd8\x00'
codecs.utf_16_be_decode(_)[0]
Traceback (most recent call last):
File "<input>", line 1, in ?
UnicodeDecodeError: 'utf16' codec can't decode bytes in position 0-1:
unexpected end of data

If the ascii can't be recognized as UTF16, then surely the codec
shouldn't have allowed it to be encoded in the first place? I could
understand if it was trying to decode ascii into (native) UTF32.

On a similar note, if you are using UTF32 natively, are you allowed to
have raw surrogate escape sequences (paired or otherwise) in unicode
literals?

Thanks

John

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?= · May 28, 2005

John said:
If the ascii can't be recognized as UTF16, then surely the codec
shouldn't have allowed it to be encoded in the first place? I could
understand if it was trying to decode ascii into (native) UTF32.

Please don't call the thing you are trying to decode "ascii". ASCII
is the name of the American Standard Code for Information Interchange;
it is a 7-bit code, and what you are trying to decode certainly isn't
ascii. Call it "bytes" instead.

So you are trying to decode bytes as UTF-16. The bytes you have
definitely are not UTF-16 - the specific sequence of bytes is invalid
in UTF-16. Therefore, the codec is right to reject it when decoding.

It might be considered as a bug that the codec encoded the characters
in the first place.

On a similar note, if you are using UTF32 natively, are you allowed to
have raw surrogate escape sequences (paired or otherwise) in unicode
literals?

Python accepts such literals.

Regards,
Martin

Python 3.0 automatic decoding of UTF16	25	Dec 5, 2008
Encoding of surrogate code points to UTF-8	14	Oct 8, 2013
Unicode conversion problem (codec can't decode)	2	Apr 4, 2008
Long way around UnicodeDecodeError, or 'ascii' codec can't decode byte	4	Mar 29, 2007
Python 3.1.1 bytes decode with replace bug	9	Oct 24, 2009
sys.exc_info() different between python 2.5.4 and 2.6.1?	0	Feb 7, 2009
Reversing backslashed escape sequences	3	Jul 1, 2010
Changing the default text codec	5	Feb 23, 2004

UTF16 codec doesn't round-trip?

John Perks and Sarah Mount

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads