unicode surrogates in py2.2/win

Mike Brown · Mar 8, 2005

In mid-October 2004, Jeff Epler helped me here with this string iterator:

def chars(s):
"""
This generator function helps iterate over the characters in a
string. When the string is unicode and a surrogate pair is
encountered, the pair is returned together, regardless of whether
Python was built with UCS-4 ('wide') or UCS-2 code values for
its internal representation of unicode. This function will raise a
ValueError if it detects an illegal surrogate pair.
"""
if isinstance(s, str):
for i in s:
yield i
return
s = iter(s)
for i in s:
if u'\ud800' <= i < u'\udc00':
try:
j = s.next()
except StopIteration:
raise ValueError("Bad pair: string ends after %r" % i)
if u'\udc00' <= j < u'\ue000':
yield i + j
else:
raise ValueError("Bad pair: %r (bad second half)" % (i+j))
elif u'\udc00' <= i < u'\ue000':
raise ValueError("Bad pair: %r (no first half)" % i)
else:
yield i

I have since discovered that I can't use it on Python 2.2 on Windows because
of some weird module import bug caused by the surrogate code values expressed
in the Python code as u'\ud800' and u'\udc00' -- apparently the string
literals are being coerced to UTF-8 internally, which results in an invalid
byte sequence upon import of the module containing this function.

A simpler test case demonstrates the symptom:

C:\dev\test>echo x = u'\ud800' > testd800.py

C:\dev\test>cat testd800.py
x = u'\ud800'

C:\dev\test>python -c "import testd800"

C:\dev\test>python -c "import testd800"
Traceback (most recent call last):
File "<string>", line 1, in ?
UnicodeError: UTF-8 decoding error: unexpected code byte

C:\dev\test>python testd800.py

C:\dev\test>python testd800.py

Very strange how it only shows up after the 1st import attempt seems to
succeed, and it doesn't ever show up if I run the code directly or run the
code in the command-line interpreter.

The error does not occur with u'\ud800\udc00' or u'\ue000' or any other valid
sequence.

In my function I can use "if u'\ud7ff' > i ..." to work around the d800 case,
but I can't use the same trick for the dc00 case. I will have to go back to
calling ord(i) and comparing against integers. IIRC the explicit ord() call
slowed things down a bit, though, so I'd like to avoid it if I can.

Can anyone tell me what's causing this, or point me to a reference to show
when it was fixed? I'm using 2.2.1 and I couldn't find mention of it in any
release notes up through 2.3. Any other comments/suggestions (besides "stop
supporting narrow unicode builds of Py 2.2") would be appreciated, too. Thanks

-Mike

Guest · Mar 8, 2005

Mike said:
Very strange how it only shows up after the 1st import attempt seems to
succeed, and it doesn't ever show up if I run the code directly or run the
code in the command-line interpreter.

The reason for that is that the Python byte code stores the Unicode
literal in UTF-8. The first time, the byte code is generated, and an
unpaired surrogate is written to disk. The next time, the compiled byte
code is read back in, and the codec complains about the unpaired
surrogate.

Can anyone tell me what's causing this, or point me to a reference to show
when it was fixed?

In Misc/NEWS, we have, for 2.3a1:

- The UTF-8 codec will now encode and decode Unicode surrogates
correctly and without raising exceptions for unpaired ones.

Essentially, Python now allows surrogates to occur in UTF-8 encodings.

> I'm using 2.2.1 and I couldn't find mention of it in any
release notes up through 2.3. Any other comments/suggestions (besides "stop
supporting narrow unicode builds of Py 2.2") would be appreciated, too. Thanks

I see two options. One is to compile the code with exec, avoiding byte
code generation. Put

exec """

before the code, and

"""

after it. The other option is to use variables instead of literals:

surr1 = unichr(0xd800)
surr2 = unichr(0xdc00)
surr3 = unichr(0xe000)
def chars(s, surr1=surr1, surr2=surr2, surr3=surr3):
....
if surr1 <= i < surr2:
...

I would personally go with "stop supporting Py 2.2". Unless you have the
time machine, you can't fix the bugs in old Python releases, and it is
a waste of time (IMO) to uglify the code just to work around limitations
in older interpreter versions.

Regards,
Martin

Unicode	2	Mar 15, 2013
API for custom Unicode error handlers	5	Oct 4, 2013
Unicode Chars in Windows Path	12	Apr 3, 2014
Benchmarking stripping of Unicode characters which are invalid XML	0	Mar 18, 2012
Python Unicode handling wins again -- mostly	67	Nov 30, 2013
Message from exception raised in generator disappears	1	Oct 17, 2004
Help in hangman game	1	Jul 24, 2023
Encoding of surrogate code points to UTF-8	14	Oct 8, 2013

unicode surrogates in py2.2/win

Mike Brown

Guest

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads