unicode surrogates in py2.2/win

M

Mike Brown

In mid-October 2004, Jeff Epler helped me here with this string iterator:

def chars(s):
"""
This generator function helps iterate over the characters in a
string. When the string is unicode and a surrogate pair is
encountered, the pair is returned together, regardless of whether
Python was built with UCS-4 ('wide') or UCS-2 code values for
its internal representation of unicode. This function will raise a
ValueError if it detects an illegal surrogate pair.
"""
if isinstance(s, str):
for i in s:
yield i
return
s = iter(s)
for i in s:
if u'\ud800' <= i < u'\udc00':
try:
j = s.next()
except StopIteration:
raise ValueError("Bad pair: string ends after %r" % i)
if u'\udc00' <= j < u'\ue000':
yield i + j
else:
raise ValueError("Bad pair: %r (bad second half)" % (i+j))
elif u'\udc00' <= i < u'\ue000':
raise ValueError("Bad pair: %r (no first half)" % i)
else:
yield i


I have since discovered that I can't use it on Python 2.2 on Windows because
of some weird module import bug caused by the surrogate code values expressed
in the Python code as u'\ud800' and u'\udc00' -- apparently the string
literals are being coerced to UTF-8 internally, which results in an invalid
byte sequence upon import of the module containing this function.

A simpler test case demonstrates the symptom:

C:\dev\test>echo x = u'\ud800' > testd800.py

C:\dev\test>cat testd800.py
x = u'\ud800'

C:\dev\test>python -c "import testd800"

C:\dev\test>python -c "import testd800"
Traceback (most recent call last):
File "<string>", line 1, in ?
UnicodeError: UTF-8 decoding error: unexpected code byte

C:\dev\test>python testd800.py

C:\dev\test>python testd800.py

Very strange how it only shows up after the 1st import attempt seems to
succeed, and it doesn't ever show up if I run the code directly or run the
code in the command-line interpreter.

The error does not occur with u'\ud800\udc00' or u'\ue000' or any other valid
sequence.

In my function I can use "if u'\ud7ff' > i ..." to work around the d800 case,
but I can't use the same trick for the dc00 case. I will have to go back to
calling ord(i) and comparing against integers. IIRC the explicit ord() call
slowed things down a bit, though, so I'd like to avoid it if I can.

Can anyone tell me what's causing this, or point me to a reference to show
when it was fixed? I'm using 2.2.1 and I couldn't find mention of it in any
release notes up through 2.3. Any other comments/suggestions (besides "stop
supporting narrow unicode builds of Py 2.2") would be appreciated, too. Thanks
:)

-Mike
 
G

Guest

Mike said:
Very strange how it only shows up after the 1st import attempt seems to
succeed, and it doesn't ever show up if I run the code directly or run the
code in the command-line interpreter.

The reason for that is that the Python byte code stores the Unicode
literal in UTF-8. The first time, the byte code is generated, and an
unpaired surrogate is written to disk. The next time, the compiled byte
code is read back in, and the codec complains about the unpaired
surrogate.
Can anyone tell me what's causing this, or point me to a reference to show
when it was fixed?

In Misc/NEWS, we have, for 2.3a1:

- The UTF-8 codec will now encode and decode Unicode surrogates
correctly and without raising exceptions for unpaired ones.

Essentially, Python now allows surrogates to occur in UTF-8 encodings.
> I'm using 2.2.1 and I couldn't find mention of it in any
release notes up through 2.3. Any other comments/suggestions (besides "stop
supporting narrow unicode builds of Py 2.2") would be appreciated, too. Thanks
:)

I see two options. One is to compile the code with exec, avoiding byte
code generation. Put

exec """

before the code, and

"""

after it. The other option is to use variables instead of literals:

surr1 = unichr(0xd800)
surr2 = unichr(0xdc00)
surr3 = unichr(0xe000)
def chars(s, surr1=surr1, surr2=surr2, surr3=surr3):
....
if surr1 <= i < surr2:
...

I would personally go with "stop supporting Py 2.2". Unless you have the
time machine, you can't fix the bugs in old Python releases, and it is
a waste of time (IMO) to uglify the code just to work around limitations
in older interpreter versions.

Regards,
Martin
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,994
Messages
2,570,223
Members
46,812
Latest member
GracielaWa

Latest Threads

Top