M
Mike Brown
In mid-October 2004, Jeff Epler helped me here with this string iterator:
def chars(s):
"""
This generator function helps iterate over the characters in a
string. When the string is unicode and a surrogate pair is
encountered, the pair is returned together, regardless of whether
Python was built with UCS-4 ('wide') or UCS-2 code values for
its internal representation of unicode. This function will raise a
ValueError if it detects an illegal surrogate pair.
"""
if isinstance(s, str):
for i in s:
yield i
return
s = iter(s)
for i in s:
if u'\ud800' <= i < u'\udc00':
try:
j = s.next()
except StopIteration:
raise ValueError("Bad pair: string ends after %r" % i)
if u'\udc00' <= j < u'\ue000':
yield i + j
else:
raise ValueError("Bad pair: %r (bad second half)" % (i+j))
elif u'\udc00' <= i < u'\ue000':
raise ValueError("Bad pair: %r (no first half)" % i)
else:
yield i
I have since discovered that I can't use it on Python 2.2 on Windows because
of some weird module import bug caused by the surrogate code values expressed
in the Python code as u'\ud800' and u'\udc00' -- apparently the string
literals are being coerced to UTF-8 internally, which results in an invalid
byte sequence upon import of the module containing this function.
A simpler test case demonstrates the symptom:
C:\dev\test>echo x = u'\ud800' > testd800.py
C:\dev\test>cat testd800.py
x = u'\ud800'
C:\dev\test>python -c "import testd800"
C:\dev\test>python -c "import testd800"
Traceback (most recent call last):
File "<string>", line 1, in ?
UnicodeError: UTF-8 decoding error: unexpected code byte
C:\dev\test>python testd800.py
C:\dev\test>python testd800.py
Very strange how it only shows up after the 1st import attempt seems to
succeed, and it doesn't ever show up if I run the code directly or run the
code in the command-line interpreter.
The error does not occur with u'\ud800\udc00' or u'\ue000' or any other valid
sequence.
In my function I can use "if u'\ud7ff' > i ..." to work around the d800 case,
but I can't use the same trick for the dc00 case. I will have to go back to
calling ord(i) and comparing against integers. IIRC the explicit ord() call
slowed things down a bit, though, so I'd like to avoid it if I can.
Can anyone tell me what's causing this, or point me to a reference to show
when it was fixed? I'm using 2.2.1 and I couldn't find mention of it in any
release notes up through 2.3. Any other comments/suggestions (besides "stop
supporting narrow unicode builds of Py 2.2") would be appreciated, too. Thanks
-Mike
def chars(s):
"""
This generator function helps iterate over the characters in a
string. When the string is unicode and a surrogate pair is
encountered, the pair is returned together, regardless of whether
Python was built with UCS-4 ('wide') or UCS-2 code values for
its internal representation of unicode. This function will raise a
ValueError if it detects an illegal surrogate pair.
"""
if isinstance(s, str):
for i in s:
yield i
return
s = iter(s)
for i in s:
if u'\ud800' <= i < u'\udc00':
try:
j = s.next()
except StopIteration:
raise ValueError("Bad pair: string ends after %r" % i)
if u'\udc00' <= j < u'\ue000':
yield i + j
else:
raise ValueError("Bad pair: %r (bad second half)" % (i+j))
elif u'\udc00' <= i < u'\ue000':
raise ValueError("Bad pair: %r (no first half)" % i)
else:
yield i
I have since discovered that I can't use it on Python 2.2 on Windows because
of some weird module import bug caused by the surrogate code values expressed
in the Python code as u'\ud800' and u'\udc00' -- apparently the string
literals are being coerced to UTF-8 internally, which results in an invalid
byte sequence upon import of the module containing this function.
A simpler test case demonstrates the symptom:
C:\dev\test>echo x = u'\ud800' > testd800.py
C:\dev\test>cat testd800.py
x = u'\ud800'
C:\dev\test>python -c "import testd800"
C:\dev\test>python -c "import testd800"
Traceback (most recent call last):
File "<string>", line 1, in ?
UnicodeError: UTF-8 decoding error: unexpected code byte
C:\dev\test>python testd800.py
C:\dev\test>python testd800.py
Very strange how it only shows up after the 1st import attempt seems to
succeed, and it doesn't ever show up if I run the code directly or run the
code in the command-line interpreter.
The error does not occur with u'\ud800\udc00' or u'\ue000' or any other valid
sequence.
In my function I can use "if u'\ud7ff' > i ..." to work around the d800 case,
but I can't use the same trick for the dc00 case. I will have to go back to
calling ord(i) and comparing against integers. IIRC the explicit ord() call
slowed things down a bit, though, so I'd like to avoid it if I can.
Can anyone tell me what's causing this, or point me to a reference to show
when it was fixed? I'm using 2.2.1 and I couldn't find mention of it in any
release notes up through 2.3. Any other comments/suggestions (besides "stop
supporting narrow unicode builds of Py 2.2") would be appreciated, too. Thanks
-Mike