M
Michael Butscher
Hi,
if I execute the following two lines in Python 2.5 (to feed in a
*unicode* string):
import sgmllib
sgmllib.SGMLParser().feed(u'<a title="teßt"></a>')
I get the exception:
Traceback (most recent call last):
File "<pyshell#10>", line 1, in <module>
sgmllib.SGMLParser().feed(u'<a title="teßt"></a>')
File "C:\Programme\Python25\Lib\sgmllib.py", line 99, in feed
self.goahead(0)
File "C:\Programme\Python25\Lib\sgmllib.py", line 133, in goahead
k = self.parse_starttag(i)
File "C:\Programme\Python25\Lib\sgmllib.py", line 285, in
parse_starttag
self._convert_ref, attrvalue)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xdf in position 0:
ordinal not in range(128)
The reason is that the character reference ß is converted to
*byte* string "\xdf" by SGMLParser.convert_codepoint. Adding this byte
string to the remaining unicode string fails.
Workaround (not thoroughly tested): Override convert_codepoint in a
derived class with:
def convert_codepoint(self, codepoint):
return unichr(codepoint)
Is this a bug or is SGMLParser not meant to be used for unicode strings
(it should be documented then)?
Michael
if I execute the following two lines in Python 2.5 (to feed in a
*unicode* string):
import sgmllib
sgmllib.SGMLParser().feed(u'<a title="teßt"></a>')
I get the exception:
Traceback (most recent call last):
File "<pyshell#10>", line 1, in <module>
sgmllib.SGMLParser().feed(u'<a title="teßt"></a>')
File "C:\Programme\Python25\Lib\sgmllib.py", line 99, in feed
self.goahead(0)
File "C:\Programme\Python25\Lib\sgmllib.py", line 133, in goahead
k = self.parse_starttag(i)
File "C:\Programme\Python25\Lib\sgmllib.py", line 285, in
parse_starttag
self._convert_ref, attrvalue)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xdf in position 0:
ordinal not in range(128)
The reason is that the character reference ß is converted to
*byte* string "\xdf" by SGMLParser.convert_codepoint. Adding this byte
string to the remaining unicode string fails.
Workaround (not thoroughly tested): Override convert_codepoint in a
derived class with:
def convert_codepoint(self, codepoint):
return unichr(codepoint)
Is this a bug or is SGMLParser not meant to be used for unicode strings
(it should be documented then)?
Michael