Michael L Torrie said:
Giving you the benefit of the doubt here, despite the fact that Stefan
Behnel has state this over and over again and you just haven't listened.
Speaking of over and over again ...
xml.sax's use of parseString() is exactly correct. xml.sax should
*never* parse python unicode strings as by definition XML must be
encoded as a *byte stream*, which is what a python string is.
I don't care about the definition of XML at this point of the
program.
http://docs.python.org/lib/module-xml.sax.html calls
parseString() a convenience function.
This is Python. Python has a class named unicode. Its literals
look like strings. The base class is basestring.
xml.sax belongs to Python. Batteries included. parseString() is
in Python.
It's not parseString() that tells me something is wrong with the
parameter. It's cStringIO, which is used on platforms where it is
available. On other platforms no exceptions are thrown, because
then StringIO is used, which behaves in Python 2.4 and Python 2.5
the same, regarding unicode strings.
Other libraries like LXML (not included) parse unicode strings.
And these are two additional lines in my code now:
if isinstance(string, unicode):
string = string.encode("utf-8")
A python /unicode/ string could be held internally in any number of
ways, 2, 3, 4, or even 8 bytes per character if the implementation
demanded it (a bit contrived, I admit). Since the xml parser is only
ever intended to parse *XML*, why should it ever know what to do with
python unicode strings, which could be stored any number of ways, making
byte-parsing impossible.
xml.sax is no external parser. The program doesn't have to
communicate with the outside world at this point of execution.
The Python programm calls a Python function of a Python class and
passes a Python unicode string as parameter.
XML parsers only have to support few encodings. But nobody has
something against it when they support more than that.
A Python convenience function isn't broken when it allows Python
unicode strings.
The behavior of cStringIO (the original topic of this thread) is
correct and documented. parseString() uses the old idiom where
cStringIO is imported as StringIO, when available. Despite the
fact that they behave differently.
In my personal opinion: If parseString() shouldn't support
unicode strings, then it should check for it and throw a
meaningful exception.
At the moment the code just looks as if someone has overlooked
the fact that unicode strings (with non-ascii characters in it)
cause a problem. Missing test?
So your code is faulty in its assumptions, not xml.sax.
As I said in the conclusion, a few messages before: Undocumented,
implementation dependent behavior.
Or maybe just a bug, considering the following on
http://docs.python.org/lib/module-xml.sax.html
A typical SAX application uses three kinds of objects:
readers, handlers and input sources. ``Reader'' in this
context is another term for parser, i.e. some piece of
code that reads the bytes or characters from the input
source, and produces a sequence of events.
Bytes _or_ characters.