Mysterious xml.sax Encoding Exception

J

JKPeck

I have a module that uses xml.sax and feeds it a string of xml as in
xml.sax.parseString(dictfile,handler)

The xml is always encoded in utf-16, and the XML string always starts
with
<?xml version="1.0" encoding="UTF-16" standalone="no"?>

This almost always works fine, but two users of this module get an
exception whatever input they use it on. (The actual xml is generated
by an api in our application that returns an xml version of metadata
associated with the application's data.)

The exception is
xml.sax._exceptions.SAXParseException: <unknown>:1:30: encoding
specified in XML declaration is incorrect.

In both of these cases, there are only plain, 7-bit ascii characters
in the xml, and it really is valid utf-16 as far as I can tell.

Now here is the hard part: This never happens to me, and having gotten
the actual xml content from one of the users and fed it to the parser,
I don't get the exception.

What could be going on? We are all on Python 2.5 (and all on an
English locale).

Any suggestions would be appreciated.
-Jon Peck
 
M

Martin v. Löwis

In both of these cases, there are only plain, 7-bit ascii characters
in the xml, and it really is valid utf-16 as far as I can tell.

What do you mean by "7-bit ascii characters"? If it means what I think
it means (namely, a sequence of bytes whose values are between 1 and
127), then it is *not* valid utf-16.
Now here is the hard part: This never happens to me, and having gotten
the actual xml content from one of the users and fed it to the parser,
I don't get the exception.

What could be going on? We are all on Python 2.5 (and all on an
English locale).

What operating system do they use, and how do they send you the file
for verification? Can you have them run

print repr(open(filename, "rb").read(10))

and send you its output?

Regards,
Martin
 
J

JKPeck

What do you mean by "7-bit ascii characters"? If it means what I think
it means (namely, a sequence of bytes whose values are between 1 and
127), then it is *not* valid utf-16.



What operating system do they use, and how do they send you the file
for verification? Can you have them run

print repr(open(filename, "rb").read(10))

and send you its output?

Regards,
Martin

They sent me the actual file, which was created on Windows, as an
email attachment. They had also sent the actual dataset from which
the XML was generated so that I could generate it myself using the
same version of our app as the user has. I did that but did not get
an exception.
 
M

Martin v. Löwis

They sent me the actual file, which was created on Windows, as an
email attachment. They had also sent the actual dataset from which
the XML was generated so that I could generate it myself using the
same version of our app as the user has. I did that but did not get
an exception.

So are you sure you open the file in binary mode on Windows?

Regards,
Martin
 
J

JKPeck

So are you sure you open the file in binary mode on Windows?

Regards,
Martin

In the real case, the xml never goes through a file but is handed
directly to the parser. The api return a Python Unicode string
(utf-16). For the file the user sent, if I open it in binary mode, it
still has a BOM; otherwise the BOM is removed. But either version
works on my system.

The basic fact, though, remains, the same code works for me with the
same input but not for two particular users (out of hundreds).

Regards,
Jon
 
M

Martin v. Löwis

The basic fact, though, remains, the same code works for me with the
same input but not for two particular users (out of hundreds).

I see. That's mysterious.

Regards,
Martin
 
J

Jeroen Ruigrok van der Werven

-On [20080201 19:06] said:
In both of these cases, there are only plain, 7-bit ascii characters
in the xml, and it really is valid utf-16 as far as I can tell.

Did you mean to say that the only characters they used in the UTF-16 encoded
file are characters from the Basic Latin Unicode block?
 
J

John Machin

In the real case, the xml never goes through a file but is handed
directly to the parser. The api return a Python Unicode string
(utf-16).

A Python unicode object is *NOT* the UTF-16 that the SAX parser is
expecting. It is expecting a str object which is Unicode text encoded
as UTF-16.

At the end of this post is code using a str object (works) then
attempting to use a unicode object (reproduces your error message).
For the file the user sent, if I open it in binary mode, it
still has a BOM; otherwise the BOM is removed. But either version
works on my system.

The basic fact, though, remains, the same code works for me with the
same input but not for two particular users (out of hundreds).

If the real case doesn't involve a file, I can't imagine what you can
infer from a file that isn't used [strike 1] sent to you by a user
[strike 2].

Consider trapping the exception, write repr(the_xml_document_string[:
80]) to the log file and re-raise the exception. Get the user to run
the app. You inspect the log file.

Here's the promised code and results.

C:\junk>type utf16sax.py
import xml.sax, xml.sax.saxutils
import cStringIO
asciistr = 'qwertyuiop'
xml_template = """<?xml version="1.0" encoding="%s"?><data>%s</
data>"""
unicode_doc = (xml_template % ('UTF-16', asciistr)).decode('ascii')
utf16_doc = unicode_doc.encode('UTF-16')
for doc in (utf16_doc, unicode_doc):
print
print 'doc = ', repr(doc)
print
f = cStringIO.StringIO()
handler = xml.sax.saxutils.XMLGenerator(f, encoding='utf8')
xml.sax.parseString(doc, handler)
result = f.getvalue()
f.close()
start = result.find('<data>') + 6
end = result.find('</data>')
mydata = result[start:end]
print "SAX output (UTF-8): %r" % mydata


C:\junk>utf16sax.py

doc = '\xff\xfe<\x00?\x00x\x00m\x00l\x00 \x00v\x00e\x00r\x00s\x00i
\x00o\x00n\x0
0=\x00"\x001\x00.\x000\x00"\x00 \x00e\x00n\x00c\x00o\x00d\x00i\x00n
\x00g\x00=\x0
0"\x00U\x00T\x00F\x00-\x001\x006\x00"\x00?\x00>\x00<\x00d\x00a\x00t
\x00a\x00>\x0
0q\x00w\x00e\x00r\x00t\x00y\x00u\x00i\x00o\x00p\x00<\x00/\x00d\x00a
\x00t\x00a\x0
0>\x00'

SAX output (UTF-8): 'qwertyuiop'

doc = u'<?xml version="1.0" encoding="UTF-16"?><data>qwertyuiop</
data>'

Traceback (most recent call last):
File "C:\junk\utf16sax.py", line 13, in <module>
xml.sax.parseString(doc, handler)
File "C:\Python25\lib\xml\sax\__init__.py", line 49, in parseString
parser.parse(inpsrc)
File "C:\Python25\lib\xml\sax\expatreader.py", line 107, in parse
xmlreader.IncrementalParser.parse(self, source)
File "C:\Python25\lib\xml\sax\xmlreader.py", line 123, in parse
self.feed(buffer)
File "C:\Python25\lib\xml\sax\expatreader.py", line 211, in feed
self._err_handler.fatalError(exc)
File "C:\Python25\lib\xml\sax\handler.py", line 38, in fatalError
raise exception
xml.sax._exceptions.SAXParseException: <unknown>:1:30: encoding
specified in XML
declaration is incorrect

I guess what is happening is that the unicode is coerced to str using
the default encoding (ascii) then it looks at the result, parses out
the "UTF-16", attempts to decode it using utf-16, fails, complains.

HTH,
John
 
J

JKPeck

-On [20080201 19:06] said:
In both of these cases, there are only plain, 7-bit ascii characters
in the xml, and it really is valid utf-16 as far as I can tell.

Did you mean to say that the only characters they used in the UTF-16 encoded
file are characters from the Basic Latin Unicode block?

It appears that the root cause of this problem is indeed passing a
Unicode XML string to xml.sax.parseString with an encoding declaration
in the XML of utf-16. This works with the standard distribution on
Windows. It does not work with ActiveState on Windows even though
both distributions report
64K for sys.maxunicode.

So I don't know why the results are different, but the problem is
solved by encoding the Unicode string into utf-16 before passing it to
the parser.

Thanks to all for helping to track this down.

Regards,
Jon Peck
 
J

John Machin

-On [20080201 19:06], JKPeck ([email protected]) wrote:
In both of these cases, there are only plain, 7-bit ascii characters
in the xml, and it really is valid utf-16 as far as I can tell.
Did you mean to say that the only characters they used in the UTF-16 encoded
file are characters from the Basic Latin Unicode block?


It appears that the root cause of this problem is indeed passing a
Unicode XML string to xml.sax.parseString with an encoding declaration
in the XML of utf-16. This works with the standard distribution on
Windows.

It did NOT work for me with the standard 2.5.1 Windows distribution --
see the code + output that I posted.
 
J

JKPeck

-On [20080201 19:06], JKPeck ([email protected]) wrote:
In both of these cases, there are only plain, 7-bit ascii characters
in the xml, and it really is valid utf-16 as far as I can tell.
Did you mean to say that the only characters they used in the UTF-16 encoded
file are characters from the Basic Latin Unicode block?
It appears that the root cause of this problem is indeed passing a
Unicode XML string to xml.sax.parseString with an encoding declaration
in the XML of utf-16. This works with the standard distribution on
Windows.

It did NOT work for me with the standard 2.5.1 Windows distribution --
see the code + output that I posted.
It does not work with ActiveState on Windows even though
both distributions report
64K for sys.maxunicode.
So I don't know why the results are different, but the problem is
solved by encoding the Unicode string into utf-16 before passing it to
the parser.

Interesting. In the course of installing and testing with
ActiveState, I upgraded from the standard distribution 2.5.0 to
2.5.1. The former worked; the latter does not (with the original
code). So that ..1 seems to matter here, and that probably accounts
for why ActiveState raised the exception and the standard 2.5.0 did
not.

-Jon
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,969
Messages
2,570,161
Members
46,710
Latest member
bernietqt

Latest Threads

Top