SAXParseException: not well-formed (invalid token)

Pablo Rey · Aug 30, 2007

Dear Colleagues,

I am getting the following error with a XML page:

File "/home/prey/RAL-CESGA/bin/voms2users/voms2users.py", line 69, in getItems
d = minidom.parseString(xml.read())
File "/usr/lib/python2.2/site-packages/_xmlplus/dom/minidom.py", line 967, in parseString
return _doparse(pulldom.parseString, args, kwargs)
File "/usr/lib/python2.2/site-packages/_xmlplus/dom/minidom.py", line 954, in _doparse
toktype, rootNode = events.getEvent()
File "/usr/lib/python2.2/site-packages/_xmlplus/dom/pulldom.py", line 265, in getEvent
self.parser.feed(buf)
File "/usr/lib/python2.2/site-packages/_xmlplus/sax/expatreader.py", line 208, in feed
self._err_handler.fatalError(exc)
File "/usr/lib/python2.2/site-packages/_xmlplus/sax/handler.py", line 38, in fatalError
raise exception
xml.sax._exceptions.SAXParseException: <unknown>:553:48: not well-formed (invalid token)

def getItems(page):
opener =urllib.URLopener(key_file=HOSTKEY,cert_file=HOSTCERT) ;
try:
xml = opener.open(page)
except:
return []

d = minidom.parseString(xml.read())
items = d.getElementsByTagName('item')
data = []
for i in items:
data.append(getText(i.childNodes))

return data

The page is
https://lcg-voms.cern.ch:8443/voms/cms/services/VOMSCompatibility?method=getGridmapUsers
and the line with the invalid character is (the invalid character is the
final é of Université):

<item>/C=BE/O=BEGRID/OU=Physique/OU=Univesité Catholique de
Louvain/CN=Roberfroid</item>

I have tried several options but I am not able to avoid this problem.
Any idea?.

I am starting to work with Python so I am sorry if this problem is trivial.

Thanks for your time.
Pablo Rey

Marc 'BlackJack' Rintsch · Aug 30, 2007

The page is
https://lcg-voms.cern.ch:8443/voms/cms/services/VOMSCompatibility?method=getGridmapUsers
and the line with the invalid character is (the invalid character is the
final Ã© of UniversitÃ©):

The URL doesn't work for me in a browser. (Could not connectâ€¦)

Maybe you can download that XML file and use `xmllint` to check if it is
well formed XML!?

Ciao,
Marc 'BlackJack' Rintsch

Stefan Behnel · Aug 30, 2007

Pablo said:
I am getting the following error with a XML page:

File "/home/prey/RAL-CESGA/bin/voms2users/voms2users.py", line 69,
in getItems
d = minidom.parseString(xml.read())
File "/usr/lib/python2.2/site-packages/_xmlplus/dom/minidom.py",
line 967, in parseString
return _doparse(pulldom.parseString, args, kwargs)
File "/usr/lib/python2.2/site-packages/_xmlplus/dom/minidom.py",
line 954, in _doparse
toktype, rootNode = events.getEvent()
File "/usr/lib/python2.2/site-packages/_xmlplus/dom/pulldom.py",
line 265, in getEvent
self.parser.feed(buf)
File "/usr/lib/python2.2/site-packages/_xmlplus/sax/expatreader.py",
line 208, in feed
self._err_handler.fatalError(exc)
File "/usr/lib/python2.2/site-packages/_xmlplus/sax/handler.py",
line 38, in fatalError
raise exception
xml.sax._exceptions.SAXParseException: <unknown>:553:48: not
well-formed (invalid token)

def getItems(page):
opener =urllib.URLopener(key_file=HOSTKEY,cert_file=HOSTCERT) ;
try:
xml = opener.open(page)
except:
return []

d = minidom.parseString(xml.read())
items = d.getElementsByTagName('item')
data = []
for i in items:
data.append(getText(i.childNodes))

return data

Click to expand...

The page is
https://lcg-voms.cern.ch:8443/voms/cms/services/VOMSCompatibility?method=getGridmapUsers
and the line with the invalid character is (the invalid character is the
final é of Université):

<item>/C=BE/O=BEGRID/OU=Physique/OU=Univesité Catholique de
Louvain/CN=Roberfroid</item>

I have tried several options but I am not able to avoid this
problem. Any idea?.

Looks like the page is not well-formed XML (i.e. not XML at all). If it
doesn't specify an encoding (<?xml encoding="..."?>), you can try recoding the
input, possibly decoding it from latin-1 and re-encoding it as UTF-8 before
passing it to the SAX parser.

Alternatively, tell the page authors to fix their page.

Stefan

Pablo Rey · Aug 30, 2007

Hi Stefan,

The xml has specified an encoding (<?xml version="1.0" encoding="UTF-8"
?>).

About the possibility that you mention to recoding the input, could you
let me know how to do it?. I am sorry I am starting with Python and I
don't know how to do it.

Thanks by your help.
Pablo

Pablo said:
Pablo said:

I am getting the following error with a XML page:

File "/home/prey/RAL-CESGA/bin/voms2users/voms2users.py", line 69,
in getItems
d = minidom.parseString(xml.read())
File "/usr/lib/python2.2/site-packages/_xmlplus/dom/minidom.py",
line 967, in parseString
return _doparse(pulldom.parseString, args, kwargs)
File "/usr/lib/python2.2/site-packages/_xmlplus/dom/minidom.py",
line 954, in _doparse
toktype, rootNode = events.getEvent()
File "/usr/lib/python2.2/site-packages/_xmlplus/dom/pulldom.py",
line 265, in getEvent
self.parser.feed(buf)
File "/usr/lib/python2.2/site-packages/_xmlplus/sax/expatreader.py",
line 208, in feed
self._err_handler.fatalError(exc)
File "/usr/lib/python2.2/site-packages/_xmlplus/sax/handler.py",
line 38, in fatalError
raise exception
xml.sax._exceptions.SAXParseException: <unknown>:553:48: not
well-formed (invalid token)

Click to expand...

def getItems(page):
opener =urllib.URLopener(key_file=HOSTKEY,cert_file=HOSTCERT) ;
try:
xml = opener.open(page)
except:
return []

d = minidom.parseString(xml.read())
items = d.getElementsByTagName('item')
data = []
for i in items:
data.append(getText(i.childNodes))

return data

Click to expand...

The page is
https://lcg-voms.cern.ch:8443/voms/cms/services/VOMSCompatibility?method=getGridmapUsers
and the line with the invalid character is (the invalid character is the
final é of Université):

<item>/C=BE/O=BEGRID/OU=Physique/OU=Univesité Catholique de
Louvain/CN=Roberfroid</item>

I have tried several options but I am not able to avoid this
problem. Any idea?.

Click to expand...

Looks like the page is not well-formed XML (i.e. not XML at all). If it
doesn't specify an encoding (<?xml encoding="..."?>), you can try recoding the
input, possibly decoding it from latin-1 and re-encoding it as UTF-8 before
passing it to the SAX parser.

Alternatively, tell the page authors to fix their page.

Stefan

Pablo Rey · Aug 30, 2007

The URL doesn't work for me in a browser. (Could not connectâ€¦)

Hi Marc,

To access to the page you need a X509 certificate signed by a CA
recognised by the project.

I have stored the XML file and you can find it attached.

Maybe you can download that XML file and use `xmllint` to check if it is
well formed XML!?

This is the output of the xmllint command:

[prey@www3 voms2users]$ xmllint cms.xml
cms.xml:553: error: Input is not proper UTF-8, indicate encoding !
<item>/C=BE/O=BEGRID/OU=Physique/OU=UnivesitÃ© Catholique de
Louvain/CN=Roberfroi
^
cms.xml:553: error: Bytes: 0xE9 0x20 0x43 0x61
<item>/C=BE/O=BEGRID/OU=Physique/OU=UnivesitÃ© Catholique de
Louvain/CN=Roberfroi

Thanks for your time.
Pablo

Marc 'BlackJack' Rintsch · Aug 30, 2007

Maybe you can download that XML file and use `xmllint` to check if it
is well formed XML!?

Click to expand...

This is the output of the xmllint command:

[prey@www3 voms2users]$ xmllint cms.xml cms.xml:553: error: Input is not
proper UTF-8, indicate encoding !
<item>/C=BE/O=BEGRID/OU=Physique/OU=UnivesitÃ© Catholique de
Louvain/CN=Roberfroi
^
cms.xml:553: error: Bytes: 0xE9 0x20 0x43 0x61
<item>/C=BE/O=BEGRID/OU=Physique/OU=UnivesitÃ© Catholique de
Louvain/CN=Roberfroi

[â€¦]

<?xml version="1.0" encoding="UTF-8" ?>

So the XML says it is encoded in UTF-8 but it contains at least one
character that seems to be encoded in ISO-8859-1.

Tell the authors/creators of that document there XML is broken.

Ciao,
Marc 'BlackJack' Rintsch

Carsten Haese · Aug 30, 2007

Hi Stefan,

The xml has specified an encoding (<?xml version="1.0" encoding="UTF-8"
?>).

It's possible that the encoding specification is incorrect:
'\xc3\xa9'

If your input string contains the byte 0xe9 where your accented e is,
the file is actually latin-1 encoded. If it contains the byte sequence
0xc3,0xa9 it is UTF-8 encoded.

If the string is encoded in latin-1, you can transcode it to utf-8 like
this:

contents = contents.decode("latin-1").encode("utf-8")

HTH,

Carsten Haese · Aug 30, 2007

About the possibility that you mention to recoding the input, could you
let me know how to do it?. I am sorry I am starting with Python and I
don't know how to do it.

While I answered this question in my previous reply, I wanted to add
that you might find the following How-To helpful in demystifying
Unicode:

http://www.amk.ca/python/howto/unicode

Lawrence D'Oliveiro · Aug 31, 2007

If your input string contains the byte 0xe9 where your accented e is,
the file is actually latin-1 encoded. If it contains the byte sequence
0xc3,0xa9 it is UTF-8 encoded.

It is dismaying how often I come across Web pages that claim to be
UTF-8-encoded, but are actually Latin-1 or Dimdows-1252.

Please help!! SAXParseException: not well-formed (invalid token)	4	Mar 27, 2007
socket or xml error?	0	Feb 19, 2005
help for xml parsing error	1	Feb 18, 2005
XMLout() output causes "not well-formed (invalid token)" in XMLin()	3	Sep 22, 2004
problem with google api / xml	3	May 31, 2006
XML parsing ExpatError with xml.dom.minidom at line 1, column 0	2	Feb 13, 2014
PPM error message -- invalid token	2	Mar 25, 2009
SOAP failure	0	Dec 6, 2004

SAXParseException: not well-formed (invalid token)

Pablo Rey

Marc 'BlackJack' Rintsch

Stefan Behnel

Pablo Rey

Pablo Rey

Marc 'BlackJack' Rintsch

Carsten Haese

Carsten Haese

Lawrence D'Oliveiro

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads