SAXParseException: not well-formed (invalid token)

P

Pablo Rey

Dear Colleagues,

I am getting the following error with a XML page:
File "/home/prey/RAL-CESGA/bin/voms2users/voms2users.py", line 69, in getItems
d = minidom.parseString(xml.read())
File "/usr/lib/python2.2/site-packages/_xmlplus/dom/minidom.py", line 967, in parseString
return _doparse(pulldom.parseString, args, kwargs)
File "/usr/lib/python2.2/site-packages/_xmlplus/dom/minidom.py", line 954, in _doparse
toktype, rootNode = events.getEvent()
File "/usr/lib/python2.2/site-packages/_xmlplus/dom/pulldom.py", line 265, in getEvent
self.parser.feed(buf)
File "/usr/lib/python2.2/site-packages/_xmlplus/sax/expatreader.py", line 208, in feed
self._err_handler.fatalError(exc)
File "/usr/lib/python2.2/site-packages/_xmlplus/sax/handler.py", line 38, in fatalError
raise exception
xml.sax._exceptions.SAXParseException: <unknown>:553:48: not well-formed (invalid token)

def getItems(page):
opener =urllib.URLopener(key_file=HOSTKEY,cert_file=HOSTCERT) ;
try:
xml = opener.open(page)
except:
return []

d = minidom.parseString(xml.read())
items = d.getElementsByTagName('item')
data = []
for i in items:
data.append(getText(i.childNodes))

return data

The page is
https://lcg-voms.cern.ch:8443/voms/cms/services/VOMSCompatibility?method=getGridmapUsers
and the line with the invalid character is (the invalid character is the
final é of Université):

<item>/C=BE/O=BEGRID/OU=Physique/OU=Univesité Catholique de
Louvain/CN=Roberfroid</item>


I have tried several options but I am not able to avoid this problem.
Any idea?.

I am starting to work with Python so I am sorry if this problem is trivial.

Thanks for your time.
Pablo Rey
 
S

Stefan Behnel

Pablo said:
I am getting the following error with a XML page:
File "/home/prey/RAL-CESGA/bin/voms2users/voms2users.py", line 69,
in getItems
d = minidom.parseString(xml.read())
File "/usr/lib/python2.2/site-packages/_xmlplus/dom/minidom.py",
line 967, in parseString
return _doparse(pulldom.parseString, args, kwargs)
File "/usr/lib/python2.2/site-packages/_xmlplus/dom/minidom.py",
line 954, in _doparse
toktype, rootNode = events.getEvent()
File "/usr/lib/python2.2/site-packages/_xmlplus/dom/pulldom.py",
line 265, in getEvent
self.parser.feed(buf)
File "/usr/lib/python2.2/site-packages/_xmlplus/sax/expatreader.py",
line 208, in feed
self._err_handler.fatalError(exc)
File "/usr/lib/python2.2/site-packages/_xmlplus/sax/handler.py",
line 38, in fatalError
raise exception
xml.sax._exceptions.SAXParseException: <unknown>:553:48: not
well-formed (invalid token)

def getItems(page):
opener =urllib.URLopener(key_file=HOSTKEY,cert_file=HOSTCERT) ;
try:
xml = opener.open(page)
except:
return []

d = minidom.parseString(xml.read())
items = d.getElementsByTagName('item')
data = []
for i in items:
data.append(getText(i.childNodes))

return data

The page is
https://lcg-voms.cern.ch:8443/voms/cms/services/VOMSCompatibility?method=getGridmapUsers
and the line with the invalid character is (the invalid character is the
final é of Université):

<item>/C=BE/O=BEGRID/OU=Physique/OU=Univesité Catholique de
Louvain/CN=Roberfroid</item>


I have tried several options but I am not able to avoid this
problem. Any idea?.

Looks like the page is not well-formed XML (i.e. not XML at all). If it
doesn't specify an encoding (<?xml encoding="..."?>), you can try recoding the
input, possibly decoding it from latin-1 and re-encoding it as UTF-8 before
passing it to the SAX parser.

Alternatively, tell the page authors to fix their page.

Stefan
 
P

Pablo Rey

Hi Stefan,

The xml has specified an encoding (<?xml version="1.0" encoding="UTF-8"
?>).

About the possibility that you mention to recoding the input, could you
let me know how to do it?. I am sorry I am starting with Python and I
don't know how to do it.

Thanks by your help.
Pablo



Pablo said:
I am getting the following error with a XML page:
File "/home/prey/RAL-CESGA/bin/voms2users/voms2users.py", line 69,
in getItems
d = minidom.parseString(xml.read())
File "/usr/lib/python2.2/site-packages/_xmlplus/dom/minidom.py",
line 967, in parseString
return _doparse(pulldom.parseString, args, kwargs)
File "/usr/lib/python2.2/site-packages/_xmlplus/dom/minidom.py",
line 954, in _doparse
toktype, rootNode = events.getEvent()
File "/usr/lib/python2.2/site-packages/_xmlplus/dom/pulldom.py",
line 265, in getEvent
self.parser.feed(buf)
File "/usr/lib/python2.2/site-packages/_xmlplus/sax/expatreader.py",
line 208, in feed
self._err_handler.fatalError(exc)
File "/usr/lib/python2.2/site-packages/_xmlplus/sax/handler.py",
line 38, in fatalError
raise exception
xml.sax._exceptions.SAXParseException: <unknown>:553:48: not
well-formed (invalid token)
def getItems(page):
opener =urllib.URLopener(key_file=HOSTKEY,cert_file=HOSTCERT) ;
try:
xml = opener.open(page)
except:
return []

d = minidom.parseString(xml.read())
items = d.getElementsByTagName('item')
data = []
for i in items:
data.append(getText(i.childNodes))

return data
The page is
https://lcg-voms.cern.ch:8443/voms/cms/services/VOMSCompatibility?method=getGridmapUsers
and the line with the invalid character is (the invalid character is the
final é of Université):

<item>/C=BE/O=BEGRID/OU=Physique/OU=Univesité Catholique de
Louvain/CN=Roberfroid</item>


I have tried several options but I am not able to avoid this
problem. Any idea?.

Looks like the page is not well-formed XML (i.e. not XML at all). If it
doesn't specify an encoding (<?xml encoding="..."?>), you can try recoding the
input, possibly decoding it from latin-1 and re-encoding it as UTF-8 before
passing it to the SAX parser.

Alternatively, tell the page authors to fix their page.

Stefan
 
P

Pablo Rey

The URL doesn't work for me in a browser. (Could not connect…)

Hi Marc,

To access to the page you need a X509 certificate signed by a CA
recognised by the project.

I have stored the XML file and you can find it attached.
Maybe you can download that XML file and use `xmllint` to check if it is
well formed XML!?

This is the output of the xmllint command:

[prey@www3 voms2users]$ xmllint cms.xml
cms.xml:553: error: Input is not proper UTF-8, indicate encoding !
<item>/C=BE/O=BEGRID/OU=Physique/OU=Univesité Catholique de
Louvain/CN=Roberfroi
^
cms.xml:553: error: Bytes: 0xE9 0x20 0x43 0x61
<item>/C=BE/O=BEGRID/OU=Physique/OU=Univesité Catholique de
Louvain/CN=Roberfroi

Thanks for your time.
Pablo
 
M

Marc 'BlackJack' Rintsch

Maybe you can download that XML file and use `xmllint` to check if it
is well formed XML!?

This is the output of the xmllint command:

[prey@www3 voms2users]$ xmllint cms.xml cms.xml:553: error: Input is not
proper UTF-8, indicate encoding !
<item>/C=BE/O=BEGRID/OU=Physique/OU=Univesité Catholique de
Louvain/CN=Roberfroi
^
cms.xml:553: error: Bytes: 0xE9 0x20 0x43 0x61
<item>/C=BE/O=BEGRID/OU=Physique/OU=Univesité Catholique de
Louvain/CN=Roberfroi

[…]

<?xml version="1.0" encoding="UTF-8" ?>

So the XML says it is encoded in UTF-8 but it contains at least one
character that seems to be encoded in ISO-8859-1.

Tell the authors/creators of that document there XML is broken.

Ciao,
Marc 'BlackJack' Rintsch
 
C

Carsten Haese

Hi Stefan,

The xml has specified an encoding (<?xml version="1.0" encoding="UTF-8"
?>).

It's possible that the encoding specification is incorrect:
'\xc3\xa9'

If your input string contains the byte 0xe9 where your accented e is,
the file is actually latin-1 encoded. If it contains the byte sequence
0xc3,0xa9 it is UTF-8 encoded.

If the string is encoded in latin-1, you can transcode it to utf-8 like
this:

contents = contents.decode("latin-1").encode("utf-8")

HTH,
 
C

Carsten Haese

About the possibility that you mention to recoding the input, could you
let me know how to do it?. I am sorry I am starting with Python and I
don't know how to do it.

While I answered this question in my previous reply, I wanted to add
that you might find the following How-To helpful in demystifying
Unicode:

http://www.amk.ca/python/howto/unicode
 
L

Lawrence D'Oliveiro

If your input string contains the byte 0xe9 where your accented e is,
the file is actually latin-1 encoded. If it contains the byte sequence
0xc3,0xa9 it is UTF-8 encoded.

It is dismaying how often I come across Web pages that claim to be
UTF-8-encoded, but are actually Latin-1 or Dimdows-1252.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,968
Messages
2,570,149
Members
46,695
Latest member
StanleyDri

Latest Threads

Top