XML expat error

D

dirkheld

Hi,

I have written a piece of code that reads all xml files in a directory
in onder to retrieve one element in each of these files. All files
have the same XML structure. After file 123 I receive the following
error :

xml.parsers.expat.ExpatError: not well-formed (invalid token): line
554, column 20

I guess that the element I try to read or the XML(which would be
strange since they have been created with the same code) can't ben
retrieved.

Is there a way to :
1. fix this problems so that I can retrieve it
2. is there a way that after such an error the invalid file is being
skipped and the program continues with reading the subsequent files;
Some sort of error handling?

Here is the code I use :

from xml.dom import minidom
import os
path = "/Documents/programming/data/xml/"


dirList = os.listdir(path)
url_file=open('/Documents/programming/data/xml/test.txt','w')
for file in dirList:
xmldoc = minidom.parse('/Documents/programming/data/xml/'+file)
xml_elem = xmldoc.getElementsByTagName('webpage')
web_elem = xml_elem[0]
url = web_elem.attributes['uri']
url_file.write(url.value + '\n')
url_file.close()
 
R

Richard Brodie

xml.parsers.expat.ExpatError: not well-formed (invalid token): line
554, column 20

I guess that the element I try to read or the XML(which would be
strange since they have been created with the same code) can't ben
retrieved.

It's fairly easy to write non-robust XML generating code, and also
quick to test if one file is always bad. Drop it into a text editor or
Firefox, and take a quick look at line 554. Most likely some random
control character has sneaked in; it only takes (for example) one NUL
to make the document ill-formed.
 
D

dirkheld

It's fairly easy to write non-robust XML generating code, and also
quick to test if one file is always bad. Drop it into a text editor or
Firefox, and take a quick look at line 554. Most likely some random
control character has sneaked in; it only takes (for example) one NUL
to make the document ill-formed.

Something strange here. The xml file causing the problem has only 361
lines. Isn't there a way to catch this error, ignore it and continu
with the rest of the other files?
This is the full error report :

Traceback (most recent call last):
File "xmltest.py", line 10, in <module>
xmldoc = minidom.parse('/Documents/programming/data/xml/'+file)
File "/System/Library/Frameworks/Python.framework/Versions/2.5/lib/
python2.5/xml/dom/minidom.py", line 1913, in parse
return expatbuilder.parse(file)
File "/System/Library/Frameworks/Python.framework/Versions/2.5/lib/
python2.5/xml/dom/expatbuilder.py", line 924, in parse
result = builder.parseFile(fp)
File "/System/Library/Frameworks/Python.framework/Versions/2.5/lib/
python2.5/xml/dom/expatbuilder.py", line 207, in parseFile
parser.Parse(buffer, 0)
xml.parsers.expat.ExpatError: not well-formed (invalid token): line
554, column 20
 
M

Marc 'BlackJack' Rintsch

Something strange here. The xml file causing the problem has only 361
lines. Isn't there a way to catch this error, ignore it and continu
with the rest of the other files?

Yes of course: handle the exception instead of letting it propagate to the
top level and ending the program.

Ciao,
Marc 'BlackJack' Rintsch
 
D

dirkheld

Yes of course: handle the exception instead of letting it propagate to the
top level and ending the program.

Ciao,
Marc 'BlackJack' Rintsch

Ehm, maybe a stupid question... how. I'm rather new to python and I
never user error handling.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,995
Messages
2,570,226
Members
46,815
Latest member
treekmostly22

Latest Threads

Top