xml.sax feature question

C

christof hoeke

hi,
this is my first try with sax (and some of the first utils in python
too) so the code is not the best. but i wrote a small utility which
finds all used element names in a bunch of xml files. reason is simply
to find out which elements are used and only partly a DTD is available.

so with a os.path.walk over all xml-files in a dir includings subdirs a
simple sax ContentHandler simply stores all names in a dictionary (to
keep any given name only once).

the problem i have is that if the xmlfile has a doctype declaration the
sax parser tries to load it and fails (IOError if course).
partly because the path to the DTD is just a simple name in the same dir
e.g. <!DOCTYPE contacts SYSTEM "contacts.dtd"> and i guess the parser
does not use the path os.path.walk uses (can i somehow give the parser
this information?). but it also could be a DTD which should be loaded
over a network which is not available at the time.

at the moment these files are not processed at all.

i guess to simply set a feature of the sax parser to not try to load any
external DTDs should work. question is which feature do i have to disable?
p = xml.sax.make_parser()
p.setFeature('http://xml.org/sax/features/validation', False)

i thought turning off the validation would stop the parser to load
external DTDs, but it still tries to load them.
any other suggestions?


sorry for the rather lengthy explanation and code.
thanks a lot!
chris

the complete code for a better understanding of my problem:

import fnmatch, os.path, sys, xml.sax

class ElementList:
name = {}

class Names(xml.sax.ContentHandler):
def startElement(self, tag, attr):
if not ElementList.name.has_key(tag):
ElementList.name[tag] = 1
else:
ElementList.name[tag] += 1

def process(self, file):
try:
#xml.sax.parse(file, ElementList.Names())
p = xml.sax.make_parser()
p.setContentHandler(ElementList.Names())
p.setFeature('http://xml.org/sax/features/validation', False)
p.parse(file)
print '\t', file
except (xml.sax.SAXException, IOError), e:
print '\tNOT PROCESSED', file, e

def printList(self):
print
print '#\t<ELEMENTNAME>'
print '-\t-------------'
keys = self.name.keys()
keys.sort()
for key in keys:
print self.name[key], '\t', key

class Lister:
def __init__(self):
self.el = ElementList()

def process(self, dir):
print
print 'FILES'
print '-----'
def proc(junk, dir, files):
for file in fnmatch.filter(files, '*.xml'):
self.el.process(os.path.join(dir, file))
os.path.walk(dir, proc, None)

def printList(self):
self.el.printList()

#MAIN
if __name__ == '__main__':
try:
dir = sys.argv[1]
except:
print "usage: python lister.py startdir"
sys.exit(0)
l = Lister()
l.process(dir)
l.printList()
 
M

Martin v. =?iso-8859-15?q?L=F6wis?=

christof hoeke said:
the problem i have is that if the xmlfile has a doctype declaration
the sax parser tries to load it and fails (IOError if course).
partly because the path to the DTD is just a simple name in the same
dir e.g. <!DOCTYPE contacts SYSTEM "contacts.dtd"> and i guess the
parser does not use the path os.path.walk uses (can i somehow give the
parser this information?). but it also could be a DTD which should be
loaded over a network which is not available at the time.

In XML, the SYSTEM identifier is a URI reference; in your case, it is
a relative URL. An XML processor must interpret this relative to the
URL of the main document. If you have the main document on a local
disk, the relative URL will be intepreted relative to the file name.
So you should put the DTD along with the document (in the same
directory).
i guess to simply set a feature of the sax parser to not try to load
any external DTDs should work. question is which feature do i have to
disable?
p = xml.sax.make_parser()
p.setFeature('http://xml.org/sax/features/validation', False)

i thought turning off the validation would stop the parser to load
external DTDs, but it still tries to load them.

This just turns of validation. The parser you are using is not
validating anyway, so this has no effect. The parser still loads the
DTD, in order to expand entity references it may encounter.
any other suggestions?

You need to turn off resolution of general entities:

p.setFeature("http://xml.org/sax/features/external-general-entities",False)

Alternatively, you can install an entity handler which then uses a
different mechanism of resolving the DTD (and other external entities).

Regards,
Martin
 
C

christof hoeke

Martin said:
In XML, the SYSTEM identifier is a URI reference; in your case, it is
a relative URL. An XML processor must interpret this relative to the
URL of the main document. If you have the main document on a local
disk, the relative URL will be intepreted relative to the file name.
So you should put the DTD along with the document (in the same
directory).

this is what i did but still i get the exception for example for
xmltest\contacts.xml "[Errno 2] No such file or directory:
'contacts.dtd'" if xmltest contains contacts.xml with the SYSTEM
identifier "contacts.dtd" and contacts.dtd is in the same directory.

You need to turn off resolution of general entities:

p.setFeature("http://xml.org/sax/features/external-general-entities",False)


exactly what i was looking for, thanks a lot. still i wonder why the
above error happens.
Alternatively, you can install an entity handler which then uses a
different mechanism of resolving the DTD (and other external entities).

i think i get a copy of the sax2 book to look into that a bit more...

thanks
christof
 
M

Martin v. =?iso-8859-15?q?L=F6wis?=

christof hoeke said:
exactly what i was looking for, thanks a lot. still i wonder why the
above error happens.

It appears that the standard entity resolver is

class EntityResolver:
def resolveEntity(self, publicId, systemId):
return systemId

So it just returns the system ID, instead of taking a base URL into
account. I'm uncertain whether this is a limitation of PyXML, or SAX
in general.

Regards,
Martin
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,983
Messages
2,570,187
Members
46,747
Latest member
jojoBizaroo

Latest Threads

Top