xml.sax feature question

christof hoeke · Oct 25, 2003

hi,
this is my first try with sax (and some of the first utils in python
too) so the code is not the best. but i wrote a small utility which
finds all used element names in a bunch of xml files. reason is simply
to find out which elements are used and only partly a DTD is available.

so with a os.path.walk over all xml-files in a dir includings subdirs a
simple sax ContentHandler simply stores all names in a dictionary (to
keep any given name only once).

the problem i have is that if the xmlfile has a doctype declaration the
sax parser tries to load it and fails (IOError if course).
partly because the path to the DTD is just a simple name in the same dir
e.g. <!DOCTYPE contacts SYSTEM "contacts.dtd"> and i guess the parser
does not use the path os.path.walk uses (can i somehow give the parser
this information?). but it also could be a DTD which should be loaded
over a network which is not available at the time.

at the moment these files are not processed at all.

i guess to simply set a feature of the sax parser to not try to load any
external DTDs should work. question is which feature do i have to disable?
p = xml.sax.make_parser()
p.setFeature('http://xml.org/sax/features/validation', False)

i thought turning off the validation would stop the parser to load
external DTDs, but it still tries to load them.
any other suggestions?

sorry for the rather lengthy explanation and code.
thanks a lot!
chris

the complete code for a better understanding of my problem:

import fnmatch, os.path, sys, xml.sax

class ElementList:
name = {}

class Names(xml.sax.ContentHandler):
def startElement(self, tag, attr):
if not ElementList.name.has_key(tag):
ElementList.name[tag] = 1
else:
ElementList.name[tag] += 1

def process(self, file):
try:
#xml.sax.parse(file, ElementList.Names())
p = xml.sax.make_parser()
p.setContentHandler(ElementList.Names())
p.setFeature('http://xml.org/sax/features/validation', False)
p.parse(file)
print '\t', file
except (xml.sax.SAXException, IOError), e:
print '\tNOT PROCESSED', file, e

def printList(self):
print
print '#\t<ELEMENTNAME>'
print '-\t-------------'
keys = self.name.keys()
keys.sort()
for key in keys:
print self.name[key], '\t', key

class Lister:
def __init__(self):
self.el = ElementList()

def process(self, dir):
print
print 'FILES'
print '-----'
def proc(junk, dir, files):
for file in fnmatch.filter(files, '*.xml'):
self.el.process(os.path.join(dir, file))
os.path.walk(dir, proc, None)

def printList(self):
self.el.printList()

#MAIN
if __name__ == '__main__':
try:
dir = sys.argv[1]
except:
print "usage: python lister.py startdir"
sys.exit(0)
l = Lister()
l.process(dir)
l.printList()

Martin v. =?iso-8859-15?q?L=F6wis?= · Oct 26, 2003

christof hoeke said:
the problem i have is that if the xmlfile has a doctype declaration
the sax parser tries to load it and fails (IOError if course).
partly because the path to the DTD is just a simple name in the same
dir e.g. <!DOCTYPE contacts SYSTEM "contacts.dtd"> and i guess the
parser does not use the path os.path.walk uses (can i somehow give the
parser this information?). but it also could be a DTD which should be
loaded over a network which is not available at the time.

In XML, the SYSTEM identifier is a URI reference; in your case, it is
a relative URL. An XML processor must interpret this relative to the
URL of the main document. If you have the main document on a local
disk, the relative URL will be intepreted relative to the file name.
So you should put the DTD along with the document (in the same
directory).

i guess to simply set a feature of the sax parser to not try to load
any external DTDs should work. question is which feature do i have to
disable?
p = xml.sax.make_parser()
p.setFeature('http://xml.org/sax/features/validation', False)

i thought turning off the validation would stop the parser to load
external DTDs, but it still tries to load them.

This just turns of validation. The parser you are using is not
validating anyway, so this has no effect. The parser still loads the
DTD, in order to expand entity references it may encounter.

any other suggestions?

You need to turn off resolution of general entities:

p.setFeature("http://xml.org/sax/features/external-general-entities",False)

Alternatively, you can install an entity handler which then uses a
different mechanism of resolving the DTD (and other external entities).

Regards,
Martin

christof hoeke · Oct 26, 2003

Martin said:
In XML, the SYSTEM identifier is a URI reference; in your case, it is
a relative URL. An XML processor must interpret this relative to the
URL of the main document. If you have the main document on a local
disk, the relative URL will be intepreted relative to the file name.
So you should put the DTD along with the document (in the same
directory).

this is what i did but still i get the exception for example for
xmltest\contacts.xml "[Errno 2] No such file or directory:
'contacts.dtd'" if xmltest contains contacts.xml with the SYSTEM
identifier "contacts.dtd" and contacts.dtd is in the same directory.

You need to turn off resolution of general entities:

p.setFeature("http://xml.org/sax/features/external-general-entities",False)

exactly what i was looking for, thanks a lot. still i wonder why the
above error happens.

Alternatively, you can install an entity handler which then uses a
different mechanism of resolving the DTD (and other external entities).

i think i get a copy of the sax2 book to look into that a bit more...

thanks
christof

Martin v. =?iso-8859-15?q?L=F6wis?= · Oct 26, 2003

christof hoeke said:
exactly what i was looking for, thanks a lot. still i wonder why the
above error happens.

It appears that the standard entity resolver is

class EntityResolver:
def resolveEntity(self, publicId, systemId):
return systemId

So it just returns the system ID, instead of taking a base URL into
account. I'm uncertain whether this is a limitation of PyXML, or SAX
in general.

Regards,
Martin

xml.sax problem, help needed.	0	Aug 1, 2006
Can xml.sax NOT process the DTD?	1	Jan 28, 2008
__delitem__ "feature"	5	Dec 26, 2010
Sequential XML parsing with xml.sax	2	Aug 23, 2005
xml.sax problem: getting parse() to read a string	4	Jun 5, 2006
processing XHTML1.1 documents with xml.sax	1	Aug 7, 2004
trying to use sax for a very basic first xml parser	4	Jul 14, 2008
simple ElementTree based parser that allows entity definition map	0	Dec 4, 2013

xml.sax feature question

christof hoeke

Martin v. =?iso-8859-15?q?L=F6wis?=

christof hoeke

Martin v. =?iso-8859-15?q?L=F6wis?=

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads