C
christof hoeke
hi,
this is my first try with sax (and some of the first utils in python
too) so the code is not the best. but i wrote a small utility which
finds all used element names in a bunch of xml files. reason is simply
to find out which elements are used and only partly a DTD is available.
so with a os.path.walk over all xml-files in a dir includings subdirs a
simple sax ContentHandler simply stores all names in a dictionary (to
keep any given name only once).
the problem i have is that if the xmlfile has a doctype declaration the
sax parser tries to load it and fails (IOError if course).
partly because the path to the DTD is just a simple name in the same dir
e.g. <!DOCTYPE contacts SYSTEM "contacts.dtd"> and i guess the parser
does not use the path os.path.walk uses (can i somehow give the parser
this information?). but it also could be a DTD which should be loaded
over a network which is not available at the time.
at the moment these files are not processed at all.
i guess to simply set a feature of the sax parser to not try to load any
external DTDs should work. question is which feature do i have to disable?
p = xml.sax.make_parser()
p.setFeature('http://xml.org/sax/features/validation', False)
i thought turning off the validation would stop the parser to load
external DTDs, but it still tries to load them.
any other suggestions?
sorry for the rather lengthy explanation and code.
thanks a lot!
chris
the complete code for a better understanding of my problem:
import fnmatch, os.path, sys, xml.sax
class ElementList:
name = {}
class Names(xml.sax.ContentHandler):
def startElement(self, tag, attr):
if not ElementList.name.has_key(tag):
ElementList.name[tag] = 1
else:
ElementList.name[tag] += 1
def process(self, file):
try:
#xml.sax.parse(file, ElementList.Names())
p = xml.sax.make_parser()
p.setContentHandler(ElementList.Names())
p.setFeature('http://xml.org/sax/features/validation', False)
p.parse(file)
print '\t', file
except (xml.sax.SAXException, IOError), e:
print '\tNOT PROCESSED', file, e
def printList(self):
print
print '#\t<ELEMENTNAME>'
print '-\t-------------'
keys = self.name.keys()
keys.sort()
for key in keys:
print self.name[key], '\t', key
class Lister:
def __init__(self):
self.el = ElementList()
def process(self, dir):
print
print 'FILES'
print '-----'
def proc(junk, dir, files):
for file in fnmatch.filter(files, '*.xml'):
self.el.process(os.path.join(dir, file))
os.path.walk(dir, proc, None)
def printList(self):
self.el.printList()
#MAIN
if __name__ == '__main__':
try:
dir = sys.argv[1]
except:
print "usage: python lister.py startdir"
sys.exit(0)
l = Lister()
l.process(dir)
l.printList()
this is my first try with sax (and some of the first utils in python
too) so the code is not the best. but i wrote a small utility which
finds all used element names in a bunch of xml files. reason is simply
to find out which elements are used and only partly a DTD is available.
so with a os.path.walk over all xml-files in a dir includings subdirs a
simple sax ContentHandler simply stores all names in a dictionary (to
keep any given name only once).
the problem i have is that if the xmlfile has a doctype declaration the
sax parser tries to load it and fails (IOError if course).
partly because the path to the DTD is just a simple name in the same dir
e.g. <!DOCTYPE contacts SYSTEM "contacts.dtd"> and i guess the parser
does not use the path os.path.walk uses (can i somehow give the parser
this information?). but it also could be a DTD which should be loaded
over a network which is not available at the time.
at the moment these files are not processed at all.
i guess to simply set a feature of the sax parser to not try to load any
external DTDs should work. question is which feature do i have to disable?
p = xml.sax.make_parser()
p.setFeature('http://xml.org/sax/features/validation', False)
i thought turning off the validation would stop the parser to load
external DTDs, but it still tries to load them.
any other suggestions?
sorry for the rather lengthy explanation and code.
thanks a lot!
chris
the complete code for a better understanding of my problem:
import fnmatch, os.path, sys, xml.sax
class ElementList:
name = {}
class Names(xml.sax.ContentHandler):
def startElement(self, tag, attr):
if not ElementList.name.has_key(tag):
ElementList.name[tag] = 1
else:
ElementList.name[tag] += 1
def process(self, file):
try:
#xml.sax.parse(file, ElementList.Names())
p = xml.sax.make_parser()
p.setContentHandler(ElementList.Names())
p.setFeature('http://xml.org/sax/features/validation', False)
p.parse(file)
print '\t', file
except (xml.sax.SAXException, IOError), e:
print '\tNOT PROCESSED', file, e
def printList(self):
print '#\t<ELEMENTNAME>'
print '-\t-------------'
keys = self.name.keys()
keys.sort()
for key in keys:
print self.name[key], '\t', key
class Lister:
def __init__(self):
self.el = ElementList()
def process(self, dir):
print 'FILES'
print '-----'
def proc(junk, dir, files):
for file in fnmatch.filter(files, '*.xml'):
self.el.process(os.path.join(dir, file))
os.path.walk(dir, proc, None)
def printList(self):
self.el.printList()
#MAIN
if __name__ == '__main__':
try:
dir = sys.argv[1]
except:
print "usage: python lister.py startdir"
sys.exit(0)
l = Lister()
l.process(dir)
l.printList()