XML Parsing

Tyler Eaves · Feb 14, 2004

Hi,

Right now I'm using xml.dom.minidom for parsing some xml files. It
works, certainly, but the speed leaves a bit to be desired. Are there
any other XML modules that offer the same interface minidom does, but
are faster? Things like validation are not a big deal for me, as all
the XML is generated by my own programs, so I'm not worried about
malformed documents.

Andrew Clover · Feb 15, 2004

Tyler Eaves said:
Are there any other XML modules that offer the same interface minidom
does, but are faster?

It's not totally the same interface as minidom, but cDomlette offers a
fast set of XML operations through an incomplete DOM interface. See
http://www.4suite.org/ .

With simple XML and a bit of care avoiding problem areas (see eg.
http://pyxml.sourceforge.net/topics/compliance.html ) it is possible to
write software that will work equally well with minidom, cDomlette and
other DOM implementations.

Chris Herborth · Feb 16, 2004

Tyler said:
Right now I'm using xml.dom.minidom for parsing some xml files. It
works, certainly, but the speed leaves a bit to be desired. Are there
any other XML modules that offer the same interface minidom does, but
are faster? Things like validation are not a big deal for me, as all
the XML is generated by my own programs, so I'm not worried about
malformed documents.

PyXML on Sourceforge (http://pyxml.sourceforge.net/) has faster
DOM-producing routines.

pyRXP (http://www.reportlab.org/pyrxp.html) is probably the fastest XML
parser for Python, but it doesn't produce a DOM or have a SAX API... it
produces tuple-based output that's easy enough to dig through in Python.

I'm probably going to be working on a pyRXP -> DOM translator (I've got an
existing DOM app that uses XPath; I don't want to rewrite it to use tuples),
but no idea if/when it'll be in a working state.

Chris Herborth · Feb 16, 2004

Andrew said:
With simple XML and a bit of care avoiding problem areas (see eg.
http://pyxml.sourceforge.net/topics/compliance.html ) it is possible to
write software that will work equally well with minidom, cDomlette and
other DOM implementations.

Ain't standards great? ;-)

James Kew · Feb 16, 2004

Chris Herborth said:
PyXML on Sourceforge (http://pyxml.sourceforge.net/) has faster
DOM-producing routines.

Which are? I like PyXML, but well-documented it ain't. I tend to use PyXML's
minidom, fed by either the validating (== xmlproc) or non-validating (==
expat) parsers -- are there faster PyXML alternatives?

pyRXP (http://www.reportlab.org/pyrxp.html) is probably the fastest XML
parser for Python, but it doesn't produce a DOM or have a SAX API...

And recent threads here suggest it's not fully XML-compliant either, unless
you can work in an ASCII-only XML subset.

For raw speed, libxml2 (and its Python wrapper) seems to get a lot of
glowing reviews. It's not a standard DOM API, though, and again
documentation is a problem (lots of C-API-level documentation, but not much
in terms of how to put it together into a working Python app).

I gave it a whirl and it certainly seemed to fly, but getting to grips with
the API and converting my existing DOM-manipulating code to it felt like too
much of a hurdle given that my app runs fast enough as it is.

James

Paul Boddie · Feb 17, 2004

James Kew said:
For raw speed, libxml2 (and its Python wrapper) seems to get a lot of
glowing reviews. It's not a standard DOM API, though, and again
documentation is a problem (lots of C-API-level documentation, but not much
in terms of how to put it together into a working Python app).

I made a PyXML-style wrapper for libxml2, although it works above the
existing wrapper and therefore isn't very fast. However, if you just
want to access various parts of your documents before getting libxslt
to do the real work, you might find it convenient. Here it is:

http://www.boddie.org.uk/python/downloads/libxml2dom-0.1.tar.gz

I also made a wrapper around qtxml/KHTML which gives the same
PyXML-style conveniences:

http://www.boddie.org.uk/python/downloads/qtxmldom-0.1.tar.gz

Obviously, if you don't mind writing to a specific API, then neither
of these packages is the way to go. However, XML processing is quite
often a tradeoff between compliance, convenience and performance, as
the recent PyRXP debate demonstrates. ;-)

Paul

Andrew Clover · Feb 17, 2004

Chris Herborth said:
Ain't standards great? ;-)

Heh. Quite so, although to be fair cDomlette and FtMiniDom don't actually
claim to be full DOM implementations.

It was frustration with this rather uneven state of affairs that led me
to roll my own. Speaking of which, I'm happy to announce that pxdom
1.0 [final] has been released:

http://www.doxdesk.com/software/py/pxdom.html

This implements the February 2004 Proposed Recommendations for DOM Level
3 Core/XML and Load/Save completely (except for the optional external
entity support, which will be coming in 1.1 [beta], and optional DTD
validation, which is unlikely to happen any time soon I'm afraid.)

Hurrah!

Nicodemus · Feb 18, 2004

Tyler said:
Hi,

Right now I'm using xml.dom.minidom for parsing some xml files. It
works, certainly, but the speed leaves a bit to be desired. Are there
any other XML modules that offer the same interface minidom does, but
are faster? Things like validation are not a big deal for me, as all
the XML is generated by my own programs, so I'm not worried about
malformed documents.

Take a look at Fredrik Lundh's element tree:
http://effbot.org/zone/element-index.htm

It's fast and very pythonic... I use it all the time.

Regards,
Nicodemus.

Nick Efford · Feb 18, 2004

Which are? I like PyXML, but well-documented it ain't.

What about O'Reilly's "Python and XML"?...

N.

Tim Roberts · Feb 19, 2004

Nick Efford said:
What about O'Reilly's "Python and XML"?...

For what it's worth, I thought this was a great book. I had done lots of
Python but no serious XML work before reading it; someone with a deeper
background might have a different view.

Uche Ogbuji · Feb 21, 2004

Chris Herborth said:
Ain't standards great? ;-)

We never claim cDomlette to be a DOM implementation. The main page
for cDomlette info is:

http://uche.ogbuji.net/tech/akara/nodes/2003-01-01/domlettes

It starts with:

"Domlette is 4Suite's lightweight DOM implementation. It is optimized
for XPath operations, speed, and relatively low memory overhead, at
least when compared to 4DOM and minidom. It is not fully DOM
compliant, but it does provide an interface very close to DOM Level 2.
In Domlette, where DOM and XPath disagree, XPath wins."

That last point is the salient one. We wrote cDomlette for a reason:
4XSLT was *way* too slow operating on standard DOM and we needed a
super-fast alternative specialized for XPath processing. The emphasis
was on XPath data model rather than DOM. Both, BTW, are W3C standards
and yet they conflict in a few key ways. Go figure.

Anyway, cDomlette is a useful and very fast general API for XML
processing. You can use it if you don't need full DOM support.

--Uche
http://uche.ogbuji.net

Uche Ogbuji · Feb 21, 2004

James Kew said:
Which are? I like PyXML, but well-documented it ain't. I tend to use PyXML's
minidom, fed by either the validating (== xmlproc) or non-validating (==
expat) parsers -- are there faster PyXML alternatives?

And recent threads here suggest it's not fully XML-compliant either, unless
you can work in an ASCII-only XML subset.

Yes, and this is a very serious problem. Anyone entering into XML
processing with the belief that they'll never need anything but
Unicode characters under U+256 is fooling himself. Heck, even XML
exports from MS Office will generate high Unicode characters for
"smart" quotes, em nd en dashes, ellipses and a lot of other comon
punctuation. All of these will blow up with PyRXP.

You can use PyRXPU, which is compliant but indications are that it
isn't as fast.

For raw speed, libxml2 (and its Python wrapper) seems to get a lot of
glowing reviews. It's not a standard DOM API, though, and again
documentation is a problem (lots of C-API-level documentation, but not much
in terms of how to put it together into a working Python app).

I gave it a whirl and it certainly seemed to fly, but getting to grips with
the API and converting my existing DOM-manipulating code to it felt like too
much of a hurdle given that my app runs fast enough as it is.

This was my biggest problem with libxml2/Python as documented here:

http://www.xml.com/pub/a/2003/05/14/py-xml.html

If documentation for Python users is improved, it will be hard to beat
that package.

But your criteria lead me to suggest that you give cDomlette a try. I
is also implemented in C for performance. It's as much DOM compliant
as libxml2's DOM API (which is to say not fully so), but we do try to
document it from the Python POV. See:

http://uche.ogbuji.net/tech/akara/nodes/2003-01-01/domlettes

--Uche
http://uche.ogbuji.net

parsing xml (xmpp) with ruby	3	Sep 27, 2008
XML in XMPP	8	Jul 6, 2012
Fast and capable XML parser?	3	Apr 20, 2007
I'm tempted to quit out of frustration	1	Aug 13, 2023
searching and storing large quantities of xml!	7	Jan 16, 2010
Parsing XML schema- variable attributes	5	Sep 18, 2008
Parsing XML with ElementTree (unicode problem?)	13	Jul 23, 2007
Parsing XML - Newbie help	5	May 21, 2005

XML Parsing

Tyler Eaves

Andrew Clover

Chris Herborth

Chris Herborth

James Kew

Paul Boddie

Andrew Clover

Nicodemus

Nick Efford

Tim Roberts

Uche Ogbuji

Uche Ogbuji

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads