XML Parsing

T

Tyler Eaves

Hi,

Right now I'm using xml.dom.minidom for parsing some xml files. It
works, certainly, but the speed leaves a bit to be desired. Are there
any other XML modules that offer the same interface minidom does, but
are faster? Things like validation are not a big deal for me, as all
the XML is generated by my own programs, so I'm not worried about
malformed documents.
 
A

Andrew Clover

Tyler Eaves said:
Are there any other XML modules that offer the same interface minidom
does, but are faster?

It's not totally the same interface as minidom, but cDomlette offers a
fast set of XML operations through an incomplete DOM interface. See
http://www.4suite.org/ .

With simple XML and a bit of care avoiding problem areas (see eg.
http://pyxml.sourceforge.net/topics/compliance.html ) it is possible to
write software that will work equally well with minidom, cDomlette and
other DOM implementations.
 
C

Chris Herborth

Tyler said:
Right now I'm using xml.dom.minidom for parsing some xml files. It
works, certainly, but the speed leaves a bit to be desired. Are there
any other XML modules that offer the same interface minidom does, but
are faster? Things like validation are not a big deal for me, as all
the XML is generated by my own programs, so I'm not worried about
malformed documents.

PyXML on Sourceforge (http://pyxml.sourceforge.net/) has faster
DOM-producing routines.

pyRXP (http://www.reportlab.org/pyrxp.html) is probably the fastest XML
parser for Python, but it doesn't produce a DOM or have a SAX API... it
produces tuple-based output that's easy enough to dig through in Python.

I'm probably going to be working on a pyRXP -> DOM translator (I've got an
existing DOM app that uses XPath; I don't want to rewrite it to use tuples),
but no idea if/when it'll be in a working state.
 
J

James Kew

Chris Herborth said:
PyXML on Sourceforge (http://pyxml.sourceforge.net/) has faster
DOM-producing routines.

Which are? I like PyXML, but well-documented it ain't. I tend to use PyXML's
minidom, fed by either the validating (== xmlproc) or non-validating (==
expat) parsers -- are there faster PyXML alternatives?
pyRXP (http://www.reportlab.org/pyrxp.html) is probably the fastest XML
parser for Python, but it doesn't produce a DOM or have a SAX API...

And recent threads here suggest it's not fully XML-compliant either, unless
you can work in an ASCII-only XML subset.

For raw speed, libxml2 (and its Python wrapper) seems to get a lot of
glowing reviews. It's not a standard DOM API, though, and again
documentation is a problem (lots of C-API-level documentation, but not much
in terms of how to put it together into a working Python app).

I gave it a whirl and it certainly seemed to fly, but getting to grips with
the API and converting my existing DOM-manipulating code to it felt like too
much of a hurdle given that my app runs fast enough as it is.

James
 
P

Paul Boddie

James Kew said:
For raw speed, libxml2 (and its Python wrapper) seems to get a lot of
glowing reviews. It's not a standard DOM API, though, and again
documentation is a problem (lots of C-API-level documentation, but not much
in terms of how to put it together into a working Python app).

I made a PyXML-style wrapper for libxml2, although it works above the
existing wrapper and therefore isn't very fast. However, if you just
want to access various parts of your documents before getting libxslt
to do the real work, you might find it convenient. Here it is:

http://www.boddie.org.uk/python/downloads/libxml2dom-0.1.tar.gz

I also made a wrapper around qtxml/KHTML which gives the same
PyXML-style conveniences:

http://www.boddie.org.uk/python/downloads/qtxmldom-0.1.tar.gz

Obviously, if you don't mind writing to a specific API, then neither
of these packages is the way to go. However, XML processing is quite
often a tradeoff between compliance, convenience and performance, as
the recent PyRXP debate demonstrates. ;-)

Paul
 
A

Andrew Clover

Chris Herborth said:
Ain't standards great? ;-)

Heh. Quite so, although to be fair cDomlette and FtMiniDom don't actually
claim to be full DOM implementations.

It was frustration with this rather uneven state of affairs that led me
to roll my own. Speaking of which, I'm happy to announce that pxdom
1.0 [final] has been released:

http://www.doxdesk.com/software/py/pxdom.html

This implements the February 2004 Proposed Recommendations for DOM Level
3 Core/XML and Load/Save completely (except for the optional external
entity support, which will be coming in 1.1 [beta], and optional DTD
validation, which is unlikely to happen any time soon I'm afraid.)

Hurrah!
 
N

Nicodemus

Tyler said:
Hi,

Right now I'm using xml.dom.minidom for parsing some xml files. It
works, certainly, but the speed leaves a bit to be desired. Are there
any other XML modules that offer the same interface minidom does, but
are faster? Things like validation are not a big deal for me, as all
the XML is generated by my own programs, so I'm not worried about
malformed documents.

Take a look at Fredrik Lundh's element tree:
http://effbot.org/zone/element-index.htm

It's fast and very pythonic... I use it all the time.

Regards,
Nicodemus.
 
T

Tim Roberts

Nick Efford said:
What about O'Reilly's "Python and XML"?...

For what it's worth, I thought this was a great book. I had done lots of
Python but no serious XML work before reading it; someone with a deeper
background might have a different view.
 
U

Uche Ogbuji

Chris Herborth said:
Ain't standards great? ;-)

We never claim cDomlette to be a DOM implementation. The main page
for cDomlette info is:

http://uche.ogbuji.net/tech/akara/nodes/2003-01-01/domlettes

It starts with:

"Domlette is 4Suite's lightweight DOM implementation. It is optimized
for XPath operations, speed, and relatively low memory overhead, at
least when compared to 4DOM and minidom. It is not fully DOM
compliant, but it does provide an interface very close to DOM Level 2.
In Domlette, where DOM and XPath disagree, XPath wins."

That last point is the salient one. We wrote cDomlette for a reason:
4XSLT was *way* too slow operating on standard DOM and we needed a
super-fast alternative specialized for XPath processing. The emphasis
was on XPath data model rather than DOM. Both, BTW, are W3C standards
and yet they conflict in a few key ways. Go figure.

Anyway, cDomlette is a useful and very fast general API for XML
processing. You can use it if you don't need full DOM support.

--Uche
http://uche.ogbuji.net
 
U

Uche Ogbuji

James Kew said:
Which are? I like PyXML, but well-documented it ain't. I tend to use PyXML's
minidom, fed by either the validating (== xmlproc) or non-validating (==
expat) parsers -- are there faster PyXML alternatives?


And recent threads here suggest it's not fully XML-compliant either, unless
you can work in an ASCII-only XML subset.

Yes, and this is a very serious problem. Anyone entering into XML
processing with the belief that they'll never need anything but
Unicode characters under U+256 is fooling himself. Heck, even XML
exports from MS Office will generate high Unicode characters for
"smart" quotes, em nd en dashes, ellipses and a lot of other comon
punctuation. All of these will blow up with PyRXP.

You can use PyRXPU, which is compliant but indications are that it
isn't as fast.

For raw speed, libxml2 (and its Python wrapper) seems to get a lot of
glowing reviews. It's not a standard DOM API, though, and again
documentation is a problem (lots of C-API-level documentation, but not much
in terms of how to put it together into a working Python app).

I gave it a whirl and it certainly seemed to fly, but getting to grips with
the API and converting my existing DOM-manipulating code to it felt like too
much of a hurdle given that my app runs fast enough as it is.

This was my biggest problem with libxml2/Python as documented here:

http://www.xml.com/pub/a/2003/05/14/py-xml.html

If documentation for Python users is improved, it will be hard to beat
that package.

But your criteria lead me to suggest that you give cDomlette a try. I
is also implemented in C for performance. It's as much DOM compliant
as libxml2's DOM API (which is to say not fully so), but we do try to
document it from the Python POV. See:

http://uche.ogbuji.net/tech/akara/nodes/2003-01-01/domlettes

--Uche
http://uche.ogbuji.net
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,183
Messages
2,570,968
Members
47,518
Latest member
TobiasAxf

Latest Threads

Top