XML Parser

Elmar Brandt · Mar 1, 2006

Hello,
we are looking for a fast XML parser.
The XML-files are very big (>2GB) and we want to convert them into other
formats via XSLT.

Has anyone an idea?

With best regards

Elmar Brandt

Juergen Kahrs · Mar 1, 2006

Elmar said:
we are looking for a fast XML parser.
The XML-files are very big (>2GB) and we want to convert them into other
formats via XSLT.

Has anyone an idea?

This question is asked here each month.
The usual answer is that XSLT needs a DOM.
A DOM requires at least as much RAM/SWAP than
your file has. My understanding is that files
larger than 500 MB are impractical to process
with XSLT. There are lots of other tools for
processing very large files.

Martin Honnen · Mar 1, 2006

Elmar said:
we are looking for a fast XML parser.
The XML-files are very big (>2GB) and we want to convert them into other
formats via XSLT.

The fast parser alone does not help then, XSLT usually builds a tree
model of the complete XML input in memory and transforms that input tree
to a result tree which is then serialized.
Thus even if you use a fast and not much resources consuming
parser/parse approach like SAX or XmlReader the XSLT processor will
build its tree model of your XML input in memory and will need more
memory for the result tree.

Joe Kesselman · Mar 1, 2006

Juergen said:
The usual answer is that XSLT needs a DOM.

Quibble: XSLT, in general, needs an in-memory model of the source
document. ("DOM" stands for Document Object Model, though it usually
refers to the W3C DOM which is in fact an object-based API for documents
and doesn't actually say anything about what the model behind that API
might be.)

My understanding is that files
larger than 500 MB are impractical to process
with XSLT.

Depends on how much memory you have in your machine and how fast your
memory swap system is, as well as how much locality of reference there
is in the stylesheet's execution.

XSLT processors which can automatically recognize opportunities to keep
less of the source document in memory are something of a "holy grail"
project -- we all know it's possible, but as far as I know nobody has
yet made that optimization work sufficiently generally or reliably.
Search the Apache Xalan mailing list's archives for the key words
"streaming", "pruning" and "filtering" to see some past discussion of
that. (In fact, when Xalan is processing from its database adapter it
often does operate in a streaming mode, counting on the user not to
write stylesheets which require wide random-access to the source.)

I know work is continuing on this in several research groups. Meanwhile,
depending on what you're doing, you may find that a hand-coded solution
can be made more efficient. XSLT is a good "high-level language" for XML
manipulation, but sometimes ya just gotta break down and write something
closer to the machine... at least, until the optimizers get smarter.

Joe Kesselman · Mar 1, 2006

Elmar said:
we are looking for a fast XML parser.

IBM has a significantly fast XML parser -- a few papers were recently
published on it -- but I'm not sure whether it's shipping or under what
name. Lemme see if I can find out.

Andy Dingley · Mar 1, 2006

Elmar said:
we are looking for a fast XML parser.
The XML-files are very big (>2GB) and we want to convert them into other
formats via XSLT.

Redesign it. Monolithic XML is not the solution here.

XML works with "documents", documents that have closure around a single
root element. You can play with this by using SAX but it's always a
basic underpinning that you can never avoid entirely. When it gets to
2GB you really are pushing things.

Is this "2GB" really one huge document, or can you split it down into
separate events?

How big is the expected output transform? Can you run through a
lightweight SAX parser to generate a filtered document, then transform
that with XSLT?

Juergen Kahrs · Mar 1, 2006

Joe said:
Quibble: XSLT, in general, needs an in-memory model of the source
document. ("DOM" stands for Document Object Model, though it usually
refers to the W3C DOM which is in fact an object-based API for documents
and doesn't actually say anything about what the model behind that API
might be.)
Agreed.

Depends on how much memory you have in your machine and how fast your
memory swap system is, as well as how much locality of reference there
is in the stylesheet's execution.

Having tons of RAM is a solution if you process
one very large XML file at a time. But typical
server applications have to server dozens of these
very large XML files at the same time.

Juergen Kahrs · Mar 1, 2006

Joe said:
IBM has a significantly fast XML parser -- a few papers were recently
published on it -- but I'm not sure whether it's shipping or under what
name. Lemme see if I can find out.

With 2 GB of data, other considerations may be
more important than the XML parser. Imagine the
XML parser runs in "zero time". The hard disk
interface still needs more than 20 seconds to
just for the disk-I/O. So the limiting factor
(besides RAM with a DOM) may be the hard disk
and not necessarily the parser.

Joseph Kesselman · Mar 1, 2006

Juergen said:
Having tons of RAM is a solution if you process
one very large XML file at a time. But typical
server applications have to server dozens of these
very large XML files at the same time.

Typical server applications aren't processing half-gig documents every
time they're requested -- they're styling them ahead of time and caching
the result.

Or they're reducing the data before styling it, eg retrieving only the
information they need from a database or otherwise prefiltering before
styling.

XML Parser	15	Jul 2, 2007
XML parser	2	Dec 8, 2006
XML Parser	1	Jun 13, 2006
[ANN ]Syncro Soft Announces New Release of Oxygen XML Editor andOxygen XML Author	0	May 19, 2011
A Straightforward Explanation of XML And Visual Basic	0	Aug 14, 2012
I'm tempted to quit out of frustration	1	Aug 13, 2023
Seeking co-founders for my company.	3	Sep 8, 2024
A Look At The Advantages and Drawbacks of XML	13	Jan 22, 2013

XML Parser

Elmar Brandt

Juergen Kahrs

Martin Honnen

Joe Kesselman

Joe Kesselman

Andy Dingley

Juergen Kahrs

Juergen Kahrs

Joseph Kesselman

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads