XML Parser

E

Elmar Brandt

Hello,
we are looking for a fast XML parser.
The XML-files are very big (>2GB) and we want to convert them into other
formats via XSLT.

Has anyone an idea?

With best regards

Elmar Brandt
 
J

Juergen Kahrs

Elmar said:
we are looking for a fast XML parser.
The XML-files are very big (>2GB) and we want to convert them into other
formats via XSLT.

Has anyone an idea?

This question is asked here each month.
The usual answer is that XSLT needs a DOM.
A DOM requires at least as much RAM/SWAP than
your file has. My understanding is that files
larger than 500 MB are impractical to process
with XSLT. There are lots of other tools for
processing very large files.
 
M

Martin Honnen

Elmar said:
we are looking for a fast XML parser.
The XML-files are very big (>2GB) and we want to convert them into other
formats via XSLT.

The fast parser alone does not help then, XSLT usually builds a tree
model of the complete XML input in memory and transforms that input tree
to a result tree which is then serialized.
Thus even if you use a fast and not much resources consuming
parser/parse approach like SAX or XmlReader the XSLT processor will
build its tree model of your XML input in memory and will need more
memory for the result tree.
 
J

Joe Kesselman

Juergen said:
The usual answer is that XSLT needs a DOM.

Quibble: XSLT, in general, needs an in-memory model of the source
document. ("DOM" stands for Document Object Model, though it usually
refers to the W3C DOM which is in fact an object-based API for documents
and doesn't actually say anything about what the model behind that API
might be.)
My understanding is that files
larger than 500 MB are impractical to process
with XSLT.

Depends on how much memory you have in your machine and how fast your
memory swap system is, as well as how much locality of reference there
is in the stylesheet's execution.

XSLT processors which can automatically recognize opportunities to keep
less of the source document in memory are something of a "holy grail"
project -- we all know it's possible, but as far as I know nobody has
yet made that optimization work sufficiently generally or reliably.
Search the Apache Xalan mailing list's archives for the key words
"streaming", "pruning" and "filtering" to see some past discussion of
that. (In fact, when Xalan is processing from its database adapter it
often does operate in a streaming mode, counting on the user not to
write stylesheets which require wide random-access to the source.)

I know work is continuing on this in several research groups. Meanwhile,
depending on what you're doing, you may find that a hand-coded solution
can be made more efficient. XSLT is a good "high-level language" for XML
manipulation, but sometimes ya just gotta break down and write something
closer to the machine... at least, until the optimizers get smarter.
 
J

Joe Kesselman

Elmar said:
we are looking for a fast XML parser.

IBM has a significantly fast XML parser -- a few papers were recently
published on it -- but I'm not sure whether it's shipping or under what
name. Lemme see if I can find out.
 
A

Andy Dingley

Elmar said:
we are looking for a fast XML parser.
The XML-files are very big (>2GB) and we want to convert them into other
formats via XSLT.

Redesign it. Monolithic XML is not the solution here.

XML works with "documents", documents that have closure around a single
root element. You can play with this by using SAX but it's always a
basic underpinning that you can never avoid entirely. When it gets to
2GB you really are pushing things.

Is this "2GB" really one huge document, or can you split it down into
separate events?

How big is the expected output transform? Can you run through a
lightweight SAX parser to generate a filtered document, then transform
that with XSLT?
 
J

Juergen Kahrs

Joe said:
Quibble: XSLT, in general, needs an in-memory model of the source
document. ("DOM" stands for Document Object Model, though it usually
refers to the W3C DOM which is in fact an object-based API for documents
and doesn't actually say anything about what the model behind that API
might be.)
Agreed.



Depends on how much memory you have in your machine and how fast your
memory swap system is, as well as how much locality of reference there
is in the stylesheet's execution.

Having tons of RAM is a solution if you process
one very large XML file at a time. But typical
server applications have to server dozens of these
very large XML files at the same time.
 
J

Juergen Kahrs

Joe said:
IBM has a significantly fast XML parser -- a few papers were recently
published on it -- but I'm not sure whether it's shipping or under what
name. Lemme see if I can find out.

With 2 GB of data, other considerations may be
more important than the XML parser. Imagine the
XML parser runs in "zero time". The hard disk
interface still needs more than 20 seconds to
just for the disk-I/O. So the limiting factor
(besides RAM with a DOM) may be the hard disk
and not necessarily the parser.
 
J

Joseph Kesselman

Juergen said:
Having tons of RAM is a solution if you process
one very large XML file at a time. But typical
server applications have to server dozens of these
very large XML files at the same time.

Typical server applications aren't processing half-gig documents every
time they're requested -- they're styling them ahead of time and caching
the result.

Or they're reducing the data before styling it, eg retrieving only the
information they need from a database or otherwise prefiltering before
styling.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,001
Messages
2,570,255
Members
46,853
Latest member
GeorgiaSta

Latest Threads

Top