huge XML files, XSLT memory problems, Java & SAX...

J

Jeff Calico

I have 2 XML data files that I want to extract data from simultaneously

and transform with XSLT to generate a report. The first file is huge
and when XSLT builds the DOM tree in memory, it runs out of space.

I only need a few branches of elements from the original XML, so I am
seeking a recomended way of building a DOM for XSLT of only the
elements
that I need. I'm writing a Java application that invokes Xalan, and
reading up on SAX parsers this afternoon... I'm sure this is a common
problem, and as such, there is probably a clean and easy way to do it,
but I haven't found that one yet...

thanks,
Jeff
 
T

Tjerk Wolterink

Jeff said:
I have 2 XML data files that I want to extract data from simultaneously

and transform with XSLT to generate a report. The first file is huge
and when XSLT builds the DOM tree in memory, it runs out of space.

I only need a few branches of elements from the original XML, so I am
seeking a recomended way of building a DOM for XSLT of only the
elements
that I need. I'm writing a Java application that invokes Xalan, and
reading up on SAX parsers this afternoon... I'm sure this is a common
problem, and as such, there is probably a clean and easy way to do it,
but I haven't found that one yet...

thanks,
Jeff

1. Save your xml files in an xml-databases.
2. Use xquery to only retrieve the data you want.
3. transform that data with your stylesheet
3. voila : better performance... because a xml database retrieves the
data you want fast from a huge set.
And you only load the elements you need in memory.

The only problem is that most xml-databases are in development..
Just google it.
 
?

=?ISO-8859-1?Q?J=FCrgen_Kahrs?=

Jeff said:
and transform with XSLT to generate a report. The first file is huge
and when XSLT builds the DOM tree in memory, it runs out of space.

This has become a FAQ. The usual answer is to not use a DOM.
By the way, what do you consider a huge file ?
DOMs should work up until a few 100 MB of XML if
you have all the RAM for your your DOM.
 
J

Joe Kesselman

Jeff said:
I only need a few branches of elements from the original XML, so I am
seeking a recomended way of building a DOM for XSLT of only the
elements
that I need.

SAX through a SAX filter that selects the information you're concerned
with and thence into a SAX-to-DOM builder if you need an in-memory model.

Note that DOM implementations can vary in their efficiency; I once wrote
a DOM subset that required only six words of memory per node (not
counting text contents), and Xalan-j still uses my DTM data model
internally because it's more efficient than a traditional
Java-object-based DOM implementation (as well as being a better
impedence match to the XPath data model abstraction).

As others have said: What do you consider "huge"? Exceeding physical
memory? Exceeding _virtual_ memory?
 
J

Jeff Calico

Tjerk said:
1. Save your xml files in an xml-databases.
[snip]

Thanks for your reply Tjerk. We have considered a database, although
not an xml database per se, but it seems a better option to immediately
discard elements that
we don't need, and that will of course be much faster also.

--Jeff
 
J

Jeff Calico

Thanks for your reply, Jurgen. I did do some searching in the archives
of this
newsgroup before posting, but I didn't find what I thought I might. As
I understand it,
XSLT requires a DOM to exist, so if I wish to not write my own XSLT
functionality,
then I must have one.

I do not yet know the exact size of files I must process, but I would
expect them to
be much less than 100 MB (I hope!). However, I have heard several
co-workers talk about
DOMs gobbling up all available memory, so I want to avoid even the
posibility of that.
I did get an out-of-memory error with XML Spy when using it to perform
a transform on a
*small* file of the type I am working with.

Anyway, the real issue is not to construct a huge DOM for no good
reason. I don't
need all that data, just 3 or 4 important nodes and their children...

--Jeff
 
J

Jeff Calico

Thanks for the reply, Joe. I expect I will be using Xalan-j and hence
your
earlier work :)

As I mentioned in my reply to Jurgen above, I don't know the real sizes
of the XML data files yet, only that I should expect big ones and I did

crash XMLSpy with a fairly small data file while prototyping. As
usual, we
have the issues of speed and memory space; the best solution is not
not process what we don't need right from the beginning.

Would you happen to remember the names of the classes that do the
SAX filtering ---> filtered DOM building? If they are not on the tip
of your
tongue , I can and shall certainly look them up. But it is again the
issue of speed (I'm slow) and memory allocation (brain is running low
on space)
:)

--Jeff
 
J

Joe Kesselman

Jeff said:
As I understand it,XSLT requires a DOM to exist

Uhm... Not exactly. XSLT may use a DOM internally (or may use other data
models). But most XSLT processors can accept input from a file, a text
stream, a SAX stream, or a DOM... and will output to any of those. So
you don't have to explicitly create a DOM in order to use XSLT, and
XSLT's internal representation may (or may not) be more efficient than a
general-purpose DOM.
I do not yet know the exact size of files I must process, but I would
expect them to
be much less than 100 MB (I hope!)

We run documents that size through Xalan on a regular basis.
However, I have heard several co-workers talk about
DOMs gobbling up all available memory

The DOM is just an API. How much memory a DOM needs depends on which DOM
implementation you're using as well as on the exact characteristics of
the document being processed.
I did get an out-of-memory error with XML Spy when using it to perform
a transform on a
*small* file of the type I am working with.

That may be a problem in the transformation, or you may have set the
limits on your environment too low.
Anyway, the real issue is not to construct a huge DOM for no good
reason. I don't
need all that data, just 3 or 4 important nodes and their children...

If that's the case, a hand-coded SAX solution will probably be more
efficient than an XSLT solution... for now. Recognizing and optimizing
these cases is an ongoing area of research for XSLT developers.
 
J

Joe Kesselman

Would you happen to remember the names of the classes that do the
SAX filtering ---> filtered DOM building

SAX-driven DOM builders are pretty common; many DOM implementations ship
with one, and if not generic ones are a standard intro-to-XML class
exercise so there are lots of them running around.

Filtering: That's up to you. You need to implement a class which is a
SAX handler, accepting the SAX event calls and tracking them to decide
what does and doesn't have to be passed along to another handler (in
this case, the DOM builder). Very standard bit of SAX programming, and
in fact a bit too standard for me to actually have kept pointers to
examples. Any good SAX tutorial ought to give you all the info you need
to do this -- modulo the hassle of figuring out what criteria you need
to use to decide what is and isn't worth passing along.

Lemme see if I've got anything on tap that's simple enough to be a good
pedagogical illustration...
 
J

Juergen Kahrs

Jeff said:
XSLT requires a DOM to exist, so if I wish to not write my own XSLT
functionality,
then I must have one.

That's also my understanding of the problem.
I do not yet know the exact size of files I must process, but I would
expect them to
be much less than 100 MB (I hope!). However, I have heard several
co-workers talk about
DOMs gobbling up all available memory, so I want to avoid even the
posibility of that.

This sounds like your XSLT implementation has a problem.
I did get an out-of-memory error with XML Spy when using it to perform
a transform on a
*small* file of the type I am working with.

This confirms my guess about a problem with your XSLT implementation.
Anyway, the real issue is not to construct a huge DOM for no good
reason. I don't
need all that data, just 3 or 4 important nodes and their children...

I have heard read such postings several times over the
last months. But I cant give you a pointer right now.
 
H

Henry S. Thompson

Don't write Java, use a pipeline -- Markup pipeline demo site [1]
includes a pipeline to split a large document into small chunks for
validation, similar approach would work for.

ht

[1] http://www.markup.co.uk/showcase/
--
Henry S. Thompson, Markup Technology Ltd.
4 Buccleuch Place, Edinburgh EH8 9LW, SCOTLAND -- +44 (0) 7866 471 388
Fax: (44) 131 650-4587, e-mail: (e-mail address removed)
URL: http://www.markup.co.uk/
[mail really from me _always_ has this .sig -- mail without it is forged spam]
 
J

Joe Kesselman

Lemme see if I've got anything on tap that's simple enough to be a good
pedagogical illustration...

I don't have anything really good on hand, but look for examples of use
of org.xml.sax.helpers.XMLFilterImpl.

That starts out as a no-op filter which just passes everything through.
What you'd need to do is add enough logic to recognize which portions of
the document you're interested in, and pass those (and only those) along
to the next stage of processing. Plus, probably, the document element
(or a synthesized document element) so it's well-formed XML. Handling
namespaces properly complicates this somewhat but not horribly.

Note that the next stage has to be aware that it's seeing a filtered
view of the document; if you've passed along only some subtrees, search
patterns that look at the context they appeared in may of course not
work as expected. For example, if you're prefiltering before running a
stylesheet, some of the XPaths in that stylesheet may have to be rewritten.

As I say, I haven't had much trouble running recent versions of Xalan on
large documents... but this kind of explicit prefiltering may save you
some cycles and storage, at the cost of requiring more cycles of
developer time to create and maintain it.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,995
Messages
2,570,230
Members
46,819
Latest member
masterdaster

Latest Threads

Top