XSLTranslation of a large XML file using Java results in OutOfMemory

L

Lenny Wintfeld

Hi

I'm attempting additions/changes to a Java program that (among other
things) uses XSLT to transform a large (96 Mb) XML file. It runs fine on
small XML files but generates OutOfMemory exceptions with large XML
files. I tried a simple punt of -Xmx512MB but that didn't work. In the
future, the input XML file may become considerably bigger than 96 MB, so
even if it did work, it probably would be putting off the inevitable to
some later date.

I'm using JavaSE 1.4.2_11 and the XSL/XML libraries that come with it.
The translation is from and to an xml file. The code I inherited looks a
lot like most of the example code you can find on the net for doing an
XSLT transformation. The relevant part is:

TransformerFactory tf = TransformerFactory.newInstance();
Transformer transformer = tf.newTransformer(xsltSource);
transformer.transform(new StreamSource(new StringReader(x)),
xsltDest);

where xsltSource is XSLT in the form of a string, generated by code
immediately above the snip shown, and the "x" is the input xml to be
transformed.

Things I tried:

1. I modified the above code to use a file instead of a String as the
XML to be transformed and a file for the XSLT that specifies the
transformation. It works fine with small XML input files but not with
large ones. I assume this code is using the DOM parser, and there is
simply not enough room in memory to house the input XML file.

2. Based on some old (years old) newsgroup posts I found, I tried using
a SAX equivalent of the above code, assuming that SAX takes in, parses
and transforms the input XML file either picemeal (maybe element by
element?) or that SAX uses the complete virtual memory of the computer.
But this code also results successful runs on small input XML files and
OutOfMemory errors on large ones. Here is a snip of the SAX code
(adapted from a chapter of Burke's "XSLT and Java" at the O'Reilly
website):

FileInputStream brXSLT = new FileInputStream ("C:/Documents and
Settings/Lenny/Desktop/OCCxsl.xsl");

// Set up the transformer
TransformerFactory transFact =
TransformerFactory.newInstance( );
SAXTransformerFactory saxTransFact =
(SAXTransformerFactory) transFact;
Source xsltSource = new StreamSource(brXSLT);
TransformerHandler transHand =
saxTransFact.newTransformerHandler(xsltSource);

// Set up input source
InputSource inxml = new InputSource(inXML);
SAXSource saxSource = new SAXSource(inxml);

// Set the destination for the XSLT transformation
transHand.setResult(new StreamResult(outXML));

// attach the XSLT processor to the XMLReader
String parserClass = "org.apache.crimson.parser.XMLReaderImpl";
XMLReader reader = XMLReaderFactory.createXMLReader(parserClass);

//parse the input file to an output file
reader.setContentHandler(transHand);
reader.parse(inxml);


I'm considering making a custom parser of the input XML file which
basically identifies elements of the input XML file and treats each
element as if it were a comlete document. e.g. send the content handler
ch.startDocument()
ch.startElement(..) // pass through the original element
ch.characters(..) // "
ch.endElement(..) // "
ch.endDocument()
for each element in the input XML file.

But being a newbie to XSLT, I don't know if this is worth pursuing, or
even if it would work; I'm hoping there are simpler, more strightforward
ways of accomplising the same thing and at a higher level. It does seem
pretty clumsy, even if it would work.

I found a reply on the web to someone who had a similar problem. To the
effect that a "SAX pipeline" should be used. But there was no further
elaboration, and so far, I haven't figured out what a SAX Pipeline is or
how it would help.

Any advice, references to examples, or actual examples would be
greatly appreciated.

Non-procedural programming is taking quite a bit of effort to
understand!

Thanks in advance for your help.

Lenny Wintfeld

ps - I've had this up on comp.lang.java.programmer for most of the day
with no replies. It bridges both specialties, that's why I'm trying
here.
 
J

Joe Kesselman

In general, XSLT can't operate as a streaming processor, since its use
of XPaths assumes the entire document is available in memory (or at
least can be re-read) at once. Some processors use more compact models
than others and thus may be able to handle larger documents in the same
memory; this is part of why Xalan created its own model, known as DTM,
rather than using an off-the-shelf DOM implementation.

If you're willing to limit the kinds of stylesheets you write to ones
which _only_ process the document in forward order, you can of course
set up a minimal data model which just contains one (or a few) nodes;
Xalan's SQL extension works that way, actually.

Yes, automatically recognizing which stylesheets (or portions thereof)
are streamable would be a Good Thing, but it's still something of a Holy
Grail for XSLT implementers. If you look in the archives of the Xalan
mailing list, you'll see much past discussion of this, and of possible
approaches to dealing with it. Look in particular for the keywords
"streaming", "pruning", and "filtering". Folks are continuing to
research this, but it is not an easy problem.

But until someone does get a handle on this problem... Sometimes, if you
have to process large documents, the only good answer is to drop down
from XSLT to a lower level and code the processing yourself as a direct
SAX application. That lets you take advantage of whatever
streaming/pruning/filtering opportunities exist, as well as letting you
code a special-purpose (and thus more compact) model for any data you do
have to retain. High-level languages are a good thing, but some problems
are still best addressed by low-level bit-twiddling.
 
P

Peter Flynn

Joe said:
In general, XSLT can't operate as a streaming processor, since its use
of XPaths assumes the entire document is available in memory (or at
least can be re-read) at once. Some processors use more compact models
than others and thus may be able to handle larger documents in the same
memory; this is part of why Xalan created its own model, known as DTM,
rather than using an off-the-shelf DOM implementation.

Perhaps it's appropriate to mention Omnimark, which uses a technique
sometimes known as "write-behind" (borrowed from the hardware field).
Instead of having an addressing scheme (XPath) for accessing objects
out of document sequence, it provides for the placement of references
to named anchors at the places where you know (or have computed) you
will need to access such objects, and then creating the anchors
themselves when you encounter them in document order. When the last
event in document order has triggered, the "write-behind" reconciliation
takes place, and all the values of the anchors are slotted into the
places reserved for them by the references.

(At least, this is how it used to work: I haven't used it for years.)

///Peter
 
L

lennyw

Thanks very much for your reply and advice. It's a shame that the XSL
transform engines can't (at least as an option) use virtual memory as
their target environment for xml data file transformations. It looks
like I may have a long row to hoe in doing the equivalent of the
transform using procedural code! The sad part is, the transfomations
that are done to these XML files using XSLT seem to be custom made for
XSLT!

Just a couple of quick follow ups: 1. Note that the transformation that
is being done is XML to XML. Except for a sort, which could be broken
out of the XSLT stylesheet and done procedurally after the
transformation is complete, all other transformations in the stylesheet
are local to small elements in the xml being transformed and there are
no dependencies between these. With those restrictions, is there a way
to mechanize a sequential (element-by-element) transformation? If so
could you point me to some examples? 2. I'm tantlized by the reference
that I noted in my original post to a suggestion that a "SAX Pipeline"
be used to process very large XML files. To me that sounds like a
sequential processor of XML with XSLT. Do you know where I could get
additonal info on a "SAX Pipeline", or might this have been some
wishful thnking on the part of it's author?

Once again, thanks for your feedback.

Lenny Wintfeld
 
?

=?ISO-8859-1?Q?J=FCrgen_Kahrs?=

Just a couple of quick follow ups: 1. Note that the transformation that
is being done is XML to XML. Except for a sort, which could be broken
out of the XSLT stylesheet and done procedurally after the
transformation is complete, all other transformations in the stylesheet
are local to small elements in the xml being transformed and there are
no dependencies between these. With those restrictions, is there a way
to mechanize a sequential (element-by-element) transformation? If so
could you point me to some examples? 2. I'm tantlized by the reference

It sounds like your focus is on large files
(> 100 MB) and you may be willing to give up
XSL and Java in order to solve the problem.
The following tool is not so specialized in
producing XML files, but it can handle 1 GB
of data withing 1 or 2 minutes:

http://home.vrweb.de/~juergen.kahrs/gawk/XML/xmlgawk.html#Printing-an-outline-of-an-XML-file
that I noted in my original post to a suggestion that a "SAX Pipeline"
be used to process very large XML files. To me that sounds like a
sequential processor of XML with XSLT. Do you know where I could get
additonal info on a "SAX Pipeline", or might this have been some
wishful thnking on the part of it's author?

Maybe this one helps:

Pipestreaming microformats
http://www-128.ibm.com/developerworks/xml/library/x-matters44.html
 
J

Joe Kesselman

Thanks very much for your reply and advice. It's a shame that the XSL
transform engines can't (at least as an option) use virtual memory as
their target environment for xml data file transformations.

Generally, XSLT transformers *will* use virtual memory if the language
they're running in and the operating system they're running on support
it -- they just don't try to do the memory management themselves; they
trust the system to do it for them. And in fact Java does use virtual
memory... but the JVM you're using won't let you set that limit high
enough for this particular document.
It looks
like I may have a long row to hoe in doing the equivalent of the
transform using procedural code! The sad part is, the transfomations
that are done to these XML files using XSLT seem to be custom made for
XSLT!

I know how you feel. All I can say is that I know folks who are working
on finding ways to address this, so In The Future Things Should Be
Better. The concepts are relatively straightforward; the hard part is
translating them into rules the machine can apply.
transformation is complete, all other transformations in the stylesheet
are local to small elements in the xml being transformed and there are
no dependencies between these. With those restrictions, is there a way
to mechanize a sequential (element-by-element) transformation?

I agree that this is exactly the kind of problem that ought to be
streamable... There's no portable way to leverage that, but specific
XSLT processor may have a way to handle it. To take the example I know
best: Xalan's internal data representation does happen to have the
ability to "prune off" the most recently added nodes, so an explicit
call to an extension function could, theoretically, discard the element
once you're done processing it. In fact, one of Xalan's more obscure and
underdocumented extensions does discard trees, though only in specific
situations; we added that to handle the
foreach-over-a-list-of-document()s situation... but I don't think
there's a generalized version which would address your case. (We'd
started investigating one, actually, then Other Priorities Intervenes.)
could you point me to some examples? 2. I'm tantlized by the reference
that I noted in my original post to a suggestion that a "SAX Pipeline"
be used to process very large XML files. To me that sounds like a
sequential processor of XML with XSLT.

I think that was probably intended to be a reference to hand-coded SAX
processing.

But actually, you *could* do a compromise: hand-code a SAX processor
which essentially breaks the large document up into a series of smaller
ones and runs XSLT transforms on each one via its API (eg TrAX, of
you're working in Java), then reassembles the output of those
transformations into a single document again.
 
L

lennyw

Jurgen, I looked at your reference to xmlgawk in some detail, and it
seems pretty encouraging; not only for the problem I stated, but for
web tie-ins on XML data. I will look at your document in more detail
and at the references (especially XMLBooster, xmllib and Expat). But in
the mean time could you let me know directly, or provide me with some
info on the following: How would I tie in xmlgawk to my primary
application(s) in java. Would I do the equivalent of an exec(..) of the
awk processor and then look for an exit code or is there a library that
ties it in more directly (similar to the XSLT library for Java)?

I'm looking forward to seeing if xmlgawk would be a reasonable half
step between purely procedural code and XSLT; either premanently, or
until XSLT can handle the kinds of XML files I'm called on to process.

Thanks for the reference!

Lenny W.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,968
Messages
2,570,153
Members
46,699
Latest member
AnneRosen

Latest Threads

Top