M
michaela_google
I need to write a tool which will allow me to process a huge document
with chunks of it in memory. The procedure is to echo XML coming from
some input to an output, then when I hit a certain type of element,
read all of its children into a DOM and somehow manipulate it
(specifically, the goal is to hand it to a rules engine which will add
flags). After that is complete, serialize it right back out to the
stream and resume processing.
My approach currently involves a ContentHandler object which simply
serializes every event using a TransformerHandler object. When I see
the element I want, I start creating Node objects in a stack and
assembling them into a DOM as I hit closing elements. The problem is
that there is no direct way of serializing that DOM to the same output
stream without violating encapsulation. At first I thought that I
would be able to pass the DOM to a SAX ContentHandler (which I would
provide with a reference to the TransformerHandler doing the
serialization). However, it turns out that isn't the case. I do not
want to reinvent the wheel and walk the DOM manually nor do I want to
create (potential) side-effects by writing directly to the OutputStream
wrapped by my Result object using XMLSerializer on the DOM. XMLFilters
do not appear to be a solution because I cannot apply my rules in a
streamable fashion (and getting a DOM from the incoming stream is not
the difficult part anyway).
I've looked around and found a variety of mixed parsing techniques, but
it seems that none of them quite meet my needs. Writing a class which
takes a DOM, walks it, then calls the appropriate ContentHandler
methods would not be too painful, but this seems horribly inelegant to
me. Also, I cannot even begin to fathom that I am the only person who
wants to do this.
Any suggestions?
(What follows is some code which may give an idea of where I am at.
Simplified for brevity.)
// Initial invocation
final SAXParserFactory factory = SAXParserFactory.newInstance();
factory.setNamespaceAware(true);
final SAXParser parser = factory.newSAXParser();
final XMLReader reader = parser.getXMLReader();
final TransformerHandler handler = getTransformerHandler(System.out);
reader.setContentHandler(new EvaluationHandler(handler));
reader.parse("foo.xml");
// EvaluationHandler (derives from a class which echoes to a provided
ContentHandler)
public void startElement(final String uri, final String localName,
final String qName, final Attributes attributes) throws SAXException
{
if (performDOMProcessing())
{
final Element element = this.host.createElementNS(uri, qName);
addAttributes(element, attributes);
this.buffer.push(element);
}
else
{
super.startElement(uri, localName, qName, attributes);
}
}
public void endElement(final String uri, final String localName, final
String qName) throws SAXException
{
if (performDOMProcessing())
{
final Node element = this.buffer.pop();
if (this.buffer.isEmpty())
{
if (this.host.hasChildNodes())
{
final Element root = this.host.getDocumentElement();
this.host.removeChild(root);
}
this.host.appendChild(element);
// Do processing against this.host with rules engine.
// Pass off to serialization.
}
else
{
this.buffer.peek().appendChild(element);
}
}
else
{
super.endElement(uri, localName, qName);
}
}
public void characters(final char[] characters, final int start, final
int length) throws SAXException
{
if (performDOMProcessing())
{
final String text = new String(characters, start, length);
final Text textNode = this.host.createTextNode(text);
this.buffer.peek().appendChild(textNode);
}
else
{
super.characters(characters, start, length);
}
}
with chunks of it in memory. The procedure is to echo XML coming from
some input to an output, then when I hit a certain type of element,
read all of its children into a DOM and somehow manipulate it
(specifically, the goal is to hand it to a rules engine which will add
flags). After that is complete, serialize it right back out to the
stream and resume processing.
My approach currently involves a ContentHandler object which simply
serializes every event using a TransformerHandler object. When I see
the element I want, I start creating Node objects in a stack and
assembling them into a DOM as I hit closing elements. The problem is
that there is no direct way of serializing that DOM to the same output
stream without violating encapsulation. At first I thought that I
would be able to pass the DOM to a SAX ContentHandler (which I would
provide with a reference to the TransformerHandler doing the
serialization). However, it turns out that isn't the case. I do not
want to reinvent the wheel and walk the DOM manually nor do I want to
create (potential) side-effects by writing directly to the OutputStream
wrapped by my Result object using XMLSerializer on the DOM. XMLFilters
do not appear to be a solution because I cannot apply my rules in a
streamable fashion (and getting a DOM from the incoming stream is not
the difficult part anyway).
I've looked around and found a variety of mixed parsing techniques, but
it seems that none of them quite meet my needs. Writing a class which
takes a DOM, walks it, then calls the appropriate ContentHandler
methods would not be too painful, but this seems horribly inelegant to
me. Also, I cannot even begin to fathom that I am the only person who
wants to do this.
Any suggestions?
(What follows is some code which may give an idea of where I am at.
Simplified for brevity.)
// Initial invocation
final SAXParserFactory factory = SAXParserFactory.newInstance();
factory.setNamespaceAware(true);
final SAXParser parser = factory.newSAXParser();
final XMLReader reader = parser.getXMLReader();
final TransformerHandler handler = getTransformerHandler(System.out);
reader.setContentHandler(new EvaluationHandler(handler));
reader.parse("foo.xml");
// EvaluationHandler (derives from a class which echoes to a provided
ContentHandler)
public void startElement(final String uri, final String localName,
final String qName, final Attributes attributes) throws SAXException
{
if (performDOMProcessing())
{
final Element element = this.host.createElementNS(uri, qName);
addAttributes(element, attributes);
this.buffer.push(element);
}
else
{
super.startElement(uri, localName, qName, attributes);
}
}
public void endElement(final String uri, final String localName, final
String qName) throws SAXException
{
if (performDOMProcessing())
{
final Node element = this.buffer.pop();
if (this.buffer.isEmpty())
{
if (this.host.hasChildNodes())
{
final Element root = this.host.getDocumentElement();
this.host.removeChild(root);
}
this.host.appendChild(element);
// Do processing against this.host with rules engine.
// Pass off to serialization.
}
else
{
this.buffer.peek().appendChild(element);
}
}
else
{
super.endElement(uri, localName, qName);
}
}
public void characters(final char[] characters, final int start, final
int length) throws SAXException
{
if (performDOMProcessing())
{
final String text = new String(characters, start, length);
final Text textNode = this.host.createTextNode(text);
this.buffer.peek().appendChild(textNode);
}
else
{
super.characters(characters, start, length);
}
}