Mixed SAX and DOM processing: echoing with occassional changes.

M

michaela_google

I need to write a tool which will allow me to process a huge document
with chunks of it in memory. The procedure is to echo XML coming from
some input to an output, then when I hit a certain type of element,
read all of its children into a DOM and somehow manipulate it
(specifically, the goal is to hand it to a rules engine which will add
flags). After that is complete, serialize it right back out to the
stream and resume processing.

My approach currently involves a ContentHandler object which simply
serializes every event using a TransformerHandler object. When I see
the element I want, I start creating Node objects in a stack and
assembling them into a DOM as I hit closing elements. The problem is
that there is no direct way of serializing that DOM to the same output
stream without violating encapsulation. At first I thought that I
would be able to pass the DOM to a SAX ContentHandler (which I would
provide with a reference to the TransformerHandler doing the
serialization). However, it turns out that isn't the case. I do not
want to reinvent the wheel and walk the DOM manually nor do I want to
create (potential) side-effects by writing directly to the OutputStream
wrapped by my Result object using XMLSerializer on the DOM. XMLFilters
do not appear to be a solution because I cannot apply my rules in a
streamable fashion (and getting a DOM from the incoming stream is not
the difficult part anyway).

I've looked around and found a variety of mixed parsing techniques, but
it seems that none of them quite meet my needs. Writing a class which
takes a DOM, walks it, then calls the appropriate ContentHandler
methods would not be too painful, but this seems horribly inelegant to
me. Also, I cannot even begin to fathom that I am the only person who
wants to do this.

Any suggestions?

(What follows is some code which may give an idea of where I am at.
Simplified for brevity.)

// Initial invocation
final SAXParserFactory factory = SAXParserFactory.newInstance();
factory.setNamespaceAware(true);
final SAXParser parser = factory.newSAXParser();
final XMLReader reader = parser.getXMLReader();
final TransformerHandler handler = getTransformerHandler(System.out);
reader.setContentHandler(new EvaluationHandler(handler));
reader.parse("foo.xml");

// EvaluationHandler (derives from a class which echoes to a provided
ContentHandler)
public void startElement(final String uri, final String localName,
final String qName, final Attributes attributes) throws SAXException
{
if (performDOMProcessing())
{
final Element element = this.host.createElementNS(uri, qName);
addAttributes(element, attributes);
this.buffer.push(element);
}
else
{
super.startElement(uri, localName, qName, attributes);
}
}

public void endElement(final String uri, final String localName, final
String qName) throws SAXException
{
if (performDOMProcessing())
{
final Node element = this.buffer.pop();
if (this.buffer.isEmpty())
{
if (this.host.hasChildNodes())
{
final Element root = this.host.getDocumentElement();
this.host.removeChild(root);
}
this.host.appendChild(element);
// Do processing against this.host with rules engine.
// Pass off to serialization.
}
else
{
this.buffer.peek().appendChild(element);
}
}
else
{
super.endElement(uri, localName, qName);
}
}

public void characters(final char[] characters, final int start, final
int length) throws SAXException
{
if (performDOMProcessing())
{
final String text = new String(characters, start, length);
final Text textNode = this.host.createTextNode(text);
this.buffer.peek().appendChild(textNode);
}
else
{
super.characters(characters, start, length);
}
}
 
S

[Si]dragon

As incredibly tacky as it may be, the following class "fixes" my
problem. It does as described earlier: it walks the DOM and triggers
the appropriate SAX event on a provided ContentHandler object. While
is suffices for my needs, it is terribly incomplete, so it is an
exercise for the reader to bring it to an adequate state. Extremely
simple, entirely self-explanatory and hopefully it will go unused.

import org.w3c.dom.Document;
import org.w3c.dom.Element;
import org.w3c.dom.NamedNodeMap;
import org.w3c.dom.Node;
import org.w3c.dom.NodeList;
import org.w3c.dom.ProcessingInstruction;
import org.w3c.dom.Text;
import org.xml.sax.Attributes;
import org.xml.sax.ContentHandler;
import org.xml.sax.SAXException;
import org.xml.sax.helpers.AttributesImpl;

public class DOMContentHandlerBridge
{
final ContentHandler handler;

public DOMContentHandlerBridge(final ContentHandler handler)
{
this.handler = handler;
}

public void process(final Node node) throws SAXException
{
if (node == null)
{

}
else
{
switch (node.getNodeType())
{
case Node.PROCESSING_INSTRUCTION_NODE:
process((ProcessingInstruction) node);
break;
case Node.DOCUMENT_NODE:
process((Document) node);
break;
case Node.ELEMENT_NODE:
process((Element) node);
break;
case Node.TEXT_NODE:
process((Text) node);
break;
}
}
}

private void process(final ProcessingInstruction
processingInstruction)

throws SAXException
{
if (processingInstruction == null)
{

}
else
{
final String target = processingInstruction.getTarget();
final String data = processingInstruction.getData();
this.handler.processingInstruction(target, data);
}
}

private void process(final Document document) throws SAXException
{
if (document == null)
{

}
else
{
this.handler.startDocument();
process(document.getDocumentElement());
this.handler.endDocument();
}
}

private void process(final Element element) throws SAXException
{
if (element == null)
{

}
else
{
final String namespace = element.getNamespaceURI();
final String name = element.getNodeName();
final String localName = element.getLocalName();
final Attributes attributes =
DOMContentHandlerBridge.asAttributes(element.getAttributes());
this.handler.startElement(namespace, localName, name,
attributes);
process(element.getChildNodes());
this.handler.endElement(namespace, localName, name);
}
}

private void process(final Text text) throws SAXException
{
if (text == null)
{

}
else
{
final String value = text.getNodeValue();
this.handler.characters(value.toCharArray(), 0,
value.length());
}
}

private void process(final NodeList nodeList) throws SAXException
{
if (nodeList == null)
{

}
else
{
for (Integer i = 0; i < nodeList.getLength(); i++)
{
final Node node = nodeList.item(i);
process(node);
}
}
}

private static Attributes asAttributes(final NamedNodeMap nodeMap)
{
final AttributesImpl attributes = new AttributesImpl();

for (Integer i = 0; i < nodeMap.getLength(); i++)
{
final Node attribute = nodeMap.item(i);
final String namespace = attribute.getNamespaceURI();
final String name = attribute.getNodeName();
final String localName = attribute.getLocalName();
final String value = attribute.getNodeValue();

/*
* FIXME: It is unknown how we should handle the attribute
type
* argument as we are not coming from serialized XML.
*/
attributes.addAttribute(namespace, localName, name, null,
value);
}

return attributes;
}
}
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,982
Messages
2,570,190
Members
46,736
Latest member
zacharyharris

Latest Threads

Top