see whitespace in java DOM

W

Wired Earp

I've had some luck using string values "\t" "\n" and "\r" to insert tabs,
newlines and carriagereturn textnodes into a document, but I can't *read*
these nodes, at least not by analyzing the nodeValue. Am i missing
something?


/**
* NodeFilter supposed to remove ignorable whitespace
*/
private class WhiteSpaceFilter implements NodeFilter {

public short acceptNode ( Node node ) {

// HELLO?
String value = node.getTextContent ();
boolean ok = value.equals ( "\n" ) || value.equals ( "\t" );
return ok ? NodeFilter.FILTER_ACCEPT : NodeFilter.FILTER_REJECT;
}
}
/**
* Strip whitespace
* @param element DOMElement
*/
private void strip ( Element element ) {

List<Node> list = new ArrayList<Node> ();
NodeFilter filter = new WhiteSpaceFilter ();
Document document = element.getOwnerDocument();
DocumentTraversal traversable = (DocumentTraversal) document;
TreeWalker walker = traversable.createTreeWalker (
element, NodeFilter.SHOW_TEXT, filter, true );

while ( walker.nextNode() != null )
list.add ( walker.getCurrentNode ());
for ( Node node : list )
node.getParentNode().removeChild ( node );
}
 
W

Wired Earp

I said:
Am i missing something?

For some reason, even a single "\n" textnode can only be identified by a
regular expression. To make things worse, in-text whitespace must be
trimmed out, not to fool the filter.

private class WhiteSpaceFilter implements NodeFilter {

// filter parsed data
public short acceptNode ( Node node ) {
node = sanitize ( node );
String data = node.getTextContent();
boolean ok = Pattern.matches ( "", data );
return ok ? NodeFilter.FILTER_ACCEPT : NodeFilter.FILTER_REJECT;
}

// parse and modify data
private Node sanitize ( Node node ) {
Text text = ( Text ) node;
String data = text.getData ();
text.setData ( data.replaceAll ( "[\t\n\r\f]+", "" ));
return node; //TODO: delete multiple space characters
}
}
 
W

Wired Earp

I said:
For some reason, even a single "\n" textnode can only be identified by a
regular expression. To make things worse, in-text whitespace must be
trimmed out, not to fool the filter.

In that case, it would probably be simpler to just:

private void strip ( Document document ) {

DocumentTraversal traversable = ( DocumentTraversal ) document;
NodeIterator iterator = traversable.createNodeIterator (
(Node)document, NodeFilter.SHOW_TEXT, null, false );

Node node;
while (( node = iterator.nextNode ()) != null ) {
Text text = ( Text ) node;
String data = text.getData ();
text.setData ( data.replaceAll ( "[\t\n\r\f]+", "" ));
// TODO: delete multiple spaces
}
document.normalizeDocument ();
}
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,997
Messages
2,570,241
Members
46,831
Latest member
RusselWill

Latest Threads

Top