org.w3c.dom.NodeList - empty nodes?

M

Michael Preminger

Hello!

The question is a bit lengthy (for completeness) but actually quite simple.

I have a very simple xml document Im experimenting with:

<?xml version="1.0"?>
<metadata xmlns="http://purl.org/dc/elements/1.1/"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://dublincore.org/schemas/xmls/simpledc20021212.xsd"
xmlns:dc="http://purl.org/dc/elements/1.1/">
<dc:title>
UKOLN
</dc:title>
<dc:description>
UKOLN is a national focus of expertise in digital information
management. It provides policy, research and awareness services
to the UK library, information and cultural heritage communities.
UKOLN is based at the University of Bath.
</dc:description>
<dc:publisher>
UKOLN, University of Bath
</dc:publisher>
<dc:identifier>
http://www.ukoln.ac.uk/
</dc:identifier>
</metadata>

The root element is tagged <metadata>, and I am looping through its
childNodes.
Element docElem=document.getDocumentElement();
System.out.println("Document element: " +
docElem.getNodeName());
NodeList nl=docElem.getChildNodes();

for(int i=0;i<nl.getLength();i++){
Node nd=nl.item(i);
System.out.println(i+" "+nd);
}


Unexpectedly, I get the following output, where every even node seems
devoid of contents.
------------------------------------------------
Document element:metadata


0


1 <dc:title>
UKOLN
</dc:title>
2

3 <dc:description>
UKOLN is a national focus of expertise in digital information
management. It provides policy, research and awareness services
to the UK library, information and cultural heritage communities.
UKOLN is based at the University of Bath.
</dc:description>
4

5 <dc:publisher>
UKOLN, University of Bath
</dc:publisher>
6

7 <dc:identifier>
http://www.ukoln.ac.uk/
</dc:identifier>
8
---------------------------------------------------------------------------
I thought that the "void" nodes were the text nodes descendent to the
<dc:> elements. (they have a NODE_TYPE 1).
When I descent into one of the nodes (dc:publisher) :
if (i==5){
NodeList nl5=nd.getChildNodes();
for(int k=0; k<nl5.getLength(); k++){

System.out.println("k:"+k+" "+nl5.item(k));
}
}
Then I actually get the text "UKOLN, University of Bath".
To me this means that the void even nodes are not the text nodes. (I get
nothing when I try to type-cast them into Text and run getData())

If so: what are they?
If they are the text nodes: Why isnt their content printed to the
standard output

Thanks

Michael
 
M

Martin Honnen

Michael Preminger wrote:

I have a very simple xml document Im experimenting with:

<?xml version="1.0"?>
<metadata xmlns="http://purl.org/dc/elements/1.1/"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://dublincore.org/schemas/xmls/simpledc20021212.xsd"
xmlns:dc="http://purl.org/dc/elements/1.1/">
<dc:title>
UKOLN
</dc:title>
<dc:description>
UKOLN is a national focus of expertise in digital information
management. It provides policy, research and awareness services
to the UK library, information and cultural heritage communities.
UKOLN is based at the University of Bath.
</dc:description>
<dc:publisher>
UKOLN, University of Bath
</dc:publisher>
<dc:identifier>
http://www.ukoln.ac.uk/
</dc:identifier>
</metadata>
NodeList nl=docElem.getChildNodes();

for(int i=0;i<nl.getLength();i++){
Node nd=nl.item(i);
System.out.println(i+" "+nd);
}


Unexpectedly, I get the following output, where every even node seems
devoid of contents.
------------------------------------------------
Document element:metadata


0


1 <dc:title>
UKOLN
</dc:title>
2
I thought that the "void" nodes were the text nodes descendent to the
<dc:> elements. (they have a NODE_TYPE 1).

No, what you see in the DOM are white space text nodes between the
element nodes e.g. if you have
<gods><god>Kibo</god><god>Xibo</god></gods>
then you have only element nodes, there is the document element node
(<gods>) and it has two child nodes which are again element nodes. But
usually for easier reading such XML is written as
<gods>
<god>Kibo</god>
<god>Xibo</god>
</gods>
and then the document element node (<gods>) has five child nodes, a text
node with whitespace, an element node (<god>), a text node with white
space, an element node (<god), and a text node with white space.
 
J

John C. Bollinger

Martin said:
No, what you see in the DOM are white space text nodes between the
element nodes e.g. if you have
<gods><god>Kibo</god><god>Xibo</god></gods>
then you have only element nodes, there is the document element node
(<gods>) and it has two child nodes which are again element nodes. But
usually for easier reading such XML is written as
<gods>
<god>Kibo</god>
<god>Xibo</god>
</gods>
and then the document element node (<gods>) has five child nodes, a text
node with whitespace, an element node (<god>), a text node with white
space, an element node (<god), and a text node with white space.

Exactly right. Note, however, that a parser operating in "validating"
mode may be able to avoid creating the text nodes in question, so you
cannot assume that they will always be there. A validating parser does
require a DTD / schema, however, so if there is none then you should
expect to see the extra nodes.

Note also that you are not assured that the entire text content of an
element node will be contained in a single text node, even if the
element contains nothing but text. Furthermore note that if you need to
be general in your handling of the DOM tree then you also need to worry
about CDATA nodes wherever you permit text, and even mixed CDATA nodes
and text.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,996
Messages
2,570,238
Members
46,826
Latest member
robinsontor

Latest Threads

Top