Searching XML

Nash Kabbara · Oct 26, 2004

Hi all,

I just finished writing a log reader that reads xml logs (about 1 to 2 MB
large). At the command line you can specify the file name, the name of the
element and it's value like so: logreader log.txt MyElement myvalue

In retrospect, I've noticed that it takes a long time to process. The time
is spent on comparing the value of all tags named MyElement to myvalue.
Namely:

NodeList nodeList = m_document.getElementsByTagName(MyElement);
for(int index =0, arrIndex = 0; index < nodeList.getLength(); index++)
      if(getTextNode(nodeList.item(index)).trim().equals(myvalue))
//getTextNode merely return the text value of the node
      {
       counter++;
       tempIndex[arrIndex++] = index;
      }

This takes around 20 seconds to complete processing. So my question is, is
there some way where I can extract xml elements based on the element value.
For example XPATH allows you to chose elements based to attribute value, so
I'm wondering, is there a similar mechanism that allows you to grab
elements based on their value?

Thanks.

Jeff Kish · Oct 26, 2004

Hi all,

I just finished writing a log reader that reads xml logs (about 1 to 2 MB
large). At the command line you can specify the file name, the name of the
element and it's value like so: logreader log.txt MyElement myvalue

In retrospect, I've noticed that it takes a long time to process. The time
is spent on comparing the value of all tags named MyElement to myvalue.
Namely:

NodeList nodeList = m_document.getElementsByTagName(MyElement);
for(int index =0, arrIndex = 0; index < nodeList.getLength(); index++)
      if(getTextNode(nodeList.item(index)).trim().equals(myvalue))
//getTextNode merely return the text value of the node
      {
       counter++;
       tempIndex[arrIndex++] = index;
      }

This takes around 20 seconds to complete processing. So my question is, is
there some way where I can extract xml elements based on the element value.
For example XPATH allows you to chose elements based to attribute value, so
I'm wondering, is there a similar mechanism that allows you to grab
elements based on their value?

Thanks.

Here is a query that selects data based on element values...

This XQuery (taken from a tutorial on the internet..don't recall exact doc/url):

for $b in document("books.xml")//book
where some $a in $b/author
satisfies ($a/last="Stevens" and $a/first="W.")
return $b/title

returns these results:

<title>TCP/IP Illustrated</title>,
<title>Advanced Programming in the UNIX Environment</title>

Using this data:

<bib>
<book year="1994">
<title>TCP/IP Illustrated</title>
<author><last>Stevens</last><first>W.</first></author>
<publisher>Addison-Wesley</publisher>
<price>65.95</price>
</book>

<book year="1992">
<title>Advanced Programming in the UNIX Environment</title>
<author><last>Stevens</last><first>W.</first></author>
<publisher>Addison-Wesley</publisher>
<price>65.95</price>
</book>

<book year="2000">
<title>Data on the Web</title>
<author><last>Abiteboul</last><first>Serge</first></author>
<author><last>Buneman</last><first>Peter</first></author>
<author><last>Suciu</last><first>Dan</first></author>
<publisher>Morgan Kaufmann Publishers</publisher>
<price>65.95</price>
</book>

<book year="1999">
<title>The Economics of Technology andContent for Digital TV</title>
<editor><last>Gerbarg</last>
<first>Darcy</first>
<affiliation>CITI</affiliation>
</editor>
<publisher>Kluwer Academic Publishers</publisher>
<price>129.95</price>
</book>

</bib>

HTH

Andy Dingley · Oct 26, 2004

This takes around 20 seconds to complete processing.

I'm not surprised ! getElementsByTagName is always slow, but it's
also inefficient here because it's having to look everywhere in the
structure to find elements to test their names. If you can improve
the search by looking for elements as children or grand-children,
rather than searching everywhere for them, then this can be a good
tweak.

XML is often incredibly powerful, but this excess power can lead to
inefficiencies if it's being used "by default" when you didn't really
need it.

So my question is, is
there some way where I can extract xml elements based on the element value.

Yes, XPath ! Just use "//MyElementName"

Or if MyElementName is supplied by the users, then use a [...]
predicate and the local-name() function to get the name of the
element, then compare it to the value of an element name supplied as a
parameter.

<xsl

aram name="elmName" >MyElementName</xsl

aram>
...
//*[local-name() = string($elmName)]

XQuery (and various other incarnations) will do it too, and with
better performance. However it's sometimes hard to find XQuery
features in an environment, but most will have XSLT and XPath
available.

Jeff Kish · Oct 26, 2004

This takes around 20 seconds to complete processing.

Click to expand...

I'm not surprised ! getElementsByTagName is always slow, but it's
also inefficient here because it's having to look everywhere in the
structure to find elements to test their names. If you can improve
the search by looking for elements as children or grand-children,
rather than searching everywhere for them, then this can be a good
tweak.

XML is often incredibly powerful, but this excess power can lead to
inefficiencies if it's being used "by default" when you didn't really
need it.

So my question is, is
there some way where I can extract xml elements based on the element value.

Click to expand...

Yes, XPath ! Just use "//MyElementName"

Or if MyElementName is supplied by the users, then use a [...]
predicate and the local-name() function to get the name of the
element, then compare it to the value of an element name supplied as a
parameter.

<xslaram name="elmName" >MyElementName</xslaram>
...
//*[local-name() = string($elmName)]

XQuery (and various other incarnations) will do it too, and with
better performance. However it's sometimes hard to find XQuery
features in an environment, but most will have XSLT and XPath
available.

I like Andy's answer better.
Jeff Kish

Nash Kabbara · Oct 26, 2004

Hi Andy,

Thanks for the response. Actually the lag is not in getElementsByTagName,
but by the loop I have that compares the values of the tags with what the
user is looking for (myvalue). So I was wondering if there's a built in
mechanism that pulls elements based on their Value. When I say "Value" I
mean their content, not their name. i.e <Element>value</Element>. Sorry for
not being clear. It seems your examples of xpath get elements base on their
name, but not value.

Nash

Andy said:
This takes around 20 seconds to complete processing.

Click to expand...

I'm not surprised ! getElementsByTagName is always slow, but it's
also inefficient here because it's having to look everywhere in the
structure to find elements to test their names. If you can improve
the search by looking for elements as children or grand-children,
rather than searching everywhere for them, then this can be a good
tweak.

XML is often incredibly powerful, but this excess power can lead to
inefficiencies if it's being used "by default" when you didn't really
need it.

So my question is, is
there some way where I can extract xml elements based on the element
value.

Click to expand...

Yes, XPath ! Just use "//MyElementName"

Or if MyElementName is supplied by the users, then use a [...]
predicate and the local-name() function to get the name of the
element, then compare it to the value of an element name supplied as a
parameter.

<xslaram name="elmName" >MyElementName</xslaram>
...
//*[local-name() = string($elmName)]

XQuery (and various other incarnations) will do it too, and with
better performance. However it's sometimes hard to find XQuery
features in an environment, but most will have XSLT and XPath
available.

Andy Dingley · Oct 26, 2004

Thanks for the response. Actually the lag is not in getElementsByTagName,
but by the loop I have that compares the values of the tags with what the
user is looking for (myvalue).

I don't recognise the coding platform - what is it ?

There's a lot you can do to improve that loop.
- Use an iterator not an array index
- Be suspicious of that .getlength() method, especially in an array
bound. Is that a per-iteration overhead you've given yourself ?
- never trim() when you can rtrim()
- Never trim() when you can use a space-ignoring comparison instead.

The trouble with much XML optimisation is that it becomes sensitive to
the data you feed it. Do you have a lot of matching elements to walk
through, or is finding the set of elements the main problem ?

So I was wondering if there's a built in
mechanism that pulls elements based on their Value. When I say "Value" I
mean their content, not their name. i.e <Element>value</Element>.

Yes, XPath !

Use a similar predicate, "//*[string (.) = $elmContents]"

string() is optional (because in this context it's the default
behaviour) but it's good practice to use it in situations like this,
because it makes reading your code a lot clearer in the future.

Tjerk Wolterink · Oct 26, 2004

I think youre coding in Java,

It is better to use SAX: Simple Api for XML.
You then dont have to load the entire DOM,
and you can do some optimizations.

SAX is a good choice if it is not too complex what you want to do.

Greetz
Tjerk

Jeff Kish · Oct 26, 2004

Yes, XPath !

Use a similar predicate, "//*[string (.) = $elmContents]"

string() is optional (because in this context it's the default
behaviour) but it's good practice to use it in situations like this,
because it makes reading your code a lot clearer in the future.

<snip>
lots of good info in this thread!
Yes, Sax if you don't need to load your entire object in memory.

Oh.. regarding xquery..

for $b in document("books.xml")//*[.="TCP/IP Illustrated"]
return
<temp>{string($b/.), name($b/.)}</temp>

{-- results in this output
<temp>TCP/IP Illustrated title</temp>
--}

Jeff Kish

XML Schema <xs:unique/> scope	1	Feb 14, 2007
Building a XML file thanks to a XPath-like syntax	2	Mar 2, 2007
Accessing array index addresses with custom datatype in a function	0	Jun 2, 2022
xml parsing using dom	1	Nov 17, 2006
[ANN] Release of XMLmind XML Editor v5.8	0	Oct 7, 2013
Simple Java/XML question	3	Apr 16, 2005
searching an XML doc	5	Jan 15, 2008
PHP RSS Feed Aggregator changing to todays date everytime feed is aggregated	1	Jan 11, 2022

Searching XML

Nash Kabbara

Jeff Kish

Andy Dingley

Jeff Kish

Nash Kabbara

Andy Dingley

Tjerk Wolterink

Jeff Kish

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads