Searching XML

N

Nash Kabbara

Hi all,

I just finished writing a log reader that reads xml logs (about 1 to 2 MB
large). At the command line you can specify the file name, the name of the
element and it's value like so: logreader log.txt MyElement myvalue

In retrospect, I've noticed that it takes a long time to process. The time
is spent on comparing the value of all tags named MyElement to myvalue.
Namely:     

NodeList nodeList = m_document.getElementsByTagName(MyElement);
for(int index =0, arrIndex = 0; index < nodeList.getLength(); index++)
      if(getTextNode(nodeList.item(index)).trim().equals(myvalue))
//getTextNode merely return the text value of the node
      {
       counter++;
       tempIndex[arrIndex++] = index;
      }
 
This takes around 20 seconds to complete processing. So my question is, is
there some way where I can extract xml elements based on the element value.
For example XPATH allows you to chose elements based to attribute value, so
I'm wondering, is there a similar mechanism that allows you to grab
elements based on their value?


Thanks.
 
J

Jeff Kish

Hi all,

I just finished writing a log reader that reads xml logs (about 1 to 2 MB
large). At the command line you can specify the file name, the name of the
element and it's value like so: logreader log.txt MyElement myvalue

In retrospect, I've noticed that it takes a long time to process. The time
is spent on comparing the value of all tags named MyElement to myvalue.
Namely:     

NodeList nodeList = m_document.getElementsByTagName(MyElement);
for(int index =0, arrIndex = 0; index < nodeList.getLength(); index++)
      if(getTextNode(nodeList.item(index)).trim().equals(myvalue))
//getTextNode merely return the text value of the node
      {
       counter++;
       tempIndex[arrIndex++] = index;
      }
 
This takes around 20 seconds to complete processing. So my question is, is
there some way where I can extract xml elements based on the element value.
For example XPATH allows you to chose elements based to attribute value, so
I'm wondering, is there a similar mechanism that allows you to grab
elements based on their value?


Thanks.
Here is a query that selects data based on element values...

This XQuery (taken from a tutorial on the internet..don't recall exact doc/url):

for $b in document("books.xml")//book
where some $a in $b/author
satisfies ($a/last="Stevens" and $a/first="W.")
return $b/title

returns these results:

<title>TCP/IP Illustrated</title>,
<title>Advanced Programming in the UNIX Environment</title>


Using this data:

<bib>
<book year="1994">
<title>TCP/IP Illustrated</title>
<author><last>Stevens</last><first>W.</first></author>
<publisher>Addison-Wesley</publisher>
<price>65.95</price>
</book>

<book year="1992">
<title>Advanced Programming in the UNIX Environment</title>
<author><last>Stevens</last><first>W.</first></author>
<publisher>Addison-Wesley</publisher>
<price>65.95</price>
</book>

<book year="2000">
<title>Data on the Web</title>
<author><last>Abiteboul</last><first>Serge</first></author>
<author><last>Buneman</last><first>Peter</first></author>
<author><last>Suciu</last><first>Dan</first></author>
<publisher>Morgan Kaufmann Publishers</publisher>
<price>65.95</price>
</book>

<book year="1999">
<title>The Economics of Technology andContent for Digital TV</title>
<editor><last>Gerbarg</last>
<first>Darcy</first>
<affiliation>CITI</affiliation>
</editor>
<publisher>Kluwer Academic Publishers</publisher>
<price>129.95</price>
</book>

</bib>

HTH
 
A

Andy Dingley

This takes around 20 seconds to complete processing.

I'm not surprised ! getElementsByTagName is always slow, but it's
also inefficient here because it's having to look everywhere in the
structure to find elements to test their names. If you can improve
the search by looking for elements as children or grand-children,
rather than searching everywhere for them, then this can be a good
tweak.

XML is often incredibly powerful, but this excess power can lead to
inefficiencies if it's being used "by default" when you didn't really
need it.
So my question is, is
there some way where I can extract xml elements based on the element value.

Yes, XPath ! Just use "//MyElementName"

Or if MyElementName is supplied by the users, then use a [...]
predicate and the local-name() function to get the name of the
element, then compare it to the value of an element name supplied as a
parameter.

<xsl:param name="elmName" >MyElementName</xsl:param>
...
//*[local-name() = string($elmName)]


XQuery (and various other incarnations) will do it too, and with
better performance. However it's sometimes hard to find XQuery
features in an environment, but most will have XSLT and XPath
available.
 
J

Jeff Kish

This takes around 20 seconds to complete processing.

I'm not surprised ! getElementsByTagName is always slow, but it's
also inefficient here because it's having to look everywhere in the
structure to find elements to test their names. If you can improve
the search by looking for elements as children or grand-children,
rather than searching everywhere for them, then this can be a good
tweak.

XML is often incredibly powerful, but this excess power can lead to
inefficiencies if it's being used "by default" when you didn't really
need it.
So my question is, is
there some way where I can extract xml elements based on the element value.

Yes, XPath ! Just use "//MyElementName"

Or if MyElementName is supplied by the users, then use a [...]
predicate and the local-name() function to get the name of the
element, then compare it to the value of an element name supplied as a
parameter.

<xsl:param name="elmName" >MyElementName</xsl:param>
...
//*[local-name() = string($elmName)]


XQuery (and various other incarnations) will do it too, and with
better performance. However it's sometimes hard to find XQuery
features in an environment, but most will have XSLT and XPath
available.
I like Andy's answer better.
Jeff Kish
 
N

Nash Kabbara

Hi Andy,

Thanks for the response. Actually the lag is not in getElementsByTagName,
but by the loop I have that compares the values of the tags with what the
user is looking for (myvalue). So I was wondering if there's a built in
mechanism that pulls elements based on their Value. When I say "Value" I
mean their content, not their name. i.e <Element>value</Element>. Sorry for
not being clear. It seems your examples of xpath get elements base on their
name, but not value.


Nash
Andy said:
This takes around 20 seconds to complete processing.

I'm not surprised ! getElementsByTagName is always slow, but it's
also inefficient here because it's having to look everywhere in the
structure to find elements to test their names. If you can improve
the search by looking for elements as children or grand-children,
rather than searching everywhere for them, then this can be a good
tweak.

XML is often incredibly powerful, but this excess power can lead to
inefficiencies if it's being used "by default" when you didn't really
need it.
So my question is, is
there some way where I can extract xml elements based on the element
value.

Yes, XPath ! Just use "//MyElementName"

Or if MyElementName is supplied by the users, then use a [...]
predicate and the local-name() function to get the name of the
element, then compare it to the value of an element name supplied as a
parameter.

<xsl:param name="elmName" >MyElementName</xsl:param>
...
//*[local-name() = string($elmName)]


XQuery (and various other incarnations) will do it too, and with
better performance. However it's sometimes hard to find XQuery
features in an environment, but most will have XSLT and XPath
available.
 
A

Andy Dingley

Thanks for the response. Actually the lag is not in getElementsByTagName,
but by the loop I have that compares the values of the tags with what the
user is looking for (myvalue).

I don't recognise the coding platform - what is it ?

There's a lot you can do to improve that loop.
- Use an iterator not an array index
- Be suspicious of that .getlength() method, especially in an array
bound. Is that a per-iteration overhead you've given yourself ?
- never trim() when you can rtrim()
- Never trim() when you can use a space-ignoring comparison instead.

The trouble with much XML optimisation is that it becomes sensitive to
the data you feed it. Do you have a lot of matching elements to walk
through, or is finding the set of elements the main problem ?

So I was wondering if there's a built in
mechanism that pulls elements based on their Value. When I say "Value" I
mean their content, not their name. i.e <Element>value</Element>.

Yes, XPath !

Use a similar predicate, "//*[string (.) = $elmContents]"

string() is optional (because in this context it's the default
behaviour) but it's good practice to use it in situations like this,
because it makes reading your code a lot clearer in the future.
 
T

Tjerk Wolterink

I think youre coding in Java,

It is better to use SAX: Simple Api for XML.
You then dont have to load the entire DOM,
and you can do some optimizations.

SAX is a good choice if it is not too complex what you want to do.

Greetz
Tjerk
 
J

Jeff Kish

Yes, XPath !

Use a similar predicate, "//*[string (.) = $elmContents]"

string() is optional (because in this context it's the default
behaviour) but it's good practice to use it in situations like this,
because it makes reading your code a lot clearer in the future.
<snip>
lots of good info in this thread!
Yes, Sax if you don't need to load your entire object in memory.

Oh.. regarding xquery..

for $b in document("books.xml")//*[.="TCP/IP Illustrated"]
return
<temp>{string($b/.), name($b/.)}</temp>

{-- results in this output
<temp>TCP/IP Illustrated title</temp>
--}

Jeff Kish
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,992
Messages
2,570,220
Members
46,807
Latest member
ryef

Latest Threads

Top