How to speed up XML reading

Ramon F Herrera · Sep 11, 2012

My application makes a large number of XPath() retrievals and that's
the code that predominantly uses most of the clock time. The rest of
the tasks take a negligible amount of CPU and disk. In short, all the
app does is to read XML variables and write them in a PDF file.

See a previous, very related post below.

-Ramon

=============================================

You can't compare SAX and DOM. SAX is under the parsing level therefore
DOM is for manipulating an XML document. DOM is mostly built with SAX
system. You can use it or ignore it building your own SAX code. However
create your own SAX handler is much complex and the final result could
be much slower than with a pure DOM usage.

Very true. (Though some DOM parsers/loaders bypass SAX for greater
efficiency; I believe Xerces actually uses lower-level events to drive
its DOM construction.)

SAX does require that you manage all the state information, which may
or may not include building something like the DOM for part or all of
the document. How fast or slow that will be depends entirely on the
problem at hand and how good your code is.

If you've got time, doing it all via SAX may be worth trying. But it
isn't always going to be a magic bullet.

As I said in my other post, the first thing to do is to find out
whether this is even a significant part of your application's
processing time.

Ramon F Herrera · Sep 11, 2012

A related thread is: "Why is SAX faster than DOM?"

-RFH

Ramon F Herrera · Sep 11, 2012

Tools used:
C++
Xerces-C
XQilla
Developed under Linux, ported to Windows

A very important lesson that I learned follows. Xerces implements a
reasonably/very fast XPath retrieval BUT it does so at the expense of
flexibility. The only type of XPath retrieval supported by Xerces is
the MINIMAL one:

string neededVariable = XPath("/this/is/the/variable/that/i/need");

If the path contains any character like "[", "@", "=", etc. I must
resort to XQilla, which is wonderful (a LOT easier to code than pure
Xerces), but as slow as molasses in cold weather:

string someOtherVar = XPath("/table/joint/ancestor::table/
@titledetail");

After running some benchmarks I have concluded that my best option is
to use a combination of the 2 XPath engines: Xerces for the "easy"
stuff and Xqilla for the more complex.

-Ramon

Alain Ketterlin · Sep 12, 2012

[...]

A very important lesson that I learned follows. Xerces implements a
reasonably/very fast XPath retrieval BUT it does so at the expense of
flexibility. The only type of XPath retrieval supported by Xerces is
the MINIMAL one:

string neededVariable = XPath("/this/is/the/variable/that/i/need");

If the path contains any character like "[", "@", "=", etc. I must
resort to XQilla, which is wonderful (a LOT easier to code than pure
Xerces), but as slow as molasses in cold weather:

string someOtherVar = XPath("/table/joint/ancestor::table/
@titledetail");

... would have the same effect as ancestor::table since the query starts
at document root.

After running some benchmarks I have concluded that my best option is
to use a combination of the 2 XPath engines: Xerces for the "easy"
stuff and Xqilla for the more complex.

XPath may require DOM if you use funny axes, e.g., preceding-sibling::*
and, maybe, ancestor.

However, for the request you show above, a hand-coded SAX parser keeping
a simple stack (with @titledetail cached where appropriate) can extract what
you want. XPath, and any generic query language for that matter, is far
more powerful, and will therefore most likely be slower.

(Generating the SAX handler for any given XPath query is left as an
exercise for the reader.

-- Alain.

Ramon F Herrera · Sep 12, 2012

(Generating the SAX handler for any given XPath query
is left as an exercise for the reader.

-- Alain.

Merci, Alain.

Actually, I think that the solution to my performance problem is to
implement (via SAX?) the reading of the whole XML file and insert the
variables in my own data structures. That must speed up the variable
retrieval substantially BUT an XML guru is required, which I am not.

In the meantime, I downloaded libxml and will see how well it
performs. Perhaps that is the solution to my problem. Being written in
C, it should be faster than Xerces-C++

-Ramon

Manuel Collado · Sep 12, 2012

El 12/09/2012 14:52, Ramon F Herrera escribió:

...
Actually, I think that the solution to my performance problem is to
implement (via SAX?) the reading of the whole XML file and insert the
variables in my own data structures. That must speed up the variable
retrieval substantially BUT an XML guru is required, which I am not.

In the meantime, I downloaded libxml and will see how well it
performs. Perhaps that is the solution to my problem. Being written in
C, it should be faster than Xerces-C++

You could try Expat, written in C.

Joe Kesselman · Sep 14, 2012

A related thread is: "Why is SAX faster than DOM?"

(Answer: It isn't always. Depends on the patterns of access to the data.)

--
Joe Kesselman,
http://www.love-song-productions.com/people/keshlam/index.html

{} ASCII Ribbon Campaign | "may'ron DaroQbe'chugh vaj bIrIQbej" --
/\ Stamp out HTML mail! | "Put down the squeezebox & nobody gets hurt."

Joe Kesselman · Sep 14, 2012

If the path contains any character like "[", "@", "=", etc. I must
resort to XQilla, which is wonderful (a LOT easier to code than pure
Xerces), but as slow as molasses in cold weather

You might want to look at Xalan. There was a fair amount of work put
into Xalan performance; I don't know how XQilla compares to that.

Or, if you're using IBM's Java environment, you might want to look at
the XML support that ships with that JRE, which is another design
iteration past Xalan. Or, in Websphere, the Websphere XML feature, which
supports XPath 2.0, XSLT 2.0, and XQuery and is yet another design
iteration.

With all of these, remember that the JAXP/TRAX APIs allow precompiling a
path or query. And remember that the performance can be improved if the
document is cached in memory in the appropriate internal representation.
(The Xerces implementation is single-pass, I believe; if you want to run
more than one path the advantage goes away quickly because you have to
reparse the input document.)

--
Joe Kesselman,
http://www.love-song-productions.com/people/keshlam/index.html

{} ASCII Ribbon Campaign | "may'ron DaroQbe'chugh vaj bIrIQbej" --
/\ Stamp out HTML mail! | "Put down the squeezebox & nobody gets hurt."

Joe Kesselman · Sep 14, 2012

Actually, I think that the solution to my performance problem is to

implement (via SAX?) the reading of the whole XML file and insert the
variables in my own data structures. That must speed up the variable
retrieval substantially BUT an XML guru is required, which I am not.

In many cases, yes, XML should be used as your "portability" level, and
custom internal representations should be used within the application.
Of course the downside is that you then have to implement a lot more of
your own logic rather than being able to take advantage of the XML-level
utilities.

In the meantime, I downloaded libxml and will see how well it
performs. Perhaps that is the solution to my problem. Being written in
C, it should be faster than Xerces-C++

C++ isn't necessarily slower than C. That depends on the details of the
code, both in coding style and in algorithms. Remember, an infinite
speedup of something that accounts for only 1% of runtime is only a 1%
real improvement.

--
Joe Kesselman,
http://www.love-song-productions.com/people/keshlam/index.html

{} ASCII Ribbon Campaign | "may'ron DaroQbe'chugh vaj bIrIQbej" --
/\ Stamp out HTML mail! | "Put down the squeezebox & nobody gets hurt."

Ramon F Herrera · Sep 16, 2012

If the path contains any character like "[", "@", "=", etc. I must
resort to XQilla, which is wonderful (a LOT easier to code than pure
Xerces), but as slow as molasses in cold weather

Click to expand...

You might want to look at Xalan. There was a fair amount
of work put into Xalan performance; I don't know how XQilla
compares to that.

Following a previous advice of yours, I looked into it. It seems that
Xalan has reached a dead end. It won't even compile on a regular Linux
box.

What I discovered is that most of the action is in libxml. See my
thread "Dramatic performance gains with Libxml" (I develop under C/C+
+).

-Ramon

Joe Kesselman · Sep 20, 2012

Following a previous advice of yours, I looked into it. It seems that
Xalan has reached a dead end. It won't even compile on a regular Linux
box.

The C++ version of Xalan has lost most of its contributors, agreed. The
Java version is still alive and kicking, though not as actively under
development as it was when IBM was donating lots of manhours to it.

I'm not sure whether there's a C++ version of Saxon; if so that would
also be worth looking at.

--
Joe Kesselman,
http://www.love-song-productions.com/people/keshlam/index.html

{} ASCII Ribbon Campaign | "may'ron DaroQbe'chugh vaj bIrIQbej" --
/\ Stamp out HTML mail! | "Put down the squeezebox & nobody gets hurt."

shivers.paul · Sep 21, 2012

Tools used:

C++

Xerces-C

XQilla

Developed under Linux, ported to Windows

A very important lesson that I learned follows. Xerces implements a

reasonably/very fast XPath retrieval BUT it does so at the expense of

flexibility. The only type of XPath retrieval supported by Xerces is

the MINIMAL one:

string neededVariable = XPath("/this/is/the/variable/that/i/need");

If the path contains any character like "[", "@", "=", etc. I must

resort to XQilla, which is wonderful (a LOT easier to code than pure

Xerces), but as slow as molasses in cold weather:

string someOtherVar = XPath("/table/joint/ancestor::table/

@titledetail");

After running some benchmarks I have concluded that my best option is

to use a combination of the 2 XPath engines: Xerces for the "easy"

stuff and Xqilla for the more complex.

-Ramon

Have you looked at liquid xml c++ tool? (http://www.liquid-technologies.com/xmldatabinding/xml-schema-to-cpp.aspx)

How to get event.target.id ?	3	Jan 9, 2023
Only one table shows up with the information	2	Mar 29, 2023
What should I do Before I give up programming?	6	Jan 14, 2023
How to effectively develop a web application from scratch?	0	Jul 2, 2023
How to save textBox values into a xml-file(with naming an choosing directory)?	1	Aug 23, 2022
How to speed up this slow part of my program	14	Mar 28, 2012
XML in XMPP	8	Jul 6, 2012
XML support featured in the DataSet class for reading and writingdata as XML	0	Feb 16, 2014

How to speed up XML reading

Ramon F Herrera

Ramon F Herrera

Ramon F Herrera

Alain Ketterlin

Ramon F Herrera

Manuel Collado

Joe Kesselman

Joe Kesselman

Joe Kesselman

Ramon F Herrera

Joe Kesselman

shivers.paul

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads