How to speed up XML reading

R

Ramon F Herrera

My application makes a large number of XPath() retrievals and that's
the code that predominantly uses most of the clock time. The rest of
the tasks take a negligible amount of CPU and disk. In short, all the
app does is to read XML variables and write them in a PDF file.

See a previous, very related post below.

-Ramon

=============================================
You can't compare SAX and DOM. SAX is under the parsing level therefore
DOM is for manipulating an XML document. DOM is mostly built with SAX
system. You can use it or ignore it building your own SAX code. However
create your own SAX handler is much complex and the final result could
be much slower than with a pure DOM usage.

Very true. (Though some DOM parsers/loaders bypass SAX for greater
efficiency; I believe Xerces actually uses lower-level events to drive
its DOM construction.)

SAX does require that you manage all the state information, which may
or may not include building something like the DOM for part or all of
the document. How fast or slow that will be depends entirely on the
problem at hand and how good your code is.

If you've got time, doing it all via SAX may be worth trying. But it
isn't always going to be a magic bullet.

As I said in my other post, the first thing to do is to find out
whether this is even a significant part of your application's
processing time.
 
R

Ramon F Herrera

Tools used:
C++
Xerces-C
XQilla
Developed under Linux, ported to Windows


A very important lesson that I learned follows. Xerces implements a
reasonably/very fast XPath retrieval BUT it does so at the expense of
flexibility. The only type of XPath retrieval supported by Xerces is
the MINIMAL one:

string neededVariable = XPath("/this/is/the/variable/that/i/need");

If the path contains any character like "[", "@", "=", etc. I must
resort to XQilla, which is wonderful (a LOT easier to code than pure
Xerces), but as slow as molasses in cold weather:

string someOtherVar = XPath("/table/joint/ancestor::table/
@titledetail");

After running some benchmarks I have concluded that my best option is
to use a combination of the 2 XPath engines: Xerces for the "easy"
stuff and Xqilla for the more complex.

-Ramon
 
A

Alain Ketterlin

[...]
A very important lesson that I learned follows. Xerces implements a
reasonably/very fast XPath retrieval BUT it does so at the expense of
flexibility. The only type of XPath retrieval supported by Xerces is
the MINIMAL one:

string neededVariable = XPath("/this/is/the/variable/that/i/need");

If the path contains any character like "[", "@", "=", etc. I must
resort to XQilla, which is wonderful (a LOT easier to code than pure
Xerces), but as slow as molasses in cold weather:

string someOtherVar = XPath("/table/joint/ancestor::table/
@titledetail");

... would have the same effect as ancestor::table since the query starts
at document root.
After running some benchmarks I have concluded that my best option is
to use a combination of the 2 XPath engines: Xerces for the "easy"
stuff and Xqilla for the more complex.

XPath may require DOM if you use funny axes, e.g., preceding-sibling::*
and, maybe, ancestor.

However, for the request you show above, a hand-coded SAX parser keeping
a simple stack (with @titledetail cached where appropriate) can extract what
you want. XPath, and any generic query language for that matter, is far
more powerful, and will therefore most likely be slower.

(Generating the SAX handler for any given XPath query is left as an
exercise for the reader. :)

-- Alain.
 
R

Ramon F Herrera

(Generating the SAX handler for any given XPath query
is left as an exercise for the reader. :)

-- Alain.

Merci, Alain.

Actually, I think that the solution to my performance problem is to
implement (via SAX?) the reading of the whole XML file and insert the
variables in my own data structures. That must speed up the variable
retrieval substantially BUT an XML guru is required, which I am not.

In the meantime, I downloaded libxml and will see how well it
performs. Perhaps that is the solution to my problem. Being written in
C, it should be faster than Xerces-C++

-Ramon
 
M

Manuel Collado

El 12/09/2012 14:52, Ramon F Herrera escribió:
...
Actually, I think that the solution to my performance problem is to
implement (via SAX?) the reading of the whole XML file and insert the
variables in my own data structures. That must speed up the variable
retrieval substantially BUT an XML guru is required, which I am not.

In the meantime, I downloaded libxml and will see how well it
performs. Perhaps that is the solution to my problem. Being written in
C, it should be faster than Xerces-C++

You could try Expat, written in C.
 
J

Joe Kesselman

If the path contains any character like "[", "@", "=", etc. I must
resort to XQilla, which is wonderful (a LOT easier to code than pure
Xerces), but as slow as molasses in cold weather

You might want to look at Xalan. There was a fair amount of work put
into Xalan performance; I don't know how XQilla compares to that.

Or, if you're using IBM's Java environment, you might want to look at
the XML support that ships with that JRE, which is another design
iteration past Xalan. Or, in Websphere, the Websphere XML feature, which
supports XPath 2.0, XSLT 2.0, and XQuery and is yet another design
iteration.

With all of these, remember that the JAXP/TRAX APIs allow precompiling a
path or query. And remember that the performance can be improved if the
document is cached in memory in the appropriate internal representation.
(The Xerces implementation is single-pass, I believe; if you want to run
more than one path the advantage goes away quickly because you have to
reparse the input document.)


--
Joe Kesselman,
http://www.love-song-productions.com/people/keshlam/index.html

{} ASCII Ribbon Campaign | "may'ron DaroQbe'chugh vaj bIrIQbej" --
/\ Stamp out HTML mail! | "Put down the squeezebox & nobody gets hurt."
 
J

Joe Kesselman

Actually, I think that the solution to my performance problem is to
implement (via SAX?) the reading of the whole XML file and insert the
variables in my own data structures. That must speed up the variable
retrieval substantially BUT an XML guru is required, which I am not.

In many cases, yes, XML should be used as your "portability" level, and
custom internal representations should be used within the application.
Of course the downside is that you then have to implement a lot more of
your own logic rather than being able to take advantage of the XML-level
utilities.
In the meantime, I downloaded libxml and will see how well it
performs. Perhaps that is the solution to my problem. Being written in
C, it should be faster than Xerces-C++

C++ isn't necessarily slower than C. That depends on the details of the
code, both in coding style and in algorithms. Remember, an infinite
speedup of something that accounts for only 1% of runtime is only a 1%
real improvement.

--
Joe Kesselman,
http://www.love-song-productions.com/people/keshlam/index.html

{} ASCII Ribbon Campaign | "may'ron DaroQbe'chugh vaj bIrIQbej" --
/\ Stamp out HTML mail! | "Put down the squeezebox & nobody gets hurt."
 
R

Ramon F Herrera

If the path contains any character like "[", "@", "=", etc. I must
resort to XQilla, which is wonderful (a LOT easier to code than pure
Xerces), but as slow as molasses in cold weather

You might want to look at Xalan. There was a fair amount
of work put into Xalan performance; I don't know how XQilla
compares to that.

Following a previous advice of yours, I looked into it. It seems that
Xalan has reached a dead end. It won't even compile on a regular Linux
box.

What I discovered is that most of the action is in libxml. See my
thread "Dramatic performance gains with Libxml" (I develop under C/C+
+).

-Ramon
 
J

Joe Kesselman

Following a previous advice of yours, I looked into it. It seems that
Xalan has reached a dead end. It won't even compile on a regular Linux
box.

The C++ version of Xalan has lost most of its contributors, agreed. The
Java version is still alive and kicking, though not as actively under
development as it was when IBM was donating lots of manhours to it.

I'm not sure whether there's a C++ version of Saxon; if so that would
also be worth looking at.

--
Joe Kesselman,
http://www.love-song-productions.com/people/keshlam/index.html

{} ASCII Ribbon Campaign | "may'ron DaroQbe'chugh vaj bIrIQbej" --
/\ Stamp out HTML mail! | "Put down the squeezebox & nobody gets hurt."
 
S

shivers.paul

Tools used:

C++

Xerces-C

XQilla

Developed under Linux, ported to Windows





A very important lesson that I learned follows. Xerces implements a

reasonably/very fast XPath retrieval BUT it does so at the expense of

flexibility. The only type of XPath retrieval supported by Xerces is

the MINIMAL one:



string neededVariable = XPath("/this/is/the/variable/that/i/need");



If the path contains any character like "[", "@", "=", etc. I must

resort to XQilla, which is wonderful (a LOT easier to code than pure

Xerces), but as slow as molasses in cold weather:



string someOtherVar = XPath("/table/joint/ancestor::table/

@titledetail");



After running some benchmarks I have concluded that my best option is

to use a combination of the 2 XPath engines: Xerces for the "easy"

stuff and Xqilla for the more complex.



-Ramon

Have you looked at liquid xml c++ tool? (http://www.liquid-technologies.com/xmldatabinding/xml-schema-to-cpp.aspx)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,982
Messages
2,570,190
Members
46,736
Latest member
zacharyharris

Latest Threads

Top