xerces advanced usage - progresss, random access etc

Kza · Sep 4, 2006

Hi, I am currently using xerces sax parser for c++, (I use DOM too, but
I think SAX is more relevant here) for processing and displaying fairly
large xml files. Usually I give xerces a filename, and it parses it and
thats all good. But the customer needs more features.

Feature 1: A progress display. I have tried a few times now to find a
way of asking xerces how far through a file it is in bytes, but no
luck. (I did try a per element check, but that involves a whole extra
parse at the start just to count the elements). I have tried using the
LocalFileInputSource, and getting its BinInputStream and calling itc
curPos, but its always 0.

Any ideas?

Feature 2: Loading only a "screenful" of the file at a time. I also
would like some sort of random access functionality, so if the user
scrolls down to 75% of the file, the parser skips forward to that
position and starts reading there, and when they scroll back up it goes
up and reads just that little bit of the file.

I am pretty sure feature 1 is possible with normal xerces sax, but I
have no idea how, the documentation is very sparse, naming the
functions etc but not actually saying what they do or how they should
be used.

For feature 2 it might be more complicated. A colleage mentioned some
other "object models" like xparse and xalaron (not sure how thats
pronounced or spelt) some apache project that parses xml in a random
access fashion.

Anyone got any ideas?

Thanks a lot.

Joe Kesselman · Sep 5, 2006

Kza said:
Feature 1: A progress display.

The SAX APIs can be persuaded to give line/column information, though
unless you know how many lines there were in the file before you stared
parsing it that doesn't do you any good. Look at the Locator API.

The DOM assumes reading the file is a single operation, so the concept
of getting incremental details doesn't make much sense. You *could* plug
in a stream filter between wherever the file is being read from and the
parser, and set up that filter so it counts characters going by --
that's going to give you only a very rough progress indication, and
again it requires that you know the length before you start if you want
to report it as a percentage-complete number.

Feature 2: Loading only a "screenful" of the file at a time.

"Screenful" is not defined in XML. Nor is starting parse from the middle
of a file. You could try to do something with incremental processing,
via throttling of ta SAX stream -- I've done that in the past -- but
keeping track of when enough has been read to fill a screen and when
more would have to be read to fill the next screen is very much an
application problem rather than a parser problem.

Random-access to an XML model isn't a problem -- the DOM can do that,
though again it isn't designed to operate on screenfuls -- but
random-order parsing really doesn't make sense. Namespaces are
context-dependent, to take one major point where that idea breaks down.

Boris Kolpackov · Sep 8, 2006

Kza said:
Feature 1: A progress display. I have tried a few times now to find a
way of asking xerces how far through a file it is in bytes, but no
luck. (I did try a per element check, but that involves a whole extra
parse at the start just to count the elements). I have tried using the
LocalFileInputSource, and getting its BinInputStream and calling itc
curPos, but its always 0.

Any ideas?

You can implement your own InputStream which will keep track of how
much data Xerces-C++ has consumed so far. Combine this with the total
length of the file and you can calculate the progress.

Feature 2: Loading only a "screenful" of the file at a time. I also
would like some sort of random access functionality, so if the user
scrolls down to 75% of the file, the parser skips forward to that
position and starts reading there, and when they scroll back up it goes
up and reads just that little bit of the file.

This one would definitely be easier with an in-memory model (e.g., DOM).

hth,
-boris

Kza · Sep 8, 2006

Just as an update here, and I hope top posting is de riguer for this
news group,

I solved feature one with xerces getSrcOffset() method. Even though I
had to wrap it with an exception catcher, as the particular version we
are using at work at the moment causes an exception when parsing is
finished (but before the parse method returns) and theres no other way
to find out when its finished.

Feature 2 I dont have a solution for at the moment. DOM is not an
option as the whole point is that a whole file uses up too much memory,
and DOM loads the whole thing at once, thats why we wanted to load in a
section at a time.

If it turns out really important to analyse large files, I will just
have to write a seperate program that uses sax, and maybe only filters
for certain things, or perhaps reparses when people want to "scroll up"
which has its own time trade off for saving memory. Its up to the
customers really. I suspect the real solution is a non-xml indexed
binary format. But the memory issue isnt actually as big as the
customers think it is.. I will work something out.

WordPress/Advanced Custom Fields/ PHP Random Help	1	Jul 13, 2021
How to Create a random password generator in a separate window	4	May 26, 2022
Xerces library SAX question	0	Apr 6, 2011
Xerces usage under Eclipse CDT..	0	Sep 18, 2006
Need to learn how to set up the context in XQuilla and/or Xerces-C	9	Jun 8, 2012
java sax parser + XERCES + dtd	1	Aug 23, 2006
xerces/SAX xml search	4	Apr 25, 2007
How to speed up XML reading	11	Sep 11, 2012

xerces advanced usage - progresss, random access etc

Kza

Joe Kesselman

Boris Kolpackov

Kza

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads