xerces advanced usage - progresss, random access etc

K

Kza

Hi, I am currently using xerces sax parser for c++, (I use DOM too, but
I think SAX is more relevant here) for processing and displaying fairly
large xml files. Usually I give xerces a filename, and it parses it and
thats all good. But the customer needs more features.

Feature 1: A progress display. I have tried a few times now to find a
way of asking xerces how far through a file it is in bytes, but no
luck. (I did try a per element check, but that involves a whole extra
parse at the start just to count the elements). I have tried using the
LocalFileInputSource, and getting its BinInputStream and calling itc
curPos, but its always 0.

Any ideas?

Feature 2: Loading only a "screenful" of the file at a time. I also
would like some sort of random access functionality, so if the user
scrolls down to 75% of the file, the parser skips forward to that
position and starts reading there, and when they scroll back up it goes
up and reads just that little bit of the file.

I am pretty sure feature 1 is possible with normal xerces sax, but I
have no idea how, the documentation is very sparse, naming the
functions etc but not actually saying what they do or how they should
be used.

For feature 2 it might be more complicated. A colleage mentioned some
other "object models" like xparse and xalaron (not sure how thats
pronounced or spelt) some apache project that parses xml in a random
access fashion.

Anyone got any ideas?

Thanks a lot.
 
J

Joe Kesselman

Kza said:
Feature 1: A progress display.

The SAX APIs can be persuaded to give line/column information, though
unless you know how many lines there were in the file before you stared
parsing it that doesn't do you any good. Look at the Locator API.

The DOM assumes reading the file is a single operation, so the concept
of getting incremental details doesn't make much sense. You *could* plug
in a stream filter between wherever the file is being read from and the
parser, and set up that filter so it counts characters going by --
that's going to give you only a very rough progress indication, and
again it requires that you know the length before you start if you want
to report it as a percentage-complete number.
Feature 2: Loading only a "screenful" of the file at a time.

"Screenful" is not defined in XML. Nor is starting parse from the middle
of a file. You could try to do something with incremental processing,
via throttling of ta SAX stream -- I've done that in the past -- but
keeping track of when enough has been read to fill a screen and when
more would have to be read to fill the next screen is very much an
application problem rather than a parser problem.

Random-access to an XML model isn't a problem -- the DOM can do that,
though again it isn't designed to operate on screenfuls -- but
random-order parsing really doesn't make sense. Namespaces are
context-dependent, to take one major point where that idea breaks down.
 
B

Boris Kolpackov

Kza said:
Feature 1: A progress display. I have tried a few times now to find a
way of asking xerces how far through a file it is in bytes, but no
luck. (I did try a per element check, but that involves a whole extra
parse at the start just to count the elements). I have tried using the
LocalFileInputSource, and getting its BinInputStream and calling itc
curPos, but its always 0.

Any ideas?

You can implement your own InputStream which will keep track of how
much data Xerces-C++ has consumed so far. Combine this with the total
length of the file and you can calculate the progress.

Feature 2: Loading only a "screenful" of the file at a time. I also
would like some sort of random access functionality, so if the user
scrolls down to 75% of the file, the parser skips forward to that
position and starts reading there, and when they scroll back up it goes
up and reads just that little bit of the file.

This one would definitely be easier with an in-memory model (e.g., DOM).


hth,
-boris
 
K

Kza

Just as an update here, and I hope top posting is de riguer for this
news group,

I solved feature one with xerces getSrcOffset() method. Even though I
had to wrap it with an exception catcher, as the particular version we
are using at work at the moment causes an exception when parsing is
finished (but before the parse method returns) and theres no other way
to find out when its finished.

Feature 2 I dont have a solution for at the moment. DOM is not an
option as the whole point is that a whole file uses up too much memory,
and DOM loads the whole thing at once, thats why we wanted to load in a
section at a time.

If it turns out really important to analyse large files, I will just
have to write a seperate program that uses sax, and maybe only filters
for certain things, or perhaps reparses when people want to "scroll up"
which has its own time trade off for saving memory. Its up to the
customers really. I suspect the real solution is a non-xml indexed
binary format. But the memory issue isnt actually as big as the
customers think it is.. I will work something out.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,982
Messages
2,570,185
Members
46,737
Latest member
Georgeengab

Latest Threads

Top