How to scale and/or "object orient" SAX parsing for big files?

S

scott.david.brown

So I have used DOM for sometime to parse my XML documents. But I have
arrived at a point where my document is just too big to want to use
DOM. So I am experimenting with SAX (technically, SAX2). But I have
run into a conceptual issue that I just can't get around regarding the
writing of a content handler. In my document, I have a lot of
different elements (many element names) and the element tree can get
very deep (many levels of nested elements). Furthermore, I really
have a few conceptual blocks and sub-blocks of XML and I might want to
have the ability to parse a file that has just one of these blocks.

All the SAX examples are very simplistic and show how to make a
handler that deals with (1) a very small number of element types
(small number of element names) and (2) very shallow element trees.
If I extend this approach for my application, I end up trying to
create a single handler for the entire document which becomes
hideous. In short, it would be nice to make a set of objects that
handle various parts of the file. Then I can re-use those objects to
parse these blocks as part of a single file or as a file that only
contains the block. What confuses me is this: I can make a set of
objects for different content blocks, but how to I use them?

One of the ways I could see this happening is to simply change the
handler as I parse. When I see element "A" in startElement, I could
change the handler object to the one specialized for this "A" content,
and when I see it again in endElement I can switch it back to the
previous handler. However, these is where I find the SAX
documentation confusing (technically, the Xerces-C++ implementation,
but I looked at the JavaDocs too). How do I get the "handle" to the
current XML stream being processed? There is an InputSource object
that abstracts the source of the XML content, but I can't figure out
how to get it from within startElement. Furthermore, I have no idea
if that object has all the information about that current parsing
state. For example, does it know where the parser is currently
processing? If I feed it to my the parser to handle this next block,
would it know where to pick up the work? And how to I get my hands on
the parser object to change the handler object from withing a
handler's startElement function? When I make my handler object and
set to be the handler for the parser, do I just need to store a
reference to the InputSource and parser in my handler (as member
variables) so I have access to them later? If I change the handler
while parsing, does it do what I expect?

I have spent quite some time looking for discussions for how to scale
SAX to these types of problems and I haven't had much luck. So I am
hoping to create some discussion here.
 
J

Joseph Kesselman

All the SAX examples are very simplistic and show how to make a
handler that deals with (1) a very small number of element types
(small number of element names) and (2) very shallow element trees.
If I extend this approach for my application, I end up trying to
create a single handler for the entire document which becomes
hideous.

Depth of the tree shouldn't matter to the handler itself, since the
handler is only processing one event at a time -- no recursion issues.

Complexity of the XML language certainly affects complexity of the
handler, and you may indeed want to break it up into sub-handlers,
switching off which one is being dispatched to as you parse through the
document. In my experience this has rarely been done unless the
sub-handler represents another namespace -- another XML-based language
-- which has been plugged into this point of the document and which
wants to be written as a reusable module, but it might indeed yield
faster processing if you only have to dispatch the events actually
expected at this point in the document.

(Of course that requires that you know the exact structure of the
expected document. I believe IBM has a patent or two regarding
automatically high-performance parsers for known schemas, though those
go beyond simply tuning the handlers.)
In short, it would be nice to make a set of objects that
handle various parts of the file. Then I can re-use those objects to
parse these blocks as part of a single file or as a file that only
contains the block. What confuses me is this: I can make a set of
objects for different content blocks, but how to I use them?

Each one has to know how to hand off handling of its contents to the
appropriate sub-handler. Maintain a stack so the sub-handler can return
control to its caller. This is much like handling event streams in other
event-driven models -- or much like handling token streams in a
traditional (lexx/yacc) parser.
> How do I get the "handle" to the
current XML stream being processed?

Actually, you don't need the stream; you need the parser so you can tell
it which handler to call. I don't think you can retrieve that from the
SAX event; you have to track it in your application code.
For example, does it know where the parser is currently
processing?

That's the parser's problem, not yours.

If you're feeling paranoid about how the parser will response to having
the handler changed in mid-parse -- though this should work in a
properly implemented SAX parser -- the workaround would be to implement
your own "stub" handler which dispatches to other handlers, and redirect
it as neeeded. Clearly, changing that delegation pointer is a
well-defined/well-contained operation.
 
U

usenet

So I have used DOM for sometime to parse my XML documents.  But I have
arrived at a point where my document is just too big to want to use
DOM.  So I am experimenting with SAX (technically, SAX2). But I have
run into a conceptual issue that I just can't get around regarding the
writing of a content handler.  In my document, I have a lot of
different elements (many element names) and the element tree can get
very deep (many levels of nested elements).  Furthermore, I really
have a few conceptual blocks and sub-blocks of XML and I might want to
have the ability to parse a file that has just one of these blocks.
...

If you're running out of steam using DOM, but SAX is proving
difficult, one option might be to consider data binding. This
converts XML documents into programming language specific objects,
such as (in our case) C++, Java or some other.

How much smaller the in memory usage is depends on your XML, but could
easily be a quarter or less than the DOM size.

You do need an XML schema for your data, and it may not be appropriate
for huge, Huge, HUGE documents, but I'll let you judge that.

You can find out more from our web site at:

http://www.codalogic.com/lmx/

or contact me directly.

HTH,

Pete Cordell
Codalogic
Visit http://www.codalogic.com/lmx/ for XML C++ data binding
 
S

scott.david.brown

Depth of the tree shouldn't matter to the handler itself, since the
handler is only processing one event at a time -- no recursion issues.
Yeah, I didn't make that very clear. Yes, my handler only addresses a
single event at a time. I'm not trying to peak forward or anything
bad like
that.

What I was trying to say was that when I have an element tree that is
maybe levels 10 deep and has 10 unique element types at each level
than I have 100 possible element types to worry about and my
startElement gets messy. If I have a single handler for my file, then
startElement has 100 possible elements to look through for a match
because there are 100 different element names that could be
encountered.
Does that make sense, or am I really missing something? Because
I really feel like I am. Using a single handler for a whole file
means
(to me) that it needs to handle every possible element type with
little means to take advantage of the fact that I know a given element
will only have a select set of sub-elements. I guess startElement
could call a series of functions based on what element I am within,
but
it just doesn't scale up very well at all.

What I think I want to do is break that up into a set of handlers that
work
on specific parts of the tree. As a result, each handler will be
smaller and
easier to write, test and maintain.

But, please, please, feel free to set me straight. I feel like I must
be
missing something when people start to parse real documents that
have something other that 1-2 levels of elements and only a handful
of possible elements. I've looked at the specs for the XML office
documents. There is no way they are using a single handler for the
1000+ different element types in those files.

I'm working on a prototype tonight that uses multiple handlers, tracks
the
handler stack (actually, each handler just stores a pointer to the
prior
handler) and changes the handler in the parser as it goes along. I
will
keep you posted.

Thanks for your comments.
 
S

scott.david.brown

I think my design experiment is a success.

I made a new interface class that will be used for all my content
handlers. It takes the parser and the previous handler as a
constructor argument. When I want to transition from my current
handler to a new handler (because I found an element that has a
specialized handler), I create the new handler and pass the parser
(which is already stored in the current handler) and the current
handler as the previous handler to this new handler. I then set the
current handler in the parser (again, which is already stored in the
current handler) to my new handler. When that handler sees the
endElement for itself, it simply sets the handler in the parser back
to the previous handler. Simply having the previous handler stored in
the current handler avoids having to explicitly make a complicated
external stack of handlers. The handlers push themselves and pop
themselves simply by changing the current handler systematically.

The nice thing is that I can now build up a set of handlers that can
be easily reused for various sections of a large XML file, and each
handler is specialized for a much smaller number of elements (the set
of elements specific to that object being described). In this case,
my document was a description of a bunch of geometry. So I have
handlers for basic boxes, spheres, etc. These are potentially
embedded within other geometric objects in my XML, etc. But rather
than one handler that addresses every possible element, I have a nice
hierarchy of handlers that work with each other.

I don't think this design is that "dirty" or "shameful", but I would
still take comments from people if they have a better approach.
 
V

Victor Anyakin

I think my design experiment is a success.

I made a new interface class that will be used for all my content
handlers. It takes the parser and the previous handler as a
constructor argument. When I want to transition from my current
handler to a new handler (because I found an element that has a
specialized handler), I create the new handler and pass the parser
(which is already stored in the current handler) and the current
handler as the previous handler to this new handler. I then set the
current handler in the parser (again, which is already stored in the
current handler) to my new handler. When that handler sees the
endElement for itself, it simply sets the handler in the parser back
to the previous handler. Simply having the previous handler stored in
the current handler avoids having to explicitly make a complicated
external stack of handlers. The handlers push themselves and pop
themselves simply by changing the current handler systematically.

This sound pretty much like a state machine (or a finite automata) or
something like this, but very similar.

Why not make a "global" state variable, and in the handler something
like a "switch" or a "cond" block acting appropriately, choosing the
proper handler function?

With best regards,
Victor
 
S

scott.david.brown

This sound pretty much like a state machine (or a finite automata) or something like this, but very similar.

Why not make a "global" state variable, and in the handler something like a "switch" or a "cond" block acting appropriately, choosing the proper handler function?
You are right, it really is a state machine. If I understand your
suggestion correctly, I think there are two possible problems for my
application.

First has to do with the nature of SAX -- it's event driven. So if I
see the start of an element called "string" I need to know what
element I am already in, so I know what to do with that string. I
think this is the #1 item that the SAX examples ignore. In really big
XML documents, chances are you have elements with the same names that
appear in different parts of the tree. So when you see an element
with that name you can't just do something directly with it. You need
to know if that element named "string" is a sub-element of something
called "firstname" or something called "lastname" so you know what to
do with it. Sure, you could make your XML element names more unique,
but in very large and complicated XML documents (where SAX should work
better with than DOM) element name collisions will happen. And to try
and avoid them places an needless burden on your freedom to design the
element structure.

The second is, again, related to scale. I'm working on a large
document with 1,000s of different element types. You can't make a
switch with 1,000 cases. However, some of my lower level handlers do
take your approach (of a "global" status). I have a private
enumerated type in the handler that tracks what I am currently parsing
so we know what to do when we see certain elements. This is the one
thing that boggles my mind. I really don't see how someone could
scale SAX except for the way I am doing it now.

In hindsight, the amount of code I must write to use SAX is very
frustrating. I tried SAX because I was looking forward to the
potential performance over DOM for my large document. However, I
think I might go back to DOM because this seems to be so poor at
scaling.

Isn't there anyone out there that can tell us why SAX is supposed to
be so awesome? I'm not seeing it!
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,989
Messages
2,570,207
Members
46,783
Latest member
RickeyDort

Latest Threads

Top