S
scott.david.brown
So I have used DOM for sometime to parse my XML documents. But I have
arrived at a point where my document is just too big to want to use
DOM. So I am experimenting with SAX (technically, SAX2). But I have
run into a conceptual issue that I just can't get around regarding the
writing of a content handler. In my document, I have a lot of
different elements (many element names) and the element tree can get
very deep (many levels of nested elements). Furthermore, I really
have a few conceptual blocks and sub-blocks of XML and I might want to
have the ability to parse a file that has just one of these blocks.
All the SAX examples are very simplistic and show how to make a
handler that deals with (1) a very small number of element types
(small number of element names) and (2) very shallow element trees.
If I extend this approach for my application, I end up trying to
create a single handler for the entire document which becomes
hideous. In short, it would be nice to make a set of objects that
handle various parts of the file. Then I can re-use those objects to
parse these blocks as part of a single file or as a file that only
contains the block. What confuses me is this: I can make a set of
objects for different content blocks, but how to I use them?
One of the ways I could see this happening is to simply change the
handler as I parse. When I see element "A" in startElement, I could
change the handler object to the one specialized for this "A" content,
and when I see it again in endElement I can switch it back to the
previous handler. However, these is where I find the SAX
documentation confusing (technically, the Xerces-C++ implementation,
but I looked at the JavaDocs too). How do I get the "handle" to the
current XML stream being processed? There is an InputSource object
that abstracts the source of the XML content, but I can't figure out
how to get it from within startElement. Furthermore, I have no idea
if that object has all the information about that current parsing
state. For example, does it know where the parser is currently
processing? If I feed it to my the parser to handle this next block,
would it know where to pick up the work? And how to I get my hands on
the parser object to change the handler object from withing a
handler's startElement function? When I make my handler object and
set to be the handler for the parser, do I just need to store a
reference to the InputSource and parser in my handler (as member
variables) so I have access to them later? If I change the handler
while parsing, does it do what I expect?
I have spent quite some time looking for discussions for how to scale
SAX to these types of problems and I haven't had much luck. So I am
hoping to create some discussion here.
arrived at a point where my document is just too big to want to use
DOM. So I am experimenting with SAX (technically, SAX2). But I have
run into a conceptual issue that I just can't get around regarding the
writing of a content handler. In my document, I have a lot of
different elements (many element names) and the element tree can get
very deep (many levels of nested elements). Furthermore, I really
have a few conceptual blocks and sub-blocks of XML and I might want to
have the ability to parse a file that has just one of these blocks.
All the SAX examples are very simplistic and show how to make a
handler that deals with (1) a very small number of element types
(small number of element names) and (2) very shallow element trees.
If I extend this approach for my application, I end up trying to
create a single handler for the entire document which becomes
hideous. In short, it would be nice to make a set of objects that
handle various parts of the file. Then I can re-use those objects to
parse these blocks as part of a single file or as a file that only
contains the block. What confuses me is this: I can make a set of
objects for different content blocks, but how to I use them?
One of the ways I could see this happening is to simply change the
handler as I parse. When I see element "A" in startElement, I could
change the handler object to the one specialized for this "A" content,
and when I see it again in endElement I can switch it back to the
previous handler. However, these is where I find the SAX
documentation confusing (technically, the Xerces-C++ implementation,
but I looked at the JavaDocs too). How do I get the "handle" to the
current XML stream being processed? There is an InputSource object
that abstracts the source of the XML content, but I can't figure out
how to get it from within startElement. Furthermore, I have no idea
if that object has all the information about that current parsing
state. For example, does it know where the parser is currently
processing? If I feed it to my the parser to handle this next block,
would it know where to pick up the work? And how to I get my hands on
the parser object to change the handler object from withing a
handler's startElement function? When I make my handler object and
set to be the handler for the parser, do I just need to store a
reference to the InputSource and parser in my handler (as member
variables) so I have access to them later? If I change the handler
while parsing, does it do what I expect?
I have spent quite some time looking for discussions for how to scale
SAX to these types of problems and I haven't had much luck. So I am
hoping to create some discussion here.