B
Brian Candler
I plan to parse a huge XML document (too big to fit into RAM) using a
stream parser. I can divide the stream into logical chunks which can be
processed individually. If a particular chunk fails, I want to append it
to an output XML file, which will contain all the failed chunks, and can
be patched up and retried.
To do this, I want to be able to regenerate the XML of the failed chunk,
preferably identical to how it was seen.
The options I can think of are:
1. A stream parser which gives me the raw XML alongside each parsed
item; I can concatenate the raw XML into a string.
2. A stream parser which gives me the byte pos of the current node, so I
can seek back within the file to fetch the XML again
3. A stream parser which gives me events to identify the different parts
of XML, together with an inverse process to which I can replay the
events and get the XML back again.
Playing with REXML StreamListener, I can get a series of method calls
like start_tag(...) and end_tag(...), and I can collect these into an
array; is there existing code which would let me squirt that array and
recreate the XML? Any other options I should be looking at?
Thanks,
Brian.
stream parser. I can divide the stream into logical chunks which can be
processed individually. If a particular chunk fails, I want to append it
to an output XML file, which will contain all the failed chunks, and can
be patched up and retried.
To do this, I want to be able to regenerate the XML of the failed chunk,
preferably identical to how it was seen.
The options I can think of are:
1. A stream parser which gives me the raw XML alongside each parsed
item; I can concatenate the raw XML into a string.
2. A stream parser which gives me the byte pos of the current node, so I
can seek back within the file to fetch the XML again
3. A stream parser which gives me events to identify the different parts
of XML, together with an inverse process to which I can replay the
events and get the XML back again.
Playing with REXML StreamListener, I can get a series of method calls
like start_tag(...) and end_tag(...), and I can collect these into an
array; is there existing code which would let me squirt that array and
recreate the XML? Any other options I should be looking at?
Thanks,
Brian.