Inverse of stream parser

B

Brian Candler

I plan to parse a huge XML document (too big to fit into RAM) using a
stream parser. I can divide the stream into logical chunks which can be
processed individually. If a particular chunk fails, I want to append it
to an output XML file, which will contain all the failed chunks, and can
be patched up and retried.

To do this, I want to be able to regenerate the XML of the failed chunk,
preferably identical to how it was seen.

The options I can think of are:

1. A stream parser which gives me the raw XML alongside each parsed
item; I can concatenate the raw XML into a string.

2. A stream parser which gives me the byte pos of the current node, so I
can seek back within the file to fetch the XML again

3. A stream parser which gives me events to identify the different parts
of XML, together with an inverse process to which I can replay the
events and get the XML back again.

Playing with REXML StreamListener, I can get a series of method calls
like start_tag(...) and end_tag(...), and I can collect these into an
array; is there existing code which would let me squirt that array and
recreate the XML? Any other options I should be looking at?

Thanks,

Brian.
 
C

Caleb Clausen

I plan to parse a huge XML document (too big to fit into RAM) using a
stream parser. I can divide the stream into logical chunks which can be
processed individually. If a particular chunk fails, I want to append it
to an output XML file, which will contain all the failed chunks, and can
be patched up and retried.

To do this, I want to be able to regenerate the XML of the failed chunk,
preferably identical to how it was seen.

The options I can think of are:

1. A stream parser which gives me the raw XML alongside each parsed
item; I can concatenate the raw XML into a string.

2. A stream parser which gives me the byte pos of the current node, so I
can seek back within the file to fetch the XML again

3. A stream parser which gives me events to identify the different parts
of XML, together with an inverse process to which I can replay the
events and get the XML back again.

Playing with REXML StreamListener, I can get a series of method calls
like start_tag(...) and end_tag(...), and I can collect these into an
array; is there existing code which would let me squirt that array and
recreate the XML? Any other options I should be looking at?

From my experience, REXML is far too wimpy to deal with data on this
scale. (Among other things, it was too slow.) I suggest using the
'stream parser' (a misnomer, this is really a lexer) in libxml
instead. I don't know for sure if it can reconstruct the original text
the way you want, but that should be possible.

I think the class you'd want is LibXML::XML::SaxParser. See
http://libxml.rubyforge.org/.
 
J

John W Higgins

[Note: parts of this message were removed to make it a legal post.]

Morning,

I plan to parse a huge XML document (too big to fit into RAM) using a
stream parser. I can divide the stream into logical chunks which can be
processed individually. If a particular chunk fails, I want to append it
to an output XML file, which will contain all the failed chunks, and can
be patched up and retried.

If you aren't completely against Perl - XML-Twig [1] has a tool called
xml_split [2] which works rather well at splitting xml files. You might wish
to split up your files into smaller files prior to even beginning the
processing and then if a file fails to process you just have the file in
hand. When finished you could smash the failed files back together using
xml_merge [3] from the same perl package.

If there is some ruby variant of this I couldn't locate it but that never
means much :)

John

[1] - http://search.cpan.org/~mirod/XML-Twig-3.34/
[2] - http://search.cpan.org/~mirod/XML-Twig-3.34/tools/xml_split/xml_split
[3] - http://search.cpan.org/~mirod/XML-Twig-3.34/tools/xml_merge/xml_merge
 
R

Robert Dober

Would you care to use JRuby?
That would give you access to top XML Stream parsers IIRC ;)
Just as an example: org.apache.xerces.parsers.SAXParser seems very
suited for your purpose, although it is a little bit of work to
construct your xml fragments it should be rather easy.

HTH
R.
 
B

Brian Candler

Would you care to use JRuby?

I don't mind which stream parser, but Java is out :)

Since this is a bit of disposable code, I've decided to cheat. I
pretty-print the XML, then I can read it line-at-a-time using gets into
a buffer, identify a range of lines which forms a chunk, then parse the
buffer. On error I write out the buffer again.

Thanks for all your suggestions.
 
R

Robert Dober

And to add insult to injury, by interfacing J*** ;) with JRuby you do
not even see Java, you see a Ruby API.
( Just wanted to be clear about this )
R.
 
F

Florian Gilcher

And to add insult to injury, by interfacing J*** ;) with JRuby you do
not even see Java, you see a Ruby API.
( Just wanted to be clear about this )

Just to be clear, too: By interfacing Java with JRuby, you get a Ruby =
API that feels like its written by a Java consultant struggling on his =
first steps to learn Ruby.

While I am impressed how well the integration of JRuby into Java works, =
Java libraries without a handwritten layer above them still feel very =
alien. So, you do see Java - a lot, actually.

Regards,
Florian=
 
R

Robert Dober

Just to be clear, too: By interfacing Java with JRuby, you get a Ruby API that feels like its written by a Java consultant struggling on his first steps to learn Ruby.

While I am impressed how well the integration of JRuby into Java works, Java libraries without a handwritten layer above them still feel very alien. So, you do see Java - a lot, actually.
agreed, I was putting my bold statement to test, when calling into
Java you need to honor the java type checks and there are no block
parameters.
Thus there remains lots of work to be done to adapt a given API to be
"rubyish" my bad.
R.
 
J

James Britt

Florian said:
Just to be clear, too: By interfacing Java with JRuby, you get a Ruby API that feels like its written by a Java consultant struggling on his first steps to learn Ruby.

While I am impressed how well the integration of JRuby into Java works, Java libraries without a handwritten layer above them still feel very alien.

Often true. However, the range of fast, reliable libraries is much
greater for Java than for Ruby.

Don't spite yourself.

--
James Britt

www.jamesbritt.com - Playing with Better Toys
www.ruby-doc.org - Ruby Help & Documentation
www.rubystuff.com - The Ruby Store for Ruby Stuff
www.neurogami.com - Smart application development
 
C

Charles Oliver Nutter

Just to be clear, too: By interfacing Java with JRuby, you get a Ruby API that feels like its written by a Java consultant struggling on his first steps to learn Ruby.

I don't know a lot of struggling Java consultants that have released
Java libraries used on a wide scale. In fact, I don't know any
struggling Java consultants that have released libraries, period.
Maybe the APIs would be better if they did.

I think you're overstating the problem. Many Java libraries are
overdesigned, this is true. But JRuby does more than just provide a
means to call them; it provides a lot of other niceities like passing
a block or arbitrary object as the implementation of an interface and
not having to convert or cast values all over.

I also don't think it's a whole lot better when people write C
extensions that just wrap a raw C API. If anything, C APIs are usually
*underdesigned*, and it becomes a mess just to fit them nicely into an
OO language. The truth is that just providing the ability to call from
Ruby a library written in C or Java isn't always enough; but it's a
hell of a lot easier to start with the Java library in JRuby, since
you don't even have to compile anything.

- Charlie
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,968
Messages
2,570,152
Members
46,697
Latest member
AugustNabo

Latest Threads

Top