Stream Parsing with REXML

M

Marc Hoeppner

Hi (again, sort of) :)

I am still on my quest to write a program that parses a large XML file.
After having tried to do it in tree mode, I had to realize that the
performance was simply abysmal. So back to the drawing board. But, and
here is the thing...I could find a good straight-forward tutorial on how
to write a stream parser using REXML. The official tutorial is pretty
much mute on that part and the only other example I found (or rather was
pointed to -
http://www.janvereecken.com/2007/4/11/event-driven-xml-parser-in-ruby)
was way too complex for someone like me who is still pretty much a
beginner in ruby.

So, what I am looking for is either a brief description of how to write
an event driven parser or else a link to a good and simple tutorial.

For the former, this is what the parser should do:

Find the element "Gene-ref", allow me to access its children and then
close and repeat for the next "Gene-ref entry. In xml code, that would
look like

<something here>
<Gene-ref>
<name>...</name>
<start>...</start>
<end>...</end>
</Gen-ref>
<Gen-ref>
...

I understand that I need a Listener class, like
classListener
def tag_start(name, attrs)
end
def text(text)
end
def tag_end(name)
end
end

But I havent really worked with classes all that much and maybe someone
could just put down the basics for the script from where I can start
experimenting? Would be very much appreciated. Let's say for each
element "Gene-ref" I want to puts the name, start and end in one line,
or something along those lines.

Cheers,

Marc
 
B

Bob Hutchison

Hi Marc and Dusty,

I have found that hpricot works very well with large xml files. It
streams it automatically. I use it to parse a 5M xml file and it does
it rather quickly.

Dusty, you sure hpricot is streaming?

At any rate, it is a lot faster than any pure Ruby XML parser. Marc,
if performance is your only problem hpricot is worth looking at. It'll
only take a few seconds to know, it's something like 4 lines of Ruby
to find out:

require 'rubygems'
require 'hpricot'

doc = open(<your file name here>) do |f|
Hpricot.XML(f)
end

If that works well enough you're probably set. Hpricot's search and
query stuff is pretty nice.

If you still have problems, let us know. If you really need streaming
you have a few options.


Cheers,
Bob
 
P

Phrogz

Hi (again, sort of) :)

I am still on my quest to write a program that parses a large XML file.
After having tried to do it in tree mode, I had to realize that the
performance was simply abysmal. So back to the drawing board. But, and
here is the thing...I could find a good straight-forward tutorial on how
to write a stream parser using REXML. The official tutorial is pretty
much mute on that part and the only other example I found (or rather was
pointed to -http://www.janvereecken.com/2007/4/11/event-driven-xml-parser-in-ruby)
was way too complex for someone like me who is still pretty much a
beginner in ruby.

So, what I am looking for is either a brief description of how to write
an event driven parser or else a link to a good and simple tutorial.

xml = <<ENDXML
oh hai
<root>
<kid1 foo="bar" jim="jam" />
<kid2 foo="sausage">
<grandkid1>I ate &lt;raw&gt; sausages&tm;!</grandkid1>
<grandkid2>
<![CDATA[
I ate <raw> sausages&tm;!
]]>
</grandkid2>
</kid2>
</root>
whoa!
ENDXML

class Listener
def initialize
# Use instance variables to mantain information
# between different callbacks.
@nest_level = 0
end

def tag_start( name, attr_hash )
@nest_level += 1
print " " * @nest_level
puts "<#{name} ...> #{attr_hash.inspect}"
end

def text( str )
print " " * @nest_level
p str
end

# Treat CDATA sections just like text
alias_method :cdata, :text

def tag_end( name )
print " " * @nest_level
puts "</#{name}>"
@nest_level -= 1
end
end

require 'rexml/document'
require 'stringio' # for this demo
stream_handler = Listener.new
pseudo_file = StringIO.new( xml )
REXML::Document.parse_stream( pseudo_file, stream_handler )

#=> OUTPUT:
#=> "oh hai\n"
#=> <root ...> {}
#=> "\n "
#=> <kid1 ...> {"jim"=>"jam", "foo"=>"bar"}
#=> </kid1>
#=> "\n "
#=> <kid2 ...> {"foo"=>"sausage"}
#=> "\n "
#=> <grandkid1 ...> {}
#=> "I ate <raw> sausages&tm;!"
#=> </grandkid1>
#=> "\n "
#=> <grandkid2 ...> {}
#=> "\n "
#=> "\n I ate <raw> sausages&tm;!\n "
#=> "\n "
#=> </grandkid2>
#=> "\n "
#=> </kid2>
#=> "\n"
#=> </root>
#=> "\nwhoa!\n"
 
P

Phrogz

Find the element "Gene-ref", allow me to access its children and then
close and repeat for the next "Gene-ref entry. In xml code, that would
look like

<something here>
 <Gene-ref>
  <name>...</name>
  <start>...</start>
  <end>...</end>
 </Gen-ref>
 <Gen-ref>
 ...

Here's a more targeted example, since I'm feeling helpful:

class Gene
attr_accessor :name, :start, :end
def initialize( name=nil, start=nil, endd=nil )
@name = name
self.start = start
self.end = endd
end
def start=( start )
@start = start.to_i
end
def end=( endd )
@end = endd.to_i
end
def duration
@end - @start
end
def to_s
"<Gene '#{@name}' #{@start}-#{@end} (#{duration})> "
end
end

class GeneParser
GENE_TAG_NAME = "Gene-ref"
def tag_start( name, attributes )
case name
when "root"
# ignore
when GENE_TAG_NAME
@current_gene = Gene.new
else
@current_property = name
end
end

def text( str )
if @current_gene && @current_property
@current_gene.send( @current_property.to_s + "=", str )
end
end

def tag_end( name )
if name == GENE_TAG_NAME
puts @current_gene
@current_gene = nil
else
@current_property = nil
end
end
end

require 'rexml/document'
REXML::Document.parse_stream( DATA, GeneParser.new )
#=> OUTPUT:
#=> <Gene 'foo' 17-42 (25)>
#=> <Gene 'bar' 43-50 (7)>

__END__
<root>
<Gene-ref>
<name>foo</name>
<start>17</start>
<end>42</end>
</Gene-ref>
<Gene-ref>
<name>bar</name>
<start>43</start>
<end>50</end>
</Gene-ref>
</root>
 
M

Marc Hoeppner

Thanks for the great responses!

Just for clarification though:

tag_start identifies an element like "gene" ala

def tag_start(name, attrib)
if name=="gene"
do something here
end
end

So tag_end is then used if I want to puts everything that was done with
the element and its children like storing some values in an array or
something?

/Marc
 
R

Robert Klemme

Thanks for the great responses!

Just for clarification though:

tag_start identifies an element like "gene" ala

def tag_start(name, attrib)
if name=="gene"
do something here
end
end

So tag_end is then used if I want to puts everything that was done with
the element and its children like storing some values in an array or
something?

Yeah, you could do that. Basically it's completely up to you. The
parser just hands off events for parsed XML items (like starting tags,
closing tags, text data etc.) in the order found in the document.

I have attached another example of an idiom that I use frequently. This
may not be needed in your case but who knows? Basic idea is that the
event listener creates nested listeners (one per element) and hands
processing off to them while maintaining a stack of listeners. That way
you can do different processing based on element name, attributes or
whatever. Might be overkill for your simple example but OTOH if you do
need to do complex processing steps based on elements this might be
exactly what you need. For example, you can store any information you
need in a nested listener and do all the processing in end tag.

Kind regards

robert

#!/bin/env ruby

# Robert Klemme 2007

require 'rexml/document'
require 'rexml/streamlistener'
require 'delegate'

class StreamListener < Delegator

def initialize
@current = NestedStreamListener.new
end

def tag_start(name, attrs)
sl = listener(name, attrs).new
sl.parent = @current
@current = sl
sl.tag_start(name, attrs)
end

def tag_end(name)
@current.tag_end(name)
@current = @current.parent
end

def __getobj__
@current or raise "Cannot handle beyond root"
end

private
def listener(name, attrs)
# more complex code here for specific
# nested listeners
NestedStreamListener
end
end

class BaseNestedStreamListener
include REXML::StreamListener

# structure
attr_accessor :parent

protected

# parents from the root
def parents
res = []
sl = self

while sl
res.unshift sl
sl = sl.parent
end

res
end
end


class NestedStreamListener < BaseNestedStreamListener

attr_accessor :tag_name

def tag_start(name, attrs)
# demo only
self.tag_name = name

parents.each do |par|
print par.tag_name, " "
end

puts

@text = ""
end

def tag_end(name)
print " ", @text.inspect, "\n"
end

def text(s)
(@text ||= "") << s
end

alias cdata text

end

REXML::Document.parse_stream( DATA, StreamListener.new )

__END__
<root>
<Gene-ref>
<name>foo</name>
<start>17</start>
<end>42</end>
</Gene-ref>
<Gene-ref>
<name>bar</name>
<start>43</start>
<end>50</end>
</Gene-ref>
</root>
 
B

Bob Hutchison

Hi Marc, Robert:

I have attached another example of an idiom that I use frequently.
This may not be needed in your case but who knows? Basic idea is
that the event listener creates nested listeners (one per element)
and hands processing off to them while maintaining a stack of
listeners. That way you can do different processing based on
element name, attributes or whatever. Might be overkill for your
simple example but OTOH if you do need to do complex processing
steps based on elements this might be exactly what you need. For
example, you can store any information you need in a nested listener
and do all the processing in end tag.

That idiom is really frequent in the stuff that I do with XML. It also
happens to be a case where a pull parser can make things clearer. Pull
parsers are event based like SAX parsers are, they do not build up
trees in memory, and they are very fast. The code below is almost
equivalent to yours Robert, as far as I could make it anyway. The
parents method is different, and I habitually add a node level
corresponding to the document since there are things (comments,
processing instructions, etc) that are allowed outside of the root
element of an XML file.

I wrote that xampl-pp parser in May 2002. It is on source forge <http://sourceforge.net/projects/xampl-pp/
but not as a gem. Hmm, maybe I should do something about that, I've
made some fixes to very obscure problems over the years, but never
released anything. Anyway the point isn't xampl-pp so much as what a
pull parser looks like. Many people don't know about them and if they
do they often don't know why they are useful. I think the libxml2
library does pull parsing (and that'll be very fast) and I think REXML
does too (it has a few problems with namespaces last I looked).
Anyway...

All the interesting stuff happens in the work method.

#!/bin/env ruby

require "xampl-pp"

class Thing

attr_accessor :tag_name
attr_accessor :parent
attr_accessor :text

def initialize(parent, name)
@parent = parent
@tag_name = name
@text = ""
end

def Thing.start(xml)
pp = Xampl_PP.new
pp.input = xml

Thing.new(nil, 'doc').work(pp)
end

def parents
p = self
while p
yield p
p = p.parent
end
end

def work(pp)
parents do |par|
print par.tag_name, " "
end
puts

while not pp.endDocument? do
case pp.nextEvent
when Xampl_PP::START_ELEMENT
Thing.new(self, pp.name).work(pp)
when Xampl_PP::END_ELEMENT
print " ", text.inspect, "\n"
return
when Xampl_PP::TEXT || Xampl_PP::CDATA_SECTION ||
Xampl_PP::ENTITY_REF then
text << pp.text
end
end
end
end

Thing.start(DATA)

__END__
<root>
<Gene-ref>
<name>foo</name>
<start>17</start>
<end>42</end>
</Gene-ref>
<Gene-ref>
<name>bar</name>
<start>43</start>
<end>50</end>
</Gene-ref>
</root>
 
R

Robert Klemme

Hi Marc, Robert:



That idiom is really frequent in the stuff that I do with XML. It also
happens to be a case where a pull parser can make things clearer. Pull
parsers are event based like SAX parsers are, they do not build up
trees in memory, and they are very fast. The code below is almost
equivalent to yours Robert, as far as I could make it anyway. The
parents method is different,

I considered that as well but wanted to be able to traverse parent nodes
top down for the example. After all it was just a quick hack. :) For
more robustness and library fitness I would certainly change a few things.
and I habitually add a node level
corresponding to the document since there are things (comments,
processing instructions, etc) that are allowed outside of the root
element of an XML file.

If you look closely that happens in my version as well (see
StreamListener#initialize and #tag_start).
I wrote that xampl-pp parser in May 2002. It is on source forge <http://sourceforge.net/projects/xampl-pp/
made some fixes to very obscure problems over the years, but never
released anything. Anyway the point isn't xampl-pp so much as what a
pull parser looks like. Many people don't know about them and if they
do they often don't know why they are useful. I think the libxml2
library does pull parsing (and that'll be very fast) and I think REXML
does too (it has a few problems with namespaces last I looked).
Anyway...

All the interesting stuff happens in the work method.

#!/bin/env ruby

require "xampl-pp"

class Thing

attr_accessor :tag_name
attr_accessor :parent
attr_accessor :text

def initialize(parent, name)
@parent = parent
@tag_name = name
@text = ""
end

def Thing.start(xml)
pp = Xampl_PP.new
pp.input = xml

Thing.new(nil, 'doc').work(pp)
end

def parents
p = self
while p
yield p
p = p.parent
end
end

def work(pp)
parents do |par|
print par.tag_name, " "
end
puts

while not pp.endDocument? do
case pp.nextEvent
when Xampl_PP::START_ELEMENT
Thing.new(self, pp.name).work(pp)
when Xampl_PP::END_ELEMENT
print " ", text.inspect, "\n"
return
when Xampl_PP::TEXT || Xampl_PP::CDATA_SECTION ||
Xampl_PP::ENTITY_REF then
text << pp.text
end
end
end
end

Thing.start(DATA)

Yep, looks pretty similar. To be honest I never used an XML pull parser
myself. Personally I prefer the push parser a bit because it avoids the
looping and decision logic based on event and element type (in #work).
Are there major advantages of pull parsers over push parsers that I have
overlooked so far?

Kind regards

robert
 
M

Marc Hoeppner

Ok, I am still on this and am starting to feel a bit stupid...I guess
there is something wrong with the following example, but maybe I should
step back for a while. Anyhow - what's missing to make it work:

The xml file:

<Gene-ref>
<name>Gene1</name>
<start>1</start>
<end>10</end>
</Gen-ref>
<Gene-ref>
<name>Gene2</name>
<start>11</start>
<end>20</end>
</Gen-ref>
<Gene-ref>
<name>Gene3</name>
<start>21</start>
<end>30</end>
</Gen-ref>
<Gene-ref>
<name>Gene4</name>
<start>31</start>
<end>40</end>
</Gen-ref>
<Gene-ref>
<name>Gene5</name>
<start>41</start>
<end>50</end>
</Gen-ref>

The code:

data = File.new(ARGV.shift)

class Listener

def tag_start( name, attr_hash )
if name == 'Gene-ref'
@name = name
end
end

def text( str )

end

def tag_end( name )
if name == 'Gene-ref'
puts @name
@name = nil
end
end
end


require 'rexml/document'
REXML::Document.parse_stream( data, Listener.new )
 
R

Robert Klemme

2008/1/14 said:
Ok, I am still on this and am starting to feel a bit stupid...I guess
there is something wrong with the following example, but maybe I should
step back for a while. Anyhow - what's missing to make it work:

The xml file:

<Gene-ref>
<name>Gene1</name>
<start>1</start>
<end>10</end>
</Gen-ref>
<Gene-ref>
<name>Gene2</name>
<start>11</start>
<end>20</end>
</Gen-ref>
<Gene-ref>
<name>Gene3</name>
<start>21</start>
<end>30</end>
</Gen-ref>
<Gene-ref>
<name>Gene4</name>
<start>31</start>
<end>40</end>
</Gen-ref>
<Gene-ref>
<name>Gene5</name>
<start>41</start>
<end>50</end>
</Gen-ref>

The code:

data = File.new(ARGV.shift)

class Listener

def tag_start( name, attr_hash )
if name == 'Gene-ref'
@name = name
end
end

def text( str )

You need to place the @name storing code here. But to make it work
you need to remember the current tag's name because you want to only
store "str" in @name if you are in <name>. This makes it necessary
that you add a stack to your listener class which gets pushed and
popped whenever tag_start and tag_end are invoked.
end

def tag_end( name )
if name == 'Gene-ref'
puts @name
@name = nil
end
end
end


require 'rexml/document'
REXML::Document.parse_stream( data, Listener.new )

Cheers

robert
 
B

Bob Hutchison

Hi Robert,

Yep, looks pretty similar. To be honest I never used an XML pull
parser myself. Personally I prefer the push parser a bit because it
avoids the looping and decision logic based on event and element
type (in #work). Are there major advantages of pull parsers over
push parsers that I have overlooked so far?


I use both and I'm comfortable with both. The combination of sax and
pull gives you a much more complete toolkit for streaming XML.

The biggest advantage (for me) to a pull parser is when you are
parsing an XML file and using those events to drive some kind of
algorithm. Either the parser or the algorithm is going to have to be
able to suspend processing (and I mean save state by that) and resume
when the other needs it to. This can be really hard to do for some
algorithms, in the case of some it isn't possible without
fundamentally changing the algorithm. If you read about pull parsers
they will talk about where the flow of control (the 'outermost' loop)
is in the program: in the parser or in the application. With sax the
parser controls the flow, with pull parsers the application does. A
subset of that situation is that the pull parser is just an object
that you can pass around, and that can be *really* handy. This also
means you can have more than one pull parser operating at a time, and
that's incredibly confusing with sax.

You can fake a pull parser by putting a sax parser in a different
thread and streaming the events over some kind of channel with some
kind of blocking protocol. Some implementations of pull parsers do
exactly that, and that is *slow*, yet if you need it people will put
up with it. If you've ever had the temptation to do that then you
might want to think pull parser.

It is pretty much trivial to write a sax parser based on a pull parser
at very small cost (xampl-pp ships with an example that does that). So
one trick that you can play is switching back and forth between sax
and pull where appropriate.

Junior programmers have a much easier time with pull parsers. It seems
to provide a gentler route to sax based software. XML is a big enough
step. There is some advantage to avoiding the need for a beginner to
learn both XML and event based programming at the same time.

Historically pull parsers tend to be very fast compared to sax
parsers. This is not going to last, there's no technical reason for
such a difference. In the Java world the speed difference is eroding
for the good of all. Anyway, speed isn't the issue it used to be. If
you poke around the web you'll still see a bunch of references to
performance advantages, this doesn't apply anymore if you choose your
parser correctly.

Cheers,
Bob
 
R

Robert Klemme

2008/1/14 said:
I use both and I'm comfortable with both. The combination of sax and
pull gives you a much more complete toolkit for streaming XML.

I read that in the light of your following remarks. Other than that I
do not see a difference with regard to the XML parsed, i.e. both
approaches are equally powerful.
The biggest advantage (for me) to a pull parser is when you are
parsing an XML file and using those events to drive some kind of
algorithm. Either the parser or the algorithm is going to have to be
able to suspend processing (and I mean save state by that) and resume
when the other needs it to. This can be really hard to do for some
algorithms, in the case of some it isn't possible without
fundamentally changing the algorithm. If you read about pull parsers
they will talk about where the flow of control (the 'outermost' loop)
is in the program: in the parser or in the application. With sax the
parser controls the flow, with pull parsers the application does.

I see, that's definitively a situation I was not aware of. I guess I
just haven't been in a situation where I needed to employ such
algorithms. And yes, the instance in control (parser vs. application
code) seems to be the biggest difference. I can see how this makes a
difference for some algorithms.
A
subset of that situation is that the pull parser is just an object
that you can pass around, and that can be *really* handy. This also
means you can have more than one pull parser operating at a time, and
that's incredibly confusing with sax.
:)

You can fake a pull parser by putting a sax parser in a different
thread and streaming the events over some kind of channel with some
kind of blocking protocol. Some implementations of pull parsers do
exactly that, and that is *slow*, yet if you need it people will put
up with it. If you've ever had the temptation to do that then you
might want to think pull parser.

It is pretty much trivial to write a sax parser based on a pull parser
at very small cost (xampl-pp ships with an example that does that). So
one trick that you can play is switching back and forth between sax
and pull where appropriate.

Yep. Basically a push parser can be implemented as thin wrapper
around a pull parser.
Junior programmers have a much easier time with pull parsers. It seems
to provide a gentler route to sax based software. XML is a big enough
step. There is some advantage to avoiding the need for a beginner to
learn both XML and event based programming at the same time.

I haven't thought of that either but sounds like a valid point.
Historically pull parsers tend to be very fast compared to sax
parsers. This is not going to last, there's no technical reason for
such a difference. In the Java world the speed difference is eroding
for the good of all. Anyway, speed isn't the issue it used to be. If
you poke around the web you'll still see a bunch of references to
performance advantages, this doesn't apply anymore if you choose your
parser correctly.

Bob, thanks for taking the time and put together this elaborate explanation!

Kind regards

robert
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,995
Messages
2,570,230
Members
46,817
Latest member
DicWeils

Latest Threads

Top