parsing xml (xmpp) with ruby

E

Eric Will

Hello World,

I am writing an XMPP (Jabber) server in Ruby. XMPP uses XML for its
protocol. This means I have to do a good deal of XML parsing, in Ruby.

Right now I am using REXML to parse the individual stanzas as they
come in. However, in order to do this without REXML complaining of
"multiple root elements" (that is, XMPP is streaming XML over a TCP
socket, so I only get the root element once) I have to wrap every
incoming chunk of XMPP with my own <root/> tag, and then ignore that
after REXML parses it. I am currently unhappy with this approach.

Another option is to use REXML's stream parsing. I don't really like
this idea. It seems the only benefit of using SAX(ish) parsing is when
you're dealing with huge documents that you don't want to load into
memory. This isn't the case. I get maybe 5-10 objects per parse. Most
of the people I've talked to in XMPP insist on using SAX (or something
like it, such as REXML's stream parsing). The other reason I don't
like REXML's stream parsing (or libxml's SAX) is because I have to
provide a class instance for it to use for the event-parsing, and this
class has to be a giant state machine, which seems wrong to me. I
don't want to have to write a complicated class to, in effect, parse
the XML myself when the XML parser should be doing this for me.

The other options include using hpricot to do the incoming parsing
(since it's C, and way faster than REXML) and continue to use REXML
for generating the outgoing XML (I can't seem to figure out how to do
this in hpricot, if it's even possible). Although, XMPP requires XML
well-formedness, and hpricot does not do validation (to the best of my
knowledge). I also like xml-simple, but it uses REXML underneath it
all, so I'm left with the same issues.

My real question is, is there a GOOD REASON to switch for the scheme I
currently use? A number of people seem to think it's the "Wrong Thing"
to do, but I'm not quite sure what the "Right Thing" is. I don't think
it's SAX.

Thanks for any feedback.

-- rakaur
 
D

Dejan Dimic

Hello World,

I am writing an XMPP (Jabber) server in Ruby. XMPP uses XML for its
protocol. This means I have to do a good deal of XML parsing, in Ruby.

Right now I am using REXML to parse the individual stanzas as they
come in. However, in order to do this without REXML complaining of
"multiple root elements" (that is, XMPP is streaming XML over a TCP
socket, so I only get the root element once) I have to wrap every
incoming chunk of XMPP with my own <root/> tag, and then ignore that
after REXML parses it. I am currently unhappy with this approach.

Another option is to use REXML's stream parsing. I don't really like
this idea. It seems the only benefit of using SAX(ish) parsing is when
you're dealing with huge documents that you don't want to load into
memory. This isn't the case. I get maybe 5-10 objects per parse. Most
of the people I've talked to in XMPP insist on using SAX (or something
like it, such as REXML's stream parsing). The other reason I don't
like REXML's stream parsing (or libxml's SAX) is because I have to
provide a class instance for it to use for the event-parsing, and this
class has to be a giant state machine, which seems wrong to me. I
don't want to have to write a complicated class to, in effect, parse
the XML myself when the XML parser should be doing this for me.

The other options include using hpricot to do the incoming parsing
(since it's C, and way faster than REXML) and continue to use REXML
for generating the outgoing XML (I can't seem to figure out how to do
this in hpricot, if it's even possible). Although, XMPP requires XML
well-formedness, and hpricot does not do validation (to the best of my
knowledge). I also like xml-simple, but it uses REXML underneath it
all, so I'm left with the same issues.

My real question is, is there a GOOD REASON to switch for the scheme I
currently use? A number of people seem to think it's the "Wrong Thing"
to do, but I'm not quite sure what the "Right Thing" is. I don't think
it's SAX.

Thanks for any feedback.

-- rakaur

Every problem can have multiple solutions.

Personally I will go for the SAX XML processing of the incoming XML
stream.
It can not be so hard to build the event driven solution and the state
machine should not be more complicated then the DOM node processing.
The benefit you can get is to start building the response while you
processing the XML input.
You can't get much faster then that.

If you think it's not your cup of tee thats totally OK.

If you have to parse chinks of XML data then hpricot is my favorite
choice.
While analyzing the DOM for nods of interest, preferably with XPath
you should build the response.
You can do it with hpricot to.

In a word, do it as you see fit, and then try to make it better. :)
 
R

Robert Klemme

Another option is to use REXML's stream parsing. I don't really like
this idea. It seems the only benefit of using SAX(ish) parsing is when
you're dealing with huge documents that you don't want to load into
memory. This isn't the case. I get maybe 5-10 objects per parse. Most
of the people I've talked to in XMPP insist on using SAX (or something
like it, such as REXML's stream parsing). The other reason I don't
like REXML's stream parsing (or libxml's SAX) is because I have to
provide a class instance for it to use for the event-parsing, and this
class has to be a giant state machine, which seems wrong to me. I
don't want to have to write a complicated class to, in effect, parse
the XML myself when the XML parser should be doing this for me.

Well, this is not true. You can have multiple classes cooperating in
doing XML stream parsing. You need one instance for receiving the
events but that can delegate to any number of other instances. A scheme
I usually use is to have a class per element type and the front end
instance keeps a stack of those.

Typically XML is parsed to instantiate classes of a particular object
model that is built do implement the business logic (in your case
message exchange). It is a waste of resources to create an XML DOM and
then traverse it in order to transform it into other objects. Also, not
all input data is needed in every case. That's why stream parsing has
serious advantages over DOM parsing.

OTOH, if you can do all your processing efficiently on the DOM then
maybe that is a better way. In your situation I would still choose the
stream approach because it also better fits the way the data is provided.

My 0.02 EUR

robert
 
E

Eric Will

Well, this is not true. You can have multiple classes cooperating in doing
XML stream parsing. You need one instance for receiving the events but that
can delegate to any number of other instances. A scheme I usually use is to
have a class per element type and the front end instance keeps a stack of
those.

This is how I'd implement it. I just don't wanna.
Typically XML is parsed to instantiate classes of a particular object model
that is built do implement the business logic (in your case message
exchange). It is a waste of resources to create an XML DOM and then
traverse it in order to transform it into other objects. Also, not all
input data is needed in every case. That's why stream parsing has serious
advantages over DOM parsing.

The thing is, I'm only parsing out like 5-10 objects at a time. It's
nothing huge
to transverse, but I'm thinking it'll be a hard performance hit to keep on like
that when I try to scale.

-- rakaur
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,961
Messages
2,570,131
Members
46,689
Latest member
liammiller

Latest Threads

Top