Parsing xml

Robert Klemme · Mar 25, 2009

require 'mechanize'
mech = WWW::Mechanize.new
mech.get 'http://www.shoe-g.com/index.rdf'
doc = Nokogiri(mech.page.body)
titles = (doc / 'title').map(&:text)

Here's a similar solution using REXML:

require 'open-uri'
require 'rexml/document'

body = open('http://www.shoe-g.com/index.rdf').read
doc = REXML:

ocument.new body
titles = doc.elements.to_enum

each, '//title').map(&:text)

Five lines as well...

Cheers

robert

Bill Kelly · Mar 25, 2009

From: "Jason Roelofs said:
Regex is not stateful, thus you can't use it to parse XML. Oh there
are ways to hack yourself around some limitations and get some
results, but you are going to spend a TON of time making very
unreadable Regex that will die at the presense of slightest malformed
XML.

But, regexps work just fine to lex XML. The parser, then,
becomes a bit of ruby that accepts tokens from the regexp
lexer.

Handling the most common syntactic elements of an XML doc
this way (tags, text, cdata) is relatively trivial.

On the other hand, as we can see from the BNF, handling the
full XML spec is complicated: http://pastie.org/pastes/427101

. . .

In any case, I'm fully on board with the "why reinvent the
wheel?" replies in this thread.

I just had to visit this territory recently because REXML
doesn't work properly in a $SAFE = 4 sandbox.

Regards,

Bill

Simon Krahnke · Mar 25, 2009

* Sebastian Hungerecker said:
I don't see what you could add to the regexp to handle nested tags. You can't
really handle nested structures with regular expressions.

I don't see how the regex above he wrote doesn't already do that.

If you modify it to /<some-tag.*?>(.*?)<\/some-tag>/ it will ignore
attributes, too. Put so simple it can of course problems when there are
other elements whose names start with "some-tag".

ttfn, simon .... l

Sebastian Hungerecker · Mar 26, 2009

Simon said:
I don't see how the regex above he wrote doesn't already do that.

document = "<some-tag> lala <some-tag> lulu </some-tag> lili </some-tag>"
document.match(/<some-tag>(.+?)<\/some-tag>/)[1]
=> " lala <some-tag> lulu "

https html parsing	5	Mar 26, 2009
parsing xml (xmpp) with ruby	3	Sep 27, 2008
Need help XML to CSV	1	May 29, 2023
REXML Input File Question	7	Jul 19, 2010
XMLRPC (REXML) incorrectly handles UTF-8 data	6	Nov 16, 2010
memory considerations when parsing XML file	2	Jan 31, 2008
xml parsing in ruby	9	Dec 15, 2010
xml parsing in ruby	0	Dec 15, 2010

Parsing xml

Robert Klemme

Bill Kelly

Simon Krahnke

Sebastian Hungerecker

Ask a Question

Similar Threads

Staff online

Members online

Forum statistics

Latest Threads