Parsing xml

B

Bill Kelly

From: "Jason Roelofs said:
Regex is not stateful, thus you can't use it to parse XML. Oh there
are ways to hack yourself around some limitations and get some
results, but you are going to spend a TON of time making very
unreadable Regex that will die at the presense of slightest malformed
XML.

But, regexps work just fine to lex XML. The parser, then,
becomes a bit of ruby that accepts tokens from the regexp
lexer.

Handling the most common syntactic elements of an XML doc
this way (tags, text, cdata) is relatively trivial.

On the other hand, as we can see from the BNF, handling the
full XML spec is complicated: http://pastie.org/pastes/427101

. . .

In any case, I'm fully on board with the "why reinvent the
wheel?" replies in this thread.

I just had to visit this territory recently because REXML
doesn't work properly in a $SAFE = 4 sandbox.


Regards,

Bill
 
S

Simon Krahnke

* Sebastian Hungerecker said:
I don't see what you could add to the regexp to handle nested tags. You can't
really handle nested structures with regular expressions.

I don't see how the regex above he wrote doesn't already do that.

If you modify it to /<some-tag.*?>(.*?)<\/some-tag>/ it will ignore
attributes, too. Put so simple it can of course problems when there are
other elements whose names start with "some-tag".

ttfn, simon .... l
 
S

Sebastian Hungerecker

Simon said:
I don't see how the regex above he wrote doesn't already do that.

document = "<some-tag> lala <some-tag> lulu </some-tag> lili </some-tag>"
document.match(/<some-tag>(.+?)<\/some-tag>/)[1]
=> " lala <some-tag> lulu "
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Similar Threads


Staff online

Members online

Forum statistics

Threads
474,176
Messages
2,570,950
Members
47,501
Latest member
log5Sshell/alfa5

Latest Threads

Top