Parsing xml

A

Arun Kumar

Hi,
Is there any way in Ruby to parse an xml file without using REXML or any
other libraries.



Regards
Arun Kumar
 
P

Peter Zotov

Quoting "Arun Kumar said:
Hi,
Is there any way in Ruby to parse an xml file without using REXML or any
other libraries.

Of course. You can write a finite state machine, read XML from file
and parse as you want.
 
A

Arun Kumar

Peter said:
Of course. You can write a finite state machine, read XML from file
and parse as you want.

Thanks. Can u please give me details of it.

Regards
Arun Kumar
 
P

Peter Zotov

Quoting "Arun Kumar said:
Thanks. Can u please give me details of it.

It is described nicely in Wikipedia:
http://en.wikipedia.org/wiki/Finite_state_machine

As a clue, I can recommend you define following states: "text",
"opening tag", "tag attribute", "tag attribute value", "closing tag".
E. g. when you are in "text" state and get "<" symbol at input
sequence, you change state to "opening tag" or "closing tag"...
I still have one question: why you don't use REXML?
 
A

Arun Kumar

Peter said:
It is described nicely in Wikipedia:
http://en.wikipedia.org/wiki/Finite_state_machine

As a clue, I can recommend you define following states: "text",
"opening tag", "tag attribute", "tag attribute value", "closing tag".
E. g. when you are in "text" state and get "<" symbol at input
sequence, you change state to "opening tag" or "closing tag"...
I still have one question: why you don't use REXML?

Hi,

The problem is that my boss donot want me to use any libraries to parse
xml. He also said to use regular expressions to extract the contents of
an xml tag. Can u please tell me how to do it. I'll be really greatfull.

Regards
Arun Kumar
 
J

Jason Roelofs

It is described nicely in Wikipedia:
http://en.wikipedia.org/wiki/Finite_state_machine

As a clue, I can recommend you define following states: "text", "opening
tag", "tag attribute", "tag attribute value", "closing tag". E. g. when y= ou
are in "text" state and get "<" symbol at input sequence, you change stat= e
to "opening tag" or "closing tag"...
I still have one question: why you don't use REXML?

Better question: Why *wouldn't* you want to use an existing library?
You'd have to spend months on your own before it even starts to make
sense to use such a custom solution over an existing, tested, and
heavily used library like libxml or nokigiri (and to be fair,
Hpricot::XML, though it's more for HTML parsing than XML).

Jason
 
A

Arun Kumar

Jason said:
Better question: Why *wouldn't* you want to use an existing library?
You'd have to spend months on your own before it even starts to make
sense to use such a custom solution over an existing, tested, and
heavily used library like libxml or nokigiri (and to be fair,
Hpricot::XML, though it's more for HTML parsing than XML).

Jason

Hi,
One problem is compatability. I'm developing an application that
extracts the xml tags from a url like 'http://www.shoe-g.com/index.rdf'
and displays the contents within it. So compatability is an issue. My
boss is strict of not using any complex libraries. Can u please help me.
Thanks once again

Regards
Arun Kumar
 
P

Peter Zotov

Quoting "Arun Kumar said:
Hi,

The problem is that my boss donot want me to use any libraries to parse
xml. He also said to use regular expressions to extract the contents of
an xml tag. Can u please tell me how to do it. I'll be really greatfull.

If you have, for example, this document:
----8<----
<?xml version=3D"1.0" encoding=3D"utf-8"?>
<root>
<some-tag>some text</some-tag>
</root>
----8<----

you can extract contetns of tag "some-tag" with this (code assumes =20
that document lies in "document" variable):

document.match(/<some-tag>(.+?)<\/some-tag>/)[1]

But this will fail at "some-tag" embedded in other "some-tag" and if =20
tag will have arguments. Of course, these variants can be predicted =20
and added to regexp too, but this will make it very complicated.

Anyway, REXML is not _external_ library to Ruby. It's in stdlib!
 
P

Phlip

Arun said:
Thanks. Can u please give me details of it.

He told you to write a parser. That's the same as using REXML or any other library.

Why can't you use a library? REXML comes for free with Ruby, and is good enough
in a pinch.

If the XML input is very stable, and it never changes, you can parse some of it
with Regexp. That will break very easily, but it might be good enough for your
needs.
 
P

Phlip

Arun said:
The problem is that my boss donot want me to use any libraries to parse
xml. He also said to use regular expressions to extract the contents of
an xml tag. Can u please tell me how to do it. I'll be really greatfull.

Your boss is micromanaging you, and does not understand the relationship between
Ruby, its libraries, and its programmers. Bosses generally should not prohibit
valid techniques for bogus reasons.

That said, you could use "malicious compliance", and show her or him how fragile
regular expressions are. (Write unit tests that fail for the wrong XML, for
example.)

Or you could explain that REXML is not an _external_ library. It comes with
Ruby, so it's "free" to use. You never need to download and install it...
 
P

Phlip

One problem is compatability. I'm developing an application that
extracts the xml tags from a url like 'http://www.shoe-g.com/index.rdf'
and displays the contents within it. So compatability is an issue. My
boss is strict of not using any complex libraries. Can u please help me.
Thanks once again

I had a boss once who wouldn't let us use keyboards, because we might use them
to type bugs in.

Sheesh...
 
S

Sebastian Hungerecker

Peter said:
But this will fail at "some-tag" embedded in other "some-tag" and if =C2= =A0
tag will have arguments. Of course, these variants can be predicted =C2=A0
and added to regexp too

I don't see what you could add to the regexp to handle nested tags. You can=
't=20
really handle nested structures with regular expressions.
 
P

Peter Zotov

Quoting "Sebastian Hungerecker said:
I don't see what you could add to the regexp to handle nested tags. You ca= n't
really handle nested structures with regular expressions.

I think that you can use backlinks and combine results afterwards from =20
their splitted state, but not sure.
 
J

Jason Roelofs

Regex is not stateful, thus you can't use it to parse XML. Oh there
are ways to hack yourself around some limitations and get some
results, but you are going to spend a TON of time making very
unreadable Regex that will die at the presense of slightest malformed
XML. Your boss obviously has no idea what he's talking about. If
anything, use REXML because, as another poster said, it's a part of
Ruby and not an external library.

If your boss makes these kind of requests often, you should probably
go looking for another job, IMO

Jason
 
P

Phlip

Or you could explain that REXML is not an _external_ library. It comes
with Ruby, so it's "free" to use. You never need to download and install
it...

Next, REXML will break if you point it at nearly any website.

Elaborate sigh...

Can you demo Nokogiri, Hpricot, and Regexps to this boss??
 
P

Peter Zotov

Quoting Phlip said:
Next, REXML will break if you point it at nearly any website.

Yes, at average HTML code will break it, but he need to parse RDF (or
probably ATOM/RSS). These are almost always valid.
 
D

David Masover

Arun said:
Hi,
One problem is compatability.

Compatibility with what?
I'm developing an application that
extracts the xml tags from a url like 'http://www.shoe-g.com/index.rdf'

Yes, Nokogiri can read that. I'll bet Hpricot can, too -- maybe even REXML.

Maybe you can find an example for me of an XML document that Nokogiri
(libxml) can't read?
My
boss is strict of not using any complex libraries.

Either this is some sort of test or interview question, to make sure you
understand regular expressions...

...or, your boss doesn't know what he's talking about. The whole reason
to use Ruby is to save yourself work. Suppose you want the contents of
each title tag, just as an example:

require 'mechanize'
mech = WWW::Mechanize.new
mech.get 'http://www.shoe-g.com/index.rdf'
doc = Nokogiri(mech.page.body)
titles = (doc / 'title').map(&:text)

Ask your boss if it's really worth it to spend days or months trying to
get it right, when you could be using five lines to download and parse
it much more simply and accurately than a regular expression would allow.

And if your boss insists, even after seeing this, you might want to
start looking for a new job -- that one won't last long.
 
D

David Masover

Phlip said:
Next, REXML will break if you point it at nearly any website.

Except the question was about XML, not HTML. And the document indicated:

http://www.shoe-g.com/index.rdf

seems to parse fine with rexml:

irb(main):001:0> require 'open-uri'
=> true
irb(main):002:0> require 'rexml/document'
=> true
irb(main):003:0> open('http://www.shoe-g.com/index.rdf'){|f|
REXML::Document.new f}
=> <UNDEFINED> ... </>
irb(main):004:0> _.root
=> <rdf:RDF xmlns:rdf='http://www.w3.org/1999/02/22-rdf-syntax-ns#'
xmlns:dc='http://purl.org/dc/elements/1.1/'
xmlns:sy='http://purl.org/rss/1.0/modules/syndication/'
xmlns:admin='http://webns.net/mvcb/'
xmlns:cc='http://web.resource.org/cc/' xmlns='http://purl.org/rss/1.0/'>
... </>

I'll second Nokogiri, but I think we agree on the main point: Use a
library. Any library. The world does not need another hacked-together,
home-grown, broken XML parser.
 
R

Robert Klemme

Regex is not stateful, thus you can't use it to parse XML. Oh there
are ways to hack yourself around some limitations and get some
results, but you are going to spend a TON of time making very
unreadable Regex that will die at the presense of slightest malformed
XML. Your boss obviously has no idea what he's talking about. If
anything, use REXML because, as another poster said, it's a part of
Ruby and not an external library.

Funny as it goes, REXML is actually named that way because it uses
regular expressions internally. :)

But of course you are right: there is no easy straightforward way to use
a regular expression to properly parse an XML document. The best you
can probably do is to match tags and thus split the document in chunks
which are either tags or text which can be used to do build up a nested
structure etc. - in other words: reinvent the wheel.
If your boss makes these kind of requests often, you should probably
go looking for another job, IMO

+2

Cheers

robert
 
R

Robert Klemme

One problem is compatability. I'm developing an application that
extracts the xml tags from a url like 'http://www.shoe-g.com/index.rdf'
and displays the contents within it. So compatability is an issue.

Between what and what? REXML is pure Ruby so it runs on all platforms.
My boss is strict of not using any complex libraries.

This has some implications

- You can never tackle complex problems, because you always have to
write everything from scratch - and you'll be late on *any* project plan
with this approach.

- Apparently your boss judges without knowing the facts (REXML and
others are _not_ complex to _use_ as has been demonstrated).

- Also it seems your boss's understanding of software engineering
needs some serious improvement. Picking the right tool for a job is a
significant part of it and has already saved tons of working hours all
over the world. You build applications by plugging together self
written and externally obtained components - that's the only
economically viable way.

- You cost him nothing or he does not care about how you spend your time.

This is really one of the most ridiculous things I have read in years.
If he would argue with steep learning curve or expensive commercial
software - but a strict rejection of "complex libraries"?
Can u please help me.

Yes, update your resume and run for a better place to work.

Really, you would also be wasting everybody else's time by trying to
extract all the details on how to do something manually which has been
built already and which you get it for free (i.e. with no extra charge
or installation hassle).

Good luck!

robert
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Similar Threads


Staff online

Members online

Forum statistics

Threads
474,176
Messages
2,570,950
Members
47,501
Latest member
log5Sshell/alfa5

Latest Threads

Top