Ruby (X)HTML Parser?

A

Andrei Maxim

Hi guys,

I'm starting to learn Ruby and I was thinking about a little app so I can
get things started as quickly as possible. Since I'm an avid blog reader,
the first thing that went though my mind was a small app that would extract
the RSS or Atom feed from a web page, giving the URL.

My first choice were regexps but I'm thinking that my little app my grow a
little bit more in the not-so-distant future and I might be doing more than
just extracting feeds.

I found:

* ymHTML at http://www.yoshidam.net/Ruby.html
* RAA at http://raa.ruby-lang.org/project/html-parser-2/

but they don't look really standard and RAA doesn't look like it's currently
maintained. I've also heard that there's a Rails HTML parser but I couldn't
find more info (an pro'lly I'll ask on one of the Rails list).

Is there a more "standard" way to parse HTML pages in Ruby?

Thanks,

Andrei
 
A

Alex Young

Andrei said:
Hi guys,

I'm starting to learn Ruby and I was thinking about a little app so I can
get things started as quickly as possible. Since I'm an avid blog reader,
the first thing that went though my mind was a small app that would extract
the RSS or Atom feed from a web page, giving the URL.

My first choice were regexps but I'm thinking that my little app my grow a
little bit more in the not-so-distant future and I might be doing more than
just extracting feeds.

I found:

* ymHTML at http://www.yoshidam.net/Ruby.html
* RAA at http://raa.ruby-lang.org/project/html-parser-2/

but they don't look really standard and RAA doesn't look like it's currently
maintained. I've also heard that there's a Rails HTML parser but I couldn't
find more info (an pro'lly I'll ask on one of the Rails list).

Is there a more "standard" way to parse HTML pages in Ruby?
The closest you'll find to a standard is REXML, which is an XML parser
that ships in the stdlib. You'll want to throw your HTML through Tidy
first, though - but that's an easy install.

There are a couple of alternatives: Hpricot and html-parser spring
instantly to mind.

If you're doing feed parsing, you probably also want to check out feedtools.
 
B

Bob Aman

Since I'm an avid blog reader,
If you're doing feed parsing, you probably also want to check out feedtools.

Well... he probably won't learn much from the FeedTools code, but it is
convenient for this sort of thing:

irb(main):001:0> $KCODE = 'u'
=> "u"
irb(main):002:0> require 'feed_tools'
=> true
irb(main):003:0> feed = FeedTools::Feed.open('http://intertwingly.net/')
=> #<FeedTools::Feed:0x135d8fe URL:http://www.intertwingly.net/blog/index.atom>
irb(main):004:0> feed.title
=> "Sam Ruby"
irb(main):005:0> feed.subtitle
=> "It's just data"

Cheers,
Bob Aman
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,981
Messages
2,570,188
Members
46,732
Latest member
ArronPalin

Latest Threads

Top