Ruby (X)HTML Parser?

Andrei Maxim · Sep 25, 2006

Hi guys,

I'm starting to learn Ruby and I was thinking about a little app so I can
get things started as quickly as possible. Since I'm an avid blog reader,
the first thing that went though my mind was a small app that would extract
the RSS or Atom feed from a web page, giving the URL.

My first choice were regexps but I'm thinking that my little app my grow a
little bit more in the not-so-distant future and I might be doing more than
just extracting feeds.

I found:

* ymHTML at http://www.yoshidam.net/Ruby.html
* RAA at http://raa.ruby-lang.org/project/html-parser-2/

but they don't look really standard and RAA doesn't look like it's currently
maintained. I've also heard that there's a Rails HTML parser but I couldn't
find more info (an pro'lly I'll ask on one of the Rails list).

Is there a more "standard" way to parse HTML pages in Ruby?

Thanks,

Andrei

Alex Young · Sep 25, 2006

Andrei said:
Hi guys,

I'm starting to learn Ruby and I was thinking about a little app so I can
get things started as quickly as possible. Since I'm an avid blog reader,
the first thing that went though my mind was a small app that would extract
the RSS or Atom feed from a web page, giving the URL.

My first choice were regexps but I'm thinking that my little app my grow a
little bit more in the not-so-distant future and I might be doing more than
just extracting feeds.

I found:

* ymHTML at http://www.yoshidam.net/Ruby.html
* RAA at http://raa.ruby-lang.org/project/html-parser-2/

but they don't look really standard and RAA doesn't look like it's currently
maintained. I've also heard that there's a Rails HTML parser but I couldn't
find more info (an pro'lly I'll ask on one of the Rails list).

Is there a more "standard" way to parse HTML pages in Ruby?

The closest you'll find to a standard is REXML, which is an XML parser
that ships in the stdlib. You'll want to throw your HTML through Tidy
first, though - but that's an easy install.

There are a couple of alternatives: Hpricot and html-parser spring
instantly to mind.

If you're doing feed parsing, you probably also want to check out feedtools.

MonkeeSage · Sep 25, 2006

Jordan said:
There's Hpricot. Haven't used it myself though.

http://code.whytheluckystiff.net/hpricot/

Hpricot is *really* nice. Also, there is the standard REXML (built-in
since 1.8). See the tutorial for some ideas how to use it:
http://www.germane-software.com/software/rexml/docs/tutorial.html

Regards,
Jordan

why the lucky stiff · Sep 25, 2006

There's Hpricot. Haven't used it myself though.

http://code.whytheluckystiff.net/hpricot/

If you decide to us Hpricot, I'd recommend the latest 0.4.52 gems:

gem install hpricot --source code.whytheluckystiff.net

There's been a good deal of patching over the past week and a new release is
very close.

_why

Bob Aman · Sep 25, 2006

Since I'm an avid blog reader,

If you're doing feed parsing, you probably also want to check out feedtools.

Well... he probably won't learn much from the FeedTools code, but it is
convenient for this sort of thing:

irb(main):001:0> $KCODE = 'u'
=> "u"
irb(main):002:0> require 'feed_tools'
=> true
irb(main):003:0> feed = FeedTools::Feed.open('http://intertwingly.net/')
=> #<FeedTools::Feed:0x135d8fe URL:http://www.intertwingly.net/blog/index.atom>
irb(main):004:0> feed.title
=> "Sam Ruby"
irb(main):005:0> feed.subtitle
=> "It's just data"

Cheers,
Bob Aman

[ANN] kramdown 0.2.0 - definition list syntax and better HTML parser	0	Dec 3, 2009
Converting HTML to XML in Ruby	1	Sep 18, 2004
[ANN] ruby-feedparser : RSS/Atom feed parser	0	Nov 15, 2005
HTML-Parser / SGML-Parser	5	Oct 1, 2003
Ruby Weekly News 12th - 18th March 2007	2	Mar 21, 2007
[ANN] MasterView 0.3.4 Rails-optimized (x)html template engine	1	Oct 4, 2007
Ruby Weekly News 5th - 11th June 2006	0	Jun 14, 2006
[ANN] MasterView 0.3.3 Rails-optimized (x)html template engine	0	Jul 1, 2007

Ruby (X)HTML Parser?

Andrei Maxim

Alex Young

MonkeeSage

why the lucky stiff

Bob Aman

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads