Take a look at Michael Neumann's WWW::Mechanize. He has made it very
easy to code up an app to fetch a page, find pertinent content, follow
links, and so on.
It's on RubyForge, as part of the Wee project
http://rubyforge.org/projects/wee
I seem to be running into difficulty with mechanize.
$ ruby --version
ruby 1.9.0 (2005-02-08) [i686-linux]
$ gem --version
0.8.1
$ gem list --local | grep mech
mechanize (0.1.0)
$ cat test.rb
require 'rubygems'
require 'mechanize'
agent = WWW::Mechanize.new {|a| a.log = Logger.new(STDERR) }
page = agent.get('
https://www.alcoadirect.com/')
p page.forms
$ ruby test.rb
I, [2005-02-08T09:30:10.092550 #1947] INFO -- : GET:
https://www.alcoadirect.com/
warning: peer certificate won't be verified in this SSL session
D, [2005-02-08T09:30:14.305853 #1947] DEBUG -- : request-header: accept => */*
D, [2005-02-08T09:30:14.393482 #1947] DEBUG -- : header: last-modified
: Tue, 08 Feb 2005 00:02:27 GMT
D, [2005-02-08T09:30:14.393568 #1947] DEBUG -- : header: content-type
: text/html
D, [2005-02-08T09:30:14.393607 #1947] DEBUG -- : header: date : Tue,
08 Feb 2005 15:30:43 GMT
D, [2005-02-08T09:30:14.393646 #1947] DEBUG -- : header: server : JRun
Web Server/3.0
D, [2005-02-08T09:30:14.393684 #1947] DEBUG -- : header:
transfer-encoding : chunked
I, [2005-02-08T09:30:14.554417 #1947] INFO -- : status: 200
/usr/lib/ruby/gems/1.9/gems/mechanize-0.1.0/lib/mechanize.rb:115:in
`parse': undefined method `downcase' for nil:NilClass (NoMethodError)
from /usr/lib/ruby/gems/1.9/gems/mechanize-0.1.0/lib/mechanize.rb:112:in
`call'
from /usr/lib/ruby/gems/1.9/gems/mechanize-0.1.0/lib/mechanize/parsing.rb:14:in
`each_recursive'
from /usr/lib/ruby/gems/1.9/gems/mechanize-0.1.0/lib/mechanize/parsing.rb:13:in
`each'
from /usr/lib/ruby/1.9/rexml/element.rb:916:in `each'
from /usr/lib/ruby/1.9/rexml/xpath.rb:49:in `each'
from /usr/lib/ruby/1.9/rexml/element.rb:916:in `each'
from /usr/lib/ruby/gems/1.9/gems/mechanize-0.1.0/lib/mechanize/parsing.rb:13:in
`each_recursive'
from /usr/lib/ruby/gems/1.9/gems/mechanize-0.1.0/lib/mechanize/parsing.rb:15:in
`each_recursive'
... 136 levels...
from /usr/lib/ruby/gems/1.9/gems/mechanize-0.1.0/lib/mechanize/parsing.rb:13:in
`each_recursive'
from /usr/lib/ruby/gems/1.9/gems/mechanize-0.1.0/lib/mechanize.rb:212:in
`parse_html'
from /usr/lib/ruby/gems/1.9/gems/mechanize-0.1.0/lib/mechanize.rb:164:in
`forms'
from test.rb:6
Is the HTML parser perhaps dependant on well formed HTML? Is there a
more appropriate forum for me to raise this issue? Thanks.
Regards,
Jason