Ruby screen scraping

Chris Gallagher · Nov 19, 2006

Hi,

I'm looking at creating a ruby script that will firstly access our
cruise control page on localhost and examin the page to see the values
on the page, so basically telling us if the build succeeded or failed.

Does anyone have any opinions on what might be the best way to approach
this task. Ive been looking at a number of different packages including
Htree.

Thanks

Marcelo Alvim · Nov 19, 2006

Hi,

I'm looking at creating a ruby script that will firstly access our
cruise control page on localhost and examin the page to see the values
on the page, so basically telling us if the build succeeded or failed.

If you want screen scraping, I would tell you to look at why's
excellent Hpricot HTML parser. It's really simple to use and very
effective.

http://code.whytheluckystiff.net/hpricot/

Cheers,
Alvim.

Daniel Lucraft · Nov 19, 2006

For HTML scraping I recommend scrAPI.

gem install scrapi

homepage:
http://blog.labnotes.org/category/scrapi/

Example scraper:

Scraper.define do
attr_accessor :title, :author,

ub_date, :content

process "div#GuardianArticle > h1", :title => :text
process "div#GuardianArticle > font > b" do |element|
@author = element.children[0].content
@pub_date = element.children[2].content.strip
end
process "div#GuardianArticleBody", :content => :text
end

Chris Gallagher · Nov 19, 2006

thanks guys I'll look into both of them.

Another question I would have is how would i then get this scraped info
to insert into a mysql database called say "build" and a table called
"results".

For now if you could base answers on the following htree code?

require 'open-uri'
require 'htree'
require 'rexml/document'

url = "http://www.google.com/search?q=ruby"
open(url) {
|page| page_content = page.read()
doc = HTree(page_content).to_rexml
doc.root.each_element('//a[@class="l"]') {
|elem| puts elem.attribute('href').value }
}

which is returning a result of:

C:\>ruby script2.rb
http://www.ruby-lang.org/
http://www.ruby-lang.org/en/20020101.html
http://www.rubyonrails.org/
http://www.rubycentral.com/
http://www.rubycentral.com/book/
http://en.wikipedia.org/wiki/Ruby_programming_language
http://en.wikipedia.org/wiki/Ruby
http://www.w3.org/TR/ruby/
http://poignantguide.net/
http://www.zenspider.com/Languages/Ruby/QuickRef.html

Cheers.

Peter Szinek · Nov 19, 2006

Chris said:
Hi,

I'm looking at creating a ruby script that will firstly access our
cruise control page on localhost and examin the page to see the values
on the page, so basically telling us if the build succeeded or failed.

Once you have the page (open-uri if you know the URL exactly, or
www::mechanize if you need to navigate there (i.e. fill textfields,
click buttons etc)) I recommend to check out these possibilities:

1) regular expressions
2) HPpricot
3) scrAPI
4) Rubyful soup

Regular expressions would be the most old-school solution, in some cases
such a wrapper is the most robust (but since you are in control of the
generated page as I understood, robustness is possibly not an issue).

If you can't do it with regexps, HPricot will be most probably adequate
(I would need to see the concrete page).

Finally, if neither of the above works, you should try scrAPI - and
though I don't think so you should fail after this point, Rubyful soup
is another possibility to check out.

Peter
__
http://www.rubyrailways.com

Peter Szinek · Nov 19, 2006

Chris said:
thanks guys I'll look into both of them.

Another question I would have is how would i then get this scraped info
to insert into a mysql database called say "build" and a table called
"results".

For now if you could base answers on the following htree code?

require 'open-uri'
require 'htree'
require 'rexml/document'

url = "http://www.google.com/search?q=ruby"
open(url) {
|page| page_content = page.read()
doc = HTree(page_content).to_rexml
doc.root.each_element('//a[@class="l"]') {
|elem| puts elem.attribute('href').value }
}

Something along the lines of

require "mysql"

dbh = Mysql.real_connect("localhost", "chris", "", "build")
dbh.query("
INSERT INTO results
VALUES (whatever)

Cheers,

Peter
__
http://www.rubyrailways.com

Chris Gallagher · Nov 19, 2006

Thanks for the help.

Ill get on with it and see how it goes

Peter Szinek · Nov 19, 2006

OK,here is the full code:

require 'open-uri'
require 'htree'
require 'rexml/document'
require 'mysql'

url = "http://www.google.com/search?q=ruby"
results = []

open(url) {
|page| page_content = page.read()
doc = HTree(page_content).to_rexml
doc.root.each_element('//a[@class="l"]') {
|elem| results << elem.attribute('href').value }

dbh = Mysql.real_connect("localhost", "peter", "****", "build")

results.each do |result|
dbh.query("INSERT INTO result VALUES ('#{result}')")
end
}

HTH,

Peter
__
http://www.rubyrailways.com

Chris Gallagher · Nov 19, 2006

wow, thanks for that code.

One question though. Does the name of the field in the table which the
scraped information is going to be inserted into need to be specified in
the code? Or is it already and i'm missing something here?

Peter Szinek · Nov 19, 2006

Chris said:
wow, thanks for that code.

Welcome

One question though. Does the name of the field in the table which the
scraped information is going to be inserted into need to be specified in
the code? Or is it already and i'm missing something here?

My code assumed that the table has one column (e.g. 'url' in this case)
and the values were inserted into that column.

Otherwise if you have more columns, you can do this:

INSERT INTO people
(name, age) VALUES('Peter Szinek', '23' ).

You can do

INSERT INTO people VALUES('Peter Szinek', '23' )

as well, but in this case you have to be sure that the columns in your
DB are in the same order as in your insert query. In the first example
you don't have to care about the column ordering in the DB, as far as
the mapping between the column names (first pair of parens) and the
values (second pair of parens) are O.K.

HTH,
Peter

__
http://www.rubyrailways.com

Chris Gallagher · Nov 19, 2006

ah thats great.

thanks again for your help

Chris Gallagher · Nov 19, 2006

OK that code all works great but i have one last question

This is allowing me to scrape the values of the class values on tags and
any other attribues such as that. My question is, how would i modify the
code in order to get it to capture say a block of text such as:

this is text that i want to scrape

any ideas?

thanks.

Peter Szinek · Nov 19, 2006

Chris said:
OK that code all works great but i have one last question

This is allowing me to scrape the values of the class values on tags and
any other attribues such as that. My question is, how would i modify the
code in order to get it to capture say a block of text such as:

this is text that i want to scrape

Hmm this is hard to tell just by this example. If you need ALL s,
then those can be queried by this XPath:

//p

I am not sure what are you using now, but in Hpricot this would be:

doc = Hpricot(open("http://stuff.com/"))
results = doc/"//p"

If you are still using, HTree, query this XPath there for the same results.

However, I guess you want something more sophisticated than ALL the
s. Well this is where the troubles begin with screen scraping: you
need to figure out some rules which extract *exactly* what you want -
usually it is not that hard to come up with rules that extract more or
less, but much worse to find the right ones...

To solve this problem, you need to tell us what do you want - i.e. an
example page, and a set of objects you would like to extract.

Cheers,
Peter

__
http://www.rubyrailways.com

James Edward Gray II · Nov 20, 2006

Really simple:

array = page_content.scan(%r{(.*?)}m).flatten

Returns an array, each cell of which is a paragraph from the
original page.

This is why it is a bad idea to adopt a package or library to
accomplish
something that is easier to accomplish with a few lines of code, or
even
one line as in this case.

At first the library seems as though it can do anyting, with no
need to
understand what is actually going on. Pretty quickly you encounter
something the library cannot do, and you have to ... understand
what is
going on. Then you abandon the library and write normal code.

In Ruby, writing normal code is so easy that the traditional cautions
against adopting miraculous libraries should be amplified tenfold.

I hope you're not arguing that HTML should be parsed with simple
regular expression instead of a real parser. I think most would
agree with me when I say that strategy seldom holds up for long.

James Edward Gray II

Peter Szinek · Nov 20, 2006

Hola,

I hope you're not arguing that HTML should be parsed with simple regular
expression instead of a real parser. I think most would agree with me
when I say that strategy seldom holds up for long.

I could not agree more with James here. HTML scraping is one of the most
tedious tasks these days. Paul, how far would your scraper get with this
'HTML':

This is a para.

This is another...

With Hpricot, this code

equire 'rubygems'
require 'hpricot'

doc = Hpricot(open("1.html").read)
results = doc/"//p"

works without any problems.

Of course I absolutely understand your viewpoint, but messed up HTML, as
you have seen, can make a real difference...

Peter

__
http://www.rubyrailways.com

Peter Szinek · Nov 20, 2006

I agree completely (see my other post on this topic), but it appears the OP

was trying to read machine-generated Web content, presumably with reliable
syntax.

Then you are right of course. I guess the problem is in the definition
of the term 'screen scraping' ( or 'web extraction' or 'web mining' or
'html extraction' - people can not even agree on it's name ).

For me 'screen scraping' means a complex thing: navigating to the
document, parsing it into something meaningful and querying the objects
of the parsed structure. In general, I am assuming that neither of these
steps are trivial - maybe because I am working for a web extraction
company for years now and I have seen every kind of nice tricks of the
other side (a.k.a the anti-scrape camp)

Of course, if you define screen scraping as the last step only (i.e. you
have a parsed model (e.g. a well formed page) and you need to query
that) - then of course regular expressions are always the first thing to
consider.

Since the OP was referring to a machine generated page, I think the
latter applies - so yep, as far as he need all 's only, regular
expressions are probably the easiest thing to pull out.

Peter

__
http://www.rubyrailways.com

Gabriele Marrone · Nov 20, 2006

array = page_content.scan(%r{(.*?)}m).flatten

Please note that the P end tag isn't required in HTML 4.01:
http://www.w3.org/TR/html4/struct/text.html#h-9.3.1

James Edward Gray II · Nov 20, 2006

In this thread, the OP started out by examining the alternatives among
specialized libraries meant to address the general problem, but
apparently
never considered writing code to solve the problem directly.

Starting out by looking for a library that does the hard work for you
is a good first step, I would say. Do we really want to be
discouraging that?

As to modern XHTML Web pages that can pass a validator, I know from
direct
recent experience that they yield to the simplest parser design,
and can be
relied on to produce a tree of organized content, stripped of tags and
XHTML-specific formatting, in a handful of lines of Ruby code.

I've seen valid XHTML that wouldn't be much fun to parse. You still
need to worry about whitespace, namespaces, the kind of quoting used,
CDATA sections, ...

James Edward Gray II

James Edward Gray II · Nov 20, 2006

These are all relatively easy to parse. Even the CDATA sections are
clearly
and consistently delimited, so can be reliably skipped over and
encapsulated. That was the design goal of XHTML -- to be easy to
parse, to
be consistent -- assuming the syntax is followed.

But if you use an already developed parser, you gain all their work
on edge cases, all their testing efforts, all their optimization
work, etc.

I see what you are saying about knowing you can count on the data,
but your messages are filled with a lot "as long as you are sure"
conditions. Dropping a bunch of those conditions is just one more
advantage to using a library.

You say you are always surprised when people build up all this hefty
library code when a simple regex will do, but I'm always shocked when
I can replace hundreds of lines of code by loading and making use of
a library. If we have to err on one side of that, I would prefer it
be on the library using side.

That said, I guess we'll just have to agree to disagree. That's for
the intelligent and civil debate.

James Edward Gray II

Chris Gallagher · Nov 20, 2006

Turns out I actually ended up abandonning HTree and the rest. I used
net/http in order to fetch the page and then took the table of the page
that I was interested in examining and converted that using rexml. I
have now been able to grab the values that I wanted using XPath

require 'net/http'
require 'uri'
require 'rexml/document'
include REXML
def fetch(uri_str, limit=10)
fail 'http redirect too deep' if limit.zero?
puts "Trying: #{uri_str}"
response = Net:: HTTP.get_response(URI.parse(uri_str))
case response
when Net::HTTPSuccess
response
when NetHTTPRedirection
fetch(response['location'], limit-1)
else
response.error!
end
end

response = fetch('http://10.37.150.55:8080')

scraped_data = response.body

table_start_pos = scraped_data.index('<table class="index"
width="100%">')
#puts table_start_pos

table_end_pos = scraped_data.index('</table>') + 9
#puts table_end_pos

height = table_end_pos - table_start_pos

gathered_data = response.body[table_start_pos,height]

converted_data = REXML:

ocument.new gathered_data
#puts converted_data

module_name = XPath.first(converted_data, "//td[@class='data']/a/]")
puts module_name

build_status = XPath.first (converted_data, "//td[2]/em")
puts build_status.text

last_failure = XPath.first(converted_data, "//tbody/tr/td[3]")
puts last_failure.text

last_success = XPath.first(converted_data, "//tbody/tr/td[4]")
puts last_success.text

build_number = XPath.first(converted_data, "//tbody/tr/td[5]")
puts build_number.text

Centering picture element for larger screen sizes	2	Sep 21, 2023
Screen Scraping Advice	9	Sep 17, 2007
screen scraping using htmltools and rexml	0	Jan 21, 2006
[mini-ANN] Web scraping article, episode 1	3	Feb 5, 2007
Web scraping from Java	2	May 28, 2009
Screen scraping question	2	Oct 12, 2005
Help with Screen Scraper!	0	Nov 19, 2008
website screen scraping with Mechanize or Rubyful Soup	9	Sep 12, 2005

Ruby screen scraping

Chris Gallagher

Marcelo Alvim

Daniel Lucraft

Chris Gallagher

Peter Szinek

Peter Szinek

Chris Gallagher

Peter Szinek

Chris Gallagher

Peter Szinek

Chris Gallagher

Chris Gallagher

Peter Szinek

James Edward Gray II

Peter Szinek

Peter Szinek

Gabriele Marrone

James Edward Gray II

James Edward Gray II

Chris Gallagher

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads