Ruby screen scraping

C

Chris Gallagher

Hi,

I'm looking at creating a ruby script that will firstly access our
cruise control page on localhost and examin the page to see the values
on the page, so basically telling us if the build succeeded or failed.

Does anyone have any opinions on what might be the best way to approach
this task. Ive been looking at a number of different packages including
Htree.

Thanks
 
M

Marcelo Alvim

Hi,

I'm looking at creating a ruby script that will firstly access our
cruise control page on localhost and examin the page to see the values
on the page, so basically telling us if the build succeeded or failed.

If you want screen scraping, I would tell you to look at why's
excellent Hpricot HTML parser. It's really simple to use and very
effective.

http://code.whytheluckystiff.net/hpricot/

Cheers,
Alvim.
 
D

Daniel Lucraft

For HTML scraping I recommend scrAPI.

gem install scrapi

homepage:
http://blog.labnotes.org/category/scrapi/

Example scraper:

Scraper.define do
attr_accessor :title, :author, :pub_date, :content

process "div#GuardianArticle > h1", :title => :text
process "div#GuardianArticle > font > b" do |element|
@author = element.children[0].content
@pub_date = element.children[2].content.strip
end
process "div#GuardianArticleBody", :content => :text
end
 
C

Chris Gallagher

thanks guys I'll look into both of them.

Another question I would have is how would i then get this scraped info
to insert into a mysql database called say "build" and a table called
"results".

For now if you could base answers on the following htree code?

require 'open-uri'
require 'htree'
require 'rexml/document'

url = "http://www.google.com/search?q=ruby"
open(url) {
|page| page_content = page.read()
doc = HTree(page_content).to_rexml
doc.root.each_element('//a[@class="l"]') {
|elem| puts elem.attribute('href').value }
}

which is returning a result of:

C:\>ruby script2.rb
http://www.ruby-lang.org/
http://www.ruby-lang.org/en/20020101.html
http://www.rubyonrails.org/
http://www.rubycentral.com/
http://www.rubycentral.com/book/
http://en.wikipedia.org/wiki/Ruby_programming_language
http://en.wikipedia.org/wiki/Ruby
http://www.w3.org/TR/ruby/
http://poignantguide.net/
http://www.zenspider.com/Languages/Ruby/QuickRef.html

Cheers.
 
P

Peter Szinek

Chris said:
Hi,

I'm looking at creating a ruby script that will firstly access our
cruise control page on localhost and examin the page to see the values
on the page, so basically telling us if the build succeeded or failed.

Once you have the page (open-uri if you know the URL exactly, or
www::mechanize if you need to navigate there (i.e. fill textfields,
click buttons etc)) I recommend to check out these possibilities:

1) regular expressions
2) HPpricot
3) scrAPI
4) Rubyful soup


Regular expressions would be the most old-school solution, in some cases
such a wrapper is the most robust (but since you are in control of the
generated page as I understood, robustness is possibly not an issue).

If you can't do it with regexps, HPricot will be most probably adequate
(I would need to see the concrete page).

Finally, if neither of the above works, you should try scrAPI - and
though I don't think so you should fail after this point, Rubyful soup
is another possibility to check out.


Peter
__
http://www.rubyrailways.com
 
P

Peter Szinek

Chris said:
thanks guys I'll look into both of them.

Another question I would have is how would i then get this scraped info
to insert into a mysql database called say "build" and a table called
"results".

For now if you could base answers on the following htree code?

require 'open-uri'
require 'htree'
require 'rexml/document'

url = "http://www.google.com/search?q=ruby"
open(url) {
|page| page_content = page.read()
doc = HTree(page_content).to_rexml
doc.root.each_element('//a[@class="l"]') {
|elem| puts elem.attribute('href').value }
}
Something along the lines of

require "mysql"

dbh = Mysql.real_connect("localhost", "chris", "", "build")
dbh.query("
INSERT INTO results
VALUES (whatever)

Cheers,

Peter
__
http://www.rubyrailways.com
 
P

Peter Szinek

OK,here is the full code:

require 'open-uri'
require 'htree'
require 'rexml/document'
require 'mysql'

url = "http://www.google.com/search?q=ruby"
results = []

open(url) {
|page| page_content = page.read()
doc = HTree(page_content).to_rexml
doc.root.each_element('//a[@class="l"]') {
|elem| results << elem.attribute('href').value }

dbh = Mysql.real_connect("localhost", "peter", "****", "build")

results.each do |result|
dbh.query("INSERT INTO result VALUES ('#{result}')")
end
}

HTH,

Peter
__
http://www.rubyrailways.com
 
C

Chris Gallagher

wow, thanks for that code.

One question though. Does the name of the field in the table which the
scraped information is going to be inserted into need to be specified in
the code? Or is it already and i'm missing something here?
 
P

Peter Szinek

Chris said:
wow, thanks for that code.
Welcome :)
One question though. Does the name of the field in the table which the
scraped information is going to be inserted into need to be specified in
the code? Or is it already and i'm missing something here?

My code assumed that the table has one column (e.g. 'url' in this case)
and the values were inserted into that column.

Otherwise if you have more columns, you can do this:

INSERT INTO people
(name, age) VALUES('Peter Szinek', '23' ).

You can do

INSERT INTO people VALUES('Peter Szinek', '23' )

as well, but in this case you have to be sure that the columns in your
DB are in the same order as in your insert query. In the first example
you don't have to care about the column ordering in the DB, as far as
the mapping between the column names (first pair of parens) and the
values (second pair of parens) are O.K.

HTH,
Peter

__
http://www.rubyrailways.com
 
C

Chris Gallagher

OK that code all works great but i have one last question :)

This is allowing me to scrape the values of the class values on tags and
any other attribues such as that. My question is, how would i modify the
code in order to get it to capture say a block of text such as:

<p>this is text that i want to scrape</p>

any ideas?

thanks.
 
P

Peter Szinek

Chris said:
OK that code all works great but i have one last question :)

This is allowing me to scrape the values of the class values on tags and
any other attribues such as that. My question is, how would i modify the
code in order to get it to capture say a block of text such as:

<p>this is text that i want to scrape</p>

Hmm this is hard to tell just by this example. If you need ALL <p>s,
then those can be queried by this XPath:

//p

I am not sure what are you using now, but in Hpricot this would be:

doc = Hpricot(open("http://stuff.com/"))
results = doc/"//p"

If you are still using, HTree, query this XPath there for the same results.

However, I guess you want something more sophisticated than ALL the
<p>s. Well this is where the troubles begin with screen scraping: you
need to figure out some rules which extract *exactly* what you want -
usually it is not that hard to come up with rules that extract more or
less, but much worse to find the right ones...

To solve this problem, you need to tell us what do you want - i.e. an
example page, and a set of objects you would like to extract.

Cheers,
Peter

__
http://www.rubyrailways.com
 
J

James Edward Gray II

Really simple:

array = page_content.scan(%r{<p>(.*?)</p>}m).flatten

Returns an array, each cell of which is a paragraph from the
original page.

This is why it is a bad idea to adopt a package or library to
accomplish
something that is easier to accomplish with a few lines of code, or
even
one line as in this case.

At first the library seems as though it can do anyting, with no
need to
understand what is actually going on. Pretty quickly you encounter
something the library cannot do, and you have to ... understand
what is
going on. Then you abandon the library and write normal code.

In Ruby, writing normal code is so easy that the traditional cautions
against adopting miraculous libraries should be amplified tenfold.

I hope you're not arguing that HTML should be parsed with simple
regular expression instead of a real parser. I think most would
agree with me when I say that strategy seldom holds up for long.

James Edward Gray II
 
P

Peter Szinek

Hola,
I hope you're not arguing that HTML should be parsed with simple regular
expression instead of a real parser. I think most would agree with me
when I say that strategy seldom holds up for long.

I could not agree more with James here. HTML scraping is one of the most
tedious tasks these days. Paul, how far would your scraper get with this
'HTML':

<p>This is a para.
<b/>
<p>This is another...

With Hpricot, this code

equire 'rubygems'
require 'hpricot'

doc = Hpricot(open("1.html").read)
results = doc/"//p"

works without any problems.

Of course I absolutely understand your viewpoint, but messed up HTML, as
you have seen, can make a real difference...

Peter

__
http://www.rubyrailways.com
 
P

Peter Szinek

I agree completely (see my other post on this topic), but it appears the OP
was trying to read machine-generated Web content, presumably with reliable
syntax.

Then you are right of course. I guess the problem is in the definition
of the term 'screen scraping' ( or 'web extraction' or 'web mining' or
'html extraction' - people can not even agree on it's name ).

For me 'screen scraping' means a complex thing: navigating to the
document, parsing it into something meaningful and querying the objects
of the parsed structure. In general, I am assuming that neither of these
steps are trivial - maybe because I am working for a web extraction
company for years now and I have seen every kind of nice tricks of the
other side (a.k.a the anti-scrape camp)

Of course, if you define screen scraping as the last step only (i.e. you
have a parsed model (e.g. a well formed page) and you need to query
that) - then of course regular expressions are always the first thing to
consider.

Since the OP was referring to a machine generated page, I think the
latter applies - so yep, as far as he need all <p>'s only, regular
expressions are probably the easiest thing to pull out.

Peter

__
http://www.rubyrailways.com
 
J

James Edward Gray II

In this thread, the OP started out by examining the alternatives among
specialized libraries meant to address the general problem, but
apparently
never considered writing code to solve the problem directly.

Starting out by looking for a library that does the hard work for you
is a good first step, I would say. Do we really want to be
discouraging that?
As to modern XHTML Web pages that can pass a validator, I know from
direct
recent experience that they yield to the simplest parser design,
and can be
relied on to produce a tree of organized content, stripped of tags and
XHTML-specific formatting, in a handful of lines of Ruby code.

I've seen valid XHTML that wouldn't be much fun to parse. You still
need to worry about whitespace, namespaces, the kind of quoting used,
CDATA sections, ...

James Edward Gray II
 
J

James Edward Gray II

These are all relatively easy to parse. Even the CDATA sections are
clearly
and consistently delimited, so can be reliably skipped over and
encapsulated. That was the design goal of XHTML -- to be easy to
parse, to
be consistent -- assuming the syntax is followed.

But if you use an already developed parser, you gain all their work
on edge cases, all their testing efforts, all their optimization
work, etc.

I see what you are saying about knowing you can count on the data,
but your messages are filled with a lot "as long as you are sure"
conditions. Dropping a bunch of those conditions is just one more
advantage to using a library.

You say you are always surprised when people build up all this hefty
library code when a simple regex will do, but I'm always shocked when
I can replace hundreds of lines of code by loading and making use of
a library. If we have to err on one side of that, I would prefer it
be on the library using side.

That said, I guess we'll just have to agree to disagree. That's for
the intelligent and civil debate.

James Edward Gray II
 
C

Chris Gallagher

Turns out I actually ended up abandonning HTree and the rest. I used
net/http in order to fetch the page and then took the table of the page
that I was interested in examining and converted that using rexml. I
have now been able to grab the values that I wanted using XPath :)

require 'net/http'
require 'uri'
require 'rexml/document'
include REXML
def fetch(uri_str, limit=10)
fail 'http redirect too deep' if limit.zero?
puts "Trying: #{uri_str}"
response = Net:: HTTP.get_response(URI.parse(uri_str))
case response
when Net::HTTPSuccess
response
when NetHTTPRedirection
fetch(response['location'], limit-1)
else
response.error!
end
end

response = fetch('http://10.37.150.55:8080')


scraped_data = response.body

table_start_pos = scraped_data.index('<table class="index"
width="100%">')
#puts table_start_pos

table_end_pos = scraped_data.index('</table>') + 9
#puts table_end_pos

height = table_end_pos - table_start_pos

gathered_data = response.body[table_start_pos,height]

converted_data = REXML::Document.new gathered_data
#puts converted_data


module_name = XPath.first(converted_data, "//td[@class='data']/a/]")
puts module_name

build_status = XPath.first (converted_data, "//td[2]/em")
puts build_status.text

last_failure = XPath.first(converted_data, "//tbody/tr/td[3]")
puts last_failure.text

last_success = XPath.first(converted_data, "//tbody/tr/td[4]")
puts last_success.text

build_number = XPath.first(converted_data, "//tbody/tr/td[5]")
puts build_number.text
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,219
Messages
2,571,117
Members
47,729
Latest member
taulaju99

Latest Threads

Top