newbie to Ruby

Charles Pareto · Aug 27, 2007

Hi,
I was reading through Learning Ruby and was trying to get the example on
page 119 to work to scrape Google. But when I run it nothing happens.
Any help would be appreciated. Thanks.

require 'open-uri'

url = "http://www.google.com/search?q=ruby"

open(url) { |page| page_content = page.read()

links = page_content.scan(/<a class=1.*?href=\"(.*?)\"/).flatten

links.each {|link| puts link}

}

Mark Gallop · Aug 27, 2007

Hi Charles,

Charles said:
links = page_content.scan(/<a class=1.*?href=\"(.*?)\"/).flatten

I don't think that regular expression (regexp) works. Maybe google has
changed their code since the book was written. I think it goes "href"
then "class".

If you work out the correct regexp, let us know.

Cheers,
Mark

Dan Zwell · Aug 27, 2007

Mark said:
Hi Charles,

I don't think that regular expression (regexp) works. Maybe google has
changed their code since the book was written. I think it goes "href"
then "class".

If you work out the correct regexp, let us know.

Cheers,
Mark

As Mark said, google changed their code somewhat. If you work out the
correct regular expression and it still seems to give erratic results,
here is a hint: the naive solution uses ".*?" in a certain place, but
that will still match too many results. Try [^"]*? instead, because you
probably don't want to match quotes. (I just tried this, and that was
the problem I encountered.)

By the way, a robust regex to match all HTML links looks kind of nasty,
but perhaps you should try writing one--it's a good exercise. (Of
course, that's not what you want for this--you want to match all links
of class=l.)

Regards,
Dan

John Joyce · Aug 27, 2007

As those guys said, Google probably changed their code since the book
was written.
That's not to prevent web scraping, it's just that web sites are
pretty transitory. They change all the time and very easily. This
makes web sophisticated web scraping a moving target.

Jaime Iniesta · Aug 27, 2007

2007/8/27 said:
That's not to prevent web scraping, it's just that web sites are
pretty transitory. They change all the time and very easily. This
makes web sophisticated web scraping a moving target.

Yes, web scraping using just open-uri and regular expressions is
pretty low-level.

Try Hpricot or scRUBYt for a higher level, more flexible, scraping.=A1

--=20
Jaime Iniesta
http://jaimeiniesta.com - http://railes.net - http://freelancegirona.com

Charles Pareto · Aug 29, 2007

Jaime said:
Yes, web scraping using just open-uri and regular expressions is
pretty low-level.

Try Hpricot or scRUBYt for a higher level, more flexible, scraping.ï¿½

So I was trying out what everyone said and I got it to work. Here is
what I did.

page_content.scan(/<a href=\"([^"]*?)\" class=l[^"]*?/).flatten

Ruby Newbie...	3	Apr 2, 2011
Newbie ruby watir exception handling	0	Apr 16, 2010
Ruby and selenium for a newbie	0	Aug 26, 2010
Working on mobile css menu with plenty of frustration!	2	Dec 29, 2022
removing newline from eachline in file	6	Sep 18, 2007
Getting all google results with hpricot and connecting two gsubstatements to just one?	3	Aug 29, 2007
Newbie Question On Ruby Quiz	3	Jun 30, 2008
CSV confusion newbie question	1	Dec 6, 2009

newbie to Ruby

Charles Pareto

Mark Gallop

Dan Zwell

John Joyce

Jaime Iniesta

Charles Pareto

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads