mechanize - extract href

Corey Watts · Oct 16, 2010

Hey there everyone. I'm having a slight problem using Mechanize. I'm
trying to scrape the yellowpages.com, and extract information about each
business listing. I'm extracting all the information I want, except for
one small portion: the business's website. It is the href inside of a
link that I am trying to scrape. As far as I know, I'm following the
correct xpath rules, but I can't seem to get the part I want. One
tricky thing that I've had to deal with is that not every listing has a
website. The website link and the "learn more" link are very similar,
xpath-wise, so I have to use an if statement to check the inner text of
both of them to make sure that I'm extracting the xpath one.

I'm scraping from
http://yellowpages.com/santa-barbara-ca/restaurants?page=1 and my code
is attached.

Thanks so much for your help!

Jethrow .. · Oct 17, 2010

The example file might not help if your not using windows, but it might
give you some ideas. Sorry if this isn't what you're looking for at
all...

- jethrow

Corey Watts · Oct 18, 2010

Jethrow, thanks but that's not quite what I need. I need to extract
this link's href attribute, which is the website of the buisness. I'm
using xpath, using the ".../a/@href" method which I believe is the
correct one. But it just doesn't extract anything! Any other ideas?

Mike Dalessio · Oct 18, 2010

Hey there everyone. =A0I'm having a slight problem using Mechanize. =A0I'= m
trying to scrape the yellowpages.com, and extract information about each
business listing. =A0I'm extracting all the information I want, except fo= r
one small portion: the business's website. =A0It is the href inside of a
link that I am trying to scrape. =A0As far as I know, I'm following the
correct xpath rules, but I can't seem to get the part I want. =A0One
tricky thing that I've had to deal with is that not every listing has a
website. =A0The website link and the "learn more" link are very similar,
xpath-wise, so I have to use an if statement to check the inner text of
both of them to make sure that I'm extracting the xpath one.

I'm scraping from
http://yellowpages.com/santa-barbara-ca/restaurants?page=3D1 and my code
is attached.

Your xpath:

website =3D website.search("/a/@href")

should be:

website =3D website.search("./a/@href")

a leading "/" means that you want the xpath search to begin from the
root of the document. "./" means to start from the context node, in
this case `website`.

Corey Watts · Oct 18, 2010

Thanks Mike. I've updated the code. Putting "./" in my code messed up
what the code grabbed, so I eliminated all leading "/" and it worked.
However, the website part still doesn't work. It looks like it is
grabbing what is inside of the a tags, instead of grabbing the href
address. A typical listing output looks like:

McDonald's
1213 State St # B,
Santa Barbara
CA
93101
(805) 962-6976
<span class="raquo">»</span>
?
Website
["Restaurants", "Fast Food Restaurants", "American Restaurants"]

The three lines: <span>, ?, and Website are all grabbed from the website
= website.search(...) line, but it's grabbing the wrong thing! Do you
have any more suggestions?

Attachments:
http://www.ruby-forum.com/attachment/5222/mech.rb

Corey Watts · Oct 27, 2010

I still haven't figured this out. Perhaps I should phrase the question
a different way...

What is the preferred method of extracting the href attribute from a
link? I've tried doing it using .search() and searching for the xml
@href attribute. For some reason that's not working for me.

Is there a different way of extracting this attribute, without using
search and an xml path? I'm sure mechanize has some other method
too...

Robert Klemme · Oct 27, 2010

I still haven't figured this out. =A0Perhaps I should phrase the question
a different way...

What is the preferred method of extracting the href attribute from a
link? =A0I've tried doing it using .search() and searching for the xml
@href attribute. =A0For some reason that's not working for me.

Is there a different way of extracting this attribute, without using
.search and an xml path? =A0I'm sure mechanize has some other method
too...

With this and a local version of the page I was able to get the info you wa=
nt:

#!/bin/env ruby19

require 'nokogiri'

raw =3D File.read("restaurants.html", mode: "r:UTF-8")
puts raw.encoding
# raw.force_encoding 'UTF-8'
doc =3D Nokogiri.parse raw

doc.xpath('//div[@class=3D"listing_content"]').each do |listing|
puts '----------------------------------------'
# p listing.to_s[0...10]+"..."
puts listing
puts '----------------------------------------'
# p listing.xpath('.//a//text()').map(&:to_s)
listing.xpath('.//a[@href and contains(text(),"Website")]/@href').each do=
|a|
p a.value
end
puts
end

Cheers

robert

--=20
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

Corey Watts · Oct 28, 2010

Thanks! I eventually came up with something like this:

website =3D =

page.search("//div[@class=3D'listing_content']/ul[@class=3D'features']/li=
[1]")
# initializes the link variable with something
link =3D "super"
# The segment below grabs the URL of the listing's website, if it has =

one. Otherwise it
# puts 'nil' into the link variable
if website.inner_text() =3D=3D "=C2=BB Website"
website.search("a").map do |l|
link =3D l['href']
end
else
link =3D "nil"
end

-- =

Posted via http://www.ruby-forum.com/.=

Robert Klemme · Oct 29, 2010

Thanks! =A0I eventually came up with something like this:

website =3D
page.search("//div[@class=3D'listing_content']/ul[@class=3D'features']/li=

[1]")

Wouldn't it be better to first search for all div with
class=3Dlisting_content and then extract data for that single listing
from there?

# initializes the link variable with something
link =3D "super"

"link =3D nil" is sufficient.

# The segment below grabs the URL of the listing's website, if it has
one. =A0Otherwise it
# puts 'nil' into the link variable
if website.inner_text() =3D=3D "=BB Website"

You can use my XPath to find those links.

=A0website.search("a").map do |l|

#map is nonsense here - you only need #each.

=A0 =A0link =3D l['href']
=A0end
else
=A0link =3D "nil"

Didn't you mean "link =3D nil"?

Kind regards

robert

--=20
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

Corey Watts · Oct 29, 2010

Robert, thank you so much for your help. I'm just starting out in ruby,
so I have a lot to learn! I've attached my code with the changes you
recommended.

Attachments:
http://www.ruby-forum.com/attachment/5263/mech.rb

Robert Klemme · Oct 29, 2010

Robert, thank you so much for your help. =A0I'm just starting out in ruby= ,
so I have a lot to learn! =A0I've attached my code with the changes you
recommended.

Attachments:
http://www.ruby-forum.com/attachment/5263/mech.rb

Few remarks:

for x in 1...2 will execute the loop body exactly once. Are you sure
this is what you want?

You can combine all the puts at the end in a single statement.

You can replace all the ".gsub(/^\s+/, "").gsub(/\s+$/, $/)" by
".strip. You do not need to append a newline because puts does that
already.

Why do you use 'entry.search("a")' and do not include the "a" in the
path expression?

Cheers

robert

--=20
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

Corey Watts · Oct 30, 2010

Robert,

Thanks again for your help. This script above is actually part of a
larger script, but I was just showing you the smaller portion for
clarity. I've attached the whole script, in order to put the for loop
in context. I'm using the for loop to grab multiple pages in
succession. Hopefully this makes it more clear.

I've made all the changes you recommended, except consolidating the puts
statements. Could you explain how I would do that? Thank you!

Attachments:
http://www.ruby-forum.com/attachment/5267/total.rb

scrubyt scraper help	0	Oct 1, 2010
mechanize and Content Encoding Error	0	Feb 28, 2011
Mechanize Cache problem	1	Sep 10, 2010
Mechanize and encoding	1	Nov 22, 2008
Using Mechanize To Submit Forms	3	Mar 28, 2010
Mechanize file save on generated link	7	Sep 12, 2010
Mechanize click()-Problem	2	Feb 11, 2008
Moving Mechanize to Nokogiri	3	Feb 19, 2009

mechanize - extract href

Corey Watts

Jethrow ..

Corey Watts

Mike Dalessio

Corey Watts

Corey Watts

Robert Klemme

Corey Watts

Robert Klemme

Corey Watts

Robert Klemme

Corey Watts

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads