J
jzakiya
I'm trying to scrape some data off websites using nokogiri
require 'rubygems'
require 'open-uri'
require 'nokogiri' #using the latest 1.4.0
url = 'http://www.whateverwebsitenameis.org'
doc = Nokogiri::HTML(open(url))
This gets me data off the website I want to scrape.
The segment of the site I want looks like this (from FF 'view
source' )
-------------------------------------------------------------------------
<h2>Association Detail</h2>
<div class="sectionHeaderText" style="padding-bottom: 6pt;">DETAIL
DIRECTORY RESULTS</div>
1) <b>Some Institute name</b><Br><br>
2) some address<Br> city, st zip<br>
3)
4) United States <Br>
5)
6) Phone:
7)
8) (123) 456-7890<Br>
9)
10 <br>
11) Web address: <a href="Http://www.xyz.org"
target="_Blank">www.xyz.org</a><Br>
<br><br>
<A href="javascript:history.back();">Back to Search Results</
a><br><br>
<A href="AssociationSearch.cfm">Search Again</a>
</td>
---------------------------------------------------------------------------------
I want to scrap and collect the data between lines 1-11, ie, name,
address, city, st, zip, United States, phone number, and line 11 I
want the website url: 'http://www.xyz.org'
I can find the beginning of this section of code by doing this:
doc.css('h2').each do |elem| puts elem.content end
which displays 'Association Detail'
I am having problems using this as the starting point to parse the
data in lines 1-11 which contain the specific 'Association Detail'
details. I've tried it with 'xpath' and 'search' according to the
example here: http://rdoc.info/projects/tenderlove/nokogiri
but there's something I'm just not getting correctly when I use other
elements get info from.
My system is Windows XP, Ruby 1.8.6, Nokogiri 1.4.0
Thanks in advance for any help.
require 'rubygems'
require 'open-uri'
require 'nokogiri' #using the latest 1.4.0
url = 'http://www.whateverwebsitenameis.org'
doc = Nokogiri::HTML(open(url))
This gets me data off the website I want to scrape.
The segment of the site I want looks like this (from FF 'view
source' )
-------------------------------------------------------------------------
<h2>Association Detail</h2>
<div class="sectionHeaderText" style="padding-bottom: 6pt;">DETAIL
DIRECTORY RESULTS</div>
1) <b>Some Institute name</b><Br><br>
2) some address<Br> city, st zip<br>
3)
4) United States <Br>
5)
6) Phone:
7)
8) (123) 456-7890<Br>
9)
10 <br>
11) Web address: <a href="Http://www.xyz.org"
target="_Blank">www.xyz.org</a><Br>
<br><br>
<A href="javascript:history.back();">Back to Search Results</
a><br><br>
<A href="AssociationSearch.cfm">Search Again</a>
</td>
---------------------------------------------------------------------------------
I want to scrap and collect the data between lines 1-11, ie, name,
address, city, st, zip, United States, phone number, and line 11 I
want the website url: 'http://www.xyz.org'
I can find the beginning of this section of code by doing this:
doc.css('h2').each do |elem| puts elem.content end
which displays 'Association Detail'
I am having problems using this as the starting point to parse the
data in lines 1-11 which contain the specific 'Association Detail'
details. I've tried it with 'xpath' and 'search' according to the
example here: http://rdoc.info/projects/tenderlove/nokogiri
but there's something I'm just not getting correctly when I use other
elements get info from.
My system is Windows XP, Ruby 1.8.6, Nokogiri 1.4.0
Thanks in advance for any help.