Using Nokogiri

J

jzakiya

I'm trying to scrape some data off websites using nokogiri

require 'rubygems'
require 'open-uri'
require 'nokogiri' #using the latest 1.4.0


url = 'http://www.whateverwebsitenameis.org'

doc = Nokogiri::HTML(open(url))

This gets me data off the website I want to scrape.

The segment of the site I want looks like this (from FF 'view
source' )

-------------------------------------------------------------------------
<h2>Association Detail</h2>

<div class="sectionHeaderText" style="padding-bottom: 6pt;">DETAIL
DIRECTORY RESULTS</div>

1) <b>Some Institute name</b><Br><br>
2) some address<Br> city, st zip<br>
3)
4) United States <Br>
5)
6) Phone:
7)
8) (123) 456-7890<Br>
9)
10 <br>
11) Web address: <a href="Http://www.xyz.org"
target="_Blank">www.xyz.org</a><Br>

<br><br>

<A href="javascript:history.back();">Back to Search Results</
a><br><br>


<A href="AssociationSearch.cfm">Search Again</a>

</td>
---------------------------------------------------------------------------------

I want to scrap and collect the data between lines 1-11, ie, name,
address, city, st, zip, United States, phone number, and line 11 I
want the website url: 'http://www.xyz.org'

I can find the beginning of this section of code by doing this:

doc.css('h2').each do |elem| puts elem.content end
which displays 'Association Detail'

I am having problems using this as the starting point to parse the
data in lines 1-11 which contain the specific 'Association Detail'
details. I've tried it with 'xpath' and 'search' according to the
example here: http://rdoc.info/projects/tenderlove/nokogiri

but there's something I'm just not getting correctly when I use other
elements get info from.

My system is Windows XP, Ruby 1.8.6, Nokogiri 1.4.0

Thanks in advance for any help.
 
7

7stud --

jzakiya said:
I'm trying to scrape some data off websites using nokogiri

require 'rubygems'
require 'open-uri'
require 'nokogiri' #using the latest 1.4.0


url = 'http://www.whateverwebsitenameis.org'

doc = Nokogiri::HTML(open(url))

This gets me data off the website I want to scrape.

The segment of the site I want looks like this (from FF 'view
source' )

-------------------------------------------------------------------------
<h2>Association Detail</h2>

<div class="sectionHeaderText" style="padding-bottom: 6pt;">DETAIL
DIRECTORY RESULTS</div>

1) <b>Some Institute name</b><Br><br>
2) some address<Br> city, st zip<br>
3)
4) United States <Br>
5)
6) Phone:
7)
8) (123) 456-7890<Br>
9)
10 <br>
11) Web address: <a href="Http://www.xyz.org"
target="_Blank">www.xyz.org</a><Br>

<br><br>

<A href="javascript:history.back();">Back to Search Results</
a><br><br>


<A href="AssociationSearch.cfm">Search Again</a>

</td>
---------------------------------------------------------------------------------

I want to scrap and collect the data between lines 1-11, ie, name,
address, city, st, zip, United States, phone number, and line 11 I
want the website url: 'http://www.xyz.org'

I can find the beginning of this section of code by doing this:

doc.css('h2').each do |elem| puts elem.content end
which displays 'Association Detail'

I am having problems using this as the starting point to parse the
data in lines 1-11 which contain the specific 'Association Detail'
details. I've tried it with 'xpath' and 'search' according to the
example here: http://rdoc.info/projects/tenderlove/nokogiri

but there's something I'm just not getting correctly when I use other
elements get info from.

My system is Windows XP, Ruby 1.8.6, Nokogiri 1.4.0

Thanks in advance for any help.
You aren't really searching by css, which would involve things like
searching for tags based on their 'class' attribute or 'id' attribute.
Because the <h2> tag doesn't have any attributes, you are simply
searching by tag name, so you could do this instead:

doc.xpath('//h2').each do |h2|
puts h2.content
end

That uses xpath notation to find all h2 tags on the page. Then you
might write something like this:

doc = Nokogiri::HTML.parse(html)

doc.xpath('//h2').each do |h2|

if h2.content == "Association Detail"
puts "---"
puts h2.next.content
puts "---"
end

end

Knowing you can do that will enable you to write something like this:

results = []

doc.xpath('//h2').each do |h2|

if h2.content == "Association Detail"
curr_elmt = h2

while (curr_elmt = curr_elmt.next)
curr_content = curr_elmt.content
results << curr_content
break if curr_content.include?("Web address:")
end

end
end

results.each do |result|
puts "--start--"
puts result
puts "--end--"
puts
end


output=

--start--
DETAIL
DIRECTORY RESULTS
--end--

--start--
Some Institute name
--end--

--start--

--end--

--start--

--end--

--start--

some address city, st zip

United States

Phone:

(123) 456-7890
) Web address: www.xyz.orgBack to Search Results
a>Search Again
--end--


As you can see, the html is pretty bad, so your results aren't that
great. You will have to figure out how to extract the data you need
from those strings.
 
7

7stud --

I chopped off the top of my code, which looks like this:


require 'rubygems'
require 'nokogiri'


html =<<ENDOFHTML
<html>
<body>
<h2>Association Detail</h2>

<div class="sectionHeaderText" style="padding-bottom: 6pt;">DETAIL
DIRECTORY RESULTS</div>

<b>Some Institute name</b><Br><br>
some address<Br> city, st zip<br>

United States <Br>

Phone:

(123) 456-7890<Br>

<br>
) Web address: <a href="Http://www.xyz.org"
target="_Blank">www.xyz.org</a><Br>

<br><br>

<A href="javascript:history.back();">Back to Search Results</
a> said:
doc = Nokogiri::HTML.parse(html)
<snip>
 
7

7stud --

Argh. Now I've chomped off the bottom of the html. This is what I used:

require 'rubygems'
require 'nokogiri'

html =<<ENDOFHTML
<html>
<body>
<h2>Association Detail</h2>

<div class="sectionHeaderText" style="padding-bottom: 6pt;">DETAIL
DIRECTORY RESULTS</div>

<b>Some Institute name</b><Br><br>
some address<Br> city, st zip<br>

United States <Br>

Phone:

(123) 456-7890<Br>

<br>
) Web address: <a href="Http://www.xyz.org"
target="_Blank">www.xyz.org</a><Br>

<br><br>

<A href="javascript:history.back();">Back to Search Results</
a><br><br>


<A href="AssociationSearch.cfm">Search Again</a>
</body>
</html>
ENDOFHTML


doc = Nokogiri::HTML.parse(html)

...rest of code
 
M

Mark Thomas

This should get what you want:

prefix = '//div[@class="sectionHeaderText"]/following-sibling::'
xpaths = {
:name => "#{prefix}b/text()",
:addr => "#{prefix}text()[2]",
:citystzip => "#{prefix}text()[3]",
:country => "#{prefix}text()[4]",
:phone => "#{prefix}text()[5]",
}
xpaths.each do |data,xpath|
puts "#{data} = " + doc.search(xpath).to_s.strip
end

-- Mark.
 
7

7stud --

Mark said:
This should get what you want:

prefix = '//div[@class="sectionHeaderText"]/following-sibling::'
xpaths = {
:name => "#{prefix}b/text()",
:addr => "#{prefix}text()[2]",
:citystzip => "#{prefix}text()[3]",
:country => "#{prefix}text()[4]",
:phone => "#{prefix}text()[5]",
}
xpaths.each do |data,xpath|
puts "#{data} = " + doc.search(xpath).to_s.strip
end

-- Mark.

I was wondering if you could answer some xpath questions? I would think
that in this xpath:

div[@class="sectionHeaderText"]/following-sibling::text()[2]

the part:

div[@class="sectionHeaderText"]/following-sibling

would be the <b> tag. Then:

div[@class="sectionHeaderText"]/following-sibling::text()

would be the <b> tag's text or "Some Institute name". So then the
following [2]:

div[@class="sectionHeaderText"]/following-sibling::text()[2]

doesn't seem applicable. And in fact, when I run your code, it doesn't
work:


addr =
citystzip =
name = Some Institute name
country =
phone =



===========

html =<<ENDOFHTML
<html>
<body>
<h2>Association Detail</h2>

<div class="sectionHeaderText" style="padding-bottom: 6pt;">DETAIL
DIRECTORY RESULTS</div>

<b>Some Institute name</b><Br><br>
some address<Br> city, st zip<br>

United States <Br>

Phone:

(123) 456-7890<Br>

<br>
) Web address: <a href="Http://www.xyz.org"
target="_Blank">www.xyz.org</a><Br>

<br><br>

<A href="javascript:history.back();">Back to Search Results</
a><br><br>


<A href="AssociationSearch.cfm">Search Again</a>
</body>
</html>
ENDOFHTML


doc = Nokogiri::HTML.parse(html)

prefix = '//div[@class="sectionHeaderText"]/following-sibling::'
xpaths = {
:name => "#{prefix}b/text()",
:addr => "#{prefix}text()[2]",
:citystzip => "#{prefix}text()[3]",
:country => "#{prefix}text()[4]",
:phone => "#{prefix}text()[5]",
}
xpaths.each do |key, val|
puts "#{key} = " + doc.search(val).to_s.strip
end
 
J

jzakiya

Mark said:
This should get what you want:
prefix = '//div[@class="sectionHeaderText"]/following-sibling::'
xpaths = {
 :name => "#{prefix}b/text()",
 :addr => "#{prefix}text()[2]",
 :citystzip => "#{prefix}text()[3]",
 :country => "#{prefix}text()[4]",
 :phone => "#{prefix}text()[5]",
}  
xpaths.each do |data,xpath|
  puts "#{data} = " + doc.search(xpath).to_s.strip
end

I was wondering if you could answer some xpath questions? I would think
that in this xpath:

div[@class="sectionHeaderText"]/following-sibling::text()[2]

the part:

div[@class="sectionHeaderText"]/following-sibling

would be the <b> tag.  Then:

div[@class="sectionHeaderText"]/following-sibling::text()

would be the <b> tag's text or "Some Institute name".  So then the
following [2]:

div[@class="sectionHeaderText"]/following-sibling::text()[2]

doesn't seem applicable.  And in fact, when I run your code, it doesn't
work:

addr =
citystzip =
name = Some Institute name
country =
phone =

===========

html =<<ENDOFHTML
<html>
<body>
<h2>Association Detail</h2>

    <div class="sectionHeaderText" style="padding-bottom: 6pt;">DETAIL
DIRECTORY RESULTS</div>

    <b>Some Institute name</b><Br><br>
   some address<Br> city, st zip<br>

    United States <Br>

      Phone:

        (123) 456-7890<Br>

    <br>
)    Web address: <a href="Http://www.xyz.org"
target="_Blank">www.xyz.org</a><Br>

    <br><br>

    <A href="javascript:history.back();">Back to Search Results</
a><br><br>

    <A href="AssociationSearch.cfm">Search Again</a>
</body>
</html>
ENDOFHTML

doc = Nokogiri::HTML.parse(html)

prefix = '//div[@class="sectionHeaderText"]/following-sibling::'
xpaths = {
 :name => "#{prefix}b/text()",
 :addr => "#{prefix}text()[2]",
 :citystzip => "#{prefix}text()[3]",
 :country => "#{prefix}text()[4]",
 :phone => "#{prefix}text()[5]",}

xpaths.each do |key, val|
  puts "#{key} = " + doc.search(val).to_s.strip
end

7stud's approach works, but Mark's doesn't (currently).
Here's the file I created which will get me all the raw
data I want (still have to process to get to final form).

file: scrape.rb
-------------------------
require 'rubygems'
require 'open-uri'
require 'nokogiri'

def scrape (id)

id = id.to_s
url = "http://www.xyz.org/../../..ID=#{id}"
doc = Nokogiri::HTML.parse(open(url))

results = []

doc.xpath('//h2').each do |h2|
if h2.content == "Association Detail"
curr_elmt = h2
while (curr_elmt = curr_elmt.next)
curr_content = curr_elmt.content.gsub(/\n|\t|\r/,'').squeeze
(' ').strip
results << curr_content unless curr_content.strip.empty?
break if curr_content.include?("Back to Search Results")
end
end
end

results.each do |result|
#Do while result is not a blank string
puts "--start--"
puts result
puts "--end--"
end
return results
end
---------------------------------------

So I just 'require' this file, and can then do:
info = scrape 1234

where 'info' is the array 'results'. I can then process
that to my hearts delight.

Thanks 7stud for your help.
I would, however, like to know if Mark's way can be made to work too.

Jabari
 
M

Mark Thomas

7stud's approach works, but Mark's doesn't (currently).

Strange... it works for me.

mark@ubuntu:~$ ruby -v
ruby 1.8.7 (2009-06-12 patchlevel 174) [i486-linux]

Nokogiri 1.4.0
libxslt 1.1.24-2ubuntu2

Here's the entire working program:

require 'rubygems'
require 'nokogiri'

html =<<ENDOFHTML
<html>
<body>
<h2>Association Detail</h2>

<div class="sectionHeaderText" style="padding-bottom: 6pt;">DETAIL
DIRECTORY RESULTS</div>

<b>Some Institute name</b><Br><br>
some address<Br> city, st zip<br>

United States <Br>

Phone:

(123) 456-7890<Br>

<br>
) Web address: <a href="Http://www.xyz.org"
target="_Blank">www.xyz.org</a><Br>

<br><br>

<A href="javascript:history.back();">Back to Search Results</
a><br><br>

<A href="AssociationSearch.cfm">Search Again</a>
</body>
</html>
ENDOFHTML

doc = Nokogiri::HTML.parse(html)
prefix = '//div[@class="sectionHeaderText"]/following-sibling::'
xpaths = {
:name => "#{prefix}b/text()",
:addr => "#{prefix}text()[2]",
:citystzip => "#{prefix}text()[3]",
:country => "#{prefix}text()[4]",
:phone => "#{prefix}text()[5]",
}
xpaths.each do |k,xpath|
puts "#{k} = " + doc.search(xpath).to_s.strip
end

# Output:
addr = some address
citystzip = city, st zip
country = United States
phone = Phone:

(123) 456-7890
name = Some Institute name
 
M

Mark Thomas

Mark said:
This should get what you want:
prefix = '//div[@class="sectionHeaderText"]/following-sibling::'
xpaths = {
 :name => "#{prefix}b/text()",
 :addr => "#{prefix}text()[2]",
 :citystzip => "#{prefix}text()[3]",
 :country => "#{prefix}text()[4]",
 :phone => "#{prefix}text()[5]",
}  
xpaths.each do |data,xpath|
  puts "#{data} = " + doc.search(xpath).to_s.strip
end

I was wondering if you could answer some xpath questions? I would think
that in this xpath:

div[@class="sectionHeaderText"]/following-sibling::text()[2]

the part:

div[@class="sectionHeaderText"]/following-sibling

would be the <b> tag.

Not quite. following-sibling:: is an axis predicate that needs to be
followed by a node. Therefore following-sibling::text() is a set of
all text nodes after the div. After that, it's just a matter of
indexing.
doesn't seem applicable.  And in fact, when I run your code, it doesn't
work:

As I just posted in another message, it works for me. I wonder what's
different about my environment. Are you using Nokogiri 1.4.0?
 
7

7stud --

Mark said:
As I just posted in another message, it works for me. I wonder what's
different about my environment. Are you using Nokogiri 1.4.0?

Yes, however I get a warning message that informs me that I'm using an
outdated version of libxml2:

$ ruby -v
ruby 1.8.6 (2007-03-13 patchlevel 0) [i686-darwin8.11.1]

$ nokogiri -v
HI. You're using libxml2 version 2.6.16 which is over 4 years old and
has
plenty of bugs. We suggest that for maximum HTML/XML parsing pleasure,
you
upgrade your version of libxml2 and re-install nokogiri. If you like
using
libxml2 version 2.6.16, but don't like this warning, please define the
constant
I_KNOW_I_AM_USING_AN_OLD_AND_BUGGY_VERSION_OF_LIBXML2 before requring
nokogiri.

/usr/local/lib/ruby/gems/1.8/gems/nokogiri-1.4.0/lib/nokogiri/xml/builder.rb:272:
warning: parenthesize argument(s) for future version
---
nokogiri: 1.4.0
warnings: []

libxml:
compiled: 2.6.16
loaded: 2.6.16
binding: extension


So it could be something with that, or maybe it has something to do with
the fact that ruby 1.8.7 back ports some stuff from ruby 1.9.
 
J

jzakiya

Mark said:
As I just posted in another message, it works for me. I wonder what's
different about my environment. Are you using Nokogiri 1.4.0?

Yes, however I get a warning message that informs me that I'm using an
outdated version of libxml2:

$ ruby -v
ruby 1.8.6 (2007-03-13 patchlevel 0) [i686-darwin8.11.1]

$ nokogiri -v
HI.  You're using libxml2 version 2.6.16 which is over 4 years old and
has
plenty of bugs.  We suggest that for maximum HTML/XML parsing pleasure,
you
upgrade your version of libxml2 and re-install nokogiri.  If you like
using
libxml2 version 2.6.16, but don't like this warning, please define the
constant
I_KNOW_I_AM_USING_AN_OLD_AND_BUGGY_VERSION_OF_LIBXML2 before requring
nokogiri.

/usr/local/lib/ruby/gems/1.8/gems/nokogiri-1.4.0/lib/nokogiri/xml/builder..rb:272:
warning: parenthesize argument(s) for future version
---
nokogiri: 1.4.0
warnings: []

libxml:
  compiled: 2.6.16
  loaded: 2.6.16
  binding: extension

So it could be something with that, or maybe it has something to do with
the fact that ruby 1.8.7 back ports some stuff from ruby 1.9.

OK, when I put Mark's code in a file and ran it (versus entering it in
a irb session) it DOES work. However, it doesn't capture the website
url, which 7stud's approach does. I haven't figure out how to do it
with this approach, and merely adding more items in xpaths doesn't
work.

So Mark, how can your approach be used to capture the url add the end
of the data section?

Here's the file I used with Mark's approach:

file: scrape1.rb
---------------------
require 'rubygems'
require 'open-uri'
require 'nokogiri'

def scrape (id)

id = id.to_s
url = "http://www.asaecenter.org/Directories/AssocDetail.cfm?ID=#{id}
&type=association"
doc = Nokogiri::HTML.parse(open(url))

prefix = '//div[@class="sectionHeaderText"]/following-sibling::'
xpaths = {
:name => "#{prefix}b/text()",
:addr => "#{prefix}text()[2]",
:citystzip => "#{prefix}text()[3]",
:country => "#{prefix}text()[4]",
:phone => "#{prefix}text()[5]",
:web => "#{prefix}text()[6]",
:url => "#{prefix}text()[7]"
}

results = {}
xpaths.each do |data,xpath|
results[data] = doc.search(xpath).to_s.gsub(/\n|\t|\r/,'').squeeze
(' ').strip
puts "#{data} = " + results[data]
end
return results
end
 
J

jzakiya

Mark said:
As I just posted in another message, it works for me. I wonder what's
different about my environment. Are you using Nokogiri 1.4.0?

Yes, however I get a warning message that informs me that I'm using an
outdated version of libxml2:

$ ruby -v
ruby 1.8.6 (2007-03-13 patchlevel 0) [i686-darwin8.11.1]

$ nokogiri -v
HI.  You're using libxml2 version 2.6.16 which is over 4 years old and
has
plenty of bugs.  We suggest that for maximum HTML/XML parsing pleasure,
you
upgrade your version of libxml2 and re-install nokogiri.  If you like
using
libxml2 version 2.6.16, but don't like this warning, please define the
constant
I_KNOW_I_AM_USING_AN_OLD_AND_BUGGY_VERSION_OF_LIBXML2 before requring
nokogiri.

/usr/local/lib/ruby/gems/1.8/gems/nokogiri-1.4.0/lib/nokogiri/xml/builder..rb:272:
warning: parenthesize argument(s) for future version
---
nokogiri: 1.4.0
warnings: []

libxml:
  compiled: 2.6.16
  loaded: 2.6.16
  binding: extension

So it could be something with that, or maybe it has something to do with
the fact that ruby 1.8.7 back ports some stuff from ruby 1.9.


- Hide quoted text -
- Show quoted text -
Mark Thomas wrote:
Yes, however I get a warning message that informs me that I'm using an
outdated version of libxml2:
$ ruby -v
ruby 1.8.6 (2007-03-13 patchlevel 0) [i686-darwin8.11.1]
$ nokogiri -v
HI. You're using libxml2 version 2.6.16 which is over 4 years old and
has
plenty of bugs. We suggest that for maximum HTML/XML parsing pleasure,
you
upgrade your version of libxml2 and re-install nokogiri. If you like
using
libxml2 version 2.6.16, but don't like this warning, please define the
constant
I_KNOW_I_AM_USING_AN_OLD_AND_BUGGY_VERSION_OF_LIBXML2 before requring
nokogiri.
/usr/local/lib/ruby/gems/1.8/gems/nokogiri-1.4.0/lib/nokogiri/xml/builder..rb:272:
warning: parenthesize argument(s) for future version
libxml:
compiled: 2.6.16
loaded: 2.6.16
binding: extension
So it could be something with that, or maybe it has something to do with
the fact that ruby 1.8.7 back ports some stuff from ruby 1.9.

OK, when I put Mark's code in a file and ran it (versus entering it in
a irb session) it DOES work. However, it doesn't capture the website
url, which 7stud's approach does. I haven't figure out how to do it
with this approach, and merely adding more items in xpaths doesn't
work.

So Mark, how can your approach be used to capture the url add the end
of the data section?

Here's the file I used with Mark's approach:

File: scrape1.rb
----------------------------
require 'rubygems'
require 'open-uri'
require 'nokogiri'

def scrape (id)

id = id.to_s
url = "http://www.xyz.org/../../..ID=#{id}"
doc = Nokogiri::HTML.parse(open(url))

prefix = '//div[@class="sectionHeaderText"]/following-sibling::'
xpaths = {
:name => "#{prefix}b/text()",
:addr => "#{prefix}text()[2]",
:citystzip => "#{prefix}text()[3]",
:country => "#{prefix}text()[4]",
:phone => "#{prefix}text()[5]",
:web => "#{prefix}text()[6]",
:url => "#{prefix}text()[7]"
}

results = {}
xpaths.each do |data,xpath|
results[data] = doc.search(xpath).to_s.gsub(/\n|\t|\r/,'').squeeze
(' ').strip
puts "#{data} = " + results[data]
end
return results
end
 
M

Mark Thomas

OK, when I put Mark's code in a file and ran it (versus entering it in
a irb session) it DOES work. However, it doesn't capture the website
url, which 7stud's approach does. I haven't figure out how to do it
with this approach, and merely adding more items in xpaths doesn't
work.

So Mark, how can your approach be used to capture the url add the end
of the data section?

Here's the file I used with Mark's approach:

File: scrape1.rb
----------------------------
require 'rubygems'
require 'open-uri'
require 'nokogiri'

def scrape (id)

  id = id.to_s
  url = "http://www.xyz.org/../../..ID=#{id}"
  doc = Nokogiri::HTML.parse(open(url))

  prefix = '//div[@class="sectionHeaderText"]/following-sibling::'
  xpaths = {
   :name => "#{prefix}b/text()",
   :addr => "#{prefix}text()[2]",
   :citystzip => "#{prefix}text()[3]",
   :country => "#{prefix}text()[4]",
   :phone => "#{prefix}text()[5]",
   :web => "#{prefix}text()[6]",
   :url => "#{prefix}text()[7]"

You'll need to modify that last line. Unlike the other items, the URL
is not in a text node, it is the href attribute of the first <a>
element. So try:

:url => "#{prefix}a[1]/@href"
 
M

Mark Thomas

$ nokogiri -v

cool! I didn't know about that.
HI.  You're using libxml2 version 2.6.16 which is over 4 years old and
has
plenty of bugs.  We suggest that for maximum HTML/XML parsing pleasure,
you
upgrade your version of libxml2 and re-install nokogiri.  If you like
using
libxml2 version 2.6.16, but don't like this warning, please define the
constant
I_KNOW_I_AM_USING_AN_OLD_AND_BUGGY_VERSION_OF_LIBXML2 before requring
nokogiri.

/usr/local/lib/ruby/gems/1.8/gems/nokogiri-1.4.0/lib/nokogiri/xml/builder..rb:272:
warning: parenthesize argument(s) for future version
---
nokogiri: 1.4.0
warnings: []

libxml:
  compiled: 2.6.16
  loaded: 2.6.16
  binding: extension

This is most likely the problem.

Mine reports:
libxml:
loaded: 2.7.5
binding: extension
compiled: 2.7.5

with no warnings.

Can you install a newer version of libxml2? As you can see from
http://xmlsoft.org/news.html, your version dates back to 2004 with
tons of bug fixes (including XPath fixes) since.
 
7

7stud --

Mark said:
/usr/local/lib/ruby/gems/1.8/gems/nokogiri-1.4.0/lib/nokogiri/xml/builder.rb:272:
warning: parenthesize argument(s) for future version
---
nokogiri: 1.4.0
warnings: []

libxml:
� compiled: 2.6.16
� loaded: 2.6.16
� binding: extension

This is most likely the problem.

Mine reports:
libxml:
loaded: 2.7.5
binding: extension
compiled: 2.7.5

with no warnings.

Can you install a newer version of libxml2? As you can see from
http://xmlsoft.org/news.html, your version dates back to 2004 with
tons of bug fixes (including XPath fixes) since.

I've looked into installing newer versions of libxml2 and libxslt, but
it looks complicated and fraught with danger for mac os x.

...
,
,
,
,
,
,
,
,
,
,
,
 
J

jzakiya

OK, when I put Mark's code in a file and ran it (versus entering it in
a irb session) it DOES work. However, it doesn't capture the website
url, which 7stud's approach does. I haven't figure out how to do it
with this approach, and merely adding more items in xpaths doesn't
work.
So Mark, how can your approach be used to capture the url add the end
of the data section?
Here's the file I used with Mark's approach:
File: scrape1.rb
def scrape (id)
  id = id.to_s
  url = "http://www.xyz.org/../../..ID=#{id}"
  doc = Nokogiri::HTML.parse(open(url))
  prefix = '//div[@class="sectionHeaderText"]/following-sibling::'
  xpaths = {
   :name => "#{prefix}b/text()",
   :addr => "#{prefix}text()[2]",
   :citystzip => "#{prefix}text()[3]",
   :country => "#{prefix}text()[4]",
   :phone => "#{prefix}text()[5]",
   :web => "#{prefix}text()[6]",
   :url => "#{prefix}text()[7]"

You'll need to modify that last line. Unlike the other items, the URL
is not in a text node, it is the href attribute of the first <a>
element. So try:

    :url => "#{prefix}a[1]/@href"

Yes, this allows me to capture the url I want (and sometimes ones I
don't want), and I'm able to post-process xpaths to get everything I
need.

xpaths = {
:name => "#{prefix}b/text()",
:addr => "#{prefix}text()[2]",
:citystzip => "#{prefix}text()[3]",
:country => "#{prefix}text()[4]",
:phone => "#{prefix}text()[5]",
:url => "#{prefix}a[1]/@href"
}

Now, I just need to understand completely WHY/HOW it works. :)

Jabari
 
M

Mark Thomas

OK, when I put Mark's code in a file and ran it (versus entering it in
a irb session) it DOES work. However, it doesn't capture the website
url, which 7stud's approach does. I haven't figure out how to do it
with this approach, and merely adding more items in xpaths doesn't
work.
So Mark, how can your approach be used to capture the url add the end
of the data section?
Here's the file I used with Mark's approach:
File: scrape1.rb
----------------------------
require 'rubygems'
require 'open-uri'
require 'nokogiri'
def scrape (id)
  id = id.to_s
  url = "http://www.xyz.org/../../..ID=#{id}"
  doc = Nokogiri::HTML.parse(open(url))
  prefix = '//div[@class="sectionHeaderText"]/following-sibling::'
  xpaths = {
   :name => "#{prefix}b/text()",
   :addr => "#{prefix}text()[2]",
   :citystzip => "#{prefix}text()[3]",
   :country => "#{prefix}text()[4]",
   :phone => "#{prefix}text()[5]",
   :web => "#{prefix}text()[6]",
   :url => "#{prefix}text()[7]"
You'll need to modify that last line. Unlike the other items, the URL
is not in a text node, it is the href attribute of the first <a>
element. So try:
    :url => "#{prefix}a[1]/@href"

Yes, this allows me to capture the url I want (and sometimes ones I
don't want), and I'm able to post-process xpaths to get everything I
need.

  xpaths = {
   :name => "#{prefix}b/text()",
   :addr => "#{prefix}text()[2]",
   :citystzip => "#{prefix}text()[3]",
   :country => "#{prefix}text()[4]",
   :phone => "#{prefix}text()[5]",
   :url => "#{prefix}a[1]/@href"
  }

Now, I just need to understand completely WHY/HOW it works. :)

Let's take the first one as an example. I noticed that everything was
after a div with the class "sectionHeaderText", so I started with
that:

//div[@class="sectionHeaderText"]

The double slash is a wildcard that means the div can be anywhere. The
part in brackets is called a predicate, and it constrains the
expression. I like to think of it as a "such that" clause. So you can
read the above as "a div such that the class is
'sectionHeaderText'." (Actually, it's the set of all divs for which it
is true, so if you had multiple divs with the same class, it would
return them all)

Then I noticed that the items you wanted were not children of the div.
The div closes before you get to the text you want. Even <br> tags are
considered to be <br/> which are self-closing. Therefore almost
everything you want is at the same nesting depth, or in XPath
terminology, they are siblings. The "following-sibling" is an XPath
"axis" (see the W3C Schools XPath tutorial for details on these
things). The name though was inside a <b> element so I used the XPath
expression to get the following sibling that happens to be a <b>
element:

//div[@class="sectionHeaderText"]/following-sibling::b

Then, how you get text from within a node is the XPath function text()
which means all the text between tags, including whitespace.

//div[@class="sectionHeaderText"]/following-sibling::b/text()

And there you have the name.

Now, the other things were text nodes between <br> elements. You could
pull them all by asking for the set of text node siblings of the div:

//div[@class="sectionHeaderText"]/following-sibling::text()

But when you get more stuff than you want like that, you can index
them like an array:

//div[@class="sectionHeaderText"]/following-sibling::text()[2]

and that happens to pull the street address.

So hopefully you see how the XPaths were put together. Usually they
are a bit simpler, but like 7stud said, it was pretty crappy HTML.

-- Mark.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Similar Threads


Members online

No members online now.

Forum statistics

Threads
473,968
Messages
2,570,152
Members
46,697
Latest member
AugustNabo

Latest Threads

Top