hpricot parsing

Marc Farber · Apr 19, 2009

Ruby newbie here

Have successfully used hpricot to scrape correct <div> from desired page
http://www.montgomeryadvertiser.com/section/obits using

doc = Hpricot(uri above)
...
@grab1 = doc.search("//div[@class='article-bodytext']")

target data is in following logical form

<div>
<h3>name of funeral home</h3>
<p>deceased1</p>
<div>advertising crap</div>
<h3>funeral home 2</h3>
<p>deceased 2</p>
<p>deceased 3</p>
</div>

I'm struggling to iterate thru this div, plucking a array or hash where
I can feed a database with each record being a funeral home and person.
I was thinking I could go thru each of the @grab1 elements and process
according to tag type and establish the "record" logic thru simple
knowing that a new record starts with each new h3 tag.

Any help for a newbie with first Ruby script?

Thx

7stud -- · Apr 20, 2009

Marc said:
Ruby newbie here

Have successfully used hpricot to scrape correct <div> from desired page
http://www.montgomeryadvertiser.com/section/obits using

doc = Hpricot(uri above)
...
@grab1 = doc.search("//div[@class='article-bodytext']")

target data is in following logical form

<div>
<h3>name of funeral home</h3>
<p>deceased1</p>
<div>advertising crap</div>
<h3>funeral home 2</h3>
<p>deceased 2</p>
<p>deceased 3</p>
</div>

I'm struggling to iterate thru this div..
I [want to insert a record into a table with each] record being a funeral home and person.
I was thinking I could go thru each of the @grab1 elements and process
according to tag type:

These methods seem like the ones you need:

elm.next_sibling (skips the newlines in the html)
elm.name

How about this:

require "rubygems"
require 'hpricot'

str =<<ENDOFSTRING
<div>
<h3>name of funeral home</h3>
<p>deceased1</p>
<div>advertising crap</div>
<h3>funeral home 2</h3>
<p>deceased 2</p>
<p>deceased 3</p>
</div>
ENDOFSTRING

doc = Hpricot(str)
h3_tags = doc.search("h3")

h3_tags.each do |h3|
elm = h3

while elm = elm.next_sibling
break if elm.name != 'p'

puts h3.inner_text
puts "\t #{elm.inner_text}"
end

end

--output:--
name of funeral home
deceased1
funeral home 2
deceased 2
funeral home 2
deceased 3

7stud -- · Apr 20, 2009

7stud said:
h3_tags.each do |h3|
elm = h3

while elm = elm.next_sibling
break if elm.name != 'p'

puts h3.inner_text
puts "\t #{elm.inner_text}"
end

end

To avoid having to lookup the inner_text of the funeral home for each
deceased person at that funeral home, this would be more efficient:

h3_tags.each do |elm|
funeral_home = elm.inner_text

while elm = elm.next_sibling
break if elm.name != 'p'

puts funeral_home
puts "\t #{elm.inner_text}"
end
end

Marc Farber · Apr 20, 2009

Thanks so much 7-stud

I had been fixated on next_child thinking that next_sibling would skip
over the "p" tags. I really appreciate your thoughtfulness to provide a
working code snippet.

Marc

Wang Jian · Apr 20, 2009

[Note: parts of this message were removed to make it a legal post.]

Makes me wonder if ReXML, Hpricot or Nokogiri has a to_hash method...not yet
found.
I'd also be glad to know.

Phlip · Apr 20, 2009

Wang said:
Makes me wonder if ReXML, Hpricot or Nokogiri has a to_hash method...not yet
found.

Try to write it. I hope I'm wrong, but I suspect that starting will be easy, and
hitting your own target XML will be easy...

....but making it generic enough to publish will be another story!

inject idiom	10	Jul 11, 2008
[ANN] nokogiri 1.4.0 Released	1	Oct 31, 2009
[ANN] rq-3.0.0 : ruby queue gets gem'd	3	Mar 2, 2007
[ANN] ruby queue : rq-2.3.1	0	Dec 11, 2005
[SUMMARY] Stock Portfolios (#41)	0	Aug 11, 2005

hpricot parsing

Marc Farber

7stud --

7stud --

Marc Farber

Wang Jian

Phlip

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads