hpricot parsing

M

Marc Farber

Ruby newbie here

Have successfully used hpricot to scrape correct <div> from desired page
http://www.montgomeryadvertiser.com/section/obits using

doc = Hpricot(uri above)
...
@grab1 = doc.search("//div[@class='article-bodytext']")

target data is in following logical form

<div>
<h3>name of funeral home</h3>
<p>deceased1</p>
<div>advertising crap</div>
<h3>funeral home 2</h3>
<p>deceased 2</p>
<p>deceased 3</p>
</div>

I'm struggling to iterate thru this div, plucking a array or hash where
I can feed a database with each record being a funeral home and person.
I was thinking I could go thru each of the @grab1 elements and process
according to tag type and establish the "record" logic thru simple
knowing that a new record starts with each new h3 tag.

Any help for a newbie with first Ruby script?


Thx
 
7

7stud --

Marc said:
Ruby newbie here

Have successfully used hpricot to scrape correct <div> from desired page
http://www.montgomeryadvertiser.com/section/obits using

doc = Hpricot(uri above)
...
@grab1 = doc.search("//div[@class='article-bodytext']")

target data is in following logical form

<div>
<h3>name of funeral home</h3>
<p>deceased1</p>
<div>advertising crap</div>
<h3>funeral home 2</h3>
<p>deceased 2</p>
<p>deceased 3</p>
</div>

I'm struggling to iterate thru this div..
I [want to insert a record into a table with each] record being a funeral home and person.
I was thinking I could go thru each of the @grab1 elements and process
according to tag type:

These methods seem like the ones you need:

elm.next_sibling (skips the newlines in the html)
elm.name

How about this:

require "rubygems"
require 'hpricot'

str =<<ENDOFSTRING
<div>
<h3>name of funeral home</h3>
<p>deceased1</p>
<div>advertising crap</div>
<h3>funeral home 2</h3>
<p>deceased 2</p>
<p>deceased 3</p>
</div>
ENDOFSTRING

doc = Hpricot(str)
h3_tags = doc.search("h3")

h3_tags.each do |h3|
elm = h3

while elm = elm.next_sibling
break if elm.name != 'p'

puts h3.inner_text
puts "\t #{elm.inner_text}"
end

end


--output:--
name of funeral home
deceased1
funeral home 2
deceased 2
funeral home 2
deceased 3
 
7

7stud --

7stud said:
h3_tags.each do |h3|
elm = h3

while elm = elm.next_sibling
break if elm.name != 'p'

puts h3.inner_text
puts "\t #{elm.inner_text}"
end

end

To avoid having to lookup the inner_text of the funeral home for each
deceased person at that funeral home, this would be more efficient:

h3_tags.each do |elm|
funeral_home = elm.inner_text

while elm = elm.next_sibling
break if elm.name != 'p'

puts funeral_home
puts "\t #{elm.inner_text}"
end
end
 
M

Marc Farber

Thanks so much 7-stud

I had been fixated on next_child thinking that next_sibling would skip
over the "p" tags. I really appreciate your thoughtfulness to provide a
working code snippet.

Marc
 
W

Wang Jian

[Note: parts of this message were removed to make it a legal post.]

Makes me wonder if ReXML, Hpricot or Nokogiri has a to_hash method...not yet
found.
I'd also be glad to know.
 
P

Phlip

Wang said:
Makes me wonder if ReXML, Hpricot or Nokogiri has a to_hash method...not yet
found.

Try to write it. I hope I'm wrong, but I suspect that starting will be easy, and
hitting your own target XML will be easy...

....but making it generic enough to publish will be another story!
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Similar Threads


Members online

Forum statistics

Threads
473,968
Messages
2,570,152
Members
46,697
Latest member
AugustNabo

Latest Threads

Top