Need help parsing HTML with Hpricot...

  • Thread starter Just Another Victim of the Ambient Morality
  • Start date
J

Just Another Victim of the Ambient Morality

I'm having trouble understanding Hpricot (thanks to an abominable lack
of documentation). I'm trying to parse HTML of the following nature:


This is one line of text<br />
This is another line of text<br />
It keeps going on like this<br />
<br />
Until a new paragraph is started<br />
Otherwise, it's just more of the same<br />


I know, it looks simple but, frankly, I have no clue how to parse this
with Hpricot. Particularly, I don't know how to single out the lines of
text in between the "br" tags. This is important 'cause I need to know
where the line breaks are in the text, as well as the new paragraphs.
Does anyone know how to do this with Hpricot?
Thank you...
 
M

Mikel Lindsaar

You can try each_child.

I will use each_child_with_index to show you what I mean:

Put your raw HTML text into @text

@parsed_html = Hpricot(@text)
@parsed_html.each_child_with_index do |c,i|
puts "Line #{i}: #{c.to_s.strip}"
end

Produces:

Line 0: This is one line of text
Line 1: <br />
Line 2: This is another line of text
Line 3: <br />
Line 4: It keeps going on like this
Line 5: <br />
Line 6:
Line 7: <br />
Line 8: Until a new paragraph is started
Line 9: <br />
Line 10: Otherwise, it's just more of the same
Line 11: <br />
Line 12:

Hope that helps.

Mikel
 
M

Mikel Lindsaar

Of course... you could also do:

require 'rubygems'
require 'hpricot'

text =<<HERE
This is one line of text<br />
This is another line of text<br />
It keeps going on like this<br />
<br />
Until a new paragraph is started<br />
Otherwise, it's just more of the same<br />
HERE

class String
def not_needed?
self.strip == "<br />" ? true : false
end
end

@parsed_html = Hpricot(text)
@paragraphs = Array.new
@parsed_html.each_child_with_index do |c,i|
line = c.to_s.strip
if line == ""
puts "<p>#{@paragraphs}</p>"
@paragraphs.clear
else
@paragraphs << "#{line} " unless line.not_needed?
end
end

Which produces:

<p>This is one line of text This is another line of text It keeps
going on like this </p>
<p>Until a new paragraph is started Otherwise, it's just more of the same </p>

Now... don't pick on my favorite HTML parser again! :D Just ask nicely :)

Mikel
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,995
Messages
2,570,236
Members
46,821
Latest member
AleidaSchi

Latest Threads

Top