Need a regex searching html code

C

Chirantan

I have an html code into string. I want to retrieve the content (Can
be any HTML code with any number of tags) present inside the div after
the heading till the end of the div.

Example,

<div class="info">
<h5>Tagline:</h5>
Yippee Ki Yay Mo - John 6:27
</div>

<div class="info">
<h5>Plot Outline:</h5>
John McClane takes on an Internet-based terrorist organization who is
systematically shutting down the United States. <a class="tn15more
inline" href="http://www.imdb.com/title/tt0337978/plotsummary"
onclick="(new Image()).src='/rg/title-tease/plotsummary/images/b.gif?
link=/title/tt0337978/plotsummary';">more</a>
</div>


In the above example, Plot Outline is header that I am looking for
then, regex should give me -

John McClane takes on an Internet-based terrorist organization who is
systematically shutting down the United States. <a class="tn15more
inline" href="http://www.imdb.com/title/tt0337978/plotsummary"
onclick="(new Image()).src='/rg/title-tease/plotsummary/images/b.gif?
link=/title/tt0337978/plotsummary';">more</a>


And if "Tagline:" is what I am looking for then regex should give me -

Yippee Ki Yay Mo - John 6:27

I hope the problem statement is clear.
 
T

Todd Benson

I have an html code into string. I want to retrieve the content (Can
be any HTML code with any number of tags) present inside the div after
the heading till the end of the div.

Example,

<div class="info">
<h5>Tagline:</h5>
Yippee Ki Yay Mo - John 6:27
</div>

<div class="info">
<h5>Plot Outline:</h5>
John McClane takes on an Internet-based terrorist organization who is
systematically shutting down the United States. <a class="tn15more
inline" href="http://www.imdb.com/title/tt0337978/plotsummary"
onclick="(new Image()).src='/rg/title-tease/plotsummary/images/b.gif?
link=/title/tt0337978/plotsummary';">more</a>
</div>


In the above example, Plot Outline is header that I am looking for
then, regex should give me -

John McClane takes on an Internet-based terrorist organization who is
systematically shutting down the United States. <a class="tn15more
inline" href="http://www.imdb.com/title/tt0337978/plotsummary"
onclick="(new Image()).src='/rg/title-tease/plotsummary/images/b.gif?
link=/title/tt0337978/plotsummary';">more</a>


And if "Tagline:" is what I am looking for then regex should give me -

Yippee Ki Yay Mo - John 6:27

I hope the problem statement is clear.

Scraping html is not the easiest thing in the world. I would
recommend the hpricot library.

Todd
 
W

William James

I have an html code into string. I want to retrieve the content (Can
be any HTML code with any number of tags) present inside the div after
the heading till the end of the div.

Example,

<div class="info">
<h5>Tagline:</h5>
Yippee Ki Yay Mo - John 6:27
</div>

<div class="info">
<h5>Plot Outline:</h5>
John McClane takes on an Internet-based terrorist organization who is
systematically shutting down the United States. <a class="tn15more
inline" href="http://www.imdb.com/title/tt0337978/plotsummary"
onclick="(new Image()).src='/rg/title-tease/plotsummary/images/b.gif?
link=/title/tt0337978/plotsummary';">more</a>
</div>

In the above example, Plot Outline is header that I am looking for
then, regex should give me -

John McClane takes on an Internet-based terrorist organization who is
systematically shutting down the United States. <a class="tn15more
inline" href="http://www.imdb.com/title/tt0337978/plotsummary"
onclick="(new Image()).src='/rg/title-tease/plotsummary/images/b.gif?
link=/title/tt0337978/plotsummary';">more</a>

And if "Tagline:" is what I am looking for then regex should give me -

Yippee Ki Yay Mo - John 6:27

I hope the problem statement is clear.

Note that this will give spurious results if an html comment happens
to contain what you are looking for.

def find_header header, html
# Put all of the DIVs in an array.
divs = html.scan( %r{<div.*?>(.*?)</div>}im ).flatten
divs.each{|s|
if s =~ %r{<h(\d)>#{header}</h\1>(.*)}im
return $2.strip
end
}
return nil
end

html = DATA.read

puts find_header( "Plot Outline:", html )

__END__
<div class="info">
<h5>Tagline:</h5>
Yippee Ki Yay Mo - John 6:27
</div>

<div class="info">
<h5>Plot Outline:</h5>
John McClane takes on an Internet-based terrorist organization who is
systematically shutting down the United States. <a class="tn15more
inline" href="http://www.imdb.com/title/tt0337978/plotsummary"
onclick="(new Image()).src='/rg/title-tease/plotsummary/images/b.gif?
link=/title/tt0337978/plotsummary';">more</a>
</div>
 
M

Mark Thomas

A regex will break too easily when parsing HTML. A real parser will do
a much better job, and often be more concise and readable, too.

This does what you want:

#-------
require 'rubygems'
require 'hpricot'
@Doc = Hpricot(html) # or Hpricot(open("filename"))

def find(term)
@doc.search("//div[@class='info']").each do |info|
header = info.search("h5").remove
if header.inner_text == term
puts info.inner_html
end
end
end
#-------
find("Plot Outline:")
John McClane takes on an Internet-based terrorist organization who is
systematically shutting down the United States. <a href="http://
www.imdb.com/title/tt0337978/plotsummary" class="tn15more
inline" onclick="(new Image()).src='/rg/title-tease/plotsummary/images/
b.gif?
link=/title/tt0337978/plotsummary';">more</a>

Mark
 
W

William James

Note that this will give spurious results if an html comment happens
to contain what you are looking for.

def find_header header, html
# Put all of the DIVs in an array.
divs = html.scan( %r{<div.*?>(.*?)</div>}im ).flatten
divs.each{|s|
if s =~ %r{<h(\d)>#{header}</h\1>(.*)}im
return $2.strip
end
}
return nil
end

html = DATA.read

puts find_header( "Plot Outline:", html )

__END__
<div class="info">
<h5>Tagline:</h5>
Yippee Ki Yay Mo - John 6:27
</div>

<div class="info">
<h5>Plot Outline:</h5>
John McClane takes on an Internet-based terrorist organization who is
systematically shutting down the United States. <a class="tn15more
inline" href="http://www.imdb.com/title/tt0337978/plotsummary"
onclick="(new Image()).src='/rg/title-tease/plotsummary/images/b.gif?
link=/title/tt0337978/plotsummary';">more</a>
</div>

More concise:

def find_header header, html
html.scan( %r{<div.*?>(.*?)</div>}im ).flatten.each{|s|
return $1.strip if s =~ %r{<h5>#{header}</h5(.*)}im }
return nil
end
 
C

Chirantan

More concise:

def find_header header, html
html.scan( %r{<div.*?>(.*?)</div>}im ).flatten.each{|s|
return $1.strip if s =~ %r{<h5>#{header}</h5(.*)}im }
return nil
end

Thank you William and Mark,

The codes worked. :) Thanks a lot.
 
M

Mark Thomas

All the regex solutions provided will break with the following
perfectly valid HTML:

<div class="info">
<h5 >Tagline:</h5>
Yippee Ki Yay Mo - John 6:27
</div>

This is one of many reasons it is a BAD idea to use regexes to parse
HTML. Regular expressions are simply not the right tool for the job.
 
F

Florian Gilcher

All the regex solutions provided will break with the following
perfectly valid HTML:

<div class="info">
<h5 >Tagline:</h5>
Yippee Ki Yay Mo - John 6:27
</div>

This is one of many reasons it is a BAD idea to use regexes to parse
HTML. Regular expressions are simply not the right tool for the job.

Whats quite interesting is that I am not able to find a nice article
on _why_
this doesn't work. So, in short:

Regexp can only parse languages that are regular (hence the name) or -
in other words - a Type 3-language in the Chomsky hierarchy [1]. This
is a
rule of thumb because many Regexp-libraries nowadays implement
features that enable you to do more than formal regular expressions.
But for the typical use, it is true.

Regular languages do not have any possibility to "look behind". They
do only
look forward. This is the reason why you cannot define a regular
language to
describe an parse arbitrarily deep nested structure (an thus, no regular
expression):
You have no possibility to determine which closing tag matches a given
opening tag.

A more abstract example:
There is no (formal) regular expression that matches a word that
consists
of n times "a" and then n times "b":

ab
aabb
aaabbb
aaaabbbb
etc.

What you can do is extract a tag, push it on a stack, extract the
next one, etc. and pop them when encountering matching closing tags.
Tags
by itself can be described with regexps (afaik, this is how Textmate
does its
markup).

Greetings
Skade

[1] http://en.wikipedia.org/wiki/Chomsky_hierarchy
 
T

Todd Benson

All the regex solutions provided will break with the following
perfectly valid HTML:

<div class="info">
<h5 >Tagline:</h5>
Yippee Ki Yay Mo - John 6:27
</div>

This is one of many reasons it is a BAD idea to use regexes to parse
HTML. Regular expressions are simply not the right tool for the job.

Whats quite interesting is that I am not able to find a nice article
on _why_
this doesn't work. So, in short:

Regexp can only parse languages that are regular (hence the name) or -
in other words - a Type 3-language in the Chomsky hierarchy [1]. This
is a
rule of thumb because many Regexp-libraries nowadays implement
features that enable you to do more than formal regular expressions.
But for the typical use, it is true.

Regular languages do not have any possibility to "look behind". They
do only
look forward. This is the reason why you cannot define a regular
language to
describe an parse arbitrarily deep nested structure (an thus, no regular
expression):
You have no possibility to determine which closing tag matches a given
opening tag.

A more abstract example:
There is no (formal) regular expression that matches a word that
consists
of n times "a" and then n times "b":

ab
aabb
aaabbb
aaaabbbb
etc.

What you can do is extract a tag, push it on a stack, extract the
next one, etc. and pop them when encountering matching closing tags.
Tags
by itself can be described with regexps (afaik, this is how Textmate
does its
markup).

Greetings
Skade

[1] http://en.wikipedia.org/wiki/Chomsky_hierarchy

Thank you for that great explanation! I was waiting for someone to
bring up formal grammar, but I was afraid to, because I wasn't sure it
applied (not that familiar with how regexps actually work).

Todd
 
W

William James

All the regex solutions provided will break with the following
perfectly valid HTML:

<div class="info">
<h5 >Tagline:</h5>
Yippee Ki Yay Mo - John 6:27
</div>

Easily fixed.

def find_header header, html
html.scan( %r{<div.*?>(.*?)</div\s*>}im ).flatten.
each{|s|
return $1.strip if s =~ %r{<h5\s*>#{header}</h5\s*>(.*)}im }
return nil
end
This is one of many reasons it is a BAD idea to use regexes to parse
HTML. Regular expressions are simply not the right tool for the job.

Who told you that they are not? And why did you take his word for it?
Does hpricot use regular expressions?
 
W

William James

All the regex solutions provided will break with the following
perfectly valid HTML:
<div class="info">
<h5 >Tagline:</h5>
Yippee Ki Yay Mo - John 6:27
</div>
This is one of many reasons it is a BAD idea to use regexes to parse
HTML. Regular expressions are simply not the right tool for the job.

Whats quite interesting is that I am not able to find a nice article
on _why_
this doesn't work. So, in short:

Regexp can only parse languages that are regular (hence the name) or -
in other words - a Type 3-language in the Chomsky hierarchy [1]. This
is a
rule of thumb because many Regexp-libraries nowadays implement
features that enable you to do more than formal regular expressions.
But for the typical use, it is true.

Regular languages do not have any possibility to "look behind". They
do only
look forward. This is the reason why you cannot define a regular
language to
describe an parse arbitrarily deep nested structure (an thus, no regular
expression):
You have no possibility to determine which closing tag matches a given
opening tag.

A more abstract example:
There is no (formal) regular expression that matches a word that
consists
of n times "a" and then n times "b":

And that doesn't matter much. One can use as many regular expressions
as he wishes.
ab
aabb
aaabbb
aaaabbbb
etc.

"ab
xx
aabb
aaabbb
aaabb
aaaabbbb".split.each{|s|
if s.match(/^(a+)/) and s.match(/^a+b{#{$1.size}}$/)
puts s
else
puts '-'
end
}

Or one can use regular expression + code:

"ab
xx
aabb
aaabbb
aaabb
aaaabbbb".split.each{|s|
if s.match(/^(a+)(b+)$/) and $1.size == $2.size
puts s
else
puts '-'
end
}

What makes anyone think that a single regular expression
has to do all the work?
 
J

Jari Williamsson

Mark said:
All the regex solutions provided will break with the following
perfectly valid HTML:

<div class="info">
<h5 >Tagline:</h5>
Yippee Ki Yay Mo - John 6:27
</div>

This is one of many reasons it is a BAD idea to use regexes to parse
HTML. Regular expressions are simply not the right tool for the job.

Sorry if I'm missing the point:
---
the_text = %q{
<div class="info">
<h5 >Tagline:</h5>
Yippee Ki Yay Mo - John 6:27
</div>
}

the_text.each_line do |line|
puts "Within DIV tags: #{line}" if (line=~/<div/)..(line=~/<\/div/)
puts "Within H5 tags: #{line}" if (line=~/<h5/)..(line=~/<\/h5/)
end
---

Result:
Within DIV tags: <div class="info">
Within DIV tags: <h5 >Tagline:</h5>
Within H5 tags: <h5 >Tagline:</h5>
Within DIV tags: Yippee Ki Yay Mo - John 6:27
Within DIV tags: </div>



Best regards,

Jari Williamsson
 
T

Todd Benson

Sorry if I'm missing the point:
---
the_text = %q{

<div class="info">
<h5 >Tagline:</h5>
Yippee Ki Yay Mo - John 6:27
</div>
}

the_text.each_line do |line|
puts "Within DIV tags: #{line}" if (line=~/<div/)..(line=~/<\/div/)
puts "Within H5 tags: #{line}" if (line=~/<h5/)..(line=~/<\/h5/)
end
---

Result:
Within DIV tags: <div class="info">
Within DIV tags: <h5 >Tagline:</h5>
Within H5 tags: <h5 >Tagline:</h5>
Within DIV tags: Yippee Ki Yay Mo - John 6:27
Within DIV tags: </div>

What if you have a div inside a div? Although, the OP said "any"
legitimate html inside a div, there's part of me that begs the
question: which div?

Todd
 
F

Florian Gilcher

Sorry if I'm missing the point:
---
the_text = %q{
<div class="info">
<h5 >Tagline:</h5>
Yippee Ki Yay Mo - John 6:27
</div>
}

the_text.each_line do |line|
puts "Within DIV tags: #{line}" if (line=~/<div/)..(line=~/<\/div/)
puts "Within H5 tags: #{line}" if (line=~/<h5/)..(line=~/<\/h5/)
end
---

Result:
Within DIV tags: <div class="info">
Within DIV tags: <h5 >Tagline:</h5>
Within H5 tags: <h5 >Tagline:</h5>
Within DIV tags: Yippee Ki Yay Mo - John 6:27
Within DIV tags: </div>



Best regards,

Jari Williamsson

This may work on this short snippet. Consider this:

the_text = %q{
<div class="info">
<div class="nextinfo">
<h5 >Tagline:</h5>
Yippee Ki Yay Mo - John 6:27
</div>
</div>
}

the_text.each_line do |line|
puts "Within DIV tags: #{line}" if (line=~/<div/)..(line=~/<\/div/)
puts "Within H5 tags: #{line}" if (line=~/<h5/)..(line=~/<\/h5/)
end

It doesn't see the second </div> as it considers _both_ divs closed.
(which isn't even possible to determine, as we did not save any
state). Second question: which <div> am I in at a certain point? Or,
in other words: whats the #innerText of .info, whats the #innerText
of .nextinfo? You won't get far without a stack and that can be proven
[1].
If this is of interest to you, consider reading a book about computer
theory. It may be hard stuff, but it pays off :).[2]

Greetings
Florian Gilcher

[1] Up to the reader ;).
[2] Don't feel bad if you didn't and don't consider this as an
offense. I know many good programmers that never read any theory. But
it certainly isn't bad to know about it.
 
J

Jari Williamsson

Todd said:
What if you have a div inside a div? Although, the OP said "any"
legitimate html inside a div, there's part of me that begs the
question: which div?

Sure, for real-life HTML with nested tags it'll break. I just wanted to
point out that for simple parsing needs (as the example that I replied
to) regexps can find both beginning and end tags.



Best regards,

Jari Williamsson
 
J

Jari Williamsson

Florian said:
This may work on this short snippet. Consider this:

the_text = %q{
<div class="info">
<div class="nextinfo">
<h5 >Tagline:</h5>
Yippee Ki Yay Mo - John 6:27
</div>
</div>
}

the_text.each_line do |line|
puts "Within DIV tags: #{line}" if (line=~/<div/)..(line=~/<\/div/)
puts "Within H5 tags: #{line}" if (line=~/<h5/)..(line=~/<\/h5/)
end

It doesn't see the second </div> as it considers _both_ divs closed.

It consider the first div closed. It never sees the other one.


Best regards,

Jari Williamsson
 
F

Florian Gilcher

"ab
xx
aabb
aaabbb
aaabb
aaaabbbb".split.each{|s|
if s.match(/^(a+)/) and s.match(/^a+b{#{$1.size}}$/)
puts s
else
puts '-'
end
}

Or one can use regular expression + code:

"ab
xx
aabb
aaabbb
aaabb
aaaabbbb".split.each{|s|
if s.match(/^(a+)(b+)$/) and $1.size == $2.size
puts s
else
puts '-'
end
}

What makes anyone think that a single regular expression
has to do all the work?


I don't know. But many think one fits. Thats why i wrote this
explanation, as it is something i see almost everyday and to give some
insight to those that are pondering on why this is so.
So: your solution does not fit the problem, but thanks for showing
that another problem (parsing "a*nb*n" with a touring-complete
language) can indeed be solved.

I also stated this in my last paragraph: you can solve the problem by
using regular expressions. But the language of regular expressions by
itself is not mighty enough to solve it alone.

Greetings
Florian
 
M

Mark Thomas

Easily fixed.
def find_header header, html
  html.scan( %r{<div.*?>(.*?)</div\s*>}im ).flatten.
  each{|s|
    return $1.strip if s =~ %r{<h5\s*>#{header}</h5\s*>(.*)}im }
  return nil
end

Easily broken again.

<div class="info">
<h5 class="header">Tagline:</h5>
Yippee Ki Yay Mo - John 6:27
</div>

The point is, regex-based parsing is fragile, and is provably
incomplete for parsing arbitrarily nested structures like HTML. A real
parser (such as a recursive descent parser) is needed. I use regular
expressions often, but when parsing HTML, XML, or other nested data, I
reach for other tools.
Who told you that they are not?  And why did you take his word for it?

Experience, for one. Until I really understood parsers, I tended to
use regular expressions for everything. I've been using regular
expressions for a LONG time, and I am very comfortable with them. But
parsing HTML was always troublesome.

This has been discussed for years e.g. in Perl circles (PerlMonks,
etc) where it is well known that regexes do not fit nested data.
People with questions asking how to parse HTML with a regex will get
chided, especially with so many good parsers available in Perl. There
are good parsers available in Ruby now too, so people should be
encouraged to use them.
Does hpricot use regular expressions?

Of course not.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,285
Messages
2,571,416
Members
48,107
Latest member
jigyasauniversity

Latest Threads

Top