Regular Expressions to parse HTML



I need to parse and HTML document of the following format.

I am interested to obtain all the HTML from and including the first <div
class="data"> up to and including Data updated dd/mm/yyyy (where dd/mm/yyyy
will change). what kind of regular expressions can I use? Note I want
everything in the core of the HTML including all the tags within the div tags.

<!-- Not interested in parsing data in the header-->
<div class="head">not interested in this</div>
<div class="data">Interested in data from this first data div</div>
<div class="data">There can be <b>other tags</b> within these divs too!</div>
<a name="data3"></a>(There can be some other stuff in between the div tags)
Data updated dd/mm/yyyy
<img src="notInterested.jpg">
some other rubbish
<div class="footer">not interested</div>

Ken Arway

Patrick said:
I need to parse and HTML document of the following format.

I am interested to obtain all the HTML from and including the first <div
class="data"> up to and including Data updated dd/mm/yyyy (where dd/mm/yyyy
will change). what kind of regular expressions can I use? Note I want
everything in the core of the HTML including all the tags within the div tags.

Treating the input Html as one string (C# code):

Regex regex = new Regex(@"(<div class=""data"">.*(?=<img))",

Sample input:
<!-- Not interested in parsing data in the header-->
<div class="head">not interested in this</div>
<div class="data">Interested in data from this first data div</div>
<div class="data">There can be <b>other tags</b> within these divs too!</div>
<a name="data3"></a>(There can be some other stuff in between the div tags)
Data updated dd/mm/yyyy
<img src="notInterested.jpg">
some other rubbish
<div class="footer">not interested</div>

Sample output:
1 =»<div class="data">Interested in data from this first data div</div>
<div class="data">There can be <b>other tags</b> within these divs too!</div>
<a name="data3"></a>(There can be some other stuff in between the div tags)
Data updated dd/mm/yyyy
Sep 24, 2007
Reaction score
old thread, but i wanted the challenge.
use LWP::Simple;
use URI::URL;



$content =~ s/<div class=\"head\".*>//g;
$content =~ s/<div class=\"data\">//g;
$content =~ s/<(?:[^> '"]*|([ '"]).*?1)*>//g;
$content =~ s/<\/div>//g;
$content =~ s/<a.*>//g;
$content =~ s/<img.*>//g;
$content =~ s/.*footer.*//g;
$content =~ s/<!--.*-->//g;

print $content;

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Latest member

Latest Threads
