RegEx - matching previous match

J

j ellings

Hello.

I have an html file converted from PDF that includes the following
sample lines:

(html has been converted)

<i><b>Z & A Newsstand</b></i><br>
<i>Retail Food: Mobile Food Vendor</i><br>
<i>2 N 10th St</i><br>
<i>Philadelphia, PA 19107</i><br>
<b>Inspection Date</b><br>
<i>4/11/07</i><br>
No Critical Violations<br>
<i>4/11/07</i><br>
No Critical Violations<br>
<i>11/28/06</i><br>
No Critical Violations<br>
<i>4/24/06</i><br>
No Critical Violations<br>
<i><b>Newstand</b></i><br>
<i>Retail Food: Mobile Food Vendor</i><br>
<i>32 N 10th St</i><br>
<i>Philadelphia, PA 19107</i><br>
<b>Inspection Date</b><br>
<i>7/2/07</i><br>
No Critical Violations<br>
<i><b>Pudgies Deli</b></i><br>
<i>Retail Food: Restaurant, Eat-in</i><br>
<i>46 N 10th St</i><br>
<i>Philadelphia, PA 19107</i><br>
<b>Inspection Date</b><br>
<i>1/11/07</i><br>
No Critical Violations<br>
<i>9/25/06</i><br>
No Critical Violations<br>
<i>8/7/06</i><br>
No Critical Violations<br>


I am trying to capture the information between the <i><b>
tags as these are the only unique delimiters between entries.

My regex is as follows:

while ($html =~ m{<i><b>(.*?)<i><b>}gs) {
#do something
}

Unfortunately, the regex will match the first instance( Z &amp; A
Newsstand), but ignore the second (Newstand) and then match on the
third (Pudgies Deli).

I can see that the match is working according to what I wrote; I am
trying to fine tune it so that I can grab every match. Is there a way
to include the previous &lt;i&gt;&lt;b&gt; in the next match such that
it will not skip a potential match?

Any suggestions or advice would be most appreciated.

John

Any
 
G

Gunnar Hjalmarsson

j said:
(html has been converted)

Yes, but why on earth did you post the data in that format?

I am trying to capture the information between the &lt;i&gt;&lt;b&gt;
tags as these are the only unique delimiters between entries.

My regex is as follows:

while ($html =~ m{<i><b>(.*?)<i><b>}gs) {
#do something
}

Unfortunately, the regex will match the first instance( Z &amp; A
Newsstand), but ignore the second (Newstand) and then match on the
third (Pudgies Deli).

I can see that the match is working according to what I wrote; I am
trying to fine tune it so that I can grab every match. Is there a way
to include the previous &lt;i&gt;&lt;b&gt; in the next match such that
it will not skip a potential match?

A zero-width positive look-ahead assertion may be what you are after;
see "perldoc perlre".

while ($html =~ m{<i><b>(.*?)(?=<i><b>)}gs) {
---------------------------------^^^------^

Another approach that doesn't slurp the whole file into a scalar variable:

local $/ = '<i><b>';
while ( my $html = <> ) {
#do something
}
 
T

Tad J McClellan

j ellings said:
Hello.

I have an html file converted from PDF that includes the following
sample lines:

(html has been converted)


Why has HTML been converted?

This is a plain-text medium...

&lt;i&gt;&lt;b&gt;Z &amp; A Newsstand&lt;/b&gt;&lt;/i&gt;&lt;br&gt;
^^ ^^
^^ ^^

My regex is as follows:

while ($html =~ m{<i><b>(.*?)<i><b>}gs) {


End tags have slash characters in them that your pattern will not match.

Your data closes the bold before the italic, but your regex looks
for the italic close before the bold close.

I can see that the match is working according to what I wrote;


You have a strange definition of "working" then...

trying to fine tune it so that I can grab every match. Is there a way
to include the previous &lt;i&gt;&lt;b&gt; in the next match such that
it will not skip a potential match?


Any suggestions or advice would be most appreciated.


while ($html =~ m{<i><b>(.*?)</b></i>}gs) {
 
J

j ellings

A zero-width positive look-ahead assertion may be what you are after;
see "perldoc perlre".

while ($html =~ m{<i><b>(.*?)(?=<i><b>)}gs) {
---------------------------------^^^------^

Another approach that doesn't slurp the whole file into a scalar variable:

local $/ = '<i><b>';
while ( my $html = <> ) {
#do something
}

Thanks Gunnar, this worked perfectly; apologies for the formatting.
 
J

j ellings

while ($html =~ m{<i><b>(.*?)</b></i>}gs) {

Tad

Thanks for the suggestion. Your regex will match the first instance
of opening and closing of the <b><i> tags; what I needed it to do was
to match the opening of the two tags. My original regex did capture
between two opening instances, but only after skipping one.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,982
Messages
2,570,185
Members
46,736
Latest member
AdolphBig6

Latest Threads

Top