J
j ellings
Hello.
I have an html file converted from PDF that includes the following
sample lines:
(html has been converted)
<i><b>Z & A Newsstand</b></i><br>
<i>Retail Food: Mobile Food Vendor</i><br>
<i>2 N 10th St</i><br>
<i>Philadelphia, PA 19107</i><br>
<b>Inspection Date</b><br>
<i>4/11/07</i><br>
No Critical Violations<br>
<i>4/11/07</i><br>
No Critical Violations<br>
<i>11/28/06</i><br>
No Critical Violations<br>
<i>4/24/06</i><br>
No Critical Violations<br>
<i><b>Newstand</b></i><br>
<i>Retail Food: Mobile Food Vendor</i><br>
<i>32 N 10th St</i><br>
<i>Philadelphia, PA 19107</i><br>
<b>Inspection Date</b><br>
<i>7/2/07</i><br>
No Critical Violations<br>
<i><b>Pudgies Deli</b></i><br>
<i>Retail Food: Restaurant, Eat-in</i><br>
<i>46 N 10th St</i><br>
<i>Philadelphia, PA 19107</i><br>
<b>Inspection Date</b><br>
<i>1/11/07</i><br>
No Critical Violations<br>
<i>9/25/06</i><br>
No Critical Violations<br>
<i>8/7/06</i><br>
No Critical Violations<br>
I am trying to capture the information between the <i><b>
tags as these are the only unique delimiters between entries.
My regex is as follows:
while ($html =~ m{<i><b>(.*?)<i><b>}gs) {
#do something
}
Unfortunately, the regex will match the first instance( Z & A
Newsstand), but ignore the second (Newstand) and then match on the
third (Pudgies Deli).
I can see that the match is working according to what I wrote; I am
trying to fine tune it so that I can grab every match. Is there a way
to include the previous <i><b> in the next match such that
it will not skip a potential match?
Any suggestions or advice would be most appreciated.
John
Any
I have an html file converted from PDF that includes the following
sample lines:
(html has been converted)
<i><b>Z & A Newsstand</b></i><br>
<i>Retail Food: Mobile Food Vendor</i><br>
<i>2 N 10th St</i><br>
<i>Philadelphia, PA 19107</i><br>
<b>Inspection Date</b><br>
<i>4/11/07</i><br>
No Critical Violations<br>
<i>4/11/07</i><br>
No Critical Violations<br>
<i>11/28/06</i><br>
No Critical Violations<br>
<i>4/24/06</i><br>
No Critical Violations<br>
<i><b>Newstand</b></i><br>
<i>Retail Food: Mobile Food Vendor</i><br>
<i>32 N 10th St</i><br>
<i>Philadelphia, PA 19107</i><br>
<b>Inspection Date</b><br>
<i>7/2/07</i><br>
No Critical Violations<br>
<i><b>Pudgies Deli</b></i><br>
<i>Retail Food: Restaurant, Eat-in</i><br>
<i>46 N 10th St</i><br>
<i>Philadelphia, PA 19107</i><br>
<b>Inspection Date</b><br>
<i>1/11/07</i><br>
No Critical Violations<br>
<i>9/25/06</i><br>
No Critical Violations<br>
<i>8/7/06</i><br>
No Critical Violations<br>
I am trying to capture the information between the <i><b>
tags as these are the only unique delimiters between entries.
My regex is as follows:
while ($html =~ m{<i><b>(.*?)<i><b>}gs) {
#do something
}
Unfortunately, the regex will match the first instance( Z & A
Newsstand), but ignore the second (Newstand) and then match on the
third (Pudgies Deli).
I can see that the match is working according to what I wrote; I am
trying to fine tune it so that I can grab every match. Is there a way
to include the previous <i><b> in the next match such that
it will not skip a potential match?
Any suggestions or advice would be most appreciated.
John
Any