Problem with reg expression

Peter Jamieson · Sep 5, 2007

#I want my script to parse HTML tables such as the one included below:

#!/usr/bin/perl -w
use strict;
use warnings;

my $moggy = '<TABLE WIDTH=100% BORDER=0 CELLSPACING=0 CELLPADDING=0>
<TR>
<TD WIDTH=12% ALIGN=LEFT class=Tipster> RADIO TAB</TD>
<TD WIDTH=14% class=Tips> 3-2 </TD>
<TD WIDTH=16% ALIGN=LEFT class=Tipster></TD> <TD WIDTH=14% class=Tips></TD>
<TD WIDTH=14% ALIGN=CENTER></TD> <TD WIDTH=10% class=TrackCond> 520M</TD>
<TD WIDTH=10%

class="TrackCond">FINE</TD> <TD WIDTH=10% class="TrackCondR">GOOD</TD> </TR>

</TABLE>';

# I tried this

$_ = $moggy;
my ($d,$e,$f);
$d=''; $e=''; $f='';

($d,$e,$f) = /TrackCond(.*)<\/TD>/g;

print "d ",$d," e ",$e," f ",$f,"\n";

This produces for $d

520M</TD> <TD WIDTH=10% class="TrackCond">FINE</TD> <TD WIDTH=10%

class="TrackCondR">GOOD
and no value for $e or $f

I would have expected
$d to be: > 520M, $e to be: ">FINE and $f: to be R">GOOD

Can anyone suggest why I don't get this and where I am going wrong here?
Any comments appreciated!

Lars Eighner · Sep 5, 2007

the said:
#I want my script to parse HTML tables such as the one included below:

#!/usr/bin/perl -w
use strict;
use warnings;

my $moggy = '<TABLE WIDTH=100% BORDER=0 CELLSPACING=0 CELLPADDING=0>
<TR>
<TD WIDTH=12% ALIGN=LEFT class=Tipster> RADIO TAB</TD>
<TD WIDTH=14% class=Tips> 3-2 </TD>
<TD WIDTH=16% ALIGN=LEFT class=Tipster></TD> <TD WIDTH=14% class=Tips></TD>
<TD WIDTH=14% ALIGN=CENTER></TD> <TD WIDTH=10% class=TrackCond> 520M</TD>
<TD WIDTH=10%

class="TrackCond">FINE</TD> <TD WIDTH=10% class="TrackCondR">GOOD</TD> </TR>

# I tried this

$_ = $moggy;
my ($d,$e,$f);
$d=''; $e=''; $f='';

($d,$e,$f) = /TrackCond(.*)<\/TD>/g;

print "d ",$d," e ",$e," f ",$f,"\n";

This produces for $d
class="TrackCondR">GOOD
and no value for $e or $f

I would have expected
$d to be: > 520M, $e to be: ">FINE and $f: to be R">GOOD

Can anyone suggest why I don't get this and where I am going wrong here?
Any comments appreciated!

First, regexes are extremely difficult to use to parse html. Use
the HTML

arser module. (Yes, if you are a regex expert and know the
files you are working with, sometimes you can use quick and dirty
expressions for a particular ad hoc task, but if the nature of the files
change, your quick and dirty solution from last week is likely to be broken
this week.)

Second, regexes are naturally greedy. Left unmodified they will make the
largest match possible, which is to say .*</TD> will not stop at the first
occurrence of </TD> but will do everything up to the last .*</TD>. You
can consult the manual for ways of modifying this behavior, but it is still
not the way to parse HTML.

Third, what exactly did you think the values of $e and $f would be?
The assignment ($d,$e,$f) = /TrackCond(.*)<\/TD>/g; is nonsense whether
you are parsing html or a grocery list.

Gunnar Hjalmarsson · Sep 5, 2007

Peter said:
#I want my script to parse HTML tables such as the one included below:

#!/usr/bin/perl -w
use strict;
use warnings;

my $moggy = '<TABLE WIDTH=100% BORDER=0 CELLSPACING=0 CELLPADDING=0>
<TR>
<TD WIDTH=12% ALIGN=LEFT class=Tipster> RADIO TAB</TD>
<TD WIDTH=14% class=Tips> 3-2 </TD>
<TD WIDTH=16% ALIGN=LEFT class=Tipster></TD> <TD WIDTH=14% class=Tips></TD>
<TD WIDTH=14% ALIGN=CENTER></TD> <TD WIDTH=10% class=TrackCond> 520M</TD>
<TD WIDTH=10%

class="TrackCond">FINE</TD> <TD WIDTH=10% class="TrackCondR">GOOD</TD> </TR>

</TABLE>';

# I tried this

$_ = $moggy;
my ($d,$e,$f);
$d=''; $e=''; $f='';

($d,$e,$f) = /TrackCond(.*)<\/TD>/g;

print "d ",$d," e ",$e," f ",$f,"\n";

This produces for $d
class="TrackCondR">GOOD
and no value for $e or $f

I would have expected
$d to be: > 520M, $e to be: ">FINE and $f: to be R">GOOD

Can anyone suggest why I don't get this

Because regexes are greedy by default.

($d,$e,$f) = /TrackCond(.*?)<\/TD>/g;
------------------------------^

Gunnar Hjalmarsson · Sep 5, 2007

Lars said:
Third, what exactly did you think the values of $e and $f would be?

The OP already let us know that, didn't he?

The assignment ($d,$e,$f) = /TrackCond(.*)<\/TD>/g; is nonsense whether
you are parsing html or a grocery list.

Why?

Lars Eighner · Sep 5, 2007

In our last episode,
the lovely and talented Gunnar Hjalmarsson
broadcast on comp.lang.perl.misc:

The OP already let us know that, didn't he?

Why?

because .* will eat everything that matches (if anything does) so
$e and $f will always be empty (and $d will be empty if there is no match).

Peter Jamieson · Sep 5, 2007

First, regexes are extremely difficult to use to parse html. Use
the HTMLarser module. (Yes, if you are a regex expert and know the
files you are working with, sometimes you can use quick and dirty
expressions for a particular ad hoc task, but if the nature of the files
change, your quick and dirty solution from last week is likely to be
broken
this week.)

Thanks for the suggestion Lars, I will have a look at HTML:

arser module.
I have used my script for over 2 years, 62000 tables and this is oneof very
few failures
so not too unhappy with it. If HTML:

arser beats this then I'll be very
pleased.

Second, regexes are naturally greedy. Left unmodified they will make the
largest match possible, which is to say .*</TD> will not stop at the first
occurrence of </TD> but will do everything up to the last .*</TD>. You
can consult the manual for ways of modifying this behavior, but it is
still
not the way to parse HTML.

Yes I hear what you claim but my script has performed very well so far,
perhaps I was lucky.

Third, what exactly did you think the values of $e and $f would be?

Perhaps you failed to read that part of the post?....I stated quite
explicitly what I thought
the values should be as a guide to any would-be helper.

The assignment ($d,$e,$f) = /TrackCond(.*)<\/TD>/g; is nonsense whether
you are parsing html or a grocery list.

With "use strict" and "use warnings" enabled I have been getting no warning
messages
and output sent to my db is exactly as expected except for the one table
above amongst may thousands.
Cheers and thanks for the advice to use HTML:

arser. I will have a look at
it.

Mirco Wahab · Sep 5, 2007

Peter said:
($d,$e,$f) = /TrackCond(.*)<\/TD>/g;

print "d ",$d," e ",$e," f ",$f,"\n";
I would have expected
$d to be: > 520M, $e to be: ">FINE and $f: to be R">GOOD
Can anyone suggest why I don't get this and where I am going wrong here?

All has been said so far (all mysteries solved),
but I'd straighten up the whole thing a little bit:

...
my $moggy = '
<TABLE WIDTH=100% BORDER=0 CELLSPACING=0 CELLPADDING=0>
<TR>
...
...
</TR>
</TABLE>';

my ($d, $e, $f) = ('','',''); # why is this necessary at all?

($d, $e, $f) = $moggy =~ /TrackCon[^>]+>\s*(.+?)<\/TD>/g;

print "d=>'$d', e=>'$e', f=>'$f'\n"; # expand scalars in quotes
...

You don't need to put things into $_ in order
to get regular expressions applied, a $var =~ /regex/
will do fine. Furthermor, you can use [^>]+> if
you want to jump to the end of the <Tag name> of
any "TrackCond" variation.

Regards

M.

Gunnar Hjalmarsson · Sep 5, 2007

Lars said:
In our last episode,
the lovely and talented Gunnar Hjalmarsson
broadcast on comp.lang.perl.misc:

because .* will eat everything that matches (if anything does) so
$e and $f will always be empty (and $d will be empty if there is no match).

I thought you had covered the greediness thing in your "Second"
comment... The failure to make .* non-greedy doesn't make the whole
statement "nonsense" IMO.

Lars Eighner · Sep 5, 2007

Sorry, my bad - I didn't notice the 'g' modifier. That will allow multiple
matches of the subexpression to be captured, and returned as a list.

Well, no, you were right the first time, if for the wrong reasons. Because
of .* being unmodified, this kind of expression can never produce more than
one match, not matter how many g's you stick on the end. That is why it is
nonsense: putting a g on the end of something that can match at most once is
nonsense.

Something(.*)anotherthing can produce at most one match. The usual culprit
is the . because it matches just anything. Many times it does not have to
be . and replacing . with a bracketed range will help. In this case, for
example [^<]* has a chance of producing several matches. They would not
necessarily be right because in HTML a different tag could be nested in the
TD element, but you would be right to think there could be more than one
match, so /g would make sense.

Gunnar Hjalmarsson · Sep 5, 2007

Lars said:
Because of .* being unmodified, this kind of expression can never
produce more than one match, not matter how many g's you stick on
the end.

Something(.*)anotherthing can produce at most one match. The usual
culprit is the . because it matches just anything.

Those statements are not true. Without the /s modifier, the . matches
any character but a newline.

C:\home>type test.pl
my $list = <<EOL;
1. Milk
2. Sugar
3. Apples
EOL

my @items = $list =~ /\d+\.\s+(.*)/g;

print join(', ', @items), "\n";

C:\home>test.pl
Milk, Sugar, Apples

C:\home>

Many times it does not have to
be . and replacing . with a bracketed range will help.

That, OTOH, is true.

Lars Eighner · Sep 5, 2007

In our last episode,
the lovely and talented Gunnar Hjalmarsson
broadcast on comp.lang.perl.misc:

Those statements are not true. Without the /s modifier, the . matches
any character but a newline.

C:\home>type test.pl
my $list = <<EOL;
1. Milk
2. Sugar
3. Apples
EOL

The OP would not have been in trouble if he had convenient line breaks,
but

#!/usr/local/bin/perl

my $list = <<EOL;
1. Milk 2. Sugar 3. Apples
EOL

my @items = $list =~ /\d+\.\s+(.*)/g;

foreach $thing (@items){
print "$thing |";
}
print "\n";

yeilds:

Milk 2. Sugar 3. Apples |

or in other words, only one match.

my @items = $list =~ /\d+\.\s+(.*)/g;

print join(', ', @items), "\n";

Peter Jamieson · Sep 6, 2007

Gunnar Hjalmarsson said:
Because regexes are greedy by default.

($d,$e,$f) = /TrackCond(.*?)<\/TD>/g;
------------------------------^

Thanks Gunnar! Fixed the errant table immediately....brilliant!....
case of cyber-beer on it's way!....I should have seen this .....alas
too much Merlot last nite.
Cheers and thanks again.

Peter Jamieson · Sep 6, 2007

Mirco Wahab said:
Peter said:

($d,$e,$f) = /TrackCond(.*)<\/TD>/g;

print "d ",$d," e ",$e," f ",$f,"\n";
I would have expected
$d to be: > 520M, $e to be: ">FINE and $f: to be R">GOOD
Can anyone suggest why I don't get this and where I am going wrong here?

Click to expand...

All has been said so far (all mysteries solved),
but I'd straighten up the whole thing a little bit:

...
my $moggy = '
<TABLE WIDTH=100% BORDER=0 CELLSPACING=0 CELLPADDING=0>
<TR>
...
...
</TR>
</TABLE>';

my ($d, $e, $f) = ('','',''); # why is this necessary at all?

($d, $e, $f) = $moggy =~ /TrackCon[^>]+>\s*(.+?)<\/TD>/g;

print "d=>'$d', e=>'$e', f=>'$f'\n"; # expand scalars in quotes
...

You don't need to put things into $_ in order
to get regular expressions applied, a $var =~ /regex/
will do fine. Furthermor, you can use [^>]+> if
you want to jump to the end of the <Tag name> of
any "TrackCond" variation.

Regards

M.

Thanks Mirco!
Your comments and code suggestions have been most helpful
and I will incorporate your ideas.
Despite what has been said by others my script has collected
approx 50K pages of data with only one or two failures
and no warnings.
I'm only a Perl newby. Your suggestions are instructive.
Thanks again!

Can someone tell me if this a real tracker? Or is it one designed to show you a different message at certain times, ie. acting like one?	0	Jan 10, 2021
Can anyone please help? HTML - two tables applying different styles	4	Dec 1, 2020
Need help with <rowspan> in an HTML table	1	Nov 6, 2024
How to create a JSON array with values from DOM(HTML TABLE) when I click a button using JQuery/Javascript?	0	May 1, 2023
Sort by number of characters	1	Nov 2, 2023
How to create a JSON array with values from DOM(HTML TABLE) when I click a button using JQuery/Javascript?	0	May 1, 2023
Nested Loop Insert Page Break	1	Nov 5, 2021
Image shifts to the right when export the page to pdf	4	May 5, 2023

Problem with reg expression

Peter Jamieson

Lars Eighner

Gunnar Hjalmarsson

Gunnar Hjalmarsson

Lars Eighner

Peter Jamieson

Mirco Wahab

Gunnar Hjalmarsson

Lars Eighner

Gunnar Hjalmarsson

Lars Eighner

Peter Jamieson

Peter Jamieson

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads