Problem with reg expression

P

Peter Jamieson

#I want my script to parse HTML tables such as the one included below:

#!/usr/bin/perl -w
use strict;
use warnings;

my $moggy = '<TABLE WIDTH=100% BORDER=0 CELLSPACING=0 CELLPADDING=0>
<TR>
<TD WIDTH=12% ALIGN=LEFT class=Tipster> RADIO TAB</TD>
<TD WIDTH=14% class=Tips> 3-2 </TD>
<TD WIDTH=16% ALIGN=LEFT class=Tipster></TD> <TD WIDTH=14% class=Tips></TD>
<TD WIDTH=14% ALIGN=CENTER></TD> <TD WIDTH=10% class=TrackCond> 520M</TD>
<TD WIDTH=10%

class="TrackCond">FINE</TD> <TD WIDTH=10% class="TrackCondR">GOOD</TD> </TR>

</TABLE>';

# I tried this

$_ = $moggy;
my ($d,$e,$f);
$d=''; $e=''; $f='';

($d,$e,$f) = /TrackCond(.*)<\/TD>/g;

print "d ",$d," e ",$e," f ",$f,"\n";


This produces for $d
520M</TD> <TD WIDTH=10% class="TrackCond">FINE</TD> <TD WIDTH=10%
class="TrackCondR">GOOD
and no value for $e or $f

I would have expected
$d to be: > 520M, $e to be: ">FINE and $f: to be R">GOOD

Can anyone suggest why I don't get this and where I am going wrong here?
Any comments appreciated!
 
L

Lars Eighner

the said:
#I want my script to parse HTML tables such as the one included below:
#!/usr/bin/perl -w
use strict;
use warnings;
my $moggy = '<TABLE WIDTH=100% BORDER=0 CELLSPACING=0 CELLPADDING=0>
<TR>
<TD WIDTH=12% ALIGN=LEFT class=Tipster> RADIO TAB</TD>
<TD WIDTH=14% class=Tips> 3-2 </TD>
<TD WIDTH=16% ALIGN=LEFT class=Tipster></TD> <TD WIDTH=14% class=Tips></TD>
<TD WIDTH=14% ALIGN=CENTER></TD> <TD WIDTH=10% class=TrackCond> 520M</TD>
<TD WIDTH=10%
class="TrackCond">FINE</TD> <TD WIDTH=10% class="TrackCondR">GOOD</TD> </TR>

# I tried this
$_ = $moggy;
my ($d,$e,$f);
$d=''; $e=''; $f='';
($d,$e,$f) = /TrackCond(.*)<\/TD>/g;
print "d ",$d," e ",$e," f ",$f,"\n";

This produces for $d
class="TrackCondR">GOOD
and no value for $e or $f
I would have expected
$d to be: > 520M, $e to be: ">FINE and $f: to be R">GOOD
Can anyone suggest why I don't get this and where I am going wrong here?
Any comments appreciated!


First, regexes are extremely difficult to use to parse html. Use
the HTML:parser module. (Yes, if you are a regex expert and know the
files you are working with, sometimes you can use quick and dirty
expressions for a particular ad hoc task, but if the nature of the files
change, your quick and dirty solution from last week is likely to be broken
this week.)

Second, regexes are naturally greedy. Left unmodified they will make the
largest match possible, which is to say .*</TD> will not stop at the first
occurrence of </TD> but will do everything up to the last .*</TD>. You
can consult the manual for ways of modifying this behavior, but it is still
not the way to parse HTML.

Third, what exactly did you think the values of $e and $f would be?
The assignment ($d,$e,$f) = /TrackCond(.*)<\/TD>/g; is nonsense whether
you are parsing html or a grocery list.
 
G

Gunnar Hjalmarsson

Peter said:
#I want my script to parse HTML tables such as the one included below:

#!/usr/bin/perl -w
use strict;
use warnings;

my $moggy = '<TABLE WIDTH=100% BORDER=0 CELLSPACING=0 CELLPADDING=0>
<TR>
<TD WIDTH=12% ALIGN=LEFT class=Tipster> RADIO TAB</TD>
<TD WIDTH=14% class=Tips> 3-2 </TD>
<TD WIDTH=16% ALIGN=LEFT class=Tipster></TD> <TD WIDTH=14% class=Tips></TD>
<TD WIDTH=14% ALIGN=CENTER></TD> <TD WIDTH=10% class=TrackCond> 520M</TD>
<TD WIDTH=10%

class="TrackCond">FINE</TD> <TD WIDTH=10% class="TrackCondR">GOOD</TD> </TR>

</TABLE>';

# I tried this

$_ = $moggy;
my ($d,$e,$f);
$d=''; $e=''; $f='';

($d,$e,$f) = /TrackCond(.*)<\/TD>/g;

print "d ",$d," e ",$e," f ",$f,"\n";


This produces for $d
class="TrackCondR">GOOD
and no value for $e or $f

I would have expected
$d to be: > 520M, $e to be: ">FINE and $f: to be R">GOOD

Can anyone suggest why I don't get this

Because regexes are greedy by default.

($d,$e,$f) = /TrackCond(.*?)<\/TD>/g;
------------------------------^
 
G

Gunnar Hjalmarsson

Lars said:
Third, what exactly did you think the values of $e and $f would be?

The OP already let us know that, didn't he?
The assignment ($d,$e,$f) = /TrackCond(.*)<\/TD>/g; is nonsense whether
you are parsing html or a grocery list.

Why?
 
L

Lars Eighner

In our last episode,
the lovely and talented Gunnar Hjalmarsson
broadcast on comp.lang.perl.misc:
The OP already let us know that, didn't he?

because .* will eat everything that matches (if anything does) so
$e and $f will always be empty (and $d will be empty if there is no match).
 
P

Peter Jamieson

First, regexes are extremely difficult to use to parse html. Use
the HTML:parser module. (Yes, if you are a regex expert and know the
files you are working with, sometimes you can use quick and dirty
expressions for a particular ad hoc task, but if the nature of the files
change, your quick and dirty solution from last week is likely to be
broken
this week.)

Thanks for the suggestion Lars, I will have a look at HTML::parser module.
I have used my script for over 2 years, 62000 tables and this is oneof very
few failures
so not too unhappy with it. If HTML::parser beats this then I'll be very
pleased.
Second, regexes are naturally greedy. Left unmodified they will make the
largest match possible, which is to say .*</TD> will not stop at the first
occurrence of </TD> but will do everything up to the last .*</TD>. You
can consult the manual for ways of modifying this behavior, but it is
still
not the way to parse HTML.

Yes I hear what you claim but my script has performed very well so far,
perhaps I was lucky.
Third, what exactly did you think the values of $e and $f would be?

Perhaps you failed to read that part of the post?....I stated quite
explicitly what I thought
the values should be as a guide to any would-be helper.
The assignment ($d,$e,$f) = /TrackCond(.*)<\/TD>/g; is nonsense whether
you are parsing html or a grocery list.

With "use strict" and "use warnings" enabled I have been getting no warning
messages
and output sent to my db is exactly as expected except for the one table
above amongst may thousands.
Cheers and thanks for the advice to use HTML::parser. I will have a look at
it.
 
M

Mirco Wahab

Peter said:
($d,$e,$f) = /TrackCond(.*)<\/TD>/g;

print "d ",$d," e ",$e," f ",$f,"\n";
I would have expected
$d to be: > 520M, $e to be: ">FINE and $f: to be R">GOOD
Can anyone suggest why I don't get this and where I am going wrong here?

All has been said so far (all mysteries solved),
but I'd straighten up the whole thing a little bit:

...
my $moggy = '
<TABLE WIDTH=100% BORDER=0 CELLSPACING=0 CELLPADDING=0>
<TR>
...
...
</TR>
</TABLE>';

my ($d, $e, $f) = ('','',''); # why is this necessary at all?

($d, $e, $f) = $moggy =~ /TrackCon[^>]+>\s*(.+?)<\/TD>/g;

print "d=>'$d', e=>'$e', f=>'$f'\n"; # expand scalars in quotes
...

You don't need to put things into $_ in order
to get regular expressions applied, a $var =~ /regex/
will do fine. Furthermor, you can use [^>]+> if
you want to jump to the end of the <Tag name> of
any "TrackCond" variation.


Regards

M.
 
G

Gunnar Hjalmarsson

Lars said:
In our last episode,
the lovely and talented Gunnar Hjalmarsson
broadcast on comp.lang.perl.misc:

because .* will eat everything that matches (if anything does) so
$e and $f will always be empty (and $d will be empty if there is no match).

I thought you had covered the greediness thing in your "Second"
comment... The failure to make .* non-greedy doesn't make the whole
statement "nonsense" IMO.
 
L

Lars Eighner

Sorry, my bad - I didn't notice the 'g' modifier. That will allow multiple
matches of the subexpression to be captured, and returned as a list.

Well, no, you were right the first time, if for the wrong reasons. Because
of .* being unmodified, this kind of expression can never produce more than
one match, not matter how many g's you stick on the end. That is why it is
nonsense: putting a g on the end of something that can match at most once is
nonsense.

Something(.*)anotherthing can produce at most one match. The usual culprit
is the . because it matches just anything. Many times it does not have to
be . and replacing . with a bracketed range will help. In this case, for
example [^<]* has a chance of producing several matches. They would not
necessarily be right because in HTML a different tag could be nested in the
TD element, but you would be right to think there could be more than one
match, so /g would make sense.
 
G

Gunnar Hjalmarsson

Lars said:
Because of .* being unmodified, this kind of expression can never
produce more than one match, not matter how many g's you stick on
the end.

Something(.*)anotherthing can produce at most one match. The usual
culprit is the . because it matches just anything.

Those statements are not true. Without the /s modifier, the . matches
any character but a newline.

C:\home>type test.pl
my $list = <<EOL;
1. Milk
2. Sugar
3. Apples
EOL

my @items = $list =~ /\d+\.\s+(.*)/g;

print join(', ', @items), "\n";

C:\home>test.pl
Milk, Sugar, Apples

C:\home>
Many times it does not have to
be . and replacing . with a bracketed range will help.

That, OTOH, is true.
 
L

Lars Eighner

In our last episode,
the lovely and talented Gunnar Hjalmarsson
broadcast on comp.lang.perl.misc:
Those statements are not true. Without the /s modifier, the . matches
any character but a newline.
C:\home>type test.pl
my $list = <<EOL;
1. Milk
2. Sugar
3. Apples
EOL

The OP would not have been in trouble if he had convenient line breaks,
but

#!/usr/local/bin/perl

my $list = <<EOL;
1. Milk 2. Sugar 3. Apples
EOL

my @items = $list =~ /\d+\.\s+(.*)/g;

foreach $thing (@items){
print "$thing |";
}
print "\n";

yeilds:

Milk 2. Sugar 3. Apples |

or in other words, only one match.

my @items = $list =~ /\d+\.\s+(.*)/g;
print join(', ', @items), "\n";
 
P

Peter Jamieson

Gunnar Hjalmarsson said:
Because regexes are greedy by default.

($d,$e,$f) = /TrackCond(.*?)<\/TD>/g;
------------------------------^

Thanks Gunnar! Fixed the errant table immediately....brilliant!....
case of cyber-beer on it's way!....I should have seen this .....alas
too much Merlot last nite.
Cheers and thanks again.
 
P

Peter Jamieson

Mirco Wahab said:
Peter said:
($d,$e,$f) = /TrackCond(.*)<\/TD>/g;

print "d ",$d," e ",$e," f ",$f,"\n";
I would have expected
$d to be: > 520M, $e to be: ">FINE and $f: to be R">GOOD
Can anyone suggest why I don't get this and where I am going wrong here?

All has been said so far (all mysteries solved),
but I'd straighten up the whole thing a little bit:

...
my $moggy = '
<TABLE WIDTH=100% BORDER=0 CELLSPACING=0 CELLPADDING=0>
<TR>
...
...
</TR>
</TABLE>';

my ($d, $e, $f) = ('','',''); # why is this necessary at all?

($d, $e, $f) = $moggy =~ /TrackCon[^>]+>\s*(.+?)<\/TD>/g;

print "d=>'$d', e=>'$e', f=>'$f'\n"; # expand scalars in quotes
...

You don't need to put things into $_ in order
to get regular expressions applied, a $var =~ /regex/
will do fine. Furthermor, you can use [^>]+> if
you want to jump to the end of the <Tag name> of
any "TrackCond" variation.


Regards

M.

Thanks Mirco!
Your comments and code suggestions have been most helpful
and I will incorporate your ideas.
Despite what has been said by others my script has collected
approx 50K pages of data with only one or two failures
and no warnings.
I'm only a Perl newby. Your suggestions are instructive.
Thanks again!
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,994
Messages
2,570,223
Members
46,812
Latest member
GracielaWa

Latest Threads

Top