2 problems parsing output from HTML::TableExtract

T

Ted Byers

I have to automate parsing email that comes in with its data in an
HTML file (so I have no control over the content or how it is
formatted.

HTML::TableExtract has proved priceless in getting this done.
However, there are two issues that are giving me grief.

The first is probably simple, at least for regex experts. There are
characters in the string that, while not a problem in browsers
displaying the HTML, are a problem for my attempts to refine this data
down to text I can work with in my other code. There are a plethora
of instances of the that Emacs displays as '\240\. But the following
statement doesn't remove them.

$payload_tmp =~ s/\240//g;

Neither does:
$payload_tmp =~ s/\\240//g;

I suspect it is a printer/display control character that results in
the following text being underlined when displayed using a browser
like MS IE or Firefox. What I don't know is what value I ought to use
in my regex to get rid of it.

I think I know what I can do, to work around this, but I would like to
know how to construct a regular expression to get rid of it.

The more important question gets down to how to deal with a warning I
get on some output produced by HTML::TableExtract.

In the html I get, there is one table, but without proper table
headers, and there are two logical tables in this one HTML table
separated by rows that have no visible values in their cells. Those
cells without useful data cause problems in the output that manifests
with a warning message:

Use of uninitialized value $row in join or string at c:/test_path/
Email_test_7.pl line 188, <GEN0> line 27252.

Here is the code block the warning relates to:
my $te = HTML::TableExtract->new();
$payload =~ s/\r//g;
my $payload_tmp = $payload;
$payload_tmp =~ s/\n//g;
$payload_tmp =~ s/\240//g;
$te->parse($payload_tmp);
my ($ts,$tn);
$tn = 0;
foreach $ts ($te->tables) {
my $row;
my $rown = 0;
foreach $row ($ts->rows) {
next unless defined $row;
next unless defined @$row;# not sure about this one, but I tried it
because the join mentioned in the warning uses @$row
my $fount = @$row;
next unless defined $fount;
next if ($fount == 0);
my $trow = join(',',@$row);
print "\tRow: $rown\t",$trow,"\n";
$rown++;
}
$tn++;
}

Since I know the HTML that is producing this output, I just want to
skip over and ignore the rows having cells that have no data. Since
the warning says I have an 'uninitialized value $row in join or
string', I tried to skip is $row is undefined, and if the row has no
data, but these tests are not having the desired effect. It is as if
they weren't there. I don't know why I'd get a message that $row is
undefined and yet a statement "next unless defined $row;" has no
effect.

What did I miss here?
 
P

Peter J. Holzer

I have to automate parsing email that comes in with its data in an
HTML file (so I have no control over the content or how it is
formatted.

HTML::TableExtract has proved priceless in getting this done.
However, there are two issues that are giving me grief.

The first is probably simple, at least for regex experts. There are
characters in the string that, while not a problem in browsers
displaying the HTML, are a problem for my attempts to refine this data
down to text I can work with in my other code. There are a plethora
of instances of the that Emacs displays as '\240\.

\240 (\x{A0} in hex) is the non-breaking space.
But the following
statement doesn't remove them.

$payload_tmp =~ s/\240//g;

This should work, provided the "\240" is there when you do the
substitution. In HTML, the non-breaking space is often written as
"&nbsp;". Are you sure that you are looking at the text you are feeding
to your script and not some processed version?

Neither does:
$payload_tmp =~ s/\\240//g;

This shouldn't.

I suspect it is a printer/display control character that results in
the following text being underlined when displayed using a browser
like MS IE or Firefox.

Please read http://www.w3.org/TR/html401/



[...]
Use of uninitialized value $row in join or string at c:/test_path/
Email_test_7.pl line 188, <GEN0> line 27252.

Here is the code block the warning relates to: [...]
my $row;
my $rown = 0;
foreach $row ($ts->rows) {
next unless defined $row;
next unless defined @$row;# not sure about this one, but I tried it
because the join mentioned in the warning uses @$row
my $fount = @$row;
next unless defined $fount;
next if ($fount == 0);
my $trow = join(',',@$row);

I assume this is line 188 because it's the only line with a join in it.
However I don't see how this line can be reached if $row is undefined.
Are you sure that this is the code you are running?

Please post a short, complete script that we can run. If you post a
short snippet from a longer script it is always possible that the error
is somewhere else. Also, you will probably find the error while trying
to make the script as short as possible and won't have to ask at all.

hp
 
S

sln

However, there are two issues that are giving me grief.

The first is probably simple, at least for regex experts. There are
characters in the string that, while not a problem in browsers
statement doesn't remove them.

$payload_tmp =~ s/\240//g;
^
would be a rx variable for $240
Neither does:
$payload_tmp =~ s/\\240//g;
s/\\240//g;
works for me

The more important question gets down to how to deal with a warning I
get on some output produced by HTML::TableExtract.

In the html I get, there is one table, but without proper table
headers, and there are two logical tables in this one HTML table
separated by rows that have no visible values in their cells. Those
cells without useful data cause problems in the output that manifests
with a warning message:

Use of uninitialized value $row in join or string at c:/test_path/
Email_test_7.pl line 188, <GEN0> line 27252.

Here is the code block the warning relates to:
my $te = HTML::TableExtract->new();
$payload =~ s/\r//g;
my $payload_tmp = $payload;
$payload_tmp =~ s/\n//g;
$payload_tmp =~ s/\240//g;
$te->parse($payload_tmp);
my ($ts,$tn);
$tn = 0;
foreach $ts ($te->tables) {
my $row;
my $rown = 0;
foreach $row ($ts->rows) {
next unless defined $row;
next unless defined @$row;# not sure about this one, but I tried it
because the join mentioned in the warning uses @$row
my $fount = @$row;
next unless defined $fount;
next if ($fount == 0);
my $trow = join(',',@$row);
print "\tRow: $rown\t",$trow,"\n";
$rown++;
}
$tn++;
}
What did I miss here?

See below.
-sln
=====================
use strict;
use warnings;

my $string = '
start\\
240\\24
0\\240\\240\\2
40\\240-end
';
print $string,"\n";
$string =~ s/\n//g;
$string =~ s/\\240//g;
# ^^
# works for me

print $string,"\n";


my $row = [qw{this is a row of data},undef,undef,'end'];
# ^^^^^ ^^^^^
# oh no, undefined elements
# join will give warning
#
my $trow = join(',',@$row);
print "$trow\n";

# to fix, rip out the undefined elements in a new copy of row.
# can either strip the undef's:
my @row_copy = map {defined $_ ? $_ : ()} @$row;
# or can blank them out:
# my @row_copy = map {defined $_ ? $_ : ()} @$row;
#
$trow = join ',', @row_copy;
print "$trow\n";

__END__

# Lets fix this up, (untested)

foreach $ts ($te->tables) {
my $row;
my $rown = 0;
foreach $row ($ts->rows) {
next if !defined($row);
my @row_copy = map {defined $_ ? $_ : ()} @$row;
next if !scalar(@row_copy);
$trow = join ',', @row_copy;
print "\tRow: $rown\t",$trow,"\n";
$rown++;
}
$tn++;
 
T

Ted Byers

                                             ^^
    # my @row_copy = map {defined $_ ? $_ : ''} @$row;

-sln

Thanks everyone. Problem solved; and I learned a bunch too. ;-)

Cheers

Ted
 
S

sln

^^
# my @row_copy = map {defined $_ ? $_ : ''} @$row;

-sln

Should you decide to just define blanks instead of deleting
the elements, you won't have to create a temporary array,
just do it in place with this:

defined $_ or $_ = '' for (@$row);

So, then the code would look something like this:

foreach $ts ($te->tables) {
my $row;
my $rown = 0;
foreach $row ($ts->rows) {
next if (!defined($row) or !@$row);
defined $_ or $_ = '' for (@$row); # just blank undef's
$trow = join ',', @$row;
print "\tRow: $rown\t",$trow,"\n";
$rown++;
}
$tn++;

-sln
 
U

Uri Guttman

s> Should you decide to just define blanks instead of deleting
s> the elements, you won't have to create a temporary array,
s> just do it in place with this:

s> defined $_ or $_ = '' for (@$row);

s> defined $_ or $_ = '' for (@$row); # just blank undef's
s> $trow = join ',', @$row;

you can merge those with a map:

$trow = join ',', map { defined ? $_ : '' } @$row;

and if you are using 5.10 with the defined or op // that is even
simpler:

$trow = join ',', map { $_ // '' } @$row;

and with 5.10 the for modifier line could also become:

$_ //= '' for @{$row} ;

uri
 
U

Uri Guttman

BM> Meh. Those are all ugly. I much prefer

BM> {
BM> no warnings "uninitialized";
BM> $trow = join ",", @$row;
BM> }

that needs a block, and is longer. and i don't like to use the warnings
pragma unless absolutely necessary. just my style vs yours.

uri
 
P

Peter J. Holzer

Peter J. Holzer said:
Use of uninitialized value $row in join or string at c:/test_path/
Email_test_7.pl line 188, <GEN0> line 27252.

Here is the code block the warning relates to: [...]
foreach $row ($ts->rows) {
next unless defined $row; [...]
my $trow = join(',',@$row);

I assume this is line 188 because it's the only line with a join in it.
However I don't see how this line can be reached if $row is undefined.
Are you sure that this is the code you are running?


I was confused too. The error message is misleading, it is not $row
that is undefined, it is one of the elements in @$row that is undef.

Ah, yes. That makes sense.

hp
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,995
Messages
2,570,225
Members
46,815
Latest member
treekmostly22

Latest Threads

Top