2 problems parsing output from HTML::TableExtract

Ted Byers · Sep 1, 2009

I have to automate parsing email that comes in with its data in an
HTML file (so I have no control over the content or how it is
formatted.

HTML::TableExtract has proved priceless in getting this done.
However, there are two issues that are giving me grief.

The first is probably simple, at least for regex experts. There are
characters in the string that, while not a problem in browsers
displaying the HTML, are a problem for my attempts to refine this data
down to text I can work with in my other code. There are a plethora
of instances of the that Emacs displays as '\240\. But the following
statement doesn't remove them.

$payload_tmp =~ s/\240//g;

Neither does:
$payload_tmp =~ s/\\240//g;

I suspect it is a printer/display control character that results in
the following text being underlined when displayed using a browser
like MS IE or Firefox. What I don't know is what value I ought to use
in my regex to get rid of it.

I think I know what I can do, to work around this, but I would like to
know how to construct a regular expression to get rid of it.

The more important question gets down to how to deal with a warning I
get on some output produced by HTML::TableExtract.

In the html I get, there is one table, but without proper table
headers, and there are two logical tables in this one HTML table
separated by rows that have no visible values in their cells. Those
cells without useful data cause problems in the output that manifests
with a warning message:

Use of uninitialized value $row in join or string at c:/test_path/
Email_test_7.pl line 188, <GEN0> line 27252.

Here is the code block the warning relates to:
my $te = HTML::TableExtract->new();
$payload =~ s/\r//g;
my $payload_tmp = $payload;
$payload_tmp =~ s/\n//g;
$payload_tmp =~ s/\240//g;
$te->parse($payload_tmp);
my ($ts,$tn);
$tn = 0;
foreach $ts ($te->tables) {
my $row;
my $rown = 0;
foreach $row ($ts->rows) {
next unless defined $row;
next unless defined @$row;# not sure about this one, but I tried it
because the join mentioned in the warning uses @$row
my $fount = @$row;
next unless defined $fount;
next if ($fount == 0);
my $trow = join(',',@$row);
print "\tRow: $rown\t",$trow,"\n";
$rown++;
}
$tn++;
}

Since I know the HTML that is producing this output, I just want to
skip over and ignore the rows having cells that have no data. Since
the warning says I have an 'uninitialized value $row in join or
string', I tried to skip is $row is undefined, and if the row has no
data, but these tests are not having the desired effect. It is as if
they weren't there. I don't know why I'd get a message that $row is
undefined and yet a statement "next unless defined $row;" has no
effect.

What did I miss here?

Peter J. Holzer · Sep 1, 2009

I have to automate parsing email that comes in with its data in an
HTML file (so I have no control over the content or how it is
formatted.

HTML::TableExtract has proved priceless in getting this done.
However, there are two issues that are giving me grief.

The first is probably simple, at least for regex experts. There are
characters in the string that, while not a problem in browsers
displaying the HTML, are a problem for my attempts to refine this data
down to text I can work with in my other code. There are a plethora
of instances of the that Emacs displays as '\240\.

\240 (\x{A0} in hex) is the non-breaking space.

But the following
statement doesn't remove them.

$payload_tmp =~ s/\240//g;

This should work, provided the "\240" is there when you do the
substitution. In HTML, the non-breaking space is often written as
" ". Are you sure that you are looking at the text you are feeding
to your script and not some processed version?

Neither does:
$payload_tmp =~ s/\\240//g;

This shouldn't.

I suspect it is a printer/display control character that results in
the following text being underlined when displayed using a browser
like MS IE or Firefox.

Please read http://www.w3.org/TR/html401/

[...]

Use of uninitialized value $row in join or string at c:/test_path/
Email_test_7.pl line 188, <GEN0> line 27252.

Here is the code block the warning relates to: [...]
my $row;
my $rown = 0;
foreach $row ($ts->rows) {
next unless defined $row;
next unless defined @$row;# not sure about this one, but I tried it
because the join mentioned in the warning uses @$row
my $fount = @$row;
next unless defined $fount;
next if ($fount == 0);
my $trow = join(',',@$row);

I assume this is line 188 because it's the only line with a join in it.
However I don't see how this line can be reached if $row is undefined.
Are you sure that this is the code you are running?

Please post a short, complete script that we can run. If you post a
short snippet from a longer script it is always possible that the error
is somewhere else. Also, you will probably find the error while trying
to make the script as short as possible and won't have to ask at all.

hp

sln · Sep 1, 2009

However, there are two issues that are giving me grief.

The first is probably simple, at least for regex experts. There are
characters in the string that, while not a problem in browsers

statement doesn't remove them.

$payload_tmp =~ s/\240//g;

^
would be a rx variable for $240

Neither does:
$payload_tmp =~ s/\\240//g;

s/\\240//g;
works for me

The more important question gets down to how to deal with a warning I
get on some output produced by HTML::TableExtract.

In the html I get, there is one table, but without proper table
headers, and there are two logical tables in this one HTML table
separated by rows that have no visible values in their cells. Those
cells without useful data cause problems in the output that manifests
with a warning message:

Use of uninitialized value $row in join or string at c:/test_path/
Email_test_7.pl line 188, <GEN0> line 27252.

Here is the code block the warning relates to:
my $te = HTML::TableExtract->new();
$payload =~ s/\r//g;
my $payload_tmp = $payload;
$payload_tmp =~ s/\n//g;
$payload_tmp =~ s/\240//g;
$te->parse($payload_tmp);
my ($ts,$tn);
$tn = 0;
foreach $ts ($te->tables) {
my $row;
my $rown = 0;
foreach $row ($ts->rows) {
next unless defined $row;
next unless defined @$row;# not sure about this one, but I tried it
because the join mentioned in the warning uses @$row
my $fount = @$row;
next unless defined $fount;
next if ($fount == 0);
my $trow = join(',',@$row);
print "\tRow: $rown\t",$trow,"\n";
$rown++;
}
$tn++;
}

What did I miss here?

See below.
-sln
=====================
use strict;
use warnings;

my $string = '
start\\
240\\24
0\\240\\240\\2
40\\240-end
';
print $string,"\n";
$string =~ s/\n//g;
$string =~ s/\\240//g;
# ^^
# works for me

print $string,"\n";

my $row = [qw{this is a row of data},undef,undef,'end'];
# ^^^^^ ^^^^^
# oh no, undefined elements
# join will give warning
#
my $trow = join(',',@$row);
print "$trow\n";

# to fix, rip out the undefined elements in a new copy of row.
# can either strip the undef's:
my @row_copy = map {defined $_ ? $_ : ()} @$row;
# or can blank them out:
# my @row_copy = map {defined $_ ? $_ : ()} @$row;
#
$trow = join ',', @row_copy;
print "$trow\n";

__END__

# Lets fix this up, (untested)

foreach $ts ($te->tables) {
my $row;
my $rown = 0;
foreach $row ($ts->rows) {
next if !defined($row);
my @row_copy = map {defined $_ ? $_ : ()} @$row;
next if !scalar(@row_copy);
$trow = join ',', @row_copy;
print "\tRow: $rown\t",$trow,"\n";
$rown++;
}
$tn++;

sln · Sep 1, 2009

# or can blank them out:
# my @row_copy = map {defined $_ ? $_ : ()} @$row;

^^
# my @row_copy = map {defined $_ ? $_ : ''} @$row;

-sln

Ted Byers · Sep 1, 2009

^^
# my @row_copy = map {defined $_ ? $_ : ''} @$row;

-sln

Thanks everyone. Problem solved; and I learned a bunch too. ;-)

Cheers

Ted

sln · Sep 1, 2009

^^
# my @row_copy = map {defined $_ ? $_ : ''} @$row;

-sln

Should you decide to just define blanks instead of deleting
the elements, you won't have to create a temporary array,
just do it in place with this:

defined $_ or $_ = '' for (@$row);

So, then the code would look something like this:

foreach $ts ($te->tables) {
my $row;
my $rown = 0;
foreach $row ($ts->rows) {
next if (!defined($row) or !@$row);
defined $_ or $_ = '' for (@$row); # just blank undef's
$trow = join ',', @$row;
print "\tRow: $rown\t",$trow,"\n";
$rown++;
}
$tn++;

-sln

Uri Guttman · Sep 1, 2009

s> Should you decide to just define blanks instead of deleting
s> the elements, you won't have to create a temporary array,
s> just do it in place with this:

s> defined $_ or $_ = '' for (@$row);

s> defined $_ or $_ = '' for (@$row); # just blank undef's
s> $trow = join ',', @$row;

you can merge those with a map:

$trow = join ',', map { defined ? $_ : '' } @$row;

and if you are using 5.10 with the defined or op // that is even
simpler:

$trow = join ',', map { $_ // '' } @$row;

and with 5.10 the for modifier line could also become:

$_ //= '' for @{$row} ;

uri

Uri Guttman · Sep 1, 2009

BM> Meh. Those are all ugly. I much prefer

BM> {
BM> no warnings "uninitialized";
BM> $trow = join ",", @$row;
BM> }

that needs a block, and is longer. and i don't like to use the warnings
pragma unless absolutely necessary. just my style vs yours.

uri

Peter J. Holzer · Sep 1, 2009

Peter J. Holzer said:
Peter J. Holzer said:

Use of uninitialized value $row in join or string at c:/test_path/
Email_test_7.pl line 188, <GEN0> line 27252.

Here is the code block the warning relates to: [...]
foreach $row ($ts->rows) {
next unless defined $row; [...]
my $trow = join(',',@$row);

Click to expand...

I assume this is line 188 because it's the only line with a join in it.
However I don't see how this line can be reached if $row is undefined.
Are you sure that this is the code you are running?

Click to expand...

I was confused too. The error message is misleading, it is not $row
that is undefined, it is one of the elements in @$row that is undef.

Ah, yes. That makes sense.

hp

Parsing HTML with HTML::TableExtract	2	Nov 27, 2009
HTML::TableExtract punctuation parsing	3	May 22, 2005
Rookie: HTML::TableExtract test will not print	6	Oct 8, 2003
HTML::TableExtract with headers constraint, exluding right-most column	0	May 16, 2005
Perl HTML::TableExtract Question	3	Apr 17, 2005
HTML::TokeParser & TableExtract	16	Apr 25, 2006
Getting extra blank rows from appending HTML..?	2	Oct 24, 2023
Problem using TableExtract 1.08	0	Sep 8, 2003

2 problems parsing output from HTML::TableExtract

Ted Byers

Peter J. Holzer

sln

sln

Ted Byers

sln

Uri Guttman

Uri Guttman

Peter J. Holzer

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads