T
Ted Byers
I have to automate parsing email that comes in with its data in an
HTML file (so I have no control over the content or how it is
formatted.
HTML::TableExtract has proved priceless in getting this done.
However, there are two issues that are giving me grief.
The first is probably simple, at least for regex experts. There are
characters in the string that, while not a problem in browsers
displaying the HTML, are a problem for my attempts to refine this data
down to text I can work with in my other code. There are a plethora
of instances of the that Emacs displays as '\240\. But the following
statement doesn't remove them.
$payload_tmp =~ s/\240//g;
Neither does:
$payload_tmp =~ s/\\240//g;
I suspect it is a printer/display control character that results in
the following text being underlined when displayed using a browser
like MS IE or Firefox. What I don't know is what value I ought to use
in my regex to get rid of it.
I think I know what I can do, to work around this, but I would like to
know how to construct a regular expression to get rid of it.
The more important question gets down to how to deal with a warning I
get on some output produced by HTML::TableExtract.
In the html I get, there is one table, but without proper table
headers, and there are two logical tables in this one HTML table
separated by rows that have no visible values in their cells. Those
cells without useful data cause problems in the output that manifests
with a warning message:
Use of uninitialized value $row in join or string at c:/test_path/
Email_test_7.pl line 188, <GEN0> line 27252.
Here is the code block the warning relates to:
my $te = HTML::TableExtract->new();
$payload =~ s/\r//g;
my $payload_tmp = $payload;
$payload_tmp =~ s/\n//g;
$payload_tmp =~ s/\240//g;
$te->parse($payload_tmp);
my ($ts,$tn);
$tn = 0;
foreach $ts ($te->tables) {
my $row;
my $rown = 0;
foreach $row ($ts->rows) {
next unless defined $row;
next unless defined @$row;# not sure about this one, but I tried it
because the join mentioned in the warning uses @$row
my $fount = @$row;
next unless defined $fount;
next if ($fount == 0);
my $trow = join(',',@$row);
print "\tRow: $rown\t",$trow,"\n";
$rown++;
}
$tn++;
}
Since I know the HTML that is producing this output, I just want to
skip over and ignore the rows having cells that have no data. Since
the warning says I have an 'uninitialized value $row in join or
string', I tried to skip is $row is undefined, and if the row has no
data, but these tests are not having the desired effect. It is as if
they weren't there. I don't know why I'd get a message that $row is
undefined and yet a statement "next unless defined $row;" has no
effect.
What did I miss here?
HTML file (so I have no control over the content or how it is
formatted.
HTML::TableExtract has proved priceless in getting this done.
However, there are two issues that are giving me grief.
The first is probably simple, at least for regex experts. There are
characters in the string that, while not a problem in browsers
displaying the HTML, are a problem for my attempts to refine this data
down to text I can work with in my other code. There are a plethora
of instances of the that Emacs displays as '\240\. But the following
statement doesn't remove them.
$payload_tmp =~ s/\240//g;
Neither does:
$payload_tmp =~ s/\\240//g;
I suspect it is a printer/display control character that results in
the following text being underlined when displayed using a browser
like MS IE or Firefox. What I don't know is what value I ought to use
in my regex to get rid of it.
I think I know what I can do, to work around this, but I would like to
know how to construct a regular expression to get rid of it.
The more important question gets down to how to deal with a warning I
get on some output produced by HTML::TableExtract.
In the html I get, there is one table, but without proper table
headers, and there are two logical tables in this one HTML table
separated by rows that have no visible values in their cells. Those
cells without useful data cause problems in the output that manifests
with a warning message:
Use of uninitialized value $row in join or string at c:/test_path/
Email_test_7.pl line 188, <GEN0> line 27252.
Here is the code block the warning relates to:
my $te = HTML::TableExtract->new();
$payload =~ s/\r//g;
my $payload_tmp = $payload;
$payload_tmp =~ s/\n//g;
$payload_tmp =~ s/\240//g;
$te->parse($payload_tmp);
my ($ts,$tn);
$tn = 0;
foreach $ts ($te->tables) {
my $row;
my $rown = 0;
foreach $row ($ts->rows) {
next unless defined $row;
next unless defined @$row;# not sure about this one, but I tried it
because the join mentioned in the warning uses @$row
my $fount = @$row;
next unless defined $fount;
next if ($fount == 0);
my $trow = join(',',@$row);
print "\tRow: $rown\t",$trow,"\n";
$rown++;
}
$tn++;
}
Since I know the HTML that is producing this output, I just want to
skip over and ignore the rows having cells that have no data. Since
the warning says I have an 'uninitialized value $row in join or
string', I tried to skip is $row is undefined, and if the row has no
data, but these tests are not having the desired effect. It is as if
they weren't there. I don't know why I'd get a message that $row is
undefined and yet a statement "next unless defined $row;" has no
effect.
What did I miss here?