HTML::TokeParser & TableExtract

Abram · Apr 25, 2006

I'm fairly new to Perl, so bare with me.

I am trying to extract a table from an HTML file and parse through each
row, then dump the extracted cell data into a csv file. This was
pretty easy to accomplish with HTML::TokeParser, however I have one
problem. Each HTML file I need to parse has three tables with the same
structure. I need to separate these three tables into three csv files.

I can use TableExtract to get the exact tables using the depth and
count matching (depth is always 2 and count is 5-7), but I am not sure
how to then parse only that table and extract the data. I'm sure this
is pretty simple stuff, and I'll kick myself when I see the answer.

Thanks in advance.

--Abram

David Squire · Apr 25, 2006

Abram said:
I'm fairly new to Perl, so bare with me.

What an image!

I guess you mean "bear with me"

(Sorry, but it seems to be spelling/idiom correction day here).

DS

Dr.Ruud · Apr 25, 2006

David Squire schreef:

it seems to be spelling/idiom correction day here

How perfect, on my birthday!

Abram · Apr 25, 2006

Ha! My brain has become a bit mushy with my hours of "learning" perl,
so I didn't even notice... I better put something on!

At least it got some attention, any suggestions (not on my apparel, but
the html data extraction)?

Tad McClellan · Apr 26, 2006

Abram said:
I can use TableExtract to get the exact tables using the depth and
count matching (depth is always 2 and count is 5-7), but I am not sure
how to then parse only that table and extract the data. I'm sure this
is pretty simple stuff, and I'll kick myself when I see the answer.

From "perldoc HTML::TableExtract":

$te = new HTML::TableExtract( depth => 2, count => 2 );
$te->parse($html_string);
foreach $ts ($te->table_states) {
print "Table found at ", join(',', $ts->coords), ":\n";
foreach $row ($ts->rows) {
print " ", join(',', @$row), "\n";
}
}

That seems to do it.

Are you having trouble modifying that to produce CSV?

Abram · Apr 26, 2006

Thanks Tad,

Tad McClellan wrote:
Are you having trouble modifying that to produce CSV?

Actually yes. I have been using the code from perldoc (slightly
modified), but cannot seem to get the proper structure for csv. That
is why I was looking into TokeParser as I could easily parse through
each TD and conditionally extract the data.

Could you provide some help on how to get this done with TableExtract?
My HTML looks something like this:
....
<table>
<tr>
<td> Header 1 </td>
<td> Header 2 </td>
<td> Header 3 </td>
</tr>

<tr id="Data_Row_1">
<td> data 1_1 </td>
<td> data 1_2 </td>
<td> data 1_3 </td>
</tr>
<tr id="Data_Row_1_1">
<td colspan=3> More data for 1 </td>
</tr>
<tr id="Data_Row_2">
<td> data 2_1 </td>
<td> data 2_2 </td>
<td> data 2_3 </td>
</tr>
<tr id="Data_Row_2_1">
<td colspan=3> More data for 2 </td>
</tr>
</table>
(NOTE: Actual html doesn' t have tr id's, used just to illustrate
associated rows)

To make things even more interesting I need to extract the "More data
for NN" row and append it to the data row.

Any suggestions?

A. Sinan Unur · Apr 26, 2006

Thanks Tad,

Actually yes. I have been using the code from perldoc (slightly
modified), but cannot seem to get the proper structure for csv. That
is why I was looking into TokeParser as I could easily parse through
each TD and conditionally extract the data.
....

<tr id="Data_Row_1">
<td> data 1_1 </td>
<td> data 1_2 </td>
<td> data 1_3 </td>
</tr>
<tr id="Data_Row_1_1">
<td colspan=3> More data for 1 </td>
</tr>
....

(NOTE: Actual html doesn' t have tr id's, used just to illustrate
associated rows)

To make things even more interesting I need to extract the "More data
for NN" row and append it to the data row.

Which column are you supposed to put the data in "More data for NN"?

Sinan

--
A. Sinan Unur <[email protected]>
(remove .invalid and reverse each component for email address)

comp.lang.perl.misc guidelines on the WWW:
http://augustmail.com/~tadmc/clpmisc/clpmisc_guidelines.html

Abram · Apr 26, 2006

Sinan,

Which column are you supposed to put the data in "More data for NN"?

The last column of the row. So it would look like this in the csv:
data 1_1,data 1_2,data 1_3,More data for 1
data 2_1,data 2_2,data 2_3,More data for 2
data 3_1,data 3_2,data 3_3,More data for 3
data 4_1,data 4_2,data 4_3,More data for 4
....etc...

--Abram

Tad McClellan · Apr 26, 2006

Abram said:
Actually yes. I have been using the code from perldoc (slightly
modified), but cannot seem to get the proper structure for csv.

It is _already_ CSV will extra spaces at the beginning and
no quotes around fields.

Modify the boilerplate code to eliminate the extra spaces, and
to put quotes around fields.

Could you provide some help on how to get this done with TableExtract?

Sure.

Post your broken code, and someone will help you fix it.

To make things even more interesting I need to extract the "More data
for NN" row and append it to the data row.

How do you identify what is to be joined?

Does it always have the "More data" text in it? (I doubt it)

Are there times when there is NOT a "continuation" row?

Can there be more than one "continuation row"?

etc...

Any suggestions?

If you need debugging help, you pretty much have to post the
code that you want debugged...

A. Sinan Unur · Apr 27, 2006

Sinan,

The last column of the row. So it would look like this in the csv:
data 1_1,data 1_2,data 1_3,More data for 1
data 2_1,data 2_2,data 2_3,More data for 2
data 3_1,data 3_2,data 3_3,More data for 3
data 4_1,data 4_2,data 4_3,More data for 4

Each regular row will contain 3 elements. The continuation row will have
only one element. Join that element with the third column of the previous
row.

For more help, post your best attempt to implement the algorithm above. If
it does not work, if I don't get a chance, someone will definitely help
you fix it.

Sinan
--
A. Sinan Unur <[email protected]>
(remove .invalid and reverse each component for email address)

comp.lang.perl.misc guidelines on the WWW:
http://augustmail.com/~tadmc/clpmisc/clpmisc_guidelines.html

Abram · Apr 27, 2006

Thanks for all your help guys! I think I got it.

Here's what ended up working for me, please advise as to any better
approaches.

#!/usr/bin/perl
use HTML::TableExtract;

# Declare the subroutines
sub trim($);

my $html_file = "C:/webfiles/test.htm";
$te = HTML::TableExtract->new( depth => 1, count => 6 );
$te->parse_file($html_file);

my $log = "c://perl//pl_projects//web_parser.log";
open(my $LF,">> $log") or die "Couldn't open $log for writing: $!\n";
my $we_need_to_truncate = 0;
foreach $ts ($te->tables) {
foreach $row ($ts->rows) {
$counter ++;
if ($counter > 4 ){
for ($i=1; $i<6; $i++) {
#If the table has no top keywords we need to truncate the
file
if (@$row[$i] =~ m/No keywords rank*/){
$we_need_to_truncate = 1;
}
# $bit is used to determine if we need to join row to the
previous row
if(!$bit){
$str = $str.trim(@$row[$i]).",";
}else{
$str = $str.trim(@$row[$i]);
}
}
if ($bit){
$bit=0;
$str = trim($str)."\n";
}else{
$bit=1;
}
}
}
}
#Write the file
my $old_fh = select($LF);
print $str;
select ($old_fh);
close($LF) or die "Couldn't close $log: $!\n";

#remove the last three rows if need be
if($we_need_to_truncate){
for ($i=1; $i<4; $i++){
truncatefile($log);
}
}

# Perl trim function to remove whitespace from the start and end of the
string
sub trim($)
{
my $string = shift;
$string =~ s/^\s+//;
$string =~ s/\s+$//;

return $string;
}

sub truncatefile()
{
open (FH, "+< $log") or die "can't update $log: $!";
while (<FH>) {
$addr = tell(FH) unless eof(FH);
}
truncate(FH, $addr) or die "can't truncate $log: $!";
}

--Abram

Dr.Ruud · Apr 27, 2006

Abram schreef:

Thanks for all your help guys! I think I got it.

Here's what ended up working for me, please advise as to any better
approaches.

#!/usr/bin/perl

Missing:

use strict;
use warnings;

use HTML::TableExtract;

# Declare the subroutines
sub trim($);

Not necessary.

my $html_file = "C:/webfiles/test.htm";

Use single quotes when double quotes are not needed.

$te = HTML::TableExtract->new( depth => 1, count => 6 );

my $te = ...

$te->parse_file($html_file);

my $log = "c://perl//pl_projects//web_parser.log";

Replace the dubble forward slashes by single ones.

open(my $LF,">> $log") or die "Couldn't open $log for writing: $!\n";

Is there a special reason to use uppercase for the lexical filehandle?
See also the 3-arguments form: perldoc -f open.

my $we_need_to_truncate = 0;

I would use a shorter name, like $must_truncate or even just $truncate.

foreach $ts ($te->tables) {

for my $ts ($te->tables) {
(further my's not mentioned)

foreach $row ($ts->rows) {
$counter ++;

How high may that counter go?

if ($counter > 4 ){
for ($i=1; $i<6; $i++) {
#If the table has no top keywords we need to truncate the
file
if (@$row[$i] =~ m/No keywords rank*/){

Zero, one or more k's? Just remove that asterisk.
Is that text at the start of a line? Add an anchor.

$we_need_to_truncate = 1;
}
# $bit is used to determine if we need to join row to the
previous row

if(!$bit){
$str = $str.trim(@$row[$i]).",";
}else{
$str = $str.trim(@$row[$i]);
}

Some variants:
$str = $str.trim(@$row[$i]) . ($bit ? '' : ',');
or
$str = $str.trim(@$row[$i]);
$str .= ',' if $bit == 0;
or
$str = $str.trim(@$row[$i]);
$bit or $str .= ',';

}
if ($bit){
$bit=0;

if ($bit) {
$bit = 0;

Whitepace is quite cheap.

$str = trim($str)."\n";
}else{
$bit=1;
}
}
}
}
#Write the file
my $old_fh = select($LF);
print $str;
select ($old_fh);

Brackets are not needed with select.

close($LF) or die "Couldn't close $log: $!\n";

Brackets are not needed with close.

#remove the last three rows if need be
if($we_need_to_truncate){
for ($i=1; $i<4; $i++){
truncatefile($log);
}

$truncate and ( truncatefile($log) for {1..4} );

Ben Morrow · Apr 27, 2006

Quoth "Abram said:
Thanks for all your help guys! I think I got it.

Here's what ended up working for me, please advise as to any better
approaches.

#!/usr/bin/perl

You definitely want

use strict;
use warnings;

here. Get Perl to help you get things right.

use HTML::TableExtract;

# Declare the subroutines
sub trim($);

my $html_file = "C:/webfiles/test.htm";
$te = HTML::TableExtract->new( depth => 1, count => 6 );
$te->parse_file($html_file);

my $log = "c://perl//pl_projects//web_parser.log";

Why have you doubled these slashes? Are you confusing them with
backslashes (which do need doubling in a "" string)?

open(my $LF,">> $log") or die "Couldn't open $log for writing: $!\n";

It's better to use three-arg open when you can (all the time,
basically), and you don't need those parens since you're using 'or'
instead of '||'.

open my $LF, '>>', $log or die "...";

You get lots of points for 1. using lexical FHs 2. checking the return
value and 3. including both the file and $! in the massage, though

.

BTW, do you realise that putting "\n" on the end of a 'die' message
suppresses the file/line-number information? This is probably a
situation (a message directed at the user rather than a developer) where
that is appropriate, but in case you didn't know...

my $we_need_to_truncate = 0;

I wouldn't have the '= 0' here: undef is a perfectly good false value.
But that's probably a matter of taste...

foreach $ts ($te->tables) {
foreach $row ($ts->rows) {
$counter ++;
if ($counter > 4 ){
for ($i=1; $i<6; $i++) {

A much more Perlish way to write that is

for my $i (1..5) {

which also makes the upper bound clearer.

#If the table has no top keywords we need to truncate the
file
if (@$row[$i] =~ m/No keywords rank*/){
$we_need_to_truncate = 1;
}
# $bit is used to determine if we need to join row to the
previous row
if(!$bit){
$str = $str.trim(@$row[$i]).",";
}else{
$str = $str.trim(@$row[$i]);
}
}
if ($bit){
$bit=0;
$str = trim($str)."\n";
}else{
$bit=1;
}
}
}
}
#Write the file
my $old_fh = select($LF);
print $str;
select ($old_fh);

You can tell print which filehandle to print to without selecting it:

print $LF $str;

Note the lack of comma after '$LF'.

close($LF) or die "Couldn't close $log: $!\n";

#remove the last three rows if need be
if($we_need_to_truncate){
for ($i=1; $i<4; $i++){
truncatefile($log);
}
}

# Perl trim function to remove whitespace from the start and end of the
string
sub trim($)

You don't need to prototype (the '($)') Perl subs. This one does no
harm...

{
my $string = shift;
$string =~ s/^\s+//;
$string =~ s/\s+$//;

return $string;
}

sub truncatefile()

....but this will fail as you call it with a parameter above. It will
work correctly, as $log is a global, but that's not good practice; so
you want something more like

sub truncatefile {
my ($log) = @_; # get the paramaters

{
open (FH, "+< $log") or die "can't update $log: $!";
while (<FH>) {
$addr = tell(FH) unless eof(FH);
}
truncate(FH, $addr) or die "can't truncate $log: $!";
}

This is a really inefficient way of removing the last line from the
file. As you accumulate the whole file before you print it, you can just
use something like (untested)

$str =~ s/(?: [^\n]* \n ){0,3} $//x;

before you print it; or, better, push the lines onto an array as you go
rather than joining them with "\n" and then chop off the last three
elements.

Ben

Abram · Apr 27, 2006

Thanks for the tips! I knew it was a bit sloppy and long-form.

Tad McClellan · Apr 28, 2006

Abram said:
please advise as to any better
approaches.

#!/usr/bin/perl
use HTML::TableExtract;

You are missing:

use warnings;
use strict;

sub trim($);

Prototypes almost for sure don't do what you think they do, consider
not using prototypes.

for ($i=1; $i<6; $i++) {

foreach my $i ( 1 .. 5 ) {

my $old_fh = select($LF);
print $str;
select ($old_fh);

What is the point of those 3 lines?

What is wrong with this 1 line instead?

print $LF $str;

I have never used select() for this purpose in over 10 years
of Perl programming.

Where did you learn about using select() like that?

for ($i=1; $i<4; $i++){

foreach my $i ( 1 .. 3 ){

Ben Morrow · Apr 28, 2006

Quoth "Dr.Ruud said:
Abram schreef:

Not necessary.

But useful if you want to call it without parens later.

Use single quotes when double quotes are not needed.

I believe this is considered a matter of style (I agree with you, but
others do not).

Is there a special reason to use uppercase for the lexical filehandle?

It is traditional, from when it was usual to use a uppercase bareword

.

I frequently use a convention like $IN is a file/$in is a line read from
that file.

if (@$row[$i] =~ m/No keywords rank*/){

Click to expand...

Zero, one or more k's? Just remove that asterisk.

I suspect he was thinking of /...rank.*/... but still, not necessary.

Brackets are not needed with close.

Again, a matter of style. Some people are more comfortable with function
calls having parens.

Ben

David Combs · May 22, 2006

before you print it; or, better, push the lines onto an array as you go
rather than joining them with "\n" and then chop off the last three
elements.

General question about pushing onto an array (and GC):

Suppose you're reading some large file, and for each
(or certain) lines in it,

you want to modify it somehow
and then push it onto an array.

Now, about GC and thrashing (eg GC'ing way too often
for comfort):

If instead of the above, suppose you first pushed
the (certain) lines onto the array, and then
later, in a 2nd pass (through the array) you
do the per-line modifications.

Under what conditions might that be a big win,
in that (with luck) you'd end up pushing each
line at the end of the "free space" (gotten by
the most recent GC)?

That is, if you pushed something onto the array,
and the array wasn't already at the *end* of
the free-space, the perl-os would have to *copy*
the entire array, and then append the line.

(OOPS: arrays are surely just an array of pointers
*to* the lines. So, translate the problem to instead
appending lines onto a single ever-growing *STRING*.)

Anyway, you can see what I'm getting at: how to
**sometimes** program so as to minimize the
GC's.

What features does perl6 have towards this end?

(I believe some languages let one allocate *multiple*
garbage-collectable spaces, so when you *really* need
to, you can set it up so that you do your appending
to one continuous structure in one space, and the
other things on another space, thus keeping them
from interfering with each other.)

This having to copy at each append can rapidly overwhelm
all other cpu-usage, what with it being an n-squared space-usage
process.

Comments?

Thanks,

David

Parsing HTML with HTML::TableExtract	2	Nov 27, 2009
2 problems parsing output from HTML::TableExtract	8	Sep 1, 2009
HTML::TableExtract with headers constraint, exluding right-most column	0	May 16, 2005
Problem using TableExtract 1.08	0	Sep 8, 2003
HTML::TableExtract punctuation parsing	3	May 22, 2005
Rookie: HTML::TableExtract test will not print	6	Oct 8, 2003
Perl HTML::TableExtract Question	3	Apr 17, 2005
HTML::TableExtract w. perl 5.10	1	Sep 28, 2012

HTML::TokeParser & TableExtract

Abram

David Squire

Dr.Ruud

Abram

Tad McClellan

Abram

A. Sinan Unur

Abram

Tad McClellan

A. Sinan Unur

Abram

Dr.Ruud

Ben Morrow

Abram

Tad McClellan

Ben Morrow

David Combs

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads