Parsing HTML - using HTML::TreeBuilder

olson_ord · Oct 5, 2006

Hi,
I am trying to use Perl to parse a webpage - and I cannot get
started. I hope someone could help me.
I searched online and I found that I am supposed to use the
HTML::TreeBuilder. In the example below I am trying to get the text in
the TAG named "H2". From the documentation there seems to be two
ways to do this (I might be wrong - then please correct me) i.e.
Using the look_down() and find_by_tag_name(). The latter is rather old.
I have used the former to look for images (just as a test) and the
latter to look for the "H2" tags. In both cases I get the number of
H2's or Images to be 0.
What am I doing wrong here - or is there an easier way to get the
text in a HTML tag. I would be grateful for any help.

Regards,
Rio

--------------- Code -------------------------
use strict;
use LWP::UserAgent;
use LWP::Simple;
use URI::Escape;
use HTTP::Request::Common;
use HTML::TreeBuilder;
my $url = "http://wordlist.gredic.com/kaleidoscope";
my $html = get($url);
# print $html;

my $tree = HTML::TreeBuilder->new();
$tree->parse_file($html);

## --- Trial 1 ----------------
my @imgs = $tree->look_down( _tag => 'img');

## --- Trial 2 ----------------
my $elements = $tree->elementify();

my @word = $elements->find_by_tag_name('h2');

## --- Results ----------------
print "H2 Words = " . @word . "\n";
print "Imgs = " . @imgs . "\n";

# At the end need to free up the memory
$tree->delete;
print "completed script\n";
--------- End of Code ---------------------

P.S. The above is not my actual code - but a working example to
demonstrate my question

Paul Lalli · Oct 5, 2006

What am I doing wrong here - or is there an easier way to get the
text in a HTML tag.

I personally prefer HTML::TokeParser for parsing HTML, but TIMTOWTDI

use strict;

You forgot:
use warnings;

use LWP::UserAgent;
use LWP::Simple;

You generally don't use both of these. . .

use URI::Escape;
use HTTP::Request::Common;
use HTML::TreeBuilder;
my $url = "http://wordlist.gredic.com/kaleidoscope";
my $html = get($url);

This function returns the actul HTML content of the URL.

# print $html;

my $tree = HTML::TreeBuilder->new();
$tree->parse_file($html);

This attempts to find a file named by the string in $html and parse
that file. Obviously, no such file exists.

You want
$tree->parse($html);

Paul Lalli

Paul Lalli · Oct 5, 2006

What am I doing wrong here - or is there an easier way to get the
text in a HTML tag.

I personally prefer HTML::TokeParser for parsing HTML, but TIMTOWTDI

use strict;

You forgot:
use warnings;

use LWP::UserAgent;
use LWP::Simple;

You generally don't use both of these. . .

use URI::Escape;
use HTTP::Request::Common;
use HTML::TreeBuilder;
my $url = "http://wordlist.gredic.com/kaleidoscope";
my $html = get($url);

This function returns the actul HTML content of the URL.

# print $html;

my $tree = HTML::TreeBuilder->new();
$tree->parse_file($html);

This attempts to find a file named by the string in $html and parse
that file. Obviously, no such file exists.

You want
$tree->parse($html);

Paul Lalli

olson_ord · Oct 6, 2006

Dear Paul,
Thanks a lot for taking your time to answer. I am not new to
programming (i.e. I use C++ for my work)but I am new to Perl. Yes, now
at least I got this initial part to work. I think I would have more
questions in the future.
If you prefer to use HTML::TokeParser I would love to look at it
myself. So if you have some handy tutorials on using the TokeParser
then it would be helpful for me. (Right now I could only locate
something at http://www.perlmonks.org/index.pl?node_id=99254 I would
look at this later.
Thanks again,
O.O.

Paul Lalli · Oct 6, 2006

If you prefer to use HTML::TokeParser I would love to look at it
myself. So if you have some handy tutorials on using the TokeParser
then it would be helpful for me.

I don't know about tutorials, but the documentation for the module is
pretty decent:
http://search.cpan.org/~gaas/HTML-Parser-3.55/lib/HTML/TokeParser.pm

Paul Lalli

olson_ord · Oct 6, 2006

Thanks a lot Paul.
I looked at the documentation HTML::TokeParser and it does not tell me
if there is an easy way to find a certain token (e.g. "h2") i.e. It
seems that I would have to start from the beginning and then scan all
the tokens until I reach the required token. (I am basically looking
for a find() function - or something similar.)
Thanks a lot for your help.
Regards,
O.O.

DJ Stunks · Oct 6, 2006

Thanks a lot Paul.
I looked at the documentation HTML::TokeParser and it does not tell me
if there is an easy way to find a certain token (e.g. "h2") i.e. It
seems that I would have to start from the beginning and then scan all
the tokens until I reach the required token. (I am basically looking
for a find() function - or something similar.)

Look a little harder, dude. it's (basically) 2 lines of code:

#!/usr/bin/perl

use strict;
use warnings;

use LWP::Simple;
use HTML::TokeParser;

my $url = 'http://wordlist.gredic.com/kaleidoscope';
my $html = get( $url );

my $p = HTML::TokeParser->new( \$html );

while ( my $tag_ref = $p->get_tag( 'h2' ) ) {
printf "%s: %s\n", $tag_ref->[0], $p->get_trimmed_text;
}

__END__

olson_ord · Oct 6, 2006

Thanks DJ.
I had thought of using a while statement (from looking at the tutorial
I mentioned above). This would make my code look like a series of while
statements. I think I would stick to using HTML::TreeBuilder - i.e.
Just because I have almost got my code working using that.
Thanks to you and Paul for your help.
O.O.

P.S. To other readers (who are unfamiliar with Perl -like myself)
consider using a last statement in the while loop i.e.

while ( my $tag_ref = $tp->get_tag( 'h2' ) ) {
printf "%s: %s\n", $tag_ref->[0], $tp->get_trimmed_text;
last;
}

-- so that you can process the file further. (Perl calls the 'break'
statement 'last').

DJ said:
Thanks a lot Paul.
I looked at the documentation HTML::TokeParser and it does not tell me
if there is an easy way to find a certain token (e.g. "h2") i.e. It
seems that I would have to start from the beginning and then scan all
the tokens until I reach the required token. (I am basically looking
for a find() function - or something similar.)

Click to expand...

Look a little harder, dude. it's (basically) 2 lines of code:

#!/usr/bin/perl

use strict;
use warnings;

use LWP::Simple;
use HTML::TokeParser;

my $url = 'http://wordlist.gredic.com/kaleidoscope';
my $html = get( $url );

my $p = HTML::TokeParser->new( \$html );

while ( my $tag_ref = $p->get_tag( 'h2' ) ) {
printf "%s: %s\n", $tag_ref->[0], $p->get_trimmed_text;
}

__END__

HTML::TreeBuilder issue	6	Feb 5, 2009
Problem parsing HTML	7	Nov 24, 2009
I need help making an html website	2	Aug 2, 2023
Parsing HTML using TreeBuilder - how to get the "next" tag?	1	Jun 12, 2005
Parsing HTML with HTML::Tree	1	Mar 1, 2010
Python client/server that reads HTML body from server	1	Apr 12, 2023
Need assistance finetuning HTML, CSS, Javascript - sticky header issue	3	Feb 25, 2022
Parsing HTML with HTML::TableExtract	2	Nov 27, 2009

Parsing HTML - using HTML::TreeBuilder

olson_ord

Paul Lalli

Paul Lalli

olson_ord

Paul Lalli

olson_ord

DJ Stunks

olson_ord

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads