Parsing HTML - using HTML::TreeBuilder

O

olson_ord

Hi,
I am trying to use Perl to parse a webpage - and I cannot get
started. I hope someone could help me.
I searched online and I found that I am supposed to use the
HTML::TreeBuilder. In the example below I am trying to get the text in
the TAG named "H2". From the documentation there seems to be two
ways to do this (I might be wrong - then please correct me) i.e.
Using the look_down() and find_by_tag_name(). The latter is rather old.
I have used the former to look for images (just as a test) and the
latter to look for the "H2" tags. In both cases I get the number of
H2's or Images to be 0.
What am I doing wrong here - or is there an easier way to get the
text in a HTML tag. I would be grateful for any help.

Regards,
Rio

--------------- Code -------------------------
use strict;
use LWP::UserAgent;
use LWP::Simple;
use URI::Escape;
use HTTP::Request::Common;
use HTML::TreeBuilder;
my $url = "http://wordlist.gredic.com/kaleidoscope";
my $html = get($url);
# print $html;

my $tree = HTML::TreeBuilder->new();
$tree->parse_file($html);

## --- Trial 1 ----------------
my @imgs = $tree->look_down( _tag => 'img');

## --- Trial 2 ----------------
my $elements = $tree->elementify();

my @word = $elements->find_by_tag_name('h2');

## --- Results ----------------
print "H2 Words = " . @word . "\n";
print "Imgs = " . @imgs . "\n";

# At the end need to free up the memory
$tree->delete;
print "completed script\n";
--------- End of Code ---------------------

P.S. The above is not my actual code - but a working example to
demonstrate my question
 
P

Paul Lalli

What am I doing wrong here - or is there an easier way to get the
text in a HTML tag.

I personally prefer HTML::TokeParser for parsing HTML, but TIMTOWTDI
use strict;

You forgot:
use warnings;
use LWP::UserAgent;
use LWP::Simple;

You generally don't use both of these. . .
use URI::Escape;
use HTTP::Request::Common;
use HTML::TreeBuilder;
my $url = "http://wordlist.gredic.com/kaleidoscope";
my $html = get($url);

This function returns the actul HTML content of the URL.
# print $html;

my $tree = HTML::TreeBuilder->new();
$tree->parse_file($html);

This attempts to find a file named by the string in $html and parse
that file. Obviously, no such file exists.

You want
$tree->parse($html);

Paul Lalli
 
P

Paul Lalli

What am I doing wrong here - or is there an easier way to get the
text in a HTML tag.

I personally prefer HTML::TokeParser for parsing HTML, but TIMTOWTDI
use strict;

You forgot:
use warnings;
use LWP::UserAgent;
use LWP::Simple;

You generally don't use both of these. . .
use URI::Escape;
use HTTP::Request::Common;
use HTML::TreeBuilder;
my $url = "http://wordlist.gredic.com/kaleidoscope";
my $html = get($url);

This function returns the actul HTML content of the URL.
# print $html;

my $tree = HTML::TreeBuilder->new();
$tree->parse_file($html);

This attempts to find a file named by the string in $html and parse
that file. Obviously, no such file exists.

You want
$tree->parse($html);

Paul Lalli
 
O

olson_ord

Dear Paul,
Thanks a lot for taking your time to answer. I am not new to
programming (i.e. I use C++ for my work)but I am new to Perl. Yes, now
at least I got this initial part to work. I think I would have more
questions in the future.
If you prefer to use HTML::TokeParser I would love to look at it
myself. So if you have some handy tutorials on using the TokeParser
then it would be helpful for me. (Right now I could only locate
something at http://www.perlmonks.org/index.pl?node_id=99254 I would
look at this later.
Thanks again,
O.O.
 
O

olson_ord

Thanks a lot Paul.
I looked at the documentation HTML::TokeParser and it does not tell me
if there is an easy way to find a certain token (e.g. "h2") i.e. It
seems that I would have to start from the beginning and then scan all
the tokens until I reach the required token. (I am basically looking
for a find() function - or something similar.)
Thanks a lot for your help.
Regards,
O.O.
 
D

DJ Stunks

Thanks a lot Paul.
I looked at the documentation HTML::TokeParser and it does not tell me
if there is an easy way to find a certain token (e.g. "h2") i.e. It
seems that I would have to start from the beginning and then scan all
the tokens until I reach the required token. (I am basically looking
for a find() function - or something similar.)

Look a little harder, dude. it's (basically) 2 lines of code:

#!/usr/bin/perl

use strict;
use warnings;

use LWP::Simple;
use HTML::TokeParser;

my $url = 'http://wordlist.gredic.com/kaleidoscope';
my $html = get( $url );

my $p = HTML::TokeParser->new( \$html );

while ( my $tag_ref = $p->get_tag( 'h2' ) ) {
printf "%s: %s\n", $tag_ref->[0], $p->get_trimmed_text;
}

__END__
 
O

olson_ord

Thanks DJ.
I had thought of using a while statement (from looking at the tutorial
I mentioned above). This would make my code look like a series of while
statements. I think I would stick to using HTML::TreeBuilder - i.e.
Just because I have almost got my code working using that.
Thanks to you and Paul for your help.
O.O.

P.S. To other readers (who are unfamiliar with Perl -like myself)
consider using a last statement in the while loop i.e.

while ( my $tag_ref = $tp->get_tag( 'h2' ) ) {
printf "%s: %s\n", $tag_ref->[0], $tp->get_trimmed_text;
last;
}

-- so that you can process the file further. (Perl calls the 'break'
statement 'last').


DJ said:
Thanks a lot Paul.
I looked at the documentation HTML::TokeParser and it does not tell me
if there is an easy way to find a certain token (e.g. "h2") i.e. It
seems that I would have to start from the beginning and then scan all
the tokens until I reach the required token. (I am basically looking
for a find() function - or something similar.)

Look a little harder, dude. it's (basically) 2 lines of code:

#!/usr/bin/perl

use strict;
use warnings;

use LWP::Simple;
use HTML::TokeParser;

my $url = 'http://wordlist.gredic.com/kaleidoscope';
my $html = get( $url );

my $p = HTML::TokeParser->new( \$html );

while ( my $tag_ref = $p->get_tag( 'h2' ) ) {
printf "%s: %s\n", $tag_ref->[0], $p->get_trimmed_text;
}

__END__
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,995
Messages
2,570,226
Members
46,815
Latest member
treekmostly22

Latest Threads

Top