O
olson_ord
Hi,
I am trying to use Perl to parse a webpage - and I cannot get
started. I hope someone could help me.
I searched online and I found that I am supposed to use the
HTML::TreeBuilder. In the example below I am trying to get the text in
the TAG named "H2". From the documentation there seems to be two
ways to do this (I might be wrong - then please correct me) i.e.
Using the look_down() and find_by_tag_name(). The latter is rather old.
I have used the former to look for images (just as a test) and the
latter to look for the "H2" tags. In both cases I get the number of
H2's or Images to be 0.
What am I doing wrong here - or is there an easier way to get the
text in a HTML tag. I would be grateful for any help.
Regards,
Rio
--------------- Code -------------------------
use strict;
use LWP::UserAgent;
use LWP::Simple;
use URI::Escape;
use HTTP::Request::Common;
use HTML::TreeBuilder;
my $url = "http://wordlist.gredic.com/kaleidoscope";
my $html = get($url);
# print $html;
my $tree = HTML::TreeBuilder->new();
$tree->parse_file($html);
## --- Trial 1 ----------------
my @imgs = $tree->look_down( _tag => 'img');
## --- Trial 2 ----------------
my $elements = $tree->elementify();
my @word = $elements->find_by_tag_name('h2');
## --- Results ----------------
print "H2 Words = " . @word . "\n";
print "Imgs = " . @imgs . "\n";
# At the end need to free up the memory
$tree->delete;
print "completed script\n";
--------- End of Code ---------------------
P.S. The above is not my actual code - but a working example to
demonstrate my question
I am trying to use Perl to parse a webpage - and I cannot get
started. I hope someone could help me.
I searched online and I found that I am supposed to use the
HTML::TreeBuilder. In the example below I am trying to get the text in
the TAG named "H2". From the documentation there seems to be two
ways to do this (I might be wrong - then please correct me) i.e.
Using the look_down() and find_by_tag_name(). The latter is rather old.
I have used the former to look for images (just as a test) and the
latter to look for the "H2" tags. In both cases I get the number of
H2's or Images to be 0.
What am I doing wrong here - or is there an easier way to get the
text in a HTML tag. I would be grateful for any help.
Regards,
Rio
--------------- Code -------------------------
use strict;
use LWP::UserAgent;
use LWP::Simple;
use URI::Escape;
use HTTP::Request::Common;
use HTML::TreeBuilder;
my $url = "http://wordlist.gredic.com/kaleidoscope";
my $html = get($url);
# print $html;
my $tree = HTML::TreeBuilder->new();
$tree->parse_file($html);
## --- Trial 1 ----------------
my @imgs = $tree->look_down( _tag => 'img');
## --- Trial 2 ----------------
my $elements = $tree->elementify();
my @word = $elements->find_by_tag_name('h2');
## --- Results ----------------
print "H2 Words = " . @word . "\n";
print "Imgs = " . @imgs . "\n";
# At the end need to free up the memory
$tree->delete;
print "completed script\n";
--------- End of Code ---------------------
P.S. The above is not my actual code - but a working example to
demonstrate my question