fetching webpage and extracting contents

  • Thread starter alfonsobaldaserra
  • Start date
A

alfonsobaldaserra

hello

i am trying to write a script which will go to bbc's top 40 pages and
show only intended contents.

i have written a script

#!/usr/bin/perl

use strict;
use warnings;
use LWP::UserAgent;

my $ua = LWP::UserAgent->new;
$ua->timeout(10);
$ua->env_proxy;

my $res = $ua->get("http://www.bbc.co.uk/radio1/chart/singles");

if ($res->is_success) {
open my $bbc, ">", "bbc.txt" or die "$!\n";
print $bbc $res->decoded_content;
close $bbc;
} else {
die "could not fetch bbc.co.uk\n";
}

open my $bbc, "<", "bbc.txt";
while (<$bbc>) {
print if m!<span class="artist">(.*)</span>!;
print if m!<span class="track">(.*)</span>!;
#next unless $_ =~ m[(<span class="artist">)|(<span
class="track">)];
#my ($foo) =~ m!<span class="artist">(.*)</span>!;
#my ($bar) =~ m!<span class="track">(.*)</span>!;
# print "$foo -> $bar\n";
}

__RESULT__
<span class="artist">Tinie Tempah</span>
<span class="track">Written In The Stars</span>
<span class="artist">Bruno Mars</span>
<span class="track">Just The Way You Are (Amazing)</span>
<span class="artist">Labrinth</span>
<span class="track">Let The Sun Shine</span>
<span class="artist">Adele</span>
<span class="track">Make You Feel My Love</span>
<span class="artist">Taio Cruz</span>
<span class="track">Dynamite</span>



but i can't figure out

#1 how to parse $res->decoded_content without writing it to a file
because apparently the whole page is a single string

#2 how to show data in artist - track format, like
Tinie Tempah - Written In The Stars

#3 how to make this work
#next unless $_ =~ m[(<span class="artist">)|(<span
class="track">)];
#my ($foo) =~ m!<span class="artist">(.*)</span>!;
#my ($bar) =~ m!<span class="track">(.*)</span>!;
# print "$foo -> $bar\n"

appreciate your time gents.

salute :)
 
A

alfonsobaldaserra

#1 how to parse $res->decoded_content without writing it to a file
because apparently the whole page is a single string

got it fixed by opening a fh to $res->decoded_content
#2 how to show data in artist - track format, like
Tinie Tempah - Written In The Stars


so the new code is

#!/usr/bin/perl

use strict;
#use warnings;
use LWP::UserAgent;

my $ua = LWP::UserAgent->new;
$ua->timeout(10);
$ua->env_proxy;

my $res = $ua->get("http://www.bbc.co.uk/radio1/chart/singles");

if ($res->is_success) {
open my $bbc, "<", \$res->decoded_content or die "$!\n";
while (defined (my $con = <$bbc>)) {
chomp $con;
next unless $con =~ m!(<span class="artist">)|(<span
class="track">)!;
my ($artist) = $con =~ m!<span class="artist">(.*?)</
span>!;
my ($track) = $con =~ m!<span class="track">(.*?)</
span>!;
print "$artist - $track\n";
}

} else {
die "could not fetch bbc.co.uk\n";
}


but the output is coming as

Tinie Tempah -
- Written In The Stars
Bruno Mars -
- Just The Way You Are (Amazing)
Labrinth -
- Let The Sun Shine
Adele -
- Make You Feel My Love

while it should have been

Tinie Tempah - Written In The Stars
Bruno Mars - Just The Way You Are (Amazing)
Labrinth - Let The Sun Shine
Adele - Make You Feel My Love

i cant figure out why this is happening.

any help guys?

thanku :)
 
A

alfonsobaldaserra

i got a real bad code working :)

#!/usr/bin/perl

use strict;
use warnings;
use LWP::UserAgent;

my $ua = LWP::UserAgent->new;
$ua->timeout(10);
$ua->env_proxy;

my $res = $ua->get("http://www.bbc.co.uk/radio1/chart/singles");

if ($res->is_success) {
open my $bbc, "<", \$res->decoded_content or die "$!\n";
while (defined (my $con = <$bbc>)) {
chomp $con;
next if $con =~ /^\s*$/;
next unless $con =~ m!(<span class="artist">)|(<span
class="track">)!;
$con =~ s/^\s*|\s*$//g;
if ($con =~ m!<span class="artist">(.*)</span>!) {
print $1, " - ";
} elsif ($con =~ m!<span class="track">(.*)</span>!) {
print $1, "\n";
}
}
}


thank you gents for giving me a chance to do it myself.

though i am still looking for any improvements that you could
suggest :)
 
P

Peter Makholm

alfonsobaldaserra said:
i got a real bad code working :)

#!/usr/bin/perl

use strict;
use warnings;
use LWP::UserAgent;

my $ua = LWP::UserAgent->new;
$ua->timeout(10);
$ua->env_proxy;

my $res = $ua->get("http://www.bbc.co.uk/radio1/chart/singles");

if ($res->is_success) {
open my $bbc, "<", \$res->decoded_content or die "$!\n";

Don't do this. While possible, it is kind of obscure and shoul in my
opinion only be used when existing interfaces requires a perl file
handle.

Just split the content on newlines if you want to iterate over the
lines.
while (defined (my $con = <$bbc>)) {
chomp $con;
next if $con =~ /^\s*$/;
next unless $con =~ m!(<span class="artist">)|(<span
class="track">)!;
$con =~ s/^\s*|\s*$//g;
if ($con =~ m!<span class="artist">(.*)</span>!) {
print $1, " - ";
} elsif ($con =~ m!<span class="track">(.*)</span>!) {
print $1, "\n";
}

Don't parse HTML by throwing naive regexpes at the problem. This would
fail horribly if BBC decided to remove unneded newlines from their
content.

I would rather use one of the existing HTML parsing modules. One
option could be HTML::TreeBuilder. Base on a quick read in the
documentation it would looke something like this:

my $html = HTML::TreeBuilder->new_from_content( $res->decoded_content );
for my $tag ($html->find('span') {
my $class = $tag->attr('class');

if ( $class eq 'artist' ) {
...;
} elsif ( $class eq 'track' ) {
...;
}
}

This would be a much more robust solution. (But I don't parse HTML in
my day to day work, so I might not be uptodate on the current set of
HTML parsers.)

//Makholm
 
S

sln

i got a real bad code working :)

#!/usr/bin/perl

use strict;
use warnings;
use LWP::UserAgent;

my $ua = LWP::UserAgent->new;
$ua->timeout(10);
$ua->env_proxy;

my $res = $ua->get("http://www.bbc.co.uk/radio1/chart/singles");

if ($res->is_success) {
open my $bbc, "<", \$res->decoded_content or die "$!\n";
while (defined (my $con = <$bbc>)) {
chomp $con;
next if $con =~ /^\s*$/;
next unless $con =~ m!(<span class="artist">)|(<span
class="track">)!;
$con =~ s/^\s*|\s*$//g;
if ($con =~ m!<span class="artist">(.*)</span>!) {
print $1, " - ";
} elsif ($con =~ m!<span class="track">(.*)</span>!) {
print $1, "\n";
}
}
}


thank you gents for giving me a chance to do it myself.

though i am still looking for any improvements that you could
suggest :)

Along the lines of what you are doing, something like below.
-sln
-----------
use strict;
use warnings;

my $string =<<EOHTML;
<html>
<span class="artist">
Tinie Tempah
</span>
<span class="track">
Written In The Stars
</span>
<span class="artist"> Bruno Mars </span>
<span class="track">Just The Way You Are (Amazing)</span>
<span class="artist">
Labrinth</span>
<span class="track">Let The Sun Shine
</span>
<span class="track">A song by Labrinth</span>
<span class="artist">Adele </span>
<span class="track">Make You Feel My Love</span>
<span class="artist">Taio Cruz</span>
<span class="track">Dynamite</span>
<html/>
EOHTML
my $artist;

while ( $string =~
/ <span \s+ class \s* = \s* ['"]\s* (artist|track) \s*['"] \s* >
\s* (.*?) \s*
<\/span\s*>
/xsig )
{
if ($1 eq 'artist') {
$artist = $2;
}
else {
if (length $artist) {
print "$artist - $2\n";
}
$artist = '';
}
}
print "\n";

## Alternate -
##

$artist = '';
my %tracks;

while ( $string =~
/ <span \s+ class \s* = \s* ['"]\s* (artist|track) \s*['"] \s* >
\s* (.*?) \s*
<\/span\s*>
/xsig )
{
if ($1 eq 'artist') {
$artist = $2;
}
else {
push @{ $tracks{$artist} }, $2;
}
}

for $artist (sort keys %tracks) {
print "\n$artist\n";
for my $track ( sort @{ $tracks{$artist} } ) {
print " - $track\n"
}
}
 
A

alfonsobaldaserra

thank you for such beautiful codes sln.

though i am inclined towards peter's advise to use html parsers.
unfortunately, i couldn't get your code to work due to lack of usage
examples of html::treebuilder online.

does anybody happen to know a good html parser with some good examples
online?
 
A

alfonsobaldaserra

Huh?

thank you guys :)

i finally utilised perlmonks link, read a little at cpan at here i am

#!/usr/bin/perl

use strict;
use warnings;
use HTML::Tree;
use LWP::Simple;

my $uri = "http://www.bbc.co.uk/radio1/chart/singles";

my $html = get($uri);
my $tree = HTML::Tree->new();
$tree->parse($html);

my @artist = $tree->look_down('_tag' , 'span', 'class', 'artist');
my @track = $tree->look_down('_tag' , 'span', 'class', 'track');

foreach my $i (0..$#artist) {
print $artist[$i]->as_text, " - ", $track[$i]->as_text, "\n";
}


again i am wondering if there is a better way to group these two
arrays together instead of the way i did

foreach my $i (0..$#artist) {
print $artist[$i]->as_text, " - ", $track[$i]->as_text, "\n";
}

thank you
 
P

Peter Makholm

alfonsobaldaserra said:
my @artist = $tree->look_down('_tag' , 'span', 'class', 'artist');
my @track = $tree->look_down('_tag' , 'span', 'class', 'track');

foreach my $i (0..$#artist) {
print $artist[$i]->as_text, " - ", $track[$i]->as_text, "\n";
}

again i am wondering if there is a better way to group these two
arrays together instead of the way i did

It all depends on the HTML. But looking at the URL you posted it looks
like you're looke for a structure looking like this:

<a class="artist-link" href="/music/artists/ba7d2626-38ce-4859-8495-bdb5732715c4" id="link-13">
<span class="artist">Taio Cruz</span>
<span class="track">Dynamite</span>
</a>

What you could do was to iterate over all the <a class="artist-link>
nodes and then look for the artist and track below this
node. Untested, but something like this:

for my $link ( $tree->look_down(_tag => 'a', class => 'artist-link') ) {
my $artist = $link->look_down(class => 'artist')->as_text;
my $track = $link->look_down(class => 'track' )->as_text;

print "$artist - $track\n";
}

//Makholm
 
A

alfonsobaldaserra

for my $link ( $tree->look_down(_tag => 'a', class => 'artist-link') ) {
    my $artist = $link->look_down(class => 'artist')->as_text;
    my $track  = $link->look_down(class => 'track' )->as_text;

    print "$artist - $track\n";

}

//Makholm

thank you again makholm, your code worked sexily without any
modification :)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,982
Messages
2,570,186
Members
46,740
Latest member
JudsonFrie

Latest Threads

Top