Syntax for getting web page links

D

dysgraphia

Hi, I'm using win xp and have the ActivePerl download.

This is my first attempt at a perl script. It tries to go to the
chessbase site and find the links to chess tournaments in 2006.

What I hope to do is have my script collect the links on the page
listed under Events 2006 and put this collection of links into
an Excel workbook ("C:\ChessEvents.xls"), spreadsheet ("Year2006")

If I progress I will make script to follow some of the links and
retrieve the information about that chess tournament.

I plan to put a button on the spreadsheet to fire the perl script
via some VBA.

So far I have only got this. Any help to push me along
would be appreciated!

#!/usr/bin/perl -w
use LWP::UserAgent;
use HTTP::Cookies;
use LWP;
use HTTP::Request::Common qw(POST GET);
use strict;
use DBI;
use IO::Dir;
use LWP::Debug qw(-);
# the chessbase page with list of events is at
# http://www.chessbase.com/events/index.asp and find under Events 2006
my $url = 'http://www.chessbase.com/events/index.asp';
my $ua = LWP::UserAgent->new();
$ua->agent("Mozilla/8.0");
$ua->cookie_jar(HTTP::Cookies->new);
my $res = $ua->request(new HTTP::Request GET => $url);
 
B

Ben Morrow

Quoth dysgraphia said:
Hi, I'm using win xp and have the ActivePerl download.

This is my first attempt at a perl script. It tries to go to the
chessbase site and find the links to chess tournaments in 2006.

What I hope to do is have my script collect the links on the page
listed under Events 2006 and put this collection of links into
an Excel workbook ("C:\ChessEvents.xls"), spreadsheet ("Year2006")

If I progress I will make script to follow some of the links and
retrieve the information about that chess tournament.

I plan to put a button on the spreadsheet to fire the perl script
via some VBA.

So far I have only got this. Any help to push me along
would be appreciated!

#!/usr/bin/perl -w

use warnings;

is better than -w.
use LWP::UserAgent;
use HTTP::Cookies;
use LWP;
use HTTP::Request::Common qw(POST GET);

You would probably be better off using LWP::Simple for this, or perhaps
WWW::Mechanize.
use strict;
use DBI;
use IO::Dir;
use LWP::Debug qw(-);
# the chessbase page with list of events is at
# http://www.chessbase.com/events/index.asp and find under Events 2006
my $url = 'http://www.chessbase.com/events/index.asp';
my $ua = LWP::UserAgent->new();
$ua->agent("Mozilla/8.0");

Why? AFAIK, no current browser uses this User-Agent string. What makes
you think you need one at all?
$ua->cookie_jar(HTTP::Cookies->new);
my $res = $ua->request(new HTTP::Request GET => $url);

This look OK, as far as it goes. What is your problem with the next
step?

Ben
 
T

Tad McClellan

dysgraphia said:
This is my first attempt at a perl script.


If you intend to learn Perl programming, then you should not
put code into your programs unless you understand why you need
to put that code in your program.

It is also a good idea to check the Perl FAQ for questions related
to what you are trying to accomplish. For instance, if you seen
this FAQ answser:

perldoc -q HTML

How do I fetch an HTML file?

Then you could replace 5 of your "use" statements with a single one.

What I hope to do is have my script collect the links on the page
listed under Events 2006
So far I have only got this.


Thank you for including your code!

But you have included a bunch of stuff that you do not use.

If it is not used, then it should not be included.

The site you want to scrape does not require cookies, so don't
use cookies.

Your program does not use the DBI nor IO::Dir modules, so don't
include those modules.

Any help to push me along
would be appreciated!


Scraping a web page requires an intimate knowledge of the page's
structure and format.

The best and most robust way to process HTML data is with one of
the many HTML::* modules on the CPAN.

But for a dirty hack that prints URLs for the 2006 Events,
this should get you started:

---------------------------------------
#!/usr/bin/perl
use warnings;
use strict;
use LWP::Simple;
use URI::Escape;

my $html = get 'http://www.chessbase.com/events/index.asp';

$html =~ s/Events\s+/Events /g; # fix silly-formatted data

$html =~ s/.*Events 2006//s; # delete unwanted prefix
$html =~ s/Events 2005.*//s; # delete unwanted suffix

foreach my $line ( split /\n/, $html ) {
if ( $line =~ /eventname=([^"]+)/ ) {
my $eventname = uri_escape( $1 );
print "http://www.chessbase.com/eventlist.asp?eventname=$eventname\n"
}
}
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,201
Messages
2,571,049
Members
47,655
Latest member
eizareri

Latest Threads

Top