Syntax for getting web page links

dysgraphia · Oct 9, 2006

Hi, I'm using win xp and have the ActivePerl download.

This is my first attempt at a perl script. It tries to go to the
chessbase site and find the links to chess tournaments in 2006.

What I hope to do is have my script collect the links on the page
listed under Events 2006 and put this collection of links into
an Excel workbook ("C:\ChessEvents.xls"), spreadsheet ("Year2006")

If I progress I will make script to follow some of the links and
retrieve the information about that chess tournament.

I plan to put a button on the spreadsheet to fire the perl script
via some VBA.

So far I have only got this. Any help to push me along
would be appreciated!

#!/usr/bin/perl -w
use LWP::UserAgent;
use HTTP::Cookies;
use LWP;
use HTTP::Request::Common qw(POST GET);
use strict;
use DBI;
use IO:

ir;
use LWP:

ebug qw(-);
# the chessbase page with list of events is at
# http://www.chessbase.com/events/index.asp and find under Events 2006
my $url = 'http://www.chessbase.com/events/index.asp';
my $ua = LWP::UserAgent->new();
$ua->agent("Mozilla/8.0");
$ua->cookie_jar(HTTP::Cookies->new);
my $res = $ua->request(new HTTP::Request GET => $url);

Ben Morrow · Oct 9, 2006

Quoth dysgraphia said:
Hi, I'm using win xp and have the ActivePerl download.

This is my first attempt at a perl script. It tries to go to the
chessbase site and find the links to chess tournaments in 2006.

What I hope to do is have my script collect the links on the page
listed under Events 2006 and put this collection of links into
an Excel workbook ("C:\ChessEvents.xls"), spreadsheet ("Year2006")

If I progress I will make script to follow some of the links and
retrieve the information about that chess tournament.

I plan to put a button on the spreadsheet to fire the perl script
via some VBA.

So far I have only got this. Any help to push me along
would be appreciated!

#!/usr/bin/perl -w

use warnings;

is better than -w.

use LWP::UserAgent;
use HTTP::Cookies;
use LWP;
use HTTP::Request::Common qw(POST GET);

You would probably be better off using LWP::Simple for this, or perhaps
WWW::Mechanize.

use strict;
use DBI;
use IO:ir;
use LWP:ebug qw(-);
# the chessbase page with list of events is at
# http://www.chessbase.com/events/index.asp and find under Events 2006
my $url = 'http://www.chessbase.com/events/index.asp';
my $ua = LWP::UserAgent->new();
$ua->agent("Mozilla/8.0");

Why? AFAIK, no current browser uses this User-Agent string. What makes
you think you need one at all?

$ua->cookie_jar(HTTP::Cookies->new);
my $res = $ua->request(new HTTP::Request GET => $url);

This look OK, as far as it goes. What is your problem with the next
step?

Ben

Paul Lalli · Oct 9, 2006

dysgraphia said:
What I hope to do is have my script collect the links on the page
listed under Events 2006 and put this collection of links into
an Excel workbook ("C:\ChessEvents.xls"), spreadsheet ("Year2006")

http://search.cpan.org/~gaas/HTML-Parser-3.55/lib/HTML/TokeParser.pm
http://search.cpan.org/~jmcnamara/Spreadsheet-WriteExcel-2.17/lib/Spreadsheet/WriteExcel.pm

Paul Lalli

dysgraphia · Oct 9, 2006

Paul said:
http://search.cpan.org/~gaas/HTML-Parser-3.55/lib/HTML/TokeParser.pm
http://search.cpan.org/~jmcnamara/Spreadsheet-WriteExcel-2.17/lib/Spreadsheet/WriteExcel.pm

Paul Lalli

Hi Paul, Thanks for those links! I will follow them up now.....cheers!

Tad McClellan · Oct 9, 2006

dysgraphia said:
This is my first attempt at a perl script.

If you intend to learn Perl programming, then you should not
put code into your programs unless you understand why you need
to put that code in your program.

It is also a good idea to check the Perl FAQ for questions related
to what you are trying to accomplish. For instance, if you seen
this FAQ answser:

perldoc -q HTML

How do I fetch an HTML file?

Then you could replace 5 of your "use" statements with a single one.

What I hope to do is have my script collect the links on the page
listed under Events 2006

So far I have only got this.

Thank you for including your code!

But you have included a bunch of stuff that you do not use.

If it is not used, then it should not be included.

The site you want to scrape does not require cookies, so don't
use cookies.

Your program does not use the DBI nor IO:

ir modules, so don't
include those modules.

Any help to push me along
would be appreciated!

Scraping a web page requires an intimate knowledge of the page's
structure and format.

The best and most robust way to process HTML data is with one of
the many HTML::* modules on the CPAN.

But for a dirty hack that prints URLs for the 2006 Events,
this should get you started:

---------------------------------------
#!/usr/bin/perl
use warnings;
use strict;
use LWP::Simple;
use URI::Escape;

my $html = get 'http://www.chessbase.com/events/index.asp';

$html =~ s/Events\s+/Events /g; # fix silly-formatted data

$html =~ s/.*Events 2006//s; # delete unwanted prefix
$html =~ s/Events 2005.*//s; # delete unwanted suffix

foreach my $line ( split /\n/, $html ) {
if ( $line =~ /eventname=([^"]+)/ ) {
my $eventname = uri_escape( $1 );
print "http://www.chessbase.com/eventlist.asp?eventname=$eventname\n"
}
}

https request failing	2	Sep 18, 2012
reading LWP in chunks	6	Oct 18, 2010
seting cookies to use some links with perl	0	Nov 13, 2007
NTLM and LWP::UserAgent	4	Sep 12, 2006
Help me to Improve	11	Oct 7, 2011
LWP and Xerox printers	1	Jul 28, 2011
form post URL encoded	4	Jun 26, 2013
Script using LWP::UserAgent is sometimes failing with 500 error,although server reports 200	9	Aug 23, 2011

Syntax for getting web page links

dysgraphia

Ben Morrow

Paul Lalli

dysgraphia

Tad McClellan

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads