P
P.R.Brady
I'm using LWP::UserAgent (Active Perl v5.6.1.638) in a web site
crawler, but there's a page I just can't read -
http://www.psychology.bangor.ac.uk/ gives '404 not found' It is
similarly inaccessible for many of the web checkers out there (like
http://validator.w3.org/) but is okay with 'real' browsers like Internet
Explorer and Netscape.
There's a redirection there somewhere behind the scenes to index.php
(which can be read), but then that is so for our main web page
http://www.bangor.ac.uk/ as well and that redirects okay.
I suppose the problem is not understanding how redirection takes place.
Is it a server issue? Do the regular browsers 'guess' at filenames if
none are given? Is there some browser/server negotiation which is not
being implemented?
An extract from the code which exhibits the symptoms is below (but note
the folding of the 'my $referer' line!)
I'd appreciate any help you can give - I've drawn blanks elsewhere!
Regards
Phil
use strict;
use warnings;
use LWP::UserAgent;
use HTTP::Response;
use HTML::TokeParser;
#the page which refers to the culprit:
my $referer = http://www.bangor.ac.uk/corporate/informationabout/depts.php';
#the inaccessible page
my $url='http://www.psychology.bangor.ac.uk/';
#but these are okay
# $url='http://www.informatics.bangor.ac.uk/';
# $url='http://www.psychology.bangor.ac.uk/index.php';
# $url='http://www.bangor.ac.uk/';
#open the browser
my $browser = LWP::UserAgent->new;
$browser->timeout(30);
#try to get the page
my $response = $browser->get($url, Referer => $referer);
print "Response $response\n";
my $status= $response->status_line;
($status) = split(' ',$status.' ');
print "Status_line $status\n";
exit;
crawler, but there's a page I just can't read -
http://www.psychology.bangor.ac.uk/ gives '404 not found' It is
similarly inaccessible for many of the web checkers out there (like
http://validator.w3.org/) but is okay with 'real' browsers like Internet
Explorer and Netscape.
There's a redirection there somewhere behind the scenes to index.php
(which can be read), but then that is so for our main web page
http://www.bangor.ac.uk/ as well and that redirects okay.
I suppose the problem is not understanding how redirection takes place.
Is it a server issue? Do the regular browsers 'guess' at filenames if
none are given? Is there some browser/server negotiation which is not
being implemented?
An extract from the code which exhibits the symptoms is below (but note
the folding of the 'my $referer' line!)
I'd appreciate any help you can give - I've drawn blanks elsewhere!
Regards
Phil
use strict;
use warnings;
use LWP::UserAgent;
use HTTP::Response;
use HTML::TokeParser;
#the page which refers to the culprit:
my $referer = http://www.bangor.ac.uk/corporate/informationabout/depts.php';
#the inaccessible page
my $url='http://www.psychology.bangor.ac.uk/';
#but these are okay
# $url='http://www.informatics.bangor.ac.uk/';
# $url='http://www.psychology.bangor.ac.uk/index.php';
# $url='http://www.bangor.ac.uk/';
#open the browser
my $browser = LWP::UserAgent->new;
$browser->timeout(30);
#try to get the page
my $response = $browser->get($url, Referer => $referer);
print "Response $response\n";
my $status= $response->status_line;
($status) = split(' ',$status.' ');
print "Status_line $status\n";
exit;