LWP::UserAgent and 404 page not found

P.R.Brady · Jun 22, 2005

I'm using LWP::UserAgent (Active Perl v5.6.1.638) in a web site
crawler, but there's a page I just can't read -
http://www.psychology.bangor.ac.uk/ gives '404 not found' It is
similarly inaccessible for many of the web checkers out there (like
http://validator.w3.org/) but is okay with 'real' browsers like Internet
Explorer and Netscape.
There's a redirection there somewhere behind the scenes to index.php
(which can be read), but then that is so for our main web page
http://www.bangor.ac.uk/ as well and that redirects okay.

I suppose the problem is not understanding how redirection takes place.
Is it a server issue? Do the regular browsers 'guess' at filenames if
none are given? Is there some browser/server negotiation which is not
being implemented?

An extract from the code which exhibits the symptoms is below (but note
the folding of the 'my $referer' line!)

I'd appreciate any help you can give - I've drawn blanks elsewhere!

Regards
Phil

use strict;
use warnings;
use LWP::UserAgent;
use HTTP::Response;
use HTML::TokeParser;

#the page which refers to the culprit:
my $referer = http://www.bangor.ac.uk/corporate/informationabout/depts.php';

#the inaccessible page
my $url='http://www.psychology.bangor.ac.uk/';

#but these are okay
# $url='http://www.informatics.bangor.ac.uk/';
# $url='http://www.psychology.bangor.ac.uk/index.php';
# $url='http://www.bangor.ac.uk/';

#open the browser

my $browser = LWP::UserAgent->new;
$browser->timeout(30);

#try to get the page

my $response = $browser->get($url, Referer => $referer);
print "Response $response\n";

my $status= $response->status_line;
($status) = split(' ',$status.' ');
print "Status_line $status\n";

exit;

Brian Wakem · Jun 22, 2005

P.R.Brady said:
I'm using LWP::UserAgent (Active Perl v5.6.1.638) in a web site
crawler, but there's a page I just can't read -
http://www.psychology.bangor.ac.uk/ gives '404 not found' It is
similarly inaccessible for many of the web checkers out there (like
http://validator.w3.org/) but is okay with 'real' browsers like Internet
Explorer and Netscape.
There's a redirection there somewhere behind the scenes to index.php
(which can be read), but then that is so for our main web page
http://www.bangor.ac.uk/ as well and that redirects okay.

I suppose the problem is not understanding how redirection takes place.
Is it a server issue? Do the regular browsers 'guess' at filenames if
none are given? Is there some browser/server negotiation which is not
being implemented?

An extract from the code which exhibits the symptoms is below (but note
the folding of the 'my $referer' line!)

I'd appreciate any help you can give - I've drawn blanks elsewhere!

Regards
Phil

my $response = $browser->get($url, Referer => $referer);

They seem to be doing a redirect based upon the language that your broswer
declares itself to accept. As you aren't doing this you get an error page.

Try:-

my $response = $browser->get($url, Referer => $referer, ACCEPT_LANGUAGE =>
'en');

P.R.Brady · Jun 23, 2005

Brian said:
P.R.Brady wrote:

[ ... snipped ...]

They seem to be doing a redirect based upon the language that your broswer
declares itself to accept. As you aren't doing this you get an error page.

Try:-

my $response = $browser->get($url, Referer => $referer, ACCEPT_LANGUAGE =>
'en');

Thanks Brian, that certainly works, Much appreciated.

Now do I have to alter my crawler to scan pages twice I wonder, once for
English, once for Welsh?

Phil

Sherm Pendley · Jun 24, 2005

P.R.Brady said:
[ ... snip ...]

Try:-
my $response = $browser->get($url, Referer => $referer,
ACCEPT_LANGUAGE =>
'en');

Click to expand...

Those parameters like Referer and ACCEPT_LANGUAGE are clearly reserved
words, but to what? The UserAgent? HTMP protocol?

HTTP. Here's a reference:

<http://www.w3.org/Protocols/rfc2616/rfc2616.html>

sherm--

P.R.Brady · Jun 24, 2005

Brian said:
P.R.Brady wrote:

[ ... snip ...]

Try:-

my $response = $browser->get($url, Referer => $referer, ACCEPT_LANGUAGE =>
'en');

Those parameters like Referer and ACCEPT_LANGUAGE are clearly reserved
words, but to what? The UserAgent? HTMP protocol?
Where are they listed and defined, or what are they called generically
so I can google them?

Phil

Why getting 404 errors?	8	Apr 7, 2024
using LWP::UserAgent Get method	0	Jun 5, 2007
How can I keep LWP::UserAgent from adding the http-equiv strings fromthe Head section of the page?	5	Mar 18, 2009
LWP user agent query	5	Aug 26, 2005
LWP::UserAgent infinite hang	1	Mar 5, 2007
How can I execute a function ONLY if fetch request returns 404 status?	0	Sep 17, 2022
NTLM and LWP::UserAgent	4	Sep 12, 2006
Problem posting with LWP::UserAgent	3	Oct 21, 2005

LWP::UserAgent and 404 page not found

P.R.Brady

Brian Wakem

P.R.Brady

Sherm Pendley

P.R.Brady

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads