H
Hal Vaughan
I'm trying to write a scraper for a website that uses cookies. The short of
it is that I keep getting their "You have to set your browser to allow
cookies" message. The code for the full scraper is a bit much, so here are
the relevant sections:
use File::Spec::Functions;
use File::Basename;
use File::Copy;
use LWP::UserAgent;
use HTTP::Cookies;
use URI::WithBase;
use DBI;
use strict;
Here's where I set up the variables (not all "my" and "our" statements are
included):
print "Cookie file: $cfile\n";
$ua = LWP::UserAgent->new;
$ua->timeout(5);
$ua->agent("Netscape/7.1");
$cjar = HTTP::Cookies->new(file =>$cfile, autosave => 1, ignore_discard =>
1);
$ua->cookie_jar($cjar);
Here's where I get the login page (which I always retrieve to make sure the
fields or info hasn't changed):
$page = $ua->get($url);
$page = $page->as_string;
And after that, I go through the page, make sure the form input fields
haven't changed (which are "login" and "key" for the username and
password). Then I post the data for the next page, including the form
data:
$parm = "";
foreach (keys %form) {
print "\tAdding parm. Key: $_, Value: $form{$_}\n";
$parm = "$parm$_=$form{$_}&";
}
$parm =~ s/&$//;
$req = HTTP::Request->new(POST => $url);
$req->content_type("application/x-www-form-urlencoded");
$req->header('Accept' => 'text/html');
$req->content_type("form-data");
$req->content($parm);
$page = $ua->request($req);
When I'm building up $parm, I'm taking the values from %form. I TRIED to
use the hash to post the values, using "$page = $ua->post($url, \%form);",
but even though it worked on a test web server on my LAN, it wouldn't work
on the system I'm scraping (don't know why -- if you can help here as well,
feel free to chip in).
The problem comes up when I use the code above to post the form data and get
the next page. The next page is a frameset with two frames. I get the
frame urls from the page and load them:
$req = HTTP::Request->new(GET => $url);
$req->content_type("application/x-www-form-urlencoded");
$req = $ua->request($req);
$page = $req->as_string;
And this is when I always get the "You don't have cookies" message.
I thought that LWP automatically took the cookies out of the page (I also
thought cookies were in the header, the one here is set with
document.cookie="doc cookie" within the document), and stored them in the
cookie jar automatically. That doesn't seem to be happening. I've been
reading the perldocs, but I can't see anything in the response object that
allows me to check the page for cookies, so I can do it myself.
So why aren't the cookies being kept and why can't the pages I retrieve
AFTER the cookie is set? Is part of the problem because they are in
frames?
Any help on this is appreciated.
Thanks!
Hal
it is that I keep getting their "You have to set your browser to allow
cookies" message. The code for the full scraper is a bit much, so here are
the relevant sections:
use File::Spec::Functions;
use File::Basename;
use File::Copy;
use LWP::UserAgent;
use HTTP::Cookies;
use URI::WithBase;
use DBI;
use strict;
Here's where I set up the variables (not all "my" and "our" statements are
included):
print "Cookie file: $cfile\n";
$ua = LWP::UserAgent->new;
$ua->timeout(5);
$ua->agent("Netscape/7.1");
$cjar = HTTP::Cookies->new(file =>$cfile, autosave => 1, ignore_discard =>
1);
$ua->cookie_jar($cjar);
Here's where I get the login page (which I always retrieve to make sure the
fields or info hasn't changed):
$page = $ua->get($url);
$page = $page->as_string;
And after that, I go through the page, make sure the form input fields
haven't changed (which are "login" and "key" for the username and
password). Then I post the data for the next page, including the form
data:
$parm = "";
foreach (keys %form) {
print "\tAdding parm. Key: $_, Value: $form{$_}\n";
$parm = "$parm$_=$form{$_}&";
}
$parm =~ s/&$//;
$req = HTTP::Request->new(POST => $url);
$req->content_type("application/x-www-form-urlencoded");
$req->header('Accept' => 'text/html');
$req->content_type("form-data");
$req->content($parm);
$page = $ua->request($req);
When I'm building up $parm, I'm taking the values from %form. I TRIED to
use the hash to post the values, using "$page = $ua->post($url, \%form);",
but even though it worked on a test web server on my LAN, it wouldn't work
on the system I'm scraping (don't know why -- if you can help here as well,
feel free to chip in).
The problem comes up when I use the code above to post the form data and get
the next page. The next page is a frameset with two frames. I get the
frame urls from the page and load them:
$req = HTTP::Request->new(GET => $url);
$req->content_type("application/x-www-form-urlencoded");
$req = $ua->request($req);
$page = $req->as_string;
And this is when I always get the "You don't have cookies" message.
I thought that LWP automatically took the cookies out of the page (I also
thought cookies were in the header, the one here is set with
document.cookie="doc cookie" within the document), and stored them in the
cookie jar automatically. That doesn't seem to be happening. I've been
reading the perldocs, but I can't see anything in the response object that
allows me to check the page for cookies, so I can do it myself.
So why aren't the cookies being kept and why can't the pages I retrieve
AFTER the cookie is set? Is part of the problem because they are in
frames?
Any help on this is appreciated.
Thanks!
Hal