LWP Doesn't Seem To Save Cookies:

H

Hal Vaughan

I'm trying to write a scraper for a website that uses cookies. The short of
it is that I keep getting their "You have to set your browser to allow
cookies" message. The code for the full scraper is a bit much, so here are
the relevant sections:

use File::Spec::Functions;
use File::Basename;
use File::Copy;
use LWP::UserAgent;
use HTTP::Cookies;
use URI::WithBase;
use DBI;
use strict;

Here's where I set up the variables (not all "my" and "our" statements are
included):

print "Cookie file: $cfile\n";
$ua = LWP::UserAgent->new;
$ua->timeout(5);
$ua->agent("Netscape/7.1");
$cjar = HTTP::Cookies->new(file =>$cfile, autosave => 1, ignore_discard =>
1);
$ua->cookie_jar($cjar);

Here's where I get the login page (which I always retrieve to make sure the
fields or info hasn't changed):


$page = $ua->get($url);
$page = $page->as_string;

And after that, I go through the page, make sure the form input fields
haven't changed (which are "login" and "key" for the username and
password). Then I post the data for the next page, including the form
data:


$parm = "";
foreach (keys %form) {
print "\tAdding parm. Key: $_, Value: $form{$_}\n";
$parm = "$parm$_=$form{$_}&";
}
$parm =~ s/&$//;
$req = HTTP::Request->new(POST => $url);
$req->content_type("application/x-www-form-urlencoded");
$req->header('Accept' => 'text/html');
$req->content_type("form-data");
$req->content($parm);
$page = $ua->request($req);

When I'm building up $parm, I'm taking the values from %form. I TRIED to
use the hash to post the values, using "$page = $ua->post($url, \%form);",
but even though it worked on a test web server on my LAN, it wouldn't work
on the system I'm scraping (don't know why -- if you can help here as well,
feel free to chip in).

The problem comes up when I use the code above to post the form data and get
the next page. The next page is a frameset with two frames. I get the
frame urls from the page and load them:

$req = HTTP::Request->new(GET => $url);
$req->content_type("application/x-www-form-urlencoded");
$req = $ua->request($req);
$page = $req->as_string;

And this is when I always get the "You don't have cookies" message.

I thought that LWP automatically took the cookies out of the page (I also
thought cookies were in the header, the one here is set with
document.cookie="doc cookie" within the document), and stored them in the
cookie jar automatically. That doesn't seem to be happening. I've been
reading the perldocs, but I can't see anything in the response object that
allows me to check the page for cookies, so I can do it myself.

So why aren't the cookies being kept and why can't the pages I retrieve
AFTER the cookie is set? Is part of the problem because they are in
frames?

Any help on this is appreciated.

Thanks!

Hal
 
T

Todd W

Hal Vaughan said:
I'm trying to write a scraper for a website that uses cookies. The short of
it is that I keep getting their "You have to set your browser to allow
cookies" message. The code for the full scraper is a bit much, so here are
the relevant sections:
<snip />

I've had a lot of sucess using LWP to scrape web pages, for instance I have
a neat program that shows me all my bank account balances on my web enabled
cell phone, but Ive had some trouble getting LWP to scrape some pages that
required cookies also.

Heres my code:

[trwww[at]waveright temp]$ perl -MWWW::Mechanize::Shell -e 'shell'
Retrieving https://www.setsivr.odjfs.state.oh.us/welcome.asp(200)
https://www.setsivr.odjfs.state.oh.us/cookieerror.htm>

If the client and the server were doing everything according to
specification, this would work.

I get the same problem with lynx, and another poster on perl.libwww verified
my issue, and also got the same error using a python http library.

Heres the archive of my thread:

http://groups-beta.google.com/group/perl.libwww/browse_thread/thread/38d09ffd6ff2f4fd

I guess that since it dosent work with lynx I can say that the server is
doing something that isnt standard, but it sucks beause it works fine on any
of the major graphical browsers I've tried.

I suppose that someone who knew http well enough could say why it dosent
work, but I know it pretty well and I cant figure it out, and I've tried
pretty hard.

Todd W
 
H

Hal Vaughan

Todd said:
Hal Vaughan said:
I'm trying to write a scraper for a website that uses cookies. The short of
it is that I keep getting their "You have to set your browser to allow
cookies" message. The code for the full scraper is a bit much, so here are
the relevant sections:
<snip />

I've had a lot of sucess using LWP to scrape web pages, for instance I
have a neat program that shows me all my bank account balances on my web
enabled cell phone, but Ive had some trouble getting LWP to scrape some
pages that required cookies also.

Heres my code:

[trwww[at]waveright temp]$ perl -MWWW::Mechanize::Shell -e 'shell'
Retrieving https://www.setsivr.odjfs.state.oh.us/welcome.asp(200)
https://www.setsivr.odjfs.state.oh.us/cookieerror.htm>

If the client and the server were doing everything according to
specification, this would work.

I get the same problem with lynx, and another poster on perl.libwww
verified my issue, and also got the same error using a python http
library.

Heres the archive of my thread:
http://groups-beta.google.com/group/perl.libwww/browse_thread/thread/38d09ffd6ff2f4fd

I checked the thread, and I've gone back over the pages I downloaded. I
wasn't clear (I think I mentioned it in my first post) about how cookies
are normally handled, and had not looked closely at the files (since I
figured that was not likely the problem). It turns out that the cookie IS
being set in Javascript, which I suspected, but didn't realize this is a
problem. I wrote out a routine that scanned the page, grabbed the cookie,
and set it manually with $cookie_jar->set_cookie(), and it looks like it is
set properly (it includes the domain and path setting, as well). However,
even after setting the cookie manually, I either get "no cookie" messages,
or trying to load any page after the login gives me the login page again
(which I noticed happens in Firefox if I try to paste in a link to a page
after the login page when I'm not logged in). (I also looked at the
cookies in Firefox to see if it looked like the same ones I was getting in
Perl, and they seem the same except for the session ID number.)

So I've found a way to set the cookie by hand, but the server I'm trying to
read from doesn't seem to see the cookie is set. Is there something I need
to do, other than setting a cookie, to make sure the server I'm connecting
to knows the cookie is set?

This is not an area I'm an expert in, and it's frustrating because I need to
get this done, so I'm low on sleep, and trying to put together a lot more
pieces than I expected in this. I didn't know, when I sent a page request
to a server, that the server could actually read the cookie with the
request, I thought cookies were only used by client side Java, but the fact
that the server won't send me the right pages without the cookie seems to
say the server can read the cookie. Is that right? If so, how do I make
sure the server gets the cookie?

Thanks for any help on this!

Hal
 
G

Gunnar Hjalmarsson

Hal said:
I thought that LWP automatically took the cookies out of the page (I also
thought cookies were in the header, the one here is set with
document.cookie="doc cookie" within the document), and stored them in the
cookie jar automatically. That doesn't seem to be happening. I've been
reading the perldocs, but I can't see anything in the response object that
allows me to check the page for cookies, so I can do it myself.

This thread with a similar topic might contain something useful:

http://groups-beta.google.com/group/comp.lang.perl.misc/browse_frm/thread/f8f4b9ef0d73a11d
 
H

Hal Vaughan

Gunnar said:

Thanks. I read through it. I already have the ignore_discard set, so that
isn't it.

At this point, I think it's a bigger problem and I could use some
clarification from anyone (I'm trying to find info on Google, but am not
doing too well). It turns out the cookie is set by Javascript, with
"document.cookie=". Since Perl doesn't catch this, I'm pulling the cookie
out with a regex and setting it manually. That doesn't seem to help
though, so I've got some more questions:

1) If I have an HTTP::Response object, and I pull out the Javascript cookie
string, is there a way to add it to the header in the Response object and
re-parse the Response to get the cookie into the jar, or will that make a
difference over me setting the cookie manually?

2) How does the server know what my cookies are? I had no idea that the
server was able to read cookies, but since I get different pages without
the cookie than what I should get, I think the server has a way of
detecting the cookies on my system.

3) If I'm right, and the server can read my cookies (other than reading them
with client-side Javascript, which was what I used to think happened), is
it worth sending the cookie as POST data instead?

If anyone can help me with these, it'll be a huge help.

Thanks!

Hal
 
G

Gunnar Hjalmarsson

Hal said:
Thanks. I read through it. I already have the ignore_discard set, so that
isn't it.

I knew that you have ignore_discard set; my thought was that other
details in Richard's code might serve as clues.

I have no own experience from using HTTP::Cookies, but when helping
Richard, I noticed that the module provides quite a few methods, of
which some appear to be relevant to you.
 
I

Ilmari Karonen

Hal Vaughan said:
even after setting the cookie manually, I either get "no cookie" messages,
or trying to load any page after the login gives me the login page again
(which I noticed happens in Firefox if I try to paste in a link to a page
after the login page when I'm not logged in).

It looks like the server might be checking the Referer header. You
may want to try to include one in every request you make, like this:

my $res = $ua->get($url, Referer => $ref);

where $ref is the URL of the page you got $url from. (It might be
enough just to give any URL from the same site, but then again, it
might not.)

A server paranoid enough to do things like that may also be checking
User-Agent headers, so if you're not doing that already, I'd suggest
setting yours to imitate some common browser, like this:

$ua->agent('Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0)');
 
J

Joe Smith

Hal said:
At this point, I think it's a bigger problem and I could use some
clarification from anyone

Last time I had a problem like this, I told my browser to use an
http proxy, and had the proxy log what was actually being sent to
the server. I used http://www.inwap.com/mybin/miscunix/?tcp-proxy
to do the logging when my proxy did not log everything I needed.
-Joe

P.S. I noticed that cookies are mentioned in
http://search.cpan.org/~petdance/WWW-Mechanize-1.12/lib/WWW/Mechanize.pm
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Similar Threads


Members online

Forum statistics

Threads
473,981
Messages
2,570,188
Members
46,732
Latest member
ArronPalin

Latest Threads

Top