D
Dan Kohn
I'm encountering a Heisenberg problem where the observation is
interfering with the outcome. It's probably because I'm not overriding
the http-access2 library correctly.
The GetPage class is designed as a scraper that provides very detailed
information about incoming and outgoing headers for each redirect. The
United airlnes website, for example, requires 5 redirects on a
successful login, plus one more to actually access your mileage
summary. But when I turn on the debugging code, I get only one
redirect, and then it sends me back to the login page (except sometime
I get 4, still without logging in). I suspect the error has to do with
this comment about the do_get_block: "Method 'do_get*' runs under MT
conditon. Be careful to change." But I don't understand why this would
matter, since I'm not even running the code multi-threaded. Turning on
client.debug_dev = STDERR also causes the lookup to return the wrong
page after 1 redirect.
require 'http-access2'
require 'time'
$stdout.sync = true
class HTTPAccess2::Client
def conn_request(conn, method, uri, query, body, extheader, &block)
unless uri.is_a?(URI)
uri = URI.parse(uri)
end
proxy = no_proxy?(uri) ? nil : @proxy
begin
req = create_request(method, uri, query, body, extheader,
!proxy.nil?)
req2 = req.dup # These two lines are the only change
debug(req2) # to the conn_request class
do_get_block(req, proxy, conn, &block)
rescue Session::KeepAliveDisconnected
req = create_request(method, uri, query, body, extheader,
!proxy.nil?)
do_get_block(req, proxy, conn, &block)
end
end
def debug(req)
# The lookup still responses incorrectly even when I comment out
# the following two lines
print req.header.dump.gsub(/\r/,"")
print "Body: #{req.body.dump}\n"
end
end
# DESCRIPTION
# GetPage is designed for a series of gets or posts, while following
# redirects and handling cookies.
class GetPage
attr_accessor :limit, age_index
def initialize
@client=HTTPAccess2::Client.new
@client.set_cookie_store("cookies.dat")
@client.ssl_config.verify_mode = nil
@limit = 20
@page_index = 0
end
def get_or_post(uri_str, form, method)
headers = [
['User-Agent', 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT
5.1; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)'],
['Accept',
'text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5'],
['Accept-Language', 'en-us,en;q=0.5'],
['Accept-Charset', 'ISO-8859-1,utf-8;q=0.7,*;q=0.7'],
['Keep-Alive', '300'],
['Connection', 'keep-alive'],
['Referer', 'http://www.united.com/']]
puts "-------------------------------------"
response = case method
when :get
puts "Fetching: #{uri_str}"
@client.get(uri_str, form, headers)
when ost
puts "Posting to: #{uri_str}"
@client.post(uri_str, form, headers)
end
puts "======================================"
print response.header.dump.gsub(/\r/,"")
File.open(@page_index.to_s + ".html","w").puts response.body.dump
@page_index += 1
check_for_redirect(response)
@client.save_cookie_store
response
end
def check_for_redirect(response)
case response.status
when 200
response
when 302
@limit -= 1
new_uri = response.header['location'].to_s.gsub(/ /,"%20")
get_or_post(new_uri, [], :get)
else fail "Don't know response status #{response.status}"
end
end
def get(uri_str, form = [])
get_or_post(uri_str, form, :get)
end
def post(uri_str, form)
get_or_post(uri_str, form, ost)
end
end # Class GetPage
username = "1234"
password = "4567"
uri_str1 =
"https://www.ua2go.com/ci/DoLogin.js...=NEWREC,itn/air/united&return_to=ff_acct_hist"
form1 = [['userId' , username],
['password', password],
['sel_return_to', '&return_to=ff_acct_hist'],
['submit2', 'Login' ]]
g = GetPage.new
sum = g.post(uri_str1, form1)
I originally tried using Mechanize, a higher level tool, but the United
site is absurdly brittle in terms of needing all paramaters and cookies
to be perfect, so much of my time was spent outfitting http-access2
with enough debugging to know exactly what it's sending after each of
the 5(!) redirects. Imagine my surprise, then, to find that my scraper
only worked when I disabled the debugging code.
My ruby is "ruby 1.8.2 (2004-12-25) [i386-mswin32]" and I'm using
http-access2 version 2.0.6.
Thanks in advance for any suggestions you can provide.
- dan
interfering with the outcome. It's probably because I'm not overriding
the http-access2 library correctly.
The GetPage class is designed as a scraper that provides very detailed
information about incoming and outgoing headers for each redirect. The
United airlnes website, for example, requires 5 redirects on a
successful login, plus one more to actually access your mileage
summary. But when I turn on the debugging code, I get only one
redirect, and then it sends me back to the login page (except sometime
I get 4, still without logging in). I suspect the error has to do with
this comment about the do_get_block: "Method 'do_get*' runs under MT
conditon. Be careful to change." But I don't understand why this would
matter, since I'm not even running the code multi-threaded. Turning on
client.debug_dev = STDERR also causes the lookup to return the wrong
page after 1 redirect.
require 'http-access2'
require 'time'
$stdout.sync = true
class HTTPAccess2::Client
def conn_request(conn, method, uri, query, body, extheader, &block)
unless uri.is_a?(URI)
uri = URI.parse(uri)
end
proxy = no_proxy?(uri) ? nil : @proxy
begin
req = create_request(method, uri, query, body, extheader,
!proxy.nil?)
req2 = req.dup # These two lines are the only change
debug(req2) # to the conn_request class
do_get_block(req, proxy, conn, &block)
rescue Session::KeepAliveDisconnected
req = create_request(method, uri, query, body, extheader,
!proxy.nil?)
do_get_block(req, proxy, conn, &block)
end
end
def debug(req)
# The lookup still responses incorrectly even when I comment out
# the following two lines
print req.header.dump.gsub(/\r/,"")
print "Body: #{req.body.dump}\n"
end
end
# DESCRIPTION
# GetPage is designed for a series of gets or posts, while following
# redirects and handling cookies.
class GetPage
attr_accessor :limit, age_index
def initialize
@client=HTTPAccess2::Client.new
@client.set_cookie_store("cookies.dat")
@client.ssl_config.verify_mode = nil
@limit = 20
@page_index = 0
end
def get_or_post(uri_str, form, method)
headers = [
['User-Agent', 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT
5.1; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)'],
['Accept',
'text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5'],
['Accept-Language', 'en-us,en;q=0.5'],
['Accept-Charset', 'ISO-8859-1,utf-8;q=0.7,*;q=0.7'],
['Keep-Alive', '300'],
['Connection', 'keep-alive'],
['Referer', 'http://www.united.com/']]
puts "-------------------------------------"
response = case method
when :get
puts "Fetching: #{uri_str}"
@client.get(uri_str, form, headers)
when ost
puts "Posting to: #{uri_str}"
@client.post(uri_str, form, headers)
end
puts "======================================"
print response.header.dump.gsub(/\r/,"")
File.open(@page_index.to_s + ".html","w").puts response.body.dump
@page_index += 1
check_for_redirect(response)
@client.save_cookie_store
response
end
def check_for_redirect(response)
case response.status
when 200
response
when 302
@limit -= 1
new_uri = response.header['location'].to_s.gsub(/ /,"%20")
get_or_post(new_uri, [], :get)
else fail "Don't know response status #{response.status}"
end
end
def get(uri_str, form = [])
get_or_post(uri_str, form, :get)
end
def post(uri_str, form)
get_or_post(uri_str, form, ost)
end
end # Class GetPage
username = "1234"
password = "4567"
uri_str1 =
"https://www.ua2go.com/ci/DoLogin.js...=NEWREC,itn/air/united&return_to=ff_acct_hist"
form1 = [['userId' , username],
['password', password],
['sel_return_to', '&return_to=ff_acct_hist'],
['submit2', 'Login' ]]
g = GetPage.new
sum = g.post(uri_str1, form1)
I originally tried using Mechanize, a higher level tool, but the United
site is absurdly brittle in terms of needing all paramaters and cookies
to be perfect, so much of my time was spent outfitting http-access2
with enough debugging to know exactly what it's sending after each of
the 5(!) redirects. Imagine my surprise, then, to find that my scraper
only worked when I disabled the debugging code.
My ruby is "ruby 1.8.2 (2004-12-25) [i386-mswin32]" and I'm using
http-access2 version 2.0.6.
Thanks in advance for any suggestions you can provide.
- dan