Why does adding debug statements in an override class of http-access2 change it's outcome?

D

Dan Kohn

I'm encountering a Heisenberg problem where the observation is
interfering with the outcome. It's probably because I'm not overriding
the http-access2 library correctly.

The GetPage class is designed as a scraper that provides very detailed
information about incoming and outgoing headers for each redirect. The
United airlnes website, for example, requires 5 redirects on a
successful login, plus one more to actually access your mileage
summary. But when I turn on the debugging code, I get only one
redirect, and then it sends me back to the login page (except sometime
I get 4, still without logging in). I suspect the error has to do with
this comment about the do_get_block: "Method 'do_get*' runs under MT
conditon. Be careful to change." But I don't understand why this would
matter, since I'm not even running the code multi-threaded. Turning on
client.debug_dev = STDERR also causes the lookup to return the wrong
page after 1 redirect.


require 'http-access2'
require 'time'
$stdout.sync = true

class HTTPAccess2::Client
def conn_request(conn, method, uri, query, body, extheader, &block)
unless uri.is_a?(URI)
uri = URI.parse(uri)
end
proxy = no_proxy?(uri) ? nil : @proxy
begin
req = create_request(method, uri, query, body, extheader,
!proxy.nil?)
req2 = req.dup # These two lines are the only change
debug(req2) # to the conn_request class
do_get_block(req, proxy, conn, &block)
rescue Session::KeepAliveDisconnected
req = create_request(method, uri, query, body, extheader,
!proxy.nil?)
do_get_block(req, proxy, conn, &block)
end
end

def debug(req)
# The lookup still responses incorrectly even when I comment out
# the following two lines
print req.header.dump.gsub(/\r/,"")
print "Body: #{req.body.dump}\n"
end
end



# DESCRIPTION
# GetPage is designed for a series of gets or posts, while following
# redirects and handling cookies.
class GetPage

attr_accessor :limit, :page_index

def initialize
@client=HTTPAccess2::Client.new
@client.set_cookie_store("cookies.dat")
@client.ssl_config.verify_mode = nil
@limit = 20
@page_index = 0
end

def get_or_post(uri_str, form, method)
headers = [
['User-Agent', 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT
5.1; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)'],
['Accept',
'text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5'],
['Accept-Language', 'en-us,en;q=0.5'],
['Accept-Charset', 'ISO-8859-1,utf-8;q=0.7,*;q=0.7'],
['Keep-Alive', '300'],
['Connection', 'keep-alive'],
['Referer', 'http://www.united.com/']]
puts "-------------------------------------"
response = case method
when :get
puts "Fetching: #{uri_str}"
@client.get(uri_str, form, headers)
when :post
puts "Posting to: #{uri_str}"
@client.post(uri_str, form, headers)
end

puts "======================================"
print response.header.dump.gsub(/\r/,"")
File.open(@page_index.to_s + ".html","w").puts response.body.dump
@page_index += 1

check_for_redirect(response)
@client.save_cookie_store
response
end

def check_for_redirect(response)
case response.status
when 200
response
when 302
@limit -= 1
new_uri = response.header['location'].to_s.gsub(/ /,"%20")
get_or_post(new_uri, [], :get)
else fail "Don't know response status #{response.status}"
end
end

def get(uri_str, form = [])
get_or_post(uri_str, form, :get)
end

def post(uri_str, form)
get_or_post(uri_str, form, :post)
end
end # Class GetPage

username = "1234"
password = "4567"
uri_str1 =
"https://www.ua2go.com/ci/DoLogin.js...=NEWREC,itn/air/united&return_to=ff_acct_hist"
form1 = [['userId' , username],
['password', password],
['sel_return_to', '&return_to=ff_acct_hist'],
['submit2', 'Login' ]]

g = GetPage.new
sum = g.post(uri_str1, form1)


I originally tried using Mechanize, a higher level tool, but the United
site is absurdly brittle in terms of needing all paramaters and cookies
to be perfect, so much of my time was spent outfitting http-access2
with enough debugging to know exactly what it's sending after each of
the 5(!) redirects. Imagine my surprise, then, to find that my scraper
only worked when I disabled the debugging code.

My ruby is "ruby 1.8.2 (2004-12-25) [i386-mswin32]" and I'm using
http-access2 version 2.0.6.

Thanks in advance for any suggestions you can provide.

- dan
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Similar Threads


Members online

Forum statistics

Threads
473,982
Messages
2,570,185
Members
46,738
Latest member
JinaMacvit

Latest Threads

Top