Building browser-like GET request

G

Gilles Ganault

Hello

I'd like to download pages from a site, but it checks whether
the requests are coming from a live user or a script; If the latter,
the server returns a blank page.

Using a proxy (Paros), I can see what information my script and
FireFox send, and there are a lot of information that Python is
missing:

======== PYTHON ===============
http://www.acme.com/cgi-bin/read?code=123 HTTP/1.1
Accept-Encoding: identity
Host: www.acme.com
Connection: close
User-Agent: Python-urllib/2.4 Paros/3.2.12
======== FIREFOX ===============
http://www.acme.com/cgi-bin/read?code=123 HTTP/1.1
Host: www.acme.com
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US;
rv:1.8.1.3) Gecko/20070309 Firefox/2.0.0.3 Paros/3.2.12
Accept:
text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5
Accept-Language: fr-fr,en-us;q=0.7,en;q=0.3
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Keep-Alive: 300
Proxy-Connection: keep-alive
=============================

How can Python be told to send the same information?

Thank you.
 
?

=?iso-8859-1?B?Qmr2cm4gS2VpbA==?=

I'd like to download pages from a site, but it checks whether
the requests are coming from a live user or a script; If the latter,
the server returns a blank page.

Using a proxy (Paros), I can see what information my script and
FireFox send, and there are a lot of information that Python is
missing:

Well, I am brand new to Python, so it takes me a lot of guessing, but
since it seems you're using urlib2:

On http://docs.python.org/lib/module-urllib2.html is written that you
may add custom headers to your http requests.
Either by calling "addheader()" or by passing a dictionary with
headers to the constructor.

I hope that helped and I wasn't telling things you already new.
As a sidenote: For the task you describe I'd rather use an actual
sniffer - such as Wireshark (http://en.wikipedia.org/wiki/Wireshark),
than logs of a Proxy... Not sure wether Wireshark works under Windows,
though.

Good luck!
 
G

Gilles Ganault

Well, I am brand new to Python, so it takes me a lot of guessing, but
since it seems you're using urlib2:

Thanks. Indeed, it looks like urlib2 is the way to go when going
through a proxy.

For those interested, here's how to download a page through a proxy:

----------------------------
import sys
import urllib
import urllib2
import re

#set up proxy
proxy_info = { 'host' : 'localhost','port' : 8080}
proxy_support = urllib2.ProxyHandler({"http" :
"http://%(host)s:%(port)d" % proxy_info})
opener = urllib2.build_opener(proxy_support)
urllib2.install_opener(opener)

#call page with specific headers
url = 'http://www.acme.com/cgi-bin/read?code=123'
headers = {
'User-Agent' : 'Mozilla/4.0 (compatible; MSIE 5.5; Windows
NT)',
'Accept' :
'text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5',
'Accept-Language' : 'fr-fr,en-us;q=0.7,en;q=0.3',
'Accept-Charset' : 'ISO-8859-1,utf-8;q=0.7,*;q=0.7'
}
#None = GET; set values to use POST
req = urllib2.Request(url, None, headers)

response = urllib2.urlopen(req).read()
log = open('output.html','w')
log.write(response)
log.close()
 
S

Steve Holden

Björn Keil wrote:
[...]
I hope that helped and I wasn't telling things you already new.
As a sidenote: For the task you describe I'd rather use an actual
sniffer - such as Wireshark (http://en.wikipedia.org/wiki/Wireshark),
than logs of a Proxy... Not sure wether Wireshark works under Windows,
though.

On a point of information, Wireshark wokrs very effectively under
Windows. The only thing you shouldn't expect to be able to do is tap
into the loopback network, and that's down to the Windows driver structure.

regards
Steve
 
G

Gilles Ganault

On a point of information, Wireshark wokrs very effectively under
Windows. The only thing you shouldn't expect to be able to do is tap
into the loopback network, and that's down to the Windows driver structure.

Thanks for the tip. Someone mentionned a lighter alternative to
display what goes on between browser and web server:

PocketSoap's TCPTrace
http://www.pocketsoap.com/tcptrace/
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,995
Messages
2,570,233
Members
46,820
Latest member
GilbertoA5

Latest Threads

Top