How to use urllib2.BaseHandler class

D

Doug Farrell

Hi all,

I'm trying to build a web page crawler to help us build our websites,
which are driven by static pages after they are called the first time.
Anyway, I can use urllib2.urlopen() no problem, but I'd like to have
more control over the process. In particular I'd like to get back the
HTTP status code from the request, even if it's a 200. It looks like I
can do that by deriving my own class from HTTPHandler, but I'm not
sure how to go about it. Can anyone direct me to some useful example
code for this kind of thing?

Thanks in advance,
Doug Farrell
 
J

John J. Lee

Hi all,

I'm trying to build a web page crawler to help us build our websites,
which are driven by static pages after they are called the first time.
Anyway, I can use urllib2.urlopen() no problem, but I'd like to have
more control over the process. In particular I'd like to get back the
HTTP status code from the request, even if it's a 200. It looks like I
can do that by deriving my own class from HTTPHandler, but I'm not
sure how to go about it. Can anyone direct me to some useful example
code for this kind of thing?

In 2.3, urllib2 only ever *returns* a response if the code is 200. In
other cases, HTTPError exceptions are *raised*. HTTPError instances
satisfy the normal response interface, so you can catch them and use
them just as you would the return value of urlopen(). As you've
noticed, they also have .code and .msg attributes (unlike normal
response objects, in 2.3 -- since it's always 200, they weren't really
necessary!).

Now for 2.4, where things have changed a bit.

I *think* the 2.4 CVS urllib2.py will work fine with Python 2.3 (the
annoying Python test suite runner makes it a mild pain to check).

As I mentioned in another thread, don't use the urllib2 from 2.4a1 --
it's broken.

In 2.4, some successful responses other than 200 are also returned (at
present, only 200 and 206). Also, all response objects have .code and
..msg attributes -- not only HTTPError, but those that get returned,
too (ie. 200 and 206 ATM). If you want all responses returned rather
than raised as exceptions, or vice-versa, it's much easier to achieve
that in 2.4 than in 2.3. It's easier because the interface of handler
objects has been extended to allow pre- post-processing of requests
and responses respectively, and that feature is now used by urllib2 to
implement HTTP error handling separately from the rest of HTTP
fetching. Snip from CVS urllib2.py:

class HTTPErrorProcessor(BaseHandler):
"""Process HTTP error responses."""
handler_order = 1000 # after all other processing

def http_response(self, request, response):
code, msg, hdrs = response.code, response.msg, response.info()

if code not in (200, 206):
response = self.parent.error(
'http', request, response, code, msg, hdrs)

return response

https_response = http_response


So, to get all responses returned without error handling, regardless
of error code (this will disable things like authentication and
redirection, of course, so you might want to be a bit more
restrictive, by still passing on selected error codes to
self.parent.error()):

import urllib2

class NullHTTPErrorProcessor(urllib2.HTTPErrorProcessor):
def http_response(self, request, response):
return response

https_response = http_response

opener = urllib2.build_opener(NullHTTPErrorProcessor())
opener.open("http://www.python.org/") # never raises HTTPError


You should probably only do this if you have good reason, because you
may confuse people reading your code.

Use urllib2.install_opener() if you want to use urllib2.urlopen().
Usually there's no real point, though.

If you want to stick with pre-2.4 code, look at ClientCookie for
example code. That code is full of cruft though, since it's supposed
to work back to 1.5.2, and has to cut and paste a fair amount as a
result ;-)

HTH


John
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,202
Messages
2,571,057
Members
47,661
Latest member
sxarexu

Latest Threads

Top