urllib2 and threading

R

robean

I am writing a program that involves visiting several hundred webpages
and extracting specific information from the contents. I've written a
modest 'test' example here that uses a multi-threaded approach to
reach the urls with urllib2. The actual program will involve fairly
elaborate scraping and parsing (I'm using Beautiful Soup for that) but
the example shown here is simplified and just confirms the url of the
site visited.

Here's the problem: the script simply crashes after getting a a couple
of urls and takes a long time to run (slower that a non-threaded
version that I wrote and ran). Can anyone figure out what I am doing
wrong? I am new to both threading and urllib2, so its possible that
the SNAFU is quite obvious.

The urls are stored in a text file that I read from. The urls are all
valid, so there's no problem there.

Here's the code:

#!/usr/bin/python

import urllib2
import threading

class MyThread(threading.Thread):
"""subclass threading.Thread to create Thread instances"""
def __init__(self, func, args):
threading.Thread.__init__(self)
self.func = func
self.args = args

def run(self):
apply(self.func, self.args)


def get_info_from_url(url):
""" A dummy version of the function simply visits urls and prints
the url of the page. """
try:
page = urllib2.urlopen(url)
except urllib2.URLError, e:
print "**** error ****", e.reason
except urllib2.HTTPError, e:
print "**** error ****", e.code

else:
ulock.acquire()
print page.geturl() # obviously, do something more useful here,
eventually
page.close()
ulock.release()

ulock = threading.Lock()
num_links = 10
threads = [] # store threads here
urls = [] # store urls here

fh = open("links.txt", "r")
for line in fh:
urls.append(line.strip())
fh.close()

# collect threads
for i in range(num_links):
t = MyThread(get_info_from_url, (urls,) )
threads.append(t)

# start the threads
for i in range(num_links):
threads.start()

for i in range(num_links):
threads.join()

print "all done"
 
P

Paul Rubin

robean said:
reach the urls with urllib2. The actual program will involve fairly
elaborate scraping and parsing (I'm using Beautiful Soup for that) but
the example shown here is simplified and just confirms the url of the
site visited.

Keep in mind Beautiful Soup is pretty slow, so if you're doing a lot
of pages and have multiple cpu's, you probably want parallel processes
rather than threads.
wrong? I am new to both threading and urllib2, so its possible that
the SNAFU is quite obvious..
...
ulock = threading.Lock()

Without looking at the code for more than a few seconds, using an
explicit lock like that is generally not a good sign. The usual
Python style is to send all inter-thread communications through
Queues. You'd dump all your url's into a queue and have a bunch of
worker threads getting items off the queue and processing them. This
really avoids a lot of lock-related headache. The price is that you
sometimes use more threads than strictly necessary. Unless it's a LOT
of extra threads, it's usually not worth the hassle of messing with
locks.
 
R

robean

Thanks for your reply. Obviously you make several good points about
Beautiful Soup and Queue. But here's the problem: even if I do nothing
whatsoever with the threads beyond just visiting the urls with
urllib2, the program chokes. If I replace

else:
ulock.acquire()
print page.geturl() # obviously, do something more useful
here,eventually
page.close()
ulock.release()

with

else:
pass

the urllib2 starts raising URLErrros after the first 3 - 5 urls have
been visited. Do you have any sense what in the threads is corrupting
urllib2's behavior? Many thanks,

Robean
 
S

Stefan Behnel

robean said:
I am writing a program that involves visiting several hundred webpages
and extracting specific information from the contents. I've written a
modest 'test' example here that uses a multi-threaded approach to
reach the urls with urllib2. The actual program will involve fairly
elaborate scraping and parsing (I'm using Beautiful Soup for that)

Try lxml.html instead. It often parses HTML pages better than BS, can parse
directly from HTTP/FTP URLs, frees the GIL doing so, and is generally a lot
faster and more memory friendly than the combination of urllib2 and BS,
especially when threading is involved. It also supports CSS selectors for
finding page content, so your "elaborate scraping" might actually turn out
to be a lot simpler than you think.

http://codespeak.net/lxml/

These might be worth reading:

http://blog.ianbicking.org/2008/12/10/lxml-an-underappreciated-web-scraping-library/
http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/

Stefan
 
S

shailen.tuli

For better performance, lxml easily outperforms Beautiful Soup.

For what its worth, the code runs fine if you switch from urllib2 to
urllib (different exceptions are raised, obviously). I have no
experience using urllib2 in a threaded environment, so I'm not sure
why it breaks; urllib does OK, though.

- Shailen
 
P

Piet van Oostrum

robean said:
R> def get_info_from_url(url):
R> """ A dummy version of the function simply visits urls and prints
R> the url of the page. """
R> try:
R> page = urllib2.urlopen(url)
R> except urllib2.URLError, e:
R> print "**** error ****", e.reason
R> except urllib2.HTTPError, e:
R> print "**** error ****", e.code

There's a problem here. HTTPError is a subclass of URLError so it should
be first. Otherwise when you have an HTTPError (like a 404 File not
found) it will be caught by the "except URLError", but it will not have
a reason attribute, and then you get an exception in the except clause
and the thread will crash.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,982
Messages
2,570,190
Members
46,736
Latest member
zacharyharris

Latest Threads

Top