R
robean
I am writing a program that involves visiting several hundred webpages
and extracting specific information from the contents. I've written a
modest 'test' example here that uses a multi-threaded approach to
reach the urls with urllib2. The actual program will involve fairly
elaborate scraping and parsing (I'm using Beautiful Soup for that) but
the example shown here is simplified and just confirms the url of the
site visited.
Here's the problem: the script simply crashes after getting a a couple
of urls and takes a long time to run (slower that a non-threaded
version that I wrote and ran). Can anyone figure out what I am doing
wrong? I am new to both threading and urllib2, so its possible that
the SNAFU is quite obvious.
The urls are stored in a text file that I read from. The urls are all
valid, so there's no problem there.
Here's the code:
#!/usr/bin/python
import urllib2
import threading
class MyThread(threading.Thread):
"""subclass threading.Thread to create Thread instances"""
def __init__(self, func, args):
threading.Thread.__init__(self)
self.func = func
self.args = args
def run(self):
apply(self.func, self.args)
def get_info_from_url(url):
""" A dummy version of the function simply visits urls and prints
the url of the page. """
try:
page = urllib2.urlopen(url)
except urllib2.URLError, e:
print "**** error ****", e.reason
except urllib2.HTTPError, e:
print "**** error ****", e.code
else:
ulock.acquire()
print page.geturl() # obviously, do something more useful here,
eventually
page.close()
ulock.release()
ulock = threading.Lock()
num_links = 10
threads = [] # store threads here
urls = [] # store urls here
fh = open("links.txt", "r")
for line in fh:
urls.append(line.strip())
fh.close()
# collect threads
for i in range(num_links):
t = MyThread(get_info_from_url, (urls,) )
threads.append(t)
# start the threads
for i in range(num_links):
threads.start()
for i in range(num_links):
threads.join()
print "all done"
and extracting specific information from the contents. I've written a
modest 'test' example here that uses a multi-threaded approach to
reach the urls with urllib2. The actual program will involve fairly
elaborate scraping and parsing (I'm using Beautiful Soup for that) but
the example shown here is simplified and just confirms the url of the
site visited.
Here's the problem: the script simply crashes after getting a a couple
of urls and takes a long time to run (slower that a non-threaded
version that I wrote and ran). Can anyone figure out what I am doing
wrong? I am new to both threading and urllib2, so its possible that
the SNAFU is quite obvious.
The urls are stored in a text file that I read from. The urls are all
valid, so there's no problem there.
Here's the code:
#!/usr/bin/python
import urllib2
import threading
class MyThread(threading.Thread):
"""subclass threading.Thread to create Thread instances"""
def __init__(self, func, args):
threading.Thread.__init__(self)
self.func = func
self.args = args
def run(self):
apply(self.func, self.args)
def get_info_from_url(url):
""" A dummy version of the function simply visits urls and prints
the url of the page. """
try:
page = urllib2.urlopen(url)
except urllib2.URLError, e:
print "**** error ****", e.reason
except urllib2.HTTPError, e:
print "**** error ****", e.code
else:
ulock.acquire()
print page.geturl() # obviously, do something more useful here,
eventually
page.close()
ulock.release()
ulock = threading.Lock()
num_links = 10
threads = [] # store threads here
urls = [] # store urls here
fh = open("links.txt", "r")
for line in fh:
urls.append(line.strip())
fh.close()
# collect threads
for i in range(num_links):
t = MyThread(get_info_from_url, (urls,) )
threads.append(t)
# start the threads
for i in range(num_links):
threads.start()
for i in range(num_links):
threads.join()
print "all done"