T
Thomas Lindgaard
Hello
I'm a newcomer to the world of Python trying to write a web spider. I
downloaded the skeleton from
http://starship.python.net/crew/aahz/OSCON2001/ThreadPoolSpider.py
Some of the source shown below.
A couple of questions:
1) Why use the
if __name__ == '__main__':
construct?
2) In Retrievepool.__init__ the Retriever.__init__ is called with
self.inputQueue and self.outputQueue as arguments. Does this mean that
each Retriever thread has a reference to Retrievepool.inputQueue and
Retrievepool.outputQueue (ie. there is only one input and output queue and
the threads all share, pushing and popping whenever they want (which is
safe due to the synchronized nature of Queue)?
3) How many threads will be running? Spider.run initializes the
Retrievepool and this will consist of MAX_THREADS threads, so once the
crawler is running there will be the main thread (caught in the while loop
in Spider.run) and MAX_THREADS Retriever threads running, right?
Hmm... I think that's about it for now.
---------------------------------------------------------------------
MAX_THREADS = 3
....
class Retriever(threading.Thread):
def __init__(self, inputQueue, outputQueue):
threading.Thread.__init__(self)
self.inputQueue = inputQueue
self.outputQueue = outputQueue
def run(self):
while 1:
self.URL = self.inputQueue.get()
self.getPage()
self.outputQueue.put(self.getLinks())
...
class RetrievePool:
def __init__(self, numThreads):
self.retrievePool = []
self.inputQueue = Queue.Queue()
self.outputQueue = Queue.Queue()
for i in range(numThreads):
retriever = Retriever(self.inputQueue, self.outputQueue)
retriever.start()
self.retrievePool.append(retriever)
...
class Spider:
def __init__(self, startURL, maxThreads):
self.URLs = []
self.queue = [startURL]
self.URLdict = {startURL: 1}
self.include = startURL
self.numPagesQueued = 0
self.retriever = RetrievePool(maxThreads)
def run(self):
self.startPages()
while self.numPagesQueued > 0:
self.queueLinks()
self.startPages()
self.retriever.shutdown()
self.URLs = self.URLdict.keys()
self.URLs.sort()
...
if __name__ == '__main__':
startURL = sys.argv[1]
spider = Spider(startURL, MAX_THREADS)
spider.run()
print
for URL in spider.URLs:
print URL
I'm a newcomer to the world of Python trying to write a web spider. I
downloaded the skeleton from
http://starship.python.net/crew/aahz/OSCON2001/ThreadPoolSpider.py
Some of the source shown below.
A couple of questions:
1) Why use the
if __name__ == '__main__':
construct?
2) In Retrievepool.__init__ the Retriever.__init__ is called with
self.inputQueue and self.outputQueue as arguments. Does this mean that
each Retriever thread has a reference to Retrievepool.inputQueue and
Retrievepool.outputQueue (ie. there is only one input and output queue and
the threads all share, pushing and popping whenever they want (which is
safe due to the synchronized nature of Queue)?
3) How many threads will be running? Spider.run initializes the
Retrievepool and this will consist of MAX_THREADS threads, so once the
crawler is running there will be the main thread (caught in the while loop
in Spider.run) and MAX_THREADS Retriever threads running, right?
Hmm... I think that's about it for now.
---------------------------------------------------------------------
MAX_THREADS = 3
....
class Retriever(threading.Thread):
def __init__(self, inputQueue, outputQueue):
threading.Thread.__init__(self)
self.inputQueue = inputQueue
self.outputQueue = outputQueue
def run(self):
while 1:
self.URL = self.inputQueue.get()
self.getPage()
self.outputQueue.put(self.getLinks())
...
class RetrievePool:
def __init__(self, numThreads):
self.retrievePool = []
self.inputQueue = Queue.Queue()
self.outputQueue = Queue.Queue()
for i in range(numThreads):
retriever = Retriever(self.inputQueue, self.outputQueue)
retriever.start()
self.retrievePool.append(retriever)
...
class Spider:
def __init__(self, startURL, maxThreads):
self.URLs = []
self.queue = [startURL]
self.URLdict = {startURL: 1}
self.include = startURL
self.numPagesQueued = 0
self.retriever = RetrievePool(maxThreads)
def run(self):
self.startPages()
while self.numPagesQueued > 0:
self.queueLinks()
self.startPages()
self.retriever.shutdown()
self.URLs = self.URLdict.keys()
self.URLs.sort()
...
if __name__ == '__main__':
startURL = sys.argv[1]
spider = Spider(startURL, MAX_THREADS)
spider.run()
for URL in spider.URLs:
print URL