R
rzimerman
I'm hoping to write a program that will read any number of urls from
stdin (1 per line), download them, and process them. So far my script
(below) works well for small numbers of urls. However, it does not
scale to more than 200 urls or so, because it issues HTTP requests for
all of the urls simultaneously, and terminates after 25 seconds.
Ideally, I'd like this script to download at most 50 pages in parallel,
and to time out if and only if any HTTP request is not answered in 3
seconds. What changes do I need to make?
Is Twisted the best library for me to be using? I do like Twisted, but
it seems more suited to batch mode operations. Is there some way that I
could continue registering url requests while the reactor is running?
Is there a way to specify a time out per page request, rather than for
a batch of pages requests?
Thanks!
#-------------------------------------------------
from twisted.internet import reactor
from twisted.web import client
import re, urllib, sys, time
def extract(html):
#do some processing on html, writing to stdout
def printError(failure):
print >> sys.stderr, "Error:", failure.getErrorMessage( )
def stopReactor():
print "Now stopping reactor..."
reactor.stop()
for url in sys.stdin:
url = url.rstrip()
client.getPage(url).addCallback(extract).addErrback(printError)
reactor.callLater(25, stopReactor)
reactor.run()
stdin (1 per line), download them, and process them. So far my script
(below) works well for small numbers of urls. However, it does not
scale to more than 200 urls or so, because it issues HTTP requests for
all of the urls simultaneously, and terminates after 25 seconds.
Ideally, I'd like this script to download at most 50 pages in parallel,
and to time out if and only if any HTTP request is not answered in 3
seconds. What changes do I need to make?
Is Twisted the best library for me to be using? I do like Twisted, but
it seems more suited to batch mode operations. Is there some way that I
could continue registering url requests while the reactor is running?
Is there a way to specify a time out per page request, rather than for
a batch of pages requests?
Thanks!
#-------------------------------------------------
from twisted.internet import reactor
from twisted.web import client
import re, urllib, sys, time
def extract(html):
#do some processing on html, writing to stdout
def printError(failure):
print >> sys.stderr, "Error:", failure.getErrorMessage( )
def stopReactor():
print "Now stopping reactor..."
reactor.stop()
for url in sys.stdin:
url = url.rstrip()
client.getPage(url).addCallback(extract).addErrback(printError)
reactor.callLater(25, stopReactor)
reactor.run()