Making HTTP requests using Twisted

R

rzimerman

I'm hoping to write a program that will read any number of urls from
stdin (1 per line), download them, and process them. So far my script
(below) works well for small numbers of urls. However, it does not
scale to more than 200 urls or so, because it issues HTTP requests for
all of the urls simultaneously, and terminates after 25 seconds.
Ideally, I'd like this script to download at most 50 pages in parallel,
and to time out if and only if any HTTP request is not answered in 3
seconds. What changes do I need to make?

Is Twisted the best library for me to be using? I do like Twisted, but
it seems more suited to batch mode operations. Is there some way that I
could continue registering url requests while the reactor is running?
Is there a way to specify a time out per page request, rather than for
a batch of pages requests?

Thanks!



#-------------------------------------------------

from twisted.internet import reactor
from twisted.web import client
import re, urllib, sys, time

def extract(html):
#do some processing on html, writing to stdout

def printError(failure):
print >> sys.stderr, "Error:", failure.getErrorMessage( )

def stopReactor():
print "Now stopping reactor..."
reactor.stop()

for url in sys.stdin:
url = url.rstrip()
client.getPage(url).addCallback(extract).addErrback(printError)

reactor.callLater(25, stopReactor)
reactor.run()
 
K

K.S.Sreeram

rzimerman said:
I'm hoping to write a program that will read any number of urls from
stdin (1 per line), download them, and process them. So far my script
(below) works well for small numbers of urls. However, it does not
scale to more than 200 urls or so, because it issues HTTP requests for
all of the urls simultaneously, and terminates after 25 seconds.
Ideally, I'd like this script to download at most 50 pages in parallel,
and to time out if and only if any HTTP request is not answered in 3
seconds. What changes do I need to make?

Is Twisted the best library for me to be using? I do like Twisted, but
it seems more suited to batch mode operations. Is there some way that I
could continue registering url requests while the reactor is running?
Is there a way to specify a time out per page request, rather than for
a batch of pages requests?

Have a look at pyCurl. (http://pycurl.sourceforge.net)

Regards
Sreeram



-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2.2 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFEs2vIrgn0plK5qqURAmahAJ4oPAJ4AtPNvRFxs99IFNHuViyCiQCgmT8a
GYqpz82zvsin4QrXGXW0WDI=
=rz4Q
-----END PGP SIGNATURE-----
 
F

Fredrik Lundh

rzimerman said:
Is Twisted the best library for me to be using? I do like Twisted, but
it seems more suited to batch mode operations. Is there some way that I
could continue registering url requests while the reactor is running?
Is there a way to specify a time out per page request, rather than for
a batch of pages requests?

there are probably ways to solve this with Twisted, but in case you want a
simpler alternative, you could use Python's standard asyncore module and
the stuff described here:

http://effbot.org/zone/effnews.htm

especially

http://effbot.org/zone/effnews-1.htm#storing-the-rss-data
http://effbot.org/zone/effnews-3.htm#managing-downloads

</F>
 
M

Manlio Perillo

rzimerman ha scritto:
I'm hoping to write a program that will read any number of urls from
stdin (1 per line), download them, and process them. So far my script
(below) works well for small numbers of urls. However, it does not
scale to more than 200 urls or so, because it issues HTTP requests for
all of the urls simultaneously, and terminates after 25 seconds.
Ideally, I'd like this script to download at most 50 pages in parallel,
and to time out if and only if any HTTP request is not answered in 3
seconds. What changes do I need to make?

Take a look at
http://svn.twistedmatrix.com/cvs/trunk/doc/core/examples/stdiodemo.py?view=markup&rev=15456

And read
http://twistedmatrix.com/documents/current/api/twisted.web.client.HTTPClientFactory.html

You can pass a timeout to the constructor.

To download at most 50 pages in parallel you can use a download queue.

Here is a quick example, ABSOLUTELY NOT TESTED:

class DownloadQueue(object):
SIZE = 50

def init(self):
self.requests = [] # queued requests
self.deferreds = [] # waiting requests

def addRequest(self, url, timeout):
if len(self.deferreds) >= sels.SIZE:
# wait for completion of all previous requests
DeferredList(self.deferreds
).addCallback(self._callback)
self.deferreds = []

# queue the request
deferred = Deferred()
self.requests.append((url, timeout, deferred))

return deferred
else:
# execute the request now
deferred = getPage(url, timeout=timeout)
self.deferreds.append(deferred)

return deferred

def _callback(self):
if len(self.requests) > self.SIZE:
queue = self.requests[:self.SIZE]
self.requests = self.requests[self.SIZE:]
else:
queue = self.requests[:]
self.requests = []

# execute the requests
for (url, timeout, deferredHelper) in queue:
deferred = getPage(url, timeout=timeout)
self.deferreds.append(deferred)

deferred.chainDeferred(deferredHelper)




Regards Manlio Perillo
 
M

Manlio Perillo

Manlio Perillo ha scritto:
Here is a quick example, ABSOLUTELY NOT TESTED:

class DownloadQueue(object):
SIZE = 50

def init(self):
self.requests = [] # queued requests
self.deferreds = [] # waiting requests

def addRequest(self, url, timeout):
if len(self.deferreds) >= sels.SIZE:
# wait for completion of all previous requests
DeferredList(self.deferreds
).addCallback(self._callback)
self.deferreds = []

The deferreds list should be cleared in the _callback method, not here.
Please note that probably there are other bugs.


Regards Manlio Perillo
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,981
Messages
2,570,187
Members
46,729
Latest member
ScarlettJe

Latest Threads

Top