Improving the web page download code.

mukesh tiwari · Aug 27, 2013

Hello All,
I am doing web stuff first time in python so I am looking for suggestions. I wrote this code to download the title of webpages using as much less resource ( server time, data download) as possible and should be fast enough. Initially I used BeautifulSoup for parsing but the person who is going to use this code asked me not to use this and use regular expressions ( The reason was BeautifulSoup is not fast enough ? ). Also initially, I was downloading the the whole page but finally I restricted to only 30000 characters to get the title of almost all the pages. Write now I can see only two shortcomings of this code, one when I kill the code by SIGINT ( ctrl-c ) then itdies instantly. I can modify this code to process all the elements in queue and let it die. The second is one IO call per iteration in download url function ( May be I can use async IO call but I am not sure ). I don't have much web programming experience so I am looking for suggestion to make it more robust. top-1m.csv is file downloaded from alexa[1]. Also some suggestions to write more idiomatic python code.

-Mukesh Tiwari

[1]http://www.alexa.com/topsites.

import urllib2, os, socket, Queue, thread, signal, sys, re

class Downloader():

def __init__( self ):
self.q = Queue.Queue( 200 )
self.count = 0

def downloadurl( self ) :
#open a file in append mode and write the result ( Improvement think of writing in chunks )
with open('titleoutput.dat', 'a+' ) as file :
while True :
try :
url = self.q.get( )
data = urllib2.urlopen ( url , data = None , timeout = 10 ).read( 30000 )
regex = re.compile('<title.*>(.*?)</title>' , re.IGNORECASE)
#Read data line by line and as soon you find the title go out of loop.
#title = None
#for r in data:
# if not r :
# raise StopIteration
# else:
# title = regex.search( r )
# if title is not None: break

title = regex.search( data )
result = ', '.join ( [ url , title.group(1) ] )
#data.close()
file.write(''.join( [ result , '\n' ] ) )
except urllib2.HTTPError as e:
print ''.join ( [ url, ' ', str ( e ) ] )
except urllib2.URLError as e:
print ''.join ( [ url, ' ', str ( e ) ] )
except Exception as e :
print ''.join ( [ url, ' ', str( e ) ] )
#With block python calls file.close() automatically.

def createurl ( self ) :

#check if file exist. If not then create one with default value of 0 bytes read.
if os.path.exists('bytesread.dat'):
f = open ( 'bytesread.dat','r')
self.count = int ( f.readline() )

else:
f=open('bytesread.dat','w')
f.write('0\n')
f.close()

#Reading data in chunks is fast but we can miss some sites due to readingthe data in chunks( It's worth missing because reading is very fast)
with open('top-1m.csv', 'r') as file:
prefix = ''
file.seek( self.count * 1024 )
#you will land into the middle of bytes so discard upto newline
if ( self.count ): file.readline()
for lines in iter ( lambda : file.read( 1024 ) , ''):
l = lines.split('\n')
n = len ( l )
l[0] = ''.join( [ prefix , l[0] ] )
for i in xrange ( n - 1 ) : self.q.put ( ''.join ( [ 'http://www.', l.split(',')[1] ] ) )
prefix = l[n-1]
self.count += 1

#do graceful exit from here.
def handleexception ( self , signal , frame) :
with open('bytesread.dat', 'w') as file:
print ''.join ( [ 'Number of bytes read ( probably unfinished ) ' , str ( self.count ) ] )
file.write ( ''.join ( [ str ( self.count ) , '\n' ] ) )
file.close()
sys.exit(0)

if __name__== '__main__':
u = Downloader()
signal.signal( signal.SIGINT , u.handleexception)
thread.start_new_thread ( u.createurl , () )
for i in xrange ( 5 ) :
thread.start_new_thread ( u.downloadurl , () )
while True : pass

MRAB · Aug 27, 2013

Hello All,
I am doing web stuff first time in python so I am looking for suggestions. I wrote this code to download the title of webpages using as much less resource ( server time, data download) as possible and should be fast enough. Initially I used BeautifulSoup for parsing but the person who is going to use this code asked me not to use this and use regular expressions ( The reason was BeautifulSoup is not fast enough ? ). Also initially, I was downloading the the whole page but finally I restricted to only 30000 characters to get the title of almost all the pages. Write now I can see only two shortcomings of this code, one when I kill the code by SIGINT ( ctrl-c ) then it dies instantly. I can modify this code to process all the elements in queue and let it die. The second is one IO call per iteration in download url function ( May be I can use async IO call but I am not sure ). I don't have much web programming experience so I am looking for suggestion to make it more robust. top-1m.c sv
is file downloaded from alexa[1]. Also some suggestions to write more idiomatic python code.

-Mukesh Tiwari

[1]http://www.alexa.com/topsites.

import urllib2, os, socket, Queue, thread, signal, sys, re

class Downloader():

def __init__( self ):
self.q = Queue.Queue( 200 )
self.count = 0

def downloadurl( self ) :
#open a file in append mode and write the result ( Improvement think of writing in chunks )
with open('titleoutput.dat', 'a+' ) as file :
while True :
try :
url = self.q.get( )
data = urllib2.urlopen ( url , data = None , timeout = 10 ).read( 30000 )
regex = re.compile('<title.*>(.*?)</title>' , re.IGNORECASE)
#Read data line by line and as soon you find the title go out of loop.
#title = None
#for r in data:
# if not r :
# raise StopIteration
# else:
# title = regex.search( r )
# if title is not None: break

title = regex.search( data )
result = ', '.join ( [ url , title.group(1) ] )
#data.close()
file.write(''.join( [ result , '\n' ] ) )
except urllib2.HTTPError as e:
print ''.join ( [ url, ' ', str ( e ) ] )
except urllib2.URLError as e:
print ''.join ( [ url, ' ', str ( e ) ] )
except Exception as e :
print ''.join ( [ url, ' ', str( e ) ] )
#With block python calls file.close() automatically.

def createurl ( self ) :

#check if file exist. If not then create one with default value of 0 bytes read.
if os.path.exists('bytesread.dat'):
f = open ( 'bytesread.dat','r')
self.count = int ( f.readline() )

else:
f=open('bytesread.dat','w')
f.write('0\n')
f.close()

#Reading data in chunks is fast but we can miss some sites due to reading the data in chunks( It's worth missing because reading is very fast)
with open('top-1m.csv', 'r') as file:
prefix = ''
file.seek( self.count * 1024 )
#you will land into the middle of bytes so discard upto newline
if ( self.count ): file.readline()
for lines in iter ( lambda : file.read( 1024 ) , ''):
l = lines.split('\n')
n = len ( l )
l[0] = ''.join( [ prefix , l[0] ] )
for i in xrange ( n - 1 ) : self.q.put ( ''.join ( [ 'http://www.', l.split(',')[1] ] ) )
prefix = l[n-1]
self.count += 1

#do graceful exit from here.
def handleexception ( self , signal , frame) :
with open('bytesread.dat', 'w') as file:
print ''.join ( [ 'Number of bytes read ( probably unfinished ) ' , str ( self.count ) ] )
file.write ( ''.join ( [ str ( self.count ) , '\n' ] ) )
file.close()
sys.exit(0)

if __name__== '__main__':
u = Downloader()
signal.signal( signal.SIGINT , u.handleexception)
thread.start_new_thread ( u.createurl , () )
for i in xrange ( 5 ) :
thread.start_new_thread ( u.downloadurl , () )
while True : pass

My preferred method when working with background threads is to put a
sentinel such as None at the end and then when a worker gets an item
from the queue and sees that it's the sentinel, it puts it back in the
queue for the other workers to see, and then returns (terminates). The
main thread can then call each worker thread's .join method to wait for
it to finish. You currently have the main thread running in a 'busy
loop', consuming processing time doing nothing!

mukesh tiwari · Aug 27, 2013

Hello All,

Click to expand...

I am doing web stuff first time in python so I am looking for suggestions. I wrote this code to download the title of webpages using as much less resource ( server time, data download) as possible and should be fast enough. Initially I used BeautifulSoup for parsing but the person who is going to use this code asked me not to use this and use regular expressions ( Thereason was BeautifulSoup is not fast enough ? ). Also initially, I was downloading the the whole page but finally I restricted to only 30000 characters to get the title of almost all the pages. Write now I can see only two shortcomings of this code, one when I kill the code by SIGINT ( ctrl-c ) then it dies instantly. I can modify this code to process all the elements in queue and let it die. The second is one IO call per iteration in download url function ( May be I can use async IO call but I am not sure ). I don't have much web programming experience so I am looking for suggestion to make it more robust. top-1m.c
sv

is file downloaded from alexa[1]. Also some suggestions to write moreidiomatic python code.
-Mukesh Tiwari

[1]http://www.alexa.com/topsites.

Click to expand...

import urllib2, os, socket, Queue, thread, signal, sys, re
class Downloader():
def __init__( self ):

Click to expand...

self.q = Queue.Queue( 200 )

Click to expand...

self.count = 0

def downloadurl( self ) :

Click to expand...

#open a file in append mode and write the result ( Improvement think of writing in chunks )

Click to expand...

with open('titleoutput.dat', 'a+' ) as file :

Click to expand...

while True :

Click to expand...

try :

Click to expand...

url = self.q.get( )

Click to expand...

data = urllib2.urlopen ( url , data = None , timeout = 10 ).read( 30000 )

Click to expand...

regex = re.compile('<title.*>(.*?)</title>' , re.IGNORECASE)

Click to expand...

#Read data line by line and as soon you find the title go out of loop.

Click to expand...

#title = None

Click to expand...

#for r in data:

Click to expand...

# if not r :

Click to expand...

# raise StopIteration

# title = regex.search( r )

Click to expand...

# if title is not None: break

title = regex.search( data )

Click to expand...

result = ', '.join ( [ url , title.group(1) ] )

#data.close()

Click to expand...

file.write(''.join( [ result , '\n' ] ) )

Click to expand...

except urllib2.HTTPError as e:

Click to expand...

print ''.join ( [ url, ' ', str ( e ) ] )

Click to expand...

except urllib2.URLError as e:

Click to expand...

print ''.join ( [ url, ' ', str ( e ) ] )

Click to expand...

except Exception as e :

Click to expand...

print ''.join ( [ url, ' ', str( e ) ] )

Click to expand...

#With block python calls file.close() automatically.

def createurl ( self ) :

#check if file exist. If not then create one with default value of 0 bytes read.

Click to expand...

if os.path.exists('bytesread.dat'):

Click to expand...

f = open ( 'bytesread.dat','r')

Click to expand...

self.count = int ( f.readline() )

#Reading data in chunks is fast but we can miss some sites due to reading the data in chunks( It's worth missing because reading is very fast)

Click to expand...

with open('top-1m.csv', 'r') as file:

Click to expand...

prefix = ''

Click to expand...

file.seek( self.count * 1024 )

Click to expand...

#you will land into the middle of bytes so discard upto newline

Click to expand...

if ( self.count ): file.readline()

Click to expand...

for lines in iter ( lambda : file.read( 1024 ) , ''):

Click to expand...

l = lines.split('\n')

Click to expand...

n = len ( l )

Click to expand...

l[0] = ''.join( [ prefix , l[0] ] )

Click to expand...

for i in xrange ( n - 1 ) : self.q.put ( ''.join ( [ 'http://www.',l.split(',')[1] ] ) )

Click to expand...

prefix = l[n-1]

Click to expand...

self.count += 1

#do graceful exit from here.

Click to expand...

def handleexception ( self , signal , frame) :

Click to expand...

with open('bytesread.dat', 'w') as file:

Click to expand...

print ''.join ( [ 'Number of bytes read ( probably unfinished ) ' , str ( self.count ) ] )

Click to expand...

file.write ( ''.join ( [ str ( self.count ) , '\n' ] ) )

file.close()

Click to expand...

sys.exit(0)

Click to expand...

if __name__== '__main__':

Click to expand...

u = Downloader()

Click to expand...

signal.signal( signal.SIGINT , u.handleexception)

Click to expand...

thread.start_new_thread ( u.createurl , () )

Click to expand...

for i in xrange ( 5 ) :

Click to expand...

thread.start_new_thread ( u.downloadurl , () )

Click to expand...

while True : pass

Click to expand...

My preferred method when working with background threads is to put a

sentinel such as None at the end and then when a worker gets an item

from the queue and sees that it's the sentinel, it puts it back in the

queue for the other workers to see, and then returns (terminates). The

main thread can then call each worker thread's .join method to wait for

it to finish. You currently have the main thread running in a 'busy

loop', consuming processing time doing nothing!

Hi MRAB,
Thank you for the reply. I wrote this while loop only because of there is no thread.join in thread[1] library but I got your point. I am simply running a while loop for doing nothing. So if somehow I can block the main without too much computation then it will great.

-Mukesh Tiwari

[1] http://docs.python.org/2/library/thread.html#module-thread

MRAB · Aug 27, 2013

On 27/08/2013 20:41, mukesh tiwari wrote:
[snip]

if __name__== '__main__':
u = Downloader()
signal.signal( signal.SIGINT , u.handleexception)
thread.start_new_thread ( u.createurl , () )
for i in xrange ( 5 ) :
thread.start_new_thread ( u.downloadurl , () )
while True : pass

Click to expand...

My preferred method when working with background threads is to put a
sentinel such as None at the end and then when a worker gets an item
from the queue and sees that it's the sentinel, it puts it back in
the queue for the other workers to see, and then returns
(terminates). The main thread can then call each worker thread's
.join method to wait for it to finish. You currently have the main
thread running in a 'busy loop', consuming processing time doing
nothing!

Click to expand...

Hi MRAB,
Thank you for the reply. I wrote this while loop only because of
there is no thread.join in thread[1] library but I got your point. I
am simply running a while loop for doing nothing. So if somehow I can
block the main without too much computation then it will great.

Why don't you use the 'threading' module instead?

creator = threading.Thread(target=u.createurl)

workers = []
for i in xrange(5):
workers.append(threading.Thread(target=u.downloadurl))

creator.start()

for w in workers:
w.start()

creator.join()

for w in workers:
w.join()

mukesh tiwari · Aug 28, 2013

On 27/08/2013 20:41, mukesh tiwari wrote:

Click to expand...

[snip]

if __name__== '__main__':
u = Downloader()
signal.signal( signal.SIGINT , u.handleexception)
thread.start_new_thread ( u.createurl , () )
for i in xrange ( 5 ) :
thread.start_new_thread ( u.downloadurl , () )
while True : pass

My preferred method when working with background threads is to put a
sentinel such as None at the end and then when a worker gets an item
from the queue and sees that it's the sentinel, it puts it back in
the queue for the other workers to see, and then returns
(terminates). The main thread can then call each worker thread's
.join method to wait for it to finish. You currently have the main
thread running in a 'busy loop', consuming processing time doing
nothing!

Click to expand...

Hi MRAB,

Click to expand...

Thank you for the reply. I wrote this while loop only because of

Click to expand...

there is no thread.join in thread[1] library but I got your point. I

Click to expand...

am simply running a while loop for doing nothing. So if somehow I can

Click to expand...

block the main without too much computation then it will great.

Click to expand...

Why don't you use the 'threading' module instead?

creator = threading.Thread(target=u.createurl)

workers = []

for i in xrange(5):

workers.append(threading.Thread(target=u.downloadurl))

creator.start()

for w in workers:

w.start()

creator.join()

for w in workers:

w.join()

Hi MRAB,
Initially I blocked the main using raw_input('') and it was working fine.

u = Downloader()
signal.signal( signal.SIGINT , u.handleexception)
thread.start_new_thread ( u.createurl , () )
for i in xrange ( 5 ) :
thread.start_new_thread ( u.downloadurl , () )
#This is for blocking main
raw_input('')
When I pressed ctrl-c then it's responding fine but now after switching to threading module, I am not able to kill my program using SIGINT ( ctrl-c ). Any idea how to signal SIGINT to threads ?

Now the changed code and I have to catch the SIGINT.
u = Downloader()
signal.signal( signal.SIGINT , u.handleexception)
urlcreator = threading.Thread ( target = u.createurl )

workers = []
for i in xrange ( 5 ):
workers.append ( threading.Thread( target = u.downloadurl ) )

urlcreator.start()
for w in workers:
w.start()

urlcreator.join()
for w in workers:
w.join()

-Mukesh Tiwari

MRAB · Aug 28, 2013

On 28/08/2013 07:23, mukesh tiwari wrote:
[snip]

Initially I blocked the main using raw_input('') and it was working fine.

u = Downloader()
signal.signal( signal.SIGINT , u.handleexception)
thread.start_new_thread ( u.createurl , () )
for i in xrange ( 5 ) :
thread.start_new_thread ( u.downloadurl , () )
#This is for blocking main
raw_input('')
When I pressed ctrl-c then it's responding fine but now after switching to threading module, I am not able to kill my program using SIGINT ( ctrl-c ). Any idea how to signal SIGINT to threads ?

Try making them daemon threads. A daemon thread is one that will be
killed when the main thread terminates.

Now the changed code and I have to catch the SIGINT.
u = Downloader()
signal.signal( signal.SIGINT , u.handleexception)
urlcreator = threading.Thread ( target = u.createurl )

workers = []
for i in xrange ( 5 ):
workers.append ( threading.Thread( target = u.downloadurl ) )
urlcreator.daemon = True
urlcreator.start()

for w in workers:

urlcreator.daemon = True
w.daemon = True

Personal archive tool, looking for suggestions on improving the code	5	Jul 27, 2010
Web Page Parsing/Downloading	1	Nov 22, 2013
Python code problem	2	Apr 23, 2023
I made a blockchain and want to make a cryptocurrency, but my code doesn't verify hash of each block	2	Jun 2, 2024
download all mib files from a web page	6	May 27, 2009
HCaptcha - How to stop page from refreshing on submit if captcha is not checked/validated	1	Aug 29, 2023
Remote SSH and Configuring code help	0	Dec 13, 2023
Help with code	0	Jun 12, 2022

Improving the web page download code.

mukesh tiwari

MRAB

mukesh tiwari

MRAB

mukesh tiwari

MRAB

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads