web page retrieve problems

golu · Jul 26, 2009

the following function retrieves pages from the web and saves them in
a specified dir. i want to extract the respective filenames from the
urls e.g the page code.google.com shud be saved as code-google.htm or
something similar. can u suggest me a way to do it
def retrieve_url(self,url):
""" The main method of the robot class and is called
run method to retrieve the given urls from the web."""

if url is not None:

try:
if visited.has_key(url): return
pieces=urlparse.urlparse(url)
filepath=pieces[2]
if filepath != '':
filepath=filepath[1:]
filename=filepath.split("/")[-1]
else:
filename='home.htm'

path=os.path.join(PAGE_DIR,filename)
url=urlparse.urlunparse(pieces)
p=url.rfind('#') #temporary
if p!=-1:
url=url[

]

visited=1
m=urllib2.urlopen(url)

fopen=open(path,'wb')

fopen.seek(0)
fopen.write(url+'|')

fopen.write(m.read())
fopen.close()
print url ,'retrieved'

except IOError:
print url
print "ERROR:OOPS! THE URL CAN'T BE RETRIEVED"

return

Alex · Jul 27, 2009

the following function retrieves pages from the web and saves them in
a specified dir. i want to extract the respective filenames from the
urls e.g the page code.google.com shud be saved as code-google.htm or
something similar. can u suggest me a way to do it

Try with urllib.urlretrieve from standard lib:

urllib.urlretrieve(url[, filename[, reporthook[, data]]])¶
Copy a network object denoted by a URL to a local file, if necessary.
If the URL points to a local file, or a valid cached copy of the
object exists, the object is not copied. Return a tuple (filename,
headers) where filename is the local file name under which the object
can be found, and headers is whatever the info() method of the object
returned by urlopen() returned (for a remote object, possibly cached).
Exceptions are the same as for urlopen().

code debugging	3	Jul 26, 2009
Retrieve Custom 404 page.	0	Nov 17, 2008
Improving the web page download code.	5	Aug 27, 2013
download all mib files from a web page	6	May 27, 2009
Crawling	1	Mar 10, 2021
Web page data and urllib2.urlopen	8	Aug 5, 2009
save gb-2312 web page in a .html file	6	Dec 26, 2007
FLV download script works, but I want to enhance it	3	May 6, 2009

web page retrieve problems

golu

Alex

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads