J
Johnny Lee
Hi,
I was using urllib to grab urls from web. here is the work flow of
my program:
1. Get base url and max number of urls from user
2. Call filter to validate the base url
3. Read the source of the base url and grab all the urls from "href"
property of "a" tag
4. Call filter to validate every url grabbed
5. Continue 3-4 until the number of url grabbed gets the limit
In filter there is a method like this:
--------------------------------------------------
# check whether the url can be connected
def filteredByConnection(self, url):
assert url
try:
webPage = urllib2.urlopen(url)
except urllib2.URLError:
self.logGenerator.log("Error: " + url + " <urlopen error timed
out>")
return False
except urllib2.HTTPError:
self.logGenerator.log("Error: " + url + " not found")
return False
self.logGenerator.log("Connecting " + url + " successed")
webPage.close()
return True
----------------------------------------------------
But every time when I ran to the 70 to 75 urls (that means 70-75
urls have been tested via this way), the program will crash and all the
urls left will raise urllib2.URLError until the program exits. I tried
many ways to work it out, using urllib, set a sleep(1) in the filter (I
thought it was the massive urls crashed the program). But none works.
BTW, if I set the url from which the program crashed to base url, the
program will still crashed at the 70-75 url. How can I solve this
problem? thanks for your help
Regards,
Johnny
I was using urllib to grab urls from web. here is the work flow of
my program:
1. Get base url and max number of urls from user
2. Call filter to validate the base url
3. Read the source of the base url and grab all the urls from "href"
property of "a" tag
4. Call filter to validate every url grabbed
5. Continue 3-4 until the number of url grabbed gets the limit
In filter there is a method like this:
--------------------------------------------------
# check whether the url can be connected
def filteredByConnection(self, url):
assert url
try:
webPage = urllib2.urlopen(url)
except urllib2.URLError:
self.logGenerator.log("Error: " + url + " <urlopen error timed
out>")
return False
except urllib2.HTTPError:
self.logGenerator.log("Error: " + url + " not found")
return False
self.logGenerator.log("Connecting " + url + " successed")
webPage.close()
return True
----------------------------------------------------
But every time when I ran to the 70 to 75 urls (that means 70-75
urls have been tested via this way), the program will crash and all the
urls left will raise urllib2.URLError until the program exits. I tried
many ways to work it out, using urllib, set a sleep(1) in the filter (I
thought it was the massive urls crashed the program). But none works.
BTW, if I set the url from which the program crashed to base url, the
program will still crashed at the 70-75 url. How can I solve this
problem? thanks for your help
Regards,
Johnny