H
hapaboy2059
Hello,
I need help in using sgmlparser to parse a html file and keep track of
the number of times each tag is being used.
In the end of this program I need to print out the number of times each
tag was seen(presumably any type of tag can be used) and the linked
text.
I need help in getting past the first steps. I already have this basic
program to return hyperlinks. I cant seem to understand how to parse
any tag and keep track of it to print it out at a later time....
very frustrated and help is appreciated!!!!!
--------------------------------------------------------------------------
import sgmllib, urllib
class HtmParser(sgmllib.SGMLParser):
def __init__(self, verbose=0):
"Initialise an object, passing 'verbose' to the superclass."
sgmllib.SGMLParser.__init__(self, verbose)
self.hyperlinks = []
self.descriptions = []
self.inside_a_element = 0
def start_a(self, attributes):
"Process a hyperlink and its 'attributes'."
for name, value in attributes:
if name == "href":
self.hyperlinks.append(value)
def get_hyperlinks(self):
"Return the list of hyperlinks."
return self.hyperlinks
parser = HtmParser()
inptAdrs = raw_input('Please input the absolute path to the url\n')
print 'you entered: ', inptAdrs
content = urllib.urlopen(inptAdrs)
bufff = content.read()
print 'Statistics for ', inptAdrs
print 'There is', len(bufff), 'characters in the web page'
parser.feed(bufff)
print parser.get_hyperlinks()
parser.close()
I need help in using sgmlparser to parse a html file and keep track of
the number of times each tag is being used.
In the end of this program I need to print out the number of times each
tag was seen(presumably any type of tag can be used) and the linked
text.
I need help in getting past the first steps. I already have this basic
program to return hyperlinks. I cant seem to understand how to parse
any tag and keep track of it to print it out at a later time....
very frustrated and help is appreciated!!!!!
--------------------------------------------------------------------------
import sgmllib, urllib
class HtmParser(sgmllib.SGMLParser):
def __init__(self, verbose=0):
"Initialise an object, passing 'verbose' to the superclass."
sgmllib.SGMLParser.__init__(self, verbose)
self.hyperlinks = []
self.descriptions = []
self.inside_a_element = 0
def start_a(self, attributes):
"Process a hyperlink and its 'attributes'."
for name, value in attributes:
if name == "href":
self.hyperlinks.append(value)
def get_hyperlinks(self):
"Return the list of hyperlinks."
return self.hyperlinks
parser = HtmParser()
inptAdrs = raw_input('Please input the absolute path to the url\n')
print 'you entered: ', inptAdrs
content = urllib.urlopen(inptAdrs)
bufff = content.read()
print 'Statistics for ', inptAdrs
print 'There is', len(bufff), 'characters in the web page'
parser.feed(bufff)
print parser.get_hyperlinks()
parser.close()