SGML parsing tags and leeping track

H

hapaboy2059

Hello,

I need help in using sgmlparser to parse a html file and keep track of
the number of times each tag is being used.

In the end of this program I need to print out the number of times each
tag was seen(presumably any type of tag can be used) and the linked
text.

I need help in getting past the first steps. I already have this basic
program to return hyperlinks. I cant seem to understand how to parse
any tag and keep track of it to print it out at a later time....

very frustrated and help is appreciated!!!!!



--------------------------------------------------------------------------
import sgmllib, urllib

class HtmParser(sgmllib.SGMLParser):
def __init__(self, verbose=0):
"Initialise an object, passing 'verbose' to the superclass."

sgmllib.SGMLParser.__init__(self, verbose)
self.hyperlinks = []
self.descriptions = []
self.inside_a_element = 0

def start_a(self, attributes):
"Process a hyperlink and its 'attributes'."

for name, value in attributes:
if name == "href":
self.hyperlinks.append(value)

def get_hyperlinks(self):
"Return the list of hyperlinks."

return self.hyperlinks


parser = HtmParser()

inptAdrs = raw_input('Please input the absolute path to the url\n')
print 'you entered: ', inptAdrs

content = urllib.urlopen(inptAdrs)

bufff = content.read()
print 'Statistics for ', inptAdrs

print 'There is', len(bufff), 'characters in the web page'

parser.feed(bufff)


print parser.get_hyperlinks()
parser.close()
 
H

hapaboy2059

could i make a global variable and keep track of each tag count?

Also how would i make a list or dictionary of tags that is found?
how can i handle any tag that is given?
 
H

Heiko Wundram

Am Dienstag 02 Mai 2006 20:38 schrieb (e-mail address removed):
could i make a global variable and keep track of each tag count?

Also how would i make a list or dictionary of tags that is found?
how can i handle any tag that is given?

The following snippet does what you want:
from sgmllib import SGMLParser

class MyParser(SGMLParser):

def __init__(self):
SGMLParser.__init__(self)
self.tagcount = {}
self.links = set()

# Tag count handling
# ------------------

def handle_starttag(self,tag,method,args):
self.tagcount[tag] = self.tagcount.get(tag,0) + 1
method(args)

def unknown_starttag(self,tag,args):
self.tagcount[tag] = self.tagcount.get(tag,0) + 1

# Argument handling
# -----------------

def start_a(self,args):
self.links.update([value for name, value in args if name == "href"])

parser = MyParser()
parser.feed(file("test.html").read()) # Insert your data source here...
parser.close()

print parser.tagcount
print parser.links
See the documentation for sgmllib for more info on handle_starttag (whose
logic might just as well have been implemented in start_a, but if you want
argument handling for more tags, it's best to keep it at this one central
place) and unknown_starttag.

--- Heiko.
 
H

Heiko Wundram

Am Dienstag 02 Mai 2006 20:38 schrieb (e-mail address removed):
could i make a global variable and keep track of each tag count?

Also how would i make a list or dictionary of tags that is found?
how can i handle any tag that is given?

The following snippet does what you want:
from sgmllib import SGMLParser

class MyParser(SGMLParser):

def __init__(self):
SGMLParser.__init__(self)
self.tagcount = {}
self.links = set()

# Tag count handling
# ------------------

def handle_starttag(self,tag,method,args):
self.tagcount[tag] = self.tagcount.get(tag,0) + 1
method(args)

def unknown_starttag(self,tag,args):
self.tagcount[tag] = self.tagcount.get(tag,0) + 1

# Argument handling
# -----------------

def start_a(self,args):
self.links.update([value for name, value in args if name == "href"])

parser = MyParser()
parser.feed(file("test.html").read()) # Insert your data source here...
parser.close()

print parser.tagcount
print parser.links
See the documentation for sgmllib for more info on handle_starttag (whose
logic might just as well have been implemented in start_a, but if you want
argument handling for more tags, it's best to keep it at this one central
place) and unknown_starttag.

--- Heiko.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,968
Messages
2,570,154
Members
46,702
Latest member
LukasConde

Latest Threads

Top