SGML parsing tags and leeping track

hapaboy2059 · May 2, 2006

Hello,

I need help in using sgmlparser to parse a html file and keep track of
the number of times each tag is being used.

In the end of this program I need to print out the number of times each
tag was seen(presumably any type of tag can be used) and the linked
text.

I need help in getting past the first steps. I already have this basic
program to return hyperlinks. I cant seem to understand how to parse
any tag and keep track of it to print it out at a later time....

very frustrated and help is appreciated!!!!!

--------------------------------------------------------------------------
import sgmllib, urllib

class HtmParser(sgmllib.SGMLParser):
def __init__(self, verbose=0):
"Initialise an object, passing 'verbose' to the superclass."

sgmllib.SGMLParser.__init__(self, verbose)
self.hyperlinks = []
self.descriptions = []
self.inside_a_element = 0

def start_a(self, attributes):
"Process a hyperlink and its 'attributes'."

for name, value in attributes:
if name == "href":
self.hyperlinks.append(value)

def get_hyperlinks(self):
"Return the list of hyperlinks."

return self.hyperlinks

parser = HtmParser()

inptAdrs = raw_input('Please input the absolute path to the url\n')
print 'you entered: ', inptAdrs

content = urllib.urlopen(inptAdrs)

bufff = content.read()
print 'Statistics for ', inptAdrs

print 'There is', len(bufff), 'characters in the web page'

parser.feed(bufff)

print parser.get_hyperlinks()
parser.close()

hapaboy2059 · May 2, 2006

could i make a global variable and keep track of each tag count?

Also how would i make a list or dictionary of tags that is found?
how can i handle any tag that is given?

Heiko Wundram · May 2, 2006

Am Dienstag 02 Mai 2006 20:38 schrieb (e-mail address removed):

could i make a global variable and keep track of each tag count?

Also how would i make a list or dictionary of tags that is found?
how can i handle any tag that is given?

The following snippet does what you want:
from sgmllib import SGMLParser

class MyParser(SGMLParser):

def __init__(self):
SGMLParser.__init__(self)
self.tagcount = {}
self.links = set()

# Tag count handling
# ------------------

def handle_starttag(self,tag,method,args):
self.tagcount[tag] = self.tagcount.get(tag,0) + 1
method(args)

def unknown_starttag(self,tag,args):
self.tagcount[tag] = self.tagcount.get(tag,0) + 1

# Argument handling
# -----------------

def start_a(self,args):
self.links.update([value for name, value in args if name == "href"])

parser = MyParser()
parser.feed(file("test.html").read()) # Insert your data source here...
parser.close()

print parser.tagcount
print parser.links
See the documentation for sgmllib for more info on handle_starttag (whose
logic might just as well have been implemented in start_a, but if you want
argument handling for more tags, it's best to keep it at this one central
place) and unknown_starttag.

--- Heiko.

Heiko Wundram · May 2, 2006

Am Dienstag 02 Mai 2006 20:38 schrieb (e-mail address removed):

could i make a global variable and keep track of each tag count?

Also how would i make a list or dictionary of tags that is found?
how can i handle any tag that is given?

The following snippet does what you want:
from sgmllib import SGMLParser

class MyParser(SGMLParser):

def __init__(self):
SGMLParser.__init__(self)
self.tagcount = {}
self.links = set()

# Tag count handling
# ------------------

def handle_starttag(self,tag,method,args):
self.tagcount[tag] = self.tagcount.get(tag,0) + 1
method(args)

def unknown_starttag(self,tag,args):
self.tagcount[tag] = self.tagcount.get(tag,0) + 1

# Argument handling
# -----------------

def start_a(self,args):
self.links.update([value for name, value in args if name == "href"])

parser = MyParser()
parser.feed(file("test.html").read()) # Insert your data source here...
parser.close()

print parser.tagcount
print parser.links
See the documentation for sgmllib for more info on handle_starttag (whose
logic might just as well have been implemented in start_a, but if you want
argument handling for more tags, it's best to keep it at this one central
place) and unknown_starttag.

--- Heiko.

Use of logging module to track TODOs	0	Nov 27, 2013
Creating an object that can track when its attributes are modified	11	Mar 6, 2013
Web Page Parsing/Downloading	1	Nov 22, 2013
parsley parsing question	0	Jun 2, 2014
parsing email from stdin	0	Oct 8, 2013
help with link parsing?	3	Dec 20, 2010
I made a blockchain and want to make a cryptocurrency, but my code doesn't verify hash of each block	2	Jun 2, 2024
Python battle game help	2	Feb 23, 2023

SGML parsing tags and leeping track

hapaboy2059

hapaboy2059

Heiko Wundram

Heiko Wundram

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads