Parsing HTML--looking for info/comparison of HTMLParser vs. htmllibmodules.

Kenneth McDonald · Jul 7, 2006

I'm writing a program that will parse HTML and (mostly) convert it to
MediaWiki format. The two Python modules I'm aware of to do this are
HTMLParser and htmllib. However, I'm currently experiencing either real
or conceptual difficulty with both, and was wondering if I could get
some advice.

The problem I'm having with HTMLParser is simple; I don't seem to be
getting the actual text in the HTML document. I've implemented the
do_data method of HTMLParser.HTMLParser in my HTMLParser subclass, but
it never seems to receive any data. Is there another way to access the
text chunks as they come along?

HTMLParser would probably be the way to go if I can figure this out. It
seems much simpler than htmllib, and satisfies my requirements.

htmllib will write out the text data (using the AbstractFormatter and
AbstractWriter), but my problem here is conceptual. I simply don't
understand why all of these different "levels" of abstractness are
necessary, nor how to use them. As an example, the html <i>text</i>
should be converted to ''text'' (double single-quotes at each end) in my
mediawiki markup output. This would obviously be easy to achieve if I
simply had an html parse that called a method for each start tag, text
chunk, and end tag. But htmllib calls the tag functions in HTMLParser,
and then does more things with both a formatter and a writer. To me,
both seem unnecessarily complex (though I suppose I can see the benefits
of a writer before generators gave the opportunity to simply yield
chunks of output to be processed by external code.) In any case, I don't
really have a good idea of what I should do with htmllib to get my
converted tags, and then content, and then closing converted tags,
written out.

Please feel free to point to examples, code, etc. Probably the simplest
solution would be a way to process text content in HTMLParser.HTMLParser.

Thanks,
Ken

wes weston · Jul 7, 2006

from HTMLParser import HTMLParser

class MyHTMLParser(HTMLParser):
def __init__(self):
HTMLParser.__init__(self)
self.TokenList = []
def handle_data( self,data):
data = data.strip()
if data and len(data) > 0:
self.TokenList.append(data)
#print data
def GetTokenList(self):
return self.TokenList

try:
url = "http://....your url here.............."
f = urllib.urlopen(url)
res = f.read()
f.close()
except:
print "bad read"
return

h = MyHTMLParser()
h.feed(res)
tokensList = h.GetTokenList()

HTMLParser skipping HTML? [newbie]	6	Sep 5, 2012
HTMLParser not parsing whole html file	4	Oct 24, 2010
Looking for programmers!	3	Feb 9, 2024
UTF8 & HTMLParser	2	Dec 1, 2006
HTMLParser fragility	8	Apr 5, 2006
Looking For Advice	1	Dec 10, 2022
Buffering HTML as HTMLParser reads it?	3	Aug 1, 2007
Unexpected behaviour with HTMLParser...	5	Oct 9, 2007

Parsing HTML--looking for info/comparison of HTMLParser vs. htmllibmodules.

Kenneth McDonald

wes weston

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads