Trying to understand html.parser.HTMLParser

Andrew Berg · May 15, 2011

I'm trying to understand why HMTLParser.feed() isn't returning the whole
page. My test script is this:

import urllib.request
import html.parser
class MyHTMLParser(html.parser.HTMLParser):
def handle_starttag(self, tag, attrs):
if tag == 'a' and attrs:
print(tag,'-',attrs)

url = 'http://x264.nl/x264/?dir=./64bit/8bit_depth'
page = urllib.request.urlopen(url).read()
parser = MyHTMLParser()
parser.feed(str(page))

I can do print(page) and get the entire HTML source, but
parser.feed(str(page)) only spits out the information for the top links
and none of the "revisionxxxx" links. Ultimately, I just want to find
the name of the first "revisionxxxx" link (right now it's
"revision1995", when a new build is uploaded it will be "revision2000"
or whatever). I figure this is a relatively simple page; once I
understand all of this, I can move on to more complicated pages.

I've searched Google, but everything I find is either outdated, a
recommendation for some external module (I don't need to do anything too
fancy and most modules don't completely support Python 3 anyway) or is
just a code snippet with no real explanation. I had a book that
explained this, but I had to return it to the library (and I'll have to
get back in line to check it out again).

Weird problem matching with REs	11	May 29, 2011
HTMLParser not parsing whole html file	4	Oct 24, 2010
Trying to understand this moji-bake	9	Jan 25, 2014
HTMLParser skipping HTML? [newbie]	6	Sep 5, 2012
Trying to understand 'import' a bit better	0	Mar 4, 2012
Trying to build a SARIMAX model to forecast the S&P500 trend	0	Nov 5, 2023
trying to understand dictionaries	3	Jun 12, 2009
confused by HTMLParser class	3	May 28, 2008

Trying to understand html.parser.HTMLParser

Andrew Berg

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads