Which HTMLParser?

Tuang · Dec 18, 2003

The library docs show that there is an HTMLParser module and an
htmllib module, both of which apparently contain classes named
"HTMLParser". There is a bit of decription of differences, but it
still doesn't seem clear to me what the intent is.

Which one is the best choice for parsing arbitrary real-life Web
pages? I get the feeling that maybe the HTMLParser module is the more
recent, more practical utility, while the htmllib version is the older
one, retained for backward compatibility, but I'm not sure. The docs
don't exactly say that.

Any recommendations or clarifications of what's going on would be
helpful.

Thanks.

Jarek Zgoda · Dec 19, 2003

Tuang said:
Which one is the best choice for parsing arbitrary real-life Web
pages? I get the feeling that maybe the HTMLParser module is the more
recent, more practical utility, while the htmllib version is the older
one, retained for backward compatibility, but I'm not sure. The docs
don't exactly say that.

Any recommendations or clarifications of what's going on would be
helpful.

If you are not sure that your source is valid HTML, use SGML parser
instead. Personally I recommend F. Lundh's sgmlop -- fast, robust and
well-written piece of software, real Meisterstueck. Works perfectly on
Unix, Windows and IBM iSeries (formerly AS/400).

Rene Pijlman · Dec 20, 2003

Tuang:

The library docs show that there is an HTMLParser module and an
htmllib module, both of which apparently contain classes named
"HTMLParser". There is a bit of decription of differences, but it
still doesn't seem clear to me what the intent is.

I think the intent is to use HTMLParser. Its newer, and its documentation
doesn't scare you off with phrases like "HTML 2.0" and "SGML"

Which one is the best choice for parsing arbitrary real-life Web pages?

Neither! Real-life web pages are typically not HTML-parseable. Try tyding
it up a bit first. See http://groups.google.nl/groups?th=58cd394d2e71137f

John J. Lee · Dec 22, 2003

Jarek Zgoda said:
If you are not sure that your source is valid HTML, use SGML parser
instead.

Note that htmllib is a simple subclass of sgmllib, so the results you
get from sgmllib will be the same as for htmllib as far as this
concern goes.

HTMLParser.HTMLParser can cope better with XHTML, and treats optional
or missing start/end tags more simply (ie. better) than sgmllib /
htmllib.

Personally I recommend F. Lundh's sgmlop -- fast, robust and
well-written piece of software, real Meisterstueck. Works perfectly on
Unix, Windows and IBM iSeries (formerly AS/400).

I don't think it's any more lenient, though. And harder to modify.

Use mxTidy or uTidylib to clean bad HTML.

John

Documentation for HTMLParser	0	Apr 25, 2007
HTMLParser problems.	11	Oct 30, 2003
Manipulate HTML documents via data structure	0	Oct 1, 2004
Seeking co-founders for my company.	3	Sep 8, 2024
virtualenv problem	3	Jul 25, 2013
Question about PEP 8	2	Sep 10, 2007
Which one to use: generate_tokens or tokenize?	1	Sep 10, 2004
Question about import hooks	0	Nov 23, 2013

Which HTMLParser?

Tuang

Jarek Zgoda

Rene Pijlman

John J. Lee

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads