Which HTMLParser?

T

Tuang

The library docs show that there is an HTMLParser module and an
htmllib module, both of which apparently contain classes named
"HTMLParser". There is a bit of decription of differences, but it
still doesn't seem clear to me what the intent is.

Which one is the best choice for parsing arbitrary real-life Web
pages? I get the feeling that maybe the HTMLParser module is the more
recent, more practical utility, while the htmllib version is the older
one, retained for backward compatibility, but I'm not sure. The docs
don't exactly say that.

Any recommendations or clarifications of what's going on would be
helpful.

Thanks.
 
J

Jarek Zgoda

Tuang said:
Which one is the best choice for parsing arbitrary real-life Web
pages? I get the feeling that maybe the HTMLParser module is the more
recent, more practical utility, while the htmllib version is the older
one, retained for backward compatibility, but I'm not sure. The docs
don't exactly say that.

Any recommendations or clarifications of what's going on would be
helpful.

If you are not sure that your source is valid HTML, use SGML parser
instead. Personally I recommend F. Lundh's sgmlop -- fast, robust and
well-written piece of software, real Meisterstueck. Works perfectly on
Unix, Windows and IBM iSeries (formerly AS/400).
 
R

Rene Pijlman

Tuang:
The library docs show that there is an HTMLParser module and an
htmllib module, both of which apparently contain classes named
"HTMLParser". There is a bit of decription of differences, but it
still doesn't seem clear to me what the intent is.

I think the intent is to use HTMLParser. Its newer, and its documentation
doesn't scare you off with phrases like "HTML 2.0" and "SGML" :)
Which one is the best choice for parsing arbitrary real-life Web pages?

Neither! Real-life web pages are typically not HTML-parseable. Try tyding
it up a bit first. See http://groups.google.nl/groups?th=58cd394d2e71137f
 
J

John J. Lee

Jarek Zgoda said:
If you are not sure that your source is valid HTML, use SGML parser
instead.

Note that htmllib is a simple subclass of sgmllib, so the results you
get from sgmllib will be the same as for htmllib as far as this
concern goes.

HTMLParser.HTMLParser can cope better with XHTML, and treats optional
or missing start/end tags more simply (ie. better) than sgmllib /
htmllib.

Personally I recommend F. Lundh's sgmlop -- fast, robust and
well-written piece of software, real Meisterstueck. Works perfectly on
Unix, Windows and IBM iSeries (formerly AS/400).

I don't think it's any more lenient, though. And harder to modify.

Use mxTidy or uTidylib to clean bad HTML.


John
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,173
Messages
2,570,937
Members
47,481
Latest member
ElviraDoug

Latest Threads

Top