HTML parsing bug?

g_no_mail_please · Jan 30, 2006

Python 2.3.5 seems to choke when trying to parse html files, because it
doesn't realize that what's inside  is a comment in HTML,
even if this comment is inside <script> </script>, especially if it's a
comment inside that script code too.

The html file:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html><head><title>Choke on this</title>
<script language="JavaScript">

</script>
</head>
<body>
Hey there
</body>
</html>

The Python program:

from urllib2 import urlopen
from HTMLParser import HTMLParser
f = urlopen("file:///PATH_TO_THE_ABOVE/index.html")
p = HTMLParser()
p.feed(f.read())

G. · Jan 30, 2006

// said:
an HTML comment

This is supposed to be one line. Got wrapped during posting.

Richard Brodie · Jan 30, 2006

Python 2.3.5 seems to choke when trying to parse html files, because it
doesn't realize that what's inside  is a comment in HTML,
even if this comment is inside <script> </script>, especially if it's a
comment inside that script code too.

Actually, you are technically incorrect; try validating the code you posted.
Google found this explanation: http://lachy.id.au/log/2005/05/script-comments
Feeding even slightly invalid HTML to the standard library parser will often
choke it. If you can't guarantee clean sources, best use Tidy first or another
parser entirely.

Istvan Albert · Jan 30, 2006

this is a comment in JavaScript, which is itself inside an HTML comment

Don't nest HTML comments. Occasionaly it may break the browsers as
well.

(I remember this from one of the weirdest of bughunts : whenever the
number of characters between nested HTML comments was divisible by four
the page would render incorrectly ... or something of that sorts)

i.

Tim Roberts · Feb 1, 2006

Istvan Albert said:
Don't nest HTML comments. Occasionaly it may break the browsers as
well.

Did you read the post? He didn't nest HTML comments. He put a Javascript
comment inside an HTML comment, inside a <script></script> pair. Virtually
every page with Javascript does exactly the same thing.

Fredrik Lundh · Feb 1, 2006

Python 2.3.5 seems to choke when trying to parse html files, because it
doesn't realize that what's inside  is a comment in HTML,
even if this comment is inside <script> </script>, especially if it's a
comment inside that script code too.

nope. what's inside  is not a comment if it's inside a <script>
or <style> tag. read the spec:

http://www.w3.org/TR/REC-html40/types.html#type-cdata

"Although the STYLE and SCRIPT elements use CDATA for their data
model, for these elements, CDATA must be handled differently by
user agents. Markup and entities must be treated as raw text and
passed to the application as is. The first occurrence of the
character sequence "</" (end-tag open delimiter) is treated as
terminating the end of the element's content. In valid documents,
this would be the end tag for the element."

in your case, the first occurrence of "</" is not the end tag.

you can disable proper parsing by setting the CDATA_CONTENT_ELEMENTS
attribute on the parser instance, before you start parsing. by default, it is
set to

CDATA_CONTENT_ELEMENTS = ("script", "style")

setting it to an empty tuple disables HTML-compliant handling for these
elements:

p = HTMLParser()
p.CDATA_CONTENT_ELEMENTS = ()
p.feed(f.read())

</F>

Istvan Albert · Feb 2, 2006

this is a comment in JavaScript, which is itself inside an HTML comment

Did you read the post?

misread it rather ...

tidy to convert google scholar page in xml	1	Oct 8, 2012
What "might" I be doing wrong with this snippet?	6	Sep 25, 2023
HTMLParser skipping HTML? [newbie]	6	Sep 5, 2012
In javascript, XML File Create, File Save	2	Jul 17, 2023
HTML form to csv file on server	1	Feb 12, 2025
How to have two html audio players on one page?	0	May 3, 2022
How can I add React 18 to existing HTML?	2	Mar 27, 2023
External html	2	May 13, 2020

HTML parsing bug?

g_no_mail_please

G.

Richard Brodie

Istvan Albert

Tim Roberts

Fredrik Lundh

Istvan Albert

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads