HTML parsing bug?

G

g_no_mail_please

Python 2.3.5 seems to choke when trying to parse html files, because it
doesn't realize that what's inside <!-- --> is a comment in HTML,
even if this comment is inside <script> </script>, especially if it's a
comment inside that script code too.

The html file:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html><head><title>Choke on this</title>
<script language="JavaScript">
<!--
// </ht ml> - this is a comment in JavaScript, which is itself inside
an HTML comment
-->
</script>
</head>
<body>
Hey there
</body>
</html>


The Python program:

from urllib2 import urlopen
from HTMLParser import HTMLParser
f = urlopen("file:///PATH_TO_THE_ABOVE/index.html")
p = HTMLParser()
p.feed(f.read())
 
R

Richard Brodie

Python 2.3.5 seems to choke when trying to parse html files, because it
doesn't realize that what's inside <!-- --> is a comment in HTML,
even if this comment is inside <script> </script>, especially if it's a
comment inside that script code too.

Actually, you are technically incorrect; try validating the code you posted.
Google found this explanation: http://lachy.id.au/log/2005/05/script-comments
Feeding even slightly invalid HTML to the standard library parser will often
choke it. If you can't guarantee clean sources, best use Tidy first or another
parser entirely.
 
I

Istvan Albert

this is a comment in JavaScript, which is itself inside an HTML comment

Don't nest HTML comments. Occasionaly it may break the browsers as
well.

(I remember this from one of the weirdest of bughunts : whenever the
number of characters between nested HTML comments was divisible by four
the page would render incorrectly ... or something of that sorts)

i.
 
T

Tim Roberts

Istvan Albert said:
Don't nest HTML comments. Occasionaly it may break the browsers as
well.

Did you read the post? He didn't nest HTML comments. He put a Javascript
comment inside an HTML comment, inside a <script></script> pair. Virtually
every page with Javascript does exactly the same thing.
 
F

Fredrik Lundh

Python 2.3.5 seems to choke when trying to parse html files, because it
doesn't realize that what's inside <!-- --> is a comment in HTML,
even if this comment is inside <script> </script>, especially if it's a
comment inside that script code too.

nope. what's inside <!-- --> is not a comment if it's inside a <script>
or <style> tag. read the spec:

http://www.w3.org/TR/REC-html40/types.html#type-cdata

"Although the STYLE and SCRIPT elements use CDATA for their data
model, for these elements, CDATA must be handled differently by
user agents. Markup and entities must be treated as raw text and
passed to the application as is. The first occurrence of the
character sequence "</" (end-tag open delimiter) is treated as
terminating the end of the element's content. In valid documents,
this would be the end tag for the element."

in your case, the first occurrence of "</" is not the end tag.

you can disable proper parsing by setting the CDATA_CONTENT_ELEMENTS
attribute on the parser instance, before you start parsing. by default, it is
set to

CDATA_CONTENT_ELEMENTS = ("script", "style")

setting it to an empty tuple disables HTML-compliant handling for these
elements:

p = HTMLParser()
p.CDATA_CONTENT_ELEMENTS = ()
p.feed(f.read())

</F>
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,283
Messages
2,571,405
Members
48,100
Latest member
Calfin5299

Latest Threads

Top