G
g_no_mail_please
Python 2.3.5 seems to choke when trying to parse html files, because it
doesn't realize that what's inside <!-- --> is a comment in HTML,
even if this comment is inside <script> </script>, especially if it's a
comment inside that script code too.
The html file:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html><head><title>Choke on this</title>
<script language="JavaScript">
<!--
// </ht ml> - this is a comment in JavaScript, which is itself inside
an HTML comment
-->
</script>
</head>
<body>
Hey there
</body>
</html>
The Python program:
from urllib2 import urlopen
from HTMLParser import HTMLParser
f = urlopen("file:///PATH_TO_THE_ABOVE/index.html")
p = HTMLParser()
p.feed(f.read())
doesn't realize that what's inside <!-- --> is a comment in HTML,
even if this comment is inside <script> </script>, especially if it's a
comment inside that script code too.
The html file:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html><head><title>Choke on this</title>
<script language="JavaScript">
<!--
// </ht ml> - this is a comment in JavaScript, which is itself inside
an HTML comment
-->
</script>
</head>
<body>
Hey there
</body>
</html>
The Python program:
from urllib2 import urlopen
from HTMLParser import HTMLParser
f = urlopen("file:///PATH_TO_THE_ABOVE/index.html")
p = HTMLParser()
p.feed(f.read())