Closing tags within <script></script>

J

Jeffrey

Hello,

I've found an oddity with HTML/Javascript that I'm hoping someone on
this list could shed some light on for me. This arose when I was using
the libxml parser to parse some HTML web pages.

The observation is that the following page does something odd:
http://www.cs.washington.edu/homes/jbigham/test/js-test.html

The source of the page is:
<html>
<head>
<script>
function alert_me() {
alert("<script>function foo() { alert("Hello!"); }</script>");
}
</script>
</head>
<body>
The body of the page.
<input type="button" onclick="alert_me();" value="Click me!">
</body>
</html>


It produces:
"); }
The body of the page. [[Button]]


According to this page, I should expect this behavior because ending
HTML tags are not allowed to appear within <script> tags:
http://www.htmlhelp.com/tools/validator/problems.html#script

But my problem is that some very popular websites, seem to violate this
and that apparently messes up the libxml SAX parser
(http://xmlsoft.org/). For example, Yahoo does this, as this excerpt
from their page shows:

<script language=javascript>
if(typeof(YAHOO)!='undefined') {
document.write('<map name="yodel"><area shape="rect"
coords="209,30,216,39" href="http://www.yahoo.com"
onclick="callYodel();return false;"><area shape="poly"
coords="211,0,222,1,215,26,211,25" href="http://www.yahoo.com"
onclick="callYodel();return false;"></map><div id=l_fl
style="position:absolute"></div>');
var
lr0='http://us.ard.yahoo.com/SIG=12ldjm8...cSkA/Y=YAHOO/EXP=1160765162/A=3912593/R=0/*';
var lcap=0,lncap=0,ad_jsl=0,lnfv=6,ylmap=0;
var ldir="http://us.i1.yimg.com/us.yimg.com/i/mntl/ww/06q3/";
var swfl1=ldir+"yodel.swf";
var swflw=1,swflh=1;
}
....
</script>

The libxml parser thinks those ending tags are incorrect and causes
problems for me when trying to use it to traverse the DOM. Is Yahoo
incorrect? Is libxml incorrectly interpretting the standard? Are they
both somehow correct?

Thanks!
Jeff
 
M

Michael Winter

Jeffrey wrote:

[snip]
alert("<script>function foo() { alert("Hello!"); }</script>");

You cannot nest quotation marks like that. You either need to escape the
inner pair, or change one of them to single quotes (').

[snip]
According to this page, I should expect this behavior because ending
HTML tags are not allowed to appear within <script> tags:
http://www.htmlhelp.com/tools/validator/problems.html#script

Indeed, though more precisely, there can be no ETAGO (</) tokens
followed by a NAME character (that is, anything that looks like an
end-tag). This is because a HTML parser will be looking for these when
trying to find the end of the element; it doesn't need to look for an
end-tag that matches the start-tag because some end-tags are optional
whilst others are forbidden.

The document you cite tells you how to avoid the problem: escape the
backslash to break apart the ETAGO (<\/).
But my problem is that some very popular websites, seem to violate
this ...

Popular websites often do stupid things, but the sheer weight of their
popularity often means that they can get away with it. It's not
something to emulate though, if anyone was thinking along those lines.

[snip]

Mike
 
B

Benjamin Niemann

Hello,
I've found an oddity with HTML/Javascript that I'm hoping someone on
this list could shed some light on for me. This arose when I was using
the libxml parser to parse some HTML web pages.

libxml is correct (too correct for such a usage), these and other websites
not.

As you can obviously not fix documents that are not your own and far too
many documents on the web are malformed, invalid or simply a heap of s**t,
it is not a wise decision to use a strict parser like libxml.
There are special parsers built to deal with such 'tag-soup' documents,
e.g. 'Beautiful Soup' for Python
<http://www.crummy.com/software/BeautifulSoup/>.
There may be similar packages for the language of your choice (if it does
not happen to be Python).

HTH
 
C

cwdjrxyz

<script language=javascript>

The above script tag will produce a validation error at the W3C
validator, at least for html 4.01 and above. The correct tag is <script
type="text/javascript">. Language is no longer required, and may give
an error in higher levels of xhtml if used in addition to type. The
type is a must now. However most browsers still will work with the
script tag as written by you.
if(typeof(YAHOO)!='undefined') {
document.write('<map name="yodel"><area shape="rect"
coords="209,30,216,39" href="http://www.yahoo.com"
onclick="callYodel();return false;"><area shape="poly"
coords="211,0,222,1,215,26,211,25" href="http://www.yahoo.com"
onclick="callYodel();return false;"></map><div id=l_fl
style="position:absolute"></div>');

The close division in the line above is in a document.write within a
script and thus must be backslashed as <\/div>. This applies to all
types of closing tags in a document.write. A page often will work if
this is not done, but the W3C validator finds not backslashing to be a
validation error, which is correct. The reasons are rather complicated.
Check the FAQ/help tab at the W3C validator and go to the section on
javascript for links that will tell your more. Also the close map in
the above script fragment needs to be backslashed, because the map tag
is also in the document.write.
var
lr0='http://us.ard.yahoo.com/SIG=12ldjm8...cSkA/Y=YAHOO/EXP=1160765162/A=3912593/R=0/*';
var lcap=0,lncap=0,ad_jsl=0,lnfv=6,ylmap=0;
var ldir="http://us.i1.yimg.com/us.yimg.com/i/mntl/ww/06q3/";
var swfl1=ldir+"yodel.swf";
var swflw=1,swflh=1;
}
...
</script>

I did not examine your code in great detail and could have missed some
other problem. If you still have problems, please post details. If you
do not get what you need from this thread after a reasonable time, you
might consider posting in the Usenet group comp.lanf.javascript.
 
C

cwdjrxyz

cwdjrxyz said:
If you do not get what you need from this thread after a reasonable time, you
might consider posting in the Usenet group comp.lanf.javascript.

Typo: should be comp.lang.javascript .
 
J

Jeffrey

libxml is correct (too correct for such a usage), these and other websites
not.

As you can obviously not fix documents that are not your own and far too
many documents on the web are malformed, invalid or simply a heap of s**t,
it is not a wise decision to use a strict parser like libxml.
There are special parsers built to deal with such 'tag-soup' documents,
e.g. 'Beautiful Soup' for Python
<http://www.crummy.com/software/BeautifulSoup/>.
There may be similar packages for the language of your choice (if it does
not happen to be Python).

What you describe is exactly what I want. Do you (or does anyone) know
of such a parser that will work in plain old C. A search doesn't bring
up more than a few comments like, "hey, there should be a C Tag-Soup
library" and my application requires C. Is "tag-soup" the name that I
should look under for this?

Thanks!
Jeff
 
B

Benjamin Niemann

Jeffrey said:
What you describe is exactly what I want. Do you (or does anyone) know
of such a parser that will work in plain old C. A search doesn't bring
up more than a few comments like, "hey, there should be a C Tag-Soup
library" and my application requires C. Is "tag-soup" the name that I
should look under for this?

HTML Tidy <http://tidy.sourceforge.net/> (better known as a stand-alone
program which reads 'tag-soup' and outputs a cleaned up version) seems to
be written in C and the functionality might be available through TidyLib
('seems' and 'might', because this is just the result of a seconds on its
website).
You'll probably have to pass the documents through TidyLib to transform it
to (at least) wellformed XML, which you can then parse with libxml.
 
M

mbstevens

Benjamin said:
Jeffrey wrote:
HTML Tidy <http://tidy.sourceforge.net/> (better known as a stand-alone
program which reads 'tag-soup' and outputs a cleaned up version) seems to
be written in C

It is in Perl.

and the functionality might be available through TidyLib

That has a public interface in C.
Here is the source forge page:
http://tidy.sourceforge.net/libintro.html


('seems' and 'might', because this is just the result of a seconds on its
website).
You'll probably have to pass the documents through TidyLib to transform it
to (at least) wellformed XML, which you can then parse with libxml.

....or just call HTML Tidy to from a shell
script which then processes things further.
 
M

mbstevens

mbstevens said:
It is in Perl.

....I just checked myslef. I remembered
(hopefully correctly) it used to be in Perl,
but it appears to now be ported into C.
 
B

Benjamin Niemann

mbstevens said:
...I just checked myslef. I remembered
(hopefully correctly) it used to be in Perl,
but it appears to now be ported into C.

I also thought it was perl. But a on quick glance at the CVS repository I
only found a bunch of .c files. If porting some software from X to C is a
wise decision nowadays, is rather questionable though... We're not talking
about real-time raytracing here...
 
M

mbstevens

Benjamin said:
I also thought it was perl. But a on quick glance at the CVS repository I
only found a bunch of .c files.

I may have had it confused with the w3c
validator, which is in Perl.
If porting some software from X to C is a
wise decision nowadays, is rather questionable though... We're not talking
about real-time raytracing here...


The mind boggles at the extra effort it must
have taken to do it in C instead of using
Perl, Python, or Ruby. Still, C made my
living for many years before those were
around, and all three were written in it.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,982
Messages
2,570,186
Members
46,740
Latest member
JudsonFrie

Latest Threads

Top