HTML Correctness and Validators

Xah Lee · Dec 29, 2008

recently i wrote a blog essay about html correctness and html
validators, with relations to the programing lang communities. I hope
programing lang fans will take more consideration on the correctness
of the doc they produces.

HTML Correctness and Validators
â€¢ http://xahlee.org/js/html_correctness.html

plain text version follows.
---------------------------

HTML Correctness and Validators

Xah Lee, 2008-12-28

Some notes about html correctness and html validator.

Condition Of Website Correctness

My website â€œxahlee.orgâ€ has close to 4000 html files. All are valid
html files. â€œValidâ€ here means passing the w3c's validator at
http://validator.w3.org/. Being a programing and correctness nerd,
correct html is important to me. (correct markup has important,
practical, benefits, such as machine parsing and transformation, as
picked up by the XML movement. Ultimately, it is a foundation of
semantic webâ†—.)

In programing lang communities, the programer tech geekers are
fanatical about their fav lang's superiority, and in the case of
functional langs, they are often proud of their correctness features.
However, a look at their official docs or websites, they are ALL
invalid html, with errors in just about every 10 lines of html source
code. It is fucking ridiculous.

<lj-cut>

In the web development geeker communities, you can see how they are
tight-assed about correct use of HTML/CSS, etc, where there are often
frequent and heated debates about propriety of semantic markup, and
they don't hesitate to ridicule Microsoft Internet Explorer browser,
or the average HTML content producer. However, a look at the html they
produced, also almost none are valid.

The bad html also happens in vast majority of docs produced by
organization of standards, such as the Unicode Consortiumâ†—, IETFâ†—. For
example, if you run w3c's validator on their IETF's home page, there
are 32 errors, including â€œno doctype foundâ€, and if you validate
unicode's http://www.unicode.org/faq/utf_bom.html, there are 2 errors.
(few years ago, they are much worse. I don't think even â€œw3.orgâ€'s
pages are valid back then.)

In about 2006, i spent few hours research on what major websites
produces valid html. To this date, I know of only one major site that
produces valid html, and that is Wikipedia. This is fantastic.
Wikipedia is produced by MediaWikiâ†— engine, written in PHP. Many other
wiki sites also run MediaWiki, so they undoubtfully are also valid. As
far as i know, few other wiki or forum software also produces valid
html, though they are more the exceptions than norm. (did try to check
7 random pages from â€œw3.orgâ€, looks like they are all valid today.)
Personal Need For Validator

My personal need is to validate typically hundreds of files on my
local drive. Every month or so, i do systematic regex find-replace
operation on a dir. This often results over a hundred changed files.
Every now and then, i improve my css or html markup semantics site
wide, so the find-replace is on all 4000 files. Usually the find-
replace is carefully crafted with attention to correctenss, or done in
emacs interactively, so possible regex screwup is minimal, but still i
wish to validate by batch after the operation.

Batch validation is useful because, if i screwed up in my regex,
usually it ends up with badly formed html, so html validation can
catch the result.

In 2008, i converted most my sites from html 4 transitional to html 4
strict. The process is quite a manual pain, even the files i start
with are valid.

Here are some examples. In html4strict:

* â€œâ€¹brâ€ºâ€ must be inside block level tags. Image tag â€œâ€¹img ...â€ºâ€
needs to be enclosed in a block level tag such as â€œâ€¹divâ€ºâ€. Content
inside blockquote must be wrapped with a block level tag. e.g.
â€œâ€¹blockquoteâ€ºTime Fliesâ€¹/blockquoteâ€ºâ€ would be invalid in html4strict;
you must have â€œâ€¹blockquoteâ€ºâ€¹pâ€ºTime Fliesâ€¹/pâ€ºâ€¹/blockquoteâ€ºâ€

Lets look at the image tag example. You might think it is trivial to
transform because you can simply use regex to wrap a â€œâ€¹divâ€ºâ€ to image
tags. However, it's not that simple. Because, for example, often i
have this form:

â€¹img src="pretty.jpg" alt="pretty girl" width="565" height="809"â€º
â€¹pâ€ºabove: A pretty girl.â€¹/pâ€º

The â€œpâ€ tag immediately below a â€œimgâ€ tag, functions as the image's
caption. I have css setup so that this caption has no gap to the image
above it, like this:

img + p {margin-top:0px;width:100%} /* img caption */

I have the â€œwidth:100%â€ because normally â€œpâ€ has â€œwidth:80exâ€ for
normal paragraph.

Now, if i simply wrap a â€œdivâ€ tag to all my â€œimgâ€ tags, i will end up
with this form:

â€¹divâ€ºâ€¹img src="pretty.jpg" alt="pretty girl" width="565"
height="809"â€ºâ€¹/divâ€º â€¹pâ€ºabove: A pretty girl.â€¹/pâ€º

Now this screws up with my caption css, and there's no way to match
â€œpâ€ that comes after a â€œdiv â€º imgâ€.

Also, sometimes i have a sequence of images. Wrapping each with a
â€œdivâ€ would introduce gaps between them.

This is just a simplified example. In short, converting from
html4transitional to html4strict while hoping to retain appearance or
markup semantics in practical ways is pretty much a manual pain. (the
ultimate reason is because html4transitional is far from being a good
semantic markup. (html4strict is a bit better)) Validators

In my work i need a batch validator. What i want is a command line
utility, that can batch validate all files in a dir. Here are some
solutions related to html validation.

* The standard validator service by w3c: http://validator.w3.org/
(see also: W3C Markup Validation Serviceâ†— ). The problem with this is
that it can't validate local files, and can't do in batch. Using it to
validate 4000 files thru network (with a help of perl script) would
not be acceptable, since each job means massive web traffic. (my site
is near 754 Mebibyteâ†—.)

* FireFox has a â€œHtml Validatorâ€ add-on by Marc Gueury.
https://addons.mozilla.org/en-US/firefox/addon/249. This is based on
the same code as w3c validator, can work on local files, is extremely
fast. When browsing any page, it shows a green check mark on the
window corner when the file is valid. I heavily rely on this tool.

* FireFox has a â€œWeb Developerâ€ add-on by Chris Pederick.
https://addons.mozilla.org/en-US/firefox/addon/60 Since Firefox â€œv.3â€,
it has a icon that indicates if a page's css and javascript are
invalid (has errors), and also indicates whether the file is using
Quirks modeâ†—. I heavily rely on this tool.

I heavily relie on the above 2 FireFox tools. However, the FireFox
tools do not let me do batch validation. Over the years i've searched
for batch validation tools. Here's some list:

* HTML Tidyâ†— A batch tool primarily for cleanup html markup. I
didn't find it useful for batch validation purposes, nor for html
conversion jobs. It doesn't do well for my html conversion needs
because the tool is incapable of retaining your html formatting (i.e.
retain your newlines locations). I do a lot regex based text procesing
on my html files, so i need assumptions about how newlines are in my
html files. If i use tidy on my site, that means i have to abandon
regex based text processing, and instead, have to treat my files using
html and dom parsers, which makes most practical text processing needs
quite more complex and cumbersome.

* A perl module â€œHTML::Lintâ€, at http://search.cpan.org/~petdance/HTML-Lint-2.06/lib/HTML/Lint.pm.
Seems pretty much like HTML Tidy.

* http://htmlhelp.com/tools/validator/offline/index.html.en is
another validation tool. I haven't looked into yet. Their doc about
differences to other validator: http://htmlhelp.com/tools/validator/differences.html.en,
is quite interesting, and seems a advantage for my needs.

* OpenJade and OpenSP. http://openjade.sourceforge.net/ Seems a
good tool. Haven't looked into.

* Emacs's nxml mode http://www.thaiopensource.com/nxml-mode/, by
the xml expert James Clarkâ†—. This is written in elisp with over 10
thousand lines of code. It indicates whether your xml file is valid as
you type. This package is very well received, reputed to make emacs
the best xml editor. This is fantastic, but since my files are
currently html not xhtml, so i haven't used this much. There are emacs
html modes based on this package, called nxhtml mode, but the code is
still pretty alpha and i find it having a lot problems.

One semi solution for batch validation i found is: â€œValidator S.A.C..â€,
at http://habilis.net/validator-sac/. It is basically w3c's validator
compiled for OS X with a GUI interface. However, this is not designed
for batch operation. If you want to do batch, i run it like this: â€œ/
Applications/Validator-SAC.app/Contents/Resources/weblet â€¹html file
pathâ€ºâ€. However, it output a whole report in html on the validation
result (same as the page you see in w3c validation). This is not what
i want. What i want is simply for it to tell me if a file is valid or
not. For any error detail, i can simply load the page in FireFox
myself, since if i need to edit it i need to view it in FireFox
anyway. So, to fix this problem, you can wrap a perl script, which
takes a dir and simply print any file path that's invalid.

Here's the perl script:

# perl

# 2008-06-20 validates a given dir's html files recursively requires
the mac os x app Validator-SAC.app at http://habilis.net/validator-sac/
as of 2008-06

use strict; use File::Find;

my $dirPath = q(/Users/xah/web/emacs); my $validator = q(/Applications/
Validator-SAC.app/Contents/Resources/weblet);

sub wanted { if ($_ =~ m{\.html$} && not -d $File::Find::name) {

my $output = qx{$validator "$File::Find::name" | head -n 11 | grep
'X-W3C-Validator-Status:'};

if ($output ne qq(X-W3C-Validator-Status: Valid\n)) { print q
(Problem: ), $File::Find::name, "\n";

} else {

print qq(Good: $_) ,"\n";

}

}

}

find(\&wanted, $dirPath);

print q(Done.)

However, for some reason, â€œValidator S.A.C.â€ took nearly 2 seconds to
check each file, in contrast, the FireFox html validator add-on took a
fraction of a second while also render the whole page completely. For
example, suppose i have 20 files in a dir i need to validate. It's
faster, if i just open all of them in FireFox and eyeball the validity
indicator, then running the â€œValidator SACâ€ on them.

I wrote to its author Chuck Houpt about this. It seems that the
validator uses Perl and loads about 20 heavy duty web related perl
modules to do its job, and over all is wrapped as a Common Gateway
Interfaceâ†—. Perhaps there is a way to avoid these wrappers and call
the parser or validator directly.

I'm still looking for a fast, batch, html validation tool.

-----------------

Xah
âˆ‘ http://xahlee.org/

â˜„

Aaron Gray · Dec 29, 2008

Xah Lee said:
recently i wrote a blog essay about html correctness and html
validators, with relations to the programing lang communities. I hope
programing lang fans will take more consideration on the correctness
of the doc they produces.

HTML Correctness and Validators
. http://xahlee.org/js/html_correctness.html

Do you enjoy spamming comp.lang.functional with OT cross-posts ?

Regards,

Aaron

Lew · Dec 29, 2008

Xah Lee wrote...

recently [sic] i [sic] wrote a blog essay about html [sic] correctness and html [sic]
validators, with relations [sic] to the programing [sic] lang [sic] communities. I hope
programing [sic] lang [sic] fans will take more consideration on [sic] the correctness
of the doc [sic] they produces [sic].

Click to expand...

Aaron Gray said:
Do you enjoy spamming comp.lang.functional with OT cross-posts ?

Is that a rhetorical question?

Roedy Green · Dec 30, 2008

I'm still looking for a fast, batch, html validation tool.

see http://mindprod.com/jgloss/htmlvalidator.html

I salute you for your fortitude using W3C individually on all your
files.

A few thoughts on validation:

1. I use macros to generate complex HTML. That way I know it is
perfect. If I wreck it, it will soon be regenerated. I can also, in
one place, change how such macros expand giving me central control of
the look and feel.

2. CSE HTML Validator lets you whip your website into shape,
gradually, clearing up the most serious problems first.

3. Having validated HTML is your best bet to make your website work
with a wide variety of browsers. I am so frustrated by websites that
work with only one browser, often one not even still distributed.

4. You need a tool to fluff up HTML to make it indented and human
readable with extra white space. You then need another to take out all
unnecessary white space to produce compact downloads. I use Slick Edit
for the first and Compactor
for the second. http://mindprod.com/products1.html#COMPACTOR
--
Roedy Green Canadian Mind Products
http://mindprod.com
PM Steven Harper is fixated on the costs of implementing Kyoto, estimated as high as 1% of GDP.
However, he refuses to consider the costs of not implementing Kyoto which the
famous economist Nicholas Stern estimated at 5 to 20% of GDP

Arne Vajhøj · Dec 31, 2008

Roedy said:
1. I use macros to generate complex HTML. That way I know it is
perfect. If I wreck it, it will soon be regenerated. I can also, in
one place, change how such macros expand giving me central control of
the look and feel.

The visual impression should be controlled by CSS not by the HTML.

Arne

Tom Anderson · Dec 31, 2008

The visual impression should be controlled by CSS not by the HTML.

CSS doesn't have limitless power, though. If one day i decide that i want
my pages to have the section of the website written on them ("Java",
"Politics", etc), and that information isn't in the HTML, there's no CSS
on earth that can add it. Similarly, if i decide i want my tables to have
alternating-coloured rows, but i haven't put classes on the tr elements to
support that, there's no way (AFAIK) to do that in pure CSS. You need some
cooperation from the HTML.

tom

Arne Vajhøj · Dec 31, 2008

Tom said:
CSS doesn't have limitless power, though. If one day i decide that i
want my pages to have the section of the website written on them
("Java", "Politics", etc), and that information isn't in the HTML,
there's no CSS on earth that can add it. Similarly, if i decide i want
my tables to have alternating-coloured rows, but i haven't put classes
on the tr elements to support that, there's no way (AFAIK) to do that in
pure CSS. You need some cooperation from the HTML.

True.

For CSS to be a good solution then the HTML need to be designed
for usage with CSS.

Arne

Roedy Green · Jan 1, 2009

CSS doesn't have limitless power, though. If one day i decide that i want
my pages to have the section of the website written on them ("Java",
"Politics", etc), and that information isn't in the HTML, there's no CSS
on earth that can add it. Similarly, if i decide i want my tables to have
alternating-coloured rows, but i haven't put classes on the tr elements to
support that, there's no way (AFAIK) to do that in pure CSS. You need some
cooperation from the HTML.

Macros do things CSS cannot e.g.

1. compute ages from dates

2. generate complex layouts from a small amount of data. CSS is far
from Java or HTML's ability to place data where you want it.

3. ability to generate fill-in-the blanks type boilerplate

4. insert "randomly" generated ads or quotations.

6. arrange columnar data in grids.

7. insert facts such as file size or date.

8. interconvert metric and English measure.

9. validate sizes and presence of images, inserting sizes in HTML for
you.

see http://mindprod.com/application/htmlmacrosmanual.html

--
Roedy Green Canadian Mind Products
http://mindprod.com
PM Steven Harper is fixated on the costs of implementing Kyoto, estimated as high as 1% of GDP.
However, he refuses to consider the costs of not implementing Kyoto which the
famous economist Nicholas Stern estimated at 5 to 20% of GDP

Stuck with html and css	25	Dec 14, 2022
Batch Convert HTML to UTF-8 Files	2	Oct 2, 2023
Generate one HTML from API based on the object key language and their value	2	Aug 19, 2022
Validators	1	Mar 22, 2010
Relative image URLs in HTML	3	Feb 13, 2023
Changing .html in URL	3	Jul 11, 2022
Im having some issues with my html website	1	Jun 4, 2024
Validators	3	May 29, 2008

HTML Correctness and Validators

Xah Lee

Aaron Gray

Lew

Roedy Green

Arne Vajhøj

Tom Anderson

Arne Vajhøj

Roedy Green

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads