I wrote my dissertation on the subject of automated neatening of HTML. [...]
with divs and CSS. It worked suprisingly well, but I only had to test
it on ISO-8859-1 documents. I worked out the invalid characters just
by feeding them into the W3C Validator,
I think I'm going to have to stand firm, and say that you really need
to make the effort and cross the threshold of understanding the HTML
character model in order to grasp what's behind this, otherwise you'd
risk blundering on in a heuristic fashion without a robust mental
picture of what's involved.
This note makes no attempt to be a full tutorial on that, but just
races through some key headings to see whether you can be persuaded to
read the background and get up to speed.
All of the characters from 0 to 31 decimal, and all of the characters
from 127(sic) to 159 decimal, in the Document Character Set, are
defined to be control characters, and almost all of them are excluded
from use in HTML. These are the characters which are declared to be
"invalid" by the specification (and by the validator).
What's the "Document Character Set"? Well, in HTML2 it was
iso-8859-1, and in HTML4 it was defined to be iso-10646 as amended.
Loosely, you can read "iso-10646 as amended" as being the character
model of Unicode. As far as the values from 0 to 255 are concerned,
iso-8859-1 and iso-10646 are identical.
How is this related to the external character encoding? Well, the
character model that was introduced in RFC2070 and embodied in HTML4
is based on the concept that the external encoding is converted into
iso-10646/unicode prior to any other processing being done. It
doesn't require implementations to work in that way internally, but it
_does_ mandate that they give that impression externally (black box
model).
So from HTML's point of view, if you have a document which is coded in
say Windows-1252, including those pretty quotes, then (as long as the
recipient consents - see the HTTP Accept-charset) it's perfectly
legal. All you need to do is apply the appropriate code mapping that
you find at the Unicode site, and get the resulting Unicode character.
Resources at
http://www.unicode.org/Public/MAPPINGS/ , in this case
http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1252.TXT
and for the ones that were invalid but rendered under Windows (like
smartquotes) I replaced those with valid equivalents.
What you're talking about here is probably a document which in reality
is coded in Windows-1252 but erroneously claims to be - or is
mistakenly presumed to be - iso-8859-1 (or its equivalent in other
locales).
There's nothing inherently wrong with these particular octet values
(128-159 decimal) *in those codings which assign them to printable
characters* (that's not only all of the Windows-125x codings, but also
koi-8r and some other less-usual codings).
What's wrong is when those octet values occur in codings which define
them to be control characters which are not used in HTML.
Once I've worked the program into a more presentable state, I'd like
to release it (GPL'd, of course). The problem is, I've got no idea
what would happen if, say, a Japanese person runs it on some Japanese
HTML source on their harddisk - I've never used a foreign character
encoding, so I don't even know how their text editors figure out the
encoding.
Sadly, quite a number of language locales simply *assume* that their
local coding applies. Try looking at such a file on a system that's
set for a different locale, and you'll get rubbish. Although it's
sometimes possible to guess (look at the automatic charset selection
in, say, Mozilla for examples of what can be done heuristically).
OK, I've done the HTML part of this. I'm not a regular Java user so
I'm leaving that to others.
1. That's the one I'm asking about.
Thanks - I did want to be sure about that first.
[Don't make the mistake of confusing an 8-bit character of value 151
decimal (in some specified 8-bit encoding), on the one hand, with the
undefined(HTML)/illegal(XML) notation on the other hand.]
2. If I understand the specification correctly, these refer to UCS
code positions,
basically yes, modulo some possible nit picking about high/low
surrogates and stuff, that I don't want to go into here.
so I just to to check whether the position is defined
in Unicode.
Er, not quite. Those control characters are certainly *defined*, but
they are excluded from use in HTML by the "SGML declaration for HTML",
and from XHTML by the rules of XML.
And on the other hand I don't think an as-yet-unassigned Unicode code
point is actually invalid for use in (X)HTML. Try it and see what the
validator says?
hope this helps a bit. The writeup of the HTML character model in the
relevant part of the HTML4 spec and/or RFC2070 is not bad, I'd suggest
giving it a try. There's also some material at
http://ppewww.ph.gla.ac.uk/~flavell/charset/ which some folks have
found helpful.