Character encodings and invalid characters

S

Safalra

[Crossposted as the questions to each group might sound a little
strange without context; trim groups if necessary]

The idea here is relatively simple: a java program (I'm using JDK1.4
if that makes a difference) that loads an HTML file, removes invalid
characters (or replaces them in the case of common ones like
Microsoft's 'smartquotes'), and outputs the file.

The problem is these files will be on disk, so the program won't have
the character encoding information from the server.

Questions:

1) I presume Java will correctly identify UTF-16BE and UTF-16LE from
the byte order markers. How does it identify other encodings? Will it
just assume the system default encoding until it finds bytes that
imply UTF-8? The program will mainly deal with UTF-16, UTF-8,
ISO-8859-1 and US-ASCII, but others may occur.

2) I'm slightly confused by the HTML specification - are the valid
characters precisely those that are defined in Unicode? (Java
internally works with 16 but characters.) (I'm ignoring at this point
characters that in HTML need escaping.)

3) If it fails on esoteric character encodings, how badly is it likely
to fail? Will it totally trash the HTML?
 
A

Alan J. Flavell

Questions:

1) I presume Java will correctly identify UTF-16BE and UTF-16LE from
the byte order markers. How does it identify other encodings?

[I can't answer that, but the use of a BOM is permissible in utf-8
although it's not required. Actually, if I may be pedantic for a
moment, utf-16BE and utf-16LE don't use a BOM - the endianness is
specified by the name of the encoding; utf-16 uses a BOM and by
looking at the BOM you work out for yourself whether it's LE or BE.

Coming back to utf-8: unless it's entirely us-ascii in which case you
can't tell the difference, there are validity criteria, and the more
of it you get which meet the criteria, the more confident you can be
that it really is utf-8. Just one single violation of the criteria is
enough to rule that possibility out, and the Unicode rules *mandate*
refusing to process the document further, for security reasons.
Will it just assume the system default encoding until it finds bytes
that imply UTF-8? The program will mainly deal with UTF-16, UTF-8,
ISO-8859-1 and US-ASCII, but others may occur.

Right, but define "others". Are you going to deal with any character
encodings which define characters that don't exist in Unicode - e.g
Klingon?

You certainly aren't going to be able to guess 8-bit character
encodings just by looking at them - you absolutely do, in general,
need some external source of wisdom on what character coding you are
dealing with. *Some* character encodings can be guessed, at least on
plausibility grounds.
2) I'm slightly confused by the HTML specification - are the valid
characters precisely those that are defined in Unicode?

With the greatest of respect, you seem to be putting the cart before
the horse. First you say you intend to remove invalid characters, and
then it becomes clear that you're not sure how to define what they
are. :-}

I'm assuming that there's some substantive issue behind your problem,
but I'm afraid you're not expressing it in terms that I can be
confident that I understand what you're trying to achieve. Recall
that there are in general three ways of representing characters in
HTML:

1. coded characters in the appropriate character encoding
2. numerical character references or 
3. character entity references &name; for those characters which have
them.

Can you address what you propose to do with each of these when you
find them?
(I'm ignoring at this point characters that in HTML need escaping.)

Hmmm? Are you referring to the use of &-notations here, or something
else?
3) If it fails on esoteric character encodings, how badly is it likely
to fail? Will it totally trash the HTML?

Best answer I can give to that is that the HTML markup itself uses
nothing more than plain us-ascii repertoire. If you can't recognise
at least that repertoire in the original encoding, then you're going
to do worse than trash only the HTML, no?

good luck
 
R

Roedy Green

1) I presume Java will correctly identify UTF-16BE and UTF-16LE from
the byte order markers. How does it identify other encodings?

You have to ask the user. You can find out the default encoding on his
machine, but that's as good as it gets. People never thought to mark
documents with the encoding or record it in a resource fork.

You can take the same document and interpret it many ways. It would
require almost AI to figure out which was the most likely encoding.

You could do it my comparing letter frequencies to averages of
samples.
 
T

Thomas Weidenfeller

Safalra said:
[Crossposted as the questions to each group might sound a little
strange without context; trim groups if necessary]

The idea here is relatively simple: a java program (I'm using JDK1.4
if that makes a difference) that loads an HTML file, removes invalid
characters (or replaces them in the case of common ones like
Microsoft's 'smartquotes'), and outputs the file.

Sounds like you want to re-invent JTidy or HTML Tidy (google is your
friend).
The problem is these files will be on disk, so the program won't have
the character encoding information from the server.

If you are lucky, there is a charset entry at the beginning of the HTML.
1) I presume Java will correctly identify UTF-16BE and UTF-16LE from
the byte order markers.

No, it doesn't look at them. While I have presented pseudo code in this
group to at least do this (and someone has later posted an
implementation, search an archive of this group), this will not help
you. You assume that the file has been correctly saved by the browser or
some other tool. This would assume the browser had that information.
How does it identify other encodings?

It doesn't. It can't.
2) I'm slightly confused by the HTML specification - are the valid
characters precisely those that are defined in Unicode? (Java
internally works with 16 but characters.) (I'm ignoring at this point
characters that in HTML need escaping.)

There are the specs, and there is what people really put into web pages.
And they put everything in it, really everything.
3) If it fails on esoteric character encodings, how badly is it likely
to fail? Will it totally trash the HTML?

It can. It depends on the contents of the page. E.g. UTF-8 is
indistinguishable from US-ASCII if only ASCII characters are used in the
HTML (UTF-8 encoding of these characters happens to be the same). So if
you pick the wrong encoding in this case, you won't see a difference at
all. But if there are non ASCII characters in the UTF-8 data, and if you
decode it as US-ASCII, you get strange additional characters. The amount
of these characters entirely depends on the contents. And if you e.g.
misinterpret a Shift-JIT as US-ASCII, you will most likely see only
strange things.

/Thomas
 
S

Safalra

Alan J. Flavell said:
With the greatest of respect, you seem to be putting the cart before
the horse. First you say you intend to remove invalid characters, and
then it becomes clear that you're not sure how to define what they
are. :-}

I'm assuming that there's some substantive issue behind your problem,
but I'm afraid you're not expressing it in terms that I can be
confident that I understand what you're trying to achieve.

Okay, I guess I should have given more detail:

I wrote my dissertation on the subject of automated neatening of HTML.
As part of this I wrote a Java program to demonstrate what could be
done. It removed or replaced invalid characters, attributes and
elements, turned presentation elements and attributes to CSS, and
replaced many tables used for layout purposes (and some framesets)
with divs and CSS. It worked suprisingly well, but I only had to test
it on ISO-8859-1 documents. I worked out the invalid characters just
by feeding them into the W3C Validator, and for the ones that were
invalid but rendered under Windows (like smartquotes) I replaced those
with valid equivalents.

Once I've worked the program into a more presentable state, I'd like
to release it (GPL'd, of course). The problem is, I've got no idea
what would happen if, say, a Japanese person runs it on some Japanese
HTML source on their harddisk - I've never used a foreign character
encoding, so I don't even know how their text editors figure out the
encoding. I was wondering if Java assumes it's the system default
(unless it encounters unicode), and hence the program would still
work. (I assume that people would usually use the same character
encoding for their system and their HTML?)
Recall
that there are in general three ways of representing characters in
HTML:

1. coded characters in the appropriate character encoding
2. numerical character references or 
3. character entity references &name; for those characters which have
them.

Can you address what you propose to do with each of these when you
find them?

1. That's the one I'm asking about. :)

Assuming I can get around character encoding problems.:

2. If I understand the specification correctly, these refer to UCS
code positions, so I just to to check whether the position is defined
in Unicode.
3. I just need to check whether these are defined in the
specification.

If occurances of (2) and (3) are valid, they'll just be outputted by
the program in the same form.
Hmmm? Are you referring to the use of &-notations here,

Yes, but now we've discussed them above...
 
M

Michael Borgwardt

Safalra said:
to release it (GPL'd, of course). The problem is, I've got no idea
what would happen if, say, a Japanese person runs it on some Japanese
HTML source on their harddisk - I've never used a foreign character
encoding, so I don't even know how their text editors figure out the
encoding.

They assume it by convention, usually. This can (and does) go wrong.
I was wondering if Java assumes it's the system default
(unless it encounters unicode)

Java *alway* assumes text is the system default encoding unless given an
explicit encoding. Unicode does not play into it.

Also, do remember that in theory, all HTML documents should declare
their encoding explicitly, or have it supplied by the server in
the header. In XHTML, the explicit declaration is in fact mandatory.

But overall, text encoding is a horribly complex, muddled mess of
legacy conventions, incompatibilities, hacks and workarounds. Most
of the time, it breaks down horribly as soon as you cross a language
barrier.
 
A

Alan J. Flavell

I wrote my dissertation on the subject of automated neatening of HTML. [...]
with divs and CSS. It worked suprisingly well, but I only had to test
it on ISO-8859-1 documents. I worked out the invalid characters just
by feeding them into the W3C Validator,

I think I'm going to have to stand firm, and say that you really need
to make the effort and cross the threshold of understanding the HTML
character model in order to grasp what's behind this, otherwise you'd
risk blundering on in a heuristic fashion without a robust mental
picture of what's involved.

This note makes no attempt to be a full tutorial on that, but just
races through some key headings to see whether you can be persuaded to
read the background and get up to speed.

All of the characters from 0 to 31 decimal, and all of the characters
from 127(sic) to 159 decimal, in the Document Character Set, are
defined to be control characters, and almost all of them are excluded
from use in HTML. These are the characters which are declared to be
"invalid" by the specification (and by the validator).

What's the "Document Character Set"? Well, in HTML2 it was
iso-8859-1, and in HTML4 it was defined to be iso-10646 as amended.
Loosely, you can read "iso-10646 as amended" as being the character
model of Unicode. As far as the values from 0 to 255 are concerned,
iso-8859-1 and iso-10646 are identical.

How is this related to the external character encoding? Well, the
character model that was introduced in RFC2070 and embodied in HTML4
is based on the concept that the external encoding is converted into
iso-10646/unicode prior to any other processing being done. It
doesn't require implementations to work in that way internally, but it
_does_ mandate that they give that impression externally (black box
model).

So from HTML's point of view, if you have a document which is coded in
say Windows-1252, including those pretty quotes, then (as long as the
recipient consents - see the HTTP Accept-charset) it's perfectly
legal. All you need to do is apply the appropriate code mapping that
you find at the Unicode site, and get the resulting Unicode character.

Resources at http://www.unicode.org/Public/MAPPINGS/ , in this case
http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1252.TXT

and for the ones that were invalid but rendered under Windows (like
smartquotes) I replaced those with valid equivalents.

What you're talking about here is probably a document which in reality
is coded in Windows-1252 but erroneously claims to be - or is
mistakenly presumed to be - iso-8859-1 (or its equivalent in other
locales).

There's nothing inherently wrong with these particular octet values
(128-159 decimal) *in those codings which assign them to printable
characters* (that's not only all of the Windows-125x codings, but also
koi-8r and some other less-usual codings).

What's wrong is when those octet values occur in codings which define
them to be control characters which are not used in HTML.
Once I've worked the program into a more presentable state, I'd like
to release it (GPL'd, of course). The problem is, I've got no idea
what would happen if, say, a Japanese person runs it on some Japanese
HTML source on their harddisk - I've never used a foreign character
encoding, so I don't even know how their text editors figure out the
encoding.

Sadly, quite a number of language locales simply *assume* that their
local coding applies. Try looking at such a file on a system that's
set for a different locale, and you'll get rubbish. Although it's
sometimes possible to guess (look at the automatic charset selection
in, say, Mozilla for examples of what can be done heuristically).

OK, I've done the HTML part of this. I'm not a regular Java user so
I'm leaving that to others.
1. That's the one I'm asking about. :)

Thanks - I did want to be sure about that first.

[Don't make the mistake of confusing an 8-bit character of value 151
decimal (in some specified 8-bit encoding), on the one hand, with the
undefined(HTML)/illegal(XML) notation — on the other hand.]
2. If I understand the specification correctly, these refer to UCS
code positions,

basically yes, modulo some possible nit picking about high/low
surrogates and stuff, that I don't want to go into here.
so I just to to check whether the position is defined
in Unicode.

Er, not quite. Those control characters are certainly *defined*, but
they are excluded from use in HTML by the "SGML declaration for HTML",
and from XHTML by the rules of XML.

And on the other hand I don't think an as-yet-unassigned Unicode code
point is actually invalid for use in (X)HTML. Try it and see what the
validator says?

hope this helps a bit. The writeup of the HTML character model in the
relevant part of the HTML4 spec and/or RFC2070 is not bad, I'd suggest
giving it a try. There's also some material at
http://ppewww.ph.gla.ac.uk/~flavell/charset/ which some folks have
found helpful.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,968
Messages
2,570,152
Members
46,698
Latest member
LydiaHalle

Latest Threads

Top