[...]
I wish people would give simple answers to simple questions.
I don't think you've understood the problem. If the questioner was in
a position to understand the "simple answer" which you say you want, I
can't imagine how they would have asked the question in that form in
the first place.
This is not a silly question;
The original questioner should not feel offended or dispirited by what
I'm going to say: but, in the form in which is was asked, the question
is incoherent.
This is not unusual: many people are confused both by the theory and
by the terminology of character representation, especially if they
gained an initial understanding in a simpler situation (typically,
character repertoires of 256 characters or less, represented by an
8-bit character encoding such as iso-8859-anything; and fonts that
were laid out accordingly).
How very strange. This claims to be XHTML, but, as far as I can see,
it has no character encoding specified on its HTTP Content-type header
*nor* on its <?xml...> thingy (indeed it doesn't have a <?xml...>
thingy).
In the absence of a BOM, XML is entitled to deduce that it's utf-8:
but since it's invalid utf-8, it *ought* to refuse to process it.
Unless someone can show me what I'm missing.
By looking at it, it is evidently encoded in iso-8859-1.
It purports to declare that via a "meta http-equiv", but for XML this
is meaningless - and anyway comes far too late.
I don't know why the W3C validator doesn't reject it out of hand?
(Of course the popular browsers will be slurping it as slightly
xhtml-flavoured tag soup, so we can't expect to deduce very much from
the fact that they calmly display what the author intended.)
Slightly
edited, this says:
XML documents can contain foreign characters like Norwegian æøå, or French
êèé.
And those characters are presented encoded in iso-8859-1 ...
To let your XML parser understand these characters, you should save
your XML documents as Unicode.
Two things wrong here. What do they suppose they mean by "save ... as
Unicode"? The XML Document Character Set is *by definition* Unicode,
there's nothing that an author can do to change that (unlike SGML).
Characters can be represented in at least two different ways in XML:
by /numerical character references/ ( ), or as /encoded
characters/ using some /character encoding scheme/. (In some contexts
there may also be named character entities, but they introduce no new
principles for the present purpose so we won't need to discuss them
here).
The only coherent interpretation I can put on their "should save as
Unicode" statement is "should save in one of the character encoding
schemes of Unicode". But /should/ we? Do they? No, they don't: they
are using iso-8859-1 (they *could* even do it correctly); and they
also discuss the use of windows-1252, although without giving much
detail about the implications of deploying a proprietary character
encoding on the WWW.
The /conclusions/ are fine, in their way:
* Use an editor that supports encoding.
* Make sure you know what encoding it uses.
* Use the same encoding attribute in your XML documents.
But the reader still hasn't really learned anything about the
underlying principles yet. And the page hasn't told them anything
useful about *which* encoding to choose for deploying their documents
on the WWW.
Windows 95/98 Notepad cannot save files in Unicode format.
Then it's unfit for composing the kind of document that we are
discussing here. No matter - there are plenty of competent editors
which can work on that platform.
My own tutorial pages weren't really aimed at XML, so I won't suggest
them as an appropriate answer here. Actually, the relevant chapter of
the Unicode specification is not unreasonable as an introduction to
the principles of character representation and encoding, even if they
might be a bit indigestible at a first reading.