Naresh Agarwal said:
XML uses UTF-8 by default. Is that correct?
People say this quite often. You'd think it were true. Usually when they say
it, they are thinking "if I don't put an encoding declaration in the prolog,
the XML parser is going to assume the document is utf-8 encoded, right?" And
that may seem to be true most of the time, but some better understanding is
in order.
First, understand that XML, being on one level just a string of abstract
Unicode characters, may be represented in any encoding. That is, the
"physical" bytes of the document (or rather, the bytes of each 'entity'
[file]) can represent Unicode characters according to any character map you
wish to use -- e.g., iso-8859-1, utf-8, us-ascii, shift-jis, whatever.
However, an XML parser is only *required* to support two encodings: utf-8
and utf-16, each of which provides a way to map all 1.1 million Unicode
characters to specific sequences of 1 to 4 bytes each... whereas other
encodings are typically using just one byte per character and are thus only
good for representing a very small subset of Unicode's repertoire. You will
find that most parsers do at least support us-ascii and iso-8859-1 in
addition to the required utf-8 and utf-16, since these are fairly common
encodings.
The XML spec, which you should have handy and should read when you want to
find answers like this, requires that an XML parser determine the encoding
of a document by checking for declarations and hints in a number of places
which I will not list here, since it's not an easily summarizable list. One
of the things it will look for, though, in the absence of
externally-supplied encoding info, is the presence of a UTF-16 byte order
mark (BOM) at the start of the file. This is something unique to the UTF-16
encoding -- the byte stream is prefaced by a pair of bytes that are the
encoded form of the "zero-width no-break space" character, Unicode code
point 0xFEFF. These bytes, which will typically (but not necessarily) be in
the order 0xFF 0xFE on Intel platforms, signal to the parser that the
document is UTF-16 encoded and that the bytes are in big-endian or
little-endian order.
So, contrary to popular belief, it is quite possible to save a document with
no encoding declaration in its prolog, using UTF-16 encoding (such as in
Windows Notepad, if you choose "Unicode" from the "Save As" dialog), and the
parser will not in this case "default to UTF-8", but will instead recognize
the BOM as a UTF-16 declaration, of sorts, and it will decode the document
properly.