using unicode in XML

N

Naresh Agarwal

Hi

XML uses UTF-8 by default. Is that correct?

Also, can we use Unicode in XML?

thanks,
Naresh
 
E

Edwin Dankert

XML uses UTF-8 by default. Is that correct?

That is correct (without BOM). [1]
Also, can we use Unicode in XML?

UTF-8, UTF-16 and UTF-32 are three encodings defined by the [2] unicode
people. These character sets addopt the whole Unicode character set.

UTF-8 takes 1-4 (6) bytes for a character.
UTF-16 takes 2/4 bytes for a character.
UTF-32 takes 4 bytes for a character.

[1] http://www.w3.org/TR/2000/REC-xml-20001006#charencoding
[2] http://www.unicode.org/

Regards,
Edwin Dankert
Cladonia Ltd.
http://www.cladonia.com/
 
M

Mike Brown

Naresh Agarwal said:
XML uses UTF-8 by default. Is that correct?

People say this quite often. You'd think it were true. Usually when they say
it, they are thinking "if I don't put an encoding declaration in the prolog,
the XML parser is going to assume the document is utf-8 encoded, right?" And
that may seem to be true most of the time, but some better understanding is
in order.

First, understand that XML, being on one level just a string of abstract
Unicode characters, may be represented in any encoding. That is, the
"physical" bytes of the document (or rather, the bytes of each 'entity'
[file]) can represent Unicode characters according to any character map you
wish to use -- e.g., iso-8859-1, utf-8, us-ascii, shift-jis, whatever.

However, an XML parser is only *required* to support two encodings: utf-8
and utf-16, each of which provides a way to map all 1.1 million Unicode
characters to specific sequences of 1 to 4 bytes each... whereas other
encodings are typically using just one byte per character and are thus only
good for representing a very small subset of Unicode's repertoire. You will
find that most parsers do at least support us-ascii and iso-8859-1 in
addition to the required utf-8 and utf-16, since these are fairly common
encodings.

The XML spec, which you should have handy and should read when you want to
find answers like this, requires that an XML parser determine the encoding
of a document by checking for declarations and hints in a number of places
which I will not list here, since it's not an easily summarizable list. One
of the things it will look for, though, in the absence of
externally-supplied encoding info, is the presence of a UTF-16 byte order
mark (BOM) at the start of the file. This is something unique to the UTF-16
encoding -- the byte stream is prefaced by a pair of bytes that are the
encoded form of the "zero-width no-break space" character, Unicode code
point 0xFEFF. These bytes, which will typically (but not necessarily) be in
the order 0xFF 0xFE on Intel platforms, signal to the parser that the
document is UTF-16 encoded and that the bytes are in big-endian or
little-endian order.

So, contrary to popular belief, it is quite possible to save a document with
no encoding declaration in its prolog, using UTF-16 encoding (such as in
Windows Notepad, if you choose "Unicode" from the "Save As" dialog), and the
parser will not in this case "default to UTF-8", but will instead recognize
the BOM as a UTF-16 declaration, of sorts, and it will decode the document
properly.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Similar Threads

Using SOAP in XML 0
Thinking Unicode 0
Unicode 20
Unicode help please 5
unicode by default 29
unicode + xml 0
XML in XMPP 8
Unicode (UTF-8) in C 13

Members online

No members online now.

Forum statistics

Threads
473,994
Messages
2,570,223
Members
46,810
Latest member
Kassie0918

Latest Threads

Top