using unicode in XML

Naresh Agarwal · Aug 14, 2003

Hi

XML uses UTF-8 by default. Is that correct?

Also, can we use Unicode in XML?

thanks,
Naresh

Bjorn Brox · Aug 14, 2003

Naresh said:
Hi

XML uses UTF-8 by default. Is that correct?

Also, can we use Unicode in XML?

UTF-8 is the most common way to encode Unicode.

Edwin Dankert · Aug 14, 2003

XML uses UTF-8 by default. Is that correct?

That is correct (without BOM). [1]

Also, can we use Unicode in XML?

UTF-8, UTF-16 and UTF-32 are three encodings defined by the [2] unicode
people. These character sets addopt the whole Unicode character set.

UTF-8 takes 1-4 (6) bytes for a character.
UTF-16 takes 2/4 bytes for a character.
UTF-32 takes 4 bytes for a character.

[1] http://www.w3.org/TR/2000/REC-xml-20001006#charencoding
[2] http://www.unicode.org/

Regards,
Edwin Dankert
Cladonia Ltd.
http://www.cladonia.com/

Mike Brown · Aug 14, 2003

Naresh Agarwal said:
XML uses UTF-8 by default. Is that correct?

People say this quite often. You'd think it were true. Usually when they say
it, they are thinking "if I don't put an encoding declaration in the prolog,
the XML parser is going to assume the document is utf-8 encoded, right?" And
that may seem to be true most of the time, but some better understanding is
in order.

First, understand that XML, being on one level just a string of abstract
Unicode characters, may be represented in any encoding. That is, the
"physical" bytes of the document (or rather, the bytes of each 'entity'
[file]) can represent Unicode characters according to any character map you
wish to use -- e.g., iso-8859-1, utf-8, us-ascii, shift-jis, whatever.

However, an XML parser is only *required* to support two encodings: utf-8
and utf-16, each of which provides a way to map all 1.1 million Unicode
characters to specific sequences of 1 to 4 bytes each... whereas other
encodings are typically using just one byte per character and are thus only
good for representing a very small subset of Unicode's repertoire. You will
find that most parsers do at least support us-ascii and iso-8859-1 in
addition to the required utf-8 and utf-16, since these are fairly common
encodings.

The XML spec, which you should have handy and should read when you want to
find answers like this, requires that an XML parser determine the encoding
of a document by checking for declarations and hints in a number of places
which I will not list here, since it's not an easily summarizable list. One
of the things it will look for, though, in the absence of
externally-supplied encoding info, is the presence of a UTF-16 byte order
mark (BOM) at the start of the file. This is something unique to the UTF-16
encoding -- the byte stream is prefaced by a pair of bytes that are the
encoded form of the "zero-width no-break space" character, Unicode code
point 0xFEFF. These bytes, which will typically (but not necessarily) be in
the order 0xFF 0xFE on Intel platforms, signal to the parser that the
document is UTF-16 encoded and that the bytes are in big-endian or
little-endian order.

So, contrary to popular belief, it is quite possible to save a document with
no encoding declaration in its prolog, using UTF-16 encoding (such as in
Windows Notepad, if you choose "Unicode" from the "Save As" dialog), and the
parser will not in this case "default to UTF-8", but will instead recognize
the BOM as a UTF-16 declaration, of sorts, and it will decode the document
properly.

Using SOAP in XML	0	Jun 9, 2014
Thinking Unicode	0	Aug 8, 2013
Unicode	20	Dec 16, 2012
Unicode help please	5	Oct 19, 2013
unicode by default	29	May 11, 2011
unicode + xml	0	Sep 8, 2009
XML in XMPP	8	Jul 6, 2012
Unicode (UTF-8) in C	13	Mar 16, 2014

using unicode in XML

Naresh Agarwal

Bjorn Brox

Edwin Dankert

Mike Brown

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads