Roedy Green said:
real unicode uses some 3 or 4 byte format that more or less directly
encodes the character,where Java uses two 16 bit encoded characters of
a magic range, where each char piggybacks some of the bits of the real
character??
There's a lot of confusion about this issue that stems from the fact
that Unicode is only a mapping from numbers to characters. It does not, for
example, explain how to map bytes or bitstreams into characters. So in
addition to Unicode, you need an encoding, such as UTF-8 or UTF-16. UTF-8
takes a bitstream, and maps those into a sequence of numbers. Then you use
Unicode to take that sequence of numbers and map them onto a sequence of
characters.
I think part of the misunderstanding happens because in ASCII, both the
mapping from bits to numbers and numbers to character are glossed over into
a direct mapping from bits into characters.
The biggest number which has a defined mapping onto a character in
Unicode is 0x10FFFF (in hex). One encoding system might be to always use 3
bytes for every character, so that the largest number you could represent
with this encoding is 0xFFFFFF, which is enough to represent all the defined
unicode characters and then some. But of course, this encoding system would
not be "backwards compatible" with ASCII.
UTF-8 is a variable-length encoding which is backwards compatible with
ASCII. This is one of the reasons why it's a very popular encoding; all
valid ASCII documents are also valid UTF-8 documents. For characters whose
Unicode numbers are in the range 0x000000 to 0x00007F, only 1 byte is
require to encode them. However, as a tradeoff, for some characters (those
whose Unicode numbers are in the range 0x010000 to 0x10FFFF) 4 bytes are
required to encode them.
UTF-16 is variable-length as well, and NOT backwards compatible with
ASCII. I'm not all that familiar with this encoding, so I may be mistaken
here, but apparently it uses 2 bytes to encode all the characters whose
Unicode number fall between 0x000000 and 0x00FFFF, and 4 bytes for the rest
(I don't see how this is possible, and couldn't find a reference
implementation).
Java uses UTF-16 internally. A Java "char" is 16 bites, so it handles
the first case (requiring 2 bytes) just fine. The problem is when you try to
represent a character that requires 4 bytes under UTF-16. Apparently what
happened was the Unicode Stantard "changed", where previously The BMP (Basic
Multilingual Plane) was all of Unicode, and but then Unicode added a few
more characters.
From the JavaDoc for java.lang.Character:
<quote>
The char data type (and therefore the value that a Character object
encapsulates) are based on the original Unicode specification, which defined
characters as fixed-width 16-bit entities. The Unicode standard has since
been changed to allow for characters whose representation requires more than
16 bits. The range of legal code points is now U+0000 to U+10FFFF
</quote>
So Sun, based on the original Unicode specification, had assumed that 16
bits was enough, and then got screwed over.
For what it's worth, "real" UTF-8 can support arbitrarily many
characters (i.e. no matter how many characters the next version of Unicode
defines, UTF-8 will have a sequence of bytes to represent it), because it
first encodes the length of the byte representation in unary, and then
encodes the actual bytes in binary.
- Oliver