So my question is, when you see a 4 byte sequence, how can you know if
it is a pair of characters in the 0x000000 to 0x00FFFF range, or if it is a
single character in the 0x010000 to 0x10FFFD range?
because they RESERVE themselves two banks of characters in the 16 bit
range for encoding high characters.
Here is my code for how UTF-16 encodes the high code points:
// encode 32- bit Unicode into UTF-16
void putwchar( int c )
{
if ( c < 0x10000 )
{
// for 16-bit unicode, use 2-byte format
int high = c >>> 8;
int low = c & 0xff;
putByte( high );
putByte( low );
}
else
{
// for unicode above 0xffff use 4-byte format
int d = c - 0x100;
int high10 = d >>> 10;
int low10 = d & 0x3ff;
int highSurrogate = 0xd800 | high10;
int lowSurrogate = 0xdc00 | low10;
int high = highSurrogate >>> 8;
int low = highSurrogate & 0xff;
putByte( high );
putByte( low );
high = lowSurrogate >>> 8;
low = lowSurrogate & 0xff;
putByte( high );
putByte( low );
}
I posted this at
http://mindprod.com/jgloss/utf.html in a table.
It will be a bit scrambled here:
unicode bytes notes
00000000 yyyyyyyy xxxxxxxx yyyyyyyy xxxxxxxx 2
for numbers in range 0x0000 to 0xffff just encode them as they are in
16 bits.
000zzzzh yyyyyyyy xxxxxxxx 110110zz zzyyyyyy 110111yy xxxxxxxx 4
for numbers in range 0x10000 to 0x10FFFF, you have 21 bits to encode.
This is reduced to 20 bits by subtracting 0x100. The highorder bits
are encoded as a 16-bit base 0xd800 + high order 10 bits, and the low
order bits are encoded as a 16-bit base 0xdc00 + low order 10 bits.
The resulting pair of 16 bit characters are in the so-called so-called
high-half zone or high surrogate area (0xdc800-0xdbff) and low-half
zone or low surrogate area (0xdcff-0xdfff). Characters with values
greater than 0x10fff cannot be encoded in UTF-16. Values between
0xdc800-0xdbff and 0xd800-0xdfff are specifically reserved for use
with UTF-16 for encoding high characters, and don't have any
characters assigned to them.
16 bit Unicode encoding comes in big-endian, little-endian with the
endianness marked or implied. UTF-16 is big-endian and must be marked
as such with FE FF. UTF-16BE is big-endian unmarked. UTF-16LE is
little-endian unmarked. UTF-16 is officially defined in Annex Q of
ISO/IEC 10646-1. (Copies of ISO standards are quite expensive.) It is
also described in the Unicode Consortium's Unicode Standard, as well
as in the IETF's RFC 2781. Here is how you would encode 32-bit Unicode
to UTF-16