Siri Cruz said:
[QUOTE="Keith Thompson said:
I'm thinking of UTF-8 rather than wide characters, mainly so that in the
I would suggest sticking to Unicode and let callers use iconv to
handle anything else. If wchar is Unicode, there's little problem
supporting both. Conversion between UTF8, UTF16, and Unicode is
straightforward. You could designate one (the most frequently used?)
as a base implementation and then do alternate versions that convert
to, call the base, and convert back.
It's not clear that you understand what "Unicode" means.
Unicode is a mapping between characters and "code points", which are
integer values. It does not by itself specify how those code points are
And there something like 2^n (n=24?) code points each assigned an integer value
from 0 to 2^n-1. So I use 'Unicode' to refer to a set and C representation that
is isomorphic to the set of code points.[/QUOTE]
Unicode consists of 17 planes of 65536 code points each, for a total of
1,114,112 code points from 0x0 to 0x10FFFF. (I recall seeing a
statement on unicode.org that it will never exceed that upper bound.)
So 21 bits are more than enough to represent all code points.
The term Unicode by itself refers to the mapping between characters and
code points, *not* to any particular representation of that mapping.
It's an important distinction.
What exactly do you mean by "Conversion between UTF8, UTF16, and
Unicode"? If you're talking about a representation that uses a full 32
bits to represent each code point, that's called UTF-32 or UCS-4.
UTF-8 and UTF-16 are maps from the Unicode set to strings of 8 or 16 bit
naturals. They are not isomorphic because some 8 bit strings do not map into
Unicode.
A small quibble: the term "natural number" traditionally refers only to
*positive* integers; 0 is a valid Unicode code point. (The term
"natural number" is also sometimes used to refer to the non-negative
integers.)
But it could be whatever the C implementor regards as natural wide
character set.
Sure.
I'm not sure that Microsoft's use of 16 bits for wchar_t is even
conforming. The C standard says that wchar_t "is an integer type whose
range of values can represent distinct codes for all members of the
largest extended character set specified among the supported locales".
16 bits covers UCS-2 (which can only represent code points from 0 to
65535), but using it for UTF-16 arguably violates the C standard's
requirements.