D
damjan
This may look like a silly question to someone, but the more I try to
understand Unicode the more lost I feel. To say that I am not a beginner
C++ programmer, only had no need to delve into character encoding
intricacies before.
In c/c++, the unicode characters are introduced by the means of wchar_t
type. Based on the presence of _UNICODE definition C functions are
macro'd to either the normal version or the one prefixed with w. Because
this is all standard c, it should be platform independent and as much as
I understand, all unicode characters in c (and windows) are 16-bit
(because w_char is usually typedef'd as unsigned short).
Now, the various UTF encodings (UTF-8, 16, or 32) define variable
character sizes (e.g. UTF-8 from 1 to 4 bytes). How is that compatible
with C version of w_char? I've even been convinced by others that the
compiler calculates the neccessary storage size of the unicode
character, which may thus be variable. So pointer increment on a string
would sometimes progress by 1, 2 or 4 bytes. I think this is absurd and
have not been able to produce such behaviour on windows os.
I would rest with that conviction if I didn't find web pages like
http://evanjones.ca/unicode-in-c.html, where the obviously competent
author asserts that wstring characters are 32-bit. The c++ stl book from
josuttis explains virtually nothing on that matter.
So, my question is if anyone knows how to explain this character size
mess in as understandable way as possible or at least post a link to the
right place. I've read petzold und believe that c simply uses fixed
16-bit unicode, but how does that combine with the unicode encodings?
dj
understand Unicode the more lost I feel. To say that I am not a beginner
C++ programmer, only had no need to delve into character encoding
intricacies before.
In c/c++, the unicode characters are introduced by the means of wchar_t
type. Based on the presence of _UNICODE definition C functions are
macro'd to either the normal version or the one prefixed with w. Because
this is all standard c, it should be platform independent and as much as
I understand, all unicode characters in c (and windows) are 16-bit
(because w_char is usually typedef'd as unsigned short).
Now, the various UTF encodings (UTF-8, 16, or 32) define variable
character sizes (e.g. UTF-8 from 1 to 4 bytes). How is that compatible
with C version of w_char? I've even been convinced by others that the
compiler calculates the neccessary storage size of the unicode
character, which may thus be variable. So pointer increment on a string
would sometimes progress by 1, 2 or 4 bytes. I think this is absurd and
have not been able to produce such behaviour on windows os.
I would rest with that conviction if I didn't find web pages like
http://evanjones.ca/unicode-in-c.html, where the obviously competent
author asserts that wstring characters are 32-bit. The c++ stl book from
josuttis explains virtually nothing on that matter.
So, my question is if anyone knows how to explain this character size
mess in as understandable way as possible or at least post a link to the
right place. I've read petzold und believe that c simply uses fixed
16-bit unicode, but how does that combine with the unicode encodings?
dj