unicode mess in c++

D

damjan

This may look like a silly question to someone, but the more I try to
understand Unicode the more lost I feel. To say that I am not a beginner
C++ programmer, only had no need to delve into character encoding
intricacies before.

In c/c++, the unicode characters are introduced by the means of wchar_t
type. Based on the presence of _UNICODE definition C functions are
macro'd to either the normal version or the one prefixed with w. Because
this is all standard c, it should be platform independent and as much as
I understand, all unicode characters in c (and windows) are 16-bit
(because w_char is usually typedef'd as unsigned short).

Now, the various UTF encodings (UTF-8, 16, or 32) define variable
character sizes (e.g. UTF-8 from 1 to 4 bytes). How is that compatible
with C version of w_char? I've even been convinced by others that the
compiler calculates the neccessary storage size of the unicode
character, which may thus be variable. So pointer increment on a string
would sometimes progress by 1, 2 or 4 bytes. I think this is absurd and
have not been able to produce such behaviour on windows os.

I would rest with that conviction if I didn't find web pages like
http://evanjones.ca/unicode-in-c.html, where the obviously competent
author asserts that wstring characters are 32-bit. The c++ stl book from
josuttis explains virtually nothing on that matter.

So, my question is if anyone knows how to explain this character size
mess in as understandable way as possible or at least post a link to the
right place. I've read petzold und believe that c simply uses fixed
16-bit unicode, but how does that combine with the unicode encodings?

dj
 
A

Alf P. Steinbach

* damjan:
This may look like a silly question to someone, but the more I try to
understand Unicode the more lost I feel. To say that I am not a beginner
C++ programmer, only had no need to delve into character encoding
intricacies before.

In c/c++, the unicode characters are introduced by the means of wchar_t
type. Based on the presence of _UNICODE definition C functions are
macro'd to either the normal version or the one prefixed with w.

No, that's not standard C.

Because
this is all standard c

It isn't.

it should be platform independent and as much as
I understand, all unicode characters in c (and windows) are 16-bit
(because w_char is usually typedef'd as unsigned short).

They're not.

Now, the various UTF encodings (UTF-8, 16, or 32) define variable
character sizes (e.g. UTF-8 from 1 to 4 bytes). How is that compatible
with C version of w_char?

It isn't. UTF-8 is however compatible with C and C++ 'char'. One or
more 'char' per character.

I've even been convinced by others that the
compiler calculates the neccessary storage size of the unicode
character, which may thus be variable.
No.


So pointer increment on a string
would sometimes progress by 1, 2 or 4 bytes. I think this is absurd

It is.

and
have not been able to produce such behaviour on windows os.

Not suprising.

I would rest with that conviction if I didn't find web pages like
http://evanjones.ca/unicode-in-c.html, where the obviously competent
author
Uh.

asserts that wstring characters are 32-bit.

They're not (necessarily).

The c++ stl book from
josuttis explains virtually nothing on that matter.

So, my question is if anyone knows how to explain this character size
mess in as understandable way as possible or at least post a link to the
right place. I've read petzold und believe that c simply uses fixed
16-bit unicode, but how does that combine with the unicode encodings?

The simple explanation is that C and C++ don't support Unicode more than
these languages support, say, graphics. The basic operations needed to
implement Unicode support are present. And what you do is to either
implement a library yourself, or use one implemented by others.
 
L

loufoque

damjan wrote :
In c/c++, the unicode characters are introduced by the means of wchar_t
type.

Wrong.
wchar_t has nothing to do with Unicode.

Based on the presence of _UNICODE definition C functions are
macro'd to either the normal version or the one prefixed with w.

That's MS Windows way of doing things.
And it's not a very good way IMHO.
Because
this is all standard c, it should be platform independent and as much as
I understand, all unicode characters in c (and windows) are 16-bit
(because w_char is usually typedef'd as unsigned short).

wchar_t can be any size (bigger than a byte obviously).
Still, it's usually 16 or 32 bits.
On GNU/Linux, for an example, it's 32 bits.

Now, the various UTF encodings (UTF-8, 16, or 32) define variable
character sizes (e.g. UTF-8 from 1 to 4 bytes). How is that compatible
with C version of w_char?

Well, you can convert from UTF-8, UTF-16 or UTF-32 to UCS-2 or UCS-4
depending on the size of wchar_t if you wish.

I've even been convinced by others that the
compiler calculates the neccessary storage size of the unicode
character, which may thus be variable. So pointer increment on a string
would sometimes progress by 1, 2 or 4 bytes. I think this is absurd and
have not been able to produce such behaviour on windows os.

Indeed, this is absurd, unless you use a clever Unicode string type,
which is what I advise if you want to build C++ applications that are
unicode-aware.

I would rest with that conviction if I didn't find web pages like
http://evanjones.ca/unicode-in-c.html, where the obviously competent
author asserts that wstring characters are 32-bit. The c++ stl book from
josuttis explains virtually nothing on that matter.

If you want to code in C++, you shouldn't even try to search solutions in C.
I mean, string handling in C is tedious and annoying, why bother using
that when you have nicer alternatives in C++.
So, my question is if anyone knows how to explain this character size
mess in as understandable way as possible or at least post a link to the
right place. I've read petzold und believe that c simply uses fixed
16-bit unicode, but how does that combine with the unicode encodings?

In C usually you could use char* with utf-8 or wchar_t with UCS-2 or UCS-4.

The solution I find the niciest for Unicode is Glib::ustring.
Unfortunetely it's part of glibmm, which is rather big, and some people
just don't want to have such a dependency.
 
D

dj

loufoque said:
damjan wrote :


Wrong.
wchar_t has nothing to do with Unicode.

Well, perhaps not philosophically, but it is the way 16-bit chars step
into c. Perhaps i am too mcuh into windows, but in the microsoft
documentation the wchar is almost a synonym for unicode. How else would
you understand this (from Petzolds book):

quote:
If the _UNICODE identifier is defined, TCHAR is wchar_t:
typedef wchar_t TCHAR;
end of quote:
That's MS Windows way of doing things.
And it's not a very good way IMHO.

True, perhaps i am just too used to microsoft way of "adapting" things.
wchar_t can be any size (bigger than a byte obviously).
Still, it's usually 16 or 32 bits.
On GNU/Linux, for an example, it's 32 bits.

Now, this is something that should probably bother me, if i intend to
program multiplatform. I hope java is more consistant than that. So how
do i declare a platform independent wide character?
Well, you can convert from UTF-8, UTF-16 or UTF-32 to UCS-2 or UCS-4
depending on the size of wchar_t if you wish.



Indeed, this is absurd, unless you use a clever Unicode string type,
which is what I advise if you want to build C++ applications that are
unicode-aware.

P.S. Though MBCS works exactly that way, i think.
If you want to code in C++, you shouldn't even try to search solutions
in C.
I mean, string handling in C is tedious and annoying, why bother using
that when you have nicer alternatives in C++.

Don't be misled by the c in the title. wstring is an stl template class.
In C usually you could use char* with utf-8 or wchar_t with UCS-2 or UCS-4.

Now how could i use a 1-byte char for a unicode character, even if it is
encoded as utf-8? According to wikipedia, UCS-2 is fixed 16-bit unicode
encoding. Well, this sounds to me like a perfect match for the c
representation, but again, if w_char is not necessarily 16-bit ...
The solution I find the niciest for Unicode is Glib::ustring.
Unfortunetely it's part of glibmm, which is rather big, and some people
just don't want to have such a dependency.

Thanks for the advice, however i am very reluctant to using ever new
libraries. After all, there exist a zillion implementations of the
string class, adding to the overall chaos.
 
?

=?ISO-8859-15?Q?Juli=E1n?= Albo

dj said:
Well, perhaps not philosophically, but it is the way 16-bit chars step
into c. Perhaps i am too mcuh into windows, but in the microsoft
documentation the wchar is almost a synonym for unicode. How else would
you understand this (from Petzolds book):

quote:
If the _UNICODE identifier is defined, TCHAR is wchar_t:
typedef wchar_t TCHAR;
end of quote:

I understand that Petzold assumes his readers know the context. After all
"...programming windows..." in the book's titles looks clear enough.

If you want to talk about Windows or Windows compilers peculiarities, better
do it in some windows programming group.
 
D

dj

Alf said:
* damjan:

No, that's not standard C.



It isn't.

I agree, the macro expansion is microsoft idea. But the functions for
handling wide (unicode) chars are prefixed by w, that is standard, right?
They're not.



It isn't. UTF-8 is however compatible with C and C++ 'char'. One or
more 'char' per character.

So which encoding does wchar "use" (i.e. is compatible with)?
It is.



Not suprising.



They're not (necessarily).



The simple explanation is that C and C++ don't support Unicode more than
these languages support, say, graphics. The basic operations needed to
implement Unicode support are present. And what you do is to either
implement a library yourself, or use one implemented by others.

OK, so my conclusion is that C's wchar and unicode really have no
logical connection. wchar is only a way to allow for 2- or more byte
character strings and there is no other way to handle 4-byte unicode
chars and various unicode encodings in visual c++ but to implement my
own library or search for one.
 
D

dj

Julián Albo said:
I understand that Petzold assumes his readers know the context. After all
"...programming windows..." in the book's titles looks clear enough.

If you want to talk about Windows or Windows compilers peculiarities, better
do it in some windows programming group.

The title of the section i quoted from is "Wide Characters and C". The
"Wide Characters and Windows" is the next section. I am not a Windows
freak, just used to it most. My original question was about unicode and
c (it so emerged that a microsoft version of c). If you are bothered by
that, skip this thread next time.
 
?

=?ISO-8859-15?Q?Juli=E1n?= Albo

dj said:
I am not a Windows freak, just used to it most. My original question was
about unicode and c (it so emerged that a microsoft version of c). If you
are bothered by that, skip this thread next time.

Start by learning that C an C++ are different languages.
 
J

Jonathan Mcdougall

dj said:
Well, perhaps not philosophically, but it is the way 16-bit chars step
into c.

wchar_t is a mean, not a solution. It is a standard type of a size
"large enough to hold the largest character set supported by the
implementation's locale" [1]. However, how you use it, or whether you
want to use it, is left to you.
Perhaps i am too mcuh into windows, but in the microsoft
documentation the wchar is almost a synonym for unicode.

That's how Microsoft Windows decided to work, some systems may do
otherwise. However, standard C++ doesn't mandate a specific encoding
for wchar_t.
True, perhaps i am just too used to microsoft way of "adapting" things.

This is a dangerous thing: to expect platform-specific behavior to be
standard. This often happens when one learns about platform-specific
libraries and features before learning the language itself. You must be
aware of what's standard and what's not.
Now, this is something that should probably bother me, if i intend to
program multiplatform. I hope java is more consistant than that. So how
do i declare a platform independent wide character?

That's impossible in C++. You must look in your compiler's
documentation and find an appropriate type (and check again if another
version comes out). If you expect to port your program, use the
preprocessor to typedef these types.
P.S. Though MBCS works exactly that way, i think.

Actually, all the string type I know work that way. Advancing an
iterator goes to the next conceptual character, not necessarily
sizeof(char) bytes forward.
Now how could i use a 1-byte char for a unicode character, even if it is
encoded as utf-8?

You could use multiple chars to represent a character code.
Thanks for the advice, however i am very reluctant to using ever new
libraries. After all, there exist a zillion implementations of the
string class, adding to the overall chaos.

Character encoding is a tricky discussion in C++. However, there are
some good libraries that do the job well. Glib's ustring type is
excellent. Use it.

If you don't want it, you can either search for another portable string
library, use platform-specific features (yurk) or roll your down (shame
on you).


Jonathan
 
D

dj

Julián Albo said:
Start by learning that C an C++ are different languages.

I know that, but the topic applies to both c and c++ (e.g. WCHAR and
wstring), so I mixed them in the text. Actually, I consider C++ as an
evolution of C, but hey, that's just my naive interpretation. You got me
on this one, though.
 
J

Jack Klein

I agree, the macro expansion is microsoft idea. But the functions for
handling wide (unicode) chars are prefixed by w, that is standard, right?

[snip]

The functions for handling wide characters are indeed prefixed by 'w'.
But they do not necessarily have anything at all to do with UNICODE.
There is no guarantee that wchar_t is wide enough to hold even the old
UNICODE (16 bits), let alone the newest version (still 18 bits, or did
they increase it again?).
 
G

Gianni Mariani

Jack said:
The functions for handling wide characters are indeed prefixed by 'w'.
But they do not necessarily have anything at all to do with UNICODE.
There is no guarantee that wchar_t is wide enough to hold even the old
UNICODE (16 bits), let alone the newest version (still 18 bits, or did
they increase it again?).

You need at least 21 bits to store a UTF16 codepoint. (2 surrogates = 20
bits + plane 0 - 64k).
 
A

AnonMail2005

damjan said:
This may look like a silly question to someone, but the more I try to
understand Unicode the more lost I feel. To say that I am not a beginner
C++ programmer, only had no need to delve into character encoding
intricacies before.

In c/c++, the unicode characters are introduced by the means of wchar_t
type. Based on the presence of _UNICODE definition C functions are
macro'd to either the normal version or the one prefixed with w. Because
this is all standard c, it should be platform independent and as much as
I understand, all unicode characters in c (and windows) are 16-bit
(because w_char is usually typedef'd as unsigned short).

Now, the various UTF encodings (UTF-8, 16, or 32) define variable
character sizes (e.g. UTF-8 from 1 to 4 bytes). How is that compatible
with C version of w_char? I've even been convinced by others that the
compiler calculates the neccessary storage size of the unicode
character, which may thus be variable. So pointer increment on a string
would sometimes progress by 1, 2 or 4 bytes. I think this is absurd and
have not been able to produce such behaviour on windows os.

I would rest with that conviction if I didn't find web pages like
http://evanjones.ca/unicode-in-c.html, where the obviously competent
author asserts that wstring characters are 32-bit. The c++ stl book from
josuttis explains virtually nothing on that matter.

So, my question is if anyone knows how to explain this character size
mess in as understandable way as possible or at least post a link to the
right place. I've read petzold und believe that c simply uses fixed
16-bit unicode, but how does that combine with the unicode encodings?

dj
You don't specify what your application is, so it is hard to give
recommendations. As others have mentioned this is not really a C++
question but more of an application question.

Plenty of applications deal _internally_ with standard C++ characters
and std::string and then deal with the outside world via encoded
characters.

We do this, for instance, when our applications need to write XML for
external consumption. For these applications, we use the iconv library
to do character conversions. These calls are limited to the interface
we ue to our XML document creater. It all comes down, eventually,
to converting standard C++ types (ints, doubles, std::strings) into
their equivalent character encoding. Since the number of types is
limited, the number of calls to the above mentioned library is limited.

Depending on your application, this may or may not work for you.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,982
Messages
2,570,186
Members
46,740
Latest member
JudsonFrie

Latest Threads

Top