Problem using wchar_t and wprintf

R

Rui Maciel

I've just started learning how to use the wchar_t data type as the basis for
Unicode strings and unfortunately I'm having quite a bit of problems, both
in the C front and the Unicode front.

In this case,it seems that the wprintf function isn't able to print a string
beyond the first character. I don't have a clue why this is happening. Here
is the test code:

<code>
#include <stdlib.h>
#include <wchar.h>


int main(int argc, char *argv[])
{
wchar_t *snafu = L"notação";
wprintf(L"%s\n",snafu);
return EXIT_SUCCESS;
}
</code>


On a side note, I was amazed at the amount of information available
regarding the whole Unicode in C issue. It's practically nonexistent. As a
sign, according to Google Groups since at far as 2000 this newsgroup only
saw about 18 threads where the the word wprintf was mentioned. Is everyone
purposely ignoring Unicode or is there a better, standard way to handle it
besides using wchar_t and all those w* functions?


Rui Maciel
 
G

Guest

Rui said:
I've just started learning how to use the wchar_t data type as the basis for
Unicode strings and unfortunately I'm having quite a bit of problems, both
in the C front and the Unicode front.

In this case,it seems that the wprintf function isn't able to print a string
beyond the first character. I don't have a clue why this is happening. Here
is the test code:

<code>
#include <stdlib.h>
#include <wchar.h>


int main(int argc, char *argv[])
{
wchar_t *snafu = L"notação";
wprintf(L"%s\n",snafu);
return EXIT_SUCCESS;
}
</code>

%s is the format specifier for an ordinary character string (char *),
not a wide character string. Use %ls for that. You can print multibyte
character strings and wide character strings both, with both printf()
and wprintf().
On a side note, I was amazed at the amount of information available
regarding the whole Unicode in C issue. It's practically nonexistent. As a
sign, according to Google Groups since at far as 2000 this newsgroup only
saw about 18 threads where the the word wprintf was mentioned. Is everyone
purposely ignoring Unicode or is there a better, standard way to handle it
besides using wchar_t and all those w* functions?

You cannot use printf() and wprintf() on the same output stream, and
the standard does not guarantee (to the best of my knowledge) that the
file format used by the wchar_t-based I/O functions is the same as
that of the char-based I/O functions. It may be better to read in and
write out data as multibyte strings, and only treat them as wide
strings internally.

#include <stdio.h>
#include <stdlib.h>
#include <wchar.h>
#include <locale.h>

int main(int argc, char *argv[])
{
wchar_t *snafu = L"notação";

setlocale(LC_CTYPE, "");
printf("%ls\n",snafu);
return EXIT_SUCCESS;
}
 
R

Rui Maciel

Harald said:
%s is the format specifier for an ordinary character string (char *),
not a wide character string. Use %ls for that. You can print multibyte
character strings and wide character strings both, with both printf()
and wprintf().

Thanks! That did the trick.

You cannot use printf() and wprintf() on the same output stream, and
the standard does not guarantee (to the best of my knowledge) that the
file format used by the wchar_t-based I/O functions is the same as
that of the char-based I/O functions. It may be better to read in and
write out data as multibyte strings, and only treat them as wide
strings internally.

So it seems that the use of wchar_t and related functions is a nice source
of headaches and a hefty dose of PitA. If those problems weren't enough it
seems that there is a sever drought of information relating to that theme.
According to my experience, there isn't a single C tutorial that delves
into it. All those tutorials that mix C and C++ together were bad enough
but noticing that none of the decent ones even mentions wchar_t anywhere...
That's bad.

So, where can I get my hands on a nice document which explains the whole
Unicode through wchar_t thing?


Thanks for the help
Rui Maciel
 
Y

Yevgen Muntyan

Rui Maciel wrote:
[snip]
So it seems that the use of wchar_t and related functions is a nice source
of headaches and a hefty dose of PitA.

People say it's not bad if you don't demand too much from it.
E.g. if you have some string processing program which pretends
everything is latin, you may be able to replace char functions with
their wide characters equivalents and get a program which works with
Chinese, for free.
If those problems weren't enough it
seems that there is a sever drought of information relating to that theme.
According to my experience, there isn't a single C tutorial that delves
into it. All those tutorials that mix C and C++ together were bad enough
but noticing that none of the decent ones even mentions wchar_t anywhere...
That's bad.

So, where can I get my hands on a nice document which explains the whole
Unicode through wchar_t thing?

There is no "Unicode through wchar_t" thing. Wide character business in
C is *not* "Unicode in C". It depends on what you need. If you want to
get list of words in console from user and count them, you use wchar_t.
If you want to save a file and read it later, you better use
non-standard stuff, convert your data to whatever encoding you like and
back, etc. If you are on windows NT you can use wchar_t without worries
(it's fixed UTF16), but then there are problems with the standard. So
you don't want to know how to handle unicode in standard C, you want
to know what you have available on your platform(s) and pick
what's easier/better for you. (E.g. wchar_t if you only care
about windows; glib or icu if you want more portability; you may
use a library which handles everything internally and you may not
care about unicode and such at all, etc.)

Yevgen
 
B

Ben Pfaff

Yevgen Muntyan said:
There is no "Unicode through wchar_t" thing. Wide character business in
C is *not* "Unicode in C".

Well, not in general. If the compiler defines
__STDC_ISO_10646__, however, then wchar_t encodes ISO 10646 code
points, which is essentially Unicode.
 
Y

Yevgen Muntyan

Ben said:
Well, not in general. If the compiler defines
__STDC_ISO_10646__, however, then wchar_t encodes ISO 10646 code
points, which is essentially Unicode.

But even if wchar_t sequences handled by the C library can represent
whole unicode, you don't know *how* it does it. So if you actually need
to be able to transfer data somehow somewhere (like save and load, even
on the same machine), you get a problem. Which is why I said that, it
surely depends on "Unicode in C thing" interpretation :)
And one could say that conforming implementation is allowed to
ignore unicode and use ascii and one-byte wchar_t (you know,
we are in comp.lang.c).

Yevgen
 
Y

Yevgen Muntyan

Yevgen said:
But even if wchar_t sequences handled by the C library can represent
whole unicode, you don't know *how* it does it. So if you actually need
to be able to transfer data somehow somewhere (like save and load, even
on the same machine), you get a problem. Which is why I said that, it
surely depends on "Unicode in C thing" interpretation :)
And one could say that conforming implementation is allowed to
ignore unicode and use ascii and one-byte wchar_t (you know,
we are in comp.lang.c).

And there is always that nice MS implementation. So if "standard C" is
C99, then we ignore MS, which may be impractical; if "standard C" is
C90, then there is no standard way to do anything with unicode and
alike. So in the end what you do with wchar_t is really what you can do
and what you do in particular implementation(s).

Yevgen
 
G

Guest

Yevgen said:
But even if wchar_t sequences handled by the C library can represent
whole unicode, you don't know *how* it does it.

Yes, you do. If __STDC_ISO_10646__ is defined (there might also be a
minimum value), then

wchar_t wc = 0x20AC;

is guaranteed to set wc to the Euro sign, and vice versa.

You don't know how this will be encoded when you convert it to a
multibyte character using the standard routines (or whether it can be
at all), but that's not a wchar_t sequence issue. (And of course, you
can portably write this to a file as UTF-8 using your own conversion
routines (or UTF-32 if you like simplicity, or anything else), and
read it back the same way.)

[...]
And one could say that conforming implementation is allowed to
ignore unicode and use ascii and one-byte wchar_t (you know,
we are in comp.lang.c).

You seem to be saying that a one-byte eight-bit wchar_t is allowed but
useless. It's not. It's useful for making multibyte-aware programs
work without modifications even on systems that do not support
multibyte characters.
 
Y

Yevgen Muntyan

Harald said:
Yes, you do. If __STDC_ISO_10646__ is defined (there might also be a
minimum value), then

wchar_t wc = 0x20AC;

is guaranteed to set wc to the Euro sign, and vice versa.
You don't know how this will be encoded when you convert it to a
multibyte character using the standard routines (or whether it can be
at all), but that's not a wchar_t sequence issue. (And of course, you
can portably write this to a file as UTF-8 using your own conversion
routines (or UTF-32 if you like simplicity, or anything else), and
read it back the same way.)

I take "you don't know *how* it does it" part back, it was my
ignorance. If __STDC_ISO_10646__ is defined, you can actually
do all you need (after you write encoding/decoding routines,
with fancy UTF-8 character layout or UTF-16/32 byte order
marks, I need to finally learn these two!).
Still, there are implementations with working (meaning you can
do Chinese and Russian) wchar_t business but with no C99 (or is it
C95?) compliance. For instance, MS doesn't care about C99
at all; FreeBSD library doesn't have this macro defined (I don't
know if it actually can do Chinese in Russian locale, I guess
it should); glibc does have the macro defined. So if the macro is
defined, you're good; but if it's not defined, you're back to either
writing portable code which doesn't use wchar_t at all (or using
third-party libs for that purpose), or studying what exactly you have on
your target platforms, without any standard support.
I wonder if __STDC_ISO_10646__ is considered nice by (majority
of) implementors, or they tend to have "efficient" code.
[...]
And one could say that conforming implementation is allowed to
ignore unicode and use ascii and one-byte wchar_t (you know,
we are in comp.lang.c).

You seem to be saying that a one-byte eight-bit wchar_t is allowed but
useless. It's not. It's useful for making multibyte-aware programs
work without modifications even on systems that do not support
multibyte characters.

Sure, wchar_t can certainly be useful, and it's certainly good that
wchar_t code won't break if the system doesn't support unicode. But if
the system doesn't support unicode, and you got a file from Chinese
friend, it's useless. It's like a program which parses some text
and pretends everything is ASCII - it may be very useful, and in
in many setups it's all you need.
I'd rather say wchar_t facilities (as in C standard) are useless,
but it'd be too strong a statement, I presume it is used in lot of
software.

Best regards,
Yevgen
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,169
Messages
2,570,919
Members
47,458
Latest member
Chris#

Latest Threads

Top