Wide character input/output

I

Ioannis Vranos

[The current message encoding is set to Unicode (UTF-8) because it
contains Greek]


The following code does not work as expected:


#include <wchar.h>
#include <locale.h>
#include <stdio.h>
#include <stddef.h>

int main()
{
char *p= setlocale( LC_ALL, "Greek" );

wchar_t input[50];

if (!p)
printf("NULL returned!\n");

fgetws(input, 50, stdin);

wprintf(L"%s\n", input);

return 0;
}


Under Linux:


[john@localhost src]$ ./foobar-cpp
Test
T
[john@localhost src]$


[john@localhost src]$ ./foobar-cpp
Δοκιμαστικό
�
[john@localhost src]$




Under MS Visual C++ 2008 Express:

Test
Test

Press any key to continue . . .


Δοκιμαστικό
??????ε????

Press any key to continue . . .


Am I missing something?
 
B

Ben Bacarisse

Ioannis Vranos said:
[The current message encoding is set to Unicode (UTF-8) because it
contains Greek]


The following code does not work as expected:


#include <wchar.h>
#include <locale.h>
#include <stdio.h>
#include <stddef.h>

int main()
{
char *p= setlocale( LC_ALL, "Greek" );

wchar_t input[50];

if (!p)
printf("NULL returned!\n");

fgetws(input, 50, stdin);

wprintf(L"%s\n", input);

You need "%ls". This is very important with wprintf since without it
%s denotes a multi-byte character sequence. printf("%ls\n" input)
should also work. You need the w version if you want the multi-byte
conversion of %s or if the format has to be a wchar_t pointer.
return 0;
}


Under Linux:


[john@localhost src]$ ./foobar-cpp
Test
T
[john@localhost src]$


[john@localhost src]$ ./foobar-cpp
Δοκιμαστικό
�
[john@localhost src]$

The above my not be the only problem. In cases like this, you need to
say way encoding your terminal is using.

<snip>
 
I

Ioannis Vranos

Ben said:
You need "%ls". This is very important with wprintf since without it
%s denotes a multi-byte character sequence. printf("%ls\n" input)
should also work. You need the w version if you want the multi-byte
conversion of %s or if the format has to be a wchar_t pointer.


Perhaps you may help me understand better. We have the usual char
encoding which is implementation defined (usually ASCII).

wchar_t is wide character encoding, which is the "largest character set
supported by the system", so I suppose Unicode under Linux and Windows.

What exactly is a multi-byte character?

I have to say that I am talking about C95 here, not C99.

return 0;
}


Under Linux:


[john@localhost src]$ ./foobar-cpp
Test
T
[john@localhost src]$


[john@localhost src]$ ./foobar-cpp
Δοκιμαστικό
�
[john@localhost src]$

The above my not be the only problem. In cases like this, you need to
say way encoding your terminal is using.


You are somehow correct on this. My terminal encoding was UTF-8 and I
added Greek(ISO-8859-7). Under the last, the following code works OK:


#include <wchar.h>
#include <locale.h>
#include <stdio.h>
#include <stddef.h>

int main()
{
char *p= setlocale( LC_ALL, "Greek" );

wprintf(L"Δοκιμαστικό\n");

return 0;
}

[john@localhost src]$ ./foobar-cpp
Δοκιμαστικό
[john@localhost src]$


Also the original, fixed according to your suggestion:


#include <wchar.h>
#include <locale.h>
#include <stdio.h>
#include <stddef.h>

int main()
{
char *p= setlocale( LC_ALL, "Greek" );

wchar_t input[50];

if (!p)
printf("NULL returned!\n");

fgetws(input, 50, stdin);

wprintf(L"%ls", input);

return 0;
}

works OK too:

[john@localhost src]$ ./foobar-cpp
Δοκιμαστικό
Δοκιμαστικό
[john@localhost src]$


It works OK under Terminal UTF-8 default encoding too. So "%ls" is what
was really needed.


BTW, how can we define UTF-8 as the locale?


Thanks a lot.
 
I

Ioannis Vranos

Ioannis said:
It works OK under Terminal UTF-8 default encoding too. So "%ls" is what
was really needed.


Actually the code:

#include <wchar.h>
#include <locale.h>
#include <stdio.h>
#include <stddef.h>

int main()
{
char *p= setlocale( LC_ALL, "Greek" );

wprintf(L"Δοκιμαστικό\n");

return 0;
}

works only when I set the Terminal encoding to Greek (ISO-8859-7).
 
B

Ben Bacarisse

Ioannis Vranos said:
Perhaps you may help me understand better. We have the usual char
encoding which is implementation defined (usually ASCII).

wchar_t is wide character encoding, which is the "largest character
set supported by the system", so I suppose Unicode under Linux and
Windows.

What exactly is a multi-byte character?

It is a confusing term. It means an encoding that uses sequences of
ordinary bytes (in the C sense -- chars) to encode a large character
set. The most common example is UTF-8.
I have to say that I am talking about C95 here, not C99.

return 0;
}


Under Linux:


[john@localhost src]$ ./foobar-cpp
Test
T
[john@localhost src]$


[john@localhost src]$ ./foobar-cpp
Δοκιμαστικό
�
[john@localhost src]$

The above my not be the only problem. In cases like this, you need to
say way encoding your terminal is using.


You are somehow correct on this.

Strange, I know!
My terminal encoding was UTF-8 and I
added Greek(ISO-8859-7). Under the last, the following code works OK:


#include <wchar.h>
#include <locale.h>
#include <stdio.h>
#include <stddef.h>

int main()
{
char *p= setlocale( LC_ALL, "Greek" );

wprintf(L"Δοκιμαστικό\n");

return 0;
}

[john@localhost src]$ ./foobar-cpp
Δοκιμαστικό
[john@localhost src]$


Also the original, fixed according to your suggestion:


#include <wchar.h>
#include <locale.h>
#include <stdio.h>
#include <stddef.h>

int main()
{
char *p= setlocale( LC_ALL, "Greek" );

wchar_t input[50];

if (!p)
printf("NULL returned!\n");

fgetws(input, 50, stdin);

wprintf(L"%ls", input);

return 0;
}

works OK too:

[john@localhost src]$ ./foobar-cpp
Δοκιμαστικό
Δοκιμαστικό
[john@localhost src]$


It works OK under Terminal UTF-8 default encoding too. So "%ls" is
what was really needed.


BTW, how can we define UTF-8 as the locale?

I *think* this is now off-topic. I don't think C says anything about
what the locale string means...

The character encoding is usually specified after a '.'. I use, for
example, "en-GB.UTF-8". I suspect that if you only specify a part of
the locale (or one that does not make sense) your C library picks up
what to do from the execution environment. To me "Greek" looks like
an odd locale string. I would expect "el-GR.UTF-8" or
"el-GR.ISO8859-7".
 
B

Ben Bacarisse

Ioannis Vranos said:
Actually the code:

#include <wchar.h>
#include <locale.h>
#include <stdio.h>
#include <stddef.h>

int main()
{
char *p= setlocale( LC_ALL, "Greek" );

wprintf(L"Δοκιμαστικό\n");

return 0;
}

works only when I set the Terminal encoding to Greek (ISO-8859-7).

This sort of thing is almost impossible to investigate over Usenet.
Your news software will take your code and may or may not encode the
characters of the L"..." string in the encoding of your post (UTF-8).
It makes it very hard to know what the program text actually is.

Another complication is that the locale setting affects the run-time
behaviour, but you program also depends on what character encoding is
expected by the compiler that builds the string.
 
I

Ioannis Vranos

Ben said:
I *think* this is now off-topic. I don't think C says anything about
what the locale string means...

The character encoding is usually specified after a '.'. I use, for
example, "en-GB.UTF-8". I suspect that if you only specify a part of
the locale (or one that does not make sense) your C library picks up
what to do from the execution environment. To me "Greek" looks like
an odd locale string. I would expect "el-GR.UTF-8" or
"el-GR.ISO8859-7".


I got the idea from:

http://msdn2.microsoft.com/en-us/library/x99tb11d(VS.80).aspx

http://msdn2.microsoft.com/en-us/library/39cwe7zf(VS.80).aspx
 
I

Ioannis Vranos

Ben said:
I *think* this is now off-topic. I don't think C says anything about
what the locale string means...

The character encoding is usually specified after a '.'. I use, for
example, "en-GB.UTF-8". I suspect that if you only specify a part of
the locale (or one that does not make sense) your C library picks up
what to do from the execution environment. To me "Greek" looks like
an odd locale string. I would expect "el-GR.UTF-8" or
"el-GR.ISO8859-7".


This code works with gcc:

#include <wchar.h>
#include <locale.h>
#include <stdio.h>
#include <stddef.h>

int main()
{
char *p= setlocale( LC_ALL, "greek" );

wchar_t input[50];

if (!p)
printf("NULL returned!\n");

fgetws(input, 50, stdin);

wprintf(L"%ls", input);

return 0;
}


[john@localhost src]$ ./foobar-cpp
Δοκιμαστικό
Δοκιμαστικό
[john@localhost src]$


When I place el-GR.UTF-8 or el-GR.ISO8859-7 I get:


[john@localhost src]$ ./foobar-cpp
NULL returned!

[john@localhost src]$
 
I

Ioannis Vranos

Ben said:
Ah, OK. Anyway, we are off-topic now. I think you'd have to post in
a Windows group to find out what locale strings mean there.


I am a Linux user. The "el-GR.UTF-8" and "el-GR.ISO8859-7" you suggested
make setlocale() return NULL. The "greek" and "Greek" suggested by
MSDN works. So I supposed there is a portable way for this. Aren't any
portable locale encoding strings?
 
C

CBFalconer

Ioannis said:
[The current message encoding is set to Unicode (UTF-8) because
it contains Greek]

The following code does not work as expected:

#include <wchar.h>
#include <locale.h>
#include <stdio.h>
#include <stddef.h>

int main() {
char *p= setlocale( LC_ALL, "Greek" );
wchar_t input[50];

if (!p)
printf("NULL returned!\n");
fgetws(input, 50, stdin);
wprintf(L"%s\n", input);
return 0;
}
.... snip ...

Am I missing something?

Yes. If setlocale fails, it returns NULL, which you detect, but do
not immediately exit the program. You also forgot to check for
errors in executing fgetws or wprintf.
 
I

Ioannis Vranos

Ioannis said:
Clarified:



==> under Linux.


==> under Linux.


Also based on
http://gcc.gnu.org/onlinedocs/libstdc++/22_locale/locale.html where it
mentions "locale -a" and provides a list of locales, in my system it
outputs among other things:


galego
galician
gd_GB
gd_GB.iso885915
gd_GB.utf8
german
gez_ER
gez_ER@abegede
gez_ER.utf8
gez_ER.utf8@abegede
gez_ET
gez_ET@abegede
gez_ET.utf8
gez_ET.utf8@abegede
gl_ES
gl_ES@euro
gl_ES.iso88591
gl_ES.iso885915@euro
gl_ES.utf8
==> greek
gu_IN
gu_IN.utf8
gv_GB
gv_GB.iso88591
gv_GB.utf8
hebrew
he_IL
he_IL.iso88598
he_IL.utf8
hi_IN
hi_IN.utf8
hr_HR
hr_HR.iso88592
hr_HR.utf8
hrvatski
hsb_DE
hsb_DE.iso88592
hsb_DE.utf8
hu_HU
hu_HU.iso88592
hu_HU.utf8
hungarian


So "greek" is a valid locale for linux too.
 
B

Ben Bacarisse

Ioannis Vranos said:
Also based on
http://gcc.gnu.org/onlinedocs/libstdc++/22_locale/locale.html where it
mentions "locale -a" and provides a list of locales, in my system it
outputs among other things:

galego
galician
gd_GB ....
gl_ES.iso885915@euro
gl_ES.utf8
==> greek

Post in comp.unix.programmer. I think you can define anything you
like under Linux, but what is and is not valid is not specified by C.
Other standards (like POSIX) probably specify much more.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,955
Messages
2,570,117
Members
46,705
Latest member
v_darius

Latest Threads

Top