wcout, wprintf() only print English

I

Ioannis Vranos

James said:
You're still not telling us a lot of important information.
What is the actual encoding used in the source file, and what
are the bytes actually output. (FWIW: I think g++, and most
other compilers, just pass the bytes through transparently in a
narrow character string. Which means that your second code will
output whatever your editor put in the source file. If you're
using the same encoding everywhere, it will seem to work.)

Note that there isn't really any portable solution, because so
much depends on things the C++ compiler has no control over.
Run the same code in two different xterm, and it can output two
different things, completely; just specify a different font
(option -fn) with a different encoding for one of the xterm.
(And of course, it's pretty much par for the course to see one
thing when you cat to the screen, and something else when you
output the same file to the printer.)


I posted a C95 question in c.l.c., about this (which is a subset of
C++03) and I got a C95 working code. My last message there:

> Ben Bacarisse wrote:
>
> You need "%ls". This is very important with wprintf since without it
> %s denotes a multi-byte character sequence. printf("%ls\n" input)
> should also work. You need the w version if you want the multi-byte
> conversion of %s or if the format has to be a wchar_t pointer.


Perhaps you may help me understand better. We have the usual char
encoding which is implementation defined (usually ASCII).

wchar_t is wide character encoding, which is the "largest character set
supported by the system", so I suppose Unicode under Linux and Windows.

What exactly is a multi-byte character?

I have to say that I am talking about C95 here, not C99.

>
>> return 0;
>> }
>>
>>
>> Under Linux:
>>
>>
>> [john@localhost src]$ ./foobar-cpp
>> Test
>> T
>> [john@localhost src]$
>>
>>
>> [john@localhost src]$ ./foobar-cpp
>> Δοκιμαστικό
>> �
>> [john@localhost src]$
>
> The above my not be the only problem. In cases like this, you need to
> say way encoding your terminal is using.


You are somehow correct on this. My terminal encoding was UTF-8 and I
added Greek(ISO-8859-7). Under the last, the following code works OK:


#include <wchar.h>
#include <locale.h>
#include <stdio.h>
#include <stddef.h>

int main()
{
char *p= setlocale( LC_ALL, "Greek" );

wprintf(L"Δοκιμαστικό\n");

return 0;
}

[john@localhost src]$ ./foobar-cpp
Δοκιμαστικό
[john@localhost src]$


Also the original, fixed according to your suggestion:


#include <wchar.h>
#include <locale.h>
#include <stdio.h>
#include <stddef.h>

int main()
{
char *p= setlocale( LC_ALL, "Greek" );

wchar_t input[50];

if (!p)
printf("NULL returned!\n");

fgetws(input, 50, stdin);

wprintf(L"%ls", input);

return 0;
}

works OK too:

[john@localhost src]$ ./foobar-cpp
Δοκιμαστικό
Δοκιμαστικό
[john@localhost src]$


It works OK under Terminal UTF-8 default encoding too. So "%ls" is what
was really needed.


BTW, how can we define UTF-8 as the locale?


Thanks a lot.
 
J

James Kanze

Jeff Schwab wrote:

[...]
"Strangely" these also happen to my Linux box with "gcc
version 4.1.2 20070626".
cout prints Greek without the L notation to the string
literal.
The same with wcout prints an empty line.

I don't think the problem is so much wcout, as the wide
character literal. The compiler is obliged to do interpret the
contents of the literal in some way, and I would guess that it's
not doing this in a way conform with the input you've given it.

What does the compiler documentation say about how it processes
characters outside of the basic character set? What happens if
you replace your characters with their UCN, e.g.:

std::wcout << L"\u0394\u03BF..." ;

?
The same with wcout and L notation prints question marks.
This made me think to use plain cout, and it also works:
#include <iostream>
int main()
{
std::cout << "Δοκιμαστικό μήνυμα\n";
}
also prints the Greek message.
Seeing this I am assuming char is implemented as unsigned char
and this is working because Greek is provided in the extended
ASCII character set (values 128-255) supported by my system (I
have set the regional settings under GNOME etc). However why
does this also work for you?

Most likely, the compiler is just generating code which copies
the characters bit patterns, without ever looking at their
numeric values. So the signedness of char is irrelevant
(here---in other places, it can cause problems).
#include <iostream>
#include <limits>
int main()
{
using namespace std;
cout<< static_cast<int>( numeric_limits<char>::max() )<< endl;
}
produces in my system:
[john@localhost src]$ ./foobar-cpp
127

In other words, plain char is signed. (It usually is, for some
reason.)
[john@localhost src]$
so I am wrong, char is implemented as signed char, and no
extended ASCII takes place.

There's no such thing as "extended ASCII":). Still, I
regularly used ISO 8859-15 in plain char's, on machines which
are signed. If I look at the numeric value of the char, it's
wrong, but the bits are right, and they get copied through
correctly.

I just have to be careful when I use functions which expect an
int in the range [0...UCHAR_MAX]. (Those in the <cctype>
header, for example.)
 
J

James Kanze

You and me both. I would be very surprised if this were a GCC
bug (I'm using 4.2.4 pre-release), but I'm guessing somebody
here knows a lot more about this than we do, and is willing to
enlighten us. :)

It wouldn't surprise me if g++ (or any other compiler) had some
bugs in this. It's far from trivial. But for the moment,
nothing you've show seems particularly surprising to me. (In
fact, I'm sure that there is one bug in g++. Most of what is
involved here is implementation defined, and the standard says
that a conforming implementation must document its choices. I
haven't found any such documentation for g++.)
 
J

Jeff Schwab

James said:
You're still not telling us a lot of important information.
What is the actual encoding used in the source file,
UTF-8

and what
are the bytes actually output.

0x3f eleven times (UTF-8 question mark '?'), followed by one 0x20
(literal space ' '), followed by six more 0x3f.

??????????? ??????
(FWIW: I think g++, and most
other compilers, just pass the bytes through transparently in a
narrow character string. Which means that your second code will
output whatever your editor put in the source file. If you're
using the same encoding everywhere, it will seem to work.)

That is probably what is happening.

Note that there isn't really any portable solution, because so
much depends on things the C++ compiler has no control over.
Run the same code in two different xterm, and it can output two
different things, completely; just specify a different font
(option -fn) with a different encoding for one of the xterm.
(And of course, it's pretty much par for the course to see one
thing when you cat to the screen, and something else when you
output the same file to the printer.)

Thanks. Well, that's not very satisfying. :-/
 
I

Ioannis Vranos

Reply I posted in c.l.c.:


Ioannis said:
>
> It works OK under Terminal UTF-8 default encoding too. So "%ls" is what
> was really needed.


Actually the code:

#include <wchar.h>
#include <locale.h>
#include <stdio.h>
#include <stddef.h>

int main()
{
char *p= setlocale( LC_ALL, "Greek" );

wprintf(L"Δοκιμαστικό\n");

return 0;
}

works only when I set the Terminal encoding to Greek (ISO-8859-7).
 
J

James Kanze

* Jeff Schwab:

[...]
Assuming no issue with translation from source code character
set to execution character set, if you use only the narrow
character streams you avoid most translation.

In practice, at least with g++. In theory, you *should*
encounter problems in a quality implementation, because the
compiler is supposed to define what it does for input outside
the basic character set. Which may or may not include handling
Greek characters correctly.
There's still translation of newlines and possibly other
characters (e.g. Ctrl Z in Windows). Thus, using UTF-8 source
code and UTF-8 execution environment character set, and
(mostly) non-translating narrow character streams, everything
should work swimmingly.
Another reason to avoid the wide character streams is that
they're not supported by the MingW Windows port of g++.
At least, not in the version I have.

I'm not sure what the current status is, but for a very long
time, g++ couldn't handle any locales except "C" and "POSIX".
And as I understand it UTF-8 is the usual in the *nix world.

Not at all. Most Unix programmers think it should be, however,
so maybe in a couple of decades... (Actually, things are moving
fairly quickly in this direction.)
 
I

Ioannis Vranos

How can we convert the C subset C++ code:


#include <wchar.h>
#include <locale.h>
#include <stdio.h>
#include <stddef.h>

int main()
{
char *p= setlocale( LC_ALL, "Greek" );

wchar_t input[50];

if (!p)
printf("NULL returned!\n");

fgetws(input, 50, stdin);

wprintf(L"%ls", input);

return 0;
}

that works, to use the newest and greatest C++ facilities? :)
 
J

Jeff Schwab

James said:
It wouldn't surprise me if g++ (or any other compiler) had some
bugs in this. It's far from trivial. But for the moment,

I wouldn't, either. I'd just be surprised if a "hello unicode" example
uncovered any that I would recognize.

nothing you've show seems particularly surprising to me. (In
fact, I'm sure that there is one bug in g++. Most of what is
involved here is implementation defined, and the standard says
that a conforming implementation must document its choices. I
haven't found any such documentation for g++.)

http://gcc.gnu.org/onlinedocs/gcc-4.2.3/gcc/Characters-implementation.html

http://gcc.gnu.org/onlinedocs/gcc-4.2.3/cpp/Implementation_002ddefined-behavior.html

http://gcc.gnu.org/onlinedocs/libstdc++/22_locale/howto.html#1

"Currently, CPP requires its input to be ASCII or UTF-8. The execution
character set may be controlled by the user, with the -fexec-charset and
-fwide-exec-charset options."

What specifically needs to be documented?
 
J

James Kanze

[...]
There isn't such issue here, cout prints Greek literal
correctly and wcout not.

That's just because cout and narrow string literals are passing
your bytes through literally. Neither is doing anything with
them.
Also cin and string read and store Greek text correctly while
wcin and wstring look like they do not work for Greek text
input.

Using which locale? For input in what encoding?
I am not sure I understand this.
Isn't L"some text" a wide character string literal?

According to the language. But the characters between the "..."
are still encoded in some narrow character encoding, which the
compiler has to translate into some wide character encoding.

Which narrow character encoding, and which wide character
encoding, is anybody's guess. The standard says that it's
"implementation defined", which means that the implementation
has to document its choices. Good luck finding such
documentation (for just about any compiler).
Don't wcout, wcin and wstring provide operator<< and
operator>> overloads for wide characters and wide character
strings?

Yes, but all I/O is actually byte oriented. So the do code
translations on the fly. According to the embedded locale.
(The last time I checked, in g++, you could embed any locale
installed on the system, and it would still act as if it were in
locale "C". But that was a very, very long time ago.)
What do you mean by "narrow character" streams? char streams
right?

Yes.

He should have added that to be sure there's no code
translation, you have to embed the "C" locale.
This is irrelevant. MINGW's problems are MINGW problems, I am
using GCC under Linux (Scientific Linux 5.1 which is
essentially Red Hat Enterprise Linux 5.1 source code
recompiled, like CentOS - give them a try).
Also I have MS Visual C++ 2008 Express installed.

Under Linux ! :)
Can you pinpoint where our code is wrong? Essentially the following:
#include <iostream>
#include <string>
int main()
{
using namespace std;
wcout<< "Give wide character input: ";
wstring ws;
wcin>> ws;
wcout<< "You gave: "<< ws << endl;
}
It produces:
[john@localhost src]$ ./foobar-cpp
Give wide character input: Δοκιμαστικό
You gave:
[john@localhost src]$

To start with, you didn't embed a locale which supports
characters outside of the basic character set.
while the code:

#include <iostream>
#include <string>

int main()
{
using namespace std;
cout<< "Give wide character input: ";
string s;
cin>> s;
cout<< "You gave: "<< s << endl;
}
produces:

[john@localhost src]$ ./foobar-cpp
Give wide character input: Δοκιμαστικό
You gave: Δοκιμαστικό
[john@localhost src]$

Formally, the code has undefined behavior:). Practically,
you're just shuffling bytes, so it "seems" to work.
 
I

Ioannis Vranos

Ioannis said:
How can we convert the C subset C++ code:


#include <wchar.h>
#include <locale.h>
#include <stdio.h>
#include <stddef.h>

int main()
{
char *p= setlocale( LC_ALL, "greek" );

wchar_t input[50];

if (!p)
printf("NULL returned!\n");

fgetws(input, 50, stdin);

wprintf(L"%ls", input);

return 0;
}

that works, to use the newest and greatest C++ facilities? :)


The next best thing after this, is to use the C-subset setlocale with
wcin, wcout, wstring and stuff, and it works indeed:


#include <iostream>
#include <clocale>
#include <string>

int main()
{
using namespace std;

char *p= setlocale( LC_ALL, "greek" );

if (!p)
cerr<< "NULL returned!\n";

wstring ws;

wcin>> ws;

wcout<< ws<< endl;
}


[john@localhost src]$ ./foobar-cpp
Δοκιμαστικό
Δοκιμαστικό
[john@localhost src]$



The following works out of the box too:

#include <iostream>
#include <clocale>

int main()
{
using namespace std;

char *p= setlocale( LC_ALL, "greek" );

wcout<< L"Δοκιμαστικό\n";
}


[john@localhost src]$ ./foobar-cpp
Δοκιμαστικό
[john@localhost src]$



Now how can we move from the setlocale() to the newer C++ facilities?
 
R

Rolf Magnus

James said:
Ioannis said:
Ioannis Vranos wrote:
Has anyone actually managed to print non-English text by
using wcout or wprintf and the rest of standard, wide
character functions?
For example:
[john@localhost src]$ cat main.cc
#include <iostream>
int main()
{
using namespace std;
wcout<< L"Δοκιμαστικό μήνυμα\n";
Are you sure that you stored your source file in the same
encoding the compiler expects as source character set?

Are you sure the compiler even allows anything but US ASCII as
input?

I don't know, but if it doesn't, the file was not stored in the encoding
that the compiler expected ;-)
The OP could use the \u notation to specify his wide characters.
Before going any further, we have to know 1) how the Greek
characters are encoded. (Probably UTF-8, since that what my
editor is configured for, and I'm seeing them correctly.)

Content-Type: text/plain; charset=ISO-8859-7; format=flowed

But the encoding used in the posting need not be the same as the encoding in
the original source file.
And which compiler he's using, which options, and what the compiler
documentation says about input file encodings. Most likely,
he'll have to ask in a group for his compiler what it accepts,
and how to make it accept what he's got.

Indeed.
 
J

James Kanze

James said:
Ioannis Vranos wrote:
Ioannis Vranos wrote:
Has anyone actually managed to print non-English text by
using wcout or wprintf and the rest of standard, wide
character functions?
For example:
[john@localhost src]$ cat main.cc
#include <iostream>
int main()
{
using namespace std;
wcout<< L"Δοκιμαστικό μήνυμα\n";
Are you sure that you stored your source file in the same
encoding the compiler expects as source character set?
Are you sure the compiler even allows anything but US ASCII as
input?
I don't know, but if it doesn't, the file was not stored in
the encoding that the compiler expected ;-)

Yep.

I guess my real point is that you've got to read the compiler
documentation, to find out what it supports. Supposing you can
find it.
The OP could use the \u notation to specify his wide characters.

In theory. Can you really imagine maintaining code in which
strings like his are all written using UCN's?

It is, of course, the only halfway portable approach. But IMHO,
it means that he'll need some sort of pre-processor which
converts his characters to UCN's. (It shouldn't be that hard to
write---something like ten or twenty lines of C++. But of
course, in order to write it, you have to know the encoding
you're using.)
Content-Type: text/plain; charset=ISO-8859-7; format=flowed
But the encoding used in the posting need not be the same as
the encoding in the original source file.

And the encoding used in the posting need not be the encoding
which I get when I copy/paste it in my environment:). I'd
completely forgotten about that aspect. Especially, as I'm
using Google to read news, and have configured my browser to
tell the server that UTF-8 is the preferred encoding. I
wouldn't be surprised if Google were translating it (since it
sends many postings in the same HTML page, and so has to ensure
that they are all in the same encoding), but even if it weren't,
the fonts I'm using here are UTF-8, so the browser will convert
to UTF-8 to display, and probably for copy/paste as well.

Yes. No matter how you look at it, the problem is NOT trivial.
 
J

James Kanze

James Kanze wrote:
Linux::VMWare::Windows::VC++2008 Express.

Thanks. I'll give it a try myself. (Of course, the executables
it generates will also require VMWare to run, but it will allow
at least verifying that my code compiles with VC++ before trying
to port it to Windows.)
 
B

Boris

Alf P. Steinbach wrote:
[...]
Ans as has also been remarked else-thread, by Boris, one issue,
relevant for i/o, is that the wide character streams convert to and
from narrow characters. wcout converts to narrow characters, and wcin
converts from narrow characters. They're not wide character streams,
they're wide character converters.

I am not sure I understand this.

Isn't L"some text" a wide character string literal? Don't wcout, wcin
and wstring provide operator<< and operator>> overloads for wide
characters and wide character strings?

wcout and wcin represent external devices. When you read from or write to
external devices the facet codecvt is used. The C++ standard says there
are only two: codecvt<char, char, mbstate_t> which doesn't do anything and
codecvt<wchar_t, char, mbstate_t> which converts from wchar_t to char. As
you see there is an implicit conversion to char even if you actually use
wchar_t in your program. You don't know either how the conversion of
codecvt<wchar_t, char, mbstate_t> works (there is no guarantee that it's
UTF-16 to UTF-8 for example). Either you convert to UTF-8 explicitly and
write to cout or you define or use a codecvt from a third-party library
(like http://www.boost.org/libs/serialization/doc/codecvt.html).

Boris
 
I

Ioannis Vranos

Boris said:
Alf P. Steinbach wrote:
[...]
Ans as has also been remarked else-thread, by Boris, one issue,
relevant for i/o, is that the wide character streams convert to and
from narrow characters. wcout converts to narrow characters, and
wcin converts from narrow characters. They're not wide character
streams, they're wide character converters.

I am not sure I understand this.

Isn't L"some text" a wide character string literal? Don't wcout, wcin
and wstring provide operator<< and operator>> overloads for wide
characters and wide character strings?

wcout and wcin represent external devices. When you read from or write
to external devices the facet codecvt is used. The C++ standard says
there are only two: codecvt<char, char, mbstate_t> which doesn't do
anything and codecvt<wchar_t, char, mbstate_t> which converts from
wchar_t to char. As you see there is an implicit conversion to char even
if you actually use wchar_t in your program. You don't know either how
the conversion of codecvt<wchar_t, char, mbstate_t> works (there is no
guarantee that it's UTF-16 to UTF-8 for example). Either you convert to
UTF-8 explicitly and write to cout or you define or use a codecvt from a
third-party library (like
http://www.boost.org/libs/serialization/doc/codecvt.html).


Instead of messing with these details, perhaps we should accept that the
C subset setlocale() function defined in <clocale> is simpler (and thus
better)?


The following code works:


#include <iostream>
#include <clocale>
#include <string>
#include <cstdlib>


int main()
{
using namespace std;

if (!setlocale( LC_ALL, "greek" ))
{
cerr<< "NULL returned!\n";

return EXIT_FAILURE;
}


wstring ws;

wcin>> ws;

wcout<< ws<< endl;
}
 
I

Ioannis Vranos

I filed a bug in GCC Bugzilla:

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=35353


May anyone explain me how should I apply the suggested solution
"sync_with_stdio (false)", and the fstream suggested solution to the
following failing code?



#include <iostream>
#include <locale>
#include <string>

int main()
{
using namespace std;

wcout.imbue(locale("greek"));

wstring ws;

wcin>> ws;

wcout<< ws<< endl;
}
 
J

James Kanze

James said:
You're still not telling us a lot of important information.
What is the actual encoding used in the source file, and what
are the bytes actually output. (FWIW: I think g++, and most
other compilers, just pass the bytes through transparently in a
narrow character string. Which means that your second code will
output whatever your editor put in the source file. If you're
using the same encoding everywhere, it will seem to work.)
Note that there isn't really any portable solution, because so
much depends on things the C++ compiler has no control over.
Run the same code in two different xterm, and it can output two
different things, completely; just specify a different font
(option -fn) with a different encoding for one of the xterm.
(And of course, it's pretty much par for the course to see one
thing when you cat to the screen, and something else when you
output the same file to the printer.)

I posted a C95 question in c.l.c., about this (which is a subset of
C++03) and I got a C95 working code. My last message there:
Ben Bacarisse wrote:

You need "%ls". This is very important with wprintf since without it
%s denotes a multi-byte character sequence. printf("%ls\n" input)
should also work. You need the w version if you want the multi-byte
conversion of %s or if the format has to be a wchar_t pointer.

Perhaps you may help me understand better. We have the usual char
encoding which is implementation defined (usually ASCII).

wchar_t is wide character encoding, which is the "largest character set
supported by the system", so I suppose Unicode under Linux and Windows.

What exactly is a multi-byte character?

I have to say that I am talking about C95 here, not C99.
return 0;
}


Under Linux:


[john@localhost src]$ ./foobar-cpp
Test
T
[john@localhost src]$


[john@localhost src]$ ./foobar-cpp
Δοκιμαστικό
�
[john@localhost src]$

The above my not be the only problem. In cases like this, you need to
say way encoding your terminal is using.

You are somehow correct on this. My terminal encoding was UTF-8 and I
added Greek(ISO-8859-7). Under the last, the following code works OK:

#include <wchar.h>
#include <locale.h>
#include <stdio.h>
#include <stddef.h>

int main()
{
char *p= setlocale( LC_ALL, "Greek" );

wprintf(L"Δοκιμαστικό\n");

return 0;

}

[john@localhost src]$ ./foobar-cpp
Δοκιμαστικό
[john@localhost src]$

Also the original, fixed according to your suggestion:

#include <wchar.h>
#include <locale.h>
#include <stdio.h>
#include <stddef.h>

int main()
{
char *p= setlocale( LC_ALL, "Greek" );

wchar_t input[50];

if (!p)
printf("NULL returned!\n");

fgetws(input, 50, stdin);

wprintf(L"%ls", input);

return 0;

}

works OK too:

[john@localhost src]$ ./foobar-cpp
Δοκιμαστικό
Δοκιμαστικό
[john@localhost src]$

It works OK under Terminal UTF-8 default encoding too. So "%ls" is what
was really needed.

BTW, how can we define UTF-8 as the locale?

Thanks a lot.
 
J

James Kanze

James Kanze wrote:
I posted a C95 question in c.l.c., about this (which is a subset of
C++03) and I got a C95 working code. My last message there:

I'd forgotten about that aspect. It's been many, many years
since I last used printf et al. But yes, you'll definitely need
a modifier in any printf specifier.
Perhaps you may help me understand better.

Well, the main thing you have to understand is that there are
many different players in game, and that each is doing more or
less what it wants, without considering what the others are
doing.
We have the usual char encoding which is implementation
defined (usually ASCII).

The "usual char encoding" for what? One of the problems is
that different tools have different ideas as to what the "usual
char encoding" should be.

Unless you have to deal with mainframes (where EBCIDC still
rules), you can probably count on whatever encoding is being
used for narrow characters to understand ASCII as a subset
(although I'm not at all sure that this is true for the Asian
languages).
wchar_t is wide character encoding, which is the "largest
character set supported by the system", so I suppose Unicode
under Linux and Windows.

wchar_t is implementation defined, and can be just about
anything. On the systems I know, it's UTF-16 for Windows and
AIX, UTF-32 (I think) under Linux, and some pre-Unicode 32 bit
encoding under Solaris. Except that all it really is is a 16 or
32 bit integral type. (On the usual systems. The standard
doesn't make any requirements, and an implementation which
typedef's it to char is conformant.) How the implementation
interprets it (the encoding) may depend on the locale (and I
think recent versions of Solaris have locales which interpret it
as UTF-32, rather than the pre-Unicode encoding).
What exactly is a multi-byte character?

A character which requires several bytes for its encoding.
Very, very succinctly (Haralambous takes about 60 pages to cover
the issues, so I've obviously got to leave something out):

A character is represented by one or more code points.
Probably, all of the characters we're concerned with here can be
represented by a single code point in Unicode, but that's not
always true. And even characters that can be represented by a
single code point (e.g. an o with a circumflex accent) may be
represented by more than one code point (e.g. latin small letter
O, followed by combining accent circumflex), and will be
represented thusly in some canonical representations. A code
point is a numeric value, e.g. 0x0065 (Latin small letter E, in
Unicode) or 0x0394 (Greek capital letter Delta, in Unicode).
Which leaves open how the numeric value is represented. Unicode
code points require at least 21 bits in order to be represented
numerically, but in fact, Unicode defines a certain number of
"transformation formats", specifying how the code points are to
be formatted. The most frequent are UTF-32 (with 32 bits per
element, and one element per code point, always), UTF-16 (BE or
LE), with 16 bits per element, and one or two elements per code
point (but if all you're concerned with is the Latin and the
Greek alphabets, you can consider that it is always one element
per code point as well), and UTF-8, with 8 bit elements, and one
to four elements per code point.

In all cases of Unicode where there can be more than one element
per code point, the encoding format is defined in such a way
that you can always tell from a single element whether it is a
complete code point, the first element of a multiple element
code point, or a following element of a multiple element code
point. Thus, in UTF-8, byte values 0-0x7F are single element
code points (corresponding in fact to US ASCII), byte values
0x80-0xBF can only be a trailing byte in a multibyte code point,
0xC2-0xF7 can only be the first byte of a multibyte code point,
and values 0xC0, 0xC1, 0xF8-0xFF never occur. (The UTF-8
encoding format is actually capable of handling numeric values
up to 0x7FFFFFFF; such values may use the byte values 0xF8-0xFD
for the first byte.)

The important point, of course, being that a single code point
may require more than one byte.

Historically, earlier encodings didn't make such a rigorous
distinction between characters and code points, and tended to
define code points directly in terms of the encoding format,
rather than as a numeric value. Also, most of them didn't have
the characteristic that you could tell immediately from the
value of a byte whether it was a first byte or not; in general,
if you just indexed into a string at any arbitrary byte index,
you had no way of "resynchronizing", i.e. finding the nearest
character boundary. Some of the earlier encodings also depended
on stream state, using codes for shift in and shift out to
specify that the numeric values which followed (until the next
shift in or shift out code) were e.g. in the Greek alphabet,
rather than in the Latin one. (Some early data transmission
codes were only five bits, using shift in and shift out to
change from letters to digits/punctuation and vice versa---and
only supporting one case of letters.)
I have to say that I am talking about C95 here, not C99.
return 0;
}
Under Linux:
[john@localhost src]$ ./foobar-cpp
Test
T
[john@localhost src]$
[john@localhost src]$ ./foobar-cpp
Δοκιμαστικό
�
[john@localhost src]$
The above my not be the only problem. In cases like this,
you need to say way encoding your terminal is using.
You are somehow correct on this. My terminal encoding was
UTF-8 and I added Greek(ISO-8859-7).

In general: all a program written in C++ can do is output bytes,
which have some numeric value. We suppose a particular
encoding, etc. in the program, but there's no guarantee that
whoever later reads those bytes supposes the same thing, and
there's not much C++ can do about it.

(Since you're under Linux, try starting an xterm with a font
using UTF-8, set the locales correctly for it, and create a file
with Greek characters in the name. Then start a different xterm
with a font using ISO-8859-7, set the locales for that, and do
an ls on the directory where you created the file. As you can
see, even without any C++, there can be problems. And there's
nothing C++ can do about it.)

[...]
BTW, how can we define UTF-8 as the locale?

It depends on the implementation, but the Unix conventions
prescribe something along the lines of
<language>[_<country>][.<encoding>], where the language is 2
letter language code, as per ISO 639-2 (in lower case), the
country is the 2 letter country code, as per ISO 3166, in upper
case, and the encoding is somthing or other. With the optional
parts defaulting to some system defined value if they're not
specified. For historical reasons, most implementations also
support additional names, like "Greek". And of course,
depending on the machine, any given locale may or may not be
installed---typically, if you do an ls of either
/usr/share/locale or /usr/lib/locale, you'll get a list of
supported locales for the machine in question. (On the version
of Linux I'm running here, UTF-8 is the default, and I can't see
it in the locale names. IIRC from the Solaris machine at work,
however, the UTF-8 locales end in .utf8. Also note that there
may be some additional files in this directory.)
 
B

Boris

[...]Instead of messing with these details, perhaps we should accept
that the C subset setlocale() function defined in <clocale> is simpler
(and thus better)?


The following code works:


#include <iostream>
#include <clocale>
#include <string>
#include <cstdlib>


int main()
{
using namespace std;

if (!setlocale( LC_ALL, "greek" ))
{
cerr<< "NULL returned!\n";

return EXIT_FAILURE;
}


wstring ws;

wcin>> ws;

wcout<< ws<< endl;
}

If the locale name "greek" means an eight-bit character set is used you
don't need to use wstring, wcin and wcout at all? What character set do
you actually plan to use in your program?

Boris
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,174
Messages
2,570,940
Members
47,484
Latest member
JackRichard

Latest Threads

Top