Wide characters and streams

G

Guest

From thread
http://groups.google.com/group/comp.lang.c++/browse_thread/thread/79d767efa42df516

P.J. Plauger said:
In practice they're not broken and you can write Unicode characters.
As with any other Standard C++ library, you need an appropriate
codecvt facet for the code conversion you favor. See our add-on
library, which includes a broad assortment.

I'll take this at face value and I'll have to suppose that I don't
understand what the streams should do.

I guess then the root of my problem is my expectation that if I use a
std::eek:fstream it will write a char sequence to disk and if I use a
std::wofstream it will write a wchar_t sequence to disk. I presume then
that this is wrong?

I also have to assume that if I write a UTF-16 sequence to std::wcout
then I should not expect it to display correctly on a platform that
uses UTF-16?

The code below summarises my expectation of what I would be able to do,
so I guess my understanding is off. What should the code below do?

#include "stdafx.h" // This header is empty
#include <iostream>
#include <conio.h>
#include <fstream>

int wmain(int /*argc*/, wchar_t* /*argv*/[])
{
std::wcout << L"Hello world!" << std::endl;
// Surname with AE ligature
std::wcout << L"Hello Kirit S\x00e6lensminde" << std::endl;
// Kirit transliterated (probably badly) into Greek
std::wcout << L"Hello \x039a\x03b9\x03c1\x03b9\x03c4" << std::endl;
// Kirit transliterated into Thai
std::wcout << L"Hello \x0e04\x0e35\x0e23\x0e34\x0e17" << std::endl;

//if ( std::wcout )
// std::cout << "\nstd::wcout still good" << std::endl;
//else
// std::cout << "\nstd::wcout gone bad" << std::endl;

_cputws( L"\n\n\n" );
_cputws( L"Hello Kirit S\x00e6lensminde\n" ); // AE ligature
_cputws( L"Hello \x039a\x03b9\x03c1\x03b9\x03c4\n" ); // Greek
_cputws( L"Hello \x0e04\x0e35\x0e23\x0e34\x0e17\n" ); // Thai

std::wofstream wout1( "test1.txt" );
wout1 << L"12345" << std::endl;

//if ( wout1 )
// std::cout << "\nwout1 still good" << std::endl;
//else
// std::cout << "\nwout1 gone bad" << std::endl;

std::wofstream wout2( "test2.txt" );
wout2 << L"Hello world!" << std::endl;
wout2 << L"Hello Kirit S\x00e6lensminde" << std::endl;
wout2 << L"Hello \x039a\x03b9\x03c1\x03b9\x03c4" << std::endl;
wout2 << L"Hello \x0e04\x0e35\x0e23\x0e34\x0e17" << std::endl;

//if ( wout2 )
// std::cout << "\nwout2 still good" << std::endl;
//else
// std::cout << "\nwout2 gone bad" << std::endl;

return 0;
}


I've compiled this on MSVC Studio 2003 and it reports the following
command line switches on a debug build (i.e. Unicode defined as the
character set and wchar_t as a built-in type):

/Od /D "WIN32" /D "_DEBUG" /D "_CONSOLE" /D "_UNICODE" /D "UNICODE" /Gm
/EHsc /RTC1 /MLd /Zc:wchar_t /Zc:forScope /Yu"stdafx.h"
/Fp"Debug/wcout.pch" /Fo"Debug/" /Fd"Debug/vc70.pdb" /W3 /nologo /c
/Wp64 /ZI /TP

If I run this directly from the IDE then it clearly does some odd
narrowing of the output as the Greek cputws() line displays:

Hello ????t

Which to me looks like a failure in the character substitution from
Unicode to what I presume is some OEM encoding. Now don't get wrong, I
think this is a poor default situation for running something on a
Unicode platform (this is on Windows 2003 Server), but it does seem to
be beside the point for this discussion.

If I run it from a command prompt with Unicode I/O turned on (cmd.exe
/u) then the output is somewhat more encouraging, but not a lot:

Hello world!
Hello Kirit Sµlensminde
Hello


Hello Kirit Sælensminde
Hello ΚιÏιτ
Hello คีริท

The _cputws calls all work as I would expect, but std::wcout doesn't
work at all. Worse uncommenting the stream tests shows that there is an
error on std::wcout rendering it unusable from then on. Note also that
it has translated the AE ligature into what looks to me like a Greek
lower case mu. The Greek capital kappa has wedged the stream.

The two txt files are interesting. test1.txt is seven bytes long,
exactly the half the size I would naively expect and test2.txt is 45
bytes long. Exactly the length I'd expect from a char stream that only
went up to, but didn't include, the Greek capital kappa.

Now, if this is all by design then I presume that there is something
fairly simple that I can do to have all of this work in the way that I
naively expect, or does the C++ standard in some way mandate that it is
going to be really hard? Myabe it's a quality of implementation issue
and we just have to buy the library upgrade or write our own codecvt
implementations?

What we've done is to use our own implementation of a UTF-16 to UTF-8
converter (that we know works properly as it drives our web interfaces)
and just send that sequence to a std::eek:fstream. We've had to more or
less give up on meaningful and pipeable console output.


K
 
P

P.J. Plauger

From thread
http://groups.google.com/group/comp.lang.c++/browse_thread/thread/79d767efa42df516

P.J. Plauger said:
In practice they're not broken and you can write Unicode characters.
As with any other Standard C++ library, you need an appropriate
codecvt facet for the code conversion you favor. See our add-on
library, which includes a broad assortment.

I'll take this at face value and I'll have to suppose that I don't
understand what the streams should do.

I guess then the root of my problem is my expectation that if I use a
std::eek:fstream it will write a char sequence to disk and if I use a
std::wofstream it will write a wchar_t sequence to disk. I presume then
that this is wrong?

[pjp] It's not exactly right. When you write to a wofstream, the wchar_t
sequence you
write gets converted to a byte sequence written to the file. How that
conversion
occurs depends on the codecvt facet you choose. Choose none any you get some
default. In the case of VC++ the default is pretty stupid -- the first 256
codes get
written as single bytes and all other wide-character codes fail to write.

I also have to assume that if I write a UTF-16 sequence to std::wcout
then I should not expect it to display correctly on a platform that
uses UTF-16?

[pjp] Again, that depends on the codecvt facet you use. With our add-on
library (available at our web site) we offer a host of codecvt facets.
One of them converts UTF-16 wide characters to UTF-8 files. Another
writes UTF-16 to UTF-16 files, with choice of endianness and an
optional BOM that tells what kind of file it is.

The code below summarises my expectation of what I would be able to do,
so I guess my understanding is off. What should the code below do?

[pjp] <Lengthy code omitted, which reaffirms the above.>

Now, if this is all by design then I presume that there is something
fairly simple that I can do to have all of this work in the way that I
naively expect, or does the C++ standard in some way mandate that it is
going to be really hard? Myabe it's a quality of implementation issue
and we just have to buy the library upgrade or write our own codecvt
implementations?

[pjp] If the default codecvt facet were UTF-16 to UTF-8 you'd fine the
behavior sensible -- for your needs. I suspect that you're in the majority
these days, which is why we've made this the default for our Standard
C library. But the Standard C++ library was designed to be way more
flexible. Hence, it is in effect mandated to be hard, and it is indeed a
QOI issue what to provide. But writing your own codecvt facets is way
harder than it appears, so be careful.

What we've done is to use our own implementation of a UTF-16 to UTF-8
converter (that we know works properly as it drives our web interfaces)
and just send that sequence to a std::eek:fstream. We've had to more or
less give up on meaningful and pipeable console output.

[pjp] That's one way out, yes.

P.J. Plauger
Dinkumware, Ltd.
http://www.dinkumware.com
 
?

=?iso-8859-1?q?Kirit_S=E6lensminde?=

P.J. Plauger said:
From thread
http://groups.google.com/group/comp.lang.c++/browse_thread/thread/79d767efa42df516

P.J. Plauger said:
In practice they're not broken and you can write Unicode characters.
As with any other Standard C++ library, you need an appropriate
codecvt facet for the code conversion you favor. See our add-on
library, which includes a broad assortment.

I'll take this at face value and I'll have to suppose that I don't
understand what the streams should do.

I guess then the root of my problem is my expectation that if I use a
std::eek:fstream it will write a char sequence to disk and if I use a
std::wofstream it will write a wchar_t sequence to disk. I presume then
that this is wrong?

[pjp] It's not exactly right. When you write to a wofstream, the wchar_t
sequence you
write gets converted to a byte sequence written to the file. How that
conversion
occurs depends on the codecvt facet you choose. Choose none any you get some
default. In the case of VC++ the default is pretty stupid -- the first 256
codes get
written as single bytes and all other wide-character codes fail to write.

Indeed that is pretty stupid. I don't mind stupid defaults so long as
they are described in the documentation, but the documentation of
std::wofstream or std::wcout makes no mention of this. I notice though
that std::wstringstream doesn't seem to suffer this problem.

As far as std::wcout goes though there must be something else going on
as well or the AE ligature would not have been mangled to a Greek mu.
This would seem to imply that using a codecvt that passed through
UTF-16 would not work or is it the existing codecvt that is performing
the miss-transliteration?

I can't help but think that a lot of the frustration could be very
simply resolved by just properly documenting what the libraries do and
putting that documentation where people will see it.
I also have to assume that if I write a UTF-16 sequence to std::wcout
then I should not expect it to display correctly on a platform that
uses UTF-16?

[pjp] Again, that depends on the codecvt facet you use. With our add-on
library (available at our web site) we offer a host of codecvt facets.
One of them converts UTF-16 wide characters to UTF-8 files. Another
writes UTF-16 to UTF-16 files, with choice of endianness and an
optional BOM that tells what kind of file it is.


As a practical matter I don't understand how wchar_t streams can be
seen as anything but broken (in the 'not working' sense) on this
platform if I have to write my own codecvt implementation or buy one in
so that I can write UTF-16 files.

It seems bizarre that an assertion that the streams aren't broken is
compatible with the fact that they cannot be used in what must be a
very common (if not the most common) use case. An inability to write
UTF-16 to the console sure seems broken to me and an implementation
that writes UTF-16 streams as you describe surely can't be described as
'working' for any practical purpose.
The code below summarises my expectation of what I would be able to do,
so I guess my understanding is off. What should the code below do?

[pjp] <Lengthy code omitted, which reaffirms the above.>

Now, if this is all by design then I presume that there is something
fairly simple that I can do to have all of this work in the way that I
naively expect, or does the C++ standard in some way mandate that it is
going to be really hard? Myabe it's a quality of implementation issue
and we just have to buy the library upgrade or write our own codecvt
implementations?

[pjp] If the default codecvt facet were UTF-16 to UTF-8 you'd fine the
behavior sensible -- for your needs. I suspect that you're in the majority
these days, which is why we've made this the default for our Standard
C library. But the Standard C++ library was designed to be way more
flexible. Hence, it is in effect mandated to be hard, and it is indeed a
QOI issue what to provide. But writing your own codecvt facets is way
harder than it appears, so be careful.

Actually if the default codecvt was simply a null, do nothing UTF-16 to
UTF-16 that would be fine too.

We did notice that writing a codecvt implementation is no trivial task.
We tried to write a UTF-16 to UTF-8 codecvt, but haven't managed to get
it to work.

Looking at the comments in our source it seems that there was some
confusion about what do_length should return. I think the standard says
it should return the number of bytes, but the documentation we were
using at the time seemed to imply that it should return the number of
wchar_t. The documentation we're now using looks to have been changed,
but I'm not sure I can work out from the wording what it is saying
should be returned.

This is something that we may revisit.


On your web site, is "compleat" some joke that I'm not getting?

And thanks for taking the time to answer. It's certainly cleared up a
lot about what is going on.


K
 
P

P.J. Plauger

P.J. Plauger said:
From thread
http://groups.google.com/group/comp.lang.c++/browse_thread/thread/79d767efa42df516

P.J. Plauger said:
In practice they're not broken and you can write Unicode characters.
As with any other Standard C++ library, you need an appropriate
codecvt facet for the code conversion you favor. See our add-on
library, which includes a broad assortment.

I'll take this at face value and I'll have to suppose that I don't
understand what the streams should do.

I guess then the root of my problem is my expectation that if I use a
std::eek:fstream it will write a char sequence to disk and if I use a
std::wofstream it will write a wchar_t sequence to disk. I presume then
that this is wrong?

[pjp] It's not exactly right. When you write to a wofstream, the wchar_t
sequence you
write gets converted to a byte sequence written to the file. How that
conversion
occurs depends on the codecvt facet you choose. Choose none any you get
some
default. In the case of VC++ the default is pretty stupid -- the first 256
codes get
written as single bytes and all other wide-character codes fail to write.

Indeed that is pretty stupid. I don't mind stupid defaults so long as
they are described in the documentation, but the documentation of
std::wofstream or std::wcout makes no mention of this. I notice though
that std::wstringstream doesn't seem to suffer this problem.

As far as std::wcout goes though there must be something else going on
as well or the AE ligature would not have been mangled to a Greek mu.
This would seem to imply that using a codecvt that passed through
UTF-16 would not work or is it the existing codecvt that is performing
the miss-transliteration?

[pjp] The whole problem is the stupid default conversion. Our C++
library has always used the fgetwc/fputwc machinery from the C
library for the default wchar_t codecvt facet. Thus, we more or less
inherit whatever decision a compiler vendor has chosen for C.
(Unless, of course, that vendor has also licensed our C library,
in which case you get UTF-16/UTF-8 by default.)

But remember that what you see is also determined by the display
software, which is outside the purview of C and C++. Sometimes
that's not what you expect, so extended character sets get curdled
in surprising ways on their way to your eyeballs.
---

I can't help but think that a lot of the frustration could be very
simply resolved by just properly documenting what the libraries do and
putting that documentation where people will see it.

[pjp] I agree that these decisions could be better highlighted.
---
I also have to assume that if I write a UTF-16 sequence to std::wcout
then I should not expect it to display correctly on a platform that
uses UTF-16?

[pjp] Again, that depends on the codecvt facet you use. With our add-on
library (available at our web site) we offer a host of codecvt facets.
One of them converts UTF-16 wide characters to UTF-8 files. Another
writes UTF-16 to UTF-16 files, with choice of endianness and an
optional BOM that tells what kind of file it is.

As a practical matter I don't understand how wchar_t streams can be
seen as anything but broken (in the 'not working' sense) on this
platform if I have to write my own codecvt implementation or buy one in
so that I can write UTF-16 files.

[pjp] If they don't do what you want, then they are broken to you.
---

It seems bizarre that an assertion that the streams aren't broken is
compatible with the fact that they cannot be used in what must be a
very common (if not the most common) use case. An inability to write
UTF-16 to the console sure seems broken to me and an implementation
that writes UTF-16 streams as you describe surely can't be described as
'working' for any practical purpose.

[pjp] The common use case of today is not the one that was common
a decade or more ago, when some of these decisions were made. The
default conversion is doubtless overdue for revision.
---
The code below summarises my expectation of what I would be able to do,
so I guess my understanding is off. What should the code below do?

[pjp] <Lengthy code omitted, which reaffirms the above.>

Now, if this is all by design then I presume that there is something
fairly simple that I can do to have all of this work in the way that I
naively expect, or does the C++ standard in some way mandate that it is
going to be really hard? Myabe it's a quality of implementation issue
and we just have to buy the library upgrade or write our own codecvt
implementations?

[pjp] If the default codecvt facet were UTF-16 to UTF-8 you'd fine the
behavior sensible -- for your needs. I suspect that you're in the majority
these days, which is why we've made this the default for our Standard
C library. But the Standard C++ library was designed to be way more
flexible. Hence, it is in effect mandated to be hard, and it is indeed a
QOI issue what to provide. But writing your own codecvt facets is way
harder than it appears, so be careful.

Actually if the default codecvt was simply a null, do nothing UTF-16 to
UTF-16 that would be fine too.

[pjp] For some people.
---

We did notice that writing a codecvt implementation is no trivial task.
We tried to write a UTF-16 to UTF-8 codecvt, but haven't managed to get
it to work.

[pjp] It's the hardest codecvt facet of all to write. In fact, it's
officially impossible, since codecvt was "designed" to do 1-N code
conversions, and UTF-16/UTF-8 is M-N. No Standard C++ library except
ours will even give you a fighting chance, and it's a fiendishly
difficult coding problem even then.
---

Looking at the comments in our source it seems that there was some
confusion about what do_length should return. I think the standard says
it should return the number of bytes, but the documentation we were
using at the time seemed to imply that it should return the number of
wchar_t. The documentation we're now using looks to have been changed,
but I'm not sure I can work out from the wording what it is saying
should be returned.

[pjp] The description of codecvt in the C++ Standard is murky, to
put it politely.
---

This is something that we may revisit.

On your web site, is "compleat" some joke that I'm not getting?

[pjp] "Compleat" is an older spelling of "complete". See, for
example, the noted 17th century book, "The Compleat Angler or
the Contemplative man's Recreation."
---

And thanks for taking the time to answer. It's certainly cleared up a
lot about what is going on.

[pjp] Welcome.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,961
Messages
2,570,131
Members
46,689
Latest member
liammiller

Latest Threads

Top