imbue(locale) and file encoding

R

Ralf Goertz

Hi,

since my previous post
<[email protected]> is still
unanswered I'd like to rephrase my question. In order to read/write a
wstring in UTF-8 encoding it is *not* sufficient to imbue the stream
with a locale like "de_DE.UTF-8". Doing so only takes care of facets of
decimal numbers and the like. Rather, one has to call
locale::global("de_DE.UTF-8"). Is this behaviour conforming to the
standard? And if so why? I mean why wouldn't wcin.imbue("de_DE.UTF-8")
make wcin accept UTF-8 multibyte characters while still allowing 5,7 to
be parsed as 5.7?

file wcintest.cc:
-------------
#include <iostream>
#include <string>
#include <locale>
using namespace std;

float f;
wstring euro;

int main(){
locale l("de_DE.UTF-8");
wcin.imbue(l);
locale::global(l); // (*)
wcin>>f>>euro;
wcout.imbue(locale("en_US.UTF-8"));
wcout<<f<<L" "<<euro<<endl;
}
-------------

Calling

$ echo "5,70 €" |./wcintest

in a UTF-8 environment gives

5.70 €

but only if the line marked (*) is present. Otherwise you only get

5.70

It seems as if the encoding part of the locale is ignored by the imbue
calls but I don't see why this should be the case.

I use g++ (GCC) 4.1.0 under linux (i386).

Ralf
 
O

ondra.holub

Currently I do not have linux here (at work) so I am only guessing. Did
you try to change locale of output to German locale?

wcout.imbue(l);

Maybe the euro sign is not accepted by US locale.
 
R

Ralf Goertz

ondra.holub said:
Currently I do not have linux here (at work) so I am only guessing.
Did you try to change locale of output to German locale?

wcout.imbue(l);

Maybe the euro sign is not accepted by US locale.

The problem occurs earlier. The euro sign cannot be read from wcin
without the locale::global(l). Like I said wcin.imbue(l) does not seem
to honour the encoding part of the locale string. Probably, the encoding
can only be changed globally whereas the facets are specific to the
stream. But that's what puzzles me because I see no reason for this kind
of behaviour.

Ralf
 
O

ondra.holub

Hi. I tried it on Open SUSE 10.1 and the behaviour is exactly the same
as you described. There is no problem when using cin, cout and string,
but it does not work with wide-character versions :-(

With wide strings it works also when you set global locale to
locale("") - the current user's system locale. Maybe standard library
expects latin-1 encoding as default and it is not correct for utf-8
systems. But I am only guessing. Anyway, I think it is not problem to
start the main function with locale::global(locale("")); and it should
work everywhere (hopefuly).
 
R

Ralf Goertz

ondra.holub said:
Hi. I tried it on Open SUSE 10.1 and the behaviour is exactly the same
as you described. There is no problem when using cin, cout and string,
but it does not work with wide-character versions :-(

I would use cin, cout and string, but then there is the problem, that
string.size() and string.substr() do not work as expected.
With wide strings it works also when you set global locale to
locale("") - the current user's system locale. Maybe standard library
expects latin-1 encoding as default and it is not correct for utf-8
systems. But I am only guessing. Anyway, I think it is not problem to
start the main function with locale::global(locale("")); and it should
work everywhere (hopefuly).

Yeah it works, but I don't see the logic. Suppose you want to convert a
german utf8-encoded text file with floats and euro signs into a latin1
encoded file with en_US locale. Then you always have to change the
global locale before switching from reading from wcin to writing to
wcout or vice versa. If source and destination had the same encoding
then one imbue call for each stream would be sufficient. As I have found
nothing on the net that says "imbue calls do not care about encoding" I
suspect it might be a bug in my libstdc++ implementation of the
standard. It would be nice to know how other compilers/libraries deal
with that situation.

Another problem I encountered is that tolower() does not work on wchar_t
Umlauts although I use the correct global locale.

Ralf
 
R

Ralf Goertz

I said:
Yeah it works, but I don't see the logic. Suppose you want to convert
a german utf8-encoded text file with floats and euro signs into a
latin1 encoded file with en_US locale. Then you always have to change
the global locale before switching from reading from wcin to writing
to wcout or vice versa.

I just found the following in Stroustrup (retranslated from German)

"Setting the global locale does not affect existing input/output
streams. The streams continue to use those locales that were assigned to
them using imbue() during their creation."

Ralf
 
O

ondra.holub

I think that tolower function is not designed for C++. There should be
used facets instead, but the code looks a bit complicated:

#include <iostream>
#include <locale>

int main()
{
std::locale loc("german");
char s[] = "äÖü";

std::use_facet< std::ctype<char> >(loc).tolower(s, s + sizeof(s));
std::cout << s << std::endl;

std::use_facet< std::ctype<char> >(loc).toupper(s, s + sizeof(s));
std::cout << s << std::endl;

return 0;
}
 
R

Ralf Goertz

ondra.holub said:
I think that tolower function is not designed for C++. There should be
used facets instead, but the code looks a bit complicated:

Okay this works (after modification), also with wchar_t. I also found a
solution in the c++ cookbook, templated functions

to[Upper|Lower](basic_string<C>,const locale & loc=locale())

which use use_facet. Interestingly they also have the problem that the
encoding part of the locale is not used unless the global locale
explicitly states that we use UTF-8. I'd really like to know whether is
confirming to the standard or a bug.

Ralf
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,955
Messages
2,570,117
Members
46,705
Latest member
v_darius

Latest Threads

Top