stdin charset

A

Antimon

Hi,

I'm new to c/c++ and working on string stuff with visual studio 2005.
I'm trying to understand something, for example when i do this:

wstring st;
wcin >> st;

if the input is pure ascii, then everything is ok, but if there are
unicode characters like "ÅŸ" (u+015f) what is the encoding of st now?
Everything works when i use this st string, do stuff, write to cout
etc but if i want to convert this string to utf-8, what encoding am i
converting from?

Btw, when i do something like this:

wsring a = L"ÅŸ";
wstring b;
wcin >> b;

and write "ÅŸ" into console,

(a == b) is false. i checked a and it's unicode (16), b is not
unicode, i could not manage to find what it is.

Thanks.
 
O

Old Wolf

I'm new to c/c++ and working on string stuff with visual studio 2005.

NB. I'm not expert on this, but am posting because nobody else
has yet, so perhaps I can help you a little, at least.
I'm trying to understand something, for example when i do this:

wstring st;
wcin >> st;

if the input is pure ascii, then everything is ok, but if there are
unicode characters like "ÅŸ" (u+015f) what is the encoding of st now?

It depends on your compiler. From what I know of Microsoft, it's
likely to be UTF-16.
Everything works when i use this st string, do stuff, write to cout
etc but if i want to convert this string to utf-8, what encoding am i
converting from?

C++ includes the C functions for converting between "wide
character" and "multi-byte character sequence". It doesn't
specify that MBCS has to be UTF-8, but if you're lucky then
it will turn out to be that on your compiler. Try using the
function wcstombs() on your wstring and it might spit out
UTF-8 if you're lucky.
Btw, when i do something like this:

wsring a = L"ÅŸ";
wstring b;
wcin >> b;

and write "ÅŸ" into console,

(a == b) is false. i checked a and it's unicode (16), b is not
unicode, i could not manage to find what it is.

You can check what you have got by printing it out as a series
of unsigned chars, e.g. :

void hex_dump( void const *ptr, size_t nbytes )
{
unsigned char const *p = ptr;
while (nbytes--)
printf("%02X", *p++);
putchar('\n');
}

and then call it like this:
hex_dump( a.c_str(), a.size() * sizeof(wchar_t) );
hex_dump( b.c_str(), b.size() * sizeof(wchar_t) );
 
J

jalina

Antimon a écrit :
Hi,

I'm new to c/c++ and working on string stuff with visual studio 2005.
I'm trying to understand something, for example when i do this:

wstring st;
wcin >> st;

if the input is pure ascii, then everything is ok, but if there are
unicode characters like "ÅŸ" (u+015f) what is the encoding of st now?
Everything works when i use this st string, do stuff, write to cout
etc but if i want to convert this string to utf-8, what encoding am i
converting from?

Btw, when i do something like this:

wsring a = L"ÅŸ";
wstring b;
wcin >> b;

and write "ÅŸ" into console,

(a == b) is false. i checked a and it's unicode (16), b is not
unicode, i could not manage to find what it is.

Thanks.
C++ does not know anything about encoding (UTF-8, UTF-16 or what ever)
In C++, a wide char is just mean to be a place holder for a 2-char data.
You can put whatever you want on it.

If you want to use encoding, you should use a library that handle this.

J.
 
J

James Kanze

I'm new to c/c++ and working on string stuff with visual studio 2005.
I'm trying to understand something, for example when i do this:
wstring st;
wcin >> st;
if the input is pure ascii, then everything is ok, but if there are
unicode characters like "?" (u+015f) what is the encoding of st now?

It depends on the system. Windows uses (I think) UTF-16, and
Linux UTF-32. Older systems have different conventions, which
may vary according to the compiler. (G++ and Sun CC behave
differently under Solaris, for example.)
Everything works when i use this st string, do stuff, write to cout
etc but if i want to convert this string to utf-8, what encoding am i
converting from?

It depends on the system, the compiler, and possibly even some
options of the compiler.
Btw, when i do something like this:
wsring a = L"?";
wstring b;
wcin >> b;
and write "?" into console,
(a == b) is false. i checked a and it's unicode (16), b is not
unicode, i could not manage to find what it is.

When reading from wcin (or any wide string input), how the input
is encoded depends on the locale embedded in the stream. By
default, this should be the "C" locale (although if you change
the global locale in a constructor of a static object, there may
be some issues concerning order of initialization), however, and
I can't imagine any problems with this with regards to the "C"
locale. (At least with "?", which is pure ASCII. For
historical reasons, Windows does not use the same default code
page in console windows as it uses elsewhere, so you often do
get surprises.)

FWIW: I'm unable to duplicate what you describe on my Windows
machine (with VC++ 2005). Both a and b, above, contained a
single character with the value 0x003F (which corresponds to the
UTF-16 code for '?').
 
A

Antimon

When reading from wcin (or any wide string input), how the input
is encoded depends on the locale embedded in the stream. By
default, this should be the "C" locale (although if you change
the global locale in a constructor of a static object, there may
be some issues concerning order of initialization), however, and
I can't imagine any problems with this with regards to the "C"
locale. (At least with "?", which is pure ASCII. For
historical reasons, Windows does not use the same default code
page in console windows as it uses elsewhere, so you often do
get surprises.)

FWIW: I'm unable to duplicate what you describe on my Windows
machine (with VC++ 2005). Both a and b, above, contained a
single character with the value 0x003F (which corresponds to the
UTF-16 code for '?').

I think that's because your newsreader displays that character as "?"
It was a "s with cedilla". Unicode character \u015F. I tried something
else, here:

wstring a = L"ÅŸ";
wstring b;
wcin >> b;

wcout << (unsigned int)a[0] << "\n";
wcout << (unsigned int)b[0] << "\n";

(a is the unicode character \u015F that i mentioned before.) when i
run this and again, write the same character as "a" holds. i get the
output:

351
159

first one (a) is right. \u015F is 351. But what the hell is 159? :) So
if i add "locale::global(locale(""));" to top, i get:

351
376

still, it doesn't read UTF-16 from console. I've been reading throuhg
msdn about vs2005 and unicode stuff but no luck yet.

Thanks alot for helping.
 
J

James Kanze

I think that's because your newsreader displays that character as "?"

My newsreader displays '?' with a '?', yes:). But you're
right. On the machine on which I read your message, the only
fonts I have installed are ISO 8859-1, and anything which is not
representable in that codeset is displayed as a '?'. I see the
s-cedilla here (although the way I've configured my editor
doesn't allow inputing it---my printer wouldn't understand it,
so there's no point).

And yes, my experiment was with a '?'. (And I did the
experiment because I simply couldn't believe that a normal ASCII
character like '?' could cause problems.)
It was a "s with cedilla". Unicode character \u015F. I tried something
else, here:
wstring a = L"?";
wstring b;
wcin >> b;
wcout << (unsigned int)a[0] << "\n";
wcout << (unsigned int)b[0] << "\n";
(a is the unicode character \u015F that i mentioned before.) when i
run this and again, write the same character as "a" holds. i get the
output:

Wierd. At first, I thought that perhaps something was trimming
the upper bits somewhere, but 159 is 0x009F, and just trimming
the bits would give 0x005F.
first one (a) is right. \u015F is 351. But what the hell is 159? :)

Application Program Command:). Whatever that means (but it is
a control character).
So if i add "locale::global(locale(""));" to top, i get:

Which is 0x178: LATIN CAPITAL LETTER Y WITH DIAERESIS.

This is curious because normally, the locale for wcin should be
set when the object is constructed, and this is before main(),
so you should always get locale "C" (I don't know if this is
intentional, but that's effectively what the standard says.).
Quite obviously, changing the global locale is changing
something, but I don't know what. (I suspect that this is
occuring because IIRC, the Microsoft implementation of wcin
goes through the FILE*, and FILE* will reflect all changes to
the global locale.)

At any rate, the fact that changing the locale does have an
effect is good news, in a way, since it probably means that all
you have to do is find the correct local. And regretfully, I
can't help much there, since all of my experience has been on
Unix platforms (where the available locales are all represented
by sub-directories of a directory locale, usually in /usr/lib).

BTW: when outputting codes, as above, it's usually easier if you
set the hex flag, so that the values are in hex. And there is
an enormous amount of information, including the full code
charts, available on line at the Unicode site
(www.unicode.org)---nothing that will help you with this
particular problem, of course, but probably useful in the long
run.
still, it doesn't read UTF-16 from console. I've been reading throuhg
msdn about vs2005 and unicode stuff but no luck yet.

You might try the Dinkumware site. I don't know if it has
anything useful, but Dinkumware did provide Microsoft with the
libraries, and the head of the company, Plauger, is probably the
best expert in the world concerning the subtilities of handling
different code sets.

As a general rule, however, expect problems anytime you go
beyond basic ASCII.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,297
Messages
2,571,529
Members
48,250
Latest member
Bette22B13

Latest Threads

Top