How should I handle the multibyte char set string in C++?

D

Dancefire

Hi, everyone,

I'm writing a program using wstring(wchar_t) as internal string.

The problem is raised when I convert the multibyte char set string
with different encoding to wstring(which is Unicode, UCS-2LE(BMP) in
Win32, and UCS4 in Linux?).

I have 2 ways to do the job:

1) use std::locale, set std::locale::global() and use mbstowcs() and
wcstombs() do the conversion.

2) use platform dependent functions to do the job, such as libiconv in
Linux, or MultiByteToWideChar() and WideCharToMultiByte() in Win32.

At first glance, it might be definitely to choose the solution 1) to
do the job. Since it's really C++ favor, and in details, the codecvt
facet is actually wrap the function by calling libiconv in Linux, and
MultiByteToWideChar() or WideCharToMultiByte() in Win32 (by different
STL implementation) to do the real job.(if my understanding is
correct).

However, I have 2 problems.

First, I have to set the global locale before I do the conversion.

There are 2 side effects, the first effect is when I do the multi-
thread program, changing the global setting will affect the other
thread using different encoding to do the conversion. Yes, I can lock
the conversion, but it make no sense to do, and cause really low
performance.

The second effect is every time I set std::locale::global() is time
consuming, create a locale object and set it to global locale is not a
light job, it does cause a low performance.

Second problem, looks like the system dependent conversion functions
support much more encoding than std::locale() by each STL
implementation. For example, libiconv support UCS-2LE encoding, but g+
+'s locale() doesn't support it. MultiByteToWideChar() support UTF8
conversion, but MSVC(8.0)'s STL std::locale() doesn't support ".65001"
for code page 65001 which is UTF8.

The locale string is not same on different platform might be the third
problem, but I can easily ignore it by #ifdef #endif.

So, back to beginning question, how should I handle the MBCS string in
C++?

Thanks.
 
J

James Kanze

I'm writing a program using wstring(wchar_t) as internal string.
The problem is raised when I convert the multibyte char set string
with different encoding to wstring(which is Unicode, UCS-2LE(BMP) in
Win32, and UCS4 in Linux?).
I have 2 ways to do the job:
1) use std::locale, set std::locale::global() and use mbstowcs() and
wcstombs() do the conversion.

Why not std::codecvt? A facet which you can obtain from a
locale.
2) use platform dependent functions to do the job, such as libiconv in
Linux, or MultiByteToWideChar() and WideCharToMultiByte() in Win32.
At first glance, it might be definitely to choose the solution 1) to
do the job. Since it's really C++ favor, and in details, the codecvt
facet is actually wrap the function by calling libiconv in Linux, and
MultiByteToWideChar() or WideCharToMultiByte() in Win32 (by different
STL implementation) to do the real job.(if my understanding is
correct).
However, I have 2 problems.
First, I have to set the global locale before I do the conversion.

Why? You can get a facet from any locale. That's the one
advantage C++ locales have over the C stuff.

[...]
Second problem, looks like the system dependent conversion functions
support much more encoding than std::locale() by each STL
implementation.

That's a problem with the C++ library implementation. A quality
implementation will support all of the code sets that are
installed on the system.
For example, libiconv support UCS-2LE encoding, but g++'s
locale() doesn't support it. MultiByteToWideChar() support
UTF8 conversion, but MSVC(8.0)'s STL std::locale() doesn't
support ".65001" for code page 65001 which is UTF8.

Finding what locales are available and work can be a bit of a
game:). And how they are named, if you're not under Unix.
The locale string is not same on different platform might be the third
problem, but I can easily ignore it by #ifdef #endif.
So, back to beginning question, how should I handle the MBCS string in
C++?

The official answer is std::codecvt. In practice, I roll my
own:).
 
D

Dancefire

Why not std::codecvt? A facet which you can obtain from a

oops, I miss the std::codecvt. Thank you.

After I tried std::codecvt, I have 2 more questions.

1) Should we initialize mbstate_t variable? And how to initialize the
mbstate_t portable and in C++ way?

Many sample code I saw on the net, didn't initialize the mbstate_t
variable. Such as:

http://incubator.apache.org/stdcxx/doc/stdlibref/codecvt.html#sec12

std::mbstate_t state;

And sample in MSDN with Visual Studio 2005.

mbstate_t state;

They just declare it and use it, never assign any initial value to the
state. And I did get a problem in VC80 without initialize the state to
zero during I try (the first character always mass up in debug mode,
the follow up is ok).

But the online version of MSDN do initialize the mbstate_t variable:
http://msdn2.microsoft.com/en-us/library/xse90h58(VS.80).aspx

mbstate_t state = {0};

And I do find a code using memset() to set all range to zero, but I
don't think it's c++'s way.
How should I make the initial portable?

2) I can know the wchar_t* buf length for codecvt.in() by
codecvt.length(), but how should I know the char * buffer length for
codecvt.out()?

I can pass 0 pointer to mbstowcs() or wcstombs() to get the length of
the output buffer I need. but I don't know how to do the same thing by
using codecvt said:
Finding what locales are available and work can be a bit of a
game:). And how they are named, if you're not under Unix.

I use "locale -l" list all the locale string supportted in Linux, and
use the following link to find the locale string in Windows:

http://msdn2.microsoft.com/en-us/library/hzz3tw78(vs.80).aspx

However, I still cannot handle "UCS-2"/"UTF16" in Linux or
"UTF8"/"UTF16" in Windows by std::locale. Do you know how can I do
this?
The official answer is std::codecvt. In practice, I roll my
own:).


Thanks again, you do help me.
 
S

sebor

1) Should we initialize mbstate_t variable? And how to initialize the
mbstate_t portable and in C++ way?

Many sample code I saw on the net, didn't initialize the mbstate_t
variable. Such as:

http://incubator.apache.org/stdcxx/doc/stdlibref/codecvt.html#sec12

std::mbstate_t state;

Strictly speaking you should zero-initialize the state. It doesn't
matter
in the trivial example shown in the Apache stdcxx documentation but
in general the state must be either zeroed out (i.e., to represent the
initial shift state) or be the result of a prior conversion.

I have corrected the example program to initialize the state variable,
see: http://svn.apache.org/viewvc?view=rev&revision=533806. I'll fix
the docs next.
[...]
mbstate_t state = {0};

And I do find a code using memset() to set all range to zero, but I
don't think it's c++'s way.
How should I make the initial portable?

Like so:

mbstate_t state = mbstate_t ();
2) I can know the wchar_t* buf length for codecvt.in() by
codecvt.length(), but how should I know the char * buffer length for
codecvt.out()?

codecvt::length() returns the number of extern_type characters (i.e.,
narrow chars for codecvt said:
[...]
However, I still cannot handle "UCS-2"/"UTF16" in Linux or
"UTF8"/"UTF16" in Windows by std::locale. Do you know how can I do
this?

In the Apache C++ Standard Library you can do it using
a codecvt_byname facet constructed with the name "UTF-8@UCS"
as an argument, although it's not mentioned on the documentation page:
http://incubator.apache.org/stdcxx/doc/stdlibref/codecvt-byname.html
Let me look into adding it.
 
D

Dancefire

1) Should we initialize mbstate_t variable? And how to initialize the
Strictly speaking you should zero-initialize the state. It doesn't
matter
in the trivial example shown in the Apache stdcxx documentation but
in general the state must be either zeroed out (i.e., to represent the
initial shift state) or be the result of a prior conversion.

I have corrected the example program to initialize the state variable,
see:http://svn.apache.org/viewvc?view=rev&revision=533806. I'll fix
the docs next.

Yes, the example in Apache stdcxx documentation works, since it
doesn't try to handle MBCS in CJK encoding. If the state is not zero,
the code will get problem to handle MBCS string, and the first 1-2
bytes in the MBCS will parse to a wrong result if they a greater than
0x80, and the follow up byte might be parsed correct, and if the first
1-2 char is < 0x80, it might just simply return with an error.

Thank you very much for correct the code and the doc, it will make
others much clear and avoid the problem I faced.
[...]
mbstate_t state = {0};
And I do find a code using memset() to set all range to zero, but I
don't think it's c++'s way.
How should I make the initial portable?

Like so:

mbstate_t state = mbstate_t ();

I get it, thank you very much.
codecvt::length() returns the number of extern_type characters (i.e.,
narrow chars for codecvt<wchar_t, char>).

I'm a little confuse here, even after read the document. Could you
give me a piece of code as example how to do same thing as below's
code:

===================================
string str("\xba\xba\xd6\xd7");
size_t len = mbstowcs(0, str, str.length());
wchar_t* wstr = new wchar_t[len+1];
mbstowcs(wstr, str, len);
===================================
And the reverse version:

===================================
wstring wstr(L"\xbaba\xd6d7");
size_t len = wcstombs(0, wstr, wstr.length());
char* str = new char[len+1];
wcstombs(str, wstr, len);
===================================

The point is I need to get the length for the output buffer, so I can
new the buffer in a safe way. How can I get the buffer's length for
both codecvt::in() and codecvt::eek:ut()?

BTW, am I correct in above code? I mean at the second time call for
wcstombs() or mbstowcs() which use "len" as the length rather than as
the first call which are use "wstr.length()" or "str.length()" as the
length?
[...]
However, I still cannot handle "UCS-2"/"UTF16" in Linux or
"UTF8"/"UTF16" in Windows by std::locale. Do you know how can I do
this?

In the Apache C++ Standard Library you can do it using
a codecvt_byname facet constructed with the name "UTF-8@UCS"
as an argument, although it's not mentioned on the documentation page:http://incubator.apache.org/stdcxx/doc/stdlibref/codecvt-byname.html
Let me look into adding it.

Thank you, I know how to handle this in Apache C++ Standard Library
now. I will try that.
Do you know the how can I use g++'s STL do this? I mean, conversion
between wchar_t*, which contain UCS-4 string, and char*, which contain
UCS-2 or UTF16 string.

The problem is raised when I try to do a project can be portable
between Windows and Linux. I try to write the unicode string to a
file.

When I choose UTF8 to write, I get 2 problems,

1) VC80's STL doesn't support UTF8's locale, (althought Win32 api
support it, but use win32 api will make some of the code non-portable)
2) All of the string is CJK characters, so UTF8 will cost at least 3
bytes to store, enlarge 50% for storage which is unnecessary if I
store just use UCS-2. And I'm sure all the characters is in BMP of
ISO-10646. So I'd better just use 16bit to store it in the file.

However, If I choose UCS2LE, just like what stored in wchar_t in VC, I
got problem of reading the file at Linux, which g++'s STL looks like
doesn't support UCS-2LE locale, and wchar_t in Linux is UCS4 rather
than UCS2, so I cannot directly read the content. (same kind of story,
since libiconv support UCS-2LE, but if I use libiconv it will make the
part of the code non-portable and I have to let mycode depends on
libiconv).

So, What should I do in this case?
 
P

P.J. Plauger

.....
[...]
However, I still cannot handle "UCS-2"/"UTF16" in Linux or
"UTF8"/"UTF16" in Windows by std::locale. Do you know how can I do
this?

In the Apache C++ Standard Library you can do it using
a codecvt_byname facet constructed with the name "UTF-8@UCS"
as an argument, although it's not mentioned on the documentation
page:http://incubator.apache.org/stdcxx/doc/stdlibref/codecvt-byname.html
Let me look into adding it.

Thank you, I know how to handle this in Apache C++ Standard Library
now. I will try that.
Do you know the how can I use g++'s STL do this? I mean, conversion
between wchar_t*, which contain UCS-4 string, and char*, which contain
UCS-2 or UTF16 string.

The problem is raised when I try to do a project can be portable
between Windows and Linux. I try to write the unicode string to a
file.

When I choose UTF8 to write, I get 2 problems,

1) VC80's STL doesn't support UTF8's locale, (althought Win32 api
support it, but use win32 api will make some of the code non-portable)
2) All of the string is CJK characters, so UTF8 will cost at least 3
bytes to store, enlarge 50% for storage which is unnecessary if I
store just use UCS-2. And I'm sure all the characters is in BMP of
ISO-10646. So I'd better just use 16bit to store it in the file.

However, If I choose UCS2LE, just like what stored in wchar_t in VC, I
got problem of reading the file at Linux, which g++'s STL looks like
doesn't support UCS-2LE locale, and wchar_t in Linux is UCS4 rather
than UCS2, so I cannot directly read the content. (same kind of story,
since libiconv support UCS-2LE, but if I use libiconv it will make the
part of the code non-portable and I have to let mycode depends on
libiconv).

So, What should I do in this case?

Everything you need is included in our Compleat Libraries, for both
VC++ and gcc. But they cost $.

P.J. Plauger
Dinkumware, Ltd.
http://www.dinkumware.com
 
D

Dancefire

.....
[...]
However, I still cannot handle "UCS-2"/"UTF16" in Linux or
"UTF8"/"UTF16" in Windows by std::locale. Do you know how can I do
this?
In the Apache C++ Standard Library you can do it using
a codecvt_byname facet constructed with the name "UTF-8@UCS"
as an argument, although it's not mentioned on the documentation
page:http://incubator.apache.org/stdcxx/doc/stdlibref/codecvt-byname.html
Let me look into adding it.
Thank you, I know how to handle this in Apache C++ Standard Library
now. I will try that.
Do you know the how can I use g++'s STL do this? I mean, conversion
between wchar_t*, which contain UCS-4 string, and char*, which contain
UCS-2 or UTF16 string.
The problem is raised when I try to do a project can be portable
between Windows and Linux. I try to write the unicode string to a
file.
When I choose UTF8 to write, I get 2 problems,
1) VC80's STL doesn't support UTF8's locale, (althought Win32 api
support it, but use win32 api will make some of the code non-portable)
2) All of the string is CJK characters, so UTF8 will cost at least 3
bytes to store, enlarge 50% for storage which is unnecessary if I
store just use UCS-2. And I'm sure all the characters is in BMP of
ISO-10646. So I'd better just use 16bit to store it in the file.
However, If I choose UCS2LE, just like what stored in wchar_t in VC, I
got problem of reading the file at Linux, which g++'s STL looks like
doesn't support UCS-2LE locale, and wchar_t in Linux is UCS4 rather
than UCS2, so I cannot directly read the content. (same kind of story,
since libiconv support UCS-2LE, but if I use libiconv it will make the
part of the code non-portable and I have to let mycode depends on
libiconv).
So, What should I do in this case?

Everything you need is included in our Compleat Libraries, for both
VC++ and gcc. But they cost $.

P.J. Plauger
Dinkumware, Ltd.http://www.dinkumware.com


Yes, the Compleat Libraries is cool. but before I pay it, I need to
make sure there is no way to do it easily.
I'm developing an open source project, for portability reason, I'd
better depends on existing STL in VC80 Express for windows, and libstdc
++ for Linux(or other).
I'm trying to find the common encoding for Unicode in both VC80
Express STL and libstdc++.
 
P

P.J. Plauger

.....
[...]
However, I still cannot handle "UCS-2"/"UTF16" in Linux or
"UTF8"/"UTF16" in Windows by std::locale. Do you know how can I do
this?
In the Apache C++ Standard Library you can do it using
a codecvt_byname facet constructed with the name "UTF-8@UCS"
as an argument, although it's not mentioned on the documentation
page:http://incubator.apache.org/stdcxx/doc/stdlibref/codecvt-byname.html
Let me look into adding it.
Thank you, I know how to handle this in Apache C++ Standard Library
now. I will try that.
Do you know the how can I use g++'s STL do this? I mean, conversion
between wchar_t*, which contain UCS-4 string, and char*, which contain
UCS-2 or UTF16 string.
The problem is raised when I try to do a project can be portable
between Windows and Linux. I try to write the unicode string to a
file.
When I choose UTF8 to write, I get 2 problems,
1) VC80's STL doesn't support UTF8's locale, (althought Win32 api
support it, but use win32 api will make some of the code non-portable)
2) All of the string is CJK characters, so UTF8 will cost at least 3
bytes to store, enlarge 50% for storage which is unnecessary if I
store just use UCS-2. And I'm sure all the characters is in BMP of
ISO-10646. So I'd better just use 16bit to store it in the file.
However, If I choose UCS2LE, just like what stored in wchar_t in VC, I
got problem of reading the file at Linux, which g++'s STL looks like
doesn't support UCS-2LE locale, and wchar_t in Linux is UCS4 rather
than UCS2, so I cannot directly read the content. (same kind of story,
since libiconv support UCS-2LE, but if I use libiconv it will make the
part of the code non-portable and I have to let mycode depends on
libiconv).
So, What should I do in this case?

Everything you need is included in our Compleat Libraries, for both
VC++ and gcc. But they cost $.

P.J. Plauger
Dinkumware, Ltd.http://www.dinkumware.com


Yes, the Compleat Libraries is cool. but before I pay it, I need to
make sure there is no way to do it easily.
I'm developing an open source project, for portability reason, I'd
better depends on existing STL in VC80 Express for windows, and libstdc
++ for Linux(or other).
I'm trying to find the common encoding for Unicode in both VC80
Express STL and libstdc++.

Well, you can encode Unicode as:

-- UTF-8 in an array of char

-- UTF-16 in an array of short (or wchar_t under VC++)

-- UCS-2 in an array of short (if you're willing to settle for the common
65K Unicode subset)

-- UTF-32 or UCS-4 in an array of long (or wchar_t under gcc)

We supply a whole slew of interconversions between these forms, and
the appropriate endian versions in files, in our Code Conversions
library (part of the Compleat Libraries). See:

file:///C:/htm_cplt/temp/index_cvt.html

for an essay on code conversions and the list of facets we supply.

P.J. Plauger
Dinkumware, Ltd.
http://www.dinkumware.com
 
D

Dancefire

Well, you can encode Unicode as:

-- UTF-8 in an array of char

-- UTF-16 in an array of short (or wchar_t under VC++)

-- UCS-2 in an array of short (if you're willing to settle for the common
65K Unicode subset)

-- UTF-32 or UCS-4 in an array of long (or wchar_t under gcc)

We supply a whole slew of interconversions between these forms, and
the appropriate endian versions in files, in our Code Conversions
library (part of the Compleat Libraries). See:

file:///C:/htm_cplt/temp/index_cvt.html

for an essay on code conversions and the list of facets we supply.

P.J. Plauger
Dinkumware, Ltd.http://www.dinkumware.com

Thanks, but I can't see the link, it's local...

And one more question about the codecvt. I'm not familiar with
codecvt, I need some help here.
codecvt::length() returns the number of extern_type characters (i.e.,
narrow chars for codecvt<wchar_t, char>).

I'm a little confuse here, even after read the document. Could you
give me a piece of code as example how to do same thing as below's
code:

===================================
string str("\xba\xba\xd6\xd7");
size_t len = mbstowcs(0, str, str.length());
wchar_t* wstr = new wchar_t[len+1];
mbstowcs(wstr, str, len);
===================================
And the reverse version:

===================================
wstring wstr(L"\xbaba\xd6d7");
size_t len = wcstombs(0, wstr, wstr.length());
char* str = new char[len+1];
wcstombs(str, wstr, len);
===================================

The point is I need to get the length for the output buffer, so I can
new the buffer in a safe way. How can I get the buffer's length for
both codecvt::in() and codecvt::eek:ut()?

BTW, am I correct in above code? I mean at the second time call for
wcstombs() or mbstowcs() which use "len" as the length rather than as
the first call which are use "wstr.length()" or "str.length()" as the
length?

Thanks
 
S

sebor

I'm a little confuse here, even after read the document. Could you
give me a piece of code as example how to do same thing as below's
code:

I don't blame you for being confused. You can't use length() for this
(or for much else, I'm afraid). It's really not a very useful
function.
===================================
string str("\xba\xba\xd6\xd7");
size_t len = mbstowcs(0, str, str.length());
wchar_t* wstr = new wchar_t[len+1];
mbstowcs(wstr, str, len);

Here's an implementation of mbstowcs() using codecvt. I'll probably
put it up on the Apache stdcxx site or include it in the documentation
but I'm pasting it here for reference (let me know if you run into any
problems with it). The reverse (i.e., wcstombs()) is analogous and
I'll leave its implementation as an exercise for interested
readers ;-)

std::size_t
my_mbstowcs (std::mbstate_t *pstate,
wchar_t *dst,
const char *src,
std::size_t size)
{
const std::locale global;

typedef std::codecvt<wchar_t, char, std::mbstate_t> CodeCvt;

// retrieve the codecvt facet from the global locale
const CodeCvt &cvt = std::use_facet<CodeCvt>(global);

// use local shift state when pstate is null
std::mbstate_t state = std::mbstate_t ();
if (0 == pstate)
pstate = &state;

// use a small local buffer when dst is null and ignore size
wchar_t buf [32];
if (0 == dst) {
dst = buf;
size = sizeof buf / sizeof *buf;
}

const char *from = src;
const char *from_end = from + std::strlen (from);
const char *from_next = from;

wchar_t *to = dst;
wchar_t *to_end = to + size;
wchar_t *to_next;

// number of non-NUL wide characters stored in destination buffer
std::size_t nconv = 0;

for ( ; from_next != from_end && to_next != to_end; ) {

const std::codecvt_base::result res =
cvt.in (*pstate,
from, from_end, from_next,
to, to_end, to_next);

switch (res) {

case std::codecvt_base::error:
return std::size_t (-1);

case std::codecvt_base::noconv:
// should not happen (bad codecvt facet)
return std::size_t (-1);

case std::codecvt_base::eek:k:
case std::codecvt_base::partial:

nconv += to_next - to;

if (from_next == from || dst != buf)
return nconv;

from = from_next;
to = dst;
to_end = dst + size;

break;
}
}

return nconv;
}

[...]
BTW, am I correct in above code? I mean at the second time call for
wcstombs() or mbstowcs() which use "len" as the length rather than as
the first call which are use "wstr.length()" or "str.length()" as the
length?

I don't think that's correct. The last argument specifies the size of
the
destination buffer.
[...]
Thank you, I know how to handle this in Apache C++ Standard Library
now. I will try that.
Do you know the how can I use g++'s STL do this? I mean, conversion
between wchar_t*, which contain UCS-4 string, and char*, which contain
UCS-2 or UTF16 string.

You should be able to use the same code to convert between UCS and
UTF-8 across all implementations. The only thing that may be different
is the name of the locale. I don't know of a portable way to do UTF-16
(not to be confused with UCS-2), or UCS-2 on platforms where wchar_t
isn't 2 bytes wide (or, conversely, UCS-4 where wchar_t is 2 bytes).
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,954
Messages
2,570,116
Members
46,704
Latest member
BernadineF

Latest Threads

Top