Wide character to multi-byte

P

PEK

I need some code that convert a multi-byte string to a Unicode string,
and Unicode to multi-byte. I work mostly in Windows and know how to
solve it there, but I would like to have some platform independent
code too.

I have tried with mbtowcs/wctombs but I'm not satisfied with the
result. If wctombs finds a character that can't be converted it return
-1, and stops. I would like to replace such of characters with some
special character and convert so much that is possible.

So I have written my own functions, based on mbtowc and wctomb. I have
successfully converted text from and to different codepages (I have
tried 437, 1252 and 949 [Korean, with some characters that takes two
bytes]). So I think the code is OK, but I would appreciate if someone
else look at it (so I have someone to blame ;-).

The code:

void ConvertCharToWstring(const char* from, wstring &to)
{
to = L"";

size_t pos=0;
wchar_t temp[1];

while(true)
{
size_t len = mbtowc(temp, from+pos, MB_CUR_MAX);

//Found end
if(len == 0)
return;
else if(len == (size_t)-1)
{
//Unknown character, this should never happen
pos++;
}
else
{
to += temp[0];
pos += len;
}
}
}

void ConvertWcharToString
(const wchar_t* from, string &to,
bool* datalost, char unknownchar)
{
to = "";

char* temp = new char[MB_CUR_MAX];

while(*from != L'\0')
{
size_t len = wctomb(temp, *from);

//Found end
if(len == 0)
break;
else if(len == (size_t)-1)
{
//Replace with unknown character
to += unknownchar;

if(datalost != NULL)
*datalost=true;
}
else
{
//Copy all characters
for(size_t i=0; i<len; i++)
to += temp;
}

from++;
}

delete [] temp;
}

/PEK
 
U

Unforgiven

PEK said:
I need some code that convert a multi-byte string to a Unicode string,
and Unicode to multi-byte. I work mostly in Windows and know how to
solve it there, but I would like to have some platform independent
code too.
/PEK

// wide-char to multibyte:
wstring source = "something";
typedef ctype<wchar_t> CT;
size_t length = source.length();
char *result = new char[length];
CT const& ct = use_facet<CT>(locale());
ct.narrow(source.data(), source.data() + source.size(), 'X', result);
string dest(result, length);
delete[] result;
return dest;

For the reverse, use ct.widen instead (and make source a string and dest a
wstring of course).
This uses the global C locale, which at program startup is ASCII, *not* the
system locale. To set a specific locale, use:
locale::global(locale("Dutch_Netherlands"));
At least on Windows with VC, this sets the global locale to the system
locale:
locale::global(locale(""));

Note that this won't handle actual multi-byte character sets, i.e. character
sets with characters > 256 (e.g. JIS), those characters will not get
converted properly. I know of no standard way to handle those, just the
WideCharToMultiByte windows method.
 
J

Jonathan Turkanis

PEK said:
I need some code that convert a multi-byte string to a Unicode string,
and Unicode to multi-byte. I work mostly in Windows and know how to
solve it there, but I would like to have some platform independent
code too.

The standard C++ solution is to use codecvt facets. Currently these are a bit
hard to use, but there is a proposal to add several components which would make
it easier. See

http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2004/n1683.html.

In the meantime, both the Boost Serialization library and the soon-to-be-relased
Boost Iostreams

http://home.comcast.net/~jturkanis/iostreams/libs/iostreams/doc/?path=5.6

library contain code conversion components. (The documentation for the iostreams
code conversion component is temporarily out-of-sync with the source.)

You can also use the Dinkumware CoreX library, which is reasonably priced and is
the basis for n1683.

Jonathan
 
J

Jonathan Turkanis

Note that this won't handle actual multi-byte character sets, i.e.
character sets with characters > 256 (e.g. JIS), those characters
will not get converted properly. I know of no standard way to handle
those, just the WideCharToMultiByte windows method.

Using mbtowcs/wctombs *is* a standard way to handle multibyte characters. The
prefered C++ solution is to use a codecvt facet instead of a ctype facet.

Jonathan
 
U

Unforgiven

Jonathan Turkanis said:
Using mbtowcs/wctombs *is* a standard way to handle multibyte characters.

That I knew, but it has the drawback of bolting on unrecognized characters
instead of replacing them with some predetermined character (like '?'), as
the OP mentioned.
The
prefered C++ solution is to use a codecvt facet instead of a ctype facet.

That I didn't know.
 
P

PEK

That I knew, but it has the drawback of bolting on unrecognized characters
instead of replacing them with some predetermined character (like '?'), as
the OP mentioned.

A workaround for this is to use mbtowc/wctomb instead and convert the
characters in a loop. This was my solution and it seems to work, or is
there some problems with it?


That I didn't know.

The code Unforgiven it's a bit obscure, but I think I understand most
of it. But I also want to detect if an unrecognized character was
replaced (I guess I didn't mention that in my earlier post). Another
problem with the code is that I suppose it's hard to calculate the
length of the result when multibyte characters will be used.


/PEK
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,995
Messages
2,570,230
Members
46,818
Latest member
Brigette36

Latest Threads

Top