Problems with UTF-8 on Windows

amandeep.bhatia1 · Jan 11, 2007

Hello Friends,

I am working on a project to support internationalization for a
existing project.

While supporting UTF-8 I am facing a problem , while doing POC.

I have a C string
which I have declared as
const char* utf8buf = "Bienvenue à l'anglais ";

I want to supporint UTF-8 for I/0 and wchat_t strings for internal
manipulations. So I am setting locale to setlocale(LC_CTYPE,"UTF8");
before I start with the main code for string handling.

Then I am using MultiByteToWideChar (using codepage as CP_UTF8) to
convert it to wstring.

Then again before output I am converting the string back to UTF8 format
using WideCharToMultiByte.

The problem is after getting back the UTF8 string after above
conversion , when I am printing the string, I am getting "Bienvenue
l'anglais" as output , which is not same as the input utfbuf.

Does C++ string class support UTF-8 ?

In real environment , we are planning to get the UTF8 strings from
MySQL database.

How can I correct this?

Is there any other way in C/C++ to represent UTF8 strings?

Thanks,
Aman

peter koch · Jan 11, 2007

(e-mail address removed) skrev:

Hello Friends,

I am working on a project to support internationalization for a
existing project.

While supporting UTF-8 I am facing a problem , while doing POC.

I have a C string
which I have declared as
const char* utf8buf = "Bienvenue à l'anglais ";

The above is not valid utf-8.

I want to supporint UTF-8 for I/0 and wchat_t strings for internal
manipulations. So I am setting locale to setlocale(LC_CTYPE,"UTF8");
before I start with the main code for string handling.

Now we enter implementation defined territory.

Then I am using MultiByteToWideChar (using codepage as CP_UTF8) to
convert it to wstring.

And this is not C++ but Windows and thus off-topic.

Then again before output I am converting the string back to UTF8 format
using WideCharToMultiByte.

Once again off-topic.

The problem is after getting back the UTF8 string after above
conversion , when I am printing the string, I am getting "Bienvenue
l'anglais" as output , which is not same as the input utfbuf.

Does C++ string class support UTF-8 ?

Well.... the short answer is no. You will have no problem storing an
utf-8 buffer in a std::string, but accesss to individual characters is
off: string[n] might be a character, but it could also be part of an
escape sequence.

In real environment , we are planning to get the UTF8 strings from
MySQL database.

There is no problem getting utf-8 from a MySQL database, but I doubt
that there is any reason to store it in a std::string (but it will not
lead to an incorrect program).

How can I correct this?

Correct what? The problem with the missing á above could very well be
related to the fact that the string above is not valid utf8, but you
should go to the platform specific group (perhaps something like
microsoft.public.internationalization?) for that part.

Is there any other way in C/C++ to represent UTF8 strings?

You can store it in a variety of ways. The most natural way for many
applications would be to convert at APIs - for instance at the point
you get the data from your database. If you expect to keep large
amounts of strings in memory and if you expect UTF-8 would be a smart
internal format, you should look for a utf8-string class. Most probably
there will already be some nice classes out there and I vaguely
remember having read something about utf8-strings in boost (and that is
always the first place I look).

/Peter

UTF-8 and strings	44	Jun 7, 2011
MeCab UTF-8 Decoding Problem	6	Jun 29, 2013
Unicode (UTF-8) in C	13	Mar 16, 2014
UTF-8 problems with windows	31	Aug 10, 2009
std::wstringbuf and imbue to convert from utf-8 to wchar_t?	3	Nov 2, 2008
Converting from std::wstring to UTF-8 std::string	5	Aug 19, 2011
ifstream >> string with UTF-8?	6	Sep 9, 2009
CGI and UTF-8	14	Sep 28, 2009

Problems with UTF-8 on Windows

amandeep.bhatia1

peter koch

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads