Problems with UTF-8 on Windows

A

amandeep.bhatia1

Hello Friends,

I am working on a project to support internationalization for a
existing project.

While supporting UTF-8 I am facing a problem , while doing POC.

I have a C string
which I have declared as
const char* utf8buf = "Bienvenue à l'anglais ";

I want to supporint UTF-8 for I/0 and wchat_t strings for internal
manipulations. So I am setting locale to setlocale(LC_CTYPE,"UTF8");
before I start with the main code for string handling.

Then I am using MultiByteToWideChar (using codepage as CP_UTF8) to
convert it to wstring.

Then again before output I am converting the string back to UTF8 format
using WideCharToMultiByte.

The problem is after getting back the UTF8 string after above
conversion , when I am printing the string, I am getting "Bienvenue
l'anglais" as output , which is not same as the input utfbuf.

Does C++ string class support UTF-8 ?

In real environment , we are planning to get the UTF8 strings from
MySQL database.

How can I correct this?

Is there any other way in C/C++ to represent UTF8 strings?

Thanks,
Aman
 
P

peter koch

(e-mail address removed) skrev:
Hello Friends,

I am working on a project to support internationalization for a
existing project.

While supporting UTF-8 I am facing a problem , while doing POC.

I have a C string
which I have declared as
const char* utf8buf = "Bienvenue à l'anglais ";

The above is not valid utf-8.
I want to supporint UTF-8 for I/0 and wchat_t strings for internal
manipulations. So I am setting locale to setlocale(LC_CTYPE,"UTF8");
before I start with the main code for string handling.

Now we enter implementation defined territory.
Then I am using MultiByteToWideChar (using codepage as CP_UTF8) to
convert it to wstring.

And this is not C++ but Windows and thus off-topic.
Then again before output I am converting the string back to UTF8 format
using WideCharToMultiByte.

Once again off-topic.
The problem is after getting back the UTF8 string after above
conversion , when I am printing the string, I am getting "Bienvenue
l'anglais" as output , which is not same as the input utfbuf.

Does C++ string class support UTF-8 ?
Well.... the short answer is no. You will have no problem storing an
utf-8 buffer in a std::string, but accesss to individual characters is
off: string[n] might be a character, but it could also be part of an
escape sequence.
In real environment , we are planning to get the UTF8 strings from
MySQL database.

There is no problem getting utf-8 from a MySQL database, but I doubt
that there is any reason to store it in a std::string (but it will not
lead to an incorrect program).
How can I correct this?
Correct what? The problem with the missing á above could very well be
related to the fact that the string above is not valid utf8, but you
should go to the platform specific group (perhaps something like
microsoft.public.internationalization?) for that part.
Is there any other way in C/C++ to represent UTF8 strings?
You can store it in a variety of ways. The most natural way for many
applications would be to convert at APIs - for instance at the point
you get the data from your database. If you expect to keep large
amounts of strings in memory and if you expect UTF-8 would be a smart
internal format, you should look for a utf8-string class. Most probably
there will already be some nice classes out there and I vaguely
remember having read something about utf8-strings in boost (and that is
always the first place I look).

/Peter
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,955
Messages
2,570,117
Members
46,705
Latest member
v_darius

Latest Threads

Top