UnicodeString to std::string or char*

T

Tristan Wibberley

You mean ICU UnicodeString? You'll need to convert it to UTF-8. See
the ICU converters: http://icu-project.org/userguide/codepageConverters.html

Rather than simply UTF-8, you'd be better off converting it to the
character encoding of std::string::value_type WRT the std::codecvt facet
of the current global locale (if you expect to do any string processing
or stream it to a std::eek:stream) or to the std::codecvt facet of the
locale you're going to output the string to if you're going to use some
binary I/O API.

BTW that page says:

"On Windows there are three encodings in use at the same time. Unicode
(UTF-16) is always used inside of Windows ..."

That is true for Windows2000 and later, but I read somewhere online that
for windows 95 and 98 it uses UCS-2.

IE, programs written for windows 95 and 98 can and probably will assume
that each element of a BSTR is a whole character while for later
versions of Windows, there may be multibyte encodings. Unfortunately
very often programs for Windows 2000 and later also assume the same
thing. They are very broken.

Another thing to remember when dealing with most implementations is that
comparison with even a German Unicode std::collate facet will not find U
+f6(o-umlaut) and U+6f(o),U+308(umlaut) to be equal. Which just seems
wrong to me - I would have thought they'd be equal in *all* Unicode
collations - Unicode documents them to be equivalent representations.

On a related note, if anybody knows how to use
std::lexicographical_compare with a specific locale without having to
change the global locale (which is really not safe in a nontrivial
program), please answer - I'm becoming upset. I'd especially like to be
able to specify a locale for each sequence of encoding atoms so if the
implementation has a mapping between them, it will be able to tell me if
they are equal even when they're in different encodings.

--
Tristan Wibberley

Any opinion expressed is mine (or else I'm playing devils advocate for
the sake of a good argument). My employer had nothing to do with this
communication.
 
N

Nemanja Trifunovic

Rather than simply UTF-8, you'd be better off converting it to the
character encoding of std::string::value_type WRT the std::codecvt facet
of the current global locale (if you expect to do any string processing
or stream it to a std::eek:stream) or to the std::codecvt facet of the
locale you're going to output the string to if you're going to use some
binary I/O API.

But how you can be sure that all the characters from the UnicodeString
fit into char encoding of the current global locale?
 
T

Tristan Wibberley

But how you can be sure that all the characters from the UnicodeString
fit into char encoding of the current global locale?

You can set the global locale to have a UTF-8 char codecvt facet, or you
can decline to use any string processing functions unless you can pass a
locale/codecvt facet that tells it that the string is UTF-8 and decline to
stream the string through a std::eek:stream (although you can .imbue() the stream
with an appropriate locale).

Although, if you are just going to throw the data out to a file via binary I/O
or if you will have custom UTF-8 processing functions then you can do it safely.

I'm kind of just trying to warn against throwing character sets and encodings
around just because it seems like they fit better in a particular data type
(UTF-16 can be put into a std::string too).

--
Tristan Wibberley

Any opinion expressed is mine (or else I'm playing devils advocate for
the sake of a good argument). My employer had nothing to do with this
communication.
 
J

James Kanze

Another thing to remember when dealing with most
implementations is that comparison with even a German Unicode
std::collate facet will not find U +f6(o-umlaut) and
U+6f(o),U+308(umlaut) to be equal. Which just seems wrong to
me - I would have thought they'd be equal in *all* Unicode
collations - Unicode documents them to be equivalent
representations.

It depends on what the implementation claims to support. It may
be supposing a normalized form, in which case, one of the two
representations is not allowed.

And a nit: it's not an Umlaut, it's a diaeresis. Umlaut is a
German word for its function (sound change) in German, and some
other languages. In other languages, such as French or English,
it has a completely different function.
On a related note, if anybody knows how to use
std::lexicographical_compare with a specific locale without having to
change the global locale (which is really not safe in a nontrivial
program), please answer - I'm becoming upset.

I think you're supposed to use the collate facet directly.
Which is sort of a drag, because it requires char const*, rather
than std::string::const_iterator. (Since the functions in the
facets are virtual, they can't be templates, but support for the
string iterators would have seemed a minimum, IMHO.) You should
be able to make do with s.data(), s.data()+s.size().

At any rate, I don't think that lexicographical_compare can be
used for text, since it does a byte by byte (or rather a
Iterator::value_type by Iterator::value_type) comparison: in
fact, what std::equal should do. For any lexical comparisons,
you need to use the collate facet of locale, and deal with char
const* pointers, instead of iterators.
I'd especially like to be able to specify a locale for each
sequence of encoding atoms so if the implementation has a
mapping between them, it will be able to tell me if they are
equal even when they're in different encodings.

There's currently nothing that does that. You'll have to write
your own function. Off hand, I don't really see anything better
than using two codecvt facets to force to a common encoding
(some UTF, probably), and compare that.

In fact, my own approach is to convert on input or output, and
only use a single encoding (usually UTF-8) internally.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,197
Messages
2,571,040
Members
47,634
Latest member
RonnyBoelk

Latest Threads

Top