M
MikeP
Jorgen said:Of course, if the trend is "let the normal representation of strings
be UTF-8" then the answer doesn't matter.
What is this/the enamourment with UTF-8 anyway?
Jorgen said:Of course, if the trend is "let the normal representation of strings
be UTF-8" then the answer doesn't matter.
Miles said:Exactly. utf-16 offers no simplicity advantage over utf-8, and
suffers from some significant disadvantages.
In practice, I suppose that many windows apps probably just ignore
anything outside the BMP, pretend "it's all 16-bit!",
and as a result
suffer from mysterious and bizarre bugs when such characters crop
up...
Unless one craves the simplicity of constant-width characters.
I may soon be working on a code base that is undergoing
"internationalization", and what is already done on low levels of
operating system interaction is using UTF-8 in char-based strings.
Clearly, new code needs to use Unicode, or _some_ way of representing
a lot more than 256 characters. UTF-8 and Unicode has merit as an
encoding and transmission format, and I'm certainly not against it.
But std::string models a sequence of *characters*, and that doesn't
fit with multi-byte variable-length character encodings of any kind.
Furthermore, one character won't fit in a char.
Before I get in too deep with my thoughts, I want to see what others
are doing. Is there any existing information on this topic? Is there
a better place to discuss it?
My initial thoughts is that most of the time the code simply hangs
onto the string data and doesn't manipulate it, so "funny" contents
won't matter much. Furthermore, UTF-8 avoids (by design) many of the
problems found with multi-byte encodings, so a naive handling might
work "well enough" for common tasks. However, I should catalog in
detail just what works and what the problems are. For more general
manipulation of the string, we need functions like the C library's
mblen etc. and replacements for standard library routines whose
implementation doesn't handle UTF-8 as multi-byte character strings.
Or, is there some paradigm shift I should be aware of?
none said:Unless one craves the simplicity of constant-width characters.
I'm quite aware of the precise terminology. To summarize, Unicode isYou need to get your terminology right and understand clearly the
differences between Unicode, UTF-8, UTF-16, UTF-32 and UCS-2. You
don't use "Unicode" in an application. You might use what Windows
refer to as "Unicode" but it will likely be UTF-16 or UCS-2 in
reality.
What will you do with you data?
What will you be interfacing with mostly with your internationalised
data?
What's your OS?
If you are interfacing
with an OS often, you probably would be better sticking with the
common Unicode representation used by that OS. If you are often
interfacing with a GUI library, you may as well keep the data in the
same format.
Hi John
I am by no means a C++ expert, I use C++Builder which is full
unicode now, but I can see that there is a wstring in STL so
my suggestion would be to translate all incoming and outgoing
data and the n internally use wchar_t and wstring.
You didn't mention if You were on a specific operating system,
but Windows work in wchar_t internally and I suppose unix, linux
and Mac does the same, I dont know though.
Best regards
Asger-P
check out std::wstring, Glib::ustring for discussion. I'd bet boost also
must have i18n related stuff.
Jorgen Grahn wrote:
32 bits are enough that any unicode character fits in a single
wchat_t, so you can work on those almost (ok, that's a big "almost")
as easily as with plain old ascii. 16 bits force you to use some
variable length encoding like utf-16, so this is just as complicated
as utf-8.
The question isn't "are there better ways to handle things than using
UTF-8 internally."
My question is, "*given* that this project uses UTF-8 internally, and
will do more of that as development continues, and the C++ standard
library doesn't have a class that models that concept cleanly, what
are good ways to deal with it?"
I'd love to solve the problem differently, and not care about memory
consumption either (just use 32-bit characters), but that's not where
I am on this.
I don't see ICU listed, which to my little knowledge is the default
solution for serious Unicode string processing and manipulation.
Or use UTF-16 as UCS-2, which sounds pretty reasonable still on Windows
(?).
I'm quite aware of the precise terminology. To summarize, Unicode is
a mapping of cataloged characters to ordinal values, and unlike older
catalogs does not imply any specific means of representing a list of
integers in files or memory. UTF-8, OTOH, is a specific encoding of
such a list as a sequence of bytes.
What in my message makes it seem that I don't know the difference? I
don't see any sloppy use of the terminology, and I certainly used my
terms rigorously and precisely in the parts you quoted.
Anything and everything that applications do with strings. It's not a
publishing system or word processor, so strings are mainly incidental:
labels, file names, and values used in the GUI.
Just the OS primitives and other like-minded modules.
It's portable code that needs to work on a variety of OSs.
The problem is that "common code" needs to be applicable to all
operating systems. Anything facing the OS will be abstracted, using
standard library classes or modules made for the purpose. The code
base already uses UTF-8 internally and I'm exploring the ramifications
of that, and how to do it correctly.
What I would want from a good Unicode string class is: …
Sorry then. I read the statements that "already ... is using UTF-8",
"...new code needs to use Unicode" and "UTF-8 and Unicode has merit
..." as if it you meant that there were mutually exclusive
alternatives: "Unicode" or "UTF-8". Apologies
Check the boost mailing list. There's been some discussion recently, as
well as a presentation at boostcon last month.
What is this/the enamourment with UTF-8 anyway?
But isn't Windows (XP)'s "UTF-16" really UCS-2 and hence not a risk at
all then?
John said:There are a few:
(1) Old code is written using byte-size character strings of *some*
kind. For languages larger character sets, the existing practice is
to use a "multi-byte" code page, and actually have variable length
where ASCII is still a single byte and certain ranges of character
indicate prefixes or paired use.
So lots of code uses byte strings with "internationalization" meaning
to allow the target system to specify the mapping of which character
is what value, and allowing multi-byte characters mixed with single
byte characters.
So, using UTF-8 fits the existing data types and code. It's just
another "code page" to such code.
(2) It is efficient for non-ideographic languages, and data-
processing data that uses mostly ASCII identifiers.
(3) Being defined as a series of bytes, it is byte-order neutral.
(4) Contrast with UTF-16, which _still_ requires awareness of pairs
to allow more than 64K code values.
So, it's attractive for data storage and transmission.
John M. Dlugosz said:Originally, yes. At some point it went to real UTF-16 awareness, at
least for rendering strings using fonts into windows, and converting
between code pages correctly. This was around Windows-2000 or XP,
certainly well before Windows 7.
Want to reply to this thread or ask your own question?
You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.