C++0x two Unicode proposals. A correction one and a different one

Ioannis Vranos · Jan 17, 2008

Based on a discussion about Unicode in clc++ inside a discussion thread
with subject "next ISO C++ standard", and the data provided in
http://en.wikipedia.org/wiki/C++0x , and with the design ideals:

1. To provide Unicode support in C++0x always and explicitly.
2. To provide support to all Unicode sets out there.

I think the implementation of these as:

a) char, char16_t and char32_t types.
b) built-in Unicode literals.

should become:

I) Library, implementation defined types like utf8_char, utf16_char, and
utf32_char, leaving alone and not polluting the existing built in types
like char for now and in the future.

II) Leave b) as it is.

In this way, the built in types are not polluted with additional
ever-growing list of UTFs, while in the future the old ones can easily
be deprecated/obsoleted in the library. The pollution of an ever growing
list of UTF characters and literals will be minimal.

Also I think this UTF implementation change will cause minimal change in
the existing C++0x.

---------------------------------------------------------------------------

My second thought on this, is that Unicode support should also become
optional. This will further decrease pollution of built in types and
string literals. An implementation should be able to choose whether it
will support Unicode and which one.

Phil Endecott · Jan 17, 2008

Ioannis said:
Based on a discussion about Unicode in clc++ inside a discussion thread
with subject "next ISO C++ standard", and the data provided in
http://en.wikipedia.org/wiki/C++0x , and with the design ideals:

1. To provide Unicode support in C++0x always and explicitly.
2. To provide support to all Unicode sets out there.

I think the implementation of these as:

a) char, char16_t and char32_t types.
b) built-in Unicode literals.

should become:

I) Library, implementation defined types like utf8_char, utf16_char, and
utf32_char, leaving alone and not polluting the existing built in types
like char for now and in the future.

The problem is that if the library does something like this:

typedef uint32_t char32_t;

then when I write

char32_t c = L'a';
cout << c;

It will output c as "64", not 'c', because the overloading of operator<<
can't detect the typedef.

The library could implement a char32_t like

class char32_t {
uint32_t impl;
....
};

but that has its own problems. It all works OK if these are built-in types.

II) Leave b) as it is.

So if I write a UTF-16 literal using the built-in literal syntax, what
is its type? It has to be a built-in type, not a library type.

Phil.

Ioannis Vranos · Jan 17, 2008

Phil said:
The problem is that if the library does something like this:

typedef uint32_t char32_t;

then when I write

char32_t c = L'a';
cout << c;

It will output c as "64", not 'c', because the overloading of operator<<
can't detect the typedef.

Well, then the library should not do that typedef and operator<< of cout
should be implemented to work with the provided character type.

The library could implement a char32_t like

class char32_t {
uint32_t impl;
....
};

but that has its own problems. It all works OK if these are built-in
types.

If your above type suggestion is not possible to be implemented, why not
focusing on providing language tools that make it possible instead?

So if I write a UTF-16 literal using the built-in literal syntax, what
is its type? It has to be a built-in type, not a library type.

It can be a library type. AFAIK a built-in type can also look like a
library type, if it is hidden when the equivalent header is not #included.

In any case my main point of my "correction" proposal, is that the C++
built-in types should not be tied with a specific character encoding system.

Consider the possibility if after some years, a now non-existent, new
character system becomes the dominant one, while C++ built in types are
tied with Unicode.

Having any specific character system provided as a library extension
(implementation-defined type), C++ will have the flexibility to adapt to
new character systems that will emerge in the future without messing
with its built in types.

The same way math-specific types should not become built-in in C++ but
as library extensions, I think the same should happen with character
systems, regular expressions etc.

So as another example, although probably not needed in standard C++,
let's consider adding EBCDIC support explicitly as a library extension.

Something like:

#include <whatever>

// ...
std::ebcdic_char *p= EB"This is a text";
std::ebcdic char c= EB'c';

This style can work for whatever character type system. UTF8, UTF16,
UTF32 whatever.

I think tiying any specific character system with built in types, is
Java-style approach (like C#/.NET etc.) which is a whole framework and
not a programming language alone, and can be changed at will.

Apart from this, I also think that wchar_t should be the largest
character system a specific compiler provides, so for example if a
compiler provides UTF32 as its largest character type, for this compiler
wchar_t should be equivalent with the UTF32 character type of this
compiler.

Poll: Which type would you prefer for UTF-8 string literals in C++0x	17	Aug 31, 2010
C++0x and GCC4.3	2	Mar 12, 2008
C++0x "auto" equivalence in non-0x? (function needs to return undeterminedtype value)	15	May 6, 2011
Questions on various string literals in c++0x	1	Dec 7, 2010
The need of Unicode types in C++0x	26	Oct 1, 2008
std::pair and move-only types in C++0x ?	1	Oct 11, 2010
C++0x memory model and atomics, some questions	5	Sep 1, 2010
On alignment (final committee draft for C++0x and n1425 for C1X)	6	Aug 20, 2010

C++0x two Unicode proposals. A correction one and a different one

Ioannis Vranos

Phil Endecott

Ioannis Vranos

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads