next ISO C++ standard

R

Rui Maciel

Ioannis said:
Essentially I am talking about restricting the introduction of new
features in the new standard, only to the most essential ones. I have
the feeling that all these Unicodes will be messy.

How, exactly?

Why are all these
Unicode types needed?

Please take a look at:
http://en.wikipedia.org/wiki/Unicode


After a new version of Unicode, we will have it
introduced as a new built-in type in C++ standard? What will be the use
of the old ones? What I am saying is that we will be having an
continuous accumulation of older built-in character types.

We are repeating C's mistakes here, adding built in types instead of
providing them as libraries.

What do you mean by "built in type"? The wchar_t type is implemented through
a typedef. For example, GCC defines the wchar_t type in stddef.h as

typedef wchar_t int;


Rui Maciel
 
V

Victor Bazarov

Rui said:
How, exactly?



Please take a look at:
http://en.wikipedia.org/wiki/Unicode


After a new version of Unicode, we will have it

What do you mean by "built in type"? The wchar_t type is implemented
through a typedef. For example, GCC defines the wchar_t type in
stddef.h as

typedef wchar_t int;

That would be a violation of the C++ Standard. 'wchar_t' is a keyword
in C++. Yes, it is a built-in type.

V
 
I

Ioannis Vranos

Bo said:
OSes that do provide Unicode?

But they might be different variants of Unicode, like UTF-16 and
UTF-32.

Yes, no problem with that. In a system supporting UTF-32, why should we
need UTF-16 or UTF-8?

1) We could use only the largest Unicode type.

2) I think that could be provided only via wchar_t.

Yes. Perhaps we will not use them much there?


So do we need all those Unicode types built in?

No one has come up with a good way to introduce string literals as a
library only solution. The compiler has to know the types, to do the
proper encoding.


1) Wouldn't it be better, only the largest Unicode type supported by the
system to be provided?

2) Can't wchar_t alone fulfill this aim?
 
R

Rui Maciel

Ioannis said:
Yes, no problem with that. In a system supporting UTF-32, why should we
need UTF-16 or UTF-8?

You got to be kidding.

1) We could use only the largest Unicode type.

2) I think that could be provided only via wchar_t.

....which means, at least for those whose information is stored using mainly
the BMP code points, occupying 4x the space that is needed to store that
very same information.

So do we need all those Unicode types built in?

Obviously yes, unless you are ok with the idea of spending 4x of what you
need.

1) Wouldn't it be better, only the largest Unicode type supported by the
system to be provided?

There isn't a "one size fits all" solution to this problem. Currently, UTF-8
comes close to be that solution but it has it's drawbacks.

2) Can't wchar_t alone fulfill this aim?

Hard unicode support may be by far the easiest way to implement Unicode
support but, particularly if you only make use of the BMP code points, it
needlessly wastes memory. A lot of memory.

To give you an idea, I have a small pet project which consists of writing a
small JSON parser. That library supported Unicode the hard way and suffered
quite a bit from bloat, needing about 100MB to store the document tree of a
5MB test document. As soon as I switched the Unicode support from UCS4 to
UTF-8, the memory usage of that particular test document went from 100MB to
a bit over 40MB. That's a lot of memory.


Rui Maciel
 
I

Ioannis Vranos

Rui said:
You got to be kidding.


1. To be more precise what I meant is, the compiler should support the
largest character type supported by the system (OS), and if it supports
Unicode, the largest Unicode type supported by the OS.

Otherwise are we going to accumulate Unicodes for the next 10 years as
built-in types and supported literals?


And on an OS that does not support Unicode, would the compiler writer
have to provide all these Unicode types?

I am not much in Unicode types and text manipulation, but the built-in
approach of this magnitude feels wrong to me.


...which means, at least for those whose information is stored using mainly
the BMP code points, occupying 4x the space that is needed to store that
very same information.


Does any OS really exist that supports more than one Unicode types in
its APIs?
 
I

Ioannis Vranos

Ioannis said:
Interesting link on C++0x indeed:

http://en.wikipedia.org/wiki/C++0x


There it is mentioned:

"For the purpose of enhancing support for Unicode in C++ compilers, the
definition of the type char has been modified to be both at least the
size necessary to store an eight-bit coding of UTF-8 and large enough to
contain any member of the compilers basic execution character set. It
was previously defined as only the latter".


Regarding UTF8, Wikipedia mentions:

"UTF-8 encodes each character in one to four octets (8-bit bytes):

1. One byte is needed to encode the 128 US-ASCII characters (Unicode
range U+0000 to U+007F).
2. Two bytes are needed for Latin letters with diacritics and for
characters from Greek, Cyrillic, Armenian, Arabic, Syriac and Thaana
alphabets (Unicode range U+0080 to U+07FF).
3. Three bytes are needed for the rest of the Basic Multilingual
Plane (which contains virtually all characters in common use).
4. Four bytes are needed for characters in the other planes of
Unicode, which are rarely used in practice".


So I am confused.


Q1: Will "char" support only one of these 4 8-bit bytes or all of them?
If only one, which one?

Q2: Do we also get the restriction of a 8-bit char? What will happen on
machines where a byte is different than 8 bits?

In such a system how many bytes will sizeof(char) be? char becomes
unsigned char only? CHAR_BIT of <climits>, <limits.h>, and
<numeric_limits<char>::digits of <limits> become always 8?

Currently there are systems where CHAR_BIT and
numeric_limits<char>::digits are more than 8.
 
J

James Kanze

I mean long long is merely introduced because C committee decided to
introduce it to C99, no other real reason. What will happen if they
decide in the future to add another such built-in type?

The following version of C++ will almost certainly add it as
well. I'm pretty sure that there is a strong consensus to keep
C++ compatible with C with regards to the integer types.

Note that the C committee wasn't particularly happy with long
long itself. After all, what happens if the next generation of
machines also supports 128 bit types: we all "long long long"?
They accepted it as "wide-spread existing practice", but at the
same time, developped a more general framework for an unlimited
number of integral types. C++ has also adopted this framework:
there is no guarantee that long long is the longest integral
type in a given implementation. (That would be intmax_t, which
is a typedef.)
Those implementations you are mentioning are compiling
programs for OSes that do provide Unicode?

What does "provide Unicode" mean? I use Unicode under Solaris.
Sun CC generates some other encoding for wide string literals,
and G++ only allows basic ASCII in them to begin with (otherwise
"converting to execution character set: Illegal byte sequence"
For that matter, I get the same error with g++ under Linux.
Under Windows I suppose current VC++ implements wchar_t as
Unicode, and in my OS (Linux) I suppose wchar_t is Unicode
(haven't verified the last though).

Under Windows, wchar_t nominally is UTF-16. But of course, it's
really whatever the code interpreting it interprets it to be.
Under Linux, as far as I can tell, there is no nominal
encoding---it's whatever the program wants it to be. (The
difference, of course, is that Linux doesn't support any wchar_t
IO, so any wchar_t is purely intern to the program.)

A quick test on my machines showed that g++ doesn't support
UTF-32 (which would be the normal Unicode format for the 4 byte
wchar_t), at least in wide string literals, so I don't see how
you can say that it supports Unicode. I haven't tried things
like "toupper( L'\u00E9', std::locale() )", so I don't know
about those, but they're locale dependent anyway.
So with these new character types will we get Unicodes under
OSes that do not support Unicode?

Presumably. The problem isn't really OS support---most OS's are
encoding neutral for most functions. (With a few
exceptions---Posix/Linux pretty much requires that the native
narrow character encoding be a superset of ASCII. But in fact,
about the only place I think that this will be an issue is for
'/', and maybe a few other special separators.)
With the introduction of these new types, what will be the use
of wchar_t?

Support for legacy code. Support for the native 32 bit
encoding, which isn't Unicode under Solaris (nor, I think, most
other Posix systems). Support for whatever the implementation
wants---that's what it currently is.
Essentially I am talking about restricting the introduction of
new features in the new standard, only to the most essential
ones. I have the feeling that all these Unicodes will be
messy.

Well, I won't argue against you there. Having to deal with so
many different encodings and encoding formats is messy. The
problem is that the mess is there, outside of C++, and we have
to deal with it in one way or another.
Why are all these Unicode types needed?

To support all of the formats Unicode standardizes.
After a new version of Unicode, we will have it introduced as
a new built-in type in C++ standard?

If they introduce still more encoding formats, I suppose yes.
Somehow, I don't see that happening.
What will be the use of the old ones? What I am saying is that
we will be having an continuous accumulation of older built-in
character types.
We are repeating C's mistakes here, adding built in types
instead of providing them as libraries.

There's only so much you can do in a library. You can't make a
new integral type, which behaves like an integral type.

Of course, I'm not sure that that's really what is needed for
the Unicode types. Do you really want to be able to increment a
character (as opposed to a small integral value). But again,
character types are integral types in C, C++ wants to be
compatible with C with regards to the integral types, and C
won't use a library for the basic type here. (I'm not sure, but
I believe that char32_t and char16_t also originate in a TR for
C.)

For the other integral types: the language wants to support what
the hardware supports.
 
J

James Kanze

1. To be more precise what I meant is, the compiler should
support the largest character type supported by the system
(OS), and if it supports Unicode, the largest Unicode type
supported by the OS.

You've mentionned "Unicode type supported by the OS" several
times now. What do you mean by this? As far as I can tell, the
Unix systems I work on 1) only support 8 bit characters (wide
characters), and 2) require that any encoding used be a superset
of ASCII. I have filenames written in ISO 8859-1 on the file
system (shared by Linux and Solaris), filenames written in
UTF-8, and it doesn't bother the system one way or the other.
Does this mean that the systems in question don't support
Unicode, or that they support all encodings which are a superset
of ASCII?

I don't know what the situation is under Windows, but under
Unix, about the only place encoding for characters other than a
few meta-characters like '/' makes a difference is in input and
output. And there, it typically depends on something external:
the device you're writing to, or the encoding of the font used
for rendering, or the current xmodmap for the keyboard. I can
easily have one window working in ISO 8859-1, and another in
UTF-8, for example. (Creating a file in one window, and doing
an ls in another, does result in some strange looking filenames,
of course.)
Otherwise are we going to accumulate Unicodes for the next 10
years as built-in types and supported literals?
And on an OS that does not support Unicode, would the compiler
writer have to provide all these Unicode types?
I am not much in Unicode types and text manipulation, but the
built-in approach of this magnitude feels wrong to me.
Does any OS really exist that supports more than one Unicode
types in its APIs?

For the most part, Unix is neutral. From what I understand,
however, Windows supports UTF-16 and an 8 bit API---and it would
surprise me somewhat if the latest versions (Vista) didn't allow
UTF-8 in the 8 bit API.
 
R

red floyd

James said:
Note that the C committee wasn't particularly happy with long
long itself. After all, what happens if the next generation of
machines also supports 128 bit types: we all "long long long"?

I think we discussed this once before:

128-bit: very long long;
256-bit: really long long;
512-bit: extra long long;

etc....
 
E

Erik Wikström

There it is mentioned:

"For the purpose of enhancing support for Unicode in C++ compilers, the
definition of the type char has been modified to be both at least the
size necessary to store an eight-bit coding of UTF-8 and large enough to
contain any member of the compilers basic execution character set. It
was previously defined as only the latter".


Regarding UTF8, Wikipedia mentions:

"UTF-8 encodes each character in one to four octets (8-bit bytes):

1. One byte is needed to encode the 128 US-ASCII characters (Unicode
range U+0000 to U+007F).
2. Two bytes are needed for Latin letters with diacritics and for
characters from Greek, Cyrillic, Armenian, Arabic, Syriac and Thaana
alphabets (Unicode range U+0080 to U+07FF).
3. Three bytes are needed for the rest of the Basic Multilingual
Plane (which contains virtually all characters in common use).
4. Four bytes are needed for characters in the other planes of
Unicode, which are rarely used in practice".


So I am confused.


Q1: Will "char" support only one of these 4 8-bit bytes or all of them?
If only one, which one?

There is only one thing to support, and if char is 8 bits that is
supported. What is meant with the above is that if you want to use a
character which is not included in the 128 US-ASCII characters you will
need a sequence of two or more chars, i.e. in a char array of size 4 you
can encode all Unicode characters.
Q2: Do we also get the restriction of a 8-bit char? What will happen on
machines where a byte is different than 8 bits?

What restrictions are you talking about? A C++ byte is as large as it
needs to be, but at least 8 bits.
In such a system how many bytes will sizeof(char) be? char becomes
unsigned char only? CHAR_BIT of <climits>, <limits.h>, and
<numeric_limits<char>::digits of <limits> become always 8?

sizeof(char), sizeof(unsigned char), and sizeof(signed char) is defined
to be 1. Or put another way, in C++ a byte is as large as a char. The
number of bits in a char is implementation defined.
 
R

Ron Natalie

red said:
I think we discussed this once before:

128-bit: very long long;
256-bit: really long long;
512-bit: extra long long;

We were working on a machine that was natively 64 bits over
20 years ago. For practical reasons we needed "int" to
also be 64 bits, so what to call the 32 bit type.

Crazy ideas were:
short long int
and
medium int

We finaly decided on int32.
 
V

Victor Bazarov

Ron said:
We were working on a machine that was natively 64 bits over
20 years ago. For practical reasons we needed "int" to
also be 64 bits, so what to call the 32 bit type.

Crazy ideas were:
short long int
and
medium int

We finaly decided on int32.

What's wrong with just 'short'? Nowhere does it say that 'short'
has to be less than 32 bits, it just shouldn't be less than 16...

Now what do you call a 16-bit integral type after that?
A 'really short' or a 'long char'?

V
 
M

Matthias Buelow

Victor said:
Now what do you call a 16-bit integral type after that?
A 'really short' or a 'long char'?

Why has there to be one? I presume "short" was introduced because of
backwards-compatibility issues when moving from the PDP-7 to 32-bit
environments like the VAX (and probably used in the same way a decade
later when going 32-bit on the PC), both uses are rather rare today on
32- and 64-bit machines. I see no reason why a 16-bit integer type has
to be provided on these machines.
 
V

Victor Bazarov

Matthias said:
Why has there to be one? I presume "short" was introduced because of
backwards-compatibility issues when moving from the PDP-7 to 32-bit
environments like the VAX (and probably used in the same way a decade
later when going 32-bit on the PC), both uses are rather rare today on
32- and 64-bit machines. I see no reason why a 16-bit integer type has
to be provided on these machines.

That is not under discussion. What's under discussion is the name of
that type _if_ it is provided. Of course, there is always the 'intXX_t'
one could resort to.

V
 
I

Ioannis Vranos

Victor said:
That is not under discussion. What's under discussion is the name of
that type _if_ it is provided. Of course, there is always the 'intXX_t'
one could resort to.


Is there any intXX_t type in C++03? Will there be any in C++0x?
 
J

James Kanze

Why has there to be one? I presume "short" was introduced because of
backwards-compatibility issues when moving from the PDP-7 to 32-bit
environments like the VAX (and probably used in the same way a decade
later when going 32-bit on the PC), both uses are rather rare today on
32- and 64-bit machines. I see no reason why a 16-bit integer type has
to be provided on these machines.

Short was present in K&R-I; same size as int on a PDP-11 and a
Honeywell 6000, smaller on IBM 370 and Interdata 8/32.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,994
Messages
2,570,222
Members
46,809
Latest member
moe77

Latest Threads

Top