character sets

jraul · Jun 24, 2007

1) Am I correct that C++ does not have a defined character set? In
particular, a platform might not use the ASCII character set?

2) C++ supports wchar_t types. But again, this has no defined
character set? For instance, it might not be a unicode character set?

Mike Wahler · Jun 25, 2007

jraul said:
1) Am I correct that C++ does not have a defined character set?

It imposes a requirement that certain characters
exist in the source and execution character sets,
but no, it does not mandate any particular set.

In
particular, a platform might not use the ASCII character set?

Correct. E.g. many/most IBM systems use EBCDIC

2) C++ supports wchar_t types.

Correct; this allows a larger number of characters than
possible with the minimum eight-bit sized type 'char'.

But again, this has no defined
character set?
No.

For instance, it might not be a unicode character set?

Correct.

-Mike

BobR · Jun 25, 2007

jraul said:
1) Am I correct that C++ does not have a defined character set? In
particular, a platform might not use the ASCII character set?

Yup and yup. (that's yes and yes in yupese. said:
2) C++ supports wchar_t types. But again, this has no defined
character set? For instance, it might not be a unicode character set?

If you compile a C++ program, run it, the system runs your program in a
'console' (assumed). It's the 'console' that has the char set, AFAIK.

Think about it. What would a char set be used for? Output to a screen (CRT)?
C++ does not know anything about a screen, keyboard, mouse, filesystem,
etc.. Those are supplied by libraries. Try writing anything[1] in C++
*without* any '#include' in it, and get IO. You can 'crunch' numbers, but
you won't be able to *see* the results (....unless it's a toaster. <G> ).

[1] - ..except an IO module. <G>

Robert Bauck Hamar · Jun 25, 2007

BobR said:
If you compile a C++ program, run it, the system runs your program in a
'console' (assumed). It's the 'console' that has the char set, AFAIK.

Think about it. What would a char set be used for?

Knowing the difference between printable and unprintable characters?

Output to a screen
(CRT)? C++ does not know anything about a screen, keyboard, mouse,
filesystem, etc.. Those are supplied by libraries.

Correct, but the standard in, standard out and standard error file streams
must be open on a hosted implementation, nevertheless. And when using them,
one should know what the output will be.

Try writing anything[1]
in C++ *without* any '#include' in it, and get IO. You can 'crunch'
numbers, but you won't be able to *see* the results (....unless it's a
toaster. <G> ).

extern "C" int printf(const char*,...);
int main()
{
printf("hello, world\n");
}

It only requires a library function named printf. This is not portable, but
it works on my g++. But why would you? The standard headers _is_ part of
C++, and the standard library is there for reason.

--
rbh

BobR · Jun 25, 2007

Robert Bauck Hamar said:
Knowing the difference between printable and unprintable characters?
Letters? #include <cctype>? <locale>? C++ has standard libraries that can
tell you a lot about supported char sets on your platform.

Ahem, let's try that again <G>:
Think about it. What would a char set be used for, output to a screen
(CRT)? (ya' know, like theoretical question.)

And, on this device with no CRT, no keyboard/pad, how is the built in
character set used?

Try writing anything[1]
in C++ *without* any '#include' in it, and get IO. You can 'crunch'
numbers, but you won't be able to *see* the results (....unless it's a
toaster. <G> ).

Click to expand...

extern "C" int printf(const char*,...);
int main(){
printf("hello, world\n");
}

It only requires a library function named printf. This is not portable, but
it works on my g++. But why would you? The standard headers _is_ part of
C++, and the standard library is there for reason.

Oh. So, you're saying that the standard library provides an 'character set'.
I see. I get it now. Silly me, I'v been includeing <iostream> for nothing! I
could just use 'printf' in my GUI apps. Right?
Then how do you tell 'printf' what char set to use?

But, that's a 'library written in the language', not 'the language'. Or is
it the other way around?

Wow, ya' learn something new every day.

Old Wolf · Jun 25, 2007

extern "C" int printf(const char*,...);
int main()
{
printf("hello, world\n");

}

It only requires a library function named printf. This is not portable,

What is not portable about it? The standard specifies
that printf has just that signature.

Old Wolf · Jun 25, 2007

Ahem, let's try that again <G>:
Think about it. What would a char set be used for, output to a screen
(CRT)? (ya' know, like theoretical question.)

And, on this device with no CRT, no keyboard/pad, how is the built in
character set used?

Could be any number of uses. Storing character
strings read in from files or other storage,
for example.

BTW, many devices have non-CRT displays (e.g. LCD panels).

Oh. So, you're saying that the standard library provides an 'character set'.
I see. I get it now. Silly me,

The C++ language provides a character set. This is
formally known as 'the execution character set'.

I'v been includeing <iostream> for nothing! I
could just use 'printf' in my GUI apps.

You certainly could, although this has nothing to
do with character sets.

Right? Then how do you tell 'printf' what char set to use?

I guess you mean: how do you tell printf which
locale to use? If so, then the answer is: call
the setlocale() function.

Wow, ya' learn something new every day.

Indeed ya' do.

James Kanze · Jun 25, 2007

Yup and yup. (that's yes and yes in yupese. <G>)

In fact, some platforms don't. Windows, for example, or Linux,
or most Unices. Some platforms don't even use an encoding which
is a superset of ASCII: IBM mainframes use EBCDIC, for example.

All you're guaranteed is that:

-- a certain number of characters (known as the basic character
set) are present,

-- the digits (but not necessarily the upper or lower case
letters) are successive, and in ascending order, and

-- no character in the basic character set will be negative
when stored in a char (but this doesn't hold for characters
in the extended characters set).

In addition, the actual run-time character set can change
depending on the locale. Which can play havoc with e.g. string
literals (which don't appear like they do in the code).

And often isn't, for historical reasons. Even when it is
Unicode, sometimes it's UTF-16, other times UTF-32.

If you compile a C++ program, run it, the system runs your
program in a 'console' (assumed). It's the 'console' that has
the char set, AFAIK.

It's significantly more complicated than that: as you correctly
observer, the "characters" are interpreted by many different
components, some of which are completely independant of your
code: in a string literal, the apparent character will probably
depend on the encoding of the font you use in the editor, when
you write the code. Unless the editor is compensating in some
way---most editors will allow using one encoding for display,
and another when writing to the file. After that, the compiler
might remap some of the characters, according to its ideas as to
what the "default" execution code set is, compared to the code
set it's reading. (At present, I don't think many compilers
actually do this. But it could make a lot of sense for
cross-compilers.) Until this point, of course, we're only
concerned with string literals and character constants. At
runtime, how the program interprets characters internally (i.e.
things like isupper) depends on the current locale; in C++, this
means that it can be different for different files. Once you've
output the character (say 0xE9---a "Latin small letter e with
grave accent" in ISO 8859-1), of course, you have no more
control over how it is interpreted; if you output to the
console, it will depend on the codeset the current console font
is using, which is pretty much out of the control of your
program. (I think Windows calls this a codepage.) If you
output it to a file, and copy the file into a console window
sometime later (the "cat" command under Unix and most advanced
Windows command interpreters, "type" in the default Windows
command interpreter), it will depend on the font being used in
the console window at the time you execute the command. (At
least under X, it's possible to have two different console
windows using different fonts, with different encodings, running
at the same time. So the "character" you see will depend on
which window you look at the file in.) Copy the file to the
printer, of course, and the character will depend on the font
used by the printer.

Think about it. What would a char set be used for? Output to a screen (CRT)?
C++ does not know anything about a screen, keyboard, mouse, filesystem,
etc.. Those are supplied by libraries. Try writing anything[1] in C++
*without* any '#include' in it, and get IO.

The language does define a certain number of includes, including
<iostream> and <fstream>. So there is support for IO in the
language. The semantics, on the other hand, are very, very
loosely defined; more a suggestion of an intent than an actual
definition. Probably because C++ can't affect much of this.

James Kanze · Jun 25, 2007

But it can't guarantee that the supported character set you're
asking about is the one which will be used for display.

Ahem, let's try that again <G>:
Think about it. What would a char set be used for, output to a screen
(CRT)? (ya' know, like theoretical question.)

And, on this device with no CRT, no keyboard/pad, how is the
built in character set used?

The standard does define the concept of an "interactive device".
It also distinguishes between hosted and free-standing
implementations; a free-standing implementation isn't required
to support standard IO, but a hosted one is.

Try writing anything[1]
in C++ *without* any '#include' in it, and get IO. You can 'crunch'
numbers, but you won't be able to *see* the results (....unless it's a
toaster. <G> ).

Click to expand...

extern "C" int printf(const char*,...);
int main(){
printf("hello, world\n");
}
It only requires a library function named printf. This is not portable,

Click to expand...

Only in that it isn't defined whether printf is `extern "C"' or
not. It would be a 100% portable C program.

Oh. So, you're saying that the standard library provides an
'character set'.

The language requires a "character set".

I see. I get it now. Silly me, I'v been includeing <iostream>
for nothing!

What's you're point. The standard says you have to include

I could just use 'printf' in my GUI apps.

Who knows. C++ doesn't speak of GUI's. But that doesn't mean
that it doesn't consider the issue of character sets, both
compile time and run-time. Otherwise, things like string
literals and character constants wouldn't make sense.

Right?
Then how do you tell 'printf' what char set to use?

Are you trying to be intentionally stupid, or do you just not
know the language or understand this issues?

But, that's a 'library written in the language', not 'the
language'. Or is it the other way around?

The standard library is part of the language. As are string
literals and character constants. And concepts like the "basic
execution character set" and the "extended execution character
set".

And the issues surrounding character encoding are extremely
complex, because elements outside the language, over which C++
has no control, do come into play. (I can start a new console
Window using Zapf dingbats on my system. The C++
implementations I have access to don't have a locale which
supports it, probably because the character set doesn't even
support the basic characters required by the C++ standard.)

Robert Bauck Hamar · Jun 25, 2007

Old said:
What is not portable about it? The standard specifies
that printf has just that signature.

AFAIK, the standard specifies printf to be part of namespace std, and that
it implementation-defined whether the linkage of printf is C or C++
(Â§17.4.2.2). The standard recommends C++ linkage , but printf is reserved
to the implementation in the global namespace as an external symbol with C
linkage(Â§17.4.3.1.3).

I believe this is not portable, but it would often work, as C++ compilers
often link with the platform's C library. But there is no guarantee there
exists a library function named printf with C linkage in the global
namespace.

BobR · Jun 25, 2007

James Kanze wrote in message...
/* """ quote

I see. I get it now. Silly me, I'v been includeing <iostream>
for nothing!

What's you're point. The standard says you have to include
<iostream>....
""" */ unquote

Then it should be built-in like int, double, char, etc.? If you have to
include it, it's provided for the language, not actually *in* the language.
(see 'stupid' below)
[ notice how I split your sentence, much like earlier posts split my
thought. <G>]
Onward....

/* """ quote
..... in order to use the symbol std::cout, and a number of
other things.

Then how do you tell 'printf' what char set to use?

Are you trying to be intentionally stupid, or do you just not
know the language or understand this issues?
""" */ unquote

The former. Sometime you catch more flies with honey, than with vinegar. And
sometimes if you show 'ignorance with authority', people can't resist
correcting you.

What I was looking for is a definition of "character set", distinction
between a 'char set' and 'char encoding' and 'font' (which is called by some
'char set' (s/b 'type face')). Everyvody keeps sidestepping to 'library' and
'encoding'.
Life was just so much simpler in the ASCII world, but, of course, that can't
support all the languages of the world. <G> [ I often wonder what the
Chinese keyboard looks like. ( 'qwerty' or 'devorak'? <G>) ]
Let's keep moving.....

/* """ quote

But, that's a 'library written in the language', not 'the
language'. Or is it the other way around?

The standard library is part of the language. As are string
literals and character constants. And concepts like the "basic
execution character set" and the "extended execution character
set".

And the issues surrounding character encoding are extremely
complex, because elements outside the language, over which C++
has no control, do come into play. (I can start a new console
Window using Zapf dingbats on my system. The C++
implementations I have access to don't have a locale which
supports it, probably because the character set doesn't even
support the basic characters required by the C++ standard.)

""" */ unquote

There's some of that 'honey' now! :-}

Can you supply a paragraph or two from the C++ standard which describe
exactly what a "character set" is? ( not the <cctype>, <locale> parts, or
'encoding'. (Keep in mind a 'char' is an integer type. So is a 'character
set' just "an array of numbers" which is interpreted in a specific
manner?) ).
[ ....or do I have to take another 'vinegar bath'. ;-} ]

Thanks James, Robert, and Old Wolf.

[ sorry, I don't have the 'standard doc', can't afford it. My 'need'
download list is long, and my 'want' list is huge. (dial-up ;-{ )]

Robert Bauck Hamar · Jun 25, 2007

BobR said:
James Kanze wrote in message...
/* """ quote

What's you're point. The standard says you have to include
<iostream>....
""" */ unquote

Then it should be built-in like int, double, char, etc.? If you have to
include it, it's provided for the language, not actually *in* the
language. (see 'stupid' below)

On my system, IO is provided by system calls to the operating system. Such
calls are made by the use of a machine code instruction not normally used
by any of the core language features. To achieve output, you can use
libraries or you must use extensions to the language. On gcc, you can try
this:

int main()
{
__builtin_printf("Hello, world!\n");
}

The compiler _could_ also make the whole standard library builtin. There
exists no place stating that the standard headers must be actual files.

What I was looking for is a definition of "character set", distinction
between a 'char set' and 'char encoding' and 'font' (which is called by
some 'char set' (s/b 'type face')). Everyvody keeps sidestepping to
'library' and 'encoding'.

Because the standard says very little about the concepts.

Fonts are also different. Some font encodings are just an array of glyphs,
while other also contain meta information about the actual character, so
that they can be used with different character sets.

Life was just so much simpler in the ASCII world, but, of course, that
can't support all the languages of the world.

The ASCII world was before my time.

Let's keep moving.....

/* """ quote

The standard library is part of the language. As are string
literals and character constants. And concepts like the "basic
execution character set" and the "extended execution character
set".

And the issues surrounding character encoding are extremely
complex, because elements outside the language, over which C++
has no control, do come into play. (I can start a new console
Window using Zapf dingbats on my system. The C++
implementations I have access to don't have a locale which
supports it, probably because the character set doesn't even
support the basic characters required by the C++ standard.)

""" */ unquote

There's some of that 'honey' now! :-}

Can you supply a paragraph or two from the C++ standard which describe
exactly what a "character set" is?

Â§1.2 Normative references
The following referenced documents are indispensable for the application of
this document. [...]
ISO/IEC 2382 (all parts), /Information technology -- Vocabulary/
ISO/IEC 9899:1999, /Programming languages -- C/
ISO/IEC 10646-1:2000, /Information technology -- Universal Multiple-Octet
Coded Character Set (UCS) -- Part 1: Architecture and Basic Multilingual
Plane/

Â§2.2 Defines the basic source character set, the universal-character-name
construct, the basic execution character set, the basic execution wide
character set, the execution character set and the execution wide character
set. They all contain A-Z, a-z, 0-9, space, \t, \n, \v, \f, and _{
[]#()<>%:;.?*+-/^&|~!=,\"'. The actual values are implementation-defined
(for the mentioned 96 characters, and for \a, \b, and \r, which are part of
the execution character sets) or locale dependent, with the exception of
the null (wide) character (also a member of the execution character sets),
which has an all zerobit value.

(Keep in mind a 'char' is an integer type. So is a 'character
set' just "an array of numbers" which is interpreted in a specific
manner?) ).

Yes, pretty much.

James Kanze · Jun 26, 2007

James Kanze wrote in message...

/* """ quote
On Jun 25, 4:46 am, "BobR" wrote:

What's you're point. The standard says you have to include
<iostream>....
""" */ unquote

Then it should be built-in like int, double, char, etc.?

Why? The standard says that for some things, you don't have to
include a header, and for others, you do.

If you have to include it, it's provided for the language, not
actually *in* the language.

That's not what the standard says. The standard says very
explicitly that iostream is part of the language.

/* """ quote
.... in order to use the symbol std::cout, and a number of
other things.

Are you trying to be intentionally stupid, or do you just not
know the language or understand this issues?
""" */ unquote

The former. Sometime you catch more flies with honey, than
with vinegar. And sometimes if you show 'ignorance with
authority', people can't resist correcting you.

What I was looking for is a definition of "character set",
distinction between a 'char set' and 'char encoding' and
'font' (which is called by some 'char set' (s/b 'type face')).
Everyvody keeps sidestepping to 'library' and 'encoding'.

Life was just so much simpler in the ASCII world, but, of course, that can't
support all the languages of the world. <G> [ I often wonder what the
Chinese keyboard looks like. ( 'qwerty' or 'devorak'? <G>) ]

[Just to keep it off topic

: I think they use a US ASCII
keyboard. They enter the word more or less phonetically, and
the "keyboard driver" then presents a choice of symbols on the
screen, from which they pick. A bit wierd for someone not used
to it.]

Let's keep moving.....

/* """ quote

The standard library is part of the language. As are string
literals and character constants. And concepts like the "basic
execution character set" and the "extended execution character
set".

And the issues surrounding character encoding are extremely
complex, because elements outside the language, over which C++
has no control, do come into play. (I can start a new console
Window using Zapf dingbats on my system. The C++
implementations I have access to don't have a locale which
supports it, probably because the character set doesn't even
support the basic characters required by the C++ standard.)

""" */ unquote

There's some of that 'honey' now! :-}

Yes. You hit on a real problem, but you really only hit at a
part of it, and you didn't seem to recognize the real effort
provided by the authors of the standard to try to define the
parts of it they could influence. You're posting made it sound
simply not part of the standard, which isn't the case either.

Can you supply a paragraph or two from the C++ standard which describe
exactly what a "character set" is?

Take a look at §2.2. Entitled, not inappropriately, "Character
sets". In particular, the third paragraph:

The basic execution character set and the basic
execution wide-character set shall each contain all the
members of the basic source character set, plus control
characters representing alert, backspace, and carriage
return, plus a null character (respectively, null wide
character), whose representation has all zero bits. For
each basic execution character set, the values of the
members shall be non-negative and distinct from one
another. In both the source and execution basic
character sets, the value of each character after 0 in
the above list of decimal digits shall be one greater
than the value of the previous. The execution character
set and the execution wide-character set are supersets
of the basic execution character set and the basic
execution wide-character set, respectively. The values
of the members of the execution character sets are
implementation-defined, and any additional members are
locale-specific.

( not the <cctype>, <locale> parts, or
'encoding'. (Keep in mind a 'char' is an integer type. So is a 'character
set' just "an array of numbers" which is interpreted in a specific
manner?) ).
[ ....or do I have to take another 'vinegar bath'. ;-} ]

Not this time

.

The problem is, of course, that while "char" itself is an
integral type, and that in a very real sense, C++ doesn't have a
"character" type, the standard (the language, if you prefer)
does consider input and output of text, and it does consider
"characters" when talking about such things as string literals
and character constants. Since C++ doesn't want to impose an
particular encoding (implementations using EBCDIC are legal),
it's very difficult to specify much exactly.

The other point is that C++ can't control everything; I'm pretty
sure that this was the point you were trying to make. And if
you use different fonts in the editor window, where you write
your string literals, and in the console window, where you run
the program, there's not much the C++ compiler can do to
guarantee the results.

BobR · Jun 26, 2007

James Kanze wrote in message...
On Jun 25, 9:59 pm, "BobR" wrote:

/* """ quote

[ I often wonder what the
Chinese keyboard looks like. ( 'qwerty' or 'devorak'? <G>) ]

[Just to keep it off topic

: I think they use a US ASCII
keyboard. They enter the word more or less phonetically, and
the "keyboard driver" then presents a choice of symbols on the
screen, from which they pick. A bit wierd for someone not used
to it.]
""" */ unquote

Gads, no wonder they teach "How to send a virus to the U.S.A. 101" in their
colleges!! <G>

/* """ quote

Can you supply a paragraph or two from the C++ standard which describe
exactly what a "character set" is?

Take a look at §2.2. Entitled, not inappropriately, "Character
sets". In particular, the third paragraph:
[snip paragraph - what I was looking for ]
""" */ unquote

Thanks you *very much*.

I hate to be such a jerk about it, but, I don't remember seeing any posts
discussing the points you (and Robert) have answered here.
I appreciate your (and the standards committees) efforts and time.

Outputting signal values to terminal Within Character Array	0	Dec 10, 2021
Questions on various string literals in c++0x	1	Dec 7, 2010
I need help in understanding these files on my phone, Could someone help me understand these files? Urgent help needed. Please help.	4	Jun 4, 2023
Problem with a login script, SESSION user rights and put this together so it works with the other pages and MySQL. Code examples.	2	May 5, 2023
Trying to use clangd with VSCodium, CMake_World_COMPILER not set	1	Nov 5, 2024
How can I fix my pattern coding error in c++	0	Mar 19, 2023
Character set	18	Jun 22, 2009
Problem with displaying character that code number is 219 (after SetConsoleTextAttribute)?	3	Jan 9, 2023

character sets

jraul

Mike Wahler

BobR

Robert Bauck Hamar

BobR

Old Wolf

Old Wolf

James Kanze

James Kanze

Robert Bauck Hamar

BobR

Robert Bauck Hamar

James Kanze

BobR

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads