I
Ioannis Vranos
I am asking so as to be sure:
AFAIK non-latin, other language characters, produce undefined behaviour,
when used with standard library facilities expecting char strings like
printf(), and when used in string literals.
Is this correct?
The C99 standard mentions:
"5.2.1 Character sets
1 Two sets of characters and their associated collating sequences shall be
defined: the set in
which source files are written (the source character set), and the set
interpreted in the
execution environment (the execution character set). Each set is further
divided into a
basic character set, whose contents are given by this subclause, and a set
of zero or more
locale-specific members (which are not members of the basic character set)
called
extended characters. The combined set is also called the extended character
set. The
values of the members of the execution character set are implementation-
defined.
2 In a character constant or string literal, members of the execution
character set shall be
represented by corresponding members of the source character set or by
escape
sequences consisting of the backslash \ followed by one or more characters.
A byte with
all bits set to 0, called the null character, shall exist in the basic
execution character set; it
is used to terminate a character string.
3 Both the basic source and basic execution character sets shall have the
following
members: the 26 uppercase letters of the Latin alphabet
A B C D E F G H I J K L M
N O P Q R S T U V W X Y Z
the 26 lowercase letters of the Latin alphabet
a b c d e f g h i j k l m
n o p q r s t u v w x y z
the 10 decimal digits
0 1 2 3 4 5 6 7 8 9
the following 29 graphic characters
! " # % & ' ( ) * + , - . / :
; < = > ? [ \ ] ^ _ { | } ~
the space character, and control characters representing horizontal tab,
vertical tab, and
form feed. The representation of each member of the source and execution
basic
character sets shall fit in a byte. In both the source and execution basic
character sets, the
value of each character after 0 in the above list of decimal digits shall be
one greater than
the value of the previous. In source files, there shall be some way of
indicating the end of
each line of text; this International Standard treats such an end-of-line
indicator as if it
were a single new-line character. In the basic execution character set,
there shall be
control characters representing alert, backspace, carriage return, and new
line. If any
other characters are encountered in a source file (except in an identifier,
a character
constant, a string literal, a header name, a comment, or a preprocessing
token that is never
converted to a token), the behavior is undefined.
4 A letter is an uppercase letter or a lowercase letter as defined above; in
this International
Standard the term does not include other characters that are letters in
other alphabets".
Thanks a lot,
--
Ioannis Vranos
C95 / C++03 Software Developer
http://www.cpp-software.net
AFAIK non-latin, other language characters, produce undefined behaviour,
when used with standard library facilities expecting char strings like
printf(), and when used in string literals.
Is this correct?
The C99 standard mentions:
"5.2.1 Character sets
1 Two sets of characters and their associated collating sequences shall be
defined: the set in
which source files are written (the source character set), and the set
interpreted in the
execution environment (the execution character set). Each set is further
divided into a
basic character set, whose contents are given by this subclause, and a set
of zero or more
locale-specific members (which are not members of the basic character set)
called
extended characters. The combined set is also called the extended character
set. The
values of the members of the execution character set are implementation-
defined.
2 In a character constant or string literal, members of the execution
character set shall be
represented by corresponding members of the source character set or by
escape
sequences consisting of the backslash \ followed by one or more characters.
A byte with
all bits set to 0, called the null character, shall exist in the basic
execution character set; it
is used to terminate a character string.
3 Both the basic source and basic execution character sets shall have the
following
members: the 26 uppercase letters of the Latin alphabet
A B C D E F G H I J K L M
N O P Q R S T U V W X Y Z
the 26 lowercase letters of the Latin alphabet
a b c d e f g h i j k l m
n o p q r s t u v w x y z
the 10 decimal digits
0 1 2 3 4 5 6 7 8 9
the following 29 graphic characters
! " # % & ' ( ) * + , - . / :
; < = > ? [ \ ] ^ _ { | } ~
the space character, and control characters representing horizontal tab,
vertical tab, and
form feed. The representation of each member of the source and execution
basic
character sets shall fit in a byte. In both the source and execution basic
character sets, the
value of each character after 0 in the above list of decimal digits shall be
one greater than
the value of the previous. In source files, there shall be some way of
indicating the end of
each line of text; this International Standard treats such an end-of-line
indicator as if it
were a single new-line character. In the basic execution character set,
there shall be
control characters representing alert, backspace, carriage return, and new
line. If any
other characters are encountered in a source file (except in an identifier,
a character
constant, a string literal, a header name, a comment, or a preprocessing
token that is never
converted to a token), the behavior is undefined.
4 A letter is an uppercase letter or a lowercase letter as defined above; in
this International
Standard the term does not include other characters that are letters in
other alphabets".
Thanks a lot,
--
Ioannis Vranos
C95 / C++03 Software Developer
http://www.cpp-software.net