arnuld said:
I thought ASCII values are same for all platforms.
The ASCII values are the same, but not all platforms use ASCII values
for characters.
I don't get it. Why ASCII exists ?
To standardize the method used in the United Stated to represent
English words. There's also a completely different system called
EBCDIC which is still in use on many systems. ASCII had to be extended
significantly to be of any use in most of Europe, where most languages
have various special characters that don't occur in English. ISO
created 15 extended versions of ASCII called ISO 8859-1 through ISO
8859-16 (ISO 8859-12 was abandoned). However, extending ASCII is a
totally inadequate strategy for most East Asian languages, where the
number of different characters can run into the thousands. There are
hundreds of different character encodings in use somewhere in the
world, and dozens that are pretty common.
Unicode is a pretty popular standard that was created with the goal of
cleaning up this mess. It assigns unique code points to each of more
than 100,000 characters from a very large and diverse set of
languages, and it has lots of unused code points that have been
reserved for additional characters, if needed. UTF-8 is a popular
variable-length encoding for Unicode code points: some characters
require only a single byte, others require multiple bytes; all of the
characters that can be represented in ASCII are represented by a
single byte with the same value it has in ASCII. There's also a 16 bit
variable length encoding, and a 32 bit fixed-width encoding, and a few
other alternatives as well.
C has a certain limited amount of support for Unicode. Section 6.4.3
describes Universal Character Names (UCN), such as \U0B4D. According
to Appendix D, \U0B4D represents a character from the Oriya language.
UCNs can appear in identifiers, character literals, and string
literals. The intent was that editors would be created which could
display some or all of the Unicode characters that are not in the
basic C character set. Any UCN that corresponded to a character that
the editor knows how to display would be displayed as that character;
for any character that it could not display, the corresponding UCN
would be displayed as a UCN. I have no idea whether such editors
actually existed, I've never had a need to use a UCN.
....
I always thought that everything inside a c programs is saved as
characters and those characters are converted to ASCII equivalents during
compilation.
The encoding used for a C source code file is not required to be
ASCII. In general, characters from the input file do not get copied as
such to the output file, except when they occur inside a character
literal, a string literal, or an identifier with external linkage.
Even then, many such characters are transformed in various ways during
translation of the program. For instance, inside the string literal
"Ding\07!", the three characters '\', '0', and '7' will (generally)
cause a single byte with a value of 7 to be stored in the executable.
Escape sequences like '\n' and UCNs cause similar transformations to
occur. What finally gets written to the output file will use the
encoding for the execution character set, which is also not required
to be ASCII, and which might not be the same as the encoding used in
the source file.