Arthur J. O'Dwyer said:
On Thu, 1 Jul 2004, Keith Thompson wrote: [...]
I was wrong about implementations' being allowed to not-support UCNs
(all conforming implementations must, I think). But the passage to
which you're referring does seem to support the general conclusion that
UCNs were added grudgingly: there are a lot of other places where
dubious use of UCNs leads to UB rather than a constraint violation
(a couple of places in the preprocessing stages, for example). I
think this is because maybe the Committee realized that nobody was
going to build in full "Unicode"[1] support just for the benefit of
anal-retentive users.
(Non-USAnians may have a better idea, but I'm under the impression that
\u4E00 looks like "backslash, letter u, 4, E, 0, 0" in all major IDEs, so
there's no good reason to use UCNs in C code except inside string literals
anyway. It doesn't let you "write code in your own language" or
anything.)
Presumably the intent is to allow programmers to use native characters
in identifiers; nobody is expected to write "\u4E00".
In translation phase 1:
Physical source file multibyte characters are mapped, in an
implementation-defined manner, to the source character set ...
I think the sequence "\u4E00" is normally expected to occur only after
translation phase 1; in the actual source file, it should look like
the corresponding Asian ideograph. As the rationale says:
Given the current state of multibyte encodings, this mapping is
specified to be implementation-defined; but an implementation can
provide the users with utility programs that do the conversion
from UCNs to "native" multibytes or vice versa, thus providing a
way to exchange source files between implementations using the UCN
notation.
UCNs are similar to trigraphs, but they seem to work in the opposite
direction. Phase 1 maps trigraphs to their legible single-character
equivalents, but it (optionally?) maps legible native characters to
their illegible UCN equivalents. Trigraphs are intended to be used in
human-readable source code (believe it or not); UCNs are not.
Of course UCNs can be used in source code if the programmer is
sufficiently masochistic; in that case, phase 1 presumably will pass
them through unchanged.
It's quite possible that I've misunderstood this. None of the
characters that require UCNs to represent them appear on my keyboard,
so I don't have much experience with this kind of thing. Corrections
are welcome.
I thought one of the sections in Annex D was labeled "Extended Digits"
or something like that?
You're right. Annex D is two pages long; the last two sections at the
bottom of the second page are "Digits" and "Special characters".
(There's no other mention of "special characters", so I suppose they
can be used in identifiers as if they were letters.)