'hello world' OS

Jeremy Yallop · Jul 1, 2004

Kenneth said:
What happens if the user enters a lowercase accented letter, such as 'á'
(which may or may not show up on your system properly, but is an accented
'a' here)?

In the "C" locale, islower() only returns true for the 26 lowercase
letters of the Latin alphabet, so is_x() will return false for 'á'.

Jeremy.

Arthur J. O'Dwyer · Jul 1, 2004

Now this question is perhaps off-topic for comp.lang.c, but I don't
understand *why* you can't use UCNs for members of the basic character
set. What is the rationale behind this constraint?

I'm not an authority, but I assume the reason is so that implementations
that don't support extended character sets don't have to implement
anything special to parse UCNs (which I think are new in C99?).

Alternatively, it could be a B&D approach to clarity and portability:
if the only way to write 'x' is to actually use the letter 'x', and not
to use arbitrarily complicated arithmetic, then the maintainer has one
less problem to worry about when porting to an EBCDIC system. ;-)

-Arthur

Dan Pop · Jul 1, 2004

I'm not an authority, but I assume the reason is so that implementations
that don't support extended character sets don't have to implement
anything special to parse UCNs (which I think are new in C99?).

Wrong. UCN support is mandatory:

6.4.2 Identifiers

6.4.2.1 General

Syntax

1 identifier:
identifier-nondigit
identifier identifier-nondigit
identifier digit

identifier-nondigit:
nondigit
universal-character-name
other implementation-defined characters

nondigit: one of
_ a b c d e f g h i j k l m
n o p q r s t u v w x y z
A B C D E F G H I J K L M
N O P Q R S T U V W X Y Z

digit: one of
0 1 2 3 4 5 6 7 8 9

An interesting consequence is that f$ is not a valid identifier, but
it becomes a valid identifier if the $ sign is replaced by its UCN:
f\u0024 !

Alternatively, it could be a B&D approach to clarity and portability:
if the only way to write 'x' is to actually use the letter 'x', and not
to use arbitrarily complicated arithmetic, then the maintainer has one
less problem to worry about when porting to an EBCDIC system. ;-)

Wrong again: UCNs have nothing to do with ASCII vs EBCDIC issues.

Dan

Keith Thompson · Jul 1, 2004

An interesting consequence is that f$ is not a valid identifier, but
it becomes a valid identifier if the $ sign is replaced by its UCN:
f\u0024 !

I don't think so.

C99 6.4.2.1p3 says:

Each universal character name in an identifier shall designate a
character whose encoding in ISO/IEC 10646 falls into one of the
ranges specified in annex D.

The encoding of '$', 0024, is not within one of the ranges specified
in annex D.

Interestingly, the "shall" in 6.4.2.1p3 is not in a constraint, so
using f\u0024 as an identifier invokes undefined behavior (it doesn't
violate a syntax rule either). I wonder if that was the intent. It
seems to me that it would make more sense for it to be a constraint
violation, requiring a diagnostic. If I'm not mistaken, a conforming
implementation could simply ignore annex D and allow any arbitrary
UCNs in identifiers. (That doesn't make f\u0024 a valid identifier,
it just means the implemention isn't required to diagnose it.)

Another possible oversight: the same paragraph also says

The initial character shall not be a universal character name
designating a digit.

but there's no specification in annex D of which UCNs specify digits.
Presumably ISO/IEC 10646 covers that, but it would be useful to spell
it out in the C standard, perhaps in a footnote.

Arthur J. O'Dwyer · Jul 2, 2004

An interesting consequence is that f$ is not a valid identifier, but
it becomes a valid identifier if the $ sign is replaced by its UCN:
f\u0024 !

Click to expand...

I don't think so. [...]
The encoding of '$', 0024, is not within one of the ranges specified
in annex D.

Interestingly, the "shall" in 6.4.2.1p3 is not in a constraint, so
using f\u0024 as an identifier invokes undefined behavior (it doesn't
violate a syntax rule either). I wonder if that was the intent. It
seems to me that it would make more sense for it to be a constraint
violation, requiring a diagnostic. If I'm not mistaken, a conforming
implementation could simply ignore annex D and allow any arbitrary
UCNs in identifiers. (That doesn't make f\u0024 a valid identifier,
it just means the implemention isn't required to diagnose it.)

I was wrong about implementations' being allowed to not-support UCNs
(all conforming implementations must, I think). But the passage to
which you're referring does seem to support the general conclusion that
UCNs were added grudgingly: there are a lot of other places where
dubious use of UCNs leads to UB rather than a constraint violation
(a couple of places in the preprocessing stages, for example). I
think this is because maybe the Committee realized that nobody was
going to build in full "Unicode"[1] support just for the benefit of
anal-retentive users.
(Non-USAnians may have a better idea, but I'm under the impression that
\u4E00 looks like "backslash, letter u, 4, E, 0, 0" in all major IDEs, so
there's no good reason to use UCNs in C code except inside string literals
anyway. It doesn't let you "write code in your own language" or
anything.)

Another possible oversight: the same paragraph also says

The initial character shall not be a universal character name
designating a digit.

but there's no specification in annex D of which UCNs specify digits.
Presumably ISO/IEC 10646 covers that, but it would be useful to spell
it out in the C standard, perhaps in a footnote.

I thought one of the sections in Annex D was labeled "Extended Digits"
or something like that?

-Arthur

Keith Thompson · Jul 2, 2004

Arthur J. O'Dwyer said:
On Thu, 1 Jul 2004, Keith Thompson wrote: [...]
I was wrong about implementations' being allowed to not-support UCNs
(all conforming implementations must, I think). But the passage to
which you're referring does seem to support the general conclusion that
UCNs were added grudgingly: there are a lot of other places where
dubious use of UCNs leads to UB rather than a constraint violation
(a couple of places in the preprocessing stages, for example). I
think this is because maybe the Committee realized that nobody was
going to build in full "Unicode"[1] support just for the benefit of
anal-retentive users.
(Non-USAnians may have a better idea, but I'm under the impression that
\u4E00 looks like "backslash, letter u, 4, E, 0, 0" in all major IDEs, so
there's no good reason to use UCNs in C code except inside string literals
anyway. It doesn't let you "write code in your own language" or
anything.)

Presumably the intent is to allow programmers to use native characters
in identifiers; nobody is expected to write "\u4E00".

In translation phase 1:

Physical source file multibyte characters are mapped, in an
implementation-defined manner, to the source character set ...

I think the sequence "\u4E00" is normally expected to occur only after
translation phase 1; in the actual source file, it should look like
the corresponding Asian ideograph. As the rationale says:

Given the current state of multibyte encodings, this mapping is
specified to be implementation-defined; but an implementation can
provide the users with utility programs that do the conversion
from UCNs to "native" multibytes or vice versa, thus providing a
way to exchange source files between implementations using the UCN
notation.

UCNs are similar to trigraphs, but they seem to work in the opposite
direction. Phase 1 maps trigraphs to their legible single-character
equivalents, but it (optionally?) maps legible native characters to
their illegible UCN equivalents. Trigraphs are intended to be used in
human-readable source code (believe it or not); UCNs are not.

Of course UCNs can be used in source code if the programmer is
sufficiently masochistic; in that case, phase 1 presumably will pass
them through unchanged.

It's quite possible that I've misunderstood this. None of the
characters that require UCNs to represent them appear on my keyboard,
so I don't have much experience with this kind of thing. Corrections
are welcome.

I thought one of the sections in Annex D was labeled "Extended Digits"
or something like that?

You're right. Annex D is two pages long; the last two sections at the
bottom of the second page are "Digits" and "Special characters".
(There's no other mention of "special characters", so I suppose they
can be used in identifiers as if they were letters.)

Dan Pop · Jul 2, 2004

In said:
(e-mail address removed) (Dan Pop) writes:
[...]

An interesting consequence is that f$ is not a valid identifier, but
it becomes a valid identifier if the $ sign is replaced by its UCN:
f\u0024 !

Click to expand...

I don't think so.

C99 6.4.2.1p3 says:

Each universal character name in an identifier shall designate a
character whose encoding in ISO/IEC 10646 falls into one of the
ranges specified in annex D.

The encoding of '$', 0024, is not within one of the ranges specified
in annex D.

Good point! So, \u0024 can appear only in character constants and
string literals, as expected.

Dan

Hello , Im Emilio	1	Nov 14, 2024
Programming with an old Mac OS 10.11.6	1	Feb 13, 2024
Hello all! Noob here with completely unrealistic ambitions. Happy to join the crew and get good enough to help others.	4	Aug 13, 2024
Hello from beginner with some questions!	5	Jul 30, 2021
Hello world console program with lcc-win32	15	Jul 7, 2011
"hello world server/client"	6	Dec 6, 2009
how can I make a hello world executable as big as possible?	2	Mar 1, 2011
Mandatory Elements To Conduct JavaScript Form Manipulation	7	Aug 22, 2023

'hello world' OS

Jeremy Yallop

Arthur J. O'Dwyer

Dan Pop

Keith Thompson

Arthur J. O'Dwyer

Keith Thompson

Dan Pop

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads