VLA question

S

Stephen Sprunk

I mentioned my argument for that conclusion earlier in this thread -
both you and Keith seem to have skipped over it without either
accepting it or explaining why you had rejected it. Here it is
again.

I'll admit that I didn't quite understand the relevance the first time;
you added some clarification this time (plus some of the other points
discussed have started to sink in), so now I think I get it.
... While, in general, conversion to signed type of a value that is
too big to be represented by that type produces an implementation-
defined result or raises an implementation-defined signal, for this
particular conversion, I think that 7.21.2p3 implicitly prohibits the
signal, and requires that if 'c' is an unsigned char, then

(unsigned char)(int)c == c

If CHAR_MAX > INT_MAX, then 'char' must behave the same as 'unsigned
char'. Also, on such an implementation, there cannot be more valid
'int' values than there are 'char' values, and the inversion
requirement implies that there cannot be more char values than there
are valid 'int' values. This means that we must also have, if 'i' is
an int object containing a valid representation, that

(int)(char)i == i

This is indeed an interesting property of such systems, and one with
unexpectedly far-reaching implications.
In particular, this applies when i==EOF, which is why comparing
fgetc() values with EOF is not sufficient to determine whether or not
the call was successful.

I'd wondered about that, since the usual excuse for fgetc() returning an
int is to allow for EOF, which is presented by most introductory texts
as being impossible to mistake for a valid character.
Negative zero and positive zero have to
convert to the same unsigned char, which would make it impossible to
meet both inversion requirements, so it also follows that 'int' must
have a 2's complement representation on such a platform.

That only holds if plain char is unsigned, right?

It seems these seemingly-unrelated restrictions would not apply if plain
char were signed, which would be the (IMHO only) logical choice if
character literals were signed.
You've already said that. What you haven't done so far is explained
why. I agree that there's a bit of conflict there, but 'insane' seems
extreme.

Perhaps "insane" was a bit strong, but I see no rational excuse for the
signedness of plain chars and character literals to differ; the two are
logically linked, and only C's definition of the latter as "int" even
allows such a bizarre case to exist in theory.

IMHO, that C++ implicitly requires the signedness of the two to match,
apparently without problems, is an argument in favor of adopting the
same rule in C. As long as the signedness matches, none of the problems
mentioned in this thread would come up--and potentially break code that
was not written to account for this unlikely corner case.
I'd forgotten that C++ had a different rule for the value of a
character literal than C does. The C rule is defined in terms of
conversion of a char object's value to type 'int', which obviously
would be inappropriate given that C++ gives character literals a type
of 'char'. Somehow I managed to miss that "obvious" conclusion, and I
didn't bother to check. Sorry.

I'm in no position to complain about that.
Every time I've brought up the odd behavior of implementations which
have UCHAR_MAX > INT_MAX, it's been argued that they either don't
exist or are so rare that we don't need to bother worrying about
them. Implementations where CHAR_MAX>INT_MAX must be even rarer
(since they are a subset of implementations where UCHAR_MAX >
INT_MAX), so I'm surprised (and a bit relieved) to see someone
actually arguing for the probable existence of such implementations.
I'd feel happier about it if someone could actually cite one, but I
don't remember anyone ever doing so.

I'm not arguing for the _probable_ existence of such systems as much as
admitting that I don't have enough experience with atypical systems to
have much idea what's really out there on the fringes, other than
various examples given here since I've been reading. The world had
pretty much standardized on twos-complement systems with flat, 32-bit
address spaces by the time I started using C; 64-bit systems were my
first real-world experience with having to think about variations in the
sizes of base types--and even then usually only pointers.

S
 
J

James Kuyper

On 01-Jul-13 13:07, James Kuyper wrote: ....

This is indeed an interesting property of such systems, and one with
unexpectedly far-reaching implications.


I'd wondered about that, since the usual excuse for fgetc() returning an
int is to allow for EOF, which is presented by most introductory texts
as being impossible to mistake for a valid character.

On most systems, including the ones where C was first developed, that's
perfectly true. But the C standard allows an implementation where that's
not true to still be fully conforming. This does not "break" fgetc(), as
some have claimed, since you can still use feof() and ferror() to
determine whether an EOF value indicates success, failure, or
end-of-file; but in principle it does make use of fgetc() less convenient.
That only holds if plain char is unsigned, right?

It seems these seemingly-unrelated restrictions would not apply if plain
char were signed, which would be the (IMHO only) logical choice if
character literals were signed.

Correct - most of what I've been saying has been explicitly about
platforms where CHAR_MAX > INT_MAX, which would not be permitted if char
were signed. "For any two integer types with the same signedness and
different integer conversion rank (see 6.3.1.1), the range of values of
the type with smaller integer conversion rank is a subrange of the
values of the other type." (6.2.5p8)

....
Perhaps "insane" was a bit strong, but I see no rational excuse for the
signedness of plain chars and character literals to differ; the two are
logically linked, and only C's definition of the latter as "int" even
allows such a bizarre case to exist in theory.

IMHO, that C++ implicitly requires the signedness of the two to match,
apparently without problems, is an argument in favor of adopting the
same rule in C. As long as the signedness matches, none of the problems
mentioned in this thread would come up--and potentially break code that
was not written to account for this unlikely corner case.

I agree that the C++ approach makes more sense - I'm taking issue only
with your characterization of C code which relies upon the C approach as
"broken". I also think it's unlikely that the C committee would decide
to change this, even though I've argued that the breakage that could
occur would be fairly minor.

You've seen how many complicated ideas and words I've had to put
together to construct my arguments for the breakage being minor. The
committee would have to be even more rigorous in considering the same
issues. The fact that there could be any breakage at all (and there can
be) means that there would have to be some pretty significant
compensating advantages for the committee to decide to make such a
change. Despite agreeing with the C++ approach, I don't think the
advantages are large enough to justify such a change.
 
J

James Kuyper

You're right - I reached that conclusion so many years ago that I forgot
the assumptions I relied upon to reach it. I was thinking of the minimal
case where CHAR_MAX is as small as possible while still being greater
than INT_MAX, in which case there's no room for padding bits. If you
move away from the minimal case, there is room for padding bits, and
then the argument breaks down. Of course, such implementations are even
less commonplace than the minimal case.

I'll have to review my earlier comments more carefully with that
correction in mind.

I was tired and in a hurry to go home, and didn't put enough thought
into my response. Such an implementation would violate 6.2.5p8:

"For any two integer types with the same signedness and different
integer conversion rank (see 6.3.1.1), the range of values of the type
with smaller integer conversion rank is a subrange of the values of the
other type."
 
J

James Kuyper

On 06/29/2013 02:05 PM, Keith Thompson wrote:
....
Right -- but that's only an issue when CHAR_BIT >= 16, which is the
context I missed in my previous response. As I also noted elsethread,
the conversion from char to int, where char is an unsigned type and the
value doesn't fit, is implementation-defined; the result is *probably*
negative, but it's not guaranteed.

I've just posted an arguemnt on a different branch of this thread that
7.21.2p3 indirectly implies that on systems where UCHAR_MAX > INT_MAX,
given an unsigned character c and a valid int i, we must have

(unsigned char)(int)c == c

and

(int)(unsigned char)i == i

Comment?
 
G

glen herrmannsfeldt

James Kuyper said:
On 07/02/2013 01:10 AM, Stephen Sprunk wrote:
(snip)
On most systems, including the ones where C was first developed,
that's perfectly true. But the C standard allows an
implementation where that's not true to still be fully conforming.
This does not "break" fgetc(), as some have claimed, since you can
still use feof() and ferror() to determine whether an EOF value
indicates success, failure, or end-of-file; but in principle it
does make use of fgetc() less convenient.

Depending on your definition of valid character. My undestanding
is that ASCII-7 system can use a signed 8 bit char, but EBCDIC
eight bit systems should use unsigned char. (No systems ever
used the ASCII-8 code that IBM designed into S/360.)

A unicode based system could use a 16 bit unsigned char, like
Jave does.

Valid character doesn't mean anything that you can put the
but pattern out for, but for an actual character in the input
character set.

-- glen
 
J

James Kuyper

Depending on your definition of valid character.

For this purpose, a valid character is anything that can be returned by
a successful call to fgetc(). Since I can fill a buffer with unsigned
char values from 0 to UCHAR_MAX, and write that buffer to a binary
stream, with a guarantee of being able read the same values back, I must
respectfully disagree with the following assertion:

....
Valid character doesn't mean anything that you can put the
but pattern out for, but for an actual character in the input
character set.

Do you think that the only purpose for fgetc() is to read text files?
All C input, whether from text streams or binary, has behavior defined
by the standard in terms of calls to fgetc(), whether or not actual
calls to that function occur.
 
K

Keith Thompson

Stephen Sprunk said:
Granted, one can create arbitrary character literals, but doing so
ventures into "contrived" territory. I only mean to include real
characters, which I think means ones in the source or execution
character sets.
[...]

I wouldn't call '\xff' (or '\xffff' for CHAR_BIT==16) contrived.
 
K

Keith Thompson

James Kuyper said:
On 06/29/2013 02:05 PM, Keith Thompson wrote:
...

I've just posted an arguemnt on a different branch of this thread that
7.21.2p3 indirectly implies that on systems where UCHAR_MAX > INT_MAX,
given an unsigned character c and a valid int i, we must have

(unsigned char)(int)c == c

and

(int)(unsigned char)i == i

Comment?

I agree.

I sometimes wonder how much thought the committee put into making
everything consistent for "exotic" systems, particularly those with
char and int having the same size (which implies CHAR_BIT >= 16).
I'm fairly sure that most C programmers don't put much thought
into it.

For most systems, having fgetc() return EOF reliably indicates that
there were no more characters to read, and that exactly one of feof()
or ferror() will then return true, and I think most C programmers
rely on that assumption. That assumption can be violated only if
CHAR_BIT >= 16.

Even with CHAR_BIT == 8, storing the (non-EOF) result of fgetc() into a
char object depends on the conversion to char (which is
implementation-defined if plain char is signed) being particularly well
behaved.

Are there *any* systems with sizeof (int) == 1 (implying CHAR_BIT >= 16)
that support stdio? I know that some implementations for DSPs have
CHAR_BIT > 8, but are they all freestanding?

I wonder if we (well, the committee) should consider adding some
restrictions for hosted implementations, such as requiring INT_MAX >
CHAR_MAX or specifying the results of out-of-range conversions to plain
or signed char.
 
J

James Kuyper

On 07/02/2013 03:24 PM, Keith Thompson wrote:
....
I wonder if we (well, the committee) should consider adding some
restrictions for hosted implementations, such as requiring INT_MAX >
CHAR_MAX or specifying the results of out-of-range conversions to plain
or signed char.

That sounds like a good idea to me. However, if there's any existing
implementations that would become non-conforming as a result of such a
change, it could be difficult (and properly so) to get it approved.
 
S

Stephen Sprunk

I wouldn't call '\xff' (or '\xffff' for CHAR_BIT==16) contrived.

Why would anyone use that syntax for a character literal, rather than
the shorter 0xff (or 0xffff)? That strikes me as contrived.

There are certain cases where using the escape syntax is reasonable,
such as '\n', but even '\0' is more simply written as just 0. String
literals are another matter entirely, but those already have type
(pointer to) char--another argument in favor of character literals
having type char.

S
 
J

James Kuyper

Why would anyone use that syntax for a character literal, rather than
the shorter 0xff (or 0xffff)? That strikes me as contrived.

For the same reasons I use false, '\0', L'\0', u'\0', U'\0', 0L, 0LL,
0U, 0UL, 0ULL, 0.0F, 0.0, or 0.0L, depending upon the intended use of
value, even though all of those constants have the same value. The form
of the constant makes it's intended use clearer. As a side benefit, in
some of those cases, it shuts up a warning message from the compiler,
though that doesn't apply to '\0'.
 
G

glen herrmannsfeldt

(snip)
Are there *any* systems with sizeof (int) == 1 (implying CHAR_BIT >= 16)
that support stdio? I know that some implementations for DSPs have
CHAR_BIT > 8, but are they all freestanding?

I never used one, but I thought I remembered some Cray machines
with word addressed 64 bit words that did. Maybe only in museums
by now.

-- glen
 
G

glen herrmannsfeldt

(snip on EOF and valid characters)
For this purpose, a valid character is anything that can be returned by
a successful call to fgetc(). Since I can fill a buffer with unsigned
char values from 0 to UCHAR_MAX, and write that buffer to a binary
stream, with a guarantee of being able read the same values back, I must
respectfully disagree with the following assertion:
Do you think that the only purpose for fgetc() is to read text files?
All C input, whether from text streams or binary, has behavior defined
by the standard in terms of calls to fgetc(), whether or not actual
calls to that function occur.

Doesn't really matter what I think, but it does matter what writers
of compilers think.

Are there compilers with 8 bit signed char using ASCII
and EOF of -1?

-- glen
 
E

Eric Sosman

[...]
Are there compilers with 8 bit signed char using ASCII
and EOF of -1?

You've just described a system called the PDP-11 -- which
dmr said was *not* the birthplace of C, but how would he know?
 
J

James Kuyper

(snip on EOF and valid characters)




Doesn't really matter what I think, but it does matter what writers
of compilers think.

Not really - I'm talking about conforming implementations of C. The only
thing that matters for the truth of my statements is what the writers of
the standard intended. If writers of compilers disagree, and act on that
disagreement, they will produce compilers that don't conform. That would
be a problem - but it wouldn't have any affect on whether or not my
statements are correct.

I'm confused, however, as to what form you think that disagreement might
take. Do you know of any implementations that implement fgetc() in ways
that will cause it to return EOF when processing a byte from a stream in
a file, if that byte has a value that you would not consider a valid
character? Such an implementation would be non-conforming, but I can't
imagine any reason for creating such an implementation. Most programs
that use C I/O functions to write and read data consisting of any
non-character data type would malfunction if that were true.
Are there compilers with 8 bit signed char using ASCII
and EOF of -1?

EOF == -1 is quite common. ASCII has yielded to extended ASCII, UTF-8,
or other more exotic options on most modern implementations I'm familiar
with, which means that the extended execution character set includes
characters that would be negative if char is signed. The one part I'm
not sure of is how common it is for char to be signed - but if there are
such implementations, it's not a problem. That's because the behavior of
fputc() and fgetc() is defined in terms of unsigned char, not plain
char. As a result, setting EOF to -1 cannot cause a conflict with any
hypothetical character that happens to have a value of -1. You have to
write such a char to file using fputc((unsigned char)-1), and fgetc()
should return (int)(unsigned char)(-1) upon reading such a character.
Having a successful call to fgetc() return EOF is only possible if
UCHAR_MAX > INT_MAX, which can't happen on systems with 8-bit signed char.
 
T

Tim Rentsch

James Kuyper said:
Why? I thought that, while converting a negative value to
unsigned was well-defined, converting an out-of-range unsigned
value to signed was not.

I mentioned my argument for that conclusion earlier in this
thread - both you and Keith seem to have skipped over it without
either accepting it or explaining why you had rejected it. Here
it is again.

The standard defines the behavior of fputc() in terms of the
conversion of int to unsigned char (7.21.7.3p2). It defines the
behavior of fgetc() in terms of the conversion from unsigned char
to int (7.21.7.1p2). All other I/O is defined in terms of the
behavior of those two functions - the other I/O functions don't
have to actually call those functions, but they are required to
behave as if they did. It also requires that "Data read in from
a binary stream shall compare equal to the data that were earlier
written out to that stream, under the same implementation."
(7.21.2p3). While, in general, conversion to signed type of a
value that is too big to be represented by that type produces an
implementation-defined result or raises an implementation-defined
signal, for this particular conversion, I think that 7.21.2p3
implicitly prohibits the signal, and requires that if 'c' is an
unsigned char, then

(unsigned char)(int)c == c

If CHAR_MAX > INT_MAX, then 'char' must behave the same as
'unsigned char'. Also, on such an implementation, there cannot
be more valid 'int' values than there are 'char' values, and the
inversion requirement implies that there cannot be more char
values than there are valid 'int' values. This means that we
must also have, if 'i' is an int object containing a valid
representation, that

(int)(char)i == i

In particular, this applies when i==EOF, which is why comparing
fgetc() values with EOF is not sufficient to determine whether or
not the call was successful. Negative zero and positive zero
have to convert to the same unsigned char, which would make it
impossible to meet both inversion requirements, so it also
follows that 'int' must have a 2's complement representation on
such a platform. [snip unrelated]

This is a clever line of reasoning. But the conclusions are
wrong, for lots of different reasons.

First, an implementation might simply fail any attempt to open a
binary file. Or the open might succeed, but any attempt to write
a negative argument might fail and indicate a write error. Such
an implementation might be seen as abysmal but it still could be
conforming. And it clearly satisfies 7.21.2p3, without imposing
any further limitations on how conversions might work (or not).

Second, the Standard doesn't guarantee that all unsigned char
values can be transmitted: it is only the unsigned char values
corresponding to int arguments that can be written, and hence
only these that need compare equal when subsequently read. The
word "transparently" might be taken as implying all unsigned char
values will work, or it might be taken to mean what the rest of
the paragraph spells out, ie that any values written will survive
the round trip unchanged. The idea that the 'comparing equal'
condition is meant to apply universally to all unsigned char
values is an assumption not supported by any explicit wording.

Third, even if an implementation allows reading and writing of
binary files, and fputc works faithfully for all unsigned char
values, conversion from unsigned char to int could still raise
an implementation-defined signal (for values above INT_MAX).
This could work if the default signal handler checked and did
the right thing when converting arguments to fputc etc, and
otherwise something else. (And the signal in question could
be one not subject to change using signal().)

Fourth, alternatively, ints could have a trap representation,
where the implementation takes advantage of the freedom given by
undefined behavior in such cases, to do the right thing when
converting arguments to fputc (etc), or something else for other
such conversions. Such implementations might be seen as rather
perverse, but that doesn't make them non-conforming.

Finally, and perhaps most obviously, there is no reason 7.21.2p3
even necessarily applies, because freestanding implementations
aren't required to implement <stdio.h>.
 
K

Keith Thompson

Robert Wessel said:
I'm pretty sure, but not 100% certain, that all the word addressed
Crays that had C compilers normally used a different pointer format
for char (and void) pointers. It's certainly possible that there were
some prototype/development compilers that did not.

All the Crays I used had CHAR_BIT==8. The T90 in particular was a
word-addressed vector machine with 64-bit words. char* and void*
pointers were 64-bit word pointers with a bytes offset stored in the
high-order 3 bits. String operations were surprisingly slow -- but of
course the hardware was heavily optimized for massive floating-point
operations.

But that was for Unicos, Cray's Unix system, so it had to conform (more
or less) both to C and to POSIX. I never used the earlier non-Unix COS
system, and I don't know what it was like.
 
K

Keith Thompson

glen herrmannsfeldt said:
Are there compilers with 8 bit signed char using ASCII
and EOF of -1?

Yes, a lot of them, in fact most of the C compilers I've used fit that
description.

8-bit char (i.e., CHAR_BIT==8): I've never used a system with CHAR_BIT
!= 8.

signed char: Plain char is signed on *most* compilers I've used.
Exceptions are Cray systems, SGI MIPS systems running Irix, and IBM
PowerPC systems running AIX.

Most systems these days support various supersets of ASCII, ranging from
Windows-1252 or Latin-1 to full Unicode. But the 7-bit ASCII subset is
nearly universal. EBCDIC-based systems are an obvious exception (and
they'd probably have to make plain char unsigned).

I don't believe I've ever seen a value for EOF other than -1.
 
T

Tim Rentsch

Stephen Sprunk said:
A thorough and thoughtful analysis; I'm not sure I agree with
you on every point, but most of my disagreement would probably
fall into how each is weighted anyway.

Thank you, I appreciate the positive comment. Of course I would
expect (and hope!) that most people would agree on the objective
portion, and differ only in assignment of the subjective weights.
It's nice to hear that's how it worked out in this instance.
I come at this from a rather different angle. Looking at the
C-like subset of C++ (no classes, templates, namespaces, and
overloading), I find it to be a superior version of C than C
itself. That gap has been narrowing over time, eg. adopting //
comments, so what I propose is to merely complete that task all
at once. The individual changes could probably not be
justified on their own, but as a group, IMHO they are.

My reaction is just the opposite. For starters, I think the gap
has gotten wider rather than narrower. Moreover it is likely to
continue growing in the future, beause of the different design
philosophies of the respective groups - the C group is generally
conservative, the C++ group more open to accommodating new
features.

As to the "whole is greater than the sum of the parts" idea, I
believe if individual changes don't stand on their own merits,
then it's even worse to include them as a group. Let's take the
'const'-defined array bound as an example. This language feature
adds no significant expressive power to the language; it's
simply another way of doing something that can already be done
with about the same amount of code writing. There may be some
things about it that are better, and some things that are worse,
but certainly it isn't clearly better -- it's just different. So
now what happens if rather than one of those we add 25 of them?
There's no appreciable difference in how easy or hard programs
are to write; but reading them gets harder, because there are
more ways to write the same thing, and translating between them
takes some effort. Meanwhile the language specification would
get noticeably larger, and require more effort to read and digest
(even not counting the effort needed to write). No real gain,
and a bunch of cost.

An uncharitable view of your opinion is that it is simply
disguised chauvinism for the C++ way of doing things. Do you
have any arguments to offer for the merits of some of these
proposed new features that don't reference C++ but are able to
stand on their own?
 
I

Ian Collins

Tim said:
My reaction is just the opposite. For starters, I think the gap
has gotten wider rather than narrower. Moreover it is likely to
continue growing in the future, beause of the different design
philosophies of the respective groups - the C group is generally
conservative, the C++ group more open to accommodating new
features.

It looks like this will continue to be the case, given the active
discussion of new features for the next C++ revision.

An uncharitable view of your opinion is that it is simply
disguised chauvinism for the C++ way of doing things. Do you
have any arguments to offer for the merits of some of these
proposed new features that don't reference C++ but are able to
stand on their own?

There are number of C++11 additions that would improve and stand alone
in C in similar ways to C++, some examples:

Static (compile-time) assertions. Yes you can do much the same with the
preprocessor, but there are limits and I believe static_assert makes the
conditions clearer as programme or function preconditions. They are also
more concise.

Initialisations preventing narrowing. Removes another source of
unexpected behaviour.

General compile time constants with constexpr. This would be
particularly useful in the embedded world were you want to minimise RAM use.

nullptr. Removes another source of unexpected behaviour.

Raw string literals. Removes another potential source of unexpected
behaviour and hard to read code (how many slashes do I need?).

alignas to standardise alignment.

And for the bold, "auto" variable declarations.

None of these are particularity radical, but they would make
programmer's life just that little bit easier.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,076
Messages
2,570,565
Members
47,200
Latest member
Vanessa98N

Latest Threads

Top