Reading from files and range of char and friends

S

Spiros Bousbouras

Spiros Bousbouras said:
On 3/10/2011 11:40 AM, Spiros Bousbouras wrote:
If you are reading from a file by successively calling fgetc() is there
any point in storing what you read in anything other than unsigned
char ?

Sure. To see one reason in action, try

unsigned char uchar_password[SIZE];
...
if (strcmp(uchar_password, "SuperSecret") == 0) ...

Just to be clear , the only thing that can go wrong with this example
is that strcmp() may try to convert the elements of uchar_password to
char thereby causing the implementation defined behavior. The same
issue could arise with any other str* function. Or is there something
specific about your example that I'm missing ?

The call to strcmp() violates a constraint. strcmp() expects const
char* (a non-const char* is also ok), but uchar_password, after
the implicit conversion is of type unsigned char*. Types char*
and unsigned char* are not compatible, and there is no implicit
conversion from one to the other.

I see. I assumed that the implicit conversion would be ok because
paragraph 27 of 6.2.5 says "A pointer to void shall have the same
representation and alignment requirements as a pointer to a character
type.39)" and footnote 39 says "The same representation and alignment
requirements are meant to imply interchangeability as arguments to
functions, return values from functions, and members of unions." I
assumed that the relation "same representation and alignment
requirements" is transitive.

On the other hand footnote 35 of paragraph 15 says that char is not
compatible with signed or unsigned char and in 6.7.5.1 we read that
pointers to types are compatible only if the types are compatible. We
must conclude then that the relation "same representation and alignment
requirements" is not transitive. That's a damn poor choice of
terminology then.
If you use an explicit cast, it will *probably* work as expected,
but without the case the compiler is permitted to reject i.t

What would be so strange about it? If a file contains a sequence of
ints, stored as binary, and the implementation has a distinct
representation for negative zero, then the file could certainly contain
negative zeros.

Ok , I guess it could happen. But then I have a different objection. Eric said

(The situation is particularly bad for systems with
signed-magnitude or ones' complement notations, where the
sign of zero is obliterated on conversion to unsigned char
and thus cannot be recovered again after getc().)

It seems to me that an implementation can easily ensure that the sign
of zero does not get obliterated. If by using fgetc() an unsigned char
gets the bit pattern which corresponds to negative zero then the
implementation can assign the negative zero when converting to int .
The standard allows this.
 
L

lawrence.jones

Tim Rentsch said:
A call to getc() cannot return negative zero. The reason is,
getc() is defined in terms of fgetc(), which returns an
'unsigned char' converted to an 'int', and such conversions
cannot produce negative zeros.

They can if char and int are the same size.
--
Larry Jones

I always send Grandma a thank-you note right away. ...Ever since she
sent me that empty box with the sarcastic note saying she was just
checking to see if the Postal Service was still working. -- Calvin
 
S

Spiros Bousbouras

A call to getc() cannot return negative zero. The reason is,
getc() is defined in terms of fgetc(), which returns an
'unsigned char' converted to an 'int', and such conversions
cannot produce negative zeros.

When I said "getc() read int's from files" I meant that also fgetc()
reads int's from files i.e. we're talking about an alternative C where
we don't have the intermediate unsigned char step.

Apart from that , in post

<[email protected]>
http://groups.google.com/group/comp.lang.c/msg/1909c5fe30c02e81?dmode=source

you say

Do you mean to say that if a file has a byte with a bit
pattern corresponding to a 'char' negative-zero, and
that byte is read (in binary mode) with getc(), the
result of getc() will be zero? If that's what you're
saying I believe that is wrong.

Assuming actual C (i.e. not the alternative C from above) is it not
possible in the scenario you're describing that int will get negative
zero ?
 
S

Spiros Bousbouras

The standard could say that if an implementation offers stdio.h then
the following function

int foo(unsigned char a) {
char b = a ;
unsigned char c = b ;
return a == c ;
}

always returns 1. This I think would be sufficient to be able to assign
the return value of fgetc() to char (after checking for EOF) without
worries. But does it leave any existing implementations out ? And while
I'm at it , how do existing implementations handle conversion to a
signed integer type if the value doesn't fit ? Anyone has any unusual
examples ?

Another approach would be to have a macro __WBUC2CA (well behaved
unsigned char to char assignment) which will have the value 1 or 0 and
if it has the value 1 then foo() above will be guaranteed to return 1.

A better name would be __WBUC2CC for well behaved unsigned char to char
conversion.
 
S

Spiros Bousbouras

assigning but I guess it wasn't clear. What I had in mind was something
like:

unsigned char arr[some_size] ;
int a ;

while ( (a = fgetc(f)) != EOF) arr[position++] = a ;

Would there be any reason for arr to be something other than
unsigned char ?

No, but you should use a cast there or your compiler might balk because
unsigned char is likely to have less bits than int.

A cast wouldn't buy you anything in this case because according to
paragraph 2 of 6.5.16.1 a conversion will happen anyway.
 
S

Spiros Bousbouras

Pardon me for jumping in so late. I got interested when someone earlier
thought to store the EOF character. Of course the EOF is a status and need
not be stored.

I don't recall anyone in the thread saying that.
The return type of fgetc() is int so as to allow the full 0..255 value of a
byte AND a value EOF.

A byte in C can have values greater than 255 depending on the
implementation.
When you assign int to char, the char takes the lower
eight bits of the int without change.

Where do you get this from ? In the OP I mentioned paragraph 3 of
6.3.1.3 .Here's what it says:

Otherwise, the new type is signed and the value cannot be
represented in it; either the result is implementation-defined
or an implementation-defined signal is raised.

And you do realise that a char is permitted to have more than 8 bits ,
yes ?
Try this:

#include <stdio.h>
int main(void) {
char c;
unsigned char u;
int i = 240;
c = i;
u = c;
printf("%d, %d, %d\n", i, c, u);
return 0;
}

I get: 240, -16, 240 as I expected.

That is one data point among the hundreds or thousands of C
implementations. Even if a char always had 8 bits and even if the
assignment int to char was guaranteed to copy the lower 8 bits , the
middle number could still be -112 if the implementation uses "sign and
magnitude" to represent negative numbers.
The value of fgetc() being int and being assigned to char is not a problem
and not a 'defect' of the language.

If only.
 
K

Keith Thompson

Spiros Bousbouras said:
When I said "getc() read int's from files" I meant that also fgetc()
reads int's from files i.e. we're talking about an alternative C where
we don't have the intermediate unsigned char step.

I'm afraid I'm not following you here.

I initially assumed you meant getc and fgetc would be reading
int-sized chunks from the file, rather than (as C currently
specifies) reading bytes, interpreting them as unsigned char,
and converting that to int.

Without the intermediate step, how is the int value determined?

Perhaps you mean getc and fgetc read a byte from the file, interpret
is as *plain* char, and then convert the result to int.

If so, and if plain char is signed and has a distinct representation
for negative zero (this excludes 2's-complement systems), then
could getc() return a negative zero?

I'd say no. Converting a negative zero from char to int does not
yield a negative zero int; 6.2.6.2p3 specifies the operations that
might generate a negative zero, and conversions aren't in the list.

Which means that getc() and fgetc() would be unable to distinguish
between a positive and negative zero in a byte read from a file.
Which is probably part of the reason why the standard specifies
that the value is treated as an unsigned char.

Or the standard could have said specifically that getc and fgetc do
return a negative zero in these cases, but dealing with that in code
would be nasty (and, since most current systems don't have negative
zeros, most programmers wouldn't bother).

(As I've said before, requiring plain char to be unsigned would
avoid a lot of this confusion, but might have other bad effects.)
 
K

Keith Thompson

Joe Wright said:
Angel said:
[snip]

UTF-8, as the name implies, is 8 bits wide and will fit in an unsigned
char (it will fit in a signed char too,

It will on most implementations but the Standard does not
require that.
but values>127 will be converted to negative values),

Again true on most implementations but not Standard-guaranteed.

I must be missing your point. What does UTF-8 have to do with the Standard?

Somebody upthread suggested that the plain char vs. unsigned char
mismatch isn't a problem, because ASCII characters are all in the
range 0-127. UTF-8 is one example of a character encoding where
bytes in a text file can have values exceeding 127. (Latin-1 and
EBCDIC are other examples.)
 
E

Eric Sosman

If you are reading from a file by successively calling fgetc() is there
any point in storing what you read in anything other than unsigned
char ?

Sure. To see one reason in action, try

unsigned char uchar_password[SIZE];
...
if (strcmp(uchar_password, "SuperSecret") == 0) ...

Just to be clear , the only thing that can go wrong with this example
is that strcmp() may try to convert the elements of uchar_password to
char thereby causing the implementation defined behavior.

True: After issuing the required diagnostic, the implementation
may accept the faulty translation unit anyhow, and may assign it any
meaning it's inclined to, and that meaning may be implementation-
defined.

Alternatively, the implementation may issue the diagnostic and
spit the sorry source back in your face.
The same
issue could arise with any other str* function. Or is there something
specific about your example that I'm missing ?

The required diagnostic, I think. 6.5.2.2p2, plus 6.3.2.3's
omission of any description of the necessary conversion.
If getc() read int's from files instead of unsigned char's would it be
realistically possible that reading from a file would return a negative
zero ? That would be one strange file.

One strange text file, yes. But not so strange for a binary
file, where any bit pattern at all might appear. If a char that looks
like minus zero appears somewhere in the middle of a double, and you
fwrite() that double to a binary stream, the underlying fputc() calls
(a direct requirement; not even an "as if") convert each byte in turn
from unsigned char to int. I think the conversion allows the bits to
be diddled irreversibly -- although on reconsideration it may happen
only when sizeof(int)==1 as well.
I don't see how this can happen with getc().

When sizeof(int)==1, there will exist a perfectly valid unsigned
char value whose conversion to int yields EOF. (Or else there will
exist two or more distinct unsigned char values that convert to the
same int value, which is even worse and violates 7.19.2p3.) So
checking the value of getc() against EOF isn't quite enough: Having
found EOF, you also need to call feof() and ferror() before concluding
that it's "condition" rather than "data." More information is being
forced through the return-value channel than the unaided channel
can accommodate.
 
E

Eric Sosman

[...]
Ok , I guess it could happen. But then I have a different objection. Eric said

(The situation is particularly bad for systems with
signed-magnitude or ones' complement notations, where the
sign of zero is obliterated on conversion to unsigned char
and thus cannot be recovered again after getc().)

It seems to me that an implementation can easily ensure that the sign
of zero does not get obliterated. If by using fgetc() an unsigned char
gets the bit pattern which corresponds to negative zero then the
implementation can assign the negative zero when converting to int .
The standard allows this.

Could you indicate where? I'm looking at 6.2.6.2p3, which lists
the operations that can generate a minus zero, and does not list
"conversion" among them.
 
J

J. J. Farrell

Spiros said:
assigning but I guess it wasn't clear. What I had in mind was something
like:

unsigned char arr[some_size] ;
int a ;

while ( (a = fgetc(f)) != EOF) arr[position++] = a ;

Would there be any reason for arr to be something other than
unsigned char ?
No, but you should use a cast there or your compiler might balk because
unsigned char is likely to have less bits than int.

A cast wouldn't buy you anything in this case because according to
paragraph 2 of 6.5.16.1 a conversion will happen anyway.

No, a cast would buy you freedom from a warning with some compilers.
 
E

Eric Sosman

They can if char and int are the same size.

Despite 6.2.6.2p3? In ISO/IEC 9899:TC3 (perhaps the wording
has changed in more recent versions), "conversion" is not listed
among the operations that can generate a negative zero. Even if
a negative zero arises, this paragraph says it's unspecified whether
storing it in an object stores a negative or a "normal" zero.
 
J

J. J. Farrell

Joe said:
...

The return type of fgetc() is int so as to allow the full 0..255 value
of a byte AND a value EOF.

.... assuming the value range of a byte is limited to 0..255 which it
need not be. In particular, a byte can be the same size as an int.
When you assign int to char, the char takes
the lower eight bits of the int without change.

No, no, no, no, no. ISO/IEC9899:1999 6.3.1.3 Signed and unsigned integers:

"When a value with integer type is converted to another integer type
other than _Bool, if the value can be represented by the new type, it is
unchanged.

Otherwise, if the new type is unsigned, the value is converted by
repeatedly adding or subtracting one more than the maximum value that
can be represented in the new type until the value is in the range of
the new type.

Otherwise, the new type is signed and the value cannot be represented in
it; either the result is implementation-defined or an
implementation-defined signal is raised."
Try this:

#include <stdio.h>
int main(void) {
char c;
unsigned char u;
int i = 240;
c = i;
u = c;
printf("%d, %d, %d\n", i, c, u);
return 0;
}

I get: 240, -16, 240 as I expected.

You had no right to expect it. It's a common implementation, but far
from guaranteed.
 
K

Keith Thompson

Eric Sosman said:
Despite 6.2.6.2p3? In ISO/IEC 9899:TC3 (perhaps the wording
has changed in more recent versions), "conversion" is not listed
among the operations that can generate a negative zero. Even if
a negative zero arises, this paragraph says it's unspecified whether
storing it in an object stores a negative or a "normal" zero.

6.3p2:

Conversion of an operand value to a compatible type causes no
change to the value or the representation.

Looks like a mild inconsistency.
 
T

Tim Rentsch

They can if char and int are the same size.

Yes, implementations that have sizeof(int) == 1 and that
use signed magnitude or one's complement are an exception
to what I said, and I should have mentioned that.

Do any such implementations actually exist? Certainly I'm
not aware of any.
 
T

Tim Rentsch

Joe Wright said:
Angel said:
[snip]

UTF-8, as the name implies, is 8 bits wide and will fit in an unsigned
char (it will fit in a signed char too,

It will on most implementations but the Standard does not
require that.
but values>127 will be converted to negative values),

Again true on most implementations but not Standard-guaranteed.

I must be missing your point. What does UTF-8 have to do with the Standard?

My comment was not about UTF-8 but about 8-bit values (ie
256 distinct non-negative values); these don't necessarily
fit in a 'signed char', etc.
 
T

Tim Rentsch

Eric Sosman said:
Despite 6.2.6.2p3?
Yes.

In ISO/IEC 9899:TC3 (perhaps the wording
has changed in more recent versions),

Still the same as of N1547.
"conversion" is not listed
among the operations that can generate a negative zero.

Conversion of an in-range value cannot. Conversion of an
out-of-range value gives an implementation-defined result,
which may be defined to be (not to generate) negative zero
for certain values. Subtle distinction, I admit, but I
believe that is how the committee expects these statements
will be read.
Even if
a negative zero arises, this paragraph says it's unspecified whether
storing it in an object stores a negative or a "normal" zero.

Because the behavior is unspecified, an implementation may
define it to store the negative zero faithfully. 'Unspecified'
doesn't mean outside of the implementation's control, it
just means the implementation isn't obliged to document
what choices it makes.
 
T

Tim Rentsch

Spiros Bousbouras said:
When I said "getc() read int's from files" I meant that also fgetc()
reads int's from files i.e. we're talking about an alternative C where
we don't have the intermediate unsigned char step.

Ahh, I didn't understand that. I don't know what would
happen in alternative C; I don't have any kind of reference
manual or standards document for that language.
Apart from that , in post

<[email protected]>
http://groups.google.com/group/comp.lang.c/msg/1909c5fe30c02e81?dmode=source

you say

Do you mean to say that if a file has a byte with a bit
pattern corresponding to a 'char' negative-zero, and
that byte is read (in binary mode) with getc(), the
result of getc() will be zero? If that's what you're
saying I believe that is wrong.

Assuming actual C (i.e. not the alternative C from above) is it not
possible in the scenario you're describing that int will get negative
zero ?

As Larry Jones reminded me, it is possible for this to
happen in implementations that have sizeof(int) == 1 (and
that use representations with negative zeros in them). I'm
not aware of any such implementations but the Standard does
allow them. Other than that, it isn't.
 
K

Keith Thompson

pete said:
I don't see the relevance of that quote,
because it is about compatible type conversion,
and I don't see anything about compatible types
in the above quoted post.

Ah, you're right.

I think it's still a mild inconsistency, but it doesn't apply to
the situation we're discussing. For example, suppose ptrdiff_t is
compatible with long. Then converting (say, via an explicit cast)
a negative zero of type long to ptrdiff_t would yield a negative
zero of type ptrdiff_t -- but a cast is not one of the operations
that can yield a negative zero.

This is probably one of the most obscure corner cases I've ever
run across.
 
L

lawrence.jones

Keith Thompson said:
For example, suppose ptrdiff_t is
compatible with long. Then converting (say, via an explicit cast)
a negative zero of type long to ptrdiff_t would yield a negative
zero of type ptrdiff_t -- but a cast is not one of the operations
that can yield a negative zero.

The standard lists the operations that can *generate* a negative zero.
One could argue that operations like cast and assignment simply preserve
an existing negative zero rather than generating a new one.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,091
Messages
2,570,605
Members
47,225
Latest member
DarrinWhit

Latest Threads

Top