A very simple parser with scanf & C

K

Keith Thompson

Phil Carmody said:
I suspect those architectures may well live only in Bernsteinian la-la
land. Quite why the other 100% of the world should care one flying
ferret about expensiveness of operations on such architectures, I'm
not entirely sure. (And I say that as someone who has not used a
mainstream architecture as his primary machine for over a decade. (7
happy years on alpha, 4 happy years on POWER. And most development in
that time done for ARM.))

Ok. If anyone else has solid information on this, I'd be interested in
hearing about it.
I'm pretty sure you disagreed with me when I made that suggestion
a while back.

Hmm. I may have disagreed that requiring plain char to be unsigned
would be a good idea for the performance reasons stated above.
I don't recall arguing that it wouldn't be *conceptually* simpler.
 
B

Ben Bacarisse

Keith Thompson said:
Ok. If anyone else has solid information on this, I'd be interested in
hearing about it.

K&R say this about the signed/unsigned char issue: (sorry, there will
probably be typos since this is not cut and paste)

"There is one subtle point about the conversion of characters to
integers. The language does not specify whether variables of type
char are signed or unsigned quantities. When a char is converted to
int, can it ever produce a /negative/ integer? Unfortunately, this
varies from machine to machine, reflecting differences in
architecture. On some machine (the PDP-11, for instance), a char
whose leftmost bit is 1 will be converted to a negative integer ("sign
extension"). On others, a char is promoted to an int by adding zeros
at the left end, and is thus always positive."
K&R 1st ed.

They don't explicitly say that not sign extending a char on the PDP-11
was considered too expensive but it seems to be implied by the "that's
what machines do so that's what C does" attitude. The PDP-11 had only
the one MOVB instruction, and that did sign extension.

I wonder how many later designed had this restriction. Not many, I'd
wager, but I look forward to hearing about them.

<snip>
 
B

BartC

Nick Keighley said:
it hardly ever seems to matter to me

People keep saying that. But the problems keep appearing too.

And if it *really* doesn't matter, then why not make char unsigned by
default? After all, nothing will break, right?

And the problem in this case was one part of a program with char codes in
the range -128 to +127, and another part of the program expecting char
codes in the range 0 to +255, a conflict that requires someone (1) be aware
of it and (2) do something about it, both being unnecessary if char codes
were always positive. It's just a source of bugs that is completely
unnecessary.
 
T

Tim Rentsch

Keith Thompson said:
BartC said:
I still don't get it.

Are you saying that while an implementation's char type can be signed or
unsigned, it's isalpha() function assumes unsigned?

Yes. C99 7.4p1:

In all cases the argument is an int, the value of which shall
be representable as an unsigned char or shall equal the value
of the macro EOF. If the argument has any other value, the
behavior is undefined.
Apart from being ludicrous, how are you supposed to use it then? With an
(unsigned char) cast? Suppose the char value really is EOF?

Yes, an (unsigned char) cast is the correct way to use it.
[snip incidental]

My understanding is that a cast to (unsigned char) may not
give exactly the right behavior. In particular, if we have
a variable 'char c;' then to call a function like 'isalpha()'
we need to do

isalpha( * (unsigned char*) &c )

rather than

isalpha( (unsigned char ) c )

to get exactly the same behavior as happens for getchar(), etc.

Of course, on most current machines these two methods give
the same results, but as a matter of correctness the first
form is the better one. Right?
 
T

Tim Rentsch

pete said:
I don't think so.
If you were to output the value of (c) using fputc,
then the return value of the function call would be
((int)(unsigned char)c).

Th value of (c) "converted to unsigned char" is ((unsigned char)c).

That's right but the parameter c is of type 'int', not type
char. Where did this 'int' value come from? Presumably it
came from (among other places) an input function like 'fgetc()'
(see below).
N869
4.9.7.3 The fputc function
Synopsis
#include <stdio.h>
int fputc(int c, FILE *stream);
Description
The fputc function writes the character specified by c
(converted to an unsigned char ) ...

I no longer use N869, but looking at C99 (the wording is basically
unchanged from C90) for fgetc(), the description says fgetc() reads
the next character "as an 'unsigned char' converted to an 'int'"
(the single quotes designate C program text). Note especially
the first word there, "as".

I take this phrase to mean the character object is read _as_ an
'unsigned char' (eg, through an 'unsigned char *' pointer), rather
than being _converted_ to 'unsigned char' (eg, through a 'char *'
pointer and then casting that value to 'unsigned char').

Otherwise, this sentence would have said it reads the next
character "as a 'char' converted to an 'unsigned char' converted
to an 'int'. But it doesn't say that.

Presumably the "ctype.h" functions are meant to work correctly
on values returned by fgetc(), etc.

Also, isalpha also has to be able to work with arguments
which may be constant expressions that don't have addresses.

Given

#define NEG_A ('A' - 1 - (unsigned char)-1)

then isalpha ((unsigned char)NEG_A) should return nonzero

and putchar((unsigned char)NEG_A) should return ('A')

That's true but irrelevant to the point under discussion -- the
expression here doesn't involve characters with negative values.
The example I gave (with a variable 'char c') was just an example
for when there is a variable present. More generally, if you want a
function like 'isalpha()' but which takes a 'char' valued argument,
this can (and, I would argue, should) be done as follows

#define char_isalpha(c) \
isalpha( (int) (union {char c; unsigned char u;}){ c }.u )

to match the description of how fgetc() works.
 
T

Tim Rentsch

pete said:
Tim said:
pete said:
Tim Rentsch wrote:

[snip]

If you're going to say for {char c;}

then I think we should consider a case where
( * (unsigned char*) &c )
is different from
( (unsigned char ) c )
and the only case I can think of,
is where (c) is signed, negative and not twos complement.

Right (ignoring padding bits in char, which no sane
implementation has), although actually there is one other
case, namely a two's complement trap representation. Not
very likely perhaps but it is allowed, so for the sake of
completeness I think it should be mentioned. (And now it
has been...)
 
T

Tim Rentsch

pete said:
Tim said:
pete said:
Tim Rentsch wrote:


Tim Rentsch wrote:

[snip]

If you're going to say for {char c;}

isalpha( * (unsigned char*) &c )

rather than

isalpha( (unsigned char ) c )

then I think we should consider a case where
( * (unsigned char*) &c )
is different from
( (unsigned char ) c )
and the only case I can think of,
is where (c) is signed, negative and not twos complement.

Right (ignoring padding bits in char, which no sane
implementation has), although actually there is one other
case, namely a two's complement trap representation. Not
very likely perhaps but it is allowed, so for the sake of
completeness I think it should be mentioned. (And now it
has been...)

Considering that file operations are at the root of this thread:

If char is signed, and char c is assigned a value of (-1), as in
char c;
c = -1;

and if (c) is written to a file using fputc,
then the value written into the file will be ((unsigned char)-1).

If that same byte is read back from the same file using fgetc,
then the value returned by fgetc will be
((int)(unsigned char)-1)
which is equal to ((unsigned char)c)
regardless of the representation of negative integers.

The return value of fgetc will be unequal to
( * (unsigned char*) &c )
when negative integers are either sign and magnitude
or one's complement.

That's right, of course. Which form should be used depends
on where the bytes came from in the first place -- if they
had been written with fwrite(), for example, then the type
punning form, rather than the ordinary converting form, is
probably a better choice. And conversely.

I suspect this is all a moot point anyway, since the values
that might be affected most likely will all have the same
characteristics vis-a-vis the <ctype.h> functions. So I
will just say thank you for an interesting discussion.
 
T

Tim Rentsch

pete said:
Tim said:
pete said:
Tim Rentsch wrote:


Tim Rentsch wrote:


Tim Rentsch wrote:

[snip]

If you're going to say for {char c;}

isalpha( * (unsigned char*) &c )

rather than

isalpha( (unsigned char ) c )

then I think we should consider a case where
( * (unsigned char*) &c )
is different from
( (unsigned char ) c )
and the only case I can think of,
is where (c) is signed, negative and not twos complement.

Right (ignoring padding bits in char, which no sane
implementation has), although actually there is one other
case, namely a two's complement trap representation. Not
very likely perhaps but it is allowed, so for the sake of
completeness I think it should be mentioned. (And now it
has been...)

Considering that file operations are at the root of this thread:

If char is signed, and char c is assigned a value of (-1), as in
char c;
c = -1;

and if (c) is written to a file using fputc,
then the value written into the file will be ((unsigned char)-1).

If that same byte is read back from the same file using fgetc,
then the value returned by fgetc will be
((int)(unsigned char)-1)
which is equal to ((unsigned char)c)
regardless of the representation of negative integers.

The return value of fgetc will be unequal to
( * (unsigned char*) &c )
when negative integers are either sign and magnitude
or one's complement.

That's right, of course. Which form should be used depends
on where the bytes came from in the first place -- if they
had been written with fwrite(), for example, then the type
punning form, rather than the ordinary converting form, is
probably a better choice. And conversely.

I suspect this is all a moot point anyway, since the values
that might be affected most likely will all have the same
characteristics vis-a-vis the <ctype.h> functions. So I
will just say thank you for an interesting discussion.

As far as the ctype functions go,
if you want use isalpha to give you some idea of
whether or not putchar(c) is going to output an alpha,
then isalpha((unsigned char)c) is the way to do it,
because the value returned by putchar(c)
compares equal to ((unsigned char)c).

I don't find this argument persuasive, because in essence it's
no more than the tautology that 'isalpha( (unsigned char) c )'
is equal to 'isalpha( (unsigned char) c )'. It is just as true
that 'isalpha( *(unsigned char*) &c )' will be equal to
'isalpha( *(unsigned char*) &c )' (and similarly for putting
the value through the fputc()/fget() step). That doesn't prove
anything.

However, if we look at character/string constants, there is
some support for the type-punning view. For example, a
character constant '0xFF' is one whose unsigned char
representation has all bits set (under CHAR_BIT==8); that is,
it has different values (when 'char' is signed) depending
on whether two's complement, ones' complement, or sign-and-
magnitude is used to represent signed integers. The string
comparison functions in <string.h> likewise use the unsigned
char representation rather than converting to unsigned char.
So it seems more consistent, considering those aspects, that
the <ctype.h> functions are expecting an unsigned char
representation, not just a conversion to unsigned char.

And again, practically speaking, it's highly unlikely that
there will be any difference, not counting the unusual case
of char having a trap representation.
 
T

Tim Rentsch

pete said:
Tim said:
pete said:
Tim Rentsch wrote:


Tim Rentsch wrote:


Tim Rentsch wrote:


Tim Rentsch wrote:

[snip]

If you're going to say for {char c;}

isalpha( * (unsigned char*) &c )

rather than

isalpha( (unsigned char ) c )

then I think we should consider a case where
( * (unsigned char*) &c )
is different from
( (unsigned char ) c )
and the only case I can think of,
is where (c) is signed, negative and not twos complement.

Right (ignoring padding bits in char, which no sane
implementation has), although actually there is one other
case, namely a two's complement trap representation. Not
very likely perhaps but it is allowed, so for the sake of
completeness I think it should be mentioned. (And now it
has been...)

Considering that file operations are at the root of this thread:

If char is signed, and char c is assigned a value of (-1), as in
char c;
c = -1;

and if (c) is written to a file using fputc,
then the value written into the file will be ((unsigned char)-1).

If that same byte is read back from the same file using fgetc,
then the value returned by fgetc will be
((int)(unsigned char)-1)
which is equal to ((unsigned char)c)
regardless of the representation of negative integers.

The return value of fgetc will be unequal to
( * (unsigned char*) &c )
when negative integers are either sign and magnitude
or one's complement.

That's right, of course. Which form should be used depends
on where the bytes came from in the first place -- if they
had been written with fwrite(), for example, then the type
punning form, rather than the ordinary converting form, is
probably a better choice. And conversely.

I suspect this is all a moot point anyway, since the values
that might be affected most likely will all have the same
characteristics vis-a-vis the <ctype.h> functions. So I
will just say thank you for an interesting discussion.

As far as the ctype functions go,
if you want use isalpha to give you some idea of
whether or not putchar(c) is going to output an alpha,
then isalpha((unsigned char)c) is the way to do it,
because the value returned by putchar(c)
compares equal to ((unsigned char)c).

I don't find this argument persuasive, because in essence it's
no more than the tautology that 'isalpha( (unsigned char) c )'
is equal to 'isalpha( (unsigned char) c )'. It is just as true
that 'isalpha( *(unsigned char*) &c )' will be equal to
'isalpha( *(unsigned char*) &c )' (and similarly for putting
the value through the fputc()/fget() step). That doesn't prove
anything.


isalpha((unsigned char)c) will tell you
if the output of putchar(c) is an alpha
because putchar(c) is equal to putchar((unsigned char)c).

Sorry, I misunderstood your point before. However, my basic
point still stands - depending on circumstances, the type punning
form might be a better choice than conversion, or vice versa. We
don't necessarily want to use 'putchar(c)' if c is a char, since
(for example) if char uses a ones complement representation then
plus zero and minus zero will both end up just being zero; using
'putchar( *(unsigned char*)&c )' might be a better choice in such
cases.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,079
Messages
2,570,575
Members
47,207
Latest member
HelenaCani

Latest Threads

Top