What's the deal with the "toupper" family?

  • Thread starter Frederick Gotham
  • Start date
F

Frederick Gotham

The "toupper" function takes an int as an argument. That's not too
irrational given that a character literal is of type "int" in C.
(Although why it isn't of type "char" escapes me... )

The "toupper" function imposes a further constrait in that the value
passed to it must be representable as a unsigned char. (If C does not
require all character values to be positive, then again, this constrait
too escapes me... )

Let's say we have the following hypothetical system:

char is signed.

UCHAR_MAX == 255
SCHAR_MAX == 127
CHAR_MAX == 127

INT_MAX == 65535


We are able to represent all the characters of ASCII using positive
numbers, but anything beyond that would require negative numbers on this
system.

So what's the deal with using toupper on these extraneous characters
whose numeric value is negative?

Let's say we have a German sharp S, or a Spanish N with a curly thing on
top of it, and that its numeric value is negative. How do we go about
passing their value to toupper? Should we do the following?

toupper( (unsigned char)c );



(One more thing. If you have a signed integer value, and you cast it to
its corresponding unsigned integer type, and then back to the signed
type, are you guaranteed to have the same value? i.e.:

signed char s = -5;

unsigned char us = s;

s = us;

assert( -5 == s ); /* Is this guaranteed? */
 
E

Eric Sosman

Frederick said:
The "toupper" function takes an int as an argument. That's not too
irrational given that a character literal is of type "int" in C.
(Although why it isn't of type "char" escapes me... )

The "toupper" function imposes a further constrait in that the value
passed to it must be representable as a unsigned char. (If C does not
require all character values to be positive, then again, this constrait
too escapes me... )

Back in the Dawn of C (well, the Early Morning), the
<ctype.h> functions were defined to operate on all the values
returned by getchar(), getc(), and fgetc(). These functions
need to be able to return any legitimate character code plus
a code unlike all characters to indicate an input failure.
The scheme adopted for the input functions was that they would
return a non-negative int to represent an actual character code
or a negative int to represent input failure. The <ctype.h>
functions thus inherited their oddities from the I/O functions'
practice of returning "special values" in place of "real data."

If one were designing the C library today, I doubt these
decisions would be made in the same way. getchar() et al. are
already in trouble on systems where sizeof(int)==1, because there
is no "space" for a distinguished non-character EOF value. If
getchar() returns EOF, it could actually be "real data:" you
cannot tell from the returned value alone, but must consult the
feof() and ferror() functions.

Even if the "in-band" signalling by the I/O functions were
retained, I doubt that newly-designed <ctype.h> functions would
be defined on the entire range of values getchar() can return.
Rather, they would be defined for all possible char values and
would make no special provision for EOF. Then we'd need none
of this silly casting when applying the <ctype.h> functions to
characters taken from a string.

However, that particular horse left the barn long ago.
Let's say we have the following hypothetical system:

char is signed.

UCHAR_MAX == 255
SCHAR_MAX == 127
CHAR_MAX == 127

INT_MAX == 65535

We are able to represent all the characters of ASCII using positive
numbers, but anything beyond that would require negative numbers on this
system.

Character codes 128 through 255 would not be representable
as char, but they would be representable as unsigned char or as
int.
So what's the deal with using toupper on these extraneous characters
whose numeric value is negative?

As above: The argument to a <ctype.h> function must be either
the negative value EOF or else a character code represented as
Let's say we have a German sharp S, or a Spanish N with a curly thing on
top of it, and that its numeric value is negative. How do we go about
passing their value to toupper? Should we do the following?

toupper( (unsigned char)c );

Yes.
(One more thing. If you have a signed integer value, and you cast it to
its corresponding unsigned integer type, and then back to the signed
type, are you guaranteed to have the same value? i.e.:
>
> signed char s = -5;
> unsigned char us = s;

No problem yet: us has the value UCHAR_MAX-4 (252, for
an eight-bit character).
> s = us;

Trouble in River City. The value of us is out of range
for a signed char, so you get either (1) an implementation-
defined result stored in s, or (2) an implementation-defined
signal is raised. (This is not undefined behavior, technically
speaking, but it might as well be. If a signal is raised, there
is no way to handle that signal and continue without invoking
undefined behavior. The distinction is somewhat like observing
that you will not be harmed by a fall from a hundred-story tower
but only by the sudden stop at the end.)

On most implementations nowadays, alternative (1) is taken
and the implementation-defined result happens to be equal to the
value s had before conversion to unsigned char. This is not an
outcome guaranteed by the language itself, though.
 
S

SM Ryan

# We are able to represent all the characters of ASCII using positive
# numbers, but anything beyond that would require negative numbers on this
# system.

Beyond ASCII, there are many different encodings.

# Let's say we have a German sharp S, or a Spanish N with a curly thing on
# top of it, and that its numeric value is negative. How do we go about
# passing their value to toupper? Should we do the following?

Don't depend on the encoding of non-ASCII characters. Instead you can
use wide characters (wchar_t) and functions like towupper.
 
B

Ben Pfaff

Frederick Gotham said:
Let's say we have a German sharp S, or a Spanish N with a curly thing on
top of it, and that its numeric value is negative. How do we go about
passing their value to toupper? Should we do the following?

toupper( (unsigned char)c );

Yes. That's the usual thing to do.
(One more thing. If you have a signed integer value, and you cast it to
its corresponding unsigned integer type, and then back to the signed
type, are you guaranteed to have the same value?

No. The behavior is essentially undefined:

6.3.1.3 Signed and unsigned integers

1 When a value with integer type is converted to another integer
type other than _Bool, if the value can be represented by
the new type, it is unchanged.

2 Otherwise, if the new type is unsigned, the value is converted
by repeatedly adding or subtracting one more than the
maximum value that can be represented in the new type until
the value is in the range of the new type.49)

3 Otherwise, the new type is signed and the value cannot be
represented in it; either the result is
implementation-defined or an implementation-defined signal
is raised.
 
J

Jack Klein

The "toupper" function takes an int as an argument. That's not too
irrational given that a character literal is of type "int" in C.
(Although why it isn't of type "char" escapes me... )

Obviously you lack an understanding of K&R C, not to mention BCPL and
B.
The "toupper" function imposes a further constrait in that the value
passed to it must be representable as a unsigned char. (If C does not
require all character values to be positive, then again, this constrait
too escapes me... )

What does not escape you? All of the to... and is... functions
defined in <ctype.h> work perfectly with the int value returned by
getchar(), which returns valid characters in the range of
0...UCHAR_MAX, plus EOF which is guaranteed not to be in that range.
Let's say we have the following hypothetical system:

char is signed.

UCHAR_MAX == 255
SCHAR_MAX == 127
CHAR_MAX == 127

INT_MAX == 65535


We are able to represent all the characters of ASCII using positive
numbers, but anything beyond that would require negative numbers on this
system.

So what's the deal with using toupper on these extraneous characters
whose numeric value is negative?

"The deal" is undefined behavior.
Let's say we have a German sharp S, or a Spanish N with a curly thing on
top of it, and that its numeric value is negative. How do we go about
passing their value to toupper? Should we do the following?

toupper( (unsigned char)c );



(One more thing. If you have a signed integer value, and you cast it to
its corresponding unsigned integer type, and then back to the signed
type, are you guaranteed to have the same value? i.e.:
No.

signed char s = -5;

unsigned char us = s;

s = us;

assert( -5 == s ); /* Is this guaranteed? */

Again, not. Given your assumption that the implementation has
UCHAR_MAX 255 and CHAR_MAX 127, assigning a value of -5 to an unsigned
char results is well defined, and results in an unsigned char with the
value 251. Assigning the value 251 to a signed char, a value outside
its range, results in either an implementation-defined result, or an
implementation-defined signal is raised.
 
W

Walter Roberson

On Wed, 05 Jul 2006 14:24:44 GMT, Frederick Gotham
What does not escape you? All of the to... and is... functions
defined in <ctype.h> work perfectly with the int value returned by
getchar(), which returns valid characters in the range of
0...UCHAR_MAX, plus EOF which is guaranteed not to be in that range.

Not according to C89. According to C89, getchar() is equivilent
[but possibly a macro] to fgetc(stdin), and fgetc() is defined as
returning "an unsigned char converted to an int". In implementations
in which UCHAR_MAX exceeds INT_MAX [e.g., sizeof(char) == sizeof(int),
in which case UCHAR_MAX may be UINT_MAX > INT_MAX]
then the conversion of values in the range INT_MAX+1 to UCHAR_MAX
has implementation defined results that are -not- guaranteed
to be in the range of 0..UCHAR_MAX.

C89 does NOT define fgetc() [and transitively, getchar()] such that
returning a negative value indicates EOF or an error. C89 defines
fgetc() as returning the specific value EOF upon EOF or error,
and defines EOF only as "a negative integral constant". As long as
the value EOF is not one of the values that can be returned for valid
characters, getchar() is free to return negative values.


For example, an implementation might choose to include keycode
modifiers such as LEFT_ALT LEFT_CONTROL RIGHT_ALT RIGHT_CONTROL
CAPS_LOCK NUM_LOCK KEY_DOWN KEY_UP for characters from some sources.
In this example, on a system with 16 bit ints, all 8 of these
flag bits might be set, and keys such as F12 could generate basic
values in the 128..255 range. The composite result could be
something greater than INT_MAX, and the implementation behaviour
in converting the value to an int might be to just copy the bits
and let the value be reinterpreted as 2's complement, leading to
negative values. The implementation could know, however, that
there is no key whose basic value is 255, and so could set EOF as
LEFT_ALT|LEFT_CONTROL|RIGHT_ALT|RIGHT_CONTROL|CAPS_LOCK|NUM_LOCK|
KEY_DOWN|KEY_UP|255
which in this hypothetical arrangement would happen to come out,
after interpretation as a signed 2s complement integer, as -1 .
EOF would be negative, would not represent any possible character
in the hypothetical system, but there would be valid negative values.
 
B

Ben Pfaff

On Wed, 05 Jul 2006 14:24:44 GMT, Frederick Gotham
What does not escape you? All of the to... and is... functions
defined in <ctype.h> work perfectly with the int value returned by
getchar(), which returns valid characters in the range of
0...UCHAR_MAX, plus EOF which is guaranteed not to be in that range.

Not according to C89. According to C89, getchar() is equivilent
[but possibly a macro] to fgetc(stdin), and fgetc() is defined as
returning "an unsigned char converted to an int". In implementations
in which UCHAR_MAX exceeds INT_MAX [e.g., sizeof(char) == sizeof(int),
in which case UCHAR_MAX may be UINT_MAX > INT_MAX]
then the conversion of values in the range INT_MAX+1 to UCHAR_MAX
has implementation defined results that are -not- guaranteed
to be in the range of 0..UCHAR_MAX.

Jack and many of the other posters here are well aware of this.
However, in previous discussions, we've been unable to locate a
hosted implementation that meets these criteria. Some
freestanding ones are known to exist, if I recall correctly, but
freestanding implementations do not include the standard I/O
library.
 
K

Keith Thompson

Frederick Gotham said:
The "toupper" function takes an int as an argument. That's not too
irrational given that a character literal is of type "int" in C.
(Although why it isn't of type "char" escapes me... )

In K&R C, it wasn't possible for a function to have an argument of
type char. Even in modern C, expressions of type char and short are
promoted to int.
 
P

Peter Nilsson

Frederick said:
The "toupper" function takes an int as an argument. That's not too
irrational given that a character literal is of type "int" in C.

Not necessarily. Even if é is a member of the execution character set,
the
character constant 'é' needn't be a positive value (in the range of
unsigned
char.)
(Although why it isn't of type "char" escapes me... )

Covered elsethread by others.
The "toupper" function imposes a further constrait in that the value
passed to it must be representable as a unsigned char. (If C does not
require all character values to be positive,

It requires the execution character set character codings have
non-negative
values. Whether those codings are represented as non-negative values in
(plain) char is another matter.
then again, this constrait too escapes me... )

Technically, it's not a constraint. It's a prerequisite for the
standard
implementation of toupper.
Let's say we have the following hypothetical system:

char is signed.

UCHAR_MAX == 255
SCHAR_MAX == 127
CHAR_MAX == 127

INT_MAX == 65535

We are able to represent all the characters of ASCII using positive
numbers, but anything beyond that would require negative numbers on this
system.

As a plain char value yes, however most programs receive input as
though
fgetc is storing an unsigned char into char storage.
So what's the deal with using toupper on these extraneous characters
whose numeric value is negative?

It's up to the programmer to supply the correct character code value.
Let's say we have a German sharp S, or a Spanish N with a curly thing
on top of it,
[Tilde.]

and that its numeric value is negative. How do we go about
passing their value to toupper? Should we do the following?

toupper( (unsigned char)c );

That's the clc regular's method. To me, it generally makes more
sense to do...

toupper( * (unsigned char) &c )

....when c is a plain char.

Even on a two's complement system, there is no guarantee that
the cast conversion of a plain char value will yield the original
unsigned char value of the character code.

The following is unlikely (due to QoI), but nontheless allowed...

UCHAR_MAX: 65535
SCHAR_MAX: 127
SCHAR_MIN: -128
CHAR_MAX: 127
 
J

Jack Klein

On Wed, 05 Jul 2006 14:24:44 GMT, Frederick Gotham
What does not escape you? All of the to... and is... functions
defined in <ctype.h> work perfectly with the int value returned by
getchar(), which returns valid characters in the range of
0...UCHAR_MAX, plus EOF which is guaranteed not to be in that range.

Not according to C89. According to C89, getchar() is equivilent
[but possibly a macro] to fgetc(stdin), and fgetc() is defined as
returning "an unsigned char converted to an int". In implementations
in which UCHAR_MAX exceeds INT_MAX [e.g., sizeof(char) == sizeof(int),
in which case UCHAR_MAX may be UINT_MAX > INT_MAX]
then the conversion of values in the range INT_MAX+1 to UCHAR_MAX
has implementation defined results that are -not- guaranteed
to be in the range of 0..UCHAR_MAX.

C89 does NOT define fgetc() [and transitively, getchar()] such that
returning a negative value indicates EOF or an error. C89 defines
fgetc() as returning the specific value EOF upon EOF or error,
and defines EOF only as "a negative integral constant". As long as
the value EOF is not one of the values that can be returned for valid
characters, getchar() is free to return negative values.


For example, an implementation might choose to include keycode
modifiers such as LEFT_ALT LEFT_CONTROL RIGHT_ALT RIGHT_CONTROL
CAPS_LOCK NUM_LOCK KEY_DOWN KEY_UP for characters from some sources.
In this example, on a system with 16 bit ints, all 8 of these
flag bits might be set, and keys such as F12 could generate basic
values in the 128..255 range. The composite result could be
something greater than INT_MAX, and the implementation behaviour
in converting the value to an int might be to just copy the bits
and let the value be reinterpreted as 2's complement, leading to
negative values. The implementation could know, however, that
there is no key whose basic value is 255, and so could set EOF as
LEFT_ALT|LEFT_CONTROL|RIGHT_ALT|RIGHT_CONTROL|CAPS_LOCK|NUM_LOCK|
KEY_DOWN|KEY_UP|255
which in this hypothetical arrangement would happen to come out,
after interpretation as a signed 2s complement integer, as -1 .
EOF would be negative, would not represent any possible character
in the hypothetical system, but there would be valid negative values.

As Ben mentioned, it is literally impossible to have a conforming
hosted implementation where INT_MAX < UCHAR_MAX. There can be, and
are, more-or-less conforming implementations where UINT_MAX ==
UCHAR_MAX and therefore UCHAR_MAX > INT_MAX, and I have worked on some
of them.

In fact it is impossible for a conforming getchar() (and related
functions) to exist on a platform where INT_MAX is not at least equal
to UCHAR_MAX. getchar() and its ilk must be able to return UCHAR_MAX
+ 1 distinct values, since each and every value in the range
0...UCHAR_MAX can be read from a stream, and EOF must be
distinguishable from all.

You may ask why I say EOF be distinguishable from all values in the
range 0 to U_CHAR max, and therefore cannot have the same
representation in an int as any of these values.

C99: paragraph 9 of 7.19.1 requires that the macro EOF "expands to an
integer constant expression, with type int and a negative value, that
is returned by several functions to indicate end-of-file, that is, no
more input from a stream".

C90: no paragraph numbers, but the corresponding section of 7.9.1
has identical wording.

No function defined to return EOF on end-of-file (or error) may return
this value unless it detects end-of-file or an error.

Any implementation where UCHAR_MAX > INT_MAX must be a free-standing
implementation. Free-standing implementations are not required to
provide either <stdio.h> or <ctype.h>, so there is no point is arguing
on how such features interact on such a platform.
 
W

Walter Roberson

Jack Klein said:
In fact it is impossible for a conforming getchar() (and related
functions) to exist on a platform where INT_MAX is not at least equal
to UCHAR_MAX. getchar() and its ilk must be able to return UCHAR_MAX
+ 1 distinct values, since each and every value in the range
0...UCHAR_MAX can be read from a stream, and EOF must be
distinguishable from all.

Why must every value in the range 0...UCHAR_MAX be readable from
a stream?
You may ask why I say EOF be distinguishable from all values in the
range 0 to U_CHAR max,

No, I don't ask that: in my posting I specifically proposed an EOF
distinct from any value that could be reach in the hypothetical system.

What is a stream, that every value 0...UCHAR_MAX must be readable
from it? For example, an implementation could be such that data
read from a file or pipe or socket is returned 8 bits at a time, but that
data read from a console might be augmented with keycode modifiers.

I don't have my standard at home with me: does the standard promise
that all possible values 0 to UCHAR_MAX must be writable to a binary
stream? (If it does so guarantee, then the standard does indicate
that it must be possible to read them back unchanged, except perhaps
trailing nulls.) Does the standard promise that all values
0 to UCHAR_MAX must be ungetc()-able?
 
F

Frederick Gotham

Peter Nilsson posted:

The following is unlikely (due to QoI), but nontheless allowed...

UCHAR_MAX: 65535


This suggests that a unsigned char has 16 value representation bits, and an
unknown quantity of padding bits.

SCHAR_MAX: 127
SCHAR_MIN: -128
CHAR_MAX: 127


This suggests that a signed char has 8 value representation bits (inclusive
of the sign bit), and at least 8 paddings bits, in order to satisfy:


assert( sizeof(signed char) == sizeof(unsigned char) );
 
F

Frederick Gotham

Peter Nilsson posted:

toupper( *(unsigned char const *)&c )


Does anyone else agree with this?

It's safe because an unsigned char cannot have any trap representations,
but nonetheless, does it do what we want it to do, and is it preferable
over the following?

toupper( (unsigned char)c );
 
B

Ben Pfaff

Why must every value in the range 0...UCHAR_MAX be readable from
a stream?

For binary streams there is a guarantee (C99 7.19.2):

3 A binary stream is an ordered sequence of characters that can
transparently record internal data. Data read in from a
binary stream shall compare equal to the data that were
earlier written out to that stream, under the same
implementation. Such a stream may, however, have an
implementation-defined number of null characters appended to
the end of the stream.

For text streams there is no such guarantee.
 
A

Andrew Poelstra

Peter Nilsson posted:




Does anyone else agree with this?

It looks overly complicated to me.
It's safe because an unsigned char cannot have any trap representations,
but nonetheless, does it do what we want it to do, and is it preferable
over the following?

toupper( (unsigned char)c );

No; the latter is much clearer and just as functional, IMHO.
 
M

Mike S

Peter said:
Frederick said:
Let's say we have a German sharp S, or a Spanish N with a curly thing
on top of it,
[Tilde.]

and that its numeric value is negative. How do we go about
passing their value to toupper? Should we do the following?

toupper( (unsigned char)c );

That's the clc regular's method. To me, it generally makes more
sense to do...

toupper( * (unsigned char) &c )

...when c is a plain char.

ITYM:

toupper( *(unsigned char *) &c)

OK, it's late and I might be missing something here, but aren't the
expressions

(unsigned char) c

and

*(unsigned char*) &c

semantically equivalent? Or is there a chance that they might evaluate
to a different result or produce different side effects along the way
to the result which somehow makes the second expression even more
reliable as a parameter to the to*() and is*() functions than the first
(seemingly more popular) expression? At the moment, the two seem
perfectly interchangeable to me, so I don't see much reason for
choosing the second over the first, especially since the first is
clearer.
 
R

Richard Heathfield

Mike S said:

OK, it's late and I might be missing something here, but aren't the
expressions

(unsigned char) c

and

*(unsigned char*) &c

semantically equivalent?
No.

Or is there a chance that they might evaluate to a different result

Very much so.

int c = getchar(); /* let's say we get an 'A' from getchar(), and let's
assume we're using some completely arbitrary and whacko character set such
as, say, ASCII. */

c now has the value 65, right? (Remember, we're assuming ASCII for the sake
of this exercise.) Okay, so (unsigned char)c gets you 65, which is fine.

But let's take a closer look at this int. If ints are 16 bits, we have two
choices for in-memory representation: 0x0041, or 0x4100. If ints are 32
bits, we have rather more choices, but the two most likely are 0x00000041
and 0x41000000. If ints are 64 bits, we are probably going to have either
0x0000000000000041 or 0x4100000000000000. Other endianisms are possible,
but we don't need to go there to demonstrate that *(unsigned char)&c is
wrong. I hope you can see the problem straight away. On any big-endian
system where sizeof(int) > 1, this code is going to produce the wrong
result. Specifically, it will normally produce 0 instead of the required
result.

So Peter's idea is fatally flawed. And yet it probably works fine for him,
because he's probably using it on a little-endian system. So it's just
sitting there waiting to bite him (or his maintainers) at porting time.
 
P

Peter Nilsson

Frederick said:
Peter Nilsson posted:

This suggests that a unsigned char has 16 value representation bits, and an
unknown quantity of padding bits.


This suggests that a signed char has 8 value representation bits (inclusive
of the sign bit), and at least 8 paddings bits,
Yes.

in order to satisfy:

assert( sizeof(signed char) == sizeof(unsigned char) );

That is always satisfied on a conforming implementation, but yes.
 
P

Peter Nilsson

Richard said:
Mike S said:

Very much so.

int c = getchar(); ...
... Peter's idea is fatally flawed.

<sigh>

Consider...

char line[256];
size_t i;
if (fgets(line, sizeof line, stdin))
{
for (i = 0; line != 0; i++)
{
line = toupper((unsigned char) line); /* v1 */
line = toupper(* (unsigned char *) &line); /* v2 */
}
...
}

On an implementation satisfying...

UCHAR_MAX: 65535
SCHAR_MAX: 127
SCHAR_MIN: -128
CHAR_MAX: 127

....v1 can fail, v2 succeeds.
 
P

Peter Nilsson

Andrew said:
It looks overly complicated to me.

In normal form, I use things like...

const unsigned char *us = (const unsigned char *) s;
for (; *us; us++) *us = toupper(*us);

If that's too complicated for some people, so be it.

As I said, it's up to the programmer to pass the right value.
Different circumstances may well require different forms.
Where and how you source and store the character is a
factor in deciding which method you use.
No; the latter is much clearer and just as functional, IMHO.

But fails for potentially conforming implementations. To many people,
that's acceptable.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,994
Messages
2,570,222
Members
46,810
Latest member
Kassie0918

Latest Threads

Top