What's the deal with the "toupper" family?

  • Thread starter Frederick Gotham
  • Start date
R

Richard Heathfield

Peter Nilsson said:
Richard Heathfield wrote:
... Peter's idea is fatally flawed.

<sigh>

Consider...

char line[256];
size_t i;
if (fgets(line, sizeof line, stdin))
{
for (i = 0; line != 0; i++)
{
line = toupper((unsigned char) line); /* v1 */
line = toupper(* (unsigned char *) &line); /* v2 */
}
...
}

On an implementation satisfying...

UCHAR_MAX: 65535
SCHAR_MAX: 127
SCHAR_MIN: -128
CHAR_MAX: 127

...v1 can fail, v2 succeeds.



Even if such an implementation is conforming (about which I have serious
doubts, but I'm not going to press the point right now), it would be
extraordinarily rare. I have already posted code which shows how your
technique fails on big-endian systems with perfectly ordinary char ranges,
and such systems are far more common (eg IBM 370, 68000, most RISCs) than
an architecture that has 8 padding bits in every char!

Therefore, your technique is not safe for general use, and I cannot
recommend it.
 
A

Andrew Poelstra

In normal form, I use things like...

const unsigned char *us = (const unsigned char *) s;
for (; *us; us++) *us = toupper(*us);

No matter what you think `const' means in this context, it's wrong. You
change both `us' /and/ `*us' in the second line.
If that's too complicated for some people, so be it.

Most simple-minded people believe that the const keyword will create a
constant. It's true that we find it `too complicated' to violate that.
As I said, it's up to the programmer to pass the right value.
Different circumstances may well require different forms.
Where and how you source and store the character is a
factor in deciding which method you use.

The point of the cast is to work correctly, even if the programmer passes
the wrong value. Perhaps the programmer is passing input from a file
stream or something, and doesn't want to validate the string for such
a simple function. (And perhaps the string being uppercase is required
for future validations.)
But fails for potentially conforming implementations. To many people,
that's acceptable.

Under what circumstances will casting to unsigned char fail, and how
will it fail?
 
P

Peter Nilsson

Andrew said:
No matter what you think `const' means in this context, it's wrong. ...

Yup, braino. I was thinking about reading from a source and writing to
a different string. Please remove the const and reparse.
You change both `us' /and/ `*us' in the second line.

That wasn't a typo, just saving whitespace.
Under what circumstances will casting to unsigned char fail, and how
will it fail?

On hypothetical but conforming implementations where char is signed
and the count of integers in the range of char is smaller than the
count
of integers in the range of unsigned char. Pigeon hole principles come
into play.
 
P

Peter Nilsson

Richard said:
Peter Nilsson said:
Richard Heathfield wrote:
... Peter's idea is fatally flawed.

<sigh>

Consider...

char line[256];
size_t i;
if (fgets(line, sizeof line, stdin))
{
for (i = 0; line != 0; i++)
{
line = toupper((unsigned char) line); /* v1 */
line = toupper(* (unsigned char *) &line); /* v2 */
}
...
}

On an implementation satisfying...

UCHAR_MAX: 65535
SCHAR_MAX: 127
SCHAR_MIN: -128
CHAR_MAX: 127

...v1 can fail, v2 succeeds.


Even if


[Nothing semantically wrong with the v2 version of the above code
then?]
such an implementation is conforming (about which I have
serious doubts, but I'm not going to press the point right now),

Since you clearly don't have serious c&v, I won't either.
 
R

Richard Heathfield

Peter Nilsson said:
Richard said:
Peter Nilsson said:
Richard Heathfield wrote:
... Peter's idea is fatally flawed.

<sigh>

Consider...

char line[256];
size_t i;
if (fgets(line, sizeof line, stdin))
{
for (i = 0; line != 0; i++)
{
line = toupper((unsigned char) line); /* v1 */
line = toupper(* (unsigned char *) &line); /* v2 */
}
...
}

On an implementation satisfying...

UCHAR_MAX: 65535
SCHAR_MAX: 127
SCHAR_MIN: -128
CHAR_MAX: 127

...v1 can fail, v2 succeeds.


Even if


[Nothing semantically wrong with the v2 version of the above code
then?]


I didn't look that closely, since you're only describing a theoretical
problem, which you are trying to solve by replacing it with a technique
that is flawed not just in theory but also in practice.
Since you clearly don't have serious c&v, I won't either.

I think you've completely and utterly missed my point.

I will accept for the purposes of this discussion that the implementation
you describe is conforming, and might conceivably exist. Nevertheless, you
would presumably agree that no such implementation is in widespread use. So
your "fix" doesn't actually fix anything in real life. (If you disagree,
let's hear it. Which widely-used platform has the characteristics you
describe?)

On the other hand, conforming implementations for big-endian platforms
certainly exist, and are in widespread use, and your technique breaks on
such platforms, in a manner I have described upthread.

So we have two choices: a technique that can only be shown to break on a
hypothetical platform, and a technique that can be shown to break on very
real and widely-used platforms.

If those are the only choices, then, for me at least, it's no contest.
 
D

Dik T. Winter

Richard, I think you are missing something:
> Peter Nilsson said: ....
> >> > char line[256]; ....
> >> > line = toupper(* (unsigned char *) &line); /* v2 */
....
> On the other hand, conforming implementations for big-endian platforms
> certainly exist, and are in widespread use, and your technique breaks on
> such platforms, in a manner I have described upthread.


Care to explain why the above would break on such a platform? The only
thing is that a pointer to char is cast to a pointer to unsigned char,
and the latter is dereferenced.
 
R

Richard Heathfield

Dik T. Winter said:
Richard, I think you are missing something:

....and I think Peter is. :)
Peter Nilsson said: ...
char line[256]; ...
line = toupper(* (unsigned char *) &line); /* v2 */
...
On the other hand, conforming implementations for big-endian platforms
certainly exist, and are in widespread use, and your technique breaks
on such platforms, in a manner I have described upthread.


Care to explain why the above would break on such a platform?


I'm not saying it will. Peter introduced that code to illustrate how a
simple cast to unsigned char could conceivably break on a hypothetical
platform with UCHAR_MAX = 65535 and SCHAR_MAX = 127. Let us ascribe the
generic name "PeterPlatform" to such platforms, and let us give big-endian
platforms with sizeof(int) > 1 the generic name of "PracticalPlatform".

The problem I have with his suggested technique:

object = toupper(*(unsigned char *)&object);

is not in relation to the above code, but in contexts where a character
value is stored in an int, and it is not known whether the character is
representable as an unsigned char. This is far from rare. Consider, for
example, the following function:

#include <ctype.h>

int toggle_case(int ch)
{
#ifdef PETER
if(islower(*(unsigned char *)&ch))
{
ch = toupper(*(unsigned char *)&ch);
}
else
{
ch = tolower(*(unsigned char *)&ch);
}
#else
if(islower((unsigned char)ch))
{
ch = toupper((unsigned char)ch);
}
else
{
ch = tolower((unsigned char)ch);
}
#endif
return ch;
}

If PETER is defined, the code breaks on PracticalPlatform, but works on
PeterPlatform.

If PETER is not defined, the code breaks on PeterPlatform, but works on
PracticalPlatform. This more conventional technique also works for the code
Peter wrote, except on PeterPlatform.

So we have two techniques, one of which works just about everywhere in the
real world, and one which breaks on a very important subset of the real
world, in certain reasonably common situations. Given the choice between
the two, I favour the technique that fails on fewest real world platforms.
 
F

Frederick Gotham

Richard Heathfield posted:

So we have two techniques, one of which works just about everywhere in
the real world, and one which breaks on a very important subset of the
real world, in certain reasonably common situations. Given the choice
between the two, I favour the technique that fails on fewest real
world platforms.


Just as a hypothetical:
If there were a guarantee in C that a signed char had no padding (and
thus the exact same quantity of value representation bits as an unsigned
char), then would you consider using:

toupper( *(unsigned char const *)c );


It would seem preferable to me over:

toupper( (unsigned char)c );
 
R

Richard Heathfield

Frederick Gotham said:
Richard Heathfield posted:




Just as a hypothetical:
If there were a guarantee in C that a signed char had no padding (and
thus the exact same quantity of value representation bits as an unsigned
char), then would you consider using:

toupper( *(unsigned char const *)c );

(Presumably you mean c to be a pointer.)

No, I wouldn't, because this is broken in exactly the same way as it was
before - i.e. it gives wrong results in some circumstances, on a very
important bunch of platforms.
It would seem preferable to me over:

toupper( (unsigned char)c );

Not to me.
 
H

Hallvard B Furuseth

Frederick said:
This suggests that a unsigned char has 16 value representation
bits, and an unknown quantity of padding bits.

unsigned char has no padding bits.
 
F

Frederick Gotham

Hallvard B Furuseth posted:
unsigned char has no padding bits.


Wups, slipped my mind.


So in the given example:

unsigned char: 16 value bits, no padding.
char: 8 value bits, 8 padding bits.
 
M

Mike S

Richard said:
Mike S said:

OK, it's late and I might be missing something here, but aren't the
expressions

(unsigned char) c

and

*(unsigned char*) &c

semantically equivalent?
No.

Or is there a chance that they might evaluate to a different result

Very much so.

int c = getchar(); /* let's say we get an 'A' from getchar(), and let's
assume we're using some completely arbitrary and whacko character set such
as, say, ASCII. */
[...]

On any big-endian
system where sizeof(int) > 1, this code is going to produce the wrong
result. Specifically, it will normally produce 0 instead of the required
result.

Peter had mentioned in a previous post that c was a plain char, so I
assumed that in my "semantically equivalent" statement. Even if it were
an int, I probably would have forgotten to consider "other-endian"
machines anyway -- I'm a bit *too* comfortable with x86 and I doubt I
would have thought twice about it ;-)
 
H

Hallvard B Furuseth

Frederick said:
Hallvard B Furuseth posted:

Wups, slipped my mind.

So in the given example:

unsigned char: 16 value bits, no padding.
char: 8 value bits, 8 padding bits.

Yup.

OTOH, getting back to (unsigned char)c vs. *(unsigned char *)&c where c
is a char: These expressions produce different values if c has the sign
bit set and is represented as one's complement or sign/magnitude. Just
like with the different-width example above I have no idea if that is
possible in a conforming implementation, but I doubt it.

However if both are possible the pointer cast hack is still just
replacing one possible bug with another one. It'll give you a value,
but not necessarily the _right_ value. Or the other way around: The one
with pointers gives the right value and the other gives the wrong value.
Depends on how the character value was stored. One thing I feel certain
about is that even if I by some miracle managed to keep that straight,
some other component of the program would be getting it wrong. So I
just don't worry about it, and use both expressions interchangeably.
 
R

Richard Heathfield

Peter Nilsson said:
Frederick Gotham wrote:

That's the clc regular's method. To me, it generally makes more
sense to do...

toupper( * (unsigned char) &c )

[(unsigned char *) was intended]
...when c is a plain char.

Peter, I owe you an apology. I missed this caveat when I first read your
article. My "big-endian" objection does not apply in such a case.

<snip>
 
A

Andrew Poelstra

Yup, braino. I was thinking about reading from a source and writing to
a different string. Please remove the const and reparse.


That wasn't a typo, just saving whitespace.

It was an error when you had the `const' in there. If you remove them,
the code works. (Although some people like to put `us' in the first
part of the for statement instead of leaving it empty).
On hypothetical but conforming implementations where char is signed
and the count of integers in the range of char is smaller than the
count
of integers in the range of unsigned char. Pigeon hole principles come
into play.

I believe all of these are guaranteed:

sizeof (char) == sizeof (unsigned char)
char has no padding bits
char has no trap representations
Therefore all chars must have 2^CHAR_BIT values.

In the case that you have a problem because on some mysterious platform
without these attributes, you'll have other problems elsewhere in the
code. That, and any code that relies on your platform will be almost
certainly nonportable.
 
E

Eric Sosman

Andrew said:
[...]

I believe all of these are guaranteed:

sizeof (char) == sizeof (unsigned char)

Yes, because both are guaranteed to equal 1.
char has no padding bits
char has no trap representations

Would you mind revealing where you find these guarantees?
If they are in the Standard, I have overlooked them.
Therefore all chars must have 2^CHAR_BIT values.

The Standard's language about "negative zero" casts some
doubt on this. If there are two different forms of the value
zero, there must be strictly fewer than 2^CHAR_BIT possible
values -- even without padding bits.
In the case that you have a problem because on some mysterious platform
without these attributes, you'll have other problems elsewhere in the
code. That, and any code that relies on your platform will be almost
certainly nonportable.

It seems to me that this is a backwards definition of
"portability." The point isn't about relying on peculiarities
of exotic platforms, but about writing code that works whether
those peculiarities are present or not. A program that works
correctly with all conforming representations of char is more
portable, not less, than a program that insists on trapless
eight-bit two's complement.
 
A

Andrew Poelstra

Andrew said:
[...]

I believe all of these are guaranteed:
char has no padding bits
char has no trap representations

Would you mind revealing where you find these guarantees?
If they are in the Standard, I have overlooked them.

The first has been mentioned in this group many times (although it
may pertain only to unsigned char), and the second seemed to me a
logical extension.
The Standard's language about "negative zero" casts some
doubt on this. If there are two different forms of the value
zero, there must be strictly fewer than 2^CHAR_BIT possible
values -- even without padding bits.

I consider 0 and -0 separate values for the purposes of my post.
It seems to me that this is a backwards definition of
"portability." The point isn't about relying on peculiarities
of exotic platforms, but about writing code that works whether
those peculiarities are present or not. A program that works
correctly with all conforming representations of char is more
portable, not less, than a program that insists on trapless
eight-bit two's complement.

All I insisted on was trapless. Please don't misinterpret me.
 
E

Eric Sosman

Andrew said:
Andrew said:
[...]

I believe all of these are guaranteed:
char has no padding bits
char has no trap representations

Would you mind revealing where you find these guarantees?
If they are in the Standard, I have overlooked them.

The first has been mentioned in this group many times (although it
may pertain only to unsigned char), and the second seemed to me a
logical extension.

There are special guarantees for unsigned char, so that
it is possible to treat the representation of any object as
an array of unsigned char. This would not work if unsigned
char had trap representation or contained indeterminately-
valued padding bits.

However, I am unaware of any similar guarantees for char,
either signed or plain. On an implementation where plain char
is unsigned one can deduce that it has no padding bits or traps
(argument: On such an implementation, plain char can represent
all the values unsigned char can, and since the latter "fills
the code space" the former must, too). But the argument doesn't
hold for signed char, or for plain char on an implementation
where CHAR_MIN<0.
 
K

Keith Thompson

Andrew Poelstra said:
I believe all of these are guaranteed:

sizeof (char) == sizeof (unsigned char)
char has no padding bits
char has no trap representations
Therefore all chars must have 2^CHAR_BIT values.

I believe the last three are guaranteed only for unsigned char, not
for plain or signed char.
 
K

Keith Thompson

Andrew Poelstra said:
Andrew said:
[...]

I believe all of these are guaranteed:
char has no padding bits
char has no trap representations

Would you mind revealing where you find these guarantees?
If they are in the Standard, I have overlooked them.

The first has been mentioned in this group many times (although it
may pertain only to unsigned char), and the second seemed to me a
logical extension.
The Standard's language about "negative zero" casts some
doubt on this. If there are two different forms of the value
zero, there must be strictly fewer than 2^CHAR_BIT possible
values -- even without padding bits.

I consider 0 and -0 separate values for the purposes of my post.

But they're not separate values in any reasonable sense. In
particular (0 == -0) is guaranteed to be true. They may be different
*representations* of the same value.

[...]
All I insisted on was trapless. Please don't misinterpret me.

Ok, but I don't see a guarantee in the standard that signed or plain
char has no trap representations.

If you want a byte-sized type with no padding bits or trap
representations, use unsigned char; that's what it's for.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,181
Messages
2,570,970
Members
47,537
Latest member
BellCorone

Latest Threads

Top