Manipulation of strings: upper/lower case

Peter Nilsson · Jan 17, 2005

infobahn said:
Caution is necessary here. The behaviours of islower and toupper
are undefined if they are passed a value that is neither EOF nor
representable as an unsigned char. It is good practice, therefore,
to cast *string to unsigned char.

I believe the cast (conversion) of individual characters is
incorrect. Instead, the byte characters should be interpreted as
unsigned char...

.. char *make_upper(char *s)
.. {
.. unsigned char *us = (unsigned char *) s;
.. for (; *us; us++) *us = toupper(*us);
.. return s;
.. }

The reason being that reinterpretation is more likely to be
correct.

I did once post a query about this...
http://groups.google.com/[email protected]

Old Wolf · Jan 17, 2005

Peter said:
I believe the cast (conversion) of individual characters is
incorrect. Instead, the byte characters should be interpreted as
unsigned char...

unsigned char *us = (unsigned char *) s;

The reason being that reinterpretation is more likely to be
correct.

Casting a signed char to unsigned is always correct.
So everything else is equally or less likely to be
correct

AFAIK the standard does not explicitly say that you
can cast a (char *) to an (unsigned char *) , for example
many compilers warn about parameter type mismatches if you
pass one to a function expecting the other.
However it does say that they must have the same size,
alignment etc. etc. etc. so I don't see how an implementation
could conform but not allow the cast. (Unless it was the DS9k).

Peter Nilsson · Jan 17, 2005

Old said:
Casting a signed char to unsigned is always correct.
So everything else is equally or less likely to be
correct

Chapter and verse, please.

Consider that I/O functions write to buffers (and strings)
using unsigned char, not char. The string and mem functions
use unsigned char, not char.

My main point is that a cast from char to unsigned char may
NOT yield the original value that was written to the char.

AFAIK the standard does not explicitly say that you
can cast a (char *) to an (unsigned char *) ,

6.3.2.3p7 "... When a pointer to an object is converted to a
pointer to a character type, the result points to the lowest
addressed byte of the object. ..."

for example
many compilers warn about parameter type mismatches if you
pass one to a function expecting the other.

Because many implicit conversions _require_ a diagnostic.

infobahn · Jan 18, 2005

Old said:
Casting a signed char to unsigned is always correct.

Yes. His complaint is most strange, since there's nothing at all
wrong with the cast I suggested.

So everything else is equally or less likely to be
correct
AFAIK the standard does not explicitly say that you
can cast a (char *) to an (unsigned char *) ,

You can point an unsigned char * anywhere you can point (within
reason - for example, you wouldn't want to point it at a function).

The closest the Standard comes to formalising this, as far as I can
tell, is:

"Values stored in non-bit-field objects of any other object type
consist of n x CHAR_BIT bits, where n is the size of an object of
that type, in bytes. The value may be copied into an object of type
unsigned char [n] (e.g., by memcpy); the resulting set of bytes is
called the object representation of the value."

This doesn't actually say anything about casting, but it does say
we can represent any object using an array of unsigned char.

for example
many compilers warn about parameter type mismatches if you
pass one to a function expecting the other.

And rightly so, but not because objects can't be pointed to by
unsigned char *.

However it does say that they must have the same size,
alignment etc. etc. etc. so I don't see how an implementation
could conform but not allow the cast. (Unless it was the DS9k).

I do not believe the DS9K could refuse the cast either.

Peter Nilsson · Jan 18, 2005

infobahn said:
Yes. His complaint is most strange, since there's nothing at all
wrong with the cast I suggested.

6.2.5p3 says:

" An object declared as type char is large enough to store any
" member of the basic execution character set. If a member of the
" basic execution character set is stored in a char object, its
" value is guaranteed to be positive. If any other character is
" stored in a char object, the resulting value is implementation-
" defined but shall be within the range of values that can be
" represented in that type.

This makes it quite clear that plain char may not be sufficient
to represent the values of all (extended) characters in the
execution character set. This is the first clue that a conversion
of a plain char value might not be appropriate.

But let's look at an example...

Suppose we have an implementation with an extended character set
that includes an accented e. For the sake of argument, let's
suppose the coding for that character is 233 (0xE9). This is
representable within a byte on any system, and is therefore a
valid single-byte character.

Let's go on to suppose we read input into a character array, and
that input includes one accented e. Note that ordinary input is
made through "byte input/output functions", so the value stored
in the corresponding byte is 233. Assuming an 8-bit byte, this
has the representation...

11101001

Consider the possible signed plain char value of this
representation on various allowed 8-bit implementations...

2c: -23
1c: -22
sm: -105

Using your cast to convert char to unsigned char, we get...

2c: 233
1c: 234
sm: 151

....only _one_ of which is correct.

If instead we interpret the byte through an unsigned char
pointer, then we get 233, irrespective of the signed plain
char value. Had I considered the character coding of 128,
then the last sentance of 6.2.5p3 says you have _NO_ guarantee
that your cast to unsigned char will produce 128.

That is why the 'interpreted' way is better than 'conversion'.
Note that the string/memory functions interpret, rather than
cast, for similar reasons.

Lawrence Kirby · Jan 18, 2005

On Tue, 18 Jan 2005 00:49:54 -0800, Peter Nilsson wrote:

....

If instead we interpret the byte through an unsigned char
pointer, then we get 233, irrespective of the signed plain
char value. Had I considered the character coding of 128,
then the last sentance of 6.2.5p3 says you have _NO_ guarantee
that your cast to unsigned char will produce 128.

That is why the 'interpreted' way is better than 'conversion'.
Note that the string/memory functions interpret, rather than
cast, for similar reasons.

The real issue is that neither approach is correct until we know how the
value in the char has been derived in the first place. Maybe the character
value was obtained by converting the return value of getc() to char,
maybe it was written directly by fgets() or fread().

In practice implementations that create inconsistent results for the
various appraches discussed are going to cause problems. In such
environments it would probably be wise for the implementation to define
char as an unsigned type. It is one of those things where the best thing
to do is ignore it until you come across it. You would have to be
AMAZINGLY unlucky for that to happen. IMO you are more likely to encounter
problems due to compiler bugs than this, and you might as well treat this
as such.

Lawrence

Peter Nilsson · Jan 18, 2005

Lawrence said:
On Tue, 18 Jan 2005 00:49:54 -0800, Peter Nilsson wrote:

...

The real issue is that neither approach is correct until we know
how the value in the char has been derived in the first place.
Maybe the character value was obtained by converting the return
value of getc() to char, maybe it was written directly by fgets()
or fread().

This is generally within the control of the programmer. Reading
input into char arrays by assigning values returned by fgetc is
wrong... in the theoretical sense. That a lot of programs do it
(K&R2 does it) doesn't make it any less 'wrong'.

In practice implementations that create inconsistent results for
the various appraches discussed are going to cause problems. In
such environments it would probably be wise for the implementation
to define char as an unsigned type.

It would be even better if the standard actually _required_ this
for qualified implementations.

Personally, I think the standard is defective, not merely because
of the above issues, but also in the way it treats character
constants.

Consider an 8-bit implementation where plain char is signed, uses
non two's complement, but supports a subset of iso646. C99, by
my reading, _requires_ that such implementations generate a value
_other than_ 233 for the character constants '\xe9' and '\u00e9'!

That said, I don't honestly claim to be able to rectify the standard
in a way that a significant majority of C diehards would approve of.

It is one of those things where the best thing to do is ignore it
until you come across it.

You would have to be AMAZINGLY unlucky for that to happen. IMO you
are more likely to encounter problems due to compiler bugs than
this, and you might as well treat this as such.

I agree, but I note that a modern C programmer would have to be
'amazingly unlucky' to ever program a hosted implementation that
didn't use two's complement, or had 9-bit chars, or uses different
sized pointers for different (object or incomplete) pointer types,
has integer padding bits, ... and all the other things which are
regularly cited in clc as being supposedly relevant considerations.

Such things are so esoteric as to be worth ignoring. Nontheless, I
still believe clc would be doing a disservice to its readers if it
did not mention them.

Old Wolf · Jan 18, 2005

infobahn said:
You can point an unsigned char * anywhere you can point (within
reason - for example, you wouldn't want to point it at a function).

Right. I meant to also say "...and get the expected result".

After reading Peter Nilsson's last post, I think his point
was that if you want to access the representation of a byte,
then you must point to it with (unsigned char *) and then read
it. This is of course different to reading the C value of a
signed char, and then converting to unsigned (because of
non-2s-magnitude systems).

Keith Thompson · Jan 18, 2005

Peter Nilsson said:
Personally, I think the standard is defective, not merely because
of the above issues, but also in the way it treats character
constants.

Consider an 8-bit implementation where plain char is signed, uses
non two's complement, but supports a subset of iso646. C99, by
my reading, _requires_ that such implementations generate a value
_other than_ 233 for the character constants '\xe9' and '\u00e9'!

That said, I don't honestly claim to be able to rectify the standard
in a way that a significant majority of C diehards would approve of.

Is there any real advantage (other than not breaking existing
implementations) in allowing plain char to be signed? I know there
are historical reasons, but what would break if the standard required
char to have the same characteristics as unsigned char?

Eric Sosman · Jan 19, 2005

Richard said:
I'm afraid they must.

A counterexample comes to mind. Consider a signed `char'
on a system that uses either ones' complement or signed
magnitude to represent negative integers. On such a system
there are two distinct `char' representations that have the
value zero (unless "minus zero" is a trap value), and both
of them produce the same value (zero) upon conversion to
`unsigned char'. Conversion obliterates the distinction.

Whether all this makes much difference is open to question,
though. A conforming C implementation can use signed magnitude,
can choose signed `char', can even choose CHAR_MAX==ULLONG_MAX,
but if it is a hosted implementation it must still make the I/O
functions work "properly." A successful getc() delivers an `int'
in the range 0..UCHAR_MAX, and if CHAR_MAX<UCHAR_MAX we might
think it unsafe to assign such a value to a plain `char' -- the
attempted conversion, according to the Standard, produces an
implementation-defined result or raises an implementation-defined
signal, and thus cannot be performed in a strictly-conforming
program. However, an implementation capable of reading a valid
character from an input stream but incapable of storing it into
a `char' would be laughed out of the marketplace. It might be
too ambitious to claim that such an implementation violated the
Standard, but "quality of implementation" concerns would, I think,
rule it out. As a practical matter, any system with signed `char'
must do "something reasonable" when it converts an out-of-range
`unsigned char' to plain (signed) `char'; the implementation-
defined aspect will turn out to be "what you wanted."

Richard Bos · Jan 19, 2005

Eric Sosman said:
A counterexample comes to mind. Consider a signed `char'
on a system that uses either ones' complement or signed
magnitude to represent negative integers. On such a system
there are two distinct `char' representations that have the
value zero (unless "minus zero" is a trap value), and both
of them produce the same value (zero) upon conversion to
`unsigned char'. Conversion obliterates the distinction.

It's dubious whether this can be called a difference in _value_, though.
They're both zero.

Whether all this makes much difference is open to question,
though. A conforming C implementation can use signed magnitude,
can choose signed `char', can even choose CHAR_MAX==ULLONG_MAX,
but if it is a hosted implementation it must still make the I/O
functions work "properly."

True. Which means that it's probably only possible to input one of the
two zeroes anyway.

Richard

Lawrence Kirby · Jan 20, 2005

Is there any real advantage (other than not breaking existing
implementations) in allowing plain char to be signed? I know there
are historical reasons, but what would break if the standard required
char to have the same characteristics as unsigned char?

There is of course a huge body of platform-specific code that assumes the
existing conventions for that platform such as the signedness of char.
Implementations themselves should be able to make the transition
fairly easily, although implementation code can quite legitimately
assume properties of the implementation, so if those properties are
changed some fixing and testing work would be needed. There is also the
issue of whether the change could produce a performance hit on some
implementations.

Lawrence

Lawrence Kirby · Jan 20, 2005

A counterexample comes to mind. Consider a signed `char'
on a system that uses either ones' complement or signed
magnitude to represent negative integers. On such a system
there are two distinct `char' representations that have the
value zero (unless "minus zero" is a trap value), and both
of them produce the same value (zero) upon conversion to
`unsigned char'. Conversion obliterates the distinction.

Since they both represent the same value there wasn't a distinction to
start with. Characters are represented by value, you cannot have two
different characters represented by the same value. It isn't the
conversion to unsigned char that causes the problem, that exists
whatever you do while the character is being represented and manipulated
as a char. Having multiple representations for a value will cause problems
for I/O handling so, as you say...

Whether all this makes much difference is open to question,
though. A conforming C implementation can use signed magnitude, can
choose signed `char', can even choose CHAR_MAX==ULLONG_MAX, but if it is
a hosted implementation it must still make the I/O functions work
"properly." A successful getc() delivers an `int' in the range
0..UCHAR_MAX, and if CHAR_MAX<UCHAR_MAX we might think it unsafe to
assign such a value to a plain `char' -- the attempted conversion,
according to the Standard, produces an implementation-defined result or
raises an implementation-defined signal, and thus cannot be performed in
a strictly-conforming program. However, an implementation capable of
reading a valid character from an input stream but incapable of storing
it into a `char' would be laughed out of the marketplace. It might be
too ambitious to claim that such an implementation violated the
Standard, but "quality of implementation" concerns would, I think, rule
it out. As a practical matter, any system with signed `char' must do
"something reasonable" when it converts an out-of-range `unsigned char'
to plain (signed) `char'; the implementation- defined aspect will turn
out to be "what you wanted."

.... a realistic implementation will avoid the possibility.

Lawrence

Mandatory Elements To Conduct JavaScript Form Manipulation	7	Aug 22, 2023
Find and count strings of text from multiple files	17	Dec 16, 2021
change each element of vector to upper case?	15	Apr 16, 2014
Regular Expression newbie question about upper & lower case	5	Sep 30, 2007
case strings	8	Jun 3, 2012
Search function - upper and lower letters	2	Jan 17, 2008
Regular Expression to Replace UPPER Case Text with lower case text	28	Feb 18, 2008
help with upper and lower case conversion	9	Dec 17, 2006

Manipulation of strings: upper/lower case

Peter Nilsson

Old Wolf

Peter Nilsson

infobahn

Peter Nilsson

Lawrence Kirby

Peter Nilsson

Old Wolf

Keith Thompson

Eric Sosman

Richard Bos

Lawrence Kirby

Lawrence Kirby

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads