A character with a negative value

M

Martin Wells

Plain char may be signed or unsigned. Typical ranges could be:

CHAR_MIN == -128, CHAR_MAX == 127

CHAR_MIN == 0, CHAR_MAX == 255

The Standard says that the behaviour is undefined if we pass an
argument to the "to*" functions whose value is outside the range of 0
through UCHAR_MAX. This most certainly should have been CHAR_MIN
through CHAR_MAX.

If there were a particular implementation where a valid character had
a negative value, wouldn't it make perfect sense that you can pass
this value to "to*"? I think it would be ridiculously stupid if you
couldn't.

As an example, let's say that there's an uppercase alphabetical
character whose numeric value is 17, and that the lowercase form of
this character's value is -8. If we pass the former to "tolower", we
should get -8, and if we pass the latter to "toupper", we should get
17. Now of course, the Standard itself doesn't guarantee this... but
if the implementation has negative values for valid characters then it
would be quite stupid if you couldn't do normal operations on these
valid characters. How many people here use an "unsigned char" cast
when using the "to*" functions? Because I don't.

Martin
 
C

Chris Dollin

Martin said:
As an example, let's say that there's an uppercase alphabetical
character whose numeric value is 17, and that the lowercase form of
this character's value is -8.

Can't happen in a conforming implementation.
How many people here use an "unsigned char" cast when using the "to*"
Me.

functions? Because I don't.

Oops.
 
V

vipvipvipvipvip.ru

Plain char may be signed or unsigned. Typical ranges could be:

CHAR_MIN == -128, CHAR_MAX == 127

CHAR_MIN == 0, CHAR_MAX == 255
CHAR_M(IN/AX) is for plain char. use (S/U)CHAR_M(IN/MAX)
SCHAR_MIN <= -127 SCHAR_MAX >= 127
UCHAR_MIN 0 UCHAR_MAX >= 255

Cast the argument in all your to*() calls to (unsigned char).
 
C

Chris Dollin

Martin said:
Chris:

Specifically what can't happen in a conforming implementation?
Specifically:

Generally: that any letter (as defined by the C standard) have a negative
value when represented as a plain `char`.
 
M

Martin Wells

Chris:
Generally: that any letter (as defined by the C standard) have a negative
value when represented as a plain `char`.


Any _letter_ or any _character_? Can you point me to the page in the
Standard?

Martin
 
B

Ben Pfaff

Martin Wells said:
Chris:

Specifically what can't happen in a conforming implementation?

All uppercase and lowercase letters in the English alphabet must
have positive values in the range of char in a C implementation.
See C99 6.2.5 "Types", paragraph 2:

If a member of the basic execution character set is
stored in a char object, its value is guaranteed to be
positive.
 
C

CBFalconer

Martin said:
Plain char may be signed or unsigned. Typical ranges could be:

CHAR_MIN == -128, CHAR_MAX == 127

CHAR_MIN == 0, CHAR_MAX == 255

The Standard says that the behaviour is undefined if we pass an
argument to the "to*" functions whose value is outside the range of 0
through UCHAR_MAX. This most certainly should have been CHAR_MIN
through CHAR_MAX.

If there were a particular implementation where a valid character had
a negative value, wouldn't it make perfect sense that you can pass
this value to "to*"? I think it would be ridiculously stupid if you
couldn't.

However in most cases this cannot arise. If you closely examine
the specifications of various input functions, such as getc, fgetc,
getchar, etc. you will notice that they all return the input value
as an int formed from the _unsigned char_ value of the input. This
also means that checking for EOF is simple, check the sign of the
returned value, since EOF is the only negative value allowed.
 
R

Richard

Martin Wells said:
Chris:



Any _letter_ or any _character_? Can you point me to the page in the
Standard?

Martin

Any English *letter* (upper/lowercase) as defined by the C standard I
would think.
 
C

Chris Dollin

Martin said:
Chris:


Any _letter_ or any _character_?

Any letter, and in fact any character in C's execution set; so eg ()[]*+!
must all be positive but @ need not be.
Can you point me to the page in the Standard?

No, I'll have to give you a reference to n1124, in which 6.2.5 para 3
has the required text.
 
E

Eric Sosman

Martin Wells wrote On 11/01/07 09:18,:
Plain char may be signed or unsigned. Typical ranges could be:

CHAR_MIN == -128, CHAR_MAX == 127

CHAR_MIN == 0, CHAR_MAX == 255

The Standard says that the behaviour is undefined if we pass an
argument to the "to*" functions whose value is outside the range of 0
through UCHAR_MAX. This most certainly should have been CHAR_MIN
through CHAR_MAX.

Taking "should have been" as a criticism of the
original design, I'd agree. Unfortunately, that horse
had left the barn long before the first Standard was
assembled. C89 codified existing practice to the extent
possible, rather than using hindsight to overturn it.
That's why we have gets(), for instance.
If there were a particular implementation where a valid character had
a negative value, wouldn't it make perfect sense that you can pass
this value to "to*"? I think it would be ridiculously stupid if you
couldn't.

It would make perfect sense, yes. Alas, we live and
code in an imperfect world.
[...] How many people here use an "unsigned char" cast
when using the "to*" functions?

I do.
Because I don't.

Sorry to hear that. Are the people who use your code
aware that some of your bugs are not accidental, but
deliberate?
 
E

Eric Sosman

Ben Pfaff wrote On 11/01/07 11:29,:
All uppercase and lowercase letters in the English alphabet must
have positive values in the range of char in a C implementation.
See C99 6.2.5 "Types", paragraph 2:

If a member of the basic execution character set is
stored in a char object, its value is guaranteed to be
positive.

Right, but that's only for the *basic* set, the
characters that the Standard itself requires. The
implementation may define additional characters -- most
do, nowadays -- some or all of which may be negative.
(Anyone with a ¥ to argue this point and lacking the ¢
to keep quiet can go £ sand.)

In the "C" locale, only the fifty-two letters A-Z
and a-z are "alphabetic" as determined by isalpha().
But other locales can extend "alphabetic" to characters
outside the basic set, and some of these could well be
negative.

Martin's scenario seems a bit fanciful, but as far
as I can tell it is permitted by the Standard. I see
no requirement that toupper((unsigned char)ch) and
tolower((unsigned char)ch) must have the same sign.
 
B

Ben Pfaff

Eric Sosman said:
Ben Pfaff wrote On 11/01/07 11:29,:

Right, but that's only for the *basic* set, the
characters that the Standard itself requires. The
implementation may define additional characters -- most
do, nowadays -- some or all of which may be negative.

And that's why I said "in the English alphabet". (Additionally,
the Standard defines "letters" to be only English letters.)
 
S

santosh

Martin said:
Chris:



Any _letter_ or any _character_? Can you point me to the page in the
Standard?

As per the Standard all printable characters of the execution and source
character set must be positive values.
 
B

Ben Pfaff

santosh said:
As per the Standard all printable characters of the execution and source
character set must be positive values.

I believe that this guarantee is restricted to the basic
execution character set.
 
K

Keith Thompson

santosh said:
As per the Standard all printable characters of the execution and source
character set must be positive values.

Yes, but in most cases it would be unwise to take advantage of that
fact. Code that deals only with the required character set today
might have to deal with arbitrary characters tomorrow.
 
J

jameskuyper

Ben said:
All uppercase and lowercase letters in the English alphabet must

He just said "alphabetical". He didn't say English, ...
have positive values in the range of char in a C implementation.
See C99 6.2.5 "Types", paragraph 2:

If a member of the basic execution character set is

and he didn't say "basic execution character set", or any variation
thereof.
 
B

Ben Pfaff

He just said "alphabetical". He didn't say English, ...


and he didn't say "basic execution character set", or any variation
thereof.

Well, that's why my answer included those words, to make
everything perfectly clear.
 
S

SM Ryan

#
# Plain char may be signed or unsigned. Typical ranges could be:
#
# CHAR_MIN == -128, CHAR_MAX == 127
#
# CHAR_MIN == 0, CHAR_MAX == 255
#
# The Standard says that the behaviour is undefined if we pass an
# argument to the "to*" functions whose value is outside the range of 0
# through UCHAR_MAX. This most certainly should have been CHAR_MIN
# through CHAR_MAX.

Not it's 0. Sucks. Other than isascii, you have to take care
for the 128-255 (or -128 - -1) range. It dates back to when everyone
knew we would never need more than seven bits so the eighth bit was
like for free and you could use it for other things.
 
J

jameskuyper

Ben said:
Well, that's why my answer included those words, to make
everything perfectly clear.

Sorry, I though you were explaining Martin Wells' claim "Can't happen
in a conforming implementation." It wasn't clear from context that you
were pointing out the limitations that made his claim inaccurate.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,992
Messages
2,570,220
Members
46,807
Latest member
ryef

Latest Threads

Top