* Vladimir Jovic:
Where can I read about problems of the unsigned types?
Books and discussions on the net. I'd hoped the FAQ would be helpful but it's
not, just repeating the old idea of saving on one comparision by using unsigned
types. Which some may think is a good idea (and at the time that FAQ item was
written many did think it was a good idea) but fails to consider any problems.
However, it's not more than can be easily summarized in a Usenet posting, so
I'll do that.
First, unsigned types are not problematic of themselves: they're eminently
suitable for doing bit-level things, and that's what they're intended for
(according to e.g. Bjarne who created the language). They can also be good for
some other things. E.g., in some contexts an unsigned type can be preferable for
representing character codes, although that can also be problematic; I give a
concrete example of one case where it's absolutely necessary, below.
Unsigned types are problemetic in arithmetic and comparisions, because
* unsigned arithmetic is modulo 2^n, where n is the number of value
representation bits (i.e. you have wrap-around behavior, guaranteed), and
* with like size types U and I in the same expression you get promotion
to U (you also get promotion to U if U is of greater size than I).
These problems, discussed in detail below, are so common that e.g. James
Gosling, creator of Java, have used them as an argument in favor of Java:
"Quiz any C developer about unsigned, and pretty soon you discover that almost
no C developers actually understand what goes on with unsigned, what unsigned
arithmetic is. Things like that made C complex."
Of course I don't buy that argument, but what's interesting about it in the
context of answering your question is that the problems of unsigned arithmetic
is an uncontroversial notion. Gosling takes it for granted that the reader is
familiar with this. And Bjarne, present in the same interview, made no comment.
No expert would deny that the problems exist. It's only in a group where novices
seek enlightenment, or in an introductory programming book, that it's even
meaningful to discuss it; it's basic. It should have been in the FAQ. It isn't.
-- Modulo arithmetic.
With an unsigned type with M values and maximum value max = M-1, the expression
max+1 yields 0, and 0-1 yields max. The value range wraps around. If you'd
otherwise get a result outside the value range, then it's reduced to to a number
in the value range as if a sufficient number of M's was added or subtracted.
This behavior is guaranteed by the standard, and it results from the most simple
way to implement arithmetic with binary numbers. It's called modular arithmetic
or clock arithmetic (since clock times also behave that way); see Wikipedia.
With most C++ implementations, and in particular on the PC, you also get wrap
around behavior -- modulo arithmetic -- with signed types, but the wrap
around for a signed type is from most negative value to most positive value, and
vice versa, and is therefore not a problem for usual "small" values. At the bit
level the wraparound for signed types is due to a very simple way of
representing negative integers called two's complement form, where a negative
value -x is represented by the same bit pattern as the unsigned -x+M = M-x. The
name "two's complement" stems from M-x = ((M-1)-x) + 1, where, since M is a
power of 2, (M-1) is an all-1's bitpattern, and (M-1)-x therefore corresponds to
inverting every bit in x, called the "one's complement" (in the binary system).
Let's say you want to call the C standard libary's 'islower' function to check
whether Norwegian 'æ' is a lowercase letter -- it is, but, uh ...
<code>
#include <iostream>
#include <locale.h> // setlocale
#include <ctype.h> // islower
int main()
{
using namespace std;
setlocale( LC_ALL, "" );
cout << boolalpha << !!islower( 'æ' ) << endl;
}
</code>
<result>
false
</result>
(Note: since it's UB the result may be 'true', as one could naïvely expect, but
what matters is the /possibility/ of getting 'false'). What went wrong?
islower takes an 'int' argument that needs to be EOF or a non-negative character
code, while in practice 'char' is a signed type, and Norwegian 'æ', a character
outside the positive 'char' range, is therefore necessarily a negative value,
which is not supported by 'islower'.
Assuming two's complement form you preserve the bitpattern by casting to
unsigned type. This is the same as adding (the type-dependent) M, which in the
case above is the same as adding UCHAR_MAX+1. But the cast is more idiomatic:
<code>
#include <iostream>
#include <locale.h> // setlocale
#include <ctype.h> // islower
bool isLower( char const c )
{
typedef unsigned char UChar;
return !!islower( UChar( c ) );
}
int main()
{
using namespace std;
setlocale( LC_ALL, "" );
cout << boolalpha << !!isLower( 'æ' ) << endl;
}
</code>
<result>
true
</result>
The first version of the program is a very common novice error, failing to
understand the issues of unsigned representation and unsigned arithmetic, that
for the default signed 'char' type the value of 'æ' has been wrapped and is
therefore outside the required range for 'is_lower'.
Even professional programmers tend to make such mistakes, for as Gosling
observed "almost no C developers actually understand what goes on with
unsigned", and that applies also to C++ -- it's the same.
This doesn't mean that using 'unsigned char' is a good solution. It means that
mixing unsigned and signed, as the 'is_lower' standard lib function does, is a
recipe for disaster. It's just too darned easy to get the /usage/ wrong.
Here's an expression example:
a < b + n
With signed integer types and common not-overly-large values for a and b, this
can also be expressed as
a - n < b
However, if the type involved is an unsigned one, then instead of a - n possibly
producing a negative value it will in that case wrap around, with the result
that when the former expression is 'true', the latter expression is 'false'...
I.e., with unsigned arithmetic the usual math rules don't apply for ordinary
not-overly-large values -- which are the values most often occurring.
Most programmers are aware of this when writing e.g. a loop that counts down,
but keeping it in mind for expressions like the above is much much harder. IIRC
one almost grotesque example in recent years was when someone noticed that
Microsoft's code for rendering a picture had such a bug, which with a suitable
crafted picture made it possible to place arbitrary bytes in memory. This then
in turn allowed malware infection via Internet Explorer by simply presenting a
picture on your web site; uh oh, infected by a JPEG, oh dear.
-- Implicit promotion.
As a concrete example of implicit promotion, consider
<code>
#include <iostream>
#include <string>
std::string rightAdjusted( int n, std::string const& s )
{
if( s.length() >= n )
{
return s;
}
else
{
return std::string( n - s.length(), ' ' ) + s;
}
}
int main()
{
using namespace std;
for( int x = -5; x <= 5; ++x )
{
cout << rightAdjusted( x*x - 4, "o" ) << endl;
}
}
</code>
<result>
o
o
o
o
This application has requested the Runtime to terminate it in an unusual way.
Please contact the application's support team for more information.
</result>
Evidently there's something wrong. And e.g. the g++ compiler warns about that,
"warning: comparison between signed and unsigned integer expressions".
In the comparision
s.length() >= n
the result of the 'length' method is unsigned, while 'n' is signed. The unsigned
type is at least as large as the signed type. And so the signed value, 'n', is
promoted to unsigned, which when it's negative effectively means adding M.
Which is very large...
And so the comparision produces 'false', and the second return path is taken,
evaluating
std::string( n - s.length(), ' ' )
where again there is a promotion to unsigned type (in the first argument),
yielding a very very large value when n is negative. Hence the crash.
I.e. this happens not only with comparisions but also in simple arithmetic
expressions, where the compiler will usually /not/ warn you, as g++ didn't for
the above expression. It is real pitfall, a very common error. But it's very
simple to avoid, like just putting on a condom: simply don't introduce unsigned
values in the first place, e.g., define & use a signed type 'size' function.
Alternatively it can be avoided by always keeping in mind whether some value
might be unsigned and dealing with such values by special cases, by remembering
that expressions cannot be simply rewritten using the usual math rules, and by
whatnot. This is like the method of pulling out at the last instance, hoping
that that will not only prevent a possible pregnancy but also any veneral
disease. It's much harder to do, it's more work, and it's brittle.
Cheers & hth.,
- Alf