double precise enough for unsigned short?

A

Andy

Hello,

I will be storing a series of unsigned shorts (C language)
as doubles. I used gcc 3.2 to print out a bunch of
sizeof statements, and unsigned shorts are 4 bytes.
Doubles are 8 bytes, with the fractional part being 52 bits.
Everything seems to indicate that double has more than
enough precision to represent all possible values of
unsigned short. I'm coding with this assumption in mind.
Just to be sure, can anyone confirm this? My simplistic
reason for assuming this is that 32 bits of the fractional
part of the double can be made nonfractional by shifting
by 32. That's enough nonfractional bits to represent all
32 bits of an unsigned short. To indicate this 32-bit
shift, the exponent part of the double must be 6 bits. This
easily fits into the 11 bits for the exponent of a double.

Despite this justification, I just want to be sure that I'm
not missing something conceptually about the IEEE
number representation. I recall vaguely that the numbers
represented by double are not uniformly distributed, but
the above reasoning seems to indicate that they are good
enough for unsigned shorts.

Thanks.
 
S

SG

I will be storing a series of unsigned shorts (C language)
as doubles.  I used gcc 3.2 to print out a bunch of
sizeof statements, and unsigned shorts are 4 bytes.

Interesting. What platform is this?
Doubles are 8 bytes, with the fractional part being 52 bits.
Everything seems to indicate that double has more than
enough precision to represent all possible values of
unsigned short.  I'm coding with this assumption in mind.
Just to be sure, can anyone confirm this?

There is (as far as I know) no upper bound on what values can be
represented with unsigned shorts. Typically shorts are 16 bits wide
but could be wider.

If I remember correctly, the C standard requires the macro DBL_DIG
(see float.h) to evaluate to an int of at least 10. So you have at
least 10 significant digits. For IEEE 754 doubles (52 bit mantissas)
you have even more precision. But 10 digits is enough to store 16 bit
integers losslessly. And it might be just enough to store 32 bit
integers losslessly (not 100% about that). But IEEE 754 doubles
certainly will do so.
Despite this justification, I just want to be sure that I'm
not missing something conceptually about the IEEE
number representation.

No, I don't think that you missed something -- except maybe that short
is not restricted in size. If cases where short is 64 bits (or more)
wide an IEEE-754 double won't be able to losslessly hold all short
values.

Cheers!
SG
 
B

bart.c

Andy said:
Hello,

I will be storing a series of unsigned shorts (C language)
as doubles. I used gcc 3.2 to print out a bunch of
sizeof statements, and unsigned shorts are 4 bytes.
Doubles are 8 bytes, with the fractional part being 52 bits.
Everything seems to indicate that double has more than
enough precision to represent all possible values of
unsigned short. I'm coding with this assumption in mind.
Just to be sure, can anyone confirm this? My simplistic
reason for assuming this is that 32 bits of the fractional
part of the double can be made nonfractional by shifting
by 32. That's enough nonfractional bits to represent all
32 bits of an unsigned short. To indicate this 32-bit
shift, the exponent part of the double must be 6 bits. This
easily fits into the 11 bits for the exponent of a double.

Despite this justification, I just want to be sure that I'm
not missing something conceptually about the IEEE
number representation. I recall vaguely that the numbers
represented by double are not uniformly distributed, but
the above reasoning seems to indicate that they are good
enough for unsigned shorts.

You might just try testing every possible value:

#include <stdio.h>
#include <limits.h>

int main(void){
unsigned short i,j;
double x;

i=0;
do {
x=i;
j=x;
if (i!=j)
printf("Can't represent %u as double\n",i);
++i;
} while (i!=0);
}

(Turn off optimisation)
 
B

bart.c

Richard Heathfield said:
If there is genuine cause for concern, unsigned short is sufficiently
large to make your program rather tedious to run.

16-bit shorts are instant. 32-bit ones took a few minutes on my computer,
but the test is only done once.

With 64-bits shorts, you already know they can't be represented.
It would make more
sense to binary-search. Start at USHRT_MAX, and play the max/min
game.

You mean, to find the point in the range where it stops working? More
sensible then to just test 0 and USHRT_MAX then.

You need to test every possible value if there are any doubts being able to
represent each one (for example if the conversion to and from double is more
elaborate than simple assignment).
By all means turn off machine optimisation if you think it'll make a
difference,

It's just to stop the compiler possibly using a register floating point
value (which on my machine has more precision than double).
 
S

SG

You need to test every possible value if there are any doubts being able to
represent each one (for example if the conversion to and from double is more
elaborate than simple assignment).

I think it's safe to assume that the differences between consecutive
representable numbers is a monotonically increasing function. In that
case you only need to check USHRT_MAX and USHRT_MAX-1.

But something of the form

DBL_EPSILON * USHRT_MAX < some_threshold

for some value of some_threshold (between 0.4 and 1.0) is probably a
better way to test this.
It's just to stop the compiler possibly using a register floating point
value (which on my machine has more precision than double).

Aren't there rules that prohibit a compiler from using such
intermetiate high precision representations beyond the evaluation of a
full expression? I'm not sure. But I think I read something like that
somewhere ... If so, it should suffice to store the double value in a
variable in one expression and read/use the variable's value in
another expression a la

double x = ...;
...x... // new expression

(?)


Cheers!
SG
 
E

Eric Sosman

If there is genuine cause for concern, unsigned short is sufficiently
large to make your program rather tedious to run. It would make more
sense to binary-search. Start at USHRT_MAX, and play the max/min
game.

Not sure how you'd apply binary search to this problem, since
there's no obvious "order:" each test tells you that a given value
does or does not garble when converted to double and back, and gives
no obvious clue about what happens to larger and smaller values.
Did you have some other order relation in mind?

Anyhow, I think the only value one need test is USHRT_MAX itself,
unless FLT_RADIX is something really, really weird. For the common
FLT_RADIX values of 2 and 16 (and even for the uncommon value 10),
USHRT_MAX will need at least as many non-zero fraction digits as
any other unsigned short value.
 
B

bart.c

Richard Heathfield said:
Why? If an implementation has 64-bit short ints, maybe it also has 256-bit
doubles.

I think a 64-bit short/64-bit double implementation would be far more
common. But given 64-bit short/256-bit double, then probably a different
approach is needed, or just more confidence that all 64-bit values are in
fact representable.
 
K

Keith Thompson

Richard Heathfield said:
Eric said:
<snip>

You might just try testing every possible value:
[snip]

If there is genuine cause for concern, unsigned short is sufficiently
large to make your program rather tedious to run. It would make more
sense to binary-search. Start at USHRT_MAX, and play the max/min
game.

Not sure how you'd apply binary search to this problem, since
there's no obvious "order:" each test tells you that a given value
does or does not garble when converted to double and back, and gives
no obvious clue about what happens to larger and smaller values.
Did you have some other order relation in mind?

Oh, it was just a faulty (albeit probably fairly accurate!) assumption -
that double will be able to represent integers precisely, up to a
certain point, but not (reliably) beyond that point. If there is such a
point for unsigned short ints, it would be nice to know what it is.

But since the non-representable values aren't consecutive, a simple
binary search might not find the smallest one.

It's probably safe to assume that if N-1 and N are both exactly
representable, then all values from 0 to N are exactly representable.
Given that assumption, you could do a binary search to find the
smallest non-representable value.
(Having said that, I suspect that the range of contiguous integer values
starting from 0 and exactly representable by double is likely to exceed
USHRT_MAX on any "normal" implementation, so there is probably no such
point.)

Well, I've used a system (Cray T90) where both short and double
were 64 bits. But short may have had padding bits; I never had a
chance to check.
 
M

Malcolm McLean

I will be storing a series of unsigned shorts (C language)
as doubles.  
Double is usually capable of representing every possible short, long
or int. However some very small compilers have very low-precision
floating point representation (this is not compliant), and some very
large systems have shorts with 64 bits and doubles with 64 bits (this
is compliant, but normally you would only use the lower two bytes of a
short).
 
R

robertwessel2

Double is usually capable of representing every possible short, long
or int. However some very small compilers have very low-precision
floating point representation (this is not compliant), and some very
large systems have shorts with 64 bits and doubles with 64 bits (this
is compliant, but normally you would only use the lower two bytes of a
short).


Certainly not true for any of the 64 bit *nix platforms I'm aware of,
which have 64 bit doubles and 64 bit longs. And that would be the
case for any other LP64 or ILP64 system (and I'm sure there's a 64-bit
*nix implementation out there somewhat that *isn't* LP64, but I don't
know it). Windows, OTOH, is LLP64 (32 bit longs, 64 bit long longs).
 
K

Keith Thompson

Certainly not true for any of the 64 bit *nix platforms I'm aware of,
which have 64 bit doubles and 64 bit longs. And that would be the
case for any other LP64 or ILP64 system (and I'm sure there's a 64-bit
*nix implementation out there somewhat that *isn't* LP64, but I don't
know it). Windows, OTOH, is LLP64 (32 bit longs, 64 bit long longs).

As I mentioned elsethread, the Cray T90 is (was?) a Unix system with
64-bit shorts and 64-bit doubles. The more recent SV1 has the same
characteristics, and other Cray vector systems are probably similar.
But short could well have a significant number of padding bits.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,954
Messages
2,570,116
Members
46,704
Latest member
BernadineF

Latest Threads

Top