double cast to int reliable?

S

sandeep

In the following code:

int i,j;
double d;

i = 1; // or any integer
d = (double)i;
j = (int)d;

Is there any chance that "i" will not equal "j" due to the double
being stored inexactly?
 
S

Sjouke Burry

sandeep said:
In the following code:

int i,j;
double d;

i = 1; // or any integer
d = (double)i;
j = (int)d;

Is there any chance that "i" will not equal "j" due to the double
being stored inexactly?
Yep. Rounding while converting do double will for most integers
mean that the double is slightly smaller then the int.
converting then to int, will not give you the original.
 
D

Dann Corbit

In the following code:

int i,j;
double d;

i = 1; // or any integer
d = (double)i;
j = (int)d;

Is there any chance that "i" will not equal "j" due to the double
being stored inexactly?

It is possible for int to be 64 bits, and represent values as large as
(for instance) 9223372036854775807.

It is possible for double to have as little as 6-7 significant digits,
though for the most part you will see 15-16 significant digits.

It is possible (though unlikely) to see the problem you describe. I
would be very surprised to see a system with 64 bit ints and 32 bit
doubles. But even with 64 bit ints and 64 bit doubles, the int values
will have greater precision because they do not store an exponent.

Far more likely is this

double d = <some value>;
int i = d; /* undefined behavior due to integer overflow */

At this point, i may not be equal to floor(d) if floor(d) is not
representible as an integer.

There will always be some situations where information can be lost
because it is very unlikely that the precisions are identical.

Hence, if you want the integral value that is stored in a double, far
better is:

double integral_part = floor(some_double);
 
D

Dann Corbit

Yep. Rounding while converting do double will for most integers
mean that the double is slightly smaller then the int.
converting then to int, will not give you the original.

Since he specified an integer assignment:the difficulties are not due to rounding, as I see it.
 
T

Tim Streater

Sjouke Burry said:
Yep. Rounding while converting do double will for most integers
mean that the double is slightly smaller then the int.
converting then to int, will not give you the original.

Won't this be exact if the integer in question occupies fewer bits than
the mantissa size in bits? On the CDC 6600 (60-bit word), all integer
arithmetic was in fact done by the floating point unit (apart from
integer addition), so integers were limited to 48 bits (mantissa length).
 
B

Ben Pfaff

Dann Corbit said:
It is possible for double to have as little as 6-7 significant digits,
though for the most part you will see 15-16 significant digits.

The 'float' type must have at least 6 significant digits.
The 'double' and 'long double' types must have at least 10
significant digits.
 
B

Ben Bacarisse

Tim Streater said:
Won't this be exact if the integer in question occupies fewer bits than
the mantissa size in bits?

Yes. 6.3.1.4 p2 (part of the section on conversions) starts:

When a value of integer type is converted to a real floating type, if
the value being converted can be represented exactly in the new type,
it is unchanged.

The standard does not use the term mantissa but section 5.2.4.2.2
("Characteristics of floating types") defines C's model of floating
types in such a way that the expected range of integers will be exactly
representable.

<snip>
 
K

kathir

In the following code:

int i,j;
double d;

i = 1; // or any integer
d = (double)i;
j = (int)d;

Is there any chance that "i" will not equal "j" due to the double
being stored inexactly?

The way how floating point numbers are stored internally are
different, uses mantissa and exponent portion. If you do any floating
point calculation (multiplication and division) and convert back to
integer, you will see a minor difference between int and double value.
To understand the bit pattern of floating points, visit at
http://softwareandfinance.com/Research_Floating_Point_Ind.html

Thanks and Regards,
Kathir
http://programming.softwareandfinance.com
 
N

Nick Keighley

Yep. Rounding while converting do double will for most integers
mean that the double is slightly smaller then the int.
converting then to int, will not give you the original.

really? Can you name an implementation where this is so? Is it a valid
implementation of C?
 
N

Nick Keighley

The way how floating point numbers are stored internally are
different, uses mantissa and exponent portion. If you do any floating
point calculation (multiplication and division) and convert back to
integer, you will see a minor difference between int and double value.

depending what operations you do you might see huge differences
 
S

Seebs

really? Can you name an implementation where this is so? Is it a valid
implementation of C?

The obvious case would be a machine where both int and double are 64-bit,
at which point, it's pretty obvious that for the vast majority of positive
integers, the conversion to double will at the very least change the
value, and I think I've seen it round down, so...

-s
 
K

Keith Thompson

Seebs said:
The obvious case would be a machine where both int and double are 64-bit,
at which point, it's pretty obvious that for the vast majority of positive
integers, the conversion to double will at the very least change the
value, and I think I've seen it round down, so...

Round down or round to zero? If the latter, then it's not the case
that "most" integers yield a slightly smaller double when converted
(unless "smaller" means closer to zero). But yes, this is just
nitpicking.

The point is that the standard requires the conversion of an integer
to a floating-point type to yield an exact result when that result
can be represented (C99 6.3.1.4), and the floating-point model
imposed by C99 5.2.4.2.2 implies that a fairly wide range of integer
values must be exactly representable. That range might not cover
the full range of any integer type (even long double might not be
able to represent CHAR_MAX if CHAR_BIT is big enough).

In particular, converting the value 1 from int to double and back
to int is guaranteed to yield 1; if it doesn't, your implementation
is non-conforming.

There's a common idea that floating-point values can never be
anything more than approximations, and that no floating-point
operation is guaranteed to yield an exact result, but the reality
of it isn't that simple. It might be safer to *assume* that all
such operations are approximate but there are things you can get
away with if you know what you're doing. The trouble is that, even
if you know what you're doing, it can be very easy to accidentally
get outside the range in which the guarantees apply; you can use
double to represent exact integers, but there's no warning when you
exceed the range where that works.
 
E

Ersek, Laszlo

The trouble is that, even if you know what you're doing, it can be very
easy to accidentally get outside the range in which the guarantees
apply; you can use double to represent exact integers, but there's no
warning when you exceed the range where that works.

For any unsigned type that has no more bits than 612,787,565,149,966; that
is, any conceivable unsigned type, the following is a sufficient condition
to store any value of said type in a "long double":

((long long unsigned)sizeof(utype) * CHAR_BIT * 30103 + 99999) / 100000
<= LDBL_DIG

For uint32_t, the left side evaluates to 10, and both DBL_DIG and LDBL_DIG
must be at least 10 on any conformant platform.

After the conversion to the chosen floating point type, eg. long double,
one must track the possible ranges in every floating point expression
involved, and make sure that any evaluation can't exceed "limit", which
can be initialized like this:

char lim_str[LDBL_DIG + 1] = "";
long double limit;

(void)sscanf(memset(lim_str, '9', LDBL_DIG), "%Lf", &limit);

(Of course not exceeding this bound may not be sufficient for converting
back to "utype", but since "(utype)-1" itself was convertible, this final
condition is only a simple comparison away.)

--o--

The number of full decimal digits needed to represent the C value
"(utype)-1" is given by the math expression

ceil(log_10(2 ** numbits - 1))

"numbits" being the number of value bits in "utype". It is safe to assume
(or rather, we have to assume) that all bits are value bits. Continuing
with math formulas, and exploiting log_10 being strictly monotonic and
ceil being monotonic,

ceil(log_10(2 ** numbits - 1))
<= ceil(log_10(2 ** numbits ))
== ceil(numbits * log_10(2))
<= ceil(numbits * (30103 / 100000))
== ceil(numbits * 30103 / 100000)

which equals the value of the math expression

floor( (numbits * 30103 + (100000 - 1)) / 100000 )

Therefore, this integer value is not less than the number of full decimal
digits needed. As "numbits" increases, this value becomes greater than the
exact number of decimal places required. The speed of divergence is
determined by the accuracy of 30103 / 100000 approximating log_10(2), but
I'm too lazy to try to calculate that relationship.

BTW, 30103 and 100000 are coprimes (30103 is a prime in its own right),
thus the smallest positive "numbits" where "numbits * 30103" is an
integral multiple of 100000 is 100000, which would still make for quite a
big integer type. Hence we can assume that the remainder of the modular
division "numbits * 30103 / 100000" is always nonzero, and the last
ceiling math expression could be rewritten as

floor(numbits * 30103 / 100000) + 1

This simplifies the initial C expression to

(long long unsigned)sizeof(utype) * CHAR_BIT * 30103 / 100000 < LDBL_DIG

Unfortunately, the entire approach falls on its face with uint64_t and an
extended precision (1 + 15 + 64 = 80 bits) "long double", even though the
significand has the required number of bits available. (As said above, the
condition is only sufficient, not necessary.)

The problem is that the method above works with entire base 10 digits. The
decimal representation of UINT64_MAX needs 20 places (19 full places and a
"fractional place", rounded up to 20), but the 64 bit significand only
provides for 19 whole decimal places, and the comparison is done in whole
decimal places. What's worse, an extended precision "long double" can only
allow for an LDBL_DIG of 18 (as my platform defines it), presumably
because (and I'm peeking at C99 5.2.4.2.2 p8) "long double" must
"accomodate" not only integers with LDBL_DIG decimal places, but also any
decimal fraction with LDBL_DIG digits. The exponent of the "long double"
stores the position of the *binary* point, not that of the *decimal*
point, and this probably sacrifices a further decimal digit.

(I gave you some material to shred, please be gentle while shredding.)

Cheers,
lacos
 
K

Keith Thompson

Ersek said:
For any unsigned type that has no more bits than 612,787,565,149,966; that
is, any conceivable unsigned type, the following is a sufficient condition
to store any value of said type in a "long double":

((long long unsigned)sizeof(utype) * CHAR_BIT * 30103 + 99999) / 100000
<= LDBL_DIG

612,787,565,149,966 can be represented in 50 bits.
unsigned long long is at least 64 bits.

Inconceivable? "I do not think that word means what you think it means."

[snip]
 
E

Ersek, Laszlo

612,787,565,149,966 can be represented in 50 bits.
unsigned long long is at least 64 bits.

Inconceivable? "I do not think that word means what you think it means."

I believe I wasn't formulating my point carefully enough. Verbatim quote,
with emphasis added:


The range of such an unsigned type would be

[0 .. 2 ** 612,787,565,149,966 - 1].

The limit is not arbitrary, it is (for the smallest allowed ULLONG_MAX):

(ULLONG_MAX - 99999) / 30103

expressed in C. "unsigned long long" doesn't need to cover the range of
the type in question, it must be able to represent the *number of bits* in
it.

Cheers,
lacos
 
B

Ben Bacarisse

Keith Thompson said:
612,787,565,149,966 can be represented in 50 bits.
unsigned long long is at least 64 bits.

Inconceivable? "I do not think that word means what you think it
means."

I'm pretty sure it's a word order confusion. I think he intended "any
unsigned type that has no more than 612,787,565,149,966 bits". That's
the maximum number of bits that won't cause the quoted expression to
fail. I.e. for more than that number of bits, long long unsigned is not
guaranteed to be able to represent the result.

Some people might still conceive of such types, but the term is not
nearly so outlandish in that context.
 
E

Ersek, Laszlo

I'm pretty sure it's a word order confusion. I think he intended "any
unsigned type that has no more than 612,787,565,149,966 bits".

Yes, thank you. I guess 18 hours of sleep accumulated over four nights is
not too much. :(

(I don't need decaf, it's my DSPS [0] that doesn't cooperate with the
"strictly scheduled" training of this week. It's 01:04 AM in local time,
again.)

Cheers,
lacos

[0] http://en.wikipedia.org/wiki/Delayed_sleep_phase_syndrome
 
S

Seebs

Round down or round to zero? If the latter, then it's not the case
that "most" integers yield a slightly smaller double when converted
(unless "smaller" means closer to zero). But yes, this is just
nitpicking.

Which is why I put "positive" in there.
The point is that the standard requires the conversion of an integer
to a floating-point type to yield an exact result when that result
can be represented (C99 6.3.1.4), and the floating-point model
imposed by C99 5.2.4.2.2 implies that a fairly wide range of integer
values must be exactly representable. That range might not cover
the full range of any integer type (even long double might not be
able to represent CHAR_MAX if CHAR_BIT is big enough).

Right.

But the obvious case would be 64-bit int and 64-bit double. Look at it
this way. Assume a typical mantissa/exponent system. Assume that there
are 58 bits of mantissa. There's 58 bits of numbers that can be represented
exactly, you can represent half of the numbers in the 59-bit range, 1/4 of
the numbers in the 60-bit range... And it turns out that this means that,
of the 63-bit range of int, a very small number of values (rough order of
1/16?) can be represented exactly in a double.

Now, as it happens, 99% of the numbers I've ever used in a C program are
in that range.
The trouble is that, even
if you know what you're doing, it can be very easy to accidentally
get outside the range in which the guarantees apply; you can use
double to represent exact integers, but there's no warning when you
exceed the range where that works.

Yes.

For plain float, on the systems I've tried, the boundary seems to be about
2^24; 2^24+1 cannot be represented exactly in a 32-bit float. I wouldn't
be surprised to find that double came out somewhere near 2^48+1 as the first
positive integer value that couldn't be represented.

-s
 
K

Keith Thompson

Ersek said:
I believe I wasn't formulating my point carefully enough. Verbatim
quote, with emphasis added:

Ok, I see what you mean. ("no more than ... bits" would have been
clearer.)
The range of such an unsigned type would be

[0 .. 2 ** 612,787,565,149,966 - 1].

The limit is not arbitrary, it is (for the smallest allowed ULLONG_MAX):

(ULLONG_MAX - 99999) / 30103

expressed in C. "unsigned long long" doesn't need to cover the range
of the type in question, it must be able to represent the *number of
bits* in it.

And the formula doesn't say "yes" for smaller types and "no" for
bigger ones; it breaks down for really huge types, right?

When I have time, I'll have to go back and re-read what you wrote.
 
K

Keith Thompson

Seebs said:
Which is why I put "positive" in there.

Which I very cleverly failed to notice. *sigh*
Right.

But the obvious case would be 64-bit int and 64-bit double. Look at it
this way. Assume a typical mantissa/exponent system. Assume that there
are 58 bits of mantissa. There's 58 bits of numbers that can be represented
exactly, you can represent half of the numbers in the 59-bit range, 1/4 of
the numbers in the 60-bit range... And it turns out that this means that,
of the 63-bit range of int, a very small number of values (rough order of
1/16?) can be represented exactly in a double.

Yes, that's the obvious case. My point, which I didn't express very
clearly, is that it's possible that *every* integer type has values
that can't be exactly represented in *any* floating-point type.
I know of no such systems in real life, but a system where everything
from char to long long and from float to long double is exactly 64
bits is certainly plausible (the Cray T90 I keep bringing up made
char 8 bits only for compatibility with other Unix-like systems; all
other arithmetic types were 64 bits).

An implementation could have integer values don't just lose precision
but *overflow* when converted to a floating-point type. On my
system, FLT_MAX is slightly less than 2**128, so (float)UINT128_MAX
would overflow if uint128_t existed.
Now, as it happens, 99% of the numbers I've ever used in a C program are
in that range.

You counted? :cool:}
Yes.

For plain float, on the systems I've tried, the boundary seems to be about
2^24; 2^24+1 cannot be represented exactly in a 32-bit float. I wouldn't
be surprised to find that double came out somewhere near 2^48+1 as the first
positive integer value that couldn't be represented.

It's more likely to be 2^53-1, assuming IEEE floating-point; look at the
values of FLT_MANT_DIG and DBL_MANT_DIG.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,994
Messages
2,570,223
Members
46,810
Latest member
Kassie0918

Latest Threads

Top