Golden California Girls said:
I do see that C doesn't require much of anything for floating math.
It does seem to permit useless floating point.
It permits a wide variety of floating-point implementations, because
there are a wide variety of floating-point implementations in the real
world, and declaring most such implementations non-conforming likely
would have killed the C language.
Now C *could* have defined a set of conditions looser than what's in
IEC 60559 while still guaranteeing that, for example, 1.0+1.0==2.0 and
being consistent with all the real-world implementations. (The Ada
standard does something like this.) But the authors chose instead
just to rely on the implementations to do the right thing -- which, in
most cases, they do.
What is the value of FLT_EVAL_METHOD?
2, which means to "evaluate all operations and constants to the range
and precision of the long double type", but I don't think that's
relevant.
Or put another way have you
been lied to about the value of DBL_EPSILON for computations.
No.
I just modified the above program to use long double rather than
double and got similar results, so the result don't depend on using a
wider type for calculations.
I computed the value of 2.0 - DBL_EPSILON. Both operands are exactly
representable values of type double. I stored the result, whose
mathematical value is also exactly representable in type double, in an
object of type double. Assuming a sane floating-point implementation,
or more specifically an IEC 60559 floating-point representation, I can
expect the exact mathematical result to be stored in the object.
[...]
Lets take a 2 bit mantissa and count up. Our value of epsilon is then 1/2.
decimal mantissa exponent
0.5 10 -1
0.75 11 -1
1.0 10 0
1.5 11 0
2.0 10 1
3.0 11 1
4.0 10 10
6.0 11 10
Now lets take 2.0 and subtract our epsilon of 0.5
2.0 10 1
0.5 10 -1
but we can't subtract the mantissa's until the exponents match.
we shift and get
2.0 10 1
0.5 00 1
now we perform the subtraction and get
2.0 10 1
2.0, 0.5, and 1.5 are all exactly representable in this floating-point
representation, so computing 2.0 - 0.5 should give me 1.5 in any sane
implementation. If you make assumptions about how the subtraction is
implemented (shifting mantissas and tweaking exponents, keeping
everything within your representation at all times), then you might
conclude that the calculation yields 2.0. Conclusion: your
assumptions are incorrect.
FLT_EVAL_METHOD refers to results of C floating-point operations such
as ordinary subtraction. What you're talking about is intermediate
results that occur during the execution of that subtraction operation,
probably within the machine's floating-point unit. As I understand
it, FPUs do have to use extra bits internally to get correct results.
The only thing that's visible to a C program is that if you compute
x = y - z;
where y, z, and the mathematical result of the subtraction are all
exactly representable in the given type, you'll get the right result
stored in x. (Again, the C standard doesn't require this, but IEC
60559 does.) What happens *within* the subtraction operation,
whatever magic the FPU performs to get the right answer, is outside
the scope of the language and, I suggest, of this discussion; all
that's relevant is that subtraction *works*.
I thought you said you could take 2.0 and subtract epsilon and get the next
lowest number? The C standard does not.
Right, I should have stated more clearly that I was assuming an IEC
60559 implementation.
[...]
AH, the fun of the C un-standard!
I don't know what that's supposed to mean. The C standard leaves a
lot of floating-point issues unspecified because it has to, because
there are so many different floating-point implementations in the real
world. But C allows an implementation to claim IEC 60559 conformance,
and for such an implementation the results are reasonably well
defined.
BTW to get the next lower representable value below 2.0 on a
implementation with FLT_EVAL_METHOD 0, subtract 2.0 * epsilon from
2.0 and then add epsilon, assuming binary mantissa and exponent.
Or just subtract epsilon from 2.0 and let the FPU (or whatever the
system uses for floating-point computations) worry about using however
many extra bits it needs to get the write answer. I challenge you to
show me a system where that doesn't work, for any C floating-point
type you choose. (C doesn't forbid such a system, but I doubt that
you can find any in existence.)