J
James Kanze
James said:[...]Andy Champ wrote:I don't think it does.The results of floating point arithmetic are by and large
implementation defined (e.g., there is no guarantee that
addition of two floats yields a best possible approximation
to the sum of the represented numbers), but one could argue
that the standard requires addition to yield the actual sum
if the sum is representable).
The standard says in [5.7/3]:
The result of the binary + operator is the sum of the operands.
...
So, _if_ the involved floats are rational numbers (i.e. not
NAN or something like that) _and_ their sum is a float, then I
don't see where the standard grants an exception. Surely,
[5/5] doesn't apply:
If during the evaluation of an expression, the result is not
mathematically defined or not in the range of representable
values for its type, the behavior is undefined, ...
since the hypotheses are designed to avoid that case.
I'm not sure. As far as I can tell, the definition of 'sum' and
'product' for a float is implementation defined. It's certainly
not the definition which applies for real numbers, since this
would make the first statement you quote unimplementable.
The C++ standard actually says very little about the required
representation: "The value representation of floating-point
types is implementation defined." And nothing, as far as I can
see, about the signification of addition over this value
representation. The requirements on <limits> and <climits>
introduce additional requirements concerning the representation,
but not (as far as I can see) concerning what happens when you
add two values.
The C standard is a lot more specific, imposing a classical
floating point representation (more or less---the as if rule
still applies); it specifically states (§5.2.4.2.2/4 in
C99---although I don't have access to a copy of C90 here, I'm
pretty sure that this text is unchanged):
The accuragcy of the floating point operations (+, -, *,
/) and of the library functions in <math.h> and
<complex.h> that return floating-point results is
implementation defined. The implementation may state
that the accuracy is unknown.
Note that this section of the C standard is explicitly included
by reference in C++, in §18.2.2/4. Not where you'd particularly
expect to find it, but it is there.
Personally, I'd consider any differences between C and C++ with
regards to the behavior of the basic types a defect in the C++
standard.
I also think that we can generally count on more than what the
standard guarantees. Most people today are working on machines
with IEEE floats, and IEEE does define and require a lot more.
In particular, it defines exactly what the results should be for
all of the four basic operations, for all possible values
(at least as long as they do not involve overflow or
underflow---I think an implementation is allowed to make the
behavior there programmable).
There's also the fact that other languages (e.g. Fortran) have
stricter requirements. (I remember the first Unix machine I
worked on. The people writing the Fortran compiler implemented
the Fortran intrinsics using the C math library...and found that
they didn't pass the Fortran validation suites we had. Some of
the library functions only had about four decimal digits
precision.)
True, but that only applies when the sum is _not_
representable as a float. In that case, you usually want a
best approximation and the standard does not go there.
See above. C quite explicitly allows a large degree of leeway.
As far as compatibility with the standard is concerned, I
think that a floating point format where each value is
irregular (i.e., not a real number but NAN or the like) would
be compatible with the standard. However, in that case the
if-part of my statement is never satisfied.
I think you could make a pretty strong argument (based on the
minimum requirements of §18.2 and §5.2.4.2.2 in the C standard)
that a float must be capable of representing at least 150
million different numeric values: FLT_DIG must be at least 6
(one million different values for a given exponent), and the
difference between FLT_MAX_EXP and FLT_MIN_EXP implies at least
150 different values for the exponent.