float limits

Z

ziller

Why is it that FLT_DIG (from <float.h>) is 6 while DBL_DIB is 15?

Doing the math, the mantissa for floats is 24 bits = 2^24-1 max value
= 16,777,215.0f. Anything 8-digit odd # greater than that will be
rounded off.
For doubles, the mantissa is 53 bits = 2^53-1 max value =
9,007,199,254,740,991.0l (that's an L). So 16 digit odd numbers
greater than that will be rounded off. To get the actual precision we
take log(base 10) of those numbers and get 7.22 and 15.95
respectively.

....floats have greater than 7 digits precision and doubles only
greater than 15 digits. So how does MS guarantee no rounding errors
for 15 digit doubles yet 6 digit floats (if I understand correctly,
the last digit of precision must be used to round off the number...the
numbers are not just truncated at 7 & 15 digits...)

Anything I'm missing for the doubles case? It looks like they should
be guaranteeing 14 digits.
 
G

Gordon Burditt

Why is it that FLT_DIG (from said:
Doing the math, the mantissa for floats is 24 bits = 2^24-1 max value
= 16,777,215.0f. Anything 8-digit odd # greater than that will be
rounded off.

I don't think you get to count the "hidden 1" bit that actually
is not stored in the number. The maximum mantissa is 2**24-1.
The minimum mantissa, without changing the exponent, is 2**23.
That's 2**23 combinations.
For doubles, the mantissa is 53 bits = 2^53-1 max value =
9,007,199,254,740,991.0l (that's an L). So 16 digit odd numbers
greater than that will be rounded off. To get the actual precision we
take log(base 10) of those numbers and get 7.22 and 15.95
respectively.

I think you should subtract .30 (log base 10 of 2, one bit) from
each of those, giving 6.92 and 15.65, respectively.
...floats have greater than 7 digits precision and doubles only
greater than 15 digits. So how does MS guarantee no rounding errors
for 15 digit doubles yet 6 digit floats (if I understand correctly,
the last digit of precision must be used to round off the number...the
numbers are not just truncated at 7 & 15 digits...)

It's not just Microsoft: FreeBSD has the same values for the i386
platform. And I believe both are correct.
Anything I'm missing for the doubles case? It looks like they should
be guaranteeing 14 digits.

ANSI C gives formulas for the constants in <float.h>.
if b = FLT_RADIX (the base) and p = FLT_MANT_DIG (digits in that base), then

FLT_DIG = floor((p-1)*log10(b) ) + (1 if b is a power of 10, 0 otherwise).

Gordon L. Burditt
 
J

Jack Klein

Why is it that FLT_DIG (from <float.h>) is 6 while DBL_DIB is 15?

Because that is what the implementation documents that it provides, as
required by the C standard. FLT_DIG and DBL_DIG are required to be at
least 6 and 10 respectively.
Doing the math, the mantissa for floats is 24 bits = 2^24-1 max value
= 16,777,215.0f. Anything 8-digit odd # greater than that will be
rounded off.
For doubles, the mantissa is 53 bits = 2^53-1 max value =
9,007,199,254,740,991.0l (that's an L). So 16 digit odd numbers
greater than that will be rounded off. To get the actual precision we
take log(base 10) of those numbers and get 7.22 and 15.95
respectively.

...floats have greater than 7 digits precision and doubles only
greater than 15 digits. So how does MS guarantee no rounding errors
for 15 digit doubles yet 6 digit floats (if I understand correctly,
the last digit of precision must be used to round off the number...the
numbers are not just truncated at 7 & 15 digits...)

Anything I'm missing for the doubles case? It looks like they should
be guaranteeing 14 digits.

What you are missing is that the C standard imposes no requirements
for "no rounding errors". In fact rounding errors are guaranteed in
almost all floating point operations.

The definition of those terms is spelled out clearly in C standard,
and it says nothing at all about rounding errors. Basically, these
values represent the largest number of decimal digits that can be
fully represented in the floating point type.

If FLT_DIG is 6, that means that any integral value in the range of
-999,999 to +999,999 can be placed into a float and then into a large
enough integer type and result will be exactly the same as the
original number.

If DBL_DIGIT is 15, that means any integral value in the range
-999,999,999,999,999 to 999,999,999,999,999 can be placed into a
double and then into a large enough integer type (if one exists) and
the result will be exactly the same as the original value.

Nowhere is there any mention of rounding at all.

If I assume that you mean Microsoft's 32-bit x86 implementations, you
have some errors in your calculations. Not the calculations
themselves, but your assumptions about the number of mantissa bits in
the Intel FPU single and double precision types, which are 23 and 52
respectively, not 24 and 53.

Which results in ranges of 8,388,609 and 4,503,599,627,370,496
respectively. There are 7 decimal digit numbers outside the range of
magnitude for the former, and 16 digit numbers for the latter.

<off-topic>

If you want to understand the actual format of Intel floating point
representations, you can download the documentation for free from
http://developer.intel.com. If you do, don't bother looking at the 80
bit extended precision format. Microsoft has decided that you aren't
qualified to use that format at the expense of "compatibility" among
Windows versions on various processors.

Here's a quote from Microsoft:

With the 16-bit Microsoft C/C++ compilers, long doubles are stored as
80- bit (10-byte) data types. Under Windows NT, in order to be
compatible with other non-Intel floating point implementations, the
80-bit long double format is aliased to the 64-bit (8-byte) double
format.

The complete web page may be found at:

http://support.microsoft.com/default.aspx?scid=kb;en-us;129209

</off-topic>
 
L

lawrence.jones

ziller said:
Why is it that FLT_DIG (from <float.h>) is 6 while DBL_DIB is 15?

Because of roundoff error. The definition of FLT_DIG requires that
*any* representable number with that many decimal digits can be rounded
into a float and back again without changing the value. Unless floats
are stored in base 10 (or a power of 10), there are roundoff errors on
both conversions that compound in the worst case. Thus, the C Standard
says the correct formula to use in the non-decimal case is:

floor((p-1)*log10(b))

where p is the precision and b is the base. For base 2 with 24 and 53
bits of precision, that yields 6 and 15 respectively.

-Larry Jones

There's a connection here, I just know it. -- Calvin
 
G

grv575

OK but I still don't understand why p-1. This hidden bit seems to
have been used in the latest visual studios (haven't looked closely
but I believe the limits.h and float.h for vs changed from version
5.0-6.0).

But here's the thing. Try it yourself in code:

printf("%f\n", 16777215.0f);
printf("%f\n", 16777216.0f); // ... even
printf("%f\n", 16777217.0f); // ... odd

16777215 = 2^24-1 (not 23)...that's definately 7.22 bits of precision
we're getting in VS (tested with 7.0).
 
G

Gordon Burditt

OK but I still don't understand why p-1. This hidden bit seems to
have been used in the latest visual studios (haven't looked closely
but I believe the limits.h and float.h for vs changed from version
5.0-6.0).

Unless VS uses software emulation of floating point, the hidden
bit is a hardware feature of floating point implemented in hardware.
This does not rule out bugs in the header files but I see no obvious way
to compromise the security of Windows with FLT_DIG, so I don't think
even Microsoft would make this mistake.
But here's the thing. Try it yourself in code:

printf("%f\n", 16777215.0f);
printf("%f\n", 16777216.0f); // ... even
printf("%f\n", 16777217.0f); // ... odd

16777215 = 2^24-1 (not 23)...that's definately 7.22 bits of precision
we're getting in VS (tested with 7.0).

Try BOTH ENDS of that range. You don't get to claim the maximum
precision, you claim the minimum that holds over the whole range.
printf("%f\n", 8388608.0f); // ... even
printf("%f\n", 8388609.0f); // ... odd

You're getting counts of 1 in a number of magnitude 2**23, thus
23 bits of precision. That you get more at the other end of the
range is not relevant: you have to use a value guaranteed over
the entire range.

Gordon L. Burditt
 
K

kal

Doing the math, the mantissa for floats is 24 bits =
2^24-1 max value = 16,777,215.

For single precision floating points, the fractional part
of the mantissa is stored in 23 bits.

The mantissa is said to have 24 bits of precision only under
the assumption of the leading bit of '1'. But this leading
bit business is true only for NORMALIZED forms.

Now, from the C99 thingy.

5.2.4.2.2 Characteristics of floating types <float.h>

3 In addition to normalized floating-point numbers ...
floating types may be able to contain other kinds of
floating-point numbers, such as subnormal floating-point
numbers ... and unnormalized floating-point numbers ...
 
T

Tim Prince

kal said:
(e-mail address removed) (ziller) wrote in message

For single precision floating points, the fractional part
of the mantissa is stored in 23 bits.

The mantissa is said to have 24 bits of precision only under
the assumption of the leading bit of '1'. But this leading
bit business is true only for NORMALIZED forms.

Now, from the C99 thingy.

5.2.4.2.2 Characteristics of floating types <float.h>

3 In addition to normalized floating-point numbers ...
floating types may be able to contain other kinds of
floating-point numbers, such as subnormal floating-point
numbers ... and unnormalized floating-point numbers ...
Whether it says so or not, the stuff in <float.h> doesn't apply to
subnormals, nor does <float.h> tell you whether subnormal operations are
enabled or not. If you care to be pedantic or historical, prior to the IEEE
standard, normalized numbers didn't suppress the most significant bit on all
machines. <float.h> doesn't care about that distinction either. If you saw
a <float.h> set up for 23 or 52 bit mantissa, you could be fairly certain it
was one of those machines which didn't suppress MSB. C, of course, didn't
have the near universal ability back then which it does now.
 
C

CBFalconer

kal said:
(e-mail address removed) (ziller) wrote in message


For single precision floating points, the fractional part
of the mantissa is stored in 23 bits.

The mantissa is said to have 24 bits of precision only under
the assumption of the leading bit of '1'. But this leading
bit business is true only for NORMALIZED forms.

Denormalized forms can always have as little as 1 significant bit
in the significand. They don't count.
 
D

Dik T. Winter

> OK but I still don't understand why p-1.

Note the *any*. (Larry could have made it to "that many decimal digits of
precision".) If you consider only integer numbers, then, indeed, it could
have been 7. But there are ranges where two f-p numbers with 7 decimal
digits of precision are different but nevertheless round to the same
number in 24 bits of precision.
 
D

Dik T. Winter

>
> Unless VS uses software emulation of floating point, the hidden
> bit is a hardware feature of floating point implemented in hardware.

Yes. And so what? Also the hidden bit counts as a bit of the mantissa.
>
> Try BOTH ENDS of that range. You don't get to claim the maximum
> precision, you claim the minimum that holds over the whole range.
>
>
> You're getting counts of 1 in a number of magnitude 2**23, thus
> 23 bits of precision. That you get more at the other end of the
> range is not relevant: you have to use a value guaranteed over
> the entire range.

I have no idea what you intend to say here. 8388609 is represented
exactly, and as that is 2^23 + 1, I see 24 bits of precision.

Consider however the following two numbers:
9903521000000000000000000000 and 9903522000000000000000000000,
they have 7 decimal digits of precision. The first one is in binary
approximately (28 bits of precision rounded down):
2^70 * 100000000000000000000000.1001
and the second is:
2^70 * 100000000000000000000001.0110
Rounded to 24 bits nearest representable number both are rounded to the
same number: 9903521494874662916604297216 (2^93 + 2^70). You can verify
that both are closer to this number than to the next lower representable
number (9903520314283042199192993792) or the next higher representable
number (9903522675466283634015600640).

More examples can be found. This occurs when the density of decimal
numbers is high (just below a power of 10) and at the same time the
density of binary numbers is low (just above a power of 2).
 
D

Dik T. Winter

>
> For single precision floating points, the fractional part
> of the mantissa is stored in 23 bits.
>
> The mantissa is said to have 24 bits of precision only under
> the assumption of the leading bit of '1'. But this leading
> bit business is true only for NORMALIZED forms.

Non-sequitur. Denormals can have as few as 1 bit of precision. The
rules are for normalised numbers.
 
L

lawrence.jones

grv575 said:
16777215 = 2^24-1 (not 23)...that's definately 7.22 bits of precision
we're getting in VS (tested with 7.0).

That's because you're testing integers, which generally do not suffer
from rounding errors. Try putting decimal points in front of those
numbers, converting them to binary and then back to decimal and see what
happens. Remember, the requirement is not just to convert each decimal
number to a unique binary number but rather to convert each decimal
number to a binary number that will convert back to the original decimal
number.

-Larry Jones

I've got to start listening to those quiet, nagging doubts. -- Calvin
 
D

Dik T. Winter

> Remember, the requirement is not just to convert each decimal
> number to a unique binary number but rather to convert each decimal
> number to a binary number that will convert back to the original decimal
> number.

The "-1" in the rules are there to be sure that each decimal number is
converted to a unique binary number (see my example in a previous
article of why the "-1" is needed). This forward unique conversion yields
automatically a valid back-conversion to the same original number. I think
that is even pretty easy to prove. (Just consider the spacing of numbers
in both bases.)
 
J

Joe Wright

Jack Klein wrote:

Much major snipping..
If I assume that you mean Microsoft's 32-bit x86 implementations, you
have some errors in your calculations. Not the calculations
themselves, but your assumptions about the number of mantissa bits in
the Intel FPU single and double precision types, which are 23 and 52
respectively, not 24 and 53.

Where do you get that? The mantissa of float is surely 24 bits of
value, even if bit 23 is not actually there. All floats (except
sub-normal ones) are 'normalized' which means shifted left until the
msb of the mantissa (bit 23) is 1. Because the b23 value is always 1
in terms of the mantissa, we don't need to reserve actual space for
it. Instead, we use the space for the lsb of the exponent.

16777214
01001011 01111111 11111111 11111110
Exp = 150 (24)
00011000
Man = .11111111 11111111 11111110
1.67772140e+07

16777215
01001011 01111111 11111111 11111111
Exp = 150 (24)
00011000
Man = .11111111 11111111 11111111
1.67772150e+07

If I can represent 16777214 and 16777215 exactly, and I can (the
nonsense about 8388607 notwithstanding), the mantissa is effectively
24 bits wide, not 23.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,146
Messages
2,570,832
Members
47,374
Latest member
anuragag27

Latest Threads

Top