J
Javier Goizueta
This is about a problem in ruby/ext/bigdecimal.c. that makes
BigDecimal#to_f use only 13 significant digits sometimes and
That may also affect the precision of BigDecimal#sqrt.
The problem can only be noticed in mswin32 Ruby.
To see the problem in action take a look at the result of:
format "%.16f",BigDecimal('1.234567890123456').to_f
The function VpVtoD() that converts BigDecimal values to double
has one problem which is hidden by the internal high precision
of most compilers' floating point, but that reveals itself
when compiled with Microsoft Visual C/C++.
Note that mswin32, which uses that compiler, is the most popular Ruby
implementation on Windows, because the "One-Click Installer" uses it,
so this is an important issue.
In big_decimal.c, DBLE_FIG is computed as:
v = 1.0;
DBLE_FIG = 0;
while(v + 1.0 > 1.0) {
++DBLE_FIG;
v /= 10;
}
In mswin32, DBLE_FIG becomes 16, which corresponds correctly
to (one more than) the real decimal precision of a Float
in IEEE double precision.
Other compilers on the same hardware (such as GCC, Borland or Digital
Mars),
use the internal precision of IEEE extended double numbers (80 bits)
as much as possible, and DBLE_FIG has a value of 20.
Note: you can avoid the internal high precision of most compilers
storing in variables the intermediate values of an expression.
The problem with VpVtoD is that the number of "figures" (internal
base-BASE digits, with BASE=4) to be converted is computed as:
fig =(DBLE_FIG + BASE_FIG - 1) / BASE_FIG;
Which does not take into account that the first "figure", m->frac[0],
may contain less than BASE_FIG decimal digits.
For example, BigDecimal('1.234567890123456') produces an object
with { 1, 2345, 6789, 123, 4560 } in m->frac. In this case,
the first "figure" holds just one significative decimal digit.
Then, as only 4 figures are used, only 13 decimal significative
digits are converted.
Note that 4 "figures" can hold up to 16=DBL_FIG decimal digits,
but in this case they hold only 13.
With other compilers we have DBLE_FIG=20, so fig=5, and at least
17 decimal digits are converted, which is ok.
The result of this is, for example, that
format "%.16f",BigDecimal('1.234567890123456').to_f
produces "1.2345678901229999" in mswin32
(would produce "1.2345678901230000" if it wasn't for the accumulated
errors when converting decimal fractions to binary floating point).
In general, this problem causes, under mswin32 versions of Ruby,
incorrect values for BigDecimal#to_f and BigDecimal#sqrt.
VpSqrt() uses VpVtoD() so it is affected by the problem, but
it also uses DBLE_FIG directly, and may have similar problems,
but I haven't look at it.
What I propose as a solution is:
1. Instead of the rather ambiguous DBLE_FIG, define a DOUBLE_DIGITS
value, as the number of decimal digits that a Float can hold
(exactly for integral values), which is one less than BASE_FIG when
correctly computed without in extra internal precision;
BASE_FIG would then be the number of decimal digits necessary
for the round-trip conversion double->BigDecimal->double when
proper rounding is used in the conversions.
For round-trip conversion with truncation, DOUBLE_DIGITS+1 (=DBLE_FIG+2)
digits are needed.
DOUBLE_DIGITS is 15 for IEEE double, and corresponds
to DBL_DIG in <float.h> so it doesn't need to be computed.
Anyway, it can be computed (in Ruby, assuming Float uses double) as:
(BITS*Math.log(2)/Math.log(10)).floor
BITS corresponds to DBL_MANT_DIG in <float.h> and can be computed as:
x = 1.0
bits = 0
begin
bits += 1
x /= 2
end while 1!=x+1
bits
Or also as: 2-Math.frexp(Float::EPSILON)[1]
In any case, care should be taken so that DOUBLE_DIGITS depends only on
the underlying double type properties, and not on compiler handling of
floating point operations.
2. In VpVtoD(), compute fig to assure that FLOAT_DIGITS+2 are
always converted, which preserves round-trip conversions,
for example as:
fig = 1 + (DOUBLE_DIGITS + BASE_FIG) / BASE_FIG;
3. To avoid breaking existing code, BigDecimal#double_fig could be
defined
to return DOUBLE_DIGITS+2, assuming that double_fig was interpreted as
enough decimal digits to represent unambiguosly a Float value.
Here is a patch for bigdecimal.c Rev.1.43 that does these things
I also have replaced (DBLE_FIG + BASE_FIG - 1) / BASE_FIG in VpSqrt,
but have I'm not really sure what I'm doing there. I haven't
done extensive testing on this, but seems to work.
--------------------------------------------------------------BEGIN OF
PATCH
--- Rev.1.43/bigdecimal.c 2004-10-19 12:24:40.000000000 +0200
+++ bigdecimal.c 2004-10-21 17:10:33.000000000 +0200
@@ -1370,7 +1370,7 @@
/* The value of BASE**2 + BASE must be represented */
/* within one U_LONG. */
static U_LONG HALF_BASE = 5000L;/* =BASE/2 */
-static S_LONG DBLE_FIG = 8; /* figure of double */
+static S_LONG DOUBLE_DIGITS = DBL_DIG; /* figure of double */
static U_LONG BASE1 = 1000L; /* =BASE/10 */
static Real *VpConstOne; /* constant 1.0 */
@@ -1515,7 +1515,7 @@
VP_EXPORT U_LONG
VpDblFig(void)
{
- return DBLE_FIG;
+ return DOUBLE_DIGITS+2;
}
VP_EXPORT U_LONG
@@ -1760,7 +1760,7 @@
* by one U_LONG word(LONG) in the computer used.
*
* [Returns]
- * DBLE_FIG ... OK
+ * DOUBLE_DIGITS ... OK
*/
VP_EXPORT U_LONG
VpInit(U_LONG BaseVal)
@@ -1799,15 +1799,6 @@
gnAlloc = 0;
#endif /* _DEBUG */
- /* Determine # of digits available in one 'double'. */
-
- v = 1.0;
- DBLE_FIG = 0;
- while(v + 1.0 > 1.0) {
- ++DBLE_FIG;
- v /= 10;
- }
-
#ifdef _DEBUG
if(gfDebug) {
printf("VpInit: BaseVal = %lu\n", BaseVal);
@@ -1819,7 +1810,7 @@
}
#endif /* _DEBUG */
- return DBLE_FIG;
+ return DOUBLE_DIGITS;
}
VP_EXPORT Real *
@@ -3401,7 +3392,7 @@
* [Output]
* *d ... fraction part of m(d = 0.xxxxxxx). where # of 'x's is fig.
* *e ... U_LONG,exponent of m.
- * DBLE_FIG ... Number of digits in a double variable.
+ * DOUBLE_DIGITS ... Number of digits in a double variable.
*
* m -> d*10**e, 0<d<BASE
* [Returns]
@@ -3448,7 +3439,7 @@
goto Exit;
}
/* Normal number */
- fig =(DBLE_FIG + BASE_FIG - 1) / BASE_FIG;
+ fig = 1 + (DOUBLE_DIGITS + BASE_FIG) / BASE_FIG;
ind_m = 0;
mm = Min(fig,(m->Prec));
*d = 0.0;
@@ -3465,7 +3456,7 @@
if(gfDebug) {
VPrint(stdout, " VpVtoD: m=%\n", m);
printf(" d=%e * 10 **%ld\n", *d, *e);
- printf(" DBLE_FIG = %ld\n", DBLE_FIG);
+ printf(" DOUBLE_DIGITS = %ld\n", DOUBLE_DIGITS);
}
#endif /*_DEBUG */
return f;
@@ -3663,7 +3654,7 @@
}
VpDtoV(y, sqrt(val)); /* y <- sqrt(val) */
y->exponent += n;
- n = (DBLE_FIG + BASE_FIG - 1) / BASE_FIG;
+ n = (DOUBLE_DIGITS + BASE_FIG) / BASE_FIG;
y->MaxPrec = (U_LONG)Min(n , y_prec);
f->MaxPrec = y->MaxPrec + 1;
n = y_prec*((S_LONG)BASE_FIG);
--------------------------------------------------------------END OF
PATCH
With these changes VpVtoD() works better, but it's still not perfect;
in particular, x.to_s.to_f is better than x.to_f, even
if we truncate x.to_s at DOUBLE_DIGITS+2 significative digits, which
in theory is what VpVtoD does.
For example:
(BigDecimal('1')/3).to_f != 1.0/3
but
(BigDecimal('1')/3).to_s.to_f == 1.0/3
or, using no more digits than VpVtoD() does:
s,d,b,e = (BigDecimal('1')/3).split
n = [d.size, BigDecimal.double_fig].min
(d[0...n]+"E#{e-n}").to_f == 1.0/3 # this is true!
Which proves that VpVtoD could do better. (note that these examples
are meant for the patched version of bigdecimal)
- - - - - -
Last minute note:
I've realized that in the unpatched version of bigdecimal.c:
(BigDecimal('1')/3).to_f == 1.0/3
That means that adding more than necessary "figures" degrades
the result. This could be solved by being more careful when
computing fig:
fig =(DOUBLE_DIGITS+K + 4-n + BASE_FIG - 1) / BASE_FIG;
where n is the number of significant digits in m->frac[0], i.e.
one more than the (floor of the) base-BASE logarithm of m->frac[0].
For BASE=4 n would be (d0<10 ? 1 : (d0<100 ? 2 : (d0<1000 ? 3 : 4)))
with d0=m->frac[0].
With K=1 (convert 16 digits for IEEE) this satifies:
(BigDecimal('1')/3).to_f == 1.0/3
But not with K=2, which is theoretically what we should take
(17 digits for IEEE) to have enough precision in all cases.
Anyway, I think this should be look at carefully by someone else,
including
the VpSqrt code. It would be nice if someone writes a more precise
VpDtoV.
- - - - - -
Finally I'd like to propose the inclusion of some Float-related
constants in the Ruby core.
I sometimes have used this:
class Float
x,bits = 1.0,0
begin
bits += 1
x /= 2
end while 1!=x+1
# Number of binary digits of precision in a Float
BITS = bits
# Number of decimal digits of precision in a Float
DIGITS = (BITS*Math.log(2)/Math.log(10)).floor
# Decimal precision required to represent a Float
DIGITS2 = DIGITS+2
EPSILON = Math.ldexp(0.5,2-BITS) unless const_defined?EPSILON)
end
These constants (EPSILON has already been defined) could
be defined in the core using <float.h> constants DBL_MANT_DIG and
DBL_DIG.
The values for IEEE double precision are:
Float::BITS = 53
Float:IGITS = 15
Float:IGITS2 = 17
Float::EPSILON = 2.2204460492503131E-16
- - - - - - - -
Regards,
--Javier
BigDecimal#to_f use only 13 significant digits sometimes and
That may also affect the precision of BigDecimal#sqrt.
The problem can only be noticed in mswin32 Ruby.
To see the problem in action take a look at the result of:
format "%.16f",BigDecimal('1.234567890123456').to_f
The function VpVtoD() that converts BigDecimal values to double
has one problem which is hidden by the internal high precision
of most compilers' floating point, but that reveals itself
when compiled with Microsoft Visual C/C++.
Note that mswin32, which uses that compiler, is the most popular Ruby
implementation on Windows, because the "One-Click Installer" uses it,
so this is an important issue.
In big_decimal.c, DBLE_FIG is computed as:
v = 1.0;
DBLE_FIG = 0;
while(v + 1.0 > 1.0) {
++DBLE_FIG;
v /= 10;
}
In mswin32, DBLE_FIG becomes 16, which corresponds correctly
to (one more than) the real decimal precision of a Float
in IEEE double precision.
Other compilers on the same hardware (such as GCC, Borland or Digital
Mars),
use the internal precision of IEEE extended double numbers (80 bits)
as much as possible, and DBLE_FIG has a value of 20.
Note: you can avoid the internal high precision of most compilers
storing in variables the intermediate values of an expression.
The problem with VpVtoD is that the number of "figures" (internal
base-BASE digits, with BASE=4) to be converted is computed as:
fig =(DBLE_FIG + BASE_FIG - 1) / BASE_FIG;
Which does not take into account that the first "figure", m->frac[0],
may contain less than BASE_FIG decimal digits.
For example, BigDecimal('1.234567890123456') produces an object
with { 1, 2345, 6789, 123, 4560 } in m->frac. In this case,
the first "figure" holds just one significative decimal digit.
Then, as only 4 figures are used, only 13 decimal significative
digits are converted.
Note that 4 "figures" can hold up to 16=DBL_FIG decimal digits,
but in this case they hold only 13.
With other compilers we have DBLE_FIG=20, so fig=5, and at least
17 decimal digits are converted, which is ok.
The result of this is, for example, that
format "%.16f",BigDecimal('1.234567890123456').to_f
produces "1.2345678901229999" in mswin32
(would produce "1.2345678901230000" if it wasn't for the accumulated
errors when converting decimal fractions to binary floating point).
In general, this problem causes, under mswin32 versions of Ruby,
incorrect values for BigDecimal#to_f and BigDecimal#sqrt.
VpSqrt() uses VpVtoD() so it is affected by the problem, but
it also uses DBLE_FIG directly, and may have similar problems,
but I haven't look at it.
What I propose as a solution is:
1. Instead of the rather ambiguous DBLE_FIG, define a DOUBLE_DIGITS
value, as the number of decimal digits that a Float can hold
(exactly for integral values), which is one less than BASE_FIG when
correctly computed without in extra internal precision;
BASE_FIG would then be the number of decimal digits necessary
for the round-trip conversion double->BigDecimal->double when
proper rounding is used in the conversions.
For round-trip conversion with truncation, DOUBLE_DIGITS+1 (=DBLE_FIG+2)
digits are needed.
DOUBLE_DIGITS is 15 for IEEE double, and corresponds
to DBL_DIG in <float.h> so it doesn't need to be computed.
Anyway, it can be computed (in Ruby, assuming Float uses double) as:
(BITS*Math.log(2)/Math.log(10)).floor
BITS corresponds to DBL_MANT_DIG in <float.h> and can be computed as:
x = 1.0
bits = 0
begin
bits += 1
x /= 2
end while 1!=x+1
bits
Or also as: 2-Math.frexp(Float::EPSILON)[1]
In any case, care should be taken so that DOUBLE_DIGITS depends only on
the underlying double type properties, and not on compiler handling of
floating point operations.
2. In VpVtoD(), compute fig to assure that FLOAT_DIGITS+2 are
always converted, which preserves round-trip conversions,
for example as:
fig = 1 + (DOUBLE_DIGITS + BASE_FIG) / BASE_FIG;
3. To avoid breaking existing code, BigDecimal#double_fig could be
defined
to return DOUBLE_DIGITS+2, assuming that double_fig was interpreted as
enough decimal digits to represent unambiguosly a Float value.
Here is a patch for bigdecimal.c Rev.1.43 that does these things
I also have replaced (DBLE_FIG + BASE_FIG - 1) / BASE_FIG in VpSqrt,
but have I'm not really sure what I'm doing there. I haven't
done extensive testing on this, but seems to work.
--------------------------------------------------------------BEGIN OF
PATCH
--- Rev.1.43/bigdecimal.c 2004-10-19 12:24:40.000000000 +0200
+++ bigdecimal.c 2004-10-21 17:10:33.000000000 +0200
@@ -1370,7 +1370,7 @@
/* The value of BASE**2 + BASE must be represented */
/* within one U_LONG. */
static U_LONG HALF_BASE = 5000L;/* =BASE/2 */
-static S_LONG DBLE_FIG = 8; /* figure of double */
+static S_LONG DOUBLE_DIGITS = DBL_DIG; /* figure of double */
static U_LONG BASE1 = 1000L; /* =BASE/10 */
static Real *VpConstOne; /* constant 1.0 */
@@ -1515,7 +1515,7 @@
VP_EXPORT U_LONG
VpDblFig(void)
{
- return DBLE_FIG;
+ return DOUBLE_DIGITS+2;
}
VP_EXPORT U_LONG
@@ -1760,7 +1760,7 @@
* by one U_LONG word(LONG) in the computer used.
*
* [Returns]
- * DBLE_FIG ... OK
+ * DOUBLE_DIGITS ... OK
*/
VP_EXPORT U_LONG
VpInit(U_LONG BaseVal)
@@ -1799,15 +1799,6 @@
gnAlloc = 0;
#endif /* _DEBUG */
- /* Determine # of digits available in one 'double'. */
-
- v = 1.0;
- DBLE_FIG = 0;
- while(v + 1.0 > 1.0) {
- ++DBLE_FIG;
- v /= 10;
- }
-
#ifdef _DEBUG
if(gfDebug) {
printf("VpInit: BaseVal = %lu\n", BaseVal);
@@ -1819,7 +1810,7 @@
}
#endif /* _DEBUG */
- return DBLE_FIG;
+ return DOUBLE_DIGITS;
}
VP_EXPORT Real *
@@ -3401,7 +3392,7 @@
* [Output]
* *d ... fraction part of m(d = 0.xxxxxxx). where # of 'x's is fig.
* *e ... U_LONG,exponent of m.
- * DBLE_FIG ... Number of digits in a double variable.
+ * DOUBLE_DIGITS ... Number of digits in a double variable.
*
* m -> d*10**e, 0<d<BASE
* [Returns]
@@ -3448,7 +3439,7 @@
goto Exit;
}
/* Normal number */
- fig =(DBLE_FIG + BASE_FIG - 1) / BASE_FIG;
+ fig = 1 + (DOUBLE_DIGITS + BASE_FIG) / BASE_FIG;
ind_m = 0;
mm = Min(fig,(m->Prec));
*d = 0.0;
@@ -3465,7 +3456,7 @@
if(gfDebug) {
VPrint(stdout, " VpVtoD: m=%\n", m);
printf(" d=%e * 10 **%ld\n", *d, *e);
- printf(" DBLE_FIG = %ld\n", DBLE_FIG);
+ printf(" DOUBLE_DIGITS = %ld\n", DOUBLE_DIGITS);
}
#endif /*_DEBUG */
return f;
@@ -3663,7 +3654,7 @@
}
VpDtoV(y, sqrt(val)); /* y <- sqrt(val) */
y->exponent += n;
- n = (DBLE_FIG + BASE_FIG - 1) / BASE_FIG;
+ n = (DOUBLE_DIGITS + BASE_FIG) / BASE_FIG;
y->MaxPrec = (U_LONG)Min(n , y_prec);
f->MaxPrec = y->MaxPrec + 1;
n = y_prec*((S_LONG)BASE_FIG);
--------------------------------------------------------------END OF
PATCH
With these changes VpVtoD() works better, but it's still not perfect;
in particular, x.to_s.to_f is better than x.to_f, even
if we truncate x.to_s at DOUBLE_DIGITS+2 significative digits, which
in theory is what VpVtoD does.
For example:
(BigDecimal('1')/3).to_f != 1.0/3
but
(BigDecimal('1')/3).to_s.to_f == 1.0/3
or, using no more digits than VpVtoD() does:
s,d,b,e = (BigDecimal('1')/3).split
n = [d.size, BigDecimal.double_fig].min
(d[0...n]+"E#{e-n}").to_f == 1.0/3 # this is true!
Which proves that VpVtoD could do better. (note that these examples
are meant for the patched version of bigdecimal)
- - - - - -
Last minute note:
I've realized that in the unpatched version of bigdecimal.c:
(BigDecimal('1')/3).to_f == 1.0/3
That means that adding more than necessary "figures" degrades
the result. This could be solved by being more careful when
computing fig:
fig =(DOUBLE_DIGITS+K + 4-n + BASE_FIG - 1) / BASE_FIG;
where n is the number of significant digits in m->frac[0], i.e.
one more than the (floor of the) base-BASE logarithm of m->frac[0].
For BASE=4 n would be (d0<10 ? 1 : (d0<100 ? 2 : (d0<1000 ? 3 : 4)))
with d0=m->frac[0].
With K=1 (convert 16 digits for IEEE) this satifies:
(BigDecimal('1')/3).to_f == 1.0/3
But not with K=2, which is theoretically what we should take
(17 digits for IEEE) to have enough precision in all cases.
Anyway, I think this should be look at carefully by someone else,
including
the VpSqrt code. It would be nice if someone writes a more precise
VpDtoV.
- - - - - -
Finally I'd like to propose the inclusion of some Float-related
constants in the Ruby core.
I sometimes have used this:
class Float
x,bits = 1.0,0
begin
bits += 1
x /= 2
end while 1!=x+1
# Number of binary digits of precision in a Float
BITS = bits
# Number of decimal digits of precision in a Float
DIGITS = (BITS*Math.log(2)/Math.log(10)).floor
# Decimal precision required to represent a Float
DIGITS2 = DIGITS+2
EPSILON = Math.ldexp(0.5,2-BITS) unless const_defined?EPSILON)
end
These constants (EPSILON has already been defined) could
be defined in the core using <float.h> constants DBL_MANT_DIG and
DBL_DIG.
The values for IEEE double precision are:
Float::BITS = 53
Float:IGITS = 15
Float:IGITS2 = 17
Float::EPSILON = 2.2204460492503131E-16
- - - - - - - -
Regards,
--Javier