J
jacob navia
The cephes math library is a legend in the internet. It was
written by Stephen L. Moshier in 1984, but there was apparently
an even earlier version in assembler, before he rewrote that
in C. It is used in many systems, for instance in the python
language.
I started working with this library around 1997 or whereabouts.
I rewrote the core of it in assembler, fixed some (minor)
bugs, and added several functions like the Lambert's W
function and others. I rewrote all the functions needed for
replicating all math.h
This library is at the center of the qfloat data type in lcc-win
that was one of the first concrete applications of operator
overloading and convinced me that I was in the right track.
Basically, the library works around
#define _NQ_ 12
typedef struct qfloatstruct {
int sign;
int exponent;
unsigned int mantissa[_NQ_];
}qfloat;
For a 32 bit system, this structure can be seen as an array
of 14 32 bit integers, with array[0] the sign, array[1] the
exponent, and array[2..13] the mantissa.
Within the mantissa, the first array position is always zero
and is used to hold bits that overflow from the higher
positions during the calculations.
When doing the four operations, the software expands this
structure by adding a zero to the end, to hold overflows in the
other direction.
I maintained this structure within the assembler core, and
essentially the 32 bit version uses the above description as is.
When preparing the 64 bit version however, I decided to
optimize the data layout.
1) The extra zero will no longer be stored in the number
but only stored in the temporary forms when actually
doing the calculations. Numbers will be expanded before
any of the four operations, then packed afterwards.
2) The size of the numbers must be a multiple of 2 to better
use the cache lines. A size of 64 bytes is a sensible choice.
All this leads naturally to the following structure
#define MANTISSA_LENGTH 7
typedef struct tagQfloat {
int sign;
unsigned int exponent;
unsigned long long mantissa[MANTISSA_LENGTH];
} Qfloat;
This allows for a precision of 448 bits, with 132 decimal
digits.
Obviously it would be too expensive to pass a 64 byte
structure by value for each function call, so I decided
to use references everywhere. The trick I used is that
instead of declaring variables like this:
Qfloat a,b,c;
I declare them like this:
Qfloat a[1],b[1],c[1];
This has the highly beneficial side effect that when
you call some function like
qadd(a,b,c);
to add a+b and put the result in c, the numbers are
automatically passed by reference without having to use the
ugly
qadd(&a,&b,&c);
notation and maintaining the bulk of the library
code intact.
This 64 bit version has more or less the same speed as the 32 bit
version, even if it has around 25% more precision (132 digits
instead of 105). Obviously reducing the number of loops by
half (mantissa is not 14 but 7 elements wide) offset the
extra precision work.
This new library will be shipped with the 64 bit version of lcc-win.
The purpose of this message is to discuss the data structures and the
choices I did. Obviously this is a serious implementation
of floating point, not just an academic exercise where the numbers
are represented just by "unsigned char" or similar solutions.
written by Stephen L. Moshier in 1984, but there was apparently
an even earlier version in assembler, before he rewrote that
in C. It is used in many systems, for instance in the python
language.
I started working with this library around 1997 or whereabouts.
I rewrote the core of it in assembler, fixed some (minor)
bugs, and added several functions like the Lambert's W
function and others. I rewrote all the functions needed for
replicating all math.h
This library is at the center of the qfloat data type in lcc-win
that was one of the first concrete applications of operator
overloading and convinced me that I was in the right track.
Basically, the library works around
#define _NQ_ 12
typedef struct qfloatstruct {
int sign;
int exponent;
unsigned int mantissa[_NQ_];
}qfloat;
For a 32 bit system, this structure can be seen as an array
of 14 32 bit integers, with array[0] the sign, array[1] the
exponent, and array[2..13] the mantissa.
Within the mantissa, the first array position is always zero
and is used to hold bits that overflow from the higher
positions during the calculations.
When doing the four operations, the software expands this
structure by adding a zero to the end, to hold overflows in the
other direction.
I maintained this structure within the assembler core, and
essentially the 32 bit version uses the above description as is.
When preparing the 64 bit version however, I decided to
optimize the data layout.
1) The extra zero will no longer be stored in the number
but only stored in the temporary forms when actually
doing the calculations. Numbers will be expanded before
any of the four operations, then packed afterwards.
2) The size of the numbers must be a multiple of 2 to better
use the cache lines. A size of 64 bytes is a sensible choice.
All this leads naturally to the following structure
#define MANTISSA_LENGTH 7
typedef struct tagQfloat {
int sign;
unsigned int exponent;
unsigned long long mantissa[MANTISSA_LENGTH];
} Qfloat;
This allows for a precision of 448 bits, with 132 decimal
digits.
Obviously it would be too expensive to pass a 64 byte
structure by value for each function call, so I decided
to use references everywhere. The trick I used is that
instead of declaring variables like this:
Qfloat a,b,c;
I declare them like this:
Qfloat a[1],b[1],c[1];
This has the highly beneficial side effect that when
you call some function like
qadd(a,b,c);
to add a+b and put the result in c, the numbers are
automatically passed by reference without having to use the
ugly
qadd(&a,&b,&c);
notation and maintaining the bulk of the library
code intact.
This 64 bit version has more or less the same speed as the 32 bit
version, even if it has around 25% more precision (132 digits
instead of 105). Obviously reducing the number of loops by
half (mantissa is not 14 but 7 elements wide) offset the
extra precision work.
This new library will be shipped with the 64 bit version of lcc-win.
The purpose of this message is to discuss the data structures and the
choices I did. Obviously this is a serious implementation
of floating point, not just an academic exercise where the numbers
are represented just by "unsigned char" or similar solutions.