P
Pallav
I'm trying to convert some source code containing floating point into
fixed-point arithmetic.
I am having some trouble understanding fixed point signed multiply. I
have a 18.14 base integer with 18 bits integer and 14 bits for
fraction. Now what I understand is that if I multiply two 18.14
values, I will get
18.14 * 18.14 = 36.28 for a result
But need to fit in 32-bits. So in the fraction part we must drop the
lower 14 bits and then in the integer part we drop the upper 18 bits.
Correct?
If so, then here is my pseudocode.
typedef long fp14;
fp14 FPMUL(fp14 a1, fp14 a2)
{
fp14 frac = 0, result = 0;
char sign = 0;
if (a1 < 0) { sign = 1; a1 = -a1; }
if (a1 < 0) { sign ^= 1; a2 = -a2; }
frac = (a1 * a2) >> 14;
// Is this correct? Or do I need to say ((a1 & 0x3FF) * (a2 & 0x3FF)
result = (a1 >> 14) * (a2 >> 18); // 18 bits * 14 bits = 32 bits
result = result + (a1 >> 14) * ((a2 >> 14) & 0xF); // 18 bits * 4
lower bits of a2
result = (result << 18) | (frac & 0x3FF); // concatante lower 18
bits of result with lower 14 bits of frac to get 32 bit result
return result * -sign;
}
Does this code look correct or am I missing something? Also is there a
more efficient way to implement this? Any help is appreciated.
Thanks
fixed-point arithmetic.
I am having some trouble understanding fixed point signed multiply. I
have a 18.14 base integer with 18 bits integer and 14 bits for
fraction. Now what I understand is that if I multiply two 18.14
values, I will get
18.14 * 18.14 = 36.28 for a result
But need to fit in 32-bits. So in the fraction part we must drop the
lower 14 bits and then in the integer part we drop the upper 18 bits.
Correct?
If so, then here is my pseudocode.
typedef long fp14;
fp14 FPMUL(fp14 a1, fp14 a2)
{
fp14 frac = 0, result = 0;
char sign = 0;
if (a1 < 0) { sign = 1; a1 = -a1; }
if (a1 < 0) { sign ^= 1; a2 = -a2; }
frac = (a1 * a2) >> 14;
// Is this correct? Or do I need to say ((a1 & 0x3FF) * (a2 & 0x3FF)
result = (a1 >> 14) * (a2 >> 18); // 18 bits * 14 bits = 32 bits
result = result + (a1 >> 14) * ((a2 >> 14) & 0xF); // 18 bits * 4
lower bits of a2
result = (result << 18) | (frac & 0x3FF); // concatante lower 18
bits of result with lower 14 bits of frac to get 32 bit result
return result * -sign;
}
Does this code look correct or am I missing something? Also is there a
more efficient way to implement this? Any help is appreciated.
Thanks