M
Markus Grunwald
Hello,
while implementing a simple algorithm for digital filters (IIR filters), I
got very confused about the very bad performance. This is the code,
discussion follows:
/**
<!-------------------------------------------------------------------------->
* The working horse: Filters a signal
*
* @param afX (i) input signal
* @param afY (o) output signal
* @param nLength (i) signal length
<!------------------------------------------------------------------------->*/
void CPTChebyshev::Filter( const float* afX, float* afY, int nLength ) {
float tY=0;
int nN=0;
int nI=0;
for( nI=m_nPoles; nI<nLength; ++nI ) {
tY=0;
for( nN=0; nN<=m_nPoles; ++nN )
{
tY += m_afA[ nN ] * afX[ nI - nN ];
}
for( nN=1; nN<=m_nPoles; ++nN )
{
tY += m_afB[ nN ] * afY[ nI - nN ];
}
afY[ nI ] = tY;
}
} // END Filter
Ok, I admit it's C++. But the problem should be independent.
I run this piece of code with m_nPoles==8 and Arrays afX, afY of size
262144. Calculating the number of flops gives me:
(9*2+8*2)*(262144-8)=8912624
I got a benchmark from http://ftp.br.freebsd.org/distfiles/flops.c,
compiled it with "gcc-3.3 -O3 -DUNIX flops.c -o flops -lm" and got values
ranging from about 400 to 800 MFLOPS for my old Pentium 4. Using the 400
MFLOPS, this takes me to:
8912624/(426*10^6)=0.021 seconds runtime
This _would_ be fine. But the function takes 1.66s ! (No, I didn't forger
-O3 This is about 80 times more ...
I tried all the obvious optimisations by hand: Use pointers instead of
array indices, combine the two inner loops into one. In the end, the gcc
result was always better the my hand "optimized" code.
The arrays afX and afY were allocated on the stack, m_afA and m_afB on the
heap. I didn't care for alignment, since malloc is supposed to return
8-byte aligned pointers. Now I suspected cache misses to be the root of
evil and tried the cachegrind tool. If I interpret the results correctly,
the code has nearly no L1 misses, but 20% L2 misses. This doesn't sound
too bad, doesn't it ?
Can somebody explain to me what's going wrong ? Or even better, tell me
how to make this faster ?
Many thanks,
Markus Grunwald
while implementing a simple algorithm for digital filters (IIR filters), I
got very confused about the very bad performance. This is the code,
discussion follows:
/**
<!-------------------------------------------------------------------------->
* The working horse: Filters a signal
*
* @param afX (i) input signal
* @param afY (o) output signal
* @param nLength (i) signal length
<!------------------------------------------------------------------------->*/
void CPTChebyshev::Filter( const float* afX, float* afY, int nLength ) {
float tY=0;
int nN=0;
int nI=0;
for( nI=m_nPoles; nI<nLength; ++nI ) {
tY=0;
for( nN=0; nN<=m_nPoles; ++nN )
{
tY += m_afA[ nN ] * afX[ nI - nN ];
}
for( nN=1; nN<=m_nPoles; ++nN )
{
tY += m_afB[ nN ] * afY[ nI - nN ];
}
afY[ nI ] = tY;
}
} // END Filter
Ok, I admit it's C++. But the problem should be independent.
I run this piece of code with m_nPoles==8 and Arrays afX, afY of size
262144. Calculating the number of flops gives me:
(9*2+8*2)*(262144-8)=8912624
I got a benchmark from http://ftp.br.freebsd.org/distfiles/flops.c,
compiled it with "gcc-3.3 -O3 -DUNIX flops.c -o flops -lm" and got values
ranging from about 400 to 800 MFLOPS for my old Pentium 4. Using the 400
MFLOPS, this takes me to:
8912624/(426*10^6)=0.021 seconds runtime
This _would_ be fine. But the function takes 1.66s ! (No, I didn't forger
-O3 This is about 80 times more ...
I tried all the obvious optimisations by hand: Use pointers instead of
array indices, combine the two inner loops into one. In the end, the gcc
result was always better the my hand "optimized" code.
The arrays afX and afY were allocated on the stack, m_afA and m_afB on the
heap. I didn't care for alignment, since malloc is supposed to return
8-byte aligned pointers. Now I suspected cache misses to be the root of
evil and tried the cachegrind tool. If I interpret the results correctly,
the code has nearly no L1 misses, but 20% L2 misses. This doesn't sound
too bad, doesn't it ?
Can somebody explain to me what's going wrong ? Or even better, tell me
how to make this faster ?
Many thanks,
Markus Grunwald