Optimization question

Markus Grunwald · Aug 18, 2007

Hello,

while implementing a simple algorithm for digital filters (IIR filters), I
got very confused about the very bad performance. This is the code,
discussion follows:

/**

* The working horse: Filters a signal
*
* @param afX (i) input signal
* @param afY (o) output signal
* @param nLength (i) signal length
*/
void CPTChebyshev::Filter( const float* afX, float* afY, int nLength ) {
float tY=0;
int nN=0;
int nI=0;

for( nI=m_nPoles; nI<nLength; ++nI ) {
tY=0;
for( nN=0; nN<=m_nPoles; ++nN )
{
tY += m_afA[ nN ] * afX[ nI - nN ];
}
for( nN=1; nN<=m_nPoles; ++nN )
{
tY += m_afB[ nN ] * afY[ nI - nN ];
}
afY[ nI ] = tY;
}
} // END Filter

Ok, I admit it's C++. But the problem should be independent.

I run this piece of code with m_nPoles==8 and Arrays afX, afY of size
262144. Calculating the number of flops gives me:

(9*2+8*2)*(262144-8)=8912624

I got a benchmark from http://ftp.br.freebsd.org/distfiles/flops.c,
compiled it with "gcc-3.3 -O3 -DUNIX flops.c -o flops -lm" and got values
ranging from about 400 to 800 MFLOPS for my old Pentium 4. Using the 400
MFLOPS, this takes me to:

8912624/(426*10^6)=0.021 seconds runtime

This _would_ be fine. But the function takes 1.66s ! (No, I didn't forger
-O3

This is about 80 times more ...

I tried all the obvious optimisations by hand: Use pointers instead of
array indices, combine the two inner loops into one. In the end, the gcc
result was always better the my hand "optimized" code.

The arrays afX and afY were allocated on the stack, m_afA and m_afB on the
heap. I didn't care for alignment, since malloc is supposed to return
8-byte aligned pointers. Now I suspected cache misses to be the root of
evil and tried the cachegrind tool. If I interpret the results correctly,
the code has nearly no L1 misses, but 20% L2 misses. This doesn't sound
too bad, doesn't it ?

Can somebody explain to me what's going wrong ? Or even better, tell me
how to make this faster ?

Many thanks,

Markus Grunwald

Malcolm McLean · Aug 18, 2007

Markus Grunwald said:
Hello,

while implementing a simple algorithm for digital filters (IIR filters), I
got very confused about the very bad performance. This is the code,
discussion follows:

/**

* The working horse: Filters a signal
*
* @param afX (i) input signal
* @param afY (o) output signal
* @param nLength (i) signal length
*/
void CPTChebyshev::Filter( const float* afX, float* afY, int nLength ) {
float tY=0;
int nN=0;
int nI=0;

for( nI=m_nPoles; nI<nLength; ++nI ) {
tY=0;
for( nN=0; nN<=m_nPoles; ++nN )
{
tY += m_afA[ nN ] * afX[ nI - nN ];
}
for( nN=1; nN<=m_nPoles; ++nN )
{
tY += m_afB[ nN ] * afY[ nI - nN ];
}
afY[ nI ] = tY;
}
} // END Filter

Ok, I admit it's C++. But the problem should be independent.

I run this piece of code with m_nPoles==8 and Arrays afX, afY of size
262144. Calculating the number of flops gives me:

(9*2+8*2)*(262144-8)=8912624

I got a benchmark from http://ftp.br.freebsd.org/distfiles/flops.c,
compiled it with "gcc-3.3 -O3 -DUNIX flops.c -o flops -lm" and got values
ranging from about 400 to 800 MFLOPS for my old Pentium 4. Using the 400
MFLOPS, this takes me to:

8912624/(426*10^6)=0.021 seconds runtime

This _would_ be fine. But the function takes 1.66s ! (No, I didn't forger
-O3 This is about 80 times more ...

I tried all the obvious optimisations by hand: Use pointers instead of
array indices, combine the two inner loops into one. In the end, the gcc
result was always better the my hand "optimized" code.

The arrays afX and afY were allocated on the stack, m_afA and m_afB on the
heap. I didn't care for alignment, since malloc is supposed to return
8-byte aligned pointers. Now I suspected cache misses to be the root of
evil and tried the cachegrind tool. If I interpret the results correctly,
the code has nearly no L1 misses, but 20% L2 misses. This doesn't sound
too bad, doesn't it ?

Can somebody explain to me what's going wrong ? Or even better, tell me
how to make this faster ?

It might not work, but try copying the eight poles to an array on the stack.
If you have an aliasing problem you might be generating too many data reads.

Markus Grunwald · Aug 18, 2007

If you have an aliasing problem you might be generating too many data
reads.

Nope, it didn't work

Markus

Jens Thoms Toerring · Aug 18, 2007

Markus Grunwald said:
while implementing a simple algorithm for digital filters (IIR filters), I
got very confused about the very bad performance. This is the code,
discussion follows:

void CPTChebyshev::Filter( const float* afX, float* afY, int nLength ) {
float tY=0;
int nN=0;
int nI=0;

for( nI=m_nPoles; nI<nLength; ++nI ) {
tY=0;
for( nN=0; nN<=m_nPoles; ++nN )
{
tY += m_afA[ nN ] * afX[ nI - nN ];
}
for( nN=1; nN<=m_nPoles; ++nN )
{
tY += m_afB[ nN ] * afY[ nI - nN ];
}
afY[ nI ] = tY;
}
} // END Filter

Ok, I admit it's C++. But the problem should be independent.

I run this piece of code with m_nPoles==8 and Arrays afX, afY of size
262144. Calculating the number of flops gives me:

I got a benchmark from http://ftp.br.freebsd.org/distfiles/flops.c,
compiled it with "gcc-3.3 -O3 -DUNIX flops.c -o flops -lm" and got values
ranging from about 400 to 800 MFLOPS for my old Pentium 4. Using the 400
MFLOPS, this takes me to:

8912624/(426*10^6)=0.021 seconds runtime

This _would_ be fine. But the function takes 1.66s ! (No, I didn't forger
-O3 This is about 80 times more ...

I tried all the obvious optimisations by hand: Use pointers instead of
array indices, combine the two inner loops into one. In the end, the gcc
result was always better the my hand "optimized" code.

I've got to tell you that this is rather off-topic here in clc.
It's not about the language C (you're not even using it) and
even if this program would be in C there are no guarantees
given about the speed of execution of a program, since this
depends beside the algorithm used on too many implementation
and system details. Perhaps comp.programming would be a good
alternative for this question since the scope of topicality
is much wider there.

But to give you another data point I converted your program
to C, allocating all the arrarys used with calloc() (on my
machine this initializes all elements to 0.0)and run it on
my Mobile Pentium M (2 GHz). I first run the 'flops.c' pro-
gram and it gave values ranging between 200 and 600. Then I
run your program and it finished within about 30 ms (give or
take a bit in different runs). If I instead used automatic
arrays (which probably will end up on the stack) it took
about 60 ms (but this includes also the time it takes to
initialize the arrays to 0.0).

I hope this demonstrates that this isn't a problem related to
C (and probably also not to C++, but I didn't try to write
a complete C++ version of the program, if you send me one
via private email I will run it and report the results). There
might be ways to micro-optimize your function, but you will
first have to figure out what the main problem is, which pro-
bably is related to the details of the system you're running
this on - and I fear clc is not the right place to look for
answers to such problems.
Regards, Jens

Flash Gordon · Aug 18, 2007

Markus Grunwald wrote, On 18/08/07 11:14:

Hello,

while implementing a simple algorithm for digital filters (IIR filters), I
got very confused about the very bad performance. This is the code,
discussion follows:

/**

* The working horse: Filters a signal
*
* @param afX (i) input signal
* @param afY (o) output signal
* @param nLength (i) signal length
*/
void CPTChebyshev::Filter( const float* afX, float* afY, int nLength ) {

Rather than further down pointing out that your problem is not specific
to C++ did you not change your code to C by replacing the above by
void CPTChebyshevFilter( const float* afX, float* afY, int nLength ) {

float tY=0;

On some systems using a float is slower than using a double. In fact, I
would say you should use a double *except* where you have specific
evidence that you need a float instead.

int nN=0;
int nI=0;

for( nI=m_nPoles; nI<nLength; ++nI ) {
tY=0;
for( nN=0; nN<=m_nPoles; ++nN )

Are you sure you meant to use <= above? It would be rather unusual since
you start with nN=0.

{
tY += m_afA[ nN ] * afX[ nI - nN ];
}
for( nN=1; nN<=m_nPoles; ++nN )

If the above loop is correct then this is almost certainly wrong, or
maybe both are wrong.

{
tY += m_afB[ nN ] * afY[ nI - nN ];
}
afY[ nI ] = tY;
}
} // END Filter

Ok, I admit it's C++. But the problem should be independent.

So why not change the code to C code or post in comp.lang.c++?

I run this piece of code with m_nPoles==8 and Arrays afX, afY of size
262144. Calculating the number of flops gives me:

(9*2+8*2)*(262144-8)=8912624

I got a benchmark from http://ftp.br.freebsd.org/distfiles/flops.c,
compiled it with "gcc-3.3 -O3 -DUNIX flops.c -o flops -lm" and got values
ranging from about 400 to 800 MFLOPS for my old Pentium 4. Using the 400
MFLOPS, this takes me to:

8912624/(426*10^6)=0.021 seconds runtime

This _would_ be fine. But the function takes 1.66s ! (No, I didn't forger
-O3 This is about 80 times more ...

We don't really deal with the specifics of getting the best speed out of
any given implementation here, since it is highly specific to the
implementation.

I tried all the obvious optimisations by hand: Use pointers instead of
array indices,

That rarely works these days since compilers will tend to do that
optimisation when appropriate anyway.

> combine the two inner loops into one. In the end, the gcc
result was always better the my hand "optimized" code.

This is why it is rarely worth trying to beet the compiler ;-)

The arrays afX and afY were allocated on the stack, m_afA and m_afB on the
heap. I didn't care for alignment, since malloc is supposed to return
8-byte aligned pointers.

No, malloc is not meant to return 8-byte aligned pointers, it is meant
to return pointers suitably aligned for any type, this is *not* the same
thing by a long way.

> Now I suspected cache misses to be the root of
evil and tried the cachegrind tool. If I interpret the results correctly,
the code has nearly no L1 misses, but 20% L2 misses. This doesn't sound
too bad, doesn't it ?

Can somebody explain to me what's going wrong ? Or even better, tell me
how to make this faster ?

If switching to double instead of float does not help, then I suspect
your problem is specific to your particular implementation and so better
handled on an implementation specific group. Personally, if it was me
and I know this was the critical function and that it was a problem I
would be tempted to hand-code it in assembler making full use of all the
tricks the processor provides.

Ben Bacarisse · Aug 18, 2007

Markus Grunwald said:
Hello,

while implementing a simple algorithm for digital filters (IIR filters), I
got very confused about the very bad performance. This is the code,
discussion follows:

/**

* The working horse: Filters a signal
*
* @param afX (i) input signal
* @param afY (o) output signal
* @param nLength (i) signal length
*/
void CPTChebyshev::Filter( const float* afX, float* afY, int nLength ) {
float tY=0;
int nN=0;
int nI=0;

for( nI=m_nPoles; nI<nLength; ++nI ) {
tY=0;
for( nN=0; nN<=m_nPoles; ++nN )
{
tY += m_afA[ nN ] * afX[ nI - nN ];
}
for( nN=1; nN<=m_nPoles; ++nN )
{
tY += m_afB[ nN ] * afY[ nI - nN ];
}
afY[ nI ] = tY;
}
} // END Filter

Ok, I admit it's C++. But the problem should be independent.

I run this piece of code with m_nPoles==8 and Arrays afX, afY of size
262144. Calculating the number of flops gives me:

(9*2+8*2)*(262144-8)=8912624

I got a benchmark from http://ftp.br.freebsd.org/distfiles/flops.c,
compiled it with "gcc-3.3 -O3 -DUNIX flops.c -o flops -lm" and got values
ranging from about 400 to 800 MFLOPS for my old Pentium 4. Using the 400
MFLOPS, this takes me to:

8912624/(426*10^6)=0.021 seconds runtime

This _would_ be fine. But the function takes 1.66s ! (No, I didn't forger
-O3 This is about 80 times more ...

Curious. I translated to C (s/::/_/ and made the m_ variables have file
scope) and I get 0.023s per function call (timed over 100 of them).
The flops program on my laptop reports numbers from 434 up to 1228 so it
is not that much faster than your Pentium, but on my hardware the
numbers tie up.

model name : Intel(R) Pentium(R) M processor 2.00GHz
cache size : 2048 KB
gcc-4.1

I know this does not help, but it is another data point. (The reason
I tried was to see if making the floats into doubles would help, but it
did not make any difference on my hardware.)

websnarf · Aug 18, 2007

Hello,

while implementing a simple algorithm for digital filters (IIR filters), I
got very confused about the very bad performance. This is the code,
discussion follows:

/**

* The working horse: Filters a signal
*
* @param afX (i) input signal
* @param afY (o) output signal
* @param nLength (i) signal length
*/
void CPTChebyshev::Filter( const float* afX, float* afY, int nLength ) {
float tY=0;
int nN=0;
int nI=0;

for( nI=m_nPoles; nI<nLength; ++nI ) {
tY=0;
for( nN=0; nN<=m_nPoles; ++nN )
{
tY += m_afA[ nN ] * afX[ nI - nN ];
}
for( nN=1; nN<=m_nPoles; ++nN )
{
tY += m_afB[ nN ] * afY[ nI - nN ];
}
afY[ nI ] = tY;
}

} // END Filter

Ok, I admit it's C++. But the problem should be independent.

I run this piece of code with m_nPoles==8 and Arrays afX, afY of size
262144. Calculating the number of flops gives me:

(9*2+8*2)*(262144-8)=8912624
[...]

You are bottlenecked on a dependency on the tY variable. Unroll it
into 4 parallel operations to tY0, tY1, tY2, tY3, then add the 4 of
them together to tY (note: this introduces numerical differences.)

For an example of what I am talking about see:

http://www.pobox.com/~qed/optimize.html#alulatency

Malcolm McLean · Aug 18, 2007

Markus Grunwald said:
Nope, it didn't work

time this.

float mulft(float *x, int N)
{
float answer = 0;
while(N--)
answer += x[N] * 0.1f;
return answer;
}

You can't get a loop much tighter than that, except maybe by using pointer
notation.
That will tell you how fast you can index an number and multiply it by a
constant, and it will tell you the practical lower bound on your function.
Run it in isolation then two or three times on the same array, to get cache
effects.

Tim Prince · Aug 18, 2007

Markus said:
Hello,

while implementing a simple algorithm for digital filters (IIR filters), I
got very confused about the very bad performance. This is the code,
discussion follows:

/**

* The working horse: Filters a signal
*
* @param afX (i) input signal
* @param afY (o) output signal
* @param nLength (i) signal length
*/
void CPTChebyshev::Filter( const float* afX, float* afY, int nLength ) {
float tY=0;
int nN=0;
int nI=0;

for( nI=m_nPoles; nI<nLength; ++nI ) {
tY=0;
for( nN=0; nN<=m_nPoles; ++nN )
{
tY += m_afA[ nN ] * afX[ nI - nN ];
}
for( nN=1; nN<=m_nPoles; ++nN )
{
tY += m_afB[ nN ] * afY[ nI - nN ];
}
afY[ nI ] = tY;
}
} // END Filter

Ok, I admit it's C++. But the problem should be independent.

I run this piece of code with m_nPoles==8 and Arrays afX, afY of size
262144. Calculating the number of flops gives me:

(9*2+8*2)*(262144-8)=8912624

I got a benchmark from http://ftp.br.freebsd.org/distfiles/flops.c,
compiled it with "gcc-3.3 -O3 -DUNIX flops.c -o flops -lm" and got values
ranging from about 400 to 800 MFLOPS for my old Pentium 4. Using the 400
MFLOPS, this takes me to:

8912624/(426*10^6)=0.021 seconds runtime

This _would_ be fine. But the function takes 1.66s ! (No, I didn't forger
-O3 This is about 80 times more ...

I tried all the obvious optimisations by hand: Use pointers instead of
array indices, combine the two inner loops into one. In the end, the gcc
result was always better the my hand "optimized" code.

The arrays afX and afY were allocated on the stack, m_afA and m_afB on the
heap. I didn't care for alignment, since malloc is supposed to return
8-byte aligned pointers. Now I suspected cache misses to be the root of
evil and tried the cachegrind tool. If I interpret the results correctly,
the code has nearly no L1 misses, but 20% L2 misses. This doesn't sound
too bad, doesn't it ?

Can somebody explain to me what's going wrong ? Or even better, tell me
how to make this faster ?

As you are using (an old version of) gcc, your question would be more
topical on (e-mail address removed). Your ideas of "obvious optimizations"
seem rather limited. Adding -funroll-loops -mfpmath=sse -march=pentium4
should help.
You would have to investigate whether serial addition is limiting your
performance. 20% L2 misses would be rather serious on a P4, and would
not be solved within your realm of obvious optimizations.
Combination of forward and backward strides is a big problem for SSE
optimization, so upgrading to a more powerful compiler and setting
vectorization options would not by itself solve your problem.

SM Ryan · Aug 19, 2007

# Hello,
#
# while implementing a simple algorithm for digital filters (IIR filters), I
# got very confused about the very bad performance. This is the code,
# discussion follows:

# I tried all the obvious optimisations by hand: Use pointers instead of
# array indices, combine the two inner loops into one. In the end, the gcc
# result was always better the my hand "optimized" code.

Could you please state that in the form of a question.

If you're saying the compiler optimised better than you and you're
asking why, the short answer is magic! The longer answer could involve
loop skewing, promotions, jamming, unrolling, cache blocks, common
subs, loop invariants, register residency, etc. You can look at the
assembly code.

Pierre Asselin · Aug 19, 2007

Markus Grunwald said:
while implementing a simple algorithm for digital filters (IIR filters), I
got very confused about the very bad performance. This is the code,
discussion follows:

[ ... ]
tY=0;
for( nN=0; nN<=m_nPoles; ++nN )
{
tY += m_afA[ nN ] * afX[ nI - nN ];
}
[ ... ]

Negative stride on the afx[] array ? Try the -funroll-loops
option to gcc. If that doesn't help, try doing your
convolution in the other direction,

for( nN=0; nN<=m_nPoles; ++nN )
{
tY += m_afA[ m_nPoles - nN ] * afX[ nI-m_nPoles + nN ];
}

so that your negative stride is now on the filter array,
then redefine your filter layout:

m_afA_new[k] == m_afA[m_nPoles-k] for all k

so that your loop becomes

for( nN=0; nN<=m_nPoles; ++nN )
{
tY += m_afA_new[ nN ] * afX[ nI-m_nPoles + nN ];
}

with positive unit strides everywhere. Same with the other loop.

Register Question	0	Oct 21, 2024
Help with Visual Lightbox: Scripts	2	May 3, 2023
Issue with textbox script?	0	Sep 5, 2022
Need help with code on website (noob)	2	Jul 18, 2022
Loop and the list	12	May 14, 2004
What is the relationship of the two process function in template anda class?	2	Oct 24, 2013
Question about nested loop	5	Dec 31, 2012
vhdl procedure question	1	Nov 20, 2010

Optimization question

Markus Grunwald

Malcolm McLean

Markus Grunwald

Jens Thoms Toerring

Flash Gordon

Ben Bacarisse

websnarf

Malcolm McLean

Tim Prince

SM Ryan

Pierre Asselin

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads