Optimisation needed

spasmous · Jan 26, 2004

Hi,

I have an inner loop that is running too slowly. I've done my best at
optimising it but am not really experienced at what works best. Can
anyone suggest any optimisations?

Thanks for any help!

for(m=-range;m<=range;m++) // range is either 1 or 2
{
for(n=-range;n<=range;n++)
{
/* indices */
int x0 = (int)px - m;
int y0 = (int)py - n;
index = x0*kpar->fftrows + y0;

/* zero out of bounds */
index *= (int)(index>=0 && index<(kpar->fftrows*kpar->fftrows));

/* calculate distance */
float dx = px - (float)x0;
float dy = py - (float)y0;
deltad = dx*dx + dy*dy;

/* calculate coefficient */
fscale = expf(deltad * norm);

/* store values */
str->array_1[index] += val_1 * fscale;
str->array_2[index] += val_2 * fscale;
str->array_3[index] += fscale;
}
}

Ron Natalie · Jan 26, 2004

spasmous said:
Hi,

I have an inner loop that is running too slowly. I've done my best at
optimising it but am not really experienced at what works best. Can
anyone suggest any optimisations?

It would be ncie if you told us the types of some of these things like px, py
and array_1,2,3.

Also if you're on a Pentium chip, you may wish to ask in a intel specific
newsgroup. Most compilers generate absolutely horrific code for float/double
to int conversion.

Andre Kostur · Jan 26, 2004

(e-mail address removed) (spasmous) wrote in @posting.google.com:

Hi,

I have an inner loop that is running too slowly. I've done my best at
optimising it but am not really experienced at what works best. Can
anyone suggest any optimisations?

Thanks for any help!

for(m=-range;m<=range;m++) // range is either 1 or 2
{
for(n=-range;n<=range;n++)
{
/* indices */
int x0 = (int)px - m;
int y0 = (int)py - n;
index = x0*kpar->fftrows + y0;

/* zero out of bounds */
index *= (int)(index>=0 && index<(kpar->fftrows*kpar->fftrows));

/* calculate distance */
float dx = px - (float)x0;
float dy = py - (float)y0;
deltad = dx*dx + dy*dy;

/* calculate coefficient */
fscale = expf(deltad * norm);

/* store values */
str->array_1[index] += val_1 * fscale;
str->array_2[index] += val_2 * fscale;
str->array_3[index] += fscale;
}
}

Umm... some simple stuff that _might_ help:

Move the declaration and initialization of x0 outside the inner loop. It
doesn't need to be recalculated every time.

Move the declaration and initialization of dx outside the inner loop
(same reason). I'd probably square it outside the loop too.

Thomas Matthews · Jan 26, 2004

spasmous said:
Hi,

I have an inner loop that is running too slowly. I've done my best at
optimising it but am not really experienced at what works best. Can
anyone suggest any optimisations?

Thanks for any help!

for(m=-range;m<=range;m++) // range is either 1 or 2
{
for(n=-range;n<=range;n++)
{
/* indices */
int x0 = (int)px - m;
int y0 = (int)py - n;
index = x0*kpar->fftrows + y0;

/* zero out of bounds */
index *= (int)(index>=0 && index<(kpar->fftrows*kpar->fftrows));

/* calculate distance */
float dx = px - (float)x0;
float dy = py - (float)y0;

Let us perform some substition:
Given X0 == px - m; from above,
dx = px - (px - m);
dx = px - px + m;
dx = m;
Similarly with dy:
dy = n;

deltad = dx*dx + dy*dy;

This is simplified to:
deltad = m * m + n * n;
This simplification removes the variables dx and dy.

/* calculate coefficient */
fscale = expf(deltad * norm);

/* store values */
str->array_1[index] += val_1 * fscale;
str->array_2[index] += val_2 * fscale;
str->array_3[index] += fscale;
}
}

--
Thomas Matthews

C++ newsgroup welcome message:
http://www.slack.net/~shiva/welcome.txt
C++ Faq: http://www.parashift.com/c++-faq-lite
C Faq: http://www.eskimo.com/~scs/c-faq/top.html
alt.comp.lang.learn.c-c++ faq:
http://www.raos.demon.uk/acllc-c++/faq.html
Other sites:
http://www.josuttis.com -- C++ STL Library book
http://www.sgi.com/tech/stl -- Standard Template Library

Andre Kostur · Jan 27, 2004

Let us perform some substition:
Given X0 == px - m; from above,
dx = px - (px - m);
dx = px - px + m;
dx = m;

Error. Your replacements aren't identical to the original:

Note that x0 is assigned to the integral part of px, minus m. Not just
px. So later on, when dx is calculated... the px mentioned in there is
the float px, not just the integral part. Thus you can't drop the px term
from the equations.

Similarly with dy:
dy = n;

Same error.

deltad = dx*dx + dy*dy;

Click to expand...

This is simplified to:
deltad = m * m + n * n;
This simplification removes the variables dx and dy.

/* calculate coefficient */
fscale = expf(deltad * norm);

/* store values */
str->array_1[index] += val_1 * fscale;
str->array_2[index] += val_2 * fscale;
str->array_3[index] += fscale;
}
}

Click to expand...

spasmous · Jan 27, 2004

Andre Kostur said:
Umm... some simple stuff that _might_ help:

Move the declaration and initialization of x0 outside the inner loop. It
doesn't need to be recalculated every time.

Move the declaration and initialization of dx outside the inner loop
(same reason). I'd probably square it outside the loop too.

Mm, I think the compiler must be doing that already - it makes no
difference to the runtime. Thanks for the tip tho'

A quick experiment I did commenting out the expf() call reduced the
timing by about 75%, so I suspect that's doing the damage. And that's
also probably already as fast as it can go, being a math library
function, right?

Patrik Stellmann · Jan 27, 2004

for(m=-range;m<=range;m++) // range is either 1 or 2
{
for(n=-range;n<=range;n++)
{

The compiler probably optimizes it anyway and for numeric values might
be no real difference in general but usually ++m is faster than m++
since the pseudo-code for the first looks like:

increment m;
return m;

and for the second:

temp = m;
increment m;
return temp;

so in for loops where you are not interested in the result you might
prefer the prefix ++ operator before the postfix...

Kevin Goodsell · Jan 27, 2004

spasmous said:
Hi,

I have an inner loop that is running too slowly. I've done my best at
optimising it but am not really experienced at what works best. Can
anyone suggest any optimisations?

I haven't really looked at your code, but here's a few tips.

The first thing you should do (and this applies to ALL attempts at
optimization) is profile the code. This serves 2 purposes: 1) It can
help you determine if optimization is actually necessary - there's no
reason to optimize code that's already fast enough. 2) It allows you to
properly target optimization effort. Never try to guess which part of
the code is slow, and never try to guess what tweaks will make it
faster. These things have to be determined empirically.

The second thing you should do is look for a better algorithm.
Algorithmic improvements almost always outperform fine-tuning, often
dramatically.

The third thing you should probably do is double check your compiler's
optimization settings. See if there's a more appropriate set of options
for your purposes.

Only after doing those should you look at fine-tuning the code. In
general, hand-optimized code is uglier, more error-prone, more difficult
to debug, and more difficult to maintain, so it should be avoided until
there are no other options. Many of the things you are likely to try are
probably already done by decent optimizing compilers, so make sure you
profile to see if your changes make a difference - there's no reason to
use uglier code that's just as slow.

-Kevin

NKOBAYE027 · Jan 27, 2004

if the range is only 1 or 2 then you're probably best of to not bother with
loops - if speed (not size) is of the essence just write the code out
straight.

definitely get rid of all those declarations and casts inside the loops -
there has to be a better way of representing the data than that
hodgepodge...use register counters in the loops (though most compilers
probably optimise to that anyway)

depending upon the accuracy of your exp requirements use your own
approximation algorithm - just be sure to calculate the factorial
denominators beforehand and save them in an array for later use...

Chris Theis · Jan 27, 2004

spasmous said:
Andre Kostur <[email protected]> wrote in message

A quick experiment I did commenting out the expf() call reduced the
timing by about 75%, so I suspect that's doing the damage. And that's
also probably already as fast as it can go, being a math library
function, right?

Well the exponential function is certainly a bottleneck. However, are you
sure that your optimizations are necessary and that this is the right place
to apply them. You should profile first and see where the time is lost. If
it's the code you gave then think whether the algorithm is the best one. If
you can answer both of these questions with yes you might consider
"hands-on" optimization. If you're dealing with constants you might consider
meta-template programming for loop unrolling and the calculation of the
exponential. Still, premature optimization might be an evil thing. Thus,
fire up your profiler first!

Regards
Chris

Andre Kostur · Jan 27, 2004

(e-mail address removed) (spasmous) wrote in

Mm, I think the compiler must be doing that already - it makes no
difference to the runtime. Thanks for the tip tho'

Yep... that's why I said "might"

Those are simply moving loop
invariants outside of the loop. A decent optimizer should be able to spot
those and move them outside on it's own. If you change the actual source
code, it makes it more obvious to other people reading your code.

Nick Hounsome · Jan 27, 2004

spasmous said:
Andre Kostur <[email protected]> wrote in message

Mm, I think the compiler must be doing that already - it makes no
difference to the runtime. Thanks for the tip tho'

A quick experiment I did commenting out the expf() call reduced the
timing by about 75%, so I suspect that's doing the damage. And that's
also probably already as fast as it can go, being a math library
function, right?

It is quite possible that operations on double are faster than floats as
the implementation will be optimized for double and
floating point hardware may only actually
work on doubles. You may actually be paying for a lot of
double/float conversions.

I'm not up on how these things are implemented but
expf(deltad*norm) == pow( deltad, expf(norm) )
so precalculating expf(norm) may help.

David Fisher · Jan 27, 2004

spasmous said:
I have an inner loop that is running too slowly. I've done my best at
optimising it but am not really experienced at what works best. Can
anyone suggest any optimisations?

for(m=-range;m<=range;m++) // range is either 1 or 2
{
for(n=-range;n<=range;n++)
{
/* indices */
int x0 = (int)px - m;
int y0 = (int)py - n;
index = x0*kpar->fftrows + y0;

/* zero out of bounds */
index *= (int)(index>=0 && index<(kpar->fftrows*kpar->fftrows));

Instead of multiplying index by 0 or 1, you could say:

if (index < 0 || index > fftrowsSquared)
{
index = 0;
}

and define fftrowsSquared as (kpar->fftrows*kpar->fftrows) before the
beginning of the first loop.

Values of argument to expf():

(square(px - (float) ((int)px - m)) + square(py - (float) ((int) py - n))) *
norm

float pxVal = px - (float) ((int) px)
float pyVal = py - (float) ((int) py)

square(pxVal - (-range to range)) + square(pyVal - (-range to range))

square(pxVal -1) + square(pyVal - 1)
square(pxVal -1) + square(pyVal - 1)
square(pxVal -1) + square(pyVal - 1)
square(pxVal -1) + square(pyVal - 1)

/* calculate distance */
float dx = px - (float) ((int) px - m);
float dy = py - (float)y0;
deltad = dx*dx + dy*dy;

/* calculate coefficient */
fscale = expf(deltad * norm);

/* store values */
str->array_1[index] += val_1 * fscale;
str->array_2[index] += val_2 * fscale;
str->array_3[index] += fscale;
}
}

David Fisher · Jan 27, 2004

Oops, previous mail got sent before it was finished ...

I was trying to say, the arguments to expf() (without the multiplication by
"norm") look something like this:

square(pxVal -1) + square(pyVal + 1) == (square(pxVal) - 2 * pxVal + 1) +
(square(pyVal) + 2 * pyVal + 1)
square(pxVal - 2) + square(pyVal) == (square(pxVal) - 4 * pxVal + 4) +
square(pyVal)
etc.

where

float pxVal = px - (float) ((int) px);
float pyVal = py - (float) ((int) py);

So the differences in the arguments to expf() are a multiple of pxVal plus a
multiple of pyVal plus a constant ...

exp(a + b) = exp(a) * exp(b)

so you can calculate the following values and use them to find the required
values of expf():

exp(square(pxVal))
exp(square(pyVal))
exp(norm) => constant ?
exp(1) => constant, once only
exp(4) => constant, once only
exp(pxVal * 2) => same as square(exp(pxVal)) => just calculate exp(pxVal)
once
exp(pxVal * -2) => same as sqrt(exp(pxVal))
exp(pxVal * 4) => same as square(square(exp(pxVal)))
exp(pxVal * -4) => same as sqrt(sqrt(exp(pxVal)))
etc.

You should end up with a set of expressions that are faster to calculate
than expf(): square, sqrt, *, etc

Hope this is helpful,

David F

David Fisher · Jan 27, 2004

David Fisher said:
values of expf():

exp(square(pxVal))
exp(square(pyVal))
exp(norm) => constant ?
exp(1) => constant, once only
exp(4) => constant, once only
exp(pxVal * 2) => same as square(exp(pxVal)) => just calculate exp(pxVal)
once
exp(pxVal * -2) => same as sqrt(exp(pxVal))
exp(pxVal * 4) => same as square(square(exp(pxVal)))
exp(pxVal * -4) => same as sqrt(sqrt(exp(pxVal)))
etc.

You should end up with a set of expressions that are faster to calculate
than expf(): square, sqrt, *, etc

Hmm ... still leaves the problem of multiplying by "norm":

exp(a * b) == pow(exp(a), b)

You wouldn't want to say pow(whatever, norm) for each expression - that
would cost as much as calling expf() in the first place ...

I guess the solution is to multiply all of the above-listed values by
"norm", ie. calculate the values of:

exp(square(pxVal) * norm)
exp(4 * norm)
etc.

(I hope the maths is right

David F

Buster · Jan 28, 2004

I'm not up on how these things are implemented but
expf(deltad*norm) == pow( deltad, expf(norm) )
so precalculating expf(norm) may help.

You mean expf (deltad * norm) == pow (expf (norm), deltad);
But in any case,
pow (exp (norm), deltad) == exp (deltad * log (exp (norm)));
exp and pow are interdefinable and as such it's hard to say
a priori whether either is faster.

Regards,
Buster.

Buster · Jan 28, 2004

exp(pxVal * -2) => same as sqrt(exp(pxVal))

No; same as inverse (square (exp (pxVal))).

You should end up with a set of expressions that are faster to calculate
than expf(): square, sqrt, *, etc

I'm not sure that sqrt will be much faster than exp,
but luckily need it's not needed.

David Fisher · Jan 28, 2004

Buster said:
No; same as inverse (square (exp (pxVal))).

What was I thinking ?

I'm not sure that sqrt will be much faster than exp,
but luckily need it's not needed.

There is a fast way of calculating square roots (Newton's method), so I was
assuming it would be more efficient than exp() or pow() ...

David F

spasmous · Jan 28, 2004

<snip good suggestions...>

Thanks guys, I played around on paper with the equations and I think I
can greatly reduce the expf() calls the way you suggest. However it's
a bit messy and the calls to pow() are *very* expensive. Slower than
expf() in my tests.

So... another way (similar) is to precompute the dx and dy expf().
I've implemented it like this and expf() only takes up 50% of the
runtime rather than 75% before, with an overall time reduction of
about 25%. Not spectacular but I'll take it

/* Storage */
float * EXP_X = (float*)malloc(...);
float * EXP_Y = (float*)malloc(...);

/* Indices */
int x0 = (int)px;
int y0 = (int)py;
float dx = px - x0;
float dy = py - y0;

/* Precompute expf */
for(m=-range;m<range;m++)
{
EXP_X[m+range] = expf(norm*(dx+m)*(dx+m));
EXP_Y(m+range] = expf(norm*(dy+m)*(dy+m));
}

for(m=-range;m<range;m++)
{
for(n=-range;n<range;n++)
{
index = x0*kpar->fftrows + y0;

/* zero out of bounds. NB. This is quicker than using 'if' */
index *= (int)(index>=0 && index<(kpar->fftrows*kpar->fftrows));

/* calculate coefficient */
fscale = EXP_X[m+range] * EXP_Y[n+range];

/* Store values */
....
}
}

Want to convert python code to java	0	Apr 1, 2017
displaying output from this program	2	Sep 5, 2005
Turtle Graphics are incompatible with gmpy	17	Aug 5, 2009
Yet another "simple" headscratcher	4	May 30, 2014
Collision detection	1	Feb 19, 2004
Help with my 1st Tkinter program	0	Oct 19, 2004
native python matrix class (2D list), without inverse	0	Jun 13, 2007
numpy help	2	Nov 3, 2006

Optimisation needed

spasmous

Ron Natalie

Andre Kostur

Thomas Matthews

Andre Kostur

spasmous

Patrik Stellmann

Kevin Goodsell

NKOBAYE027

Chris Theis

Andre Kostur

Nick Hounsome

David Fisher

David Fisher

David Fisher

Buster

Buster

David Fisher

spasmous

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads