Optimisation needed

S

spasmous

Hi,

I have an inner loop that is running too slowly. I've done my best at
optimising it but am not really experienced at what works best. Can
anyone suggest any optimisations?

Thanks for any help!


for(m=-range;m<=range;m++) // range is either 1 or 2
{
for(n=-range;n<=range;n++)
{
/* indices */
int x0 = (int)px - m;
int y0 = (int)py - n;
index = x0*kpar->fftrows + y0;

/* zero out of bounds */
index *= (int)(index>=0 && index<(kpar->fftrows*kpar->fftrows));

/* calculate distance */
float dx = px - (float)x0;
float dy = py - (float)y0;
deltad = dx*dx + dy*dy;

/* calculate coefficient */
fscale = expf(deltad * norm);

/* store values */
str->array_1[index] += val_1 * fscale;
str->array_2[index] += val_2 * fscale;
str->array_3[index] += fscale;
}
}
 
R

Ron Natalie

spasmous said:
Hi,

I have an inner loop that is running too slowly. I've done my best at
optimising it but am not really experienced at what works best. Can
anyone suggest any optimisations?

It would be ncie if you told us the types of some of these things like px, py
and array_1,2,3.

Also if you're on a Pentium chip, you may wish to ask in a intel specific
newsgroup. Most compilers generate absolutely horrific code for float/double
to int conversion.
 
A

Andre Kostur

(e-mail address removed) (spasmous) wrote in @posting.google.com:
Hi,

I have an inner loop that is running too slowly. I've done my best at
optimising it but am not really experienced at what works best. Can
anyone suggest any optimisations?

Thanks for any help!


for(m=-range;m<=range;m++) // range is either 1 or 2
{
for(n=-range;n<=range;n++)
{
/* indices */
int x0 = (int)px - m;
int y0 = (int)py - n;
index = x0*kpar->fftrows + y0;

/* zero out of bounds */
index *= (int)(index>=0 && index<(kpar->fftrows*kpar->fftrows));

/* calculate distance */
float dx = px - (float)x0;
float dy = py - (float)y0;
deltad = dx*dx + dy*dy;

/* calculate coefficient */
fscale = expf(deltad * norm);

/* store values */
str->array_1[index] += val_1 * fscale;
str->array_2[index] += val_2 * fscale;
str->array_3[index] += fscale;
}
}

Umm... some simple stuff that _might_ help:

Move the declaration and initialization of x0 outside the inner loop. It
doesn't need to be recalculated every time.

Move the declaration and initialization of dx outside the inner loop
(same reason). I'd probably square it outside the loop too.
 
T

Thomas Matthews

spasmous said:
Hi,

I have an inner loop that is running too slowly. I've done my best at
optimising it but am not really experienced at what works best. Can
anyone suggest any optimisations?

Thanks for any help!


for(m=-range;m<=range;m++) // range is either 1 or 2
{
for(n=-range;n<=range;n++)
{
/* indices */
int x0 = (int)px - m;
int y0 = (int)py - n;
index = x0*kpar->fftrows + y0;

/* zero out of bounds */
index *= (int)(index>=0 && index<(kpar->fftrows*kpar->fftrows));

/* calculate distance */
float dx = px - (float)x0;
float dy = py - (float)y0;

Let us perform some substition:
Given X0 == px - m; from above,
dx = px - (px - m);
dx = px - px + m;
dx = m;
Similarly with dy:
dy = n;
deltad = dx*dx + dy*dy;

This is simplified to:
deltad = m * m + n * n;
This simplification removes the variables dx and dy.

/* calculate coefficient */
fscale = expf(deltad * norm);

/* store values */
str->array_1[index] += val_1 * fscale;
str->array_2[index] += val_2 * fscale;
str->array_3[index] += fscale;
}
}



--
Thomas Matthews

C++ newsgroup welcome message:
http://www.slack.net/~shiva/welcome.txt
C++ Faq: http://www.parashift.com/c++-faq-lite
C Faq: http://www.eskimo.com/~scs/c-faq/top.html
alt.comp.lang.learn.c-c++ faq:
http://www.raos.demon.uk/acllc-c++/faq.html
Other sites:
http://www.josuttis.com -- C++ STL Library book
http://www.sgi.com/tech/stl -- Standard Template Library
 
A

Andre Kostur

Let us perform some substition:
Given X0 == px - m; from above,
dx = px - (px - m);
dx = px - px + m;
dx = m;

Error. Your replacements aren't identical to the original:

Note that x0 is assigned to the integral part of px, minus m. Not just
px. So later on, when dx is calculated... the px mentioned in there is
the float px, not just the integral part. Thus you can't drop the px term
from the equations.
Similarly with dy:
dy = n;

Same error.
deltad = dx*dx + dy*dy;

This is simplified to:
deltad = m * m + n * n;
This simplification removes the variables dx and dy.

/* calculate coefficient */
fscale = expf(deltad * norm);

/* store values */
str->array_1[index] += val_1 * fscale;
str->array_2[index] += val_2 * fscale;
str->array_3[index] += fscale;
}
}
 
S

spasmous

Andre Kostur said:
Umm... some simple stuff that _might_ help:

Move the declaration and initialization of x0 outside the inner loop. It
doesn't need to be recalculated every time.

Move the declaration and initialization of dx outside the inner loop
(same reason). I'd probably square it outside the loop too.


Mm, I think the compiler must be doing that already - it makes no
difference to the runtime. Thanks for the tip tho' :)

A quick experiment I did commenting out the expf() call reduced the
timing by about 75%, so I suspect that's doing the damage. And that's
also probably already as fast as it can go, being a math library
function, right?
 
P

Patrik Stellmann

for(m=-range;m<=range;m++) // range is either 1 or 2
{
for(n=-range;n<=range;n++)
{
The compiler probably optimizes it anyway and for numeric values might
be no real difference in general but usually ++m is faster than m++
since the pseudo-code for the first looks like:

increment m;
return m;

and for the second:

temp = m;
increment m;
return temp;

so in for loops where you are not interested in the result you might
prefer the prefix ++ operator before the postfix...
 
K

Kevin Goodsell

spasmous said:
Hi,

I have an inner loop that is running too slowly. I've done my best at
optimising it but am not really experienced at what works best. Can
anyone suggest any optimisations?

I haven't really looked at your code, but here's a few tips.

The first thing you should do (and this applies to ALL attempts at
optimization) is profile the code. This serves 2 purposes: 1) It can
help you determine if optimization is actually necessary - there's no
reason to optimize code that's already fast enough. 2) It allows you to
properly target optimization effort. Never try to guess which part of
the code is slow, and never try to guess what tweaks will make it
faster. These things have to be determined empirically.

The second thing you should do is look for a better algorithm.
Algorithmic improvements almost always outperform fine-tuning, often
dramatically.

The third thing you should probably do is double check your compiler's
optimization settings. See if there's a more appropriate set of options
for your purposes.

Only after doing those should you look at fine-tuning the code. In
general, hand-optimized code is uglier, more error-prone, more difficult
to debug, and more difficult to maintain, so it should be avoided until
there are no other options. Many of the things you are likely to try are
probably already done by decent optimizing compilers, so make sure you
profile to see if your changes make a difference - there's no reason to
use uglier code that's just as slow.

-Kevin
 
N

NKOBAYE027

if the range is only 1 or 2 then you're probably best of to not bother with
loops - if speed (not size) is of the essence just write the code out
straight.

definitely get rid of all those declarations and casts inside the loops -
there has to be a better way of representing the data than that
hodgepodge...use register counters in the loops (though most compilers
probably optimise to that anyway)

depending upon the accuracy of your exp requirements use your own
approximation algorithm - just be sure to calculate the factorial
denominators beforehand and save them in an array for later use...
 
C

Chris Theis

spasmous said:
Andre Kostur <[email protected]> wrote in message
A quick experiment I did commenting out the expf() call reduced the
timing by about 75%, so I suspect that's doing the damage. And that's
also probably already as fast as it can go, being a math library
function, right?

Well the exponential function is certainly a bottleneck. However, are you
sure that your optimizations are necessary and that this is the right place
to apply them. You should profile first and see where the time is lost. If
it's the code you gave then think whether the algorithm is the best one. If
you can answer both of these questions with yes you might consider
"hands-on" optimization. If you're dealing with constants you might consider
meta-template programming for loop unrolling and the calculation of the
exponential. Still, premature optimization might be an evil thing. Thus,
fire up your profiler first!

Regards
Chris
 
A

Andre Kostur

(e-mail address removed) (spasmous) wrote in
Mm, I think the compiler must be doing that already - it makes no
difference to the runtime. Thanks for the tip tho' :)

Yep... that's why I said "might" :) Those are simply moving loop
invariants outside of the loop. A decent optimizer should be able to spot
those and move them outside on it's own. If you change the actual source
code, it makes it more obvious to other people reading your code.
 
N

Nick Hounsome

spasmous said:
Andre Kostur <[email protected]> wrote in message


Mm, I think the compiler must be doing that already - it makes no
difference to the runtime. Thanks for the tip tho' :)

A quick experiment I did commenting out the expf() call reduced the
timing by about 75%, so I suspect that's doing the damage. And that's
also probably already as fast as it can go, being a math library
function, right?

It is quite possible that operations on double are faster than floats as
the implementation will be optimized for double and
floating point hardware may only actually
work on doubles. You may actually be paying for a lot of
double/float conversions.

I'm not up on how these things are implemented but
expf(deltad*norm) == pow( deltad, expf(norm) )
so precalculating expf(norm) may help.
 
D

David Fisher

spasmous said:
I have an inner loop that is running too slowly. I've done my best at
optimising it but am not really experienced at what works best. Can
anyone suggest any optimisations?
for(m=-range;m<=range;m++) // range is either 1 or 2
{
for(n=-range;n<=range;n++)
{
/* indices */
int x0 = (int)px - m;
int y0 = (int)py - n;
index = x0*kpar->fftrows + y0;

/* zero out of bounds */
index *= (int)(index>=0 && index<(kpar->fftrows*kpar->fftrows));

Instead of multiplying index by 0 or 1, you could say:

if (index < 0 || index > fftrowsSquared)
{
index = 0;
}

and define fftrowsSquared as (kpar->fftrows*kpar->fftrows) before the
beginning of the first loop.

Values of argument to expf():

(square(px - (float) ((int)px - m)) + square(py - (float) ((int) py - n))) *
norm

float pxVal = px - (float) ((int) px)
float pyVal = py - (float) ((int) py)

square(pxVal - (-range to range)) + square(pyVal - (-range to range))

square(pxVal -1) + square(pyVal - 1)
square(pxVal -1) + square(pyVal - 1)
square(pxVal -1) + square(pyVal - 1)
square(pxVal -1) + square(pyVal - 1)

/* calculate distance */
float dx = px - (float) ((int) px - m);
float dy = py - (float)y0;
deltad = dx*dx + dy*dy;

/* calculate coefficient */
fscale = expf(deltad * norm);

/* store values */
str->array_1[index] += val_1 * fscale;
str->array_2[index] += val_2 * fscale;
str->array_3[index] += fscale;
}
}
 
D

David Fisher

Oops, previous mail got sent before it was finished ...

I was trying to say, the arguments to expf() (without the multiplication by
"norm") look something like this:

square(pxVal -1) + square(pyVal + 1) == (square(pxVal) - 2 * pxVal + 1) +
(square(pyVal) + 2 * pyVal + 1)
square(pxVal - 2) + square(pyVal) == (square(pxVal) - 4 * pxVal + 4) +
square(pyVal)
etc.

where

float pxVal = px - (float) ((int) px);
float pyVal = py - (float) ((int) py);

So the differences in the arguments to expf() are a multiple of pxVal plus a
multiple of pyVal plus a constant ...

exp(a + b) = exp(a) * exp(b)

so you can calculate the following values and use them to find the required
values of expf():

exp(square(pxVal))
exp(square(pyVal))
exp(norm) => constant ?
exp(1) => constant, once only
exp(4) => constant, once only
exp(pxVal * 2) => same as square(exp(pxVal)) => just calculate exp(pxVal)
once
exp(pxVal * -2) => same as sqrt(exp(pxVal))
exp(pxVal * 4) => same as square(square(exp(pxVal)))
exp(pxVal * -4) => same as sqrt(sqrt(exp(pxVal)))
etc.

You should end up with a set of expressions that are faster to calculate
than expf(): square, sqrt, *, etc

Hope this is helpful,

David F
 
D

David Fisher

David Fisher said:
values of expf():

exp(square(pxVal))
exp(square(pyVal))
exp(norm) => constant ?
exp(1) => constant, once only
exp(4) => constant, once only
exp(pxVal * 2) => same as square(exp(pxVal)) => just calculate exp(pxVal)
once
exp(pxVal * -2) => same as sqrt(exp(pxVal))
exp(pxVal * 4) => same as square(square(exp(pxVal)))
exp(pxVal * -4) => same as sqrt(sqrt(exp(pxVal)))
etc.

You should end up with a set of expressions that are faster to calculate
than expf(): square, sqrt, *, etc

Hmm ... still leaves the problem of multiplying by "norm":

exp(a * b) == pow(exp(a), b)

You wouldn't want to say pow(whatever, norm) for each expression - that
would cost as much as calling expf() in the first place ...

I guess the solution is to multiply all of the above-listed values by
"norm", ie. calculate the values of:

exp(square(pxVal) * norm)
exp(4 * norm)
etc.

(I hope the maths is right :)

David F
 
B

Buster

I'm not up on how these things are implemented but
expf(deltad*norm) == pow( deltad, expf(norm) )
so precalculating expf(norm) may help.

You mean expf (deltad * norm) == pow (expf (norm), deltad);
But in any case,
pow (exp (norm), deltad) == exp (deltad * log (exp (norm)));
exp and pow are interdefinable and as such it's hard to say
a priori whether either is faster.

Regards,
Buster.
 
B

Buster

exp(pxVal * -2) => same as sqrt(exp(pxVal))

No; same as inverse (square (exp (pxVal))).
You should end up with a set of expressions that are faster to calculate
than expf(): square, sqrt, *, etc

I'm not sure that sqrt will be much faster than exp,
but luckily need it's not needed.
 
D

David Fisher

Buster said:
No; same as inverse (square (exp (pxVal))).

What was I thinking ? :p
I'm not sure that sqrt will be much faster than exp,
but luckily need it's not needed.

There is a fast way of calculating square roots (Newton's method), so I was
assuming it would be more efficient than exp() or pow() ...

David F
 
S

spasmous

<snip good suggestions...>

Thanks guys, I played around on paper with the equations and I think I
can greatly reduce the expf() calls the way you suggest. However it's
a bit messy and the calls to pow() are *very* expensive. Slower than
expf() in my tests.

So... another way (similar) is to precompute the dx and dy expf().
I've implemented it like this and expf() only takes up 50% of the
runtime rather than 75% before, with an overall time reduction of
about 25%. Not spectacular but I'll take it :)


/* Storage */
float * EXP_X = (float*)malloc(...);
float * EXP_Y = (float*)malloc(...);


/* Indices */
int x0 = (int)px;
int y0 = (int)py;
float dx = px - x0;
float dy = py - y0;


/* Precompute expf */
for(m=-range;m<range;m++)
{
EXP_X[m+range] = expf(norm*(dx+m)*(dx+m));
EXP_Y(m+range] = expf(norm*(dy+m)*(dy+m));
}


for(m=-range;m<range;m++)
{
for(n=-range;n<range;n++)
{
index = x0*kpar->fftrows + y0;

/* zero out of bounds. NB. This is quicker than using 'if' */
index *= (int)(index>=0 && index<(kpar->fftrows*kpar->fftrows));

/* calculate coefficient */
fscale = EXP_X[m+range] * EXP_Y[n+range];

/* Store values */
....
}
}
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,159
Messages
2,570,881
Members
47,418
Latest member
NoellaXku

Latest Threads

Top