Writing efficient normal distribution function (for learning purposes).

Luca Cerone · Feb 17, 2012

Dear all, I'm a beginner in C++ so I'm sorry if my question might seem silly.
I'm also new to the group so I just say hello to all the people in it.

I've written a function that computes the Normal Distribution, but when comparing the time to create 1000 vectors
each with 1000 points I've the same performance than using the analogous Matlab built-in function.
I'd like to improve my code for speed.
I'm doing this mainly to learn how to write efficient C++ code, so please just don't tell me
that library X and Y do this faster than my code, but help me understanding why and how to improve
my code.

I need both a single value and a vector version of this function,
so this is how I have defined them.

// this is the single point version of the function:
double normdist(const double& px,const double& mu,const double& csigma){
// evaluating the formula taken from http://en.wikipedia.org/wiki/Normal_distribution
double val = sqrt(2*M_PI)*csigma;
val = 1/val;
val = val*exp(-pow(px-mu,2)/(2*pow(csigma,2)));
return val ;
}

//this is the vector version. It takes as input a vector of double and gives the vector of corresponding values as output:

vector<double> normdist(const vector<double>& vx,const double& mu,const double& csigma){
vector<double> gv(vx.size()) ; //allocates the space for the vector gv.
for (int i=0 ; i<vx.size() ; i++){
gv=normdist(vx,mu,csigma);
}
return gv ;
}

As I told I'm a beginner so I guess there is plenty of room for improving the speed and the quality of the code
(just writing to you I have realized that the initialization of gv might not be optimal, since I create values I don't need).

Thank you very much to all of you!
Luca

Asger Joergensen · Feb 17, 2012

Hi Luca

If You can live with fixed precision You code will run much faster if You use
int's or __int64's and then multiply /devide as much as You need to get Your the
precision You need.

Best regards
Asger-P

lucacerone · Feb 17, 2012

Hi Asger-P, thanks for your reply.
Let me understand if I've understood.. by fixed precision you mean having
number with a fixed number of digits? If so, yes I can live with it
as far as I have 8 exact decimal points.

If You can live with fixed precision You code will run much faster if You use
int's or __int64's and then multiply /devide as much as You need to get Your the
precision You need.

I'm sorry but I don't understand you. How would you suggest to use ints?
What type is it __int64?

Thanks a lot for you help in any case,
Luca

Asger Joergensen · Feb 17, 2012

Hi lucacerone

Hi Asger-P, thanks for your reply.
Let me understand if I've understood.. by fixed precision you mean having
number with a fixed number of digits? If so, yes I can live with it
as far as I have 8 exact decimal points.

int64_t val = DoubleVal * 100000000;

do the rest of the work with the int and then You divide by the same number
when You need the result.

I'm sorry but I don't understand you. How would you suggest to use ints?
What type is it __int64?

64 bit integer my borland compiler like __int64, but I think int64_t is the
cross compiler way to write it.

I'm not into math so I don't know if You can use this method for Your project.

I do a lot of gradient graphics, so when I calculate color steps i use this
method and it is a lot faster then using floating point values like float and
double.

Best regards
Asger-P

Juha Nieminen · Feb 17, 2012

Asger Joergensen said:
If You can live with fixed precision You code will run much faster if You use
int's or __int64's and then multiply /devide as much as You need to get Your the
precision You need.

This isn't the early 90's. Processors have got a bit faster at
calculating with floating point values in the meantime.

Juha Nieminen · Feb 17, 2012

Luca Cerone said:
I'd like to improve my code for speed.

First of all, are you sure you are compiling with optimizations / in
release mode? A surprisingly common mistake to not to do that.

double val = sqrt(2*M_PI)*csigma;

A compiler might or might not be able to calculate that "sqrt(2*M_PI)"
at compile time (because compilers oftentimes do not perform floating point
calculations at compile time due to a myriad of reasons; of course it
depends on the compiler and its version). If you want a few clock cycles
off, might be a safer bet to write the value directly.

val = val*exp(-pow(px-mu,2)/(2*pow(csigma,2)));

Likewise a compiler might or might not be able to optimize a
"pow(x,2)" into "x*x". It may be a safer bet to write it explicitly.
(std:

ow is significantly slower than multiplication.)

vector<double> normdist(const vector<double>& vx,const double& mu,const double& csigma){
vector<double> gv(vx.size()) ; //allocates the space for the vector gv.

How many times is this function called in your program? Are you
measuring its speed by calling this function thousands/millions of
times and seeing how long it takes overall?

If yes, then there's your major problem. Memory allocation is
(unfortunately) a slow operation in C++ (not really C++'s fault,
but that's another story). You might want to pass the return value
as a reference parameter to the function and have it write there
instead of allocating a new vector each time. Then you can reuse the
same vector object for each call, avoiding the memory allocation
(except for the very first time it's called, of course).

Victor Bazarov · Feb 17, 2012

Dear all, I'm a beginner in C++ so I'm sorry if my question might seem silly.
I'm also new to the group so I just say hello to all the people in it.

I've written a function that computes the Normal Distribution, but when comparing the time to create 1000 vectors
each with 1000 points I've the same performance than using the analogous Matlab built-in function.
I'd like to improve my code for speed.
I'm doing this mainly to learn how to write efficient C++ code, so please just don't tell me
that library X and Y do this faster than my code, but help me understanding why and how to improve
my code.

I need both a single value and a vector version of this function,
so this is how I have defined them.

// this is the single point version of the function:
double normdist(const double& px,const double& mu,const double& csigma){
// evaluating the formula taken from http://en.wikipedia.org/wiki/Normal_distribution
double val = sqrt(2*M_PI)*csigma;
val = 1/val;
val = val*exp(-pow(px-mu,2)/(2*pow(csigma,2)));
return val ;

Some compilers are actually good at parsing expressions and converting
them into efficient machine code. You could try simply writing

return exp( -pow(px-mu,2)/2/pow(csigma,2) ) / sqrt(2*M_PI) / csigma;

And now you can see that there are two subexpressions that don't have to
be calculated on every call. pow(csigma,2) could be calculated once per
the entire loop and passed into 'normdist' as a const double&. The same
goes for sqrt(2*M_PI), which is just a constant you could define in the
file scope and just use here.

}

//this is the vector version. It takes as input a vector of double and gives the vector of corresponding values as output:

vector<double> normdist(const vector<double>& vx,const double& mu,const double& csigma){
vector<double> gv(vx.size()) ; //allocates the space for the vector gv.
for (int i=0 ; i<vx.size() ; i++){

While it might not matter in your particular case, get into the habit of
not calling any function in your 'for' condition if you know that it's
not going to change. I'm yet to see a compiler that knows to eliminate
it. Try this:

for (int i=0, s=vx.size(); i<s; ++i)

gv=normdist(vx,mu,csigma);
}

Most modern compilers are good with indexing, but it's a function call
for the vector, so you might want to try to eliminate it as well by
using pointers:

double* pgv = &gv[0];
const double* pvx = &vx[0];

for (...
{
*pgv++ = normdist(*pvx++, mu, csigma);
}

return gv ;
}

As I told I'm a beginner so I guess there is plenty of room for improving the speed and the quality of the code
(just writing to you I have realized that the initialization of gv might not be optimal, since I create values I don't need).

Click to expand...

Good luck!

V

Asger Joergensen · Feb 17, 2012

Hi Juha

Juha said:
This isn't the early 90's. Processors have got a bit faster at
calculating with floating point values in the meantime.

Not nearly as fast as they can do it with int's though, in my experience.

Best regards
Asger-P

Joshua Maurice · Feb 17, 2012

Dear all, I'm a beginner in C++ so I'm sorry if my question might seem silly.
I'm also new to the group so I just say hello to all the people in it.

I've written a function that computes the Normal Distribution, but when comparing the time to create 1000 vectors
each with 1000 points I've the same performance than using the analogous Matlab built-in function.
I'd like to improve my code for speed.
I'm doing this mainly to learn how to write efficient C++ code, so pleasejust don't tell me
that library X and Y do this faster than my code, but help me understanding why and how to improve
my code.

I need both a single value and a vector version of this function,
so this is how I have defined them.

// this is the single point version of the function:
double normdist(const double& px,const double& mu,const double& csigma){
// evaluating the formula taken fromhttp://en.wikipedia.org/wiki/Normal_distribution
double val = sqrt(2*M_PI)*csigma;
val = 1/val;
val = val*exp(-pow(px-mu,2)/(2*pow(csigma,2)));
return val ;

}

//this is the vector version. It takes as input a vector of double and gives the vector of corresponding values as output:

vector<double> normdist(const vector<double>& vx,const double& mu,const double& csigma){
vector<double> gv(vx.size()) ; //allocates the space for the vector gv.
for (int i=0 ; i<vx.size() ; i++){
gv=normdist(vx,mu,csigma);
}
return gv ;

}

As I told I'm a beginner so I guess there is plenty of room for improvingthe speed and the quality of the code
(just writing to you I have realized that the initialization of gv might not be optimal, since I create values I don't need).

Thank you very much to all of you!
Luca

Are compiling with all of the proper optimization flags? Are you using
MSVC, and have you disabled checked iterators?

Victor Bazarov · Feb 18, 2012

[..]
I've written a function that computes the Normal Distribution, but when comparing the time to create 1000 vectors
each with 1000 points I've the same performance than using the analogous Matlab built-in function.
[.. code using vector and indexing snipped ..]

Click to expand...

Are compiling with all of the proper optimization flags? Are you using
MSVC, and have you disabled checked iterators?

There seem to be no iterators in that code.

If the code speed is the same as Matlab's, there is probably not much
hope to significantly improve it. I doubt that Matlab people release
their library/application with sub-par performance.

V

glen herrmannsfeldt · Feb 18, 2012

(snip)

pow(x,2) should be replaced by x*x as mentioned by others, it is not
given that the optimizer is smart enough to do that by itself.

I have complained about this before. Fortran compilers have been
known to optimize x**2 for many many years. Being an operator,
instead of a function, makes it more obvious.

In addition, there is no pow(double, int) which would help in
optimizing many other cases.

-- glen

Stefan Ram · Feb 18, 2012

glen herrmannsfeldt said:
I have complained about this before. Fortran compilers have been
known to optimize x**2 for many many years. Being an operator,
instead of a function, makes it more obvious.

That reminds me of

http://shootout.alioth.debian.org/u32q/which-programming-languages-are-fastest.php

, where FORTRAN is the only language faster than languages
of the C/C++-family.

Miles Bader · Feb 19, 2012

glen herrmannsfeldt said:
I have complained about this before. Fortran compilers have been
known to optimize x**2 for many many years. Being an operator,
instead of a function, makes it more obvious.

In addition, there is no pow(double, int) which would help in
optimizing many other cases.

All the C++ compilers I've used _do_ optimize "pow(x,2)" to "x*x"...
Not very hard as optimizations go...

-miles

lucacerone · Feb 19, 2012

Hi Asger,
this solution is not feasible for me unfortunately because the function I want to approximate is nonlinear. Thanks a lot in any case for the advice.

lucacerone · Feb 19, 2012

First of all, thank you all for your advices.
Many issues have been arisen, so I try to answer to them in this post.

For the performance I have evaluated them evaluating vectors made of 1000 points, for one hundred thousand times. So I think it is a reliable measurement.

As for the compiler optimization, I'm using g++ (GNU compiler) on a Ubuntu machine. The way I compile the code is simply:

g++ -c -o mylib.o mylib.cpp
g++ -c -o main.o main.cpp
g++ -o executableneme main.o mylib.o

I have absolutely no idea on how to optimize compilation...

Amongst the other suggestions I think I've read to substitute pow(x,2) with x*x. I'll certainly do that.
Also, it is true that in the vector version of the function certain calculations can be made only once. I'll change that as well.

Victor suggested to use pointers rather than indexing
I have to be honest, while studying C++ I haven't arrived to the
pointers chapter yet, that's why I haven't used them... and that's why I haven't fully understood your answer.
Also I don't understand why accessing with pointers might be faster than direct indexing.

I'm sorry if I'm asking easy question, I really am a novice and need to learn a lot

Cheers, Luca

Asger Joergensen · Feb 19, 2012

Hi lucacerone

Victor suggested to use pointers rather than indexing
I have to be honest, while studying C++ I haven't arrived to the
pointers chapter yet, that's why I haven't used them... and that's why I haven't
fully understood your answer.

Have a look in this thread in this NG:
Re: loop optimization 01.19.01

Also I don't understand why accessing with pointers might be faster than direct
indexing.

When You say direct indexing I don't think there is much, if any, difference,
when it is a single array, but when You have [][] or [][][] there start to be
a difference, but it all depends on the compiler and various little things.
If You use vector and its indexing there is a 30% difference, at least when
compiled with my CodeGear 2009 compiler, I don't know about gcc, but You can
easily test it Your self, You can use the example from the thread i mentioned
above.
P.s. Do remember that most of the time the indexing is a very little part of
the time used, in very many cases You will not be able to see the difference
in the total picture.

I'm sorry if I'm asking easy question, I really am a novice and need to learn a lot

No problem in, we all learn more by communicating, reading is good,
but communicating is better, thats my opinion.

Best regards
Asger-P

Miles Bader · Feb 19, 2012

lucacerone said:
As for the compiler optimization, I'm using g++ (GNU compiler) on a
Ubuntu machine. The way I compile the code is simply:

g++ -c -o mylib.o mylib.cpp
g++ -c -o main.o main.cpp
g++ -o executableneme main.o mylib.o

I have absolutely no idea on how to optimize compilation...

As a start, you should turn on optimization -- the compiler does no
optimization unless you tell it to, and in many cases non-optimized code
is much, _much_ slower.

To turn on optimization, use the "-O" option, optionally followed by a
digit giving the level of optimization; with gcc, the most common choice
is "-O2" which optimizes somewhat more.

You also want to tell it what processor you're using, because by default
it will produce "generic" code that can run on almost any processor in
the same family, but can be significantly slower than code which is
precisely targetted at the processor you're using. To do this, use the
"-march=..." option; typically you can use "-march=native", which says
"produce code targetted at the processor in this machine."

So i'd suggest:

g++ -c -o mylib.o -O2 -march=native mylib.cpp
g++ -c -o main.o -O2 -march=native main.cpp
g++ -o executableneme main.o mylib.o

-Miles

Joshua Maurice · Feb 19, 2012

So i'd suggest:

g++ -c -o mylib.o -O2 -march=native mylib.cpp
g++ -c -o main.o -O2 -march=native main.cpp
g++ -o executableneme main.o mylib.o

Why no -O3?

http://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html#Optimize-Options

-O3
Optimize yet more. -O3 turns on all optimizations specified by -O2
and also turns on the -finline-functions, -funswitch-loops, -
fpredictive-commoning, -fgcse-after-reload, -ftree-vectorize and -fipa-
cp-clone options.

Function inlining is pretty important for C++.

Marc · Feb 19, 2012

Joshua said:
Why no -O3?

http://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html#Optimize-Options

-O3
Optimize yet more. -O3 turns on all optimizations specified by -O2
and also turns on the -finline-functions, -funswitch-loops, -
fpredictive-commoning, -fgcse-after-reload, -ftree-vectorize and -fipa-
cp-clone options.

It is fairly safe that -O2 will generate better code than -O1 (which
generates better code than -O0). With -O3, things are much less clear.
Sometimes you get better results, but it is still quite common that
-O3 code runs slower than -O2. The generated code is often much larger
(and the compilation takes longer).

Function inlining is pretty important for C++.

There is already some inlining at -O2 (exactly how much depends on the
version). And -O3 is not always enough to get the useful inlining,
sometimes you need -fprofile-generate/use or -finline-limit=... to
convince gcc.

This is not to say that -O3 should not be tried, of course. And
possibly -ffast-math if the OP is happy with imprecise (aka wrong)
results.

Miles Bader · Feb 20, 2012

Joshua Maurice said:
Why no -O3?

http://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html#Optimize-Options

-O3
Optimize yet more. -O3 turns on all optimizations specified by -O2
and also turns on the -finline-functions, -funswitch-loops, -
fpredictive-commoning, -fgcse-after-reload, -ftree-vectorize and -fipa-
cp-clone options.

Function inlining is pretty important for C++.

-O2 does comprehensive inlining as well -- including functions not
marked "inline" -- using gcc's inlining heuristics, which are quite
good these days. [What "-finline-functions" does is act as _every
function_ were marked "inline" meaning gcc will be more likely to
inline stuff, even when it doesn't think it's a good idea.]

-O2 is a good default choice if want good optimization, but don't want
to spend a lot of time benchmarking. -O3 includes optimizations that
may be very expensive at compile-time or may result in excessive code
bloat (and thus can have a negative effect on the icache etc). -O3
can _sometimes_ help, but it can also sometimes hurt, so it's a good
idea to do some benchmarking to determine which is better.

-Miles

Learning Regex looking for criticism	3	Jan 13, 2025
Chatbot	0	Oct 8, 2024
Using `transform_reduce` correctly translating Javascript code to C++	0	Jun 11, 2024
Javascript set language function issue	2	Nov 24, 2024
I HAVE MADE AN IMPROVED INPUT FOR INTEGERS ONLY	3	Oct 28, 2024
How many times will a template function be instantiated for the same parameter?	4	May 28, 2013
Function is not worked in C	2	Jun 27, 2023
Chatbox for website	0	Oct 16, 2024

Writing efficient normal distribution function (for learning purposes).

Luca Cerone

Asger Joergensen

lucacerone

Asger Joergensen

Juha Nieminen

Juha Nieminen

Victor Bazarov

Asger Joergensen

Joshua Maurice

Victor Bazarov

glen herrmannsfeldt

Stefan Ram

Miles Bader

lucacerone

lucacerone

Asger Joergensen

Miles Bader

Joshua Maurice

Marc

Miles Bader

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads