Operator overloading, C++ performance crappiness

K

Karl Heinz Buchegger

Karl said:
You must have a compiler which is really bad at optimization.
In the following code, the timing is as follows:

operator+: 180 clock ticks
function add(): 240 clock ticks

So (thanks to the optimizer, which optimizes away the total overhead
of the temporary) the code using operator+ is actually *faster* then
your specialized function.

Oh. I forgot. This was measure with VC++ 6.0. A fairly old compiler.
If VC++ can do it, any other compiler younger then 8 years can do it
also.
 
K

Kai-Uwe Bux

Jojo said:
I didn't say anything to the contrary. The "add(const Object& obj,
Object* output)" method in my example is still faster. MatrixOp adds
overhead no matter how you look at it. Like I said before, it's a
smaller overhead than copying a large object but it _does_ add overhead.

Consider if we're working with millions of "Vector" objects that only
have two or three float members. MatrixOp would have about the same
complexity as the Vector object itself. There would be no benefit to
using it and it would be slower than the ever so ugly:

void add(const Vector& obj, Vector* output)
{
*output = obj;
}

Jo

Here is an experiment:

#include <iostream>
#include <ctime>

unsigned long const dim = 1024;
unsigned long const runs = 1234567;

struct Vector;

struct VectorSum {

Vector const & a;
Vector const & b;

VectorSum( Vector const & a_, Vector const & b_ )
: a ( a_ )
, b ( b_ )
{}

}; // struct VectorSum;

struct Vector {

double entry [dim];

Vector ( void ) {
for ( unsigned long i = 0; i < dim; ++i ) {
entry = 2.718281828459;
}
}

Vector & operator= ( Vector const & other ) {
for ( unsigned int i = 0; i < dim; ++i ) {
this->entry = other.entry;
}
return( *this );
}

Vector & operator= ( VectorSum const & sum ) {
sum.a.add( sum.b, *this );
return( *this );
}

void add ( Vector const & other, Vector & result ) const {
for( unsigned int i = 0; i < dim; ++i ) {
result.entry = this->entry + other.entry;
}
}

VectorSum operator+ ( Vector const & b ) const {
return( VectorSum( *this, b ) );
}

}; // struct Vector;

int main ( void ) {
{
Vector a;
Vector b;
std::clock_t ticks = std::clock();
{
for ( unsigned int i = 0; i < runs; ++i ) {
a = a + b;
}
}
std::cout << std::clock() - ticks << '\n';
}
{
Vector a;
Vector b;
std::clock_t ticks = std::clock();
{
for ( unsigned int i = 0; i < runs; ++i ) {
a.add( b, a );
}
}
std::cout << std::clock() - ticks << '\n';
}
}


news_group> a.out
90000
120000


This was with all optimizations of g++ turned on.


Best

Kai-Uwe Bux
 
V

Victor Bazarov

Jojo said:
Karl said:
inline Matrix operator +( const Matrix& lhs, const Matrix& rhs )
{
Matrix temp; // "unnecessary" creation of temp variable

for (unsigned i = 0; i < sizeof(lhs.data)/sizeof(int); i++)
temp.data = lhs.data + rhs.data;

return temp;
}



OK, _this_ is the answer I think I have been looking for. This method
is not the method I posted and this is indeed faster in GCC.

My method was:
Matrix operator+(const Matrix&)

Your method is:
operator+(const Matrix&, const Matrix&)

I did not know there was a two parameter version of operator+.


Not as a member there isn't. A single-argument variation of operator+
actually does have two operands (and two arguments). The other (left)
argument is the hidden argument of any non-static member function --
the object for which the function is called. What book are you reading
that doesn't explain the difference between a member-based and non-member
based operator overloading?

V
 
K

Karl Heinz Buchegger

Jojo said:
inline Matrix operator +( const Matrix& lhs, const Matrix& rhs )
{
Matrix temp; // "unnecessary" creation of temp variable

for (unsigned i = 0; i < sizeof(lhs.data)/sizeof(int); i++)
temp.data = lhs.data + rhs.data;

return temp;
}


OK, _this_ is the answer I think I have been looking for. This method
is not the method I posted and this is indeed faster in GCC.

My method was:
Matrix operator+(const Matrix&)


That is the member function 'operator+'
Your method is:
operator+(const Matrix&, const Matrix&)

That is the free standing function 'operator+'
I did not know there was a two parameter version of operator+.

The free standing version usually is the prefered one for operator+, operator+,
operator*, ... (all operators that return a temporary object and not a reference
to *this).

Even if I change the free standing function back into a member function,
the operator+ version is still faster (by roughly the same amount) then
the add() version.

So either your measurement code is incorrect or you did not turn on
the optimizer for your measurements, since I can't believe that VC++ 6.0
outperforms gcc in terms of code optimization.
 
V

Victor Bazarov

Karl said:
Oh. I forgot. This was measure with VC++ 6.0. A fairly old compiler.
If VC++ can do it, any other compiler younger then 8 years can do it
also.

Don't bet on it. VC++ v6 has a very decent optimizer. VC++ v7.1 has
a better one, of course, but even v6 could outperform many others, even
more recent ones.

V
 
J

Jojo

Karl said:
The free standing version usually is the prefered one for operator+, operator+,
operator*, ... (all operators that return a temporary object and not a reference
to *this).

Even if I change the free standing function back into a member function,
the operator+ version is still faster (by roughly the same amount) then
the add() version.

So either your measurement code is incorrect or you did not turn on
the optimizer for your measurements, since I can't believe that VC++ 6.0
outperforms gcc in terms of code optimization.

Thanks. I don't think there is any difference in speed between the
add() method and the overloaded operator though. I think the difference
you are seeing is from already having the output loaded on the second
run. The CPU will optimize away some of the value changes because they
are exactly the same on the second run. If you separate out the two
versions and run two separate instances the timing will be exactly the same.

Jo
 
K

Kai-Uwe Bux

Jojo said:
Yes a good compiler could do that. I have not found one than can.

So the big question is what compiler are you using?

I'm using GCC (I tried versions 3.3 and 4.0).

Jo

Hi,

I compiled Karl Heinz Buchbergers code with gcc-4.0.0, all optimizations
turned on:

news_group> a.out
60000
90000


So, I would venture the conjecture that Mr Buchberger might have used
gcc, too.


Best

Kai-Uwe Bux
 
K

Kai-Uwe Bux

Karl Heinz Buchegger wrote:
[snip]
int main()
{
Matrix a(2), b(3), c;
time_t start, end;

start = clock();
for( int j = 0; j < 100000; ++j )
c = a + b;
end = clock();

Are you sure the compiler did not optimize away all of the loop?
After all, neither a nor b nor c are changing.
cout << end - start << endl;

start = clock();
for( int j = 0; j < 100000; ++j )
a.add( b, c );
end = clock();

Same here.
cout << end - start << endl;
}


Best

Kai-Uwe Bux
 
J

Jojo

Kai-Uwe Bux said:
news_group> a.out
90000
120000


This was with all optimizations of g++ turned on.


Best

Kai-Uwe Bux

Another good find! I knew the compiler could optimize out the temporary
object when creating a new object, I didn't think of using it that way
though! I'll have to try this out.

Again though, you can't get accurate timing by running nearly the same
code twice in the same execution. You have to run it two separate times
with each different set of code. Otherwise the CPU can optimize out
some of the value changes because they don't change at all.

Jo
 
K

Kai-Uwe Bux

Jojo said:
Another good find! I knew the compiler could optimize out the temporary
object when creating a new object, I didn't think of using it that way
though! I'll have to try this out.

Again though, you can't get accurate timing by running nearly the same
code twice in the same execution. You have to run it two separate times
with each different set of code. Otherwise the CPU can optimize out
some of the value changes because they don't change at all.

Jo

Hi,

I think, in my code the loops are a = a+b; and a.add( b, a ).
Thus the value of a changes. Also note that a is reset before the
second loop starts. I do not see how a CPU could be cleverly reusing
any information gained in the first run to speed up the second.

I am more puzzled that there is a difference at all: the compiler is
allowed and hoped to inline the operator calls and the calls to add.
In that case, the two loops should have identical assembler code.


Best

Kai-Uwe Bux
 
J

Jojo

Victor said:
Not as a member there isn't. A single-argument variation of operator+
actually does have two operands (and two arguments). The other (left)
argument is the hidden argument of any non-static member function --
the object for which the function is called. What book are you reading
that doesn't explain the difference between a member-based and non-member
based operator overloading?

V

Embarrassed to say but I've been programming in C++ for over 12 years.
I just never had to use such functionality. You learn something new
every day I guess (a testament to the annoying complexity yet extreme
power of C++).

Jo
 
J

Jojo

Karl said:
You must have a compiler which is really bad at optimization.
In the following code, the timing is as follows:

operator+: 180 clock ticks
function add(): 240 clock ticks

So (thanks to the optimizer, which optimizes away the total overhead
of the temporary) the code using operator+ is actually *faster* then
your specialized function.

After more testing I found this to not be the case. You have to use the
data otherwise the compiler will do extra optimization. Here is a full
code example based on your code:

#include <iostream>
#include <ctime>

using namespace std;

class Matrix
{
public:
int data[1024];

Matrix() {}

Matrix(int value)
{
for (unsigned i = 0; i < sizeof(data)/sizeof(int); i++)
data = value;
}

void add(const Matrix& obj, Matrix& output)
{
for (unsigned i = 0; i < sizeof(data)/sizeof(int); i++)
output.data = data + obj.data;
}

friend Matrix operator+(const Matrix& lhs, const Matrix& rhs);
};

inline Matrix operator +(const Matrix& lhs, const Matrix& rhs)
{
Matrix temp; // "unnecessary" creation of temp variable

for (unsigned i = 0; i < sizeof(lhs.data)/sizeof(int); i++)
temp.data = lhs.data + rhs.data;

return temp;
}

int main()
{
Matrix a(2), b(3), c;
time_t start, end;

start = clock();
for( int j = 0; j < 5000000; ++j )
c = a + b;
end = clock();

cout << a.data[0] << " " << b.data[0] << " " << c.data[0]
<< " " << end-start << endl;

start = clock();
for( int j = 0; j < 5000000; ++j )
a.add( b, c );
end = clock();

cout << a.data[0] << " " << b.data[0] << " " << c.data[0]
<< " " << end-start << endl;

return 0;
}

----------------------------------

$ g++-4.0 -Wall -O2 test2.cpp
$ ./a.out
2 3 5 6620000
2 3 5 8920000

So we are back to where I started, the add() method is still 50% faster.

If you take out the printing of the array values then the timing becomes
similar.

Jo
 
J

Jojo

$ g++-4.0 -Wall -O2 test2.cpp
$ ./a.out
2 3 5 6620000
2 3 5 8920000

Oops, those timings are inverted from the code I posted. Please note
that the 6620000 time is for "a.add()" and the 8920000 is for "c = a + b"

Jo
 
J

Jojo

Maxim said:
Jojo wrote:

[]

That would be fast and clean to use. Is there any way to accomplish
this? Otherwise the situation is just ugly and there is no point in
using operator overloading for these types of situations (which really
defeats the purpose of operator overloading in the first place).


Use expression templates. Refer to
http://osl.iu.edu/~tveldhui/papers/Expression-Templates/exprtmpl.html

Even that is not as fast as my add() method. Nothing beats that. If
only we could access the left-hand side of the expression then the
syntax would be clean.

Jo
 
V

Victor Bazarov

Jojo said:
Karl said:
You must have a compiler which is really bad at optimization.
In the following code, the timing is as follows:

operator+: 180 clock ticks
function add(): 240 clock ticks

So (thanks to the optimizer, which optimizes away the total overhead
of the temporary) the code using operator+ is actually *faster* then
your specialized function.


After more testing I found this to not be the case. You have to use the
data otherwise the compiler will do extra optimization. Here is a full
code example based on your code:

#include <iostream>
#include <ctime>

using namespace std;

class Matrix
{
public:
int data[1024];

Matrix() {}

Matrix(int value)
{
for (unsigned i = 0; i < sizeof(data)/sizeof(int); i++)
data = value;
}

void add(const Matrix& obj, Matrix& output)
{
for (unsigned i = 0; i < sizeof(data)/sizeof(int); i++)
output.data = data + obj.data;
}

friend Matrix operator+(const Matrix& lhs, const Matrix& rhs);


There is no need in this 'friend' declaration.
};

inline Matrix operator +(const Matrix& lhs, const Matrix& rhs)
{
Matrix temp; // "unnecessary" creation of temp variable

for (unsigned i = 0; i < sizeof(lhs.data)/sizeof(int); i++)
temp.data = lhs.data + rhs.data;

return temp;
}

int main()
{
Matrix a(2), b(3), c;
time_t start, end;

start = clock();
for( int j = 0; j < 5000000; ++j )
c = a + b;
end = clock();

cout << a.data[0] << " " << b.data[0] << " " << c.data[0]
<< " " << end-start << endl;

start = clock();
for( int j = 0; j < 5000000; ++j )
a.add( b, c );
end = clock();

cout << a.data[0] << " " << b.data[0] << " " << c.data[0]
<< " " << end-start << endl;

return 0;
}

----------------------------------

$ g++-4.0 -Wall -O2 test2.cpp
$ ./a.out
2 3 5 6620000
2 3 5 8920000

So we are back to where I started, the add() method is still 50% faster.


Huh? It seems that the first one (using the operator+) actually faster on
your system. In my book six is smaller than eight.
If you take out the printing of the array values then the timing becomes
similar.

What do you mean by that?

Built with Visual C++ v8 Beta 2, I get the output

2 3 5 22069
2 3 5 6741

which suggests that 'add' function works better. However, since the body
of either loop doesn't use the argument, it's quite possible that the
optimizer does something drastic, like calling each function only once...
Not the best test case, IOW.

V
 
V

Victor Bazarov

Jojo said:
Oops, those timings are inverted from the code I posted. Please note
that the 6620000 time is for "a.add()" and the 8920000 is for "c = a + b"

That's why copy-and-paste should always be used instead of typing.
 
J

Jojo

Victor said:
Huh? It seems that the first one (using the operator+) actually faster on
your system. In my book six is smaller than eight.

No, I had the output inverted from the code I posted.
What do you mean by that?

If you do not print the values a.data[0], b.data[0], etc then the timing
becomes similar. Accessing the array prevents the compiler from
performing extra optimization (at leat with gcc).
Built with Visual C++ v8 Beta 2, I get the output

2 3 5 22069
2 3 5 6741

which suggests that 'add' function works better. However, since the body
of either loop doesn't use the argument, it's quite possible that the
optimizer does something drastic, like calling each function only once...
Not the best test case, IOW.

V

Something isn't right with your code. Did you change the values in one
for-loop and forget to change the other?

Using VS 7.1

C:\>cl /Ox test2.cpp
C:\>test2
2 3 5 6703
2 3 5 11484

6703 is for add() method
11484 is for "c = a + b"

Jo
 
J

Jojo

Victor said:
That's why copy-and-paste should always be used instead of typing.

I did cut and paste. I had just changed the code because I was making
sure that reversing the order of the operations did not effect the timing.

Jo
 
V

Victor Bazarov

Jojo said:
Victor said:
Huh? It seems that the first one (using the operator+) actually
faster on
your system. In my book six is smaller than eight.


No, I had the output inverted from the code I posted.
What do you mean by that?


If you do not print the values a.data[0], b.data[0], etc then the timing
becomes similar. Accessing the array prevents the compiler from
performing extra optimization (at leat with gcc).
Built with Visual C++ v8 Beta 2, I get the output

2 3 5 22069
2 3 5 6741

which suggests that 'add' function works better. However, since the body
of either loop doesn't use the argument, it's quite possible that the
optimizer does something drastic, like calling each function only once...
Not the best test case, IOW.

V


Something isn't right with your code.

That's *your* code. And I gave the output as appears from the code
_as_posted_.
> Did you change the values in one
for-loop and forget to change the other?

No. I copied the code straight out of your post.
Using VS 7.1

C:\>cl /Ox test2.cpp
C:\>test2
2 3 5 6703
2 3 5 11484

6703 is for add() method
11484 is for "c = a + b"

Yes, I believe that. I've tested on several compilers/systems and all
pretty much give the same result, 'add' is two-three times better. The
difference undoubtedly comes from the fact that the matrix needs to be
copied to and fro.

V
 
S

Serge Skorokhodov (216716244)

Hi,

So we are back to where I started, the add() method is still
50% faster.

If you take out the printing of the array values then the
timing becomes similar.


Well, a simple blas-style plain C function add_matrix(double*
pdest, double* plh, double* prh) may be even faster...;)

BTW, have you seen http://www.oonumerics.org/blitz/?
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,995
Messages
2,570,228
Members
46,818
Latest member
SapanaCarpetStudio

Latest Threads

Top