Matrix optimization

P

Pat

Given A[n][m], B[n][m] and C[n][m]

I would like to calculate the sum of each entry:

A[j]=B[j]+C[j] for each i,j

The easiest way is to use for-loop, but I think the performance is not good.

Is it possible to find out some faster way to do that?
Thanks.
Pat
 
G

Gianni Mariani

Pat said:
Given A[n][m], B[n][m] and C[n][m]

I would like to calculate the sum of each entry:

A[j]=B[j]+C[j] for each i,j

The easiest way is to use for-loop, but I think the performance is not good.

Is it possible to find out some faster way to do that?
Thanks.
Pat


This depends so much on implementation. Some compilers know how to
"vectorize" for loops, some machines have serious cache considerations,
some machines have vector instructions.

Your question is not really about C++ per-se, I suggest you ask
discussion groups that are related directly to the platform you're
asking about to get the right answer.

However, given a dumb compiler and a dumb architecture, the fastest this
is probably somthing like this.

template <typename T, int Rows, int Cols>
void Add(
T (&A)[Rows][Cols],
const T (&B)[Rows][Cols],
const T (&C)[Rows][Cols]
) {
T * const Ap = & A[0][0];
const T * const Bp = & B[0][0];
const T * const Cp = & C[0][0];

const int count = Rows * Cols;

for ( int i = 0; i < count; ++ i )
{
Ap[ i ] = Bp[ i ] + Cp[ i ];
}

}

// if the compiler is able to do loop unrolling, this can be quite
// zippy;

*** Note - I did not syntax check it.
 
V

Victor Bazarov

Pat said:
Given A[n][m], B[n][m] and C[n][m]

I would like to calculate the sum of each entry:

A[j]=B[j]+C[j] for each i,j

The easiest way is to use for-loop, but I think the performance is not good.

Is it possible to find out some faster way to do that?


(a) Why do you "think the performance is not good"?

(b) You could avoid indexing if you define pointers and increment
those in the loop:

for (int i = 0; i < M; ++i)
for (int j = 0; j < N; ++j)
A[j] = B[j] + C[j];

becomes

for (int i = 0; i < M; ++i) {
TYPE *pa = A, *pb = B, *pc = C;
for (int j = 0; j < N; ++j)
*pa++ = *pb++ + *pc++;
}

You can still eliminate the indexing by , but I'll leave it
to you to figure out.

(c) Consider that "premature optimization is the root of all evil".
The [j] form while may not be extremely efficient, at least
conveys the message clearer than *pa++ = ...

Victor
 
V

Victor Bazarov

Gianni said:
Pat said:
Given A[n][m], B[n][m] and C[n][m]

I would like to calculate the sum of each entry:

A[j]=B[j]+C[j] for each i,j

The easiest way is to use for-loop, but I think the performance is not
good.

Is it possible to find out some faster way to do that?
Thanks.
Pat


This depends so much on implementation. Some compilers know how to
"vectorize" for loops, some machines have serious cache considerations,
some machines have vector instructions.

Your question is not really about C++ per-se, I suggest you ask
discussion groups that are related directly to the platform you're
asking about to get the right answer.

However, given a dumb compiler and a dumb architecture, the fastest this
is probably somthing like this.

template <typename T, int Rows, int Cols>
void Add(
T (&A)[Rows][Cols],
const T (&B)[Rows][Cols],
const T (&C)[Rows][Cols]
) {
T * const Ap = & A[0][0];
const T * const Bp = & B[0][0];
const T * const Cp = & C[0][0];

const int count = Rows * Cols;

for ( int i = 0; i < count; ++ i )
{
Ap[ i ] = Bp[ i ] + Cp[ i ];
}

}

// if the compiler is able to do loop unrolling, this can be quite
// zippy;


Nice. What would you say if I did

T * Ap = &A[0][0];
const T * Bp ...
...
for (int i = 0; i < count; ++i)
*Ap++ = *Bp++ + *Cp++;

? It might be just a tad faster...

V
 
P

Peter van Merkerk

Victor said:
Pat said:
Given A[n][m], B[n][m] and C[n][m]

I would like to calculate the sum of each entry:

A[j]=B[j]+C[j] for each i,j

The easiest way is to use for-loop, but I think the performance is not
good.

Is it possible to find out some faster way to do that?



(a) Why do you "think the performance is not good"?

(b) You could avoid indexing if you define pointers and increment
those in the loop:

for (int i = 0; i < M; ++i)
for (int j = 0; j < N; ++j)
A[j] = B[j] + C[j];

becomes

for (int i = 0; i < M; ++i) {
TYPE *pa = A, *pb = B, *pc = C;
for (int j = 0; j < N; ++j)
*pa++ = *pb++ + *pc++;
}

You can still eliminate the indexing by , but I'll leave it
to you to figure out.


With a good optimizing compiler it is possible that it makes no
difference at all...
(c) Consider that "premature optimization is the root of all evil".
The [j] form while may not be extremely efficient, at least
conveys the message clearer than *pa++ = ...



Also you may want to consider using a matrix library rather than
inventing your own.
 
V

Victor Bazarov

Peter said:
[..]

Also you may want to consider using a matrix library rather than
inventing your own.

That's reasonable, unless the OP's goal is to create a matrix library
for others to consider.
 
G

Gianni Mariani

Victor said:
Gianni Mariani wrote:
....


Nice. What would you say if I did

T * Ap = &A[0][0];
const T * Bp ...
...
for (int i = 0; i < count; ++i)
*Ap++ = *Bp++ + *Cp++;

? It might be just a tad faster...

It might, it might not. (And it is not faster in the test below in the
default case but it is faster when unrolling on my athlon 2Ghz.)

Some architechtures will be faster with indexing and some with moving
pointers.

If the processor has a load (reg + offs) then the index might be faster
because there is a single increment (and shift) instead of 4 increments.

On a Athlon64 with gcc 3.3.3 machine w/o unrolling, indexing is 25%
faster. With unrolling, it seems like the gcc compiler likes the
pointer code better and the pointer code is 15% faster.

Timings for the code below:
g++ (GCC) 3.4.0
g++ -O3 -march=athlon-mp -o matrix_add matrix_add.cpp

time ./matrix_add
3.070u 0.010s 0:03.08 100.0% 0+0k 0+0io 164pf+0w

time ./matrix_add pointers
4.080u 0.000s 0:04.08 100.0% 0+0k 0+0io 164pf+0w

- index code is faster

now with unrolling

g++ -O3 -funroll-loops -march=athlon-mp -o matrix_add matrix_add.cpp

time ./matrix_add
2.330u 0.000s 0:02.31 100.8% 0+0k 0+0io 164pf+0w

time ./matrix_add pointers
2.080u 0.010s 0:02.08 100.4% 0+0k 0+0io 164pf+0w

The compiler does a pretty bad job of unrolling index loops !

If you do a disassembly, you'll see that the four increments cost more -
sure you could eliminate one inrement and it might result in faster code
in the non-unrolling case but slower code in the unrolling case because
knowing when to terminate is a little harder for the compiler to figure out.

Anyhow, I think I made the point that "it depends". In general, in
numeric code when you're indexing into multiple arrays, indexing might
be faster than manipulating multiple pointers.

Having has a look at the assembler code, the compiler could do a much
better job of unrolling the index code. I suspect other compilers might
actually do a much better job.

---- the code ----

template <typename T, int Rows, int Cols>
inline void Add(
T (&A)[Rows][Cols],
const T (&B)[Rows][Cols],
const T (&C)[Rows][Cols]
) {
T * const Ap = & A[0][0];
const T * const Bp = & B[0][0];
const T * const Cp = & C[0][0];

const int count = Rows * Cols;

for ( int i = 0; i < count; ++ i )
{
Ap[ i ] = Bp[ i ] + Cp[ i ];
}

}


template <typename T, int Rows, int Cols>
inline void Addp(
T (&A)[Rows][Cols],
const T (&B)[Rows][Cols],
const T (&C)[Rows][Cols]
) {
T * Ap = & A[0][0];
const T * Bp = & B[0][0];
const T * Cp = & C[0][0];

const int count = Rows * Cols;

for ( int i = 0; i < count; ++ i )
{
* Ap ++ = * Bp ++ + * Cp ++;
}

}

enum {
rows = 20,
cols = 20
};

typedef int matrix_t[ rows ][ cols ];


void foo(matrix_t & A, matrix_t & B, matrix_t & C)
{
Add( A, B, C );
}

void foop(matrix_t & A, matrix_t & B, matrix_t & C)
{
Addp( A, B, C );
}


matrix_t A, B, C;

int main( int argc, char ** argv )
{

void ( * foof )(matrix_t & A, matrix_t & B, matrix_t & C);

foof = argc == 2 ? & foop : & foo;

for ( int i = 0; i < 5000000; i ++ )
{
( *foof )( A, B, C );
}
}
 
P

Peter van Merkerk

Gianni said:
Victor said:
Nice. What would you say if I did

T * Ap = &A[0][0];
const T * Bp ...
...
for (int i = 0; i < count; ++i)
*Ap++ = *Bp++ + *Cp++;

? It might be just a tad faster...


It might, it might not. (And it is not faster in the test below in the
default case but it is faster when unrolling on my athlon 2Ghz.)

Some architechtures will be faster with indexing and some with moving
pointers.

If the processor has a load (reg + offs) then the index might be faster
because there is a single increment (and shift) instead of 4 increments.

It is not just a matter of processor architecture. The compiler can do
some optimization tricks as well, like replacing indexing operations
with pointer arithmetric.
 
T

tom_usenet

Given A[n][m], B[n][m] and C[n][m]

I would like to calculate the sum of each entry:

A[j]=B[j]+C[j] for each i,j

The easiest way is to use for-loop, but I think the performance is not good.

Is it possible to find out some faster way to do that?


Generally the best way to optimize matrix operations is to merge
different loops into one loop, which is usually done with the
assistance of expression templates. With a decent compiler,
micro-optimization of an individual loop is not going to give you
order of magnitude increases in performance, and that kind of
optimization should come only once you've done the algorithmic
improvements and found it's still not fast enough.

There are a lot of good matrix maths libraries available. A good
starting point is www.oonumerics.org, or you could go the boost route:
http://www.boost.org/libs/numeric/ublas/doc/index.htm

Tom
 
R

rossum

Given A[n][m], B[n][m] and C[n][m]

I would like to calculate the sum of each entry:

A[j]=B[j]+C[j] for each i,j

The easiest way is to use for-loop, but I think the performance is not good.

Is it possible to find out some faster way to do that?
Thanks.
Pat

One thing to be aware of is whether your system stores matrices in row
order or in column order (yes I know there are othr ways to store a
matrix). If it uses one of these two methods then you should time the
two different ways of setting up the i and j loops

// i outside, j inside
for (int i = 0; i < IMAX; ++i) {
for (int j = 0; j < JMAX; ++j) {
A[j]=B[j]+C[j]
}
}

and

// j outside, i inside
for (int j = 0; j < JMAX; ++j) {
for (int i = 0; i < IMAX; ++i) {
A[j]=B[j]+C[j]
}
}

There may be a difference, and if there is then it is worth knowing
since the code change is so simple. This is also worth trying with
some of the other methods suggested.

rossum
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,172
Messages
2,570,934
Members
47,479
Latest member
JaysonK723

Latest Threads

Top