Matrix optimization

Pat · Jul 28, 2004

Given A[n][m], B[n][m] and C[n][m]

I would like to calculate the sum of each entry:

A[j]=B[j]+C[j] for each i,j

The easiest way is to use for-loop, but I think the performance is not good.

Is it possible to find out some faster way to do that?
Thanks.
Pat

Gianni Mariani · Jul 28, 2004

Pat said:
Given A[n][m], B[n][m] and C[n][m]

I would like to calculate the sum of each entry:

A[j]=B[j]+C[j] for each i,j

The easiest way is to use for-loop, but I think the performance is not good.

Is it possible to find out some faster way to do that?
Thanks.
Pat

This depends so much on implementation. Some compilers know how to
"vectorize" for loops, some machines have serious cache considerations,
some machines have vector instructions.

Your question is not really about C++ per-se, I suggest you ask
discussion groups that are related directly to the platform you're
asking about to get the right answer.

However, given a dumb compiler and a dumb architecture, the fastest this
is probably somthing like this.

template <typename T, int Rows, int Cols>
void Add(
T (&A)[Rows][Cols],
const T (&B)[Rows][Cols],
const T (&C)[Rows][Cols]
) {
T * const Ap = & A[0][0];
const T * const Bp = & B[0][0];
const T * const Cp = & C[0][0];

const int count = Rows * Cols;

for ( int i = 0; i < count; ++ i )
{
Ap[ i ] = Bp[ i ] + Cp[ i ];
}

}

// if the compiler is able to do loop unrolling, this can be quite
// zippy;

*** Note - I did not syntax check it.

Victor Bazarov · Jul 28, 2004

Pat said:
Given A[n][m], B[n][m] and C[n][m]

I would like to calculate the sum of each entry:

A[j]=B[j]+C[j] for each i,j

The easiest way is to use for-loop, but I think the performance is not good.

Is it possible to find out some faster way to do that?

(a) Why do you "think the performance is not good"?

(b) You could avoid indexing if you define pointers and increment
those in the loop:

for (int i = 0; i < M; ++i)
for (int j = 0; j < N; ++j)
A[j] = B[j] + C[j];

becomes

for (int i = 0; i < M; ++i) {
TYPE *pa = A, *pb = B, *pc = C;
for (int j = 0; j < N; ++j)
*pa++ = *pb++ + *pc++;
}

You can still eliminate the indexing by , but I'll leave it
to you to figure out.

(c) Consider that "premature optimization is the root of all evil".
The [j] form while may not be extremely efficient, at least
conveys the message clearer than *pa++ = ...

Victor

Victor Bazarov · Jul 28, 2004

Gianni said:
Pat said:

Given A[n][m], B[n][m] and C[n][m]

I would like to calculate the sum of each entry:

A[j]=B[j]+C[j] for each i,j

The easiest way is to use for-loop, but I think the performance is not
good.

Is it possible to find out some faster way to do that?
Thanks.
Pat

Click to expand...

This depends so much on implementation. Some compilers know how to
"vectorize" for loops, some machines have serious cache considerations,
some machines have vector instructions.

Your question is not really about C++ per-se, I suggest you ask
discussion groups that are related directly to the platform you're
asking about to get the right answer.

However, given a dumb compiler and a dumb architecture, the fastest this
is probably somthing like this.

template <typename T, int Rows, int Cols>
void Add(
T (&A)[Rows][Cols],
const T (&B)[Rows][Cols],
const T (&C)[Rows][Cols]
) {
T * const Ap = & A[0][0];
const T * const Bp = & B[0][0];
const T * const Cp = & C[0][0];

const int count = Rows * Cols;

for ( int i = 0; i < count; ++ i )
{
Ap[ i ] = Bp[ i ] + Cp[ i ];
}

}

// if the compiler is able to do loop unrolling, this can be quite
// zippy;

Nice. What would you say if I did

T * Ap = &A[0][0];
const T * Bp ...
...
for (int i = 0; i < count; ++i)
*Ap++ = *Bp++ + *Cp++;

? It might be just a tad faster...

V

Peter van Merkerk · Jul 28, 2004

Victor said:
Pat said:

Given A[n][m], B[n][m] and C[n][m]

I would like to calculate the sum of each entry:

A[j]=B[j]+C[j] for each i,j

The easiest way is to use for-loop, but I think the performance is not
good.

Is it possible to find out some faster way to do that?

Click to expand...

(a) Why do you "think the performance is not good"?

(b) You could avoid indexing if you define pointers and increment
those in the loop:

for (int i = 0; i < M; ++i)
for (int j = 0; j < N; ++j)
A[j] = B[j] + C[j];

becomes

for (int i = 0; i < M; ++i) {
TYPE *pa = A, *pb = B, *pc = C;
for (int j = 0; j < N; ++j)
*pa++ = *pb++ + *pc++;
}

You can still eliminate the indexing by , but I'll leave it
to you to figure out.

With a good optimizing compiler it is possible that it makes no
difference at all...

(c) Consider that "premature optimization is the root of all evil".
The [j] form while may not be extremely efficient, at least
conveys the message clearer than *pa++ = ...

Click to expand...

Also you may want to consider using a matrix library rather than
inventing your own.

Victor Bazarov · Jul 28, 2004

Peter said:
[..]

Also you may want to consider using a matrix library rather than
inventing your own.

That's reasonable, unless the OP's goal is to create a matrix library
for others to consider.

Gianni Mariani · Jul 29, 2004

Victor said:
Gianni Mariani wrote:
....

Nice. What would you say if I did

T * Ap = &A[0][0];
const T * Bp ...
...
for (int i = 0; i < count; ++i)
*Ap++ = *Bp++ + *Cp++;

? It might be just a tad faster...

It might, it might not. (And it is not faster in the test below in the
default case but it is faster when unrolling on my athlon 2Ghz.)

Some architechtures will be faster with indexing and some with moving
pointers.

If the processor has a load (reg + offs) then the index might be faster
because there is a single increment (and shift) instead of 4 increments.

On a Athlon64 with gcc 3.3.3 machine w/o unrolling, indexing is 25%
faster. With unrolling, it seems like the gcc compiler likes the
pointer code better and the pointer code is 15% faster.

Timings for the code below:
g++ (GCC) 3.4.0
g++ -O3 -march=athlon-mp -o matrix_add matrix_add.cpp

time ./matrix_add
3.070u 0.010s 0:03.08 100.0% 0+0k 0+0io 164pf+0w

time ./matrix_add pointers
4.080u 0.000s 0:04.08 100.0% 0+0k 0+0io 164pf+0w

- index code is faster

now with unrolling

g++ -O3 -funroll-loops -march=athlon-mp -o matrix_add matrix_add.cpp

time ./matrix_add
2.330u 0.000s 0:02.31 100.8% 0+0k 0+0io 164pf+0w

time ./matrix_add pointers
2.080u 0.010s 0:02.08 100.4% 0+0k 0+0io 164pf+0w

The compiler does a pretty bad job of unrolling index loops !

If you do a disassembly, you'll see that the four increments cost more -
sure you could eliminate one inrement and it might result in faster code
in the non-unrolling case but slower code in the unrolling case because
knowing when to terminate is a little harder for the compiler to figure out.

Anyhow, I think I made the point that "it depends". In general, in
numeric code when you're indexing into multiple arrays, indexing might
be faster than manipulating multiple pointers.

Having has a look at the assembler code, the compiler could do a much
better job of unrolling the index code. I suspect other compilers might
actually do a much better job.

---- the code ----

template <typename T, int Rows, int Cols>
inline void Add(
T (&A)[Rows][Cols],
const T (&B)[Rows][Cols],
const T (&C)[Rows][Cols]
) {
T * const Ap = & A[0][0];
const T * const Bp = & B[0][0];
const T * const Cp = & C[0][0];

const int count = Rows * Cols;

for ( int i = 0; i < count; ++ i )
{
Ap[ i ] = Bp[ i ] + Cp[ i ];
}

}

template <typename T, int Rows, int Cols>
inline void Addp(
T (&A)[Rows][Cols],
const T (&B)[Rows][Cols],
const T (&C)[Rows][Cols]
) {
T * Ap = & A[0][0];
const T * Bp = & B[0][0];
const T * Cp = & C[0][0];

const int count = Rows * Cols;

for ( int i = 0; i < count; ++ i )
{
* Ap ++ = * Bp ++ + * Cp ++;
}

}

enum {
rows = 20,
cols = 20
};

typedef int matrix_t[ rows ][ cols ];

void foo(matrix_t & A, matrix_t & B, matrix_t & C)
{
Add( A, B, C );
}

void foop(matrix_t & A, matrix_t & B, matrix_t & C)
{
Addp( A, B, C );
}

matrix_t A, B, C;

int main( int argc, char ** argv )
{

void ( * foof )(matrix_t & A, matrix_t & B, matrix_t & C);

foof = argc == 2 ? & foop : & foo;

for ( int i = 0; i < 5000000; i ++ )
{
( *foof )( A, B, C );
}
}

Peter van Merkerk · Jul 29, 2004

Gianni said:
Victor said:

Nice. What would you say if I did

T * Ap = &A[0][0];
const T * Bp ...
...
for (int i = 0; i < count; ++i)
*Ap++ = *Bp++ + *Cp++;

? It might be just a tad faster...

Click to expand...

It might, it might not. (And it is not faster in the test below in the
default case but it is faster when unrolling on my athlon 2Ghz.)

Some architechtures will be faster with indexing and some with moving
pointers.

If the processor has a load (reg + offs) then the index might be faster
because there is a single increment (and shift) instead of 4 increments.

It is not just a matter of processor architecture. The compiler can do
some optimization tricks as well, like replacing indexing operations
with pointer arithmetric.

tom_usenet · Jul 29, 2004

Given A[n][m], B[n][m] and C[n][m]

I would like to calculate the sum of each entry:

A[j]=B[j]+C[j] for each i,j

The easiest way is to use for-loop, but I think the performance is not good.

Is it possible to find out some faster way to do that?

Generally the best way to optimize matrix operations is to merge
different loops into one loop, which is usually done with the
assistance of expression templates. With a decent compiler,
micro-optimization of an individual loop is not going to give you
order of magnitude increases in performance, and that kind of
optimization should come only once you've done the algorithmic
improvements and found it's still not fast enough.

There are a lot of good matrix maths libraries available. A good
starting point is www.oonumerics.org, or you could go the boost route:
http://www.boost.org/libs/numeric/ublas/doc/index.htm

Tom

rossum · Jul 29, 2004

Given A[n][m], B[n][m] and C[n][m]

I would like to calculate the sum of each entry:

A[j]=B[j]+C[j] for each i,j

The easiest way is to use for-loop, but I think the performance is not good.

Is it possible to find out some faster way to do that?
Thanks.
Pat

One thing to be aware of is whether your system stores matrices in row
order or in column order (yes I know there are othr ways to store a
matrix). If it uses one of these two methods then you should time the
two different ways of setting up the i and j loops

// i outside, j inside
for (int i = 0; i < IMAX; ++i) {
for (int j = 0; j < JMAX; ++j) {
A[j]=B[j]+C[j]
}
}

and

// j outside, i inside
for (int j = 0; j < JMAX; ++j) {
for (int i = 0; i < IMAX; ++i) {
A[j]=B[j]+C[j]
}
}

There may be a difference, and if there is then it is worth knowing
since the code change is so simple. This is also worth trying with
some of the other methods suggested.

rossum

Java matrix problem	3	Sep 10, 2023
How to multiply two matrices of size in using inline assembly in C++	3	Mar 3, 2024
C program: memory leak/ segmentation fault/ memory limit exceeded	0	Nov 12, 2022
How to accept text and put each letter into a 2d matrix?	0	Jun 3, 2022
Competitive coding	1	Nov 12, 2022
Blue J Ciphertext Program	2	Nov 22, 2023
Getting incorrect output in finding the maximum pair sum in the given array.	7	Apr 6, 2023
How to automatically append x,y values to a matrix	2	Jan 30, 2023

Matrix optimization

Pat

Gianni Mariani

Victor Bazarov

Victor Bazarov

Peter van Merkerk

Victor Bazarov

Gianni Mariani

Peter van Merkerk

tom_usenet

rossum

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads