Victor said:
Gianni Mariani wrote:
....
Nice. What would you say if I did
T * Ap = &A[0][0];
const T * Bp ...
...
for (int i = 0; i < count; ++i)
*Ap++ = *Bp++ + *Cp++;
? It might be just a tad faster...
It might, it might not. (And it is not faster in the test below in the
default case but it is faster when unrolling on my athlon 2Ghz.)
Some architechtures will be faster with indexing and some with moving
pointers.
If the processor has a load (reg + offs) then the index might be faster
because there is a single increment (and shift) instead of 4 increments.
On a Athlon64 with gcc 3.3.3 machine w/o unrolling, indexing is 25%
faster. With unrolling, it seems like the gcc compiler likes the
pointer code better and the pointer code is 15% faster.
Timings for the code below:
g++ (GCC) 3.4.0
g++ -O3 -march=athlon-mp -o matrix_add matrix_add.cpp
time ./matrix_add
3.070u 0.010s 0:03.08 100.0% 0+0k 0+0io 164pf+0w
time ./matrix_add pointers
4.080u 0.000s 0:04.08 100.0% 0+0k 0+0io 164pf+0w
- index code is faster
now with unrolling
g++ -O3 -funroll-loops -march=athlon-mp -o matrix_add matrix_add.cpp
time ./matrix_add
2.330u 0.000s 0:02.31 100.8% 0+0k 0+0io 164pf+0w
time ./matrix_add pointers
2.080u 0.010s 0:02.08 100.4% 0+0k 0+0io 164pf+0w
The compiler does a pretty bad job of unrolling index loops !
If you do a disassembly, you'll see that the four increments cost more -
sure you could eliminate one inrement and it might result in faster code
in the non-unrolling case but slower code in the unrolling case because
knowing when to terminate is a little harder for the compiler to figure out.
Anyhow, I think I made the point that "it depends". In general, in
numeric code when you're indexing into multiple arrays, indexing might
be faster than manipulating multiple pointers.
Having has a look at the assembler code, the compiler could do a much
better job of unrolling the index code. I suspect other compilers might
actually do a much better job.
---- the code ----
template <typename T, int Rows, int Cols>
inline void Add(
T (&A)[Rows][Cols],
const T (&B)[Rows][Cols],
const T (&C)[Rows][Cols]
) {
T * const Ap = & A[0][0];
const T * const Bp = & B[0][0];
const T * const Cp = & C[0][0];
const int count = Rows * Cols;
for ( int i = 0; i < count; ++ i )
{
Ap[ i ] = Bp[ i ] + Cp[ i ];
}
}
template <typename T, int Rows, int Cols>
inline void Addp(
T (&A)[Rows][Cols],
const T (&B)[Rows][Cols],
const T (&C)[Rows][Cols]
) {
T * Ap = & A[0][0];
const T * Bp = & B[0][0];
const T * Cp = & C[0][0];
const int count = Rows * Cols;
for ( int i = 0; i < count; ++ i )
{
* Ap ++ = * Bp ++ + * Cp ++;
}
}
enum {
rows = 20,
cols = 20
};
typedef int matrix_t[ rows ][ cols ];
void foo(matrix_t & A, matrix_t & B, matrix_t & C)
{
Add( A, B, C );
}
void foop(matrix_t & A, matrix_t & B, matrix_t & C)
{
Addp( A, B, C );
}
matrix_t A, B, C;
int main( int argc, char ** argv )
{
void ( * foof )(matrix_t & A, matrix_t & B, matrix_t & C);
foof = argc == 2 ? & foop : & foo;
for ( int i = 0; i < 5000000; i ++ )
{
( *foof )( A, B, C );
}
}