loop optimization

J

Jayden Shui

Dear All,

I have a high dimensional array and want to assign one block of data
in the array to another block in the array. The two blocks are not
overlapped. The loops are very slow. Code is attached. Any ways to
optimize it? Thanks a lot.

#include <time.h>
#include <iostream>
using namespace std;

int main()
{
float a[100][100][8];

int ib = 1, ie = 99;
int jb = 1, je = 99;
int kb0 = 0, ke0 = 1;
int kb1 = 3, ke1 = 2;

clock_t t0 = clock();

// Loop to count CPU time.
for (int m = 0; m < 100000; ++m) {

// a [ib:ie] [jb:je] [kb0:ke1] = a [ib:ie] [jb:je] [kb1:ke1:-1]
slow
for (int i = ib; i <= ie; ++i)
{
for (int j = jb; j <= je; ++j)
{
for (int k0 = kb0, k1 = kb1; k0 <= ke0; ++k0, --k1)
{
a[j][k0] = a[j][k1];
}
}
}

}

clock_t t1 = clock();
cout << "CPU Time = " << double(t1 - t0) / CLOCKS_PER_SEC << "s
\n";

return 0;
}
 
J

Jayden Shui

Dear All,
I have a high dimensional array and want to assign one block of data
in the array to another block in the array. The two blocks are not
overlapped. The loops are very slow.

Compared to what? It appears that you had to repeat this 100000 times
just in order to get some meaningful timings - does not look slow to me.

Other than that, try to increase your compiler optimization level. The
inner loop is 2 iterations and should be rolled out, some compilers may
require special flags for doing that (-funroll-loops for gcc, for
example).

hth
Paavo






Code is attached. Any ways to
optimize it? Thanks a lot.
#include <time.h>
#include <iostream>
using namespace std;
int main()
{
    float a[100][100][8];
    int ib = 1, ie = 99;
    int jb = 1, je = 99;
    int kb0 = 0, ke0 = 1;
    int kb1 = 3, ke1 = 2;
    clock_t t0 = clock();
    // Loop to count CPU time.
    for (int m = 0; m < 100000; ++m) {
    // a [ib:ie] [jb:je] [kb0:ke1] = a [ib:ie] [jb:je] [kb1:ke1:-1]
slow
    for (int i = ib; i <= ie; ++i)
    {
        for (int j = jb; j <= je; ++j)
        {
            for (int k0 = kb0, k1 = kb1; k0 <= ke0; ++k0, --k1)
            {
                a[j][k0] = a[j][k1];
            }
        }
    }

    clock_t t1 = clock();
    cout << "CPU Time = " << double(t1 - t0) / CLOCKS_PER_SEC << "s
\n";
    return 0;
}


This is testing code for CPU time of the operations. My code has lots
of such similar operations. I hope using some loop optimization
techniques in http://en.wikipedia.org/wiki/Loop_optimization to speed
it up. I think the testing code should be run in less than 0.5 s in i7
after loop optimization (-O2 or O3 is really not enough) according to
my experience.
 
V

Victor Bazarov

I have a high dimensional array and want to assign one block of data
in the array to another block in the array. The two blocks are not
overlapped. The loops are very slow. Code is attached. Any ways to
optimize it? Thanks a lot.

#include<time.h>
#include<iostream>
using namespace std;

int main()
{
float a[100][100][8];

int ib = 1, ie = 99;
int jb = 1, je = 99;
int kb0 = 0, ke0 = 1;
int kb1 = 3, ke1 = 2;

clock_t t0 = clock();

// Loop to count CPU time.
for (int m = 0; m< 100000; ++m) {

// a [ib:ie] [jb:je] [kb0:ke1] = a [ib:ie] [jb:je] [kb1:ke1:-1]
slow
for (int i = ib; i<= ie; ++i)
{
for (int j = jb; j<= je; ++j)
{
for (int k0 = kb0, k1 = kb1; k0<= ke0; ++k0, --k1)
{
a[j][k0] = a[j][k1];
}
}
}

}

clock_t t1 = clock();
cout<< "CPU Time = "<< double(t1 - t0) / CLOCKS_PER_SEC<< "s
\n";

return 0;
}


Some compilers I've seen (don't recall which, though) could generate
better code if you helped them reduce the indexing by extracting the
"common subexpression" outside of the inner loop. Something like

for (int i = ...
{
float (&ai)[100][8] = a;
for (int j = ...
{
float (&aij)[8] = ai[j];
for (int k0 ...
{
aij[k0] = aij[k1];
}
}
}

Better compilers do that on their own. Try it and see if you get any
improvement. If not, you could (a) parallelize the outer loop (since
there is no overlap) but keep in mind it's only worth if the overhead of
starting worker threads is much smaller than the actual work each has to
perform, or (b) change your algorithm and the data structure.

Barring those choices, you're pretty much stuck with what you have.

V
 
C

copx

Dear All,

I have a high dimensional array and want to assign one block of data
in the array to another block in the array. The two blocks are not
overlapped. The loops are very slow. Code is attached. Any ways to
optimize it? Thanks a lot.
[snip code]

I compiled this and ran it on my desktop computer which is not very
fast:

Unoptimized build ("g++ loop.cpp"): 15.762s
Optimized build ("g++ -O2 loop.cpp"): 0s

Yes, that's a zero. It runs so fast you cannot even measure it using
your method. Did you forget to enable optimization?
 
J

Jayden Shui

On 02.02.2012 15:55, Jayden Shui wrote:> Dear All,
I have a high dimensional array and want to assign one block of data
in the array to another block in the array. The two blocks are not
overlapped. The loops are very slow. Code is attached. Any ways to
optimize it? Thanks a lot.

[snip code]

I compiled this and ran it on my desktop computer which is not very
fast:

Unoptimized build ("g++ loop.cpp"): 15.762s
Optimized build ("g++ -O2 loop.cpp"): 0s

Yes, that's a zero. It runs so fast you cannot even measure it using
your method. Did you forget to enable optimization?

This is due to there is no output and the code in fact is not run in -
O2. Please add

int i, j, k;
cin >> i >> j >> k;
cin << a[j][k];

before return 0;
 
V

Victor Bazarov

Am 02.02.12 15:55, schrieb Jayden Shui:
Dear All,


// a [ib:ie] [jb:je] [kb0:ke1] = a [ib:ie] [jb:je] [kb1:ke1:-1]

I don't really understand what kind of copy this performs. But if you
can reformulate this operation to copy (large) contiguous blocks of
memory, you could use memcpy, which is probably optimized better. As
others have suggested, this problem is probably memory bandwidth
limited. This means you can gain something by vectorizing as large
blocks as possible at once.

The problem is that the order of elements in the destination block is
inverted (see the "-1"?) That prevents memcpy (or memmove) from being used.

V
 
J

Juha Nieminen

Christian Gollwitzer said:
I don't really understand what kind of copy this performs. But if you
can reformulate this operation to copy (large) contiguous blocks of
memory, you could use memcpy, which is probably optimized better. As
others have suggested, this problem is probably memory bandwidth
limited. This means you can gain something by vectorizing as large
blocks as possible at once.

When dealing with multidimensional arrays, if you need to handle
(read, copy, modify...) large portions of it at adjacent indices,
it's always much more efficient to perform the operations on elements
that are in contiguous memory locations rather than jumping around.
This makes it important which index of a multidimensional array is
modified most often.

The reason for this is, obviously, caching. When you access an element
in the array, the CPU will load adjacent memory data to the cache. If
you then access an adjacent element of the array, it will most probably
be already loaded into the cache (closest to the CPU), which will make
it really fast. If, however, you jumped around the RAM by incrementing
the wrong index, you will get lots of cache misses.

For that reason eg. this:

for(unsigned i1 = 0; i1 < size1; ++i1)
for(unsigned i2 = 0; i2 < size2; ++i2)
largeArray[i1][i2] = anotherLargeArray[i1][i2];

will usually be much faster than:

for(unsigned i2 = 0; i2 < size2; ++i2)
for(unsigned i1 = 0; i1 < size1; ++i1)
largeArray[i1][i2] = anotherLargeArray[i1][i2];

memcpy() is often very fast not only because it has been optimized to
death during the last 30 years, but also because it accesses consecutive
memory locations, which is cache-friendly.
 
A

Asger Joergensen

Hi Jayden

Jayden said:
int ib = 1, ie = 99;
int jb = 1, je = 99;
int kb0 = 0, ke0 = 1;
int kb1 = 3, ke1 = 2; .....
for (int j = jb; j <= je; ++j)
{
for (int k0 = kb0, k1 = kb1; k0 <= ke0; ++k0, --k1)

putting in the numbers:
for (int k0 = 0, k1 = 3; k0 <= ke0; ++k0, --k1)

so You are essentially only swapping the first four floats, right ?

well to my solution:

no need for all these [][][] they are just to make things slower
and in some cases simpler to understand.

// first set some sizes
const int lineSize = 8;
const int squareSize = 800;
const int cubeSize = 80000;

// get one chuck of memory of the right size
float cube[ cubeSize ];

float *square = cube;
float* squaresEnd = cube + cubeSize;

while( square < squaresEnd )
{
float* line = square;
float* linesEnd = square + squareSize;

while( line < linesEnd )
{
float* lineBegin = line;
float* lineEnd = line + 3; // lineSize ****

for(; lineBegin < lineEnd; ++lineBegin, --lineEnd )//**
*lineBegin = *lineEnd;

line += lineSize;
}

square += squareSize;
}

this is close to double speed on my machine, but I only have an old
Borland compiler, so it might not be the same for You.

****
I don't know if the 3 was what You intended, if not replace with
lineSize.
**
I removed the = from <= if they are pointing to the same float there
is really no need to swap. ;-)


Best regards
Asger-P
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Similar Threads


Members online

Forum statistics

Threads
473,995
Messages
2,570,236
Members
46,822
Latest member
israfaceZa

Latest Threads

Top