loop performance question

P

pertheli

Hello,

I have a large array of pointer to some object. I have to run test
such that every possible pair in the array is tested.
eg. if A,B,C,D are items of the array,
possible pairs are AB, AC, AD, BC and CD.

I'm using two nested for loop as below, but it is running very slow
whenever I access any data in the second loop.

I suspect, time taken to fetch the obj2 from memory is the bottleneck
Is there a way to improve the performance, or accessing data from
memory any faster in case of obj2?

Thanks in advance
Pertheli

---------------
int i, j, k;
double val=0;
int iMax = 20000;

struct MyObj{
double a;
double b;
};

MyObj *o, *obj1, *obj2;

// array of *MyObj // iMax = large > 20000
MyObj** objList = (MyObj**)malloc(sizeof(MyObj*) * iMax);

// create MyObj fill pointers into objList

// for each possible pair in the array, do some testing
for (i=0; i<iMax; i++){
obj1 = (MyObj*)objList;

for (j=i; j<iMax; j++){
obj2 = (MyObj*)objList[j];

// do some tests
//val += obj1->a; //-> very fast(same obj1)
//val += obj1->b;

// same number of instruction but very slow
// accessing from different part in memory??
val += obj1->a; //<- obj1 and obj2
val += obj2->b; //<- Accessing any data of obj2 is very slow!
// 1/10th time slower than the above
}
}
 
C

Christian Bau

I have a large array of pointer to some object. I have to run test
such that every possible pair in the array is tested.
eg. if A,B,C,D are items of the array,
possible pairs are AB, AC, AD, BC and CD.

1. Check whether the data is aligned properly. There are processor where
accessing a double number is ten times slower if the alignment isn't ok.

2. Learn about caches: Try to rearrange the operations so that the same
data will be used repeatedly. Instead of

for (i = 0; i < 20000; ++i)
for (j = 0; j < 20000; ++j)
...

you write

for (i1 = 0; i1 < 20000; i1 += 100)
for (j1 = 0; j1 < 20000; j1 += 100)
for (i = i1; i < i1 + 100; ++i)
for (j = j1; j < j1 + 100; ++j)
...

3. Try to find a better algorithm. For that it would be important to
know what exactly you are doing. You have (iMax * iMax) operations;
usually these things can be done faster.
 
S

Sean Kenwrick

pertheli said:
Hello,

I have a large array of pointer to some object. I have to run test
such that every possible pair in the array is tested.
eg. if A,B,C,D are items of the array,
possible pairs are AB, AC, AD, BC and CD.

I'm using two nested for loop as below, but it is running very slow
whenever I access any data in the second loop.

I suspect, time taken to fetch the obj2 from memory is the bottleneck
Is there a way to improve the performance, or accessing data from
memory any faster in case of obj2?

Thanks in advance
Pertheli

---------------
int i, j, k;
double val=0;
> int iMax = 20000;

struct MyObj{
double a;
double b;
};

MyObj *o, *obj1, *obj2;

// array of *MyObj // iMax = large > 20000
MyObj** objList = (MyObj**)malloc(sizeof(MyObj*) * iMax);

// create MyObj fill pointers into objList

// for each possible pair in the array, do some testing
for (i=0; i<iMax; i++){
obj1 = (MyObj*)objList;

for (j=i; j<iMax; j++){
obj2 = (MyObj*)objList[j];

// do some tests
file://val += obj1->a; file://-> very fast(same obj1)
file://val += obj1->b;

// same number of instruction but very slow
// accessing from different part in memory??
val += obj1->a; file://<- obj1 and obj2
val += obj2->b; file://<- Accessing any data of obj2 is very slow!
// 1/10th time slower than the above
}
}


This is all down to memory optimisation. On most CPUs there is a data
cache which is VERY fast and takes only one CPU cycle to retreive data. To
retrieve data from main memory is much slower and can take as much as 36 CPU
cycles. Thus you should always try to make your algorithm so that it
will access consecutive items in memory as much as possible rather than
jumping about from one place to another.

In order to fully optimise you algorithm you should also know about the
organisation of the cache on your machine. When a 'cache miss' is
encountered and the item it fetched from main memory, the CPU will fetch an
entire 'cache line' into the cache. The size of this cache line is CPU
dependant but on the Pentium for example it is 32 bytes. Therefore
you should try and allign you data on 32 byte boundaries (you compiler might
provide options for this) and you should try and read consecutive bytes from
this cache line as much as possible.

For example you should place all commonly accessed elements of a structure
(or object) near the top of the structure so that they all fall into the
same 32 byte cache line (thus accessing each of these elements of your
structure will not result in a cache miss).

In the above code, you get the pointer to Obj1 then you access element a.
This will put the entire contents of Obj1 into cache so your second access
to element b will take only 1 CPU cycle. But is you try and access an
element from Obj2 it will likely result in a cache miss.

To optimise you need to break your loop into smaller sections so that you
get Obj1 into cache (by accessing an element) which will also cause the next
3 elements after Obj 1 into cache also.

So you have 4 Objects in cache so you should compare all combinations of
these objects first:

AB AC AD BC BD (total 6 comparisons)

Then keeping this first object in cache, read the object 4 away and compare
A with each of these and each of these with each other...

I'll leave the actual algorithm to you since my mind is starting to boggle
:)

Sean



So you should just compare amon
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,138
Messages
2,570,804
Members
47,349
Latest member
jojonoy597

Latest Threads

Top