removing a loop cause it to go at half the speed?

T

tom fredriksen

Hi

I was doing a simple test of the speed of a "maths" operation and when I
tested it I found that removing the loop that initialises the data array
for the operation caused the whole program to spend twice the time to
complete. If the loop is included it takes about 7.48 seconds to
complete, but when removed it takes about 11.48 seconds.

Does anybody have a suggestion as to why this is so and whether I can
trust the results of the code as it is below?

/tom


The code was compiled on linux 2.6.3 with gcc 3.3.2 and glibc 2.2.3


#include <stdio.h>
#include <stdlib.h>
#include <sys/time.h>

int main(int argc, char *argv[])
{
int total = 0;
int count = 65500;
signed int data[count];

/* Initialising array loop */
for(int c=0; c<count; c++) {
data[c]=1+(int) (2000000000.0*rand()/(RAND_MAX+1.0));
}

struct timeval start_time;
struct timeval end_time;

gettimeofday(&start_time, NULL);

for(int d=0; d<50000; d++) {
for(int c=0; c<count; c++) {
total += data[c];
}
}
gettimeofday(&end_time, NULL);

double t1=(start_time.tv_sec*1000)+(start_time.tv_usec/1000.0);
double t2=(end_time.tv_sec*1000)+(end_time.tv_usec/1000.0);

printf("Elapsed time (ms): %.6lf\n", t2-t1);
printf("Total: %u\n", total);

return(0);
}
 
T

tom fredriksen

Rod said:
Okay, I compiled your code for DJGPP and I'm not seeing this issue. There
is a slight variation in execution times, but nothing like what you
describe. Nor are they different enough to tell which is faster or slower.
The only way I got a huge difference was compiling one as gcc -O and the
other as gcc -O2.

Interesting. I compiled without -O2 and with -O, which gave me time
respectively 22 seconds and 11 seconds. Removing the loop made no
difference. Both with the loop, which with -O2 gave me 7 seconds
previously.

Out of cuirosity, what times did you get and what system are you using?
>If not an optimization problem, I did notice that GCC
> shifted more work to the floating point instructions when the loop was
> present. Perhaps, due to instruction pairing/pipelineing, caching, etc.,
> the floating point init just runs faster for you...

Since that happens before the measured loop, thats not an issue. The
time print statement confirms that the operation takes double the time.

/tom
 
T

tom fredriksen

Jan-Hinnerk Dumjahn said:
The time you measure is the real time. It also contains time spend by other
applications or the system. Perhaps the array needs to be swapped in on the
first access.

Try running both versions of the programm with "time myProg".

it gave me, for the 7 seconds run:
6.35user 0.00system 0:07.14elapsed 88%CPU

and for the 11 seconds run:
9.98user 0.00system 0:11.07elapsed 90%CPU
Try a loop that reads through the array once instead of the initialisation.

I changed the loop to the following:

for(int c=0; c<count; c++) {
data2[c] = data[c];
}

and then printed some of the elements at the end of the program so as
not to have it optimised away. That changed the execution time to the
same for both cases. So it seems that because of using floating point in
the rand statement, it works faster. Maybe gcc optimised the program to
use floating point instead of integer operations, which is then faster.

Since it must be an integer operation, that leads to the following
question, how do I modify the statement to use integers instead.

data[c]=1.0+(unsigned int) (2000000000.0*rand()/(RAND_MAX+1.0))

just removing ".0" does not make it slower as one would expect.

/tom
 
T

tom fredriksen

Vladimir said:
I'd suggest brushing up on your algebra. Since when is 11.48 "twice the"
7.48? In my book it's closer to 50% larger ("twice larger" being 100%).

I did not say twice the larger, I said twice the time:/

/tom
 
R

Rod Pemberton

tom fredriksen said:
Interesting. I compiled without -O2 and with -O, which gave me time
respectively 22 seconds and 11 seconds. Removing the loop made no
difference. Both with the loop, which with -O2 gave me 7 seconds
previously.

Out of cuirosity, what times did you get and what system are you using?

I didn't record the times, but the -O2 were in the 45 second range, -O was
in the 90 second range.
I ran them under Win98SE, 500Mhz K6-2, DJGPP v2.03 w/GCC v3.41.
Since that happens before the measured loop, thats not an issue. The
time print statement confirms that the operation takes double the time.

Do you have a dual-core CPU? I've seen recent complaints in the assembly
language NG's that the RDTSC returns incorrect information for dual-core
CPUs. It might be that timing routine is using RDTSC.

I had to restructure the code slightly since you were using some C99
features. But, I didn't think it was critical. I moved the declarations
and init of c,c,d out of the for loops to the top of main. Also, I moved
the declarations of start_time and end_time to the top of main. For the
version I ran, I looked at the assembly from gcc -O2 -S for three versions
of your program: no loop, loop set all to zero, loop set rand() stuff. The
assembly code was very different. The no loop version used mostly integer
assembly intructions. The loop zero version used about half integer and
half floating. And the one with the rand() used maybe 70% floating point
intructions.


Rod Pemberton
 
T

tom fredriksen

Rod said:
Do you have a dual-core CPU? I've seen recent complaints in the assembly
language NG's that the RDTSC returns incorrect information for dual-core
CPUs. It might be that timing routine is using RDTSC.

No, its a single core AMD Athlon 2GHz from a couple of years ago.
I had to restructure the code slightly since you were using some C99
features. But, I didn't think it was critical. I moved the declarations
and init of c,c,d out of the for loops to the top of main. Also, I moved
the declarations of start_time and end_time to the top of main. For the
version I ran, I looked at the assembly from gcc -O2 -S for three versions
of your program: no loop, loop set all to zero, loop set rand() stuff. The
assembly code was very different. The no loop version used mostly integer
assembly intructions. The loop zero version used about half integer and
half floating. And the one with the rand() used maybe 70% floating point
intructions.

I tested that but it gave no difference either.
>
> I didn't record the times, but the -O2 were in the 45 second range, -O was
> in the 90 second range.
> I ran them under Win98SE, 500Mhz K6-2, DJGPP v2.03 w/GCC v3.41.

this is highly suspicious... a k6 is slower than an Athlon 2GHz is it
not? so how come you are getting such low times?
Could you post the code you used, and check your times.

BTW did you use gcc? I used these args with gcc

-O2 -Wall -D_LARGEFILE64_SOURCE -std=gnu99

/tom
 
R

Rod Pemberton

tom fredriksen said:
this is highly suspicious... a k6 is slower than an Athlon 2GHz is it
not? so how come you are getting such low times?
Could you post the code you used, and check your times.

Times are below. I deleted the code, but just copied it again. The only
differences are that the declaration of c and d are no longer in the for()
loops. i.e.,
int c,d;
for(c=0; c<count; c++) {
for(d=0; d<50000; d++) {
for(c=0; c<count; c++) {
BTW did you use gcc? I used these args with gcc

-O2 -Wall -D_LARGEFILE64_SOURCE -std=gnu99

Gcc v3.41 yes, but DJGPP uses it's own libc. I keep forgetting that DJGPP
has it's own libc which is different from GNU libc. That could _easily_ be
a factor. Also, CPU speeds increase by roughly a factor of 2 per generation
(2), so 1/4 of 44 ~ 11.

(2 runs, w/loop)
gcc -O2
Elapsed time (ms): 42790.000000
Elapsed time (ms): 43450.000000

(2 runs, no loop)
gcc -O2
Elapsed time (ms): 44160.000000
Elapsed time (ms): 44050.000000

(2 runs, w/loop)
gcc -Wall -O2 -D_LARGEFILE64_SOURCE -std=gnu99
Elapsed time (ms): 44600.000000
Elapsed time (ms): 43340.000000

(3 runs, no loop)
gcc -Wall -O2 -D_LARGEFILE64_SOURCE -std=gnu99
Elapsed time (ms): 39600.000000
Elapsed time (ms): 42840.000000
Elapsed time (ms): 40970.000000


It might be time to try a linux NG to see if someone with a similar system
to yours is getting the same issue...


Rod Pemberton
 
L

laura fairhead

Hi

I was doing a simple test of the speed of a "maths" operation and when I
tested it I found that removing the loop that initialises the data array
for the operation caused the whole program to spend twice the time to
complete. If the loop is included it takes about 7.48 seconds to
complete, but when removed it takes about 11.48 seconds.

Hello Tom,

This is most likely because your initialisation is outside of the
timed code. When you are initialising the data that is causing the
memory to be moved into the cache (main memory accesses cost more
time than cached access) and so the next time (in the loop) access
is much faster. So, when you don't initialise your data, all the time
it takes to move the 64k main memory to the cache is added into the
loop instead and you get a corresponding slow-down.

byefornow
laura
 
M

Mark McIntyre

I did not say twice the larger, I said twice the time:/

Assuming you're using base 10, twice 7.48 isn't 11.48.

And to answer your question, who knows? In the "loopless" case you're
invoking undefined behaviour by accessing uninitialised memory, so the
extra time could be spent phoning the kremlin or counting your
diskdrive sectors, or pretty much anything.

Mark McIntyre
 
D

David Holland

> I was doing a simple test of the speed of a "maths" operation and when I
> tested it I found that removing the loop that initialises the data array
> for the operation caused the whole program to spend twice the time to
> complete. If the loop is included it takes about 7.48 seconds to
> complete, but when removed it takes about 11.48 seconds.
>
> Does anybody have a suggestion as to why this is so and whether I can
> trust the results of the code as it is below?
>
> [...]
> int total = 0;
> int count = 65500;
> signed int data[count];

<OT purpose=stop_wild_speculation>

It is because of the virtual memory system. When you initialize the
array beforehand, it forces the virtual memory system to materialize
memory for the pages appearing in the array. When you do not, this
materialization happens in the timed loop, and then of course it's
slower.

You should be able to confirm this by touching some but not all of the
pages beforehand and observing the resulting timings. Hint: page size
for i386 family processors is 4096 bytes. And you won't see it at all
using DJGPP, which is DOS-based and doesn't have a virtual memory
system.

</OT>

I feel compelled to observe that your program exhibits undefined
behavior... and so strictly speaking, reasoning about its performance
properties is meaningless. And you absolutely cannot trust the
results.

Perhaps you meant "total" to be unsigned?
 
L

laura fairhead

Hello Tom,

This is most likely because your initialisation is outside of the
timed code. When you are initialising the data that is causing the
memory to be moved into the cache (main memory accesses cost more
time than cached access) and so the next time (in the loop) access
is much faster. So, when you don't initialise your data, all the time
it takes to move the 64k main memory to the cache is added into the
loop instead and you get a corresponding slow-down.

byefornow
laura

Hi Again,

Sorry, that's an absolutely stupid answer. I hadn't read the code well
enough before posting. The caching overhead couldn't add that sort of
time to the loop, obvioulsy. Because of the very complex interaction
between that high-level code, the compiler & machine code it generates,
the CPU & memory subsystems, this could be oh so many things....

One possibility (more realistic than my previous idiotic post;) is
that the code in the non-loop version ends up being not-well-aligned
for a CPU that is affected by such and a compiler that doesn't optimise
the alignment of loops by default. That's one out of a great many to
be sure, you'd have to analyse it; but I would start off looking at
the assembly listings.

byefornow
laura
 
O

Old Wolf

tom said:
I was doing a simple test of the speed of a "maths" operation and when I
tested it I found that removing the loop that initialises the data array
for the operation caused the whole program to spend twice the time to
complete.

Your program causes undefined behaviour if you remove the
initialization, because it reads uninitialized variables.
 
D

Dik T. Winter

> Does anybody have a suggestion as to why this is so and whether I can
> trust the results of the code as it is below?

Cache effects.

....
> /* Initialising array loop */
> for(int c=0; c<count; c++) {
> data[c]=1+(int) (2000000000.0*rand()/(RAND_MAX+1.0));
> }

When initialising your array data comes into the cache, so all further
operations work on the cache. When you do not initialise, your data
has to be swapped into the cache during the operations, that takes time.

Try the same with an initialisation of data[c] to 1. Also try timing
with the initialisation included or not. In that case you will see
three timings:
a) No initialisation.
b) Initialisation to 1, not timed.
c) Initialisation to 1, included in timing.
You will see that (a) is approximately equal to (c) and that (b) is
shorter.
 
D

Dik T. Winter

> On Wed, 15 Mar 2006 23:01:06 +0000 (UTC), (e-mail address removed) (laura fairhead) wrote: .... ....
> Sorry, that's an absolutely stupid answer.

No, it is not absolutely stupid at all. But now I think I know the
answer. The timing of the addition operation depends on the values
involved.
 
T

tom fredriksen

Old said:
Your program causes undefined behaviour if you remove the
initialization, because it reads uninitialized variables.

:) sorry but using an uninitialised array does not cause undefined
behaviour. It only means the values of the array are not preset. An
array is just a sequence of memory positions, which have been used
previously by some other part of the program or the system, hence the
value it has is what was there before I allocated it. There is no
undefined behaviour in that. If I had used the value in some expression,
then I would get undefined behaviour because the expression is based on
values I don't know and which might be illegal for the case.

/tom
 
T

tom fredriksen

Dik said:
Does anybody have a suggestion as to why this is so and whether I can
trust the results of the code as it is below?

Cache effects.

...
/* Initialising array loop */
for(int c=0; c<count; c++) {
data[c]=1+(int) (2000000000.0*rand()/(RAND_MAX+1.0));
}

When initialising your array data comes into the cache, so all further
operations work on the cache. When you do not initialise, your data
has to be swapped into the cache during the operations, that takes time.

Try the same with an initialisation of data[c] to 1. Also try timing
with the initialisation included or not. In that case you will see
three timings:
a) No initialisation.
b) Initialisation to 1, not timed.
c) Initialisation to 1, included in timing.
You will see that (a) is approximately equal to (c) and that (b) is
shorter.

Please read my other post.
 
R

Richard Bos

tom fredriksen said:
:) sorry but using an uninitialised array does not cause undefined
behaviour.

Yes, it does.
It only means the values of the array are not preset. An array is just
a sequence of memory positions,

Yes, but...
which have been used previously by some other part of the program or the
system,

....no. An uninitialised array is allowed to contain trap values.
hence the value it has is what was there before I allocated it. There is no
undefined behaviour in that.

There is nothing in the Standard which guarantees anything whatsoever
about the memory pattern in memory you just allocated, and reading
uninitialised memory _does_ cause undefined behaviour.

Richard
 
R

Richard Heathfield

tom fredriksen said:
:) sorry but using an uninitialised array does not cause undefined
behaviour.

It does if you read those "values" before writing them. And, if you remove
your loop, that's precisely what you do.

It only means the values of the array are not preset. An
array is just a sequence of memory positions, which have been used
previously by some other part of the program or the system, hence the
value it has is what was there before I allocated it. There is no
undefined behaviour in that.

At that point, I agree, but...
If I had used the value in some expression,

signed int data[count];

(loop removed, because that's the case we're discussing)

struct timeval start_time;
struct timeval end_time;

gettimeofday(&start_time, NULL);

for(int d=0; d<50000; d++) {
for(int c=0; c<count; c++) {
total += data[c];

ping - you use the value in some expression.
then I would get undefined behaviour because the expression is based on
values I don't know and which might be illegal for the case.

s/would//
 
T

tom fredriksen

Richard said:
signed int data[count];

(loop removed, because that's the case we're discussing)

struct timeval start_time;
struct timeval end_time;

gettimeofday(&start_time, NULL);

for(int d=0; d<50000; d++) {
for(int c=0; c<count; c++) {
total += data[c];

ping - you use the value in some expression.
>
then I would get undefined behaviour because the expression is based on
values I don't know and which might be illegal for the case.

s/would//

Wrong, it does not cause undefined behaviour because the values used
does not require a specific set of values or a value range. its just a
set of ints with a 32 bit value, nothing special about that. Its the
interpretation that matters not the value itself. And since my program
does not need to interpret the value there is no undefined behaviour.

If I had required the program to have values in f.ex. the ASCII range
and used a function that required ASCII values then it would have
undefined behaviour. Or if the array was required to have either 0 or 1
as a value and I used it in a boolean expression it had negative values,
that would cause undefined behaviour.

Undefined behaviour only exists if the program or any of the functions
in the program has a definition of what are legal values and the array
does not contain those legal values.

The reason it is stated that it causes undefined behaviour is because
the programmer has to decide what behaviour it should have in those
cases. Since the inventors of the language could not possibly foresee
each programs behaviour, it was decided that it is more convenient to
say it can cause undefined behaviour.

/tom
 
R

Richard Heathfield

tom fredriksen said:
Richard said:
signed int data[count];

(loop removed, because that's the case we're discussing)

struct timeval start_time;
struct timeval end_time;

gettimeofday(&start_time, NULL);

for(int d=0; d<50000; d++) {
for(int c=0; c<count; c++) {
total += data[c];

ping - you use the value in some expression.
then I would get undefined behaviour because the expression is based on
values I don't know and which might be illegal for the case.

s/would//

Wrong, it does not cause undefined behaviour because the values used
does not require a specific set of values or a value range.

Wrong.

C89 draft:

* Undefined behavior --- behavior, upon use of a nonportable or
erroneous program construct, of erroneous data, or of
indeterminately-valued objects, for which the Standard imposes no
requirements.

C99 final:

"Certain object representations need not represent a value of the object
type. If the stored value of an object has such a representation and is
read by an lvalue expression that does not have character type, the
behavior is undefined."
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,995
Messages
2,570,230
Members
46,819
Latest member
masterdaster

Latest Threads

Top