Loop Un Rolling

R

root

Hi I have a loop in C/C++ that I want to be un rolled for efficiency
purposes.

But the code has to target a variety of plat forms.

So I can't use compiler specific directives to force an un rolling.

My question is what is a good way to HINT to a compiler by the very
structure of the code that a loop should be un rolled in such a way that
many compilers will understand this hint.

Thanks.

/root
 
B

Ben Pfaff

root said:
Hi I have a loop in C/C++ that I want to be un rolled for efficiency
purposes.

But the code has to target a variety of plat forms.

So I can't use compiler specific directives to force an un rolling.

My question is what is a good way to HINT to a compiler by the very
structure of the code that a loop should be un rolled in such a way that
many compilers will understand this hint.

Why don't you just unroll the loop yourself, by e.g. repeating
code or using a macro?
 
R

root

Ben

This would not be acceptable to me as it would make the code too messy to
be readable.

The loop may have thousands of iterations.

/root
 
B

Ben Pfaff

root said:
This would not be acceptable to me as it would make the code too messy to
be readable.

The loop may have thousands of iterations.

I really doubt that unrolling a loop thousands of times will have
performance benefits.
 
D

Daniel Pitts

root said:
Hi I have a loop in C/C++ that I want to be un rolled for efficiency
purposes.

But the code has to target a variety of plat forms.

So I can't use compiler specific directives to force an un rolling.

My question is what is a good way to HINT to a compiler by the very
structure of the code that a loop should be un rolled in such a way that
many compilers will understand this hint.

Forcing unrolling (or any "optimization") is likely to cause inefficient
binaries. Smaller caches, and other situations, can negate any
performance gained from unrolled loops. It is better to let the
compiler decide whether to unroll or not.
 
A

Alf P. Steinbach

* root:
Hi I have a loop in C/C++

There is no such hybrid language (wrong).

So from the outset, you're saying that you don't know what you're talking about.

This is also indicated by cross-posting to [comp.lang.c] and [comp.lang.c++]
(wrong), and by not posting any of the relevant code (wrong).

that I want to be un rolled for efficiency
purposes.

But the code has to target a variety of plat forms.

So I can't use compiler specific directives to force an un rolling.

My question is what is a good way to HINT to a compiler

Telepathy may be even better. Just think very very hard what you want compiler
to do. Sometimes appears to work! :)

However, in C++ you can use the Boost preprocessor macros to help repeat code.
Check out Duff's device for the code you'd want the macros to generate. Since
the preprocessor is largely the same in C and C++ there is a possibility but no
guarantee that you can also use those macros in C.

However, unrolling a loop at the source code level may have the opposite effect
of what you want.

by the very
structure of the code that a loop should be un rolled in such a way that
many compilers will understand this hint.

Most inefficiency is of the algorithmic kind.

That's where you get most return on your optimization effort.



Cheers & hth.,

- Alf
 
M

Moi

Ben

This would not be acceptable to me as it would make the code too messy
to be readable.

The loop may have thousands of iterations.

Yes. if it had only one iteration, it would not be a loop.
For examples of macroization, you could look at GNU's libc
string function implementation.
You can easily conclude from that that it will take a lot
of time (probably 10-100 hours per LOC)

IMHO the rules of thumb for (loop-)optimization:
1) avoid "fat" loops. Don't do too much in one loop.
2) avoid (conditional) jumps
3) try to deal with register- or cache- line aligned chunks
of data as much as possible.

BTW: show us the loop.

HTH,
AvK
 
B

bartc

root said:
Ben

This would not be acceptable to me as it would make the code too messy to
be readable.

The loop may have thousands of iterations.

You don't need to unroll the loop completely. If the loop executes 1000
times, then unrolling one iteration means the loop only needs to execute 500
times:

for (i=0; i<1000; ++i) {
<yourcode>
}

for (i=0; i<500; ++i) {
<yourcode>
<yourcode>
}

Or are you saying the body of the loop is complex? In that case loop
overheads may not be significant.
 
W

White Wolf

Paavo said:
Duff's device?

Maybe. Most probably not. He did not say *anything* about what the
loop does. Duff's device is not loop unrolling, and Duff's devices is a
very specific idiom applicable in a very specific case: copying a lot of
bytes to hardware that can only efficiently copy words, while using a
compiler does doesn't do Duff's device on its own as an optimization
(gcc does, AFAIK).

To the OP:

1.) Follow Michael Jackson's rules of optimization

2.) Chances are high that the most stupid compiler optimizer is still
better at loop unrolling than you will ever be. Not because you are
stupid, but because those guys writing the compilers have 25 years of
head start on you.

3.) So to reword rule #1 and #2 from the Jackson rules: Don't sweat it.

BR, WW
 
H

Hamiral

root said:
But the code has to target a variety of plat forms.

It's nearly impossible to unroll a loop efficiently for a variety of
platforms, because they can have a lot of differences regarding cache etc.
The best optimization IMHO would be to trust your compiler and to have
an efficient algorithm.

Ham
 
R

root

Ben

Perhaps you are confused about what loop un rolling does.

The efficiency is because of avoided jumps. It is the size of the loop
body not the number of iter ations that determine if loop un rolling is
beneficial.

In this case the loop body is very short.

Also if the number of iterations is a compile time constant but defined
in a complicated way then your method will not work.

/root
 
K

Keith Thompson

White Wolf said:
2.) Chances are high that the most stupid compiler optimizer is still
better at loop unrolling than you will ever be. Not because you are
stupid, but because those guys writing the compilers have 25 years of
head start on you.
[...]

And because the compiler can do a tremendous amount of analysis
*every time* you compile your code, and can redo all the same
analysis from scratch when you change a single line and recompile.
 
W

Willem

root wrote:
) Perhaps you are confused about what loop un rolling does.

There is no reason to try to be patronizing.

) The efficiency is because of avoided jumps. It is the size of the loop
) body not the number of iter ations that determine if loop un rolling is
) beneficial.

The *in*efficiency is because of caching issues. Jumps are almost free on
any reasonably modern CPU, as opposed to having to read new instructions
from memory all the time.

What platform are you targeting where you believe loop unrolling will be
beneficial ?


SaSW, Willem
--
Disclaimer: I am in no way responsible for any of the statements
made in the above text. For all I know I might be
drugged or something..
No I'm not paranoid. You all think I'm paranoid, don't you !
#EOT
 
B

Ben Pfaff

root said:
The efficiency is because of avoided jumps. It is the size of the loop
body not the number of iter ations that determine if loop un rolling is
beneficial.

In this case the loop body is very short.

Unrolling the loop thousands of times, regardless of its length,
is unlikely to be beneficial. Even if it is only a single
instruction, thousands of copies of that instruction will
probably not fit in your CPU cache or, if they do, then they will
evict other instructions from it, which will cause a high cost
for other code and possibly cause it to be a net loss.
Also if the number of iterations is a compile time constant but defined
in a complicated way then your method will not work.

Then you need a switch statement. See Duff's Device (in an
obsolete form of C, from Wikipedia):

send(to, from, count)
register short *to, *from;
register count;
{
register n=(count+7)/8;
switch(count%8){
case 0: do{ *to = *from++;
case 7: *to = *from++;
case 6: *to = *from++;
case 5: *to = *from++;
case 4: *to = *from++;
case 3: *to = *from++;
case 2: *to = *from++;
case 1: *to = *from++;
}while(--n>0);
}
}
 
W

White Wolf

Keith said:
White Wolf said:
2.) Chances are high that the most stupid compiler optimizer is still
better at loop unrolling than you will ever be. Not because you are
stupid, but because those guys writing the compilers have 25 years of
head start on you.
[...]

And because the compiler can do a tremendous amount of analysis
*every time* you compile your code, and can redo all the same
analysis from scratch when you change a single line and recompile.

Yep. :) The good old assembler versus C debate. :)

BR, WW
 
W

Walter Banks

Paavo said:
Duff's device?

There might be a general case for Duff's device that
could eliminate a lot of conditional checking.

The power of Duff's device is to do a a single
check in such a way that periodic checks are needed
less frequently.

The core of Duff's algorithm might even be able
to be implemented in macro's with some reasonable
assumptions and produce significant execution
improvements.

Regards,


Walter Banks
 
K

Keith Thompson

root said:
I always try to give the compiler hints for optimization ie by using
register variables. It can make all the difference in a tight inner loop.
[...]

Please don't top-post. See:
http://www.caliburn.nl/topposting.html
http://www.cpax.org.uk/prg/writings/topposting.php

The common wisdom these days is that the compiler can usually do
a better job of register allocation than you can, and that the
"register" keyword isn't likely to help much; in fact, it may
hinder optimization.

If you're finding that the "register" keyword actually improves
performance measurably, perhaps the common wisdom is sometimes wrong.

(You're telling your compiler to use its maximum optimization level,
right?)
 
N

Noob

Richard said:
Remember that the only thing you save by unrolling the loop is the
time spent on the loop logic.

This statement is not universally true.

Interleaving several iterations may "hide" some latency.

Consider e.g a naive memcpy:

for (i=0; i < N; ++i)
{
temp = src;
dst = temp; /* STALL WAITING FOR src */
}

Unroll the loop 4 times.

for (i=0; i < N; i += 4)
{
temp0 = src[i+0];
temp1 = src[i+1];
temp2 = src[i+2];
temp3 = src[i+3];
dst[i+0] = temp0;
dst[i+1] = temp1;
dst[i+2] = temp2;
dst[i+3] = temp3;
}

Depending on the size of the Icache, the latency of memory loads, etc,
the unrolled version may perform much better.

At this point, comp.arch would be more appropriate.

To the original poster, Wikipedia is a good reference.
http://en.wikipedia.org/wiki/Loop_unwinding
http://en.wikipedia.org/wiki/Software_pipelining
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Similar Threads


Members online

Forum statistics

Threads
473,994
Messages
2,570,223
Members
46,812
Latest member
GracielaWa

Latest Threads

Top