Tricky Code

M

Mohanasundaram

Hi All

Can you please explain this code. I took it from C Faq.

register n = (count + 7) / 8; /* count > 0 assumed */
switch (count % 8)
{
case 0: do { *to = *from++;
case 7: *to = *from++;
case 6: *to = *from++;
case 5: *to = *from++;
case 4: *to = *from++;
case 3: *to = *from++;
case 2: *to = *from++;
case 1: *to = *from++;
} while (--n > 0);
}

Thanks,
Mohan.
 
G

Grumble

Mohanasundaram said:
Can you please explain this code. I took it from C Faq.

register n = (count + 7) / 8; /* count > 0 assumed */
switch (count % 8)
{
case 0: do { *to = *from++;
case 7: *to = *from++;
case 6: *to = *from++;
case 5: *to = *from++;
case 4: *to = *from++;
case 3: *to = *from++;
case 2: *to = *from++;
case 1: *to = *from++;
} while (--n > 0);
}

Google for Duff's device.

http://www.lysator.liu.se/c/duffs-device.html
http://www.catb.org/~esr/jargon/html/D/Duffs-device.html
 
R

Richard Bos

Can you please explain this code. I took it from C Faq.

register n = (count + 7) / 8; /* count > 0 assumed */
switch (count % 8)
{
case 0: do { *to = *from++;
case 7: *to = *from++;
case 6: *to = *from++;
case 5: *to = *from++;
case 4: *to = *from++;
case 3: *to = *from++;
case 2: *to = *from++;
case 1: *to = *from++;
} while (--n > 0);
}

Well, you took it from the C FAQ, which does about as good a job at
explaining it as I could, but frankly this kind of code should not be
explained at all. It should only be used by the kind of people who
understand it fully and can tell what it does at a glance, and mere
mortals like you and I should stay well away *shudder*.

Richard
 
M

Mark A. Odell

(e-mail address removed) (Mohanasundaram) wrote in

Can you please explain this code. I took it from C Faq.

register n = (count + 7) / 8; /* count > 0 assumed */
switch (count % 8)
{
case 0: do { *to = *from++;
case 7: *to = *from++;
case 6: *to = *from++;
case 5: *to = *from++;
case 4: *to = *from++;
case 3: *to = *from++;
case 2: *to = *from++;
case 1: *to = *from++;
} while (--n > 0);
}

As you know, it's Duff's device. It performs a memcpy() with a self
adjusting loop unroll such that one does not need to worry if the loop
count is an even multiple of some machine friendly transfer size (e.g.
cache line). Unrolling a loop mitigates the cost of the branch back to the
beginning of the loop but it is hard to know how much to unroll since
'count' may not be known at design time. Duff's device deals with this
problem nicely.
 
T

Thomas Stegen CES2000

Mark said:
As you know, it's Duff's device. It performs a memcpy() with a self
adjusting loop unroll such that one does not need to worry if the loop
count is an even multiple of some machine friendly transfer size (e.g.
cache line). Unrolling a loop mitigates the cost of the branch back to the
beginning of the loop but it is hard to know how much to unroll since
'count' may not be known at design time. Duff's device deals with this
problem nicely.

Maybe it is just me but I think this is nicer:

void foo(int arr[], size_t len)
{
int roll = len % 8;
int i;
int *arrptr = arr;

/*make sure the number of elements left after this is
a multiple of 8*/
for(i = 0;i < roll;i++)
operate(*arrptr++);

for(i = roll;i < len;i += 8)
{
operate(*arrptr++);
operate(*arrptr++);
operate(*arrptr++);
operate(*arrptr++);
operate(*arrptr++);
operate(*arrptr++);
operate(*arrptr++);
operate(*arrptr++);
}
}

:) (I hope I got that right)

Slightly slower maybe, the first loop can be replaced with a switch
without an embedded loop though.
 
C

Christian Bau

"Mark A. Odell said:
(e-mail address removed) (Mohanasundaram) wrote in



As you know, it's Duff's device. It performs a memcpy() with a self
adjusting loop unroll such that one does not need to worry if the loop
count is an even multiple of some machine friendly transfer size (e.g.
cache line). Unrolling a loop mitigates the cost of the branch back to the
beginning of the loop but it is hard to know how much to unroll since
'count' may not be known at design time. Duff's device deals with this
problem nicely.

I'd say it is a good example how to write code that looks clever but
isn't - it is not a readable, it is not a fast, but it demonstrates that
the syntax of the switch statement is simpler than it should have been.

PS. Any decent compiler will unroll a loop like this itself, and
probably produce more efficient code without the programmer having to
wreck his or her brain.
 
P

Peter Nilsson

Mark A. Odell said:
(e-mail address removed) (Mohanasundaram) wrote in



As you know, it's Duff's device. It performs a memcpy()...

It pumps data through a specific (single) address, so it is not memcpy
per se. Although, it can easily be modified to do so.
 
M

Mark A. Odell

PS. Any decent compiler will unroll a loop like this itself, and
probably produce more efficient code without the programmer having to
wreck his or her brain.

I'm can't seem to see how a compiler can know how to unroll a loop like by
itself without knowing how many things need to be copied (assuming this is
a memcpy() function). That is, if I can copy any number of bytes, how
could one, or a compiler, generically unroll such a loop without using
Duff's device?
 
G

Grumble

Mark said:
True but a minute change and it is. Also, the unroll question still
stands even if copying many "things" to one address: How could a
compiler unroll a loop without any restriction on how many "things"
are to copied.

for (i=0; i < n; ++i)
{
do_stuff(i);
}

Assume the compiler has decided that the optimal unroll factor for
this particular code on this particular architecture is 6.

It would then produce code for:

for (i=0; i <= n-6; i+=6)
{
do_stuff(i);
do_stuff(i+1);
do_stuff(i+2);
do_stuff(i+3);
do_stuff(i+4);
do_stuff(i+5);
}

for ( ; i < n; ++i)
{
do_stuff(i);
}

I don't quite understand what the problem is.

Regards,

Grumble
 
M

Mark A. Odell

for (i=0; i < n; ++i)
{
do_stuff(i);
}

Assume the compiler has decided that the optimal unroll factor for
this particular code on this particular architecture is 6.

It would then produce code for:

for (i=0; i <= n-6; i+=6)
{
do_stuff(i);
do_stuff(i+1);
do_stuff(i+2);
do_stuff(i+3);
do_stuff(i+4);
do_stuff(i+5);
}

for ( ; i < n; ++i)
{
do_stuff(i);
}

I don't quite understand what the problem is.

What if n is not an integral value of 6 and do_stuff() is pushing values
into an array of size n?
 
P

pete

Mark said:
What if n is not an integral value of 6
and do_stuff() is pushing values
into an array of size n?

Also, Duff was counting clock ticks.
This form of loop:
do {
} while (--n);
is The Loop, to consider when counting cpu cycles.
It transforms into the simplest assembly, of any loop style,
in assembly languages that I'm familiar with.

while (i <= n-6), is computation intensive by comparison.
 
N

Nudge

Mark said:
What if n is not an integral value of 6 and do_stuff()
is pushing values into an array of size n?

???

The two code snippets are semantically identical.

The first will perform:
do_stuff(0), ..., do_stuff(n-1)

The second will perform:
do_stuff(0), ..., do_stuff(n-1)

For any value of n.
 
N

Nudge

pete said:
Also, Duff was counting clock ticks.
This form of loop:
do {
} while (--n);
is The Loop, to consider when counting cpu cycles.
It transforms into the simplest assembly, of any loop style,
in assembly languages that I'm familiar with.

while (i <= n-6), is computation intensive by comparison.

/me rolls eyes

The example was given in C only as an algorithmic description, to
illustrate the point that a compiler may choose to unroll a loop,
even if the number of loop iterations is known only at run-time.
 
C

Chris Torek

[I have done some editing for size below]

There is one important restriction, namely, that the count can be
determined "up front" as it were -- i.e., the loop does not have
"if (some unpredictable condition) break" in it, and in particular
it does not read "for (...; i < rand(); ...)". (Some such loops
can still be unrolled, but one tends to hit diminishing returns
very quickly in this case.)

What if n is not an integral value of 6 and do_stuff() is pushing values
into an array of size n?

This is not a problem. Remember the assumption that the number of
trips through the loop can be computed "up front". Suppose the
loop will run N times (where N is a nonnegative integer variable),
and the unroll factor is F (an integer constant greater than 1).
The code inside the loop must then be duplicated F times and the
"primary" loop run N/F times. This leaves N % F "left over" -- in
the above example, where F is 6, if N is 42, the "primary" loop
runs 7 times and there are 0 left over, but if N is 41, the "primary"
loop runs 6 times and there are 5 left over. As shown above, the
compiler needs only to generate a secondary loop that runs N % F
times (possibly zero).

When N is a constant, the compiler can even determine in advance
whether N % F is nonzero, and avoid generating the secondary loop
at all in some cases. In any case, if the value of i is not used
inside the loop, the "secondary loop" can be replaced with an
initial computed goto -- the same one that appears in Duff's Device.
 
M

Mark A. Odell

[heavy snippage]
When N is a constant, the compiler can even determine in advance
whether N % F is nonzero, and avoid generating the secondary loop
at all in some cases. In any case, if the value of i is not used
inside the loop, the "secondary loop" can be replaced with an
initial computed goto -- the same one that appears in Duff's Device.

I get it now. Thanks, Chris.
 
C

Christian Bau

"Mark A. Odell said:
I'm can't seem to see how a compiler can know how to unroll a loop like by
itself without knowing how many things need to be copied (assuming this is
a memcpy() function). That is, if I can copy any number of bytes, how
could one, or a compiler, generically unroll such a loop without using
Duff's device?

I tried the following two functions with Codewarrior 7, highest
optimisation level, and "optimise for speed" selected. I compiled the
following two functions (no guarantee that the first one is correct, but
that is what I compiled):

void copy1 (char* dst, char* src, int n) {
int k = (n + 7) / 8;
switch (n % 8) do {
case 0: *dst++ = *src++;
case 7: *dst++ = *src++;
case 6: *dst++ = *src++;
case 5: *dst++ = *src++;
case 4: *dst++ = *src++;
case 3: *dst++ = *src++;
case 2: *dst++ = *src++;
case 1: *dst++ = *src++;
} while (--n >= 0);
}

void copy2 (char* dst, char* src, int n) {
int i;
for (i = 0; i < n; ++i) *dst++ = *src++;
}

In function copy1, the loop body is translated to:

case 0: *dst++ = *src++;
00000030: 88040000 lbz r0,0(r4)
00000034: 38840001 addi r4,r4,1
00000038: 98030000 stb r0,0(r3)
0000003C: 38630001 addi r3,r3,1
case 7: *dst++ = *src++;
00000040: 88040000 lbz r0,0(r4)
00000044: 38840001 addi r4,r4,1
00000048: 98030000 stb r0,0(r3)
0000004C: 38630001 addi r3,r3,1
case 6: *dst++ = *src++;
00000050: 88040000 lbz r0,0(r4)
00000054: 38840001 addi r4,r4,1
00000058: 98030000 stb r0,0(r3)
0000005C: 38630001 addi r3,r3,1
case 5: *dst++ = *src++;
00000060: 88040000 lbz r0,0(r4)
00000064: 38840001 addi r4,r4,1
00000068: 98030000 stb r0,0(r3)
0000006C: 38630001 addi r3,r3,1
case 4: *dst++ = *src++;
00000070: 88040000 lbz r0,0(r4)
00000074: 38840001 addi r4,r4,1
00000078: 98030000 stb r0,0(r3)
0000007C: 38630001 addi r3,r3,1
case 3: *dst++ = *src++;
00000080: 88040000 lbz r0,0(r4)
00000084: 38840001 addi r4,r4,1
00000088: 98030000 stb r0,0(r3)
0000008C: 38630001 addi r3,r3,1
case 2: *dst++ = *src++;
00000090: 88040000 lbz r0,0(r4)
00000094: 38840001 addi r4,r4,1
00000098: 98030000 stb r0,0(r3)
0000009C: 38630001 addi r3,r3,1
case 1: *dst++ = *src++;
000000A0: 88040000 lbz r0,0(r4)
} while (--n >= 0);
000000A4: 34A5FFFF subic. r5,r5,1
000000A8: 38840001 addi r4,r4,1
000000AC: 98030000 stb r0,0(r3)
000000B0: 38630001 addi r3,r3,1
000000B4: 4080FF7C bge *-132 ; $00000030

One load, one store, two adds for dst++ and src++ for each byte. For
eight bytes that is eight load, eight store, sixteen adds, one subtract
and conditional branch. In function copy2, the compiler creates one loop
that does for iterations at a time, followed by a loop that does the
remaining up to 7 bytes. The code for the main loop is

0000002C: 88040000 lbz r0,0(r4)
00000030: 38E70008 addi r7,r7,8
00000034: 98030000 stb r0,0(r3)
00000038: 88040001 lbz r0,1(r4)
0000003C: 98030001 stb r0,1(r3)
00000040: 88040002 lbz r0,2(r4)
00000044: 98030002 stb r0,2(r3)
00000048: 88040003 lbz r0,3(r4)
0000004C: 98030003 stb r0,3(r3)
00000050: 88040004 lbz r0,4(r4)
00000054: 98030004 stb r0,4(r3)
00000058: 88040005 lbz r0,5(r4)
0000005C: 98030005 stb r0,5(r3)
00000060: 88040006 lbz r0,6(r4)
00000064: 98030006 stb r0,6(r3)
00000068: 88040007 lbz r0,7(r4)
0000006C: 38840008 addi r4,r4,8
00000070: 98030007 stb r0,7(r3)
00000074: 38630008 addi r3,r3,8
00000078: 4200FFB4 bdnz *-76 ; $0000002C

For eight bytes, it uses eight load, eight stores, three adds and a loop
branch. Fourteen instructions less. The total size is 188 byte for copy1
plus a 32 byte table that is needed for the case jump, vs. 164 byte for
function copy2. So copy2 is faster, produces less code, and is much
easier to write.

So the compiler basically replaced copy2 with

void copy2 (char* dst, char* src, int n) {
int i, j;
for (i = 0, j = n/8; j > 0; --j, i += 8, dst += 8, src += 8) {
dst [0] = src [0];
dst [1] = src [1];
dst [2] = src [2];
dst [3] = src [3];
dst [4] = src [4];
dst [5] = src [5];
dst [6] = src [6];
dst [7] = src [7];
}
for (; i < n; ++i) *dst++ = *src++;
}

As you can see, the case labels before every single line in the loop
body force the compiler to produce substantially inferior code.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,139
Messages
2,570,805
Members
47,355
Latest member
MavoraTech

Latest Threads

Top