"Mark A. Odell said:
I'm can't seem to see how a compiler can know how to unroll a loop like by
itself without knowing how many things need to be copied (assuming this is
a memcpy() function). That is, if I can copy any number of bytes, how
could one, or a compiler, generically unroll such a loop without using
Duff's device?
I tried the following two functions with Codewarrior 7, highest
optimisation level, and "optimise for speed" selected. I compiled the
following two functions (no guarantee that the first one is correct, but
that is what I compiled):
void copy1 (char* dst, char* src, int n) {
int k = (n + 7) / 8;
switch (n % 8) do {
case 0: *dst++ = *src++;
case 7: *dst++ = *src++;
case 6: *dst++ = *src++;
case 5: *dst++ = *src++;
case 4: *dst++ = *src++;
case 3: *dst++ = *src++;
case 2: *dst++ = *src++;
case 1: *dst++ = *src++;
} while (--n >= 0);
}
void copy2 (char* dst, char* src, int n) {
int i;
for (i = 0; i < n; ++i) *dst++ = *src++;
}
In function copy1, the loop body is translated to:
case 0: *dst++ = *src++;
00000030: 88040000 lbz r0,0(r4)
00000034: 38840001 addi r4,r4,1
00000038: 98030000 stb r0,0(r3)
0000003C: 38630001 addi r3,r3,1
case 7: *dst++ = *src++;
00000040: 88040000 lbz r0,0(r4)
00000044: 38840001 addi r4,r4,1
00000048: 98030000 stb r0,0(r3)
0000004C: 38630001 addi r3,r3,1
case 6: *dst++ = *src++;
00000050: 88040000 lbz r0,0(r4)
00000054: 38840001 addi r4,r4,1
00000058: 98030000 stb r0,0(r3)
0000005C: 38630001 addi r3,r3,1
case 5: *dst++ = *src++;
00000060: 88040000 lbz r0,0(r4)
00000064: 38840001 addi r4,r4,1
00000068: 98030000 stb r0,0(r3)
0000006C: 38630001 addi r3,r3,1
case 4: *dst++ = *src++;
00000070: 88040000 lbz r0,0(r4)
00000074: 38840001 addi r4,r4,1
00000078: 98030000 stb r0,0(r3)
0000007C: 38630001 addi r3,r3,1
case 3: *dst++ = *src++;
00000080: 88040000 lbz r0,0(r4)
00000084: 38840001 addi r4,r4,1
00000088: 98030000 stb r0,0(r3)
0000008C: 38630001 addi r3,r3,1
case 2: *dst++ = *src++;
00000090: 88040000 lbz r0,0(r4)
00000094: 38840001 addi r4,r4,1
00000098: 98030000 stb r0,0(r3)
0000009C: 38630001 addi r3,r3,1
case 1: *dst++ = *src++;
000000A0: 88040000 lbz r0,0(r4)
} while (--n >= 0);
000000A4: 34A5FFFF subic. r5,r5,1
000000A8: 38840001 addi r4,r4,1
000000AC: 98030000 stb r0,0(r3)
000000B0: 38630001 addi r3,r3,1
000000B4: 4080FF7C bge *-132 ; $00000030
One load, one store, two adds for dst++ and src++ for each byte. For
eight bytes that is eight load, eight store, sixteen adds, one subtract
and conditional branch. In function copy2, the compiler creates one loop
that does for iterations at a time, followed by a loop that does the
remaining up to 7 bytes. The code for the main loop is
0000002C: 88040000 lbz r0,0(r4)
00000030: 38E70008 addi r7,r7,8
00000034: 98030000 stb r0,0(r3)
00000038: 88040001 lbz r0,1(r4)
0000003C: 98030001 stb r0,1(r3)
00000040: 88040002 lbz r0,2(r4)
00000044: 98030002 stb r0,2(r3)
00000048: 88040003 lbz r0,3(r4)
0000004C: 98030003 stb r0,3(r3)
00000050: 88040004 lbz r0,4(r4)
00000054: 98030004 stb r0,4(r3)
00000058: 88040005 lbz r0,5(r4)
0000005C: 98030005 stb r0,5(r3)
00000060: 88040006 lbz r0,6(r4)
00000064: 98030006 stb r0,6(r3)
00000068: 88040007 lbz r0,7(r4)
0000006C: 38840008 addi r4,r4,8
00000070: 98030007 stb r0,7(r3)
00000074: 38630008 addi r3,r3,8
00000078: 4200FFB4 bdnz *-76 ; $0000002C
For eight bytes, it uses eight load, eight stores, three adds and a loop
branch. Fourteen instructions less. The total size is 188 byte for copy1
plus a 32 byte table that is needed for the case jump, vs. 164 byte for
function copy2. So copy2 is faster, produces less code, and is much
easier to write.
So the compiler basically replaced copy2 with
void copy2 (char* dst, char* src, int n) {
int i, j;
for (i = 0, j = n/8; j > 0; --j, i += 8, dst += 8, src += 8) {
dst [0] = src [0];
dst [1] = src [1];
dst [2] = src [2];
dst [3] = src [3];
dst [4] = src [4];
dst [5] = src [5];
dst [6] = src [6];
dst [7] = src [7];
}
for (; i < n; ++i) *dst++ = *src++;
}
As you can see, the case labels before every single line in the loop
body force the compiler to produce substantially inferior code.