Ian Collins said:
Why?
I thought the convoluted option would just make it harder for the
compiler to recognise opportunities for optimisations, particularly loop
unrolling. So I tried compiling both and sure enough, Sun's cc
generated the same code as gcc for the switch case, but it went on to
generate unrolled versions for the simple loop.
The most likely way to optimise the above array move is
memcpy(to, from, count*sizeof(*to))
or
memmove(to, from, count*sizeof(*to))
and let the compiler choose the best implementation.
The more usual loop would be something like
for (k=0; k<n; k++) a[k] = f(b[k], c[k]);
Depending on the underlying hardware that might be optimised to
r0 := b[0]; r2 := b[1];
r3 := b[2]; r4 := b[3];
r1 := c[0]; r3 := c[1];
r5 := c[2]; r7 := c[3];
while (k+:=4)<=n
r8 := f(r0,r1); r9 := f(r2,r3);
r10 := f(r4,r5); r11 := f(r6,r7);
r0 := b[k+0]; r2 := b[k+1];
r3 := b[k+2]; r4 := b[k+3];
r1 := c[k+0]; r3 := c[k+1];
r5 := c[k+2]; r7 := c[k+3];
a[k-4] := r8; a[k-3] := r9;
a[k-2] := r10; a[k-1] := r11;
if k-3<n, a[k-4] := f(r0, r1);
if k-2<n, a[k-3] := f(r2, r3);
if k-1<n, a[k-2] := f(r4, 5);
k := n;
Because loads and stores can be very expensive but overlap with other
operations, and it's faster to use the entire cache line at once. These kinds of
considerations are very machine specific and not obvious to people not familar
with assembly level code.