while(n--)
[as compared to a loop using p1[n] and p2[n], instead of postfix-++]
Purely as an observation, I find that the gcc x86 code generator
tends to partially unroll while loops.
Indeed, most optimizing compilers will do so. If the number of
trips through the loop is predictable (or easily computed), most
optimizing compilers will unroll to a greater extent; and if the
semantics are similar enough (e.g., if both loops in the original
samples had done a forward copy, which was not the case here), a
good compiler will often produce the same machine code for either
one.
The resulting machine code is not all that illustrative, however.
The point of the example in this case was to show that, in some
cases, indexing is actually *faster* than pointer arithmetic, as
there is only one variable to update (in this case, n) instead of
several (in the other case, n, p1, and p2). A great deal depends
on both the target machine architecture and the compiler.
The moral, as it were, is to avoid "knowing too much that is not
actually true" about the target machine code, and thus doing
low-level source code "optimization" that is actually more of a
pessimization.
Instead, write the code as clearly as you can
manage. Then, after it works, use a tool like a runtime profiler
to find out where it really spends critical amounts of time -- and
optimize there, preferably algorithmically (but if it is really
important, go ahead and twist up the source in micro-optimizations
if needed).