Anything's possible.
Incidentally, if you still have the test code around, could you also try
while (n) {
*dst = *src;
++src;
++dst;
--n;
}
I retested and included this modification that you suggested. Your
modification is always faster than the original while(n--) loop and is
roughly the same across all of the optimisation levels as the modification
that I made (worst performance is 17% slower than my mod at -O1 - an
aberration since for all other cases their separation is a few percent -
and best performance is 5% faster at -O3 -march=pentium4).
(And is there a difference between n--; and --n; on your system?)
I'm not sure about the general case - but I tested your modification above
with n-- and --n. There is a small variation that differs between the
optimisation levels - neither is consistently faster. The biggest
separation I found was post-decrement being about 3% faster at -O3
-march=pentium4. I repeated this test a few times to check that it wasn't
a one-off error due to system loading and the result was consistently
within the bounds of .05% and 3%. The initial 3% result is probably not
accurate but there's no doubt that in this case the compiler generates
slightly faster code for post-decrement.
Just to get the results from the same system as used for your original
results. (Incidentally, how did they compare with the system-supplied
memcpy? I believe gcc inlines that to assembler at some optimisation
levels...)
Its execution time doesn't vary between the sizes I originally tested as
much as the other functions' times do. Nor is its performance affected by
optimisation level. With or without optimisations, it is always the
slowest function for sizes of 0..8 bytes. Without optimisations, from
about 16 bytes it starts consistently performing far better - eg at 40
bytes it is 150% faster than any other function. With optimisations it's
"in the mix" - not much better or worse than the others up to roughly 40
bytes and from then on it consistently beats them.
I tested for larger sizes at all optimisation levels:
At 80 bytes the library function was a minimum of 34% faster than any
other function (340% faster when optimisation switches were not used).
At 1024 bytes it was at least 270% faster (1400% faster without
optimisations).
At 10 kilobytes it was at least 400% faster.
At 100 kilobytes it was at least 65% faster. Also optimisations changed
its performance - it was fastest without optimisations and at -O1 it was
twice as slow as without optimisations.
At 1 megabyte things had evened out and it was roughly the same as the
others and in some cases slightly slower. It performed the same at all
optimisation levels.
At 10 and 100 megabytes I only tested for -O3 -march=pentium4 and again it
was roughly the same as the other functions.
Indeed. And bear in mind that it may change completely with the next
version of the compiler, or switching to another compiler on the same
platform. I've found that trusting the compiler and library writers to
have picked the best optimisations is right most of the time...
Agreed - and if you _really_ need specific hard-core optimisations, don't
rely on the compiler except perhaps to use its output as a base - go with
assembly. That way the results aren't dependent on things beyond your
control like compiler code-generation.