Just another few data points: I see the plain loop version as being 14
times *slower* than the memcmp version for 512MB, provided I don't
optimise! At gcc -O1 and above the memcmp takes twice the time the
plain loops does. The reason seems to be that gcc replaces memcmp with
a repz cmpsb instruction (plus housekeeping gubbins) and these string
instructions have a reputation for being slow.
The memcmp method is still four times faster than the loop (on this
data, on this machine, with the library, with... etc) provided gcc can
be told not to inline it to cmpsb instruction.
To summarise:
for loop memcmp
gcc unoptimised 13.333s 0.932s
gcc optimising 4.160s 8.250s
I'd post the CPU, gcc version and so on but I think the real messages is
*measure*. You can still be surprised after <mumble> years in this
field.
i was surprised with your results even more than with my own.
i
already removed test code i've wrote before posting my last message and
so i had to rewrite it once more... and bang! it looks like i had a bug
(probably #ifndef instead of #ifdef for switching implementations with
-D), because now i'm getting similar computation times i got before, but
both versions are faster for memcmp(). namely:
for() memcmp()
-g3 ~1.2
~0.11
-O3 ~0.4 ~0.11