Ok, so here are some remaining questions:
1) How should one implement a standard strcmp? -- as I recall, this is
like 3 lines of code.
I doubt it can be done in 3 lines of code. For a start the arguments to
strcmp are consts so you need at least two lines of variable declarations.
There are a few different implementations - not using const parameters -
scattered through this thread.
2) How might one implement a standard memcmp? (On the assumption that
lengths are always available to you, since it basically costs nothing
for this to be the case.)
No need for assumptions, length is one of the parameters to memcmp.
Doing this naively is easy; but trying to
take advantage of low-level hardware characteristics, such as alignment,
this can be hard if you are really intent on squeezing out the maximum
performance.
If it were 'standard' surely it wouldn't have knowledge of low-level
hardware characteristics. Perhaps I misunderstand you. Anyhow on my
platform (Pentium 4 running Linux with gcc 3.3.5) you can squeeze quite a
big performance gain by knowing that the word-size of the machine is equal
to sizeof(int).
The, as you put it naive, but portable attempt:
int memcmp(const void *a, const void *b, size_t n) {
unsigned char *ua = (unsigned char *)a;
unsigned char *ub = (unsigned char *)b;
unsigned char *ua_max = ua + n;
while (ua < ua_max) {
if (*ua > *ub)
return 1;
else if (*ua < *ub)
return -1;
else {
ua++;
ub++;
}
}
return 0;
}
I compared the speed of this implementation to the glibc memcmp by running
memcmps of 50 million (less 100) pairs of different memory locations
containing random data of length 100 bytes.
It turns out that the glibc version is roughly 10-15% faster - no big deal.
So I tested your idea of using alignment, knowing that my processor likes
to operate on 4 bytes at a time, being equivalent to sizeof(int).
int memcmp_using_int(const void *a, const void *b, size_t n) {
int num_ints = n / sizeof(int);
int extra_bytes = n % sizeof(int);
unsigned int *ua = (unsigned int *)a;
unsigned int *ub = (unsigned int *)b;
unsigned int *ua_max = ua + num_ints;
unsigned char *uac;
unsigned char *ubc;
unsigned char *uac_max;
while (ua < ua_max) {
if (*ua > *ub)
return 1;
else if (*ua < *ub)
return -1;
else {
ua++;
ub++;
}
}
uac = (unsigned char *)ua;
ubc = (unsigned char *)ub;
uac_max = uac + extra_bytes;
while (uac < uac_max) {
if (*uac > *ubc)
return 1;
else if (*uac < *ubc)
return -1;
else {
uac++;
ubc++;
}
}
return 0;
}
This version is approximately 3 times faster than glibc and my original
version, which surprised me. Is there any reason to _not_ make the
assumption that I made - that we should operate on units of sizeof(int)
rather than sizeof(char)? It may not necessarily improve performance - it
certainly improves it on my setup - but could it degrade it?
Are there any other ways to work out - without knowing in advance - an
appropriate unit for your system? I happened to know that my machine's
wordsize is sizeof(int), but can you work out or approximate what it is
from standard includes/definitions?