The arena of "practical non-cryptographic hash functions" is
clearly a relatively new field.
Yes and no. The problem has certainly been around for awhile,
but you're right that it doesn't seem to have attracted much
interest. The published works I've seen mostly just present a
function, and claim that it is good, with no analysis, and most
of the time with no real comparitive benchmarks either.
Outside of crypto-hashes, Bob Jenkin's function and my
function, the quality of everything else out there is truly
pathetic.
As I said, I'm interested, because I've been collecting
benchmarks of various hashing functions. Post a link to the
algorithm, and I'll add it in.
OK, I just saw a link at the end of your posting. So I've got a
model for your implementation. I'll add it to my tests at the
first possible occasion.
Bob Jenkins, for a long time, set the standard with his
lookup2 function (I think he wrote and publicized it in Dr.
Dobb's journal in 1996 or so.)
According to the link on your page, this is one I've already
tested. And found it to be slower than FNV or my own hash
functions. It is, IMHO, a typical example where being overly
complicated to match a particular machine doesn't pay. The
distribution isn't significantly better than FNV or my own, and
in fact it takes more time (on the machines I have access to) to
calculate.
However, in a kind of "first attempt" I was able to design a
hash function myself that was dramatically faster and had
similar quality. (My function passes Bob Jenkins' avalanche
test as well as my own far more stringent bit distribution and
correlation test; Bob's function performs slightly worse on my
test.) So its clear that there is plenty of room for research
here if anyone cares to take the time or put in the effort.
Bob rewrote a version of his function, which apparently comes
much closer to the performance of my function, but I have not
gone back to check it.
What is certain is that there is a lot of folklore floating
around, and very little real research. I'm not really much into
mathematical analysis of functions myself, so I can't do much on
that end. My basic idea was based on linear congruential random
number generators; intuitively, it seemed to me that whatever
made a good random number generator would also make a good
hashing algorithm. I also took into account that the execution
time of the hashing algorithm must be balanced against its
distribution characteristic; multiplying the execution time by
10 to gain 1% better distribution will result in slower
accesses, on the average. In my case, I was (at the time)
working on a machine (an 8086) with very, very slow
multiplication, so I became intrigued with the idea of using a
Mersenne prime as the multiplier (so that the multiplication
could be converted into a shift and a subtraction). And my
algorithm did beat out the few other examples I had at hand at
the time.
That was all a long time ago, but I've never really lost
interest in the subject. When Peter K. Pearons published his
article "Fast Hashing of Variable-Length Text Strings" in the
CACM, I created a small test harness, and compared it with the
others; my own turned out to be several times faster. Some time
later, someone in fr.comp.lang.c++ mentionned FNV hashing; by
that time, I had my "standard" benchmark harness designed, so I
whipped up a benchmark with that and all of the others I could
find, and wrote up the results in
http://kanze.james.neuf.fr/code/Docs/html/Hashcode.html. Since
then, I've modified my harness to use my own AssocArray class,
rather than the g++ hash_map (so that I could also test with Sun
CC, VC++, etc.), and have added quite a few additional
algorithms. The fundamental results haven't changed, however;
either a FNV or my Mersenne prime function are always amongst
the fastest (depending on multiplication speed and probably a
number of other factors I'm not aware of). The Pearson and the
Jenkens functions (at least the variants I'm using) are,
depending on the data, between 20% and 35% slower.
I've since extended the benchmark program to support
instrumentation. A quick check showed very little difference in
the actual distributions, so the difference is due to the fact
that the Mersenne prime function or FNV are faster to calculate.
Bob and I took different approaches. He started with
something cryptographic, and knocked it down to a point where
it was faster, though losing the cryptographic properties that
were no longer required. I instead took some ideas of his,
and built a function from the ground up that has no pedigree
from any cryptographic function. My design took modern CPU
architecture into account as well as trying to get to "the
heart of the matter" for hash function quality requirements.
So I started with a framework which I knew would deliver a
high performance result and injected quality into it.
The key point behind both our functions is that they are not
only good quality, they are objectively faster than all the
other typically used hash functions (CRC, Dan Bernstein's
ASCII hash, FNV hash, etc). The only things that even compete
with our functions are things people already know are
worthless (like just summing up the bytes.)
It's interesting to note that on a modern machine, FNV or my own
functions do not take any more time than just summing the bytes,
but result in a very good distribution.
Is that a commentary on what you think of James Kanze's
abilities, or are you just indirectly praising people like Bob
Jenkins and myself as being some kind of untouchable
uber-programmers? If the latter, I would say its likely
unwarranted.
I think we both agree that the final word hasn't been written.
I'm curious to see how you code does, however, because on your
web page, you say you find that Bob's function is faster than
FNV, on an Athlon. Where as I find just the opposite, on a wide
variety of machines (Sun Sparc, some older Intel, and now on
both Intel and AMD based Linux boxes.)
FWIW, I just reran one set of tests on my machine here, an AMD
64 bit machine. Jenkins is almost exactly the same as FNV, and
slightly slower than my Mersenne prime hash using 2^7-1 as
multiplier. This was for a set of 8554 symbols extracted from
my code. A second trial with 1259 URL's (extracted from
somewhere, I forget where), did show Jenkins as slightly better.
So maybe it depends on typical length; the URL's are longer, on
the average, than my program symbols.
At any rate, my recommendation still stands: Mersenne primes
with 2^7-1 as multiplier. That seems to give the best results
over a wide variety of data and hardware. But if your algorithm
is significantly faster than Jenkins, it's definitely worth
looking at. I'll add it to my tests.
This problem is wide open for anyone with good math/programming
skills with a good rough idea about modern CPU performance.
Specifically: the mantle is up for grabs for anyone to come up
with a good *64 bit* hash function. Ironically, my gut feeling
is that it should be possible to come up with a function nearly
twice as fast as mine if you can make the assumption that your
platform natively supports 64 bit scalars (which modern systems
now do.)
Note that on a modern CPU, I would expect the byte accesses in
FNV or my own algorithms to have very little impact, since the
entire inner loop will be in cache, all intermediate variables
in registers, so the only memory access will be the characters
in the string, and the CPU will find the data in its pipeline
for all of the reads except for the first in each cache line.
So I think that the word length factor is really a red herring.