Benchmarks

Ben Bacarisse · Nov 11, 2008

Juha Nieminen said:
Ok, let me see if I understood correctly. You basically define a
function like:

f1(n) = the average number of steps quicksort performs for all possible
inputs with n elements.

and f1(n) is O(n lg n).

Yes, though since this is getting messy I should own up and say that I
have no reference for that. In Knuth TAOCP, the derivation ends with
an approximation. I think I once saw a full combinatorial analysis of
your f1 and the result was more complex that O(n lg n) but I may be
miss-remembering.

Likewise you could define a function like:

f2(n) = the maximum number of steps quicksort performs for all possible
inputs with n elements.

and in this case f2(n) is O(n^2).

But in this case you can also say that f2(n) is Omega(n^2) and
Theta(n^2) because n^2 is a tight bound for f2.

What would be the function for which you can say "quicksort is
O(n^2) and Omega(n lg n)"?

Both f1 and f2. You are correct that f2 is Omega(n^2) (which gives us
Theta, too) but any function that is Omega(n^2) is also Omega(n lg n)
and Omega(n) and Omega(log n) and...

Ben Bacarisse · Nov 11, 2008

Juha Nieminen said:
By "equivalent" I mean that both are defining the same bounding
function (n^2) and both are valid.

I am sure you agree that using "equivalent" like that is going to lead
to trouble! Defining an upper bound and a lower bound are very
different even if the bound is the same. (AT's example is a bad one.
x >= 7 and x <= 7 would have been better. They are still not
equivalent statements despite the identical bound.)

My real point was why have an O
notation and an Omega notation, when the Theta notation is telling what
we want? The O and Omega notations seem obsolete.

You are right. If we could always get tight bounds like Theta(f) then
why bother with anything else. The trouble is that this is often very
hard indeed -- especially in the "average" case that started all this.
You don't see many published big-Omega results because people rarely
care (knowing that algorithm X is never takes more than that n^4 steps
is not helpful if it is really much worse than that) but getting a
slightly better upper bound is step forward (i.e. proving that
algorithm X takes no more than n^4.24 steps).

Jerry Coffin · Nov 11, 2008

[ ... ]

Leaving out the scanning forward and backward part (swapping offending
pairs), you are missing the most important point about quicksort, namely
the one that explains the speed. In terms of number of comparisons,
quicksort is not doing better than, say, heapsort. However, when it comes
to moving elements around in memory, quicksort has a huge advantage.
Experimentally, the number of moves is one place quicksort beats the other
algorithms whereas the number of comparisons is not. Declaring the very
feature of quicksort that makes this happen to be a minor issue seems
weird.

Of course, that only applies to sorting sequences. For linked structures,
the number of pointer operations would be relevant.

I'll tell you what: if you're really convinced that scanning forward and
backward, and swapping elements in the list will reduce the number of
pointer operations, why don't you implement both and find out?

I've done so: swapping loses -- badly. Splicing an item to a list
requires only one pointer operation. Doing the forward and backward
scanning means you need a doubly linked list. Swapping one pointer takes
the usual three operations, and a doubly linked list obviously has two
pointers per node, so a swap instead of a splice takes six operations
instead of one.

Paul Hsieh · Nov 11, 2008

The arena of "practical non-cryptographic hash functions" is clearly a
relatively new field. Outside of crypto-hashes, Bob Jenkin's function
and my function, the quality of everything else out there is truly
pathetic.

Bob Jenkins, for a long time, set the standard with his lookup2
function (I think he wrote and publicized it in Dr. Dobb's journal in
1996 or so.) However, in a kind of "first attempt" I was able to
design a hash function myself that was dramatically faster and had
similar quality. (My function passes Bob Jenkins' avalanche test as
well as my own far more stringent bit distribution and correlation
test; Bob's function performs slightly worse on my test.) So its
clear that there is plenty of room for research here if anyone cares
to take the time or put in the effort. Bob rewrote a version of his
function, which apparently comes much closer to the performance of my
function, but I have not gone back to check it.

Bob and I took different approaches. He started with something
cryptographic, and knocked it down to a point where it was faster,
though losing the cryptographic properties that were no longer
required. I instead took some ideas of his, and built a function from
the ground up that has no pedigree from any cryptographic function.
My design took modern CPU architecture into account as well as trying
to get to "the heart of the matter" for hash function quality
requirements. So I started with a framework which I knew would
deliver a high performance result and injected quality into it.

The key point behind both our functions is that they are not only good
quality, they are objectively faster than all the other typically used
hash functions (CRC, Dan Bernstein's ASCII hash, FNV hash, etc). The
only things that even compete with our functions are things people
already know are worthless (like just summing up the bytes.)

This situation has lead to *some* efforts from other people. In
particular the author of the "murmur hash" (or whatever its called) is
a major proponent of his own function which is even faster than my
function, however it also has clear weaknesses and seems quite
affixedly tied to the x86 architecture. Others have used genetic
algorithms to come up with weird stuff in x86 assembly language which
is very hard to analyze.

In short, hunting around in "the literature" is not going to lead to
too much insightful information. Aside from Bob and myself, there has
been very little serious work in high performance hash functions.
That said, both his function and mine are very usable in real world
practical environments.

I'm pretty sure Jenkin's hashing function (which uses the same
algorithm as his ISAAC random number generator) is faster than anything
you could ever create yourself, having even a fraction of the same quality.

Is that a commentary on what you think of James Kanze's abilities, or
are you just indirectly praising people like Bob Jenkins and myself as
being some kind of untouchable uber-programmers? If the latter, I
would say its likely unwarranted. This problem is wide open for
anyone with good math/programming skills with a good rough idea about
modern CPU performance. Specifically: the mantle is up for grabs for
anyone to come up with a good *64 bit* hash function. Ironically, my
gut feeling is that it should be possible to come up with a function
nearly twice as fast as mine if you can make the assumption that your
platform natively supports 64 bit scalars (which modern systems now
do.) So its not like Bob and I have set some unattainable bar for
performance. (The only thing that prevents *ME* from setting that
high bar again these days is that I have a day job.)

Keith Thompson · Nov 11, 2008

CBFalconer said:
Ben Bacarisse wrote: [...]

Every function that is O(n log n) is also O(n^2). That was my
point.

Click to expand...

You are very wrong. In C, n^2 is n XOR 2, and O(n^2) is quite
fast. But, if we use reasonable notation and ask about O(n*n) we
find that to be much slower than O(n log n).

Why are you complaining about "^" to denote exponentiation, but not
complaining about "n log n" to mean "n * log(n)"? Both "n log n" and
"n^2" are obviously intended to be mathematical notation (with "^"
implying a superscript, i.e., exponentation), not C expressions.

Antoninus Twink · Nov 11, 2008

Yes, though since this is getting messy I should own up and say that I
have no reference for that. In Knuth TAOCP, the derivation ends with
an approximation.

But this approximation just replaces evaluations of the harmonic
function by the leading term in its asymptotic expansion, established
way back in 1.2.7(3), so concluding from this approximation that the
average runtime is O(n log n) is completely rigorous.

Antoninus Twink · Nov 11, 2008

You are right. If we could always get tight bounds like Theta(f) then
why bother with anything else. The trouble is that this is often very
hard indeed -- especially in the "average" case that started all this.

Well, there are two things. As you say, we might not always be able to
get a lower bound because the theory is too hard, so we write big-O for
the best-known upper bound. A good example is matrix multiplication,
where the best upper bound is currently about O(n^2.4), but AFAIK there
isn't even any Omega(n^(2+epsilon)) lower bound known.

But often, even when we do have a Theta result, people still use the O
notation. Why? Maybe just because O is a Roman character that's easy to
type? Maybe because when we're programming in the real world, all we
care about is some rough indication of how well our algorithm is likely
to scale, and not the mathematical minutiae? Or it's just convention? Or
laziness?

James Kanze · Nov 11, 2008

Jerry said:
Jerry said:

[ ... ]

The result is that for most cases quicksort will be the
fastest sort. Right behind it are other O(nLOGn) methods.
I prefer mergesort and data input in lists, because then I
don't have to worry about the size to be sorted. In
addition, mergesort is stable (quicksort is not).

Click to expand...

The most common implementation of Quicksort for arrays is
unstable -- but a Quicksort on an array _can_ be written to
be stable if you want to badly enough, and for a linked
list, it's quite easy to make it stable.

Click to expand...

Click to expand...

My linguistic intuitions differ: Quicksort _is_ the algorithm
published as Algorithm 64 (and 63) by C.A.R. Hoare in the
Communications of the ACM. That algorithm sorts a sequence in
place and is not stable. It makes 2n*log(n) comparisons on
average and (1/3)n*log(n) exchanges on average. Anything that
is stable or deals with linked lists is an implementation of a
_different_ algorithm (and probably has different average
complexities).

Click to expand...

The algorithm presented by Hoare in Algorithm 64 is very
general; it only refers to algorithm 63, and presumably,
anything that achieves a partition with similar properties would
be acceptable.

I am not saying that there are no refinements or other
algorithms based on similar ideas. Some of these algorithms
are stable and some apply to linked lists. Also, I am not a
native speaker of English. Nonetheless, I feel that algorithms
have a certain identity (which makes it meaningfull to say
that a certain sorting algorithm is stable); and that
implementations of an algorithm must preserve that identity
and the associated characteristics to count as implementations
_of_ that algorithm.

Click to expand...

So you're saying that the implementation proposed by Jon Bentley
(section 10.2 of _Programming_ _Pearls_, which is just a reprint
of one of his columns in the CACM) isn't a quick sort, although
he claims it to be one. (I think his algorithm is actually
stable, although I haven't verified; it may have problems if
there are objects equal to the pivot. But I think it could
easily be fixed.)

My understanding is that the term quick sort is applied to all
algorithms that partition, then sort each partition recursively.

(Also, of course, it's trivial to implement quick sort for a
list; just implement [] in terms of std::advance. Of course,
the adjective "quick" might not be so appropriate then, but the
algorithm stands.)

Click to expand...

James Kanze · Nov 11, 2008

While you can say that colloquially, it's formally quite
shaky.

Yes and no.

In computational complexity theory big-O is not a generic
notation used to describe some asymptotic behavior of an
algorithm. It's used to specify an upper bound to the
asymptotic behavior of an algorithm.

More or less. It's usually defined in terms of numbers of some
specific operation, or numbers of some specific object, rather
than in terms of runtime or memory. And of course, who ever is
using the term must define the conditions involved. It does
make sense to speak of a "typical big-O", i.e. a value which
holds "most of the time", even if there isn't a very rigorous
definition for "typically".

Saying, for example, "insertion sort is O(2^n)" is completely
valid. O(n^100) is equally valid. The big-O notation does not
even require for the upper bound to be tight. Any upper bound
is valid as long as the algorithm never behaves slower than
that. However, saying "insertion sort is O(n)" is not valid
because with some inputs insertion sort performs
asymptotically more steps than that.

Even if you say "quicksort is O(n lg n) in average", the upper
bound is still not correct. Even when measuring average
behavior quicksort will behave slower than (n lg n) with some
inputs, and thus the given upper bound is incorrect.

I don't think anyone has ever claimed that "quicksort is O(n lg
n) in average", or typically. The claim is that it is
"typically close to O(n lg n)". And yes, I know: throw in
enough weasel words, and anything is true. In this case,
however, there is some useful (albeit not very precise)
information being communicated: if you use quick sort, it's
highly unlikely that your performance will be significantly
worse than O(n lg n). (More weasel words

: "highly unlikely"
and "significantly worse".) The fact remains that I've done a
fair amount of benchmarking of sorting routines, and unless I
did it intentionally (e.g. use the first element for a pivot
with an already sorted array), quick sort was always the
fastest, and the actual plotted curve was pretty close to O(n lg
n).

Likewise the big-Omega notation is used to set a lower bound
to the asymptotic behavior of an algorithm. For example if you
say "insertion sort is Omega(n)", you are telling that
insertion sort never performs less steps than an amount
linearly relative to the size of the input.

Saying "the best case for insertion sort is O(n)" is exactly
as silly as saying "the worst case for insertion sort is
Omega(n^2)". They don't make sense as bounds in these
sentences. They are misused to describe some special cases.

You have to be careful when discussing vocabulary. Different
people use the same words differently. And in the end, there's
no real right or wrong.

James Kanze · Nov 11, 2008

On 2008-11-10 14:43:06 -0500, Juha Nieminen <[email protected]> said:

You can't ignore the qualifier "on average", which implies
some constraints on input sequences.

That's not my understanding of "on average". My understanding
is that given enough different input sequences, the "average" of
all of the times will be O(n lg n). But as far as I know, that
isn't the case for quicksort. It's been a long time since I
studied this, so there may be some advances, but as I recall,
the mathematical analysis of the average behavior of quicksort
was too complex to be done. All that one could say is that the
results of many trials seemed to indicate that the average
performance wasn't too much greater than O(n lg n), despite a
few outlying values.

Yes, it's O(n^2) for some input sequences, but if you rule out
those input sequences, then it's O(nlogn), and that's all that
"on average" means.

No. For it to hold "on average", you have to consider those
input sequences as well. On average, in this case, means that
they will occur rare enough that they won't have a measurable
effect on the average. (If you do a million different trials,
and one takes 1000 seconds, and all of the others take 10
seconds, what is the average?

Big-oh notation tells you what happens when the input sequence
gets long. It doesn't tell you anything about the contents of
input sequences.

No, but you have to define what you are measuring. Whether you
are measuring worst-case behavior, best-case behavior, average
behavior, or median behavior? Or something else completely,
like memory use. (And of course, you also have to define the
units you're measuring in---seconds, comparisons, comparisons
and swaps...)

James Kanze · Nov 11, 2008

It's well understood what sort of input causes O(n^2) behavior.

Really? If I post a quicksort implementation here, could you
give me an algorithm which would generate the worst case?
(I guess there should a smiley here, because I don't think that
that's really what you meant. But just a half of one, because
I'd really like to find such. There are times where you do want
to test worst case.)

James Kanze · Nov 11, 2008

No, what is folklore is the custom of using the big-O notation
to describe *all* possible asymptotic behaviors of an
algorithm. While that may work colloquially, it's technically
incorrect.

The big-O notation specifies the asymptotic upper bound for
the behavior of the algorithm. It does not specify anything
else.

Big-O predates computer algorithms by at least 50 years. Big-O
defines an asymptotic upper bound for a function. That function
can be the number of comparisons in the worst case execution of
quick sort, for a given array size, or it can be the average
number of comparisons, for a given array size, or it can be the
best case, for a given array size. Or it can include data
moves, or it can measure memory use in some unit. What is true
is that it has nothing to do with "algorithms", per se; it can
only be applied to algorithms when you define a function which
involves an algorithm.

James Kanze · Nov 11, 2008

I'm pretty sure Jenkin's hashing function (which uses the same
algorithm as his ISAAC random number generator) is faster than
anything you could ever create yourself, having even a
fraction of the same quality.

We can easily find out. If someone will post a link to the
algorithm, I'll implement it and measure. I've already got the
harness. (The hash table being used is a template, so all I
need to do is implement a traits class with isEqual and hashCode
functions for std::string. Or just have it derive from
StrIsEqual, which provides the isEqual function. Or just point
me to the algorithm, and I'll translate it myself.)

FWIW: I've got a JenkinsHash already, from
http://burtleburtle.net/bob/hash/doobs.html. It is considerably
worse than any of my own hashing functions or FNV hashing, for
all measured inputs. If this is the one you're talking about,
don't bother. For look-up hashing, it's not particularly good.
(It may be cryptographicly secure, which mine aren't. That's
something I don't know about.)

Kenny McCormack · Nov 11, 2008

CBFalconer said:
CBFalconer said:

Ben Bacarisse wrote: [...]

Every function that is O(n log n) is also O(n^2). That was my
point.

Click to expand...

You are very wrong. In C, n^2 is n XOR 2, and O(n^2) is quite
fast. But, if we use reasonable notation and ask about O(n*n) we
find that to be much slower than O(n log n).

Click to expand...

Why are you complaining about "^" to denote exponentiation, but not
complaining about "n log n" to mean "n * log(n)"? Both "n log n" and
"n^2" are obviously intended to be mathematical notation (with "^"
implying a superscript, i.e., exponentation), not C expressions.

Because CBF still thinks he is a regular - and thus entitled to the
benefits that accrue therefrom - i.e., being allowed to be a mindless
nitpicker.

Juha Nieminen · Nov 11, 2008

James said:
I don't think anyone has ever claimed that "quicksort is O(n lg
n) in average", or typically. The claim is that it is
"typically close to O(n lg n)". And yes, I know: throw in
enough weasel words, and anything is true. In this case,
however, there is some useful (albeit not very precise)
information being communicated: if you use quick sort, it's
highly unlikely that your performance will be significantly
worse than O(n lg n). (More weasel words: "highly unlikely"
and "significantly worse".) The fact remains that I've done a
fair amount of benchmarking of sorting routines, and unless I
did it intentionally (e.g. use the first element for a pivot
with an already sorted array), quick sort was always the
fastest, and the actual plotted curve was pretty close to O(n lg
n).

I think that what I really misunderstood with "is O(n lg n) in
average" is what is (mathematically) meant with "average" in this
particular context.

I have understood "upper bound" to mean that the behavior *never*
exceeds that bound. In other words, if g(n) is the upper bound of f(n),
then f(n) is *never* larger than k*g(n), for a sufficiently large k (at
least from a given n forward). From this I understand that if f(n) ever
gets larger than k*g(n), then g(n) is not the true upper bound for f(n).
Even if f(n) does not exceed k*g(n) "in average", if it exceeds it even
once, then g(n) is simply not the correct upper bound.

However, apparently the "in average" in this particular context means
something slightly different. More precisely, the function f(n) is
defined as:

f(n) = the average amount of steps the algorithm performs for an input
of size n (with all possible different inputs of that size)

In this case the f(n) function can indeed have a smaller upper bound
than the worst case for the algorithm in question.

I think the "in average" is a bit confusing.

Juha Nieminen · Nov 11, 2008

Paul said:
Is that a commentary on what you think of James Kanze's abilities, or
are you just indirectly praising people like Bob Jenkins and myself as
being some kind of untouchable uber-programmers?

It wasn't my intention to belittle Kanze's abilities. I used that
expression as a kind of colloquialism. I admit that it can easily be
understood as belittling. I apologize for that.

What I objected to was Kanze's suggestion that a cryptographically
strong hashing function may often be significantly slower than a
good-enough hashing function for a generic hash table. In my experience
Jenkin's algorithm is both cryptographically strong and very very fast
at the same time (much faster than eg. simple linear congruential
generators, which are usually cryptographically extremely weak).

If the latter, I would say its likely unwarranted.

If you say so. But I still admire Jenkin's rng. It's extremely fast
and the randomness is of superb quality.

Juha Nieminen · Nov 11, 2008

James said:
We can easily find out.

As I commented in the other post, I didn't mean that to be belittling
or as a challenge, but as a colloquial expression. I see now how it can
be understood in the wrong way. Thus I apologize for my poor choice of
words.

James Kanze · Nov 11, 2008

The arena of "practical non-cryptographic hash functions" is
clearly a relatively new field.

Yes and no. The problem has certainly been around for awhile,
but you're right that it doesn't seem to have attracted much
interest. The published works I've seen mostly just present a
function, and claim that it is good, with no analysis, and most
of the time with no real comparitive benchmarks either.

Outside of crypto-hashes, Bob Jenkin's function and my
function, the quality of everything else out there is truly
pathetic.

As I said, I'm interested, because I've been collecting
benchmarks of various hashing functions. Post a link to the
algorithm, and I'll add it in.

OK, I just saw a link at the end of your posting. So I've got a
model for your implementation. I'll add it to my tests at the
first possible occasion.

Bob Jenkins, for a long time, set the standard with his
lookup2 function (I think he wrote and publicized it in Dr.
Dobb's journal in 1996 or so.)

According to the link on your page, this is one I've already
tested. And found it to be slower than FNV or my own hash
functions. It is, IMHO, a typical example where being overly
complicated to match a particular machine doesn't pay. The
distribution isn't significantly better than FNV or my own, and
in fact it takes more time (on the machines I have access to) to
calculate.

However, in a kind of "first attempt" I was able to design a
hash function myself that was dramatically faster and had
similar quality. (My function passes Bob Jenkins' avalanche
test as well as my own far more stringent bit distribution and
correlation test; Bob's function performs slightly worse on my
test.) So its clear that there is plenty of room for research
here if anyone cares to take the time or put in the effort.
Bob rewrote a version of his function, which apparently comes
much closer to the performance of my function, but I have not
gone back to check it.

What is certain is that there is a lot of folklore floating
around, and very little real research. I'm not really much into
mathematical analysis of functions myself, so I can't do much on
that end. My basic idea was based on linear congruential random
number generators; intuitively, it seemed to me that whatever
made a good random number generator would also make a good
hashing algorithm. I also took into account that the execution
time of the hashing algorithm must be balanced against its
distribution characteristic; multiplying the execution time by
10 to gain 1% better distribution will result in slower
accesses, on the average. In my case, I was (at the time)
working on a machine (an 8086) with very, very slow
multiplication, so I became intrigued with the idea of using a
Mersenne prime as the multiplier (so that the multiplication
could be converted into a shift and a subtraction). And my
algorithm did beat out the few other examples I had at hand at
the time.

That was all a long time ago, but I've never really lost
interest in the subject. When Peter K. Pearons published his
article "Fast Hashing of Variable-Length Text Strings" in the
CACM, I created a small test harness, and compared it with the
others; my own turned out to be several times faster. Some time
later, someone in fr.comp.lang.c++ mentionned FNV hashing; by
that time, I had my "standard" benchmark harness designed, so I
whipped up a benchmark with that and all of the others I could
find, and wrote up the results in
http://kanze.james.neuf.fr/code/Docs/html/Hashcode.html. Since
then, I've modified my harness to use my own AssocArray class,
rather than the g++ hash_map (so that I could also test with Sun
CC, VC++, etc.), and have added quite a few additional
algorithms. The fundamental results haven't changed, however;
either a FNV or my Mersenne prime function are always amongst
the fastest (depending on multiplication speed and probably a
number of other factors I'm not aware of). The Pearson and the
Jenkens functions (at least the variants I'm using) are,
depending on the data, between 20% and 35% slower.

I've since extended the benchmark program to support
instrumentation. A quick check showed very little difference in
the actual distributions, so the difference is due to the fact
that the Mersenne prime function or FNV are faster to calculate.

Bob and I took different approaches. He started with
something cryptographic, and knocked it down to a point where
it was faster, though losing the cryptographic properties that
were no longer required. I instead took some ideas of his,
and built a function from the ground up that has no pedigree
from any cryptographic function. My design took modern CPU
architecture into account as well as trying to get to "the
heart of the matter" for hash function quality requirements.
So I started with a framework which I knew would deliver a
high performance result and injected quality into it.

The key point behind both our functions is that they are not
only good quality, they are objectively faster than all the
other typically used hash functions (CRC, Dan Bernstein's
ASCII hash, FNV hash, etc). The only things that even compete
with our functions are things people already know are
worthless (like just summing up the bytes.)

It's interesting to note that on a modern machine, FNV or my own
functions do not take any more time than just summing the bytes,
but result in a very good distribution.

Is that a commentary on what you think of James Kanze's
abilities, or are you just indirectly praising people like Bob
Jenkins and myself as being some kind of untouchable
uber-programmers? If the latter, I would say its likely
unwarranted.

I think we both agree that the final word hasn't been written.
I'm curious to see how you code does, however, because on your
web page, you say you find that Bob's function is faster than
FNV, on an Athlon. Where as I find just the opposite, on a wide
variety of machines (Sun Sparc, some older Intel, and now on
both Intel and AMD based Linux boxes.)

FWIW, I just reran one set of tests on my machine here, an AMD
64 bit machine. Jenkins is almost exactly the same as FNV, and
slightly slower than my Mersenne prime hash using 2^7-1 as
multiplier. This was for a set of 8554 symbols extracted from
my code. A second trial with 1259 URL's (extracted from
somewhere, I forget where), did show Jenkins as slightly better.
So maybe it depends on typical length; the URL's are longer, on
the average, than my program symbols.

At any rate, my recommendation still stands: Mersenne primes
with 2^7-1 as multiplier. That seems to give the best results
over a wide variety of data and hardware. But if your algorithm
is significantly faster than Jenkins, it's definitely worth
looking at. I'll add it to my tests.

This problem is wide open for anyone with good math/programming
skills with a good rough idea about modern CPU performance.
Specifically: the mantle is up for grabs for anyone to come up
with a good *64 bit* hash function. Ironically, my gut feeling
is that it should be possible to come up with a function nearly
twice as fast as mine if you can make the assumption that your
platform natively supports 64 bit scalars (which modern systems
now do.)

Note that on a modern CPU, I would expect the byte accesses in
FNV or my own algorithms to have very little impact, since the
entire inner loop will be in cache, all intermediate variables
in registers, so the only memory access will be the characters
in the string, and the CPU will find the data in its pipeline
for all of the reads except for the first in each cache line.

So I think that the word length factor is really a red herring.

user923005 · Nov 11, 2008

Yes and no. The problem has certainly been around for awhile,
but you're right that it doesn't seem to have attracted much
interest. The published works I've seen mostly just present a
function, and claim that it is good, with no analysis, and most
of the time with no real comparitive benchmarks either.

As I said, I'm interested, because I've been collecting
benchmarks of various hashing functions. Post a link to the
algorithm, and I'll add it in.

OK, I just saw a link at the end of your posting. So I've got a
model for your implementation. I'll add it to my tests at the
first possible occasion.

According to the link on your page, this is one I've already
tested. And found it to be slower than FNV or my own hash
functions. It is, IMHO, a typical example where being overly
complicated to match a particular machine doesn't pay. The
distribution isn't significantly better than FNV or my own, and
in fact it takes more time (on the machines I have access to) to
calculate.

There is an updated version of Bob Jenkin's hash that is faster.
Another excellent hash is this one:
http://www.geocities.com/drone115b/Goulburn06.pdf

If you are hashing big keys, UMAC is marvelous.
http://fastcrypto.org/umac/2000/perf00.html

What is certain is that there is a lot of folklore floating
around, and very little real research. I'm not really much into
mathematical analysis of functions myself, so I can't do much on
that end. My basic idea was based on linear congruential random
number generators; intuitively, it seemed to me that whatever
made a good random number generator would also make a good
hashing algorithm. I also took into account that the execution
time of the hashing algorithm must be balanced against its
distribution characteristic; multiplying the execution time by
10 to gain 1% better distribution will result in slower
accesses, on the average. In my case, I was (at the time)
working on a machine (an 8086) with very, very slow
multiplication, so I became intrigued with the idea of using a
Mersenne prime as the multiplier (so that the multiplication
could be converted into a shift and a subtraction). And my
algorithm did beat out the few other examples I had at hand at
the time.

That was all a long time ago, but I've never really lost
interest in the subject. When Peter K. Pearons published his
article "Fast Hashing of Variable-Length Text Strings" in the
CACM, I created a small test harness, and compared it with the
others; my own turned out to be several times faster. Some time
later, someone in fr.comp.lang.c++ mentionned FNV hashing; by
that time, I had my "standard" benchmark harness designed, so I
whipped up a benchmark with that and all of the others I could
find, and wrote up the results inhttp://kanze.james.neuf.fr/code/Docs/html/Hashcode.html. Since
then, I've modified my harness to use my own AssocArray class,
rather than the g++ hash_map (so that I could also test with Sun
CC, VC++, etc.), and have added quite a few additional
algorithms. The fundamental results haven't changed, however;
either a FNV or my Mersenne prime function are always amongst
the fastest (depending on multiplication speed and probably a
number of other factors I'm not aware of). The Pearson and the
Jenkens functions (at least the variants I'm using) are,
depending on the data, between 20% and 35% slower.

I use frog.cpp as my testing harness. Is your hash testing harness
code available?

I've since extended the benchmark program to support
instrumentation. A quick check showed very little difference in
the actual distributions, so the difference is due to the fact
that the Mersenne prime function or FNV are faster to calculate.

It's interesting to note that on a modern machine, FNV or my own
functions do not take any more time than just summing the bytes,
but result in a very good distribution.

FNV is not as good as some of the others. It cascades a bit.

I think we both agree that the final word hasn't been written.
I'm curious to see how you code does, however, because on your
web page, you say you find that Bob's function is faster than
FNV, on an Athlon. Where as I find just the opposite, on a wide
variety of machines (Sun Sparc, some older Intel, and now on
both Intel and AMD based Linux boxes.)

FWIW, I just reran one set of tests on my machine here, an AMD
64 bit machine. Jenkins is almost exactly the same as FNV, and
slightly slower than my Mersenne prime hash using 2^7-1 as
multiplier. This was for a set of 8554 symbols extracted from
my code. A second trial with 1259 URL's (extracted from
somewhere, I forget where), did show Jenkins as slightly better.
So maybe it depends on typical length; the URL's are longer, on
the average, than my program symbols.

Is your Mersenne prime hash based on the Mersenne twister RNG or are
just just big Mersenne primes as a large prime for modulus operations
with a very long period?

At any rate, my recommendation still stands: Mersenne primes
with 2^7-1 as multiplier. That seems to give the best results
over a wide variety of data and hardware. But if your algorithm
is significantly faster than Jenkins, it's definitely worth
looking at. I'll add it to my tests.

Note that on a modern CPU, I would expect the byte accesses in
FNV or my own algorithms to have very little impact, since the
entire inner loop will be in cache, all intermediate variables
in registers, so the only memory access will be the characters
in the string, and the CPU will find the data in its pipeline
for all of the reads except for the first in each cache line.

So I think that the word length factor is really a red herring.

I think it is a good idea to test everything, and then later on retest
it all because assumptions are based on models that can change over
time.

CBFalconer · Nov 11, 2008

Ben said:
.... snip ...

Are you just repeating your now rather tired joke? If so, fine,
I'll leave it, but if not, do you really dispute that the time
complexity of merge sort (typically measured as the comparisons
made) is also O(n*n)?

It was no joke. This being c.l.c, we tend to use C notation.

Yes, I do dispute it. The speed of mergesort is O(N log N). You
can test it yourself. Try the wdfreq.c exercise in the hashlib
distribution. That does two mergesorts of the output list, just to
demonstrate the advantage of a stable sort. Time it on various
input files. The process of extracting the data from the file is
O(1) per item, i.e. O(M), where M is the number of words, counting
duplicates. The sort is O(N log N), where N is the number of
items, eliminating duplicates. The numbers are displayed in the
output file. When the input file is n869.txt those numbers are:
143209 words, 3910 entries. (M and N respectively).

Use: "wdfreq <n869.txt >junk.txt" followed by "less <junk.txt".

Find other input files to check the tendency.

O(N log N) is much faster than O(N * N).

Infinite loop problem	1	Nov 4, 2023
C language. work with text	3	Dec 10, 2021
Boomer trying to learn coding in C and C++	6	Dec 16, 2022
I can't figure out why my program is not working.	4	May 23, 2024
String and list error while running a Markov Chain	1	Aug 26, 2020
Fibonacci	0	May 13, 2023
Single put routine overlapping words during iteration	4	Jan 2, 2023
Help in hangman game	1	Jul 24, 2023

Benchmarks

Ben Bacarisse

Ben Bacarisse

Jerry Coffin

Paul Hsieh

Keith Thompson

Antoninus Twink

Antoninus Twink

James Kanze

James Kanze

James Kanze

James Kanze

James Kanze

James Kanze

Kenny McCormack

Juha Nieminen

Juha Nieminen

Juha Nieminen

James Kanze

user923005

CBFalconer

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads