Minti wrote:
.... snip ...
[...] There are several algorithms. I am not sure if CBFalconer
is offering source code along. But if he is good news.
I would be highly suspicious of any code CBFalconer offers (his idea of
a "general hash table" is one that can hold at most 8.5 million
entries). Certainly you should heavily audit the code before use.
This comes from someone who can't recognize that a hash function
that requires a length parameter has to measure the length of an
input string, or who can't recognize that a 32 bit quantity won't
fit into an int and be portable. He is also a proponent of shift
operations on signed quantities. Similarly he seems unable to
recognize that the 8.5 million limit he mentioned is on entries,
each of which requires data storage, and thus represents something
between 5 and 100 up times that in storage bytes. Besides which,
if the user wishes, he simply adds an entry to a table to increase
the maximum. Meanwhile, all smaller quantities are handled with
minimum wastage (about 50% max).
<Aside to Hsieh - hash tables are intended to be in-memory storage,
and tend to be less efficient when they cause thrashing in a
virtual memory system. This puts an inherent maximum on the item
count usable on a given class of systems. If this is not clear to
you I suggest you use e-mail to get an explanation in smaller words
- this is not the appropriate forum>
<For everyone: The following is a quote from the hashlib source,
and is the area that creates this horrendous 8 mega-entries
limitation. I think how to adjust the upper limit should be fairly
obvious to any reasonably skilled C programmer.>
/* table of k where 2**n - k is prime, for n=8 up. 0 ends */
/* These numbers are chosen so that memory allocation will */
/* usually allow space for system overhead in a 2**n block */
/*
http://www.utm.edu/research/primes/lists/2small/0bit.html */
#define FIRSTN 8
static int primetbl[] = {45, 45, 41, 45, 45, 45, 45, 49,
57, 49, 41, 45, 59, 55, 57, 61, 0};
/* So the prime of interest, vs index i into above table, */
/* is ( 2**(FIRSTN + i) ) - primetbl
*/
/* The above table suffices for about 8,000,000 entries. */
Hsieh also seems unable to recognize that the operation of the
malloc package is independant of the compiler, nor that
cryptographic qualities of a hash function have nothing to do with
its suitability for a hash table. There seems to be a lesson here,
that mathematical ability and knowledge have very little to do with
the ability to create clean and trouble free code. However he is
making some progress, in that he has recognized that the action of
sbrk (or an equivalent system function) can affect the timing of a
malloc call. Further hint: This is the one of the reasons malloc
is part of the system and is generally not portable. Alignment
also raises its head here.
At any rate, you (the public) may judge for yourself. My code is
generally available at:
<http://cbfalconer.home.att.net/download/>
and I recommend the hashlib and nmalloc packages for this
controversy, together with id2id-20 and hshftst. The latter arose
out of Hsiehs earlier erroneous claims.
I concede that I am basing my low opinion of his code on his
published hash function. It is possible that this is an inadequate
sample. At any rate, I will not get excited and irrational if told
that there is a problem with my code, and I welcome any such
reports.