Eric said:
Johan said:
I would be grateful if someone had a minute or two to review my hash
table implementation. It's not yet commented but hopefully it's short
and idiomatic enough to be readable.
It looks ok, so my review will be based on performance. Keep in mind
that the only reason to use a hash table in the first place is because
of a concern for performance. So while my comments may seem unusual,
it really goes the heart of matter, which is to try to reduce the hash
table overhead as much as possible.
[...] Some of the code (i.e. the
get_hash function) is borrowed from various snippets I found on the
net. Thee free function could probably need some love. I have been
thinking about having a second linked list of all entries so that the
cost of freeing is in proportion to the number of entries.
#include <stdlib.h>
#include <string.h>
#define TABLE_SIZE 193
struct entry {
char *key;
char *value;
struct entry *next;
};
So this is a linked entry (or "Open hash table") design.
The nomenclature seems at variance with common practice;
can you tell us where you've encountered this usage? FWIW,
the code as given uses what Knuth calls "collision resolution
by separate chaining." The "open addressing" methods he
describes are quite different -- no links, for starters.
I learned this nomenclature (open hash, not open address) in university
and I have seen in cited elsewhere since. I did not study it via
Knuth.
The choice of TABLE_SIZE interacts with the collision
resolution method, and also influences the sensitivity to
the "goodness" of the hash function. Saving a division may
or may not be a win; saving a division at the cost of worse
bucket distribution and X% addtional strcmp() calls is quite
likely a poor trade. Measure.
I *have* measured this. I wonder if you have. The only really bad
hash functions that are affected by power of 2 table sizes are things
like CRCs. But there are other hash functions available, as I gave
links to some. The multiply then add class of hash functions are going
to be poor regardless, but they are not especially adversely affected
by the choice of power 2 or not for the table size.
Classically the choice has been, power of 2 table sizes with odd
reprobe deltas, and prime table sizes with arbitrary reprobe deltas.
You can attack the reprobe problem using quadratic probing, however, it
turns out that power of 2 tables sizes and *prime number* (greater than
the expected maximum reprobe sequence length) reprobe deltas completely
solve the problem and removes the % overhead. I.e., the alternatives
can't match the immediate performance while reprobing correctly.
But in this case, he's chaining, so reprobing doesn't matter. I.e.,
power of 2 sizes is an easy choice that has no complications
whatsoever.
Again, this can be a win or a loss. It spends ~33% more
memory (on typical machines), increasing the "pressure" on
cache lines, TLBs, and the like. (How much? Measure.) In
some schemes (open addressing among them) you might be better
off using that extra memory to get more entries into the table
and decrease the load factor, thus decreasing the number of
probes per search. Is it better to have faster probes or
to have fewer probes? Measure.
I *have* measured it. You didn't even measure well enough to even
realize that that 33% figure is completely bogus. Notice that he is
maintaing <char * key> values for each struct entry that get filled in.
You still think this additional entry is costing him that much?
This is why I made this recommendation completely unqualified. He
isn't asking for a general hash table solution. He's clearly
specifically looking for a string -> string hash map. This narrows
things considerably, and it means the pre-screening stored hash value
isn't going to cost anywhere close to 33% extra overhead. So its
almost certainly just going to be worth it to do pre-screening.
One of the fascinating things about hash tables is that
the phrase "hash table" covers such a wide variety! The
techniques and trade-offs are seemingly endless: separate
chaining, coalesced chaining, open addressing, double hashing,
bucket arrangements, dynamic hashing, combination strategies
that segregate the data in different ways, ... Given the
richness of possibility, it is difficult or perhaps even
impossible to make blanket recommendations.
Except that it isn't. I've tried a number of those variations on
different scenarios, including string -> <something> hashing (and
remember that the strings I use are faster than the strings he or you
are using, thus increasing the emphasis on hash table overhead) and the
results, interestingly (and intuition defyingly) enough are that the
kind of hash table basically doesn't matter. (The one obvious
exception being that you only go with chaining, when you need
hard-realtime and can't go with an auto-resizing the in-place hash
table.)
Once you remove all primary bottlenecks (via the pre-screening the hash
before raw compare, a mask instead of a modulo, and any reasonably
sound design) the hash table overhead itself is just nothing. It all
comes down to the speed of the hash function and the speed of the
comparison function (and the copy/update functions, if yours happen to
be slow for some reason.) On average matches you have to pay for 1
hash and 1 compare, with probability of 1-epsilon. On average misses
you pay just for the 1 hash with a probability of 1-epsilon (and I
don't remember the probes, but it was less than 1.0 of course.) These
observations are only true if you use pre-screening. With in-place
hash tables (no chaining, but reprobing) you see an average of about
1.4 reprobes that basically cost nothing since they are just integer
compares inside of a loop. With chaining, obviously it depends on how
well your index table size was chosen, and how well you hash function
works -- little else matters. Once you realize this, you quickly see
that all this variation just isn't buying you anything real. The
*purpose* of a hash function is to be an improvment in *speed*. In the
face of these observations, this means our considerations are dominated
by micro-optimizations.
Remember that the theoretical problems that Knuth ran into were at an
age where good hash functions just did not exist. Its not bad for 1969
when he originally wrote that stuff, but what do you want; the world
(including Knuth himself) has moved on since then.
[...] Everything Paul
wrote is true of some problem domain or other; my only issue
is that he seems to be saying his suggestions are the cat's
meow in all situations, while I'm uttering the cautionary
"Bless you, it all depends!"
Yes, because these positions are of course compatible with the fact
that you are ignorant and I am not. Because, guess what, I *HAVE*
measured these things. The "scenarios" you are suggesting include the
OP's scenario (string->string hashing.)
And as is usual around here, of course, you are reading into things
that I have not said -- the OP's hashing scenario is very particular,
which you might have understood if you read his code. So the design
affecting micro-optimization techniques I suggest, besides being
applicable to many scenarios, are easily covered by the OP's very
narrow single scenario. So its a very specific answer to the very
specific question he had. If you still think I seem to suggest
otherwise, I can't help you; go see a psychiatrist, or an english
teacher or something.