My hash table is in need of peer review

Johan Tibell · Aug 1, 2006

I would be grateful if someone had a minute or two to review my hash
table implementation. It's not yet commented but hopefully it's short
and idiomatic enough to be readable. Some of the code (i.e. the
get_hash function) is borrowed from various snippets I found on the
net. Thee free function could probably need some love. I have been
thinking about having a second linked list of all entries so that the
cost of freeing is in proportion to the number of entries.

#include <stdlib.h>
#include <string.h>

#define TABLE_SIZE 193

struct entry {
char *key;
char *value;
struct entry *next;
};

struct hash {
struct entry *bucket[TABLE_SIZE];
};

static unsigned long get_hash(const char *key)
{
unsigned long hash = 0;

while (*key) {
hash = (hash * 191 + *key) % TABLE_SIZE;
key++;
}

return hash;
}

char *hash_lookup(struct hash *table, const char *key)
{
unsigned long hash = get_hash(key);
struct entry *p = table->bucket[hash];

while (p) {
if (!strcmp(key, p->key))
break;
p = p->next;
}

if (!p)
return NULL;

return p->value;
}

int hash_insert(struct hash *table, const char *key, char *value)
{
unsigned long hash = get_hash(key);
struct entry *p = table->bucket[hash];

while (p) {
if (!strcmp(key, p->key))
break;
p = p->next;
}

if (!p) {
p = malloc(sizeof(*p));
if (!p)
return -1;

p->key = malloc(strlen(key) + 1);
if (!p->key) {
free(p);
return -1;
}

strcpy(p->key, key);
p->next = table->bucket[hash];
table->bucket[hash] = p;
}

p->value = value;

return 0;
}

struct hash *hash_new()
{
struct hash *table = malloc(sizeof(*table));
if (!table)
return NULL;

memset(table, 0, sizeof(*table));

return table;
}

void hash_free(struct hash *table)
{
struct entry *p, *tmp;

for (int i = 0; i < TABLE_SIZE; i++) {
p = table->bucket;
while (p) {
tmp = p;
p = p->next;
free(tmp->key);
free(tmp);
}
}

free(table);
}

Richard Tobin · Aug 1, 2006

hash = (hash * 191 + *key) % TABLE_SIZE;

I haven't read the rest of your code, but I recommend not building the
table size into the hash function. Use a more general hash function that
produces an int-sized value, and reduce it to the table size (with %)
only when needed to choose the bucket.

Apart from making your hash function more reusable, you can then store
the (large) hash value in the entry, and avoid many (quite likely all)
strcmp()s by comparing the hash values first.

-- Richard

Hallvard B Furuseth · Aug 1, 2006

I haven't read the rest of your code, but I recommend not building the
table size into the hash function.

Also a routine to resize the array if the buckets:table size grows
too large or too small. Will need to to recalculate the hash, unless
you store (the hash before doing % TABLE_SIZE). If so, a table size
which is a power of 2 might be better.

Apart from making your hash function more reusable, you can then store
the (large) hash value in the entry, and avoid many (quite likely all)
strcmp()s by comparing the hash values first.

Well, successful lookups will need at least one strcmp().

Might use a "max 32 bits" type for the hash value. (From <stdint.h>
with C99, or use #include <limits.h> / #if UINT_MAX >= 0xffffffff /
typedef ... to support C90.)

Another little hack:
struct entry {
char *value;
struct entry *next;
char key[]; /* in C99, or key[1] in C89 */
};
Now allocate entries large enough to hold both the struct and the
string. The C89 way, the compiler may "know" that keys are just
1 byte wide, so the standard-compliant way to access 'key' is
(char *)<entry_ptr> + offsetof(struct entry, key)

Richard Tobin · Aug 1, 2006

Hallvard B Furuseth said:
Well, successful lookups will need at least one strcmp().

Good point. Though with a good 64-bit hash you could easily get the
99.99% reliability mentioned in another thread

-- Richard

Eric Sosman · Aug 1, 2006

Johan Tibell wrote On 08/01/06 12:59,:

I would be grateful if someone had a minute or two to review my hash
table implementation.

Based on "a minute or two," I see a few miscellaneous
improvements that might be made and a few rather odd design
decisions that may be worth reconsidering.

It's not yet commented but hopefully it's short
and idiomatic enough to be readable. Some of the code (i.e. the
get_hash function) is borrowed from various snippets I found on the
net. Thee free function could probably need some love. I have been
thinking about having a second linked list of all entries so that the
cost of freeing is in proportion to the number of entries.

#include <stdlib.h>
#include <string.h>

#define TABLE_SIZE 193

struct entry {
char *key;
char *value;

Some `const' qualifiers would be welcome here.

struct entry *next;
};

struct hash {
struct entry *bucket[TABLE_SIZE];
};

static unsigned long get_hash(const char *key)
{
unsigned long hash = 0;

while (*key) {
hash = (hash * 191 + *key) % TABLE_SIZE;

Why reduce modulo TABLE_SIZE at every single character?
You'll get as good a hash code by accumulating the whole
thing at full `unsigned long' width and taking just one
remainder at the very end. (Given all that remaindering,
`unsigned long' seems a peculiar choice ...)

Others have suggested having get_hash() return an
"unreduced" or "full-width" hash code, and remaindering
it only when you're ready to access the bucket array; the
idea is to store the unreduced hash in the entry and maybe
save a strcmp() call. That's a classic trade-off: more
memory for the possibility of greater speed, but you would
need to try it both ways and make measurements to be sure
of which works better for your application.

key++;
}

return hash;
}

char *hash_lookup(struct hash *table, const char *key)
{
unsigned long hash = get_hash(key);
struct entry *p = table->bucket[hash];

while (p) {
if (!strcmp(key, p->key))
break;
p = p->next;
}

if (!p)
return NULL;

return p->value;

I see you're an adherent of some kind of fundamentalist
control-flow orthodoxy. Wouldn't this be easier to read
if the successful strcmp() were followed by `return p->value'
instead of by `break'? By insisting on `break' you've made
yourself write an extraneous test just to figure out which
of the two possible loop exit conditions caused the loop to
stop iterating. You already had that information while you
were inside the loop; why did you throw it away only to go
back and ask a second time?

}

int hash_insert(struct hash *table, const char *key, char *value)

Why is the value not `const'?

{
unsigned long hash = get_hash(key);
struct entry *p = table->bucket[hash];

while (p) {
if (!strcmp(key, p->key))
break;
p = p->next;
}

As before, if you used `return' instead of `break' you
wouldn't need the extra test.

if (!p) {
p = malloc(sizeof(*p));
if (!p)
return -1;

p->key = malloc(strlen(key) + 1);
if (!p->key) {
free(p);
return -1;
}

strcpy(p->key, key);
p->next = table->bucket[hash];
table->bucket[hash] = p;

This puts the newest item at the front of its bucket's
list. You might want to consider keeping the lists sorted
by key; the insertion is just a wee bit trickier, but an
unsuccessful search takes only half as many strcmp() calls,
on the average.

}

p->value = value;

This seems weird. You've gone to some trouble to make
a duplicate copy of the key, but then you just use the very
same value pointer the caller hands you. Copying both or
copying neither would make sense (depending on the usage),
but treating them differently strikes me as peculiar.

return 0;

As written, the function returns -1 for "ghastly failure"
and 0 for "anything else." Consider splitting the "else"
into two different reported conditions: item inserted vs.
item not inserted because of duplicate key. The caller might
be able to make productive use of that knowledge (for example,
he could figure out whether the value pointer is or is not
expendable).

}

struct hash *hash_new()

`(void)' would be a little better. If you use an
automated tool to generate function declarations for header
files, they may be able to generate a better declaration
if the definition prototypes the argument list.

{
struct hash *table = malloc(sizeof(*table));
if (!table)
return NULL;

memset(table, 0, sizeof(*table));

Likely to work, but not guaranteed. All-bits-zero is
not necessarily a null pointer, although that's a common
practice. Write a simple loop -- or, if you're really a
fan of zeroes-are-nulls, consider using calloc() instead
of malloc()-plus-memset().

return table;
}

void hash_free(struct hash *table)
{
struct entry *p, *tmp;

for (int i = 0; i < TABLE_SIZE; i++) {
p = table->bucket;
while (p) {
tmp = p;
p = p->next;
free(tmp->key);
free(tmp);
}
}

free(table);

You made mention earlier of keeping a second list so
the "cost of freeing is in proportion to the number of
entries." That's pretty much the case as things already
stand; there's the overhead of looping through TABLE_SIZE
list headers, but that won't amount to much. Notice that
with N entries you'd still make 2*N+1 free() calls anyhow.
Also, maintaining the extra list adds a tiny amount to the
cost of each insertion. "A *tiny* amount," you may say,
but consider: How does the cost of N list insertions (two
pointer stores each) compare to that of TABLE_SIZE list-
head inspections?

}

Click to expand...

General stylistic comments (you're welcome to ignore
these; they're all "eye of the beholder" matters):

- I'd prefer testing `p == NULL' to testing `!p'. The
two mean the same thing, but the former seems to me
to read a little more directly: "Have we arrived at
the special sentinel value?" is the question being
asked, so a comparison to that value seems natural.

- In the same vein, I *really* hate `!strcmp(...)' and
*strongly* suggest `strcmp(...) == 0' instead. The
first form looks like a test for a comparison that
has failed; in fact, the test actually fires when the
comparison succeeds. (I have fixed more than one bug
that ensued when somebody mis-read the `!' and wrote
the then and else clauses backwards. Sometimes, the
poor misguided sucker was the same person who wrote
the `if (!strcmp(...))' to begin with!)

- Several of your `while' loops would probably do better
as `for' loops instead. There is no rule that says
`for' is only for counting; it's perfectly good as a
linked-list-chaser, too.

websnarf · Aug 1, 2006

Johan said:
I would be grateful if someone had a minute or two to review my hash
table implementation. It's not yet commented but hopefully it's short
and idiomatic enough to be readable.

It looks ok, so my review will be based on performance. Keep in mind
that the only reason to use a hash table in the first place is because
of a concern for performance. So while my comments may seem unusual,
it really goes the heart of matter, which is to try to reduce the hash
table overhead as much as possible.

[...] Some of the code (i.e. the
get_hash function) is borrowed from various snippets I found on the
net. Thee free function could probably need some love. I have been
thinking about having a second linked list of all entries so that the
cost of freeing is in proportion to the number of entries.

#include <stdlib.h>
#include <string.h>

#define TABLE_SIZE 193

struct entry {
char *key;
char *value;
struct entry *next;
};

So this is a linked entry (or "Open hash table") design. The problem
with choosing a single value 193 as the index size is that it means the
hash table will only be effective for a limited size of possible
tables. In general, its prefereable to make that a run-time constant,
rather than a compiler time constant.

Another point is that you should use a power of 2 for the size, not
some random odd number. This way you can use the expression (rawhash &
(TABLE_SIZE-1)) rather than using the very slow "%" operator.

Oh, and you are making a classical mistake that people make with hash
table implementations. Remember you are investing lots of CPU cycles
computing hash values out of your keys. You want to get as much as
possible out of them. So, what you should do is also store the hash
value along with the entry:

struct entry {
char *key;
unsigned int hashvalue; /* <-- important */
char *value;
struct entry *next;
};

I will explain why this is so useful in my comments below.

struct hash {
struct entry *bucket[TABLE_SIZE];
};

static unsigned long get_hash(const char *key)
{
unsigned long hash = 0;

while (*key) {
hash = (hash * 191 + *key) % TABLE_SIZE;
key++;
}

return hash;
}

G'aah! That's going to end up being a pretty terrible hash function.
You should use either mine or Bob Jenkins which you can find in these
links:

http://www.azillionmonkeys.com/qed/hash.html
http://burtleburtle.net/bob/hash/doobs.html

Even if you want to stick with your own inline thingy, you should
design something with the "% TABLE_SIZE" hoisted out of the inner loop.
Relatively speaking the cost of "%" is *enormous*, and you should pull
those out of inner loops when you can.

In general, what you should do, is let the hash function overflow, and
just accumulate into massive noisy bit outputs which use all the bits
of your unsigned int. You only perform the modulo just before its
being used as an index for your buckets. So at a minimum, for now,
just remove the "% TABLE_SIZE", then when you have some spare time go
read the links above to see how real hash functions are designed.

char *hash_lookup(struct hash *table, const char *key)
{
unsigned long hash = get_hash(key);
struct entry *p = table->bucket[hash];

while (p) {
if (!strcmp(key, p->key))
break;
p = p->next;
}

if (!p)
return NULL;

return p->value;
}

Ok, here's where the extra stored hash value becomes useful. We first
compare the hash value against the previously computed hash key for
each entry before invoking the call to strcmp. If your hash function
is good this will lead to drammatic reductions in the number of calls
to strcmp, where you should expect your bottleneck to be in many
situations. So:

char *hash_lookup(struct hash *table, const char *key) {
unsigned long hash = get_hash(key);
struct entry *p;

for (p = table->bucket[hash % TABLE_SIZE]; p; p = p->next) {
if (p->hashvalue == hash && !strcmp(key, p->key))
return p->value;
}

return NULL;
}

Notice how the extra "% TABLE_SIZE" operation is added here, with the
expectation that the hash value itself is large (typically well above
193 or whatever your index limit is). The other big change is the
extra condition p->hashvalue == hash. This way, multiple entries which
hash to the same hash index (a value from 0 to 192 inclusive), but not
the same actual full hash value (an arbitrary unsigned long) do not
need to be checked with a full strcmp call. This relies on the fact
that if the hashvalues don't match then the original keys don't match
either. Notice how this further leverages the value of the
(potentially very) expensive hash function call.

int hash_insert(struct hash *table, const char *key, char *value)
{
unsigned long hash = get_hash(key);

Add the following:

unsigned long tableIdx = hash % TABLE_SIZE;

struct entry *p = table->bucket[hash];

So I would change this to:

struct entry *p = table->bucket[tableIdx];

With the different hash function.

while (p) {
if (!strcmp(key, p->key))

Again here we can do the prescreen test again:

if (p->hashkey == hash && !strcmp(key, p->key))

break;
p = p->next;
}

if (!p) {
p = malloc(sizeof(*p));
if (!p)
return -1;

p->key = malloc(strlen(key) + 1);
if (!p->key) {
free(p);
return -1;
}

strcpy(p->key, key);
p->next = table->bucket[hash];

Change the above to:

p->next = table->bucket[tableIdx];

And add the following:

p->hashkey = hash;

table->bucket[hash] = p;

Changed to:

table->bucket[tableIdx] = p;

}

p->value = value;

Ok, so you are assuming that the memory for "value"s are maintained by
the call site, but the keys are maintained by hash table itself. Is
that what you intend?

return 0;
}

struct hash *hash_new()
{
struct hash *table = malloc(sizeof(*table));
if (!table)
return NULL;

memset(table, 0, sizeof(*table));

Oh ... this is apparently technically not completely portable.
Apparently there exists some apocryphal systems in which NULL pointers
are not the same as a packed sequence of 0 bytes. So you should make
this the for loop and store NULL pointers in each entry.

return table;
}

void hash_free(struct hash *table)
{
struct entry *p, *tmp;

for (int i = 0; i < TABLE_SIZE; i++) {
p = table->bucket;
while (p) {
tmp = p;
p = p->next;
free(tmp->key);
free(tmp);
}
}

free(table);
}

Obviously from a code implementation point of view you seem to know
what you are doing. So let me propose a few changes to your core
*design* that will help with flexibility and performance:

struct tuple {
unsigned int hashvalue; /* for prescreening */
char *value;
char key[1]; /* Inline, as suggested by others */
};

struct entriesInIndex {
size_t allocCount, length;
struct tuple ** tuples; /* A periodically realloced buffer */
}

struct hashTable {
size_t bucketCount;
struct entriesInIndex bucket[1]; /* Inline */
};

The way you allocate the inline entries is using the "struct hack" as
the snippets below show:

#include <stddef.h>
/* ... */
struct hashTable ht = (struct hashTable *) malloc (offsetof (struct
hashTable, bucket) + sizeof (struct entriesInIndex) * qty);
/* ... */
size_t keylength = strlen (key) + 1;
struct tuple * t = (struct tuple *) malloc (offsetof (struct tuple
bucket, key) + sizeof (char) * (keylength));
memcpy (t->key, key, keylength);

The tuples field is an array instead of a sequence of linked entries.
Arrays are generally just faster than linked lists, and we can leverage
decreased allocation overhead as a result as well. So you perform a
tuple insert as follows:

struct hashTable * ht;
struct tuple * t;
unsigned long hash, tableIdx;
/* ... */
/* t is created and initialized appropriately */
if (ht->bucket[tableIdx].length >= ht->bucket[tableIdx].allocCount)
{
size_t len = ht->bucket[tableIdx].allocCount ? 1 : 2 *
ht->bucket[tableIdx].length;
struct tuple ** ntuples = (struct tuple **) realloc
(ht->bucket[tableIdx].tuples, len);
if (!ntuples) { /* out of memory error */ }
ht->bucket[tableIdx].tuples = ntuples;
ht->bucket[tableIdx].allocCount = len;
}
ht->bucket[tableIdx].tuples[ht->bucket[tableIdx].length] = t;
ht->bucket[tableIdx].length++;

So you end up with one allocation per tuple, plus one allocation per
bucket, and one overall allocation for the hash table. So its still
three levels of allocation, but the vast majority of them are the tuple
allocations, which I don't see an easy way of reducing.

If you don't even plan on adding a deletion function to your hash
table, then you can make your own special allocator just for the
tuples. This different allocator would be more stack-like than heap
like. It would support allocations (like slicing sections of cheese
sticks or sausages) and a "free all" which just purged the whole thing
in one shot. This vastly improves performance, and asymptotically
reduces average allocation space overhead.

Eric Sosman · Aug 1, 2006

Johan said:
Johan said:

I would be grateful if someone had a minute or two to review my hash
table implementation. It's not yet commented but hopefully it's short
and idiomatic enough to be readable.

Click to expand...

It looks ok, so my review will be based on performance. Keep in mind
that the only reason to use a hash table in the first place is because
of a concern for performance. So while my comments may seem unusual,
it really goes the heart of matter, which is to try to reduce the hash
table overhead as much as possible.

[...] Some of the code (i.e. the
get_hash function) is borrowed from various snippets I found on the
net. Thee free function could probably need some love. I have been
thinking about having a second linked list of all entries so that the
cost of freeing is in proportion to the number of entries.

#include <stdlib.h>
#include <string.h>

#define TABLE_SIZE 193

struct entry {
char *key;
char *value;
struct entry *next;
};

Click to expand...

So this is a linked entry (or "Open hash table") design.

The nomenclature seems at variance with common practice;
can you tell us where you've encountered this usage? FWIW,
the code as given uses what Knuth calls "collision resolution
by separate chaining." The "open addressing" methods he
describes are quite different -- no links, for starters.

Another point is that you should use a power of 2 for the size, not
some random odd number. This way you can use the expression (rawhash &
(TABLE_SIZE-1)) rather than using the very slow "%" operator.

The choice of TABLE_SIZE interacts with the collision
resolution method, and also influences the sensitivity to
the "goodness" of the hash function. Saving a division may
or may not be a win; saving a division at the cost of worse
bucket distribution and X% addtional strcmp() calls is quite
likely a poor trade. Measure.

Oh, and you are making a classical mistake that people make with hash
table implementations. Remember you are investing lots of CPU cycles
computing hash values out of your keys. You want to get as much as
possible out of them. So, what you should do is also store the hash
value along with the entry:

struct entry {
char *key;
unsigned int hashvalue; /* <-- important */
char *value;
struct entry *next;
};

I will explain why this is so useful in my comments below.

Again, this can be a win or a loss. It spends ~33% more
memory (on typical machines), increasing the "pressure" on
cache lines, TLBs, and the like. (How much? Measure.) In
some schemes (open addressing among them) you might be better
off using that extra memory to get more entries into the table
and decrease the load factor, thus decreasing the number of
probes per search. Is it better to have faster probes or
to have fewer probes? Measure.

One of the fascinating things about hash tables is that
the phrase "hash table" covers such a wide variety! The
techniques and trade-offs are seemingly endless: separate
chaining, coalesced chaining, open addressing, double hashing,
bucket arrangements, dynamic hashing, combination strategies
that segregate the data in different ways, ... Given the
richness of possibility, it is difficult or perhaps even
impossible to make blanket recommendations. Everything Paul
wrote is true of some problem domain or other; my only issue
is that he seems to be saying his suggestions are the cat's
meow in all situations, while I'm uttering the cautionary
"Bless you, it all depends!"

websnarf · Aug 2, 2006

Eric said:
Johan said:

I would be grateful if someone had a minute or two to review my hash
table implementation. It's not yet commented but hopefully it's short
and idiomatic enough to be readable.

Click to expand...

It looks ok, so my review will be based on performance. Keep in mind
that the only reason to use a hash table in the first place is because
of a concern for performance. So while my comments may seem unusual,
it really goes the heart of matter, which is to try to reduce the hash
table overhead as much as possible.

[...] Some of the code (i.e. the
get_hash function) is borrowed from various snippets I found on the
net. Thee free function could probably need some love. I have been
thinking about having a second linked list of all entries so that the
cost of freeing is in proportion to the number of entries.

#include <stdlib.h>
#include <string.h>

#define TABLE_SIZE 193

struct entry {
char *key;
char *value;
struct entry *next;
};

Click to expand...

So this is a linked entry (or "Open hash table") design.

Click to expand...

The nomenclature seems at variance with common practice;
can you tell us where you've encountered this usage? FWIW,
the code as given uses what Knuth calls "collision resolution
by separate chaining." The "open addressing" methods he
describes are quite different -- no links, for starters.

I learned this nomenclature (open hash, not open address) in university
and I have seen in cited elsewhere since. I did not study it via
Knuth.

The choice of TABLE_SIZE interacts with the collision
resolution method, and also influences the sensitivity to
the "goodness" of the hash function. Saving a division may
or may not be a win; saving a division at the cost of worse
bucket distribution and X% addtional strcmp() calls is quite
likely a poor trade. Measure.

I *have* measured this. I wonder if you have. The only really bad
hash functions that are affected by power of 2 table sizes are things
like CRCs. But there are other hash functions available, as I gave
links to some. The multiply then add class of hash functions are going
to be poor regardless, but they are not especially adversely affected
by the choice of power 2 or not for the table size.

Classically the choice has been, power of 2 table sizes with odd
reprobe deltas, and prime table sizes with arbitrary reprobe deltas.
You can attack the reprobe problem using quadratic probing, however, it
turns out that power of 2 tables sizes and *prime number* (greater than
the expected maximum reprobe sequence length) reprobe deltas completely
solve the problem and removes the % overhead. I.e., the alternatives
can't match the immediate performance while reprobing correctly.

But in this case, he's chaining, so reprobing doesn't matter. I.e.,
power of 2 sizes is an easy choice that has no complications
whatsoever.

Again, this can be a win or a loss. It spends ~33% more
memory (on typical machines), increasing the "pressure" on
cache lines, TLBs, and the like. (How much? Measure.) In
some schemes (open addressing among them) you might be better
off using that extra memory to get more entries into the table
and decrease the load factor, thus decreasing the number of
probes per search. Is it better to have faster probes or
to have fewer probes? Measure.

I *have* measured it. You didn't even measure well enough to even
realize that that 33% figure is completely bogus. Notice that he is
maintaing <char * key> values for each struct entry that get filled in.
You still think this additional entry is costing him that much?

This is why I made this recommendation completely unqualified. He
isn't asking for a general hash table solution. He's clearly
specifically looking for a string -> string hash map. This narrows
things considerably, and it means the pre-screening stored hash value
isn't going to cost anywhere close to 33% extra overhead. So its
almost certainly just going to be worth it to do pre-screening.

One of the fascinating things about hash tables is that
the phrase "hash table" covers such a wide variety! The
techniques and trade-offs are seemingly endless: separate
chaining, coalesced chaining, open addressing, double hashing,
bucket arrangements, dynamic hashing, combination strategies
that segregate the data in different ways, ... Given the
richness of possibility, it is difficult or perhaps even
impossible to make blanket recommendations.

Except that it isn't. I've tried a number of those variations on
different scenarios, including string -> <something> hashing (and
remember that the strings I use are faster than the strings he or you
are using, thus increasing the emphasis on hash table overhead) and the
results, interestingly (and intuition defyingly) enough are that the
kind of hash table basically doesn't matter. (The one obvious
exception being that you only go with chaining, when you need
hard-realtime and can't go with an auto-resizing the in-place hash
table.)

Once you remove all primary bottlenecks (via the pre-screening the hash
before raw compare, a mask instead of a modulo, and any reasonably
sound design) the hash table overhead itself is just nothing. It all
comes down to the speed of the hash function and the speed of the
comparison function (and the copy/update functions, if yours happen to
be slow for some reason.) On average matches you have to pay for 1
hash and 1 compare, with probability of 1-epsilon. On average misses
you pay just for the 1 hash with a probability of 1-epsilon (and I
don't remember the probes, but it was less than 1.0 of course.) These
observations are only true if you use pre-screening. With in-place
hash tables (no chaining, but reprobing) you see an average of about
1.4 reprobes that basically cost nothing since they are just integer
compares inside of a loop. With chaining, obviously it depends on how
well your index table size was chosen, and how well you hash function
works -- little else matters. Once you realize this, you quickly see
that all this variation just isn't buying you anything real. The
*purpose* of a hash function is to be an improvment in *speed*. In the
face of these observations, this means our considerations are dominated
by micro-optimizations.

Remember that the theoretical problems that Knuth ran into were at an
age where good hash functions just did not exist. Its not bad for 1969
when he originally wrote that stuff, but what do you want; the world
(including Knuth himself) has moved on since then.

[...] Everything Paul
wrote is true of some problem domain or other; my only issue
is that he seems to be saying his suggestions are the cat's
meow in all situations, while I'm uttering the cautionary
"Bless you, it all depends!"

Yes, because these positions are of course compatible with the fact
that you are ignorant and I am not. Because, guess what, I *HAVE*
measured these things. The "scenarios" you are suggesting include the
OP's scenario (string->string hashing.)

And as is usual around here, of course, you are reading into things
that I have not said -- the OP's hashing scenario is very particular,
which you might have understood if you read his code. So the design
affecting micro-optimization techniques I suggest, besides being
applicable to many scenarios, are easily covered by the OP's very
narrow single scenario. So its a very specific answer to the very
specific question he had. If you still think I seem to suggest
otherwise, I can't help you; go see a psychiatrist, or an english
teacher or something.

websnarf · Aug 2, 2006

Eric said:
Johan said:

I would be grateful if someone had a minute or two to review my hash
table implementation. It's not yet commented but hopefully it's short
and idiomatic enough to be readable.

Click to expand...

It looks ok, so my review will be based on performance. Keep in mind
that the only reason to use a hash table in the first place is because
of a concern for performance. So while my comments may seem unusual,
it really goes the heart of matter, which is to try to reduce the hash
table overhead as much as possible.

[...] Some of the code (i.e. the
get_hash function) is borrowed from various snippets I found on the
net. Thee free function could probably need some love. I have been
thinking about having a second linked list of all entries so that the
cost of freeing is in proportion to the number of entries.

#include <stdlib.h>
#include <string.h>

#define TABLE_SIZE 193

struct entry {
char *key;
char *value;
struct entry *next;
};

Click to expand...

So this is a linked entry (or "Open hash table") design.

Click to expand...

The nomenclature seems at variance with common practice;
can you tell us where you've encountered this usage? FWIW,
the code as given uses what Knuth calls "collision resolution
by separate chaining." The "open addressing" methods he
describes are quite different -- no links, for starters.

I learned this nomenclature (open hash, not open address) in university
and I have seen in cited elsewhere since. I did not study it via
Knuth.

The choice of TABLE_SIZE interacts with the collision
resolution method, and also influences the sensitivity to
the "goodness" of the hash function. Saving a division may
or may not be a win; saving a division at the cost of worse
bucket distribution and X% addtional strcmp() calls is quite
likely a poor trade. Measure.

I *have* measured this. I wonder if you have. The only really bad
hash functions that are affected by power of 2 table sizes are things
like CRCs. But there are other hash functions available, as I gave
links to some. The multiply then add class of hash functions are going
to be poor regardless, but they are not especially adversely affected
by the choice of power 2 or not for the table size.

Classically the choice has been, power of 2 table sizes with odd
reprobe deltas, and prime table sizes with arbitrary reprobe deltas.
You can attack the reprobe problem using quadratic probing, however, it
turns out that power of 2 tables sizes and *prime number* (greater than
the expected maximum reprobe sequence length) reprobe deltas completely
solve the problem and removes the % overhead. I.e., the alternatives
can't match the immediate performance while reprobing correctly.

But in this case, he's chaining, so reprobing doesn't matter. I.e.,
power of 2 sizes is an easy choice that has no complications
whatsoever.

Again, this can be a win or a loss. It spends ~33% more
memory (on typical machines), increasing the "pressure" on
cache lines, TLBs, and the like. (How much? Measure.) In
some schemes (open addressing among them) you might be better
off using that extra memory to get more entries into the table
and decrease the load factor, thus decreasing the number of
probes per search. Is it better to have faster probes or
to have fewer probes? Measure.

I *have* measured it. You didn't even measure well enough to even
realize that that 33% figure is completely bogus. Notice that he is
maintaing <char * key> values for each struct entry that get filled in.
You still think this additional entry is costing him that much?

This is why I made this recommendation completely unqualified. He
isn't asking for a general hash table solution. He's clearly
specifically looking for a string -> string hash map. This narrows
things considerably, and it means the pre-screening stored hash value
isn't going to cost anywhere close to 33% extra overhead. So its
almost certainly just going to be worth it to do pre-screening.

One of the fascinating things about hash tables is that
the phrase "hash table" covers such a wide variety! The
techniques and trade-offs are seemingly endless: separate
chaining, coalesced chaining, open addressing, double hashing,
bucket arrangements, dynamic hashing, combination strategies
that segregate the data in different ways, ... Given the
richness of possibility, it is difficult or perhaps even
impossible to make blanket recommendations.

Except that it isn't. I've tried a number of those variations on
different scenarios, including string -> <something> hashing (and
remember that the strings I use are faster than the strings he or you
are using, thus increasing the emphasis on hash table overhead) and the
results, interestingly (and intuition defyingly) enough are that the
kind of hash table basically doesn't matter. (The one obvious
exception being that you only go with chaining, when you need
hard-realtime and can't go with an auto-resizing the in-place hash
table.)

Once you remove all primary bottlenecks (via the pre-screening the hash
before raw compare, a mask instead of a modulo, and any reasonably
sound design) the hash table overhead itself is just nothing. It all
comes down to the speed of the hash function and the speed of the
comparison function (and the copy/update functions, if yours happen to
be slow for some reason.) On average matches you have to pay for 1
hash and 1 compare, with probability of 1-epsilon. On average misses
you pay just for the 1 hash with a probability of 1-epsilon (and I
don't remember the probes, but it was less than 1.0 of course.) These
observations are only true if you use pre-screening. With in-place
hash tables (no chaining, but reprobing) you see an average of about
1.4 reprobes that basically cost nothing since they are just integer
compares inside of a loop. With chaining, obviously it depends on how
well your index table size was chosen, and how well you hash function
works -- little else matters. Once you realize this, you quickly see
that all this variation just isn't buying you anything real. The
*purpose* of a hash function is to be an improvment in *speed*. In the
face of these observations, this means our considerations are dominated
by micro-optimizations.

Remember that the theoretical problems that Knuth ran into were at an
age where good hash functions just did not exist. Its not bad for 1969
when he originally wrote that stuff, but what do you want; the world
(including Knuth himself) has moved on since then.

[...] Everything Paul
wrote is true of some problem domain or other; my only issue
is that he seems to be saying his suggestions are the cat's
meow in all situations, while I'm uttering the cautionary
"Bless you, it all depends!"

Yes, because these positions are of course compatible with the fact
that you are ignorant and I am not. Because, guess what, I *HAVE*
measured these things. The "scenarios" you are suggesting include the
OP's scenario (string->string hashing.)

And as is usual around here, of course, you are reading into things
that I have not said -- the OP's hashing scenario is very particular,
which you might have understood if you read his code. So the design
affecting micro-optimization techniques I suggest, besides being
applicable to many scenarios, are easily covered by the OP's very
narrow single scenario. So its a very specific answer to the very
specific question he had. If you still think I seem to suggest
otherwise, I can't help you; go see a psychiatrist, or an english
teacher or something.

websnarf · Aug 2, 2006

Eric said:
Johan said:

I would be grateful if someone had a minute or two to review my hash
table implementation. It's not yet commented but hopefully it's short
and idiomatic enough to be readable.

Click to expand...

It looks ok, so my review will be based on performance. Keep in mind
that the only reason to use a hash table in the first place is because
of a concern for performance. So while my comments may seem unusual,
it really goes the heart of matter, which is to try to reduce the hash
table overhead as much as possible.

[...] Some of the code (i.e. the
get_hash function) is borrowed from various snippets I found on the
net. Thee free function could probably need some love. I have been
thinking about having a second linked list of all entries so that the
cost of freeing is in proportion to the number of entries.

#include <stdlib.h>
#include <string.h>

#define TABLE_SIZE 193

struct entry {
char *key;
char *value;
struct entry *next;
};

Click to expand...

So this is a linked entry (or "Open hash table") design.

Click to expand...

The nomenclature seems at variance with common practice;
can you tell us where you've encountered this usage? FWIW,
the code as given uses what Knuth calls "collision resolution
by separate chaining." The "open addressing" methods he
describes are quite different -- no links, for starters.

I learned this nomenclature (open hash, not open address) in university
and I have seen in cited elsewhere since. I did not study it via
Knuth.

The choice of TABLE_SIZE interacts with the collision
resolution method, and also influences the sensitivity to
the "goodness" of the hash function. Saving a division may
or may not be a win; saving a division at the cost of worse
bucket distribution and X% addtional strcmp() calls is quite
likely a poor trade. Measure.

I *have* measured this. I wonder if you have. The only really bad
hash functions that are affected by power of 2 table sizes are things
like CRCs. But there are other hash functions available, as I gave
links to some. The multiply then add class of hash functions are going
to be poor regardless, but they are not especially adversely affected
by the choice of power 2 or not for the table size.

Classically the choice has been, power of 2 table sizes with odd
reprobe deltas, and prime table sizes with arbitrary reprobe deltas.
You can attack the reprobe problem using quadratic probing, however, it
turns out that power of 2 tables sizes and *prime number* (greater than
the expected maximum reprobe sequence length) reprobe deltas completely
solve the problem and removes the % overhead. I.e., the alternatives
can't match the immediate performance while reprobing correctly.

But in this case, he's chaining, so reprobing doesn't matter. I.e.,
power of 2 sizes is an easy choice that has no complications
whatsoever.

Again, this can be a win or a loss. It spends ~33% more
memory (on typical machines), increasing the "pressure" on
cache lines, TLBs, and the like. (How much? Measure.) In
some schemes (open addressing among them) you might be better
off using that extra memory to get more entries into the table
and decrease the load factor, thus decreasing the number of
probes per search. Is it better to have faster probes or
to have fewer probes? Measure.

I *have* measured it. You didn't even measure well enough to even
realize that that 33% figure is completely bogus. Notice that he is
maintaing <char * key> values for each struct entry that get filled in.
You still think this additional entry is costing him that much?

This is why I made this recommendation completely unqualified. He
isn't asking for a general hash table solution. He's clearly
specifically looking for a string -> string hash map. This narrows
things considerably, and it means the pre-screening stored hash value
isn't going to cost anywhere close to 33% extra overhead. So its
almost certainly just going to be worth it to do pre-screening.

One of the fascinating things about hash tables is that
the phrase "hash table" covers such a wide variety! The
techniques and trade-offs are seemingly endless: separate
chaining, coalesced chaining, open addressing, double hashing,
bucket arrangements, dynamic hashing, combination strategies
that segregate the data in different ways, ... Given the
richness of possibility, it is difficult or perhaps even
impossible to make blanket recommendations.

Except that it isn't. I've tried a number of those variations on
different scenarios, including string -> <something> hashing (and
remember that the strings I use are faster than the strings he or you
are using, thus increasing the emphasis on hash table overhead) and the
results, interestingly (and intuition defyingly) enough are that the
kind of hash table basically doesn't matter. (The one obvious
exception being that you only go with chaining, when you need
hard-realtime and can't go with an auto-resizing the in-place hash
table.)

Once you remove all primary bottlenecks (via the pre-screening the hash
before raw compare, a mask instead of a modulo, and any reasonably
sound design) the hash table overhead itself is just nothing. It all
comes down to the speed of the hash function and the speed of the
comparison function (and the copy/update functions, if yours happen to
be slow for some reason.) On average matches you have to pay for 1
hash and 1 compare, with probability of 1-epsilon. On average misses
you pay just for the 1 hash with a probability of 1-epsilon (and I
don't remember the probes, but it was less than 1.0 of course.) These
observations are only true if you use pre-screening. With in-place
hash tables (no chaining, but reprobing) you see an average of about
1.4 reprobes that basically cost nothing since they are just integer
compares inside of a loop. With chaining, obviously it depends on how
well your index table size was chosen, and how well you hash function
works -- little else matters. Once you realize this, you quickly see
that all this variation just isn't buying you anything real. The
*purpose* of a hash function is to be an improvment in *speed*. In the
face of these observations, this means our considerations are dominated
by micro-optimizations.

Remember that the theoretical problems that Knuth ran into were at an
age where good hash functions just did not exist. Its not bad for 1969
when he originally wrote that stuff, but what do you want; the world
(including Knuth himself) has moved on since then.

[...] Everything Paul
wrote is true of some problem domain or other; my only issue
is that he seems to be saying his suggestions are the cat's
meow in all situations, while I'm uttering the cautionary
"Bless you, it all depends!"

Yes, because these positions are of course compatible with the fact
that you are ignorant and I am not. Because, guess what, I *HAVE*
measured these things. The "scenarios" you are suggesting include the
OP's scenario (string->string hashing.)

And as is usual around here, of course, you are reading into things
that I have not said -- the OP's hashing scenario is very particular,
which you might have understood if you read his code. So the design
affecting micro-optimization techniques I suggest, besides being
applicable to many scenarios, are easily covered by the OP's very
narrow single scenario. So its a very specific answer to the very
specific question he had. If you still think I seem to suggest
otherwise, I can't help you; go see a psychiatrist, or an english
teacher or something.

websnarf · Aug 2, 2006

Eric said:
Johan said:

I would be grateful if someone had a minute or two to review my hash
table implementation. It's not yet commented but hopefully it's short
and idiomatic enough to be readable.

Click to expand...

It looks ok, so my review will be based on performance. Keep in mind
that the only reason to use a hash table in the first place is because
of a concern for performance. So while my comments may seem unusual,
it really goes the heart of matter, which is to try to reduce the hash
table overhead as much as possible.

[...] Some of the code (i.e. the
get_hash function) is borrowed from various snippets I found on the
net. Thee free function could probably need some love. I have been
thinking about having a second linked list of all entries so that the
cost of freeing is in proportion to the number of entries.

#include <stdlib.h>
#include <string.h>

#define TABLE_SIZE 193

struct entry {
char *key;
char *value;
struct entry *next;
};

Click to expand...

So this is a linked entry (or "Open hash table") design.

Click to expand...

The nomenclature seems at variance with common practice;
can you tell us where you've encountered this usage? FWIW,
the code as given uses what Knuth calls "collision resolution
by separate chaining." The "open addressing" methods he
describes are quite different -- no links, for starters.

I learned this nomenclature (open hash, not open address) in university
and I have seen in cited elsewhere since. I did not study it via
Knuth.

The choice of TABLE_SIZE interacts with the collision
resolution method, and also influences the sensitivity to
the "goodness" of the hash function. Saving a division may
or may not be a win; saving a division at the cost of worse
bucket distribution and X% addtional strcmp() calls is quite
likely a poor trade. Measure.

I *have* measured this. I wonder if you have. The only really bad
hash functions that are affected by power of 2 table sizes are things
like CRCs. But there are other hash functions available, as I gave
links to some. The multiply then add class of hash functions are going
to be poor regardless, but they are not especially adversely affected
by the choice of power 2 or not for the table size.

Classically the choice has been, power of 2 table sizes with odd
reprobe deltas, and prime table sizes with arbitrary reprobe deltas.
You can attack the reprobe problem using quadratic probing, however, it
turns out that power of 2 tables sizes and *prime number* (greater than
the expected maximum reprobe sequence length) reprobe deltas completely
solve the problem and removes the % overhead. I.e., the alternatives
can't match the immediate performance while reprobing correctly.

But in this case, he's chaining, so reprobing doesn't matter. I.e.,
power of 2 sizes is an easy choice that has no complications
whatsoever.

Again, this can be a win or a loss. It spends ~33% more
memory (on typical machines), increasing the "pressure" on
cache lines, TLBs, and the like. (How much? Measure.) In
some schemes (open addressing among them) you might be better
off using that extra memory to get more entries into the table
and decrease the load factor, thus decreasing the number of
probes per search. Is it better to have faster probes or
to have fewer probes? Measure.

I *have* measured it. You didn't even measure well enough to even
realize that that 33% figure is completely bogus. Notice that he is
maintaing <char * key> values for each struct entry that get filled in.
You still think this additional entry is costing him that much?

This is why I made this recommendation completely unqualified. He
isn't asking for a general hash table solution. He's clearly
specifically looking for a string -> string hash map. This narrows
things considerably, and it means the pre-screening stored hash value
isn't going to cost anywhere close to 33% extra overhead. So its
almost certainly just going to be worth it to do pre-screening.

One of the fascinating things about hash tables is that
the phrase "hash table" covers such a wide variety! The
techniques and trade-offs are seemingly endless: separate
chaining, coalesced chaining, open addressing, double hashing,
bucket arrangements, dynamic hashing, combination strategies
that segregate the data in different ways, ... Given the
richness of possibility, it is difficult or perhaps even
impossible to make blanket recommendations.

Except that it isn't. I've tried a number of those variations on
different scenarios, including string -> <something> hashing (and
remember that the strings I use are faster than the strings he or you
are using, thus increasing the emphasis on hash table overhead) and the
results, interestingly (and intuition defyingly) enough are that the
kind of hash table basically doesn't matter. (The one obvious
exception being that you only go with chaining, when you need
hard-realtime and can't go with an auto-resizing the in-place hash
table.)

Once you remove all primary bottlenecks (via the pre-screening the hash
before raw compare, a mask instead of a modulo, and any reasonably
sound design) the hash table overhead itself is just nothing. It all
comes down to the speed of the hash function and the speed of the
comparison function (and the copy/update functions, if yours happen to
be slow for some reason.) On average matches you have to pay for 1
hash and 1 compare, with probability of 1-epsilon. On average misses
you pay just for the 1 hash with a probability of 1-epsilon (and I
don't remember the probes, but it was less than 1.0 of course.) These
observations are only true if you use pre-screening. With in-place
hash tables (no chaining, but reprobing) you see an average of about
1.4 reprobes that basically cost nothing since they are just integer
compares inside of a loop. With chaining, obviously it depends on how
well your index table size was chosen, and how well you hash function
works -- little else matters. Once you realize this, you quickly see
that all this variation just isn't buying you anything real. The
*purpose* of a hash function is to be an improvment in *speed*. In the
face of these observations, this means our considerations are dominated
by micro-optimizations.

Remember that the theoretical problems that Knuth ran into were at an
age where good hash functions just did not exist. Its not bad for 1969
when he originally wrote that stuff, but what do you want; the world
(including Knuth himself) has moved on since then.

[...] Everything Paul
wrote is true of some problem domain or other; my only issue
is that he seems to be saying his suggestions are the cat's
meow in all situations, while I'm uttering the cautionary
"Bless you, it all depends!"

Yes, because these positions are of course compatible with the fact
that you are ignorant and I am not. Because, guess what, I *HAVE*
measured these things. The "scenarios" you are suggesting include the
OP's scenario (string->string hashing.)

And as is usual around here, of course, you are reading into things
that I have not said -- the OP's hashing scenario is very particular,
which you might have understood if you read his code. So the design
affecting micro-optimization techniques I suggest, besides being
applicable to many scenarios, are easily covered by the OP's very
narrow single scenario. So its a very specific answer to the very
specific question he had. If you still think I seem to suggest
otherwise, I can't help you; go see a psychiatrist, or an english
teacher or something.

websnarf · Aug 2, 2006

Eric said:
Johan said:

I would be grateful if someone had a minute or two to review my hash
table implementation. It's not yet commented but hopefully it's short
and idiomatic enough to be readable.

Click to expand...

It looks ok, so my review will be based on performance. Keep in mind
that the only reason to use a hash table in the first place is because
of a concern for performance. So while my comments may seem unusual,
it really goes the heart of matter, which is to try to reduce the hash
table overhead as much as possible.

[...] Some of the code (i.e. the
get_hash function) is borrowed from various snippets I found on the
net. Thee free function could probably need some love. I have been
thinking about having a second linked list of all entries so that the
cost of freeing is in proportion to the number of entries.

#include <stdlib.h>
#include <string.h>

#define TABLE_SIZE 193

struct entry {
char *key;
char *value;
struct entry *next;
};

Click to expand...

So this is a linked entry (or "Open hash table") design.

Click to expand...

The nomenclature seems at variance with common practice;
can you tell us where you've encountered this usage? FWIW,
the code as given uses what Knuth calls "collision resolution
by separate chaining." The "open addressing" methods he
describes are quite different -- no links, for starters.

I learned this nomenclature (open hash, not open address) in university
and I have seen in cited elsewhere since. I did not study it via
Knuth.

The choice of TABLE_SIZE interacts with the collision
resolution method, and also influences the sensitivity to
the "goodness" of the hash function. Saving a division may
or may not be a win; saving a division at the cost of worse
bucket distribution and X% addtional strcmp() calls is quite
likely a poor trade. Measure.

I *have* measured this. I wonder if you have. The only really bad
hash functions that are affected by power of 2 table sizes are things
like CRCs. But there are other hash functions available, as I gave
links to some. The multiply then add class of hash functions are going
to be poor regardless, but they are not especially adversely affected
by the choice of power 2 or not for the table size.

Classically the choice has been, power of 2 table sizes with odd
reprobe deltas, and prime table sizes with arbitrary reprobe deltas.
You can attack the reprobe problem using quadratic probing, however, it
turns out that power of 2 tables sizes and *prime number* (greater than
the expected maximum reprobe sequence length) reprobe deltas completely
solve the problem and removes the % overhead. I.e., the alternatives
can't match the immediate performance while reprobing correctly.

But in this case, he's chaining, so reprobing doesn't matter. I.e.,
power of 2 sizes is an easy choice that has no complications
whatsoever.

Again, this can be a win or a loss. It spends ~33% more
memory (on typical machines), increasing the "pressure" on
cache lines, TLBs, and the like. (How much? Measure.) In
some schemes (open addressing among them) you might be better
off using that extra memory to get more entries into the table
and decrease the load factor, thus decreasing the number of
probes per search. Is it better to have faster probes or
to have fewer probes? Measure.

I *have* measured it. You didn't even measure well enough to even
realize that that 33% figure is completely bogus. Notice that he is
maintaing <char * key> values for each struct entry that get filled in.
You still think this additional entry is costing him that much?

This is why I made this recommendation completely unqualified. He
isn't asking for a general hash table solution. He's clearly
specifically looking for a string -> string hash map. This narrows
things considerably, and it means the pre-screening stored hash value
isn't going to cost anywhere close to 33% extra overhead. So its
almost certainly just going to be worth it to do pre-screening.

One of the fascinating things about hash tables is that
the phrase "hash table" covers such a wide variety! The
techniques and trade-offs are seemingly endless: separate
chaining, coalesced chaining, open addressing, double hashing,
bucket arrangements, dynamic hashing, combination strategies
that segregate the data in different ways, ... Given the
richness of possibility, it is difficult or perhaps even
impossible to make blanket recommendations.

Except that it isn't. I've tried a number of those variations on
different scenarios, including string -> <something> hashing (and
remember that the strings I use are faster than the strings he or you
are using, thus increasing the emphasis on hash table overhead) and the
results, interestingly (and intuition defyingly) enough are that the
kind of hash table basically doesn't matter. (The one obvious
exception being that you only go with chaining, when you need
hard-realtime and can't go with an auto-resizing the in-place hash
table.)

Once you remove all primary bottlenecks (via the pre-screening the hash
before raw compare, a mask instead of a modulo, and any reasonably
sound design) the hash table overhead itself is just nothing. It all
comes down to the speed of the hash function and the speed of the
comparison function (and the copy/update functions, if yours happen to
be slow for some reason.) On average matches you have to pay for 1
hash and 1 compare, with probability of 1-epsilon. On average misses
you pay just for the 1 hash with a probability of 1-epsilon (and I
don't remember the probes, but it was less than 1.0 of course.) These
observations are only true if you use pre-screening. With in-place
hash tables (no chaining, but reprobing) you see an average of about
1.4 reprobes that basically cost nothing since they are just integer
compares inside of a loop. With chaining, obviously it depends on how
well your index table size was chosen, and how well you hash function
works -- little else matters. Once you realize this, you quickly see
that all this variation just isn't buying you anything real. The
*purpose* of a hash function is to be an improvment in *speed*. In the
face of these observations, this means our considerations are dominated
by micro-optimizations.

Remember that the theoretical problems that Knuth ran into were at an
age where good hash functions just did not exist. Its not bad for 1969
when he originally wrote that stuff, but what do you want; the world
(including Knuth himself) has moved on since then.

[...] Everything Paul
wrote is true of some problem domain or other; my only issue
is that he seems to be saying his suggestions are the cat's
meow in all situations, while I'm uttering the cautionary
"Bless you, it all depends!"

Yes, because these positions are of course compatible with the fact
that you are ignorant and I am not. Because, guess what, I *HAVE*
measured these things. The "scenarios" you are suggesting include the
OP's scenario (string->string hashing.)

And as is usual around here, of course, you are reading into things
that I have not said -- the OP's hashing scenario is very particular,
which you might have understood if you read his code. So the design
affecting micro-optimization techniques I suggest, besides being
applicable to many scenarios, are easily covered by the OP's very
narrow single scenario. So its a very specific answer to the very
specific question he had. If you still think I seem to suggest
otherwise, I can't help you; go see a psychiatrist, or an english
teacher or something.

Eric Sosman · Aug 2, 2006

[...]
Yes, because these positions are of course compatible with the fact
that you are ignorant and I am not. [...]

Well, if you're so sure you're right that you post
the fact five times (so far), then I guess you must know
what you're doing.

Johan Tibell · Aug 2, 2006

I've implemented some of the suggestions and rejected a few as they're
overkill for my current use case.

I currently make a copy of the key but not the value passed to insert
(which by the way is supposed to update values of already known keys).
My thinking is that if the key is allowed to change then I could end up
with duplicate keys in the bucket, one that's in the correct position
and one that's not. I guess I could just make sure to document the fact
that the keys shouldn't be changed instead but I think I prefer the
safer approach. Adding a boolean flag to hash_free would make it
possible for the hash table to delete values as well if the user
prefers that.

Is the struct hack (with the byte bucket[]) only used as an
optimization or does it have some other nice properties?

Hallvard B Furuseth · Aug 2, 2006

Johan said:
Is the struct hack (with the byte bucket[]) only used as an
optimization or does it have some other nice properties?

If you mean my char key[] suggestion, that saves you one malloc(),
one error check for it and two free()s per bucket, and it also
saves the CPU one pointer dereferenceing per key-lookup.

Eric Sosman · Aug 2, 2006

Johan said:
I've implemented some of the suggestions and rejected a few as they're
overkill for my current use case.

I currently make a copy of the key but not the value passed to insert
(which by the way is supposed to update values of already known keys).
My thinking is that if the key is allowed to change then I could end up
with duplicate keys in the bucket, one that's in the correct position
and one that's not. I guess I could just make sure to document the fact
that the keys shouldn't be changed instead but I think I prefer the
safer approach. Adding a boolean flag to hash_free would make it
possible for the hash table to delete values as well if the user
prefers that.

Is the struct hack (with the byte bucket[]) only used as an
optimization or does it have some other nice properties?

The struct hack (however implemented; there is a family
of related techniques) is principally an optimization:

- It usually reduces time by replacing two calls to malloc()
and two calls to free() with one of each.

- It often uses less memory because malloc() usually needs
a little extra to keep track of each allocation and wastes
a little to ensure good alignment. By using one medium-
sized allocation instead of two small ones you incur less
overhead, both in absolute and relative terms.

The struct hack is not a cure-all, and has some drawbacks.
For example, C's notion of an array depends on all the array
elements having identical sizes, so it is not possible to make
an array of "hacked structs" of varying sizes. This rules out
optimizations like allocating pools of `struct entry' objects
which are then doled out one by one inside the hash package and
eventually free()d pool by pool instead of entry by entry.

The struct hack isn't usually worth while if you've got only
a few allocations to begin with -- saving a little bit of an
already small amount is not a big gain. Likewise, it's probably
not worth doing if you have a Really Large number of allocations;
other techniques (like the pooled allocations mentioned above)
will usually have bigger payoffs. The "sweet spot" for the
struct hack is somewhere in between: When you've got enough
allocations so the time and space overhead adds up, but not so
many that more drastic approaches are warranted. (This is all
very imprecise, of course: I'm just trying to lay out some of
the issues that enter into the choice.) I'd guess that your
hash table is probably below the "sweet spot" for the hack: You've
got ~200 chains of (guessing) <=5 items each, so you're looking
at a potential savings of ~1000 malloc()/free() pairs and a
reduction of perhaps ~8-16KB of memory (YMMV, clearly). That
doesn't sound like enough to fret about these days -- but then,
if your program uses a thousand such tables ...

Barry Schwarz · Aug 3, 2006

I would be grateful if someone had a minute or two to review my hash
table implementation. It's not yet commented but hopefully it's short
and idiomatic enough to be readable. Some of the code (i.e. the
get_hash function) is borrowed from various snippets I found on the
net. Thee free function could probably need some love. I have been
thinking about having a second linked list of all entries so that the
cost of freeing is in proportion to the number of entries.

#include <stdlib.h>
#include <string.h>

#define TABLE_SIZE 193

struct entry {
char *key;
char *value;
struct entry *next;
};

struct hash {
struct entry *bucket[TABLE_SIZE];
};

snip

struct hash *hash_new()
{
struct hash *table = malloc(sizeof(*table));
if (!table)
return NULL;

memset(table, 0, sizeof(*table));

This sets the structure table points to to all-bits-zero. The only
member of this structure is an array of pointers. You routinely test
one of these pointers against NULL (with a while(p) type of
construct). All-bits-zero is not guaranteed to be a valid value for a
pointer nor is it guaranteed to represent NULL if it is valid.

return table;
}

void hash_free(struct hash *table)
{
struct entry *p, *tmp;

for (int i = 0; i < TABLE_SIZE; i++) {

Defining a variable like this is not part of C89 which is still the
more common level available. I don't know if it part of C99.

p = table->bucket;
while (p) {
tmp = p;
p = p->next;
free(tmp->key);
free(tmp);
}
}

free(table);
}

Remove del for email

micans · Aug 3, 2006

Johan said:
I would be grateful if someone had a minute or two to review my hash
table implementation. It's not yet commented but hopefully it's short
and idiomatic enough to be readable. Some of the code (i.e. the
get_hash function) is borrowed from various snippets I found on the
net. Thee free function could probably need some love. I have been
thinking about having a second linked list of all entries so that the
cost of freeing is in proportion to the number of entries.

This was quite an interesting thread. It was pointed out that it may
be worthwhile to cache the hash value. My hash library (one can find
it by looking at mcl and zoem, two applications that I wrote) did not
have this feature. I quickly implemented it - happily enough it
required changes in about three lines of code.

Now from some exceedingly quick testing I seem to find that it very
much depends on the load factor whether this pays off or not. I define
the load factor as the number of entries divided by the number of
buckets.

In those tests (which need to be repeated for a wider variety of data)
it evened out at a load factor somewhere in the interval 0.8-1.2
(depending on the type of data). At higher load factors (implying
longer linked lists in the buckets) the caching began to pay off.
At lower load factors (short linked lists) it was actually slower.
The hashes adapt while growing; they double in size when the load
factor is exceeded. As an aside, I use an allocator tied to the hash
to allocate the nodes in the linked list.

I don't make any claim at all about how far this observation
generalizes, but
the connection with the load factor seems logical.

Stijn

CBFalconer · Aug 4, 2006

This was quite an interesting thread. It was pointed out that it
may be worthwhile to cache the hash value. My hash library (one
can find it by looking at mcl and zoem, two applications that I
wrote) did not have this feature. I quickly implemented it -
happily enough it required changes in about three lines of code.

Now from some exceedingly quick testing I seem to find that it
very much depends on the load factor whether this pays off or
not. I define the load factor as the number of entries divided by
the number of buckets.

In those tests (which need to be repeated for a wider variety of
data) it evened out at a load factor somewhere in the interval
0.8-1.2 (depending on the type of data). At higher load factors
(implying longer linked lists in the buckets) the caching began
to pay off. At lower load factors (short linked lists) it was
actually slower. The hashes adapt while growing; they double in
size when the load factor is exceeded. As an aside, I use an
allocator tied to the hash +to allocate the nodes in the linked
list.

I don't make any claim at all about how far this observation
generalizes, but the connection with the load factor seems logical.

I never saw any other part of this thread, but your comments only
make sense for a highly restricted set of hash table
implementations, i.e. those that handle overflow by linked lists
(or such) and where a (possibly long) string comparison can be
replaced by an integer comparison. Even then, string comparisons
normally find a difference in the first few characters, even if the
potential strings are long.

On the other hand re-organizing tables, that control the load
factor and thus the length of any chains involved, do not have this
problem. My hashlib keeps the load under about 88%, and the actual
load is more likely to be about 66%, resulting in very short chains
and expected O(1) performance. You can find it (GPLed) at:

<http://cbfalconer.home.att.net/download/>

micans · Aug 19, 2006

CBFalconer said:
I never saw any other part of this thread

I saw your comment after returning from a 2-week absence ..

but your comments only
make sense for a highly restricted set of hash table
implementations, i.e. those that handle overflow by linked lists
(or such) and where a (possibly long) string comparison can be
replaced by an integer comparison.

How exactly 'highly restricted' ? It is a particular choice of hash
implementation. There are others and I don't know of any commonly
accepted analysis that this particular implementation is worse than
others.

It is IIRC the implementation used by Perl and hosts of other
applications/languages.

Even then, string comparisons

Of course it might be desirable to have a user supplied comparison
routine - such is my prefered setup.

normally find a difference in the first few characters, even if the
potential strings are long.

On the other hand re-organizing tables, that control the load
factor and thus the length of any chains involved, do not have this
problem.

What problem? Chain length? A bucket-based hashing algorithm that
rehashes occassionally has very short chains and expected O(1)
performance.

Stijn

hash table - simple implementation	16	Oct 17, 2012
I made a blockchain and want to make a cryptocurrency, but my code doesn't verify hash of each block	2	Jun 2, 2024
Hash table implementation.	38	Aug 11, 2011
Lexical Analysis on C++	1	Oct 31, 2023
Adding adressing of IPv6 to program	1	Feb 16, 2023
K&R hash table question	16	Jun 25, 2007
Hash table Implementation	3	Mar 29, 2011
Qsort() messing with my entire Code	0	Apr 25, 2022

My hash table is in need of peer review

Johan Tibell

Richard Tobin

Hallvard B Furuseth

Richard Tobin

Eric Sosman

websnarf

Eric Sosman

websnarf

websnarf

websnarf

websnarf

websnarf

Eric Sosman

Johan Tibell

Hallvard B Furuseth

Eric Sosman

Barry Schwarz

micans

CBFalconer

micans

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads