Hash Map API Preference

M

Michael B Allen

I would like to know your preferences regarding a C API for a hash
map. Specifically, when inserting a data item into a hash map there are
two special cases that require consideration.

1) If two keys generate the same hash value should a comparison function
be used to compare keys or should data items be compared? If data items
are compared it is not necessary to store the keys with each entry
which has an impact on question 2.

2) If two keys generate the same hash value and the comparison function
finds the entries to be equal, which of the following resolutions is
preferred?

A) Return an error and otherwise do nothing. The user will be required
to first remove the old item.

B) Provide void **replaced parameters to accept the displaced data item
(and key if they are being stored) and then insert the new data item
in place of the displaced data item.

int hashmap_put(struct hashmap *h, void *key, void *data, void **replaced);

OR

int hashmap_put(struct hashmap *h,
void *key, void *data,
void **replaced_key, void **replaced_data);

C) Delete the data item (and key if they are being stored) using
function pointers provided when the hash map was created and then
insert the new data item in place of the deleted data item.

typedef int (*del_fn)(void *context, void *object);

int hashmap_create(struct hashmap *h,
unsigned int size,
hash_fn hash,
compare_fn cmp,
del_fn key_del,
del_fn data_del,
void *context);

This code will be part of libmba.

Mike
 
E

Eric Sosman

Michael said:
I would like to know your preferences regarding a C API for a hash
map. Specifically, when inserting a data item into a hash map there are
two special cases that require consideration.

1) If two keys generate the same hash value should a comparison function
be used to compare keys or should data items be compared? If data items
are compared it is not necessary to store the keys with each entry
which has an impact on question 2.

I'm not entirely sure I understand what you're asking -- the
distinction between comparing keys and comparing "data items" is
not clear to me. If "key" is understood in the usual sense of "a
unique identifier," then the keys and only the keys should be
compared -- furthermore, the hash value should be computed from
the key alone, without reference to the associated data. If "data
item" means "the key and all its associated data," comparison seems
either pointless or absurd depending on the viewpoint.
2) If two keys generate the same hash value and the comparison function
finds the entries to be equal, which of the following resolutions is
preferred?

A) Return an error and otherwise do nothing. The user will be required
to first remove the old item.

B) Provide void **replaced parameters to accept the displaced data item
(and key if they are being stored) and then insert the new data item
in place of the displaced data item.
[...]
C) Delete the data item (and key if they are being stored) using
function pointers provided when the hash map was created and then
insert the new data item in place of the deleted data item.

All these strategies (and others, too) are reasonable, and
one could dream up applications in which each would be the
"obvious" preferred behavior. For a general-purpose package
I'd choose (A): the interface is "small," and the user can
build the other behaviors atop it (and other operations like
deletion) if desired.
This code will be part of libmba.

Business schools teach programming?
 
C

CBFalconer

Michael said:
I would like to know your preferences regarding a C API for a
hash map. Specifically, when inserting a data item into a hash
map there are two special cases that require consideration.
.... snip ...

int hashmap_create(struct hashmap *h,
unsigned int size,
hash_fn hash,
compare_fn cmp,
del_fn key_del,
del_fn data_del,
void *context);

You might take a look at the interface for hashlib, which was
partially the result of a similar discussion here some time ago.
hashlib is available at:

<http://cbfalconer.home.att.net/download/>
 
M

Michael B Allen

You might take a look at the interface for hashlib, which was partially
the result of a similar discussion here some time ago. hashlib is
available at:

<http://cbfalconer.home.att.net/download/>

What's interesting about this is it uses an entirely different method
from the one I was thinking of. Conceptually a map maps a key to a value
so I was thinking a separate key should be used with get and put like:

hashmap_insert(h, key, data)
hashmap_find(h, key)

But you don't use a key at all. Instead it looks like you just use an
opaque pointer and extract all information from it using a variety of
functions provided by the user. Presumably to lookup a data item it
would be necessary to use an additional structure to (e.g. if you were
just mapping names to objects you could not use the character string
directly with find).

And if two objects are equal it looks like you return a pointer to the
offending element without removing it? I suppose that is two votes for
case A.

Mike
 
M

Michael B Allen

I'm not entirely sure I understand what you're asking -- the
distinction between comparing keys and comparing "data items" is not
clear to me. If "key" is understood in the usual sense of "a unique
identifier," then the keys and only the keys should be compared --
furthermore, the hash value should be computed from the key alone,
without reference to the associated data. If "data item" means "the key
and all its associated data," comparison seems either pointless or
absurd depending on the viewpoint.

Consider the following where "foo" is the key and obj is the data item:

hashmap_put("foo", obj)
...
obj = hashmap_get("foo")

If the same hash value is generated for the key "foo" and "bar" (unlikely
but possible) then it is necessary to use a comparison function to
definitively determine wheather or not the keys are equal. My original
question considered comparing the data object instead. That was a foolish
mistake as it does not make any sense. Please disregard the question.
All these strategies (and others, too) are reasonable, and
one could dream up applications in which each would be the "obvious"
preferred behavior. For a general-purpose package I'd choose (A): the
interface is "small," and the user can build the other behaviors atop it
(and other operations like deletion) if desired.

Mmm, how does one implement deletion atop of the case A behavior?
Business schools teach programming?

Right. It's for creating useless spreadsheets and those stupid questions
asked in "business requirements" meetings.

Mike
 
E

Eric Sosman

Michael said:
Michael said:
[...]
2) If two keys generate the same hash value and the comparison function
finds the entries to be equal, which of the following resolutions is
preferred?

A) Return an error and otherwise do nothing. The user will be
required to first remove the old item.
All these strategies (and others, too) are reasonable, and
one could dream up applications in which each would be the "obvious"
preferred behavior. For a general-purpose package I'd choose (A): the
interface is "small," and the user can build the other behaviors atop it
(and other operations like deletion) if desired.

Mmm, how does one implement deletion atop of the case A behavior?

Now it's my turn to apologize for unclear wording.
I meant that behaviors B and C (snipped; they are variations
on "replace the existing stuff with the new stuff") can be
implemented atop A plus a deletion operation.

In my own hash package (I guess everybody writes one
eventually), the insertion function returns a pointer to
the item inserted or a pointer to the pre-existing item
with the same key (or NULL for other kinds of failure):

ptr = HTInsert(table_handle, &key_and_data);
if (ptr == &key_and_data)
// insertion succeeded
else if (ptr != NULL)
// collision; ptr points to existing item
else
// other failure, probably realloc()

This is essentially your case A.
 
C

CBFalconer

Michael said:
What's interesting about this is it uses an entirely different
method from the one I was thinking of. Conceptually a map maps
a key to a value so I was thinking a separate key should be used
with get and put like:

hashmap_insert(h, key, data)
hashmap_find(h, key)

But you don't use a key at all. Instead it looks like you just
use an opaque pointer and extract all information from it using
a variety of functions provided by the user. Presumably to
lookup a data item it would be necessary to use an additional
structure to (e.g. if you were just mapping names to objects
you could not use the character string directly with find).

And if two objects are equal it looks like you return a pointer
to the offending element without removing it? I suppose that is
two votes for case A.

I don't know what your case A etc. is any more, but the
fundamental thing the system does is answer the question "is this
item in the data base". If so, it returns a pointer to the stored
data (which is not the same as the queried data). The user may
now do things with an auxiliary field, such as a count. However
the user does NOT own the data pointed at, that is under the
control of the hashlib data base. He MUST NOT alter the key
value.

Another basic operation is "install this item". The installation
process makes a copy, via provided routines, and this is the point
at which a count field would be initialized.

The only way to remove the storage from the control of the data
base (prior to closing it) is by the delete operation. This again
returns a pointer to data, but that data is no longer under the
control of the data base, and storage disposition etc. is the
users responsibility. It cannot be re-stored, since install makes
a copy.

The markov and wdfreq usage examples show some fairly complex
manipulations.

At any rate, the complete usage is intended to be understandable
via the hashlib.h header file alone.
 
E

Eric Sosman

CBFalconer said:
I don't know what your case A etc. is any more, but the
fundamental thing the system does is answer the question "is this
item in the data base". If so, it returns a pointer to the stored
data (which is not the same as the queried data). The user may
now do things with an auxiliary field, such as a count. However
the user does NOT own the data pointed at, that is under the
control of the hashlib data base. He MUST NOT alter the key
value.

It's worth pointing out that the question of who owns
the stored data is a design choice, not an immutable tenet
of hash tables. ("Don't change the key" *is* such a tenet,
though.) CBF chose to have the table own the stored data,
cloning the user-supplied datum (via user-supplied functions)
at need. In my own package I chose oppositely: the caller
always owns the data, and the table merely keeps track of
which items are currently "in."

CBF's design is perhaps more reliable than mine, since
the user is less likely to commit the accident of deallocating
an object while it's still in the table. On the other hand,
my design allows a single object to exist in an arbitrary
number of hash tables simultaneously: I can search the same
set of data on both telephone number and postal code. (CBF
can do this, too, but only by using an intermediate "reference"
object to point to the real data, and allowing the references
to be copied into multiple tables.)

There is no One Single Best Way to design an interface.
CBF's design beats mine in some respects; I feel mine beats
his in others; trade-offs are always present. In the end,
an interface design says as much about the designer's personal
preferences as it does about the problem to be solved.
 
C

CBFalconer

Eric said:
It's worth pointing out that the question of who owns
the stored data is a design choice, not an immutable tenet
of hash tables. ("Don't change the key" *is* such a tenet,
though.) CBF chose to have the table own the stored data,
cloning the user-supplied datum (via user-supplied functions)
at need. In my own package I chose oppositely: the caller
always owns the data, and the table merely keeps track of
which items are currently "in."

CBF's design is perhaps more reliable than mine, since
the user is less likely to commit the accident of deallocating
an object while it's still in the table. On the other hand,
my design allows a single object to exist in an arbitrary
number of hash tables simultaneously: I can search the same
set of data on both telephone number and postal code. (CBF
can do this, too, but only by using an intermediate "reference"
object to point to the real data, and allowing the references
to be copied into multiple tables.)

There is no One Single Best Way to design an interface.
CBF's design beats mine in some respects; I feel mine beats
his in others; trade-offs are always present. In the end,
an interface design says as much about the designer's personal
preferences as it does about the problem to be solved.

A nice exposition of why choices are made :) Mine were made
largely on the basis of 'this way will minimize my usage errors'.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,995
Messages
2,570,233
Members
46,820
Latest member
GilbertoA5

Latest Threads

Top