Hi, here is my in-memory hash.
#include <stdio.h>
#include <string.h>
#include <sys/fcntl.h>
#define HASHTSIZE 20011 // A prime number
#define MAXWORDLEN 30 //maximum word length from text file
struct hash {
int count;
char word[MAXWORDLEN];
} hash; // the single reusable hash struct
That's no so bad, you aren't storing pointers.
int NUMCOLLISIONS; //# of collisions
int first_hash(char *); //first hash function
int second_hash(char *); //second hash function
int main(int, char **);
int get_word(char *);
int main(int argc, char *argv[]) {
int h, inc, fd, n;
unlink("hash_file"); // remove the old file if present
fd = open("hash_file", O_CREAT|O_RDWR|O_EXCL, 0666);
For later user, you will need to add in a call such as:
init_hashtable(fd);
while(get_word(word_buff) > 0) {
inc = second_hash(word_buff); // second hash
h = first_hash(word_buff); // first hash
#ifdef PREVIOUS CODE FROM IN-MEMORY HASH / COLLISION RESOLUTION
STRATEGY
while(strcmp(word_buff, hashtable[h].word)) {
I am going to postulate here that the difficulty is that
hashtable[] might not fit into memory, and you want to pull the
appropriate entry from disk.
if(hashtable[h].count == 0) {
strcpy(hashtable[h].word, word_buff);
break;
}
NUMCOLLISIONS++; // wrong word is a collision
}
hashtable[h].count++; // counts both old & new words
#endif
Okay, what you can do is something like this:
..... h = first_hash(word_buff); // no change, just establishing context
while ( strcmp(word_buff, (hashentry = get_hashtable_entry(h))->word ) ) {
if (hashentry->count == 0) {
strncpy( hashentry->word, word_buff, MAXWORDLEN );
mark_hashtable_entry_dirty(h);
break;
}
NUMMCOLLISSIONS++;
release_hashtable_entry(h);
/* ... something here that updates h, because the code you show
infinite loops if the count it finds is non-zero */
}
/* your previous comment was that the increment "counts both
old & new words", but as you do not have multiple entries
for the same hash bucket, your hashing loop is going to have
to keep running until you find an entry with count of 0;
you cannot drop out of the while loop as long as the count is
non-zero. That being the case, it would make more sense to
increment the count where you do the strcpy... except increment
is the wrong term, as it is always (by induction) only going
to be changing a count of 0 to a count of 1, never (say)
a count of 19 to a count of 20. And if that's the case, that
it isn't really a count but just an in-use flag, then you could
instead just test the first character of the stored word: if
it is '\0' then the entry is unused.
*/
hashentry->count++;
mark_hashtable_entry_dirty(h);
release_hashtable_entry(h);
/* ... any other trailing stuff for the get_word() while loop goes here */
}
flush_hashtable();
close(fd);
This introduces the functions init_hashtable(), flush_hashtable(),
get_hashtable_entry(), mark_hashtable_entry_dirty(), and
release_hashtable_entry()
init_hashtable() would record the fd (or FILE*) for future use,
and would initialize an array of entries, each large enough to
hold a disk block, together with an array of disk block numbers and
another of markers as to whether the blocks are "clean" or "dirty".
When get_hashtable_entry() is called, the passed entry number is
converted (by knowing the structure size) into a disk block number,
with an integral number of entries packed per block. The cache
table is examined, and if that block is already present, then
mark the block as being "in use" and return a pointer to that particular
entry within the hash block.
If the block is not already present, then see if there is an
available slot in the cache array. If so, then read that disk block
into the slot, perform the appropriate overhead management, mark the
slot in use, return the pointer to the entry.
If no slots are available, find a slot that is not marked as being
in-use and which is not marked as dirty; clear the slot (without
any disk I/O), and do the read-the-block thingy. If there were
no slots which were not marked as dirty, pick one of the slots
that is not in use but is dirty, and write the entry block to disk,
then clear the slot and do the read-the-block thingy.
When release_hashtable_entry() is called, it needs to undo the
"in-use" marker. In the simple loop logic above, there can only be
a single in-use entry, so only one in-use marker per cache slot would
be needed, but in a slightly more complex case, you would want something
like a bit vector of entry numbers within the disk block, in case
the same disk block was being used by two different active entries.
mark_hashtable_entry_dirty() is a signal that you have made a
change to the contents of an entry and that the new version will
need to be written to disk eventually. It turns out that you only
need one "dirty" flag per disk block, not one per entry within the
block like you do for "in-use". (Actually there are some more
sophisticated disk storage structures in which it would make sense
to use one dirty flag per table entry, but those have tradeoffs
and make the reading-in of entries more complex. But if you have
a "transaction processing" setup...)
release_hashtable_entry() -could- trigger a write to disk if it was the
last active user of that cache slot and the disk slot is marked dirty,
but -usually- you want to postpone disk writes as long as practical,
hoping that other entries in the same disk block will get written to
later, so that when you finally need to release the cache slot for
use with an incoming block, that you might have effectively written
several changes with a single I/O. The immediate-write strategy will
often end up with one I/O per changed entry.
flush_hashtable() runs through and writes out all the dirty disk blocks,
and clears the dirty flags. It would not, though, clear the in-use
flags: that way it is safe to flush anytime (e.g., at periodic
intervals.)