Hashing to Disk

F

Foodbank

Hi,

Has anyone ever hashed a text file to disk before? I'm trying to
convert an in-memory hash to disk hash and I can't find any relevant
information to help me. Also, I'd like to use lseek, read, write, and
fd if anyone is familiar with them. Any info would be greatly helpful.

Thanks,
James
 
K

Keith Thompson

Foodbank said:
Has anyone ever hashed a text file to disk before? I'm trying to
convert an in-memory hash to disk hash and I can't find any relevant
information to help me. Also, I'd like to use lseek, read, write, and
fd if anyone is familiar with them. Any info would be greatly helpful.

It's not at all clear what you mean by this. Are you talking about
some kind of hash table? If so, please be more specific.

Note that C has no built-in support for hash tables. I suspect you
have more of an algorithm question than a C question. If you can show
us some code, we might be able to help you.

Note also that none of lseek, read, and write exist in standard C
(though they're similar to fseek, fread, and fwrite, which do). "fd"
is probably an abbreviation of "file descriptor", which also does not
exist in standard C (the closest equivalent is the type FILE*). You
might be looking for a Unix-specific newsgroup like
comp.unix.programmer.
 
R

Randy Howard

Keith Thompson wrote
(in article said:
It's not at all clear what you mean by this. Are you talking about
some kind of hash table? If so, please be more specific.

It sounds like he might be trying to save a hash table in memory
out to disk for a subsequent execution, but just guessing here.
If so, insert standard warnings about byte order and portability
for objects larger than bytes.
 
F

Foodbank

Yes, this is an algorithm based question. Basically, I'm trying to
save the hash table to file instead of dynamically in memory. So, for
example, I could access the hash table residing in the file at a later
date. I hope that helps.

Thanks,
James
 
K

Keith Thompson

Foodbank said:
Yes, this is an algorithm based question. Basically, I'm trying to
save the hash table to file instead of dynamically in memory. So, for
example, I could access the hash table residing in the file at a later
date. I hope that helps.

It doesn't help much.

First, read this: <http://cfaj.freeshell.org/google/>.

There are a number of ways to represent a hash table. If the
in-memory representation uses pointers, it's going to be difficult to
represent them meaningfully in an external file; probably you can
translate them to some sort of numeric index. If you use binary
files, they won't be readable on different platforms (varying byte
order, varying sizes of fundamental data types, etc.) unless you go to
some extra effort to canonicalize everything. You might consider a
text-based representation.

If you can show us the C code that implements the hash table,
particularly any type declarations, we can probably help you come up
with a disk-based representation.

If your question isn't specific to C, you might try comp.programming.
 
F

Foodbank

Hi, here is my in-memory hash. Also, I've added the sections to create
and close the file.
As a note, I've removed some of the sections that are irrelevent to the
sections I need help with to save space and for ease of reading.

Thanks,
James

Code:
#include <stdio.h>
#include <string.h>
#include <sys/fcntl.h>
#define HASHTSIZE  20011    // A prime number
#define MAXWORDLEN 30     //maximum word length from text file

struct hash {
     int     count;
     char    word[MAXWORDLEN];
} hash;              // the single reusable hash struct

int NUMCOLLISIONS;        //# of collisions
int first_hash(char *);          //first hash function
int second_hash(char *);     //second hash function
int main(int, char **);
int get_word(char *);

char word_buff[100];

int main(int argc, char *argv[]) {
     int h, inc, fd, n;
     unlink("hash_file");          // remove the old file if present
     fd = open("hash_file", O_CREAT|O_RDWR|O_EXCL, 0666);
     while(get_word(word_buff) > 0) {
          inc = second_hash(word_buff);    // second hash
          h = first_hash(word_buff);     // first hash

#ifdef     PREVIOUS CODE FROM IN-MEMORY HASH / COLLISION RESOLUTION
STRATEGY
          while(strcmp(word_buff, hashtable[h].word)) {
               if(hashtable[h].count == 0) {
                    strcpy(hashtable[h].word, word_buff);
                    break;
               }
               NUMCOLLISIONS++; // wrong word is a collision
                   }
          hashtable[h].count++;      // counts both old & new words
#endif
     
               //CODE HELP HERE FOR DISK 

}
 
W

Walter Roberson

Hi, here is my in-memory hash.
#include <stdio.h>
#include <string.h>
#include <sys/fcntl.h>
#define HASHTSIZE 20011 // A prime number
#define MAXWORDLEN 30 //maximum word length from text file
struct hash {
int count;
char word[MAXWORDLEN];
} hash; // the single reusable hash struct

That's no so bad, you aren't storing pointers.
int NUMCOLLISIONS; //# of collisions
int first_hash(char *); //first hash function
int second_hash(char *); //second hash function
int main(int, char **);
int get_word(char *);
char word_buff[100];
int main(int argc, char *argv[]) {
int h, inc, fd, n;
unlink("hash_file"); // remove the old file if present
fd = open("hash_file", O_CREAT|O_RDWR|O_EXCL, 0666);

For later user, you will need to add in a call such as:

init_hashtable(fd);
while(get_word(word_buff) > 0) {
inc = second_hash(word_buff); // second hash
h = first_hash(word_buff); // first hash

#ifdef PREVIOUS CODE FROM IN-MEMORY HASH / COLLISION RESOLUTION
STRATEGY
while(strcmp(word_buff, hashtable[h].word)) {

I am going to postulate here that the difficulty is that
hashtable[] might not fit into memory, and you want to pull the
appropriate entry from disk.
if(hashtable[h].count == 0) {
strcpy(hashtable[h].word, word_buff);
break;
}
NUMCOLLISIONS++; // wrong word is a collision
}
hashtable[h].count++; // counts both old & new words
#endif

Okay, what you can do is something like this:

..... h = first_hash(word_buff); // no change, just establishing context

while ( strcmp(word_buff, (hashentry = get_hashtable_entry(h))->word ) ) {
if (hashentry->count == 0) {
strncpy( hashentry->word, word_buff, MAXWORDLEN );
mark_hashtable_entry_dirty(h);
break;
}
NUMMCOLLISSIONS++;

release_hashtable_entry(h);

/* ... something here that updates h, because the code you show
infinite loops if the count it finds is non-zero */
}

/* your previous comment was that the increment "counts both
old & new words", but as you do not have multiple entries
for the same hash bucket, your hashing loop is going to have
to keep running until you find an entry with count of 0;
you cannot drop out of the while loop as long as the count is
non-zero. That being the case, it would make more sense to
increment the count where you do the strcpy... except increment
is the wrong term, as it is always (by induction) only going
to be changing a count of 0 to a count of 1, never (say)
a count of 19 to a count of 20. And if that's the case, that
it isn't really a count but just an in-use flag, then you could
instead just test the first character of the stored word: if
it is '\0' then the entry is unused.
*/

hashentry->count++;
mark_hashtable_entry_dirty(h);
release_hashtable_entry(h);

/* ... any other trailing stuff for the get_word() while loop goes here */
}

flush_hashtable();

close(fd);



This introduces the functions init_hashtable(), flush_hashtable(),
get_hashtable_entry(), mark_hashtable_entry_dirty(), and
release_hashtable_entry()

init_hashtable() would record the fd (or FILE*) for future use,
and would initialize an array of entries, each large enough to
hold a disk block, together with an array of disk block numbers and
another of markers as to whether the blocks are "clean" or "dirty".

When get_hashtable_entry() is called, the passed entry number is
converted (by knowing the structure size) into a disk block number,
with an integral number of entries packed per block. The cache
table is examined, and if that block is already present, then
mark the block as being "in use" and return a pointer to that particular
entry within the hash block.

If the block is not already present, then see if there is an
available slot in the cache array. If so, then read that disk block
into the slot, perform the appropriate overhead management, mark the
slot in use, return the pointer to the entry.

If no slots are available, find a slot that is not marked as being
in-use and which is not marked as dirty; clear the slot (without
any disk I/O), and do the read-the-block thingy. If there were
no slots which were not marked as dirty, pick one of the slots
that is not in use but is dirty, and write the entry block to disk,
then clear the slot and do the read-the-block thingy.

When release_hashtable_entry() is called, it needs to undo the
"in-use" marker. In the simple loop logic above, there can only be
a single in-use entry, so only one in-use marker per cache slot would
be needed, but in a slightly more complex case, you would want something
like a bit vector of entry numbers within the disk block, in case
the same disk block was being used by two different active entries.

mark_hashtable_entry_dirty() is a signal that you have made a
change to the contents of an entry and that the new version will
need to be written to disk eventually. It turns out that you only
need one "dirty" flag per disk block, not one per entry within the
block like you do for "in-use". (Actually there are some more
sophisticated disk storage structures in which it would make sense
to use one dirty flag per table entry, but those have tradeoffs
and make the reading-in of entries more complex. But if you have
a "transaction processing" setup...)

release_hashtable_entry() -could- trigger a write to disk if it was the
last active user of that cache slot and the disk slot is marked dirty,
but -usually- you want to postpone disk writes as long as practical,
hoping that other entries in the same disk block will get written to
later, so that when you finally need to release the cache slot for
use with an incoming block, that you might have effectively written
several changes with a single I/O. The immediate-write strategy will
often end up with one I/O per changed entry.

flush_hashtable() runs through and writes out all the dirty disk blocks,
and clears the dirty flags. It would not, though, clear the in-use
flags: that way it is safe to flush anytime (e.g., at periodic
intervals.)
 
F

Foodbank

Hi and thank you for your help. However, I was wondering if there was
a way to do this using lseek, read, and write? I'm trying to get a
grasp on these commands if anyone knows anything about them.

Thanks,
James
 
W

Walter Roberson

Hi and thank you for your help. However, I was wondering if there was
a way to do this using lseek, read, and write? I'm trying to get a
grasp on these commands if anyone knows anything about them.

"lseek", "read" and "write" cannot be used to "do" an unspecified
person's help for an unspecified task.

Please retry your question, but this time with appropriate quoting
and citations, so that we know what you are talking about. Roughly
every fourth posting to this newsgroup contains instructions on
how to use the googlegroups interface to do quoting.

I can predict part of the answer already, though:

lseek(), read() and write() are not part of standard C; if you
want to use them, you are best off heading over to comp.unix.programming .
The standard C equivilents are fseek(), fread() and fwrite(), and
we can help you with those here.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,175
Messages
2,570,942
Members
47,476
Latest member
blackwatermelon

Latest Threads

Top