M
Michael Wojcik
My apologies to Eric; I have snipped completely everything.
Now Rainmaker. Let's see.. 2 terabytes is 2.0e12 bytes, right? 100 bytes
is 1.0e2 I think. Dividing file size by line length give 2.0e10 I think.
That's 20 giga-lines, right?
We're trying to get a grip on what you have and what you are trying to
achieve. Your data set seems over large. There aren't 20 giga-lines in
all the books in the Library of Congress.
This problem description looks a great deal like a "rainbow table" -
an offline dictionary of the hash values for various strings, used
for cracking passwords. The OP's use of "Rainmaker" as a nickname
also suggests that.
There's a bunch of literature on constructing and using rainbow
tables, and I'd suggest that someone at the "what sort should I use?"
stage is not going to beat the published approaches. In other words,
some research seems to be the appropriate next step, and I don't mean
asking OT questions on comp.lang.c.