D
Dan Jones
I'm working with some large (several hundred megs) flat database files. I
need to examine the records for duplicates. Obviously, I don't want to
store several hundred megs of data in a hash. What I'd like to do is to
read each record, generate a hash value for the record, store that hash
value and an index key rather than storing the entire record, and look for
collisions in the hash value.
Perl obviously uses an internal hashing function to create it's hash
variables. Is it possible to access this function or to get the actual
hash value it produces? If not, any pointers to a module or information on
writing a hashing function in Perl would be appreciated. Hashing functions
usually involve low level bit twiddling. While it's probably possible to
do this directly in Perl (what isn't?), I don't know enough Perl to do it.
Right now, I'm looking at using a C function, then having to integrate that
with Perl. I'd really prefer to keep this a pure Perl script if I can.
I've been through the Camel, the Panther, and the Ram without finding
anything relevant. The Cookbook does mention that using hashes to search
for dupes is memory intensive if you have large records but doesn't provide
any alternatives that I could find. Searching for information on hashing
functions and Perl on the web has proven to be an exercise in futility due
to the naming collision with the variable type.
Thanks in advance for any assistance.
need to examine the records for duplicates. Obviously, I don't want to
store several hundred megs of data in a hash. What I'd like to do is to
read each record, generate a hash value for the record, store that hash
value and an index key rather than storing the entire record, and look for
collisions in the hash value.
Perl obviously uses an internal hashing function to create it's hash
variables. Is it possible to access this function or to get the actual
hash value it produces? If not, any pointers to a module or information on
writing a hashing function in Perl would be appreciated. Hashing functions
usually involve low level bit twiddling. While it's probably possible to
do this directly in Perl (what isn't?), I don't know enough Perl to do it.
Right now, I'm looking at using a C function, then having to integrate that
with Perl. I'd really prefer to keep this a pure Perl script if I can.
I've been through the Camel, the Panther, and the Ram without finding
anything relevant. The Cookbook does mention that using hashes to search
for dupes is memory intensive if you have large records but doesn't provide
any alternatives that I could find. Searching for information on hashing
functions and Perl on the web has proven to be an exercise in futility due
to the naming collision with the variable type.
Thanks in advance for any assistance.