hashing function

Dan Jones · Nov 28, 2004

I'm working with some large (several hundred megs) flat database files. I
need to examine the records for duplicates. Obviously, I don't want to
store several hundred megs of data in a hash. What I'd like to do is to
read each record, generate a hash value for the record, store that hash
value and an index key rather than storing the entire record, and look for
collisions in the hash value.

Perl obviously uses an internal hashing function to create it's hash
variables. Is it possible to access this function or to get the actual
hash value it produces? If not, any pointers to a module or information on
writing a hashing function in Perl would be appreciated. Hashing functions
usually involve low level bit twiddling. While it's probably possible to
do this directly in Perl (what isn't?), I don't know enough Perl to do it.
Right now, I'm looking at using a C function, then having to integrate that
with Perl. I'd really prefer to keep this a pure Perl script if I can.

I've been through the Camel, the Panther, and the Ram without finding
anything relevant. The Cookbook does mention that using hashes to search
for dupes is memory intensive if you have large records but doesn't provide
any alternatives that I could find. Searching for information on hashing
functions and Perl on the web has proven to be an exercise in futility due
to the naming collision with the variable type.

Thanks in advance for any assistance.

Juha Laiho · Nov 28, 2004

Dan Jones said:
I'm working with some large (several hundred megs) flat database files. I
need to examine the records for duplicates. Obviously, I don't want to
store several hundred megs of data in a hash. ....
If not, any pointers to a module or information on writing a hashing
function in Perl would be appreciated.

MD5 ans SHA1 are two commonly used hashing functions, and implementations
for both are available as perl modules.

Xavier Noria · Nov 28, 2004

Storing the hash code and an index is what the $hash{$record}++ idiom
does, the post seem to assume that the record itself is being stored in
the hash. Just to avoid a misconception, that's not the case.

Did you already try that simple solution and it wasn't suitable?

-- fxn

Xavier Noria · Nov 28, 2004

I don't know what was I thinking about, that's completely wrong. I am
very sorry.

-- fxn

xhoster · Nov 28, 2004

Dan Jones said:
I'm working with some large (several hundred megs) flat database files.
I need to examine the records for duplicates. Obviously, I don't want to
store several hundred megs of data in a hash.

I don't consider that to be at all obvious. I often want to and often do
store several hundred meg of data in a hash.

Is there no subset of the record which would be suitable for identifying
duplicates?

What I'd like to do is to
read each record, generate a hash value for the record, store that hash
value and an index key

You would potentially need a list of index keys, not just one index key.
Otherwise what would you do when different records collide on the
same hash values?

rather than storing the entire record, and look
for collisions in the hash value.

And then use the index values to go back and look up the whole records
and compare them properly? Wouldn't it be easier to use system sort
routines or a proper database server?

Perl obviously uses an internal hashing function to create it's hash
variables. Is it possible to access this function or to get the actual
hash value it produces?

This seems to work. No gaurantees:

#!/usr/bin/perl -wl

use strict;
use Inline 'C' ;

foreach (1..1_000_000) {
print calc($_);
};

__DATA__
__C__

size_t calc(SV* sv) {
char * c;
size_t size;
size_t hash;

c=SvPV(sv,size);
PERL_HASH(hash,c,size);
return hash;
};

If not, any pointers to a module or information
on writing a hashing function in Perl would be appreciated. Hashing
functions usually involve low level bit twiddling. While it's probably
possible to do this directly in Perl (what isn't?), I don't know enough
Perl to do it. Right now, I'm looking at using a C function, then having
to integrate that with Perl. I'd really prefer to keep this a pure Perl
script if I can.

The C code behind PERL_HASH is given in PERLDOC PERLGUTS. You could easily
translate that into Perl.

Others have mentioned modules for hashing functions which are intended
more for security than efficiency.

Xho

Mike Deskevich · Nov 28, 2004

rather than trying to store the data in the hash (which you said you
didn't want to do), why not just use the hash to keep track of how many
records there are of each key. something like this

%hash={} #or however you make an empty hash, i don't use them often so
i don't know off the top of my head how to do it.
while (<DATAFILE>)
{
$key=get_key_from_record ($_); #or however you get the key from the
record just read in
$num=(%hash{$key}++);
if ($num>1)
{
#record is a dupe, do what you want with it
}
}

John W. Krahn · Nov 29, 2004

Dan said:
I'm working with some large (several hundred megs) flat database files. I
need to examine the records for duplicates. Obviously, I don't want to
store several hundred megs of data in a hash. What I'd like to do is to
read each record, generate a hash value for the record, store that hash
value and an index key rather than storing the entire record, and look for
collisions in the hash value.

Perl obviously uses an internal hashing function to create it's hash
variables. Is it possible to access this function or to get the actual
hash value it produces? If not, any pointers to a module or information on
writing a hashing function in Perl would be appreciated. Hashing functions
usually involve low level bit twiddling. While it's probably possible to
do this directly in Perl (what isn't?), I don't know enough Perl to do it.
Right now, I'm looking at using a C function, then having to integrate that
with Perl. I'd really prefer to keep this a pure Perl script if I can.

I've been through the Camel, the Panther, and the Ram without finding
anything relevant. The Cookbook does mention that using hashes to search
for dupes is memory intensive if you have large records but doesn't provide
any alternatives that I could find. Searching for information on hashing
functions and Perl on the web has proven to be an exercise in futility due
to the naming collision with the variable type.

Here is an example using Digest::MD5 that may suit your needs:

#!/usr/bin/perl
use warnings;
use strict;

use Digest::MD5 'md5';

my $file = 'database.file';

my ( $loc, %data ) = 0;

@ARGV = $file;

while ( <> ) {
push @{ $data{ md5 $_ } }, $loc;
$loc = tell ARGV;
}

close ARGV;

# remove all duplicates except the first one
my @dups = sort { $a <=> $b } map splice( @$_, 1 ), values %data;

# or alternately remove all except the last one
# my @dups = sort { $a <=> $b } map splice( @$_, 0, $#$_ ), values %data;

( $loc, $^I, @ARGV ) = ( 0, '.bck', $file );

while ( <> ) {
if ( @dups and $loc == $dups[ 0 ] ) {
shift @dups;
next;
}
print;
$loc = tell ARGV;
}

__END__

John

hashing strings to integers for sqlite3 keys	22	May 22, 2014
Registration Form	7	Aug 30, 2023
Hashing long strings...	8	Aug 18, 2010
Mailbox-style directory hashing	7	Oct 31, 2006
Universal hashing	5	Apr 7, 2008
hashing and collision detection	0	Apr 7, 2009
Which hashing function to use?	11	Nov 16, 2007
hashing an array - howto	10	Sep 5, 2008

hashing function

Dan Jones

Juha Laiho

Xavier Noria

Xavier Noria

xhoster

Mike Deskevich

John W. Krahn

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads