B
bivity
My work requires a lot of index lookups for large amounts of files
daily. Recently, we have been receiving files with one document per
line along with all its attributes. This file will have around 400,000
entries. I then receive another file, with just the file name, and I
am told, look for each one of these files in this 400,000 entry list.
There are about 5000 in the file.
I just wrote a quick script to meet my needs, where I read in both
files, and grep (search_item, content_file). It works pretty well.
Except, it takes about 25 minutes for 5000 entries. I can't use a hash
implementation here, is there a way I can make this search faster?
Below is a sample search term and what the index line looks like,
along with the entire script. I know its rough, but i wrote it in a
hurry and I would like to refine it now, make it faster.
Search File Term:
885_Addm Un Lse 0867.pdf
Large File to be searched, its matching index:
"885_Addm Un Lse 0867.pdf","885","ELM 111 N BOBBY AVE","Addm Un Lse
0867","Addm Un Lse 0867.pdf","Elmhurst","651","885","885_Addm Un Lse
0867"
script:
#!/usr/local/bin/perl
open(FILE, $ARGV[0]);
my @text_rep=<FILE>; #file to search
close FILE;
open(FILE, $ARGV[1]);
my @text_search=<FILE>; #file with entries to use in search
close FILE;
print "Size of content:". @text_rep ."\n";
print "Searching for".@text_search. "instances\n";
open (OUT, "+>did_not_find.txt"); # in case grep can't find the value,
log it
foreach my $query (@text_search)
{
chomp($query);
$query =~s/\(/\./;
$query =~s/\)/\./;
$query =~s/\$/\./;
my @qu = grep(/$query/,@text_rep);
if($qu[0] eq '') {
print OUT $query."\n";}#Error Logging
else{
push(@final,$qu[0]);}#Found, so place in array
}
close OUT;
open (OUT, "+>lines_that_pulled.txt");
print OUT @final;
close OUT;
daily. Recently, we have been receiving files with one document per
line along with all its attributes. This file will have around 400,000
entries. I then receive another file, with just the file name, and I
am told, look for each one of these files in this 400,000 entry list.
There are about 5000 in the file.
I just wrote a quick script to meet my needs, where I read in both
files, and grep (search_item, content_file). It works pretty well.
Except, it takes about 25 minutes for 5000 entries. I can't use a hash
implementation here, is there a way I can make this search faster?
Below is a sample search term and what the index line looks like,
along with the entire script. I know its rough, but i wrote it in a
hurry and I would like to refine it now, make it faster.
Search File Term:
885_Addm Un Lse 0867.pdf
Large File to be searched, its matching index:
"885_Addm Un Lse 0867.pdf","885","ELM 111 N BOBBY AVE","Addm Un Lse
0867","Addm Un Lse 0867.pdf","Elmhurst","651","885","885_Addm Un Lse
0867"
script:
#!/usr/local/bin/perl
open(FILE, $ARGV[0]);
my @text_rep=<FILE>; #file to search
close FILE;
open(FILE, $ARGV[1]);
my @text_search=<FILE>; #file with entries to use in search
close FILE;
print "Size of content:". @text_rep ."\n";
print "Searching for".@text_search. "instances\n";
open (OUT, "+>did_not_find.txt"); # in case grep can't find the value,
log it
foreach my $query (@text_search)
{
chomp($query);
$query =~s/\(/\./;
$query =~s/\)/\./;
$query =~s/\$/\./;
my @qu = grep(/$query/,@text_rep);
if($qu[0] eq '') {
print OUT $query."\n";}#Error Logging
else{
push(@final,$qu[0]);}#Found, so place in array
}
close OUT;
open (OUT, "+>lines_that_pulled.txt");
print OUT @final;
close OUT;