optimizing text file searches

B

bivity

My work requires a lot of index lookups for large amounts of files
daily. Recently, we have been receiving files with one document per
line along with all its attributes. This file will have around 400,000
entries. I then receive another file, with just the file name, and I
am told, look for each one of these files in this 400,000 entry list.
There are about 5000 in the file.

I just wrote a quick script to meet my needs, where I read in both
files, and grep (search_item, content_file). It works pretty well.
Except, it takes about 25 minutes for 5000 entries. I can't use a hash
implementation here, is there a way I can make this search faster?

Below is a sample search term and what the index line looks like,
along with the entire script. I know its rough, but i wrote it in a
hurry and I would like to refine it now, make it faster.

Search File Term:
885_Addm Un Lse 0867.pdf

Large File to be searched, its matching index:
"885_Addm Un Lse 0867.pdf","885","ELM 111 N BOBBY AVE","Addm Un Lse
0867","Addm Un Lse 0867.pdf","Elmhurst","651","885","885_Addm Un Lse
0867"

script:

#!/usr/local/bin/perl
open(FILE, $ARGV[0]);
my @text_rep=<FILE>; #file to search
close FILE;
open(FILE, $ARGV[1]);
my @text_search=<FILE>; #file with entries to use in search
close FILE;
print "Size of content:". @text_rep ."\n";
print "Searching for".@text_search. "instances\n";
open (OUT, "+>did_not_find.txt"); # in case grep can't find the value,
log it
foreach my $query (@text_search)
{
chomp($query);
$query =~s/\(/\./;
$query =~s/\)/\./;
$query =~s/\$/\./;
my @qu = grep(/$query/,@text_rep);
if($qu[0] eq '') {
print OUT $query."\n";}#Error Logging
else{
push(@final,$qu[0]);}#Found, so place in array
}
close OUT;
open (OUT, "+>lines_that_pulled.txt");
print OUT @final;
close OUT;
 
M

Mirco Wahab

bivity said:
Search File Term:
885_Addm Un Lse 0867.pdf

That means, you do each search on
4-5000 times (exactly such lines)
against:
Large File to be searched, its matching index:
"885_Addm Un Lse 0867.pdf","885","ELM 111 N BOBBY AVE","Addm Un Lse
0867","Addm Un Lse 0867.pdf","Elmhurst","651","885","885_Addm Un Lse
0867"

Is this (above) *one line*, eg.:

"885_Addm Un Lse 0867.pdf","885","ELM 111 N BOBBY AVE","Addm Un Lse 0867","Addm Un Lse 0867.pdf","Elmhurst","651","885","885_Addm Un Lse 0867"

delimited by '\n' from the following line?

Regards

M.
 
B

bivity

That means, you do each search on
4-5000 times (exactly such lines)
against:


Is this (above) *one line*, eg.:

"885_Addm Un Lse 0867.pdf","885","ELM 111 N BOBBY AVE","Addm Un Lse 0867","Addm Un Lse 0867.pdf","Elmhurst","651","885","885_Addm Un Lse 0867"

delimited by '\n' from the following line?

Regards

M.

Yes,I know, I am linear searching, big O of (N*n)
N being the lines in the input file
n being lines in a file to search
Which isn't efficient when we are talking 400k and 5k items. It was
just the quick and dirty way I knew, and like to see if there is
better way to do this.

Each line in both files only contains one instance and yes it's
delimited by \n, Solaris - SunOs 5.8 environment, Perl version 5.8, I
can't use any extended libraries as well.

Thanks for the help in advance.
 
B

Brian Helterline

bivity said:
My work requires a lot of index lookups for large amounts of files
daily. Recently, we have been receiving files with one document per
line along with all its attributes. This file will have around 400,000
entries. I then receive another file, with just the file name, and I
am told, look for each one of these files in this 400,000 entry list.
There are about 5000 in the file.

I just wrote a quick script to meet my needs, where I read in both
files, and grep (search_item, content_file). It works pretty well.
Except, it takes about 25 minutes for 5000 entries. I can't use a hash
implementation here, is there a way I can make this search faster?

why not a hash? It is the right tool for the job.
Below is a sample search term and what the index line looks like,
along with the entire script. I know its rough, but i wrote it in a
hurry and I would like to refine it now, make it faster.

with grep, you search through the entire array of entries and
create an array of matches but only print the first entry found.
This seems very wasteful, especially if the first match is on line 1.

Search File Term:
885_Addm Un Lse 0867.pdf

Large File to be searched, its matching index:
"885_Addm Un Lse 0867.pdf","885","ELM 111 N BOBBY AVE","Addm Un Lse
0867","Addm Un Lse 0867.pdf","Elmhurst","651","885","885_Addm Un Lse
0867"

script:

#!/usr/local/bin/perl
open(FILE, $ARGV[0]);
my @text_rep=<FILE>; #file to search
close FILE;
open(FILE, $ARGV[1]);
my @text_search=<FILE>; #file with entries to use in search
close FILE;
print "Size of content:". @text_rep ."\n";
print "Searching for".@text_search. "instances\n";
open (OUT, "+>did_not_find.txt"); # in case grep can't find the value,
log it
foreach my $query (@text_search)
{
chomp($query);
$query =~s/\(/\./;
$query =~s/\)/\./;
$query =~s/\$/\./;

you are substituting one metachar for another in your regular
expression. Read up on the \Q & \E operators in perlre so
you won't have to use the above 3 lines. You can also improve the
search by anchoring it to the beginning of the line if you are just
looking for filenames preceeded by a quotation mark.
my @qu = grep(/$query/,@text_rep);

my @qu = grep( /^"\Q$query/, @text_rep );
if($qu[0] eq '') {
print OUT $query."\n";}#Error Logging
else{
push(@final,$qu[0]);}#Found, so place in array
}
close OUT;
open (OUT, "+>lines_that_pulled.txt");
print OUT @final;
close OUT;
 
M

Mirco Wahab

bivity said:
Each line in both files only contains one instance and yes it's
delimited by \n, Solaris - SunOs 5.8 environment, Perl version 5.8, I
can't use any extended libraries as well.

OK, I created a 500_000 line "full input file" according to your "long line"
template (with 2 random numbers [0-9999] in each record) and made a 5_000 line
search file according to the same rule. This run on my old Athlon/2500+ (non-X64)
in (user) 1min 52sec (and suprisingly lead to ~30 random hits).

I used the "good old long regex" from or-ed alternatives (the "search terms")
and did't consider printing non-matches:

---

use strict;
use warnings;

my $full_fn = shift;
my $srch_fn = shift;
my $pull_fn = 'lines_that_pulled.txt';
my $err_fn = 'did_not_find.txt';

open my $fh, '<', $srch_fn or die $!; # file with entries to use in search
my $ts ='^"('; # start big regex here
while( <$fh> ) {
if(length) {
tr/)($\n/..../;
$ts .= "(?:$_)|"
}
}
close $fh;
chop $ts;

my $text_q = qr/$ts)/; # build regex

print "Size of content: (unknown) \n";
print "Searching for: (unknown) instances\n";

my @final;
open $fh, '<', $full_fn or die $!; # the 400+K lines
while( <$fh> ) {
push @final, (split /,/,$_)[0]
if /$text_q/
}
close $fh;

open $fh, '>', $pull_fn or die $!;
print $fh join "\n", @final;
close $fh;

---


(FWIW)

I don't really have an idea to make this faster,
maybe someone has *the big flash* ;-)

Regards

M.
 
M

Mirco Wahab

Mirco said:
OK, I created a 500_000 line "full input file" according to your "long line"
template (with 2 random numbers [0-9999] in each record) and made a
5_000 line search file according to the same rule.

FWIW, this is the script used for these files:

use strict;
use warnings;

open my $fh, '>', 'bigfile.txt' or die $!;
for(1..500_000) {
my $num1 = sprintf "%04d", int(rand(10000));
my $num2 = sprintf "%d", int(rand(10000));
print $fh
qq{"${num2}_Addm Un Lse ${num1}.pdf","${num2}","ELM 111 N BOBBY AVE","Addm Un Lse ${num1}","Addm Un Lse ${num1}.pdf","Elmhurst","651","${num2}","${num2}_Addm Un Lse ${num1}"},"\n"
}
close $fh;

open $fh, '>', 'searchfile.txt' or die $!;
for(1..5_000) {
my $num1 = sprintf "%04d", int(rand(10000));
my $num2 = sprintf "%d", int(rand(10000));
print $fh
qq{${num2}_Addm Un Lse ${num1}.pdf}, "\n"
}
close $fh;


Regards

M.
 
B

bivity

why not a hash? It is the right tool for the job.

You're right, I took another quick look at it, and I was able to
create a hash out of the content file. Thanks for reminding of the \Q
literal strings, always escapes my memory.

Well here is what I got in the end, night and day results.

!/usr/local/bin/perl
my %vals;
open(FILE, $ARGV[0]);
my @text_rep=<FILE>;
close FILE;
foreach my $t (@text_rep){
$t=~/^\"(.*?)\",/;
$vals {$1} = $t;
}
open(FILE, $ARGV[1]);
my @text_search=<FILE>;
close FILE;
open (OUT, "+>did_not_find_beta.txt");
foreach my $query (@text_search){
chomp ($query);
if(my $tmp=$vals{$query} ne ''){
push(@final, $vals{$tmp});
}
else { print OUT $query."\n";
}
}
close OUT;
open (OUT, "+>lines_that_pulled_beta.txt");
print OUT @final;
close OUT;

Thanks for all the help.
 
B

bivity

bivity said:
My work requires a lot of index lookups for large amounts of files
daily. Recently, we have been receiving files with one document per
line along with all its attributes. This file will have around 400,000
entries. I then receive another file, with just the file name, and I
am told, look for each one of these files in this 400,000 entry list.
There are about 5000 in the file.
I just wrote a quick script to meet my needs, where I read in both
files, and grep (search_item, content_file). It works pretty well.
Except, it takes about 25 minutes for 5000 entries. I can't use a hash
implementation here, is there a way I can make this search faster?

A hash implementation would be faster. Why can you not use a hash?


Below is a sample search term and what the index line looks like,
along with the entire script. I know its rough, but i wrote it in a
hurry and I would like to refine it now, make it faster.
Search File Term:
885_Addm Un Lse 0867.pdf
Large File to be searched, its matching index:
"885_Addm Un Lse 0867.pdf","885","ELM 111 N BOBBY AVE","Addm Un Lse
0867","Addm Un Lse 0867.pdf","Elmhurst","651","885","885_Addm Un Lse
0867"

[snipped]

Suggestions:

1. If all of your data looks like this example, then you are looking
for a fixed string in the first field of each record. Extract only the
first field into your @text_rep array and use the eq operator to
compare with your search strings. Check out the Text::CSV module.

2. Do not use grep. Instead, loop over your text_rep array and stop
when you get a match. Consider using the first() function from the
List::Util module (perldoc -q first).

3. For faster searching, sort your text_rep array and do a binary
search. See, for example, Search::Binary (I have not used it).

4. For even faster searching, use a hash.

--
Jim Gibson


----------------------------------------------------------

----------------------------------------------------------
color]

2. Tried the first function, last week but my company doesn't have
many modules installed.

4. I went with this, my 25 minute process, went to a mere 7
seconds.... Got to love it.
 
M

Mirco Wahab

bivity said:
You're right, I took another quick look at it, and I was able to
create a hash out of the content file. Thanks for reminding of the \Q
literal strings, always escapes my memory.

From what I understood, you tried deliberately to
convert ')($' to dot '.', meaning "any char"
in a regular expression.
Well here is what I got in the end, night and day results.
Nice!

!/usr/local/bin/perl
my %vals;
open(FILE, $ARGV[0]);
my @text_rep=<FILE>;
close FILE;
foreach my $t (@text_rep){
$t=~/^\"(.*?)\",/;
$vals {$1} = $t;
}
open(FILE, $ARGV[1]);
my @text_search=<FILE>;
close FILE;
open (OUT, "+>did_not_find_beta.txt");
foreach my $query (@text_search){
chomp ($query);
if(my $tmp=$vals{$query} ne ''){
push(@final, $vals{$tmp});
}
else { print OUT $query."\n";
}
}
close OUT;
open (OUT, "+>lines_that_pulled_beta.txt");
print OUT @final;
close OUT;

You don't need these arrays at all. From
reducing the complexity, you might gain
another 50% run time. Especially the hash
lookup can be optimized:

use strict;
use warnings;

my $full_fn = shift;
my $srch_fn = shift;
my $pull_fn = 'lines_that_pulled_beta.txt';
my $err_fn = 'did_not_find.txt';
my (%vals, @final);

open my $fh, '<', $full_fn or die $!;
while ( <$fh> ) {
chomp; # no need to use a temp array
$vals{$1} = $_ if /"([^"]+)/
}
close $fh;

open my $fh_err, '>', $err_fn or die $!;
open $fh, '<', $srch_fn or die $!;
while( <$fh> ) {
chomp; # now do a hash lookup for existance of the key term
exists $vals{$_} ? push @final, $vals{$_} : print $fh_err "$_\n"
}
close $fh, close $fh_err;

open $fh, '>', $pull_fn or die $!;
print $fh join "\n", @final;
close $fh;



But why did you, in your original post (or one later),
stress that you *can't* use hash? Did you assume it
would be to large?

Regards

M.
 
B

bivity

bivity said:
You're right, I took another quick look at it, and I was able to
create a hash out of the content file. Thanks for reminding of the \Q
literal strings, always escapes my memory.

From what I understood, you tried deliberately to
convert ')($' to dot '.', meaning "any char"
in a regular expression.
Well here is what I got in the end, night and day results.
Nice!



!/usr/local/bin/perl
my %vals;
open(FILE, $ARGV[0]);
my @text_rep=<FILE>;
close FILE;
foreach my $t (@text_rep){
$t=~/^\"(.*?)\",/;
$vals {$1} = $t;
}
open(FILE, $ARGV[1]);
my @text_search=<FILE>;
close FILE;
open (OUT, "+>did_not_find_beta.txt");
foreach my $query (@text_search){
chomp ($query);
if(my $tmp=$vals{$query} ne ''){
push(@final, $vals{$tmp});
}
else { print OUT $query."\n";
}
}
close OUT;
open (OUT, "+>lines_that_pulled_beta.txt");
print OUT @final;
close OUT;

You don't need these arrays at all. From
reducing the complexity, you might gain
another 50% run time. Especially the hash
lookup can be optimized:

use strict;
use warnings;

my $full_fn = shift;
my $srch_fn = shift;
my $pull_fn = 'lines_that_pulled_beta.txt';
my $err_fn = 'did_not_find.txt';
my (%vals, @final);

open my $fh, '<', $full_fn or die $!;
while ( <$fh> ) {
chomp; # no need to use a temp array
$vals{$1} = $_ if /"([^"]+)/
}
close $fh;

open my $fh_err, '>', $err_fn or die $!;
open $fh, '<', $srch_fn or die $!;
while( <$fh> ) {
chomp; # now do a hash lookup for existance of the key term
exists $vals{$_} ? push @final, $vals{$_} : print $fh_err "$_\n"
}
close $fh, close $fh_err;

open $fh, '>', $pull_fn or die $!;
print $fh join "\n", @final;
close $fh;

But why did you, in your original post (or one later),
stress that you *can't* use hash? Did you assume it
would be to large?

Regards

M.

That's great stuff, thanks!
Originally, the index file was not going to match the file name:
The file "Title Index" was being provided, IE: "Document.pdf"
However, the Title Index was not found in the large file, and we had
to do a look up based off the storage location, IE:
"21392_2394993Document.pdf"
It got hairy quick, now that I think about it, a hash could have been
still applied, i just needed to clean up the storage location with a
regex. But that was why at first I just wanted to make sure it worked
before getting into a hash implementation. (It can be tricky to
explain to non-coders what the script is doing on the fly when your
superiors want to see the code and a demo, but that is a different
story)
I can't imagine life without Perl :p
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Forum statistics

Threads
474,001
Messages
2,570,255
Members
46,856
Latest member
MyronKatz6

Latest Threads

Top