dim

R

rab

is this the best way to grep a HUMONGUS file (>3000000 lines) for a
list of matching words within a perlscript?

(the file is too big to load into an array)

------------------------------------------------------
my @greplist=qw( var full error over repeats no_space );
my @all;

foreach (@greplist){
my @finds = `/usr/bin/grep $_ /var/adm/syslog`;
push(@all,@finds);
}

print @all;
------------------------------------------------------
 
J

Jürgen Exner

rab said:
is this the best way to grep a HUMONGUS file (>3000000 lines) for a
list of matching words within a perlscript?

(the file is too big to load into an array)

------------------------------------------------------
my @greplist=qw( var full error over repeats no_space );
my @all;

foreach (@greplist){
my @finds = `/usr/bin/grep $_ /var/adm/syslog`;

You are forking an external process (not needed) for each word you are
looking for (not efficient) and scan the file once for every single word.
That means for your example data you are loading and processing the file 6
times!

Obviously for a file that size you want to load and process it only once.
push(@all,@finds);
}

print @all;


Untested, only a sketch. Adding error checking is left as an excercise:

my @greplist=qw( var full error over repeats no_space );
my @all;

open F, '/var/adm/syslog';
while (<F>) { #we go through the file only once, line by line
for ($word = @greplist){ #in each line we check for every word
if (/$word/) { # if a word is found in this line then
push @all, $_; #push the line to the result list and
last; #go to the next line
}
}
print @all;
}
 
P

Peter Ensch

Abigail said:
rab ([email protected]) wrote on MMMDCLXXVIII September MCMXCIII in
<URL:.. is this the best way to grep a HUMONGUS file (>3000000 lines) for a
.. list of matching words within a perlscript?
..
.. (the file is too big to load into an array)
..
.. ------------------------------------------------------
.. my @greplist=qw( var full error over repeats no_space );
.. my @all;
..
.. foreach (@greplist){
.. my @finds = `/usr/bin/grep $_ /var/adm/syslog`;
.. push(@all,@finds);
.. }
..
.. print @all;


All you want is printing out the matches?


foreach (@greplist) {
open my $fh => "grep $_ /var/adm/syslog |" or die;
print while <$fh>;
close $fh or die;
}

Or:

system grep => $_ => '/var/adm/syslog' for @greplist;


Abigail

Or do it in one pass:

local $" = '|';

open my $fh => "egrep '@greplist' /var/adm/syslog |" or die;
print while <$fh>;
close $fh or die;

or

system "egrep '@list' /var/adm/syslog";

Peter
--

^~^~^~^~^~^~^~^~^~^~^~^~^~^~^~^~^~^~^~^~^~^~^~^
Peter Ensch,
(e-mail address removed) A-1140 (214) 480 2333
^~^~^~^~^~^~^~^~^~^~^~^~^~^~^~^~^~^~^~^~^~^~^~^
 
E

Eric J. Roode

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Untested, only a sketch. Adding error checking is left as an
excercise:

my @greplist=qw( var full error over repeats no_space );
my @all;

open F, '/var/adm/syslog';
while (<F>) { #we go through the file only once, line by line
for ($word = @greplist){ #in each line we check for every word

You mean

foreach my $word (@greplist) {

of course.


I'm not sure, but I suspect it'd be faster to do:

my $pat = join '|', @greplist; # at top of program
push @all, $_ if /$pat/i; # within file loop

- --
Eric
$_ = reverse sort $ /. r , qw p ekca lre uJ reh
ts p , map $ _. $ " , qw e p h tona e and print

-----BEGIN PGP SIGNATURE-----
Version: PGPfreeware 7.0.3 for non-commercial use <http://www.pgp.com>

iQA/AwUBP3UrxWPeouIeTNHoEQID0wCaAphvaYQSoGLLIOzYbczPDt+tZogAoOtD
2euRjjxBVuZJ+FS0kXWmbhyb
=O8Re
-----END PGP SIGNATURE-----
 
J

Jürgen Exner

Eric said:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1



You mean
foreach my $word (@greplist) {
of course.

Ooops, sorry.
As I said, it was only a sketch ;-(.
I'm not sure, but I suspect it'd be faster to do:

my $pat = join '|', @greplist; # at top of program
push @all, $_ if /$pat/i; # within file loop

Not sure, maybe. You are trading the loop for a more complex RE.
Now the RE engine itself must iterate over the alternatives. I don't know if
this will be significantly faster.
Would be interesting to run some benchmarks.

However, while you may gain some time in the match you still need to read
the OP's giant file and for sure that will be the limiting factor when it
comes to performance.

jue
 
D

Darren Dunham

Not sure, maybe. You are trading the loop for a more complex RE.
Now the RE engine itself must iterate over the alternatives. I don't know if
this will be significantly faster.

But the RE runs in C, and the foreach loop runs in perlops. For larger
numbers I expect running an op tree to take more time.
Would be interesting to run some benchmarks.

Yes indeedy!
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,141
Messages
2,570,813
Members
47,357
Latest member
sitele8746

Latest Threads

Top