dim

rab · Sep 26, 2003

is this the best way to grep a HUMONGUS file (>3000000 lines) for a
list of matching words within a perlscript?

(the file is too big to load into an array)

------------------------------------------------------
my @greplist=qw( var full error over repeats no_space );
my @all;

foreach (@greplist){
my @finds = `/usr/bin/grep $_ /var/adm/syslog`;
push(@all,@finds);
}

print @all;
------------------------------------------------------

Jürgen Exner · Sep 26, 2003

rab said:
is this the best way to grep a HUMONGUS file (>3000000 lines) for a
list of matching words within a perlscript?

(the file is too big to load into an array)

------------------------------------------------------
my @greplist=qw( var full error over repeats no_space );
my @all;

foreach (@greplist){
my @finds = `/usr/bin/grep $_ /var/adm/syslog`;

You are forking an external process (not needed) for each word you are
looking for (not efficient) and scan the file once for every single word.
That means for your example data you are loading and processing the file 6
times!

Obviously for a file that size you want to load and process it only once.

push(@all,@finds);
}

print @all;

Untested, only a sketch. Adding error checking is left as an excercise:

my @greplist=qw( var full error over repeats no_space );
my @all;

open F, '/var/adm/syslog';
while (<F>) { #we go through the file only once, line by line
for ($word = @greplist){ #in each line we check for every word
if (/$word/) { # if a word is found in this line then
push @all, $_; #push the line to the result list and
last; #go to the next line
}
}
print @all;
}

Anno Siegel · Sep 26, 2003

[...]

Or:

system grep => $_ => '/var/adm/syslog' for @greplist;

I'd use egrep, if the sequence isn't important. But I'm off topic.

Abigail

Very pretty, very prickly. Perl warns about it a lot.

Anno

Peter Ensch · Sep 26, 2003

Abigail said:
rab ([email protected]) wrote on MMMDCLXXVIII September MCMXCIII in
<URL:.. is this the best way to grep a HUMONGUS file (>3000000 lines) for a
.. list of matching words within a perlscript?
..
.. (the file is too big to load into an array)
..
.. ------------------------------------------------------
.. my @greplist=qw( var full error over repeats no_space );
.. my @all;
..
.. foreach (@greplist){
.. my @finds = `/usr/bin/grep $_ /var/adm/syslog`;
.. push(@all,@finds);
.. }
..
.. print @all;

All you want is printing out the matches?

foreach (@greplist) {
open my $fh => "grep $_ /var/adm/syslog |" or die;
print while <$fh>;
close $fh or die;
}

Or:

system grep => $_ => '/var/adm/syslog' for @greplist;

Abigail

Or do it in one pass:

local $" = '|';

open my $fh => "egrep '@greplist' /var/adm/syslog |" or die;
print while <$fh>;
close $fh or die;

or

system "egrep '@list' /var/adm/syslog";

Peter
--

^~^~^~^~^~^~^~^~^~^~^~^~^~^~^~^~^~^~^~^~^~^~^~^
Peter Ensch,
(e-mail address removed) A-1140 (214) 480 2333
^~^~^~^~^~^~^~^~^~^~^~^~^~^~^~^~^~^~^~^~^~^~^~^

Eric J. Roode · Sep 27, 2003

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Untested, only a sketch. Adding error checking is left as an
excercise:

my @greplist=qw( var full error over repeats no_space );
my @all;

open F, '/var/adm/syslog';
while (<F>) { #we go through the file only once, line by line
for ($word = @greplist){ #in each line we check for every word

You mean

foreach my $word (@greplist) {

of course.

I'm not sure, but I suspect it'd be faster to do:

my $pat = join '|', @greplist; # at top of program
push @all, $_ if /$pat/i; # within file loop

- --
Eric
$_ = reverse sort $ /. r , qw p ekca lre uJ reh
ts p , map $ _. $ " , qw e p h tona e and print

-----BEGIN PGP SIGNATURE-----
Version: PGPfreeware 7.0.3 for non-commercial use <http://www.pgp.com>

iQA/AwUBP3UrxWPeouIeTNHoEQID0wCaAphvaYQSoGLLIOzYbczPDt+tZogAoOtD
2euRjjxBVuZJ+FS0kXWmbhyb
=O8Re
-----END PGP SIGNATURE-----

Jürgen Exner · Sep 27, 2003

Eric said:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

You mean
foreach my $word (@greplist) {
of course.

Ooops, sorry.
As I said, it was only a sketch ;-(.

I'm not sure, but I suspect it'd be faster to do:

my $pat = join '|', @greplist; # at top of program
push @all, $_ if /$pat/i; # within file loop

Not sure, maybe. You are trading the loop for a more complex RE.
Now the RE engine itself must iterate over the alternatives. I don't know if
this will be significantly faster.
Would be interesting to run some benchmarks.

However, while you may gain some time in the match you still need to read
the OP's giant file and for sure that will be the limiting factor when it
comes to performance.

jue

Darren Dunham · Sep 29, 2003

Not sure, maybe. You are trading the loop for a more complex RE.
Now the RE engine itself must iterate over the alternatives. I don't know if
this will be significantly faster.

But the RE runs in C, and the foreach loop runs in perlops. For larger
numbers I expect running an op tree to take more time.

Would be interesting to run some benchmarks.

Yes indeedy!

Push regex search result into hash with multiple values	14	May 19, 2014
Perl Format the Output in table,by removing duplicate entries	0	Oct 11, 2012
Cannot have locale word characters in a variable	9	Sep 2, 2013
Translater + module + tkinter	1	Feb 16, 2023
IPC:Shareable	19	Sep 18, 2008
Merge files	1	Aug 7, 2013
CGI NET::SSH browser problem	1	Nov 10, 2012
I need help in understanding these files on my phone, Could someone help me understand these files? Urgent help needed. Please help.	4	Jun 4, 2023

dim

rab

Jürgen Exner

Anno Siegel

Peter Ensch

Eric J. Roode

Jürgen Exner

Darren Dunham

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads