I have been working on a strange problem I've been having. I am reading
a series of large files (50 mb or so) in one at a time with:
@lines = <FILE>;
or (same behavior with each)
while(<FILE>){
push(@lines, $_);
}
The first time I read a file it will read into the array in about 2
seconds. The second time I try to read a file in (the same size) it
takes about 20 seconds. Everything is declared locally inside the loop
so, everything is leaving scope. I am not sure why it is taking so much
longer the second time.
I have narrowed the problem down to a few different areas:
1. It seems that if I read the file into a large scaler by $/ = undef,
the file gets read faster. So, I assume the slow down is taking place
inside the spliting of the lines.
Seems so, on my system I get similiar results. If you could narrow your
problem in a few lines of code, feel free to post this small program.
This makes it easier to reproduce your problem. Just for testing, I've
written such a small script for you.
#!/usr/bin/perl -w
use strict;
use warnings;
use Benchmark;
my $file = '50mb.txt';
for ( 1 .. 4 ) {
print timestr( timeit( 1, sub {
# local $/ = undef;
open my $fh, '<', $file or die $!;
# my @lines = <$fh>;
my @lines; push @lines, $_ while <$fh>;
} ) ), "\n";
}
__END__
The file I'm reading here consists of 1.5 million lines (50MB all
together). I get:
4 wallclock secs ( 3.98 usr + 0.14 sys = 4.13 CPU) @ 0.24/s (n=1)
34 wallclock secs (33.89 usr + 0.16 sys = 34.05 CPU) @ 0.03/s (n=1)
27 wallclock secs (26.17 usr + 0.13 sys = 26.30 CPU) @ 0.04/s (n=1)
28 wallclock secs (27.77 usr + 0.20 sys = 27.97 CPU) @ 0.04/s (n=1)
With localizing of $/ enabled (slurp mode), I get:
1 wallclock secs ( 0.77 usr + 0.09 sys = 0.86 CPU) @ 1.16/s (n=1)
0 wallclock secs ( 0.72 usr + 0.17 sys = 0.89 CPU) @ 1.12/s (n=1)
0 wallclock secs ( 0.72 usr + 0.23 sys = 0.95 CPU) @ 1.05/s (n=1)
1 wallclock secs ( 0.70 usr + 0.23 sys = 0.94 CPU) @ 1.07/s (n=1)
With "my @lines = <$fh>" instead of the while loop, I get:
22 wallclock secs (16.13 usr + 5.22 sys = 21.34 CPU) @ 0.05/s (n=1)
36 wallclock secs (35.38 usr + 0.22 sys = 35.59 CPU) @ 0.03/s (n=1)
6 wallclock secs ( 5.58 usr + 0.14 sys = 5.72 CPU) @ 0.17/s (n=1)
37 wallclock secs (36.88 usr + 0.17 sys = 37.05 CPU) @ 0.03/s (n=1)
Curious, I don't know why the third attempt is breaking ranks.
I have run my script with another input file, too; one with considerable
fewer newlines (also 50MB, approx 200,000 lines). I get the following
result for the loop:
1 wallclock secs ( 1.34 usr + 0.14 sys = 1.48 CPU) @ 0.67/s (n=1)
12 wallclock secs (11.45 usr + 0.19 sys = 11.64 CPU) @ 0.09/s (n=1)
15 wallclock secs (14.48 usr + 0.19 sys = 14.67 CPU) @ 0.07/s (n=1)
10 wallclock secs (10.45 usr + 0.22 sys = 10.67 CPU) @ 0.09/s (n=1)
And for the version with "my @lines = <$fh>":
3 wallclock secs ( 3.06 usr + 0.33 sys = 3.39 CPU) @ 0.29/s (n=1)
57 wallclock secs (55.86 usr + 0.31 sys = 56.17 CPU) @ 0.02/s (n=1)
60 wallclock secs (59.20 usr + 0.23 sys = 59.44 CPU) @ 0.02/s (n=1)
58 wallclock secs (57.39 usr + 0.22 sys = 57.61 CPU) @ 0.02/s (n=1)
Seems, that Perl needs as more time as longer the lines are. Assuming
this, I run this script with a 50 MB file with only one newline in the
middle, whereas all attempts need (nearly) the same time.
269 wallclock secs (185.00 usr + 81.86 sys = 266.86 CPU) @ 0.00/s (n=1)
277 wallclock secs (184.42 usr + 87.11 sys = 271.53 CPU) @ 0.00/s (n=1)
276 wallclock secs (183.98 usr + 86.03 sys = 270.02 CPU) @ 0.00/s (n=1)
272 wallclock secs (184.74 usr + 85.03 sys = 269.77 CPU) @ 0.00/s (n=1)
2. If I try to append to one large array, rather then rewritting to a
different array, the slow down does not occur. So it seems Perl has a
hard time with the memory it already has but its fine with memory it
just took from the system?
Right. In my example: If I move the declaration "my @lines" in front of
the for-loop, I get for the first file with 1.5 million lines (just the
for-loop matters):
4 wallclock secs ( 3.02 usr + 0.25 sys = 3.27 CPU) @ 0.31/s (n=1)
3 wallclock secs ( 2.95 usr + 0.31 sys = 3.27 CPU) @ 0.31/s (n=1)
7 wallclock secs ( 2.86 usr + 0.27 sys = 3.13 CPU) @ 0.32/s (n=1)
9 wallclock secs ( 3.11 usr + 0.34 sys = 3.45 CPU) @ 0.29/s (n=1)
Actually this creates an array with 6 million elements. The performance
penalty in the second half is just because my machine has only 512 MB
RAM and needs to swap around. Hence the results for the file with only
200,000 lines is looking much better (no swapping is needed):
1 wallclock secs ( 1.11 usr + 0.17 sys = 1.28 CPU) @ 0.78/s (n=1)
2 wallclock secs ( 1.11 usr + 0.17 sys = 1.28 CPU) @ 0.78/s (n=1)
1 wallclock secs ( 1.03 usr + 0.28 sys = 1.31 CPU) @ 0.76/s (n=1)
1 wallclock secs ( 1.09 usr + 0.23 sys = 1.33 CPU) @ 0.75/s (n=1)
3. The problem does not seem to happen in Linux, but I'm working
Windows.
I have run this on Windows XP SP2 with ActiveState's Perl 5.8.6.
Any suggestions for a workaround? Has anyone else seen this? Thanks in
advance.
I have no suggestions for a workaround ;-(
Yes, I have seen it now ;-)
But: It is really necessary to read in the whole file? Would you compare
the first with the last line in worst cases? Perhaps you could give your
algorithm a second thought.
regards,
fabian