Strange behavior when working with large files

bjamin · Jul 1, 2005

I have been working on a strange problem I've been having. I am reading
a series of large files (50 mb or so) in one at a time with:
@lines = <FILE>;
or (same behavior with each)
while(<FILE>){
push(@lines, $_);
}
The first time I read a file it will read into the array in about 2
seconds. The second time I try to read a file in (the same size) it
takes about 20 seconds. Everything is declared locally inside the loop
so, everything is leaving scope. I am not sure why it is taking so much
longer the second time.

I have narrowed the problem down to a few different areas:

1. It seems that if I read the file into a large scaler by $/ = undef,
the file gets read faster. So, I assume the slow down is taking place
inside the spliting of the lines.

2. If I try to append to one large array, rather then rewritting to a
different array, the slow down does not occur. So it seems Perl has a
hard time with the memory it already has but its fine with memory it
just took from the system?

3. The problem does not seem to happen in Linux, but I'm working
Windows.

Any suggestions for a workaround? Has anyone else seen this? Thanks in
advance.

Ben

A. Sinan Unur · Jul 1, 2005

(e-mail address removed) wrote in

I have been working on a strange problem I've been having. I am
reading a series of large files (50 mb or so) in one at a time with:
@lines = <FILE>;
or (same behavior with each)
while(<FILE>){
push(@lines, $_);
}

Why do you think you need to do that?

2. If I try to append to one large array, rather then rewritting to a
different array, the slow down does not occur. So it seems Perl has a
hard time with the memory it already has but its fine with memory it
just took from the system?

perldoc -q memory

Sinan

bjamin · Jul 1, 2005

I need it to be in an array because I am deleteing lines and
re-ordering some lines, so I can't work on anything unless I have the
whole thing to do comparisons.

Ben

Fabian Pilkowski · Jul 2, 2005

* [email protected] said:
I have been working on a strange problem I've been having. I am reading
a series of large files (50 mb or so) in one at a time with:
@lines = <FILE>;
or (same behavior with each)
while(<FILE>){
push(@lines, $_);
}
The first time I read a file it will read into the array in about 2
seconds. The second time I try to read a file in (the same size) it
takes about 20 seconds. Everything is declared locally inside the loop
so, everything is leaving scope. I am not sure why it is taking so much
longer the second time.

I have narrowed the problem down to a few different areas:

1. It seems that if I read the file into a large scaler by $/ = undef,
the file gets read faster. So, I assume the slow down is taking place
inside the spliting of the lines.

Seems so, on my system I get similiar results. If you could narrow your
problem in a few lines of code, feel free to post this small program.
This makes it easier to reproduce your problem. Just for testing, I've
written such a small script for you.

#!/usr/bin/perl -w
use strict;
use warnings;
use Benchmark;
my $file = '50mb.txt';
for ( 1 .. 4 ) {
print timestr( timeit( 1, sub {
# local $/ = undef;
open my $fh, '<', $file or die $!;
# my @lines = <$fh>;
my @lines; push @lines, $_ while <$fh>;
} ) ), "\n";
}
__END__

The file I'm reading here consists of 1.5 million lines (50MB all
together). I get:

4 wallclock secs ( 3.98 usr + 0.14 sys = 4.13 CPU) @ 0.24/s (n=1)
34 wallclock secs (33.89 usr + 0.16 sys = 34.05 CPU) @ 0.03/s (n=1)
27 wallclock secs (26.17 usr + 0.13 sys = 26.30 CPU) @ 0.04/s (n=1)
28 wallclock secs (27.77 usr + 0.20 sys = 27.97 CPU) @ 0.04/s (n=1)

With localizing of $/ enabled (slurp mode), I get:

1 wallclock secs ( 0.77 usr + 0.09 sys = 0.86 CPU) @ 1.16/s (n=1)
0 wallclock secs ( 0.72 usr + 0.17 sys = 0.89 CPU) @ 1.12/s (n=1)
0 wallclock secs ( 0.72 usr + 0.23 sys = 0.95 CPU) @ 1.05/s (n=1)
1 wallclock secs ( 0.70 usr + 0.23 sys = 0.94 CPU) @ 1.07/s (n=1)

With "my @lines = <$fh>" instead of the while loop, I get:

22 wallclock secs (16.13 usr + 5.22 sys = 21.34 CPU) @ 0.05/s (n=1)
36 wallclock secs (35.38 usr + 0.22 sys = 35.59 CPU) @ 0.03/s (n=1)
6 wallclock secs ( 5.58 usr + 0.14 sys = 5.72 CPU) @ 0.17/s (n=1)
37 wallclock secs (36.88 usr + 0.17 sys = 37.05 CPU) @ 0.03/s (n=1)

Curious, I don't know why the third attempt is breaking ranks.

I have run my script with another input file, too; one with considerable
fewer newlines (also 50MB, approx 200,000 lines). I get the following
result for the loop:

1 wallclock secs ( 1.34 usr + 0.14 sys = 1.48 CPU) @ 0.67/s (n=1)
12 wallclock secs (11.45 usr + 0.19 sys = 11.64 CPU) @ 0.09/s (n=1)
15 wallclock secs (14.48 usr + 0.19 sys = 14.67 CPU) @ 0.07/s (n=1)
10 wallclock secs (10.45 usr + 0.22 sys = 10.67 CPU) @ 0.09/s (n=1)

And for the version with "my @lines = <$fh>":

3 wallclock secs ( 3.06 usr + 0.33 sys = 3.39 CPU) @ 0.29/s (n=1)
57 wallclock secs (55.86 usr + 0.31 sys = 56.17 CPU) @ 0.02/s (n=1)
60 wallclock secs (59.20 usr + 0.23 sys = 59.44 CPU) @ 0.02/s (n=1)
58 wallclock secs (57.39 usr + 0.22 sys = 57.61 CPU) @ 0.02/s (n=1)

Seems, that Perl needs as more time as longer the lines are. Assuming
this, I run this script with a 50 MB file with only one newline in the
middle, whereas all attempts need (nearly) the same time.

269 wallclock secs (185.00 usr + 81.86 sys = 266.86 CPU) @ 0.00/s (n=1)
277 wallclock secs (184.42 usr + 87.11 sys = 271.53 CPU) @ 0.00/s (n=1)
276 wallclock secs (183.98 usr + 86.03 sys = 270.02 CPU) @ 0.00/s (n=1)
272 wallclock secs (184.74 usr + 85.03 sys = 269.77 CPU) @ 0.00/s (n=1)

2. If I try to append to one large array, rather then rewritting to a
different array, the slow down does not occur. So it seems Perl has a
hard time with the memory it already has but its fine with memory it
just took from the system?

Right. In my example: If I move the declaration "my @lines" in front of
the for-loop, I get for the first file with 1.5 million lines (just the
for-loop matters):

4 wallclock secs ( 3.02 usr + 0.25 sys = 3.27 CPU) @ 0.31/s (n=1)
3 wallclock secs ( 2.95 usr + 0.31 sys = 3.27 CPU) @ 0.31/s (n=1)
7 wallclock secs ( 2.86 usr + 0.27 sys = 3.13 CPU) @ 0.32/s (n=1)
9 wallclock secs ( 3.11 usr + 0.34 sys = 3.45 CPU) @ 0.29/s (n=1)

Actually this creates an array with 6 million elements. The performance
penalty in the second half is just because my machine has only 512 MB
RAM and needs to swap around. Hence the results for the file with only
200,000 lines is looking much better (no swapping is needed):

1 wallclock secs ( 1.11 usr + 0.17 sys = 1.28 CPU) @ 0.78/s (n=1)
2 wallclock secs ( 1.11 usr + 0.17 sys = 1.28 CPU) @ 0.78/s (n=1)
1 wallclock secs ( 1.03 usr + 0.28 sys = 1.31 CPU) @ 0.76/s (n=1)
1 wallclock secs ( 1.09 usr + 0.23 sys = 1.33 CPU) @ 0.75/s (n=1)

3. The problem does not seem to happen in Linux, but I'm working
Windows.

I have run this on Windows XP SP2 with ActiveState's Perl 5.8.6.

Any suggestions for a workaround? Has anyone else seen this? Thanks in
advance.

I have no suggestions for a workaround ;-(

Yes, I have seen it now ;-)

But: It is really necessary to read in the whole file? Would you compare
the first with the last line in worst cases? Perhaps you could give your
algorithm a second thought.

regards,
fabian

Fabian Pilkowski · Jul 2, 2005

* Fabian Pilkowski said:
* (e-mail address removed) schrieb:

[reading a large file into an array if memory is already allocated]

The file I'm reading here consists of 1.5 million lines (50MB all
together). I get:

4 wallclock secs ( 3.98 usr + 0.14 sys = 4.13 CPU) @ 0.24/s (n=1)
34 wallclock secs (33.89 usr + 0.16 sys = 34.05 CPU) @ 0.03/s (n=1)
27 wallclock secs (26.17 usr + 0.13 sys = 26.30 CPU) @ 0.04/s (n=1)
28 wallclock secs (27.77 usr + 0.20 sys = 27.97 CPU) @ 0.04/s (n=1)

Upgrading to ActiveState's current version 5.8.7 is not solving this
problem. Without changing anything but the Perl version, I get:

6 wallclock secs ( 4.50 usr + 0.22 sys = 4.72 CPU) @ 0.21/s (n=1)
69 wallclock secs (68.27 usr + 0.16 sys = 68.42 CPU) @ 0.01/s (n=1)
68 wallclock secs (67.30 usr + 0.31 sys = 67.61 CPU) @ 0.01/s (n=1)
68 wallclock secs (67.30 usr + 0.20 sys = 67.50 CPU) @ 0.01/s (n=1)

With "my @lines = <$fh>" instead of the while loop, I get:

22 wallclock secs (16.13 usr + 5.22 sys = 21.34 CPU) @ 0.05/s (n=1)
36 wallclock secs (35.38 usr + 0.22 sys = 35.59 CPU) @ 0.03/s (n=1)
6 wallclock secs ( 5.58 usr + 0.14 sys = 5.72 CPU) @ 0.17/s (n=1)
37 wallclock secs (36.88 usr + 0.17 sys = 37.05 CPU) @ 0.03/s (n=1)

And this turns into:

21 wallclock secs (17.45 usr + 4.19 sys = 21.64 CPU) @ 0.05/s (n=1)
255 wallclock secs (252.19 usr + 0.33 sys = 252.51 CPU) @ 0.00/s (n=1)
264 wallclock secs (255.63 usr + 0.38 sys = 256.00 CPU) @ 0.00/s (n=1)
261 wallclock secs (254.33 usr + 0.52 sys = 254.84 CPU) @ 0.00/s (n=1)

It seems, that someone want to prevent you from reading large files into
an array. But perhaps this slowdown affects other perl stuff too. Up to
now I thought something would go faster if memory is already allocated.
Seems to me, Perl isn't just re-using it rather than doing anything else
before.

I have run this on Windows XP SP2 with ActiveState's Perl 5.8.6.

As mentioned, I upgraded to Activestate's Perl 5.8.7 just now. Does
anyone know (or has any idea) what Perl is doing when re-using already
allocated memory on windowish systems?

Or is Windows itself the cause of this behavior? Could anyone reproduce
this problem with another Perl distribution under Windows?

regards,
fabian

Working with files	1	Dec 10, 2021
Weird Behavior with Rays in C and OpenGL	4	Feb 13, 2024
Sending Error when attaching files	1	Aug 7, 2023
How works with large integers ?	0	Aug 16, 2022
Working with min-height	1	Feb 21, 2023
Working on mobile css menu with plenty of frustration!	2	Dec 29, 2022
Script stops working when using variables to save time typing...	4	Oct 31, 2022
Beginner's Guide to getting CipherSweet working with PDO and MYSQL	1	Dec 1, 2022

Strange behavior when working with large files

bjamin

A. Sinan Unur

bjamin

Fabian Pilkowski

Fabian Pilkowski

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads