Frequency in large datasets

C

Cosmic Cruizer

I've been able to reduce my dataset by 75%, but it still leaves me with a
file of 47 gigs. I'm trying to find the frequency of each line using:

open(TEMP, "< $tempfile") || die "cannot open file $tempfile:
$!";
foreach (<TEMP>) {
$seen{$_}++;
}
close(TEMP) || die "cannot close file
$tempfile: $!";

My program keeps aborting after a few minutes because the computer runs out
of memory. I have four gigs of ram and the total paging files is 10 megs,
but Perl does not appear to be using it.

How can I find the frequency of each line using such a large dataset? I
tried to have two output files where I kept moving the databack and forth
each time I grabbed the next line from TEMP instead of using $seen{$_}++,
but I did not have much success.
 
G

Gunnar Hjalmarsson

Cosmic said:
I've been able to reduce my dataset by 75%, but it still leaves me with a
file of 47 gigs. I'm trying to find the frequency of each line using:

open(TEMP, "< $tempfile") || die "cannot open file $tempfile:
$!";
foreach (<TEMP>) {
$seen{$_}++;
}
close(TEMP) || die "cannot close file
$tempfile: $!";

My program keeps aborting after a few minutes because the computer runs out
of memory.

This line:
foreach (<TEMP>) {

reads the whole file into memory. You should read the file line by line
instead by replacing it with:

while (<TEMP>) {
 
A

A. Sinan Unur

I've been able to reduce my dataset by 75%, but it still leaves me
with a file of 47 gigs. I'm trying to find the frequency of each line
using:

open(TEMP, "< $tempfile") || die "cannot open file
$tempfile:
$!";
foreach (<TEMP>) {

Well, that is simply silly. You have a huge file yet you try to read all
of it into memory. Ain't gonna work.

How long is each line and how many unique lines do you expect?

If the number of unique lines is small relative to the number of total
lines, I do not see any difficulty if you get rid of the boneheaded for
loop.
$seen{$_}++;
}
close(TEMP) || die "cannot close file
$tempfile: $!";


my %seen;

open my $TEMP, '<', $tempfile
or die "Cannot open '$tempfile': $!";

++ $seen{ $_ } while <$TEMP>;

close $TEMP
or die "Cannot close '$tempfile': $!";
My program keeps aborting after a few minutes because the computer
runs out of memory. I have four gigs of ram and the total paging files
is 10 megs, but Perl does not appear to be using it.

I don't see much point to having a 10 MB swap file. To make the best use
of 4 GB physical memory, AFAIK, you need to be running a 64 bit OS.
How can I find the frequency of each line using such a large dataset?
I tried to have two output files where I kept moving the databack and
forth each time I grabbed the next line from TEMP instead of using
$seen{$_}++, but I did not have much success.

If the number of unique lines is large, I would periodically store the
current counts, clear the hash, keep processing the original file. Then,
when you reach the end of the original data file, go back to the stored
counts (which will have multiple entries for each unique line) and
aggregate the information there.

Sinan

--
A. Sinan Unur <[email protected]>
(remove .invalid and reverse each component for email address)

comp.lang.perl.misc guidelines on the WWW:
http://www.rehabitation.com/clpmisc/
 
X

xhoster

Cosmic Cruizer said:
I've been able to reduce my dataset by 75%, but it still leaves me with a
file of 47 gigs. I'm trying to find the frequency of each line using:

open(TEMP, "< $tempfile") || die "cannot open file
$tempfile: $!";
foreach (<TEMP>) {
$seen{$_}++;
}
close(TEMP) || die "cannot close file
$tempfile: $!";

If each line shows up a million times on average, that shouldn't
be a problem. If each line shows up twice on average, then it won't
work so well with 4G of RAM. We don't which of those is closer to your
case.
My program keeps aborting after a few minutes because the computer runs
out of memory. I have four gigs of ram and the total paging files is 10
megs, but Perl does not appear to be using it.

If the program is killed due to running out of memory, then I would
say that the program *does* appear to be using the available memory. What
makes you think it isn't using it?

How can I find the frequency of each line using such a large dataset?

I probably wouldn't use Perl, but rather the OS's utilities. For example
on linux:

sort big_file | uniq -c

I
tried to have two output files where I kept moving the databack and forth
each time I grabbed the next line from TEMP instead of using $seen{$_}++,
but I did not have much success.

But in line 42.

Xho

--
-------------------- http://NewsReader.Com/ --------------------
The costs of publication of this article were defrayed in part by the
payment of page charges. This article must therefore be hereby marked
advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate
this fact.
 
X

xhoster

Gunnar Hjalmarsson said:
This line:


reads the whole file into memory. You should read the file line by line
instead by replacing it with:

while (<TEMP>) {

Duh, I completely overlooked that.

Xho

--
-------------------- http://NewsReader.Com/ --------------------
The costs of publication of this article were defrayed in part by the
payment of page charges. This article must therefore be hereby marked
advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate
this fact.
 
C

Cosmic Cruizer

This line:


reads the whole file into memory. You should read the file line by
line instead by replacing it with:

while (<TEMP>) {

<sigh> As both you and Sinan pointed out... I'm using foreach. Everywhere
else I used the while statement to get me to this point. This solves the
problem.

Thank you.
 
J

Jürgen Exner

Cosmic Cruizer said:
I've been able to reduce my dataset by 75%, but it still leaves me with a
file of 47 gigs. I'm trying to find the frequency of each line using:

open(TEMP, "< $tempfile") || die "cannot open file $tempfile:
$!";
foreach (<TEMP>) {

This slurps the whole file (yes, all 47GB) inot a list and then iterates
over that list. Read the file line-by-line instead:

while (<TEMP>){

This should work unless you have a lot of different data points.

jue
 
B

Ben Bullock

A. Sinan Unur said:
Well, that is simply silly. You have a huge file yet you try to read all
of it into memory. Ain't gonna work.

I'm not sure why it's silly as such - perhaps he didn't know that
"foreach" would read all the file into memory.

If the number of unique lines is small relative to the number of total
lines, I do not see any difficulty if you get rid of the boneheaded for
loop.

Again, why is it "boneheaded"? The fact that foreach reads the entire
file into memory isn't something I'd expect people to know
automatically.
 
A

A. Sinan Unur

(e-mail address removed) (Ben Bullock) wrote in
....


I'm not sure why it's silly as such - perhaps he didn't know that
"foreach" would read all the file into memory.

Well, I assumed he didn't. But this is one of those things, had I found
myself doing it, after spending hours and hours trying to work out a way
of processing the file, I would have slapped my forehead and said, "now
that was just a silly thing to do". Coupled with the "ain't" I assumed
my meaning was clear. I wasn't calling the OP names, but trying to get a
message across very strongly.
Again, why is it "boneheaded"?

Because there is no hope of anything working so long as that for loop is
there.
The fact that foreach reads the entire file into memory isn't
something I'd expect people to know automatically.

Maybe this helps:

From perlfaq3.pod:

<blockquote>
* How can I make my Perl program take less memory?

....

Of course, the best way to save memory is to not do anything to waste it
in the first place. Good programming practices can go a long way toward
this:

* Don't slurp!

Don't read an entire file into memory if you can process it line by
line. Or more concretely, use a loop like this:
</blockquote>

Maybe you would like to read the rest.

So, calling the for loop boneheaded is a little stronger than "Bad
Idea", but then what is simply a bad idea with a 200 MB file (things
will still work but less efficiently) is boneheaded with a 47 GB file
(there is no chance of the program working).

There is a reason "Don't slurp!" appears with an exclamation mark and as
the first recommendation in the FAQ list answer.

Hope this helps you become more comfortable with the notion that reading
a 47 GB file is a boneheaded move. It is boneheaded if I do it, if Larry
Wall does it, if Superman does it ... you get the picture I hope.

Sinan

--
A. Sinan Unur <[email protected]>
(remove .invalid and reverse each component for email address)

comp.lang.perl.misc guidelines on the WWW:
http://www.rehabitation.com/clpmisc/
 
N

nolo contendere

(e-mail address removed) (Ben Bullock) wrote in

Hope this helps you become more comfortable with the notion that reading
a 47 GB file is a boneheaded move. It is boneheaded if I do it, if Larry
Wall does it, if Superman does it ... you get the picture I hope.

I don't think it would be boneheaded if Superman did it...I mean, he's
SUPERMAN.
 
U

Uri Guttman

ASU> But attempting to slurp a 47 GB files is the equivalent of having a
ASU> cryptonite slurpee in the morning.

ASU> Not good.

ASU> ;-)

and i wouldn't even recommend file::slurp for that job!! :)

uri
 
J

John W. Krahn

A. Sinan Unur said:
But attempting to slurp a 47 GB files is the equivalent of having a
cryptonite slurpee in the morning.

s/cryptonite/kryptonite/;


John
 
C

Cosmic Cruizer

<sigh> As both you and Sinan pointed out... I'm using foreach. Everywhere
else I used the while statement to get me to this point. This solves the
problem.

Thank you.

Well... that did not make any difference at all. I still get up to about
90% of the physical ram and the job aborts within about the same
timeframe. From what I can tell, using while did not make any difference
than using foreach. I tried using the two swapfiles idea, but that is not
a viable solution. I guess the only thing to do is to break the files
down into smaller chunks of about 5 gigs each. That will give me about 3
to 4 days worth of data at a time. After that, I can look at what I have
and decide how I can optimize the data for the next run.
 
C

comp.llang.perl.moderated

Well... that did not make any difference at all. I still get up to about
90% of the physical ram and the job aborts within about the same
timeframe. From what I can tell, using while did not make any difference
than using foreach. I tried using the two swapfiles idea, but that is not
a viable solution. I guess the only thing to do is to break the files
down into smaller chunks of about 5 gigs each. That will give me about 3
to 4 days worth of data at a time. After that, I can look at what I have
and decide how I can optimize the data for the next run.

While slower, you could use a DBM if %seen is
overgrowing memory, eg,

my $db = tie %seen, 'DB_File', [$filename, $flags, $mode, $DB_HASH]
or die ...
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,001
Messages
2,570,250
Members
46,848
Latest member
Graciela Mitchell

Latest Threads

Top