My perl script is "Killed" - Ran out of memory

M

Marcus Brody

I am doing some data extraction on a massive scale with perl (e.g. 10
gigs of *zipped* files). For this I am using the PerlIO::gzip
modules, as I dont even have enough disk space to unzip them all
(which would be around 35 Gb).

I parse the "xls.gz" files one by one (actually tab delimeted files),
extracting the data I need, which would then be printed to STDOUT
before the program completes.

However, the program is automatically killed half way through. By
checking with "free" I can see I am running out of RAM/virtual disk
space, which is not suprising ;-)

I guess my data structure in perl (fairly simple, a hash containing an
array and some associated variables etc.) is just swelling till the
process gets killed by the OS. However, I need a way round this.

It is difficult to print out the information as i go along, as I am
aggragating info from all the files - I do not know what to print out
until the last file is read.

Therefore is there a cleaner way of dealing with huge data structures
in perl? I guestimate that my data structure contains 250 million
variables (all very simple - either decimal points (to about 6 places)
or short names).
Any ideas how I could stop running out of memory? Declaring data
types ala C?

Dont suggest I go buy some more RAM... I'm skint ;-)


Thanks in advance

MB
 
C

ctcgag

I am doing some data extraction on a massive scale with perl (e.g. 10
gigs of *zipped* files). For this I am using the PerlIO::gzip
modules, as I dont even have enough disk space to unzip them all
(which would be around 35 Gb).

You could also use gzcat to stream the data into your perl script
without unzip them all at once.
gzcat file.gz | script.pl
I parse the "xls.gz" files one by one (actually tab delimeted files),
extracting the data I need, which would then be printed to STDOUT
before the program completes.

However, the program is automatically killed half way through. By
checking with "free" I can see I am running out of RAM/virtual disk
space, which is not suprising ;-)

I guess my data structure in perl (fairly simple, a hash containing an
array and some associated variables etc.) is just swelling till the
process gets killed by the OS. However, I need a way round this.

Don't store as much! (If you told us what you were doing, I might be
able to tell you how to not store as much.
It is difficult to print out the information as i go along, as I am
aggragating info from all the files

Aggregating how? sum, count, min, max by group? by how many groups?
- I do not know what to print out
until the last file is read.

Then read the last file first :)
Therefore is there a cleaner way of dealing with huge data structures
in perl?

We really don't know how you are dealing with it now. "a hash containing
an array and some associated variables" is not much of a description.
I guestimate that my data structure contains 250 million
variables (all very simple - either decimal points (to about 6 places)
or short names).

This is what the input data structure is, or this is what the working
memory structure is? If stored in standar Perl variables, that's going to
be at least 5 gig, even if you they are all floats and never used in a
stringish way.
Any ideas how I could stop running out of memory? Declaring data
types ala C?

You could use "pack" to pack the data into C-type variables that are
held (in bulk) in strings. But that would still take at least 1 gig,
and I assume you actually want to do something with these numbers, which
would be difficult if they are all packed.

Does your output consists of all 250 million items? If not, then
perhaps you don't need to store them all afterall.



Xho
 
P

Peter Hickman

I take it that this is on Linux or some such OS.

I have a problem like that in that when parsing various XML files the process
memory grows to the size required by the largest file. And will not give it
back until the process finishes.

This occasionally gets killed due to lack of swap. What I do is process each
file from smallest to largest to reduce the amount of time that the process is
holding onto a lot of swap.
 
M

Marcus Brody

Don't store as much! (If you told us what you were doing, I might be
able to tell you how to not store as much.


It's (somebody elses) biological data. This isn't even the raw data,
which would be too huuuuge to even imagine.... Therefore it isnt my
fault I am storing so much. Biological data is notorious ;-)

I am parsing it to extract/aggregate the small percentage which is
useful to me.
Aggregating how? sum, count, min, max by group? by how many groups?

Essentially, there are a couple of useful things in the "excel" (tab
delimetid) files I want to grab hold of. Each file is a table with
columns denoting ID number, gene name, lots of associated info,
various results, normalisations and statistical calculations etc.
Each row is the results of a single "experiment", giving values for
the above fields.

The same gene appears in many experiments. I want to aggregate the
data for genes togethor for one of the results. E.g. So for each
gene, I have every result for one type of experiment ordered togethor.
To give an example of my data structure, this is what I spit out at
the end:



foreach my $key (keys %genes) {

#This is the gene name, followed by how many experiments were for
that gene
print "$key $genes{$key}{occurences}\t";

#These are all the values for the experimental results for that gene.
#The number of results varies from gene to gene.
foreach my $value ( @{$genes{$key}{result_value}} ) {
print "$value\t";
}
print "\n";
}

#please excuse poor code, I am a newbie to *all of this*
Furthermore, there is some other associated info in that hash, wich i
am not printing out at this stage, but may wish to when the thing is
working.

This is what the input data structure is, or this is what the working
memory structure is? If stored in standar Perl variables, that's going to
be at least 5 gig, even if you they are all floats and never used in a
stringish way.

The input data is lots more than that. I figure I would have around
50,000 genes * 2500 results stored in my hash at the end.
You could use "pack" to pack the data into C-type variables that are
held (in bulk) in strings. But that would still take at least 1 gig,
and I assume you actually want to do something with these numbers, which
would be difficult if they are all packed.

Yes I do want to do something else with the numbers, but printing them
out would be fine for now. Once I have everything sorted in a logical
way, I can easily do what I want to.
Does your output consists of all 250 million items? If not, then
perhaps you don't need to store them all afterall.

Yes my output does, and I do need them all. Is there a cleverer way I
could print them out as I was going along? E.g. I could have each
gene name that i come across inserted at the beggining of the next
free row, and then put the associated results value in the appropriate
place in the file.

Thanks

MB



Xho
 
M

Marcus Brody

Hey Xho

Well, am I being really patronising re:biological data?
Are you another biologist by any chance (e-mail address removed)???

If not, its a very wierd coincidence....

Anyways, I have a reasonable idea concerning public microarray data,
if your interested. Its just a side project at the mo, but I reckon
theres a paper in it...

MB
 
C

ctcgag

It's (somebody elses) biological data. This isn't even the raw data,
which would be too huuuuge to even imagine.... Therefore it isnt my
fault I am storing so much. Biological data is notorious ;-)

You think that's notorious, chemical data is at least bad. Then
try to do the cartesian join between the two! :)

I am parsing it to extract/aggregate the small percentage which is
useful to me.

OK, I just want to verify that you are not accidentally also storing
the data that isn't useful to you.

Essentially, there are a couple of useful things in the "excel" (tab
delimetid) files I want to grab hold of. Each file is a table with
columns denoting ID number, gene name, lots of associated info,
various results, normalisations and statistical calculations etc.
Each row is the results of a single "experiment", giving values for
the above fields.

I think you are already doing this, but just in case...
Don't store the gene name, just the (much smaller) ascension number.
The name can then be looked up later.

The same gene appears in many experiments. I want to aggregate the
data for genes togethor for one of the results. E.g. So for each
gene, I have every result for one type of experiment ordered togethor.
To give an example of my data structure, this is what I spit out at
the end:

So essentially you have data of this form:

exp_id, gene_id, results

where the data is effectively sorted (i.e. grouped) by exp_id (because they
are all in the same file), and you want to instead have it grouped by
gene_id.

My first effort would be to parse each file, and for every line in the
file print STDOUT "$gene-id\t$exp_id\t$relevant_result\n";
and invoke this script as
../my_script | gzip > out.txt.gz
And see if this will fit in available disk space.

Then I would use system utilities (are you using a real operating system,
or Windows? If Windows, good look!) to sort out.txt so it would be
grouped by gene_id rather then exp_id, and then have another Perl script
parse that already grouped data.

If you don't have enough disk scratch space to do the sort as one
command, there are various work-arounds.

foreach my $key (keys %genes) {
#This is the gene name, followed by how many experiments were for
that gene
print "$key $genes{$key}{occurences}\t";

#These are all the values for the experimental results for that
gene. #The number of results varies from gene to gene.
foreach my $value ( @{$genes{$key}{result_value}} ) {
print "$value\t";
}
print "\n";
}

I don't see any key by experiment here! Surely you can't just
load all the different experiment results into one amorphous array,
with no way to distinguish them.
#please excuse poor code, I am a newbie to *all of this*
Furthermore, there is some other associated info in that hash, wich i
am not printing out at this stage, but may wish to when the thing is
working.

I would start by not even storing that associated info. Once you get
the bare bones working, then go back and see if you can add it.
The input data is lots more than that. I figure I would have around
50,000 genes * 2500 results stored in my hash at the end.

Even if each result is a single float value and nothing more, that
is going to take at least 2.5 gig of memory. If you pack the
floats it would take ~500 meg of memory. This doesn't include
whatever labels would be necessary to identify which experiment
a given value is for.

I think this means you are going to have abandon in-memory hashes, and
go to hashes tied to disk, or parsing the data into a RDBMS (MySQL,
Oracle), both of which will likely take more disk space than you seem to
have, or using system tools to manipulate flat files, or resorting to
multi-pass methods.
Yes I do want to do something else with the numbers, but printing them
out would be fine for now. Once I have everything sorted in a logical
way, I can easily do what I want to.

Yes, so you can pack the numbers as you aggregate them, then unpack
one-by-one at the end when you want to do something with them. But even
this will probably use too much memory.
Yes my output does, and I do need them all.

Damn! Then there is no easy solution. I just wanted to make sure
that was the case before delving into the harder ones.
Is there a cleverer way I
could print them out as I was going along? E.g. I could have each
gene name that i come across inserted at the beggining of the next
free row, and then put the associated results value in the appropriate
place in the file.

I don't think it's feasible with ordinary files in perl with a single pass
approach. I'd suggest the flat files and system sort utility. If not,
then maybe something something like this:

write a func() which, given a gene_id, returns a number from 0 to 49,
with each number having about the same number of genes fall under it.

Open 50 file handles to compressing pipes into files:

foreach ('00'..'49') {
my $fh; open $fh, "|gzip > $_.gz" or die $!
push @handle, $fh;
};

Now, parse the "excel" files, and for every row, print to the
appropriate file:
print {$handle[func($gene_id)]} $data;

Now each file
1) contains all the information for any gene for which it conains any
information. 2) is a managable size.
So these can re-processed to aggregate the by-gene info.

If you don't have enough disk space to hold all 50 output files
simultaneously, then you could simply have the program output
to a single file, but only if func($gene_id) == $ARGV[0].

Then, you'd have to run the parser program fifty times with each
argument from 00 to 49, processing the intermediate file
(and then removing it) between each parse run.



In summary, buy more disk space!

Xho
 
M

Marcus Brody

Thanks xho

I had already come to some of the comlusions you had suggested. I
know have a 1st script which i envoke with ./1st.pl | gzip >
out.txt.gz, producing a 0.5gb file with everything I need. This is
lines with gene_symbol (e.g. HMG1, not the long name, just the HGNC
one hopefully), id_number, result.

The second script (not written yet), will do a slow sorting routine (i
can think of quicker ways of doing it, but i fear they will eat
memory), which will basically cluster all the results with the same
symbol or id togethor. E.g. looping over the file again and again,
with e.g. loop 1 takes the first id/name, and spits out results to
STDOUT with the same id/name. If i pipe this to gzip, the resulting
file should still be 0.5gb, and is shouldnt run out of memory, as I
expect only 5000 experiments max for each gene.

I think this could work....

Thanks for all the help xhoI - I guess i was just trying to be greedy
and do everything at once :) Posted here hoping there was a quick
fix ;-)

To address a couple of your points:
I don't see any key by experiment here! Surely you can't just
load all the different experiment results into one amorphous array,
with no way to distinguish them.

Your right - there is no key by experiment. All experiments for each
gene are in one amorphous array. And you know what - I dont give a
damn. Trust me, this is correct, for what I intend to do!!!

I think this means you are going to have abandon in-memory hashes, and
go to hashes tied to disk, or parsing the data into a RDBMS (MySQL,
Oracle), both of which will likely take more disk space than you seem to
have, or using system tools to manipulate flat files, or resorting to
multi-pass methods. [...]
In summary, buy more disk space!


I think the above two points would offer the cleanest solution.
However 1. I don't know any RDBMS (intend to learn, could be
useful...) 2. I'm skint ;-) 3. I just wanna get this done - dont
care if it is inelegant, just correct.


Thanks again

MB
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,159
Messages
2,570,880
Members
47,417
Latest member
DarrenGaun

Latest Threads

Top