It's (somebody elses) biological data. This isn't even the raw data,
which would be too huuuuge to even imagine.... Therefore it isnt my
fault I am storing so much. Biological data is notorious ;-)
You think that's notorious, chemical data is at least bad. Then
try to do the cartesian join between the two!
I am parsing it to extract/aggregate the small percentage which is
useful to me.
OK, I just want to verify that you are not accidentally also storing
the data that isn't useful to you.
Essentially, there are a couple of useful things in the "excel" (tab
delimetid) files I want to grab hold of. Each file is a table with
columns denoting ID number, gene name, lots of associated info,
various results, normalisations and statistical calculations etc.
Each row is the results of a single "experiment", giving values for
the above fields.
I think you are already doing this, but just in case...
Don't store the gene name, just the (much smaller) ascension number.
The name can then be looked up later.
The same gene appears in many experiments. I want to aggregate the
data for genes togethor for one of the results. E.g. So for each
gene, I have every result for one type of experiment ordered togethor.
To give an example of my data structure, this is what I spit out at
the end:
So essentially you have data of this form:
exp_id, gene_id, results
where the data is effectively sorted (i.e. grouped) by exp_id (because they
are all in the same file), and you want to instead have it grouped by
gene_id.
My first effort would be to parse each file, and for every line in the
file print STDOUT "$gene-id\t$exp_id\t$relevant_result\n";
and invoke this script as
../my_script | gzip > out.txt.gz
And see if this will fit in available disk space.
Then I would use system utilities (are you using a real operating system,
or Windows? If Windows, good look!) to sort out.txt so it would be
grouped by gene_id rather then exp_id, and then have another Perl script
parse that already grouped data.
If you don't have enough disk scratch space to do the sort as one
command, there are various work-arounds.
foreach my $key (keys %genes) {
#This is the gene name, followed by how many experiments were for
that gene
print "$key $genes{$key}{occurences}\t";
#These are all the values for the experimental results for that
gene. #The number of results varies from gene to gene.
foreach my $value ( @{$genes{$key}{result_value}} ) {
print "$value\t";
}
print "\n";
}
I don't see any key by experiment here! Surely you can't just
load all the different experiment results into one amorphous array,
with no way to distinguish them.
#please excuse poor code, I am a newbie to *all of this*
Furthermore, there is some other associated info in that hash, wich i
am not printing out at this stage, but may wish to when the thing is
working.
I would start by not even storing that associated info. Once you get
the bare bones working, then go back and see if you can add it.
The input data is lots more than that. I figure I would have around
50,000 genes * 2500 results stored in my hash at the end.
Even if each result is a single float value and nothing more, that
is going to take at least 2.5 gig of memory. If you pack the
floats it would take ~500 meg of memory. This doesn't include
whatever labels would be necessary to identify which experiment
a given value is for.
I think this means you are going to have abandon in-memory hashes, and
go to hashes tied to disk, or parsing the data into a RDBMS (MySQL,
Oracle), both of which will likely take more disk space than you seem to
have, or using system tools to manipulate flat files, or resorting to
multi-pass methods.
Yes I do want to do something else with the numbers, but printing them
out would be fine for now. Once I have everything sorted in a logical
way, I can easily do what I want to.
Yes, so you can pack the numbers as you aggregate them, then unpack
one-by-one at the end when you want to do something with them. But even
this will probably use too much memory.
Yes my output does, and I do need them all.
Damn! Then there is no easy solution. I just wanted to make sure
that was the case before delving into the harder ones.
Is there a cleverer way I
could print them out as I was going along? E.g. I could have each
gene name that i come across inserted at the beggining of the next
free row, and then put the associated results value in the appropriate
place in the file.
I don't think it's feasible with ordinary files in perl with a single pass
approach. I'd suggest the flat files and system sort utility. If not,
then maybe something something like this:
write a func() which, given a gene_id, returns a number from 0 to 49,
with each number having about the same number of genes fall under it.
Open 50 file handles to compressing pipes into files:
foreach ('00'..'49') {
my $fh; open $fh, "|gzip > $_.gz" or die $!
push @handle, $fh;
};
Now, parse the "excel" files, and for every row, print to the
appropriate file:
print {$handle[func($gene_id)]} $data;
Now each file
1) contains all the information for any gene for which it conains any
information. 2) is a managable size.
So these can re-processed to aggregate the by-gene info.
If you don't have enough disk space to hold all 50 output files
simultaneously, then you could simply have the program output
to a single file, but only if func($gene_id) == $ARGV[0].
Then, you'd have to run the parser program fifty times with each
argument from 00 to 49, processing the intermediate file
(and then removing it) between each parse run.
In summary, buy more disk space!
Xho