P
Philip Rhoades
Ellie,
When I did the profile, the array processing was the biggest hit - when
I got rid of the array, I almost halved the time! Ruby arrays are
pretty cool but I think you pay for the convenience . .
See my other note but it didn't make much difference.
There is more post processing using R and for casual inspection it is
convenient to be able to find data according to it's file name. It
might still be possible to have fewer, larger files - I might ask
another question about that (basically I have paste the single column
output of this stuff into 32 column arrays). I have tried DBs for
storing output form the main simulation program when it was all in C/C++
and it was quite slow so I went back to text files . .
Yes, it is was good to find out about this alternative.
I'm sure you are right about that!
The cubic array was just a direct translation of the C pointer setup I
had - basically it is a rectangular grid of sub-populations each with an
array of allele lengths.
Thanks again,
Regards,
Phil.
--
Philip Rhoades
Pricom Pty Limited (ACN 003 252 275 ABN 91 003 252 275)
GPO Box 3411
Sydney NSW 2001
Australia
E-mail: (e-mail address removed)
Eleanor said:A few things:
- you left a line in the loop:
File.open( output_filename, 'w' ) do |fout|
which should be deleted
Paste in haste, repent at leisure I've corrected it to read the
way it appeared in my head when I was looking at it:
http://pastie.org/222765
- I originally used:
stats = [] lines =
File.readlines(input_filename, 'r')
but found that reading the whole file (8871 lines) and then
processing the array was inefficient so I got rid of the array
- using:
stats << stats06
If you buffer it as a single read and then work through the file in
memory it guarantees that you minimise the IO costs of reading. I am
of course assuming that even at 8871 lines your file is much smaller
than your available RAM
When I did the profile, the array processing was the biggest hit - when
I got rid of the array, I almost halved the time! Ruby arrays are
pretty cool but I think you pay for the convenience . .
Doing the file write this way offloads making it efficient to the
Ruby runtime. The file.fsync call will cost you in terms of runtime
performance, but it ensures that the data is flushed to disk before
moving on to the next file which for a large data processing job is
often desirable.
See my other note but it didn't make much difference.
Personally I wouldn't store the results in separate
files but combine them into a single file (possibly even a database),
however I don't know how that would fit with your use case.
There is more post processing using R and for casual inspection it is
convenient to be able to find data according to it's file name. It
might still be possible to have fewer, larger files - I might ask
another question about that (basically I have paste the single column
output of this stuff into 32 column arrays). I have tried DBs for
storing output form the main simulation program when it was all in C/C++
and it was quite slow so I went back to text files . .
As to the file.puts *stats, there's no guarantee this approach will
be efficient but compared to doing something like:
File.open(output_filename, "a") do |file| stats.each { |stat|
file.puts stat } end
it feels more natural to the problem domain.
Yes, it is was good to find out about this alternative.
Another alternative would be:
File.open(output_filename, "a") do |file| file.puts stats.join("\n")
end
but that's likely to use more memory as first an in-memory string
will be created, then this will be passed to Ruby's IO code. For the
size of file you're working with that's not likely to be a problem.
I've a suspicion that your overall algorithm can also be greatly
improved.
I'm sure you are right about that!
In particular the fact that you're forming a cubic array and then
manipulating it raises warning bells and suggests you'll have data
sparsity issues which could be handled in a different way, but that
would require a deeper understanding of your data.
The cubic array was just a direct translation of the C pointer setup I
had - basically it is a rectangular grid of sub-populations each with an
array of allele lengths.
Thanks again,
Regards,
Phil.
--
Philip Rhoades
Pricom Pty Limited (ACN 003 252 275 ABN 91 003 252 275)
GPO Box 3411
Sydney NSW 2001
Australia
E-mail: (e-mail address removed)