J
jandot
Hi all,
There is some interest in the bioinformatics community for using rake
as a workflow tool (see e.g. http://www.bioinformaticszen.com/2008/05/organised-bioinformatics-experiments/).
Rake could be ideal for this type of work: a typical workflow will
take data and perform a first set of conversions on it (i.e. a task),
followed by a second set of conversions (that is dependent on the
first task), and so on.
However, bioinformaticians try to keep their data in databases rather
than files. And we found we need some workarounds to get dependencies
working. Does anyone know if it would be very difficult to add
functionality to rake to check a meta table in a database for
timestamps of tasks rather than looking at timestamps of files? I was
thinking of a table looking like the one below:
table: meta
task
modified_on
==============================================
001_load_data
20080602_0831
002_calculate_averages 20080602_0845
003_make_histogram_of_averages 20080602_0851
The rakefile would then contain:
task :001_load_data do
<do stuff>
<automatically update record in meta table>
end
task :002_calculate_averages => [:001_load_data] do
<do stuff>
<automatically update record in meta table>
end
task :003_make_histogram_of_averages => [:002_calculate_averages] do
<do stuff>
<automatically update record in meta table>
end
So if we had reloaded the data (001), then the timestamp for that task
in the meta table would be later than the one for task 002. As a
result, task 002 would automatically have to be rerun if we were to
run task 003.
I'd very much like to know if anyone has an idea how rake can be
extended this way. Basically, the dependency checker has to be
extended to look into a fixed table in a database...
Many thanks,
Jan Aerts
-
=================================
Dr Jan Aerts
Senior Bioinformatician
Genome Dynamics and Evolution Group
Wellcome Trust Sanger Institute
Hinxton
Cambridge CB10 1SA
UK
phone: +44 (0)1223 - 494732
web: http://www.sanger.ac.uk/Teams/Team29/
There is some interest in the bioinformatics community for using rake
as a workflow tool (see e.g. http://www.bioinformaticszen.com/2008/05/organised-bioinformatics-experiments/).
Rake could be ideal for this type of work: a typical workflow will
take data and perform a first set of conversions on it (i.e. a task),
followed by a second set of conversions (that is dependent on the
first task), and so on.
However, bioinformaticians try to keep their data in databases rather
than files. And we found we need some workarounds to get dependencies
working. Does anyone know if it would be very difficult to add
functionality to rake to check a meta table in a database for
timestamps of tasks rather than looking at timestamps of files? I was
thinking of a table looking like the one below:
table: meta
task
modified_on
==============================================
001_load_data
20080602_0831
002_calculate_averages 20080602_0845
003_make_histogram_of_averages 20080602_0851
The rakefile would then contain:
task :001_load_data do
<do stuff>
<automatically update record in meta table>
end
task :002_calculate_averages => [:001_load_data] do
<do stuff>
<automatically update record in meta table>
end
task :003_make_histogram_of_averages => [:002_calculate_averages] do
<do stuff>
<automatically update record in meta table>
end
So if we had reloaded the data (001), then the timestamp for that task
in the meta table would be later than the one for task 002. As a
result, task 002 would automatically have to be rerun if we were to
run task 003.
I'd very much like to know if anyone has an idea how rake can be
extended this way. Basically, the dependency checker has to be
extended to look into a fixed table in a database...
Many thanks,
Jan Aerts
-
=================================
Dr Jan Aerts
Senior Bioinformatician
Genome Dynamics and Evolution Group
Wellcome Trust Sanger Institute
Hinxton
Cambridge CB10 1SA
UK
phone: +44 (0)1223 - 494732
web: http://www.sanger.ac.uk/Teams/Team29/