I have multiple data files that I will retrieve from a database query.
These will be on the order of 150K rows, and an indeterminate number
of columns. The columns will include both dates and status codes, and
I will need to build a data structure containing the cumulative count
of status codes over several months, day by day. Then, I need to build
graphical files with line charts.
This is currently done by hand in Excel, and I have been tasked with
automating the process.
Munging the data and getting the cumulative count per status code per
day is a snap in Perl, and while I've generated charts in Perl using
GD::Graph, using R is certainly a lot easier, and besides, I am
motivated to learn R.
The raw data needs to be processed. The 'data' that I will use will be
contained in hashes, the keys will be status codes, the sub keys will
be dates, and the values will be integers, sort of like this:
$hash{S}{20110601} => 10
$hash{S}{20110602} => 13
$hash{S}{20110603} => 21
$hash{S}{20110604} => 19
$hash{S}{20110605} => 25
$hash{S}{20110606} => 29
$hash{S}{20110607} => 28
So, I can print out the hash in an R compatible data frame and use it
directly to generate a PDF.
I will use Perl to munge the data and produce as output an input file
for R. I want to be able to push a button and have the computer do all
the work.
Thanks for your reply, CC.
Actually, while the other responses are correct, there is a simpler
way still. Well, actually two; but it may be blasphemy to say so in
this forum. ;-) Understand, as long as your DB is one of the common
ones (e.g. MS SQL Server, MySQL, PostgreSQL, &c.) there are drivers
that let your R script connect directly to the DB (equivalent to
Perl's DBI). There is therefore no need to waste time on making CSV
files. And, given that, you can either do any data manipluation using
SQL or you can load the raw data into R and use a selection of one of
its packages to do the sort of manipulations you'd otherwise do using
SQL. Either of these options will be faster than getting Perl
involved in some of the data manipulation. Trust me, I have tried it
in all variations (having perl get/manipulate the data, having the DB
do the manipulation up to the point where my models can do their
various analyses, to importing raw data directly from the DB into R
and having R do it all. In my experience, the latter turned out to be
the faastest. using SQL's data manipulation capability is faster if
the R script and the DB are on different machines communicating over a
slow network.
HTH
Ted
This reduces Perl to simplify invoking the R script (e.g., the only
way I could make my R programs scheduled tasks is to write a simple
perl script that starts it.)