Eeek. Opened a can of worms!
It's 5:30 am where I am (I'm "still up" and not "up early"), so I'll hit
the highlights, and detail more later if for some reason there's an
interest.
-- application: data aggregation
-- I get data from county services and school districts. They have no
way to correlate and aggregate their data. That's what I am doing. I
match kids that are flagged in county services for whatever reason and
have to match them up with school records, health records, judicial
records where applicable, etc.
-- I get dozens of CSV files of various subjects. Every school
district's CSV file for any subject (attendance, grades, etc) has
similar content but not identical. Ditto for demographics from various
agencies.
-- before I can even attempt aggregation, I need to normalize all this
data so it can be indexed and analyzed, and I ensure the data is
cleaned and standardized for display.
-- there's a top layer DSL where I describe data sources and how to
transform any one particular raw data file into a normalized data file.
There another set of descriptions to allow any one field to come from a
number of possible sources. So, something like birthcity might come from
data source X, but if not available, check data source Y, etc.
-- this process has been abstracted into a data aggregation core to do
the normalization, the indexing, and other tasks, with an
application-specific layer to handle the definition of transformations,
indexes that are needed, order of precedence, and the stitching of data
from various sources into records.
-- So, this particular step I've been talking about is where a raw CSV
file undergoes normalization by reorgnizing the fields of each record
into a common structure for that given data topic, each field undergoes
some scrubbing (character case, packing phones, normalizing date
formats, translation of codes into strings, etc).
-- raw data files range from a handful of columns to a couple dozen
columns. From a few hundred rows to a couple million rows.
-- data is an array of hashes
-- by the time we get done normalizing a particular raw source, it can
hit the 4GB memory limit any one ruby fork has available to play with
(many forks run in parallel on multiple cores)
-- while most CSV files work out just fine reading into RAM,
transforming them 100% in RAM, and then writing in a simple single step,
many files do not.
-- so we have updated the process to load X records from the raw file,
tranform X records in memory, then write X records to disk, and loop
untill all records are done.
-- and we have to deal with indexes, duplicates, and other issues in
there as well
-- imagine 2,000,000 raw records from one file which get processed in
200,000 record chunks, but output back to another single file.
-- as I step through each chunk of 200,000 records, I can get the
longest length of that 200,000, and I can store that, but I can't know
what the longest length is for the next 200,000 that I haven't loaded
yet.
-- having processed and written the first 200,000 results to disk, and
then determining the length of the second 200,000, I'm not going to go
back and change the first 200,000 just to make them all the same.
there's no value in that at all.
-- So, when I get done with each 200,000 chunk, I now have a series of
integers which tells me the length of the records in each chunk of
200,000 rows.
-- the file has already been written, so again, I'm not going to go back
and move everything to insert this data at the top (which is where I
would put if indeed every record was the same length)
-- so, I put this data at the end.
-- BTW, the rows lengths are similiar enough that the disk efficiency is
not an issue.
-- I read this last line using tail, and I strip off the leading empty
bytes (if any) as I described earlier
-- it's a couple of very simple calculation to convert any "index"
position into the exact byte seek position to find a specific record.
-- from this point on, the records are read as random access from disk
because otherwise I would need oodles of GB of data in RAM all at once
during the aggregation process.
-- is doing 200,000 or even 500,000 at a time in chunks really any
faster than doing them one at a time -- that I actually don't know yet,
I am just now finishing all the code that this "chunking" touches and
ensure I get the same results I used to. the size of the chunks isn't as
important for speed as it is for memory management -- making sure I stay
within 4GB.
-- as for the speed issues, we've done a lot of profiling, and even
wrote a mini compiler to read our file tranformation DSL and output a
stream of inline variable declarations and commands whichs gets included
as a module on the fly for each data source. That trick saved us from
parsing the DSL for each data row and literally shaved hours off the
total processing time. We attacked many other levels of optimization
while working to keep the code as readable as possible, because it's a
complicated layering of abstractions and processes.
-- I will look into cdb
-- gw