R
Roedy Green
I was thinking how you might go about writing a sort that could handle
more data than could fit in RAM. It handled the problem is Abundance
by checkpointing the app to disk to free up maximum RAM, then spawning
a copy of Opt-Tech sort. My records were roughly like DataOutputStream
would produce, so I could automatically generate the command script
sort the fields in any way I wanted.
I thought you might pull it off in Java this way.
1. You write a comparator as if you were going to sort Objects in an
ArrayList.
2. the external sort has an add method that also takes collections.
It accepts a "chunk" of records, and sorts them using Sun's sort.
Then it writes them out as SERIALISED objects in heavily buffered
stream. There may be some way to do a partial reset after each object
to speed it up.
Then you repeat collecting, sorting and writing another batch to
another file.
When you have created N files, you recycle, appending. (Optimal N to
be determined by experiment). Ideally each file would be on a
different physical drive.
Then when all the records have been added, you start merging chunks
into longer chunks, and writing out the longer chunks. Each N-way
merge cuts the number of chunks by 1/N and increases the length of the
chunks N times.
on the final merge pass does not happen until the user invokes the
Iterator to hand over the resulting records.
Another way it might be done is the records to be sorted must by byte
arrays, chunks effectively produced by DataOutputStream. You specify
offset, length and key type e.g.
int, byte, short, float, double, String.
This would require a detailed knowledge of the bit structure of the
records, the way you did in the olden days of assembler and C.
This would be clumsier to use, but would avoid the overhead of
pickling and reconstituting records on every pass.
Then of course, there is the possibility someone has already solved
this and done it well.
The universe has a sneaky habit. Problems start out small, and it
looks like a purely in RAM solution is perfectly adequate. Then they
bit by bit grow and grow and start pushing the limits of the RAM
solution. Suddenly you are faced with a major redesign.
more data than could fit in RAM. It handled the problem is Abundance
by checkpointing the app to disk to free up maximum RAM, then spawning
a copy of Opt-Tech sort. My records were roughly like DataOutputStream
would produce, so I could automatically generate the command script
sort the fields in any way I wanted.
I thought you might pull it off in Java this way.
1. You write a comparator as if you were going to sort Objects in an
ArrayList.
2. the external sort has an add method that also takes collections.
It accepts a "chunk" of records, and sorts them using Sun's sort.
Then it writes them out as SERIALISED objects in heavily buffered
stream. There may be some way to do a partial reset after each object
to speed it up.
Then you repeat collecting, sorting and writing another batch to
another file.
When you have created N files, you recycle, appending. (Optimal N to
be determined by experiment). Ideally each file would be on a
different physical drive.
Then when all the records have been added, you start merging chunks
into longer chunks, and writing out the longer chunks. Each N-way
merge cuts the number of chunks by 1/N and increases the length of the
chunks N times.
on the final merge pass does not happen until the user invokes the
Iterator to hand over the resulting records.
Another way it might be done is the records to be sorted must by byte
arrays, chunks effectively produced by DataOutputStream. You specify
offset, length and key type e.g.
int, byte, short, float, double, String.
This would require a detailed knowledge of the bit structure of the
records, the way you did in the olden days of assembler and C.
This would be clumsier to use, but would avoid the overhead of
pickling and reconstituting records on every pass.
Then of course, there is the possibility someone has already solved
this and done it well.
The universe has a sneaky habit. Problems start out small, and it
looks like a purely in RAM solution is perfectly adequate. Then they
bit by bit grow and grow and start pushing the limits of the RAM
solution. Suddenly you are faced with a major redesign.