Toward a generic Disk Sort for Java

Roedy Green · Jun 26, 2008

I was thinking how you might go about writing a sort that could handle
more data than could fit in RAM. It handled the problem is Abundance
by checkpointing the app to disk to free up maximum RAM, then spawning
a copy of Opt-Tech sort. My records were roughly like DataOutputStream
would produce, so I could automatically generate the command script
sort the fields in any way I wanted.

I thought you might pull it off in Java this way.

1. You write a comparator as if you were going to sort Objects in an
ArrayList.

2. the external sort has an add method that also takes collections.

It accepts a "chunk" of records, and sorts them using Sun's sort.

Then it writes them out as SERIALISED objects in heavily buffered
stream. There may be some way to do a partial reset after each object
to speed it up.

Then you repeat collecting, sorting and writing another batch to
another file.

When you have created N files, you recycle, appending. (Optimal N to
be determined by experiment). Ideally each file would be on a
different physical drive.

Then when all the records have been added, you start merging chunks
into longer chunks, and writing out the longer chunks. Each N-way
merge cuts the number of chunks by 1/N and increases the length of the
chunks N times.

on the final merge pass does not happen until the user invokes the
Iterator to hand over the resulting records.

Another way it might be done is the records to be sorted must by byte
arrays, chunks effectively produced by DataOutputStream. You specify
offset, length and key type e.g.
int, byte, short, float, double, String.

This would require a detailed knowledge of the bit structure of the
records, the way you did in the olden days of assembler and C.

This would be clumsier to use, but would avoid the overhead of
pickling and reconstituting records on every pass.

Then of course, there is the possibility someone has already solved
this and done it well.

The universe has a sneaky habit. Problems start out small, and it
looks like a purely in RAM solution is perfectly adequate. Then they
bit by bit grow and grow and start pushing the limits of the RAM
solution. Suddenly you are faced with a major redesign.

Joshua Cranmer · Jun 26, 2008

Roedy said:
I was thinking how you might go about writing a sort that could handle
more data than could fit in RAM. It handled the problem is Abundance
by checkpointing the app to disk to free up maximum RAM, then spawning
a copy of Opt-Tech sort. My records were roughly like DataOutputStream
would produce, so I could automatically generate the command script
sort the fields in any way I wanted.

Knuth's The Art of Computer Programming, Volume 3 (Searching and
Sorting) has long discussions on optimizing searching and sorting for
tape drives that could be very relevant here.

Arne Vajhøj · Jun 26, 2008

Roedy said:
I was thinking how you might go about writing a sort that could handle
more data than could fit in RAM. It handled the problem is Abundance
by checkpointing the app to disk to free up maximum RAM, then spawning
a copy of Opt-Tech sort. My records were roughly like DataOutputStream
would produce, so I could automatically generate the command script
sort the fields in any way I wanted.

I thought you might pull it off in Java this way.

1. You write a comparator as if you were going to sort Objects in an
ArrayList.

2. the external sort has an add method that also takes collections.

It accepts a "chunk" of records, and sorts them using Sun's sort.

Then it writes them out as SERIALISED objects in heavily buffered
stream. There may be some way to do a partial reset after each object
to speed it up.

Then you repeat collecting, sorting and writing another batch to
another file.

When you have created N files, you recycle, appending. (Optimal N to
be determined by experiment). Ideally each file would be on a
different physical drive.

Then when all the records have been added, you start merging chunks
into longer chunks, and writing out the longer chunks. Each N-way
merge cuts the number of chunks by 1/N and increases the length of the
chunks N times.

on the final merge pass does not happen until the user invokes the
Iterator to hand over the resulting records.

Another way it might be done is the records to be sorted must by byte
arrays, chunks effectively produced by DataOutputStream. You specify
offset, length and key type e.g.
int, byte, short, float, double, String.

This would require a detailed knowledge of the bit structure of the
records, the way you did in the olden days of assembler and C.

This would be clumsier to use, but would avoid the overhead of
pickling and reconstituting records on every pass.

Then of course, there is the possibility someone has already solved
this and done it well.

The universe has a sneaky habit. Problems start out small, and it
looks like a purely in RAM solution is perfectly adequate. Then they
bit by bit grow and grow and start pushing the limits of the RAM
solution. Suddenly you are faced with a major redesign.

There are written plenty about that out of memory sorts.

Start at:

http://en.wikipedia.org/wiki/External_sorting

Arne

Arne Vajhøj · Jun 26, 2008

Joshua said:
Knuth's The Art of Computer Programming, Volume 3 (Searching and
Sorting) has long discussions on optimizing searching and sorting for
tape drives that could be very relevant here.

Disks are random access where tapes are sequential access and that
could change things.

But I don't know if it actually does.

Arne

Roedy Green · Jun 26, 2008

Disks are random access where tapes are sequential access and that
could change things.

But I don't know if it actually does.

If your intermediate files are on the same physical disk drive, the
arms will hop back and forth between them as you merge. However, If
you have a big enough buffer, (better still if you had double
buffering) then it would not be such a problem.

One of the tweaking parameters is how big should be make your buffers
and how many records should you sort at a time for each chunk. It
gets more complicated with today's virtual RAM. I suppose you have to
just benchmark it various ways.

I helped write a tape sort back in the early 60s in Fortran for the
7044.. Your buffering then was quite limited. The tricky part was
making the tape sort come out so that be final pass went to the drive
with the output tape mounted on it. You might have 4 to 6 intermediate
drives. Tapes were typically on the same channel so you could not do
simultaneous I/O.

I don't have a feel for how sorts would behave now compared with then.
Everything is much faster, and bigger but I don't know in what
proportion.

Roedy Green · Jun 26, 2008

http://en.wikipedia.org/wiki/External_sorting

the surprise was " As a rule of thumb, it is inadvisable to perform a
more-than-20-to-30-way merge."

I would not have guessed the optimum was now so high. I would have
guessed something like 5.

Mark Space · Jun 26, 2008

Roedy said:
The universe has a sneaky habit. Problems start out small, and it
looks like a purely in RAM solution is perfectly adequate. Then they
bit by bit grow and grow and start pushing the limits of the RAM
solution. Suddenly you are faced with a major redesign.

With out doing any research, my gut instinct would be to start big with
an actual general purpose merge sort, writing out temp files if needed.
And I'd favor DataInput/OutputStream heavily over Serialization.

The merge sort would degenerate to an in-memory shell sort below some
limen, configure externally to the program (read from a config file, for
example).

Martin Gregorie · Jun 27, 2008

With out doing any research, my gut instinct would be to start big with
an actual general purpose merge sort, writing out temp files if needed.
And I'd favor DataInput/OutputStream heavily over Serialization.

The merge sort would degenerate to an in-memory shell sort below some
limen, configure externally to the program (read from a config file, for
example).

I'd go with that. As I'm sure you know, that is how traditional mainframe
tape sorts used to work. Years back I implemented exactly that on an 8/16
bit CPU (MC 6809) with 48K RAM and floppy disks. It was quite easy to do
in PL/9 (a clone of PL/M) and worked surprisingly fast.

Its described and the code provided in Kernighan & Plauger's "Software
Tools in Pascal", though if you don't already have a copy you'll probably
have to try used bookstores.

Roedy Green · Jun 27, 2008

Its described and the code provided in Kernighan & Plauger's "Software
Tools in Pascal", though if you don't already have a copy you'll probably
have to try used bookstores.

In the old days, you created a string of bytes to sort, and you told
the sort the offsets lengths and formats of the keys embedded in the
record.

We have been spoiled by in-RAM sorts where we can write a comparator
that uses field names, and where the object need not be pulled
together into a string of bytes.

Part of the challenge is to figure out how to provide the convenience
of an in-RAM comparator as the only thing you need to specify to use
the external sort. Somehow the externalness should be transparent,
just kicks in when RAM is tight.

John B. Matthews · Jun 27, 2008

Martin Gregorie said:
I'd go with that. As I'm sure you know, that is how traditional mainframe
tape sorts used to work. Years back I implemented exactly that on an 8/16
bit CPU (MC 6809) with 48K RAM and floppy disks. It was quite easy to do
in PL/9 (a clone of PL/M) and worked surprisingly fast.

Its described and the code provided in Kernighan & Plauger's "Software
Tools in Pascal", though if you don't already have a copy you'll probably
have to try used bookstores.

Excellent! Or Wirth's "Algorithms + Data Structures = Programs"; seen
here in Oberon, p 64:

<http://www-old.oberon.ethz.ch/WirthPubl/AD.pdf>

Or, as the third shift operator once told me: throw it into a database
and let Codd sort it out;-)

Roedy Green · Jun 27, 2008

Or, as the third shift operator once told me: throw it into a database
and let Codd sort it out;-)

It sounds like a joke, but I suspect that is how most big sorts are
done. Back in the olden days external sorts were pretty well the
first software to go commercial. Nearly everything else you wrote
yourself. You just don't see ads for them like you used to.
Opt-Tech is still around, but very quiet.

John B. Matthews · Jun 27, 2008

Roedy Green said:
It sounds like a joke, but I suspect that is how most big sorts are
done. Back in the olden days external sorts were pretty well the
first software to go commercial. Nearly everything else you wrote
yourself. You just don't see ads for them like you used to.
Opt-Tech is still around, but very quiet.

Yes, a terrible pun on Codd and God, but you're right about the olden
days. At the same time, I wonder if a JDBC-enabled database may be a
useful abstraction for a generic, flat-file disk sort.

Yes, I know the project is already out of hand

Arne Vajhøj · Jun 28, 2008

Roedy said:
It sounds like a joke, but I suspect that is how most big sorts are
done. Back in the olden days external sorts were pretty well the
first software to go commercial. Nearly everything else you wrote
yourself. You just don't see ads for them like you used to.
Opt-Tech is still around, but very quiet.

Data manipulation are done in databases or in memory today.

The use of large flat files for data is rare today, so little
demand for sort utilities.

Arne

Roedy Green · Jun 28, 2008

I was thinking how you might go about writing a sort that could handle
more data than could fit in RAM.

I have written this up a student project

http://mindprod.com/jgloss/project/externalsort.html

Martin Gregorie · Jun 28, 2008

At the same time, I wonder if a JDBC-enabled database may be a
useful abstraction for a generic, flat-file disk sort.

It would be interesting to see how tipping records into an indexed
table and then reading them out again compared for speed with the
traditional in-memory sorted pre-string phase followed by n-way merge
passes. I don't think I'd lay bets either way.

Of course, if your system has enough swap space you could just substitute
a TreeMap for JDBC+database.

How to sort a CSV file with merge sort JAVA	7	May 6, 2021
Sort by number of characters	1	Nov 2, 2023
Toward more ruly background apps	10	Nov 24, 2011
Top 10 players minheap sort - need help	1	Oct 3, 2022
Toward faster international server selection using clever DNS.	7	Mar 17, 2012
Java	1	Jan 18, 2022
toward null-safe cookie cutter Comparators	40	Nov 13, 2011
Tips for using Github???	3	Jan 6, 2024

Toward a generic Disk Sort for Java

Roedy Green

Joshua Cranmer

Arne Vajhøj

Arne Vajhøj

Roedy Green

Roedy Green

Mark Space

Martin Gregorie

Roedy Green

John B. Matthews

Roedy Green

John B. Matthews

Arne Vajhøj

Roedy Green

Martin Gregorie

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads