Rainmaker said:
To be precise,
Take 2 terabytes of data
each string not longer than 100bytes.
Okay: at least 20,000,000,000 strings.
I dont understand why you need to know what these strings signify?
For example, suppose they are all United States postal
("ZIP") codes consisting of five digits. You could "sort"
them by creating a table of 100,000 counters, reading the
data and counting how many times each code appears, and
then traversing the table in order, outputting the proper
number of copies of the input string. Knowing something
about the data doesn't always open up other approaches, but
it sometimes does.
More generally, it's often helpful to describe the
overall problem you're trying to solve, not just seek advice
on one of the steps in a solution you've come up with. It
may happen that there's a better way to attack the larger
problem, one that avoids the step that's giving trouble.
Despite repeated pleas, though, you persist in dribbling
out tiny bits of information -- always, always less than was
asked for. All right, I wash my hands of you; I've got better
things to do than coddle the uncooperative for free. Based
on the little you've told us about your problem:
- You've got more data than is likely to fit in high-
speed memory, so you need what is known as an "external
sort." Plenty of such programs have been written already
and I suggest you don't set out to write yet another.
Unix systems come with a utility imaginatively named
"sort," and there are commercial products that are faster,
fancier, friendlier, and so on. Choose one and use it.
- The amount of time required to sort your data is hard
to estimate. The I/O speeds of your disks will play
a part, as will the size of your computer memory. If
we suppose you can use ten gigabytes of memory, the
technique of replacement selection applied to your two
terabytes of input will give about one hundred initial
sub-files to sort, and you might be able to finish up
with a hundred-way merge. If you've got only two gig
of buffer space, though, you'll get something like five
hundred sub-files and will probably want a different
merge pattern. In any event, the choice of a good merge
pattern will be heavily dependent on the characteristics
of your computer and disks.
- There's at least a fifty percent chance that you really
don't need to sort the data in the classical sense at
all. But lacking information about what you're trying
to do, I'm unable to suggest any likely shortcuts. Go
ahead and sort, if that's your desire.
... and for further advice, seek another newsgroup. You
do not have a C problem -- not unless you decide to write a
C program, and I suspect you're not yet knowledgable enough
about sorting to write an effective one -- so you should seek
a forum frequented by sorting experts, not experts in C (or
Fortran, or Java, or ...). Good bye, and good luck.