sort an array of strings

R

Rainmaker

Hi,

Can anyone tell me an efficient algorithm to sort an array of strings?
Keep in mind that this array is HUGE and so the algorithm should me
efficient enough to deal with it.

Thanks
 
R

Richard Heathfield

Rainmaker said:
Hi,

Can anyone tell me an efficient algorithm to sort an array of strings?
Keep in mind that this array is HUGE and so the algorithm should me
efficient enough to deal with it.

Quicksort or mergesort should do fine. Which you pick depends on just how
huge the array is.

Don't forget you probably don't need to sort the array of strings at all.
It's quite likely you can get away with merely sorting an array of
pointers, where each pointer points to one of the strings.
 
K

Keith Thompson

Rainmaker said:
Can anyone tell me an efficient algorithm to sort an array of strings?
Keep in mind that this array is HUGE and so the algorithm should me
efficient enough to deal with it.

qsort() should be reasonably efficient. You might get better
performance by rolling your own sorting function (perhaps based on a
qsort() implementation -- which isn't necessarily Quicksort), which
could avoid the overhead caused by the indirect calls to the
comparison function.

Obviously, sorting pointers rather than shuffling the strings around
is going to save you a lot of time. (I'm not even sure how you'd go
about shuffling the strings themselves if they've of variable
lengths.)
 
J

Joe Wright

Keith said:
qsort() should be reasonably efficient. You might get better
performance by rolling your own sorting function (perhaps based on a
qsort() implementation -- which isn't necessarily Quicksort), which
could avoid the overhead caused by the indirect calls to the
comparison function.

Obviously, sorting pointers rather than shuffling the strings around
is going to save you a lot of time. (I'm not even sure how you'd go
about shuffling the strings themselves if they've of variable
lengths.)
You can shuffle variable length strings themselves if they're adjacent,
suggesting a insertion sort. It is very slow. Sorting pointers is a much
better idea if you can.
 
K

Keith Thompson

Joe Wright said:
You can shuffle variable length strings themselves if they're
adjacent, suggesting a insertion sort. It is very slow. Sorting
pointers is a much better idea if you can.

If you must shuffle the strings themselves, you can first sort
pointers to the strings, then use the sorted array of pointers to
traverse the strings and copy them to a new location. This requires
enough space to hold two copies of all your strings.

Shuffling the strings in place probably requires shifting many of the
strings multiple times; if you really need to do that, the best
solution is probably to use something like insertion sort that only
swaps adjacent elements.

But it's unlikely that you really need to shuffle the strings.
 
E

Eric Sosman

Keith said:
Shuffling the strings in place probably requires shifting many of the
strings multiple times; if you really need to do that, the best
solution is probably to use something like insertion sort that only
swaps adjacent elements.

Personally, I wouldn't recommend insertion sort for
a HUGE number of anything. O(HUGE^2) is simply too awful
to contemplate ...

The real difficulty with Rainmaker's question is that
he gives no clue about what HUGE means. The choice of a
strategy depends a lot on whether HUGE fits in physical
memory, is too big for physical memory but fits in virtual
memory, or is so big it won't even fit in virtual memory.
Given that lack of information, there's no way to offer a
reliable suggestion. Even my objection to O(HUGE^2) may
be faulty, if HUGE means only "More than I can count on
the fingers of Captain Hook's other hand."
 
R

Rainmaker

Hi,

Sorry for the delay....
By HUGE i mean giga bytes of data....it may be more....

Sorting the pointers seems to me a good solution. Can you please give
me a small example(in C) as to how to achieve this?

Rainmaker
 
P

Peter Ammon

Rainmaker said:
Hi,

Sorry for the delay....
By HUGE i mean giga bytes of data....it may be more....

Sorting the pointers seems to me a good solution. Can you please give
me a small example(in C) as to how to achieve this?

Rainmaker

If you are working with gigabytes of data, then you likely do not want
to store it all in main memory. But see the CLC FAQ, question 13.11, at
http://www.eskimo.com/~scs/C-faq/q13.11.html

-Peter
 
R

Richard Heathfield

Rainmaker said:
Hi,

Sorry for the delay....
By HUGE i mean giga bytes of data....it may be more....

Then presumably this data is on file. I suggest you keep it there! :)
Sorting the pointers seems to me a good solution. Can you please give
me a small example(in C) as to how to achieve this?

Well, not a small example, no! :) But if you're talking Gigabytes, I think
mergesort is going to be your best bet.

If you don't have a copy of Knuth ("The Art of Computer Programming", Volume
3, "Sorting and Searching"), then http://en.wikipedia.org/wiki/Mergesort
would be a good starting place.
 
E

Eric Sosman

Rainmaker said:
Hi,

Sorry for the delay....
By HUGE i mean giga bytes of data....it may be more....

Still no quantitative information. How many gigabytes?
How much more? Terabytes? Zettabytes?
Sorting the pointers seems to me a good solution. Can you please give
me a small example(in C) as to how to achieve this?

If the data won't fit in memory, sorting pointers to it
doesn't help much: The eventual rearrangement of the pointed-
to data (whether explicit or implicit) will involve an awful
lot of bouncing around on the disk. Let's see: one gigabyte
of 100-byte strings is ten million strings. If you must do
a 10-ms seek before reading each of them that's 100,000
seconds or about twenty-eight hours -- and that's *after*
sorting the pointers, which involved a few disk accesses
in its own right ...

Rainmaker, if you want usable advice you're going to have
to describe your problem in considerably more detail. How
many strings are there, how long are they, what do they signify,
is there any pre-existing order you can exploit, why do you
want them sorted anyhow, ...?
 
G

Gordon Burditt

If the data won't fit in memory, sorting pointers to it
doesn't help much: The eventual rearrangement of the pointed-
to data (whether explicit or implicit) will involve an awful
lot of bouncing around on the disk.

Sorting an array in memory of structures consisting of (a) the
sort key and (b) the file offset of the record is often a
useful technique, especially if the array of structures fits
in memory even though the records don't. I remember using that
to advantage on a system with 32k of memory in the early 70's.
Copying the file to put it in sorted order did involve a lot
of seeking, and sometimes would overheat the floppy drive.

Gordon L. Burditt
 
E

Eric Sosman

Gordon said:
Sorting an array in memory of structures consisting of (a) the
sort key and (b) the file offset of the record is often a
useful technique, especially if the array of structures fits
in memory even though the records don't. I remember using that
to advantage on a system with 32k of memory in the early 70's.
Copying the file to put it in sorted order did involve a lot
of seeking, and sometimes would overheat the floppy drive.

Yah. Knuth cites a theorem by Floyd showing that the
task of rearranging the unsorted input (after generating a
sorted sequence of key/location pairs) has a lower bound
that makes it at least as hard as sorting the original
records in full, give or take a few constant factors.

Of course, those easily ignored constant factors can
have an awful lot to do with whether a method is or is not
practical. That's why Rainmaker needs to disclose quite a
lot more about his problem before advice can be advanced
with any confidence.
 
R

Rainmaker

To be precise,
Take 2 terabytes of data

each string not longer than 100bytes.
I dont understand why you need to know what these strings signify?
These are normal strings and you can store anything in it < 100 bytes.

There is no pre existing order.

Thanks,
Rainmaker
 
M

Mark McIntyre

I dont understand why you need to know what these strings signify?

The more we understand about your (rather peculiar) requirement, the
more likely it is that someone can come up with an answer. The
solution for 10,000,000 short bit different strings could be radically
different from the solution for 10,000 long very similar strings.

For instance, my immediate thought is - 2 TB of totally unsorted data
simply should not have been allowed to come into existence. You have a
major problem with whatever is creating these files, and you need to
take a step back and redesign that end of it.

Another thought - if the strings can be grouped into similar sets eg
everything that starts "foobarpotato" in one group, everything that
starts "wibblewobble" in another, you could perhaps rapidly sort into
groups, then each group, perhaps writing them out to separate files
and using filesystem tools such as the unix sort utility.

Hence its useful to know more.
 
J

Joe Wright

Rainmaker said:
To be precise,
Take 2 terabytes of data

each string not longer than 100bytes.
I dont understand why you need to know what these strings signify?
These are normal strings and you can store anything in it < 100 bytes.

There is no pre existing order.

Thanks,
Rainmaker

Eric Sosman wrote:

My apologies to Eric; I have snipped completely everything.
Now Rainmaker. Let's see.. 2 terabytes is 2.0e12 bytes, right? 100 bytes
is 1.0e2 I think. Dividing file size by line length give 2.0e10 I think.
That's 20 giga-lines, right?

We're trying to get a grip on what you have and what you are trying to
achieve. Your data set seems over large. There aren't 20 giga-lines in
all the books in the Library of Congress.
 
P

Peter Nilsson

[Please don't top-post.]
To be precise,
Take 2 terabytes of data

each string not longer than 100bytes.
I dont understand why you need to know what these strings signify?

Perhaps you don't, but then you don't understand how to sort such
strings either, so I don't understand why you're withholding
information
that could help you get you an efficient solution to your problem. ;-)
These are normal strings and you can store anything in it < 100 bytes.

What is a 'normal' string?
There is no pre existing order.

If they can store _anything_, why exactly are you sorting them? How
will that benefit
your future processing?

It sounds like you don't actually have a C question per se, rather you
have a programming issue that could probably be better answered in
another, more general, forum, e.g. comp.programming.

But you're unlikely to get different answers there either, if you're
not going to
offer clues that might help to make the sort more streamlined.
 
W

Walter Roberson

Can anyone tell me an efficient algorithm to sort an array of strings?
Keep in mind that this array is HUGE and so the algorithm should me
efficient enough to deal with it.

Taking into account the other information you posted, about the
amount of data and about the size of each string, and the lack of
information about what the strings signify (information that might
potentially allow for a better algorithm based upon the internal
structure of the strings, or based upon string probabilities ):

I wonder whether you really need to -sort- the strings, or if you
only need to be able to -locate- a string efficiently?

Often, the most efficient way for locating particular strings is -not-
through sorting them. Instead, the most efficient way might be
through some kind of hash table. The efficiency of the hash table
would depend upon matters such as whether you need to be able to
update the table without rebuilding it; upon the distribution
of the strings; and upon restrictions you may need to impose about
paging data in from disk.
 
E

Eric Sosman

Rainmaker said:
To be precise,
Take 2 terabytes of data

each string not longer than 100bytes.

Okay: at least 20,000,000,000 strings.
I dont understand why you need to know what these strings signify?

For example, suppose they are all United States postal
("ZIP") codes consisting of five digits. You could "sort"
them by creating a table of 100,000 counters, reading the
data and counting how many times each code appears, and
then traversing the table in order, outputting the proper
number of copies of the input string. Knowing something
about the data doesn't always open up other approaches, but
it sometimes does.

More generally, it's often helpful to describe the
overall problem you're trying to solve, not just seek advice
on one of the steps in a solution you've come up with. It
may happen that there's a better way to attack the larger
problem, one that avoids the step that's giving trouble.

Despite repeated pleas, though, you persist in dribbling
out tiny bits of information -- always, always less than was
asked for. All right, I wash my hands of you; I've got better
things to do than coddle the uncooperative for free. Based
on the little you've told us about your problem:

- You've got more data than is likely to fit in high-
speed memory, so you need what is known as an "external
sort." Plenty of such programs have been written already
and I suggest you don't set out to write yet another.
Unix systems come with a utility imaginatively named
"sort," and there are commercial products that are faster,
fancier, friendlier, and so on. Choose one and use it.

- The amount of time required to sort your data is hard
to estimate. The I/O speeds of your disks will play
a part, as will the size of your computer memory. If
we suppose you can use ten gigabytes of memory, the
technique of replacement selection applied to your two
terabytes of input will give about one hundred initial
sub-files to sort, and you might be able to finish up
with a hundred-way merge. If you've got only two gig
of buffer space, though, you'll get something like five
hundred sub-files and will probably want a different
merge pattern. In any event, the choice of a good merge
pattern will be heavily dependent on the characteristics
of your computer and disks.

- There's at least a fifty percent chance that you really
don't need to sort the data in the classical sense at
all. But lacking information about what you're trying
to do, I'm unable to suggest any likely shortcuts. Go
ahead and sort, if that's your desire.

... and for further advice, seek another newsgroup. You
do not have a C problem -- not unless you decide to write a
C program, and I suspect you're not yet knowledgable enough
about sorting to write an effective one -- so you should seek
a forum frequented by sorting experts, not experts in C (or
Fortran, or Java, or ...). Good bye, and good luck.
 
O

osmium

Rainmaker said:
To be precise,
Take 2 terabytes of data

each string not longer than 100bytes.
I dont understand why you need to know what these strings signify?
These are normal strings and you can store anything in it < 100 bytes.

There is no pre existing order.

This link looks kind of reasonable to me. I post this for background and as
a starting point, I didn't follow the links to the code. But I don't think
you want code anyway. I think the number of files your OS will allow you to
have open at a single time will become interesting as you pursue this.
Depending on the real numbers in your problem, you may want to open hundreds
of files at a time and your OS may be the constraint on what you can do..

I found this with< "external sort" huge OR giga>

http://cis.stvincent.edu/swd/extsort/extsort.html
 
W

websnarf

Rainmaker said:
To be precise,
Take 2 terabytes of data

each string not longer than 100bytes.
I dont understand why you need to know what these strings signify?
These are normal strings and you can store anything in it < 100 bytes.

So reference URLS from a weblog of a very popular website or something
like that?

Ok, like Peter posted, your top-level should be merge-sort. This will
allow you to retain locality of reference, which is very important once
you get to sizes that large. Ok, but once your recursion makes it down
to in-memory sizes, you should use IntroSort which is implemented as
quick-sort on top of heapsort (not on top of InsertSort, which some
IntroSort implementations do.) As an additional condition, once you
reach "in-cache" sizes (L2 size is ok) for the partition, switch to
heap-sort. And of course, generate pointers first, and sort the
pointers first.

Here's why:

1. Ordinary quicksort on *average* performs fewer compares and data
moves than either heap-sort or merge-sort. (But note that the
guaranteed O(n*log(n)) versions of Quicksort, based on the Select
algorithm, do *NOT* have this property -- i.e., its only worth it if
you use the naive style algorithm.) So when you are dominated by
comparison and access times, quick-sort is the best bet.

2. On modern processors, Heap-sort is faster once you reach the cache.
This is not very well known however since I had to discover this for
myself a couple years ago: http://www.pobox.com/~qed/sort.html . This
does require that you use "branch removal" techniques, or use a moden
compiler which implements these techniques (such as the Intel
compiler.) The reason is that once everything is in the cache, branch
misses start to dominate the actual bottom line performance. By
properly coding up heap-sort, it almost completely removes all branch
mispredictions (even though it performs three times as many comparisons
and twice as many data moves as quicksort).

3. Random disk access is just tremendously slow, and can only be
mitigated by streaming the data in the order that it is stored in in
the first place. I.e., first random access is always far more
expensive than reads that follow it, if they are sequentially from the
first read. Mergesort is the only sorting algorithm that performs
streaming.

There are some new parallel sorting algorithms which are very close to
O(n*log(n)/#cpus) however I don't know very much about them. It seems
to me that once you recurse to "in-memory" you can dispatch each
segment to a processor, which will basically be close enough. But I am
really just talking off the top of my head on that.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,172
Messages
2,570,933
Members
47,472
Latest member
blackwatermelon

Latest Threads

Top