How much time does it need to sort 1 million random 64-bit/32-bit integers?

D

dmackay

Weng said:
<snip>

If a table of coefficient can be available based on frequency and cache
size, it is convenient for everyone to get a taste on how fast a
sorting algorithm is.

Would you add another axis to the table for Issue width? Retirement
width? The number of ALUs? What about ALUs of a certain type?
Perhaps an entry for the number of architected vs physical registers?
What about the latency of a cache hit? The latency of a miss? What
about the size of L1 vs L2 vs L3? Inclusive or exclusive caches?
Interconnect network? Memory interface? Microcode issues? (This list
could go on for pages))

I believe I understand what you're attempting to do but the number of
performance altering variables you would be omitting is astounding.
That's part of the reason such constants are ignored in complexity
theory.

If you would like to make up some magic number for your particular
hardware, algorithm, coding style, compiler, etc feel free. Will that
constant mean anything on another system? Generally not.
 
D

Dann Corbit

Kolja Sulimma said:
Forget about comaprisons on PC hardware. You are interested in the
number of load/stores because these are a lot slower than comparisons.
A typical comparison sort will have two loads and two store per
comparison.

Radix Sort with 11-bit digits requires three O(N) passes for a total of
about 12M loads and 6M stores.
If your cache is big enough you could also do only two passes using
16-bit digits.

In the case of LSD radix sort, you can do both 16-bit counting passes in a
single pass (count high 16 bits bucket and low 16 bits bucket when you
access each element), hence only two passes over the data are needed (one
pass to count and one pass to sort). Of course, you need 65536*2 buckets to
hold your counts. If there are one million elements, you would need 32 bit
buckets (actually, 20 bit buckets would do, but a 20 bit integer type would
be pretty unusual).
 
P

Paul Floyd

Also, it's not clear whether lg() is a natural logarithm (the usual
symbol for that is ln()) or a base-2 logarithm or some other base.
So that adds in a second unknown constant factor.

My Knuth books are at work, but I'm pretty certain that he makes it
clear that lg() is log base 2, ln() is log base e and log10 is log base
10. I've never seen any of those notations used to mean anything else.
If you had read any of the books, you would not have much doubt about
Don being imprecise.

A bientot
Paul
 
W

Weng Tianxiang

Hi,
I received a google warning message on key word "fastest sorting
algorithm" this afternoon.

The following is one of its message addresses:
http://groups.google.com/group/rec....est+sorting+algorithm&rnum=8#78ac42b8a1c00578


"Why? It only takes that run to disprove your assertion.
You claimed that the radix sort is the fastest algorithm, this example
disproves that.

10 elements, qsort: ~2500 clocks, bucket: ~37500 clocks.

FYI, with 1000 values, the times were: ~275000 for bucket, ~435000 for
qsort."

Unit is in clocks, it is easier for everyone to compare different
sorting algorithms and I have a good taste on how fast a sorting
algorithm runs.

The message was in 1998, but it is still valid in the order.

Weng
 
D

Dann Corbit

Weng Tianxiang said:
Hi,
I received a google warning message on key word "fastest sorting
algorithm" this afternoon.

The following is one of its message addresses:
http://groups.google.com/group/rec....est+sorting+algorithm&rnum=8#78ac42b8a1c00578


"Why? It only takes that run to disprove your assertion.
You claimed that the radix sort is the fastest algorithm, this example
disproves that.

10 elements, qsort: ~2500 clocks, bucket: ~37500 clocks.

FYI, with 1000 values, the times were: ~275000 for bucket, ~435000 for
qsort."

Unit is in clocks, it is easier for everyone to compare different
sorting algorithms and I have a good taste on how fast a sorting
algorithm runs.

The message was in 1998, but it is still valid in the order.

Fastest is usually important in big data sets. In other words, if we have
10 elements, it does not matter much which sorting method is used. The sort
will be done much faster than we can sneeze.

And typically, sort algorithms fall back to routines that are O(f(n))
inferior but faster in practice for smaller sets.

For example, sorting a huge volume of data:

Start with MSD Radix sort at the top level (as a function of keysize and
rowcount).
Switch to Introspective sort when the set size is under some threshold
(keysize * set_rowcount * system_specific_constant > set_rowcount *
log2(set_rowcount)).
Switch to Insertion sort when the set size is very small (usually 20-30
elements or so).

Typically, if a dataset sorts in 10 milliseconds or 20 microseconds, we
won't care very much. If it is one day verses 1 hour, then it matters a
lot.

P.S.
Here's the world's fastest sort (it only works on sets that have 0 or 1
element):

void worlds_fastest_sort()
{
return;
}
 
W

Weng Tianxiang

Hi Dann,
I have 3 good website to deal with sorting speed:
1. http://www.azillionmonkeys.com/qed/sort.html
2. http://www.eternallyconfuzzled.com/tuts/sorting.html#heap
3.
http://groups.google.com/group/comp...nk=st&q=Bitonic+sort&rnum=10#6d2a65ca85ac8bfc

None of them mentioned radix sorting as the candidate as the fastest
sorting algorithm.

In the last email I referenced a measurement:
with 1000 values, the times were: ~275000 in clocks for bucket, ~435000
in clocks for qsort.

The measurement was made in 1999.

I like to have more similar data in clocks with best algorithm and best
CPU available now.

I really don't understand that for one data, above measurement needs
1000 clocks .

Why? Any comments?

Weng
 
D

Dann Corbit

Weng Tianxiang said:
Hi Dann,
I have 3 good website to deal with sorting speed:
1. http://www.azillionmonkeys.com/qed/sort.html
2. http://www.eternallyconfuzzled.com/tuts/sorting.html#heap
3.
http://groups.google.com/group/comp...nk=st&q=Bitonic+sort&rnum=10#6d2a65ca85ac8bfc

None of them mentioned radix sorting as the candidate as the fastest
sorting algorithm.

If a tree falls in a forest with no ear to hear it, does it make a sound?

Radix sort is the fastest way to sort a long list of integers. Period.
In the last email I referenced a measurement:
with 1000 values, the times were: ~275000 in clocks for bucket, ~435000
in clocks for qsort.

1000 values is a trivial set. There are many optimizations that can be done
for Radix sort. For instance, you can count all the buckets in a single
pass for LSD radix sort. That means you have one counting pass, and (with
16 bit buckets) 2 distribution passes. There is simply no way for a
comparison sort to do better.

You were talking about a million objects in the list. With a list like
that, Radix or bucket sort will dominate. If you have a billion items, then
it is no contest. If (on the other hand) you have very long keys, then the
comparison sorts become competitive. But for keys that are only a few bytes
long, distribution rather than comparison sorts are the obvious choice.

With special data sets and with small data sets, other sorts will be faster.
But you can special case the entry point into your sort routine.
The measurement was made in 1999.

I like to have more similar data in clocks with best algorithm and best
CPU available now.

It is not difficult to do it yourself.
I really don't understand that for one data, above measurement needs
1000 clocks .

Probably a function of CLOCKS_PER_SECOND.
 
B

Brannon

Here's my FPGA solution. I have a bitonic sort core for an FPGA. Fill
up a large FPGA and it will do 64-bit by 64 rows every clock cycle,
pipelined of course. If you had a high-speed connection to the FPGA,
you could stream your data through the FPGA and get out sorted sets of
64. You would then run a merge on those on the host processor. It
should chop lg(6) off your quick sort or merge sort.
 
H

Hal Murray

Here's my FPGA solution. I have a bitonic sort core for an FPGA. Fill
up a large FPGA and it will do 64-bit by 64 rows every clock cycle,

How many I/O pins does that take?
 
W

Weng Tianxiang

Hi Brannon,
As NIST website quoted, bitonic sorting takes O((log n)2/2) stages (or
steps) with n/2 comparators at each stage.

Can you release your design running frequency for 64*64 bitonic sorting
and what type of Xilinx chip or larger is to be used?

Thank you.

Weng
 
B

Brannon

How many I/O pins does that take?

I'm not sure that question makes a whole lot of sense. On the SGI box I
was getting 128bits per clock cycle at 200MHz on a 2v6000. The number
of pins is irrelevant. The sort algorithm was only running at 100MHz.
And it doesn't need to run that fast because it requires at least a few
clock cycles to build up enough data to make the sort worth it.
 
B

Brannon

On the SGI box I
was getting 128bits per clock cycle at 200MHz on a 2v6000.

Actually, I was just thinking, on my Starbridge HC box, I can get
128bits x 4 memory channels x 66MHz. 512@66MHz = 34Gb/s. 128@200MHz =
25Gb/s. Can anybody outrun 34Gb/s input to their FPGA these days? I
know that using all the rocket i/o / LVDS / whatever that current chips
should support input rates significantly faster than that -- but who
actually has a board that implements it?

What rate is SRC getting with their direct FSB connection? I still
dream about a board using an HTX slot, or even a cheap board with some
large Spartans and 4 PCIe in and 4PCIe out (10Gb/s each direction for a
few hundred $). These sort algorithms are all worthless if you cannot
get your data in and out of the chip fast enough.
 
G

Guest

Brannon said:
I'm not sure that question makes a whole lot of sense.

When you claim your FPGA solution will process 4096 bits of data per clock
cycle questioning how your solution gets 4096 bits of data per clock cycle
in and out of the FPGA makes a lot of sense.

--
 
B

Brannon

When you claim your FPGA solution will process 4096 bits of data per clock
cycle questioning how your solution gets 4096 bits of data per clock cycle
in and out of the FPGA makes a lot of sense.

Yes, I see what you're saying now. I didn't realize I had claimed to be
running that fast currently. I was talking about parallelizing the data
once it was inside the chip -- some kind of serial to parallel
operation. That's what I do currently. Still, I think 4096 bits per
clock cycle is possible on a v5 using 533MHz LVDS input with 512 pin
pairs should allow you to run the search algorithm at 66MHz. If you
layed out the board well, you should be able to run some DCI DDR stuff
at 266MHz using the newer chips. That's not out of spec, is it? I
thought the DDR2 memory ran that fast. Currently my fastest input is
512bits at 66MHz.
 
B

Brannon

Can you release your design running frequency for 64*64 bitonic sorting
and what type of Xilinx chip or larger is to be used?

So here are my results. First, I've never ran 64x64. I didn't mean to
imply that I could -- just that it was possible. I can run at 100MHz on
a 2v6000 with 16x64bit. It should run twice that fast on a V4. I
compiled a 32x64bit and got this result: 50837 LUTs and 72720 FFs.
That's larger than my chip. I compiled the 64x64bit version and got
140461 LUTs and 199888 FFs. That's pretty darn big. It includes
registers for parallelizing the input which is coming in at 64bits per
clock for those examples. I'm sure my implementation is not the most
efficient possible, but I still think you'll need a pretty large chip
to do any kind of serious parallel sorting.
 
W

Weng Tianxiang

Hi Brannon,
Thank you for sharing your information with us. You are experienced
expert in this regard. I appreciate your work and I have never seen a
project with such huge resource demand. I was said a v2-6000 costs
nearly $8K.

By the way, I cannot find 'bitonic' name in Microsoft Encarta computer
dictionary. What does it mean?

Weng
 
B

Brannon

Weng said:
Hi Brannon,
Thank you for sharing your information with us. You are experienced
expert in this regard. I appreciate your work and I have never seen a
project with such huge resource demand. I was said a v2-6000 costs
nearly $8K.

2v-6000-4 is $2k. Don't pay any more than that for one.
By the way, I cannot find 'bitonic' name in Microsoft Encarta computer
dictionary. What does it mean?

I have no idea what 'tonic' is refering to. I got all my information
including the exact hardware implementation from this page:

http://www.inf.fh-flensburg.de/lang/algorithmen/sortieren/bitonic/bitonicen.htm
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,170
Messages
2,570,925
Members
47,468
Latest member
Fannie44U3

Latest Threads

Top