Top 10 list algorithm

J

John Bokma

Bubble sort does insertion by swapping elements. Which is a pretty
aweful way to do insertion. Especially when insertion is supported
with very low (even hardware-level) optimizations.

swapping doesn't mean insertion, you can reuse the allocated memory.
 
A

Anno Siegel

Abigail said:
Anno Siegel ([email protected]) wrote on MMMMLVI September
MCMXCIII in <URL:\\ > Fatted ([email protected]) wrote on MMMMLV September MCMXCIII in
\\ > <URL:\\ > !! Is there a better (faster) way of implementing my top_sort function?
\\ > !! (Creating a top 10 list of the highest numbers from a large set).
\\ > !! Or did I get lucky? :)
\\ >
\\ >
\\ > #!/usr/bin/perl
\\ >
\\ > use strict;
\\ > use warnings;
\\ > no warnings qw /syntax/;
\\ >
\\ > my @heap = map {int rand 1_000_000} 1 .. 100_000;
\\ >
\\ > sub heapify;
\\ > sub heapify {
\\ > my $index = shift;
\\ >
\\ > my ($c1, $c2) = (2 * $index + 1, 2 * $index + 2);
\\ > my $max = $index;
\\ > $max = $c1 if $c1 < @heap && $heap [$c1] > $heap [$max];
\\ > $max = $c2 if $c2 < @heap && $heap [$c2] > $heap [$max];
\\ >
\\ > return if $max == $index;
\\ >
\\ > @heap [$index, $max] = @heap [$max, $index];
\\ > heapify $max;
\\ > }
\\ >
\\ >
\\ > for (my $i = int (@heap / 2); $i >= 0; $i --) {heapify $i}
\\ >
\\ > if (@heap) {
\\ > for (1 .. 10) {
\\ > print $heap [0], "\n";
\\ > my $tmp = pop @heap;
\\ > last unless @heap;
\\ > $heap [0] = $tmp;
\\ > heapify 0;
\\ > }
\\ > }
\\
\\ Well, you're heapifying the entire list, which is still an n log n
\\ operation. Might as well sort :)

No! It's not. Look carefully, and do the math. It's an O (n) operation.

Well, I'll be damned, it is. It doesn't look like it ought to be :)
I didn't do the math, I looked it up in _Introduction to Algorithms_
(Cormen, Leiserson, Rivest) which, I suppose, you have also consulted.
So your simple solution, beautifully presented as always, is also
optimal, at least asymptotically.

I still think that in practice it would pay to keep the heap size down
to k (the number of elements to be extracted). You'd still have to
*inspect* all the input, but the number of "heap operations" (heap
insertions and original (non-recursive) calls to heapify()) would be
much smaller, and operate on a smaller heap.

Anno
 
S

Shawn Corey

Anno said:
I still think that in practice it would pay to keep the heap size down
to k (the number of elements to be extracted). You'd still have to
*inspect* all the input, but the number of "heap operations" (heap
insertions and original (non-recursive) calls to heapify()) would be
much smaller, and operate on a smaller heap.

Anno

Unfortunately, no. I wrote some benchmarks just to see what happens in
practice. The subroutine heap_each() is commented out since it takes so
long to complete.

--- Shawn

#!/usr/bin/perl

use strict;
use warnings;

use Benchmark;

# Un-comment for testing; do forget to un-comment
# the print statements in each suborutine.
# heap_once();
# heap_occasional();
# heap_each();
# sort_once();
# sort_occasional();
# sort_each();
# simple();
# exit;

timethese( 100,
{
simple => \&simple,
heap_once => \&heap_once,
heap_occasional => \&heap_occasional,
# heap_each => \&heap_each,
sort_once => \&sort_once,
sort_occasional => \&sort_occasional,
sort_each => \&sort_each,
},
);

sub heap_once {
my @top = ();

srand( 1 );
for my $i ( 1 .. 10_000 ){
my $num = rand();
push @top, $num;
my $i = $#top;
my $j = 0;
for(;;){
$j = int( $i / 2 );
if( $top[$j] < $top[$i] ){
( $top[$j] , $top[$i] ) = ( $top[$i] , $top[$j] );
}else{
last;
}
last if $j <= 0;
$i = $j;
}
}

my @list = ();
for my $i ( 0 .. 9 ){
push @list, $top[0];
my $j = 0;
while( $j*2+1 <= $#top ){
if( $j*2+2 <= $#top && $top[$j*2+1] < $top[$j*2+2] ){
$top[$j] = $top[$j*2+2];
$j = $j*2+2;
}else{
$top[$j] = $top[$j*2+1];
$j = $j*2+1;
}
}
$top[$j] = 0;
}

# print "@list\n";
}

sub heap_each {
my @top = ();

srand( 1 );
for my $i ( 1 .. 10_000 ){
my $num = rand();
push @top, $num;
my $i = $#top;
my $j = 0;
for(;;){
$j = int( $i / 2 );
if( $top[$j] < $top[$i] ){
( $top[$j] , $top[$i] ) = ( $top[$i] , $top[$j] );
}else{
last;
}
last if $j <= 0;
$i = $j;
}

my @list = ();
for my $i ( 0 .. 9 ){
push @list, $top[0];
my $j = 0;
while( $j*2+1 <= $#top ){
if( $j*2+2 <= $#top && $top[$j*2+1] < $top[$j*2+2] ){
$top[$j] = $top[$j*2+2];
$j = $j*2+2;
}else{
$top[$j] = $top[$j*2+1];
$j = $j*2+1;
}
}
$top[$j] = 0;
}
@top = @list;
}

# print "@top\n";
}

sub heap_occasional {
my @top = ();

srand( 1 );
for my $i ( 1 .. 10_000 ){
my $num = rand();
push @top, $num;
my $i = $#top;
my $j = 0;
for(;;){
$j = int( $i / 2 );
if( $top[$j] < $top[$i] ){
( $top[$j] , $top[$i] ) = ( $top[$i] , $top[$j] );
}else{
last;
}
last if $j <= 0;
$i = $j;
}

if( @top > 1_000 ){ # Change 1_000 to be as large as your memory
can handle
my @list = ();
for my $i ( 0 .. 9 ){
push @list, $top[0];
my $j = 0;
while( $j*2+1 <= $#top ){
if( $j*2+2 <= $#top && $top[$j*2+1] < $top[$j*2+2] ){
$top[$j] = $top[$j*2+2];
$j = $j*2+2;
}else{
$top[$j] = $top[$j*2+1];
$j = $j*2+1;
}
}
$top[$j] = 0;
}
@top = @list;
}
}

my @list = ();
for my $i ( 0 .. 9 ){
push @list, $top[0];
my $j = 0;
while( $j*2+1 <= $#top ){
if( $j*2+2 <= $#top && $top[$j*2+1] < $top[$j*2+2] ){
$top[$j] = $top[$j*2+2];
$j = $j*2+2;
}else{
$top[$j] = $top[$j*2+1];
$j = $j*2+1;
}
}
$top[$j] = 0;
}

# print "@list\n";
}

sub simple {
my @top = ( 0 ) x 10;

srand( 1 );
for my $i ( 1 .. 10_000 ){
my $num = rand();
my $i = -1;
for( $i = $#top; $i >= 0; $i -- ){
if( $top[$i] > $num ){
@top = ( @top[ 0 .. $i ], $num, @top[ $i + 1 .. $#top ] );
pop @top;
last;
}
}
if( $i < 0 ){
unshift @top, $num;
pop @top;
}
}

# print "@top\n";
}

sub sort_each {
my @top = ( 0 ) x 10;

srand( 1 );
for my $i ( 1 .. 10_000 ){
my $num = rand();
push @top, $num;
@top = sort { $b <=> $a } @top;
$#top = 9;
}

# print "@top\n";
}

sub sort_occasional {
my @top = ( 0 ) x 10;

srand( 1 );
for my $i ( 1 .. 10_000 ){
my $num = rand();
push @top, $num;
if( @top > 1_000 ){ # Change 1_000 to be as large as your memory
can handle
@top = sort { $b <=> $a } @top;
$#top = 9;
}
}
@top = sort { $b <=> $a } @top;
$#top = 9;

# print "@top\n";
}

sub sort_once {
my @top = ( 0 ) x 10;

srand( 1 );
for my $i ( 1 .. 10_000 ){
my $num = rand();
push @top, $num;
}
@top = sort { $b <=> $a } @top;
$#top = 9;

# print "@top\n";
}

__END__
 
A

Anno Siegel

Shawn Corey said:
Unfortunately, no. I wrote some benchmarks just to see what happens in
practice. The subroutine heap_each() is commented out since it takes so
long to complete.

Then there must be something wrong with the implementation. My
benchmark (see below) shows the "small-heap" solution I suggested
about 8 times faster than Abigail's "big-heap" solution.

Anno

The benchmark compares Abigail's top-ten extraction with a bottom-ten
extraction using a small heap. Otherwise, implementations of both
a maximum-heap and a minimum heap would have been needed. To make the
results comparable, I'm feeding Abigail's solution with the negatives
of the same numbers I'm using for mine.


#!/usr/local/bin/perl
use strict; use warnings; $| = 1;
use Vi::QuickFix;
use Benchmark;

use constant N => 100_000;
use constant SIZE => 10;
sub _heapify;
my @heap;
my $count;

our @raw = map int rand 1_000_000, 1 .. N;

bench: {
@heap = map -$_, @raw;
timethis 1, \ &abigail;
print "Heap operations: $count\n";
timethis 1, \ &anno;
print "Heap operations: $count\n";
}
exit;

sub anno {
@heap = ();
$count = 0;
for ( @raw ) {
if ( @heap < SIZE or $_ < $heap[ 0] ) {
insert( $_);
extract_max() if @heap > SIZE;
}
}
print extract_max(), "\n" while @heap;
}

sub abigail {
$count = 0;
for (my $i = int (@heap / 2); $i >= 0; $i --) {_heapify $i; $count ++}
print extract_max(), "\n" for 1 .. 10;
}

###############################################################

sub count { print "count: $count\n" }

sub extract_max {
return unless @heap;
my $max = shift @heap;
if ( @heap > 1 ) {
unshift @heap, pop @heap;
$count ++;
_heapify( 0);
}
$max;
}

sub insert {
my $x = shift;
my $i = @heap;
$count ++;
while ( $i > 0 ) {
my $parent = int( ($i - 1)/2);
last if $heap[ $parent] >= $x;
$heap[ $i] = $heap[ $parent];
$i = $parent;
}
$heap[ $i] = $x;
}

sub _heapify {
my $i = shift;
my ( $l, $r) = ( 2*$i + 1, 2*$i + 2);
my $max = $i;
$max = $l if $l < @heap and $heap[ $l] > $heap[ $max];
$max = $r if $r < @heap and $heap[ $r] > $heap[ $max];
if ( $max != $i ) {
@heap[ $i, $max] = @heap[ $max, $i];
_heapify( $max);
}
}
 
S

Shawn Corey

Anno said:
Then there must be something wrong with the implementation. My
benchmark (see below) shows the "small-heap" solution I suggested
about 8 times faster than Abigail's "big-heap" solution.

Anno

The benchmark compares Abigail's top-ten extraction with a bottom-ten
extraction using a small heap. Otherwise, implementations of both
a maximum-heap and a minimum heap would have been needed. To make the
results comparable, I'm feeding Abigail's solution with the negatives
of the same numbers I'm using for mine.

The reason why your routine is so much faster is that it does not do
heap operations for each item. heap_each() does (that's what each
means), and is therefore very slow. However your routine has inspired a
change in my simple routine that blows the socks off everything else.

Also I corrected a bug in it that is not a bug. At least it did what I
wanted and didn't complain about it. I leave it as an exercise to the
reader to find the change.

--- Shawn

use constant SIZE => 10;
my $seed = time;

sub simple {
my @top = ( 0 ) x SIZE;

srand( $seed );
for my $i ( 1 .. 10_000 ){
my $num = rand();
next if $num < $top[-1];
my $j = $#top;
for( $j = $#top; $j >= 0; $j -- ){
if( $top[$j] > $num ){
@top = ( @top[ 0 .. $j ], $num, @top[ $j + 1 .. $#top ] );
pop @top;
last;
}
}
if( $j < 0 ){
unshift @top, $num;
pop @top;
}
}

# print "@top\n";
}
 
J

John W. Krahn

Shawn said:
The reason why your routine is so much faster is that it does not do
heap operations for each item. heap_each() does (that's what each
means), and is therefore very slow. However your routine has inspired a
change in my simple routine that blows the socks off everything else.

Also I corrected a bug in it that is not a bug. At least it did what I
wanted and didn't complain about it. I leave it as an exercise to the
reader to find the change.


use constant SIZE => 10;
my $seed = time;

sub simple {
my @top = ( 0 ) x SIZE;

srand( $seed );
for my $i ( 1 .. 10_000 ){
my $num = rand();
next if $num < $top[-1];
my $j = $#top;
for( $j = $#top; $j >= 0; $j -- ){
if( $top[$j] > $num ){
@top = ( @top[ 0 .. $j ], $num, @top[ $j + 1 .. $#top ] );

Why not use splice? :)

splice @top, $j + 1, 0, $num;

pop @top;
last;
}
}
if( $j < 0 ){
unshift @top, $num;
pop @top;
}
}

# print "@top\n";
}


John
 
S

Shawn Corey

John said:
Why not use splice? :)

splice @top, $j + 1, 0, $num;

Not enough coffee? Too early in the morning? Failed to read the
documentation?

--- Shawn
(That's my story and I sticking to it!)
 
E

Eugene Mikheyev

What do you think I didn't read correctly?
I think he confessed about him having read incorrectly :)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,161
Messages
2,570,892
Members
47,430
Latest member
7dog123

Latest Threads

Top