Paul said:
Paul said:
[snip]
no sort() is needed to calculate the median of a list.
you just need one temp var.
Ok, I'll bite. How do you compute the median of a list using just a
single
temp var?
-- Paul
hi Paul; well when this was a stats-class assignment (back when pascal
was popular
i just stepped through the vector and compared it
(pseudo-code)
ptr p = [with values].
fun median {
var x = 0.
while( *p++) {
if( (*p) > x) x = *p.
}
return x.
}
of course, pascal is more verbose but that's median()
No, that's the maximum. The median value is the value that is in the middle
of the list when the list is sorted. Many analyses prefer median to mean
(also known as "average") because the median is less sensitive to wild
outlier points.
My original question was in response to your post, that sort() wasn't
required but only a temp variable. I am very interested in seeing your
solution that does not require the data to be sorted. (This is not just an
academic exercise - given a large historical data set, sorting the data is
one of the costliest parts of computing the median, and I would greatly
appreciate seeing an alternative algorithm.)
Here's a K&R C function I wrote almost 20 years ago. It's a general
purpose quantile. The essential idea is to choose an element at random
(thus mostly avoiding perverse behavior with already sorted data) and
divide the set into two pieces around it. Then you figure out which
piece contains the quantile you want, and what quantile it is within
that piece, and recurse. When you see enough identical elements in the
piece you're processing, it's done. In the extreme case you'll get down
to one element.
ixrand(n) generates a random integer in the range 0..n-1. I think that's
the only nonstandard function used.
The style is torqued by what Unisoft C could and couldn't optimize: I
wouldn't write it quite like that today. One of my students pointed out
that all of the recursion is tail recursion so it should be easy to
flatten. Left as an exercise to the reader...
Somewhere, in Knuth I think, I saw a somewhat similar algorithm that
promised a little better performance by estimating the median from a
sample of the data, breaking the data up there, and then looking for a
quantile (statistically guaranteed to be) close to the min or max of one
of the subsets.
It shouldn't be hard to recode in Python, Forth, or whatever. That
wasn't my purpose in the exercise that started the thread though: I
wanted to see if I could import modules good enough to do the job from
public sources. In Python I could, and the entire image processing
program is 15 lines. In Forth I couldn't.
Anyway, here it is:
/* Find the nth from the minimum value in an array */
/* Monte Carlo method intended for finding medians */
/* 2/13/85 jpd */
/* For random data, this routine takes about */
/* 2.6*numdata + O( log( numdata )) comparisons */
/* If the data is tightly clustered about the mean, */
/* there is a speedup; it may take as few as
/* 0.5*numdata comparisons. */
/* There is a slight penalty if the array is completely */
/* or partially sorted; it is at most 25%. */
/* NTH will be nthi, nths, etc., depending on DATATYPE */
NTH( data, numdata, n )
DATATYPE data[]; /* Data array (will be scrambled on return) */
int numdata; /* lemgth of data array */
int n; /* index if item to find:
1 <= n <= numdata */
{
register DATATYPE boundary, thisdata;
register DATATYPE *lowp, *highp;
DATATYPE v1, v2;
int nlowbin;
lowp = data; /* Init data pointer */
v1 = data[ ixrand( numdata )];
{
register DATATYPE v1r = v1;
int nc = 1 + numdata - n; /* "Complement" of n */
if( nc > n )
highp = lowp + nc;
else
highp = lowp + n; /* Limit to test for done */
/* Scan for the first point which
doesn't match the boundary point.
If we encounter enough
matching points,
the boundary is the answer */
while( *lowp++ == v1r ) {
if( lowp >= highp ) return( v1r );
}
v2 = *--lowp; /* Back up to get point */
}
boundary = ( v1 >> 1 ) + ( v2 >> 1 ); /* Beware overflows */
highp = data + numdata; /* Now process the whole thing */
thisdata = *lowp; /* Prime the pump */
if( v2 < v1 ) { /* Bin 2 is low bin */
for( ; lowp < highp; thisdata = *lowp ) {
if( thisdata <= boundary ) { /* Bin 2 */
*lowp = *--highp; /* Exchange */
*highp = thisdata;
}
else ++lowp; /* Data point in right place */
}
nlowbin = numdata - ( lowp - data );
if( nlowbin >= n ) return( NTH( highp, nlowbin, n ));
else return( NTH( data, lowp - data, n - nlowbin ));
}
else { /* Primary bin is low bin */
for( ; lowp < highp; thisdata = *lowp ) {
if( thisdata > boundary ) { /* Bin 2 */
*lowp = *--highp; /* Exchange */
*highp = thisdata;
}
else ++lowp; /* Don't move point */
}
nlowbin = ( lowp - data );
if( nlowbin >= n ) return( NTH( data, nlowbin, n ));
else return( NTH( highp, numdata - nlowbin, n - nlowbin ));
}
}