Standard Forth versus Python: a case study

P

Peter Decker

Also, that finds the maximum, not the median. I had stopped examining
it after seeing it used more than one variable.

Um... isn't 'p' the list in question? There is only one variable used
for the calculation of the return value (yes, it's a maximum), and
that's 'x'.
 
B

bearophileHUGS

Ian said:
sort() sorts all of the data, but you're only after one or two numbers, so
the MODFIND method may be faster for the median:

The modified quicksort I have shown in the cookbook (466330) is O(n)
too, and it modifies the list in place, so you can apply it twice for
lists of even len.

Bye,
bearophile
 
P

Paul McGuire

Paul said:
[snip]

no sort() is needed to calculate the median of a list.

you just need one temp var.

Ok, I'll bite. How do you compute the median of a list using just a
single
temp var?

-- Paul

hi Paul; well when this was a stats-class assignment (back when pascal
was popular :) i just stepped through the vector and compared it

(pseudo-code)

ptr p = [with values].

fun median {
var x = 0.
while( *p++) {
if( (*p) > x) x = *p.
}
return x.
}

of course, pascal is more verbose but that's median()

No, that's the maximum. The median value is the value that is in the middle
of the list when the list is sorted. Many analyses prefer median to mean
(also known as "average") because the median is less sensitive to wild
outlier points.

My original question was in response to your post, that sort() wasn't
required but only a temp variable. I am very interested in seeing your
solution that does not require the data to be sorted. (This is not just an
academic exercise - given a large historical data set, sorting the data is
one of the costliest parts of computing the median, and I would greatly
appreciate seeing an alternative algorithm.)

-- Paul
 
F

Fredrik Lundh

Paul said:
My original question was in response to your post, that sort() wasn't required but only a temp
variable. I am very interested in seeing your solution that does not require the data to be
sorted. (This is not just an academic exercise - given a large historical data set, sorting the
data is one of the costliest parts of computing the median, and I would greatly appreciate seeing
an alternative algorithm.)

if you absolutely definitely cannot afford to modify or copy the input data set, but can
read the data sequentially multiple times reasonably fast, you can do what's basically a
binary search for the median, by counting how many values you have that's above or
below the current guess, and repeating until you find the right value. see e.g.

http://ndevilla.free.fr/median/median/src/torben.c

</F>
 
P

Paul Rubin

Paul McGuire said:
My original question was in response to your post, that sort() wasn't
required but only a temp variable. I am very interested in seeing your
solution that does not require the data to be sorted. (This is not just an
academic exercise - given a large historical data set, sorting the data is
one of the costliest parts of computing the median, and I would greatly
appreciate seeing an alternative algorithm.)

There are well known algorithms for finding the median in expected
O(n) time, the most straightforward of which is a modified quicksort.
You do the traditional quicksort pivot step, figure out which
partition the median is in, and recurse on just that partition instead
of on both of them. Expected running time is n + n/2 + n/4 + ...
which is approx 2n.

Tarjan discovered a guaranteed O(n) algorithm in the 1970's(?) whose
operation is much different and quite complex. But all of these need
more than one temp var. See an algorithms book like CLRS or Knuth
for more info.
 
P

Paul Rubin

Paul Rubin said:
Tarjan discovered a guaranteed O(n) algorithm in the 1970's(?) whose
operation is much different and quite complex. But all of these need
more than one temp var. See an algorithms book like CLRS or Knuth
for more info.

Ehh, make that Blum, Floyd, Pratt, Rivest, and Tarjan, and the main
different part is selecting the pivot, plus the complexity analysis.

http://www.cs.cmu.edu/afs/cs.cmu.edu/academic/class/15451-f06/www/lectures/lect0907.pdf

http://en.wikipedia.org/wiki/Selection_algorithm (see "median of
medians algorithm")
 
P

Paul McGuire

Fredrik Lundh said:
if you absolutely definitely cannot afford to modify or copy the input
data set, but can
read the data sequentially multiple times reasonably fast, you can do
what's basically a
binary search for the median, by counting how many values you have that's
above or
below the current guess, and repeating until you find the right value.
see e.g.

http://ndevilla.free.fr/median/median/src/torben.c

</F>

Thanks!

-- Paul
 
N

Neil Cerutti

Tarjan discovered a guaranteed O(n) algorithm in the 1970's(?)

Huhn! I thought Tarjan was just the big bad evil guy in Bard's
Tale 2 who was creating eternal winter. I'm glad he also
contributed to our stock of *useful* algorithms.
 
J

John Doty

Paul said:
Paul said:
[snip]

no sort() is needed to calculate the median of a list.

you just need one temp var.

Ok, I'll bite. How do you compute the median of a list using just a
single
temp var?

-- Paul
hi Paul; well when this was a stats-class assignment (back when pascal
was popular :) i just stepped through the vector and compared it

(pseudo-code)

ptr p = [with values].

fun median {
var x = 0.
while( *p++) {
if( (*p) > x) x = *p.
}
return x.
}

of course, pascal is more verbose but that's median()

No, that's the maximum. The median value is the value that is in the middle
of the list when the list is sorted. Many analyses prefer median to mean
(also known as "average") because the median is less sensitive to wild
outlier points.

My original question was in response to your post, that sort() wasn't
required but only a temp variable. I am very interested in seeing your
solution that does not require the data to be sorted. (This is not just an
academic exercise - given a large historical data set, sorting the data is
one of the costliest parts of computing the median, and I would greatly
appreciate seeing an alternative algorithm.)

Here's a K&R C function I wrote almost 20 years ago. It's a general
purpose quantile. The essential idea is to choose an element at random
(thus mostly avoiding perverse behavior with already sorted data) and
divide the set into two pieces around it. Then you figure out which
piece contains the quantile you want, and what quantile it is within
that piece, and recurse. When you see enough identical elements in the
piece you're processing, it's done. In the extreme case you'll get down
to one element.

ixrand(n) generates a random integer in the range 0..n-1. I think that's
the only nonstandard function used.

The style is torqued by what Unisoft C could and couldn't optimize: I
wouldn't write it quite like that today. One of my students pointed out
that all of the recursion is tail recursion so it should be easy to
flatten. Left as an exercise to the reader...

Somewhere, in Knuth I think, I saw a somewhat similar algorithm that
promised a little better performance by estimating the median from a
sample of the data, breaking the data up there, and then looking for a
quantile (statistically guaranteed to be) close to the min or max of one
of the subsets.

It shouldn't be hard to recode in Python, Forth, or whatever. That
wasn't my purpose in the exercise that started the thread though: I
wanted to see if I could import modules good enough to do the job from
public sources. In Python I could, and the entire image processing
program is 15 lines. In Forth I couldn't.

Anyway, here it is:

/* Find the nth from the minimum value in an array */
/* Monte Carlo method intended for finding medians */
/* 2/13/85 jpd */

/* For random data, this routine takes about */
/* 2.6*numdata + O( log( numdata )) comparisons */
/* If the data is tightly clustered about the mean, */
/* there is a speedup; it may take as few as
/* 0.5*numdata comparisons. */
/* There is a slight penalty if the array is completely */
/* or partially sorted; it is at most 25%. */

/* NTH will be nthi, nths, etc., depending on DATATYPE */

NTH( data, numdata, n )
DATATYPE data[]; /* Data array (will be scrambled on return) */
int numdata; /* lemgth of data array */
int n; /* index if item to find:
1 <= n <= numdata */
{
register DATATYPE boundary, thisdata;
register DATATYPE *lowp, *highp;
DATATYPE v1, v2;
int nlowbin;

lowp = data; /* Init data pointer */

v1 = data[ ixrand( numdata )];
{
register DATATYPE v1r = v1;
int nc = 1 + numdata - n; /* "Complement" of n */

if( nc > n )
highp = lowp + nc;
else
highp = lowp + n; /* Limit to test for done */

/* Scan for the first point which
doesn't match the boundary point.
If we encounter enough
matching points,
the boundary is the answer */
while( *lowp++ == v1r ) {
if( lowp >= highp ) return( v1r );
}
v2 = *--lowp; /* Back up to get point */
}

boundary = ( v1 >> 1 ) + ( v2 >> 1 ); /* Beware overflows */

highp = data + numdata; /* Now process the whole thing */
thisdata = *lowp; /* Prime the pump */

if( v2 < v1 ) { /* Bin 2 is low bin */
for( ; lowp < highp; thisdata = *lowp ) {
if( thisdata <= boundary ) { /* Bin 2 */
*lowp = *--highp; /* Exchange */
*highp = thisdata;
}
else ++lowp; /* Data point in right place */
}

nlowbin = numdata - ( lowp - data );
if( nlowbin >= n ) return( NTH( highp, nlowbin, n ));
else return( NTH( data, lowp - data, n - nlowbin ));

}
else { /* Primary bin is low bin */
for( ; lowp < highp; thisdata = *lowp ) {
if( thisdata > boundary ) { /* Bin 2 */
*lowp = *--highp; /* Exchange */
*highp = thisdata;
}
else ++lowp; /* Don't move point */
}

nlowbin = ( lowp - data );
if( nlowbin >= n ) return( NTH( data, nlowbin, n ));
else return( NTH( highp, numdata - nlowbin, n - nlowbin ));
}
}
 
J

Juho Schultz

John said:
The problem:

I have a bunch of image files in FITS format. For each raster row in
each file, I need to determine the median pixel value and subtract it
from all of the pixels in that row, and then write out the results as
new FITS files.

This may be personal bias... I have spent a few years with FITS files
so every time I see 'FITS' I think 'astronomy' - maybe you are doing
something else. (Someone wrote he does not know what FITS is - see
fits.gsfc.nasa.gov In a nutshell: FITS file is a list of header-data
units. Each header is text containing optional keyword-value pairs and
reading instructions for the data unit. Usually astronomical image data
but it can be anything.)

Sounds like subtracting sky background from the frames, though for some
reason (speed?) column-wise. You could have a look at either PyRAF
(www.stsci.edu/resources/software_hardware/pyraf) or PyMidas
(www.eso.org/sampo/pymidas/). Running some semi-official
sky-subtraction algorithm would at least give you a good case to test
against.

You could also check if ftools already does this. I have not used it
much, but sometimes it saves huge amounts of coding time.
heasarc.gsfc.nasa.gov/docs/software/ftools/ftools_menu.html
This is a real problem I need to solve, not a made-up toy problem. I was
originally thinking of solving it in C (I know where to get the pieces
in that language)

Could you tell a bit more of the data? Are you aiming for accuracy,
speed or getting as complete time series as possible? (for me, speed
has never been an issue) Photometry/astrometry/something else? Is there
some trade-off like "get best possible result in X seconds"?
 
J

John Doty

Juho said:
This may be personal bias... I have spent a few years with FITS files
so every time I see 'FITS' I think 'astronomy' - maybe you are doing
something else. (Someone wrote he does not know what FITS is - see
fits.gsfc.nasa.gov In a nutshell: FITS file is a list of header-data
units. Each header is text containing optional keyword-value pairs and
reading instructions for the data unit. Usually astronomical image data
but it can be anything.)

Sounds like subtracting sky background from the frames, though for some
reason (speed?) column-wise. You could have a look at either PyRAF
(www.stsci.edu/resources/software_hardware/pyraf) or PyMidas
(www.eso.org/sampo/pymidas/). Running some semi-official
sky-subtraction algorithm would at least give you a good case to test
against.

You could also check if ftools already does this. I have not used it
much, but sometimes it saves huge amounts of coding time.
heasarc.gsfc.nasa.gov/docs/software/ftools/ftools_menu.html


Could you tell a bit more of the data? Are you aiming for accuracy,
speed or getting as complete time series as possible? (for me, speed
has never been an issue) Photometry/astrometry/something else? Is there
some trade-off like "get best possible result in X seconds"?

Yeah, I'm fairly familiar with that stuff, I was/am an investigator on
ASCA, Chandra, HETE-2, Suzaku. Didn't know of anything quite right for
this job. With the right imports it was only 15 lines of Python, hard to
beat that.

It isn't my data to release, it's MKI's. Development project. And they
have plenty of folks who know IRAF, ftools, etc. They didn't know how to
do quite what they wanted here.
 
I

idknow

Paul said:
Also, that finds the maximum, not the median. I had stopped examining
it after seeing it used more than one variable.

oops, i apologise, the > should be <
that's just a typo.

As for your quoting, you cut off the global var line and miscounted; p
is a global or from the caller of median()

also, i plead `too early in the morning to code properly' :)

clearly, i've forgotten the definition of the median of a list.
to that i plead faulty memory.

my apologises for coding at a bad hour.
 
G

Gabriel Genellina

At said:
clearly, i've forgotten the definition of the median of a list.
to that i plead faulty memory.

That explains all. Finding the median in an efficient way (that is,
without sorting the data first) isn't trivial, so your claim of "I
can do that using only one temp variable" was a bit surprising...
BTW, the median is the value which sits just in the middle of the
list when ordered: median(3,5,12,1,2)=median(1,2,3,5,12) = 3


--
Gabriel Genellina
Softlab SRL





__________________________________________________
Preguntá. Respondé. Descubrí.
Todo lo que querías saber, y lo que ni imaginabas,
está en Yahoo! Respuestas (Beta).
¡Probalo ya!
http://www.yahoo.com.ar/respuestas
 
P

Paul Rubin

Gabriel Genellina said:
That explains all. Finding the median in an efficient way (that is,
without sorting the data first) isn't trivial, so your claim of "I can
do that using only one temp variable" was a bit surprising...
BTW, the median is the value which sits just in the middle of the list
when ordered: median(3,5,12,1,2)=median(1,2,3,5,12) = 3

How is this? Note that m and n are treated as constants, so
expressions like n-1, n/2 etc. are also constants. Also assume
m[i+1] is evaluated as (m+1) and of course m+1 is constant.

================================================================

#include <stdio.h>

int
median (int m[], int n)
{
int i;
while (1) {
for (i = 0; i < n-1; i++) {
if (m > m[i+1]) {
/* swap m and m[i+1] with no temp var */
m ^= m[i+1];
m[i+1] ^= m;
m ^= m[i+1];
goto yes;
}
}
break;
yes:
for (i = 0; i < n-1; i++) {
if (m > m[i+1]) {
m ^= m[i+1];
m[i+1] ^= m;
m ^= m[i+1];
}
}
}
return m[n / 2];
}

int a[] = {9,6,1,5,4,2,8,3,7};

main()
{
int m;
m = median(a, 9);
printf ("%d\n", m);
}
 
G

Gabriel Genellina

Gabriel Genellina said:
That explains all. Finding the median in an efficient way (that is,
without sorting the data first) isn't trivial, so your claim of "I can
do that using only one temp variable" was a bit surprising...
BTW, the median is the value which sits just in the middle of the list
when ordered: median(3,5,12,1,2)=median(1,2,3,5,12) = 3

How is this? Note that m and n are treated as constants, so
expressions like n-1, n/2 etc. are also constants. Also assume
m[i+1] is evaluated as (m+1) and of course m+1 is constant.

median (int m[], int n)
{
while (1) {
for (i = 0; i < n-1; i++) {
}
return m[n / 2];


Oh, yes, almost no auxiliary storage needed! You are sorting the data
first, then getting the middle element.
So if you are really really really concerned about variable storage
(maybe an embedded application?) this may work, at the expense of
speed: this sort appears to be O(n2), quicksort would do in O(nlogn),
and a customised version -at each stage, continue sorting only the
branch containing the median- can work in *expected* linear time.
Of course other set of constraints would dictate another approach,
maybe minimizing the number of array accesses by example.


--
Gabriel Genellina
Softlab SRL





__________________________________________________
Preguntá. Respondé. Descubrí.
Todo lo que querías saber, y lo que ni imaginabas,
está en Yahoo! Respuestas (Beta).
¡Probalo ya!
http://www.yahoo.com.ar/respuestas
 
F

Fredrik Lundh

Gabriel said:
That explains all. Finding the median in an efficient way (that is,
without sorting the data first) isn't trivial, so your claim of "I can
do that using only one temp variable" was a bit surprising...
BTW, the median is the value which sits just in the middle of the list
when ordered: median(3,5,12,1,2)=median(1,2,3,5,12) = 3

note that both his proposals does indeed find the median if given a list
of zeros, though. maybe they just used lousy sample data in that stat
class he took?

</F>
 
P

Paddy

werty said:
Apples/oranges ? programmers are making very little $$ today .
Thats software ! No one is makin money on obsolete Forth ,
so why a comparisom ?

Ultimately the best OpSys will be free and millions of lines of code
obsoleted . Because no one can protect intellectual property , its
simply too costly for a Government to do this .

Notice the crypto fiasco , the USA govt forbad export of SSL and
in short order Australia gave the world a free SSLeasy !

This is software . No one will , for long , own software .
Microsoft and Linux will die in 24 months . No hardware will
produce money/profits using them .

Because we have a new hardware competitor , that obsoletes the BIOS
chip , North bridge/South bridge USB helpers , NICs .
No mo PNP (Plug_and_Pray) .
In 2 chips ( CPU /SoC and LCD controller) it will be faster than
a PC . 100's have allready written HDD drivers and haven't yet
free'd them . But when others give free , what good do it to
hold yours ! You look stupid selling what others improve and
free . Trying to sell what others find easy to create !
Intel made hardware too hard to program , ARM is easy .
Python wont last on ARM , for WCE and Linux will be tossed .
A simpler structured method will be used to customise
Browsers . There will be 2 columns , one on left will be main/orig
but on the Right will have hyperlink result . This way ya dont have
to go back to compare ! side by side .
Text editors will also work this way . You will read orig in left
columns
and 'edit'/annotate your stuff to the Right columns .

Sun microsystems et all ( the ones we seldom think about ) will
all be obsoleted , bankrupted , their hardware surplused .
No more h/w servers .
There will be no high speed 64 bit boxes in the future .
The Pocket PC will do work you could not imagine !
All will have 100 GB HDD , VGA LCD , USBH/USBD , WIFI N
and no WERTY keyboard ! full GUI ...
ethernet and firewire and rs232 will die

Why not "see" the future ?
No Linux ,no WXP , no C+ , no PC .....

Hah! This is the output from a program and I claim my prize.
:)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,967
Messages
2,570,148
Members
46,694
Latest member
LetaCadwal

Latest Threads

Top