Simple addition to random module - Student's t

T

Thomas Philips

While the random module allows one to generate randome numbers with a
variety of distributions, some useful distributions are omitted - the
Student's t being among them. This distribution is easily derived from
the normal distribution and the chi-squared distribution (which in
turn is a special case of the gamma distribution). I edited and tested
a routine to generate random variables with a Student's t distribution
that I found on http://www.johndcook.com/python_student_t_rng.html,
which has one bug - there is an extra factor of two in y. The
corrected and tested code follows - how does one go about getting this
incorporated into random so that the entire community can beneffit
from it?

Sincerely

Thomas Philips

def student_t(df): # df is the number of degrees of freedom
if df < 2 or int(df) != df:
raise ValueError, 'student_tvariate: df must be a integer > 1'

x = random.gauss(0, 1)
y = random.gammavariate(df/2.0, 2)

return x / (math.sqrt(y/df))


References:

1. Student's t distribution, including relationship to normal and chi-
squared distributions: http://en.wikipedia.org/wiki/Student's_t-distribution
2. Chi-squared distribution, including relationship to Gamma
distribution: http://en.wikipedia.org/wiki/Chi-square_distribution
3. John Cook's original version (with the extra factor of 2):
http://www.johndcook.com/python_student_t_rng.html
 
M

Mark Dickinson

While the random module allows one to generate randome numbers with a
variety of distributions, some useful distributions are omitted - the
Student's t being among them. This distribution is easily derived from
the normal distribution and the chi-squared distribution (which in
turn is a special case of the gamma distribution). I edited and tested
a routine to generate random variables with a Student's t distribution
that I found onhttp://www.johndcook.com/python_student_t_rng.html,
which has  one bug - there is an extra factor of two in y. The
corrected and tested code follows - how does one go about getting this
incorporated into random so that the entire community can beneffit
from it?

To get this into core Python, you'd usually submit a feature request
at http://bugs.python.org. To maximize the chances of the feature
being accepted, provide unit tests and documentation along with the
code. There's a lot of good information about how the Python
development process works at http://www.python.org/dev; see
especially http://www.python.org/dev/contributing/.

Alternatively, you might also consider submitting something to the
Python package index, http://pypi.python.org/pypi, or posting this as
a recipe at http://code.activestate.com/recipes/langs/python/
 
M

Mark Dickinson

def student_t(df):         # df is the number of degrees of freedom
    if df < 2  or int(df) != df:
       raise ValueError, 'student_tvariate: df must be a integer > 1'

By the way, why do you exclude the possibility df=1 here?
 
T

Thomas Philips

By the way, why do you exclude the possibility df=1 here?

I exclude df=1 hereBecause the variance is then infinite (in fact, the
distribution is then Cauchy). That said, your point is well taken;
allowing df=1 makes the Cauchy distribution available to users of
random, in much the same way as the Gamma makes the Chi-squared
available. Here's the revised code:

def student_tvariate(df): # df is the number of degrees of
freedom
if df < 1 or int(df) != df:
raise ValueError, 'student_tvariate: df must be a positive
integer'

x = random.gauss(0, 1)
y = random.gammavariate(df/2.0, 2)

return x / (math.sqrt(y/df))


I'll follow your suggestion, add in documentation and submit it to
bugs.python.com. Thanks for your guidance.

Thomas Philips
 
T

Thomas Philips

I exclude df=1 hereBecause the variance is then infinite (in fact, the
distribution is then Cauchy). That said, your point is well taken;
allowing df=1 makes the Cauchy distribution available to users of
random, in much the same way as the Gamma makes the Chi-squared
available. Here's the revised code:

def student_tvariate(df):         # df is the number of degrees of
freedom
    if df < 1  or int(df) != df:
        raise ValueError, 'student_tvariate: df must be a positive
integer'

    x = random.gauss(0, 1)
    y = random.gammavariate(df/2.0, 2)

    return x / (math.sqrt(y/df))

I'll follow your suggestion, add in documentation and submit it to
bugs.python.com. Thanks for your guidance.

Thomas Philips

Mark,

I mis-spoke - the variance is infinite when df=2 (the variance is df/
(df-2), and you get the Cauchy when df=2. So I'm going to eat humble
pie and go back to

def student_tvariate(df): # df is the number of degrees of
freedom
if df < 2 or int(df) != df:
raise ValueError, 'student_tvariate: df must be a integer > 1'

x = random.gauss(0, 1)
y = random.gammavariate(df/2.0, 2)

return x / (math.sqrt(y/df))



I made the mistake because the denominator is equivalent to the
square root of the sample variance of df normal observations, which in
turn has df-1 degrees of freedom...............Oh well, I think it's
much easier to apologize than to rationalize my error!!!!


Sincerely

Tom
 
R

Robert Kern

By the way, why do you exclude the possibility df=1 here?

Similarly, requiring df to be an integer is extraneous.

--
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless enigma
that is made terrible by our own mad attempt to interpret it as though it had
an underlying truth."
-- Umberto Eco
 
M

Mark Dickinson

I mis-spoke - the variance is infinite when df=2 (the variance is df/
(df-2),

Yes: the variance is infinite both for df=2 and df=1, and Student's t
with df=1 doesn't even have an expectation. I don't see why this
would stop you from generating meaningful samples, though.
and you get the Cauchy when df=2.

Are you sure about this? All my statistics books are currently hiding
in my mother-in-law's attic, several hundred miles away, but wikipedia
and mathworld seem to say that df=1 gives you the Cauchy distribution.
I made the mistake because the denominator is  equivalent to the
square root of the sample variance of df normal observations,

As I'm reading it, the denominator is the square root of the sample
variance of *df+1* independent standard normal observations. I agree
that the wikipedia description is a bit confusing.

It seems that there are uses for Student's t distribution with
non-integral degrees of freedom. The Boost library, and the R
programming language both allow non-integral degrees of freedom.
So (as Robert Kern already suggested), you could drop the test
for integrality of df. In fact, you could just drop the tests
on df entirely: df <= 0.0 will be picked up in the gammavariate
call.
 
R

Raymond Hettinger

While the random module allows one to generate randome numbers with a
variety of distributions, some useful distributions are omitted - the
Student's t being among them.

I'm curious to hear what your use cases are.

My understanding is that t-distribution is an estimation tool
used with small samples of a population where the variance or
standard deviation is unknown.

So, when do you ever need to generate random variables with
this distribution? ISTM that this is akin to wanting
a generator for a Kolmogorov distribution -- usually the
distribution is used to test empirical data, not to generate it.

I think most of the existing generators were chosen because they
are useful in simulation programs. AFAICT, the Student's t-
distribution
doesn't fall into that category (usually, you know the population
standard deviation when you're the one generating data).

ISTM, there ought to be a statistics module that can calculate
cumulative distribution functions for a variety of distributions.
This would be far more helpful than creating more generators.


Raymond
 
R

Raymond Hettinger

To get this into core Python, you'd usually submit a feature request
athttp://bugs.python.org.

If you do submit a patch, please assign it to me.
I've been the primary maintainer for that module
for several years.


Raymond Hettinger
 
T

Thomas Philips

Yes:  the variance is infinite both for df=2 and df=1, and Student's t
with df=1 doesn't even have an expectation.  I don't see why this
would stop you from generating meaningful samples, though.


Are you sure about this?  All my statistics books are currently hiding
in my mother-in-law's attic, several hundred miles away, but wikipedia
and mathworld seem to say that df=1 gives you the Cauchy distribution.


As I'm reading it, the denominator is the square root of the sample
variance of *df+1* independent standard normal observations.  I agree
that the wikipedia description is a bit confusing.

It seems that there are uses for Student's t distribution with
non-integral degrees of freedom.  The Boost library, and the R
programming language both allow non-integral degrees of freedom.
So (as Robert Kern already suggested), you could drop the test
for integrality of df.  In fact, you could just drop the tests
on df entirely:  df <= 0.0 will be picked up in the gammavariate
call.

To tell you the truth, I have never used it with a non-integer number
of degrees of freedom, but that's not the same as saying that df
should be an integer. When df is an integer, one can interpret the t-
distribution as the ratio of a unit normal (i.e. N(0,1)) to the sample
standard deviation of a set of df+1 unit normals divided by sqrt(df
+1). However, as Robert Kern correctly observes, the distribution is
defined for all positive non-integer df, though we then lose the above
interpretation, and must think of it in abstract terms. The
distribution has infinite variance when df=2 and an undefined mean
when df<=1, but the code can still be used to generate samples.
Whether or not these samples make sense is altogether another
question, but it's easy enough to remmove the restrictions.
 
R

Robert Kern

I'm curious to hear what your use cases are.

My understanding is that t-distribution is an estimation tool
used with small samples of a population where the variance or
standard deviation is unknown.

So, when do you ever need to generate random variables with
this distribution? ISTM that this is akin to wanting
a generator for a Kolmogorov distribution -- usually the
distribution is used to test empirical data, not to generate it.

In more complicated models, estimates of one parameter need to be propagated
through the model, particularly if you are looking at sensitivity to parameters.
Student's t describes the variation of an estimate of a mean of a sample from a
Gaussian distribution. If I were modeling a processing wherein someone makes an
estimate of a mean and then acts on that estimate, I would want to generate
random t variates to feed that model.
I think most of the existing generators were chosen because they
are useful in simulation programs. AFAICT, the Student's t-
distribution
doesn't fall into that category (usually, you know the population
standard deviation when you're the one generating data).

Student's t distribution is also used as a sort of generic fat-tailed
distribution in some models and is not tied to the "estimate of a mean" description.
ISTM, there ought to be a statistics module that can calculate
cumulative distribution functions for a variety of distributions.
This would be far more helpful than creating more generators.

Yes, scipy.stats.

--
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless enigma
that is made terrible by our own mad attempt to interpret it as though it had
an underlying truth."
-- Umberto Eco
 
T

Terry Reedy

Robert said:
In more complicated models, estimates of one parameter need to be
propagated through the model, particularly if you are looking at
sensitivity to parameters. Student's t describes the variation of an
estimate of a mean of a sample from a Gaussian distribution. If I were
modeling a processing wherein someone makes an estimate of a mean and
then acts on that estimate, I would want to generate random t variates
to feed that model.


Student's t distribution is also used as a sort of generic fat-tailed
distribution in some models and is not tied to the "estimate of a mean"
description.

Yes. One can run the simulation with various df's to see the effect of
such data and possibly how and when a model breaks.
Yes, scipy.stats.

Is that stable enough so that all or part could be added to stdlib?

tjr
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,995
Messages
2,570,230
Members
46,819
Latest member
masterdaster

Latest Threads

Top