PEP 450 Adding a statistics module to Python

S

Steven D'Aprano

The trick here is that numpy really is the "right" way to do this stuff.

Numpy does not have a monopoly on the correct algorithms for statistics
functions, and a big, heavyweight library like numpy is overkill for many
lightweight statistics tasks. One shouldn't need to turn on a nuclear
reactor just to put the light on in your fridge.

I like to say:
"crunching numbers in python without numpy is like doing text processing
without using the string object"

Your analogy is backwards. String objects actually aren't optimal for
heavy duty text processing, because they're immutable. If you're serious
about crunching vast amounts of numbers, you'll use numpy. If you're
serious about crunch vast amounts of text, say for a text editor or word
processor, you *won't* use strings, you'll use some sort of mutable
buffer, or ropes, or some other data type. But very unlikely to use
strings.

What this is really an argument for is a numpy-lite in the standard
library, which could be used to build these sorts of things on. But
that's been rejected before...

"Numpy-lite". Which parts of numpy? Who maintains it? The numpy release
schedule is nothing like the standard library's release schedule, so
which one has to change? Or does somebody fork numpy, giving two
independent code bases?

What about Jython, IronPython, and other Python implementations? Even
PyPy doesn't support numpy yet, and Jython and IronPython probably never
will, since they're not C-based.

A few other comments:

1) the numpy folks have been VERY good at providing binaries for Windows
and OS-X -- easy point and click installing.

2) I hope we're almost there with standardizing pip and binary wheels,
at which point pip install will be painless.

Yeah, right, sure it will be. I've been waiting a decade for package
management on Linux to become painless, and it still isn't. There's no
reason to expect pip will be more painless than aptitude or yum.

But even if it is, installation of software is not just a software
problem to be solved by better technology. There is also the social
problem that not everyone is permitted to arbitrarily install software.
I'm not just talking about security policies on the machine, but security
policies in real life. People can be sacked for installing software they
don't have permission to install.

Machines may be locked down, users may have to submit a request before
software will be installed. That may involve a security audit, legal
review of licencing, strategy for full roll-back, potentially even a
complete code audit. (Imagine auditing all of numpy.) Or policy may
simply say, *no software from unapproved vendors* full stop.

Not everyone is privileged to be permitted to install whatever software
they like, when they like. Here are two organisations that make software
installation requests *easy*:

http://www.uhd.edu/computing/acl/SoftwareInstallationRequest.html

http://www.calstatela.edu/its/services/software/
instructsoftwarerequest.php/form2.php


Pip install isn't going to fix that.

There are many, many people in a situation where the Python std lib is
approved, usually because it comes from a vendor with a support contract
(say, RedHat, Ubuntu, or Suse), but getting third-party packages like
numpy approved is next to impossible. "Just install numpy" is a solution
for a privileged few.

even before (2) -- pip install works fine anywhere the system is set up
to build python extensions (granted, not a given on Windows and Mac, but
pretty likely on Linux)

Oh, well that's okay then -- that's three, maybe four percent of the
computing world taken care of! Problem solved!

Not.

-- the idea that running pip install wrote out a
lot of text (but worked!) is somehow a barrier to entry is absurd --
anyone building their own stuff on Linux is used to that.

Do you realise that not all Python programmers are used to, or able to,
"build their own stuff on Linux"?


[...]
All that being said -- if you do decide to do this, please use a PEP
3118 (enhanced buffer) supporting data type (probably array.array) --
compatibility with numpy and other packages for crunching numbers is
very nice.


py> import array
py> data = array.array('f', range(1000))
py> import statistics
py> statistics.mean(data)
499.5
py> statistics.stdev(data)
288.8194360957494


If the data type supports the sequence protocol, it should work with my
module. If it fails to work, submit a bug report, and I will fix it.


If someone decides to build a stand-alone stats package -- building it
on a ndarray-lite (PEP 3118 compatible) object would be a nice way to
go.


One other point -- for performance reason, is would be nice to have some
compiled code in there -- this adds incentive to put it in the stdlib --
external packages that need compiling is what makes numpy unacceptable
to some folks.

Like the decimal module, it will probably remain pure-Python for a few
releases, but I hope that in the future the statistics module will gain a
C-accelerated version. (Or Java-accelerated for Jython, etc.) I expect
that PyPy won't need one. But because it's not really aimed at number-
crunching megabytes of data, speed is not the priority.
 
C

chris.barker

Although it doesn't mention this in the PEP, a significant point that

is worth bearing in mind is that numpy is only for CPython, not PyPy,

IronPython, Jython etc. See here for a recent update on the status of

It does mention it, though I think not the additional implementations by name. And yes, the lack of numpy on the other implementation is a major limitation.
It depends what kind of number crunching you're doing.

As it depends on what kind of text processing your doing.....you could go along way with a pure-python sequence of abstract characters library, but it would be painfully slow -- no one would even try.

I guess there are more people working with, say, hundreds of numbers, than people trying to process an equally tiny amount of text...but this is a digression.

My point about that is that you can only reasonably do string processing with python because python has the concept of a string, not just an arbitrarysequence of characters, and not just for speed's sake, but for the nice semantics.

Anyone that has used an array-oriented language or library is likely to getaddicted to the idea that an array of numbers as a first class concept is really, really helpful, for both performance and semantics.
Numpy gives efficient C-style number crunching

which is the vastly most common case. Also, a properly designed algorithm may well need to know something about the internal storage/processing of thedata type -- i.e. the best way to compute a given statistic for floating point may not be the same as for integers (or decimal, or...). Maybe you canget a good one that works for most, but....

You can use dtype=object to use all these things with numpy arrays butin my
experience this is typically not faster than working with Python lists

That's quite true. In fact, often slower.
and is only really useful when you want numpy's multi-dimensional,
view-type slicing.

which is very useful indeed!
Here's an example where Steven's statistics module is more accurate:
numpy.mean([-1e60, 100, 100, 1e60])
0.0
statistics.mean([-1e60, 100, 100, 1e60])

50.0

the wonders of floating point arithmetic! -- but this looks like more of anargument for a better algorithm in numpy, than a reason to have something in the stdlib -- in fact, that's been discussed lately, there is talk of using compensated summation in the numpy sum() method -- not sure of the status.
Okay so that's a toy example but it illustrates that Steven is aiming
for ultra-high accuracy where numpy is primarily aimed at speed.

well, yes, for the most part, numpy does trade speed for accuracy when it has too -- but that's not the case here, I think this is ta case of "no one took the time to write a better algorithm"

He's also tried to ensure that it works properly with e.g. fractions:

That is pretty cool, yes.
If it's a numpy-lite then it's a numpy-ultra-lite. It really doesn't
provide much of what numpy provides.

I wasn't clear -- my point was that things like this should be build on a numpy-like array object (numpy-lite) -- so first adding such an object to the stdlib, then building this off it would be nice. But a key problem with that is where do you draw the line that defines numpy-lite? I"d say jsut thecore storage object, but then someone wants to add statistics, and someoneelse wants to add polynomial, and then random numbers, then ... and prettysure you've got numpy again!
Why? Yes I'd also like an ndarray-lite or rather an ultra-lite
1-dimensional version but why would it be useful for the statistics
module over using standard Python containers? Note that numpy arrays
do work with the reference implementation of the statistics module
(they're just treated as iterables):

One of the really great things about numpy is that when you work with a LOTof numbers (which is not rare in this era of Big Data) it stores them efficiently, and you can push them around between different arrays, and other libraries without unpacking and copying data. That's what PEP 3118 is all about.

It looks like there is some real care being put into these algorithms, so it would be nice if they could be efficiently used for large data sets and with numpy.

import numpy
import statistics
statistics.mean(numpy.array([1, 2, 3]))

you'll probably find that this is slower than a python list -- numpy has some overhead when used as a generic sequence.
It might be good to have a C accelerator one day but actually I think
the pure-Python-ness of it is a strong reason to have it since it
provides accurate statistics functions to all Python implementations
(unlike numpy) at no additional cost.

Well, I'd rather not have a package that is great for education and toy problems, but not-so-good for the real ones...

I guess my point is this:

This is a way to make the standard python distribution better for some common computational tasks. But rather than think of it as "we need some stats functions in the python stdlib", perhaps we should be thinking: "out of thebox python should be better for computation" -- in which case, I'd start with a decent array object.

-Chris
 
P

Prasad, Ramit

CM said:
I think it's a very good idea. Good PEP points, too. I hope it happens.

+1 especially for non-Cpython versions of Python.


~Ramit



This email is confidential and subject to important disclaimers and conditions including on offers for the purchase or sale of securities, accuracy andcompleteness of information, viruses, confidentiality, legal privilege, andlegal entity disclaimers, available at http://www.jpmorgan.com/pages/disclosures/email.
 
O

Oscar Benjamin

Well, I'd rather not have a package that is great for education and toy problems, but not-so-good for the real ones...

Again it depends what you mean by "real". From the other lists where
we meet I'd guess that your problems are in the "needs a nuclear
reactor" camp. I doubt that the stdlib will ever be sufficiently
mathematically/computationally oriented to fully service either of our
needs (and I don't mean that as a criticism). I persuaded the IT guys
at my work that we needed the whole Enthought Python Distribution on
all machines just because I didn't want to have to argue about
individual packages.

However in my real work, where I compute means and variances etc. I
very often do work with very small datasets and I know a lot of others
who work almost exclusively with them (think e.g. clinical data where
N is often less than 100).
I guess my point is this:

This is a way to make the standard python distribution better for some common computational tasks. But rather than think of it as "we need some stats functions in the python stdlib", perhaps we should be thinking: "out of the box python should be better for computation" -- in which case, I'd startwith a decent array object.

I think that, whether or not the statistics module gains a C
accelerator, if a fast numerical array type comes along then I'd
expect that the statistics module would use its methods as a fast
path. And if it provides a speed boost without compromising
boundedness or accuracy I'm sure that the array type would be used
internally where appropriate (just as numpy converts collections to
arrays before computation).


Oscar
 
C

chris.barker

Numpy does not have a monopoly on the correct algorithms for statistics
functions,

indeed not -- in fact, a number of them are quite lame, either because of chosen speed vs. accuracy trade offs, or just plain no-one-got-around-to-writing-the-code.

I kind of mis-spoke: what I meant was: "a numpy ndarray-similar object is the "right"way to do this", not numpy itself.
and a big, heavyweight library like numpy is overkill for many
lightweight statistics tasks. One shouldn't need to turn on a nuclear
reactor just to put the light on in your fridge.

sure -- but you are talking stdlib here -- where do we draw the line? a hard choice every time.
Your analogy is backwards. String objects actually aren't optimal for
heavy duty text processing, because they're immutable. If you're serious
about crunching vast amounts of numbers, you'll use numpy. If you're
serious about crunch vast amounts of text, say for a text editor or word
processor, you *won't* use strings, you'll use some sort of mutable
buffer, or ropes, or some other data type. But very unlikely to use
strings.

but you sure as heck won't use arbitrary pyton sequences of characters. which is what you are doing with this module.
"Numpy-lite". Which parts of numpy? Who maintains it? The numpy release
schedule is nothing like the standard library's release schedule, so
which one has to change? Or does somebody fork numpy, giving two
independent code bases?

yup -- that's why it's been rejected before -- but we did get PEP 3118 as acompromise, so one could build an nd-array-lite that was PEP 3118 compatible, and avoid many of the problems above.

However, as much a problem is is to install a third-party compiled package,it's a hell of a lot less work than writing a bunch of new code, so it'll probably never get done.

I myself am trying to write my new stuff to take PEP 3118 buffers, so I canget full high-performing numpy support, but not require users to have numpy -- it is a bit tricky, but can be done. If/when you get to the C-accelerated version, I suggest you consider it.
What about Jython, IronPython, and other Python implementations? Even
PyPy doesn't support numpy yet, and Jython and IronPython probably never
will, since they're not C-based.

There is a numpy for IronPython, though I don't hink it got beyond the alpha stage. But your point is well taken -- but also a reason for an ndarray in the stdlib, then maybe other implementations would support it.
Yeah, right, sure it will be. I've been waiting a decade for package
management on Linux to become painless, and it still isn't. There's no
reason to expect pip will be more painless than aptitude or yum.

Probably not, true -- but you needed to get Python from somewhere didn't you? You can't see it's easy to compile that on Windows!
There is also the social problem that not everyone is permitted to arbitrarily install software.

I work for the Federal Government -- believe me, I know.

There's Google App Engine, and things like that too, to support your point.....
complete code audit. (Imagine auditing all of numpy.)

well, the more we add to Pyton's stdlib, the bigger an issue that will be for all Ptyon users -- antoher reason to be cautios.

But at the end, I don't think there is a lot you can do with pyton without installing some third-party package? How many people do all their code development in IDLE? al their GUI's with tk? no image processing , writing their own web framework from scratch? The list goes on and on. I may have a fewsimple text processing scripts that don't use any third party packages, but nothing major.

I teach Intro to Python, and while I could probably get away with only the stdlib for the intro class (but sure as heck not the web development class), I don't -- because there is a lot folks should know about do anything real in Python.

So as much of a pain as it can be to use third-party packages, we can't puteverything in the stdlib for that reason.
There are many, many people in a situation where the Python std lib is
approved, usually because it comes from a vendor with a support contract
(say, RedHat, Ubuntu, or Suse), but getting third-party packages like
numpy approved is next to impossible.

don't all three of those ship numpy? I haven't used them in ages.
Oh, well that's okay then -- that's three, maybe four percent of the
computing world taken care of! Problem solved!

hence the binaries....

really -- the "I can't install an unapproved package" is a show-stopper. "Ican't built it" isn't.
Do you realise that not all Python programmers are used to, or able to,

"build their own stuff on Linux"?

then why not "yum install numpy"? or whatever?
py> import array
py> data = array.array('f', range(1000))
py> import statistics
py> statistics.mean(data)
499.5

I realized this after posting -- that is a nice feature, and could help a lot -- hurray for the buffer protocol! This makes room for compiled optimization down the road, and then you might be able to use your code with numpyarrays efficiently.
If the data type supports the sequence protocol, it should work with my
module. If it fails to work, submit a bug report, and I will fix it.

fair enough.
Like the decimal module, it will probably remain pure-Python for a few
releases, but I hope that in the future the statistics module will gain a
C-accelerated version. (Or Java-accelerated for Jython, etc.)

a perfectly reasonable development path.

I expect
that PyPy won't need one. But because it's not really aimed at number-
crunching megabytes of data, speed is not the priority.

I thought one of the key points of PyPy was performance? But anyway, maybe RPython and the JIT will take care of that.


Anyway, this looks like a great project -- not so sure about putting it in the stdlib, and do hope you'll keep the number crunchers in mind, but greatstuff none the less.

-Chris
 
J

Josef Pktd

I think the install issues in the pep are exaggerated, and are in my opinion not a sufficient reason to get something into the standard lib.

google appengine includes numpy
https://developers.google.com/appengine/docs/python/tools/libraries27

I'm on Windows, and installing numpy and scipy are just binary installers that install without problems.
There are free binary distributions (for Windows and Ubuntu) that include all the main scientific applications. One-click installer on Windows
http://code.google.com/p/pythonxy/wiki/Welcome
http://code.google.com/p/winpython/

How many Linux distributions don't include numpy? (I have no idea.)

For commercial support Enthought's and Continuum's distributions include all the main packages.

I think having basic descriptive statistics is still useful in a basic python installation. Similarly, almost all the descriptive statistics moved from scipy.stats to numpy.

However, what is the longterm scope of this supposed to be?

I think working with pure python is interesting for educational purposes
http://www.greenteapress.com/thinkstats/
but I don't think it will get very far for more extensive uses. Soon you will need some linear algebra (numpy.linalg and scipy.linalg) and special functions (scipy.special).

You can reimplement them, but what's the point to duplicate them in the standard lib?

For example:

t test: which versions? one-sample, two-sample, paired and unpaired, with and without homogeneous variances, with 3 alternative hypothesis.

If we have t test, shouldn't we also have ANOVA when we want to compare more than two samples?

....

If the Python versions that are not using a C backend need a statistics package and partial numpy replacement, then I don't think it needs to be in the CPython lib.


If think the "nuclear reactor" analogy is in my opinion misplaced.

A python implementation of statistics is a bycycle, numpy is a car, and if you need some heavier lifting in statistics or machine learning, then the trucks are scipy, scikit-learn and statsmodels (and pandas for the data handling).
And rpy for things that are not directly available in python.


I'm one of the maintainers for scipy.stats and for statsmodels.

We have a similar problem of deciding on the boundaries and scope of numpy,scipy.stats, pandas, patsy, statsmodels and scikit-learn. There is some overlap of functionality where the purpose or use cases are different, but ingeneral we try to avoid too much duplication.


https://pypi.python.org/pypi/statsmodels
https://pypi.python.org/pypi/pandas
https://pypi.python.org/pypi/patsy (R like formulas)
https://pypi.python.org/pypi/scikit-learn


Josef
 
C

CM

I am seeking comments on PEP 450, Adding a statistics module to Python's
standard library:

I just saw today that this will be included in Python 3.4. Congratulations, Steven, this is a nice addition.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,083
Messages
2,570,591
Members
47,212
Latest member
RobynWiley

Latest Threads

Top