S
Steven D'Aprano
The trick here is that numpy really is the "right" way to do this stuff.
Numpy does not have a monopoly on the correct algorithms for statistics
functions, and a big, heavyweight library like numpy is overkill for many
lightweight statistics tasks. One shouldn't need to turn on a nuclear
reactor just to put the light on in your fridge.
I like to say:
"crunching numbers in python without numpy is like doing text processing
without using the string object"
Your analogy is backwards. String objects actually aren't optimal for
heavy duty text processing, because they're immutable. If you're serious
about crunching vast amounts of numbers, you'll use numpy. If you're
serious about crunch vast amounts of text, say for a text editor or word
processor, you *won't* use strings, you'll use some sort of mutable
buffer, or ropes, or some other data type. But very unlikely to use
strings.
What this is really an argument for is a numpy-lite in the standard
library, which could be used to build these sorts of things on. But
that's been rejected before...
"Numpy-lite". Which parts of numpy? Who maintains it? The numpy release
schedule is nothing like the standard library's release schedule, so
which one has to change? Or does somebody fork numpy, giving two
independent code bases?
What about Jython, IronPython, and other Python implementations? Even
PyPy doesn't support numpy yet, and Jython and IronPython probably never
will, since they're not C-based.
A few other comments:
1) the numpy folks have been VERY good at providing binaries for Windows
and OS-X -- easy point and click installing.
2) I hope we're almost there with standardizing pip and binary wheels,
at which point pip install will be painless.
Yeah, right, sure it will be. I've been waiting a decade for package
management on Linux to become painless, and it still isn't. There's no
reason to expect pip will be more painless than aptitude or yum.
But even if it is, installation of software is not just a software
problem to be solved by better technology. There is also the social
problem that not everyone is permitted to arbitrarily install software.
I'm not just talking about security policies on the machine, but security
policies in real life. People can be sacked for installing software they
don't have permission to install.
Machines may be locked down, users may have to submit a request before
software will be installed. That may involve a security audit, legal
review of licencing, strategy for full roll-back, potentially even a
complete code audit. (Imagine auditing all of numpy.) Or policy may
simply say, *no software from unapproved vendors* full stop.
Not everyone is privileged to be permitted to install whatever software
they like, when they like. Here are two organisations that make software
installation requests *easy*:
http://www.uhd.edu/computing/acl/SoftwareInstallationRequest.html
http://www.calstatela.edu/its/services/software/
instructsoftwarerequest.php/form2.php
Pip install isn't going to fix that.
There are many, many people in a situation where the Python std lib is
approved, usually because it comes from a vendor with a support contract
(say, RedHat, Ubuntu, or Suse), but getting third-party packages like
numpy approved is next to impossible. "Just install numpy" is a solution
for a privileged few.
even before (2) -- pip install works fine anywhere the system is set up
to build python extensions (granted, not a given on Windows and Mac, but
pretty likely on Linux)
Oh, well that's okay then -- that's three, maybe four percent of the
computing world taken care of! Problem solved!
Not.
-- the idea that running pip install wrote out a
lot of text (but worked!) is somehow a barrier to entry is absurd --
anyone building their own stuff on Linux is used to that.
Do you realise that not all Python programmers are used to, or able to,
"build their own stuff on Linux"?
[...]
All that being said -- if you do decide to do this, please use a PEP
3118 (enhanced buffer) supporting data type (probably array.array) --
compatibility with numpy and other packages for crunching numbers is
very nice.
py> import array
py> data = array.array('f', range(1000))
py> import statistics
py> statistics.mean(data)
499.5
py> statistics.stdev(data)
288.8194360957494
If the data type supports the sequence protocol, it should work with my
module. If it fails to work, submit a bug report, and I will fix it.
If someone decides to build a stand-alone stats package -- building it
on a ndarray-lite (PEP 3118 compatible) object would be a nice way to
go.
One other point -- for performance reason, is would be nice to have some
compiled code in there -- this adds incentive to put it in the stdlib --
external packages that need compiling is what makes numpy unacceptable
to some folks.
Like the decimal module, it will probably remain pure-Python for a few
releases, but I hope that in the future the statistics module will gain a
C-accelerated version. (Or Java-accelerated for Jython, etc.) I expect
that PyPy won't need one. But because it's not really aimed at number-
crunching megabytes of data, speed is not the priority.