Request for feedback on API design

  • Thread starter Steven D'Aprano
  • Start date
S

Steven D'Aprano

I am soliciting feedback regarding the API of my statistics module:

http://code.google.com/p/pycalcstats/


Specifically the following couple of issues:

(1) Multivariate statistics such as covariance have two obvious APIs:

A pass the X and Y values as two separate iterable arguments, e.g.:
cov([1, 2, 3], [4, 5, 6])

B pass the X and Y values as a single iterable of tuples, e.g.:
cov([(1, 4), (2, 5), (3, 6)]

I currently support both APIs. Do people prefer one, or the other, or
both? If there is a clear preference for one over the other, I may drop
support for the other.


(2) Statistics text books often give formulae in terms of sums and
differences such as

Sxx = n*Σ(x**2) - (Σx)**2

There are quite a few of these: I count at least six common ones, all
closely related and confusing named:

Sxx, Syy, Sxy, SSx, SSy, SPxy

(the x and y should all be subscript).

Are they useful, or would they just add unnecessary complexity? Would
people would like to see these included in the package?



Thank you for your feedback.
 
T

Tim Chase

(1) Multivariate statistics such as covariance have two obvious APIs:

A pass the X and Y values as two separate iterable arguments, e.g.:
cov([1, 2, 3], [4, 5, 6])

B pass the X and Y values as a single iterable of tuples, e.g.:
cov([(1, 4), (2, 5), (3, 6)]

I currently support both APIs. Do people prefer one, or the other, or
both? If there is a clear preference for one over the other, I may drop
support for the other.

I'm partial to the "B" form (iterable of 2-tuples) -- it
indicates that the two data-sets (x_n and y_n) should be of the
same length and paired. The "A" form leaves this less obvious
that len(param1) should equal len(param2).

I haven't poked at your code sufficiently to determine whether
all the functions within can handle streamed data, or whether
they keep the entire dataset internally, but handing off an
iterable-of-pairs tends to be a little more straight-forward:

cov(humongous_dataset_iter)

or

cov(izip(humongous_dataset_iter1, humongous_dataset_iter2))

The "A" form makes doing this a little less obvious than the "B"
form.
(2) Statistics text books often give formulae in terms of sums and
differences such as

Sxx = n*Σ(x**2) - (Σx)**2

There are quite a few of these: I count at least six common ones,

When you take this count, is it across multiple text-books, or
are they common in just a small sampling of texts? (I confess
it's been a decade and a half since I last suffered a stats class)
all closely related and confusing named:

Sxx, Syy, Sxy, SSx, SSy, SPxy

(the x and y should all be subscript).

Are they useful, or would they just add unnecessary complexity?

I think it depends on your audience: amateur statisticians or
pros? I suspect that pros wouldn't blink at the distinctions
while weekenders like myself would get a little bleary-eyed
without at least a module docstring to clearly spell out the
distinctions and the forumlae used for determining them.

Just my from-the-hip thoughts for whatever little they may be worth.

-tkc
 
S

Steven D'Aprano

(1) Multivariate statistics such as covariance have two obvious APIs:

A pass the X and Y values as two separate iterable arguments,
e.g.:
cov([1, 2, 3], [4, 5, 6])

B pass the X and Y values as a single iterable of tuples, e.g.:
cov([(1, 4), (2, 5), (3, 6)]

I currently support both APIs. Do people prefer one, or the other, or
both? If there is a clear preference for one over the other, I may drop
support for the other.

I'm partial to the "B" form (iterable of 2-tuples) -- it indicates that
the two data-sets (x_n and y_n) should be of the same length and paired.
The "A" form leaves this less obvious that len(param1) should equal
len(param2).


Thanks for the comments Tim. To answer your questions:

I haven't poked at your code sufficiently to determine whether all the
functions within can handle streamed data, or whether they keep the
entire dataset internally,

Where possible, the functions don't keep the entire dataset internally.
Some functions have to (e.g. order statistics need to see the entire data
sequence at once), but the rest are capable of dealing with streamed data.

Also, there are a few functions such as standard deviation that have a
single-pass algorithm, and a more accurate multiple-pass algorithm.

When you take this count, is it across multiple text-books, or are they
common in just a small sampling of texts? (I confess it's been a decade
and a half since I last suffered a stats class)

I admit that I haven't done an exhaustive search of the literature, but
it does seen quite common to extract common expressions from various
stats formulae and give them names. The only use-case I can imagine for
them is checking hand-calculations or doing schoolwork.
 
A

Arnaud Delobelle

Steven D'Aprano said:
I am soliciting feedback regarding the API of my statistics module:

http://code.google.com/p/pycalcstats/


Specifically the following couple of issues:

(1) Multivariate statistics such as covariance have two obvious APIs:

A pass the X and Y values as two separate iterable arguments, e.g.:
cov([1, 2, 3], [4, 5, 6])

B pass the X and Y values as a single iterable of tuples, e.g.:
cov([(1, 4), (2, 5), (3, 6)]

I currently support both APIs. Do people prefer one, or the other, or
both? If there is a clear preference for one over the other, I may drop
support for the other.

I don't have an informed opinion on this.
(2) Statistics text books often give formulae in terms of sums and
differences such as

Sxx = n*Σ(x**2) - (Σx)**2

Interestingly, your Sxx is closely related to the variance:

if x is a list of n numbers then

Sxx == (n**2)*var(x)

And more generally if x and y have the same length n, then Sxy (*) is
related to the covariance

Sxy == (n**2)*cov(x, y)

So if you have a variance and covariance function, it would be redundant
to include Sxx and Sxy. Another argument against including Sxx & co is
that their definition is not universally agreed upon. For example, I
have seen

Sxx = Σ(x**2) - (Σx)**2/n

HTH
 
E

Ethan Furman

Steven said:
I am soliciting feedback regarding the API of my statistics module:

http://code.google.com/p/pycalcstats/


Specifically the following couple of issues:

(1) Multivariate statistics such as covariance have two obvious APIs:

A pass the X and Y values as two separate iterable arguments, e.g.:
cov([1, 2, 3], [4, 5, 6])

B pass the X and Y values as a single iterable of tuples, e.g.:
cov([(1, 4), (2, 5), (3, 6)]

I currently support both APIs. Do people prefer one, or the other, or
both? If there is a clear preference for one over the other, I may drop
support for the other.

Don't currently need/use stats, but B seems clearer to me.

~Ethan~
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,995
Messages
2,570,226
Members
46,815
Latest member
treekmostly22

Latest Threads

Top