Can straight line est fit with error measure be made in single pass?

A

Alf P. Steinbach

It would be nice if one could make do with a non-copyable forward
iterator for the (x,y) data points, without storing the points.

However, my current code requires the ability to iterate a second time,
from a copy of the start iterator.

Any suggestions for how to avoid the second iteration, or should I
perhaps just forget it (I only need this to check algorithm complexity
in unit test, so it's not a current problem, just "not ideal" code...)?


Code:
#pragma once
// Copyright (c) 2013 Alf P. Steinbach

// <url:
http://terpconnect.umd.edu/~toh/spectrum/CurveFitting.html#MathDetails>
//  n = number of x,y data points
//  sumx = Σx
//  sumy = Σy
//  sumxy = Σx*y
//  sumx2 = Σx*x
//  meanx = sumx / n
//  meany = sumy / n
//  slope = (n*sumxy - sumx*sumy) / (n*sumx2 - sumx*sumx)
//  intercept = meany-(slope*meanx)
//  ssy = Σ(y-meany)^2
//  ssr = Σ(y-intercept-slope*x)^2
//  R2 = 1-(ssr/ssy)
//  Standard deviation of the slope = SQRT(ssr/(n-2))*SQRT(n/(n*sumx2 -
sumx*sumx))
//  Standard deviation of the intercept =
SQRT(ssr/(n-2))*SQRT(sumx2/(n*sumx2 - sumx*sumx))
//
// Also, <url: http://mathworld.wolfram.com/LeastSquaresFitting.html>

#include <rfc/cppx/calc/Statistics.h>       // cppx::calc::Statistics

namespace cppx{ namespace calc{

class Best_line_fit
: private Statistics
{
private:
double intercept_;
double slope_;
double ssy_;            // Sum of squares of y-expressions
double ssr_;            // Sum of squares of residue expressions

public:
auto stats() const -> Statistics const&     { return *this; }

auto slope() const -> double        { return slope_; }
auto intercept() const -> double    { return intercept_; }

auto ssy() const -> double          { return ssy_; }
auto ssr() const -> double          { return ssr_; }
auto r_squared() const -> double    { return (ssr_ == 0 || ssy_
== 0? 1.0 : 1.0 - ssr_/ssy_); }
auto r() const -> double            { return sqrt( r_squared() ); }

template< class It >
Best_line_fit( It const first, It const end )
: Statistics( first, end )
, ssy_( 0.0 )
, ssr_( 0.0 )
{
slope_ = (n()*sum_xy() - sum_x()*sum_y()) / (n()*sum_xx() -
square( sum_x() ));
intercept_ = mean_y() - slope_*mean_x();

for( It it = first; it != end; ++it )
{
double const x = x_of( *it );
double const y = y_of( *it );

//ssy_ += square( y - mean_y() );
ssr_ += square( y - (slope_*x + intercept_) );
}
ssy_ = sum_yy() - 2*sum_y()*mean_y() + n()*square( mean_y() );
}
};

} }  // namespace cppx::calc

Disclaimer 1: I do not KNOW that the code is correct, in particular the
r_squared() implementation.

Disclaimer 2: In the mathworld reference I do not understand eq. 17.


Cheers,

- Alf
 
V

Victor Bazarov

It would be nice if one could make do with a non-copyable forward
iterator for the (x,y) data points, without storing the points.

Looking at the code below, I cannot really answer this question without
knowing what 'Statistics' is. Presuming that it does iterate over the
given range to collect the "sum of" and "mean" and the other values,
your other choice is to fix the 'Statistics' class to either calculate
your 'ssr' to boot, or to make it store the x and y values so you can
extract them later at your leisure.
However, my current code requires the ability to iterate a second time,
from a copy of the start iterator.

Any suggestions for how to avoid the second iteration, or should I
perhaps just forget it (I only need this to check algorithm complexity
in unit test, so it's not a current problem, just "not ideal" code...)?


Code:
#pragma once
// Copyright (c) 2013 Alf P. Steinbach

// <url:
http://terpconnect.umd.edu/~toh/spectrum/CurveFitting.html#MathDetails>
//  n = number of x,y data points
//  sumx = Σx
//  sumy = Σy
//  sumxy = Σx*y
//  sumx2 = Σx*x
//  meanx = sumx / n
//  meany = sumy / n
//  slope = (n*sumxy - sumx*sumy) / (n*sumx2 - sumx*sumx)
//  intercept = meany-(slope*meanx)
//  ssy = Σ(y-meany)^2
//  ssr = Σ(y-intercept-slope*x)^2
//  R2 = 1-(ssr/ssy)
//  Standard deviation of the slope = SQRT(ssr/(n-2))*SQRT(n/(n*sumx2 -
sumx*sumx))
//  Standard deviation of the intercept =
SQRT(ssr/(n-2))*SQRT(sumx2/(n*sumx2 - sumx*sumx))
//
// Also, <url: http://mathworld.wolfram.com/LeastSquaresFitting.html>

#include <rfc/cppx/calc/Statistics.h>       // cppx::calc::Statistics

namespace cppx{ namespace calc{

class Best_line_fit
: private Statistics
{
private:
double intercept_;
double slope_;
double ssy_;            // Sum of squares of y-expressions
double ssr_;            // Sum of squares of residue expressions

public:
auto stats() const -> Statistics const&     { return *this; }

auto slope() const -> double        { return slope_; }
auto intercept() const -> double    { return intercept_; }

auto ssy() const -> double          { return ssy_; }
auto ssr() const -> double          { return ssr_; }
auto r_squared() const -> double    { return (ssr_ == 0 || ssy_
== 0? 1.0 : 1.0 - ssr_/ssy_); }
auto r() const -> double            { return sqrt( r_squared()
); }

template< class It >
Best_line_fit( It const first, It const end )
: Statistics( first, end )
, ssy_( 0.0 )
, ssr_( 0.0 )
{
slope_ = (n()*sum_xy() - sum_x()*sum_y()) / (n()*sum_xx() -
square( sum_x() ));
intercept_ = mean_y() - slope_*mean_x();

for( It it = first; it != end; ++it )
{
double const x = x_of( *it );
double const y = y_of( *it );

//ssy_ += square( y - mean_y() );
ssr_ += square( y - (slope_*x + intercept_) );
}
ssy_ = sum_yy() - 2*sum_y()*mean_y() + n()*square( mean_y() );
}
};

} }  // namespace cppx::calc

Disclaimer 1: I do not KNOW that the code is correct, in particular the
r_squared() implementation.

Disclaimer 2: In the mathworld reference I do not understand eq. 17.

It's the simplification of eq 16. X' (x with line over it) is the mean
x value, isn't it?

SUMi(Xi - X')^2 = SUMi(XiXi - 2XiX' + X'X') =

= SUMi(Xi^2) - SUMi(2XiX') + SUMi(X'X') =

= SUMi(Xi^2) - 2X'*SUMi(Xi) + N*X'X' = ( assuming X'=SUM(Xi)/N )

= SUMi(Xi^2) - 2X'*N*X' + N*X'X' = SUMi(Xi^2) - N*X'X'

Same with eq's 18 and 20.

V
 
A

Alf P. Steinbach

Looking at the code below, I cannot really answer this question without
knowing what 'Statistics' is. Presuming that it does iterate over the
given range to collect the "sum of" and "mean" and the other values,
Yes.


your other choice is to fix the 'Statistics' class to either calculate
your 'ssr' to boot,

This would be preferable.

However the 'ssr' value depends on values that apparently depend on the
whole collection of numbers, + those original numbers themselves again,
// slope = (n*sumxy - sumx*sumy) / (n*sumx2 - sumx*sumx)
// intercept = meany-(slope*meanx)
// ssr = Σ(y-intercept-slope*x)^2

I know there is a neat trick for updating a mean value incrementally, so
I thought maybe someone knew about something similar here?

It seems like a tradeoff math/C++: more of one to get less of the other.

or to make it store the x and y values so you can
extract them later at your leisure.
Yes.


[snip]
Disclaimer 1: I do not KNOW that the code is correct, in particular the
r_squared() implementation.

Disclaimer 2: In the mathworld reference I do not understand eq. 17.

It's the simplification of eq 16. X' (x with line over it) is the mean
x value, isn't it?

SUMi(Xi - X')^2 = SUMi(XiXi - 2XiX' + X'X') =

= SUMi(Xi^2) - SUMi(2XiX') + SUMi(X'X') =

= SUMi(Xi^2) - 2X'*SUMi(Xi) + N*X'X' = ( assuming X'=SUM(Xi)/N )

= SUMi(Xi^2) - 2X'*N*X' + N*X'X' = SUMi(Xi^2) - N*X'X'

Same with eq's 18 and 20.

Thanks! :)

I got some help with that already this morning, on Facebook, and yes I
felt pretty stupid... :-( Not to mention whether that feeling now
reflects reality. Which it may.

Anyway, for my unit testing this now works, using 40 000 time
measurement points I get an R goodness between 0.99 and 1 (perfect) for
linear complexity, and forcing O(n^2) behavior[1] I get an R around
0.98. Apparently values tightly up against 0.99 or thereabouts are
normal, i.e. apparently this is a very finely tuned goodness-of-fit
criterion. But again I do not know, and the texts I've found neglect to
say anything about which numerical values decide one way or other.


Cheers,

- Alf (still in uncharted territory)

Notes:
[1] By simple expedient of allocating just required size of internal
buffer in string, instead of using the doubling buffer size idea.
 
A

Alf P. Steinbach

Sounds like you just need to precompute any needed terms to compute ssy
and ssr.

For example, ssy = sum( (y-my)^2 ) =
sum( y^2 - 2*my*y + my^2)

since my is a constant over the sum =

sum(y^2) - 2*my*sum(y) + my^2*sum(1)

so you need to have computed sum(y^2) and sum(y)


Do the same math for ssr, and Bob's your uncle, you have your solution.

First glance, srr will need sums of y^2, y, xy, x, x^2, 1

Thanks, but the current (presented) code already does this for ssy, and
ssr is the problem, not easily updated.

Cheers,

- Alf
 
A

Alf P. Steinbach

Depends on what you mean by ideal? Code simplicity, performance, accuracy,
reducing the number of passes? These goals might be in conflict with each
other. For example std. deviation and variance can be easily calculated by
one-pass algorithm. However, a naive and simple implementation produces big
round-off errors; there is a numerically stable one-pass algorithm from
Knuth, but it involves a division operation inside loop, so might be slow.
There is a stable two-pass algorithm that can actually be faster, but well,
it has two passes.

Well, unfortunately my Mr. Knuth (the three first volumes, I never got
around to obtaining the TeX works or the current fourth volume stuff)
resides at the bottom of some unidentified, anonymous-looking box, among
numerous other unidentified anonymous-looking boxes.

But thanks anyway, it really helps to have some background.

This was about variance; for linear regression it seems to be yet more
complicated. First google search turns up

http://stats.stackexchange.com/ques...g-running-linear-or-logistic-regression-param

Hm, among the answers there the last one, with 0 votes, seemed most
promising.

Still, the upshot seems to be "can't be done".


Cheers,

- Alf

PS: Donald Knuth did most of the first volume work while he was (doing
postdoc work?) at Norsk Regnesentral in Oslo. I can't help but connect
that somehow to the Simula OO work being done at that time at the
University of Oslo. Even though I never read about any connection. I
once interviewed there, and mentioned this, just to show I'd done my bit
about background research. But they had not even heard of him. :(
 
A

Alf P. Steinbach

But if you do the math, you can derive similar equations for ssr.

ssr = sum((y-b-m*x)^2)
= sum( y^2 + b^2 + m^2x^2 - 2*y*b - 2*m*x*y - 2*b*m*x )

= sum(y^2) + b^2*sum(1) + m^2*sum(x^2) - 2*b*sum(y) - 2*m*sum(x*y) -
2*b*m*sum(x)

Remember, for THIS summation, intercept (call b above) and slope (called
m above) are constants and can be factored out of the loop.

No promise I haven't made a dumb math error above, but the principle is
solid.

Thanks a bunch! :)


Cheers,

- Alf

Oh, why did that seem impossible to me, why didn't I think of it...
 
S

SG

It would be nice if one could make do with a non-copyable forward
iterator for the (x,y) data points, without storing the points.

Let me just mention that this extends to any linear least squares
problem. The amount of memory you really need is just determined by
the number of unknowns, not the number of "data points". Suppose

A x = b

is an overdetermined linear equation system with very few unknowns
(and hence very hew columns in A) but many equations (rows in A).
Then, you can, for example, compute A^T A and A^T b of

A^T A x = A^T b // Gauss' normal equation system

using a single loop over all rows of A and b. A^T A will be a small
square matrix with n^2 entries where n is the number of unknowns.
This can then be solved with a Cholesky factorization, for example.

Other approaches are possible, too. If you care about numerical
accuracy you could "accumulate" a couple of rows and periorically
reduce the whole system to O(n^2) memory via Householder
transformations.

Cheers!
SG
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,982
Messages
2,570,185
Members
46,736
Latest member
AdolphBig6

Latest Threads

Top