linear least squares fit

One of the most common uses of least squares fitting is fitting a straight line to data. Whilst, in general, it is difficult to determine the curve which best fits the data, in this case there is a relatively simple formula which can be used.

Theorem 1.

Suppose we have a data set $(x_{1},y_{1}),\ldots,(x_{n},y_{n})$ . Then the straight line which best fits this set is given as

y={ns-pq\over nr-p^{2}}x+{qr-ps\over nr-p^{2}}

where

$\displaystyle p$	$\displaystyle=\sum_{k=1}^{n}x_{k}$	(1)
$\displaystyle q$	$\displaystyle=\sum_{k=1}^{n}y_{k}$	(2)
$\displaystyle r$	$\displaystyle=\sum_{k=1}^{n}x_{k}^{2}$	(3)
$\displaystyle s$	$\displaystyle=\sum_{k=1}^{n}x_{k}y_{k}$	(4)

Proof.

Being the best fitting line means minimizing the merit function $M$ , given as

M(a,b)=\sum_{k=0}^{n}(ax_{k}+b-y_{k})^{2}

with respect to the parameters $a$ and $b$ . Expanding the square, this can be written as

M(a,b)=ra^{2}+2pab+nb^{2}-2sa-2qb+t

where $p, q, r, s$ are as above and

t=\sum_{k=1}^{n}y_{k}^{2}.

This function $M$ is a quadratic polynomial; moreover, from its definition as a sum of squares, it is clear that the highest order terms are positive definite, hence it has a minimum and all that remains is to find that minimum. To do this, we set the derivatives equal to zero to obtain the following equations:

	$\displaystyle 0={\partial M(a,b)\over\partial a}$	$\displaystyle=2ar+2pb-2s$		(5)
	$\displaystyle 0={\partial M(a,b)\over\partial b}$	$\displaystyle=2pa+2nb-2q$		(6)

These equations are easily solved to give

	$\displaystyle a$	$\displaystyle={ns-pq\over nr-p^{2}}$		(7)
	$\displaystyle b$	$\displaystyle={qr-ps\over nr-p^{2}};$		(8)

substituting in the equation $y=ax+b$ for a straight line, we obtain the answer given above. ∎

Because of the ease with which one can make a least squares fit of a line, this technique is often adapted to fitting other sorts of curves by making a change of variables. Two common cases of this practice are power laws and exponentials.

Suppose that one wants to fit some data to a curve of the form $y=ce^{kx}$ . Making a change of variable $y=e^{u}$ and defining $b=\log c$ , the equation of the curve becomes $u=kx+b$ . One can therefore fit the data set $(x_{1},\log y_{1}),\ldots(x_{n},\log y_{n})$ to a straight line.

Suppose that one wants to fit some data to a curve of the form $y=cx^{p}$ . Making a change of variable $x=e^{v}$ , $y=e^{u}$ and defining $b=\log c$ , the equation of the curve becomes $u=pv+b$ . One can therefore fit the data set $(\log x_{1},\log y_{1}),\ldots(\log x_{n},\log y_{n})$ to a straight line.

Although convenient and common, this procedure can be a cheat because changing variables and making a least squares fit of a line is not the same as making a least squares fit to a curve. The reason for this is that the merit functions are different and will not, in general have a minimum in the same place. However, if the data happen to approximately lie on a power curve or an exponential, then the answer obtained by changing variables and fitting will be an approximation to the correct answer. Depending on what one is doing, this approximation may be good enough or one may use it as a starting point for some algorithm to compute the correct minimum.

Title	linear least squares fit
Canonical name	LinearLeastSquaresFit
Date of creation	2013-03-22 17:24:16
Last modified on	2013-03-22 17:24:16
Owner	rspuzio (6075)
Last modified by	rspuzio (6075)
Numerical id	13
Author	rspuzio (6075)
Entry type	Definition
Classification	msc 15-00
Related topic	RegressionModel
Related topic	GaussMarkovTheorem