| Version 4 |
Version 3 |
| \PMlinkescapeword{derivation} |
\PMlinkescapeword{derivation} |
| \PMlinkescapeword{terms} |
\PMlinkescapeword{terms} |
| \PMlinkescapeword{formula} |
\PMlinkescapeword{formula} |
|
|
| In many situations it is desirable to calculate, |
In many situations it is desirable to calculate, |
| in one iteration, the sample variance of many numbers, and without |
in one iteration, the sample variance of many numbers, and without having to store every number |
| having to have stored every number |
in computer memory. |
| in computer memory before beginning processing. |
|
|
|
| Let $x_1, x_2, \dotsc$ denote the data. |
Let $x_1, x_2, \dotsc$ denote the data. |
| The na\"ive formula for calculating the sample |
The na\"ive formula for calculating the sample |
| variance in one pass, |
variance in one pass, |
| \[ |
\[ |
| s_n = \frac{1}{n-1} \sum_{i=1}^n (x_i - \xb)^2 |
s_n = \frac{1}{n-1} \sum_{i=1}^n (x_i - \xb)^2 |
| = \frac{\left( \sum_{i=1}^n x_i^2 \right) - n\xb^2}{n-1} |
= \frac{\left( \sum_{i=1}^n x_i^2 \right) - n\xb^2}{n-1} |
| \,, \quad \xb = \frac{1}{n} \sum_{i=1}^n x_i\,, |
\,, \quad \xb = \frac{1}{n} \sum_{i=1}^n x_i\,, |
| \] |
\] |
| suffers from numerical round-off error. |
suffers from numerical round-off error. |
| If the mean $\xb$ is large in absolute value, |
If the mean $\xb$ is large in absolute value, |
| and $\sum_{i=1}^n x_i^2$ is close to $\xb^2$, |
and $\sum_{i=1}^n x_i^2$ is close to $\xb^2$, |
| then the subtraction at the end will tend to lose |
then the subtraction at the end will tend to lose |
|
significant digits on the result.
|
significant digits on the result $s_n$.
|
|
Also, in rare cases, the sum of squares $\sum_{i=1}^n x_i^2$
|
Also, on a computer, the sum of squares $\sum_{i=1}^n x_i^2$
|
|
can be liable to overflow on a computer.
|
can be liable to overflow.
|
|
|
| A better alternative, though requiring more work, |
A better alternative is to calculate the running sample mean |
| is to calculate the running sample mean |
|
| and variance instead, |
and variance instead, |
| and update these as each datum is processed. |
and update these as each datum is processed. |
| Here we give the derivation of the one-pass algorithm, which |
Here we give the derivation of the one-pass algorithm, which |
| involves nothing more than simple algebraic manipulations. |
involves nothing more than simple algebraic manipulations. |
|
|
| Define the running arithmetic mean |
Define the running arithmetic mean |
| and the sum of squares residuals: |
and the sum of squares residuals: |
| \[ |
\[ |
| a_n = \frac{1}{n} \sum_{i=1}^n x_i\,, |
a_n = \frac{1}{n} \sum_{i=1}^n x_i\,, |
| \quad |
\quad |
| s_n = \sum_{i=1}^n (x_i - a_n)^2\,. |
s_n = \sum_{i=1}^n (x_i - a_n)^2\,. |
| \] |
\] |
| We want to express $a_{n+1}$ and $s_{n+1}$ |
We want to express $a_{n+1}$ and $s_{n+1}$ |
| in terms of the old values $a_n$ and $s_n$. |
in terms of the old values $a_n$ and $s_n$. |
|
|
| For convenience, let $\delta = x_{n+1} - a_n$ |
For convenience, let $\delta = x_{n+1} - a_n$ |
| and $\gamma = a_{n+1} - a_n$. |
and $\gamma = a_{n+1} - a_n$. |
| Then we have |
Then we have |
| \[ |
\[ |
| a_{n+1} = \frac{n a_n + x_{n+1}}{n+1} = |
a_{n+1} = \frac{n a_n + x_{n+1}}{n+1} = |
| \frac{(n+1)a_n + x_{n+1} - a_n}{n+1} |
\frac{(n+1)a_n + x_{n+1} - a_n}{n+1} |
| = a_n + \frac{\delta}{n+1}\,. |
= a_n + \frac{\delta}{n+1}\,. |
| \] |
\] |
|
|
| For the variance calculation, |
For the variance calculation, |
| we have |
we have |
| \begin{align*} |
\begin{align*} |
| s_{n+1} &= \sum_{i=1}^n \bigl( (x_i - a_n) - \gamma \bigr)^2 |
s_{n+1} &= \sum_{i=1}^n \bigl( (x_i - a_n) - \gamma \bigr)^2 |
| + (x_{n+1} - a_{n+1})^2 |
+ (x_{n+1} - a_{n+1})^2 |
| \\ |
\\ |
|
&= \sum_{i=1}^n (x_i - a_n)^2 - 2\gamma \sum_{i=1}^n (x_i - a_n)
|
&= \sum_{i=1}^n (x_i - a_n)^2 - \gamma \sum_{i=1}^n (x_i - a_n)
|
| + \sum_{i=1}^n \gamma^2 + (x_{n+1} - a_{n+1})^2 \\ |
+ \sum_{i=1}^n \gamma^2 + (x_{n+1} - a_{n+1})^2 \\ |
| &= s_n + 0 + n \gamma^2 + (x_{n+1} - a_{n+1})^2\,. |
&= s_n + 0 + n \gamma^2 + (x_{n+1} - a_{n+1})^2\,. |
| \end{align*} |
\end{align*} |
| Now observe: |
Now observe: |
| \begin{align*} |
\begin{align*} |
| \gamma = \frac{\delta}{n+1}\,, |
\gamma = \frac{\delta}{n+1}\,, |
| \quad |
\quad |
| x_{n+1} - a_{n+1} = \delta - \gamma = (n+1) \gamma - \gamma = n\gamma\,, |
x_{n+1} - a_{n+1} = \delta - \gamma = (n+1) \gamma - \gamma = n\gamma\,, |
| \end{align*} |
\end{align*} |
| while: |
while: |
| \begin{align*} |
\begin{align*} |
| n \gamma &= n ( a_{n+1} - x_{n+1}) + nx_{n+1} - na_n \\ |
n \gamma &= n ( a_{n+1} - x_{n+1}) + nx_{n+1} - na_n \\ |
| &= n(a_{n+1} - x_{n+1}) + n x_{n+1} + x_{n+1} - \sum_{i=1}^{n+1} x_{n+1} \\ |
&= n(a_{n+1} - x_{n+1}) + n x_{n+1} + x_{n+1} - \sum_{i=1}^{n+1} x_{n+1} \\ |
| &= n(a_{n+1} - x_{n+1}) + (n+1)x_{n+1} - (n+1) a_{n+1} \\ |
&= n(a_{n+1} - x_{n+1}) + (n+1)x_{n+1} - (n+1) a_{n+1} \\ |
| &= x_{n+1} - a_{n+1}\,. |
&= x_{n+1} - a_{n+1}\,. |
| \end{align*} |
\end{align*} |
|
|
| Hence we obtain: |
Hence we obtain: |
| \begin{align*} |
\begin{align*} |
| s_{n+1} = s_n + n\gamma^2 + n^2 \gamma^2 |
s_{n+1} = s_n + n\gamma^2 + n^2 \gamma^2 |
| = s_n + n(n+1) \gamma^2 |
= s_n + n(n+1) \gamma^2 |
| &= s_n + n \gamma \delta \\ |
&= s_n + n \gamma \delta \\ |
| &= |
&= |
| s_n + |
s_n + |
| ( x_{n+1} - a_{n+1}) \delta\,. |
( x_{n+1} - a_{n+1}) \delta\,. |
| \end{align*} |
\end{align*} |
|
|
| Note that the number to be added to $s_n$ is never negative, |
|
| so no cancellation error will occur from this procedure. |
|
| (However, there can still be round-off error if $s_{n+1} - s_n$ |
|
| happens to be very small compared to $s_n$.) |
|
|
|
| \medskip |
|
|
|
| The recurrence relation for the |
The recurrence relation for the |
| sample covariance of two lists of numbers $x_1, x_2, \dotsc$ |
sample covariance of two lists of numbers $x_1, x_2, \dotsc$ |
| and $y_1, y_2, \dotsc$ |
and $y_1, y_2, \dotsc$ |
| can be derived similarly. |
can be derived similarly. |
| If $a_n$ and $b_n$ denote the arithmetic means |
If $a_n$ and $b_n$ denote the arithmetic means |
| of first $n$ numbers of each of the two lists respectively, |
of first $n$ numbers of each of the two lists respectively, |
| then the quantity |
then the quantity |
| \[ |
\[ |
| c_n = \sum_{i=1}^n (x_i - a_n)(y_i - b_n) |
c_n = \sum_{i=1}^n (x_i - a_n)(y_i - b_n) |
| \] |
\] |
| can be incrementally updated by |
can be incrementally updated by |
| \[ |
\[ |
| c_{n+1} = c_n + (y_{n+1} - b_{n+1}) \, \frac{x_{n+1} - a_n}{n+1} |
c_{n+1} = c_n + (y_{n+1} - b_{n+1}) \, \frac{x_{n+1} - a_n}{n+1} |
| = c_n + (x_{n+1} - a_{n+1}) \, \frac{y_{n+1} - b_n}{n+1}\,. |
= c_n + (x_{n+1} - a_{n+1}) \, \frac{y_{n+1} - b_n}{n+1}\,. |
| \] |
\] |
|
|
| \begin{thebibliography}{3} |
\begin{thebibliography}{3} |
| \bibitem{Welford} |
\bibitem{Welford} |
| B. P. Welford. ``Note on a Method for Calculating Corrected Sums of Squares and Products''. {\it Technometrics}, Vol. 4, No. 3 (Aug., 1962), p. 419-420. |
B. P. Welford. ``Note on a Method for Calculating Corrected Sums of Squares and Products''. {\it Technometrics}, Vol. 4, No. 3 (Aug., 1962), p. 419-420. |
|
|
| \bibitem{Wiki} |
|
| ``\PMlinkexternal{Algorithms for calculating variance}{http://en.wikipedia.org/wiki/Algorithms\_for\_calculating\_variance}''. {\it Wikipedia, The Free Encyclopedia}. Accessed 25 February 2007. |
|
| \end{thebibliography} |
\end{thebibliography} |
|
|