PlanetMath (more info)
 Math for the people, by the people. Sponsor PlanetMath
Encyclopedia | Requests | Forums | Docs | Wiki | Random | RSS  
Login
create new user
name:
pass:
forget your password?
Main Menu
Revision difference : one-pass algorithm to compute sample variance
Version 4 Version 3
\PMlinkescapeword{derivation} \PMlinkescapeword{derivation}
\PMlinkescapeword{terms} \PMlinkescapeword{terms}
\PMlinkescapeword{formula} \PMlinkescapeword{formula}
In many situations it is desirable to calculate, In many situations it is desirable to calculate,
in one iteration, the sample variance of many numbers, and without in one iteration, the sample variance of many numbers, and without having to store every number
having to have stored every number in computer memory.
in computer memory before beginning processing.
Let $x_1, x_2, \dotsc$ denote the data. Let $x_1, x_2, \dotsc$ denote the data.
The na\"ive formula for calculating the sample The na\"ive formula for calculating the sample
variance in one pass, variance in one pass,
\[ \[
s_n = \frac{1}{n-1} \sum_{i=1}^n (x_i - \xb)^2 s_n = \frac{1}{n-1} \sum_{i=1}^n (x_i - \xb)^2
= \frac{\left( \sum_{i=1}^n x_i^2 \right) - n\xb^2}{n-1} = \frac{\left( \sum_{i=1}^n x_i^2 \right) - n\xb^2}{n-1}
\,, \quad \xb = \frac{1}{n} \sum_{i=1}^n x_i\,, \,, \quad \xb = \frac{1}{n} \sum_{i=1}^n x_i\,,
\] \]
suffers from numerical round-off error. suffers from numerical round-off error.
If the mean $\xb$ is large in absolute value, If the mean $\xb$ is large in absolute value,
and $\sum_{i=1}^n x_i^2$ is close to $\xb^2$, and $\sum_{i=1}^n x_i^2$ is close to $\xb^2$,
then the subtraction at the end will tend to lose then the subtraction at the end will tend to lose
significant digits on the result. significant digits on the result $s_n$.
Also, in rare cases, the sum of squares $\sum_{i=1}^n x_i^2$ Also, on a computer, the sum of squares $\sum_{i=1}^n x_i^2$
can be liable to overflow on a computer. can be liable to overflow.
A better alternative, though requiring more work, A better alternative is to calculate the running sample mean
is to calculate the running sample mean
and variance instead, and variance instead,
and update these as each datum is processed. and update these as each datum is processed.
Here we give the derivation of the one-pass algorithm, which Here we give the derivation of the one-pass algorithm, which
involves nothing more than simple algebraic manipulations. involves nothing more than simple algebraic manipulations.
Define the running arithmetic mean Define the running arithmetic mean
and the sum of squares residuals: and the sum of squares residuals:
\[ \[
a_n = \frac{1}{n} \sum_{i=1}^n x_i\,, a_n = \frac{1}{n} \sum_{i=1}^n x_i\,,
\quad \quad
s_n = \sum_{i=1}^n (x_i - a_n)^2\,. s_n = \sum_{i=1}^n (x_i - a_n)^2\,.
\] \]
We want to express $a_{n+1}$ and $s_{n+1}$ We want to express $a_{n+1}$ and $s_{n+1}$
in terms of the old values $a_n$ and $s_n$. in terms of the old values $a_n$ and $s_n$.
For convenience, let $\delta = x_{n+1} - a_n$ For convenience, let $\delta = x_{n+1} - a_n$
and $\gamma = a_{n+1} - a_n$. and $\gamma = a_{n+1} - a_n$.
Then we have Then we have
\[ \[
a_{n+1} = \frac{n a_n + x_{n+1}}{n+1} = a_{n+1} = \frac{n a_n + x_{n+1}}{n+1} =
\frac{(n+1)a_n + x_{n+1} - a_n}{n+1} \frac{(n+1)a_n + x_{n+1} - a_n}{n+1}
= a_n + \frac{\delta}{n+1}\,. = a_n + \frac{\delta}{n+1}\,.
\] \]
For the variance calculation, For the variance calculation,
we have we have
\begin{align*} \begin{align*}
s_{n+1} &= \sum_{i=1}^n \bigl( (x_i - a_n) - \gamma \bigr)^2 s_{n+1} &= \sum_{i=1}^n \bigl( (x_i - a_n) - \gamma \bigr)^2
+ (x_{n+1} - a_{n+1})^2 + (x_{n+1} - a_{n+1})^2
\\ \\
&= \sum_{i=1}^n (x_i - a_n)^2 - 2\gamma \sum_{i=1}^n (x_i - a_n) &= \sum_{i=1}^n (x_i - a_n)^2 - \gamma \sum_{i=1}^n (x_i - a_n)
+ \sum_{i=1}^n \gamma^2 + (x_{n+1} - a_{n+1})^2 \\ + \sum_{i=1}^n \gamma^2 + (x_{n+1} - a_{n+1})^2 \\
&= s_n + 0 + n \gamma^2 + (x_{n+1} - a_{n+1})^2\,. &= s_n + 0 + n \gamma^2 + (x_{n+1} - a_{n+1})^2\,.
\end{align*} \end{align*}
Now observe: Now observe:
\begin{align*} \begin{align*}
\gamma = \frac{\delta}{n+1}\,, \gamma = \frac{\delta}{n+1}\,,
\quad \quad
x_{n+1} - a_{n+1} = \delta - \gamma = (n+1) \gamma - \gamma = n\gamma\,, x_{n+1} - a_{n+1} = \delta - \gamma = (n+1) \gamma - \gamma = n\gamma\,,
\end{align*} \end{align*}
while: while:
\begin{align*} \begin{align*}
n \gamma &= n ( a_{n+1} - x_{n+1}) + nx_{n+1} - na_n \\ n \gamma &= n ( a_{n+1} - x_{n+1}) + nx_{n+1} - na_n \\
&= n(a_{n+1} - x_{n+1}) + n x_{n+1} + x_{n+1} - \sum_{i=1}^{n+1} x_{n+1} \\ &= n(a_{n+1} - x_{n+1}) + n x_{n+1} + x_{n+1} - \sum_{i=1}^{n+1} x_{n+1} \\
&= n(a_{n+1} - x_{n+1}) + (n+1)x_{n+1} - (n+1) a_{n+1} \\ &= n(a_{n+1} - x_{n+1}) + (n+1)x_{n+1} - (n+1) a_{n+1} \\
&= x_{n+1} - a_{n+1}\,. &= x_{n+1} - a_{n+1}\,.
\end{align*} \end{align*}
Hence we obtain: Hence we obtain:
\begin{align*} \begin{align*}
s_{n+1} = s_n + n\gamma^2 + n^2 \gamma^2 s_{n+1} = s_n + n\gamma^2 + n^2 \gamma^2
= s_n + n(n+1) \gamma^2 = s_n + n(n+1) \gamma^2
&= s_n + n \gamma \delta \\ &= s_n + n \gamma \delta \\
&= &=
s_n + s_n +
( x_{n+1} - a_{n+1}) \delta\,. ( x_{n+1} - a_{n+1}) \delta\,.
\end{align*} \end{align*}
Note that the number to be added to $s_n$ is never negative,
so no cancellation error will occur from this procedure.
(However, there can still be round-off error if $s_{n+1} - s_n$
happens to be very small compared to $s_n$.)
\medskip
The recurrence relation for the The recurrence relation for the
sample covariance of two lists of numbers $x_1, x_2, \dotsc$ sample covariance of two lists of numbers $x_1, x_2, \dotsc$
and $y_1, y_2, \dotsc$ and $y_1, y_2, \dotsc$
can be derived similarly. can be derived similarly.
If $a_n$ and $b_n$ denote the arithmetic means If $a_n$ and $b_n$ denote the arithmetic means
of first $n$ numbers of each of the two lists respectively, of first $n$ numbers of each of the two lists respectively,
then the quantity then the quantity
\[ \[
c_n = \sum_{i=1}^n (x_i - a_n)(y_i - b_n) c_n = \sum_{i=1}^n (x_i - a_n)(y_i - b_n)
\] \]
can be incrementally updated by can be incrementally updated by
\[ \[
c_{n+1} = c_n + (y_{n+1} - b_{n+1}) \, \frac{x_{n+1} - a_n}{n+1} c_{n+1} = c_n + (y_{n+1} - b_{n+1}) \, \frac{x_{n+1} - a_n}{n+1}
= c_n + (x_{n+1} - a_{n+1}) \, \frac{y_{n+1} - b_n}{n+1}\,. = c_n + (x_{n+1} - a_{n+1}) \, \frac{y_{n+1} - b_n}{n+1}\,.
\] \]
\begin{thebibliography}{3} \begin{thebibliography}{3}
\bibitem{Welford} \bibitem{Welford}
B. P. Welford. ``Note on a Method for Calculating Corrected Sums of Squares and Products''. {\it Technometrics}, Vol. 4, No. 3 (Aug., 1962), p. 419-420. B. P. Welford. ``Note on a Method for Calculating Corrected Sums of Squares and Products''. {\it Technometrics}, Vol. 4, No. 3 (Aug., 1962), p. 419-420.
\bibitem{Wiki}
``\PMlinkexternal{Algorithms for calculating variance}{http://en.wikipedia.org/wiki/Algorithms\_for\_calculating\_variance}''. {\it Wikipedia, The Free Encyclopedia}. Accessed 25 February 2007.
\end{thebibliography} \end{thebibliography}