extended discussion of the conjugate gradient method

Suppose we wish to solve the system

A\mathbf{x}=\mathbf{b}

(1)

where $A$ is a symmetric positive definite matrix. If we define the function

\Phi(\mathbf{x})=\frac{1}{2}\mathbf{x}^{T}A\mathbf{x}-\mathbf{x}^{T}\mathbf{b}

(2)

we realise that solving (1) is equivalent to minimizing $\Phi$ . This is because if $\mathbf{x}$ is a minimum of $\Phi$ then we must have

\nabla\Phi(\mathbf{x})=A\mathbf{x}-\mathbf{b}=\mathbf{0}

(3)

These considerations give rise to the steepest descent algorithm. Given an approximation $\mathbf{\tilde{x}}$ to the solution $\mathbf{x}$ , the idea is to improve the approximation by moving in the direction in which $\Phi$ decreases most rapidly. This direction is given by the gradient of $\Phi$ in $\mathbf{\tilde{x}}$ . Therefore we formulate our algorithm as follows

Given an initial approximation $\mathbf{x}^{(0)}$ , for $k\in 1...N$ ,

$\mathbf{x}^{(k)}=\mathbf{x}^{(k-1)}+\alpha_{k}\mathbf{r}^{(k)}$

where $\mathbf{r}^{(k)}=\mathbf{b}-A\mathbf{x}^{(k-1)}=-\nabla\Phi(\mathbf{x}^{(k-1)})$ and $\alpha_{k}$ is a scalar to be determined.

$\mathbf{r}^{(k)}$ is traditionally called the residual vector. We wish to choose $\alpha_{k}$ in such a way that we reduce $\Phi$ as much as possibile in each iteration, in other words, we wish to minimize the function $\phi(\alpha_{k})=\Phi(\mathbf{x}^{(k-1)}+\alpha_{k}\mathbf{r}^{(k)})$ with respect to $\alpha_{k}$

\begin{array}[]{rcl}\phi^{\prime}(\alpha_{k})&=&\nabla\Phi(\mathbf{x}^{(k-1)}+% \alpha_{k}\mathbf{r}^{(k)})^{T}\mathbf{r}^{(k)}\\ &=&[A\mathbf{x}^{(k)}-\mathbf{b}+\alpha_{k}A\mathbf{r}^{(k)}]^{T}\mathbf{r}^{(% k)}\\ &=&[-\mathbf{r}^{(k)}+\alpha_{k}A\mathbf{r}^{(k)}]^{T}\mathbf{r}^{(k)}\\ &=&0\end{array}\Longleftrightarrow\alpha_{k}=\frac{\mathbf{r}^{(k)}{}^{T}% \mathbf{r}^{(k)}}{\mathbf{r}^{(k)}{}^{T}A\mathbf{r}^{(k)}}

It’s possible to demonstrate that the steepest descent algorithm described above converges to the solution $\mathbf{x}$ , in an infinite time. The conjugate gradient method improves on this by finding the exact solution after only $n$ iterations. Let’s see how we can achieve this.

We say that $\mathbf{\tilde{x}}$ is optimal with respect to a direction $\mathbf{d}$ if $\lambda=0$ is a local minimum for the function $\Phi(\mathbf{\tilde{x}}+\lambda\mathbf{d})$ .

In the steepest descent algorithm, $\mathbf{x}^{(k)}$ is optimal with respect to $\mathbf{r}^{(k)}$ , but in general it is not optimal with respect to $\mathbf{r}^{(0)},...,\mathbf{r}^{(k-1)}$ . If we could modify the algorithm such that the optimality with respect to the search directions is preserved we might hope that that $\mathbf{x}^{(n)}$ is optimal with respect to $n$ linearly independent directions, at which point we would have found the exact solution.

Let’s make the following modification

	$\displaystyle\mathbf{p}^{(k)}$	$\displaystyle=$	$\displaystyle\begin{cases}\mathbf{r}^{(1)}&\text{if }k=1\\ \mathbf{r}^{(k)}-\beta_{k}\mathbf{p}^{(k-1)}&\text{if }k>1\end{cases}$		(4)
	$\displaystyle\mathbf{x}^{(k)}$	$\displaystyle=$	$\displaystyle\mathbf{x}^{(k-1)}+\alpha_{k}\mathbf{p}^{(k)}$		(5)

where $\alpha_{k}$ and $\beta_{k}$ are scalar multipliers to be determined

We choose $\alpha_{k}$ in the same way as before, i.e. such that $\mathbf{x}^{(k)}$ is optimal with respect to $\mathbf{p}^{(k)}$

\alpha_{k}=\frac{\mathbf{r}^{(k)}{}^{T}\mathbf{p}^{(k)}}{\mathbf{p}^{(k)}{}^{T% }A\mathbf{p}^{(k)}}

(6)

Now we wish to choose $\beta_{k}$ such that $\mathbf{x}^{(k)}$ , is also optimal with respect to $\mathbf{p}^{(k-1)}$ . We require that

\left.\frac{\partial\Phi(\mathbf{x}^{(k)}+\lambda\mathbf{p}^{(k-1)})}{\partial% \lambda}\right|_{\lambda=0}=[A\mathbf{x}^{(k)}-\mathbf{b}]^{T}\mathbf{p}^{(k-1% )}=0

(7)

Since $\mathbf{x}^{(k)}=\mathbf{x}^{(k-1)}+\alpha_{k}\mathbf{p}^{(k)}$ , and assuming that $\mathbf{x}^{(k-1)}$ is optimal with respect to $\mathbf{p}^{(k-1)}$ (i.e. that $[A\mathbf{x}^{(k-1)}-\mathbf{b}]^{T}\mathbf{p}^{(k-1)}=0$ ) we can rewrite this condition as

[A(\mathbf{x}^{(k-1)}+\alpha_{k}\mathbf{p}^{(k)})-\mathbf{b}]^{T}\mathbf{p}^{(% k-1)}=\alpha_{k}[\mathbf{r}^{(k)}-\beta_{k}\mathbf{p}^{(k-1)}]^{T}A\mathbf{p}^% {(k-1)}=0

(8)

and therefore we obtain the required value for $\beta_{k}$ ,

\beta_{k}=\frac{\mathbf{r}^{(k)}{}^{T}A\mathbf{p}^{(k-1)}}{\mathbf{p}^{(k-1)}{% }^{T}A\mathbf{p}^{(k-1)}}

(9)

Now we want to show that $\mathbf{x}^{(k)}$ is also optimal with respect to $\mathbf{p}^{(1)},...,\mathbf{p}^{(k-2)}$ , i.e. that

[A\mathbf{x}^{(k)}-\mathbf{b}]^{T}\mathbf{p}^{(j)}=\mathbf{r}^{(k+1)}{}^{T}% \mathbf{p}^{(j)}=0\qquad\forall\;j\in 1...k-2

(10)

We do this by strong induction on $k$ , assuming that for all $\ell\in 1...k-1,j\in 1...\ell$ , $\mathbf{x}^{(\ell)}$ is optimal with respect to $\mathbf{p}^{(j)}$ or equivalently that

[A\mathbf{x}^{(\ell)}-\mathbf{b}]^{T}\mathbf{p}^{(j)}=\mathbf{r}^{(\ell+1)}{}^% {T}\mathbf{p}^{(j)}=0

(11)

Noticing that $\mathbf{r}^{(k+1)}=\mathbf{r}^{(k)}+\alpha_{k}A\mathbf{p}^{(k)}$ , we want to show

\mathbf{r}^{(k)}{}^{T}\mathbf{p}^{(j)}+\alpha_{k}\mathbf{p}^{(k)}{}^{T}A% \mathbf{p}^{(j)}=0

(12)

and therefore since $\mathbf{r}^{(k)}{}^{T}\mathbf{p}^{(j)}=0$ by inductive hypothesis, it suffices to prove that

\mathbf{p}^{(k)}{}^{T}A\mathbf{p}^{(j)}=0\qquad\forall\;j\in 1...k-2

(13)

But, again by the definition of $\mathbf{p}^{(k)}$ , this is equivalent to proving

\mathbf{r}^{(k)}{}^{T}A\mathbf{p}^{(j)}-\beta_{k}\mathbf{p}^{(k-1)}{}^{T}A% \mathbf{p}^{(j)}=0

(14)

and since $\mathbf{p}^{(k-1)}{}^{T}A\mathbf{p}^{(j)}=0$ by inductive hypothesis, all we need to prove is that

\mathbf{r}^{(k)}{}^{T}A\mathbf{p}^{(j)}=0

(15)

To proceed we require the following lemma.

Let $V_{\ell}=\mathrm{span}\{\;A^{\ell-1}\mathbf{r}^{(1)},...,A\mathbf{r}^{(1)},% \mathbf{r}^{(1)}\;\}$ .

•

$\mathbf{r}^{(\ell)},\mathbf{p}^{(\ell)}\in V_{\ell}$

•

If $\mathbf{y}\in V_{\ell}$ then $\mathbf{r}^{(k)}\perp\mathbf{y}$ for all $k>\ell$ .

Since $\mathbf{p}^{(j)}\in V_{j}$ , it follows that $A\mathbf{p}^{(j)}\in V_{j+1}$ , and since $j+1<k$ ,

\mathbf{r}^{(k)}{}^{T}(A\mathbf{p}^{(j)})=0

(16)

To finish let’s prove the lemma. The first point is by induction. The base case $\ell=1$ holds. Assuming that $\mathbf{r}^{(\ell-1)},\mathbf{p}^{(\ell-1)}\in V_{\ell-1}$ , we have that

\mathbf{r}^{(\ell)}=\mathbf{r}^{(\ell-1)}+\alpha_{\ell-1}A\mathbf{p}^{(\ell-1)% }\in V_{\ell}\qquad\mathbf{p}^{(\ell)}=\mathbf{r}^{(\ell)}-\beta_{\ell}\mathbf% {p}^{(\ell-1)}\in V_{\ell}

(17)

For the second point we need an alternative characterization of $V_{\ell}$ . Since $\mathbf{p}^{(1)},...,\mathbf{p}^{(\ell)}\in V_{\ell}$ ,

\mathrm{span}\{\;\mathbf{p}^{(1)},...,\mathbf{p}^{(\ell)}\;\}\subseteq V_{\ell}

(18)

By (11), we have that if for some $s\in 1...\ell$

\mathbf{p}^{(s)}=\lambda_{s-1}\mathbf{p}^{(s-1)}+\cdots+\lambda_{1}\mathbf{p}^% {(1)}

(19)

then $\mathbf{p}^{(s)}{}^{T}\mathbf{r}^{(s)}=0$ , but we know that

\mathbf{p}^{(s)}{}^{T}\mathbf{r}^{(s)}=[\mathbf{r}^{(s)}-\beta_{s}\mathbf{p}^{% (s-1)}]^{T}\mathbf{r}^{(s)}=\mathbf{r}^{(s)}{}^{T}\mathbf{r}^{(s)}

(20)

but were this zero, we’d have $\mathbf{r}^{(s)}=\mathbf{0}$ and we would have solved the original problem. Thus we conclude that $\mathbf{p}^{(s)}\not\in\mathrm{span}\{\;\mathbf{p}^{(1)},...,\mathbf{p}^{(s-1)% }\;\}$ , so the vectors $\mathbf{p}^{(1)},...,\mathbf{p}^{(s)}$ are linearly independent. It follows that $\mathrm{dim}\;\mathrm{span}\{\;\mathbf{p}^{(1)},...,\mathbf{p}^{(\ell)}\;\}=\ell$ , which on the other hand is the maximum possible dimension of $V_{\ell}$ and thus we must have

V_{\ell}=\mathrm{span}\{\;\mathbf{p}^{(1)},...,\mathbf{p}^{(\ell)}\;\}

(21)

which is the alternative characterization we were looking for. Now we have, again by (11), that if $\mathbf{y}\in V_{\ell}$ , and $\ell<k$ , then

\mathbf{r}^{(k)}{}^{T}\mathbf{y}=0

(22)

thus the second point is proven.

Title	extended discussion of the conjugate gradient method
Canonical name	ExtendedDiscussionOfTheConjugateGradientMethod
Date of creation	2013-03-22 17:28:24
Last modified on	2013-03-22 17:28:24
Owner	ehremo (15714)
Last modified by	ehremo (15714)
Numerical id	13
Author	ehremo (15714)
Entry type	Topic
Classification	msc 15A06
Classification	msc 90C20