extended discussion of the conjugate gradient method

Suppose we wish to solve the system

A⁒𝐱=𝐛 (1)

where A is a symmetricPlanetmathPlanetmathPlanetmathPlanetmath positive definite matrix. If we define the function

Φ⁒(𝐱)=12⁒𝐱T⁒A⁒𝐱-𝐱T⁒𝐛 (2)

we realise that solving (1) is equivalentMathworldPlanetmathPlanetmathPlanetmathPlanetmath to minimizing Φ. This is because if 𝐱 is a minimum of Φ then we must have

βˆ‡β‘Ξ¦β’(𝐱)=A⁒𝐱-𝐛=𝟎 (3)

These considerations give rise to the steepest descent algorithm. Given an approximation 𝐱~ to the solution 𝐱, the idea is to improve the approximation by moving in the direction in which Φ decreases most rapidly. This direction is given by the gradientMathworldPlanetmath of Φ in 𝐱~. Therefore we formulate our algorithm as follows

Given an initial approximation 𝐱(0), for k∈1⁒…⁒N,


where 𝐫(k)=𝐛-A⁒𝐱(k-1)=-βˆ‡β‘Ξ¦β’(𝐱(k-1)) and Ξ±k is a scalar to be determined.

𝐫(k) is traditionally called the residual vector. We wish to choose Ξ±k in such a way that we reduce Ξ¦ as much as possibile in each iteration, in other words, we wish to minimize the function ϕ⁒(Ξ±k)=Φ⁒(𝐱(k-1)+Ξ±k⁒𝐫(k)) with respect to Ξ±k


It’s possible to demonstrate that the steepest descent algorithm described above converges to the solution 𝐱, in an infiniteMathworldPlanetmath time. The conjugate gradient method improves on this by finding the exact solution after only n iterations. Let’s see how we can achieve this.

We say that 𝐱~ is optimal with respect to a direction 𝐝 if λ=0 is a local minimumMathworldPlanetmath for the function Φ⁒(𝐱~+λ⁒𝐝).

In the steepest descent algorithm, 𝐱(k) is optimal with respect to 𝐫(k), but in general it is not optimal with respect to 𝐫(0),…,𝐫(k-1). If we could modify the algorithm such that the optimality with respect to the search directions is preserved we might hope that that 𝐱(n) is optimal with respect to n linearly independentMathworldPlanetmath directions, at which point we would have found the exact solution.

Let’s make the following modification

𝐩(k) = {𝐫(1)if ⁒k=1𝐫(k)-βk⁒𝐩(k-1)if ⁒k>1 (4)
𝐱(k) = 𝐱(k-1)+αk⁒𝐩(k) (5)

where Ξ±k and Ξ²k are scalar multipliers to be determined

We choose αk in the same way as before, i.e. such that 𝐱(k) is optimal with respect to 𝐩(k)

αk=𝐫(k)⁒𝐩(k)T𝐩(k)⁒AT⁒𝐩(k) (6)

Now we wish to choose βk such that 𝐱(k), is also optimal with respect to 𝐩(k-1). We require that

βˆ‚β‘Ξ¦β’(𝐱(k)+λ⁒𝐩(k-1))βˆ‚β‘Ξ»|Ξ»=0=[A⁒𝐱(k)-𝐛]T⁒𝐩(k-1)=0 (7)

Since 𝐱(k)=𝐱(k-1)+Ξ±k⁒𝐩(k), and assuming that 𝐱(k-1) is optimal with respect to 𝐩(k-1) (i.e. that [A⁒𝐱(k-1)-𝐛]T⁒𝐩(k-1)=0) we can rewrite this condition as

[A⁒(𝐱(k-1)+Ξ±k⁒𝐩(k))-𝐛]T⁒𝐩(k-1)=Ξ±k⁒[𝐫(k)-Ξ²k⁒𝐩(k-1)]T⁒A⁒𝐩(k-1)=0 (8)

and therefore we obtain the required value for Ξ²k,

βk=𝐫(k)⁒AT⁒𝐩(k-1)𝐩(k-1)⁒AT⁒𝐩(k-1) (9)

Now we want to show that 𝐱(k) is also optimal with respect to 𝐩(1),…,𝐩(k-2), i.e. that

[A⁒𝐱(k)-𝐛]T⁒𝐩(j)=𝐫(k+1)⁒𝐩(j)T=0β€ƒβ€ƒβˆ€j∈1⁒…⁒k-2 (10)

We do this by strong inductionMathworldPlanetmath on k, assuming that for all β„“βˆˆ1⁒…⁒k-1,j∈1⁒…⁒ℓ, 𝐱(β„“) is optimal with respect to 𝐩(j) or equivalently that

[A⁒𝐱(β„“)-𝐛]T⁒𝐩(j)=𝐫(β„“+1)⁒𝐩(j)T=0 (11)

Noticing that 𝐫(k+1)=𝐫(k)+αk⁒A⁒𝐩(k), we want to show

𝐫(k)⁒𝐩(j)T+αk⁒𝐩(k)⁒AT⁒𝐩(j)=0 (12)

and therefore since 𝐫(k)⁒𝐩(j)T=0 by inductive hypothesis, it suffices to prove that

𝐩(k)⁒AT⁒𝐩(j)=0β€ƒβ€ƒβˆ€j∈1⁒…⁒k-2 (13)

But, again by the definition of 𝐩(k), this is equivalent to proving

𝐫(k)⁒AT⁒𝐩(j)-βk⁒𝐩(k-1)⁒AT⁒𝐩(j)=0 (14)

and since 𝐩(k-1)⁒AT⁒𝐩(j)=0 by inductive hypothesis, all we need to prove is that

𝐫(k)⁒AT⁒𝐩(j)=0 (15)

To proceed we require the following lemma.

Let Vβ„“=span⁒{Aβ„“-1⁒𝐫(1),…,A⁒𝐫(1),𝐫(1)}.

  • β€’


  • β€’

    If 𝐲∈Vβ„“ then 𝐫(k)βŸ‚π² for all k>β„“.

Since 𝐩(j)∈Vj, it follows that A⁒𝐩(j)∈Vj+1, and since j+1<k,

𝐫(k)(A𝐩(j))T=0 (16)

To finish let’s prove the lemma. The first point is by induction. The base case β„“=1 holds. Assuming that 𝐫(β„“-1),𝐩(β„“-1)∈Vβ„“-1, we have that

𝐫(β„“)=𝐫(β„“-1)+Ξ±β„“-1⁒A⁒𝐩(β„“-1)∈Vℓ  𝐩(β„“)=𝐫(β„“)-βℓ⁒𝐩(β„“-1)∈Vβ„“ (17)

For the second point we need an alternative characterizationMathworldPlanetmath of Vβ„“. Since 𝐩(1),…,𝐩(β„“)∈Vβ„“,

span⁒{𝐩(1),…,𝐩(β„“)}βŠ†Vβ„“ (18)

By (11), we have that if for some s∈1⁒…⁒ℓ

𝐩(s)=Ξ»s-1⁒𝐩(s-1)+β‹―+Ξ»1⁒𝐩(1) (19)

then 𝐩(s)⁒𝐫(s)T=0, but we know that

𝐩(s)⁒𝐫(s)T=[𝐫(s)-βs⁒𝐩(s-1)]T⁒𝐫(s)=𝐫(s)⁒𝐫(s)T (20)

but were this zero, we’d have 𝐫(s)=𝟎 and we would have solved the original problem. Thus we conclude that 𝐩(s)βˆ‰span⁒{𝐩(1),…,𝐩(s-1)}, so the vectors 𝐩(1),…,𝐩(s) are linearly independent. It follows that dim⁒span⁒{𝐩(1),…,𝐩(β„“)}=β„“, which on the other hand is the maximum possible dimensionPlanetmathPlanetmath of Vβ„“ and thus we must have

Vβ„“=span⁒{𝐩(1),…,𝐩(β„“)} (21)

which is the alternative characterization we were looking for. Now we have, again by (11), that if 𝐲∈Vβ„“, and β„“<k, then

𝐫(k)⁒𝐲T=0 (22)

thus the second point is proven.

Title extended discussion of the conjugate gradient method
Canonical name ExtendedDiscussionOfTheConjugateGradientMethod
Date of creation 2013-03-22 17:28:24
Last modified on 2013-03-22 17:28:24
Owner ehremo (15714)
Last modified by ehremo (15714)
Numerical id 13
Author ehremo (15714)
Entry type Topic
Classification msc 15A06
Classification msc 90C20