MINRES and conjugate gradients
import Pkg; Pkg.activate("/Users/driscoll/Documents/GitHub/fnc")
using FNCFunctions
using Plots
default(
titlefont=(11,"Helvetica"),
guidefont=(11,"Helvetica"),
linewidth = 2,
markersize = 3,
msa = 0,
size=(500,320),
label="",
html_output_format = "svg"
)
using PrettyTables, LaTeXStrings, Printf
using LinearAlgebra Activating project at `~/Documents/GitHub/fnc`
8.6. MINRES and conjugate gradients ¶ We have seen before that certain matrix properties enhance solutions to linear algebra problems. One of the most important of these is when A ∗ = A \mathbf{A}^*=\mathbf{A} A ∗ = A ; i.e., A \mathbf{A} A is hermitian. The Arnoldi iteration has a particularly useful specialization to this case. While in this section we describe the resulting algorithms, we do not present them in detail or show implementations.
8.6.1 Lanczos iteration ¶ Starting from (8.4.10) , we left-multiply by Q m ∗ \mathbf{Q}_m^* Q m ∗ to get
Q m ∗ A Q m = Q m ∗ Q m + 1 H m = H ~ m , \mathbf{Q}_m^* \mathbf{A} \mathbf{Q}_m = \mathbf{Q}_m^* \mathbf{Q}_{m+1} \mathbf{H}_m = \tilde{\mathbf{H}}_m, Q m ∗ A Q m = Q m ∗ Q m + 1 H m = H ~ m , where H ~ m \tilde{\mathbf{H}}_m H ~ m is rows 1 through m m m of H m \mathbf{H}_m H m . If A \mathbf{A} A is hermitian, then so is the left side of this equation, hence H ~ m \tilde{\mathbf{H}}_m H ~ m is hermitian too. But it is also upper Hessenberg, meaning that the ( i , j ) (i,j) ( i , j ) element is zero if i > j + 1 i > j+1 i > j + 1 . By symmetry, this means that elements are zero when j > i + 1 j > i+1 j > i + 1 as well.
Equation (8.4.6) of the Arnoldi iteration now simplifies to a much shorter expression:
A q m = H m − 1 , m q m − 1 + H m m q m + H m + 1 , m q m + 1 . \mathbf{A} \mathbf{q}_m = H_{m-1,m} \,\mathbf{q}_{m-1} + H_{mm} \,\mathbf{q}_m + H_{m+1,m}\,\mathbf{q}_{m+1}. A q m = H m − 1 , m q m − 1 + H mm q m + H m + 1 , m q m + 1 . As before in deriving the Arnoldi iteration, when given the first m m m vectors we can solve for the entries in column m m m of H \mathbf{H} H and then for q m + 1 \mathbf{q}_{m+1} q m + 1 . The resulting process is known as the Lanczos iteration . Its most important practical advantage is that while Arnoldi needs O ( m ) O(m) O ( m ) steps to get q m + 1 \mathbf{q}_{m+1} q m + 1 from the previous vectors, Lanczos needs only O ( 1 ) O(1) O ( 1 ) steps, so restarting isn’t required for symmetric matrices.[1]
8.6.2 MINRES ¶ When A \mathbf{A} A is hermitian and the Arnoldi iteration is reduced to Lanczos, the analog of GMRES is known as MINRES . Like GMRES, MINRES minimizes the residual ∥ b − A x ∥ \|\mathbf{b}-\mathbf{A}\mathbf{x}\| ∥ b − Ax ∥ over increasingly larger Krylov spaces.
MINRES is also more theoretically tractable than GMRES. The following result relies on some advanced approximation theory. Recall that the eigenvalues of a hermitian matrix are real.
Suppose A \mathbf{A} A is hermitian, invertible, and indefinite. Divide its eigenvalues into positive and negative sets Λ + \Lambda_+ Λ + and Λ − \Lambda_- Λ − , and define
κ + = max λ ∈ Λ + ∣ λ ∣ min λ ∈ Λ + ∣ λ ∣ , κ − = max λ ∈ Λ − ∣ λ ∣ min λ ∈ Λ − ∣ λ ∣ . \kappa_+ = \frac{ \max_{\lambda \in \Lambda_+} |\lambda| }{ \min_{\lambda \in \Lambda_+} |\lambda| }, \qquad
\kappa_- = \frac{ \max_{\lambda \in \Lambda_-} |\lambda| }{ \min_{\lambda \in \Lambda_-} |\lambda| }. κ + = min λ ∈ Λ + ∣ λ ∣ max λ ∈ Λ + ∣ λ ∣ , κ − = min λ ∈ Λ − ∣ λ ∣ max λ ∈ Λ − ∣ λ ∣ . Then x m \mathbf{x}_m x m , the m m m th solution estimate of MINRES, satisfies
∥ r m ∥ 2 ∥ b ∥ 2 ≤ ( κ + κ − − 1 κ + κ − + 1 ) ⌊ m / 2 ⌋ , \frac{\|\mathbf{r}_m\|_2}{\|\mathbf{b}\|_2} \le \left( \frac{\sqrt{\kappa_+\kappa_-} - 1}{\sqrt{\kappa_+\kappa_-} + 1} \right)^{\lfloor m/2\rfloor}, ∥ b ∥ 2 ∥ r m ∥ 2 ≤ ( κ + κ − + 1 κ + κ − − 1 ) ⌊ m /2 ⌋ , where ⌊ m / 2 ⌋ \lfloor m/2\rfloor ⌊ m /2 ⌋ means to round m / 2 m/2 m /2 down to the nearest integer.
The bound for a definite matrix is better, as the next theorem shows. The upper bound (8.6.4) on the residual obeys a linear convergence rate. As the product κ + κ − \kappa_+\kappa_- κ + κ − grows, the rate of this convergence approaches 1. Hence, the presence of eigenvalues close to the origin (relative to the max eigenvalues) is expected to force a slower convergence.
Suppose A \mathbf{A} A has κ + = 60 \kappa_+=60 κ + = 60 and κ − = 15 \kappa_-=15 κ − = 15 . Then to achieve a guaranteed reduction in the relative residual of 10-3 , we require
( 900 − 1 900 + 1 ) ⌊ m / 2 ⌋ ≤ 1 0 − 3 , \left( \frac{\sqrt{900} - 1}{\sqrt{900} + 1} \right)^{\lfloor m/2\rfloor} \le 10^{-3}, ( 900 + 1 900 − 1 ) ⌊ m /2 ⌋ ≤ 1 0 − 3 , ⌊ m / 2 ⌋ log 10 ( 29 31 ) ≤ − 3 , {\lfloor m/2\rfloor} \log_{10} \left( \frac{29}{31} \right) \le -3, ⌊ m /2 ⌋ log 10 ( 31 29 ) ≤ − 3 , m ≥ 2 ⌈ 3 log 10 ( 31 / 29 ) ⌉ = 208. m \ge 2 \lceil \frac{3}{\log_{10}(31/29)} \rceil = 208. m ≥ 2 ⌈ log 10 ( 31/29 ) 3 ⌉ = 208. Because the theorem gives an upper bound, MINRES may converge faster. All we can say is that 208 is certain to be enough iterations.
The following matrix is indefinite.
A = FNC.poisson(10) - 20I
λ = eigvals(Matrix(A))
isneg = @. λ < 0
@show sum(isneg), sum(.!isneg);(sum(isneg), sum(.!(isneg))) = (13, 87) We can compute the relevant quantities from Theorem 8.6.1 .
mn, mx = extrema(-λ[isneg])
κ₋ = mx / mn
mn, mx = extrema(λ[.!isneg])
κ₊ = mx / mn
ρ = (sqrt(κ₋ * κ₊) - 1) / (sqrt(κ₋ * κ₊) + 1)Because the iteration number m m m is halved in (8.6.4) , the rate constant of linear convergence is the square root of this number, which makes it even closer to 1.
Now we apply MINRES to a linear system with this matrix, and compare the observed convergence to the upper bound from the theorem.
using IterativeSolvers, Plots
b = rand(100)
x, hist = minres(A, b, reltol=1e-10, maxiter=51, log=true);
relres = hist[:resnorm] / norm(b)
m = 0:length(relres)-1
plot(m, relres;
label="observed", legend=:left,
xaxis=L"m", yaxis=(:log10, "relative residual"),
title=("Convergence of MINRES"))
plot!(m, ρ .^ (m / 2), l=:dash, label="upper bound")The upper bound turns out to be pessimistic here, especially in the later iterations. However, you can see a slow linear phase in the convergence that tracks the bound closely.
8.6.3 Conjugate gradients ¶ Given positive definiteness in addition to symmetry, we arrive at perhaps the most famous Krylov subspace method for A x = b \mathbf{A}\mathbf{x}=\mathbf{b} Ax = b , called conjugate gradients .
Suppose now that A \mathbf{A} A is hermitian and positive definite (HPD ). Then A \mathbf{A} A has a Cholesky factorization, which in the complex case is A = R ∗ R \mathbf{A}=\mathbf{R}^*\mathbf{R} A = R ∗ R . Therefore, for any vector u \mathbf{u} u ,
u ∗ A u = ( R u ) ∗ ( R u ) = ∥ R u ∥ 2 , \mathbf{u}^*\mathbf{A}\mathbf{u} = (\mathbf{R}\mathbf{u})^*(\mathbf{R}\mathbf{u})=\|\mathbf{R} \mathbf{u}\|^2, u ∗ Au = ( Ru ) ∗ ( Ru ) = ∥ Ru ∥ 2 , which is nonnegative and zero only when u = 0 \mathbf{u}=\boldsymbol{0} u = 0 , provided A \mathbf{A} A (and therefore R \mathbf{R} R ) is nonsingular. Hence, we can define a special vector norm relative to A \mathbf{A} A :
∥ u ∥ A = ( u ∗ A u ) 1 / 2 . \| \mathbf{u} \|_{\mathbf{A}} = \left( \mathbf{u}^*\mathbf{A}\mathbf{u} \right)^{1/2}. ∥ u ∥ A = ( u ∗ Au ) 1/2 . For each m = 1 , 2 , 3 , … m=1,2,3,\ldots m = 1 , 2 , 3 , … , minimize ∥ u − x ∥ A \|\mathbf{u}-\mathbf{x}\|_{\mathbf{A}} ∥ u − x ∥ A for u \mathbf{u} u in the Krylov subspace K m \mathcal{K}_m K m to get x m \mathbf{x}_m x m .
8.6.4 Convergence ¶ The convergence of CG and MINRES is dependent on the eigenvalues of A \mathbf{A} A . In the HPD case the eigenvalues are real and positive, and they equal the singular values. Hence, the condition number κ \kappa κ is equal to the ratio of the largest eigenvalue to the smallest one. The following theorem suggests that MINRES and CG are not so different in convergence.
Let A \mathbf{A} A be real and SPD with 2-norm condition number κ \kappa κ . For MINRES define R ( m ) = ∥ r m ∥ 2 / ∥ b ∥ 2 R(m)=\|\mathbf{r}_m\|_2/\|\mathbf{b}\|_2 R ( m ) = ∥ r m ∥ 2 /∥ b ∥ 2 , and for CG define R ( m ) = ∥ x m − x ∥ A / ∥ x ∥ A R(m)=\|\mathbf{x}_m-\mathbf{x}\|_{\mathbf{A}}/\|\mathbf{x}\|_{\mathbf{A}} R ( m ) = ∥ x m − x ∥ A /∥ x ∥ A ,
where r m \mathbf{r}_m r m and x m \mathbf{x}_m x m are the residual and solution approximation associated with the space K m \mathcal{K}_m K m . Then
R ( m ) ≤ 2 ( κ − 1 κ + 1 ) m . R(m) \le 2\, \left( \frac{\sqrt{\kappa} - 1}{\sqrt{\kappa} + 1} \right)^m. R ( m ) ≤ 2 ( κ + 1 κ − 1 ) m . Theorem 8.6.2 characterizes the convergence of MINRES and CG similarly, differing only in whether the measurement is of the residual or the A \mathbf{A} A -norm of the error, respectively. While these are different quantities, in practice one may not find a consistent advantage for one method over the other.
As in the indefinite case with MINRES, a larger condition number is associated with slower convergence in the positive definite case. Specifically, to make the bound in (8.6.10) less than a number ϵ \epsilon ϵ requires
2 ( κ − 1 κ + 1 ) m ≈ ϵ , m log ( κ − 1 κ + 1 ) ≈ log ( ϵ 2 ) . \begin{gather*}
2 \left( \frac{\sqrt{\kappa} - 1}{\sqrt{\kappa} + 1} \right)^m \approx \epsilon, \\
m \log \left( \frac{\sqrt{\kappa} - 1}{\sqrt{\kappa} + 1} \right)
\approx \log\Bigl( \frac{\epsilon}{2} \Bigr).
\end{gather*} 2 ( κ + 1 κ − 1 ) m ≈ ϵ , m log ( κ + 1 κ − 1 ) ≈ log ( 2 ϵ ) . We estimate
κ − 1 κ + 1 = ( 1 − κ − 1 / 2 ) ( 1 + κ − 1 / 2 ) − 1 = ( 1 − κ − 1 / 2 ) ( 1 − κ − 1 / 2 + κ − 1 + ⋯ ) = 1 − 2 κ − 1 / 2 + O ( κ − 1 ) as κ → ∞ . \begin{align*}
\frac{\sqrt{\kappa} - 1}{\sqrt{\kappa} + 1}
&= (1 - \kappa^{-1/2}\,) (1 + \kappa^{-1/2}\,)^{-1}\\
&= (1 - \kappa^{-1/2}\,) (1 - \kappa^{-1/2} + \kappa^{-1} + \cdots)\\
&= 1 - 2\kappa^{-1/2} + O(\kappa^{-1}) \quad \text{ as $\kappa
\rightarrow \infty$.}
\end{align*} κ + 1 κ − 1 = ( 1 − κ − 1/2 ) ( 1 + κ − 1/2 ) − 1 = ( 1 − κ − 1/2 ) ( 1 − κ − 1/2 + κ − 1 + ⋯ ) = 1 − 2 κ − 1/2 + O ( κ − 1 ) as κ → ∞. With the Taylor expansion log ( 1 + x ) = x − ( x 2 / 2 ) + ⋯ \log(1+x) = x - (x^2/2) + \cdots log ( 1 + x ) = x − ( x 2 /2 ) + ⋯ , we finally conclude
2 m κ − 1 / 2 ≈ − log ( ϵ 2 ) , or m = O ( κ ) , \begin{gather*}
2 m \kappa^{-1/2} \approx -\log\Bigl( \frac{\epsilon}{2} \Bigr),
\text{ or }
m = O(\sqrt{\kappa}),
\end{gather*} 2 m κ − 1/2 ≈ − log ( 2 ϵ ) , or m = O ( κ ) , as an estimate of the number of iterations needed to achieve a fixed accuracy.
This estimate fails for very large κ \kappa κ , however.
We will compare MINRES and CG on some quasi-random SPD problems. The first matrix has a condition number of 100.
n = 5000
density = 0.001
A = FNC.sprandsym(n, density, 1 / 100)
x = (1:n) / n
b = A * x;
Now we apply both methods and compare the convergence of the system residuals, using implementations imported from IterativeSolvers.
plt = plot(title="Convergence of MINRES and CG",
xaxis=("Krylov dimension"), yaxis=(:log10, "relative residual norm"))
for method in [minres, cg]
x̃, history = method(A, b, reltol=1e-6, maxiter=1000, log=true)
relres = history[:resnorm] / norm(b)
plot!(0:length(relres)-1, relres, label="$method")
err = round(norm(x̃ - x) / norm(x), sigdigits=4)
println("$method error: $err")
end
pltThere is little difference between the two methods here. Next, we increase the condition number of the matrix by a factor of 25. The rule of thumb predicts that the number of iterations required should increase by a factor of about 5.
A = FNC.sprandsym(n, density, 1 / 2500)
b = A * x;
plt = plot(title="Convergence of MINRES and CG",
xaxis=("Krylov dimension"), yaxis=(:log10, "relative residual norm"))
for method in [minres, cg]
x̃, history = method(A, b, reltol=1e-6, maxiter=1000, log=true)
relres = history[:resnorm] / norm(b)
plot!(0:length(relres)-1, relres, label="$method")
err = round(norm(x̃ - x) / norm(x), sigdigits=4)
println("$method error: $err")
end
pltBoth methods have an early superlinear phase that allow them to finish slightly sooner than the factor of 5 predicted: Theorem 8.6.2 is an upper bound, not necessarily an approximation. Both methods ultimately achieve the same reduction in the residual; MINRES stops earlier, but with a slightly larger error.
8.6.5 Exercises ¶ ✍ For each part, the eigenvalues of A \mathbf{A} A are given. Suppose MINRES is applied to solve A x = b \mathbf{A}\mathbf{x}=\mathbf{b} Ax = b . Use (8.6.4) or (8.6.10) , whichever is most appropriate, to determine a lower bound on m m m to guarantee reduction of the residual norm by a factor 10-4 .
(a) − 100 , − 99 , … , − 1 , 1 , 2 , … , 100 -100,-99,\ldots,-1,1,2,\ldots,100 − 100 , − 99 , … , − 1 , 1 , 2 , … , 100
(b) − 100 , 1 , 2 , … , 100 -100,1,2,\ldots,100 − 100 , 1 , 2 , … , 100
(c) 1 , 2 , … , 100 1,2,\ldots,100 1 , 2 , … , 100
⌨ Let b \mathbf{b} b be a random unit vector of length 202. Define a diagonal matrix with diagonal entries d k d_k d k given by
d k = − 200 + 1.95 k , k = 0 , 1 , … , 100 , d k + 101 = 10 + 0.9 k , k = 0 , 1 , … , 100. \begin{align*}
d_k &= -200 + 1.95k, \quad k=0,1,\ldots,100, \\
d_{k+101} &= 10 + 0.9k, \quad k=0,1,\ldots,100.
\end{align*} d k d k + 101 = − 200 + 1.95 k , k = 0 , 1 , … , 100 , = 10 + 0.9 k , k = 0 , 1 , … , 100. (a) Apply 120 iterations of MINRES to solve A x = b \mathbf{A}\mathbf{x}=\mathbf{b} Ax = b . Compute the relative error of the answer, and plot the norm of the residual as a function of m m m on a log-linear scale.
(b) Add to your graph the line representing the upper bound (8.6.4) . (Ignore the rounding in the exponent.) This line should stay strictly on or above the error curve.
⌨ Let b \mathbf{b} b be a random unit vector of length 501. Define a sparse diagonal matrix A \mathbf{A} A with diagonal entries d k d_k d k given by
d k = 4 + k ⋅ 9996 500 , k = 0 , 1 , … , 500. \begin{align*}
d_k &= 4 + k\cdot\frac{9996}{500}, \quad k=0,1,\ldots,500.
\end{align*} d k = 4 + k ⋅ 500 9996 , k = 0 , 1 , … , 500. (a) Apply 100 iterations of MINRES to solve A x = b \mathbf{A}\mathbf{x}=\mathbf{b} Ax = b . Compute the relative norm of the answer. Plot the norm of the residual as a function of m m m .
(b) Add to your graph the line representing the upper bound (8.6.10) . This line should stay strictly on or above the convergence curve.
(c) Add a convergence curve for 100 iterations of cg.
✍ Suppose a family of SPD matrices A \mathbf{A} A is parameterized by t t t , and that the condition numbers of the matrices scale like O ( t 2 ) O(t^2) O ( t 2 ) as t → ∞ t\to\infty t → ∞ . Given that CG takes 60 iterations to reach a certain reduction in the error of a linear system when t = 200 t=200 t = 200 , estimate the number of iterations CG will need to reach the same accuracy at t = 300 t=300 t = 300 .
✍ Given real n × n n\times n n × n symmetric A \mathbf{A} A and vector b = A x \mathbf{b}=\mathbf{A}\mathbf{x} b = Ax , we can define the scalar-valued function
φ ( u ) = u T A u − 2 u T b , u ∈ R n . \varphi(\mathbf{u}) = \mathbf{u}^T \mathbf{A} \mathbf{u} - 2 \mathbf{u}^T \mathbf{b}, \qquad \mathbf{u}\in\real^n. φ ( u ) = u T Au − 2 u T b , u ∈ R n . (a) Expand and simplify the expression φ ( x + v ) − φ ( x ) \varphi(\mathbf{x}+\mathbf{v})-\varphi(\mathbf{x}) φ ( x + v ) − φ ( x ) , keeping in mind that A x = b \mathbf{A}\mathbf{x}=\mathbf{b} Ax = b .
(b) Using the result of part (a), prove that if A \mathbf{A} A is an SPD matrix, φ \varphi φ has a global minimum at x \mathbf{x} x .
(c) Show that for any vector u \mathbf{u} u , ∥ u − x ∥ A 2 − φ ( u ) \|\mathbf{u}-\mathbf{x}\|_{\mathbf{A}}^2-\varphi(\mathbf{u}) ∥ u − x ∥ A 2 − φ ( u ) is constant.
(d) Using the result of part (c), prove that CG minimizes φ ( u ) \varphi(\mathbf{u}) φ ( u ) over Krylov subspaces.
⌨ Let n = 50 n=50 n = 50 . Define the matrix A k \mathbf{A}_k A k as FNC.poisson(n) minus k 2 k^2 k 2 times the n 2 × n 2 n^2\times n^2 n 2 × n 2 identity matrix, and define vector b \mathbf{b} b as n 2 n^2 n 2 copies of -1. The linear system A k x = b \mathbf{A}_k\mathbf{x}=\mathbf{b} A k x = b arises from the Helmholtz equation for wave propagation at a single frequency k k k .
(a) Apply both MINRES and CG to the Helmholtz system for k = 1.3 k=1.3 k = 1.3 , solving to a relative residual tolerance of 10-5 . Plotting their convergence curves together.
(b) Repeat part (a) for k = 8 k=8 k = 8 .
(c) Use eigs on the matrix from part (b) to show that it is indefinite. (Hint: Use additional arguments to get the eigenvalues with smallest and largest real parts.) This helps explain why the CG convergence curve for this matrix looks rather strange.