The normal equations Tobin A. Driscoll Richard J. Braun 
We now solve the general linear least-squares problem in Definition 3.1.1 A ∈ R m × n \mathbf{A}\in\mathbb{R}^{m \times n} A ∈ R m × n b ∈ R m \mathbf{b}\in\mathbb{R}^m b ∈ R m m > n m>n m > n x ∈ R n \mathbf{x}\in\mathbb{R}^n x ∈ R n ∥ b − A x ∥ 2 \| \mathbf{b} - \mathbf{A}\mathbf{x} \|_2 ∥ b − Ax ∥ 2  
There is a concise explicit solution. In the following proof we make use of the elementary algebraic fact that for two vectors u \mathbf{u} u v \mathbf{v} v 
( u + v ) T ( u + v ) = u T u + u T v + v T u + v T v = u T u + 2 v T u + v T v .   (\mathbf{u}+\mathbf{v})^T(\mathbf{u}+\mathbf{v}) = \mathbf{u}^T\mathbf{u} + \mathbf{u}^T\mathbf{v} + \mathbf{v}^T\mathbf{u}
  + \mathbf{v}^T\mathbf{v} = \mathbf{u}^T\mathbf{u} + 2\mathbf{v}^T\mathbf{u} + \mathbf{v}^T\mathbf{v}. ( u + v ) T ( u + v ) = u T u + u T v + v T u + v T v = u T u + 2 v T u + v T v . If x \mathbf{x} x A T ( A x − b ) = 0 \mathbf{A}^T(\mathbf{A}\mathbf{x}-\mathbf{b})=\boldsymbol{0} A T ( Ax − b ) = 0 x \mathbf{x} x x \mathbf{x} x ∥ b − A x ∥ 2 \| \mathbf{b}-\mathbf{A}\mathbf{x} \|_2 ∥ b − Ax ∥ 2  
Definition 3.2.1  (Normal equations)
Given A ∈ R m × n \mathbf{A}\in \real^{m\times n} A ∈ R m × n b ∈ R m \mathbf{b}\in \real^{m} b ∈ R m normal equations argmin  ∥ b − A x ∥ \operatorname{argmin}\| \mathbf{b}- \mathbf{A} \mathbf{x}\| argmin ∥ b − Ax ∥ A T ( A x − b ) = 0 \mathbf{A}^T(\mathbf{A}\mathbf{x}-\mathbf{b})=\boldsymbol{0} A T ( Ax − b ) = 0 
A T A x = A T b . \mathbf{A}^T\mathbf{A}\mathbf{x}=\mathbf{A}^T\mathbf{b}. A T Ax = A T b . The normal equations have a geometric interpretation, as shown in Figure 3.2.1 A \mathbf{A} A b \mathbf{b} b A x − b \mathbf{A}\mathbf{x}-\mathbf{b} Ax − b z \mathbf{z} z ( A z ) T ( A x − b ) = 0 (\mathbf{A} \mathbf{z})^T(\mathbf{A}\mathbf{x}-\mathbf{b})=0 ( Az ) T ( Ax − b ) = 0 A T ( A x − b ) = 0 \mathbf{A}^T(\mathbf{A}\mathbf{x}-\mathbf{b})=\boldsymbol{0} A T ( Ax − b ) = 0 
Figure 3.2.1: Geometry of the normal equations. The smallest residual is orthogonal to the range of the matrix A \mathbf{A} A 
3.2.1 Pseudoinverse and definiteness ¶ If we associate the left-hand side of the normal equations as ( A T A )   x (\mathbf{A}^T\mathbf{A})\,\mathbf{x} ( A T A ) x (3.2.3) square  n × n n\times n n × n x \mathbf{x} x 
Mathematically, the overdetermined least-squares problem A x ≈ b \mathbf{A}\mathbf{x}\approx \mathbf{b} Ax ≈ b x = A + b \mathbf{x}=\mathbf{A}^+\mathbf{b} x = A + b 
Computationally we can generalize the observation about Julia from Chapter 2: backslash is equivalent mathematically to left-multiplication by the inverse (in the square case) or pseudoinverse (in the rectangular case) of a matrix. One can also compute the pseudoinverse directly using pinv, but as with matrix inverses, this is rarely necessary in practice.
The matrix A T A \mathbf{A}^T\mathbf{A} A T A 
The first part is left as Exercise 3.2.3 A T A z = 0 \mathbf{A}^T\mathbf{A}\mathbf{z}=\boldsymbol{0} A T Az = 0 A T A \mathbf{A}^T\mathbf{A} A T A z \mathbf{z} z z T \mathbf{z}^T z T 
0 = z T A T A z = ( A z ) T ( A z ) = ∥ A z ∥ 2 2 , 0 = \mathbf{z}^T\mathbf{A}^T\mathbf{A}\mathbf{z}=(\mathbf{A}\mathbf{z})^T(\mathbf{A}\mathbf{z}) = \| \mathbf{A}\mathbf{z} \|_2^2, 0 = z T A T Az = ( Az ) T ( Az ) = ∥ Az ∥ 2 2  , which is equivalent to A z = 0 \mathbf{A}\mathbf{z}=\boldsymbol{0} Az = 0 z \mathbf{z} z A \mathbf{A} A 
Finally, we can repeat the manipulations above to show that for any nonzero n n n v \mathbf{v} v v T ( A T A ) v = ∥ A v ∥ 2 2 ≥ 0 \mathbf{v}^T(\mathbf{A}^T\mathbf{A})\mathbf{v}=\| \mathbf{A}\mathbf{v} \|_2^2\ge 0 v T ( A T A ) v = ∥ Av ∥ 2 2  ≥ 0 
3.2.2 Implementation ¶ The definition of the pseudoinverse involves taking the inverse of a matrix, so it is not advisable to use the pseudoinverse computationally. Instead, we use the definition of the normal equations to set up a linear system, which we already know how to solve. In summary, the steps for solving the linear least squares problem A x ≈ b \mathbf{A}\mathbf{x}\approx\mathbf{b} Ax ≈ b 
Algorithm 3.2.1  (Solution of linear least squares by the normal equations)
Compute N = A T A \mathbf{N}=\mathbf{A}^T\mathbf{A} N = A T A 
Compute z = A T b \mathbf{z} = \mathbf{A}^T\mathbf{b} z = A T b 
Solve the n × n n\times n n × n N x = z \mathbf{N}\mathbf{x} = \mathbf{z} Nx = z x \mathbf{x} x 
Steps 1 and 3 of Algorithm 3.2.1 
In the last step we can exploit the fact, proved in Theorem 3.2.2 N \mathbf{N} N Exploiting matrix structure . This detail is included in Function 3.2.2 
1
 2
 3
 4
 5
 6
 7
 8
 9
 10
 11
 12
 13
 14
 """
    lsnormal(A, b)
Solve a linear least-squares problem by the normal equations.
Returns the minimizer of ||b-Ax||.
"""
function lsnormal(A, b)
    N = A' * A
    z = A' * b
    R = cholesky(N).U
    w = forwardsub(R', z)                   # solve R'z=c
    x = backsub(R, w)                       # solve Rx=z
    return x
endAbout the code
The syntax on line 9 is a field reference  to extract the matrix we want from the structure returned by cholesky.
1
 2
 3
 4
 5
 6
 7
 8
 9
 10
 11
 12
 13
 14
 function x = lsnormal(A, b)
% LSNORMAL   Solve linear least squares by normal equations.
% Input: 
%   A     coefficient matrix (m by n, m > n)
%   b     right-hand side (m by 1)
% Output:
%   x     minimizer of || b - Ax ||
    N = A' * A;
    z = A' * b;
    R = chol(N);
    w = forwardsub(R', z);    % solve R'z=c
    x = backsub(R, w);        % solve Rx=z
end
1
 2
 3
 4
 5
 6
 7
 8
 9
 10
 11
 12
 13
 def lsnormal(A, b):
    """
    lsnormal(A, b)
    
    Solve a linear least squares problem by the normal equations. Returns the
    minimizer of ||b-Ax||.
    """
    N = A.T @ A
    z = A.T @ b
    R = scipy.linalg.cholesky(N)
    w = forwardsub(R.T, z)                   # solve R'z=c
    x = backsub(R, w)                        # solve Rx=z
    return xAbout the code
cholesky is imported from scipy.linalg.
3.2.3 Conditioning and stability ¶ We have already used A\b as the native way to solve the linear least-squares problem A x ≈ b \mathbf{A}\mathbf{x}\approx\mathbf{b} Ax ≈ b not  proceed through the normal equations, because of instability.
The conditioning of the linear least-squares problem relates changes in the solution x \mathbf{x} x A \mathbf{A} A b \mathbf{b} b 
Definition 3.2.3  (Matrix condition number (rectangular case))
If A \mathbf{A} A m × n m\times n m × n m > n m > n m > n 
κ ( A ) = ∥ A ∥ 2 ⋅ ∥ A + ∥ 2 . \kappa(\mathbf{A}) = \|\mathbf{A}\|_2 \cdot \|\mathbf{A}^{+}\|_2. κ ( A ) = ∥ A ∥ 2  ⋅ ∥ A + ∥ 2  . If the rank of A  \mathbf{A} A n n n κ ( A ) = ∞ \kappa(\mathbf{A})=\infty κ ( A ) = ∞ 
Provided that the minimum residual norm ∥ b − A x ∥ \|\mathbf{b}-\mathbf{A}\mathbf{x}\| ∥ b − Ax ∥ κ ( A ) \kappa(\mathbf{A}) κ ( A ) 
As an algorithm, the normal equations begin by computing the data for the n × n n\times n n × n ( A T A ) x = A T b (\mathbf{A}^T\mathbf{A})\mathbf{x} = \mathbf{A}^T \mathbf{b} ( A T A ) x = A T b κ ( A T A ) \kappa(\mathbf{A}^T\mathbf{A}) κ ( A T A ) 
The following can be proved using results in Chapter 7.
Theorem 3.2.4  (Condition number in the normal equations)
If A \mathbf{A} A m × n m\times n m × n m > n m > n m > n 
κ ( A T A ) = κ ( A ) 2 . \kappa(\mathbf{A}^T\mathbf{A}) = \kappa(\mathbf{A})^2. κ ( A T A ) = κ ( A ) 2 . This squaring of the condition number in the normal equations is the cause of instability. If κ ( A ) \kappa(\mathbf{A}) κ ( A ) 
Example 3.2.1  (Instability in the normal equations)
Example 3.2.1 Because the functions sin  2 ( t ) \sin^2(t) sin 2 ( t ) cos  2 ( t ) \cos^2(t) cos 2 ( t ) 
Tip
The local variable scoping rule for loops applies to comprehensions as well.
t = range(0, 3, 400)
f = [ x -> sin(x)^2, x -> cos((1 + 1e-7) * x)^2, x -> 1. ]
A = [ f(t) for t in t, f in f ]
@show κ = cond(A);κ = cond(A) = 1.8253225426741675e7 Now we set up an artificial linear least-squares problem with a known exact solution that actually makes the residual zero.
x = [1., 2, 1]
b = A * x;
Using backslash to find the least-squares solution, we get a relative error that is well below κ \kappa κ 
x_BS = A \ b
@show observed_error = norm(x_BS - x) / norm(x);
@show error_bound = κ * eps();observed_error = norm(x_BS - x) / norm(x) = 1.0163949045357309e-10 
error_bound = κ * eps() = 4.053030228488391e-9
 If we formulate and solve via the normal equations, we get a much larger relative error. With κ 2 ≈ 1 0 14 \kappa^2\approx 10^{14} κ 2 ≈ 1 0 14 
N = A' * A
x_NE = N \ (A'*b)
@show observed_err = norm(x_NE - x) / norm(x);
@show digits = -log10(observed_err);observed_err = norm(x_NE - x) / norm(x) = 0.021745909192780664
digits = -(log10(observed_err)) = 1.6626224298403076
 Example 3.2.1 Because the functions sin  2 ( t ) \sin^2(t) sin 2 ( t ) cos  2 ( t ) \cos^2(t) cos 2 ( t ) 
Tip
The local variable scoping rule for loops applies to comprehensions as well.
t = linspace(0, 3, 400)';
A = [ sin(t).^2, cos((1+1e-7)*t).^2, t.^0 ];
kappa = cond(A)Now we set up an artificial linear least-squares problem with a known exact solution that actually makes the residual zero.
x = [1; 2; 1];
b = A * x;
Using backslash to find the least-squares solution, we get a relative error that is well below κ \kappa κ 
x_BS = A \ b;
observed_err = norm(x_BS - x) / norm(x)
max_err = kappa * epsIf we formulate and solve via the normal equations, we get a much larger relative error. With κ 2 ≈ 1 0 14 \kappa^2\approx 10^{14} κ 2 ≈ 1 0 14 
N = A'*A;
x_NE = N\(A'*b);
observed_err = norm(x_NE - x) / norm(x)
digits = -log10(observed_err)Example 3.2.1 Because the functions sin  2 ( t ) \sin^2(t) sin 2 ( t ) cos  2 ( t ) \cos^2(t) cos 2 ( t ) 
from numpy.linalg import cond
t = linspace(0, 3, 400)
A = array([ [sin(t)**2, cos((1+1e-7)*t)**2, 1] for t in t ])
kappa = cond(A)
print(f"cond(A) is {kappa:.3e}")Now we set up an artificial linear least-squares problem with a known exact solution that actually makes the residual zero.
x = array([1, 2, 1])
b = A @ x
Using backslash to find the least-squares solution, we get a relative error that is well below κ \kappa κ 
from scipy.linalg import lstsq
x_BS = lstsq(A, b)[0]
print(f"observed error: {norm(x_BS - x) / norm(x):.3e}")
print(f"conditioning bound: {kappa * finfo(float).eps:.3e}")observed error: 2.019e-10
conditioning bound: 4.053e-09
 If we formulate and solve via the normal equations, we get a much larger relative error. With κ 2 ≈ 1 0 14 \kappa^2\approx 10^{14} κ 2 ≈ 1 0 14 
from scipy import linalg
N = A.T @ A
x_NE = linalg.solve(N, A.T @ b)
relative_err = norm(x_NE - x) / norm(x)
print(f"observed error: {relative_err:.3e}")
print(f"accurate digits: {-log10(relative_err):.2f}")observed error: 5.991e-02
accurate digits: 1.22
 3.2.4 Exercises ¶ (a)  ✍ Show that for any m × n m\times n m × n A \mathbf{A} A m > n m>n m > n A T A \mathbf{A}^T\mathbf{A} A T A A + A \mathbf{A}^+\mathbf{A} A + A n × n n\times n n × n 
(b)  ⌨ Show using a computer example that A A + \mathbf{A}\mathbf{A}^+ A A + n n n m × m m\times m m × m 
⌨ Let t 1 , … , t m t_1,\ldots,t_m t 1  , … , t m  m m m [ 0 , 2 π ] [0,2\pi] [ 0 , 2 π ] m = 500 m=500 m = 500 
(a)  Let A ϵ \mathbf{A}_\epsilon A ϵ  (3.1.2) c 1 + c 2 sin  ( t ) + c 3 cos  ( ( 1 + ϵ ) t ) c_1 + c_2 \sin(t) + c_3 \cos\bigl((1+\epsilon)t\bigr) c 1  + c 2  sin ( t ) + c 3  cos ( ( 1 + ϵ ) t ) (3.2.7) A ϵ \mathbf{A}_\epsilon A ϵ  ϵ = 1 0 0 , 1 0 − 1 , … , 1 0 − 8 \epsilon = 10^0, 10^{-1}, \ldots, 10^{-8} ϵ = 1 0 0 , 1 0 − 1 , … , 1 0 − 8 
(b)  Repeat part (a) using the fitting function c 1 + c 2 sin  2 ( t ) + c 3 cos  2 ( ( 1 + ϵ ) t ) . c_1 + c_2 \sin^2(t) + c_3 \cos^2\bigl( (1+\epsilon) t\bigr). c 1  + c 2  sin 2 ( t ) + c 3  cos 2 ( ( 1 + ϵ ) t ) . 
(c)  Why does it make sense that κ ( A ϵ ) → ∞ \kappa\bigl(\mathbf{A}_\epsilon\bigr)\to \infty κ ( A ϵ  ) → ∞ ϵ → 0 \epsilon\to 0 ϵ → 0 
✍ ⌨  When A \mathbf{A} A m × n m\times n m × n n n n pinv (which is in LinearAlgebra in Julia and numpy.linalg for Python). However, the behavior in this case is not always intuitive. Let
A s = [ 1 1 0 0 0 s ] . \mathbf{A}_s =
\begin{bmatrix}
1 & 1 \\ 0 & 0 \\ 0 & s
\end{bmatrix}. A s  = ⎣ ⎡  1 0 0  1 0 s  ⎦ ⎤  . Then A 0 \mathbf{A}_0 A 0  A 0 + ≠ lim  s → 0 A s + \mathbf{A}_0^+\neq \lim_{s\to 0} \mathbf{A}_s^+ A 0 +   = lim s → 0  A s +