9Linear least squares
IB Numerical Analysis
9 Linear least squares
Finally, we consider a slightly different kind of problem, whose solution involves
a different kind of factorization.
The question we are interested in is how we can “solve”
Ax = b,
where A ∈ R
m×n
, with m > n, and b ∈ R
m
and x ∈ R
n
.
Of course, this is an over-determined system. In general, this has no solution.
So we want to find the “best” solution to this system of equations. More precisely,
we want to find an
x
∗
∈ R
n
that minimizes
kAx − bk
2
. Note that we use the
Euclidean norm. This is the least squares problem.
Why would one be interested in this? Often in, say, statistics, we have some
linear model that says the outcome
B
is related to the input variables
A
1
, ··· , A
n
by
B = A
1
X
1
+ A
2
X
2
+ ···+ A
n
X
n
+ ε,
where the
X
i
are some unknown parameters, and
ε
is some random fluctuation.
The idea is to do lots of experiments, say
m
of them, collect the data in
A ∈ R
m×n
and
b ∈ R
m
, and then find the
X
i
’s that predict these results the most accurately,
i.e. find the x such that Ax is as close to b as possible.
So how can we find this
x
∗
? First of all, we find a necessary and sufficient
condition for x
∗
to be a solution to this.
Theorem. A vector x
∗
∈ R
n
minimizes kAx − bk
2
if and only if
A
T
(Ax
∗
− b) = 0.
Proof. A solution, by definition, minimizes
f(x) = hAx − b, Ax − bi
= x
T
AAx − 2x
T
A
T
b + b
T
b.
Then as a function of x, the partial derivatives of this must vanish. We have
∇f(x) = 2A
T
(Ax − b).
So a necessary condition is
A
T
(Ax − b).
Now suppose our
x
∗
satisfies
A
T
(
Ax
∗
− b
) = 0. Then for all
x ∈ R
n
, we write
x = x
∗
+ y, and then we have
kAx − bk
2
= kA(x
∗
+ y) −bk
= kAx
∗
− bk + 2y
T
A
T
(Ax − b) + kAyk
2
= kAx
∗
− bk + kAyk
2
≥ kAx
∗
− bk.
So x
∗
must minimize the Euclidean norm.
Corollary.
If
A ∈ R
m×n
is a full-rank matrix, then there is a unique solution
to the least squares problem.
Proof. We know all minimizers are solutions to
(A
T
A)x = A
T
b.
The matrix
A
being full rank means
y 6
=
0 ∈ R
n
implies
Ay 6
=
0 ∈ R
m
. Hence
A
T
A ∈ R
n×n
is positive definite (and in particular non-singular), since
x
T
A
T
Ax = (Ax)
T
(Ax) = kAxk
2
> 0
for x 6= 0. So we can invert A
T
A and find a unique solution x.
Now to find the minimizing x
∗
, we need to solve the normal equations
A
T
Ax = A
T
b.
If
A
has full rank, then
A
T
A
is non-singular, and there is a unique solution. If
not, then the general theory of linear equations tells us there are either infinitely
many solution or no solutions. But for this particular form of equations, it turns
out we can show this is always consistent. So there will always be a solution.
However, solving these equations is not the recommended algorithm for finding
x
∗
for accuracy reasons. Instead the solution of these is based on orthogonal
matrices.
Definition
(Orthogonal matrix)
.
A square matrix
Q ∈ R
n×n
is orthogonal if
Q
T
Q = I.
A key property of the orthogonal matrices is that we have
kQxk = kxk
for all
x ∈ R
n
. This is helpful since what we are trying to do in the least-squares
problem involves minimizing the Euclidean norm of something.
The idea is that for any orthogonal matrix, minimizing
kAx −bk
is the same
as minimizing
kQAx − Qbk
. So
Q
should we pick? Recall that a simple kind of
matrices is triangular matrices. So we might want to get something, say, upper
triangular.
Definition
(QR factorization)
.
A QR factorization of an
m × n
matrix
A
is a
factorization of the form
A = QR,
where
Q ∈ R
m×m
is an orthogonal matrix, and
R ∈ R
m×n
is “upper triangular”
matrix, where “upper triangular” means
R =
R
11
R
12
··· R
1m
0 R
22
··· R
2m
.
.
.
.
.
.
.
.
.
.
.
.
0 0 ··· R
mm
0 0 ··· 0
.
.
.
.
.
.
.
.
.
.
.
.
0 0 ··· 0
Note that every
A ∈ R
m×n
has a QR factorization, as we will soon show,
but this is not unique (e.g. we can multiply both Q and R by −1).
Once we have the QR factorization, we can multiply
kAx −bk
by
Q
T
=
Q
−1
,
and get an equivalent problem of minimizing
kRx −Q
T
bk
. We will not go into
details, but it should be clear that this is not too hard to solve.
There is an alternative “skinny” QR factorization, with
A
=
˜
Q
˜
R
, where we
remove redundant rows and columns such that
˜
Q ∈ R
m×n
and
˜
R ∈ R
n×n
. This
is good enough to solve the least squares problem.
We will see that if
A
has full rank, then the skinny QR is unique up to a
sign, i.e. if we require
˜
R
kk
> 0 for k = 1, ··· , n.
We shall demonstrate three standard algorithms for QR factorization:
(i) Gram-Schmidt factorization
(ii) Givens rotations
(iii) Householder reflections
The latter two are based on some geometric ideas, while the first is based on the
familiar Gram-Schmidt process.
Gram-Schmidt factorization
This targets the skinny version. So we stop writing the tildes above, and just
write Q ∈ R
m×n
and R ∈ R
n×n
.
As we all know, the Gram-Schmidt process orthogonalizes vectors. What we
are going to orthogonalize are the columns of A. We write
A =
a
1
··· a
n
, Q =
q
1
··· q
n
.
By definition of the QR factorization, we need
a
j
=
j
X
i=1
R
ij
q
i
.
This is just done in the usual Gram-Schmidt way.
(i) To construct column 1, if a
1
6= 0, then we set
q
1
=
a
1
ka
1
k
, R
11
= ka
1
k.
Note that the only non-unique possibility here is the sign — we can let
R
11
=
−ka
1
k
and
q
1
=
−
a
1
ka
1
k
instead. But if we require
R
11
>
0, then
this is fixed.
In the degenerate case,
a
1
=
0
, then we can just set
R
11
= 0, and the pick
any q
1
∈ R
n
with kq
1
k = 1.
(ii) For columns 1 < k ≤ n, for i = 1, ··· , k − 1, we set
R
ik
= hq
i
, a
k
i,
and set
d
k
= a
k
−
k−1
X
i=1
q
i
hq
i
, a
k
i.
If d
k
6= 0, then we set
q
k
=
d
k
kd
k
k
,
and
R
kk
= kd
k
k.
In the case where
d
k
=
0
, we again set
R
kk
= 0, and pick
q
k
to be anything
orthonormal to q
1
, ··· , q
k−1
.
In practice, a slightly different algorithm (modified Gram-Schmidt process)
is used, which is (much) superior with inexact arithmetic. The modified Gram-
Schmidt process is in fact the same algorithm, but performed in a different order
in order to minimize errors.
However, this is often not an ideal algorithm for large matrices, since there
are many divisions and normalizations involved in computing the
q
i
, and the
accumulation of errors will cause the resulting matrix Q to lose orthogonality.
Givens rotations
This works with the full QR factorization.
Recall that in R
2
, a clockwise rotation of θ ∈ [−π, π] is performed by
cos θ sin θ
−sin θ cos θ
α
β
=
α cos θ + β sin θ
−α sin θ + β cos θ
.
By choosing θ such that
cos θ =
α
p
α
2
+ β
2
, sin θ =
β
p
α
2
+ β
2
,
this then becomes
p
α
2
+ β
2
0
.
Of course, by choosing a slightly different
θ
, we can make the result zero in the
first component and
p
α
2
+ β
2
in the second.
Definition
(Givens rotation)
.
In
R
m
, where
m >
2, we define the Givens
rotation on 3 parameters 1 ≤ p < q ≤ m, θ ∈ [−π, π] by
Ω
[p,q]
θ
=
1
.
.
.
1
cos θ sin θ
1
.
.
.
1
−sin θ cos θ
1
.
.
.
1
∈ R
m×m
,
where the sin and cos appear at the p, qth rows and columns.
Note that for
y ∈ R
m
, the effect of Ω
[p,q]
θ
only alters the
p
and
q
components.
In general, for B ∈ R
m×n
, then Ω
[p,q]
θ
B only alters the p and q rows of B.
Moreover, just like the
R
2
case, given a particular
z ∈ R
m
, we can choose
θ
such that the qth component (Ω
[p,q]
θ
z)
q
= 0.
Hence,
A ∈ R
m×n
can be transformed into an “upper triangular” form by
applying
s
=
mn −
n(n+1)
2
Givens rotations, since we need to introduce
s
many
zeros. Then
Q
s
···Q
1
A = R.
How exactly do we do this? Instead of writing down a general algorithm, we
illustrate it with a matrix A ∈ R
4×3
. The resultant R would look like
R =
R
11
R
12
R
13
0 R
22
R
23
0 0 R
33
0 0 0
.
We will apply the Givens rotations in the following order:
Ω
[3,4]
θ
6
Ω
[3,4]
θ
5
Ω
[2,3]
θ
4
Ω
[1,4]
θ
3
Ω
[1,3]
θ
2
Ω
[1,2]
θ
1
A = R.
The matrix A transforms as follows:
× × ×
× × ×
× × ×
× × ×
Ω
[1,2]
θ
1
× × ×
0
× ×
× × ×
× × ×
Ω
[1,3]
θ
2
× × ×
0
× ×
0
× ×
× × ×
Ω
[1,3]
θ
3
× × ×
0
× ×
0
× ×
0
× ×
Ω
[2,3]
θ
4
× × ×
0
× ×
0 0
×
0
× ×
Ω
[3,4]
θ
5
× × ×
0
× ×
0 0
×
0 0
×
Ω
[3,4]
θ
6
× × ×
0
× ×
0 0
×
0 0 0
Note that when applying, say, Ω
[2,3]
θ
4
, the zeros of the first column get preserved,
since Ω
[2,3]
θ
4
only mixes together things in row 2 and 3, both of which are zero in
the first column. So we are safe.
Note that this gives us something of the form
Q
s
···Q
1
A = R.
We can obtain a QR factorization by inverting to get
A = Q
T
1
···Q
T
s
R.
However, we don’t really need to do this if we just want to do use this to solve
the least squares problem, since to do so, we need to multiply by
Q
T
, not
Q
,
which is exactly Q
s
···Q
1
.
Householder reflections
Definition
(Householder reflection)
.
For
u 6
=
0 ∈ R
m
, we define the House-
holder reflection by
H
u
= I − 2
uu
T
u
T
u
∈ R
m×m
.
Note that this is symmetric, and we can see
H
2
u
=
I
. So this is indeed
orthogonal.
To show this is a reflection, suppose we resolve
x
as the perpendicular and
parallel parts as x = αu + w ∈ R
m
, where
α =
u
T
x
u
T
u
, u
T
w = 0.
Then we have
H
u
x = −αu + w.
So this is a reflection in the m −1-dimensional hyperplane u
T
y = 0.
What is the cost of computing H
u
z? This is evaluated as
H
u
z = z − 2
u
T
z
u
T
u
u.
This only requires O(m) operations, which is nice.
Proposition.
A matrix
A ∈ R
m×n
can be transformed into upper-triangular
form by applying n Householder reflections, namely
H
n
···H
1
A = R,
where each
H
n
introduces zero into column
k
and leaves the other zeroes alone.
This is much less multiplications needed than the number needed for Givens
rotations, in general.
To show this, we first prove some lemmas.
Lemma.
Let
a, b ∈ R
m
, with
a 6
=
b
, but
kak
=
kbk
. Then if we pick
u
=
a −b
,
then
H
u
a = b.
This is obvious if we draw some pictures in low dimensions.
Proof. We just do it:
H
u
a = a −
2(kak − a
T
b)
kak
2
− 2a
T
b + kbk
2
(a − b) = a − (a − b) = b,
where we used the fact that kak = kbk.
We will keep applying the lemma. Since we want to get an upper triangular
matrix, it makes sense to pick
b
to have many zeroes, and the best option would
be to pick b to be a unit vector.
So we begin our algorithm: we want to clear the first column of
A
. We let
a
be the first column of
A
, and assume
a ∈ R
m
is not already in the correct form,
i.e. a is not a multiple of e
1
. Then we define
u = a ∓ kake
1
,
where either choice of the sign is pure-mathematically valid. However, we will
later see that there is one choice that is better when we have to deal with inexact
arithmetic. Then we end up with
H
1
a = H
u
a = ±kake
1
.
Now we have
H
1
A =
× × ×
0
× ×
0
× ×
0
× ×
.
To do the next step, we need to be careful, since we don’t want to destroy the
previously created zeroes.
Lemma. If the first k − 1 components of u are zero, then
(i) For every x ∈ R
m
, H
u
x does not alter the first k −1 components of x.
(ii) If the last (m − k + 1) components of y ∈ R
m
are zero, then H
u
y = y.
These are all obvious from definition. All these say is that reflections
don’t affect components perpendicular to
u
, and in particular fixes all vectors
perpendicular to u.
Lemma. Let a, b ∈ R
m
, with
a
k
.
.
.
a
m
6=
b
k
.
.
.
b
m
,
but
m
X
j=k
a
2
j
=
m
X
j=k
b
2
j
.
Suppose we pick
u = (0, 0, ··· , 0, a
k
− b
k
, ··· , a
m
− b
m
)
T
.
Then we have
H
u
a = (a
1
, ··· , a
k−1
b
k
, ··· , b
m
).
This is a generalization of what we’ve had before for
k
= 1. Again, the proof
is just straightforward verification.
Now we can mess with the second column of
H
1
A
. We let
a
be the second
column of
H
1
A
, and assume
a
3
, ··· , a
m
are not all zero, i.e. (0
, a
2
, ··· , a
m
)
T
is
not a multiple of e
2
. We choose
u = (0, a
2
∓ γ, a
3
, ··· , a
m
)
T
,
where
γ =
v
u
u
t
m
X
j=2
a
j
.
Then
H
u
a = (a
1
, ±γ, 0, ···).
Also, by the previous lemma, this does not affect anything in the first column
and the first row. Now we have
H
2
H
1
A =
× × ×
0
×
×
0 0
×
0 0
×
Suppose we have reached
H
k−1
···H
1
A
, where the first
k −
1 rows are of the
correct form, i.e.
H
k−1
···H
1
A =
0
k − 1
We consider the
k
th column of
H
k−1
···H
1
A
, say
a
. We assume
(0, ··· , 0, a
k
, ··· , a
m
)
T
is not a multiple of e
k
. Choosing
u = γ(0, ··· , 0, a
k
∓ γ, a
k+1
, ··· , a
m
)
t
, γ =
v
u
u
t
m
X
j=k
a
2
j
,
we find that
H
u
a = (a
1
, ··· , a
k−1
, ±γ, 0, ··· , 0)
T
.
Now we have
H
k
···H
1
A =
0
k
,
and H
k
did not alter the first k −1 rows and columns of H
k−1
···H
1
A.
There is one thing we have to decide on — which sign to pick. As mentioned,
these do not matter in pure mathematics, but with inexact arithmetic, we
should pick the sign in
a
k
∓ γ
such that
a
k
∓ γ
has maximum magnitude, i.e.
a
k
∓ γ
=
a
k
+
sgn
(
a
k
)
γ
. It takes some analysis to justify why this is the right
choice, but it is not too surprising that there is some choice that is better.
So how do Householder and Givens compare? We notice that the Givens
method generates zeros at one entry at a time, while Householder does it column
by column. So in general, the Householder method is superior.
However, for certain matrices with special structures, we might need the
extra delicacy of introducing a zero one at a time. For example, if
A
is a band
matrix, then it might be beneficial to just remove the few non-zero entries one
at a time.