9Linear least squares

IB Numerical Analysis

9 Linear least squares

Finally, we consider a slightly different kind of problem, whose solution involves

a different kind of factorization.

The question we are interested in is how we can “solve”

Ax = b,

where A ∈ R

m×n

, with m > n, and b ∈ R

m

and x ∈ R

n

.

Of course, this is an over-determined system. In general, this has no solution.

So we want to find the “best” solution to this system of equations. More precisely,

we want to find an

x

∗

∈ R

n

that minimizes

kAx − bk

2

. Note that we use the

Euclidean norm. This is the least squares problem.

Why would one be interested in this? Often in, say, statistics, we have some

linear model that says the outcome

B

is related to the input variables

A

1

, ··· , A

n

by

B = A

1

X

1

+ A

2

X

2

+ ···+ A

n

X

n

+ ε,

where the

X

i

are some unknown parameters, and

ε

is some random fluctuation.

The idea is to do lots of experiments, say

m

of them, collect the data in

A ∈ R

m×n

and

b ∈ R

m

, and then find the

X

i

’s that predict these results the most accurately,

i.e. find the x such that Ax is as close to b as possible.

So how can we find this

x

∗

? First of all, we find a necessary and sufficient

condition for x

∗

to be a solution to this.

Theorem. A vector x

∗

∈ R

n

minimizes kAx − bk

2

if and only if

A

T

(Ax

∗

− b) = 0.

Proof. A solution, by definition, minimizes

f(x) = hAx − b, Ax − bi

= x

T

AAx − 2x

T

A

T

b + b

T

b.

Then as a function of x, the partial derivatives of this must vanish. We have

∇f(x) = 2A

T

(Ax − b).

So a necessary condition is

A

T

(Ax − b).

Now suppose our

x

∗

satisfies

A

T

(

Ax

∗

− b

) = 0. Then for all

x ∈ R

n

, we write

x = x

∗

+ y, and then we have

kAx − bk

2

= kA(x

∗

+ y) −bk

= kAx

∗

− bk + 2y

T

A

T

(Ax − b) + kAyk

2

= kAx

∗

− bk + kAyk

2

≥ kAx

∗

− bk.

So x

∗

must minimize the Euclidean norm.

Corollary.

If

A ∈ R

m×n

is a full-rank matrix, then there is a unique solution

to the least squares problem.

Proof. We know all minimizers are solutions to

(A

T

A)x = A

T

b.

The matrix

A

being full rank means

y 6

=

0 ∈ R

n

implies

Ay 6

=

0 ∈ R

m

. Hence

A

T

A ∈ R

n×n

is positive definite (and in particular non-singular), since

x

T

A

T

Ax = (Ax)

T

(Ax) = kAxk

2

> 0

for x 6= 0. So we can invert A

T

A and find a unique solution x.

Now to find the minimizing x

∗

, we need to solve the normal equations

A

T

Ax = A

T

b.

If

A

has full rank, then

A

T

A

is non-singular, and there is a unique solution. If

not, then the general theory of linear equations tells us there are either infinitely

many solution or no solutions. But for this particular form of equations, it turns

out we can show this is always consistent. So there will always be a solution.

However, solving these equations is not the recommended algorithm for finding

x

∗

for accuracy reasons. Instead the solution of these is based on orthogonal

matrices.

Definition

(Orthogonal matrix)

.

A square matrix

Q ∈ R

n×n

is orthogonal if

Q

T

Q = I.

A key property of the orthogonal matrices is that we have

kQxk = kxk

for all

x ∈ R

n

. This is helpful since what we are trying to do in the least-squares

problem involves minimizing the Euclidean norm of something.

The idea is that for any orthogonal matrix, minimizing

kAx −bk

is the same

as minimizing

kQAx − Qbk

. So

Q

should we pick? Recall that a simple kind of

matrices is triangular matrices. So we might want to get something, say, upper

triangular.

Definition

(QR factorization)

.

A QR factorization of an

m × n

matrix

A

is a

factorization of the form

A = QR,

where

Q ∈ R

m×m

is an orthogonal matrix, and

R ∈ R

m×n

is “upper triangular”

matrix, where “upper triangular” means

R =

R

11

R

12

··· R

1m

0 R

22

··· R

2m

.

.

.

.

.

.

.

.

.

.

.

.

0 0 ··· R

mm

0 0 ··· 0

.

.

.

.

.

.

.

.

.

.

.

.

0 0 ··· 0

Note that every

A ∈ R

m×n

has a QR factorization, as we will soon show,

but this is not unique (e.g. we can multiply both Q and R by −1).

Once we have the QR factorization, we can multiply

kAx −bk

by

Q

T

=

Q

−1

,

and get an equivalent problem of minimizing

kRx −Q

T

bk

. We will not go into

details, but it should be clear that this is not too hard to solve.

There is an alternative “skinny” QR factorization, with

A

=

˜

Q

˜

R

, where we

remove redundant rows and columns such that

˜

Q ∈ R

m×n

and

˜

R ∈ R

n×n

. This

is good enough to solve the least squares problem.

We will see that if

A

has full rank, then the skinny QR is unique up to a

sign, i.e. if we require

˜

R

kk

> 0 for k = 1, ··· , n.

We shall demonstrate three standard algorithms for QR factorization:

(i) Gram-Schmidt factorization

(ii) Givens rotations

(iii) Householder reflections

The latter two are based on some geometric ideas, while the first is based on the

familiar Gram-Schmidt process.

Gram-Schmidt factorization

This targets the skinny version. So we stop writing the tildes above, and just

write Q ∈ R

m×n

and R ∈ R

n×n

.

As we all know, the Gram-Schmidt process orthogonalizes vectors. What we

are going to orthogonalize are the columns of A. We write

A =

a

1

··· a

n

, Q =

q

1

··· q

n

.

By definition of the QR factorization, we need

a

j

=

j

X

i=1

R

ij

q

i

.

This is just done in the usual Gram-Schmidt way.

(i) To construct column 1, if a

1

6= 0, then we set

q

1

=

a

1

ka

1

k

, R

11

= ka

1

k.

Note that the only non-unique possibility here is the sign — we can let

R

11

=

−ka

1

k

and

q

1

=

−

a

1

ka

1

k

instead. But if we require

R

11

>

0, then

this is fixed.

In the degenerate case,

a

1

=

0

, then we can just set

R

11

= 0, and the pick

any q

1

∈ R

n

with kq

1

k = 1.

(ii) For columns 1 < k ≤ n, for i = 1, ··· , k − 1, we set

R

ik

= hq

i

, a

k

i,

and set

d

k

= a

k

−

k−1

X

i=1

q

i

hq

i

, a

k

i.

If d

k

6= 0, then we set

q

k

=

d

k

kd

k

k

,

and

R

kk

= kd

k

k.

In the case where

d

k

=

0

, we again set

R

kk

= 0, and pick

q

k

to be anything

orthonormal to q

1

, ··· , q

k−1

.

In practice, a slightly different algorithm (modified Gram-Schmidt process)

is used, which is (much) superior with inexact arithmetic. The modified Gram-

Schmidt process is in fact the same algorithm, but performed in a different order

in order to minimize errors.

However, this is often not an ideal algorithm for large matrices, since there

are many divisions and normalizations involved in computing the

q

i

, and the

accumulation of errors will cause the resulting matrix Q to lose orthogonality.

Givens rotations

This works with the full QR factorization.

Recall that in R

2

, a clockwise rotation of θ ∈ [−π, π] is performed by

cos θ sin θ

−sin θ cos θ

α

β

=

α cos θ + β sin θ

−α sin θ + β cos θ

.

By choosing θ such that

cos θ =

α

p

α

2

+ β

2

, sin θ =

β

p

α

2

+ β

2

,

this then becomes

p

α

2

+ β

2

0

.

Of course, by choosing a slightly different

θ

, we can make the result zero in the

first component and

p

α

2

+ β

2

in the second.

Definition

(Givens rotation)

.

In

R

m

, where

m >

2, we define the Givens

rotation on 3 parameters 1 ≤ p < q ≤ m, θ ∈ [−π, π] by

Ω

[p,q]

θ

=

1

.

.

.

1

cos θ sin θ

1

.

.

.

1

−sin θ cos θ

1

.

.

.

1

∈ R

m×m

,

where the sin and cos appear at the p, qth rows and columns.

Note that for

y ∈ R

m

, the effect of Ω

[p,q]

θ

only alters the

p

and

q

components.

In general, for B ∈ R

m×n

, then Ω

[p,q]

θ

B only alters the p and q rows of B.

Moreover, just like the

R

2

case, given a particular

z ∈ R

m

, we can choose

θ

such that the qth component (Ω

[p,q]

θ

z)

q

= 0.

Hence,

A ∈ R

m×n

can be transformed into an “upper triangular” form by

applying

s

=

mn −

n(n+1)

2

Givens rotations, since we need to introduce

s

many

zeros. Then

Q

s

···Q

1

A = R.

How exactly do we do this? Instead of writing down a general algorithm, we

illustrate it with a matrix A ∈ R

4×3

. The resultant R would look like

R =

R

11

R

12

R

13

0 R

22

R

23

0 0 R

33

0 0 0

.

We will apply the Givens rotations in the following order:

Ω

[3,4]

θ

6

Ω

[3,4]

θ

5

Ω

[2,3]

θ

4

Ω

[1,4]

θ

3

Ω

[1,3]

θ

2

Ω

[1,2]

θ

1

A = R.

The matrix A transforms as follows:

× × ×

× × ×

× × ×

× × ×

Ω

[1,2]

θ

1

× × ×

0

× ×

× × ×

× × ×

Ω

[1,3]

θ

2

× × ×

0

× ×

0

× ×

× × ×

Ω

[1,3]

θ

3

× × ×

0

× ×

0

× ×

0

× ×

Ω

[2,3]

θ

4

× × ×

0

× ×

0 0

×

0

× ×

Ω

[3,4]

θ

5

× × ×

0

× ×

0 0

×

0 0

×

Ω

[3,4]

θ

6

× × ×

0

× ×

0 0

×

0 0 0

Note that when applying, say, Ω

[2,3]

θ

4

, the zeros of the first column get preserved,

since Ω

[2,3]

θ

4

only mixes together things in row 2 and 3, both of which are zero in

the first column. So we are safe.

Note that this gives us something of the form

Q

s

···Q

1

A = R.

We can obtain a QR factorization by inverting to get

A = Q

T

1

···Q

T

s

R.

However, we don’t really need to do this if we just want to do use this to solve

the least squares problem, since to do so, we need to multiply by

Q

T

, not

Q

,

which is exactly Q

s

···Q

1

.

Householder reflections

Definition

(Householder reflection)

.

For

u 6

=

0 ∈ R

m

, we define the House-

holder reflection by

H

u

= I − 2

uu

T

u

T

u

∈ R

m×m

.

Note that this is symmetric, and we can see

H

2

u

=

I

. So this is indeed

orthogonal.

To show this is a reflection, suppose we resolve

x

as the perpendicular and

parallel parts as x = αu + w ∈ R

m

, where

α =

u

T

x

u

T

u

, u

T

w = 0.

Then we have

H

u

x = −αu + w.

So this is a reflection in the m −1-dimensional hyperplane u

T

y = 0.

What is the cost of computing H

u

z? This is evaluated as

H

u

z = z − 2

u

T

z

u

T

u

u.

This only requires O(m) operations, which is nice.

Proposition.

A matrix

A ∈ R

m×n

can be transformed into upper-triangular

form by applying n Householder reflections, namely

H

n

···H

1

A = R,

where each

H

n

introduces zero into column

k

and leaves the other zeroes alone.

This is much less multiplications needed than the number needed for Givens

rotations, in general.

To show this, we first prove some lemmas.

Lemma.

Let

a, b ∈ R

m

, with

a 6

=

b

, but

kak

=

kbk

. Then if we pick

u

=

a −b

,

then

H

u

a = b.

This is obvious if we draw some pictures in low dimensions.

Proof. We just do it:

H

u

a = a −

2(kak − a

T

b)

kak

2

− 2a

T

b + kbk

2

(a − b) = a − (a − b) = b,

where we used the fact that kak = kbk.

We will keep applying the lemma. Since we want to get an upper triangular

matrix, it makes sense to pick

b

to have many zeroes, and the best option would

be to pick b to be a unit vector.

So we begin our algorithm: we want to clear the first column of

A

. We let

a

be the first column of

A

, and assume

a ∈ R

m

is not already in the correct form,

i.e. a is not a multiple of e

1

. Then we define

u = a ∓ kake

1

,

where either choice of the sign is pure-mathematically valid. However, we will

later see that there is one choice that is better when we have to deal with inexact

arithmetic. Then we end up with

H

1

a = H

u

a = ±kake

1

.

Now we have

H

1

A =

× × ×

0

× ×

0

× ×

0

× ×

.

To do the next step, we need to be careful, since we don’t want to destroy the

previously created zeroes.

Lemma. If the first k − 1 components of u are zero, then

(i) For every x ∈ R

m

, H

u

x does not alter the first k −1 components of x.

(ii) If the last (m − k + 1) components of y ∈ R

m

are zero, then H

u

y = y.

These are all obvious from definition. All these say is that reflections

don’t affect components perpendicular to

u

, and in particular fixes all vectors

perpendicular to u.

Lemma. Let a, b ∈ R

m

, with

a

k

.

.

.

a

m

6=

b

k

.

.

.

b

m

,

but

m

X

j=k

a

2

j

=

m

X

j=k

b

2

j

.

Suppose we pick

u = (0, 0, ··· , 0, a

k

− b

k

, ··· , a

m

− b

m

)

T

.

Then we have

H

u

a = (a

1

, ··· , a

k−1

b

k

, ··· , b

m

).

This is a generalization of what we’ve had before for

k

= 1. Again, the proof

is just straightforward verification.

Now we can mess with the second column of

H

1

A

. We let

a

be the second

column of

H

1

A

, and assume

a

3

, ··· , a

m

are not all zero, i.e. (0

, a

2

, ··· , a

m

)

T

is

not a multiple of e

2

. We choose

u = (0, a

2

∓ γ, a

3

, ··· , a

m

)

T

,

where

γ =

v

u

u

t

m

X

j=2

a

j

.

Then

H

u

a = (a

1

, ±γ, 0, ···).

Also, by the previous lemma, this does not affect anything in the first column

and the first row. Now we have

H

2

H

1

A =

× × ×

0

×

×

0 0

×

0 0

×

Suppose we have reached

H

k−1

···H

1

A

, where the first

k −

1 rows are of the

correct form, i.e.

H

k−1

···H

1

A =

0

k − 1

We consider the

k

th column of

H

k−1

···H

1

A

, say

a

. We assume

(0, ··· , 0, a

k

, ··· , a

m

)

T

is not a multiple of e

k

. Choosing

u = γ(0, ··· , 0, a

k

∓ γ, a

k+1

, ··· , a

m

)

t

, γ =

v

u

u

t

m

X

j=k

a

2

j

,

we find that

H

u

a = (a

1

, ··· , a

k−1

, ±γ, 0, ··· , 0)

T

.

Now we have

H

k

···H

1

A =

0

k

,

and H

k

did not alter the first k −1 rows and columns of H

k−1

···H

1

A.

There is one thing we have to decide on — which sign to pick. As mentioned,

these do not matter in pure mathematics, but with inexact arithmetic, we

should pick the sign in

a

k

∓ γ

such that

a

k

∓ γ

has maximum magnitude, i.e.

a

k

∓ γ

=

a

k

+

sgn

(

a

k

)

γ

. It takes some analysis to justify why this is the right

choice, but it is not too surprising that there is some choice that is better.

So how do Householder and Givens compare? We notice that the Givens

method generates zeros at one entry at a time, while Householder does it column

by column. So in general, the Householder method is superior.

However, for certain matrices with special structures, we might need the

extra delicacy of introducing a zero one at a time. For example, if

A

is a band

matrix, then it might be beneficial to just remove the few non-zero entries one

at a time.