3Linear models

IB Statistics

3.3 Linear models with normal assumptions

So far, we have not assumed anything about our variables. In particular, we

have not assumed that they are normal. So what further information can we

obtain by assuming normality?

Example. Suppose we want to measure the resistivity of silicon wafers. We

have five instruments, and five wafers were measured by each instrument (so we

have 25 wafers in total). We assume that the silicon wafers are all the same, and

want to see whether the instruments consistent with each other, i.e. The results

are as follows:

Wafer

1 2 3 4 5

Instrument

1 130.5 112.4 118.9 125.7 134.0

2 130.4 138.2 116.7 132.6 104.2

3 113.0 120.5 128.9 103.4 118.1

4 128.0 117.5 114.9 114.9 98.8

5 121.2 110.5 118.5 100.5 120.9

Let

Y

i,j

be the resistivity of the

j

th wafer measured by instrument

i

, where

i, j = 1, ··· , 5. A possible model is

Y

i,j

= µ

i

+ ε

i,j

,

where

ε

ij

are independent random variables such that

E

[

ε

ij

] = 0 and

var

(

ε

ij

) =

σ

2

, and the µ

i

’s are unknown constants.

This can be written in matrix form, with

Y =

Y

1,1

.

.

.

Y

1,5

Y

2,1

.

.

.

Y

2,5

.

.

.

Y

5,1

.

.

.

Y

5,5

, X =

1 0 ··· 0

.

.

.

.

.

.

.

.

.

.

.

.

1 0 ··· 0

0 1 ··· 0

.

.

.

.

.

.

.

.

.

.

.

.

0 1 ··· 0

.

.

.

.

.

.

.

.

.

.

.

.

0 0 ··· 1

.

.

.

.

.

.

.

.

.

.

.

.

0 0 ··· 1

, β =

µ

1

µ

2

µ

3

µ

4

µ

5

, ε =

ε

1,1

.

.

.

ε

1,5

ε

2,1

.

.

.

ε

2,5

.

.

.

ε

5,1

.

.

.

ε

5,5

Then

Y = Xβ + ε.

We have

X

T

X =

5 0 ··· 0

0 5 ··· 0

.

.

.

.

.

.

.

.

.

.

.

.

0 0 ··· 5

Hence

(X

T

X)

−1

=

1

5

0 ··· 0

0

1

5

··· 0

.

.

.

.

.

.

.

.

.

.

.

.

0 0 ···

1

5

So we have

ˆ

µ = (X

T

X)

−1

X

T

Y =

¯

Y

1

.

.

.

¯

Y

5

The residual sum of squares is

RSS =

5

X

i=1

5

X

j=1

(Y

i,j

− ˆµ

i

)

2

=

5

X

i=1

5

X

j=1

(Y

i,j

−

¯

Y

i

)

2

= 2170.

This has

n − p

= 25

−

5 = 20 degrees of freedom. We will later see that

¯σ =

p

RSS/(n − p) = 10.4.

Note that we still haven’t used normality!

We now make a normal assumption:

Y = Xβ + ε, ε ∼ N

n

(0, σ

2

I), rank(X) = p < n.

This is a special case of the linear model we just had, so all previous results hold.

Since Y = N

n

(Xβ, σ

2

I), the log-likelihood is

l(β, σ

2

) = −

n

2

log 2π −

n

2

log σ

2

−

1

2σ

2

S(β),

where

S(β) = (Y −Xβ)

T

(Y − Xβ).

If we want to maximize

l

with respect to

β

, we have to maximize the only term

containing β, i.e. S(β). So

Proposition. Under normal assumptions the maximum likelihood estimator

for a linear model is

ˆ

β = (X

T

X)

−1

X

T

Y,

which is the same as the least squares estimator.

This isn’t coincidence! Historically, when Gauss devised the normal distri-

bution, he designed it so that the least squares estimator is the same as the

maximum likelihood estimator.

To obtain the MLE for σ

2

, we require

∂l

∂σ

2

ˆ

β,ˆσ

2

= 0,

i.e.

−

n

2ˆσ

2

+

S(

ˆ

β)

2ˆσ

4

= 0

So

ˆσ

2

=

1

n

S(

ˆ

β) =

1

n

(Y − X

ˆ

β)

T

(Y − X

ˆ

β) =

1

n

RSS.

Our ultimate goal now is to show that

ˆ

β

and

ˆσ

2

are independent. Then we can

apply our other standard results such as the t-distribution.

First recall that the matrix

P

=

X

(

X

T

X

)

−1

X

T

that projects Y to

ˆ

Y

is

idempotent and symmetric. We will prove the following properties of it:

Lemma.

(i)

If Z

∼ N

n

(0

, σ

2

I

) and

A

is

n × n

, symmetric, idempotent with rank

r

,

then Z

T

AZ ∼ σ

2

χ

2

r

.

(ii) For a symmetric idempotent matrix A, rank(A) = tr(A).

Proof.

(i)

Since

A

is idempotent,

A

2

=

A

by definition. So eigenvalues of

A

are either

0 or 1 (since λx = Ax = A

2

x = λ

2

x).

Since

A

is also symmetric, it is diagonalizable. So there exists an orthogonal

Q such that

Λ = Q

T

AQ = diag(λ

1

, ··· , λ

n

) = diag(1, ··· , 1, 0, ··· , 0)

with r copies of 1 and n − r copies of 0.

Let W =

Q

T

Z. So Z =

Q

W. Then W

∼ N

n

(0

, σ

2

I

), since

cov

(W) =

Q

T

σ

2

IQ = σ

2

I. Then

Z

T

AZ = W

T

Q

T

AQW = W

T

ΛW =

r

X

i=1

w

2

i

∼ χ

2

r

.

(ii)

rank(A) = rank(Λ)

= tr(Λ)

= tr(Q

T

AQ)

= tr(AQ

T

Q)

= tr A

Theorem. For the normal linear model Y ∼ N

n

(Xβ, σ

2

I),

(i)

ˆ

β ∼ N

p

(β, σ

2

(X

T

X)

−1

)

(ii) RSS ∼ σ

2

χ

2

n−p

, and so ˆσ

2

∼

σ

2

n

χ

2

n−p

.

(iii)

ˆ

β and ˆσ

2

are independent.

The proof is not particularly elegant — it is just a whole lot of linear algebra!

Proof.

–

We have

ˆ

β

= (

X

T

X

)

−1

X

T

Y. Call this

C

Y for later use. Then

ˆ

β

has a

normal distribution with mean

(X

T

X)

−1

X

T

(Xβ) = β

and covariance

(X

T

X)

−1

X

T

(σ

2

I)[(X

T

X)

−1

X

T

]

T

= σ

2

(X

T

X)

−1

.

So

ˆ

β ∼ N

p

(β, σ

2

(X

T

X)

−1

).

–

Our previous lemma says that Z

T

A

Z

∼ σ

2

χ

2

r

. So we pick our Z and

A

so

that Z

T

AZ = RSS, and r, the degrees of freedom of A, is n − p.

Let Z = Y

− Xβ

and

A

= (

I

n

− P

), where

P

=

X

(

X

T

X

)

−1

X

T

. We first

check that the conditions of the lemma hold:

Since Y ∼ N

n

(Xβ, σ

2

I), Z = Y − Xβ ∼ N

n

(0, σ

2

I).

Since P is idempotent, I

n

− P also is (check!). We also have

rank(I

n

− P ) = tr(I

n

− P ) = n − p.

Therefore the conditions of the lemma hold.

To get the final useful result, we want to show that the RSS is indeed

Z

T

A

Z. We simplify the expressions of RSS and Z

T

A

Z and show that they

are equal:

Z

T

AZ = (Y − Xβ)

T

(I

n

− P )(Y − Xβ) = Y

T

(I

n

− P )Y.

Noting the fact that (I

n

− P )X = 0.

Writing R = Y −

ˆ

Y = (I

n

− P )Y, we have

RSS = R

T

R = Y

T

(I

n

− P )Y,

using the symmetry and idempotence of I

n

− P .

Hence RSS = Z

T

AZ ∼ σ

2

χ

2

n−p

. Then

ˆσ

2

=

RSS

n

∼

σ

2

n

χ

2

n−p

.

– Let V =

ˆ

β

R

= DY, where D =

C

I

n

− P

is a (p + n) × n matrix.

Since Y is multivariate, V is multivariate with

cov(V ) = Dσ

2

ID

T

= σ

2

CC

T

C(I

n

− P )

T

(I

n

− P )C

T

(I

n

− P )(I

n

− P )

T

= σ

2

CC

T

C(I

n

− P )

(I

n

− P )C

T

(I

n

− P )

= σ

2

CC

T

0

0 I

n

− P

Using

C

(

I

n

−P

) = 0 (since (

X

T

X

)

−1

X

T

(

I

n

−P

) = 0 since (

I

n

−P

)

X

= 0

— check!).

Hence

ˆ

β

and R are independent since the off-diagonal covariant terms are 0.

So

ˆ

β

and

RSS

= R

T

R are independent. So

ˆ

β

and

ˆσ

2

are independent.

From (ii),

E

(

RSS

) =

σ

2

(

n − p

). So

˜σ

2

=

RSS

n−p

is an unbiased estimator of

σ

2

.

˜σ is often known as the residual standard error on n − p degrees of freedom.