3Linear models
IB Statistics
3.3 Linear models with normal assumptions
So far, we have not assumed anything about our variables. In particular, we
have not assumed that they are normal. So what further information can we
obtain by assuming normality?
Example. Suppose we want to measure the resistivity of silicon wafers. We
have five instruments, and five wafers were measured by each instrument (so we
have 25 wafers in total). We assume that the silicon wafers are all the same, and
want to see whether the instruments consistent with each other, i.e. The results
are as follows:
Wafer
1 2 3 4 5
Instrument
1 130.5 112.4 118.9 125.7 134.0
2 130.4 138.2 116.7 132.6 104.2
3 113.0 120.5 128.9 103.4 118.1
4 128.0 117.5 114.9 114.9 98.8
5 121.2 110.5 118.5 100.5 120.9
Let
Y
i,j
be the resistivity of the
j
th wafer measured by instrument
i
, where
i, j = 1, ··· , 5. A possible model is
Y
i,j
= µ
i
+ ε
i,j
,
where
ε
ij
are independent random variables such that
E
[
ε
ij
] = 0 and
var
(
ε
ij
) =
σ
2
, and the µ
i
’s are unknown constants.
This can be written in matrix form, with
Y =
Y
1,1
.
.
.
Y
1,5
Y
2,1
.
.
.
Y
2,5
.
.
.
Y
5,1
.
.
.
Y
5,5
, X =
1 0 ··· 0
.
.
.
.
.
.
.
.
.
.
.
.
1 0 ··· 0
0 1 ··· 0
.
.
.
.
.
.
.
.
.
.
.
.
0 1 ··· 0
.
.
.
.
.
.
.
.
.
.
.
.
0 0 ··· 1
.
.
.
.
.
.
.
.
.
.
.
.
0 0 ··· 1
, β =
µ
1
µ
2
µ
3
µ
4
µ
5
, ε =
ε
1,1
.
.
.
ε
1,5
ε
2,1
.
.
.
ε
2,5
.
.
.
ε
5,1
.
.
.
ε
5,5
Then
Y = Xβ + ε.
We have
X
T
X =
5 0 ··· 0
0 5 ··· 0
.
.
.
.
.
.
.
.
.
.
.
.
0 0 ··· 5
Hence
(X
T
X)
−1
=
1
5
0 ··· 0
0
1
5
··· 0
.
.
.
.
.
.
.
.
.
.
.
.
0 0 ···
1
5
So we have
ˆ
µ = (X
T
X)
−1
X
T
Y =
¯
Y
1
.
.
.
¯
Y
5
The residual sum of squares is
RSS =
5
X
i=1
5
X
j=1
(Y
i,j
− ˆµ
i
)
2
=
5
X
i=1
5
X
j=1
(Y
i,j
−
¯
Y
i
)
2
= 2170.
This has
n − p
= 25
−
5 = 20 degrees of freedom. We will later see that
¯σ =
p
RSS/(n − p) = 10.4.
Note that we still haven’t used normality!
We now make a normal assumption:
Y = Xβ + ε, ε ∼ N
n
(0, σ
2
I), rank(X) = p < n.
This is a special case of the linear model we just had, so all previous results hold.
Since Y = N
n
(Xβ, σ
2
I), the log-likelihood is
l(β, σ
2
) = −
n
2
log 2π −
n
2
log σ
2
−
1
2σ
2
S(β),
where
S(β) = (Y −Xβ)
T
(Y − Xβ).
If we want to maximize
l
with respect to
β
, we have to maximize the only term
containing β, i.e. S(β). So
Proposition. Under normal assumptions the maximum likelihood estimator
for a linear model is
ˆ
β = (X
T
X)
−1
X
T
Y,
which is the same as the least squares estimator.
This isn’t coincidence! Historically, when Gauss devised the normal distri-
bution, he designed it so that the least squares estimator is the same as the
maximum likelihood estimator.
To obtain the MLE for σ
2
, we require
∂l
∂σ
2
ˆ
β,ˆσ
2
= 0,
i.e.
−
n
2ˆσ
2
+
S(
ˆ
β)
2ˆσ
4
= 0
So
ˆσ
2
=
1
n
S(
ˆ
β) =
1
n
(Y − X
ˆ
β)
T
(Y − X
ˆ
β) =
1
n
RSS.
Our ultimate goal now is to show that
ˆ
β
and
ˆσ
2
are independent. Then we can
apply our other standard results such as the t-distribution.
First recall that the matrix
P
=
X
(
X
T
X
)
−1
X
T
that projects Y to
ˆ
Y
is
idempotent and symmetric. We will prove the following properties of it:
Lemma.
(i)
If Z
∼ N
n
(0
, σ
2
I
) and
A
is
n × n
, symmetric, idempotent with rank
r
,
then Z
T
AZ ∼ σ
2
χ
2
r
.
(ii) For a symmetric idempotent matrix A, rank(A) = tr(A).
Proof.
(i)
Since
A
is idempotent,
A
2
=
A
by definition. So eigenvalues of
A
are either
0 or 1 (since λx = Ax = A
2
x = λ
2
x).
Since
A
is also symmetric, it is diagonalizable. So there exists an orthogonal
Q such that
Λ = Q
T
AQ = diag(λ
1
, ··· , λ
n
) = diag(1, ··· , 1, 0, ··· , 0)
with r copies of 1 and n − r copies of 0.
Let W =
Q
T
Z. So Z =
Q
W. Then W
∼ N
n
(0
, σ
2
I
), since
cov
(W) =
Q
T
σ
2
IQ = σ
2
I. Then
Z
T
AZ = W
T
Q
T
AQW = W
T
ΛW =
r
X
i=1
w
2
i
∼ χ
2
r
.
(ii)
rank(A) = rank(Λ)
= tr(Λ)
= tr(Q
T
AQ)
= tr(AQ
T
Q)
= tr A
Theorem. For the normal linear model Y ∼ N
n
(Xβ, σ
2
I),
(i)
ˆ
β ∼ N
p
(β, σ
2
(X
T
X)
−1
)
(ii) RSS ∼ σ
2
χ
2
n−p
, and so ˆσ
2
∼
σ
2
n
χ
2
n−p
.
(iii)
ˆ
β and ˆσ
2
are independent.
The proof is not particularly elegant — it is just a whole lot of linear algebra!
Proof.
–
We have
ˆ
β
= (
X
T
X
)
−1
X
T
Y. Call this
C
Y for later use. Then
ˆ
β
has a
normal distribution with mean
(X
T
X)
−1
X
T
(Xβ) = β
and covariance
(X
T
X)
−1
X
T
(σ
2
I)[(X
T
X)
−1
X
T
]
T
= σ
2
(X
T
X)
−1
.
So
ˆ
β ∼ N
p
(β, σ
2
(X
T
X)
−1
).
–
Our previous lemma says that Z
T
A
Z
∼ σ
2
χ
2
r
. So we pick our Z and
A
so
that Z
T
AZ = RSS, and r, the degrees of freedom of A, is n − p.
Let Z = Y
− Xβ
and
A
= (
I
n
− P
), where
P
=
X
(
X
T
X
)
−1
X
T
. We first
check that the conditions of the lemma hold:
Since Y ∼ N
n
(Xβ, σ
2
I), Z = Y − Xβ ∼ N
n
(0, σ
2
I).
Since P is idempotent, I
n
− P also is (check!). We also have
rank(I
n
− P ) = tr(I
n
− P ) = n − p.
Therefore the conditions of the lemma hold.
To get the final useful result, we want to show that the RSS is indeed
Z
T
A
Z. We simplify the expressions of RSS and Z
T
A
Z and show that they
are equal:
Z
T
AZ = (Y − Xβ)
T
(I
n
− P )(Y − Xβ) = Y
T
(I
n
− P )Y.
Noting the fact that (I
n
− P )X = 0.
Writing R = Y −
ˆ
Y = (I
n
− P )Y, we have
RSS = R
T
R = Y
T
(I
n
− P )Y,
using the symmetry and idempotence of I
n
− P .
Hence RSS = Z
T
AZ ∼ σ
2
χ
2
n−p
. Then
ˆσ
2
=
RSS
n
∼
σ
2
n
χ
2
n−p
.
– Let V =
ˆ
β
R
= DY, where D =
C
I
n
− P
is a (p + n) × n matrix.
Since Y is multivariate, V is multivariate with
cov(V ) = Dσ
2
ID
T
= σ
2
CC
T
C(I
n
− P )
T
(I
n
− P )C
T
(I
n
− P )(I
n
− P )
T
= σ
2
CC
T
C(I
n
− P )
(I
n
− P )C
T
(I
n
− P )
= σ
2
CC
T
0
0 I
n
− P
Using
C
(
I
n
−P
) = 0 (since (
X
T
X
)
−1
X
T
(
I
n
−P
) = 0 since (
I
n
−P
)
X
= 0
— check!).
Hence
ˆ
β
and R are independent since the off-diagonal covariant terms are 0.
So
ˆ
β
and
RSS
= R
T
R are independent. So
ˆ
β
and
ˆσ
2
are independent.
From (ii),
E
(
RSS
) =
σ
2
(
n − p
). So
˜σ
2
=
RSS
n−p
is an unbiased estimator of
σ
2
.
˜σ is often known as the residual standard error on n − p degrees of freedom.