3Linear models

IB Statistics

3.2 Simple linear regression
What we did above was so complicated. If we have a simple linear regression
model
Y
i
= a + bx
i
+ ε
i
.
then we can reparameterise it to
Y
i
= a
+ b(x
i
¯x) + ε
i
, (6)
where
¯x
=
P
x
i
/n
and
a
=
a
+
b¯x
. Since
P
(
x
i
¯x
) = 0, this leads to simplified
calculations.
In matrix form,
X =
1 (x
1
¯x)
.
.
.
.
.
.
1 (x
24
¯x)
.
Since
P
(x
i
¯x) = 0, in X
T
X, the off-diagonals are all 0, and we have
X
T
X =
n 0
0 S
xx
,
where S
xx
=
P
(x
i
¯x)
2
.
Hence
(X
T
X)
1
=
1
n
0
0
1
S
xx
So
ˆ
β = (X
T
X
1
)X
T
Y =
¯
Y
S
xY
S
xx
,
where S
xY
=
P
Y
i
(x
i
¯x).
Hence the estimated intercept is ˆa
= ¯y, and the estimated gradient is
ˆ
b =
S
xy
S
xx
=
P
i
y
i
(x
i
¯x)
P
i
(x
i
¯x)
2
=
P
i
(y
i
¯y)(x
i
¯x)
p
P
i
(x
i
¯x)
2
P
i
(y
i
¯y)
2
×
r
S
yy
S
xx
()
= r ×
r
S
yy
S
xx
.
We have (
) since
P
¯y
(
x
i
¯x
) = 0, so we can add it to the numerator. Then the
other square root things are just multiplying and dividing by the same things.
So the gradient is the Pearson product-moment correlation coefficient
r
times
the ratio of the empirical standard deviations of the
y
’s and
x
’s (note that the
gradient is the same whether the x’s are standardised to have mean 0 or not).
Hence we get
cov
(
ˆ
β
) = (
X
T
X
)
1
σ
2
, and so from our expression of (
X
T
X
)
1
,
var(ˆa
) = var(
¯
Y ) =
σ
2
n
, var(
ˆ
b) =
σ
2
S
xx
.
Note that these estimators are uncorrelated.
Note also that these are obtained without any explicit distributional assump-
tions.
Example. Continuing our previous oxygen/time example, we have
¯y
= 826
.
5,
S
xx
= 783.5 = 28.0
2
, S
xy
= 10077, S
yy
= 444
2
, r = 0.81,
ˆ
b = 12.9.
Theorem (Gauss Markov theorem). In a full rank linear model, let
ˆ
β
be the
least squares estimator of
β
and let
β
be any other unbiased estimator for
β
which is linear in the Y
i
’s. Then
var(t
T
ˆ
β) var(t
T
β
).
for all t
R
p
. We say that
ˆ
β
is the best linear unbiased estimator of
β
(BLUE).
Proof. Since β
is linear in the Y
i
’s, β
= AY for some p ×n matrix A.
Since
β
is an unbiased estimator, we must have
E
[
β
] =
β
. However, since
β
=
A
Y,
E
[
β
] =
AE
[Y] =
AXβ
. So we must have
β
=
AXβ
. Since this
holds for any β, we must have AX = I
p
. Now
cov(β
) = E[(β
β)(β
β)
T
]
= E[(AY β)(AY β)
T
]
= E[(AXβ + Aε β)(AXβ + Aε β)
T
]
Since AXβ = β, this is equal to
= E[Aε(Aε)
T
]
= A(σ
2
I)A
T
= σ
2
AA
T
.
Now let β
ˆ
β = (A (X
T
X)
1
X
T
)Y = BY, for some B. Then
BX = AX (X
T
X)
1
X
T
X = I
p
I
p
= 0.
By definition, we have
A
Y =
B
Y + (
X
T
X
)
1
X
T
Y, and this is true for all Y.
So A = B + (X
T
X)
1
X
T
. Hence
cov(β
) = σ
2
AA
T
= σ
2
(B + (X
T
X)
1
X
T
)(B + (X
T
X)
1
X
T
)
T
= σ
2
(BB
T
+ (X
T
X)
1
)
= σ
2
BB
T
+ cov(
ˆ
β).
Note that in the second line, the cross-terms disappear since BX = 0.
So for any t R
p
, we have
var(t
T
β
) = t
T
cov(β
)t
= t
T
cov(
ˆ
β)t + t
T
BB
T
tσ
2
= var(t
T
ˆ
β) + σ
2
B
T
t
2
var(t
T
ˆ
β).
Taking t = (0, ··· , 1, 0, ··· , 0)
T
with a 1 in the ith position, we have
var(
ˆ
β
i
) var(β
i
).
Definition (Fitted values and residuals).
ˆ
Y
=
X
ˆ
β
is the vector of fitted values.
These are what our model says Y should be.
R = Y
ˆ
Y
is the vector of residuals. These are the deviations of our model
from reality.
The residual sum of squares is
2
= R
T
R = (Y X
ˆ
β)
T
(Y X
ˆ
β).
We can give these a geometric interpretation. Note that
X
T
R =
X
T
(Y
ˆ
Y
) =
X
T
Y
X
T
X
ˆ
β
= 0 by our formula for
β
. So R is orthogonal to the
column space of X.
Write
ˆ
Y
=
X
ˆ
β
=
X
(
X
T
X
)
1
X
T
Y =
P
Y, where
P
=
X
(
X
T
X
)
1
X
T
.
Then
P
represents an orthogonal projection of
R
n
onto the space spanned by
the columns of
X
, i.e. it projects the actual data Y to the fitted values
ˆ
Y
. Then
R is the part of Y orthogonal to the column space of X.
The projection matrix
P
is idempotent and symmetric, i.e.
P
2
=
P
and
P
T
= P .