3Linear models

IB Statistics



3.1 Linear models
Linear models can be used to explain or model the relationship between a
response (or dependent) variable, and one or more explanatory variables (or
covariates or predictors). As the name suggests, we assume the relationship is
linear.
Example. How do motor insurance claim rates (response) depend on the age
and sex of the driver, and where they live (explanatory variables)?
It is important to note that (unless otherwise specified), we do not assume
normality in our calculations here.
Suppose we have
p
covariates
x
j
, and we have
n
observations
Y
i
. We assume
n > p
, or else we can pick the parameters to fix our data exactly. Then each
observation can be written as
Y
i
= β
1
x
i1
+ ··· + β
p
x
ip
+ ε
i
. ((*))
for i = 1, ··· , n. Here
β
1
, ··· , β
p
are unknown, fixed parameters we wish to work out (with
n > p
)
x
i1
, ··· , x
ip
are the values of the
p
covariates for the
i
th response (which
are all known).
ε
1
, ··· , ε
n
are independent (or possibly just uncorrelated) random variables
with mean 0 and variance σ
2
.
We think of the
β
j
x
ij
terms to be the causal effects of
x
ij
and
ε
i
to be a random
fluctuation (error term).
Then we clearly have
E(Y
i
) = β
1
x
i1
+ ···β
p
x
ip
.
var(Y
i
) = var(ε
i
) = σ
2
.
Y
1
, ··· , Y
n
are independent.
Note that (
) is linear in the parameters
β
1
, ··· , β
p
. Obviously the real world
can be much more complicated. But this is much easier to work with.
Example. For each of 24 males, the maximum volume of oxygen uptake in the
blood and the time taken to run 2 miles (in minutes) were measured. We want
to know how the time taken depends on oxygen uptake.
We might get the results
Oxygen 42.3 53.1 42.1 50.1 42.5 42.5 47.8 49.9
Time 918 805 892 962 968 907 770 743
Oxygen 36.2 49.7 41.5 46.2 48.2 43.2 51.8 53.3
Time 1045 810 927 813 858 860 760 747
Oxygen 53.3 47.2 56.9 47.8 48.7 53.7 60.6 56.7
Time 743 803 683 844 755 700 748 775
For each individual
i
, we let
Y
i
be the time to run 2 miles, and
x
i
be the maximum
volume of oxygen uptake,
i
= 1
, ··· ,
24. We might want to fit a straight line to
it. So a possible model is
Y
i
= a + bx
i
+ ε
i
,
where
ε
i
are independent random variables with variance
σ
2
, and
a
and
b
are
constants.
The subscripts in the equation makes it tempting to write them as matrices:
Y =
Y
1
.
.
.
Y
n
, X =
x
11
··· x
1p
.
.
.
.
.
.
.
.
.
x
n1
··· x
np
.
, β =
β
1
.
.
.
β
p
, ε =
ε
1
.
.
.
ε
n
Then the equation becomes
Y = Xβ + ε. (2)
We also have
E(ε) = 0.
cov(Y) = σ
2
I.
We assume throughout that
X
has full rank
p
, i.e. the columns are independent,
and that the error variance is the same for each observation. We say this is the
homoscedastic case, as opposed to heteroscedastic.
Example. Continuing our example, we have, in matrix form
Y =
Y
1
.
.
.
Y
24
, X =
1 x
1
.
.
.
.
.
.
1 x
24
, β =
a
b
, ε =
ε
1
.
.
.
ε
24
.
Then
Y = Xβ + ε.
Definition (Least squares estimator). In a linear model Y =
Xβ
+
ε
, the least
squares estimator
ˆ
β of β minimizes
S(β) = Y Xβ
2
= (Y Xβ)
T
(Y Xβ)
=
n
X
i=1
(Y
i
x
ij
β
j
)
2
with implicit summation over j.
If we plot the points on a graph, then the least square estimators minimizes
the (square of the) vertical distance between the points and the line.
To minimize it, we want
S
β
k
β=
ˆ
β
= 0
for all k. So
2x
ik
(Y
i
x
ij
ˆ
β
j
) = 0
for each k (with implicit summation over i and j), i.e.
x
ik
x
ij
ˆ
β
j
= x
ik
Y
i
for all k. Putting this back in matrix form, we have
Proposition. The least squares estimator satisfies
X
T
X
ˆ
β = X
T
Y. (3)
We could also have derived this by completing the square of (Y
Xβ
)
T
(Y
Xβ), but that would be more complicated.
In order to find
ˆ
β
, our life would be much easier if
X
T
X
has an inverse.
Fortunately, it always does. We assumed that X is of full rank p. Then
tX
T
Xt = (Xt)
T
(Xt) = Xt
2
> 0
for t
= 0 in
R
p
(the last inequality is since if there were a t such that
X
t
= 0,
then we would have produced a linear combination of the columns of
X
that
gives 0). So X
T
X is positive definite, and hence has an inverse. So
ˆ
β = (X
T
X)
1
X
T
Y, (4)
which is linear in Y.
We have
E(
ˆ
β) = (X
T
X
1
)X
T
E[Y] = (X
T
X)
1
X
T
Xβ = β.
So
ˆ
β is an unbiased estimator for β. Also
cov(
ˆ
β) = (X
T
X)
1
X
T
cov(Y)X(X
T
X)
1
= σ
2
(X
T
X)
1
, (5)
since cov Y = σ
2
I.