3Linear models

IB Statistics

3.1 Linear models

Linear models can be used to explain or model the relationship between a

response (or dependent) variable, and one or more explanatory variables (or

covariates or predictors). As the name suggests, we assume the relationship is

linear.

Example. How do motor insurance claim rates (response) depend on the age

and sex of the driver, and where they live (explanatory variables)?

It is important to note that (unless otherwise specified), we do not assume

normality in our calculations here.

Suppose we have

p

covariates

x

j

, and we have

n

observations

Y

i

. We assume

n > p

, or else we can pick the parameters to fix our data exactly. Then each

observation can be written as

Y

i

= β

1

x

i1

+ ··· + β

p

x

ip

+ ε

i

. ((*))

for i = 1, ··· , n. Here

– β

1

, ··· , β

p

are unknown, fixed parameters we wish to work out (with

n > p

)

– x

i1

, ··· , x

ip

are the values of the

p

covariates for the

i

th response (which

are all known).

– ε

1

, ··· , ε

n

are independent (or possibly just uncorrelated) random variables

with mean 0 and variance σ

2

.

We think of the

β

j

x

ij

terms to be the causal effects of

x

ij

and

ε

i

to be a random

fluctuation (error term).

Then we clearly have

– E(Y

i

) = β

1

x

i1

+ ···β

p

x

ip

.

– var(Y

i

) = var(ε

i

) = σ

2

.

– Y

1

, ··· , Y

n

are independent.

Note that (

∗

) is linear in the parameters

β

1

, ··· , β

p

. Obviously the real world

can be much more complicated. But this is much easier to work with.

Example. For each of 24 males, the maximum volume of oxygen uptake in the

blood and the time taken to run 2 miles (in minutes) were measured. We want

to know how the time taken depends on oxygen uptake.

We might get the results

Oxygen 42.3 53.1 42.1 50.1 42.5 42.5 47.8 49.9

Time 918 805 892 962 968 907 770 743

Oxygen 36.2 49.7 41.5 46.2 48.2 43.2 51.8 53.3

Time 1045 810 927 813 858 860 760 747

Oxygen 53.3 47.2 56.9 47.8 48.7 53.7 60.6 56.7

Time 743 803 683 844 755 700 748 775

For each individual

i

, we let

Y

i

be the time to run 2 miles, and

x

i

be the maximum

volume of oxygen uptake,

i

= 1

, ··· ,

24. We might want to fit a straight line to

it. So a possible model is

Y

i

= a + bx

i

+ ε

i

,

where

ε

i

are independent random variables with variance

σ

2

, and

a

and

b

are

constants.

The subscripts in the equation makes it tempting to write them as matrices:

Y =

Y

1

.

.

.

Y

n

, X =

x

11

··· x

1p

.

.

.

.

.

.

.

.

.

x

n1

··· x

np

.

, β =

β

1

.

.

.

β

p

, ε =

ε

1

.

.

.

ε

n

Then the equation becomes

Y = Xβ + ε. (2)

We also have

– E(ε) = 0.

– cov(Y) = σ

2

I.

We assume throughout that

X

has full rank

p

, i.e. the columns are independent,

and that the error variance is the same for each observation. We say this is the

homoscedastic case, as opposed to heteroscedastic.

Example. Continuing our example, we have, in matrix form

Y =

Y

1

.

.

.

Y

24

, X =

1 x

1

.

.

.

.

.

.

1 x

24

, β =

a

b

, ε =

ε

1

.

.

.

ε

24

.

Then

Y = Xβ + ε.

Definition (Least squares estimator). In a linear model Y =

Xβ

+

ε

, the least

squares estimator

ˆ

β of β minimizes

S(β) = ∥Y − Xβ∥

2

= (Y − Xβ)

T

(Y − Xβ)

=

n

X

i=1

(Y

i

− x

ij

β

j

)

2

with implicit summation over j.

If we plot the points on a graph, then the least square estimators minimizes

the (square of the) vertical distance between the points and the line.

To minimize it, we want

∂S

∂β

k

β=

ˆ

β

= 0

for all k. So

−2x

ik

(Y

i

− x

ij

ˆ

β

j

) = 0

for each k (with implicit summation over i and j), i.e.

x

ik

x

ij

ˆ

β

j

= x

ik

Y

i

for all k. Putting this back in matrix form, we have

Proposition. The least squares estimator satisfies

X

T

X

ˆ

β = X

T

Y. (3)

We could also have derived this by completing the square of (Y

−Xβ

)

T

(Y

−

Xβ), but that would be more complicated.

In order to find

ˆ

β

, our life would be much easier if

X

T

X

has an inverse.

Fortunately, it always does. We assumed that X is of full rank p. Then

tX

T

Xt = (Xt)

T

(Xt) = ∥Xt∥

2

> 0

for t

= 0 in

R

p

(the last inequality is since if there were a t such that

∥X

t

∥

= 0,

then we would have produced a linear combination of the columns of

X

that

gives 0). So X

T

X is positive definite, and hence has an inverse. So

ˆ

β = (X

T

X)

−1

X

T

Y, (4)

which is linear in Y.

We have

E(

ˆ

β) = (X

T

X

−1

)X

T

E[Y] = (X

T

X)

−1

X

T

Xβ = β.

So

ˆ

β is an unbiased estimator for β. Also

cov(

ˆ

β) = (X

T

X)

−1

X

T

cov(Y)X(X

T

X)

−1

= σ

2

(X

T

X)

−1

, (5)

since cov Y = σ

2

I.