3Linear models

IB Statistics

3.2 Simple linear regression

What we did above was so complicated. If we have a simple linear regression

model

Y

i

= a + bx

i

+ ε

i

.

then we can reparameterise it to

Y

i

= a

′

+ b(x

i

− ¯x) + ε

i

, (6)

where

¯x

=

P

x

i

/n

and

a

′

=

a

+

b¯x

. Since

P

(

x

i

− ¯x

) = 0, this leads to simplified

calculations.

In matrix form,

X =

1 (x

1

− ¯x)

.

.

.

.

.

.

1 (x

24

− ¯x)

.

Since

P

(x

i

− ¯x) = 0, in X

T

X, the off-diagonals are all 0, and we have

X

T

X =

n 0

0 S

xx

,

where S

xx

=

P

(x

i

− ¯x)

2

.

Hence

(X

T

X)

−1

=

1

n

0

0

1

S

xx

So

ˆ

β = (X

T

X

−1

)X

T

Y =

¯

Y

S

xY

S

xx

,

where S

xY

=

P

Y

i

(x

i

− ¯x).

Hence the estimated intercept is ˆa

′

= ¯y, and the estimated gradient is

ˆ

b =

S

xy

S

xx

=

P

i

y

i

(x

i

− ¯x)

P

i

(x

i

− ¯x)

2

=

P

i

(y

i

− ¯y)(x

i

− ¯x)

p

P

i

(x

i

− ¯x)

2

P

i

(y

i

− ¯y)

2

×

r

S

yy

S

xx

(∗)

= r ×

r

S

yy

S

xx

.

We have (

∗

) since

P

¯y

(

x

i

− ¯x

) = 0, so we can add it to the numerator. Then the

other square root things are just multiplying and dividing by the same things.

So the gradient is the Pearson product-moment correlation coefficient

r

times

the ratio of the empirical standard deviations of the

y

’s and

x

’s (note that the

gradient is the same whether the x’s are standardised to have mean 0 or not).

Hence we get

cov

(

ˆ

β

) = (

X

T

X

)

−1

σ

2

, and so from our expression of (

X

T

X

)

−1

,

var(ˆa

′

) = var(

¯

Y ) =

σ

2

n

, var(

ˆ

b) =

σ

2

S

xx

.

Note that these estimators are uncorrelated.

Note also that these are obtained without any explicit distributional assump-

tions.

Example. Continuing our previous oxygen/time example, we have

¯y

= 826

.

5,

S

xx

= 783.5 = 28.0

2

, S

xy

= −10077, S

yy

= 444

2

, r = −0.81,

ˆ

b = −12.9.

Theorem (Gauss Markov theorem). In a full rank linear model, let

ˆ

β

be the

least squares estimator of

β

and let

β

∗

be any other unbiased estimator for

β

which is linear in the Y

i

’s. Then

var(t

T

ˆ

β) ≤ var(t

T

β

∗

).

for all t

∈ R

p

. We say that

ˆ

β

is the best linear unbiased estimator of

β

(BLUE).

Proof. Since β

∗

is linear in the Y

i

’s, β

∗

= AY for some p ×n matrix A.

Since

β

∗

is an unbiased estimator, we must have

E

[

β

∗

] =

β

. However, since

β

∗

=

A

Y,

E

[

β

∗

] =

AE

[Y] =

AXβ

. So we must have

β

=

AXβ

. Since this

holds for any β, we must have AX = I

p

. Now

cov(β

∗

) = E[(β

∗

− β)(β

∗

− β)

T

]

= E[(AY − β)(AY − β)

T

]

= E[(AXβ + Aε − β)(AXβ + Aε − β)

T

]

Since AXβ = β, this is equal to

= E[Aε(Aε)

T

]

= A(σ

2

I)A

T

= σ

2

AA

T

.

Now let β

∗

−

ˆ

β = (A −(X

T

X)

−1

X

T

)Y = BY, for some B. Then

BX = AX − (X

T

X)

−1

X

T

X = I

p

− I

p

= 0.

By definition, we have

A

Y =

B

Y + (

X

T

X

)

−1

X

T

Y, and this is true for all Y.

So A = B + (X

T

X)

−1

X

T

. Hence

cov(β

∗

) = σ

2

AA

T

= σ

2

(B + (X

T

X)

−1

X

T

)(B + (X

T

X)

−1

X

T

)

T

= σ

2

(BB

T

+ (X

T

X)

−1

)

= σ

2

BB

T

+ cov(

ˆ

β).

Note that in the second line, the cross-terms disappear since BX = 0.

So for any t ∈ R

p

, we have

var(t

T

β

∗

) = t

T

cov(β

∗

)t

= t

T

cov(

ˆ

β)t + t

T

BB

T

tσ

2

= var(t

T

ˆ

β) + σ

2

∥B

T

t∥

2

≥ var(t

T

ˆ

β).

Taking t = (0, ··· , 1, 0, ··· , 0)

T

with a 1 in the ith position, we have

var(

ˆ

β

i

) ≤ var(β

∗

i

).

Definition (Fitted values and residuals).

ˆ

Y

=

X

ˆ

β

is the vector of fitted values.

These are what our model says Y should be.

R = Y

−

ˆ

Y

is the vector of residuals. These are the deviations of our model

from reality.

The residual sum of squares is

RSS = ∥R∥

2

= R

T

R = (Y − X

ˆ

β)

T

(Y − X

ˆ

β).

We can give these a geometric interpretation. Note that

X

T

R =

X

T

(Y

−

ˆ

Y

) =

X

T

Y

− X

T

X

ˆ

β

= 0 by our formula for

β

. So R is orthogonal to the

column space of X.

Write

ˆ

Y

=

X

ˆ

β

=

X

(

X

T

X

)

−1

X

T

Y =

P

Y, where

P

=

X

(

X

T

X

)

−1

X

T

.

Then

P

represents an orthogonal projection of

R

n

onto the space spanned by

the columns of

X

, i.e. it projects the actual data Y to the fitted values

ˆ

Y

. Then

R is the part of Y orthogonal to the column space of X.

The projection matrix

P

is idempotent and symmetric, i.e.

P

2

=

P

and

P

T

= P .