IB Statistics (Full)

Part IB — Statistics

Based on lectures by D. Spiegelhalter

Notes taken by Dexter Chua

Lent 2015

These notes are not endorsed by the lecturers, and I have modified them (often

significantly) after lectures. They are nowhere near accurate representations of what

was actually lectured, and in particular, all errors are almost surely mine.

Estimation

Review of distribution and density functions, parametric families. Examples: bino-

mial, Poisson, gamma. Sufficiency, minimal sufficiency, the Rao-Blackwell theorem.

Maximum likelihood estimation. Confidence intervals. Use of prior distributions and

Bayesian inference. [5]

Hypothesis testing

Simple examples of hypothesis testing, null and alternative hypothesis, critical region,

size, power, type I and type II errors, Neyman-Pearson lemma. Significance level of

outcome. Uniformly most powerful tests. Likelihood ratio, and use of generalised likeli-

ho od ratio to construct test statistics for composite hypotheses. Examples, including

-tests and

-tests. Relationship with confidence intervals. Goodness-of-fit tests and

contingency tables. [4]

Linear models

Derivation and joint distribution of maximum likelihood estimators, least squares,

Gauss-Markov theorem. Testing hypotheses, geometric interpretation. Examples,

including simple linear regression and one-way analysis of variance. Use of software. [7]

Contents

0 Introduction

1 Estimation

1.1 Estimators

1.2 Mean squared error

1.3 Sufficiency

1.4 Likelihood

1.5 Confidence intervals

1.6 Bayesian estimation

2 Hypothesis testing

2.1 Simple hypotheses

2.2 Composite hypotheses

2.3 Tests of goodness-of-fit and independence

2.3.1 Goodness-of-fit of a fully-specified null distribution

2.3.2 Pearson’s chi-squared test

2.3.3 Testing independence in contingency tables

2.4 Tests of homogeneity, and connections to confidence intervals

2.4.1 Tests of homogeneity

2.4.2 Confidence intervals and hypothesis tests

2.5 Multivariate normal theory

2.5.1 Multivariate normal distribution

2.5.2 Normal random samples

2.6 Student’s t-distribution

3 Linear models

3.1 Linear models

3.2 Simple linear regression

3.3 Linear models with normal assumptions

3.4 The F distribution

3.5 Inference for β

3.6 Simple linear regression

3.7 Expected response at x

∗

3.8 Hypothesis testing

3.8.1 Hypothesis testing

3.8.2 Simple linear regression

3.8.3

One way analysis of variance with equal numbers in each

group

0 Introduction

Statistics is a set of principles and procedures for gaining and processing quan-

titative evidence in order to help us make judgements and decisions. In this

course, we focus on formal statistical inference. In the process, we assume that

we have some data generated from some unknown probability model, and we aim

to use the data to learn about certain properties of the underlying probability

model.

In particular, we perform parametric inference. We assume that we have

a random variable

that follows a particular known family of distribution

(e.g. Poisson distribution). However, we do not know the parameters of the

distribution. We then attempt to estimate the parameter from the data given.

For example, we might know that

X ∼ Poisson

(

) for some

, and we want

to figure out what µ is.

Usually we repeat the experiment (or observation) many times. Hence we

will have

, X

, ··· , X

being iid with the same distribution as

. We call the

set X = (X

, X

, ··· , X

) a simple random sample. This is the data we have.

We will use the observed X = x to make inferences about the parameter

such as

– giving an estimate

θ(x) of the true value of θ.

– Giving an interval estimate (

(x),

(x)) for θ

– testing a hypothesis about θ, e.g. whether θ = 0.

1 Estimation

1.1 Estimators

The goal of estimation is as follows: we are given iid

, ··· , X

, and we know

that their probability density/mass function is

(

;

) for some unknown

We know

but not

. For example, we might know that they follow a Poisson

distribution, but we do not know what the mean is. The objective is to estimate

the value of θ.

Definition (Statistic). A statistic is an estimate of

. It is a function

of the

data. If we write the data as x = (

, ··· , x

), then our estimate is written as

θ = T (x). T (X) is an estimator of θ.

The distribution of T = T (X) is the sampling distribution of the statistic.

Note that we adopt the convention where capital X denotes a random variable

and x is an observed value. So

(X) is a random variable and

(x) is a particular

value we obtain after experiments.

Example. Let X

, ··· , X

be iid N(µ, 1). A possible estimator for µ is

T (X) =

Then for any particular observed sample x, our estimate is

T (x) =

What is the sampling distribution of

? Recall from IA Probability that in

general, if

∼ N

(

, σ

), then

∼ N

(

), which is something we

can prove by considering moment-generating functions.

So we have

(X)

∼ N

(

µ,

). Note that by the Central Limit Theorem,

even if

were not normal, we still have approximately

(X)

∼ N

(

µ,

) for

large values of

, but here we get exactly the normal distribution even for small

values of n.

The estimator

we had above is a rather sensible estimator. Of course,

we can also have silly estimators such as

(X) =

, or even

(X) = 0

always.

One way to decide if an estimator is silly is to look at its bias.

Definition (Bias). Let

(X) be an estimator of

. The bias of

is the

difference between its expected value and true value.

bias(

θ) = E

(

θ) − θ.

Note that the subscript

does not represent the random variable, but the thing

we want to estimate. This is inconsistent with the use for, say, the probability

mass function.

An estimator is unbiased if it has no bias, i.e. E

(

θ) = θ.

To find out

(

), we can either find the distribution of

and find its

expected value, or evaluate

as a function of X directly, and find its expected

value.

Example. In the above example, E

(T ) = µ. So T is unbiased for µ.

1.2 Mean squared error

Given an estimator, we want to know how good the estimator is. We have just

come up with the concept of the bias above. However, this is generally not a

good measure of how good the estimator is.

For example, if we do 1000 random trials

, ··· , X

1000

, we can pick our

estimator as

(X) =

. This is an unbiased estimator, but is really bad because

we have just wasted the data from the other 999 trials. On the other hand,

′

(X) = 0

01 +

1000

is biased (with a bias of 0

01), but is in general much

more trustworthy than

. In fact, at the end of the section, we will construct

cases where the only possible unbiased estimator is a completely silly estimator

to use.

Instead, a commonly used measure is the mean squared error.

Definition (Mean squared error). The mean squared error of an estimator

[(

θ − θ)

Sometimes, we use the root mean squared error, that is the square root of

the above.

We can express the mean squared error in terms of the variance and bias:

[(

θ − θ)

] = E

[(

θ − E

(

θ) + E

(

θ) − θ)

]

= E

[(

θ − E

(

θ))

] + [E

(

θ) − θ]

+ 2E

(

θ) − θ] E

[

θ − E

(

θ)]

| {z }

= var(

θ) + bias

(

θ).

If we are aiming for a low mean squared error, sometimes it could be preferable to

have a biased estimator with a lower variance. This is known as the “bias-variance

trade-off”.

For example, suppose

X ∼ binomial

(

n, θ

), where

is given and

is to be

determined. The standard estimator is

X/n

, which is unbiased.

has

variance

var

) =

var

(X)

θ(1 − θ)

Hence the mean squared error of the usual estimator is given by

mse(T

) = var

) + bias

) = θ(1 − θ)/n.

Consider an alternative estimator

X + 1

n + 2

= w

+ (1 −w)

where

(

+ 2). This can be interpreted to be a weighted average (by the

sample size) of the sample mean and 1/2. We have

) − θ =

nθ + 1

n + 2

− θ = (1 −w)



− θ



and is biased. The variance is given by

var

) =

var

(X)

(n + 2)

= w

θ(1 − θ)

Hence the mean squared error is

mse(T

) = var

) + bias

) = w

θ(1 − θ)

+ (1 −w)



− θ



We can plot the mean squared error of each estimator for possible values of

Here we plot the case where n = 10.

unbiased estimator

biased estimator

mse

0 0.2 0.4 0.6 0.8 1.0

0.01

0.02

0.03

This biased estimator has smaller MSE unless θ has extreme values.

We see that sometimes biased estimators could give better mean squared

errors. In some cases, not only could unbiased estimators be worse — they could

be completely nonsense.

Suppose

X ∼ Poisson

(

), and for whatever reason, we want to estimate

= [

(

= 0)]

−2λ

. Then any unbiased estimator

(

) must satisfy

(T (X)) = θ, or equivalently,

(T (X)) = e

−λ

∞

x=0

T (x)

= e

−2λ

The only function T that can satisfy this equation is T (X) = (−1)

Thus the unbiased estimator would estimate e

−2λ

to be 1 if X is even, -1 if

X is odd. This is clearly nonsense.

1.3 Sufficiency

Often, we do experiments just to find out the value of

. For example, we might

want to estimate what proportion of the population supports some political

candidate. We are seldom interested in the data points themselves, and just

want to learn about the big picture. This leads us to the concept of a sufficient

statistic. This is a statistic

(X) that contains all information we have about

in the sample.

Example. Let

, ···X

be iid

Bernoulli

(

), so that

(

= 1) = 1

− P

(

0) = θ for some 0 < θ < 1. So

(x | θ) =

i=1

(1 − θ)

1−x

= θ

(1 − θ)

n−

This depends on the data only through

(X) =

, the total number of ones.

Suppose we are now given that

(X) =

. Then what is the distribution of

X? We have

X|T =t

(x) =

(X = x, T = t)

(T = t)

(X = x)

(T = t)

Where the last equality comes because since if X = x, then

must be equal to

t. This is equal to

(1 − θ)

n−





(1 − θ)

n−t





−1

So the conditional distribution of X given

does not depend on

. So if we

know

, then additional knowledge of x does not give more information about

Definition (Sufficient statistic). A statistic

is sufficient for

if the conditional

distribution of X given T does not depend on θ.

There is a convenient theorem that allows us to find sufficient statistics.

Theorem (The factorization criterion). T is sufficient for θ if and only if

(x | θ) = g(T (x), θ)h(x)

for some functions g and h.

Proof. We first prove the discrete case.

Suppose f

(x | θ) = g(T (x), θ)h(x). If T (x) = t, then

X|T =t

(x) =

(X = x, T (X) = t)

(T = t)

g(T (x), θ)h(x)

{y:T (y)=t}

g(T (y), θ)h(y)

g(t, θ)h(x)

g(t, θ)

h(y)

h(x)

h(y)

which doesn’t depend on θ. So T is sufficient.

The continuous case is similar. If

| θ

) =

(

(x)

, θ

)

(x), and

(x) =

then

X|T =t

(x) =

g(T (x), θ)h(x)

y:T (y)=t

g(T (y), θ)h(y) dy

g(t, θ)h(x)

g(t, θ)

h(y) dy

h(x)

h(y) dy

which does not depend on θ.

Now suppose

is sufficient so that the conditional distribution of X

| T

does not depend on θ. Then

(X = x) = P

(X = x, T = T (x)) = P

(X = x | T = T (x))P

(T = T(x)).

The first factor does not depend on

by assumption; call it

(x). Let the second

factor be g(t, θ), and so we have the required factorisation.

Example. Continuing the above example,

(x | θ) = θ

(1 − θ)

n−

Take

(

t, θ

) =

− θ

)

n−t

and

(x) = 1 to see that

(X) =

is sufficient

for θ.

Example. Let

, ··· , X

be iid

, θ

]. Write 1

[A]

for the indicator function

of an arbitrary set A. We have

(x | θ) =

i=1

[0≤x

≤θ]

[max

≤θ]

[min

≥0]

If we let T = max

, then we have

(x | θ) =

[t≤θ]

| {z }

g(t,θ)

[min

≥0]

| {z }

h(x)

So T = max

is sufficient.

Note that sufficient statistics are not unique. If

is sufficient for

, then

so is any 1-1 function of

. X is always sufficient for

as well, but it is not of

much use. How can we decide if a sufficient statistic is “good”?

Given any statistic

, we can partition the sample space

into sets

{

∈ X

(x) =

. Then after an experiment, instead of recording the actual

value of x, we can simply record the partition x falls into. If there are less

partitions than possible values of x, then effectively there is less information we

have to store.

is sufficient, then this data reduction does not lose any information

about

. The “best” sufficient statistic would be one in which we achieve the

maximum possible reduction. This is known as the minimal sufficient statistic.

The formal definition we take is the following:

Definition (Minimal sufficiency). A sufficient statistic

(X) is minimal if it is

a function of every other sufficient statistic, i.e. if

′

(X) is also sufficient, then

′

(X) = T

′

(Y) ⇒ T (X) = T (Y).

Again, we have a handy theorem to find minimal statistics:

Theorem. Suppose T = T(X) is a statistic that satisfies

(x; θ)

(y; θ)

does not depend on θ if and only if T (x) = T (y).

Then T is minimal sufficient for θ.

Proof.

First we have to show sufficiency. We will use the factorization criterion

to do so.

Firstly, for each possible t, pick a favorite x

such that T (x

) = t.

Now let x

∈ X

and let

(x) =

. So

(x) =

). By the hypothesis,

(x;θ)

:θ)

does not depend on θ. Let this be h(x). Let g(t, θ) = f

, θ). Then

(x; θ) = f

; θ)

(x; θ)

; θ)

= g(t, θ)h(x).

So T is sufficient for θ.

To show that this is minimal, suppose that

(X) is also sufficient. By the

factorization criterion, there exist functions g

and h

such that

(x; θ) = g

(S(x), θ)h

(x).

Now suppose that S(x) = S(y). Then

(x; θ)

(y; θ)

(S(x), θ)h

(x)

(S(y), θ)h

(y)

(x)

(y)

This means that the ratio

(x;θ)

(y;θ)

does not depend on

. By the hypothesis, this

implies that

(x) =

(y). So we know that

(x) =

(y) implies

(x) =

(y).

So T is a function of S. So T is minimal sufficient.

Example. Suppose X

, ··· , X

are iid N (µ, σ

). Then

(x | µ, σ

)

(y | µ, σ

)

(2πσ

)

−n/2

exp



−

2σ

− µ)



(2πσ

)

−n/2

exp



−

2σ

− µ)



= exp

(

−

2σ

−

This is a constant function of (

µ, σ

) iff

and

. So

T (X) = (

) is minimal sufficient for (µ, σ

As mentioned, sufficient statistics allow us to store the results of our exper-

iments in the most efficient way. It turns out if we have a minimal sufficient

statistic, then we can use it to improve any estimator.

Theorem (Rao-Blackwell Theorem). Let

be a sufficient statistic for

and let

be an estimator for

with

(

)

< ∞

for all

. Let

(x) =

[

(X)

| T

(X) =

T (x)]. Then for all θ,

E[(

θ − θ)

] ≤ E[(

θ − θ)

The inequality is strict unless

θ is a function of T .

Here we have to be careful with our definition of

. It is defined as the

expected value of

(X), and this could potentially depend on the actual value of

. Fortunately, since

is sufficient for

, the conditional distribution of X given

does not depend on

. Hence

[

(X)

| T

] does not depend on

, and

so is a genuine estimator. In fact, the proof does not use that

is sufficient; we

only require it to be sufficient so that we can compute

θ.

Using this theorem, given any estimator, we can find one that is a function

of a sufficient statistic and is at least as good in terms of mean squared error of

estimation. Moreover, if the original estimator

is unbiased, so is the new

Also, if

θ is already a function of T , then

θ =

θ.

Proof.

By the conditional expectation formula, we have

(

) =

[

(

θ | T

)] =

θ). So they have the same bias.

By the conditional variance formula,

var(

θ) = E[var(

θ | T )] + var[E(

θ | T )] = E[var(

θ | T )] + var(

θ).

Hence

var

(

)

≥ var

(

). So

mse

(

)

≥ mse

(

), with equality only if

var

(

θ | T

) =

Example. Suppose

, ··· , X

are iid

Poisson

(

), and let

−λ

, which is

the probability that X

= 0. Then

(x | λ) =

−nλ

(x | θ) =

(−log θ)

We see that T =

is sufficient for θ, and

∼ Poisson(nλ).

We start with an easy estimator

= 1

, which is unbiased (i.e.

if we observe nothing in the first observation period, we assume the event is

impossible). Then

θ | T = t] = P

= 0 |

= t

P(X

= 0)P(

= t)



n − 1



= (1

−

)

. This approximately (1

−

)

≈ e

−

, which

makes sense.

Example. Let

, ··· , X

be iid

, θ

], and suppose that we want to estimate

. We have shown above that

max X

is sufficient for

. Let

= 2

, an

unbiased estimator. Then

θ | T = t] = 2E[X

| max X

= t]

= 2E[X

| max X

= t, X

= max X

]P(X

= max X

)

+ 2E[X

| max X

= t, X

= max X

]P(X

= max X

)

= 2



t ×

n − 1



n + 1

θ =

n+1

max X

is our new estimator.

1.4 Likelihood

There are many different estimators we can pick, and we have just come up with

some criteria to determine whether an estimator is “good”. However, these do

not give us a systematic way of coming up with an estimator to actually use. In

practice, we often use the maximum likelihood estimator.

Let

, ··· , X

be random variables with joint pdf/pmf

| θ

). We

observe X = x.

Definition (Likelihood). For any given x, the likelihood of

(

) =

), regarded as a function of

. The maximum likelihood estimator (mle) of

an estimator that picks the value of θ that maximizes like(θ).

Often there is no closed form for the mle, and we have to find

numerically.

When we can find the mle explicitly, in practice, we often maximize the

log-likelihood instead of the likelihood. In particular, if

, ··· , X

are iid, each

with pdf/pmf f

(x | θ), then

like(θ) =

i=1

| θ),

log like(θ) =

i=1

log f

| θ).

Example. Let X

, ··· , X

be iid Bernoulli(p). Then

l(p) = log like(p) =





log p +



n −



log(1 −p).

Thus

−

n −

1 − p

This is zero when

. So this is the maximum likelihood estimator

(and is unbiased).

Example. Let

, ··· , X

be iid

(

µ, σ

), and we want to estimate

= (

µ, σ

Then

l(µ, σ

) = log like(µ, σ

) = −

log(2π) −

log(σ

) −

2σ

− µ)

This is maximized when

∂l

∂µ

∂l

∂σ

= 0.

We have

∂l

∂µ

= −

− µ),

∂l

∂σ

= −

2σ

− µ)

So the solution, hence maximum likelihood estimator is (

ˆµ, ˆσ

) = (

¯x, S

where ¯x =

and S

− ¯x)

We shall see later that

/σ

nˆσ

∼ χ

n−1

, and so

(

ˆσ

) =

(n−1)σ

, i.e.

ˆσ

is biased.

Example (German tank problem). Suppose the American army discovers some

German tanks that are sequentially numbered, i.e. the first tank is numbered 1,

the second is numbered 2, etc. Then if

tanks are produced, then the probability

distribution of the tank number is

, θ

). Suppose we have discovered

tanks

whose numbers are

, x

, ··· , x

, and we want to estimate

, the total number

of tanks produced. We want to find the maximum likelihood estimator.

Then

like(θ) =

[max x

≤θ]

[min x

≥0]

So for

θ ≥ max x

(

) = 1

/θ

and is decreasing as

increases, while for

θ < max x

, like(θ) = 0. Hence the value

θ = max x

maximizes the likelihood.

unbiased? First we need to find the distribution of

. For 0

≤ t ≤ θ

, the

cumulative distribution function of

θ is

(t) = P (

θ ≤ t) = P(X

≤ t for all i) = (P(X

≤ t))





Differentiating with respect to T , we find the pdf f

n−1

. Hence

θ) =

n−1

dt =

nθ

n + 1

θ is biased, but asymptotically unbiased.

Example. Smarties come in

equally frequent colours, but suppose we do not

know

. Assume that we sample with replacement (although this is unhygienic).

Our first Smarties are Red, Purple, Red, Yellow. Then

like(k) = P

(1st is a new colour)P

(2nd is a new colour)

(3rd matches 1st)P

(4th is a new colour)

= 1 ×

k − 1

k − 2

(k − 1)(k −2)

The maximum likelihood is 5 (by trial and error), even though it is not much

likelier than the others.

How does the mle relate to sufficient statistics? Suppose that

is sufficient

for

. Then the likelihood is

(

(x)

, θ

)

(x), which depends on

through

(x).

To maximise this as a function of

, we only need to maximize

. So the mle

is a function of the sufficient statistic.

Note that if

(

) with

injective, then the mle of

is given by

(

). For

example, if the mle of the standard deviation

ˆσ

, then the mle of the variance

ˆσ

. This is rather useful in practice, since we can use this to simplify a lot

of computations.

1.5 Confidence intervals

Definition. A 100

% (0

< γ <

1) confidence interval for

is a random interval

(

(X)

, B

(X)) such that

(

(X)

< θ < B

(X)) =

, no matter what the true

value of θ may be.

It is also possible to have confidence intervals for vector parameters.

Notice that it is the endpoints of the interval that are random quantities,

while θ is a fixed constant we want to find out.

We can interpret this in terms of repeat sampling. If we calculate (

(x)

, B

(x))

for a large number of samples x, then approximately 100

% of them will cover

the true value of θ.

It is important to know that having observed some data x and calculated

95% confidence interval, we cannot say that

has 95% chance of being within

the interval. Apart from the standard objection that

is a fixed value and

either is or is not in the interval, and hence we cannot assign probabilities to

this event, we will later construct an example where even though we have got a

50% confidence interval, we are 100% sure that θ lies in that interval.

Example. Suppose

, ··· , X

are iid

(

θ,

1). Find a 95% confidence interval

for θ.

We know

X ∼ N(θ,

), so that

√

X − θ) ∼ N(0, 1).

Let

, z

be such that Φ(

)

−

Φ(

) = 0

95, where Φ is the standard normal

distribution function.

We have P[z

√

X − θ) < z

] = 0.95, which can be rearranged to give



X −

√

< θ <

X −

√



= 0.95.

so we obtain the following 95% confidence interval:



X −

√

X −

√



There are many possible choices for

and

. Since

1) density is symmetric,

the shortest such interval is obtained by

0.025

−z

. We can also choose

other values such as

−∞

= 1

64, but we usually choose symmetric end

points.

The above example illustrates a common procedure for finding confidence

intervals:

–

Find a quantity

, θ

) such that the

-distribution of

, θ

) does not

depend on

. This is called a pivot. In our example,

, θ

) =

√

(

X −θ

–

Write down a probability statement of the form

(

< R

, θ

)

< c

) =

– Rearrange the inequalities inside P(. . .) to find the interval.

Usually

, c

are percentage points from a known standardised distribution, often

equitailed. For example, we pick 2

5% and 97

5% points for a 95% confidence

interval. We could also use, say 0% and 95%, but this generally results in a

wider interval.

Note that if (

(X)

, B

(X)) is a 100

% confidence interval for

, and

(

)

is a monotone increasing function of

, then (

(

(X))

, T

(

(X))) is a 100

confidence interval for T (θ).

Example. Suppose

, ··· , X

are iid

, σ

). Find a 99% confidence interval

for σ

We know that X

/σ ∼ N(0, 1). So

i=1

∼ χ

So R(X, σ

) =

i=1

/σ

is a pivot.

Recall that χ

(α) is the upper 100α% point of χ

, i.e.

P(χ

≤ χ

(α)) = 1 − α.

So we have c

= χ

(0.995) = 27.99 and c

= χ

(0.005) = 79.49.



< c



= 0.99,

and hence



< σ



= 0.99.

Using the remark above, we know that a 99% confidence interval for





Example. Suppose

, ··· , X

are iid

Bernoulli

(

). Find an approximate

confidence interval for p.

The mle of p is ˆp =

/n.

By the Central Limit theorem,

ˆp

is approximately

(

p, p

− p

)

) for large

√

n(ˆp − p)

p(1 − p)

is approximately

1) for large

. So letting

(1−γ)/2

the solution to Φ(z

(1−γ)/2

) − Φ(−z

(1−γ)/2

) = 1 − γ, we have

ˆp − z

(1−γ)/2

p(1 − p)

< p < ˆp + z

(1−γ)/2

p(1 − p)

≈ γ.

But

is unknown! So we approximate it by

ˆp

to get a confidence interval for

when n is large:

ˆp − z

(1−γ)/2

ˆp(1 − ˆp)

< p < ˆp + z

(1−γ)/2

ˆp(1 − ˆp)

≈ γ.

Note that we have made a lot of approximations here, but it would be difficult

to do better than this.

Example. Suppose an opinion poll says 20% of the people are going to vote

UKIP, based on a random sample of 1

000 people. What might the true

proportion be?

We assume we have an observation of

= 200 from a

binomial

(

n, p

) distri-

bution with

= 1

000. Then

ˆp

x/n

= 0

2 is an unbiased estimate, and also

the mle.

Now

var

(

X/n

) =

p(1−p)

≈

ˆp(1−ˆp)

= 0

00016. So a 95% confidence interval is

ˆp − 1.96

ˆp(1 − ˆp)

, ˆp + 1.96

ˆp(1 − ˆp)

= 0.20±1.96×0.013 = (0.175, 0.225),

If we don’t want to make that many approximations, we can note that

−p

)

≤

4 for all 0

≤ p ≤

1. So a conservative 95% interval is

ˆp±

1/4n ≈ ˆp±

1/n

So whatever proportion is reported, it will be ’accurate’ to ±1/

√

Example. Suppose

, X

are iid from

(

θ −

, θ

+ 1

2). What is a sensible

50% confidence interval for θ?

We know that each

is equally likely to be less than

or greater than

So there is 50% chance that we get one observation on each side, i.e.

(min(X

, X

) ≤ θ ≤ max(X

, X

)) =

So (min(X

, X

), max(X

, X

)) is a 50% confidence interval for θ.

But suppose after the experiment, we obtain

−x

| ≥

. For example, we

might get

= 0

, x

= 0

9, then we know that, in this particular case,

must

lie in (min(X

, X

), max(X

, X

)), and we don’t have just 50% “confidence”!

This is why after we calculate a confidence interval, we should not say “there

is 100(1

− α

)% chance that

lies in here”. The confidence interval just says

that “if we keep making these intervals, 100(1

− α

)% of them will contain

”.

But if have calculated a particular confidence interval, the probability that that

particular interval contains θ is not 100(1 −α)%.

1.6 Bayesian estimation

So far we have seen the frequentist approach to a statistical inference, i.e.

inferential statements about θ are interpreted in terms of repeat sampling. For

example, the percentage confidence in a confidence interval is the probability

that the interval will contain θ, not the probability that θ lies in the interval.

In contrast, the Bayesian approach treats

as a random variable taking

values in Θ. The investigator’s information and beliefs about the possible values

before any observation of data are summarised by a prior distribution

(

When X = x are observed, the extra information about

is combined with the

prior to obtain the posterior distribution π(θ | x) for θ given X = x.

There has been a long-running argument between the two approaches. Re-

cently, things have settled down, and Bayesian methods are seen to be appropriate

in huge numbers of application where one seeks to assess a probability about a

“state of the world”. For example, spam filters will assess the probability that a

specific email is a spam, even though from a frequentist’s point of view, this is

nonsense, because the email either is or is not a spam, and it makes no sense to

assign a probability to the email’s being a spam.

In Bayesian inference, we usually have some prior knowledge about the

distribution of

(e.g. between 0 and 1). After collecting some data, we will find

a posterior distribution of θ given X = x.

Definition (Prior and posterior distribution). The prior distribution of

is the

probability distribution of the value of

before conducting the experiment. We

usually write as π(θ).

The posterior distribution of

is the probability distribution of the value of

θ given an outcome of the experiment x. We write as π(θ | x).

By Bayes’ theorem, the distributions are related by

π(θ | x) =

(x | θ)π(θ)

(x)

Thus

π(θ | x) ∝ f

(x | θ)π(θ).

posterior ∝ likelihood × prior.

where the constant of proportionality is chosen to make the total mass of

the posterior distribution equal to one. Usually, we use this form, instead of

attempting to calculate f

(x).

It should be clear that the data enters through the likelihood, so the inference

is automatically based on any sufficient statistic.

Example. Suppose I have 3 coins in my pocket. One is 3 : 1 in favour of tails,

one is a fair coin, and one is 3 : 1 in favour of heads.

I randomly select one coin and flip it once, observing a head. What is the

probability that I have chosen coin 3?

Let

= 1 denote the event that I observe a head,

= 0 if a tail. Let

denote the probability of a head. So θ is either 0.25, 0.5 or 0.75.

Our prior distribution is π(θ = 0.25) = π(θ = 0.5) = π(θ = 0.75) = 1/3.

The probability mass function

(

x | θ

) =

− θ

)

1−x

. So we have the

following results:

θ π(θ) f

(x = 1 | θ) f

(x = 1 | θ)π(θ) π(θ | x)

0.25 0.33 0.25 0.0825 0.167

0.50 0.33 0.50 0.1650 0.333

0.75 0.33 0.75 0.2475 0.500

Sum 1.00 1.50 0.4950 1.000

So if we observe a head, then there is now a 50% chance that we have picked

the third coin.

Example. Suppose we are interested in the true mortality risk

in a hospital

which is about to try a new operation. On average in the country, around

10% of the people die, but mortality rates in different hospitals vary from around

3% to around 20%. Hospital

has no deaths in their first 10 operations. What

should we believe about θ?

Let X

= 1 if the ith patient in H dies. The

(x | θ) = θ

(1 − θ)

n−

Suppose a priori that θ ∼ beta(a, b) for some unknown a > 0, b > 0 so that

π(θ) ∝ θ

a−1

(1 − θ)

b−1

Then the posteriori is

π(θ | x) ∝ f

(x | θ)π(θ) ∝ θ

+a−1

(1 − θ)

n−

+b−1

We recognize this as beta(

+ a, n −

+ b). So

π(θ | x) =

+a−1

(1 − θ)

n−

+b−1

+ a, n −

+ b)

In practice, we need to find a Beta prior distribution that matches our information

from other hospitals. It turns out that

beta

(

= 3

, b

= 27) prior distribution has

mean 0.1 and P(0.03 < θ < .20) = 0.9.

Then we observe data

= 0,

= 0. So the posterior is

beta

(

a, n −

+ b) = beta(3, 37). This has a mean of 3/40 = 0.075.

This leads to a different conclusion than a frequentist analysis. Since nobody

has died so far, the mle is 0, which does not seem plausible. Using a Bayesian

approach, we have a higher mean than 0 because we take into account the data

from other hospitals.

For this problem, a beta prior leads to a beta posterior. We say that the

beta family is a conjugate family of prior distributions for Bernoulli samples.

Suppose that

= 1 so that

(

) = 1 for 0

< θ <

1 — the uniform

distribution. Then the posterior is

beta

(

+ 1

, n −

+ 1), with properties

mean mode variance

prior 1/2 non-unique 1/12

posterior

+ 1

n + 2

(

+ 1)(n −

+ 1)

(n + 2)

(n + 3)

Note that the mode of the posterior is the mle.

The posterior mean estimator,

n+2

is discussed in Lecture 2, where we

showed that this estimator had smaller mse than the mle for non-extreme value

of θ. This is known as the Laplace’s estimator.

The posterior variance is bounded above by 1

(4(

+ 3)), and this is smaller

than the prior variance, and is smaller for larger n.

Again, note that the posterior automatically depends on the data through

the sufficient statistic.

After we come up with the posterior distribution, we have to decide what

estimator to use. In the case above, we used the posterior mean, but this might

not be the best estimator.

To determine what is the “best” estimator, we first need a loss function. Let

(

θ, a

) be the loss incurred in estimating the value of a parameter to be

when

the true value is θ.

Common loss functions are quadratic loss

(

θ, a

) = (

θ − a

)

, absolute error

loss L(θ, a) = |θ − a|, but we can have others.

When our estimate is a, the expected posterior loss is

h(a) =

L(θ, a)π(θ | x) dθ.

Definition (Bayes estimator). The Bayes estimator

is the estimator that

minimises the expected posterior loss.

For quadratic loss,

h(a) =

(a − θ)

π(θ | x) dθ.

′

(a) = 0 if

(a − θ)π(θ | x) dθ = 0,

π(θ | x) dθ =

θπ(θ | x) dθ,

Since

(

θ |

x) d

= 1, the Bayes estimator is

θπ

(

θ |

x) d

, the posterior

mean.

For absolute error loss,

h(a) =

|θ − a|π(θ | x) dθ

−∞

(a − θ)π(θ | x) dθ +

∞

(θ − a)π(θ | x) dθ

= a

−∞

π(θ | x) dθ −

−∞

θπ(θ | x) dθ

∞

θπ(θ | x) dθ − a

∞

π(θ | x) dθ.

Now h

′

(a) = 0 if

−∞

π(θ | x) dθ =

∞

π(θ | x) dθ.

This occurs when each side is 1/2. So

θ is the posterior median.

Example. Suppose that

, ··· , X

are iid

(

µ,

1), and that a priori

µ ∼

N(0, τ

−2

) for some τ

−2

. So τ is the certainty of our prior knowledge.

The posterior is given by

π(µ | x) ∝ f

(x | µ)π(µ)

∝ exp



−

− µ)



exp



−



∝ exp

−

(n + τ

)



µ −

n + τ



since we can regard

and all the

as constants in the normalisation term,

and then complete the square with respect to

. So the posterior distribution

given x is a normal distribution with mean

(

) and variance

1/(n + τ

The normal density is symmetric, and so the posterior mean and the posterior

median have the same value

/(n + τ

This is the optimal estimator for both quadratic and absolute loss.

Example. Suppose that

, ··· , X

are iid

Poisson

(

) random variables, and

λ has an exponential distribution with mean 1. So π(λ) = e

−λ

The posterior distribution is given by

π(λ | x) ∝ e

nλ

−λ

= λ

−(n+1)λ

, λ > 0,

which is

gamma (

+ 1, n + 1)

. Hence under quadratic loss, our estimator is

λ =

+ 1

n + 1

the posterior mean.

Under absolute error loss,

λ solves

(n + 1)

−(n+1)λ

(

dλ =

2 Hypothesis testing

Often in statistics, we have some hypothesis to test. For example, we want to

test whether a drug can lower the chance of a heart attack. Often, we will have

two hypotheses to compare: the null hypothesis states that the drug is useless,

while the alternative hypothesis states that the drug is useful. Quantitatively,

suppose that the chance of heart attack without the drug is

and the chance

with the drug is

. Then the null hypothesis is

, while the alternative

hypothesis is H

: θ = θ

It is important to note that the null hypothesis and alternative hypothesis

are not on equal footing. By default, we assume the null hypothesis is true.

For us to reject the null hypothesis, we need a lot of evidence to prove that.

This is since we consider incorrectly rejecting the null hypothesis to be a much

more serious problem than accepting it when we should not. For example, it is

relatively okay to reject a drug when it is actually useful, but it is terrible to

distribute drugs to patients when the drugs are actually useless. Alternatively,

it is more serious to deem an innocent person guilty than to say a guilty person

is innocent.

In general, let

, ··· , X

be iid, each taking values in

, each with unknown

pdf/pmf

. We have two hypotheses,

and

, about

. On the basis of data

X = x, we make a choice between the two hypotheses.

Example.

–

A coin has

(

Heads

) =

, and is thrown independently

times. We could

have H

: θ =

versus H

: θ =

–

Suppose

, ··· , X

are iid discrete random variables. We could have

the distribution is Poisson with unknown mean, and

: the distribution

is not Poisson.

–

General parametric cases: Let

, ··· , X

be iid with density

(

x | θ

is known while

is unknown. Then our hypotheses are

θ ∈

and

: θ ∈ Θ

, with Θ

∩ Θ

= ∅.

–

We could have

and

, where

and

are densities

that are completely specified but do not come form the same parametric

family.

Definition (Simple and composite hypotheses). A simple hypothesis

specifies

f completely (e.g. H

: θ =

). Otherwise, H is a composite hypothesis.

2.1 Simple hypotheses

Definition (Critical region). For testing

against an alternative hypothesis

, a test procedure has to partition

into two disjoint exhaustive regions

and

, such that if x

∈ C

, then

is rejected, and if x

∈

, then

is not

rejected. C is the critical region.

When performing a test, we may either arrive at a correct conclusion, or

make one of the two types of error:

Definition (Type I and II error).

(i) Type I error: reject H

when H

is true.

(ii) Type II error: not rejecting H

when H

is false.

Definition (Size and power). When H

and H

are both simple, let

α = P(Type I error) = P(X ∈ C | H

is true).

β = P(Type II error) = P(X ∈ C | H

is true).

The size of the test is α, and 1 − β is the power of the test to detect H

If we have two simple hypotheses, a relatively straightforward test is the

likelihood ratio test.

Definition (Likelihood). The likelihood of a simple hypothesis

∗

given

data x is

(H) = f

(x | θ = θ

∗

The likelihood ratio of two simple hypotheses H

, H

given data x is

; H

) =

)

A likelihood ratio test (LR test) is one where the critical region

is of the form

C = {x : Λ

; H

) > k}

for some k.

It turns out this rather simple test is “the best” in the following sense:

Lemma (Neyman-Pearson lemma). Suppose

, where

and

are continuous densities that are nonzero on the same regions. Then

among all tests of size less than or equal to

, the test with the largest power is

the likelihood ratio test of size α.

Proof. Under the likelihood ratio test, our critical region is

C =



x :

(x)

> k



where

is chosen such that

(

reject H

| H

) =

∈ C | H

) =

(x) dx. The probability of Type II error is given by

β = P(X ∈ C | f

) =

(x) dx.

Let

∗

be the critical region of any other test with size less than or equal to

Let α

∗

= P(X ∈ C

∗

| H

) and β

∗

= P(X ∈ C

∗

| H

). We want to show β ≤ β

∗

We know α

∗

≤ α, i.e

∗

(x) dx ≤

(x) dx.

Also, on C, we have f

(x) > kf

(x), while on

C we have f

(x) ≤ kf

(x). So

∗

∩C

(x) dx ≥ k

∗

∩C

(x) dx

C∩C

∗

(x) dx ≤ k

C∩C

∗

(x) dx.

Hence

β −β

∗

(x) dx −

∗

(x) dx

C∩C

∗

(x) dx +

C∩

∗

(x) dx

−

∗

∩C

(x) dx −

C∩

∗

(x) dx

C∩C

∗

(x) dx −

∗

∩C

(x) dx

≤ k

C∩C

∗

(x) dx − k

∗

∩C

(x) dx

= k



C∩C

∗

(x) dx +

C∩C

∗

(x) dx



− k



∗

∩C

(x) dx +

C∩C

∗

(x) dx



= k(α

∗

− α)

≤ 0.

∗

C C

∗

∩

≤ kf

)

∗

∩ C

≥ kf

)

∗

∩ C

β/H

α/H

∗

Here we assumed the

and

are continuous densities. However, this

assumption is only needed to ensure that the likelihood ratio test of exactly size

exists. Even with non-continuous distributions, the likelihood ratio test is still

a good idea. In fact, you will show in the example sheets that for a discrete

distribution, as long as a likelihood ratio test of exactly size

exists, the same

result holds.

Example. Suppose

, ··· , X

are iid

(

µ, σ

), where

is known. We want

to find the best size

test of

against

, where

and

are known fixed values with µ

> µ

. Then

; H

) =

(2πσ

)

−n/2

exp



−

2σ

− µ

)



(2πσ

)

−n/2

exp



−

2σ

− µ

)



= exp



− µ

n¯x +

n(µ

− µ

)

2σ



This is an increasing function of

¯x

, so for any

, Λ

> k ⇔ ¯x > c

for some

Hence we reject H

if ¯x > c, where c is chosen such that P(

X > c | H

) = α.

Under H

X ∼ N(µ

, σ

/n), so Z =

√

X − µ

)/σ

∼ N(0, 1).

Since ¯x > c ⇔ z > c

′

for some c

′

, the size α test rejects H

z =

√

n(¯x − µ

)

> z

For example, suppose

= 5,

= 6,

= 1,

= 0

05,

= 4 and x =

(5.1, 5.5, 4.9, 5.3). So ¯x = 5.2.

From tables,

0.05

= 1

645. We have

= 0

4 and this is less than 1

645. So x

is not in the rejection region.

We do not reject

at the 5% level and say that the data are consistent

with H

Note that this does not mean that we accept

. While we don’t have

sufficient reason to believe it is false, we also don’t have sufficient reason to

believe it is true.

This is called a z-test.

In this example, LR tests reject

z > k

for some constant

. The size of

such a test is

(

Z > k | H

) = 1

−

Φ(

), and is decreasing as

increasing.

Our observed value

will be in the rejected region iff

z > k ⇔ α > p

∗

(

Z >

z | H

Definition (

-value). The quantity

∗

is called the

-value of our observed data

x. For the example above, z = 0.4 and so p

∗

= 1 − Φ(0.4) = 0.3446.

In general, the

-value is sometimes called the “observed significance level” of

x. This is the probability under

of seeing data that is “more extreme” than

our observed data x. Extreme observations are viewed as providing evidence

against H

2.2 Composite hypotheses

For composite hypotheses like

θ ≥

0, the error probabilities do not have a

single value. We define

Definition (Power function). The power function is

W (θ) = P(X ∈ C | θ) = P(reject H

| θ).

We want W (θ) to be small on H

and large on H

Definition (Size). The size of the test is

α = sup

θ∈Θ

W (θ).

This is the worst possible size we can get.

For θ ∈ Θ

, 1 − W (θ) = P(Type II error | θ).

Sometimes the Neyman-Pearson theory can be extended to one-sided alter-

natives.

For example, in the previous example, we have shown that the most powerful

size α test of H

: µ = µ

versus H

: µ = µ

(where µ

> µ

) is given by

C =



x :

√

n(¯x − µ

)

> z



The critical region depends on

, n, σ

, α

, and the fact that

> µ

. It does

not depend on the particular value of

. This test is then uniformly the most

powerful size α for testing H

: µ = µ

against H

: µ > µ

Definition (Uniformly most powerful test). A test specified by a critical region

is uniformly most powerful (UMP) size

test for test

θ ∈

against

: θ ∈ Θ

(i) sup

θ∈Θ

W (θ) = α.

(ii)

For any other test

∗

with size

≤ α

and with power function

∗

, we have

W (θ) ≥ W

∗

(θ) for all θ ∈ Θ

Note that these may not exist. However, the likelihood ratio test often works.

Example. Suppose

, ··· , X

are iid

(

µ, σ

) where

is known, and we

wish to test H

: µ ≤ µ

against H

: µ > µ

First consider testing

′

against

′

, where

> µ

. The

Neyman-Pearson test of size α of H

′

against H

′

has

C =



x :

√

n(¯x − µ

)

> z



We show that

is in fact UMP for the composite hypotheses

against

For µ ∈ R, the power function is

W (µ) = P

(reject H

)

= P



√

X − µ

)

> z



= P



√

X − µ)

> z

√

n(µ

− µ)



= 1 − Φ



√

n(µ

− µ)



To show this is UMP, we know that

(

) =

(by plugging in).

(

) is an

increasing function of µ. So

sup

µ≤µ

W (µ) = α.

So the first condition is satisfied.

For the second condition, observe that for any

µ > µ

, the Neyman-Pearson

size

test of

′

has critical region

. Let

∗

and

∗

belong to any

other test of

of size

≤ α

. Then

∗

can be regarded as a test of

′

of size

≤ α

, and the Neyman-Pearson lemma says that

∗

(

)

≤ W

(

This holds for all µ

> µ

. So the condition is satisfied and it is UMP.

We now consider likelihood ratio tests for more general situations.

Definition (Likelihood of a composite hypothesis). The likelihood of a composite

hypothesis H : θ ∈ Θ given data x to be

(H) = sup

θ∈Θ

f(x | θ).

So far we have considered disjoint hypotheses Θ

, but we are not interested

in any specific alternative. So it is easier to take Θ

= Θ rather than Θ

Then

; H

) =

)

sup

θ∈Θ

f(x | θ)

sup

θ∈Θ

f(x | θ)

≥ 1,

with large values of Λ indicating departure from H

Example. Suppose that

, ··· , X

are iid

(

µ, σ

), with

known, and we

wish to test

against

µ 

(for given constant

). Here

= {µ

} and Θ = R.

For the numerator, we have

sup

| µ

) =

| ˆµ

), where

ˆµ

is the mle.

We know that ˆµ = ¯x. Hence

; H

) =

(2πσ

)

−n/2

exp



−

2σ

− ¯x)



(2πσ

)

−n/2

exp



−

2σ

− µ

)



Then H

is rejected if Λ

is large.

To make our lives easier, we can use the logarithm instead:

2 log Λ(H

; H

) =

− µ

)

−

− ¯x)

(¯x − µ

)

So we can reject H

if we have



√

n(¯x − µ

)



> c

for some c.

We know that under

√

X − µ

)

∼ N

1). So the size

generalised likelihood test rejects H



√

n(¯x − µ

)



> z

α/2

Alternatively, since

X − µ

)

∼ χ

, we reject H

n(¯x − µ

)

> χ

(α),

(check that z

α/2

= χ

(α)).

Note that this is a two-tailed test — i.e. we reject

both for high and low

values of ¯x.

The next theorem allows us to use likelihood ratio tests even when we cannot

find the exact relevant null distribution.

First consider the “size” or “dimension” of our hypotheses: suppose that

imposes

independent restrictions on Θ. So for example, if Θ =

{θ

(θ

, ··· , θ

)}, and we have

– H

: θ

= a

, θ

= a

, ··· , θ

= a

; or

– H

: Aθ = b (with A p × k, b p × 1 given); or

– H

: θ

= f

(φ), i = 1, ··· , k for some φ = (φ

, ··· , φ

k−p

We say Θ has

free parameters and Θ

has

k − p

free parameters. We write

|Θ

| = k − p and |Θ| = k.

Theorem (Generalized likelihood ratio theorem). Suppose Θ

⊆

and

|−

|Θ

| = p. Let X = (X

, ··· , X

) with all X

iid. If H

is true, then as n → ∞,

2 log Λ

; H

) ∼ χ

is not true, then 2

log

Λ tends to be larger. We reject

if 2

log

> c

where c = χ

(α) for a test of approximately size α.

We will not prove this result here. In our example above,

| − |

= 1,

and in this case, we saw that under

, 2

log

∼ χ

exactly for all

in that

particular case, rather than just approximately.

2.3 Tests of goodness-of-fit and independence

2.3.1 Goodness-of-fit of a fully-specified null distribution

So far, we have considered relatively simple cases where we are attempting to

figure out, say, the mean. However, in reality, more complicated scenarios arise.

For example, we might want to know if a dice is fair, i.e. if the probability of

getting each number is exactly

. Our null hypothesis would be that

··· = p

, while the alternative hypothesis allows any possible values of p

In general, suppose the observation space

is partitioned into

sets, and

let

be the probability that an observation is in set

for

= 1

, ··· , k

. We want

to test “

: the

’s arise from a fully specified model” against “

: the

’s

are unrestricted (apart from the obvious p

≥ 0,

= 1)”.

Example. The following table lists the birth months of admissions to Oxford

and Cambridge in 2012.

Sep Oct Nov Dec Jan Feb Mar Apr May Jun Jul Aug

470 515 470 457 473 381 466 457 437 396 384 394

Is this compatible with a uniform distribution over the year?

Out of

independent observations, let

be the number of observations in

ith set. So (N

, ··· , N

) ∼ multinomial(k; p

, ··· , p

For a generalized likelihood ratio test of

, we need to find the maximised

likelihood under H

and H

Under

(

, ··· , p

)

∝ p

···p

. So the log likelihood is

constant

log p

. We want to maximise this subject to

= 1. Us-

ing the Lagrange multiplier, we will find that the mle is

ˆp

. Also

|Θ

| = k − 1 (not k, since they must sum up to 1).

Under

, the values of

are specified completely, say

˜p

. So

= 0.

Using our formula for ˆp

, we find that

2 log Λ = 2 log



ˆp

··· ˆp

˜p

··· ˜p



= 2

log



n˜p



(1)

Here

|−|

k−

1. So we reject

if 2

log

> χ

k−1

(

) for an approximate

size α test.

Under

(no effect of month of birth),

˜p

is the proportion of births in

month

in 1993/1994 in the whole population — this is not simply proportional

to the number of days in each month (or even worse,

), as there is for example

an excess of September births (the “Christmas effect”). Then

2 log Λ = 2

log



n˜p



= 44.86.

(

86) = 3

−9

, which is our

-value. Since this is certainly less than

0.001, we can reject

at the 0

1% level, or can say the result is “significant at

the 0.1% level”.

The traditional levels for comparison are

= 0

001, roughly

corresponding to “evidence”, “strong evidence” and “very strong evidence”.

A similar common situation has

(

) for some parameter

and

as before. Now

is the number of independent parameters to be estimated

under H

Under H

, we find mle

θ by maximizing n

log p

(θ), and then

2 log Λ = 2 log

ˆp

··· ˆp

(

θ)

···p

(

θ)

= 2

log

(

θ)

. (2)

The degrees of freedom are k − 1 − |Θ

2.3.2 Pearson’s chi-squared test

Notice that the two log likelihoods are of the same form. In general, let

(observed number) and let

n ˜p

(

) (expected number). Let

−e

Then

2 log Λ = 2

log





= 2

+ δ

) log



1 +



= 2

+ δ

)



−

+ O(δ

)



= 2



−

+ O(δ

)



We know that

= 0 since

. So

≈

− e

)

This is known as the Pearson’s chi-squared test.

Example. Mendel crossed 556 smooth yellow male peas with wrinkled green

peas. From the progeny, let

(i) N

be the number of smooth yellow peas,

(ii) N

be the number of smooth green peas,

(iii) N

be the number of wrinkled yellow peas,

(iv) N

be the number of wrinkled green peas.

We wish to test the goodness of fit of the model

: (p

, p

) =





Suppose we observe (n

, n

) = (315, 108, 102, 31).

We find (

, e

) = (312

104

75). The actual 2

log

Λ =

0.618 and the approximation we had is

−e

)

= 0.604.

Here |Θ

| = 0 and |Θ

| = 4 − 1 = 3. So we refer to test statistics χ

(α).

Since

05) = 7

815, we see that neither value is significant at 5%. So

there is no evidence against Mendel’s theory. In fact, the

-value is approximately

(

≈

90. This is a really good fit, so good that people suspect the

numbers were not genuine.

Example. In a genetics problem, each individual has one of the three possible

genotypes, with probabilities

, p

. Suppose we wish to test

(

where

(θ) = θ

, p

= 2θ(1 − θ), p

(θ) = (1 − θ)

for some θ ∈ (0, 1).

We observe N

= n

. Under H

, the mle

θ is found by maximising

log p

(θ) = 2n

log θ + n

log(2θ(1 − θ)) + 2n

log(1 −θ).

We find that

θ =

. Also, |Θ

| = 1 and |Θ

| = 2.

After conducting an experiment, we can substitute

(

) into (2), or find the

corresponding Pearson’s chi-squared statistic, and refer to χ

2.3.3 Testing independence in contingency tables

Definition (Contingency table). A contingency table is a table in which obser-

vations or individuals are classified according to one or more criteria.

Example. 500 people with recent car changes were asked about their previous

and new cars. The results are as follows:

New car

Large Medium Small

car

Large 56 52 42

Medium 50 83 67

Small 18 51 81

This is a two-way contingency table: Each person is classified according to the

previous car size and new car size.

Consider a two-way contingency table with

rows and

columns. For

= 1

, ··· , r

And

= 1

, ··· , c

, let

be the probability that an individual

selected from the population under consideration is classified in row

and

column j. (i.e. in the (i, j) cell of the table).

Let

(

in row i

) and

(

in column j

). Then we must have

= 1.

Suppose a random sample of

individuals is taken, and let

be the number

of these classified in the (i, j) cell of the table.

Let n

and n

. So n

= n.

We have

, ··· , N

, N

, ··· , N

) ∼ multinomial(rc; p

, ··· , p

, p

, ··· , p

We may be interested in testing the null hypothesis that the two classifications

are independent. So we test

– H

: p

= p

for all i, j, i.e. independence of columns and rows.

– H

: p

are unrestricted.

Of course we have the usual restrictions like p

= 1, p

≥ 0.

Under H

, the mles are ˆp

Under H

, the mles are ˆp

and ˆp

Write o

= n

and e

= nˆp

ˆp

= n

/n.

Then

2 log Λ = 2

i=1

j=1

log





≈

i=1

j=1

− e

)

using the same approximating steps for Pearson’s Chi-squared test.

We have

rc −

1, because under

the

’s sum to one. Also,

= (

r −

1) + (

c −

1) because

, ··· , p

must satisfy

= 1 and

, ··· , p

must satisfy

= 1. So

|Θ

| − |Θ

| = rc − 1 − (r − 1) −(c − 1) = (r −1)(c − 1).

Example. In our previous example, we wish to test

: the new and previous

car sizes are independent. The actual data is:

New car

Large Medium Small Total

car

Large 56 52 42 150

Medium 50 83 67 120

Small 18 51 81 150

Total 124 186 190 500

while the expected values given by H

New car

Large Medium Small Total

car

Large 37.2 55.8 57.0 150

Medium 49.6 74.4 76.0 120

Small 37.2 55.8 57.0 150

Total 124 186 190 500

Note the margins are the same. It is quite clear that they do not match well,

but we can find the p value to be sure.

− e

)

= 36

20, and the degrees of freedom is (3

−

1)(3

−

1) = 4.

From the tables, χ

(0.05) = 9.488 and χ

(0.01) = 13.28.

So our observed value of 36.20 is significant at the 1% level, i.e. there is

strong evidence against H

. So we conclude that the new and present car sizes

are not independent.

2.4

Tests of homogeneity, and connections to confidence

intervals

2.4.1 Tests of homogeneity

Example. 150 patients were randomly allocated to three groups of 50 patients

each. Two groups were given a new drug at different dosage levels, and the third

group received a placebo. The responses were as shown in the table below.

Improved No difference Worse Total

Placebo 18 17 15 50

Half dose 20 10 20 50

Full dose 25 13 12 50

Total 63 40 47 150

Here the row totals are fixed in advance, in contrast to our last section, where

the row totals are random variables.

For the above, we may be interested in testing

: the probability of

“improved” is the same for each of the three treatment groups, and so are

the probabilities of “no difference” and “worse”, i.e.

says that we have

homogeneity down the rows.

In general, we have independent observations from

multinomial distributions,

each of which has

categories, i.e. we observe an

r ×c

table (

), for

= 1

, ··· , r

and j = 1, ··· , c, where

, ··· , N

) ∼ multinomial(n

, p

, ··· , p

)

independently for each i = 1, ··· , r. We want to test

: p

= p

= ··· = p

= p

for j = 1, ··· , c, and

: p

are unrestricted.

Using H

, for any matrix of probabilities (p

like((p

)) =

i=1

! ···n

···p

and

log like = constant +

i=1

j=1

log p

Using Lagrangian methods, we find that ˆp

Under H

log like = constant +

j=1

log p

By Lagrangian methods, we have ˆp

Hence

2 log Λ =

i=1

j=1

log



ˆp



= 2

i=1

j=1

log





which is the same as what we had last time, when the row totals are unrestricted!

We have

(

c −

1) and

c −

1. So the degrees of freedom is

(

c −

−

(

c −

1) = (

r −

1)(

c −

1), and under

, 2

log

Λ is approximately

(r−1)(c−1)

. Again, it is exactly the same as what we had last time!

We reject H

if 2 log Λ > χ

(r−1)(c−1)

(α) for an approximate size α test.

If we let

, e

, and

− e

, using the same approxi-

mating steps as for Pearson’s chi-squared, we obtain

2 log Λ ≈

− e

)

Example. Continuing our previous example, our data is

Improved No difference Worse Total

Placebo 18 17 15 50

Half dose 20 10 20 50

Full dose 25 13 12 50

Total 6 3 40 47 150

The expected under H

Improved No difference Worse Total

Placebo 21 13.3 15.7 50

Half dose 21 13.3 15.7 50

Full dose 21 13.3 15.7 50

Total 63 40 47 150

We find 2

log

Λ = 5

129, and we refer this to

. Clearly this is not significant,

as the mean of

is 4, and is something we would expect to happen solely by

chance.

We can calculate the

-value: from tables,

05) = 9

488, so our observed

value is not significant at 5%, and the data are consistent with H

We conclude that there is no evidence for a difference between the drug at

the given doses and the placebo.

For interest,

− e

)

= 5.173,

giving the same conclusion.

2.4.2 Confidence intervals and hypothesis tests

Confidence intervals or sets can be obtained by inverting hypothesis tests, and

vice versa

Definition (Acceptance region). The acceptance region

of a test is the

complement of the critical region C.

Note that when we say “acceptance”, we really mean “non-rejection”! The

name is purely for historical reasons.

Theorem (Duality of hypothesis tests and confidence intervals). Suppose

, ··· , X

have joint pdf f

(x | θ) for θ ∈ Θ.

(i)

Suppose that for every

∈

Θ there is a size

test of

. Denote

the acceptance region by

(

). Then the set

(X) =

{θ

: X

∈ A

(

)

}

is a

100(1 − α)% confidence set for θ.

(ii)

Suppose

(X) is a 100(1

− α

)% confidence set for

. Then

(

) =

{

X :

∈ I(X)} is an acceptance region for a size α test of H

: θ = θ

Intuitively, this says that “confidence intervals” and “hypothesis accep-

tance/rejection” are the same thing. After gathering some data X, we can

produce a, say, 95% confidence interval (

a, b

). Then if we want to test the

hypothesis H

: θ = θ

, we simply have to check whether θ

∈ (a, b).

On the other hand, if we have a test for

, then the confidence

interval is all θ

in which we would accept H

: θ = θ

Proof. First note that θ

∈ I(X) iff X ∈ A(θ

For (i), since the test is size α, we have

P(accept H

| H

is true) = P(X ∈ A(θ

) | θ = θ

) = 1 − α.

And so

P(θ

∈ I(X) | θ = θ

) = P(X ∈ A(θ

) | θ = θ

) = 1 − α.

For (ii), since I(X) is a 100(1 −α)% confidence set, we have

P (θ

∈ I(X) | θ = θ

) = 1 − α.

P(X ∈ A(θ

) | θ = θ

) = P(θ ∈ I(X) | θ = θ

) = 1 − α.

Example. Suppose

, ··· , X

are iid

(

µ,

1) random variables and we want

a 95% confidence set for µ.

One way is to use the theorem and find the confidence set that belongs to the

hypothesis test that we found in the previous example. We find a test of size 0.05

against

µ 

that rejects

when

√

(

¯x − µ

)

| >

(where 1.96 is the upper 2.5% point of N(0, 1)).

Then

(X) =

{µ

: X

∈ A

(

)

}

{µ

√

(

X − µ

)

| <

}

. So a 95%

confidence set for µ is (

X − 1.96/

√

X + 1.96/

√

n).

2.5 Multivariate normal theory

2.5.1 Multivariate normal distribution

So far, we have only worked with scalar random variables or a vector of iid random

variables. In general, we can have a random (column) vector X = (

, ··· , X

)

where the X

are correlated.

The mean of this vector is given by

µ = E[X] = (E(X

), ··· , E(X

))

= (µ

, ··· , µ

)

Instead of just the variance, we have the covariance matrix

cov(X) = E[(X − µ)(X −µ)

] = (cov(X

, X

))

provided they exist, of course.

We can multiply the vector X by an m × n matrix A. Then we have

E[AX] = Aµ,

and

cov(AX) = A cov(X)A

. (∗)

The last one comes from

cov(AX) = E[(AX − E[AX])(AX −E[AX])

]

= E[A(X − EX)(X − EX)

]

= AE[(X − EX)(X − EX)

If we have two random vectors V

W, we can define the covariance

cov

to be a matrix with (

i, j

)th element

cov

(

, W

). Then

cov

(

, B

X) =

A cov(X)B

An important distribution is a multivariate normal distribution.

Definition (Multivariate normal distribution). X has a multivariate normal

distribution if, for every t

∈ R

, the random variable t

X (i.e. t

X) has a

normal distribution. If E[X] = µ and cov(X) = Σ, we write X ∼ N

(µ, Σ).

Note that Σ is symmetric and is positive semi-definite because by (∗),

Σt = var(t

X) ≥ 0.

So what is the pdf of a multivariate normal? And what is the moment generating

function? Recall that a (univariate) normal X ∼ N(µ, σ

) has density

(x; µ, σ

) =

√

2πσ

exp



−

(x − µ)



with moment generating function

(s) = E[e

] = exp



µs +



Hence for any t, the moment generating function of t

X is given by

(s) = E[e

] = exp



µs +

Σts



Hence X has mgf

(t) = E[e

] = M

(1) = exp



µ +

Σt



. (†)

Proposition.

(i)

If X

∼ N

(

µ,

Σ), and

is an

m × n

matrix, then

∼ N

(

Aµ, A

(ii) If X ∼ N

(0, σ

I), then

|X|

∼ χ

Instead of writing |X|

/σ

∼ χ

, we often just say |X|

∼ σ

Proof.

(i) See example sheet 3.

(ii) Immediate from definition of χ

Proposition. Let X

∼ N

(

µ,

Σ). We split X up into two parts: X =





where X

is a n

× 1 column vector and n

+ n

= n.

Similarly write

µ =





, Σ =





where Σ

is an n

× n

matrix.

Then

(i) X

∼ N

(µ

, Σ

)

(ii) X

and X

are independent iff Σ

= 0.

Proof.

(i) See example sheet 3.

(ii) Note that by symmetry of Σ, Σ

= 0 if and only if Σ

= 0.

From (

†

(t) =

exp

Σt) for each t

∈ R

. We write t =





Then the mgf is equal to

(t) =

exp



+ t



From (i), we know that

) =

exp

). So

(t) =

) for all t if and only if Σ

= 0.

Proposition. When Σ is a positive definite, then X has pdf

(x; µ, Σ) =

|Σ|



√

2π



exp



−

(x − µ)

−1

(x − µ)



Note that Σ is always positive semi-definite. The conditions just forbid the

case |Σ| = 0, since this would lead to dividing by zero.

2.5.2 Normal random samples

We wish to use our knowledge about multivariate normals to study univariate

normal data. In particular, we want to prove the following:

Theorem (Joint distribution of

and

). Suppose

, ··· , X

are iid

N(µ, σ

) and

X =

, and S

−

. Then

(i)

X ∼ N(µ, σ

/n)

(ii) S

/σ

∼ χ

n−1

(iii)

X and S

are independent.

Proof.

We can write the joint density as X

∼ N

(

µ, σ

), where

(µ, µ, ··· , µ).

Let

be an

n × n

orthogonal matrix with the first row all 1

√

(the other

rows are not important). One possible such matrix is

A =







√

···

√

2×1

−1

√

2×1

0 0 ··· 0

√

3×2

√

3×2

−2

√

3×2

0 ··· 0

√

n(n−1)

√

n(n−1)

√

n(n−1)

√

n(n−1)

···

−(n−1)

√

n(n−1)







Now define Y = AX. Then

Y ∼ N

(Aµ, Aσ

) = N

(Aµ, σ

I).

We have

Aµ = (

√

nµ, 0, ··· , 0)

∼ N

(

√

nµ, σ

) and

∼ N

, σ

) for

= 2

, ··· , n

. Also,

, ··· , Y

are

independent, since the covariance matrix is every non-diagonal term 0.

But from the definition of A, we have

√

i=1

√

X ∼ N(

√

nµ, σ

), or

X ∼ N(µ, σ

/n). Also

+ ··· + Y

= Y

Y − Y

= X

AX − Y

= X

X − n

i=1

− n

i=1

−

= S

So S

= Y

+ ··· + Y

∼ σ

n−1

Finally, since Y

and Y

, ··· , Y

are independent, so are

X and S

2.6 Student’s t-distribution

Definition (

-distribution). Suppose that

and

are independent,

Z ∼ N

and Y ∼ χ

. Then

T =

Y/k

is said to have a t-distribution on k degrees of freedom, and we write T ∼ t

The density of t

turns out to be

(t) =

Γ((k + 1)/2)

Γ(k/2)

√

πk



1 +



−(k+1)/2

This density is symmetric, bell-shaped, and has a maximum at

= 0, which

is rather like the standard normal density. However, it can be shown that

(

T > t

)

> P

(

Z > t

), i.e. the

distribution has a “fatter” tail. Also, as

k → ∞

approaches a normal distribution.

Proposition. If k > 1, then E

(T ) = 0.

If k > 2, then var

(T ) =

k−2

If k = 2, then var

(T ) = ∞.

In all other cases, the values are undefined. In particular, the

= 1 case has

undefined mean and variance. This is known as the Cauchy distribution.

Notation. We write

(

) be the upper 100

% point of the

distribution, so

that P(T > t

(α)) = α.

Why would we define such a weird distribution? The typical application is

to study random samples with unknown mean and unknown variance.

Let

, ··· , X

be iid

(

µ, σ

). Then

X ∼ N

(

µ, σ

). So

√

X−µ)

∼

N(0, 1).

Also, S

/σ

∼ χ

n−1

and is independent of

X, and hence Z. So

√

X − µ)/σ

/((n − 1)σ

)

∼ t

n−1

√

X − µ)

/(n − 1)

∼ t

n−1

We write

˜σ

n−1

(note that this is the unbiased estimator). Then a 100(1

−α

confidence interval for µ is found from

1 − α = P



−t

n−1





≤

√

X − µ)

˜σ

≤ t

n−1







This has endpoints

X ±

˜σ

√

n−1





3 Linear models

3.1 Linear models

Linear models can be used to explain or model the relationship between a

response (or dependent) variable, and one or more explanatory variables (or

covariates or predictors). As the name suggests, we assume the relationship is

linear.

Example. How do motor insurance claim rates (response) depend on the age

and sex of the driver, and where they live (explanatory variables)?

It is important to note that (unless otherwise specified), we do not assume

normality in our calculations here.

Suppose we have

covariates

, and we have

observations

. We assume

n > p

, or else we can pick the parameters to fix our data exactly. Then each

observation can be written as

= β

+ ··· + β

+ ε

. ((*))

for i = 1, ··· , n. Here

– β

, ··· , β

are unknown, fixed parameters we wish to work out (with

n > p

)

– x

, ··· , x

are the values of the

covariates for the

th response (which

are all known).

– ε

, ··· , ε

are independent (or possibly just uncorrelated) random variables

with mean 0 and variance σ

We think of the

terms to be the causal effects of

and

to be a random

fluctuation (error term).

Then we clearly have

– E(Y

) = β

+ ···β

– var(Y

) = var(ε

) = σ

– Y

, ··· , Y

are independent.

Note that (

∗

) is linear in the parameters

, ··· , β

. Obviously the real world

can be much more complicated. But this is much easier to work with.

Example. For each of 24 males, the maximum volume of oxygen uptake in the

blood and the time taken to run 2 miles (in minutes) were measured. We want

to know how the time taken depends on oxygen uptake.

We might get the results

Oxygen 42.3 53.1 42.1 50.1 42.5 42.5 47.8 49.9

Time 918 805 892 962 968 907 770 743

Oxygen 36.2 49.7 41.5 46.2 48.2 43.2 51.8 53.3

Time 1045 810 927 813 858 860 760 747

Oxygen 53.3 47.2 56.9 47.8 48.7 53.7 60.6 56.7

Time 743 803 683 844 755 700 748 775

For each individual

, we let

be the time to run 2 miles, and

be the maximum

volume of oxygen uptake,

= 1

, ··· ,

24. We might want to fit a straight line to

it. So a possible model is

= a + bx

+ ε

where

are independent random variables with variance

, and

and

are

constants.

The subscripts in the equation makes it tempting to write them as matrices:

Y =













, X =







··· x







, β =













, ε =













Then the equation becomes

Y = Xβ + ε. (2)

We also have

– E(ε) = 0.

– cov(Y) = σ

We assume throughout that

has full rank

, i.e. the columns are independent,

and that the error variance is the same for each observation. We say this is the

homoscedastic case, as opposed to heteroscedastic.

Example. Continuing our example, we have, in matrix form

Y =













, X =







1 x







, β =





, ε =













Then

Y = Xβ + ε.

Definition (Least squares estimator). In a linear model Y =

Xβ

, the least

squares estimator

β of β minimizes

S(β) = ∥Y − Xβ∥

= (Y − Xβ)

(Y − Xβ)

i=1

− x

)

with implicit summation over j.

If we plot the points on a graph, then the least square estimators minimizes

the (square of the) vertical distance between the points and the line.

To minimize it, we want

∂S

∂β



β=

= 0

for all k. So

−2x

− x

) = 0

for each k (with implicit summation over i and j), i.e.

= x

for all k. Putting this back in matrix form, we have

Proposition. The least squares estimator satisfies

β = X

Y. (3)

We could also have derived this by completing the square of (Y

−Xβ

)

−

Xβ), but that would be more complicated.

In order to find

, our life would be much easier if

has an inverse.

Fortunately, it always does. We assumed that X is of full rank p. Then

Xt = (Xt)

(Xt) = ∥Xt∥

> 0

for t



= 0 in

(the last inequality is since if there were a t such that

∥X

∥

= 0,

then we would have produced a linear combination of the columns of

that

gives 0). So X

X is positive definite, and hence has an inverse. So

β = (X

−1

Y, (4)

which is linear in Y .

We have

β) = (X

−1

E[Y] = (X

−1

Xβ = β.

β is an unbiased estimator for β. Also

cov(

β) = (X

−1

cov(Y)X(X

−1

= σ

−1

, (5)

since cov Y = σ

3.2 Simple linear regression

What we did above was so complicated. If we have a simple linear regression

model

= a + bx

+ ε

then we can reparameterise it to

= a

′

+ b(x

− ¯x) + ε

, (6)

where

¯x

and

′

b¯x

. Since

(

− ¯x

) = 0, this leads to simplified

calculations.

In matrix form,

X =







1 (x

− ¯x)

1 (x

− ¯x)







Since

− ¯x) = 0, in X

X, the off-diagonals are all 0, and we have

X =



n 0

0 S



where S

− ¯x)

Hence

−1





β = (X

−1

Y =





where S

− ¯x).

Hence the estimated intercept is ˆa

′

= ¯y, and the estimated gradient is

b =

− ¯x)

− ¯y)(x

− ¯x)

− ¯y)

(∗)

= r ×

We have (

∗

) since

¯y

(

− ¯x

) = 0, so we can add it to the numerator. Then the

other square root things are just multiplying and dividing by the same things.

So the gradient is the Pearson product-moment correlation coefficient

times

the ratio of the empirical standard deviations of the

’s and

’s (note that the

gradient is the same whether the x’s are standardised to have mean 0 or not).

Hence we get

cov

(

) = (

)

−1

, and so from our expression of (

)

−1

var(ˆa

′

) = var(

Y ) =

, var(

b) =

Note that these estimators are uncorrelated.

Note also that these are obtained without any explicit distributional assump-

tions.

Example. Continuing our previous oxygen/time example, we have

¯y

= 826

= 783.5 = 28.0

, S

= −10077, S

= 444

, r = −0.81,

b = −12.9.

Theorem (Gauss Markov theorem). In a full rank linear model, let

be the

least squares estimator of

and let

∗

be any other unbiased estimator for

which is linear in the Y

’s. Then

var(t

β) ≤ var(t

∗

for all t

∈ R

. We say that

is the best linear unbiased estimator of

(BLUE).

Proof. Since β

∗

is linear in the Y

’s, β

∗

= AY for some p ×n matrix A.

Since

∗

is an unbiased estimator, we must have

[

∗

] =

. However, since

∗

[

∗

] =

[Y] =

AXβ

. So we must have

AXβ

. Since this

holds for any β, we must have AX = I

. Now

cov(β

∗

) = E[(β

∗

− β)(β

∗

− β)

]

= E[(AY − β)(AY − β)

]

= E[(AXβ + Aε − β)(AXβ + Aε − β)

]

Since AXβ = β, this is equal to

= E[Aε(Aε)

]

= A(σ

I)A

= σ

Now let β

∗

−

β = (A − (X

−1

)Y = BY, for some B. Then

BX = AX − (X

−1

X = I

− I

= 0.

By definition, we have

Y =

Y + (

)

−1

Y, and this is true for all Y.

So A = B + (X

−1

. Hence

cov(β

∗

) = σ

= σ

(B + (X

−1

)(B + (X

−1

)

= σ

(BB

+ (X

−1

)

= σ

+ cov(

β).

Note that in the second line, the cross-terms disappear since BX = 0.

So for any t ∈ R

, we have

var(t

∗

) = t

cov(β

∗

= t

cov(

β)t + t

tσ

= var(t

β) + σ

∥B

t∥

≥ var(t

β).

Taking t = (0, ··· , 1, 0, ··· , 0)

with a 1 in the ith position, we have

var(

) ≤ var(β

∗

Definition (Fitted values and residuals).

is the vector of fitted values.

These are what our model says Y should be.

R = Y

−

is the vector of residuals. These are the deviations of our model

from reality.

The residual sum of squares is

RSS = ∥R∥

= R

R = (Y − X

β)

(Y − X

β).

We can give these a geometric interpretation. Note that

R =

−

) =

− X

= 0 by our formula for

. So R is orthogonal to the

column space of X.

Write

(

)

−1

Y =

Y, where

(

)

−1

Then

represents an orthogonal projection of

onto the space spanned by

the columns of

, i.e. it projects the actual data Y to the fitted values

. Then

R is the part of Y orthogonal to the column space of X.

The projection matrix

is idempotent and symmetric, i.e.

and

= P .

3.3 Linear models with normal assumptions

So far, we have not assumed anything about our variables. In particular, we

have not assumed that they are normal. So what further information can we

obtain by assuming normality?

Example. Suppose we want to measure the resistivity of silicon wafers. We

have five instruments, and five wafers were measured by each instrument (so we

have 25 wafers in total). We assume that the silicon wafers are all the same, and

want to see whether the instruments consistent with each other, i.e. The results

are as follows:

Wafer

1 2 3 4 5

Instrument

1 130.5 112.4 118.9 125.7 134.0

2 130.4 138.2 116.7 132.6 104.2

3 113.0 120.5 128.9 103.4 118.1

4 128.0 117.5 114.9 114.9 98.8

5 121.2 110.5 118.5 100.5 120.9

Let

i,j

be the resistivity of the

th wafer measured by instrument

, where

i, j = 1, ··· , 5. A possible model is

i,j

= µ

+ ε

i,j

where

are independent random variables such that

[

] = 0 and

var

(

) =

, and the µ

’s are unknown constants.

This can be written in matrix form, with

Y =







1,1

1,5

2,1

2,5

5,1

5,5







, X =







1 0 ··· 0

0 1 ··· 0

0 0 ··· 1







, β =













, ε =







1,1

1,5

2,1

2,5

5,1

5,5







Then

Y = Xβ + ε.

We have

X =







5 0 ··· 0

0 5 ··· 0

0 0 ··· 5







Hence

−1







0 ··· 0

··· 0

0 0 ···







So we have

µ = (X

−1

Y =













The residual sum of squares is

RSS =

i=1

j=1

i,j

− ˆµ

)

i=1

j=1

i,j

−

)

= 2170.

This has

n − p

= 25

−

5 = 20 degrees of freedom. We will later see that

¯σ =

RSS/(n − p) = 10.4.

Note that we still haven’t used normality!

We now make a normal assumption:

Y = Xβ + ε, ε ∼ N

(0, σ

I), rank(X) = p < n.

This is a special case of the linear model we just had, so all previous results hold.

Since Y = N

(Xβ, σ

I), the log-likelihood is

l(β, σ

) = −

log 2π −

log σ

−

2σ

S(β),

where

S(β) = (Y − Xβ)

(Y − Xβ).

If we want to maximize

with respect to

, we have to maximize the only term

containing β, i.e. S(β). So

Proposition. Under normal assumptions the maximum likelihood estimator

for a linear model is

β = (X

−1

which is the same as the least squares estimator.

This isn’t coincidence! Historically, when Gauss devised the normal distri-

bution, he designed it so that the least squares estimator is the same as the

maximum likelihood estimator.

To obtain the MLE for σ

, we require

∂l

∂σ



β,ˆσ

= 0,

i.e.

−

2ˆσ

β)

2ˆσ

= 0

ˆσ

β) =

(Y − X

β)

(Y − X

β) =

RSS.

Our ultimate goal now is to show that

and

ˆσ

are independent. Then we can

apply our other standard results such as the t-distribution.

First recall that the matrix

(

)

−1

that projects Y to

idempotent and symmetric. We will prove the following properties of it:

Lemma.

(i)

If Z

∼ N

, σ

) and

n × n

, symmetric, idempotent with rank

then Z

AZ ∼ σ

(ii) For a symmetric idempotent matrix A, rank(A) = tr(A).

Proof.

(i)

Since

is idempotent,

by definition. So eigenvalues of

are either

0 or 1 (since λx = Ax = A

x = λ

x).

Since

is also symmetric, it is diagonalizable. So there exists an orthogonal

Q such that

Λ = Q

AQ = diag(λ

, ··· , λ

) = diag(1, ··· , 1, 0, ··· , 0)

with r copies of 1 and n − r copies of 0.

Let W =

Z. So Z =

W. Then W

∼ N

, σ

), since

cov

(W) =

IQ = σ

I. Then

AZ = W

AQW = W

ΛW =

i=1

∼ χ

(ii)

rank(A) = rank(Λ)

= tr(Λ)

= tr(Q

AQ)

= tr(AQ

= tr A

Theorem. For the normal linear model Y ∼ N

(Xβ, σ

I),

(i)

β ∼ N

(β, σ

−1

)

(ii) RSS ∼ σ

n−p

, and so ˆσ

∼

n−p

(iii)

β and ˆσ

are independent.

The proof is not particularly elegant — it is just a whole lot of linear algebra!

Proof.

–

We have

= (

)

−1

Y. Call this

Y for later use. Then

has a

normal distribution with mean

−1

(Xβ) = β

and covariance

−1

(σ

I)[(X

−1

]

= σ

−1

β ∼ N

(β, σ

−1

–

Our previous lemma says that Z

∼ σ

. So we pick our Z and

that Z

AZ = RSS, and r, the degrees of freedom of A, is n −p.

Let Z = Y

− Xβ

and

= (

− P

), where

(

)

−1

. We first

check that the conditions of the lemma hold:

Since Y ∼ N

(Xβ, σ

I), Z = Y − Xβ ∼ N

(0, σ

I).

Since P is idempotent, I

− P also is (check!). We also have

rank(I

− P ) = tr(I

− P ) = n −p.

Therefore the conditions of the lemma hold.

To get the final useful result, we want to show that the RSS is indeed

Z. We simplify the expressions of RSS and Z

Z and show that they

are equal:

AZ = (Y − Xβ)

− P )(Y − Xβ) = Y

− P )Y.

Noting the fact that (I

− P )X = 0.

Writing R = Y −

Y = (I

− P )Y, we have

RSS = R

R = Y

− P )Y,

using the symmetry and idempotence of I

− P .

Hence RSS = Z

AZ ∼ σ

n−p

. Then

ˆσ

RSS

∼

n−p

– Let V =





= DY, where D =



− P



is a (p + n) × n matrix.

Since Y is multivariate, V is multivariate with

cov(V ) = Dσ

= σ



C(I

− P )

− P )C

− P )(I

− P )



= σ



C(I

− P )

− P )C

− P )



= σ



0 I

− P



Using

(

−P

) = 0 (since (

)

−1

(

−P

) = 0 since (

−P

)

= 0

— check!).

Hence

and R are independent since the off-diagonal covariant terms are 0.

and

RSS

= R

R are independent. So

and

ˆσ

are independent.

From (ii),

(

RSS

) =

(

n − p

). So

˜σ

RSS

n−p

is an unbiased estimator of

˜σ is often known as the residual standard error on n − p degrees of freedom.

3.4 The F distribution

Definition (F distribution). Suppose U and V are independent with U ∼ χ

and

V ∼ χ

. The

U/m

V/n

is said to have an

-distribution on

and

degrees

of freedom. We write X ∼ F

m,n

Since

and

have mean

and

respectively,

U/m

and

V/n

are approxi-

mately 1. So F is often approximately 1.

It should be very clear from definition that

Proposition. If X ∼ F

m,n

, then 1/X ∼ F

n,m

We write

m,n

(

) be the upper 100

% point for the

m,n

-distribution so

that if X ∼ F

m,n

, then P(X > F

m,n

(α)) = α.

Suppose that we have the upper 5% point for all

n,m

. Using these in-

formation, it is easy to find the lower 5% point for

m,n

since we know that

(

m,n

) =

(

n,m

> x

), which is where the above proposition comes

useful.

Note that it is immediate from definitions of

and

1,n

that if

Y ∼ t

, then

∼ F

1,n

, i.e. it is a ratio of independent χ

and χ

variables.

3.5 Inference for β

We know that

β ∼ N

(β, σ

−1

). So

∼ N(β

, σ

−1

The standard error of

is defined to be

SE(

) =

˜σ

−1

where

˜σ

RSS/

(

n − p

). Unlike the actual variance

(

)

−1

, the standard

error is calculable from our data.

Then

− β

SE(

)

− β

˜σ

−1

(

− β

−1

RSS/((n − p)σ

)

By writing it in this somewhat weird form, we now recognize both the numer-

ator and denominator. The numerator is a standard normal

1), and the

denominator is an independent

n−p

/(n − p)

, as we have previously shown.

But a standard normal divided by χ

is, by definition, the t distribution. So

− β

SE(

)

∼ t

n−p

So a 100(1

− α

)% confidence interval for

has end points

± SE

(

)

n−p

(

In particular, if we want to test H

: β

= 0, we use the fact that under H

SE(

)

∼ t

n−p

3.6 Simple linear regression

We can apply our results to the case of simple linear regression. We have

= a

′

+ b(x

− ¯x) + ε

where ¯x =

/n and ε

are iid N (0, σ

) for i = 1, ··· , n.

Then we have

ˆa

′

Y ∼ N



′



b =

∼ N





= ˆa

′

b(x

− ¯x)

RSS =

−

)

∼ σ

n−2

and (ˆa

′

b) and ˆσ

= RSS/n are independent, as we have previously shown.

Note that

ˆσ

is obtained by dividing RSS by

, and is the maximum likelihood

estimator. On the other hand,

˜σ

is obtained by dividing RSS by

n − p

, and is

an unbiased estimator.

Example. Using the oxygen/time example, we have seen that

˜σ

RSS

n − p

67968

24 − 2

= 3089 = 55.6

So the standard error of

β is

SE(

b) =

˜σ

−1

3089

55.6

28.0

= 1.99.

So a 95% interval for b has end points

b ± SE(

b) × t

n−p

(0.025) = 12.9 ± 1.99 ∗ t

(0.025) = (−17.0, −8.8),

using the fact that t

(0.025) = 2.07.

Note that this interval does not contain 0. So if we want to carry out a size

05 test of

= 0 (they are uncorrelated) vs

b 

= 0 (they are correlated),

the test statistic would be

SE(

−12.9

1.99

−

48. Then we reject

because

this is less than −t

(0.025) = −2.07.

3.7 Expected response at x

∗

After performing the linear regression, we can now make predictions from it.

Suppose that x

∗

is a new vector of values for the explanatory variables.

The expected response at x

∗

] = x

∗T

. We estimate this by x

∗T

Then we have

∗T

(

β − β) ∼ N(0, x

∗T

cov(

β)x

∗

) = N(0, σ

∗T

−1

∗

Let τ

= x

∗T

−1

∗

. Then

∗T

(

β − β)

˜στ

∼ t

n−p

Then a confidence interval for the expected response x

∗T

β has end points

∗T

β ± ˜στ t

n−p





Example. Previous example continued:

Suppose we wish to estimate the time to run 2 miles for a man with an

oxygen take-up measurement of 50. Here x

∗T

= (1, 50 − ¯x), where ¯x = 48.6.

The estimated expected response at x

∗T

β = ˆa

′

+ (50 −48.5) ×

b = 826.5 − 1.4 × 12.9 = 808.5,

which is obtained by plugging x

∗T

into our fitted line.

We find

= x

∗T

−1

∗

∗2

1.4

783.5

= 0.044 = 0.21

So a 95% confidence interval for E[Y | x

∗

= 50 − ¯x] is

∗T

β ± ˜στ t

n−p





= 808.5 ± 55.6 × 0.21 ×2.07 = (783.6, 832.2).

Note that this is the confidence interval for the predicted expected value,

NOT the confidence interval for the actual obtained value.

The predicted response at x

∗

= x

∗

, where

∗

∼ N

, σ

), and

∗

is independent of

, ··· , Y

. Here we have more uncertainties in our prediction:

β and ε

∗

A 100(1

− α

)% prediction interval for

∗

is an interval

(Y) such that

(

∗

∈ I

(Y)) = 1

− α

, where the probability is over the joint distribution of

∗

, Y

, ··· , Y

. So

is a random function of the past data Y that outputs an

interval.

First of all, as above, the predicted expected response is

∗

= x

∗T

. This is

an unbiased estimator since

∗

− Y

∗

= x

∗T

(

β − β) −ε

∗

, and hence

∗

− Y

∗

] = x

∗T

(β − β) = 0,

To find the variance, we use that fact that x

∗T

(

β − β

) and

∗

are independent,

and the variance of the sum of independent variables is the sum of the variances.

var(

∗

− Y

∗

) = var(x

∗T

(

β)) + var(ε

∗

)

= σ

∗T

−1

∗

+ σ

= σ

(τ

+ 1).

We can see this as the uncertainty in the regression line

, plus the wobble

about the regression line σ

. So

∗

− Y

∗

∼ N(0, σ

(τ

+ 1)).

We therefore find that

∗

− Y

∗

˜σ

√

+ 1

∼ t

n−p

So the interval with endpoints

∗T

β ± ˜σ

+ 1t

n−p





is a 95% prediction interval for

∗

. We don’t call this a confidence interval —

confidence intervals are about finding parameters of the distribution, while the

prediction interval is about our predictions.

Example. A 95% prediction interval for Y

∗

at x

∗T

= (1, (50 − ¯x)) is

∗T

± ˜σ

+ 1t

n−p





= 808.5 ± 55.6 × 1.02 ×2.07 = (691.1, 925.8).

Note that this is much wider than our our expected response! This is since there

are three sources of uncertainty: we don’t know what

is, what

is, and the

random ε fluctuation!

Example. Wafer example continued: Suppose we wish to estimate the expected

resistivity of a new wafer in the first instrument. Here x

∗T

= (1

, ··· ,

0) (recall

that x is an indicator vector to indicate which instrument is used).

The estimated response at x

∗T

µ = ˆµ

= ¯y

= 124.3

We find

= x

∗T

−1

∗

So a 95% confidence interval for E[Y

∗

] is

∗T

µ ± ˜στt

n−p





= 124.3 ±

10.4

√

× 2.09 = (114.6, 134.0).

Note that we are using an estimate of

obtained from all five instruments. If

we had only used the data from the first instrument, σ would be estimated as

˜σ

j=1

1,j

− ¯y

5 − 1

= 8.74.

The observed 95% confidence interval for µ

would have been

¯y

˜σ

√





= 124.3 ± 3.91 × 2.78 = (113.5, 135.1),

which is slightly wider. Usually it is much wider, but in this special case, we

only get little difference since the data from the first instrument is relatively

tighter than the others.

A 95% prediction interval for Y

∗

at x

∗T

= (1, 0, ··· , 0) is

∗T

µ ± ˜σ

+ 1t

n−p





= 124.3 ± 10.42 × 1.1 ×2.07 = (100.5, 148.1).

3.8 Hypothesis testing

3.8.1 Hypothesis testing

In hypothesis testing, we want to know whether certain variables influence the

result. If, say, the variable

does not influence

, then we must have

= 0.

So the goal is to test the hypothesis

= 0 versus



= 0. We will

tackle a more general case, where

can be split into two vectors

and

and we test if β

is zero.

We start with an obscure lemma, which might seem pointless at first, but

will prove itself useful very soon.

Lemma. Suppose Z

∼ N

, σ

), and

and

are symmetric, idempotent

n × n

matrices with

= 0 (i.e. they are orthogonal). Then Z

Z and

Z are independent.

This is geometrically intuitive, because

and

being orthogonal means

they are concerned about different parts of the vector Z.

Proof. Let X

= A

Z, i = 1, 2 and

W =









Then

W ∼ N





, σ



0 A



since the off diagonal matrices are σ

= A

= 0.

So W

and W

are independent, which implies

= Z

Z = Z

and

= Z

Z = Z

are independent

Now we go to hypothesis testing in general linear models:

Suppose

n×p

n×(p−p

)

and B =





, where

rank

(

) =

p, rank(X

) = p

We want to test

= 0 against



= 0. Under

vanishes

and

Y = X

+ ε.

Under H

, the mle of β

and σ

are

= (X

)

−1

ˆσ

RSS

(Y − X

)

(Y − X

)

and we have previously shown these are independent.

Note that our poor estimators wear two hats instead of one. We adopt the

convention that the estimators of the null hypothesis have two hats, while those

of the alternative hypothesis have one.

So the fitted values under H

are

Y = X

)

−1

Y = P

where P

= X

)

−1

The generalized likelihood ratio test of H

against H

, H

) =



√

2πˆσ



exp



−

2ˆσ

(Y − X

β)

(Y − X

β)





√

2π

ˆσ



exp



−

ˆσ

(Y − X

)

(Y − X

)



ˆσ

n/2



RSS



n/2



1 +

RSS

− RSS

RSS



n/2

We reject H

when 2 log Λ is large, equivalently when

RSS

−RSS

RSS

is large.

Using the results in Lecture 8, under H

, we have

2 log Λ = n log



1 +

RSS

− RSS

RSS



which is approximately a χ

−p

random variable.

This is a good approximation. But we can get an exact null distribution, and

get an exact test.

We have previously shown that RSS = Y

− P )Y, and so

RSS

− RSS = Y

− P

)Y − Y

− P )Y = Y

(P − P

)Y.

Now both

− P

and

P − P

are symmetric and idempotent, and therefore

rank(I

− P ) = n −p and

rank(P − P

) = tr(P − P

) = tr(P ) − tr(P

) = rank(P ) − rank(P

) = p − p

Also,

− P )(P − P

) = (I

− P )P − (I

− P )P

= (P − P

) − (P

− P P

) = 0.

(we have

by idempotence, and

P P

since after projecting with

we are already in the space of P , and applying P has no effect)

Finally,

− P )Y = (Y − X

)

− P )(Y − X

)

(P − P

)Y = (Y −X

)

(P − P

)(Y − X

)

since (I

− P )X

= (P − P

= 0.

If we let Z = Y

−X

−P

P −P

, and apply our previous

lemma, and the fact that Z

Z ∼ σ

, then

RSS = Y

− P )Y ∼ χ

n−p

RSS

− RSS = Y

(P − P

)Y ∼ χ

p−p

and these random variables are independent.

So under H

F =

(P − P

)Y/(p − p

)

− P )Y/(n − p)

(RSS

− RSS)/(p −p

)

RSS/(n − p)

∼ F

p−p

,n−p

Hence we reject H

if F > F

p−p

,n−p

(α).

RSS

− RSS

is the reduction in the sum of squares due to fitting

addition to β

Source of var. d.f. sum of squares mean squares F statistic

Fitted model p − p

RSS

− RSS

RSS

−RSS

p−p

(RSS

−RSS)/(p−p

)

RSS/(n−p)

Residual n − p RSS

RSS

n−p

Total n −p

RSS

The ratio

RSS

−RSS

RSS

is sometimes known as the proportion of variance explained

by β

, and denoted R

3.8.2 Simple linear regression

We assume that

= a

′

+ b(x

− ¯x) + ε

where ¯x =

/n and ε

are N (0, σ

Suppose we want to test the hypothesis

= 0, i.e. no linear relationship.

We have previously seen how to construct a confidence interval, and so we could

simply see if it included 0.

Alternatively, under

, the model is

∼ N

(

′

, σ

), and so

ˆa

′

, and the

fitted values are

Y .

The observed RSS

is therefore

RSS

− ¯y)

= S

The fitted sum of squares is therefore

RSS

−RSS =



− ¯y)

−(y

− ¯y −

b(x

− ¯x))



− ¯x)

Source of var. d.f. sum of squares mean squares F statistic

Fitted model 1 RSS

− RSS =

F =

˜σ

Residual n −2 RSS =

− ˆy)

˜σ

Total n −1 RSS

− ¯y)

Note that the proposition of variance explained is

where r is the Pearson’s product-moment correlation coefficient

r =

We have previously seen that under

SE(

∼ t

n−2

, where

(

) =

˜σ/

√

So we let

t =

SE(

√

˜σ

Checking whether

|t| > t

n−2





is precisely the same as checking whether

= F > F

1,n−2

(α), since a F

1,n−2

variable is t

n−2

Hence the same conclusion is reached, regardless of whether we use the

t-distribution or the F statistic derived form an analysis of variance table.

3.8.3

One way analysis of variance with equal numb ers in each group

Recall that in our wafer example, we made measurements in groups, and want to

know if there is a difference between groups. In general, suppose

measurements

are taken in each of I groups, and that

= µ

+ ε

where

are independent

, σ

) random variables, and the

are unknown

constants.

Fitting this model gives

RSS =

i=1

j=1

− ˆµ

)

i=1

j=1

−

)

on n − I degrees of freedom.

Suppose we want to test the hypothesis

, i.e. no difference between

groups.

Under

, the model is

∼ N

(

µ, σ

), and so

ˆµ

, and the fitted values

are

Y .

The observed RSS

is therefore

RSS

i,j

− ¯y

)

The fitted sum of squares is therefore

RSS

− RSS =



− ¯y

)

− (y

− ¯y

)



= J

(¯y

− ¯y

)

Source of var. d.f. sum of squares mean squares F statistic

Fitted model I − 1 J

(¯y

− ¯y

)

(¯y

−¯y

)

I−1

(¯y

−¯y

)

(I−1)˜σ

Residual n −I

− ¯y

)

˜σ

Total n − 1

− ¯y

)