Part IB — Statistics
Based on lectures by D. Spiegelhalter
Notes taken by Dexter Chua
Lent 2015
These notes are not endorsed by the lecturers, and I have modified them (often
significantly) after lectures. They are nowhere near accurate representations of what
was actually lectured, and in particular, all errors are almost surely mine.
Estimation
Review of distribution and density functions, parametric families. Examples: bino
mial, Poisson, gamma. Sufficiency, minimal sufficiency, the RaoBlackwell theorem.
Maximum likelihood estimation. Confidence intervals. Use of prior distributions and
Bayesian inference. [5]
Hypothesis testing
Simple examples of hypothesis testing, null and alternative hypothesis, critical region,
size, power, type I and type II errors, NeymanPearson lemma. Significance level of
outcome. Uniformly most powerful tests. Likelihood ratio, and use of generalised likeli
ho od ratio to construct test statistics for composite hypotheses. Examples, including
t
tests and
F
tests. Relationship with confidence intervals. Goodnessoffit tests and
contingency tables. [4]
Linear models
Derivation and joint distribution of maximum likelihood estimators, least squares,
GaussMarkov theorem. Testing hypotheses, geometric interpretation. Examples,
including simple linear regression and oneway analysis of variance. Use of software. [7]
Contents
0 Introduction
1 Estimation
1.1 Estimators
1.2 Mean squared error
1.3 Sufficiency
1.4 Likelihood
1.5 Confidence intervals
1.6 Bayesian estimation
2 Hypothesis testing
2.1 Simple hypotheses
2.2 Composite hypotheses
2.3 Tests of goodnessoffit and independence
2.3.1 Goodnessoffit of a fullyspecified null distribution
2.3.2 Pearson’s chisquared test
2.3.3 Testing independence in contingency tables
2.4 Tests of homogeneity, and connections to confidence intervals
2.4.1 Tests of homogeneity
2.4.2 Confidence intervals and hypothesis tests
2.5 Multivariate normal theory
2.5.1 Multivariate normal distribution
2.5.2 Normal random samples
2.6 Student’s tdistribution
3 Linear models
3.1 Linear models
3.2 Simple linear regression
3.3 Linear models with normal assumptions
3.4 The F distribution
3.5 Inference for β
3.6 Simple linear regression
3.7 Expected response at x
∗
3.8 Hypothesis testing
3.8.1 Hypothesis testing
3.8.2 Simple linear regression
3.8.3
One way analysis of variance with equal numbers in each
group
0 Introduction
Statistics is a set of principles and procedures for gaining and processing quan
titative evidence in order to help us make judgements and decisions. In this
course, we focus on formal statistical inference. In the process, we assume that
we have some data generated from some unknown probability model, and we aim
to use the data to learn about certain properties of the underlying probability
model.
In particular, we perform parametric inference. We assume that we have
a random variable
X
that follows a particular known family of distribution
(e.g. Poisson distribution). However, we do not know the parameters of the
distribution. We then attempt to estimate the parameter from the data given.
For example, we might know that
X ∼ Poisson
(
µ
) for some
µ
, and we want
to figure out what µ is.
Usually we repeat the experiment (or observation) many times. Hence we
will have
X
1
, X
2
, ··· , X
n
being iid with the same distribution as
X
. We call the
set X = (X
1
, X
2
, ··· , X
n
) a simple random sample. This is the data we have.
We will use the observed X = x to make inferences about the parameter
θ
,
such as
– giving an estimate
ˆ
θ(x) of the true value of θ.
– Giving an interval estimate (
ˆ
θ
1
(x),
ˆ
θ
2
(x)) for θ
– testing a hypothesis about θ, e.g. whether θ = 0.
1 Estimation
1.1 Estimators
The goal of estimation is as follows: we are given iid
X
1
, ··· , X
n
, and we know
that their probability density/mass function is
f
X
(
x
;
θ
) for some unknown
θ
.
We know
f
X
but not
θ
. For example, we might know that they follow a Poisson
distribution, but we do not know what the mean is. The objective is to estimate
the value of θ.
Definition (Statistic). A statistic is an estimate of
θ
. It is a function
T
of the
data. If we write the data as x = (
x
1
, ··· , x
n
), then our estimate is written as
ˆ
θ = T (x). T (X) is an estimator of θ.
The distribution of T = T (X) is the sampling distribution of the statistic.
Note that we adopt the convention where capital X denotes a random variable
and x is an observed value. So
T
(X) is a random variable and
T
(x) is a particular
value we obtain after experiments.
Example. Let X
1
, ··· , X
n
be iid N(µ, 1). A possible estimator for µ is
T (X) =
1
n
X
X
i
.
Then for any particular observed sample x, our estimate is
T (x) =
1
n
X
x
i
.
What is the sampling distribution of
T
? Recall from IA Probability that in
general, if
X
i
∼ N
(
µ
i
, σ
2
i
), then
P
X
i
∼ N
(
P
µ
i
,
P
σ
2
i
), which is something we
can prove by considering momentgenerating functions.
So we have
T
(X)
∼ N
(
µ,
1
/n
). Note that by the Central Limit Theorem,
even if
X
i
were not normal, we still have approximately
T
(X)
∼ N
(
µ,
1
/n
) for
large values of
n
, but here we get exactly the normal distribution even for small
values of n.
The estimator
1
n
P
X
i
we had above is a rather sensible estimator. Of course,
we can also have silly estimators such as
T
(X) =
X
1
, or even
T
(X) = 0
.
32
always.
One way to decide if an estimator is silly is to look at its bias.
Definition (Bias). Let
ˆ
θ
=
T
(X) be an estimator of
θ
. The bias of
ˆ
θ
is the
difference between its expected value and true value.
bias(
ˆ
θ) = E
θ
(
ˆ
θ) − θ.
Note that the subscript
θ
does not represent the random variable, but the thing
we want to estimate. This is inconsistent with the use for, say, the probability
mass function.
An estimator is unbiased if it has no bias, i.e. E
θ
(
ˆ
θ) = θ.
To find out
E
θ
(
T
), we can either find the distribution of
T
and find its
expected value, or evaluate
T
as a function of X directly, and find its expected
value.
Example. In the above example, E
µ
(T ) = µ. So T is unbiased for µ.
1.2 Mean squared error
Given an estimator, we want to know how good the estimator is. We have just
come up with the concept of the bias above. However, this is generally not a
good measure of how good the estimator is.
For example, if we do 1000 random trials
X
1
, ··· , X
1000
, we can pick our
estimator as
T
(X) =
X
1
. This is an unbiased estimator, but is really bad because
we have just wasted the data from the other 999 trials. On the other hand,
T
′
(X) = 0
.
01 +
1
1000
P
X
i
is biased (with a bias of 0
.
01), but is in general much
more trustworthy than
T
. In fact, at the end of the section, we will construct
cases where the only possible unbiased estimator is a completely silly estimator
to use.
Instead, a commonly used measure is the mean squared error.
Definition (Mean squared error). The mean squared error of an estimator
ˆ
θ
is
E
θ
[(
ˆ
θ − θ)
2
].
Sometimes, we use the root mean squared error, that is the square root of
the above.
We can express the mean squared error in terms of the variance and bias:
E
θ
[(
ˆ
θ − θ)
2
] = E
θ
[(
ˆ
θ − E
θ
(
ˆ
θ) + E
θ
(
ˆ
θ) − θ)
2
]
= E
θ
[(
ˆ
θ − E
θ
(
ˆ
θ))
2
] + [E
θ
(
ˆ
θ) − θ]
2
+ 2E
θ
[E
θ
(
ˆ
θ) − θ] E
θ
[
ˆ
θ − E
θ
(
ˆ
θ)]
 {z }
0
= var(
ˆ
θ) + bias
2
(
ˆ
θ).
If we are aiming for a low mean squared error, sometimes it could be preferable to
have a biased estimator with a lower variance. This is known as the “biasvariance
tradeoff”.
For example, suppose
X ∼ binomial
(
n, θ
), where
n
is given and
θ
is to be
determined. The standard estimator is
T
U
=
X/n
, which is unbiased.
T
U
has
variance
var
θ
(T
U
) =
var
θ
(X)
n
2
=
θ(1 − θ)
n
.
Hence the mean squared error of the usual estimator is given by
mse(T
U
) = var
θ
(T
U
) + bias
2
(T
U
) = θ(1 − θ)/n.
Consider an alternative estimator
T
B
=
X + 1
n + 2
= w
X
n
+ (1 −w)
1
2
,
where
w
=
n/
(
n
+ 2). This can be interpreted to be a weighted average (by the
sample size) of the sample mean and 1/2. We have
E
θ
(T
B
) − θ =
nθ + 1
n + 2
− θ = (1 −w)
1
2
− θ
,
and is biased. The variance is given by
var
θ
(T
B
) =
var
θ
(X)
(n + 2)
2
= w
2
θ(1 − θ)
n
.
Hence the mean squared error is
mse(T
B
) = var
θ
(T
B
) + bias
2
(T
B
) = w
2
θ(1 − θ)
n
+ (1 −w)
2
1
2
− θ
2
.
We can plot the mean squared error of each estimator for possible values of
θ
.
Here we plot the case where n = 10.
unbiased estimator
biased estimator
θ
mse
0 0.2 0.4 0.6 0.8 1.0
0
0.01
0.02
0.03
This biased estimator has smaller MSE unless θ has extreme values.
We see that sometimes biased estimators could give better mean squared
errors. In some cases, not only could unbiased estimators be worse — they could
be completely nonsense.
Suppose
X ∼ Poisson
(
λ
), and for whatever reason, we want to estimate
θ
= [
P
(
X
= 0)]
2
=
e
−2λ
. Then any unbiased estimator
T
(
X
) must satisfy
E
θ
(T (X)) = θ, or equivalently,
E
λ
(T (X)) = e
−λ
∞
X
x=0
T (x)
λ
x
x!
= e
−2λ
.
The only function T that can satisfy this equation is T (X) = (−1)
X
.
Thus the unbiased estimator would estimate e
−2λ
to be 1 if X is even, 1 if
X is odd. This is clearly nonsense.
1.3 Sufficiency
Often, we do experiments just to find out the value of
θ
. For example, we might
want to estimate what proportion of the population supports some political
candidate. We are seldom interested in the data points themselves, and just
want to learn about the big picture. This leads us to the concept of a sufficient
statistic. This is a statistic
T
(X) that contains all information we have about
θ
in the sample.
Example. Let
X
1
, ···X
n
be iid
Bernoulli
(
θ
), so that
P
(
X
i
= 1) = 1
− P
(
X
i
=
0) = θ for some 0 < θ < 1. So
f
X
(x  θ) =
n
Y
i=1
θ
x
i
(1 − θ)
1−x
i
= θ
P
x
i
(1 − θ)
n−
P
x
i
.
This depends on the data only through
T
(X) =
P
x
i
, the total number of ones.
Suppose we are now given that
T
(X) =
t
. Then what is the distribution of
X? We have
f
XT =t
(x) =
P
θ
(X = x, T = t)
P
θ
(T = t)
=
P
θ
(X = x)
P
θ
(T = t)
.
Where the last equality comes because since if X = x, then
T
must be equal to
t. This is equal to
θ
P
x
i
(1 − θ)
n−
P
x
i
n
t
θ
t
(1 − θ)
n−t
=
n
t
−1
.
So the conditional distribution of X given
T
=
t
does not depend on
θ
. So if we
know
T
, then additional knowledge of x does not give more information about
θ
.
Definition (Sufficient statistic). A statistic
T
is sufficient for
θ
if the conditional
distribution of X given T does not depend on θ.
There is a convenient theorem that allows us to find sufficient statistics.
Theorem (The factorization criterion). T is sufficient for θ if and only if
f
X
(x  θ) = g(T (x), θ)h(x)
for some functions g and h.
Proof. We first prove the discrete case.
Suppose f
X
(x  θ) = g(T (x), θ)h(x). If T (x) = t, then
f
XT =t
(x) =
P
θ
(X = x, T (X) = t)
P
θ
(T = t)
=
g(T (x), θ)h(x)
P
{y:T (y)=t}
g(T (y), θ)h(y)
=
g(t, θ)h(x)
g(t, θ)
P
h(y)
=
h(x)
P
h(y)
which doesn’t depend on θ. So T is sufficient.
The continuous case is similar. If
f
X
(x
 θ
) =
g
(
T
(x)
, θ
)
h
(x), and
T
(x) =
t
,
then
f
XT =t
(x) =
g(T (x), θ)h(x)
R
y:T (y)=t
g(T (y), θ)h(y) dy
=
g(t, θ)h(x)
g(t, θ)
R
h(y) dy
=
h(x)
R
h(y) dy
,
which does not depend on θ.
Now suppose
T
is sufficient so that the conditional distribution of X
 T
=
t
does not depend on θ. Then
P
θ
(X = x) = P
θ
(X = x, T = T (x)) = P
θ
(X = x  T = T (x))P
θ
(T = T(x)).
The first factor does not depend on
θ
by assumption; call it
h
(x). Let the second
factor be g(t, θ), and so we have the required factorisation.
Example. Continuing the above example,
f
X
(x  θ) = θ
P
x
i
(1 − θ)
n−
P
x
i
.
Take
g
(
t, θ
) =
θ
t
(1
− θ
)
n−t
and
h
(x) = 1 to see that
T
(X) =
P
X
i
is sufficient
for θ.
Example. Let
X
1
, ··· , X
n
be iid
U
[0
, θ
]. Write 1
[A]
for the indicator function
of an arbitrary set A. We have
f
X
(x  θ) =
n
Y
i=1
1
θ
1
[0≤x
i
≤θ]
=
1
θ
n
1
[max
i
x
i
≤θ]
1
[min
i
x
i
≥0]
.
If we let T = max
i
x
i
, then we have
f
X
(x  θ) =
1
θ
n
1
[t≤θ]
 {z }
g(t,θ)
1
[min
i
x
i
≥0]
 {z }
h(x)
.
So T = max
i
x
i
is sufficient.
Note that sufficient statistics are not unique. If
T
is sufficient for
θ
, then
so is any 11 function of
T
. X is always sufficient for
θ
as well, but it is not of
much use. How can we decide if a sufficient statistic is “good”?
Given any statistic
T
, we can partition the sample space
X
n
into sets
{
x
∈ X
:
T
(x) =
t}
. Then after an experiment, instead of recording the actual
value of x, we can simply record the partition x falls into. If there are less
partitions than possible values of x, then effectively there is less information we
have to store.
If
T
is sufficient, then this data reduction does not lose any information
about
θ
. The “best” sufficient statistic would be one in which we achieve the
maximum possible reduction. This is known as the minimal sufficient statistic.
The formal definition we take is the following:
Definition (Minimal sufficiency). A sufficient statistic
T
(X) is minimal if it is
a function of every other sufficient statistic, i.e. if
T
′
(X) is also sufficient, then
T
′
(X) = T
′
(Y) ⇒ T (X) = T (Y).
Again, we have a handy theorem to find minimal statistics:
Theorem. Suppose T = T(X) is a statistic that satisfies
f
X
(x; θ)
f
X
(y; θ)
does not depend on θ if and only if T (x) = T (y).
Then T is minimal sufficient for θ.
Proof.
First we have to show sufficiency. We will use the factorization criterion
to do so.
Firstly, for each possible t, pick a favorite x
t
such that T (x
t
) = t.
Now let x
∈ X
N
and let
T
(x) =
t
. So
T
(x) =
T
(x
t
). By the hypothesis,
f
X
(x;θ)
f
X
(x
t
:θ)
does not depend on θ. Let this be h(x). Let g(t, θ) = f
X
(x
t
, θ). Then
f
X
(x; θ) = f
X
(x
t
; θ)
f
X
(x; θ)
f
X
(x
t
; θ)
= g(t, θ)h(x).
So T is sufficient for θ.
To show that this is minimal, suppose that
S
(X) is also sufficient. By the
factorization criterion, there exist functions g
S
and h
S
such that
f
X
(x; θ) = g
S
(S(x), θ)h
S
(x).
Now suppose that S(x) = S(y). Then
f
X
(x; θ)
f
X
(y; θ)
=
g
S
(S(x), θ)h
S
(x)
g
S
(S(y), θ)h
S
(y)
=
h
S
(x)
h
S
(y)
.
This means that the ratio
f
X
(x;θ)
f
X
(y;θ)
does not depend on
θ
. By the hypothesis, this
implies that
T
(x) =
T
(y). So we know that
S
(x) =
S
(y) implies
T
(x) =
T
(y).
So T is a function of S. So T is minimal sufficient.
Example. Suppose X
1
, ··· , X
n
are iid N (µ, σ
2
). Then
f
X
(x  µ, σ
2
)
f
X
(y  µ, σ
2
)
=
(2πσ
2
)
−n/2
exp
−
1
2σ
2
P
i
(x
i
− µ)
2
(2πσ
2
)
−n/2
exp
−
1
2σ
2
P
i
(y
i
− µ)
2
= exp
(
−
1
2σ
2
X
i
x
2
i
−
X
i
y
2
i
!
+
µ
σ
2
X
i
x
i
−
X
i
y
i
!)
.
This is a constant function of (
µ, σ
2
) iff
P
i
x
2
i
=
P
i
y
2
i
and
P
i
x
i
=
P
i
y
i
. So
T (X) = (
P
i
X
2
i
,
P
i
X
i
) is minimal sufficient for (µ, σ
2
).
As mentioned, sufficient statistics allow us to store the results of our exper
iments in the most efficient way. It turns out if we have a minimal sufficient
statistic, then we can use it to improve any estimator.
Theorem (RaoBlackwell Theorem). Let
T
be a sufficient statistic for
θ
and let
˜
θ
be an estimator for
θ
with
E
(
˜
θ
2
)
< ∞
for all
θ
. Let
ˆ
θ
(x) =
E
[
˜
θ
(X)
 T
(X) =
T (x)]. Then for all θ,
E[(
ˆ
θ − θ)
2
] ≤ E[(
˜
θ − θ)
2
].
The inequality is strict unless
˜
θ is a function of T .
Here we have to be careful with our definition of
ˆ
θ
. It is defined as the
expected value of
˜
θ
(X), and this could potentially depend on the actual value of
θ
. Fortunately, since
T
is sufficient for
θ
, the conditional distribution of X given
T
=
t
does not depend on
θ
. Hence
ˆ
θ
=
E
[
˜
θ
(X)
 T
] does not depend on
θ
, and
so is a genuine estimator. In fact, the proof does not use that
T
is sufficient; we
only require it to be sufficient so that we can compute
ˆ
θ.
Using this theorem, given any estimator, we can find one that is a function
of a sufficient statistic and is at least as good in terms of mean squared error of
estimation. Moreover, if the original estimator
˜
θ
is unbiased, so is the new
ˆ
θ
.
Also, if
˜
θ is already a function of T , then
ˆ
θ =
˜
θ.
Proof.
By the conditional expectation formula, we have
E
(
ˆ
θ
) =
E
[
E
(
˜
θ  T
)] =
E(
˜
θ). So they have the same bias.
By the conditional variance formula,
var(
˜
θ) = E[var(
˜
θ  T )] + var[E(
˜
θ  T )] = E[var(
˜
θ  T )] + var(
ˆ
θ).
Hence
var
(
˜
θ
)
≥ var
(
ˆ
θ
). So
mse
(
˜
θ
)
≥ mse
(
ˆ
θ
), with equality only if
var
(
˜
θ  T
) =
0.
Example. Suppose
X
1
, ··· , X
n
are iid
Poisson
(
λ
), and let
θ
=
e
−λ
, which is
the probability that X
1
= 0. Then
p
X
(x  λ) =
e
−nλ
λ
P
x
i
Q
x
i
!
.
So
p
X
(x  θ) =
θ
n
(−log θ)
P
x
i
Q
x
i
!
.
We see that T =
P
X
i
is sufficient for θ, and
P
X
i
∼ Poisson(nλ).
We start with an easy estimator
θ
is
˜
θ
= 1
X
1
=0
, which is unbiased (i.e.
if we observe nothing in the first observation period, we assume the event is
impossible). Then
E[
˜
θ  T = t] = P
X
1
= 0 
n
X
1
X
i
= t
!
=
P(X
1
= 0)P(
P
n
2
X
i
= t)
P(
P
n
1
X
i
= t)
=
n − 1
n
t
.
So
ˆ
θ
= (1
−
1
/n
)
P
x
i
. This approximately (1
−
1
/n
)
n
¯
X
≈ e
−
¯
X
=
e
−
ˆ
λ
, which
makes sense.
Example. Let
X
1
, ··· , X
n
be iid
U
[0
, θ
], and suppose that we want to estimate
θ
. We have shown above that
T
=
max X
i
is sufficient for
θ
. Let
ˆ
θ
= 2
X
1
, an
unbiased estimator. Then
E[
˜
θ  T = t] = 2E[X
1
 max X
i
= t]
= 2E[X
1
 max X
i
= t, X
1
= max X
i
]P(X
1
= max X
i
)
+ 2E[X
1
 max X
i
= t, X
1
= max X
i
]P(X
1
= max X
i
)
= 2
t ×
1
n
+
t
2
n − 1
n
=
n + 1
n
t.
So
ˆ
θ =
n+1
n
max X
i
is our new estimator.
1.4 Likelihood
There are many different estimators we can pick, and we have just come up with
some criteria to determine whether an estimator is “good”. However, these do
not give us a systematic way of coming up with an estimator to actually use. In
practice, we often use the maximum likelihood estimator.
Let
X
1
, ··· , X
n
be random variables with joint pdf/pmf
f
X
(x
 θ
). We
observe X = x.
Definition (Likelihood). For any given x, the likelihood of
θ
is
like
(
θ
) =
f
X
(x

θ
), regarded as a function of
θ
. The maximum likelihood estimator (mle) of
θ
is
an estimator that picks the value of θ that maximizes like(θ).
Often there is no closed form for the mle, and we have to find
ˆ
θ
numerically.
When we can find the mle explicitly, in practice, we often maximize the
loglikelihood instead of the likelihood. In particular, if
X
1
, ··· , X
n
are iid, each
with pdf/pmf f
X
(x  θ), then
like(θ) =
n
Y
i=1
f
X
(x
i
 θ),
log like(θ) =
n
X
i=1
log f
X
(x
i
 θ).
Example. Let X
1
, ··· , X
n
be iid Bernoulli(p). Then
l(p) = log like(p) =
X
x
i
log p +
n −
X
x
i
log(1 −p).
Thus
dl
dp
=
P
x
i
p
−
n −
P
x
i
1 − p
.
This is zero when
p
=
P
x
i
/n
. So this is the maximum likelihood estimator
(and is unbiased).
Example. Let
X
1
, ··· , X
n
be iid
N
(
µ, σ
2
), and we want to estimate
θ
= (
µ, σ
2
).
Then
l(µ, σ
2
) = log like(µ, σ
2
) = −
n
2
log(2π) −
n
2
log(σ
2
) −
1
2σ
2
X
(x
i
− µ)
2
.
This is maximized when
∂l
∂µ
=
∂l
∂σ
2
= 0.
We have
∂l
∂µ
= −
1
σ
2
X
(x
i
− µ),
∂l
∂σ
2
= −
n
2σ
2
+
1
2σ
4
X
(x
i
− µ)
2
.
So the solution, hence maximum likelihood estimator is (
ˆµ, ˆσ
2
) = (
¯x, S
xx
/n
),
where ¯x =
1
n
P
x
i
and S
xx
=
P
(x
i
− ¯x)
2
.
We shall see later that
S
XX
/σ
2
=
nˆσ
2
σ
2
∼ χ
2
n−1
, and so
E
(
ˆσ
2
) =
(n−1)σ
2
n
, i.e.
ˆσ
2
is biased.
Example (German tank problem). Suppose the American army discovers some
German tanks that are sequentially numbered, i.e. the first tank is numbered 1,
the second is numbered 2, etc. Then if
θ
tanks are produced, then the probability
distribution of the tank number is
U
(0
, θ
). Suppose we have discovered
n
tanks
whose numbers are
x
1
, x
2
, ··· , x
n
, and we want to estimate
θ
, the total number
of tanks produced. We want to find the maximum likelihood estimator.
Then
like(θ) =
1
θ
n
1
[max x
i
≤θ]
1
[min x
i
≥0]
.
So for
θ ≥ max x
i
,
like
(
θ
) = 1
/θ
n
and is decreasing as
θ
increases, while for
θ < max x
i
, like(θ) = 0. Hence the value
ˆ
θ = max x
i
maximizes the likelihood.
Is
ˆ
θ
unbiased? First we need to find the distribution of
ˆ
θ
. For 0
≤ t ≤ θ
, the
cumulative distribution function of
ˆ
θ is
F
ˆ
θ
(t) = P (
ˆ
θ ≤ t) = P(X
i
≤ t for all i) = (P(X
i
≤ t))
n
=
t
θ
n
,
Differentiating with respect to T , we find the pdf f
ˆ
θ
=
nt
n−1
θ
n
. Hence
E(
ˆ
θ) =
Z
θ
0
t
nt
n−1
θ
n
dt =
nθ
n + 1
.
So
ˆ
θ is biased, but asymptotically unbiased.
Example. Smarties come in
k
equally frequent colours, but suppose we do not
know
k
. Assume that we sample with replacement (although this is unhygienic).
Our first Smarties are Red, Purple, Red, Yellow. Then
like(k) = P
k
(1st is a new colour)P
k
(2nd is a new colour)
P
k
(3rd matches 1st)P
k
(4th is a new colour)
= 1 ×
k − 1
k
×
1
k
×
k − 2
k
=
(k − 1)(k −2)
k
3
.
The maximum likelihood is 5 (by trial and error), even though it is not much
likelier than the others.
How does the mle relate to sufficient statistics? Suppose that
T
is sufficient
for
θ
. Then the likelihood is
g
(
T
(x)
, θ
)
h
(x), which depends on
θ
through
T
(x).
To maximise this as a function of
θ
, we only need to maximize
g
. So the mle
ˆ
θ
is a function of the sufficient statistic.
Note that if
ϕ
=
h
(
θ
) with
h
injective, then the mle of
ϕ
is given by
h
(
ˆ
θ
). For
example, if the mle of the standard deviation
σ
is
ˆσ
, then the mle of the variance
σ
2
is
ˆσ
2
. This is rather useful in practice, since we can use this to simplify a lot
of computations.
1.5 Confidence intervals
Definition. A 100
γ
% (0
< γ <
1) confidence interval for
θ
is a random interval
(
A
(X)
, B
(X)) such that
P
(
A
(X)
< θ < B
(X)) =
γ
, no matter what the true
value of θ may be.
It is also possible to have confidence intervals for vector parameters.
Notice that it is the endpoints of the interval that are random quantities,
while θ is a fixed constant we want to find out.
We can interpret this in terms of repeat sampling. If we calculate (
A
(x)
, B
(x))
for a large number of samples x, then approximately 100
γ
% of them will cover
the true value of θ.
It is important to know that having observed some data x and calculated
95% confidence interval, we cannot say that
θ
has 95% chance of being within
the interval. Apart from the standard objection that
θ
is a fixed value and
either is or is not in the interval, and hence we cannot assign probabilities to
this event, we will later construct an example where even though we have got a
50% confidence interval, we are 100% sure that θ lies in that interval.
Example. Suppose
X
1
, ··· , X
n
are iid
N
(
θ,
1). Find a 95% confidence interval
for θ.
We know
¯
X ∼ N(θ,
1
n
), so that
√
n(
¯
X − θ) ∼ N(0, 1).
Let
z
1
, z
2
be such that Φ(
z
2
)
−
Φ(
z
1
) = 0
.
95, where Φ is the standard normal
distribution function.
We have P[z
1
<
√
n(
¯
X − θ) < z
2
] = 0.95, which can be rearranged to give
P
¯
X −
z
2
√
n
< θ <
¯
X −
z
1
√
n
= 0.95.
so we obtain the following 95% confidence interval:
¯
X −
z
2
√
n
,
¯
X −
z
1
√
n
.
There are many possible choices for
z
1
and
z
2
. Since
N
(0
,
1) density is symmetric,
the shortest such interval is obtained by
z
2
=
z
0.025
=
−z
1
. We can also choose
other values such as
z
1
=
−∞
,
z
2
= 1
.
64, but we usually choose symmetric end
points.
The above example illustrates a common procedure for finding confidence
intervals:
–
Find a quantity
R
(X
, θ
) such that the
P
θ
distribution of
R
(X
, θ
) does not
depend on
θ
. This is called a pivot. In our example,
R
(X
, θ
) =
√
n
(
¯
X −θ
).
–
Write down a probability statement of the form
P
θ
(
c
1
< R
(X
, θ
)
< c
2
) =
γ
.
– Rearrange the inequalities inside P(. . .) to find the interval.
Usually
c
1
, c
2
are percentage points from a known standardised distribution, often
equitailed. For example, we pick 2
.
5% and 97
.
5% points for a 95% confidence
interval. We could also use, say 0% and 95%, but this generally results in a
wider interval.
Note that if (
A
(X)
, B
(X)) is a 100
γ
% confidence interval for
θ
, and
T
(
θ
)
is a monotone increasing function of
θ
, then (
T
(
A
(X))
, T
(
B
(X))) is a 100
γ
%
confidence interval for T (θ).
Example. Suppose
X
1
, ··· , X
50
are iid
N
(0
, σ
2
). Find a 99% confidence interval
for σ
2
.
We know that X
i
/σ ∼ N(0, 1). So
1
σ
2
50
X
i=1
X
2
i
∼ χ
2
50
.
So R(X, σ
2
) =
P
50
i=1
X
2
i
/σ
2
is a pivot.
Recall that χ
2
n
(α) is the upper 100α% point of χ
2
n
, i.e.
P(χ
2
n
≤ χ
2
n
(α)) = 1 − α.
So we have c
1
= χ
2
50
(0.995) = 27.99 and c
2
= χ
2
50
(0.005) = 79.49.
So
P
c
1
<
P
X
2
i
σ
2
< c
2
= 0.99,
and hence
P
P
X
2
i
c
2
< σ
2
<
P
X
2
i
c
1
= 0.99.
Using the remark above, we know that a 99% confidence interval for
σ
is
q
P
X
2
i
c
2
,
q
P
X
2
i
c
1
.
Example. Suppose
X
1
, ··· , X
n
are iid
Bernoulli
(
p
). Find an approximate
confidence interval for p.
The mle of p is ˆp =
P
X
i
/n.
By the Central Limit theorem,
ˆp
is approximately
N
(
p, p
(1
− p
)
/n
) for large
n.
So
√
n(ˆp − p)
p
p(1 − p)
is approximately
N
(0
,
1) for large
n
. So letting
z
(1−γ)/2
be
the solution to Φ(z
(1−γ)/2
) − Φ(−z
(1−γ)/2
) = 1 − γ, we have
P
ˆp − z
(1−γ)/2
r
p(1 − p)
n
< p < ˆp + z
(1−γ)/2
r
p(1 − p)
n
!
≈ γ.
But
p
is unknown! So we approximate it by
ˆp
to get a confidence interval for
p
when n is large:
P
ˆp − z
(1−γ)/2
r
ˆp(1 − ˆp)
n
< p < ˆp + z
(1−γ)/2
r
ˆp(1 − ˆp)
n
!
≈ γ.
Note that we have made a lot of approximations here, but it would be difficult
to do better than this.
Example. Suppose an opinion poll says 20% of the people are going to vote
UKIP, based on a random sample of 1
,
000 people. What might the true
proportion be?
We assume we have an observation of
x
= 200 from a
binomial
(
n, p
) distri
bution with
n
= 1
,
000. Then
ˆp
=
x/n
= 0
.
2 is an unbiased estimate, and also
the mle.
Now
var
(
X/n
) =
p(1−p)
n
≈
ˆp(1−ˆp)
n
= 0
.
00016. So a 95% confidence interval is
ˆp − 1.96
r
ˆp(1 − ˆp)
n
, ˆp + 1.96
r
ˆp(1 − ˆp)
n
!
= 0.20±1.96×0.013 = (0.175, 0.225),
If we don’t want to make that many approximations, we can note that
p
(1
−p
)
≤
1
/
4 for all 0
≤ p ≤
1. So a conservative 95% interval is
ˆp±
1
.
96
p
1/4n ≈ ˆp±
p
1/n
.
So whatever proportion is reported, it will be ’accurate’ to ±1/
√
n.
Example. Suppose
X
1
, X
2
are iid from
U
(
θ −
1
/
2
, θ
+ 1
/
2). What is a sensible
50% confidence interval for θ?
We know that each
X
i
is equally likely to be less than
θ
or greater than
θ
.
So there is 50% chance that we get one observation on each side, i.e.
P
θ
(min(X
1
, X
2
) ≤ θ ≤ max(X
1
, X
2
)) =
1
2
.
So (min(X
1
, X
2
), max(X
1
, X
2
)) is a 50% confidence interval for θ.
But suppose after the experiment, we obtain
x
1
−x
2
 ≥
1
2
. For example, we
might get
x
1
= 0
.
2
, x
2
= 0
.
9, then we know that, in this particular case,
θ
must
lie in (min(X
1
, X
2
), max(X
1
, X
2
)), and we don’t have just 50% “confidence”!
This is why after we calculate a confidence interval, we should not say “there
is 100(1
− α
)% chance that
θ
lies in here”. The confidence interval just says
that “if we keep making these intervals, 100(1
− α
)% of them will contain
θ
”.
But if have calculated a particular confidence interval, the probability that that
particular interval contains θ is not 100(1 −α)%.
1.6 Bayesian estimation
So far we have seen the frequentist approach to a statistical inference, i.e.
inferential statements about θ are interpreted in terms of repeat sampling. For
example, the percentage confidence in a confidence interval is the probability
that the interval will contain θ, not the probability that θ lies in the interval.
In contrast, the Bayesian approach treats
θ
as a random variable taking
values in Θ. The investigator’s information and beliefs about the possible values
of
θ
before any observation of data are summarised by a prior distribution
π
(
θ
).
When X = x are observed, the extra information about
θ
is combined with the
prior to obtain the posterior distribution π(θ  x) for θ given X = x.
There has been a longrunning argument between the two approaches. Re
cently, things have settled down, and Bayesian methods are seen to be appropriate
in huge numbers of application where one seeks to assess a probability about a
“state of the world”. For example, spam filters will assess the probability that a
specific email is a spam, even though from a frequentist’s point of view, this is
nonsense, because the email either is or is not a spam, and it makes no sense to
assign a probability to the email’s being a spam.
In Bayesian inference, we usually have some prior knowledge about the
distribution of
θ
(e.g. between 0 and 1). After collecting some data, we will find
a posterior distribution of θ given X = x.
Definition (Prior and posterior distribution). The prior distribution of
θ
is the
probability distribution of the value of
θ
before conducting the experiment. We
usually write as π(θ).
The posterior distribution of
θ
is the probability distribution of the value of
θ given an outcome of the experiment x. We write as π(θ  x).
By Bayes’ theorem, the distributions are related by
π(θ  x) =
f
X
(x  θ)π(θ)
f
X
(x)
.
Thus
π(θ  x) ∝ f
X
(x  θ)π(θ).
posterior ∝ likelihood × prior.
where the constant of proportionality is chosen to make the total mass of
the posterior distribution equal to one. Usually, we use this form, instead of
attempting to calculate f
X
(x).
It should be clear that the data enters through the likelihood, so the inference
is automatically based on any sufficient statistic.
Example. Suppose I have 3 coins in my pocket. One is 3 : 1 in favour of tails,
one is a fair coin, and one is 3 : 1 in favour of heads.
I randomly select one coin and flip it once, observing a head. What is the
probability that I have chosen coin 3?
Let
X
= 1 denote the event that I observe a head,
X
= 0 if a tail. Let
θ
denote the probability of a head. So θ is either 0.25, 0.5 or 0.75.
Our prior distribution is π(θ = 0.25) = π(θ = 0.5) = π(θ = 0.75) = 1/3.
The probability mass function
f
X
(
x  θ
) =
θ
x
(1
− θ
)
1−x
. So we have the
following results:
θ π(θ) f
X
(x = 1  θ) f
X
(x = 1  θ)π(θ) π(θ  x)
0.25 0.33 0.25 0.0825 0.167
0.50 0.33 0.50 0.1650 0.333
0.75 0.33 0.75 0.2475 0.500
Sum 1.00 1.50 0.4950 1.000
So if we observe a head, then there is now a 50% chance that we have picked
the third coin.
Example. Suppose we are interested in the true mortality risk
θ
in a hospital
H
which is about to try a new operation. On average in the country, around
10% of the people die, but mortality rates in different hospitals vary from around
3% to around 20%. Hospital
H
has no deaths in their first 10 operations. What
should we believe about θ?
Let X
i
= 1 if the ith patient in H dies. The
f
x
(x  θ) = θ
P
x
i
(1 − θ)
n−
P
x
i
.
Suppose a priori that θ ∼ beta(a, b) for some unknown a > 0, b > 0 so that
π(θ) ∝ θ
a−1
(1 − θ)
b−1
.
Then the posteriori is
π(θ  x) ∝ f
x
(x  θ)π(θ) ∝ θ
P
x
i
+a−1
(1 − θ)
n−
P
x
i
+b−1
.
We recognize this as beta(
P
x
i
+ a, n −
P
x
i
+ b). So
π(θ  x) =
θ
P
x
i
+a−1
(1 − θ)
n−
P
x
i
+b−1
B(
P
x
i
+ a, n −
P
x
i
+ b)
.
In practice, we need to find a Beta prior distribution that matches our information
from other hospitals. It turns out that
beta
(
a
= 3
, b
= 27) prior distribution has
mean 0.1 and P(0.03 < θ < .20) = 0.9.
Then we observe data
P
x
i
= 0,
n
= 0. So the posterior is
beta
(
P
x
i
+
a, n −
P
x
i
+ b) = beta(3, 37). This has a mean of 3/40 = 0.075.
This leads to a different conclusion than a frequentist analysis. Since nobody
has died so far, the mle is 0, which does not seem plausible. Using a Bayesian
approach, we have a higher mean than 0 because we take into account the data
from other hospitals.
For this problem, a beta prior leads to a beta posterior. We say that the
beta family is a conjugate family of prior distributions for Bernoulli samples.
Suppose that
a
=
b
= 1 so that
π
(
θ
) = 1 for 0
< θ <
1 — the uniform
distribution. Then the posterior is
beta
(
P
x
i
+ 1
, n −
P
x
i
+ 1), with properties
mean mode variance
prior 1/2 nonunique 1/12
posterior
P
x
i
+ 1
n + 2
P
x
i
n
(
P
x
i
+ 1)(n −
P
x
i
+ 1)
(n + 2)
2
(n + 3)
Note that the mode of the posterior is the mle.
The posterior mean estimator,
P
x
i
+1
n+2
is discussed in Lecture 2, where we
showed that this estimator had smaller mse than the mle for nonextreme value
of θ. This is known as the Laplace’s estimator.
The posterior variance is bounded above by 1
/
(4(
n
+ 3)), and this is smaller
than the prior variance, and is smaller for larger n.
Again, note that the posterior automatically depends on the data through
the sufficient statistic.
After we come up with the posterior distribution, we have to decide what
estimator to use. In the case above, we used the posterior mean, but this might
not be the best estimator.
To determine what is the “best” estimator, we first need a loss function. Let
L
(
θ, a
) be the loss incurred in estimating the value of a parameter to be
a
when
the true value is θ.
Common loss functions are quadratic loss
L
(
θ, a
) = (
θ − a
)
2
, absolute error
loss L(θ, a) = θ − a, but we can have others.
When our estimate is a, the expected posterior loss is
h(a) =
Z
L(θ, a)π(θ  x) dθ.
Definition (Bayes estimator). The Bayes estimator
ˆ
θ
is the estimator that
minimises the expected posterior loss.
For quadratic loss,
h(a) =
Z
(a − θ)
2
π(θ  x) dθ.
h
′
(a) = 0 if
Z
(a − θ)π(θ  x) dθ = 0,
or
a
Z
π(θ  x) dθ =
Z
θπ(θ  x) dθ,
Since
R
π
(
θ 
x) d
θ
= 1, the Bayes estimator is
ˆ
θ
=
R
θπ
(
θ 
x) d
θ
, the posterior
mean.
For absolute error loss,
h(a) =
Z
θ − aπ(θ  x) dθ
=
Z
a
−∞
(a − θ)π(θ  x) dθ +
Z
∞
a
(θ − a)π(θ  x) dθ
= a
Z
a
−∞
π(θ  x) dθ −
Z
a
−∞
θπ(θ  x) dθ
+
Z
∞
a
θπ(θ  x) dθ − a
Z
∞
a
π(θ  x) dθ.
Now h
′
(a) = 0 if
Z
a
−∞
π(θ  x) dθ =
Z
∞
a
π(θ  x) dθ.
This occurs when each side is 1/2. So
ˆ
θ is the posterior median.
Example. Suppose that
X
1
, ··· , X
n
are iid
N
(
µ,
1), and that a priori
µ ∼
N(0, τ
−2
) for some τ
−2
. So τ is the certainty of our prior knowledge.
The posterior is given by
π(µ  x) ∝ f
x
(x  µ)π(µ)
∝ exp
−
1
2
X
(x
i
− µ)
2
exp
−
µ
2
τ
2
2
∝ exp
"
−
1
2
(n + τ
2
)
µ −
P
x
i
n + τ
2
2
#
since we can regard
n
,
τ
and all the
x
i
as constants in the normalisation term,
and then complete the square with respect to
µ
. So the posterior distribution
of
µ
given x is a normal distribution with mean
P
x
i
/
(
n
+
τ
2
) and variance
1/(n + τ
2
).
The normal density is symmetric, and so the posterior mean and the posterior
median have the same value
P
x
i
/(n + τ
2
).
This is the optimal estimator for both quadratic and absolute loss.
Example. Suppose that
X
1
, ··· , X
n
are iid
Poisson
(
λ
) random variables, and
λ has an exponential distribution with mean 1. So π(λ) = e
−λ
.
The posterior distribution is given by
π(λ  x) ∝ e
nλ
λ
P
x
i
e
−λ
= λ
P
x
i
e
−(n+1)λ
, λ > 0,
which is
gamma (
P
x
i
+ 1, n + 1)
. Hence under quadratic loss, our estimator is
ˆ
λ =
P
x
i
+ 1
n + 1
,
the posterior mean.
Under absolute error loss,
ˆ
λ solves
Z
ˆ
λ
0
(n + 1)
P
x
i
+1
λ
P
x
i
e
−(n+1)λ
(
P
x
i
)!
dλ =
1
2
.
2 Hypothesis testing
Often in statistics, we have some hypothesis to test. For example, we want to
test whether a drug can lower the chance of a heart attack. Often, we will have
two hypotheses to compare: the null hypothesis states that the drug is useless,
while the alternative hypothesis states that the drug is useful. Quantitatively,
suppose that the chance of heart attack without the drug is
θ
0
and the chance
with the drug is
θ
. Then the null hypothesis is
H
0
:
θ
=
θ
0
, while the alternative
hypothesis is H
1
: θ = θ
0
.
It is important to note that the null hypothesis and alternative hypothesis
are not on equal footing. By default, we assume the null hypothesis is true.
For us to reject the null hypothesis, we need a lot of evidence to prove that.
This is since we consider incorrectly rejecting the null hypothesis to be a much
more serious problem than accepting it when we should not. For example, it is
relatively okay to reject a drug when it is actually useful, but it is terrible to
distribute drugs to patients when the drugs are actually useless. Alternatively,
it is more serious to deem an innocent person guilty than to say a guilty person
is innocent.
In general, let
X
1
, ··· , X
n
be iid, each taking values in
X
, each with unknown
pdf/pmf
f
. We have two hypotheses,
H
0
and
H
1
, about
f
. On the basis of data
X = x, we make a choice between the two hypotheses.
Example.
–
A coin has
P
(
Heads
) =
θ
, and is thrown independently
n
times. We could
have H
0
: θ =
1
2
versus H
1
: θ =
3
4
.
–
Suppose
X
1
, ··· , X
n
are iid discrete random variables. We could have
H
0
:
the distribution is Poisson with unknown mean, and
H
1
: the distribution
is not Poisson.
–
General parametric cases: Let
X
1
, ··· , X
n
be iid with density
f
(
x  θ
).
f
is known while
θ
is unknown. Then our hypotheses are
H
0
:
θ ∈
Θ
0
and
H
1
: θ ∈ Θ
1
, with Θ
0
∩ Θ
1
= ∅.
–
We could have
H
0
:
f
=
f
0
and
H
1
:
f
=
f
1
, where
f
0
and
f
1
are densities
that are completely specified but do not come form the same parametric
family.
Definition (Simple and composite hypotheses). A simple hypothesis
H
specifies
f completely (e.g. H
0
: θ =
1
2
). Otherwise, H is a composite hypothesis.
2.1 Simple hypotheses
Definition (Critical region). For testing
H
0
against an alternative hypothesis
H
1
, a test procedure has to partition
X
n
into two disjoint exhaustive regions
C
and
¯
C
, such that if x
∈ C
, then
H
0
is rejected, and if x
∈
¯
C
, then
H
0
is not
rejected. C is the critical region.
When performing a test, we may either arrive at a correct conclusion, or
make one of the two types of error:
Definition (Type I and II error).
(i) Type I error: reject H
0
when H
0
is true.
(ii) Type II error: not rejecting H
0
when H
0
is false.
Definition (Size and power). When H
0
and H
1
are both simple, let
α = P(Type I error) = P(X ∈ C  H
0
is true).
β = P(Type II error) = P(X ∈ C  H
1
is true).
The size of the test is α, and 1 − β is the power of the test to detect H
1
.
If we have two simple hypotheses, a relatively straightforward test is the
likelihood ratio test.
Definition (Likelihood). The likelihood of a simple hypothesis
H
:
θ
=
θ
∗
given
data x is
L
x
(H) = f
X
(x  θ = θ
∗
).
The likelihood ratio of two simple hypotheses H
0
, H
1
given data x is
Λ
x
(H
0
; H
1
) =
L
x
(H
1
)
L
x
(H
0
)
.
A likelihood ratio test (LR test) is one where the critical region
C
is of the form
C = {x : Λ
x
(H
0
; H
1
) > k}
for some k.
It turns out this rather simple test is “the best” in the following sense:
Lemma (NeymanPearson lemma). Suppose
H
0
:
f
=
f
0
,
H
1
:
f
=
f
1
, where
f
0
and
f
1
are continuous densities that are nonzero on the same regions. Then
among all tests of size less than or equal to
α
, the test with the largest power is
the likelihood ratio test of size α.
Proof. Under the likelihood ratio test, our critical region is
C =
x :
f
1
(x)
f
0
(x)
> k
,
where
k
is chosen such that
α
=
P
(
reject H
0
 H
0
) =
P
(X
∈ C  H
0
) =
R
C
f
0
(x) dx. The probability of Type II error is given