1Estimation
IB Statistics
1.3 Sufficiency
Often, we do experiments just to find out the value of
θ
. For example, we might
want to estimate what proportion of the population supports some political
candidate. We are seldom interested in the data points themselves, and just
want to learn about the big picture. This leads us to the concept of a sufficient
statistic. This is a statistic
T
(X) that contains all information we have about
θ
in the sample.
Example. Let
X
1
, ···X
n
be iid
Bernoulli
(
θ
), so that
P
(
X
i
= 1) = 1
− P
(
X
i
=
0) = θ for some 0 < θ < 1. So
f
X
(x | θ) =
n
Y
i=1
θ
x
i
(1 − θ)
1−x
i
= θ
P
x
i
(1 − θ)
n−
P
x
i
.
This depends on the data only through
T
(X) =
P
x
i
, the total number of ones.
Suppose we are now given that
T
(X) =
t
. Then what is the distribution of
X? We have
f
X|T =t
(x) =
P
θ
(X = x, T = t)
P
θ
(T = t)
=
P
θ
(X = x)
P
θ
(T = t)
.
Where the last equality comes because since if X = x, then
T
must be equal to
t. This is equal to
θ
P
x
i
(1 − θ)
n−
P
x
i
n
t
θ
t
(1 − θ)
n−t
=
n
t
−1
.
So the conditional distribution of X given
T
=
t
does not depend on
θ
. So if we
know
T
, then additional knowledge of x does not give more information about
θ
.
Definition (Sufficient statistic). A statistic
T
is sufficient for
θ
if the conditional
distribution of X given T does not depend on θ.
There is a convenient theorem that allows us to find sufficient statistics.
Theorem (The factorization criterion). T is sufficient for θ if and only if
f
X
(x | θ) = g(T (x), θ)h(x)
for some functions g and h.
Proof. We first prove the discrete case.
Suppose f
X
(x | θ) = g(T (x), θ)h(x). If T (x) = t, then
f
X|T =t
(x) =
P
θ
(X = x, T (X) = t)
P
θ
(T = t)
=
g(T (x), θ)h(x)
P
{y:T (y)=t}
g(T (y), θ)h(y)
=
g(t, θ)h(x)
g(t, θ)
P
h(y)
=
h(x)
P
h(y)
which doesn’t depend on θ. So T is sufficient.
The continuous case is similar. If
f
X
(x
| θ
) =
g
(
T
(x)
, θ
)
h
(x), and
T
(x) =
t
,
then
f
X|T =t
(x) =
g(T (x), θ)h(x)
R
y:T (y)=t
g(T (y), θ)h(y) dy
=
g(t, θ)h(x)
g(t, θ)
R
h(y) dy
=
h(x)
R
h(y) dy
,
which does not depend on θ.
Now suppose
T
is sufficient so that the conditional distribution of X
| T
=
t
does not depend on θ. Then
P
θ
(X = x) = P
θ
(X = x, T = T(x)) = P
θ
(X = x | T = T(x))P
θ
(T = T (x)).
The first factor does not depend on
θ
by assumption; call it
h
(x). Let the second
factor be g(t, θ), and so we have the required factorisation.
Example. Continuing the above example,
f
X
(x | θ) = θ
P
x
i
(1 − θ)
n−
P
x
i
.
Take
g
(
t, θ
) =
θ
t
(1
− θ
)
n−t
and
h
(x) = 1 to see that
T
(X) =
P
X
i
is sufficient
for θ.
Example. Let
X
1
, ··· , X
n
be iid
U
[0
, θ
]. Write 1
[A]
for the indicator function
of an arbitrary set A. We have
f
X
(x | θ) =
n
Y
i=1
1
θ
1
[0≤x
i
≤θ]
=
1
θ
n
1
[max
i
x
i
≤θ]
1
[min
i
x
i
≥0]
.
If we let T = max
i
x
i
, then we have
f
X
(x | θ) =
1
θ
n
1
[t≤θ]
| {z }
g(t,θ)
1
[min
i
x
i
≥0]
| {z }
h(x)
.
So T = max
i
x
i
is sufficient.
Note that sufficient statistics are not unique. If
T
is sufficient for
θ
, then
so is any 1-1 function of
T
. X is always sufficient for
θ
as well, but it is not of
much use. How can we decide if a sufficient statistic is “good”?
Given any statistic
T
, we can partition the sample space
X
n
into sets
{
x
∈ X
:
T
(x) =
t}
. Then after an experiment, instead of recording the actual
value of x, we can simply record the partition x falls into. If there are less
partitions than possible values of x, then effectively there is less information we
have to store.
If
T
is sufficient, then this data reduction does not lose any information
about
θ
. The “best” sufficient statistic would be one in which we achieve the
maximum possible reduction. This is known as the minimal sufficient statistic.
The formal definition we take is the following:
Definition (Minimal sufficiency). A sufficient statistic
T
(X) is minimal if it is
a function of every other sufficient statistic, i.e. if
T
′
(X) is also sufficient, then
T
′
(X) = T
′
(Y) ⇒ T (X) = T (Y).
Again, we have a handy theorem to find minimal statistics:
Theorem. Suppose T = T (X) is a statistic that satisfies
f
X
(x; θ)
f
X
(y; θ)
does not depend on θ if and only if T (x) = T (y).
Then T is minimal sufficient for θ.
Proof.
First we have to show sufficiency. We will use the factorization criterion
to do so.
Firstly, for each possible t, pick a favorite x
t
such that T (x
t
) = t.
Now let x
∈ X
N
and let
T
(x) =
t
. So
T
(x) =
T
(x
t
). By the hypothesis,
f
X
(x;θ)
f
X
(x
t
:θ)
does not depend on θ. Let this be h(x). Let g(t, θ) = f
X
(x
t
, θ). Then
f
X
(x; θ) = f
X
(x
t
; θ)
f
X
(x; θ)
f
X
(x
t
; θ)
= g(t, θ)h(x).
So T is sufficient for θ.
To show that this is minimal, suppose that
S
(X) is also sufficient. By the
factorization criterion, there exist functions g
S
and h
S
such that
f
X
(x; θ) = g
S
(S(x), θ)h
S
(x).
Now suppose that S(x) = S(y). Then
f
X
(x; θ)
f
X
(y; θ)
=
g
S
(S(x), θ)h
S
(x)
g
S
(S(y), θ)h
S
(y)
=
h
S
(x)
h
S
(y)
.
This means that the ratio
f
X
(x;θ)
f
X
(y;θ)
does not depend on
θ
. By the hypothesis, this
implies that
T
(x) =
T
(y). So we know that
S
(x) =
S
(y) implies
T
(x) =
T
(y).
So T is a function of S. So T is minimal sufficient.
Example. Suppose X
1
, ··· , X
n
are iid N(µ, σ
2
). Then
f
X
(x | µ, σ
2
)
f
X
(y | µ, σ
2
)
=
(2πσ
2
)
−n/2
exp
−
1
2σ
2
P
i
(x
i
− µ)
2
(2πσ
2
)
−n/2
exp
−
1
2σ
2
P
i
(y
i
− µ)
2
= exp
(
−
1
2σ
2
X
i
x
2
i
−
X
i
y
2
i
!
+
µ
σ
2
X
i
x
i
−
X
i
y
i
!)
.
This is a constant function of (
µ, σ
2
) iff
P
i
x
2
i
=
P
i
y
2
i
and
P
i
x
i
=
P
i
y
i
. So
T (X) = (
P
i
X
2
i
,
P
i
X
i
) is minimal sufficient for (µ, σ
2
).
As mentioned, sufficient statistics allow us to store the results of our exper-
iments in the most efficient way. It turns out if we have a minimal sufficient
statistic, then we can use it to improve any estimator.
Theorem (Rao-Blackwell Theorem). Let
T
be a sufficient statistic for
θ
and let
˜
θ
be an estimator for
θ
with
E
(
˜
θ
2
)
< ∞
for all
θ
. Let
ˆ
θ
(x) =
E
[
˜
θ
(X)
| T
(X) =
T (x)]. Then for all θ,
E[(
ˆ
θ −θ)
2
] ≤ E[(
˜
θ −θ)
2
].
The inequality is strict unless
˜
θ is a function of T .
Here we have to be careful with our definition of
ˆ
θ
. It is defined as the
expected value of
˜
θ
(X), and this could potentially depend on the actual value of
θ
. Fortunately, since
T
is sufficient for
θ
, the conditional distribution of X given
T
=
t
does not depend on
θ
. Hence
ˆ
θ
=
E
[
˜
θ
(X)
| T
] does not depend on
θ
, and
so is a genuine estimator. In fact, the proof does not use that
T
is sufficient; we
only require it to be sufficient so that we can compute
ˆ
θ.
Using this theorem, given any estimator, we can find one that is a function
of a sufficient statistic and is at least as good in terms of mean squared error of
estimation. Moreover, if the original estimator
˜
θ
is unbiased, so is the new
ˆ
θ
.
Also, if
˜
θ is already a function of T , then
ˆ
θ =
˜
θ.
Proof.
By the conditional expectation formula, we have
E
(
ˆ
θ
) =
E
[
E
(
˜
θ | T
)] =
E(
˜
θ). So they have the same bias.
By the conditional variance formula,
var(
˜
θ) = E[var(
˜
θ | T )] + var[E(
˜
θ | T )] = E[var(
˜
θ | T )] + var(
ˆ
θ).
Hence
var
(
˜
θ
)
≥ var
(
ˆ
θ
). So
mse
(
˜
θ
)
≥ mse
(
ˆ
θ
), with equality only if
var
(
˜
θ | T
) =
0.
Example. Suppose
X
1
, ··· , X
n
are iid
Poisson
(
λ
), and let
θ
=
e
−λ
, which is
the probability that X
1
= 0. Then
p
X
(x | λ) =
e
−nλ
λ
P
x
i
Q
x
i
!
.
So
p
X
(x | θ) =
θ
n
(−log θ)
P
x
i
Q
x
i
!
.
We see that T =
P
X
i
is sufficient for θ, and
P
X
i
∼ Poisson(nλ).
We start with an easy estimator
θ
is
˜
θ
= 1
X
1
=0
, which is unbiased (i.e.
if we observe nothing in the first observation period, we assume the event is
impossible). Then
E[
˜
θ | T = t] = P
X
1
= 0 |
n
X
1
X
i
= t
!
=
P(X
1
= 0)P(
P
n
2
X
i
= t)
P(
P
n
1
X
i
= t)
=
n − 1
n
t
.
So
ˆ
θ
= (1
−
1
/n
)
P
x
i
. This approximately (1
−
1
/n
)
n
¯
X
≈ e
−
¯
X
=
e
−
ˆ
λ
, which
makes sense.
Example. Let
X
1
, ··· , X
n
be iid
U
[0
, θ
], and suppose that we want to estimate
θ
. We have shown above that
T
=
max X
i
is sufficient for
θ
. Let
ˆ
θ
= 2
X
1
, an
unbiased estimator. Then
E[
˜
θ | T = t] = 2E[X
1
| max X
i
= t]
= 2E[X
1
| max X
i
= t, X
1
= max X
i
]P(X
1
= max X
i
)
+ 2E[X
1
| max X
i
= t, X
1
= max X
i
]P(X
1
= max X
i
)
= 2
t ×
1
n
+
t
2
n − 1
n
=
n + 1
n
t.
So
ˆ
θ =
n+1
n
max X
i
is our new estimator.