1Estimation

IB Statistics

1.3 Sufficiency

Often, we do experiments just to find out the value of

θ

. For example, we might

want to estimate what proportion of the population supports some political

candidate. We are seldom interested in the data points themselves, and just

want to learn about the big picture. This leads us to the concept of a sufficient

statistic. This is a statistic

T

(X) that contains all information we have about

θ

in the sample.

Example. Let

X

1

, ···X

n

be iid

Bernoulli

(

θ

), so that

P

(

X

i

= 1) = 1

− P

(

X

i

=

0) = θ for some 0 < θ < 1. So

f

X

(x | θ) =

n

Y

i=1

θ

x

i

(1 − θ)

1−x

i

= θ

P

x

i

(1 − θ)

n−

P

x

i

.

This depends on the data only through

T

(X) =

P

x

i

, the total number of ones.

Suppose we are now given that

T

(X) =

t

. Then what is the distribution of

X? We have

f

X|T =t

(x) =

P

θ

(X = x, T = t)

P

θ

(T = t)

=

P

θ

(X = x)

P

θ

(T = t)

.

Where the last equality comes because since if X = x, then

T

must be equal to

t. This is equal to

θ

P

x

i

(1 − θ)

n−

P

x

i

n

t

θ

t

(1 − θ)

n−t

=

n

t

−1

.

So the conditional distribution of X given

T

=

t

does not depend on

θ

. So if we

know

T

, then additional knowledge of x does not give more information about

θ

.

Definition (Sufficient statistic). A statistic

T

is sufficient for

θ

if the conditional

distribution of X given T does not depend on θ.

There is a convenient theorem that allows us to find sufficient statistics.

Theorem (The factorization criterion). T is sufficient for θ if and only if

f

X

(x | θ) = g(T (x), θ)h(x)

for some functions g and h.

Proof. We first prove the discrete case.

Suppose f

X

(x | θ) = g(T (x), θ)h(x). If T (x) = t, then

f

X|T =t

(x) =

P

θ

(X = x, T (X) = t)

P

θ

(T = t)

=

g(T (x), θ)h(x)

P

{y:T (y)=t}

g(T (y), θ)h(y)

=

g(t, θ)h(x)

g(t, θ)

P

h(y)

=

h(x)

P

h(y)

which doesn’t depend on θ. So T is sufficient.

The continuous case is similar. If

f

X

(x

| θ

) =

g

(

T

(x)

, θ

)

h

(x), and

T

(x) =

t

,

then

f

X|T =t

(x) =

g(T (x), θ)h(x)

R

y:T (y)=t

g(T (y), θ)h(y) dy

=

g(t, θ)h(x)

g(t, θ)

R

h(y) dy

=

h(x)

R

h(y) dy

,

which does not depend on θ.

Now suppose

T

is sufficient so that the conditional distribution of X

| T

=

t

does not depend on θ. Then

P

θ

(X = x) = P

θ

(X = x, T = T(x)) = P

θ

(X = x | T = T(x))P

θ

(T = T (x)).

The first factor does not depend on

θ

by assumption; call it

h

(x). Let the second

factor be g(t, θ), and so we have the required factorisation.

Example. Continuing the above example,

f

X

(x | θ) = θ

P

x

i

(1 − θ)

n−

P

x

i

.

Take

g

(

t, θ

) =

θ

t

(1

− θ

)

n−t

and

h

(x) = 1 to see that

T

(X) =

P

X

i

is sufficient

for θ.

Example. Let

X

1

, ··· , X

n

be iid

U

[0

, θ

]. Write 1

[A]

for the indicator function

of an arbitrary set A. We have

f

X

(x | θ) =

n

Y

i=1

1

θ

1

[0≤x

i

≤θ]

=

1

θ

n

1

[max

i

x

i

≤θ]

1

[min

i

x

i

≥0]

.

If we let T = max

i

x

i

, then we have

f

X

(x | θ) =

1

θ

n

1

[t≤θ]

| {z }

g(t,θ)

1

[min

i

x

i

≥0]

| {z }

h(x)

.

So T = max

i

x

i

is sufficient.

Note that sufficient statistics are not unique. If

T

is sufficient for

θ

, then

so is any 1-1 function of

T

. X is always sufficient for

θ

as well, but it is not of

much use. How can we decide if a sufficient statistic is “good”?

Given any statistic

T

, we can partition the sample space

X

n

into sets

{

x

∈ X

:

T

(x) =

t}

. Then after an experiment, instead of recording the actual

value of x, we can simply record the partition x falls into. If there are less

partitions than possible values of x, then effectively there is less information we

have to store.

If

T

is sufficient, then this data reduction does not lose any information

about

θ

. The “best” sufficient statistic would be one in which we achieve the

maximum possible reduction. This is known as the minimal sufficient statistic.

The formal definition we take is the following:

Definition (Minimal sufficiency). A sufficient statistic

T

(X) is minimal if it is

a function of every other sufficient statistic, i.e. if

T

′

(X) is also sufficient, then

T

′

(X) = T

′

(Y) ⇒ T (X) = T (Y).

Again, we have a handy theorem to find minimal statistics:

Theorem. Suppose T = T (X) is a statistic that satisfies

f

X

(x; θ)

f

X

(y; θ)

does not depend on θ if and only if T (x) = T (y).

Then T is minimal sufficient for θ.

Proof.

First we have to show sufficiency. We will use the factorization criterion

to do so.

Firstly, for each possible t, pick a favorite x

t

such that T (x

t

) = t.

Now let x

∈ X

N

and let

T

(x) =

t

. So

T

(x) =

T

(x

t

). By the hypothesis,

f

X

(x;θ)

f

X

(x

t

:θ)

does not depend on θ. Let this be h(x). Let g(t, θ) = f

X

(x

t

, θ). Then

f

X

(x; θ) = f

X

(x

t

; θ)

f

X

(x; θ)

f

X

(x

t

; θ)

= g(t, θ)h(x).

So T is sufficient for θ.

To show that this is minimal, suppose that

S

(X) is also sufficient. By the

factorization criterion, there exist functions g

S

and h

S

such that

f

X

(x; θ) = g

S

(S(x), θ)h

S

(x).

Now suppose that S(x) = S(y). Then

f

X

(x; θ)

f

X

(y; θ)

=

g

S

(S(x), θ)h

S

(x)

g

S

(S(y), θ)h

S

(y)

=

h

S

(x)

h

S

(y)

.

This means that the ratio

f

X

(x;θ)

f

X

(y;θ)

does not depend on

θ

. By the hypothesis, this

implies that

T

(x) =

T

(y). So we know that

S

(x) =

S

(y) implies

T

(x) =

T

(y).

So T is a function of S. So T is minimal sufficient.

Example. Suppose X

1

, ··· , X

n

are iid N(µ, σ

2

). Then

f

X

(x | µ, σ

2

)

f

X

(y | µ, σ

2

)

=

(2πσ

2

)

−n/2

exp

−

1

2σ

2

P

i

(x

i

− µ)

2

(2πσ

2

)

−n/2

exp

−

1

2σ

2

P

i

(y

i

− µ)

2

= exp

(

−

1

2σ

2

X

i

x

2

i

−

X

i

y

2

i

!

+

µ

σ

2

X

i

x

i

−

X

i

y

i

!)

.

This is a constant function of (

µ, σ

2

) iff

P

i

x

2

i

=

P

i

y

2

i

and

P

i

x

i

=

P

i

y

i

. So

T (X) = (

P

i

X

2

i

,

P

i

X

i

) is minimal sufficient for (µ, σ

2

).

As mentioned, sufficient statistics allow us to store the results of our exper-

iments in the most efficient way. It turns out if we have a minimal sufficient

statistic, then we can use it to improve any estimator.

Theorem (Rao-Blackwell Theorem). Let

T

be a sufficient statistic for

θ

and let

˜

θ

be an estimator for

θ

with

E

(

˜

θ

2

)

< ∞

for all

θ

. Let

ˆ

θ

(x) =

E

[

˜

θ

(X)

| T

(X) =

T (x)]. Then for all θ,

E[(

ˆ

θ −θ)

2

] ≤ E[(

˜

θ −θ)

2

].

The inequality is strict unless

˜

θ is a function of T .

Here we have to be careful with our definition of

ˆ

θ

. It is defined as the

expected value of

˜

θ

(X), and this could potentially depend on the actual value of

θ

. Fortunately, since

T

is sufficient for

θ

, the conditional distribution of X given

T

=

t

does not depend on

θ

. Hence

ˆ

θ

=

E

[

˜

θ

(X)

| T

] does not depend on

θ

, and

so is a genuine estimator. In fact, the proof does not use that

T

is sufficient; we

only require it to be sufficient so that we can compute

ˆ

θ.

Using this theorem, given any estimator, we can find one that is a function

of a sufficient statistic and is at least as good in terms of mean squared error of

estimation. Moreover, if the original estimator

˜

θ

is unbiased, so is the new

ˆ

θ

.

Also, if

˜

θ is already a function of T , then

ˆ

θ =

˜

θ.

Proof.

By the conditional expectation formula, we have

E

(

ˆ

θ

) =

E

[

E

(

˜

θ | T

)] =

E(

˜

θ). So they have the same bias.

By the conditional variance formula,

var(

˜

θ) = E[var(

˜

θ | T )] + var[E(

˜

θ | T )] = E[var(

˜

θ | T )] + var(

ˆ

θ).

Hence

var

(

˜

θ

)

≥ var

(

ˆ

θ

). So

mse

(

˜

θ

)

≥ mse

(

ˆ

θ

), with equality only if

var

(

˜

θ | T

) =

0.

Example. Suppose

X

1

, ··· , X

n

are iid

Poisson

(

λ

), and let

θ

=

e

−λ

, which is

the probability that X

1

= 0. Then

p

X

(x | λ) =

e

−nλ

λ

P

x

i

Q

x

i

!

.

So

p

X

(x | θ) =

θ

n

(−log θ)

P

x

i

Q

x

i

!

.

We see that T =

P

X

i

is sufficient for θ, and

P

X

i

∼ Poisson(nλ).

We start with an easy estimator

θ

is

˜

θ

= 1

X

1

=0

, which is unbiased (i.e.

if we observe nothing in the first observation period, we assume the event is

impossible). Then

E[

˜

θ | T = t] = P

X

1

= 0 |

n

X

1

X

i

= t

!

=

P(X

1

= 0)P(

P

n

2

X

i

= t)

P(

P

n

1

X

i

= t)

=

n − 1

n

t

.

So

ˆ

θ

= (1

−

1

/n

)

P

x

i

. This approximately (1

−

1

/n

)

n

¯

X

≈ e

−

¯

X

=

e

−

ˆ

λ

, which

makes sense.

Example. Let

X

1

, ··· , X

n

be iid

U

[0

, θ

], and suppose that we want to estimate

θ

. We have shown above that

T

=

max X

i

is sufficient for

θ

. Let

ˆ

θ

= 2

X

1

, an

unbiased estimator. Then

E[

˜

θ | T = t] = 2E[X

1

| max X

i

= t]

= 2E[X

1

| max X

i

= t, X

1

= max X

i

]P(X

1

= max X

i

)

+ 2E[X

1

| max X

i

= t, X

1

= max X

i

]P(X

1

= max X

i

)

= 2

t ×

1

n

+

t

2

n − 1

n

=

n + 1

n

t.

So

ˆ

θ =

n+1

n

max X

i

is our new estimator.