1Estimation

IB Statistics

1.6 Bayesian estimation

So far we have seen the frequentist approach to a statistical inference, i.e.

inferential statements about θ are interpreted in terms of repeat sampling. For

example, the percentage confidence in a confidence interval is the probability

that the interval will contain θ, not the probability that θ lies in the interval.

In contrast, the Bayesian approach treats

θ

as a random variable taking

values in Θ. The investigator’s information and beliefs about the possible values

of

θ

before any observation of data are summarised by a prior distribution

π

(

θ

).

When X = x are observed, the extra information about

θ

is combined with the

prior to obtain the posterior distribution π(θ | x) for θ given X = x.

There has been a long-running argument between the two approaches. Re-

cently, things have settled down, and Bayesian methods are seen to be appropriate

in huge numbers of application where one seeks to assess a probability about a

“state of the world”. For example, spam filters will assess the probability that a

specific email is a spam, even though from a frequentist’s point of view, this is

nonsense, because the email either is or is not a spam, and it makes no sense to

assign a probability to the email’s being a spam.

In Bayesian inference, we usually have some prior knowledge about the

distribution of

θ

(e.g. between 0 and 1). After collecting some data, we will find

a posterior distribution of θ given X = x.

Definition (Prior and posterior distribution). The prior distribution of

θ

is the

probability distribution of the value of

θ

before conducting the experiment. We

usually write as π(θ).

The posterior distribution of

θ

is the probability distribution of the value of

θ given an outcome of the experiment x. We write as π(θ | x).

By Bayes’ theorem, the distributions are related by

π(θ | x) =

f

X

(x | θ)π(θ)

f

X

(x)

.

Thus

π(θ | x) ∝ f

X

(x | θ)π(θ).

posterior ∝ likelihood × prior.

where the constant of proportionality is chosen to make the total mass of

the posterior distribution equal to one. Usually, we use this form, instead of

attempting to calculate f

X

(x).

It should be clear that the data enters through the likelihood, so the inference

is automatically based on any sufficient statistic.

Example. Suppose I have 3 coins in my pocket. One is 3 : 1 in favour of tails,

one is a fair coin, and one is 3 : 1 in favour of heads.

I randomly select one coin and flip it once, observing a head. What is the

probability that I have chosen coin 3?

Let

X

= 1 denote the event that I observe a head,

X

= 0 if a tail. Let

θ

denote the probability of a head. So θ is either 0.25, 0.5 or 0.75.

Our prior distribution is π(θ = 0.25) = π(θ = 0.5) = π(θ = 0.75) = 1/3.

The probability mass function

f

X

(

x | θ

) =

θ

x

(1

− θ

)

1−x

. So we have the

following results:

θ π(θ) f

X

(x = 1 | θ) f

X

(x = 1 | θ)π(θ) π(θ | x)

0.25 0.33 0.25 0.0825 0.167

0.50 0.33 0.50 0.1650 0.333

0.75 0.33 0.75 0.2475 0.500

Sum 1.00 1.50 0.4950 1.000

So if we observe a head, then there is now a 50% chance that we have picked

the third coin.

Example. Suppose we are interested in the true mortality risk

θ

in a hospital

H

which is about to try a new operation. On average in the country, around

10% of the people die, but mortality rates in different hospitals vary from around

3% to around 20%. Hospital

H

has no deaths in their first 10 operations. What

should we believe about θ?

Let X

i

= 1 if the ith patient in H dies. The

f

x

(x | θ) = θ

P

x

i

(1 − θ)

n−

P

x

i

.

Suppose a priori that θ ∼ beta(a, b) for some unknown a > 0, b > 0 so that

π(θ) ∝ θ

a−1

(1 − θ)

b−1

.

Then the posteriori is

π(θ | x) ∝ f

x

(x | θ)π(θ) ∝ θ

P

x

i

+a−1

(1 − θ)

n−

P

x

i

+b−1

.

We recognize this as beta(

P

x

i

+ a, n −

P

x

i

+ b). So

π(θ | x) =

θ

P

x

i

+a−1

(1 − θ)

n−

P

x

i

+b−1

B(

P

x

i

+ a, n −

P

x

i

+ b)

.

In practice, we need to find a Beta prior distribution that matches our information

from other hospitals. It turns out that

beta

(

a

= 3

, b

= 27) prior distribution has

mean 0.1 and P(0.03 < θ < .20) = 0.9.

Then we observe data

P

x

i

= 0,

n

= 0. So the posterior is

beta

(

P

x

i

+

a, n −

P

x

i

+ b) = beta(3, 37). This has a mean of 3/40 = 0.075.

This leads to a different conclusion than a frequentist analysis. Since nobody

has died so far, the mle is 0, which does not seem plausible. Using a Bayesian

approach, we have a higher mean than 0 because we take into account the data

from other hospitals.

For this problem, a beta prior leads to a beta posterior. We say that the

beta family is a conjugate family of prior distributions for Bernoulli samples.

Suppose that

a

=

b

= 1 so that

π

(

θ

) = 1 for 0

< θ <

1 — the uniform

distribution. Then the posterior is

beta

(

P

x

i

+ 1

, n −

P

x

i

+ 1), with properties

mean mode variance

prior 1/2 non-unique 1/12

posterior

P

x

i

+ 1

n + 2

P

x

i

n

(

P

x

i

+ 1)(n −

P

x

i

+ 1)

(n + 2)

2

(n + 3)

Note that the mode of the posterior is the mle.

The posterior mean estimator,

P

x

i

+1

n+2

is discussed in Lecture 2, where we

showed that this estimator had smaller mse than the mle for non-extreme value

of θ. This is known as the Laplace’s estimator.

The posterior variance is bounded above by 1

/

(4(

n

+ 3)), and this is smaller

than the prior variance, and is smaller for larger n.

Again, note that the posterior automatically depends on the data through

the sufficient statistic.

After we come up with the posterior distribution, we have to decide what

estimator to use. In the case above, we used the posterior mean, but this might

not be the best estimator.

To determine what is the “best” estimator, we first need a loss function. Let

L

(

θ, a

) be the loss incurred in estimating the value of a parameter to be

a

when

the true value is θ.

Common loss functions are quadratic loss

L

(

θ, a

) = (

θ −a

)

2

, absolute error

loss L(θ, a) = |θ − a|, but we can have others.

When our estimate is a, the expected posterior loss is

h(a) =

Z

L(θ, a)π(θ | x) dθ.

Definition (Bayes estimator). The Bayes estimator

ˆ

θ

is the estimator that

minimises the expected posterior loss.

For quadratic loss,

h(a) =

Z

(a − θ)

2

π(θ | x) dθ.

h

′

(a) = 0 if

Z

(a − θ)π(θ | x) dθ = 0,

or

a

Z

π(θ | x) dθ =

Z

θπ(θ | x) dθ,

Since

R

π

(

θ |

x) d

θ

= 1, the Bayes estimator is

ˆ

θ

=

R

θπ

(

θ |

x) d

θ

, the posterior

mean.

For absolute error loss,

h(a) =

Z

|θ −a|π(θ | x) dθ

=

Z

a

−∞

(a − θ)π(θ | x) dθ +

Z

∞

a

(θ −a)π(θ | x) dθ

= a

Z

a

−∞

π(θ | x) dθ −

Z

a

−∞

θπ(θ | x) dθ

+

Z

∞

a

θπ(θ | x) dθ − a

Z

∞

a

π(θ | x) dθ.

Now h

′

(a) = 0 if

Z

a

−∞

π(θ | x) dθ =

Z

∞

a

π(θ | x) dθ.

This occurs when each side is 1/2. So

ˆ

θ is the posterior median.

Example. Suppose that

X

1

, ··· , X

n

are iid

N

(

µ,

1), and that a priori

µ ∼

N(0, τ

−2

) for some τ

−2

. So τ is the certainty of our prior knowledge.

The posterior is given by

π(µ | x) ∝ f

x

(x | µ)π(µ)

∝ exp

−

1

2

X

(x

i

− µ)

2

exp

−

µ

2

τ

2

2

∝ exp

"

−

1

2

(n + τ

2

)

µ −

P

x

i

n + τ

2

2

#

since we can regard

n

,

τ

and all the

x

i

as constants in the normalisation term,

and then complete the square with respect to

µ

. So the posterior distribution

of

µ

given x is a normal distribution with mean

P

x

i

/

(

n

+

τ

2

) and variance

1/(n + τ

2

).

The normal density is symmetric, and so the posterior mean and the posterior

median have the same value

P

x

i

/(n + τ

2

).

This is the optimal estimator for both quadratic and absolute loss.

Example. Suppose that

X

1

, ··· , X

n

are iid

Poisson

(

λ

) random variables, and

λ has an exponential distribution with mean 1. So π(λ) = e

−λ

.

The posterior distribution is given by

π(λ | x) ∝ e

nλ

λ

P

x

i

e

−λ

= λ

P

x

i

e

−(n+1)λ

, λ > 0,

which is

gamma (

P

x

i

+ 1, n + 1)

. Hence under quadratic loss, our estimator is

ˆ

λ =

P

x

i

+ 1

n + 1

,

the posterior mean.

Under absolute error loss,

ˆ

λ solves

Z

ˆ

λ

0

(n + 1)

P

x

i

+1

λ

P

x

i

e

−(n+1)λ

(

P

x

i

)!

dλ =

1

2

.