III Modern Statistical Methods - High-dimensional inference

5High-dimensional inference

III Modern Statistical Methods

5.1 Multiple testing

Finally, we talk about high-dimensional inference. Suppose we have come up

with a large number of potential drugs, and want to see if they are effective in

killing bacteria. Naively, we might try to run a hypothesis test on each of them,

using a

p <

05 threshold. But this is a terrible idea, since each test has a 0

chance of giving a false positive, so even if all the drugs are actually useless, we

would have incorrectly believed that a lot of them are useful, which is not the

case.

In general, suppose we have some null hypothesis

, . . . , H

. By definition,

a p value p

for H

is a random variable such that

≤ α) ≤ α

for all α ∈ [0, 1].

Let

be the number of true null hypothesis. Given a procedure for

rejecting hypothesis (a multiple testing procedure), we let

be the number of

false rejections (false positives), and

the total number of rejections. One can

also think about the number of false negatives, but we shall not do that here.

Traditionally, multiple-testing procedures sought to control the family-wise

error rate (FWER), defined by

(

N ≥

1). The simplest way to minimize this is

to use the Bonferroni correction, which rejects

≤

. Usually, we might

have

α ∼

05, and so this would be very small if we have lots of hypothesis (e.g.

a million). Unsurprisingly, we have

Theorem. When using the Bonferroni correction, we have

FWER ≤ E(N) ≤

≤ α.

Proof.

The first inequality is Markov’s inequality, and the last is obvious. The

second follows from

E(N) = E

i∈I

≤α/m

i∈I



≤



≤

The Bonferroni is a rather conservative procedure, since all these inequalities

can be quite loose. When we have a large number of hypotheses, the criterion

for rejection is very very strict. Can we do better?

A more sophisticated approach is the closed testing procedure. For each

non-empty subset

I ⊆ {

, . . . , m}

, we let

be the null hypothesis that

true for all

i ∈ I

. This is known as an intersection hypothesis. Suppose for each

I ⊆ {

, . . . , m}

non-empty, we have an

-level test

for

(a local test), so

that

(φ

= 1) ≤ α.

Here Φ

takes values in

{

}

, and

= 1 means rejection. The closed testing

procedure then rejects H

iff for all J ⊇ I, we have φ

= 1.

Example. Consider the tests, where the red ones are the rejected one:

1234

134

124

134

234

In this case, we reject

but not

by closed testing. While

is rejected,

we cannot tell if it is H

or H

that should be rejected.

This might seem like a very difficult procedure to analyze, but it turns out it

is extremely simple.

Theorem. Closed testing makes no false rejections with probability

≥

− α

In particular, FWER ≤ α.

Proof.

In order for there to be a false rejection, we must have falsely rejected

with the local test, which has probability at most α.

But of course this doesn’t immediately give us an algorithm we can apply to

data. Different choices for the local test give rise to different multiple testing

procedures. One famous example is Holm’s procedure. This takes

to be the

Bonferroni test, where φ

= 1 iff p

≤

|I|

for some i ∈ I.

When

is large, then we don’t want to compute all

, since there are 2

computations to do. So we might want to find a shortcut. With a moment of

thought, we see that Holm’s procedure amounts to the following:

–

Let (

) be the index of the

th smallest

-value, so

(1)

≤ p

(2)

≤ ··· ≤ p

(m)

–

Step 1: If

(1)

≤

, then reject

(1)

and go to step 2. Otherwise, accept

all null hypothesis.

–

Step

: If

(i)

≤

m−i+1

, then reject

(i)

and go to step

+ 1. Otherwise,

accept H

(i)

, H

(i+1)

, . . . , H

(m)

– Step m: If p

(m)

≤ α, then reject H

(m)

. Otherwise, accept H

(m)

The interesting thing about this is that it has the same bound on FWER as the

Bonferroni correction, but the conditions here are less lenient.

But if

is very large, then the criterion for accepting

(1)

is still quite strict.

The problem is that controlling FWER is a very strong condition. Instead of

controlling the probability that there is one false rejection, when

is large, it

might be more reasonable to control the proportion of false discoveries. Many

modern multiple testing procedures aim to control the false discovery rate

FDR = E



max{R, 1}



The funny maximum in the denominator is just to avoid division by zero. When

= 0, then we must have

= 0 as well, so what is put in the denominator

doesn’t really matter.

The Benjamini–Hochberg procedure attempts to control the FDR at level

and works as follows:

–

Let

max



i : p

(i)

≤

αi



. Then reject

(1)

, . . . , H

(

, or accept all

hypothesis if

k is not defined.

Under certain conditions, this does control the false discovery rate.

Theorem. Suppose that for each

i ∈ I

is independent of

j 6

. Then

using the Benjamini–Hochberg procedure, the false discovery rate

F DR = E

max(R, 1)

≤

αM

≤ α.

Curiously, while the proof requires

to be independent of the others, in

simulations, even when there is no hope that the

are independent, it appears

that the Benjamini–Hochberg still works very well, and people are still trying to

understand what it is that makes Benjamini–Hochberg work so well in general.

Proof. The false discovery rate is

max(R, 1)

r=1

R=r

r=1

i∈I

≤αr/M

R=r

i∈I

r=1



≤

αr

, R = r



The brilliant idea is, for each

i ∈ I

, let

be the number of rejections when

applying a modified Benjamini–Hochberg procedure to

, . . . , p

}\{p

}

with cutoff

= max



j : p

(j)

≤

α(j + 1)



We observe that for i ∈ I

and r ≥ 1, we have

≤

αr

, R = r

≤

, p

(r)

≤

αr

, p

(s)

αs

for all s ≥ r

≤

αr

, p

(r−1)

≤

αr

, p

(s−1)

αs

for all s > r

≤

αr

, R

= r − 1

The key point is that

r −

1 depends only on the other

-values. So the

FDR is equal to

FDR =

i∈I

r=1



≤

αr

, R

= r − 1



i∈I

r=1



≤

αr



P(R

= r − 1)

Using that P(p

≤

αr

) ≤

αr

by definition, this is

≤

i∈I

r=1

P(R

= r − 1)

i∈I

P(R

∈ {0, . . . , m − 1})

αM

This is one of the most used procedures in modern statistics, especially in

the biological sciences.