5High-dimensional inference

III Modern Statistical Methods



5.1 Multiple testing
Finally, we talk about high-dimensional inference. Suppose we have come up
with a large number of potential drugs, and want to see if they are effective in
killing bacteria. Naively, we might try to run a hypothesis test on each of them,
using a
p <
0
.
05 threshold. But this is a terrible idea, since each test has a 0
.
05
chance of giving a false positive, so even if all the drugs are actually useless, we
would have incorrectly believed that a lot of them are useful, which is not the
case.
In general, suppose we have some null hypothesis
H
1
, . . . , H
m
. By definition,
a p value p
i
for H
i
is a random variable such that
P
H
i
(p
i
α) α
for all α [0, 1].
Let
m
0
=
|I
0
|
be the number of true null hypothesis. Given a procedure for
rejecting hypothesis (a multiple testing procedure), we let
N
be the number of
false rejections (false positives), and
R
the total number of rejections. One can
also think about the number of false negatives, but we shall not do that here.
Traditionally, multiple-testing procedures sought to control the family-wise
error rate (FWER), defined by
P
(
N
1). The simplest way to minimize this is
to use the Bonferroni correction, which rejects
H
i
if
p
i
α
m
. Usually, we might
have
α
0
.
05, and so this would be very small if we have lots of hypothesis (e.g.
a million). Unsurprisingly, we have
Theorem. When using the Bonferroni correction, we have
FWER E(N)
m
0
α
m
α.
Proof.
The first inequality is Markov’s inequality, and the last is obvious. The
second follows from
E(N) = E
X
iI
0
1
p
i
α/m
!
=
X
iI
0
P
p
i
α
m
m
0
α
m
.
The Bonferroni is a rather conservative procedure, since all these inequalities
can be quite loose. When we have a large number of hypotheses, the criterion
for rejection is very very strict. Can we do better?
A more sophisticated approach is the closed testing procedure. For each
non-empty subset
I {
1
, . . . , m}
, we let
H
I
be the null hypothesis that
H
i
is
true for all
i I
. This is known as an intersection hypothesis. Suppose for each
I {
1
, . . . , m}
non-empty, we have an
α
-level test
φ
I
for
H
I
(a local test), so
that
P
H
I
(φ
I
= 1) α.
Here Φ
I
takes values in
{
0
,
1
}
, and
φ
I
= 1 means rejection. The closed testing
procedure then rejects H
I
iff for all J I, we have φ
J
= 1.
Example. Consider the tests, where the red ones are the rejected one:
H
1234
H
134
H
124
H
134
H
234
H
12
H
13
H
14
H
23
H
24
H
34
H
1
H
2
H
3
H
4
In this case, we reject
H
1
but not
H
2
by closed testing. While
H
23
is rejected,
we cannot tell if it is H
2
or H
3
that should be rejected.
This might seem like a very difficult procedure to analyze, but it turns out it
is extremely simple.
Theorem. Closed testing makes no false rejections with probability
1
α
.
In particular, FWER α.
Proof.
In order for there to be a false rejection, we must have falsely rejected
H
I
0
with the local test, which has probability at most α.
But of course this doesn’t immediately give us an algorithm we can apply to
data. Different choices for the local test give rise to different multiple testing
procedures. One famous example is Holm’s procedure. This takes
φ
I
to be the
Bonferroni test, where φ
I
= 1 iff p
i
α
|I|
for some i I.
When
m
is large, then we don’t want to compute all
φ
I
, since there are 2
I
computations to do. So we might want to find a shortcut. With a moment of
thought, we see that Holm’s procedure amounts to the following:
Let (
i
) be the index of the
i
th smallest
p
-value, so
p
(1)
p
(2)
··· p
(m)
.
Step 1: If
p
(1)
α
m
, then reject
H
(1)
and go to step 2. Otherwise, accept
all null hypothesis.
Step
i
: If
p
(i)
α
mi+1
, then reject
H
(i)
and go to step
i
+ 1. Otherwise,
accept H
(i)
, H
(i+1)
, . . . , H
(m)
.
Step m: If p
(m)
α, then reject H
(m)
. Otherwise, accept H
(m)
.
The interesting thing about this is that it has the same bound on FWER as the
Bonferroni correction, but the conditions here are less lenient.
But if
m
is very large, then the criterion for accepting
p
(1)
is still quite strict.
The problem is that controlling FWER is a very strong condition. Instead of
controlling the probability that there is one false rejection, when
m
is large, it
might be more reasonable to control the proportion of false discoveries. Many
modern multiple testing procedures aim to control the false discovery rate
FDR = E
N
max{R, 1}
.
The funny maximum in the denominator is just to avoid division by zero. When
R
= 0, then we must have
N
= 0 as well, so what is put in the denominator
doesn’t really matter.
The Benjamini–Hochberg procedure attempts to control the FDR at level
α
and works as follows:
Let
ˆ
k
=
max
i : p
(i)
αi
m
. Then reject
M
(1)
, . . . , H
(
ˆ
k)
, or accept all
hypothesis if
ˆ
k is not defined.
Under certain conditions, this does control the false discovery rate.
Theorem. Suppose that for each
i I
0
,
p
i
is independent of
{p
j
:
j 6
=
i}
. Then
using the Benjamini–Hochberg procedure, the false discovery rate
F DR = E
N
max(R, 1)
αM
0
M
α.
Curiously, while the proof requires
p
i
to be independent of the others, in
simulations, even when there is no hope that the
p
i
are independent, it appears
that the Benjamini–Hochberg still works very well, and people are still trying to
understand what it is that makes Benjamini–Hochberg work so well in general.
Proof. The false discovery rate is
E
N
max(R, 1)
=
M
X
r=1
E
N
r
1
R=r
=
m
X
r=1
1
r
E
X
iI
0
1
p
i
αr/M
1
R=r
=
X
iI
0
M
X
r=1
1
r
P
p
i
αr
m
, R = r
.
The brilliant idea is, for each
i I
0
, let
R
i
be the number of rejections when
applying a modified Benjamini–Hochberg procedure to
p
\i
=
{p
1
, . . . , p
M
}\{p
i
}
with cutoff
ˆ
k
i
= max
j : p
\i
(j)
α(j + 1)
m
We observe that for i I
0
and r 1, we have
n
p
i
αr
m
, R = r
o
=
n
p
i
ar
m
, p
(r)
αr
m
, p
(s)
>
αs
m
for all s r
o
=
n
p
i
αr
m
, p
\i
(r1)
αr
m
, p
\i
(s1)
>
αs
m
for all s > r
o
=
n
p
i
αr
m
, R
i
= r 1
o
.
The key point is that
R
i
=
r
1 depends only on the other
p
-values. So the
FDR is equal to
FDR =
X
iI
0
M
X
r=1
1
r
P
p
i
αr
M
, R
i
= r 1
=
X
iI
0
M
X
r=1
1
r
P
p
i
αr
m
P(R
i
= r 1)
Using that P(p
i
αr
m
)
αr
m
by definition, this is
α
M
X
iI
0
m
X
r=1
P(R
i
= r 1)
=
α
M
X
iI
0
P(R
i
{0, . . . , m 1})
=
αM
0
M
.
This is one of the most used procedures in modern statistics, especially in
the biological sciences.