5High-dimensional inference

III Modern Statistical Methods



5.2 Inference in high-dimensional regression
We have more-or-less some answer as to how to do hypothesis testing, given that
we know how to obtain these
p
-values. But how do we obtain these in the first
place?
For example, we might be trying to do regression, and are trying figure out
which coefficients are non-zero. The the low dimension setting, with the normal
linear model
Y
=
Xβ
0
+
ε
, where
ε N
n
(0
, σ
2
I
). In the low-dimensional setting,
we have
n
(
ˆ
β
OLS
β
0
)
N
p
(0
, σ
2
(
1
n
X
T
X
)
1
). Since this does not depend on
β
0
, we can use this to form confidence intervals and hypothesis tests.
However, if we have more coefficients than there are data points, then we
can’t do ordinary least squares. So we need to look for something else. For
example, we might want to replace the OLS estimate with the Lasso estimate.
However,
n
(
ˆ
β
L
λ
β
0
) has an intractable distribution. In particular, since
ˆ
β
L
λ
has a bias, the distribution will depend on β
0
in a complicated way.
The recently introduced debiased Lasso tries to overcomes these issues. See
van de Geer, uhlmann, Ritov, Dezeure (2014) for more details. Let
ˆ
β
be the
Lasso solution at λ > 0. Recall the KKT conditions that says ˆν defined by
1
n
X
T
(Y X
ˆ
β) = λˆν
satisfies kˆνk
1 and ˆν
ˆ
S
= sgn(
ˆ
β
ˆ
S
), where
ˆ
S = {k :
ˆ
β
k
6= 0}.
We set
ˆ
Σ =
1
n
X
T
X. Then we can rewrite the KKT conditions as
ˆ
Σ(
ˆ
β β
0
) + λˆν =
1
n
X
T
ε.
What we are trying to understand is
ˆ
β β
0
. So it would be nice if we can find
some sort of inverse to
ˆ
Σ
. If so, then
ˆ
β β
0
plus some correction term involving
ˆv would then be equal to a Gaussian.
Of course, the problem is that in the high dimensional setting, that
ˆ
Σ
has
no hope of being invertible. So we want to find some approximate inverse
ˆ
Θ
so
that the error we make is not too large. If we are equipped with such a
ˆ
Θ
, then
we have
n(
ˆ
β + λ
ˆ
Θˆν β
0
) =
1
n
ˆ
ΘX
T
ε + ,
where
=
n(
ˆ
Θ
ˆ
Σ I)(β
0
ˆ
β).
We hope we can choose
ˆ
Θ so that δ is small. We can then use the quantity
b =
ˆ
β + λ
ˆ
Θˆν =
ˆ
β +
1
n
ˆ
ΘX
T
(Y X
ˆ
β)
as our modified estimator, called the debiased Lasso.
How do we bound ∆? We already know that (under compatibility and
sparsity conditions), we can make the
`
1
norm of
kβ
0
ˆ
βk
small with high
probability. So if the
`
norm of each of the rows of
ˆ
Θ
ˆ
Σ
1 is small, then
older allows us to bound ∆.
Write
ˆ
θ
j
for the jth row of
ˆ
Θ. Then
k(
ˆ
Σ
ˆ
Θ
T
)
j
Ik
η
is equivalent to
|
(
ˆ
Σ
ˆ
Θ
T
)
kj
| η
for
k 6
=
j
and
|
(
ˆ
Σ
ˆ
Θ
T
)
jj
1
| η
. Using the
definition of
ˆ
Σ, these are equivalent to
1
n
|X
T
k
X
ˆ
θ
j
| η,
1
n
X
T
j
X
ˆ
θ
j
1
η.
The first is the same as saying
1
n
kX
T
j
X
ˆ
θ
j
k
η.
This is quite reminiscent of the KKT conditions for the Lasso. So let us define
ˆγ
(j)
= argmin
γR
p1
1
2n
kX
j
X
j
γk
2
2
+ λ
j
kγk
1
ˆτ
2
j
=
1
n
X
T
j
(X
j
X
j
ˆγ
(j)
) =
1
n
kX
j
X
j
ˆγ
(j)
k
2
2
+ λ
j
kˆγ
(j)
k
1
.
The second equality is an exercise on the example sheet.
We can then set
ˆ
θ
j
=
1
ˆτ
2
j
(ˆγ
(j)
1
, . . . , ˆγ
(j)
j1
, 1, ˆγ
(j)
j
, . . . , ˆγ
(j)
p1
)
T
.
The factor is there so that the second inequality holds.
Then by construction, we have
X
ˆ
θ
j
=
X
j
X
j
ˆγ
(j)
X
T
j
(X X
j
ˆγ
(j)
)/n
.
Thus, we have
X
T
j
X
ˆ
θ
j
n
= 1, and by the KKT conditions for the Lasso, we have
ˆτ
j
n
kX
T
j
X
ˆ
θ
j
k
λ
j
.
Thus, with the choice of
ˆ
Θ above, we have
kk
nk
ˆ
β β
0
k
1
max
j
λ
j
ˆτ
2
j
.
Now this is good as long as we can ensure
λ
j
ˆτ
2
j
to be small. When is this true?
We can consider a random design setting, where each row of
X
is iid
N
p
(0
,
Σ)
for some positive-definite Σ. Write = Σ
1
.
Then by our study of the neighbourhood selection procedure, we know that
for each j, we can write
X
j
= X
j
γ
(j)
+ ε
(j)
,
where
ε
(j)
i
| X
j
N
(0
,
1
jj
) are iid and
γ
(j)
=
1
jj
j,j
. To apply our
results, we need to ensure that γ
(j)
are sparse. Let use therefore define
s
j
=
X
k6=j
1
jk
6=0
,
and set s
max
= max(max
j
s
j
, s).
Theorem. Suppose the maximum eigenvalue of Σ is always at least
c
min
>
0
and
max
j
Σ
jj
1. Suppose further that
s
max
p
log(p)/n
0. Then there exists
constants A
1
, A
2
such that setting λ = λ
j
= A
1
p
log(p)/n, we have
n(
ˆ
b β
0
) = W +
W | X N
p
(0, σ
2
ˆ
Θ
ˆ
Σ
ˆ
Θ
T
),
and as n, p ,
P
kk
> A
2
s
log(p)
n
0.
Note that here X is not centered and scaled.
We see that in particular,
n
(
ˆ
b
j
β
0
j
)
N
(0
, σ
2
(
ˆ
Θ
ˆ
Σ
ˆ
Θ
T
)
jj
). In fact, one
can show that
d
j
=
1
n
kX
j
X
j
ˆγ
(j)
}
2
2
ˆτ
2
j
.
This suggests an approximate (1 α)-level confidence interval for β
0
j
,
CI =
b
j
Z
α/2
σ
q
d
j
/n,
ˆ
b
j
+ Z
α/2
σ
q
d
j
/n
,
where
Z
α
is the upper
α
point of
N
(0
,
1). Note that here we are getting confidence
intervals of width
p
1/n
. In particular, there is no
log p
dependence, if we are
only trying to estimate β
0
j
only.
Proof. Consider the sequence of events Λ
n
defined by the following properties:
φ
ˆ
Σ,s
c
min
/2 and φ
2
ˆ
Σ
j,j
,s
j
c
min
/2 for all j
2
n
kX
T
Σk
λ and
2
n
kX
T
j
ε
(j)
k
λ.
1
n
Σ
(j)
k
2
2
(Ω
jj
)
1
(1 4
p
(log p)/n)
Question 13 on example sheet 4 shows that
P
n
)
1 for
A
1
sufficiently
large. So we will work on the event Λ
n
.
By our results on the Lasso, we know
kβ
0
ˆ
βk
1
c
1
s
p
log p/n.
for some constant
c
1
. We now seek a lower bound for
ˆτ
2
j
. Consider linear models
X
j
= X
j
γ
(j)
+ ε
(j)
,
where the sparsity of γ
(j)
is s
j
, and ε
(j)
i
|X
j
N(0,
1
jj
). Note that
1
jj
= var(X
ij
| X
i,j
) var(X
ij
) = Σ
ij
A.
Also, the maximum eigenvalue of is at most
c
1
min
. So
jj
c
1
min
. So
1
jj
c
min
. So by Lasso theory, we know
kγ
(j)
ˆγ
(j)
k
1
c
2
s
j
r
log p
n
for some constant c
2
. Then we have
ˆτ
2
j
=
1
n
kX
j
X
j
ˆγ
(j)
k
2
2
+ λkˆγ
(j)
k
1
1
n
kε
(j)
+ X
j
(γ
(j)
ˆγ
(j)
)k
2
2
1
n
kε
(j)
k
2
2
2
n
kX
T
j
ε
(j)
k
kγ
(j)
ˆγ
(j)
k
1
1
jj
1 4
r
log p
n
!
c
2
s
j
r
log p
n
+ A
1
r
log p
n
In the limit, this tends to
1
jj
. So for large n, this is
1
2
1
jj
1
2
c
min
.
Thus, we have
kk
2λ
nc
1
s
r
log p
n
c
1
min
= A
2
s
log p
n
.