III Modern Statistical Methods - High-dimensional inference

5High-dimensional inference

III Modern Statistical Methods

5.2 Inference in high-dimensional regression

We have more-or-less some answer as to how to do hypothesis testing, given that

we know how to obtain these

-values. But how do we obtain these in the first

place?

For example, we might be trying to do regression, and are trying figure out

which coefficients are non-zero. The the low dimension setting, with the normal

linear model

Xβ

, where

ε ∼ N

, σ

). In the low-dimensional setting,

we have

√

(

OLS

− β

)

∼ N

, σ

(

)

−1

). Since this does not depend on

, we can use this to form confidence intervals and hypothesis tests.

However, if we have more coefficients than there are data points, then we

can’t do ordinary least squares. So we need to look for something else. For

example, we might want to replace the OLS estimate with the Lasso estimate.

However,

√

(

− β

) has an intractable distribution. In particular, since

has a bias, the distribution will depend on β

in a complicated way.

The recently introduced debiased Lasso tries to overcomes these issues. See

van de Geer, B¨uhlmann, Ritov, Dezeure (2014) for more details. Let

be the

Lasso solution at λ > 0. Recall the KKT conditions that says ˆν defined by

(Y − X

β) = λˆν

satisfies kˆνk

∞

≤ 1 and ˆν

= sgn(

), where

S = {k :

6= 0}.

We set

Σ =

X. Then we can rewrite the KKT conditions as

Σ(

β − β

) + λˆν =

ε.

What we are trying to understand is

β − β

. So it would be nice if we can find

some sort of inverse to

. If so, then

β −β

plus some correction term involving

ˆv would then be equal to a Gaussian.

Of course, the problem is that in the high dimensional setting, that

has

no hope of being invertible. So we want to find some approximate inverse

that the error we make is not too large. If we are equipped with such a

, then

we have

√

β + λ

Θˆν − β

) =

√

ΘX

ε + ∆,

where

∆ =

√

Σ − I)(β

−

β).

We hope we can choose

Θ so that δ is small. We can then use the quantity

b =

β + λ

Θˆν =

β +

ΘX

(Y − X

β)

as our modified estimator, called the debiased Lasso.

How do we bound ∆? We already know that (under compatibility and

sparsity conditions), we can make the

norm of

kβ

−

βk

small with high

probability. So if the

∞

norm of each of the rows of

Σ −

1 is small, then

H¨older allows us to bound ∆.

Write

for the jth row of

Θ. Then

)

− Ik

∞

≤ η

is equivalent to

(

)

| ≤ η

for

k 6

and

(

)

−

| ≤ η

. Using the

definition of

Σ, these are equivalent to

| ≤ η,



− 1



≤ η.

The first is the same as saying

−j

∞

≤ η.

This is quite reminiscent of the KKT conditions for the Lasso. So let us define

ˆγ

(j)

= argmin

γ∈R

p−1



− X

−j

γk

+ λ

kγk



ˆτ

− X

−j

ˆγ

(j)

) =

− X

−j

ˆγ

(j)

+ λ

kˆγ

(j)

The second equality is an exercise on the example sheet.

We can then set

= −

ˆτ

(ˆγ

(j)

, . . . , ˆγ

(j)

j−1

, −1, ˆγ

(j)

, . . . , ˆγ

(j)

p−1

)

The factor is there so that the second inequality holds.

Then by construction, we have

− X

−j

ˆγ

(j)

(X − X

−j

ˆγ

(j)

)/n

Thus, we have

= 1, and by the KKT conditions for the Lasso, we have

ˆτ

−j

∞

≤ λ

Thus, with the choice of

Θ above, we have

k∆k

∞

≤

√

β − β

max

ˆτ

Now this is good as long as we can ensure

ˆτ

to be small. When is this true?

We can consider a random design setting, where each row of

is iid

Σ)

for some positive-definite Σ. Write Ω = Σ

−1

Then by our study of the neighbourhood selection procedure, we know that

for each j, we can write

= X

−j

(j)

+ ε

(j)

where

(j)

| X

−j

∼ N

Ω

−1

) are iid and

(j)

−

Ω

−1

Ω

−j,j

. To apply our

results, we need to ensure that γ

(j)

are sparse. Let use therefore define

k6=j

Ω

6=0

and set s

max

= max(max

, s).

Theorem. Suppose the maximum eigenvalue of Σ is always at least

min

and

max

≤

1. Suppose further that

max

log(p)/n →

0. Then there exists

constants A

, A

such that setting λ = λ

= A

log(p)/n, we have

√

b − β

) = W + ∆

W | X ∼ N

(0, σ

and as n, p → ∞,



k∆k

∞

> A

log(p)

√



→ 0.

Note that here X is not centered and scaled.

We see that in particular,

√

(

− β

)

∼ N

, σ

(

)

). In fact, one

can show that

− X

−j

ˆγ

(j)

}

ˆτ

This suggests an approximate (1 − α)-level confidence interval for β

CI =



− Z

α/2

/n,

+ Z

α/2



where

is the upper

point of

1). Note that here we are getting confidence

intervals of width

∼

1/n

. In particular, there is no

log p

dependence, if we are

only trying to estimate β

only.

Proof. Consider the sequence of events Λ

defined by the following properties:

– φ

Σ,s

≥ c

min

/2 and φ

−j,−j

≥ c

min

/2 for all j

–

Σk

∞

≤ λ and

−j

(j)

∞

≤ λ.

–

(j)

≥ (Ω

)

−1

(1 − 4

(log p)/n)

Question 13 on example sheet 4 shows that

(Λ

)

→

1 for

sufficiently

large. So we will work on the event Λ

By our results on the Lasso, we know

kβ

−

βk

≤ c

log p/n.

for some constant

. We now seek a lower bound for

ˆτ

. Consider linear models

= X

−j

(j)

+ ε

(j)

where the sparsity of γ

(j)

is s

, and ε

(j)

−j

∼ N(0, Ω

−1

). Note that

Ω

−1

= var(X

| X

i,−j

) ≤ var(X

) = Σ

≤ A.

Also, the maximum eigenvalue of Ω is at most

−1

min

. So Ω

≤ c

−1

min

. So

Ω

−1

≥ c

min

. So by Lasso theory, we know

kγ

(j)

− ˆγ

(j)

≤ c

log p

for some constant c

. Then we have

ˆτ

− X

−j

ˆγ

(j)

+ λkˆγ

(j)

≥

kε

(j)

+ X

−j

(γ

(j)

− ˆγ

(j)

≥

kε

(j)

−

−j

(j)

∞

kγ

(j)

− ˆγ

(j)

≥ Ω

−1

1 − 4

log p

− c

log p

+ A

log p

In the limit, this tends to Ω

−1

. So for large n, this is ≥

Ω

−1

≥

min

Thus, we have

k∆k

∞

≤ 2λ

√

log p

−1

min

= A

log p

√