III Modern Statistical Methods - The Lasso and beyond

3The Lasso and beyond

III Modern Statistical Methods

3.5 Variable selection

So we have seen that the Lasso picks out some “important” variables and discards

the rest. How well does it do the job?

For simplicity, consider a noiseless linear model

Y = Xβ

Our objective is to find the set

S = {k : β

6= 0}.

We may wlog assume

{

, . . . , s}

, and

{

. . . p}\S

(as usual

n ×p

We further assume that rk(X

) = s.

In general, it is difficult to find out

even if we know

|S|

. Under certain

conditions, the Lasso can do the job correctly. This condition is dictated by the

∞

norm of the quantity

∆ = X

)

−1

sgn(β

We can understand this a bit as follows — the

th entry of this is the dot product

sgn

(

) with (

)

−1

. This is the coefficient vector we would obtain

if we tried to regress

. If this is large, then this suggests we expect

to look correlated to

, and so it would be difficult to determine if

is part of

S or not.

Theorem.

(i) If k∆k

∞

≤ 1, or equivalently

max

k∈N

|sgn(β

)

−1

| ≤ 1,

and moreover

|β

| > λ



sgn(β

)





−1



for all

k ∈ S

, then there exists a Lasso solution

with

sgn

(

) =

sgn

(

(ii) If there exists a Lasso solution with sgn(

) = sgn(β

), then k∆k

∞

≤ 1.

We are going to make heavy use of the KKT conditions.

Proof.

Write

, and write

= 0

}

. Then the KKT conditions

are that

(β

−

β) = λˆν,

where kˆνk

∞

≤ 1 and ˆν

= sgn(

We expand this to say





−



= λ



ˆν



Call the top and bottom equations (1) and (2) respectively.

It is easier to prove (ii) first. If there is such a solution, then

= 0. So

from (1), we must have

(β

−

) = λˆν

Inverting

, we learn that

−

= λ





−1

sgn(β

Substitute this into (2) to get





−1

sgn(β

) = λˆν

By the KKT conditions, we know kˆν

∞

≤ 1, and the LHS is exactly λ∆.

To prove (1), we need to exhibit a

that agrees in sign with

and satisfies

the equations (1) and (2). In particular,

= 0. So we try

(

, ˆν

) =

− λ





−1

sgn(β

), sgn(β

)

(

, ν

) = (0, ∆).

This is by construction a solution. We then only need to check that

sgn(

) = sgn(β

which follows from the second condition.

Prediction and estimation

We now consider other question of how good the Lasso functions as a regression

method. Consider the model

Y = Xβ

+ ε − ¯ε1,

where the

are independent and have common sub-Gaussian parameter

. Let

S, s, N be as before.

As before, the Lasso doesn’t always behave well, and whether or not it does

is controlled by the compatibility factor.

Definition (Compatibility factor). Define the compatibility factor to be

= inf

β∈R

kβ

≤3kβ

6=0

kXβk

kβ

= inf

β∈R

kβ

k=1

kβ

≤3

− X

Note that we are free to use a minus sign inside the norm since the problem

is symmetric in β

↔ −β

In some sense, this

measures how well we can approximate

just with

the noise variables.

Definition (Compatibility condition). The compatibility condition is φ

> 0.

Note that if

has minimum eigenvalue

min

0, then we have

≥ c

min

. Indeed,

kβ

= sgn(β

)

≤

√

skβ

≤

√

skβk

and so

≥ inf

β6=0

kXβk

kβk

= c

min

Of course, we don’t expect the minimum eigenvalue to be positive, but we have

the restriction in infimum in the definition of

and we can hope to have a

positive value of φ

Theorem. Assume φ

> 0, and let

β be the Lasso solution with

λ = Aσ

log p/n.

Then with probability at least 1 − 2p

−(A

/8−1)

, we have

kX(β

−

β)k

+ λk

β − β

≤

16λ

16A

log p

sσ

This is actually two bounds. This simultaneously bounds the error in the

fitted values, and a bound on the error in predicting

β − β

Recall that in our previous bound, we had a bound of

∼

√

, and now we have

∼

. Note also that

sσ

is the error we would get if we magically knew which

were the non-zero variables and did ordinary least squares on those variables.

This also tells us that if

has a component that is large, then

must be

non-zero in that component as well. So while the Lasso cannot predict exactly

which variables are non-zero, we can at least get the important ones.

Proof. Start with the basic inequality Q

(

β) ≤ Q

(β

), which gives us

kX(β

−

β)k

+ λk

βk

≤

β − β

) + λkβ

We work on the event

Ω =



εk

∞

≤



where after applying H¨older’s inequality, we get

kX(β

−

β)k

+ 2λk

βk

≤ λk

β − β

+ 2λkβ

We can move 2

λk

βk

to the other side, and applying the triangle inequality, we

have

kX(

β − β

≤ 3λk

β − β

If we manage to bound the RHS from above as well, so that

3λk

β − β

k ≤ cλ

√

kX(

β − β

for some c, then we obtain the bound

kX(β − β

≤ c

Plugging this back into the second bound, we also have

3λk

β − β

≤ c

To obtain these bounds, we want to apply the definition of

β −β

. We thus

need to show that the

β − β

satisfies the conditions required in the infimum

taken.

Write

a =

nλ

kX(

β − β

Then we have

a + 2(k

+ k

) ≤ k

− β

+ k

+ 2kβ

Simplifying, we obtain

a + k

≤ k

− β

+ 2kβ

− 2k

Using the triangle inequality, we write this as

a + k

− β

≤ 3k

− β

So we immediately know we can apply the compatibility condition, which gives

≤

kX(

β − β

− β

Also, we have

kX(

β − β

+ λk

β − β

≤ 4λk

− β

Thus, using the compatibility condition, we have

kX(

β − β

+ λk

β − β

k ≤

4λ

kX(

β − β

Thus, dividing through by

√

kX(

β − β

, we obtain

√

kX(

β − β

≤

4λ

√

. (∗)

So we substitute into the RHS (∗) and obtain

kX(

β − β

+ λk

β − β

≤

16λ

If we want to be impressed by this result, then we should make sure that

the compatibility condition is a reasonable assumption to make on the design

matrix.

The compatibili ty condition and random design

For any Σ im R

p×p

, define

(S) = inf

β:kβ

≤3kβ

, β

6=0

Σβ

kβ

/|S|

Our original φ

is then the same as φ

(S).

If we want to analyze how

(

) behaves for a “random” Σ, then it would

be convenient to know that this depends continuously on Σ. For our purposes,

the following lemma suffices:

Lemma. Let Θ, Σ ∈ R

p×p

. Suppose φ

(S) > 0 and

max

j,k

|Θ

− Σ

| ≤

(S)

32|S|

Then

(S) ≥

(S).

Proof.

We suppress the dependence on

for notational convenience. Let

|S|

and

t =

(S)

32s

We have

|β

(Σ − Θ)β| ≤ kβk

k(Σ − Θ)βk

∞

≤ tkβk

where we applied H¨older twice.

If kβ

k ≤ 3kβ

, then we have

kβk

≤ 4kβ

≤ 4

Θβ

√

Thus, we have

Θβ −

32s

16β

Θβ

Θβ ≤ β

Σβ.

Define

Σ,s

= min

S:|S|=s

(S).

Note that if

max

|Θ

− Σ

| ≤

Θ,s

32s

then

(S) ≥

(S).

for all S with |S| = s. In particular,

Σ,s

≥

Θ,s

Theorem. Suppose the rows of

are iid and each entry is sub-Gaussian with

parameter

. Suppose

log p/n →

0 as

n → ∞

, and

is bounded away

from 0. Then if Σ

= E

Σ, then we have



Σ,s

≥



→ 1 as n → ∞.

Proof. It is enough to show that

max

− Σ

| ≤

32s

→ 0

as n → ∞.

Let t =

32s

. Then



max

j,k

− Σ

| ≥ t



≤

j,k

P(|

− Σ

| ≥ t).

Recall that

i=1

So we can apply Bernstein’s inequality to bound

P(|

− Σ

) ≤ 2 exp



−

2(64v

+ 4v



since σ = 8v

and b = 4v

. So we can bound



max

j,k

− Σ

| ≥ t



≤ 2p

exp



−



= 2 exp



−



c −

n log p



→ 0

for some constant c.

Corollary. Suppose the rows of

are iid mean-zero multivariate Gaussian

with variance Σ

. Suppose Σ

has minimum eigenvalue bounded from below by

min

0, and suppose the diagonal entries of Σ

are bounded from above. If

log p/n → 0, then



Σ,s

≥

min



→ 1 as n → ∞.