3The Lasso and beyond
III Modern Statistical Methods
3.5 Variable selection
So we have seen that the Lasso picks out some “important” variables and discards
the rest. How well does it do the job?
For simplicity, consider a noiseless linear model
Y = Xβ
0
.
Our objective is to find the set
S = {k : β
0
k
6= 0}.
We may wlog assume
S
=
{
1
, . . . , s}
, and
N
=
{
1
. . . p}\S
(as usual
X
is
n ×p
).
We further assume that rk(X
S
) = s.
In general, it is difficult to find out
S
even if we know
|S|
. Under certain
conditions, the Lasso can do the job correctly. This condition is dictated by the
`
∞
norm of the quantity
∆ = X
T
N
X
S
(X
T
S
X
S
)
−1
sgn(β
0
S
).
We can understand this a bit as follows — the
k
th entry of this is the dot product
of
sgn
(
β
0
S
) with (
X
T
S
X
S
)
−1
X
T
S
X
k
. This is the coefficient vector we would obtain
if we tried to regress
X
k
on
X
S
. If this is large, then this suggests we expect
X
k
to look correlated to
X
S
, and so it would be difficult to determine if
k
is part of
S or not.
Theorem.
(i) If k∆k
∞
≤ 1, or equivalently
max
k∈N
|sgn(β
0
S
)
T
(X
T
S
X
S
)
−1
X
T
S
X
k
| ≤ 1,
and moreover
|β
0
k
| > λ
sgn(β
0
S
)
T
1
n
X
T
j
X
j
−1
k
for all
k ∈ S
, then there exists a Lasso solution
ˆ
β
L
λ
with
sgn
(
ˆ
β
L
λ
) =
sgn
(
β
0
).
(ii) If there exists a Lasso solution with sgn(
ˆ
β
L
λ
) = sgn(β
0
), then k∆k
∞
≤ 1.
We are going to make heavy use of the KKT conditions.
Proof.
Write
ˆ
β
=
ˆ
β
L
λ
, and write
ˆ
S
=
{k
:
ˆ
β
k
6
= 0
}
. Then the KKT conditions
are that
1
n
X
T
(β
0
−
ˆ
β) = λˆν,
where kˆνk
∞
≤ 1 and ˆν
ˆ
S
= sgn(
ˆ
β
ˆ
S
).
We expand this to say
1
n
X
T
S
X
S
X
T
S
X
N
X
T
N
X
S
XN
T
X
N
β
0
S
−
ˆ
β
S
−
ˆ
β
N
= λ
ˆν
S
ˆν
N
.
Call the top and bottom equations (1) and (2) respectively.
It is easier to prove (ii) first. If there is such a solution, then
ˆ
β
N
= 0. So
from (1), we must have
1
n
X
T
S
X
S
(β
0
S
−
ˆ
β
S
) = λˆν
S
.
Inverting
1
n
X
T
S
X
S
, we learn that
β
0
S
−
ˆ
β
S
= λ
1
n
X
T
S
X
S
−1
sgn(β
0
S
).
Substitute this into (2) to get
λ
1
n
X
T
N
X
S
1
n
X
T
S
X
S
−1
sgn(β
0
S
) = λˆν
N
.
By the KKT conditions, we know kˆν
N
k
∞
≤ 1, and the LHS is exactly λ∆.
To prove (1), we need to exhibit a
ˆ
β
that agrees in sign with
ˆ
β
and satisfies
the equations (1) and (2). In particular,
ˆ
β
N
= 0. So we try
(
ˆ
β
S
, ˆν
S
) =
β
0
S
− λ
1
n
X
T
S
X
S
−1
sgn(β
0
S
), sgn(β
0
S
)
!
(
ˆ
β
N
, ν
N
) = (0, ∆).
This is by construction a solution. We then only need to check that
sgn(
ˆ
β
S
) = sgn(β
0
S
),
which follows from the second condition.
Prediction and estimation
We now consider other question of how good the Lasso functions as a regression
method. Consider the model
Y = Xβ
0
+ ε − ¯ε1,
where the
ε
i
are independent and have common sub-Gaussian parameter
σ
. Let
S, s, N be as before.
As before, the Lasso doesn’t always behave well, and whether or not it does
is controlled by the compatibility factor.
Definition (Compatibility factor). Define the compatibility factor to be
φ
2
= inf
β∈R
p
kβ
N
k
1
≤3kβ
S
k
1
β
S
6=0
1
n
kXβk
2
2
1
s
kβ
S
k
2
1
= inf
β∈R
p
kβ
S
k=1
kβ
N
k
1
≤3
s
n
kX
S
β
S
− X
N
β
N
k
2
2
.
Note that we are free to use a minus sign inside the norm since the problem
is symmetric in β
N
↔ −β
N
.
In some sense, this
φ
measures how well we can approximate
X
S
β
S
just with
the noise variables.
Definition (Compatibility condition). The compatibility condition is φ
2
> 0.
Note that if
ˆ
Σ
=
1
n
X
T
X
has minimum eigenvalue
c
min
>
0, then we have
φ
2
≥ c
min
. Indeed,
kβ
S
k
1
= sgn(β
S
)
T
β
S
≤
√
skβ
S
k
2
≤
√
skβk
2
,
and so
φ
2
≥ inf
β6=0
1
n
kXβk
2
2
kβk
2
2
= c
min
.
Of course, we don’t expect the minimum eigenvalue to be positive, but we have
the restriction in infimum in the definition of
φ
2
and we can hope to have a
positive value of φ
2
.
Theorem. Assume φ
2
> 0, and let
ˆ
β be the Lasso solution with
λ = Aσ
p
log p/n.
Then with probability at least 1 − 2p
−(A
2
/8−1)
, we have
1
n
kX(β
0
−
ˆ
β)k
2
2
+ λk
ˆ
β − β
0
k
1
≤
16λ
2
s
φ
2
=
16A
2
log p
φ
2
sσ
2
n
.
This is actually two bounds. This simultaneously bounds the error in the
fitted values, and a bound on the error in predicting
ˆ
β − β
0
.
Recall that in our previous bound, we had a bound of
∼
1
√
n
, and now we have
∼
1
n
. Note also that
sσ
2
n
is the error we would get if we magically knew which
were the non-zero variables and did ordinary least squares on those variables.
This also tells us that if
β
0
has a component that is large, then
ˆ
β
must be
non-zero in that component as well. So while the Lasso cannot predict exactly
which variables are non-zero, we can at least get the important ones.
Proof. Start with the basic inequality Q
λ
(
ˆ
β) ≤ Q
λ
(β
0
), which gives us
1
2n
kX(β
0
−
ˆ
β)k
2
2
+ λk
ˆ
βk
1
≤
1
n
ε
T
X(
ˆ
β − β
0
) + λkβ
0
k
1
.
We work on the event
Ω =
1
n
kX
T
εk
∞
≤
1
2
λ
,
where after applying H¨older’s inequality, we get
1
n
kX(β
0
−
ˆ
β)k
2
2
+ 2λk
ˆ
βk
1
≤ λk
ˆ
β − β
0
k
1
+ 2λkβ
0
k
1
.
We can move 2
λk
ˆ
βk
1
to the other side, and applying the triangle inequality, we
have
1
n
kX(
ˆ
β − β
0
)k
2
2
≤ 3λk
ˆ
β − β
0
k.
If we manage to bound the RHS from above as well, so that
3λk
ˆ
β − β
0
k ≤ cλ
1
√
n
kX(
ˆ
β − β
0
)k
2
for some c, then we obtain the bound
1
n
kX(β − β
0
)k
2
2
≤ c
2
λ
2
.
Plugging this back into the second bound, we also have
3λk
ˆ
β − β
0
k
1
≤ c
2
λ
2
.
To obtain these bounds, we want to apply the definition of
φ
2
to
ˆ
β −β
0
. We thus
need to show that the
ˆ
β − β
0
satisfies the conditions required in the infimum
taken.
Write
a =
1
nλ
kX(
ˆ
β − β
0
)k
2
2
.
Then we have
a + 2(k
ˆ
β
n
k
1
+ k
ˆ
β
S
k
1
) ≤ k
ˆ
β
S
− β
0
S
k
1
+ k
ˆ
β
N
k
1
+ 2kβ
0
S
k
1
.
Simplifying, we obtain
a + k
ˆ
β
N
k
1
≤ k
ˆ
β
S
− β
0
S
k
1
+ 2kβ
0
S
k
1
− 2k
ˆ
β
S
k
1
.
Using the triangle inequality, we write this as
a + k
ˆ
β
N
− β
0
k
N
≤ 3k
ˆ
β
S
− β
0
S
k
1
.
So we immediately know we can apply the compatibility condition, which gives
us
φ
2
≤
1
n
kX(
ˆ
β − β
0
)k
2
2
1
s
k
ˆ
β
S
− β
0
S
k
2
1
.
Also, we have
1
n
kX(
ˆ
β − β
0
)k
2
2
+ λk
ˆ
β − β
0
k
1
≤ 4λk
ˆ
β
S
− β
0
S
k
1
.
Thus, using the compatibility condition, we have
1
n
kX(
ˆ
β − β
0
)k
2
2
+ λk
ˆ
β − β
0
k ≤
4λ
φ
r
s
n
kX(
ˆ
β − β
0
)k
2
.
Thus, dividing through by
1
√
n
kX(
ˆ
β − β
0
)k
2
, we obtain
1
√
n
kX(
ˆ
β − β
0
)k
2
≤
4λ
√
s
φ
. (∗)
So we substitute into the RHS (∗) and obtain
1
n
kX(
ˆ
β − β
0
)k
2
2
+ λk
ˆ
β − β
0
k
1
≤
16λ
2
s
φ
2
.
If we want to be impressed by this result, then we should make sure that
the compatibility condition is a reasonable assumption to make on the design
matrix.
The compatibili ty condition and random design
For any Σ im R
p×p
, define
φ
2
Σ
(S) = inf
β:kβ
N
k
1
≤3kβ
S
k
1
, β
s
6=0
β
T
Σβ
kβ
S
k
2
1
/|S|
Our original φ
2
is then the same as φ
2
ˆ
Σ
(S).
If we want to analyze how
φ
2
Σ
(
S
) behaves for a “random” Σ, then it would
be convenient to know that this depends continuously on Σ. For our purposes,
the following lemma suffices:
Lemma. Let Θ, Σ ∈ R
p×p
. Suppose φ
2
Θ
(S) > 0 and
max
j,k
|Θ
jk
− Σ
jk
| ≤
φ
2
Θ
(S)
32|S|
.
Then
φ
2
Σ
(S) ≥
1
2
φ
2
Θ
(S).
Proof.
We suppress the dependence on
S
for notational convenience. Let
s
=
|S|
and
t =
φ
2
Θ
(S)
32s
.
We have
|β
T
(Σ − Θ)β| ≤ kβk
1
k(Σ − Θ)βk
∞
≤ tkβk
2
1
,
where we applied H¨older twice.
If kβ
N
k ≤ 3kβ
S
k
1
, then we have
kβk
1
≤ 4kβ
S
k
1
≤ 4
p
β
T
Θβ
φ
Θ
/
√
s
.
Thus, we have
β
T
Θβ −
φ
2
Θ
32s
·
16β
T
Θβ
φ
2
Θ
/s
=
1
2
β
T
Θβ ≤ β
T
Σβ.
Define
φ
2
Σ,s
= min
S:|S|=s
φ
2
Σ
(S).
Note that if
max
jk
|Θ
jk
− Σ
jk
| ≤
φ
2
Θ,s
32s
,
then
φ
2
Σ
(S) ≥
1
2
φ
2
Θ
(S).
for all S with |S| = s. In particular,
φ
2
Σ,s
≥
1
2
φ
2
Θ,s
.
Theorem. Suppose the rows of
X
are iid and each entry is sub-Gaussian with
parameter
v
. Suppose
s
p
log p/n →
0 as
n → ∞
, and
φ
2
Σ
0
,s
is bounded away
from 0. Then if Σ
0
= E
ˆ
Σ, then we have
P
φ
2
ˆ
Σ,s
≥
1
2
φ
2
Σ
0
,s
→ 1 as n → ∞.
Proof. It is enough to show that
P
max
jk
|
ˆ
Σ
jk
− Σ
0
jk
| ≤
φ
2
Σ
0
,s
32s
!
→ 0
as n → ∞.
Let t =
φ
2
Σ
0
,s
32s
. Then
P
max
j,k
|
ˆ
Σ
jk
− Σ
0
jk
| ≥ t
≤
X
j,k
P(|
ˆ
Σ
jk
− Σ
0
jk
| ≥ t).
Recall that
ˆ
Σ
jk
=
1
n
n
X
i=1
X
ij
X
ik
.
So we can apply Bernstein’s inequality to bound
P(|
ˆ
Σ
jk
− Σ
0
jk
) ≤ 2 exp
−
nt
2
2(64v
4
+ 4v
2
t)
,
since σ = 8v
2
and b = 4v
2
. So we can bound
P
max
j,k
|
ˆ
Σ
jk
− Σ
0
jk
| ≥ t
≤ 2p
2
exp
−
cn
s
2
= 2 exp
−
cn
s
2
c −
2s
2
n log p
→ 0
for some constant c.
Corollary. Suppose the rows of
X
are iid mean-zero multivariate Gaussian
with variance Σ
0
. Suppose Σ
n
has minimum eigenvalue bounded from below by
c
min
>
0, and suppose the diagonal entries of Σ
0
are bounded from above. If
s
p
log p/n → 0, then
P
φ
2
ˆ
Σ,s
≥
1
2
c
min
→ 1 as n → ∞.