3The Lasso and beyond

III Modern Statistical Methods



3.1 The Lasso estimator
The Lasso (Tibshirani, 1996) is a seemingly small modification of Ridge regression
that solves our problems. It solves
(ˆµ
L
λ
,
ˆ
β
L
λ
) = argmin
(µ,β)R×R
p
1
2n
kY µ1 Xβk
2
2
+ λkβk
1
.
The key difference is that we use an `
1
norm on β rather than the `
2
norm,
kβk
1
=
p
X
k=1
|β
k
|.
This makes it drastically different from Ridge regression. We will see that for
λ
large, it will make all of the entries of
β
exactly zero, as opposed to being very
close to zero.
We can compare this to best subset regression, where we replace
kβk
1
with
something of the form
P
p
k=1
1
β
k
>0
. But the beautiful property of the Lasso
is that the optimization problem is now continuous, and in fact convex. This
allows us to solve it using standard convex optimization techniques.
Why is the
`
1
norm so different from the
`
2
norm? Just as in Ridge regression,
we may center and scale
X
, and center
Y
, so that we can remove
µ
from the
objective. Define
Q
λ
(β) =
1
2n
kY Xβk
2
2
+ λkβk
1
.
Any minimizer
ˆ
β
L
λ
of Q
λ
(β) must also be a solution to
min kY Xβk
2
2
subject to kβk
1
k
ˆ
β
L
λ
k
1
.
Similarly,
ˆ
β
R
λ
is a solution of
min kY Xβk
2
2
subject to kβk
2
k
ˆ
β
R
λ
k
2
.
So imagine we are given a value of
k
ˆ
β
L
λ
k
1
, and we try to solve the above
optimization problem with pictures. The region
kβk
1
k
ˆ
β
L
λ
k
1
is given by a
rotated square
On the other hand, the minimum of
kY Xβk
2
2
is at
ˆ
β
OLS
, and the contours
are ellipses centered around this point.
To solve the minimization problem, we should pick the smallest contour that hits
the square, and pick the intersection point to be our estimate of β
0
. The point
is that since the unit ball in the
`
1
-norm has these corners, this
β
0
is likely to
be on the corners, hence has a lot of zeroes. Compare this to the case of Ridge
regression, where the constraint set is a ball:
Generically, for Ridge regression, we would expect the solution to be non-zero
everywhere.
More generally, we can try to use the `
q
norm given by
kβk
q
=
p
X
k=1
β
q
k
!
1/q
.
We can plot their unit spheres, and see that they look like
q = 0.5 q = 1 q = 1.5 q = 2
We see that
q
= 1 is the smallest value of
q
for which there are corners, and
also the largest value of
q
for which the constraint set is still convex. Thus,
q
= 1
is the sweet spot for doing regression.
But is this actually good? Suppose the columns of
X
are scaled to have
`
2
norm
n, and, after centering, we have a normal linear model
Y = Xβ
0
+ ε ¯ε1,
with ε N
n
(0, σ
2
I).
Theorem. Let
ˆ
β be the Lasso solution with
λ =
r
log p
n
for some A. Then with probability 1 2p
(A
2
/21)
, we have
1
n
kXβ
0
X
ˆ
βk
2
2
4
r
log p
n
kβ
0
k
1
.
Crucially, this is proportional to
log p
, and not just
p
. On the other hand,
unlike what we usually see for ordinary least squares, we have
1
n
, and not
1
n
.
We will later obtain better bounds when we make some assumptions on the
design matrix.
Proof.
We don’t really have a closed form solution for
ˆ
β
, and in general it doesn’t
exist. So the only thing we can use is that it in fact minimizes
Q
λ
(
β
). Thus, by
definition, we have
1
2n
kY X
ˆ
βk
2
2
+ λk
ˆ
βk
1
1
2n
kY Xβ
0
k
2
2
+ λkβ
0
k
1
.
We know exactly what Y is. It is Xβ
0
+ ε ¯ε1. If we plug this in, we get
1
2n
kXβ
0
X
ˆ
βk
2
2
1
n
ε
T
X(
ˆ
β β
0
) + λkβ
0
k
1
λk
ˆ
βk
1
.
Here we use the fact that X is centered, and so is orthogonal to 1.
Now older tells us
|ε
T
X(
ˆ
β β
0
)| kX
T
εk
k
ˆ
β β
0
k
1
.
We’d like to bound
kX
T
εk
, but it can be arbitrarily large since
ε
is a Gaussian.
However, with high probability, it is small. Precisely, define
=
1
n
kX
T
εk
λ
.
In a later lemma, we will show later that
P
(Ω)
1
2
p
(A
2
/21)
. Assuming
holds, we have
1
2n
kXβ
0
X
ˆ
βk
2
2
λk
ˆ
β β
0
k λk
ˆ
βk + λkβ
0
k 2λkβ
0
k
1
.