III Modern Statistical Methods - The Lasso and beyond

3The Lasso and beyond

III Modern Statistical Methods

3.1 The Lasso estimator

The Lasso (Tibshirani, 1996) is a seemingly small modification of Ridge regression

that solves our problems. It solves

(ˆµ

) = argmin

(µ,β)∈R×R

kY − µ1 − Xβk

+ λkβk

The key difference is that we use an `

norm on β rather than the `

norm,

kβk

k=1

|β

This makes it drastically different from Ridge regression. We will see that for

large, it will make all of the entries of

exactly zero, as opposed to being very

close to zero.

We can compare this to best subset regression, where we replace

kβk

with

something of the form

k=1

. But the beautiful property of the Lasso

is that the optimization problem is now continuous, and in fact convex. This

allows us to solve it using standard convex optimization techniques.

Why is the

norm so different from the

norm? Just as in Ridge regression,

we may center and scale

, and center

, so that we can remove

from the

objective. Define

(β) =

kY − Xβk

+ λkβk

Any minimizer

of Q

(β) must also be a solution to

min kY − Xβk

subject to kβk

≤ k

Similarly,

is a solution of

min kY − Xβk

subject to kβk

≤ k

So imagine we are given a value of

, and we try to solve the above

optimization problem with pictures. The region

kβk

≤ k

is given by a

rotated square

On the other hand, the minimum of

kY − Xβk

is at

OLS

, and the contours

are ellipses centered around this point.

To solve the minimization problem, we should pick the smallest contour that hits

the square, and pick the intersection point to be our estimate of β

. The point

is that since the unit ball in the

-norm has these corners, this

is likely to

be on the corners, hence has a lot of zeroes. Compare this to the case of Ridge

regression, where the constraint set is a ball:

Generically, for Ridge regression, we would expect the solution to be non-zero

everywhere.

More generally, we can try to use the `

norm given by

kβk

k=1

1/q

We can plot their unit spheres, and see that they look like

q = 0.5 q = 1 q = 1.5 q = 2

We see that

= 1 is the smallest value of

for which there are corners, and

also the largest value of

for which the constraint set is still convex. Thus,

= 1

is the sweet spot for doing regression.

But is this actually good? Suppose the columns of

are scaled to have

norm

√

n, and, after centering, we have a normal linear model

Y = Xβ

+ ε − ¯ε1,

with ε ∼ N

(0, σ

I).

Theorem. Let

β be the Lasso solution with

λ = Aσ

log p

for some A. Then with probability 1 − 2p

−(A

/2−1)

, we have

kXβ

− X

βk

≤ 4Aσ

log p

kβ

Crucially, this is proportional to

log p

, and not just

. On the other hand,

unlike what we usually see for ordinary least squares, we have

√

, and not

We will later obtain better bounds when we make some assumptions on the

design matrix.

Proof.

We don’t really have a closed form solution for

, and in general it doesn’t

exist. So the only thing we can use is that it in fact minimizes

(

). Thus, by

definition, we have

kY − X

βk

+ λk

βk

≤

kY − Xβ

+ λkβ

We know exactly what Y is. It is Xβ

+ ε − ¯ε1. If we plug this in, we get

kXβ

− X

βk

≤

β − β

) + λkβ

− λk

βk

Here we use the fact that X is centered, and so is orthogonal to 1.

Now H¨older tells us

|ε

β − β

)| ≤ kX

εk

∞

β − β

We’d like to bound

εk

∞

, but it can be arbitrarily large since

is a Gaussian.

However, with high probability, it is small. Precisely, define

Ω =



εk

∞

≤ λ



In a later lemma, we will show later that

(Ω)

≥

−

−(A

/2−1)

. Assuming Ω

holds, we have

kXβ

− X

βk

≤ λk

β − β

k − λk

βk + λkβ

k ≤ 2λkβ