III Modern Statistical Methods - The Lasso and beyond

3The Lasso and beyond

III Modern Statistical Methods

3.6 Computation of Lasso solutions

We have had enough of bounding things. In this section, let’s think about how

we can actually run the Lasso. What we present here is actually a rather general

method to find the minimum of a function, known as coordinate descent.

Suppose we have a function

→ R

. We start with an initial guess

(0)

and repeat for m = 1, 2, . . .

(m)

= argmin

f(x

, x

(m−1)

, . . . , x

(m−1)

)

(m)

= argmin

f(x

(m)

, x

(m−1)

, . . . , x

(m−1)

)

(m)

= argmin

f(x

(m)

, x

(m)

, . . . , x

(m)

d−1

, x

)

until the result stabilizes.

This was proposed for solving the Lasso a long time ago, and a Stanford

group tried this out. However, instead of using

(m)

when computing

(m)

, they

used

(m−1)

instead. This turned out to be pretty useless, and so the group

abandoned the method. After trying some other methods, which weren’t very

good, they decided to revisit this method and fixed their mistake.

For this to work well, of course the coordinatewise minimizations have to be

easy (which is the case for the Lasso, where we even have explicit solutions). This

converges to the global minimizer if the minimizer is unique,

(

)

≤ f

(

(0)

)

}

is compact, and if f has the form

f(x) = g(x) +

where

is convex and differentiable, and each

is convex but not necessarily

differentiable. In the case of the Lasso, the first is the least squared term, and

the h

is the `

term.

There are two things we can do to make this faster. We typically solve the

Lasso on a grid of

values

> λ

> ··· > λ

, and then picking the appropriate

-fold cross-validation. In this case, we can start solving at

, and then

for each

i >

0, we solve the

problem by picking

(0)

to be the optimal

solution to the

i−1

problem. In fact, even if we already have a fixed

value we

want to use, it is often advantageous to solve the Lasso with a larger

-value,

and then use that as a warm start to get to our desired λ value.

Another strategy is an active set strategy. If

is large, then this loop may

take a very long time. Since we know the Lasso should set a lot of things to zero,

for ` = 1, . . . , L, we set

A = {k :

`−1

6= 0}.

We then perform coordinate descent only on coordinates in

to obtain a Lasso

solution

with

= 0. This may not be the actual Lasso solution. To check

this, we use the KKT conditions. We set

V =



k ∈ A

(Y − X

β)| > λ



∅

, and we are done. Otherwise, we add

to our active set

, and then

run coordinate descent again on this active set.