3The Lasso and beyond

III Modern Statistical Methods



3.6 Computation of Lasso solutions
We have had enough of bounding things. In this section, let’s think about how
we can actually run the Lasso. What we present here is actually a rather general
method to find the minimum of a function, known as coordinate descent.
Suppose we have a function
f
:
R
d
R
. We start with an initial guess
x
(0)
and repeat for m = 1, 2, . . .
x
(m)
1
= argmin
x
1
f(x
1
, x
(m1)
2
, . . . , x
(m1)
d
)
x
(m)
2
= argmin
x
2
f(x
(m)
1
, x
2
, x
(m1)
3
, . . . , x
(m1)
d
)
.
.
.
x
(m)
d
= argmin
x
d
f(x
(m)
1
, x
(m)
2
, . . . , x
(m)
d1
, x
d
)
until the result stabilizes.
This was proposed for solving the Lasso a long time ago, and a Stanford
group tried this out. However, instead of using
x
(m)
1
when computing
x
(m)
2
, they
used
x
(m1)
1
instead. This turned out to be pretty useless, and so the group
abandoned the method. After trying some other methods, which weren’t very
good, they decided to revisit this method and fixed their mistake.
For this to work well, of course the coordinatewise minimizations have to be
easy (which is the case for the Lasso, where we even have explicit solutions). This
converges to the global minimizer if the minimizer is unique,
{x
:
f
(
x
)
f
(
x
(0)
)
}
is compact, and if f has the form
f(x) = g(x) +
X
j
h
j
(x
j
),
where
g
is convex and differentiable, and each
h
j
is convex but not necessarily
differentiable. In the case of the Lasso, the first is the least squared term, and
the h
j
is the `
1
term.
There are two things we can do to make this faster. We typically solve the
Lasso on a grid of
λ
values
λ
0
> λ
1
> ··· > λ
L
, and then picking the appropriate
λ
by
v
-fold cross-validation. In this case, we can start solving at
λ
0
, and then
for each
i >
0, we solve the
λ
=
λ
i
problem by picking
x
(0)
to be the optimal
solution to the
λ
i1
problem. In fact, even if we already have a fixed
λ
value we
want to use, it is often advantageous to solve the Lasso with a larger
λ
-value,
and then use that as a warm start to get to our desired λ value.
Another strategy is an active set strategy. If
p
is large, then this loop may
take a very long time. Since we know the Lasso should set a lot of things to zero,
for ` = 1, . . . , L, we set
A = {k :
ˆ
β
L
λ
`1
,k
6= 0}.
We then perform coordinate descent only on coordinates in
A
to obtain a Lasso
solution
ˆ
β
with
ˆ
β
A
c
= 0. This may not be the actual Lasso solution. To check
this, we use the KKT conditions. We set
V =
k A
c
:
1
n
|X
T
k
(Y X
ˆ
β)| > λ
`
.
If
V
=
, and we are done. Otherwise, we add
V
to our active set
A
, and then
run coordinate descent again on this active set.