3The Lasso and beyond

III Modern Statistical Methods



3.7 Extensions of the Lasso
There are many ways we can modify the Lasso. The first thing we might want to
change in the Lasso is to replace the least squares loss with other log-likelihoods.
Another way to modify the Lasso is to replace the
`
1
penalty with something
else in order to encourage a different form of sparsity.
Example (Group Lasso). Given a partition
G
1
···G
q
= {1, . . . , p},
the group Lasso penalty is
λ
q
X
j=1
m
j
kβ
G
j
k
2
,
where
{m
j
}
is some sort of weight to account for the fact that the groups have
different sizes. Typically, we take m
j
=
p
|G
j
|.
If we take
G
i
=
{i}
, then we recover the original Lasso. If we take
q
= 1,
then we recover Ridge regression. What this does is that it encourages the entire
group to be all zero, or all non-zero.
Example. Another variation is the fused Lasso. If
β
0
j+1
is expected to be close
to β
0
j
, then a fused Lasso penalty may be appropriate, with
λ
1
p1
X
j=1
|β
j+1
β
j
| + λ
2
kβk
1
.
For example, if
Y
i
= µ
i
+ ε
i
,
and we believe that (
µ
i
)
n
i=1
form a piecewise constant sequence, we can estimate
µ
0
by
argmin
µ
(
kY µk
2
2
+ λ
n1
X
i=1
|µ
i+1
µ
i
|
)
.
Example (Elastic net). We can use
ˆ
β
EN
λ,α
= argmin
β
1
2n
kY Xβk
2
2
+ λ(αkβk
1
+ (1 α)kβk
2
2
)
.
for
α
[0
,
1]. What the
`
2
norm does is that it encourages highly positively
correlated variables to have similar estimated coefficients.
For example, if we have duplicate columns, then the
`
1
penalty encourages
us to take one of the coefficients to be 0, while the
`
2
penalty encourages the
coefficients to be the same.
Another class of variations try to reduce the bias of the Lasso. Although
the bias of the Lasso is a necessary by-product of reducing the variance of the
estimate, it is sometimes desirable to reduce this bias.
The LARS-OLS hybrid takes the
ˆ
S
λ
obtained by the Lasso, and then re-
estimate
β
0
ˆ
S
λ
by OLS. We can also re-estimate using the Lasso on
X
ˆ
S
λ
, and this
is known as the relaxed Lasso.
In the adaptive Lasso, we obtain an initial estimate of
β
0
, e.g. with the Lasso,
and then solve
ˆ
β
adapt
λ
= argmin
β:β
ˆ
S
=0
1
2n
kY Xβk
2
2
+ λ
X
k
ˆ
S
|β
k
|
|
ˆ
β
k
|
.
We can also try to use a non-convex penalty. We can attempt to solve
argmin
β
(
1
2n
kY Xβk
2
2
+
n
X
k=1
p
λ
(|β
k
|)
)
,
where
p
λ
: [0
,
)
p
[0
,
) is a non-convex function. One common example is
the MCP, given by
p
0
λ
(u) =
λ
u
γ
+
,
where
γ
is an extra tuning parameter. This tends to give estimates even sparser
than the Lasso.