III Modern Statistical Methods - The Lasso and beyond

3The Lasso and beyond

III Modern Statistical Methods

3.7 Extensions of the Lasso

There are many ways we can modify the Lasso. The first thing we might want to

change in the Lasso is to replace the least squares loss with other log-likelihoods.

Another way to modify the Lasso is to replace the

penalty with something

else in order to encourage a different form of sparsity.

Example (Group Lasso). Given a partition

∪ ···G

= {1, . . . , p},

the group Lasso penalty is

j=1

kβ

where

}

is some sort of weight to account for the fact that the groups have

different sizes. Typically, we take m

If we take

{i}

, then we recover the original Lasso. If we take

= 1,

then we recover Ridge regression. What this does is that it encourages the entire

group to be all zero, or all non-zero.

Example. Another variation is the fused Lasso. If

j+1

is expected to be close

to β

, then a fused Lasso penalty may be appropriate, with

p−1

j=1

|β

j+1

− β

| + λ

kβk

For example, if

= µ

+ ε

and we believe that (

)

i=1

form a piecewise constant sequence, we can estimate

argmin

(

kY − µk

+ λ

n−1

i=1

|µ

i+1

− µ

)

Example (Elastic net). We can use

λ,α

= argmin



kY − Xβk

+ λ(αkβk

+ (1 − α)kβk

)



for

α ∈

1]. What the

norm does is that it encourages highly positively

correlated variables to have similar estimated coefficients.

For example, if we have duplicate columns, then the

penalty encourages

us to take one of the coefficients to be 0, while the

penalty encourages the

coefficients to be the same.

Another class of variations try to reduce the bias of the Lasso. Although

the bias of the Lasso is a necessary by-product of reducing the variance of the

estimate, it is sometimes desirable to reduce this bias.

The LARS-OLS hybrid takes the

obtained by the Lasso, and then re-

estimate

by OLS. We can also re-estimate using the Lasso on

, and this

is known as the relaxed Lasso.

In the adaptive Lasso, we obtain an initial estimate of

, e.g. with the Lasso,

and then solve

adapt

= argmin

β:β







kY − Xβk

+ λ

k∈

|β







We can also try to use a non-convex penalty. We can attempt to solve

argmin

(

kY − Xβk

k=1

(|β

)

where

: [0

, ∞

)

→ p

, ∞

) is a non-convex function. One common example is

the MCP, given by

(u) =



λ −



where

is an extra tuning parameter. This tends to give estimates even sparser

than the Lasso.