3The Lasso and beyond

III Modern Statistical Methods



3 The Lasso and beyond
We are interested in the situation where we have a very large number of variables
compared to the size of the data set. For example the data set might be all the
genetic and environmental information about patients, and we want to predict if
they have diabetes. The key property is that we expect that most of the data
are not useful, and we want to select a small subset of variables that are relevant,
and form prediction models based on them.
In these cases, ordinary least squares is not a very good tool. If there are
more variables than the number of data points, then it is likely that we can
pick the regression coefficients so that our model fits our data exactly, but this
model is likely to be spurious, since we are fine-tuning our model based on the
random fluctuations of a large number of irrelevant variables. Thus, we want to
find a regression method that penalizes non-zero coefficients. Note that Ridge
regression is not good enough it encourages small coefficients, but since it
uses the
`
2
norm, it’s quite lenient towards small, non-zero coefficients, as the
square of a small number is really small.
There is another reason to favour models that sets a lot of coefficients to
zero. Consider the linear model
Y = Xβ
0
+ ε,
where as usual
Eε
= 0 and
var
(
ε
) =
σ
2
I
. Let
S
=
{k
:
β
0
k
6
= 0
}
, and let
s
=
|S|
.
As before, we suppose we have some a priori belief that s p.
For the purposes of this investigation, suppose
X
has full column rank, so
that we can use ordinary least squares. Then the prediction error is
1
n
EkXβ
0
X
ˆ
β
OLS
k
2
2
=
1
n
E(
ˆ
β
OLS
β
0
)
T
X
T
X(
ˆ
β
OLS
β
0
)
=
1
n
E tr(
ˆ
β
OLS
β
0
)(
ˆ
β
OLS
β
0
)
T
X
T
X
=
1
n
tr E(
ˆ
β
OLS
β
0
)(
ˆ
β
OLS
β
0
)
T
X
T
X
=
1
n
tr Eσ
2
(X
T
X)
1
X
T
X
= σ
2
p
n
.
Note that this does not depend on what what β
0
is, but only on σ
2
, p and n.
In the situation we are interested in, we expect
s p
. So if we can find
S
and find ordinary least squares just on these, then we would have a mean
squared prediction error of σ
2
s
n
, which would be much much smaller.
We first discuss a few classical model methods that does this.
The first approach we may think of is the best subsets method, where we try
to do regression on all possible choices of
S
and see which is the “best”. For any
set
M
of indices, we let
X
M
be the submatrix of
X
formed from the columns of
X
with index in
M
. We then regress
Y
on
X
M
for every
M {
1
, . . . , p}
, and
then pick the best model via cross-validation, for example. A big problem with
this is that the number of subsets grows exponentially with
p
, and so becomes
infeasible for, say, p > 50.
Another method might be forward selection. We start with an intercept-only
model, and then add to the existing model the predictor that reduces the RSS
the most, and then keep doing this until a fixed number of predictors have been
added. This is quite a nice approach, and is computationally quite fast even
if
p
is large. However, this method is greedy, and if we make a mistake at the
beginning, then the method blows up. In general, this method is rather unstable,
and is not great from a practical perspective.

Contents