2Kernel machines

III Modern Statistical Methods



2.2 v-fold cross-validation
In practice, the above analysis is not very useful, since it doesn’t actually tell
us what
λ
we should pick. If we are given a concrete data set, how do we know
what λ we should pick?
One common method is to use
v
-fold cross-validation. This is a very general
technique that allows us to pick a regression method from a variety of options.
We shall explain the method in terms of Ridge regression, where our regression
methods are parametrized by a single parameter
λ
, but it should be clear that
this is massively more general.
Suppose we have some data set (
Y
i
, x
i
)
n
i=1
, and we are asked to predict the
value of
Y
given a new predictors
x
. We may want to pick
λ
to minimize the
following quantity:
E
n
(Y
(x
)
T
ˆ
β
R
λ
(X, Y ))
2
| X, Y
o
.
It is difficult to actually do this, and so an easier target to minimize is the
expected prediction error
E
h
E
n
(Y
(x
)
T
ˆ
β
R
λ
(
˜
X,
˜
Y ))
2
|
˜
X,
˜
Y
oi
One thing to note about this is that we are thinking of
˜
X
and
˜
Y
as arbitrary
data sets of size
n
, as opposed to the one we have actually got. This might be a
more tractable problem, since we are not working with our actual data set, but
general data sets.
The method of cross-validation estimates this by splitting the data into
v
folds.
(X
(1)
, Y
(1)
)
A
1
(X
(v)
, Y
(v)
)
A
v
A
i
is called the observation indices, which is the set of all indices
j
so that the
jth data point lies in the ith fold.
We let (X
(k)
, Y
(k)
) be all data except that in the kth fold. We define
CV (λ) =
1
n
v
X
k=1
X
iA
k
(Y
i
x
T
i
ˆ
β
R
λ
(X
(k)
, Y
(k)
))
2
We write
λ
CV
for the minimizer of this, and pick
ˆ
β
R
λ
CV
(
X, Y
) as our estimate.
This tends to work very will in practice.
But we can ask ourselves why do we have to pick a single
λ
to make the
estimate? We can instead average over a range of different
λ
. Suppose we have
computed
ˆ
β
R
λ
on a grid of
λ
-values
λ
1
> λ
2
> ··· > λ
L
. Our plan is to take a
good weighted average of the λ. Concretely, we want to minimize
1
n
v
X
k=1
X
iA
k
Y
i
L
X
i=1
w
i
x
T
i
ˆ
β
R
λ
i
(X
(k)
, Y
(k)
)
!
2
over
w
[0
,
)
L
. This is known as stacking. This tends to work better than
just doing
v
-fold cross-validation. Indeed, it must cross-validation is just a
special case where we restrict
w
i
to be zero in all but one entries. Of course, this
method comes with some heavy computational costs.