III Modern Statistical Methods

2Kernel machines

2.2 v-fold cross-validation

In practice, the above analysis is not very useful, since it doesn’t actually tell

us what

we should pick. If we are given a concrete data set, how do we know

what λ we should pick?

One common method is to use

-fold cross-validation. This is a very general

technique that allows us to pick a regression method from a variety of options.

We shall explain the method in terms of Ridge regression, where our regression

methods are parametrized by a single parameter

, but it should be clear that

this is massively more general.

Suppose we have some data set (

, x

)

i=1

, and we are asked to predict the

value of

∗

given a new predictors

∗

. We may want to pick

to minimize the

following quantity:

∗

− (x

∗

)

(X, Y ))

| X, Y

It is difficult to actually do this, and so an easier target to minimize is the

expected prediction error

∗

− (x

∗

)

(

Y ))

One thing to note about this is that we are thinking of

and

as arbitrary

data sets of size

, as opposed to the one we have actually got. This might be a

more tractable problem, since we are not working with our actual data set, but

general data sets.

The method of cross-validation estimates this by splitting the data into

folds.

(1)

, Y

(1)

)

(v)

, Y

(v)

)

is called the observation indices, which is the set of all indices

so that the

jth data point lies in the ith fold.

We let (X

(−k)

, Y

(−k)

) be all data except that in the kth fold. We define

CV (λ) =

k=1

i∈A

− x

(−k)

, Y

(−k)

))

We write

for the minimizer of this, and pick

(

X, Y

) as our estimate.

This tends to work very will in practice.

But we can ask ourselves — why do we have to pick a single

to make the

estimate? We can instead average over a range of different

. Suppose we have

computed

on a grid of

-values

> λ

> ··· > λ

. Our plan is to take a

good weighted average of the λ. Concretely, we want to minimize

k=1

i∈A

−

i=1

(−k)

, Y

(−k)

)

over

w ∈

, ∞

)

. This is known as stacking. This tends to work better than

just doing

-fold cross-validation. Indeed, it must — cross-validation is just a

special case where we restrict

to be zero in all but one entries. Of course, this

method comes with some heavy computational costs.