III Modern Statistical Methods

2Kernel machines

2.4 Making predictions

Let’s now try to understand what exactly we are doing when we do ridge

regression with a kernel

. To do so, we first think carefully what we were doing

in ordinary ridge regression, which corresponds to using the linear kernel. For

the linear kernel, the

norm

kβk

corresponds exactly to the norm in the

RKHS kfk

. Thus, an alternative way of expressing ridge regression is

f = argmin

f∈H

(

i=1

− f(x

))

+ λkfk

)

, (∗)

where

is the RKHS of the linear kernel. Now this way of writing ridge

regression makes sense for an arbitrary RKHS, so we might think this is what

we should solve in general.

But if

is infinite dimensional, then naively, it would be quite difficult to

solve (∗). The solution is provided by the representer theorem.

Theorem (Representer theorem). Let

be an RKHS with reproducing kernel

. Let

be an arbitrary loss function and

: [0

, ∞

)

→ R

any strictly increasing

function. Then the minimizer

f ∈ H of

(f) = c(Y, x

, . . . , x

, f(x

), . . . , f(x

)) + J(kf k

)

lies in the linear span of {k( ·, x

)}

i=1

Given this theorem, we know our optimal solution can be written in the form

f( ·) =

i=1

ˆα

k( ·, x

and thus we can rewrite our optimization problem as looking for the

ˆα ∈ R

that minimizes

(α) = c(Y, x

, . . . , x

, Kα) + J(α

Kα),

over α ∈ R

(with K

= k(x

, x

)).

For an arbitrary

, this can still be quite difficult, but in the case of Ridge

regression, this tells us that (∗) is equivalent to minimizing

kY − Kαk

+ λα

Kα,

and by differentiating, we find that this is given by solving

K ˆα = K(K + λI)

−1

Of course

K ˆα

is also our fitted values, so we see that if we try to minimize

(∗), then the fitted values is indeed given by K(K + λI)

−1

Y , as in our original

motivation.

Proof. Suppose

f minimizes Q

. We can then write

f = u + v

where u ∈ V = span{k( ·, x

) : i = 1, . . . , n} and v ∈ V

⊥

. Then

f(x

) = hf, k( ·, x

)i = hu + v, k( ·, x

)i = hu, k( ·, x

)i = u(x

So we know that

c(Y, x

, . . . , x

, f(x

), . . . , f(x

)) = c(Y, x

, . . . , x

, u(x

), . . . , u(x

)).

Meanwhile,

kfk

= ku + vk

= kuk

+ kvk

using the fact that u and v are orthogonal. So we know

J(kf k

) ≥ J(kuk

)

with equality iff

= 0. Hence

(

)

≥ Q

(

) with equality iff

= 0, and so we

must have v = 0 by optimality.

Thus, we know that the optimizer in fact lies in

, and then the formula of

just expresses Q

in terms of α.

How well does our kernel machine do? Let

be an RKHS with reproducing

kernel k, and f

∈ H. Consider a model

= f

) + ε

for i = 1, . . . , n, and assume Eε = 0, var(ε) = σ

For convenience, We assume

≤

1. There is no loss in generality by

assuming this, since we can always achieve this by scaling the norm.

Let

be the kernel matrix

(

, x

) with eigenvalues

≥ d

≥ ··· ≥

≥ 0. Define

= argmin

f∈H

(

i=1

− f(x

))

+ λkfk

)

Theorem. We have

i=1

E(f

) −

))

≤

i=1

+ λ)

≤

i=1

min



, λ



Given a data set, we can compute the eigenvalues

, and thus we can compute

this error bound.

Proof. We know from the representer theorem that

(

), . . . ,

))

= K(K + λI)

−1

Also, there is some α ∈ R

such that

), . . . , f

))

= Kα.

Moreover, on the example sheet, we show that

1 ≥ kf

≥ α

Kα.

Consider the eigen-decomposition

UDU

, where

is orthogonal,

and D

= 0 for i 6= j. Then we have

i=1

) −

))

= EkKα − K(K + λI)

−1

(Kα + ε)k

Noting that Kα = (K + λI)(K + λI)

−1

Kα, we obtain

= E



λ(K + λI)

−1

Kα − K(K + λI)

−1



= λ



(K + λI)

−1

Kα



| {z }

(I)

+ E



K(K + λI)

−1



| {z }

(II)

At this stage, we can throw in the eigen-decomposition of K to write (I) as

(I) = λ



(UDU

+ λI)

−1

UDU



= λ

kU(D + λI)

−1

| {z }

i=1

+ λ)

Now we have

Kα = α

UDU

α = α

UDD

where D

is diagonal and

(

−1

> 0

0 otherwise

We can then write this as

Kα =

The key thing we know about this is that it is ≤ 1.

Note that by definition of

, we see that if

= 0, then

= 0. So we can

write

(II) =

i:d

≥0

+ λ)

≤ λ max

i=1,...,n

+ λ)

by H¨older’s inequality with (p, q) = (1, ∞). Finally, use the inequality that

(a + b)

≥ 4ab

to see that we have

(I) ≤

So we are left with (II), which is a bit easier. Using the trace trick, we have

(II) = Eε

(K + λI)

−1

(K + λI)

−1

= E tr



K(K + λI)

−1

εε

(K + λI)

−1



= tr



K(K + λI)

−1

E(εε

)(K + λI)

−1



= σ



(K + λI)

−2



= σ



(UDU

+ λI)

−2



= σ



(D + λI)

−2



= σ

i=1

+ λ)

Finally, writing

+λ)

, we have

+ λ)

≤ min



4λ



and we have the second bound.

How can we interpret this? If we look at the formula, then we see that we can

have good benefits if the

’s decay quickly, and the exact values of the

depend

only on the choice of

. So suppose the

are random and idd and independent

of ε. Then the entire analysis is still valid by conditioning on x

, . . . , x

We define ˆµ

, and λ

. Then we can rewrite our result to say

i=1

) −

))

≤

i=1

E min



ˆµ

, λ



≡ Eδ

(λ

We want to relate this more directly to the kernel somehow. Given a density

p(x) on X, Mercer’s theorem guarantees the following eigenexpansion

k(x, x

) =

∞

j=1

(x)e

where the eigenvalues µ

and eigenfunctions e

obey

) =

k(x, x

(x)p(x) dx

and

(x)e

(x)p(x) dx = 1

j=k

One can then show that

i=1

min



ˆµ

, λ



≤

∞

i=1

min



, λ



up to some absolute constant factors. For a particular density

(

) of the input

space, we can try to solve the integral equation to figure out what the

’s are,

and we can find the expected bound.

When

is the Sobolev kernel and

(

) is the uniform density, then it turns

out we have

(2j − 1)

By drawing a picture, we see that we can bound

∞

i=1

min



, λ



∞

i=1

min



, λ



≤

∞

∧

(2j − 1)

dj ≤

√

So we find that

E(δ

(λ

)) = O



nλ

1/2

+ λ



We can then find the λ

that minimizes this, and we see that we should pick

∼





2/3

which gives an error rate of ∼





2/3