2Kernel machines
III Modern Statistical Methods
2.4 Making predictions
Let’s now try to understand what exactly we are doing when we do ridge
regression with a kernel
k
. To do so, we first think carefully what we were doing
in ordinary ridge regression, which corresponds to using the linear kernel. For
the linear kernel, the
L
2
norm
kβk
2
2
corresponds exactly to the norm in the
RKHS kfk
2
H
. Thus, an alternative way of expressing ridge regression is
ˆ
f = argmin
f∈H
(
n
X
i=1
(Y
i
− f(x
i
))
2
+ λkfk
2
H
)
, (∗)
where
H
is the RKHS of the linear kernel. Now this way of writing ridge
regression makes sense for an arbitrary RKHS, so we might think this is what
we should solve in general.
But if
H
is infinite dimensional, then naively, it would be quite difficult to
solve (∗). The solution is provided by the representer theorem.
Theorem (Representer theorem). Let
H
be an RKHS with reproducing kernel
k
. Let
c
be an arbitrary loss function and
J
: [0
, ∞
)
→ R
any strictly increasing
function. Then the minimizer
ˆ
f ∈ H of
Q
1
(f) = c(Y, x
1
, . . . , x
n
, f(x
1
), . . . , f(x
n
)) + J(kf k
2
H
)
lies in the linear span of {k( ·, x
i
)}
n
i=1
.
Given this theorem, we know our optimal solution can be written in the form
ˆ
f( ·) =
n
X
i=1
ˆα
i
k( ·, x
i
),
and thus we can rewrite our optimization problem as looking for the
ˆα ∈ R
n
that minimizes
Q
2
(α) = c(Y, x
1
, . . . , x
n
, Kα) + J(α
T
Kα),
over α ∈ R
n
(with K
ij
= k(x
i
, x
j
)).
For an arbitrary
c
, this can still be quite difficult, but in the case of Ridge
regression, this tells us that (∗) is equivalent to minimizing
kY − Kαk
2
2
+ λα
T
Kα,
and by differentiating, we find that this is given by solving
K ˆα = K(K + λI)
−1
Y.
Of course
K ˆα
is also our fitted values, so we see that if we try to minimize
(∗), then the fitted values is indeed given by K(K + λI)
−1
Y , as in our original
motivation.
Proof. Suppose
ˆ
f minimizes Q
1
. We can then write
ˆ
f = u + v
where u ∈ V = span{k( ·, x
i
) : i = 1, . . . , n} and v ∈ V
⊥
. Then
ˆ
f(x
i
) = hf, k( ·, x
i
)i = hu + v, k( ·, x
i
)i = hu, k( ·, x
i
)i = u(x
i
).
So we know that
c(Y, x
1
, . . . , x
n
, f(x
1
), . . . , f(x
n
)) = c(Y, x
1
, . . . , x
n
, u(x
1
), . . . , u(x
n
)).
Meanwhile,
kfk
2
H
= ku + vk
2
H
= kuk
2
H
+ kvk
2
H
,
using the fact that u and v are orthogonal. So we know
J(kf k
2
H
) ≥ J(kuk
2
H
)
with equality iff
v
= 0. Hence
Q
1
(
f
)
≥ Q
1
(
u
) with equality iff
v
= 0, and so we
must have v = 0 by optimality.
Thus, we know that the optimizer in fact lies in
V
, and then the formula of
Q
2
just expresses Q
1
in terms of α.
How well does our kernel machine do? Let
H
be an RKHS with reproducing
kernel k, and f
0
∈ H. Consider a model
Y
i
= f
0
(x
i
) + ε
i
for i = 1, . . . , n, and assume Eε = 0, var(ε) = σ
2
I.
For convenience, We assume
kf
0
k
2
H
≤
1. There is no loss in generality by
assuming this, since we can always achieve this by scaling the norm.
Let
K
be the kernel matrix
K
ij
=
k
(
x
i
, x
j
) with eigenvalues
d
1
≥ d
2
≥ ··· ≥
d
n
≥ 0. Define
ˆ
f
λ
= argmin
f∈H
(
n
X
i=1
(Y
i
− f(x
i
))
2
+ λkfk
2
H
)
.
Theorem. We have
1
n
n
X
i=1
E(f
0
(x
i
) −
ˆ
f
λ
(x
i
))
2
≤
σ
2
n
n
X
i=1
d
2
i
(d
i
+ λ)
2
+
λ
4n
≤
σ
2
n
1
λ
n
X
i=1
min
d
i
4
, λ
+
λ
4n
.
Given a data set, we can compute the eigenvalues
d
i
, and thus we can compute
this error bound.
Proof. We know from the representer theorem that
(
ˆ
f
λ
(x
i
), . . . ,
ˆ
f
λ
(x
n
))
T
= K(K + λI)
−1
Y.
Also, there is some α ∈ R
n
such that
(f
0
(x
1
), . . . , f
0
(x
n
))
T
= Kα.
Moreover, on the example sheet, we show that
1 ≥ kf
0
k
2
H
≥ α
T
Kα.
Consider the eigen-decomposition
K
=
UDU
T
, where
U
is orthogonal,
D
ii
=
d
i
and D
ij
= 0 for i 6= j. Then we have
E
n
X
i=1
(f
0
(x
i
) −
ˆ
f
λ
(x
i
))
2
= EkKα − K(K + λI)
−1
(Kα + ε)k
2
2
Noting that Kα = (K + λI)(K + λI)
−1
Kα, we obtain
= E
λ(K + λI)
−1
Kα − K(K + λI)
−1
ε
2
2
= λ
2
(K + λI)
−1
Kα
2
2
| {z }
(I)
+ E
K(K + λI)
−1
ε
2
2
| {z }
(II)
.
At this stage, we can throw in the eigen-decomposition of K to write (I) as
(I) = λ
2
(UDU
T
+ λI)
−1
UDU
T
α
2
2
= λ
2
kU(D + λI)
−1
DU
T
α
| {z }
θ
k
2
2
=
n
X
i=1
θ
2
i
λ
2
(d
i
+ λ)
2
Now we have
α
T
Kα = α
T
UDU
T
α = α
T
UDD
+
DU
T
,
where D
+
is diagonal and
D
+
ii
=
(
d
−1
i
d
i
> 0
0 otherwise
.
We can then write this as
α
T
Kα =
X
d
i
>0
θ
2
i
d
i
.
The key thing we know about this is that it is ≤ 1.
Note that by definition of
θ
i
, we see that if
d
i
= 0, then
θ
i
= 0. So we can
write
(II) =
X
i:d
i
≥0
θ
2
i
d
i
d
i
λ
2
(d
i
+ λ)
2
≤ λ max
i=1,...,n
d
i
λ
(d
i
+ λ)
2
by H¨older’s inequality with (p, q) = (1, ∞). Finally, use the inequality that
(a + b)
2
≥ 4ab
to see that we have
(I) ≤
λ
4
.
So we are left with (II), which is a bit easier. Using the trace trick, we have
(II) = Eε
T
(K + λI)
−1
K
2
(K + λI)
−1
ε
= E tr
K(K + λI)
−1
εε
T
(K + λI)
−1
K
= tr
K(K + λI)
−1
E(εε
T
)(K + λI)
−1
K
= σ
2
tr
K
2
(K + λI)
−2
= σ
2
tr
UD
2
U
T
(UDU
T
+ λI)
−2
= σ
2
tr
D
2
(D + λI)
−2
= σ
2
n
X
i=1
d
2
i
(d
i
+ λ)
2
.
Finally, writing
d
2
i
(d
i
+λ)
2
=
d
i
λ
d
i
λ
(d
i
+λ)
2
, we have
d
2
i
(d
i
+ λ)
2
≤ min
1,
d
i
4λ
,
and we have the second bound.
How can we interpret this? If we look at the formula, then we see that we can
have good benefits if the
d
i
’s decay quickly, and the exact values of the
d
i
depend
only on the choice of
x
i
. So suppose the
x
i
are random and idd and independent
of ε. Then the entire analysis is still valid by conditioning on x
1
, . . . , x
n
.
We define ˆµ
i
=
d
i
n
, and λ
n
=
λ
n
. Then we can rewrite our result to say
1
n
E
n
X
i=1
(f
0
(x
i
) −
ˆ
f
λ
(x
i
))
2
≤
σ
2
λ
n
1
n
n
X
i=1
E min
ˆµ
i
4
, λ
n
+
λ
n
4
≡ Eδ
n
(λ
n
).
We want to relate this more directly to the kernel somehow. Given a density
p(x) on X, Mercer’s theorem guarantees the following eigenexpansion
k(x, x
0
) =
∞
X
j=1
µ
j
e
j
(x)e
j
(x
0
),
where the eigenvalues µ
j
and eigenfunctions e
j
obey
µ
j
e
j
(x
0
) =
Z
X
k(x, x
0
)e
j
(x)p(x) dx
and
Z
X
e
k
(x)e
j
(x)p(x) dx = 1
j=k
.
One can then show that
1
n
E
n
X
i=1
min
ˆµ
i
4
, λ
n
≤
1
n
∞
X
i=1
min
µ
i
4
, λ
n
up to some absolute constant factors. For a particular density
p
(
x
) of the input
space, we can try to solve the integral equation to figure out what the
µ
j
’s are,
and we can find the expected bound.
When
k
is the Sobolev kernel and
p
(
x
) is the uniform density, then it turns
out we have
µ
j
4
=
1
π
2
(2j − 1)
2
.
By drawing a picture, we see that we can bound
P
∞
i=1
min
µ
i
4
, λ
n
by
∞
X
i=1
min
µ
i
4
, λ
n
≤
Z
∞
0
λ
n
∧
1
π
2
(2j − 1)
2
dj ≤
√
λ
n
π
+
λ
n
2
.
So we find that
E(δ
n
(λ
n
)) = O
σ
2
nλ
1/2
n
+ λ
n
.
We can then find the λ
n
that minimizes this, and we see that we should pick
λ
n
∼
σ
2
n
2/3
,
which gives an error rate of ∼
σ
2
n
2/3
.