2Kernel machines

III Modern Statistical Methods



2.3 The kernel trick
We now come to the actual, main topic of the chapter. We start with a very
simple observation. An alternative way of writing the fitted values from Ridge
regression is
(X
T
X + λI)X
T
= X
T
(XX
T
+ λI).
One should be careful that the
λI
are different matrices on both sides, as they
have different dimensions. Multiplying inverses on both sides, we have
X
T
(XX
T
+ λI)
1
= (X
T
X + λI)
1
X
T
.
We can multiply the right-hand-side by
Y
to obtain the
ˆ
β
from Ridge regression,
and multiply on the left to obtain the fitted values. So we have
XX
T
(XX
T
+ λI)
1
Y = X(X
T
X + λI)
1
X
T
Y = X
ˆ
β
R
λ
.
Note that
X
T
X
is a
p × p
matrix, and takes
O
(
np
2
) time to compute. On the
other hand,
XX
T
is an
n × n
matrix, and takes
O
(
n
2
p
) time to compute. If
p n
, then this way of computing the fitted values would be much quicker. The
key point to make is that the fitted values from Ridge regression only depends
on K = XX
T
(and Y ). Why is this important?
Suppose we believe we have a quadratic signal
Y
i
= x
T
i
β +
X
k,`
x
ik
x
i`
θ
k`
+ ε
i
.
Of course, we can do Ridge regression, as long as we add the products
x
ik
x
i`
as predictors. But this requires
O
(
p
2
) many predictors. Even if we use the
XX
T
way, this has a cost of
O
(
n
2
p
2
), and if we used the naive method, it would
require O(np
4
) operations.
We can do better than this. The idea is that we might be able to compute
K directly. Consider
(1 + x
T
i
x
j
)
2
= 1 + 2x
T
i
x
j
+
X
k,`
x
ik
x
i`
x
jk
x
j`
.
This is equal to the inner product between vectors of the form
(1,
2x
i1
, . . . ,
2x
ip
, x
i1
x
i1
, . . . , x
i1
x
ip
, x
i2
x
i1
, . . . , x
ip
x
ip
) ()
If we set
K
ij
= (1+
x
T
i
x
j
)
2
and form
K
(
K
+
λI
)
1
Y
, this is equivalent to forming
ridge regression with (
) as our predictors. Note that here we don’t scale our
columns to have the same
`
2
norm. This is pretty interesting, because computing
this is only
O
(
n
2
p
). We managed to kill a factor of
p
in this computation. The
key idea here was that the fitted values depend only on
K
, and not on the values
of x
ij
itself.
Consider the general scenario where we try to predict the value of
Y
given
a predictor
x X
. In general, we don’t even assume
X
has some nice linear
structure where we can do linear regression.
If we want to do Ridge regression, then one thing we can do is that we
can try to construct some map
φ
:
X R
D
for some
D
, and then run Ridge
regression using
{φ
(
x
i
)
}
as our predictors. If we want to fit some complicated,
non-linear model, then
D
can potentially be very huge, and this can be costly.
The observation above is that instead of trying to do this, we can perhaps find
some magical way of directly computing
K
ij
= k(x
i
, x
j
) = hφ(x
i
), φ(x
j
)i.
If we can do so, then we can simply use the above formula to obtain the fitted
values (we shall discuss the problem of making new predictions later).
Since it is only the function
k
that matters, perhaps we can specify a
“regression method” simply by providing a suitable function
k
:
X × X R
.
If we are given such a function
k
, when would it arise from a “feature map”
φ : X R
D
?
More generally, we will find that it is convenient to allow for
φ
to take values
in infinite-dimensional vector spaces instead (since we don’t actually have to
compute φ!).
Definition (Inner product space). an inner product space is a real vector space
H endowed with a map h·, ·i : H × H R and obeys
Symmetry: hu, vi = hv, ui
Linearity: If a, b R, then hau + bw, vi = ahu, vi + bhw, vi.
Positive definiteness: hu, ui 0 with hu, ui = 0 iff u = 0.
If we want
k
to come from a feature map
φ
, then an immediate necessary
condition is that
k
has to be symmetric. There is also another condition that
corresponds to the positive-definiteness of the inner product.
Proposition. Given φ : X × X H, define k : X × X R by
k(x, x
0
) = hφ(x), φ(x
0
)i.
Then for any x
1
, . . . , x
n
X, the matrix K R
n
× R
n
with entries
K
ij
= k(x
i
, x
j
)
is positive semi-definite.
Proof. Let x
1
, . . . , x
n
X, and α R
n
. Then
X
i,j
α
i
k(x
i
, x
j
)α
j
=
X
i,j
α
i
hφ(x
i
), φ(x
j
)iα
j
=
*
X
i
α
i
φ(x
i
),
X
j
α
j
φ(x
j
)
+
0
since the inner product is positive definite.
This suggests the following definition:
Definition (Positive-definite kernel). A positive-definite kernel (or simply ker-
nel) is a symmetric map
k : X×X R
such that for all
n N
and
x
1
, . . . , x
n
X
,
the matrix K R
n
× R
n
with entries
K
ij
= k(x
i
, x
j
)
is positive semi-definite.
We will prove that every positive-definite kernel comes from a feature map.
However, before that, let’s look at some examples.
Example. Suppose k
1
, k
2
, . . . , k
n
are kernels. Then
If α
1
, α
2
0, then α
1
k
1
+ α
2
k
2
is a kernel. Moreover, if
k(x, x
0
) = lim
m→∞
k
m
(x, x
0
)
exists, then k is a kernel.
The pointwise product k
1
k
2
is a kernel, where
(k
1
k
2
)(x, x
0
) = k
1
(x, x
0
)k
2
(x, x
0
).
Example. The linear kernel is
k
(
x, x
0
) =
x
T
x
0
. To see this, we see that this is
given by the feature map φ = id and taking the standard inner product on R
p
.
Example. The polynomial kernel is k(x, x
0
) = (1 + x
T
x
0
)
d
for all d N.
We saw this last time with
d
= 2. This is a kernel since both 1 and
x
T
x
0
are
kernels, and sums and products preserve kernels.
Example. The Gaussian kernel is
k(x, x
0
) = exp
kx x
0
k
2
2
2σ
2
.
The quantity σ is known as the bandwidth.
To show that it is a kernel, we decompose
kx x
0
k
2
2
= kxk
2
2
+ kx
0
k
2
2
2x
T
x
0
.
We define
k
1
(x, x
0
) = exp
kxk
2
2
2σ
2
exp
kx
0
k
2
2
2σ
2
.
This is a kernel by taking φ( ·) = exp
k·k
2
2
2σ
2
.
Next, we can define
k
2
(x, x
0
) = exp
x
T
x
0
σ
2
=
X
r=0
1
r!
x
T
x
0
σ
2
r
.
We see that this is the infinite linear combination of powers of kernels, hence is
a kernel. Thus, it follows that k = k
1
k
2
is a kernel.
Note also that any feature map giving this
k
must take values in an infinite
dimensional inner product space. Roughly speaking, this is because we have
arbitrarily large powers of x and x
0
in the expansion of k.
Example. The Sobolev kernel is defined as follows: we take
X
= [0
,
1], and let
k(x, x
0
) = min(x, x
0
).
A slick proof that this is a kernel is to notice that this is the covariance function
of Brownian motion.
Example. The Jaccard similarity is defined as follows: Take
X
to be the set of
all subsets of {1, . . . , p}. For x, x
0
X, define
k(x, x
0
) =
(
|xx
0
|
|xx
0
|
x x
0
6=
1 otherwise
.
Theorem (Moore–Aronszajn theorem). For every kernel
k
:
X × X R
, there
exists an inner product space H and a feature map φ : X H such that
k(x, x
0
) = hφ(x), φ(x
0
)i.
This is not actually the full Moore–Aronszajn theorem, but a simplified
version of it.
Proof. Let H denote the vector space of functions f : X R of the form
f( ·) =
n
X
i=1
α
i
k( ·, x
i
) ()
for some n N, α
1
, . . . , α
n
R and x
1
, . . . , x
n
X. If
g( ·) =
m
X
i=1
β
j
k( ·, x
0
j
) H,
then we tentatively define the inner product of f and g by
hf, gi =
n
X
i=1
m
X
j=1
α
i
β
j
k(x
i
, x
0
j
).
We have to check that this is an inner product, but even before that, we need to
check that this is well-defined, since
f
and
g
can be represented in the form (
)
in multiple ways. To do so, simply observe that
n
X
i=1
m
X
j=1
α
i
β
j
k(x
i
, x
0
j
) =
n
X
i=1
α
i
g(x
i
) =
m
X
j=1
β
j
f(x
0
j
). ()
The first equality shows that the definition of our inner product does not depend
on the representation of
g
, while the second equality shows that it doesn’t depend
on the representation of f.
To show this is an inner product, note that it is clearly symmetric and bilinear.
To show it is positive definite, note that we have
hf, fi =
n
X
i=1
m
X
j=1
α
i
k(x
i
, x
j
)α
j
0
since the kernel is positive semi-definite. It remains to check that if
hf, fi
= 0,
then
f
= 0 as a function. To this end, note the important reproducing property:
by (), we have
hk( ·, x), f i = f (x).
This says k(·, x) represents the evaluation-at-x linear functional.
In particular, we have
f(x)
2
= hk( ·, x), f i
2
hk( ·, x), k( ·, x)ihf, f i = 0.
Here we used the Cauchy–Schwarz inequality, which, if you inspect the proof,
does not require positive definiteness, just positive semi-definiteness. So it follows
that f 0. Thus, we know that H is indeed an inner product space.
We now have to construct a feature map. Define φ : X H by
φ(x) = k( ·, x).
Then we have already observed that
hφ(x), φ(x
0
)i = hk( ·, x), k( ·, x
0
)i = k(x, x
0
),
as desired.
We shall boost this theorem up to its full strength, where we say there is
a “unique” inner product space with such a feature map. Of course, it will not
be unique unless we impose appropriate quantifiers, since we can just throw in
some random elements into H.
Recall (or learn) from functional analysis that any inner product space
B
is
a normed space, with norm
kfk
2
= hf, fi
B
.
We say sequence (
f
m
) in a normed space
B
is Cauchy if
kf
m
f
n
k
B
0 as
n, m
. An inner product space is complete if every Cauchy sequence has a
limit (in the space), and a complete inner product space is called a Hilbert space.
One important property of a Hilbert space
B
is that if
V
is a closed subspace
of
B
, then every
f B
has a unique representation as
f
=
v
+
w
, where
v V
and
w V
= {u B : hu, yi = 0 for all y V }.
By adding limits of Cauchy sequences to
H
, we can obtain a Hilbert space.
Indeed, if (f
m
) is Cauchy in H, then
|f
m
(x) f
n
(x)| k
1/2
(x, x)kf
m
f
n
k
H
0
as
m, n
. Since every Cauchy sequence in
R
converges (i.e.
R
is complete),
it makes sense to define a limiting function
f(x) = lim
n→∞
f
n
(x),
and it can be shown that after augmenting
H
with such limits, we obtain a
Hilbert space. In fact, it is a special type of Hilbert space.
Definition (Reproudcing kernel Hilbert space (RKHS)). A Hilbert space
B
of
functions
f
:
X R
is a reproducing kernel Hilbert space if for each
x X
,
there exists a k
x
B such that
hk
x
, fi = f(x)
for all x B.
The function k : X × X R given by
k(x, x
0
) = hk
x
, k
0
x
i = k
x
(x
0
) = k
x
0
(x)
is called the reproducing kernel associated with B.
By the Riesz representation theorem, this condition is equivalent to saying
that pointwise evaluation is continuous.
We know the reproducing kernel associated with an RKHS is a positive-
definite kernel, and the Moore–Aronszajn theorem can be extended to show
that to each kernel
k
, there is a unique representing kernel Hilbert space
H
and
feature map φ : X H such that k(x, x
0
) = hφ(x), φ(x
0
)i.
Example. Take the linear kernel
k(x, x
0
) = x
T
x
0
.
By definition, we have
H =
(
f : R
p
R | f(x) =
n
X
i=1
α
i
x
T
i
x
)
for some n N, x
1
, . . . , x
n
R
p
. We then see that this is equal to
H = {f : R
p
R | f(x) = β
T
x for some β R
p
},
and if f(x) = β
T
x, then kfk
2
H
= k(β, β) = kβk
2
2
.
Example. Take the Sobolev kernel, where
X
= [0
,
1] and
k
(
x, x
0
) =
min
(
x, x
0
).
Then
H
includes all linear combinations of functions of the form
x 7→ min
(
x, x
0
),
where x
0
[0, 1], and their pointwise limits. These functions look like
x
min(x, x
0
)
x
0
Since we allow arbitrary linear combinations of these things and pointwise limits,
this gives rise to a large class of functions. In particular, this includes all Lipschitz
functions that are 0 at the origin.
In fact, the resulting space is a Sobolev space, with the norm given by
kfk =
Z
1
0
f
0
(x)
2
dx
1/2
.
For example, if we take f = min( ·, x), then by definition we have
kmin( ·, x)k
2
H
= min(x, x) = x,
whereas the formula we claimed gives
Z
x
0
1
2
dt = x.
In general, it is difficult to understand the RKHS, but if we can do so, this
provides a lot of information on what we are regularizing in the kernel trick.