2Kernel machines

III Modern Statistical Methods



2.6 Large-scale kernel machines
How do we apply kernel machines at large scale? Whenever we want to make a
prediction with a kernel machine, we need
O
(
n
) many steps, which is quite a pain
if we work with a large data set, say a few million of them. But even before that,
forming K R
n×n
and inverting K + λI or equivalent can be computationally
too expensive.
One of the more popular approaches to tackle these difficulties is to form a
randomized feature map
ˆ
φ : X R
b
such that
E
ˆ
φ(x)
T
ˆ
φ(x
0
) = k(x, x
0
)
To increase the quality of the approximation, we can use
x 7→
1
L
(
ˆ
φ
1
(x), . . . ,
ˆ
φ
L
(x))
T
R
Lb
,
where the
ˆ
φ
i
(x) are iid with the same distribution as
ˆ
φ.
Let Φ be the matrix with
i
th row
1
L
(
ˆ
φ
1
(
x
)
, . . . ,
ˆ
φ
L
(
x
)). When performing,
e.g. kernel ridge regression, we can compute
ˆ
θ =
T
Φ + λI)
1
Φ
T
Y.
The computational cost of this is
O
((
Lb
)
3
+
n
(
Lb
)
2
). The key thing of course is
that this is linear in
n
, and we can choose the size of
L
so that we have a good
trade-off between the computational cost and the accuracy of the approximation.
Now to predict a new x, we form
1
L
(
ˆ
φ
1
(x), . . . ,
ˆ
φ
L
(x))
T
ˆ
θ,
and this is O(Lb).
In 2007, Rahimi and Recht proposed a random map for shift-invariant kernels,
i.e. kernels k such that k(x, x
0
) = h(x x
0
) for some h (we work with X = R
p
).
A common example is the Gaussian kernel.
One motivation of this comes from a classical result known as Bochner’s
theorem.
Theorem (Bochner’s theorem). Let
k
:
R
p
× R
p
R
be a continuous kernel.
Then
k
is shift-invariant if and only if there exists some distribution
F
on
R
p
and c > 0 such that if W F , then
k(x, x
0
) = cE cos((x x
0
)
T
W ).
Our strategy is then to find some
ˆ
φ depending on W such that
c cos((x x
0
)
T
W ) =
ˆ
φ(x)
ˆ
φ(x
0
).
We are going to take b = 1, so we don’t need a transpose on the right.
The idea is then to play around with trigonometric identities to try to write
c cos((x x
0
)
T
W ). After some work, we find the following solution:
Let u U[π, π], and take x, y R. Then
E cos(x + u) cos(y + u) = E(cos x cos u sin x sin u)(cos y cos u sin y sin u)
Since u has the same distribution as u, we see that E cos u sin u = 0.
Also, cos
2
u + sin
2
u = 1. Since u ranges uniformly in [π, π], by symmetry,
we have E cos
2
u = E sin
2
u =
1
2
. So we find that
2E cos(x + u) cos(y + u) = (cos x cos y + sin x sin y) = cos(x y).
Thus, given a shift-invariant kernel
k
with associated distribution
F
, we set
W F independently of u. Define
ˆ
φ(x) =
2c cos(W
T
x + u).
Then
E
ˆ
φ(x)
ˆ
φ(x
0
) = 2cE[E[cos(W
T
x + u) cos(W
T
x
0
+ u) | W ]]
= cE cos(W
T
(x x
0
))
= k(x, x
0
).
In order to get this to work, we must find the appropriate
W
. In certain
cases, this W is actually not too hard to find:
Example. Take
k(x, x
0
) = exp
1
2σ
2
kx x
0
k
2
2
,
the Gaussian kernel. If W N (0, σ
2
I), then
Ee
it
T
W
= e
−ktk
2
2
/(2σ
2
)
= E cos(t
2
W ).