III Modern Statistical Methods

2Kernel machines

2.6 Large-scale kernel machines

How do we apply kernel machines at large scale? Whenever we want to make a

prediction with a kernel machine, we need

(

) many steps, which is quite a pain

if we work with a large data set, say a few million of them. But even before that,

forming K ∈ R

n×n

and inverting K + λI or equivalent can be computationally

too expensive.

One of the more popular approaches to tackle these difficulties is to form a

randomized feature map

φ : X → R

such that

φ(x)

φ(x

) = k(x, x

)

To increase the quality of the approximation, we can use

x 7→

√

(

(x), . . . ,

(x))

∈ R

where the

(x) are iid with the same distribution as

φ.

Let Φ be the matrix with

th row

√

(

)

, . . . ,

(

)). When performing,

e.g. kernel ridge regression, we can compute

θ = (Φ

Φ + λI)

−1

The computational cost of this is

((

)

(

)

). The key thing of course is

that this is linear in

, and we can choose the size of

so that we have a good

trade-off between the computational cost and the accuracy of the approximation.

Now to predict a new x, we form

√

(

(x), . . . ,

(x))

θ,

and this is O(Lb).

In 2007, Rahimi and Recht proposed a random map for shift-invariant kernels,

i.e. kernels k such that k(x, x

) = h(x − x

) for some h (we work with X = R

A common example is the Gaussian kernel.

One motivation of this comes from a classical result known as Bochner’s

theorem.

Theorem (Bochner’s theorem). Let

× R

→ R

be a continuous kernel.

Then

is shift-invariant if and only if there exists some distribution

and c > 0 such that if W ∼ F , then

k(x, x

) = cE cos((x − x

)

W ).

Our strategy is then to find some

φ depending on W such that

c cos((x − x

)

W ) =

φ(x)

φ(x

We are going to take b = 1, so we don’t need a transpose on the right.

The idea is then to play around with trigonometric identities to try to write

c cos((x − x

)

W ). After some work, we find the following solution:

Let u ∼ U[−π, π], and take x, y ∈ R. Then

E cos(x + u) cos(y + u) = E(cos x cos u − sin x sin u)(cos y cos u − sin y sin u)

Since u has the same distribution as −u, we see that E cos u sin u = 0.

Also, cos

u + sin

u = 1. Since u ranges uniformly in [−π, π], by symmetry,

we have E cos

u = E sin

u =

. So we find that

2E cos(x + u) cos(y + u) = (cos x cos y + sin x sin y) = cos(x − y).

Thus, given a shift-invariant kernel

with associated distribution

, we set

W ∼ F independently of u. Define

φ(x) =

√

2c cos(W

x + u).

Then

φ(x)

φ(x

) = 2cE[E[cos(W

x + u) cos(W

+ u) | W ]]

= cE cos(W

(x − x

))

= k(x, x

In order to get this to work, we must find the appropriate

. In certain

cases, this W is actually not too hard to find:

Example. Take

k(x, x

) = exp



−

2σ

kx − x



the Gaussian kernel. If W ∼ N (0, σ

−2

I), then

= e

−ktk

/(2σ

)

= E cos(t

W ).