III Modern Statistical Methods

2Kernel machines

2.5 Other kernel machines

Recall that the representer theorem was much more general than what we used

it for. We could apply it to any loss function. In particular, we can apply it

to classification problems. For example, we might want to predict whether an

email is a spam. In this case, we have Y ∈ {−1, 1}

Suppose we are trying to find a hyperplane that separates the two sets

}

and

}

i:Y

=−1

. Consider the really optimistic case where they can

indeed be completely separated by a hyperplane through the origin. So there

exists

β ∈ R

(the normal vector) with

β >

0 for all

. Moreover, it would

be nice if we can maximize the minimum distance between any of the points and

the hyperplane defined by β. This is given by

max

kβk

Thus, we can formulate our problem as

maximize M among β ∈ R

, M ≥ 0 subject to

kβk

≥ M.

This optimization problem gives the hyperplane that maximizes the margin

between the two classes.

Let’s think about the more realistic case where we cannot completely separate

the two sets. We then want to impose a penalty for each point on the wrong

side, and attempt to minimize the distance.

There are two options. We can use the penalty

i=1



M −

kβk



where t

= t1

t≥0

. Another option is to take

i=1



1 −

Mkβk



which is perhaps more natural, since now everything is “dimensionless”. We

will actually use neither, but will start with the second form and massage it to

something that can be attacked with our existing machinery. Starting with the

second option, we can write our optimization problem as

max

M≥0,β∈R

M − λ

i=1



1 −

Mkβk



Since the objective function is invariant to any positive scaling of

, we may

assume kβk

, and rewrite this problem as maximizing

kβk

− λ

i=1

(1 − Y

β)

Replacing

max

kβk

with minimizing

kβk

and adding instead of subtracting the

penalty part, we modify this to say

min

β∈R

kβk

+ λ

i=1

(1 − Y

β)

The final change we make is that we replace

with

, and multiply the whole

equation by λ to get

min

β∈R

λkβk

i=1

(1 − Y

β)

This looks much more like what we’ve seen before, with

λkβk

being the penalty

term and

i=1

(1 − Y

β)

being the loss function.

The final modification is that we want to allow planes that don’t necessarily

pass through the origin. To do this, we allow ourselves to translate all the

’s

by a fixed vector δ ∈ R

. This gives

min

β∈R

,δ∈R

λkβk

i=1



1 − Y

− δ)



Since

always appears together, we can simply replace it with a constant

and write our problem as

(ˆµ,

β) = argmin

(µ,β)∈R×R

i=1



1 − Y

β + µ)



+ λkβk

. (∗)

This gives the support vector classifier.

This is still just fitting a hyperplane. Now we would like to “kernelize” this,

so that we can fit in a surface instead. Let

be the RKHS corresponding to

the linear kernel. We can then write (∗) as

(ˆµ

) = argmin

(µ,f)∈R×H

i=1

(1 − Y

(f(x

) + µ))

+ λkfk

The representer theorem (or rather, a slight variant of it) tells us that the above

optimization problem is equivalent to the support vector machine

(ˆµ

, ˆα

) = argmin

(µ,α)∈R×R

i=1

(1 − Y

α + µ))

+ λα

Kα

where

(

, x

) and

is the reproducing kernel of

. Predictions at a

new x are then given by

sign

ˆµ

i=1

ˆα

λ,i

k(x, x

)

One possible alternative to this is to use logistic regression. Suppose that

log



P(Y

= 1)

P(Y

= −1)



= x

and that

, . . . , Y

are independent. An estimate of

can be obtained through

maximizing the likelihood, or equivalently,

argmin

b∈R

i=1

log(1 + exp(−Y

β)).

To kernelize this, we introduce an error term of

λkβk

, and then kernelize this.

The resulting optimization problem is then given by

argmin

f∈H

i=1

log(1 + exp(−Y

f(x

))) + λkfk

We can then solve this using the representer theorem.