2Kernel machines

III Modern Statistical Methods



2.5 Other kernel machines
Recall that the representer theorem was much more general than what we used
it for. We could apply it to any loss function. In particular, we can apply it
to classification problems. For example, we might want to predict whether an
email is a spam. In this case, we have Y {−1, 1}
n
.
Suppose we are trying to find a hyperplane that separates the two sets
{x
i
}
y
i
=1
and
{x
i
}
i:Y
i
=1
. Consider the really optimistic case where they can
indeed be completely separated by a hyperplane through the origin. So there
exists
β R
p
(the normal vector) with
y
i
x
T
I
β >
0 for all
i
. Moreover, it would
be nice if we can maximize the minimum distance between any of the points and
the hyperplane defined by β. This is given by
max
y
i
x
T
i
β
kβk
2
.
Thus, we can formulate our problem as
maximize M among β R
p
, M 0 subject to
y
i
x
T
i
β
kβk
2
M.
This optimization problem gives the hyperplane that maximizes the margin
between the two classes.
Let’s think about the more realistic case where we cannot completely separate
the two sets. We then want to impose a penalty for each point on the wrong
side, and attempt to minimize the distance.
There are two options. We can use the penalty
λ
n
X
i=1
M
Y
i
x
T
i
β
kβk
2
+
where t
+
= t1
t0
. Another option is to take
λ
n
X
i=1
1
Y
i
x
T
i
β
Mkβk
2
+
which is perhaps more natural, since now everything is “dimensionless”. We
will actually use neither, but will start with the second form and massage it to
something that can be attacked with our existing machinery. Starting with the
second option, we can write our optimization problem as
max
M0R
p
M λ
n
X
i=1
1
Y
i
x
T
i
β
Mkβk
2
+
!
.
Since the objective function is invariant to any positive scaling of
β
, we may
assume kβk
2
=
1
M
, and rewrite this problem as maximizing
1
kβk
2
λ
n
X
i=1
(1 Y
i
x
T
i
β)
+
.
Replacing
max
1
kβk
2
with minimizing
kβk
2
2
and adding instead of subtracting the
penalty part, we modify this to say
min
βR
p
kβk
2
2
+ λ
n
X
i=1
(1 Y
i
x
T
i
β)
+
!
.
The final change we make is that we replace
λ
with
1
λ
, and multiply the whole
equation by λ to get
min
βR
p
λkβk
2
2
+
n
X
i=1
(1 Y
i
x
T
i
β)
+
!
.
This looks much more like what we’ve seen before, with
λkβk
2
2
being the penalty
term and
P
n
i=1
(1 Y
i
x
T
i
β)
+
being the loss function.
The final modification is that we want to allow planes that don’t necessarily
pass through the origin. To do this, we allow ourselves to translate all the
x
i
’s
by a fixed vector δ R
p
. This gives
min
βR
p
R
p
λkβk
2
2
+
n
X
i=1
1 Y
i
(x
i
δ)
T
β
+
!
Since
δ
T
β
always appears together, we can simply replace it with a constant
µ
,
and write our problem as
(ˆµ,
ˆ
β) = argmin
(µ,β)R×R
p
n
X
i=1
1 Y
i
(x
T
i
β + µ)
+
+ λkβk
2
2
. ()
This gives the support vector classifier.
This is still just fitting a hyperplane. Now we would like to “kernelize” this,
so that we can fit in a surface instead. Let
H
be the RKHS corresponding to
the linear kernel. We can then write () as
(ˆµ
λ
,
ˆ
f
λ
) = argmin
(µ,f)R×H
n
X
i=1
(1 Y
i
(f(x
i
) + µ))
+
+ λkfk
2
H
.
The representer theorem (or rather, a slight variant of it) tells us that the above
optimization problem is equivalent to the support vector machine
(ˆµ
λ
, ˆα
λ
) = argmin
(µ,α)R×R
n
n
X
i=1
(1 Y
i
(K
T
i
α + µ))
+
+ λα
T
Kα
where
K
ij
=
k
(
x
i
, x
j
) and
k
is the reproducing kernel of
H
. Predictions at a
new x are then given by
sign
ˆµ
λ
+
n
X
i=1
ˆα
λ,i
k(x, x
i
)
!
.
One possible alternative to this is to use logistic regression. Suppose that
log
P(Y
i
= 1)
P(Y
i
= 1)
= x
T
i
β
0
,
and that
Y
1
, . . . , Y
n
are independent. An estimate of
β
0
can be obtained through
maximizing the likelihood, or equivalently,
argmin
bR
p
n
X
i=1
log(1 + exp(Y
i
x
T
i
β)).
To kernelize this, we introduce an error term of
λkβk
2
2
, and then kernelize this.
The resulting optimization problem is then given by
argmin
f∈H
n
X
i=1
log(1 + exp(Y
i
f(x
i
))) + λkfk
2
H
!
.
We can then solve this using the representer theorem.