2Kernel machines

III Modern Statistical Methods



2.1 Ridge regression
Ridge regression was introduced in around 1970. The idea of Ridge regression is
to try to shrink our estimator a bit in order to lessen the variance, at the cost of
introducing a bias. We would hope then that this will result in a smaller mean
squared error.
Definition (Ridge regression). Ridge regression solves
(ˆµ
R
λ
,
ˆ
β
R
λ
) = argmin
(µ,β)R×R
p
{kY µ1 Xβk
2
2
+ λkβk
2
2
},
where 1 is a vector of all 1’s. Here
λ
0 is a tuning parameter, and it controls
how much we penalize a large choice of β.
Here we explicitly have an intercept term. Usually, we eliminate this by
adding a column of 1’s in
X
. But here we want to separate that out, since
we do not want give a penalty for large
µ
. For example, if the parameter is
temperature, and we decide to measure in degrees Celsius rather than Kelvins,
then we don’t want the resulting ˆµ,
ˆ
β to change.
More precisely, our formulation of Ridge regression is so that if we make the
modification
Y
0
= c1 + Y,
then we have
ˆµ
R
λ
(Y
0
) = ˆµ
R
λ
(Y ) + c.
Note also that the Ridge regression formula makes sense only if each entry in
β
have the same order of magnitude, or else the penalty will only have a significant
effect on the terms of large magnitude. Standard practice is to subtract from
each column of
X
its mean, and then scale it to have
`
2
norm
n
. The actual
number is not important here, but it will in the case of the Lasso.
By differentiating, one sees that the solution to the optimization problem is
ˆµ
R
λ
=
¯
Y =
1
n
X
Y
i
ˆ
β
R
λ
= (X
T
X + λI)
1
X
T
Y.
Note that we can always pick
λ
such that the matrix (
X
T
X
+
λI
) is invertible.
In particular, this can work even if we have more parameters than data points.
This will be important when we work with the Lasso later on.
If some god-like being told us a suitable value of
λ
to use, then Ridge
regression always works better than ordinary least squares.
Theorem. Suppose
rank
(
X
) =
p
. Then for
λ >
0 sufficiently small (depending
on β
0
and σ
2
), the matrix
E(
ˆ
β
OLS
β
0
)(
ˆ
β
OLS
β
0
)
T
E(
ˆ
β
R
λ
β
0
)(
ˆ
β
R
λ
β
0
)
T
()
is positive definite.
Proof.
We know that the first term is just
σ
2
(
X
T
X
)
1
. The right-hand-side
has a variance term and a bias term. We first look at the bias:
E[
ˆ
β β
0
] = (X
T
X + λI)
1
X
T
Xβ
0
β
0
= (X
T
X + λI)
1
(X
T
X + λI λI)β
0
β
0
= λ(X
T
X + λI)
1
β
0
.
We can also compute the variance
var(
ˆ
β) = σ
2
(X
T
X + λI)
1
X
T
X(X
T
X + λI)
1
.
Note that both terms appearing in the squared error look like
(X
T
X + λI)
1
something(X
T
X + λI)
1
.
So let’s try to write σ
2
(X
T
X)
1
in this form. Note that
(X
T
X)
1
= (X
T
X + λI)
1
(X
T
X + λI)(X
T
X)
1
(X
T
X + λI)(X
T
X + λI)
1
= (X
T
X + λI)
1
(X
T
X + 2λI + λ
2
(X
T
X)
1
)(X
T
X + λI)
1
.
Thus, we can write () as
(X
T
X + λI)
1
σ
2
(X
T
X + 2λI + λ
2
(X
T
X)
1
)
σ
2
X
T
X λ
2
β
0
(β
0
)
T
(X
T
X + λI)
1
= λ(X
T
X + λI)
1
2σ
2
I + λ(X
T
X)
1
λβ
0
(β
0
)
T
(X
T
X + λI)
1
Since λ > 0, this is positive definite iff
2σ
2
I + σ
2
λ(X
T
X)
1
λβ
0
(β
0
)
T
is positive definite, which is true for 0 < λ <
2σ
2
kβ
0
k
2
2
.
While this is nice, this is not really telling us much, because we don’t know
how to pick the correct
λ
. It also doesn’t tell us when we should expect a big
improvement from Ridge regression.
To understand this better, we need to use the singular value decomposition.
Theorem (Singular value decomposition). Let
X R
n×p
be any matrix. Then
it has a singular value decomposition (SVD)
X
n×p
= U
n×n
D
n×p
V
T
p×p
,
where
U, V
are orthogonal matrices, and
D
11
D
22
··· D
mm
0, where
m = min(n, p), and all other entries of D are zero.
When
n > p
, there is an alternative version where
U
is an
n × p
matrix
with orthogonal columns, and
D
is a
p × p
diagonal matrix. This is done by
replacing
U
by its first
p
columns and
D
by its first
p
rows. This is known as
the thin singular value decomposition. In this case,
U
T
U
=
I
p
but
UU
T
is not
the identity.
Let’s now apply try to understand Ridge regressions a little better. Suppose
n > p. Then using the thin SVD, the fitted values from ridge regression are
X
ˆ
β
R
λ
= X(X
T
X + λI)
1
X
T
Y
= UDV
T
(V D
2
V
T
+ λI)
1
V DU
T
Y.
We now note that
V D
2
V
T
+
λI
=
V
(
D
2
+
λI
)
V
T
, since
V
is still orthogonal.
We then have
(V (D
2
+ λI)V
T
)
1
= V (D
2
+ λI)
1
V
T
Since
D
2
and
λI
are both diagonal, it is easy to compute their inverses as well.
We then have
X
ˆ
β
R
λ
= UD
2
(D
2
+ λI)
1
U
T
Y =
p
X
j=1
U
j
D
2
jj
D
2
jj
+ λ
U
T
j
Y
Here U
j
refers to the jth column of U .
Now note that the columns of
U
form an orthonormal basis for the column
space of
X
. If
λ
= 0, then this is just a fancy formula for the projection of
Y
onto the column space of
X
. Thus, what this formula is telling us is that we
look at this projection, look at the coordinates in terms of the basis given by
the columns of U, and scale accordingly.
We can now concretely see the effect of our
λ
. The shrinking depends on the
size of D
jj
, and the larger D
jj
is, the less shrinking we do.
This is not very helpful if we don’t have a concrete interpretation of the
D
jj
,
or rather, the columns of
U
. It turns out the columns of
U
are what are known
as the normalized principal components of X.
Take u R
p
, kuk
2
= 1. The sample variance of Xu R
n
is then
1
n
kXuk
2
2
=
1
n
u
T
X
T
Xu =
1
n
u
T
V D
2
V
t
u.
We write w = V
T
u. Then kwk
2
= 1 since V is orthogonal. So we have
1
n
kXuk
2
2
=
1
n
w
T
D
2
w =
1
n
X
j
w
2
j
D
jj
2
1
n
D
2
11
,
and the bound is achieved when
w
=
e
1
, or equivalently,
u
=
V e
1
=
V
1
. Thus,
V
1
gives the coefficients of the linear combination of columns of
X
that has largest
sample variance. We can then write the result as
XV
1
= UDV
T
V
1
= U
1
D
11
.
We can extend this to a description of the other columns of
U
, which is done
in the example sheet. Roughly, the subsequent principle components obey the
same optimality conditions with the added constraint of being orthogonal to all
earlier principle components.
The conclusion is that Ridge regression works best if
EY
lies in the space
spanned by the large principal components of X.