Part III Modern Statistical Methods
Based on lectures by R. D. Shah
Notes taken by Dexter Chua
Michaelmas 2017
These notes are not endorsed by the lecturers, and I have modified them (often
significantly) after lectures. They are nowhere near accurate representations of what
was actually lectured, and in particular, all errors are almost surely mine.
The remarkable development of computing power and other technology now allows
scientists and businesses to routinely collect datasets of immense size and complexity.
Most classical statistical methods were designed for situations with many observations
and a few, carefully chosen variables. However, we now often gather data where we
have huge numbers of variables, in an attempt to capture as much information as we
can about anything which might conceivably have an influence on the phenomenon
of interest. This dramatic increase in the number variables makes modern datasets
strikingly different, as well-established traditional methods perform either very poorly,
or often do not work at all.
Developing methods that are able to extract meaningful information from these large
and challenging datasets has recently been an area of intense research in statistics,
machine learning and computer science. In this course, we will study some of the
methods that have been developed to analyse such datasets. We aim to cover some of
the following topics.
Kernel machines: the kernel trick, the representer theorem, support vector
machines, the hashing trick.
Penalised regression: Ridge regression, the Lasso and variants.
Graphical modelling: neighbourhood selection and the graphical Lasso. Causal
inference through structural equation modelling; the PC algorithm.
High-dimensional inference: the closed testing procedure and the Benjamini–
Hochberg pro cedure; the debiased Lasso
Pre-requisites
Basic knowledge of statistics, probability, linear algebra and real analysis. Some
background in optimisation would be helpful but is not essential.
Contents
0 Introduction
1 Classical statistics
2 Kernel machines
2.1 Ridge regression
2.2 v-fold cross-validation
2.3 The kernel trick
2.4 Making predictions
2.5 Other kernel machines
2.6 Large-scale kernel machines
3 The Lasso and beyond
3.1 The Lasso estimator
3.2 Basic concentration inequalities
3.3 Convex analysis and optimization theory
3.4 Properties of Lasso solutions
3.5 Variable selection
3.6 Computation of Lasso solutions
3.7 Extensions of the Lasso
4 Graphical modelling
4.1 Conditional independence graphs
4.2 Structural equation modelling
4.3 The PC algorithm
5 High-dimensional inference
5.1 Multiple testing
5.2 Inference in high-dimensional regression
0 Introduction
In recent years, there has been a rather significant change in what sorts of data
we have to handle and what questions we ask about them, witnessed by the
popularity of the buzzwords “big data” and “machine learning”. In classical
statistics, we usually have a small set of parameters, and a very large data set.
We then use the large data set to estimate the parameters.
However, nowadays we often see scenarios where we have a very large number
of parameters, and the data set is relatively small. If we tried to apply our
classical linear regression, then we would just be able to tune the parameters so
that we have a perfect fit, and still have great freedom to change the parameters
without affecting the fitting.
One example is that we might want to test which genomes are responsible for
a particular disease. In this case, there is a huge number of genomes to consider,
and there is good reason to believe that most of them are irrelevant, i.e. the
parameters should be set to zero. Thus, we want to develop methods that find
the “best” fitting model that takes this into account.
Another problem we might encounter is that we just have a large data set,
and doing linear regression seems a bit silly. If we have so much data, we might
as well try to fit more complicated curves, such as polynomial functions and
friends. Perhaps more ambitiously, we might try to find the best continuously
differentiable function that fits the curve, or, as analysts will immediately suggest
as an alternative, weakly differentiable functions.
There are many things we can talk about, and we can’t talk about all of
them. In this course, we are going to cover 4 different topics of different size:
Kernel machines
The Lasso and its extensions
Graphical modeling and causal inference
Multiple testing and high-dimensional inference
The four topics are rather disjoint, and draw on different mathematical skills.
1 Classical statistics
This is a course on modern statistical methods. Before we study methods, we
give a brief summary of what we are not going to talk about, namely classical
statistics.
So suppose we are doing regression. We have some predictors
x
i
R
p
and
responses
Y
i
R
, and we hope to find a model that describes
Y
as a function of
x. For convenience, define the vectors
X =
x
T
1
.
.
.
x
T
n
, Y =
Y
T
1
.
.
.
Y
T
n
.
The linear model then assumes there is some β
0
R
p
such that
Y = Xβ
0
+ ε,
where
ε
is some (hopefully small) error random variable. Our goal is then to
estimate β
0
given the data we have.
If
X
has full column rank, so that
X
T
X
is invertible, then we can use ordinary
least squares to estimate β
0
, with estimate
ˆ
β
OLS
= argmin
βR
p
kY k
2
2
= (X
T
X)
1
X
T
Y.
This assumes nothing about
ε
itself, but if we assume that
Eε
= 0 and
var
(
E
) =
σ
2
I, then this estimate satisfies
E
β
ˆ
β
OLS
= (X
T
X)
1
X
T
Xβ
0
= β
0
var
β
(
ˆ
β
OLS
) = (X
T
X
1
)X
T
var(ε)X(X
T
X)
1
= σ
2
(X
T
X)
1
.
In particular, this is an unbiased estimator. Even better, this is the best linear
unbiased estimator. More precisely, the Gauss–Markov theorem says any other
linear estimator
˜
β = AY has var(
˜
β) var(
ˆ
β
OLS
) positive semi-definite.
Of course, ordinary least squares is not the only way to estimate
β
0
. Another
common method for estimating parameters is maximum likelihood estimation,
and this works for more general models than linear regression. For people who are
already sick of meeting likelihoods, this will be the last time we meet likelihoods
in this course.
Suppose we want to estimate a parameter
θ
via knowledge of some data
Y
.
We assume Y has density f(y; θ). We define the log-likelihood by
`(θ) = log f(Y, θ).
The maximum likelihood estimator then maximize `(θ) over θ to get
ˆ
θ.
Similar to ordinary least squares, there is a theorem that says maximum
likelihood estimation is the “best”. To do so, we introduce the Fisher information
matrix. This is a family of d × d matrices indexed by θ, defined by
I
jk
(θ) = E
θ
2
θ
j
θ
j
`(θ)
.
The relevant theorem is
Theorem (Cram´er–Rao bound). If
˜
θ
is an unbiased estimator for
θ
, then
var(
˜
θ) I
1
(θ) is positive semi-definite.
Moreover, asymptotically, as
n
, the maximum likelihood estimator is
unbiased and achieves the Carm´er–Rao bound.
Another wonderful fact about the maximum likelihood estimator is that
asymptotically, it is normal distributed, and so it is something we understand
well.
This might seem very wonderful, but there are a few problems here. The
results we stated are asymptotic, but what we actually see in real life is that
as
n
, the value of
p
also increases. In these contexts, the asymptotic
property doesn’t tell us much. Another issue is that these all talk about unbiased
estimators. In a lot of situations of interest, it turns out biased methods do
much much better than these methods we have.
Another thing we might be interested is that as
n
gets large, we might want
to use more complicated models than simple parametric models, as we have
much more data to mess with. This is not something ordinary least squares
provides us with.
2 Kernel machines
We are going to start a little bit slowly, and think about our linear model
Y
=
Xβ
0
+
ε
, where
E
(
ε
) = 0 and
var
(
ε
) =
σ
2
I
. Ordinary least squares is an
unbiased estimator, so let’s look at biased estimators.
For a biased estimator,
˜
β
, we should not study the variance, but the mean
squared error
E[(
˜
β β
0
)(
˜
β β
0
)
T
] = E(
˜
β E
˜
β + E
˜
β β
0
)(
˜
β E
˜
β + E
˜
β β
0
)
T
= var(
˜
β) + (E
˜
β β
0
)(E
˜
β β
0
)
T
The first term is, of course, just the variance, and the second is the squared bias.
So the point is that if we pick a clever biased estimator with a tiny variance,
then this might do better than unbiased estimators with large variance.
2.1 Ridge regression
Ridge regression was introduced in around 1970. The idea of Ridge regression is
to try to shrink our estimator a bit in order to lessen the variance, at the cost of
introducing a bias. We would hope then that this will result in a smaller mean
squared error.
Definition (Ridge regression). Ridge regression solves
(ˆµ
R
λ
,
ˆ
β
R
λ
) = argmin
(µ,β)R×R
p
{kY µ1 Xβk
2
2
+ λkβk
2
2
},
where 1 is a vector of all 1’s. Here
λ
0 is a tuning parameter, and it controls
how much we penalize a large choice of β.
Here we explicitly have an intercept term. Usually, we eliminate this by
adding a column of 1’s in
X
. But here we want to separate that out, since
we do not want give a penalty for large
µ
. For example, if the parameter is
temperature, and we decide to measure in degrees Celsius rather than Kelvins,
then we don’t want the resulting ˆµ,
ˆ
β to change.
More precisely, our formulation of Ridge regression is so that if we make the
modification
Y
0
= c1 + Y,
then we have
ˆµ
R
λ
(Y
0
) = ˆµ
R
λ
(Y ) + c.
Note also that the Ridge regression formula makes sense only if each entry in
β
have the same order of magnitude, or else the penalty will only have a significant
effect on the terms of large magnitude. Standard practice is to subtract from
each column of
X
its mean, and then scale it to have
`
2
norm
n
. The actual
number is not important here, but it will in the case of the Lasso.
By differentiating, one sees that the solution to the optimization problem is
ˆµ
R
λ
=
¯
Y =
1
n
X
Y
i
ˆ
β
R
λ
= (X
T
X + λI)
1
X
T
Y.
Note that we can always pick
λ
such that the matrix (
X
T
X
+
λI
) is invertible.
In particular, this can work even if we have more parameters than data points.
This will be important when we work with the Lasso later on.
If some god-like being told us a suitable value of
λ
to use, then Ridge
regression always works better than ordinary least squares.
Theorem. Suppose
rank
(
X
) =
p
. Then for
λ >
0 sufficiently small (depending
on β
0
and σ
2
), the matrix
E(
ˆ
β
OLS
β
0
)(
ˆ
β
OLS
β
0
)
T
E(
ˆ
β
R
λ
β
0
)(
ˆ
β
R
λ
β
0
)
T
()
is positive definite.
Proof.
We know that the first term is just
σ
2
(
X
T
X
)
1
. The right-hand-side
has a variance term and a bias term. We first look at the bias:
E[
ˆ
β β
0
] = (X
T
X + λI)
1
X
T
Xβ
0
β
0
= (X
T
X + λI)
1
(X
T
X + λI λI)β
0
β
0
= λ(X
T
X + λI)
1
β
0
.
We can also compute the variance
var(
ˆ
β) = σ
2
(X
T
X + λI)
1
X
T
X(X
T
X + λI)
1
.
Note that both terms appearing in the squared error look like
(X
T
X + λI)
1
something(X
T
X + λI)
1
.
So let’s try to write σ
2
(X
T
X)
1
in this form. Note that
(X
T
X)
1
= (X
T
X + λI)
1
(X
T
X + λI)(X
T
X)
1
(X
T
X + λI)(X
T
X + λI)
1
= (X
T
X + λI)
1
(X
T
X + 2λI + λ
2
(X
T
X)
1
)(X
T
X + λI)
1
.
Thus, we can write () as
(X
T
X + λI)
1
σ
2
(X
T
X + 2λI + λ
2
(X
T
X)
1
)
σ
2
X
T
X λ
2
β
0
(β
0
)
T
(X
T
X + λI)
1
= λ(X
T
X + λI)
1
2σ
2
I + λ(X
T
X)
1
λβ
0
(β
0
)
T
(X
T
X + λI)
1
Since λ > 0, this is positive definite iff
2σ
2
I + σ
2
λ(X
T
X)
1
λβ
0
(β
0
)
T
is positive definite, which is true for 0 < λ <
2σ
2
kβ
0
k
2
2
.
While this is nice, this is not really telling us much, because we don’t know
how to pick the correct
λ
. It also doesn’t tell us when we should expect a big
improvement from Ridge regression.
To understand this better, we need to use the singular value decomposition.
Theorem (Singular value decomposition). Let
X R
n×p
be any matrix. Then
it has a singular value decomposition (SVD)
X
n×p
= U
n×n
D
n×p
V
T
p×p
,
where
U, V
are orthogonal matrices, and
D
11
D
22
··· D
mm
0, where
m = min(n, p), and all other entries of D are zero.
When
n > p
, there is an alternative version where
U
is an
n × p
matrix
with orthogonal columns, and
D
is a
p × p
diagonal matrix. This is done by
replacing
U
by its first
p
columns and
D
by its first
p
rows. This is known as
the thin singular value decomposition. In this case,
U
T
U
=
I
p
but
UU
T
is not
the identity.
Let’s now apply try to understand Ridge regressions a little better. Suppose
n > p. Then using the thin SVD, the fitted values from ridge regression are
X
ˆ
β
R
λ
= X(X
T
X + λI)
1
X
T
Y
= UDV
T
(V D
2
V
T
+ λI)
1
V DU
T
Y.
We now note that
V D
2
V
T
+
λI
=
V
(
D
2
+
λI
)
V
T
, since
V
is still orthogonal.
We then have
(V (D
2
+ λI)V
T
)
1
= V (D
2
+ λI)
1
V
T
Since
D
2
and
λI
are both diagonal, it is easy to compute their inverses as well.
We then have
X
ˆ
β
R
λ
= UD
2
(D
2
+ λI)
1
U
T
Y =
p
X
j=1
U
j
D
2
jj
D
2
jj
+ λ
U
T
j
Y
Here U
j
refers to the jth column of U.
Now note that the columns of
U
form an orthonormal basis for the column
space of
X
. If
λ
= 0, then this is just a fancy formula for the projection of
Y
onto the column space of
X
. Thus, what this formula is telling us is that we
look at this projection, look at the coordinates in terms of the basis given by
the columns of U , and scale accordingly.
We can now concretely see the effect of our
λ
. The shrinking depends on the
size of D
jj
, and the larger D
jj
is, the less shrinking we do.
This is not very helpful if we don’t have a concrete interpretation of the
D
jj
,
or rather, the columns of
U
. It turns out the columns of
U
are what are known
as the normalized principal components of X.
Take u R
p
, kuk
2
= 1. The sample variance of Xu R
n
is then
1
n
kXuk
2
2
=
1
n
u
T
X
T
Xu =
1
n
u
T
V D
2
V
t
u.
We write w = V
T
u. Then kwk
2
= 1 since V is orthogonal. So we have
1
n
kXuk
2
2
=
1
n
w
T
D
2
w =
1
n
X
j
w
2
j
D
jj
2
1
n
D
2
11
,
and the bound is achieved when
w
=
e
1
, or equivalently,
u
=
V e
1
=
V
1
. Thus,
V
1
gives the coefficients of the linear combination of columns of
X
that has largest
sample variance. We can then write the result as
XV
1
= UDV
T
V
1
= U
1
D
11
.
We can extend this to a description of the other columns of
U
, which is done
in the example sheet. Roughly, the subsequent principle components obey the
same optimality conditions with the added constraint of being orthogonal to all
earlier principle components.
The conclusion is that Ridge regression works best if
EY
lies in the space
spanned by the large principal components of X.
2.2 v-fold cross-validation
In practice, the above analysis is not very useful, since it doesn’t actually tell
us what
λ
we should pick. If we are given a concrete data set, how do we know
what λ we should pick?
One common method is to use
v
-fold cross-validation. This is a very general
technique that allows us to pick a regression method from a variety of options.
We shall explain the method in terms of Ridge regression, where our regression
methods are parametrized by a single parameter
λ
, but it should be clear that
this is massively more general.
Suppose we have some data set (
Y
i
, x
i
)
n
i=1
, and we are asked to predict the
value of
Y
given a new predictors
x
. We may want to pick
λ
to minimize the
following quantity:
E
n
(Y
(x
)
T
ˆ
β
R
λ
(X, Y ))
2
| X, Y
o
.
It is difficult to actually do this, and so an easier target to minimize is the
expected prediction error
E
h
E
n
(Y
(x
)
T
ˆ
β
R
λ
(
˜
X,
˜
Y ))
2
|
˜
X,
˜
Y
oi
One thing to note about this is that we are thinking of
˜
X
and
˜
Y
as arbitrary
data sets of size
n
, as opposed to the one we have actually got. This might be a
more tractable problem, since we are not working with our actual data set, but
general data sets.
The method of cross-validation estimates this by splitting the data into
v
folds.
(X
(1)
, Y
(1)
)
A
1
(X
(v)
, Y
(v)
)
A
v
A
i
is called the observation indices, which is the set of all indices
j
so that the
jth data point lies in the ith fold.
We let (X
(k)
, Y
(k)
) be all data except that in the kth fold. We define
CV (λ) =
1
n
v
X
k=1
X
iA
k
(Y
i
x
T
i
ˆ
β
R
λ
(X
(k)
, Y
(k)
))
2
We write
λ
CV
for the minimizer of this, and pick
ˆ
β
R
λ
CV
(
X, Y
) as our estimate.
This tends to work very will in practice.
But we can ask ourselves why do we have to pick a single
λ
to make the
estimate? We can instead average over a range of different
λ
. Suppose we have
computed
ˆ
β
R
λ
on a grid of
λ
-values
λ
1
> λ
2
> ··· > λ
L
. Our plan is to take a
good weighted average of the λ. Concretely, we want to minimize
1
n
v
X
k=1
X
iA
k
Y
i
L
X
i=1
w
i
x
T
i
ˆ
β
R
λ
i
(X
(k)
, Y
(k)
)
!
2
over
w
[0
,
)
L
. This is known as stacking. This tends to work better than
just doing
v
-fold cross-validation. Indeed, it must cross-validation is just a
special case where we restrict
w
i
to be zero in all but one entries. Of course, this
method comes with some heavy computational costs.
2.3 The kernel trick
We now come to the actual, main topic of the chapter. We start with a very
simple observation. An alternative way of writing the fitted values from Ridge
regression is
(X
T
X + λI)X
T
= X
T
(XX
T
+ λI).
One should be careful that the
λI
are different matrices on both sides, as they
have different dimensions. Multiplying inverses on both sides, we have
X
T
(XX
T
+ λI)
1
= (X
T
X + λI)
1
X
T
.
We can multiply the right-hand-side by
Y
to obtain the
ˆ
β
from Ridge regression,
and multiply on the left to obtain the fitted values. So we have
XX
T
(XX
T
+ λI)
1
Y = X(X
T
X + λI)
1
X
T
Y = X
ˆ
β
R
λ
.
Note that
X
T
X
is a
p × p
matrix, and takes
O
(
np
2
) time to compute. On the
other hand,
XX
T
is an
n × n
matrix, and takes
O
(
n
2
p
) time to compute. If
p n
, then this way of computing the fitted values would be much quicker. The
key point to make is that the fitted values from Ridge regression only depends
on K = XX
T
(and Y ). Why is this important?
Suppose we believe we have a quadratic signal
Y
i
= x
T
i
β +
X
k,`
x
ik
x
i`
θ
k`
+ ε
i
.
Of course, we can do Ridge regression, as long as we add the products
x
ik
x
i`
as predictors. But this requires
O
(
p
2
) many predictors. Even if we use the
XX
T
way, this has a cost of
O
(
n
2
p
2
), and if we used the naive method, it would
require O(np
4
) operations.
We can do better than this. The idea is that we might be able to compute
K directly. Consider
(1 + x
T
i
x
j
)
2
= 1 + 2x
T
i
x
j
+
X
k,`
x
ik
x
i`
x
jk
x
j`
.
This is equal to the inner product between vectors of the form
(1,
2x
i1
, . . . ,
2x
ip
, x
i1
x
i1
, . . . , x
i1
x
ip
, x
i2
x
i1
, . . . , x
ip
x
ip
) ()
If we set
K
ij
= (1+
x
T
i
x
j
)
2
and form
K
(
K
+
λI
)
1
Y
, this is equivalent to forming
ridge regression with (
) as our predictors. Note that here we don’t scale our
columns to have the same
`
2
norm. This is pretty interesting, because computing
this is only
O
(
n
2
p
). We managed to kill a factor of
p
in this computation. The
key idea here was that the fitted values depend only on
K
, and not on the values
of x
ij
itself.
Consider the general scenario where we try to predict the value of
Y
given
a predictor
x X
. In general, we don’t even assume
X
has some nice linear
structure where we can do linear regression.
If we want to do Ridge regression, then one thing we can do is that we
can try to construct some map
φ
:
X R
D
for some
D
, and then run Ridge
regression using
{φ
(
x
i
)
}
as our predictors. If we want to fit some complicated,
non-linear model, then
D
can potentially be very huge, and this can be costly.
The observation above is that instead of trying to do this, we can perhaps find
some magical way of directly computing
K
ij
= k(x
i
, x
j
) = hφ(x
i
), φ(x
j
)i.
If we can do so, then we can simply use the above formula to obtain the fitted
values (we shall discuss the problem of making new predictions later).
Since it is only the function
k
that matters, perhaps we can specify a
“regression method” simply by providing a suitable function
k
:
X × X R
.
If we are given such a function
k
, when would it arise from a “feature map”
φ : X R
D
?
More generally, we will find that it is convenient to allow for
φ
to take values
in infinite-dimensional vector spaces instead (since we don’t actually have to
compute φ!).
Definition (Inner product space). an inner product space is a real vector space
H endowed with a map h·, ·i : H × H R and obeys
Symmetry: hu, vi = hv, ui
Linearity: If a, b R, then hau + bw, vi = ahu, vi + bhw, vi.
Positive definiteness: hu, ui 0 with hu, ui = 0 iff u = 0.
If we want
k
to come from a feature map
φ
, then an immediate necessary
condition is that
k
has to be symmetric. There is also another condition that
corresponds to the positive-definiteness of the inner product.
Proposition. Given φ : X × X H, define k : X × X R by
k(x, x
0
) = hφ(x), φ(x
0
)i.
Then for any x
1
, . . . , x
n
X, the matrix K R
n
× R
n
with entries
K
ij
= k(x
i
, x
j
)
is positive semi-definite.
Proof. Let x
1
, . . . , x
n
X, and α R
n
. Then
X
i,j
α
i
k(x
i
, x
j
)α
j
=
X
i,j
α
i
hφ(x
i
), φ(x
j
)iα
j
=
*
X
i
α
i
φ(x
i
),
X
j
α
j
φ(x
j
)
+
0
since the inner product is positive definite.
This suggests the following definition:
Definition (Positive-definite kernel). A positive-definite kernel (or simply ker-
nel) is a symmetric map
k : X×X R
such that for all
n N
and
x
1
, . . . , x
n
X
,
the matrix K R
n
× R
n
with entries
K
ij
= k(x
i
, x
j
)
is positive semi-definite.
We will prove that every positive-definite kernel comes from a feature map.
However, before that, let’s look at some examples.
Example. Suppose k
1
, k
2
, . . . , k
n
are kernels. Then
If α
1
, α
2
0, then α
1
k
1
+ α
2
k
2
is a kernel. Moreover, if
k(x, x
0
) = lim
m→∞
k
m
(x, x
0
)
exists, then k is a kernel.
The pointwise product k
1
k
2
is a kernel, where
(k
1
k
2
)(x, x
0
) = k
1
(x, x
0
)k
2
(x, x
0
).
Example. The linear kernel is
k
(
x, x
0
) =
x
T
x
0
. To see this, we see that this is
given by the feature map φ = id and taking the standard inner product on R
p
.
Example. The polynomial kernel is k(x, x
0
) = (1 + x
T
x
0
)
d
for all d N.
We saw this last time with
d
= 2. This is a kernel since both 1 and
x
T
x
0
are
kernels, and sums and products preserve kernels.
Example. The Gaussian kernel is
k(x, x
0
) = exp
kx x
0
k
2
2
2σ
2
.
The quantity σ is known as the bandwidth.
To show that it is a kernel, we decompose
kx x
0
k
2
2
= kxk
2
2
+ kx
0
k
2
2
2x
T
x
0
.
We define
k
1
(x, x
0
) = exp
kxk
2
2
2σ
2
exp
kx
0
k
2
2
2σ
2
.
This is a kernel by taking φ( ·) = exp
k· k
2
2
2σ
2
.
Next, we can define
k
2
(x, x
0
) = exp
x
T
x
0
σ
2
=
X
r=0
1
r!
x
T
x
0
σ
2
r
.
We see that this is the infinite linear combination of powers of kernels, hence is
a kernel. Thus, it follows that k = k
1
k
2
is a kernel.
Note also that any feature map giving this
k
must take values in an infinite
dimensional inner product space. Roughly speaking, this is because we have
arbitrarily large powers of x and x
0
in the expansion of k.
Example. The Sobolev kernel is defined as follows: we take
X
= [0
,
1], and let
k(x, x
0
) = min(x, x
0
).
A slick proof that this is a kernel is to notice that this is the covariance function
of Brownian motion.
Example. The Jaccard similarity is defined as follows: Take
X
to be the set of
all subsets of {1, . . . , p}. For x, x
0
X, define
k(x, x
0
) =
(
|xx
0
|
|xx
0
|
x x
0
6=
1 otherwise
.
Theorem (Moore–Aronszajn theorem). For every kernel
k
:
X × X R
, there
exists an inner product space H and a feature map φ : X H such that
k(x, x
0
) = hφ(x), φ(x
0
)i.
This is not actually the full Moore–Aronszajn theorem, but a simplified
version of it.
Proof. Let H denote the vector space of functions f : X R of the form
f( ·) =
n
X
i=1
α
i
k( ·, x
i
) ()
for some n N, α
1
, . . . , α
n
R and x
1
, . . . , x
n
X. If
g( ·) =
m
X
i=1
β
j
k( ·, x
0
j
) H,
then we tentatively define the inner product of f and g by
hf, gi =
n
X
i=1
m
X
j=1
α
i
β
j
k(x
i
, x
0
j
).
We have to check that this is an inner product, but even before that, we need to
check that this is well-defined, since
f
and
g
can be represented in the form (
)
in multiple ways. To do so, simply observe that
n
X
i=1
m
X
j=1
α
i
β
j
k(x
i
, x
0
j
) =
n
X
i=1
α
i
g(x
i
) =
m
X
j=1
β
j
f(x
0
j
). ()
The first equality shows that the definition of our inner product does not depend
on the representation of
g
, while the second equality shows that it doesn’t depend
on the representation of f.
To show this is an inner product, note that it is clearly symmetric and bilinear.
To show it is positive definite, note that we have
hf, fi =
n
X
i=1
m
X
j=1
α
i
k(x
i
, x
j
)α
j
0
since the kernel is positive semi-definite. It remains to check that if
hf, fi
= 0,
then
f
= 0 as a function. To this end, note the important reproducing property:
by (), we have
hk( ·, x), f i = f(x).
This says k(·, x) represents the evaluation-at-x linear functional.
In particular, we have
f(x)
2
= hk( ·, x), f i
2
hk( ·, x), k( ·, x)ihf, f i = 0.
Here we used the Cauchy–Schwarz inequality, which, if you inspect the proof,
does not require positive definiteness, just positive semi-definiteness. So it follows
that f 0. Thus, we know that H is indeed an inner product space.
We now have to construct a feature map. Define φ : X H by
φ(x) = k( ·, x).
Then we have already observed that
hφ(x), φ(x
0
)i = hk( ·, x), k( ·, x
0
)i = k(x, x
0
),
as desired.
We shall boost this theorem up to its full strength, where we say there is
a “unique” inner product space with such a feature map. Of course, it will not
be unique unless we impose appropriate quantifiers, since we can just throw in
some random elements into H.
Recall (or learn) from functional analysis that any inner product space
B
is
a normed space, with norm
kfk
2
= hf, fi
B
.
We say sequence (
f
m
) in a normed space
B
is Cauchy if
kf
m
f
n
k
B
0 as
n, m
. An inner product space is complete if every Cauchy sequence has a
limit (in the space), and a complete inner product space is called a Hilbert space.
One important property of a Hilbert space
B
is that if
V
is a closed subspace
of
B
, then every
f B
has a unique representation as
f
=
v
+
w
, where
v V
and
w V
= {u B : hu, yi = 0 for all y V }.
By adding limits of Cauchy sequences to
H
, we can obtain a Hilbert space.
Indeed, if (f
m
) is Cauchy in H, then
|f
m
(x) f
n
(x)| k
1/2
(x, x)kf
m
f
n
k
H
0
as
m, n
. Since every Cauchy sequence in
R
converges (i.e.
R
is complete),
it makes sense to define a limiting function
f(x) = lim
n→∞
f
n
(x),
and it can be shown that after augmenting
H
with such limits, we obtain a
Hilbert space. In fact, it is a special type of Hilbert space.
Definition (Reproudcing kernel Hilbert space (RKHS)). A Hilbert space
B
of
functions
f
:
X R
is a reproducing kernel Hilbert space if for each
x X
,
there exists a k
x
B such that
hk
x
, fi = f (x)
for all x B.
The function k : X × X R given by
k(x, x
0
) = hk
x
, k
0
x
i = k
x
(x
0
) = k
x
0
(x)
is called the reproducing kernel associated with B.
By the Riesz representation theorem, this condition is equivalent to saying
that pointwise evaluation is continuous.
We know the reproducing kernel associated with an RKHS is a positive-
definite kernel, and the Moore–Aronszajn theorem can be extended to show
that to each kernel
k
, there is a unique representing kernel Hilbert space
H
and
feature map φ : X H such that k(x, x
0
) = hφ(x), φ(x
0
)i.
Example. Take the linear kernel
k(x, x
0
) = x
T
x
0
.
By definition, we have
H =
(
f : R
p
R | f(x) =
n
X
i=1
α
i
x
T
i
x
)
for some n N, x
1
, . . . , x
n
R
p
. We then see that this is equal to
H = {f : R
p
R | f(x) = β
T
x for some β R
p
},
and if f (x) = β
T
x, then kf k
2
H
= k(β, β) = kβk
2
2
.
Example. Take the Sobolev kernel, where
X
= [0
,
1] and
k
(
x, x
0
) =
min
(
x, x
0
).
Then
H
includes all linear combinations of functions of the form
x 7→ min
(
x, x
0
),
where x
0
[0, 1], and their pointwise limits. These functions look like
x
min(x, x
0
)
x
0
Since we allow arbitrary linear combinations of these things and pointwise limits,
this gives rise to a large class of functions. In particular, this includes all Lipschitz
functions that are 0 at the origin.
In fact, the resulting space is a Sobolev space, with the norm given by
kfk =
Z
1
0
f
0
(x)
2
dx
1/2
.
For example, if we take f = min( ·, x), then by definition we have
kmin( ·, x)k
2
H
= min(x, x) = x,
whereas the formula we claimed gives
Z
x
0
1
2
dt = x.
In general, it is difficult to understand the RKHS, but if we can do so, this
provides a lot of information on what we are regularizing in the kernel trick.
2.4 Making predictions
Let’s now try to understand what exactly we are doing when we do ridge
regression with a kernel
k
. To do so, we first think carefully what we were doing
in ordinary ridge regression, which corresponds to using the linear kernel. For
the linear kernel, the
L
2
norm
kβk
2
2
corresponds exactly to the norm in the
RKHS kf k
2
H
. Thus, an alternative way of expressing ridge regression is
ˆ
f = argmin
f∈H
(
n
X
i=1
(Y
i
f(x
i
))
2
+ λkfk
2
H
)
, ()
where
H
is the RKHS of the linear kernel. Now this way of writing ridge
regression makes sense for an arbitrary RKHS, so we might think this is what
we should solve in general.
But if
H
is infinite dimensional, then naively, it would be quite difficult to
solve (). The solution is provided by the representer theorem.
Theorem (Representer theorem). Let
H
be an RKHS with reproducing kernel
k
. Let
c
be an arbitrary loss function and
J
: [0
,
)
R
any strictly increasing
function. Then the minimizer
ˆ
f H of
Q
1
(f) = c(Y, x
1
, . . . , x
n
, f(x
1
), . . . , f(x
n
)) + J(kfk
2
H
)
lies in the linear span of {k( ·, x
i
)}
n
i=1
.
Given this theorem, we know our optimal solution can be written in the form
ˆ
f( ·) =
n
X
i=1
ˆα
i
k( ·, x
i
),
and thus we can rewrite our optimization problem as looking for the
ˆα R
n
that minimizes
Q
2
(α) = c(Y, x
1
, . . . , x
n
, Kα) + J(α
T
Kα),
over α R
n
(with K
ij
= k(x
i
, x
j
)).
For an arbitrary
c
, this can still be quite difficult, but in the case of Ridge
regression, this tells us that () is equivalent to minimizing
kY Kαk
2
2
+ λα
T
Kα,
and by differentiating, we find that this is given by solving
K ˆα = K(K + λI)
1
Y.
Of course
K ˆα
is also our fitted values, so we see that if we try to minimize
(), then the fitted values is indeed given by K(K + λI)
1
Y , as in our original
motivation.
Proof. Suppose
ˆ
f minimizes Q
1
. We can then write
ˆ
f = u + v
where u V = span{k( ·, x
i
) : i = 1, . . . , n} and v V
. Then
ˆ
f(x
i
) = hf, k( ·, x
i
)i = hu + v, k( ·, x
i
)i = hu, k( ·, x
i
)i = u(x
i
).
So we know that
c(Y, x
1
, . . . , x
n
, f(x
1
), . . . , f(x
n
)) = c(Y, x
1
, . . . , x
n
, u(x
1
), . . . , u(x
n
)).
Meanwhile,
kfk
2
H
= ku + vk
2
H
= kuk
2
H
+ kvk
2
H
,
using the fact that u and v are orthogonal. So we know
J(kfk
2
H
) J(kuk
2
H
)
with equality iff
v
= 0. Hence
Q
1
(
f
)
Q
1
(
u
) with equality iff
v
= 0, and so we
must have v = 0 by optimality.
Thus, we know that the optimizer in fact lies in
V
, and then the formula of
Q
2
just expresses Q
1
in terms of α.
How well does our kernel machine do? Let
H
be an RKHS with reproducing
kernel k, and f
0
H. Consider a model
Y
i
= f
0
(x
i
) + ε
i
for i = 1, . . . , n, and assume Eε = 0, var(ε) = σ
2
I.
For convenience, We assume
kf
0
k
2
H
1. There is no loss in generality by
assuming this, since we can always achieve this by scaling the norm.
Let
K
be the kernel matrix
K
ij
=
k
(
x
i
, x
j
) with eigenvalues
d
1
d
2
···
d
n
0. Define
ˆ
f
λ
= argmin
f∈H
(
n
X
i=1
(Y
i
f(x
i
))
2
+ λkfk
2
H
)
.
Theorem. We have
1
n
n
X
i=1
E(f
0
(x
i
)
ˆ
f
λ
(x
i
))
2
σ
2
n
n
X
i=1
d
2
i
(d
i
+ λ)
2
+
λ
4n
σ
2
n
1
λ
n
X
i=1
min
d
i
4
, λ
+
λ
4n
.
Given a data set, we can compute the eigenvalues
d
i
, and thus we can compute
this error bound.
Proof. We know from the representer theorem that
(
ˆ
f
λ
(x
i
), . . . ,
ˆ
f
λ
(x
n
))
T
= K(K + λI)
1
Y.
Also, there is some α R
n
such that
(f
0
(x
1
), . . . , f
0
(x
n
))
T
= Kα.
Moreover, on the example sheet, we show that
1 kf
0
k
2
H
α
T
Kα.
Consider the eigen-decomposition
K
=
UDU
T
, where
U
is orthogonal,
D
ii
=
d
i
and D
ij
= 0 for i 6= j. Then we have
E
n
X
i=1
(f
0
(x
i
)
ˆ
f
λ
(x
i
))
2
= EkKα K(K + λI)
1
(Kα + ε)k
2
2
Noting that Kα = (K + λI)(K + λI)
1
Kα, we obtain
= E
λ(K + λI)
1
Kα K(K + λI)
1
ε
2
2
= λ
2
(K + λI)
1
Kα
2
2
| {z }
(I)
+ E
K(K + λI)
1
ε
2
2
| {z }
(II)
.
At this stage, we can throw in the eigen-decomposition of K to write (I) as
(I) = λ
2
(UDU
T
+ λI)
1
UDU
T
α
2
2
= λ
2
kU(D + λI)
1
DU
T
α
| {z }
θ
k
2
2
=
n
X
i=1
θ
2
i
λ
2
(d
i
+ λ)
2
Now we have
α
T
Kα = α
T
UDU
T
α = α
T
UDD
+
DU
T
,
where D
+
is diagonal and
D
+
ii
=
(
d
1
i
d
i
> 0
0 otherwise
.
We can then write this as
α
T
Kα =
X
d
i
>0
θ
2
i
d
i
.
The key thing we know about this is that it is 1.
Note that by definition of
θ
i
, we see that if
d
i
= 0, then
θ
i
= 0. So we can
write
(II) =
X
i:d
i
0
θ
2
i
d
i
d
i
λ
2
(d
i
+ λ)
2
λ max
i=1,...,n
d
i
λ
(d
i
+ λ)
2
by older’s inequality with (p, q) = (1, ). Finally, use the inequality that
(a + b)
2
4ab
to see that we have
(I)
λ
4
.
So we are left with (II), which is a bit easier. Using the trace trick, we have
(II) = Eε
T
(K + λI)
1
K
2
(K + λI)
1
ε
= E tr
K(K + λI)
1
εε
T
(K + λI)
1
K
= tr
K(K + λI)
1
E(εε
T
)(K + λI)
1
K
= σ
2
tr
K
2
(K + λI)
2
= σ
2
tr
UD
2
U
T
(UDU
T
+ λI)
2
= σ
2
tr
D
2
(D + λI)
2
= σ
2
n
X
i=1
d
2
i
(d
i
+ λ)
2
.
Finally, writing
d
2
i
(d
i
+λ)
2
=
d
i
λ
d
i
λ
(d
i
+λ)
2
, we have
d
2
i
(d
i
+ λ)
2
min
1,
d
i
4λ
,
and we have the second bound.
How can we interpret this? If we look at the formula, then we see that we can
have good benefits if the
d
i
’s decay quickly, and the exact values of the
d
i
depend
only on the choice of
x
i
. So suppose the
x
i
are random and idd and independent
of ε. Then the entire analysis is still valid by conditioning on x
1
, . . . , x
n
.
We define ˆµ
i
=
d
i
n
, and λ
n
=
λ
n
. Then we can rewrite our result to say
1
n
E
n
X
i=1
(f
0
(x
i
)
ˆ
f
λ
(x
i
))
2
σ
2
λ
n
1
n
n
X
i=1
E min
ˆµ
i
4
, λ
n
+
λ
n
4
Eδ
n
(λ
n
).
We want to relate this more directly to the kernel somehow. Given a density
p(x) on X, Mercer’s theorem guarantees the following eigenexpansion
k(x, x
0
) =
X
j=1
µ
j
e
j
(x)e
j
(x
0
),
where the eigenvalues µ
j
and eigenfunctions e
j
obey
µ
j
e
j
(x
0
) =
Z
X
k(x, x
0
)e
j
(x)p(x) dx
and
Z
X
e
k
(x)e
j
(x)p(x) dx = 1
j=k
.
One can then show that
1
n
E
n
X
i=1
min
ˆµ
i
4
, λ
n
1
n
X
i=1
min
µ
i
4
, λ
n
up to some absolute constant factors. For a particular density
p
(
x
) of the input
space, we can try to solve the integral equation to figure out what the
µ
j
’s are,
and we can find the expected bound.
When
k
is the Sobolev kernel and
p
(
x
) is the uniform density, then it turns
out we have
µ
j
4
=
1
π
2
(2j 1)
2
.
By drawing a picture, we see that we can bound
P
i=1
min
µ
i
4
, λ
n
by
X
i=1
min
µ
i
4
, λ
n
Z
0
λ
n
1
π
2
(2j 1)
2
dj
λ
n
π
+
λ
n
2
.
So we find that
E(δ
n
(λ
n
)) = O
σ
2
1/2
n
+ λ
n
.
We can then find the λ
n
that minimizes this, and we see that we should pick
λ
n
σ
2
n
2/3
,
which gives an error rate of
σ
2
n
2/3
.
2.5 Other kernel machines
Recall that the representer theorem was much more general than what we used
it for. We could apply it to any loss function. In particular, we can apply it
to classification problems. For example, we might want to predict whether an
email is a spam. In this case, we have Y {−1, 1}
n
.
Suppose we are trying to find a hyperplane that separates the two sets
{x
i
}
y
i
=1
and
{x
i
}
i:Y
i
=1
. Consider the really optimistic case where they can
indeed be completely separated by a hyperplane through the origin. So there
exists
β R
p
(the normal vector) with
y
i
x
T
I
β >
0 for all
i
. Moreover, it would
be nice if we can maximize the minimum distance between any of the points and
the hyperplane defined by β. This is given by
max
y
i
x
T
i
β
kβk
2
.
Thus, we can formulate our problem as
maximize M among β R
p
, M 0 subject to
y
i
x
T
i
β
kβk
2
M.
This optimization problem gives the hyperplane that maximizes the margin
between the two classes.
Let’s think about the more realistic case where we cannot completely separate
the two sets. We then want to impose a penalty for each point on the wrong
side, and attempt to minimize the distance.
There are two options. We can use the penalty
λ
n
X
i=1
M
Y
i
x
T
i
β
kβk
2
+
where t
+
= t1
t0
. Another option is to take
λ
n
X
i=1
1
Y
i
x
T
i
β
Mkβk
2
+
which is perhaps more natural, since now everything is “dimensionless”. We
will actually use neither, but will start with the second form and massage it to
something that can be attacked with our existing machinery. Starting with the
second option, we can write our optimization problem as
max
M0R
p
M λ
n
X
i=1
1
Y
i
x
T
i
β
Mkβk
2
+
!
.
Since the objective function is invariant to any positive scaling of
β
, we may
assume kβk
2
=
1
M
, and rewrite this problem as maximizing
1
kβk
2
λ
n
X
i=1
(1 Y
i
x
T
i
β)
+
.
Replacing
max
1
kβk
2
with minimizing
kβk
2
2
and adding instead of subtracting the
penalty part, we modify this to say
min
βR
p
kβk
2
2
+ λ
n
X
i=1
(1 Y
i
x
T
i
β)
+
!
.
The final change we make is that we replace
λ
with
1
λ
, and multiply the whole
equation by λ to get
min
βR
p
λkβk
2
2
+
n
X
i=1
(1 Y
i
x
T
i
β)
+
!
.
This looks much more like what we’ve seen before, with
λkβk
2
2
being the penalty
term and
P
n
i=1
(1 Y
i
x
T
i
β)
+
being the loss function.
The final modification is that we want to allow planes that don’t necessarily
pass through the origin. To do this, we allow ourselves to translate all the
x
i
’s
by a fixed vector δ R
p
. This gives
min
βR
p
R
p
λkβk
2
2
+
n
X
i=1
1 Y
i
(x
i
δ)
T
β
+
!
Since
δ
T
β
always appears together, we can simply replace it with a constant
µ
,
and write our problem as
(ˆµ,
ˆ
β) = argmin
(µ,β)R×R
p
n
X
i=1
1 Y
i
(x
T
i
β + µ)
+
+ λkβk
2
2
. ()
This gives the support vector classifier .
This is still just fitting a hyperplane. Now we would like to “kernelize” this,
so that we can fit in a surface instead. Let
H
be the RKHS corresponding to
the linear kernel. We can then write () as
(ˆµ
λ
,
ˆ
f
λ
) = argmin
(µ,f)R×H
n
X
i=1
(1 Y
i
(f(x
i
) + µ))
+
+ λkfk
2
H
.
The representer theorem (or rather, a slight variant of it) tells us that the above
optimization problem is equivalent to the support vector machine
(ˆµ
λ
, ˆα
λ
) = argmin
(µ,α)R×R
n
n
X
i=1
(1 Y
i
(K
T
i
α + µ))
+
+ λα
T
Kα
where
K
ij
=
k
(
x
i
, x
j
) and
k
is the reproducing kernel of
H
. Predictions at a
new x are then given by
sign
ˆµ
λ
+
n
X
i=1
ˆα
λ,i
k(x, x
i
)
!
.
One possible alternative to this is to use logistic regression. Suppose that
log
P(Y
i
= 1)
P(Y
i
= 1)
= x
T
i
β
0
,
and that
Y
1
, . . . , Y
n
are independent. An estimate of
β
0
can be obtained through
maximizing the likelihood, or equivalently,
argmin
bR
p
n
X
i=1
log(1 + exp(Y
i
x
T
i
β)).
To kernelize this, we introduce an error term of
λkβk
2
2
, and then kernelize this.
The resulting optimization problem is then given by
argmin
f∈H
n
X
i=1
log(1 + exp(Y
i
f(x
i
))) + λkfk
2
H
!
.
We can then solve this using the representer theorem.
2.6 Large-scale kernel machines
How do we apply kernel machines at large scale? Whenever we want to make a
prediction with a kernel machine, we need
O
(
n
) many steps, which is quite a pain
if we work with a large data set, say a few million of them. But even before that,
forming K R
n×n
and inverting K + λI or equivalent can be computationally
too expensive.
One of the more popular approaches to tackle these difficulties is to form a
randomized feature map
ˆ
φ : X R
b
such that
E
ˆ
φ(x)
T
ˆ
φ(x
0
) = k(x, x
0
)
To increase the quality of the approximation, we can use
x 7→
1
L
(
ˆ
φ
1
(x), . . . ,
ˆ
φ
L
(x))
T
R
Lb
,
where the
ˆ
φ
i
(x) are iid with the same distribution as
ˆ
φ.
Let Φ be the matrix with
i
th row
1
L
(
ˆ
φ
1
(
x
)
, . . . ,
ˆ
φ
L
(
x
)). When performing,
e.g. kernel ridge regression, we can compute
ˆ
θ =
T
Φ + λI)
1
Φ
T
Y.
The computational cost of this is
O
((
Lb
)
3
+
n
(
Lb
)
2
). The key thing of course is
that this is linear in
n
, and we can choose the size of
L
so that we have a good
trade-off between the computational cost and the accuracy of the approximation.
Now to predict a new x, we form
1
L
(
ˆ
φ
1
(x), . . . ,
ˆ
φ
L
(x))
T
ˆ
θ,
and this is O(Lb).
In 2007, Rahimi and Recht proposed a random map for shift-invariant kernels,
i.e. kernels k such that k(x, x
0
) = h(x x
0
) for some h (we work with X = R
p
).
A common example is the Gaussian kernel.
One motivation of this comes from a classical result known as Bochner’s
theorem.
Theorem (Bochner’s theorem). Let
k
:
R
p
× R
p
R
be a continuous kernel.
Then
k
is shift-invariant if and only if there exists some distribution
F
on
R
p
and c > 0 such that if W F , then
k(x, x
0
) = cE cos((x x
0
)
T
W ).
Our strategy is then to find some
ˆ
φ depending on W such that
c cos((x x
0
)
T
W ) =
ˆ
φ(x)
ˆ
φ(x
0
).
We are going to take b = 1, so we don’t need a transpose on the right.
The idea is then to play around with trigonometric identities to try to write
c cos((x x
0
)
T
W ). After some work, we find the following solution:
Let u U[π, π], and take x, y R. Then
E cos(x + u) cos(y + u) = E(cos x cos u sin x sin u)(cos y cos u sin y sin u)
Since u has the same distribution as u, we see that E cos u sin u = 0.
Also, cos
2
u + sin
2
u = 1. Since u ranges uniformly in [π, π], by symmetry,
we have E cos
2
u = E sin
2
u =
1
2
. So we find that
2E cos(x + u) cos(y + u) = (cos x cos y + sin x sin y) = cos(x y).
Thus, given a shift-invariant kernel
k
with associated distribution
F
, we set
W F independently of u. Define
ˆ
φ(x) =
2c cos(W
T
x + u).
Then
E
ˆ
φ(x)
ˆ
φ(x
0
) = 2cE[E[cos(W
T
x + u) cos(W
T
x
0
+ u) | W ]]
= cE cos(W
T
(x x
0
))
= k(x, x
0
).
In order to get this to work, we must find the appropriate
W
. In certain
cases, this W is actually not too hard to find:
Example. Take
k(x, x
0
) = exp
1
2σ
2
kx x
0
k
2
2
,
the Gaussian kernel. If W N(0, σ
2
I), then
Ee
it
T
W
= e
−ktk
2
2
/(2σ
2
)
= E cos(t
2
W ).
3 The Lasso and beyond
We are interested in the situation where we have a very large number of variables
compared to the size of the data set. For example the data set might be all the
genetic and environmental information about patients, and we want to predict if
they have diabetes. The key property is that we expect that most of the data
are not useful, and we want to select a small subset of variables that are relevant,
and form prediction models based on them.
In these cases, ordinary least squares is not a very good tool. If there are
more variables than the number of data points, then it is likely that we can
pick the regression coefficients so that our model fits our data exactly, but this
model is likely to be spurious, since we are fine-tuning our model based on the
random fluctuations of a large number of irrelevant variables. Thus, we want to
find a regression method that penalizes non-zero coefficients. Note that Ridge
regression is not good enough it encourages small coefficients, but since it
uses the
`
2
norm, it’s quite lenient towards small, non-zero coefficients, as the
square of a small number is really small.
There is another reason to favour models that sets a lot of coefficients to
zero. Consider the linear model
Y = Xβ
0
+ ε,
where as usual
Eε
= 0 and
var
(
ε
) =
σ
2
I
. Let
S
=
{k
:
β
0
k
6
= 0
}
, and let
s
=
|S|
.
As before, we suppose we have some a priori belief that s p.
For the purposes of this investigation, suppose
X
has full column rank, so
that we can use ordinary least squares. Then the prediction error is
1
n
EkXβ
0
X
ˆ
β
OLS
k
2
2
=
1
n
E(
ˆ
β
OLS
β
0
)
T
X
T
X(
ˆ
β
OLS
β
0
)
=
1
n
E tr(
ˆ
β
OLS
β
0
)(
ˆ
β
OLS
β
0
)
T
X
T
X
=
1
n
tr E(
ˆ
β
OLS
β
0
)(
ˆ
β
OLS
β
0
)
T
X
T
X
=
1
n
tr Eσ
2
(X
T
X)
1
X
T
X
= σ
2
p
n
.
Note that this does not depend on what what β
0
is, but only on σ
2
, p and n.
In the situation we are interested in, we expect
s p
. So if we can find
S
and find ordinary least squares just on these, then we would have a mean
squared prediction error of σ
2
s
n
, which would be much much smaller.
We first discuss a few classical model methods that does this.
The first approach we may think of is the best subsets method, where we try
to do regression on all possible choices of
S
and see which is the “best”. For any
set
M
of indices, we let
X
M
be the submatrix of
X
formed from the columns of
X
with index in
M
. We then regress
Y
on
X
M
for every
M {
1
, . . . , p}
, and
then pick the best model via cross-validation, for example. A big problem with
this is that the number of subsets grows exponentially with
p
, and so becomes
infeasible for, say, p > 50.
Another method might be forward selection. We start with an intercept-only
model, and then add to the existing model the predictor that reduces the RSS
the most, and then keep doing this until a fixed number of predictors have been
added. This is quite a nice approach, and is computationally quite fast even
if
p
is large. However, this method is greedy, and if we make a mistake at the
beginning, then the method blows up. In general, this method is rather unstable,
and is not great from a practical perspective.
3.1 The Lasso estimator
The Lasso (Tibshirani, 1996) is a seemingly small modification of Ridge regression
that solves our problems. It solves
(ˆµ
L
λ
,
ˆ
β
L
λ
) = argmin
(µ,β)R×R
p
1
2n
kY µ1 Xβk
2
2
+ λkβk
1
.
The key difference is that we use an `
1
norm on β rather than the `
2
norm,
kβk
1
=
p
X
k=1
|β
k
|.
This makes it drastically different from Ridge regression. We will see that for
λ
large, it will make all of the entries of
β
exactly zero, as opposed to being very
close to zero.
We can compare this to best subset regression, where we replace
kβk
1
with
something of the form
P
p
k=1
1
β
k
>0
. But the beautiful property of the Lasso
is that the optimization problem is now continuous, and in fact convex. This
allows us to solve it using standard convex optimization techniques.
Why is the
`
1
norm so different from the
`
2
norm? Just as in Ridge regression,
we may center and scale
X
, and center
Y
, so that we can remove
µ
from the
objective. Define
Q
λ
(β) =
1
2n
kY Xβk
2
2
+ λkβk
1
.
Any minimizer
ˆ
β
L
λ
of Q
λ
(β) must also be a solution to
min kY Xβk
2
2
subject to kβk
1
k
ˆ
β
L
λ
k
1
.
Similarly,
ˆ
β
R
λ
is a solution of
min kY Xβk
2
2
subject to kβk
2
k
ˆ
β
R
λ
k
2
.
So imagine we are given a value of
k
ˆ
β
L
λ
k
1
, and we try to solve the above
optimization problem with pictures. The region
kβk
1
k
ˆ
β
L
λ
k
1
is given by a
rotated square
On the other hand, the minimum of
kY Xβk
2
2
is at
ˆ
β
OLS
, and the contours
are ellipses centered around this point.
To solve the minimization problem, we should pick the smallest contour that hits
the square, and pick the intersection point to be our estimate of β
0
. The point
is that since the unit ball in the
`
1
-norm has these corners, this
β
0
is likely to
be on the corners, hence has a lot of zeroes. Compare this to the case of Ridge
regression, where the constraint set is a ball:
Generically, for Ridge regression, we would expect the solution to be non-zero
everywhere.
More generally, we can try to use the `
q
norm given by
kβk
q
=
p
X
k=1
β
q
k
!
1/q
.
We can plot their unit spheres, and see that they look like
q = 0.5 q = 1 q = 1.5 q = 2
We see that
q
= 1 is the smallest value of
q
for which there are corners, and
also the largest value of
q
for which the constraint set is still convex. Thus,
q
= 1
is the sweet spot for doing regression.
But is this actually good? Suppose the columns of
X
are scaled to have
`
2
norm
n, and, after centering, we have a normal linear model
Y = Xβ
0
+ ε ¯ε1,
with ε N
n
(0, σ
2
I).
Theorem. Let
ˆ
β be the Lasso solution with
λ =
r
log p
n
for some A. Then with probability 1 2p
(A
2
/21)
, we have
1
n
kXβ
0
X
ˆ
βk
2
2
4
r
log p
n
kβ
0
k
1
.
Crucially, this is proportional to
log p
, and not just
p
. On the other hand,
unlike what we usually see for ordinary least squares, we have
1
n
, and not
1
n
.
We will later obtain better bounds when we make some assumptions on the
design matrix.
Proof.
We don’t really have a closed form solution for
ˆ
β
, and in general it doesn’t
exist. So the only thing we can use is that it in fact minimizes
Q
λ
(
β
). Thus, by
definition, we have
1
2n
kY X
ˆ
βk
2
2
+ λk
ˆ
βk
1
1
2n
kY Xβ
0
k
2
2
+ λkβ
0
k
1
.
We know exactly what Y is. It is Xβ
0
+ ε ¯ε1. If we plug this in, we get
1
2n
kXβ
0
X
ˆ
βk
2
2
1
n
ε
T
X(
ˆ
β β
0
) + λkβ
0
k
1
λk
ˆ
βk
1
.
Here we use the fact that X is centered, and so is orthogonal to 1.
Now older tells us
|ε
T
X(
ˆ
β β
0
)| kX
T
εk
k
ˆ
β β
0
k
1
.
We’d like to bound
kX
T
εk
, but it can be arbitrarily large since
ε
is a Gaussian.
However, with high probability, it is small. Precisely, define
=
1
n
kX
T
εk
λ
.
In a later lemma, we will show later that
P
(Ω)
1
2
p
(A
2
/21)
. Assuming
holds, we have
1
2n
kXβ
0
X
ˆ
βk
2
2
λk
ˆ
β β
0
k λk
ˆ
βk + λkβ
0
k 2λkβ
0
k
1
.
3.2 Basic concentration inequalities
We now have to prove the lemma we left out in the theorem just now. In this
section, we are going to prove a bunch of useful inequalities that we will later
use to prove bounds.
Consider the event as defined before. Then
P
1
n
kX
T
εk
> λ
= P
p
[
j=1
1
n
|X
T
j
ε| > λ
p
X
j=1
P
1
n
|X
T
j
ε| > λ
.
Now
1
n
X
T
j
ε N
(0
,
σ
2
n
). So we just have to bound tail probabilities of normal
random variables.
The simplest tail bound we have is Markov’s inequality.
Lemma (Markov’s inequality). Let
W
be a non-negative random variable. Then
P(W t)
1
t
EW.
Proof. We have
t1
W t
W.
The result then follows from taking expectations and then dividing both sides
by t.
While this is a very simple bound, it is actually quite powerful, because it
assumes almost nothing about
W
. This allows for the following trick: given any
strictly increasing function
ϕ
:
R
(0
,
) and any random variable
W
, we have
P(W t) = P(ϕ(W ) ϕ(t))
Eϕ(W )
ϕ(t)
.
So we get a bound on the tail probability of
W
for any such function. Even
better, we can minimize the right hand side over a class of functions to get an
even better bound.
In particular, applying this with ϕ(t) = e
αt
gives
Corollary (Chernoff bound). For any random variable W , we have
P(W t) inf
α>0
e
αt
Ee
αW
.
Note that
Ee
αW
is just the moment generating function of
W
, which we can
compute quite straightforwardly.
We can immediately apply this when
W
is a normal random variable,
W
N(0, σ
2
). Then
Ee
αW
= e
α
2
σ
2
/2
.
So we have
P(W t) inf
α>0
exp
α
2
σ
2
2
αt
= e
t
2
/(2σ
2
)
.
Observe that in fact this tail bound works for any random variable whose moment
generating function is bounded above by e
α
2
σ
2
/2
.
Definition (Sub-Gaussian random variable). A random variable
W
is sub-
Gaussian (with parameter σ) if
Ee
α(W EW )
e
α
2
σ
2
/2
for all α R.
Corollary. Any sub-Gaussian random variable W with parameter σ satisfies
P(W t) e
t
2
/2σ
2
.
In general, bounded random variables are sub-Gaussian.
Lemma (Hoeffding’s lemma). If
W
has mean zero and takes values in [
a, b
],
then W is sub-Gaussian with parameter
ba
2
.
Recall that the sum of two independent Gaussians is still a Gaussian. This
continues to hold for sub-Gaussian random variables.
Proposition. Let (
W
i
)
n
i=1
be independent mean-zero sub-Gaussian random
variables with parameters (
σ
i
)
n
i=0
, and let
γ R
n
. Then
γ
T
W
is sub-Gaussian
with parameter
X
(γ
i
σ
i
)
2
1/2
.
Proof. We have
E exp
α
n
X
i=1
γ
i
W
i
!
=
n
Y
i=1
E exp (αγ
i
W
i
)
n
Y
i=1
exp
α
2
2
γ
2
i
σ
2
i
= exp
α
2
2
n
X
i=1
σ
2
i
γ
2
i
!
.
We can now prove our bound for the Lasso, which in fact works for any
sub-Gaussian random variable.
Lemma. Suppose (
ε
i
)
n
i=1
are independent, mean-zero sub-Gaussian with com-
mon parameter σ. Let
λ =
r
log p
n
.
Let X be a matrix whose columns all have norm
n. Then
P
1
n
kX
T
εk
λ
1 2p
(A
2
/21)
.
In particular, this includes ε N
n
(0, σ
2
I).
Proof. We have
P
1
n
kX
T
εk
> λ
p
X
j=1
P
1
n
|X
T
j
ε| > λ
.
But ±
1
n
X
T
j
ε are both sub-Gaussian with parameter
σ
X
i
X
ij
n
2
!
1/2
=
σ
n
.
Then by our previous corollary, we get
p
X
j=1
P
1
n
|X
T
j
ε|
> λ
2p exp
λ
2
n
2σ
2
.
Note that we have the factor of 2 since we need to consider the two cases
1
n
X
T
j
ε > λ and
1
n
X
T
j
ε > λ.
Plugging in our expression of λ, we write the bound as
2p exp
1
2
A
2
log p
= 2p
1A
2
/2
.
This is all we need for our result on the Lasso. We are going to go a bit
further into this topic of concentration inequalities, because we will need them
later when we impose conditions on the design matrix. In particular, we would
like to bound the tail probabilities of products.
Definition (Bernstein’s condition). We say that a random variable
W
satisfies
Bernstein’s condition with parameters (σ, b) where a, b > 0 if
E[|W EW |
k
]
1
2
k! σ
2
b
k2
for k = 2, 3, . . ..
The point is that these bounds on the moments lets us bound the moment
generating function of W .
Proposition (Bernstein’s inequality). Let
W
1
, W
2
, . . . , W
n
be independent ran-
dom variables with
EW
i
=
µ
, and suppose each
W
i
satisfies Bernstein’s condition
with parameters (σ, b). Then
Ee
α(W
i
µ)
exp
α
2
σ
2
/2
1 b|α|
for all |α| <
1
b
,
P
1
n
n
X
i=1
W
i
µ t
!
exp
nt
2
2(σ
2
+ bt)
for all t 0.
Note that for large t, the bound goes as e
t
instead of e
t
2
.
Proof. For the first part, we fix i and write W = W
i
. Let |α| <
1
b
. Then
Ee
α(W
i
µ)
=
X
k=0
E
1
k!
α
k
|W
i
µ|
k
1 +
σ
2
α
2
2
X
k=2
|α|
k2
b
k2
= 1 +
σ
2
α
2
2
1
1 |α|b
exp
α
2
σ
2
/2
1 b|α|
.
Then
E exp
1
n
n
X
i=1
α(W
i
µ)
!
=
n
Y
i=1
E exp
α
n
(W
i
µ)
exp
n
α
n
2
σ
2
/2
1 b
α
n
!
,
assuming
α
n
<
1
b
. So it follows that
P
1
n
n
X
i=1
W
i
µ t
!
e
αt
exp
n
α
n
2
σ
2
/2
1 b
α
n
!
.
Setting
α
n
=
t
bt + σ
2
0,
1
b
gives the result.
Lemma. Let
W, Z
be mean-zero sub-Gaussian random variables with parameters
σ
W
and
σ
Z
respectively. Then
W Z
satisfies Bernstein’s condition with parameter
(8σ
W
σ
Z
, 4σ
W
σ
Z
).
Proof.
For any random variable
Y
(which we will later take to be
W Z
), for
k > 1, we know
E|Y EY |
k
= 2
k
E
1
2
Y
1
2
EY
k
2
k
E
1
2
|Y | +
1
2
|EY |
k
.
Note that
1
2
|Y | +
1
2
|EY |
k
|Y |
k
+ |EY |
k
2
by Jensen’s inequality. Applying Jensen’s inequality again, we have
|EY |
k
E|Y |
k
.
Putting the whole thing together, we have
E|Y EY |
k
2
k
E|Y |
k
.
Now take Y = W Z. Then
E|W Z EW Z| 2
k
E|W Z|
k
2
k
EW
2k
1/2
(EZ
2k
)
1/2
,
by the Cauchy–Schwarz inequality.
We know that sub-Gaussians satisfy a bound on the tail probability. We can
then use this to bound the moments of W and Z. First note that
W
2k
=
Z
0
1
x<W
2k
dx.
Taking expectations of both sides, we get
EW
2k
=
Z
0
P(x < W
2k
) dx.
Since we have a tail bound on
W
instead of
W
2k
, we substitute
x
=
t
2k
. Then
dx = 2kt
2k1
dt. So we get
EW
2k
= 2k
Z
0
t
2k1
P(|W | > t) dt
= 4k
Z
0
t
2k
exp
t
2
2σ
2
N
dt.
where again we have a factor of 2 to account for both signs. We perform yet
another substitution
x =
t
2
2σ
2
N
, dx =
t
σ
2
W
dt.
Then we get
EW
2k
= 2
k+1
σ
2k
W
kσ
2
W
Z
0
x
k1
e
x
dx = 4 · k!σ
2
W
.
Plugging this back in, we have
E|W Z EW Z|
k
2
k
2
k+1
k!σ
k
W
σ
k
Z
σ
k
Z
=
1
2
k!2
2k+2
σ
k
W
σ
k
Z
=
1
2
k!(8σ
W
σ
Z
)
2
(4σ
W
σ
Z
)
k2
.
3.3 Convex analysis and optimization theory
We’ll leave these estimates aside for a bit, and give more background on convex
analysis and convex optimization. Recall the following definition:
Definition (Convex set). A set
A R
d
is convex if for any
x, y A
and
t [0, 1], we have
(1 t)x + ty A.
non-convex convex
We are actually more interested in convex functions. We shall let our functions
to take value
, so let us define
¯
R
=
R {∞}
. The point is that if we want our
function to be defined in [
a, b
], then it is convenient to extend it to be defined
on all of R by setting the function to be outside of [a, b].
Definition (Convex function). A function f : R
d
¯
R is convex iff
f((1 t)x + ty) (1 t)f(x) + tf (y)
for all
x, y R
d
and
t
(0
,
1). Moreover, we require that
f
(
x
)
<
for at least
one x.
We say it is strictly convex if the inequality is strict for all
x, y
and
t
(0
,
1).
x y
(1 t)x + ty
(1 t)f (x) + tf(y)
Definition (Domain). Define the domain of a function f : R
d
¯
R to be
dom f = {x : f(x) < ∞}.
One sees that the domain of a convex function is always convex.
Proposition.
(i)
Let
f
1
, . . . , f
m
:
R
d
¯
R
be convex with
dom f
1
··· dom f
m
6
=
, and
let c
1
, . . . , c
m
0. Then c
1
+ ··· + c
m
f
m
is a convex function.
(ii) If f : R
d
R is twice continuously differentiable, then
(a) f is convex iff its Hessian is positive semi-definite everywhere.
(b) f is strictly convex if its Hessian positive definite everywhere.
Note that having a positive definite Hessian is not necessary for strict con-
vexity, e.g. x
4
is strictly convex but has vanishing Hessian at 0.
Now consider a constrained optimization problem
minimize f (x) subject to g(x) = 0
where x R
d
and g : R
d
R
b
. The Lagrangian for this problem is
L(x, θ) = f(x) + θ
T
g(x).
Why is this helpful? Suppose
c
is the minimum of
f
. Then note that for any
θ
,
we have
inf
xR
d
L(x, θ) inf
xR
d
,g(x)=0
L(x, θ) = inf
xR
d
:g(x)=0
f(x) = c
.
Thus, if we can find some
θ
, x
such that
x
minimizes
L
(
x, θ
) and
g
(
x
) = 0,
then this is indeed the optimal solution.
This gives us a method to solve the optimization problem for each fixed
θ
,
solve the unconstrained optimization problem
argmin
x
L
(
x, θ
). If we are doing
this analytically, then we would have a formula for
x
in terms of
θ
. Then seek
for a θ such that g(x) = 0 holds.
Subgradients
Usually, when we have a function to optimize, we take its derivative and set it
to zero. This works well if our function is actually differentiable. However, the
`
1
norm is not a differentiable function, since
|x|
is not differentiable at 0. This
is not some exotic case we may hope to avoid most of the time when solving
the Lasso, we actively want our solutions to have zeroes, so we really want to
get to these non-differentiable points.
Thus, we seek some generalized notion of derivative that works on functions
that are not differentiable.
Definition (Subgradient). A vector
v R
d
is a subgradient of a convex function
at x if f (y) f (x) + v
T
(y x) for all y R
d
.
The set of subgradients of
f
at
x
is denoted
f
(
x
), and is called the subdif-
ferential.
f f
This is indeed a generalization of the derivative, since
Proposition. Let
f
be convex and differentiable at
x int
(
dom f
). Then
f(x) = {∇f(x)}.
The following properties are immediate from definition.
Proposition. Suppose
f
and
g
are convex with
int
(
dom f
)
int
(
dom g
)
6
=
,
and α > 0. Then
(αf)(x) = α∂f(x) = {αv : v f (x)}
(f + g)(x) = g(x) + f(x).
The condition (for convex differentiable functions) that
x
is a minimum iff
f
0
(x) = 0” now becomes
Proposition. If f is convex, then
x
argmin
xR
d
f(x) 0 f(x
).
Proof.
Both sides are equivalent to the requirement that
f
(
y
)
f
(
x
) for all
y.
We are interested in applying this to the Lasso. So we want to compute the
subdifferential of the `
1
norm. Let’s first introduce some notation.
Notation. For
x R
d
and
A {
1
. . . , d}
, we write
x
A
for the sub-vector of
x
formed by the components of
x
induced by
A
. We write
x
j
=
x
{j}
c
=
x
{1,...,d}\j
.
Similarly, we write x
jk
= x
{jk}
c
etc.
We write
sgn(x
i
) =
1 x
i
< 0
1 x
i
> 0
0 x
i
= 0
,
and sgn(x) = (sgn(x
1
), . . . , sgn(x
d
))
T
.
First note that k · k
1
is convex, as it is a norm.
Proposition. For x R
d
and A {j : x
j
6= 0}, we have
kxk
1
= {v R
d
: kvk
1, v
A
= sgn(x
A
)}.
Proof.
It suffices to look at the subdifferential of the absolute value function,
and then add them up.
For
j
= 1
, . . . , d
, we define
g
j
:
R
d
R
by sending
x
to
|x
j
|
. If
x
j
6
= 0, then
g
j
is differentiable at
x
, and so we know
g
j
(
x
) =
{sgn
(
x
j
)
e
j
}
, with
e
j
the
j
th
standard basis vector.
When x
j
= 0, if v g
j
(x
j
), then
g
j
(y) g
j
(x) + v
T
(y x).
So
|y
j
| v
T
(y x).
We claim this holds iff
v
j
[
1
,
1] and
v
j
= 0. The
direction is an immediate
calculation, and to show
, we pick
y
j
=
v
j
+
x
j
, and
y
j
= 0. Then we have
0 v
T
j
v
j
.
So we know that v
j
= 0. Once we know this, the condition says
|y
j
| v
j
y
j
for all
y
j
. This is then true iff
v
j
[
1
,
1]. Forming the set sum of the
g
j
(
x
)
gives the result.
3.4 Properties of Lasso solutions
Let’s now apply this to the Lasso. Recall that the Lasso objective was
Q
λ
(β) =
1
2n
kY Xβk
2
2
+ λkβk
1
.
We know this is a convex function in
β
. So we know
ˆ
β
L
λ
is a minimizer of
Q
λ
(
β
)
iff 0 Q
λ
(
ˆ
β).
Let’s subdifferentiate Q
λ
and see what this amounts to. We have
Q
λ
(
ˆ
β) =
1
n
X
T
(Y Xβ)
+ λ
n
ˆν R
d
: kˆνk
1, ˆν|
ˆρ
L
λ
= sgn(
ˆ
β
L
λ,ˆρ
L
λ
)
o
,
where ˆρ
L
λ
= {j :
ˆ
β
L
λ,j
6= 0}.
Thus, 0 Q
λ
(
ˆ
β
λ
) is equivalent to
ˆν =
1
λ
·
1
n
X
T
(Y X
ˆ
β
λ
)
satisfying
kˆνk
1, ˆν
ˆρ
L
λ
= sgn(
ˆ
β
L
λ,ˆρ
L
λ
).
These are known as the KKT conditions for the Lasso.
In principle, there could be several solutions to the Lasso. However, at least
the fitted values are always unique.
Proposition. X
ˆ
β
L
λ
is unique.
Proof.
Fix
λ >
0 and stop writing it. Suppose
ˆ
β
(1)
and
ˆ
β
(2)
are two Lasso
solutions at λ. Then
Q(
ˆ
β
(1)
) = Q(
ˆ
β
(2)
) = c
.
As Q is convex, we know
c
= Q
1
2
ˆ
β
(1)
+
1
2
ˆ
β
(2)
1
2
Q(
ˆ
β
(1)
) +
1
2
Q(
ˆ
β
(2)
) = c
.
So
1
2
ˆ
β
(1)
+
1
2
ˆ
β
(2)
is also a minimizer.
Since the two terms in
Q
(
β
) are individually convex, it must be the case that
1
2
(Y X
ˆ
β
(1)
) +
1
2
(Y X
ˆ
β
(2)
)
2
2
=
1
2
Y X
ˆ
β
(1)
2
2
+
1
2
Y X
ˆ
β
(2)
2
2
1
2
(
ˆ
β
(1)
+
ˆ
β
(2)
)
1
=
1
2
k
ˆ
β
(1)
k
1
+
1
2
k
ˆ
β
(2)
k
1
.
Moreover, since
k · k
2
2
is strictly convex, we can have equality only if
X
ˆ
β
(1)
=
X
ˆ
β
(2)
. So we are done.
Definition (Equicorrelation set). Define the equicorrelation set
ˆ
E
λ
to be the
set of k such that
1
n
|X
T
k
(Y X
ˆ
β
L
λ
)| = λ,
or equivalently, the
k
with
ν
k
=
±
1, which is well-defined since it depends only
on the fitted values.
By the KKT conditions,
ˆ
E
λ
contains the set of non-zeroes of Lasso solution,
but may be strictly bigger than that.
Note that if rk(X
ˆ
E
λ
) = |
ˆ
E
λ
|, then the Lasso solution must be unique since
X
ˆ
E
λ
(
ˆ
β
(1)
ˆ
β
(2)
) = 0.
So
ˆ
β
(1)
=
ˆ
β
(2)
.
3.5 Variable selection
So we have seen that the Lasso picks out some “important” variables and discards
the rest. How well does it do the job?
For simplicity, consider a noiseless linear model
Y = Xβ
0
.
Our objective is to find the set
S = {k : β
0
k
6= 0}.
We may wlog assume
S
=
{
1
, . . . , s}
, and
N
=
{
1
. . . p}\S
(as usual
X
is
n ×p
).
We further assume that rk(X
S
) = s.
In general, it is difficult to find out
S
even if we know
|S|
. Under certain
conditions, the Lasso can do the job correctly. This condition is dictated by the
`
norm of the quantity
= X
T
N
X
S
(X
T
S
X
S
)
1
sgn(β
0
S
).
We can understand this a bit as follows the
k
th entry of this is the dot product
of
sgn
(
β
0
S
) with (
X
T
S
X
S
)
1
X
T
S
X
k
. This is the coefficient vector we would obtain
if we tried to regress
X
k
on
X
S
. If this is large, then this suggests we expect
X
k
to look correlated to
X
S
, and so it would be difficult to determine if
k
is part of
S or not.
Theorem.
(i) If kk
1, or equivalently
max
kN
|sgn(β
0
S
)
T
(X
T
S
X
S
)
1
X
T
S
X
k
| 1,
and moreover
|β
0
k
| > λ
sgn(β
0
S
)
T
1
n
X
T
j
X
j
1
k
for all
k S
, then there exists a Lasso solution
ˆ
β
L
λ
with
sgn
(
ˆ
β
L
λ
) =
sgn
(
β
0
).
(ii) If there exists a Lasso solution with sgn(
ˆ
β
L
λ
) = sgn(β
0
), then kk
1.
We are going to make heavy use of the KKT conditions.
Proof.
Write
ˆ
β
=
ˆ
β
L
λ
, and write
ˆ
S
=
{k
:
ˆ
β
k
6
= 0
}
. Then the KKT conditions
are that
1
n
X
T
(β
0
ˆ
β) = λˆν,
where kˆνk
1 and ˆν
ˆ
S
= sgn(
ˆ
β
ˆ
S
).
We expand this to say
1
n
X
T
S
X
S
X
T
S
X
N
X
T
N
X
S
XN
T
X
N
β
0
S
ˆ
β
S
ˆ
β
N
= λ
ˆν
S
ˆν
N
.
Call the top and bottom equations (1) and (2) respectively.
It is easier to prove (ii) first. If there is such a solution, then
ˆ
β
N
= 0. So
from (1), we must have
1
n
X
T
S
X
S
(β
0
S
ˆ
β
S
) = λˆν
S
.
Inverting
1
n
X
T
S
X
S
, we learn that
β
0
S
ˆ
β
S
= λ
1
n
X
T
S
X
S
1
sgn(β
0
S
).
Substitute this into (2) to get
λ
1
n
X
T
N
X
S
1
n
X
T
S
X
S
1
sgn(β
0
S
) = λˆν
N
.
By the KKT conditions, we know kˆν
N
k
1, and the LHS is exactly λ∆.
To prove (1), we need to exhibit a
ˆ
β
that agrees in sign with
ˆ
β
and satisfies
the equations (1) and (2). In particular,
ˆ
β
N
= 0. So we try
(
ˆ
β
S
, ˆν
S
) =
β
0
S
λ
1
n
X
T
S
X
S
1
sgn(β
0
S
), sgn(β
0
S
)
!
(
ˆ
β
N
, ν
N
) = (0, ∆).
This is by construction a solution. We then only need to check that
sgn(
ˆ
β
S
) = sgn(β
0
S
),
which follows from the second condition.
Prediction and estimation
We now consider other question of how good the Lasso functions as a regression
method. Consider the model
Y = Xβ
0
+ ε ¯ε1,
where the
ε
i
are independent and have common sub-Gaussian parameter
σ
. Let
S, s, N be as before.
As before, the Lasso doesn’t always behave well, and whether or not it does
is controlled by the compatibility factor.
Definition (Compatibility factor). Define the compatibility factor to be
φ
2
= inf
βR
p
kβ
N
k
1
3kβ
S
k
1
β
S
6=0
1
n
kXβk
2
2
1
s
kβ
S
k
2
1
= inf
βR
p
kβ
S
k=1
kβ
N
k
1
3
s
n
kX
S
β
S
X
N
β
N
k
2
2
.
Note that we are free to use a minus sign inside the norm since the problem
is symmetric in β
N
β
N
.
In some sense, this
φ
measures how well we can approximate
X
S
β
S
just with
the noise variables.
Definition (Compatibility condition). The compatibility condition is φ
2
> 0.
Note that if
ˆ
Σ
=
1
n
X
T
X
has minimum eigenvalue
c
min
>
0, then we have
φ
2
c
min
. Indeed,
kβ
S
k
1
= sgn(β
S
)
T
β
S
skβ
S
k
2
skβk
2
,
and so
φ
2
inf
β6=0
1
n
kXβk
2
2
kβk
2
2
= c
min
.
Of course, we don’t expect the minimum eigenvalue to be positive, but we have
the restriction in infimum in the definition of
φ
2
and we can hope to have a
positive value of φ
2
.
Theorem. Assume φ
2
> 0, and let
ˆ
β be the Lasso solution with
λ =
p
log p/n.
Then with probability at least 1 2p
(A
2
/81)
, we have
1
n
kX(β
0
ˆ
β)k
2
2
+ λk
ˆ
β β
0
k
1
16λ
2
s
φ
2
=
16A
2
log p
φ
2
2
n
.
This is actually two bounds. This simultaneously bounds the error in the
fitted values, and a bound on the error in predicting
ˆ
β β
0
.
Recall that in our previous bound, we had a bound of
1
n
, and now we have
1
n
. Note also that
2
n
is the error we would get if we magically knew which
were the non-zero variables and did ordinary least squares on those variables.
This also tells us that if
β
0
has a component that is large, then
ˆ
β
must be
non-zero in that component as well. So while the Lasso cannot predict exactly
which variables are non-zero, we can at least get the important ones.
Proof. Start with the basic inequality Q
λ
(
ˆ
β) Q
λ
(β
0
), which gives us
1
2n
kX(β
0
ˆ
β)k
2
2
+ λk
ˆ
βk
1
1
n
ε
T
X(
ˆ
β β
0
) + λkβ
0
k
1
.
We work on the event
=
1
n
kX
T
εk
1
2
λ
,
where after applying older’s inequality, we get
1
n
kX(β
0
ˆ
β)k
2
2
+ 2λk
ˆ
βk
1
λk
ˆ
β β
0
k
1
+ 2λkβ
0
k
1
.
We can move 2
λk
ˆ
βk
1
to the other side, and applying the triangle inequality, we
have
1
n
kX(
ˆ
β β
0
)k
2
2
3λk
ˆ
β β
0
k.
If we manage to bound the RHS from above as well, so that
3λk
ˆ
β β
0
k
1
n
kX(
ˆ
β β
0
)k
2
for some c, then we obtain the bound
1
n
kX(β β
0
)k
2
2
c
2
λ
2
.
Plugging this back into the second bound, we also have
3λk
ˆ
β β
0
k
1
c
2
λ
2
.
To obtain these bounds, we want to apply the definition of
φ
2
to
ˆ
β β
0
. We thus
need to show that the
ˆ
β β
0
satisfies the conditions required in the infimum
taken.
Write
a =
1
kX(
ˆ
β β
0
)k
2
2
.
Then we have
a + 2(k
ˆ
β
n
k
1
+ k
ˆ
β
S
k
1
) k
ˆ
β
S
β
0
S
k
1
+ k
ˆ
β
N
k
1
+ 2kβ
0
S
k
1
.
Simplifying, we obtain
a + k
ˆ
β
N
k
1
k
ˆ
β
S
β
0
S
k
1
+ 2kβ
0
S
k
1
2k
ˆ
β
S
k
1
.
Using the triangle inequality, we write this as
a + k
ˆ
β
N
β
0
k
N
3k
ˆ
β
S
β
0
S
k
1
.
So we immediately know we can apply the compatibility condition, which gives
us
φ
2
1
n
kX(
ˆ
β β
0
)k
2
2
1
s
k
ˆ
β
S
β
0
S
k
2
1
.
Also, we have
1
n
kX(
ˆ
β β
0
)k
2
2
+ λk
ˆ
β β
0
k
1
4λk
ˆ
β
S
β
0
S
k
1
.
Thus, using the compatibility condition, we have
1
n
kX(
ˆ
β β
0
)k
2
2
+ λk
ˆ
β β
0
k
4λ
φ
r
s
n
kX(
ˆ
β β
0
)k
2
.
Thus, dividing through by
1
n
kX(
ˆ
β β
0
)k
2
, we obtain
1
n
kX(
ˆ
β β
0
)k
2
4λ
s
φ
. ()
So we substitute into the RHS () and obtain
1
n
kX(
ˆ
β β
0
)k
2
2
+ λk
ˆ
β β
0
k
1
16λ
2
s
φ
2
.
If we want to be impressed by this result, then we should make sure that
the compatibility condition is a reasonable assumption to make on the design
matrix.
The compatibility condition and random design
For any Σ im R
p×p
, define
φ
2
Σ
(S) = inf
β:kβ
N
k
1
3kβ
S
k
1
, β
s
6=0
β
T
Σβ
kβ
S
k
2
1
/|S|
Our original φ
2
is then the same as φ
2
ˆ
Σ
(S).
If we want to analyze how
φ
2
Σ
(
S
) behaves for a “random” Σ, then it would
be convenient to know that this depends continuously on Σ. For our purposes,
the following lemma suffices:
Lemma. Let Θ, Σ R
p×p
. Suppose φ
2
Θ
(S) > 0 and
max
j,k
|Θ
jk
Σ
jk
|
φ
2
Θ
(S)
32|S|
.
Then
φ
2
Σ
(S)
1
2
φ
2
Θ
(S).
Proof.
We suppress the dependence on
S
for notational convenience. Let
s
=
|S|
and
t =
φ
2
Θ
(S)
32s
.
We have
|β
T
Θ)β| kβk
1
k Θ)βk
tkβk
2
1
,
where we applied older twice.
If kβ
N
k 3kβ
S
k
1
, then we have
kβk
1
4kβ
S
k
1
4
p
β
T
Θβ
φ
Θ
/
s
.
Thus, we have
β
T
Θβ
φ
2
Θ
32s
·
16β
T
Θβ
φ
2
Θ
/s
=
1
2
β
T
Θβ β
T
Σβ.
Define
φ
2
Σ,s
= min
S:|S|=s
φ
2
Σ
(S).
Note that if
max
jk
|Θ
jk
Σ
jk
|
φ
2
Θ,s
32s
,
then
φ
2
Σ
(S)
1
2
φ
2
Θ
(S).
for all S with |S| = s. In particular,
φ
2
Σ,s
1
2
φ
2
Θ,s
.
Theorem. Suppose the rows of
X
are iid and each entry is sub-Gaussian with
parameter
v
. Suppose
s
p
log p/n
0 as
n
, and
φ
2
Σ
0
,s
is bounded away
from 0. Then if Σ
0
= E
ˆ
Σ, then we have
P
φ
2
ˆ
Σ,s
1
2
φ
2
Σ
0
,s
1 as n .
Proof. It is enough to show that
P
max
jk
|
ˆ
Σ
jk
Σ
0
jk
|
φ
2
Σ
0
,s
32s
!
0
as n .
Let t =
φ
2
Σ
0
,s
32s
. Then
P
max
j,k
|
ˆ
Σ
jk
Σ
0
jk
| t
X
j,k
P(|
ˆ
Σ
jk
Σ
0
jk
| t).
Recall that
ˆ
Σ
jk
=
1
n
n
X
i=1
X
ij
X
ik
.
So we can apply Bernstein’s inequality to bound
P(|
ˆ
Σ
jk
Σ
0
jk
) 2 exp
nt
2
2(64v
4
+ 4v
2
t)
,
since σ = 8v
2
and b = 4v
2
. So we can bound
P
max
j,k
|
ˆ
Σ
jk
Σ
0
jk
| t
2p
2
exp
cn
s
2
= 2 exp
cn
s
2
c
2s
2
n log p

0
for some constant c.
Corollary. Suppose the rows of
X
are iid mean-zero multivariate Gaussian
with variance Σ
0
. Suppose Σ
n
has minimum eigenvalue bounded from below by
c
min
>
0, and suppose the diagonal entries of Σ
0
are bounded from above. If
s
p
log p/n 0, then
P
φ
2
ˆ
Σ,s
1
2
c
min
1 as n .
3.6 Computation of Lasso solutions
We have had enough of bounding things. In this section, let’s think about how
we can actually run the Lasso. What we present here is actually a rather general
method to find the minimum of a function, known as coordinate descent.
Suppose we have a function
f
:
R
d
R
. We start with an initial guess
x
(0)
and repeat for m = 1, 2, . . .
x
(m)
1
= argmin
x
1
f(x
1
, x
(m1)
2
, . . . , x
(m1)
d
)
x
(m)
2
= argmin
x
2
f(x
(m)
1
, x
2
, x
(m1)
3
, . . . , x
(m1)
d
)
.
.
.
x
(m)
d
= argmin
x
d
f(x
(m)
1
, x
(m)
2
, . . . , x
(m)
d1
, x
d
)
until the result stabilizes.
This was proposed for solving the Lasso a long time ago, and a Stanford
group tried this out. However, instead of using
x
(m)
1
when computing
x
(m)
2
, they
used
x
(m1)
1
instead. This turned out to be pretty useless, and so the group
abandoned the method. After trying some other methods, which weren’t very
good, they decided to revisit this method and fixed their mistake.
For this to work well, of course the coordinatewise minimizations have to be
easy (which is the case for the Lasso, where we even have explicit solutions). This
converges to the global minimizer if the minimizer is unique,
{x
:
f
(
x
)
f
(
x
(0)
)
}
is compact, and if f has the form
f(x) = g(x) +
X
j
h
j
(x
j
),
where
g
is convex and differentiable, and each
h
j
is convex but not necessarily
differentiable. In the case of the Lasso, the first is the least squared term, and
the h
j
is the `
1
term.
There are two things we can do to make this faster. We typically solve the
Lasso on a grid of
λ
values
λ
0
> λ
1
> ··· > λ
L
, and then picking the appropriate
λ
by
v
-fold cross-validation. In this case, we can start solving at
λ
0
, and then
for each
i >
0, we solve the
λ
=
λ
i
problem by picking
x
(0)
to be the optimal
solution to the
λ
i1
problem. In fact, even if we already have a fixed
λ
value we
want to use, it is often advantageous to solve the Lasso with a larger
λ
-value,
and then use that as a warm start to get to our desired λ value.
Another strategy is an active set strategy. If
p
is large, then this loop may
take a very long time. Since we know the Lasso should set a lot of things to zero,
for ` = 1, . . . , L, we set
A = {k :
ˆ
β
L
λ
`1
,k
6= 0}.
We then perform coordinate descent only on coordinates in
A
to obtain a Lasso
solution
ˆ
β
with
ˆ
β
A
c
= 0. This may not be the actual Lasso solution. To check
this, we use the KKT conditions. We set
V =
k A
c
:
1
n
|X
T
k
(Y X
ˆ
β)| > λ
`
.
If
V
=
, and we are done. Otherwise, we add
V
to our active set
A
, and then
run coordinate descent again on this active set.
3.7 Extensions of the Lasso
There are many ways we can modify the Lasso. The first thing we might want to
change in the Lasso is to replace the least squares loss with other log-likelihoods.
Another way to modify the Lasso is to replace the
`
1
penalty with something
else in order to encourage a different form of sparsity.
Example (Group Lasso). Given a partition
G
1
···G
q
= {1, . . . , p},
the group Lasso penalty is
λ
q
X
j=1
m
j
kβ
G
j
k
2
,
where
{m
j
}
is some sort of weight to account for the fact that the groups have
different sizes. Typically, we take m
j
=
p
|G
j
|.
If we take
G
i
=
{i}
, then we recover the original Lasso. If we take
q
= 1,
then we recover Ridge regression. What this does is that it encourages the entire
group to be all zero, or all non-zero.
Example. Another variation is the fused Lasso. If
β
0
j+1
is expected to be close
to β
0
j
, then a fused Lasso penalty may be appropriate, with
λ
1
p1
X
j=1
|β
j+1
β
j
| + λ
2
kβk
1
.
For example, if
Y
i
= µ
i
+ ε
i
,
and we believe that (
µ
i
)
n
i=1
form a piecewise constant sequence, we can estimate
µ
0
by
argmin
µ
(
kY µk
2
2
+ λ
n1
X
i=1
|µ
i+1
µ
i
|
)
.
Example (Elastic net). We can use
ˆ
β
EN
λ,α
= argmin
β
1
2n
kY Xβk
2
2
+ λ(αkβk
1
+ (1 α)kβk
2
2
)
.
for
α
[0
,
1]. What the
`
2
norm does is that it encourages highly positively
correlated variables to have similar estimated coefficients.
For example, if we have duplicate columns, then the
`
1
penalty encourages
us to take one of the coefficients to be 0, while the
`
2
penalty encourages the
coefficients to be the same.
Another class of variations try to reduce the bias of the Lasso. Although
the bias of the Lasso is a necessary by-product of reducing the variance of the
estimate, it is sometimes desirable to reduce this bias.
The LARS-OLS hybrid takes the
ˆ
S
λ
obtained by the Lasso, and then re-
estimate
β
0
ˆ
S
λ
by OLS. We can also re-estimate using the Lasso on
X
ˆ
S
λ
, and this
is known as the relaxed Lasso.
In the adaptive Lasso, we obtain an initial estimate of
β
0
, e.g. with the Lasso,
and then solve
ˆ
β
adapt
λ
= argmin
β:β
ˆ
S
=0
1
2n
kY Xβk
2
2
+ λ
X
k
ˆ
S
|β
k
|
|
ˆ
β
k
|
.
We can also try to use a non-convex penalty. We can attempt to solve
argmin
β
(
1
2n
kY Xβk
2
2
+
n
X
k=1
p
λ
(|β
k
|)
)
,
where
p
λ
: [0
,
)
p
[0
,
) is a non-convex function. One common example is
the MCP , given by
p
0
λ
(u) =
λ
u
γ
+
,
where
γ
is an extra tuning parameter. This tends to give estimates even sparser
than the Lasso.
4 Graphical modelling
4.1 Conditional independence graphs
So far, we have been looking at prediction problems. But sometimes we may
want to know more than that. For example, there is a positive correlation
between the wine budget of a college, and the percentage of students getting
firsts. This information allows us to make predictions, in the sense that if we
happen to know the wine budget of a college, but forgot the percentage of
students getting firsts, then we can make a reasonable prediction of the latter
based on the former. However, this does not suggest any causal relation between
the two increasing the wine budget is probably not a good way to increase
the percentage of students getting firsts!
Of course, it is unlikely that we can actually figure out causation just by
looking at the data. However, there are some things we can try to answer. If
we gather more data about the colleges, then we would probably find that the
colleges that have larger wine budget and more students getting firsts are also
the colleges with larger endowment and longer history. If we condition on all
these other variables, then there is not much correlation left between the wine
budget and the percentage of students getting firsts. This is what we are trying
to capture in conditional independence graphs.
We first introduce some basic graph terminology. For the purpose of condi-
tional independence graphs, we only need undirected graphs. But later, we need
the notion of direct graphs as well, so our definitions will be general enough to
include those.
Definition (Graph). A graph is a pair
G
= (
V, E
), where
V
is a set and
E (V, V ) such that (v, v) 6∈ E for all v V .
Definition (Edge). We say there is an edge between
j
and
k
and that
j
and
k
are adjacent if (j, k) E or (k, j) E.
Definition (Undirected edge). An edge (
j, k
) is undirected if also (
k, j
)
E
.
Otherwise, it is directed and we write
j k
to represent it. We also say that
j
is a parent of k, and write pa(k) for the set of all parents of k.
Definition ((Un)directed graph). A graph is (un)directed if all its edges are
(un)directed.
Definition (Skeleton). The skeleton of
G
is a copy of
G
with every edge replaced
by an undirected edge.
Definition (Subgraph). A graph
G
1
= (
V, E
) is a subgraph of
G
= (
V, E
) if
V
1
V
and
E
1
E
. A proper subgraph is one where either of the inclusions are
proper inclusions.
As discussed, we want a graph that encodes the conditional dependence of
different variables. We first define what this means. In this section, we only
work with undirected graphs.
Definition (Conditional independence). Let
X, Y, Z
be random vectors with
joint density
f
XY Z
. We say that
X
is conditionally independent of
Y
given
Z
,
written X q Y | Z, if
f
XY |Z
(x, y | z) = f
X|Z
(x | z)f
Y |Z
(y | z).
Equivalently,
f
X|Y Z
(x | y, z) = f
X|Z
(x | z)
for all y.
We shall ignore all the technical difficulties, and take as an assumption that
all these conditional densities exist.
Definition (Conditional independence graph (CIG)). Let
P
be the law of
Z
= (
Z
1
, . . . , Z
p
)
T
. The conditional independent graph (CIG) is the graph whose
vertices are
{
1
, . . . , p}
, and contains an edge between
j
and
k
iff
Z
j
and
Z
k
are
conditionally dependent given Z
jk
.
More generally, we make the following definition:
Definition (Pairwise Markov property). Let
P
be the law of
Z
= (
Z
1
, . . . , Z
p
)
T
.
We say
P
satisfies the pairwise Markov property with respect to a graph
G
if for
any distinct, non-adjacent vertices j, k, we have Z
j
q Z
k
| Z
jk
.
Example. If
G
is a complete graph, then
P
satisfies the pairwise Markov
property with respect to G.
The conditional independence graph is thus the minimal graph satisfying the
pairwise Markov property. It turns out that under mild conditions, the pairwise
Markov property implies something much stronger.
Definition (Separates). Given a triple of (disjoint) subsets of nodes
A, B, S
,
we say
S
separates
A
from
B
if every path from a node in
A
to a node in
B
contains a node in S.
Definition (Global Markov property). We say
P
satisfies the global Markov
property with respect to
G
if for any triple of disjoint subsets of
V
(
A, B, S
), if
S separates A and B, then Z
A
q Z
B
| Z
S
.
Proposition. If
P
has a positive density, then if it satisfies the pairwise Markov
property with respect to G, then it also satisfies the global Markov property.
This is a really nice result, but we will not prove this. However, we will prove
a special case in the example sheet.
So how do we actually construct the conditional independence graph? To do
so, we need to test our variables for conditional dependence. In general, this is
quite hard. However, in the case where we have Gaussian data, it is much easier,
since independence is the same as vanishing covariance.
Notation (
M
A,B
). Let
M
be a matrix. Then
M
A,B
refers to the submatrix
given by the rows in A and columns in B.
Since we are going to talk about conditional distributions a lot, the following
calculation will be extremely useful.
Proposition. Suppose Z N
p
(µ, Σ) and Σ is positive definite. Then
Z
A
| Z
B
= z
B
N
|A|
(µ
A
+ Σ
A,B
Σ
1
B,B
(z
B
µ
B
), Σ
A,A
Σ
A,B
Σ
1
B,B
Σ
B,A
).
Proof.
Of course, we can just compute this directly, maybe using moment
generating functions. But for pleasantness, we adopt a different approach. Note
that for any M , we have
Z
A
= MZ
B
+ (Z
A
MZ
B
).
We shall pick M such that Z
A
MZ
B
is independent of Z
B
, i.e. such that
0 = cov(Z
B
, Z
A
MZ
B
) = Σ
B,A
Σ
B,B
M
T
.
So we should take
M =
1
B,B
Σ
B,A
)
T
= Σ
A,B
Σ
1
B,B
.
We already know that
Z
A
MZ
B
is Gaussian, so to understand it, we only need
to know its mean and variance. We have
E[Z
A
MZ
B
] = µ
A
Mµ
B
= µ
A
Σ
AB
Σ
1
BB
µ
B
var(Z
A
MZ
B
) = Σ
A,A
A,B
Σ
1
B,B
Σ
B,A
+ Σ
A,B
Σ
1
B,B
Σ
B,B
Σ
1
B,B
Σ
B,A
= Σ
A,A
Σ
A,B
Σ
1
B,B
Σ
B,A
.
Then we are done.
Neighbourhood selection
We are going to specialize to
A
=
{k}
and
B
=
{
1
, . . . , n} \ {k}
. Then we can
separate out the “mean” part and write
Z
k
= M
k
+ Z
T
k
Σ
1
k,k
Σ
k,k
+ ε
k
,
where
M
k
= µ
k
µ
T
k
Σ
1
k,k
Σ
k,k
,
ε
k
| Z
k
N(0, Σ
k,k
Σ
k,k
Σ
1
k,k
Σ
k,k
).
This now looks like we are doing regression.
We observe that
Lemma. Given
k
, let
j
0
be such that (
Z
k
)
j
=
Z
j
0
. This
j
0
is either
j
or
j
+ 1,
depending on whether it comes after or before k.
If the jth component of Σ
1
k,k
Σ
k,k
is 0, then Z
k
q Z
j
0
| Z
kj
0
.
Proof.
If the
j
th component of Σ
1
k,k
Σ
k,k
is 0, then the distribution of
Z
k
| Z
k
will not depend on (
Z
k
)
j
=
Z
j
0
(here
j
0
is either
j
or
j
+ 1, depending
on where k is). So we know
Z
k
| Z
k
d
= Z
k
| Z
kj
0
.
This is the same as saying Z
k
q Z
j
0
| Z
kj
0
.
Neighbourhood selection exploits this fact. Given
x
1
, . . . , x
n
which are iid
Z and
X = (x
T
1
, ··· , x
T
n
)
T
,
we can estimate Σ
1
k,k
Σ
k,k
by regressing
X
k
on
X
k
using the Lasso (with
an intercept term). We then obtain selected sets
ˆ
S
k
. There are two ways of
estimating the CIG based on these:
OR rule: We add the edge (j, k) if j
ˆ
S
k
or k
ˆ
S
j
.
AND rule: We add the edge (j, k) if j
ˆ
S
k
and k
ˆ
S
j
.
The graphical Lasso
Another way of finding the conditional independence graph is to compute
var(Z
jk
| Z
jk
) directly. The following lemma will be useful:
Lemma. Let M R
p×p
be positive definite, and write
M =
P Q
Q
T
R
,
where P and Q are square. The Schur complement of R is
S = P QR
1
Q
T
.
Note that this has the same size as P . Then
(i) S is positive definite.
(ii)
M
1
=
S
1
S
1
QR
1
R
1
Q
T
S
1
R
1
+ R
1
Q
T
S
1
QR
1
.
(iii) det(M) = det(S) det(R)
We have seen this Schur complement when we looked at
var
(
Z
A
| Z
A
c
)
previously, where we got
var(Z
A
| Z
A
c
) = Σ
A,A
Σ
A,A
C
Σ
1
A
c
,A
c
Σ
A
c
,A
=
1
A,A
,
where = Σ
1
is the precision matrix.
Specializing to the case where A = {j, k}, we have
var(Z
{j,k}
| Z
jk
) =
1
det(Ω
A,A
)
k,k
j,k
j,k
j,j
.
This tells us
Z
k
qZ
j
| Z
kj
iff
jk
= 0. Thus, we can approximate the conditional
independence graph by computing the precision matrix Ω.
Our method to estimate the precision matrix is similar to the Lasso. Recall
that the density of N
p
(µ, Σ) is
P (z) =
1
(2π)
p/2
(det Σ)
1/2
exp
1
2
(z µ)
T
Σ
1
(z µ)
.
The log-likelihood of (
µ,
Ω) based on an iid sample (
X
1
, . . . , X
n
) is (after dropping
a constant)
`(µ, Ω) =
n
2
log det
1
2
n
X
i=1
(x
i
µ)
T
Ω(x
i
µ).
To simplify this, we let
¯x =
1
n
n
X
i=1
x
i
, S =
1
n
n
X
i=1
(x
i
¯x)(x
i
¯x)
T
.
Then
X
(x
i
µ)
T
Ω(x
i
µ) =
X
(x
i
¯x + ¯x µ)
T
Ω(x
i
¯x + ¯x µ)
=
X
(x
i
¯x)
T
Ω(x
i
¯x) + n(
¯
X µ)
T
Ω(
¯
X µ).
We have
X
(x
i
¯x)
T
Ω(x
i
¯x) =
X
tr
(x
i
¯x)
T
Ω(x
i
¯x)
= n tr(SΩ).
So we now have
`(µ, Ω) =
n
2
tr(SΩ) log det + (
¯
X µ)
T
Ω(
¯
X µ)
.
We are really interested in estimating Ω. So we should try to maximize this over
µ
, but that is easy, since this is the same as minimizing (
¯
X µ
)
T
Ω(
¯
X µ
), and
we know is positive-definite. So we should set µ =
¯
X. Thus, we have
`(Ω) = max
µR
p
`(µ, Ω) =
n
2
tr(SΩ) log det ω
.
So we can solve for the MLE of by solving
min
Ω:Ω0
log det + tr(SΩ)
.
One can show that this is convex, and to find the MLE, we can just differentiate
jk
log det = (Ω
1
)
jk
,
jk
tr(SΩ) = S
jk
,
using that
S
and are symmetric. So provided that
S
is positive definite, the
maximum likelihood estimate is just
= S
1
.
But we are interested in the high dimensional situation, where we have loads of
variables, and
S
cannot be positive definite. To solve this, we use the graphical
Lasso.
The graphical Lasso solves
argmin
Ω:Ω0
log det + tr(SΩ) + λkk
1
,
where
kk
1
=
X
jk
jk
.
Often, people don’t sum over the diagonal elements, as we want to know if off-
diagonal elements ought to be zero. Similar to the case of the Lasso, this gives a
sparse estimate of from which we may estimate the conditional independence
graph.
4.2 Structural equation modelling
The conditional independence graph only tells us which variables are related to
one another. However, it doesn’t tell us any causal relation between the different
variables. We first need to explain what we mean by a causal model. For this,
we need the notion of a directed acyclic graph.
Definition (Path). A path from
j
to
k
is a sequence
j
=
j
1
, j
2
, . . . , j
m
=
k
of
(at least two) distinct vertices such that j
`
and j
`+1
are adjacent.
A path is directed if j
`
j
`+1
for all `.
Definition (Directed acyclic graph (DAG)). A directed cycle is (almost) a
directed path but with the start and end points the same.
A directed acyclic graph (DAG) is a directed graph containing no directed
cycles.
a
b
c
DAG
a
b
c
not DAG
We will use directed acyclic graphs to encode causal structures, where we
have a path from a to b if a “affects” b.
Definition (Structural equation model (SEM)). A structural equation model
S
for a random vector Z R
p
is a collection of equations
Z
k
= h
k
(Z
p
k
, ε
k
),
where
k
= 1
, . . . , p
and
ε
1
, . . . , ε
p
are independent, and
p
k
{
1
, . . . , p}\{k}
and
such that the graph with pa(k) = p
k
is a directed acyclic graph.
Example. Consider three random variables:
Z
1
= 1 if a student is taking a course, 0 otherwise
Z
2
= 1 if a student is attending catch up lectures, 0 otherwise
Z
3
= 1 if a student heard about machine learning before attending the
course, 0 otherwise.
Suppose
Z
3
= ε
3
Bernoulli(0.25)
Z
2
= 1
{ε
2
(1+Z
3
)>
1
2
}
, ε
2
U[0, 1]
Z
1
= 1
{ε
1
(Z
2
+Z
3
)>
1
2
}
, ε
1
U[0, 1].
This is then an example of a structural equation modelling, and the corresponding
DAG is
Z
2
Z
3
Z
1
Note that the structural equation model for
Z
determines its distribution,
but the converse is not true. For example, the following two distinct structural
equation give rise to the same distribution:
Z
1
= ε Z
1
= Z
2
Z
2
= Z
1
Z
2
= ε
Indeed, if we have two variables that are just always the same, it is hard to tell
which is the cause and which is the effect.
It would be convenient if we could order our variables in a way that
Z
k
depends only on Z
j
for j < k. This is known as a topological ordering:
Definition (Descendant). We say
k
is a descendant of
j
if there is a directed
path from j to k. The set of descendant of j will be denoted de(j).
Definition (Topological ordering). Given a DAG
G
with
V
=
{
1
, . . . , p}
we say
that a permutation
π
:
V V
is a topological ordering if
π
(
j
)
< π
(
k
) whenever
k de(j).
Thus, given a topological ordering
π
, we can write
Z
k
as a function of
ε
π
1
(1)
, . . . , ε
π
1
(π(k))
.
How do we understand structural equational models? They give us informa-
tion that are not encoded in the distribution itself. One way to think about them
is via interventions. We can modify a structural equation model by replacing
the equation for
Z
k
by setting, e.g.
Z
k
=
a
. In real life, this may correspond
to forcing all students to go to catch up workshops. This is called a perfect
intervention. The modified SEM gives a new joint distribution for
Z
. Expecta-
tions or probabilities with respect to the new distribution are written by adding
do(Z
k
= a)”. For example, we write
E(Z
j
| do(Z
k
= a)).
In general, this is different to
E
(
Z
j
| Z
k
=
a
), since, for example, if we conditioned
on Z
2
= a in our example, then that would tell us something about Z
3
.
Example. After the intervention
do
(
Z
2
= 1), i.e. we force everyone to go to the
catch-up lectures, we have a new SEM with
Z
3
= ε
3
Bernoulli(0.25)
Z
2
= 1
Z
1
= 1
ε
1
(1+Z
3
)>
1
2
, ε
1
U[0, 1].
Then, for example, we can compute
P(Z
1
= 1 | do(Z
2
= 1)) =
1
2
·
3
4
+
3
4
+
1
4
=
9
16
,
and by high school probability, we also have
P(Z
1
= 1 | Z
2
= 1) =
7
12
.
To understand the DAGs associated to structural equation models, we would
like to come up with Markov properties analogous to what we had for undirected
graphs. This, in particular, requires the correct notion of “separation”, which we
call d-separation. Our notion should be such that if
S
d-separates
A
and
B
in
the DAG, then
Z
A
and
Z
B
are conditionally independent given
Z
S
. Let’s think
about some examples. For example, we might have a DAG that looks like this:
a
p
s
q
b
Then we expect that
(i) Z
a
and Z
s
are not independent;
(ii) Z
a
and Z
s
are independent given Z
q
;
(iii) Z
a
and Z
b
are independent;
(iv) Z
a
and Z
b
are not independent given Z
s
.
We can explain a bit more about the last one. For example, the structural
equation model might telling us
Z
s
=
Z
a
+
Z
b
+
ε
. In this case, if we know that
Z
a
is large but
Z
s
is small, then chances are,
Z
b
is also large (in the opposite
sign). The point is that both
Z
a
and
Z
b
both contribute to
Z
s
, and if we know
one of the contributions and the result, we can say something about the other
contribution as well.
Similarly, if we have a DAG that looks like
a
s
b
then as above, we know that Z
a
and Z
b
are not independent given Z
s
.
Another example is
a
p
s
q
b
Here we expect
Z
a
and Z
b
are not independent.
Z
a
and Z
b
are independent given Z
s
.
To see (i), we observe that if we know about
Z
a
, then this allows us to predict
some information about
Z
s
, which would in turn let us say something about
Z
b
.
Definition (Blocked). In a DAG, we say a path (
j
1
, . . . , j
m
) between
j
1
and
j
m
is blocked by a set of nodes
S
(with neither
j
1
nor
j
m
in
S
) if there is some
j
`
S and the path is not of the form
j
`1
j
`
j
`+1
or there is some
j
`
such that the path is of this form, but neither
j
`
nor any of
its descendants are in S.
Definition (d-separate). If
G
is a DAG, given a triple of (disjoint) subsets of
nodes
A, B, S
, we say
S
d-separates
A
from
B
if
S
blocks every path from
A
to
B.
For convenience, we define
Definition (
v
-structure). A set of three nodes is called a
v
-structure if one node
is a child of the two other nodes, and these two nodes are not adjacent.
It is now natural to define
Definition (Markov properties). Let
P
be the distribution of
Z
and let
f
be
the density. Given a DAG G, we say P satisfies
(i) the Markov factorization criterion if
f(z
1
, . . . , z
p
) =
p
Y
k=1
f(z
k
| z
pa(k)
).
(ii)
the global Markov property if for all disjoint
A, B, S
such that
A, B
is
d-separated by S, then Z
A
q Z
B
| Z
S
.
Proposition. If
P
has a density with respect to a product measure, then (i)
and (ii) are equivalent.
How does this fit into the structural equation model?
Proposition. Let
P
be the structural equation model with DAG
G
. Then
P
obeys the Markov factorization property.
Proof.
We assume
G
is topologically ordered (i.e. the identity map is a topological
ordering). Then we can always write
f(z
1
, . . . , z
p
) = f(z
1
)f(z
2
| z
1
) ···z(z
p
| z
1
, z
2
, . . . , z
p1
).
By definition of a topological order, we know
pa
(
k
)
{
1
, . . . , k
1
}
. Since
Z
k
is a function of Z
pa(k)
and independent noise ε
k
. So we know
Z
k
q Z
{1,...,p}\{kpa(k)}
| Z
pa(k)
.
Thus,
f(z
k
| z
1
, . . . , z
k1
) = f(z
k
| z
pa(k)
).
4.3 The PC algorithm
We now want to try to find out the structural equation model given some data,
and in particular, determine the causal structure. As we previously saw, there is
no hope of determining this completely, even if we know the distribution of the
Z completely. Let’s consider the different obstacles to this problem.
Causal minimality
If
P
is generated by an SEM with DAG
G
, then from the above, we know that
P
is Markov with respect to
G
. The converse is also true: if
P
is Markov with
respect to a DAG
G
, then there exists a SEM with DAG
G
that generates
P
.
This immediately implies that
P
will be Markov with respect to many DAGs.
For example, a DAG whose skeleton is complete will always work. This suggests
the following definition:
Definition (Causal minimality). A distribution
P
satisfies causal minimality
with respect to G but not any proper subgraph of G.
Markov equivalent DAGs
It is natural to aim for finding a causally minimal DAG. However, this does
not give a unique solution, as we saw previously with the two variables that are
always the same.
In general, two different DAGs may satisfy the same set of d-separations,
and then a distribution is Markov with respect to one iff its Markov with respect
to the other, and we cannot distinguish between the two.
Definition (Markov equivalence). For a DAG G, we let
M(G) = {distributions P such that P is Markov with respect to G}.
We say two DAGs G
1
, G
2
are are Markov equivalent if M(G
1
) = M(G
2
).
What is nice is that there is a rather easy way of determining when two
DAGs are Markov equivalent.
Proposition. Two DAGs are Markov equivalent iff they have the same skeleton
and same set of v-structure.
The set of all DAGs that are Markov equivalent to a given DAG can be
represented by a CPDAG (completed partial DAG), which contains an edge
(j, k) iff some member of the equivalence class does.
Faithfulness
To describe the final issue, consider the SEM
Z
1
= ε
1
, Z
2
= αZ
1
+ ε
2
, Z
3
= βZ
1
+ γZ
2
+ ε
3
.
We take ε N
3
(0, I). Then we have Z = (Z
1
, Z
2
, Z
3
) N(0, Σ), where
Σ =
1 α β + αγ
α α
2
+ 1 α(β + αγ) + γ
β + αγ α(β + αγ) + γ β
2
+ γ
2
(α
2
+ 1) + 1 + 2αβγ
.
Z
2
Z
3
Z
1
Now if we picked values of α, β, γ such that
β + αγ = 0,
then we obtained an extra independence relation
Z
1
q Z
3
in our system. For
example, if we pick β = 1 and α, γ = 1, then
Σ =
1 1 0
1 2 1
0 1 2
.
While there is an extra independence relation, we cannot remove any edge while
still satisfying the Markov property. Indeed:
If we remove 1
2, then this would require
Z
1
q Z
2
, but this is not true.
If we remove 2 3, then this would require Z
2
q Z
3
| Z
1
, but we have
var((Z
2
, Z
3
) | Z
1
) =
2 1
1 2
1
0
1 0
=
1 1
1 2
,
and this is not diagonal.
If we remove 1 3, then this would require Z
1
q Z
3
| Z
2
, but
var((Z
1
, Z
3
) | Z
2
) =
1 0
0 2
1
2
1
1
1 1
,
which is again non-diagonal.
So this DAG satisfies causal minimality. However,
P
can also be generated by
the structural equation model
˜
Z
1
= ˜ε
1
,
˜
Z
2
=
˜
Z
1
+
1
2
˜
Z
3
+ ˜ε
2
,
˜
Z
3
= ˜ε
3
,
where the ˜ε
i
are independent with
˜ε
1
N(0, 1), ˜ε
2
N(0, 2), ˜ε
3
N(0,
1
2
).
Then this has the DAG
Z
2
Z
3
Z
1
This is a strictly smaller DAG in terms of the number of edges involved. It is
easy to see that this satisfies causal minimality.
Definition (Faithfulness). We say
P
is faithful to a DAG
G
if it is Markov
with respect to
G
and for all
A, B, S
disjoint,
Z
A
q Z
B
| Z
S
implies
A, B
are
d-separated by S.
Determining the DAG
We shall assume our distribution is faithful to some
G
0
, and see if we can figure
out G
0
from P , or even better, from data.
To find G, the following proposition helps us compute the skeleton:
Proposition. If nodes
j
and
k
are adjacent in a DAG
G
, then no set can
d-separate them.
If they are not adjacent, and
π
is a topological order for
G
with
π
(
j
)
< π
(
k
),
then they are d-separated by pa(k).
Proof.
Only the last part requires proof. Consider a path
j
=
j
1
, . . . , j
m
=
k
.
Start reading the path from
k
and go backwards. If it starts as
j
m1
k
, then
j
m1
is a parent of k and blocks the path. Otherwise, it looks like k j
m1
.
We keep going down along the path until we first see something of the form
k
a
···
Thus must exist, since j is not a descendant of k by topological ordering. So it
suffices to show that
a
does not have a descendant in
pa
(
k
), but if it did, then
this would form a closed loop.
Finding the
v
-structures is harder, and at best we can do so up to Markov
equivalence. To do that, observe the following:
Proposition. Suppose we have j ` k in the skeleton of a DAG.
(i) If j ` k, then no S that d-separates j can have ` S.
(ii)
If there exists
S
that d-separates
j
and
k
and
` 6∈ S
, then
j ` k
.
Denote the set of nodes adjacent to the vertex
k
in the graph
G
by
adj
(
G, k
).
We can now describe the first part of the PC algorithm, which finds the
skeleton of the “true DAG”:
(i) Set
ˆ
G to be the complete undirected graph. Set ` = 1.
(ii) Repeat the following steps:
(a) Set ` = ` + 1:
(b) Repeat the following steps:
i.
Select a new ordered pair of nodes
j, k
that are adjacent in
ˆ
G
and
such that |adj(
ˆ
G, j) \{k}| `.
ii. Repeat the following steps:
A. Choose a new S adj(
ˆ
G, j) \{k} with |S| = `.
B.
If
Z
j
q Z
k
| Z
S
, then delete the edge
jk
, and store
S
(
k, j
) =
S(j, k) = S
C. Repeat until j k is deleted or all S chosen.
iii. Repeat until all pairs of adjacent nodes are inspected.
(c) Repeat until ` p 2.
Suppose
P
is faithful to a DAG
G
0
. At each stage of the algorithm, the skeleton
of
G
0
will be a subgraph of
ˆ
G
. On the other hand, edges (
j, k
) remaining at
termination will have
Z
j
q Z
k
| Z
S
for all S (
ˆ
G, k), S (
ˆ
G, j).
So they must be adjacent in G
0
. Thus,
ˆ
G and G
0
have the same skeleton.
To find the v-structures, we perform:
(i) For all j l k in
ˆ
G, do:
(a) If ` 6∈ S(j, k), then orient j ` k.
This gives us the Markov equivalence class, and we may orient the other edges
using other properties like acyclicity.
If we want to apply this to data sets, then we need to apply some conditional
independence tests instead of querying our oracle to figure out if things are
conditional dependence. However, errors in the algorithm propagate, and the
whole process may be derailed by early errors. Moreover, the result of the
algorithm may depend on how we iterate through the nodes. People have tried
many ways to fix these problems, but in general, this method is rather unstable.
Yet, if we have large data sets, this can produce quite decent results.
5 High-dimensional inference
5.1 Multiple testing
Finally, we talk about high-dimensional inference. Suppose we have come up
with a large number of potential drugs, and want to see if they are effective in
killing bacteria. Naively, we might try to run a hypothesis test on each of them,
using a
p <
0
.
05 threshold. But this is a terrible idea, since each test has a 0
.
05
chance of giving a false positive, so even if all the drugs are actually useless, we
would have incorrectly believed that a lot of them are useful, which is not the
case.
In general, suppose we have some null hypothesis
H
1
, . . . , H
m
. By definition,
a p value p
i
for H
i
is a random variable such that
P
H
i
(p
i
α) α
for all α [0, 1].
Let
m
0
=
|I
0
|
be the number of true null hypothesis. Given a procedure for
rejecting hypothesis (a multiple testing procedure), we let
N
be the number of
false rejections (false positives), and
R
the total number of rejections. One can
also think about the number of false negatives, but we shall not do that here.
Traditionally, multiple-testing procedures sought to control the family-wise
error rate (FWER), defined by
P
(
N
1). The simplest way to minimize this is
to use the Bonferroni correction, which rejects
H
i
if
p
i
α
m
. Usually, we might
have
α
0
.
05, and so this would be very small if we have lots of hypothesis (e.g.
a million). Unsurprisingly, we have
Theorem. When using the Bonferroni correction, we have
FWER E(N)
m
0
α
m
α.
Proof.
The first inequality is Markov’s inequality, and the last is obvious. The
second follows from
E(N) = E
X
iI
0
1
p
i
α/m
!
=
X
iI
0
P
p
i
α
m
m
0
α
m
.
The Bonferroni is a rather conservative procedure, since all these inequalities
can be quite loose. When we have a large number of hypotheses, the criterion
for rejection is very very strict. Can we do better?
A more sophisticated approach is the closed testing procedure. For each
non-empty subset
I {
1
, . . . , m}
, we let
H
I
be the null hypothesis that
H
i
is
true for all
i I
. This is known as an intersection hypothesis. Suppose for each
I {
1
, . . . , m}
non-empty, we have an
α
-level test
φ
I
for
H
I
(a local test), so
that
P
H
I
(φ
I
= 1) α.
Here Φ
I
takes values in
{
0
,
1
}
, and
φ
I
= 1 means rejection. The closed testing
procedure then rejects H
I
iff for all J I, we have φ
J
= 1.
Example. Consider the tests, where the red ones are the rejected one:
H
1234
H
134
H
124
H
134
H
234
H
12
H
13
H
14
H
23
H
24
H
34
H
1
H
2
H
3
H
4
In this case, we reject
H
1
but not
H
2
by closed testing. While
H
23
is rejected,
we cannot tell if it is H
2
or H
3
that should be rejected.
This might seem like a very difficult procedure to analyze, but it turns out it
is extremely simple.
Theorem. Closed testing makes no false rejections with probability
1
α
.
In particular, FWER α.
Proof.
In order for there to be a false rejection, we must have falsely rejected
H
I
0
with the local test, which has probability at most α.
But of course this doesn’t immediately give us an algorithm we can apply to
data. Different choices for the local test give rise to different multiple testing
procedures. One famous example is Holm’s procedure. This takes
φ
I
to be the
Bonferroni test, where φ
I
= 1 iff p
i
α
|I|
for some i I.
When
m
is large, then we don’t want to compute all
φ
I
, since there are 2
I
computations to do. So we might want to find a shortcut. With a moment of
thought, we see that Holm’s procedure amounts to the following:
Let (
i
) be the index of the
i
th smallest
p
-value, so
p
(1)
p
(2)
··· p
(m)
.
Step 1: If
p
(1)
α
m
, then reject
H
(1)
and go to step 2. Otherwise, accept
all null hypothesis.
Step
i
: If
p
(i)
α
mi+1
, then reject
H
(i)
and go to step
i
+ 1. Otherwise,
accept H
(i)
, H
(i+1)
, . . . , H
(m)
.
Step m: If p
(m)
α, then reject H
(m)
. Otherwise, accept H
(m)
.
The interesting thing about this is that it has the same bound on FWER as the
Bonferroni correction, but the conditions here are less lenient.
But if
m
is very large, then the criterion for accepting
p
(1)
is still quite strict.
The problem is that controlling FWER is a very strong condition. Instead of
controlling the probability that there is one false rejection, when
m
is large, it
might be more reasonable to control the proportion of false discoveries. Many
modern multiple testing procedures aim to control the false discovery rate
FDR = E
N
max{R, 1}
.
The funny maximum in the denominator is just to avoid division by zero. When
R
= 0, then we must have
N
= 0 as well, so what is put in the denominator
doesn’t really matter.
The Benjamini–Hochberg procedure attempts to control the FDR at level
α
and works as follows:
Let
ˆ
k
=
max
i : p
(i)
αi
m
. Then reject
M
(1)
, . . . , H
(
ˆ
k)
, or accept all
hypothesis if
ˆ
k is not defined.
Under certain conditions, this does control the false discovery rate.
Theorem. Suppose that for each
i I
0
,
p
i
is independent of
{p
j
:
j 6
=
i}
. Then
using the Benjamini–Hochberg procedure, the false discovery rate
F DR = E
N
max(R, 1)
αM
0
M
α.
Curiously, while the proof requires
p
i
to be independent of the others, in
simulations, even when there is no hope that the
p
i
are independent, it appears
that the Benjamini–Hochberg still works very well, and people are still trying to
understand what it is that makes Benjamini–Hochberg work so well in general.
Proof. The false discovery rate is
E
N
max(R, 1)
=
M
X
r=1
E
N
r
1
R=r
=
m
X
r=1
1
r
E
X
iI
0
1
p
i
αr/M
1
R=r
=
X
iI
0
M
X
r=1
1
r
P
p
i
αr
m
, R = r
.
The brilliant idea is, for each
i I
0
, let
R
i
be the number of rejections when
applying a modified Benjamini–Hochberg procedure to
p
\i
=
{p
1
, . . . , p
M
}\{p
i
}
with cutoff
ˆ
k
i
= max
j : p
\i
(j)
α(j + 1)
m
We observe that for i I
0
and r 1, we have
n
p
i
αr
m
, R = r
o
=
n
p
i
ar
m
, p
(r)
αr
m
, p
(s)
>
αs
m
for all s r
o
=
n
p
i
αr
m
, p
\i
(r1)
αr
m
, p
\i
(s1)
>
αs
m
for all s > r
o
=
n
p
i
αr
m
, R
i
= r 1
o
.
The key point is that
R
i
=
r
1 depends only on the other
p
-values. So the
FDR is equal to
FDR =
X
iI
0
M
X
r=1
1
r
P
p
i
αr
M
, R
i
= r 1
=
X
iI
0
M
X
r=1
1
r
P
p
i
αr
m
P(R
i
= r 1)
Using that P(p
i
αr
m
)
αr
m
by definition, this is
α
M
X
iI
0
m
X
r=1
P(R
i
= r 1)
=
α
M
X
iI
0
P(R
i
{0, . . . , m 1})
=
αM
0
M
.
This is one of the most used procedures in modern statistics, especially in
the biological sciences.
5.2 Inference in high-dimensional regression
We have more-or-less some answer as to how to do hypothesis testing, given that
we know how to obtain these
p
-values. But how do we obtain these in the first
place?
For example, we might be trying to do regression, and are trying figure out
which coefficients are non-zero. The the low dimension setting, with the normal
linear model
Y
=
Xβ
0
+
ε
, where
ε N
n
(0
, σ
2
I
). In the low-dimensional setting,
we have
n
(
ˆ
β
OLS
β
0
)
N
p
(0
, σ
2
(
1
n
X
T
X
)
1
). Since this does not depend on
β
0
, we can use this to form confidence intervals and hypothesis tests.
However, if we have more coefficients than there are data points, then we
can’t do ordinary least squares. So we need to look for something else. For
example, we might want to replace the OLS estimate with the Lasso estimate.
However,
n
(
ˆ
β
L
λ
β
0
) has an intractable distribution. In particular, since
ˆ
β
L
λ
has a bias, the distribution will depend on β
0
in a complicated way.
The recently introduced debiased Lasso tries to overcomes these issues. See
van de Geer, B¨uhlmann, Ritov, Dezeure (2014) for more details. Let
ˆ
β
be the
Lasso solution at λ > 0. Recall the KKT conditions that says ˆν defined by
1
n
X
T
(Y X
ˆ
β) = λˆν
satisfies kˆνk
1 and ˆν
ˆ
S
= sgn(
ˆ
β
ˆ
S
), where
ˆ
S = {k :
ˆ
β
k
6= 0}.
We set
ˆ
Σ =
1
n
X
T
X. Then we can rewrite the KKT conditions as
ˆ
Σ(
ˆ
β β
0
) + λˆν =
1
n
X
T
ε.
What we are trying to understand is
ˆ
β β
0
. So it would be nice if we can find
some sort of inverse to
ˆ
Σ
. If so, then
ˆ
β β
0
plus some correction term involving
ˆv would then be equal to a Gaussian.
Of course, the problem is that in the high dimensional setting, that
ˆ
Σ
has
no hope of being invertible. So we want to find some approximate inverse
ˆ
Θ
so
that the error we make is not too large. If we are equipped with such a
ˆ
Θ
, then
we have
n(
ˆ
β + λ
ˆ
Θˆν β
0
) =
1
n
ˆ
ΘX
T
ε + ,
where
=
n(
ˆ
Θ
ˆ
Σ I)(β
0
ˆ
β).
We hope we can choose
ˆ
Θ so that δ is small. We can then use the quantity
b =
ˆ
β + λ
ˆ
Θˆν =
ˆ
β +
1
n
ˆ
ΘX
T
(Y X
ˆ
β)
as our modified estimator, called the debiased Lasso.
How do we bound ∆? We already know that (under compatibility and
sparsity conditions), we can make the
`
1
norm of
kβ
0
ˆ
βk
small with high
probability. So if the
`
norm of each of the rows of
ˆ
Θ
ˆ
Σ
1 is small, then
older allows us to bound ∆.
Write
ˆ
θ
j
for the jth row of
ˆ
Θ. Then
k(
ˆ
Σ
ˆ
Θ
T
)
j
Ik
η
is equivalent to
|
(
ˆ
Σ
ˆ
Θ
T
)
kj
| η
for
k 6
=
j
and
|
(
ˆ
Σ
ˆ
Θ
T
)
jj
1
| η
. Using the
definition of
ˆ
Σ, these are equivalent to
1
n
|X
T
k
X
ˆ
θ
j
| η,
1
n
X
T
j
X
ˆ
θ
j
1
η.
The first is the same as saying
1
n
kX
T
j
X
ˆ
θ
j
k
η.
This is quite reminiscent of the KKT conditions for the Lasso. So let us define
ˆγ
(j)
= argmin
γR
p1
1
2n
kX
j
X
j
γk
2
2
+ λ
j
kγk
1
ˆτ
2
j
=
1
n
X
T
j
(X
j
X
j
ˆγ
(j)
) =
1
n
kX
j
X
j
ˆγ
(j)
k
2
2
+ λ
j
kˆγ
(j)
k
1
.
The second equality is an exercise on the example sheet.
We can then set
ˆ
θ
j
=
1
ˆτ
2
j
(ˆγ
(j)
1
, . . . , ˆγ
(j)
j1
, 1, ˆγ
(j)
j
, . . . , ˆγ
(j)
p1
)
T
.
The factor is there so that the second inequality holds.
Then by construction, we have
X
ˆ
θ
j
=
X
j
X
j
ˆγ
(j)
X
T
j
(X X
j
ˆγ
(j)
)/n
.
Thus, we have
X
T
j
X
ˆ
θ
j
n
= 1, and by the KKT conditions for the Lasso, we have
ˆτ
j
n
kX
T
j
X
ˆ
θ
j
k
λ
j
.
Thus, with the choice of
ˆ
Θ above, we have
kk
nk
ˆ
β β
0
k
1
max
j
λ
j
ˆτ
2
j
.
Now this is good as long as we can ensure
λ
j
ˆτ
2
j
to be small. When is this true?
We can consider a random design setting, where each row of
X
is iid
N
p
(0
,
Σ)
for some positive-definite Σ. Write = Σ
1
.
Then by our study of the neighbourhood selection procedure, we know that
for each j, we can write
X
j
= X
j
γ
(j)
+ ε
(j)
,
where
ε
(j)
i
| X
j
N
(0
,
1
jj
) are iid and
γ
(j)
=
1
jj
j,j
. To apply our
results, we need to ensure that γ
(j)
are sparse. Let use therefore define
s
j
=
X
k6=j
1
jk
6=0
,
and set s
max
= max(max
j
s
j
, s).
Theorem. Suppose the maximum eigenvalue of Σ is always at least
c
min
>
0
and
max
j
Σ
jj
1. Suppose further that
s
max
p
log(p)/n
0. Then there exists
constants A
1
, A
2
such that setting λ = λ
j
= A
1
p
log(p)/n, we have
n(
ˆ
b β
0
) = W +
W | X N
p
(0, σ
2
ˆ
Θ
ˆ
Σ
ˆ
Θ
T
),
and as n, p ,
P
kk
> A
2
s
log(p)
n
0.
Note that here X is not centered and scaled.
We see that in particular,
n
(
ˆ
b
j
β
0
j
)
N
(0
, σ
2
(
ˆ
Θ
ˆ
Σ
ˆ
Θ
T
)
jj
). In fact, one
can show that
d
j
=
1
n
kX
j
X
j
ˆγ
(j)
}
2
2
ˆτ
2
j
.
This suggests an approximate (1 α)-level confidence interval for β
0
j
,
CI =
b
j
Z
α/2
σ
q
d
j
/n,
ˆ
b
j
+ Z
α/2
σ
q
d
j
/n
,
where
Z
α
is the upper
α
point of
N
(0
,
1). Note that here we are getting confidence
intervals of width
p
1/n
. In particular, there is no
log p
dependence, if we are
only trying to estimate β
0
j
only.
Proof. Consider the sequence of events Λ
n
defined by the following properties:
φ
ˆ
Σ,s
c
min
/2 and φ
2
ˆ
Σ
j,j
,s
j
c
min
/2 for all j
2
n
kX
T
Σk
λ and
2
n
kX
T
j
ε
(j)
k
λ.
1
n
Σ
(j)
k
2
2
(Ω
jj
)
1
(1 4
p
(log p)/n)
Question 13 on example sheet 4 shows that
P
n
)
1 for
A
1
sufficiently
large. So we will work on the event Λ
n
.
By our results on the Lasso, we know
kβ
0
ˆ
βk
1
c
1
s
p
log p/n.
for some constant
c
1
. We now seek a lower bound for
ˆτ
2
j
. Consider linear models
X
j
= X
j
γ
(j)
+ ε
(j)
,
where the sparsity of γ
(j)
is s
j
, and ε
(j)
i
|X
j
N(0,
1
jj
). Note that
1
jj
= var(X
ij
| X
i,j
) var(X
ij
) = Σ
ij
A.
Also, the maximum eigenvalue of is at most
c
1
min
. So
jj
c
1
min
. So
1
jj
c
min
. So by Lasso theory, we know
kγ
(j)
ˆγ
(j)
k
1
c
2
s
j
r
log p
n
for some constant c
2
. Then we have
ˆτ
2
j
=
1
n
kX
j
X
j
ˆγ
(j)
k
2
2
+ λkˆγ
(j)
k
1
1
n
kε
(j)
+ X
j
(γ
(j)
ˆγ
(j)
)k
2
2
1
n
kε
(j)
k
2
2
2
n
kX
T
j
ε
(j)
k
kγ
(j)
ˆγ
(j)
k
1
1
jj
1 4
r
log p
n
!
c
2
s
j
r
log p
n
+ A
1
r
log p
n
In the limit, this tends to
1
jj
. So for large n, this is
1
2
1
jj
1
2
c
min
.
Thus, we have
kk
2λ
nc
1
s
r
log p
n
c
1
min
= A
2
s
log p
n
.