4Graphical modelling
III Modern Statistical Methods
4.1 Conditional independence graphs
So far, we have been looking at prediction problems. But sometimes we may
want to know more than that. For example, there is a positive correlation
between the wine budget of a college, and the percentage of students getting
firsts. This information allows us to make predictions, in the sense that if we
happen to know the wine budget of a college, but forgot the percentage of
students getting firsts, then we can make a reasonable prediction of the latter
based on the former. However, this does not suggest any causal relation between
the two — increasing the wine budget is probably not a good way to increase
the percentage of students getting firsts!
Of course, it is unlikely that we can actually figure out causation just by
looking at the data. However, there are some things we can try to answer. If
we gather more data about the colleges, then we would probably find that the
colleges that have larger wine budget and more students getting firsts are also
the colleges with larger endowment and longer history. If we condition on all
these other variables, then there is not much correlation left between the wine
budget and the percentage of students getting firsts. This is what we are trying
to capture in conditional independence graphs.
We first introduce some basic graph terminology. For the purpose of condi-
tional independence graphs, we only need undirected graphs. But later, we need
the notion of direct graphs as well, so our definitions will be general enough to
include those.
Definition (Graph). A graph is a pair
G
= (
V, E
), where
V
is a set and
E ⊆ (V, V ) such that (v, v) 6∈ E for all v ∈ V .
Definition (Edge). We say there is an edge between
j
and
k
and that
j
and
k
are adjacent if (j, k) ∈ E or (k, j) ∈ E.
Definition (Undirected edge). An edge (
j, k
) is undirected if also (
k, j
)
∈ E
.
Otherwise, it is directed and we write
j → k
to represent it. We also say that
j
is a parent of k, and write pa(k) for the set of all parents of k.
Definition ((Un)directed graph). A graph is (un)directed if all its edges are
(un)directed.
Definition (Skeleton). The skeleton of
G
is a copy of
G
with every edge replaced
by an undirected edge.
Definition (Subgraph). A graph
G
1
= (
V, E
) is a subgraph of
G
= (
V, E
) if
V
1
⊆ V
and
E
1
⊆ E
. A proper subgraph is one where either of the inclusions are
proper inclusions.
As discussed, we want a graph that encodes the conditional dependence of
different variables. We first define what this means. In this section, we only
work with undirected graphs.
Definition (Conditional independence). Let
X, Y, Z
be random vectors with
joint density
f
XY Z
. We say that
X
is conditionally independent of
Y
given
Z
,
written X q Y | Z, if
f
XY |Z
(x, y | z) = f
X|Z
(x | z)f
Y |Z
(y | z).
Equivalently,
f
X|Y Z
(x | y, z) = f
X|Z
(x | z)
for all y.
We shall ignore all the technical difficulties, and take as an assumption that
all these conditional densities exist.
Definition (Conditional independence graph (CIG)). Let
P
be the law of
Z
= (
Z
1
, . . . , Z
p
)
T
. The conditional independent graph (CIG) is the graph whose
vertices are
{
1
, . . . , p}
, and contains an edge between
j
and
k
iff
Z
j
and
Z
k
are
conditionally dependent given Z
−jk
.
More generally, we make the following definition:
Definition (Pairwise Markov property). Let
P
be the law of
Z
= (
Z
1
, . . . , Z
p
)
T
.
We say
P
satisfies the pairwise Markov property with respect to a graph
G
if for
any distinct, non-adjacent vertices j, k, we have Z
j
q Z
k
| Z
−jk
.
Example. If
G
is a complete graph, then
P
satisfies the pairwise Markov
property with respect to G.
The conditional independence graph is thus the minimal graph satisfying the
pairwise Markov property. It turns out that under mild conditions, the pairwise
Markov property implies something much stronger.
Definition (Separates). Given a triple of (disjoint) subsets of nodes
A, B, S
,
we say
S
separates
A
from
B
if every path from a node in
A
to a node in
B
contains a node in S.
Definition (Global Markov property). We say
P
satisfies the global Markov
property with respect to
G
if for any triple of disjoint subsets of
V
(
A, B, S
), if
S separates A and B, then Z
A
q Z
B
| Z
S
.
Proposition. If
P
has a positive density, then if it satisfies the pairwise Markov
property with respect to G, then it also satisfies the global Markov property.
This is a really nice result, but we will not prove this. However, we will prove
a special case in the example sheet.
So how do we actually construct the conditional independence graph? To do
so, we need to test our variables for conditional dependence. In general, this is
quite hard. However, in the case where we have Gaussian data, it is much easier,
since independence is the same as vanishing covariance.
Notation (
M
A,B
). Let
M
be a matrix. Then
M
A,B
refers to the submatrix
given by the rows in A and columns in B.
Since we are going to talk about conditional distributions a lot, the following
calculation will be extremely useful.
Proposition. Suppose Z ∼ N
p
(µ, Σ) and Σ is positive definite. Then
Z
A
| Z
B
= z
B
∼ N
|A|
(µ
A
+ Σ
A,B
Σ
−1
B,B
(z
B
− µ
B
), Σ
A,A
− Σ
A,B
Σ
−1
B,B
Σ
B,A
).
Proof.
Of course, we can just compute this directly, maybe using moment
generating functions. But for pleasantness, we adopt a different approach. Note
that for any M, we have
Z
A
= MZ
B
+ (Z
A
− MZ
B
).
We shall pick M such that Z
A
− MZ
B
is independent of Z
B
, i.e. such that
0 = cov(Z
B
, Z
A
− MZ
B
) = Σ
B,A
− Σ
B,B
M
T
.
So we should take
M = (Σ
−1
B,B
Σ
B,A
)
T
= Σ
A,B
Σ
−1
B,B
.
We already know that
Z
A
−MZ
B
is Gaussian, so to understand it, we only need
to know its mean and variance. We have
E[Z
A
− MZ
B
] = µ
A
− Mµ
B
= µ
A
− Σ
AB
Σ
−1
BB
µ
B
var(Z
A
− MZ
B
) = Σ
A,A
− 2Σ
A,B
Σ
−1
B,B
Σ
B,A
+ Σ
A,B
Σ
−1
B,B
Σ
B,B
Σ
−1
B,B
Σ
B,A
= Σ
A,A
− Σ
A,B
Σ
−1
B,B
Σ
B,A
.
Then we are done.
Neighbourhood selection
We are going to specialize to
A
=
{k}
and
B
=
{
1
, . . . , n} \ {k}
. Then we can
separate out the “mean” part and write
Z
k
= M
k
+ Z
T
−k
Σ
−1
−k,−k
Σ
−k,k
+ ε
k
,
where
M
k
= µ
k
− µ
T
−k
Σ
−1
k,−k
Σ
−k,k
,
ε
k
| Z
−k
∼ N(0, Σ
k,k
− Σ
k,−k
Σ
−1
−k,−k
Σ
−k,k
).
This now looks like we are doing regression.
We observe that
Lemma. Given
k
, let
j
0
be such that (
Z
−k
)
j
=
Z
j
0
. This
j
0
is either
j
or
j
+ 1,
depending on whether it comes after or before k.
If the jth component of Σ
−1
−k,−k
Σ
−k,k
is 0, then Z
k
q Z
j
0
| Z
−kj
0
.
Proof.
If the
j
th component of Σ
−1
−k,−k
Σ
−k,k
is 0, then the distribution of
Z
k
| Z
−k
will not depend on (
Z
−k
)
j
=
Z
j
0
(here
j
0
is either
j
or
j
+ 1, depending
on where k is). So we know
Z
k
| Z
−k
d
= Z
k
| Z
−kj
0
.
This is the same as saying Z
k
q Z
j
0
| Z
−kj
0
.
Neighbourhood selection exploits this fact. Given
x
1
, . . . , x
n
which are iid
∼ Z and
X = (x
T
1
, ··· , x
T
n
)
T
,
we can estimate Σ
−1
−k,−k
Σ
−k,k
by regressing
X
k
on
X
−k
using the Lasso (with
an intercept term). We then obtain selected sets
ˆ
S
k
. There are two ways of
estimating the CIG based on these:
– OR rule: We add the edge (j, k) if j ∈
ˆ
S
k
or k ∈
ˆ
S
j
.
– AND rule: We add the edge (j, k) if j ∈
ˆ
S
k
and k ∈
ˆ
S
j
.
The graphical Lasso
Another way of finding the conditional independence graph is to compute
var(Z
jk
| Z
−jk
) directly. The following lemma will be useful:
Lemma. Let M ∈ R
p×p
be positive definite, and write
M =
P Q
Q
T
R
,
where P and Q are square. The Schur complement of R is
S = P − QR
−1
Q
T
.
Note that this has the same size as P . Then
(i) S is positive definite.
(ii)
M
−1
=
S
−1
−S
−1
QR
−1
−R
−1
Q
T
S
−1
R
−1
+ R
−1
Q
T
S
−1
QR
−1
.
(iii) det(M ) = det(S) det(R)
We have seen this Schur complement when we looked at
var
(
Z
A
| Z
A
c
)
previously, where we got
var(Z
A
| Z
A
c
) = Σ
A,A
− Σ
A,A
C
Σ
−1
A
c
,A
c
Σ
A
c
,A
= Ω
−1
A,A
,
where Ω = Σ
−1
is the precision matrix .
Specializing to the case where A = {j, k}, we have
var(Z
{j,k}
| Z
−jk
) =
1
det(Ω
A,A
)
Ω
k,k
−Ω
j,k
−Ω
j,k
Ω
j,j
.
This tells us
Z
k
qZ
j
| Z
−kj
iff Ω
jk
= 0. Thus, we can approximate the conditional
independence graph by computing the precision matrix Ω.
Our method to estimate the precision matrix is similar to the Lasso. Recall
that the density of N
p
(µ, Σ) is
P (z) =
1
(2π)
p/2
(det Σ)
1/2
exp
−
1
2
(z − µ)
T
Σ
−1
(z − µ)
.
The log-likelihood of (
µ,
Ω) based on an iid sample (
X
1
, . . . , X
n
) is (after dropping
a constant)
`(µ, Ω) =
n
2
log det Ω −
1
2
n
X
i=1
(x
i
− µ)
T
Ω(x
i
− µ).
To simplify this, we let
¯x =
1
n
n
X
i=1
x
i
, S =
1
n
n
X
i=1
(x
i
− ¯x)(x
i
− ¯x)
T
.
Then
X
(x
i
− µ)
T
Ω(x
i
− µ) =
X
(x
i
− ¯x + ¯x − µ)
T
Ω(x
i
− ¯x + ¯x − µ)
=
X
(x
i
− ¯x)
T
Ω(x
i
− ¯x) + n(
¯
X − µ)
T
Ω(
¯
X − µ).
We have
X
(x
i
− ¯x)
T
Ω(x
i
− ¯x) =
X
tr
(x
i
− ¯x)
T
Ω(x
i
− ¯x)
= n tr(SΩ).
So we now have
`(µ, Ω) = −
n
2
tr(SΩ) − log det Ω + (
¯
X − µ)
T
Ω(
¯
X − µ)
.
We are really interested in estimating Ω. So we should try to maximize this over
µ
, but that is easy, since this is the same as minimizing (
¯
X −µ
)
T
Ω(
¯
X −µ
), and
we know Ω is positive-definite. So we should set µ =
¯
X. Thus, we have
`(Ω) = max
µ∈R
p
`(µ, Ω) = −
n
2
tr(SΩ) − log det ω
.
So we can solve for the MLE of Ω by solving
min
Ω:Ω0
− log det Ω + tr(SΩ)
.
One can show that this is convex, and to find the MLE, we can just differentiate
∂
∂Ω
jk
log det Ω = (Ω
−1
)
jk
,
∂
∂Ω
jk
tr(SΩ) = S
jk
,
using that
S
and Ω are symmetric. So provided that
S
is positive definite, the
maximum likelihood estimate is just
Ω = S
−1
.
But we are interested in the high dimensional situation, where we have loads of
variables, and
S
cannot be positive definite. To solve this, we use the graphical
Lasso.
The graphical Lasso solves
argmin
Ω:Ω0
− log det Ω + tr(SΩ) + λkΩk
1
,
where
kΩk
1
=
X
jk
Ω
jk
.
Often, people don’t sum over the diagonal elements, as we want to know if off-
diagonal elements ought to be zero. Similar to the case of the Lasso, this gives a
sparse estimate of Ω from which we may estimate the conditional independence
graph.