III Modern Statistical Methods - Graphical modelling

4Graphical modelling

III Modern Statistical Methods

4.1 Conditional independence graphs

So far, we have been looking at prediction problems. But sometimes we may

want to know more than that. For example, there is a positive correlation

between the wine budget of a college, and the percentage of students getting

firsts. This information allows us to make predictions, in the sense that if we

happen to know the wine budget of a college, but forgot the percentage of

students getting firsts, then we can make a reasonable prediction of the latter

based on the former. However, this does not suggest any causal relation between

the two — increasing the wine budget is probably not a good way to increase

the percentage of students getting firsts!

Of course, it is unlikely that we can actually figure out causation just by

looking at the data. However, there are some things we can try to answer. If

we gather more data about the colleges, then we would probably find that the

colleges that have larger wine budget and more students getting firsts are also

the colleges with larger endowment and longer history. If we condition on all

these other variables, then there is not much correlation left between the wine

budget and the percentage of students getting firsts. This is what we are trying

to capture in conditional independence graphs.

We first introduce some basic graph terminology. For the purpose of condi-

tional independence graphs, we only need undirected graphs. But later, we need

the notion of direct graphs as well, so our definitions will be general enough to

include those.

Definition (Graph). A graph is a pair

= (

V, E

), where

is a set and

E ⊆ (V, V ) such that (v, v) 6∈ E for all v ∈ V .

Definition (Edge). We say there is an edge between

and

and that

and

are adjacent if (j, k) ∈ E or (k, j) ∈ E.

Definition (Undirected edge). An edge (

j, k

) is undirected if also (

k, j

)

∈ E

Otherwise, it is directed and we write

j → k

to represent it. We also say that

is a parent of k, and write pa(k) for the set of all parents of k.

Definition ((Un)directed graph). A graph is (un)directed if all its edges are

(un)directed.

Definition (Skeleton). The skeleton of

is a copy of

with every edge replaced

by an undirected edge.

Definition (Subgraph). A graph

= (

V, E

) is a subgraph of

= (

V, E

) if

⊆ V

and

⊆ E

. A proper subgraph is one where either of the inclusions are

proper inclusions.

As discussed, we want a graph that encodes the conditional dependence of

different variables. We first define what this means. In this section, we only

work with undirected graphs.

Definition (Conditional independence). Let

X, Y, Z

be random vectors with

joint density

XY Z

. We say that

is conditionally independent of

given

written X q Y | Z, if

XY |Z

(x, y | z) = f

X|Z

(x | z)f

Y |Z

(y | z).

Equivalently,

X|Y Z

(x | y, z) = f

X|Z

(x | z)

for all y.

We shall ignore all the technical difficulties, and take as an assumption that

all these conditional densities exist.

Definition (Conditional independence graph (CIG)). Let

be the law of

= (

, . . . , Z

)

. The conditional independent graph (CIG) is the graph whose

vertices are

{

, . . . , p}

, and contains an edge between

and

iff

and

are

conditionally dependent given Z

−jk

More generally, we make the following definition:

Definition (Pairwise Markov property). Let

be the law of

= (

, . . . , Z

)

We say

satisfies the pairwise Markov property with respect to a graph

if for

any distinct, non-adjacent vertices j, k, we have Z

q Z

| Z

−jk

Example. If

is a complete graph, then

satisfies the pairwise Markov

property with respect to G.

The conditional independence graph is thus the minimal graph satisfying the

pairwise Markov property. It turns out that under mild conditions, the pairwise

Markov property implies something much stronger.

Definition (Separates). Given a triple of (disjoint) subsets of nodes

A, B, S

we say

separates

from

if every path from a node in

to a node in

contains a node in S.

Definition (Global Markov property). We say

satisfies the global Markov

property with respect to

if for any triple of disjoint subsets of

(

A, B, S

), if

S separates A and B, then Z

q Z

| Z

Proposition. If

has a positive density, then if it satisfies the pairwise Markov

property with respect to G, then it also satisfies the global Markov property.

This is a really nice result, but we will not prove this. However, we will prove

a special case in the example sheet.

So how do we actually construct the conditional independence graph? To do

so, we need to test our variables for conditional dependence. In general, this is

quite hard. However, in the case where we have Gaussian data, it is much easier,

since independence is the same as vanishing covariance.

Notation (

A,B

). Let

be a matrix. Then

A,B

refers to the submatrix

given by the rows in A and columns in B.

Since we are going to talk about conditional distributions a lot, the following

calculation will be extremely useful.

Proposition. Suppose Z ∼ N

(µ, Σ) and Σ is positive definite. Then

| Z

= z

∼ N

|A|

(µ

+ Σ

A,B

−1

B,B

− µ

), Σ

A,A

− Σ

A,B

−1

B,B

B,A

Proof.

Of course, we can just compute this directly, maybe using moment

generating functions. But for pleasantness, we adopt a different approach. Note

that for any M, we have

= MZ

+ (Z

− MZ

We shall pick M such that Z

− MZ

is independent of Z

, i.e. such that

0 = cov(Z

, Z

− MZ

) = Σ

B,A

− Σ

B,B

So we should take

M = (Σ

−1

B,B

B,A

)

= Σ

A,B

−1

B,B

We already know that

−MZ

is Gaussian, so to understand it, we only need

to know its mean and variance. We have

E[Z

− MZ

] = µ

− Mµ

= µ

− Σ

−1

var(Z

− MZ

) = Σ

A,A

− 2Σ

A,B

−1

B,B

B,A

+ Σ

A,B

−1

B,B

−1

B,B

B,A

= Σ

A,A

− Σ

A,B

−1

B,B

B,A

Then we are done.

Neighbourhood selection

We are going to specialize to

{k}

and

{

, . . . , n} \ {k}

. Then we can

separate out the “mean” part and write

= M

+ Z

−k

−1

−k,−k

−k,k

+ ε

where

= µ

− µ

−k

−1

k,−k

−k,k

| Z

−k

∼ N(0, Σ

k,k

− Σ

k,−k

−1

−k,−k

−k,k

This now looks like we are doing regression.

We observe that

Lemma. Given

, let

be such that (

−k

)

. This

is either

+ 1,

depending on whether it comes after or before k.

If the jth component of Σ

−1

−k,−k

−k,k

is 0, then Z

q Z

| Z

−kj

Proof.

If the

th component of Σ

−1

−k,−k

−k,k

is 0, then the distribution of

| Z

−k

will not depend on (

−k

)

(here

is either

+ 1, depending

on where k is). So we know

| Z

−k

= Z

| Z

−kj

This is the same as saying Z

q Z

| Z

−kj

Neighbourhood selection exploits this fact. Given

, . . . , x

which are iid

∼ Z and

X = (x

, ··· , x

)

we can estimate Σ

−1

−k,−k

−k,k

by regressing

−k

using the Lasso (with

an intercept term). We then obtain selected sets

. There are two ways of

estimating the CIG based on these:

– OR rule: We add the edge (j, k) if j ∈

or k ∈

– AND rule: We add the edge (j, k) if j ∈

and k ∈

The graphical Lasso

Another way of finding the conditional independence graph is to compute

var(Z

| Z

−jk

) directly. The following lemma will be useful:

Lemma. Let M ∈ R

p×p

be positive definite, and write

M =



P Q



where P and Q are square. The Schur complement of R is

S = P − QR

−1

Note that this has the same size as P . Then

(i) S is positive definite.

(ii)

−1



−1

−S

−1

−R

−1

+ R

−1



(iii) det(M ) = det(S) det(R)

We have seen this Schur complement when we looked at

var

(

| Z

)

previously, where we got

var(Z

| Z

) = Σ

A,A

− Σ

A,A

−1

= Ω

−1

A,A

where Ω = Σ

−1

is the precision matrix .

Specializing to the case where A = {j, k}, we have

var(Z

{j,k}

| Z

−jk

) =

det(Ω

A,A

)



Ω

k,k

−Ω

j,k

−Ω

j,k

Ω

j,j



This tells us

| Z

−kj

iff Ω

= 0. Thus, we can approximate the conditional

independence graph by computing the precision matrix Ω.

Our method to estimate the precision matrix is similar to the Lasso. Recall

that the density of N

(µ, Σ) is

P (z) =

(2π)

p/2

(det Σ)

1/2

exp



−

(z − µ)

−1

(z − µ)



The log-likelihood of (

µ,

Ω) based on an iid sample (

, . . . , X

) is (after dropping

a constant)

`(µ, Ω) =

log det Ω −

i=1

− µ)

Ω(x

− µ).

To simplify this, we let

¯x =

i=1

, S =

i=1

− ¯x)(x

− ¯x)

Then

− µ)

Ω(x

− µ) =

− ¯x + ¯x − µ)

Ω(x

− ¯x + ¯x − µ)

− ¯x)

Ω(x

− ¯x) + n(

X − µ)

Ω(

X − µ).

We have

− ¯x)

Ω(x

− ¯x) =



− ¯x)

Ω(x

− ¯x)



= n tr(SΩ).

So we now have

`(µ, Ω) = −



tr(SΩ) − log det Ω + (

X − µ)

Ω(

X − µ)



We are really interested in estimating Ω. So we should try to maximize this over

, but that is easy, since this is the same as minimizing (

X −µ

)

Ω(

X −µ

), and

we know Ω is positive-definite. So we should set µ =

X. Thus, we have

`(Ω) = max

µ∈R

`(µ, Ω) = −



tr(SΩ) − log det ω



So we can solve for the MLE of Ω by solving

min

Ω:Ω0



− log det Ω + tr(SΩ)



One can show that this is convex, and to find the MLE, we can just differentiate

∂

∂Ω

log det Ω = (Ω

−1

)

∂

∂Ω

tr(SΩ) = S

using that

and Ω are symmetric. So provided that

is positive definite, the

maximum likelihood estimate is just

Ω = S

−1

But we are interested in the high dimensional situation, where we have loads of

variables, and

cannot be positive definite. To solve this, we use the graphical

Lasso.

The graphical Lasso solves

argmin

Ω:Ω0



− log det Ω + tr(SΩ) + λkΩk



where

kΩk

Ω

Often, people don’t sum over the diagonal elements, as we want to know if off-

diagonal elements ought to be zero. Similar to the case of the Lasso, this gives a

sparse estimate of Ω from which we may estimate the conditional independence

graph.