III Modern Statistical Methods - Graphical modelling

4Graphical modelling

III Modern Statistical Methods

4.2 Structural equation modelling

The conditional independence graph only tells us which variables are related to

one another. However, it doesn’t tell us any causal relation between the different

variables. We first need to explain what we mean by a causal model. For this,

we need the notion of a directed acyclic graph.

Definition (Path). A path from

is a sequence

, j

, . . . , j

(at least two) distinct vertices such that j

and j

`+1

are adjacent.

A path is directed if j

→ j

`+1

for all `.

Definition (Directed acyclic graph (DAG)). A directed cycle is (almost) a

directed path but with the start and end points the same.

A directed acyclic graph (DAG) is a directed graph containing no directed

cycles.

DAG

not DAG

We will use directed acyclic graphs to encode causal structures, where we

have a path from a to b if a “affects” b.

Definition (Structural equation model (SEM)). A structural equation model

for a random vector Z ∈ R

is a collection of equations

= h

, ε

where

= 1

, . . . , p

and

, . . . , ε

are independent, and

⊆ {

, . . . , p}\{k}

and

such that the graph with pa(k) = p

is a directed acyclic graph.

Example. Consider three random variables:

– Z

= 1 if a student is taking a course, 0 otherwise

– Z

= 1 if a student is attending catch up lectures, 0 otherwise

– Z

= 1 if a student heard about machine learning before attending the

course, 0 otherwise.

Suppose

= ε

∼ Bernoulli(0.25)

= 1

{ε

(1+Z

}

, ε

∼ U[0, 1]

= 1

{ε

}

, ε

∼ U[0, 1].

This is then an example of a structural equation modelling, and the corresponding

DAG is

Note that the structural equation model for

determines its distribution,

but the converse is not true. For example, the following two distinct structural

equation give rise to the same distribution:

= ε Z

= Z

= ε

Indeed, if we have two variables that are just always the same, it is hard to tell

which is the cause and which is the effect.

It would be convenient if we could order our variables in a way that

depends only on Z

for j < k. This is known as a topological ordering:

Definition (Descendant). We say

is a descendant of

if there is a directed

path from j to k. The set of descendant of j will be denoted de(j).

Definition (Topological ordering). Given a DAG

with

{

, . . . , p}

we say

that a permutation

V → V

is a topological ordering if

(

)

< π

(

) whenever

k ∈ de(j).

Thus, given a topological ordering

, we can write

as a function of

−1

(1)

, . . . , ε

−1

(π(k))

How do we understand structural equational models? They give us informa-

tion that are not encoded in the distribution itself. One way to think about them

is via interventions. We can modify a structural equation model by replacing

the equation for

by setting, e.g.

. In real life, this may correspond

to forcing all students to go to catch up workshops. This is called a perfect

intervention. The modified SEM gives a new joint distribution for

. Expecta-

tions or probabilities with respect to the new distribution are written by adding

“do(Z

= a)”. For example, we write

E(Z

| do(Z

= a)).

In general, this is different to

(

| Z

), since, for example, if we conditioned

on Z

= a in our example, then that would tell us something about Z

Example. After the intervention

(

= 1), i.e. we force everyone to go to the

catch-up lectures, we have a new SEM with

= ε

∼ Bernoulli(0.25)

= 1

(1+Z

, ε

∼ U[0, 1].

Then, for example, we can compute

P(Z

= 1 | do(Z

= 1)) =

and by high school probability, we also have

P(Z

= 1 | Z

= 1) =

To understand the DAGs associated to structural equation models, we would

like to come up with Markov properties analogous to what we had for undirected

graphs. This, in particular, requires the correct notion of “separation”, which we

call d-separation. Our notion should be such that if

d-separates

and

the DAG, then

and

are conditionally independent given

. Let’s think

about some examples. For example, we might have a DAG that looks like this:

Then we expect that

(i) Z

and Z

are not independent;

(ii) Z

and Z

are independent given Z

;

(iii) Z

and Z

are independent;

(iv) Z

and Z

are not independent given Z

We can explain a bit more about the last one. For example, the structural

equation model might telling us

. In this case, if we know that

is large but

is small, then chances are,

is also large (in the opposite

sign). The point is that both

and

both contribute to

, and if we know

one of the contributions and the result, we can say something about the other

contribution as well.

Similarly, if we have a DAG that looks like

then as above, we know that Z

and Z

are not independent given Z

Another example is

Here we expect

– Z

and Z

are not independent.

– Z

and Z

are independent given Z

To see (i), we observe that if we know about

, then this allows us to predict

some information about

, which would in turn let us say something about

Definition (Blocked). In a DAG, we say a path (

, . . . , j

) between

and

is blocked by a set of nodes

(with neither

nor

) if there is some

∈ S and the path is not of the form

`−1

`+1

or there is some

such that the path is of this form, but neither

nor any of

its descendants are in S.

Definition (d-separate). If

is a DAG, given a triple of (disjoint) subsets of

nodes

A, B, S

, we say

d-separates

from

blocks every path from

For convenience, we define

Definition (

-structure). A set of three nodes is called a

-structure if one node

is a child of the two other nodes, and these two nodes are not adjacent.

It is now natural to define

Definition (Markov properties). Let

be the distribution of

and let

the density. Given a DAG G, we say P satisfies

(i) the Markov factorization criterion if

f(z

, . . . , z

) =

k=1

f(z

| z

pa(k)

(ii)

the global Markov property if for all disjoint

A, B, S

such that

A, B

d-separated by S, then Z

q Z

| Z

Proposition. If

has a density with respect to a product measure, then (i)

and (ii) are equivalent.

How does this fit into the structural equation model?

Proposition. Let

be the structural equation model with DAG

. Then

obeys the Markov factorization property.

Proof.

We assume

is topologically ordered (i.e. the identity map is a topological

ordering). Then we can always write

f(z

, . . . , z

) = f(z

)f(z

| z

) ···z(z

| z

, z

, . . . , z

p−1

By definition of a topological order, we know

(

)

⊆ {

, . . . , k −

}

. Since

is a function of Z

pa(k)

and independent noise ε

. So we know

q Z

{1,...,p}\{k∪pa(k)}

| Z

pa(k)

Thus,

f(z

| z

, . . . , z

k−1

) = f(z

| z

pa(k)