4Graphical modelling

III Modern Statistical Methods



4.2 Structural equation modelling
The conditional independence graph only tells us which variables are related to
one another. However, it doesn’t tell us any causal relation between the different
variables. We first need to explain what we mean by a causal model. For this,
we need the notion of a directed acyclic graph.
Definition (Path). A path from
j
to
k
is a sequence
j
=
j
1
, j
2
, . . . , j
m
=
k
of
(at least two) distinct vertices such that j
`
and j
`+1
are adjacent.
A path is directed if j
`
j
`+1
for all `.
Definition (Directed acyclic graph (DAG)). A directed cycle is (almost) a
directed path but with the start and end points the same.
A directed acyclic graph (DAG) is a directed graph containing no directed
cycles.
a
b
c
DAG
a
b
c
not DAG
We will use directed acyclic graphs to encode causal structures, where we
have a path from a to b if a “affects” b.
Definition (Structural equation model (SEM)). A structural equation model
S
for a random vector Z R
p
is a collection of equations
Z
k
= h
k
(Z
p
k
, ε
k
),
where
k
= 1
, . . . , p
and
ε
1
, . . . , ε
p
are independent, and
p
k
{
1
, . . . , p}\{k}
and
such that the graph with pa(k) = p
k
is a directed acyclic graph.
Example. Consider three random variables:
Z
1
= 1 if a student is taking a course, 0 otherwise
Z
2
= 1 if a student is attending catch up lectures, 0 otherwise
Z
3
= 1 if a student heard about machine learning before attending the
course, 0 otherwise.
Suppose
Z
3
= ε
3
Bernoulli(0.25)
Z
2
= 1
{ε
2
(1+Z
3
)>
1
2
}
, ε
2
U[0, 1]
Z
1
= 1
{ε
1
(Z
2
+Z
3
)>
1
2
}
, ε
1
U[0, 1].
This is then an example of a structural equation modelling, and the corresponding
DAG is
Z
2
Z
3
Z
1
Note that the structural equation model for
Z
determines its distribution,
but the converse is not true. For example, the following two distinct structural
equation give rise to the same distribution:
Z
1
= ε Z
1
= Z
2
Z
2
= Z
1
Z
2
= ε
Indeed, if we have two variables that are just always the same, it is hard to tell
which is the cause and which is the effect.
It would be convenient if we could order our variables in a way that
Z
k
depends only on Z
j
for j < k. This is known as a topological ordering:
Definition (Descendant). We say
k
is a descendant of
j
if there is a directed
path from j to k. The set of descendant of j will be denoted de(j).
Definition (Topological ordering). Given a DAG
G
with
V
=
{
1
, . . . , p}
we say
that a permutation
π
:
V V
is a topological ordering if
π
(
j
)
< π
(
k
) whenever
k de(j).
Thus, given a topological ordering
π
, we can write
Z
k
as a function of
ε
π
1
(1)
, . . . , ε
π
1
(π(k))
.
How do we understand structural equational models? They give us informa-
tion that are not encoded in the distribution itself. One way to think about them
is via interventions. We can modify a structural equation model by replacing
the equation for
Z
k
by setting, e.g.
Z
k
=
a
. In real life, this may correspond
to forcing all students to go to catch up workshops. This is called a perfect
intervention. The modified SEM gives a new joint distribution for
Z
. Expecta-
tions or probabilities with respect to the new distribution are written by adding
do(Z
k
= a)”. For example, we write
E(Z
j
| do(Z
k
= a)).
In general, this is different to
E
(
Z
j
| Z
k
=
a
), since, for example, if we conditioned
on Z
2
= a in our example, then that would tell us something about Z
3
.
Example. After the intervention
do
(
Z
2
= 1), i.e. we force everyone to go to the
catch-up lectures, we have a new SEM with
Z
3
= ε
3
Bernoulli(0.25)
Z
2
= 1
Z
1
= 1
ε
1
(1+Z
3
)>
1
2
, ε
1
U[0, 1].
Then, for example, we can compute
P(Z
1
= 1 | do(Z
2
= 1)) =
1
2
·
3
4
+
3
4
+
1
4
=
9
16
,
and by high school probability, we also have
P(Z
1
= 1 | Z
2
= 1) =
7
12
.
To understand the DAGs associated to structural equation models, we would
like to come up with Markov properties analogous to what we had for undirected
graphs. This, in particular, requires the correct notion of “separation”, which we
call d-separation. Our notion should be such that if
S
d-separates
A
and
B
in
the DAG, then
Z
A
and
Z
B
are conditionally independent given
Z
S
. Let’s think
about some examples. For example, we might have a DAG that looks like this:
a
p
s
q
b
Then we expect that
(i) Z
a
and Z
s
are not independent;
(ii) Z
a
and Z
s
are independent given Z
q
;
(iii) Z
a
and Z
b
are independent;
(iv) Z
a
and Z
b
are not independent given Z
s
.
We can explain a bit more about the last one. For example, the structural
equation model might telling us
Z
s
=
Z
a
+
Z
b
+
ε
. In this case, if we know that
Z
a
is large but
Z
s
is small, then chances are,
Z
b
is also large (in the opposite
sign). The point is that both
Z
a
and
Z
b
both contribute to
Z
s
, and if we know
one of the contributions and the result, we can say something about the other
contribution as well.
Similarly, if we have a DAG that looks like
a
s
b
then as above, we know that Z
a
and Z
b
are not independent given Z
s
.
Another example is
a
p
s
q
b
Here we expect
Z
a
and Z
b
are not independent.
Z
a
and Z
b
are independent given Z
s
.
To see (i), we observe that if we know about
Z
a
, then this allows us to predict
some information about
Z
s
, which would in turn let us say something about
Z
b
.
Definition (Blocked). In a DAG, we say a path (
j
1
, . . . , j
m
) between
j
1
and
j
m
is blocked by a set of nodes
S
(with neither
j
1
nor
j
m
in
S
) if there is some
j
`
S and the path is not of the form
j
`1
j
`
j
`+1
or there is some
j
`
such that the path is of this form, but neither
j
`
nor any of
its descendants are in S.
Definition (d-separate). If
G
is a DAG, given a triple of (disjoint) subsets of
nodes
A, B, S
, we say
S
d-separates
A
from
B
if
S
blocks every path from
A
to
B.
For convenience, we define
Definition (
v
-structure). A set of three nodes is called a
v
-structure if one node
is a child of the two other nodes, and these two nodes are not adjacent.
It is now natural to define
Definition (Markov properties). Let
P
be the distribution of
Z
and let
f
be
the density. Given a DAG G, we say P satisfies
(i) the Markov factorization criterion if
f(z
1
, . . . , z
p
) =
p
Y
k=1
f(z
k
| z
pa(k)
).
(ii)
the global Markov property if for all disjoint
A, B, S
such that
A, B
is
d-separated by S, then Z
A
q Z
B
| Z
S
.
Proposition. If
P
has a density with respect to a product measure, then (i)
and (ii) are equivalent.
How does this fit into the structural equation model?
Proposition. Let
P
be the structural equation model with DAG
G
. Then
P
obeys the Markov factorization property.
Proof.
We assume
G
is topologically ordered (i.e. the identity map is a topological
ordering). Then we can always write
f(z
1
, . . . , z
p
) = f(z
1
)f(z
2
| z
1
) ···z(z
p
| z
1
, z
2
, . . . , z
p1
).
By definition of a topological order, we know
pa
(
k
)
{
1
, . . . , k
1
}
. Since
Z
k
is a function of Z
pa(k)
and independent noise ε
k
. So we know
Z
k
q Z
{1,...,p}\{kpa(k)}
| Z
pa(k)
.
Thus,
f(z
k
| z
1
, . . . , z
k1
) = f(z
k
| z
pa(k)
).