4Graphical modelling

III Modern Statistical Methods



4.3 The PC algorithm
We now want to try to find out the structural equation model given some data,
and in particular, determine the causal structure. As we previously saw, there is
no hope of determining this completely, even if we know the distribution of the
Z completely. Let’s consider the different obstacles to this problem.
Causal minimality
If
P
is generated by an SEM with DAG
G
, then from the above, we know that
P
is Markov with respect to
G
. The converse is also true: if
P
is Markov with
respect to a DAG
G
, then there exists a SEM with DAG
G
that generates
P
.
This immediately implies that
P
will be Markov with respect to many DAGs.
For example, a DAG whose skeleton is complete will always work. This suggests
the following definition:
Definition (Causal minimality). A distribution
P
satisfies causal minimality
with respect to G but not any proper subgraph of G.
Markov equivalent DAGs
It is natural to aim for finding a causally minimal DAG. However, this does
not give a unique solution, as we saw previously with the two variables that are
always the same.
In general, two different DAGs may satisfy the same set of d-separations,
and then a distribution is Markov with respect to one iff its Markov with respect
to the other, and we cannot distinguish between the two.
Definition (Markov equivalence). For a DAG G, we let
M(G) = {distributions P such that P is Markov with respect to G}.
We say two DAGs G
1
, G
2
are are Markov equivalent if M(G
1
) = M(G
2
).
What is nice is that there is a rather easy way of determining when two
DAGs are Markov equivalent.
Proposition. Two DAGs are Markov equivalent iff they have the same skeleton
and same set of v-structure.
The set of all DAGs that are Markov equivalent to a given DAG can be
represented by a CPDAG (completed partial DAG), which contains an edge
(j, k) iff some member of the equivalence class does.
Faithfulness
To describe the final issue, consider the SEM
Z
1
= ε
1
, Z
2
= αZ
1
+ ε
2
, Z
3
= βZ
1
+ γZ
2
+ ε
3
.
We take ε N
3
(0, I). Then we have Z = (Z
1
, Z
2
, Z
3
) N(0, Σ), where
Σ =
1 α β + αγ
α α
2
+ 1 α(β + αγ) + γ
β + αγ α(β + αγ) + γ β
2
+ γ
2
(α
2
+ 1) + 1 + 2αβγ
.
Z
2
Z
3
Z
1
Now if we picked values of α, β, γ such that
β + αγ = 0,
then we obtained an extra independence relation
Z
1
q Z
3
in our system. For
example, if we pick β = 1 and α, γ = 1, then
Σ =
1 1 0
1 2 1
0 1 2
.
While there is an extra independence relation, we cannot remove any edge while
still satisfying the Markov property. Indeed:
If we remove 1
2, then this would require
Z
1
q Z
2
, but this is not true.
If we remove 2 3, then this would require Z
2
q Z
3
| Z
1
, but we have
var((Z
2
, Z
3
) | Z
1
) =
2 1
1 2
1
0
1 0
=
1 1
1 2
,
and this is not diagonal.
If we remove 1 3, then this would require Z
1
q Z
3
| Z
2
, but
var((Z
1
, Z
3
) | Z
2
) =
1 0
0 2
1
2
1
1
1 1
,
which is again non-diagonal.
So this DAG satisfies causal minimality. However,
P
can also be generated by
the structural equation model
˜
Z
1
= ˜ε
1
,
˜
Z
2
=
˜
Z
1
+
1
2
˜
Z
3
+ ˜ε
2
,
˜
Z
3
= ˜ε
3
,
where the ˜ε
i
are independent with
˜ε
1
N(0, 1), ˜ε
2
N(0, 2), ˜ε
3
N(0,
1
2
).
Then this has the DAG
Z
2
Z
3
Z
1
This is a strictly smaller DAG in terms of the number of edges involved. It is
easy to see that this satisfies causal minimality.
Definition (Faithfulness). We say
P
is faithful to a DAG
G
if it is Markov
with respect to
G
and for all
A, B, S
disjoint,
Z
A
q Z
B
| Z
S
implies
A, B
are
d-separated by S.
Determining the DAG
We shall assume our distribution is faithful to some
G
0
, and see if we can figure
out G
0
from P , or even better, from data.
To find G, the following proposition helps us compute the skeleton:
Proposition. If nodes
j
and
k
are adjacent in a DAG
G
, then no set can
d-separate them.
If they are not adjacent, and
π
is a topological order for
G
with
π
(
j
)
< π
(
k
),
then they are d-separated by pa(k).
Proof.
Only the last part requires proof. Consider a path
j
=
j
1
, . . . , j
m
=
k
.
Start reading the path from
k
and go backwards. If it starts as
j
m1
k
, then
j
m1
is a parent of k and blocks the path. Otherwise, it looks like k j
m1
.
We keep going down along the path until we first see something of the form
k
a
···
Thus must exist, since j is not a descendant of k by topological ordering. So it
suffices to show that
a
does not have a descendant in
pa
(
k
), but if it did, then
this would form a closed loop.
Finding the
v
-structures is harder, and at best we can do so up to Markov
equivalence. To do that, observe the following:
Proposition. Suppose we have j ` k in the skeleton of a DAG.
(i) If j ` k, then no S that d-separates j can have ` S.
(ii)
If there exists
S
that d-separates
j
and
k
and
` 6∈ S
, then
j ` k
.
Denote the set of nodes adjacent to the vertex
k
in the graph
G
by
adj
(
G, k
).
We can now describe the first part of the PC algorithm, which finds the
skeleton of the “true DAG”:
(i) Set
ˆ
G to be the complete undirected graph. Set ` = 1.
(ii) Repeat the following steps:
(a) Set ` = ` + 1:
(b) Repeat the following steps:
i.
Select a new ordered pair of nodes
j, k
that are adjacent in
ˆ
G
and
such that |adj(
ˆ
G, j) \ {k}| `.
ii. Repeat the following steps:
A. Choose a new S adj(
ˆ
G, j) \ {k} with |S| = `.
B.
If
Z
j
q Z
k
| Z
S
, then delete the edge
jk
, and store
S
(
k, j
) =
S(j, k) = S
C. Repeat until j k is deleted or all S chosen.
iii. Repeat until all pairs of adjacent nodes are inspected.
(c) Repeat until ` p 2.
Suppose
P
is faithful to a DAG
G
0
. At each stage of the algorithm, the skeleton
of
G
0
will be a subgraph of
ˆ
G
. On the other hand, edges (
j, k
) remaining at
termination will have
Z
j
q Z
k
| Z
S
for all S (
ˆ
G, k), S (
ˆ
G, j).
So they must be adjacent in G
0
. Thus,
ˆ
G and G
0
have the same skeleton.
To find the v-structures, we perform:
(i) For all j l k in
ˆ
G, do:
(a) If ` 6∈ S(j, k), then orient j ` k.
This gives us the Markov equivalence class, and we may orient the other edges
using other properties like acyclicity.
If we want to apply this to data sets, then we need to apply some conditional
independence tests instead of querying our oracle to figure out if things are
conditional dependence. However, errors in the algorithm propagate, and the
whole process may be derailed by early errors. Moreover, the result of the
algorithm may depend on how we iterate through the nodes. People have tried
many ways to fix these problems, but in general, this method is rather unstable.
Yet, if we have large data sets, this can produce quite decent results.