Introduction To Bayesian Networks - Koski - Noble
Introduction To Bayesian Networks - Koski - Noble
John M. Noble,
Timo Koski,
Faculty of Mathematics, Informatics and Mechanics,
Institutionen för matematik,
University of Warsaw,
Kungliga Tekniska Högskolan,
ul. Banacha 2,
10044 STOCKHOLM, Sweden
02-097 WARSZAWA, Poland
iv
Contents
Introduction 1
3 Intervention Calculus 45
3.1 Causal Models and Bayesian Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.2 Conditioning by Observation and by Intervention . . . . . . . . . . . . . . . . . . . . . . . 46
3.3 The Intervention Calculus for a Bayesian Network . . . . . . . . . . . . . . . . . . . . . . . 46
3.4 Causal Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.4.1 Establishing a Causal Model via a Controlled Experiment . . . . . . . . . . . . . . 51
3.5 Properties of Intervention Calculus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
v
3.6 Confounding, The `Sure Thing' Principle and Simpson's Paradox . . . . . . . . . . . . . . 56
3.6.1 Confounding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.6.2 Simpson's Paradox . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.6.3 The Sure Thing Principle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.7 Identiability: Back-Door and Front-Door Criteria . . . . . . . . . . . . . . . . . . . . . . . 59
3.7.1 Back Door Criterion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
3.7.2 Front Door Criterion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
3.7.3 Non-Indentiability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
3.8 Inference Rules for Intervention Calculus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
3.8.1 Example: Front Door Criterion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
3.8.2 Causal Inference by Surrogate Experiments . . . . . . . . . . . . . . . . . . . . . . . 77
3.9 Measurement Bias and Eect Restoration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
3.9.1 The Matrix Adjustment Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
3.9.2 Eect Restoration Without External Studies . . . . . . . . . . . . . . . . . . . . . . 79
3.10 Identication of Counterfactuals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
3.10.1 Counterfactual Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
3.10.2 Joint Counterfactual Probabilities and Intervention . . . . . . . . . . . . . . . . . . 84
Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
3.11 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
3.12 Answers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
vi
Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
6.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
6.7 Answers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
vii
10 Conditional Gaussian variables 203
10.1 Conditional Gaussian Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
10.1.1 Some Results on Marginalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
10.1.2 CG Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206
10.2 The Junction Tree for Conditional Gaussian Distributions . . . . . . . . . . . . . . . . . . 208
10.3 Updating a CG distribution using a Junction Tree . . . . . . . . . . . . . . . . . . . . . . . 211
Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214
10.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215
viii
13.3 The Sensitivity of Queries to Parameter Changes . . . . . . . . . . . . . . . . . . . . . . . . 272
Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276
13.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277
13.5 Answers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279
ix
16.6 PC and MMPC Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 317
16.7 Recursive Autonomy Identication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 321
16.8 Incompatible Immoralities: EDGE-OPT Algorithm . . . . . . . . . . . . . . . . . . . . . . 324
16.9 Hybrid Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324
16.9.1 The Maximum Minimum Hill Climbing Algorithm . . . . . . . . . . . . . . . . . . 324
16.9.2 L1-Regularisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325
16.9.3 Gibbs sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 326
16.10A Junction Tree Framework for Undirected Graphical Model Selection . . . . . . . . . . . 326
16.11The Xie-Geng Algorithm for Learning a DAG . . . . . . . . . . . . . . . . . . . . . . . . . . 329
16.11.1 Description of the Xie-Geng Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 330
16.11.2 Proofs of Theorems 16.5 and 16.6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334
16.12The Ma-Xie-Geng Algorithm for Learning Chain Graphs . . . . . . . . . . . . . . . . . . . 338
16.12.1 Skeleton Recovery with a Separation Tree . . . . . . . . . . . . . . . . . . . . . . . . 338
16.12.2 Recovering the Complexes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 340
16.13Structure Learning and Faithfulness: an Evaluation . . . . . . . . . . . . . . . . . . . . . . 341
16.13.1 Faithfulness and `real world' data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 341
16.13.2 Interaction eects without main eects . . . . . . . . . . . . . . . . . . . . . . . . . 343
16.13.3 Hidden variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343
16.13.4 The scope of structure learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344
16.13.5 Application of FAS and RAI to nancial data . . . . . . . . . . . . . . . . . . . . . 344
16.13.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345
16.13.7 The `Causal Discovery' Controversy . . . . . . . . . . . . . . . . . . . . . . . . . . . 346
16.13.8 Faithfulness and the great leap of faith . . . . . . . . . . . . . . . . . . . . . . . . . 347
16.13.9 Inferring non-causation and causation . . . . . . . . . . . . . . . . . . . . . . . . . . 349
16.13.10Summarising causal discovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 350
Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 350
16.14Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 351
16.15Answers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354
x
18 Monte Carlo Algorithms for Graph Search 375
18.1 A Stochastic Optimisation Algorithm for Essential Graphs . . . . . . . . . . . . . . . . . . 375
18.2 Structure MCMC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 377
18.3 Edge Reversal Moves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 378
18.4 Order MCMC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 379
18.5 Partition MCMC for Directed Acyclic Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . 381
18.5.1 Scoring Partitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 381
18.5.2 Partition Moves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 382
18.5.3 Permutation Moves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 382
18.5.4 Combination with Edge Reversal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 383
xi
22 Variational Methods for Parameter Estimation 439
22.1 Complete Instantiations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 439
22.1.1 Triangulated Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 439
22.1.2 Non-Triangulated Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 440
22.2 Partially Observed Models and Expectation-Maximisation . . . . . . . . . . . . . . . . . . 442
22.2.1 Exact EM Algorithm for Exponential Families . . . . . . . . . . . . . . . . . . . . . 442
22.2.2 Mean Field Approximate EM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 445
22.3 Variational Bayes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 445
INDEX 457
xii
Introduction
The models that were later to be called Bayesian networks were introduced into articial intelligence by
J. Pearl in (1982) [103], a seminal article in the literature of that eld. A Bayesian network is simply
a factorisation of a probability distribution and a corresponding directed acyclic graph (henceforth
written DAG), where the edges of the DAG correspond to direct associations between variables in the
factorisation.
The rst Bayesian networks were connected with causal models, where the order of the variables
in the factorisation represented cause to eect, and the directed arrows in the DAG represented direct
causes. This is still one of the major uses of Bayesian networks. The leaf nodes of the network are the
observables. From observations of leaf variables, inference is made for hidden variables via Bayes rule,
hence the term Bayesian network. The terminology, therefore, derives from the probabilistic use of
Bayes rule; the statistical use of Bayes rule whereby uncertainty in the parameter value is expressed in
terms of a prior distribution which is then updated to a posterior distribution when data is considered,
is not in view here.
Graphical directed separation statements for the DAG imply corresponding conditional indepen-
dence statements for the probability distribution. For large and complex systems, graphical separation
algorithms provide a convenient and ecient method to establish probabilistic conditional indepen-
dence statements.
The description `Bayesian networks' now covers a large eld of problems and techniques of data
analysis and probabilistic reasoning, where data is collected on a large number of variables and the
aim is to factorise the distribution, represent it graphically and exploit the graphical representation.
Perhaps the earliest work that explicitly uses directed graphs to represent possible dependencies among
random variables is that by S. Wright (1921) [146], developed by the same author in 1934 [147].
Bayesian networks represent a small part of the wider eld of graphical models. A Bayesian network
is a probability distribution factorised along a DAG. In many examples this is not the most ecient
model for representing the independence structure and there is a wider eld of graphical models.
Situations where Bayesian networks provide the natural tools for analysis are, for example: com-
puting the overall reliability of a system given the reliability of the individual components and how
they interact, system security where Bayesian networks are used as a tool for assessing intrusion evi-
dence and whether a network is under attack, forensic analysis. Further applications are, for example:
nding the most likely message that was sent across a noisy channel, restoring a noisy image, mapping
genes onto a chromosome. One of the leading applications of techniques from the area is to establish-
1
2
ing genome pathways. Given DNA hybridisation arrays, which simultaneously measure the expression
levels for thousands of genes, a major challenge in computational biology is to uncover, from such mea-
surements, gene/protein interactions and key biological features of cellular systems. This is discussed,
for example, by Nir Friedman et. al. in [46] (2000).
DAGs have proved useful in a large number of situations where the graph is constructed along
causal principles; parent variables are considered to be direct causes. One eld where causal networks
have proved particularly eective has been epidemiological research, where DAGs have provided a
framework for the problem of multiple confounding factors in genetic epidemiology, as discussed by
Greenland, Pearl and Robins (1999) [56]. Bayesian networks oer an alternative to `naïve Bayes' models
of supervised classication in machine learning, which enable more of the structure to be exploited.
One of the rst examples of this was the Chow-Liu tree (1968) [28].
Chapter 1
Models
A graphical model for a probability distribution over several variables is, quite simply, a graph, where
the random variables correspond to the node set of the graph and each graphical separation statement
implies the corresponding conditional independence statement for the random variables. The opposite
(that conditional independence implies graphical separation) in general does not hold. In a system with
a large numbers of variables, the task of determining graphical separation statements is, in general,
computationally far less demanding than the task of determining conditional independence, hence the
motivation for graphical models and applying graph theoretic results.
A Bayesian network is the representation of a probability distribution on a directed acyclic graph
(DAG). In this setting, the most useful notion of separation is D-separation (short for directed separa-
tion), which is dened later. If a probability distribution factorises along a DAG, then D-separation
statements in the DAG imply the corresponding conditional independence statements (although the
reverse implication is, in general, false).
In many problems, for example gene expression data where there are thousands of variables, it may
not be either possible or desirable to obtain a complete description of the dependence structure. The
aim for such problems is to learn a DAG which encodes the most important features of the dependence
structure. In classication problems, a complete description of the dependence structure is usually
unnecessary; algorithms only locate the key features of the dependence structure to ensure accurate
classication.
3
4 CHAPTER 1. CONDITIONAL INDEPENDENCE AND GRAPHICAL MODELS
In the subject of Bayesian Networks, there are usually three situations in view: multinomial,
Gaussian and Conditional Gaussian. PX1 ,...,Xd will be used to denote the probability distribution over
X1 , . . . , Xd . That is, for the multinomial case, this is simply the probability function; the quantity
PX1 ,...,Xd (x1 , . . . , xd ) is simply the probability of obtaining a conguration with indices (x1 , . . . , xd ) ∈
X . When X is a multivariate Gaussian random vector, PX1 ,...,Xd refers to the probability density
function. When X is conditional Gaussian, the discrete variables are listed with lower index than
the continuous, so that X = (X1 , . . . , Xa , Xa+1 , . . . , Xd ) where there are a discrete variables and d − a
continuous variables. Here PX1 ,...,Xa is the probability function for the discrete variables, while for each
conguration (x1 , . . . , xa ) of the discrete variables, PXa+1 ,...,Xd ∣X1 ,...,Xa (.∣x1 , . . . , xa ) is a multivariate
Gaussian probability distribution over Rd−a for the variables a + 1, . . . , d.
In this treatment, the discussion will be presented for discrete random variables, unless explicitly stated
otherwise.
For a simple graph that may contain both directed and undirected edges, the edge set E may be decom-
posed as E = D ∪ U , where D ∩ U = ∅, the empty set. The sets U and D are dened by
Denition 1.2 (Parent, Child, Directed and Undirected Neighbour, Family). Consider a graph G =
(V, E), where V = {1, . . . , d} and let E = D ∪ U , where D is the set of directed edges and U the set of
undirected edges. Let α, β ∈ V . If (α, β) ∈ D, then β is referred to as a child of α and α as a parent of
β.
The family of a node β is the set containing the node β together with its parents and undirected
neighbours. It is denoted:
The notation α ∼ β will be used to denote that α ∈ N (β); namely, that α and β are neighbours. Note
that α ∈ N (β) Ô⇒ β ∈ N (α).
In this text, a directed edge (α, β) is indicated by a pointed arrow from α to β ; that is, from the parent
to the child.
6 CHAPTER 1. CONDITIONAL INDEPENDENCE AND GRAPHICAL MODELS
Denition 1.3 (Directed, Undirected Graph). If all edges of a graph are undirected, then the graph G
is said to be undirected. If all edges are directed, then the graph is said to be directed. The undirected
version of a graph G , denoted by G̃ , is obtained by replacing the directed edges of G by undirected edges.
Denition 1.4 (Trail). Let G = (V, E) be a graph, where E = D ∪ U ; D ∩ U = ∅, D denotes the directed
edges and U the undirected edges. A trail τ between two nodes α ∈ V and β ∈ V is a collection of
nodes τ = (τ1 , . . . , τm ), where τi ∈ V for each i = 1, . . . , m, τ1 = α and τm = β and such that for each
i = 1, . . . , m − 1, τi ∼ τi+1 . That is, for each i = 1, . . . , m − 1, either (τi , τi+1 ) ∈ D or (τi+1 , τi ) ∈ D or
⟨τi , τi+1 ⟩ ∈ U .
Note that in general it is possible for a sub-graph to contain the same nodes, but fewer edges, but the
sub-graph induced by the same node set will have the same edges.
Denition 1.6 (Connected Graph, Connected Component). A graph is said to be connected if between
any two nodes αj ∈ V and αk ∈ V there is a trail. A connected component of a graph G = (V, E) is
an induced sub-graph GA such that GA is connected and such that if A ≠ V , then for any two nodes
(α, β) ∈ V × V such that α ∈ A and β ∈ V /A, there is no trail between α and β .
Denition 1.7 (Path, Directed Path). Let G = (V, E) denote a simple graph, where E = D ∪ U . That
is, D ∩ U = ∅, D denotes the directed edges and U denotes the undirected edges. A path of length m
from a node α to a node β is a sequence of distinct nodes (τ0 , . . . , τm ) such that τ0 = α and τm = β
such that (τi−1 , τi ) ∈ E for each i = 1, . . . , m. That is, for each i = 1, . . . , m, either (τi−1 , τi ) ∈ D, or
⟨τi−1 , τi ⟩ ∈ U .
The path is a directed path if (τi−1 , τi ) ∈ D for each i = 1, . . . , m. That is, there are no undirected
edges along the directed path.
It follows that a trail in G is a sequence of nodes that form a path in the undirected version G̃ .
Unlike a trail, a directed path (τ0 , . . . , τm ) requires that the directed edge (τi , τi+1 ) ∈ D for all
i = 0, . . . , m − 1.
In both cases, the paths are directed; they consist of directed edges only; they do not contain undirected
edges.
Denition 1.9 (Cycle). Let G = (V, E) be a graph. An m-cycle in G is a sequence of distinct nodes
τ0 , . . . , τm−1
Denition 1.10 (Directed Acyclic Graph (DAG)). A graph G = (V, E) is said to be a directed acyclic
graph if each edge is directed (that is, G is a simple graph such that for each pair (α, β) ∈ V × V ,
(α, β) ∈ E Ô⇒ (β, α) ∈/ E ) and for any node α ∈ V there does not exist any set of distinct nodes
τ1 , . . . , τm such that α ≠ τi for all i = 1, . . . , m and (α, τ1 , . . . , τm , α) forms a directed path. That is,
there are no m-cycles in G for any m ≥ 1.
Denition 1.11 (Tree, Leaf). A tree is a graph G = (V, E) that is connected and such that for any
node α ∈ V , there is no trail between α and α and for any two nodes α and β in V with α ≠ β , there
is a unique trail. A leaf of a tree is a node that is connected to exactly one other node.
PX,Y,Z = PX∣Z PY ∣Z PZ .
This is written X ⊥ Y ∣Z .
Theorem 1.13. The following are all equivalent to X ⊥ Y ∣Z : using XX , XY and XZ to denote the
state spaces of X , Y and Z respectively:
1. For all (x, y, z) ∈ XZ × XY × XZ such that PY ∣Z (y∣z) > 0 and PZ (z) > 0,
2. There exists a function a ∶ XX × XZ → [0, 1] such that for all (x, y, z) ∈ XX × XY × XZ satisfying
PY,Z (y, z) > 0,
d
PX1 ,...,Xd = PXσ(1) ∏ PXσ(j) ∣Xσ(1) ,...,Xσ(j−1)
j=2
for any permutation σ of 1, . . . , d. Let Pa(σ) (j) ⊂ {σ(1), . . . , σ(j − 1)} satisfy
Xσ(j) ⊥ {Xσ(1) , . . . , Xσ(j−1) }/X (σ) ∣X (σ)
Pa (j) Pa (j)
Xσ(j) ⊥
/ {Xσ(1) , . . . , Xσ(j−1) }/XΘ(j) ∣XΘ(j)
(σ)
for any strict subset Θj ⊂ Paj .
Then, by the rst characterisation of conditional independence and setting PXσ(j) ∣X (σ) = PXσ(j)
Pa (j)
when Paσ(j) = ∅ the empty set,
d
PX1 ,...,Xd = ∏ PXσ(j) ∣X (σ) .
j=1 Pa (j)
1.2. CONDITIONAL INDEPENDENCE AND FACTORISATION 9
d
PX1 ,...,Xd = ∏ PXσ(j) ∣X (1.8)
Ξ(σ) (j)
j=1
such that for each j ∈ {1, . . . , d}, Ξ(σ) (j) ⊆ {σ(1), . . . , σ(j − 1)}.
A Bayesian Network is a factorisation of a probability distribution
d
PX1 ,...,Xd = ∏ PXσ(j) ∣X (σ) (1.9)
j=1 Pa (j)
such that
(σ)
1. Pa1 = ∅ (the empty set)
(σ)
2. Paj ⊆ {σ(1), . . . , σ(j − 1)}
Unless otherwise stated, it will be assumed that the variables are labelled in such a way that σ = I ,
the identity.
For Paj = {lj,1 , . . . , lj,mj }, the state space of XPa(j) is Xlj,1 × . . . × Xlj,mj . For discrete variables, there
mj (l) qj
are qj = ∏a=1 klj,a congurations. These may be labelled (πj )l=1 and the parameters required for the
probability distribution PX1 ,...,Xd are
(l)
θjil = PXj ∣X (i∣πj ) j = 1, . . . , d i = 0, . . . , kj − 1, l = 1, . . . , qj .
Pa(j)
The factorisation of Equation (1.8) Denition 1.14 may be represented by a Directed Acyclic Graph.
For example, if the probability distribution over X, Y, Z, W satises
~
Y Z
~
W
α / β / γ
Within a directed acyclic graph, there are three basic ways in which two nodes α, γ such that (α, γ) ∈/ D
and (γ, α) ∈/ D can be connected via a third node. They are the chain, fork and collider connections
respectively.
Chain Connections A chain connection between nodes α and γ is a connection via a node β such
that the graph contains directed edges α → β and β → γ , but no edge between α and γ .
Consider a probability distribution over (Xα , Xβ , Xγ ) factorised according to the graph in Fig-
ure 1.2, as PXα PXβ ∣Xα PXγ ∣Xβ .
Clearly, Xα ⊥
/ Xγ in general;
PXα ,Xγ (x1 , x3 ) = PXα (x1 ) ∑ PXβ ∣Xα (x2 ∣x1 )PXγ ∣Xβ (x3 ∣x2 )
x2 ∈X2
PXα ,Xβ ,Xγ (., x2 , .) PXα (.)PXβ ∣Xα (x2 ∣.)PXγ ∣Xβ (.∣x2 )
PXα ,Xγ ∣Xβ (., .∣x2 ) = =
PXβ (x2 ) PXβ (x2 )
PXα (.)PXβ ∣Xα (x2 ∣.)
= ( ) (PXγ ∣Xβ (.∣x2 )) = (PXα ∣Xβ (.∣x2 )) (PXγ ∣Xβ (.∣x2 ))
PXβ (x2 )
1.2. CONDITIONAL INDEPENDENCE AND FACTORISATION 11
α γ
where Bayes rule has been used and so, following characterisation 3 of conditional independence from
Theorem 1.13, Xα ⊥ Xγ ∣Xβ .
Fork Connections A fork connection between two nodes Xα and Xγ is a situation where there is no
edge between Xα and Xγ , but there is a node Xβ such that the graph contains directed edges Xβ ↦ Xα
and Xβ ↦ Xγ . It is illustrated in Figure 1.3.
A distribution over the variables (Xα , Xβ , Xγ ) that factorises according to the DAG in Figure 1.3
has factorisation
It is clear that Xα ⊥
/ Xγ in general;
PXα ,Xγ (x1 , x3 ) = ∑ PXβ (x2 )PXα ∣Xβ (x1 ∣x2 )PXγ ∣Xβ (x3 ∣x2 )
x2 ∈X2
and, without further assumptions, this cannot be expressed in product form. Conditioned on Xβ ,
though:
Collider Connections A collider connection between two nodes α and γ is a connection such that
the graph does not contain an edge between α and γ , but there is a node β such that the graph contains
directed edges α ↦ β and γ ↦ β . A collider connection is illustrated in Figure 1.4.
The factorisation of the distribution PXα ,Xβ ,Xγ corresponding to the DAG for the collider is
α γ
β
In general, Xα ⊥
/ Xγ ∣Xβ . But for each (x, z) ∈ Xα × Xγ ,
so that Xα ⊥ Xγ .
A Causal Interpretation So far, the discussion has considered sets of random variables where,
based on the ordering of the variables, the parent set of a variable is a subset of those of a lower
order. The representation of a probability distribution by factorising along a Directed Acyclic Graph
may be particularly useful if there are cause to eect relations between the variables, the ancestors
being the cause and the descendants the eect. For a causal model, the connections have the following
interpretations:
Fork Connection: Common cause For the fork connection, illustrated by Figure 1.2, Xβ may
be a cause that inuences both Xα and Xγ which are eects. The variables are only related through
Xβ . The situation is illustrated by the following example, taken from a cartoon by Albert Engström;
`during a convivial discussion at the bar one evening, about the unhygienic nature of galoshes, one
of the participants pipes up, you have a very good point there. Every time I wake up wearing my
galoshes, I have a sore head.
Let Xα denote the state of the feet and Xγ the state of the head. These two variables are related;
Xα ⊥ / Xβ . But there is a common cause; X2 , which denotes the activities of the previous evening.
Once it is known that he has spent a convivial evening drinking, the state of the feet gives no further
information about the state of the head; Xα ⊥ Xγ ∣Xβ .
Chain Connection This may similarly be understood as cause to eect. Xα inuences Xβ , which
in turn inuences Xγ , but there is no direct causal relationship between the values taken by Xα and
those taken by Xγ . If Xβ is unknown, then Xα ⊥ / Xγ , but once the state of Xβ is established, Xα and
Xγ give no further information about each other; Xα ⊥ Xγ ∣Xβ .
1.2. CONDITIONAL INDEPENDENCE AND FACTORISATION 13
Collider Connection For the collider connection, Xα and Xβ are unrelated; Xα ⊥ Xγ . But they
both inuence Xβ . For example, consider a burglar alarm (Xβ ) that is activated if a burglary takes
place, but can also be activated if there is a minor earth tremor.
One day, somebody calls you while you are at work to say that your burglar alarm is activated.
You get into the car to go home. But on the way home, you hear on the radio that there has been an
earth tremor in the area. As a result, you return to work.
Once Xβ is instantiated, the information that there has been an earth tremor inuences the likeli-
hood that a burglary has taken place; Xα ⊥ / Xγ ∣Xβ .
This is known as explaining away.
Attention is now turned to trails within a DAG, and characterisation of those along which information
can pass.
Denition 1.17 (S -Active Trail). Let G = (V, D) be a directed acyclic graph. Let S ⊂ V and let
α, β ∈ V /S . A trail τ between the two variables α and β is said to be S -active if
Denition 1.18 (Blocked Trail). A trail between α and β that is not S -active is said to be blocked
by S .
The following denition is basic; it will be seen that if a probability distribution factorises along a
DAG G and two nodes α and β are D-separated by S , then Xα ⊥ Xβ ∣XS .
Denition 1.19 (D-separation). Let G = (V, D) be a directed acyclic graph, where V = {1, . . . , d}. Let
S ⊂ V . Two distinct nodes α and β not in S are D-separated by S if all trails between α and β are
blocked by S .
Let A and B denote two sets of nodes. If every trail from any node in A to any node in B is blocked
by S , then the sets A and B are said to be D-separated by S . This is written
A á B ∥G S. (1.10)
The terminology D-separation is short for directed separation. The insertion of the letter `D' points
out that this is not the standard use of the term `separation' found in graph theory.
Denition 1.20 (D-connected). If two nodes α and β are not D-separated, they are said to be D-
connected.
Notation The notation α á/ β∥G S denotes that α and β are D-connected by S in the DAG G . Here
α and β may refer to individual nodes or sets of nodes.
14 CHAPTER 1. CONDITIONAL INDEPENDENCE AND GRAPHICAL MODELS
Example 1.21.
Consider the chain connection α ↦ β ↦ γ in the DAG in Figure 1.2 and the fork connection of
Figure 1.3. For the chain connection of Figure 1.2, the D-separation statements are: α á γ∥G β
while α á/ γ∥G ∅ (∅ denotes the empty set). For the DAG in Figure 1.3, α á γ∥G β while α á / γ∥G ∅.
These correspond to the conditional independence statements derived for probability distributions
that factorise along these graphs. For Figure 1.4, α á γ∥G ∅ while α á
/ γ∥G β . Again, these statements
correspond to the conditional independence statements that may be derived from the fact that a
distribution factorises along the DAG of Figure 1.4.
Let MB(α) denote the set of nodes which are either parents of α or children of α or a node which
shares a common child with α. Then α is D-separated from the rest of the network by MB(α). This
set of nodes is known as the Markov blanket of the node α.
Denition 1.22 (Markov Blanket). The Markov blanket of a node α in a DAG G = (V, D), denote
MB(α), is the set consisting of the parents of α, the children of α and the nodes sharing a common
child with α.
Denition 1.23 (Instantiated Nodes). Let G = (V, D) be a directed acyclic graph. When considering
statements α á β∥G S and α á
/ β∥G S , the nodes in S are referred to as instantiated.
Consider the three types of connection in a DAG; chain, collider and fork.
1.3. D-SEPARATION AND CONDITIONAL INDEPENDENCE 15
For the chain connection illustrated in Figure 1.2, the Bayes ball algorithm indicates that if node
β is instantiated, then the ball does not move from α to γ through β . The communication in the
trail is blocked. If the node is not instantiated, then communication is possible.
For the fork connection illustrated in Figure 1.3, the algorithm states that if node β is instantiated,
then again communication between α and γ is blocked. If the node is not instantiated, then
communication is possible.
For the collider connection illustrated in Figure 1.4, the Bayes ball algorithm states that the ball
does move from α to γ if node α or any of its descendants is instantiated. If β or a descendant
is instantiated, this opens communication between the parents. If neither β nor any of its
descendants are instantiated, then there is no communication.
For a collider node β , instantiating any of the descendants of β also opens communication. If node β
is not instantiated, and none of its descendants are instantiated, then there is no communication.
A DAG G = (V, D) satises the following important property:
Theorem 1.24. A DAG G = (V, D) contains an edge between two nodes α, β ∈ V if and only if
αá
/ β∥G S for any S ⊆ V /{α, β}.
Proof The proof of this is straightforward and left as an exercise (Exercise 6 page 22).
Theorem 1.25 (D-Separation Implies Conditional Independence). Let G = (V, D) be a directed acyclic
graph and let P be a probability distribution that factorises along G . Then for any three disjoint subsets
A, B, S ⊂ V , it holds that XA ⊥ XB ∣XS (XA and XB are independent given XS ) if A á B∥G A (A and
B are D-separated by S ).
Proof of Theorem 1.25 Let X = (X1 , . . . , Xd ) be a random vector. Let V = {1, . . . , d} denote the
set of nodes of a Directed acyclic graph G( V, D) and suppose that PX factorises along G . Let A ⊂ V ,
B ⊂ V and S ⊂ V be three disjoint sets of nodes. Suppose that A á B∥G S . Let A, B and S denote also
the random vectors XA , XB and XS respectively and let XA , XB and XS denote their respective state
spaces. It is required to show that for all a ∈ XA , b ∈ XB and s ∈ XS ,
From Characterisation 5 of Theorem 1.13, it is required to show that there are two functions F and G
such that
Let P(Xj ∣Paj ) denote the conditional probability function of Xj given the parent variables XPa(j) .
Then
Since all the nodes of R are uninstantiated and there is no S -active trail from A to B , it follows that
any node α ∈ R1 is either a collider which is not in S , nor does it have any descendants (Denition 1.8)
in S or the descendant of such a collider. Furthermore, any descendant of a variable in R1 is also in
R1 . Therefore, marginalising over the variables in R1 does not involve the parent variables of A, B or
S , nor does it involve the variables in R2 or R3 or their ancestors since ∑Xj P(Xj ∣Paj ) = 1.
There is no S -active trail from a variable in R4 to any variable in A or B . It follows that parents of
variables in R4 are either in R4 or in S (if the parent is not in S and there is an S -active trail between
the parent and a variable in either A or B , then there is an S -active trail from the variable itself; the
link between variable, its parent and the next variable on the trail is either an uninstantiated fork or
uninstantiated chain connection).
Now, using ∅ to denote the empty set, let S2 = {α ∈ S ∣ Pa(α) ∩ R2 ≠ ∅} (there is an S -active trail
from A to a parent of α ∈/ S but not from B to a parent of α ∈/ S ), S3 = {α ∈ S ∣ Pa(α) ∩ R3 ≠ ∅} (there
is an S -active trail from B to a parent of α ∈/ S but not from A to a parent of α ∈/ S ) and S4 = S ∩S2c ∩S3c
(nodes α ∈ S such that there is no S -active trail either from A to a parent of α ∈/ S or from B to a
parent of α ∈/ S ). Then S2 ∩ S3 = ∅, the empty set, otherwise there would be a collider node in S that
would result in an active trail from A to B .
It is also clear that Pa(S4 ) ⊆ S ∪ R4 , where Pa(S4 ) denotes the parent variables of the variables in S4 ;
that is, Pa(S4 ) = {Y ∣(Y, X) ∈ E, X ∈ S4 }. The sets S2 , S3 , S4 are disjoint. It follows that
1.4. THE LOCALLY DIRECTED MARKOV PROPERTY 17
⎛ ⎞
P(A, B, S) = ∑ ∑ ∑ ∑ ∏ P(Xj ∣Paj ) ∏ P(Xj ∣Paj ) ∏ P(Xj ∣Paj )
XR2 XR3 XR4 XR1 ⎝j∈R1 j∈R4 j∈S4 ⎠
⎛ ⎞
× ∏ P(Xj ∣Paj ) ∏ P(Xj ∣Paj ) ∏ P(Xj ∣Paj )
⎝j∈A j∈S2 j∈R2 ⎠
⎛ ⎞
× ∏ P(Xj ∣Paj ) ∏ P(Xj ∣Paj ) ∏ P(Xj ∣Paj ) .
⎝j∈B j∈S3 j∈R3 ⎠
The sums are taken from right to left, starting with ∑XR . None of the variables in R1 are in A ∪ B ∪ S
1
or the parent sets of variables in A, B , S2 , S3 , R2 or R3 . The parents of variables in R4 are either in
R4 or S , the parents of variables in S4 are either in S4 or R4 . The parents of variables in R3 in R3 ∪ B .
The parents of variables in R2 are in R2 ∪ A. It follows that P(A, B, S) has a factorisation
Of course, the converse is not true in general; D-separation is a convenient way of locating some of
the independence structure of a distribution. It does not, in general, locate the entire independence
structure.
Denition 1.26 (Local Directed Markov Condition, Locally G - Markovian). Let X = (X1 , . . . , Xd )
be a random vector. A probability function P over X satises the local directed Markov condi-
tion with respect to a DAG G = (V, D) with node set V = {1, . . . , d} or, equivalently, is said to
be locally G -Markovian if and only if there is an ordering of the variables σ such that Pa(σ) (j) ∈
{σ(1), . . . , σ(j − 1)} for each j ∈ {1, . . . , d} and such that Xσ(j) is conditionally independent, given
X (σ) of X , where V (σ) (j) is the set of all descendants of σ(j) in G . That is,
Pa (j) V /(V (σ) (j)∪Pa )(j)
(σ)
Xσ(j) ⊥ X ∣X (σ) .
V /(V (σ) (j)∪Pa (j)∪{σ(j)}) Pa (j)
(σ)
Proposition 1.27. Let P be a probability distribution over a random vector X = (X1 , . . . , Xd ). Then
P satises the l.d.m.p. with respect to a graph G = (V, D) if and only if there is an ordering of the
variables σ such that P factorises along G .
18 CHAPTER 1. CONDITIONAL INDEPENDENCE AND GRAPHICAL MODELS
Proof Firstly, assume that P is locally G -Markovian and assume that the variables are ordered
in such a way that for each j ∈ {1, . . . , d}, Xj ⊥ XV /(V (j)∪Pa(j)) ∣XPa(j) where V (j) is dened in
Equation (1.11). Let πj (x1 , . . . , xj−1 ) denote the instantiation of XPa(j) when X is instantiated
as (x1 , . . . , xd ). By Characterisation 1) of Theorem 1.13, for all j = 1, . . . , d and any πj such that
PX (π ) > 0,
Pa(j) j
d
PX1 ,...,Xd = ∏ PXj ∣X
j=1 Pa(j)
and hence, by denition, that P factorises along G .
Secondly, suppose that P factorises along a directed acyclic graph G = (V, D). Then it is clear (for
example by using the Bayes ball algorithm) that
where V (j) is the set of variables dened by Equation (1.11). If Pa(j) is instantiated, then any trail
from j to V /(Vj ∪ Paj ∪ {Xj }) has to pass through a node in Paj , which will be either a chain or fork
connection. It follows from Theorem 1.25 that
Denition 1.28 (Query). A query in probabilistic inference is simply a conditional probability distri-
bution, over the variables of interest (the query variables) conditioned on information received.
Discussion of the main algorithms for answering queries is the subject of chapters 7 and 8.
p q p∨q
1 1 1
1 0 1
0 1 1
0 0 0
while the `And' disjunction of two propositions p and q , denoted by p ∧ q is dened by the truth table
p q p∧q
1 1 1
1 0 0
0 1 0
0 0 0
Here 1 = the proposition is true, 0 = the proposition is false.
Now consider the situation where p and q are independent causes of some eect, but that p and q
only cause the eect with some probability less than 1.
Another simplifying assumption is that an individual contracts dierent diseases independently of each
other. Under this assumption,
20 CHAPTER 1. CONDITIONAL INDEPENDENCE AND GRAPHICAL MODELS
m
PD = ∏ PDi .
i=1
For the problem of classication, that is diagnosing diseases given a list of symptoms, these two
modelling assumptions come under the umbrella independence of competing risks. This is a simplica-
tion, but nevertheless, can produce an eective classier.
Noisy `or' as a causal network Consider the DAG given in Figure 1.6 where B = A1 ∨A2 ∨. . .∨An .
This is the logical `or' and there is no noise. The noise then enters, as in the DAG given in Figure 1.7,
by considering that if any of the variables Ai , i = 1, . . . , n is present, then B is present unless something
has inhibited it, the inhibitors on each variable acting independently of each other.
A1 A2 ... An
! w
B
Noisy `or': inhibitors Consider the DAG in gure 1.7, where qi denotes the probability that the
impact of Ai is inhibited.
A1 A2 ... An
1−qn
1−q2
! w
1−q1
All variables are binary, and take value 1 if the cause, or eect, is present and 0 otherwise. In other
words, PB∣Ai (0∣1) = qi . The assumption from the DAG is that all the inhibitors are independent. This
implies that
where Y = {j ∈ {1, . . . , n}∣aj = 1}. This may be described by a noisy `or' gate.
1.5. QUICK MEDICAL REFERENCE - DECISION THEORETIC: AN EXAMPLE 21
Noisy `or' Gate The noisy `or' can be modelled directly, introducing the variables Bi i = 1, . . . , n,
where Bi takes the value 1 if the cause Ai is on and it is not inhibited and 0 otherwise. The corre-
sponding DAG is given in gure 1.8
B1 B2 ... Bn
1−q1 1−q2 1−qn
A1 A2 ... An
! w
B
where
Notes The models that were later to be called Bayesian networks were introduced into articial in-
telligence by J. Pearl, in the article [103] (from 1982). Within the Articial Intelligence literature, this
is a seminal article. Perhaps the earliest work that uses directed graphs to represent possible depen-
dencies among random variables is that by S. Wright (1921) [146]. An early article that considered the
notion of a factorisation of a probability distribution along a directed acyclic graph representing causal
dependencies is that by H. Kiiveri, T.P. Speed and J.B. Carlin (1984) [74], where a Markov property
for Bayesian networks was dened. This was developed by J. Pearl in [106] (from 1990). D-separation,
and the extent to which it characterises independence is discussed by J.Pearl and T. Verma in [112]
and by J.Pearl, D. Geiger and T. Verma in [111]. The Bayes ball is taken from R.D. Schachter [122].
The results for identifying independence in Bayesian networks are taken from D. Geiger, T. Verma and
J. Pearl [51].
1.6 Exercises
1. Let (X, Y, W, Z) be disjoint sets of random variables, each with a nite state space. Prove that
the following logical relations hold:
2. Let (X, Y, W, Z) be four sets of nodes in a DAG G = (V, D). Prove the following;
3. Let X denote the state space for (X, Y, W, Z) and assume that PX,Y,W,Z (x, y, w, z) > 0 for each
(x, y, w, z) ∈ X . Does it hold in general that if X ⊥ Y ∣ Z ∪ W and X ⊥ W ∣ Z ∪ Y , then
X ⊥ Z ∣ Y ∪ W ? Either prove the result or illustrate why it is false.
4. Let V = A ∪ B ∪ S where A, B and S are disjoint subsets and suppose that A ⊥ B∣S . Prove that
for any α ∈ A and γ ∈ B ,
5. Let A be a variable in a DAG. Prove that if all the variables in the Markov blanket of A are
instantiated, then A is d-separated from the remaining uninstantiated variables.
7. Let G = (V, D) denote a directed acyclic graph. Let X ⊆ V , Y ⊆ V and Z ⊆ V denote sets of
nodes and let α, β, γ, δ ∈ V /X ∪ Y ∪ Z denote individual nodes.
8. The notation X A is used to denote the random (row) vector of all variables in set A. Let
V = {X1 , . . . , Xd } be the d variables of a Bayesian network and assume that X V /{Xi } = w. That
is, all the variables except Xi are instantiated. Assume that Xi is a binary variable, taking values
0 or 1. The odds of an event A given B is dened as:
P(A∣B)
OP (A∣B) =
P(Ac ∣B)
22
where Ac denotes the complement of A. Consider the odds
and show that this depends only on the variables in the Markov blanket (Denition 1.22) of Xi .
23
1.7 Answers
1. (a) X ⊥ Y ∪ W ∣Z means PW,X,Y,Z (w, x, y, z) = PX∣Z (x∣z)PW,Y ∣Z (w, y∣z)PZ (z) Summing over W
gives PX,Y,Z (x, y, z) = PX∣Z (x∣z)PY ∣Z (y∣z)PZ (z); equivalent to X ⊥ Y ∣Z .
Similarly, summing over Y gives PW,X,Z (w, x, z) = PX∣Z (x∣z)PW ∣Z (w∣z)PZ (z), equivalent to
X ⊥ W ∣Z .
(b) X ⊥ Y ∣Z implies PX,Y,Z (x, y, z) = PX∣Z (x∣z)PY ∣Z (y∣z)PZ (z) and X ⊥ W ∣Y ∪ Z implies
PW,X,Y,Z (w, x, y, z) = PX∣Y,Z (x∣y, z)PW ∣Y,Z (w∣y, z)PY,Z (y, z). The rst statement implies
that for (x, y, z) such that PX,Y,Z (x, y, z) > 0, PX∣Y,Z = PX∣Z , so, using PY,Z = PY ∣Z pZ , it
follows that
so that X ⊥ W ∪ Y ∣Z .
(c)
PXW PW Y Z
PXY ZW = = aXW bY ZW
PW
where aXW = PPXW
W
and bY ZW = PW Y Z so that X ⊥ Y ∣Z ∪ W from the characterisations of
independence.
(d)
PXW Z PY W Z PXY Z PW Y Z
PX,Y,W,Z = =
PW Z PY Z
PXW Z PXY Z
=
PW Z PY Z
so that
PX∣W Z = PX∣Y Z = PX∣Z
giving
PXZ PY W Z
PX,Y,W,Z =
PZ
and hence
X ⊥ Y ∪ W ∣Z.
2. (a) This is clear from the denition: Z blocks all trails between X and Y and all trails between
X and W .
(b) Consider α ∈ X and β ∈ W . Any trail α ↔ β has either an instantiated fork or chain node
in Y ∪ Z or an uninstantiated collider that is not in Y ∪ Z , neither any of its descendants. It
follows that such an uninstantiated collider is not in Z , neither are any of its descendants.
If it has an instantiated fork or chain node in Y , then the trail from α to the instantiated
fork or chain in Y is blocked by Z since X á Y ∥G Z . Hence X á W ∪ Y ∥G Z .
24
(c) Let α ∈ X and β ∈ Y . Any trail is blocked by W . That is, it has either a fork or chain node
in W or a collider node that is not in W , neither are any of its descendants.
If it is blocked by a chain or fork in W , then the trail is also blocked by Z ∪ W . Consider
the rst collider on the trail, proceeding from α, not in W , with no descendants in W , that
is either in Z or has a descendant in Z . Then the trail between α and the node in Z is
blocked by W since X á Y ∪ Z∥G W . Since neither the collider nor any of its descendants
are in W , it follows that the trail between α and the collider node is blocked by W , from
which it follows that it has a chain or fork in W , from which it follows that X á Y ∥G Z ∪ W .
(d) Let α ∈ X and β ∈ Y . Any trail between them with no other nodes in X or Y is blocked by
W ∪ Z . That is, it has either a fork or chain node in W ∪ Z or a collider not in W ∪ Z with
no descendants in W ∪ Z . Such a collider is therefore not in Z and has no descendants in
Z.
Assume that the trail blocked by W ∪ Z is not blocked by Z . Let γ be the rst fork or
chain node along the trail that is in W . This trail is Z active, but is blocked by Y ∪ Z .
It therefore contains a fork or chain node in Y , contradicting the assertion that α was the
only node in X and β the only node in Y on the trail.
PZ PX∣Z PW ∣Z PY ∣W,Z
clearly satises X ⊥ Y ∣Z ∪ W and X ⊥ W ∣Y ∪ Z , but there are distributions with such a factori-
sation that do not satisfy X ⊥ Z∣Y ∪ W .
4. Since A ⊥ B∣S , it follows from the weak union result in Exercise 1 that α ⊥ B∣A ∪ S/{α}. This,
together with the condition α ⊥ γ∣A ∪ S/{α, γ} imply (using X = {α}, W = B , Z = A ∪ S/{α, γ},
Y = {γ} in the contraction statement Exercise 1) that
as required.
Now suppose that α ⊥ γ∣(A ∪ B ∪ S)/{α, γ}. Since A ⊥ B∣S , it follows that α ⊥ B∣A ∪ S/{α}.
This, together with the condition, give (using X = {α}, Y = B , W = {γ} and Z = A ∪ S/{α, γ} in
the intersection statement of Exercise 1) that α ⊥ B ∪ {γ}∣A ∪ S/{α, γ} as required.
5. Recall denition of Markov blanket; parents of A, children of A and any variables sharing a child
with A. Consider the `Bayes Ball' algorithm, started at A. The ball cannot travel through an
instantiated chain or fork connection, nor can it travel through a collider, where none of the
descendants are instantiated. Otherwise, it can travel through a node along the graph.
Therefore: if all variables in the Markov blanket are instantiated, Bayes ball cannot pass through
any of the parents (by denition, the connection is necessarily chain or fork). It cannot pass
through a child to any ospring of the child (the connection necessarily chain). If it passes
25
through an instantiated child to another parent of the instantiated child, it cannot pass further:
connection at the point of the instantiated parent of the instantiated child is either chain or fork.
6. Firstly, clearly if there is an edge between α and β , then α − β is an S -active trail for any
S ⊆ V /{α, β}. If there is no edge between α and β , then there are two cases. Firstly, if α ∈/ M B(β)
(where MB denotes Markov blanket), then α á β∥G M B(β). If α ∈ M B(β), but there is no edge
between α and β , then α and β are parents of a common child. Let C denote the set of variables
that are common children of α and β . Let
Let S = V /({α, β}∪VC ), then α á β∥G S . Then any trail between α and β through a common child
is blocked by virtue of an uninstantiated collider where none of the descendants are instantiated.
Any trail with a common ancestor is blocked by virtue of an instantiated fork. On any trail
where α is an ancestor of β or β an ancestor of α, there is an instantiated chain connection.
7. (a) All trails between X and Y contain either a fork or chain node in Z , or collider not in Z
with no descendants in Z . When Z and γ are instantiated, there is no trail between X and
Y where all the colliders are either instantiated or have an instantiated descendant and all
chain and fork connections are uninstantiated.
Suppose that X á / {γ}∥G Z and Y á/ {γ}∥G Z . Then for any x ∈ X and any y ∈ Y there is
a Z -active trail between x and γ and a Z -active trail between y and γ . Consider the trail
between x and y formed by joining the two. If γ is a chain or fork node, then the trail is
active when γ is not instantiated, contradicting X á Y ∥G Z .
(b) Assume the result is not true and that α á / β∥G {γ} and α á/ β∣{δ} and that α á β∥G {γ, δ}.
Then there is a {γ}-active trail between α and β with δ as a fork or chain node. Assume
that there is a collider node on the trail with γ as a descendant, then there is a collider ρ
and a trail δ → ρ1 → . . . → ρn → ρ and hence a directed path from δ to γ that does not
contain α or β contradicting γ á δ∥G {α, β}. It follows that there is a trail between α and β
containing δ with only fork and chain connections. Similarly, there is a trail between α and
β containing δ with only forks and chains. Then there is a trail between δ and γ containing
α with at most one collider α and another trail between δ and γ containing β with at most
one collider {β}. If δ á γ∣{α, β} then neither of these are colliders and hence there is a
cycle, hence a contradiction.
8. This is a direct consequence of the denition. Let x = (x1 , . . . , xd ) and y = (y1 , . . . , yd ) where
yj = xj = wj for j ≠ i, xi = 1, yi = 0. Let πj (x) denote the parent conguration for variable j
when X = x. Then, since
26
∏dj=1 PXj ∣Paj (xj ∣πj (x))
OP ({Xi = 1}∣X V /{Xi } = w}) =
∏dj=1 PXj ∣Paj (yj ∣πj (y))
PXi ∣Pai (1∣πi (x)) ∏j∣Xi ∈Paj PXj ∣Paj (xj ∣πj (x))
=
PXi ∣Pai (0∣πi (y)) ∏j∣Xi ∈Paj PXj ∣Paj (yj ∣πj (y))
and, from the denition, this only involves the Markov blanket of Xi ; PXi ∣Pai involves Xi and
the parents of Xi , the other conditional probabilities involve the children of Xi and their parents.
27
28 CHAPTER 1. CONDITIONAL INDEPENDENCE AND GRAPHICAL MODELS
Chapter 2
Denition 2.1 (Markov Model). Let V = {1, . . . , d} and let G = (V, D) be a directed acyclic graph with
node set V and directed edge set D. Let V denote the entire set of subsets of V . The Markov Model
MG of G = (V, D) is the set of triples (A, B, S) ∈ V × V × V , A, B, S disjoint, such that the D-separation
statement A á B∥G S holds in the DAG. That is,
Let P be a probability distribution of a random vector X = (X1 , . . . , Xd ), whose components are indexed
by V . Let I(P) denote the entire set of conditional independence statements associated with P;
where, for any set C ⊆ V , XC denotes the sub-vector of random variables indexed by C . The convention
is that if S = ∅ (the empty set) then XA ⊥ XB ∣XS means XA ⊥ XB . A distribution P is said to belong
to the Markov Model of G , written P ∈ MG , if and only if MG ⊆ I(P). The Markov model is the
set of conditional independence relations satised by all distributions that are locally G -Markovian
(Denition 1.26).
If a distribution P factorises along a DAG G = (V, D), then the collection of triples MG dened in
Equation (2.1) Denition 2.1 represents the entire set of conditional independence statements that it is
possible to infer from the DAG. Clearly, this collection does not, in general, represent the complete set
of conditional independence statements that hold for P. In fact, the probability distributions modelling
real world situations, corresponding to data sets, very rarely factorise along a DAG whose D-separation
29
30 CHAPTER 2. MARKOV MODELS AND MARKOV EQUIVALENCE
statements encode the entire set of independence statments. When it does hold, the DAG is known as
a perfect I -map.
Denition 2.2 (Perfect I -Map, Faithful). Let G = (V, D) be a DAG with node set V = {1, . . . , d} which
indexes a random vector X = (X1 , . . . , Xd ). Let V denote the set of all subsets of V . The DAG G is
known as a perfect I -map for a probability function P over X if and only if I(P) = MG . A DAG G
such that I(P) = MG is said to be faithful to P.
A DAG G = (V, D) is consistent with I(P) dened by Equation (2.2) if and only if G is faithful to P,
if and only if G is a perfect I -map of P.
A DAG that is faithful to a probability distribution P, satises the following important property:
Theorem 2.3. Let G = (V, D) be faithful to a probability distribution P. Then the edge set D contains
an edge between two nodes α and β if and only if Xα ⊥/ Xβ ∣XS for any S ⊆ V /{α, β}.
Proof This is a straightforward consequence of Theorem 1.24, that the edge set D contains an edge
between α and β if and only if α á
/ β∥G S for any S ⊆ V /{α, β}.
Denition 2.4 (I -sub-map, I -map, I -equivalence, Markov Equivalence). Let G1 and G2 be two DAGs
with the same node set. The DAG G1 is said to be an I -sub-map of G2 if MG1 ⊆ MG2 . They are said
to be I -equivalent if MG1 = MG2 . I -equivalence is also known as Markov equivalence.
Example 2.5.
In the following example on three variables, all three factorisations give the same independence struc-
ture. Consider a probability distribution PX1 ,X2 ,X3 with factorisation
It follows that
2.1. I-MAPS AND MARKOV EQUIVALENCE 31
PX1 ,X2 ,X3 = PX2 PX1 ∣X2 PX3 ∣X1 ,X2 = PX2 PX1 ∣X2 PX3 ∣X2 ,
PX1 ,X2 ,X3 = PX3 PX2 ∣X3 PX1 ∣X2 ,X3 = PX3 PX2 ∣X3 PX1 ∣X2 ,
since X1 ⊥ X3 ∣X2 . For the rst and last of these, X2 is a chain node, while in the second of these X2
is a fork node. The conditional independence structure associated with chains and forks is the same.
The three corresponding DAGs are given in Figure 2.1.
@2 2 2^
1 3 1 3 1 3
Figure 2.1: Three DAGs, each with the same D-separation structure
In general, the factorisations resulting from dierent orderings of the variables will not necessarily give
I -equivalent maps. This is illustrated by the following example on four variables.
Example 2.6.
PX1 ,X2 ,X3 ,X4 = PX1 PX2 PX3 ∣X1 ,X2 PX4 ∣X3 .
The DAG associated with the factorisation is the one on the left in Figure 2.2. Assume that the DAG
on the left in Figure 2.2 and PX1 ,X2 ,X3 ,X4 are faithful to each other. The factorisation obtained using
the ordering (X1 , X4 , X3 , X2 ) is:
PX1 ,X2 ,X3 ,X4 = PX1 PX4 ∣X1 PX3 ∣X1 ,X4 PX2 ∣X1 ,X3 ,X4 = PX1 PX4 ∣X1 PX3 ∣X1 ,X4 PX2 ∣X1 ,X3 .
X1 ⊥
/ X4 ,
X3 ⊥
/ {X1 , X4 }/Θ∣Θ for any Θ ⊆ {X1 , X4 } and
X2 ⊥ X4 ∣{X1 , X3 }, but X2 ⊥
/ {X1 , X3 , X4 }/Θ∣Θ for any strict subset Θ ⊂ {X1 , X3 }.
32 CHAPTER 2. MARKOV MODELS AND MARKOV EQUIVALENCE
1 2 1 / 2
@
3 3
O
4 4
Figure 2.2: DAGs with dierent D-separation properties, corresponding to dierent factorisations of
the same distribution
The corresponding DAG (the graph on the right of Figure 2.2) gives less information on conditional
independence; X4 á / X1 ∥G2 X3 , using G2 to denote the DAG on the right in Figure 2.2. The two
corresponding DAGs are shown in Figure 2.2. The graph on the right is a strict I -sub-map of the
graph on the left.
This example illustrates that while D-separated variables are conditionally independent conditioned
on the separating set, it does not hold that conditionally independent variables are necessarily D-
separated.
These relations are discussed by Pearl in [105] (1988). The proofs of these are quite straightforward
and have been left as exercises (Exercise 1 page 22).
The Markov model MG for a DAG G also satises the following: let (X, Y, W, Z) be four sets of
nodes in a DAG G = (V, D), then the following relations hold:
These have been left as exercises (Exercise 2 page 22). A collection of triples (Xi , Yi , Si )i∈I , where I
denotes the indexing set and each (Xi , Yi , Si ) ∈ V ×V ×V , Xi , Yi , Si mutually disjoint which satises these
four conditions has come to be known as a graphoid. These statements do not axiomatise conditional
independence; if a given set of triples satises these four conditions, there does not necessarily exist a
probability distribution P for which the set of conditional independence statements is I(P). Conditional
independence cannot be axiomatised; this was proved by Studeny [130].
A collection of D-separation statements for a DAG also satises the composition property;
A graphoid that also satises composition is known as a compositional graphoid. The Markov model
of a DAG MG is always a compositional graphoid; the collection of independence statements I(P) is
not necessarily a compositional graphoid.
A Markov model MG also satises the following two properties, which are not necessarily satised by
a collection of conditional independence statements I(P).
Let V = A ∪ B ∪ S where A, B and S are disjoint subsets and suppose that A á B∥G S . Then for
any α ∈ A and γ ∈ A ∪ S ,
Let G = (V, D) denote a directed acyclic graph. Let X ⊆ V , Y ⊆ V and Z ⊆ V denote sets of
nodes and let α, β, γ, δ ∈ V /X ∪ Y ∪ Z denote individual nodes.
The proofs of these statements are left as exercises. They are included here simply as illustration of the
additional structure that is required for a Markov model MG over and above the set of independence
statements I(P) for a probability distribution that factorises over G .
The following basic example illustrates a situation where composition does not hold for the probability
distribution and where there is no faithful DAG.
⎧
⎪
⎪ 1 Y1 = Y3
X2 = ⎨
⎪
⎩ 0 Y1 ≠ Y3
⎪
⎧
⎪
⎪ 1 Y1 = Y2
X3 = ⎨
⎪
⎩ 0 Y1 ≠ Y2
⎪
Then X1 , X2 , X3 provide the classic example of three random variables that are pairwise independent,
but not jointly independent.
1
PX1 ,X2 ,X3 (1, 1, 1) = P(Y1 = Y2 = Y3 ) = PY1 ,Y2 ,Y3 (1, 1, 1) + PY1 ,Y2 ,Y3 (0, 0, 0) =
4
PX1 ,X2 ,X3 (1, 1, 0) = PX1 ,X2 ,X3 (1, 0, 1) = PX1 ,X2 ,X3 (0, 1, 1) = P(Y2 = Y3 = Y1 , Y1 ≠ Y2 ) = 0
PX1 ,X2 ,X3 (1, 0, 0) = PX1 ,X2 ,X3 (0, 1, 0) = PX1 ,X2 ,X3 (0, 0, 1) = P(Y2 = Y3 , Y1 ≠ Y3 , Y1 ≠ Y2 )
1
= P(Y1 = 1, Y2 = Y3 = 0) + P(Y1 = 0, Y2 = Y3 = 1) =
4
PX1 ,X2 ,X3 (0, 0, 0) = 0
It follows that
1
PX1 ,X2 (1, 1) = PX1 ,X2 (1, 0) = PX1 ,X2 (0, 1) = PX1 ,X2 (0, 0) =
4
so that PX1 (1) = PX1 (0) = 1
2 and in all cases
But
1 1
= PX1 ,X2 ,X3 (1, 1, 1) ≠ PX1 (1)PX2 (1)PX3 (1) = .
4 8
Since X1 ⊥ X2 but X3 ⊥ / {X1 , X2 }, X3 ⊥ / X1 ∣X2 and X3 ⊥ / X2 ∣X1 , it follows that the factorisation
obtained for the distribution PX1 ,X2 ,X3 is
PX1 ,X2 ,X3 = PX1 PX3 PX2 ∣X1 ,X3 = PX2 PX3 PX1 ∣X2 ,X3
and in none of the cases do the D-separation statements of the DAG corresponding to the Bayesian
Network represent all the conditional independence statements of the distribution.
The type of situation described here, where the distribution does not satisfy a composition property,
can be summarised as follows: it is the situation where X1 tells you nothing about X3 and X2 tells you
nothing about X3 , but X1 and X2 taken together tell you everything about X3 . This is the principle on
which any good detective novel is based, as Edward Nelson puts it in his book `Radically Elementary
Probability Theory' [100].
2.1. I-MAPS AND MARKOV EQUIVALENCE 35
The argument shows that in experimental design situations where there are interaction eects, but
no main eects (e.g. each chemical taken separately does not produce an eect, but the interaction
between two chemicals causes an eect), composition will not hold and there will not exist a faithful
DAG.
There is a whole industry of structure learning algorithms, based on the principle of Theorem 2.3,
which deleted an edge as soon as a conditioning set is found such that X ⊥ Y ∣S . These algorithms are
elegant, cost-eective, ecient, and return accurate results if the underlying distribution has a faithful
graph. They are discussed in chapter 16. Their draw-back is that they produce wildly inaccurate
results when there does not exist an underlying faithful graph. Note that if such a structure learning
algorithm were applied to the three-coin example above, where X1 ⊥ X2 , X1 ⊥ X3 and X2 ⊥ X3 , such
an algorithm would remove all the edges based on the results of conditioning on S = ∅ and return the
empty graph. The model delivered by the algorithm would then be the independence model, which
represents a disastrous failure.
In many causal situations, the set of variables X may be split into observable variables Z and unobserv-
able variables, U . Usually, the observable variables are descendants of the unobservable; observations
are made on the observable and, from these observations, inferences made about the unobservable. For
example, the variables of Z could represent symptoms, while those of U could represent the diseases
that cause the symptoms.
Even if there is a faithful DAG corresponding to the full set of variables X = (U, V ), there is often
no faithful DAG corresponding to the observable variables Z . For example, consider a probability
distribution over the 5 variables {U, Z1 , Z2 , Z3 , Z4 } which factorises according to
PU,Z1 ,Z2 ,Z3 ,Z4 = PU PZ3 PZ4 ∣Z3 PZ1 ∣U,Z3 PZ2 ∣U,Z4
and suppose the corresponding graph given by Figure 2.3 is faithful. In this example, there is no faithful
DAG for the distribution over (Z1 , Z2 , Z3 , Z4 ); the set of D-separation statements for any DAG along
which the distribution can be factorised will be a strict subset of the set of conditional independence
statements. Two examples of factorisations are:
PZ1 ,Z2 ,Z3 ,Z4 = PZ1 PZ2 ∣Z1 PZ3 ∣Z1 ,Z2 PZ4 ∣Z1 ,Z2 ,Z3
PZ1 ,Z3 ,Z4 ,Z2 = PZ1 PZ3 ∣Z1 PZ4 ∣Z3 PZ2 ∣Z1 ,Z3 ,Z4
When all 24 permutations are considered, either an edge Z2 ∼ Z3 or an edge Z1 ∼ Z4 will be present,
even though Z2 ⊥ Z3 ∣Z4 and Z1 ⊥ Z4 ∣Z3 . None of the DAGs corresponding to the Bayesian Networks
of all 24 possible orderings of the variables will represent all the CI statements.
36 CHAPTER 2. MARKOV MODELS AND MARKOV EQUIVALENCE
U1 / Z2 o Z4
O
Z1 o Z3
= X2 a
X1 X3
Figure 2.1 shows three directed acyclic graphs, each with the same D-separation statements;
X1 á X3 ∥G X2 and the DAG does not admit any other D-separation statements. The D-separation
statements of the DAG in Figure 2.4 are dierent from those for the DAGs in Figure 2.1.
The key result in this section characterising Markov equivalence is Theorem 2.11, which states
that the two features of a directed acyclic graph which are necessary and sucient for determining its
Markov structure are its immoralities and its skeleton. These are dened below.
Denition 2.9 (Immorality). Let G = (V, E) be a graph. Let E = D ∪ U , where D is the set of directed
edges, U is the set of undirected edges and D ∩ U = ∅. An immorality in a graph is a triple (α, β, γ)
such that (α, β) ∈ D and (γ, β) ∈ D, but (α, γ) ∈/ D, (γ, α) ∈/ D and ⟨α, γ⟩ ∈/ U .
Denition 2.10 (Skeleton). The skeleton of a graph G = (V, E) is the graph obtained by making the
̃ where ⟨α, β⟩ ∈ E
graph undirected. That is, the skeleton of G is the graph G̃ = (V, E) ̃ ⇔ (α, β) ∈ D or
(β, α) ∈ D or ⟨α, β⟩ ∈ U .
Theorem 2.11. Two DAGs are Markov equivalent if and only if they have the same skeleton and the
same immoralities.
2.2. CHARACTERISATION OF MARKOV EQUIVALENCE 37
The key to establishing Theorem 2.11 will be to consider the active trails (Denition 1.17) in the graph.
The following two denitions are also required.
Denition 2.12 (S -active node). Let G = (V, E) be a Directed Acyclic Graph and let S ⊂ V . Recall
the denition of a trail (Denition 1.4) and the denition of an active trail (Denition 1.17). A node
α ∈ V is said to be S -active if either α ∈ S or there is a directed path from the node α to a node β ∈ S .
Denition 2.13 (Minimal S - active trail). Let G = (V, E) be a Directed Acyclic Graph and let S ⊂ V .
An S -active trail τ in G between two nodes α and β is said to be a minimal S - active trail if it satises
the following two properties:
1. if k is the number of nodes in the trail, the rst node is α and the k th node is β , then there does
not exist an S -active trail between α and β with fewer than k nodes and
2. there does not exist a dierent S -active trail ρ between α and β with exactly k nodes such that
for all 1 < j < k either ρj = τj or ρj is a descendant of τj .
Lemma 2.14. Let G1 = (V, D1 ) and G2 = (V, D2 ) be two directed acyclic graphs with the same skeletons
and the same immoralities. Then for all S ⊂ V , a trail is a minimal S -active trail in G1 if and only if
it is a minimal S -active in G2 .
Proof of Lemma 2.14 Recall the notation from Denition 1.2; α ∼ β denotes that two nodes (α, β) ∈
V × V are neighbours. For a directed graph G = (V, D), that is either (α, β) ∈ D or (β, α) ∈ D. Since G1
and G2 have the same skeletons, any trail τ in G1 is also a trail in G2 . Let S ⊂ V . Assume that τ is a
minimal S -active trail in G1 . It is now proved, by induction on the number of collider nodes along the
path, that τ is also an S -active trail in G2 . By denition, a single node will be considered an S -active
trail, for any S ⊂ V . The proof is in three parts: Let τ be a minimal S -active trail in G1 . Then
2. If τ contains at least one collider connection centred at node τj , then τ is S -active in G2 if and
only if τj is S -active in G2 .
Part 1 If τ is an S -active trail in G1 and does not contain any collider connections in G1 , then none of
the nodes on τ are in S . This can be seen by considering the Bayes ball algorithm, which characterises
d-separation. It follows that the trail is S -active in G2 if and only if it does not contain a collider
connection in G2 .
Let τ be a minimal S -active trail in G1 with k nodes and no collider connections in G1 . Suppose
that a node τi is a collider node in G2 , so that τi−1 and τi+1 are parents of τi in G2 . Then, so that no
new immoralities are introduced, it follows that τi−1 ∼ τi+1 . Since τi is either a chain or a fork in G1 ,
it follows that in G1 , the connections between nodes τi−2 , τi−1 , τi , τi+1 , τi+2 take one of the forms shown
in Figure 2.5 when τi a chain node or those in Figure 2.6 when τi a fork node.
38 CHAPTER 2. MARKOV MODELS AND MARKOV EQUIVALENCE
!
τi
!
τi
}
τi
}
τi
Figure 2.5: Possible connections between the nodes when τi is chain node
It is clear from the gure that the trail of length k − 1 in G1 , obtained by removing τi and using the
direct link from τi−1 to τi+1 is also an S -active trail in G1 , contradicting the assumption that τ was a
minimal S -active trail. Hence τi is a chain node or a fork node in G2 .
It follows that there are no collider connections along the trail τ taken in G2 and hence, since it
does not contain any nodes that are in S , it is an S - active trail in G2 .
Part 2 Assume that any minimal S -active trail in G1 containing n collider connections is also S -active
in G2 . This is true for n = 0 by part 1. Let τ be a trail with k nodes that is minimal S -active in G1
and with n + 1 collider connections in G1 . Consider one of the collider connections centred at τj , with
τ (0,j−1) = (τ0 , τ1 , . . . , τj−2 , τj−1 ) and let ̃
parents τj−1 and τj+1 . Let ̃ τ (j+1,k) = (τj+1 , . . . , τk ). Both ̃
τ (0,j−1)
and ̃τ (j+1,k) are minimal S -active in G1 and they both have a number of collider connections less than
or equal to n. By the inductive hypothesis, they are therefore both S -active in G2 .
Because the trail τ is minimal S -active in G1 , it follows that τj−1 ∼/ τj+1 . This is because both τj−1
and τj+1 are S - active nodes in G1 (they have a common descendant in S to make the trail active), and
neither is in S (neither is the centre of a collider along τ ) it follows that if τj−1 ∼ τj+1 , then the trail
on k − 1 nodes obtained by removing the node τj would be S -active in G1 , for the following reason:
any chain or fork (τj−2 , τj−1 , τj+1 ) or (τj−1 , τj+1 , τj+2 ) would be active because both τj−1 and τj+1 are
2.2. CHARACTERISATION OF MARKOV EQUIVALENCE 39
τi
τi
uninstantiated. Any collider (τj−2 , τj−1 , τj+1 ) or (τj−1 , τj+1 , τj+2 ) would be active because both τj−1
and τj+1 have a descendant in S . It follows that τj−1 ∼/ τj+1 . This holds in both G1 and G2 , since the
skeletons are the same.
τ (1,i−1) and ̃
Since ̃ τ (i+1,k) are both active, and τj−1 → τj ← τj+1 is a collider, the trail τ is active if and
only if τj is an active node. That is, it is either in S or has a descendant in S .
Part 3 Let τ be a minimal S -active trail in G1 and let τj ∈ τ be a collider node in G1 . Since the trail
τ is a minimal S -active trail in G1 , it follows either that τj ∈ S or τj , considered in G1 , has a descendant
in S . That is, considered in G1 , there is a directed path from τj to a node w ∈ S . Let ρ denote the
shortest such path. If τj ∈ S , then the length of the path is 0 and τj is also an S -active node in G2 .
Assume there is a directed edge from τj to w ∈ S in G1 . If there are links from τj−1 to w or τj+1
to w, then these links are τj−1 → w or τj+1 → w respectively, otherwise the DAG would have cycles. If
both are present, then the trail τ violates the second assumption of the minimality requirement. This
is seen by considering the trail formed by taking w instead of τj in τ . It follows that either τj−1 ∼/ w
or τj+1 ∼/ w or neither of the edges are present. Without loss of generality, assume τj−1 ∼/ w (since the
argument proceeds in the same way if τj+1 ∼/ w). The diagram in Figure 2.7 may be useful.
τj−1 / τj o τj+1
{
w
Since w is not a parent of τj in G1 , it cannot be a parent of τj in G2 , since both graphs have the
same immoralities and (τj−1 , τj , w) is not an immorality in G1 .
Furthermore, τj−1 ∼/ τj+1 (since they are both uninstantiated, and, in G1 both have a common
40 CHAPTER 2. MARKOV MODELS AND MARKOV EQUIVALENCE
descendant in S , so that if τj−1 ∼ τj+1 then the trail with τj removed would be active whether the
connections at τj−1 and τj+1 are chain, fork or collider, contradicting the minimality assumption).
Since both graphs have the same immoralities and τj−1 ∼/ τj+1 , it follows that (τj−1 , τj , τj+1 ) is an
immorality in both G1 and G2 and hence that τj−1 is a parent of τj in G2 . Therefore, τj is a parent of
w in G2 and is therefore w is an S - active node in G2 .
Assume that for the shortest directed path ρ from τj to w in G1 , the rst l links have the same
directed edges in G2 . Now suppose that the shortest directed path is ρ, where τj = ρ0 , . . . , ρl+p = w
and consider the links ρl ∼ ρl+1 and ρl+1 ∼ ρl+2 . If ρl ∼ ρl+2 , then in G1 , the directed edge ρl → ρl+2
is present, otherwise there is a cycle. If the directed edge ρl → ρl+2 is present in G1 , then the path
ρ is not minimal. Therefore, ρl ∼/ ρl+2 . This holds in both G1 and G2 , because both graphs have the
same skeletons. By a similar argument, ρl−1 ∼/ ρl+1 . (there would be a cycle in G1 if (ρl+1 , ρl−1 ) were
present; ρ would not be minimal in G1 if (ρl−1 , ρl+1 ) were present. Since the skeletons are the same,
ρl−1 ∼/ ρl+1 in either G1 or G2 ). Since ρl ∼/ ρl+2 , it follows that ρl and ρl+2 are not both parents of ρl+1 in
G2 ; otherwise G2 would contain an immorality not present in G1 . Similarly, since ρl−1 ∼/ ρl+1 , the edge
ρl → ρl+1 is present in G2 , otherwise G2 would have either the immorality (ρl−1 , ρl , ρl+1 ), since the edge
(ρl−1 , ρl ) is present in G2 by assumption. It follows that the directed edges (ρl , ρl+1 ) and (ρl+1 , ρl+2 )
are both present in G2 . By induction, therefore, the whole directed path ρ is also present in G2 and
hence τj is an S - active in both G1 and G2 .
Proof of Theorem 2.11 This follows directly: let G1 and G2 denote two DAGs with the same
skeleton and the same immoralities. For any set S and any two nodes α and β , it follows from the
lemma, together with the denition of D- separation, that
if there is an S -active trail between the two variables in one of the graphs, then there is a minimal
S -active trail in that graph and hence there is also a minimal S -active trail between the two variables
in the other. If there is no S -active trail between the two variables in one of the graphs then there is
no S active trail between the two variables in the other. By denition, two variables are D-separated
by a set of variables S if and only if there is no S -active trail between the two variables. Two graphs
are Markov equivalent, or I -equivalent (Denition 2.4), if and only if Equation (2.3) holds for all
(α, β, S) ∈ V × V × V .
present, since Z2 ⊥ Z3 ∣Z4 and Z1 ⊥ Z4 ∣Z3 . Since Z1 ⊥ Z4 ∣Z3 , it follows that Z1 − Z3 − Z4 cannot be an
immorality. Since Z3 ⊥ Z2 ∣Z4 , it follows that Z3 − Z4 − Z2 cannot be an immorality. Both Z1 − Z2 − Z4
and Z2 − Z1 − Z3 are required to be immoralities, which is not possible. Therefore, either Z1 − Z2 − Z4
is not an immorality, in which case the model returned contains the false independence statement
Z1 ⊥ Z4 ∣{Z2 , Z3 }, or else Z2 − Z1 − Z3 is not an immorality, in which case the model returned contains
the false independence statement Z2 ⊥ Z3 ∣{Z1 , Z4 }.
At the same time, the requirement that both Z1 − Z2 − Z4 and Z2 − Z1 − Z3 are immoralities can
be resolved by adding in a hidden variable U to obtain Figure 2.3.
α2
=
!
α1 / α4
=
!
α3
α2 α2
=
! } !
α1 / α4 α1 / α4
a = =
!
α3 α3
For the DAG in Figure 2.8, all the DAGs with the same skeleton can be enumerated, and it is
clear that those in Figure 2.9 are the only two that satisfy the criteria. To nd the DAGs equivalent
to the one in Figure 2.8, the immorality (α2 , α4 , α3 ) has to be preserved and no new immoralities
42 CHAPTER 2. MARKOV MODELS AND MARKOV EQUIVALENCE
may be added. The directed edges (α1 , α4 ), (α2 , α4 ) and (α3 , α4 ) are therefore essential; the directed
edges (α2 , α4 ) and (α3 , α4 ) to form the immorality, the directed edge (α1 , α4 ) because the connection
(α2 , α1 , α3 ) is either a fork or chain, forcing (α1 , α4 ) to prevent a cycle. These three directed edges
will be present in any equivalent DAG. The other three edges may be oriented in 23 dierent ways, but
only 5 of these lead to DAGs (the other graphs contain cycles) and of these 5, only the three shown in
Figures 2.8 and 2.9 have the same immoralities.
A useful starting point for locating all the DAGs that are Markov equivalent to a given DAG is to
locate the essential graph, given in the following denition.
Denition 2.15 (Essential Graph). Let G be a Directed Acyclic Graph. The essential graph G ∗
associated with G is the graph with the same skeleton as G , but where an edge is directed in G ∗ if and
only if it occurs as a directed edge with the same orientation in every DAG that is Markov equivalent
to G . The directed edges of G ∗ are the essential edges of G .
The edges that are directed in an essential graph are the compelled edges.
Denition 2.16 (Compelled Edge). Let G = (V, E) be a chain graph, where E = D ∪ U . A directed
edge (α, β) ∈ D is said to be compelled if it occurs in at least one of the congurations in Figure 2.10.
γ1 β α γ1
@
α β
γ1
α / β α / β
= >
!
γ1 γ2
Lemma 2.17. In an essential graph, the directed edges are the compelled edges; all other edges are
undirected.
Proof From the denition, the directed edges are those that necessarily have the same direction in
every Markov equivalent DAG. The gure in the top right shows the immoralities; these are necessarily
2.3. MARKOV EQUIVALENCE AND THE ESSENTIAL GRAPH 43
directed. The direction is forced in the gure in the top left; otherwise the graph contains an additional
immorality. The direction is forced in the structure on the bottom left; otherwise there is a cycle. The
direction is forced in the structure in the bottom right; otherwise both (γ1 , α) and (γ2 , α) are forced
to prevent cycles appearing and (γ1 , α, γ2 ) is an additional immorality.
To show that these are the only compelled edges: Consider two nodes α and β which are neighbours.
Firstly, suppose that α and β do not have any other common neighbours. If α does not have a neighbour
γ such that there is a directed edge (γ, α), then the direction (α, β) is not forced; no additional
immorality or cycle is created by either direction.
Now suppose that α and β have at least one common neighbour. Suppose that there are no
neighbours γ such that both (α, γ), (γ, β) in the directed edge set, then it is not necessary to force the
direction (α, β) to prevent a cycle.
Suppose, furthermore, that there is no pair of common neighbours γ1 and γ2 , such that (γ1 , β)
and (γ2 , β) are both in the directed edge set and (γ1 , α, γ2 ) is not an immorality. Then it is not
necessary to force the direction (α, β) to prevent either (γ1 , α, γ2 ) becoming an immorality or else a
cycle appearing.
Notes The terminology Markov model corresponding to a Directed Acyclic Graph G = (V, E) was
introduced into the literature and may be found in Andersson, Madigan, Perlman and Triggs [2].The
results on Markov equivalence are taken from T. Verma and J. Pearl [140]. The rules for determining
compelled edges were formulated by Meek in [93]. A rigorous treatment covering chain graphs is [129]
(Studený). Exercise 7 page 352 is taken from [150], while Exercises 4 page 352 and 5 page 352 are
taken from Chickering [24].
44 CHAPTER 2. MARKOV MODELS AND MARKOV EQUIVALENCE
Chapter 3
Intervention Calculus
H1 H2 A B
~ ~
A C B C
45
46 CHAPTER 3. INTERVENTION CALCULUS
to eect must be part of the modelling assumptions before the data is analysed, determined by other
considerations. The data analysis only determines which directed edges remain and which are removed.
From data, one can determine whether or not there is an association between earth tremors and alarms
triggered; it is not possible to determine from the data what causes what.
This is self evident, but surprisingly it turns out that it is necessary to state this. The purpose of
the article by Freedman and Humphreys [43] (1999) was to point out the obvious fact that causality
could not be inferred from data alone and was a necessary response to obvious errors in the literature;
the term `causal discovery' has been used in surprising ways, to describe learning a directed edge in
a DAG. even after it had been established, with simple concrete and obvious examples, that the idea
that such arrows learned from data alone actually represented causality, was ridiculous and long after
publication of [43] illustrating that it was ridiculous. The article by Freedman and Humphreys is a
good article; it is surprising that the literature had degenerated to such an extent that it was necessary
for the authors to write it.
To dene a causal network, an additional ingredient is needed; this is the concept of intervention,
introduced by Judea Pearl in the seminal article [107] from 1995.
PX,Y (x, y)
PY ∣X (y∣x) = .
PX (x)
This formula describes the way that the probability distribution of the random variable Y changes
after X = x is observed. If, instead, the value X = x is forced by the observer, irrespective of other
considerations, the conditional probability statement is invalid.
If random variables are linked through a causal model, expressed by a directed acyclic graph,
where parent variables have a causal eect on their children, some attempt can be made to compute
the probability distribution over the remaining variables when the states of some variables are forced.
In a controlled experiment, a variable is forced to take a particular value, chosen at random,
irrespective of the other variables in the network. In terms of the directed acyclic graph, the variable
is instantiated with this value, the directed edges between the variable and its parents are removed
(because the parents no longer have inuence on the state of the variable) and all other conditional
probabilities remain unaltered.
⎛ ⎞
∑ ϕ (xA ) = ∑ ϕ(xV /A .xA ).
⎝V /A ⎠ xV /A
Denition 3.1 (The Intervention Formula). The conditional probability of XV /A = xV /A , given that
the variables XA were forced to take the values xA independently of all else, is written
and dened as
PV /A∥A (xV /A ∣XA ← xA ) = PV /A∥A (xV /A ∥xA ) = ∏ Pv∣Pa(v) (xv ∣xPa(v) ). (3.1)
v∈V /A
PV (xV )
PV /A∥A (xV /A ∥xA ) = . (3.2)
∏v∈A Pv∣Pa(v) (xv ∣xPa(v) )
The last expression of Equation (3.1) is in terms of the required factorisation; instantiation of the
variables indexed by the set A and elimination of those edges in D which lead from the parents of the
nodes inA to the nodes in V /A. The terminology `local surgery' is used to describe such an elimination.
A local surgery is performed and the conditional probabilities on the remaining edges are multiplied.
This yields a factorisation along a mutilated graph where the direct causes of the manipulated variable
are put out of eect.
The intervention formula (3.1) is obtained by wiping out those factors from the factorisation which cor-
respond to the interventions. An explicit translation of intervention in terms of `wiping out' equations
was rst proposed by Strotz and Wold [128] (1960).
The quantity PV /A∥A (.∥xA ) from Denition 3.1 denes a family of probability measures over XV /A ,
which depends on the values xA , which may be considered as parameters. These are the values forced
on the variables indexed by A. This family includes original probability measure; if A = ∅, then
PV /A∥A (.∥xA ) = PX (.). This family is known as the intervention measure. In addition, the expression
on the right hand side of (3.1) is called the intervention formula.
Intervention An `intervention' is an action taken to force a variable into a certain state, without
reference to its own current state, or the states of any of the other variables. It may be thought of as
choosing the values x∗A for the variables XA by using a random generator independent of the variables
X.
48 CHAPTER 3. INTERVENTION CALCULUS
PV (x)
PV /A∣A (xV /A ∣xA ) = . (3.4)
PX (xA )
Example 3.2.
Consider the DAG given in Figure 3.2, for `X having causal eect on Y '.
X / Y
PY ∥X (y∥x) = PY ∣X (y∣x).
Since X is a parent of Y , the intervention to force X = x produces exactly the same conditional
probability distribution over Y as observing X = x. But if instead Y is forced, the intervention formula
yields
Clearly, PX∥Y (x∥y) ≠ PX∣Y (x∣y) as functions unless X and Y are independent.
The `wet pavement' example is a classic illustration, introduced by Judea Pearl. See, for example, page
15 of [109]. The DAG represents a causal model for a wet pavement and is given in Figure 3.3. The
season A has four states; spring, summer autumn, winter. Rain B has two states; yes / no. Sprinkler
C has two states; on / o. Wet pavement D has two states; yes / no. Slippery pavement E has two
states; yes / no.
The joint probability distribution is factorised as
?B
A / E
>D
C
Suppose, without reference to the values of any of the other variables and without reference to the
current state of the sprinkler, `sprinkler on' is now enforced. This could be, for example, regular mainte-
nance work, which is carried out at regular intervals, irrespective of the season or other considerations.
Then
PA,B,C,D,E (., ., 1, ., .)
PA,B,D,E∥C (.∥C ← 1) =
PC∣A (1∣.)
= PA PB∣A PD∣B,C (.∣., 1)PE∣D .
After observing that the sprinkler is on, it may be inferred that the season is dry and that it probably
did not rain and so on. If `sprinkler on' is enforced, without reference to the state of the system when
the action is taken, then no such inference should be drawn in evaluating the eects of the intervention.
The resulting DAG is given in Figure 3.4. It is the same as before, except that C = 1 is xed and the
edge between C and A disappears. The deletion of the factor PC∣A represents the understanding that
whatever relationships existed between sprinklers and seasons prior to the action, found from
are no longer in eect when the state of the variable is forced, as in a controlled experiment, without
reference to the state of the system.
After observing that the sprinkler is on, it may be inferred that the season is dry, that it probably
did not rain and so on. No such inferences may be drawn in evaluating the eects of the intervention
`ensure that the sprinkler is on'.
Denition 3.4 (Causal Model). Let X = (X1 , . . . , Xd ) be a random vector and let V = {1, . . . , d}
denote the indexing set. A causal model consists of the following:
50 CHAPTER 3. INTERVENTION CALCULUS
B=1
"
/ E
A <D
"
C
1. A Bayesian Network for PX , that is, an ordering σ of the indices V , a factorisation of the
probability distribution
d
PV = ∏ P (3.5)
σ(j)∣Pa (j)
(σ)
j=1
where Pa(σ) (j) ⊆ {σ(1), . . . , σ(j − 1)} and is the smallest such subset such that (3.5) holds.
2. The node set V consists of two types of nodes; VI and VN , where VI ∩ VN = ∅ and VI ∪ VN =
V . The nodes VI are the interventional nodes and VI are the non-interventional nodes, where
no intervention is possible. The intervention formula (3.1) holds for each subset A ⊆ VI of
interventional nodes and each xA ∈ XA .
The arrows α ↦ β of the DAG for either α or β (or both) in VI are causal arrows, indicating direct
cause to eect. The remaining arrows are non-causal; a cause to eect relation between nodes α and
β cannot be inferred from an arrow α ↦ β if both α, β ∈ VN .
In many cases, a model contains hidden variables, which cannot be observed. A special case of this is
the semi-Markov model, where the hidden variables are common causes and where none of the hidden
variables have observable anscestors.
Denition 3.5 (Semi-Markov Model). A semi-Markov model is a causal model for a random vector
X with node set V = V ∪ U, where V are the observable variables, VI ⊂ V (intervention can be made
on a subset of the observable nodes), VN = V/VI (observable nodes on which intervention cannot be
made) and VN = VN ∪ U.
The nodes of VI correspond to interventional variables, U ⊂ VN are the hidden (latent) variables
and VN represent the observable variables on which no intervention can be made.
For a semi-Markov model, the requirement is that the variables of U have no ancestors in V.
Notation Throughout, if a variable is named U , or Ui (for some index i), it may be assumed that
the variable (or its index) belongs to U.
3.4. CAUSAL MODELS 51
@γ γ^ γ
α β α β α β
Xγ ← z Xγ ← z Xγ ← z
" | | "
α β α β α β
If it is possible to control variables, then it is possible to learn whether or not a collider represents
independent causes with a common eect. If the DAG on the left hand side of Figure 3.1 represents
a causal structure, then an experiment where variable A is controlled will establish that it is not a
52 CHAPTER 3. INTERVENTION CALCULUS
H1 H2
~
A=a C B
direct cause of C , since an intervention on A leaves it separated from the rest of the network, as in
Figure 3.7.
Proposition 3.7. For each j ∈ V , let Pa(j) denote the set of parents of node j and XPa(j) the state
space of XPa(j) . For each (x, π) ∈ Xj × XPa(j) ,
For all j ∈ V and each S ⊆ V such that S ∩ ({j} ∪ {Pa(j)}) = ∅, for each (xj , πj , xS ) ∈ Xj × XPa(j) × XS ,
Proof Equation (3.6) is established rst. Pj∥Pa(j) (.∥πj ) is a marginal distribution which depends on
the enforced value XPa(j) ← πj . For all (xj , πj ) ∈ Xj × XPa(j) ,
⎛ ⎞RRRR
Pj∥Pa(j) (x∥π) = ∑ ⎜ ∏ Pv∣Pa(v) (xv ∣xPa(v) )⎟RRRRR .
x
V /Pa(j)
∣xj =x ⎝v∈V /Pa(j) ⎠RRRR
(xj ,x
Pa(j) )=(x,π)
It follows, using
∑ Pv∣Pa(v) (x∣πv ) = 1
x∈Xv
for any πv ∈ XPa(v) that
Once all the direct causes of a variable Xj are controlled, no other interventions will aect the condi-
tional probability distribution of Xj .
Proof By marginalisation, followed by an application of the intervention formula (3.1), for each
(x, π) ∈ Xj × XPa(j) ,
The probability measure after intervention is factorised along the mutilated graph. The following
proposition determines the probabilities on the mutilated graph.
where the conditioning is taken in the sense of: rst the `do' conditioning XA ← xA is applied and then
the set of variables Pa(j)/A is observed.
Summing from right to left, the variables in V /(A ∪ Pa(j) ∪ {j}) with j in their parent set have been
summed over so that
giving
The right hand side may be thought of as a pre-intervention probability, which can be estimated from
the data before the intervention C ← 1 is made. In this case, an estimate of the pre-intervention
probability PD∣B,C (.∣., 1) is also an estimate of the post-intervention probability PD∣B∥C (.∣.∥1).
Proposition 3.11.
PV /{j}∥j (xV /{j} ∥y) = PV /({j}∪Pa(j))∣j,Pa(j) (xV /({j}∪Pa(j)) ∣y, xPa(j) )PPa(j) (xPa(j) )
One term has been removed in the product, namely, Pj∣Pa(j) (y∣xPa(j) ), so that (with xj = y )
PV (xV )
∏ Pv∣Pa(v) (xv ∣xPa(j) ) =
v∈V /{j} Pj∣Pa(j) (y∣xPa(j) )
PV (x)PPa(j) (xPa(j) )
=
Pj,Pa(j) (y, xPa(j) )
= PV /({j}∪Pa(j))∣j,Pa(j) (xV /({j}∪Pa(j)) ∣y, xPa(j) )PPa(j) (xPa(j) )
as required.
Proposition 3.12 (Adjustment for Direct Causes ). Let G = (V, D) be a DAG and let B ⊂ V such
that ({j} ∪ Pa(j)) ∩ B = ∅ (the empty set). Then for any (x, y) ∈ Xj × XB ,
PB∥j (xB ∥y) = ∑ PV /({j}∪Pa(j))∣{j}∪Pa(j) (xV /({j}∪Pa(j)) ∣y, xPa(j) )PPa(j) (xPa(j) ).
XV /(B∪{j})
A / B
A=a / B
Consider the conditional probability of B , when A is controlled; PB∥A (.∥a). The DAG illustrating the
intervention is shown in Figure 3.9. Note that
and that
where in the second term, the do-conditioning of A ← a is applied rst, and then C is observed. It
follows that
3.6. CONFOUNDING, THE `SURE THING' PRINCIPLE AND SIMPSON'S PARADOX 57
This shows that to estimate PB∥A (.∥a) from data alone (i.e. without controlling A), it is necessary to
be able to estimate PB∣A,C and PC from data. If C is observable, then the eect on the probability
distribution of B of manipulating A may be estimated. But if C is a hidden random variable (sometimes
the term latent is used) in the sense that no direct sample of the outcomes of C may be obtained, it
will not be possible to estimate the probabilities used on the right hand side and hence it will not be
possible to predict the eect on B of manipulating A. This is known as confounding.
?C
A / B
confounded by the eects of hidden variables. The following result is referred to as `The Sure Thing
Principle'. It states that when Figure 3.8 represents the causal structure and there is do-conditioning
on A, then Simpson's paradox does not hold.
Proposition 3.13. Consider three binary variables A, B , C with the network given in Figure 3.8.
If
PB∣C∥A (1∣1∥1) < PB∣C∥A (1∣1∥0)
and
PB∣C∥A (1∣0∥1) < PB∣C∥A (1∣0∥0)
then
Proof Firstly,
1 1
PB∥A (1∥1) = ∑ PB∣C∥A (1∣x∥1)PC∥A (x∥1) = ∑ PB∣C∥A (1∣x∥1)PC (x).
x=0 x=0
Similarly,
1
PB∥A (1∥0) = ∑ PB∣C∥A (1∣x∥0)PC (x).
x=0
It now follows directly from the assumptions that
Example 3.14.
Consider an experiment in which soil fumigants X are to be used to increase oat crop yields Y ,
by controlling the eelworm population, Z . These may also have direct eects, both benecial and
adverse, on yields, besides the control of eelworms. We would like to assess the total eects of the
fumigants on yields when the study is complicated by several factors. First, controlled, randomised
experiments are infeasible: farmers insist on deciding for themselves which plots are to be fumigated.
Secondly, the farmers' choice of treatment depends on last year's eelworm population Z0 . This is an
unknown quantity, but is strongly correlated with this year's population. This presents a classic case
of confounding bias, which interferes with the assessment of the treatment eects, regardless of sample
size. Fortunately, through laboratory analysis of soil samples, the eelworm populations before and
after treatment can be determined. Furthermore, since fumigants are only active for a short period,
they do not aect the growth of eelworms surviving the treatment; eelworm growth depends on the
population of bird and other predators. This, in turn, is correlated with last year's eelworm population
and hence with the treatment itself.
The situation may be represented by the causal diagram in Figure 3.11. The variables are:
X fumigants,
Y crop yields,
Denition 3.15 (Identiable). The causal eect of X on Y is said to be identiable if the quantity
PY ∥X can be computed uniquely from the probability distribution of the observable variables.
60 CHAPTER 3. INTERVENTION CALCULUS
6 X
Z0 / Z1 / Z2 / Y
?
B / Z3
In this section, two graphical conditions are described which ensure that causal eects can be
estimated consistently from observational data. The rst of these is named back door criterion and is
equivalent to the ignorability condition of Rosenbaum and Rubin [119] (1983). The second of these is
the front-door criterion. This involves covariates which are aected by the treatment (in this example
Z2 and Z3 ).
Denition 3.16 (Back Door Criterion). A set of nodes C satises the back door criterion relative to
an ordered pair of nodes (X, Y ) ∈ V × V if
2. C blocks every trail (in the sense of D-separation) between X and Y which contains an edge
pointing to X .
If A and B are two disjoint subsets of nodes, C is said to satisfy the back door criterion relative to
(A, B) if it satises the back door criterion relative to any pair (Xi , Xj ) ∈ A × B .
Example 3.17.
In Figure 3.11, the set C = {Z0 } satises the back door criterion relative to (X, Y ). The node Z0 is
unobservable. The set C = {Z1 , Z2 , Z3 } does block all trails between X and Y with an arrow pointing
into X , but clearly does not satisfy the back door criterion since both Z2 and Z3 are descendants of
X.
The name `back door criterion' reects the fact that the second condition requires that only trails with
nodes pointing at Xi be blocked. The remaining trails can be seen as entering Xi through a back door.
Example 3.18.
3.7. IDENTIFIABILITY: BACK-DOOR AND FRONT-DOOR CRITERIA 61
Consider the back door criterion DAG, given in Figure 3.12. The sets of variables C1 = {Z3 , Z4 }
and C2 = {Z4 , Z5 } satisfy the back door criterion relative to the ordered pair of nodes (X, Y ), whereas
C3 = {Z4 } does not satisfy the criterion relative to the ordered pair of nodes (X, Y ); if Z4 is instantiated,
the Bayes ball may pass through the collider connection from Z1 to Z2 .
Z1 Z2
~ ~
Z3 Z4 Z5
~ ~
X / Z6 / Y
Identiability Consider a causal network and let A ⊆ V /{X, Y } be a subset of the variables which
satises the back door criterion with respect to an ordered pair (X, Y ). The aim is to show that the
set of variables A plays a similar role to the variable C in the discussion on confounding.
The quantity PY ∥X may be expressed as:
↓A ↓A
PY ∥X = (PY,A∥X ) = (PY ∣A∥X PA∥X )
The two terms in the sum may be expressed in terms of see-conditioning; PA∥X = PA and PY ∣A∥X =
PY ∣A,X . These may be seen as follows: rstly, since no variables of A are descendants of X , it follows
that PA∥X = PA . This is seen as follows: the variables may be ordered as V = (Y1 , . . . , Yn , X, Yn+1 , . . . , Yn+m )
where the ordering is chosen such that Pa(Yj ) ⊆ {Y1 , . . . , Yj−1 } for j ≤ n, Pa(X) ⊆ {Y1 , . . . , Yn } and
Pa(Yj ) ⊆ {Y1 , . . . , Yn , X, Yn+1 , . . . , Yj−1 } for j ∈ {m + 1, . . . , n + m} and where A ⊆ {Y1 , . . . , Yn }. From
the intervention formula,
m+n
PV /X∥X = ∏ PYj ∣Paj
j=1
while
m+n
PV = PX∣Pa(X) (x∣πX ) ∏ PYj ∣Paj .
j=1
Now, marginalise over variables Yn+1 , . . . , Yn+m in both expressions, then marginalise over X in the
second expression and nally marginalise over all remaining variables not in A. The same answer
62 CHAPTER 3. INTERVENTION CALCULUS
Second, since A blocks all trails between Y and X that have an edge pointing towards X , it follows
that Y á (Pa(X)/A)∥G A. It follows, with notation that should be clear, using Proposition 3.12 that
↓Pa(X)/A
PY ∣A∥X = (PY ∣A,Pa(X)∥X PPa(X)/A∥X )
↓Pa(X)/A
= (PY ∣A,Pa(X),X PPa(X)/A )
↓Pa(X)/A
= (PY ∣A,X PPa(X)/A )
= PY ∣A,X .
To go from rst to second line, do-conditioning on X does not alter the probabilities for ancestors of
X , hence PPa(X)/A∥X = PPa(X)/A Also, for conditional probabilities of Y , do-conditioning on ancestors
of Y is the same as see-conditioning on ancestors of Y , hence PY ∣A,Pa(X)∥X = PY ∣A,Pa(X),X .
To go from second to third line: once X is known, Pa(X) gives no further information.
↓A
PY ∥X = (PY ∣A,X PA ) (3.9)
If a set of variables A satisfying the back door criterion with respect to (X, Y ) can be chosen such
that PA and PY ∣A,X can be estimated from the observed data, then the distribution PY ∥X can also be
estimated from the observed data.
Lemma 3.19 (Identiability). If a set of variables Z satises the back door criterion relative to (X, Y ),
then the causal eect of X to Y is given by the formula
↓Z
PY ∥X = (PY ∣X,Z PZ ) (3.10)
Proof This follows directly from the denition of identiable (Denition 3.15) and the analysis
above.
Formula (3.10) is named adjustment for concomitants. The word identiability refers to the fact that the
concomitants Z satisfying the back door criterion are observable and hence it is possible to compute, or
identify the intervention probability PY ∥X (y∥x) using the `see' conditional probabilities (PXj ∣Paj )dj=1 .
3.7. IDENTIFIABILITY: BACK-DOOR AND FRONT-DOOR CRITERIA 63
Denition 3.20 (Front Door Criterion). A set of variables Z satises the front door criterion relative
to the ordered pair (X, Y ) if:
The situation is illustrated in Figure 3.13. The variable U is a hidden (latent) variable. The variable
Z satises the front door criterion relative to (X, Y ).
~
X / Z / Y
Theorem 3.21 (Front Door Criterion). Let Z satisfy the front door criterion relative to the ordered
pair (X, Y ). Then the causal eect on Y of an intervention on X is:
↓Z
PY ∥X = (PZ∣X PY ∣Z ) .
↓U
This is self evident; note that PY ∣Z = (PY ∣Z,U PU ) . In other words, if the see-conditional PZ∣X and
PY ∣Z are available, then the intervention PY ∥X may be computed.
3.7.3 Non-Indentiability
There are various conditions for non-identiability of PY ∥X . These include:
1. A necessary condition is that there is an unblockable back-door path between X and Y ; that
is, a path ending with an arrow pointing into X which cannot be blocked by observable non-
descendants of X . This is not a sucient condition, as Figure 3.13 illustrates. This shows a
situation where there is a non-blockable back-door path, yet PY ∥X is identiable (front-door
criterion).
U U / Z
>
X / Z / Y X / Y
3. Local identiability is not a sucient condition for global identiability. In Figure 3.15, PZ1 ∥X ,
PZ2 ∥X , PY ∥Z1 , PY ∥Z2 are all identiable, but PY ∥X is not.
U1 U2
~ ~ (
X / Z1 / Y o Z2
Theorem 3.22 (Rules for Intervention Calculus). Let G = (V, D) be a DAG associated with a causal
model and let P denote the probability distribution. For any disjoint subsets X, Y, Z, W of V , the
following rules hold:
where the set Z(W ) in the graph GX is the set of Z -nodes which are not ancestors of any W
node in GX .
1. The interventional distribution PV /X∥X factorises along the graph GX . Since the variables of
X have no parents in GX , do- and see- conditioning on X are equivalent for distributions that
factorise along GX . The separation statement implies that, for the mutilated graph (where X
has been instantiated by intervention), Y is D-separated from Z by X ∪ W . Equation (3.11)
follows because a D-separation statement implies the corresponding conditional independence
statement.
2. The interventional distribution PV /X∥X factorises along GX . The D-separation statement of (3.12)
implies, furthermore, that all X ∪ W -active trails in GX between Y and Z have an arrow from
a node in Z to one of its children, hence all back-door paths from Z to Y in GX are blocked by
X ∪ W . It follows that the operations setting Z ← z or conditioning on Z = z have the same
eect on Y .
3. Assume that the D-separation condition holds, then all W ∪ X - active trails between Y and Z
in GX have an edge γ ↦ β where β ∈ Z(W ) (since removing arrows into Z(W ) blocks the trail).
For such a node β , none of its descendants are in W . Therefore,
PY ∣W ∥X,Z(W ) = PY ∣W ∥X .
The D-separation statement implies that Y á Z/Z(W )∥GX X ∪ W from which the result follows.
Corollary 3.23. Let P be a probability distribution which factorises according to a causal model (Def-
inition 3.4). An intervention probability q = PY ∥X (y∥x), where X and Y are disjoint subsets of V
where X ⊆ V is identiable if there is a nite sequence of transformations, each conforming to one of
the inference rules in Theorem 3.22 which reduces q to a probability expression that only involves see
conditioning.
Proof Clear
The converse of Corollary 3.23 is also true. This will now be dealt with. Firstly, Tian and Pearl [135](2002)
developed systematic criteria for establishing the interventional statements that could be computed
from see-conditioning statements. Huang and Valtorta [64] (2006) then established that these criteria
could be obtained from the three rules of Theorem 3.22. The problem was also dealt with in Shpitser
and Pearl [124] (2006); graphical criteria are discussed in Tian and Shpister [136] (2010).
Only the interventional probability distributions over the observable nodes are of interest; the
following rather obvious lemma helps to simplify the problem.
66 CHAPTER 3. INTERVENTION CALCULUS
Lemma 3.24. If any of the three rules can be used on a model with graph G , it can also be used
on a model that is obtained by removing all hidden nodes U ∈ U that have no descendants among the
observable nodes V.
Proof Clear.
The following lemma establishes that only rules 2 and rules 3 need be considered for a completeness
theorem.
Proof Since all D-separation statements before removal of an edge remain true after the edge is
removed, the conditions for the application of rules 2 and 3 are satised if the condition for rule 1
is satised. The application of rule 1 can be replaced by the application of rule 2 followed by the
application of rule 3.
In detail: suppose the D-separation statement of (3.11) holds. Then the D-separation statements
of (3.12) and (3.13) both hold, so an application of rule 2 gives:
PY ∣W,Z∥X = PY ∣W ∥X,Z
PY ∣W ∥X,Z = PY ∣W ∥X
At the heart of the systematic identication of interventional statements that may be expressed in
terms of see-conditioning by Tian and Pearl [135] is the concept of c-components. All the c-factors are
computable from the probability distribution over the observed variables.
A c-component of the node set V is either a set containing a single node from V if that node has no
parents in U or it consists of all the U nodes which are c-component related to each other, together
with all V nodes that have a parent in U which is a member of that c-component.
3.8. INFERENCE RULES FOR INTERVENTION CALCULUS 67
Let H denote all the nodes of a c-component and let H ′ = H ∩ V (the observable nodes of a c-
component). Then a c-factor is simply PH ′ , the probability distribution over the observable nodes of
the c-component.
The relation ∼c on U is reexive, symmetric and transitive and hence denes a partition of U. Based
on this relation, U can be divided into disjoint and mutually exclusive c-component related parts.
Now suppose that P factorises according to a semi-Markov model (Denition 3.5). Lemmas 3.27
and 3.28 form the basis of the characterisation of Tian and Pearl [135] of interventional statements
that can be expressed by see-conditioning statements. The proofs given here are from [64], which
demonstrate that they follow from Rules 2 and 3 of Theorem 3.22.
A set S ⊆ V in a graph G is called an ancestral set if for each α ∈ G , every ancestor of α is also in G .
The set an(α) denotes the ancestors of the node α; α ∈/ an(α). A topological ordering of the nodes in a
graph G is an ordering σ of the nodes such that for each node β and each node γ ∈ an(β), σ(γ) < σ(β).
Proof Trivially,
(PC∥V/C )↓C/W = PW ∥V/C .
If W is an ancestral set in GC∪(Pa(C)∩U) , it follows that none of the nodes of W have parents in
C ∪ (Pa(C) ∩ U)/W , although they may have parents in Pa(C) ∩ V.
Since W is an ancestral set in GC∪(Pa(C)∩U) , there is a topological ordering of the nodes in
GC∪(Pa(C)∩U) that starts with all the nodes in W and continues with the other nodes.
The lemma may be proved by induction. If W = V , the lemma is trivially true. Otherwise, consider
the rst node, say α in the topological order just described that is in C , but not in W . By induction,
it is necessary and sucient to prove that if W ⊂ V , α ∈ V /W , C = W ∪ {α} and W is an ancestral set
in GC∪(Pa(C)∩U) , then
Let the nodes of W be labelled: W = {1, . . . , k} and let Y = V/(W ∪ {α}). With obvious notation, the
identity to be established may be rewritten as:
Marginalising over the variable labelled α and using the fact that a probability distribution sums to 1
gives:
68 CHAPTER 3. INTERVENTION CALCULUS
By construction,
W á {α}∥GY ,{α} Y
This is because in graph GY ,{α} , if there is a Y -active trail between α and a node i ∈ {1, . . . , k}, the path
cannot include any nodes in Y , since Y nodes are instantiated fork nodes. Therefore, any Y -active
trail in GY ,{α} between α and i which does not contain any other nodes of W rstly, cannot have an
arrow pointing into α, since the arrows between α and its parents have been removed. If the trail has
no collider connections, then it is of the form α ↦ . . . ↦ i, which is a contradiction since the nodes of
W are of a lower topological order than α. If it contains a collider which is either instantiated, or one
of its descendants is instantiated, then the links to the parents of the instantiated nodes have been
removed, hence such a connection does not exist. Therefore, such a trail does not exist.
Using rule 3,
PW ∥Y = PW ∥α,Y
Lemma 3.28. Let H ⊆ V and let H1′ , . . . , Hn′ denote the c-components of the sub-graph GH∪(Pa(H)∩U) .
Let Hi = Hi′ ∩ H . Then
1.
n
PH∥V/H = ∏ PHi ∥V/Hi
i=1
2. Each PHi ∥V/Hi is computable from PH∥V/H in the following way. Let k be the number of variables
in H and let α1 < . . . < αk be a topological order of the variables of H in GH∪(Pa(H)∩U) . Let
H (j) = {α1 , . . . , αj } for j = 1, . . . , k and H (0) = ∅ (the empty set). Then each PHi ∥V/Hi ∶ i =
1, . . . , n is given by:
↓H/H (j)
PH (j) ∥V/H (j) = (PH∥V/H ) . (3.14)
Furthermore, the results of this lemma are a consequence of Rules 2 and 3 of Theorem 3.22.
3.8. INFERENCE RULES FOR INTERVENTION CALCULUS 69
Proof The rst statement is proved rst, then Equation (3.14) is established and nally the second
statement is proved.
The proof is by induction. When H includes exactly one node from V, the result is clearly true,
from the denition.
Suppose that the two statements are true for H ∶ ∣H∣ ≤ k for an integer k and consider an arbitrary
set E ⊂ V of size ∣E∣ = k + 1. Let H = {α1 , . . . , αk } and E = H ∪ {αk+1 }, where the indices correspond
to the topological order. Let H1′ , . . . , Hn′ be the c-components of H ∪ (Pa(H) ∩ U) in GH∪(Pa(H)∩U)
and let Hi = Hi′ ∩ H for 1 ≤ i ≤ n. Let Y = V/E .
Now consider the c-components of E ∪ (Pa(E) ∩ U) in GE∪(Pa(E)∩U) .
′
If Pa(αk+1 )∩ Pa(H)∩U = ∅, then the c-components are E1′ , . . . , En+1 , where Ei′ = Hi′ for i = 1, . . . , n
′
and En+1 = {αk+1 }∪(Pa(αk+1 )∩U). It follows that Ei ∶= Ei′ ∩E = Hi for i = 1, . . . , n and En+1 = {αk+1 }.
In this case, let m = n + 1.
If Pa(αk+1 ) ∩ Pa(H) ∩ U ≠ ∅, then αk+1 shares at least one parent in U with a node in H . Let
′
denote the c-components of E ∪ (Pa(E) ∩ U) in GE∪(Pa(E)∩U) and let Ei = Ei′ ∩ E . It
E1′ , . . . , Em
follows that, relabelling if necessary, Ei = Hi for i = 1, . . . , m − 1 and Em = {αk+1 } ∪ ⋃ni=m Hi .
n
PH∥V/H = ∏ PHi ∥V/Hi ,
i=1
n
PH∥(V/E)∪{αk+1 } = ∏ PHi ∥(V/E)∪{αk+1 }∪H i−1 ∪H n
1 i+1
i=1
where the notation Hij for i < j means: ∪jk=i Hk . For the rst statement, it is required to prove that:
m−1
PH,{αk+1 }∥V/(H∪{αk+1 }) = PHm
n ,{α m−1 ∏ PH ∥(V/H ) .
k+1 }∥(V/E)∪H1 j j
j=1
The notation Y = V /E will be used. The D-separation statement: H á {αk+1 }∥GY ,α Y holds. This
k+1
is because any Y -active trail in GY ,α between αk+1 and a node in H does not include any node in Y
since nodes in Y are instantiated fork nodes. Since the arrows from parents of αk+1 have been removed
and the nodes of H are of a lower topological order, any active trail contains an instantiated collider
connection. But the instantiated nodes are in Y , hence links to their parents have been removed, hence
such a trail does not exist.
Using Rule 3, it follows that:
PH∥Y = PH∥αk+1 ,Y ,
70 CHAPTER 3. INTERVENTION CALCULUS
m−1 n
PH,αk+1 ∥Y = Pαk+1 ∣H∥Y PH∥αk+1 ,Y = Pαk+1 ∣H∥Y ∏ PHj ∥H j−1 ,H n ,Y,αk+1
× ∏ PHj ∥H j−1 ,H n .
1 j+1 1 j+1 ,Y,αk+1
j=1 j=m
m−1
∏ PHj ∥H j−1 ,H n = PH m−1 ∥Hm
n ,Y,α .
1 j+1 ,Y,αk+1 1 k+1
j=1
As before,
PH m−1 ∥Hm
n ,Y,α
k+1
= PH m−1 ∥Hm
n ,Y .
1 1
αk+1 ⊥ Hm
n
∥GY ,H n Y ∪ H1m−1 .
m
m−1 n
PH,αk+1 ∥Y = Pαk+1 ∣H∥Y ∏ PHj ∥H j−1 ,H n ∏ PHj ∥H j−1 ,H n
1 j+1 ,Y 1 j+1 ,Y,αk+1
j=1 j=m
n
= Pαk+1 ∣H m−1 ∥Hm
n ,Y PH m−1 ∥H n ,Y ∏ P
Hj ∥H j−1 ,H n
1 1 m 1 j+1 ,Y,αk+1
j=m
n
= PH m−1 ,αk+1 ∣Hm
n ,Y ∏ P
Hj ∥H j−1 ,H n
.
1 1 j+1 ,Y,αk+1
j=m
Now let H ̃ (j) = (H (j) ∩ Hi ) ∪ (Pa(Hi ) ∩ U) and let H ̃i = Hi ∪ (Pa(Hi ) ∩ U). Then, by construction,
i
̃ (j)
H i is an ancestral subset of Hi ∪ (Pa(Hi ) ∩ U) in GHi ∪(Pa(Hi )∩U) and hence (by extending the set
V for a moment to include all the nodes of H ̃ (j) ), it follows by Lemma 3.27 that:
j
↓Hi /H (j)
PH̃ (j) ∥V/H (j) = (PH̃i ∥V/Hi ) .
i
↓Hi /H (j)
PH (j) ∩Hi ∥V/H (j) = (PHi ∥V/Hi ) .
It follows that
n
↓H/H (j)
(PH∥V/H ) = ∏ PH (j) ∩Hi ∥V/H (j) .
i=1
Equation 3.14 now follows because do-conditioned on V/H (j) , the node sets H (j) ∩ Hi ∶ i = 1, . . . , n are
D-separated from each other.
m−1
PE∥V/E = PE (k+1) ∥V/E (k+1) = PHm
n ,α
k+1 ∥V/(Hm ∪{αk+1 }) ∏
n PHi ∥V/Hi .
i=1
Now,
so that
Based on the rst statement of Lemma 3.28, establishing the non-identiability of a statement may
be reduced to establishing the non-identiability of a statement within a c-component. The relevant
result is the following:
1. G is itself a c-component,
Proof Non-identiability is established if it can be shown that there is a back-door path between a
node in S and a node in V/S . Assume there is no back-door path, then there does not exist a node
υ ∈ U which is an ancestor of both a node in S and a node in V/S . It follows that G is not itself a
c-component, which is a contradiction.
Lemmas 3.27 and 3.28 provide the basis of a complete identication algorithm for computing do-
conditioning statements PS∥V/S for S ⊆ V in terms of see-conditioning statements, in the sense that
when it does not give an output fail, it returns the correct answer. Theorem 3.29 establishes that the
algorithm is complete, in the sense that it returns an output fail when and only when the statement
is not identiable.
3.8. INFERENCE RULES FOR INTERVENTION CALCULUS 73
Let V1 , . . . , Vn be a partition of V, where Vj = Vj′ ∩ V and V1′ , . . . , Vn′ are the c-components of the
sub-graph GV∪Pa(V) . Let S1 , . . . , Sl be a partition of S where S1′ , . . . , Sl′ are the c-components of the
sub-graph GS∪(Pa(S)∩U) and Sj = Sj′ ∩ V for j = 1, . . . , l. The subsets are labelled such that Sj ⊆ Vj for
j = 1, . . . , l; this can clearly be done without loss of generality. Now
2. Compute each PSj ∥V/Sj using Algorithm 3.8 (identify (C, T ) below), with C = Sj , T = Vj and
Q = PVj ∥V/Vj .
3. If in part 2. Algorithm 3.8 gives the output fail for any of the Sj ∶ j = 1, . . . , l, then PS∥V/S is
not identiable and the output given is fail. Otherwise, PS∥V/S is identiable and is given by:
l
PS∥V/S = ∏ PSj ∥V/Sj .
j=1
T /C
PC∥V/C = (PT ∥V/T ) .
3. Else: (if C ⊂ A ⊂ T ).
Algorithm 3.8 is therefore recursive, until either it nds an expression for PC∥V/C or else returns the
output fail.
It now follows that if PV/T ∥T is identiable, then so is PS∥T for any S ⊆ V/T , by marginalisation:
↓(V/T )/S
PS∥T = (PV/T ∣T ) .
↓V/(T ∪D)
(PV/T ∥T ) = PD∥V/D .
It follows that:
1. Let D = (S ∪ an(S))GV/T ∩ V.
4. Else, output
↓D/S
PS∥T = (PD∥V/D ) .
Theorem 3.30. The three inference rules of Theorem 3.22, together with standard probability manip-
ulations, are complete for determining the identiability of PH∥V/H for all H ⊂ V.
Proof Lemmas 3.27 and 3.28 follow from the inference rules of Theorem 3.22 (as proved by Huang
and Valtorta [64]). These form the basis of Algorithms 3.8, 3.8 and 3.8. By Theorem 3.29, it follows
that the algorithms give the output fail if and only if the statement is not identiable; by standard
probability manipulations, they give the correct answer otherwise.
↓X
PY ∥Z = (PY ∣X∥Z PX∥Z ) .
PX∥Z = PX .
PY ∣X∥Z = PY ∣X,Z .
It follows that
↓X
PY ∥Z = (PY ∣X,Z PX ) . (3.19)
3. PY ∥X . Writing
↓Z
PY ∥X = (PY ∣Z∥X PZ∥X )
it follows from (3.18) that PZ∥X = PZ∣X . Rule 2 may be applied to give PY ∣Z∥X = PY ∥X,Z , since
Y á X∥GX,Z Z . Rule 3 may be applied, since Y á X∥GX,Z Z , to give:
PY ∣Z∥X = PY ∥Z ,
which was computed in terms of see-conditioning in (3.19). Putting all this together gives:
↓X ↓Z
PY ∥X = (PZ∣X (PY ∣X,Z PX ) ) .
All the other causal eects (for example PY,Z∥X and PX,Z∥Y ) can be derived from the rules of Theo-
rem 3.22.
3.9. MEASUREMENT BIAS AND EFFECT RESTORATION 77
2. PY ∥X is identiable in GZ .
If the rst of these holds, it follows that Y á Z∥GX,Z X and hence PY ∥X = PY ∥X,Z . This represents
the causal eects of X on Y in a model that factorises along GZ which is identiable by the second
condition. These conditions are satised by the two models in Figure 3.16. Translated to the cholestorol
example, they require that there be no direct eect of diet on heart disease and no confounding eect
between cholestorol level and heart disease unless there is an intermediate variable between the two
which can be measured. For the rst gure, the conditions are clear. For the second gure, PY ∥X is
identiable in GZ because
Z o U2 U1 U2 U3 U4
O
~v
U1 / X / Y Z / X v / W / Y
PY ∥X . Secondly, PW ∣U is unknown, but there are two observable variables (Z, W ) which together give
sucient information to identify PY ∥X without bias.
Example 3.31.
The Head Start Program is discussed in Madgison [87] (1977). This was a government programme
within the United States of America aimed at giving assistance to children. Magidson's sample consists
of 148 children who received the programme and 155 control children.
Let X be an indicator variable, indicating whether or not the child received the programme. Y is
the outcome variable of the Metropolitan Readiness Test (a test which supposedly measures cognitive
ability). U represents socio-economic status. This is unobserved and may be considered, following the
discussion of Madgison, as a sucient confounder. Figure 3.17 gives three possible situations; the rst
where W is measured as a proxy variable for U , the second and third where W and Z (family income)
are measured as proxy variables of U .
U / W U / W U / W
X / Y Z X /' Y Z / X /' Y
Figure 3.17: U hidden; causal models with proxy variables on U . For (a), PW ∣U is required to identify
PY ∥X . For (b) and (c), under further assumptions on Z and W , the eect PY ∥X may be estimated
from data.
In Figure 3.17, U satises the back-door criterion relative to (X, Y ), but its proxy variables W and
Z do not. For each of the models,
↓U
PY ∥X = (PY ∣X,U PU ) .
If the conditional distribution PW ∣U is known (and W is observable) then, under additional assumptions
on PW ∣U , it is possible to construct an asymptotically unbiased estimator of PY ∥X .
↓U
PY,W ∣X = (PY,U ∣X PW ∣U )
3.9. MEASUREMENT BIAS AND EFFECT RESTORATION 79
Set
⎛ PY,U ∣X (1, y∣x) ⎞ ⎛ PY,W ∣X (1, y∣x) ⎞
VU ∶Y ∣X (. ∶ y∣x) = ⎜
⎜ ... ⎟,
⎟ VW ∶Y ∣X (. ∶ y∣x) = ⎜
⎜ ... ⎟
⎟
⎝ PY,U ∣X (k, y∣x) ⎠ ⎝ PY,W ∣X (k, y∣x) ⎠
and
⎛ PW ∣U (1∣1) . . . PW ∣U (1∣k) ⎞
MW ∣U =⎜
⎜ ⋮ ⋱ ⋮ ⎟.
⎟
⎝ PW ∣U (k∣1) . . . PW ∣U (k∣k) ⎠
If MW ∣U is invertible, then:
−1
VU ∶Y ∣X (. ∶ y∣x) = MW ∣U VW ∶Y ∣X (. ∶ y∣x)
↓Y ↓Y
Similarly, set VU ∶∣X = (VU ∶Y ∣X ) so that VU ∶∣X (u∣x) = PU ∣X (u∣x) and similarly VW ∶∣X = (WW ∶Y ∣X ) ,
then
−1
VU ∶∣X = MW ∣U VW ∶∣X .
It follows that if PW ∣U is known, then the causal eect of manipulating X , i.e. PY ∥X , is estimable and
is given by:
↓W ↓U
⎛ (M −1t P −1t ↓W ⎞
W,U Y,W ∣X (MW,U PW )
↓U )
PY,U ∥X PU
PY ∥X = ( ) ⎜
=⎜ ⎟
↓W ⎟ .
PU ∣X −1t P
(MW,U
⎝ W ∣X ) ⎠
For the models under consideration, W á {X, Y, Z}∥G U and Y á {W, Z}∥G {U, X}.
↓U ↓U ↓U
PZ,W ∣X = (PZ,W,U ∣X ) = (PW ∣Z,U,X PZ∣U,X PU ∣X ) = (PW ∣U PZ∣U PU ∣X ) .
Similarly,
↓U
PY,W ∣X = (PW ∣U PY ∣X,U PU ∣X )
80 CHAPTER 3. INTERVENTION CALCULUS
↓U
PY,Z∣X = (PY ∣X,U PZ∣X,U PX∣U )
↓U
PY,Z,W ∣X = (PW ∣U PZ∣X,U PY ∣X,U PU ∣X )
Let
⎧
⎪ ⎛ 1 PW ∣X (1∣x) ... PW ∣X (k − 1∣x) ⎞
⎪
⎪
⎪
⎪
⎪ ⎜ PZ,W ∣X (z1 , w1 ∣x) . . . PZ,W ∣X (z1 , wk−1 ∣x) ⎟
⎪
⎪
⎪ ⎜ PZ∣X (1∣x) ⎟
⎪
⎪ PZ,W = ⎜ ⎟.
⎪
⎪ ⎜ ⋮ ⋮ ⋱ ⋮ ⎟
⎪
⎪ ⎜ ⎟
⎪
⎪ ⎝
⎪
⎪ PZ∣X (k − 1∣x) PZ,W ∣X (k − 1, 1∣x) . . . PZ,W ∣X (zk−1 , wk−1 ∣x) ⎠
⎨ (3.20)
⎪
⎪
⎪ ⎛ 1 PY,W ∣X (y, 1∣x) ... PY,W ∣X (y, k − 1∣x) ⎞
⎪
⎪
⎪
⎪ ⎜ ⎟
⎪
⎪
⎪ ⎜ PY,Z∣X (y, 1∣x) PY,Z,W ∣X (y, z1 , w1 ∣x) . . . PY,Z,W ∣X (y, z1 , wk−1 ∣x) ⎟
⎪
⎪ QZ,W = ⎜ ⎟.
⎪
⎪ ⎜ ⋮ ⋮ ⋱ ⋮ ⎟
⎪
⎪ ⎜ ⎟
⎪
⎪
⎪ ⎝ ⎠
⎩ PY,Z∣X (y, k − 1∣x) PY,Z,W ∣X (y, k − 1, 1∣x) . . . PY,Z,W ∣X (y, zk−1 , wk−1 ∣x)
⎛ 1 PW ∣U (1∣σ(1)) . . . PW ∣U (k − 1∣σ(1)) ⎞
UW,U =⎜
⎜ ⋮ ⋮ ⋱ ⋮ ⎟
⎟
⎝ 1 PW ∣U (1∣σ(k)) . . . PW ∣U (k − 1∣σ(k)) ⎠
−1
PZ,W QZ,W = UW,U ∆U UW,U .
It follows that the recovery problem of PW ∣U from UW,U rests on the eigenvalue decomposition of
−1
PZ,W QZ,W . Once PW ∣U is known, the matrix adjustment method may be used to evaluate the causal
eect on Y of manipulating X . This requires additionally that QZ,W be invertible and the probabilities
PY ∣X,U (y∣x, 1), . . . , PY ∣X,U (y∣x, k) take distinct values for given (x, y).
The result is presented in the following theorem:
Theorem 3.32. Suppose U is a sucient confounder relative to (X, Y ) and suppose that
3.9. MEASUREMENT BIAS AND EFFECT RESTORATION 81
1. Two proxy variables of U that are conditionally independent of each other given U can be observed;
call them W and Z . Both W ⊥ {X, Y, Z}∣U and Y ⊥ {W, Z}∣{U, X} hold.
2. W, Z and the counfounder U are discrete variables, with a given nite nuber of categories, k .
4. The probabilities PY ∣X,U (y∣x, 1), . . . , PY ∣X,U (y∣x, k) take distinct values for given x and y ,
Proof The proof is based on the following two-step procedure that recovers PX,Y,U from PX,Y,Z,W .
−1
Stage 1: Solve an eigenvalue problem of PZ,W QZ,W to recover PW ∣U from UW,U
−1
Step 1 First solve ∣PZ,W QZ,W − λIk ∣ = 0, where Ik denotes the k × k identity matrix. The solutions
−1
λ1 , . . . , λk are the eigenvalues of PZ,W QZ,W . They satisfy:
−1
∣PZ,W QZ,W − λIk ∣ = ∣∆U − Ik ∣ = 0
where ∆U is dened by (3.21). It follows that λi = PY ∣X,U (y∣x, σ(i)) and hence the elements of ∆U are
estimable.
To obtain the eigenvector ηi corresponding to λi , let H = (η1 , . . . , ηk ), then H satises:
−1
PZ,W QZ,W H = H∆U .
By the condition that λi take dierent values, it follows that η1 , . . . , ηk are uniquely determined.
−1
Let A = UW,U E where E = diag(α1 , . . . , αk ) for non-zero values of (α1 , . . . , αk ), then:
−1 −1 −1
PZ,W QZ,W A = UW,U ∆U E = UW,U E∆U = A∆U .
−1
It follows that A is also a matrix of eigenvectors of PX,Z QX,Z and hence, with a particular choice of
−1
α1 , . . . , αk , A = UW,U E = H .
It follows that for the inverse H −1 of the estimable matrix H satises (using UW,U
−1
E = H ):
−1 −1
⎛ 1 PW ∣U (1∣σ(1)) . . . PW ∣U (k − 1∣σ(1)) ⎞ ⎛ α1 H11 . . . α1 H1k ⎞
UW,U =⎜
⎜ ⋮ ⋮ ⋱ ⋮ ⎟ = EH −1 = ⎜ ⋮
⎟ ⎜ ⋱ ⋮ ⎟
⎟
⎝ 1 PW ∣U (1∣σ(k)) . . . PW ∣U (1∣σ(k)) ⎠ ⎝ αk H −1 . . . αk H −1 ⎠
k1 kk
Step 2 Since
↓U
PX,Y,W = (PX,Y,U PW ∣U )
↓U
↓U PX,Y,U
PY ∥X = (PY ∣X,U PU ) =( PU )
PX,U
is identiable.
Denition 3.33 (Stochastic Process). Let X be a set and (E, E) a measurable space. A stochastic
process Y indexed by X with state space (E, E), indexed by a set X , is a family of measurable mappings
{Y (x) ∶ x ∈ X } from a probability space (Ω, F, P) into (E, E). The space (E, E) is called the state space.
There is no requirement from the denition of `process' that the state space X should represent `time'.
In the counterfactual set-up, Y ′ has state space XY , the same state space as Y . Attention is
restricted to the situation where E = XY is either nite, in which case E is simply the set of all possible
subsets, or else E = R, in which case E = B(R), the Borel σ -algebra over R, the smallest collection of
subsets necessary to dene integration (and hence a probability measure).
3.10. IDENTIFICATION OF COUNTERFACTUALS 83
Suppose that the state space of Y is Y = {0, 1}, where 0 represents `death' and 1 represent `cure'.
Suppose that x1 was the dose administered and the outcome was `death'. Consider the counterfactual
query: `would the patient have survived if we had given a treatment dose x2 ?' In terms of the
counterfactual process, the quantity to be computed is therefore:
In some limited cases, with serious additional modelling assumptions, this quantity can be computed
from the one-dimensional marginal distributions. For example, suppose we assume that, for x1 < x2 ,
{Y ′ (x1 ) = 1} ⊆ {Y ′ (x2 ) = 1}. This means that we assume that if the patient survives a low dose of the
treatment, he will also survive a higher dose. The treatment does not have side eects which kill the
patient; increasing the treatment dose increases the chance of success.
⎧
⎪ P(Y ′ (x2 )=0)
′ ′ ⎪ P(Y ′ (x1 )=0) x2 > x1
P(Y (x2 ) = 0∣Y (x1 ) = 0) = ⎨
⎪
⎪ x2 ≤ x1 .
⎩ 1
Several types of counterfactual query can be considered; if X is a cause and Y an eect within a larger
network, x1 could either be observed, or forced by intervention; the query is then `we observed eect
Y = y when we observed X = x1 . What would have happened if we had forced X ← x2 by intervention?'
To construct the appropriate counterfactual probability distribution, we add the counterfactual
process Y ′ , indexed by X , which does not have X as a parent. At the same time, the original variable
Y , with parent X remains in the graph and the counterfactual query is to compute
1. add a c-process node β ′ . These are the counterfactual process nodes, corresponding to the coun-
terfactual process, enumerating the value taken by the variable for each x ∈ XA .
A c-process node β ′ has all the parents of α except the nodes in A; there are no links from nodes
in A to c-process nodes.
2. For each α ∈ A and each β on the causal path between A and Y (including all the nodes in Y ),
add in an arrow α → β . If β and γ , where β, γ ∈/ A are on the causal path between A and Y and
there is an arrow β → γ , add in a process arrow β ′ ⇒ γ ′ .
3. Add in a process to variable arrow β ′ ↠ β for each process node and its corresponding variable
node. This is shorthand for an arrow β ′ (x) → β for each x ∈ XA .
4. Add in variable to process nodes γ ↣ β ′ for each γ ∈ Pa(β)/A. This is shorthand for γ → β ′ (x)
for each x ∈ XA .
Between Process Nodes If α = f (β1 , . . . , βk , v) in the original DAG, where Pa(α) = {β1 , . . . , βk }
and v is the random eect, then
for each j .
Example 3.34.
Suppose we are interested in how likely a patient would be to have a certain symptom Y (1 = yes,
0 = no), given a dose x of a drug X assuming we know that the patient took dose x′ of the drug
and exhibited the symptom. Suppose there is a mediating variable W , for example: blood pressure,
and that it is the blood pressure which is the cause of the symptom. Furthermore, we also know that
the patient took dose d of a drug D and we have measured a symptom Z = z . We know PZ∣D , the
conditional probability distribution for symptom Z given drug D.
The blood pressure / symptom (W, Y ) may therefore considered as a counterfactual process indexed
by the dose of drug. The random variable Y (x) indicates whether or not the patient exhibits the symp-
tom when dose x is administered. In this language, problem is therefore to compute PY (x)∣Y (x′ ),Z,D .
Figure 3.18 (a) shows the original DAG for the Bayesian Network; (b) shows the network in terms of
functional relations.
D X U1 D X U2
~
Z W Z W U3
~
Y Y
Figure 3.18: (a) A DAG, (b) the graph expressed as a functional relations graph
U1 D X W o U2
~~
W 7 Y o U3
Z Y
for the presentation here are Edwards [40] (2000) chapter 9 and Lauritzen (2001) [82]. The idea of
deletion of connections (in terms of wiping out equations in a multivariate model) is found in Strotz
and Wold (1960) [128]. The intervention formula is due to J. Pearl, but is also given independently
in the rst edition of Spirtes, Glymour and Scheines (2002) [127]. The designation semi-Markovian
model follows [134]. The paper [64] (2006) summarises the recent developments in the problem of
identiability and presents an algorithmic solution. The results by Y. Huang, M. Valtorta in [64] show
that the do-calculus rules of Pearl [107] and [108] (1995) are complete in the sense that if a causal
eect is identiable, then the causal eects can be computed in terms of observational quantities. The
article [43] by Freedman and Humphreys makes the obvious point that causality cannot be learned
from data and is a necessary response to errors that inexplicably crept into the literature.
86 CHAPTER 3. INTERVENTION CALCULUS
U2
y %
′ ′
U1 D X W (x1 ) W (x2 ) W ′ (x3 )
|
us ′ ′ ′
W 6 Y (x1 )e 3 Y (x2 )
O
2 Y9 (x3 )
yt
Z Y U3
Figure 3.20: Graph of Figure 3.19 (b) with the process nodes written out
3.11 Exercises
1. The two parts of this exercise are very similar and straightforward, illustrating how d-separation
in the mutilated graph corresponds to conditional independence in the remaining variables after
do-conditioning.
(a) Let G be a Directed Acyclic Graph, and suppose that a probability distribution P may
be factorised along G . Let G −X denote the graph obtained by deleting from G all arrows
pointing towards X (that is, all links between X and its parents are deleted). Prove that if
Y and Z are d-separated in G −X by X , then
2. Suppose the causal relations between the variables (X1 , X2 , X3 , X4 , X5 , X6 , Y, Z) may be ex-
pressed by the DAG given in Figure 3.21. Which of the following sets satisfy the back door
criterion with respect to the ordered pair of nodes (Y, Z)? C1 = {X1 , X2 }, C2 = {X4 , X5 },
C3 = {X4 }.
State all sets of nodes that satisfy the back door criterion with respect to the ordered set of nodes
(Z, Y ).
X1 X2
} ! } !
X3 X4 X5
! } ! }
Y / X6 / Z
87
88 CHAPTER 3. INTERVENTION CALCULUS
3. Let a set of variables C satisfy the back door criterion relative to (X, Y ). Prove that
4. Let C be a set of variables in a Bayesian Network and let X be a variable such that C contains
no descendants of X . Prove, from the denition, that
5. Let V = {X1 , . . . , Xd } denote a set of variables. Let V = Z ∪ U , where the variables in Z are
observable and the variables in U are unobservable. Assume that the probability distribution
over the variables in V may be factorised along a Directed Acyclic Graph G = (V, D), where
no variable in U is a descendant of any variable in Z . That is, the model is semi-Markovian.
Consider a single variable, say Xj ∈ Z . Assume that there is no trail between Xj and Xk for
Xk ∈ Z with only fork and chain connections which contains a variable Xi ∈ U . Show that
̃Y ∣Z = ̃
PY ∣Z∥X (.∣.∥x) = P PY = PY ∥X (.∥x).
(b) Let V denote the variable set and let P̃V /C = PV /C∥C (.∥xC ). Then P
̃ factorises along the
−C −C
graph GV /C (the subgraph of G with the nodes C removed) and, for X ∈/ C , conditional
̃
probability potentials P ̃ = PX∣PaX where Pa ̃ X = PaX /C , PaX denotes the original
X∣PaX
neighbour set, and the variables in PaX ∩ C instantiated with the appropriate values.
If A á B∥G −C C ∪ W then any trail from A to B either has a fork or chain node in C ∪ W or
a collider node that is not in C ∪ W with no descendants in C ∪ W . It follows that, on the
graph GV−C/C , any trail from A to B either has a fork or chain node in W or a collider node
that is not in W with no descendants in W ; edges are deleted, but not added, by taking the
subgraph restricted to the variables of V /C and hence no new trails are added by removing
the nodes in C . It follows that A á B∥G −C W and hence that
V /C
2. C1 = {X1 , X2 } does not satisfy the back door criterion; Y − X4 − Z is a trail between Y and Z
with an edge pointing to Y which is not blocked by C1 .
C2 = {X4 , X5 } satises the back door criterion; trail Y − X6 − Z does not have an edge pointing
towards Y . The other trails pass through X4 . For the trails Y −X4 −Z and Y −X3 −X1 −X4 −Z , X4
is an instantiated fork or chain respectively, hence C2 blocks the trail. For Y −X1 −X4 −X2 −X5 −Z ,
X5 is an instantiated chain and hence the trail is blocked. All trails between Y and Z have been
considered.
For the backdoor criterion with respect to (Z, Y ), the sets have to block all trails with an
arrow pointing towards Z . This means that any set that contains X6 , X4 and any node from
{X3 , X1 , X2 , X5 } will satisfy the backdoor criterion with respect to (Z, Y ); any set that does not
will not.
89
90 CHAPTER 3. INTERVENTION CALCULUS
3. It is clear that
PY ∥X (y∥x) = ∑ PY ∣C∥X (y∣c∥x)PC∥X (c∥x).
c
Since C blocks all trails between Y and X that have an edge pointing towards X , it follows that
Y á (PaX /C)∥G C . It follows, with notation that should be clear, using Proposition 3.12 that
Furthermore, since none of the variables in C are descendants of X , it follows (again, using
Proposition 3.12) that
PC∥X (c∥x) = PC (c)
and the result follows. The fact that PC∥X (c∥x) = PC (c),
and PPaX /C∥X (π/c∥x) = PPaX /C (π/c) is clear by comparing the original DAG and the mutilated
graph. A formal algebraic proof that PC∥X (c∥x) = PC (c) is given in the next exercise.
4. The variables may be ordered as V = {Y1 , . . . , Yn , X, Yn+1 , . . . , Yn+m } where the ordering is chosen
such that Pa(Yj ) ⊆ {Y1 , . . . , Yj−1 } for j ≤ n, Pa(X) ⊆ {Y1 , . . . , Yn },
m+n
PV /X∥X (y1 , . . . , ym+n ∥x) = ∏ PYj ∣Paj (yj ∣πj )
j=1
while
m+n
PV (y1 , . . . , ym+n , x) = PX∣Pa(X) (x∣πX ) ∏ PYj ∣Paj (yj ∣πj ).
j=1
Now, sum over variables Yn+1 , . . . , Yn+m in both expressions, then sum over X in the second
expression. Then sum over all remaining variables not in C . The same answer obtains for both
expressions, so that
PC∥X = PC .
3.12. ANSWERS 91
5. Firstly,
so that
By A. Cayley*.
The following question was suggested to me, either by some of Prof. Boole's memoirs on
the subject of probabilities, or in conversation with him, I forget which; it seems to me a
good instance of the class of questions to which it belongs.
Given the probability α that a cause A will act, and the probability p that A acting the
eect will happen; also the probability β that a cause B will act, and the probability q that
B acting the eect will happen; required the total probability of the eect.
As an instance of the precise case contemplated, take the following: say a day is called
windy if there is at least w of wind, and a day is called rainy if there is at least r of rain,
and a day is called stormy if there is at least W of wind, or if there is at least R of rain.
The day may therefore be stormy because of there being at least W of wind, or because
of there being at least R of rain, or on both accounts; but if there is less than W of wind
and less than R of rain, the day will not be stormy. Then α is the probability that a day
93
94 CHAPTER 4. THE PIONEERING WORK OF ARTHUR CAYLEY
chosen at random will be windy, p the probability that a windy day chosen at random will
be stormy, β the probability that a day chosen at random will be rainy, q the probability
that a rainy day chosen at random will be stormy. The quantities λ, µ introduced in the
solution of the question mean in this particular instance, λ the probability that a windy
day chosen at random will be stormy by reason of the quantity of wind, or in other words,
that there will be at least W of wind, µ the probability that a rainy day chosen at random
will be stormy by reason of the quantity of rain, or in other words, that there will be at
least R of rain.
The sense of the terms being clearly understood, the problem presents of course no diculty.
Let λ be the probability that the cause A acting will act ecaciously; µ the probability
that the cause B acting will act ecaciously; then
p = λ + (1 − λ)µβ
q = µ + (1 − µ)αλ,
ρ = λα + µβ − λµαβ,
that is, ρ = p, for p is in this case the probability that (acting as a cause which is certain to
act) the eect will happen, or what is the same thing, p is the probability that the eect
will happen.
Machynlleth, August 16, 1853.
*Communicated by the Author.
In this short note, Cayley gives a prototype example of a causal network; rain and wind both have
causal eects on the state of the day (stormy or not), which may be inhibited. He demonstrates the
key principle of modularity, taking a problem with several variables and splitting it into its simpler
component conditional probabilities, by considering the direct causal inuences for each variable and
considering the natural factorisation of the probability distribution in this problem into these condi-
tional probabilities.
It should also be pointed out that Cayley was no stranger to graph theory; he proved Cayley's tree
formula, that there are nn−2 distinct labelled trees of order n (1889) [19] and established links between
graph theory and group theory, representing groups by graphs. The Cayley graph is named after him.
The variables here may be taken as
⎧
⎪ ⎧
⎪
⎪ 1 wind ⎪ 1 rain
C=⎨ D=⎨
⎪ ⎪
⎩ 0 no wind
⎪ ⎩ 0 no rain
⎪
4.1. CAYLEY'S CONTRIBUTION 95
rain
µ
$
storm
:
wind
with
α = PC (1) β = PD (1).
⎧
⎪
⎪ 1 storm
Y =⎨
⎪
⎩ 0 no storm
⎪
Then, in Cayley's notation, if there is rain, it causes a storm with probability µ; if there is wind,
it causes a storm with probability λ. The corresponding `network', on three variables, is seen in
Figure 4.1. The subscripts µ and λ on the arrows indicate the probability that the cause, if active, will
trigger the eect.
This is a noisy `or' gate, which can be expressed as a logical `or' gate by the addition of two
variables, R and W . The variable R denotes severe rain, that is that the `rain' variable reaches the
threshold to trigger a storm. This happens if the quantity of rain is above a threshold. The W variable
denotes severe wind; that is, that the `wind' variable reaches the threshold to trigger a storm. This
happens if the strength of wind is above a threshold. The variables, to form the logical or gate have
conditional probability values given below; PW ∣C denotes the conditional probability function for the
variable W given C and PR∣D denotes the conditional probability function for the variable R given D.
C/W 1 0 D/R 1 0
PW ∣C = 1 λ 1−λ PR∣D = 1 µ 1−µ
0 0 1 0 0 1
The network may now be expressed graphically according to Figure 4.2. This DAG is a represen-
tation of the factorisation that Cayley is using;
where PY ∣W,R denotes the CPP for the variable Y , given W and R. For Y = 1, these values are given
in the following table:
96 CHAPTER 4. THE PIONEERING WORK OF ARTHUR CAYLEY
rain / R
#
storm
;
wind / W
W /R 1 0
PY ∣W,R (1∣., .) = 1 1 1 .
0 1 0
From the factorisation,
From Cayley, p is the probability that a windy day, chosen at random, will be stormy; P = PY ∣D (1∣1).
p = PY ∣D (1∣1) = ∑ PA (x1 ) ∑ PR∣C (x2 ∣x1 ) ∑ PY ∣R,W (1∣x2 , x3 )PW ∣D (x3 ∣1)
x1 x2 x3
= βλµ + βµ(1 − λ) + β(1 − µ)λ + (1 − β)λ
= βµ − βλµ + λ = λ + (1 − λ)βµ.
Similarly, q , the probability that a rainy day, chosen at random, will be stormy; q = PY ∣C (1∣1), is given
by
q = µ + (1 − µ)αλ,
as computed by Cayley. Cayley is deriving the expression for the marginal probability of a stormy day,
ρ = PY (1);
PY (1) = ∑ PC (x1 ) ∑ PD (x2 ) ∑ PR∣C (x3 ∣x1 ) ∑ PW ∣D (x4 ∣x2 )PY ∣R,W (1∣x3 , x4 )
x1 x2 x3 x4
= ∑ PR (x3 ) ∑ PW (x4 )PY ∣R,W (1∣x3 , x4 )
x3 x4
= PR (1)PW (1) + PR (1)PW (0) + PR (0)PW (1)
= αλ + βµ − αβλµ.
This simple construction from 1853 possibly represents the rst example of a causal network and the
rst construction of a noisy-or gate, with the concept of an inhibitor.
4.2. ARTHUR CAYLEY AND JUDEA PEARL'S INTERVENTION CALCULUS 97
where Paσ (j) ⊂ {σ(1), . . . , σ(j − 1)} is the parent set of node σ(j) when ordering σ is employed and
Ξσ (j) = {σ(1), . . . , σ(j − 1)}/Paσ (j).
Let V = {1, . . . , d} denote the node set which indexes the variables, X = (X1 , . . . , Xd ) the random
vector, let the indexing set for the state space for variable Xj be Xj = {0, 1, . . . , kj −1} and the indexing
set for the state space for X be X = ×dj=1 Xj . Let R(X ) the ring of polynomial functions on RX .
A conditional independence statement XA ⊥ XB ∣XC , where A, B and C are disjoint subsets of V ,
translates using proposition 8.1 from Sturmfels (2002) [130], into a set of homogeneous quadratic poly-
nomials on R(X ), and these polynomials generate an ideal. Let IA⊥B∣C denote the ideal generated by
the statement XA ⊥ XB ∣XC . The ideal for a collection of independence statements, for example those
corresponding to a factorisation, is dened as the sum of the ideals; let M = {XAi ⊥ XBi ∣XCi i=
1, . . . , m}, then
Cayley is using the expression of the conditional independence statements that dene the factorisation
in terms of polynomials to obtain the two polynomial equations
⎧
⎪
⎪ p = λ + (1 − λ)µβ
⎨ (4.1)
⎪
⎪ q = µ + (1 − µ)λα
⎩
and writes, `.... which determine λ and µ'. This amounts to nding roots of the two polynomials in
λ, µ
⎧
⎪
⎪ f1 (λ, µ) = λ + (1 − λ)µβ − p
⎨
⎪
⎩ f2 (λ, µ) = µ + (1 − µ)λα − q
⎪
In terms of algebraic geometry, equation (4.1) denes the ane variety
In his brief note, Cayley has pointed out the connections between Bayesian networks and algebraic
geometry, a subject that he knew well. Cayley did much to clarify a large number of interrelated
theorems in algebraic geometry and is known for the Cayley surface (1869) [17].
Chapter 5
Chain Graphs
The denition of a chain graph is given below and it is shown that an essential graph is a chain graph,
although not vice versa. The study of chain graphs will be developed in 5.2.
Denition 5.1 (Chain Graph). A chain graph is a graph G = (V, E), where the edge set contains
both directed and undirected edges, E = D ∪ U , where D is the set of directed edges and U the set of
undirected edges. The node set V can be partitioned into n disjoint subsets V = V1 ∪ . . . ∪ Vn where the
sets V1 , . . . , Vn are the node sets of the connected components of (V, U ), the graph obtained by removing
all the directed edges.
1. GVj is an undirected graph for all j = 1, . . . , n
2. For any i ≠ j , and any α ∈ Vi , β ∈ Vj , there is no cycle in G = (V, E) (Denition 1.9) containing
both α and β .
The chain graph consists of components where the edges are undirected, which are connected by
directed edges. The components with undirected edges are known as chain components, which are
dened below.
Denition 5.2 (Chain Component). Let G = (V, E) be a chain graph, where E = D ∪ U , D is the set
of directed edges. Let Ĝ = (V, U ) denote the graph obtained by removing all the directed edges from E .
Each connected component of Ĝ is known as a chain component.
The chain components (Vj , Uj ), j = 1, . . . , n of G therefore satisfy the following conditions:
1. Vj ⊆ V and Uj is the edge set obtained by retaining all undirected edges ⟨α, β⟩ ∈ E such that
α ∈ Vj and β ∈ Vj .
Theorem 5.3 states any essential graph is necessarily a chain graph and presents the additional features
required to ensure that a chain graph is an essential graph corresponding to a directed acyclic graph.
It gives a characterisation for essential graphs that is useful for structure learning algorithms.
99
100 CHAPTER 5. MORAL GRAPH, INDEPENDENCE GRAPH, CHAIN GRAPHS
α @γ α @γ α @β
β β δ
Theorem 5.3. Let G = (V, E) be a graph, where E = D ∪ U . There exists a directed acyclic graph G ∗
for which G is the corresponding essential graph if and only if G satises the following conditions:
1. G is a chain graph,
3. The congurations shown in Figure 5.1 do not occur in any induced sub-graph of a three variable
set {α, β, γ} ⊂ V for the rst two congurations or a four variable set {α, β, γ, δ} for the third
conguration.
Proof Proof that an essential graph satises the conditions. To prove that it is a chain graph, the
rst part of the denition is easily satised and it is sucient to show that there is no cycle in (V, E)
containing α ∈ Vi and β ∈ Vj for two distinct chain components Vi and Vj .
Recall that the edges of a cycle τ0 , . . . , τn are either directed (τi , τi+1 ) or undirected ⟨τi , τi+1 ⟩. Let
(τ, γ) denote a directed edge in the cycle where γ ∈ Vj . Both connected components will have a node γ
with this property. If there is an undirected edge ⟨γ, γ1 ⟩ in the cycle, then there is also an undirected
edge ⟨τ, γ1 ⟩ in the graph. If there is a directed edge (τ, γ1 ) or (γ1 , τ ) then the edge between γ and γ1
is compelled contradicting the fact that it is undirected. If there is an undirected edge ⟨τ, γ1 ⟩, then
τ ∈ Vj . Proceeding inductively, it is clear that if there is a cycle, then there is an undirected edge
⟨τ1 , τ2 ⟩ where τ1 ∈ Vi and τ2 ∈ Vj contradicting the fact that the two chain components are distinct. It
follows that an essential graph is a chain graph.
Secondly, if there is a cycle of length ≥ 4 of undirected edges without a chord, then the DAG will
have a directed cycle, otherwise additional immoralities will appear when the edges are directed, hence
the chain components are triangulated.
Thirdly, the conguration stated cannot appear in an essential graph. The fourth requirement
follows from the denition of an essential graph.
5.1. THE MORAL GRAPH AND THE INDEPENDENCE GRAPH 101
For the other direction: suppose a graph satises the four conditions stated. All the directed edges
appear in congurations that are compelled and from the forbidden subgraphs, no undirected edges
appear in compelled congurations where there should be a directed edge. It remains to show that the
undirected edges may be oriented in a way that produces a directed acyclic graph.
For each chain component, orient the edges so that the chain component is a directed acyclic
triangulated graph. This can be done. Then, since the rst structure is forbidden, this operation
does not produce additional immoralities in the whole graph. Furthermore, since there are no cycles
containing two nodes α and β with α ∈ Vj and β ∈ Vk for j ≠ k , this operation does not produce directed
cycles. The graph is therefore the essential graph of a DAG.
Denition 5.4 (Moral Graph). Let G = (V, D) be a directed acyclic graph. The moral graph G (m) =
(V, U ) is the undirected graph such that for any α, β ∈ V , ⟨α, β⟩ ∈ U if and only if either (α, β) ∈ D
or (β, α) ∈ D or {α, β} ∈ Pa(γ) for some γ ∈ V . That is, the moral graph is the graph obtained by
rstly for each node adding links between all the parent variables of the node and then undirecting all
the directed edges.
Theorem 5.5. Let G = (V, D) be a directed acyclic graph and let G (m) = (V, U ) be its moral graph.
There is an edge ⟨α, β⟩ ∈ U if and only if α á
/ β∥G V /{α, β}. That is, the moral graph has an edge if
and only if α and β are not D-separated by the remaining variables.
Denition 5.6 (Independence Graph). Let X = (X1 , . . . , Xd ) be a random vector. The independence
graph G = (V, U ) is the undirected graph with vertex set V = {1, . . . , d} and where ⟨α, β⟩ ∈ U for α ≠ β
if and only if Xα ⊥/ Xβ ∣X−(α,β) where the notation X−(α,β) denotes X without components Xα and Xβ .
Recall the denition of separator (Denition 7.15). The independence graph satises the following
property:
Theorem 5.7. Let X = (X1 , . . . , Xd ) be a random vector, let V = {1, . . . , d} be the indexing set for X
and let G = (V, U ) be the independence graph of X . Then for three disjoint sets A, B and S such that
V = A ∪ B ∪ S , it holds that A ⊥ B∣S (A and B are conditionally independent given S ) if and only if
A á B 8 S (A and B separated by S ).
102 CHAPTER 5. MORAL GRAPH, INDEPENDENCE GRAPH, CHAIN GRAPHS
Proof Firstly, assume that for three disjoint sets A, B and S such that A ∪ B ∪ S , A á B 8 S in the
independence graph. Then, for each α1 , α2 ∈ A and β ∈ B , set C = V /{α1 , α2 , β}. From the denition
of the independence graph,
It follows from the intersection property, which states that if X ⊥ Y ∣W ∪ Z and X ⊥ W ∣Y ∪ Z then
X ⊥ W ∪ Y ∣Z , that
XA ⊥ Xβ ∣X−(A∪{β}) .
Successive applications of the intersection property to the variables with indices in B give
XA ⊥ XB ∣XS .
Now assume that XA ⊥ XB ∣XS . Then, for each α ∈ A and β ∈ B , this may be rewritten as
Xα ⊥ XB ∣XS∪A/{α}
Xα ⊥ Xβ ∣X−(α,β) .
Theorem 5.8. Let P be a probability distribution that factorises along a DAG G = (V, D). Let G (m) =
(V, U (m) ) denote its moral graph and let G (i) = (V, U (i) ) denote the independence graph of P. Then
U (i) ⊆ U (m) . Furthermore, if (V, D) is faithful to P, then U (i) = U (m) .
5.2. CHAIN GRAPHS 103
Proof From Theorem 5.5, the moral graph has an edge ⟨α, β⟩ if and only if α á/ β∥G V /{α, β}; there
is no edge ⟨α, β⟩ if and only if α á β∥G V /{α, β}. Since D-separation implies conditional independence
(Theorem 1.25), it follows that the lack of an edge ⟨α, β⟩ imples Xα ⊥ Xβ ∣X−(α,β) . From this, it follows
directly that U (i) ⊆ U (m) .
For a faithful DAG, D-separation and conditional independence are equivalent, from which it follows
that U (i) = U (m) when P and G = (V, D) are faithful.
If a distribution P does not have a faithful representation, then for any DAG U (i) ⊂ U (m) .
Corollary 5.9. Let X = (X1 , . . . , Xd ) be a random vector and V = {1, . . . , d} be its indexing set. Let
G = (V, D) be a directed acyclic graph, along which P, the probability distribution of X , factorises and
let G (m) be the moral graph. Let V = A ∪ B ∪ S where A, B and S are disjoint subsets. Then A á B 8 S
(A and B separated by S in G (m) ) implies XA ⊥ XB ∣XS (A and B conditionally independent given S ).
PX1 ,X2 ,X3 ,X4 (x1 , x2 , x3 , x4 ) = C exp {−β12 (x1 − x2 ) − β23 (x2 − x3 ) − β34 (x3 − x4 ) − β14 (x1 − x4 )} .
PX1 ,X2 ,X3 ,X4 = PX1 PX2 ∣X1 PX3 ∣X1 ,X2 PX4 ∣X1 ,X2 ,X3 .
Note that
PX1 ,X2 ,X3 = C exp {−β12 (x1 − x2 ) − β23 (x2 − x3 )} ∑ exp {−β34 (x3 − x4 ) − β14 (x1 − x4 )} .
x4
104 CHAPTER 5. MORAL GRAPH, INDEPENDENCE GRAPH, CHAIN GRAPHS
@2
1 / 3
4
1 3
PX1 ,X2 ,X3 ,X4 = PX1 PX2 ∣X1 PX3 ∣X1 ,X2 PX4 ∣X1 ,X3 .
with DAG given by Figure 5.2. The moral graph of the DAG of Figure 5.2 is given in Figure 5.3.
Whatever ordering of the variables, the moral graph of the resulting Bayesian network will be
triangulated. The cliques of the moral graph are the parent/variable sets of factorisation.
More of the independence structure is revealed in this example by the factor graph shown in Figure 5.4,
which is a chain graph.
PX1 ,X2 ,X3 ,X4 = PX1 PX2 PX3 ,X4 ∣X1 ,X2 ,
but where neither X1 ⊥ X4 ∣X3 nor X2 ⊥ X3 ∣X4 hold. Such a distribution could arise, for example, with
a probability distribution
5.2. CHAIN GRAPHS 105
1 3
1 2
3 4
1 2
3 4
PU,X1 ,X2 ,X3 ,X4 = PU PX1 PX2 PX3 ∣X1 ,U PX4 ∣X2 ,U
The additional exibility available for modelling when chain graphs are used should be clear. Chain
graphs, however, still satisfy the composition property and therefore separation statements in a chain
graph do not characterise the independence structure; there does not exist a faithful chain graph for
Example 2.7, the three-coin example.
where A = {j ∶ ∃k ∶ (k, j) ∈ D} and C denotes the collection of cliques of the chain components; clique
C is the domain of the function ∅C for each C ∈ C .
To generalise from DAGs to chain graphs, some additional denitions and machinery are necessary.
The approach taken here follows Ma-Xie-Geng (2008) [86].
A head-to-head section in a chain graph plays the same role as an immorality in a DAG.
Denition 5.12 (Section, Terminal). The terminals of a trail ρ = (ρ0 , . . . , ρk ) are simply the nodes
at each end, ρ0 and ρk . A section of a trail ρ = (ρ0 , . . . , ρk ) is a maximal undirected subroute σ =
(ρi , . . . , ρj ). In other words, either ρi = ρ0 or else i ≠ 0 and there is a directed edge ρi−1 ↦ ρi or
ρi ↦ ρi−1 ; similarly, either j = k or else there is a directed edge ρj ↦ ρj+1 or ρj+1 ↦ ρj .
The vertices ρi and ρj are called terminals. ρi (ρj ) is a head terminal if i > 0 and G contains the
directed edge ρi−1 ↦ ρi (or j < k and G contains the edge ρj+1 ↦ ρj and a tail terminal if i > 0 and the
graph G contains the edge ρi ↦ ρi−1 . (or j < k and the graph contains the edge ρj ↦ ρj+1 ).
A section σ of ρ is a head-to-head section if it has two head-terminals, otherwise it is a non
head-to-head section.
For a set of vertices S ⊂ V , a section σ is outside S if {ρi , . . . , ρj } ∩ S = ∅; otherwise we say that
σ is hit by S .
A complex within a trail in a chain graph plays a similar role to a collider node in a trail in a DAG.
Denition 5.13 (Complex). A complex in G is a trail ρ = (ρ0 , . . . , ρk ) such that ρ0 ↦ ρ1 and ρk ↦ ρk−1
are in G and, for i = 1, . . . , k − 2 G contains the undirected edges ρi − ρi+1 . The vertices ρ0 and ρk are
the parents of the complex and {ρ1 , . . . , ρk−1 } the region of the complex.
5.2. CHAIN GRAPHS 107
The pattern of a chain graph corresponds to taking the skeleton of a DAG and directing those edges
which belong to immoralities.
Denition 5.14 (Complex Arrow, Pattern, Moral Graph). A directed edge in the chain graph is known
as a complex arrow if it belongs to a complex of G . The pattern of G , denoted G ∗ is the graph obtained
by undirecting all directed edges which are not complex arrows. The moral graph G (m) of a chain graph
is the graph obtained by rst, for each complex, adding an undirected edge between each pair of parents
of the complex, and then undirecting all the edges.
For a chain graph, the descendants of a node are those for which there is a trail where each edge is
either undirected or directed from the node to the descendant.
Denition 5.15 (Descendant). A node β is a descendant of a node α if there is a path ρ = (ρ0 , ρ1 , . . . , ρk )
such that ρ0 = α, ρk = β and for i = 0, . . . , k − 1 either there is either an undirected edge ⟨ρi , ρi+1 ⟩ or a
directed edge (ρi , ρi+1 ) in G .
For a DAG, a connection is open if it is an uninstantiated fork or chain, or if it is a collider which
is either instantiated or has an instantiated descendant. In chain graphs, this has to be developed
slightly.
Denition 5.16 (Intervented). A trail ρ in G is intervented by a subset S of V if and only if there
exists a section σ of ρ such that:
1. either σ is a head to head section with respect to ρ and σ and all its descendants are outside S ,
or
Note In [86], the requirement in 1. that the descendants are also outside S is not given. It is clear
that this is necessary, by considering the situation where the chain graph is a DAG.
The notion of C -separation for chain graphs corresponds to D-separation for DAGs.
Denition 5.17 (C -Separation). Let A, B and S be three disjoint subsets of V of a chain graph G
such that A and B are non-empty. The sets A and B are C -separated by S , written A á B∥G S if and
only if every trail with one of its terminals in A and another in B is intervented by S . The set S is a
C -separator for A and B .
The denition of Markov equivalence is the same, with C -separation substituted for D-separation.
Denition 5.18 (Markov Equivalence). Two chain graphs G1 and G2 are said to be Markov equivalent
if for any three disjoint subsets A, B and S with both A and B non-empty,
A á B∥G1 S ⇔ A á B∥G2 S.
Having formulated the concepts for chain graphs that correspond to those for DAGs, the key result for
chain graphs corresponds directly to Theorem 2.11.
Theorem 5.19. Two chain graphs G1 and G2 are Markov equivalent if and only if they have the same
skeleton and the same complexes. That is, they have the same pattern.
108 CHAPTER 5. MORAL GRAPH, INDEPENDENCE GRAPH, CHAIN GRAPHS
Proof Frydenberg [48] (1990). It is similar to the proof of Theorem 2.11 for DAGs.
A distribution that factorises according to a chain graph is said to be Markovian with respect to the
chain graph.
Denition 5.20 (Markovian). A distribution P is said to be Markovian with respect to a chain graph
G if C -separation statements imply the corresponding independence statements:
A á B∥G S ⇒ XA ⊥ XB ∣XS .
The denition of faithfulness for chain graphs is analogous to faithfulness for DAGs.
Denition 5.21 (Chain Graph Faithfulness). A distribution P is said to be faithful with respect to a
chain graph G if C -separation statements and independence statements are equivalent;
A á B∥G S ⇔ XA ⊥ XB ∣XS .
Denition 5.22. Let G = (V, E) be a chain graph. Let C = {C1 , . . . , CH } be a collection of distinct sets
of variables such that V = ∪H j=1 Cj . Let T denote the graph (C, U) where U is a set of labelled undirected
edges. Uij ∈ U if and only if Ci ∩ Cj ≠ ∅; the label is Ci ∩ Cj and Uij is the separator.
T is said to be a tree if removal of the nodes of Uij for any pair i ≠ j splits T into two disjoint trees
Ti (with node set denoted Ci ) and Tj (with node set denoted Cj ). Let Vi = ∪C∈Ci C and Vj = ∪C∈Cj C .
A tree T with node set C is a separation tree for chain graph G if and only if:
1. ∪C∈C C = V and
V1 /S á V2 /S∥G S.
The separation tree has similarities to the junction tree, but it does not require that the collection
{C1 , . . . , CH } are cliques or that every separator is complete.
A separation tree can be constructed quite easily from the independence graph.
Theorem 5.23. Let X = (X1 , . . . , Xd ) be a random vector and let G (i) denote the independence graph.
Any junction tree constructed from any triangulation of G (i) is a separation tree.
5.2. CHAIN GRAPHS 109
Proof This is obvious, since any separation statement in the independence graph implies the corre-
sponding C -separation statement in the chain graph.
Lemma 5.24. Let α and β be two adjacent nodes in a chain graph G , then any separation tree T for
G contains a tree-node C such that {α, β} ⊆ C .
Proof Assume not, then there exists a separator K on T such that α ∈ V1 /K and β ∈ V2 /K , where
Vi denotes the variable set of the subtree Ti obtained by removing the edge attached by separator K ,
for i = 1, 2. This implies that α á β∥G K , which is false.
The separation tree satises several properties which will be useful in 16.12 for learning a chain graph.
Some of them are collected in the following theorem.
Theorem 5.25. Let T be a separation tree for a chain graph G = (V, E). Nodes α and β are C -
separated by some set Sαβ ⊂ V in G if and only if one of the following conditions hold:
1. α and β are not both contained in the same node C for any C ∈ C .
′
2. α, β ∈ C for some C ∈ C , but for any separator S ⊂ C , {α, β} ⊂/ S and there exists a set Sα,β ⊂C
such that
′
α á β∥G Sαβ .
3. There is a C ∈ C such that {α, β} ⊆ C , there is a separator S ⊂ C such that {α, β} ⊆ S , but there
′
is a subset Sαβ of either ∪C∶α∈C C or ∪C∶β∈C C such that
′
α á β∥G Sαβ .
The following proposition shows that, similarly to the situation with DAGs, the parents for each
complex are all contained within the same tree node.
Proposition 5.26. Let G be a chain graph and T a separation tree of G . For any complex ρ in G ,
there exists a tree-node C ∈ C such that Pa(ρ) ⊆ C .
The proofs of Theorem 5.25 and Proposition 5.26 are given after the following example.
Example 5.27 (Chain Graph, Moral Graph, Separation Tree).
Figure 5.7 (a) shows a chain graph, while (b) shows the moralised graph. Figure 5.8 shows a
separation tree. The vertex set for the separation tree here is:
C = {{A, B, C}, {B, C, D}, {C, D, E}, {D, E, F }, {E, I}, {I, J}, {D, F, G}, {F, G, K, H}}.
In this case, the separation tree is the junction tree corresponding to a triangulation of the moral
graph, but a separation tree does not necessarily have to satisfy this property.
Lemma 5.28. Let G = (V, E) be a chain graph and let α, β ∈ V . There exists an edge α ∼ β in E if
and only if α á
/ β∥G S for any S ⊆ V /{α, β}.
110 CHAPTER 5. MORAL GRAPH, INDEPENDENCE GRAPH, CHAIN GRAPHS
>G H G H
B D / F / K B D F K
A C / E / I / J A C E I J
BCD DEF
BC DE
E
CD DF
ABC CDE EI DF G
I GF
IJ F GKH
Proof If there is an edge α ∼ β , whether directed or undirected, then clearly α á/ β∥S for any
S ⊆ V /{α, β}. Let Pa(α) = {γ ∶ (γ, α) ∈ E} ∪ {γ ∶ ⟨γ, α⟩ ∈ E}. In other words, the parents of a node α
are all nodes for which there is either a directed edge from the node to α or an undirected edge between
the node and α.
Suppose there is no edge α ∼ β , then α á β∥G Pa(α) if β is an ancestor of α, α á β∥G Pa(β) if α is
an ancestor of β and both statements are true if α is not an ancestor of β and β is not an ancestor of
α.
Proof of Theorem 5.25 Clearly, if any of the three conditions hold, then there is a C -sep-set Sαβ
such that α á β∥G Sαβ .
Assume that, for a given separation tree T , none of the conditions hold. That is, there exists an
α, β and C such that α, β ∈ C , there is a separator S ⊂ C such that α, β ∈ S and for every subset S of
either ∪C∶α∈C C or ∪C∶β∈C , α á
/ β∥G S .
Note that, using the denitions from the proof of Lemma 5.28, if there is no edge α ∼ β , then
either α á β∥G Pa(α) or α á β∥G Pa(β) or both. Since Pa(α) ⊂ ∪C∶α∈C C and Pa(β) ⊂ ∪C∶β∈C C , this is
a contradiction, hence there is an edge α ∼ β in G , hence (by Lemma 5.28 there is no set R such that
α á β∥G R and the theorem is proved.
Proof of Proposition 5.26 Suppose that α and β are parents of a complex κ = (α, γ1 , . . . , γk , β)
where k ≥ 1. Suppose that for every tree-node C ∈ C , {α, β} ∩ C =/ {α, β}. Consider two tree-nodes
C1 and C2 such that α ∈ C1 and β ∈ C2 . Let C1 − D1 − . . . , Dn − C2 denote the path in the tree
from C1 to C2 and let S = C1 ∩ D1 . If S ∩ {α, β} = ∅, then {γ1 , . . . , γk } ∩ S ≠ ∅. This implies that
αá / β∥S , since instantiation of any non-empty subset of the set {γ1 , . . . , γk } opens the connection.
This contradicts the fact that S is a separator in the separation tree. It follows that either α ∈ S and
hence α ∈ D1 or β ∈ S and hence β ∈ D1 ; hence, inductively, it follows that there is a tree-node C such
that {α, β} ⊆ C .
112 CHAPTER 5. MORAL GRAPH, INDEPENDENCE GRAPH, CHAIN GRAPHS
Chapter 6
P(A) = ∑ P(x).
x∈A
The space X contains a nite number of elements and the event algebra A is simply the set of all
possible subsets of X . If an event A ⊂ X is observed, then the probability P is updated to a probability
measure P∗ using the denition of conditional probability
P(B∣A) = P∗ (B) =
P(AB)
P(A)
to a probability function P∗ over X that satises
⎧
⎪
⎪ x∈A
P(x)
P∗ (x) = ⎨ P(A)
⎪
⎪ x ∈/ A.
⎩ 0
113
114 CHAPTER 6. EVIDENCE AND METRICS
Denition 6.1 (Jerey's Update). The Jerey's rule for computing the update of the probability for
any A ⊆ X is given by
r
P∗ (A) = ∑ P∗ (Gj )P(A∣Gj ) (6.1)
j=1
λj
P∗ (x) = P(x) x ∈ Gj , j = 1, . . . , r. (6.2)
µj
The information leading to the update may be considered as an event Ξ such that Ξ ⊆/ X . The
probability measure P is extended to acommodate the event Ξ in the following way: for the set of
mutually exclusive and exhaustive events G1 , . . . , Gr and any A ⊆ X , Ξ ⊥ A∣Gj for each j = 1, . . . , r.
The conditional probabilities of the events G1 , . . . , Gr given Ξ are specied as P(Gj ∣Ξ) = λj . Then, for
any A ⊆ X , the probability update is
r r
̃
P(A) = P(A∣Ξ) = ∑ P(A∣Gj , Ξ)P(Gj ∣Ξ) = ∑ λj P(A∣Gj ).
j=1 j=1
P(A∩Gj )
Using P(A∣Gj ) = P(Gj ) and µj = P(Gj ), this gives
̃ λj
P(x) = P(x) x ∈ Gj , j = 1, . . . , r.
µj
Pearl's Update Pearl's update is a re-expression of Jerey's update, where the information is pre-
sented in a slightly dierent format. Information received is that an event Ξ ⊆/ X has happened, where
Ξ ⊥ A∣Gj for each j = 1, . . . , r, where (Gj )rj=1 are a set of mutually exclusive and exhaustive events. The
information, though, is given in terms of likelihood ratios. Instead of λj = P(Gj ∣Ξ), the information
P(Ξ∣G )
is expressed as a collection of likelihood ratios ρj = P(Ξ∣G1j ) for j = 1, . . . , r, ratios of the likelihood of
Ξ given Gj compared with the likelihood of Ξ given G1 . That is, λj represents the likelihood ratio
that the event A occurs given that Gj occurs, compared with G1 . Note that λ1 = 1. Using the same
notation µj = P(Gj ), for any A ⊆ X , an application of Bayes rule gives
r
̃
P(A) = P(A∣Ξ) = ∑ P(A∣Gj )P(Gj ∣Ξ)
j=1
r P(Ξ∣Gj )P(Gj )
= ∑ P(A∣Gj )
j=1 P(Ξ)
r P(Ξ∣Gj )P(Gj )
= ∑ P(A∣Gj )
j=1 ∑k=1 P(Ξ∣Gk )P(Gk )
r
r ρ j µj
= ∑ P(A∣Gj ) r .
j=1 ∑k=1 ρj µj
Denition 6.2 (Pearl's update). Let P denote a probability distribution over X and let G1 , . . . , Gr be
a mutually exclusive (that is Gi ∩Gj = ϕ for all i ≠ j ) and exhaustive (that is ∪nj=1 Gj = X ) events, where
6.1. PROBABILITY UPDATES 115
P(Gj ) = µj . Let ρ1 = 1 and let ρj , j = 2, . . . , r denote a collection of numbers. Then, for each x ∈ X ,
the Pearl update P̃ is dened as
̃ ρj
P(x) = P(x) r x ∈ Gj , j = 1, . . . , r. (6.3)
∑j=1 ρj µj
This is clearly a well dened probability function over X . The numbers ρj are interpreted as likelihood
ratios where Ξ is an event Ξ ⊆/ X and P is extended to include Ξ such that Ξ ⊥ A∣Gj for each j = 1, . . . , r,
P(Ξ∣G )
A ⊆ X and ρj = P(Ξ∣G1j ) .
Pearl's update and Jerey's rule are equivalent. The original probability space has been extended;
information has been received of a form that cannot be expressed in terms of events, or subsets, of the
original probability space.
Example 6.3.
A piece of cloth is to be sold on the market. The colour C is either green (cg ), blue (cb ) or violet
(cv ). Tomorrow, the piece of cloth will either be sold (s) or not (sc ); this is denoted by the variable S .
Experience gives the following probability distribution over C, S
S/C cg cb cv
PC,S = s 0.12 0.12 0.32
sc 0.18 0.18 0.08
The marginal distribution over C is
cg cb cv
PC = .
0.3 0.3 0.4
The piece of cloth is inspected by candle light. From the inspection by candle light, the probability
over C is assessed as:
cg cb cv
QC = .
0.7 0.25 0.05
This is a situation where Jerey's rule may be used to update the probability.
QC
QS,C = QC PS∣C = PS,C .
PC
λg 0.7
QS,C (s, cg ) = P(s, cg ) = × 0.12 = 0.28.
µg 0.3
Updating the whole distribution in this way gives
S/C cg cb cv
QC,S = s 0.28 0.10 0.04
sc 0.42 0.15 0.01
116 CHAPTER 6. EVIDENCE AND METRICS
6.2 Evidence
For a Bayesian network, three dierent types of evidence will be discussed; hard evidence, soft evidence
and virtual evidence. The denitions used are as follows:
Denition 6.4 (Hard Evidence, Soft Evidence, Virtual Evidence). The denitions are:
(l)
A hard nding is an instantiation, {Xi = xi } for a particular value of i ∈ {1, . . . , d} and a
(l)
particular value of l ∈ {1, . . . , ki }. This species that variable Xi is in state xi .
A soft nding on a variable Xj species the probability distribution of the variable Xj . That is,
the conditional probability function PXj ∣Paj is replaced by a probability function P∗Xj with domain
Xj .
Soft evidence and virtual evidence are dierent. When soft evidence is received on a variable, the links
between the variable and its parents are severed; if soft evidence is received on variable Xj , then the
conditional probability function PXj ∣Paj is replaced by a new probability function P∗Xj .
Soft evidence basically applies to the situation described in the discussion of intervention calculus;
it is assumed that the Bayesian network has been derived from causal principles, where the parents of
a variable are direct causes. The soft evidence gives a new distribution over the variable, where the
new distribution is not inuenced by its parents. The state of the variable is forced, as in a controlled
experiment, without reference to the other variables, while the new distribution P∗Xj describes the
probability of which state of Xj is enforced.
When virtual evidence is received, the links are preserved; the evidence is interpreted as an addi-
tional variable, which is instantiated.
Virtual Evidence and the DAG The following shows how, in general, virtual evidence can be
considered as an additional node E in the DAG. Consider a set of variables V = {X1 , . . . , Xd }, where
the joint probability distribution is factorised as
d
PX1 ,...,Xd = ∏ PXj ∣Paj .
j=1
Suppose that virtual evidence is received on variable Xj . This may be expressed as a variable E and,
by d-separation properties, the updated distribution PX1 ,...,Xd ,E has a factorisation
The variable E is a `dummy variable', in the sense that its state space and distribution do not need
to be dened; the virtual evidence is interpreted as a particular instantiation {E = e} for this variable
and this is the only information that is needed. From Equation (6.5),
PE∣Xj (e∣.)
PX1 ,...,Xd ∣E (., . . . , .∣e) = (∏ PXk ∣Pak ) .
k PE (e)
From Equation (6.4),
(m) (m)
L(xj ) PE∣Xj (e∣xj )
= m = 1, . . . , kj ,
k (i)
∑i=1 L(xj )PXj ∣Paj (xj ∣πj )
j (i) (n) PE (e)
(m1 ) (m1 )
L(xj ) PE∣Xj (e∣xj )
(m2 )
= (m2 )
.
L(xj ) PE∣Xj (e∣xj )
When applying virtual evidence, create an extra node on the network, with conditional probabilities
(m) (m) (m) (m)
PE∣Xj (1∣xj ) ∝ L(xj ); any values satisfying 0 < PE∣Xj (1∣xj ) < 1 for L(xj ) > 0 will suce
(m) (m)
and PE∣Xj (0∣xj ) = 1 − PE∣Xj (1∣xj ). For a Bayesian networks programme, these values need to be
dened, although the only conditional probability values used are those for E = 1. Then update the
network with the hard evidence E = 1.
Equivalence with Pearl's Update Represented on a DAG, the virtual evidence node E satises
E á V /{Xj }∥G Xj . The virtual evidence {E = e} may be expressed as Pearl's update with Ξ = {E = e}
(m) kj
and the partition events Gm = {Xj = xj } for m = 1, . . . , kj . The collection (Gm )j=1 are mutually
PE∣Xj (e∣xj )
(m)
(m)
µm = PXj (xj ) m = 1, . . . kj .
Then, after extending P to accommodate the new variable E , the probability distribution PX1 ,...,Xd
̃X ,...,X = PX ,...,X ∣E (., . . . , .∣e) where
is updated to P 1 d 1 d
ρij
̃X ,...,X (x(i1 ) , . . . , x(id ) ) = PX ,...,X (x(i1 ) , . . . , x(id ) )
P .
1 d 1 d 1 d 1 d kj
∑m=1 µm ρm
118 CHAPTER 6. EVIDENCE AND METRICS
Example 6.5.
Consider a DAG on ve variables, X1 , X2 , X3 , X4 and X5 , given in Figure 6.1. Suppose that a piece
of virtual evidence is received on the variable X3 . This evidence may be modelled by a variable E ,
that is inserted to the DAG giving the DAG in Figure 6.2. The state of X3 aects the virtual evidence
that is observed.
X1 X2
~
X3
~
X4 X5
From Figure 6.2, it is clear that (X1 , X2 , X4 , X5 ) á E∥G X3 . The decomposition along the DAG gives
P(E∣X1 , X2 , X3 , X4 , X5 ) = P(E∣X3 ) and P(X1 , X2 , X4 , X5 ∣X3 , E) = P(X1 , X2 , X4 , X5 ∣X3 ).
Suppose that on any given day, there is a burglary at any given house with probability 10−4 . If there
is a burglary, then the alarm will go o with probability 0.95; if there is no burglary, then it does not
go o. One day, Professor Noddy receives a call from his neighbour Margarita, saying that she may
have heard Professor Noddy's burglar alarm going o. Professor Noddy decides that it is four times
more likely that Margarita did hear the alarm going o than that she was mistaken.
X1 X1
~
X3
~
X4 E X5
Let A take value 1 to denote the alarm going o and 0 otherwise, B = 1 to denote that a burglary
takes place and 0 otherwise and let E denote the variable `telephone call'; E = 1 is the evidence that
Noddy received the call from Jemima. This evidence can be interpreted by extending P to include the
variable E , where B ⊥ E∣A (the virtual evidence is received on A; B is the remainder of the network)
and the relevant quantity is
PE∣A (1∣1)
λ= = 4.
PE∣A (1∣0)
B/A 1 0
1 0
PB = PA∣B = 1 0.95 0.05
10−4 1 − 10−4
0 0 1
B/A 1 0
PB,A = 1 0.95 × 10−4 0.05 × 10−4
0 0 1 − 10−4
so that
1 0
PA = −4
0.95 × 10 1 − 0.95 × 10−4
̃B (1) = P
P ̃B,A (1, 0) = 10−4 × ( 3.80 + 0.05 ) ≃ 3.85 × 10−4 .
̃B,A (1, 1) + P
1 + 3.85 × 10−4
120 CHAPTER 6. EVIDENCE AND METRICS
Consider two common measures of divergence between probability distributions. Let P and Q be two
probability functions over the same nite state space X = (x1 , . . . , xk ) and let pj = P(xj ) and qj = Q(xj )
for j = 1, . . . , k .
Denition 6.9 (Kullback Leibler Divergence). The Kullback Leibler divergence between two probability
distributions P and Q over the same state space X is dened as
k pj
DKL (P∥Q) = ∑ pj ln .
j=1 qj
The Kullback Leibler divergence is non negative (left as an exercise) and DKL (P∥Q) = 0 ⇔ P ≡ Q, but
it is not a distance in the sense of Denition 6.7; it does not, in general, satisfy DKL (P∥Q) = DKL (Q∥P).
Example 6.10.
Let
P(1) = (0.02, 0.98), Q(1) = (0.0364, 0.9636), P(2) = (0.01, 0.99), Q(2) = (0.00471, 0.99529).
Then
√
D2 (P(1) , Q(1) ) = (0.02 − 0.0364)2 + (0.98 − 0.9636)2 = 0.0232
√
D2 (P(2) , Q(2) ) = (0.00471 − 0.01)2 + (0.99529 − 0.99)2 = 0.00748,
so the change represented by the second adjustment is less than one third of the change represented
by the rst if the change is measured using the quadratic distance measure. For the Kullback-Leibler,
6.5. THE CHAN - DARWICHE DISTANCE MEASURE 121
0.02 0.98
DKL (P(1) ∥Q(1) ) = 0.02 ln + 0.98 ln = 0.004562,
0.0364 0.9636
0.01 0.99
DKL (P(2) ∥Q(2) ) = 0.01 ln + 0.99 ln = 0.00225,
0.00471 0.99529
so the change represented by the second adjustment is approximately one half of the change represented
by the rst. Clearly, dierent distance measures give dierent impressions of the relative importance
of parameter changes.
Denition 6.11 (Chan - Darwiche Distance). Let P and Q be two probability functions over a nite
state space X . That is, P ∶ X → [0, 1] and Q ∶ X → [0, 1], ∑x∈X P(x) = 1 and ∑x∈X Q(x) = 1. The Chan
- Darwiche distance is dened as
Q(x) Q(x)
DCD (P, Q) = ln max − ln min ,
x∈X P(x) x∈X P(x)
where, by denition, 0
0 = 1.
Unlike the Kullback - Leibler divergence, the Chan - Darwiche distance is a distance; it satises the
three requirements of Denition 6.7. This result is stated in Theorem 6.13.
The support of a probability function dened on a nite state space; namely, those points where it is
strictly positive (relating to outcomes that can happen) is important when comparing two dierent
probability functions over the same state space.
Denition 6.12 (Support). Let P be a probability function over a countable state space X ; that is,
P ∶ X → [0, 1] and ∑x∈X P(x) = 1. The support of P is dened as the subset SP ⊆ X such that
Theorem 6.13. The Chan - Darwiche distance measure is a distance measure, in the sense that for
any three probability functions P1 , P2 , P3 over a state space X , the following three properties hold:
Proof Positivity and symmetry are clear and are left as exercises. It only remains to prove the
triangle inequality. Since the state space is discrete and nite, it follows that there exist y, z ∈ X such
that
P3 (x) P3 (x) P3 (y) P3 (z)
DCD (P1 , P3 ) = ln max − ln min = ln − ln
x∈X P1 (x) x∈X P1 (x) P1 (y) P1 (z)
P3 (y) P2 (y) P3 (z) P2 (z)
= ln + ln − ln − ln
P2 (y) P1 (y) P2 (z) P1 (z)
P3 (y) P3 (z) P2 (y) P2 (z)
= (ln − ln ) + (ln − ln )
P2 (y) P2 (z) P1 (y) P1 (z)
P3 (x) P3 (x) P2 (x) P2 (x)
≤ (ln max − ln min ) + (ln max − ln min )
P2 (x)
x∈X x∈X P2 (x) x∈X P1 (x) x∈X P1 (x)
This distance is relatively easy to compute. It has the advantage over the Kullback Leibler divergence
(which is not a true distance measure) that it may be used to obtain bounds on odds ratios.
Denition 6.14 (Odds). Let P be a probability measure over X and let A ⊂ X and B ⊂ X . The odds
for A versus Ac given B is dened as
P(A∣B)
OP (A∣B) = .
P(Ac ∣B)
Comparison with the Kullback Leibler Divergence and Euclidean Distance Consider two
probability distributions P = (p1 , p2 , p3 ) and Q = (q1 , q2 , q3 ) over {1, 2, 3} dened by
p1 = a, p2 = b − a, p3 = 1 − b
q1 = ka, q2 = b − ka, q3 = 1 − b
Then
b − ka
DKL (P∥Q) = −a ln k − (b − a) ln .
b−a
Consider the events A = {1}, B = {1, 2}, then OP (A∣B) = a
b−a and OQ (A∣B) = ka
b−ka and the odds ratio
is given by
OQ (A∣B) k(b − a)
= .
OP (A∣B) b − ka
O (A∣B)
As a → 0, DKL (P∥Q) → 0, while OQP (A∣B) → k . It is therefore not possible to nd a bound on the odds
ratio in terms of the Kullback Leibler divergence.
√ a→0
D2 (P, Q) = 2a(1 − k) Ð→ 0,
6.5. THE CHAN - DARWICHE DISTANCE MEASURE 123
while
Theorem 6.15. Let P and Q be two probability distributions over the same nite state space X and let
A and B be two subsets of X . Let Ac = X /A and B c = X /B . Let OP (A∣B) = and OQ (A∣B) =
P(A∣B)
P(Ac ∣B)
OQ (A∣B)
e−DCD (P,Q) ≤ ≤ eDCD (P,Q) .
OP (A∣B)
The bound is sharp in the sense that for any pair of distributions (P, Q) there are subsets A and B of
X such that
Proof of Theorem 6.15 Without loss of generality, it may be assumed that P and Q have the same
support; that is, P(x) > 0 ⇔ Q(x) > 0. Otherwise DCD (P, Q) = +∞ and the statement is trivially
O (A∣B)
true; for any A, B ⊆ X , 0 ≤ OQP (A∣B) ≤ +∞. For P and Q such that P and Q have the same support, let
r(x) = P(x) . For any two subsets A, B ⊆ X ,
Q(x)
maxz∈X r(z)
eDCD (P,Q) = ,
minz∈X r(z)
hence
OQ (A∣B)
e−DCD (P,Q) ≤ ≤ eDCD (P,Q) ,
OP (A∣B)
as required, thus proving the rst part.
124 CHAPTER 6. EVIDENCE AND METRICS
To prove that the bound is tight, consider x such that r(x) = maxz∈X r(z) and y such that r(y) =
minz∈X r(z). Set A = {x} and B = {x, y}. Then
r(x)P(x)
OQ (A∣B) = .
r(y)P(y)
OQ (A∣B)
= eDCD (P,Q) .
OP (A∣B)
Theorem 6.15 may be used to obtain bounds on arbitrary queries Q(A∣B) for the measure Q in terms
of P(A∣B).
P(A∣B)e−d P(A∣B)ed
≤ Q(A∣B) ≤ . (6.7)
1 + (e−d − 1)P(A∣B) 1 + (ed − 1)P(A∣B)
Proof Equation (6.7) is a straight forward consequence of Theorem 6.15. The computation is left as
an exercise.
Theorem 6.17. Let P be a probability distribution over a countable state space X and let G1 , . . . , Gn
be a collection of mutually exclusive and exhaustive events. Let λj = P(Gj ) for j = 1, . . . , n. Let Q
denote the probability distribution such that Q(Gj ) = µj for j = 1, . . . , n and such that for all x ∈ X
µj
Q(x) = P(x) x ∈ Gj .
λj
λj λj
DCD (P, Q) = ln max − ln min .
j µj j µj
6.5. THE CHAN - DARWICHE DISTANCE MEASURE 125
Under the Chan - Darwiche distance measure, Jerey's rule may be considered optimal, in the following
sense.
Theorem 6.19. Let P denote a probability distribution over X and let G1 , . . . , Gr denote a collection of
mutually exclusive and exhaustive events. Let µj = P(Gj ), let λ1 , . . . , λr be a collection of non negative
numbers such that ∑rj=1 λj = 1 and let Q be the probability distribution over X dened by
λj
Q(x) = P(x) x ∈ Gj .
µj
Then DCD (P, Q) minimises DCD (P, R) subject to the constraint that R is a probability distribution
over X such that R(Gi ) = λi for i = 1, . . . , r.
Proof Let Q denote the distribution generated by Jerey's rule and let R be any distribution that
satises the constraint R(Gj ) = Q(Gj ) = λj , j = 1, . . . , R. If P and R do not have the same support
(Denition 6.12), then +∞ = DCD (P, R) ≥ DCD (P, Q). If they have the same support, let j denote the
λ
value such that µjj = maxi µλii and let k denote the value such that µλkk = mini µλii . Let α = maxx∈X P(x) .
R(x)
Then
R(x)
αµj = α ∑ P(x) ≥ ∑ P(x) = R(Gj ) = λj ,
x∈Gj x∈Gj P(x)
so that
λj
α≥ .
µj
Set β = minx∈X P(x) , then a similar argument gives β ≤ λk
µk . It follows that the distance between P and
R(x)
R is
R(x) R(x)
DCD (P, R) = ln max − ln min = ln α − ln β
x∈X P(x) x∈X P(x)
λj λj λi λi
≥ ln − ln = ln max − ln min = DCD (P, Q).
µj µj i µi i µi
Pearl's Method of Virtual Evidence Recall Pearl's Method of Virtual Evidence. The CD distance
between the original distribution and the updated distribution has a convenient expression.
Theorem 6.20. Let P be a probability distribution over a nite state space X and let λ1 = 1 and
λ2 , . . . , λr be positive numbers. Let G1 , . . . , Gr be a collection of mutually exclusive and exhaustive
subsets of X . Let µj = ∑x∈Gj P(x) for j = 1, . . . , r. Let Q be dened as
λj
Q(x) = P(x) r x ∈ Gj .
∑k=1 µk λk
Then Q is a probability distribution over X and
Proof Firstly, it is clear from the construction that ∑x∈X Q(x) = 1 and that Q(x) ≥ 0 for all x ∈ X ,
so that Q is a probability function. From the denition,
Q(x) λj
= x ∈ Gj .
P(x) ∑k µk λk
It follows that
Q(x) Q(x)
DCD (P, Q) = ln max − ln min
x∈XP(x) x∈X P(x)
λj λj
= ln max − ln min
j ∑ k µk λ k j ∑ k µk λ k
= ln max λj − ln min λj
j j
as required.
Corollary 6.21. Let OQ and OP denote the odds functions associated with the probability measures
dened in Theorem 6.20 and let
OQ (A∣B)
e−d ≤ ≤ ed .
OP (A∣B)
Example 6.22.
6.5. THE CHAN - DARWICHE DISTANCE MEASURE 127
The `Burglary' example may be developed to illustrate these results. Let A denote the event that the
alarm goes o, B the event that a burglary takes place and let E denote the evidence of the telephone
call from Jemima. According to Pearl's method, this evidence can be interpreted as
PE∣A (1∣1)
λ= = 4.
PE∣A (1∣0)
Therefore, the distance between the original distribution P and the update Q(.) = P(.∣E = 1) derived
according to Pearl's method is DCD (P, Q) = ln 4 ≃ 1.386. This distance may be used to bound QB (1),
the probability of a Burglary, after the update to incorporate the evidence. Using the bound stated in
the corollary,
PB (1)e−d PB (1)ed
≤ QB (1) ≤ ,
1 + (e−d − 1)PB (1) 1 + (ed − 1)PB (1)
so that 2.50 × 10−5 ≤ QB (1) ≤ 4.00 × 10−4 . An application of Pearl's virtual evidence rule gives
QB (1) = 3.85 × 10−4 .
Notes The article [37] discusses probability updates when the information received does not t into
the framework of the standard denition. The Chan Darwiche distance measure is proposed in [20].
The article [22] by Chan and Darwiche discusses the application of Jerey's update rule and Pearl's
method to virtual evidence. These two articles provide the basis for the chapter.
6.6 Exercises
1. Jerey's Rule In a certain country, people use only two car models, Volvo and Saab, which come
in two colours, red and blue. The sales statistics suggest P(Volvo) = P(Saab) = 1/2. Furthermore,
P(red∣Volvo) = 0.7 and P(red∣Saab) = 0.2. You are on holiday in this region and you are standing
outside a large underground garage, which you may not enter. The attendant of the garage
communicates his impression that 40% of the cars in the garage are red. What is the probability
that the rst car leaving the garage is a Volvo?
2. Pearl's Method The two parts of this question are virtually identical.
(a) Let A denote an event that gives uncertain information (or virtual / soft evidence) about
the partition (that is a collection of mutually exclusive and exhaustive events) {Gj }nj=1 .
Suppose that A satises
P (A ∣ Gj , B) = P (A ∣ Gj ) , j = 1, 2, . . . , n
∑nj=1 λj P (B ∩ Gj )
P (B ∣ A) = .
∑nj=1 λj P (Gj )
Check that P(.∣A) satises the denition of the Pearl update (Denition 6.2).
(b) Let P denote a probability distribution before evidence is obtained and suppose that a piece
of evidence Ξ gives uncertain information about the partition (that is, the collection of
mutually exclusive and exhaustive events) {Gj }nj=1 . Suppose that Ξ is not in the original
event space and that for any event A in the original event space, Ξ ⊥ A∣Gj for each j =
1, . . . , n. Suppose that this evidence is specied by the posterior probabilities
Let
P(Ξ∣Gj )
ρj = j = 1, 2, . . . , n
P(Ξ∣G1 )
and
qj
λj = , j = 1, 2, . . . , n.
P (Gj )
For any event C , compute the probability P (C∣Ξ) obtained by Pearl's method of virtual
evidence and show that this gives the same result as Jerey's rule of update.
128
6.6. EXERCISES 129
3. Let X1 , X2 , X3 be three binary random variables, each taking values in {0, 1}, such that
1
PX1 ,X2 ,X3 (x1 , x2 , x3 ) = ,
8
Now let V be an additional binary random variable and let E = {V = 1}. Here V stands for
virtual information. Suppose that the conditional probability function of V given X3 satises
and
G2 = {(x1 , x2 , x3 ) ∈ {0, 1}3 ∣ x3 = 1} .
The events G1 and G2 are mutually exclusive and exhaustive. Use Pearl's method of virtual
evidence to obtain the updated probability distribution
4. Let G = (V, E) be a Directed Acyclic Graph, where V = (X1 , . . . , Xd ), and let P and Q be two
probability distribution factorised along G . Let
(i) (l)
θjil = PXj ∣Paj (xj ∣πj ).
Suppose that the conditional probabilities for P and Q are the same except for one single (j, l)
(l) (l)
variable / parent conguration, where PXj ∣Paj (.∣πj ) is given by θj.l and QXj ∣Paj (.∣πj ) is given
by θ̃j.l . Let DKL denote the Kullback Leibler distance. Show that
5. Let DCD denote the Chan Darwiche distance. Prove the remaining two statements of Theo-
rem 6.13; that for any P and Q,
and
DCD (P, Q) = DCD (Q, P).
130 CHAPTER 6. EVIDENCE AND METRICS
n
Q(A) = ∑ µj P(A∣Gj ).
j=1
λj λj
DCD (P, Q) = ln max − ln min ,
j µj j µj
7. (a) Find a calibration of the Chan-Darwiche distance in terms of the distance between two
Bernoulli trials. That is, let P = (p0 , p1 ) and Q = (q0 , q1 ). Find the number cd(k) such that
if q0 = 1 − cd(k) and q1 = cd(k) and p0 = p1 = 21 , then
DCD (P, Q) = k.
1 1√
KL(k) = ± 1 − e−2k .
2 2
8. Jensen's inequality Let ϕ(x) be a convex function and X nite discrete real valued random
variable, dened on a nite space X . Prove, by induction, that
E [ϕ (X)] ≥ ϕ (E [X])
d
Q (x) = ∏ qixi (1 − qi )1−xi
i=1
and
d
P (x) = ∏ pxi i (1 − pi )1−xi
i=1
where, for this example, it is assumed that 0 < qi < 1 and 0 < pi < 1 (i.e. the inequalities are
strict) for all i ∈ {1, . . . , d}.
OQ OP
DDC (P, Q) = d ln max ( , ), (6.9)
OP OQ
10. A piece of cloth is to be sold on the market. The colour C is either green (cg ), blue (cb ) or
violet (cv ). Tomorrow, the piece of cloth will either be sold (s) or not (sc ); this is denoted by the
variable S . Experience gives the following probability distribution over C, S
S/C cg cb cv
PC,S = s 0.12 0.12 0.32
sc 0.18 0.18 0.08
cg cb cv
PC = .
0.3 0.3 0.4
The piece of cloth is inspected by candle light. Since it cannot be seen perfectly, this only gives
soft evidence. From the inspection by candle light, the probability over C is assessed as:
cg cb cv
QC = .
0.7 0.25 0.05
S/C cg cb cv
QS,C = s 0.28 0.10 0.04 .
sc 0.42 0.15 0.01
132 CHAPTER 6. EVIDENCE AND METRICS
(a) Compute DCD (P, Q), the Chan-Darwiche distance between the original and updated dis-
tributions.
(b) Compute the bounds on the odds ratios given by Corollary 6.18 in this example. Compare
O (c ∣s)
with OQP (cgg∣s) .
(c) Suppse that Q∗C = (0.25, 0.25, 0.50). Compute DCD (P, Q∗ ) and the bounds on the odds
O ∗ (cg ∣s)
ratios given by Corollary 6.18. Again, compare with OQP (cg ∣s)
The distribution Q∗ is closer to P than Q and hence the bounds are tighter.
(d) Now consider the following problem: the probability that the piece of cloth is green, given
that it is sold tomorrow is, before updating, 0.214. What evidence would satisfy the con-
straint that the updated probability that the cloth is green, given that it is sold tomorrow,
does not exceed 0.3?
6.7 Answers
1. Let A denote car type and C colour. Events to be updated: P∗C (red) = 0.4, P∗C (blue) = 0.6
Original joint probability function:
car/colour R B
PA,C = V 0.35 0.15
S 0.1 0.4
so
PC (red) = 0.45 PC (blue) = 0.55
and
car/colour R B
PA∣C = V 7/9 3/11
S 2/9 8/11
Jerey's rule:
car/colour R B
P∗A,C = PA∣C P∗C = V 14/45 9/55
S 4/45 24/55
47
P∗ (volvo) =
99
2. (a) A ⊥ B∣Gj for each Gj , j = 1, . . . n so P(A∣Gj , B) = P(A∣Gj ). It follows, using P(B∣Gj )P(Gj ) =
P(BGj ) and λj = P(A∣Gj ) that
For an outcome x,
λj ρj
P(x∣A) = P(x) = P(x) x ∈ Gj , j = 1, . . . , n
∑k λk P(Gk ) ∑k ρk P(Gk )
P(A∣Gk )
where ρk = λk
λ1 = P(A∣G1 )
which is the denition of the Pearl update.
(b) The Jerey's rule is valid for a piece of information Ξ that alters the probabilities on the
partition events G1 , . . . , Gn and such that P(Ξ∣Gj , B) = P(Ξ∣Gj ) for any event B . Let
P∗ (C) = P(C∣Ξ), the updated probability for an event C . Then the update under Jerey's
rule is, for any outcome x,
n n
P∗ (x) = ∑ P(x∣Gj )P∗ (Gj ) = ∑ qj P(x∣Gj ) = qk P(x∣Gk ) x ∈ Gk ,
j=1 j=1
133
134 CHAPTER 6. EVIDENCE AND METRICS
P(Ξ∣Gj )
ρj = j = 1, . . . , n
P(Ξ∣G1 )
ρj
P∗ (x) = P(x∣Ξ) = P(x) x ∈ Gj .
∑k=1 ρk P(Gk )
n
Using P(Ξ∣G) = ,
P(G∣Ξ)P(Ξ)
P(G)
so that
λj
P∗ (x) = P(x) x ∈ Gj ,
∑k=1 λk P(Gk )
n
3. In this example, X1 , X2 , X3 are mutually independent; PX1 ,X2 ,X3 = PX1 PX2 PX3 , PXj (1) = PXj (0) =
2 for j = 1, 2, 3. Virtual evidence on X3 is treated as a node with a single parent X3 , so (using
1
PV ∣X3 (1∣1)
̃X (1) = PX (1)
P 3 3
PV ∣X3 (1∣1)PX3 (1) + PV ∣X3 (1∣0)PX3 (0)
2λ λ
= PX3 (1) =
λ+1 λ+1
so
̃X ,X ,X (x1 , x2 , 1) = PX (x1 )PX (x2 )P
̃X (1) = λ
P 1 2 3 1 2 3
8(λ + 1)
̃X (x3 ) =
̃X ,X ,X (x1 , x2 , 0) = PX (x1 )PX (x2 )P 1
P 1 2 3 1 2 3
8(λ + 1)
for each value of (x1 , x2 ) ∈ {(0, 0), (0, 1), (1, 0), (1, 1)}.
4. Assume that the variables are ordered so that Paj ⊆ {X1 , . . . , Xj−1 }. Then
6.7. ANSWERS 135
P(x)
DKL (P, Q) = ∑ P(x) ln
x Q(x)
(i )
d
(i ) ∏dk=1 PXk ∣Pak (xk k ∣πk (x))
= ∑ ∏ PXk ∣Pak (xk k ∣πk (x)) ln (i )
x=(x 1 ,...,x d ) k=1
(i ) (i ) ∏dk=1 QXk ∣Pak (xk k ∣πk (x))
1 d
d θjij l
(i )
= ∑ ∏ PXk ∣Pak (xk k ∣πk (x)) ln ̃
x∣πj (x)=π l k=1 θjij l
j
θjij l j−1
(i )
= ∑ θjij l ln ∑ ∏ PXk ∣Pak (xk k ∣πk (x)
ij θ̃jij l (i
j−1 ) k=1
(x1 1 ,...,xj−1 )∣πj (x)=πjl
(i )
5.
P(x) P(x)
DCD (P, Q) = max ln − min ln .
x Q(x) x Q(x)
Clearly, for any function f , maxx f (x) ≥ minx f (x) so the distance is non negative. If DCD (P, Q) =
0, it follows that Q(x) = α, a constant, for all x ∈ X . It follows that P(x) = αQ(x) so that
P(x)
1 = ∑ P(x) = α ∑ Q(x) = α
x x
6. Take any point x ∈ X , then x ∈ Gj for exactly one j . It follows that Q(x) = µj P(x∣Gj ) = µj
P(x)
λj
for j such that x ∈ Gj . Therefore
P(x) P(x)
DCD (P, Q) = max ln − min ln
x Q(x) x Q(x)
λj λj
= max ln − min ln
j µj j µj
as required.
136 CHAPTER 6. EVIDENCE AND METRICS
θ
DCD (Q, P) = ln 2θ − ln 2(1 − θ) = ln.
1−θ
Hence, let θ(k) denote the value of θ such that DCD (Q, P) = k , then
θ
ek =
1−θ
giving
ek
θ(k) = .
1 + ek
Considering 0 ≤ θ ≤ 1
2 gives, for a Chan Darwiche distance k ,
e−k
θ(k) = .
1 + e−k
(b)
1 1 1 1
DCD (Q, P) = ln + ln =k
2 2θ(k) 2 2θ(k)
1
θ(1 − θ) = e−2k
4
2
1 1
(θ − ) = (1 − e−2k )
2 4
1 1√
θ(k) = ± 1 − e−2k .
2 2
8. Denition of convexity for a function ϕ: for any λ ∈ [0, 1] and any (x, y),
⎛n+1 ⎞ ⎛n pj ⎞
ϕ(µ) = ϕ ∑ pj xj ≤ pn+1 ϕ(xn+1 ) + (1 − pn+1 )ϕ ∑ xj
⎝ j=1 ⎠ ⎝j=1 1 − pn+1 ⎠
pj
and, by the inductive hypothesis (since ∑nj=1 1−pn+1 = 1) ,
⎛n pj ⎞ n pj
ϕ ∑ xj ≤ ∑ ϕ(xj )
⎝j=1 1 − pn+1 ⎠ j=1 1 − pn+1
6.7. ANSWERS 137
so that
n+1
ϕ(µ) ≤ ∑ ϕ(xj )pj
j=1
9. (a) The likelihood ratio between Q and P is well dened and is given by
Q (x) d
qi xi 1 − qi 1−xi
LR (x) = = ∏( ) ( ) .
P (x) i=1 pi 1 − pi
d
qi 1 − qi
LR (x) ≤ LR (m) = ∏ max ( , ). (6.11)
i=1 pi 1 − pi
Next let m̄ be the binary complement of m dened by Equation (6.10). That is, for each
i ∈ {1, . . . , d}, m̄i = 1 − mi , giving m̄i = 0, if mi = 1 and m̄i = 1 if mi = 0. Then it holds that
d
qi 1 − qi
LR (x) ≥ LR (m) = ∏ min ( , ). (6.12)
i=1 pi 1 − pi
It now follows from the denition of the Chan - Darwiche distance measure (Denition 6.11)
that
̃
DDC (p, q) = ln LR (m) − ln LR (m)
d max ( pqii , 1−p
1−qi
)
= ∑ ln
i
.
i=1 min ( pqii , 1−p
1−qi
i
)
OQ OP
DDC (P, Q) = d ln max ( , ).
OP OQ
OQ
If, say, OP > OP
OQ , then
DDC (P, Q) = d (ln OQ − ln OP ) ,
λi λi 0.7 0.05
DCD (P, Q) = ln max − ln min = ln − ln = 2.93,
i µi i µi 0.3 0.4
The evidence is weaker and the bounds are therefore tighter. In this case,
0.214e−d 0.214ed
≤ Q(cg ∣s) ≤ .
1 + (e−d − 1) × 0.214 1 + (ed − 1) × 0.214
The constraint QC∣S (cg ∣s) ≤ 0.3 is satised if
0.214ed
≤ 0.3
1 + (ed − 1) × 0.214
giving d ≤ 0.454. The current distribution over colour is (µg , µb , µv ) = (0.3, 0.3, 0.4). The
problem now reduces to nding (λg , λb , λv ) such that QC∣S (cg ∣s) = 0.3 and
λg λb λv λg λb λv
ln max ( , , ) − ln min ( , , ) = 0.454.
0.3 0.3 0.4 0.3 0.3 0.4
Since
λj
QC,S (cj , s) = PC,S (cj , s), j = g, b, v
µj
it follows that
0.4λg
QC∣S (cg ∣s) = .
0.4λg + 0.4λb + 0.4λv
With PC∣S (cg ∣s) = 0.3,
0.28λg − 0.12λb − 0.24λv = 0,
λg 10λg − 3
0.454 = ln − ln
0.3 1.2
giving
3e0.454
λg = = 0.402
10e0.454 − 4
λv = 0.34, λb = 0.258.
Junction Trees
xD = ×v∈D xv
is used to denote a conguration (or a collection of outcomes) on the nodes in D. Furthermore, for
any set W ⊂ V , let W̃ denote the indexing set for W . The notation XW will also be used to denote
XW̃ , X W to denote X W̃ and xW to denote xW̃ . Suppose D ⊆ W ⊆ Ṽ and that xW ∈ XW . That is,
xW = ×v∈W xv . Then, ordering the variables of W so that XW = XD × XW /D , the projection of xW onto
D is dened as the variable xD that satises
xW = (xD , xW /D ),
where the meaning of the notation `(, )' is clear from the context. Here A/B denotes the set dierence;
i.e. the elements in the set A not included in B .
Denition 7.1 (Function, Domain). Consider a function ϕ ∶ XD → R+ . The space XD is known as the
domain of the function. If the domain is the state space of a random vector X D , then X D may also
be referred to as the domain of the function.
In this setting, a function over a domain XD has ∏j∈D kj entries. For W ⊂ V , the domain of a function
XW may also be denoted by the collection of random variables W .
141
142 CHAPTER 7. MARGINALISATION, TRIANGULATED GRAPHS AND JUNCTION TREES
Addition, Multiplication, Division For functions dened on the same domain, addition, multi-
plication and division are dened pointwise where, by denition,
a(x)
a(x) = 0, b(x) = 0 Ô⇒ = 0.
b(x)
Functions over dierent domains If function ϕ1 is dened over domain XD1 and function ϕ2
is dened over domain XD2 , then multiplication and division of functions may be dened by rst
extending both functions to the domain XD1 ∪D2 .
Denition 7.2 (Extending the Domain). Let the function ϕ be dened on a domain XD , where D ⊂
W̃ ⊆ Ṽ . Then ϕ, dened over a domain XD , is extended to the domain XW̃ in the following way. For
each xW̃ ∈ XW̃ ,
ϕ(xW̃ ) = ϕ(xD ),
where xD is the projection of xW̃ onto XD , using the denition of xD (and hence xW̃ ) from the beginning
of the section, page 141. In other words, the extended function depends on xW̃ only through xD .
Addition, Multiplication and Division of Functions over Dierent Domains Addition, mul-
tiplication and division of functions over dierent domains is dened as rst, extending the domains
of denition using Denition 7.2 so that they are dened over the same domain, followed by standard
pointwise addition, multiplication or division.
Multiplication of functions may be expressed in the following terms: the product ϕ1 .ϕ2 of functions ϕ1
and ϕ2 , dened over domains XD1 and XD2 is dened as
⎛ ⎞
∑ ϕ (xU ) = ∑ ϕ(z, xU ),
⎝W /U ⎠ z∈XW /XU
where the arguments have been rearranged so that those corresponding to W /U appear rst, z ∈ XW /U
is the projection of (z, xU ) ∈ XW onto XW /U and xU ∈ XU the projection of (z, xU ) ∈ XW onto XU .
The following notation is also used for marginalising a function with domain XW .
7.1. FUNCTIONS AND DOMAINS 143
⎛ ⎞
ϕ↓U = ∑ ϕ .
⎝W /U ⎠
The marginalisation operation obeys the following rules:
Φ = {ϕ1 , . . . , ϕm }
The same notation is often used to denote the contraction of a charge and of the set of functions (the
charge). The context makes it clear which is intended.
Probability function factorised along a DAG The joint probability function pX1 ,...,Xd is itself
a function, with domain X . If the joint probability function may be factorised according to a DAG
G = (V, D), the decomposition is written as
d
pX1 ,...,Xd = ∏ pXj ∣Πj .
j=1
Then for each j = 1, . . . , d, ϕj dened by ϕj = pXj ∣Πj is a function with domain XDj = Xj × XΠ̃j and
Dj = {j} ∪ Π̃j .
Example 7.4.
Consider a probability function over six variables that may be factorised along the directed acyclic
graph in Figure 7.1. The functions corresponding to the conditional probabilities are
X1
} !
X2 X3
} ! } !
X4 X5 X6
Denition 7.5 (Domain Graph). The domain graph for the set of functions in Φ is an undirected
graph with the variables as nodes and the links between any pair of variables which are members of the
same domain.
Figure 7.2 illustrates the domain graph associated with DAG of Figure 7.1. The domain graph of a
DAG is the moral graph, Denition 5.4. The maximal cliques of the moral graph are illustrated in
Figure 7.3.
X1
X2 X3
X4 X5 X6
It is clear that the domain graph of a Bayesian network is the moral graph, since by denition all the
parents are connected to each other and to the variable.
X2 X1 X2 X3 X3
X4 X2 X3 X5 X6
∑ ϕ1 ϕ2 = ϕ1 ∑ ϕ2 .
XA XA
In coordinates, let ϕ1 have domain XD1 ∪D3 and ϕ2 domain XD2 ∪D3 ∪D4 , where D1 , D2 , D3 and D4 are
disjoint. By the distributive law, the marginalisation may be written as
The function over XD1 × XD3 × XD4 is rst marginalised down to a function over XD3 × XD4 . The
function is transmitted to the function over XD2 × XD3 , to which it is multiplied. The domains of the
two functions to be multiplied have to be extended to XD1 ×XD3 ×XD4 . Using X1 , X2 , X3 , X4 to denote
the associated domains XD1 , XD2 , XD3 and XD4 , the domains under consideration for the operations
are illustrated in Figure 7.4. First, the function ϕ2 , dened over (X2 , X3 , X4 ) is considered. This is
marginalised to a function over (X3 , X4 ) and is then extended, by multiplying with ϕ1 , to a function
over (X1 , X3 , X4 ).
(X2 , X3 , X4 ) (X1 , X3 , X4 )
(X3 ,X4 )
Consider the computation for marginalising a contraction of a charge Φ dened over a state space
X = X1 × X2 × X3 × X4 × X5 where
Φ↓0 = ∑ Φ(x),
x∈X
146 CHAPTER 7. MARGINALISATION, TRIANGULATED GRAPHS AND JUNCTION TREES
where the notation Φ↓U is dened on page 142. With the order of summation: x2 , x4 , x6 , x5 , x3 ,x1 ,
the sum may be written (taking sums from right to left) as
The computation, carried out in this order (right to left), may be represented by the graph in Figure 7.5;
a computational tree, according to the distributive law, is given in Figure 7.6.
X2 X6
X1 X5
X3
X4
(X1 , X2 ) (X5 , X6 )
X6
X1
& x
(X1 , X3 , X5 )
O
X4
(X3 , X4 )
Recall (page 142) that the operation Φ↓U (x) means marginalising Φ over all variables not in the set U .
Denition 7.7 (Elimination of a Variable). The variable Xv , with index v ∈ W̃ = Ṽ /Ũ is eliminated
from ∑xV /U ∈XV /U Φ(xV /U , xU ) by the following procedure, where contraction means multiplying together
all the functions in the charge.
7.3. DECOMPOSABLE GRAPHS AND NODE ELIMINATION 147
1. Let Φv (or ΦXv ) denote the contraction of the functions in Φ that have Xv in their domain; that
is,
Φv = ∏ ϕj .
j∣v∈Dj
Those functions that do not contain Xv in their domain have been retained; the others have been
multiplied together and then marginalised over Xv (thus eliminating the variable) to give ϕ(v) . This
function has been added to the collection, and all those containing Xv (other than ϕ(v) ) have been
removed.
(Note that the notation Φ−Xv has two meanings: it is used to the collection of functions, and it is
also used to denote the contraction of the charge obtained by multiplying together the functions in the
collection. The meaning is determined by the context.) Having removed Xv , it remains to compute
The quantity
can be computed through successive elimination of the variables Xv ∈ W /U . The task, of course, is to
nd a sequence for marginalising the variables such that, at each stage, the variable is to be eliminated
from as small a domain as possible. The procedure outlined above may be considered graphically in
terms of undirected graphs and their triangulations.
Denition 7.8 (Complete Graph, Complete Subset). A graph G is complete (or a clique) if every
pair of nodes is joined by an undirected edge. That is, for each (α, β) ∈ V × V with α ≠ β , (α, β) ∈ E
and (β, α) ∈ E . In other words, ⟨α, β⟩ ∈ U , where U denotes the set of undirected edges. A subset of
nodes is called complete if it induces a complete sub graph.
Denition 7.9 (Maximal Clique). A maximal clique is a complete sub graph that is maximal with
respect to ⊆. In other words, a maximal clique is not a sub graph of any other complete graph.
148 CHAPTER 7. MARGINALISATION, TRIANGULATED GRAPHS AND JUNCTION TREES
Denition 7.10 (Simplicial Node). Recall the denition of family, found in Denition 1.2. For an
undirected graph, the family of a node β is F (β) = {β}∪N (β), where N (β) denotes the set of neighbours
of β . A node β in an undirected graph is called simplicial if its family F (β) is a maximal clique.
This means that, in an undirected graph, a node β is simplicial if all its neigbours are neighbours of
each other.
Denition 7.11 (Connectedness, Strong Components). Let G = (V, E) be a simple graph, where
E = U ∪ D. That is, E may contain both directed and undirected edges. Let α → β denote that there
is a path (Denition 1.7) from α to β . If there is both α → β and β → α then α and β are said to be
connected. This is written:
α ↔ β.
This is clearly an equivalence relation. The equivalence class for α is denoted by [α]. In other words,
β ∈ [α] if and only if β ↔ α. These equivalence classes are called strong components of G .
Note that a graph is connected if between any two nodes there exists a trail (Denition 1.6), but any
two nodes α and β are only said to be connected if there is path from α to β and a path from β to α,
where the denition of a `path' is given in Denition 1.7.
Denition 7.12 (Chord). Let G = (V, E) be a graph. Let σ be an n cycle in G . A chord of this cycle
is a pair (αi , αj ) of non consecutive nodes in σ such that αi ∼ αj in G .
Denition 7.13 (Triangulated). An undirected graph is said to be triangulated if every cycle of length
≥ 4 has a chord.
Lemma 7.14. If G = (V, E) is triangulated, then the induced graph GA is also triangulated.
Proof Consider any cycle of length ≥ 4 in the restricted graph. All the edges connecting these
nodes remain. If the cycle possessed a chord in the original graph, the chord remains in the restricted
graph.
Denition 7.15 (Separator). Let G = (V, E) be a graph. Let α, β ∈ V be two nodes. A subset S ∈ V is
called an α, β separator if every trail between α and β has at least one node in S . Let A ⊂ V , B ⊂ V .
A set S ⊂ V separates A and B if it is an α, β separator for each (α, β) ∈ A × B . A and B are said to
be separated by S . The notation used in this text is A á B 8 S .
V =A∪B∪S
and
S separates A from B ,
S is a complete subset of V ,
A, B or S may be the empty set. If both A and B are non empty, then the decomposition is proper.
A triple (A, B, S) of disjoint subsets of the node set V of an undirected graph is said to form a weak
decomposition of G or to weakly decompose G if V = A ∪ B ∪ S , S separates A from B and both GA∪S
and GB∪S are weakly decomposable.
A weak decomposition diers from a decomposition in that the separator set S is not necessarily
complete. Clearly, every graph can be decomposed to its connected components (Denition 1.6). If
the graph is undirected, then the connected components are the strong components (Denition 7.11).
1. it is complete, or
2. it possesses a proper decomposition (A, B, S) such that both sub graphs GA∪S and GB∪S are de-
composable.
This is a recursive denition, which is permissible, since the decomposition (A, B, S) is required to be
proper, so that GA∪S and GB∪S have fewer nodes than the original graph G .
Consider the graph in Figure 7.7. In the rst stage, set S = {α3 }, with A = {α1 , α2 } and B =
{α4 , α5 , α6 }. Then S is a maximal clique and S separates A from B . Then A ∪ S = {α1 , α2 , α3 }
and GA∪S is a maximal clique. B ∪ S = {α3 , α4 , α5 , α6 }. The graph GB∪S is decomposable; take
S2 = {α3 , α5 }, A2 = {α4 } and B2 = {α6 }. Then GA2 ∪S2 and GB2 ∪S2 are maximal cliques.
Theorem 7.20. Let G = (V, U ) be an undirected graph. The following conditions 1), 2) and 3) are
equivalent.
1. G is decomposable.
2. G is triangulated.
α1 α2
α3 α6
α4 α5
Proof of Theorem 7.20: 3) Ô⇒ 1) If G is complete, then the result is clear. If G is not complete,
then choose two distinct nodes (α, β) ∈ V × V that are not adjacent. Let S ⊆ V /{α, β} denote the
minimal separator for the pair (α, β). Let A denote the node set of the maximal connected component
7.3. DECOMPOSABLE GRAPHS AND NODE ELIMINATION 151
of GV /S and let B = V /(A ∪ S). Then (A, B, S) provides three disjoint subsets, where S is complete.
We have to show that GA∪S and GB∪S are decomposable. The procedure can be repeated on both GA∪S
and GB∪S and repeated recursively, stopping when GA′ ∪S ′ is complete for a set A′ and corresponding
separator S ′ , hence the graph is decomposable.
Denition 7.21 (Perfect Node Elimination Sequence). Let V = {α1 , . . . , αd } denote the node set
of a graph G . A perfect node elimination sequence of a graph G is an ordering of the node set
{α1 , . . . , αd } such that for each j in 1 ≤ j ≤ d − 1, αj is a simplicial node of the sub graph of G
induced by {αj , αj+1 , . . . , αd }
Lemma 7.22. Every triangulated graph G has a simplicial node. Moreover, if G is not complete, then
it has two non adjacent simplicial nodes.
Proof The lemma is trivial if either G is complete, or else G has two or three nodes. Assume that G
is not complete. Suppose the result is true for all graphs with fewer nodes than G . Consider two non
adjacent nodes α and β . Let S denote the minimal separator of α and β . Let GA denote the largest
connected component of GV /S such that α ∈ A and let B = V /(A ∪ S), so that β ∈ B .
By induction, either GA∪S is complete, or else it has two non adjacent simplicial nodes. Since GS
is complete, it follows that at least one of the two simplicial nodes is in A. Such a node is therefore
also simplicial in G , because none of its neighbours is in B .
If GA∪S is complete, then any node of A is a simplicial node of G .
In all cases, there is a simplicial node of G in A. Similarly, there is a simplicial node in B . These
two nodes are then non adjacent simplicial nodes of G .
Theorem 7.23. A graph G is triangulated if and only if it has a perfect node elimination sequence.
Proof Suppose that G is triangulated. Assume that every triangulated graph with fewer nodes than
G has a perfect elimination sequence. By the previous lemma, G has a simplicial node α. Removing
α returns a triangulated graph. (Consider any cycle of length ≥ 4 with a chord. If the cycle remains
after the node is removed, then the chord is not removed). By proceeding inductively, it follows that
G has a perfect elimination sequence.
Conversely, assume that G has a perfect sequence, say {α1 , . . . , αd }. Consider any cycle of length
≥ 4. Let j be the rst index such that αj is in the cycle. Let V (C) denote the node set of the cycle
and let Vj = {αj , . . . , αd }. Then V (C) ∈ Vj . Since αj is simplicial in GVj+1 , the neighbours of αj in the
cycle are adjacent, hence the cycle has a chord. Therefore G is triangulated.
Denition 7.24 (Eliminating a Node). Let G = (V, E) be an undirected graph. A node α is eliminated
from an undirected graph G in the following way:
1. For all pairs of neighbours (β, γ) of α add a link if G does not already contain one. The added
links are called ll ins.
2. Remove α.
152 CHAPTER 7. MARGINALISATION, TRIANGULATED GRAPHS AND JUNCTION TREES
α1
α2 α3
α4 α5 α6
α1
α2
α4 α5 α6
α1
α2 α3
α5 α6
α4
In G σ , any node α together with its neighbours of higher elimination order form a complete subset. The
neighbours of α of higher elimination order are denoted by Nσ(α) . The sets Nσ(α) are the elimination
domains corresponding to the elimination sequence σ .
An ecient algorithm clearly tries to minimise the number of ll ins. If possible, one should nd an
elimination sequence that does not introduce ll ins.
Proof Let C be a maximal clique in G σ and let α be a variable in C of the lowest elimination order.
Then C = Nσ(α) .
An ecient algorithm ought to nd an elimination sequence for the domain graph that yields maximal
cliques of minimal total size.
Proof By construction, the elimination sequence σ for graph G σ does not require any ll-ins.
Recall that a graph is triangulated if and only if it has an elimination sequence without ll ins. This
is equivalent to the statement that an undirected graph is triangulated if and only if all nodes can be
eliminated by successively eliminating a node α such that the family Fα = {α} ∪ Nα is complete. From
the denition, such a node α is a simplicial node.
cliques of the domain graph are the domains of the functions of the charge. Recall Denition 7.7, which
describes the procedure for eliminating a variable in a marginalisation. When a node Xv is eliminated
from the graph G , the resulting graph is denoted by G −Xv . Graphically, the procedure described in
Denition 7.7 is the same as Denition 7.24, eliminating a node. If G is the domain graph for a set
of functions Φ, then it is clear from Denition 7.24 that the graph G −Xv is the domain graph for the
set of functions Φ−Xv . Therefore, if the domain graph is triangulated, there is a perfect elimination
sequence; there is an order for eliminating the variables that, at each stage, the elimination domain
corresponds to a maximal clique in the current domain graph.
Denition 7.30 (Junction Trees). Let C be a collection of subsets of a nite set V and T be a tree
with C as its node set. Then T is said to be a junction tree (or join tree) if any intersection C1 ∩ C2
of a pair C1 , C2 of sets in C is contained in every node on the unique path in T between C1 and C2 .
Let G be an undirected graph and C the family of its maximal cliques. If T is a junction tree with C as
its node set, then T is known as junction tree for the graph G .
Theorem 7.31. There exists a junction tree T of maximal cliques for the graph G if and only if G is
decomposable.
Proof Firstly, we prove that if the graph is decomposable, then there exists a junction tree of the
maximal cliques. The proof is by construction; a sequence is established in the following way. Firstly,
a simplicial node α is chosen; Fα is therefore a maximal clique. The algorithm continues by choosing
nodes from Fα that only have neighbours in Fα . The set of nodes Fα is labelled C1 and the set of those
nodes in Fα that have neighbours not in Fα is labelled S1 . This set is a separator.
Now remove the nodes in Fα that do not have neighbours outside Fα and name the new graph G ′ .
Choose a new node α in the graph G ′ such that Fα is a maximal clique. Repeat the process, with the
index j , where j is the previous index, plus 1.
When the parts have been established (as indicated in the diagram below), each separator Si is
then connected to a maximal clique Sj with j > i and such that Si ⊂ Cj . This is always possible,
because Si is a complete set and, in the elimination sequence described above, the rst point of Si is
eliminated when dealing with a maximal clique of index greater than i.
It is necessary to prove that the structure constructed is a tree and that it has the junction tree
property.
7.4. JUNCTION TREES 155
Firstly, each maximal clique has at most one parent, so there are not multiple paths. The structure
is therefore a tree.
To prove the junction tree condition, consider two maximal cliques, Ci and Cj with i > j and let α
be a member of both. There is a unique path between Ci and Cj .
Because α is not eliminated when dealing with Cj , it is a member of Sj . By construction, it is also
a member of the child of Cj , say Ck . Arguing similarly, it is also a member of the child of Ck and, by
induction it is also a member of Ci and, of course, all the separators in between.
The converse is trivial; if the maximal cliques can be arranged as a junction tree, then we can construct
a perfect elimination sequence by: take a simplicial node from a maximal clique which is a leaf of the
junction tree and remove the node. If this is not the only simplicial node in the chosen maximal clique,
the maximal clique remains as a leaf of the junction tree, otherwise the maximal clique is removed
from the junction tree; the resulting maximal clique tree is a junction tree. Hence there is a perfect
elimination sequence, hence the graph is triangulated (and decomposable).
Example 7.32.
Consider the directed acyclic graph in Figure 7.11. The corresponding moral graph is given in Fig-
ure 7.12.
α1
} !
α2 α3 / α4
!
α5 α6 / α7 / α8
! }
α9
(α8 , α7 , α4 , α9 , α2 , α3 , α1 , α5 , α6 ).
There are two ll-ins; these are ⟨α1 , α5 ⟩ corresponding to the elimination of α2 and ⟨α1 , α6 ⟩, corre-
sponding to the elimination of α3 . The corresponding triangulated graph is given in Figure 7.13.
The junction tree construction may be applied. The maximal cliques and separators, with the labels
resulting from the diagram, are shown in Figure 7.14 and put together to form the junction tree, or
join tree, shown in Figure 7.15.
Later, when using the algorithm for updating, it will be useful to designate one node as the root.
156 CHAPTER 7. MARGINALISATION, TRIANGULATED GRAPHS AND JUNCTION TREES
α1
α2 α3 α4
α5 α6 α7 α8
α9
α1
α2 α3 α4
α5 α6 α7 α8
α9
Denition 7.33 (Rooted Tree). A rooted tree T is a tree graph with a designated node ρ called the
root. A leaf of a tree is a node that is joined to at most one other node.
Denition 7.34 (Running Intersection Property). C is said to have running intersection property
(r.i.p.) if there is an order σ of {1, . . . , n} such that for each j ≥ 2 there is an l such that σ(l) < σ(j)
and
Cσ(j) ∩ (∪j−1
i=1 Cσ(i) ) ⊆ Cσ(j) ∩ Cσ(l) . (7.1)
An order of the maximal cliques that satises r.i.p. is said to be a perfect order of the maximal cliques.
7.5. PERFECT ORDERS OF MAXIMAL CLIQUES 157
{α4 ,α7 ,α8 } {α4 ,α7 } {α4 ,α6 ,α7 } {α4 ,α6 }
C1 S1 C2 S2
{α3 ,α4 ,α6 } {α3 ,α6 } {α5 ,α6 ,α9 } {α5 ,α6 }
C3 S3 C4 S4
{α1 ,α2 ,α5 } {α1 ,α5 } {α1 ,α3 ,α6 } {α1 ,α6 }
C5 S5 C6 S6
Figure 7.14: The Maximal Cliques and Separators from Figure 7.13
Theorem 7.35. For an undirected graph G = (V, U ) with maximal cliques C = {C1 , . . . , Cn }, there
exists a perfect order of the maximal cliques if and only if G is triangulated. Furthermore, for any
order such that (7.2) holds, the tree constructed by adding the edge σ(j) ∼ σ(l(j)), where for each j ≥ 2
a single l(j) ∈ {1, . . . , j − 1} is chosen such that
Cσ(j) ∩ (∪j−1
i=1 Cσ(i) ) ⊆ Cσ(j) ∩ Cσ(l(j)) (7.2)
is a junction tree.
Proof The graph is triangulated if and only if the maximal cliques can be arranged as a junction tree.
If there is a perfect order of the maximal cliques, then clearly the method described for constructing a
tree from these maximal cliques (edge between Cσ(j) and maximal clique Cσ(l(j)) such that (7.2) holds)
gives the junction tree property; namely, that for any two cliques Cα , Cβ , Cα ∩ Cβ is contained in each
separator on the unique path Cα ↔ Cβ on the tree. On the other hand, if there is a junction tree,
then we may choose arbitrarily one node as root, call it σ(1) and then proceed by choosing σ(j) as
any neighbour of σ(1), . . . , σ(j − 1) that has not yet appeared in the order. This order of the maximal
cliques satises r.i.p..
158 CHAPTER 7. MARGINALISATION, TRIANGULATED GRAPHS AND JUNCTION TREES
C1 C2 C3
S1 S2
S3
C6
S6
C5 C7
S5
S4
C4
Figure 7.15: A Junction Tree (or join tree) constructed from the triangulated graph in Figure 7.13
root
} !
γ δ
~ !
leaf leaf leaf
Notes The material is standard from algorithmic graph theory. See, for example, [55]. The proof of
Theorem 7.20 follows the lines of Cowell, Dawid, Lauritzen and Spiegelhalter in [32].
Chapter 8
The task is to describe a scheme of message passing (propagation) between the maximal cliques of a
junction tree to compute the marginal distribution over a set of variables A ⊂ V /E , given hard evidence
on a set of variables E ; {X E = xE };
⎛ ⎞
PV /E∣E (xV /E ∣xE )↓A = ⎜ ∑ PV /E∣E (xA , xṼ /(A∪E) ∣xE )⎟ .
⎝xV /(A∪E) ∈XV /A ⎠
The message passing algorithm described here is the one used by the R packages gRain and bnlearn
and also many other software programmes that deal with Bayesian Networks; the algorithm is based
on representing joint distribution of a Bayesian network using the so - called Aalborg formula
∏C∈C ϕC (xC )
PX (x1 , . . . , xn ) = ,
∏S∈S ϕS (xS )
(which will be established later in this section), where
Denition 8.1. A joint probability PX over a random vector X = (X1 , . . . , Xd ) is said to be factorised
according to G if there exist functions or factors, ϕA dened on ×v∈Ã Xv where A is a complete set of
nodes in G such that
159
160 CHAPTER 8. JUNCTION TREES AND MESSAGE PASSING
PX (x) = ∏ ϕA (xA )
A
where the notation is clear (see section 7.1, page 141); the product is over all the functions.
Recall Denition 7.15 of a separator and Denition 7.17 of a decomposition. In the denition, A, B
or S may be the empty set, ϕ.
Proposition 8.2. Let G be a decomposable undirected graph and let (A, B, S) decompose G . Then the
following two statements are equivalent:
2. both PA∪S and PB∪S factorise along GA∪S and GB∪S respectively and
Proof of 1) Ô⇒ 2) Since the graph is decomposable, its maximal cliques can be organised as a
junction tree. Hence, without loss of generality, the factorisation can be taken to be of the form
P(x) = ∏ ϕC (xC ),
C∈C
where the product is over the maximal cliques of G . Since (A, B, S) decomposes G , any maximal clique
of G can either be taken as a subset of A ∪ S or as a subset of B ∪ S . Furthermore, S is a strict subset
of any maximal clique of A ∪ S containing S and S is a strict subset of any maximal clique of B ∪ S
containing S . Letting C denote a maximal clique, it follows that
Since S is itself complete, it is a subset of any maximal clique containing S , so that no maximal clique
in the decomposition will appear in both A ∪ S and B ∪ S . Set
Then
and
8.2. FACTORISING ALONG A JUNCTION TREE 161
where kS is dened as kS (xS ) = ∑XB k(xB∪S ) and hS (xS ) = ∑XA h(xA∪S ). It follows that
P(xA∪S )P(xB∪S )
P(x) = h(xA∪S )k(xB∪S ) = .
k(xS )h(xS )
Since
it follows that
Proof of 2) Ô⇒ 1) If both PA∪S and PB∪S factorise along GA∪S and GB∪S respectively and the
given formula holds, then
1
P(x) = ∏ ϕC (xC ) ∏ ϕC (xC ). (8.1)
PS (xS ) C⊂A∪S C⊂B∪S
For the maximal clique C that satises C ⊂ A ∪ S such that C ∩ S ≠ ϕ and set ψC = ϕC
pS . For all other
C , set ψC = ϕC , then
P(x) = ∏ ψC (xC ),
C⊂V
Since PA∪S = ∏C⊂A∪S ϕC and PB∪S = ∏C⊂B∪S ϕC in Equation (8.1), it follows by a recursive application
of the proposition that
∏C∈C PC (xC )
P(x) = ,
∏S∈S PS (xS )
where C denotes the set of maximal cliques and S denotes the set of separators.
d
PX (x) = ∏ PXv ∣Πv (xv ∣πv (x)),
v=1
162 CHAPTER 8. JUNCTION TREES AND MESSAGE PASSING
where πv (x) denotes the parent set of Xv for an instantiation x. It is clear that this may be expressed
as a factorisation according to the moralised graph G mor , which is undirected:
d
PX (x) = ∏ ϕAv (xAv )
v=1
Hence a probability distribution factorised along the DAG is also factorised along the moral graph G mor .
For implementing algorithms, the problem is that it may not be possible to represent the sets (Av )dv=1
on a tree. To enable this, G mor is triangulated to give (G mor )t . Recall that (G mor )t is decomposable
and its maximal cliques can be organised into a junction tree T . The probability distribution can
clearly be factorised as
PX (x) = ∏ ϕC (xC ),
C∈C
where ϕC (xC ) is the product of all those P(xv ∣xΠ̃v ), all of whose arguments belong to C . This
factorisation is not necessarily unique. It corresponds to a triangulation of the moral graph, where C
are the maximal cliques. It follows that
∏C∈C PC (xC )
PX (x) = , (8.2)
∏S∈S PS (xS )
where C denotes the set of maximal cliques and S denotes the set of separators of (G mor )t , which may
be organised according to a junction tree. This is the denition of a factorisation along a junction tree.
Denition 8.3 (Factorisation along a Junction Tree, Marginal Charge). Let PX be a probability dis-
tribution over a random vector X = (X1 , . . . , Xd ). Suppose that the variables can be organised as a
junction tree, with maximal cliques C and separators S such that PX has representation given in Equa-
tion (8.2), where PC and PS denote the marginal probability functions over the maximal clique variables
C ∈ C and separator variables S ∈ S respectively. The representation in Equation (8.2) is known as the
factorisation along the junction tree, and the charge
Φ = {PS ∶ S ∈ S, PC ∶ C ∈ C}
From the foregoing discussion, it is clear that Denition 8.3 is a special case of Denition 8.1, with
appropriate choice of functions in Denition 8.1.
Entering Evidence Equation (8.2) expresses the probability distribution in terms of functions over
the maximal cliques and separators of (G mor )t , or the junction tree. Suppose that hard evidence is
obtained on the variables U ; namely, that for U ⊆ V , {X U = y U } and the probability over the variables
V /U has to be updated accordingly.
8.3. FLOW OF MESSAGES 163
The algorithm described below describes a procedure such that for any function f ∶ X → R+ (not
necessarily a probability function) that is expressed as
∏C∈C ϕC (xC )
f (x) = , (8.3)
∏S∈S ϕS (xS )
for a collection of functions Φ = {ϕC , C ∈ C, ϕS , S ∈ S where C and S are the maximal cliques and
separators of a junction tree, the algorithm updates Φ to a collection of functions Φ∗ = {fC , C ∈
C, fS , S ∈ S that satisfy
fC (xC ) = ∑ f (z, xC )
z∈XV /C
and
fS (xS ) = ∑ f (z, xS )
z∈XV /S
for each C ∈ C and each S ∈ S . It follows that if the algorithm is applied using
⎧
⎪
⎪ PC (xC ) xC∩U = y C∩U
ϕC (xC ) = ⎨
⎪
⎪ xC∩U ≠ y C∩U
⎩ 0
and
⎧
⎪
⎪ PS (xS ) xS∩U = y S∩U
ϕS (xS ) = ⎨
⎪
⎪ xS∩U ≠ y S∩U
⎩ 0
then
PX U (y U ) = ∑ fC (z, y U ∩C ) = ∑ fS (z, y U ∩S )
z∈XC/(U ∩C) z∈XS/(U ∩S)
for all S ∈ S and all C ∈ C . This quantity may therefore be computed by marginalising the maximal
clique or separator with the smallest domain; dividing fC and fS by this quantity will give PC∣U (.∣y U )
and PS∣U (.∣y U ) respectively and hence a representation of the conditional distribution in terms of
marginal distributions over the maximal cliques and separators.
f (x, z)g(y, z)
F (x, y, z) = , (8.4)
h(z)
for non negative functions f ∶ X × Z → R+ , g ∶ Y × Z → R+ and h ∶ Z → R+ .
Decomposition (8.4) for the function F is of the form given in Equation (8.3), with maximal cliques
C1 = {X, Z}, C2 = {Z, Y } and separator S = {Z} arranged according to the junction tree in Figure 8.1.
164 CHAPTER 8. JUNCTION TREES AND MESSAGE PASSING
X Z Y
XZ Z
YZ
F1 (x,z)F2 (y,z)
The following procedure returns a representation F (x, y, z) = F3 (z) , where
Firstly,
f (x, z)g(y, z) f (x, z)
F1 (x, z) = ∑ F (x, y, z) = ∑ = ∑ g(y, z).
y∈Y y h(z) h(z) y∈Y
h∗ (z)
Dene the auxiliary function h∗ (z) = ∑y g(y, z), and the update f ∗ (x, y) = f (x, y) h(z) , then clearly
h∗ (z)
f ∗ (x, z) = f (x, z) = F1 (x, z).
h(z)
The calculation of the marginal function F1 (x, z) by means of the auxiliary function h∗ (z) may be
described as passing a local message ow from ZY to XZ through their separator Z . The factor
h∗ (z)
h(z)
is called the update ratio. It follows that
Similarly, a message can be passed in the other direction, i.e. from XZ to ZY Using the same
procedure, set
Next, set
8.4. LOCAL COMPUTATION ON JUNCTION TREES 165
h̃(z)
g̃(y, z) = g(y, z) .
h∗ (z)
1 1
F (x, y, z) = F1 (x, z) g̃(y, z) = F1 (x, z) g̃(y, z)
h̃(z) F3 (z)
1
F2 (y, z) = ∑ F (x, y, z) = g̃(y, z) ∑ F1 (x, z) = g̃(y, z).
x∈X x∈X F3 (z)
Passing messages in both directions results in a new overall representation of the function F (x, y, z); .
1 1 h∗ (z)
F (x, y, z) = f ∗ (x, z) g(y, z) = f ∗
(x, z) g̃(y, z)
h∗ (z) h∗ (z) h̃(z)
1
= f ∗ (x, z) g̃(y, z)
h̃(z)
1
= F1 (x, z) F2 (y, z).
F3 (z)
The original representation using functions has been transformed into a new representation where all
the functions are marginal functions.
This idea is now extended to arbitrary non negative functions represented on junction trees.
Φ = {ϕC ∶ C ∈ C, ϕS ∶ S ∈ S} (8.5)
be a charge; that is, a collection of non negative functions such that ϕC ∶ XC → R+ and ϕS ∶ XS → R+
for each C ∈ C and each S ∈ S .
Denition 8.4 (Contraction of a Charge on a Junction Tree). The contraction of a charge (8.5) over
a junction tree is dened as
∏C∈C ϕC (xC )
f (x) = . (8.6)
∏S∈S ϕS (xS )
166 CHAPTER 8. JUNCTION TREES AND MESSAGE PASSING
Local Message Passing Let C1 and C2 be two adjacent neighbouring nodes in T separated by S .
Set
ϕ∗S (xS ) = ∑ ϕC1 (z, xS ) (8.7)
z∈XC1 /S
and set
ϕ∗S
λS = (8.8)
ϕS
where, by denition, 00 = 0 is used in division of functions. λS is known as the update ratio. The
`message passing' is dened as the operation of updating ϕS to ϕ∗S and ϕC2 to
All other functions remain unchanged. The scheme of local message passing is illustrated in Figure 8.3.
C1 C2
S
Ð→
λS ϕ∗C2 = λS ϕC2
∏C∈C ϕC (xC )
f (x) = . (8.10)
∏S∈S ϕS (xS )
Let the contraction after the ow be denoted by f ∗ and the charge, after the ow from C1 to C2
denoted by
Φ∗ = {ϕ∗C ∶ C ∈ C, ϕ∗S ∶ S ∈ S}
Then
ϕ∗
ϕ∗C2 ϕC2 λS ϕC2 ( ϕS ) ϕC2
S
= = =
ϕ∗S ϕ∗S ϕ∗S ϕS
and the result is proved.
For x such that ϕS (xS ) = 0: it follows that f (x) = 0 and hence that ϕC1 (xC1 ) = 0 and that
λS (xS ) = 0. It therefore follows from the denition of ϕ∗C2 , that ϕ∗C2 = 0 and hence that f ∗ (x) = 0,
so that 0 = f ∗ (x) = f (x).
For x such that ϕS (xS ) > 0, but ϕ∗S (xS ) = 0, it follows directly that λS (xS ) = 0, so that that
f ∗ (x) = 0. It remains to show that f (x) = 0. From the denition,
Since ϕC1 (xC1 ) ≥ 0 for all xC1 ∈ XC1 , it follows that ϕC1 (z, xS ) = 0 for all z ∈ XC1 /S . Since
In all cases, it follows that a ow does not change the contraction of a charge.
8.5 Schedules
The aim of this section is to describe how to construct a series of transmissions between the various
maximal cliques of a junction tree, to update a set of functions to a set of functions that have the same
contraction as the original and which are the marginals of the contraction over the maximal cliques
and separators. First, some denitions and notations are established.
The following denition gives the technical terms that will be used.
Denition 8.7 (Schedule, Active Flow, Fully Active Schedule). A schedule is an ordered list of directed
edges of T specifying which ows are to be passed and in which order.
A ow is said to be active relative to a schedule if before it is sent the source has already received
active ows from all its neighbours in T , with the exception of the sink; namely, the node to which it
168 CHAPTER 8. JUNCTION TREES AND MESSAGE PASSING
is sending its ow. A schedule is full if it contains an active ow in each direction along every edge of
the tree T . A schedule is active if it contains only active ows. It is fully active if it is both full and
active.
It follows from this denition that the rst active ow must originate in a leaf of T .
Figure 8.4 shows a DAG and 8.5 a corresponding junction tree. An example of a fully active schedule
for the junction tree given in Figure 8.5, where the maximal clique BEL is chosen as the root, would
be:
BEL → ELT, ELT → EK, ELT → AT, BEL → BLS, BEL → BDE.
A S
T / E o L B
~
'
K D
AT T
ELT E
EK
EL
BLS BL
BEL BE
BDE
Proposition 8.9. For any tree T , there exists a fully active schedule.
8.5. SCHEDULES 169
Proof If there is only one maximal clique, the proposition is clear; no transmissions are necessary.
Assume that there is more than one maximal clique. Let C0 denote a leaf in T . Let T0 be a sub-tree
of T obtained by removing C0 and the corresponding edge S0 . Assume that the proposition is true for
T0 . Adding the edge
C0 → S0 → T0
to the beginning of the schedule and
C0 ← S0 ← T0
to the end of the schedule provides a fully active schedule for T .
The aim is to show that after the passage of a fully active schedule of ows over a junction tree, the
resulting charge is the marginal charge. That is, all the functions of the charge are the marginal of
the contraction of the charge over the respective maximal cliques and separators. Furthermore, there
is global consistency after the passage of a fully active schedule of ows over a junction tree. This will
be dened later, but loosely speaking, it means that if there are several apparent ways to compute a
probability distribution over a set of variables using the functions of the marginal charge, they will all
give the same answer.
Denition 8.10 (The Base of a Sub-tree, Restriction of a Charge, Live Sub-tree). Let T ′ be a sub-tree
of T , with nodes C ′ ⊆ C and edges S ′ ⊆ S . The base of T ′ is dened as the set of variables
U ′ ∶= ∪C∈C ′ C.
Let
Φ = {ϕC ∶ C ∈ C, ϕS ∶ S ∈ S}
be a charge for T . Its restriction to T ′ is dened as
ΦT ′ = {ϕC ∶ C ∈ C ′ , ϕS ∈ S ′ }.
Recall Denition 8.4. The contraction of ΦT ′ is dened as
∏C∈C ′ ϕC (xC )
.
∏S∈S ′ ϕS (xS )
A sub-tree T ′ is said to be live with respect to the schedule of ows if it has already received active ows
from all its neighbours.
Φ0 = {ϕ0C ∶ C ∈ C, ϕ0S ∶ S ∈ S}
denote an initial charge for a function f that has factorisation
where C and S are the sets of maximal cliques and separators for a junction tree T . Suppose that Φ0 is
modied by a sequence of ows according to some schedule. Then, whenever T ′ is live, the contraction
of the charge for T ′ is the margin of the contraction f of the charge for T on U ′ .
Proof Assume that T ′ ⊂ T and that T ′ is live. Let C ∗ denote that last neighbour to have passed a
ow into T ′ . Let T ∗ be the sub-tree obtained by adding C ∗ and the associated edge S ∗ to T ′ . Let
C ∗ , S ∗ and U ∗ be the maximal cliques, separators and the base of T ∗ . By the junction tree property
of T , the separator associated with the edge S ∗ joining C ∗ to T ′ is
S∗ = C ∗ ∩ U ′.
Also,
C ∗ = C ′ ∪ {C ∗ } and S ∗ = S ′ ∪ {S ∗ }.
U ∗ = U ′ ∪ C ∗.
The induction hypothesis: The assertion holds for the contraction of the charge on T ∗ . That is,
using
fU ∗ (xU ∗ ) = ∑ f (x),
U /U ∗
Let
Φ = {ϕC ∶ C ∈ C, ϕS ∶ S ∈ S}
denote the charge just before the last ow from C ∗ into T ′ . It follows that
Corollary 8.12. Let {ϕC , C ∈ C, ϕS , S ∈ S} denote the current functions over the maximal cliques and
separators. For any set A ⊆ V , let fA = ∑XV /A f ; the marginal over A. Whenever a maximal clique C
is live, its corresponding function is ϕC = fC = ∑XV /C f .
Proof A single maximal clique is a sub-tree. The result is immediate from the theorem.
Corollary 8.13. Using the notation of Corollary 8.12, whenever active ows have passed in both
directions across an edge in T , the function for the associated separator is ϕS = fS = ∑XV /S f .
Proof The function ϕS for the associated separator is, by denition of the update,
ϕS = ∑ ϕC ,
XC/S
so that
∑ ϕC = ∑ fC = fS ,
XC/S XC/S
Proposition 8.14 (The Main Result). After passage of a fully active schedule of ows, the resulting
charge is the marginal charge Φ and its contraction represents f . In other words, the following formula,
known as the Aalborg formula holds;
∏C∈C fC (xC )
f (x) = .
∏S∈S fS (xS )
Proof This follows from the previous two corollaries and Lemma 8.5, stating that the contraction is
unaltered by the ows.
Φ = {ϕC ∶ C ∈ C, ϕS ∶ S ∈ S}.
Denition 8.15 (Local Consistency). A junction tree T is said to be locally consistent if whenever
C1 ∈ C and C2 ∈ C are two neighbours with separator S = C1 ∩ C2 , then
∑ ϕ C1 = ϕ S = ∑ ϕ C2 .
XC1 /(C1 ∩C2 ) XC2 /(C1 ∩C2 )
172 CHAPTER 8. JUNCTION TREES AND MESSAGE PASSING
Denition 8.16 (Global Consistency). A junction tree T (or its charge) is said to be globally consis-
tent if for every C1 ∈ C and C2 ∈ C it holds that
∑ ϕ C1 = ∑ ϕC2 .
XC1 /(C1 ∩C2 ) XC2 /(C1 ∩C2 )
Global consistency means that the marginalisation to C1 ∩ C2 of ϕC1 and ϕC2 coincide for every C1
and C2 in C . The following results show that, for a junction tree, local consistency implies global
consistency.
Proposition 8.17. After a passage of a fully active schedule of ows, a junction tree T is locally
consistent.
Proof The two corollaries of the main result give that for any two neighbouring C1 and C2 ,
∑ fC1 = fS = ∑ fC2 .
C1 /S C2 /S
An equilibrium, or xed point has been reached, in the sense that any new ows passed after passage
of a fully active schedule do not alter the functions. The update ratio for another message from C1 to
C2 becomes
∑C1 /S fC1
λS = = 1.
fS
Global Consistency of Junction Trees In this paragraph, it is shown that for junction trees, local
consistency implies global consistency.
By denition, a junction tree is a tree such that the intersection C1 ∩ C2 of any pair C1 and C2 in C is
contained in every node on the unique trail in T between C1 and C2 . The set C1 ∩ C2 can be empty
and, in this case it is therefore (by convention) a subset of every other set.
Proof In a junction tree the intersection C1 ∩ C2 of any pair C1 and C2 in C is contained in every node
on the unique path in T between C1 and C2 . Assume that C1 ∩ C2 is non empty. Consider the unique
path from C1 to C2 . Let the nodes on the path be denoted by {C (i) }ni=0 with C (0) = C1 and C (n) = C2 ,
so that C (i) and C (i+1) are neighbours. Denote the separator between C (i) and C (i+1) by
C1 ∩ C2 ⊆ S (i) .
8.7. USING A JUNCTION TREE WITH VIRTUAL EVIDENCE AND SOFT EVIDENCE 173
For a set of variables C , let ∑C denote ∑XC . The assumption of local consistency means that for any
two neighbours
∑ ϕC (i) = ∑ ∑ ϕC (i)
C (i) /(C1 ∩C2 ∩C (i) ) S (i) /(C1 ∩C2 ∩S (i) ) C (i) /S (i)
⎛ ⎞
= ∑ ∑ ϕC (i+1) = ∑ ϕ (i+1) .
⎝
S (i) /(C1 ∩C2 ∩S (i) ) C (i+1) /S (i)
⎠ C (i+1) /(C1 ∩C2 ∩C (i+1) ) C
The marginalisation of ϕC1 and ϕC (1) coincide. The procedure is continued along the path until the
node C2 is reached. The result is proved.
Corollary 8.19. After the passage of a fully active schedule of ows, a junction tree is globally con-
sistent.
Proof This follows from the proposition stating that after passage of a fully active schedule of ows a
junction tree T is locally consistent, together with Proposition 8.9.
The algorithm for updating considered the maximal cliques of a junction tree, which sent and received
messages locally; the global update is performed entirely by a series of local computations. By organ-
ising the variables into maximal cliques and separators on a junction tree and determining a schedule,
there is no need for global computations in the inference problem; the global update is achieved en-
tirely by passing messages between neighbours in the tree according to a schedule and the algorithm
terminates automatically when the update is completed.
8.7 Using a Junction Tree with Virtual Evidence and Soft Evidence
The junction tree may be extended to the problem of updating in the light of virtual evidence and soft
evidence.
Dealing with virtual evidence is straightforward; for each virtual nding, one adds in a virtual node
as illustrated in Figure 6.2, which will be instantiated according to the virtual nding. This simply
adds the virtual nding node to the maximal clique containing the variable for which there is a virtual
nding.
If virtual evidence is given on a variable X with state space (x1 , . . . , xn ), and the evidence is given
in the form
PE∣X (1∣xj )
ρ1 = 1, ρj = j = 2, . . . , n
PE∣X (1∣x1 )
the conditional probabilities
174 CHAPTER 8. JUNCTION TREES AND MESSAGE PASSING
x1 x2 ... xn
PE∣X (1∣.) =
a aρ2 . . . aρn
may be used, for some a > 0 such that 0 ≤ PE∣X (1∣xj ) ≤ 1 for each j = 1, . . . , n.
To absorb soft evidence, remove the links from each variable Y1 , . . . , Ym to which soft evidence is
applied. Provided the nodes on which the soft evidence is received are d-separated from each other
and d-separated from the nodes on which hard evidence is received after surgery, simply replace the
conditional probabilities PYj ∣Π(Yj ) with P∗Yj ; the independence ensures that the marginal probabilities
for these variables after updating will remain as P∗Yj .
The Lazy big maximal clique algorithm If soft evidence and hard evidence are received on
variables that are d-connected after sugery, then incorporating soft evidence cannot be carried out
in such a straightforward manner. The problem is that the approach outline above inserts P∗Yj , the
marginal probability after updating, in the place of the a-priori assessment PYj ∣Πj , without reference to
other pieces of evidence. The updated distribution should have P∗Yj as the marginal distribution over
Yj .
One method for incorporating soft evidence is discussed in [138]. The input is a Bayesian network
with a collection of soft and hard ndings. The method returns a joint probability distribution with
two properties:
1. The ndings are the marginal distributions for the updated distribution.
2. The updated distribution is the closest to the original distribution (where the Kullback Leibler
divergence is used) that satises this constraint (that the ndings are the marginals of the updated
distribution).
The junction tree is modied to incorporate soft evidence in the following way.
1. After surgery, construct a junction tree, in which all the variables that have soft evidence are in
the same maximal clique - the big maximal clique C1 .
2. Let C1 (the big maximal clique) be the root node, apply the hard evidence and run the rst half
of the fully active schedule; that is, propagating from the leafs to the root node.
3. Once the big maximal clique C1 has been updated with the information from all the other
maximal cliques, absorb all the soft evidence into C1 . This is described below.
4. Distribute the evidence according to the method described in Section 8.5, the second part; sending
the messages from the updated root out to the leaves.
If the big maximal clique is updated to provide a probability function (namely a non negative function
that sums to 1, then the distribution of evidence will update the functions over the maximal cliques
and separators to probability distributions over the respective maximal cliques and separators.
8.7. USING A JUNCTION TREE WITH VIRTUAL EVIDENCE AND SOFT EVIDENCE 175
Absorbing the Soft Evidence Suppose the big maximal clique C1 has soft evidence on the vari-
ables (Y1 , . . . , Yk ). Suppose soft evidence is received that Y1 , . . . , Yk have distributions QY1 , . . . , QYk
respectively. Let QC1 denote the probability function over the variables in C1 after the soft evidence
has been absorbed. Then it is required that, for each j ∈ {1, . . . , k}, QYj = ∑XC /{Y QC1 . That is, the
1 k
marginal of QC1 over all variables other than Yk is QYk .
The important feature of soft evidence (Denition 6.4) is that after soft evidence has been received,
the variable has no parent variables. The Iterative Proportional Fitting Procedure (IPFP), therefore,
may be employed. It goes in cycles of length k . Firstly, normalise the function over C1 (after the hard
evidence has been received) so that it is a probability distribution PC1 . Then
(0)
PC1 = PC1
(mk+j−1) (mk+j−1)
for j = 1, . . . , k , set PYj = ∑ XC PC1 , and
1 /{Yj }
(mk+j−1)
(mk+j) P C1 QYj
PC1 = (mk+j−1)
.
P Yj
This is repeated until the desired accuracy is obtained. It has been well established that, for discrete
distributions with nite state space, the IPFP algorithm converges to the distribution that minimises
the Kullback Leibler distance from the original distribution (see, for example, [10] (1959).
Notes The original paper describing the use of junction trees for updating a Bayesian network is by
S.L. Lauritzen and D.J. Spiegelhalter [80]. The propagation presented is the approach of Lauritzen and
Spiegelhalter, discussed in [83]; the technicalities dier slightly between implementations in software.
The proofs or the main results for the message passing algorithm were originally presented in [33]. The
Iterative Proportion Fitting Procedure dating back to Deming and Stephan (1940) [35]. This is the
basis for updating a junction tree in the light of soft evidence. The basic technique is taken from M.
Valtorta, Y.G. Kim, J. Vomlel (2002) [138].
8.8 Exercises
1. Let PX1 ,X2 ,X3 ,X4 ,X5 be a probability distribution over ve variables that has factorisation
PX1 ,X2 PX2 ,X3 ,X4 PX4 ,X5
PX1 ,X2 ,X3 ,X4 ,X5 = .
PX2 PX4
Suppose hard evidence X3 = a is received. Let
⎧
⎪
⎪ pX1 ,X2 ,X3 ,X4 ,X5 (x1 , x2 , a, x4 , x5 ) x3 = a
fX1 ,X2 ,X3 ,X4 ,X5 (x1 , x2 , a, x4 , x5 ) = ⎨
⎪
⎪ 0 x3 ≠ a
⎩
Work through the stages of the message passing algorithm to obtain functions ψX1 ,X2 , ψX2 ,
ψX2 ,X3 ,X4 (., a, .), ψX4 , ψX4 ,X5 such that such that
⎧
⎪ ψX1 ,X2 = ∑x4 ,x5 fX1 ,...,X5 (., ., a, x4 , x5 ),
⎪
⎪
⎪
⎪
⎪
⎪ ψX2 = ∑x1 ,x4 ,x5 fX1 ,...,X5 (x1 , ., a, x4 , x5 ),
⎪
⎪
⎪
⎨ ψX2 ,X3 ,X4 = ∑x1 ,x5 fX1 ,...,X5 (x1 , ., a, ., x5 ),
⎪
⎪
⎪
⎪
⎪
⎪ ψX4 = ∑x1 ,x2 ,x5 fX1 ,X2 ,X3 ,X4 ,X5 (x1 , x2 , a, ., x5 ),
⎪
⎪
⎪
⎪
⎩ ψX4 ,X5 = ∑x1 ,x2 fX1 ,...,X5 (x1 , x2 , a, ., .)
and
ψX1 ,X2 ψX2 ,X3 ,X4 ψX4 ,X5
fX1 ,...,X5 = .
ψX2 ψX4
2. (a) Prove that Kruskal's algorithm returns a tree of maximal weight: Consider d nodes, labelled
(α1 , . . . , αd ) and a weight bij corresponding to each pair of nodes {αi , αj }. The tree of
maximal weight is the tree with nodes {α1 , . . . , αd } such that the score ∑e∈T be , where the
sum is taken over all edges e included in the tree, is greater than or equal to the score for
any other tree.
Krusal's algorithm proceeds as follows:
i. The d variables yield d(d − 1)/2 edges. The edges are indexed in decreasing order,
according to their weights b1 , b2 , . . . , bd(d−1)/2 .
ii. The edges b1 and b2 are selected. Then the edge b3 is selected if it does not form a
cycle.
iii. This is repeated through b4 , . . . , bd(d−1)/2 , in that order, adding edges if they do not
form a cycle and discarding them if they form a cycle.
This may be proved by induction.
(b) Prove that Prim's algorithm returns a tree of maximal weight. This proceeds by rst
choosing the edge of maximal weight and then subsequently choosing additional edges to
add to the tree where the additional link has maximal weight.
3. Let C denote the set of maximal cliques from a triangulated graph. A pre-I-tree is a tree over C
with separators S = C1 ∩ C2 for adjacent maximal cliques C1 and C2 . The weight of a pre-I-tree
is the sum of the number of variables in the separators.
176
8.8. EXERCISES 177
ψX2 (x2 ) = ∑ ψX2 X3 X4 (x2 , a, x4 ) = ∑ pX2 ,X3 ,X4 (x2 , a, x4 ) = pX2 ,X3 (x2 , a)
x4 x4
3. (a) Let T be any non - maximal spanning tree. Let T1 ⊂ T2 ⊂ . . . ⊂ T ′ denote a sequence of
maximal trees constructed through Prim's algorithm. Let the construction be so that a link
from T is chosen whenever possible. Let m be the rst stage where this is not possible and
let C1 − C2 with separator S be the link actually chosen (C1 ∈ Tm , C2 ∈/ Tm ; the separator S
C1 − C2 has maximal weight of those not used). In T , there is a path between C1 and C2 .
The path contains a link C3 − C4 with separator S ′ such that C3 ∈ Tm , C4 ∈/ Tm . Possibly C2
is C4 . Since C3 − C4 could not be chosen, it follows that ∣S ′ ∣ < ∣S∣ and therefore S contains
variables not in S ′ . Therefore, T does not satisfy the junction tree condition.
(b) Consider the tree of maximal weight constructed by Prim's algorithm and let T1 , . . . , Tn = T
denote the successive trees. Assume that T is not a junction tree, then at some stage m,
Tm can be extended to a junction tree T ′ while Tm+1 cannot. Let C1 − C2 with separator S
be the link chosen at this stage; C2 ∈ Tm+1 . Since Tm+1 cannot be extended to a junction
tree, the link C1 − C2 is not in T ′ , so there is a path in T ′ between C1 and C2 not containing
the link C1 − C2 . This path contains a link C3 − C4 with separator S ′ such that C3 ∈ Tm and
C4 ∈/ Tm . Since T ′ is a junction tree, it follows that S ⊆ S ′ and since S was chosen through
178
8.9. ANSWERS 179
Prim's algorithm, it follows that ∣S∣ ≥ ∣S ′ ∣ so that S = S ′ . Now remove the link C3 − C4 from
T ′ and add the link C1 − C2 . The result is a junction tree extending Tm+1 , contradicting
the assumption that it cannot be extended to a junction tree.
4. (a) Consider any cycle length n ≥ 4 in the moral graph. If an edge in the cycle is removed
that was added at the moralisation stage, there will be a cycle of length n + 1. Successively
removing edges from the cycle that were added at the moralisation stage, the graph will
still have a cycle, containing the vee structure from the original graph instead of the parent-
parent edge. Hence, if the moral graph is not triangulated, the skeleton of the original graph
is not a tree, hence the original graph is not singly connected.
(b) The maximal cliques of the moral graph are the variable/parent congurations. Suppose
that there are two variables U and V in a separator. If U −V is a parent - parent conguration
in both maximal cliques being separated, then there is a cycle in the original graph, hence
contradiction. If U − V represents a parent-child conguration in both maximal cliques
that it is separating, then there is a contradiction; both maximal cliques taken together
form a single complete subset, contradicting the fact that there are two maximal cliques.
If U − V represents a parent-parent conguration in one maximal clique and a parent-child
conguration in another, then there is a cycle in the skeleton of the original graph, hence a
contradiction.
180 CHAPTER 8. JUNCTION TREES AND MESSAGE PASSING
Chapter 9
Bayesian Networks in R
9.1 Introduction
It has become clear that R is now the most eective and dominant language of statistical computing.
There are excellent packages available in R for Bayesian Networks, for inference using a given Bayesian
Network and for learning the structure of a Bayesian Network. This chapter introduces some of the
software in R available for Bayesian Networks and discusses graphs in R and inference using networks
that have already been dened. Parameter learning and structure learning are considered later.
The packages considered are gRain by Søren Højsgaard and bnlearn.
Having installed R and a suitable editor (for example Rstudio), the relevant packages have to be
installed.
gRain and related packages Information for gRain is available on the author's web page:
http://people.math.aau.dk/~sorenh/software/gR/
The package, along with all the supporting packages, has to be installed. As pointed out on the web
page, under `4 Installation', the package uses the packages graph, RBGL and Rgraphviz. These
packages are not on CRAN, but on `bioconductor'. To install these packages, execute
Warning This can take a long time. Furthermore, there may be some interactive questions requiring
yes/no answers.
After this, gRain may be installed from CRAN in the usual way:
> install.packages("gRain")
The package bnlearn also has some useful inference functions, although its main consideration is
learning. Install it in the usual way:
> install.packages("bnlearn")
181
182 CHAPTER 9. BAYESIAN NETWORKS IN R
9.2 Graphs in R
This section considers the various graphs that appear in graphical modelling and how to render them
in R. In addition to the packages mentioned so far, the package ggm, has some useful functions for
graphical Markov models.
>install.packages("ggm")
>install.packages("igraph")
> library("bnlearn")
> library("gRain")
> library("ggm")
> library("igraph")
> library(RBGL)
> library("gRbase")
(the gRbase package contains the function ug(). It is automatically activated if gRain is activated).
Plotting the graph requires the package Rgraphviz
> library(Rgraphviz)
> plot(ugraph)
The default output of ug() retuns a graphNEL object. The commands result = igraph or result
= matrix return an igraph or adjacency matrix instead. There is a plot method for igraph objects
in the igraph package.
a e
b
c
Edges can be added or deleted quite easily using the addEdge() and removeEdge() commands:
> nodes(ugraph)
[1] "a" "b" "c" "d" "e"
> str(edgeList(ugraph))
List of 4
$ : chr [1:2] "a" "b"
$ : chr [1:2] "b" "c"
$ : chr [1:2] "b" "d"
$ : chr [1:2] "c" "d"
> maxClique(ugraph)
$maxCliques
$maxCliques[[1]]
[1] "b" "c" "d"
$maxCliques[[2]]
[1] "b" "a"
$maxCliques[[3]]
[1] "e"
184 CHAPTER 9. BAYESIAN NETWORKS IN R
ugraph is not complete; this can be seen using the is.complete command:
> is.complete(ugraph)
[1] FALSE
The command separates from the RBGL package, indicates whether or not there is graphical sepa-
ration:
> separates("a","d",c("b","c"),ugraph)
[1] TRUE
The boundary bd(α) of a vertex α is the set of vertices adjacent to α, adj(α) which is equal (for
an undirected graph) to the set of neighbours. The closure is the boundary together with the node:
cl(α) = bd(α) ∪ {α}.
> adj(ugraph,"c")
$c
[1] "d" "b"
> closure("c",ugraph)
[1] "c" "d" "b"
We can also establish whether or not nodes are simplicial, if the graph is triangulated, and obtain the
connected components.
> is.simplicial("b",ugraph)
[1] FALSE
> simplicialNodes(ugraph)
[1] "a" "c" "d" "e"
> connComp(ugraph)
[[1]]
[1] "a" "b" "c" "d"
[[2]]
[1] "e"
> is.triangulated(ugraph)
[1] TRUE
9.2. GRAPHS IN R 185
If we want to establish if (A, B, S) forms a decomposition where S is complete and separates A and B ,
the function is is.decomposition
> is.decomposition("a","d",c("b","c"),ugraph)
[1] FALSE
> mcs(ugraph)
[1] "a" "b" "c" "d" "e"
> mcs(ugraph,root=c("d","c","a"))
[1] "d" "c" "b" "a" "e"
It is convenient if the cliques satisfy running intersection property Cj ∩ (C1 ∪ . . . ∪ Cj−1 ) ⊆ Ci for some
i < j . Dene Sj = Cj ∩ (C1 ∪ . . . ∪ Cj−1 ) and Rj = Cj /Sj with S1 = ϕ. Any clique Ci where Sj ⊂ Ci with
i < j is a possible parent of Ci . The rip function returns such a list if the graph is triangulated.
> rip(ugraph)
cliques
1 : b a
2 : b c d
3 : e
separators
1 :
2 : b
3 :
parents
1 : 0
2 : 1
3 : 0
a f
b g
c e
d
Figure 9.2: Directed Acyclic Graph
> nodes(dgraph)
[1] "a" "b" "c" "d" "e" "g" "f"
> str(edges(dgraph))
List of 7
$ a: chr [1:3] "b" "c" "e"
$ b: chr "c"
$ c: chr "d"
$ d: chr(0)
$ e: chr "d"
$ g: chr(0)
$ f: chr "g"
edges gives a list of the children for each node. Alternatively, the edges are listed by:
> str(edgeList(dgraph))
List of 7
$ : chr [1:2] "a" "b"
$ : chr [1:2] "a" "c"
$ : chr [1:2] "a" "e"
9.2. GRAPHS IN R 187
The vpar() function returns a list with an element for each node together with its parents.
The parents, chilren, ancestral set an(A) of a set A together with all its ancestors can be obtained by:
> parents("d",dgraph)
[1] "c" "e"
> children("c",dgraph)
[1] "d"
> ancestralSet(c("b","e"),dgraph)
[1] "a" "b" "e"
> ag <- ancestralGraph(c("b","e"),dgraph)
> plot(ag)
b e
D-separation can be obtained by the dSep function from the ggm package.
> dSep(as(dgraph,"matrix"),"c","e","a")
[1] TRUE
188 CHAPTER 9. BAYESIAN NETWORKS IN R
a g
b f
c
d
e
> adjm<-matrix(c(0,1,1,0,1,0,0,1,1,0,0,0,1,1,1,0),nrow=4)
> rownames(adjm)<-colnames(adjm)<-letters[1:4]
> adjm
a b c d
a 0 1 1 1
b 1 0 0 1
c 1 0 0 1
d 0 1 0 0
> gG<-as(adjm,"graphNEL")
> plot(gG,"neato")
> gG1<-as(adjm,"igraph")
> plot(gG1,layout=layout.spring)
9.3. BAYESIAN NETWORKS 189
d
b
a
c
Figure 9.5: Mixed Graph
Is it a Chain Graph? The is.chaingraph() function from the lcd package determines whether a
mixed graph is a chain graph. The input is an adjacency matrix.
> install.packages("lcd")
> library(lcd)
> is.chaingraph(as(gG1,"matrix"))
$result
[1] FALSE
$vert.order
NULL
$chain.size
NULL
The graph is not a chain graph; a and d are in the same chain component and therefore there should
not be a directed edge a ↦ d.
The + operator could be considered slightly misleading. There are other ways to enter the conditional
probability potentials:
> t.a<-cptable(~tub|asia,values=c(5,95,1,99),levels=yn)
> t.a<-cptable(c("tub","asia"),values=c(5,95,1,99),levels=yn)
There are also special functions ortable() and andtable. For example, e.lt() could be entered by:
The plot is shown in Figure 9.6. The ctitious situation being modelled is the following: you return
from a visit to Asia and nd that you have a cough. A visit to Asia increases the chances of catching
tuberculosis. Meanwhile, smoking causes both lung cancer and bronchitis. Tuberculosis and Lung
cancer both give the same results for an x-ray. Bronchitis causes dyspnoea (shortness of breath); both
lung cancer and tuberculosis have equal chances of causing dyspnoea.
> grn1c<-compile(grn1)
> summary(grn1c)
Independence network: Compiled: TRUE Propagated: FALSE
9.3. BAYESIAN NETWORKS 191
asia smoke
tub lung
either bronc
xray dysp
Nodes : chr [1:8] "asia" "tub" "smoke" "lung" "bronc" "either" ...
Number of cliques: 6
Maximal clique size: 3
Maximal state space in cliques: 8
> g<-grn1$dag
> mg<-moralize(g)
> tmg<-triangulate(mg)
> rip(tmg)
cliques
1 : asia tub
2 : either lung tub
3 : either lung bronc
4 : smoke lung bronc
5 : either dysp bronc
6 : either xray
separators
1 :
2 : tub
3 : either lung
4 : lung bronc
5 : either bronc
6 : either
parents
1 : 0
192 CHAPTER 9. BAYESIAN NETWORKS IN R
2 : 1
3 : 2
4 : 3
5 : 3
6 : 5
> junctree<-rip(tmg)
> plot(junctree)
1
2
3
4 5
6
Figure 9.7: Junction Tree for Asia Network
> grn1c.ev<-
+ setFinding(grn1c,nodes=c("asia","dysp"),states=c("yes","yes"))
This creates a new grain object. The grain objects with (grn1c.ev) and without (gran1c) can be
queried to give marginal probabilities:
> querygrain(grn1c.ev,nodes=c("lung","bronc"),type="marginal")
$lung
lung
yes no
0.09952515 0.90047485
9.3. BAYESIAN NETWORKS 193
$bronc
bronc
yes no
0.8114021 0.1885979
> querygrain(grn1c,nodes=c("lung","bronc"),type="marginal")
$lung
lung
yes no
0.055 0.945
$bronc
bronc
yes no
0.45 0.55
The evidence in a grain object can be retrieved with the getFinding() function, while the probability
of observing the evidence is obtained using the pFinding() function:
> getFinding(grn1c.ev)
Finding:
asia: yes
dysp: yes
Pr(Finding)= 0.004501375
> pFinding(grn1c.ev)
[1] 0.004501375
> querygrain(grn1c.ev,nodes=c("lung","bronc"),type="joint")
bronc
lung yes no
yes 0.06298076 0.03654439
no 0.74842132 0.15205354
> querygrain(grn1c.ev,nodes=c("lung","bronc"),type="conditional")
bronc
lung yes no
yes 0.07761966 0.1937688
no 0.92238034 0.8062312
These are both conditioned on the evidence; the former the joint distribution of lung and bronc
conditioned on the evidence, while the latter is the conditional distribution of lung given bronc and
the evidence.
194 CHAPTER 9. BAYESIAN NETWORKS IN R
If it is known beforehand that a specic subset U of the variables will be of interest, it is computa-
tionally faster to ensure that they are in the same clique. Consider the grain objects grn1c2, where
variables of interest are forced into the root clique:
> grn1c2<-compile(grn1,root=c("lung","bronc","tub"),propagate=TRUE)
> grn1c2.ev<-setFinding(grn1c2,nodes=c("asia","dysp"),states=c("yes","yes"))
> grn1c.ev<-setFinding(grn1c,nodes="asia",states="yes",propagate="FALSE")
> grn1c.ev<-setFinding(grn1c.ev,nodes="dysp",states="yes",propagate="FALSE")
> grn1c.ev<-propagate(grn1c.ev)
> grn1c.ev<-retractFinding(grn1c.ev,nodes="asia")
> getFinding(grn1c.ev)
Finding:
dysp: yes
Pr(Finding)= 0.4359706
> grn1c.ev<-retractFinding(grn1c.ev)
> getFinding(grn1c.ev)
NULL
9.3. BAYESIAN NETWORKS 195
> plot(g)
> simdagchest<-grain(g,data=chestSim500)
extractCPT - data.frame
> simdagchest<-compile(simdagchest,propagate=TRUE,smooth=0.1)
> querygrain(simdagchest,nodes=c("lung","bronc"),type="marginal")
$lung
lung
yes no
0.046 0.954
$bronc
bronc
yes no
0.454 0.546
Alternatively, a grain object may be built from an undirected triangulated graph. Recall that tmg is
g which has been moralised and then triangulated. Then
> simugchest<-grain(tmg,data=chestSim500,smooth=0.1)
extractCPT - data.frame
> simugchest<-compile(simugchest,propagate=TRUE)
> plot(simugchest)
> simulate(grn1c.ev,nsim=5)
asia tub smoke lung bronc either xray dysp
1 yes yes no no no yes yes yes
2 yes no yes no yes no no yes
3 yes no yes no yes no no yes
4 yes no yes no yes no no yes
5 yes no yes no yes no no yes
196 CHAPTER 9. BAYESIAN NETWORKS IN R
The xtabs() function may be used to obtain (approximately) the joint distribution of lung and bronc
conditioned on the nding:
9.3.7 Prediction
The predict() function is used for prediction. The default is type = class, which gives the class
with the highest probability, given the observed values of the predictors. Firstly, we generate some
data:
> mydata<-simulate(grn1c.ev,nsim=5)
> mydata
asia tub smoke lung bronc either xray dysp
1 yes no yes no yes no no yes
2 yes no no no yes no no yes
3 yes no no no yes no no yes
4 yes no yes no yes no no yes
5 yes no no no no no no yes
then we try to predict the most probable conguration of lung and the most probable conguration
of bronc, given all the others.
> predict(grn1c,response=c("lung","bronc"),newdata=mydata,
+ predictors=c("smoke","asia","tub","dysp","xray"),type="class")
$pred
$pred$lung
[1] "no" "no" "no" "no" "no"
$pred$bronc
[1] "yes" "yes" "yes" "yes" "yes"
$pFinding
[1] 0.002123915 0.001388412 0.001388412 0.002123915 0.001388412
These are read as follows: the variables lung and bronc are treated individually; this does not give the
joint most probable conguration. The entire conditional distribution of lung and bronc is obtained
as follows:
9.3. BAYESIAN NETWORKS 197
> predict(grn1c,response=c("lung","bronc"),newdata=mydata,
+ predictors=c("smoke","asia","tub","dysp","xray"),type="dist")
$pred
$pred$lung
yes no
[1,] 0.0036677551 0.9963322
[2,] 0.0005200187 0.9994800
[3,] 0.0005200187 0.9994800
[4,] 0.0036677551 0.9963322
[5,] 0.0005200187 0.9994800
$pred$bronc
yes no
[1,] 0.9221067 0.07789335
[2,] 0.7739757 0.22602430
[3,] 0.7739757 0.22602430
[4,] 0.9221067 0.07789335
[5,] 0.7739757 0.22602430
$pFinding
[1] 0.002123915 0.001388412 0.001388412 0.002123915 0.001388412
http://www.mimuw.edu.pl/~noble/courses/BayesianNetworks/data/
Copy the le onto your local directory, then load it into R.
>library(bnlearn)
>library(gRain)
> sachs.interventional <- read.table("~/data/sachs.interventional.txt", header=TRUE,
colClasses = "factor")
> isachs<-sachs.interventional
> val.str=paste("[PKC][PKA|PKC][praf|PKC:PKA]",
+ "[pmek|PKC:PKA:praf][p44.42|pmek:PKA]",
+ "[pakts473|p44.42:PKA][P38|PKC:PKA]",
+ "[pjnk|PKC:PKA][plcg][PIP3|plcg]",
+ "[PIP2|plcg:PIP3]")
> val=model2network(val.str)
> isachs=isachs[, 1:11]
> for(i in names(isachs))
+ levels(isachs[, i]) = c("LOW","AVERAGE","HIGH")
> fitted = bn.fit(val, isachs, method = "bayes")
The variable val contains the DAG for the Bayesian network. Given the structure, bn.fit estimates
the conditional probabilities. There are several methods for doing this, but the Conditional Probability
Potentials simply contain the estimates from data.
Once the BN (DAG and CPPs) has been specied, we construct a junction tree for inference. The
junction tree algorithm is provided by the gRain package.
Now suppose that we have hard evidence or a nding that node p44.42 is in state LOW. Then this is
inserted quite simply by:
Let us now check the marginal distribution of the node pakts473 with and without the evidence.
The maximum a posteriori states may be found by nding the largest element of the target distribution:
> names(which.max(querygrain(jprop,nodes=c("PKA"))$PKA))
[1] "LOW"
9.3. BAYESIAN NETWORKS 199
The cpdist and cpquery commands from bnlearn do the same thing:
The cpquery command returns to probability of a specic event which is described by another logical
expression. For example:
9.4 Exercises
1. Professor Noddy is in his oce when he reeives the news that the burglar alarm in his home has
gone o. Convinced that a burglar has broken in, he starts to drive home. But, on his way, he
hears on the radio that there has been a minor earth tremor in the area. Since an earth tremor
can set o a burglar alarm, he therefore returns to his oce.
E/R y n
PR∣E = y 0.99 0.01
n 0.05 0.95
E/B y n
PA∣B,E (y∣., .) = y 0.98 0.95
n 0.95 0.03
y n
PB =
0.01 0.99
y n
PE =
0.001 0.999
Find
PB∣A (y∣y), PB∣A (y∣y), PB∣A,R (y∣y, y)
A/B b1 b2 b3 b4
PB∣A = a1 0.6 0.1 0.2 0.1
a2 0.2 0.5 0.1 0.2
and
B/C c1 c2
b1 0.8 0.2
PC∣B = b2 0.8 0.2
b3 0.2 0.8
b4 0.2 0.8
C G
t
F T S
3. Consider the Bayesian Network in Figure 9.8. You have a sore throat (T). There are two possible
causes; either you have a cold (C), or else you have Green Monkey Disease (G). A symptom of
GMD is spots (S).
The conditional probabilities are:
G/C y n
PF ∣C,G (y∣., .) = y 0.990 0.700
n 0.800 0.200
G/C y n
PT ∣C,G (y∣., .) = y 0.999 0.900
n 0.800 0.300
y n
PS∣G (y∣.) = PC (y) = 0.20 PG (y) = 0.10
0.010 0.001
(a) Let E = (F, T, S) and enter the evidence e = (n, n, y). That is, {F = n, T = n, S = y}.
(b) Compute the updated joint probability distribution of C,G given the evidence, PC,G∣E (., .∣e).
(c) Compute the most probable explanation of the evidence e. This is the conguration of the
remaining variables V /E that gives the largest value for PE∣V /E . It therefore also maximises
PV /E∣E (.∣e).
(d) Consider a vector of evidence variables E = (E1 , . . . , Em ), instantiated as e = (e1 , . . . , em ).
The conict measure of the evidence is dened as:
4. Now consider the sachs.interventional.txt data in the notes. Find the moral graph, trian-
gulate it and construct a junction tree.
For the network with the parameters given in the notes from the sachs.interventional.txt
data, note that PKA is a parent of all the nodes in the praf -> pmek -> p44.42 -> pakts473
chain. Use the junction tree algorithm to update the probabilities over these nodes when we have
evidence that PKA is LOW and PKA is HIGH.
Use any other techniques discussed on this network.
Chapter 10
where Xj denotes the state space for variables j . The following notation will also be used;
X = X∆ × XΓ .
Attention is restricted to the case where the continuous variables, conditioned on the discrete variables,
have Gaussian distribution, so XΓ = R∣Γ∣ . For the discrete variables,
(1) (kj )
Xj = {ij , . . . , ij }.
The following notation will be used to indicate that a random vector X 1 conditioned on X 2 = x2 has
distribution F :
X 1 ∣ X 2 = x2 ∼ F.
The moment generating function is a useful for the denition of a multivariate normal distribution.
203
204 CHAPTER 10. CONDITIONAL GAUSSIAN VARIABLES
The moment generating function is useful, because it uniquely determines the distribution of a ran-
dom vector X . That is, a joint probability determines a unique moment generating function, and
the moment generating function uniquely determines a corresponding joint probability. The moment
generating function is essentially a Laplace transform.
A multivariate normal distribution is dened as follows:
⎧
⎪ ⎫
⎪
⎪p 1 ⎪
ϕ(p1 , . . . , pd ) = exp ⎨ ∑ pj µj + ∑ pj pk Cjk ⎬ , p ∈ Rd .
⎪
⎪ 2 ⎪
⎪
⎩j=1 jk ⎭
If a random vector X ∼ N (µ, C), then E[Xi ] = µi for each i = 1, . . . , n and Cov(Xi , Xj ) = Cij for each
(i, j). If C is positive denite, then the joint density function of X = (X1 , . . . , Xd ) is given by
1 1
πX1 ,...,Xd (x1 , . . . , xd ) = exp {− (x − µ)C −1 (x − µ)} , x ∈ Rd ,
(2π)d/2 ∣C∣1/2 2
where x = (x1 , . . . , xd ) and µ = (µ1 , . . . , µd ) are row vectors and ∣C∣ denotes the determinant of C .
X ∼ CG(∣∆∣, ∣Γ∣).
If the numbers of discrete and continuous random variables are, respectively, ∣∆∣ = p and ∣Γ∣ = q , then
X ∼ CG(p, q).
1
e− 2 (x−µ(i))C(i) (x−µ(i)) ,
1 t
πX Γ ∣X ∆ (x∣i) = √ (10.2)
−1
PX ∆ (i) > 0.
For this discussion, it is assumed that PX ∆ (i) > 0 for each i ∈ X∆ .
1
g(i) = log PX ∆ (i) + (log det K(i) − ∣Γ∣ log 2π − µ(i)K(i)µ(i)t ) . (10.6)
2
From Equation (10.2), it is clear that conditioning on the discrete variables gives a family of multivariate
normal distributions. The canonical parameters of the Gaussian distribution are dened as (h(i), K(i))
and the mean parameters as (µ(i), C(i)). Conditioned on X ∆ = i,
E [X Γ ∣X ∆ = i] = µ(i)
and
t
E [(X Γ − µ(i)) (X Γ − µ(i)) ∣X ∆ = i] = C(i)
(where the random vectors are taken to be row vectors).
Parametrisation of the CG Distribution The canonical parameters for the joint distribution, de-
ned by the pair of functions (PX ∆ , πX Γ ∣X ∆ ) are dened as (g, h, K), where the parameters (h(i), K(i))
are dened by Equations (10.4) and (10.5) respectively and g(i) is dened by Equation (10.6).
Similarly, the mean parameters are dened as (P, µ, C), where (µ(i), C(i)) are the mean parameters
of the conditional distribution and P(i) is the probability function over the discrete variables.
Proof The following calculation shows that X A∩Γ ∣ {X B = xB } ∪ {X A∩∆ = xA∩∆ } has a multivariate
Gaussian distribution. Firstly, it is clear that
The conditional density function on the right hand side is obtained by conditioning the distribution
of X A∩Γ ∣ X ∆ = x∆ on X B∩Γ = xB∩Γ . Since (X A∩Γ , X B∩Γ ) ∣ X ∆ = x∆ has a multivariate Gaussian
distribution, and the conditional distribution of a multivariate Gaussian, conditioning on some of
its component variables is again multivariate Gaussian, it follows that the conditional distribution is
multivariate. The proof is complete.
If the variables to be marginalised are discrete, then complicated mixture distributions arise. The
following theorem gives a situation where the marginalisation yields a CG distribution.
Proposition 10.5. Let A ⊆ V denote a subset of the indexing set for the variables. If X is CG and
B = V ∖ A (namely, B is the set of all indices in V that are not in A) and B ⊆ ∆ and
X B ⊥ X Γ ∣ X ∆∖B ,
then X A ∼ CG.
Proof Clearly, from the denition of a CG distribution, it is necessary and sucient to show that
(multivariate normal, with dimension ∣A ∩ Γ∣). The proof requires the following identity: If X B ⊥ X Γ ∣
X ∆∖B , then
This is a straightforward consequence of the denition of conditional independence. Recall that, from
the denition of a CG distribution, πX Γ ∣X ∆ (xΓ ∣ x∆ ) is a multivariate normal distribution. Therefore
the conditional distribution of X Γ conditioned on X ∆/B is multivariate Gaussian, therefore the condi-
tional distribution of X Γ∩A conditioned on X ∆/B is multivariate Gaussian. The proof is complete.
10.1.2 CG Regression
An important special case of CG distributions are those that follow CG regression. The requirement
here is that the continuous variables depend linearly on their continuous parents. This is the situation
that is treated by most softwares with a facility for CG distributions.
Denition 10.6 (CG Regression). Let Z = (Z1 , . . . , Zs ) be a continuous random (row) vector and let
I be a discrete random (row) vector with probability function pI . Let I denote the state space for I . If
a random (row) vector Y = (Y1 , . . . , Yr ) has the property that
B(i) is an s × r matrix,
Let X denote a random vector, containing both discrete and continuous variables, which have been
ordered so that the probability distribution may be factorised along a Directed Acyclic Graph G =
(V, D). Let Xγ be a continuous variable, with parent set Π(γ). Suppose that X has a CG distribution
that satises the additional CG regression requirement. Then the conditional distribution for Xγ ,
conditioned on its parent nodes Π(γ) is the CG regression
⎧
⎪ 2⎫
1 ⎪ (xγ − α(i) + zβ) ⎪
⎪
ϕ(i, z, xγ ) = √ exp ⎨− ⎬. (10.7)
2πσ(i) ⎪
⎪ 2σ (i)
2 ⎪
⎪
⎩ ⎭
Example 10.7.
This example is taken from [84]. The emissions from a waste incinerator dier because of compositional
dierences in incoming waste. Another important factor is the way in which the waste is burnt, which
can be monitored by measuring the concentration of carbon dioxide in the emissions. The eciency
of the lter depends on its technical state and also on the amount and composition of the waste. The
emission of heavy metals depends both on the concentration of metals in the incoming waste and the
emission of dust particles in general. The emission of dust is monitored by measuring the penetration
of light.
The situation may be modelled using a directed acyclic marked graph (DAMG) in Figure 10.1;
marked because there are two types of nodes. In this case, these are discrete and continuous. In
HUGIN, nodes with a double circle are continuous nodes. The categorical variables are F : lter
state, W : waste type, B method of burning. The continuous variables are Min : metals in the waste,
Mout : metals emitted, E : lter eciency, D: Dust emission, C : carbon dioxide concentration in
emission and L: light penetration. The set ∆ = {F, W, B} is the set of discrete variables, while
Γ = {C, D, E, L, Min , Mout } is the set of continuous variables.
If HUGIN is being used, then inserting the graph, using double circles to indicate `Gaussian' nodes,
the conditional probability distributions can be inserted. For a conditional Gaussian distribution, the
208 CHAPTER 10. CONDITIONAL GAUSSIAN VARIABLES
/ M / Mout
W in
<
!
F / E / D
=
#
B / C L
continuous nodes cannot have discrete nodes as descendants. For a continuous node, HUGIN requests
the mean and variance, the parameters to describe a CG regression.
This section describes a junction tree approach due to Lauritzen [81] (1992), for nding the updated
conditional Gaussian distribution when hard evidence is inserted on some of the nodes. The problem
here is that while marginalising a CG distribution over one of its continuous variables gives another CG
distribution, marginalising a CG distribution over one of its discrete variables does not necessarily give
a CG distribution. Therefore, care has to be taken in the construction of the junction tree. Ideally, the
junction tree should be constructed so that the marginal distributions over the cliques and separators
are CG distributions, to enable appropriate marginalisations to be made. This requires some additional
restrictions on the construction of the cliques and separators.
Denition 10.8 (Marked Graph). A marked graph is a graph where there are several types of nodes;
the type of the node is the mark.
In the context of directed acyclic graphs for conditional Gaussian distributions, there are two markings;
discrete and continuous, for the types of variables represented by each type of node.
Denition 10.9 (GG Decomposition). A triple (A, B, S) of disjoint subsets of the node set V of an
undirected marked graph G is said to form a CG decomposition of G if V = A ∪ B ∪ S and the following
three conditions hold:
10.2. THE JUNCTION TREE FOR CONDITIONAL GAUSSIAN DISTRIBUTIONS 209
1. S separates A from B ,
2. S is a complete subset of V ,
3. Either S ⊆ ∆, or B ⊆ Γ or both.
When this holds, (A, B, S) is said to CG-decompose G into the components GA∪S and GB∪S .
If only the rst two conditions hold, then (A, B, S) is said to form a decomposition. Thus, a
decomposition ignores the markings of the graph, while a CG decomposition takes them into account.
The logic is as follows: if B contains only continuous nodes with multivariate Gaussian distribution,
then the marginal over the separator will again be multivariate Gaussian. If the separator contains
only discrete nodes and B both continuous and discrete, then marginalising rst over all the Gaussian
nodes in B and then marginalising over the discrete nodes in B not in the separator gives the exact
probability distribution over the separator.
Decomposable unmarked graphs are triangulated; any cycle of length 4 or more has a chord. CG
decomposable marked graphs are further characterised by requiring that if there is a path between two
discrete variable containing only continuous variables, then there is an edge between the two discrete
variables.
Proposition 10.11. For an undirected marked graph G , the following are equivalent:
1. G is CG decomposable.
2. G is triangulated, and for any path (δ1 , α1 , . . . , αn , δ2 ) between two discrete nodes (δ1 , δ2 ) where
(α1 , . . . , αn ) are all continuous, δ1 and δ2 are neighbours.
3. For any α and β in G , every minimal (α, β) separator is complete. If both α and β are discrete,
then their minimal separator contains only discrete nodes.
Proof of 1 ⇒ 2 The proof, as before for unmarked graphs, is by induction. The inductive hypothesis
is: All undirected CG decomposable graphs with n or fewer nodes are triangulated and satisfying the
conditions of statement 2.
This is clearly true for a graph on one node.
Let G be a CG decomposable graph on n + 1 nodes.
Either G is complete, in which case the properties of 2 clearly follow,
Or There exist sets A, B , S , where V = A ∪ B ∪ S , where either B ⊆ Γ or S ⊆ ∆ or both, and such
that GA∪S and GB∪S are CG decomposable. Then any cycle of length 4 without a chord must pass
through both A and B . By decomposability, S separates A from B . Therefore the cycle must pass
through S at least twice. Since S is complete, the cycle will therefore have a chord. Since GA∪S and
210 CHAPTER 10. CONDITIONAL GAUSSIAN VARIABLES
GB∪S are triangulated, it follows that G is also triangulated. If the nodes of S are discrete, it follows
that any path between two discrete variable passing through S satises the condition of statement 2.
If B ⊆ Γ, then since all paths in GA∪S and all paths in GB∪S satisfy the condition of statement 2, it is
clear that all paths passing through S will also satisfy the condition of statement 2. It follows that G
is CG decomposable.
Proof of 2 ⇒ 3 Assume that G is triangulated, with the additional property in statement 2. Consider
two nodes α and β and let S be their minimal separator. Let A denote the set of all nodes that may
be connected to α by a trail that does not contain nodes in S and let B denote all nodes that may be
connected to β by a trail that does not contain nodes in S . Every node γ ∈ S must be adjacent to some
node in A and some node in B , otherwise GV /(S/{γ}) would not be connected. This would contradict
the minimality of S , since S/{γ} would separate α from β . Suppose that the condition in statement
2 holds and consider the minimal separator for two discrete nodes α, β , which are not neighbours.
The separator is complete. Denote the separator by S . Consider Ŝ , which is S with the continuous
nodes removed. Then Ŝ separates α and β on the sub graph induced by the discrete variables. But
the condition of statement 2 implies that α and β are also separated on G . Therefore, Ŝ separates α
and β . It follows that the minimal separator for two discrete nodes contains only discrete nodes.
Proof of 3 ⇒ 1 If G is complete, it follows that every node is discrete and the result is clear. Let α
and β be two discrete nodes that are not contained within their minimal separator. Let S denote their
minimal separator. Let A denote the maximal connected component of V /S and let B = V /(A ∪ S).
Then (A, B, S) provides a decomposition, with S ⊆ ∆. Suppose that two such discrete nodes cannot
be found. Let α and β be two nodes that are not contained within their minimal separator, where β
is continuous. Let S denote the minimal separator. Let B denote the largest connected component
of V /S containing β . Suppose that B contains a discrete node γ . Then S separates γ from α and
therefore consists entirely of discrete nodes. Therefore, either S ⊆ ∆, or B ⊆ Γ, as required.
The construction of the junction tree has to be modied. Starting from the directed acyclic graph, the
graph is rst moralised by adding in the links between all the parents of each variable and then making
all the edges undirected, as before. Then, sucient edges are added in to ensure that the graph is CG
- decomposable.
Next, a junction tree is constructed. As before, this is an organisation of a collection of subsets of
the variables V into a tree, such that if A and B are two nodes on the junction tree, then the variables
in A ∩ B appear in each node on the path between A and B .
Denition 10.12 (CG Root). A node R on a junction tree is a CG root if any pair of neighbours A,
B , such that A lies on the path between R and B (so that A is closer to R than B ) satises
This condition is equivalent to the statement that the triple (A/(A ∩ B), B/(A ∩ B), A ∩ B) forms
a CG decomposition of GA∪B . This means that when a separator between two neighbouring cliques
10.3. UPDATING A CG DISTRIBUTION USING A JUNCTION TREE 211
is not purely discrete, the clique furthest away from the root has only continuous nodes beyond the
separator.
Theorem 10.13. The cliques of a CG decomposable marked graph can be organised into a junction
tree with at least one CG root.
Proof As with the unmarked graph, choose simplicial nodes, one after the other. This is done in
such a way that either the separator (the nodes not removed) are all discrete, or else the nodes that
are removed are all continuous, until it is not possible to nd any other such nodes.
The remaining graph is therefore a clique, by the following arguments: either all the remaining
discrete nodes are in the same clique, or else there is not a simplicial discrete node, since the minimal
separator between two discrete nodes consists entirely of discrete nodes. Assume there is not a simplicial
discrete node. If there are discrete nodes remaining, then the family of any simplicial continuous node
contains a discrete node that does not have neighbours outside the family and is therefore simplicial.
It follows that all the discrete nodes are in the same clique, the family of any remaining continuous
node.
The nal clique, constructed in this way, clearly satises the properties of a CG root.
⎧
⎪ t 2⎫
1 ⎪ (x − α(y) − β(y)z ) ⎪
⎪
PX∣Πc (X),Πd (X) (x∣z, y) = exp ⎨− ⎬.
(2πγ(y)) 1/2 ⎪
⎪ 2γ(y) ⎪
⎪
⎩ ⎭
where Πc (X) denotes continuous parents and Πd (X) denotes discrete parents. Here, α is a function,
β is a (row) vector of the same length as z and γ is the conditional variance.
For the separators S , the initialisation is: ϕS ≡ 1 for each S ∈ S .
From this, expanding the parentheses, taking logarithms and identifying terms gives the canonical
parameters (gX , hX , KX ) for PX∣Π(X) . The log partition function is
α(y)2 1
gX (y) = − − log(2πγ(y)),
2γ(y) 2
and the other parameters are given by
α(y)
hX (y) = ( 1 −β(y) )
γ(y)
212 CHAPTER 10. CONDITIONAL GAUSSIAN VARIABLES
and
1 ⎛ 1 −β(y) ⎞
KX (y) = .
γ(y) ⎝ −β(y) β(y)t β(y) ⎠
t
⎧
⎪ ⎫
⎪ 1 ⎛ K11 K12 ⎞ ⎛ xt1 ⎞⎪
⎪
ϕY ,X 1 ,X 2 (y, x1 , x2 ) = χ(y) exp ⎨g(y) + h1 (y)xt1 + h2 (y)xt2 − (x1 , x2 ) ⎬,
⎪
⎪ 2 ⎝ K12 K22 ⎠ ⎝ x2 ⎠⎪
t t
⎪
⎩ ⎭
where
⎧
⎪
⎪ 1 PY (y) > 0
χ(y) = ⎨
⎪
⎩ 0 PY (y) = 0
⎪
K is symmetric and the triple (g, h, K) represent the canonical characteristics. Recall the standard
result that, taking z ∈ Rp as a row vector, and K a positive denite p × p symmetric matrix,
1 1 1
∫ p exp {− zKz } dz = √
t
(2π) p/2 R 2 det(K)
and hence that for a ∈ Rp and K a positive denite p × p symmetric matrix
1 t 1 t −1 (2π)p/2
∫ p exp {(a, z) − z Kz} dz = exp { a K a} √ .
R 2 2 det(K)
From this, it follows, after some routine calculation, that if X 1 is a random p-vector with positive
denite covariance matrix, then
1
∫ ϕY ,X 1 ,X 2 (y, x1 , x2 )dx1 = χ(y) exp {g̃(y) + h̃(y)xt2 − x2 K̃xt2 } ,
Rp 2
where
1
g̃(y) = g(y) + (p log(2π) − log det(K11 (y)) + h1 (y)K11 (y)−1 h1 (y)t ) ,
2
1
ϕ̃(y 1 , x) = exp {h̃(y 1 )xt − xK̃(y 1 )xt } ∑ χ(y 1 , y 2 ) exp {g(y 1 , y 2 )} .
2 y
2
10.3. UPDATING A CG DISTRIBUTION USING A JUNCTION TREE 213
The function ϕ̃ is therefore CG with canonical characteristics g̃(y 1 ) = log ∑y exp {g(y 1 , y 2 )} and h̃, K̃
2
as before.
If either h or K depends on y 2 , then a marginalisation will not produce a CG distribution, so an
approximation is used. For this, it is convenient to consider the mean parameters, (P, C, µ), where
P(y 1 , y 2 ) = P((Y 1 , Y 2 ) = (y 1 , y 2 )) and
The approximation is as following: ϕ̃ is dened as CG with mean parameters (P̃, C̃, µ̃) dened as:
P̃(y 1 ) = ∑ P(y 1 , y 2 ),
y
2
1
µ̃(y 1 ) = ∑ P(y 1 , y 2 )µ(y 1 , y 2 ),
P̃(y 1 ) y2
1
C̃(y 1 ) = ∑ P(y 1 , y 2 ) (C(y 1 , y 2 ) + (µ(y 1 , y 2 ) − µ̃(y 1 )) (µ(y 1 , y 2 ) − µ̃(y 1 ))) .
t
P̃(y 1 ) y2
It is relatively straightforward to compute that this approximate marginalisation has the correct ex-
pected value and second moments.
Marginalising over both Discrete and Continuous When marginalising over both types of
variables, rst the continuous variables are marginalised, and then the discrete.
Entering Evidence Two types of evidence can be entered; rstly, evidence that a continuous variable
Y is instantiated as y for some y ∈ R. Suppose PX∣Π(X) has canonical characteristics (g, h, K), where
either X = Y or Y ∈ Π(X). The vector (X, Π(X)) may be re-ordered so that Y appears last, so that
the canonical characteristics are written as
⎧
⎪
⎪ 0 evidence states that s is impossible
fSd (s) = ⎨
⎪
⎪ otherwise
⎩ 1
The Fully Active Schedule The fully active schedule may now be applied. Firstly, the evidence
is inserted. This is hard evidence, that certain states for discrete variables are excluded, or that the
continuous variables take certain xed values. The information then has to be propagated. Start at
the leaves, send all messages to a CG root. A message from C to C ′ computes ϕ∗S = ∑C/S ϕC , where
the sum denotes an integral for a continuous variable and a sum for a discrete variable, updates ϕC ′
to ϕ∗C ′ = ϕSS ϕC ′ and updates ϕS to ϕS ∗ .
ϕ∗
Note that, when two functions are multiplied or divided, this simply involves rstly: computing
the canonical characteristics (either exactly, or those for the approximating function) and then if ϕ1
has characteristics (g1 , h1 , K1 ) and ϕ2 has characteristics (g2 , h2 , K2 ) then ϕ1 ϕ2 has characteristics
(g1 + g2 , h1 + h2 , K1 + K2 ) and ϕϕ12 has characteristics (g1 − g2 , h1 − h2 , K1 − K2 ).
When the root has received all messages, at this stage the potential over the root is normalised.
That is, it is multiplied by a suitable constant to make it a probability. All the messages propagated
to the CG root are proper marginalisations and therefore the distribution over the CG root, after the
evidence is received, is an exact CG distribution.
For the propagation back out to the leaves, it will not, in general, be possible to make exact
marginalinalisations. The same procedure is used; for a message C to C ′ separated by S , set ϕ∗S =
∑C/S ϕC and update ϕC ′ to ϕSS ϕC ′ and update ϕS to ϕ∗S .
ϕ ∗
Having inserted hard evidence and run the schedule, since the potential over the root has been
normalised, the resulting functions are probability distributions.
The approximate marginalisations give an approximate update, but by construction, since the tree
has a strong root, the tree will be consistent; by construction, the exact marginalision of a clique in
the direction of the strong root gives exactly the approximating distribution over the separator that is
produced from by the approximate marginalisation when computing away from the root.
The Termination Although the resulting algorithm has produced approximate distributions over
the cliques, which are conditional Gaussian, with the correct expectation vector and covariance struc-
ture, it should be clear from the algorithm that dividing the function over the clique by the function
over the adjacent separator in the direction of the root gives the exact conditional distribution of the
clique conditioned on the separator.
Notes The application of junction tree methods to conditional Gaussian distributions was taken from
Lauritzen [81].
10.4 Exercises
1. Let
X = (X ∆ , XΓ ) ∼ CG(∣∆∣, 1).
Let I denote the state space for X ∆ and let P denote the probability function for the random
vector X ∆ . Prove that
E [XΓ ] = ∑ P(i)µ(i)
i∈I
and
V (XΓ ) = ∑ p(i)σ(i)2 + ∑ P(i) (µ(i) − E [XΓ ])2 .
i∈I i∈I
2. Let X ∼ CG(2, 2) and let I1 and I2 be binary variables. Find the canonical parameters for the
distribution.
3. Show that if a Conditional Gaussian Distribution is marginalised over a subset of the continuous
variables, the resulting distribution is again a CG distribution. Find the canonical characteristics
of the marginal distribution in terms of the original canonical characteristics, stating the standard
results about multivariate normal random variables that you are using.
4. Suppose that hard evidence is entered into a subset of the continuous variables of a CG distri-
bution. Show that the updated distribution is again a CG distribution and express the mean
parameters (conditional expectation vector and covariance matrix) of the updated distribution
in terms of the mean parameters of the original distribution.
5. This example is taken from Lauritzen [84]. It is a ctitious problem connected with controlling
the emission of heavy metals from a waste incinerator. The type of incoming waste W aects
the metals in the waste Min , the dust emission D and the lter eciency E . The quantity of
metals in the waste Min aects the metals emission Mout . Another important factor is the waste
burning regimen B , which is monitored via the carbon dioxide concentration in the emission C .
The burning regimen, the waste type and the lter eciency E aect the dust emission D. The
dust emission aects the metals emission and it is monitored by recording the light penetration
L. The state of the lter F (whether it is intact or defective) aects E .
The variables F , W , B are qualitative variables with states (the lter is either intact or defective,
the waste is either industrial or household, the burning regimen is either stable or unstable). The
variables E , C , D, L, Min and Mout are continuous. The directed acyclic marked graph is given
in Figure 10.1.
215
216 CHAPTER 10. CONDITIONAL GAUSSIAN VARIABLES
and E = log ρ. The variable D, dust emission, is again on a logarithmic scale, as is C , the
CO2 concentration and L, the light penetrability. Light penetrability is roughly inversely
proportional to the square root of dust concentration. The metal in waste Min and metal
emission Mout variables are on logarithmic scales.
Suppose that the waste burned is of industrial type (W = 1), the light penetration variable
is measured as L = 1.1 and the CO2 concentration is measured as C = −0.9.
Find the updated probability distributions for B and F and the updated means and vari-
ances for Min , Mout and D.
Chapter 11
Graphical Models in R
The packages ggm, deal, glasso, gRc, pcalg, bnlearn, gRim are useful.
Consider X ∼ N (µ, Σ). The matrix K = Σ−1 is known as the concentration matrix. The partial
correlation between Xu and Xv given all the other variables may be derived from K as:
Kuv
ρuv∣V /uv = − √ .
Kuu Kvv
Thus, the independence graph does not have an edge u ↔ v if and only if Kuv = 0.
Consider an illustrative example of `carcass' data:
> library(gRbase)
> data(carcass)
> head(carcass)
Fat11 Meat11 Fat12 Meat12 Fat13 Meat13 LeanMeat
1 17 51 12 51 12 61 56.52475
2 17 49 15 48 15 54 57.57958
3 14 38 11 34 11 40 55.88994
4 17 58 12 58 11 58 61.81719
5 14 51 12 48 13 54 62.95964
6 20 40 14 40 14 45 54.57870
217
218 CHAPTER 11. GAUSSIAN AND CONDITIONAL GAUSSIAN GRAPHICAL MODELS IN R
Fat13 is conditionally independent of Meat12 and LeanMeat is also conditionally independent of Meat12.
A stepwise backward model selection procedure can be carried out as follows:
>library(gRim)
> sat.carc = cmod(~.^.,data=carcass)
> aic.carc = stepwise(sat.carc)
> library(Rgraphviz)
> plot(as(aic.carc,"graphNEL"),"fdp")
The BIC gives a higher penalty for complexity and also removes edges between Fat13 and Meat13.
> edge.carc=cmod(edgeList(as(gen.carc,"graphNEL")),data=carcass)
> edge.carc
Model: A cModel with 7 variables
graphical : TRUE decomposable : FALSE
-2logL : 11387.24 mdim : 22 aic : 11431.24
ideviance : 2453.99 idf : 15 bic : 11515.73
deviance : 19.79 df : 6
The matrix K is estimated by iterative proportion scaling. The point is that the estimate has to satisfy
the constraint that Kuv = 0 when there is no edge u ↔ v .
> carcfit1 =
+ ggmfit(S.carc,n=nrow(carcass),edgeList(as(gen.carc,"graphNEL")))
> carcfit1[c("dev","df","iter")]
$dev
[1] 19.78537
$df
[1] 6
$iter
[1] 774
Hypothesis Testing A likelihood ratio test, to see whether model M1 gives a better t than model
M2 may be carried out as follows:
$df
[1] 2
This would indicate that the smaller model does not t well.
The function ciTest_mvn() tests single conditional independence hypotheses. To test LeanMeat ⊥
Meat13∣remaining variables
> ciTest_mvn(list(cov=S.carc,n.obs=nrow(carcass)),
+ set=~LeanMeat+Meat13+Meat11+Meat12+Fat11+Fat12+Fat13)
Testing LeanMeat _|_ Meat13 | Meat11 Meat12 Fat11 Fat12 Fat13
Statistic (DEV): 1.687 df: 1 p-value: 0.1940 method: CHISQ
Gaussian conditional independence can be tested from the pcalg package as follows:
> library(pcalg)
> C.carc=cov2cor(S.carc)
> gaussCItest(7,2,c(1,3,4,5,6),list(C=C.carc,n=nrow(carcass)))
[1] 0.003077247
̂ = (K
K ̂A∪S )A∪B∪S + (K
̂B∪S )A∪B∪S − (S −1 )A∪B∪S
S
> AC=c("Fat11","Fat12","Fat13","Meat11","LeanMeat")
> BC=c("Meat11","Meat12","Meat13","Fat11","Fat12")
> C=c("Fat11","Fat12","Meat11")
> K.hat[AC,AC]=K.hat[AC,AC]+solve(S.carc[AC,AC])
> K.hat[BC,BC]=K.hat[BC,BC]+solve(S.carc[BC,BC])
> K.hat[C,C]=K.hat[C,C]-solve(S.carc[C,C])
> round(100*K.hat)
Fat11 Meat11 Fat12 Meat12 Fat13 Meat13 LeanMeat
Fat11 44 1 -20 -7 -16 6 10
Meat11 1 16 -4 -6 -4 -5 -5
Fat12 -20 -4 54 6 -20 -4 9
Meat12 -7 -6 6 14 0 -9 0
Fat13 -16 -4 -20 0 55 0 7
Meat13 6 -5 -4 -9 0 16 0
LeanMeat 10 -5 9 0 7 0 26
> Sigma.hat=solve(K.hat)
> round(Sigma.hat,2)
Fat11 Meat11 Fat12 Meat12 Fat13 Meat13 LeanMeat
Fat11 11.34 0.74 8.42 2.06 7.66 -0.76 -9.08
Meat11 0.74 32.97 0.67 35.94 2.01 31.97 5.33
Fat12 8.42 0.67 8.91 0.31 6.84 -0.60 -7.95
Meat12 2.06 35.94 0.31 51.79 2.45 41.47 5.41
Fat13 7.66 2.01 6.84 2.45 7.62 0.89 -6.93
Meat13 -0.76 31.97 -0.60 41.47 0.89 41.44 6.43
LeanMeat -9.08 5.33 -7.95 5.41 -6.93 6.43 12.90
> round(S.carc,2)
Fat11 Meat11 Fat12 Meat12 Fat13 Meat13 LeanMeat
Fat11 11.34 0.74 8.42 2.06 7.66 -0.76 -9.08
Meat11 0.74 32.97 0.67 35.94 2.01 31.97 5.33
Fat12 8.42 0.67 8.91 0.31 6.84 -0.60 -7.95
Meat12 2.06 35.94 0.31 51.79 2.18 41.47 6.03
Fat13 7.66 2.01 6.84 2.18 7.62 0.38 -6.93
Meat13 -0.76 31.97 -0.60 41.47 0.38 41.44 7.23
LeanMeat -9.08 5.33 -7.95 6.03 -6.93 7.23 12.90
Model Search using gRim Setting search = headlong causes edges to be searched in a random
order, which can make the search faster.
> ind.carc=cmod(~.^1,data=carcass)
> set.seed(123)
222 CHAPTER 11. GAUSSIAN AND CONDITIONAL GAUSSIAN GRAPHICAL MODELS IN R
> forw.carc=stepwise(ind.carc,search="headlong",
+ direction="forward",k=log(nrow(carcass)),details=0)
> forw.carc
Model: A cModel with 7 variables
graphical : TRUE decomposable : TRUE
-2logL : 11393.53 mdim : 23 aic : 11439.53
ideviance : 2447.70 idf : 16 bic : 11527.87
deviance : 26.08 df : 5
> plot(forw.carc,"neato")
The function essentialGraph() from the ggm package returns the essential graph of a DAG. For
example:
Model Selection A DAG may be established using the package pcalg package. The PC algorithm
may be used to nd the skeleton. The pcalg::skeleton command ensures that the relevant version
of the command skeleton is used.
> library(pcalg)
> c.carc=cov2cor(S.carc)
> suffStat=list(C=c.carc,n=nrow(carcass))
> indepTest=gaussCItest
> skeleton.carc=pcalg::skeleton(suffStat,gaussCItest,p=ncol(carcass),alpha=0.05)
11.3. DIRECTED GAUSSIAN GRAPHICAL MODELS 223
> nodes(skeleton.carc@graph)=names(carcass)
> names(carcass)
[1] "Fat11" "Meat11" "Fat12" "Meat12" "Fat13" "Meat13"
[7] "LeanMeat"
> str(skeleton.carc@sepset[[1]])
List of 7
$ : NULL
$ : int(0)
$ : NULL
$ : int(0)
$ : NULL
$ : int(0)
$ : NULL
This is read as follows: The rst variable Fat11 was marginally independent of variables Meat11,
Meat12 and Meat13. This is seen from the designation NULL. Similarly,
> str(skeleton.carc@sepset[[2]])
List of 7
$ : NULL
$ : NULL
$ : int(0)
$ : NULL
$ : int 4
$ : NULL
$ : int 6
This indicates that Meat11 (the second variable) is marginally independent of Fat12, conditionally
independent of Fat13 (5th variable on the list) given Meat12 (4th variable on the list) and conditionally
independent of LeanMeat given Meat13 etc.
In pcalg, there are several options for turning a skeleton together with sep-sets into a DAG. These
are:
udag2pdag()
udag2pdagRelaxed()
udag2pdagSpecial()
Read the help functions to nd out the dierences. For example:
> pdag.carc=udag2pdagRelaxed(skeleton.carc,verbose=0)
> nodes(pdag.carc@graph)=names(carcass)
> plot(pdag.carc@graph,"neato")
224 CHAPTER 11. GAUSSIAN AND CONDITIONAL GAUSSIAN GRAPHICAL MODELS IN R
Undirected edges are shown as double-arrowed edges. This graph is not an essential graph; the arrow
from Meat12 to Meat13 is not part of an immorality, neither is it a compelled edge.
Both steps (skeleton and edge orientation) can be called simultaneously using the function pc().
For example,
> cpdag.carc=pc(suffStat,gaussCItest,p=ncol(carcass),alpha=0.05)
> plot(cpdag.carcass@graph)
> library(lcd)
> ug<-naive.getug.norm(carcass,0.05)
> jtree<-ug.to.jtree(ug)
> cg<-learn.mec.norm(jtree,cov(carcass),nrow(carcass),0.01,"CG")
> icg<-as(cg,"igraph")
> E(icg)$arrow.mode<-2
> E(icg)[is.mutual(icg)]$arrow.mode<-0
> V(icg)$size<-40
> plot(icg,layout=layout.kamada.kawai)
1 1
πΓ∣∆ (y∣i) = exp {− (y − µ(i))t Σ−1 (y − µ(i))} .
(2π)q/2 ∣Σ∣1/2 2
For illustration, consider two data sets from gRbase; milkcomp1 and wine.
The CGstats() function calculates the number of observations and means of the continuous vari-
ables for each cell i, together (by default) with a common covariance matrix.
> data(milkcomp1,package='gRbase')
> head(milkcomp1)
11.5. CONDITIONAL GAUSSIAN MODELS 225
$center
a b c d e f g
fat 6.64125 8.01000 7.0525 7.40125 8.13375 7.518571 6.97375
protein 5.48750 5.28750 5.4750 5.81750 5.26250 5.295714 5.58000
lactose 5.49125 5.48875 5.4675 5.31375 5.40625 5.382857 5.41500
$cov
fat protein lactose
fat 2.31288338 0.19928422 -0.07028198
protein 0.19928422 0.12288675 -0.03035208
lactose -0.07028198 -0.03035208 0.04529896
$cont.names
[1] "fat" "protein" "lactose"
$disc.names
[1] "treat"
$disc.levels
[1] 7
> apply(SS$center,1,sd)/apply(SS$center,1,mean)
fat protein lactose
0.07415672 0.03656048 0.01186589
226 CHAPTER 11. GAUSSIAN AND CONDITIONAL GAUSSIAN GRAPHICAL MODELS IN R
> can.parms=CGstats2mmodParms(SS,type="ghk")
> print(can.parms,simplify=FALSE)
$g
treat
a b c d e f g
-745.4933 -729.3707 -740.4563 -743.5508 -712.6533 -710.4957 -740.1503
$h
[,1] [,2] [,3] [,4] [,5] [,6] [,7]
[1,] 0.7869693 1.628323 0.9975735 0.873605 1.686407 1.343883 0.8641906
[2,] 88.2214588 85.006209 87.6318341 90.151087 84.137184 84.816560 88.5106553
[3,] 181.5552534 180.651093 180.9626426 179.064184 178.337697 177.745064 180.1855748
$K
[,1] [,2] [,3]
[1,] 0.5055681 -0.7503065 0.2816613
[2,] -0.7503065 10.8648914 6.1157915
[3,] 0.2816613 6.1157915 26.6103828
$gentype
[1] "mixed"
$gentype
[1] "mixed"
$cont.names
[1] "fat" "protein" "lactose"
$disc.names
[1] "treat"
$disc.levels
[1] 7
Let j denote the level of the treatment factor, then h(j) takes the form:
> apply(can.parms$h,1,sd)/apply(can.parms$h,1,mean)
[1] 0.32484006 0.02614999 0.00793359
> conc2pcor(can.parms$K)
[,1] [,2] [,3]
[1,] 1.00000000 0.3201373 -0.07679125
[2,] 0.32013725 1.0000000 -0.35967845
[3,] -0.07679125 -0.3596784 1.00000000
This suggests that the partial correlation between fat and lactose is zero. Therefore,
Conditional Gaussian Models To construct a marked graph, the information on marking has to
be provided:
> uG1=ug(~a:b+b:c+c:d)
> uG2=ug(~a:b+a:d+c:d)
> mcsmarked(uG1,discrete=c("a","d"))
character(0)
> mcsmarked(uG2,discrete=c("a","d"))
[1] "a" "d" "b" "c"
> plot(uG1)
> plot(uG2)
For the rst graph, both a and d have to be in the CG-root, hence the root contains all the variables;
the CG-Gaussian tree contains exactly one node. For the second one, the CG-root is the clique {a, d}.
228 CHAPTER 11. GAUSSIAN AND CONDITIONAL GAUSSIAN GRAPHICAL MODELS IN R
Using gRim for CG-models The function mmod() enables CG models to be dened and tted.
The parameters are obtained using coef(). The parametrisation may be specied as either canonical
or mean eld. For canonical parameters:
> coef(milk,type="ghk")
$g
treat
a b c d e f g
-676.0550 -666.0859 -675.0546 -690.9918 -664.9730 -666.7805 -680.0217
$h
[,1] [,2] [,3] [,4] [,5] [,6] [,7]
[1,] -1.134727 -0.2838037 -0.9178505 -1.021725 -0.2012326 -0.5374842 -1.043008
[2,] 84.348819 81.3413696 83.8953921 86.850963 81.0040255 81.8196051 84.952805
[3,] 164.633541 164.6335413 164.6335413 164.633541 164.6335413 164.6335413 164.633541
$K
[,1] [,2] [,3]
[1,] 0.5025868 -0.815040 0.000000
[2,] -0.8150400 10.762254 5.666744
[3,] 0.0000000 5.666744 24.645834
$gentype
11.5. CONDITIONAL GAUSSIAN MODELS 229
[1] "mixed"
$cont.names
[1] "fat" "protein" "lactose"
$disc.names
[1] "treat"
$disc.levels
[1] 7
$N
[1] 55
$SSD
[,1] [,2] [,3]
[1,] 127.208586 10.960632 -3.865509
[2,] 10.960632 6.758771 -1.669364
[3,] -3.865509 -1.669364 2.491443
$SS
fat protein lactose
fat 3143.500 2227.141 2199.894
protein 2227.141 1648.827 1627.220
lactose 2199.894 1627.220 1620.993
Updating Models Models are updated using update(). A list with one or more components
add.edge, drop.edge, add.term, drop.term is specied. The updates are made in the order given.
For example:
Inference Functions such as ciTest(), testInEdges(), testOutEdges() etc. have the same be-
haviour as with pure discrete and pure continuous networks. For example:
230 CHAPTER 11. GAUSSIAN AND CONDITIONAL GAUSSIAN GRAPHICAL MODELS IN R
> ciTest(milkcomp1)
Testing treat _|_ fat | protein dm lactose
Statistic (DEV): 8.742 df: 6 p-value: 0.1886 method: CHISQ
> testInEdges(milk,getInEdges(milk$glist))
statistic df p.value aic V1 V2 action
1 11.06071 6 0.086518199 -0.9392919 treat fat +
2 18.68943 6 0.004721598 6.6894264 treat protein -
3 8.27794 1 0.004012963 6.2779399 fat protein -
4 10.24527 1 0.001370352 8.2452747 protein lactose -
> testOutEdges(milk,getOutEdges(milk$glist))
statistic df p.value aic V1 V2 action
1 3.8928582 6 0.6911730 8.107142 treat lactose -
2 0.9827155 1 0.3215293 1.017285 fat lactose -
> milk3=update(milk,list(drop.edge=~treat:protein))
> compareModels(milk,milk3)
Large:
:"treat" "fat" "protein"
:"protein" "lactose"
Small:
:"protein" "lactose"
:"treat" "fat"
:"fat" "protein"
-2logL: 18.69 df: 6 AIC(k= 2.0): 6.69 p.value: 0.155100
> testdelete(milk,c("treat","protein"))
dev: 18.689 df: 6 p.value: 0.00472 AIC(k=2.0): 6.7 edge: treat:protein
Notice: Test perfomed by comparing likelihood ratios
> testadd(milk,c("treat","lactose"))
dev: 3.893 df: 6 p.value: 0.69117 AIC(k=2.0): 8.1 edge: treat:lactose
Notice: Test perfomed by comparing likelihood ratios
Stepwise Model Selection The stepwise() function in gRim implements stepwise selection. The
following starts from the saturated model and uses BIC criterion. This function can take a while to
produce the output.
> data(wine,package='gRbase')
> mm=mmod(~.^.,data=wine)
> mm2=stepwise(mm,k=log(nrow(wine)),details=0)
> plot(mm2)
Chapter 12
Functions
12.1 Introduction
Let X = (X1 , . . . , Xd ) be a random vector, whose probability distribution factorises along a DAG
G = (V, D). This chapter considers the task of learning the conditional probability potentials, when
the DAG G is given, when presented with an n × d data matrix of instantiations
⎛ x(1) ⎞
x=⎜
⎜ ⋮
⎟
⎟
⎝ x ⎠
(n)
⎛ x(1) ⎞
X=⎜
⎜ ⋮
⎟.
⎟
⎝ X ⎠
(n)
Gaussian
Multinomial
Conditional Gaussian.
⎧
⎪
⎪ X =µ+ϵ ϵ ∼ N (0, Σ)
⎨
⎪
⎪ µ = βj0 + ∑k∈Pa(j) βjk µk
⎩ j
231
232 CHAPTER 12. LEARNING THE CONDITIONAL PROBABILITY FUNCTIONS
where Pa(j) denotes the indices of the parent nodes of Xj and for dierent instantiations, the ϵ are
i.i.d. Estimation of the parameters βkj ∶ k ∈ {0} ∪ Pa(j) is carried out simply by maximum likelihood
estimation, which is equivalent to least squares for Gaussian variables. That is, the parameters β are
estimated by minimising:
n
∑(xij − β0j − ∑ xik βkj ) .
2
i=1 k∈Paj
We assume that the nodes have been ordered so that Paj ⊆ {1, . . . , j − 1}. Denote the estimator by β̂.
Then
̂j = β̂0j + ∑ µ
µ ̂k β̂kj .
k∈Paj
The estimate of Σ is slightly harder than the estimate of µ, since we have to ensure that the conditional
independence constraints are satised; the estimate Σ ̂ has to correspond to the factorisation.
The estimate Σ ̂ 11 is simply the m.l.e. variance estimate derived from the univariate sample x.1 from a
N (µ1 , Σ11 ) distribution.
For j > 1, assume that the components of the j − 1 × j − 1 sub-matrix Σ(j−1) , with entries Σab ∶ 1 ≤ a ≤
̂ (j)−1 )ij = 0 for i ∈/ Paj . Now let A denote
j − 1, 1 ≤ b ≤ j − 1, have already been estimated. Then set (Σ
the j × j symmetric matrix obtained as the maximiser of:
⎧
⎪ 1 n j j ⎫
⎪
∣A∣1/2 ⎪ ⎪
exp ⎨− ∑ ∑ ∑ (x − ̂
µ )(x − ̂
µ )A ⎬.
(2π) j/2 ⎪ 2 i=1 a=1 b=1
⎪
ia a ib b ab
⎪
⎪
⎩ ⎭
subject to the constraint that Aij = 0 for i < j, i ∈/ Paj . For i ∈ Paj ∪ {j}, set Σ ̂ ij = A−1 .
ij
Conditioned Gaussian Part of Conditional Gaussian Similarly, for Conditional Gaussian, the
parameters of the Gaussian variables, conditioned on the discrete, are estimated by maximum likelihood
in the standard way for multivariate Gaussian.
Let be an n × d data matrix, representing n instantiations of X . The aim is to estimate the values of
(θjil )j,i,l based on the data matrix.
For discrete variables there are two approaches; the maximum likelihood method and the Bayesian
approach.
12.4. MAXIMUM LIKELIHOOD FOR DISCRETE VARIABLES 233
Denition 12.1 (Likelihood function, Likelihood Estimate, Log Likelihood Function). The likelihood
function of the parameters θ is dened as
L(θ∣x) = PX (x∣θ).
There is an elegant expression of the likelihood function in terms of the Shannon Entropy and Kullback
Leibler divergence given below.
Denition 12.2 (Shannon Entropy). The Shannon Entropy, or Entropy of a probability distribution
θ = (θ1 , . . . , θk ), where θj ≥ 0, j = 1, . . . , k and θ1 + . . . + θk = 1 is dened as
k
H(θ) = − ∑ θj log θj .
j=1
234 CHAPTER 12. LEARNING THE CONDITIONAL PROBABILITY FUNCTIONS
In the denition of H(θ), the denition 0 log 0 = 0 is used, obtained by continuous extension of the
function x log x, x > 0.
Denition 12.3 (Kullback Leibler Divergence). The Kullback Leibler Divergence between two discrete
probability functions f and g with the same state space X is dened as
f (x)
DKL (f ∣g) = ∑ f (x) log .
x∈X g(x)
The Kullback Leibler divergence has the property that it is non negative, and for two probability
measures dened on the same nite state space, DKL (f ∣g) = 0 if and only if f = g . This is a consequence
of Jensen's inequality and is now stated.
Lemma 12.4. For any two discrete probability distributions f and g , it holds that
DKL (f ∣g) ≥ 0
Proof of Lemma 12.4 The proof uses Jensen's Inequality1 namely, that for any convex function ϕ,
E[ϕ(X)] ≥ ϕ(E[X]), with equality if and only if either ϕ(x) = ax + b or P(X = y) = 1 for some point
y . Note that f (x) ≥ 0 for all x ∈ X and that ∑x∈X f (x) = ∑x∈X g(x) = 1. Using this, together with the
fact that − log is convex, yields
g(x) g(x)
DKL (f ∣g) = − ∑ f (x) log ( ) ≥ − log ( ∑ f (x) ) = − log 1 = 0
x∈X f (x) x∈X f (x)
with equality if and only if f = g .
The likelihood function may be expressed in terms of the Shannon Entropy and the Kullback Leibler
divergence as follows:
denote the likelihood function for the parameter vector θ = (θ1 , . . . , θk ), where the n-vector x denotes
the outcomes of n independent trials, each taking values in X = (x(1) , . . . , x(k) ) and ni denotes the
number of times x(i) appears in the list x. Let
n1 nk
θ̂ = ( , . . . , ) .
n n
Then
1 ̂ + DKL (θ∣θ)
̂
− log Ln (θ∣x) = H(θ) (12.2)
n
1
J.L. Jensen (1859 - 1925) published this in Acta Mathematica volume in the year 1906.
12.4. MAXIMUM LIKELIHOOD FOR DISCRETE VARIABLES 235
Proof of Theorem 12.5 Since PX (x(n) ∣ θ) = ∏ki=1 θini it follows directly that
1 1 k k
− log PX (x ∣ ̂
θ) = − ∑ ni log θ̂i = − ∑ θ̂i log θ̂i = H(θ).
̂ (12.3)
n n i=1 i=1
This is the Shannon entropy for the empirical distribution, given by Denition 12.2.
k
1 1 1 1 k k
− log Ln (θ) ∶= − log PX (x ∣ θ) = − log ∏ θini = − ∑ ni log θi = − ∑ θ̂i log θi
n n n i=1 n i=1 i=1
k k ̂
θi
= − ∑ θ̂i log θ̂i − ∑ θ̂i log
i=1 i=1 θi
̂
= H (θ) + DKL (θ∣θ) ̂
Since the Kullback Leibler distance is non-negative, it now follows directly that the maximum likelihood
estimate θ̂M LE of θ is given by
n1 nk
θ̂M LE = ( , . . . , ) .
n n
Recall that, for parameter estimation in statistics, the same notation θ̂ is used for an estimate, esti-
mator, and estimating function for a parameter θ.
E[Yj ] = nθj
n n
E[Yj2 ] = ∑ ∑ E[1x(j) (Xk1 )1x(j) (Xk2 )] = nθj + n(n − 1)θj2
k1 =1 k2 =1
236 CHAPTER 12. LEARNING THE CONDITIONAL PROBABILITY FUNCTIONS
giving
V(Yj ) = nθj (1 − θj )
and, for i ≠ j
E[Yi Yj ] = n(n − 1)θi θj
so that
Cov(Yi , Yj ) = −nθi θj .
where Cjj = θj (1 − θj ) and Cij = −θi θj for i ≠ j . Furthermore, the following central limit is standard
from multivariate analysis:
√
n(θ̂M L − θ) Ð→ N (0, C).
n→+∞
denote the random matrix, where each row represents an independent copy of X and let
⎛ x(1) ⎞
x=⎜
⎜ ⋮
⎟,
⎟
⎝ x ⎠
(n)
Setting
⎧
⎪ (i)
(i) (l) ⎪ 1 (xk )j = xj , πj (x(k) ) = πjl
nk (xj , πj ) = ⎨
⎪
⎩ 0 otherwise
⎪
and
n
(i) (l) (i) (l)
n(xj , πj ) = ∑ nk (xj , πj ),
k=1
The crucial point is that, using Equation 12.4, the likelihood function has product form:
12.5. THE BAYESIAN APPROACH 237
n d qj kj n
nk (xj ,πj )
(i) (l) d qj
⎛ kj n(x(i) ,π ) ⎞
(l)
L(θ∣x) = PX∣Θ (x∣θ) = ∏ PX∣Θ (xk ∣θ) = ∏ ∏ ∏ ∏ θjil = ∏ ∏ ∏ θjil j j . (12.5)
k=1 j=1 l=1 i=1 k=1 j=1 l=1 ⎝ i=1 ⎠
This is the product of likelihood functions; each (j, l) represents the likelihood function for the param-
kj (l)
eters (θjil )i=1 based on n(πj ) independent observations where
kj
(l) (i) (l)
n(πj ) ∶= ∑ n(xj , πj ).
i=1
Furthermore, the estimator satises; n(πj )θ̂j.l ∼ Mult(n; θj1l , . . . , θjkj l ), where the family of random
(l)
k
vectors ((θj,.,l )i=1
j
) are independent and for each(j, l)
⎧
⎪
⎪ n(πjl ) θjil (1 − θjil ) i1 = i2 = i
1
̂ ̂ (jl) ⎪
Cov(θM L;ji1 l , θM L;ji2 l ) = Ci1 i2 = ⎨
⎪
⎪
⎪ − n(π1 l ) θji1 l θji2 l i1 ≠ i2 ,
⎩ j
√ n(πj )→+∞
(l)
To keep computations within a reasonable framework, it is important that the prior distribution is
from a conjugate family.
Denition 12.6 (Conjugate Prior). A prior distribution from a family that is closed under sampling
is known as a conjugate prior
x(n) = (x1 , . . . , xn )t .
Each trial is a Bernoulli trial with probability θ of success (obtaining a 1). This is denoted by
Xi ∼ Be(θ), i = 1, . . . , n.
Using the Bayesian approach, the parameter θ is be regarded as the outcome of a random variable,
which is denoted by Θ. The outcomes are conditionally independent, given θ. This is denoted by
Xi ⊥ Xj ∣Θ, i ≠ j.
When Θ = θ is given, the random variables X1 , . . . , Xn are independent. Let X(n) = (X1 , . . . , Xn )t so
that
n
PX(n) ∣Θ (x(n) ∣θ) = ∏ θxl (1 − θ)1−xl = θk (1 − θ)n−k
l=1
where k = ∑nl=1 xl .
The problem is use x(n) to make an assessment of θ and then use this to assess the probability function
for a further outcome Xn+1 . The Bayesian approach is, starting with a prior density πΘ (.) over the
parameter space Θ̃ = [0, 1], to nd the posterior density πΘ∣X (.∣x(n) ).
(n)
Let πΘ be the uniform density on [0, 1]. This represents no initial preference concerning θ; all values
are equally plausible2 . The choice of prior may seem arbitrary, but following the computations below,
it should be clear that, from a large class of priors, the nal answer does not depend much on the
choice of prior if the thumb-tack is thrown a large number of times.
1 1 k!(n − k)!
∫ PX(n) ∣Θ (x(n) ∣θ)πΘ (θ)dθ = ∫ θk (1 − θ)n−k dθ = . (12.6)
0 0 (n + 1)!
The posterior distribution is a Beta density
⎧
⎪ (n+1)! k
⎪ k!(n−k)! θ (1 − θ) 0≤θ≤1
n−k
(n)
πΘ∣X(n) (θ∣x ) = ⎨ (12.7)
⎪
⎪ otherwise.
⎩ 0
The Beta distribution is not restricted to integer values; the Euler gamma function is necessary to
extend the denition to positive integers.
Denition 12.7 (Euler Gamma Function). The Euler Gamma Function Γ(α) ∶ (0, +∞) → (0, +∞) is
dened as
∞
Γ(α) = ∫ xα−1 e−x dx. (12.8)
0
Lemma 12.8. For all α > 0, Γ(α + 1) = αΓ(α). If n is an integer satisfying n ≥ 1, then
Γ(n) = (n − 1)!
Proof Note that Γ(1) = ∫0∞ e−x dx = 1. For all α > 0, integration by parts gives
∞
Γ(α + 1) = ∫ xα e−x dx = αΓ(α). (12.9)
0
For Bernoulli sampling, given a sequence x = (x1 , . . . , xn ) containing k 1's and n − k 0's, the likelihood
function is L(θ) = θk (1 − θ)n−k . Since
π(θ∣x) ∝ L(θ∣x)π(θ)
2
All statistical methods contain some ad-hoc element and in Bayesian statistics, this is contained in the choice of
prior distribution. The results obtained from any statistical analysis are only reliable if there is sucient data so that
any inference will be robust under a rather general choice of prior.
There are well known diculties with the statement that a uniform prior represents no preference concerning the value
of θ. If the prior density for Θ is uniform, then the prior density of Θ will not be uniform, so `no preference' for values
2
of Θ indicates that there is a distinct preference among possible initial values of Θ . If π (x) = 1 for 0 < x < 1 is the
2
1
density function for Θ and π is the density function for Θ , then π (x) =
2
2
for 0 < x < 1.
2
1
2x1/2
240 CHAPTER 12. LEARNING THE CONDITIONAL PROBABILITY FUNCTIONS
where π is the prior, it therefore follows that the prior should have the form
π(θ) ∝ θa (1 − θ)b
for some values a and b to guarantee that both prior and posterior come from the same conjugate
family. In this case, the family of distributions is the family of Beta distributions, dened as follows:
Denition 12.9 (Beta Density). The beta density Beta(α, β) with parameters α > 0 and β > 0 is
dened as the function
⎧
⎪ Γ(α+β) α−1
⎪ t (1 − t)β−1 t ∈ [0, 1]
ψ(t) = ⎨ Γ(α)Γ(β) (12.10)
⎪
⎪ t ∈/ [0, 1]
⎩ 0
The Beta density is a probability density function for all real α > 0 and β > 0. It follows that, for
Binomial sampling, updating may be carried out very easily for any prior distribution within the Beta
family. Suppose the prior distribution π0 is the B(α, β) density function, n trials are observed, with k
taking the value 1 and n − k taking the value 0. Then
1
Since ∫0 πΘ∣X(n) (θ∣x(n) )dθ = 1, therefore:
⎧
⎪ Γ(α+β+n)
⎪ Γ(α+k)Γ(β+n−k) θ
α+k−1
(1 − θ)β+n−k−1 θ ∈ (0, 1)
πΘ∣X(n) (θ∣x(n) ) = ⎨
⎪
⎪ θ ∈/ (0, 1).
⎩ 0
Denition 12.10 (Maximum Posterior Estimate). The maximum posterior estimate, θ̂M AP , is the
value of θ which maximises the posterior density πΘ∣X(n) (θ∣x(n) ).
k+α−1
θ̂M AP = .
n+α+β−2
Note that when the prior density is uniform, as in the case above, the MAP and MLE are exactly the
same. The parameter, of course, is not an end in itself. The parameter ought to be regarded as a
means to computing the predictive probability. The posterior is used to compute this.
12.5. THE BAYESIAN APPROACH 241
The Predictive Probability for the Next Toss Suppose that πΘ∣Xn (θ∣x(n) ) has a B(α+k, β+n−k)
distribution.The predictive probability for the next toss, for a = 0 or 1, is given by
1
PXn+1 ∣X(n) (a∣x(n) ) = ∫ PXn+1 (a∣θ)πΘ∣X(n) (θ∣x(n) )dθ.
0
Since PXn+1 ∣Θ (1∣θ) = θ, it follows (using Equation (12.9)) that
Γ(α + β + n) 1
(α+k)
PXn+1 ∣X(n) (1∣x(n) ) = ∫ θ (1 − θ)β+n−k−1 dθ
Γ(α + k)Γ(β + n − k) 0
Γ(α + β + n) Γ(α + k + 1)Γ(β + n − k)
=
Γ(α + k)Γ(β + n − k) Γ(α + β + n + 1)
α+k
= .
α+β+n
In particular, note that the uniform prior, π0 (θ) = 1 for θ ∈ (0, 1), is the B(1, 1) density function, so
that for binomial sampling with a uniform prior, the predictive probability is
k+1
PXn+1 ∣X(n) (1∣x(n) ) = ;
n+2
(12.11)
n+1−k
PXn+1 ∣X(n) (0∣x(n) ) = .
n+2
This distribution, or more precisely n+2 ,
k+1
is known as the Laplace rule of succession.
where nj = ∑ni=1 1x(j) (xi ); i.e. the number of times outcome x(j) appears in the sequence x, for
j = 1, . . . , k . It follows that, to ensure that the prior and posterior are within the same conjugate
family, the prior has the form:
It follows that the only possible family of distributions to use is the Dirichlet family, dened as follows.
Denition 12.11 (Dirichlet Density). The Dirichlet density Dir(a1 , . . . , ak ) is the function
⎧ Γ(a1 +...+ak ) aj −1
⎪ ∏kj=1 Γ(ak ) (∏j=1 θj ) θj ≥ 0, ∑j=1 θj = 1,
⎪ k k
π(θ1 , . . . , θk ) = ⎨ (12.12)
⎪
⎪ otherwise,
⎩ 0
242 CHAPTER 12. LEARNING THE CONDITIONAL PROBABILITY FUNCTIONS
where Γ denotes the Euler Gamma Function, given in Denition 12.7. The parameters (a1 , . . . , ak ) are
all strictly positive and are known as hyper parameters.
This density, and integration with respect to this density function, are to be understood in the following
j=1 θj , it follows that π may be written as π(θ1 , . . . , θk ) = ̃
sense. Since θk = 1 − ∑k−1 π (θ1 , . . . , θk−1 ), where
⎧
⎪ Γ(a1 +...+ak ) k−1 aj −1 ak −1
⎪
⎪ ∏kj=1 Γ(ak ) (∏j=1 θj ) (1 − ∑j=1 θj )
k−1
θj ≥ 0, ∑k−1
j=1 θj ≤ 1,
̃
π (θ1 , . . . , θk−1 ) = ⎨ (12.13)
⎪
⎪
⎪ otherwise.
⎩ 0
Clearly, when k = 2, this reduces to the Beta density. The Dirichlet density is a probability density
function.
Properties of the Dirichlet Density The family of Dirichlet densities Dir(α1 , . . . , αk ) ∶ α1 >
0, . . . , αk > 0 is closed under sampling: Consider a prior distribution πΘ ∼ Dir(α1 , . . . , αk ) and suppose
that observations of n independent trials are made: x ∶= (x1 , . . . , xn ) where nj = ∑ni=1 1x(j) (xi ), i.e.
the number of appearances of x(j) in the sequence, for j = 1, . . . , n. Let πΘ∣X denote the posterior
distribution. Then
The Dirichlet density is usually written exclusively as a function of k variables, πΘ (θ1 , . . . , θk ), where
there are k − 1 independent variables and θk = 1 − ∑k−1
j=1 θj .
Mean Posterior Estimate The mean posterior estimate is the expected value of the posterior dis-
tribution. Here,
ni + αi
θ̂i,M EP = ∫ θi π(θ1 , . . . , θk ∣x, α)dθ1 . . . dθk = .
k
∑j=1 nj + ∑kj=1 αj
This computation is left as an exercise.
The prior distribution over the entire collection of parameters Θ is taken to be πΘ = ∏jl πΘjl . That
is, the distributions over (θj.l )(j,l) are mutually independent for dierent (j, l). Suppose an n × d data
(i) (l)
matrix is obtained, with n complete instantiations. Let n(xj , πj ) denote the number of times that
(i) (l)
the conguration (xj , πj ) appears in x. It follows that
12.5. THE BAYESIAN APPROACH 243
PX∣Θ (x∣θ)
πΘ∣X (θ∣x) = πΘ (θ) .
PX (x)
Recall the expression for PX∣Θ (x∣θ) found in Equation (12.5). It follows that
d qj ⎛ kj
n(xj ,πj ) ⎞
1 (i) (l)
πΘ∣X (θ∣x) = ∏ ∏ πΘjl (θjl ) ∏ θjil .
PX (x) j=1 l=1 ⎝ i=1 ⎠
The posterior distribution of θj.l depends only on counts of family congurations at node j and not on
congurations at any other node.
Predictive Distribution The predictive distribution of a new case x(n+1) may be computed us-
ing the posterior density; with an n × d data matrix x of n complete instantiations, θjil , dened in
Equation (12.1), will be estimated by:
(i)
This is the predictive conditional probability that variable Xn+1,j attains value xj , given the parent
(l)
conguration πj and the cases stored in x. Let X denote the n × d matrix where each row X k. is an
independent copy of X = (X1 , . . . , Xd ). Recall that
d qj
πΘ∣X = ∏ ∏ πΘjl ∣X .
j=1 l=1
Note that PPan+1,j ∣Θ is an expression containing sums and products of (θail )a=1,...,j−1,i=1,...,ka ,l=1,...,qa . It
follows that πΘ∣Pan+1,j ,X may be expressed as a product
d qj
πΘ∣Pan+1,j ,X = A((θ)a.l )a=1,...,j−1,l=1,...,qa )πΘjl ∣X (θj.l ∣x) ∏ ∏ πΘal ∣X (θa.l ∣x)
a=j+1 l=1
where A is a probability density over (θail )a=1,...,j−1,i=1,...,ka ,l=1,...,qa . Then, by computations as before,
with θ̃jil dened by Equation (12.14),
244 CHAPTER 12. LEARNING THE CONDITIONAL PROBABILITY FUNCTIONS
θ̂iM EP
lim = 1.
n→+∞ θ̂iM LE
where
⎧
⎪ (r) (l)
(r) (l) ⎪ n(x , π ) r≠i
n∗ (xj , πj ) = ⎨ ∗ j (i) j (l) (i) (l)
⎪
⎩ n (xj , πj ) = n(xj , πj ) + 1 r = i.
⎪
(l)
The virtual sample size for πj is updated as
kj
(l)
s∗ = n(πj ) + 1 + ∑ αjil .
i=1
12.6. UPDATING, MISSING DATA, FRACTIONAL UPDATING 245
A Missing Instantiation Suppose the instantiation at node j is missing in the new case; the parent
(l)
conguration πj is present. Let
⎛ x(1) ⎞
X=⎜
⎜ ⋮
⎟
⎟
⎝ x ⎠
(n)
denote the complete instantiations and let x(n+1) denote instantiation n + 1 where the value xn+1,j is
missing. The distribution of the random vector θj.l ∣x, xn+1 is expressed as the mixture of distributions
kj
(i) (l) (i) (l) (kj ) (l)
∑ wi Dir(n(xj ∣πj ) + αj1l , . . . , n(xj , πj ) + 1 + αjil , . . . , n(xj , πj ) + αjkj l ),
i=1
where
(i) (l)
wi = PXj,n+1 ∣Paj,n+1 ,X (xj ∣πj , x) = ∫ θjil πΘ∣X (θ∣x)dθ.
Updating: Parent Conguration and the state at node j are missing Consider a new case
x(n+1) where both the state and the parent conguration of node j are missing. Then the distribution
of θj.l ∣x, xn+1 is given as the mixture of distributions
kj
(1) (l) (i) (l) (kj ) (l)
∑ vi Dir(n(xj , πj ) + αj1l , . . . , n(xj ∣πj ) + 1 + αjil , . . . , n(xj , πj ) + αjkj l )
i=1
(1) (l) (kj ) (l)
+Dir(n(xj , πj ) + αj1l , . . . , n(xj , πj ) + αjkj l )v ∗ ,
where
(i) (l)
vi = PXj ,Paj ∣X,X (xj , πj ∣x, xn+1 ), i = 1, . . . , kj
n+1
and
(l)
v ∗ = 1 − PPaj ∣X,X (πj ∣x, xn+1 ).
n+1
Fractional Updating The preceding shows that adding new cases with missing values results in
dealing with increasingly messy mixtures, with increasing numbers of components. The standard way
to deal with this is to use a Dirichlet integral that is an approximation of the true update, taking the
updated distribution as:
where
(i) (l) (i) (l) (i) (l)
n∗ (xj , πj ) = n(xj , πj ) + PXj ,Paj ∣X,X (xj , πj ∣x, xn+1 ), i = 1, . . . , kj .
n+1
Fading If the parameters change with time, then information learnt a long time ago may not be so
useful. A way to make the old cases less relevant is to have the sample size discounted by a fading
factor qF , a positive number less than one.
⎧
⎪ (r) (l)
(r) (l) ⎪ qF n(xj , πj ) r≠i
n∗ (xj , πj ) = ⎨ (i) (l)
⎪
⎩ 1 + qF n(xj , πj ) r = i.
⎪
(l) (i) (l)
If (πj , xj ) is observed for some i = 1, . . . , kj , the virtual sample size for parent conguration πj is
updated to
(l) (l)
n∗ (πj ) = 1 + qF n(πj )
and
(a) (a)
n∗ (πj ) = qF n(πj ) a ≠ l.
sn = qF sn−1 + 1, s0 = s.
n 1 − qFn+1
sn = qFn s + ∑ qFi = qFn s + .
i=0 1 − qF
The limiting eective maximal sample size is therefore
1
s∗ = .
1 − qF
n d qj
PX∣D (x∣D) = ∫ ∏ PX∣Θ,D (x(k) ∣θ, D) ∏ ∏ ϕ(θj.l ∣αj.l , D)dθj.l
k=1 j=1 l=1
where ϕ(θj.l ∣αj.l ) is a compact way of referring to the Dirichlet density Dir(αj1l , . . . , αjkj l ).
Because PX∣Θ,D (x∣θ, D) has a convenient product form, computing the Dirichlet integral is straight-
forward and gives
d qj k
Γ(∑i=1
j
αjil ) kj Γ(n(xij ∣πjl ) + αjil )
L(D∣x) ∶= PX∣D (x∣D) = ∏ ∏ ∏ . (12.15)
j=1 l=1
kj
Γ(n(πjl ) + ∑i=1 αjil ) i=1 Γ(αjil )
The computation is left as an exercise; this is the Cooper Herskovitz likelihood for the graph structure.
t = t (X) .
Denition 12.12 (Bayesian Suciency). A statistic T dened as T = t (X) such that for every prior
πΘ within the space of prior distributions under consideration, there is a function ϕ such that
pX (x∣θ)πΘ (θ)
πΘ∣X (θ∣x) = = ϕ(θ, t(x)) (12.16)
pX (x)
is called a Bayesian sucient statistic for θ.
This denition states that for learning about θ based on X, the statistic T contains all the relevant
information, since the posterior distribution depends on X only through T .
The following result shows that if the conditional distribution of X given t(X) does not depend on θ,
then t(X) is Bayesian sucient for θ. If the families of probability measures have nite dimensional
parameter spaces, then the converse is also true. If there are an innite number of parameters, counter
examples may be obtained to the converse statement.
X ⊥ θ∣T, (12.17)
where Equation (12.17) means that
Proof of Proposition 12.13 As usual, let T = t(X). An application of Bayes rule gives
x = (x1 , . . . , xn )t .
That is, for each j = 1, . . . , n, xj = 1 or 0. The statistic t is a function of n variables, dened as
n
t(x) = ∑ xj .
j=1
That is, when t is applied to a sequence of n 0's and 1's, it returns the number of 1's in the sequence.
Here, T = t(X) = ∑nj=1 Xj and therefore T has a binomial distribution with the parameters n and θ,
since it is the sum of independent Bernoulli trials. The probability function of T is given by
⎧
⎪ ⎛ n ⎞ k
⎪
⎪
⎪ θ (1 − θ)n−k k = 0, 1, . . . , n
pT ∣Θ (k∣θ) = ⎨ ⎝ k ⎠
⎪
⎪
⎪
⎪
⎩ 0 other k.
Since t is a function of x, it follows that
⎧
⎪
⎪ θk (1 − θ)n−k k = 0, 1, . . . , n
pX,T ∣Θ (x, k∣θ) = ⎨
⎪
⎪ other k
⎩ 0
from which
The right hand side does not depend on θ, from which equation (12.18) holds and hence equation
(12.17) follows. Therefore, if x = (x1 , . . . , xn ) are n independent Bernoulli trials, each with parameter
θ, the function t such that t(x) = ∑nl=1 xl is a Bayesian sucient statistic for the parameter θ. In the
thumb-tack example, given in subsection 12.5.1, the posterior distribution, based on a uniform prior is
an explicit function of the data x only through the function t(x).
Now consider a random vector X and suppose now that t is a generic sucient statistic. Since t is a
function of X (i.e. t = t(X)), it follows, using the rules of conditional probability and equation (12.18),
that
pX∣Θ (x∣θ) = pX,T ∣Θ (x, t(x)∣θ) = pX∣T,Θ (x∣t(x), θ)pT ∣Θ (t(x)∣θ) = pX∣T (x∣t(x))pT ∣Θ (t(x)∣θ).
where
h(x) = pX∣T (x∣t(x)) = pX∣t(X) (x∣t(x)).
X ⊥ (Y , θ) ∣T. (12.22)
That is, once t(X) is given, there is no additional statistical information in X about Y or θ. The
problem is to predict Y statistically using a function of X .
Proposition 12.15. Let t denote a function and let T = t(X). If X, Y , T, θ satisfy X ⊥ Y ∣∣(T, θ) and
X ⊥ θ∣T , then
πΘ∣Y ,X,T (θ ∣ y, x, t) = πΘ∣Y ,T (θ ∣ y, t). (12.23)
250 CHAPTER 12. LEARNING THE CONDITIONAL PROBABILITY FUNCTIONS
It follows that
pX,Y ∣T (x, y∣t)
pY ∣X,T (y∣x, t) = = pY ∣T (y∣t). (12.24)
pX∣T (x∣t)
An application of Bayes rule gives
pY ,X,T ∣Θ (y, x, t∣θ)πΘ (θ) pY ,X∣T,θ (y, x∣t, θ)pT ∣θ (t∣θ)πΘ (θ)
πΘ∣Y ,X,T (θ∣y, x, t) = =
pY ,X,T (y, x, t) pY ,X,T (y, x, t)
pY ∣T,Θ (y∣t, θ)pX∣T,θ (x∣t, θ)pT ∣θ (t∣θ)πΘ (θ)
= ,
pY ,X,T (y, x, t)
where the conditional independence X ⊥ Y ∣(θ, T ) was used. Then, since X ⊥ θ∣T , it follows that
pX∣T,Θ (x∣t, θ) = pX∣T (x∣t) and hence, using the identity (12.24), that
pY ∣T,Θ (y∣t, θ)pX∣T (x∣t)pT ∣Θ (t∣θ)πΘ (θ) pY ∣T,Θ (y∣t, θ)pT ∣Θ (t∣θ)πΘ (θ)
πΘ∣Y ,X,T (θ∣y, x, t) = = .
pY ∣X,T (y ∣ x, t)pX∣T (x ∣ t)pT (t) pY ∣T (y ∣ t)pT (t)
It follows that
Πj ⊆ {X1 , . . . , Xj−1 },
Denition 12.16 (Parameter Modularity). A set of parameters Θ for a Bayesian Network satises
parameter modularity if it may be decomposed into d distinct parameter sets Θ1 , . . . , Θd such that for
j = 1, . . . , d, the parameters in vector Θj are directly linked only to node Xj .
This denition was introduced by Heckerman, Geiger and Chickering (1995) [62].
Under the assumption of parameter modularity, the DAG may be expanded by adding the parameter
nodes as parent variables in the graph, and directed links from each node in the set Θj to the node Xj
giving an extended graph that is directed and acyclic, where pX1 ,...,Xd ∣Θ has the decomposition
d
pX1 ,...,Xd ∣Θ = ∏ pXj ∣Θj ,Πj . (12.25)
j=1
Furthermore, under the assumption of modularity, Θ1 , . . . , Θd are independent random vectors and the
joint prior distribution is a product of individual priors; πΘ = ∏dj=1 πΘj .
The following notation is useful:
tj (X̃j ) = Πj .
In other words, the parent set Πj is a prediction sucient statistic for (Xj , Θj ) in the sense that there
is no further information in ((X1 , Θ1 ), . . . , (Xj−1 , Θj−1 )) relevant to uncertainty about either Θj or
Xj .
In a Bayesian network where the parameters satisfy the modularity assumption (Denition 12.16),
(Πj , Xj ) are a Bayesian sucient statistic for Θj . The modularity assumption is clearly satised when
Equation (12.25) holds.
Notes The discussion of the thumb-tack and learning for DAGs is taken from D. Heckerman [61]
and [62]. Learning from incomplete data is discussed in [116]. Another treatment of learning is found
in [99] (Neapolitan). The Savage distribution is due to J.L. Savage [121]. The Dickey distribution is
due to J.M. Dickey [38].
12.11 Exercises
1. Suppose one has a data base C with n cases of congurations over a collection of variables V .
Let Sp(V ) denote the set of possible congurations over V and let #(v) denote the number of
#(v)
cases of conguration v . Dene P C (v) = n . Let P M denote a probability distribution over
Sp(V ). Assume that P C (v) = 0 if and only if P M (v) = 0 and discount these congurations.
Dene S M (C) = − ∑c∈C log P M (c).
Let DKL denote the Kullback Leibler distance. Show that
2. (a) Consider the thumb-tack experiment and the conditional independence model for the prob-
lem and the uniform prior density for θ. Let X denote the vector of n i.i.d. copies of the
random variable and let Xn+1 denote an additional copy, independent of X. Let x denote
an outcome of X What is PXn+1 ∣X (head∣x) ?
(b) Prove the Laplace Rule of Succession. Namely, let {X1 , . . . , Xn+1 } be independent, identi-
cally distributed Bernoulli random variables, where PXi (1) = 1 − PXi (0) = θ and θ ∼ U (0, 1).
Then the Laplace Rule of Succession states that
s+1
PXn+1 ∣X1 +...+Xn (1∣s) = .
n+2
3. Let Θ ∼ Beta(α, β). Compute E[Θ] and V(Θ). You may use the fact that if Θ ∼ Beta(α, β)
then its density is given by
Γ(α + β) α−1
π(θ) = θ (1 − θ)β−1 θ ∈ [0, 1].
Γ(α)Γ(β)
4. Let (X1 , . . . , Xn+1 )t be a vector of independent identically distributed random variables, each with
probability distribution given by PX (x(i) ) = θi , i = 1, . . . , k . Suppose that the prior distribution
over θ is Dir(αq1 , . . . , αqk ) where ∑ki=1 qi = 1. Let X = (X1 , . . . , Xn ) and let x be an n-vector of
outcomes where x(i) appears ni times, for i = 1, . . . , k and ∑ki=1 ni = n. Show that
ni + αqi
PXn+1 ∣X (x(i) ∣ x) = ∫ θi π (θ1 , . . . , θL ∣x; αq) dθ1 . . . dθL = . (12.26)
SL n+α
5. Let Θ = (Θ1 , . . . , ΘL ) be a continuous random vector with Dir (α1 , . . . , αL ) distribution. Com-
pute V (Θi ).
V ∼ Dir (a1 , . . . , aK ) ,
and set
252
12.11. EXERCISES 253
Vi x−1
Ui = i
−1
, , i = 1, . . . , K,
∑K
i=1 Vi xi
where x = (x1 , . . . , xK ) is a vector of positive real numbers; that is, xi > 0 for each i =
1, . . . , K . Show that U = (U1 , . . . , UK ) has density function
∑K
Γ (∑ki=1 ai ) K
a −1 1 i=1 ai K
∏ ui i ( K ) a
∏ xi i .
i=1 Γ(ai )
∏K i=1 ∑i=1 ui xi i=1
U ∼ S (a, x) .
This is due to J.L. Savage [121]. Note that the Dirichlet density is obtained as a special
case when xi = c for i = 1, . . . , K .
The next two parts illustrate how the Savage distribution can arise in Bayesian analysis, for
updating an objective distribution over the subjective assessments of a probability distri-
bution by several dierent researchers, faced with a common set of data.
(b) Consider several researchers studying an unknown quantity X , where X can take values in
{1, 2, . . . , K}. Each researcher has his own initial assessment of the probability distribution
V = (V1 , . . . , VK ) for the value that X takes. That is, for a particular researcher,
Vi = PX (i) , i = 1, . . . , K.
It is assumed that
V ∼ Dir (a1 , . . . , aK ) .
Each researcher observes the same set of data with the common likelihood function
li = P (data∣{X = i}) , i = 1, . . . , K.
Ui = P ({X = i} ∣ data) , i = 1, 2, . . . , K.
U ∼ S (a, l−1 ) ,
(c) Show that the family of distributions S (a, l−1 ) is closed under updating of the opinion
populations. In other words, if
V ∼ S (a, z) ,
U ∼ S (a, z × l−1 ) ,
7. Consider a Bayesian Network over two binary variables A and B , where the Directed Acyclic
Graph is A → B and A and B each take the values 0 or 1. Let (Θa , Θb∣y , Θb∣n ) denote three
independent random variables representing the unknown parameters. Let θa = PA∣Θa (1∣θa ), θb∣y =
PB∣A,Θb∣y (1∣1, θb∣y ), θb∣n = PB∣A,Θb∣n (1∣0, θb∣n ). Let the prior distributions over the parameters be
⎧
⎪
⎪ 3θ2 0 ≤ θa ≤ 1
πa (θ) = ⎨
⎪
⎪ θ ∈/ [0, 1],
⎩ 0
⎧
⎪
⎪ 12θ2 (1 − θ) 0 ≤ θ ≤ 1
πb∣y (θ) = ⎨
⎪
⎪ θ ∈/ [0, 1],
⎩ 0
⎧
⎪
⎪ 12θ(1 − θ)2 0 ≤ θ ≤ 1
πb∣n (θ) = ⎨
⎪
⎪ θ ∈/ [0, 1],
⎩ 0
Suppose that there is a single instantiation, where B = 1 is observed, but A is unknown. Perform
the approximate updating.
L
n
L (θ; x) = ∏ θj j ,
j=1
where nj is the number of times the symbol xj (in a nite alphabet with L symbols) is present
in x and ∑Lj=1 θj = 1. For the prior distribution over θ , a nite Dirichlet mixture is taken, given
by
k
(i) (i)
πΘ (θ) = ∑ λi Dir (α(i) q1 , . . . , α(i) qL ) ,
i=1
(i) (i)
where λi ≥ 0, ∑ki=1 λi = 1 (the mixture distribution), α(i) > 0, qj > 0, ∑L
i=1 qj = 1 for every i.
Compute the mean posterior estimate θ̂j;M P for j = 1, . . . , L.
12.11. EXERCISES 255
9. Let ϕ(θj.l , αj.l ) denote the Dirichlet density Dir(αj1l , . . . , αjkj l ). By performing the required
integration, prove that the Likelihood function for the graph structure, dened by
n d qj
PX∣D (x∣D) = ∫ ∏ PX∣Θ,D (x(k) ∣θ, D) ∏ ∏ ϕ(θj.l , αj.l )dθj.l
k=1 j=1 l=1
is given by
k
d qj Γ (∑i=1
j
αjil ) kj Γ (n(xij , πjl ) + αjil )
PX∣D (x∣D) = ∏ ∏ ∏ .
j=1 l=1
kj
Γ (n(πjl ) + ∑i=1 αjil ) i=1 Γ (αjil )
α −1
1 1−θ1 1−(θ1 +...+θn−2 ) ⎛n−1 α −1 ⎞ ⎛ n−1 ⎞ n ∏nj=1 Γ(αj )
∫ ∫ ...∫ ∏θ
j
1 − ∑ θj dθn−1 . . . dθ1 = .
0 0 0 ⎝ j=1 j ⎠ ⎝ j=1 ⎠ Γ(∑nj=1 αj )
What parameters αj.l are used if a uniform prior is taken on every θj.l ? You may use Γ(n) =
(n − 1)!.
12.12 Short Answers
1. Firstly, note that
giving
P C (v)
S M (C) − S C (C) = n ∑ P C (v) log = ndK (P C ∣P M ).
v∈Sp(v) P M (v)
3.
1 Γ(α + β) 1 Γ(α + β) Γ(α + 1)Γ(β) α
E[Θ] = ∫ θπ(θ) = ∫ θ (1 − θ) dθ =
α β−1
=
0 Γ(α)Γ(β) 0 Γ(α)Γ(β) Γ(α + β + 1) α+β
256
12.12. SHORT ANSWERS 257
4.
Γ(α + 1) = αΓ(α),
E[Θi ] = ∫ θi πΘ (θ)dθ
Γ(α) ⎛ αj ⎞ α +1
= ∫ ∏ θj θi i dθ
j=1 Γ(αj )
∏L ⎝ j≠i ⎠
Γ(α) (∏j≠i Γ(αj )) Γ(αi + 1) Γ(α)Γ(αi + 1)
= =
∏Lj=1 Γ(αj ) Γ(α + 1) Γ(α + 1)Γ(αi )
αi
= .
α
Similarly,
Γ(α)Γ(αi + 2) (αi + 1)αi
E[Θ2i ] = = .
Γ(α + 2)Γαi ) (α + 1)α
This gives
6. (a) The free variables are (v1 , . . . , vK−1 ) with the constraint vK = 1 − ∑K−1
j=1 vj . Set
K vj K−1
1 1 1
S=∑ = + ∑ vj ( − )
j=1 xj xK j=1 xj xK
then
vj K−1
uj = j = 1, . . . , K and uK = 1 − ∑ uj .
xj S j=1
1 1
S= =
xK + ∑K−1
j=1 (xj − xK )uj ∑K
j=1 xj uj
The Jacobian determinant for v → u may be computed by noting that vj = uj xj S and using
∂S
= −S 2 (xα − xK )
∂uα
so that
⎧
⎪ −Sui xi (xα − xK )
∂vi ⎪ α≠i
=⎨
⎪ Sxi − Sui xi (xi − xK ) α = i
∂uα ⎪
⎩
i=1 xi M , where
The matrix of which the determinant is to be computed is therefore S K−1 ∏K−1
⎛ u1 ⎞
M =I −S⎜
⎜ ⋮
⎟ (x1 − xK , . . . , xK−1 − xK ).
⎟
⎝ uK−1 ⎠
(M − λ)e = 0
⎛ u1 ⎞
satises e = c ⎜
⎜ ⋮
⎟ and therefore λ satises
⎟
⎝ uK−1 ⎠
K−1 K−1
1
1 − λ = S ∑ uj (xj − xK ) = S( ∑ uj xj − xK + xK uK ) = S( − xK )
j=1 j=1 S
so that λ = SxK . It follows that the density in the new coordinates is
aK −1
Γ (∑kj=1 aj ) ⎛K−1 aj −1 ⎞ ⎛
K−1 ⎞ K
∏ (Sx u
j j ) 1 − S ∑ xj uj S K ∏ xj .
∏j=1 Γ(aj ) ⎝ j=1
K ⎠⎝ j=1 ⎠ j=1
12.12. SHORT ANSWERS 259
∑K
j=1 aj
Γ (∑kj=1 aj ) K
aj ⎛ K aj −1 ⎞ ⎛ 1 ⎞
∏ xj ∏ uj
j=1 Γ(aj )
∏K j=1 ⎝j=1 ⎠ ⎝ ∑j=1 xj uj ⎠
K
as required.
(b) The work was in the previous part, computing the distribution. This exercise is now a
straightforward application of Bayes rule.
PX (i)li Vi li
Ui = P({X = i}∣data) = =
P(data) ∑K i=1 Vi li
the denominator follows because ∑K i=1 Ui = 1. The distribution of U now satises the deni-
−1
tion of the S(a, l ) distribution of the previous exercise.
(c) Again, assume data is obtained and the likelihood is li = P(data∣X = i) and the prior
distribution is S(a, z). Then
Vi li Wi zi−1 li
Ui = P({X = i}∣data) = = ,
P(data) P(data) ∑K −1
i=1 Wi zi
Wi zi−1 li
Ui = −1
∑K
i=1 Wi zi li
where
(i) (l) (i) (l)
n∗ (xij , πjl ) = n(xj , πj ) + PXj ,Paj ∣E (xj , πj ∣e∗ )
where P is the probability computed using the prior and E = (Xi1 , . . . , Xim ), those variables that
are instantiated in the partial observation; e∗ denotes the values that these variables take in the
incomplete instantiation.
To update the distribution over Θa , the eective sample sizes on which the prior is based are
needed. Furthermore, for Xj = A, Paj = ϕ, so PXj ∣Paj ,E = PA∣B (.∣1). For Xj = B , Paj = A, so
that PXj ∣Paj ,E = PB∣A,B (.∣., 1).
260 CHAPTER 12. LEARNING THE CONDITIONAL PROBABILITY FUNCTIONS
The computations of PA and PB∣A are straightforward; PA∣B is obtained using Bayes rule. Note
that
1 1 3
PA (1) = ∫ PA∣Θa (1∣θ)πΘa (θ)dθ = 3 ∫ θ3 dθ =
0 0 4
1
PA (0) =
4
1 1 3
PB∣A (1∣1) = ∫ PB∣A,Θb∣y (1∣1, θ)πΘb∣y (θ)dθ = 12 ∫ θ3 (1 − θ)dθ =
0 0 5
2
PB∣A (0∣1) =
5
1 2
PB∣A (1∣0) = 12 ∫ θ2 (1 − θ)2 =
0 5
3
PB∣A (0∣0) =
5
so
2 1 3 3 11
PB (1) = × + × =
5 4 5 4 20
9
PB (0) =
20
PA (1)PB∣A (1∣1) 9
PA∣E ∗ (1∣e∗ ) = PA∣B (1∣1) = =
PB (1) 11
2
PA∣E ∗ (0∣e∗ ) =
11
2
PA,B∣B ((0, 1)∣1) = PA∣B (0∣1) =
11
9
PA,B∣B ((1, 1)∣1) = PA∣B (1∣1) =
11
PA,B∣B ((0, 0)∣1) = PA,B∣B ((1, 0)∣1) = 0.
So updating is
Γ(5) 9 2
πa∣e∗ (θ) = θ2+ 11 (1 − θ) 11 , θ ∈ [0, 1]
Γ(3 + 11 )Γ(1 + 11 )
9 2
Γ(5 + 11 ))
9
9
πb∣y,e∗ (θ) = θ2+ 11 (1 − θ), θ ∈ [0, 1]
Γ(3 + 11 )Γ(2)
9
Γ(5 + 11 ))
2
2
πb∣n,e∗ (θ) = θ1+ 11 (1 − θ)2 , θ ∈ [0, 1]
Γ(2 + 11 )Γ(3)
2
Notations As usual, for a variable Xj , let Paj denote the set of parent variables and let (πj(i) )qi=1
j
(i) (l)
θjil = PXj ∣Paj (xj ∣πj ),
k
so that ∑i=1
j
θjil = 1 for each (j, l). The collection of θjil ∶ j = 1, . . . , d, i = 1, . . . , kj , l = 1, . . . , qj with
the constraint given above denotes the entire set of parameters for the network.
The functions PXj ∣Paj will be referred to as potentials or CPPs (conditional probability potentials).
re tamper
| # z
smoke alarm
leaving / report
263
264 CHAPTER 13. PARAMETERS AND SENSITIVITY
The problem considered in this section is to decide whether an individual parameter is relevant to
a given query constraint and, if it is, to compute the minimum amount of change needed to that
parameter to enforce the constraint. The constraints considered are in the form of hard evidence where
the collection E is instantiated as e, where E = (Xe1 , . . . , Xem ) is a subset of (X1 , . . . , Xd ).
Consider the Bayesian network called Fire.1 The model is shown in Figure 13.1. The network models
the scenario of whether or not there is a re in the building. Let F denote `re', T denote `tampering',
S `smoke', A `alarm', L `leaving' and R `report'. A re may causes smoke to be seen; it may also
cause the alarm to go o. Equally, if somebody tampers with the alarm, this could also cause it to
go o, even without a re. When people hear the alarm, they may leave the building and when a
large number of people leave the building at an unscheduled time, this may be reported to the re
department.
Now consider the following evidence: {report = true, smoke = false}. That is, the re department
receives a report that people are evacuating the building, but no smoke is observed. This evidence
should make it more likely that the re alarm has been tampered with than that there is a real re. Let
t denote `true' and f denote `false'. Suppose that the conditional probability values for this network
derived, perhaps, from experience, are
t f t f
PF = , PT = ,
0.01 0.99 0.02 0.98
L/R t f
PR∣L = t 0.75 0.25
f 0.01 0.99
F /S t f A/L t f
PS∣F = t 0.9 0.1 , PL∣A = T 0.88 0.12
f 0.01 0.99 f 0.001 0.999
F /T t f
PA∣F,T (t∣., .) = t 0.5 0.99
f 0.85 0.0001.
The evidence is (R, S) = (t, f ). The probability that someone has tampered with the alarm given this
evidence is
PT,R,S (t, t, f )
PT ∣R,S (t∣t, f ) = .
PR,S (t, f )
Using the notation XZ to denote the state space of a variable Z ,
1
This Bayesian network is distributed with the evaluation version of the commercial HUGIN Graphical User Interface,
by HUGIN Expert.
13.1. PARAMETER CHANGES TO SATISFY QUERY CONSTRAINTS 265
PT,R,S (t, t, f ) = PT (t) ∑ pR∣L (t∣.) ∑ PL∣A ∑ PA∣T,F (.∣t, .)PS∣F (f ∣.)PF
XL XA XF
and
Similarly,
PF,R,S (t, t, f )
PF ∣R,S (t∣t, f ) = ,
PR,S (t, f )
and
Suppose that it is known from experience that the probability that the alarm has been tampered
with should be no less than 0.65 given this evidence. The network should therefore be adjusted to
accommodate. It is simplest to try changing only one network parameter. Suppose that the probability
function PT is to be adjusted. Let θ = PT (t). Let
so that
PT,R,S (t, t, f ) = θα
and
so that
PT,R,S (t, t, f ) αθ
PT ∣R,S (t∣t, f ) = = .
PR,S (t, f ) (α − β)θ + β
is θ = 0.0364.
Similarly, let ψ = PR∣L (t∣f ). Keeping all other potentials xed, PT ∣R,S (t∣t, f ) may be computed as a
function of ψ and the equation PT ∣R,S (t∣t, f )(ψ) = 0.65 has solution ψ = 0.00471.
For all other single parameter adjustments, the equation does not have a solution in the interval [0, 1].
Therefore, if only one parameter is to be adjusted, the constraint PT ∣R,S (t∣t, f ) = 0.65 can be dealt
with in either of the following two ways:
2. Decrease the probability of a false report, given that there is an evacuation, from 0.01 to less
than 0.00471.
It turns out for this example that it is not possible to enforce the desired constraint by adjusting a
single parameter in any of the CPPs of the variables re, smoke, alarm and leaving.
Denition 13.3 (Proportional Scaling Property). A Bayesian network satises the proportional scal-
(i) (l)
ing property if for each conditional probability distribution θj.l , where θjil = pXj ∣Paj (xj ∣πj ), there is
a parameter t(jl) such that
(l)
PXj ∣Paj (.∣πj ) = (αj1l + βj1l t(jl) , . . . , αjkj l + βjkj l t(jl) ),
k k
where ∑m=1
j
αjml = 1 and ∑m=1
j
βjml = 0.
Theorem 13.4. Consider a Bayesian network over a collection of variables V = {X1 , . . . , Xd }. Sup-
pose that the network satises proportional scaling, where there is a single variable parameter t in a
(c ) (c )
conditional probability distribution θj.l . Then for any E = (Xi1 , . . . , Xim ) and e = (xi1 1 , . . . , ximm ),
PE (e)(t) = at + b
for two constants a and b that depend on e.
PE (e) = ∑ PV (y1 , . . . , yd )
y∈X ∣(yi1 ,...,yim )=(xi 1 ,...,ximm )
(c ) (c )
1
= ∑ PXj ∣Paj (yj ∣πj (y)) ∏ PXk ∣Pak (yk ∣πk (y)) (13.1)
y∈X ∣(yi1 ,...,yim )=(xi
(c1 )
,...,ximm )
(c ) k≠j
1
13.2. PROPORTIONAL SCALING 267
and it is clear from the denition of proportional scaling, and from Equation (13.1), that t enters
linearly. It therefore follows that
PE (e)(t) = at + b.
It follows that for two disjoint sets of variables A and E , there are numbers a(e), b(e), c(x, e), d(x, e)
such that
PA,E (x, e) ct + d
PA∣E (x∣e)(t) = = .
PE (e) at + b
The Optimality of Proportional Scaling Consider one of the conditional probability distributions
(θj1l , . . . , θjkj l ) and suppose that θj1l is to be altered to a dierent value, denoted by θ̃jl1 . Under
proportional scaling, the probabilities of the other states are given by
1 − θ̃j1l
θ̃jil = θjil i = 2, . . . , kj .
1 − θj1l
1 − θ̃j1l
t(jl) =
1 − θj1l
αjil = 0 i = 2, . . . , kj , αj1l = 1.
Proportional scaling turns out to be optimal under the Chan - Darwiche distance measure.
Theorem 13.5. Consider a probability distribution P factorised according to a DAG G . Suppose the
value θj1l is changed to θ̃j1l . Among the class of probability distributions Q factorised along G with
QXj ∣Paj (xj ∣πj ) = θ̃j1l , minQ∈Q DCD (P, Q) is obtained for Q such that θ̃a.b = θa.b for all (a, b) ≠ (j, l)
(1) (l)
and
1 − θ̃j1l
θ̃jil = θjil .
1 − θj1l
Under proportional scaling, the Chan - Darwiche distance is then given by
Proof Let P be a distribution that factorises along a DAG G , with conditional probabilities θaib =
(i) (b)
PXa ∣Paa (xa ∣πa ). Let Q denote the distribution that factorises along G , with conditional probabilities
1 − θ̃j1l
θ̃jil = θjil i = 2, . . . , kj .
1 − θj1l
This is the distribution generated by the proportional scheme. Let R denote any other probability
belonging to class Q.
If θj1l = 1 and θ̃j1l < 1, then there is a θ̃jkl > 0 with θjkl = 0 and it follows that DCD (P, Q) = DCD (P, R) =
+∞.
If θj1l = 0 and θ̃j1l > 0 then, similarly, it follows directly that DCD (P, Q) = DCD (P, R) = +∞.
Consider 0 < θj1l < 1. Firstly, consider θ̃j1l > θj1l . Then
Q(x) 1 − θ̃j1l
min = .
x∈X P(x) 1 − θj1l
1−θ̃j1b θ̃j1l
Similarly, if θ̃j1l < θj1l , then maxx∈X = and minx∈X = θj1l , so
Q(x) Q(x)
P(x) 1−θj1b P(x)
P and R may be expressed as PX,Y and RX,Y where (X, Y ) are two sets of variables. Using PX,Y =
PX PY ∣X and RX,Y = RX RY ∣X and (x∗ , y ∗ ) and (x∗ , y∗ ) to denote the points where the maxima and
minima of the ratios are achieved, it follows that
Now let X denote the set of variables (X1 , . . . , Xj ) and Y the set of variables (Xj+1 , . . . , Xd ). It follows
that
PY ∣E (y∣e) − PZ∣E (z∣e) ≥ ϵ, (13.2)
PY ∣E (y∣e)
≥ ϵ. (13.3)
PZ∣E (z∣e)
The notation will be abbreviated by writing: P(y∣e) when the abbreviation is clear from the context.
Let PX denote the probability function for a collection of variables X = (X1 , . . . , Xd ), which may
be factorised along a graph G = (V, E) (where V = {X1 , . . . , Xd }), with given conditional probability
(i) (l)
potentials, θjil = PXj ∣Paj (xj ∣πj ). Then
d qj kj
n (i,l)
PX (x) = ∏ ∏ ∏ θjilj ,
j=1 l=1 i=1
(i) (l)
where nj (i, l) = 1 if the child parent conguration (xj , πj ) appears in x and 0 otherwise. Suppose
(jl) (jl)
that the probabilities (θj1l , . . . , θj,kj ,l ) are parametrised by (t1 , . . . , tmj ), where mj ≤ kj − 1. The
following result holds.
270 CHAPTER 13. PARAMETERS AND SENSITIVITY
Theorem 13.6. Let X = (X1 , . . . , Xd ) denote a set of variables and let P be a probability distribution
(i) (l)
that factorises along a DAG G with node set V = {X1 , . . . , Xd }. Let θjil = PXj ∣Paj (xj ∣πj ). Suppose
(jl) (jl)
that for each (j, l) the probabilities (θj1l , . . . , θj,kj ,l ) are parametrised by (t1 , . . . , tmjl ) where mjl ≤
(i ) (i )
kj −1. Let E = (Xe1 , . . . Xem ) denote a subset of X and let e = (xe11 , . . . , xemm ) denote an instantiation
of E . Then for all 1 ≤ k ≤ mjl ,
(i) (l)
∂
kj PE,Xj ,Paj (e, xj , πj ) ∂
(jl)
PE (e) = ∑ θ .
(jl) jil
∂tk i=1 θjil ∂tk
Proof Firstly,
It follows that
kj
∂ (i) (l) (l) ∂θjil
(jl)
PE (e) = ∑ PE∣Xj ,Paj (e∣xj , πj )PPaj (πj ) (jl)
∂tk i=1 ∂tk
(i) (l) (l)
kj PXj ,Paj ∣E (xj , πj ∣e)PE (e)PPaj (πj ) ∂θjil
= ∑ (i) (l) (jl)
i=1 PXj ,Paj (xj , πj ) ∂tk
(j) (j)
kj PXj ,Paj ,E (xi , πl , e) ∂θjil
= ∑ (i) (l) (jl)
i=1 PXj ∣Paj (xj ∣πj ) ∂tk
(i) (l)
kjPXj ,Paj ,E (xj , πj , e) ∂θjil
= ∑ (jl)
i=1 θjil ∂t k
as required.
Proportional Scaling Again, the complete set of variables is X = (X1 , . . . , Xd ), with a joint proba-
bility distribution P that may be factorised along a Directed Acyclic Graph G . Evidence is received on
a subset of the variables E = (Xe1 , . . . , Xem ). Consider a proportional scaling scheme, where each condi-
tional probability distribution (θj1l , . . . , θjkj l ) has exactly one parameter. Under proportional scaling,
(jl) (jl)
this may be represented as θj1l = t(jl) and there are non negative numbers a2 , . . . , akj satisfying
kj (jl)
∑α=2 aα = 1, such that
θj1l = t(jl)
θjαl = a(jl)
α (1 − t
(jl)
), α = 2, . . . , kj .
Then, an application of Theorem 13.6 in the simplied setting of proportional scaling immediately
gives
13.2. PROPORTIONAL SCALING 271
PE (e) = α + βt(jl) ,
where α and β do not depend on t(jl) . It follows that for any t(jl) , ∂t∂(jl) PE (e) = β , where β is constant
(i.e. it does not depend on t(jl) ). This observation makes it straight forward, under proportional
scaling, to nd the necessary change in a single parameter t(jl) (if such a parameter change is possible)
to enforce a query constraint.
Corollary 13.7. To satisfy the constraint given by Equation (13.2), the parameter t(jl) has to be
changed to t(jl) + δ ,where δ satises
To satisfy the constraint given by Equation (13.3), the parameter t(jl) has to be changed to t(jl) + δ ,
where
PY,E (y,e)
Proof Since PY ∣E (y∣e) = PE (e) ,it follows that PY ∣E (y∣e)−PZ∣E (z∣e) ≥ ϵ is equivalent to PY,E (y, e)−
PZ,E (z, e) ≥ ϵPE (e). A change in the constraint changes PY,E (y, e), PZ,E (z, e) and PE (e) to PY,E (y, e)+
δλy , PZ,E (z, e) + δλz and PE (e) + δλ respectively. To enforce the dierence constraint, it follows that
δ satises
(PY,E (y, e) + λy δ) − (PZ,E (z, e) + λz δ) ≥ ϵ(PE (e) + λδ).
Equation (13.6) follows directly.
Denition 13.8 (Sensitivity). Let P denote a parametrised family of probability distributions, over a
̃ , where Θ
nite, discrete state space X , parametrised by k parameters (θ1 , . . . , θk ) ∈ Θ ̃ ⊆ Rk denotes the
parameter space. Let P(θ1 ,...,θk ) (.) denote the probability function over X when the parameters are xed
at θ1 , . . . , θk . Then the sensitivity of P to parameter θj is dened as
∂ ∂
Sj (P)(θ1 , . . . , θk ) = max ln P(θ1 ,...,θk ) (x) − min ln P(θ1 ,...,θk ) (x).
x∈X ∂θj x∈X ∂θj
Example 13.9.
If P is a family of binary variables, with state space X = {x0 , x1 } and a single parameter θ, then
∂ P(θ) (x1 )
S(P)(θ) = ∣ ln (θ) ∣.
∂θ P (x0 )
This section restricts attention to a single parameter model. Consider a network with d variables,
X = (X1 , . . . , Xd ) where one particular variable Xj is a binary variable. The other variables may be
multivalued. Let
(1) (l)
t(jl) = PXj ∣Paj (xj ∣πj ).
Let Y denote a collection of variables, taken from (X1 , . . . , Xn ) and let Y = y denote an instantiation
of these variables. Let y denote the event {Y = y} and let y c denote the event {Y ≠ y}. Similarly,
let e denote the event {E = e}, where E is a dierent sub-collection of variables from X . From
Denition 13.8, the sensitivity of a query P(y∣e) to the parameter t(jl) is dened as
∂ P(y∣e)
∣ ln ∣.
∂t(jl) P(y c ∣e)
The following theorem provides a simple bound on the derivative in terms of P(y∣e) and t(jl) only.
13.3. THE SENSITIVITY OF QUERIES TO PARAMETER CHANGES 273
Theorem 13.10. Suppose Xj is a binary variable taking values xj(1) or xj(0) . Set
(1) (l)
t(jl) = PXj ∣Paj (xj ∣πj ).
Then
∂ P(y∣e)(1 − P(y∣e))
∣ (jl)
P(y∣e)∣ ≤ . (13.8)
∂t t(jl) (1 − t(jl) )
The example given after the proof shows that this bound is sharp; there are situations where the
derivative assumes the bound exactly.
∂ 1 ∂ P(y, e) ∂
P(y∣e) = P(y, e) − P(e).
∂t(jl) P(e) ∂t(jl) P2 (e) ∂t(jl)
∂
P(y∣e)
∂t(jl)
(1) (l) (0) (l)
{(1 − t(jl) )P(y, xj , πj ∣e) − t(jl) P(y, xj , πj ∣e)}
= (13.9)
t(jl) (1 − t(jl) )
(1) (l) (0) (l)
{(1 − t(jl) )P(y∣e)PXj ,Paj ∣E (xj πj ∣e) − t(jl) P(y∣e)PXj ,Paj ∣E (xj , πj ∣e)}
−
t(jl) (t − t(jl) )
(1) (l) (1) (l)
(1 − t(jl) )(PY ,Xj ,Paj ∣E (y, xj , πj ∣e) − P(y∣e)PXj ,Paj ∣E (xj πj ∣e))
=
t(jl) (1 − t(jl) )
(0) (l) (0) (l)
t(jl) (PY ,Xj ,Paj ∣E (y, xj , πj ∣e) − P(y∣e)PXj ,Paj ∣E (xj πj ∣e))
− . (13.10)
t(jl) (t − t(jl) )
and
∂ P(y∣e)(1 − P(y∣e))
∣ P(y∣e)∣ ≤ .
∂t(jl) t(jl) (1 − t(jl) )
The proof of Theorem 13.10 is complete.
∂ P(y∣e) 1
∣ ln ∣ ≤ (jl) . (13.11)
∂t(jl) P(y ∣e)
c t (1 − t(jl) )
Proof Immediate.
It is clear that the worst situation from a robustness point of view arises when the parameter value
t(jl) is close to either 0 or 1, while the query takes values that are close to neither 0 nor 1.
Example 13.12.
This example shows that the bounds given by inequalities (13.8) and (13.11) are sharp, in the sense
that there are examples where the bounds are attained. Consider the network given in Figure 13.2,
where X and Y are binary variables taking values from (x0 , x1 ) and (y0 , y1 ) respectively. PX (x0 ) = θx
and PY (y0 ) = θy . Suppose that E is a deterministic binary variable; that is, P({E = e}∣{X = Y }) = 1
and P({E = e}∣{X ≠ Y }) = 0.
X Y
E
x0 x1 y0 y1
PX = PY =
θ x 1 − θx θy 1 − θy
X/Y y0 y1
PE∣X,Y (e∣., .) = x0 1 0
x1 0 1
and
∂ θy (1 − θy )
PY ∣E (y0 ∣e) =
∂θx (θx θy + (1 − θx )(1 − θy ))2
while
so that
∂ θy (1 − θy )
PY ∣E (y0 ∣e) =
∂θx (θx θy + (1 − θx )(1 − θy ))2
showing that the bound (13.8) is achieved.
For the bound (13.11), note from the above that
so that
∂ PY ∣E (y1 ∣e)
ln PY ∣E (y0 ∣e) =
∂θx θx (1 − θx )
and, because PY ∣E (y0 ∣e) + PY ∣E (y1 ∣e) = 1,
so that
∂ PY ∣E (y0 ∣e) 1
ln = ,
∂θx PY ∣E (y1 ∣e) θx (1 − θx )
Theorem 13.13. Let P be a parametrised family of probability distributions, factorised along the same
(θ) (0) (l)
DAG, with a single parameter θ. Let Xj be a binary variable and let θ = P (xj ∣πj ); all the
Xj ∣Paj
other CPPs remain xed and let Oθ = 1−θ
θ
. Consider a parameter change from θ = t to θ = s. Note that
(θ)
Ot = 1−t and Os = 1−s . Let P (y∣e) denote the probability value of a query when θ is the parameter
t s
P(θ) (y∣e)
̃θ (y∣e) =
value. Let O . Then
1−P(θ) (y∣e)
Ot Õs (y∣e) O
s
≤ ≤ s≥t
Os Õt (y∣e) Ot
Os Õs (y∣e) O
t
≤ ≤ t ≤ s.
Ot Õt (y∣e) Os
This gives the bound
̃s (y∣e) − ln O
∣ln O ̃t (y∣e)∣ ≤ ∣ln Os − ln Ot ∣ .
Proof Let x denote the probability of the query P(y∣e) when the value of the parameter t(jl) is z .
Note that, for 0 < a ≤ b < 1,
b dx b dx b dx b 1−a
∫ =∫ +∫ = ln .
a x(1 − x) a x a 1−x a 1−b
Then, for t(jl) ≤ s(jl) , Equation (13.8) gives
s dz Ps (y∣e) dx s dz
−∫ ≤∫ ≤∫ ,
t z(1 − z) Pt (y∣e) x(1 − x) t z(1 − z)
so that
s 1−t Ps (y∣e) 1 − Ps (y∣e) s 1−t
− ln ≤ ln ≤ ln
t 1−s Pt (y∣e) 1 − Pt (y∣e) t 1−s
giving immediately that
Ot Õs (y∣e) O
s
≤ ≤ .
Os Õt (y∣e) Ot
For s ≤ t the argument is similar and gives
Os Õs (y∣e) O
t
≤ ≤ .
Ot Õt (y∣e) Os
In both cases
̃s (y∣e) − ln O
∣ln O ̃t (y∣e)∣ ≤ ∣ln Os − ln Ot ∣
Notes The observation that the probability of evidence is a linear function of any single parameter
in the model and hence that the conditional probability is the ratio of two linear functions is due to
Castillo, Gutiérrez and Hadi (1997) [12] and [13]. The most signicant developments in sensitivity
analysis, which comprise practically the whole chapter, were introduced by Chan and Darwiche in the
article [21] (2002) and developed in the articles [20] (2005) and article [22].
13.4 Exercises
1. Consider a Bernoulli trial, with probability function PX (.∣t) dened by
Recall the denition of sensitivity, Denition 13.8. Compute the sensitivity with respect to the
parameter t.
2. Consider the `re' example given in the text. Suppose that the evidence is (R, S) = (t, f ). Let
PT (t) = θ be a variable parameter so that PT (f ) = 1−θ and suppose that all the other probabilities
are xed, according to the values given. From an initial value θ0 = 0.02, compute the lower bound
for the change δ required to satisfy the query constraint
PT ∣R,S (t∣t, f )
≥ 10
PF ∣R,S (t∣t, f )
corresponding to Corollary 13.7 and express the probabilities needed in terms of the conditional
probabilities given. This represents the constraint that, given the report without smoke, it is 10
times more likely that the alarm has been tampered with than that there is a real re.
PX,Y,E = PX PY PE∣X,Y
x0 x1 y0 y1
PX = PY =
θx 1 − θ x θy 1 − θy
X/Y y0 y1
PE∣X,Y (e∣., .) = x0 α β
x1 β α
and β < α.
P (y0 ∣e)
(a) Compute ∂
∂θx ln PY ∣E (y1 ∣e) and compare the result with the bound from Corollary 13.11.
Y ∣E
4. (a) On Odds and the Weight of Evidence Let P be a probability distribution over a space
X . The odds of an event A ⊆ X given B ⊆ X under P, denoted by OP (A ∣ B), is dened as
P (A ∣ B)
OP (A ∣ B) = . (13.12)
P (Ac ∣ B)
277
278 CHAPTER 13. PARAMETERS AND SENSITIVITY
P (E ∣ A ∩ B)
W (A ∶ E ∣ B) = ln . (13.14)
P (E ∣ Ac ∩ B)
(b) On a generalised Odds and the Weight of Evidence Let P denote a probability
distribution over a space X and let H1 ⊆ X , H2 ⊆ X , G ⊆ X and E ⊆ X . The odds of H1
compared to H2 given G, denoted by OP (H1 /H2 ∣ G), is dened as
P (H1 ∣ G)
OP (H1 /H2 ∣ G) = . (13.15)
P (H2 ∣ G)
OP (H1 /H2 ∣ G ∩ E)
W (H1 /H2 ∶ E ∣ G) = ln . (13.16)
OP (H1 /H2 ∣ G)
P (E ∣ H1 ∩ G)
W (H1 /H2 ∶ E ∣ G) = ln . (13.17)
P (E ∣ H2 ∩ G)
This is clearly a loglikelihood ratio and these notions are another expression for
2. The parameter is in the variable T , which has no parents; PaT = ϕ. According to the corollary,
it is required to choose δ such that
is required, where
PR,S,T (t, f, t)
λT =
θ0
since PR,S,T,R (t, f, t, f ) = 0,
PR,S,T (t, f, t) = PT (t) ∑ PF (xf )PS∣F (f ∣xf ) ∑ PA∣T,F (xa ∣t, xf ) ∑ PL∣A (xl ∣xa )PR∣L (t∣xl ).
xf xa xl
PF,R,S (t, t, f ) = PF (t)PS∣F (f ∣t) ∑ PT (xt ) ∑ PA∣T,F (xa ∣xt , t) ∑ PL∣A (xl ∣xa )PR∣L (t∣xl )
xt xa xl
PF,R,S,T (t, t, f, t) = θ0 PF (t)PS∣F (f ∣t) ∑ PA∣F,T (xa ∣t, t) ∑ PL∣A (xl ∣xa )PR∣L (t∣xl )
xa xl
PF,R,S,T (t, t, f, f ) = (1 − θ0 )PF (t)PS∣F (f ∣t) ∑ PA∣F,T (xa ∣t, f ) ∑ PL∣A (xl ∣xa )PR∣L (t∣xl )
xa xl
3. (a)
PY,E (y0 , e) ∑1i=0 PX (xi )PX,Y ∣E (xi , y0 ∣e)
PY ∣E (y0 ∣e) = = PY (y0 ) 1
PE (e) ∑i,j=0 PY (yj )PX (xi )PX,Y ∣E (xi , yj ∣e)
(α − β)θx θy + θy β
PY ∣E (y0 ∣e) = .
2θx θy (α − β) + (β − α)(θx + θy ) + α
(α − β)θx θy + α + βθx − α(θx + θy )
PY ∣E (y1 ∣e) = .
2θx θy (α − β) + (β − α)(θx + θy ) + α
279
280 CHAPTER 13. PARAMETERS AND SENSITIVITY
PY ∣E (y0 ∣e)
ln = ln ((α − β)θx θy + θy β) − ln ((α − β)θx θy + α + βθx − α(θx + θy ))
PY ∣e (y1 ∣e)
∂ PY ∣E (y0 ∣e) (α − β) (α − β)
ln = +
∂θx PY ∣E (y1 ∣e) (α − β)θx + β α − (α − β)θx
1 1
= + α .
θx + α−β α−β − θx
β
Set θ̃x = β
α−β + θx , then
∂ PY ∣E (y0 ∣e) 1 1
ln = + .
̃
∂θx PY ∣E (y1 ∣e) θx 1 + ̃
α−β − θ
2β
∂ PY ∣E (y0 ∣e) 1
∣ ln ∣< .
∂θx PY ∣E (y1 ∣e) θx (1 − θx )
(b)
(α − β)sθy + θy β
Os (y0 ∣e) =
(α − β)sθy + α + βs − α(s + θy )
so that
Os (y0 ∣e) (α − β)s + β (α − (α − β)t
= ( )( )
Ot (y0 ∣e) (α − β)t + β (α − (α − β)s
⎛ s + α−β ⎞ ⎛ α−β − t ⎞
β α
= β ⎠ ⎝ α − s⎠
.
⎝ t + α+β α−β
4. (a)
(b)
Structure Learning
14.1 Introduction
This chapter considers the problem of learning the structure of a DAG corresponding to a Bayesian
network for a random (row) vector X = (X1 , . . . , Xd ) when presented with an n × d data matrix x,
considered as an instantiation of a random matrix
⎛ X1. ⎞
X=⎜
⎜ ⋮
⎟
⎟
⎝ Xn. ⎠
where X1. , . . . , Xn. is a collection of independent identically distributed random vectors, each with the
same distribution as X . The notation Xj. means (Xj1 , . . . , Xjd ) for j = 1, . . . , n.
Methods available fall into two categories; search and score techniques, where a score function is
used and the algorithm attempts to nd the structure that maximises the score function and constraint
based methods, where conditional independence tests are carried out and the independence relations
thus established provide constraints, limiting the edges that can be added.
Algorithms can, broadly speaking, be placed in one of three dierent categories; search-and-score,
constraint based and hybrid. Hybrid algorithms use features from both constraint based and search
and score methods.
The aim of this chapter is to give a broad introduction and describe some of the search-and-score
algorithms. Constraint based algorithms will be dealt with in considerably more detail in Chapter 16,
while Markov chain Monte Carlo (MCMC), the most popular search-and-score approach, will be dealt
with in Chapter 18.
The straightforward approach of maximising the likelihood, or a posterior distribution, over graph
structures leads to a problem that, a rst glance, may appear fairly straightforward. There is a nite
number of dierent possible DAGs G = (V, D) with d nodes. In general, though, testing all possible
structures is not computationally feasible. This is because the number of possible DAGs grows super
exponentially in the number of nodes. In [118], Robinson gave the following recursive function for
computing the number N (d) of acyclic directed graphs with d nodes:
281
282 CHAPTER 14. STRUCTURE LEARNING
d ⎛ d ⎞ i(d−1)
N (d) = ∑(−1)i+1 2 N (d − i). (14.1)
i=1 ⎝ i ⎠
For d = 5 it is 29000 and for d = 10 it is approximately 4.2 × 1018 . Here N (d) is a very large number,
even for small values of d. Therefore, it is clearly not feasible to compute this sum, even for modest
values of d.
Denition 14.1 (Structural Hamming Distance). The Structural Hamming Distance between two
DAGs graphs G1 = (V, D1 ) and G2 = (V, D2 ) is dened as
The structural Hamming Distance between two essential graphs G1 = (V, E1 ) and G2 = (V, E2 ) is dened
as
where E1 is the set of DAGs within the Markov equivalence class of E1 and E2 is the set of DAGs within
the Markov equivalence class of E2 . D1 and D2 are the edge sets for directed acyclic graphs chosen
from the equivalence classes E1 and E2 respectively.
The SHD is a distance measure, or metric, in the sense that it satises the denition of a distance
or metric. That is, it satises:
SHD(E1 , E2 ) ≥ 0 ∀E1 , E2
14.2. DISTANCE MEASURES 283
B E C
A D / F
B / E o C
O
A D / F
SHD(E1 , E2 ) = 0 ⇔ E1 = E2
The Structural Hamming Distance measures the distance between two essential graphs, but if com-
parison is being made between a `tted' graph and a `true' graph, the SHD distance measure does not
distinguish between `false positives' (edges in the tted graph that are not in the true graph) and `false
negatives' (edges not present in the tted graph that are present in the true graph).
The distance thus dened between the two graphs in Figures 14.1 and 14.2 is 1, since there is a
valid orientation of the edges in 14.1 where all except C − E (which is not present) have the same
orientation as the edges in Figure 14.2.
sparse graphs, with a large number of nodes, the specicity measure will always be approximately 1 for
an algorithm with a tendency to wrongly reject edges rather than wrongly include edges. The following
denitions for sensitivity and specicity are therefore more convenient; the specicity corresponding to
the usual denition, the sensitivity modied. The following denitions are proposed for sparse graphs:
Denition 14.2 (Sensitivity and Specicity). For the construction of the skeleton, the sensitivity is
dened as
number of edges correctly identied
TPR = (14.4)
number of edges correctly identied + number falsely rejected
while the proposed denition for specicity is
Equation (14.5) is not the standard denition of specicity, but if the value is close to 1, it implies
that the rate of wrong inclusion is insignicant, rather than that the graph is large and sparse, which
would lead to a value close to 1 using the denition in Equation (14.3) even if the number of edges
wrongly included is large compared with the total number of edges in the true graph.
p(x)
DKL (p∥q) = ∑ p(x) log .
x∈X q(x)
In view here is the divergence between the factorisation over a true directed acyclic graph G1 = (V, D1 )
and a tted directed acyclic graph G2 = (V, D2 ). Let p̂1 and p̂2 denote the tted probability distributions
from the data, according to the factorisations along G1 and G2 respectively. The tted distribution p̂
is the same for each directed acyclic graph within the Markov equivalence class of an essential graph.
The Prior Distribution for the Graph Structure There are several possible ways of constructing
a prior distribution pD . If it is known a priori that the graph structure lies within a subset A ⊆ D̃,
then an obvious choice is the uniform prior over A;
⎧
⎪
⎪
1
if D ∈ A
PD (D) = ⎨ ∣A∣
⎪
⎪ otherwise
⎩ 0
where ∣A∣ is the number of elements in a subset A ⊆ D̃.
The Bayesian selection rule for a graph G = (V, D) uses the graph which maximises the posterior
probability
where PD is the prior probability over the space of edge sets. The prior odds ratio for two dierent edge
P (D1 ) (D1 ∣x)
and the posterior odds ratio is dened as PD∣X (D2 ∣x) . Equation
P
sets D1 and D2 is dened as PD D (D2 ) D∣X
(14.6) may then be expressed as
S(D1 )
Posterior odds = Likelihood ratio × Prior odds = .
S(D2 )
Using factorisations along the relevant graphs, the computation of a ratio, rather than simply com-
puting each score function, is sometimes easier if the two graphs have some part of the structure in
common.
Computing the posterior distribution is an NP hard problem; Cooper [30] proves that the inference
problem is NP hard. That means, worse than an NP problem. This discussed in [26]. Koivisto and
Sood [75] (2004) constructed the rst algorithm that had a complexity less than super exponential
for nding the posterior probability of a network, at the expense of limiting the maximum number
of parents for each variable; the run time is O(d2d + dk+1 C(n)) where d is the number of notes, k is
the maximum in-degree permitted and C(n) the cost of computing a single local marginal conditional
marginal likelihood for n instantiations.
polynomial time) problem. NP-hard therefore means `at least as hard as any NP-problem' although it
might, in fact, be harder.
AIC and BIC Score Functions One standard score function is the Akaike Information Criterion
(AIC) dened as:
The BDeu Score The BDeu score was introduced by Heckerman, Geiger and Chickering [62]. The
BD score is simply the score function given by Equation (14.7), the posterior probability over directed
acyclic graphs, assuming that the variables each have multinomial distribution. The BDeu score uses a
uniform prior over graph structures, so that the posterior distribution is proportional to the likelihood,
and then multiplies by a factor that penalises according to the number of edges where the graph diers
from some `target' graph, based on prior information. The BDeu score function for a directed acyclic
graph, based on the data is dened as follows
where x denotes the n × d data matrix of n independent instantiations of the d variables in the variable
set V , D denotes the edge set for the directed acyclic graph G = (V, D), κ is a number 0 < κ ≤ 1, δ(D)
denotes the number of edges in D that dier from those in a `target' graph, a graph that is a priori
considered most likely, based on prior information.
The BD score function is the BDeu score function with κ = 1.
When the aim is to construct a graph representing the dependence relations in the data, with as few
edges as possible, δ(D) simply counts the number of edges in the edge set D.
Denition 14.4 (Prior Sample Size). The prior sample size is dened as the quantity
ñ = ∑ αjil .
jil
14.3. SEARCH AND SCORE ALGORITHMS 287
The quantity ñ = ∑jil αjil is considered to be the weight attached to the prior assessment. Loosely
speaking it is the `number' of observations on which the prior is based.
∑ n (x, y, z) ln i
Xi ,Y,Pa (n−1) (y, z)n (n−1) (x, z)
(n−1)
n
(x,y,z)∈XXi ×XY ×X Y,Pai Xi ,Pai
i
Pai(n−1)
is suciently high. Here M B denotes Markov blanket (parents, children and parents of children).
Also, for a set W , nW (w) denotes the number of appearances of conguration w in the data
(n−1)
matrix x. If the test statistic is low, it supports Xi ⊥ Y ∣Pai and hence Y is not a candidate
parent.
There are other ways of determining the candidate parents; anything in the current Markov blan-
ket not d-separated from the variable by the Markov blanket should be included as a candidate
parent.
(n)
Find a high scoring network Dn where PaD
i
n
⊂ Ci for i = 1, . . . , d.
Optimal Reinsertion The optimal reinsertion algorithm, introduced by A. Moore and W-K. Wong
(2003) [96], is a search - and -score algorithm that works along the following lines: at each step a
target node is chosen, all edges entering or leaving the target are deleted, and the optimal combination
of in-edges and out-edges is found, the node is re-inserted with these edges. This involves searching
through the legal candidate parent sets and, for each candidate parent set, the legal child sets. The
optimal reinsertion may be combined with sparse candidate.
288 CHAPTER 14. STRUCTURE LEARNING
Forward phase Let E0 denote the graph with no edges. Let En denote the essential graph from
stage n of the forward phase. Consider all possible DAGs within the Markov equivalence class,
all possible DAGs obtained by adding exactly one edge to a DAG from this equivalence class and
consider the set of essential graphs corresponding to this collection of DAGs. Let En+1 denote
the essential graph with the highest score if it has a higher score than En and continue to forward
phase stage n + 1. Otherwise, terminate the forward phase, with output En .
Backward phase Let Ẽ0 denote the output graph from the forward phase. Let Ẽn denote the
output graph from stage n of the backward phase. Consider all possible DAGs corresponding to
the equivalence class Ẽn , all possible DAGs formed by an edge deletion from these DAGs and
consider the set of essential graphs corresponding to this collection of DAGs. Let Ẽn+1 denote the
essential graph with the highest score if it is higher than that for Ẽn and continue to backward
phase stage n+1. Otherwise terminate; Ẽn is the output of the backward phase and of the greedy
equivalence search algorithm.
After the forward and backward phase, this algorithm is guaranteed to return an optimal structure
provided there exists a faithful DAG. The faithfulness assumption may be relaxed; the algorithm re-
turns a suitable structure provided the weaker composition condition holds (compositional graphoid,
Equation (2.1.1)). The compositional axiom is essential for the algorithm to return the correct graph.
The necessity of composition is clear from the three variable example, where Y1 , Y2 , Y3 are independent
binary variables P(Yi = 1) = P(Yi = 0) = 21 , X1 = 1(Y2 = Y3 ), X2 = 1(Y1 = Y3 ), X3 = 1(Y1 = Y2 ). Since
X1 ⊥ X2 , X1 ⊥ X3 and X2 ⊥ X3 , adding a single edge to the empty graph will not increase the score.
The algorithm will therefore terminate after the rst step of the forward phase and return the empty
graph.
Notes The Cooper Herskovitz likelihood was introduced by Cooper and Herskovitz in [31]. In [30],
Cooper proves that the inference problem for structure learning is NP hard. In [26], Chickering
Heckerman and Meek prove, under some assumptions, that identifying high scoring structures in search
- and - score algorithms is NP - hard. Koivisto and Sood [75] [2004] constructed the rst algorithm
that had a complexity less than super - exponential for nding the posterior probability of a network.
The Chow - Liu tree is taken from [28] [1969]. The K2 algorithm is by Cooper and Herskovitz [31]
[1992]. The robotics example is due to E. Lazkano, B. Sierra, A. Astigarraga, and J.M. Martínez -
Otzeta [79] [2007] The maximum minimum hill climbing algorithm is found in [137]. The Markov chain
Monte Carlo model composition algorithm, known as M C 3 , and the augmented Markov chain Monte
14.3. SEARCH AND SCORE ALGORITHMS 289
Carlo model composition (AM C 3 ) algorithm were introduced by Madigan and York [89] in 1995 and
Madigan, Andersson, Perlman and Volinsky [88] in 1997.
14.4 Exercises
These exercises should be carried out using R. The bnlearn package may be useful.
1. Chow - Liu Tree Generate three columns, c1, c2 and c3, each containing independent random
samples of 50 Be(1/2) observations. Here Be(1/2) means Bernoulli trials, returning 0 with
probability 1/2 and 1 with probability 1/2. Let c4 = c1 + c2 and let c5 = c3 + c4. Implement the
Kruskal algorithm on the variables c1, c2, c3, c4, c5 and see which edges are chosen.
2. Chow - Liu Tree Download the data set from the URL address
http://archive.ics.uci.edu/ml/machine-learning-databases/zoo/zoo.data
A description of the data is found at the address
http://archive.ics.uci.edu/ml/datasets/Zoo
The data set presents attributes of various animals; hair type, feather type, egg type, milk type,
whether it is airborne, aquatic, a predator, whether or not it has teeth, a backbone, whether
it breathes, or is venomous, has ns, legs, tail, or domestic, or catsize. The last variable is a
classication of the type of animal.
(a) Compute the estimated probability distribution for all the variables except for the `class'
variable, assuming that they are independent. What is the Kullback Leibler distance be-
tween the empirical distribution and the estimate using the independence model?
(b) Perform Kruskal's algorithm, to determine the optimal Chow - Liu tree. Use the data
from all the variables, except for the class variable, to construct a single Chow - Liu tree.
Calculate the estimated probability distribution, assuming that the distribution factorises
according to the Chow - Liu tree. Calculate the Kullback Leibler distance between this
estimate and the empirical probability distribution.
Note You have to specify a root for the Chow-Liu tree. This determines the directions of
the arrows. All possible Chow-Liu trees from the same skeleton are Markov equivalent.
(c) Classication See how the Chow - Liu tree performs for classication. Compute the
classier using the data and then use the classier to predict the classes of the same data
set. Such a procedure is not so satisfactory; dierent data should be used for training and
classication.
(d) Perform an MMPC algorithm on the zoo data, using a nominal signicance level of 0.05.
Compare it with the Chow-Liu tree with those edges that fail the signicance test at 0.05
level are removed.
The following R code solves the Chow-Liu tree problem. The `50 warnings' basically come from zero
divided by zero problems. The code should be modied (by adding on a small value such as 0.01 to
each cell) to prevent this. The classier works reasonably well in any case.
> library("bnlearn")
> zoo <- read.csv("~/data/zoo.data", header=F)
290
14.4. EXERCISES 291
model:
[undirected graph]
nodes: 16
arcs: 15
undirected arcs: 15
directed arcs: 0
average markov blanket size: 1.88
average neighbourhood size: 1.88
average branching factor: 0.00
pred 1 2 3 4 5 6 7
1 11 0 0 0 0 0 0
2 0 7 0 0 0 0 0
3 0 0 0 0 4 0 0
4 0 0 0 6 0 0 0
6 0 0 0 0 0 1 0
14.4. EXERCISES 293
7 0 0 0 0 0 0 2
> > resmmpc <- mmpc(zoo[,-c(1,18)])
> print(resmmpc$arcs)
from to
[1,] "hair" "aquatic"
[2,] "hair" "milk"
[3,] "eggs" "milk"
[4,] "milk" "catsize"
[5,] "milk" "eggs"
[6,] "milk" "hair"
[7,] "aquatic" "fins"
[8,] "aquatic" "predator"
[9,] "aquatic" "hair"
[10,] "predator" "domestic"
[11,] "predator" "aquatic"
[12,] "toothed" "backbone"
[13,] "backbone" "tail"
[14,] "backbone" "toothed"
[15,] "fins" "aquatic"
[16,] "tail" "backbone"
[17,] "domestic" "predator"
[18,] "catsize" "milk"
294 CHAPTER 14. STRUCTURE LEARNING
Chapter 15
15.1 Introduction
Let X = (X1 , . . . , Xd ) denote a random vector with probability function PX1 ,...,Xd . Let
(1) (kj )
Xj = (xj , . . . , xj )
X = ×dj=1 Xj .
The number of elements in the state space is ∣X ∣ = (∏dj=1 kj ) and, without further assumptions on P,
∣X ∣ − 1 elements are required to store the entire distribution.
The problem of storing the entire probability distribution is one of many expressions of the `curse
of dimensionality'. The size of the problem is reduced if one instead stores lower dimensional marginals
and approximates the distribution by an appropriate product of lower dimensional marginals.
The topic of storing a high dimensional discrete probability distribution in a digital medium ap-
peared in the journal literature, probably for the rst time, by J. Hartmanis (1959) [59] and P.M. Lewis
II (1959) [85]. The Chow - Liu tree by Chow and Liu (1969) [28], approximately 10 years later, provides
an inuential and eective solution to the problem. Chow and Liu gave an algorithm for selecting rst
order factors for the product approximation so that among all such rst order approximations, the
constructed approximation has the minimum Kullback-Leibler distance to the actual distribution to
be stored.
295
296 CHAPTER 15. DATA STORAGE, PRODUCT APPROXIMATIONS, CHOW LIU TREES
the distribution. The classical marginal problem considers an `inverse problem'; given a family (PWi )si=1
of probability distributions, for s < 2d − 2 and Wi ⊂ V = {X1 , . . . , Xd }, the question is whether there
exists a probability distribution PV that satises the so-called collective compatibility condition given
by Equation (15.1).
Here the notation ↓ A means the marginalisation down to a set of variables A. This problem is of
importance in the following setting: if the full probability distribution cannot be estimated and stored,
it may be possible to estimate and store probability distributions over selected subsets of the variables.
These subsets should be chosen so that, formally, the collection of distributions over the subsets are
compatible. Some fundamental contributions to this problem are due to H.G. Kellerer [73] and others.
If the sets (Wi )si=1 are disjoint, satisfying ∪i Wi = V , then the problem has an obvious trivial solution:
s
P(x) = ∏ PWi (xWi )
i=1
where the product operation means rst extending the probabilities PWi as functions, to functions P̃Wi
over the domain V where (using obvious notation) P̃Wi (xWi .xV /Wi ) = PWi (xWi ) for each xWi ∈ XWi
and then multiplying. With the appropriate projections of x,
s
P(x) = ∏ PWj (xWj ).
j=1
If (Wj )sj=1 are not disjoint, then clearly the collection of probabilities (PWj )sj=1 should satisfy a pairwise
compatibility condition:
↓C ↓C
PCij = PWiij = PWjij ∀i, j ∈ {1, . . . , s}2 .
The following example due to Vorobev (1962) [141] shows that pairwise compatibility does not imply
collective compatibility.
Let V = {1, 2, 3}, W1 = {2, 3}, W2 = {1, 3}, W3 = {1, 2}. Suppose that the following three pairwise joint
distributions are specied:
↓{3} ↓{3} 0 1
PW1 (x3 ) = PW2 (x3 ) = 1 1
.
2 2
15.2. PRODUCT APPROXIMATIONS 297
W1 ∩ W3 = {2} and
↓{2} ↓{2} 0 1
PW1 (x2 ) = PW3 (x2 ) = 1 1
.
2 2
W2 ∩ W3 = {1} and
↓{1} ↓{1} 0 1
PW2 (x1 ) = PW3 (x1 ) = 1 1
.
2 2
1
= PW1 (0, 0) = P∗ (0, 0, 0) + P∗ (1, 0, 0) ≤ PW2 (0, 0) + PW3 (1, 0) = 0,
2
which is a contradiction. The three marginals satisfy a pairwise compatibility condition, but not a
collective compatibility condition.
Without loss of generality, let V = ∪sj=1 Wj . The condition to ensure that pairwise compatibility implies
collective compatibility is known as the acyclic condition.
Denition 15.2 (Acyclic, Running Intersection Property). Suppose that there is an ordering of the
sets W1 , . . . , Ws such that for each j there is an l < j such that
Bj = Wj ∩ (∪j−1
k=1 Wk ) ⊆ Wj ∩ Wl (15.2)
This property is known as the running intersection property. A set of subsets of W1 , . . . , Ws having the
running intersection property, given some ordering, is known as acyclic.
but {1, 2} is not a subset of W1 or W2 . It follows that Equation (15.2) does not hold for this ordering.
It is easy to check that there is no ordering that satises Equation (15.2), hence acyclicity does not
hold for Example 15.1.
The following important result is due to Beeri et. al. (1983) [5]
Theorem 15.3. Acyclicity is equivalent to `pairwise compatibility for all (i, j) implies collective com-
patibility'. Furthermore, under acyclicity, there is a unique product form extension,
∏sl=1 P(xWl )
P∗ (x) = (15.3)
h=1 P(xVh )
∏s−1
Proof Assume we have the acyclic / running intersection property. Consider the variables as nodes of
a graph, where the sets (Wi )si=1 are maximal cliques. The running intersection property is equivalent
to a perfect order of the maximal cliques, which implies that the maximal cliques W1 , . . . , Ws can
be arranged as a junction tree, where for each j ∈ {2, . . . , s}, we choose an l(j) < j from the set
{l ∶ Wj ∩ (∪j−1
k=1 Wk ) = Wj ∩ Wl } and insert an edge j − l(j). In this way, we have a tree with s − 1 edges.
For each edge ⟨j, l⟩, let V⟨j,l⟩ = Wj ∩ Wl and let U denote the (undirected) edge set. Now consider any
collection (PWj )sj=1 which is pairwise compatible. Then we can dene a distribution P∗ by:
where PV⟨j,l = (PWl )↓V⟨j,l⟩ = (PWj )↓V⟨j,l⟩ . Since the sets (Wi )si=1 are arranged on a junction tree, hence
any intersection Wα ∩ Wβ is contained in Wγ ∩ Wδ for any edge γ − δ on the unique path α ↔ β in the
tree. Hence the acyclic property gives (pairwise compatibility implies collective compatibility).
Now suppose that pairwise compatibility implies collective compatibility for W1 , . . . , Ws and assume
that acyclicity is not possible. Taking W1 , . . . , Ws as the maximal cliques of an undirected graph, lack-
of-acyclicity is equivalent to existence of a cycle of length ≥ 4 in the graph without a chord. Let the
cycle be α1 , . . . , αm . Then there are Wj1 , . . . , Wjm such that {αi , αi+1 } ⊆ Wji for i = 1, . . . , m, using
αm+1 = α1 . Furthermore, the lack-of-chord implies that Wja ∩Wjb = ∅ for ∣a−b∣ ≥ 2, where we take a and
b mod m. We may therefore nd (similar to Vorobev's example) distributions PWj1 , . . . , PWjm which
↓{α ,αi+1 }
are pairwise compatible, but where the distributions PWj i (using αm+1 ≡ α1 ) are not collectively
i
compatible.
Uniqueness of representation (15.3) requires that PWi (xWi ) > 0 for all xWi and all i = 1, . . . , s.
Bj = Wj ∩ (∪j−1
k=1 Wk ) , j = 2, . . . , k.
Let Aj = Wj /Bj so that Wj = Aj ∪ Bj . It follows that A1 , . . . , As is a partition of V and that the sets
(Bj )sj=1 satisfy
Bj ⊂ Ai ∪ Bi some i ∈ {1, . . . , j − 1}.
This leads to the denition of a dependence structure, the term used to describe collections (Aj , Bj )sj=1
which satisfy this property.
Denition 15.4 (Dependence Structure). Let (Ai )ki=1 be a partition of a set V and let S be a sequence
of pairs of subsets of V , S = (Ai , Bi )ki=1 satisfying
B1 = ϕ, B r ⊂ Ai ∪ B i 1≤i≤r−1 r = 2, . . . , k
Denition 15.5 (Product Approximation). Let S be a dependence structure. Then the probability
distribution dened by
k
P(S) (x) = PA1 (xA1 ) ∏ PAj ∣Bj (xAj ∣xBj )
j=2
A product approximation is clearly a well dened probability distribution. Furthermore, it satised the
following compatibility condition:
Lemma 15.6.
so that
j−1
P(S)↓Aj ∪Bj (xAj ∪Bj ) = PAj ∣Bj (xAj ∣xBj ) ∑ PA1 (xA1 ) ∏ PAk ∣Bk (xAk ∣xBk )
A1 ∪...∪Aj−1 /Bj k=2
It remains to show that P(S)↓Bj (xBj ) = PBj (xBj ). This follows inductively; B1 = ϕ. Assume true for
all i = 1, . . . , j − 1. Then Bj ⊂ Ai ∪ Bi for some i ∈ 1, . . . , j − 1. Assume that PAi ∪Bi = P(S)↓(Ai ∪Bi ) for
1 = 1, . . . , j − 1, then P(S)↓Bj = PBj and the result follows by induction.
∏si=1 PWi
P(S) =
∏si=2 PBi
where Bi = Wi ∩ ∪i−1
j=1 Wj and the convention Pϕ ≡ 1 is used. In this situation, clearly
P(S)Wj = PWj .
Aj ⊥ ∪j−1
k=1 Ak /Bj ∣P(S) Bj .
300 CHAPTER 15. DATA STORAGE, PRODUCT APPROXIMATIONS, CHOW LIU TREES
Denition 15.7 (Shannon Entropy). Let A ⊆ V . The Shannon entropy of the set of variables A for
a probability distribution P is dened as
where PA = P↓A .
For a dependence structure S = (Ai , Bi )si=1 and a probability distribution Q that factorises according
to: Q = ∏si=1 QAi ∣Bi , it is straightforward to compute that
s
DKL (P∥Q) = −HP (V ) − ∑ P(x) ∑ ln QAi ∣Bi (xAi ∣xBi )
x∈X i=1
s
= −HP (V ) − ∑ ∑ PAi ∪Bi (xAi ∪Bi ) ln QAi ∣Bi (xAi ∣xBi )
i=1 xAi ∪Bi
s
= −HP (V ) − ∑ ∑ PBi (xBi ) ∑ PAi ∣Bi (xAi ∣xBi ) ln QAi ∣Bi (xAi ∣xBi ).
i=1 xBi xA
i
Now use Gibb's inequality; for any two probability distributions f and g over the same state space,
L L
∑ fj ln fj ≥ ∑ fj ln gj . (15.5)
j=1 j=1
and
k
DKL (P∥P(S) ) = −HP (V ) + ∑ (HP (Ai ∪ Bi ) − HP (Bi )) .
i=1
Denition 15.8 (Mutual Information). The mutual information I(A, B) between two disjoint sets of
variables A and B is dened as
If one is choosing a dependence structure S = (Ai , Bi )si=1 , from within a class S of dependence structures
with the same storage properties, it follows that the dependence structure S = (Ai , Bi )si=1 is chosen to
maximise
k k
Q(S) = − ∑ H(Ai ) + ∑ I(Ai , Bi ).
i=1 i=1
∣Ai ∪ Bi ∣ ≤ 2 i = 1, . . . , k.
Let G = (V, U ) denote an undirected graph, where V = {1, . . . , d} is the indexing set for the nodes and
U is the undirected edge set. An undirected graph G is complete if U = {⟨i, j⟩ ∶ 1 ≤ i < j ≤ d}. The
degree of a node i is dened as the number of distinct edges containing the node i.
A subgraph H of G is a graph (V1 , U1 ) where V1 ⊆ V and U1 ⊆ U . A subgraph V1 is induced by
A ⊂ V if V1 = A and U1 = U ∩ A × A. A subgraph H is a spanning subgraph of G if it is connected and
V1 = V .
An undirected tree T is a connected undirected graph that has no cycles. It follows that there is
a unique path between any two nodes. A spanning tree of a graph is a spanning graph of G which is a
tree.
A labelled tree is a tree on d nodes where each node is labelled by one of the integers {1, . . . , d}. In
the sequel, labelled trees will be referred to as trees.
A weighted undirected graph is
G = ((V, U )∣w)
where w ∶ U → R+ (non negative real numbers). The weight of a tree is the sum of its edge weights.
The weight to be used by the Chow-Liu algorithm will be dened via the mutual information
j1 = ϕ, jr ∈ {i1 , . . . , ir−1 } ⊆ V r = 2, . . . , d.
A Chow-Liu dependence structure will give a tree. Since the tree connects all the nodes, it is a spanning
tree. Arrows are directed from jr to ir . If jr = ϕ, there is no arrow pointing to the node ir . Any node
in a directed tree with jr = ϕ is called a root. By construction, i1 is the only root. A tree with exactly
one root is said to be proper.
Note that
ir ⊥P(S) {i1 , . . . , ir−1 }/{jr }∣jr .
d d
Q(σ) = − ∑ H(ir ) + ∑ I(ir , jr ).
r=1 r=1
The Chow-Liu dependence structure denes a product approximation of a known probability distribu-
tion P by
d
P(σ) (x) = Pi1 (xi1 ) ∏ P(xir ∣xjr ).
r=2
The following theorem is the rst main result in Chow and Liu [28] (1968).
Theorem 15.10. Let P be a probability distribution over X . Let G = ((V, U )∣w) be a complete weighted
graph with w given by
w(j, k) = I(j, k) ⟨j, k⟩ ∈ U
where the I(j, k)s are computed using the Pj,k s. Then the maximum weight spanning tree of G denes
a Chow-Liu dependence structure σ , which maximises
d d
Q(σ) = − ∑ H(ir ) + ∑ I(ir , jr ).
r=1 r=2
Proof Firstly, ∑dr=1 H(ir ) = ∑di=1 H(i) so that the rst term in Q(σ) is independent of σ , hence the
problem is equivalent to the maximisation of ∑dr=1 I(ir , jr ).
15.4. THE OPTIMAL CHOW-LIU PRODUCT APPROXIMATION 303
Let P(X ) denote the space of all probability distributions over X ; that is
Let Td = (V, σ) be a spanning tree on V , where σ is a Chow-Liu dependence structure. Let Td denote
the set of all spanning trees, then
P(X , Td ) = {P(σ) }
is the set of all tree dependent probability distributions on X and P(X , Td ) ⊂ P(X ). The empirical
probability is dened as
n
̂n (x) = 1 ∑ 1x (x ).
P (k)
n k=1
̂(M L) is given by
Lemma 15.11. The maximum likelihood estimate P
̂(M L) = arg
P min ̂n ∥P).
DKL (P
P∈P(X ,Td )
Proof
̂n ∥P) = −H
DKL (P ̂n (x) ln P(x),
̂n (V ) − ∑ P
x∈X
where H ̂n (x) ln P
̂n (V ) = − ∑x∈X P ̂n (x). Note that this does not depend on the tree structure. For the
other part,
n
∑P ̂n (x) ln P(x) = 1 ∑ ln P(x ),
(j)
x∈X n j=1
which is the log likelihood function. Hence, the maximum likelihood estimate is equivalent to the
̂n onto the set of tree dependent distributions P(X , Td ).
reverse I -projection of P
1 n
L(σ, P) = ∑ ln P(x(j) ∣σ, P).
n j=1
1 1 d 1 d
L(σ, P) = ∑ N (x) ln Pi1 (x) + ∑ ∑ N (x, y) ln Pir ,jr (x, y) − ∑ ∑ N (x) ln Pjr (x)
n x∈Xi n r=2 (x,y)∈Xi ×Xj n j=2 x∈Xjr
1 r r
where N (x) denotes number of appearances of the appropriate conguration x in the data matrix x.
This reduces to
d d
̂n;i (x) ln Pi (x)+ ∑
L(σ, P) = ∑ P ∑ ̂n;i ,j (x, y) ln Pi ,j (x, y)− ∑ ∑ P
P ̂n;j (x) ln Pj (x).
1 1 r r r r r r
x∈Xi1 r=2 (x,y)∈Xir ×Xjr j=2 x∈Xjr
d
̂n;i (x) ln Pi (x) + ∑ ∑ P
L(σ, P) = ∑ P ̂n;j (y) ∑ ̂
Pn;ir ∣jr (x, y) ln Pir ∣jr (x∣y). (15.6)
1 1 r
x∈Xi1 r=2 y∈Xjr x∈Xir
The log likelihood L(σ, P) is to be maximised. For a xed structure σ , it therefore follows from Gibb's
inequality that the maximum likelihood estimates are:
̂n;i ,j
(M L)
P i1 ̂n;i
=P
(M L)
Pir ∣jr =
P r r
r = 2, . . . , d.
1
̂
Pn;j r
from which
̂
̂n;i ,j (x, y) ln Pn;ir ,jr (x, y)
d
L(σ, P (M L) ) = ̂i (x) ln P
∑ P ̂n;i (x) + ∑ ∑
P
x∈Xi1
1 1
r=2 (x,y)∈Xir ×Xjr
r r
̂n;j (y)
P r
d d ̂n;i ,j (x, y)
̂n;i (x) ln P
= ∑ ∑ P ̂n;i (x) + ∑ ∑ ̂n;i ,j (x, y) ln
P
P r r
r
r=1 x∈Xir
r
r=2 (x,y)∈Xir ×Xjr
r r
̂n;i (x)P
P ̂n;j (y)
r r
d d
= ∑ ∑ P ̂n;i (x) + ∑ I(i
̂n;i (x) ln P ̂ r , jr )
r r
r=1 x∈Xir r=2
where
̂
Pn;ir ,jr (x, y)
̂ r , jr ) =
I(i ∑ ̂n;i ,j (x, y) ln
P
(x,y)∈Xir ×Xjr
r r
̂n;i (x)P
P ̂n;j (y)
r r
15.4. THE OPTIMAL CHOW-LIU PRODUCT APPROXIMATION 305
is the plug in estimate of the mutual information. Clearly, the rst term in the expression for
L(σ, P (M L) ) does not depend on σ and hence the maximum likelihood estimate σ (M L) is given by
d
̂ r , jr )} .
σ (M L) = argmaxσ { ∑ I(i
r=2
The number of spanning trees on d nodes is dd−2 . This is Cayley's formula. An exhaustive search is not
feasable in practise. Besides, as pointed out by Chow - Liu [28], a greedy approach nds the maximal
spanning tree.
There are several well known standard algorithms for nding the spanning tree of maximum weight,
for example Kruskal's algorithm and Prim's algorithm. These algorithms are almost identical and nd
the maximum weight spanning tree in O(d2 ln d) time.
2. The edges b1 and b2 are selected. Then the edge b3 is added, if it does not form a cycle.
3. This is repeated, through b4 , . . . bd(d−1)/2 , in that order, adding edges if they do not form a cycle
and discarding them if they form a cycle.
This procedure returns a unique tree if the weights are dierent. If two weights are equal, one may
impose an arbitrary ordering. From the d(d − 1)/2 edges, exactly d − 1 will be chosen.
Lemma 15.12. Kruskal's algorithm returns the tree with the maximum weight.
Proof The result may be proved by induction. It is clearly true for 2 nodes. Assume that it is
true for d nodes and consider a collection of d + 1 nodes, labelled (X1 , X2 , . . . , Xd+1 ), where they are
ordered so that for each j = 1, . . . , d + 1, the maximal tree from (X1 , . . . , Xj ) gives the maximal tree
from any selection of j nodes from the full set of d + 1 nodes. Let b(i,j) denote the weight of edge (i, j)
(d+1)
for 1 ≤ i < j ≤ d + 1. Edges will be considered to be undirected. Let Tj denote the maximal tree
(d+1)
obtained by selecting j nodes from the d + 1 and consider Td+1 .
(d+1) (d+1)
Let Z denote the leaf node in Td+1 such that among all leaf nodes in Td+1 the edge (Z, Y ) in
(d+1)
Td+1 has the smallest weight. Removing the node Z gives the maximal tree on d nodes from the set
of d + 1 nodes. This is seen as follows. Clearly, there is no tree with larger weight that can be formed
with these d nodes, otherwise the tree on d nodes with larger weight, with the addition of the leaf
(Z, Y ) would be a tree on d + 1 nodes with greater weight than Td+1 d+1
. It follows that Z = Xd+1 and
(d+1)
hence that Xd+1 is a leaf node of Td+1 .
(d+1)
By the inductive hypothesis, Td may be obtained by applying Kruskal's algorithm to the weights
(b(i,j) )1≤i<j≤d . Now consider an application of Kruskal's algorithm to the weights (b(i,j) )1≤i<j≤d+1 and
(d+1)
note that for any (i, j) with i < j such that the undirected edge (Xi , Xj ) forms part of the tree Td ,
b(i,d+1) < b(i,j) and b(j,d+1) < b(i,j) . Therefore, if the edges (b(i,j) )1≤i<j≤d+1 are listed according to their
306 CHAPTER 15. DATA STORAGE, PRODUCT APPROXIMATIONS, CHOW LIU TREES
(d+1)
weight and the Kruskal algorithm applied, then all the edges used in Td will appear further up the
(d+1)
list than any edge (b(k,d+1) )k=1 and therefore all the edges of Td
d
will be included by the algorithm
(d+1)
before the edges (b(k,d+1) )k=1 are considered. It follows that Td+1 is the graph obtained by applying
d
Proof It is clear that, with the same ordering of the weight, Prim's algorithm returns the same tree
as Kruskal's algorithm.
Theorem 15.15. Suppose that P factorises according to a polytree. Kruskal's algorithm will locate the
skeleton of the polytree.
Proof Let A = {i}, B = {j} and D = {k} be three distinct nodes. Assume that i ⊥P j∣k. This can
happen in the following cases:
i → k → j, i ← k ← j, i ← k → j,
where m → n indicates a directed path. In all cases, I(i, k∣j) > 0 and I(k, j∣i) > 0. It follows that
Kruskal's algorithm takes the edge of largest weight that does not form a cycle. The algorithm will
therefore not choose the edge (i, j) if there is a node k between i and j in σ .
For i → k ← j , i ⊥P j and hence the edge i − j will not be chosen by Kruskal's algorithm.
It is straightforward to nd appropriate directions for the edges; if there are edges i − j − k , then the
edges take directions i → j ← k if and only if I(i, k) = 0.
Then P(σ)0 is the reverse I -projection of P0 and the corresponding structure σ (0) = (i0r , jr0 )dj=2 is the
Chow-Liu dependence structure. If P ∈ P(X , Td ), then P(σ)0 = P0 .
Let Td0 denote the tree structure corresponding to σ (0) , then
d
W(Td0 ) = ∑ I 0 (i0r , jr0 )
r=2
where I 0 (i0r , jr0 ) are the mutual informations computed with P0↓(ir ,jr ) , which is the tree of maximal
0 0
weight.
Let
̂(σM L ) = argmin
P M L;n P∈P(X ,Td ) DKL (Pn ∥P)
where Pn denotes the empirical distribution and σM L denotes the maximum likelihood Chow-Liu
dependence structure. Let W(T ; n) denote the weight of tree T based on probability distribution Pn
(M L)
and, in particular, let W(Td (n); n) denote the Chow Liu dependence tree weight based on σM L
̂
and Pn . Then the following result holds:
Theorem 15.16.
(M L) n→+∞
W(Td (n); n) Ð→ W(Td0 ) P0 − a.s.
308 CHAPTER 15. DATA STORAGE, PRODUCT APPROXIMATIONS, CHOW LIU TREES
Proof This is a consequence of the strong law of large numbers; rstly, since X is nite, the strong
law of large numbers gives that
n→+∞
max ∣Pn (x) − P0 (x)∣ Ð→ 0 P0 − a.s.
x∈X
from which a.s. convergence of all empirical marginal distributions follows and in particular
for all pairs (i, j). It follows that for each tree Td ,
n→+∞
W(Td ; n) Ð→ W(Td ) P0 − a.s.
n→+∞
max ∣W(Td ; n) − W(Td )∣ Ð→ 0.
Td ∈Td
(M L)
Note that, by construction, W(Td ; n) ≤ W(Td (n); n) for all Td and each n. Now let
δ
max ∣W(Td ; n) − W(Td )∣ ≤ .
Td ∈Td 2
There is an nδ such that this holds for all n ≥ nδ and such that there is a tree in T0d , say Td0 such that
(M L)
W(Td (n)) = W(Td0 ) and such that
δ
∣W(Td0 ; n) − W(Td0 )∣ ≤ .
2
(M L)
∣W(Td (n); n) − W(Td0 )∣ < ϵ P0 − a.s.
(M L)
This result does not assert convergence of the sequence of trees Td (n) unless the set T0d contains
exactly one element.
15.6. CLASSIFICATION 309
15.6 Classication
Many of the techniques of supervised learning, or classication, involve a Bayes rule and an approx-
imate distribution. Variables are of two types, symptom variables X O (O for observable) and class
variables, or diagnosis variables, X C . A prior distribution PC is placed over the class variables, evi-
dence is obtained in the form of an instantiation xO of X O of the symptom variables and the posterior
distribution over the class variables obtained using Bayes rule;
PC PO∣C
PC∣O = ∝ PC PO∣C .
PO
In supervised classication, the probabilities PO∣C are learned, by observing the instantiations xO in
training examples where xC is given. When classifying (where the class xC is unknown, the class that
maximises PC PO∣C is chosen, for a given set of symptoms xO .
Often in classication, the distribution PO∣C has too many states and instead a set of lower dimensional
marginals is considered:
where for each xC , SC ∶= (Aj , Bj )sj=1 is a dependence structure. The dependence structures may depend
on xC . The class variable xC is then chosen to maximise
s
PC PA1 ∣C ∏ PAj ∣Bj ,C .
j=1
d
QX∣C = ∏ PXj ∣C
j=1
is used. The aim is then, for an observation x, to nd the value c that maximises PC QX∣C (x∣.).
Classication comes in two stages; rstly, constructing the classier. For constructing the classier,
a large number of observations of X are made, assumed independent, for each value of C , where the
n(x,c)
value of C is known. From this, PXj ∣C is estimated by P̂Xj ∣C (x∣c) = n(c) where n(x, c) is the number
of observations with (Xj , C) = (x, c) in the sample.
If a prior distribution PC has been placed over the class variable C , the score function is then
d
PC (c) ∏ P̂Xj ∣C (xj ∣c)
j=1
and an observation x is assigned to the class c that maximises this function. If there is no prior, then
the likelihood function ∏dj=1 P̂Xj ∣C (xj ∣c) is used.
310 CHAPTER 15. DATA STORAGE, PRODUCT APPROXIMATIONS, CHOW LIU TREES
Example 15.17.
In the article [28] by Chow and Liu, the example of character recognition is discussed. A person writes
a number, 0, 1, 2, 3, 4, 5, 6, 7, 8 or 9 in a rectangular space and the machine has to recognise which of
the ten characters has been written. The rectangle is split into a 12 × 8 grid and each of the 96 spaces
is coded as a 1 or a 0 depending on which character has been written. In the example, 7000 numerals
were used as training examples to construct the classier, which was then applied to 12000 examples,
with a success rate of 91%.
Chow Liu Tree Suppose that V = {X1 , . . . , Xd , C}, where X = (X1 , . . . , Xd ) is a random vector to
be observed and C is a class variable. With classication, an observation x is assigned to the category
c that maximises pC (.)pX∣C (X∣.)
The Chow - Liu tree presents an improvement over the naïve classier. For each category c ∈ C ,
the best tting Chow Liu tree is estimated from the training variables;
d
QX∣C = ∏ P̂Xj ∣Xπc (j) ,C
j=1
and then the observation x is assigned to the category c that maximises the score function SC,X =
PC QX∣C (x∣.) if there is a prior PC over the categories, or the score function SC,X = QX∣C (x∣.) if the
initial assessment is that all categories are equally likely.
The article [28] which introduced the Chow - Liu tree considers the problem of machine recognition
of handwritten numerals, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9. There are c = 10 pattern classes. Let ai denote the
numeral i. There is a prior distribution p = (p0 , p1 , . . . , p9 ) over the numerals. The number is written on
a 12 × 8 rectangle and 96 binary measurements are used to represent the numeral; 1 if the cell contains
writing and 0 otherwise. In the example given in [28], 19000 numerals produced by 4 inventory clerks
were scanned. 7000 of these were employed as training examples, to nd the best tting trees and
estimate the probabilities p0 , . . . , p9 . The optimal trees for each of the 10 numerals were obtained. For
the remaining numerals, the observation x = (x1 , . . . , x96 ) was considered. By Bayes rule,
Algorithms
311
312 CHAPTER 16. CONSTRAINT-BASED STRUCTURE LEARNING ALGORITHMS
Denition 16.1 (Partial Correlation). The partial correlation ρX,Y ∣S between X and Y given S is
dened as
ρ(X − ΣXS Σ−1 −1
SS S, Y − ΣY S ΣSS S).
where ΣSS is the covariance matrix of S , ΣXS is the covariance between X and S , ΣY S is the covariance
between Y and S . The partial correlation may be viewed in the following way: regress X on S
and regress Y on S ; the partial correlation is the correlation between the residuals from these two
regressions.
For multivariate Gaussian, X ⊥ Y ∣S if and only if ρX,Y ∣S = 0. To test this, rst regress X against S
and store the residuals R1 , and regress Y against S and store the residuals R2 . The estimated partial
correlation is the sample correlation between R1 and R2 . Fisher's z -transform of the partial correlation
is dened as:
1 1 + ρ̂X,Y ∣S
ρX,Y ∣S ) =
z(̂ log ( ).
2 1 − ρ̂X,Y ∣S
Consider the null hypothesis H0 ∶ ρX,Y ∣S = 0 versus the alternative H1 ∶ ρX,Y ∣S ≠ 0 (two sided test).
The null hypothesis is rejected at signicance level α if and only if
√
n − ∣S∣ − 3∣z(̂
ρX,Y ∣S )∣ ≥ zα/2
where zα is the value such that P(Z ≥ zα ) = α for Z ∼ N (0, 1). The distribution of the sample partial
correlation was described by Fisher (1924) [42].
Under the assumption that the variables are multivariate Gaussian, the statement X ⊥ Y ∣S (where
X and Y are variables and S is a vector) may be tested by considering Σ ̂ −1 , where Σ is the covariance
matrix of (X, Y, S) and Σ̂ −1 is either the inverse, or a generalised inverse, of Σ. If X ⊥ Y ∣S then
(Σ−1 )XY = 0.
n(x, y, s)n(s)
G2 (X, Y, S) = 2 ∑ n(x, y, s) log . (16.1)
x,y,s n(x, s)n(y, s)
Asymptotically, this is distributed as a χ2 distribution on (jx − 1)(jy − 1)js degrees of freedom, where
jx , jy and js are the number of values that X , Y and S respectively can take.
16.3. THE K2 STRUCTURAL LEARNING ALGORITHM 313
The algorithm assumes that an order has been established for the d nodes X1 , . . . , Xd so that,
for each i, the parent nodes Pai for variable Xi are established among the nodes X1 , . . . , Xi−1 . For
j = 1, . . . , i − 1, the empirical Kullback - Leibler divergence between the two empirical probability
distributions of (X1 , . . . , Xi ), one determined by the graphs with and the other determined by the
graph without the directed edge (i, j), is measured and the edge is retained if a) the divergence is
suciently large and b) node i does not already have 4 parents. That is, if Pai is the current parent
set of Xi and Xj is under consideration, the quantity in question is
̂ ̂ ̂
P Xi ,Pai ,Xj PPai PXi ∣Pai ,Xj
̂
Q(i, j) = ∑ P ̂
= ∑P
Xi ,Pai ,Xj log ̂ ̂ Xi ,Pai ,Xj log ̂
.
P Xi ,Pai PPai ,Xj P Xi ∣Paj
Under the null hypothesis, that Xi ⊥ Xj ∣Pai , 2nQ(i, j) ∼ χ2n(Pa )(n(X )−1)(n(X )−1) , where n(Xi )
i i j
denotes the number of elements in the state space of Xi ; similarly for Xj and Pai .
The resulting algorithm is a greedy algorithm, with all the advantages and disadvantages that this
implies.
When the K2 algorithm is used, the learnt structure depends entirely on the order chosen for the
variables generated before the learning process starts. It is therefore usual to repeat the algorithm
with several dierent randomly chosen orders (say 1000) and choose the best; the one with the lowest
Kullback Leibler divergence between the tted distribution and the empirical distribution.
This example is taken from the paper [79]. It shows an application to Bayesian network learning
techniques for task execution in mobile robots. The task here is for the robot to locate an open door
and travel through it.
The robot emits sonar pulses and is equipped with eight detectors, which detect the echoes. From
this information, it has to decide where the door is located.
An action has to be taken: step to left, right, or straight ahead. This is the class variable and the
class has to be determined by the signals received by the eight detectors. Since the signals are not
independent of each other (the echoes may be created by the same object), the model is improved by
incorporating a dependence structure.
In this experiment, the problem is to learn the structure of the Bayesian network and to estimate
the probability potentials from the training data base.
The K2 algorithm is employed to establish a suitable structure. For the robot learning example,
the maximum number is set to four. The size of the probability potentials cannot be too large, since
the robot is expected to nd the door and travel through it in real time.
The intensity of an echo may be modelled as a continuous random variable, but the variables are
discretised for computational convenience. In general, it is not convenient to use a variable with more
than 20 dierent values.
In the Bayesian Robotics experiment, the experiments were repeated 1000 times and nets with
optimal values selected.
The resulting network for the eight variables is shown in Figure 16.1.
16.4. THREE PHASE DEPENDENCY ANALYSIS 315
S4 / S5
~
S3 S6
~
S2 / + S7
S1 r / S8
Figure 16.1: Network produced by the K2 algorithm. Here the nodes Sj represent the signals received
by the sensors. The variable C , not shown, which is a parent to all the variables shown, denotes the
class variable, the action to be performed.
In addition to the 8 variables shown in the network, there is also a class variable C , the direction
to be taken, which is a parent of all the nodes in X = (S1 , S2 , S3 , S4 , S5 , S6 , S7 , S8 ). The network is
estimated using a uniform prior distribution over C , which is an ancestor variable for each random
ordering chosen for the nodes in X ; the action performed is the action that maximises p̂X,C where
p̂ is the estimate of the distribution from the training examples, factorised according to the DAG in
Figure 16.1.
that there exists a faithful graphical representation for the probability distribution. First, a complete
(undirected) graph is created. Then for n = 0, 1, 2, . . . an edge ⟨α, β⟩ is removed if and only if there
is a set Sα,β of size n such that Xα ⊥ Xβ ∣XSα,β . This is the approach to nding the skeleton. A vee
structure (α, γ, β) is declared to be an immorality if and only if γ ∈/ Sα,β (known as the minimal sepset).
The remaining compelled edges are added using Meek's rules to obtain the essential graph. These are
edges α ∼ β that appear in structures given in Figure 2.10 Denition 2.16 Page 42 are directed as in
the Figure 2.10.
The algorithm starts with an input order for the variables (X1 , . . . , Xd ). Stage 1 of the PC algorithm
is given in Algorithm 6.
The MMPC diers from the PC in one aspect: there is a gentle change whereby at each stage the
best variable is added into the parent set. Stage 1 of the MMPC algorithm is given in Algorithm 7.
After the rst stage of the PC / MMPC algorithm, the candidate parent/children sets may contain
too many variables. The next stage prunes them. This is Algorithm 8.
After Stages 1 and 2 of the PC / MMPC algorithm, there may still be false positives. Suppose a
probability distribution may be represented by the DAG in Figure 16.2. Working from T , the node C
may enter the output, and remain in the output.
T / A / C
O ?
This is because C is dependent on T , conditioned on all subsets of T 's parents and children; namely,
ϕ (the empty set) and {A}. Note that the collider connection T AB , is opened when A is instantiated
so that, when A is instantiated and B is uninstantiated, T is d-connected with C . For ϕ (the empty
set), T AC is a chain connection, where A is uninstantiated, so that T is D-connected to C .
318 CHAPTER 16. CONSTRAINT-BASED STRUCTURE LEARNING ALGORITHMS
T and C are D-separated if and only if A and B are simultaneously instantiated; that is, T ⊥
C∣{A, B}. But if B is independent from T given the empty set, so it will be removed from Z . Therefore,
the link T C will not be removed.
This is corrected by considering the parent / child sets of the other variables. When working from C ,
both A and B will be in the parent / child set, and T ⊥ C∣{A, B}. The third stage of the algorithm
(Algorithm 9) removes these false positives.
Establishing the Essential Graph Having recorded the sepsets, sets such that X ⊥ Y ∣SXY , it
is now straightforward to construct the essential graph. For each vee structure (X, Z, Y ) (that is a
structure such that {X, Y } ⊂ Z (Z) , but X ∈/ Z (Y ) ), check whether or not Z ∈ SXY . If Z ∈ SXY , then
(X, Z, Y ) is not an immorality; the edges X − Z − Y remain undirected at this stage. If Z ∈/ SXY , then
(X, Z, Y ) is an immorality.
Finally, add in the additional compelled edges using Meek's rules; edges α ∼ β that appear in
structures given in Figure 2.10 Denition 2.16 Page 42 are directed as in the Figure 2.10.
Starting with a variable set V , the initial graph is the complete graph, with undirected edges
between each pair of variables {X, Y }.
For each pair {X, Y } ⊂ V , it is checked whether or not X ⊥ Y and if this holds, the edge X − Y
is removed. Record SX,Y = ϕ, the empty set (SX,Y is the separator).
For each vee structure X −Z −Y where there is no edge X −Y , the triple (X, Z, Y ) is an immorality
X →Z ←Y.
For each pair {X, Y } that do not have an edge between them, the set SX,Y used to determine the edge
removal using X ⊥ Y ∣SX,Y , is recorded.
After this initialisation (stage 0), the algorithm proceeds recursively. At stage n+1, do the following.
Start with the skeleton from stage n. For each vee-structure α − γ − β , the vee-structure is an
immorality if γ ∈/ Sα,β and it is not an immorality if γ ∈ Sα,β . Add in the remaining compelled
edges. The resulting graph is the Stage n essential graph, which is a chain graph. Locate the
chain components.
322 CHAPTER 16. CONSTRAINT-BASED STRUCTURE LEARNING ALGORITHMS
Starting with a chain component that has no descendants and proceeding backwards, consider
in turn each chain component GC and the subgraph GD formed by taking the chain component
GC = (C, UC ) together with the chain components that have parent variables of GC and all all the
directed edges connecting these chain components. Let D denote variable set for GD . For each
Y ∈ C and each neighbour X of Y (consider rst the parents in dierent connected components,
and then the undirected neighbours in the component GC ), check whether there is a set SXY ⊂ D
of size n such that X ⊥ Y ∣S . If there is, then remove the edge between X and Y and record
SXY . Remove the chain component GC and proceed recursively until the whole graph has been
considered.
This is repeated until the size of the largest neighbour set in the undirected graph is equal to n,
then the algorithm terminates. Undirect all the edges, nd the immoralities and add in the remaining
compelled edges. The output is the resulting essential graph.
Note In [150], the algorithm presented is slightly dierent; once an edge is directed, it is not subse-
quently undirected. It is dicult to see the theoretical justication for this; the modication presented
here ensures that, when there is a faithful graph and assuming a perfect oracle, the output graph is
the essential graph.
X3
} !
X1 X4 X5
! }
X2
!
X6
!
X7
Suppose also a `perfect oracle'; independence tests give the correct results. After the rst round,
X1 ⊥
/ X2 , X1 ⊥/ X6 , X1 ⊥ / X7 , but X1 ⊥ {X3 , X4 , X5 }. X2 ⊥
/ Xj for any j , X3 ⊥
/ Xj for j = 4, 5, 6, 7,
X4 ⊥
/ Xj for j = 5, 6, 7, X5 ⊥
/ Xj for j = 6, 7 and X6 ⊥
/ X7 .
16.7. RECURSIVE AUTONOMY IDENTIFICATION 323
After the CI tests with conditioning sets size 0 have been carried out, the immoralities are deter-
mined;
(X1 , X2 , X3 ), (X1 , X6 , X3 ), (X1 , X7 , X3 ), (X1 , X2 , X4 ), (X1 , X6 , X4 )
X3
X1 X4 X5
! }
X2
!
X6
!
X7
At this point, D is now removed. Since A1 and A2 have no ancestors, they are considered separately and
the algorithm is nished. If A1 and A2 were descendants of other chain components, a chain component
with no descendants would be chosen and the algorithm continues until all chain components have been
considered.
The algorithm is then repeated with conditioning sets of size 2, and so on, until the termination
condition is satised.
324 CHAPTER 16. CONSTRAINT-BASED STRUCTURE LEARNING ALGORITHMS
At each stage, either add a directed edge to D, choosing an edge in E and directing it; any
direction that does not produce a cycle is admissible, or delete an edge from G , or reverse an
edge in G , or leave the graph unaltered. From all the possibilities of `add an edge', `delete an
edge', `reverse an edge', `leave the graph unaltered' choose the one that gives the greatest score;
that is, the operation that produces the greatest reduction in the Kullback Leibler divergence
between the probability modelled along the graph and the empirical probability.
The algorithm may be modied as follows: instead of the best change, make the best change that
results on a graph that has not already appeared. When 15 changes occur without an increase in the
best score ever encountered during the search, the algorithm terminates. The DAG that produced the
best score is then returned. which starts with the constraint based MMPC stage to locate the skeleton
and then carries out a search and score based MMHC stage, using the skeleton obtained from MMPC
as the candidate edge set. Two other hybrid methods are described below.
16.9.2 L1-Regularisation
One method, introduced by Schmidt, Niculescu-Mizil and Murphy (2007) [123], places constraints on
the model and then uses an L1 score function, described below, as the basis of a search and score
within the constrained space.
The method can be employed with Gaussian or binary variables. The binary case is outlined here.
In this algorithm, there is no restriction on the number of parents that a variable may have, but
there is a constraint on the way in which the parents inuence the variable. The state space of variable
Xj is {−1, 1} for each j and the conditional probabilities are modelled so that the logit function is
linear:
where π j = (πj1 , . . . , πjpj ), the conguration of Paj , is a sequence of ±1 corresponding to the states of
p
the parent variables. The parent variables are only permitted to inuence ln 1−p linearly; no interactions
are permitted. This permits a large number of parents, since the number of parameters is linear, rather
than exponential, in the number of parents.
The algorithm works in two stages: like the MMPC, it rst produces candidate parent children sets
for each variable. Having constrained the search space, it then uses a search and score algorithm to
determine the candidate parent / children sets. Having determined the parent / children sets, it runs the
hill climbing part of the MMHC algorithm of Tsamardinos, Brown and Aliferis to obtain the structure,
keeping the conditional probabilities of the form in Equation (16.2). For a vector x = (x1 , . . . , xd ) of
1's and −1's, let Paj denote all the variables without j . That is, all variables permitted as possible
parents for j at this stage. Let x̃(j) denote the vector x without xj . Let
denote the log likelihood function and, for x the data matrix with rows x(1) , . . . , x(n) , let
n
LL(j, θj , x) = ∑ LL(j, θj , x(k) ).
k=1
k=1 ∣θjk ∣ and λ is chosen appropriately. The sum is over the parameters corresponding
where ∥θj ∥1 = ∑d−1
to dependence on parent variables; the parameter θj,0 is not included. The article [123] has some
discussion about the appropriate choice of λ.
The L1 regularisation, if λ is appropriately chosen, has the eect of choosing vectors θj with a
substantial number of zero components. Because of this property, it tends to favours a lower number of
parameters in the model. For this reason, L1 regularisation is a technique that is developing increasing
importance.
Xα ⊥
/ Xβ ∣XWα,β /{α,β} .
16.10. A JUNCTION TREE FRAMEWORK FOR UNDIRECTED GRAPHICAL MODEL SELECTION 327
In other words, only those cliques containing either α or β (or both) need to be considered, leading to
a reduction in the size of the conditioning sets and hence to more accurate conditional independence
tests.
There may be some additional gain, in terms of reducing the size of the conditioning sets, if the
̃ edges that have been considered, for
junction tree can be successively updated, by removing from U
which it has been established that they are not in U .
Vats and Nowak [139](2014) provide a framework for this, by considering the so-called region graph.
A region graph is simply a directed acyclic graph, where a node the region graph (which we call a region-
node) is a subset of V , the node set. The region graph of interest is constructed as follows: the rst
generation of regions, R1 is the collection C of cliques of a junction tree. These are the ancestor
nodes of the region graph. Generation Ri+1 is the set of all pairwise intersections of sets in Ri with
cardinality greater than or equal to 2, for i = 1, . . . , L − 1, where L is the maximum value of i for which
Ri constructed in this way is non-empty.
The edge set of a region graph contains an edge R → S if and only if R ∈ Ri and S ∈ Ri+1 for some
i ∈ {1, . . . , L − 1} and there is a set T ∈ Ri such that R ∩ T = S .
Vats and Nowak propose an algorithm for locating the independence graph G = (V, U ), given a
decomposable graph H = (V, U ̃ ) where U ⊆ U
̃ . The algorithm is given as Algorithm 10; some further
notation is needed before introducing it.
For a region R of a region graph, let
R = ∪S∈{an(R),R} S (16.3)
In other words, R is the union of region R and all its ancestors. In terms of the junction tree for H,
this is the union of all cliques which have R as a subset.
For a node set S , let K(S) denote the complete undirected graph with node set S . For a set of
nodes R, let denote the edge set W restricted to R and let
Suppose the independence graph G = (V, U ) is given on the left of Figure 16.5 and it is known that
U ⊆Ũ , where the graph H = (V, U
̃ ) on the right. Here V = {1, 2, 3, 4, 5, 6, 7}.
The algorithm proceeds as follows:
328 CHAPTER 16. CONSTRAINT-BASED STRUCTURE LEARNING ALGORITHMS
1 3 5 7 1
2 4 6 3 5 7
2 4 6
̃ ); U
Figure 16.5: Graph G = (V, U ) and H = (V, U ̃⊇U
Clique 1, 2, 3, 5 has child 1, 3, 5. Remove edges from the child gives edges ⟨1, 2⟩, ⟨2, 3⟩, ⟨2, 5⟩ from
̃ to be estimated. Therefore, look at R1 (the cliques of the junction tree with the complete
U
graphs of the separators removed.
Clique 1, 2, 3, 5 edges ⟨1, 2⟩, ⟨2, 3⟩, ⟨2, 5⟩ considered; ⟨2, 3⟩ and ⟨2, 5⟩ removed. Edge ⟨1, 2⟩ added
̂
to U
Clique 1, 3, 4, 5 Children are 1, 3, 5 and 3, 4, 5. Therefore, only edge ⟨1, 4⟩ is considered. This is
retained. It is therefore removed from U ̃ and added to U ̂.
Clique 3, 4, 5, 6 Children are 3, 4, 5 and 4, 5, 6. Only edge ⟨3, 6⟩ is considered. It is removed from
̃.
U
Clique 4, 5, 6, 7 Child is 4, 5, 6. Edges considered are: ⟨4, 7⟩, ⟨5, 7⟩ and ⟨6, 7⟩. They are removed
from Ũ . Edges ⟨5, 7⟩ and ⟨6, 7⟩ are added to U ̂.
̃ ∪U
At this stage, a new junction tree may be computed, using the edges from U ̂ . This may be
1, 2, 3, 5 1, 3, 4, 5 3, 4, 5, 6 4, 5, 6, 7
1,3,5 3,4,5 4,5,6
1, 2, 3, 5 1, 3, 4, 5 3, 4, 5, 6 4, 5, 6, 7
z $ z $
1, 3, 5 3, 4, 5 4, 5, 6
$ z $ z
3, 5 4, 5
Figure 16.6: Junction Tree (above) and Region Graph (below) for Figure 16.5
more ecient. Alternatively, we may continue with the same junction tree.
After deleting these edges from U ̃ , generation R2 is the rst generation that satises the property.
Region 1, 3, 5: This has one child, which is 3, 5. The edges under consideration are therefore
⟨1, 3⟩ and ⟨1, 5⟩. These have not been considered before. They are removed from U ̃ and edge
⟨1, 3⟩ is added to Û.
Region 3, 4, 5: This has two children, 3, 5 and 4, 5. Only one edge is considered; ⟨3, 4⟩. This is
removed from U ̃ . It is not present in U and therefore (assuming a perfect oracle) is not added
to Û.
Region 4, 5, 6: This has one child, 4, 5. The edges under consideration are therefore ⟨4, 6⟩ and
⟨5, 6⟩. They are removed from U ̃ . The edge ⟨4, 6⟩ is added to Û.
At this stage, a new junction tree may be computed. If we proceed with the current junction
tree, we look at R3 . This contains regions 3, 5 and 4, 5. Each region consists of two nodes; for
each region, there is one edge under consideration. The edge ⟨3, 5⟩ is removed from Ũ and added
̂ ; similarly with ⟨4, 5⟩.
to U
̃ = ϕ and U
At this stage, U ̂ = U.
329
330 CHAPTER 16. CONSTRAINT-BASED STRUCTURE LEARNING ALGORITHMS
by Theorem 5.7. Recall the denition of weak decomposition (Denition 7.17). The independence
graph is subsequently decomposed, the sep-sets recorded at each stage. With the reconstruction, an
edge ⟨α, β⟩ appears in the nal graph if and only if it appears in all parts of the decomposition that
contain both nodes α and β ; otherwise a suitable immorality is added, dictated by the sep-sets in the
usual manner. The compelled edges are then added and the essential graph is returned.
The algorithm assumes that independence statements Xα ⊥ Xβ ∣X−{α,β} can be veried. This may
be a weak point for large numbers of random variables, since the power of conditional independence
tests decays proportionally to the number of variables in the conditioning set. The algorithm may be
combined with the algorithm of Section 16.10 from Vats and Nowak [139] to reduce the size of the
conditioning sets if there is additional a-priori information that the independence graph is contained
within a decomposable graph H = (V, U ̃ ).
Theorem 2.3 is essential for proving that the algorithm returns a faithful DAG when it exists.
From Denition 5.6 (the denition of the independence graph) together with Theorem 5.7, it follows
that graphical separation statements in the independence graph are equivalent to the corresponding
conditional independence statements for the probability distribution. By Theorem 5.5 together with
the denition of the independence graph (Denition 5.6), the independence graph is equivalent to the
moral graph of a faithful DAG, when a faithful DAG exists.
The following two theorems are used crucially in proving that the graph returned by the algorithm
is the skeleton of a DAG along which the distribution may be factorised.
Theorem 16.5. Let G = (V, D) be a DAG. Suppose that A á B∥G S for three subsets A, B, S ⊂ V . Let
α ∈ A and β ∈ A ∪ S . Then α á β∥G R for some R ⊂ A ∪ B ∪ S if and only if α á β∥G R′ for a subset
R′ ⊂ A ∪ S .
Theorem 16.6. Let G = (V, D) be a DAG and suppose that A, B, S ⊂ V such that A á B∥G S . Let
α, β ∈ S . Then there is a subset R ⊆ A ∪ B ∪ S such that α ⊥ β∣R if and only if either there is a subset
R′ ⊂ A ∪ S or there is a subset R′ ⊂ B ∪ S such that α á β∥R′ .
These statements appear, at face value, precisely what one would expect. Their proofs, though, are
somewhat involved and non-trivial. The Xie-Geng algorithm is based on these statements, which
enable the edge set for the whole graph to be concluded from examining subsets of the variables. The
proofs of these theorems are given later, after the description of the algorithm.
An undirected graph is constructed. This is the graph G = (V, U ) where ⟨α, β⟩ ∈ U if and only if
Xα ⊥/ Xβ ∣X−(α,β) . This is the independence graph (Denition 5.6). It is therefore equivalent to
the moral graph of a faithful DAG if a faithful DAG exists (Exercise 6 page 352).
A weak decomposition (A, B, S) (Denition 7.17) of the moral graph is found, if such a decom-
position exists.
16.11. THE XIE-GENG ALGORITHM FOR LEARNING A DAG 331
Construct GA∪Si
and GB∪S
i
, where for each γ, δ ∈ A ∪ S , ⟨γ, δ⟩ ∈ UA∪S
i
if and only if Xγ ⊥
/
Xδ ∣X(A∪S)/{γ,δ}) , similarly for GB∪S . These are the independence graphs for A ∪ S and B ∪ S
i
respectively.
Before the assembly stage, the following additional stage is carried out on the cliques which are obtained
from the recursive decomposition:
For each clique A in the decomposition and each pair {α, β} ⊆ A, check whether there is a subset
S ⊂ A/{α, β} such that Xα ⊥ Xβ ∣XS . If there is, then remove the edge ⟨α, β⟩ and let Sα,β = S ,
the sep-set of {α, β}.
Two sub-skeletons LA∪S = (A ∪ S, UA∪S ) and LB∪S = (B ∪ S, UB∪S ) are combined to form
LA∪B∪S = (A ∪ B ∪ S, UA∪B∪S )
where
UA∪B∪S = UA∪S ∪ UB∪S /{⟨α, β⟩∣α, β ∈ S, ⟨α, β⟩ ∈/ UA∪S ∩ UB∩S }.
This is done recursively until all the pieces have been added.
Establishing Correctness If there is a faithful DAG for the distribution and a perfect oracle, then
the algorithm returns the essential graph of the faithful DAG. This is established as follows:
Suppose that GA∪C and GB∪C are faithful for the distributions over A ∪ C and B ∪ C respectively
and are combined according to the rules given to give GA∪B∪C . The results of Theorems 16.5 and 16.6
may be used to establish the D-separation properties:
For any α ∈ A and β ∈ B , Sα,β = C and therefore there is no edge α ∼ β in a faithful DAG for the
distribution over A ∪ B ∪ C . Following the reconstruction, there is no edge in GA∪B∪C .
For α, β ∈ C , the reconstruction has an edge α ∼ β in GA∪B∪C if and only if there are edges in both
GA∪C and GB∪C . Theorem 16.6 states that if G(A ∪ B ∪ C) is a DAG over the variables A ∪ B ∪ C and
there is a set R ⊆ A∪B ∪C such that α á β∥G(A∪B∪C) R if and only if either there is a set R′ ⊂ A∪C such
that α á β∥G(A∪C) R′ or there is a set R′ ⊂ B ∪ C such that α á β∥G(B∪C) R′ . Therefore, G(A ∪ B ∪ C)
is a faithful graph for pA∪B∪C then its skeleton contains an edge α ∼ β between two variables in C if
and only if both GA∪C and GB∪C contain the edge.
332 CHAPTER 16. CONSTRAINT-BASED STRUCTURE LEARNING ALGORITHMS
B / E
O `
A / C G
D / F o H
Suppose that we also have a perfect oracle (each conditional independence test gives the correct result).
The rst step of the algorithm is to construct the independence graph, given in Figure 16.8. This is
constructed by starting with the empty graph and adding an undirected edge ⟨α, β⟩ if and only if
Xα ⊥/ Xβ ∣X−(α,β) . If the DAG in Figure 16.7 is faithful, the independence graph is the moral graph.
B E
A C G
D F H
Figure 16.8: The Moral / Independence Graph for the DAG of Figure 16.7
The graph is now decomposed recursively; for example, take {C, G}, then this decomposes the graph
into {A, C, D, F } and {B, E, C, F, G, H}. The independence graphs for these two sets of variables are
illustrated in Figure 16.9.
Now consider the piece on the left hand side of Figure 16.9. The set {C, D} may be used as the
separation set, and the independence graphs of the two pieces {A, C, D} and {C, D, F } are shown in
Figure 16.10.
The edge C − D does not appear in the rst graph, since C ⊥ D∣A. This is clear from the DAG in
Figure 16.7, which is faithful to the distribution. It therefore follows that in the reconstruction stage,
the edge C − D will not be present and that C − F − D will be an immorality.
A C B E
D F C G
F H
A C C
D D F
{A} and {B, E, F, G, H} are separated by {C, D}; A ⊥ {B, E, F, G, H}∣{C, D}. The two pieces
are: {A, C, D} and {B, C, D, E, F, G, H}.
Consider {A, C, D}. The graph is; A − D − C , since for variables {A, D, C}, C ⊥ D∣A.
This is decomposed further into {A, D} and {A, C}; C ⊥ D∣A. This decomposition is complete;
the pieces are cliques and cannot be decomposed further.
Consider {B, C, D, E, F, G, H}. Then B ⊥ {D, F, H, G}∣{C, E}. The decomposition is into
{B, C, E} and {C, D, E, F, G, H}. The piece {B, C, E} is a clique, since B ⊥
/ C∣E .
For {C, D, E, F, G, H}, E ⊥ {D, F, H}∣{C, G}, so it is decomposed into {C, D, F, G, H} and
{C, E, G}. {C, E, G} is a clique at this stage, since C ⊥
/ G∣E .
For {C, D, F, G, H}, C ⊥ G∣{D, F, H}, so the graph of this piece does not contain the edge C − G.
Now consider {C, D, F, H} and decompose into {C, D, F } and {H, D, F }; C ⊥ H∣{D, F }. Since
C⊥/ D∣F , the piece {C, D, F } is a clique. Since D ⊥
/ H∣F , this is also a clique.
Now the cliques are considered and edges removed according to the principle of Theorem 2.3.
B / E
O `
A C G
D / F o H
For {D, F, H}, the edge D − H is removed, with separation set (sep set) ϕ, since D ⊥ H (with no
instantiated notes, there is an open collider in each trail in the original DAG).
For the nal stage of the `deconstruction' phase, edge B − C is removed because B ⊥ C , with
sepset SBC = ϕ.
Reconstruction For the reconstruction, these are put together, using the rule that GA∪B∪C has an
edge between two variables in C if and only if both GA∪C and GB∪C . At this stage, the edge C − D
is removed from the nal graph, since at the earlier stage SCD = {A}. Similarly, C − G is removed
because SCG = {D, F, H}. Vee structures (α, γ, β) are immoralities if and only if γ ∈/ Sα,β . This gives
the essential graph of Figure 16.11.
Lemma 16.8. Let G = (V, D) be a directed acyclic graph. Let α, β ∈ V . Let F = GAn
m
({α,β}∪S)
where
An(W ) denotes the set W together with all nodes that are ancestor nodes in G for any node in W .
First, the subgraph is taken, then it is moralised. Prove that S separates α and β in F if and only if
α á β∥G S .
Proof of Lemma 16.8 Assume that there is a path from α to β in F that has no nodes in S . Then
a trail from α to β in G may be found by taking the directed edge in G if it corresponds to an edge
in F or two edges to form a collider if there is no corresponding edge in F ; the two directed edges
corresponding to the immorality that was removed when the graph was moralised.
If the collider node, or any of its descendants is in S , then the node is S -active. Assume that there
is one collider γ that is not S -active. Then each parent node (they are both in F ) is either an ancestor
16.11. THE XIE-GENG ALGORITHM FOR LEARNING A DAG 335
Lemma 16.9. Let G = (V, D) and let S ⊂ V . Two nodes {α, β} are D-separated by S if and only if
they are D-separated by an({α, β}) ∩ S , where an(W ) = An(W )/W .
Proof of Lemma 16.9 Set S ′ = an({α, β}) ∩ S . Since S ⊇ S ′ , it follows that if α á β∥G S ′ then
(trivially) there is a subset R ⊂ S such that α á β∥G R.
Now suppose that α á / β∥G S ′ . By Lemma 16.8, there is a path ρ connecting α and β in GAnm
({α,β}
that does not contain any vertex of S ′ and hence that ρ does not contain any vertex in S/{α, β}.
Suppose that α and β are D-separated by S0 ⊆ S . Since an({α, β}) ∩ S0 ⊆ S ′ , it follows that ρ does
not contain any vertex in an({α, β}) ∩ S0 and hence, by Lemma 16.8, α á / β∥G S0 . It follows that if
there is a subset R ⊆ S such that α á β∥G R, then α á β∥G an({α, β}) ∩ S . The proof of Lemma 16.9 is
complete.
Lemma 16.10. Let G = (V, D) be a DAG and suppose that ρ is a trail between two non adjacent
vertices α and β . If there are any nodes in ρ that are not in An({α, β}), then the trail ρ is blocked by
any subset S ⊆ an({α, β})
Proof It is clear that such a trail contains a collider connection, where the collider node is not in
An({α, β}) and hence the node does is not in an({α, β}), nor does it have a descendant in this set.
The proof of Lemma 16.10 is complete.
Let
S ′ = (an({α}) ∪ an({δ}) ∩ (A ∪ S).
By Lemma 16.9, it is sucient to show that S ′ blocks every trail ρ between α and δ . There are two
cases:
α á β∥G (S ′ ∪ S).
Now suppose there is a trail ρ contained in An({α, δ}) between α and δ that is not blocked by S ′ .
Let W = S ′ ∪ S . Then W blocks ρ. There is therefore at least one node in ρ that is in W /S ′ . Note
that W ⊆ B . Let γ ∈ W /S ′ denote the rst node on the trail ρ, starting from α, that is in W /S ′ .
Let ρ′ denote the sub-trail of ρ between α and γ . Since ρ is not blocked by S ′ , neither is ρ′ . Since
γ is the only node of ρ′ that is in B , it follows that if ρ′ is S ′ active, it is also W active and hence
αá / γ∥G (S ′ ∪ S), which is a contradiction.
If every sequence satises these properties, then clearly it satises these properties for every trail
and hence, from the denition, α á / β∥G S .
If α á β∥G S , then consider any such sequence of nodes. Take a subsequence by removing the loops
so that any node appears at most once. This is a trail. Since D-separation holds, the trail has the
property listed. The property therefore holds for the original sequence.
It is clear that if there is a set R′ ⊂ A ∪ S or R′ ⊂ B ∪ S , then R = R′ ⊂ A ∪ B ∪ S satises the
criterion.
Now suppose there is a set R̃ ⊂ A∪B ∪S such that α ⊥ β∥G R̃ and let γ1 , γ2 ∈ S such that γ1 ⊥ γ2 ∥G R̃.
By Lemma 16.9, γ1 ⊥ γ2 ∥G R where
Suppose that γ2 is not an ancestor of γ1 . This can be done without loss of generality, by exchanging
the roles of γ1 and γ2 if necessary.
Let
R1 = (an(γ1 ) ∪ an(γ2 )) ∩ (A ∪ S).
To prove that R1 or R2 D-separate γ1 and γ2 , it is sucient to show that for two trails ρ1 in A ∪ S
and ρ2 in B ∪ S either ρ1 is R1 active, or ρ2 is R2 active, or both.
Consider the two cases separately:
For the rst case, since both R1 and R2 are subsets of an(γ1 ) ∪ an(γ2 ), it follows from Lemma 16.10
that ρj is blocked by both R1 and R2 .
Now consider the second case. Suppose that ρ1 is R1 active and ρ2 is R2 active. Both ρ1 and ρ2
are blocked by R = R1 ∪ R2 . It follows that ρ1 has a node in R/R1 and ρ2 has a node in R/R2 . Let δ1
and δ2 denote the nodes on ρ1 and ρ2 respectively that are closest to γ1 . As with the previous exercise,
γ1 ∈ R/R1 ⊆ B and γ2 ∈ R/R2 ⊆ A. Let ρ′1 and ρ′2 denote the subtrails of ρ1 and ρ2 respectively between
γ1 ↔ δ1 , and γ1 ↔ δ2 respectively. Note that ρ′1 is R1 active, and ρ′2 is R2 active. Connecting at γ1
gives a sequence ρ′ between δ1 and δ2 through γ1 . Note that ρ′ may not be a trail, since there may be
repeated nodes.
Any node that is not a collider node in ρ′1 , since it is in an(γ1 ) ∪ an(γ2 ) and since neither ρ1 nor ρ′1
are blocked by R1 , is not in R1 ∪ S . Similarly S does not contain any collider node on ρ′2 . Therefore,
except perhaps for γ1 , ρ′ does not have any collider connections where the collider node is in S .
Let ν1 denote the neighbour of γ1 on ρ′1 . Since ν1 ∈ an(γ1 )∪ an(γ2 ) and it is not γ2 , it is an ancestor
of γ1 or γ2 . If the orientation is γ1 → ν1 , then γ2 is an ancestor of γ1 , contradicting the assumption.
Therefore the edge is oriented ν1 → γ1 . Similarly, for ν2 a neighbour of γ1 on ρ′2 . It follows that
(ν1 , γ1 , ν2 ) is a collider on ρ′ . Therefore S does not contain any nodes on ρ′ that are not collider nodes
on the trail.
Consider any collider node c in ρ′j (that is, the centre of a collider connection in ρ′j ). It is either
in Rj or else has a descendant in Rj . Since c ∈ an(γ1 ) ∪ an(γ2 ), it follows that γ1 ∈ S or γ2 ∈ S is a
descendant of c. Since γ1 ∈ S , it follows that each collider node in ρ′ is either in S or has a descendant
in S .
It follows that δ1 á / δ2 ∥G S , contradicting A á B∥G S . It follows that either γ1 á γ2 ∥G R1 or γ1 á
γ2 ∥G R2 . The proof of Theorem 16.5 is complete.
Another preparatory lemma is needed, before proving Theorem 16.6.
Lemma 16.11. Two non adjacent nodes α and β in a directed acyclic graph G = (V, D) are D-separated
by a set S ⊂ V if and only if for any sequence λ = (α, λ1 , . . . , λn−1 , β) (where the same node can appear
more than once) with edges between each consecutive pair
either λ contains a chain or a fork connection such that the chain node or fork node is in S or
λ contains a collider connection such that the collider node is not in S and has no descendant in
S.
A sequence λ with edges between each consecutive pair that satises this property is said to be blocked
by S .
Proof of Lemma 16.11 The result of Theorem 1.24 page 15, stating that a DAG G = (V, D) has an
edge between α and β in D if and only if α á / β∥G S for any subset S , is used crucially here, together
with the denition of `faithful', that conditional independence statements and D-separation statements
are equivalent.
338 CHAPTER 16. CONSTRAINT-BASED STRUCTURE LEARNING ALGORITHMS
For α ∈ A and β ∈ C , the graph GA∪B∪C in the reconstruction has an edge α ∼ β if and only if there
is an edge α ∼ β in the graph GA∪C . Theorem 16.5 states that if G(A ∪ B ∪ C) is a DAG over the
variables A ∪ B ∪ C then there is a set R ⊂ A ∪ B ∪ C such that α á β∥G(A∪B∪C) R if and only if there
is a set R′ ⊆ A ∪ C such that α á β∥G(A∪C) R′ . It follows that if G(A ∪ B ∪ C) is faithful for pA∪B∪C
then its skeleton contains an edge α ∼ β between two variables α ∈ A and β ∈ C if and only if G(A ∪ C)
contains an edge between α and β . The proof of Lemma 16.11 is complete.
Local skeletons are recovered for each individual tree-node of the separation tree. By Condition
1 of Theorem 5.25, edges deleted in any local skeleton are also absent in the global skeleton. This
is the same principle used in the Xie-Geng algorithm for DAGs.
All the information from local skeletons is combined to give a global undirected graph, which has
all the edges of the skeleton, but may contain additional edges.
Theorem 16.12. Suppose there is a chain graph faithful to a probability distribution P. Given a perfect
oracle (i.e. each test for conditional independence gives the correct answer, rejecting CI when it is false
and not rejecting when the CI statement is true), Algorithm 11 returns the skeleton of a faithful chain
graph.
16.12. THE MA-XIE-GENG ALGORITHM FOR LEARNING CHAIN GRAPHS 339
Proof This uses Theorem 5.25. There is an edge between two nodes in a chain graph if and only
if α á
/ β∥G S for any subset S ⊆ V /{α, β}. The three lines which delete edges therefore only delete
edges which cannot appear in the skeleton. The output therefore returns a graph which contains all
the edges of the skeleton.
At the same time, if α ∼/ β , then one of the three conditions of Theorem 5.25 holds. For condition
1, there is no edge ⟨α, β⟩ if α and β do not appear in the same tree node.
If condition 2 holds, then the edge ⟨α, β⟩ is removed in Stage 1 or Stage 2.
If condition 3 holds, then either α á β∥G Pa(α) or α á β∥G Pa(β) (or both) where Pa(γ) denotes
{δ ∶ (δ, γ) ∈ D or ⟨δ, γ⟩ ∈ U }. The edge ⟨α, β⟩ is therefore removed in Stage 3.
Before proving that Algorithm 12 orients the edges correctly, the following preparatory lemma is
necessary.
Lemma 16.13. Any arrow oriented by Algorithm 12 gives the same orientation as the arrow in G .
Proof The result is trivially clear and requires faithfulness; lack of C -separation implies that the
corresponding conditional independence statement does not hold.
If all trails α ↔ β are blocked by Sα,β , but opened by γ , then γ is either a node in the region of
a complex on the trail between α and β or a descendant of such a node. Since γ is adjacent to α, it
follows that G contains the arrow (α, β).
Theorem 16.14. If G ′ is the skeleton of a chain graph G which is faithful to the probability distribution
P over X , then the output G ∗ of Algorithm 12 is the pattern of G .
16.13. STRUCTURE LEARNING AND FAITHFULNESS: AN EVALUATION 341
Proof This follows from Theorem 16.12 and Proposition 5.26. Firstly, by Theorem 16.12 provides the
correct skeleton. Clearly, if all the C-sep-sets were recorded, Algorithm 12 would consider all ordered
pairs of nodes (α, β); for each γ such that ⟨α, γ ∈ G ′ (the skeleton) and determine whether or not it
was a complex arrow (Denition ??). The algorithm would then return the pattern; the graph where
all the complex arrows are directed and the others are undirected.
The only remaining issue is whether or not the sep-sets provided by Algorithm 11 are sucient.
By Proposition 5.26, for any complex (α, ρ1 , . . . , ρn , β), there is a tree-node C which contains both
parents α and β of the complex. Hence the set of C-sep-sets returned by Algorithm 11 is sucient and
hence Algorithm 11 returns the correct pattern.
Example 16.15.
Suppose the chain graph in Figure 5.7 gives a faithful graphical representation of the conditional
independence structure of a probability distribution P. Suppose that we derive the separation tree
of Figure 5.8. This separation tree is not optimal, in the sense that the tree-node F GKH could
be decomposed further into two tree-nodes F KG and GH separated by G. Given a perfect oracle,
Algorithm 11 will return the correct skeleton and the set of C-sep-sets will be sucient for Algorithm 12
will return the correct pattern.
The C-sep-sets found by Algorithm 1 are:
Stage 1 (each tree node): SBC = {A}, SCD = {B} (we cannot separate B − C at this stage, nor
D − E ), SF G = {D}, SF H = {G}, SKH = {G}.
Stage 2 this simply looks at the edges removed from each tree-node. An edge between two nodes
is present in the skeleton if and only if it is present between the two nodes for every tree-node.
After Stage 2, the only additional edge, still present in the graph which is not present in the skeleton,
is D − E .
Stage 3 The edge D −E is removed with sep set SDE = {C, F } using tree-nodes CDE and DEF .
probability distribution for which there exists a faithful DAG, the results veried that the algorithm is
ecient and produces a graph that corresponds well to the distribution that generated the data, with
low computational overheads. The feature of the algorithm of making all required tests with smaller
conditioning sets before moving on to larger increases accuracy over methods that do not do this. The
additional use made of the structure, identifying the chain components of the essential graph at each
stage, ensures that fewer statistical calls (references to the data set) are required.
Some features were noted in the performance of the algorithm. In earlier stages, some contradictory
directions appeared. That is, pairs of immoralities X → Y ← Z , Y → Z ← W , in situations where
the edge Y ∼ Z would be deleted in subsequent rounds of the algorithm following tests with larger
conditioning sets. The direction chosen for the edge during that round was dictated by which im-
morality appeared rst. If the test X ⊥ Z∣SX,Z , yielding a sep-set SX,Z was carried out rst, then the
edge would take the direction Y ← Z . After carrying out the CI tests and determining the directions,
Meek's orientation rules were applied to determine the structures for the next round of the algorithm.
The algorithm worked very well; with 10000 observations, it produced a graph that had the correct
skeleton and only 4 edges with incorrect orientation.
The test of performance of an algorithm is based on the ability of the algorithm to recover a
probability distribution used to simulate data. There are several standard networks, including the
ALARM network, that are used. Data is simulated from the network and the algorithm applied to
the simulated data. Freedman and Humphreys (2000) p 33,34 [43] are somewhat scathing in their
assessment of this procedure for verifying the utility of an algorithm, of using simulated data from a
distribution known to have good properties. They write,
The ALARM network is supposed to represent causal relations between variables relevant to
hospital emergency rooms, and Spirtes Glymour Scheines (1993) [126] p 11 claim to have
discovered almost all the adjacencies and edge directions `from sample data'. However,
these `sample data' are simulated; the hospitals and patients exist only in the computer
program. The assumptions made by SGS (1993) [126] are all satised by at, having been
programmed into the computer: the question of whether they are satised in the real world
is not addressed. After all, computer programs operate on numbers, not on blood pressures
or pulmonary ventilation levels (two of the many evocative labels on nodes in the ALARM
network).
These kinds of simulations tell us very little about the extent to which modelling assump-
tions hold true for substantive applications.
The constraint based algorithms all depend crucially on the modelling assumption that there is a DAG
that is faithful to the set of conditional dependence / independence statements that can be established.
We have already pinpointed two diculties that can arise in the `real world'; interaction eects without
main eects and hidden common causes.
16.13. STRUCTURE LEARNING AND FAITHFULNESS: AN EVALUATION 343
X o H / Y
>
O
W / Z
Figure 16.12: H is hidden and does not appear in the data matrix
If the RAI algorithm is applied to the variables X, Y, Z, W , whose associations are described by the
d-connection statements of the DAG in Figure 16.12, then X ⊥ Z∣W , giving X → Y ← Z and Y ⊥
W ∣Z , giving the immorality Y → X ← W . Even if there is a perfect oracle (sucient data to give
correct results for each CI test so that the results are consistent with the probability distribution over
(X, Y, Z, W )), the edge between X and Y is a reversed edge, X ↔ Y . This notation means that, from
the CI tests, one test gives a direction X → Y ; the other gives a direction X ← Y and the algorithm
will choose the direction depending on the order in which the tests are carried out.
In the RAI algorithm, the direction that an edge takes in the output graph, under such circum-
stances is determined by the order of the variables; if the test results X ⊥ Z∣W appears rst, the
output graph will contain X → Y and thus the graph will contain the false d-separation statement
W ⊥ Y ∣{X, Z}, while if the result W ⊥ Y ∣Z appears rst, the output graph will contain the edge Y → X
and the false d-separation statement X ⊥ Z∣{W, Y }. The two possibilities are given in Figure 16.13.
344 CHAPTER 16. CONSTRAINT-BASED STRUCTURE LEARNING ALGORITHMS
X / Y X o Y
O O O O
W / Z W / Z
Figure 16.13: Possible outputs applying constraint based algorithm to variables (X, Y, Z, W ) from
Figure 16.12
variables should be d-connected in the output graph. Yet the output graph, following application of the
raw RAI algorithm, gave pairs of d-separated variables, which indicates that conditional independence
was falsely accepted due to weak tests.
In order to deal with the situation where `accept independence' from tests with large condition-
ing sets contradicted d-connection statements with lower order conditioning sets, Barros adopted a
more conservative approach than the argumentation of Bromberg and Margaritis [9] and modied
the algorithm so that it did not accept an independence statement that resulted in a d-separation
in the output graph contradicting a dependence statement that has already been established. This
modication worked well.
The output still gave a large number of `reversed edges'. While the ALARM network gave one or
two, the nancial data set gave approximately 28 reversed edges, indicating situations that appeared
in the DAG in Figure 16.12, with possible output graphs corresponding to Figure 16.13.
The presence of a substantial number of `common cause' hidden variables would explain this.
This was a randomly chosen `real world' data set and probably not appropriate for an algorithm
based on a `faithfulness' assumption. The variables here do not satisfy one of the motivating features of
the faithfulness assumption, that the variables stand in causal relation to each other; their association
is more likely to be a result of hidden common causes, such as government policies, or global nancial
considerations that inuence the various stock markets.
The same diculties seemed to arise in other applications. The RAI algorithm was applied to
the genetic data found in Friedman et. al. [46]. Tentative results seem to give substantially dierent
output depending on the input order of the variables, suggesting hidden common causes.
16.13.6 Conclusion
Constraint based algorithms oer a fast approach, which is convenient with data matrices when d, the
number of variables, is very large. They can be many times faster than search and score algorithms.
Unfortunately, these algorithms tend to assume `faithfulness' and work on the principle of removing an
edge whenever a conditional independence test gives the result `do not reject X ⊥ Y ∣S '. This leads to
several diculties. Firstly, since tests with larger conditioning sets are weaker, it can lead to situations
where deletion of an edge can contradict earlier d-connection statements. This diculty is present even
if there is a faithful DAG corresponding to the independence structure. Secondly, two-factor, or higher
order interactions are not detected if there are no `main eects'. Thirdly, hidden variables can lead to
contradictory edges, resulting in d-separation statements not present in the probability distribution. If
there is no faithful DAG that describes the underlying independence structure, this can manifest itself
in other ways.
Modications to remove the rst of these diculties have been considered, for example by Bromberg
and Margaritis [9] using argumentation and the more conservative approach of Barros [4] retaining all
dependence statements that have been established through rejecting independence.
The second and third of these diculties have not been fully addressed by constraint based algo-
rithms.
346 CHAPTER 16. CONSTRAINT-BASED STRUCTURE LEARNING ALGORITHMS
`... undirected models cannot be used to model causality in the sense of Pearl [109], which is
useful in many domains such as molecular biology, where interventions can be performed.'
The thrust of the quote is that directed edges whose direction can be interpreted as cause to eect,
can be learned from data. But placing a causal interpretation on a directed arrow in a graph that has
been learned purely by applying a structure learning algorithm to data can be misleading.
In a situation where interventions can be performed, a causal directed graph can be obtained
from the undirected graph through further controlled experiments. Consider the situation on three
variables (X, Y, Z) where X ⊥ Z∣Y , but X ⊥ / Y,X ⊥/ Z, Y ⊥/ Z, Y ⊥
/ X∣Z and Y ⊥ / Z∣X . There are three
DAGs along which the distribution pX,Y,Z may be factorised, given in Figure 16.14. Suppose that an
intervention may be carried out on the variable Y , forcing its state. This has the eect of removing
arrows from parents of Y to Y . If the state Y ← y is forced, this gives the graphs in Figure 16.15.
If all the states of Y can be explored, in a controlled experiment, by randomly assigning levels of
the `treatment' variable Y , the causal structure can be determined from the Markov structure, but not
otherwise.
Markowetz and Spang [91] discuss the application of intervention calculus for perturbation experi-
ments that are inferring gene function and regulatory pathways.
16.13. STRUCTURE LEARNING AND FAITHFULNESS: AN EVALUATION 347
Y >Y
~
X Z X Z
Y `
~
X Z
Y =y Y =y
| " "
X Z X Z
Y =y
|
X Z
As Freedman and Humphreys point out (1999) [43], commenting on automated causal learning,
`these claims are premature at best and the examples used in [126] to illustrate the algorithms are
indicative of failure rather than success.' They point out that `the gap between association and
causation has yet to be bridged.'
Y1 / X3
`
~
X2 ` Y2
~
Y3 / X1
` .... the faithfulness condition can be thought of as the assumption that conditional
independence relations are due to causal structure rather than to accidents of parameter
values.' Spirtes et. al. (2000) [127]
Example 2.7 gives an instance of a situation where the probability distribution does not have faithful
graphical representation. For the variables (Y1 , Y2 , Y3 , X1 , X2 , X3 ), the DAG that best represents the
associations between the variables is given by Figure 16.16. In this graph, X1 = 1 if Y2 = Y3 and 0
otherwise. X1 ⊥ Y2 and X1 ⊥ Y3 , but X1 ⊥ / {Y2 , Y3 }. In this situation the inuence of Y2 and Y3 on X1
is not seen if the variables are considered separately, but the interaction eect is decisive.
Another statement of the same principle is found in Meek (1995) [93]
In cases where P(G) (the set of distributions that factorise along a graph G ) can be
parametrised by a family of distributions with a parameter of nite dimensions, the set of
unfaithful distributions typically has Lebesgue measure zero. (Spirtes et. al. (2000) [127]
pp 42 - 2)
This assumption, that the set of observable variables O may be extended to a set V = (U, O) where
U represents unobserved common causes, or confounders, and that there will exist a DAG over V
that is faithful to the probability distribution over V , is re-stated in Robins, Scheines, Spirtes and
Wasserman (2003) [117]. There is strong interest in classes of faithful distributions in the literature;
the work of Zhang and Spirtes [151] requires that the class of distributions under consideration satisfy
a stronger assumption than faithfulness in order to obtain uniform consistency in causal inference for a
certain class of problems; [117] illustrates non-existence of uniform consistency when only faithfulness is
assumed, because of the possibility of non-faithful distributions in the closure of the set of distributions
under consideration.
Consider again Example 2.7 and suppose that O = (X1 , X2 , X3 ), the values for (X1 , X2 , X3 ) are
observable and U = (Y1 , Y2 , Y3 ), the results of (Y1 , Y2 , Y3 ) are hidden. Clearly, the set of distributions
over 6 binary variables that factorises over the DAG in Figure 16.16 can be described by a nite
parameter space; 15 parameters are required to describe the entire set of distributions; the param-
eter space is [0, 1]15 . Furthermore, it is clear that the parameters to describe the distribution over
(Y1 , Y2 , Y3 , X1 , X2 , X3 ) in Example 2.7 correspond to exactly one point in the parameter space, which
16.13. STRUCTURE LEARNING AND FAITHFULNESS: AN EVALUATION 349
has Lebesgue measure zero. Nevertheless, examples where knowledge of two causes is required to ex-
plain the eect and where knowledge only of a single cause tells you nothing about an eect arise all
the time in practise, in the real world.
Furthermore, the parametrisation of any distribution that has an independence structure has
Lebesgue measure zero in the parameter space of all distributions over the variables in question.
Meek's argument can equally well be used to argue against searching for any independence structure
at all.
Faithfulness appears a convenient hypothesis to produce beautiful mathematics (and the relation
between DAGs and probability distributions under this assumption has produced a very elegant and
attractive mathematical theory), but it is dicult to see that it necessarily applies to real world
situations; the real world does not respect the fact that the set of parameters that describe the situation
have Lebesgue measure zero in a mathematical parameter space. Divergence between `real world'
behaviour and the assumption that it should t into a convenient mathematical framework has been
termed `The Mind Projection Fallacy' by E.T. Jaynes (2003) [70].
H1 H2
~ ~
X Z Y
If one were using immoralities as a guide to causation, one would conclude that X and Y were common
causes of Z . As Freedman and Humphreys point out in [43], commenting on Spirtes Glymour Scheines
(1993) [126] on a DAG produced from a sociological data set,
The graph says, for instance, that race and religion cause region of residence.
In the context, this is non-sensical and raises a timely note of caution when inferring causality.
350 CHAPTER 16. CONSTRAINT-BASED STRUCTURE LEARNING ALGORITHMS
X Y
Z
The epidemiologists discovered an important truth - smoking is bad for you. The epi-
demiologists made this discovery by looking at the data and using their brains, two skills
that are not readily automated. .... The examples in SGS (1993) [126] count against the
automation principle, not for it.'
The conclusion drawn by the authors of this article is that the output produced by structure learning
algorithms provides invaluable information. It can give good information about associations and can
certainly point towards the possibility of causal relations, but they do not even begin to automate the
process of learning causality; it is still necessary for researchers to use their brains to design experiments,
examine the data and use their brains again, taking into account circumstances and contexts additional
to the raw data, to reach conclusions. As the example from SGS (1993) [126], extended by Freedman
and Humphreys [43] shows, causation cannot be deduced from the presence of an immorality and,
indeed, cannot be inferred from the output of structure learning algorithms alone.
Notes The PC algorithm was introduced by Spirtes et. al. [126](1993), while the MMPC was in-
troduced by Tsamardinos et. al. [137](2006). The FAS algorithm is discussed in Fast [41]. Recursive
Autonomy Identication is due to Yehezkel and Lerner [150](2009).
16.14 Exercises
1. This problem is motivated by the following consideration: when searching for a graph with a
suitable structure to t a given data set with reasonable accuracy, Markov chain Monte Carlo
techniques are often used. These algorithms are computationally more ecient if they change
as few edges as possible at each transition, while ensuring that the chain can move through the
entire space of graphs. It is also more ecient to search the space of essential graphs, to ensure
that the chain does not spend time moving between graphs that are Markov equivalent.
This exercise shows that even in a simple setting, it is necessary to change at least two edges per
move to ensure that the algorithm can move from the current essential graph a dierent essential
graph.
(a) Let V = {X1 , X2 , X3 }. Construct an undirected graph by adding an edge between two nodes
α and β if and only if α ⊥/ β∣S for any subset S ⊆ V /{α, β}.
(b) Construct the independence graph.
(c) What happens if V = {Y1 , Y2 , Y3 , X1 , X2 , X3 }?
351
γ
γ β α @β
α δ
4. For any DAG G = (V, D), an edge (X, Y ) ∈ D is said to be covered in G if PaX = PaY /{X}. Let
G1 = (V, D1 ) be a DAG and let G2 = (V, D2 ) be obtained by reversing the edge (X, Y ) ∈ D1 .
Prove that G2 is Markov equivalent to G1 if and only if (X, Y ) is covered in G1 .
5. Let G1 and G2 be two Markov equivalent DAGs and suppose that there are exactly m edges in G1
with the opposite orientation in G2 . Using Exercise 4, prove that there is a sequence of exactly
m distinct edge reversals in G1 with the following properties:
6. Let G = (V, D) be a directed acyclic graph. Prove that G m , the moral graph, contains an
undirected edges ⟨X, Y ⟩ if and only if X á
/ Y ∥G V /{X, Y } (X and Y are not d-separated by
V /{X, Y }).
7. Recall the Recursive Autonomy Identication algorithm, Subsection 16.7 page 321.
(a) In the description of stage 0, where an edge between X and Y is removed if and only if
X ⊥ Y , assume that the resulting skeleton is correct. Why is (X, Z, Y ) an immorality if
there are edges X − Y and Y − Z but no edge X − Y ?
(b) Assume that the graph in Figure 16.20 is a faithful graph for PX1 ,X2 ,X3 ,X4 . Assume that
the data set is suciently large so that each test for independence gives the correct result.
Outline how the algorithm proceeds, sketching the graphs returned at each stage of the
algorithm, stating the reasons for deleting edges and directing edges.
(c) Assume that the graph in Figure 16.21 is a faithful graph for PX1 ,X2 ,X3 ,X4 and that each
independence test gives the correct result. Outline how the algorithm proceeds.
352
X1 X2
! }
X3
X4
= X2
!
X1 = X4
!
X3
(d) Assume that the graph in Figure 2.3 is faithful to the distribution PU1 ,Z1 ,Z2 ,Z3 ,Z4 and
that variable U1 is hidden. What is the output of the RAI algorithm if the input is
(Z1 , Z2 , Z3 , Z4 )? What is the output of the RAI algorithm if the input order is (Z4 , Z3 , Z2 , Z1 )?
353
16.15 Answers
1. (a) Yes: A → B ← C is an essential graph.
(b) Recall that the essential graph is the graph where directions are retained on and only on
those edges that retain the same direction in every graph in the Markov equivalence class.
Hence A−B ← C is not an essential graph since A → B ← C and A ← B ← C are not Markov
equivalent; if B ← C is present and (A, B, C) is not an immorality, this forces A ← B .
With this in mind, the essential graphs are:
The three graphs A → B ← C , B → A ← C , A → C ← B , the three graphs with one
(undirected) edge between two of the nodes and the third node unconnected, the graph with
no edges between any of the nodes, the three graphs with two undirected edges A − B − C ,
A − C − B , C − A − B . The graph with three undirected edges between A, B and C .
(c) A, B ← C ; A − B ← C ; A ← B ← C ; A → B, C ; A → B − C ; A → B → C . None of them are
essential graphs.
4. When edge (X, Y ) is reversed in a DAG G1 to form a new graph G2 , the two graphs are Markov
equivalent if and only if it has the same skeleton, and the same immoralities and there are no
cycles in G2 .
When (X, Y ) is removed and (Y, X) is added, there are no new immoralities if and only if for
each Z ∈ Pa(X), there is link between Y and Z . The link is (Z, Y ), otherwise there is a cycle in
G1 . Therefore Pa(X) ⊆ Pa(Y ) in G1 .
No immoralities are removed and no cycles are introduced if and only if for any Z ∈ Pa(Y )/{X},
Z ∈ Pa(X), so Pa(Y )/{X} ⊆ Pa(X). It follows that Pa(X) = Pa(Y )/{X}.
354
5. Assume that none of the m edges are covered. Then using the previous exercise, for each
edge (X, Y ) to be altered, either there is a node Z ∈ Pa(Y )/Pa(X) or there is a node Z ∈
Pa(X)/Pa(Y ). If there is a node Z ∈ Pa(Y )/Pa(X) then (X, Y, Z) is an immorality, so that the
direction X → Y remains the same in any Markov equivalent graph. It follows that for each of
the m edges (X, Y ), there is a variable Z ∈ Pa(X)/Pa(Y ). If the direction of the edge (Z, X)
is not also reversed, then (Z, X, Y ) is an immorality in the new graph, which is a contradiction.
It follows that there is at least one covered edge among the m edges. Change the orientation of
this edge. After the change, there is a covered edge among the remaining m − 1 and by induction
the target graph is obtained after m changes.
6. Firstly, note that the moral graph contains an edge X − Y if and only if Y ∈ M B(X), the Markov
blanket of X . M B(X) is the set X , together with Pa(X) (the parents of X ) and Ch(X) (the
children of X ) and all parents that share a child with X . That is, X together with all neighbours
of X and those variables that are linked to X when the graph is moralised.
Note that
X ⊥ V /M B(X)∥G M B(X)
so that, using the weak union result of Exercise 2 page 22, if Y ∈/ M B(X),
X ⊥ Y ∥G V /{X, Y }.
355
356 CHAPTER 16. CONSTRAINT-BASED STRUCTURE LEARNING ALGORITHMS
the edge X1 → X4 is removed. The algorithm now terminates, returning the essential graph
with undirected edges ⟨X1 , X2 ⟩ and ⟨X1 , X3 ⟩ and directed edges ⟨X2 , X4 ⟩ and ⟨X3 , X4 ⟩.
Chapter 17
Parameter Learning
2. Incremental Association iamb: based on the Markov blanket detection algorithm of the same
name, which is based on a two-phase selection scheme (a forward selection followed by an attempt
to remove false positives).
3. Fast Incremental Association fast.iamb: a variant of IAMB which uses speculative stepwise
forward selection to reduce the number of conditional independence tests.
2. Tabu Search tabu: a modied hill climbing able to escape local optima by selecting a network
that minimally decreases the score function.
357
358CHAPTER 17. BAYESIAN NETWORKS IN R: STRUCTURE AND PARAMETER LEARNING
1. Max-Min Parents and Children mmpc: a forward selection technique for neighbourhood
detection based on the maximization of the minimum association measure observed with any
subset of the nodes selected in the previous iterations.
2. Hiton Parents and Children si.hiton.pc: a fast forward selection technique for neigh-
bourhood detection designed to exclude nodes early based on the marginal association. The
implementation follows the Semi-Interleaved variant of the algorithm.
3. Chow-Liu chow.liu: an application of the minimum-weight spanning tree and the information
inequality. It learn the tree structure closest to the true one in the probability space.
4. ARACNE aracne: an improved version of the Chow-Liu algorithm that is able to learn poly-
trees.
> library(bnlearn)
> data(marks)
> str(marks)
'data.frame': 88 obs. of 5 variables:
$ MECH: num 77 63 75 55 63 53 51 59 62 64 ...
$ VECT: num 82 78 73 72 63 61 67 70 60 72 ...
17.1. BAYESIAN NETWORKS WITH BNLEARN 359
First create an empty network with the nodes corresponding to the variables using the empty.graph
function:
> ug<-empty.graph(names(marks))
The arcs presented in Whittaker (1990) from Figure 17.1 may be added as follows:
> arcs(ug,ignore.cycles=TRUE)=matrix(
+ c("MECH","VECT","MECH","ALG","VECT","MECH",
+ "VECT","ALG","ALG","MECH","ALG","VECT",
+ "ALG","ANL","ALG","STAT","ANL","ALG",
+ "ANL","STAT","STAT","ALG","STAT","ANL"),
+ ncol=2, byrow = TRUE,
+ dimnames=list(c(),c("from","to")))
> plot(ug)
STAT
MECH ANL
VECT ALG
The resuting ug object belongs to graph bn. There are several arguments: ug$learning, ug$nodes,
ug$arcs.
learning is not useful in this example, since this argument gives information about the results of
the structure learning algorithm used to generate the network and its tuning parameters (which were
not used here).
$nodes gives information about the Markov blanket of each node, while $arcs gives the arcs
presented in the network.
360CHAPTER 17. BAYESIAN NETWORKS IN R: STRUCTURE AND PARAMETER LEARNING
> ug
model:
[undirected graph]
nodes: 5
arcs: 6
undirected arcs: 6
directed arcs: 0
average markov blanket size: 2.40
average neighbourhood size: 2.40
average branching factor: 0.00
model:
[STAT][ANL|STAT][ALG|ANL:STAT][VECT|ALG][MECH|VECT:ALG]
nodes: 5
arcs: 6
undirected arcs: 0
directed arcs: 6
average markov blanket size: 2.40
average neighbourhood size: 2.40
average branching factor: 1.20
A dag can be specied by its adjacency matrix. The function all.equal() indicates whether two
graphs are equal.
> mat=matrix(c(0,1,1,0,0,0,0,1,0,0,0,0,
17.1. BAYESIAN NETWORKS WITH BNLEARN 361
+ 0,1,1,0,0,0,0,1,0,0,0,0,0),
+ nrow=5,
+ dimnames=list(nodes(dag),nodes(dag)))
> mat
MECH VECT ALG ANL STAT
MECH 0 0 0 0 0
VECT 1 0 0 0 0
ALG 1 1 0 0 0
ANL 0 0 1 0 0
STAT 0 0 1 1 0
> dag2=empty.graph(nodes(dag))
> amat(dag2)=mat
> all.equal(dag,dag2)
[1] TRUE
A new bn object may be created by adding (set.arc), dropping (drop.arc) or reversing rev.arc)
arcs from the original. For example:
A topological ordering of the nodes (from ancestors to descendants) may be obtained by the func-
tion node.ordering(). The neighbours and Markov blanket may be found using nbr() and mb()
respectively. The %in% command may be used to establish membership.
> node.ordering(dag)
[1] "STAT" "ANL" "ALG" "VECT" "MECH"
> nbr(dag,"ANL")
[1] "ALG" "STAT"
> mb(dag,"ANL")
[1] "ALG" "STAT"
> "ANL" %in% mb(dag,"ALG")
[1] TRUE
We can check that the Markov blanket of a variable consists of parents, children and children of parents:
> chld=children(dag,"VECT")
> par=parents(dag,"VECT")
> o.par=sapply(chld,parents,x=dag)
> unique(c(chld,par,o.par[o.par != "VECT"]))
[1] "MECH" "ALG"
> mb(dag,"VECT")
[1] "MECH" "ALG"
362CHAPTER 17. BAYESIAN NETWORKS IN R: STRUCTURE AND PARAMETER LEARNING
> library(Rgraphviz)
Loading required package: grid
> h = list(arcs=vstructs(dag2,arcs=TRUE),lwd=4,col="black")
> graphviz.plot(dag2,highlight=h,layout="fdp",main="dag2")
dag2
ANL STAT
ALG
MECH
VECT
The essential graph, showing the Markov equivalence class is returned by cpdag.The function moral
returns the moral graph.
> plot(cpdag(dag2))
model:
[undirected graph]
nodes: 5
arcs: 6
undirected arcs: 6
directed arcs: 0
average markov blanket size: 2.40
average neighbourhood size: 2.40
average branching factor: 0.00
The parameter value α = 0.05 is the nominal signicance level for each χ2 test for independence.
The mmhc algorithm learns a dierent network, but it is Markov equivalent to the network learned by
the gs algorithm and has the same BIC score.
These structure learning algorithms often only direct an edge when a particular direction gives a
better t, leaving other edges undirected. The function cextend() gets one graph out of the Markov
equivalence class, which may be used for scoring purposes. The BIC score for the learned graph may
be obtained as follows. The documentation lists other scoring criteria that are available (such as AIC).
model:
[STAT][ANL|STAT][ALG|ANL:STAT][VECT|ALG][MECH|VECT:ALG]
nodes: 5
arcs: 6
undirected arcs: 0
directed arcs: 6
average markov blanket size: 2.40
average neighbourhood size: 2.40
364CHAPTER 17. BAYESIAN NETWORKS IN R: STRUCTURE AND PARAMETER LEARNING
The type of estimator (maximum likelihood or Bayes) can be specied by either mle (maximum like-
lihood estimates) or Bayes the posterior Bayesian estimate arising from a at, non-informative prior.
Only mle is available with continuous (Gaussian) data; the Bayes considers Dirichlet densities over the
parameter space.
The parameters of a tted network can easily be replaced. For example, ALG has two parents, ANL and
STAT. For the Gaussian network, the restriction is that the standard deviation for the residuals at each
node is the same. We consider
ALG = β0 + ANLβ1 + STATβ2 + ϵ
where ϵ ∼ N (0, σ 2 ), independent identically distributed. This is carried out by:
A bn.fit object can be created from scratch using the custom.fit function. For example:
17.1.5 Discretisation
The only continuous models that can be accommodated are Gaussian. When the data is manifestly not
Gaussian, it is better to discretise it and to construct a Bayesian network over multinomial variables.
There are several methods of discretisation available; look up the documentation for discretize. For
example:
> ?discretize
> dmarks = discretize(marks, breaks=2, method="quantile")
> bn.dgs=gs(dmarks)
> plot(bn.dgs)
> all.equal(cpdag(bn.dgs),cpdag(bn.gsdirect))
[1] "Different number of directed/undirected arcs"
The network learned from the discretised data is dierent; MECH is independent of the other variables.
The parameters may be tted to the structure using the discretised data:
> fitted3=bn.fit(cextend(bn.dgs),data=dmarks)
> fitted3$ALG
ANL
ALG [9,49] (49,70]
[15,50] 0.7777778 0.2558140
(50,80] 0.2222222 0.7441860
For the marks data, Edwards (2000) [40] assumed that the students fell into two distinct groups
(which we call A and B ). He then used a classication technique involving the EM algorithm to assign
the students to two dierenc classes. The results were as follows: group A contained students 1-44
and 46-52 while group B contained students 45 and 53 - 88. We add in this latent variable and we
construct a network for group A and another network for group B. We then discretize the variables
and learn the network when the latent variable is included. The results are:
> latent=factor(c(rep("A",44),"B",rep("A",7),rep("B",36)))
> bn.A = hc(marks[latent=="A",])
> bn.B = hc(marks[latent=="B",])
> modelstring(bn.A)
[1] "[MECH][ALG|MECH][VECT|ALG][ANL|ALG][STAT|ALG:ANL]"
> modelstring(bn.B)
[1] "[MECH][ALG][ANL][STAT][VECT|MECH]"
> dmarks=discretize(marks,breaks=2,method="interval")
> dmarks2=cbind(dmarks,LAT=latent)
> bn.LAT=hc(dmarks2)
> bn.LAT
model:
[MECH][ANL][LAT|MECH:ANL][VECT|LAT][ALG|LAT][STAT|LAT]
nodes: 6
arcs: 5
undirected arcs: 0
directed arcs: 5
average markov blanket size: 2.00
average neighbourhood size: 1.67
average branching factor: 0.83
Note that for the learned network, variable LAT has two parents; MECH and ANL. If MECH, VECT, ALG,
ANL, STAT were continuous, this distribution would therefore not fall into the CG framework.
368CHAPTER 17. BAYESIAN NETWORKS IN R: STRUCTURE AND PARAMETER LEARNING
1. Outliers are removed. This is because, for continuous data, Bayesian Networks only supports
multivariate Gaussian distributions; outliers make the Gaussian modelling assumptions less likely
to hold.
2. Structure learning is repeated several times, so that there is more chance of nding a global
maximiser for the score function.
3. The networks discovered in the previous step are averaged. This is a technique from Claskens
and Hjort (2008) [29]. The averaged network uses arcs present in (say) 85% of the networks.
We try this on the sachs.data.txt data set, found in the data directory of the course home page:
> library(bnlearn)
> sachs.data <- read.delim("~/data/sachs.data.txt")
> sachs<-sachs.data
> dsachs=discretize(sachs,method="hartemink",breaks=3,ibreaks=60,idisc="quantile")
Each variable in the dsachs data frame is a factor with three levels, corresponding approximately to
low, normal and high expression. Now apply bootstrap resampling to learn a set of 500 networks to
be used for model averaging:
> boot=boot.strength(data=dsachs,R=500,algorithm="hc",algorithm.args=list(score="bde",iss=10))
> boot[(boot$strength>0.85)&(boot$direction>=0.5),]
from to strength direction
1 praf pmek 1.000 0.5180000
23 plcg PIP2 1.000 0.5100000
24 plcg PIP3 1.000 0.5220000
34 PIP2 PIP3 1.000 0.5120000
56 p44.42 pakts473 1.000 0.5620000
57 p44.42 PKA 0.992 0.5665323
67 pakts473 PKA 1.000 0.5690000
89 PKC P38 1.000 0.5100000
90 PKC pjnk 1.000 0.5100000
100 P38 pjnk 0.954 0.5062893
The virtual sample size is 10, which is very low. Arcs are signicant if they appear in at least 85% of
the networks and in the direction that appears most frequently. The averaged network is formed quite
simply using the averaged.network function:
An alternative approach is to average the results of several hill climbing searches, each starting from
a dierent network. The initial condition can be generated using a distribution over the space of con-
nected graphs. An algorithm to do this was proposed by Ide and Cozman [69](2002). It is implemented
by the function random.graph(). It is carried out as follows:
The networks have the same skeleton, although some of the directions are dierent.
The score is computed rst by taking cpdag to get an essential graph and then by taking cextend
to form a dag.
370CHAPTER 17. BAYESIAN NETWORKS IN R: STRUCTURE AND PARAMETER LEARNING
> score(cextend(cpdag(avg.start)),dsachs,type="bde",iss=10)
[1] -8498.877
The bnlearn package contains a default level for the threshold, which is found in averaged.network
> averaged.network(boot)
model:
[praf][plcg][p44.42][PKC][pmek|praf][PIP2|plcg][pakts473|p44.42][P38|PKC]
[pjnk|PKC][PIP3|plcg:PIP2][PKA|p44.42:pakts473]
nodes: 11
arcs: 9
undirected arcs: 0
directed arcs: 9
average markov blanket size: 1.64
average neighbourhood size: 1.64
average branching factor: 0.82
denote the order statistics for the arc strengths stored in boot. Now, let ̂
t denote a threshold and set
⎧
⎪
⎪ 1 p̂(k) ≥ t
p̃(k) (t) = ⎨ .
⎪
⎪ 0 p̂ < t.
⎩ (k)
This denotes the `empirical' probability function for arc strengths for the graph where arcs are present
̃(.) denote the resulting vector.
if and only if p̂(k) ≥ t. Let p
Now choose ̂ t to minimise
where Fp̂(.) and Fp̂(.) are the empirical distribution functions of p ̃(.) respectively. Then ̃
̂(.) and p t, the
threshold is chosen to minimise this.
17.1. BAYESIAN NETWORKS WITH BNLEARN 371
> wh = matrix(c(rep("INT",11),names(isachs)[1:11]),ncol=2)
> bn.wh = tabu(isachs,whitelist=wh,score="bde",iss=10,tabu=50)
> tiers=list("INT",names(isachs)[1:11])
> bl = tiers2blacklist(nodes=tiers)
> bn.tiers=tabu(isachs,blacklist=bl,score="bde",iss=10,tabu=50)
While the two methods given above, producing bn.wh and bn.tiers show how to force certain arrows
into a network, they do not involve the structure of the intervention.
The way to model an intervention is described as follows: the value of INT identies which node is
subject to an intervention. Therefore, we start by constructing a named list of which observations are
manipulated for each node.
> INT2=sapply(1:11,function(x){which(isachs$INT==x)})
> nodes=names(isachs)[1:11]
> names(INT2)=nodes
Now pass the list to tabu as an additional argument for mbde (the modied BDe score function).
> start=random.graph(nodes=nodes,method="melancon",num=500,burn.in=10^5,every=100)
> netlist=lapply(start,function(net){
+ tabu(isachs[,1:11],score="mbde",exp=INT2,iss=10,start=net,tabu=50)})
> bn.mbde=averaged.network(arcs,threshold=0.85)
Warning messages:
1: In averaged.network.backend(strength = strength, nodes = nodes, :
arc pjnk -> PKA would introduce cycles in the graph, ignoring.
372CHAPTER 17. BAYESIAN NETWORKS IN R: STRUCTURE AND PARAMETER LEARNING
17.2 Exercises
1. This exercise uses the asia data set found in the bnlearn package.
(a) Create a bn object with the network structure shown in Figure 17.3.
asia smoke
tub lung
either bronc
xray dysp
(b) Derive the skeleton, the moral graph, and the essential graph representing the Markov
equivalence class. Plot them using graphviz.plot.
(c) Identify the parents, the children, the neighbours and the Markov blanket of each node.
(d) For the network in Figure 17.3, estimate the CPPs.
(e) Using the data asia, use the MMPC algorithm (called `grow-shrink' in bnlearn) to learn
the skeleton followed by hill climbing to learn the direction of the arrows. Is the output
DAG Markov equivalent to the graph in Figure 17.3?
(a) Discretise the data using a quantile transform and dierent numbers of intervals (say 2 to
5). Learn the network structure. How does the structure change with the discretisation?
(b) Repeat the discretisation using interval discretisation, using up to ve intervals. Compare
the resulting networks with those obtained previously using quantile discretisation.
(c) Does Hartemink's discretisation algorithm perform better than either quantile or interval
discretisation? How does its behaviour depend on the number of initial breaks?
3. The ALARM network is a standard network used to test new algorithms. A synthetic data set
alarm is found in the bnlearn package. Type:
> library(bnlearn)
> ?alarm
374CHAPTER 17. BAYESIAN NETWORKS IN R: STRUCTURE AND PARAMETER LEARNING
On the bottom right quadrant of Rstudio, click on ALARM Monitoring System (synthetic)
data set. This gives a description. Go to the bottom under Examples. You will nd the
structure of the `true' network.
(a) Create a bn object for the true network using the model string provided in the documenta-
tion.
(b) Compare the networks learned from the data using dierent constraint based algorithms
with the true network, both in terms of structural dierences and also using either BIC or
BDe.
(c) How are these constraint based strategies aected by dierent choices of α (the nominal
signicance level of each test)?
(d) Now learn the structure with hill-climbing and tabu search, using the posterior density BDe
as a score function. How does the network change with the hyper parameters iss (imaginary
sample size)?
(e) Does the length of the tabu list have a signicant impact on the network structures learned
using tabu?
(f) Does the learned network depend on whether BDe or BIC is being used as a score criterion?
4. Now consider the data from Sachs et. al., found in sachs.data.txt on the course home page.
Use the original data set; not the discretised data set.
(a) Evaluate the networks leanred by hill-climbing with BIC and BGe, using cross-validation
and the log-likelihood loss function.
(b) Use bootstrap resampling to evaluate the distribution of the number of arcs present in each
of the networks learned. Do they dier signicantly?
(c) Compute the averaged network structure for sachs using hill-climbing with BGe and dif-
ferent hyperparameters (imaginary sample sizes). How does the value of the signicance
threshold change as iss increases?
Chapter 18
There are various Monte Carlo approaches to locating a structure. These involve running a stochastic
process through the space of possible structures and using this either to build up a posterior distribution
over the space of structures (Markov Chain Monte Carlo) or else designing a process with sucient
mobility, that is attracted to highly scoring structures and scoring each structure visited. The output
from a stochastic optimisation algorithm is simply the structure visited with the highest score.
⎛ X (1) ⎞
As usual, X = (X1 , . . . , Xd ) denotes the random vector of variables, X = ⎜ ⎟
⎜ ⋮ ⎟ denotes an n × d
⎝ X ⎠
(n)
random matrix of n independent copies of X , x denotes the data matrix, an instantiation of X.
where n(E) is the number of DAGs within the equivalence class and D ∈ equiv(E). The posterior is
then given by:
375
376 CHAPTER 18. MONTE CARLO ALGORITHMS FOR GRAPH SEARCH
Let E denote the current edge set. As usual, E = D ∪ U where D denotes the directed edges
and U denotes the undirected edges. ⟨α, β⟩ ∈ U denotes an undirected edge; (α, β) ∈ D denotes a
directed edge α ↦ β . For Fij and Fjk where Fpq is dened below, consider the 16 possible graphs
generated by keeping all other edges the same and modifying any edges between the two pairs
[Xi , Xj ] and [Xj , Xk ] (where [α, β] simply denotes the ordered pair of vertices) according to the
four possibilities for each pair:
⎧
⎪ 1 (Xp , Xq ) ∈/ D, (Xq , Xp ) ∈/ D, ⟨Xp , Xq ⟩ ∈/ U
⎪
⎪
⎪
⎪
⎪
⎪ 2 (Xq , Xp ) ∈ D
Fpq =⎨ (18.2)
⎪
⎪
⎪ 3 (Xp , Xq ) ∈ D
⎪
⎪
⎪
⎪
⎩ 4 ⟨Xp , Xq ⟩ ∈ U.
Suppose the current state is E (0) and label the 16 possible graphs E (0) , E (1) , . . . , E (15) generated
by all the possibilities of Fij and Fjk . For each graph, check whether it is an essential graph,
using the criteria of Theorem 5.3.
That is, it has to be a chain graph (for α ∈ Vi and β ∈ Vj where Vi and Vj are two separate chain
components) there is no cycle containing both α and β (that is, a sequence ρ0 , . . . , ρm , ρm+1 = ρ0
with either (ρi , ρi+1 ) ∈ D or ⟨ρi , ρi+1 ⟩ ∈ U for each i = 0, . . . , m). The chain components have to
be triangulated and the graph must not contain forbidden substructures (those in Figure 5.1).
where SE∣X is dened by (18.1), or indeed any other reasonable score function. Select E(t + 1) =
E (l) with probability yl , l = 0, 1, . . . , 15.
This gives a process which works through the space of essential graphs, guiding the process (at least
locally) to highly scoring structures, while the stochastic element ensures that the process can escape
from a local maximum with positive probability.
Since the aim is to examine each graph E(0), . . . , E(N ) visited, together with all those that were
checked as candidates when the transition probabilites were computed and then choose the one that
maximises S(E) ∶ E ∈ {graphs evaluated}, the following variation may be more ecient.
Start with an empty graph. Let E(0) denote the empty graph and let E(t) denote the graph
selected at step t.
For each cycle of 12 d(d − 1)(d − 2) steps, randomly select σ , an ordering of {1, . . . , d}, each with
probability d!
1
and, for j = 1, . . . , d, i = 1, . . . , d − 1, i ≠ j , k = i + 1, . . . , d, k ≠ j , do the following:
1. For the triple of nodes {Xσ(i) , Xσ(j) , Xσ(k) }, consider all 16 possibilities of (Fσ(i),σ(j) , Fσ(j),σ(k) )
(dened in Equation (18.2)) when applied to the current essential graph and record those for
which the new graph is an essential graph.
2. Let E (0) = E(t) and let E (1) , . . . , E (15) denote the other 15 possibilities. For E (0) , . . . , E (15) , set
yk = 0 if E (k) is not an essential graph, otherwise, set yk = R(E (k) ), where R is a suitable score
function.
After the algorithm has run for the required length of time (several cycles of length 12 d(d − 1)(d −
2)), the graph E that gives maxt∈{0,...,N } E(t) is selected.
Diculties with Metropolis Hastings This algorithm has computational advantages. Only three
nodes at a time are considered, with the possibility of at most 15 dierent essential graphs. It provides
a stochastic search algorithm, where the aim is to nd a highly scoring structure. But it seems very
dicult to modify it to produce a Metropolis Hastings scheme with a `theoretically' correct stationary
distribution. If N (E) denotes the space of all essential graphs that can be obtained by such a procedure,
then the Q(E, E ′ ), the probability of proposing E ′ given a current state E does not have a convenient
S(E ′ )Q(E ′ ,E)
expression and neither does the acceptance probability αE,E ′ = min (1, S(E)Q(E,E ′ ) ) .
⎧
⎪
′ ⎪
1
D′ ∈ N (D(t))
Q(D(t), D ) = ⎨ ∣N (D(t))∣
⎪
⎪ otherwise.
⎩ 0
The acceptance probability is:
Notation Let D denote a directed edge set. For a node Xi , let D(Xi )←π denote the graph D where
the edges Pai ↦ Xi are removed and a new parent set π is imposed on Xi . For a graph D, let 1(D)
denote the indicator function, returning value 1 if D is a DAG and 0 otherwise. Let
where, for a given ordering of the nodes, Li (π∣x) denotes a score function for node i having parent set
π . Let
Choice of New Parent Sets Let D0 denote the graph D after links Pai ↦ Xi and Paj ↦ Xj have
been removed. The new parent set for Xi , ̃
πi , is sampled from the distribution:
Conditioned on choosing REV (deciding to make a move of reverse-edge type), the proposal probability
for the move D ↦ D′ , where D′ is obtained by exchanging the parent sets (πi , πj ) of nodes (Xi , Xj )
by ̃πi , ̃
πj ) is:
1
Q(D, D′ ) = πj ∣D0Xi ←̃πi )
πi ∣D0 , Xj )Q(̃
Q(̃
N (D)
where N (D) is the number of edges in D. The acceptance is:
Adding Reverse Move to the Sampler A value pR ∈ (0, 1) is chosen. If the current graph is
not empty, then with probability pR , it is decided is to make a reverse move and with probability
pS = 1 − pR it is decided to make a standard move (addition or deletion). Since the standard moves
comprise an ergodic Markov chain (albeit not with the desired level of mobility), the mixture is also
ergodic.
d
L(D∣x) = ∏ L̃(j, πj ∣x)
j=1
where j denotes node j in D and πj denotes its parent set. Assume that the prior PD (D) also has
form:
d
PD (D) = ∏ Q(j, Paj )
j=1
and set
The score R(σ∣x) for a given ordering σ , given the data x, is given by:
d
R(σ∣x) = ∑ P(D∣x) ∝ ∏ ∑ S(σ(j), Paσ(j) ∣x) (18.4)
j=1 Pa
σ(j) ∈σ
D∈σ
where S is a score function, D ∈ σ denotes a DAG compatible with node ordering σ and Paσ(j) ∈ σ
denotes that the parent set of σ(j) is compatible with node ordering σ .
A hard limit K is placed on the size of each parent set. This reduces the complexity of scoring
each node to order nK .
It is much easier to consider moves between node orders. There are a variety of proposals for moves
from σ to σ ′ ; for example, choose two at random and ip them. The move σ ↦ σ ′ is proposed with
probability Q(σ, σ ′ ) the proposal is accepted with probability
Sampling the DAG Having converged to the stationary distribution over orders σ , orderings σ ∗
are then sampled proportionally to R(σ∣x). A DAG is sampled for a xed order, in the following way:
the parent sets are sampled independently for each variable Xi ; for Xi , the scre function . This makes
the problem much easier; the parent sets for each variable Xi are sampled independently, according to
the score function (18.3).
The Problem with Bias The posterior distribution over orderings is;
Here P(D∣x) is simply the Cooper-Herskovitz likelihood. This diers from the score function (18.4)
through the term P(σ∣D), which is simply the inverse of the number of orders that the DAG belongs
to. On average, the number of orders that each DAG belongs to is exponentially large. (It can range
from 1 to d!). Neglecting this term in the order MCMC algorithm then wieghts DAGs by the number
of orders they belong to.
18.5. PARTITION MCMC FOR DIRECTED ACYCLIC GRAPHS 381
Layering of a DAG The nodes of a DAG may be layered. A layering is a partition satisfying the
condition that no node in the same layer is either an ancestor or descendant of any other node in layer
k . The layers are indexed by N = {1, 2, 3, . . .}. It is known as a minimal layering if each node has the
minimal index value such that the partition is a layering.
The minimal layering clearly satises (for example) that all ancestor nodes are in layer 1. Further-
more, all nodes in layer k have at least one parent in layer k − 1.
Consider a minimal layering with m levels and let (k1 , . . . , km ) denote the number of nodes in each
layer. The number of DAGs belonging to such a partition is given by:
m m
d! kj
ak1 ,...,km = ∏ (2 j−1 − 1) ∏ 2 j j−2 .
k k S
k1 ! . . . km ! j=2 j=3
where Sj = ∑ji=1 ki . The rst term is simply the number of ways of distributing d nodes in m partition
elements of size k1 , . . . , km respectively. The second is the number of ways that nodes in each partition
can have parents in the previous partition. Subtracting 1 excludes the case where nodes receive no
edges. The third term is the number of ways that nodes can have parents from partitions other than
the one directly below.
d
S(Λ∣x) = ∑ P(Λ∣D, x)P(D∣x) = ∑ P(D∣x) ∝ ∏ ∑ S(Xj , Paj ∣x).
D D∈Λ j=1 Paj ∈Λ
where D ∈ Λ denotes that the DAG D is compatible with the layering specied by Λ and Paj ∈ Λ
denotes that the parent set of variable j is compatible with Λ.
382 CHAPTER 18. MONTE CARLO ALGORITHMS FOR GRAPH SEARCH
The MCMC will propose a move Λ ↦ Λ′ , by dening a set N (Λ) of neighbours and choosing each
with equal probability. The acceptance is:
m ki −1 m m
ki
(m − 1) + ∑ ∑ ( ) = m − 1 + ∑ (2ki − 2) = (∑ 2ki ) − m − 1.
i=1 c=1 c i=1 i=1
When merging layer i with layer i + 1, the score changes simply with the indicator function of whether
the parent sets are legal under the new layering. The alterations are only in those in the layer labelled
i + 2 before the merge; the number of possible parent sets has increased - and (of course) those in layer
i+1 before the merge. Instead of being forced to have at least one parent from layer i before the merge,
links with these variables are excluded; now the variables from i + 1 (before merge) are forced to have
a parent in layer i − 1.
Splitting and merging thus dened give reversible moves, so that the acceptance dened by (18.5)
is positive.
It is straightforward to see that the chain is irreducible; from one partition any other partition can
be reached in a nite number of moves which have positive probability. If necessary, the chain can stay
still with positive probability to ensure aperodicity.
Choose two nodes at random, with the constraint that they are in adjacent layers. There are
m−1
M (Λ) = ∑ ki ki+1
i=1
possible choices of pairs. They are chosen each with equal probability and the move is accepted
M (Λ)S(Λ′ ∣x)
with probability M (Λ′ )S(Λ∣x) .
18.5. PARTITION MCMC FOR DIRECTED ACYCLIC GRAPHS 383
19.1 Introduction
Dynamic Bayesian networks (DBNs) are an important tool that have proved useful for a large class of
problems. The thesis of Kevin Murphy (2002) [97] provides a comprehensive introduction to the topic.
The rst mention of dynamic Bayesian networks seems to be by Dean and Kanazawa (1989) [34].
The DBN framework provides a way to extend Bayesian network machinery to model probability
distributions over collections of random variables (Z t )t≥0 . The parameter t ∈ {0, 1, 2, . . .} represents
time. Typically, the variables at a time slice t are partitioned into Z t = (U t , X t , Y t ) representing the
input, hidden and output variables of the model. The term `dynamic' refers to the fact that the system
is dynamic; the basic structure remains the same over time.
k−1 t
PZ0 ,...,Zt = PZ0 ∏ PZs ∣Z0 ,...,Zs−1 ∏ PZs ∣Zs−k ,...,Zs−1
s=1 s=k
where, for t ≥ k ,
PZt ∣Zt−k−1 ,...,Zt−1 = ∏ PZ j ∣Pa(Z j ) ,
t t
j
Ztj is the j th node at time t, which could be a component of either Xt , Yt or Ut and the set Pa(Ztj )
of parents of Ztj belongs to the collection
The arrows within the same time slice do not represent causality.
The requirement is that the subgraph restricted to {Z t , . . . , Z t+k−1 } is the same for each t ≥ 0 and
the conditional probabilities PZ j ∣Pa(Z j ) are the same for each t ≥ k . Furthermore, for 1 ≤ i ≤ j ≤ k ,
t t
and each s ≥ j , the subgraph restricted to {Z s+i , . . . , Z s+j } is a subgraph of the subgraph restricted to
{Z s+i−1 , . . . , Z s+j }.
385
386 CHAPTER 19. DYNAMIC BAYESIAN NETWORKS
The arcs between slices are from left to right and reecting the causal ow of time. If there is an
j
arc from Zt−1 to Ztj , the node Z j is said to be persistent. The arcs within a slice may have arbitrary
direction, so long as the overall DBN is a DAG. The arcs within a time slice may be undirected, since
they model correlation or constraints rather than causation. The resulting model is then a (dynamic)
chain graph.
The parameters of the conditional probabilities PZ j ∣Pa(Z j ) are time-invariant for t ≥ k , i.e., the
t t
model is time-homogeneous. If parameters can change, they may be added to the state-space and
treated as random variables or alternatively a hidden variable may be added that selects which set of
parameters to use.
Within the engineering community, DBNs have become a popular tool, because they can express
a large number of models and are often computationally tractable.
DBNs have been successfully applied to in the reconstruction of genetic networks, where genes do
not remain static, but rather their expression levels uctuate constantly. Increased expression level of a
gene will result in increased levels of mRNA from that gene which will in turn inuence the expression
levels of other genes. DBNs have proved to be a successful way of analysing genetic expression data.
With a Dynamic Bayesian Network, the n × d data matrix no longer represents n independent
instantiations of a random d-vector. Rather, the rows represent time slices of a process {X(t) ∶ t ∈ N}.
Some assumptions (for example time homogeneity) have to be made in order to learn structure and
parameters.
If the number of instantiations n available is large in comparison to d, then standard multivariate
time series techniques may be used eectively. If n is small compared with d, other techniques (such
as LASSO L1 regularisation) should be used.
p q
X(t) = µ0 + tµ1 + ∑ Aj X(t − j) + ∑ Bk ϵt+1−q
j=1 k=1
where ϵt ∼ N (0, Σ) are i.i.d. (the distribution is not necessarily normal, but the normality assumption,
if true, leads to sharper estimation).
The MA part often leads to instability for estimation; we therefore only consider VAR(p) processes;
p
X(t) = µ0 + tµ1 + ∑ Aj X(t − j) + ϵt
j=1
> install.packages("vars")
> library(vars)
19.2. MULTIVARIATE TIME SERIES 387
Within vars, there is a test data-set Canada, which contains 4 macroeconomic indicators; prod (labour
productivity), e (employment), U (unemployment rate) and rw (real wages). A VAR(2) model is tted
quite simply with the command:
> data(Canada)
> can = VAR(Canada,p=2)
> summary(can)
Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1
e prod rw U
e 1.00000 -0.03155 -0.1487 -0.6809
prod -0.03155 1.00000 0.1269 0.0763
rw -0.14870 0.12691 1.0000 0.1568
U -0.68090 0.07630 0.1568 1.0000
The default value, which estimates µ0 and sets µ1 = 0 is const. To set µ0 = 0 and µ1 = 0, type:
> VAR(Canada,p=2,type="none")
> VAR(Canada,p=2,type="trend")
> VAR(Canada,p=2,type="both")
The stability function veries the covariance stationarity of a VAR process, using cumulative
sums of residuals. This may be carried out by:
> var.2c=VAR(Canada,p=2,type="const")
> stab=stability(var.2c,type="OLS-CUSUM")
> plot(stab)
There are several tests for normality which come under normality.test.
> normality.test(var.2c)
$JB
JB-Test (multivariate)
$Skewness
$Kurtosis
The function serial.test carries out the Portmanteau (i.e. Ljung-Box) test
> serial.test(var.2c,lags.pt=16,type="PT.adjusted")
The VARMA model is standard and is treated in any reasonable text on Time Series, for example [?].
LASSO and Least Angle Regression Given a set of input measurements (xj,1 , . . . , xj,d ) for j =
1, . . . , n and outcome measurement yj ∶ j = 1, . . . , n, taken as observations on independent variables,
the lasso ts a linear model
d
ŷj = β̂0 + ∑ xj β̂j .
j=1
Minimise ∑nj=1 (yj ŷj )2 subject to ∑dj=0 ∣βj ∣ ≤ s for a constraint value s.
The bound s is a tuning parameter. When s is suciently large, the constraint has no eect and the
solution is simply the usual multiple linear least squares regression of y on x1 , . . . , xd .
For smaller values of s (s ≥ 0), the solutions are shrunken versions of the least squares estimates.
The L1 penalisation often forces some of the coecient estimates β̂j to be zero.
392 CHAPTER 19. DYNAMIC BAYESIAN NETWORKS
The choice of s therefore plays a similar role to choosing the number of predictors in a regression
model.
Cross-validation is the standard tool for estimating the best value for s.
Forward stepwise regression achieves the same objective as regularisation by adding in explanatory
variables one at a time:
Find the predictor xj which is most correlated to y and add it into the model. Take residuals
r = y − ŷ.
Continue, at each stage adding to the model the predictor most correlated with r.
The Least Angle Regression procedure follows the same general scheme, but does not add a predictor
fully into the model. The coecient of that predictor is increased only until that predictor is no longer
the one most correlated with the residual r. Then some other competing predictor is included.
Increase the coecient βj in the direction of the sign of its correlation with y . Take residuals
r = y − ŷ. Stop when some other predictor xk has as much correlation with r as xj has.
Increase (βj , βk ) in their joint least squares direction, until some other predictor xm has as much
correlation with the residual r.
It can be shown that, with one modication, this procedure gives the entire path of lasso solutions,
as s is varied from 0 to innity. The modication needed is: if a non-zero coecient hits zero, remove
it from the active set of predictors and recompute the joint direction.
Cross-Validation Cross validation is a model evaluation method where some of the data is removed
before training begins. Then when training is done, the data that was removed can be used to test
the performance of the learned model on new data. This is the basic idea for the class of model
evaluation methods called cross validation.
Holdout The holdout method is the simplest kind of cross validation. The data set is separated
into two sets; the training set and the testing set. The function approximator ts a function using
19.3. LASSO LEARNING 393
the training set only. Then the function approximator is asked to predict the output values for
the data in the testing set (it has never seen these output values before). The errors it makes
are accumulated as before to give the mean absolute test set error, which is used to evaluate the
model.
K-fold Cross Validation K-fold cross validation is one way to improve over the holdout method.
The data set is divided into k subsets, and the holdout method is repeated k times. Each time,
one of the k subsets is used as the test set and the other k-1 subsets are put together to form
a training set. Then the average error across all k trials is computed. The advantage of this
method is that it matters less how the data gets divided. Every data point gets to be in a test set
exactly once, and gets to be in a training set k-1 times. The variance of the resulting estimate is
reduced as k is increased. The disadvantage of this method is that the training algorithm has to
be rerun from scratch k times, which means it takes k times as much computation to make an
evaluation. A variant of this method is to randomly divide the data into a test and training set
k dierent times. The advantage of doing this is that you can independently choose how large
each test set is and how many trials you average over.
Leave-one-out Leave-one-out cross validation is K-fold cross validation taken to its logical
extreme, with K equal to n, the number of data points in the set. That means that the function
approximator is trained on all the data except for one point n separate times and a prediction
is made for that point. As before the average error is computed and used to evaluate the model.
The evaluation given by leave-one-out cross validation error (LOO-XVE) is good, but at rst
pass it seems very expensive to compute.
19.3.1 Implementation
There are several packages available in R for DBN learning. One of the most prominent is the lars
package by Hastie and Efron [60] (2012). Other packages available are: glmnet package by Friedman
et. al. [44] (2010) and penalized by Goeman [54] (2012). For illustration, we use the arth800MTS data
set from the GeneNet package. This describes the expression levels of 800 genes of the Arabidopsis
thaliana during the diurnal cycle. We consider a subset arth12 of 12 of the genes.
> library(lars)
> library(GeneNet)
> data(arth800)
> subset=c(60,141,260,333,365,424,441,512,521,578,799)
> arth12=arth800.expr[,subset]
Now lars is used to estimate a model for a target variable speied by a vector (say y) and a set of
possible parents specied by a matrix of predictors (say x). The arth800 data set consists of two time
series, each of 11 points in length. That is, there are two repeated measurements for each time point.
To estimate a VAR(1) process, rstly remove the two repeated measurements for the rst time point
394 CHAPTER 19. DYNAMIC BAYESIAN NETWORKS
of y and the two repeated measurements for the last time point of x. They cannot be used for LASSO,
since y(t) needs x(t − 1).
> x = arth12[1:(nrow(arth12)-2),]
> y = arth12[-(1:2),"265768_at"]
> lasso.fit = lars(y=y,x=x,type="lasso")
> plot(lasso.fit)
LASSO
0 1 3 4 5 6 7 9 11
Standardized Coefficients
**
2
**
8
* *
* **
* * *
*
**** ** * *
7
*
0
* ** ** ** *
* * ** ** **
10
*
−2
*
**
** * * * * ** * *
11
−4
* *
0.0 0.2 0.4 0.6 0.8 1.0
|beta|/max|beta|
The gure is interpreted as follows: the aim is to predict y(t) (the expression levels for gene labelled
265768_at) by the expression levels one time unit earlier (given at time index t − 2 because we have
double measurements for each time); x(t−2). The regression is carried out by evaluating the coecients
β which minimise ∑22 t=3 (y(t)− ∑j=1 xj (t−2)βj ) , subject to a constraint that ∑j=1 ∣βj ∣ ≤ t for t increasing.
11 2 11
For the x-axis, this is presented as ∣β∣/ max ∣β∣, where ∣β∣ = ∑11j=1 ∣βj ∣ and max ∣β∣ is the value of ∑j=1 ∣βj ∣
11
> coef(lasso.fit)
Structure learning (i.e. deciding which directed edges to include in the network) is carried out via
cross-validation. The cv.lars function does this.
> lasso.cv=cv.lars(y=y,x=x,mode="fraction")
The output gives the MSE (mean squared error) as a function of ∣β∣/ max ∣β∣ (where ∣β∣ denotes the
constraint and max ∣β∣ denotes the value of ∑11
j=1 ∣βj ∣ for the unconstrained problem) and the output is
shown in Figure 19.2. The optimal set of arcs is chosen to minimise the mean squared error.
19.3. LASSO LEARNING 395
> frac=lasso.cv$index[which.min(lasso.cv$cv)]
> predict(lasso.fit,s=frac,type="coef",mode="fraction")
$s
[1] 0.1919192
$fraction
[1] 0.1919192
$mode
[1] "fraction"
$coefficients
265768_at 263426_at 260676_at 258736_at 257710_at 255764_at
0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
255070_at 253425_at 253174_at 251324_at 245319_at 245094_at
0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 -0.6420806
The non-zero coecients indicate the arcs to be included on the gene 265768_at for the optimal value
s=frac computed by cv.lars.
Cross−Validated MSE
0.6
0.2
The number of steps can be controlled by setting the mode argument of predict to step.
> predict(lasso.fit,s=3,type="coef",mode="step")$coefficients
265768_at 263426_at 260676_at 258736_at 257710_at 255764_at 255070_at 253425_at
-0.02152962 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000
253174_at 251324_at 245319_at 245094_at
0.00000000 0.00000000 0.00000000 -0.72966658
396 CHAPTER 19. DYNAMIC BAYESIAN NETWORKS
> predict(lasso.fit,s=0.2,type="coef",mode="lambda")$coefficients
265768_at 263426_at 260676_at 258736_at 257710_at 255764_at 255070_at 253425_at
0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
253174_at 251324_at 245319_at 245094_at
0.0000000 0.0000000 0.0000000 -0.6961228
The lars package also ts least angle regression and stepwise regression.
> lar.fit=lars(y=y,x=x,type="lar")
> lar.cv=cv.lars(y=y,x=x,type="lar")
> step.fit=lars(y=y,x=x,type="stepwise")
> step.cv=cv.lars(y=y,x=x,type="stepwise")
> install.packages("simone")
> library(simone)
> ?simone
It works on the principle that the n×d data matrix contains n sequential observations of the d variables
and it ts a VAR(1) model. The default is clustering = FALSE.
The output is the number of edges in the network depending on the penalisation (default: BIC). A
sequencing display of the network as the penalty is reduced is obtained by:
> plot.simone(result)
The analysis can be carried out with clustering; edges are penalised if latent clustering is discovered
while constructing the network.
> resultcluster=simone(arth12,type="time-course",clustering=TRUE,control=ctrl)
The sequencing display of the network indicates that clustering has not changed the output much.
19.5. GENENET, GIDBN 397
Learning is carried out in two stages: rstly, learning the graph encoding the rst order partial
dependencies with DBNScoreStep1.
> step1=DBNScoreStep1(arth12,method="ls")
> edgesG1=BuildEdges(score=step1$S1ls,threshold=0.50,prec=6)
> nrow(edgesG1)
[1] 27
> step2=DBNScoreStep2(step1$S1ls,data=arth12,method="ls",alpha1=0.50)
> edgesG=BuildEdges(score=step2,threshold=0.05,prec=6)
398 CHAPTER 19. DYNAMIC BAYESIAN NETWORKS
If T > t (node Xi (t) is omitted), the query is called smoothing. It returns a smoothed value of
X̂i (t); the aim of the query is noise reduction.
Queries which ask for the Most Probable Explanation can be performed for ltering, smoothing
and prediction with the lars package.
To see how it works, consider the arth12 data set:
> library(GeneNet)
> data(arth800)
> subset = c(60, 141, 260, 333, 365, 424, 441, 512,
+ 521, 578, 789, 799)
> arth12 = arth800.expr[, subset]
> library(lars)
> x = arth12[1:(nrow(arth12) - 2), ]
> y = arth12[-(1:2), "265768_at"]
y contains the expression levels of gene 265768_at at all times except for time 0 (recall that there are
two measurements at each time). x contains the whole data set for all times except for the last one,
labelled 24.
frac contains the value of the index that minimises the cross variation. Therefore, this is the value
that is used to build the model. Estimation for the expression levels of 265768_at may be carried out
quite simply by:
The estimated expression levels at 20-1 and 20-2 are a result of ltering, while the others given
here are a result of smoothing.
The values of 24-1 and 24-2 can be predicted by:
The penalized package ts LASSO models which are compatible with bnlearn. Therefore, more
complex conditional probability queries can be carried out using cpquery and cpdist if the model is
rst learned in this way.
> library(penalized)
> lambda = optL1(response = y, penalized = x)$lambda
> lasso.t = penalized(response = y, penalized = x,
+ lambda1 = lambda)
# nonzero coefficients: 2
> coef(lasso.t)
(Intercept) 245094_at
14.0402894 -0.7059011
The only parent of gene 256768_at is 245094_at, which seems to act as an inhibitor.
This suggests that a model with this explanatory variable might be useful. Such a DBN can be created
in the following way:
>dbn1 =
+ model2network("[245094_at][265768_at|245094_at]")
>xp.mean = mean(x[, "245094_at"])
>xp.sd = sd(x[, "245094_at"])
>dbn1.fit =
+ custom.fit(dbn1,
+ dist = list("245094_at" = list(coef = xp.mean,
+ sd = xp.sd), "265768_at" = lasso.t))
400 CHAPTER 19. DYNAMIC BAYESIAN NETWORKS
Since the data is continuous, there are two possibilities: either create a Gaussian network, or discretise
the variables. The network dbn1 is Gaussian. The mean xp.mean and standard deviation xp.sd need
to be specied.
The regression analysis suggests that high expression levels of 245094_at at time t − 1 lead to low
expression levels of 265768_at at time t. The cpquery function can be used:
Note With this package, it is not permitted to condition on events of measure 0. Therefore, intervals
must be specied both for event and evidence.
The function cpdist may be used to generate random observations. To compare the conditional
distributions for both pieces of evidence, use:
Now suppose that the variables at time t are not independent of those at t − 2 given t − 1. It is then a
good idea to construct a DBN which depends on lags 1 and 2. To check whether the introduction of
t − 2 to explain t improves the model:
The assumption is that the DBN is time homogeneous. These results suggest a network structure
which can be created as follows:
19.6. INFERENCE FOR DYNAMIC BAYESIAN NETWORKS 401
The parameters of dbn2 may be estimated via maximum likelihood. The parameters of 265769_at and
245094_at may then be substituted with those from the LASSO models lasso.t and lasso.s.
19.7 Exercises
1. Consider the Canada data set from the vars package. Load the data set, make some exploratory
analysis and estimate a VAR(1) process for this data set. Estimate the auto-regressive matrix A
and the constant matrix B which dene the VAR(1) model.
Compare the results with the LASSO matrix when the L1 penalty is estimated by cross-validation.
What are your conclusions?
2. Consider the arth800 data set from the GeneNet package. Load the data set. The time
series expression of the 800 genes is included in a data set called arth800.expr. Investigate its
properties.
Compute the variances of each of the 800 variables, plot them in decreasing order and create a
data set with those variables whose variance is greater than 2.
Can you t a VAR process using the var package (unlikely)? Suggest alternative approaches
(such as LASSO) and apply them. Estimate a DBN with each approach and compare the DBNs.
Plot the DBNs using plot from G1DBN.
402
Chapter 20
algorithm
This chapter describes the Sum Product Algorithm, henceforth abbreviated SPA, which was introduced
by Wiberg [145] (1996). It is an algorithm for obtaining the marginals of a factorised function. It
has also become known as Loopy Belief Propagation. It operates on factor graphs. SPA can be
considered as the most elementary of a family of related algorithms, consisting of double-loop algorithms
(see Heskes et. al. [63](2003)), Generalised Belief Propagation (see Yedidia et. al. [149] (2005)),
Expectation Propagation (see [94](2001)), Expectation Consistent Approximate Inference (see Opper
and Winter [102](2005)), the Max-Product Algorithm (see Weiss and Freeman [143](2001)), the Survey
Propagation Algorithm (see Braunstein, Mézard and Zecchina [7] (2004) and [6](2005)) and Fractional
Belief Propagation (see Tatikonda [133](2003)) to name but a few variants. SPA and its variants
provide a natural method for a wide variety of applications: Wiberg [145] discusses applications to
error correcting codes, an application developed by McEliece, MacKay and Cheng [92] (1998). It is
used for satisability problems in combinatorial optimisation [7] and computer vision (stereo matching:
Sun, Zheng and Shum [131](2003) and image restoration Tanaka [132](2002)). More recently, a variant
known as `Stochastic Belief Propagation' algorithm was developed by Noorshams and Wainwright [101]
(2013) with applications to image analysis. For that situation, the number of states of each variable is
large, so that only a few of the states are randomly selected for update in each cycle of the algorithm.
Denition 20.1 (Factorisability). The function ϕ is said to be factorisable if it factors into a product
of several local functions γj each dened on local domains, such that
403
404 CHAPTER 20. FACTOR GRAPHS AND THE SUM PRODUCT ALGORITHM
where the domains of the functions have been extended to X (Denition 7.2). This is also known
as the `one i (eye) problem'. The aim of this chapter is to describe a procedure for computing the
marginalisation, which exploits the way in which the global function is factorised and uses the current
values to update the values assigned to each variable. The method involves a factor graph, which is an
example of a bipartite graph.
Denition 20.2 (Bipartite Graph). A graph G is bipartite if its node set can be partitioned into two
sets W and U in such a way that every edge in G has one node in W and another in U .
A factor graph is a bipartite graph that expresses the structure of the factorisation given by Equa-
tion (20.1). The graph has the following properties:
there is a variable node (an element of U ) for each variable. A capital letter X will be used to
denote the variable node, a small letter the value x in the state space XX associated with the
variable.
there is a function node (an element of W ) for each function γj . γj will be used to denote both
the local function and the node.
an undirected edge connecting variable node Xi to factor node γj if and only if Xi is in the local
domain of γj .
In other words, a factor graph is a representation of the relation `is an argument of'.
A Bayesian Network has a joint probability distribution that factorises according to a DAG. This joint
distribution can be converted into a factor graph. Each function is the local function PXi ∣Πi and edges
are drawn from this node to Xi and to its parents Πi . The DAG corresponding to the factorisation
of PX1 ,X2 ,X3 ,X4 is shown in Figure 20.1 and the corresponding factor graph in Figure 20.2.
20.2. THE SUM PRODUCT ALGORITHM 405
X1 / X2
!
X3 / X4
pX2 ∣X1 X2
Figure 20.2: The Factor Graph Corresponding to the Directed Acyclic Graph in Figure 20.1
This is the message sent from node X to node γj in the sum product algorithm and
This is the message sent from the function node γj to the variable node X .
µX→γj
X γj
Z
µγj →X
Recall the denition of neighbour (Denition 1.2). Nv will be used to denote the set of neighbours of a
node v . A factor graph is undirected. By the denition of a factor graph, all the neighbours of a node
will be of the opposite type to the node itself.
406 CHAPTER 20. FACTOR GRAPHS AND THE SUM PRODUCT ALGORITHM
The message sent from node v on edge e is the product of the local function at v (or the unit function
if v is a variable node) with all messages received at v on edges other than e and then marginalised to
the variable associated with e. The messages are dened recursively as follows.
Denition 20.4 (Sum Product Update Rule). For x ∈ Xk , and for each Xk ∈ Nγj ,
⎧
⎪
⎪ ∏h∈NXk /{γj } µh→Xk (x) ∀x ∈ Xk NXk ≠ ϕ
µXk →γj (x) = ⎨ (20.3)
⎪
⎪ NXk = ϕ.
⎩ 1
and for each γj ∈ NXk ,
where ϕ denotes the empty set, and where the domain of γj has been extended to X and variable Xk
takes the last position; yj is the value taken by variable Xj (j ≠ k ).
for each γj ∈ NXk . for each variable node Xk and each function node γj .
20.2. THE SUM PRODUCT ALGORITHM 407
Denition 20.6 (Termination). The termination at a node is the product of all messages directed
towards that node.
and
Note that the function node receives communications from precisely those variables that are in the
domain of the function.
After sending suciently many messages according to a suitable schedule, the termination at the
variable node yields the marginalisation, or a suitable approximation to the marginalisation, over that
variable. That is,
where the arguments of ϕ have been rearranged, so that variable Xi appears last.
Note Consider the problem where the potentials initially represent probability distributions over the
domains and where hard evidence is inserted rendering the potential over the `impossible'. For the
initialisation, only those states that are possible are included and the initialisation set to 1; the other
states are not included (equivalently, the corresponding initialisation is set to zero). The termination
at a node then gives the joint probability distribution of the variable and the evidence. If a conditional
probability is required, then the answer has to be normalised.
The Schedule One node is arbitrarily chosen as a root and, for the purposes of constructing a
schedule, the edges are directed to form a directed acyclic graph, where the root has no parents. If the
graph is a tree, then the choice of directed acyclic graph is uniquely dened by the choice of the root
node. Computation begins at the leaves of the factor graph.
Each leaf variable node sends the trivial identity function to its parents.
Each node waits for the message from all its children before computing the message to be sent
to its parents.
Once the root has received messages from all its children, it sends messages to all its children.
Each node waits for messages from all its parents before computing the message to be sent to its
children.
408 CHAPTER 20. FACTOR GRAPHS AND THE SUM PRODUCT ALGORITHM
This is repeated from root to leaves and is iterated a suitable number of times. No iterations are needed
if the factor graph is cycle free. This is known as a generalised forward and backward algorithm.
ϕ(x) = ∏ γj (xDj )
j
and let G be a factor graph with no cycles, representing ϕ. Then, for any variable node Xk , the
marginal of ϕ at x ∈ Xk is
where the arguments of ϕ have been rearranged so that the k th variable appears last and µXk (x) is given
in Equation (20.5).
Example 20.8.
Before giving a proof of Wiberg's theorem, the following example may be instructive. Consider
X1 γ1 (x1 , x2 ) X2 γ2 (x2 , x3 ) X3
µγ1 →X1 (x1 ) = ∑ γ1 (x1 , x2 )µX2 →γ1 (x2 ) = ∑ γ1 (x1 , x2 )γ2 (x2 , x3 )
x2 ∈X2 (x2 ,x3 )∈X2 ×X3
20.2. THE SUM PRODUCT ALGORITHM 409
µX2 (x2 ) = µγ1 →X2 (x2 )µγ2 →X2 (x2 ) = ∑ γ1 (x1 , x2 )γ2 (x2 , x3 )
(x1 ,x3 )∈X1 ×X3
which are the required marginalisation. The theorem of N. Wiberg states that if the factor graph is a
tree, then after a full schedule, the terminations give the required marginalisation.
Proof of Theorem 20.7 Consider Figure 20.6. Suppose that a full schedule has been performed on
a tree. The proof proceeds in three steps.
Step 1: Decompose the factor graph into n components, R1 , . . . , Rn Choose a variable Xi and
suppose that n edges enter the variable node Xi . Since there are no cycles, the margin ∑y∈XṼ /{i} ϕ(y, xi )
(where the arguments of ϕ have been suitably rearranged) may be written as
∑ ϕ(y, xi ) = ∑ ∏ γj (y D ) ∏ γj (y D ) . . . ∏ γj (y Dn )
j j
y∈XṼ /{i} y∈X ∣yi =xi j∈R1 j∈R2 j∈Rn
n
= ∏ ∑ ∏ γj (y D )
j
k=1 y ∈XRk ∣yi =xi j∈Rk
Rk
n
= ∏ νRk (xi ),
k=1
where the notation is clear. The last expression has the same form as the termination formula. There-
fore the assertion is proved if it can be established that
where γ10 , . . . , γn0 are the n function nodes that are neighbours of Xi . Due to the clear symmetry, it is
only necessary to consider one of these.
Step 2 Consider the decomposition of R1 . The case where γ10 has three neighbours is illustrated in
Figure 20.7. In the three variable case shown in Figure 20.7, X1 is the node under consideration and
γ10 is outside R3 and R4 . Suppose the variables neighbouring γ10 are X1 , Y1 , . . . , Ym and the regions
corresponding to Y1 , . . . , Ym are R11 , . . . , R1m respectively. Then νR1 can be decomposed as
410 CHAPTER 20. FACTOR GRAPHS AND THE SUM PRODUCT ALGORITHM
νR1 (x1 ) = ∑ ∏ γj (y D )
j
y∈XR1 ∣y1 =x1 j∈R1
m ⎛ ⎞
= ∑ γ10 (x1 , y1 , . . . , ym ) ∏ ⎜ ∑ ∏ γj (z Dj )⎟
(y1 ,...,ym ) k=1 ⎝z R ∈XR1k ∣Yk =yk j∈R3 ⎠
1k
m
= ∑ γ10 (x1 , y1 , . . . , ym ) ∏ ν̃R1k (yk )
(y1 ,...,ym ) k=1
where the notation Yk = yk means that the value of the variable denoted Yk takes the value yk in
z R1k . The notation XR1k denotes all the variable nodes that are neighbours of function nodes in R1k ,
retaining the same indices as the full set of variables.
Crucially, note that if variable Xj is a leaf node in the graph, then ν̃Rj ≡ 1.
20.3. THE SUM PRODUCT ALGORITHM ON GENERAL GRAPHS 411
The expression for νR1 has the same form as the update rule given for µγj →X in Equation (20.4).
In other words, if ν̃Rj (yj ) = µXj →γ 0 (yj ) for each j , then the result is proved. The algorithm proceeds
1
to the leaf nodes of the factor graph.
Step 3 There are two cases. If the leaf node is a function node (as in step 1, going from a variable
to functions), then (clearly from the graph) this is a function (h say) of a single variable (say Y ) and
(from (20.4)),
If the leaf node is a variable node X (as in step 2, going from functions to variables), then the leaf
variable is adjacent to a single function h (or else it is not a leaf), which has neighbours (Y1 , . . . , Ym , X),
say, then
since if X is a leaf, then h is the only neighbour of X and hence µX→h (x) ≡ 1 from (20.3).
By tracing backward from the leaf nodes, it is now clear, by induction, that
n
∑ ϕ(y, xi ) = ∏ µγ 0 →Xi (xi ),
j
y∈XṼ /{i} j=1
together with the formula for the message from a variable node to a function node:
Suppose the factor graph is a tree. Then, since any variable to function message is the product of
all but one of the factors in the termination formula, it is clear that µX (x) may be computed as the
product of the two messages that were passed in opposite directions, a) from the variable X to one of
the functions and b) from the function to the variable X .
scheme converges to the right answer has been considered in [95]. In general, there are two major
obstacles.
1. if the sum-product algorithm converges, it is not clear whether the convergence is to the required
marginal.
If the factors are all strictly positive, a xed point exists [149]. This does not imply convergence towards
the xed point and there is no guarantee that the xed point is stable.
Mooij and Kappen [95] give sucient conditions where the mapping has a xed point and where
there is convergence to the xed point.
d
ϕ = ∏ ψj ∏ ψjk (20.6)
j=1 ⟨j,k⟩∈U
where the domain of ψj is Xj and the domain of ψjk is Xj × Xk ; U ⊆ {⟨j, k⟩ ∶ 1 ≤ j < k ≤ d}. The charge
here is:
In this case, the function nodes corresponding to the ψj s are leaf nodes, while the function nodes ψjk
only receive a message from one neighbour before passing a message onto a variable node. Therefore,
the message passed on from the function node is identical to the message received by the function
node; no multiplication is required.
For functions that factorise according to Equation (20.6), it follows that only variable to variable
messages need be considered; messages are propagated along the edges of the undirected graph G =
(Ṽ , U ).
Let Muv denote the message transmitted along the edge ⟨u, v⟩ in the direction u ↦ v . The message
passing algorithm discussed so far, in this setting, may be expressed as:
⎧
⎪
⎪ Muv0
≡1
⎨
⎪
⎩ Muv (xv ) = ∑y∈Xu ψu (y)ψuv (y, xv ) ∏j∈N (u)/{v} Mju
⎪ t+1 t
where N (u) denotes the neighbours of node u in graph G . If the factor graph is a tree, the messages
are sent into a root, then propagated back out to the leaves, resulting in exact marginalisations. If the
factor graph contains loops, then a suitable schedule is chosen and the updates are iterated.
t t→+∞ ∗
Suppose that Muv Ð→ Muv . The termination is:
20.4. STOCHASTIC PROBABILITY UPDATES 413
The Stochastic Probability Updates of Noorshams and Wainwright [101] consider the situation where
(1) (k )
the state space for each variable Xj = (xj , . . . , xj j ) is large. Therefore, not all elements of the state
space are updated at each iteration. The algorithm proceeds as follows:
Stochastic Update
(k)
1. Initialise message vectors Mvu (xu )0 ≡ 1
t+1
Mvu (.) = (1 − λt )Mvu
t ̃uv (., x(Jvu ) ).
(.) + λt Γ
t
v
Now suppose that kj = K , for some xed K ∈ N. The computational complexity of this algorithm is
O(d) operations per edge per round.
The number λt is chosen as: λt = 1+t .
1
It has to satisfy:
1. λt → 0 as t → +∞,
2. ∑∞
t=1 λ = +∞ to ensure `innite travel'.
t
Application to Image Restoration This algorithm is presented in [101], where results on conver-
gence are established. It is applied to image processing and computer vision; a 200 × 200 image (40000
pixels), with K = 256 grey-scale levels.
The model is the Potts model: it is assumed that the state space for each variable is Xj = {1, . . . , K}
and
⎧
⎪
⎪ 1 i=j
ψuv (i, j) = ⎨
⎪
⎩ γ i≠j
⎪
414 CHAPTER 20. FACTOR GRAPHS AND THE SUM PRODUCT ALGORITHM
⎧
⎪ βuv (j) = ψu (j)(1 + (K − 1)γ)
⎪
⎪
⎪
⎪ ⎧
⎪
⎨ ⎪ 1+(K−1)γ i = j
1
⎪
⎪
⎪ Γ (i, j) = ⎨
⎪ uv
⎪
⎩ 1+(K−1)γ i ≠ j
γ
⎪
⎩ ⎪
For the application to image processing, the lattice is used; the edge set is
The parameter in the Potts model is: γ = 0.05. This is a smoothing parameter. A picture of the moon
is taken, which is then contaminated by adding i.i.d. N (0, 0.12 ) variables to each pixel. The algorithm
is then run, where evidence is entered on the singleton potentials;
⎧
⎪
⎪ 1 intensity = x
ψj (x) ← ⎨
⎪
⎩ 0 otherwise
⎪
This is slightly dierent from the earlier discussion of the sum-product algorithm; the single variable
potentials ψj represent the raw data; the edge potentials ψjk represent smoothing.
The propagation algorithm is applied and the output is the most likely value for each pixel.
The experiments indicate that the Stochastic Probability Update gives good results.
Notes The sum product algorithm is due to N.Wiberg (1996) [145], and was developed further,
with applications to Bayesian networks by F.R. Ksischang , B.J. Frey and H-A. Loeliger (2001) [78]
and S.M. Aji and R.J McEliece (2000) [1]. The stochastic update algorithm and application to image
processing was introduced by N. Noorshams and M.J. Wainwright [101] (2013).
20.5 Exercise
Consider the directed acyclic graph below.
B E
A R
R/E 0 1
PR∣E = 0 0.99 0.05
1 0.01 0.95
E/B 0 1
PA∣B,E (0∣., .) = 0 0.97 0.05
1 0.05 0.02
Assume that the joint distribution PA,B,E,R factorises recursively according to the Bayesian network
shown in the gure. Using the sum - product algorithm, compute
415
20.6 Answer
The computation of PB∣A (1∣1) is given. The key point is that when hard evidence A = 1 is received,
this is accommodated by considering XA = {1} and only considering a = 1. When this is done, the
termination at variable B will give the function PB,A (., 1); this has to be normalised appropriately to
give the conditional probability.
The factor graph is given in Figure 20.9
pB (b) b pE (e)
A is observed to be 1, so
µA→PA∣B,E (1) = 1 ∀(b, e)
µPE →E = PE
µR→pR∣E = (1, 1)
Message µpA∣B,E →A not needed, so don't compute it. Neither is µE→PR∣E nor µPR∣E →R .
416
20.6. ANSWER 417
Message µB→PB not needed, because we are interested in the variable B and we need the product of
messages function to the variable B .
Finally,
(PB∣A (0∣1), PB∣A (1∣1)) = β(µPB →b (0)µPA∣B,E →b (0), µPB →b (1)µPA∣B,E →b (1))
1
= (0.037203936, 0.93165691)
0.968469627
418 CHAPTER 20. FACTOR GRAPHS AND THE SUM PRODUCT ALGORITHM
Chapter 21
Families
This chapter deals with multivariate distributions which fall within the framework of exponential family.
The dependence structure is expressed as a graphical model. For an exponential family of full rank,
there is a 1 - 1 mapping between canonical parameters and mean eld parameters. We discuss conjugate
duality and the Fenchel-Legendre transform between the log-partition function A(θ) ∶ θ ∈ Θ (the
canonical parameter space) and A∗ (µ) ∶ µ ∈ M where M is the mean-value parameter space and µ
denotes the mean value vector of the sucient statistic vector. The Kullback-Leibler divergence has
particularly convenient form for exponential families; we discuss the primal, dual and mixed forms in
terms of the canonical and mean value parametrisations. We consider mean eld approximations, to
obtain a mean eld lower bound for A(θ).
419
420 CHAPTER 21. GRAPHICAL MODELS AND EXPONENTIAL FAMILIES
The parameters in the vector θ are known as the canonical parameters or exponential parameters.
Attention will be restricted to distributions where ∣I∣ = p < +∞; namely, I has a nite number, p, of
elements.
Since ∑X PX (x∣θ) = 1 for discrete variables and ∫X πX (x∣θ)dx = 1 for continuous variables, it follows
that the quantity A, known as the log partition function, is given by the expression
Set
PX (x∣θ)
P (x; θ) = . (21.1)
h(x)
With the set of functions Φ xed, each parameter vector θ indexes a particular probability function
PX (.∣θ) belonging to the family. The exponential parameters of interest belong to the parameter space,
which is the set
Denition 21.2 (Regular Families). An exponential family for which the domain Θ of Equation (21.2)
is an open set is known as a regular family.
Denition 21.3 (Minimal Representation). An exponential family, dened using a collection of func-
tions Φ for which there is no linear combination ⟨a, Φ(x)⟩ = ∑α∈I aα ϕα (x) equal to a constant is known
as a minimal representation.
For a minimal representation, there is a unique parameter vector θ associated with each distribution.
When the representation is over-complete, there exists an ane subset of parameter vectors θ, each
associated with the same distribution.
Recall the denition of suciency, given in Denition 12.12. The following lemma is crucial. Its proof
is left as an exercise
Lemma 21.5. Let X = (X1 , . . . , Xd ) be a random vector with joint probability function
then Φ(X), which will be denoted Φ, is a sucient statistic for θ. If the representation is minimal,
then Φ(X) is a minimal sucient statistic for θ.
Bernoulli Consider the random variable X , taking values 0 or 1, with probability function PX (1) = p,
PX (0) = 1 − p. This may be written as
⎧
⎪
⎪ px (1 − p)1−x x ∈ {0, 1}
PX (x) = ⎨
⎪
⎪ other x.
⎩ 0
Then
p
pX (x) = exp {x log ( ) + log(1 − p)}
1−p
= exp {xθ + log(1 − p)}
= exp {xθ − log(1 + eθ )} ,
In the language of exponential families, X = {0, 1}, Φ = {ϕ} where ϕ(x) = x, h(0) = h(1) = 1,
In other words
which gives
so that
Gaussian Recall that the one dimensional Gaussian density is of the form
1 (x − µ)2
π(x∣µ, σ) = √ exp {− }.
2πσ 2σ 2
This may be expressed in terms of an exponential family as follows: X = R, h(x) = 1, Φ = {ϕ1 , ϕ2 }
where ϕ1 (x) = x and ϕ2 (x) = −x2 .
where
∞
1 = e−A(θ) ∫
2
eθ1 x−θ2 x dx.
−∞
1 1 θ2
A(θ) = log π − log θ2 + 12
2 2 4θ2
and the parameter space is
µ 1
θ1 = , θ2 = .
σ2 σ2
21.3. GRAPHICAL MODELS AND EXPONENTIAL FAMILIES 423
Poisson Recall that the probability function p for a Poisson distribution with parameter µ is given
by
µx −µ
P(x∣µ) = e , x = 0, 1, 2, . . .
x!
This is an exponential family with h(x) = x!
1
, θ = log µ so that P(x∣µ) = P (x; θ)h(x), where
θ
P (x; θ) = exθ−e .
Beta Recall that the probability density function for a Beta distribution is given by
⎧
⎪ Γ(α+β) α−1
⎪ x (1 − x)β−1 x ∈ [0, 1]
π(x∣α, β) = ⎨ Γ(α)Γ(β)
⎪
⎪ other x.
⎩ 0
This is an exponential family, with X = (0, 1), h ≡ 1, α − 1 = θ1 , β − 1 = θ2 , Φ = {ϕ1 , ϕ2 } where
ϕ1 (x) = log x, ϕ2 (x) = log(1 − x). Then
Xj ∈ V , j = 1, . . . , d, the random variable Xj takes values 0 or 1, each with probability 1/2. For any
two components Xs and Xt of the random vector X , component Xs has a direct causal eect on Xt
only if (Xs , Xt ) ∈ D.
The notation will be simplied in the following way: V and D will be used to denote the sets of nodes
(variables) and directed edges respectively; the same notation will also be used to denote the indexing
sets of nodes and directed edges. In other words the notations
will also be used. The meaning will be clear from the context. The probability distribution over the
possible congurations is modelled by an exponential family with probability function PX (.∣θ) of the
form
⎧
⎪ ⎫
⎪
⎪d ⎪
PX (x∣θ) = exp ⎨ ∑ θs xs + ∑ θ(s,t) xs xt − A(θ)⎬ .
⎪
⎪ ⎪
⎪
⎩s=1 (s,t)∈D ⎭
The notation Pai denotes the parent set of node Xi and πi (x) denotes the instantiation of Pai corre-
sponding to the instantiation {X = x}, this may be rewritten as
d
PX (x∣θ) = ∏ PXi ∣Pai (xi ∣πi (x), θ),
i=1
where (clearly)
This model may be generalised. For example, one may consider higher order interactions. To include
coupling of triples (Xs , Xt , Xu ), one would add a monomial xs xt xu with corresponding exponential
parameter θ(s,t,u) . More generally, the set C of indices of interacting variables may be considered,
giving
The QMR - DT (Quick Medical Reference - Decision Theoretic) database is a large scale probabilistic
data base that is intended to be used as a diagnostic aid in the domain of internal medicine. It is a
21.4. PROPERTIES OF THE LOG PARTITION FUNCTION 425
bipartite graphical model; that is, a graphical model where the nodes may be of one of two types. The
upper layer of nodes (the parents) represent diseases and the lower layer of nodes represent symptoms.
There are approximately 600 disease nodes and 4000 symptom nodes in the database.
An evidence, or nding will be a set of observed symptoms, denoted by a vector of length 4000,
each entry being a 1 or 0 depending upon whether or not the symptom is present or absent. This will
be denoted f , which is an instantiation of the random vector F . The vector d will be used to represents
the diseases; this is considered as an instantiation of the random vector D. Let dj denote component
j of vector d and let fj denote component j of vector f . Then, if the occurrence of various diseases
are taken to be independent of each other, the following factorisation holds:
This may be represented by noisy `or' model. Let qi0 denote the probability that symptom i is present
in the absence of any disease and qij the probability that disease j induces symptom i, then the
probability that symptom i is absent, given a vector of diseases d is
⎧
⎪ ⎫
⎪
⎪ ⎪
PFi ∣D (0∣d) = exp ⎨− ∑ θij dj − θi0 ⎬ ,
⎪
⎪ ⎪
⎪
⎩ j ⎭
where θij ≡ log(1 − qij ) are the transformed parameters.
∂
A(θ) = ∑ e⟨θ,Φ(x)⟩−A(θ) ϕα (x)h(x) = Eθ [ϕα (X)]. (21.4)
∂θα x∈X
∂
A(θ) = Eθ [ϕα (X)ϕβ (X)] − Eθ [ϕα (X)]Eθ [ϕα (X)] = Covθ (ϕα (X), ϕβ (X)).
∂θα ∂θβ
It is and easy to show, and a standard fact, that any covariance matrix is non negative denite. It
now follows that, on Θ, A is a convex function.
Mapping to Mean Parameters Given a vector of functions Φ, set F (θ) = Eθ [Φ(X)] and let
M = F (Θ). For an arbitrary exponential family dened by
Λ(θ) ∶= Eθ [Φ(X)].
To each θ ∈ Θ, the mapping Λ associates a vector of mean parameters µ = Λ(θ) belonging to the set
M. Note that, by Equation (21.4),
Λ(θ) = ∇A(θ).
The mapping Λ is one to one, and hence invertible on its image, when the representation is minimal.
The image of Θ is the interior of M.
Consider a Bernoulli random variable X with state space {0, 1}. That is, pX (0) = 1 − p and pX (1) = p.
Now consider an Overcomplete exponential representation
so that
∂
A(θ) = eθ0 −A(θ0 ,θ1 ) = 1 − p = µ0
∂θ0
∂
A(θ) = eθ1 −A(θ0 ,θ1 ) = p = µ1 .
∂θ1
21.5. FENCHEL LEGENDRE CONJUGATE 427
The set M of mean parameters is the simplex {(µ0 , µ1 ) ∈ R+ ×R+ ∣µ0 +µ1 = 1}. For any xed µ = (µ0 , µ1 )
where µ0 ≥ 0, µ1 ≥ 0, µ0 + µ1 = 1, the inverse image is,
eθ0
Λ−1 (µ) = {(θ0 , θ1 ) ∈ R2 ∣ = µ0 }
eθ0 + eθ1
µ1
Λ−1 (µ) = {(θ0 , θ1 ) ∈ R2 ∣θ1 − θ0 = log }.
µ0
In an over-parametrised, or over-complete representation, there is no longer a bijection between Θ and
Λ(Θ). Instead, there is a bijection between elements of Λ(Θ) and a ane subsets of Θ. A pair (θ, µ)
is said to be dually coupled if µ = Λ(θ), and hence θ ∈ Λ−1 (µ).
The choice of notation is deliberately suggestive; the variables in the Fenchel Legendre dual turn out
to have interpretation as the mean parameters. Recall the denition of P given by Equation (21.1);
namely, if PX (x∣θ) is the probability function (or density function), then
PX (x∣θ)
P (x; θ) = .
h(x)
Denition 21.9 (Boltzmann - Shannon Entropy). The Boltzmann - Shannon entropy of PX (x∣θ) with
respect to h is dened as
−H(PX (x∣θ)) = Eθ [log P (X; θ)] = Eθ [⟨θ, Φ(X)⟩] − A(θ) = ⟨θ, µ⟩ − A(θ). (21.7)
Let θ(µ) denote a value of θ that maximises F (µ, θ) if such a value exists in Θ. The result follows
directly by using the denition given by Equation (21.5) together with Equation (21.7). Otherwise,
let θ(n) (µ) denote a sequence such that limn→+∞ F (µ, θ(n) (µ)) = A∗ (µ). The rst statement of the
theorem follows directly.
For the second part, choose θ ∈ Θ and choose µ(θ) = ∇θ A(θ). By denition of M, note that µ(θ) ∈ M.
Since A is convex, it follows that µ(θ) maximises ⟨θ, µ⟩ − A(θ), so that
From this,
Examples The conjugate dual pair (A, A∗ ) is now computed for several examples of exponential
families.
eθ(µ)
µ= .
1 + eθ(µ)
It follows that
µ
eθ(µ) =
1−µ
and
21.5. FENCHEL LEGENDRE CONJUGATE 429
so that
µ
A∗ (µ) = µ log µ − µ log(1 − µ) − log(1 + ),
1−µ
which gives
1 1 θ2
A(θ) = log π − log θ2 + 1 .
2 2 4θ2
1 1 θ2
A∗ (µ) = sup{θ1 µ1 + θ2 µ2 − log π + ln θ2 − 1 }.
θ∈Θ 2 2 4θ2
This is maximised when
⎧
⎪ θ1 (µ)
⎪ µ1 −
⎪ 2θ2 (µ) =0
⎨ θ12 (µ)
⎪
⎪
⎪ µ2 + 1
+ = 0,
⎩ 2θ2 (µ) 4θ22 (µ)
which gives
⎧
⎪
⎪ θ2 (µ1 , µ2 ) = − 2(µ21 +µ2 )
⎪ 1
⎨
⎪
⎪
⎪ θ1 (µ) = − µ2µ+µ1
⎩ 1 2
and
1 1 1
A∗ (µ1 , µ2 ) = − − log π − log(−2(µ21 + µ2 )).
2 2 2
Note that
Exponential Distribution Recall that Θ = (0, +∞) and that A(θ) = − log(θ). By a straightforward
computation,
A∗ (µ) = −1 − log(−µ)
and
M = (−∞, 0).
430 CHAPTER 21. GRAPHICAL MODELS AND EXPONENTIAL FAMILIES
Poisson Distribution Recall that Θ = R and that A(θ) = exp{θ}. It is a straightforward computa-
tion to see that
A∗ (µ) = µ log µ − µ
and that
M = (0, +∞).
M qj
DKL (q∣p) = ∑ qj ln .
j=1 pj
q(X)
DKL (q∣p) = Eq [log ], (21.8)
p(X)
where X is a random vector with state space X = (x1 , . . . , xM ) and Eq is expectation with respect
to the measure such that qj = P(X = xj ). The denition of Kullback Leibler may be extended to
continuous distributions using Equation (21.8), where q and p denote the respective density functions.
In this case, Equation (21.8) is taken as
q(x)
DKL (q∣p) = ∫ q(x) log dx.
Rd p(x)
When q and p are members of the same exponential family, the Kullback Leibler distance may be
computed in terms of the parameters. The key result, for expressing the distance in terms of the
partition function, is the Fenchel's inequality given in Equation (21.9), which can be seen directly from
the denition of A∗ (µ).
with equality if and only if µ = Λ(θ) and θ ∈ Λ−1 (µ). That is, for µ = Λ(θ) and θ ∈ Λ−1 (µ),
Consider an exponential family of distributions, and consider two exponential parameter vectors, θ1 ∈ Θ
and θ2 ∈ Θ. When distributions are from the same exponential family, the notation D(θ1 ∣θ2 ) is used
21.7. MEAN FIELD THEORY 431
to denote DKL (p(.∣θ1 )∣p(.∣θ2 )). Set µi = Λ(θi ). Using the parameter to denote the distribution with
respect to which the expectation is taken, note that
P(X∣θ1 )
D(θ1 ∣θ2 ) = Eθ1 [log ] = A(θ2 ) − A(θ1 ) − ⟨µ1 , θ2 − θ1 ⟩. (21.11)
P(X∣θ2 )
The representation of the Kullback Leibler divergence given in Equation (21.11) is known as the primal
form of the KL divergence.
Taking µ1 = Λ(θ1 ) and applying Equation (21.10), the Kullback Leibler distance may also be written
̃
D(θ1 ∣θ2 ) ≡ D(µ ∣θ ) = A(θ2 ) + A∗ (µ1 ) − ⟨µ1 , θ2 ⟩. (21.12)
1 2
The representation given in Equation (21.12) is known as the mixed form of the KL divergence. Recall
the denition of A∗ given by
̃
It follows that inf µ∈M D(µ∣θ) = 0.
Finally, taking µ2 = Λ(θ2 ) and applying Equation (21.10) once again to Equation (21.12) yields the
so-called dual form of the KL divergence;
̃
̃
D(µ ∣µ ) ≡ D(θ1 ∣θ2 ) = A∗ (µ1 ) − A∗ (µ2 ) − ⟨θ2 , µ1 − µ2 ⟩. (21.13)
1 2
are considered. Mean eld theory techniques are discussed and it is shown how they may be used to
obtain estimates of the log partition function A(θ). This is equivalent to the problem of nding an
432 CHAPTER 21. GRAPHICAL MODELS AND EXPONENTIAL FAMILIES
appropriate normalising constant to make a function into a probability density, a problem that often
arises when updating using Bayes rule.
Mean Field Theory is based on the variational principle of Equation (21.6). The two fundamental
diculties associated with the variational problem are the nature of the constraint set M and the lack
of an explicit form for the dual function A∗ . Mean eld theory entails limiting the optimization to a
subset of distributions for which A∗ is relatively easy to characterise.
More specically, the discussion of this chapter is restricted to the case where the functions ϕα are
either linear or quadratic. The problem therefore reduces to considering a graph G = (V, U ), where the
node set V denotes the variables and the edge set U denotes a direct association between the variables.
For this discussion, the edges in U are assumed to be undirected. As usual, V and U denote the node
(variable) and undirected edge sets; the same notation is used for the indexing sets. That is, with
minor abuse of notation (clear from the context), V and U are also used to mean: V = {1, . . . , d} and
U = {⟨s, t⟩∣⟨Xs , Xt ⟩ ∈ E}. Specically, the probability distributions under consideration are of the form
⎧
⎪ ⎫
⎪
⎪
⎪ ⎪
⎪
PX (x∣θ) = exp ⎨ ∑ θs xs + ∑ θ(s,t) xs xt − A(θ)⎬ .
⎪
⎪
⎪ ̃ ̃ ⎪
⎪
⎪
⎩s∈V (s,t)∈E ⎭
Let H denote a sub-graph of G over which it is feasible to perform exact calculations. In an exponential
formulation, the set of all distributions that respect the structure of H can be represented by a linear
subspace of the exponential parameters. Let I(H) denote the subset of indices associated with cliques
in H . Then the set of exponential parameters corresponding to distributions structured according to
H is given by
E(H) ∶= {θ ∈ Θ ∣ θα = 0, α ∈ I/I(H)} .
The simplest example is to consider the completely disconnected graph H = (V, ϕ). Then
Optimisation and Lower Bounds Let PX (x∣θ) denote the target distribution that is to be approx-
imated. The basis of mean eld approximation is the following: any valid mean parameter species a
lower bound on the log partition function, established using Jensen's inequality.
Proof The proof is given for discrete variables; the proof for continuous variables is exactly the same,
replacing the sum with an integral.
The inequality (a) follows from Jensen's inequality; the last line follows from Theorem 21.10.
There are diculties in computing the lower bound in cases where there is not an explicit form for
A∗ (µ). The mean eld approach circumvents this diculty by restricting to
n→+∞
Let µ(n) denote a sequence such that for each n, µ(n) ∈ M(G, H), such that µ(n) Ð→ µ and such
that
Note that µ ∈ M(G; H). Since θ ∈ Θ, it follows that µ ∈ M. The distribution associated with
µ minimises the Kullback Leibler divergence between the approximating distribution and the target
distribution, subject to the constraint that µ ∈ M(G; H). Recall the mixed form of the Kullback
Leibler divergence; namely, Equation (21.12).
̃
D(µ∣θ) = A(θ) − A∗ (µ) − ⟨µ, θ⟩.
Naive Mean Field Updates In the naive mean eld approach, a fully factorised distribution is
chosen. This is equivalent to the approximation obtained by taking an empty edge set to approximate
the original distribution. The naive mean eld updates are a set of recursions for nding a stationary
point of the resulting optimisation problem.
Let X = (X1 , . . . , Xd ) be a random vector with state space X = {0, 1}d (d binary variables). Suppose
that the distribution may be factorised along an undirected graph G = (V, U ). The probability function
is given by
⎧
⎪ ⎫
⎪
⎪n ⎪
PX (x∣θ) = exp ⎨ ∑ θj xj + ∑ θ⟨i,j⟩ xi xj − A(θ)⎬ .
⎪
⎪ ⎪
⎪
⎩j=1 ⟨i,j⟩∈U ⎭
The naive mean eld approach involves considering the graph with no edges. In this restricted class,
⎧
⎪ ⎫
⎪
⎪n (H) ⎪
PX (x∣θ) = exp ⎨ ∑ θj xj − A(θ )⎬ ,
⎪
⎪ ⎪
⎪
⎩j=1 ⎭
(H)
where θ(H) is the collection of parameters θs = θs , s = 1, . . . , d and θ(H) (s, t) ≡ 0. Note that
and
µ(s,t) = Eθ [Xs Xt ] = µs µt .
With the restriction to product form distributions, (Xs )ds=1 are independent Bernoulli variables and
hence
d
A∗H (µ) = ∑ {µs log µs + (1 − µs ) log(1 − µs )}.
s=1
Set
d d
F (µ; θ) = ∑ θs µs + ∑ θ(s,t) µs µt − ∑ (µs log µs + (1 − µs ) log(1 − µs )),
s=1 ⟨s,t⟩∈U s=1
Note that, for each µs , the function F is strictly convex. It is easy to see that the maximum is attained
when, for all 1 ≤ s ≤ t, (µt )dt=1 satises
µs
θs + ∑ θ⟨s,t⟩ µt − log = 0,
t∶⟨s,t⟩∈U 1 − µs
21.7. MEAN FIELD THEORY 435
or
µs
log = θs + ∑ θ⟨s,t⟩ µt .
1 − µs t∈N (s)
Note that if
y
log = x,
1−y
then
y = σ(x),
where
1
σ(x) = .
1 + e−x
The algorithm then proceeds by setting
⎛ (j) ⎞
µ(j+1)
s = σ θs + ∑ θ⟨s,t⟩ µt .
⎝ t∈N (s) ⎠
As discussed in [72] (page 222), the lower bound thus computed seems to provide a good approximation
to the true value.
Notes
The material for Chapter 21 is taken mostly from Wainright and Jordan [142]. It is developed further
in [72]. Possible improvements to the lower bound are proposed by Humphreys and Titterington in [66].
The book by Barndor - Nielsen [3] is the standard treatise of exponential families and the required
convex analysis.
21.8 Exercises: Graphical Models and Exponential Families
1. Prove lemma 21.5.
3
n!
p(x1 , x2 , x3 ∣η) = x
∏ pi i , x1 + x2 + x3 = n,
x1 !x2 !x3 ! j=1
where θ = {(θ(j))nj=1 , (θ(j, k)), (j, k) ∈ E}, E denotes the edge set and x ∈ {0, 1}n . Let q denote
the probability function
⎧
⎪ ⎫
⎪
⎪n ⎪
qX (x∣θ) = exp ⎨ ∑ θ(j)x(j) − AH (θ)⎬ .
⎪
⎪ ⎪
⎪
⎩j=1 ⎭
Let
A∗H (µ) = sup{⟨µ, θ⟩ − AH (θ).
θ
436
21.8. EXERCISES: GRAPHICAL MODELS AND EXPONENTIAL FAMILIES 437
µ(1)
log = θ(1) + θ(1, 2)µ(2) + θ(1, 3)µ(3)
1 − µ(1)
µ(2)
log = θ(2) + θ(1, 2)µ(1)
1 − µ(2)
µ(3)
log = θ(3) + θ(1, 3)µ(1).
1 − µ(3)
(d) Write a Matlab code to compute numerical approximations to the values (µ(1), µ(2), µ(3))
that give the naive mean eld approximation to the log partition function A(θ).
438 CHAPTER 21. GRAPHICAL MODELS AND EXPONENTIAL FAMILIES
Chapter 22
Estimation
∏C∈C PC,
P=
∏S∈S PS
where C and S denote the collections of cliques and separators.
Consider the multivariate setting. Let x = (x1 , . . . , xd ) denote an instantiation of the random vector
X = (X1 , . . . , Xd ). For each s ∈ {1, . . . , d}, xs ∈ Xs = (1, . . . , ks ), x ∈ X = ×ds=1 Xs , xC = {xs ∶ s ∈ C} and
xS = {xs ∶ s ∈ S}. For simplicity, the values taken by the variables are noted by their indices.
p(C, xC ) ∶= PC (xC ) ∶ C ∈ C, x C ∈ XC .
p(S, xS ) = PS (xS ) ∶ S ∈ S, xS ∈ XS .
439
440 CHAPTER 22. VARIATIONAL METHODS FOR PARAMETER ESTIMATION
1 n 1 n
p̂C (xC ) = ∑ 1x (xj,C ) p̂S (xS ) = ∑ 1x (xj,S ).
n j=1 C n j=1 S
where xj,C denotes the value for clique C of instantiation j ∶ j = 1, . . . , n and similarly for xj,S . By
construction, these are clearly consistent; if S ⊂ C then p̂(S, xS ) = ∑xC /xS p̂(C, xC ).
This may be written as an exponential family with over-complete canonical representation:
Factorisation along a Chow-Liu Tree Now suppose that the distribution factorises along a Chow-
Liu tree. Equation 22.1 may now be written:
⎧
⎪ ⎫
⎪
⎪d ⎪
P(x) = exp ⎨ ∑ θ(s; xs ) + ∑ θ(s, t; xs , xt )⎬
⎪
⎪ ⎪
⎪
⎩s=1 (s,t)∈E ⎭
where
PXs ,Xt (xs , xt )
θ(s; xs ) = log PXs (xs ), θ(s, t; xs , xt ) = log
PXs (xs )PXt (xt )
and E denotes the edge set of the graph. The maximum likelihood estimates of the parameters are
given by:
P(x) = ∏ ϕC (xC )
C∈C
where C denotes the collection of cliques of the undirected graph. The probability distribution may be
written as:
1 n
L(θ) = ∑ ( ∑ θC (xj,C ) − A(θ)) = ∑ ∑ θC (xC )̂
pC (xC ) − A(θ) (22.2)
n j=1 C∈C C∈C xC ∈XC
22.1. COMPLETE INSTANTIATIONS 441
then
∂ ∂
L(θ) = p̂C (xC ) − A(θ) = p̂C (xC ) − pC (xC ). (22.3)
∂θC (xC ) ∂θC (xC )
Here we've used the fact that we have an exponential family in its canonical form so that ∂
∂θC (xC ) A(θ) =
pC (xC )] = pC (xC ) (from (22.2), p̂C (xC ) is the sucient statistic).
Eθ [̂
The aim is to nd the MLE (where ∂θC (xC ) L(θ) = 0). The iterative proportional tting scheme proceeds
∂
as follows:
(t)
pC (xC ) ∶= Pθ(t) (XC = xC ) ∀xC ∈ XC .
The sequence satises two important properties, which are stated as a proposition.
Proposition 22.1. 1.
A(θ(t+1) ) = A(θ(t) ).
(t+1)
A(θ(t+1) ) = log ∑ exp { ∑ θC (xC )}
x C∈C
⎧
⎪ ⎫
⎪ (t) p̂C ′ (xC ′ ) ⎪
⎪
= log ∑ exp ⎨ ∑ θC (xC ) + log (t) ⎬
⎪
⎪ pC ′ (xC ′ ) ⎪⎪
x ⎩C∈C ⎭
⎧
⎪ ⎫
⎪
p̂C ′ (xC ′ )eθC ′ (xC ′ )
(t)
⎪ (t) ⎪
= log ∑ ∑ exp ⎨ ∑ θ (x ) ⎬
(t)
pC ′ (xC ′ ) ⎪
⎪
C C
⎪
⎪
x′C x/xC ⎩C∈C/C ′ ⎭
Now use:
from which
442 CHAPTER 22. VARIATIONAL METHODS FOR PARAMETER ESTIMATION
p̂C (xC )
pC,θ(t+1) (xC ) = p (t) (xC ) = p̂C (xC ).
pC,θ(t) (xC ) C,θ
It therefore follows that the IPF algorithm corresponds to a co-ordinate ascent method for maximising
the objective (22.2).
The Schedule Convexity of the log-partition function gives, by standard results, that the IPF
algorithm converges. The main issue is eciency. One way is to
1. Triangulate the graph and construct a junction tree. Fix a schedule for the junction tree.
2. For each node of the junction tree, consider the true model (the sub-graph of cliques and separa-
tors of the true model) and use the IPF scheme to update each clique of the triangulated graph
according to the schedule.
The maximum likelihood estimate θ̂ is obtained by maximising the log probability of the observed data
y . This is referred to as the incomplete log likelihood in the EM setting. The incomplete log likelihood
is given by the integral:
L(θ; y) = log ∫ exp {⟨θ, ϕ(x, y)⟩ − A(θ)} h(x)dx = Ay (θ) − A(θ). (22.4)
X
For each xed y , the set My of valid mean parameters is dened as:
My = {ν ∈ Rp ∶ µ = Eθ [ϕ(X, y)] θ ∈ Θ} .
From (22.5), it follows that Ay (θ) ≥ ⟨µ, θ⟩ − A∗y (µ) for any µ. A lower bound for the incomplete log
likelihood is therefore:
̃ θ).
L(θ, y) = Ay (θ) − A(θ) ≥ ⟨µ, θ⟩ − A∗y (µ) − A(θ) ∶= L(µ,
̃ which gives a
With this set up, the EM algorithm is the coordinate ascent function on this function L
lower bound. The steps of the EM algorithm are:
⎧
⎪ (t+1) ̃ θ(t) ) E step
⎪ µy = arg maxµ∈My L(µ,
⎨ (t+1) (22.7)
⎪
⎪ ̃ y(t+1) , θ) M step
= arg maxθ∈Θ L(µ
⎩ θ
̃ were equal to the log likelihood L, then the E step would be equivalent to nding the
Note that if L
expectation µ for parameter vector θ(t) , while the M step would be precisely the problem of nding
(t+1)
the maximum likelihood estimator based on expected sucient statistics µy .
The EM algorithm described can be used to estimate the parameters for a Conditional Gaussian
model. For example, consider the straightforward setting where Y = (Y1 , . . . , Yr ) are Gaussian variables,
Y ∣{X = j} = Yj for j = 1, . . . , r. Suppose that X , the index of the components, is unobserved. The
state space for X is X = {1, . . . , r} and X has a multinomial distribution.
The complete likelihood may be written as:
444 CHAPTER 22. VARIATIONAL METHODS FOR PARAMETER ESTIMATION
⎧
⎪ ⎫
⎪
⎪r ⎪
Lθ (x, y) = exp ⎨ ∑ 1j (x) {αj + γj y + ̃
γj y − Aj (γj , ̃
2
γj )} − A(α)⎬
⎪
⎪ ⎪
⎪
⎩j=1 ⎭
where θ = (α, γ, ̃
γ ), the parameter α ∈ Rr parametrises the the multinomial distribution over the hidden
vector X and the pair (γj , ̃ γj ) parametrises the Gaussian distribution of the j th mixture component.
The log-partition function A(γj , ̃ γj ) is for the conditionally Gaussian distribution of Y given X = j ,
while A(α) = log ∑j=1 exp {αj } normalises the multinomial distribution.
r
When the complete likelihood is viewed as an exponential family, the sucient statistics are the
collection of triples
⎧
⎪ ⎫
⎪
⎪r ⎪
p(x∣y, θ) ∝ exp ⎨ ∑ 1{j} (x) (αj + γj y + ̃
γj y 2 − Aj (γj , ̃
γj ))⎬ .
⎪
⎪ ⎪
⎪
⎩j=1 ⎭
It follows that the mean parameter pj∣y = P(X = j∣Y = y) is:
exp {αj + γj y + ̃
γj y 2 − Aj (γj , ̃
γj )}
pj∣y =
∑rk=1 exp {αk + γk y + ̃
γk y 2 − Aj (γk , ̃
γk )}
Similarly, the remaining mean parameters are:
The computations of the mean parameter µy = (pj∣y , pj∣y y, pj∣y y 2 ) correspond to the E step.
⟨µy(t+1) , θ) − A(θ).
Some computation shows that this problem takes the form of nding (α, γ, ̃
γ ) ∈ Θ which maximises:
r n
∑ ∑ (αj pj∣yi + γj pj∣yi yi + ̃
γj pj∣yi yi2 − pj∣yi Aj (γj , ̃
γj )) − nA(α).
j=1 i=1
The optimisation therefore decouples into separate maximisation problems: one for the α vector
parametrising the mixtures and one for each of the (γj , ̃
γj ) pairs specifying the Gaussian mixtures.
⎧
⎪ pj∣α = n1 ∑ni=1 pj∣yi
⎪
⎪
⎪
⎪
⎪ E ∑n
i=1 pj∣yi yi
⎨ γj ,̃γj [Y ∣X = j] = ∑ni=1 pj∣yi
⎪
⎪
⎪
⎪
⎪
⎪
∑n pj∣yi yi2
Eγj ,̃γj [Y 2 ∣X = j] = ∑i=1 .
⎩ n
i=1 pj∣y
i
µ(t+1)
y = max {⟨µ, θ(t) ⟩ − A∗y (µ)} .
µ∈M
red
The E step no longer closes the gap between the incomplete log-likelihood L and the auxiliary function
̃ and there are no longer guarantees that the algorithm goes uphill.
L
where the function η ∶ Rp → Rp gives some additional exibility. Assume, furthermore, that the prior
distribution over Θ also lies in an exponential family and is of conjugate prior form:
This exponential family is specied by the sucient statistics: {η(θ), −A(η(θ))} ∈ Rd × R. The log
partition function B(ξ, λ) is dened in the usual way:
log pξ∗ ,λ∗ (y) = log ∫ (∫ p(x, y∣θ)dx) pξ∗ ,λ∗ (θ)dθ = log ∫ pξ∗ ,λ∗ (θ)p(y∣θ)dθ.
so that
For each xed y , the set My is the set of mean parameters of the form µ = E[ϕ(X, y)].
The variational Bayes algorithm is based on optimising this lower bound using only distributions of
product form over (Θ, X ). Such an optimisation is referred to as `free form'. Using (22.6),
for any µ and hence the right hand side of Equation (22.9) has lower bound:
⎧
⎪ (t+1)
= arg maxµ∈My {⟨µ, η (t) ⟩ − A∗y (µ)} VB-E Step
⎪ µ
⎨ (t+1) (t+1) (22.12)
⎪ ) = arg max(η,A) {⟨µ (t+1) ∗ ∗ ∗
+ ξ , η⟩ − (1 + λ A − B (η, A)} VB-M Step
⎩ (η
⎪ ,A
These coordinate-wise optimisations have explicit solutions; the explicit solution of the VB-E Step
is:
Similarly, setting
(ξ (t+1) , λ(t+1) ) = (ξ ∗ + µ(t+1) , λ∗ + 1)
then
[1] S.M. Aji and R.J McEliece [2000] The Generalised Distributive Law IEEE Transactions on Infor-
mation Theory vol. 46 pp. 325 - 343
[2] S.A. Andersson, D. Madigan, M.D. Perlman and C.M. Triggs [1997] A graphical characterisation
of lattice conditional independence models Annals of Mathematics and Articial Intelligence vol.
21 pp. 27 - 50
[3] O. Barndor - Nielsen [1978]Information and Exponential Families in Statistical Theory Wiley
[4] Barros, B. [2012] Incremental Learning Algorithms for Financial Data Modelling Master's Thesis,
Linköping University, Department of Mathematics LiTH-MAT-INT-A2012/01SE
[5] Beeri, C.; Fagin, R.; Maier, D.; Yannakakis, M. [1983] On the desirability of acyclic database
schemes J. Assoc. Comput. Mach. 30 pp 479 - 513.
[6] [2005] Braunstein, A.; Mézard, M.; Zecchina, R. [2005] An Algorithm for Satisability Random
Structures and Algorithms, vol. 27, no. 2, pp. 201 - 226
http://dx.doi.org/10.1002/rsa.20057
[7] Braunstein, A.; Zecchina, R. [2004] Survey Propagation as Local Equilibrium Equations Journal
of Statistical Mechanics: Theory and Experiment vol. 2004, no. 6 pp. P06007
https://stacks.iop.org/1742-5468/2004/P06007
[8] Brockwell, P.J.; Davis, R.A. [1991] Time Series: Theory and Methods (second edition) Springer
[9] F. Bromberg, D. Margaritis [2009] Improving the reliability of causal discovery from small data
sets using argumentation Journal of Machine Learning Research vol. 10 pp. 301 - 340
[10] D.T. Brown [1959] A Note on Approximations to Discrete Probability Distributions Information
and Control vol. 2 pp. 386 - 392
[11] Bulashevska, S.; Eils, R. [2005] Inferring genetic regulatory logic from expression data Bioinfor-
matics vol 21 no 11 pp 2706 - 2713
[12] E. Castillo, J.M. Gutiérrez, A.S. Hadi [1996] A New Method for Ecient Symbolic Propagation in
Discrete Bayesian Networks Networks vol. 28 no. 1 pp. 31 - 43
[13] E. Castillo, J.M. Gutiérrez, A.S. Hadi [1997] Sensitivity Analysis in Discrete Bayesian Networks
IEEE Transactions on Systems, Man and Cybernetics - Part A: Systems and Humans vol. 27 no.
4
[14] Cayley, A. [1853] Note on a Question in the Theory of Probabilities The London, Edinburgh, and
Dublin Philosophical Magazine and Journal of Science vol. VI. - fourth series July - December,
1853, Taylor and Francis. p. 259
[15] Cayley, A. [1854] On the theory of groups as depending on the symbolic equation θn = 1 Phil. Mag.
vol. 7 no. 4 pp 40 - 47
[16] Cayley, A. [1858] A Memoir on the Theory of Matrices Phil. Trans. of the Royal Soc. of London,
vol 148 p. 24
449
450 LITERATURE CITED
[17] Cayley, A. [1869] A Memoir on Cubic Surfaces Philosophical Transactions of the Royal Society of
London (The Royal Society) vol 159 pp 231326
[18] Cayley, A. [1878] Desiderata and suggestions: No. 2. The Theory of groups: graphical representa-
tion Amer. J. Math. vol. 1 no. 2 174176
[19] Cayley, A. [1889] A Theorem on Trees Quarterly Journal of Mathematics vol 23 pp 276-378
[20] H. Chan, A. Darwiche [2005] A Distance Measure for Bounding Probabilistic Belief Change Inter-
national Journal of Approximate Reasoning vol. 38 pp. 149 - 174
[21] H. Chan, A. Darwiche [2002]When do Numbers Really Matter? Journal of Articial Intelligence
Research vol. 17 pp. 265 - 287
[22] H. Chan, A. Darwiche [2005] On the Revision of Probabilistic Beliefs Using Uncertain Evidence
Articial Intelligence vol. 163 pp. 67-90
[23] Cheng, J.; Greiner, R.; Kelly, J.; Bell, D. A.; Liu, W. [2002] Learning Bayesian networks from
data: An information-theory based approach Articial Intelligence vol 137 pp 43 - 90.
[24] D.M. Chickering [1995] A transformational characterization of Bayesian network structures In
Hanks, S. and Besnard, P., editors, Proceedings of the Eleventh Conference on Uncertainty in
Articial Intelligence, pages 87 - 98 Morgan Kaufmann.
[25] Chickering, D. M. [2002] Optimal structure identication with greedy search Journal of Machine
Learning Research, 507554.
[26] D.M. Chickering, D. Heckerman, C. Meek [2004] Large Sample Learning of Bayesian Networks is
NP - Hard Journal of Machine Learning Research vol. 5 pp. 1287 - 1330
[27] Chiquet, J.; Smith, A.; Grasseau, G.; Matias, C.; Ambroise, C. [2009] SIMoNe: Statistical Infer-
ence for Modular Networks Bioinformatics 25(3):417418
[28] C.K. Chow and C.N. Liu [1968] Approximating Discrete Probability Distributions with Dependence
Trees IEEE Transactions on Information Theory, vol. IT - 14 no. 3
[29] Claeskens G, Hjort NL [2008]Model selection and model averaging Cambridge University Press,
Cambridge
[30] G.F. Cooper [1990] The Computational Complexity of Probabilistic Inference using Bayesian Belief
Networks Articial Intelligence vol. 42 pp. 393 - 405
[31] G.F. Cooper and E. Herskovitz [1992]A Bayesian Method for the Induction of Probabilistic Net-
works from Data Machine Learning vol. 9 pp. 309 - 347
[32] R.G. Cowell, A.P. David, S.L. Lauritzen and D.J. Spiegelhalter [1999] Probabilistic Networks and
Expert Systems Springer, New York
[33] A.P. Dawid [1992] Applications of a General Propagation Algorithm for Probabilistic Expert Sys-
tems Statistics and Computing vol. 2 pp. 25 - 36
[34] Dean, T.; Kanazawa, K. [1989] A Model for Reasoning about Persistence and Causation Compu-
tational Intelligence vol. 5, no. 2, pp.142 - 150.
[35] W.E. Deming and F.F. Stephan [1940] On a Least Squares Adjustment of a Sampled Frequency
Table when the Expected Marginal Totals are Known Annals of Mathematical Statistics vol. 11
pp. 427 - 444
[36] Dempster, P.; Laird, N.M.; Rubin, D.B. [1977]Maximum Likelihood from Incomplete Data via the
EM Algorithm Journal of the Royal Statistical Society, Series B, vol. 39, pp. 1 - 38
[37] P. Diaconis and S.L. Zabell [1982] Updating Subjective Probability Journal of the American Sta-
tistical Association vol. 77 (380) pp. 822 - 830
451
[38] J.M. Dickey [1983] Multiple Hypergeometric Functions: Probabilistic Interpretations and Statistical
Uses Journal of the American Statistical Association, 1983, vol. 78 (383) pp. 628 - 637
[39] Drton, M.; Sturmfels, B.; Sullivant, S. [2009] Lectures on algebraic statistics Birkhäuser
[40] D. Edwards [2000] Introduction to Graphical Modelling chapter 9: Causal Inference. Springer
[41] A. Fast [2010] Learning the structure of Bayesian networks with constraint satisfaction Ph.D.
thesis, Graduate School of the University of Massachusetts Amherst, Department of Computer
Science
[42] Fisher, R.A. [1924] The Distribution of the Partial Correlation Coecient Metron vol. 3 no. 3-4
pp. 329 - 332.
[43] D. Freedman and P. Humphreys [1999]Are there Algorithms that Discover Causal Structure? Syn-
these vol. 121 pp. 29 - 54
[44] Friedman, J.; Hastie, T.; Tibshirani, R. [2010]Regularisation Paths for Generalised Linear Models
via Coordinate Descent J Stat Softw 33(1):122
[45] Friedman, N.; Nachman, I.; Pe'er, D. [1999] Learning Bayesian network structure from massive
datasets: the `sparse candidate' algorithm Proc. Sixteenth Conference on Uncertainty in Articial
Intelligence (UAI '99) pp 196 - 205
[46] Friedman, N.; Linial, M.; Nachman, I.; Pe'er, D. [2000] Using Bayesian Networks to Analyse
Expression Data Journal of Computational Biology 7 no 3/4 pp 601 - 620
[47] Friedman, N.; Koller, D. [2003]Being Bayesian About Network Structure: A Bayesian Approach
to Structure Discovery in Bayesian Networks Machine Learning, vol. 50 pp. 95 - 125
[48] Friedman, N. [2004] Inferring Cellular Networks Using Probabilistic Graphical Models Science Vol
303 no 5659 pp 799-805 DOI: 10.1126/science.1094068
[49] Gamerman, D.; Lopes, H.F. [2006] Markov chain Monte Carlo: stochastic simulation for Bayesian
inference Chapman and Hall CRC
[50] Garcia, L.D.; Stillman, M.; Sturmfels, B. [2005] Algebraic geometry of Bayesian networks Journal
of Symbolic Computation 39 pp 331355
[51] D. Geiger, T. Verma and J. Pearl [1990] Identifying Independence in Bayesian Networks Networks
vol. 20 pp. 507 - 534.
[52] Gentry J, Long L, Gentleman R, Seth, Hahne F, Sarkar D, Hansen K [2012]Rgraphviz: provides
plotting capabilities for R graph objects. R package version 1.32.0
[53] Giudici, P.; Castelo, R. [2003]Improving Markov chain Monte Carlo Model Search for Data Mining
Machine Learning vol. 50 pp. 127 - 158
[54] Goeman, J.J. [2012]penalized R package R package version 0.9-41
[55] M.C. Golumbic [2004] Algorithmic Graph Theory and Perfect Graphs Elsevier
[56] Greenland, S.; Pearl, J.; Robins, J.M. [1999] Causal diagrams for epidemiologic research Epidemi-
ology pp 37 - 48
[57] Greenland, S.; Lash, T. [2008] Bias Analysis in: Modern Epidemiology, 3rd ed., Ed. K Rothman,
S. Greenland and T. Lash, pp 345 - 380. Philadelphia: Lippincott, Williams and Wilkins.
[58] Grzegorczyk, M.; Husmeier, D. [2008]Improving the Structure MCMC Sampler for Bayesian Net-
works by introducing a New Edge Reversal Move Mach. Learn vol. 71 pp. 265 - 305
452 LITERATURE CITED
[59] Hartmanis, J. [1959]Application of some Basic Inequalities for Entropy Information and Control
vol. 2 pp 199 - 213
[60] Hastie T.; Efron, B. [2012]lars: least angle regression, lasso and forward stagewise R package
version 1.1
[61] D. Heckerman [1998] A Tutorial on Learning with Bayesian Networks Report # MSR-TR-95-06
Microsoft Research, Redmont, Washington
http://research.microsoft.com/∼ heckerman/
[62] D. Heckerman, D. Geiger and D.M. Chickering [1995] Learning Bayesian Networks: The Combi-
nation of Knowledge and Statistical Data Machine Learning vol. 20 pp. 197 - 243
[63] Heskes, T.; Albers, C.; Kappen, H.J. [2003] Approximate Inference and Constrained Optimisation
in Proc. of the 19th Annual Conference on Uncertainty in Articial Intelligence (UAI-03) San
Fransisco, CA: Morgan Kaufmann Publishers, pp. 313 - 320
[64] Huang, Y.; Valtorta, M. [2006] Pearl's Calculus of Intervention is Complete Proceedings of the
22nd Conference on Uncertainty in Artical Intelligence pp. 217-224 UAI Press
[65] Huang, Y.; Valtorta, M. [2008] On the Completeness of an Identiability Algorithm for Semi-
Markov Models Ann Math Artif Intell vol. 54 pp. 363 - 408
[66] K. Humphreys and D.M. Titterington [2000] Improving the Mean - Field Approximation in Be-
lief Networks using Bahadur's Reparameterisation of the Multivariate Binary Distribution Neural
Processing Letters vol. 12 pp. 183 - 197
[67] Højsgaard, S. [2012] Graphical Independence Networks with the gRain Package for R Journal of
Statistical Software, vol. 46 no.10 pp. 1-26.
http://www.jstatsoft.org/v46/i10/
[68] Højsgaard,S.; Edwards, D.; Lauritzen, S. [2012] Graphical Models with R Springer
[69] Ide, J.S.; Cozman, F.G. [2002]Random generation of Bayesian networks In: SBIA '02: Proceedings
of the 16th Brazilian symposium on articial intelligence, Springer, pp 366375
[70] Jaynes, E.T. [2003] Probability Theory. The Logic of Science Cambridge University Press
[71] R.C. Jerey [1965]The Logic of Decision McGraw - Hill, New York (second ed., University of
Chicago Press, Chicago, 1983; Paperback correction, 1990)
[72] M.I. Jordan, Z. Ghahramani, T.S. Jaakkola, L.K. Saul [1999] An Introduction to Variational
Methods for Graphical Models Machine Learning vol. 37 pp. 183 - 233
[73] Kellerer, H.G. [1991]Indecomposable marginal problems Advances in probability distributions with
given marginals: beyond the copulas, Springer Verlag, Berlin, pp 139 - 149
[74] H. Kiiveri, T.P. Speed, J.B. Carlin [1984] Recursive Causal Models J. Austral. Math. Soc. (series
A) vol. 36 pp. 30 - 52
[75] M. Koivisto and K. Sood [2004]Exact Bayesian Structure Discovery in Bayesian Networks Journal
of Machine Learning Research vol. 5 pp. 549 - 573
[76] Kuipers, J.; Moa, G. [2015] Partition MCMC for Inference on Acyclic Digraphs preprint:
arxiv:1504.05006v1
[77] Kuroki, M.; Pearl, J. [2014] Measurement Bias and Eect Restoration in Causal Inference
Biometrika vol. 101 no. 2 pp. 423 - 437
[78] F.R. Ksischang , B.J. Frey, H-A. Loeliger [2001] Factor Graphs and the Sum Product Algorithm
IEEE Transactions on Information Theory vol. 47 February, pp. 498 - 519
453
[79] E. Lazkano, B. Sierra, A. Astigarraga, J.M. Martínez - Otzeta [2007] On the use of Bayesian
Networks to Develop Behaviours for Mobile Robots Robots and Autonomous Systems vol. 55 pp.
253 - 265
[80] S.L. Lauritzen, D.J. Spiegelhalter [1988]Local Computations of Probabilities on Graphical Struc-
tures and their Applications to Expert Systems Journal of the Royal Statistical Society B (Method-
ological) vol. 50 no. 2 pp. 157 - 224
[81] S.L. Lauritzen [1992]Propagation of Probabilities, Means and Variances in Mixed Graphical Asso-
ciation Models Journal of the Americal Statistical Association vol. 78 no. 420 pp. 1098 - 1108
[82] S. Lauritzen [2001] Causal Inference from Graphical Models in Complex Stochastic Systems pp.
63 - 108, Chapman and Hall
[83] S. Lauritzen and D. Spiegelhalter [1988] Local Computations with Probabilities on Graphical Struc-
tures and their Application to Expert Systems (with discussion) Journal of the Royal Statistical
Society, Series B, vol. 50, pp. 157 - 224
[84] S. Lauritzen [1992] Propagation of Probabilities, Means and Variances in Mixed Graphical Asso-
ciation Models Journal of the American Statistical Association vol. 87 no. 420 pp. 1098 - 1108
[85] Lewis II, P.M. [1959]Approximating Probability Distributions to Reduce Storage Requirements In-
formation and Control vol. 2 pp 214 - 225
[86] Ma, Z.; X, Xie; Geng, Z. [2008] Structure Learning of Chain Graphs via Decomposition J. Mach.
Learn Res 9 pp 2847 - 2880
[87] Madgison, J. [1977]Toward a Causal Model Approach for Adjusting for Pre-Existing Dierences
in the Non-Equivalent Control Group Situation: A General Alternative to ANCOVA Eval. Rev.
vol. 1 pp. 399 - 420.
[88] D. Madigan, S.A. Andersson, M.D. Perlman, C.T. Volinsky [1996] Bayesian Model Averaging and
Model Selection for Markov Equivalence Classes of Acyclic Digraphs Communications In Statistics:
Theory and Methods vol. 25, no. 11 pp. 2493-2519
[89] D. Madigan and J. York [1995] Bayesian Graphical Models for Discrete Data International Statis-
tical Review vol. 63 pp. 215 - 232
[90] Mardia KV, Kent JT, Bibby JM (1979) Multivariate analysis. Academic, London
[91] Markowetz, F.; Spang, R. [2007]Inferring Cellular Networks - A Review BMC bioinformatics vol.
8 (Suppl 6) : S5
[92] McEliece, R.J.; MacKay, D.J.C.; Cheng, J.-F. [1998] Turbo Decoding as an Instance of Pearl's
`Belief Propagation' Algorithm IEEE J. Select. Areas Commun. vol. 16 pp. 140 - 152
[93] C. Meek [1995] Causal inference and causal explanation with background knowledge Proceedings
of the Eleventh Conference on Uncertainty in Articial Intellegence pp 403 - 410
[94] Minka, T. [2001] Expectation Propagation for Approximate Bayesian Inference in Proc. of the
17th Annual Conf. on Uncertainty in Articial Intelligence (UAI-01) San Fransisco, CA: Morgan
Kaufmann Publishers, pp. 362 - 369
[95] Mooij, J.M.; Kappen, H.J. [2007] Sucient Conditions for Convergence of the Sum-Product Al-
gorithm IEEE Transactions on Information Theory, vol. 53 no. 12, pp 4422 - 4437
[96] Moore, A.; Wong, W-K. [2003] Optimal Reinsertion: A new search operator for accelerated and
more accurate Bayesian network structure learning Proceedings of the Twentieth International
Conference on Machine Learning (ICML - 2003), Washington DC
[97] Murphy, K.P. [2002] Dynamic Bayesian Networks: Representation, Inference and Learning Uni-
versity of California, Berkeley, Ph.D. thesis (Computer Science)
454 LITERATURE CITED
[98] R. Neal [1992] Correctionist Learning of Belief Networks Articial Intelligence vol. 56 pp. 71 - 113
[99] R.E. Neapolitan [2004] Learning Bayesian Networks Pearson Prentice Hall, Upper Saddle River,
New Jersey.
[100] Nelson, E. [1987] Radically Elementary Probability Theory Princeton University Press
[101] Noorshams, N; Wainwright, M.J. [2013] Stochastic Belief Propagation: A Low-Complexity Alter-
native to the Sum-Product Algorithm IEEE Transactions on Information Theory vol. 59 no. 4 pp.
1981 - 2000
[102] Opper, M.; Winder, O. [2005] Expectation Consistent Approximate Inference Journal of Machine
Learning Research vol. 6 pp. 2177 - 2004
[103] J. Pearl [1982] Reverend Bayes on Inference Engines: A Distributed Hierarchical Approach AAAI
- 82 Proceedings pp. 133 - 136
[104] J. Pearl [1987] Evidential Reasoning Using Stochastic Simulation of Causal Models Artical In-
telligence, vol. 32, pp. 245-257.
[105] Pearl, J. [1988] Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference
Morgan Kaufmann, San Mateo, CA.
[106] J. Pearl [1990] Probabilistic Reasoning in Intelligent Systems 2nd revised printing, Morgan and
Kaufman Publishers Inc., San Fransisco
[107] J. Pearl [1995]Causal Diagrams for Empirical Research Biometrika vol. 82 pp. 669 - 710
[108] J. Pearl [1995]Causal Inference from Indirect Experiments Articial Intelligence in Medicine vol.
7 pp. 561 - 582
[109] J. Pearl [2000] Causality Cambridge University Press
[110] Pearl, J.; Dechter, D. [1996] Identifying Independencies in Causal Graphs with Feedback Proceed-
ings of the Twelfth International Conference in Uncertainty in Articial Intelligence (UAI'96) pp.
420 - 426, Morgan Kaufmann
[111] J. Pearl, D. Geiger and T. Verma [1989] Conditional Independence and its Representations
Kybernetica vol. 25 no. 2 pp. 33 - 44
[112] J. Pearl and T. Verma [1987] The Logic of Representing Dependencies by Directed Acyclic Graphs
Proceedings of the AAAI, Seattle, Washington pp. 374 - 379
[113] Pearl, J. [2010] On Measurement Bias in Causal Inference In: Proc. 20th Cof. Uncertainty in
Articial Intelligence, pp. 425 - 432, Catalina Island.
[114] J.M. Pena [2007] Approximate Counting of Graphical Models Via MCMC Proceedings of the 11th
Conference in Articial Intelligence pp. 352- 359
[115] Pistone, G.; Riccomagno, E.; Wynn, H. [2001] Algebraic Statistics: Computational Commutative
Algebra in Statistics Chapman and Hall, Boca Raton.
[116] M. Ramoni and P. Sebastiani [1997] Parameter Estimation in Bayesian Networks from Incomplete
Databases Knowledge Media Institute, KMI-TR-57
[117] Robins, J.M.; Scheines, R.; Spirtes, P.; Wasserman, L. [2003] Uniform consistency in causal
inference Biometrika vol 90 no 3 pp 491- 515
[118] R.W. Robinson [1977]Counting Unlabelled Acyclic Digraphs Springer Lecture Notes in Mathe-
matics: Combinatorial Mathematics V, C.H.C. Little (ed.) pp. 28 - 43.
[119] Rosenbaum, P.; Rubin, D. [1983] The Central Role of Propensity Score in Observational Studies
for Causal Eects Biometrika vol. 70 pp. 41-55
455
[120] Sadeghi, K.; Lauritzen, S, [2012] Markov Properties for Mixed Graphs submitted to Bernoulli,
available on arxiv
http://arxiv.org/pdf/1109.5909v2.pdf
[121] J. L. Savage [1966]Foundations of Statistics John Wiley and Sons, New York.
[122] R.D. Schachter [1998] Bayes Ball: The Rational Pass Time for Determining Irrelevance and
Requisite Information in Belief Networks and Inuence Diagrams Proceedings of the 14th Annual
Conference on Uncertainty in Articial Intelligence (ed G.F. Cooper and S. Moral) pp. 480 - 487,
Morgan Kaufmann, San Fransisco, CA.
[123] Schmidt, M.; Niculescu-Mizil, A.; Murphy, K. [2007] Learning graphical model structure using
l1-regularization paths Proceedings of the National Conference on Articial Intelligence vol 22 no
2 pp 12- 78
[124] Shpister, I.; Pearl, J. [2006]Identication of Joint Interventional Distributions in Recursive Semi-
Markovian Causal Models In: Proceedings of the Twenty-First National Conference on Articial
Intelligence. Menlo Park, CA: AAAI Press. pp 1219 - 1226
[125] Shpister, I.; Pearl, J. [2008] Complete Identication Methods for Causal Hierarchy Journal of
Machine Learning Research vol. 9 pp. 1941 - 1979
[126] Spirtes, P.; Glymour, C.; Scheines, R. [1993] Causation, Prediction and Search Lecture Notes in
Statistics no. 81 Springer-Verlag New York
[127] P. Spirtes, C. Glymour and R. Scheines [2000] Causation, Prediction and Search second edition,
The MIT press.
[128] Strotz, R.H.; Wold, H.O.A. [1960] Recursive versus Nonrecursive Systems: An Attempt at Syn-
thesis Econometrica vol. 28 pp. 417-427
[129] M. Studený [2005] Probabilistic Conditional Independence Structures Springer Verlag.
[130] Sturmfels, B. [2002] Solving Systems of Polynomial Equations In: CBMS Lectures Series, Amer-
ican Mathematical Society.
[131] Sun, J.; Zheng, N.-N.; Shum, H.-Y. [2003] Stereo Matching using Belief Propagation IEEE Trans-
actions on Pattern Analysis and Machine Intelligence vol. 25 no. 7 pp. 787 - 800
[132] Tanaka, K. [2002] Statistical-Mechanical Approach to Image Processing Journal of Physics A:
Mathematical and General vol. 35 no. 37 pp. R81 - R150
http://stacks.iop.org/0305-4470/35/R81
[133] Tatikonda, S.C. [2003] Convergence of the Sum-Product Algorithm in Proceedings 2003 IEEE
Information Theory Workshop
[134] Tian, J.; Pearl, J. [2002] A General Identication Condition for Causal Eects Proceedings of
the Eighteenth National Conference on Artical Intelligence, AAAI Press, Menlo Park California
pp. 567 - 573.
[135] Tian, J.; Pearl, J. [2002]On the Testable Implications of Causal Models with Hidden Variables in
Proceedings of UAI-02, pp. 519 - 527
[136] Tian, J.; Shpitser, I. [2010] On Identifying Causal Eects In: Dechter, R.; Gener, H.; Halpern,
J. eds., Heuristics, Probability and Causality: A Tribpute to Judea Pearl UK: College Publications,
pp. 415 - 444.
[137] I. Tsamardinos, L.E. Brown and C.F. Aliferis [2006] The Max - Min Hill - Climbing Bayesian
Network Structure Learning Algorithm Machine Learning vol. 65 pp. 31 - 78
[138] M. Valtorta, Y.G. Kim, J. Vomlel [2002] Soft Evidential Update for Probabilistic Multiagent
Systems International Journal of Approximate Reasoning vol. 29 no. 1 pp. 71 - 106
456 LITERATURE CITED
[139] Vats, D.; Nowak, R.D. [2014]A Junction Tree Framework for Undirected Graphical Model Selec-
tion Journal of Machine Learning Research vol. 15 pp. 147 - 191
[140] P. Verma and J. Pearl [1992] An Algorithm for Deciding if a Set of Observed Independencies has a
Causal Explanation in Uncertainty in Articial Intelligence, Proceedings of the Eighth Conference
(D. Dubois, M.P. Welman, B. D'Ambrosio and P.Smets, eds.) San Fransisco: Morgan Kaufman
pp. 323 - 330
[141] Vorobev, N. N. [1962] Consistent families of measures and their extensions Theory of Probability
and its Applications vol. 7 pp 147 - 162
[142] M.J. Wainright, M.I. Jordan [2003] Graphical Models, Exponential Families and Variational In-
ference Technical report 649, Department of Statistics, University of California, Berkeley
[143] On the Optimality of Solutions of the Max-Product Belief-Propagation Algorithm in Arbitrary
Graphs IEEE Transactions on Information Theory vol. 47 no. 2 pp. 736 - 744
[144] Whittaker, J. [1990]Graphical models in applied multivariate statistics Wiley
[145] N. Wiberg [1996] Codes and Decoding on General Graphs Linköping Studies in Science and
Technology. Dissertation 440 Linköpings Universitet, Linköping, 1996
[146] S. Wright [1921]Correlation and Causation Journal of Agricultural Research vol. 20 pp. 557 - 585
[147] Wright, S. [1934] The method of path coecients Ann. Math. Statist. vol 5 pp 161 - 215.
[148] X. Xie and Z. Geng [2008]A recursive method for structural learning of directed acyclic graphs
Journal of machine learning research vol. 9 pp. 459 - 483
[149] Yedidia, J.S.; Freeman, W.T.; Weiss, Y. [2005] Constructing Free-Energy Approximations and
Generalised Belief Propagation Algorithms IEEE Transactions on Information Theory, vol. 51 no.
7 pp. 2282-2312
[150] R. Yehezkel and B. Lerner [2009]Bayesian network structure learning by recursive autonomy
identication Journal of Machine Learning Research vol. 10 pp 1527 - 1570
[151] Zhang, J; Spirtes, P. [2002]Strong faithfulness and uniform consistency in causal inference Pro-
ceedings of the nineteenth conference on uncertainty in articial intelligence pp 632639, Morgan
Kaufmann Publishers Inc.
Index
457
458 INDEX