0% found this document useful (0 votes)
7 views471 pages

Introduction To Bayesian Networks - Koski - Noble

The document is an introduction to Bayesian Networks, covering foundational concepts such as conditional independence, graphical models, and intervention calculus. It includes detailed sections on Markov models, causal inference, and various graphical representations like moral graphs and junction trees. The content is structured into chapters with exercises and notes to facilitate understanding of complex topics in probabilistic reasoning and decision-making.

Uploaded by

danxiawong
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views471 pages

Introduction To Bayesian Networks - Koski - Noble

The document is an introduction to Bayesian Networks, covering foundational concepts such as conditional independence, graphical models, and intervention calculus. It includes detailed sections on Markov models, causal inference, and various graphical representations like moral graphs and junction trees. The content is structured into chapters with exercises and notes to facilitate understanding of complex topics in probabilistic reasoning and decision-making.

Uploaded by

danxiawong
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 471

Introduction to Bayesian Networks

Introduction to Bayesian Networks

John M. Noble,
Timo Koski,
Faculty of Mathematics, Informatics and Mechanics,
Institutionen för matematik,
University of Warsaw,
Kungliga Tekniska Högskolan,
ul. Banacha 2,
10044 STOCKHOLM, Sweden
02-097 WARSZAWA, Poland
iv
Contents

Introduction 1

1 Conditional Independence and Graphical Models 3


1.1 Notational preliminaries: Graphical and Probabilistic . . . . . . . . . . . . . . . . . . . . . 3
1.2 Conditional Independence and Factorisation . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2.1 Directed Acyclic Graphs and Probability Distributions . . . . . . . . . . . . . . . . 9
1.2.2 Connections in a Directed Acyclic Graph and Conditional Independence . . . . . 10
1.2.3 Bayes Ball . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.3 D-Separation and Conditional Independence . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.4 The Locally Directed Markov Property . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.5 Quick Medical Reference - Decision Theoretic: An Example . . . . . . . . . . . . . . . . . 18
1.5.1 Propositional Logic and Noisy Logic Gates . . . . . . . . . . . . . . . . . . . . . . . 19
1.5.2 QMR - DT Data Base . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
1.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
1.7 Answers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

2 Markov models and Markov equivalence 29


2.1 I-maps and Markov equivalence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.1.1 Properties of Conditional Expectation and D-Separation . . . . . . . . . . . . . . . 32
2.2 Characterisation of Markov Equivalence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.2.1 Example 2.8 (Hidden Variables) Revisited . . . . . . . . . . . . . . . . . . . . . . . 40
2.3 Markov Equivalence and the Essential Graph . . . . . . . . . . . . . . . . . . . . . . . . . . 41
Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

3 Intervention Calculus 45
3.1 Causal Models and Bayesian Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.2 Conditioning by Observation and by Intervention . . . . . . . . . . . . . . . . . . . . . . . 46
3.3 The Intervention Calculus for a Bayesian Network . . . . . . . . . . . . . . . . . . . . . . . 46
3.4 Causal Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.4.1 Establishing a Causal Model via a Controlled Experiment . . . . . . . . . . . . . . 51
3.5 Properties of Intervention Calculus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

v
3.6 Confounding, The `Sure Thing' Principle and Simpson's Paradox . . . . . . . . . . . . . . 56
3.6.1 Confounding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.6.2 Simpson's Paradox . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.6.3 The Sure Thing Principle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.7 Identiability: Back-Door and Front-Door Criteria . . . . . . . . . . . . . . . . . . . . . . . 59
3.7.1 Back Door Criterion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
3.7.2 Front Door Criterion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
3.7.3 Non-Indentiability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
3.8 Inference Rules for Intervention Calculus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
3.8.1 Example: Front Door Criterion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
3.8.2 Causal Inference by Surrogate Experiments . . . . . . . . . . . . . . . . . . . . . . . 77
3.9 Measurement Bias and Eect Restoration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
3.9.1 The Matrix Adjustment Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
3.9.2 Eect Restoration Without External Studies . . . . . . . . . . . . . . . . . . . . . . 79
3.10 Identication of Counterfactuals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
3.10.1 Counterfactual Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
3.10.2 Joint Counterfactual Probabilities and Intervention . . . . . . . . . . . . . . . . . . 84
Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
3.11 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
3.12 Answers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

4 The Pioneering Work of Arthur Cayley 93


4.1 Cayley's Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
4.2 Arthur Cayley and Judea Pearl's intervention calculus . . . . . . . . . . . . . . . . . . . . 97
4.3 Arthur Cayley: algebraic geometry and Bayesian networks . . . . . . . . . . . . . . . . . . 97

5 Moral Graph, Independence Graph, Chain Graphs 99


5.1 The Moral Graph and the Independence Graph . . . . . . . . . . . . . . . . . . . . . . . . . 101
5.2 Chain Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
5.2.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
5.2.2 Factorisation along a Chain Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
5.2.3 Separation Trees for Chain Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108

6 Evidence and Metrics 113


6.1 Probability Updates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
6.1.1 Jerey's Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
6.2 Evidence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
6.3 Virtual Evidence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
6.4 Measures of Divergence between Probability Distributions . . . . . . . . . . . . . . . . . . 120
6.5 The Chan - Darwiche Distance Measure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
6.5.1 Soft Evidence and Virtual Evidence . . . . . . . . . . . . . . . . . . . . . . . . . . . 124

vi
Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
6.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
6.7 Answers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133

7 Marginalisation, Triangulated Graphs and Junction Trees 141


7.1 Functions and Domains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
7.2 Marginalisation and Graphical Representations . . . . . . . . . . . . . . . . . . . . . . . . . 144
7.3 Decomposable Graphs and Node Elimination . . . . . . . . . . . . . . . . . . . . . . . . . . 147
7.4 Junction Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
7.5 Perfect Orders of Maximal Cliques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157

8 Junction trees and message passing 159


8.1 Factorisation along an Undirected Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
8.2 Factorising along a Junction Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
8.3 Flow of Messages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
8.3.1 First Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
8.4 Local Computation on Junction Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
8.5 Schedules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
8.6 Local and Global Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
8.7 Using a Junction Tree with Virtual Evidence and Soft Evidence . . . . . . . . . . . . . . . 173
Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
8.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
8.9 Answers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178

9 Bayesian Networks in R 181


9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
9.2 Graphs in R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182
9.2.1 Undirected Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182
9.2.2 Directed Acyclic Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
9.2.3 Mixed Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188
9.3 Bayesian Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
9.3.1 Specifying the Conditional Probability Potentials . . . . . . . . . . . . . . . . . . . 189
9.3.2 Building the Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190
9.3.3 Compilation - Finding the Clique Potentials . . . . . . . . . . . . . . . . . . . . . . 190
9.3.4 Absorbing Evidence and Answering Queries . . . . . . . . . . . . . . . . . . . . . . 192
9.3.5 Building a Network from Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
9.3.6 Simulation using a Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
9.3.7 Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
9.3.8 Buidling a Bayesian Network using bnlearn . . . . . . . . . . . . . . . . . . . . . . 197
9.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200

vii
10 Conditional Gaussian variables 203
10.1 Conditional Gaussian Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
10.1.1 Some Results on Marginalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
10.1.2 CG Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206
10.2 The Junction Tree for Conditional Gaussian Distributions . . . . . . . . . . . . . . . . . . 208
10.3 Updating a CG distribution using a Junction Tree . . . . . . . . . . . . . . . . . . . . . . . 211
Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214
10.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215

11 Gaussian and Conditional Gaussian Graphical Models in R 217


11.1 Undirected Gaussian Graphical Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218
11.2 Decomposition of UGGMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220
11.3 Directed Gaussian Graphical Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222
11.4 Gaussian Chain Graph Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224
11.5 Conditional Gaussian Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224

12 Learning the Conditional Probability Functions 231


12.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231
12.2 Gaussian and Conditional Gaussian Networks . . . . . . . . . . . . . . . . . . . . . . . . . . 231
12.3 Discrete Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232
12.4 Maximum Likelihood for Discrete Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . 233
12.4.1 Maximum Likelihood for Multinomial Sampling . . . . . . . . . . . . . . . . . . . . 233
12.4.2 MLE for a Probability Factorised along a DAG . . . . . . . . . . . . . . . . . . . . 236
12.5 The Bayesian Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237
12.5.1 Independent Bernoulli trials and the Beta distribution . . . . . . . . . . . . . . . . 238
12.5.2 Multinomial Sampling and the Dirichlet Integral . . . . . . . . . . . . . . . . . . . . 241
12.5.3 Distribution for Conditional Probabilies of a Bayesian network . . . . . . . . . . . 242
12.6 Updating, Missing Data, Fractional Updating . . . . . . . . . . . . . . . . . . . . . . . . . . 244
12.7 Likelihood Function for the Graph Structure . . . . . . . . . . . . . . . . . . . . . . . . . . 246
12.8 Bayesian Sucient Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247
12.9 Prediction Suciency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249
12.10Prediction Suciency for a Bayesian Network . . . . . . . . . . . . . . . . . . . . . . . . . . 250
Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251
12.11Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252
12.12Short Answers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256

13 Parameters and Sensitivity 263


13.1 Parameter Changes to Satisfy Query Constraints . . . . . . . . . . . . . . . . . . . . . . . . 263
13.2 Proportional Scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266
13.2.1 Query Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269
13.2.2 Binary Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271

viii
13.3 The Sensitivity of Queries to Parameter Changes . . . . . . . . . . . . . . . . . . . . . . . . 272
Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276
13.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277
13.5 Answers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279

14 Structure Learning 281


14.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281
14.2 Distance Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282
14.2.1 Structural Hamming Distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282
14.2.2 Sensitivity and Specicity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283
14.2.3 The Kullback Leibler Divergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284
14.3 Search and Score Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284
14.3.1 Score Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 286
14.3.2 Sparse Candidate Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 287
14.3.3 Greedy Search and Greedy Equivalence Search . . . . . . . . . . . . . . . . . . . . . 288
Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 288
14.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 290

15 Data Storage, Product Approximations, Chow Liu Trees 295


15.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295
15.2 Product Approximations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295
15.2.1 Existence of Extensions with Given Marginals . . . . . . . . . . . . . . . . . . . . . 295
15.2.2 Dependence Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 298
15.3 Reverse I -Projection and the Optimal Product Approximation . . . . . . . . . . . . . . . 300
15.4 The Optimal Chow-Liu Product Approximation . . . . . . . . . . . . . . . . . . . . . . . . 301
15.4.1 Chow Liu Tree with known P . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 302
15.4.2 Chow-Liu Algorithm with Unknown P . . . . . . . . . . . . . . . . . . . . . . . . . . 303
15.4.3 The Log Likelihood Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303
15.4.4 The Chow-Liu Algorithm and Polytrees . . . . . . . . . . . . . . . . . . . . . . . . . 306
15.5 Asymptotic Consistency of the Maximum Likelihood Estimate . . . . . . . . . . . . . . . . 307
15.6 Classication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 309

16 Constraint-Based Structure Learning Algorithms 311


16.1 Structure Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 311
16.2 Testing for Conditional Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 312
16.2.1 Gaussian variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 312
16.2.2 Discrete Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 312
16.2.3 Hypothesis Testing and Statistical Theory . . . . . . . . . . . . . . . . . . . . . . . 313
16.3 The K2 Structural Learning Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313
16.4 Three phase dependency analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315
16.5 Fast Adjacency Search (FAS) algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315

ix
16.6 PC and MMPC Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 317
16.7 Recursive Autonomy Identication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 321
16.8 Incompatible Immoralities: EDGE-OPT Algorithm . . . . . . . . . . . . . . . . . . . . . . 324
16.9 Hybrid Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324
16.9.1 The Maximum Minimum Hill Climbing Algorithm . . . . . . . . . . . . . . . . . . 324
16.9.2 L1-Regularisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325
16.9.3 Gibbs sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 326
16.10A Junction Tree Framework for Undirected Graphical Model Selection . . . . . . . . . . . 326
16.11The Xie-Geng Algorithm for Learning a DAG . . . . . . . . . . . . . . . . . . . . . . . . . . 329
16.11.1 Description of the Xie-Geng Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 330
16.11.2 Proofs of Theorems 16.5 and 16.6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334
16.12The Ma-Xie-Geng Algorithm for Learning Chain Graphs . . . . . . . . . . . . . . . . . . . 338
16.12.1 Skeleton Recovery with a Separation Tree . . . . . . . . . . . . . . . . . . . . . . . . 338
16.12.2 Recovering the Complexes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 340
16.13Structure Learning and Faithfulness: an Evaluation . . . . . . . . . . . . . . . . . . . . . . 341
16.13.1 Faithfulness and `real world' data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 341
16.13.2 Interaction eects without main eects . . . . . . . . . . . . . . . . . . . . . . . . . 343
16.13.3 Hidden variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343
16.13.4 The scope of structure learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344
16.13.5 Application of FAS and RAI to nancial data . . . . . . . . . . . . . . . . . . . . . 344
16.13.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345
16.13.7 The `Causal Discovery' Controversy . . . . . . . . . . . . . . . . . . . . . . . . . . . 346
16.13.8 Faithfulness and the great leap of faith . . . . . . . . . . . . . . . . . . . . . . . . . 347
16.13.9 Inferring non-causation and causation . . . . . . . . . . . . . . . . . . . . . . . . . . 349
16.13.10Summarising causal discovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 350
Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 350
16.14Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 351
16.15Answers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354

17 Bayesian Networks in R: Structure and Parameter Learning 357


17.1 Bayesian Networks with bnlearn . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 357
17.1.1 Creating and Manipulating Network Structures . . . . . . . . . . . . . . . . . . . . 358
17.1.2 Visualising Graphical Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 362
17.1.3 Structure Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 362
17.1.4 Parameter Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 364
17.1.5 Discretisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 366
17.1.6 Latent Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 366
17.1.7 Application to Gene Expression Data . . . . . . . . . . . . . . . . . . . . . . . . . . 368
17.1.8 Interventional Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 371
17.2 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373

x
18 Monte Carlo Algorithms for Graph Search 375
18.1 A Stochastic Optimisation Algorithm for Essential Graphs . . . . . . . . . . . . . . . . . . 375
18.2 Structure MCMC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 377
18.3 Edge Reversal Moves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 378
18.4 Order MCMC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 379
18.5 Partition MCMC for Directed Acyclic Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . 381
18.5.1 Scoring Partitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 381
18.5.2 Partition Moves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 382
18.5.3 Permutation Moves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 382
18.5.4 Combination with Edge Reversal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 383

19 Dynamic Bayesian Networks 385


19.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 385
19.2 Multivariate Time Series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 386
19.3 Lasso Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 391
19.3.1 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 393
19.4 simone: Statistical Inference for MOdular NEtworks . . . . . . . . . . . . . . . . . . . . . . 396
19.5 GeneNet, GIDBN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 397
19.6 Inference for Dynamic Bayesian Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 398
19.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 402

20 Factor graphs and the sum product algorithm 403


20.1 Factorisation and Local Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 403
20.2 The Sum Product Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 405
20.3 The Sum Product Algorithm on General Graphs . . . . . . . . . . . . . . . . . . . . . . . . 411
20.4 Stochastic Probability Updates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 412
Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 414
20.5 Exercise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 415
20.6 Answer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 416

21 Graphical Models and Exponential Families 419


21.1 Introduction to Exponential Families . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 419
21.2 Standard Examples of Exponential Families . . . . . . . . . . . . . . . . . . . . . . . . . . . 421
21.3 Graphical Models and Exponential Families . . . . . . . . . . . . . . . . . . . . . . . . . . . 423
21.4 Properties of the log Partition Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 425
21.5 Fenchel Legendre Conjugate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 427
21.6 Kullback Leibler Divergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 430
21.7 Mean Field Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 431
Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 435
21.8 Exercises: Graphical Models and Exponential Families . . . . . . . . . . . . . . . . . . . . 436

xi
22 Variational Methods for Parameter Estimation 439
22.1 Complete Instantiations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 439
22.1.1 Triangulated Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 439
22.1.2 Non-Triangulated Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 440
22.2 Partially Observed Models and Expectation-Maximisation . . . . . . . . . . . . . . . . . . 442
22.2.1 Exact EM Algorithm for Exponential Families . . . . . . . . . . . . . . . . . . . . . 442
22.2.2 Mean Field Approximate EM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 445
22.3 Variational Bayes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 445

Literature Cited 449

INDEX 457

xii
Introduction

The models that were later to be called Bayesian networks were introduced into articial intelligence by
J. Pearl in (1982) [103], a seminal article in the literature of that eld. A Bayesian network is simply
a factorisation of a probability distribution and a corresponding directed acyclic graph (henceforth
written DAG), where the edges of the DAG correspond to direct associations between variables in the
factorisation.
The rst Bayesian networks were connected with causal models, where the order of the variables
in the factorisation represented cause to eect, and the directed arrows in the DAG represented direct
causes. This is still one of the major uses of Bayesian networks. The leaf nodes of the network are the
observables. From observations of leaf variables, inference is made for hidden variables via Bayes rule,
hence the term Bayesian network. The terminology, therefore, derives from the probabilistic use of
Bayes rule; the statistical use of Bayes rule whereby uncertainty in the parameter value is expressed in
terms of a prior distribution which is then updated to a posterior distribution when data is considered,
is not in view here.
Graphical directed separation statements for the DAG imply corresponding conditional indepen-
dence statements for the probability distribution. For large and complex systems, graphical separation
algorithms provide a convenient and ecient method to establish probabilistic conditional indepen-
dence statements.
The description `Bayesian networks' now covers a large eld of problems and techniques of data
analysis and probabilistic reasoning, where data is collected on a large number of variables and the
aim is to factorise the distribution, represent it graphically and exploit the graphical representation.
Perhaps the earliest work that explicitly uses directed graphs to represent possible dependencies among
random variables is that by S. Wright (1921) [146], developed by the same author in 1934 [147].
Bayesian networks represent a small part of the wider eld of graphical models. A Bayesian network
is a probability distribution factorised along a DAG. In many examples this is not the most ecient
model for representing the independence structure and there is a wider eld of graphical models.
Situations where Bayesian networks provide the natural tools for analysis are, for example: com-
puting the overall reliability of a system given the reliability of the individual components and how
they interact, system security where Bayesian networks are used as a tool for assessing intrusion evi-
dence and whether a network is under attack, forensic analysis. Further applications are, for example:
nding the most likely message that was sent across a noisy channel, restoring a noisy image, mapping
genes onto a chromosome. One of the leading applications of techniques from the area is to establish-

1
2

ing genome pathways. Given DNA hybridisation arrays, which simultaneously measure the expression
levels for thousands of genes, a major challenge in computational biology is to uncover, from such mea-
surements, gene/protein interactions and key biological features of cellular systems. This is discussed,
for example, by Nir Friedman et. al. in [46] (2000).
DAGs have proved useful in a large number of situations where the graph is constructed along
causal principles; parent variables are considered to be direct causes. One eld where causal networks
have proved particularly eective has been epidemiological research, where DAGs have provided a
framework for the problem of multiple confounding factors in genetic epidemiology, as discussed by
Greenland, Pearl and Robins (1999) [56]. Bayesian networks oer an alternative to `naïve Bayes' models
of supervised classication in machine learning, which enable more of the structure to be exploited.
One of the rst examples of this was the Chow-Liu tree (1968) [28].
Chapter 1

Conditional Independence and Graphical

Models

A graphical model for a probability distribution over several variables is, quite simply, a graph, where
the random variables correspond to the node set of the graph and each graphical separation statement
implies the corresponding conditional independence statement for the random variables. The opposite
(that conditional independence implies graphical separation) in general does not hold. In a system with
a large numbers of variables, the task of determining graphical separation statements is, in general,
computationally far less demanding than the task of determining conditional independence, hence the
motivation for graphical models and applying graph theoretic results.
A Bayesian network is the representation of a probability distribution on a directed acyclic graph
(DAG). In this setting, the most useful notion of separation is D-separation (short for directed separa-
tion), which is dened later. If a probability distribution factorises along a DAG, then D-separation
statements in the DAG imply the corresponding conditional independence statements (although the
reverse implication is, in general, false).
In many problems, for example gene expression data where there are thousands of variables, it may
not be either possible or desirable to obtain a complete description of the dependence structure. The
aim for such problems is to learn a DAG which encodes the most important features of the dependence
structure. In classication problems, a complete description of the dependence structure is usually
unnecessary; algorithms only locate the key features of the dependence structure to ensure accurate
classication.

1.1 Notational preliminaries: Graphical and Probabilistic


Random Variables Let X = (X1 , . . . , Xd )′ denote a vector of random variables. The random
variables under consideration are of two types, discrete with nite state space and continuous. Let Xj
denote the state space of variable j . If Xj is continuous, the state space is R. If Xj has a nite state
(1) (k )
space with kj elements, say (xj , . . . , xj j ), then Xj = {0, . . . , kj − 1} denotes the indexing set. The
state space of the random vector X is the product space X = ×dj=1 Xj .

3
4 CHAPTER 1. CONDITIONAL INDEPENDENCE AND GRAPHICAL MODELS

In the subject of Bayesian Networks, there are usually three situations in view: multinomial,
Gaussian and Conditional Gaussian. PX1 ,...,Xd will be used to denote the probability distribution over
X1 , . . . , Xd . That is, for the multinomial case, this is simply the probability function; the quantity
PX1 ,...,Xd (x1 , . . . , xd ) is simply the probability of obtaining a conguration with indices (x1 , . . . , xd ) ∈
X . When X is a multivariate Gaussian random vector, PX1 ,...,Xd refers to the probability density
function. When X is conditional Gaussian, the discrete variables are listed with lower index than
the continuous, so that X = (X1 , . . . , Xa , Xa+1 , . . . , Xd ) where there are a discrete variables and d − a
continuous variables. Here PX1 ,...,Xa is the probability function for the discrete variables, while for each
conguration (x1 , . . . , xa ) of the discrete variables, PXa+1 ,...,Xd ∣X1 ,...,Xa (.∣x1 , . . . , xa ) is a multivariate
Gaussian probability distribution over Rd−a for the variables a + 1, . . . , d.

If X = (X1 , . . . , Xd )′ and A ⊂ {1, . . . , d} where A = (a(1), . . . , a(m)) then XA ∶= (Xa(1) , . . . , Xa(m) )′ .

In this treatment, the discussion will be presented for discrete random variables, unless explicitly stated
otherwise.

Notations and Denitions for Graphs


Denition 1.1 (Graph, Simple Graph). A graph G = (V, E) consists of a nite set of nodes V and
an edge set E , where each edge is contained in V × V . The edge set therefore consists of ordered pairs
of nodes, which we denote (α, β) or α → β .
Let V = {1, . . . , d}. A graph G = (V, E) is said to be simple if E does not contain any edges of the
form (α, α) (that is a loop from the node to itself ) and any edge (α, β) ∈ E that appears in E does so
exactly once. That is, multiple edges are not permitted.
For any two distinct nodes α and β ∈ V , the ordered pair (α, β) ∈ E if and only if there is a directed
edge from α to β . An undirected edge will be denoted ⟨α, β⟩. In terms of directed edges,

⟨α, β⟩ ∈ E ⇔ (α, β) ∈ E and (α, β) ∈ E.

For a simple graph that may contain both directed and undirected edges, the edge set E may be decom-
posed as E = D ∪ U , where D ∩ U = ∅, the empty set. The sets U and D are dened by

⟨α, β⟩ ∈ U ⇔ (α, β) ∈ E and (β, α) ∈ E.

(α, β) ∈ D ⇔ (α, β) ∈ E and (β, α) ∈/ E.


If (α, β) ∈ D, we may also denote this by α → β ∈ D. If ⟨α, β⟩ ∈ U , we may also denote this by α−β ∈ U .
If either (α, β) ∈ D or (β, α) ∈ D or ⟨α, β ∈ U , but we do not specify which, we may denote this by
α ∼ β . For the denitions of `path', `trail' and `cycle', an undirected edge will be considered as a single
edge.
All the graphs considered in this treatment will be simple graphs and the term `graph' will be used to
mean `simple graph'. If (α, β) ∈ D, this is denoted by an arrow going from α to β . If ⟨α, β⟩ ∈ U , this
is denoted by an undirected edge between the two variables α and β .
1.1. NOTATIONAL PRELIMINARIES: GRAPHICAL AND PROBABILISTIC 5

Denition 1.2 (Parent, Child, Directed and Undirected Neighbour, Family). Consider a graph G =
(V, E), where V = {1, . . . , d} and let E = D ∪ U , where D is the set of directed edges and U the set of
undirected edges. Let α, β ∈ V . If (α, β) ∈ D, then β is referred to as a child of α and α as a parent of
β.

For any node α ⊆ V , the set of parents is dened as

Pa(α) = {β ∈ V ∣ (β, α) ∈ D} (1.1)

and the set of children is dened as

Ch(α) = {β ∈ V ∣ (α, β) ∈ D}. (1.2)

For any subset A ⊆ V , the set of parents of A is dened as

Pa(A) = ∪α∈A {β ∈ V /A ∣ (β, α) ∈ D}. (1.3)

The set of directed neighbours of a node α is dened as

N(d) (α) = Pa(α) ∪ Ch(α)

and the set of undirected neighbours of α as

N(u) (α) = {β ∈ V ∣ ⟨α, β⟩ ∈ U }. (1.4)

For any subset A ⊆ V , the set of undirected neighbours of A is dened as

N(u) (A) = ∪α∈A {β ∈ V /A ∣ ⟨α, β⟩ ∈ U }. (1.5)

For a node α, the set of neighbours N (α) is dened as

N (α) = N(u) (α) ∪ N(d) (α).

The family of a node β is the set containing the node β together with its parents and undirected
neighbours. It is denoted:

F (β) = {β} ∪ Pa(β) ∪ N(u) (β) = {family of β}.

When G is undirected, this reduces to F (β) = {β} ∪ N (β).

The notation α ∼ β will be used to denote that α ∈ N (β); namely, that α and β are neighbours. Note
that α ∈ N (β) Ô⇒ β ∈ N (α).

In this text, a directed edge (α, β) is indicated by a pointed arrow from α to β ; that is, from the parent
to the child.
6 CHAPTER 1. CONDITIONAL INDEPENDENCE AND GRAPHICAL MODELS

Denition 1.3 (Directed, Undirected Graph). If all edges of a graph are undirected, then the graph G
is said to be undirected. If all edges are directed, then the graph is said to be directed. The undirected
version of a graph G , denoted by G̃ , is obtained by replacing the directed edges of G by undirected edges.

Denition 1.4 (Trail). Let G = (V, E) be a graph, where E = D ∪ U ; D ∩ U = ∅, D denotes the directed
edges and U the undirected edges. A trail τ between two nodes α ∈ V and β ∈ V is a collection of
nodes τ = (τ1 , . . . , τm ), where τi ∈ V for each i = 1, . . . , m, τ1 = α and τm = β and such that for each
i = 1, . . . , m − 1, τi ∼ τi+1 . That is, for each i = 1, . . . , m − 1, either (τi , τi+1 ) ∈ D or (τi+1 , τi ) ∈ D or
⟨τi , τi+1 ⟩ ∈ U .

Denition 1.5 (Sub-graph, Induced Sub-graph). Let A ⊆ V and EA ⊆ E ∩ A × A. Then F = (A, EA )


is a sub graph of G .
If A ⊂ V and EA = E ∩ A × A, then GA = (A, EA ) is the sub-graph induced by A.

Note that in general it is possible for a sub-graph to contain the same nodes, but fewer edges, but the
sub-graph induced by the same node set will have the same edges.

Denition 1.6 (Connected Graph, Connected Component). A graph is said to be connected if between
any two nodes αj ∈ V and αk ∈ V there is a trail. A connected component of a graph G = (V, E) is
an induced sub-graph GA such that GA is connected and such that if A ≠ V , then for any two nodes
(α, β) ∈ V × V such that α ∈ A and β ∈ V /A, there is no trail between α and β .

Denition 1.7 (Path, Directed Path). Let G = (V, E) denote a simple graph, where E = D ∪ U . That
is, D ∩ U = ∅, D denotes the directed edges and U denotes the undirected edges. A path of length m
from a node α to a node β is a sequence of distinct nodes (τ0 , . . . , τm ) such that τ0 = α and τm = β
such that (τi−1 , τi ) ∈ E for each i = 1, . . . , m. That is, for each i = 1, . . . , m, either (τi−1 , τi ) ∈ D, or
⟨τi−1 , τi ⟩ ∈ U .
The path is a directed path if (τi−1 , τi ) ∈ D for each i = 1, . . . , m. That is, there are no undirected
edges along the directed path.

It follows that a trail in G is a sequence of nodes that form a path in the undirected version G̃ .
Unlike a trail, a directed path (τ0 , . . . , τm ) requires that the directed edge (τi , τi+1 ) ∈ D for all
i = 0, . . . , m − 1.

Denition 1.8 (Descendant, Ancestor). Let G = (V, E) be a graph. A node α is a descendant of a


node β if and only if there is a directed path from β to α. A node γ is an ancestor of a node α if and
only if there is a directed path from γ to α.
Let E = U ∪ D, where U denotes the undirected edges and D denotes the directed edges. The set of
descendants D(α) of a node α is dened as

D(α) = {β ∈ V ∣ ∃τ = (τ0 , . . . , τk ) ∶ τ0 = α, τk = β, (τj , τj+1 ) ∈ D, j = 0, 1, . . . , k}. (1.6)

That is, nodes β such that there is a directed path from α to β .

The set of ancestors A(α) of a node α is dened as


1.2. CONDITIONAL INDEPENDENCE AND FACTORISATION 7

A(α) = {β ∈ V ∣ ∃τ = (τ0 , . . . , τk ) ∶ τ0 = β, τk = α, (τj , τj+1 ) ∈ D, j = 0, 1, . . . , k}. (1.7)

That is, nodes β such that there is a directed path from β to α.

In both cases, the paths are directed; they consist of directed edges only; they do not contain undirected
edges.

Denition 1.9 (Cycle). Let G = (V, E) be a graph. An m-cycle in G is a sequence of distinct nodes

τ0 , . . . , τm−1

such that τ0 , . . . , τm−1 , τ0 is a path (Denition 1.7).

Denition 1.10 (Directed Acyclic Graph (DAG)). A graph G = (V, E) is said to be a directed acyclic
graph if each edge is directed (that is, G is a simple graph such that for each pair (α, β) ∈ V × V ,
(α, β) ∈ E Ô⇒ (β, α) ∈/ E ) and for any node α ∈ V there does not exist any set of distinct nodes
τ1 , . . . , τm such that α ≠ τi for all i = 1, . . . , m and (α, τ1 , . . . , τm , α) forms a directed path. That is,
there are no m-cycles in G for any m ≥ 1.

Denition 1.11 (Tree, Leaf). A tree is a graph G = (V, E) that is connected and such that for any
node α ∈ V , there is no trail between α and α and for any two nodes α and β in V with α ≠ β , there
is a unique trail. A leaf of a tree is a node that is connected to exactly one other node.

1.2 Conditional Independence and Factorisation


Denition 1.12 (Independence). Two random vectors X and Y are independent if their joint proba-
bility distribution factorises as
PX,Y = PX PY .

X and Y are conditionally independent given a random vector Z if

PX,Y,Z = PX∣Z PY ∣Z PZ .

This is written X ⊥ Y ∣Z .

The following characterisations of conditional independence follow from the denition.

Theorem 1.13. The following are all equivalent to X ⊥ Y ∣Z : using XX , XY and XZ to denote the
state spaces of X , Y and Z respectively:

1. For all (x, y, z) ∈ XZ × XY × XZ such that PY ∣Z (y∣z) > 0 and PZ (z) > 0,

PX∣Y,Z (x∣y, z) = PX∣Z (x∣z).


8 CHAPTER 1. CONDITIONAL INDEPENDENCE AND GRAPHICAL MODELS

2. There exists a function a ∶ XX × XZ → [0, 1] such that for all (x, y, z) ∈ XX × XY × XZ satisfying
PY,Z (y, z) > 0,

PX∣Y,Z (x∣y, z) = a(x, z)

3. There exist functions a ∶ XX × XZ → R+ and b ∶ XY × XZ → R+ such that for all (x, y, z) ∈


XX × XY × XZ satisfying PZ (z) > 0,

PX,Y ∣Z (x, y∣z) = a(x, z)b(y, z)

4. For all (x, y, z) ∈ XX × XY × XZ such that PZ (z) > 0,

PX,Z (x, z)PY,Z (y, z)


PX,Y,Z (x, y, z) = .
PZ (z)

5. There exist functions a ∶ XX × XZ → R+ and b ∶ XY × XZ → R+ such that

PX,Y,Z (x, y, z) = a(x, z)b(y, z).

Proof of Theorem 1.13 The proof is trivial and is therefore omitted.

Recall that for any collection of events A1 , . . . , An ,

P(A1 ∩ . . . ∩ An ) = P(A1 )P(A2 ∣A1 ) . . . P(An ∣A1 ∩ . . . ∩ An−1 ).

Clearly, any probability distribution PX1 ,...,Xd over X may be factorised as

d
PX1 ,...,Xd = PXσ(1) ∏ PXσ(j) ∣Xσ(1) ,...,Xσ(j−1)
j=2

for any permutation σ of 1, . . . , d. Let Pa(σ) (j) ⊂ {σ(1), . . . , σ(j − 1)} satisfy

ˆ
Xσ(j) ⊥ {Xσ(1) , . . . , Xσ(j−1) }/X (σ) ∣X (σ)
Pa (j) Pa (j)
ˆ
Xσ(j) ⊥
/ {Xσ(1) , . . . , Xσ(j−1) }/XΘ(j) ∣XΘ(j)
(σ)
for any strict subset Θj ⊂ Paj .

Then, by the rst characterisation of conditional independence and setting PXσ(j) ∣X (σ) = PXσ(j)
Pa (j)
when Paσ(j) = ∅ the empty set,

d
PX1 ,...,Xd = ∏ PXσ(j) ∣X (σ) .
j=1 Pa (j)
1.2. CONDITIONAL INDEPENDENCE AND FACTORISATION 9

Denition 1.14 (Factorisation, Bayesian Network). A factorisation of a probability distribution is a


decomposition

d
PX1 ,...,Xd = ∏ PXσ(j) ∣X (1.8)
Ξ(σ) (j)
j=1

such that for each j ∈ {1, . . . , d}, Ξ(σ) (j) ⊆ {σ(1), . . . , σ(j − 1)}.
A Bayesian Network is a factorisation of a probability distribution

d
PX1 ,...,Xd = ∏ PXσ(j) ∣X (σ) (1.9)
j=1 Pa (j)
such that
(σ)
1. Pa1 = ∅ (the empty set)
(σ)
2. Paj ⊆ {σ(1), . . . , σ(j − 1)}

3. Xσ(j) ⊥ {Xσ(1) , . . . , Xσ(j−1) }/X (σ) ∣X (σ)


Pa (j) Pa (j)
4. For any strict subset Θ(j) ⊂ Pa(σ) (j) of Pa(σ) (j),
Xσ(j) ⊥
/ {Xσ(1) , . . . , Xσ(j−1) }/XΘ(j) ∣XΘ(j) .

Unless otherwise stated, it will be assumed that the variables are labelled in such a way that σ = I ,
the identity.

For Paj = {lj,1 , . . . , lj,mj }, the state space of XPa(j) is Xlj,1 × . . . × Xlj,mj . For discrete variables, there
mj (l) qj
are qj = ∏a=1 klj,a congurations. These may be labelled (πj )l=1 and the parameters required for the
probability distribution PX1 ,...,Xd are
(l)
θjil = PXj ∣X (i∣πj ) j = 1, . . . , d i = 0, . . . , kj − 1, l = 1, . . . , qj .
Pa(j)
The factorisation of Equation (1.8) Denition 1.14 may be represented by a Directed Acyclic Graph.
For example, if the probability distribution over X, Y, Z, W satises

PX,Y,Z,W = PX PY ∣X PZ∣X PW ∣Y,Z ,


the factorisation may be represented by the graph in Figure 1.1.

1.2.1 Directed Acyclic Graphs and Probability Distributions


Now consider a random vector X = (X1 , . . . , Xd ).
Denition 1.15 (Factorisation along a Directed Acyclic Graph). A decomposition of a probability
distribution over a random vector X = {X1 , . . . , Xd } which satises Equation (1.8) with respect to
an ordering σ is said to factorise according to a directed acyclic graph (V, D) if V = {1, . . . , d}, the
indexing set for the variables, is the node set of the graph and for each j = 1, . . . , d, Ξ(σ) (j) is the
parent set for node σ(j). The factorisation corresponds to a Bayesian network if Ξ(σ) (j) = Pa(σ) (j)
where Xσ(j) ⊥/ {Xσ(1) , . . . , Xσ(j−1) }/XΘ(j) ∣XΘ(j) for any strict subset Θ(j) ⊂ Pa(σ) (j).
10 CHAPTER 1. CONDITIONAL INDEPENDENCE AND GRAPHICAL MODELS

~
Y Z

~
W

Figure 1.1: DAG representing the factorisation of a probability distribution

α / β / γ

Figure 1.2: A Chain Connection

1.2.2 Connections in a Directed Acyclic Graph and Conditional Independence


Denition 1.16 (Instantiated). When the state of variable is known, the variable is said to be instan-
tiated.

Within a directed acyclic graph, there are three basic ways in which two nodes α, γ such that (α, γ) ∈/ D
and (γ, α) ∈/ D can be connected via a third node. They are the chain, fork and collider connections
respectively.

Chain Connections A chain connection between nodes α and γ is a connection via a node β such
that the graph contains directed edges α → β and β → γ , but no edge between α and γ .
Consider a probability distribution over (Xα , Xβ , Xγ ) factorised according to the graph in Fig-
ure 1.2, as PXα PXβ ∣Xα PXγ ∣Xβ .
Clearly, Xα ⊥
/ Xγ in general;

PXα ,Xγ (x1 , x3 ) = PXα (x1 ) ∑ PXβ ∣Xα (x2 ∣x1 )PXγ ∣Xβ (x3 ∣x2 )
x2 ∈X2

and, without further assumptions, this cannot be expressed in product form.

Conditioned on the instantiation Xβ = x2 ,

PXα ,Xβ ,Xγ (., x2 , .) PXα (.)PXβ ∣Xα (x2 ∣.)PXγ ∣Xβ (.∣x2 )
PXα ,Xγ ∣Xβ (., .∣x2 ) = =
PXβ (x2 ) PXβ (x2 )
PXα (.)PXβ ∣Xα (x2 ∣.)
= ( ) (PXγ ∣Xβ (.∣x2 )) = (PXα ∣Xβ (.∣x2 )) (PXγ ∣Xβ (.∣x2 ))
PXβ (x2 )
1.2. CONDITIONAL INDEPENDENCE AND FACTORISATION 11


α γ

Figure 1.3: A Fork Connection

where Bayes rule has been used and so, following characterisation 3 of conditional independence from
Theorem 1.13, Xα ⊥ Xγ ∣Xβ .

Fork Connections A fork connection between two nodes Xα and Xγ is a situation where there is no
edge between Xα and Xγ , but there is a node Xβ such that the graph contains directed edges Xβ ↦ Xα
and Xβ ↦ Xγ . It is illustrated in Figure 1.3.
A distribution over the variables (Xα , Xβ , Xγ ) that factorises according to the DAG in Figure 1.3
has factorisation

PXα ,Xβ ,Xγ = PXβ PXα ∣Xβ PXγ ∣Xβ .

It is clear that Xα ⊥
/ Xγ in general;

PXα ,Xγ (x1 , x3 ) = ∑ PXβ (x2 )PXα ∣Xβ (x1 ∣x2 )PXγ ∣Xβ (x3 ∣x2 )
x2 ∈X2

and, without further assumptions, this cannot be expressed in product form. Conditioned on Xβ ,
though:

PXα ,Xγ ,Xβ PXβ PXα ∣Xβ PXγ ∣Xβ


PXα ,Xγ ∣Xβ = = = PXα ∣Xβ PXγ ∣Xβ .
PXβ PXβ

It follows that Xα ⊥ Xγ ∣Xβ following characterisation 3) from the characterisations of conditional


independence listed in the statement of Theorem 1.13.

Collider Connections A collider connection between two nodes α and γ is a connection such that
the graph does not contain an edge between α and γ , but there is a node β such that the graph contains
directed edges α ↦ β and γ ↦ β . A collider connection is illustrated in Figure 1.4.
The factorisation of the distribution PXα ,Xβ ,Xγ corresponding to the DAG for the collider is

PXα ,Xβ ,Xγ = PXα PXγ PXβ ∣Xα ,Xγ .


12 CHAPTER 1. CONDITIONAL INDEPENDENCE AND GRAPHICAL MODELS

α γ


β

Figure 1.4: A Collider Connection

In general, Xα ⊥
/ Xγ ∣Xβ . But for each (x, z) ∈ Xα × Xγ ,

PXα ,Xγ (x, z) = ∑ PXα (x)PXγ (z)PXβ ∣Xα ,Xγ (y∣x, z)


y∈Xβ

= PXα (x)PXγ (z) ∑ PXα ∣Xβ ,Xγ (y∣x, z)


y∈Xβ
= PXα (x)PXγ (z).

so that Xα ⊥ Xγ .

A Causal Interpretation So far, the discussion has considered sets of random variables where,
based on the ordering of the variables, the parent set of a variable is a subset of those of a lower
order. The representation of a probability distribution by factorising along a Directed Acyclic Graph
may be particularly useful if there are cause to eect relations between the variables, the ancestors
being the cause and the descendants the eect. For a causal model, the connections have the following
interpretations:

Fork Connection: Common cause For the fork connection, illustrated by Figure 1.2, Xβ may
be a cause that inuences both Xα and Xγ which are eects. The variables are only related through
Xβ . The situation is illustrated by the following example, taken from a cartoon by Albert Engström;
`during a convivial discussion at the bar one evening, about the unhygienic nature of galoshes, one
of the participants pipes up, you have a very good point there. Every time I wake up wearing my
galoshes, I have a sore head.
Let Xα denote the state of the feet and Xγ the state of the head. These two variables are related;
Xα ⊥ / Xβ . But there is a common cause; X2 , which denotes the activities of the previous evening.
Once it is known that he has spent a convivial evening drinking, the state of the feet gives no further
information about the state of the head; Xα ⊥ Xγ ∣Xβ .

Chain Connection This may similarly be understood as cause to eect. Xα inuences Xβ , which
in turn inuences Xγ , but there is no direct causal relationship between the values taken by Xα and
those taken by Xγ . If Xβ is unknown, then Xα ⊥ / Xγ , but once the state of Xβ is established, Xα and
Xγ give no further information about each other; Xα ⊥ Xγ ∣Xβ .
1.2. CONDITIONAL INDEPENDENCE AND FACTORISATION 13

Collider Connection For the collider connection, Xα and Xβ are unrelated; Xα ⊥ Xγ . But they
both inuence Xβ . For example, consider a burglar alarm (Xβ ) that is activated if a burglary takes
place, but can also be activated if there is a minor earth tremor.
One day, somebody calls you while you are at work to say that your burglar alarm is activated.
You get into the car to go home. But on the way home, you hear on the radio that there has been an
earth tremor in the area. As a result, you return to work.
Once Xβ is instantiated, the information that there has been an earth tremor inuences the likeli-
hood that a burglary has taken place; Xα ⊥ / Xγ ∣Xβ .
This is known as explaining away.

Attention is now turned to trails within a DAG, and characterisation of those along which information
can pass.

Denition 1.17 (S -Active Trail). Let G = (V, D) be a directed acyclic graph. Let S ⊂ V and let
α, β ∈ V /S . A trail τ between the two variables α and β is said to be S -active if

1. Every collider node in τ is in S , or has a descendant (Denition 1.8) in S .

2. Every other node is outside S .

Denition 1.18 (Blocked Trail). A trail between α and β that is not S -active is said to be blocked
by S .

The following denition is basic; it will be seen that if a probability distribution factorises along a
DAG G and two nodes α and β are D-separated by S , then Xα ⊥ Xβ ∣XS .

Denition 1.19 (D-separation). Let G = (V, D) be a directed acyclic graph, where V = {1, . . . , d}. Let
S ⊂ V . Two distinct nodes α and β not in S are D-separated by S if all trails between α and β are
blocked by S .
Let A and B denote two sets of nodes. If every trail from any node in A to any node in B is blocked
by S , then the sets A and B are said to be D-separated by S . This is written

A á B ∥G S. (1.10)

The terminology D-separation is short for directed separation. The insertion of the letter `D' points
out that this is not the standard use of the term `separation' found in graph theory.

Denition 1.20 (D-connected). If two nodes α and β are not D-separated, they are said to be D-
connected.

Notation The notation α á/ β∥G S denotes that α and β are D-connected by S in the DAG G . Here
α and β may refer to individual nodes or sets of nodes.
14 CHAPTER 1. CONDITIONAL INDEPENDENCE AND GRAPHICAL MODELS

Example 1.21.
Consider the chain connection α ↦ β ↦ γ in the DAG in Figure 1.2 and the fork connection of
Figure 1.3. For the chain connection of Figure 1.2, the D-separation statements are: α á γ∥G β
while α á/ γ∥G ∅ (∅ denotes the empty set). For the DAG in Figure 1.3, α á γ∥G β while α á / γ∥G ∅.
These correspond to the conditional independence statements derived for probability distributions
that factorise along these graphs. For Figure 1.4, α á γ∥G ∅ while α á
/ γ∥G β . Again, these statements
correspond to the conditional independence statements that may be derived from the fact that a
distribution factorises along the DAG of Figure 1.4.

Let MB(α) denote the set of nodes which are either parents of α or children of α or a node which
shares a common child with α. Then α is D-separated from the rest of the network by MB(α). This
set of nodes is known as the Markov blanket of the node α.

Denition 1.22 (Markov Blanket). The Markov blanket of a node α in a DAG G = (V, D), denote
MB(α), is the set consisting of the parents of α, the children of α and the nodes sharing a common
child with α.

1.2.3 Bayes Ball


The Bayes ball provides a convenient method for deciding whether or not two nodes are D-separated
by a set S in a DAG G = (V, D). Variables are D-connected by a set S if the Bayes ball can be passed
between them employing the following rule. The nodes which are not in S are depicted as unshaded;
nodes in S as shaded.

Denition 1.23 (Instantiated Nodes). Let G = (V, D) be a directed acyclic graph. When considering
statements α á β∥G S and α á
/ β∥G S , the nodes in S are referred to as instantiated.

Figure 1.5: Bayes Ball

Consider the three types of connection in a DAG; chain, collider and fork.
1.3. D-SEPARATION AND CONDITIONAL INDEPENDENCE 15

ˆ For the chain connection illustrated in Figure 1.2, the Bayes ball algorithm indicates that if node
β is instantiated, then the ball does not move from α to γ through β . The communication in the
trail is blocked. If the node is not instantiated, then communication is possible.

ˆ For the fork connection illustrated in Figure 1.3, the algorithm states that if node β is instantiated,
then again communication between α and γ is blocked. If the node is not instantiated, then
communication is possible.

ˆ For the collider connection illustrated in Figure 1.4, the Bayes ball algorithm states that the ball
does move from α to γ if node α or any of its descendants is instantiated. If β or a descendant
is instantiated, this opens communication between the parents. If neither β nor any of its
descendants are instantiated, then there is no communication.

For a collider node β , instantiating any of the descendants of β also opens communication. If node β
is not instantiated, and none of its descendants are instantiated, then there is no communication.
A DAG G = (V, D) satises the following important property:

Theorem 1.24. A DAG G = (V, D) contains an edge between two nodes α, β ∈ V if and only if
αá
/ β∥G S for any S ⊆ V /{α, β}.

Proof The proof of this is straightforward and left as an exercise (Exercise 6 page 22).

1.3 D-Separation and Conditional Independence


The following key result shows that if a probability distribution factorises along a given DAG G ,
then every D-separation statement for the DAG implies the corresponding conditional independence
statement for the distribution.

Theorem 1.25 (D-Separation Implies Conditional Independence). Let G = (V, D) be a directed acyclic
graph and let P be a probability distribution that factorises along G . Then for any three disjoint subsets
A, B, S ⊂ V , it holds that XA ⊥ XB ∣XS (XA and XB are independent given XS ) if A á B∥G A (A and
B are D-separated by S ).

Proof of Theorem 1.25 Let X = (X1 , . . . , Xd ) be a random vector. Let V = {1, . . . , d} denote the
set of nodes of a Directed acyclic graph G( V, D) and suppose that PX factorises along G . Let A ⊂ V ,
B ⊂ V and S ⊂ V be three disjoint sets of nodes. Suppose that A á B∥G S . Let A, B and S denote also
the random vectors XA , XB and XS respectively and let XA , XB and XS denote their respective state
spaces. It is required to show that for all a ∈ XA , b ∈ XB and s ∈ XS ,

PA,B∣S (a, b∣s) = PA∣S (a∣s)PB∣S (b∣s).

Let R = V /(A ∪ B ∪ S). Let

E1 = {α ∈ V ∣there is an S -active trail from A to α}


16 CHAPTER 1. CONDITIONAL INDEPENDENCE AND GRAPHICAL MODELS

E2 = {α ∈ V ∣there is an S -active trail from B to α}

R1 = R ∩ E1 ∩ E2 , R2 = R ∩ E1 ∩ E2c , R3 = R ∩ E2 ∩ E1c , R4 = R ∩ (R1c ∪ R2c ∪ R3c ).

From Characterisation 5 of Theorem 1.13, it is required to show that there are two functions F and G
such that

PA,B,S (a, b, s) = F (a, s)G(b, s).

Let P(Xj ∣Paj ) denote the conditional probability function of Xj given the parent variables XPa(j) .
Then

P(X1 , . . . , Xd ) = ∏ P(Xj ∣Paj ) ∏ P(Xj ∣Paj ) ∏ P(Xj ∣Paj )


j∈A j∈B j∈S
× ∏ P(Xj ∣Paj ) ∏ P(Xj ∣Paj ) ∏ P(Xj ∣Paj ) ∏ P(Xj ∣Paj ).
j∈R1 j∈R2 j∈R3 j∈R4

Since all the nodes of R are uninstantiated and there is no S -active trail from A to B , it follows that
any node α ∈ R1 is either a collider which is not in S , nor does it have any descendants (Denition 1.8)
in S or the descendant of such a collider. Furthermore, any descendant of a variable in R1 is also in
R1 . Therefore, marginalising over the variables in R1 does not involve the parent variables of A, B or
S , nor does it involve the variables in R2 or R3 or their ancestors since ∑Xj P(Xj ∣Paj ) = 1.

There is no S -active trail from a variable in R4 to any variable in A or B . It follows that parents of
variables in R4 are either in R4 or in S (if the parent is not in S and there is an S -active trail between
the parent and a variable in either A or B , then there is an S -active trail from the variable itself; the
link between variable, its parent and the next variable on the trail is either an uninstantiated fork or
uninstantiated chain connection).
Now, using ∅ to denote the empty set, let S2 = {α ∈ S ∣ Pa(α) ∩ R2 ≠ ∅} (there is an S -active trail
from A to a parent of α ∈/ S but not from B to a parent of α ∈/ S ), S3 = {α ∈ S ∣ Pa(α) ∩ R3 ≠ ∅} (there
is an S -active trail from B to a parent of α ∈/ S but not from A to a parent of α ∈/ S ) and S4 = S ∩S2c ∩S3c
(nodes α ∈ S such that there is no S -active trail either from A to a parent of α ∈/ S or from B to a
parent of α ∈/ S ). Then S2 ∩ S3 = ∅, the empty set, otherwise there would be a collider node in S that
would result in an active trail from A to B .

It is also clear that Pa(S4 ) ⊆ S ∪ R4 , where Pa(S4 ) denotes the parent variables of the variables in S4 ;
that is, Pa(S4 ) = {Y ∣(Y, X) ∈ E, X ∈ S4 }. The sets S2 , S3 , S4 are disjoint. It follows that
1.4. THE LOCALLY DIRECTED MARKOV PROPERTY 17

⎛ ⎞
P(A, B, S) = ∑ ∑ ∑ ∑ ∏ P(Xj ∣Paj ) ∏ P(Xj ∣Paj ) ∏ P(Xj ∣Paj )
XR2 XR3 XR4 XR1 ⎝j∈R1 j∈R4 j∈S4 ⎠

⎛ ⎞
× ∏ P(Xj ∣Paj ) ∏ P(Xj ∣Paj ) ∏ P(Xj ∣Paj )
⎝j∈A j∈S2 j∈R2 ⎠
⎛ ⎞
× ∏ P(Xj ∣Paj ) ∏ P(Xj ∣Paj ) ∏ P(Xj ∣Paj ) .
⎝j∈B j∈S3 j∈R3 ⎠

The sums are taken from right to left, starting with ∑XR . None of the variables in R1 are in A ∪ B ∪ S
1
or the parent sets of variables in A, B , S2 , S3 , R2 or R3 . The parents of variables in R4 are either in
R4 or S , the parents of variables in S4 are either in S4 or R4 . The parents of variables in R3 in R3 ∪ B .
The parents of variables in R2 are in R2 ∪ A. It follows that P(A, B, S) has a factorisation

P(A, B, S) = ψ1 (S)ψ2 (A, S)ψ3 (B, S)


where the constructions of ψ1 , ψ2 and ψ3 are clear from the context. This factorisation clearly satises
the required criteria. It follows that D-separation implies conditional independence.

Of course, the converse is not true in general; D-separation is a convenient way of locating some of
the independence structure of a distribution. It does not, in general, locate the entire independence
structure.

1.4 The Locally Directed Markov Property


This section introduces the local directed Markov condition, a necessary and sucient condition so that
a probability function P over a set of variables V can be factorised along a graph G .

Denition 1.26 (Local Directed Markov Condition, Locally G - Markovian). Let X = (X1 , . . . , Xd )
be a random vector. A probability function P over X satises the local directed Markov condi-
tion with respect to a DAG G = (V, D) with node set V = {1, . . . , d} or, equivalently, is said to
be locally G -Markovian if and only if there is an ordering of the variables σ such that Pa(σ) (j) ∈
{σ(1), . . . , σ(j − 1)} for each j ∈ {1, . . . , d} and such that Xσ(j) is conditionally independent, given
X (σ) of X , where V (σ) (j) is the set of all descendants of σ(j) in G . That is,
Pa (j) V /(V (σ) (j)∪Pa )(j)
(σ)

V (σ) (j) = {β ∈ V ∣there is a directed path from σ(j)to β} (1.11)


and:

Xσ(j) ⊥ X ∣X (σ) .
V /(V (σ) (j)∪Pa (j)∪{σ(j)}) Pa (j)
(σ)

Proposition 1.27. Let P be a probability distribution over a random vector X = (X1 , . . . , Xd ). Then
P satises the l.d.m.p. with respect to a graph G = (V, D) if and only if there is an ordering of the
variables σ such that P factorises along G .
18 CHAPTER 1. CONDITIONAL INDEPENDENCE AND GRAPHICAL MODELS

Proof Firstly, assume that P is locally G -Markovian and assume that the variables are ordered
in such a way that for each j ∈ {1, . . . , d}, Xj ⊥ XV /(V (j)∪Pa(j)) ∣XPa(j) where V (j) is dened in
Equation (1.11). Let πj (x1 , . . . , xj−1 ) denote the instantiation of XPa(j) when X is instantiated
as (x1 , . . . , xd ). By Characterisation 1) of Theorem 1.13, for all j = 1, . . . , d and any πj such that
PX (π ) > 0,
Pa(j) j

PXj ∣X1 ,...,Xj−1 (xj ∣x1 , . . . , xj−1 ) = PXj ∣X (xj ∣πj )


Pa(j)
with PXj ∣X1 ,...Xj−1 (xj ∣x1 , . . . , xj−1 ) = PXj (xj ) if Paj = ∅. It follows directly that

d
PX1 ,...,Xd = ∏ PXj ∣X
j=1 Pa(j)
and hence, by denition, that P factorises along G .

Secondly, suppose that P factorises along a directed acyclic graph G = (V, D). Then it is clear (for
example by using the Bayes ball algorithm) that

Xj á XV /(V (j)∪Pa(j)) ∥G XPa(j)

where V (j) is the set of variables dened by Equation (1.11). If Pa(j) is instantiated, then any trail
from j to V /(Vj ∪ Paj ∪ {Xj }) has to pass through a node in Paj , which will be either a chain or fork
connection. It follows from Theorem 1.25 that

Xj ⊥ XV /(V (j)∪Pa(j)∪{j}) ∣XPa(j) ,


from which it follows that P is locally G -Markovian.
Once a probability distribution has been factorised according to a Bayesian Network, the next task is
to use it to answer queries.

Denition 1.28 (Query). A query in probabilistic inference is simply a conditional probability distri-
bution, over the variables of interest (the query variables) conditioned on information received.

Discussion of the main algorithms for answering queries is the subject of chapters 7 and 8.

1.5 Quick Medical Reference - Decision Theoretic: An Example


In classication problems, the aim is to infer the value of a class variable, given the values of the
observables. It is often unrealistic to hope to obtain a full prole of the probability distribution; the
aim is rather to exploit enough of the structure to obtain a good classier.
We now consider the noisy logic gate, which we express as a Bayesian Network and give the QMR -
DT (Quick Medical Reference - Decision Theoretic) data base of diseases and symptoms as an example.
A disease may result in a symptom, but this is not certain. The noisy logic gate approximation
may be used to construct a classier; given the symptoms exhibited, the problem is to diagnose the
illnesses.
1.5. QUICK MEDICAL REFERENCE - DECISION THEORETIC: AN EXAMPLE 19

1.5.1 Propositional Logic and Noisy Logic Gates


In logic, the `Or' disjunction of two propositions p and q is denoted by p ∨ q and is dened by the truth
table

p q p∨q
1 1 1
1 0 1
0 1 1
0 0 0
while the `And' disjunction of two propositions p and q , denoted by p ∧ q is dened by the truth table

p q p∧q
1 1 1
1 0 0
0 1 0
0 0 0
Here 1 = the proposition is true, 0 = the proposition is false.
Now consider the situation where p and q are independent causes of some eect, but that p and q
only cause the eect with some probability less than 1.

1.5.2 QMR - DT Data Base


The QMR - DT database is a large scale probabilistic data base that is intended to be used as a
diagnostic aid in the domain of internal medicine. It stores information on approximately 600 diseases
and approximately 4000 symptoms. The quantities PD (the joint probability function for the selection
of diseases that a randomly chosen individual may have) and PS∣D (the joint probability function that
the victim exhibits a selection of symptoms given a particular selection of diseases) are estimated
from the data bank. Consider the example of diseases and symptoms. Let q0j denote the probability
that symptom j is present in the absence of any disease and qij the probability that disease i induces
symptom j . Using S = (S1 , . . . , Sn ) to denote symptoms and D = (D1 , . . . , Dm ) to denote diseases,
an instantiation s of S will be a n-vector where each entry is either 0 to denote that the symptom is
absent, or 1 to denote that it is present. Similarly, an instantiation d of D is an m-vector of 1's and
0's where 1 corresponds to presence and 0 to absence of the corresponding disease.
Under the modelling assumption, the probability that symptom j is absent, given a vector of
diseases d is

PSj ∣D (0∣d) = (1 − q0j ) ∏(1 − qij )di .


i

Another simplifying assumption is that an individual contracts dierent diseases independently of each
other. Under this assumption,
20 CHAPTER 1. CONDITIONAL INDEPENDENCE AND GRAPHICAL MODELS

m
PD = ∏ PDi .
i=1

For the problem of classication, that is diagnosing diseases given a list of symptoms, these two
modelling assumptions come under the umbrella independence of competing risks. This is a simplica-
tion, but nevertheless, can produce an eective classier.

Noisy `or' as a causal network Consider the DAG given in Figure 1.6 where B = A1 ∨A2 ∨. . .∨An .
This is the logical `or' and there is no noise. The noise then enters, as in the DAG given in Figure 1.7,
by considering that if any of the variables Ai , i = 1, . . . , n is present, then B is present unless something
has inhibited it, the inhibitors on each variable acting independently of each other.

A1 A2 ... An

!  w
B

Figure 1.6: A Logical `Or' Gate

Noisy `or': inhibitors Consider the DAG in gure 1.7, where qi denotes the probability that the
impact of Ai is inhibited.

A1 A2 ... An
1−qn
1−q2
!  w
1−q1

Figure 1.7: Noisy `Or' Junction

All variables are binary, and take value 1 if the cause, or eect, is present and 0 otherwise. In other
words, PB∣Ai (0∣1) = qi . The assumption from the DAG is that all the inhibitors are independent. This
implies that

PB∣A1 ,...,An (0∣a1 , . . . , an ) = ∏ qj ,


j∈Y

where Y = {j ∈ {1, . . . , n}∣aj = 1}. This may be described by a noisy `or' gate.
1.5. QUICK MEDICAL REFERENCE - DECISION THEORETIC: AN EXAMPLE 21

Noisy `or' Gate The noisy `or' can be modelled directly, introducing the variables Bi i = 1, . . . , n,
where Bi takes the value 1 if the cause Ai is on and it is not inhibited and 0 otherwise. The corre-
sponding DAG is given in gure 1.8

B1 B2 ... Bn
1−q1 1−q2 1−qn
  
A1 A2 ... An

!  w
B

Figure 1.8: Noisy `Or' Gate

where

PB∣B1 ,...,Bn (1∣b1 , . . . , bn ) = b1 ∨ . . . ∨ bn .

The B1 , . . . , Bn are introduced as mutually independent inhibitors, and

PBi ∣Ai (0∣1) = qi ,

giving the result given above.

Notes The models that were later to be called Bayesian networks were introduced into articial in-
telligence by J. Pearl, in the article [103] (from 1982). Within the Articial Intelligence literature, this
is a seminal article. Perhaps the earliest work that uses directed graphs to represent possible depen-
dencies among random variables is that by S. Wright (1921) [146]. An early article that considered the
notion of a factorisation of a probability distribution along a directed acyclic graph representing causal
dependencies is that by H. Kiiveri, T.P. Speed and J.B. Carlin (1984) [74], where a Markov property
for Bayesian networks was dened. This was developed by J. Pearl in [106] (from 1990). D-separation,
and the extent to which it characterises independence is discussed by J.Pearl and T. Verma in [112]
and by J.Pearl, D. Geiger and T. Verma in [111]. The Bayes ball is taken from R.D. Schachter [122].
The results for identifying independence in Bayesian networks are taken from D. Geiger, T. Verma and
J. Pearl [51].
1.6 Exercises
1. Let (X, Y, W, Z) be disjoint sets of random variables, each with a nite state space. Prove that
the following logical relations hold:

(a) decomposition Prove that if X ⊥ Y ∪ W ∣Z then X ⊥ Y ∣Z and X ⊥ W ∣Z .


(b) contraction Prove that if X ⊥ Y ∣Z and X ⊥ W ∣Y ∪ Z then X ⊥ W ∪ Y ∣Z .
(c) weak union Prove that if X ⊥ Y ∪ Z∣W then X ⊥ Y ∣Z ∪ W .
(d) intersection Prove that if X ⊥ Y ∣W ∪ Z and X ⊥ W ∣Y ∪ Z then X ⊥ W ∪ Y ∣Z .

2. Let (X, Y, W, Z) be four sets of nodes in a DAG G = (V, D). Prove the following;

(a) decomposition Prove that if X á Y ∪ W ∥G Z then X á Y ∥G Z and X á W ∥G Z .


(b) contraction Prove that if X á Y ∥G Z and X á W ∥G Y ∪ Z then X á W ∪ Y ∥G Z .
(c) weak union Prove that if X á Y ∪ Z∥G W then X á Y ∥G Z ∪ W .
(d) intersection Prove that if X á Y ∥G W ∪ Z and X á W ∥G Y ∪ Z then X á W ∪ Y ∥G Z .

3. Let X denote the state space for (X, Y, W, Z) and assume that PX,Y,W,Z (x, y, w, z) > 0 for each
(x, y, w, z) ∈ X . Does it hold in general that if X ⊥ Y ∣ Z ∪ W and X ⊥ W ∣ Z ∪ Y , then
X ⊥ Z ∣ Y ∪ W ? Either prove the result or illustrate why it is false.

4. Let V = A ∪ B ∪ S where A, B and S are disjoint subsets and suppose that A ⊥ B∣S . Prove that
for any α ∈ A and γ ∈ B ,

α ⊥ γ∣(A ∪ S)/{α, γ} ⇔ α ⊥ γ∣(A ∪ B ∪ S)/{α, γ}.

5. Let A be a variable in a DAG. Prove that if all the variables in the Markov blanket of A are
instantiated, then A is d-separated from the remaining uninstantiated variables.

6. Prove Theorem 1.24.

7. Let G = (V, D) denote a directed acyclic graph. Let X ⊆ V , Y ⊆ V and Z ⊆ V denote sets of
nodes and let α, β, γ, δ ∈ V /X ∪ Y ∪ Z denote individual nodes.

(a) Prove that if X á Y ∥G Z and X á Y ∥G Z ∪ {γ} then either X á {γ}∥G Z or Y á {γ}∥G Z


(b) Prove that if α á β∥G {γ, δ} and γ á δ∥G {α, β} then either α á β∥G {γ} or α á β∥G {δ}.

8. The notation X A is used to denote the random (row) vector of all variables in set A. Let
V = {X1 , . . . , Xd } be the d variables of a Bayesian network and assume that X V /{Xi } = w. That
is, all the variables except Xi are instantiated. Assume that Xi is a binary variable, taking values
0 or 1. The odds of an event A given B is dened as:

P(A∣B)
OP (A∣B) =
P(Ac ∣B)

22
where Ac denotes the complement of A. Consider the odds

OP ({Xi = 1} ∣ {X V ∖{Xi } = w}) ,

and show that this depends only on the variables in the Markov blanket (Denition 1.22) of Xi .

23
1.7 Answers
1. (a) X ⊥ Y ∪ W ∣Z means PW,X,Y,Z (w, x, y, z) = PX∣Z (x∣z)PW,Y ∣Z (w, y∣z)PZ (z) Summing over W
gives PX,Y,Z (x, y, z) = PX∣Z (x∣z)PY ∣Z (y∣z)PZ (z); equivalent to X ⊥ Y ∣Z .
Similarly, summing over Y gives PW,X,Z (w, x, z) = PX∣Z (x∣z)PW ∣Z (w∣z)PZ (z), equivalent to
X ⊥ W ∣Z .
(b) X ⊥ Y ∣Z implies PX,Y,Z (x, y, z) = PX∣Z (x∣z)PY ∣Z (y∣z)PZ (z) and X ⊥ W ∣Y ∪ Z implies
PW,X,Y,Z (w, x, y, z) = PX∣Y,Z (x∣y, z)PW ∣Y,Z (w∣y, z)PY,Z (y, z). The rst statement implies
that for (x, y, z) such that PX,Y,Z (x, y, z) > 0, PX∣Y,Z = PX∣Z , so, using PY,Z = PY ∣Z pZ , it
follows that

PW,X,Y,Z (w, x, y, z) = PX∣Z (x∣z)PW ∣Y,Z (w∣y, z)PY ∣Z (y∣z)PZ (z)


= PX∣Z (x∣z)PW,Y ∣Z (w, y∣z)PZ (z),

so that X ⊥ W ∪ Y ∣Z .
(c)
PXW PW Y Z
PXY ZW = = aXW bY ZW
PW
where aXW = PPXW
W
and bY ZW = PW Y Z so that X ⊥ Y ∣Z ∪ W from the characterisations of
independence.
(d)
PXW Z PY W Z PXY Z PW Y Z
PX,Y,W,Z = =
PW Z PY Z
PXW Z PXY Z
=
PW Z PY Z
so that
PX∣W Z = PX∣Y Z = PX∣Z

giving
PXZ PY W Z
PX,Y,W,Z =
PZ
and hence
X ⊥ Y ∪ W ∣Z.

2. (a) This is clear from the denition: Z blocks all trails between X and Y and all trails between
X and W .
(b) Consider α ∈ X and β ∈ W . Any trail α ↔ β has either an instantiated fork or chain node
in Y ∪ Z or an uninstantiated collider that is not in Y ∪ Z , neither any of its descendants. It
follows that such an uninstantiated collider is not in Z , neither are any of its descendants.
If it has an instantiated fork or chain node in Y , then the trail from α to the instantiated
fork or chain in Y is blocked by Z since X á Y ∥G Z . Hence X á W ∪ Y ∥G Z .

24
(c) Let α ∈ X and β ∈ Y . Any trail is blocked by W . That is, it has either a fork or chain node
in W or a collider node that is not in W , neither are any of its descendants.
If it is blocked by a chain or fork in W , then the trail is also blocked by Z ∪ W . Consider
the rst collider on the trail, proceeding from α, not in W , with no descendants in W , that
is either in Z or has a descendant in Z . Then the trail between α and the node in Z is
blocked by W since X á Y ∪ Z∥G W . Since neither the collider nor any of its descendants
are in W , it follows that the trail between α and the collider node is blocked by W , from
which it follows that it has a chain or fork in W , from which it follows that X á Y ∥G Z ∪ W .
(d) Let α ∈ X and β ∈ Y . Any trail between them with no other nodes in X or Y is blocked by
W ∪ Z . That is, it has either a fork or chain node in W ∪ Z or a collider not in W ∪ Z with
no descendants in W ∪ Z . Such a collider is therefore not in Z and has no descendants in
Z.
Assume that the trail blocked by W ∪ Z is not blocked by Z . Let γ be the rst fork or
chain node along the trail that is in W . This trail is Z active, but is blocked by Y ∪ Z .
It therefore contains a fork or chain node in Y , contradicting the assertion that α was the
only node in X and β the only node in Y on the trail.

3. The result stated is false. Counterexample: any distribution that factorises as

PZ PX∣Z PW ∣Z PY ∣W,Z

clearly satises X ⊥ Y ∣Z ∪ W and X ⊥ W ∣Y ∪ Z , but there are distributions with such a factori-
sation that do not satisfy X ⊥ Z∣Y ∪ W .

4. Since A ⊥ B∣S , it follows from the weak union result in Exercise 1 that α ⊥ B∣A ∪ S/{α}. This,
together with the condition α ⊥ γ∣A ∪ S/{α, γ} imply (using X = {α}, W = B , Z = A ∪ S/{α, γ},
Y = {γ} in the contraction statement Exercise 1) that

α ⊥ γ∣A ∪ B ∪ S/{α, γ}.

as required.
Now suppose that α ⊥ γ∣(A ∪ B ∪ S)/{α, γ}. Since A ⊥ B∣S , it follows that α ⊥ B∣A ∪ S/{α}.
This, together with the condition, give (using X = {α}, Y = B , W = {γ} and Z = A ∪ S/{α, γ} in
the intersection statement of Exercise 1) that α ⊥ B ∪ {γ}∣A ∪ S/{α, γ} as required.

5. Recall denition of Markov blanket; parents of A, children of A and any variables sharing a child
with A. Consider the `Bayes Ball' algorithm, started at A. The ball cannot travel through an
instantiated chain or fork connection, nor can it travel through a collider, where none of the
descendants are instantiated. Otherwise, it can travel through a node along the graph.
Therefore: if all variables in the Markov blanket are instantiated, Bayes ball cannot pass through
any of the parents (by denition, the connection is necessarily chain or fork). It cannot pass
through a child to any ospring of the child (the connection necessarily chain). If it passes

25
through an instantiated child to another parent of the instantiated child, it cannot pass further:
connection at the point of the instantiated parent of the instantiated child is either chain or fork.

6. Firstly, clearly if there is an edge between α and β , then α − β is an S -active trail for any
S ⊆ V /{α, β}. If there is no edge between α and β , then there are two cases. Firstly, if α ∈/ M B(β)
(where MB denotes Markov blanket), then α á β∥G M B(β). If α ∈ M B(β), but there is no edge
between α and β , then α and β are parents of a common child. Let C denote the set of variables
that are common children of α and β . Let

VC = C ∪ {δ ∣ there is a directed path γ → δ some γ ∈ C}.

Let S = V /({α, β}∪VC ), then α á β∥G S . Then any trail between α and β through a common child
is blocked by virtue of an uninstantiated collider where none of the descendants are instantiated.
Any trail with a common ancestor is blocked by virtue of an instantiated fork. On any trail
where α is an ancestor of β or β an ancestor of α, there is an instantiated chain connection.

7. (a) All trails between X and Y contain either a fork or chain node in Z , or collider not in Z
with no descendants in Z . When Z and γ are instantiated, there is no trail between X and
Y where all the colliders are either instantiated or have an instantiated descendant and all
chain and fork connections are uninstantiated.
Suppose that X á / {γ}∥G Z and Y á/ {γ}∥G Z . Then for any x ∈ X and any y ∈ Y there is
a Z -active trail between x and γ and a Z -active trail between y and γ . Consider the trail
between x and y formed by joining the two. If γ is a chain or fork node, then the trail is
active when γ is not instantiated, contradicting X á Y ∥G Z .

(b) Assume the result is not true and that α á / β∥G {γ} and α á/ β∣{δ} and that α á β∥G {γ, δ}.
Then there is a {γ}-active trail between α and β with δ as a fork or chain node. Assume
that there is a collider node on the trail with γ as a descendant, then there is a collider ρ
and a trail δ → ρ1 → . . . → ρn → ρ and hence a directed path from δ to γ that does not
contain α or β contradicting γ á δ∥G {α, β}. It follows that there is a trail between α and β
containing δ with only fork and chain connections. Similarly, there is a trail between α and
β containing δ with only forks and chains. Then there is a trail between δ and γ containing
α with at most one collider α and another trail between δ and γ containing β with at most
one collider {β}. If δ á γ∣{α, β} then neither of these are colliders and hence there is a
cycle, hence a contradiction.

8. This is a direct consequence of the denition. Let x = (x1 , . . . , xd ) and y = (y1 , . . . , yd ) where
yj = xj = wj for j ≠ i, xi = 1, yi = 0. Let πj (x) denote the parent conguration for variable j
when X = x. Then, since

P({Xi = 1}, {X V /{Xi } = w})


OP ({Xi = 1}∣{X V /{Xi } = w}) =
P({Xi = 0}, {X V /{Xi } = w})

26
∏dj=1 PXj ∣Paj (xj ∣πj (x))
OP ({Xi = 1}∣X V /{Xi } = w}) =
∏dj=1 PXj ∣Paj (yj ∣πj (y))
PXi ∣Pai (1∣πi (x)) ∏j∣Xi ∈Paj PXj ∣Paj (xj ∣πj (x))
=
PXi ∣Pai (0∣πi (y)) ∏j∣Xi ∈Paj PXj ∣Paj (yj ∣πj (y))

and, from the denition, this only involves the Markov blanket of Xi ; PXi ∣Pai involves Xi and
the parents of Xi , the other conditional probabilities involve the children of Xi and their parents.

27
28 CHAPTER 1. CONDITIONAL INDEPENDENCE AND GRAPHICAL MODELS
Chapter 2

Markov models and Markov equivalence

2.1 I-maps and Markov equivalence


If a probability distribution factorises according to a directed acyclic graph, then any D-separation
statement in the graph implies the corresponding conditional independence statement for the distribu-
tion. If each D-separation statement for a DAG G implies the corresponding conditional independence
statement for a probability distribution P, then P is said to belong to the Markov model of G .

Denition 2.1 (Markov Model). Let V = {1, . . . , d} and let G = (V, D) be a directed acyclic graph with
node set V and directed edge set D. Let V denote the entire set of subsets of V . The Markov Model
MG of G = (V, D) is the set of triples (A, B, S) ∈ V × V × V , A, B, S disjoint, such that the D-separation
statement A á B∥G S holds in the DAG. That is,

MG = {(A, B, S) ∈ V × V × V ∣ A, B, S disjoint A á B∥G S}. (2.1)

Let P be a probability distribution of a random vector X = (X1 , . . . , Xd ), whose components are indexed
by V . Let I(P) denote the entire set of conditional independence statements associated with P;

I(P) = {(A, B, S) ∈ V × V × V ∣ A, B, S disjoint XA ⊥ XB ∣XS } (2.2)

where, for any set C ⊆ V , XC denotes the sub-vector of random variables indexed by C . The convention
is that if S = ∅ (the empty set) then XA ⊥ XB ∣XS means XA ⊥ XB . A distribution P is said to belong
to the Markov Model of G , written P ∈ MG , if and only if MG ⊆ I(P). The Markov model is the
set of conditional independence relations satised by all distributions that are locally G -Markovian
(Denition 1.26).

If a distribution P factorises along a DAG G = (V, D), then the collection of triples MG dened in
Equation (2.1) Denition 2.1 represents the entire set of conditional independence statements that it is
possible to infer from the DAG. Clearly, this collection does not, in general, represent the complete set
of conditional independence statements that hold for P. In fact, the probability distributions modelling
real world situations, corresponding to data sets, very rarely factorise along a DAG whose D-separation

29
30 CHAPTER 2. MARKOV MODELS AND MARKOV EQUIVALENCE

statements encode the entire set of independence statments. When it does hold, the DAG is known as
a perfect I -map.

Denition 2.2 (Perfect I -Map, Faithful). Let G = (V, D) be a DAG with node set V = {1, . . . , d} which
indexes a random vector X = (X1 , . . . , Xd ). Let V denote the set of all subsets of V . The DAG G is
known as a perfect I -map for a probability function P over X if and only if I(P) = MG . A DAG G
such that I(P) = MG is said to be faithful to P.

A DAG G = (V, D) is consistent with I(P) dened by Equation (2.2) if and only if G is faithful to P,
if and only if G is a perfect I -map of P.

A DAG that is faithful to a probability distribution P, satises the following important property:

Theorem 2.3. Let G = (V, D) be faithful to a probability distribution P. Then the edge set D contains
an edge between two nodes α and β if and only if Xα ⊥/ Xβ ∣XS for any S ⊆ V /{α, β}.

Proof This is a straightforward consequence of Theorem 1.24, that the edge set D contains an edge
between α and β if and only if α á
/ β∥G S for any S ⊆ V /{α, β}.

A set of variables (X1 , . . . , Xd ), may be ordered in d! ways. Each permutation σ of 1, . . . , d gives


an ordering (Xσ(1) , . . . , Xσ(d) ) and for each permutation, the distribution may be factorised according
to a Bayesian network, represented by the corresponding directed acyclic graph.
The input for the construction of the directed acyclic graph consists of a list of d conditional
independence statements, one for each variable, all of the form Xσ(j) ⊥ Σσj ∣Paσj , where

Σσj = {Xσ(1) , . . . , Xσ(j−1) }/Paσ (j).

This is the set of σ -predecessors of Xσ(j) , without Paσ (j).


For a given collection of variables V = {X1 , . . . , Xd }, there may be several dierent DAGs, each
with exactly the same Markov model M. Two DAGs which determine exactly the same Markov model
are said to be I -equivalent.

Denition 2.4 (I -sub-map, I -map, I -equivalence, Markov Equivalence). Let G1 and G2 be two DAGs
with the same node set. The DAG G1 is said to be an I -sub-map of G2 if MG1 ⊆ MG2 . They are said
to be I -equivalent if MG1 = MG2 . I -equivalence is also known as Markov equivalence.

Example 2.5.

In the following example on three variables, all three factorisations give the same independence struc-
ture. Consider a probability distribution PX1 ,X2 ,X3 with factorisation

PX1 ,X2 ,X3 = PX1 PX2 ∣X1 PX3 ∣X2 .

It follows that
2.1. I-MAPS AND MARKOV EQUIVALENCE 31

PX1 ,X2 ,X3 = PX2 PX1 ∣X2 PX3 ∣X1 ,X2 = PX2 PX1 ∣X2 PX3 ∣X2 ,

using X1 ⊥ X3 ∣X2 . Also,

PX1 ,X2 ,X3 = PX3 PX2 ∣X3 PX1 ∣X2 ,X3 = PX3 PX2 ∣X3 PX1 ∣X2 ,

since X1 ⊥ X3 ∣X2 . For the rst and last of these, X2 is a chain node, while in the second of these X2
is a fork node. The conditional independence structure associated with chains and forks is the same.
The three corresponding DAGs are given in Figure 2.1.

@2 2 2^

 
1 3 1 3 1 3

Figure 2.1: Three DAGs, each with the same D-separation structure

In general, the factorisations resulting from dierent orderings of the variables will not necessarily give
I -equivalent maps. This is illustrated by the following example on four variables.

Example 2.6.

Consider a probability distribution over four variables, which may be factorised as

PX1 ,X2 ,X3 ,X4 = PX1 PX2 PX3 ∣X1 ,X2 PX4 ∣X3 .

The DAG associated with the factorisation is the one on the left in Figure 2.2. Assume that the DAG
on the left in Figure 2.2 and PX1 ,X2 ,X3 ,X4 are faithful to each other. The factorisation obtained using
the ordering (X1 , X4 , X3 , X2 ) is:

PX1 ,X2 ,X3 ,X4 = PX1 PX4 ∣X1 PX3 ∣X1 ,X4 PX2 ∣X1 ,X3 ,X4 = PX1 PX4 ∣X1 PX3 ∣X1 ,X4 PX2 ∣X1 ,X3 .

This is the factorisation corresponding to a Bayesian network since

ˆ X1 ⊥
/ X4 ,

ˆ X3 ⊥
/ {X1 , X4 }/Θ∣Θ for any Θ ⊆ {X1 , X4 } and

ˆ X2 ⊥ X4 ∣{X1 , X3 }, but X2 ⊥
/ {X1 , X3 , X4 }/Θ∣Θ for any strict subset Θ ⊂ {X1 , X3 }.
32 CHAPTER 2. MARKOV MODELS AND MARKOV EQUIVALENCE

1 2 1 / 2
@

 
3 3
O

 
4 4

Figure 2.2: DAGs with dierent D-separation properties, corresponding to dierent factorisations of
the same distribution

The corresponding DAG (the graph on the right of Figure 2.2) gives less information on conditional
independence; X4 á / X1 ∥G2 X3 , using G2 to denote the DAG on the right in Figure 2.2. The two
corresponding DAGs are shown in Figure 2.2. The graph on the right is a strict I -sub-map of the
graph on the left.

This example illustrates that while D-separated variables are conditionally independent conditioned
on the separating set, it does not hold that conditionally independent variables are necessarily D-
separated.

2.1.1 Properties of Conditional Expectation and D-Separation


For a probability distribution P over a set of variables, the collection of conditional independence
statements I(P) satises the following: Let (X, Y, W, Z) be disjoint sets of random variables, then the
following relations hold:

1. decomposition If X ⊥ Y ∪ W ∣Z then X ⊥ Y ∣Z and X ⊥ W ∣Z .

2. contraction If X ⊥ Y ∣Z and X ⊥ W ∣Y ∪ Z then X ⊥ W ∪ Y ∣Z .

3. weak union If X ⊥ Y ∪ Z∣W then X ⊥ Y ∣Z ∪ W .

4. intersection If X ⊥ Y ∣W ∪ Z and X ⊥ W ∣Y ∪ Z then X ⊥ W ∪ Y ∣Z .

These relations are discussed by Pearl in [105] (1988). The proofs of these are quite straightforward
and have been left as exercises (Exercise 1 page 22).
The Markov model MG for a DAG G also satises the following: let (X, Y, W, Z) be four sets of
nodes in a DAG G = (V, D), then the following relations hold:

1. decomposition If X á Y ∪ W ∥G Z then X á Y ∥G Z and X á W ∥G Z .

2. contraction If X á Y ∥G Z and X á W ∥G Y ∪ Z then X á W ∪ Y ∥G Z .

3. weak union If X á Y ∪ Z∥G W then X á Y ∥G Z ∪ W .


2.1. I-MAPS AND MARKOV EQUIVALENCE 33

4. intersection If X á Y ∥G W ∪ Z and X á W ∥G Y ∪ Z then X á W ∪ Y ∥G Z .

These have been left as exercises (Exercise 2 page 22). A collection of triples (Xi , Yi , Si )i∈I , where I
denotes the indexing set and each (Xi , Yi , Si ) ∈ V ×V ×V , Xi , Yi , Si mutually disjoint which satises these
four conditions has come to be known as a graphoid. These statements do not axiomatise conditional
independence; if a given set of triples satises these four conditions, there does not necessarily exist a
probability distribution P for which the set of conditional independence statements is I(P). Conditional
independence cannot be axiomatised; this was proved by Studeny [130].

A collection of D-separation statements for a DAG also satises the composition property;

5 composition If both X á Y ∥G S and X á Z∥G S hold, then X á Y ∪ Z∥G S .

A graphoid that also satises composition is known as a compositional graphoid. The Markov model
of a DAG MG is always a compositional graphoid; the collection of independence statements I(P) is
not necessarily a compositional graphoid.

A Markov model MG also satises the following two properties, which are not necessarily satised by
a collection of conditional independence statements I(P).

ˆ Let V = A ∪ B ∪ S where A, B and S are disjoint subsets and suppose that A á B∥G S . Then for
any α ∈ A and γ ∈ A ∪ S ,

α á γ∥G (A ∪ S)/{α, γ} ⇔ α á γ∥G (A ∪ B ∪ S)/{α, γ}.

ˆ Let G = (V, D) denote a directed acyclic graph. Let X ⊆ V , Y ⊆ V and Z ⊆ V denote sets of
nodes and let α, β, γ, δ ∈ V /X ∪ Y ∪ Z denote individual nodes.

 If X á Y ∥G Z and X á Y ∥G Z ∪ {γ} then either X á {γ}∥G Z or Y á {γ}∥G Z


 If α á β∥G {γ, δ} and γ á δ∥G {α, β} then either α á β∥G {γ} or α á β∥G {δ}.

The proofs of these statements are left as exercises. They are included here simply as illustration of the
additional structure that is required for a Markov model MG over and above the set of independence
statements I(P) for a probability distribution that factorises over G .

The following basic example illustrates a situation where composition does not hold for the probability
distribution and where there is no faithful DAG.

Example 2.7 (Tossing Three Coins).


Let Y1 , Y2 , Y3 be three independent identically distributed binary variables, with probability function
PY (0) = PY (1) = 12 . Let


⎪ 1 Y2 = Y3
X1 = ⎨

⎩ 0 Y2 ≠ Y3

34 CHAPTER 2. MARKOV MODELS AND MARKOV EQUIVALENCE



⎪ 1 Y1 = Y3
X2 = ⎨

⎩ 0 Y1 ≠ Y3



⎪ 1 Y1 = Y2
X3 = ⎨

⎩ 0 Y1 ≠ Y2

Then X1 , X2 , X3 provide the classic example of three random variables that are pairwise independent,
but not jointly independent.
1
PX1 ,X2 ,X3 (1, 1, 1) = P(Y1 = Y2 = Y3 ) = PY1 ,Y2 ,Y3 (1, 1, 1) + PY1 ,Y2 ,Y3 (0, 0, 0) =
4
PX1 ,X2 ,X3 (1, 1, 0) = PX1 ,X2 ,X3 (1, 0, 1) = PX1 ,X2 ,X3 (0, 1, 1) = P(Y2 = Y3 = Y1 , Y1 ≠ Y2 ) = 0

PX1 ,X2 ,X3 (1, 0, 0) = PX1 ,X2 ,X3 (0, 1, 0) = PX1 ,X2 ,X3 (0, 0, 1) = P(Y2 = Y3 , Y1 ≠ Y3 , Y1 ≠ Y2 )
1
= P(Y1 = 1, Y2 = Y3 = 0) + P(Y1 = 0, Y2 = Y3 = 1) =
4
PX1 ,X2 ,X3 (0, 0, 0) = 0

It follows that
1
PX1 ,X2 (1, 1) = PX1 ,X2 (1, 0) = PX1 ,X2 (0, 1) = PX1 ,X2 (0, 0) =
4
so that PX1 (1) = PX1 (0) = 1
2 and in all cases

PX1 ,X2 = PX1 PX2 .

But
1 1
= PX1 ,X2 ,X3 (1, 1, 1) ≠ PX1 (1)PX2 (1)PX3 (1) = .
4 8
Since X1 ⊥ X2 but X3 ⊥ / {X1 , X2 }, X3 ⊥ / X1 ∣X2 and X3 ⊥ / X2 ∣X1 , it follows that the factorisation
obtained for the distribution PX1 ,X2 ,X3 is

PX1 ,X2 ,X3 = PX1 PX2 PX3 ∣X1 ,X2 .

In the corresponding DAG, X1 á / X2 ∥G ∅, X1 á


/ X3 ∥G ∅ and X2 á/ X3 ∥G ∅, even though the independence
statements X1 ⊥ X3 and X2 ⊥ X3 hold.
By considering other orderings of the variables, the other possible factorisations are

PX1 ,X2 ,X3 = PX1 PX3 PX2 ∣X1 ,X3 = PX2 PX3 PX1 ∣X2 ,X3

and in none of the cases do the D-separation statements of the DAG corresponding to the Bayesian
Network represent all the conditional independence statements of the distribution.
The type of situation described here, where the distribution does not satisfy a composition property,
can be summarised as follows: it is the situation where X1 tells you nothing about X3 and X2 tells you
nothing about X3 , but X1 and X2 taken together tell you everything about X3 . This is the principle on
which any good detective novel is based, as Edward Nelson puts it in his book `Radically Elementary
Probability Theory' [100].
2.1. I-MAPS AND MARKOV EQUIVALENCE 35

The argument shows that in experimental design situations where there are interaction eects, but
no main eects (e.g. each chemical taken separately does not produce an eect, but the interaction
between two chemicals causes an eect), composition will not hold and there will not exist a faithful
DAG.
There is a whole industry of structure learning algorithms, based on the principle of Theorem 2.3,
which deleted an edge as soon as a conditioning set is found such that X ⊥ Y ∣S . These algorithms are
elegant, cost-eective, ecient, and return accurate results if the underlying distribution has a faithful
graph. They are discussed in chapter 16. Their draw-back is that they produce wildly inaccurate
results when there does not exist an underlying faithful graph. Note that if such a structure learning
algorithm were applied to the three-coin example above, where X1 ⊥ X2 , X1 ⊥ X3 and X2 ⊥ X3 , such
an algorithm would remove all the edges based on the results of conditioning on S = ∅ and return the
empty graph. The model delivered by the algorithm would then be the independence model, which
represents a disastrous failure.

Example 2.8 (Hidden Variables).

In many causal situations, the set of variables X may be split into observable variables Z and unobserv-
able variables, U . Usually, the observable variables are descendants of the unobservable; observations
are made on the observable and, from these observations, inferences made about the unobservable. For
example, the variables of Z could represent symptoms, while those of U could represent the diseases
that cause the symptoms.
Even if there is a faithful DAG corresponding to the full set of variables X = (U, V ), there is often
no faithful DAG corresponding to the observable variables Z . For example, consider a probability
distribution over the 5 variables {U, Z1 , Z2 , Z3 , Z4 } which factorises according to

PU,Z1 ,Z2 ,Z3 ,Z4 = PU PZ3 PZ4 ∣Z3 PZ1 ∣U,Z3 PZ2 ∣U,Z4

and suppose the corresponding graph given by Figure 2.3 is faithful. In this example, there is no faithful
DAG for the distribution over (Z1 , Z2 , Z3 , Z4 ); the set of D-separation statements for any DAG along
which the distribution can be factorised will be a strict subset of the set of conditional independence
statements. Two examples of factorisations are:

PZ1 ,Z2 ,Z3 ,Z4 = PZ1 PZ2 ∣Z1 PZ3 ∣Z1 ,Z2 PZ4 ∣Z1 ,Z2 ,Z3

PZ1 ,Z3 ,Z4 ,Z2 = PZ1 PZ3 ∣Z1 PZ4 ∣Z3 PZ2 ∣Z1 ,Z3 ,Z4

When all 24 permutations are considered, either an edge Z2 ∼ Z3 or an edge Z1 ∼ Z4 will be present,
even though Z2 ⊥ Z3 ∣Z4 and Z1 ⊥ Z4 ∣Z3 . None of the DAGs corresponding to the Bayesian Networks
of all 24 possible orderings of the variables will represent all the CI statements.
36 CHAPTER 2. MARKOV MODELS AND MARKOV EQUIVALENCE

U1 / Z2 o Z4
O

Z1 o Z3

Figure 2.3: Faithful DAG, U1 hidden

2.2 Characterisation of Markov Equivalence


When trying to t a graphical model to data, all the DAGs in a Markov equivalence class will t
the data equally well and ecient algorithms for nding the structure will therefore only examine the
dierent equivalence classes, rather than all the dierent possible DAGs.

= X2 a

X1 X3

Figure 2.4: Not Markov Equivalent to the Graphs in Figure 2.1

Figure 2.1 shows three directed acyclic graphs, each with the same D-separation statements;
X1 á X3 ∥G X2 and the DAG does not admit any other D-separation statements. The D-separation
statements of the DAG in Figure 2.4 are dierent from those for the DAGs in Figure 2.1.
The key result in this section characterising Markov equivalence is Theorem 2.11, which states
that the two features of a directed acyclic graph which are necessary and sucient for determining its
Markov structure are its immoralities and its skeleton. These are dened below.

Denition 2.9 (Immorality). Let G = (V, E) be a graph. Let E = D ∪ U , where D is the set of directed
edges, U is the set of undirected edges and D ∩ U = ∅. An immorality in a graph is a triple (α, β, γ)
such that (α, β) ∈ D and (γ, β) ∈ D, but (α, γ) ∈/ D, (γ, α) ∈/ D and ⟨α, γ⟩ ∈/ U .

Denition 2.10 (Skeleton). The skeleton of a graph G = (V, E) is the graph obtained by making the
̃ where ⟨α, β⟩ ∈ E
graph undirected. That is, the skeleton of G is the graph G̃ = (V, E) ̃ ⇔ (α, β) ∈ D or
(β, α) ∈ D or ⟨α, β⟩ ∈ U .

The characterisation is given by the following theorem.

Theorem 2.11. Two DAGs are Markov equivalent if and only if they have the same skeleton and the
same immoralities.
2.2. CHARACTERISATION OF MARKOV EQUIVALENCE 37

The key to establishing Theorem 2.11 will be to consider the active trails (Denition 1.17) in the graph.
The following two denitions are also required.

Denition 2.12 (S -active node). Let G = (V, E) be a Directed Acyclic Graph and let S ⊂ V . Recall
the denition of a trail (Denition 1.4) and the denition of an active trail (Denition 1.17). A node
α ∈ V is said to be S -active if either α ∈ S or there is a directed path from the node α to a node β ∈ S .

Denition 2.13 (Minimal S - active trail). Let G = (V, E) be a Directed Acyclic Graph and let S ⊂ V .
An S -active trail τ in G between two nodes α and β is said to be a minimal S - active trail if it satises
the following two properties:

1. if k is the number of nodes in the trail, the rst node is α and the k th node is β , then there does
not exist an S -active trail between α and β with fewer than k nodes and

2. there does not exist a dierent S -active trail ρ between α and β with exactly k nodes such that
for all 1 < j < k either ρj = τj or ρj is a descendant of τj .

The proof of Theorem 2.11 follows directly from Lemma 2.14.

Lemma 2.14. Let G1 = (V, D1 ) and G2 = (V, D2 ) be two directed acyclic graphs with the same skeletons
and the same immoralities. Then for all S ⊂ V , a trail is a minimal S -active trail in G1 if and only if
it is a minimal S -active in G2 .

Proof of Lemma 2.14 Recall the notation from Denition 1.2; α ∼ β denotes that two nodes (α, β) ∈
V × V are neighbours. For a directed graph G = (V, D), that is either (α, β) ∈ D or (β, α) ∈ D. Since G1
and G2 have the same skeletons, any trail τ in G1 is also a trail in G2 . Let S ⊂ V . Assume that τ is a
minimal S -active trail in G1 . It is now proved, by induction on the number of collider nodes along the
path, that τ is also an S -active trail in G2 . By denition, a single node will be considered an S -active
trail, for any S ⊂ V . The proof is in three parts: Let τ be a minimal S -active trail in G1 . Then

1. If τ contains no colliders in G1 , then it is S -active in G2 .

2. If τ contains at least one collider connection centred at node τj , then τ is S -active in G2 if and
only if τj is S -active in G2 .

3. If τ contains at least one collider centred at node τj , then τj is an S -active node in G2 .

Part 1 If τ is an S -active trail in G1 and does not contain any collider connections in G1 , then none of
the nodes on τ are in S . This can be seen by considering the Bayes ball algorithm, which characterises
d-separation. It follows that the trail is S -active in G2 if and only if it does not contain a collider
connection in G2 .
Let τ be a minimal S -active trail in G1 with k nodes and no collider connections in G1 . Suppose
that a node τi is a collider node in G2 , so that τi−1 and τi+1 are parents of τi in G2 . Then, so that no
new immoralities are introduced, it follows that τi−1 ∼ τi+1 . Since τi is either a chain or a fork in G1 ,
it follows that in G1 , the connections between nodes τi−2 , τi−1 , τi , τi+1 , τi+2 take one of the forms shown
in Figure 2.5 when τi a chain node or those in Figure 2.6 when τi a fork node.
38 CHAPTER 2. MARKOV MODELS AND MARKOV EQUIVALENCE

τi−2 / τi−1 / τi+1 / τi+2


=

!
τi

τi−2 o τi−1 / τi+1 / τi+2


=

!
τi

τi−2 o τi−1 o τi+1 / τi+2


a

}
τi

τi−2 o τi−1 o τi+1 o τi+2


a

}
τi

Figure 2.5: Possible connections between the nodes when τi is chain node

It is clear from the gure that the trail of length k − 1 in G1 , obtained by removing τi and using the
direct link from τi−1 to τi+1 is also an S -active trail in G1 , contradicting the assumption that τ was a
minimal S -active trail. Hence τi is a chain node or a fork node in G2 .
It follows that there are no collider connections along the trail τ taken in G2 and hence, since it
does not contain any nodes that are in S , it is an S - active trail in G2 .

Part 2 Assume that any minimal S -active trail in G1 containing n collider connections is also S -active
in G2 . This is true for n = 0 by part 1. Let τ be a trail with k nodes that is minimal S -active in G1
and with n + 1 collider connections in G1 . Consider one of the collider connections centred at τj , with
τ (0,j−1) = (τ0 , τ1 , . . . , τj−2 , τj−1 ) and let ̃
parents τj−1 and τj+1 . Let ̃ τ (j+1,k) = (τj+1 , . . . , τk ). Both ̃
τ (0,j−1)
and ̃τ (j+1,k) are minimal S -active in G1 and they both have a number of collider connections less than
or equal to n. By the inductive hypothesis, they are therefore both S -active in G2 .
Because the trail τ is minimal S -active in G1 , it follows that τj−1 ∼/ τj+1 . This is because both τj−1
and τj+1 are S - active nodes in G1 (they have a common descendant in S to make the trail active), and
neither is in S (neither is the centre of a collider along τ ) it follows that if τj−1 ∼ τj+1 , then the trail
on k − 1 nodes obtained by removing the node τj would be S -active in G1 , for the following reason:
any chain or fork (τj−2 , τj−1 , τj+1 ) or (τj−1 , τj+1 , τj+2 ) would be active because both τj−1 and τj+1 are
2.2. CHARACTERISATION OF MARKOV EQUIVALENCE 39

τi−2 o τi−1 o τi+1 / τi+2


a =

τi

τi−2 o τi−1 / τi+1 / τi+2


a =

τi

Figure 2.6: Possible connections between the nodes if τi is a fork node

uninstantiated. Any collider (τj−2 , τj−1 , τj+1 ) or (τj−1 , τj+1 , τj+2 ) would be active because both τj−1
and τj+1 have a descendant in S . It follows that τj−1 ∼/ τj+1 . This holds in both G1 and G2 , since the
skeletons are the same.
τ (1,i−1) and ̃
Since ̃ τ (i+1,k) are both active, and τj−1 → τj ← τj+1 is a collider, the trail τ is active if and
only if τj is an active node. That is, it is either in S or has a descendant in S .

Part 3 Let τ be a minimal S -active trail in G1 and let τj ∈ τ be a collider node in G1 . Since the trail
τ is a minimal S -active trail in G1 , it follows either that τj ∈ S or τj , considered in G1 , has a descendant
in S . That is, considered in G1 , there is a directed path from τj to a node w ∈ S . Let ρ denote the
shortest such path. If τj ∈ S , then the length of the path is 0 and τj is also an S -active node in G2 .
Assume there is a directed edge from τj to w ∈ S in G1 . If there are links from τj−1 to w or τj+1
to w, then these links are τj−1 → w or τj+1 → w respectively, otherwise the DAG would have cycles. If
both are present, then the trail τ violates the second assumption of the minimality requirement. This
is seen by considering the trail formed by taking w instead of τj in τ . It follows that either τj−1 ∼/ w
or τj+1 ∼/ w or neither of the edges are present. Without loss of generality, assume τj−1 ∼/ w (since the
argument proceeds in the same way if τj+1 ∼/ w). The diagram in Figure 2.7 may be useful.

τj−1 / τj o τj+1

 {
w

Figure 2.7: Illustration where τj is an uninstantiated collider node

Since w is not a parent of τj in G1 , it cannot be a parent of τj in G2 , since both graphs have the
same immoralities and (τj−1 , τj , w) is not an immorality in G1 .
Furthermore, τj−1 ∼/ τj+1 (since they are both uninstantiated, and, in G1 both have a common
40 CHAPTER 2. MARKOV MODELS AND MARKOV EQUIVALENCE

descendant in S , so that if τj−1 ∼ τj+1 then the trail with τj removed would be active whether the
connections at τj−1 and τj+1 are chain, fork or collider, contradicting the minimality assumption).
Since both graphs have the same immoralities and τj−1 ∼/ τj+1 , it follows that (τj−1 , τj , τj+1 ) is an
immorality in both G1 and G2 and hence that τj−1 is a parent of τj in G2 . Therefore, τj is a parent of
w in G2 and is therefore w is an S - active node in G2 .
Assume that for the shortest directed path ρ from τj to w in G1 , the rst l links have the same
directed edges in G2 . Now suppose that the shortest directed path is ρ, where τj = ρ0 , . . . , ρl+p = w
and consider the links ρl ∼ ρl+1 and ρl+1 ∼ ρl+2 . If ρl ∼ ρl+2 , then in G1 , the directed edge ρl → ρl+2
is present, otherwise there is a cycle. If the directed edge ρl → ρl+2 is present in G1 , then the path
ρ is not minimal. Therefore, ρl ∼/ ρl+2 . This holds in both G1 and G2 , because both graphs have the
same skeletons. By a similar argument, ρl−1 ∼/ ρl+1 . (there would be a cycle in G1 if (ρl+1 , ρl−1 ) were
present; ρ would not be minimal in G1 if (ρl−1 , ρl+1 ) were present. Since the skeletons are the same,
ρl−1 ∼/ ρl+1 in either G1 or G2 ). Since ρl ∼/ ρl+2 , it follows that ρl and ρl+2 are not both parents of ρl+1 in
G2 ; otherwise G2 would contain an immorality not present in G1 . Similarly, since ρl−1 ∼/ ρl+1 , the edge
ρl → ρl+1 is present in G2 , otherwise G2 would have either the immorality (ρl−1 , ρl , ρl+1 ), since the edge
(ρl−1 , ρl ) is present in G2 by assumption. It follows that the directed edges (ρl , ρl+1 ) and (ρl+1 , ρl+2 )
are both present in G2 . By induction, therefore, the whole directed path ρ is also present in G2 and
hence τj is an S - active in both G1 and G2 .

Proof of Theorem 2.11 This follows directly: let G1 and G2 denote two DAGs with the same
skeleton and the same immoralities. For any set S and any two nodes α and β , it follows from the
lemma, together with the denition of D- separation, that

α á β∥G1 S ⇔ α á β∥G2 S; (2.3)

if there is an S -active trail between the two variables in one of the graphs, then there is a minimal
S -active trail in that graph and hence there is also a minimal S -active trail between the two variables
in the other. If there is no S -active trail between the two variables in one of the graphs then there is
no S active trail between the two variables in the other. By denition, two variables are D-separated
by a set of variables S if and only if there is no S -active trail between the two variables. Two graphs
are Markov equivalent, or I -equivalent (Denition 2.4), if and only if Equation (2.3) holds for all
(α, β, S) ∈ V × V × V .

2.2.1 Example 2.8 (Hidden Variables) Revisited


Consider the situation where there is a faithful DAG for the variable set X = (U, Z) where U denotes
the set of hidden variables and Z the set of observable variables. In situations where there is no faithful
DAG for the variable set Z , it is possible using the principles of Theorem 2.3 and 5.5 to locate a set
of hidden variables Ũ such that a faithful graph can be constructed for (U ̃ , Z).
If the principle of Theorem 2.3 is applied to Z = (Z1 , Z2 , Z3 , Z4 ) in Example 2.8, then there is an
edge Z1 ∼ Z2 , Z2 ∼ Z4 , Z3 ∼ Z4 , Z1 ∼ Z3 . The directions are yet to be specied. No other edges will be
2.3. MARKOV EQUIVALENCE AND THE ESSENTIAL GRAPH 41

present, since Z2 ⊥ Z3 ∣Z4 and Z1 ⊥ Z4 ∣Z3 . Since Z1 ⊥ Z4 ∣Z3 , it follows that Z1 − Z3 − Z4 cannot be an
immorality. Since Z3 ⊥ Z2 ∣Z4 , it follows that Z3 − Z4 − Z2 cannot be an immorality. Both Z1 − Z2 − Z4
and Z2 − Z1 − Z3 are required to be immoralities, which is not possible. Therefore, either Z1 − Z2 − Z4
is not an immorality, in which case the model returned contains the false independence statement
Z1 ⊥ Z4 ∣{Z2 , Z3 }, or else Z2 − Z1 − Z3 is not an immorality, in which case the model returned contains
the false independence statement Z2 ⊥ Z3 ∣{Z1 , Z4 }.
At the same time, the requirement that both Z1 − Z2 − Z4 and Z2 − Z1 − Z3 are immoralities can
be resolved by adding in a hidden variable U to obtain Figure 2.3.

2.3 Markov Equivalence and the Essential Graph


Consider the DAG in Figure 2.8. Using the characterisation given by Theorem 2.11, the DAG in
Figure 2.8 is equivalent to the DAGs in Figure 2.9.

α2
=

!
α1 / α4
=

!
α3

Figure 2.8: A DAG on four nodes

α2 α2
=

! } !
α1 / α4 α1 / α4
a = =

!
α3 α3

Figure 2.9: The equivalent DAGs

For the DAG in Figure 2.8, all the DAGs with the same skeleton can be enumerated, and it is
clear that those in Figure 2.9 are the only two that satisfy the criteria. To nd the DAGs equivalent
to the one in Figure 2.8, the immorality (α2 , α4 , α3 ) has to be preserved and no new immoralities
42 CHAPTER 2. MARKOV MODELS AND MARKOV EQUIVALENCE

may be added. The directed edges (α1 , α4 ), (α2 , α4 ) and (α3 , α4 ) are therefore essential; the directed
edges (α2 , α4 ) and (α3 , α4 ) to form the immorality, the directed edge (α1 , α4 ) because the connection
(α2 , α1 , α3 ) is either a fork or chain, forcing (α1 , α4 ) to prevent a cycle. These three directed edges
will be present in any equivalent DAG. The other three edges may be oriented in 23 dierent ways, but
only 5 of these lead to DAGs (the other graphs contain cycles) and of these 5, only the three shown in
Figures 2.8 and 2.9 have the same immoralities.
A useful starting point for locating all the DAGs that are Markov equivalent to a given DAG is to
locate the essential graph, given in the following denition.

Denition 2.15 (Essential Graph). Let G be a Directed Acyclic Graph. The essential graph G ∗
associated with G is the graph with the same skeleton as G , but where an edge is directed in G ∗ if and
only if it occurs as a directed edge with the same orientation in every DAG that is Markov equivalent
to G . The directed edges of G ∗ are the essential edges of G .

The edges that are directed in an essential graph are the compelled edges.

Denition 2.16 (Compelled Edge). Let G = (V, E) be a chain graph, where E = D ∪ U . A directed
edge (α, β) ∈ D is said to be compelled if it occurs in at least one of the congurations in Figure 2.10.

γ1 β α γ1
@

 
α β

γ1

α / β α / β
= >

!
γ1 γ2

Figure 2.10: The directed edge (α, β) is compelled (Denition 2.16)

Lemma 2.17. In an essential graph, the directed edges are the compelled edges; all other edges are
undirected.

Proof From the denition, the directed edges are those that necessarily have the same direction in
every Markov equivalent DAG. The gure in the top right shows the immoralities; these are necessarily
2.3. MARKOV EQUIVALENCE AND THE ESSENTIAL GRAPH 43

directed. The direction is forced in the gure in the top left; otherwise the graph contains an additional
immorality. The direction is forced in the structure on the bottom left; otherwise there is a cycle. The
direction is forced in the structure in the bottom right; otherwise both (γ1 , α) and (γ2 , α) are forced
to prevent cycles appearing and (γ1 , α, γ2 ) is an additional immorality.
To show that these are the only compelled edges: Consider two nodes α and β which are neighbours.
Firstly, suppose that α and β do not have any other common neighbours. If α does not have a neighbour
γ such that there is a directed edge (γ, α), then the direction (α, β) is not forced; no additional
immorality or cycle is created by either direction.
Now suppose that α and β have at least one common neighbour. Suppose that there are no
neighbours γ such that both (α, γ), (γ, β) in the directed edge set, then it is not necessary to force the
direction (α, β) to prevent a cycle.
Suppose, furthermore, that there is no pair of common neighbours γ1 and γ2 , such that (γ1 , β)
and (γ2 , β) are both in the directed edge set and (γ1 , α, γ2 ) is not an immorality. Then it is not
necessary to force the direction (α, β) to prevent either (γ1 , α, γ2 ) becoming an immorality or else a
cycle appearing.

Notes The terminology Markov model corresponding to a Directed Acyclic Graph G = (V, E) was
introduced into the literature and may be found in Andersson, Madigan, Perlman and Triggs [2].The
results on Markov equivalence are taken from T. Verma and J. Pearl [140]. The rules for determining
compelled edges were formulated by Meek in [93]. A rigorous treatment covering chain graphs is [129]
(Studený). Exercise 7 page 352 is taken from [150], while Exercises 4 page 352 and 5 page 352 are
taken from Chickering [24].
44 CHAPTER 2. MARKOV MODELS AND MARKOV EQUIVALENCE
Chapter 3

Intervention Calculus

3.1 Causal Models and Bayesian Networks


In many applications, a Bayesian network is constructed as a causal model, where for each variable,
its parent variables are considered to be direct causes that inuence the value taken by the variable.
For example, an earth tremor or a burglary can cause the burglar alarm to go o and the arrows in
the associated collider DAG represent cause to eect relations. It is self evident, but nevertheless has to
be stated, that only associations can be inferred from an n×d data matrix x of instantiations; directions
of cause to eect cannot be inferred from data alone. When conditional independence statements are
learned from data, this can be interpreted as a Markov model and it may be possible to construct an
ecient factorisation of the distribution using these conditional independence statements. Clearly, this
factorisation cannot be understood as a causal model, unless there are other modelling assumptions.
For example, consider a model containing observable variables A, B, C , where there are hidden
variables H1 , H2 that are unknown to the experimenter. If the causal diagram representing the causal
relations between these variables is given by the DAG on the left in Figure 3.1, then the learned DAG,
along which the distribution of A, B, C can be factorised, is the DAG on the right of Figure 3.1.
This is the correct DAG, in that it preserves the D-connection properties between A, B , C , but the
collider connection cannot be interpreted as A and B having a causal eect on C ; they are eects of
the latent common causes H1 and H2 .
If a Bayesian network is to be interpreted as a causal model, then the possible directions of cause

H1 H2 A B

~ ~  
A C B C

Figure 3.1: Hidden causes and the learned DAG

45
46 CHAPTER 3. INTERVENTION CALCULUS

to eect must be part of the modelling assumptions before the data is analysed, determined by other
considerations. The data analysis only determines which directed edges remain and which are removed.
From data, one can determine whether or not there is an association between earth tremors and alarms
triggered; it is not possible to determine from the data what causes what.
This is self evident, but surprisingly it turns out that it is necessary to state this. The purpose of
the article by Freedman and Humphreys [43] (1999) was to point out the obvious fact that causality
could not be inferred from data alone and was a necessary response to obvious errors in the literature;
the term `causal discovery' has been used in surprising ways, to describe learning a directed edge in
a DAG. even after it had been established, with simple concrete and obvious examples, that the idea
that such arrows learned from data alone actually represented causality, was ridiculous and long after
publication of [43] illustrating that it was ridiculous. The article by Freedman and Humphreys is a
good article; it is surprising that the literature had degenerated to such an extent that it was necessary
for the authors to write it.
To dene a causal network, an additional ingredient is needed; this is the concept of intervention,
introduced by Judea Pearl in the seminal article [107] from 1995.

3.2 Conditioning by Observation and by Intervention


Let X and Y be two random variables and suppose that X = x is observed. Then the conditional
probability of Y = y is dened as

PX,Y (x, y)
PY ∣X (y∣x) = .
PX (x)
This formula describes the way that the probability distribution of the random variable Y changes
after X = x is observed. If, instead, the value X = x is forced by the observer, irrespective of other
considerations, the conditional probability statement is invalid.
If random variables are linked through a causal model, expressed by a directed acyclic graph,
where parent variables have a causal eect on their children, some attempt can be made to compute
the probability distribution over the remaining variables when the states of some variables are forced.
In a controlled experiment, a variable is forced to take a particular value, chosen at random,
irrespective of the other variables in the network. In terms of the directed acyclic graph, the variable
is instantiated with this value, the directed edges between the variable and its parents are removed
(because the parents no longer have inuence on the state of the variable) and all other conditional
probabilities remain unaltered.

3.3 The Intervention Calculus for a Bayesian Network


Notations For any A = (i1 , . . . , im ) ⊂ V , let XA = (Xi1 , . . . , Xim ) and, for x = (x1 , . . . , xd ) ∈ X , let
xA = (xi1 , . . . , xim ). The set dierence, written V /A, is dened as all the indices in V which are not
included in A. A change of order of the variables will be employed for convenience, which should be
clear from the context:
3.3. THE INTERVENTION CALCULUS FOR A BAYESIAN NETWORK 47

x = xV = ×dv=1 xv = (×v∈V /A xv ).(×v∈A xv ) = xV /A .xA .

Let ϕ be a function dened on X . Then the quantity ∑V /A ϕ is dened as

⎛ ⎞
∑ ϕ (xA ) = ∑ ϕ(xV /A .xA ).
⎝V /A ⎠ xV /A

Denition 3.1 (The Intervention Formula). The conditional probability of XV /A = xV /A , given that
the variables XA were forced to take the values xA independently of all else, is written

PV /A∥A (xV /A ∣XA ← xA ) or PV /A∥A (xV ∥xA )

and dened as

PV /A∥A (xV /A ∣XA ← xA ) = PV /A∥A (xV /A ∥xA ) = ∏ Pv∣Pa(v) (xv ∣xPa(v) ). (3.1)
v∈V /A

Note that (3.1) is equivalent to:

PV (xV )
PV /A∥A (xV /A ∥xA ) = . (3.2)
∏v∈A Pv∣Pa(v) (xv ∣xPa(v) )

The last expression of Equation (3.1) is in terms of the required factorisation; instantiation of the
variables indexed by the set A and elimination of those edges in D which lead from the parents of the
nodes inA to the nodes in V /A. The terminology `local surgery' is used to describe such an elimination.
A local surgery is performed and the conditional probabilities on the remaining edges are multiplied.
This yields a factorisation along a mutilated graph where the direct causes of the manipulated variable
are put out of eect.

The intervention formula (3.1) is obtained by wiping out those factors from the factorisation which cor-
respond to the interventions. An explicit translation of intervention in terms of `wiping out' equations
was rst proposed by Strotz and Wold [128] (1960).
The quantity PV /A∥A (.∥xA ) from Denition 3.1 denes a family of probability measures over XV /A ,
which depends on the values xA , which may be considered as parameters. These are the values forced
on the variables indexed by A. This family includes original probability measure; if A = ∅, then
PV /A∥A (.∥xA ) = PX (.). This family is known as the intervention measure. In addition, the expression
on the right hand side of (3.1) is called the intervention formula.

Intervention An `intervention' is an action taken to force a variable into a certain state, without
reference to its own current state, or the states of any of the other variables. It may be thought of as
choosing the values x∗A for the variables XA by using a random generator independent of the variables
X.
48 CHAPTER 3. INTERVENTION CALCULUS

Remark In the same style of notation, conditioning by observation is

PXV /A ∣XA (xV /A ∣see(xA )) = PXV /A ∣XA (xV /A ∣xA ) (3.3)

where, by the standard denition of conditional probability,

PV (x)
PV /A∣A (xV /A ∣xA ) = . (3.4)
PX (xA )
Example 3.2.

Consider the DAG given in Figure 3.2, for `X having causal eect on Y '.

X / Y

Figure 3.2: A DAG for X having causal eect on Y

The factorisation of PX,Y along the DAG in Figure 3.2 is

PX,Y (x, y) = PY ∣X (y∣x)PX (x)

and the intervention formula gives

PY ∥X (y∥x) = PY ∣X (y∣x).

Since X is a parent of Y , the intervention to force X = x produces exactly the same conditional
probability distribution over Y as observing X = x. But if instead Y is forced, the intervention formula
yields

PX∥Y (x∥y) = PX (x).

Clearly, PX∥Y (x∥y) ≠ PX∣Y (x∣y) as functions unless X and Y are independent.

Example 3.3 (The DAG for a wet pavement).

The `wet pavement' example is a classic illustration, introduced by Judea Pearl. See, for example, page
15 of [109]. The DAG represents a causal model for a wet pavement and is given in Figure 3.3. The
season A has four states; spring, summer autumn, winter. Rain B has two states; yes / no. Sprinkler
C has two states; on / o. Wet pavement D has two states; yes / no. Slippery pavement E has two
states; yes / no.
The joint probability distribution is factorised as

PA,B,C,D,E = PA PB∣A PC∣A PD∣B,C PE∣D .


3.4. CAUSAL MODELS 49

?B

A / E
>D


C

Figure 3.3: DAG for wet pavement, no intervention

Suppose, without reference to the values of any of the other variables and without reference to the
current state of the sprinkler, `sprinkler on' is now enforced. This could be, for example, regular mainte-
nance work, which is carried out at regular intervals, irrespective of the season or other considerations.
Then

PA,B,C,D,E (., ., 1, ., .)
PA,B,D,E∥C (.∥C ← 1) =
PC∣A (1∣.)
= PA PB∣A PD∣B,C (.∣., 1)PE∣D .

After observing that the sprinkler is on, it may be inferred that the season is dry and that it probably
did not rain and so on. If `sprinkler on' is enforced, without reference to the state of the system when
the action is taken, then no such inference should be drawn in evaluating the eects of the intervention.
The resulting DAG is given in Figure 3.4. It is the same as before, except that C = 1 is xed and the
edge between C and A disappears. The deletion of the factor PC∣A represents the understanding that
whatever relationships existed between sprinklers and seasons prior to the action, found from

PA,B,D,E∣C (., ., ., .∣1)

are no longer in eect when the state of the variable is forced, as in a controlled experiment, without
reference to the state of the system.
After observing that the sprinkler is on, it may be inferred that the season is dry, that it probably
did not rain and so on. No such inferences may be drawn in evaluating the eects of the intervention
`ensure that the sprinkler is on'.

3.4 Causal Models


Having dened the family of intervention measures, the concept of causal model may now be dened.

Denition 3.4 (Causal Model). Let X = (X1 , . . . , Xd ) be a random vector and let V = {1, . . . , d}
denote the indexing set. A causal model consists of the following:
50 CHAPTER 3. INTERVENTION CALCULUS

B=1

"
/ E
A <D

"
C

Figure 3.4: Sprinkler `on' is forced

1. A Bayesian Network for PX , that is, an ordering σ of the indices V , a factorisation of the
probability distribution

d
PV = ∏ P (3.5)
σ(j)∣Pa (j)
(σ)
j=1

where Pa(σ) (j) ⊆ {σ(1), . . . , σ(j − 1)} and is the smallest such subset such that (3.5) holds.

2. The node set V consists of two types of nodes; VI and VN , where VI ∩ VN = ∅ and VI ∪ VN =
V . The nodes VI are the interventional nodes and VI are the non-interventional nodes, where
no intervention is possible. The intervention formula (3.1) holds for each subset A ⊆ VI of
interventional nodes and each xA ∈ XA .

The arrows α ↦ β of the DAG for either α or β (or both) in VI are causal arrows, indicating direct
cause to eect. The remaining arrows are non-causal; a cause to eect relation between nodes α and
β cannot be inferred from an arrow α ↦ β if both α, β ∈ VN .

In many cases, a model contains hidden variables, which cannot be observed. A special case of this is
the semi-Markov model, where the hidden variables are common causes and where none of the hidden
variables have observable anscestors.

Denition 3.5 (Semi-Markov Model). A semi-Markov model is a causal model for a random vector
X with node set V = V ∪ U, where V are the observable variables, VI ⊂ V (intervention can be made
on a subset of the observable nodes), VN = V/VI (observable nodes on which intervention cannot be
made) and VN = VN ∪ U.
The nodes of VI correspond to interventional variables, U ⊂ VN are the hidden (latent) variables
and VN represent the observable variables on which no intervention can be made.
For a semi-Markov model, the requirement is that the variables of U have no ancestors in V.

Notation Throughout, if a variable is named U , or Ui (for some index i), it may be assumed that
the variable (or its index) belongs to U.
3.4. CAUSAL MODELS 51

3.4.1 Establishing a Causal Model via a Controlled Experiment


If sucient data is available, a suitable Bayesian Network may be learned from the data. A causal
model cannot be established from data alone. Additional information is needed, which is obtained
through interventions on the interventional variables.
For example, the three graphs in Figure 3.5 are Markov equivalent; if the probability distribution
factorises along one of these graphs, it also factorises along the others. The chains α → γ → β and
α ← γ ← β and the fork α ← γ → β are all Markov equivalent, with D-separation structure α á β∥γ .
If any of these DAGs represents a causal network, then it is not possible to learn the causal network
from the data alone.
Suppose that it is possible to intervene by controlling the variable Xγ , then if one of these graphs is
the DAG for a causal network, it will be possible to establish which one through a controlled experiment.
Figure 3.6 shows the associated structural model when the control Xγ ← z has been applied, forcing
Xγ to be independent of its ancestors. A controlled experiment, where the direct causal links between
Xγ and its parent variables have been eliminated, will exhibit independence structure Xα ⊥ {Xβ , Xγ }
in the rst case, Xα ⊥ Xβ ∣Xγ in the second {Xα , Xγ } ⊥ Xβ in the third. Once the associations
Xα ⊥ Xβ ∣Xγ , Xα ⊥ / Xβ , Xα ⊥ / Xγ , Xα ⊥
/ Xγ ∣Xβ , Xβ ⊥/ Xγ and Xβ ⊥ / Xγ ∣Xα have been established, an
additional controlled experiment, if it is possible to control the variable Xγ with interventions to force
all possible values of Xγ , will determine which graph within the equivalence class is appropriate.

@γ γ^ γ

 
α β α β α β

Figure 3.5: Three Markov Equivalent Graphs

Xγ ← z Xγ ← z Xγ ← z

" | | "
α β α β α β

Figure 3.6: Graphs from Figure 3.5 with intervention Xγ ← z applied

If it is possible to control variables, then it is possible to learn whether or not a collider represents
independent causes with a common eect. If the DAG on the left hand side of Figure 3.1 represents
a causal structure, then an experiment where variable A is controlled will establish that it is not a
52 CHAPTER 3. INTERVENTION CALCULUS

H1 H2

~
A=a C B

Figure 3.7: Hidden causes H1 and H2 ; intervention A = a

direct cause of C , since an intervention on A leaves it separated from the rest of the network, as in
Figure 3.7.

3.5 Properties of Intervention Calculus


The following propositions summarise some basic properties of the intervention calculus.

Proposition 3.6. If Xi has no parents, then for all x ∈ X

PV /{i}∥i (xV /{i} ∥xi ) = PV /{i}∣i (xV /{i} ∣xi ).

Proof This is straightforward from the denition;

PV /{i}∥i (xV /{i} ∥xi ) = ∏ Pj∣Pa(j) (xj ∣πj )


j≠i

while if Xi does not have any parents

∏j Pj∣Pa(j) (xj ∣πj ) ∏j≠i Pj∣Pa(j) (xj ∣πj )


PV /{i}∣i (xV /{i} ∣xi ) = = Pi (xi ) = ∏ Pj∣Pa(j) (xj ∣πj ).
Pi (xi ) Pi (xi ) j≠i

Proposition 3.7. For each j ∈ V , let Pa(j) denote the set of parents of node j and XPa(j) the state
space of XPa(j) . For each (x, π) ∈ Xj × XPa(j) ,

Pj∥Pa(j) (x∥π) = Pj∣Pa(j) (x∣π). (3.6)

For all j ∈ V and each S ⊆ V such that S ∩ ({j} ∪ {Pa(j)}) = ∅, for each (xj , πj , xS ) ∈ Xj × XPa(j) × XS ,

Pj∥Pa(j),S (xj ∥πj , xS ) = Pj∣Pa(j) (xj ∣πj ). (3.7)


3.5. PROPERTIES OF INTERVENTION CALCULUS 53

Proof Equation (3.6) is established rst. Pj∥Pa(j) (.∥πj ) is a marginal distribution which depends on
the enforced value XPa(j) ← πj . For all (xj , πj ) ∈ Xj × XPa(j) ,

Pj∥Pa(j) (xj ∥πj ) = ∑ PV /Pa(j)∥Pa(j) (yV /Pa(j) ∥πj ).


y∣yj =xj
V /Pa(j)

An application of (3.1), the intervention formula, yields

⎛ ⎞RRRR
Pj∥Pa(j) (x∥π) = ∑ ⎜ ∏ Pv∣Pa(v) (xv ∣xPa(v) )⎟RRRRR .
x
V /Pa(j)
∣xj =x ⎝v∈V /Pa(j) ⎠RRRR
(xj ,x
Pa(j) )=(x,π)
It follows, using

∑ Pv∣Pa(v) (x∣πv ) = 1
x∈Xv
for any πv ∈ XPa(v) that

Pj∥Pa(j) (x∥π) = Pj∣Pa(j) (x∣π)


for all (x, π) ∈ Xj × XPa(j) as required. The proof of Equation (3.7) is similar.

Once all the direct causes of a variable Xj are controlled, no other interventions will aect the condi-
tional probability distribution of Xj .

The following property is another straightforward consequence of the denition.

Proposition 3.8. For any (x, π) ∈ Xj × XPa(j) ,


PPa(j)∥j (π∥x) = PPa(j) (π).

Proof By marginalisation, followed by an application of the intervention formula (3.1), for each
(x, π) ∈ Xj × XPa(j) ,

PPa(j)∥j (π∥x) = ∑ PV /({j}∪Pa(j)),Pa(j)∥j (xV /({j}∪Pa(j)) , π∥x)


x
Pa(j))
V /({j}∪
RRR
PV (xV ) RRR
= ∑ RR
PXj ∣Pa(j) (x∣π) RRRR
x
V /({j}∪Pa(j)) RR(xj ,x
Pa(j) )=(x,π)
∑x PV (xV ) RRRR
V /({j}∪Pa(j)) RRR
= RRR
Pj∣Pa(j) (x∣π) RRR(x ,x
j
Pa(j) )=(x,π)
Pj,Pa(j) (x, π) PPa(j) (π)Pj∣Pa(j) (x∣π)
= =
Pj∣Pa(j) (x∣π) Pj∣Pa(j) (x∣π)
= PPa(j) (π).
54 CHAPTER 3. INTERVENTION CALCULUS

The probability measure after intervention is factorised along the mutilated graph. The following
proposition determines the probabilities on the mutilated graph.

Proposition 3.9. Let A ⊂ V . For j ∈/ A and any (y, xPa(j)/A , xA ) ∈ Xj × XPa(j) × XA ,

Pj∣Pa(j)/A∥A (y∣xPa(j)/A ∥xA ) = Pj∣Pa(j) (y∣xPa(j) ),

where the conditioning is taken in the sense of: rst the `do' conditioning XA ← xA is applied and then
the set of variables Pa(j)/A is observed.

Proof By denition of conditional probability,

Pj,Pa(j)/A∥A (y, xPa(j)/A ∥xA )


Pj∣Pa(j)/A∥A (y∣xPa(j)/A ∥xA ) = .
PPa(j)/A∥A (xPa(j)/A ∥xA )
An application of the intervention formula to the numerator gives

Pj,Pa(j)/A∥A (y, xPa(j)/A ∥xA ) = ∑ ∏ Pv∣Pa(v) (xv ∣xPa(v) )


X v/∈A
V /({j}∪Pa(j)∪A)

= Pj∣Pa(j) (y∣xPa(j) ) ∑ ∏ Pv∣Pa(v) (xv ∣xPa(v) )


X v/∈A
V /({j}∪Pa(j)∪A)

where xj = y and to the denominator gives

PPa(j)/A∥A (xPa(j)/A ∥xA ) = ∑ Pj∣Pa(j) (xj ∣xPa(j) ) ∑ ∏ Pv∣Pa(v) (xv ∣xPa(v) ).


Xj X
V /({j}∪Pa(j)∪A) v∈Pa (j)/A

Summing from right to left, the variables in V /(A ∪ Pa(j) ∪ {j}) with j in their parent set have been
summed over so that

PPa(j)/A∥A (xPa(j)/A ∥xA ) = ∑ ∏ Pv∣Pa(v) (xv ∣xPa(v) ),


X
V /({j}∪Pa(j)∪A) v∈Pa (j)/A

giving

Pj∣Pa(j)/A∥A (y∣xPa(j)/A ∥xA ) = Pj∣Pa(j) (y∣xPa(j) )

as claimed. The proof is complete.

Example 3.10 (Wet Pavement Revisited).


Consider the conditional probability PD∣B∥C (.∣.∥1). Here B and C are parents of D, Pa(D)/B = C .
Plugging into the formula in the preceding proposition,

PD∣B∥C (.∣.∥1) = PD∣B,C (.∣., 1).


3.5. PROPERTIES OF INTERVENTION CALCULUS 55

The right hand side may be thought of as a pre-intervention probability, which can be estimated from
the data before the intervention C ← 1 is made. In this case, an estimate of the pre-intervention
probability PD∣B,C (.∣., 1) is also an estimate of the post-intervention probability PD∣B∥C (.∣.∥1).

Transformations of Probability The following proposition is almost a direct consequence of the


denition. It presents a simple rearrangement of the intervention formula in a special case.

Proposition 3.11.

PV /{j}∥j (xV /{j} ∥y) = PV /({j}∪Pa(j))∣j,Pa(j) (xV /({j}∪Pa(j)) ∣y, xPa(j) )PPa(j) (xPa(j) )

Proof An application of the denition gives

PV /{j}∥j (xV /{j} ∥y) = ∏ Pv∣Pa(v) (xv ∣xPa(j) ).


v∈V /{j}

One term has been removed in the product, namely, Pj∣Pa(j) (y∣xPa(j) ), so that (with xj = y )

PV (xV )
∏ Pv∣Pa(v) (xv ∣xPa(j) ) =
v∈V /{j} Pj∣Pa(j) (y∣xPa(j) )
PV (x)PPa(j) (xPa(j) )
=
Pj,Pa(j) (y, xPa(j) )
= PV /({j}∪Pa(j))∣j,Pa(j) (xV /({j}∪Pa(j)) ∣y, xPa(j) )PPa(j) (xPa(j) )

as required.

Proposition 3.12 (Adjustment for Direct Causes ). Let G = (V, D) be a DAG and let B ⊂ V such
that ({j} ∪ Pa(j)) ∩ B = ∅ (the empty set). Then for any (x, y) ∈ Xj × XB ,

PB∥j (y∥x) = ∑ PB∣j,Pa(j) (y∣x, xPa(j) )PPa(j) (xPa(j) ). (3.8)


X
Pa(j)

Proof of Proposition 3.12 Firstly, with xj = y ,

PB∥j (xB ∥y) = ∑ PV /{j}∥j (xV /{j} ∥y).


XV /(B∪{j})

By Proposition 3.11, this may be written as

PB∥j (xB ∥y) = ∑ PV /({j}∪Pa(j))∣{j}∪Pa(j) (xV /({j}∪Pa(j)) ∣y, xPa(j) )PPa(j) (xPa(j) ).
XV /(B∪{j})

A marginalisation over XV /(B∪{j}) gives


56 CHAPTER 3. INTERVENTION CALCULUS

PB∣j (xB ∣y) = ∑ PB∣j,Pa(j) (xB ∣y, xPa(j) )PPa(j) (xPa(j) )


X
Pa(j)
as required. The proof is complete.

3.6 Confounding, The `Sure Thing' Principle and Simpson's Paradox


3.6.1 Confounding
Consider the DAG given in Figure 3.8. It corresponds to the factorisation:

 
A / B

Figure 3.8: Illustration for Confounding


A=a / B

Figure 3.9: Intervention on A

PA,B,C = PB∣A,C PA∣C PC .

Consider the conditional probability of B , when A is controlled; PB∥A (.∥a). The DAG illustrating the
intervention is shown in Figure 3.9. Note that

PB∥A (.∥a) = ∑ PB,C∥A (., c∥a).


c∈XC

and that

PB,C∥A (., .∥a) = PB∣C∥A (.∣.∥a)PC∥A (.∥a) = PB∣A,C (.∣a, .)PC ,

where in the second term, the do-conditioning of A ← a is applied rst, and then C is observed. It
follows that
3.6. CONFOUNDING, THE `SURE THING' PRINCIPLE AND SIMPSON'S PARADOX 57

PB∥A (.∥a) = ∑ PB∣A,C (.∣a, c)PC (c).


c∈XC

This shows that to estimate PB∥A (.∥a) from data alone (i.e. without controlling A), it is necessary to
be able to estimate PB∣A,C and PC from data. If C is observable, then the eect on the probability
distribution of B of manipulating A may be estimated. But if C is a hidden random variable (sometimes
the term latent is used) in the sense that no direct sample of the outcomes of C may be obtained, it
will not be possible to estimate the probabilities used on the right hand side and hence it will not be
possible to predict the eect on B of manipulating A. This is known as confounding.

3.6.2 Simpson's Paradox


Consider three binary variables, A, B and C . Simpson's paradox is the observation that there are
situations where

PB∣C,A (1∣1, 1)/PB∣C,A (0∣1, 1) PB∣C,A (1∣0, 1)/PB∣C,A (0∣0, 1)


>1 and > 1,
PB∣C,A (1∣1, 0)/PB∣C,A (0∣1, 0) PB∣C,A (1∣0, 0)/PB∣C,A (0∣0, 0)
but

PB∣A (1∣1)/PB∣A (0∣1)


< 1.
PB∣A (1∣0)/PB∣A (0∣0)
For example let A denote `treatment', B `recovery' and C `blood pressure'. Simpson's paradox states
that even if the `treatment' may improve the chances of recovery for those with high blood pressure
and those with low blood pressure, it may nevertheless be bad for the population as a whole. It could
be that although the treatment is comparatively good within the group where high blood pressure
is observed after treatment and also comparatively good within the group where low blood pressure
is observed after treatment, it may be bad for the population as a whole. This occurs if `treatment'
increases blood pressure and increased blood pressure reduces the chances of recovery.
This situation is illustrated by the DAG given in Figure 3.10, where A denotes treatment, B
recovery and C blood pressure. Suppose that C is a hidden variable. Even if the `treatment' variable
A can be controlled, an intervention on A does not remove any arrows from the causal diagram; there
is the possibility of a Simpson's paradox, even with a controlled experiment.
If A denotes `treatment' and B `recovery' and C denotes a common cause of both A and B , as in
Figure 3.8, Simpson's paradox may be resolved if A can be controlled, because controlling A breaks
the causal link between C and A. This is the sure thing principle, considered next, which states that if
the treatment improves the chances of recovery for each level of the `common cause' variable C , then
it is good for the population as a whole.

3.6.3 The Sure Thing Principle


Consider again the situation of Figure 3.8. Suppose that A is controlled; values for the variable A are
assigned at random, so the link C → A is broken and hence the eect on B of manipulating A is not
58 CHAPTER 3. INTERVENTION CALCULUS

?C


A / B

Figure 3.10: A=treatment / B=recovery / C=blood pressure

confounded by the eects of hidden variables. The following result is referred to as `The Sure Thing
Principle'. It states that when Figure 3.8 represents the causal structure and there is do-conditioning
on A, then Simpson's paradox does not hold.
Proposition 3.13. Consider three binary variables A, B , C with the network given in Figure 3.8.
If
PB∣C∥A (1∣1∥1) < PB∣C∥A (1∣1∥0)
and
PB∣C∥A (1∣0∥1) < PB∣C∥A (1∣0∥0)
then

PB∥A (1∥1) < PB∥A (1∥0).


The notation means: rst A is forced, then C is observed.

Proof Firstly,

PB∥A (1∥1) = PB∣C∥A (1∣1∥1)PC∥A (1∥1) + PB∣C∥A (1∣0∥1)PC∥A (0∥1).


Since C is a parent of A,

PC∥A (.∥1) = PC (.).


It follows that

1 1
PB∥A (1∥1) = ∑ PB∣C∥A (1∣x∥1)PC∥A (x∥1) = ∑ PB∣C∥A (1∣x∥1)PC (x).
x=0 x=0
Similarly,

1
PB∥A (1∥0) = ∑ PB∣C∥A (1∣x∥0)PC (x).
x=0
It now follows directly from the assumptions that

PB∥A (1∥1) < PB∥A (1∥0),


which is the stated result.
3.7. IDENTIFIABILITY: BACK-DOOR AND FRONT-DOOR CRITERIA 59

3.7 Identiability: Back-Door and Front-Door Criteria


In a wide variety of situations, the aim is to compute the eects of an intervention, when it is not
possible to carry out a controlled experiment. The following example, quoted at the beginning of
Pearl's seminal article [107] illustrates the issues involved.

Example 3.14.

Consider an experiment in which soil fumigants X are to be used to increase oat crop yields Y ,
by controlling the eelworm population, Z . These may also have direct eects, both benecial and
adverse, on yields, besides the control of eelworms. We would like to assess the total eects of the
fumigants on yields when the study is complicated by several factors. First, controlled, randomised
experiments are infeasible: farmers insist on deciding for themselves which plots are to be fumigated.
Secondly, the farmers' choice of treatment depends on last year's eelworm population Z0 . This is an
unknown quantity, but is strongly correlated with this year's population. This presents a classic case
of confounding bias, which interferes with the assessment of the treatment eects, regardless of sample
size. Fortunately, through laboratory analysis of soil samples, the eelworm populations before and
after treatment can be determined. Furthermore, since fumigants are only active for a short period,
they do not aect the growth of eelworms surviving the treatment; eelworm growth depends on the
population of bird and other predators. This, in turn, is correlated with last year's eelworm population
and hence with the treatment itself.

The situation may be represented by the causal diagram in Figure 3.11. The variables are:

ˆ X fumigants,

ˆ Y crop yields,

ˆ Z0 last year's eelworm population,

ˆ Z1 eelworm population before treatment,

ˆ Z2 eelworm population after treatment,

ˆ Z3 eelworm population at the end of the season,

ˆ B population of birds and other predators.

In this example, the variables B and Z0 are hidden variables.


The issue is whether interventional probabilities PY ∥X (.∥X ← x) may be computed from information
on the observables (Z1 , Z2 , Z3 , X, Y ). When they can, they are said to be identiable.

Denition 3.15 (Identiable). The causal eect of X on Y is said to be identiable if the quantity
PY ∥X can be computed uniquely from the probability distribution of the observable variables.
60 CHAPTER 3. INTERVENTION CALCULUS

6 X


Z0 / Z1 / Z2 / Y
?


B / Z3

Figure 3.11: A causal diagram representing the eect of fumigants X on yields Y

In this section, two graphical conditions are described which ensure that causal eects can be
estimated consistently from observational data. The rst of these is named back door criterion and is
equivalent to the ignorability condition of Rosenbaum and Rubin [119] (1983). The second of these is
the front-door criterion. This involves covariates which are aected by the treatment (in this example
Z2 and Z3 ).

3.7.1 Back Door Criterion


The back door criterion is dened as follows:

Denition 3.16 (Back Door Criterion). A set of nodes C satises the back door criterion relative to
an ordered pair of nodes (X, Y ) ∈ V × V if

1. no node of C is a descendant of X and

2. C blocks every trail (in the sense of D-separation) between X and Y which contains an edge
pointing to X .

If A and B are two disjoint subsets of nodes, C is said to satisfy the back door criterion relative to
(A, B) if it satises the back door criterion relative to any pair (Xi , Xj ) ∈ A × B .

Example 3.17.

In Figure 3.11, the set C = {Z0 } satises the back door criterion relative to (X, Y ). The node Z0 is
unobservable. The set C = {Z1 , Z2 , Z3 } does block all trails between X and Y with an arrow pointing
into X , but clearly does not satisfy the back door criterion since both Z2 and Z3 are descendants of
X.

The name `back door criterion' reects the fact that the second condition requires that only trails with
nodes pointing at Xi be blocked. The remaining trails can be seen as entering Xi through a back door.

Example 3.18.
3.7. IDENTIFIABILITY: BACK-DOOR AND FRONT-DOOR CRITERIA 61

Consider the back door criterion DAG, given in Figure 3.12. The sets of variables C1 = {Z3 , Z4 }
and C2 = {Z4 , Z5 } satisfy the back door criterion relative to the ordered pair of nodes (X, Y ), whereas
C3 = {Z4 } does not satisfy the criterion relative to the ordered pair of nodes (X, Y ); if Z4 is instantiated,
the Bayes ball may pass through the collider connection from Z1 to Z2 .

Z1 Z2

~ ~
Z3 Z4 Z5

~ ~
X / Z6 / Y

Figure 3.12: Back Door Criterion

Identiability Consider a causal network and let A ⊆ V /{X, Y } be a subset of the variables which
satises the back door criterion with respect to an ordered pair (X, Y ). The aim is to show that the
set of variables A plays a similar role to the variable C in the discussion on confounding.
The quantity PY ∥X may be expressed as:

↓A ↓A
PY ∥X = (PY,A∥X ) = (PY ∣A∥X PA∥X )

where the notation ↓ A means: marginalise over the variables in A.

The two terms in the sum may be expressed in terms of see-conditioning; PA∥X = PA and PY ∣A∥X =
PY ∣A,X . These may be seen as follows: rstly, since no variables of A are descendants of X , it follows
that PA∥X = PA . This is seen as follows: the variables may be ordered as V = (Y1 , . . . , Yn , X, Yn+1 , . . . , Yn+m )
where the ordering is chosen such that Pa(Yj ) ⊆ {Y1 , . . . , Yj−1 } for j ≤ n, Pa(X) ⊆ {Y1 , . . . , Yn } and
Pa(Yj ) ⊆ {Y1 , . . . , Yn , X, Yn+1 , . . . , Yj−1 } for j ∈ {m + 1, . . . , n + m} and where A ⊆ {Y1 , . . . , Yn }. From
the intervention formula,

m+n
PV /X∥X = ∏ PYj ∣Paj
j=1

while

m+n
PV = PX∣Pa(X) (x∣πX ) ∏ PYj ∣Paj .
j=1

Now, marginalise over variables Yn+1 , . . . , Yn+m in both expressions, then marginalise over X in the
second expression and nally marginalise over all remaining variables not in A. The same answer
62 CHAPTER 3. INTERVENTION CALCULUS

obtains for both expressions, so that


PA∥X = PA .

Second, since A blocks all trails between Y and X that have an edge pointing towards X , it follows
that Y á (Pa(X)/A)∥G A. It follows, with notation that should be clear, using Proposition 3.12 that

↓Pa(X)/A
PY ∣A∥X = (PY ∣A,Pa(X)∥X PPa(X)/A∥X )
↓Pa(X)/A
= (PY ∣A,Pa(X),X PPa(X)/A )
↓Pa(X)/A
= (PY ∣A,X PPa(X)/A )
= PY ∣A,X .

To go from rst to second line, do-conditioning on X does not alter the probabilities for ancestors of
X , hence PPa(X)/A∥X = PPa(X)/A Also, for conditional probabilities of Y , do-conditioning on ancestors
of Y is the same as see-conditioning on ancestors of Y , hence PY ∣A,Pa(X)∥X = PY ∣A,Pa(X),X .

To go from second to third line: once X is known, Pa(X) gives no further information.

Hence, if A ∩ (X ∪ Y ) = ∅, then (using PA∥X = PA and summing over A) gives:

↓A
PY ∥X = (PY ∣A,X PA ) (3.9)

If a set of variables A satisfying the back door criterion with respect to (X, Y ) can be chosen such
that PA and PY ∣A,X can be estimated from the observed data, then the distribution PY ∥X can also be
estimated from the observed data.

Lemma 3.19 (Identiability). If a set of variables Z satises the back door criterion relative to (X, Y ),
then the causal eect of X to Y is given by the formula

↓Z
PY ∥X = (PY ∣X,Z PZ ) (3.10)

and the intervention of X on Y is said to be identiable.

Proof This follows directly from the denition of identiable (Denition 3.15) and the analysis
above.

Formula (3.10) is named adjustment for concomitants. The word identiability refers to the fact that the
concomitants Z satisfying the back door criterion are observable and hence it is possible to compute, or
identify the intervention probability PY ∥X (y∥x) using the `see' conditional probabilities (PXj ∣Paj )dj=1 .
3.7. IDENTIFIABILITY: BACK-DOOR AND FRONT-DOOR CRITERIA 63

3.7.2 Front Door Criterion


The front door criterion is dened as follows:

Denition 3.20 (Front Door Criterion). A set of variables Z satises the front door criterion relative
to the ordered pair (X, Y ) if:

ˆ Z intercepts all directed paths from X to Y ,

ˆ there is no back-door path between X and Z ,

ˆ every back-door path between Z and Y is blocked by X .

The situation is illustrated in Figure 3.13. The variable U is a hidden (latent) variable. The variable
Z satises the front door criterion relative to (X, Y ).

~ 
X / Z / Y

Figure 3.13: Front Door Criterion

The result is the following:

Theorem 3.21 (Front Door Criterion). Let Z satisfy the front door criterion relative to the ordered
pair (X, Y ). Then the causal eect on Y of an intervention on X is:

↓Z
PY ∥X = (PZ∣X PY ∣Z ) .
↓U
This is self evident; note that PY ∣Z = (PY ∣Z,U PU ) . In other words, if the see-conditional PZ∣X and
PY ∣Z are available, then the intervention PY ∥X may be computed.

3.7.3 Non-Indentiability
There are various conditions for non-identiability of PY ∥X . These include:

1. A necessary condition is that there is an unblockable back-door path between X and Y ; that
is, a path ending with an arrow pointing into X which cannot be blocked by observable non-
descendants of X . This is not a sucient condition, as Figure 3.13 illustrates. This shows a
situation where there is a non-blockable back-door path, yet PY ∥X is identiable (front-door
criterion).

2. A sucient condition for identiability of PY ∥X is existence of a confounding path between X


and any of its children on a path from X to Y ; two examples are given in Figure 3.14.
64 CHAPTER 3. INTERVENTION CALCULUS

U U / Z
>

  
X / Z / Y X / Y

Figure 3.14: Sucient condition for identiability

3. Local identiability is not a sucient condition for global identiability. In Figure 3.15, PZ1 ∥X ,
PZ2 ∥X , PY ∥Z1 , PY ∥Z2 are all identiable, but PY ∥X is not.

U1 U2

~ ~ (
X / Z1 / Y o Z2

Figure 3.15: Sucient condition for identiability

3.8 Inference Rules for Intervention Calculus


Let X, Y and Z be arbitrary disjoint sets of nodes in a DAG G = (V, D). GX denotes the graph obtained
from G by deleting all arrows {(α, β) ∈ D ∶ β ∈ X}, while GX denotes the graph obtained from G by
deleting all arrows {(α, β) ∶ (α, β) ∈ D, α ∈ X}. The notation GX,Z denotes the graph obtained from G
by deleting all arrows {(α, β) ∈ D ∶ β ∈ X or α ∈ Z}. The following theorem states three rules that
are valid for every interventional distribution compatible with G .

Theorem 3.22 (Rules for Intervention Calculus). Let G = (V, D) be a DAG associated with a causal
model and let P denote the probability distribution. For any disjoint subsets X, Y, Z, W of V , the
following rules hold:

1. (insertion / deletion of observations)

Y á Z∥GX X ∪ W ⇒ PY ∣Z,W ∥X = PY ∣W ∥X (3.11)

2. (action / observation exchange)

Y á Z∥GX,Z X ∪ W ⇒ PY ∣W ∥X,Z = PY ∣Z,W ∥X (3.12)

3. (insertion / deletion of actions)

Y á Z∥GX,Z(W ) X ∪ W ⇒ PY ∣W ∥X,Z = PY ∣W ∥X . (3.13)


3.8. INFERENCE RULES FOR INTERVENTION CALCULUS 65

where the set Z(W ) in the graph GX is the set of Z -nodes which are not ancestors of any W
node in GX .

Proof of Theorem 3.22

1. The interventional distribution PV /X∥X factorises along the graph GX . Since the variables of
X have no parents in GX , do- and see- conditioning on X are equivalent for distributions that
factorise along GX . The separation statement implies that, for the mutilated graph (where X
has been instantiated by intervention), Y is D-separated from Z by X ∪ W . Equation (3.11)
follows because a D-separation statement implies the corresponding conditional independence
statement.

2. The interventional distribution PV /X∥X factorises along GX . The D-separation statement of (3.12)
implies, furthermore, that all X ∪ W -active trails in GX between Y and Z have an arrow from
a node in Z to one of its children, hence all back-door paths from Z to Y in GX are blocked by
X ∪ W . It follows that the operations setting Z ← z or conditioning on Z = z have the same
eect on Y .

3. Assume that the D-separation condition holds, then all W ∪ X - active trails between Y and Z
in GX have an edge γ ↦ β where β ∈ Z(W ) (since removing arrows into Z(W ) blocks the trail).
For such a node β , none of its descendants are in W . Therefore,

PY ∣W ∥X,Z(W ) = PY ∣W ∥X .

The D-separation statement implies that Y á Z/Z(W )∥GX X ∪ W from which the result follows.

Corollary 3.23. Let P be a probability distribution which factorises according to a causal model (Def-
inition 3.4). An intervention probability q = PY ∥X (y∥x), where X and Y are disjoint subsets of V
where X ⊆ V is identiable if there is a nite sequence of transformations, each conforming to one of
the inference rules in Theorem 3.22 which reduces q to a probability expression that only involves see
conditioning.

Proof Clear

The converse of Corollary 3.23 is also true. This will now be dealt with. Firstly, Tian and Pearl [135](2002)
developed systematic criteria for establishing the interventional statements that could be computed
from see-conditioning statements. Huang and Valtorta [64] (2006) then established that these criteria
could be obtained from the three rules of Theorem 3.22. The problem was also dealt with in Shpitser
and Pearl [124] (2006); graphical criteria are discussed in Tian and Shpister [136] (2010).
Only the interventional probability distributions over the observable nodes are of interest; the
following rather obvious lemma helps to simplify the problem.
66 CHAPTER 3. INTERVENTION CALCULUS

Lemma 3.24. If any of the three rules can be used on a model with graph G , it can also be used
on a model that is obtained by removing all hidden nodes U ∈ U that have no descendants among the
observable nodes V.

Proof Clear.
The following lemma establishes that only rules 2 and rules 3 need be considered for a completeness
theorem.

Lemma 3.25. Rule 1 follows from rule 2 and rule 3.

Proof Since all D-separation statements before removal of an edge remain true after the edge is
removed, the conditions for the application of rules 2 and 3 are satised if the condition for rule 1
is satised. The application of rule 1 can be replaced by the application of rule 2 followed by the
application of rule 3.
In detail: suppose the D-separation statement of (3.11) holds. Then the D-separation statements
of (3.12) and (3.13) both hold, so an application of rule 2 gives:

PY ∣W,Z∥X = PY ∣W ∥X,Z

and an application of rule 3 gives:

PY ∣W ∥X,Z = PY ∣W ∥X

from which implies that:


PY ∣Z,W ∥X = PY ∣W ∥X .

This is the statement of Rule 1.

At the heart of the systematic identication of interventional statements that may be expressed in
terms of see-conditioning by Tian and Pearl [135] is the concept of c-components. All the c-factors are
computable from the probability distribution over the observed variables.

Denition 3.26 (c-component). Recall that V = V ∪ U, V ∩ U = ∅, where V is the set of observable


nodes and U is the set of hidden nodes. Two nodes α, β ∈ U are related under the c-component relation
(written α ∼c β ) if and only if at least one of the following holds:

1. there is an edge between α and β

2. α and β are both parents of the same node γ ∈ V

3. both α and β are in c-component relation with respect to another node δ ∈ U.

A c-component of the node set V is either a set containing a single node from V if that node has no
parents in U or it consists of all the U nodes which are c-component related to each other, together
with all V nodes that have a parent in U which is a member of that c-component.
3.8. INFERENCE RULES FOR INTERVENTION CALCULUS 67

Let H denote all the nodes of a c-component and let H ′ = H ∩ V (the observable nodes of a c-
component). Then a c-factor is simply PH ′ , the probability distribution over the observable nodes of
the c-component.

The relation ∼c on U is reexive, symmetric and transitive and hence denes a partition of U. Based
on this relation, U can be divided into disjoint and mutually exclusive c-component related parts.
Now suppose that P factorises according to a semi-Markov model (Denition 3.5). Lemmas 3.27
and 3.28 form the basis of the characterisation of Tian and Pearl [135] of interventional statements
that can be expressed by see-conditioning statements. The proofs given here are from [64], which
demonstrate that they follow from Rules 2 and 3 of Theorem 3.22.

Lemma 3.27. If W ⊆ C ⊆ V and W is an ancestral set in GC∪(Pa(C)∩U) , then

(PC∥V/C )↓C/W = PW ∥V/W .

Furthermore, this statement is a consequence of Rule 3 from Theorem 3.22.

A set S ⊆ V in a graph G is called an ancestral set if for each α ∈ G , every ancestor of α is also in G .
The set an(α) denotes the ancestors of the node α; α ∈/ an(α). A topological ordering of the nodes in a
graph G is an ordering σ of the nodes such that for each node β and each node γ ∈ an(β), σ(γ) < σ(β).

Proof Trivially,
(PC∥V/C )↓C/W = PW ∥V/C .

It has to be shown that this is equal to PW ∥V/W .

If W is an ancestral set in GC∪(Pa(C)∩U) , it follows that none of the nodes of W have parents in
C ∪ (Pa(C) ∩ U)/W , although they may have parents in Pa(C) ∩ V.
Since W is an ancestral set in GC∪(Pa(C)∩U) , there is a topological ordering of the nodes in
GC∪(Pa(C)∩U) that starts with all the nodes in W and continues with the other nodes.
The lemma may be proved by induction. If W = V , the lemma is trivially true. Otherwise, consider
the rst node, say α in the topological order just described that is in C , but not in W . By induction,
it is necessary and sucient to prove that if W ⊂ V , α ∈ V /W , C = W ∪ {α} and W is an ancestral set
in GC∪(Pa(C)∩U) , then

(PC∥V/C )↓C/W = PW ∥V/W .

Let the nodes of W be labelled: W = {1, . . . , k} and let Y = V/(W ∪ {α}). With obvious notation, the
identity to be established may be rewritten as:

(PW,α∥Y )↓C/W = PW ∥Y,α

Marginalising over the variable labelled α and using the fact that a probability distribution sums to 1
gives:
68 CHAPTER 3. INTERVENTION CALCULUS

(PW,α∥Y )↓α = (Pα∣W ∥Y )↓α PW ∥Y = PW ∥Y .

By construction,

W á {α}∥GY ,{α} Y

This is because in graph GY ,{α} , if there is a Y -active trail between α and a node i ∈ {1, . . . , k}, the path
cannot include any nodes in Y , since Y nodes are instantiated fork nodes. Therefore, any Y -active
trail in GY ,{α} between α and i which does not contain any other nodes of W rstly, cannot have an
arrow pointing into α, since the arrows between α and its parents have been removed. If the trail has
no collider connections, then it is of the form α ↦ . . . ↦ i, which is a contradiction since the nodes of
W are of a lower topological order than α. If it contains a collider which is either instantiated, or one
of its descendants is instantiated, then the links to the parents of the instantiated nodes have been
removed, hence such a connection does not exist. Therefore, such a trail does not exist.

Using rule 3,

PW ∥Y = PW ∥α,Y

and the result follows.

Lemma 3.28. Let H ⊆ V and let H1′ , . . . , Hn′ denote the c-components of the sub-graph GH∪(Pa(H)∩U) .
Let Hi = Hi′ ∩ H . Then

1.
n
PH∥V/H = ∏ PHi ∥V/Hi
i=1

2. Each PHi ∥V/Hi is computable from PH∥V/H in the following way. Let k be the number of variables
in H and let α1 < . . . < αk be a topological order of the variables of H in GH∪(Pa(H)∩U) . Let
H (j) = {α1 , . . . , αj } for j = 1, . . . , k and H (0) = ∅ (the empty set). Then each PHi ∥V/Hi ∶ i =
1, . . . , n is given by:

PH (j) ∥V/H (j)


PHi ∥V/Hi = ∏ .
j=∶αj ∈Hi PH (j−1) ∥V/H (j−1)

Each PH (j) ∥V/H (j) ∶ j = 0, 1, . . . , k is given by the marginalisation:

↓H/H (j)
PH (j) ∥V/H (j) = (PH∥V/H ) . (3.14)

Furthermore, the results of this lemma are a consequence of Rules 2 and 3 of Theorem 3.22.
3.8. INFERENCE RULES FOR INTERVENTION CALCULUS 69

Proof The rst statement is proved rst, then Equation (3.14) is established and nally the second
statement is proved.
The proof is by induction. When H includes exactly one node from V, the result is clearly true,
from the denition.
Suppose that the two statements are true for H ∶ ∣H∣ ≤ k for an integer k and consider an arbitrary
set E ⊂ V of size ∣E∣ = k + 1. Let H = {α1 , . . . , αk } and E = H ∪ {αk+1 }, where the indices correspond
to the topological order. Let H1′ , . . . , Hn′ be the c-components of H ∪ (Pa(H) ∩ U) in GH∪(Pa(H)∩U)
and let Hi = Hi′ ∩ H for 1 ≤ i ≤ n. Let Y = V/E .
Now consider the c-components of E ∪ (Pa(E) ∩ U) in GE∪(Pa(E)∩U) .

If Pa(αk+1 )∩ Pa(H)∩U = ∅, then the c-components are E1′ , . . . , En+1 , where Ei′ = Hi′ for i = 1, . . . , n

and En+1 = {αk+1 }∪(Pa(αk+1 )∩U). It follows that Ei ∶= Ei′ ∩E = Hi for i = 1, . . . , n and En+1 = {αk+1 }.
In this case, let m = n + 1.
If Pa(αk+1 ) ∩ Pa(H) ∩ U ≠ ∅, then αk+1 shares at least one parent in U with a node in H . Let

denote the c-components of E ∪ (Pa(E) ∩ U) in GE∪(Pa(E)∩U) and let Ei = Ei′ ∩ E . It
E1′ , . . . , Em
follows that, relabelling if necessary, Ei = Hi for i = 1, . . . , m − 1 and Em = {αk+1 } ∪ ⋃ni=m Hi .

By the inductive hypothesis,

n
PH∥V/H = ∏ PHi ∥V/Hi ,
i=1

which may be rewritten as:

n
PH∥(V/E)∪{αk+1 } = ∏ PHi ∥(V/E)∪{αk+1 }∪H i−1 ∪H n
1 i+1
i=1

where the notation Hij for i < j means: ∪jk=i Hk . For the rst statement, it is required to prove that:

m−1
PH,{αk+1 }∥V/(H∪{αk+1 }) = PHm
n ,{α m−1 ∏ PH ∥(V/H ) .
k+1 }∥(V/E)∪H1 j j
j=1

A straightforward factorisation gives:

PH,{αk+1 }∥(V/(H∪{αk+1 })) = P{αk+1 }∣H∥(V/(H∪{αk+1 })) PH∥(V/(H∪{αk+1 })) . (3.15)

The notation Y = V /E will be used. The D-separation statement: H á {αk+1 }∥GY ,α Y holds. This
k+1
is because any Y -active trail in GY ,α between αk+1 and a node in H does not include any node in Y
since nodes in Y are instantiated fork nodes. Since the arrows from parents of αk+1 have been removed
and the nodes of H are of a lower topological order, any active trail contains an instantiated collider
connection. But the instantiated nodes are in Y , hence links to their parents have been removed, hence
such a trail does not exist.
Using Rule 3, it follows that:

PH∥Y = PH∥αk+1 ,Y ,
70 CHAPTER 3. INTERVENTION CALCULUS

from which (3.15) gives:

m−1 n
PH,αk+1 ∥Y = Pαk+1 ∣H∥Y PH∥αk+1 ,Y = Pαk+1 ∣H∥Y ∏ PHj ∥H j−1 ,H n ,Y,αk+1
× ∏ PHj ∥H j−1 ,H n .
1 j+1 1 j+1 ,Y,αk+1
j=1 j=m

By the inductive hypothesis,

m−1
∏ PHj ∥H j−1 ,H n = PH m−1 ∥Hm
n ,Y,α .
1 j+1 ,Y,αk+1 1 k+1
j=1

As before,

H1m−1 ⊥ αk+1 ∥GY ,H n ,α Y ∪ Hm


n
.
m k+1

This is because if there is a Y ∪ Hm


n
-active trail in GY,H n ,αk+1 between αk+1 and H1m−1 , the trail does
m
not contain any nodes of Y since all nodes of Y are instantiated forks. Since the links between αk+1
and its parents have been removed and since the nodes of H1m−1 are of lower topological order than
αk+1 , the node must contain at least one collider which is either instantiated or has an instantiated
descendant. But all instantiated nodes have had links to their parents removed, hence no such path
exists.
It now follows from Rule 3 that:

PH m−1 ∥Hm
n ,Y,α
k+1
= PH m−1 ∥Hm
n ,Y .
1 1

The next step is to show that

αk+1 ⊥ Hm
n
∥GY ,H n Y ∪ H1m−1 .
m

If there is a Y ∪ H1m−1 -active trail in GY ,H n between αk+1 and Hm n


, the path does not contain any
m+1
node of Y , since nodes in Y are instantiated forks. Furthermore, there are no arrows into a node of Hm n

since links between Hm n


and its descendants have been removed. Consider the shortest Y ∪H1m−1 -active
trail between αk+1 and a node in Hm n
. Let the node be designated β . Except for the end points, the
trail contains no nodes in V, hence all are in U. Since the nodes of Hm n
are of a lower topological order
than αk+1 , the trail contains at least one fork node. It follows from the denition of c-component that
αk+1 and β belong to the same c-component, which is a contradiction.

Using Rule 2, it therefore follows that:

Pαk+1 ∣H∥Y = Pαk+1 ∣H m−1 ,Hm


n ∥Y = Pα
k+1 ∣H
m−1 ∥H n ,Y .
m
1 1

Putting these together, it follows that:


3.8. INFERENCE RULES FOR INTERVENTION CALCULUS 71

m−1 n
PH,αk+1 ∥Y = Pαk+1 ∣H∥Y ∏ PHj ∥H j−1 ,H n ∏ PHj ∥H j−1 ,H n
1 j+1 ,Y 1 j+1 ,Y,αk+1
j=1 j=m
n
= Pαk+1 ∣H m−1 ∥Hm
n ,Y PH m−1 ∥H n ,Y ∏ P
Hj ∥H j−1 ,H n
1 1 m 1 j+1 ,Y,αk+1
j=m
n
= PH m−1 ,αk+1 ∣Hm
n ,Y ∏ P
Hj ∥H j−1 ,H n
.
1 1 j+1 ,Y,αk+1
j=m

This establishes the rst statement, since PH,αk+1 ∥Y = PE∥V/E .

Now Equation (3.14) is established. From the rst part,


n
↓H/H (j) ↓Hi /H (j)
(PH∥V/H ) = ∏ (PHi ∥V/Hi ) .
i=1

Now let H ̃ (j) = (H (j) ∩ Hi ) ∪ (Pa(Hi ) ∩ U) and let H ̃i = Hi ∪ (Pa(Hi ) ∩ U). Then, by construction,
i
̃ (j)
H i is an ancestral subset of Hi ∪ (Pa(Hi ) ∩ U) in GHi ∪(Pa(Hi )∩U) and hence (by extending the set
V for a moment to include all the nodes of H ̃ (j) ), it follows by Lemma 3.27 that:
j

↓Hi /H (j)
PH̃ (j) ∥V/H (j) = (PH̃i ∥V/Hi ) .
i

Marginalising both sides over Pa(Hi ) ∩ U gives:

↓Hi /H (j)
PH (j) ∩Hi ∥V/H (j) = (PHi ∥V/Hi ) .

It follows that
n
↓H/H (j)
(PH∥V/H ) = ∏ PH (j) ∩Hi ∥V/H (j) .
i=1

Equation 3.14 now follows because do-conditioned on V/H (j) , the node sets H (j) ∩ Hi ∶ i = 1, . . . , n are
D-separated from each other.

It follows from 3.14 (by considering H (k) = H and H (k+1) = E ) that

PH∥V/H = (PH,αk+1 ∥V/(H∪{α}) )↓{αk+1 } = (PE∥V/E )↓{αk+1 } . (3.16)

By the inductive hypothesis, H satises statement 2, so that:

PH (j) ∥V/H (j)


PHi ∥V/Hi = ∏ i = 1, . . . , n (3.17)
j∣αj ∈Hi PH (j−1) ∥V/H (j−1)

By construction, E (j) = H (j) for j = 1, . . . , k and E (k+1) = E = {α1 , . . . , αk+1 }. For j = 1, . . . , k , it


follows from (3.16) that the following are equal:

H∪{α}/H (j) H/H (j)


PE (j) ∥V/E (j) = (PE∥V/E ) = (PH∥V/H ) = PH (j) ∥V/H (j) .
72 CHAPTER 3. INTERVENTION CALCULUS

From (3.15) (end of proof of Part 1), it follows that:

m−1
PE∥V/E = PE (k+1) ∥V/E (k+1) = PHm
n ,α
k+1 ∥V/(Hm ∪{αk+1 }) ∏
n PHi ∥V/Hi .
i=1

From this, it follows from (3.17) that:

PE (k+1) ∥V/E (k+1)


PEm ∥V/Em = PHm
n ,α
k+1 ∥V/(Hm ∪{αk+1 })
n = .
∏m−1
i=1 PHi ∥V/Hi

Now,

k PE (k+1) ∥V/E (k+1) m PE (j) ∥V/E (j)


PE (k+1) ∥V/E (k+1) = ∏ =∏ ∏
j=0 PE (k) ∥V/E (k) i=1 j∶αj ∈Ei PE (j−1) ∥V/E (j−1)

so that

PE (j) ∣V/E (j)


PEm ∥V/Em ∏
j∣αj ∈Em PE (j−1) ∥V/E (j−1)

as required, and the lemma is proved.

Based on the rst statement of Lemma 3.28, establishing the non-identiability of a statement may
be reduced to establishing the non-identiability of a statement within a c-component. The relevant
result is the following:

Theorem 3.29. Let G be a semi-Markovian model. If

1. G is itself a c-component,

2. S ⊂ V in G and GS∪(Pa(S)∩U) has only one c-component,

3. all the nodes of V/S are ancestors of S in G ,

then PS∥V/S is not identiable in G .

Proof Non-identiability is established if it can be shown that there is a back-door path between a
node in S and a node in V/S . Assume there is no back-door path, then there does not exist a node
υ ∈ U which is an ancestor of both a node in S and a node in V/S . It follows that G is not itself a
c-component, which is a contradiction.

Lemmas 3.27 and 3.28 provide the basis of a complete identication algorithm for computing do-
conditioning statements PS∥V/S for S ⊆ V in terms of see-conditioning statements, in the sense that
when it does not give an output fail, it returns the correct answer. Theorem 3.29 establishes that the
algorithm is complete, in the sense that it returns an output fail when and only when the statement
is not identiable.
3.8. INFERENCE RULES FOR INTERVENTION CALCULUS 73

Algorithm 1 Algorithm: Compute PS∥V/S


INPUT: S ⊆ V

OUTPUT: Expression for PS∥V/S or fail.

Let V1 , . . . , Vn be a partition of V, where Vj = Vj′ ∩ V and V1′ , . . . , Vn′ are the c-components of the
sub-graph GV∪Pa(V) . Let S1 , . . . , Sl be a partition of S where S1′ , . . . , Sl′ are the c-components of the
sub-graph GS∪(Pa(S)∩U) and Sj = Sj′ ∩ V for j = 1, . . . , l. The subsets are labelled such that Sj ⊆ Vj for
j = 1, . . . , l; this can clearly be done without loss of generality. Now

1. Compute each PVj ∥V/Vj with Lemma 3.28;

2. Compute each PSj ∥V/Sj using Algorithm 3.8 (identify (C, T ) below), with C = Sj , T = Vj and
Q = PVj ∥V/Vj .

3. If in part 2. Algorithm 3.8 gives the output fail for any of the Sj ∶ j = 1, . . . , l, then PS∥V/S is
not identiable and the output given is fail. Otherwise, PS∥V/S is identiable and is given by:

l
PS∥V/S = ∏ PSj ∥V/Sj .
j=1

This follows from Lemma 3.28


74 CHAPTER 3. INTERVENTION CALCULUS

Algorithm 2 Algorithm: Identify (C, T )


INPUT: C and T where C ⊆ T ⊆ V, where GC∪(Pa(C)∩U) and GT ∪(Pa(T )∩U) are both composed of a
single c-component.

OUTPUT: Expression for PC∥V/C or fail.

Let A = (an(C) ∪ C)G .


T ∪( Pa(T )∩U)

1. If A = C , then the output is PC∥V/C . By Lemma 3.27, this is given by:

T /C
PC∥V/C = (PT ∥V/T ) .

2. Else: if A = T (and T ≠ C ) then output fail.

3. Else: (if C ⊂ A ⊂ T ).

(a) By Lemma 3.27, compute:


↓T /A
PA∥V/A = (PT ∥V/T ) .

(b) Assume that, in GA∪(Pa(A)∩U) , C is contained in a c-component T1′ . Set T1 = T1′ ∩ A.


(c) Compute PT1 ∥V/T1 from PA∥V/A by Lemma 3.28.
(d) The output is the output of algorithm identify applied to (C, T1 ).
3.8. INFERENCE RULES FOR INTERVENTION CALCULUS 75

Algorithm 3.8 is therefore recursive, until either it nds an expression for PC∥V/C or else returns the
output fail.

It now follows that if PV/T ∥T is identiable, then so is PS∥T for any S ⊆ V/T , by marginalisation:

↓(V/T )/S
PS∥T = (PV/T ∣T ) .

Algorithm 3.8 is based on the following consideration: let D = (S ∪ an(S))GV/T ∩ V. Then D is an


ancestral set in V/T and hence

↓V/(T ∪D)
(PV/T ∥T ) = PD∥V/D .

It follows that:

↓V/(T ∪D) ↓D/S ↓D/S


PS∥T = ((PV/T ∥T ) ) = (PD∥V/D ) .

Algorithm 3 Algorithm: Compute PS∥T


INPUT: Two disjoint observable variable sets S ⊆ V and T ⊂ V, where T is interventional.

OUTPUT: The expression for PS∥T or fail.

1. Let D = (S ∪ an(S))GV/T ∩ V.

2. Use the algorithm: Computing PS∥V/S to compute PD∥V/D .

3. If the algorithm returns fail, then the output is fail.

4. Else, output
↓D/S
PS∥T = (PD∥V/D ) .

The converse of Corollary 3.23 can now be stated:

Theorem 3.30. The three inference rules of Theorem 3.22, together with standard probability manip-
ulations, are complete for determining the identiability of PH∥V/H for all H ⊂ V.

Proof Lemmas 3.27 and 3.28 follow from the inference rules of Theorem 3.22 (as proved by Huang
and Valtorta [64]). These form the basis of Algorithms 3.8, 3.8 and 3.8. By Theorem 3.29, it follows
that the algorithms give the output fail if and only if the statement is not identiable; by standard
probability manipulations, they give the correct answer otherwise.

3.8.1 Example: Front Door Criterion


Suppose that PU,X,Z,Y factorises according to the DAG in Figure 3.13 and that U is hidden.
76 CHAPTER 3. INTERVENTION CALCULUS

1. PZ∥X X á Z∥GX ∅ and hence Rule 2 gives:

PZ∥X = PZ∣X . (3.18)

2. PY ∥Z : GZ contains a back-door path from Z to Y , which is: Z ← X ← U → Y . This path is


blocked if X is instantiated.

↓X
PY ∥Z = (PY ∣X∥Z PX∥Z ) .

Since Z á X∥GZ ∅, it follows that Rule 3 may be applied:

PX∥Z = PX .

Since Z á Y ∥GZ ∅, Rule 2 gives:

PY ∣X∥Z = PY ∣X,Z .

It follows that

↓X
PY ∥Z = (PY ∣X,Z PX ) . (3.19)

This is a special case of the back-door formula (3.10).

3. PY ∥X . Writing
↓Z
PY ∥X = (PY ∣Z∥X PZ∥X )

it follows from (3.18) that PZ∥X = PZ∣X . Rule 2 may be applied to give PY ∣Z∥X = PY ∥X,Z , since
Y á X∥GX,Z Z . Rule 3 may be applied, since Y á X∥GX,Z Z , to give:

PY ∣Z∥X = PY ∥Z ,

which was computed in terms of see-conditioning in (3.19). Putting all this together gives:

↓X ↓Z
PY ∥X = (PZ∣X (PY ∣X,Z PX ) ) .

All the other causal eects (for example PY,Z∥X and PX,Z∥Y ) can be derived from the rules of Theo-
rem 3.22.
3.9. MEASUREMENT BIAS AND EFFECT RESTORATION 77

3.8.2 Causal Inference by Surrogate Experiments


Suppose that the causal eect of X on Y , PY ∥X is of interest, but it is not identiable and the variable
X cannot be controlled via a randomised experiment. For example, if we are interested in assessing
the eect of cholestorol X on heart disease Y , it may be possible to exercise control over the subject's
diet Z rather than directly controlling the quantity of cholestorol in the subject's blood.
Formally, this problem amounts to transforming the problem PY ∥X into expressions where do-
conditioning is only on variables in Z . By Theorem 3.22, the following conditions are sucient:

1. X intercepts all directed paths from Z to Y .

2. PY ∥X is identiable in GZ .

If the rst of these holds, it follows that Y á Z∥GX,Z X and hence PY ∥X = PY ∥X,Z . This represents
the causal eects of X on Y in a model that factorises along GZ which is identiable by the second
condition. These conditions are satised by the two models in Figure 3.16. Translated to the cholestorol
example, they require that there be no direct eect of diet on heart disease and no confounding eect
between cholestorol level and heart disease unless there is an intermediate variable between the two
which can be measured. For the rst gure, the conditions are clear. For the second gure, PY ∥X is
identiable in GZ because

↓W,U2 ,U3 ,U4 ↓W


PY ∥X,Z = (PU2 PU3 PU4 PY ∣W,U3 ,U4 PW ∣X,U2 ) = (PY ∣W PW ∣X ) .

Z o U2 U1 U2 U3 U4
O

  ~v 
U1 / X / Y Z / X v / W / Y

Figure 3.16: Do-condition on surrogate variable Z

3.9 Measurement Bias and Eect Restoration


Consider a situation where we would like to compute the causal eect of X on Y (namely PY ∥X ), in the
situation where there is a sucient confounder U . Confounder means that, as a result of its presence,
the eect PY ∥X is not identiable; sucient in this context means that PY ∥X could be computed if U
were observable. There are situations where an observable W may give sucient information about U
to enable PY ∥X to be identied from data.
The material for this section is taken from Kuroki and Pearl [77] (2014) and deals with two situa-
tions: rstly, the situation where PW ∣U is known and W gives sucient information about U to identify
78 CHAPTER 3. INTERVENTION CALCULUS

PY ∥X . Secondly, PW ∣U is unknown, but there are two observable variables (Z, W ) which together give
sucient information to identify PY ∥X without bias.

Example 3.31.
The Head Start Program is discussed in Madgison [87] (1977). This was a government programme
within the United States of America aimed at giving assistance to children. Magidson's sample consists
of 148 children who received the programme and 155 control children.
Let X be an indicator variable, indicating whether or not the child received the programme. Y is
the outcome variable of the Metropolitan Readiness Test (a test which supposedly measures cognitive
ability). U represents socio-economic status. This is unobserved and may be considered, following the
discussion of Madgison, as a sucient confounder. Figure 3.17 gives three possible situations; the rst
where W is measured as a proxy variable for U , the second and third where W and Z (family income)
are measured as proxy variables of U .

U / W U / W U / W

  
X / Y Z X /' Y Z / X /' Y

Figure 3.17: U hidden; causal models with proxy variables on U . For (a), PW ∣U is required to identify
PY ∥X . For (b) and (c), under further assumptions on Z and W , the eect PY ∥X may be estimated
from data.

In Figure 3.17, U satises the back-door criterion relative to (X, Y ), but its proxy variables W and
Z do not. For each of the models,

↓U
PY ∥X = (PY ∣X,U PU ) .

If the conditional distribution PW ∣U is known (and W is observable) then, under additional assumptions
on PW ∣U , it is possible to construct an asymptotically unbiased estimator of PY ∥X .

3.9.1 The Matrix Adjustment Method


We now consider the model in Figure 3.17 (a), under the assumption that PW ∣U is known, and show
how to compute PY ∥X . The method is known as the matrix adjustment method.
Assume that U and W both have nite state space, with k elements. Without loss of generality, let
U and W both have state space {1, . . . , k}. The main idea for recovering PX,Y,U from both PX,Y,W and
PW ∣U , the matrix adjustment method, found in Greenland and Lash [57] (2008) p. 360 and discussed
in Pearl [113] (2010). The discussion here is taken from Kuroki and Pearl [77] (2014).

↓U
PY,W ∣X = (PY,U ∣X PW ∣U )
3.9. MEASUREMENT BIAS AND EFFECT RESTORATION 79

Set
⎛ PY,U ∣X (1, y∣x) ⎞ ⎛ PY,W ∣X (1, y∣x) ⎞
VU ∶Y ∣X (. ∶ y∣x) = ⎜
⎜ ... ⎟,
⎟ VW ∶Y ∣X (. ∶ y∣x) = ⎜
⎜ ... ⎟

⎝ PY,U ∣X (k, y∣x) ⎠ ⎝ PY,W ∣X (k, y∣x) ⎠

and

⎛ PW ∣U (1∣1) . . . PW ∣U (1∣k) ⎞
MW ∣U =⎜
⎜ ⋮ ⋱ ⋮ ⎟.

⎝ PW ∣U (k∣1) . . . PW ∣U (k∣k) ⎠

If MW ∣U is invertible, then:

−1
VU ∶Y ∣X (. ∶ y∣x) = MW ∣U VW ∶Y ∣X (. ∶ y∣x)

↓Y ↓Y
Similarly, set VU ∶∣X = (VU ∶Y ∣X ) so that VU ∶∣X (u∣x) = PU ∣X (u∣x) and similarly VW ∶∣X = (WW ∶Y ∣X ) ,
then

−1
VU ∶∣X = MW ∣U VW ∶∣X .

It follows that if PW ∣U is known, then the causal eect of manipulating X , i.e. PY ∥X , is estimable and
is given by:

↓W ↓U
⎛ (M −1t P −1t ↓W ⎞
W,U Y,W ∣X (MW,U PW )
↓U )
PY,U ∥X PU
PY ∥X = ( ) ⎜
=⎜ ⎟
↓W ⎟ .
PU ∣X −1t P
(MW,U
⎝ W ∣X ) ⎠

where PW and PW ∣X (.∣x), for each x ∈ X , are taken as column k -vectors.

3.9.2 Eect Restoration Without External Studies


Now consider the more dicult problem of estimating causal eects without prior knowledge of PW ∣U .
This is not possible for the rst of the models of Figure 3.17, but may be possible, under additional
assumptions for the second and third models in that gure.
For each given (x, y), let σ be a permutation of 1, . . . , k such that

PY ∣X,U (y∣x, σ(1)) ≥ . . . ≥ PY ∣X,U (y∣x, σ(k)).

For the models under consideration, W á {X, Y, Z}∥G U and Y á {W, Z}∥G {U, X}.

↓U ↓U ↓U
PZ,W ∣X = (PZ,W,U ∣X ) = (PW ∣Z,U,X PZ∣U,X PU ∣X ) = (PW ∣U PZ∣U PU ∣X ) .

Similarly,

↓U
PY,W ∣X = (PW ∣U PY ∣X,U PU ∣X )
80 CHAPTER 3. INTERVENTION CALCULUS

↓U
PY,Z∣X = (PY ∣X,U PZ∣X,U PX∣U )

↓U
PY,Z,W ∣X = (PW ∣U PZ∣X,U PY ∣X,U PU ∣X )

Let


⎪ ⎛ 1 PW ∣X (1∣x) ... PW ∣X (k − 1∣x) ⎞




⎪ ⎜ PZ,W ∣X (z1 , w1 ∣x) . . . PZ,W ∣X (z1 , wk−1 ∣x) ⎟


⎪ ⎜ PZ∣X (1∣x) ⎟

⎪ PZ,W = ⎜ ⎟.

⎪ ⎜ ⋮ ⋮ ⋱ ⋮ ⎟

⎪ ⎜ ⎟

⎪ ⎝

⎪ PZ∣X (k − 1∣x) PZ,W ∣X (k − 1, 1∣x) . . . PZ,W ∣X (zk−1 , wk−1 ∣x) ⎠
⎨ (3.20)


⎪ ⎛ 1 PY,W ∣X (y, 1∣x) ... PY,W ∣X (y, k − 1∣x) ⎞



⎪ ⎜ ⎟


⎪ ⎜ PY,Z∣X (y, 1∣x) PY,Z,W ∣X (y, z1 , w1 ∣x) . . . PY,Z,W ∣X (y, z1 , wk−1 ∣x) ⎟

⎪ QZ,W = ⎜ ⎟.

⎪ ⎜ ⋮ ⋮ ⋱ ⋮ ⎟

⎪ ⎜ ⎟


⎪ ⎝ ⎠
⎩ PY,Z∣X (y, k − 1∣x) PY,Z,W ∣X (y, k − 1, 1∣x) . . . PY,Z,W ∣X (y, zk−1 , wk−1 ∣x)

⎛ 1 PW ∣U (1∣σ(1)) . . . PW ∣U (k − 1∣σ(1)) ⎞
UW,U =⎜
⎜ ⋮ ⋮ ⋱ ⋮ ⎟

⎝ 1 PW ∣U (1∣σ(k)) . . . PW ∣U (k − 1∣σ(k)) ⎠

⎛ 1 PZ∣X,U (1∣x, σ(1)) . . . PZ∣X,U (k − 1∣x, σ(1)) ⎞


RZ,U =⎜
⎜ ⋮ ⋮ ⋱ ⋮ ⎟

⎝ 1 PZ∣X,U (1∣x, σ(k)) . . . PZ∣X,U (k − 1∣x, σ(k)) ⎠

∆U = diag(PY ∣X,U (y∣x, σ(1)), . . . , PY ∣X,U (y∣x, σ(k))) (3.21)

MU = diag(PU ∣X (σ(1)∣x), . . . , PU ∣X (σ(k)∣x)).

Note that P and Q can be written:

PZ,W = RtZ,U MU UW,U , QZ,W = RtZ,U MU ∆U UW,U .

Provided both PZ,W is invertible, it follows that:

−1
PZ,W QZ,W = UW,U ∆U UW,U .

It follows that the recovery problem of PW ∣U from UW,U rests on the eigenvalue decomposition of
−1
PZ,W QZ,W . Once PW ∣U is known, the matrix adjustment method may be used to evaluate the causal
eect on Y of manipulating X . This requires additionally that QZ,W be invertible and the probabilities
PY ∣X,U (y∣x, 1), . . . , PY ∣X,U (y∣x, k) take distinct values for given (x, y).
The result is presented in the following theorem:

Theorem 3.32. Suppose U is a sucient confounder relative to (X, Y ) and suppose that
3.9. MEASUREMENT BIAS AND EFFECT RESTORATION 81

1. Two proxy variables of U that are conditionally independent of each other given U can be observed;
call them W and Z . Both W ⊥ {X, Y, Z}∣U and Y ⊥ {W, Z}∣{U, X} hold.

2. W, Z and the counfounder U are discrete variables, with a given nite nuber of categories, k .

3. Both PZ,W and QZ,W dened by (3.20) are invertible.

4. The probabilities PY ∣X,U (y∣x, 1), . . . , PY ∣X,U (y∣x, k) take distinct values for given x and y ,

then the causal eect PY ∥X of X on Y is identiable.

Proof The proof is based on the following two-step procedure that recovers PX,Y,U from PX,Y,Z,W .
−1
ˆ Stage 1: Solve an eigenvalue problem of PZ,W QZ,W to recover PW ∣U from UW,U

ˆ Recover PX,Y,U using the matrix adjustment method.

−1
Step 1 First solve ∣PZ,W QZ,W − λIk ∣ = 0, where Ik denotes the k × k identity matrix. The solutions
−1
λ1 , . . . , λk are the eigenvalues of PZ,W QZ,W . They satisfy:

−1
∣PZ,W QZ,W − λIk ∣ = ∣∆U − Ik ∣ = 0

where ∆U is dened by (3.21). It follows that λi = PY ∣X,U (y∣x, σ(i)) and hence the elements of ∆U are
estimable.
To obtain the eigenvector ηi corresponding to λi , let H = (η1 , . . . , ηk ), then H satises:

−1
PZ,W QZ,W H = H∆U .

By the condition that λi take dierent values, it follows that η1 , . . . , ηk are uniquely determined.
−1
Let A = UW,U E where E = diag(α1 , . . . , αk ) for non-zero values of (α1 , . . . , αk ), then:

−1 −1 −1
PZ,W QZ,W A = UW,U ∆U E = UW,U E∆U = A∆U .
−1
It follows that A is also a matrix of eigenvectors of PX,Z QX,Z and hence, with a particular choice of
−1
α1 , . . . , αk , A = UW,U E = H .
It follows that for the inverse H −1 of the estimable matrix H satises (using UW,U
−1
E = H ):

−1 −1
⎛ 1 PW ∣U (1∣σ(1)) . . . PW ∣U (k − 1∣σ(1)) ⎞ ⎛ α1 H11 . . . α1 H1k ⎞
UW,U =⎜
⎜ ⋮ ⋮ ⋱ ⋮ ⎟ = EH −1 = ⎜ ⋮
⎟ ⎜ ⋱ ⋮ ⎟

⎝ 1 PW ∣U (1∣σ(k)) . . . PW ∣U (1∣σ(k)) ⎠ ⎝ αk H −1 . . . αk H −1 ⎠
k1 kk

It follows, equating the rst column, that αj = 1


Hj1
−1 ∶ j = 1, . . . , k . This shows that UW,U is identiable
−1 −1
from EH since H is estimable. It follows that every element PW ∣U of UW,U can be obtained.
82 CHAPTER 3. INTERVENTION CALCULUS

Step 2 Since

↓U
PX,Y,W = (PX,Y,U PW ∣U )

it now follows that

↓U
↓U PX,Y,U
PY ∥X = (PY ∣X,U PU ) =( PU )
PX,U
is identiable.

3.10 Identication of Counterfactuals


A counterfactual is simply a hypothetical statement that cannot be tested directly. For example,
following the network of Arthur Cayley (chapter 4), it is not possible to act to cause a storm-force
gale, nor to dissociate experimentally by an intervention the eects of wind and rain from their common
causes; such an intervention is not possible.
Another example of a counterfactual is the following: a given dose of treatment was administered
to a patient. The dose was decided upon as a result of standard diagnostic procedures. It failed to cure
the disease and the patient died. Would a stronger dose have cured the patient? Or would a stronger
dose have still failed to control the disease? Or would the patient have died of side eects from the
treatment?
Such questions, of course, have importance. One would like to know whether or not the wrong
treatment was administered for future reference; what to do in future cases that exhibit similar symp-
toms and whether there are possibilities of adjusting the treatment, once administered, if it does not
seem to be having the desired eect.
Let X denote `treatment', taking values in X and let x ∈ X denote a generic element. Let Y denote
the `eect'. This could be binary (0 or 1) if the only question of interest is whether or not the patient
was cured, or it could (for example) be a real valued random variable, denoting the quantity of an
enzyme after treatment.
To formulate the counterfactual query, Y should no longer be considered as a single random variable,
but rather as a stochastic process Y ′ indexed by X . A stochastic process, in its greatest generality, is
dened as follows:

Denition 3.33 (Stochastic Process). Let X be a set and (E, E) a measurable space. A stochastic
process Y indexed by X with state space (E, E), indexed by a set X , is a family of measurable mappings
{Y (x) ∶ x ∈ X } from a probability space (Ω, F, P) into (E, E). The space (E, E) is called the state space.

There is no requirement from the denition of `process' that the state space X should represent `time'.
In the counterfactual set-up, Y ′ has state space XY , the same state space as Y . Attention is
restricted to the situation where E = XY is either nite, in which case E is simply the set of all possible
subsets, or else E = R, in which case E = B(R), the Borel σ -algebra over R, the smallest collection of
subsets necessary to dene integration (and hence a probability measure).
3.10. IDENTIFICATION OF COUNTERFACTUALS 83

Suppose that the state space of Y is Y = {0, 1}, where 0 represents `death' and 1 represent `cure'.
Suppose that x1 was the dose administered and the outcome was `death'. Consider the counterfactual
query: `would the patient have survived if we had given a treatment dose x2 ?' In terms of the
counterfactual process, the quantity to be computed is therefore:

P(Y ′ (x2 ) = 1∣Y ′ (x1 ) = 0).

In some limited cases, with serious additional modelling assumptions, this quantity can be computed
from the one-dimensional marginal distributions. For example, suppose we assume that, for x1 < x2 ,
{Y ′ (x1 ) = 1} ⊆ {Y ′ (x2 ) = 1}. This means that we assume that if the patient survives a low dose of the
treatment, he will also survive a higher dose. The treatment does not have side eects which kill the
patient; increasing the treatment dose increases the chance of success.

Under this assumption,


⎪ P(Y ′ (x2 )=0)
′ ′ ⎪ P(Y ′ (x1 )=0) x2 > x1
P(Y (x2 ) = 0∣Y (x1 ) = 0) = ⎨

⎪ x2 ≤ x1 .
⎩ 1
Several types of counterfactual query can be considered; if X is a cause and Y an eect within a larger
network, x1 could either be observed, or forced by intervention; the query is then `we observed eect
Y = y when we observed X = x1 . What would have happened if we had forced X ← x2 by intervention?'
To construct the appropriate counterfactual probability distribution, we add the counterfactual
process Y ′ , indexed by X , which does not have X as a parent. At the same time, the original variable
Y , with parent X remains in the graph and the counterfactual query is to compute

P(Y ′ (x2 ) = y∣Y = y, X = x1 ).

3.10.1 Counterfactual Graphs


A countefactual Bayesian Network is a Bayesian network that is obtained by extending the original
network in a way that can be used to answer the counterfactual query.
Firstly, it is important that the same `random' component is considered. In other words, if we
observe an instantiation of Y when we do X ← x1 and we are asking about the probability distribution,
conditional on this information, of the distribution of Y if we had done X ← x2 instead, we need a
network which contains both Y (x1 ) and Y (x2 ) and also assumes the same random inuence.
Therefore, we start with a formulation of the DAG which involves functional equations. That is, if
Xj has parents Pa(j), then Xj = fj (Pa(j), Uj ) where U1 , . . . , Ud are i.i.d. U (0, 1) random variables, the
functions fj ∶ j = 1, . . . , d are deterministic functions and the parents of Xj in the DAG are Pa(j)∪{Uj }.
We extend the DAG to answer the counterfactual query in the following way: If the query involves
(a) do-conditioning on a subset A ⊆ V and (b) asking a counterfactual question about the causal eect
on a set of nodes Y , then for each node β on the causal path from A to Y
84 CHAPTER 3. INTERVENTION CALCULUS

1. add a c-process node β ′ . These are the counterfactual process nodes, corresponding to the coun-
terfactual process, enumerating the value taken by the variable for each x ∈ XA .
A c-process node β ′ has all the parents of α except the nodes in A; there are no links from nodes
in A to c-process nodes.

2. For each α ∈ A and each β on the causal path between A and Y (including all the nodes in Y ),
add in an arrow α → β . If β and γ , where β, γ ∈/ A are on the causal path between A and Y and
there is an arrow β → γ , add in a process arrow β ′ ⇒ γ ′ .

3. Add in a process to variable arrow β ′ ↠ β for each process node and its corresponding variable
node. This is shorthand for an arrow β ′ (x) → β for each x ∈ XA .

4. Add in variable to process nodes γ ↣ β ′ for each γ ∈ Pa(β)/A. This is shorthand for γ → β ′ (x)
for each x ∈ XA .

5. If there is an arrow β ′ ⇒ γ ′ , then remove the arrow β → γ (if it exists).

Within a Process Node A process node α′ is shorthand for {α′ (x) ∶ x ∈ XA }.

Between Process Nodes If α = f (β1 , . . . , βk , v) in the original DAG, where Pa(α) = {β1 , . . . , βk }
and v is the random eect, then

α′ (j) = f (β1′ (j), . . . , βk′ (j), v)

for each j .

Example 3.34.

Suppose we are interested in how likely a patient would be to have a certain symptom Y (1 = yes,
0 = no), given a dose x of a drug X assuming we know that the patient took dose x′ of the drug
and exhibited the symptom. Suppose there is a mediating variable W , for example: blood pressure,
and that it is the blood pressure which is the cause of the symptom. Furthermore, we also know that
the patient took dose d of a drug D and we have measured a symptom Z = z . We know PZ∣D , the
conditional probability distribution for symptom Z given drug D.
The blood pressure / symptom (W, Y ) may therefore considered as a counterfactual process indexed
by the dose of drug. The random variable Y (x) indicates whether or not the patient exhibits the symp-
tom when dose x is administered. In this language, problem is therefore to compute PY (x)∣Y (x′ ),Z,D .
Figure 3.18 (a) shows the original DAG for the Bayesian Network; (b) shows the network in terms of
functional relations.

3.10.2 Joint Counterfactual Probabilities and Intervention


Notes The `do - calculus' is due to Judea Pearl in [109] and [107]. It enables conclusions to be
drawn about the eects of active interventions, based on passive observations. The other main sources
3.10. IDENTIFICATION OF COUNTERFACTUALS 85

D X U1 D X U2

    ~
Z W Z W U3

  ~
Y Y

Figure 3.18: (a) A DAG, (b) the graph expressed as a functional relations graph

U1 D X W o U2

 ~~ 
W 7 Y o U3

   
Z Y

Figure 3.19: (c) the graph extended to answer a counterfactual query

for the presentation here are Edwards [40] (2000) chapter 9 and Lauritzen (2001) [82]. The idea of
deletion of connections (in terms of wiping out equations in a multivariate model) is found in Strotz
and Wold (1960) [128]. The intervention formula is due to J. Pearl, but is also given independently
in the rst edition of Spirtes, Glymour and Scheines (2002) [127]. The designation semi-Markovian
model follows [134]. The paper [64] (2006) summarises the recent developments in the problem of
identiability and presents an algorithmic solution. The results by Y. Huang, M. Valtorta in [64] show
that the do-calculus rules of Pearl [107] and [108] (1995) are complete in the sense that if a causal
eect is identiable, then the causal eects can be computed in terms of observational quantities. The
article [43] by Freedman and Humphreys makes the obvious point that causality cannot be learned
from data and is a necessary response to errors that inexplicably crept into the literature.
86 CHAPTER 3. INTERVENTION CALCULUS

U2

y  %
′ ′
U1 D X W (x1 ) W (x2 ) W ′ (x3 )

 |   
us ′ ′ ′
W 6 Y (x1 )e 3 Y (x2 )
O
2 Y9 (x3 )

   yt
Z Y U3

Figure 3.20: Graph of Figure 3.19 (b) with the process nodes written out
3.11 Exercises
1. The two parts of this exercise are very similar and straightforward, illustrating how d-separation
in the mutilated graph corresponds to conditional independence in the remaining variables after
do-conditioning.

(a) Let G be a Directed Acyclic Graph, and suppose that a probability distribution P may
be factorised along G . Let G −X denote the graph obtained by deleting from G all arrows
pointing towards X (that is, all links between X and its parents are deleted). Prove that if
Y and Z are d-separated in G −X by X , then

PY ∣Z∥X (.∣.∥x) = PY ∥X (.∥x),

where the conditioning is taken from right to left.


(b) Let A, B, C, W be disjoint sets of nodes in a Bayesian Network. Let G denote the Directed
Acyclic Graph describing the causal network, and let G −C denote the graph with all edges
between C and parents of C removed.
Prove that if A and B are d-separated by (C, W ) on the graph G −C , then

PA∣W,B∥C (xA ∣xW , xB , ∥xC ) = PA∣W ∥C (xA ∣xW ∥xC ),

where the conditioning is performed from right to left.

2. Suppose the causal relations between the variables (X1 , X2 , X3 , X4 , X5 , X6 , Y, Z) may be ex-
pressed by the DAG given in Figure 3.21. Which of the following sets satisfy the back door
criterion with respect to the ordered pair of nodes (Y, Z)? C1 = {X1 , X2 }, C2 = {X4 , X5 },
C3 = {X4 }.
State all sets of nodes that satisfy the back door criterion with respect to the ordered set of nodes
(Z, Y ).

X1 X2

} ! } !
X3 X4 X5

! } ! }
Y / X6 / Z

Figure 3.21: Causal Relations between Variables

87
88 CHAPTER 3. INTERVENTION CALCULUS

3. Let a set of variables C satisfy the back door criterion relative to (X, Y ). Prove that

PY ∥X (y∥x) = ∑ PY ∣C,X (y∣c, x)PC (c).


c

4. Let C be a set of variables in a Bayesian Network and let X be a variable such that C contains
no descendants of X . Prove, from the denition, that

PC∥X (c∥x) = PC (c).

5. Let V = {X1 , . . . , Xd } denote a set of variables. Let V = Z ∪ U , where the variables in Z are
observable and the variables in U are unobservable. Assume that the probability distribution
over the variables in V may be factorised along a Directed Acyclic Graph G = (V, D), where
no variable in U is a descendant of any variable in Z . That is, the model is semi-Markovian.
Consider a single variable, say Xj ∈ Z . Assume that there is no trail between Xj and Xk for
Xk ∈ Z with only fork and chain connections which contains a variable Xi ∈ U . Show that

PZ/{Xj }∥Xj (xZ/{j} ∥xj ) = PZ/({Xj }∪Paj )∣Xj ,Paj ∩Z (x ̃ ̃ ∣xj , x ̃ )P (x ̃ ̃ ).


̃
Z/({j}∪Paj ) Paj ∩Z̃ Paj ∩Z̃ Pa j ∩Z
3.12 Answers
1. (a) Let P̃V /{X} = PV /{X}∥X (.∥x). Then P
̃ factorises along G −X , the subgraph of G −X over the
V /{X}
variables V /{X}. The probability tables are, for Y ≠ X and parent sets Pa ̃ Y = PaY /{X}
̃ Y is the original parent set of PaY with X removed) ̃
(Pa P ̃ = PY ∣PaY with the instan-
Y ∣PaY
tiation X ← x for every appearance of X in PaY . If Y á Z∥G −X X , then all trails between Y
and Z in G −X have either X as a fork or chain node, or else have a collider node that is not
X and which does not have X as a descendant. It follows that all trails between Y and Z
in GV−X
/{X} have at least one collider node and hence that Y á Z∥GV−X ∅ (d separated when
/{X}
none of the other variables are instantiated. It follows that, under probability distribution
̃
P, Y ⊥ Z , so that

̃Y ∣Z = ̃
PY ∣Z∥X (.∣.∥x) = P PY = PY ∥X (.∥x).

(b) Let V denote the variable set and let P̃V /C = PV /C∥C (.∥xC ). Then P
̃ factorises along the
−C −C
graph GV /C (the subgraph of G with the nodes C removed) and, for X ∈/ C , conditional
̃
probability potentials P ̃ = PX∣PaX where Pa ̃ X = PaX /C , PaX denotes the original
X∣PaX
neighbour set, and the variables in PaX ∩ C instantiated with the appropriate values.
If A á B∥G −C C ∪ W then any trail from A to B either has a fork or chain node in C ∪ W or
a collider node that is not in C ∪ W with no descendants in C ∪ W . It follows that, on the
graph GV−C/C , any trail from A to B either has a fork or chain node in W or a collider node
that is not in W with no descendants in W ; edges are deleted, but not added, by taking the
subgraph restricted to the variables of V /C and hence no new trails are added by removing
the nodes in C . It follows that A á B∥G −C W and hence that
V /C

̃A∣W,B (xA ∣xW , xB ) = P


PA∣W,B∥C (xA ∣xW , xB ∥xC ) = P ̃A∣W (xA ∣xW ) = PA∣W ∥C (xA ∣xW ∥xC )

which is the result.

2. C1 = {X1 , X2 } does not satisfy the back door criterion; Y − X4 − Z is a trail between Y and Z
with an edge pointing to Y which is not blocked by C1 .

C2 = {X4 , X5 } satises the back door criterion; trail Y − X6 − Z does not have an edge pointing
towards Y . The other trails pass through X4 . For the trails Y −X4 −Z and Y −X3 −X1 −X4 −Z , X4
is an instantiated fork or chain respectively, hence C2 blocks the trail. For Y −X1 −X4 −X2 −X5 −Z ,
X5 is an instantiated chain and hence the trail is blocked. All trails between Y and Z have been
considered.

For the backdoor criterion with respect to (Z, Y ), the sets have to block all trails with an
arrow pointing towards Z . This means that any set that contains X6 , X4 and any node from
{X3 , X1 , X2 , X5 } will satisfy the backdoor criterion with respect to (Z, Y ); any set that does not
will not.

89
90 CHAPTER 3. INTERVENTION CALCULUS

3. It is clear that
PY ∥X (y∥x) = ∑ PY ∣C∥X (y∣c∥x)PC∥X (c∥x).
c

Since C blocks all trails between Y and X that have an edge pointing towards X , it follows that
Y á (PaX /C)∥G C . It follows, with notation that should be clear, using Proposition 3.12 that

PY ∣C∥X (y∣c∥x) = ∑ PY ∣C,PaX ∥X (y∣c, π/c∥x)PPaX /C∥X (π/c∥x)


π/c

= ∑ PY ∣C,PaX ,X (y∣c, π/c, x)PPaX /C (π/c)


π/c

= ∑ PY ∣C,X (y∣c, x)PPaX /C (π/c)


π/c
= PY ∣C,X (y∣c, x).

Furthermore, since none of the variables in C are descendants of X , it follows (again, using
Proposition 3.12) that
PC∥X (c∥x) = PC (c)

and the result follows. The fact that PC∥X (c∥x) = PC (c),

PY ∣C,PaX ∥X (y∣c, π/c∥x) = PY ∣C,PaX ,X (y∣c, π/c, x)

and PPaX /C∥X (π/c∥x) = PPaX /C (π/c) is clear by comparing the original DAG and the mutilated
graph. A formal algebraic proof that PC∥X (c∥x) = PC (c) is given in the next exercise.

4. The variables may be ordered as V = {Y1 , . . . , Yn , X, Yn+1 , . . . , Yn+m } where the ordering is chosen
such that Pa(Yj ) ⊆ {Y1 , . . . , Yj−1 } for j ≤ n, Pa(X) ⊆ {Y1 , . . . , Yn },

Pa(Yj ) ⊆ {Y1 , . . . , Yn , X, Yn+1 , . . . , Yj−1 }

for j ∈ {m + 1, . . . , n + m} and where C ⊆ {Y1 , . . . , Yn }. From the intervention formula,

m+n
PV /X∥X (y1 , . . . , ym+n ∥x) = ∏ PYj ∣Paj (yj ∣πj )
j=1

while
m+n
PV (y1 , . . . , ym+n , x) = PX∣Pa(X) (x∣πX ) ∏ PYj ∣Paj (yj ∣πj ).
j=1

Now, sum over variables Yn+1 , . . . , Yn+m in both expressions, then sum over X in the second
expression. Then sum over all remaining variables not in C . The same answer obtains for both
expressions, so that
PC∥X = PC .
3.12. ANSWERS 91

5. Firstly,

PV /{Xj }∥{Xj } = PV /({Xj }∪Paj )∣Xj ,Paj PPaj .

Now let PaU = Paj ∩ U , Ua denote ancestors of Xj in U , Ub ancestors of Z/({Xj } ∪ Paj ) in U


and Uc = U /(PaU ∪ Ua ∪ Ub ).
Sum over the variables in V /(Z ∪ PaU ), then, from the condition that there are no trails between
Xj and other variables in Z that contain only fork or chain connections, Z/{Xj } ∪ Paj is d-
separated from PaU by {Xj }. It follows that

PZ∪PaU /{Xj }∥{Xj } = PZ/({Xj }∪(Paj ∩Z))∣Xj ,(Paj ∩Z),PaU PPaj


= PZ/({Xj }∪(Paj ∩Z))∣Xj ,(Paj ∩Z) PPaj

so that

PZ/{Xj }∥{Xj } = PZ/({Xj }∪(Paj ∩Z)∪PaU )∣Xj ,(Paj ∩Z) PPaj ∩Z .


92 CHAPTER 3. INTERVENTION CALCULUS
Chapter 4

The Pioneering Work of Arthur Cayley

4.1 Cayley's Contribution


Arthur Cayley F.R.S. (16 August 1821 - 26 January 1895) was a British mathematician, known for
his work in pure mathematics. His contributions include the so-called Cayley-Hamilton theorem, that
every square matrix satises its own characteristic polynomial, which he veried for matrices of order
2 and 3 (1858) [16]. He was the rst to dene the concept of a group in the modern way, as a set with
a binary operation satisfying certain laws. From group theory, he is known for Cayley's theorem, which
states that every group G is isomorphic to a subgroup of the symmetric group acting on G (1854) [15].
In the context of Bayesian networks, attention is drawn to a short article by Arthur Cayley from
1853, where in an example that takes less than one page, he seems to develop several principles that
later formed the basis of the subject of Bayesian networks, in particular, the `noisy or' gate.
Here is the article in its entirety.

XXXVII. Note on a Question in the Theory of Probabilities.

By A. Cayley*.

The following question was suggested to me, either by some of Prof. Boole's memoirs on
the subject of probabilities, or in conversation with him, I forget which; it seems to me a
good instance of the class of questions to which it belongs.
Given the probability α that a cause A will act, and the probability p that A acting the
eect will happen; also the probability β that a cause B will act, and the probability q that
B acting the eect will happen; required the total probability of the eect.
As an instance of the precise case contemplated, take the following: say a day is called
windy if there is at least w of wind, and a day is called rainy if there is at least r of rain,
and a day is called stormy if there is at least W of wind, or if there is at least R of rain.
The day may therefore be stormy because of there being at least W of wind, or because
of there being at least R of rain, or on both accounts; but if there is less than W of wind
and less than R of rain, the day will not be stormy. Then α is the probability that a day

93
94 CHAPTER 4. THE PIONEERING WORK OF ARTHUR CAYLEY

chosen at random will be windy, p the probability that a windy day chosen at random will
be stormy, β the probability that a day chosen at random will be rainy, q the probability
that a rainy day chosen at random will be stormy. The quantities λ, µ introduced in the
solution of the question mean in this particular instance, λ the probability that a windy
day chosen at random will be stormy by reason of the quantity of wind, or in other words,
that there will be at least W of wind, µ the probability that a rainy day chosen at random
will be stormy by reason of the quantity of rain, or in other words, that there will be at
least R of rain.
The sense of the terms being clearly understood, the problem presents of course no diculty.
Let λ be the probability that the cause A acting will act ecaciously; µ the probability
that the cause B acting will act ecaciously; then

p = λ + (1 − λ)µβ

q = µ + (1 − µ)αλ,

which determine λ, µ; and the total probability ρ of the eect is given by

ρ = λα + µβ − λµαβ,

suppose, for instance, α = 1, then

p = λ + (1 − λ)µβ, q = µ + λ − λµ, ρ = λ + µβ − λµβ,

that is, ρ = p, for p is in this case the probability that (acting as a cause which is certain to
act) the eect will happen, or what is the same thing, p is the probability that the eect
will happen.
Machynlleth, August 16, 1853.
*Communicated by the Author.

In this short note, Cayley gives a prototype example of a causal network; rain and wind both have
causal eects on the state of the day (stormy or not), which may be inhibited. He demonstrates the
key principle of modularity, taking a problem with several variables and splitting it into its simpler
component conditional probabilities, by considering the direct causal inuences for each variable and
considering the natural factorisation of the probability distribution in this problem into these condi-
tional probabilities.
It should also be pointed out that Cayley was no stranger to graph theory; he proved Cayley's tree
formula, that there are nn−2 distinct labelled trees of order n (1889) [19] and established links between
graph theory and group theory, representing groups by graphs. The Cayley graph is named after him.
The variables here may be taken as

⎪ ⎧

⎪ 1 wind ⎪ 1 rain
C=⎨ D=⎨
⎪ ⎪
⎩ 0 no wind
⎪ ⎩ 0 no rain

4.1. CAYLEY'S CONTRIBUTION 95

rain
µ

$
storm
:

wind

Figure 4.1: Rain and wind causing a storm

with
α = PC (1) β = PD (1).

Let Y be the variable denoting whether there is a storm;



⎪ 1 storm
Y =⎨

⎩ 0 no storm

Then, in Cayley's notation, if there is rain, it causes a storm with probability µ; if there is wind,
it causes a storm with probability λ. The corresponding `network', on three variables, is seen in
Figure 4.1. The subscripts µ and λ on the arrows indicate the probability that the cause, if active, will
trigger the eect.
This is a noisy `or' gate, which can be expressed as a logical `or' gate by the addition of two
variables, R and W . The variable R denotes severe rain, that is that the `rain' variable reaches the
threshold to trigger a storm. This happens if the quantity of rain is above a threshold. The W variable
denotes severe wind; that is, that the `wind' variable reaches the threshold to trigger a storm. This
happens if the strength of wind is above a threshold. The variables, to form the logical or gate have
conditional probability values given below; PW ∣C denotes the conditional probability function for the
variable W given C and PR∣D denotes the conditional probability function for the variable R given D.

C/W 1 0 D/R 1 0
PW ∣C = 1 λ 1−λ PR∣D = 1 µ 1−µ
0 0 1 0 0 1

The network may now be expressed graphically according to Figure 4.2. This DAG is a represen-
tation of the factorisation that Cayley is using;

PC,D,R,W,Y = PC PD PR∣C PW ∣D PY ∣W,R

where PY ∣W,R denotes the CPP for the variable Y , given W and R. For Y = 1, these values are given
in the following table:
96 CHAPTER 4. THE PIONEERING WORK OF ARTHUR CAYLEY

rain / R

#
storm
;

wind / W

Figure 4.2: Rain and wind: logical `or' gate

W /R 1 0
PY ∣W,R (1∣., .) = 1 1 1 .
0 1 0
From the factorisation,

PW (1) = ∑ PW ∣C (1∣x)PC (x) = λα, PR (1) = µβ,


x

From Cayley, p is the probability that a windy day, chosen at random, will be stormy; P = PY ∣D (1∣1).

p = PY ∣D (1∣1) = ∑ PA (x1 ) ∑ PR∣C (x2 ∣x1 ) ∑ PY ∣R,W (1∣x2 , x3 )PW ∣D (x3 ∣1)
x1 x2 x3
= βλµ + βµ(1 − λ) + β(1 − µ)λ + (1 − β)λ
= βµ − βλµ + λ = λ + (1 − λ)βµ.

Similarly, q , the probability that a rainy day, chosen at random, will be stormy; q = PY ∣C (1∣1), is given
by

q = µ + (1 − µ)αλ,
as computed by Cayley. Cayley is deriving the expression for the marginal probability of a stormy day,
ρ = PY (1);

PY (1) = ∑ PC (x1 ) ∑ PD (x2 ) ∑ PR∣C (x3 ∣x1 ) ∑ PW ∣D (x4 ∣x2 )PY ∣R,W (1∣x3 , x4 )
x1 x2 x3 x4
= ∑ PR (x3 ) ∑ PW (x4 )PY ∣R,W (1∣x3 , x4 )
x3 x4
= PR (1)PW (1) + PR (1)PW (0) + PR (0)PW (1)
= αλ + βµ − αβλµ.

This simple construction from 1853 possibly represents the rst example of a causal network and the
rst construction of a noisy-or gate, with the concept of an inhibitor.
4.2. ARTHUR CAYLEY AND JUDEA PEARL'S INTERVENTION CALCULUS 97

4.2 Arthur Cayley and Judea Pearl's intervention calculus


There is the cryptic remark towards the end of Arthur Cayley's paper, which indicates that he may
already have had the framework of Judea Pearl's intervention calculus in mind when considering causal
probabilistic models. The phrase ` .... acting a cause which is certain to act' may be a clumsy way of
expressing a brilliant insight into the intervention calculus, if by `acting' he means intervening to force
the state of the variable.
This reading may be somewhat strained; in Arthur Cayley's example, no human intervention is
possible to force the states of the wind or rain variables. Since `wind' and `rain' are both ancestor
variables, no links are removed from the DAG and in Pearl's framework, intervention conditioning
is the same as the standard conditioning on an observation. The wording suggests, though, that he
understood, from causal principles, that the two equations relating λ and µ to p and q remain valid if
the conditioning on an ancestor variable is forced by intervention, rather than simply observed, one of
the features of Pearl's intervention calculus.

4.3 Arthur Cayley: algebraic geometry and Bayesian networks


The emerging eld of algebraic statistics (Pistone et al. (2001) [115], Drton et. al. (2009) [39])
advocates polynomial algebra as a tool in the statistical analysis of experiments and discrete data; the
connection between algebraic geometry and Bayesian networks is discussed by Garcia et. al. (2005)
in [50].
For a probability distribution over a set of variables, the conditional independence statements
for subsets X, Y, Z, W satisfy the logical relations of decomposition, contraction, weak union and
intersection, described on page 32 chapter 2.
A factorisation is equivalent to a set of conditional independence statements;

{Xσ(j) ⊥ XΞσ (j) ∣XPaσ (j) j = 1, . . . , d},

where Paσ (j) ⊂ {σ(1), . . . , σ(j − 1)} is the parent set of node σ(j) when ordering σ is employed and
Ξσ (j) = {σ(1), . . . , σ(j − 1)}/Paσ (j).
Let V = {1, . . . , d} denote the node set which indexes the variables, X = (X1 , . . . , Xd ) the random
vector, let the indexing set for the state space for variable Xj be Xj = {0, 1, . . . , kj −1} and the indexing
set for the state space for X be X = ×dj=1 Xj . Let R(X ) the ring of polynomial functions on RX .
A conditional independence statement XA ⊥ XB ∣XC , where A, B and C are disjoint subsets of V ,
translates using proposition 8.1 from Sturmfels (2002) [130], into a set of homogeneous quadratic poly-
nomials on R(X ), and these polynomials generate an ideal. Let IA⊥B∣C denote the ideal generated by
the statement XA ⊥ XB ∣XC . The ideal for a collection of independence statements, for example those
corresponding to a factorisation, is dened as the sum of the ideals; let M = {XAi ⊥ XBi ∣XCi i=
1, . . . , m}, then

IM = IA1 ⊥B1 ∣C1 + . . . + IAm ⊥Bm ∣Cm .


98 CHAPTER 4. THE PIONEERING WORK OF ARTHUR CAYLEY

Cayley is using the expression of the conditional independence statements that dene the factorisation
in terms of polynomials to obtain the two polynomial equations



⎪ p = λ + (1 − λ)µβ
⎨ (4.1)

⎪ q = µ + (1 − µ)λα

and writes, `.... which determine λ and µ'. This amounts to nding roots of the two polynomials in
λ, µ



⎪ f1 (λ, µ) = λ + (1 − λ)µβ − p


⎩ f2 (λ, µ) = µ + (1 − µ)λα − q

In terms of algebraic geometry, equation (4.1) denes the ane variety

V (f1 , f2 ) = {(λ, µ) ∈ R2 ∣f1 (λ, µ) = f2 (λ, µ) = 0} .

In his brief note, Cayley has pointed out the connections between Bayesian networks and algebraic
geometry, a subject that he knew well. Cayley did much to clarify a large number of interrelated
theorems in algebraic geometry and is known for the Cayley surface (1869) [17].
Chapter 5

Moral Graph, Independence Graph,

Chain Graphs

The denition of a chain graph is given below and it is shown that an essential graph is a chain graph,
although not vice versa. The study of chain graphs will be developed in 5.2.

Denition 5.1 (Chain Graph). A chain graph is a graph G = (V, E), where the edge set contains
both directed and undirected edges, E = D ∪ U , where D is the set of directed edges and U the set of
undirected edges. The node set V can be partitioned into n disjoint subsets V = V1 ∪ . . . ∪ Vn where the
sets V1 , . . . , Vn are the node sets of the connected components of (V, U ), the graph obtained by removing
all the directed edges.
1. GVj is an undirected graph for all j = 1, . . . , n

2. For any i ≠ j , and any α ∈ Vi , β ∈ Vj , there is no cycle in G = (V, E) (Denition 1.9) containing
both α and β .
The chain graph consists of components where the edges are undirected, which are connected by
directed edges. The components with undirected edges are known as chain components, which are
dened below.

Denition 5.2 (Chain Component). Let G = (V, E) be a chain graph, where E = D ∪ U , D is the set
of directed edges. Let Ĝ = (V, U ) denote the graph obtained by removing all the directed edges from E .
Each connected component of Ĝ is known as a chain component.
The chain components (Vj , Uj ), j = 1, . . . , n of G therefore satisfy the following conditions:

1. Vj ⊆ V and Uj is the edge set obtained by retaining all undirected edges ⟨α, β⟩ ∈ E such that
α ∈ Vj and β ∈ Vj .

2. There is no undirected edge in E from any node in V /Vj to any node in Vj .

Theorem 5.3 states any essential graph is necessarily a chain graph and presents the additional features
required to ensure that a chain graph is an essential graph corresponding to a directed acyclic graph.
It gives a characterisation for essential graphs that is useful for structure learning algorithms.

99
100 CHAPTER 5. MORAL GRAPH, INDEPENDENCE GRAPH, CHAIN GRAPHS


α @γ α @γ α @β


β β δ

Figure 5.1: Forbidden subgraphs

Theorem 5.3. Let G = (V, E) be a graph, where E = D ∪ U . There exists a directed acyclic graph G ∗
for which G is the corresponding essential graph if and only if G satises the following conditions:

1. G is a chain graph,

2. Each chain component of G is triangulated,

3. The congurations shown in Figure 5.1 do not occur in any induced sub-graph of a three variable
set {α, β, γ} ⊂ V for the rst two congurations or a four variable set {α, β, γ, δ} for the third
conguration.

4. Every directed edge (α1 , α2 ) ∈ D is compelled in G .

Proof Proof that an essential graph satises the conditions. To prove that it is a chain graph, the
rst part of the denition is easily satised and it is sucient to show that there is no cycle in (V, E)
containing α ∈ Vi and β ∈ Vj for two distinct chain components Vi and Vj .
Recall that the edges of a cycle τ0 , . . . , τn are either directed (τi , τi+1 ) or undirected ⟨τi , τi+1 ⟩. Let
(τ, γ) denote a directed edge in the cycle where γ ∈ Vj . Both connected components will have a node γ
with this property. If there is an undirected edge ⟨γ, γ1 ⟩ in the cycle, then there is also an undirected
edge ⟨τ, γ1 ⟩ in the graph. If there is a directed edge (τ, γ1 ) or (γ1 , τ ) then the edge between γ and γ1
is compelled contradicting the fact that it is undirected. If there is an undirected edge ⟨τ, γ1 ⟩, then
τ ∈ Vj . Proceeding inductively, it is clear that if there is a cycle, then there is an undirected edge
⟨τ1 , τ2 ⟩ where τ1 ∈ Vi and τ2 ∈ Vj contradicting the fact that the two chain components are distinct. It
follows that an essential graph is a chain graph.
Secondly, if there is a cycle of length ≥ 4 of undirected edges without a chord, then the DAG will
have a directed cycle, otherwise additional immoralities will appear when the edges are directed, hence
the chain components are triangulated.
Thirdly, the conguration stated cannot appear in an essential graph. The fourth requirement
follows from the denition of an essential graph.
5.1. THE MORAL GRAPH AND THE INDEPENDENCE GRAPH 101

For the other direction: suppose a graph satises the four conditions stated. All the directed edges
appear in congurations that are compelled and from the forbidden subgraphs, no undirected edges
appear in compelled congurations where there should be a directed edge. It remains to show that the
undirected edges may be oriented in a way that produces a directed acyclic graph.
For each chain component, orient the edges so that the chain component is a directed acyclic
triangulated graph. This can be done. Then, since the rst structure is forbidden, this operation
does not produce additional immoralities in the whole graph. Furthermore, since there are no cycles
containing two nodes α and β with α ∈ Vj and β ∈ Vk for j ≠ k , this operation does not produce directed
cycles. The graph is therefore the essential graph of a DAG.

5.1 The Moral Graph and the Independence Graph


Let (P, G) be a Bayesian network; that is, a probability distribution P over a random vector X =
(X1 , . . . , Xd ), such that P factorises along a directed acyclic graph G = (V, D), and this is no longer
true if any variable is eliminated from any of the parent sets.

Denition 5.4 (Moral Graph). Let G = (V, D) be a directed acyclic graph. The moral graph G (m) =
(V, U ) is the undirected graph such that for any α, β ∈ V , ⟨α, β⟩ ∈ U if and only if either (α, β) ∈ D
or (β, α) ∈ D or {α, β} ∈ Pa(γ) for some γ ∈ V . That is, the moral graph is the graph obtained by
rstly for each node adding links between all the parent variables of the node and then undirecting all
the directed edges.

The moral graph satises the following property:

Theorem 5.5. Let G = (V, D) be a directed acyclic graph and let G (m) = (V, U ) be its moral graph.
There is an edge ⟨α, β⟩ ∈ U if and only if α á
/ β∥G V /{α, β}. That is, the moral graph has an edge if
and only if α and β are not D-separated by the remaining variables.

Proof The proof of this is left as an exercise (Exercise 6 page 352).

The independence graph is dened as follows:

Denition 5.6 (Independence Graph). Let X = (X1 , . . . , Xd ) be a random vector. The independence
graph G = (V, U ) is the undirected graph with vertex set V = {1, . . . , d} and where ⟨α, β⟩ ∈ U for α ≠ β
if and only if Xα ⊥/ Xβ ∣X−(α,β) where the notation X−(α,β) denotes X without components Xα and Xβ .

Recall the denition of separator (Denition 7.15). The independence graph satises the following
property:

Theorem 5.7. Let X = (X1 , . . . , Xd ) be a random vector, let V = {1, . . . , d} be the indexing set for X
and let G = (V, U ) be the independence graph of X . Then for three disjoint sets A, B and S such that
V = A ∪ B ∪ S , it holds that A ⊥ B∣S (A and B are conditionally independent given S ) if and only if
A á B 8 S (A and B separated by S ).
102 CHAPTER 5. MORAL GRAPH, INDEPENDENCE GRAPH, CHAIN GRAPHS

Proof Firstly, assume that for three disjoint sets A, B and S such that A ∪ B ∪ S , A á B 8 S in the
independence graph. Then, for each α1 , α2 ∈ A and β ∈ B , set C = V /{α1 , α2 , β}. From the denition
of the independence graph,

Xα1 ⊥ Xβ ∣XC∪{α2 } and Xα2 ⊥ Xβ ∣XC∪{α1 } .

It follows from the intersection property, which states that if X ⊥ Y ∣W ∪ Z and X ⊥ W ∣Y ∪ Z then
X ⊥ W ∪ Y ∣Z , that

(Xα1 , Xα2 ) ⊥ Xβ ∣XC .

By successive applications of the intersection property to each variable, it follows that

XA ⊥ Xβ ∣X−(A∪{β}) .

This holds for all β ∈ B . The intersection property gives:

XA ⊥ Xβ1 ∣X−(A∪{β1 }) and XA ⊥ Xβ2 ∣X−(A∪{β2 }) ⇒ XA ⊥ (Xβ1 , Xβ2 )∣X−(A∪{β1 ,β2 }) .

Successive applications of the intersection property to the variables with indices in B give

XA ⊥ XB ∣XS .

Now assume that XA ⊥ XB ∣XS . Then, for each α ∈ A and β ∈ B , this may be rewritten as

(Xα , XA/{α} ) ⊥ (Xβ , XB/β )∣XS .

Using the weak union result, that X ⊥ Y ∪ Z∣W ⇒ X ⊥ Y ∣Z ∪ W it follows that

Xα ⊥ XB ∣XS∪A/{α}

and another application gives

Xα ⊥ Xβ ∣X−(α,β) .

Theorem 5.8. Let P be a probability distribution that factorises along a DAG G = (V, D). Let G (m) =
(V, U (m) ) denote its moral graph and let G (i) = (V, U (i) ) denote the independence graph of P. Then
U (i) ⊆ U (m) . Furthermore, if (V, D) is faithful to P, then U (i) = U (m) .
5.2. CHAIN GRAPHS 103

Proof From Theorem 5.5, the moral graph has an edge ⟨α, β⟩ if and only if α á/ β∥G V /{α, β}; there
is no edge ⟨α, β⟩ if and only if α á β∥G V /{α, β}. Since D-separation implies conditional independence
(Theorem 1.25), it follows that the lack of an edge ⟨α, β⟩ imples Xα ⊥ Xβ ∣X−(α,β) . From this, it follows
directly that U (i) ⊆ U (m) .
For a faithful DAG, D-separation and conditional independence are equivalent, from which it follows
that U (i) = U (m) when P and G = (V, D) are faithful.

If a distribution P does not have a faithful representation, then for any DAG U (i) ⊂ U (m) .

The following corollary is an obvious consequence of the preceeding.

Corollary 5.9. Let X = (X1 , . . . , Xd ) be a random vector and V = {1, . . . , d} be its indexing set. Let
G = (V, D) be a directed acyclic graph, along which P, the probability distribution of X , factorises and
let G (m) be the moral graph. Let V = A ∪ B ∪ S where A, B and S are disjoint subsets. Then A á B 8 S
(A and B separated by S in G (m) ) implies XA ⊥ XB ∣XS (A and B conditionally independent given S ).

Proof A clear consequence of the preceeding arguments.

5.2 Chain Graphs


5.2.1 Motivation
Consider the problem of nding a graphical model where each graphical separation statement implies
the corresponding conditional independence statement, and the aim is to locate a graph structure
which encodes as much of the conditional independence structure as possible. When there does not
exist a faithful DAG, a Bayesian Network cannot encode the complete set of conditional independence
statements. Chain graphs give a substantially broader class of graphical models which can encode
more of the conditional independence structure.

Example 5.10 (Chain Graph (1)).


Consider the situation where a probability distribution is constructed out of pairwise potentials:

PX1 ,X2 ,X3 ,X4 (x1 , x2 , x3 , x4 ) = C exp {−β12 (x1 − x2 ) − β23 (x2 − x3 ) − β34 (x3 − x4 ) − β14 (x1 − x4 )} .

Consider a factorisation of this distribution

PX1 ,X2 ,X3 ,X4 = PX1 PX2 ∣X1 PX3 ∣X1 ,X2 PX4 ∣X1 ,X2 ,X3 .

Note that

PX1 ,X2 ,X3 = C exp {−β12 (x1 − x2 ) − β23 (x2 − x3 )} ∑ exp {−β34 (x3 − x4 ) − β14 (x1 − x4 )} .
x4
104 CHAPTER 5. MORAL GRAPH, INDEPENDENCE GRAPH, CHAIN GRAPHS

@2


1 / 3


4

Figure 5.2: DAG for 4 variable example

1 3

Figure 5.3: Moral graph for 4 variable example

It follows that the BN is given by the following factorisation:

PX1 ,X2 ,X3 ,X4 = PX1 PX2 ∣X1 PX3 ∣X1 ,X2 PX4 ∣X1 ,X3 .

with DAG given by Figure 5.2. The moral graph of the DAG of Figure 5.2 is given in Figure 5.3.
Whatever ordering of the variables, the moral graph of the resulting Bayesian network will be
triangulated. The cliques of the moral graph are the parent/variable sets of factorisation.

More of the independence structure is revealed in this example by the factor graph shown in Figure 5.4,
which is a chain graph.

Example 5.11 (Chain Graph (2)).


Figure 5.5 gives an example of a chain graph which is not an essential graph, where the chain compo-
nents are nevertheless triangulated. Its chain components are shown in Figure 5.6.
Figure 5.5 is the chain graph of a probability distribution which has factorisation:

PX1 ,X2 ,X3 ,X4 = PX1 PX2 PX3 ,X4 ∣X1 ,X2 ,

but where neither X1 ⊥ X4 ∣X3 nor X2 ⊥ X3 ∣X4 hold. Such a distribution could arise, for example, with
a probability distribution
5.2. CHAIN GRAPHS 105

1 3

Figure 5.4: Factor graph for 4 variable example

1 2


3 4

Figure 5.5: Chain Graph, Not an Essential Graph

1 2

3 4

Figure 5.6: Chain Components of Chain Graph, Figure 5.5


106 CHAPTER 5. MORAL GRAPH, INDEPENDENCE GRAPH, CHAIN GRAPHS

PU,X1 ,X2 ,X3 ,X4 = PU PX1 PX2 PX3 ∣X1 ,U PX4 ∣X2 ,U

where U is a hidden variable.

The additional exibility available for modelling when chain graphs are used should be clear. Chain
graphs, however, still satisfy the composition property and therefore separation statements in a chain
graph do not characterise the independence structure; there does not exist a faithful chain graph for
Example 2.7, the three-coin example.

5.2.2 Factorisation along a Chain Graph


Let X = (X1 , . . . , Xd ) be a random vector indexed by V = {1, . . . , d}. Let P denote the probability
distribution of X . The probability distribution is said to factorise along the chain graph G = (V, E) if
and only if there exist functions ∅C ∶ C ∈ C and it has a decomposition of the form:

PX1 ,...,Xd = ∏ PXj ∣X ∏∅


j∈A Pa(j) C∈C C

where A = {j ∶ ∃k ∶ (k, j) ∈ D} and C denotes the collection of cliques of the chain components; clique
C is the domain of the function ∅C for each C ∈ C .

To generalise from DAGs to chain graphs, some additional denitions and machinery are necessary.
The approach taken here follows Ma-Xie-Geng (2008) [86].
A head-to-head section in a chain graph plays the same role as an immorality in a DAG.

Denition 5.12 (Section, Terminal). The terminals of a trail ρ = (ρ0 , . . . , ρk ) are simply the nodes
at each end, ρ0 and ρk . A section of a trail ρ = (ρ0 , . . . , ρk ) is a maximal undirected subroute σ =
(ρi , . . . , ρj ). In other words, either ρi = ρ0 or else i ≠ 0 and there is a directed edge ρi−1 ↦ ρi or
ρi ↦ ρi−1 ; similarly, either j = k or else there is a directed edge ρj ↦ ρj+1 or ρj+1 ↦ ρj .
The vertices ρi and ρj are called terminals. ρi (ρj ) is a head terminal if i > 0 and G contains the
directed edge ρi−1 ↦ ρi (or j < k and G contains the edge ρj+1 ↦ ρj and a tail terminal if i > 0 and the
graph G contains the edge ρi ↦ ρi−1 . (or j < k and the graph contains the edge ρj ↦ ρj+1 ).
A section σ of ρ is a head-to-head section if it has two head-terminals, otherwise it is a non
head-to-head section.
For a set of vertices S ⊂ V , a section σ is outside S if {ρi , . . . , ρj } ∩ S = ∅; otherwise we say that
σ is hit by S .

A complex within a trail in a chain graph plays a similar role to a collider node in a trail in a DAG.

Denition 5.13 (Complex). A complex in G is a trail ρ = (ρ0 , . . . , ρk ) such that ρ0 ↦ ρ1 and ρk ↦ ρk−1
are in G and, for i = 1, . . . , k − 2 G contains the undirected edges ρi − ρi+1 . The vertices ρ0 and ρk are
the parents of the complex and {ρ1 , . . . , ρk−1 } the region of the complex.
5.2. CHAIN GRAPHS 107

The pattern of a chain graph corresponds to taking the skeleton of a DAG and directing those edges
which belong to immoralities.
Denition 5.14 (Complex Arrow, Pattern, Moral Graph). A directed edge in the chain graph is known
as a complex arrow if it belongs to a complex of G . The pattern of G , denoted G ∗ is the graph obtained
by undirecting all directed edges which are not complex arrows. The moral graph G (m) of a chain graph
is the graph obtained by rst, for each complex, adding an undirected edge between each pair of parents
of the complex, and then undirecting all the edges.
For a chain graph, the descendants of a node are those for which there is a trail where each edge is
either undirected or directed from the node to the descendant.
Denition 5.15 (Descendant). A node β is a descendant of a node α if there is a path ρ = (ρ0 , ρ1 , . . . , ρk )
such that ρ0 = α, ρk = β and for i = 0, . . . , k − 1 either there is either an undirected edge ⟨ρi , ρi+1 ⟩ or a
directed edge (ρi , ρi+1 ) in G .
For a DAG, a connection is open if it is an uninstantiated fork or chain, or if it is a collider which
is either instantiated or has an instantiated descendant. In chain graphs, this has to be developed
slightly.
Denition 5.16 (Intervented). A trail ρ in G is intervented by a subset S of V if and only if there
exists a section σ of ρ such that:
1. either σ is a head to head section with respect to ρ and σ and all its descendants are outside S ,
or

2. σ is a non-head-to-head section with respect to ρ and σ is hit by S .

Note In [86], the requirement in 1. that the descendants are also outside S is not given. It is clear
that this is necessary, by considering the situation where the chain graph is a DAG.

The notion of C -separation for chain graphs corresponds to D-separation for DAGs.
Denition 5.17 (C -Separation). Let A, B and S be three disjoint subsets of V of a chain graph G
such that A and B are non-empty. The sets A and B are C -separated by S , written A á B∥G S if and
only if every trail with one of its terminals in A and another in B is intervented by S . The set S is a
C -separator for A and B .
The denition of Markov equivalence is the same, with C -separation substituted for D-separation.
Denition 5.18 (Markov Equivalence). Two chain graphs G1 and G2 are said to be Markov equivalent
if for any three disjoint subsets A, B and S with both A and B non-empty,
A á B∥G1 S ⇔ A á B∥G2 S.
Having formulated the concepts for chain graphs that correspond to those for DAGs, the key result for
chain graphs corresponds directly to Theorem 2.11.
Theorem 5.19. Two chain graphs G1 and G2 are Markov equivalent if and only if they have the same
skeleton and the same complexes. That is, they have the same pattern.
108 CHAPTER 5. MORAL GRAPH, INDEPENDENCE GRAPH, CHAIN GRAPHS

Proof Frydenberg [48] (1990). It is similar to the proof of Theorem 2.11 for DAGs.

A distribution that factorises according to a chain graph is said to be Markovian with respect to the
chain graph.

Denition 5.20 (Markovian). A distribution P is said to be Markovian with respect to a chain graph
G if C -separation statements imply the corresponding independence statements:

A á B∥G S ⇒ XA ⊥ XB ∣XS .

The denition of faithfulness for chain graphs is analogous to faithfulness for DAGs.

Denition 5.21 (Chain Graph Faithfulness). A distribution P is said to be faithful with respect to a
chain graph G if C -separation statements and independence statements are equivalent;

A á B∥G S ⇔ XA ⊥ XB ∣XS .

5.2.3 Separation Trees for Chain Graphs


A DAG can be moralised, the moral graph triangulated and the triangulated moral graph decomposed
into a junction tree. This is the basis of the Aalborg inference engine. The moral graph for a chain
graph is given by Denition 5.14. In Ma-Xie-Geng [86], the separation tree is proposed as the analogous
object to the junction tree.

Denition 5.22. Let G = (V, E) be a chain graph. Let C = {C1 , . . . , CH } be a collection of distinct sets
of variables such that V = ∪H j=1 Cj . Let T denote the graph (C, U) where U is a set of labelled undirected
edges. Uij ∈ U if and only if Ci ∩ Cj ≠ ∅; the label is Ci ∩ Cj and Uij is the separator.
T is said to be a tree if removal of the nodes of Uij for any pair i ≠ j splits T into two disjoint trees
Ti (with node set denoted Ci ) and Tj (with node set denoted Cj ). Let Vi = ∪C∈Ci C and Vj = ∪C∈Cj C .
A tree T with node set C is a separation tree for chain graph G if and only if:

1. ∪C∈C C = V and

2. For any separator S ∈ U , with V1 and V2 dened above by removing S ,

V1 /S á V2 /S∥G S.

The separation tree has similarities to the junction tree, but it does not require that the collection
{C1 , . . . , CH } are cliques or that every separator is complete.

A separation tree can be constructed quite easily from the independence graph.

Theorem 5.23. Let X = (X1 , . . . , Xd ) be a random vector and let G (i) denote the independence graph.
Any junction tree constructed from any triangulation of G (i) is a separation tree.
5.2. CHAIN GRAPHS 109

Proof This is obvious, since any separation statement in the independence graph implies the corre-
sponding C -separation statement in the chain graph.
Lemma 5.24. Let α and β be two adjacent nodes in a chain graph G , then any separation tree T for
G contains a tree-node C such that {α, β} ⊆ C .

Proof Assume not, then there exists a separator K on T such that α ∈ V1 /K and β ∈ V2 /K , where
Vi denotes the variable set of the subtree Ti obtained by removing the edge attached by separator K ,
for i = 1, 2. This implies that α á β∥G K , which is false.

The separation tree satises several properties which will be useful in 16.12 for learning a chain graph.
Some of them are collected in the following theorem.
Theorem 5.25. Let T be a separation tree for a chain graph G = (V, E). Nodes α and β are C -
separated by some set Sαβ ⊂ V in G if and only if one of the following conditions hold:
1. α and β are not both contained in the same node C for any C ∈ C .

2. α, β ∈ C for some C ∈ C , but for any separator S ⊂ C , {α, β} ⊂/ S and there exists a set Sα,β ⊂C
such that


α á β∥G Sαβ .

3. There is a C ∈ C such that {α, β} ⊆ C , there is a separator S ⊂ C such that {α, β} ⊆ S , but there

is a subset Sαβ of either ∪C∶α∈C C or ∪C∶β∈C C such that

α á β∥G Sαβ .

The following proposition shows that, similarly to the situation with DAGs, the parents for each
complex are all contained within the same tree node.
Proposition 5.26. Let G be a chain graph and T a separation tree of G . For any complex ρ in G ,
there exists a tree-node C ∈ C such that Pa(ρ) ⊆ C .
The proofs of Theorem 5.25 and Proposition 5.26 are given after the following example.
Example 5.27 (Chain Graph, Moral Graph, Separation Tree).
Figure 5.7 (a) shows a chain graph, while (b) shows the moralised graph. Figure 5.8 shows a
separation tree. The vertex set for the separation tree here is:

C = {{A, B, C}, {B, C, D}, {C, D, E}, {D, E, F }, {E, I}, {I, J}, {D, F, G}, {F, G, K, H}}.

In this case, the separation tree is the junction tree corresponding to a triangulation of the moral
graph, but a separation tree does not necessarily have to satisfy this property.
Lemma 5.28. Let G = (V, E) be a chain graph and let α, β ∈ V . There exists an edge α ∼ β in E if
and only if α á
/ β∥G S for any S ⊆ V /{α, β}.
110 CHAPTER 5. MORAL GRAPH, INDEPENDENCE GRAPH, CHAIN GRAPHS

>G H G H

B D / F / K B D F K

A C / E / I / J A C E I J

Figure 5.7: Chain Graph, Moralised Graph

BCD DEF
BC DE
E
CD DF

ABC CDE EI DF G
I GF

IJ F GKH

Figure 5.8: A Separation Tree for Figure 5.7


5.2. CHAIN GRAPHS 111

Proof If there is an edge α ∼ β , whether directed or undirected, then clearly α á/ β∥S for any
S ⊆ V /{α, β}. Let Pa(α) = {γ ∶ (γ, α) ∈ E} ∪ {γ ∶ ⟨γ, α⟩ ∈ E}. In other words, the parents of a node α
are all nodes for which there is either a directed edge from the node to α or an undirected edge between
the node and α.
Suppose there is no edge α ∼ β , then α á β∥G Pa(α) if β is an ancestor of α, α á β∥G Pa(β) if α is
an ancestor of β and both statements are true if α is not an ancestor of β and β is not an ancestor of
α.

Proof of Theorem 5.25 Clearly, if any of the three conditions hold, then there is a C -sep-set Sαβ
such that α á β∥G Sαβ .
Assume that, for a given separation tree T , none of the conditions hold. That is, there exists an
α, β and C such that α, β ∈ C , there is a separator S ⊂ C such that α, β ∈ S and for every subset S of
either ∪C∶α∈C C or ∪C∶β∈C , α á
/ β∥G S .
Note that, using the denitions from the proof of Lemma 5.28, if there is no edge α ∼ β , then
either α á β∥G Pa(α) or α á β∥G Pa(β) or both. Since Pa(α) ⊂ ∪C∶α∈C C and Pa(β) ⊂ ∪C∶β∈C C , this is
a contradiction, hence there is an edge α ∼ β in G , hence (by Lemma 5.28 there is no set R such that
α á β∥G R and the theorem is proved.

Proof of Proposition 5.26 Suppose that α and β are parents of a complex κ = (α, γ1 , . . . , γk , β)
where k ≥ 1. Suppose that for every tree-node C ∈ C , {α, β} ∩ C =/ {α, β}. Consider two tree-nodes
C1 and C2 such that α ∈ C1 and β ∈ C2 . Let C1 − D1 − . . . , Dn − C2 denote the path in the tree
from C1 to C2 and let S = C1 ∩ D1 . If S ∩ {α, β} = ∅, then {γ1 , . . . , γk } ∩ S ≠ ∅. This implies that
αá / β∥S , since instantiation of any non-empty subset of the set {γ1 , . . . , γk } opens the connection.
This contradicts the fact that S is a separator in the separation tree. It follows that either α ∈ S and
hence α ∈ D1 or β ∈ S and hence β ∈ D1 ; hence, inductively, it follows that there is a tree-node C such
that {α, β} ⊆ C .
112 CHAPTER 5. MORAL GRAPH, INDEPENDENCE GRAPH, CHAIN GRAPHS
Chapter 6

Evidence and Metrics

6.1 Probability Updates


(1) (k )
Let V = {X1 , . . . , Xd } denote the set of random variables, Xj = (xj , . . . , xj j ) denote the state space
for variable Xj and let X = ×dj=1 Xj denote the state space for the collection V . Let P ∶ X → [0, 1]
(i ) (i )
denote the probability function of (X1 , . . . , Xd ). Let x = (x1 1 , . . . , xd d ) denote an element of X . For
subsets A ⊆ X , P(A) will be used to denote

P(A) = ∑ P(x).
x∈A

The space X contains a nite number of elements and the event algebra A is simply the set of all
possible subsets of X . If an event A ⊂ X is observed, then the probability P is updated to a probability
measure P∗ using the denition of conditional probability

P(B∣A) = P∗ (B) =
P(AB)
P(A)
to a probability function P∗ over X that satises



⎪ x∈A
P(x)
P∗ (x) = ⎨ P(A)

⎪ x ∈/ A.
⎩ 0

6.1.1 Jerey's Rule


There may be evidence that is not expressed in the form that an event A ⊆ X has occurred. It often
happens in experimental settings that the probability space and event space are determined in advance
and then information is acquired that is not of the form that an event, as a subset of the original
probability space has occurred.
Jerey's rule is for the particular situation where the additional information leads to a re-assessment
of the probabilities for a collection (Gj )rj=1 of mutually exclusive and exhaustive events from P(Gj ) to
P∗ (Gj ) and where it may be assumed that the conditional probabilities P(A∣Gj ) remain unaltered for
j = 1, . . . , r and all A ⊆ X .

113
114 CHAPTER 6. EVIDENCE AND METRICS

Denition 6.1 (Jerey's Update). The Jerey's rule for computing the update of the probability for
any A ⊆ X is given by
r
P∗ (A) = ∑ P∗ (Gj )P(A∣Gj ) (6.1)
j=1

Let P(Gj ) = µj for j = 1, . . . , r and P∗ (Gj ) = λj for j = 1, . . . , r. Then, for x ∈ X ,

λj
P∗ (x) = P(x) x ∈ Gj , j = 1, . . . , r. (6.2)
µj
The information leading to the update may be considered as an event Ξ such that Ξ ⊆/ X . The
probability measure P is extended to acommodate the event Ξ in the following way: for the set of
mutually exclusive and exhaustive events G1 , . . . , Gr and any A ⊆ X , Ξ ⊥ A∣Gj for each j = 1, . . . , r.
The conditional probabilities of the events G1 , . . . , Gr given Ξ are specied as P(Gj ∣Ξ) = λj . Then, for
any A ⊆ X , the probability update is

r r
̃
P(A) = P(A∣Ξ) = ∑ P(A∣Gj , Ξ)P(Gj ∣Ξ) = ∑ λj P(A∣Gj ).
j=1 j=1

P(A∩Gj )
Using P(A∣Gj ) = P(Gj ) and µj = P(Gj ), this gives

̃ λj
P(x) = P(x) x ∈ Gj , j = 1, . . . , r.
µj

Pearl's Update Pearl's update is a re-expression of Jerey's update, where the information is pre-
sented in a slightly dierent format. Information received is that an event Ξ ⊆/ X has happened, where
Ξ ⊥ A∣Gj for each j = 1, . . . , r, where (Gj )rj=1 are a set of mutually exclusive and exhaustive events. The
information, though, is given in terms of likelihood ratios. Instead of λj = P(Gj ∣Ξ), the information
P(Ξ∣G )
is expressed as a collection of likelihood ratios ρj = P(Ξ∣G1j ) for j = 1, . . . , r, ratios of the likelihood of
Ξ given Gj compared with the likelihood of Ξ given G1 . That is, λj represents the likelihood ratio
that the event A occurs given that Gj occurs, compared with G1 . Note that λ1 = 1. Using the same
notation µj = P(Gj ), for any A ⊆ X , an application of Bayes rule gives

r
̃
P(A) = P(A∣Ξ) = ∑ P(A∣Gj )P(Gj ∣Ξ)
j=1
r P(Ξ∣Gj )P(Gj )
= ∑ P(A∣Gj )
j=1 P(Ξ)
r P(Ξ∣Gj )P(Gj )
= ∑ P(A∣Gj )
j=1 ∑k=1 P(Ξ∣Gk )P(Gk )
r

r ρ j µj
= ∑ P(A∣Gj ) r .
j=1 ∑k=1 ρj µj

Denition 6.2 (Pearl's update). Let P denote a probability distribution over X and let G1 , . . . , Gr be
a mutually exclusive (that is Gi ∩Gj = ϕ for all i ≠ j ) and exhaustive (that is ∪nj=1 Gj = X ) events, where
6.1. PROBABILITY UPDATES 115

P(Gj ) = µj . Let ρ1 = 1 and let ρj , j = 2, . . . , r denote a collection of numbers. Then, for each x ∈ X ,
the Pearl update P̃ is dened as

̃ ρj
P(x) = P(x) r x ∈ Gj , j = 1, . . . , r. (6.3)
∑j=1 ρj µj

This is clearly a well dened probability function over X . The numbers ρj are interpreted as likelihood
ratios where Ξ is an event Ξ ⊆/ X and P is extended to include Ξ such that Ξ ⊥ A∣Gj for each j = 1, . . . , r,
P(Ξ∣G )
A ⊆ X and ρj = P(Ξ∣G1j ) .
Pearl's update and Jerey's rule are equivalent. The original probability space has been extended;
information has been received of a form that cannot be expressed in terms of events, or subsets, of the
original probability space.

Example 6.3.

A piece of cloth is to be sold on the market. The colour C is either green (cg ), blue (cb ) or violet
(cv ). Tomorrow, the piece of cloth will either be sold (s) or not (sc ); this is denoted by the variable S .
Experience gives the following probability distribution over C, S

S/C cg cb cv
PC,S = s 0.12 0.12 0.32
sc 0.18 0.18 0.08
The marginal distribution over C is
cg cb cv
PC = .
0.3 0.3 0.4
The piece of cloth is inspected by candle light. From the inspection by candle light, the probability
over C is assessed as:

cg cb cv
QC = .
0.7 0.25 0.05
This is a situation where Jerey's rule may be used to update the probability.

QC
QS,C = QC PS∣C = PS,C .
PC

This gives, for example,

λg 0.7
QS,C (s, cg ) = P(s, cg ) = × 0.12 = 0.28.
µg 0.3
Updating the whole distribution in this way gives

S/C cg cb cv
QC,S = s 0.28 0.10 0.04
sc 0.42 0.15 0.01
116 CHAPTER 6. EVIDENCE AND METRICS

6.2 Evidence
For a Bayesian network, three dierent types of evidence will be discussed; hard evidence, soft evidence
and virtual evidence. The denitions used are as follows:

Denition 6.4 (Hard Evidence, Soft Evidence, Virtual Evidence). The denitions are:
(l)
ˆ A hard nding is an instantiation, {Xi = xi } for a particular value of i ∈ {1, . . . , d} and a
(l)
particular value of l ∈ {1, . . . , ki }. This species that variable Xi is in state xi .

ˆ Hard evidence is a collection of hard ndings.

ˆ A soft nding on a variable Xj species the probability distribution of the variable Xj . That is,
the conditional probability function PXj ∣Paj is replaced by a probability function P∗Xj with domain
Xj .

ˆ Soft evidence is a collection of soft ndings.


(m)
ˆ A virtual nding on variable Xj is a collection of values {L(xj ), m = {1, . . . , kj }} such that
(n)
the updated conditional probability function for Xj ∣Paj = πj is, for m = 1, . . . , kj ,

(m) (n) 1 (m) (n) (m)


P∗X ∣Pa (xj ∣πj ) = kj (q) (n) (q)
PXj ∣Paj (xj ∣πj )L(xj ). (6.4)
∑q=1 PXj ∣Paj (xj ∣πj )L(xj )
j j

ˆ Virtual evidence is a collection of virtual ndings.

Soft evidence and virtual evidence are dierent. When soft evidence is received on a variable, the links
between the variable and its parents are severed; if soft evidence is received on variable Xj , then the
conditional probability function PXj ∣Paj is replaced by a new probability function P∗Xj .
Soft evidence basically applies to the situation described in the discussion of intervention calculus;
it is assumed that the Bayesian network has been derived from causal principles, where the parents of
a variable are direct causes. The soft evidence gives a new distribution over the variable, where the
new distribution is not inuenced by its parents. The state of the variable is forced, as in a controlled
experiment, without reference to the other variables, while the new distribution P∗Xj describes the
probability of which state of Xj is enforced.
When virtual evidence is received, the links are preserved; the evidence is interpreted as an addi-
tional variable, which is instantiated.

6.3 Virtual Evidence


A virtual nding on variable Xj aects the probability PXj ∣Paj , without aecting any other conditional
probabilities. The following discussion shows how to incorporate virtual evidence by extended the
probability space by the addition of a virtual variable.
6.3. VIRTUAL EVIDENCE 117

Virtual Evidence and the DAG The following shows how, in general, virtual evidence can be
considered as an additional node E in the DAG. Consider a set of variables V = {X1 , . . . , Xd }, where
the joint probability distribution is factorised as
d
PX1 ,...,Xd = ∏ PXj ∣Paj .
j=1

Suppose that virtual evidence is received on variable Xj . This may be expressed as a variable E and,
by d-separation properties, the updated distribution PX1 ,...,Xd ,E has a factorisation

PX1 ,...,Xd ,E = (∏ PXk ∣Pak ) PE∣Xj . (6.5)


k

The variable E is a `dummy variable', in the sense that its state space and distribution do not need
to be dened; the virtual evidence is interpreted as a particular instantiation {E = e} for this variable
and this is the only information that is needed. From Equation (6.5),
PE∣Xj (e∣.)
PX1 ,...,Xd ∣E (., . . . , .∣e) = (∏ PXk ∣Pak ) .
k PE (e)
From Equation (6.4),
(m) (m)
L(xj ) PE∣Xj (e∣xj )
= m = 1, . . . , kj ,
k (i)
∑i=1 L(xj )PXj ∣Paj (xj ∣πj )
j (i) (n) PE (e)

so that for m1 and m2 in {1, . . . , kj },

(m1 ) (m1 )
L(xj ) PE∣Xj (e∣xj )
(m2 )
= (m2 )
.
L(xj ) PE∣Xj (e∣xj )
When applying virtual evidence, create an extra node on the network, with conditional probabilities
(m) (m) (m) (m)
PE∣Xj (1∣xj ) ∝ L(xj ); any values satisfying 0 < PE∣Xj (1∣xj ) < 1 for L(xj ) > 0 will suce
(m) (m)
and PE∣Xj (0∣xj ) = 1 − PE∣Xj (1∣xj ). For a Bayesian networks programme, these values need to be
dened, although the only conditional probability values used are those for E = 1. Then update the
network with the hard evidence E = 1.

Equivalence with Pearl's Update Represented on a DAG, the virtual evidence node E satises
E á V /{Xj }∥G Xj . The virtual evidence {E = e} may be expressed as Pearl's update with Ξ = {E = e}
(m) kj
and the partition events Gm = {Xj = xj } for m = 1, . . . , kj . The collection (Gm )j=1 are mutually
PE∣Xj (e∣xj )
(m)

exclusive and exhaustive events. Set ρ1 = 1 and ρm = for m = 2, . . . , kj . Set


PE∣Xj (e∣xj )
(1)

(m)
µm = PXj (xj ) m = 1, . . . kj .

Then, after extending P to accommodate the new variable E , the probability distribution PX1 ,...,Xd
̃X ,...,X = PX ,...,X ∣E (., . . . , .∣e) where
is updated to P 1 d 1 d

ρij
̃X ,...,X (x(i1 ) , . . . , x(id ) ) = PX ,...,X (x(i1 ) , . . . , x(id ) )
P .
1 d 1 d 1 d 1 d kj
∑m=1 µm ρm
118 CHAPTER 6. EVIDENCE AND METRICS

Example 6.5.

Consider a DAG on ve variables, X1 , X2 , X3 , X4 and X5 , given in Figure 6.1. Suppose that a piece
of virtual evidence is received on the variable X3 . This evidence may be modelled by a variable E ,
that is inserted to the DAG giving the DAG in Figure 6.2. The state of X3 aects the virtual evidence
that is observed.

X1 X2

~
X3

~
X4 X5

Figure 6.1: Before Virtual Evidence is Added

From Figure 6.2, it is clear that (X1 , X2 , X4 , X5 ) á E∥G X3 . The decomposition along the DAG gives
P(E∣X1 , X2 , X3 , X4 , X5 ) = P(E∣X3 ) and P(X1 , X2 , X4 , X5 ∣X3 , E) = P(X1 , X2 , X4 , X5 ∣X3 ).

Example 6.6 (Burglary).

Suppose that on any given day, there is a burglary at any given house with probability 10−4 . If there
is a burglary, then the alarm will go o with probability 0.95; if there is no burglary, then it does not
go o. One day, Professor Noddy receives a call from his neighbour Margarita, saying that she may
have heard Professor Noddy's burglar alarm going o. Professor Noddy decides that it is four times
more likely that Margarita did hear the alarm going o than that she was mistaken.

X1 X1

~
X3

~ 
X4 E X5

Figure 6.2: After the Virtual Evidence Node is Added


6.3. VIRTUAL EVIDENCE 119

Let A take value 1 to denote the alarm going o and 0 otherwise, B = 1 to denote that a burglary
takes place and 0 otherwise and let E denote the variable `telephone call'; E = 1 is the evidence that
Noddy received the call from Jemima. This evidence can be interpreted by extending P to include the
variable E , where B ⊥ E∣A (the virtual evidence is received on A; B is the remainder of the network)
and the relevant quantity is

PE∣A (1∣1)
λ= = 4.
PE∣A (1∣0)

Then, the update of PB,A requires PA . The conditional probabilities are

B/A 1 0
1 0
PB = PA∣B = 1 0.95 0.05
10−4 1 − 10−4
0 0 1

Using PB,A = PB PA∣B , the joint probabilities are

B/A 1 0
PB,A = 1 0.95 × 10−4 0.05 × 10−4
0 0 1 − 10−4

so that
1 0
PA = −4
0.95 × 10 1 − 0.95 × 10−4

and hence, using Pearl's update,

̃B,A (., 1) = PB,A∣E (., 1∣1) = PB,A (., 1) 4


P
4 × 0.95 × 10−4 + 1 − 0.95 × 10−4

̃B,A (., 0) = PB,A∣E (., 0∣1) = PB,A (., 0) 1


P
4 × 0.95 × 10−4 + 1 − 0.95 × 10−4
Exactly the same thing may be computed directly; using λ0 = 1 and λ1 = 4,

̃B,A = PB,A∣E (., .∣1) = PB,A,E (., ., 1)


P
PE (1)
PB PA∣B PE∣A (1∣.) λa
= = PB PA∣B
PE∣A (1∣1)PA (1) + PE∣A (1∣0)PA (0) λ1 PA (1) + λ0 PA (0)

where a denotes the value taken by variable A.


It follows that

̃B (1) = P
P ̃B,A (1, 0) = 10−4 × ( 3.80 + 0.05 ) ≃ 3.85 × 10−4 .
̃B,A (1, 1) + P
1 + 3.85 × 10−4
120 CHAPTER 6. EVIDENCE AND METRICS

6.4 Measures of Divergence between Probability Distributions


A distance is a more specic measure of divergence, which satises the properties given in the following
denition.

Denition 6.7 (Distance). A measure of divergence D between probability distributions is a distance


if it satises the following three properties: for any three probability distributions P1 , P2 and P3 over
the same space X = (x1 , . . . , xk ),

ˆ Positivity: D(P1 , P2 ) ≥ 0. Furthermore, D(P1 , P2 ) = 0 ⇔ P1 ≡ P2

ˆ Symmetry: D(P1 , P2 ) = D(P2 , P1 )

ˆ Triangle Inequality: D(P1 , P3 ) ≤ D(P1 , P2 ) + D(P2 , P3 ).

Consider two common measures of divergence between probability distributions. Let P and Q be two
probability functions over the same nite state space X = (x1 , . . . , xk ) and let pj = P(xj ) and qj = Q(xj )
for j = 1, . . . , k .

Denition 6.8 (Euclidean Distance). The quadratic or Euclidean distance is dened as


¿
Ák
Á
D2 (P, Q) = Á
À ∑ (pj − qj )2 .
j=1

Denition 6.9 (Kullback Leibler Divergence). The Kullback Leibler divergence between two probability
distributions P and Q over the same state space X is dened as
k pj
DKL (P∥Q) = ∑ pj ln .
j=1 qj

The Kullback Leibler divergence is non negative (left as an exercise) and DKL (P∥Q) = 0 ⇔ P ≡ Q, but
it is not a distance in the sense of Denition 6.7; it does not, in general, satisfy DKL (P∥Q) = DKL (Q∥P).

Example 6.10.
Let

P(1) = (0.02, 0.98), Q(1) = (0.0364, 0.9636), P(2) = (0.01, 0.99), Q(2) = (0.00471, 0.99529).

Then


D2 (P(1) , Q(1) ) = (0.02 − 0.0364)2 + (0.98 − 0.9636)2 = 0.0232

D2 (P(2) , Q(2) ) = (0.00471 − 0.01)2 + (0.99529 − 0.99)2 = 0.00748,

so the change represented by the second adjustment is less than one third of the change represented
by the rst if the change is measured using the quadratic distance measure. For the Kullback-Leibler,
6.5. THE CHAN - DARWICHE DISTANCE MEASURE 121

0.02 0.98
DKL (P(1) ∥Q(1) ) = 0.02 ln + 0.98 ln = 0.004562,
0.0364 0.9636
0.01 0.99
DKL (P(2) ∥Q(2) ) = 0.01 ln + 0.99 ln = 0.00225,
0.00471 0.99529
so the change represented by the second adjustment is approximately one half of the change represented
by the rst. Clearly, dierent distance measures give dierent impressions of the relative importance
of parameter changes.

6.5 The Chan - Darwiche Distance Measure


The problem with both the Kullback Leibler and the Quadratic distance measure is that they do not
emphasise the proportional dierence between two probability values when they are close to zero. The
following distance measure was proposed by Chan and Darwiche. It will be seen that it is particularly
useful when comparison of odds ratios are in view.

Denition 6.11 (Chan - Darwiche Distance). Let P and Q be two probability functions over a nite
state space X . That is, P ∶ X → [0, 1] and Q ∶ X → [0, 1], ∑x∈X P(x) = 1 and ∑x∈X Q(x) = 1. The Chan
- Darwiche distance is dened as

Q(x) Q(x)
DCD (P, Q) = ln max − ln min ,
x∈X P(x) x∈X P(x)

where, by denition, 0
0 = 1.

Unlike the Kullback - Leibler divergence, the Chan - Darwiche distance is a distance; it satises the
three requirements of Denition 6.7. This result is stated in Theorem 6.13.

The support of a probability function dened on a nite state space; namely, those points where it is
strictly positive (relating to outcomes that can happen) is important when comparing two dierent
probability functions over the same state space.

Denition 6.12 (Support). Let P be a probability function over a countable state space X ; that is,
P ∶ X → [0, 1] and ∑x∈X P(x) = 1. The support of P is dened as the subset SP ⊆ X such that

SP = {x ∈ X ∣P(x) > 0}. (6.6)

Theorem 6.13. The Chan - Darwiche distance measure is a distance measure, in the sense that for
any three probability functions P1 , P2 , P3 over a state space X , the following three properties hold:

ˆ Positivity: DCD (P1 , P2 ) ≥ 0 and DCD (P1 , P2 ) = 0 ⇔ P1 ≡ P2 .

ˆ Symmetry: DCD (P1 , P2 ) = DCD (P2 , P1 )

ˆ Triangle Inequality: DCD (P1 , P2 ) + DCD (P2 , P3 ) ≥ DCD (P1 , P3 ).


122 CHAPTER 6. EVIDENCE AND METRICS

Proof Positivity and symmetry are clear and are left as exercises. It only remains to prove the
triangle inequality. Since the state space is discrete and nite, it follows that there exist y, z ∈ X such
that
P3 (x) P3 (x) P3 (y) P3 (z)
DCD (P1 , P3 ) = ln max − ln min = ln − ln
x∈X P1 (x) x∈X P1 (x) P1 (y) P1 (z)
P3 (y) P2 (y) P3 (z) P2 (z)
= ln + ln − ln − ln
P2 (y) P1 (y) P2 (z) P1 (z)
P3 (y) P3 (z) P2 (y) P2 (z)
= (ln − ln ) + (ln − ln )
P2 (y) P2 (z) P1 (y) P1 (z)
P3 (x) P3 (x) P2 (x) P2 (x)
≤ (ln max − ln min ) + (ln max − ln min )
P2 (x)
x∈X x∈X P2 (x) x∈X P1 (x) x∈X P1 (x)

= DCD (P1 , P2 ) + DCD (P2 , P3 ).

This distance is relatively easy to compute. It has the advantage over the Kullback Leibler divergence
(which is not a true distance measure) that it may be used to obtain bounds on odds ratios.

Denition 6.14 (Odds). Let P be a probability measure over X and let A ⊂ X and B ⊂ X . The odds
for A versus Ac given B is dened as

P(A∣B)
OP (A∣B) = .
P(Ac ∣B)

Comparison with the Kullback Leibler Divergence and Euclidean Distance Consider two
probability distributions P = (p1 , p2 , p3 ) and Q = (q1 , q2 , q3 ) over {1, 2, 3} dened by

p1 = a, p2 = b − a, p3 = 1 − b

q1 = ka, q2 = b − ka, q3 = 1 − b

Then
b − ka
DKL (P∥Q) = −a ln k − (b − a) ln .
b−a
Consider the events A = {1}, B = {1, 2}, then OP (A∣B) = a
b−a and OQ (A∣B) = ka
b−ka and the odds ratio
is given by

OQ (A∣B) k(b − a)
= .
OP (A∣B) b − ka
O (A∣B)
As a → 0, DKL (P∥Q) → 0, while OQP (A∣B) → k . It is therefore not possible to nd a bound on the odds
ratio in terms of the Kullback Leibler divergence.

Similarly, in this example, the Euclidean distance is

√ a→0
D2 (P, Q) = 2a(1 − k) Ð→ 0,
6.5. THE CHAN - DARWICHE DISTANCE MEASURE 123

while

1 b−a 1 b−a a→0


DCD (P, Q) = ln max ( , , 1) − ln min ( , , 1) Ð→ ln k.
k b − ka k b − ka
Neither the Kullback Leibler divergence nor the Euclidean distance can be used to provide uniform
bounds on the odds ratios; even if there is a large relative dierence between pairs of probability values
for P and Q, they will be ignored if the absolute values of these probabilities are small.
The Chan Darwiche distance measure is useful, because it can be used to obtain sharp bounds on the
way that odds change as the probability distribution changes.

Theorem 6.15. Let P and Q be two probability distributions over the same nite state space X and let
A and B be two subsets of X . Let Ac = X /A and B c = X /B . Let OP (A∣B) = and OQ (A∣B) =
P(A∣B)
P(Ac ∣B)

Q(Ac ∣B) . Then


Q(A∣B)

OQ (A∣B)
e−DCD (P,Q) ≤ ≤ eDCD (P,Q) .
OP (A∣B)
The bound is sharp in the sense that for any pair of distributions (P, Q) there are subsets A and B of
X such that

OQ (A∣B) OQ (Ac ∣B)


= exp{DCD (P, Q)}, = exp{−DCD (P, Q)}.
OP (A∣B) OP (Ac ∣B)

Proof of Theorem 6.15 Without loss of generality, it may be assumed that P and Q have the same
support; that is, P(x) > 0 ⇔ Q(x) > 0. Otherwise DCD (P, Q) = +∞ and the statement is trivially
O (A∣B)
true; for any A, B ⊆ X , 0 ≤ OQP (A∣B) ≤ +∞. For P and Q such that P and Q have the same support, let
r(x) = P(x) . For any two subsets A, B ⊆ X ,
Q(x)

OQ (A∣B) Q(A∣B) 1 − P(A∣B) Q(AB) P(Ac B) ∑x∈AB Q(x) ∑x∈Ac B P(x)


= = =
OP (A∣B) 1 − Q(A∣B) P(A∣B) Q(Ac B) P(AB) ∑x∈Ac B Q(x) ∑x∈AB P(x)
∑x∈AB r(x)P(x) ∑x∈Ac B P(x) maxz∈X r(z) ∑x∈AB P(x) ∑x∈Ac B P(x)
= ≤
∑x∈Ac B r(x)P(x) ∑x∈AB P(x) minz∈X r(z) ∑x∈Ac B P(x) ∑x∈AB P(x)
maxz∈X r(z)
= .
minz∈X r(z)
Similarly,
OQ (A∣B) minz∈X r(z)
≥ .
OP (A∣B) maxz∈X r(z)
From the denition of DCD (P, Q), it follows directly that

maxz∈X r(z)
eDCD (P,Q) = ,
minz∈X r(z)
hence
OQ (A∣B)
e−DCD (P,Q) ≤ ≤ eDCD (P,Q) ,
OP (A∣B)
as required, thus proving the rst part.
124 CHAPTER 6. EVIDENCE AND METRICS

To prove that the bound is tight, consider x such that r(x) = maxz∈X r(z) and y such that r(y) =
minz∈X r(z). Set A = {x} and B = {x, y}. Then

r(x)P(x)
OQ (A∣B) = .
r(y)P(y)

and eDCD (P,Q) =


maxz∈X r(z)
Since OP (A∣B) = minz∈X r(z) , it follows that
P(x)
P(y)

OQ (A∣B)
= eDCD (P,Q) .
OP (A∣B)

Similarly, let C = {y}, then


OQ (C∣B)
= e−DCD (P,Q) .
OP (C∣B)

Theorem 6.15 may be used to obtain bounds on arbitrary queries Q(A∣B) for the measure Q in terms
of P(A∣B).

Corollary 6.16. Set d = DCD (P, Q), then

P(A∣B)e−d P(A∣B)ed
≤ Q(A∣B) ≤ . (6.7)
1 + (e−d − 1)P(A∣B) 1 + (ed − 1)P(A∣B)

Proof Equation (6.7) is a straight forward consequence of Theorem 6.15. The computation is left as
an exercise.

6.5.1 Soft Evidence and Virtual Evidence


Jerey's Rule Let P denote a probability distribution over a nite state space X and let Q denote
the distribution obtained by updating according to Jerey's rule. The following formula may be
established.

Theorem 6.17. Let P be a probability distribution over a countable state space X and let G1 , . . . , Gn
be a collection of mutually exclusive and exhaustive events. Let λj = P(Gj ) for j = 1, . . . , n. Let Q
denote the probability distribution such that Q(Gj ) = µj for j = 1, . . . , n and such that for all x ∈ X

µj
Q(x) = P(x) x ∈ Gj .
λj

In other words, Q is the Jerey's update of P, dened by Q(Gj ) = µj , j = 1, . . . , n. Then

λj λj
DCD (P, Q) = ln max − ln min .
j µj j µj
6.5. THE CHAN - DARWICHE DISTANCE MEASURE 125

Proof This follows directly and is left as an exercise.

This immediately gives the following bound.


Corollary 6.18. Let OP and OQ denote the odds function before and after applying Jereys rule. Let
λj λj
d = ln max − ln min .
j µj j µj

Then for any two events A and B ,


OP (A∣B)
e−d ≤ ≤ ed .
OQ (A∣B)

Proof This follows directly.

Under the Chan - Darwiche distance measure, Jerey's rule may be considered optimal, in the following
sense.
Theorem 6.19. Let P denote a probability distribution over X and let G1 , . . . , Gr denote a collection of
mutually exclusive and exhaustive events. Let µj = P(Gj ), let λ1 , . . . , λr be a collection of non negative
numbers such that ∑rj=1 λj = 1 and let Q be the probability distribution over X dened by
λj
Q(x) = P(x) x ∈ Gj .
µj
Then DCD (P, Q) minimises DCD (P, R) subject to the constraint that R is a probability distribution
over X such that R(Gi ) = λi for i = 1, . . . , r.

Proof Let Q denote the distribution generated by Jerey's rule and let R be any distribution that
satises the constraint R(Gj ) = Q(Gj ) = λj , j = 1, . . . , R. If P and R do not have the same support
(Denition 6.12), then +∞ = DCD (P, R) ≥ DCD (P, Q). If they have the same support, let j denote the
λ
value such that µjj = maxi µλii and let k denote the value such that µλkk = mini µλii . Let α = maxx∈X P(x) .
R(x)

Then

R(x)
αµj = α ∑ P(x) ≥ ∑ P(x) = R(Gj ) = λj ,
x∈Gj x∈Gj P(x)
so that

λj
α≥ .
µj
Set β = minx∈X P(x) , then a similar argument gives β ≤ λk
µk . It follows that the distance between P and
R(x)

R is

R(x) R(x)
DCD (P, R) = ln max − ln min = ln α − ln β
x∈X P(x) x∈X P(x)
λj λj λi λi
≥ ln − ln = ln max − ln min = DCD (P, Q).
µj µj i µi i µi

Therefore Q gives the smallest distance.


126 CHAPTER 6. EVIDENCE AND METRICS

Pearl's Method of Virtual Evidence Recall Pearl's Method of Virtual Evidence. The CD distance
between the original distribution and the updated distribution has a convenient expression.

Theorem 6.20. Let P be a probability distribution over a nite state space X and let λ1 = 1 and
λ2 , . . . , λr be positive numbers. Let G1 , . . . , Gr be a collection of mutually exclusive and exhaustive
subsets of X . Let µj = ∑x∈Gj P(x) for j = 1, . . . , r. Let Q be dened as

λj
Q(x) = P(x) r x ∈ Gj .
∑k=1 µk λk
Then Q is a probability distribution over X and

DCD (P, Q) = ln max λi − ln min λi .


i i

Proof Firstly, it is clear from the construction that ∑x∈X Q(x) = 1 and that Q(x) ≥ 0 for all x ∈ X ,
so that Q is a probability function. From the denition,

Q(x) λj
= x ∈ Gj .
P(x) ∑k µk λk
It follows that

Q(x) Q(x)
DCD (P, Q) = ln max − ln min
x∈XP(x) x∈X P(x)
λj λj
= ln max − ln min
j ∑ k µk λ k j ∑ k µk λ k
= ln max λj − ln min λj
j j

as required.

This immediately gives the following bound.

Corollary 6.21. Let OQ and OP denote the odds functions associated with the probability measures
dened in Theorem 6.20 and let

d = DCD (P, Q) = ln max λi − ln min λi .


i i

Then for any events A, B ⊆ X ,

OQ (A∣B)
e−d ≤ ≤ ed .
OP (A∣B)

Proof This follows directly and is left as an exercise.

Example 6.22.
6.5. THE CHAN - DARWICHE DISTANCE MEASURE 127

The `Burglary' example may be developed to illustrate these results. Let A denote the event that the
alarm goes o, B the event that a burglary takes place and let E denote the evidence of the telephone
call from Jemima. According to Pearl's method, this evidence can be interpreted as

PE∣A (1∣1)
λ= = 4.
PE∣A (1∣0)
Therefore, the distance between the original distribution P and the update Q(.) = P(.∣E = 1) derived
according to Pearl's method is DCD (P, Q) = ln 4 ≃ 1.386. This distance may be used to bound QB (1),
the probability of a Burglary, after the update to incorporate the evidence. Using the bound stated in
the corollary,

PB (1)e−d PB (1)ed
≤ QB (1) ≤ ,
1 + (e−d − 1)PB (1) 1 + (ed − 1)PB (1)
so that 2.50 × 10−5 ≤ QB (1) ≤ 4.00 × 10−4 . An application of Pearl's virtual evidence rule gives
QB (1) = 3.85 × 10−4 .

Notes The article [37] discusses probability updates when the information received does not t into
the framework of the standard denition. The Chan Darwiche distance measure is proposed in [20].
The article [22] by Chan and Darwiche discusses the application of Jerey's update rule and Pearl's
method to virtual evidence. These two articles provide the basis for the chapter.
6.6 Exercises
1. Jerey's Rule In a certain country, people use only two car models, Volvo and Saab, which come
in two colours, red and blue. The sales statistics suggest P(Volvo) = P(Saab) = 1/2. Furthermore,
P(red∣Volvo) = 0.7 and P(red∣Saab) = 0.2. You are on holiday in this region and you are standing
outside a large underground garage, which you may not enter. The attendant of the garage
communicates his impression that 40% of the cars in the garage are red. What is the probability
that the rst car leaving the garage is a Volvo?

2. Pearl's Method The two parts of this question are virtually identical.

(a) Let A denote an event that gives uncertain information (or virtual / soft evidence) about
the partition (that is a collection of mutually exclusive and exhaustive events) {Gj }nj=1 .
Suppose that A satises

P (A ∣ Gj , B) = P (A ∣ Gj ) , j = 1, 2, . . . , n

for every event B . This is an assumption of conditional independence; the event A is


independent of all other events given the partition Gj . Set λj = P(A∣Gj ) and show that for
any event B ,

∑nj=1 λj P (B ∩ Gj )
P (B ∣ A) = .
∑nj=1 λj P (Gj )

Check that P(.∣A) satises the denition of the Pearl update (Denition 6.2).

(b) Let P denote a probability distribution before evidence is obtained and suppose that a piece
of evidence Ξ gives uncertain information about the partition (that is, the collection of
mutually exclusive and exhaustive events) {Gj }nj=1 . Suppose that Ξ is not in the original
event space and that for any event A in the original event space, Ξ ⊥ A∣Gj for each j =
1, . . . , n. Suppose that this evidence is specied by the posterior probabilities

P∗ (Gj ) = P(Gj ∣Ξ) = qj , j = 1, 2, . . . , n.

Let
P(Ξ∣Gj )
ρj = j = 1, 2, . . . , n
P(Ξ∣G1 )
and

qj
λj = , j = 1, 2, . . . , n.
P (Gj )
For any event C , compute the probability P (C∣Ξ) obtained by Pearl's method of virtual
evidence and show that this gives the same result as Jerey's rule of update.

128
6.6. EXERCISES 129

3. Let X1 , X2 , X3 be three binary random variables, each taking values in {0, 1}, such that

1
PX1 ,X2 ,X3 (x1 , x2 , x3 ) = ,
8

for (x1 , x2 , x3 ) ∈ {0, 1}3 .

Now let V be an additional binary random variable and let E = {V = 1}. Here V stands for
virtual information. Suppose that the conditional probability function of V given X3 satises

PV ∣X3 (1 ∣ 1) = λPV ∣X3 (1 ∣ 0) .

Let G1 and G2 be the two events

G1 = {(x1 , x2 , x3 ) ∈ {0, 1}3 ∣ x3 = 0}

and
G2 = {(x1 , x2 , x3 ) ∈ {0, 1}3 ∣ x3 = 1} .

The events G1 and G2 are mutually exclusive and exhaustive. Use Pearl's method of virtual
evidence to obtain the updated probability distribution

̃X ,X ,X (x1 , x2 , x3 ) = PX ,X ,X ∣V (x1 , x2 , x3 ∣1)


P (x1 , x2 , x3 ) ∈ {0, 1}3 .
1 2 3 1 2 3

4. Let G = (V, E) be a Directed Acyclic Graph, where V = (X1 , . . . , Xd ), and let P and Q be two
probability distribution factorised along G . Let

(i) (l)
θjil = PXj ∣Paj (xj ∣πj ).

Suppose that the conditional probabilities for P and Q are the same except for one single (j, l)
(l) (l)
variable / parent conguration, where PXj ∣Paj (.∣πj ) is given by θj.l and QXj ∣Paj (.∣πj ) is given
by θ̃j.l . Let DKL denote the Kullback Leibler distance. Show that

DKL (P∥Q) = P({Paj = πj })dKL (θj.l , θ̃j.l ).


(l)

5. Let DCD denote the Chan Darwiche distance. Prove the remaining two statements of Theo-
rem 6.13; that for any P and Q,

DCD (P, Q) ≥ 0 DCD (P, Q) = 0 ⇒ P = Q

and
DCD (P, Q) = DCD (Q, P).
130 CHAPTER 6. EVIDENCE AND METRICS

6. Let P be a probability distribution over a countable state space X and let G1 , . . . , Gn be a


collection of mutually exclusive and exhaustive events. Let λj = P(Gj ) for j = 1, . . . , n. Let Q
denote the probability distribution such that Q(Gj ) = µj for j = 1, . . . , n and such that for any
other event A,

n
Q(A) = ∑ µj P(A∣Gj ).
j=1

In other words, Q is the Jerey's update of P, dened by Q(Gj ) = µj , j = 1, . . . , n. Prove that

λj λj
DCD (P, Q) = ln max − ln min ,
j µj j µj

where DCD denotes the Chan Darwiche distance.

7. (a) Find a calibration of the Chan-Darwiche distance in terms of the distance between two
Bernoulli trials. That is, let P = (p0 , p1 ) and Q = (q0 , q1 ). Find the number cd(k) such that
if q0 = 1 − cd(k) and q1 = cd(k) and p0 = p1 = 21 , then

DCD (P, Q) = k.

You should obtain


e±k
cd(k) = .
1 + e±k
(b) Find a calibration of the Kullback Leibler distance; that is, the number KL(k) such that if
q0 = 1 − KL(k), q1 = KL(k) and p0 = p1 = 21 , then DKL (P∥Q) = k . You should obtain

1 1√
KL(k) = ± 1 − e−2k .
2 2

8. Jensen's inequality Let ϕ(x) be a convex function and X nite discrete real valued random
variable, dened on a nite space X . Prove, by induction, that

E [ϕ (X)] ≥ ϕ (E [X])

Hence prove that DKL (P∥Q) ≥ 0 with equality if and only if P = Q.

9. The Chan-Darwiche Distance between Two Multivariate Bernoulli Distributions Con-


sider d independent Bernoulli trials, X = (X1 , . . . , Xd ), where the `success' probabilities for each
trial may dier. The distribution of the random vector X is known as a multivariate Bernoulli dis-
tribution. This example considers the distance between two multivariate Bernoulli distributions
where the `success' probabilities for the two distributions are given by the vectors p = (p1 , . . . , pd )
and q = (q1 , . . . , qd ) respectively.
Let X be the binary hypercube; that is, X = {0, 1}d and let x ∈ X denote an element in X . Then
x = (xi )di=1 , where xi ∈ {0, 1}. Let Q and P be two multivariate Bernoulli probability functions
over X . That is, Q ∶ X → [0, 1] and P ∶ X → [0, 1] are dened such that each x ∈ X ,
6.6. EXERCISES 131

d
Q (x) = ∏ qixi (1 − qi )1−xi
i=1

and

d
P (x) = ∏ pxi i (1 − pi )1−xi
i=1

where, for this example, it is assumed that 0 < qi < 1 and 0 < pi < 1 (i.e. the inequalities are
strict) for all i ∈ {1, . . . , d}.

(a) Show that


d Oq,i Op,i
DCD (P, Q) = ∑ ln max ( , ). (6.8)
i=1 Op,i Oq,i
where Op,i = pi
1−pi and Oq,i = qi
1−qi .
(b) Let qi = q and pi = p for all i, and 0 < q < 1 and 0 < p < 1. Show that, in this case,

OQ OP
DDC (P, Q) = d ln max ( , ), (6.9)
OP OQ

10. A piece of cloth is to be sold on the market. The colour C is either green (cg ), blue (cb ) or
violet (cv ). Tomorrow, the piece of cloth will either be sold (s) or not (sc ); this is denoted by the
variable S . Experience gives the following probability distribution over C, S

S/C cg cb cv
PC,S = s 0.12 0.12 0.32
sc 0.18 0.18 0.08

The marginal distribution over C is

cg cb cv
PC = .
0.3 0.3 0.4

The piece of cloth is inspected by candle light. Since it cannot be seen perfectly, this only gives
soft evidence. From the inspection by candle light, the probability over C is assessed as:

cg cb cv
QC = .
0.7 0.25 0.05

The Jerey's update gives QS,C = QC PS∣C = QC


PC PS,C which is

S/C cg cb cv
QS,C = s 0.28 0.10 0.04 .
sc 0.42 0.15 0.01
132 CHAPTER 6. EVIDENCE AND METRICS

(a) Compute DCD (P, Q), the Chan-Darwiche distance between the original and updated dis-
tributions.
(b) Compute the bounds on the odds ratios given by Corollary 6.18 in this example. Compare
O (c ∣s)
with OQP (cgg∣s) .
(c) Suppse that Q∗C = (0.25, 0.25, 0.50). Compute DCD (P, Q∗ ) and the bounds on the odds
O ∗ (cg ∣s)
ratios given by Corollary 6.18. Again, compare with OQP (cg ∣s)
The distribution Q∗ is closer to P than Q and hence the bounds are tighter.
(d) Now consider the following problem: the probability that the piece of cloth is green, given
that it is sold tomorrow is, before updating, 0.214. What evidence would satisfy the con-
straint that the updated probability that the cloth is green, given that it is sold tomorrow,
does not exceed 0.3?
6.7 Answers
1. Let A denote car type and C colour. Events to be updated: P∗C (red) = 0.4, P∗C (blue) = 0.6
Original joint probability function:

car/colour R B
PA,C = V 0.35 0.15
S 0.1 0.4
so
PC (red) = 0.45 PC (blue) = 0.55

and
car/colour R B
PA∣C = V 7/9 3/11
S 2/9 8/11

Jerey's rule:
car/colour R B
P∗A,C = PA∣C P∗C = V 14/45 9/55
S 4/45 24/55
47
P∗ (volvo) =
99
2. (a) A ⊥ B∣Gj for each Gj , j = 1, . . . n so P(A∣Gj , B) = P(A∣Gj ). It follows, using P(B∣Gj )P(Gj ) =
P(BGj ) and λj = P(A∣Gj ) that

P(A∣Gj )P(Gj ) ∑j λj P(B ∩ Gj )


P(B∣A) = ∑ P(B∣A, Gj )P(Gj ∣A) = ∑ P(B∣Gj ) = .
j j P(A) ∑j λj P(Gj )

For an outcome x,

λj ρj
P(x∣A) = P(x) = P(x) x ∈ Gj , j = 1, . . . , n
∑k λk P(Gk ) ∑k ρk P(Gk )
P(A∣Gk )
where ρk = λk
λ1 = P(A∣G1 )
which is the denition of the Pearl update.
(b) The Jerey's rule is valid for a piece of information Ξ that alters the probabilities on the
partition events G1 , . . . , Gn and such that P(Ξ∣Gj , B) = P(Ξ∣Gj ) for any event B . Let
P∗ (C) = P(C∣Ξ), the updated probability for an event C . Then the update under Jerey's
rule is, for any outcome x,

n n
P∗ (x) = ∑ P(x∣Gj )P∗ (Gj ) = ∑ qj P(x∣Gj ) = qk P(x∣Gk ) x ∈ Gk ,
j=1 j=1

Pearl's method for updating given a piece of information A and a partition G1 , . . . , Gn is to


set

133
134 CHAPTER 6. EVIDENCE AND METRICS

P(Ξ∣Gj )
ρj = j = 1, . . . , n
P(Ξ∣G1 )

The Pearl update is dened by

ρj
P∗ (x) = P(x∣Ξ) = P(x) x ∈ Gj .
∑k=1 ρk P(Gk )
n

Using P(Ξ∣G) = ,
P(G∣Ξ)P(Ξ)
P(G)

P(Gj ∣Ξ) P(G1 ) λj


ρj = =
P(Gj ) P(G1 ∣Ξ) λ1

so that

λj
P∗ (x) = P(x) x ∈ Gj ,
∑k=1 λk P(Gk )
n

qj P∗ (Gj ) P(Gj ∣A)


which gives an expression for Pearl's update in terms of the λj = P(Gj ) = P(Gj ) = P(Gj ) .
To show that it is the same as Jerey's rule, for x ∈ Gj ,

P(Ξ∣Gj ) P(x ∩ Gj ) P(Ξ∣Gj )P(Gj )


P∗ (x) = P(x) = = P(x∣Gj )P(Gj ∣Ξ) = qj P(x∣Gj ),
P(Ξ) P(Gj ) P(Ξ)

which is the Jerey's update.

3. In this example, X1 , X2 , X3 are mutually independent; PX1 ,X2 ,X3 = PX1 PX2 PX3 , PXj (1) = PXj (0) =
2 for j = 1, 2, 3. Virtual evidence on X3 is treated as a node with a single parent X3 , so (using
1

PX3 (1) = PX3 (0) = 21 ),

PV ∣X3 (1∣1)
̃X (1) = PX (1)
P 3 3
PV ∣X3 (1∣1)PX3 (1) + PV ∣X3 (1∣0)PX3 (0)
2λ λ
= PX3 (1) =
λ+1 λ+1

so
̃X ,X ,X (x1 , x2 , 1) = PX (x1 )PX (x2 )P
̃X (1) = λ
P 1 2 3 1 2 3
8(λ + 1)

̃X (x3 ) =
̃X ,X ,X (x1 , x2 , 0) = PX (x1 )PX (x2 )P 1
P 1 2 3 1 2 3
8(λ + 1)

for each value of (x1 , x2 ) ∈ {(0, 0), (0, 1), (1, 0), (1, 1)}.

4. Assume that the variables are ordered so that Paj ⊆ {X1 , . . . , Xj−1 }. Then
6.7. ANSWERS 135

P(x)
DKL (P, Q) = ∑ P(x) ln
x Q(x)
(i )
d
(i ) ∏dk=1 PXk ∣Pak (xk k ∣πk (x))
= ∑ ∏ PXk ∣Pak (xk k ∣πk (x)) ln (i )
x=(x 1 ,...,x d ) k=1
(i ) (i ) ∏dk=1 QXk ∣Pak (xk k ∣πk (x))
1 d
d θjij l
(i )
= ∑ ∏ PXk ∣Pak (xk k ∣πk (x)) ln ̃
x∣πj (x)=π l k=1 θjij l
j

θjij l j−1
(i )
= ∑ θjij l ln ∑ ∏ PXk ∣Pak (xk k ∣πk (x)
ij θ̃jij l (i
j−1 ) k=1
(x1 1 ,...,xj−1 )∣πj (x)=πjl
(i )

= dKL (θj.l , θ̃j.l )PPaj (πjl ).

5.
P(x) P(x)
DCD (P, Q) = max ln − min ln .
x Q(x) x Q(x)
Clearly, for any function f , maxx f (x) ≥ minx f (x) so the distance is non negative. If DCD (P, Q) =
0, it follows that Q(x) = α, a constant, for all x ∈ X . It follows that P(x) = αQ(x) so that
P(x)

1 = ∑ P(x) = α ∑ Q(x) = α
x x

and hence α = 1, so that P(x) = Q(x) for all x ∈ X .


For the second point, if P(x) > 0 for a point where Q(x) = 0, or P(x) = 0 for a point where
Q(x) > 0, then DCD (P, Q) = +∞.
Now consider P(x) > 0 ⇔ Q(x) > 0. For a strictly positive function f , maxx f (x) = minx (1/f
1
(x))
since the point where the maximum of f (x) is attained is the point where the minimum of 1/f (x)
is attained. It follows that
P(x) P(x)
DCD (P, Q) = max ln − min ln
Q(x) Q(x)
1 1
= ln Q(x)
− ln Q(x)
minx P(x) maxx P(x)
Q(x) Q(x)
= − min ln + max ln
x P(x) x P(x)
= DCD (P, Q).

6. Take any point x ∈ X , then x ∈ Gj for exactly one j . It follows that Q(x) = µj P(x∣Gj ) = µj
P(x)
λj
for j such that x ∈ Gj . Therefore
P(x) P(x)
DCD (P, Q) = max ln − min ln
x Q(x) x Q(x)
λj λj
= max ln − min ln
j µj j µj
as required.
136 CHAPTER 6. EVIDENCE AND METRICS

7. (a) Let P = (p0 , p1 ) be a Bernoulli trial with success probability p1 = 1


2 and Q a Bernoulli trial
with success probability q1 = θ > 21 . Then

θ
DCD (Q, P) = ln 2θ − ln 2(1 − θ) = ln.
1−θ
Hence, let θ(k) denote the value of θ such that DCD (Q, P) = k , then
θ
ek =
1−θ
giving
ek
θ(k) = .
1 + ek
Considering 0 ≤ θ ≤ 1
2 gives, for a Chan Darwiche distance k ,

e−k
θ(k) = .
1 + e−k
(b)
1 1 1 1
DCD (Q, P) = ln + ln =k
2 2θ(k) 2 2θ(k)
1
θ(1 − θ) = e−2k
4
2
1 1
(θ − ) = (1 − e−2k )
2 4
1 1√
θ(k) = ± 1 − e−2k .
2 2
8. Denition of convexity for a function ϕ: for any λ ∈ [0, 1] and any (x, y),

ϕ(λx + (1 − λ)y) ≤ λϕ(x) + (1 − λ)ϕ(y).

Proof of result by induction: if X = {x1 , x2 }, set p1 = λ, p2 = 1 − λ, then µ = E[X] = p1 x1 + p2 x2


so denition of convexity gives

ϕ(µ) ≤ p1 ϕ(x1 ) + p2 ϕ(x2 ) = E[ϕ(X)],

with equality if and only if ϕ(x) = a + bx for x ∈ {x1 , x2 , µ}.


Assume result is true for any probability distribution over {x1 , . . . , xn }. Consider a probability
distribution (p1 , . . . , pn+1 ) over (x1 , . . . , xn+1 ). Then

⎛n+1 ⎞ ⎛n pj ⎞
ϕ(µ) = ϕ ∑ pj xj ≤ pn+1 ϕ(xn+1 ) + (1 − pn+1 )ϕ ∑ xj
⎝ j=1 ⎠ ⎝j=1 1 − pn+1 ⎠
pj
and, by the inductive hypothesis (since ∑nj=1 1−pn+1 = 1) ,

⎛n pj ⎞ n pj
ϕ ∑ xj ≤ ∑ ϕ(xj )
⎝j=1 1 − pn+1 ⎠ j=1 1 − pn+1
6.7. ANSWERS 137

so that
n+1
ϕ(µ) ≤ ∑ ϕ(xj )pj
j=1

with equality if and only if ϕ(x) = ax + b for x ∈ {µ, x1 , . . . , xn } as required.


It follows that
qj qj
DKL (P∥Q) = − ∑ pj ln ≥ − ln ∑ pj = − ln ∑ qj = − ln 1 = 0
j pj j pj j

with equality if and only if p = q .

9. (a) The likelihood ratio between Q and P is well dened and is given by

Q (x) d
qi xi 1 − qi 1−xi
LR (x) = = ∏( ) ( ) .
P (x) i=1 pi 1 − pi

For each i ∈ {1, . . . , d}, let mi be dened as




⎪ 1 if qi
≥ 1−qi
mi = ⎨ pi 1−pi
(6.10)

⎪ otherwise.
⎩ 0
Then m = (mi )di=1 ∈ X and, by construction, it follows that for all x ∈ X ,

d
qi 1 − qi
LR (x) ≤ LR (m) = ∏ max ( , ). (6.11)
i=1 pi 1 − pi
Next let m̄ be the binary complement of m dened by Equation (6.10). That is, for each
i ∈ {1, . . . , d}, m̄i = 1 − mi , giving m̄i = 0, if mi = 1 and m̄i = 1 if mi = 0. Then it holds that
d
qi 1 − qi
LR (x) ≥ LR (m) = ∏ min ( , ). (6.12)
i=1 pi 1 − pi

It now follows from the denition of the Chan - Darwiche distance measure (Denition 6.11)
that

̃
DDC (p, q) = ln LR (m) − ln LR (m)
d max ( pqii , 1−p
1−qi
)
= ∑ ln
i
.
i=1 min ( pqii , 1−p
1−qi
i
)

For i such that mi = 1 it clearly holds that

max ( pqii , 1−p


1−qi
) qi
pi Oq,i
= =
i
1−qi
,
min ( pqii , 1−p
1−qi
i
) 1−pi
Op,i

where O denotes the odds;


qi pi
Oq,i = , Op,i = .
1 − qi 1 − pi
138 CHAPTER 6. EVIDENCE AND METRICS

Similarly for i such that mi = 0 it holds that

max ( pqii , 1−p


1−qi
) Op,i
=
i
,
min ( pqii , 1−p
1−qi
i
) Oq,i

from which the result follows.


(b) Let qi = q and pi = p for all i, and 0 < q < 1 and 0 < p < 1.

Q (x) = q k (1 − q)d−k , P (x) = pk (1 − p)d−k ,

where k is the number of digital ones in x. It follows that

OQ OP
DDC (P, Q) = d ln max ( , ).
OP OQ
OQ
If, say, OP > OP
OQ , then
DDC (P, Q) = d (ln OQ − ln OP ) ,

10. (a) Theorem 6.17 gives

λi λi 0.7 0.05
DCD (P, Q) = ln max − ln min = ln − ln = 2.93,
i µi i µi 0.3 0.4

(b) Corollary 6.18 gives


OQ (cg ∣s)
0.05 ≤ ≤ 18.73.
OP (cg ∣s)
This suggests that the distributions have changed dramatically. Note that PC∣S (cg ∣s) =
0.56 = 0.214, while QC∣S (cg ∣s) = 0.42 = 0.667.
0.12 0.28

OQ (cg ∣s) 0.667/0.333


= = 7.34.
OP (cg ∣s) 0.214/0.786
(c) If the new distribution over colour is Q∗C = (0.25, 0.25, 0.50), then DCD (P, Q∗ ) = 0.406 and

OQ∗ (cg ∣s)


0.513 ≤ ≤ 1.946
OP (cg ∣s)

The evidence is weaker and the bounds are therefore tighter. In this case,

OQ∗ (cg ∣s)


OP (cg ∣s)
Q∗ (cg ∣s) P(cg ∣sc ) Q∗ (cg , s) Q∗ (sc ) P(cg , sc ) P(s)
= ∗ =
Q (cg ∣sc ) P(cg ∣s) Q∗ (s) Q∗ (cg , sc ) P(sc ) P(cg , s)
Q∗ (cg )P(s∣cg ) (Q∗ (cg )P(sc ∣cg ) + Q∗ (cv )p(sc ∣cv ) + Q∗ (cb )P(sc ∣cb )) P(cg , sc ) P(s)
=
(Q∗ (cg )P(s∣cg ) + Q∗ (cv )P(s∣cv ) + Q∗ (cb )P(s∣cb )) Q∗ (cg )P(sc ∣cg ) P(sc ) P(cg , s)
= 1.756
6.7. ANSWERS 139

(d) Inequality (6.7) gives

0.214e−d 0.214ed
≤ Q(cg ∣s) ≤ .
1 + (e−d − 1) × 0.214 1 + (ed − 1) × 0.214
The constraint QC∣S (cg ∣s) ≤ 0.3 is satised if

0.214ed
≤ 0.3
1 + (ed − 1) × 0.214

giving d ≤ 0.454. The current distribution over colour is (µg , µb , µv ) = (0.3, 0.3, 0.4). The
problem now reduces to nding (λg , λb , λv ) such that QC∣S (cg ∣s) = 0.3 and

λg λb λv λg λb λv
ln max ( , , ) − ln min ( , , ) = 0.454.
0.3 0.3 0.4 0.3 0.3 0.4
Since
λj
QC,S (cj , s) = PC,S (cj , s), j = g, b, v
µj
it follows that
0.4λg
QC∣S (cg ∣s) = .
0.4λg + 0.4λb + 0.4λv
With PC∣S (cg ∣s) = 0.3,
0.28λg − 0.12λb − 0.24λv = 0,

with constraint λg + λb + λv = 1, so that 10λg − 3λv = 3.


If the maximum and minimum are then given by λg and λv respectively, then

λg 10λg − 3
0.454 = ln − ln
0.3 1.2
giving
3e0.454
λg = = 0.402
10e0.454 − 4
λv = 0.34, λb = 0.258.

Finally, to check that the solution is valid, 0.4


λv = 0.4
0.34 = 1.176 > 1.163 = 0.3
0.258 = λb .
0.3
Similarly,
λg
0.3 clearly gives the maximum in the rst term.
140 CHAPTER 6. EVIDENCE AND METRICS
Chapter 7

Marginalisation, Triangulated Graphs and

Junction Trees

7.1 Functions and Domains


Notation Let V = {X1 , . . . , Xd } denote the set of random variables, where variable Xj has state
(1) (k )
space Xj = (xj , . . . , xj j ) for j = 1, . . . , d. Let X = ×dj=1 Xj denote the state space for the random
vector X = (X1 , . . . , Xd ). Let Ṽ = {1, . . . , d} denote the indexing set for the variables. For D ⊂ Ṽ ,
where D = {j1 , . . . , jm }, let XD = ×j∈D Xj and let X D = (Xj1 , . . . , Xjm ). Let x ∈ X denote a generic
element of X and let xD = (xj1 , . . . , xjm ) ∈ XD , when x = (x1 , . . . , xd ) ∈ X . The notation

xD = ×v∈D xv

is used to denote a conguration (or a collection of outcomes) on the nodes in D. Furthermore, for
any set W ⊂ V , let W̃ denote the indexing set for W . The notation XW will also be used to denote
XW̃ , X W to denote X W̃ and xW to denote xW̃ . Suppose D ⊆ W ⊆ Ṽ and that xW ∈ XW . That is,
xW = ×v∈W xv . Then, ordering the variables of W so that XW = XD × XW /D , the projection of xW onto
D is dened as the variable xD that satises

xW = (xD , xW /D ),

where the meaning of the notation `(, )' is clear from the context. Here A/B denotes the set dierence;
i.e. the elements in the set A not included in B .

Denition 7.1 (Function, Domain). Consider a function ϕ ∶ XD → R+ . The space XD is known as the
domain of the function. If the domain is the state space of a random vector X D , then X D may also
be referred to as the domain of the function.

In this setting, a function over a domain XD has ∏j∈D kj entries. For W ⊂ V , the domain of a function
XW may also be denoted by the collection of random variables W .

141
142 CHAPTER 7. MARGINALISATION, TRIANGULATED GRAPHS AND JUNCTION TREES

Addition, Multiplication, Division For functions dened on the same domain, addition, multi-
plication and division are dened pointwise where, by denition,
a(x)
a(x) = 0, b(x) = 0 Ô⇒ = 0.
b(x)

Functions over dierent domains If function ϕ1 is dened over domain XD1 and function ϕ2
is dened over domain XD2 , then multiplication and division of functions may be dened by rst
extending both functions to the domain XD1 ∪D2 .

Denition 7.2 (Extending the Domain). Let the function ϕ be dened on a domain XD , where D ⊂
W̃ ⊆ Ṽ . Then ϕ, dened over a domain XD , is extended to the domain XW̃ in the following way. For
each xW̃ ∈ XW̃ ,

ϕ(xW̃ ) = ϕ(xD ),

where xD is the projection of xW̃ onto XD , using the denition of xD (and hence xW̃ ) from the beginning
of the section, page 141. In other words, the extended function depends on xW̃ only through xD .

Addition, Multiplication and Division of Functions over Dierent Domains Addition, mul-
tiplication and division of functions over dierent domains is dened as rst, extending the domains
of denition using Denition 7.2 so that they are dened over the same domain, followed by standard
pointwise addition, multiplication or division.

Multiplication of functions may be expressed in the following terms: the product ϕ1 .ϕ2 of functions ϕ1
and ϕ2 , dened over domains XD1 and XD2 is dened as

(ϕ1 .ϕ2 )(xD1 ∪D2 ) = ϕ1 (xD1 ∪D2 )ϕ2 (xD1 ∪D2 ),

where ϕ1 and ϕ2 have rst been extended to XD1 ∪D2 .


Let Dϕ denote the index set for the domain variables of a function ϕ. Then for two functions ϕ1
and ϕ2 , Dϕ1 .ϕ2 = Dϕ1 ∪ Dϕ2 .

Marginalisation The operation of marginalisation is now considered more generally. Let U ⊆ W ⊆ V


and let ϕ be a function dened over XW . The expression ∑XW /U ϕ denotes the margin (or the sum
margin) of ϕ over XU and is dened for xU ∈ XU by

⎛ ⎞
∑ ϕ (xU ) = ∑ ϕ(z, xU ),
⎝W /U ⎠ z∈XW /XU

where the arguments have been rearranged so that those corresponding to W /U appear rst, z ∈ XW /U
is the projection of (z, xU ) ∈ XW onto XW /U and xU ∈ XU the projection of (z, xU ) ∈ XW onto XU .
The following notation is also used for marginalising a function with domain XW .
7.1. FUNCTIONS AND DOMAINS 143

⎛ ⎞
ϕ↓U = ∑ ϕ .
⎝W /U ⎠
The marginalisation operation obeys the following rules:

1. The Commutative Law: for any two sets of variables U ⊂ V and W ⊂ V ,

(ϕ↓U )↓W = (ϕ↓W )↓U .

2. The Distributive Law:


If XD1 is the domain of ϕ1 and D1 ⊆ Ṽ , then (ϕ1 ϕ2 )↓D1 = ϕ1 (ϕ2 )↓D1 .

Denition 7.3 (Charge, Contraction). A charge

Φ = {ϕ1 , . . . , ϕm }

is dened as a set of functions on X .


A contraction of a charge, or set of functions is an operation of multiplication of functions, after
extending them to X , that returns the function
m
Φ(x) = ∏ ϕj (x).
j=1

The same notation is often used to denote the contraction of a charge and of the set of functions (the
charge). The context makes it clear which is intended.

Probability function factorised along a DAG The joint probability function pX1 ,...,Xd is itself
a function, with domain X . If the joint probability function may be factorised according to a DAG
G = (V, D), the decomposition is written as

d
pX1 ,...,Xd = ∏ pXj ∣Πj .
j=1

Then for each j = 1, . . . , d, ϕj dened by ϕj = pXj ∣Πj is a function with domain XDj = Xj × XΠ̃j and
Dj = {j} ∪ Π̃j .

Example 7.4.
Consider a probability function over six variables that may be factorised along the directed acyclic
graph in Figure 7.1. The functions corresponding to the conditional probabilities are

ϕ1 = pX1 , ϕ2 = pX2 ∣X1 , ϕ3 = pX3 ∣X1 ,

ϕ4 = pX4 ∣X2 , ϕ5 = pX5 ∣X2 ,X3 , ϕ6 = pX6 ∣X3 .

The corresponding domains are


144 CHAPTER 7. MARGINALISATION, TRIANGULATED GRAPHS AND JUNCTION TREES

XD1 = X1 , XD2 = X2 × X1 , XD3 = X3 × X1


XD4 = X4 × X2 , XD5 = X5 × X2 × X3 , XD6 = X6 × X3 .

X1

} !
X2 X3

} ! } !
X4 X5 X6

Figure 7.1: A Bayesian Network on 6 variables

Denition 7.5 (Domain Graph). The domain graph for the set of functions in Φ is an undirected
graph with the variables as nodes and the links between any pair of variables which are members of the
same domain.
Figure 7.2 illustrates the domain graph associated with DAG of Figure 7.1. The domain graph of a
DAG is the moral graph, Denition 5.4. The maximal cliques of the moral graph are illustrated in
Figure 7.3.

X1

X2 X3

X4 X5 X6

Figure 7.2: Domain graph of Bayesian Network in Figure 7.1

It is clear that the domain graph of a Bayesian network is the moral graph, since by denition all the
parents are connected to each other and to the variable.

7.2 Marginalisation and Graphical Representations


Let ϕ1 be a function with domain XD1 and let ϕ2 be a function with domain XD2 . Suppose that
A ⊂ D1 ∪ D2 and their product ϕ1 ϕ2 is to be marginalised over XA . If A ∩ D1 = ϕ (the empty set), then
7.2. MARGINALISATION AND GRAPHICAL REPRESENTATIONS 145

X2 X1 X2 X3 X3

X4 X2 X3 X5 X6

Figure 7.3: Maximal Cliques of the Graph in Figure 7.2

∑ ϕ1 ϕ2 = ϕ1 ∑ ϕ2 .
XA XA

In coordinates, let ϕ1 have domain XD1 ∪D3 and ϕ2 domain XD2 ∪D3 ∪D4 , where D1 , D2 , D3 and D4 are
disjoint. By the distributive law, the marginalisation may be written as

∑ ϕ1 (x1 , x3 )ϕ2 (x2 , x3 , x4 ) = ϕ1 (x1 , x3 ) ∑ ϕ2 (x2 , x3 , x4 ).


x2 ∈XD2 x2 ∈XD2

The function over XD1 × XD3 × XD4 is rst marginalised down to a function over XD3 × XD4 . The
function is transmitted to the function over XD2 × XD3 , to which it is multiplied. The domains of the
two functions to be multiplied have to be extended to XD1 ×XD3 ×XD4 . Using X1 , X2 , X3 , X4 to denote
the associated domains XD1 , XD2 , XD3 and XD4 , the domains under consideration for the operations
are illustrated in Figure 7.4. First, the function ϕ2 , dened over (X2 , X3 , X4 ) is considered. This is
marginalised to a function over (X3 , X4 ) and is then extended, by multiplying with ϕ1 , to a function
over (X1 , X3 , X4 ).

(X2 , X3 , X4 ) (X1 , X3 , X4 )
(X3 ,X4 )

Figure 7.4: The Distributive Law

Example 7.6 (Example of a Marginalisation).

Consider the computation for marginalising a contraction of a charge Φ dened over a state space
X = X1 × X2 × X3 × X4 × X5 where

Φ(x) = ϕ1 (x1 , x3 , x5 )ϕ2 (x1 , x2 )ϕ3 (x3 , x4 )ϕ4 (x5 , x6 ).

More particularly, consider the computation of

Φ↓0 = ∑ Φ(x),
x∈X
146 CHAPTER 7. MARGINALISATION, TRIANGULATED GRAPHS AND JUNCTION TREES

where the notation Φ↓U is dened on page 142. With the order of summation: x2 , x4 , x6 , x5 , x3 ,x1 ,
the sum may be written (taking sums from right to left) as

∑ ∑ ∑ ϕ1 (x1 , x3 , x5 ) ∑ ϕ4 (x5 , x6 ) ∑ ϕ3 (x3 , x4 ) ∑ ϕ2 (x1 , x2 ).


x1 ∈X1 x3 ∈X3 x5 ∈X5 x6 ∈X6 x4 ∈X4 x2 ∈X2

The computation, carried out in this order (right to left), may be represented by the graph in Figure 7.5;
a computational tree, according to the distributive law, is given in Figure 7.6.

X2 X6

X1 X5

X3

X4

Figure 7.5: Associations of Variables

(X1 , X2 ) (X5 , X6 )
X6
X1
& x
(X1 , X3 , X5 )
O
X4

(X3 , X4 )

Figure 7.6: A Computational Tree for the Marginalisation

Recall (page 142) that the operation Φ↓U (x) means marginalising Φ over all variables not in the set U .

Denition 7.7 (Elimination of a Variable). The variable Xv , with index v ∈ W̃ = Ṽ /Ũ is eliminated
from ∑xV /U ∈XV /U Φ(xV /U , xU ) by the following procedure, where contraction means multiplying together
all the functions in the charge.
7.3. DECOMPOSABLE GRAPHS AND NODE ELIMINATION 147

1. Let Φv (or ΦXv ) denote the contraction of the functions in Φ that have Xv in their domain; that
is,
Φv = ∏ ϕj .
j∣v∈Dj

2. Let ϕ(v) (or ϕ(Xv ) ) denote the function ∑xv ∈Xv Φv .

3. Find a new set of functions Φ−v (or Φ−Xv ) by setting

Φ−v = (Φ ∪ {ϕ(v) })/Φv .

This is the denition of Φ−v , also denoted by Φ−Xv .

Those functions that do not contain Xv in their domain have been retained; the others have been
multiplied together and then marginalised over Xv (thus eliminating the variable) to give ϕ(v) . This
function has been added to the collection, and all those containing Xv (other than ϕ(v) ) have been
removed.
(Note that the notation Φ−Xv has two meanings: it is used to the collection of functions, and it is
also used to denote the contraction of the charge obtained by multiplying together the functions in the
collection. The meaning is determined by the context.) Having removed Xv , it remains to compute

∑ Φ−Xv (xU , xW /{Xv } ).


xW /{Xv }

The quantity

Φ↓U (xU ) = ∑ Φ(xW /U , xU )


xW /U ∈XW /U

can be computed through successive elimination of the variables Xv ∈ W /U . The task, of course, is to
nd a sequence for marginalising the variables such that, at each stage, the variable is to be eliminated
from as small a domain as possible. The procedure outlined above may be considered graphically in
terms of undirected graphs and their triangulations.

7.3 Decomposable Graphs and Node Elimination


Recall the denition of an induced sub graph (Denition 1.5); a subgraph induced by a subset A ⊂ V is
the graph GA = (A, EA ) where EA = E ∩ A × A. The following denitions are necessary.

Denition 7.8 (Complete Graph, Complete Subset). A graph G is complete (or a clique) if every
pair of nodes is joined by an undirected edge. That is, for each (α, β) ∈ V × V with α ≠ β , (α, β) ∈ E
and (β, α) ∈ E . In other words, ⟨α, β⟩ ∈ U , where U denotes the set of undirected edges. A subset of
nodes is called complete if it induces a complete sub graph.

Denition 7.9 (Maximal Clique). A maximal clique is a complete sub graph that is maximal with
respect to ⊆. In other words, a maximal clique is not a sub graph of any other complete graph.
148 CHAPTER 7. MARGINALISATION, TRIANGULATED GRAPHS AND JUNCTION TREES

Denition 7.10 (Simplicial Node). Recall the denition of family, found in Denition 1.2. For an
undirected graph, the family of a node β is F (β) = {β}∪N (β), where N (β) denotes the set of neighbours
of β . A node β in an undirected graph is called simplicial if its family F (β) is a maximal clique.

This means that, in an undirected graph, a node β is simplicial if all its neigbours are neighbours of
each other.

Denition 7.11 (Connectedness, Strong Components). Let G = (V, E) be a simple graph, where
E = U ∪ D. That is, E may contain both directed and undirected edges. Let α → β denote that there
is a path (Denition 1.7) from α to β . If there is both α → β and β → α then α and β are said to be
connected. This is written:

α ↔ β.

This is clearly an equivalence relation. The equivalence class for α is denoted by [α]. In other words,
β ∈ [α] if and only if β ↔ α. These equivalence classes are called strong components of G .

Note that a graph is connected if between any two nodes there exists a trail (Denition 1.6), but any
two nodes α and β are only said to be connected if there is path from α to β and a path from β to α,
where the denition of a `path' is given in Denition 1.7.

Denition 7.12 (Chord). Let G = (V, E) be a graph. Let σ be an n cycle in G . A chord of this cycle
is a pair (αi , αj ) of non consecutive nodes in σ such that αi ∼ αj in G .

Denition 7.13 (Triangulated). An undirected graph is said to be triangulated if every cycle of length
≥ 4 has a chord.

Lemma 7.14. If G = (V, E) is triangulated, then the induced graph GA is also triangulated.

Proof Consider any cycle of length ≥ 4 in the restricted graph. All the edges connecting these
nodes remain. If the cycle possessed a chord in the original graph, the chord remains in the restricted
graph.

Denition 7.15 (Separator). Let G = (V, E) be a graph. Let α, β ∈ V be two nodes. A subset S ∈ V is
called an α, β separator if every trail between α and β has at least one node in S . Let A ⊂ V , B ⊂ V .
A set S ⊂ V separates A and B if it is an α, β separator for each (α, β) ∈ A × B . A and B are said to
be separated by S . The notation used in this text is A á B 8 S .

Denition 7.16 (Minimal Separator). Let A ⊆ V , B ⊆ V and S ⊆ V be three disjoint subsets of V .


Let S separate A and B . The separator S is said to be a minimal separator of A and B if no proper
subset of S is itself a separator of A and B .

Denition 7.17 (Decomposition, Weak Decomposition). Let G = (V, U ) be an undirected graph. A


triple (A, B, S) of disjoint subsets of the node set V of an undirected graph is said to form a decom-
position of G or to decompose G if
7.3. DECOMPOSABLE GRAPHS AND NODE ELIMINATION 149

V =A∪B∪S

and

ˆ S separates A from B ,

ˆ S is a complete subset of V ,

ˆ Both GA∪S and GB∪S are decomposable.

A, B or S may be the empty set. If both A and B are non empty, then the decomposition is proper.

A triple (A, B, S) of disjoint subsets of the node set V of an undirected graph is said to form a weak
decomposition of G or to weakly decompose G if V = A ∪ B ∪ S , S separates A from B and both GA∪S
and GB∪S are weakly decomposable.

A weak decomposition diers from a decomposition in that the separator set S is not necessarily
complete. Clearly, every graph can be decomposed to its connected components (Denition 1.6). If
the graph is undirected, then the connected components are the strong components (Denition 7.11).

Denition 7.18 (Decomposable Graph). An undirected graph G is decomposable if either

1. it is complete, or

2. it possesses a proper decomposition (A, B, S) such that both sub graphs GA∪S and GB∪S are de-
composable.

This is a recursive denition, which is permissible, since the decomposition (A, B, S) is required to be
proper, so that GA∪S and GB∪S have fewer nodes than the original graph G .

Example 7.19 (Decomposable Graph).

Consider the graph in Figure 7.7. In the rst stage, set S = {α3 }, with A = {α1 , α2 } and B =
{α4 , α5 , α6 }. Then S is a maximal clique and S separates A from B . Then A ∪ S = {α1 , α2 , α3 }
and GA∪S is a maximal clique. B ∪ S = {α3 , α4 , α5 , α6 }. The graph GB∪S is decomposable; take
S2 = {α3 , α5 }, A2 = {α4 } and B2 = {α6 }. Then GA2 ∪S2 and GB2 ∪S2 are maximal cliques.

Theorem 7.20. Let G = (V, U ) be an undirected graph. The following conditions 1), 2) and 3) are
equivalent.

1. G is decomposable.

2. G is triangulated.

3. For every pair of nodes (α, β) ∈ V × V , their minimal separator is complete.


150 CHAPTER 7. MARGINALISATION, TRIANGULATED GRAPHS AND JUNCTION TREES

α1 α2

α3 α6

α4 α5

Figure 7.7: Example of a Decomposable Graph

Proof of Theorem 7.20: 1) Ô⇒ 2) Inductive hypothesis: All undirected decomposable graphs


with n nodes or less are triangulated. This is true for one node.
Let G be a decomposable graph with n + 1 nodes. There are two alternatives:
Either G is complete, in which case it is triangulated,
Or: by the denition of decomposable, there are three disjoint subsets A, B, S such that S is a
complete subset, S separates A from B , V = A ∪ B ∪ S and GA∪S and GB∪S are decomposable. The
decomposition is proper, hence GA∪S and GB∪S have less than or equal to n nodes. Therefore, by
the inductive hypothesis GA∪S and GB∪S are triangulated. Therefore, a cycle of length ≥ 4 without a
chord, will be a cycle from A which passes through B . By decomposability, S separates A from B
and therefore any such cycle must pass S at least twice. But then this cycle has a chord, since S is a
complete subset.

Proof of Theorem 7.20: 2) Ô⇒ 3) Assume that G = (V, U ) is an undirected, triangulated graph.


Let S be a minimal separator for two nodes α and β . Let A denote the set such that α ∈ A and GA is
the largest connected sub-graph of GV /S such that α is in the node set. Let B = V /(A ∪ S). For every
node γ ∈ S , there is a node τ ∈ A such that ⟨γ, τ ⟩ ∈ U and there is a node σ ∈ B such that ⟨γ, σ⟩ ∈ U .
Otherwise S/{γ} would be a separator for α and β , contradicting the minimality of S . Hence, for any
pair (γ, δ) ∈ S × S , there exist paths γ, τ1 , . . . , τm , δ and γ, σ1 , . . . , σn , δ where all the nodes {τ1 , . . . , τm }
are in A and all the nodes {σ1 , . . . σn } are in B . Then γ, τ1 , . . . , τm , δ, σn , . . . , σ1 , γ is a cycle of length
≥ 4 and therefore has a chord. Assume that τ1 , . . . , τm and σ1 , . . . , σn have been chosen so that the
paths are as short as possible (that is, there is no shorter path from γ to δ with all intervening nodes
in A and no shorter path from γ to δ with all intervening nodes in B ).
The chord cannot be of the form ⟨τi , τj ⟩ for some (i, j) or ⟨σk , σl ⟩ for any (k, l) because of the
minimality of the lengths of the chosen paths. Therefore, ⟨γ, δ⟩ ∈ U . Therefore, γ and δ are adjacent
for every pair (γ, δ) ∈ S × S . It follows that S is complete.

Proof of Theorem 7.20: 3) Ô⇒ 1) If G is complete, then the result is clear. If G is not complete,
then choose two distinct nodes (α, β) ∈ V × V that are not adjacent. Let S ⊆ V /{α, β} denote the
minimal separator for the pair (α, β). Let A denote the node set of the maximal connected component
7.3. DECOMPOSABLE GRAPHS AND NODE ELIMINATION 151

of GV /S and let B = V /(A ∪ S). Then (A, B, S) provides three disjoint subsets, where S is complete.
We have to show that GA∪S and GB∪S are decomposable. The procedure can be repeated on both GA∪S
and GB∪S and repeated recursively, stopping when GA′ ∪S ′ is complete for a set A′ and corresponding
separator S ′ , hence the graph is decomposable.

Denition 7.21 (Perfect Node Elimination Sequence). Let V = {α1 , . . . , αd } denote the node set
of a graph G . A perfect node elimination sequence of a graph G is an ordering of the node set
{α1 , . . . , αd } such that for each j in 1 ≤ j ≤ d − 1, αj is a simplicial node of the sub graph of G
induced by {αj , αj+1 , . . . , αd }

Lemma 7.22. Every triangulated graph G has a simplicial node. Moreover, if G is not complete, then
it has two non adjacent simplicial nodes.

Proof The lemma is trivial if either G is complete, or else G has two or three nodes. Assume that G
is not complete. Suppose the result is true for all graphs with fewer nodes than G . Consider two non
adjacent nodes α and β . Let S denote the minimal separator of α and β . Let GA denote the largest
connected component of GV /S such that α ∈ A and let B = V /(A ∪ S), so that β ∈ B .
By induction, either GA∪S is complete, or else it has two non adjacent simplicial nodes. Since GS
is complete, it follows that at least one of the two simplicial nodes is in A. Such a node is therefore
also simplicial in G , because none of its neighbours is in B .
If GA∪S is complete, then any node of A is a simplicial node of G .
In all cases, there is a simplicial node of G in A. Similarly, there is a simplicial node in B . These
two nodes are then non adjacent simplicial nodes of G .

Theorem 7.23. A graph G is triangulated if and only if it has a perfect node elimination sequence.

Proof Suppose that G is triangulated. Assume that every triangulated graph with fewer nodes than
G has a perfect elimination sequence. By the previous lemma, G has a simplicial node α. Removing
α returns a triangulated graph. (Consider any cycle of length ≥ 4 with a chord. If the cycle remains
after the node is removed, then the chord is not removed). By proceeding inductively, it follows that
G has a perfect elimination sequence.
Conversely, assume that G has a perfect sequence, say {α1 , . . . , αd }. Consider any cycle of length
≥ 4. Let j be the rst index such that αj is in the cycle. Let V (C) denote the node set of the cycle
and let Vj = {αj , . . . , αd }. Then V (C) ∈ Vj . Since αj is simplicial in GVj+1 , the neighbours of αj in the
cycle are adjacent, hence the cycle has a chord. Therefore G is triangulated.

Denition 7.24 (Eliminating a Node). Let G = (V, E) be an undirected graph. A node α is eliminated
from an undirected graph G in the following way:

1. For all pairs of neighbours (β, γ) of α add a link if G does not already contain one. The added
links are called ll ins.

2. Remove α.
152 CHAPTER 7. MARGINALISATION, TRIANGULATED GRAPHS AND JUNCTION TREES

The resulting graph is denoted by G −α .


For example, consider the graph in Figure 7.8. This graph is already triangulated. But suppose one
did not notice this and one decided to eliminate node α3 from the graph in Figure 7.8. The resulting
graph is given in Figure 7.9.

α1

α2 α3

α4 α5 α6

Figure 7.8: Example for Eliminating a Node

α1

α2

α4 α5 α6

Figure 7.9: Graph 7.8 with α3 Eliminated

Denition 7.25 (Elimination Sequence). An elimination sequence of G is a linear ordering of its


nodes.
Let σ be an elimination sequence and let Λ denote the ll ins produced by eliminating a node of G in
the order σ . Denote by G σ the graph G extended by Λ.
Example 7.26.
Consider the graph in Figure 7.8. Suppose the elimination sequence α3 , α2 , α4 , α5 , α6 is employed. Then
the ll ins, for each stage, will be ⟨α1 , α6 ⟩, ⟨α1 , α5 ⟩, ⟨α2 , α6 ⟩, ⟨α5 , α6 ⟩ for α3 , then ⟨α1 , α4 ⟩, ⟨α4 , α6 ⟩ for
α2 . No further ll ins are required. The graph G σ is given in Figure 7.10.
Denition 7.27 (Elimination sequence, elimination domains). An elimination sequence σ is a linear
ordering of the set of nodes V = {α1 , . . . , αd } where for each α ∈ {1, . . . , d}, σ(α) denotes the number
assigned to variable Xα . A node β is said to be of higher elimination order than α if σ(β) > σ(α). The
elimination domain of a node α is the set of neighbours of α of higher elimination order.
7.3. DECOMPOSABLE GRAPHS AND NODE ELIMINATION 153

α1

α2 α3

α5 α6

α4

Figure 7.10: G σ . Elimination sequence (α3 , α2 , α4 , α1 , α5 , α6 )

In G σ , any node α together with its neighbours of higher elimination order form a complete subset. The
neighbours of α of higher elimination order are denoted by Nσ(α) . The sets Nσ(α) are the elimination
domains corresponding to the elimination sequence σ .

An ecient algorithm clearly tries to minimise the number of ll ins. If possible, one should nd an
elimination sequence that does not introduce ll ins.

Proposition 7.28. All maximal cliques in a G σ are a Nσ(α) for some α ∈ V .

Proof Let C be a maximal clique in G σ and let α be a variable in C of the lowest elimination order.
Then C = Nσ(α) .

An ecient algorithm ought to nd an elimination sequence for the domain graph that yields maximal
cliques of minimal total size.

The following proposition is clear.

Proposition 7.29. Any G σ is a triangulation of G .

Proof By construction, the elimination sequence σ for graph G σ does not require any ll-ins.

Recall that a graph is triangulated if and only if it has an elimination sequence without ll ins. This
is equivalent to the statement that an undirected graph is triangulated if and only if all nodes can be
eliminated by successively eliminating a node α such that the family Fα = {α} ∪ Nα is complete. From
the denition, such a node α is a simplicial node.

Marginalisation and triangulation of graphs Let G = (V, U ) be an undirected graph, where


V = {X1 , . . . , Xd }. Recall the denition of the domain graph (Denition 7.5) and note that the maximal
154 CHAPTER 7. MARGINALISATION, TRIANGULATED GRAPHS AND JUNCTION TREES

cliques of the domain graph are the domains of the functions of the charge. Recall Denition 7.7, which
describes the procedure for eliminating a variable in a marginalisation. When a node Xv is eliminated
from the graph G , the resulting graph is denoted by G −Xv . Graphically, the procedure described in
Denition 7.7 is the same as Denition 7.24, eliminating a node. If G is the domain graph for a set
of functions Φ, then it is clear from Denition 7.24 that the graph G −Xv is the domain graph for the
set of functions Φ−Xv . Therefore, if the domain graph is triangulated, there is a perfect elimination
sequence; there is an order for eliminating the variables that, at each stage, the elimination domain
corresponds to a maximal clique in the current domain graph.

7.4 Junction Trees


Decomposable graphs provide the basis for one of the key methods for updating a probability distri-
bution described in terms of a Bayesian network. The DAG is moralised and then triangulated using
the most ecient triangulation algorithms available. The triangulated graph is then decomposed and
organised to form a junction tree, which supports a eective algorithms. The purpose of this section
is to dene junction trees and to show how to construct them. They provide a key tool for updating
a Bayesian network.

Denition 7.30 (Junction Trees). Let C be a collection of subsets of a nite set V and T be a tree
with C as its node set. Then T is said to be a junction tree (or join tree) if any intersection C1 ∩ C2
of a pair C1 , C2 of sets in C is contained in every node on the unique path in T between C1 and C2 .
Let G be an undirected graph and C the family of its maximal cliques. If T is a junction tree with C as
its node set, then T is known as junction tree for the graph G .

Theorem 7.31. There exists a junction tree T of maximal cliques for the graph G if and only if G is
decomposable.

Proof Firstly, we prove that if the graph is decomposable, then there exists a junction tree of the
maximal cliques. The proof is by construction; a sequence is established in the following way. Firstly,
a simplicial node α is chosen; Fα is therefore a maximal clique. The algorithm continues by choosing
nodes from Fα that only have neighbours in Fα . The set of nodes Fα is labelled C1 and the set of those
nodes in Fα that have neighbours not in Fα is labelled S1 . This set is a separator.
Now remove the nodes in Fα that do not have neighbours outside Fα and name the new graph G ′ .
Choose a new node α in the graph G ′ such that Fα is a maximal clique. Repeat the process, with the
index j , where j is the previous index, plus 1.
When the parts have been established (as indicated in the diagram below), each separator Si is
then connected to a maximal clique Sj with j > i and such that Si ⊂ Cj . This is always possible,
because Si is a complete set and, in the elimination sequence described above, the rst point of Si is
eliminated when dealing with a maximal clique of index greater than i.
It is necessary to prove that the structure constructed is a tree and that it has the junction tree
property.
7.4. JUNCTION TREES 155

Firstly, each maximal clique has at most one parent, so there are not multiple paths. The structure
is therefore a tree.
To prove the junction tree condition, consider two maximal cliques, Ci and Cj with i > j and let α
be a member of both. There is a unique path between Ci and Cj .
Because α is not eliminated when dealing with Cj , it is a member of Sj . By construction, it is also
a member of the child of Cj , say Ck . Arguing similarly, it is also a member of the child of Ck and, by
induction it is also a member of Ci and, of course, all the separators in between.

The converse is trivial; if the maximal cliques can be arranged as a junction tree, then we can construct
a perfect elimination sequence by: take a simplicial node from a maximal clique which is a leaf of the
junction tree and remove the node. If this is not the only simplicial node in the chosen maximal clique,
the maximal clique remains as a leaf of the junction tree, otherwise the maximal clique is removed
from the junction tree; the resulting maximal clique tree is a junction tree. Hence there is a perfect
elimination sequence, hence the graph is triangulated (and decomposable).

Example 7.32.
Consider the directed acyclic graph in Figure 7.11. The corresponding moral graph is given in Fig-
ure 7.12.

α1

} !
α2 α3 / α4

   !
α5 α6 / α7 / α8

! }
α9

Figure 7.11: A Directed Acyclic Graph

An appropriate elimination sequence for this moral graph is

(α8 , α7 , α4 , α9 , α2 , α3 , α1 , α5 , α6 ).

There are two ll-ins; these are ⟨α1 , α5 ⟩ corresponding to the elimination of α2 and ⟨α1 , α6 ⟩, corre-
sponding to the elimination of α3 . The corresponding triangulated graph is given in Figure 7.13.
The junction tree construction may be applied. The maximal cliques and separators, with the labels
resulting from the diagram, are shown in Figure 7.14 and put together to form the junction tree, or
join tree, shown in Figure 7.15.
Later, when using the algorithm for updating, it will be useful to designate one node as the root.
156 CHAPTER 7. MARGINALISATION, TRIANGULATED GRAPHS AND JUNCTION TREES

α1

α2 α3 α4

α5 α6 α7 α8

α9

Figure 7.12: Moral Graph corresponding to Figure 7.11

α1

α2 α3 α4

α5 α6 α7 α8

α9

Figure 7.13: The triangulated graph corresponding to Figure 7.12

Denition 7.33 (Rooted Tree). A rooted tree T is a tree graph with a designated node ρ called the
root. A leaf of a tree is a node that is joined to at most one other node.

7.5 Perfect Orders of Maximal Cliques


Closely associated with the concept of a Junction Tree is the concept of running intersection property
and perfect order of maximal cliques. Let C = {C1 , . . . , Cn } denote a collection of maximal cliques.

Denition 7.34 (Running Intersection Property). C is said to have running intersection property
(r.i.p.) if there is an order σ of {1, . . . , n} such that for each j ≥ 2 there is an l such that σ(l) < σ(j)
and

Cσ(j) ∩ (∪j−1
i=1 Cσ(i) ) ⊆ Cσ(j) ∩ Cσ(l) . (7.1)
An order of the maximal cliques that satises r.i.p. is said to be a perfect order of the maximal cliques.
7.5. PERFECT ORDERS OF MAXIMAL CLIQUES 157

{α4 ,α7 ,α8 } {α4 ,α7 } {α4 ,α6 ,α7 } {α4 ,α6 }
C1 S1 C2 S2

{α3 ,α4 ,α6 } {α3 ,α6 } {α5 ,α6 ,α9 } {α5 ,α6 }
C3 S3 C4 S4

{α1 ,α2 ,α5 } {α1 ,α5 } {α1 ,α3 ,α6 } {α1 ,α6 }
C5 S5 C6 S6

{α1 ,α5 ,α6 }


C7

Figure 7.14: The Maximal Cliques and Separators from Figure 7.13

Theorem 7.35. For an undirected graph G = (V, U ) with maximal cliques C = {C1 , . . . , Cn }, there
exists a perfect order of the maximal cliques if and only if G is triangulated. Furthermore, for any
order such that (7.2) holds, the tree constructed by adding the edge σ(j) ∼ σ(l(j)), where for each j ≥ 2
a single l(j) ∈ {1, . . . , j − 1} is chosen such that

Cσ(j) ∩ (∪j−1
i=1 Cσ(i) ) ⊆ Cσ(j) ∩ Cσ(l(j)) (7.2)

is a junction tree.

Proof The graph is triangulated if and only if the maximal cliques can be arranged as a junction tree.
If there is a perfect order of the maximal cliques, then clearly the method described for constructing a
tree from these maximal cliques (edge between Cσ(j) and maximal clique Cσ(l(j)) such that (7.2) holds)
gives the junction tree property; namely, that for any two cliques Cα , Cβ , Cα ∩ Cβ is contained in each
separator on the unique path Cα ↔ Cβ on the tree. On the other hand, if there is a junction tree,
then we may choose arbitrarily one node as root, call it σ(1) and then proceed by choosing σ(j) as
any neighbour of σ(1), . . . , σ(j − 1) that has not yet appeared in the order. This order of the maximal
cliques satises r.i.p..
158 CHAPTER 7. MARGINALISATION, TRIANGULATED GRAPHS AND JUNCTION TREES

C1 C2 C3
S1 S2

S3

C6
S6

C5 C7
S5

S4

C4

Figure 7.15: A Junction Tree (or join tree) constructed from the triangulated graph in Figure 7.13

root

} !
γ δ

~ ! 
leaf leaf leaf

Figure 7.16: Illustration of a Rooted Tree

Notes The material is standard from algorithmic graph theory. See, for example, [55]. The proof of
Theorem 7.20 follows the lines of Cowell, Dawid, Lauritzen and Spiegelhalter in [32].
Chapter 8

Junction trees and message passing

The task is to describe a scheme of message passing (propagation) between the maximal cliques of a
junction tree to compute the marginal distribution over a set of variables A ⊂ V /E , given hard evidence
on a set of variables E ; {X E = xE };

⎛ ⎞
PV /E∣E (xV /E ∣xE )↓A = ⎜ ∑ PV /E∣E (xA , xṼ /(A∪E) ∣xE )⎟ .
⎝xV /(A∪E) ∈XV /A ⎠
The message passing algorithm described here is the one used by the R packages gRain and bnlearn
and also many other software programmes that deal with Bayesian Networks; the algorithm is based
on representing joint distribution of a Bayesian network using the so - called Aalborg formula

∏C∈C ϕC (xC )
PX (x1 , . . . , xn ) = ,
∏S∈S ϕS (xS )
(which will be established later in this section), where

C = maximal cliques of the triangulated moral graph


and

S = separators of the junction tree


and each ϕC and ϕS is the function over the respective maximal clique C and separator S . The prop-
agation presented is the approach of Lauritzen and Spiegelhalter, discussed in [83]; the technicalities
may dier slightly in the various implementations available.

8.1 Factorisation along an Undirected Graph


Let G = (V, U ) be an undirected graph, where V = {X1 , . . . Xd } is a set of discrete variables.

Denition 8.1. A joint probability PX over a random vector X = (X1 , . . . , Xd ) is said to be factorised
according to G if there exist functions or factors, ϕA dened on ×v∈Ã Xv where A is a complete set of
nodes in G such that

159
160 CHAPTER 8. JUNCTION TREES AND MESSAGE PASSING

PX (x) = ∏ ϕA (xA )
A

where the notation is clear (see section 7.1, page 141); the product is over all the functions.

Recall Denition 7.15 of a separator and Denition 7.17 of a decomposition. In the denition, A, B
or S may be the empty set, ϕ.

Proposition 8.2. Let G be a decomposable undirected graph and let (A, B, S) decompose G . Then the
following two statements are equivalent:

1. P factorises along G and

2. both PA∪S and PB∪S factorise along GA∪S and GB∪S respectively and

PA∪S (xA∪S )PB∪S (xB∪S )


P(x) = .
PS (xS )

Proof of 1) Ô⇒ 2) Since the graph is decomposable, its maximal cliques can be organised as a
junction tree. Hence, without loss of generality, the factorisation can be taken to be of the form

P(x) = ∏ ϕC (xC ),
C∈C

where the product is over the maximal cliques of G . Since (A, B, S) decomposes G , any maximal clique
of G can either be taken as a subset of A ∪ S or as a subset of B ∪ S . Furthermore, S is a strict subset
of any maximal clique of A ∪ S containing S and S is a strict subset of any maximal clique of B ∪ S
containing S . Letting C denote a maximal clique, it follows that

P(x) = ∏ ϕC (xC ) ∏ ϕC (xC ).


C⊆A∪S C⊆B∪S

Since S is itself complete, it is a subset of any maximal clique containing S , so that no maximal clique
in the decomposition will appear in both A ∪ S and B ∪ S . Set

h(xA∪S ) = ∏ ϕC (xC ), k(xB∪S ) = ∏ ϕC (xC ).


C⊆A∪S C⊆B∪S

Then

P(x) = h(xA∪S )k(xB∪S )

and the marginal distribution is given by

PA∪S (xA∪S ) = h(xA∪S ) ∑ k(xB∪S ) = h(xA∪S )kS (xS )


XB

and
8.2. FACTORISING ALONG A JUNCTION TREE 161

PB∪S (xB∪S ) = k(xB∪S )hS (xS ),

where kS is dened as kS (xS ) = ∑XB k(xB∪S ) and hS (xS ) = ∑XA h(xA∪S ). It follows that

P(xA∪S )P(xB∪S )
P(x) = h(xA∪S )k(xB∪S ) = .
k(xS )h(xS )
Since

PS (xS ) = ∑ h(xA∪S ) ∑ k(xB∪S ) = h(xS )k(xS ),


XA XB

it follows that

PA∪S (xA∪S )PB∪S (xB∪S )


P(x) = .
PS (xS )
Since PA∪S (xA∪S ) = ∏C⊂A∪S ϕC (xC ) and PB∪S (xB∪S ) = ∏C⊂B∪S ϕC (xC ), it follows that PA∪S and
PB∪S factorise along the corresponding graphs. This establishes the proof of 1) Ô⇒ 2).

Proof of 2) Ô⇒ 1) If both PA∪S and PB∪S factorise along GA∪S and GB∪S respectively and the
given formula holds, then
1
P(x) = ∏ ϕC (xC ) ∏ ϕC (xC ). (8.1)
PS (xS ) C⊂A∪S C⊂B∪S

For the maximal clique C that satises C ⊂ A ∪ S such that C ∩ S ≠ ϕ and set ψC = ϕC
pS . For all other
C , set ψC = ϕC , then
P(x) = ∏ ψC (xC ),
C⊂V

so that P factorises along G .

Since PA∪S = ∏C⊂A∪S ϕC and PB∪S = ∏C⊂B∪S ϕC in Equation (8.1), it follows by a recursive application
of the proposition that

∏C∈C PC (xC )
P(x) = ,
∏S∈S PS (xS )
where C denotes the set of maximal cliques and S denotes the set of separators.

8.2 Factorising along a Junction Tree


Let P be a probability distribution that factorises along a directed acyclic graph G = (V, E). The
factorisation is given by

d
PX (x) = ∏ PXv ∣Πv (xv ∣πv (x)),
v=1
162 CHAPTER 8. JUNCTION TREES AND MESSAGE PASSING

where πv (x) denotes the parent set of Xv for an instantiation x. It is clear that this may be expressed
as a factorisation according to the moralised graph G mor , which is undirected:

d
PX (x) = ∏ ϕAv (xAv )
v=1

where Av = {Xv } ∪ Πv and

ϕAv (xAv ) = PXv ∣Πv (xv ∣πv (x)).

Hence a probability distribution factorised along the DAG is also factorised along the moral graph G mor .
For implementing algorithms, the problem is that it may not be possible to represent the sets (Av )dv=1
on a tree. To enable this, G mor is triangulated to give (G mor )t . Recall that (G mor )t is decomposable
and its maximal cliques can be organised into a junction tree T . The probability distribution can
clearly be factorised as

PX (x) = ∏ ϕC (xC ),
C∈C

where ϕC (xC ) is the product of all those P(xv ∣xΠ̃v ), all of whose arguments belong to C . This
factorisation is not necessarily unique. It corresponds to a triangulation of the moral graph, where C
are the maximal cliques. It follows that

∏C∈C PC (xC )
PX (x) = , (8.2)
∏S∈S PS (xS )
where C denotes the set of maximal cliques and S denotes the set of separators of (G mor )t , which may
be organised according to a junction tree. This is the denition of a factorisation along a junction tree.

Denition 8.3 (Factorisation along a Junction Tree, Marginal Charge). Let PX be a probability dis-
tribution over a random vector X = (X1 , . . . , Xd ). Suppose that the variables can be organised as a
junction tree, with maximal cliques C and separators S such that PX has representation given in Equa-
tion (8.2), where PC and PS denote the marginal probability functions over the maximal clique variables
C ∈ C and separator variables S ∈ S respectively. The representation in Equation (8.2) is known as the
factorisation along the junction tree, and the charge

Φ = {PS ∶ S ∈ S, PC ∶ C ∈ C}

is known as the marginal charge.

From the foregoing discussion, it is clear that Denition 8.3 is a special case of Denition 8.1, with
appropriate choice of functions in Denition 8.1.

Entering Evidence Equation (8.2) expresses the probability distribution in terms of functions over
the maximal cliques and separators of (G mor )t , or the junction tree. Suppose that hard evidence is
obtained on the variables U ; namely, that for U ⊆ V , {X U = y U } and the probability over the variables
V /U has to be updated accordingly.
8.3. FLOW OF MESSAGES 163

The algorithm described below describes a procedure such that for any function f ∶ X → R+ (not
necessarily a probability function) that is expressed as

∏C∈C ϕC (xC )
f (x) = , (8.3)
∏S∈S ϕS (xS )
for a collection of functions Φ = {ϕC , C ∈ C, ϕS , S ∈ S where C and S are the maximal cliques and
separators of a junction tree, the algorithm updates Φ to a collection of functions Φ∗ = {fC , C ∈
C, fS , S ∈ S that satisfy
fC (xC ) = ∑ f (z, xC )
z∈XV /C

and
fS (xS ) = ∑ f (z, xS )
z∈XV /S

for each C ∈ C and each S ∈ S . It follows that if the algorithm is applied using



⎪ PC (xC ) xC∩U = y C∩U
ϕC (xC ) = ⎨

⎪ xC∩U ≠ y C∩U
⎩ 0
and



⎪ PS (xS ) xS∩U = y S∩U
ϕS (xS ) = ⎨

⎪ xS∩U ≠ y S∩U
⎩ 0
then
PX U (y U ) = ∑ fC (z, y U ∩C ) = ∑ fS (z, y U ∩S )
z∈XC/(U ∩C) z∈XS/(U ∩S)

for all S ∈ S and all C ∈ C . This quantity may therefore be computed by marginalising the maximal
clique or separator with the smallest domain; dividing fC and fS by this quantity will give PC∣U (.∣y U )
and PS∣U (.∣y U ) respectively and hence a representation of the conditional distribution in terms of
marginal distributions over the maximal cliques and separators.

8.3 Flow of Messages


8.3.1 First Example
Consider a non-negative function with domain X × Y × Z , F ∶ X × Y × Z → R+ , which may be written
as

f (x, z)g(y, z)
F (x, y, z) = , (8.4)
h(z)
for non negative functions f ∶ X × Z → R+ , g ∶ Y × Z → R+ and h ∶ Z → R+ .
Decomposition (8.4) for the function F is of the form given in Equation (8.3), with maximal cliques
C1 = {X, Z}, C2 = {Z, Y } and separator S = {Z} arranged according to the junction tree in Figure 8.1.
164 CHAPTER 8. JUNCTION TREES AND MESSAGE PASSING

X Z Y

Figure 8.1: Undirected Graph for the Three Variables

XZ Z
YZ

Figure 8.2: Junction Tree for message passing

F1 (x,z)F2 (y,z)
The following procedure returns a representation F (x, y, z) = F3 (z) , where

F1 (x, z) = ∑ F (x, y, z), F2 (y, z) = ∑ F (x, y, z), F3 (z) = ∑ F (x, y, z).


y∈Y x∈X (x,y)∈X×Y

Firstly,
f (x, z)g(y, z) f (x, z)
F1 (x, z) = ∑ F (x, y, z) = ∑ = ∑ g(y, z).
y∈Y y h(z) h(z) y∈Y
h∗ (z)
Dene the auxiliary function h∗ (z) = ∑y g(y, z), and the update f ∗ (x, y) = f (x, y) h(z) , then clearly

h∗ (z)
f ∗ (x, z) = f (x, z) = F1 (x, z).
h(z)
The calculation of the marginal function F1 (x, z) by means of the auxiliary function h∗ (z) may be
described as passing a local message ow from ZY to XZ through their separator Z . The factor

h∗ (z)
h(z)
is called the update ratio. It follows that

f (x, z)g(y, z) f (x, z)g(y, z)h∗ (z) 1


F (x, y, z) = = = F1 (x, z) ∗ g(y, z).
h(z) h∗ (z)h(z) h (z)
The passage of the ow has resulted in a new representation of F (x, y, z) similar to the original, but
where one of the factors is a marginal function.

Similarly, a message can be passed in the other direction, i.e. from XZ to ZY Using the same
procedure, set

h̃(z) = ∑ F1 (x, z) = ∑ F (x, y, z) = F3 (z).


x∈X (x,y)∈X×Y

Next, set
8.4. LOCAL COMPUTATION ON JUNCTION TREES 165

h̃(z)
g̃(y, z) = g(y, z) .
h∗ (z)

It then follows that g̃(y, z) = F2 (y, z), because

1 1
F (x, y, z) = F1 (x, z) g̃(y, z) = F1 (x, z) g̃(y, z)
h̃(z) F3 (z)

and hence, since F3 (z) = ∑x∈X F1 (x, z), that

1
F2 (y, z) = ∑ F (x, y, z) = g̃(y, z) ∑ F1 (x, z) = g̃(y, z).
x∈X x∈X F3 (z)

Passing messages in both directions results in a new overall representation of the function F (x, y, z); .

1 1 h∗ (z)
F (x, y, z) = f ∗ (x, z) g(y, z) = f ∗
(x, z) g̃(y, z)
h∗ (z) h∗ (z) h̃(z)
1
= f ∗ (x, z) g̃(y, z)
h̃(z)
1
= F1 (x, z) F2 (y, z).
F3 (z)

The original representation using functions has been transformed into a new representation where all
the functions are marginal functions.

This idea is now extended to arbitrary non negative functions represented on junction trees.

8.4 Local Computation on Junction Trees


Consider a junction tree T with nodes C and separators S and let Φ be a charge

Φ = {ϕC ∶ C ∈ C, ϕS ∶ S ∈ S} (8.5)

be a charge; that is, a collection of non negative functions such that ϕC ∶ XC → R+ and ϕS ∶ XS → R+
for each C ∈ C and each S ∈ S .

Denition 8.4 (Contraction of a Charge on a Junction Tree). The contraction of a charge (8.5) over
a junction tree is dened as

∏C∈C ϕC (xC )
f (x) = . (8.6)
∏S∈S ϕS (xS )
166 CHAPTER 8. JUNCTION TREES AND MESSAGE PASSING

Local Message Passing Let C1 and C2 be two adjacent neighbouring nodes in T separated by S .
Set
ϕ∗S (xS ) = ∑ ϕC1 (z, xS ) (8.7)
z∈XC1 /S

and set

ϕ∗S
λS = (8.8)
ϕS
where, by denition, 00 = 0 is used in division of functions. λS is known as the update ratio. The
`message passing' is dened as the operation of updating ϕS to ϕ∗S and ϕC2 to

ϕ∗C2 = λS ϕC2 . (8.9)

All other functions remain unchanged. The scheme of local message passing is illustrated in Figure 8.3.

C1 C2
S

Ð→

λS ϕ∗C2 = λS ϕC2

Figure 8.3: Flow from C1 to C2

Lemma 8.5. Let f ∶ X → R+ be the contraction of a charge Φ = {ϕS , S ∈ S, ϕC , C ∈ C} on a junction


tree (Denition 8.4) where C is the collection of maximal cliques and S the collection of separators. A
ow does not change the contraction of the charge.

Proof The initial contraction is given by

∏C∈C ϕC (xC )
f (x) = . (8.10)
∏S∈S ϕS (xS )
Let the contraction after the ow be denoted by f ∗ and the charge, after the ow from C1 to C2
denoted by

Φ∗ = {ϕ∗C ∶ C ∈ C, ϕ∗S ∶ S ∈ S}
Then

ϕ∗C2 (xC2 ) ∏C∈C,C≠C2 ϕC (xC )


f ∗ (x) = . (8.11)
ϕ∗S (xS ) ∏T ∈S,T ≠S ϕS (xT )
There are three cases to consider.
8.5. SCHEDULES 167

ˆ For x such that ϕS (xS ) > 0 and ϕ∗S (xS ) > 0,

ϕ∗
ϕ∗C2 ϕC2 λS ϕC2 ( ϕS ) ϕC2
S

= = =
ϕ∗S ϕ∗S ϕ∗S ϕS
and the result is proved.

ˆ For x such that ϕS (xS ) = 0: it follows that f (x) = 0 and hence that ϕC1 (xC1 ) = 0 and that
λS (xS ) = 0. It therefore follows from the denition of ϕ∗C2 , that ϕ∗C2 = 0 and hence that f ∗ (x) = 0,
so that 0 = f ∗ (x) = f (x).

ˆ For x such that ϕS (xS ) > 0, but ϕ∗S (xS ) = 0, it follows directly that λS (xS ) = 0, so that that
f ∗ (x) = 0. It remains to show that f (x) = 0. From the denition,

0 = ϕ∗S (xS ) = ∑ ϕC1 (z, xS ).


z∈XC1 /S

Since ϕC1 (xC1 ) ≥ 0 for all xC1 ∈ XC1 , it follows that ϕC1 (z, xS ) = 0 for all z ∈ XC1 /S . Since

ϕC2 (xC2 ) ∏C∈C,C≠C1 ,C2 ϕC (xC )


f (x) = ϕC1 (xC1 ) ,
ϕS (xS ) ∏S∈S,S≠S ϕS (xS )
it follows directly from the facts that the domains of the maximal cliques other than C1 and C2
ϕC (x )
and separators other than S do not include XS , and that ϕS2 (xC2) < +∞ that f (x) = 0, hence
S
f (x) = f ∗ (x).

In all cases, it follows that a ow does not change the contraction of a charge.

8.5 Schedules
The aim of this section is to describe how to construct a series of transmissions between the various
maximal cliques of a junction tree, to update a set of functions to a set of functions that have the same
contraction as the original and which are the marginals of the contraction over the maximal cliques
and separators. First, some denitions and notations are established.

Denition 8.6 (Sub-tree, Neighbouring Clique). A sub-tree T ′ of a junction tree T is a connected


set of nodes of T together with the edges in T between them.
A maximal clique C of a junction tree T is a neighbour of a sub-tree T ′ if the corresponding node
of T is not a node of T ′ but is connected to T ′ by an edge of T .

The following denition gives the technical terms that will be used.

Denition 8.7 (Schedule, Active Flow, Fully Active Schedule). A schedule is an ordered list of directed
edges of T specifying which ows are to be passed and in which order.
A ow is said to be active relative to a schedule if before it is sent the source has already received
active ows from all its neighbours in T , with the exception of the sink; namely, the node to which it
168 CHAPTER 8. JUNCTION TREES AND MESSAGE PASSING

is sending its ow. A schedule is full if it contains an active ow in each direction along every edge of
the tree T . A schedule is active if it contains only active ows. It is fully active if it is both full and
active.

It follows from this denition that the rst active ow must originate in a leaf of T .

Example 8.8 (Fully active schedule).

Figure 8.4 shows a DAG and 8.5 a corresponding junction tree. An example of a fully active schedule
for the junction tree given in Figure 8.5, where the maximal clique BEL is chosen as the root, would
be:

AT → ELT, BLS → BEL, BDE → BEL, EK → ELT, ELT → BEL

BEL → ELT, ELT → EK, ELT → AT, BEL → BLS, BEL → BDE.

A S

  
T / E o L B

~ 
'
K D

Figure 8.4: Example of a DAG

AT T
ELT E
EK

EL

BLS BL
BEL BE
BDE

Figure 8.5: Corresponding Junction Tree

Proposition 8.9. For any tree T , there exists a fully active schedule.
8.5. SCHEDULES 169

Proof If there is only one maximal clique, the proposition is clear; no transmissions are necessary.
Assume that there is more than one maximal clique. Let C0 denote a leaf in T . Let T0 be a sub-tree
of T obtained by removing C0 and the corresponding edge S0 . Assume that the proposition is true for
T0 . Adding the edge
C0 → S0 → T0
to the beginning of the schedule and
C0 ← S0 ← T0
to the end of the schedule provides a fully active schedule for T .

The aim is to show that after the passage of a fully active schedule of ows over a junction tree, the
resulting charge is the marginal charge. That is, all the functions of the charge are the marginal of
the contraction of the charge over the respective maximal cliques and separators. Furthermore, there
is global consistency after the passage of a fully active schedule of ows over a junction tree. This will
be dened later, but loosely speaking, it means that if there are several apparent ways to compute a
probability distribution over a set of variables using the functions of the marginal charge, they will all
give the same answer.

Denition 8.10 (The Base of a Sub-tree, Restriction of a Charge, Live Sub-tree). Let T ′ be a sub-tree
of T , with nodes C ′ ⊆ C and edges S ′ ⊆ S . The base of T ′ is dened as the set of variables

U ′ ∶= ∪C∈C ′ C.
Let

Φ = {ϕC ∶ C ∈ C, ϕS ∶ S ∈ S}
be a charge for T . Its restriction to T ′ is dened as

ΦT ′ = {ϕC ∶ C ∈ C ′ , ϕS ∈ S ′ }.
Recall Denition 8.4. The contraction of ΦT ′ is dened as
∏C∈C ′ ϕC (xC )
.
∏S∈S ′ ϕS (xS )
A sub-tree T ′ is said to be live with respect to the schedule of ows if it has already received active ows
from all its neighbours.

Proposition 8.11. Let

Φ0 = {ϕ0C ∶ C ∈ C, ϕ0S ∶ S ∈ S}
denote an initial charge for a function f that has factorisation

∏C∈C ϕ0C (xC )


f (x) =
∏S∈S ϕ0S (xS )
170 CHAPTER 8. JUNCTION TREES AND MESSAGE PASSING

where C and S are the sets of maximal cliques and separators for a junction tree T . Suppose that Φ0 is
modied by a sequence of ows according to some schedule. Then, whenever T ′ is live, the contraction
of the charge for T ′ is the margin of the contraction f of the charge for T on U ′ .

Proof Assume that T ′ ⊂ T and that T ′ is live. Let C ∗ denote that last neighbour to have passed a
ow into T ′ . Let T ∗ be the sub-tree obtained by adding C ∗ and the associated edge S ∗ to T ′ . Let
C ∗ , S ∗ and U ∗ be the maximal cliques, separators and the base of T ∗ . By the junction tree property
of T , the separator associated with the edge S ∗ joining C ∗ to T ′ is

S∗ = C ∗ ∩ U ′.

Also,

C ∗ = C ′ ∪ {C ∗ } and S ∗ = S ′ ∪ {S ∗ }.

and the set of base variables for T ∗ is

U ∗ = U ′ ∪ C ∗.

The induction hypothesis: The assertion holds for the contraction of the charge on T ∗ . That is,
using

fU ∗ (xU ∗ ) = ∑ f (x),
U /U ∗

Let

Φ = {ϕC ∶ C ∈ C, ϕS ∶ S ∈ S}

denote the charge just before the last ow from C ∗ into T ′ . It follows that

∏C∈C ∗ ϕC (xC ) ϕC ∗ (xC ∗ ) ∏C∈C ′ ϕC (xC )


fU ∗ (xU ∗ ) = = . (8.12)
∏S∈S ∗ ϕS (xS ) ϕS ∗ (xS ∗ ) ∏S∈S ′ ϕS (xS )
Lemma 8.5 states that a ow does not change the contraction of a charge. Let {ϕ∗C , C ∈ C ′ , ϕ∗S , S ∈ S ′ }
are the updated functions over the maximal cliques and separators after the ow.
The aim is to nd the marginal fU ′ = ∑U /U ′ f of f on U ′ and to show that after the ow,

∏C∈C ′ ϕ∗C (xC )


fU ′ (xU ′ ) = .
∏S∈S ′ ϕ∗S (xS )

Note that ϕC ∗ = ϕ∗C ∗ and that ϕ∗S ∗ = ∑C ∗ /S ∗ ϕC ∗ . It follows that

∑C ∗ /S ∗ ϕC ∗ (xC ∗ ) ∏C∈C ′ ϕ∗C (xC ) ∏C∈C ′ ϕ∗C (xC )


fU ′ = ∑ fU ∗ = ( ) =
U ∗ /U ′ ϕ∗S ∗ (xS ∗ ) ∏S∈S ′ ϕ∗S (xS ) ∏S∈S ′ ϕ∗S (xS )

and the proof is complete.


8.6. LOCAL AND GLOBAL CONSISTENCY 171

Corollary 8.12. Let {ϕC , C ∈ C, ϕS , S ∈ S} denote the current functions over the maximal cliques and
separators. For any set A ⊆ V , let fA = ∑XV /A f ; the marginal over A. Whenever a maximal clique C
is live, its corresponding function is ϕC = fC = ∑XV /C f .

Proof A single maximal clique is a sub-tree. The result is immediate from the theorem.

Corollary 8.13. Using the notation of Corollary 8.12, whenever active ows have passed in both
directions across an edge in T , the function for the associated separator is ϕS = fS = ∑XV /S f .

Proof The function ϕS for the associated separator is, by denition of the update,

ϕS = ∑ ϕC ,
XC/S

so that

∑ ϕC = ∑ fC = fS ,
XC/S XC/S

because ϕC is fC by the previous corollary.

Proposition 8.14 (The Main Result). After passage of a fully active schedule of ows, the resulting
charge is the marginal charge Φ and its contraction represents f . In other words, the following formula,
known as the Aalborg formula holds;

∏C∈C fC (xC )
f (x) = .
∏S∈S fS (xS )
Proof This follows from the previous two corollaries and Lemma 8.5, stating that the contraction is
unaltered by the ows.

8.6 Local and Global Consistency


Recall that T denotes the junction tree, the set of maximal cliques which form the nodes of T is
denoted C and the intersection of neighbours in the tree T are the separators, denoted by S . Recall
that the functions associated with C ∈ C and S ∈ S are denoted by ϕC and ϕS respectively, and that
the charge on T , Φ is dened as:

Φ = {ϕC ∶ C ∈ C, ϕS ∶ S ∈ S}.

Denition 8.15 (Local Consistency). A junction tree T is said to be locally consistent if whenever
C1 ∈ C and C2 ∈ C are two neighbours with separator S = C1 ∩ C2 , then

∑ ϕ C1 = ϕ S = ∑ ϕ C2 .
XC1 /(C1 ∩C2 ) XC2 /(C1 ∩C2 )
172 CHAPTER 8. JUNCTION TREES AND MESSAGE PASSING

Denition 8.16 (Global Consistency). A junction tree T (or its charge) is said to be globally consis-
tent if for every C1 ∈ C and C2 ∈ C it holds that

∑ ϕ C1 = ∑ ϕC2 .
XC1 /(C1 ∩C2 ) XC2 /(C1 ∩C2 )

Global consistency means that the marginalisation to C1 ∩ C2 of ϕC1 and ϕC2 coincide for every C1
and C2 in C . The following results show that, for a junction tree, local consistency implies global
consistency.

Proposition 8.17. After a passage of a fully active schedule of ows, a junction tree T is locally
consistent.

Proof The two corollaries of the main result give that for any two neighbouring C1 and C2 ,

∑ fC1 = fS = ∑ fC2 .
C1 /S C2 /S

An equilibrium, or xed point has been reached, in the sense that any new ows passed after passage
of a fully active schedule do not alter the functions. The update ratio for another message from C1 to
C2 becomes

∑C1 /S fC1
λS = = 1.
fS

Global Consistency of Junction Trees In this paragraph, it is shown that for junction trees, local
consistency implies global consistency.

By denition, a junction tree is a tree such that the intersection C1 ∩ C2 of any pair C1 and C2 in C is
contained in every node on the unique trail in T between C1 and C2 . The set C1 ∩ C2 can be empty
and, in this case it is therefore (by convention) a subset of every other set.

Proposition 8.18. A locally consistent junction tree is globally consistent.

Proof In a junction tree the intersection C1 ∩ C2 of any pair C1 and C2 in C is contained in every node
on the unique path in T between C1 and C2 . Assume that C1 ∩ C2 is non empty. Consider the unique
path from C1 to C2 . Let the nodes on the path be denoted by {C (i) }ni=0 with C (0) = C1 and C (n) = C2 ,
so that C (i) and C (i+1) are neighbours. Denote the separator between C (i) and C (i+1) by

S (i) = C (i) ∩ C (i+1) .

Then, for all i,

C1 ∩ C2 ⊆ S (i) .
8.7. USING A JUNCTION TREE WITH VIRTUAL EVIDENCE AND SOFT EVIDENCE 173

For a set of variables C , let ∑C denote ∑XC . The assumption of local consistency means that for any
two neighbours

∑ ϕC (i) = ∑ ϕC (i) = ∑ ϕC (i+1) = ∑ ϕC (i+1) = ϕS (i) .


C (i) /S (i) C (i) /(C (i+1) ∩C (i) ) C (i+1) /(C (i) ∩C (i+1) ) C (i+1) /S (i)

Starting with the leftmost marginalisation,

∑ ϕC (i) = ∑ ∑ ϕC (i)
C (i) /(C1 ∩C2 ∩C (i) ) S (i) /(C1 ∩C2 ∩S (i) ) C (i) /S (i)

⎛ ⎞
= ∑ ∑ ϕC (i+1) = ∑ ϕ (i+1) .

S (i) /(C1 ∩C2 ∩S (i) ) C (i+1) /S (i)
⎠ C (i+1) /(C1 ∩C2 ∩C (i+1) ) C

The marginalisation of ϕC1 and ϕC (1) coincide. The procedure is continued along the path until the
node C2 is reached. The result is proved.

Corollary 8.19. After the passage of a fully active schedule of ows, a junction tree is globally con-
sistent.

Proof This follows from the proposition stating that after passage of a fully active schedule of ows a
junction tree T is locally consistent, together with Proposition 8.9.

The algorithm for updating considered the maximal cliques of a junction tree, which sent and received
messages locally; the global update is performed entirely by a series of local computations. By organ-
ising the variables into maximal cliques and separators on a junction tree and determining a schedule,
there is no need for global computations in the inference problem; the global update is achieved en-
tirely by passing messages between neighbours in the tree according to a schedule and the algorithm
terminates automatically when the update is completed.

8.7 Using a Junction Tree with Virtual Evidence and Soft Evidence
The junction tree may be extended to the problem of updating in the light of virtual evidence and soft
evidence.
Dealing with virtual evidence is straightforward; for each virtual nding, one adds in a virtual node
as illustrated in Figure 6.2, which will be instantiated according to the virtual nding. This simply
adds the virtual nding node to the maximal clique containing the variable for which there is a virtual
nding.
If virtual evidence is given on a variable X with state space (x1 , . . . , xn ), and the evidence is given
in the form
PE∣X (1∣xj )
ρ1 = 1, ρj = j = 2, . . . , n
PE∣X (1∣x1 )
the conditional probabilities
174 CHAPTER 8. JUNCTION TREES AND MESSAGE PASSING

x1 x2 ... xn
PE∣X (1∣.) =
a aρ2 . . . aρn
may be used, for some a > 0 such that 0 ≤ PE∣X (1∣xj ) ≤ 1 for each j = 1, . . . , n.

To absorb soft evidence, remove the links from each variable Y1 , . . . , Ym to which soft evidence is
applied. Provided the nodes on which the soft evidence is received are d-separated from each other
and d-separated from the nodes on which hard evidence is received after surgery, simply replace the
conditional probabilities PYj ∣Π(Yj ) with P∗Yj ; the independence ensures that the marginal probabilities
for these variables after updating will remain as P∗Yj .

The Lazy big maximal clique algorithm If soft evidence and hard evidence are received on
variables that are d-connected after sugery, then incorporating soft evidence cannot be carried out
in such a straightforward manner. The problem is that the approach outline above inserts P∗Yj , the
marginal probability after updating, in the place of the a-priori assessment PYj ∣Πj , without reference to
other pieces of evidence. The updated distribution should have P∗Yj as the marginal distribution over
Yj .
One method for incorporating soft evidence is discussed in [138]. The input is a Bayesian network
with a collection of soft and hard ndings. The method returns a joint probability distribution with
two properties:

1. The ndings are the marginal distributions for the updated distribution.

2. The updated distribution is the closest to the original distribution (where the Kullback Leibler
divergence is used) that satises this constraint (that the ndings are the marginals of the updated
distribution).

The junction tree is modied to incorporate soft evidence in the following way.

1. After surgery, construct a junction tree, in which all the variables that have soft evidence are in
the same maximal clique - the big maximal clique C1 .

2. Let C1 (the big maximal clique) be the root node, apply the hard evidence and run the rst half
of the fully active schedule; that is, propagating from the leafs to the root node.

3. Once the big maximal clique C1 has been updated with the information from all the other
maximal cliques, absorb all the soft evidence into C1 . This is described below.

4. Distribute the evidence according to the method described in Section 8.5, the second part; sending
the messages from the updated root out to the leaves.

If the big maximal clique is updated to provide a probability function (namely a non negative function
that sums to 1, then the distribution of evidence will update the functions over the maximal cliques
and separators to probability distributions over the respective maximal cliques and separators.
8.7. USING A JUNCTION TREE WITH VIRTUAL EVIDENCE AND SOFT EVIDENCE 175

Absorbing the Soft Evidence Suppose the big maximal clique C1 has soft evidence on the vari-
ables (Y1 , . . . , Yk ). Suppose soft evidence is received that Y1 , . . . , Yk have distributions QY1 , . . . , QYk
respectively. Let QC1 denote the probability function over the variables in C1 after the soft evidence
has been absorbed. Then it is required that, for each j ∈ {1, . . . , k}, QYj = ∑XC /{Y QC1 . That is, the
1 k
marginal of QC1 over all variables other than Yk is QYk .
The important feature of soft evidence (Denition 6.4) is that after soft evidence has been received,
the variable has no parent variables. The Iterative Proportional Fitting Procedure (IPFP), therefore,
may be employed. It goes in cycles of length k . Firstly, normalise the function over C1 (after the hard
evidence has been received) so that it is a probability distribution PC1 . Then

(0)
PC1 = PC1
(mk+j−1) (mk+j−1)
for j = 1, . . . , k , set PYj = ∑ XC PC1 , and
1 /{Yj }

(mk+j−1)
(mk+j) P C1 QYj
PC1 = (mk+j−1)
.
P Yj

This is repeated until the desired accuracy is obtained. It has been well established that, for discrete
distributions with nite state space, the IPFP algorithm converges to the distribution that minimises
the Kullback Leibler distance from the original distribution (see, for example, [10] (1959).

Notes The original paper describing the use of junction trees for updating a Bayesian network is by
S.L. Lauritzen and D.J. Spiegelhalter [80]. The propagation presented is the approach of Lauritzen and
Spiegelhalter, discussed in [83]; the technicalities dier slightly between implementations in software.
The proofs or the main results for the message passing algorithm were originally presented in [33]. The
Iterative Proportion Fitting Procedure dating back to Deming and Stephan (1940) [35]. This is the
basis for updating a junction tree in the light of soft evidence. The basic technique is taken from M.
Valtorta, Y.G. Kim, J. Vomlel (2002) [138].
8.8 Exercises
1. Let PX1 ,X2 ,X3 ,X4 ,X5 be a probability distribution over ve variables that has factorisation
PX1 ,X2 PX2 ,X3 ,X4 PX4 ,X5
PX1 ,X2 ,X3 ,X4 ,X5 = .
PX2 PX4
Suppose hard evidence X3 = a is received. Let


⎪ pX1 ,X2 ,X3 ,X4 ,X5 (x1 , x2 , a, x4 , x5 ) x3 = a
fX1 ,X2 ,X3 ,X4 ,X5 (x1 , x2 , a, x4 , x5 ) = ⎨

⎪ 0 x3 ≠ a

Work through the stages of the message passing algorithm to obtain functions ψX1 ,X2 , ψX2 ,
ψX2 ,X3 ,X4 (., a, .), ψX4 , ψX4 ,X5 such that such that

⎪ ψX1 ,X2 = ∑x4 ,x5 fX1 ,...,X5 (., ., a, x4 , x5 ),





⎪ ψX2 = ∑x1 ,x4 ,x5 fX1 ,...,X5 (x1 , ., a, x4 , x5 ),



⎨ ψX2 ,X3 ,X4 = ∑x1 ,x5 fX1 ,...,X5 (x1 , ., a, ., x5 ),





⎪ ψX4 = ∑x1 ,x2 ,x5 fX1 ,X2 ,X3 ,X4 ,X5 (x1 , x2 , a, ., x5 ),




⎩ ψX4 ,X5 = ∑x1 ,x2 fX1 ,...,X5 (x1 , x2 , a, ., .)

and
ψX1 ,X2 ψX2 ,X3 ,X4 ψX4 ,X5
fX1 ,...,X5 = .
ψX2 ψX4

2. (a) Prove that Kruskal's algorithm returns a tree of maximal weight: Consider d nodes, labelled
(α1 , . . . , αd ) and a weight bij corresponding to each pair of nodes {αi , αj }. The tree of
maximal weight is the tree with nodes {α1 , . . . , αd } such that the score ∑e∈T be , where the
sum is taken over all edges e included in the tree, is greater than or equal to the score for
any other tree.
Krusal's algorithm proceeds as follows:
i. The d variables yield d(d − 1)/2 edges. The edges are indexed in decreasing order,
according to their weights b1 , b2 , . . . , bd(d−1)/2 .
ii. The edges b1 and b2 are selected. Then the edge b3 is selected if it does not form a
cycle.
iii. This is repeated through b4 , . . . , bd(d−1)/2 , in that order, adding edges if they do not
form a cycle and discarding them if they form a cycle.
This may be proved by induction.
(b) Prove that Prim's algorithm returns a tree of maximal weight. This proceeds by rst
choosing the edge of maximal weight and then subsequently choosing additional edges to
add to the tree where the additional link has maximal weight.

3. Let C denote the set of maximal cliques from a triangulated graph. A pre-I-tree is a tree over C
with separators S = C1 ∩ C2 for adjacent maximal cliques C1 and C2 . The weight of a pre-I-tree
is the sum of the number of variables in the separators.

176
8.8. EXERCISES 177

(a) Prove that a junction tree is a pre-I-tree of maximal weight.


(b) Prove that any pre-I-tree of maximal weight is a junction tree.

4. A polytree is a DAG whose skeleton is a tree.

(a) Prove that the moral graph of a polytree is triangulated.


(b) Prove that the separators in a junction tree for a polytree consist of exactly one variable.
8.9 Answers
1. Using X2 X3 X4 as root, X1 X2 → X2 X3 X4 - trivial message is passed. h∗X2 (x2 ) = ∑x1 PX1 X2 (x1 , x2 ) =
(x2 )
PX2 (x2 ), so update ratio is λ(x2 ) = PX (x2 ) = 1; PX2 (x2 ) updated to PX2 (x2 ) and PX2 ,X3 ,X4 up-
X2 P
2
dated to


⎪ PY ZW (x2 , a, x4 ) x3 = a


⎪ 0 otherwise

X4 X5 → X2 X3 X4 - again trivial message is passed, h∗X4 (x4 ) = pX4 (x4 ), update ratio is 1. After
this message is passed, the maximal clique X2 X3 X4 is live, with potential


⎪ PY ZW (x2 , a, x4 ) x3 = a
ψX2 ,X3 ,X4 (x2 , x3 , x4 ) = ⎨

⎪ 0 otherwise

X2 X3 X4 → X1 X2

ψX2 (x2 ) = ∑ ψX2 X3 X4 (x2 , a, x4 ) = ∑ pX2 ,X3 ,X4 (x2 , a, x4 ) = pX2 ,X3 (x2 , a)
x4 x4

PX2 ,X3 (x2 ,a)


update ratio is λ(x2 ) = PX2 (x2 ) and

PX2 ,X3 (x2 , a)


ψX1 X2 (x1 , x2 ) = PX1 ,X2 (x1 , x2 )
PX2 (x2 )

ψX4 (x4 ) = PX3 ,X4 (a, x4 )


PX3 ,X4 (a, x4 )
ψX4 ,X5 (x4 , x5 ) = PX4 ,X5 (x4 , x5 )
PX4 (x4 )

2. See Lemma 15.12 page 305.

3. (a) Let T be any non - maximal spanning tree. Let T1 ⊂ T2 ⊂ . . . ⊂ T ′ denote a sequence of
maximal trees constructed through Prim's algorithm. Let the construction be so that a link
from T is chosen whenever possible. Let m be the rst stage where this is not possible and
let C1 − C2 with separator S be the link actually chosen (C1 ∈ Tm , C2 ∈/ Tm ; the separator S
C1 − C2 has maximal weight of those not used). In T , there is a path between C1 and C2 .
The path contains a link C3 − C4 with separator S ′ such that C3 ∈ Tm , C4 ∈/ Tm . Possibly C2
is C4 . Since C3 − C4 could not be chosen, it follows that ∣S ′ ∣ < ∣S∣ and therefore S contains
variables not in S ′ . Therefore, T does not satisfy the junction tree condition.
(b) Consider the tree of maximal weight constructed by Prim's algorithm and let T1 , . . . , Tn = T
denote the successive trees. Assume that T is not a junction tree, then at some stage m,
Tm can be extended to a junction tree T ′ while Tm+1 cannot. Let C1 − C2 with separator S
be the link chosen at this stage; C2 ∈ Tm+1 . Since Tm+1 cannot be extended to a junction
tree, the link C1 − C2 is not in T ′ , so there is a path in T ′ between C1 and C2 not containing
the link C1 − C2 . This path contains a link C3 − C4 with separator S ′ such that C3 ∈ Tm and
C4 ∈/ Tm . Since T ′ is a junction tree, it follows that S ⊆ S ′ and since S was chosen through

178
8.9. ANSWERS 179

Prim's algorithm, it follows that ∣S∣ ≥ ∣S ′ ∣ so that S = S ′ . Now remove the link C3 − C4 from
T ′ and add the link C1 − C2 . The result is a junction tree extending Tm+1 , contradicting
the assumption that it cannot be extended to a junction tree.

4. (a) Consider any cycle length n ≥ 4 in the moral graph. If an edge in the cycle is removed
that was added at the moralisation stage, there will be a cycle of length n + 1. Successively
removing edges from the cycle that were added at the moralisation stage, the graph will
still have a cycle, containing the vee structure from the original graph instead of the parent-
parent edge. Hence, if the moral graph is not triangulated, the skeleton of the original graph
is not a tree, hence the original graph is not singly connected.
(b) The maximal cliques of the moral graph are the variable/parent congurations. Suppose
that there are two variables U and V in a separator. If U −V is a parent - parent conguration
in both maximal cliques being separated, then there is a cycle in the original graph, hence
contradiction. If U − V represents a parent-child conguration in both maximal cliques
that it is separating, then there is a contradiction; both maximal cliques taken together
form a single complete subset, contradicting the fact that there are two maximal cliques.
If U − V represents a parent-parent conguration in one maximal clique and a parent-child
conguration in another, then there is a cycle in the skeleton of the original graph, hence a
contradiction.
180 CHAPTER 8. JUNCTION TREES AND MESSAGE PASSING
Chapter 9

Bayesian Networks in R

9.1 Introduction
It has become clear that R is now the most eective and dominant language of statistical computing.
There are excellent packages available in R for Bayesian Networks, for inference using a given Bayesian
Network and for learning the structure of a Bayesian Network. This chapter introduces some of the
software in R available for Bayesian Networks and discusses graphs in R and inference using networks
that have already been dened. Parameter learning and structure learning are considered later.
The packages considered are gRain by Søren Højsgaard and bnlearn.
Having installed R and a suitable editor (for example Rstudio), the relevant packages have to be
installed.

gRain and related packages Information for gRain is available on the author's web page:

http://people.math.aau.dk/~sorenh/software/gR/

The package, along with all the supporting packages, has to be installed. As pointed out on the web
page, under `4 Installation', the package uses the packages graph, RBGL and Rgraphviz. These
packages are not on CRAN, but on `bioconductor'. To install these packages, execute

> source("http://bioconductor.org/biocLite.R"); biocLite(c("graph","RBGL","Rgraphviz"))

Warning This can take a long time. Furthermore, there may be some interactive questions requiring
yes/no answers.

After this, gRain may be installed from CRAN in the usual way:

> install.packages("gRain")

The package bnlearn also has some useful inference functions, although its main consideration is
learning. Install it in the usual way:

> install.packages("bnlearn")

181
182 CHAPTER 9. BAYESIAN NETWORKS IN R

9.2 Graphs in R
This section considers the various graphs that appear in graphical modelling and how to render them
in R. In addition to the packages mentioned so far, the package ggm, has some useful functions for
graphical Markov models.

>install.packages("ggm")

Another useful graphics package is igraph

>install.packages("igraph")

These packages should be activated:

> library("bnlearn")
> library("gRain")
> library("ggm")
> library("igraph")
> library(RBGL)
> library("gRbase")

9.2.1 Undirected Graphs


An undirected graph can be created using the ug() function. For example:

> ugraph <- ug(c("a","b"),c("b","c","d"),"e")


> ugraph
A graphNEL graph with undirected edges
Number of Nodes = 5
Number of Edges = 4

(the gRbase package contains the function ug(). It is automatically activated if gRain is activated).
Plotting the graph requires the package Rgraphviz

> library(Rgraphviz)
> plot(ugraph)

The default output of ug() retuns a graphNEL object. The commands result = igraph or result
= matrix return an igraph or adjacency matrix instead. There is a plot method for igraph objects
in the igraph package.

> uigraph <- ug(~a:b+b:c:d+e, result="igraph")


>library("igraph")
> plot(uigraph,layout=layout.spring)
9.2. GRAPHS IN R 183

a e

b
c

Figure 9.1: Undirected Graph

Edges can be added or deleted quite easily using the addEdge() and removeEdge() commands:

> ugrapha <- addEdge("a","c",ugraph)


> ugraphb <- removeEdge("c","d",ugraph)

The nodes and edges can be recovered quite easily:

> nodes(ugraph)
[1] "a" "b" "c" "d" "e"
> str(edgeList(ugraph))
List of 4
$ : chr [1:2] "a" "b"
$ : chr [1:2] "b" "c"
$ : chr [1:2] "b" "d"
$ : chr [1:2] "c" "d"

The function maxClique() returns the cliques of the graph:

> maxClique(ugraph)
$maxCliques
$maxCliques[[1]]
[1] "b" "c" "d"

$maxCliques[[2]]
[1] "b" "a"

$maxCliques[[3]]
[1] "e"
184 CHAPTER 9. BAYESIAN NETWORKS IN R

ugraph is not complete; this can be seen using the is.complete command:

> is.complete(ugraph)
[1] FALSE

The command separates from the RBGL package, indicates whether or not there is graphical sepa-
ration:

> separates("a","d",c("b","c"),ugraph)
[1] TRUE

This shows that {b, c} separates a and d.

Subgraphs can be obtained by: subGraph. For example:

> usub <- subGraph(c("b","c","d","e"),ugraph)


> plot(usub)

The boundary bd(α) of a vertex α is the set of vertices adjacent to α, adj(α) which is equal (for
an undirected graph) to the set of neighbours. The closure is the boundary together with the node:
cl(α) = bd(α) ∪ {α}.

> adj(ugraph,"c")
$c
[1] "d" "b"

> closure("c",ugraph)
[1] "c" "d" "b"

We can also establish whether or not nodes are simplicial, if the graph is triangulated, and obtain the
connected components.

> is.simplicial("b",ugraph)
[1] FALSE
> simplicialNodes(ugraph)
[1] "a" "c" "d" "e"
> connComp(ugraph)
[[1]]
[1] "a" "b" "c" "d"

[[2]]
[1] "e"

> is.triangulated(ugraph)
[1] TRUE
9.2. GRAPHS IN R 185

If we want to establish if (A, B, S) forms a decomposition where S is complete and separates A and B ,
the function is is.decomposition

> is.decomposition("a","d",c("b","c"),ugraph)
[1] FALSE

A perfect elimination sequence can be obtained by mcs (maximum cardinality search):

> mcs(ugraph)
[1] "a" "b" "c" "d" "e"

We can have some control over the ordering:

> mcs(ugraph,root=c("d","c","a"))
[1] "d" "c" "b" "a" "e"

It is convenient if the cliques satisfy running intersection property Cj ∩ (C1 ∪ . . . ∪ Cj−1 ) ⊆ Ci for some
i < j . Dene Sj = Cj ∩ (C1 ∪ . . . ∪ Cj−1 ) and Rj = Cj /Sj with S1 = ϕ. Any clique Ci where Sj ⊂ Ci with
i < j is a possible parent of Ci . The rip function returns such a list if the graph is triangulated.

> rip(ugraph)
cliques
1 : b a
2 : b c d
3 : e
separators
1 :
2 : b
3 :
parents
1 : 0
2 : 1
3 : 0

Graphs may be triangulated using the triangulate function:

> uguntriang <- ug(~a:b:c+c:d+d:e+a:e)


> is.triangulated(uguntriang)
[1] FALSE
> plot(uguntriang)
> utriang <- triangulate(uguntriang)
> is.triangulated(utriang)
[1] TRUE
> plot(utriang)
186 CHAPTER 9. BAYESIAN NETWORKS IN R

9.2.2 Directed Acyclic Graphs


A DAG may be created using the dag() function. It can be used in several ways. For example:

> dgraph <- dag(~a, ~b*a, ~c*a*b, ~d*c*e, ~e*a, ~g*f)


> plot(dgraph)

a f
b g
c e
d
Figure 9.2: Directed Acyclic Graph

Nodes and edges may be listed as follows:

> nodes(dgraph)
[1] "a" "b" "c" "d" "e" "g" "f"
> str(edges(dgraph))
List of 7
$ a: chr [1:3] "b" "c" "e"
$ b: chr "c"
$ c: chr "d"
$ d: chr(0)
$ e: chr "d"
$ g: chr(0)
$ f: chr "g"

edges gives a list of the children for each node. Alternatively, the edges are listed by:

> str(edgeList(dgraph))
List of 7
$ : chr [1:2] "a" "b"
$ : chr [1:2] "a" "c"
$ : chr [1:2] "a" "e"
9.2. GRAPHS IN R 187

$ : chr [1:2] "b" "c"


$ : chr [1:2] "c" "d"
$ : chr [1:2] "e" "d"
$ : chr [1:2] "f" "g"

The vpar() function returns a list with an element for each node together with its parents.

> vpardgraph <- vpar(dgraph)


> vpardgraph$c
[1] "c" "a" "b"

The parents, chilren, ancestral set an(A) of a set A together with all its ancestors can be obtained by:

> parents("d",dgraph)
[1] "c" "e"
> children("c",dgraph)
[1] "d"
> ancestralSet(c("b","e"),dgraph)
[1] "a" "b" "e"
> ag <- ancestralGraph(c("b","e"),dgraph)
> plot(ag)

b e

Figure 9.3: Directed Acyclic Graph

The moralize function moralises the graph:

> moral <- moralize(dgraph)


> plot(moral)

D-separation can be obtained by the dSep function from the ggm package.

> dSep(as(dgraph,"matrix"),"c","e","a")
[1] TRUE
188 CHAPTER 9. BAYESIAN NETWORKS IN R

a g
b f
c
d
e

Figure 9.4: Moralised Graph

9.2.3 Mixed Graphs


Chain graphs, of which essential graphs are a subset, are mixed. They are represented in the graph
and igraph package as directed graphs with multiple edges. A convenient way of dening them is to
use adjacency matrices.

> adjm<-matrix(c(0,1,1,0,1,0,0,1,1,0,0,0,1,1,1,0),nrow=4)
> rownames(adjm)<-colnames(adjm)<-letters[1:4]
> adjm
a b c d
a 0 1 1 1
b 1 0 0 1
c 1 0 0 1
d 0 1 0 0

This matrix can be used to create a graphNEL object:

> gG<-as(adjm,"graphNEL")
> plot(gG,"neato")

The graph is shown in Figure 9.5.


Note that Rgraphviz interprets symmetric entries as double-headed arrows. It does not distinguish
between bi-directed and undirected edges. The same is true if the graph is treated as an igraph object.
The graph from igraph is obtained as follows:

> gG1<-as(adjm,"igraph")
> plot(gG1,layout=layout.spring)
9.3. BAYESIAN NETWORKS 189

d
b
a

c
Figure 9.5: Mixed Graph

Is it a Chain Graph? The is.chaingraph() function from the lcd package determines whether a
mixed graph is a chain graph. The input is an adjacency matrix.

> install.packages("lcd")
> library(lcd)
> is.chaingraph(as(gG1,"matrix"))
$result
[1] FALSE

$vert.order
NULL

$chain.size
NULL

The graph is not a chain graph; a and d are in the same chain component and therefore there should
not be a directed edge a ↦ d.

9.3 Bayesian Networks


9.3.1 Specifying the Conditional Probability Potentials
Consider the `Asia' example of Lauritzen et. al. The conditional probability potentials may be specied
as follows:

> library("gRain", lib.loc="~/R/x86_64-redhat-linux-gnu-library/3.1")


Loading required package: gRbase
> yn <- c("yes","no")
190 CHAPTER 9. BAYESIAN NETWORKS IN R

> a<-cptable(~asia, values=c(1,99),levels=yn)


> t.a<-cptable(~tub+asia,values=c(5,95,1,99),levels=yn)
> s<-cptable(~smoke, values=c(5,5),levels=yn)
> l.s<-cptable(~lung+smoke,values=c(1,9,1,99),levels=yn)
> b.s<-cptable(~bronc+smoke,values=c(6,4,3,7),levels=yn)
> e.lt<-cptable(~either+lung+tub,values=c(1,0,1,0,1,0,0,1),levels=yn)
> x.e<-cptable(~xray+either,values=c(98,2,5,95),levels=yn)
> d.be<-cptable(~dysp+bronc+either, values=c(9,1,7,3,8,2,1,9), levels = yn)

The + operator could be considered slightly misleading. There are other ways to enter the conditional
probability potentials:

> t.a<-cptable(~tub|asia,values=c(5,95,1,99),levels=yn)
> t.a<-cptable(c("tub","asia"),values=c(5,95,1,99),levels=yn)

There are also special functions ortable() and andtable. For example, e.lt() could be entered by:

> e.lt <-ortable(~either+lung+tub, levels=yn)

9.3.2 Building the Network


A network is created with the function grain(), which returns an object of class grain:

> plist<-compileCPT(list(a, t.a, s, l.s, b.s, e.lt, x.e, d.be))


> grn1<-grain(plist)
> summary(grn1)
Independence network: Compiled: FALSE Propagated: FALSE
Nodes : chr [1:8] "asia" "tub" "smoke" "lung" "bronc" "either" ...
> plot(grn1)

The plot is shown in Figure 9.6. The ctitious situation being modelled is the following: you return
from a visit to Asia and nd that you have a cough. A visit to Asia increases the chances of catching
tuberculosis. Meanwhile, smoking causes both lung cancer and bronchitis. Tuberculosis and Lung
cancer both give the same results for an x-ray. Bronchitis causes dyspnoea (shortness of breath); both
lung cancer and tuberculosis have equal chances of causing dyspnoea.

9.3.3 Compilation - Finding the Clique Potentials


The network has to be compiled and propagated before queries can be made.

> grn1c<-compile(grn1)
> summary(grn1c)
Independence network: Compiled: TRUE Propagated: FALSE
9.3. BAYESIAN NETWORKS 191

asia smoke

tub lung

either bronc

xray dysp

Figure 9.6: Asia Network

Nodes : chr [1:8] "asia" "tub" "smoke" "lung" "bronc" "either" ...
Number of cliques: 6
Maximal clique size: 3
Maximal state space in cliques: 8

The various steps of compile can be carried out separately;

> g<-grn1$dag
> mg<-moralize(g)
> tmg<-triangulate(mg)
> rip(tmg)
cliques
1 : asia tub
2 : either lung tub
3 : either lung bronc
4 : smoke lung bronc
5 : either dysp bronc
6 : either xray
separators
1 :
2 : tub
3 : either lung
4 : lung bronc
5 : either bronc
6 : either
parents
1 : 0
192 CHAPTER 9. BAYESIAN NETWORKS IN R

2 : 1
3 : 2
4 : 3
5 : 3
6 : 5

> junctree<-rip(tmg)
> plot(junctree)

The plot is shown in Figure 9.7.

1
2
3
4 5
6
Figure 9.7: Junction Tree for Asia Network

9.3.4 Absorbing Evidence and Answering Queries


Evidence may be entered as follows: for example, suppose we have evidence that someone has visited
asia and has dyspnoea. This is entered as follows:

> grn1c.ev<-
+ setFinding(grn1c,nodes=c("asia","dysp"),states=c("yes","yes"))

This creates a new grain object. The grain objects with (grn1c.ev) and without (gran1c) can be
queried to give marginal probabilities:

> querygrain(grn1c.ev,nodes=c("lung","bronc"),type="marginal")
$lung
lung
yes no
0.09952515 0.90047485
9.3. BAYESIAN NETWORKS 193

$bronc
bronc
yes no
0.8114021 0.1885979

> querygrain(grn1c,nodes=c("lung","bronc"),type="marginal")
$lung
lung
yes no
0.055 0.945

$bronc
bronc
yes no
0.45 0.55

The evidence in a grain object can be retrieved with the getFinding() function, while the probability
of observing the evidence is obtained using the pFinding() function:

> getFinding(grn1c.ev)
Finding:
asia: yes
dysp: yes
Pr(Finding)= 0.004501375
> pFinding(grn1c.ev)
[1] 0.004501375

Joint and conditional distributions may be computed as follows:

> querygrain(grn1c.ev,nodes=c("lung","bronc"),type="joint")
bronc
lung yes no
yes 0.06298076 0.03654439
no 0.74842132 0.15205354
> querygrain(grn1c.ev,nodes=c("lung","bronc"),type="conditional")
bronc
lung yes no
yes 0.07761966 0.1937688
no 0.92238034 0.8062312

These are both conditioned on the evidence; the former the joint distribution of lung and bronc
conditioned on the evidence, while the latter is the conditional distribution of lung given bronc and
the evidence.
194 CHAPTER 9. BAYESIAN NETWORKS IN R

If it is known beforehand that a specic subset U of the variables will be of interest, it is computa-
tionally faster to ensure that they are in the same clique. Consider the grain objects grn1c2, where
variables of interest are forced into the root clique:

> grn1c2<-compile(grn1,root=c("lung","bronc","tub"),propagate=TRUE)
> grn1c2.ev<-setFinding(grn1c2,nodes=c("asia","dysp"),states=c("yes","yes"))

and now compare the computing times:

> system.time({for (i in 1:50)


+ querygrain(grn1c.ev,nodes=c("lung","bronc","tub"),type="joint")})
user system elapsed
1.275 0.004 1.279
> system.time({for (i in 1:50)
+ querygrain(grn1c2.ev,nodes=c("lung","bronc","tub"),type="joint")})
user system elapsed
0.012 0.000 0.013

The second method is much faster.

Evidence can be entered incrementally by calling setFinding() repeatedly. Set propagate=FALSE


while evidence is being entered and call propagate() at the end:

> grn1c.ev<-setFinding(grn1c,nodes="asia",states="yes",propagate="FALSE")
> grn1c.ev<-setFinding(grn1c.ev,nodes="dysp",states="yes",propagate="FALSE")
> grn1c.ev<-propagate(grn1c.ev)

Evidence can be retracted (removed) using the retractFinding() function:

> grn1c.ev<-retractFinding(grn1c.ev,nodes="asia")
> getFinding(grn1c.ev)
Finding:
dysp: yes
Pr(Finding)= 0.4359706

Omitting nodes implies that all the evidence is retracted:

> grn1c.ev<-retractFinding(grn1c.ev)
> getFinding(grn1c.ev)
NULL
9.3. BAYESIAN NETWORKS 195

9.3.5 Building a Network from Data


For an n × d data matrix x which represents n independent instantiations of d variables (X1 , . . . , Xd ),
the conditional probability potentials can be estimated. Recall that g is the DAG for the `Asia' network.
The input is: a data frame and a DAG where the nodes are the names of the variables. To avoid 0s
in the CPPs, a small smoothing number is added to all the frequencies (which are then normalised to
ensure that they are probabilities). This can be, for example, 0.1.

> plot(g)
> simdagchest<-grain(g,data=chestSim500)
extractCPT - data.frame
> simdagchest<-compile(simdagchest,propagate=TRUE,smooth=0.1)
> querygrain(simdagchest,nodes=c("lung","bronc"),type="marginal")
$lung
lung
yes no
0.046 0.954

$bronc
bronc
yes no
0.454 0.546

Alternatively, a grain object may be built from an undirected triangulated graph. Recall that tmg is
g which has been moralised and then triangulated. Then

> simugchest<-grain(tmg,data=chestSim500,smooth=0.1)
extractCPT - data.frame
> simugchest<-compile(simugchest,propagate=TRUE)
> plot(simugchest)

9.3.6 Simulation using a Network


To simulate data from the Asia network, with the evidence that a person has visited Asia and has
returned with dyspnoea, the function simulate() may be used:

> simulate(grn1c.ev,nsim=5)
asia tub smoke lung bronc either xray dysp
1 yes yes no no no yes yes yes
2 yes no yes no yes no no yes
3 yes no yes no yes no no yes
4 yes no yes no yes no no yes
5 yes no yes no yes no no yes
196 CHAPTER 9. BAYESIAN NETWORKS IN R

The xtabs() function may be used to obtain (approximately) the joint distribution of lung and bronc
conditioned on the nding:

> xtabs(~lung+bronc, data=simulate(grn1c.ev,nsim=1000))/1000


bronc
lung yes no
yes 0.064 0.028
no 0.757 0.151

9.3.7 Prediction
The predict() function is used for prediction. The default is type = class, which gives the class
with the highest probability, given the observed values of the predictors. Firstly, we generate some
data:

> mydata<-simulate(grn1c.ev,nsim=5)
> mydata
asia tub smoke lung bronc either xray dysp
1 yes no yes no yes no no yes
2 yes no no no yes no no yes
3 yes no no no yes no no yes
4 yes no yes no yes no no yes
5 yes no no no no no no yes

then we try to predict the most probable conguration of lung and the most probable conguration
of bronc, given all the others.

> predict(grn1c,response=c("lung","bronc"),newdata=mydata,
+ predictors=c("smoke","asia","tub","dysp","xray"),type="class")
$pred
$pred$lung
[1] "no" "no" "no" "no" "no"

$pred$bronc
[1] "yes" "yes" "yes" "yes" "yes"

$pFinding
[1] 0.002123915 0.001388412 0.001388412 0.002123915 0.001388412

These are read as follows: the variables lung and bronc are treated individually; this does not give the
joint most probable conguration. The entire conditional distribution of lung and bronc is obtained
as follows:
9.3. BAYESIAN NETWORKS 197

> predict(grn1c,response=c("lung","bronc"),newdata=mydata,
+ predictors=c("smoke","asia","tub","dysp","xray"),type="dist")
$pred
$pred$lung
yes no
[1,] 0.0036677551 0.9963322
[2,] 0.0005200187 0.9994800
[3,] 0.0005200187 0.9994800
[4,] 0.0036677551 0.9963322
[5,] 0.0005200187 0.9994800

$pred$bronc
yes no
[1,] 0.9221067 0.07789335
[2,] 0.7739757 0.22602430
[3,] 0.7739757 0.22602430
[4,] 0.9221067 0.07789335
[5,] 0.7739757 0.22602430

$pFinding
[1] 0.002123915 0.001388412 0.001388412 0.002123915 0.001388412

9.3.8 Buidling a Bayesian Network using bnlearn


Inference may also be carried out using the bnlearn package. For illustration, consider the gene
expression analysis from the paper by Sachs et. al.
Sachs K.; Perez, O.; Pe'er, D.; Lauenburger, D.A.; Nolan, G.P. (2005) Causal Protein-Signalling
Networks derived from Multi-parameter Single-cell Data Science 308 (5721): 523-529
The relevant data is found in sachs.interventional.txt in the data directory of the course web
page:

http://www.mimuw.edu.pl/~noble/courses/BayesianNetworks/data/

Copy the le onto your local directory, then load it into R.

>library(bnlearn)
>library(gRain)
> sachs.interventional <- read.table("~/data/sachs.interventional.txt", header=TRUE,
colClasses = "factor")
> isachs<-sachs.interventional

It is important to have colClasses = factor. The Bayesian Network is constructed as follows:


198 CHAPTER 9. BAYESIAN NETWORKS IN R

> val.str=paste("[PKC][PKA|PKC][praf|PKC:PKA]",
+ "[pmek|PKC:PKA:praf][p44.42|pmek:PKA]",
+ "[pakts473|p44.42:PKA][P38|PKC:PKA]",
+ "[pjnk|PKC:PKA][plcg][PIP3|plcg]",
+ "[PIP2|plcg:PIP3]")
> val=model2network(val.str)
> isachs=isachs[, 1:11]
> for(i in names(isachs))
+ levels(isachs[, i]) = c("LOW","AVERAGE","HIGH")
> fitted = bn.fit(val, isachs, method = "bayes")

The variable val contains the DAG for the Bayesian network. Given the structure, bn.fit estimates
the conditional probabilities. There are several methods for doing this, but the Conditional Probability
Potentials simply contain the estimates from data.

Once the BN (DAG and CPPs) has been specied, we construct a junction tree for inference. The
junction tree algorithm is provided by the gRain package.

> jtree <- compile(as.grain(fitted))

Now suppose that we have hard evidence or a nding that node p44.42 is in state LOW. Then this is
inserted quite simply by:

> jprop <- setFinding(jtree, nodes = "p44.42",


+ states="LOW")

Let us now check the marginal distribution of the node pakts473 with and without the evidence.

> querygrain(jtree, nodes="pakts473")$pakts473


pakts473
LOW AVERAGE HIGH
0.60893407 0.31041282 0.08065311

The conditional probability, conditioned on the evidence is:

> querygrain(jprop, nodes="pakts473")$pakts473


pakts473
LOW AVERAGE HIGH
0.665161776 0.333333333 0.001504891

The maximum a posteriori states may be found by nding the largest element of the target distribution:

> names(which.max(querygrain(jprop,nodes=c("PKA"))$PKA))
[1] "LOW"
9.3. BAYESIAN NETWORKS 199

The cpdist and cpquery commands from bnlearn do the same thing:

> particles <- cpdist(fitted, nodes="pakts473",evidence=(p44.42=="LOW"))


> prop.table(table(particles))
particles
LOW AVERAGE HIGH
0.669962 0.330038 0.000000

The cpquery command returns to probability of a specic event which is described by another logical
expression. For example:

> cpquery(fitted,event=(pakts473=="LOW")&(PKA != "HIGH"),


+ evidence = (p44.42 == "LOW")|(praf=="LOW"))
[1] 0.5696073
200 CHAPTER 9. BAYESIAN NETWORKS IN R

9.4 Exercises
1. Professor Noddy is in his oce when he reeives the news that the burglar alarm in his home has
gone o. Convinced that a burglar has broken in, he starts to drive home. But, on his way, he
hears on the radio that there has been a minor earth tremor in the area. Since an earth tremor
can set o a burglar alarm, he therefore returns to his oce.

(a) Construct the Bayesian network associated with the situation.


(b) Suppose that the variables are listed as R for the radio broad ast (y/n), A for the alarm
(y/n), B for the burglary (y/n) and E for the earthquake (y/n), where y stands for `yes'
and n stands for `no'. Suppose that the conditional probability potentials associated with
the Bayesian Network are

E/R y n
PR∣E = y 0.99 0.01
n 0.05 0.95

E/B y n
PA∣B,E (y∣., .) = y 0.98 0.95
n 0.95 0.03

y n
PB =
0.01 0.99

y n
PE =
0.001 0.999
Find
PB∣A (y∣y), PB∣A (y∣y), PB∣A,R (y∣y, y)

2. You have two CPPs from a Bayesian Network:

A/B b1 b2 b3 b4
PB∣A = a1 0.6 0.1 0.2 0.1
a2 0.2 0.5 0.1 0.2

and

B/C c1 c2
b1 0.8 0.2
PC∣B = b2 0.8 0.2
b3 0.2 0.8
b4 0.2 0.8

Establish whether or not A ⊥ C .


9.4. EXERCISES 201

C G

t   
F T S

Figure 9.8: Sore throat model

3. Consider the Bayesian Network in Figure 9.8. You have a sore throat (T). There are two possible
causes; either you have a cold (C), or else you have Green Monkey Disease (G). A symptom of
GMD is spots (S).
The conditional probabilities are:

G/C y n
PF ∣C,G (y∣., .) = y 0.990 0.700
n 0.800 0.200

G/C y n
PT ∣C,G (y∣., .) = y 0.999 0.900
n 0.800 0.300

y n
PS∣G (y∣.) = PC (y) = 0.20 PG (y) = 0.10
0.010 0.001

(a) Let E = (F, T, S) and enter the evidence e = (n, n, y). That is, {F = n, T = n, S = y}.
(b) Compute the updated joint probability distribution of C,G given the evidence, PC,G∣E (., .∣e).
(c) Compute the most probable explanation of the evidence e. This is the conguration of the
remaining variables V /E that gives the largest value for PE∣V /E . It therefore also maximises
PV /E∣E (.∣e).
(d) Consider a vector of evidence variables E = (E1 , . . . , Em ), instantiated as e = (e1 , . . . , em ).
The conict measure of the evidence is dened as:

j=1 PEj (ej )


∏m
conf(e) = log2 .
PE (e)
If the evidence variables are independent of each other, then the conict measure will clearly
be zero. If the pieces of evidence corroborate each other; for example PE1 ∣E2 (e1 ∣e2 ) > PE1 (e1 )
so that given E2 = e2 , the event E1 = e1 is more likely than the unconditional event, the
conict ratio will be negative. If the pieces of evidence conict, then the conict measure
will be positive.
Compute the conict measure for the evidence in this example.
202 CHAPTER 9. BAYESIAN NETWORKS IN R

4. Now consider the sachs.interventional.txt data in the notes. Find the moral graph, trian-
gulate it and construct a junction tree.
For the network with the parameters given in the notes from the sachs.interventional.txt
data, note that PKA is a parent of all the nodes in the praf -> pmek -> p44.42 -> pakts473
chain. Use the junction tree algorithm to update the probabilities over these nodes when we have
evidence that PKA is LOW and PKA is HIGH.
Use any other techniques discussed on this network.
Chapter 10

Conditional Gaussian variables

10.1 Conditional Gaussian Distributions


One very important family of distributions, that is accommodated by standard Bayesian network
software is the family of conditional Gaussian distributions.
Let X = (X ∆ , X Γ ) where X ∆ is a discrete random vector and X Γ is a continuous random vector.
Let ∆ denote the indexing set for X ∆ and Γ the indexing set for X Γ variables. Let ∣∆∣ denote the
number of variables in ∆ and ∣Γ∣ denote the number of variables in Γ. Random vectors will be taken
as row vectors. The state space is

X = X1 × . . . × X∣∆∣ × X∣∆∣+1 × . . . × X∣∆∣+∣Γ∣ ,

where Xj denotes the state space for variables j . The following notation will also be used;

X∆ = X1 × . . . × X∣∆∣ , XΓ = X∣∆∣+1 × . . . × X∣∆∣+∣Γ∣ ,

X = X∆ × XΓ .

Attention is restricted to the case where the continuous variables, conditioned on the discrete variables,
have Gaussian distribution, so XΓ = R∣Γ∣ . For the discrete variables,

(1) (kj )
Xj = {ij , . . . , ij }.

A particular conguration i ∈ X∆ is called a cell.

The following notation will be used to indicate that a random vector X 1 conditioned on X 2 = x2 has
distribution F :

X 1 ∣ X 2 = x2 ∼ F.

The moment generating function is a useful for the denition of a multivariate normal distribution.

203
204 CHAPTER 10. CONDITIONAL GAUSSIAN VARIABLES

Denition 10.1 (Moment Generating Function). Let X = (X 1 , . . . , X d ) be a random vector. Its


moment generating function is the function MX ∶ Rd → R is dened as
⎡ ⎧
⎪ ⎫
⎪ ⎤
⎢ ⎪d ⎪⎥⎥

MX (p1 , . . . , pd ) = E ⎢exp ⎨ ∑ pj Xj ⎬⎥ .
⎢ ⎪ ⎪
⎣ ⎪
⎩j=1 ⎭⎥⎦

The moment generating function is useful, because it uniquely determines the distribution of a ran-
dom vector X . That is, a joint probability determines a unique moment generating function, and
the moment generating function uniquely determines a corresponding joint probability. The moment
generating function is essentially a Laplace transform.
A multivariate normal distribution is dened as follows:

Denition 10.2 (Multivariate Normal Distribution). A random vector X = (X1 , . . . , Xd ) is said to


have a multivariate normal distribution, written X ∼ N (µ, C), if its moment generating function is of
the form


⎪ ⎫

⎪p 1 ⎪
ϕ(p1 , . . . , pd ) = exp ⎨ ∑ pj µj + ∑ pj pk Cjk ⎬ , p ∈ Rd .

⎪ 2 ⎪

⎩j=1 jk ⎭
If a random vector X ∼ N (µ, C), then E[Xi ] = µi for each i = 1, . . . , n and Cov(Xi , Xj ) = Cij for each
(i, j). If C is positive denite, then the joint density function of X = (X1 , . . . , Xd ) is given by

1 1
πX1 ,...,Xd (x1 , . . . , xd ) = exp {− (x − µ)C −1 (x − µ)} , x ∈ Rd ,
(2π)d/2 ∣C∣1/2 2
where x = (x1 , . . . , xd ) and µ = (µ1 , . . . , µd ) are row vectors and ∣C∣ denotes the determinant of C .

The conditional Gaussian distribution, or CG distribution may now be dened.

Denition 10.3 (CG Distribution). A collection of random variables X = (X ∆ , X Γ ) is said to follow


a CG distribution if for each i ∈ X∆ ,

X Γ ∣{X ∆ = i} ∼ N (µ(i), C(i)) . (10.1)

The notation for such a Conditional Gaussian distribution is

X ∼ CG(∣∆∣, ∣Γ∣).

If the numbers of discrete and continuous random variables are, respectively, ∣∆∣ = p and ∣Γ∣ = q , then
X ∼ CG(p, q).

If C −1 is well dened, then the conditional density function of X Γ conditioned on X ∆ = i is

1
e− 2 (x−µ(i))C(i) (x−µ(i)) ,
1 t
πX Γ ∣X ∆ (x∣i) = √ (10.2)
−1

(2π)q/2 det C(i)


for all i ∈ X∆ such that
10.1. CONDITIONAL GAUSSIAN DISTRIBUTIONS 205

PX ∆ (i) > 0.
For this discussion, it is assumed that PX ∆ (i) > 0 for each i ∈ X∆ .

Directly from Equation (10.2),


1 t
PX ∆ (i)πX Γ ∣X ∆ (x∣i) = χ(i)eg(i)+xh(i)− 2 xK(i)x (10.3)

where χ(i) = 1 if PX ∆ (i) > 0 and 0 if PX ∆ (i) = 0,

h(i) = C(i)−1 µ(i)t (10.4)

K(i) = C(i)−1 (10.5)


and

1
g(i) = log PX ∆ (i) + (log det K(i) − ∣Γ∣ log 2π − µ(i)K(i)µ(i)t ) . (10.6)
2
From Equation (10.2), it is clear that conditioning on the discrete variables gives a family of multivariate
normal distributions. The canonical parameters of the Gaussian distribution are dened as (h(i), K(i))
and the mean parameters as (µ(i), C(i)). Conditioned on X ∆ = i,

E [X Γ ∣X ∆ = i] = µ(i)
and
t
E [(X Γ − µ(i)) (X Γ − µ(i)) ∣X ∆ = i] = C(i)
(where the random vectors are taken to be row vectors).

Parametrisation of the CG Distribution The canonical parameters for the joint distribution, de-
ned by the pair of functions (PX ∆ , πX Γ ∣X ∆ ) are dened as (g, h, K), where the parameters (h(i), K(i))
are dened by Equations (10.4) and (10.5) respectively and g(i) is dened by Equation (10.6).
Similarly, the mean parameters are dened as (P, µ, C), where (µ(i), C(i)) are the mean parameters
of the conditional distribution and P(i) is the probability function over the discrete variables.

10.1.1 Some Results on Marginalization


The aim will be to factorise the CG distribution along an appropriate junction tree, so that evidence
can be inserted and propagated. The diculty arises that if a CG distribution is marginalised over
some of its discrete variables, the resulting distribution is no longer CG. For ecient computation
along a junction tree, it is desirable if the CG property can be preserved as far as possible for the
marginal distributions on the cliques and separators. The following results give properties that help
determine appropriate factorisations of the distribution.
Proposition 10.4. Let X have a CG distribution. Let V = ∆ ∪ Γ denote the indexing set. Let A and
B be two disjoint sets such that V = A ∪ B , then the conditional distribution of X A given X B = xB is
CG.
206 CHAPTER 10. CONDITIONAL GAUSSIAN VARIABLES

Proof The following calculation shows that X A∩Γ ∣ {X B = xB } ∪ {X A∩∆ = xA∩∆ } has a multivariate
Gaussian distribution. Firstly, it is clear that

πX A∩Γ ∣X A∩∆ ,X B (xA∩Γ ∣ xA∩∆ , xB ) = πX A∩Γ ∣X ∆ ,X B∩Γ (xA∩Γ ∣ x∆ , xB∩Γ ) .

The conditional density function on the right hand side is obtained by conditioning the distribution
of X A∩Γ ∣ X ∆ = x∆ on X B∩Γ = xB∩Γ . Since (X A∩Γ , X B∩Γ ) ∣ X ∆ = x∆ has a multivariate Gaussian
distribution, and the conditional distribution of a multivariate Gaussian, conditioning on some of
its component variables is again multivariate Gaussian, it follows that the conditional distribution is
multivariate. The proof is complete.
If the variables to be marginalised are discrete, then complicated mixture distributions arise. The
following theorem gives a situation where the marginalisation yields a CG distribution.

Proposition 10.5. Let A ⊆ V denote a subset of the indexing set for the variables. If X is CG and
B = V ∖ A (namely, B is the set of all indices in V that are not in A) and B ⊆ ∆ and

X B ⊥ X Γ ∣ X ∆∖B ,

then X A ∼ CG.

Proof Clearly, from the denition of a CG distribution, it is necessary and sucient to show that

X A∩Γ ∣ X ∆∖B ∼ N∣A∩Γ∣ .

(multivariate normal, with dimension ∣A ∩ Γ∣). The proof requires the following identity: If X B ⊥ X Γ ∣
X ∆∖B , then

πX Γ ∣X ∆ (xΓ ∣x∆ ) = πX Γ ∣X B ,X ∆/B (xΓ ∣xB , x∆/B ) = πX Γ ∣X ∆/B (xΓ ∣x∆/B ).

This is a straightforward consequence of the denition of conditional independence. Recall that, from
the denition of a CG distribution, πX Γ ∣X ∆ (xΓ ∣ x∆ ) is a multivariate normal distribution. Therefore
the conditional distribution of X Γ conditioned on X ∆/B is multivariate Gaussian, therefore the condi-
tional distribution of X Γ∩A conditioned on X ∆/B is multivariate Gaussian. The proof is complete.

10.1.2 CG Regression
An important special case of CG distributions are those that follow CG regression. The requirement
here is that the continuous variables depend linearly on their continuous parents. This is the situation
that is treated by most softwares with a facility for CG distributions.

Denition 10.6 (CG Regression). Let Z = (Z1 , . . . , Zs ) be a continuous random (row) vector and let
I be a discrete random (row) vector with probability function pI . Let I denote the state space for I . If
a random (row) vector Y = (Y1 , . . . , Yr ) has the property that

Y ∣ {I = i, Z = z} ∼ Nr (A(i) + zB(i), C(i)) ,


10.1. CONDITIONAL GAUSSIAN DISTRIBUTIONS 207

where for each i ∈ I

ˆ A(i) is a 1 × r row vector for each i ∈ I ,

ˆ B(i) is an s × r matrix,

ˆ C(i) is a positive semi-denite symmetric matrix,

then Y is said to follow a CG regression.

Let X denote a random vector, containing both discrete and continuous variables, which have been
ordered so that the probability distribution may be factorised along a Directed Acyclic Graph G =
(V, D). Let Xγ be a continuous variable, with parent set Π(γ). Suppose that X has a CG distribution
that satises the additional CG regression requirement. Then the conditional distribution for Xγ ,
conditioned on its parent nodes Π(γ) is the CG regression

Xγ ∣ {Πd (γ) = i, Πc (γ) = z} ∼ N (α(i) + zβ, σ 2 (i)) ,


where the discrete variables Πd (γ) of Π(γ) take values i and the continuous variables Πc (γ) of Π(γ)
take values z . Here α(i) is a number, σ 2 (i) = V(Xγ ∣Π(γ)) and β is a column vector with dimension
equal to the dimension of the continuous component z so that zβ is a well dened inner product. Thus,
the conditional density is Gaussian; ϕ(i, z, xγ ), equal to


⎪ 2⎫
1 ⎪ (xγ − α(i) + zβ) ⎪

ϕ(i, z, xγ ) = √ exp ⎨− ⎬. (10.7)
2πσ(i) ⎪
⎪ 2σ (i)
2 ⎪

⎩ ⎭
Example 10.7.
This example is taken from [84]. The emissions from a waste incinerator dier because of compositional
dierences in incoming waste. Another important factor is the way in which the waste is burnt, which
can be monitored by measuring the concentration of carbon dioxide in the emissions. The eciency
of the lter depends on its technical state and also on the amount and composition of the waste. The
emission of heavy metals depends both on the concentration of metals in the incoming waste and the
emission of dust particles in general. The emission of dust is monitored by measuring the penetration
of light.
The situation may be modelled using a directed acyclic marked graph (DAMG) in Figure 10.1;
marked because there are two types of nodes. In this case, these are discrete and continuous. In
HUGIN, nodes with a double circle are continuous nodes. The categorical variables are F : lter
state, W : waste type, B method of burning. The continuous variables are Min : metals in the waste,
Mout : metals emitted, E : lter eciency, D: Dust emission, C : carbon dioxide concentration in
emission and L: light penetration. The set ∆ = {F, W, B} is the set of discrete variables, while
Γ = {C, D, E, L, Min , Mout } is the set of continuous variables.
If HUGIN is being used, then inserting the graph, using double circles to indicate `Gaussian' nodes,
the conditional probability distributions can be inserted. For a conditional Gaussian distribution, the
208 CHAPTER 10. CONDITIONAL GAUSSIAN VARIABLES

/ M / Mout
W in
<

 !
F / E / D
=

#
B / C L

Figure 10.1: Marked Graph

continuous nodes cannot have discrete nodes as descendants. For a continuous node, HUGIN requests
the mean and variance, the parameters to describe a CG regression.

10.2 The Junction Tree for Conditional Gaussian Distributions


When a CG distribution is arranged as a directed acyclic marked graph, it is done in such a way that
no continuous nodes have discrete children. The assumption is that for a continuous variable X , with
parents Π(X) = (Πd (X), Πc (X)), where Πd (X) are the discrete parents and Πc (X) are the continuous
parents,

X∣{Πd (X) = y, Πc (X) = z) ∼ N (α(y) + (β(y), z), γ(y)}.

This section describes a junction tree approach due to Lauritzen [81] (1992), for nding the updated
conditional Gaussian distribution when hard evidence is inserted on some of the nodes. The problem
here is that while marginalising a CG distribution over one of its continuous variables gives another CG
distribution, marginalising a CG distribution over one of its discrete variables does not necessarily give
a CG distribution. Therefore, care has to be taken in the construction of the junction tree. Ideally, the
junction tree should be constructed so that the marginal distributions over the cliques and separators
are CG distributions, to enable appropriate marginalisations to be made. This requires some additional
restrictions on the construction of the cliques and separators.

Denition 10.8 (Marked Graph). A marked graph is a graph where there are several types of nodes;
the type of the node is the mark.

In the context of directed acyclic graphs for conditional Gaussian distributions, there are two markings;
discrete and continuous, for the types of variables represented by each type of node.

Denition 10.9 (GG Decomposition). A triple (A, B, S) of disjoint subsets of the node set V of an
undirected marked graph G is said to form a CG decomposition of G if V = A ∪ B ∪ S and the following
three conditions hold:
10.2. THE JUNCTION TREE FOR CONDITIONAL GAUSSIAN DISTRIBUTIONS 209

1. S separates A from B ,

2. S is a complete subset of V ,

3. Either S ⊆ ∆, or B ⊆ Γ or both.

When this holds, (A, B, S) is said to CG-decompose G into the components GA∪S and GB∪S .

If only the rst two conditions hold, then (A, B, S) is said to form a decomposition. Thus, a
decomposition ignores the markings of the graph, while a CG decomposition takes them into account.
The logic is as follows: if B contains only continuous nodes with multivariate Gaussian distribution,
then the marginal over the separator will again be multivariate Gaussian. If the separator contains
only discrete nodes and B both continuous and discrete, then marginalising rst over all the Gaussian
nodes in B and then marginalising over the discrete nodes in B not in the separator gives the exact
probability distribution over the separator.

Denition 10.10 (CG Decomposable). An undirected marked graph is said to be CG decomposable


if it is complete, or if there exists a CG decomposition (A, B, S), where both A and B are non empty,
into CG decomposable sub-graphs GA∪S , and GB∪S .

Decomposable unmarked graphs are triangulated; any cycle of length 4 or more has a chord. CG
decomposable marked graphs are further characterised by requiring that if there is a path between two
discrete variable containing only continuous variables, then there is an edge between the two discrete
variables.

Proposition 10.11. For an undirected marked graph G , the following are equivalent:

1. G is CG decomposable.

2. G is triangulated, and for any path (δ1 , α1 , . . . , αn , δ2 ) between two discrete nodes (δ1 , δ2 ) where
(α1 , . . . , αn ) are all continuous, δ1 and δ2 are neighbours.

3. For any α and β in G , every minimal (α, β) separator is complete. If both α and β are discrete,
then their minimal separator contains only discrete nodes.

Proof of 1 ⇒ 2 The proof, as before for unmarked graphs, is by induction. The inductive hypothesis
is: All undirected CG decomposable graphs with n or fewer nodes are triangulated and satisfying the
conditions of statement 2.
This is clearly true for a graph on one node.
Let G be a CG decomposable graph on n + 1 nodes.
Either G is complete, in which case the properties of 2 clearly follow,
Or There exist sets A, B , S , where V = A ∪ B ∪ S , where either B ⊆ Γ or S ⊆ ∆ or both, and such
that GA∪S and GB∪S are CG decomposable. Then any cycle of length 4 without a chord must pass
through both A and B . By decomposability, S separates A from B . Therefore the cycle must pass
through S at least twice. Since S is complete, the cycle will therefore have a chord. Since GA∪S and
210 CHAPTER 10. CONDITIONAL GAUSSIAN VARIABLES

GB∪S are triangulated, it follows that G is also triangulated. If the nodes of S are discrete, it follows
that any path between two discrete variable passing through S satises the condition of statement 2.
If B ⊆ Γ, then since all paths in GA∪S and all paths in GB∪S satisfy the condition of statement 2, it is
clear that all paths passing through S will also satisfy the condition of statement 2. It follows that G
is CG decomposable.

Proof of 2 ⇒ 3 Assume that G is triangulated, with the additional property in statement 2. Consider
two nodes α and β and let S be their minimal separator. Let A denote the set of all nodes that may
be connected to α by a trail that does not contain nodes in S and let B denote all nodes that may be
connected to β by a trail that does not contain nodes in S . Every node γ ∈ S must be adjacent to some
node in A and some node in B , otherwise GV /(S/{γ}) would not be connected. This would contradict
the minimality of S , since S/{γ} would separate α from β . Suppose that the condition in statement
2 holds and consider the minimal separator for two discrete nodes α, β , which are not neighbours.
The separator is complete. Denote the separator by S . Consider Ŝ , which is S with the continuous
nodes removed. Then Ŝ separates α and β on the sub graph induced by the discrete variables. But
the condition of statement 2 implies that α and β are also separated on G . Therefore, Ŝ separates α
and β . It follows that the minimal separator for two discrete nodes contains only discrete nodes.

Proof of 3 ⇒ 1 If G is complete, it follows that every node is discrete and the result is clear. Let α
and β be two discrete nodes that are not contained within their minimal separator. Let S denote their
minimal separator. Let A denote the maximal connected component of V /S and let B = V /(A ∪ S).
Then (A, B, S) provides a decomposition, with S ⊆ ∆. Suppose that two such discrete nodes cannot
be found. Let α and β be two nodes that are not contained within their minimal separator, where β
is continuous. Let S denote the minimal separator. Let B denote the largest connected component
of V /S containing β . Suppose that B contains a discrete node γ . Then S separates γ from α and
therefore consists entirely of discrete nodes. Therefore, either S ⊆ ∆, or B ⊆ Γ, as required.

The construction of the junction tree has to be modied. Starting from the directed acyclic graph, the
graph is rst moralised by adding in the links between all the parents of each variable and then making
all the edges undirected, as before. Then, sucient edges are added in to ensure that the graph is CG
- decomposable.
Next, a junction tree is constructed. As before, this is an organisation of a collection of subsets of
the variables V into a tree, such that if A and B are two nodes on the junction tree, then the variables
in A ∩ B appear in each node on the path between A and B .

Denition 10.12 (CG Root). A node R on a junction tree is a CG root if any pair of neighbours A,
B , such that A lies on the path between R and B (so that A is closer to R than B ) satises

B/A ⊆ Γ or B∩A⊆∆ or both.

This condition is equivalent to the statement that the triple (A/(A ∩ B), B/(A ∩ B), A ∩ B) forms
a CG decomposition of GA∪B . This means that when a separator between two neighbouring cliques
10.3. UPDATING A CG DISTRIBUTION USING A JUNCTION TREE 211

is not purely discrete, the clique furthest away from the root has only continuous nodes beyond the
separator.

Theorem 10.13. The cliques of a CG decomposable marked graph can be organised into a junction
tree with at least one CG root.

Proof As with the unmarked graph, choose simplicial nodes, one after the other. This is done in
such a way that either the separator (the nodes not removed) are all discrete, or else the nodes that
are removed are all continuous, until it is not possible to nd any other such nodes.
The remaining graph is therefore a clique, by the following arguments: either all the remaining
discrete nodes are in the same clique, or else there is not a simplicial discrete node, since the minimal
separator between two discrete nodes consists entirely of discrete nodes. Assume there is not a simplicial
discrete node. If there are discrete nodes remaining, then the family of any simplicial continuous node
contains a discrete node that does not have neighbours outside the family and is therefore simplicial.
It follows that all the discrete nodes are in the same clique, the family of any remaining continuous
node.
The nal clique, constructed in this way, clearly satises the properties of a CG root.

10.3 Updating a CG distribution using a Junction Tree


The random vectors are taken as row vectors when they are several attributes measured on a single
run of an experiment.
For each clique C on the CG junction tree, let ϕC = ∏X∈C/S PX∣Π(X) where PX∣Π(X) is a discrete
probability function if X is discrete and a conditional Gaussian if X is Gaussian, where S denotes
those variables in C that were not simplicial during the junction tree construction.
For each continuous variable X ,


⎪ t 2⎫
1 ⎪ (x − α(y) − β(y)z ) ⎪

PX∣Πc (X),Πd (X) (x∣z, y) = exp ⎨− ⎬.
(2πγ(y)) 1/2 ⎪
⎪ 2γ(y) ⎪

⎩ ⎭
where Πc (X) denotes continuous parents and Πd (X) denotes discrete parents. Here, α is a function,
β is a (row) vector of the same length as z and γ is the conditional variance.
For the separators S , the initialisation is: ϕS ≡ 1 for each S ∈ S .
From this, expanding the parentheses, taking logarithms and identifying terms gives the canonical
parameters (gX , hX , KX ) for PX∣Π(X) . The log partition function is

α(y)2 1
gX (y) = − − log(2πγ(y)),
2γ(y) 2
and the other parameters are given by

α(y)
hX (y) = ( 1 −β(y) )
γ(y)
212 CHAPTER 10. CONDITIONAL GAUSSIAN VARIABLES

and

1 ⎛ 1 −β(y) ⎞
KX (y) = .
γ(y) ⎝ −β(y) β(y)t β(y) ⎠
t

Marginalisation: Continuous Variables Suppose ϕY ,X 1 ,X 2 is CG, where Y are discrete variables


and X 1 and X 2 are continuous variables. That is, ϕ is given by


⎪ ⎫
⎪ 1 ⎛ K11 K12 ⎞ ⎛ xt1 ⎞⎪

ϕY ,X 1 ,X 2 (y, x1 , x2 ) = χ(y) exp ⎨g(y) + h1 (y)xt1 + h2 (y)xt2 − (x1 , x2 ) ⎬,

⎪ 2 ⎝ K12 K22 ⎠ ⎝ x2 ⎠⎪
t t

⎩ ⎭
where


⎪ 1 PY (y) > 0
χ(y) = ⎨

⎩ 0 PY (y) = 0

K is symmetric and the triple (g, h, K) represent the canonical characteristics. Recall the standard
result that, taking z ∈ Rp as a row vector, and K a positive denite p × p symmetric matrix,

1 1 1
∫ p exp {− zKz } dz = √
t
(2π) p/2 R 2 det(K)
and hence that for a ∈ Rp and K a positive denite p × p symmetric matrix

1 t 1 t −1 (2π)p/2
∫ p exp {(a, z) − z Kz} dz = exp { a K a} √ .
R 2 2 det(K)
From this, it follows, after some routine calculation, that if X 1 is a random p-vector with positive
denite covariance matrix, then

1
∫ ϕY ,X 1 ,X 2 (y, x1 , x2 )dx1 = χ(y) exp {g̃(y) + h̃(y)xt2 − x2 K̃xt2 } ,
Rp 2
where

1
g̃(y) = g(y) + (p log(2π) − log det(K11 (y)) + h1 (y)K11 (y)−1 h1 (y)t ) ,
2

h̃(y) = h2 (y), K̃ = −K21 (y)K11 (y)−1 K12 (y).

Marginalisation: Discrete Variables Consider a CG function ϕY 1 ,Y 2 ,X , where Y 1 and Y 2 denote


sets of discrete variables and X a set of continuous variables. Consider marginalisation over Y 2 . Firstly,
if h(y 1 , y 2 ) = h̃(y 1 ) and K(y 1 y 2 ) = K̃(y 1 ) for some functions h̃ and K̃ (i.e. they do not depend on y 2 ),
then ϕ̃, the marginal of ϕY 1 ,Y 2 ,X is simply

1
ϕ̃(y 1 , x) = exp {h̃(y 1 )xt − xK̃(y 1 )xt } ∑ χ(y 1 , y 2 ) exp {g(y 1 , y 2 )} .
2 y
2
10.3. UPDATING A CG DISTRIBUTION USING A JUNCTION TREE 213

The function ϕ̃ is therefore CG with canonical characteristics g̃(y 1 ) = log ∑y exp {g(y 1 , y 2 )} and h̃, K̃
2
as before.
If either h or K depends on y 2 , then a marginalisation will not produce a CG distribution, so an
approximation is used. For this, it is convenient to consider the mean parameters, (P, C, µ), where
P(y 1 , y 2 ) = P((Y 1 , Y 2 ) = (y 1 , y 2 )) and

X∣{(Y 1 , Y 2 ) = (y 1 , y 2 )} = N (µ(y 1 , y 2 ), C(y 1 , y 2 )).

The approximation is as following: ϕ̃ is dened as CG with mean parameters (P̃, C̃, µ̃) dened as:

P̃(y 1 ) = ∑ P(y 1 , y 2 ),
y
2

1
µ̃(y 1 ) = ∑ P(y 1 , y 2 )µ(y 1 , y 2 ),
P̃(y 1 ) y2

1
C̃(y 1 ) = ∑ P(y 1 , y 2 ) (C(y 1 , y 2 ) + (µ(y 1 , y 2 ) − µ̃(y 1 )) (µ(y 1 , y 2 ) − µ̃(y 1 ))) .
t
P̃(y 1 ) y2

It is relatively straightforward to compute that this approximate marginalisation has the correct ex-
pected value and second moments.

Marginalising over both Discrete and Continuous When marginalising over both types of
variables, rst the continuous variables are marginalised, and then the discrete.

Entering Evidence Two types of evidence can be entered; rstly, evidence that a continuous variable
Y is instantiated as y for some y ∈ R. Suppose PX∣Π(X) has canonical characteristics (g, h, K), where
either X = Y or Y ∈ Π(X). The vector (X, Π(X)) may be re-ordered so that Y appears last, so that
the canonical characteristics are written as

⎛ K11 (i) K1Y (i) ⎞


h(i) = ( h1 (i) hY (i) ) , K(i) = .
⎝ KY 1 (i) KY Y (i) ⎠

It is straightforward to show that when Y ← y is instantiated, PY ∣Π(Y ) is replaced by a function ψΠ(Y )


with canonical characteristics (g ∗ , h∗ , K ∗ ) given by

K ∗ (i) = K11 (i)

h∗ (i) = h1 (i) − yK1Y (i)


1
g ∗ (i) = g(i) + hY (i)y − y 2 KAA (i).
2
The algorithm accommodates evidence on discrete variables in the form of information that certain
states are impossible. If (X, Π(X)) contains discrete variables, then let Sd = ({X} ∪ Π(X)) ∩ ∆ where
∆ denotes the set of discrete variables. Then replace PX∣Π(X) by a function
214 CHAPTER 10. CONDITIONAL GAUSSIAN VARIABLES

ψX,Π(X) = fSd PX∣Π(X)

where for each s ∈ XSd ,



⎪ 0 evidence states that s is impossible
fSd (s) = ⎨

⎪ otherwise
⎩ 1

The Fully Active Schedule The fully active schedule may now be applied. Firstly, the evidence
is inserted. This is hard evidence, that certain states for discrete variables are excluded, or that the
continuous variables take certain xed values. The information then has to be propagated. Start at
the leaves, send all messages to a CG root. A message from C to C ′ computes ϕ∗S = ∑C/S ϕC , where
the sum denotes an integral for a continuous variable and a sum for a discrete variable, updates ϕC ′
to ϕ∗C ′ = ϕSS ϕC ′ and updates ϕS to ϕS ∗ .
ϕ∗

Note that, when two functions are multiplied or divided, this simply involves rstly: computing
the canonical characteristics (either exactly, or those for the approximating function) and then if ϕ1
has characteristics (g1 , h1 , K1 ) and ϕ2 has characteristics (g2 , h2 , K2 ) then ϕ1 ϕ2 has characteristics
(g1 + g2 , h1 + h2 , K1 + K2 ) and ϕϕ12 has characteristics (g1 − g2 , h1 − h2 , K1 − K2 ).
When the root has received all messages, at this stage the potential over the root is normalised.
That is, it is multiplied by a suitable constant to make it a probability. All the messages propagated
to the CG root are proper marginalisations and therefore the distribution over the CG root, after the
evidence is received, is an exact CG distribution.
For the propagation back out to the leaves, it will not, in general, be possible to make exact
marginalinalisations. The same procedure is used; for a message C to C ′ separated by S , set ϕ∗S =
∑C/S ϕC and update ϕC ′ to ϕSS ϕC ′ and update ϕS to ϕ∗S .
ϕ ∗

Having inserted hard evidence and run the schedule, since the potential over the root has been
normalised, the resulting functions are probability distributions.
The approximate marginalisations give an approximate update, but by construction, since the tree
has a strong root, the tree will be consistent; by construction, the exact marginalision of a clique in
the direction of the strong root gives exactly the approximating distribution over the separator that is
produced from by the approximate marginalisation when computing away from the root.

The Termination Although the resulting algorithm has produced approximate distributions over
the cliques, which are conditional Gaussian, with the correct expectation vector and covariance struc-
ture, it should be clear from the algorithm that dividing the function over the clique by the function
over the adjacent separator in the direction of the root gives the exact conditional distribution of the
clique conditioned on the separator.

Notes The application of junction tree methods to conditional Gaussian distributions was taken from
Lauritzen [81].
10.4 Exercises
1. Let

X = (X ∆ , XΓ ) ∼ CG(∣∆∣, 1).

Let I denote the state space for X ∆ and let P denote the probability function for the random
vector X ∆ . Prove that

E [XΓ ] = ∑ P(i)µ(i)
i∈I

and
V (XΓ ) = ∑ p(i)σ(i)2 + ∑ P(i) (µ(i) − E [XΓ ])2 .
i∈I i∈I

2. Let X ∼ CG(2, 2) and let I1 and I2 be binary variables. Find the canonical parameters for the
distribution.

3. Show that if a Conditional Gaussian Distribution is marginalised over a subset of the continuous
variables, the resulting distribution is again a CG distribution. Find the canonical characteristics
of the marginal distribution in terms of the original canonical characteristics, stating the standard
results about multivariate normal random variables that you are using.

4. Suppose that hard evidence is entered into a subset of the continuous variables of a CG distri-
bution. Show that the updated distribution is again a CG distribution and express the mean
parameters (conditional expectation vector and covariance matrix) of the updated distribution
in terms of the mean parameters of the original distribution.

5. This example is taken from Lauritzen [84]. It is a ctitious problem connected with controlling
the emission of heavy metals from a waste incinerator. The type of incoming waste W aects
the metals in the waste Min , the dust emission D and the lter eciency E . The quantity of
metals in the waste Min aects the metals emission Mout . Another important factor is the waste
burning regimen B , which is monitored via the carbon dioxide concentration in the emission C .
The burning regimen, the waste type and the lter eciency E aect the dust emission D. The
dust emission aects the metals emission and it is monitored by recording the light penetration
L. The state of the lter F (whether it is intact or defective) aects E .
The variables F , W , B are qualitative variables with states (the lter is either intact or defective,
the waste is either industrial or household, the burning regimen is either stable or unstable). The
variables E , C , D, L, Min and Mout are continuous. The directed acyclic marked graph is given
in Figure 10.1.

(a) ˆ Moralise and triangulate the graph.


ˆ Is it possible to construct a junction tree with a CG root?

215
216 CHAPTER 10. CONDITIONAL GAUSSIAN VARIABLES

(b) ˆ Moralise the graph.


ˆ By adding in as few links as possible, construct a CG decomposable graph.
ˆ Construct a junction tree. What are the possible strong roots for the junction tree?
(c) (HUGIN exercise) Programme the model in HUGIN, with the following conditional proba-
bilities:
PB (1) = 0.85, PB (0) = 0.15 1 = stable 0 = unstable

PF (1) = 0.95 PF (0) = 0.05 1 = intact 0 = defect

PW (1) = 0.25 PW (0) = 0.75 1 = industrial 0 = household

E∣(F, W ) = (1, 0) ∼ N (−3.2, 0.00002) E∣(F, W ) = (0, 0) ∼ N (−0.5, 0.0001)

E∣(F, W ) = (1, 1) ∼ N (−3.9, 0.00002) E∣(F, W ) = (0, 1) ∼ N (−0.4, 0.0001)

D∣(B, W, E) = (1, 1, x) ∼ N (6.5 + x, 0.03) D∣(B, W, E) = (1, 0, x) ∼ N (6.0 + x, 0.04)

D∣(B, W, E) = (0, 1, x) ∼ N (7.5 + x, 0.1) D∣(B, W, E) = (0, 0, x) ∼ N (7.0 + x, 0.1)

C∣B = 1 ∼ N (−2, 0.1) C∣B = 0 ∼ N (−1, 0.3)


1
L∣D = x ∼ N (3 − d, 0.25)
2
Min ∣W = 1 ∼ N (0.5, 0.01) Min ∣W = 0 ∼ N (−0.5, 0.005)

Mout ∣D = x, Min = y ∼ N (x + y, 0.002)

The variable E , lter eciency, is represented on a logarithmic scale. It is assumed that

dust out = dust in × ρ

and E = log ρ. The variable D, dust emission, is again on a logarithmic scale, as is C , the
CO2 concentration and L, the light penetrability. Light penetrability is roughly inversely
proportional to the square root of dust concentration. The metal in waste Min and metal
emission Mout variables are on logarithmic scales.
Suppose that the waste burned is of industrial type (W = 1), the light penetration variable
is measured as L = 1.1 and the CO2 concentration is measured as C = −0.9.
Find the updated probability distributions for B and F and the updated means and vari-
ances for Min , Mout and D.
Chapter 11

Gaussian and Conditional Gaussian

Graphical Models in R

The packages ggm, deal, glasso, gRc, pcalg, bnlearn, gRim are useful.
Consider X ∼ N (µ, Σ). The matrix K = Σ−1 is known as the concentration matrix. The partial
correlation between Xu and Xv given all the other variables may be derived from K as:

Kuv
ρuv∣V /uv = − √ .
Kuu Kvv
Thus, the independence graph does not have an edge u ↔ v if and only if Kuv = 0.
Consider an illustrative example of `carcass' data:

> library(gRbase)
> data(carcass)
> head(carcass)
Fat11 Meat11 Fat12 Meat12 Fat13 Meat13 LeanMeat
1 17 51 12 51 12 61 56.52475
2 17 49 15 48 15 54 57.57958
3 14 38 11 34 11 40 55.88994
4 17 58 12 58 11 58 61.81719
5 14 51 12 48 13 54 62.95964
6 20 40 14 40 14 45 54.57870

The concentration matrix can be estimated as:

> S.carc = cov.wt(carcass,method="ML")$cov


> K.carc = solve(S.carc)
> round(100*K.carc)
Fat11 Meat11 Fat12 Meat12 Fat13 Meat13 LeanMeat
Fat11 44 3 -20 -7 -16 4 10
Meat11 3 16 -3 -6 -6 -6 -3

217
218 CHAPTER 11. GAUSSIAN AND CONDITIONAL GAUSSIAN GRAPHICAL MODELS IN R

Fat12 -20 -3 54 6 -21 -5 9


Meat12 -7 -6 6 14 -1 -9 0
Fat13 -16 -6 -21 -1 56 3 7
Meat13 4 -6 -5 -9 3 16 -1
LeanMeat 10 -3 9 0 7 -1 26

The partial correlation is obtained using cov2pcor:

> PC.carc = cov2pcor(S.carc)


> round(100*PC.carc)
Fat11 Meat11 Fat12 Meat12 Fat13 Meat13 LeanMeat
Fat11 100 -11 41 30 32 -16 -29
Meat11 -11 100 9 41 19 35 16
Fat12 41 9 100 -24 38 18 -24
Meat12 30 41 -24 100 2 61 2
Fat13 32 19 38 2 100 -9 -18
Meat13 -16 35 18 61 -9 100 7
LeanMeat -29 16 -24 2 -18 7 100

Fat13 is conditionally independent of Meat12 and LeanMeat is also conditionally independent of Meat12.
A stepwise backward model selection procedure can be carried out as follows:

>library(gRim)
> sat.carc = cmod(~.^.,data=carcass)
> aic.carc = stepwise(sat.carc)
> library(Rgraphviz)
> plot(as(aic.carc,"graphNEL"),"fdp")

The BIC gives a higher penalty for complexity and also removes edges between Fat13 and Meat13.

> bic.carc = stepwise(sat.carc,k=log(nrow(carcass)))


> bic.carc
Model: A cModel with 7 variables
graphical : TRUE decomposable : TRUE
-2logL : 11376.07 mdim : 24 aic : 11424.07
ideviance : 2465.16 idf : 17 bic : 11516.25
deviance : 8.62 df : 4
plot(as(bic.carc,"graphNEL"),"fdp")

11.1 Undirected Gaussian Graphical Models


The model gen.carc for the carcass data species a UGGM (undirected Gaussian graphical model)
with edges missing for all partial correlations less than or equal to 0.12.
11.1. UNDIRECTED GAUSSIAN GRAPHICAL MODELS 219

> gen.carc = cmod(~Fat11*Fat12*Meat12*Meat13


+ + Fat11*Fat12*Fat13*LeanMeat
+ +Meat11*Meat12*Meat13
+ +Meat11*Fat13*LeanMeat,data=carcass)
> gen.carc
Model: A cModel with 7 variables
graphical : TRUE decomposable : FALSE
-2logL : 11387.24 mdim : 22 aic : 11431.24
ideviance : 2453.99 idf : 15 bic : 11515.73
deviance : 19.79 df : 6
> plot(gen.carc,"neato")

Alternatively, the model could be specied as follows:

> edge.carc=cmod(edgeList(as(gen.carc,"graphNEL")),data=carcass)
> edge.carc
Model: A cModel with 7 variables
graphical : TRUE decomposable : FALSE
-2logL : 11387.24 mdim : 22 aic : 11431.24
ideviance : 2453.99 idf : 15 bic : 11515.73
deviance : 19.79 df : 6

The matrix K is estimated by iterative proportion scaling. The point is that the estimate has to satisfy
the constraint that Kuv = 0 when there is no edge u ↔ v .

> carcfit1 =
+ ggmfit(S.carc,n=nrow(carcass),edgeList(as(gen.carc,"graphNEL")))
> carcfit1[c("dev","df","iter")]
$dev
[1] 19.78537

$df
[1] 6

$iter
[1] 774

Hypothesis Testing A likelihood ratio test, to see whether model M1 gives a better t than model
M2 may be carried out as follows:

> comparemodels = function(m1,m2){}


> comparemodels=function(m1,m2){
220 CHAPTER 11. GAUSSIAN AND CONDITIONAL GAUSSIAN GRAPHICAL MODELS IN R

+ lrt = m2$fitinfo$dev - m1$fitinfo$dev


+ dfdiff = m2$fitinfo$dimension[4]-m1$fitinfo$dimension[4]
+ names(dfdiff)=NULL
+ list('lrt'=lrt,'df'=dfdiff)
+ }
> comparemodels(aic.carc,bic.carc)
$lrt
[1] 8.372649

$df
[1] 2

This would indicate that the smaller model does not t well.

The function ciTest_mvn() tests single conditional independence hypotheses. To test LeanMeat ⊥
Meat13∣remaining variables

> ciTest_mvn(list(cov=S.carc,n.obs=nrow(carcass)),
+ set=~LeanMeat+Meat13+Meat11+Meat12+Fat11+Fat12+Fat13)
Testing LeanMeat _|_ Meat13 | Meat11 Meat12 Fat11 Fat12 Fat13
Statistic (DEV): 1.687 df: 1 p-value: 0.1940 method: CHISQ

Gaussian conditional independence can be tested from the pcalg package as follows:

> library(pcalg)
> C.carc=cov2cor(S.carc)
> gaussCItest(7,2,c(1,3,4,5,6),list(C=C.carc,n=nrow(carcass)))
[1] 0.003077247

11.2 Decomposition of UGGMs


Consider the model with BIC penalisation. Let A = {Fat13,LeanMeat}, B = {Meat12, Meat13}
and S = {Fat11, Fat12, Meat11}. Then (A, B, S) is a decomposition of its independence graph.
̂A∪S = S −1 and K
Furthermore, both MA∪S and MB∪S are saturated. It follows that K ̂B∪S = S −1 .
A∪S B∪S
The MLE of K may therefore be found using:

̂ = (K
K ̂A∪S )A∪B∪S + (K
̂B∪S )A∪B∪S − (S −1 )A∪B∪S
S

where the meanings of the terms are clear.

> K.hat = S.carc


> K.hat[]=0
11.2. DECOMPOSITION OF UGGMS 221

> AC=c("Fat11","Fat12","Fat13","Meat11","LeanMeat")
> BC=c("Meat11","Meat12","Meat13","Fat11","Fat12")
> C=c("Fat11","Fat12","Meat11")
> K.hat[AC,AC]=K.hat[AC,AC]+solve(S.carc[AC,AC])
> K.hat[BC,BC]=K.hat[BC,BC]+solve(S.carc[BC,BC])
> K.hat[C,C]=K.hat[C,C]-solve(S.carc[C,C])
> round(100*K.hat)
Fat11 Meat11 Fat12 Meat12 Fat13 Meat13 LeanMeat
Fat11 44 1 -20 -7 -16 6 10
Meat11 1 16 -4 -6 -4 -5 -5
Fat12 -20 -4 54 6 -20 -4 9
Meat12 -7 -6 6 14 0 -9 0
Fat13 -16 -4 -20 0 55 0 7
Meat13 6 -5 -4 -9 0 16 0
LeanMeat 10 -5 9 0 7 0 26
> Sigma.hat=solve(K.hat)
> round(Sigma.hat,2)
Fat11 Meat11 Fat12 Meat12 Fat13 Meat13 LeanMeat
Fat11 11.34 0.74 8.42 2.06 7.66 -0.76 -9.08
Meat11 0.74 32.97 0.67 35.94 2.01 31.97 5.33
Fat12 8.42 0.67 8.91 0.31 6.84 -0.60 -7.95
Meat12 2.06 35.94 0.31 51.79 2.45 41.47 5.41
Fat13 7.66 2.01 6.84 2.45 7.62 0.89 -6.93
Meat13 -0.76 31.97 -0.60 41.47 0.89 41.44 6.43
LeanMeat -9.08 5.33 -7.95 5.41 -6.93 6.43 12.90

> round(S.carc,2)
Fat11 Meat11 Fat12 Meat12 Fat13 Meat13 LeanMeat
Fat11 11.34 0.74 8.42 2.06 7.66 -0.76 -9.08
Meat11 0.74 32.97 0.67 35.94 2.01 31.97 5.33
Fat12 8.42 0.67 8.91 0.31 6.84 -0.60 -7.95
Meat12 2.06 35.94 0.31 51.79 2.18 41.47 6.03
Fat13 7.66 2.01 6.84 2.18 7.62 0.38 -6.93
Meat13 -0.76 31.97 -0.60 41.47 0.38 41.44 7.23
LeanMeat -9.08 5.33 -7.95 6.03 -6.93 7.23 12.90

Model Search using gRim Setting search = headlong causes edges to be searched in a random
order, which can make the search faster.

> ind.carc=cmod(~.^1,data=carcass)
> set.seed(123)
222 CHAPTER 11. GAUSSIAN AND CONDITIONAL GAUSSIAN GRAPHICAL MODELS IN R

> forw.carc=stepwise(ind.carc,search="headlong",
+ direction="forward",k=log(nrow(carcass)),details=0)
> forw.carc
Model: A cModel with 7 variables
graphical : TRUE decomposable : TRUE
-2logL : 11393.53 mdim : 23 aic : 11439.53
ideviance : 2447.70 idf : 16 bic : 11527.87
deviance : 26.08 df : 5
> plot(forw.carc,"neato")

11.3 Directed Gaussian Graphical Models


A DAG may be constructed as follows: for example,

> gdag1=DAG(LeanMeat~Meat13:Fat11:Fat12, Meat13~Meat11:Meat12,


+ Fat12~Fat11,Fat13~Meat11:Meat12, Meat12~Meat11)
> plot(as(gdag1,"graphNEL"))
> fdag1=fitDag(gdag1,S.carc,nrow(carcass))
> fdag1$dev
[1] 552.2726
> fdag1$df
[1] 12

The function essentialGraph() from the ggm package returns the essential graph of a DAG. For
example:

> eG1 = as(essentialGraph(gdag1),"igraph")


> V(eG1)$size = 40
> E(eG1)$arrow.mode=2
> E(eG1)[is.mutual(eG1)]$arrow.mode = 0
> plot(eG1,layout=layout.kamada.kawai)

Model Selection A DAG may be established using the package pcalg package. The PC algorithm
may be used to nd the skeleton. The pcalg::skeleton command ensures that the relevant version
of the command skeleton is used.

> library(pcalg)
> c.carc=cov2cor(S.carc)
> suffStat=list(C=c.carc,n=nrow(carcass))
> indepTest=gaussCItest
> skeleton.carc=pcalg::skeleton(suffStat,gaussCItest,p=ncol(carcass),alpha=0.05)
11.3. DIRECTED GAUSSIAN GRAPHICAL MODELS 223

> nodes(skeleton.carc@graph)=names(carcass)
> names(carcass)
[1] "Fat11" "Meat11" "Fat12" "Meat12" "Fat13" "Meat13"
[7] "LeanMeat"
> str(skeleton.carc@sepset[[1]])
List of 7
$ : NULL
$ : int(0)
$ : NULL
$ : int(0)
$ : NULL
$ : int(0)
$ : NULL

This is read as follows: The rst variable Fat11 was marginally independent of variables Meat11,
Meat12 and Meat13. This is seen from the designation NULL. Similarly,

> str(skeleton.carc@sepset[[2]])
List of 7
$ : NULL
$ : NULL
$ : int(0)
$ : NULL
$ : int 4
$ : NULL
$ : int 6

This indicates that Meat11 (the second variable) is marginally independent of Fat12, conditionally
independent of Fat13 (5th variable on the list) given Meat12 (4th variable on the list) and conditionally
independent of LeanMeat given Meat13 etc.
In pcalg, there are several options for turning a skeleton together with sep-sets into a DAG. These
are:

ˆ udag2pdag()

ˆ udag2pdagRelaxed()

ˆ udag2pdagSpecial()

Read the help functions to nd out the dierences. For example:

> pdag.carc=udag2pdagRelaxed(skeleton.carc,verbose=0)
> nodes(pdag.carc@graph)=names(carcass)
> plot(pdag.carc@graph,"neato")
224 CHAPTER 11. GAUSSIAN AND CONDITIONAL GAUSSIAN GRAPHICAL MODELS IN R

Undirected edges are shown as double-arrowed edges. This graph is not an essential graph; the arrow
from Meat12 to Meat13 is not part of an immorality, neither is it a compelled edge.
Both steps (skeleton and edge orientation) can be called simultaneously using the function pc().
For example,

> cpdag.carc=pc(suffStat,gaussCItest,p=ncol(carcass),alpha=0.05)
> plot(cpdag.carcass@graph)

11.4 Gaussian Chain Graph Models


To build a chain graph model, rstly an undirected graph, corresponding to the independence graph
is constructed.
The package lcd is useful. This uses a constraint-based algorithm due to Ma et. al. [86](2008).
A junction tree for the undirected graph is then derived. The algorithm then performs a series of
conditional independence tests following a scheme based on the junction tree. This may be applied to
the carcass data:

> library(lcd)
> ug<-naive.getug.norm(carcass,0.05)
> jtree<-ug.to.jtree(ug)
> cg<-learn.mec.norm(jtree,cov(carcass),nrow(carcass),0.01,"CG")
> icg<-as(cg,"igraph")
> E(icg)$arrow.mode<-2
> E(icg)[is.mutual(icg)]$arrow.mode<-0
> V(icg)$size<-40
> plot(icg,layout=layout.kamada.kawai)

11.5 Conditional Gaussian Models


Recall that, for a conditional Gaussian model, the covariance of the continuous variables, conditioned
on the discrete, is the same for each value of the discrete variables. That is, conditioned on the discrete
variables taking conguration i, the Gaussian variables have distribution:

1 1
πΓ∣∆ (y∣i) = exp {− (y − µ(i))t Σ−1 (y − µ(i))} .
(2π)q/2 ∣Σ∣1/2 2
For illustration, consider two data sets from gRbase; milkcomp1 and wine.
The CGstats() function calculates the number of observations and means of the continuous vari-
ables for each cell i, together (by default) with a common covariance matrix.

> data(milkcomp1,package='gRbase')
> head(milkcomp1)
11.5. CONDITIONAL GAUSSIAN MODELS 225

treat fat protein dm lactose


1 d 6.16 6.65 18.55 5.06
2 c 4.06 5.44 18.32 5.23
3 f 9.25 5.67 20.68 5.15
4 b 5.82 5.62 17.57 5.74
5 a 4.98 5.37 16.38 5.55
6 b 9.06 5.08 20.21 5.29
> library(gRim)
> SS = CGstats(milkcomp1,varnames=c("treat","fat","protein","lactose"))
> SS
$n.obs
treat
a b c d e f g
8 8 8 8 8 7 8

$center
a b c d e f g
fat 6.64125 8.01000 7.0525 7.40125 8.13375 7.518571 6.97375
protein 5.48750 5.28750 5.4750 5.81750 5.26250 5.295714 5.58000
lactose 5.49125 5.48875 5.4675 5.31375 5.40625 5.382857 5.41500

$cov
fat protein lactose
fat 2.31288338 0.19928422 -0.07028198
protein 0.19928422 0.12288675 -0.03035208
lactose -0.07028198 -0.03035208 0.04529896

$cont.names
[1] "fat" "protein" "lactose"

$disc.names
[1] "treat"

$disc.levels
[1] 7

The coecients of variation are:

> apply(SS$center,1,sd)/apply(SS$center,1,mean)
fat protein lactose
0.07415672 0.03656048 0.01186589
226 CHAPTER 11. GAUSSIAN AND CONDITIONAL GAUSSIAN GRAPHICAL MODELS IN R

The corresponding canonical parameters are:

> can.parms=CGstats2mmodParms(SS,type="ghk")
> print(can.parms,simplify=FALSE)
$g
treat
a b c d e f g
-745.4933 -729.3707 -740.4563 -743.5508 -712.6533 -710.4957 -740.1503

$h
[,1] [,2] [,3] [,4] [,5] [,6] [,7]
[1,] 0.7869693 1.628323 0.9975735 0.873605 1.686407 1.343883 0.8641906
[2,] 88.2214588 85.006209 87.6318341 90.151087 84.137184 84.816560 88.5106553
[3,] 181.5552534 180.651093 180.9626426 179.064184 178.337697 177.745064 180.1855748

$K
[,1] [,2] [,3]
[1,] 0.5055681 -0.7503065 0.2816613
[2,] -0.7503065 10.8648914 6.1157915
[3,] 0.2816613 6.1157915 26.6103828

$gentype
[1] "mixed"

$gentype
[1] "mixed"

$cont.names
[1] "fat" "protein" "lactose"

$disc.names
[1] "treat"

$disc.levels
[1] 7

Let j denote the level of the treatment factor, then h(j) takes the form:

h(j) = (hfat (j), hprotein (j), hlactose (j))

The coecients for h are:


11.5. CONDITIONAL GAUSSIAN MODELS 227

> apply(can.parms$h,1,sd)/apply(can.parms$h,1,mean)
[1] 0.32484006 0.02614999 0.00793359

This suggests that hlactose is a constant function of j . In other words,

lactose ⊥ treat∣(fat, protein).


The partial correlation matrix is:

> conc2pcor(can.parms$K)
[,1] [,2] [,3]
[1,] 1.00000000 0.3201373 -0.07679125
[2,] 0.32013725 1.0000000 -0.35967845
[3,] -0.07679125 -0.3596784 1.00000000

This suggests that the partial correlation between fat and lactose is zero. Therefore,

lactose ⊥ fat∣(treat, protein).


The generators of the model are simply the cliques of the CG junction tree. The mmod() function from
gRim allowed CG models to be dened using model formulae. For example, to construct a model
with generators treat, fat, protein,protein,lactose,

> milkmod = mmod(~treat*fat*protein+protein*lactose, data=milkcomp1)


> milkmod
Model: A mModel with 4 variables
graphical : TRUE decomposable : TRUE
-2logL : 428.47 mdim : 26 aic : 480.47
ideviance : 18.97 idf : 15 bic : 532.66
deviance : 2.11 df : 7

Conditional Gaussian Models To construct a marked graph, the information on marking has to
be provided:

> uG1=ug(~a:b+b:c+c:d)
> uG2=ug(~a:b+a:d+c:d)
> mcsmarked(uG1,discrete=c("a","d"))
character(0)
> mcsmarked(uG2,discrete=c("a","d"))
[1] "a" "d" "b" "c"
> plot(uG1)
> plot(uG2)

For the rst graph, both a and d have to be in the CG-root, hence the root contains all the variables;
the CG-Gaussian tree contains exactly one node. For the second one, the CG-root is the clique {a, d}.
228 CHAPTER 11. GAUSSIAN AND CONDITIONAL GAUSSIAN GRAPHICAL MODELS IN R

Using gRim for CG-models The function mmod() enables CG models to be dened and tted.

> glist = ~treat:fat:protein+protein:lactose


> milk=mmod(glist,data=milkcomp1)
> milk
Model: A mModel with 4 variables
graphical : TRUE decomposable : TRUE
-2logL : 428.47 mdim : 26 aic : 480.47
ideviance : 18.97 idf : 15 bic : 532.66
deviance : 2.11 df : 7
> summary(milk)
Mixed interaction model:
Generators:
:"treat" "fat" "protein"
:"protein" "lactose"
Discrete: 1 Continuous: 3
Is graphical: TRUE Is decomposable: TRUE
logL: -214.233011, iDeviance: 241.774364

The parameters are obtained using coef(). The parametrisation may be specied as either canonical
or mean eld. For canonical parameters:

> coef(milk,type="ghk")
$g
treat
a b c d e f g
-676.0550 -666.0859 -675.0546 -690.9918 -664.9730 -666.7805 -680.0217

$h
[,1] [,2] [,3] [,4] [,5] [,6] [,7]
[1,] -1.134727 -0.2838037 -0.9178505 -1.021725 -0.2012326 -0.5374842 -1.043008
[2,] 84.348819 81.3413696 83.8953921 86.850963 81.0040255 81.8196051 84.952805
[3,] 164.633541 164.6335413 164.6335413 164.633541 164.6335413 164.6335413 164.633541

$K
[,1] [,2] [,3]
[1,] 0.5025868 -0.815040 0.000000
[2,] -0.8150400 10.762254 5.666744
[3,] 0.0000000 5.666744 24.645834

$gentype
11.5. CONDITIONAL GAUSSIAN MODELS 229

[1] "mixed"

$cont.names
[1] "fat" "protein" "lactose"

$disc.names
[1] "treat"

$disc.levels
[1] 7

$N
[1] 55

$SSD
[,1] [,2] [,3]
[1,] 127.208586 10.960632 -3.865509
[2,] 10.960632 6.758771 -1.669364
[3,] -3.865509 -1.669364 2.491443

$SS
fat protein lactose
fat 3143.500 2227.141 2199.894
protein 2227.141 1648.827 1627.220
lactose 2199.894 1627.220 1620.993

Updating Models Models are updated using update(). A list with one or more components
add.edge, drop.edge, add.term, drop.term is specied. The updates are made in the order given.
For example:

> milk2 = update(milk,list(add.edge=~fat:lactose,drop.edge=~treat:protein))


> milk2
Model: A mModel with 4 variables
graphical : TRUE decomposable : TRUE
-2logL : 446.17 mdim : 21 aic : 488.17
ideviance : 10.12 idf : 10 bic : 530.33
deviance : 10.96 df : 12

Inference Functions such as ciTest(), testInEdges(), testOutEdges() etc. have the same be-
haviour as with pure discrete and pure continuous networks. For example:
230 CHAPTER 11. GAUSSIAN AND CONDITIONAL GAUSSIAN GRAPHICAL MODELS IN R

> ciTest(milkcomp1)
Testing treat _|_ fat | protein dm lactose
Statistic (DEV): 8.742 df: 6 p-value: 0.1886 method: CHISQ

> testInEdges(milk,getInEdges(milk$glist))
statistic df p.value aic V1 V2 action
1 11.06071 6 0.086518199 -0.9392919 treat fat +
2 18.68943 6 0.004721598 6.6894264 treat protein -
3 8.27794 1 0.004012963 6.2779399 fat protein -
4 10.24527 1 0.001370352 8.2452747 protein lactose -

> testOutEdges(milk,getOutEdges(milk$glist))
statistic df p.value aic V1 V2 action
1 3.8928582 6 0.6911730 8.107142 treat lactose -
2 0.9827155 1 0.3215293 1.017285 fat lactose -

> milk3=update(milk,list(drop.edge=~treat:protein))
> compareModels(milk,milk3)
Large:
:"treat" "fat" "protein"
:"protein" "lactose"
Small:
:"protein" "lactose"
:"treat" "fat"
:"fat" "protein"
-2logL: 18.69 df: 6 AIC(k= 2.0): 6.69 p.value: 0.155100

> testdelete(milk,c("treat","protein"))
dev: 18.689 df: 6 p.value: 0.00472 AIC(k=2.0): 6.7 edge: treat:protein
Notice: Test perfomed by comparing likelihood ratios
> testadd(milk,c("treat","lactose"))
dev: 3.893 df: 6 p.value: 0.69117 AIC(k=2.0): 8.1 edge: treat:lactose
Notice: Test perfomed by comparing likelihood ratios

Stepwise Model Selection The stepwise() function in gRim implements stepwise selection. The
following starts from the saturated model and uses BIC criterion. This function can take a while to
produce the output.

> data(wine,package='gRbase')
> mm=mmod(~.^.,data=wine)
> mm2=stepwise(mm,k=log(nrow(wine)),details=0)
> plot(mm2)
Chapter 12

Learning the Conditional Probability

Functions

12.1 Introduction
Let X = (X1 , . . . , Xd ) be a random vector, whose probability distribution factorises along a DAG
G = (V, D). This chapter considers the task of learning the conditional probability potentials, when
the DAG G is given, when presented with an n × d data matrix of instantiations

⎛ x(1) ⎞
x=⎜
⎜ ⋮


⎝ x ⎠
(n)

which are the realisation of a random matrix

⎛ x(1) ⎞
X=⎜
⎜ ⋮
⎟.

⎝ X ⎠
(n)

Bayesian network analysis restricts itself to three settings:

ˆ Gaussian

ˆ Multinomial

ˆ Conditional Gaussian.

12.2 Gaussian and Conditional Gaussian Networks


For Gaussian networks, X ∼ N (µ, Σ). The model is:



⎪ X =µ+ϵ ϵ ∼ N (0, Σ)


⎪ µ = βj0 + ∑k∈Pa(j) βjk µk
⎩ j

231
232 CHAPTER 12. LEARNING THE CONDITIONAL PROBABILITY FUNCTIONS

where Pa(j) denotes the indices of the parent nodes of Xj and for dierent instantiations, the ϵ are
i.i.d. Estimation of the parameters βkj ∶ k ∈ {0} ∪ Pa(j) is carried out simply by maximum likelihood
estimation, which is equivalent to least squares for Gaussian variables. That is, the parameters β are
estimated by minimising:
n
∑(xij − β0j − ∑ xik βkj ) .
2
i=1 k∈Paj

We assume that the nodes have been ordered so that Paj ⊆ {1, . . . , j − 1}. Denote the estimator by β̂.
Then

̂j = β̂0j + ∑ µ
µ ̂k β̂kj .
k∈Paj

The estimate of Σ is slightly harder than the estimate of µ, since we have to ensure that the conditional
independence constraints are satised; the estimate Σ ̂ has to correspond to the factorisation.
The estimate Σ ̂ 11 is simply the m.l.e. variance estimate derived from the univariate sample x.1 from a
N (µ1 , Σ11 ) distribution.

For j > 1, assume that the components of the j − 1 × j − 1 sub-matrix Σ(j−1) , with entries Σab ∶ 1 ≤ a ≤
̂ (j)−1 )ij = 0 for i ∈/ Paj . Now let A denote
j − 1, 1 ≤ b ≤ j − 1, have already been estimated. Then set (Σ
the j × j symmetric matrix obtained as the maximiser of:


⎪ 1 n j j ⎫

∣A∣1/2 ⎪ ⎪
exp ⎨− ∑ ∑ ∑ (x − ̂
µ )(x − ̂
µ )A ⎬.
(2π) j/2 ⎪ 2 i=1 a=1 b=1

ia a ib b ab


⎩ ⎭
subject to the constraint that Aij = 0 for i < j, i ∈/ Paj . For i ∈ Paj ∪ {j}, set Σ ̂ ij = A−1 .
ij

Conditioned Gaussian Part of Conditional Gaussian Similarly, for Conditional Gaussian, the
parameters of the Gaussian variables, conditioned on the discrete, are estimated by maximum likelihood
in the standard way for multivariate Gaussian.

12.3 Discrete Variables


Let X = (X1 , . . . , Xd ) be a random vector of discrete variables, with probability function PX1 ,...,Xd over
(1) (k )
state space X = ×dj=1 Xj where Xj = (xj , . . . , xj j ) is the state space for Xj . Suppose that P factorises
according to a DAG G = (V, D). For x ∈ X , let πj (x) denote the parent conguration for variable Xj
(l) qj
when X = x and let (πj )l=1 denote a listing of the possible parent congurations for Xj . Let
(i) (l)
θjil = PXj ∣Paj (xj ∣πj ) j = 1, . . . , d, i = 1, . . . , kj , l = 1, . . . , qj (12.1)

Let be an n × d data matrix, representing n instantiations of X . The aim is to estimate the values of
(θjil )j,i,l based on the data matrix.
For discrete variables there are two approaches; the maximum likelihood method and the Bayesian
approach.
12.4. MAXIMUM LIKELIHOOD FOR DISCRETE VARIABLES 233

12.4 Maximum Likelihood for Discrete Variables


We describe maximum likelihood for multinomial sampling. The crucial point about a Bayesian network
is modularity; the components of the network are conditionally independent and each component can
be treated separately, as a multinomial.

12.4.1 Maximum Likelihood for Multinomial Sampling


Let X be a random variable with state space X = (x(1) , . . . , x(k) ) and θi = PX (x(i) ), i = 1, . . . , k , where
θi ≥ 0 for i = 1, . . . , k and ∑ki=1 θi = 1. Let X = (X1 , . . . , Xn )t be n independent identically distributed
(i ) (i )
copies of X and let x = (x1 1 , . . . , xn n )t be an instantiation of X.
Let

nl = number of times x(l) appears in x, l = 1, . . . , k

so that n = n1 + . . . + nk . The probability of x is:

PX (x∣θ) = θ1n1 . . . , θknk .

Maximum Likelihood Let



⎪ ⎫

⎪ k

Θ = ⎨(θ1 , θ2 , . . . , θk ) ∣ θj ≥ 0, j = 1, . . . , k, ∑ θj = 1⎬

⎪ ⎪

⎩ j=1 ⎭
denote the parameter space.

Denition 12.1 (Likelihood function, Likelihood Estimate, Log Likelihood Function). The likelihood
function of the parameters θ is dened as

L(θ∣x) = PX (x∣θ).

The maximum likelihood estimate ̂


θM E is dened as the value of θ that maximises L(θ∣x). The log
likelihood function is:

log L (θ1 , θ2 , . . . , θk ) = log PX (x ∣ θ) ,

where log is used to denote the natural logarithm.

There is an elegant expression of the likelihood function in terms of the Shannon Entropy and Kullback
Leibler divergence given below.

Denition 12.2 (Shannon Entropy). The Shannon Entropy, or Entropy of a probability distribution
θ = (θ1 , . . . , θk ), where θj ≥ 0, j = 1, . . . , k and θ1 + . . . + θk = 1 is dened as

k
H(θ) = − ∑ θj log θj .
j=1
234 CHAPTER 12. LEARNING THE CONDITIONAL PROBABILITY FUNCTIONS

In the denition of H(θ), the denition 0 log 0 = 0 is used, obtained by continuous extension of the
function x log x, x > 0.

Note that H(θ) ≥ 0. Recall the denition of Kullback Leibler divergence:

Denition 12.3 (Kullback Leibler Divergence). The Kullback Leibler Divergence between two discrete
probability functions f and g with the same state space X is dened as

f (x)
DKL (f ∣g) = ∑ f (x) log .
x∈X g(x)
The Kullback Leibler divergence has the property that it is non negative, and for two probability
measures dened on the same nite state space, DKL (f ∣g) = 0 if and only if f = g . This is a consequence
of Jensen's inequality and is now stated.

Lemma 12.4. For any two discrete probability distributions f and g , it holds that

DKL (f ∣g) ≥ 0

and DKL (f ∣g) = 0 if and only if f ≡ g .

Proof of Lemma 12.4 The proof uses Jensen's Inequality1 namely, that for any convex function ϕ,
E[ϕ(X)] ≥ ϕ(E[X]), with equality if and only if either ϕ(x) = ax + b or P(X = y) = 1 for some point
y . Note that f (x) ≥ 0 for all x ∈ X and that ∑x∈X f (x) = ∑x∈X g(x) = 1. Using this, together with the
fact that − log is convex, yields

g(x) g(x)
DKL (f ∣g) = − ∑ f (x) log ( ) ≥ − log ( ∑ f (x) ) = − log 1 = 0
x∈X f (x) x∈X f (x)
with equality if and only if f = g .

The likelihood function may be expressed in terms of the Shannon Entropy and the Kullback Leibler
divergence as follows:

Theorem 12.5. Let


k
Ln (θ∣x) = PX∣Θ (x∣θ) = ∏ θini
i=1

denote the likelihood function for the parameter vector θ = (θ1 , . . . , θk ), where the n-vector x denotes
the outcomes of n independent trials, each taking values in X = (x(1) , . . . , x(k) ) and ni denotes the
number of times x(i) appears in the list x. Let
n1 nk
θ̂ = ( , . . . , ) .
n n
Then
1 ̂ + DKL (θ∣θ)
̂
− log Ln (θ∣x) = H(θ) (12.2)
n
1
J.L. Jensen (1859 - 1925) published this in Acta Mathematica volume in the year 1906.
12.4. MAXIMUM LIKELIHOOD FOR DISCRETE VARIABLES 235

where H denotes the Shannon entropy, so that


k
̂ = − ∑ θ̂i log θ̂i = − 1 log PX∣Θ (x∣θ)
H(θ) ̂
i=1 n
and DKL the Kullback Leibler divergence, so that
k ̂
̂ = ∑ θ̂i log θi .
DKL (θ∣θ)
i=1 θi

Proof of Theorem 12.5 Since PX (x(n) ∣ θ) = ∏ki=1 θini it follows directly that

1 1 k k
− log PX (x ∣ ̂
θ) = − ∑ ni log θ̂i = − ∑ θ̂i log θ̂i = H(θ).
̂ (12.3)
n n i=1 i=1
This is the Shannon entropy for the empirical distribution, given by Denition 12.2.

For arbitrary θ ∈ Θ, it therefore follows directly that

k
1 1 1 1 k k
− log Ln (θ) ∶= − log PX (x ∣ θ) = − log ∏ θini = − ∑ ni log θi = − ∑ θ̂i log θi
n n n i=1 n i=1 i=1
k k ̂
θi
= − ∑ θ̂i log θ̂i − ∑ θ̂i log
i=1 i=1 θi
̂
= H (θ) + DKL (θ∣θ) ̂

and Theorem 12.5 is proved.

Since the Kullback Leibler distance is non-negative, it now follows directly that the maximum likelihood
estimate θ̂M LE of θ is given by
n1 nk
θ̂M LE = ( , . . . , ) .
n n

Recall that, for parameter estimation in statistics, the same notation θ̂ is used for an estimate, esti-
mator, and estimating function for a parameter θ.

It is important to have, at least approximately, the distribution of the estimator. Let


n
Yj = ∑ 1x(j) (Xk ),
k=1

the number of times that x(j) appears in X. Then θ̂j = n1 Yj ,

E[Yj ] = nθj
n n
E[Yj2 ] = ∑ ∑ E[1x(j) (Xk1 )1x(j) (Xk2 )] = nθj + n(n − 1)θj2
k1 =1 k2 =1
236 CHAPTER 12. LEARNING THE CONDITIONAL PROBABILITY FUNCTIONS

giving
V(Yj ) = nθj (1 − θj )

and, for i ≠ j
E[Yi Yj ] = n(n − 1)θi θj

so that
Cov(Yi , Yj ) = −nθi θj .

Since θ = n1 Y , it follows that



E[θ̂M L ] = θ, Cov( n(θ̂M L − θ)) = C

where Cjj = θj (1 − θj ) and Cij = −θi θj for i ≠ j . Furthermore, the following central limit is standard
from multivariate analysis:

n(θ̂M L − θ) Ð→ N (0, C).
n→+∞

12.4.2 MLE for a Probability Factorised along a DAG


Let θ denote the entire collection of parameters (θjil ) arranged as a vector and let Θ denote the
parameter space. Let
⎛ X (1) ⎞
X=⎜ ⎜ ⋮ ⎟

⎝ X ⎠
(n)

denote the random matrix, where each row represents an independent copy of X and let

⎛ x(1) ⎞
x=⎜
⎜ ⋮
⎟,

⎝ x ⎠
(n)

the n × d data matrix, denote an instantiation of X.

Firstly, the probability function PX may be written as


d
PX = ∏ PXj ∣Paj (12.4)
j=1

Setting


⎪ (i)
(i) (l) ⎪ 1 (xk )j = xj , πj (x(k) ) = πjl
nk (xj , πj ) = ⎨

⎩ 0 otherwise

and
n
(i) (l) (i) (l)
n(xj , πj ) = ∑ nk (xj , πj ),
k=1
The crucial point is that, using Equation 12.4, the likelihood function has product form:
12.5. THE BAYESIAN APPROACH 237

n d qj kj n
nk (xj ,πj )
(i) (l) d qj
⎛ kj n(x(i) ,π ) ⎞
(l)
L(θ∣x) = PX∣Θ (x∣θ) = ∏ PX∣Θ (xk ∣θ) = ∏ ∏ ∏ ∏ θjil = ∏ ∏ ∏ θjil j j . (12.5)
k=1 j=1 l=1 i=1 k=1 j=1 l=1 ⎝ i=1 ⎠

This is the product of likelihood functions; each (j, l) represents the likelihood function for the param-
kj (l)
eters (θjil )i=1 based on n(πj ) independent observations where

kj
(l) (i) (l)
n(πj ) ∶= ∑ n(xj , πj ).
i=1

It follows that the maximum likelihood estimate is

(i) (l) (i) (l)


n(xj , πj ) frequency of (xj , πj ) conguration
θ̂M L;jil = (l)
= (l)
.
n(πj ) total frequency of πj conguration

Furthermore, the estimator satises; n(πj )θ̂j.l ∼ Mult(n; θj1l , . . . , θjkj l ), where the family of random
(l)

k
vectors ((θj,.,l )i=1
j
) are independent and for each(j, l)

E[θ̂M L;jil ] = θjil ,



⎪ n(πjl ) θjil (1 − θjil ) i1 = i2 = i
1
̂ ̂ (jl) ⎪
Cov(θM L;ji1 l , θM L;ji2 l ) = Ci1 i2 = ⎨


⎪ − n(π1 l ) θji1 l θji2 l i1 ≠ i2 ,
⎩ j

and, asymptotically for each (j, l),

√ n(πj )→+∞
(l)

n(πj )(θ̂M L;j.l − θj.l )


(l)
Ð→ N (0, C (jl) ).

12.5 The Bayesian Approach


The classical approach to statistics starts by approximating a situation by constructing a probability
model with unknown parameters. Data is then obtained and the parameters estimated from the data.
The estimates are then plugged into the model and the estimated probability is then used to make
predictions. These predictions are considered to be approximate, where the approximation is from two
sources: 1) A probability model does not give a full description of the problem; 2) the true parameter
values are unknown and approximate values are used.
The Bayesian approach deals with the uncertainty in the parameter value by modelling the uncer-
tainty as a probability distribution, so that the parameter may be regarded as the outcome of a random
variable with this distribution. A probability model to approximate the situation is established that
has some unknown parameters and the uncertainty in the parameter values is modelled by placing
a probability distribution πΘ over the parameter space Θ. The distribution πΘ , which models the
uncertainty in the parameters before any data is gathered, is known as the prior distribution. When
data x, and instantiation of X is gathered, this assessment is updated to:
238 CHAPTER 12. LEARNING THE CONDITIONAL PROBABILITY FUNCTIONS

PX∣Θ (x∣θ)πΘ (θ) PX∣Θ (x∣θ)πΘ (θ)


πΘ∣X (θ∣x) = = .
PX (x)
∫Θ PX∣Θ (x∣θ)πΘ (θ)dθ
This is then used to compute the predictive probability; of a random variable Y that is independent
of X once the parameter value is known; Y ⊥ X∣Θ. This is computed as

PY ∣X (y∣x) = ∫ PY ∣X,Θ (y∣x, θ)πΘ∣x (θ∣x)dθ = ∫ PY ∣Θ (y∣θ)πΘ∣X (θ∣x)dθ.


Θ Θ

To keep computations within a reasonable framework, it is important that the prior distribution is
from a conjugate family.

Denition 12.6 (Conjugate Prior). A prior distribution from a family that is closed under sampling
is known as a conjugate prior

12.5.1 Independent Bernoulli trials and the Beta distribution


The following discussion motivates and explains the Bayesian approach, illustrating it by a basic
example.
If a thumb-tack is thrown in the air, it will come to rest either on its point (0) or on its head (1).
Suppose the thumb-tack is ipped n times in identical conditions. Let x(n) denote the sequence of
outcomes

x(n) = (x1 , . . . , xn )t .
Each trial is a Bernoulli trial with probability θ of success (obtaining a 1). This is denoted by

Xi ∼ Be(θ), i = 1, . . . , n.

Using the Bayesian approach, the parameter θ is be regarded as the outcome of a random variable,
which is denoted by Θ. The outcomes are conditionally independent, given θ. This is denoted by

Xi ⊥ Xj ∣Θ, i ≠ j.
When Θ = θ is given, the random variables X1 , . . . , Xn are independent. Let X(n) = (X1 , . . . , Xn )t so
that
n
PX(n) ∣Θ (x(n) ∣θ) = ∏ θxl (1 − θ)1−xl = θk (1 − θ)n−k
l=1
where k = ∑nl=1 xl .

The problem is use x(n) to make an assessment of θ and then use this to assess the probability function
for a further outcome Xn+1 . The Bayesian approach is, starting with a prior density πΘ (.) over the
parameter space Θ̃ = [0, 1], to nd the posterior density πΘ∣X (.∣x(n) ).
(n)

PX(n) ∣Θ (x(n) ∣θ)πΘ (θ) PX(n) ∣Θ (x(n) ∣θ)πΘ (θ)


πΘ∣X(n) (θ∣x(n) ) = = .
PX(n) (x(n) ) ∫ PX(n) ∣Θ (x(n) ∣ϕ)πΘ (ϕ)dϕ
12.5. THE BAYESIAN APPROACH 239

Let πΘ be the uniform density on [0, 1]. This represents no initial preference concerning θ; all values
are equally plausible2 . The choice of prior may seem arbitrary, but following the computations below,
it should be clear that, from a large class of priors, the nal answer does not depend much on the
choice of prior if the thumb-tack is thrown a large number of times.

With the uniform prior,

1 1 k!(n − k)!
∫ PX(n) ∣Θ (x(n) ∣θ)πΘ (θ)dθ = ∫ θk (1 − θ)n−k dθ = . (12.6)
0 0 (n + 1)!
The posterior distribution is a Beta density


⎪ (n+1)! k
⎪ k!(n−k)! θ (1 − θ) 0≤θ≤1
n−k
(n)
πΘ∣X(n) (θ∣x ) = ⎨ (12.7)

⎪ otherwise.
⎩ 0
The Beta distribution is not restricted to integer values; the Euler gamma function is necessary to
extend the denition to positive integers.

Denition 12.7 (Euler Gamma Function). The Euler Gamma Function Γ(α) ∶ (0, +∞) → (0, +∞) is
dened as

Γ(α) = ∫ xα−1 e−x dx. (12.8)
0

The Euler Gamma function satises the following properties.

Lemma 12.8. For all α > 0, Γ(α + 1) = αΓ(α). If n is an integer satisfying n ≥ 1, then

Γ(n) = (n − 1)!

Proof Note that Γ(1) = ∫0∞ e−x dx = 1. For all α > 0, integration by parts gives

Γ(α + 1) = ∫ xα e−x dx = αΓ(α). (12.9)
0

The result follows directly.

For Bernoulli sampling, given a sequence x = (x1 , . . . , xn ) containing k 1's and n − k 0's, the likelihood
function is L(θ) = θk (1 − θ)n−k . Since

π(θ∣x) ∝ L(θ∣x)π(θ)
2
All statistical methods contain some ad-hoc element and in Bayesian statistics, this is contained in the choice of
prior distribution. The results obtained from any statistical analysis are only reliable if there is sucient data so that
any inference will be robust under a rather general choice of prior.
There are well known diculties with the statement that a uniform prior represents no preference concerning the value
of θ. If the prior density for Θ is uniform, then the prior density of Θ will not be uniform, so `no preference' for values
2

of Θ indicates that there is a distinct preference among possible initial values of Θ . If π (x) = 1 for 0 < x < 1 is the
2
1

density function for Θ and π is the density function for Θ , then π (x) =
2
2
for 0 < x < 1.
2
1
2x1/2
240 CHAPTER 12. LEARNING THE CONDITIONAL PROBABILITY FUNCTIONS

where π is the prior, it therefore follows that the prior should have the form

π(θ) ∝ θa (1 − θ)b

for some values a and b to guarantee that both prior and posterior come from the same conjugate
family. In this case, the family of distributions is the family of Beta distributions, dened as follows:

Denition 12.9 (Beta Density). The beta density Beta(α, β) with parameters α > 0 and β > 0 is
dened as the function


⎪ Γ(α+β) α−1
⎪ t (1 − t)β−1 t ∈ [0, 1]
ψ(t) = ⎨ Γ(α)Γ(β) (12.10)

⎪ t ∈/ [0, 1]
⎩ 0

The Beta density is a probability density function for all real α > 0 and β > 0. It follows that, for
Binomial sampling, updating may be carried out very easily for any prior distribution within the Beta
family. Suppose the prior distribution π0 is the B(α, β) density function, n trials are observed, with k
taking the value 1 and n − k taking the value 0. Then

PX(n) ∣Θ (x(n) ∣θ)πΘ (θ)


πΘ∣X(n) (θ∣x(n) ) =
PX(n) (x(n) )
Γ(α + β)
= θα+k−1 (1 − θ)β+n−k−1 = cθα+k−1 (1 − θ)β+n−k−1 .
Γ(α)Γ(β)PX(n) (x(n) )

1
Since ∫0 πΘ∣X(n) (θ∣x(n) )dθ = 1, therefore:


⎪ Γ(α+β+n)
⎪ Γ(α+k)Γ(β+n−k) θ
α+k−1
(1 − θ)β+n−k−1 θ ∈ (0, 1)
πΘ∣X(n) (θ∣x(n) ) = ⎨

⎪ θ ∈/ (0, 1).
⎩ 0

so that πΘ∣X(n) (θ∣x(n) ) is a B(α + k, β + n − k) density.

Denition 12.10 (Maximum Posterior Estimate). The maximum posterior estimate, θ̂M AP , is the
value of θ which maximises the posterior density πΘ∣X(n) (θ∣x(n) ).

When the posterior density is B(k + α, n − k + β), an easy computation gives

k+α−1
θ̂M AP = .
n+α+β−2

Note that when the prior density is uniform, as in the case above, the MAP and MLE are exactly the
same. The parameter, of course, is not an end in itself. The parameter ought to be regarded as a
means to computing the predictive probability. The posterior is used to compute this.
12.5. THE BAYESIAN APPROACH 241

The Predictive Probability for the Next Toss Suppose that πΘ∣Xn (θ∣x(n) ) has a B(α+k, β+n−k)
distribution.The predictive probability for the next toss, for a = 0 or 1, is given by

1
PXn+1 ∣X(n) (a∣x(n) ) = ∫ PXn+1 (a∣θ)πΘ∣X(n) (θ∣x(n) )dθ.
0
Since PXn+1 ∣Θ (1∣θ) = θ, it follows (using Equation (12.9)) that

Γ(α + β + n) 1
(α+k)
PXn+1 ∣X(n) (1∣x(n) ) = ∫ θ (1 − θ)β+n−k−1 dθ
Γ(α + k)Γ(β + n − k) 0
Γ(α + β + n) Γ(α + k + 1)Γ(β + n − k)
=
Γ(α + k)Γ(β + n − k) Γ(α + β + n + 1)
α+k
= .
α+β+n

In particular, note that the uniform prior, π0 (θ) = 1 for θ ∈ (0, 1), is the B(1, 1) density function, so
that for binomial sampling with a uniform prior, the predictive probability is

k+1
PXn+1 ∣X(n) (1∣x(n) ) = ;
n+2
(12.11)
n+1−k
PXn+1 ∣X(n) (0∣x(n) ) = .
n+2
This distribution, or more precisely n+2 ,
k+1
is known as the Laplace rule of succession.

12.5.2 Multinomial Sampling and the Dirichlet Integral


Consider the case of multinomial sampling, where there are k possible outcomes in the state space
X = (x(1) , . . . , x(k) ) and PX (x(j) ) = θj , j = 1, . . . , k so that θ1 + . . . + θk = 1. Consider n independent
trials, X = (X1 , . . . , Xn )t with outcomes x = (x1 , . . . , xn ).
The likelihood function for θ, given x is:

L(θ∣x) = θ1n1 . . . θknk

where nj = ∑ni=1 1x(j) (xi ); i.e. the number of times outcome x(j) appears in the sequence x, for
j = 1, . . . , k . It follows that, to ensure that the prior and posterior are within the same conjugate
family, the prior has the form:

π(θ) ∝ θ1α1 . . . θkαk .

It follows that the only possible family of distributions to use is the Dirichlet family, dened as follows.

Denition 12.11 (Dirichlet Density). The Dirichlet density Dir(a1 , . . . , ak ) is the function
⎧ Γ(a1 +...+ak ) aj −1
⎪ ∏kj=1 Γ(ak ) (∏j=1 θj ) θj ≥ 0, ∑j=1 θj = 1,
⎪ k k
π(θ1 , . . . , θk ) = ⎨ (12.12)

⎪ otherwise,
⎩ 0
242 CHAPTER 12. LEARNING THE CONDITIONAL PROBABILITY FUNCTIONS

where Γ denotes the Euler Gamma Function, given in Denition 12.7. The parameters (a1 , . . . , ak ) are
all strictly positive and are known as hyper parameters.

This density, and integration with respect to this density function, are to be understood in the following
j=1 θj , it follows that π may be written as π(θ1 , . . . , θk ) = ̃
sense. Since θk = 1 − ∑k−1 π (θ1 , . . . , θk−1 ), where


⎪ Γ(a1 +...+ak ) k−1 aj −1 ak −1

⎪ ∏kj=1 Γ(ak ) (∏j=1 θj ) (1 − ∑j=1 θj )
k−1
θj ≥ 0, ∑k−1
j=1 θj ≤ 1,
̃
π (θ1 , . . . , θk−1 ) = ⎨ (12.13)


⎪ otherwise.
⎩ 0
Clearly, when k = 2, this reduces to the Beta density. The Dirichlet density is a probability density
function.

Properties of the Dirichlet Density The family of Dirichlet densities Dir(α1 , . . . , αk ) ∶ α1 >
0, . . . , αk > 0 is closed under sampling: Consider a prior distribution πΘ ∼ Dir(α1 , . . . , αk ) and suppose
that observations of n independent trials are made: x ∶= (x1 , . . . , xn ) where nj = ∑ni=1 1x(j) (xi ), i.e.
the number of appearances of x(j) in the sequence, for j = 1, . . . , n. Let πΘ∣X denote the posterior
distribution. Then

πΘ∣X (θ1 , . . . , θk ∣x) ∼ Dir(α1 + n1 , . . . , αk + nk ).

The Dirichlet density is usually written exclusively as a function of k variables, πΘ (θ1 , . . . , θk ), where
there are k − 1 independent variables and θk = 1 − ∑k−1
j=1 θj .

Mean Posterior Estimate The mean posterior estimate is the expected value of the posterior dis-
tribution. Here,

ni + αi
θ̂i,M EP = ∫ θi π(θ1 , . . . , θk ∣x, α)dθ1 . . . dθk = .
k
∑j=1 nj + ∑kj=1 αj
This computation is left as an exercise.

12.5.3 Distribution for Conditional Probabilies of a Bayesian network


The notation θj.l = (θj1l , . . . , θjkj l ) is used to denote the probability distribution over the states of Xj ,
(l)
given that πj is the parent conguration. The prior distribution over θj.l is taken to be

πΘjl ∼ Dir(αj1l , . . . , αjkj l ).

The prior distribution over the entire collection of parameters Θ is taken to be πΘ = ∏jl πΘjl . That
is, the distributions over (θj.l )(j,l) are mutually independent for dierent (j, l). Suppose an n × d data
(i) (l)
matrix is obtained, with n complete instantiations. Let n(xj , πj ) denote the number of times that
(i) (l)
the conguration (xj , πj ) appears in x. It follows that
12.5. THE BAYESIAN APPROACH 243

PX∣Θ (x∣θ)
πΘ∣X (θ∣x) = πΘ (θ) .
PX (x)

Recall the expression for PX∣Θ (x∣θ) found in Equation (12.5). It follows that

d qj ⎛ kj
n(xj ,πj ) ⎞
1 (i) (l)
πΘ∣X (θ∣x) = ∏ ∏ πΘjl (θjl ) ∏ θjil .
PX (x) j=1 l=1 ⎝ i=1 ⎠

From this, it follows directly that πΘ∣X = ∏jl πΘjl ∣X , where

(1) (1) (kj ) (1)


πΘjl ∣X (.∣x) ∼ Dir(n(xj , πj ) + αj1l , . . . , n(xj , πj ) + αjkj l ).

The posterior distribution of θj.l depends only on counts of family congurations at node j and not on
congurations at any other node.

Predictive Distribution The predictive distribution of a new case x(n+1) may be computed us-
ing the posterior density; with an n × d data matrix x of n complete instantiations, θjil , dened in
Equation (12.1), will be estimated by:

θ̃jil = PXn+1,j ∣Paj ,X (xj ∣πj , x).


(i) (l)
(12.14)

(i)
This is the predictive conditional probability that variable Xn+1,j attains value xj , given the parent
(l)
conguration πj and the cases stored in x. Let X denote the n × d matrix where each row X k. is an
independent copy of X = (X1 , . . . , Xd ). Recall that

d qj
πΘ∣X = ∏ ∏ πΘjl ∣X .
j=1 l=1

Using Bayes rule,

PPan+1,j ∣Θ,X PPan+1,j ∣Θ


πΘ∣Pan+1,j ,X = πΘ∣X = πΘ∣X .
PPan+1,j ∣X PPan+1,j ∣X

Note that PPan+1,j ∣Θ is an expression containing sums and products of (θail )a=1,...,j−1,i=1,...,ka ,l=1,...,qa . It
follows that πΘ∣Pan+1,j ,X may be expressed as a product

d qj
πΘ∣Pan+1,j ,X = A((θ)a.l )a=1,...,j−1,l=1,...,qa )πΘjl ∣X (θj.l ∣x) ∏ ∏ πΘal ∣X (θa.l ∣x)
a=j+1 l=1

where A is a probability density over (θail )a=1,...,j−1,i=1,...,ka ,l=1,...,qa . Then, by computations as before,
with θ̃jil dened by Equation (12.14),
244 CHAPTER 12. LEARNING THE CONDITIONAL PROBABILITY FUNCTIONS

θ̃jil = PXn+1,j ∣Pan+1,j ,X (xj ∣πj , x)


(i) (l)

(i) (l) (l)


= ∫ PXn+1,j ∣Pan+1,j ,Θ,X (xj ∣πj , θ, x)πΘ∣Pan+1,j ,X (θ∣πj , x)dθ
S

= ∫ θjil πΘj.l ∣X (θj.l ∣x)dθj.l


Sjl
(l)
Γ(n(πj ) + αj.l ) kj
n(xj ,πj )+αjil
(i) (l)
= ∫ θjil kj (m) (l)
∏ θjil dθ
Sjl ∏m=1 Γ(n(xj , πj ) + αjml ) i=1
(l)
Γ(n(πj ) + αj.l ) n(xj
(m)
,πj ) n(xj ,πj )+1
(l) (i) (l)
= kj (m) (l) ∫ ∏ θjml θjil dθ
∏m=1 Γ(n(xj , πj ) + αjml ) Sjl m≠i

(l) (i) (l) (i) (l)


Γ(n(πj ) + αj.l ) Γ(n(xj ∣πj ) + αjil + 1) ∏m≠i Γ(n(xj , πj ) + αjil )
= kj (m) (l)
× (l)
∏m=1 Γ(n(xj , πj ) + αjml ) Γ(n(πj ) + αj.l + 1)
(i) (l)
n(xj , πj ) + αjil
= (l) k
,
n(πj ) + ∑i=1
j
αjil

where Sjl is dened as



⎪ kj ⎫

⎪ kj ⎪
Sjl = ⎨(θjil )i=1 ∣θjil ≥ 0, i = 1, . . . , kj , ∑ θjil = 1⎬ .

⎪ ⎪

⎩ i=1 ⎭

Comparing with θ̂i,M LE = n,


ni
note that

θ̂iM EP
lim = 1.
n→+∞ θ̂iM LE

12.6 Updating, Missing Data, Fractional Updating


Updating Suppose the cases x(1) , . . . , x(n) are complete. Suppose next that xj(i) and πj(l) are observed
in x(n+1) . Then, by Bayes rule,

(i) (l) (1) (l) (kj ) (l)


θj.l ∣(x(1) , . . . , x(n) , (xn+1,j , πn+1,j )) ∼ Dir(n∗ (xj ∣πj ) + αj1l , . . . , n∗ (xj ∣πj ) + αjkj l ),

where


⎪ (r) (l)
(r) (l) ⎪ n(x , π ) r≠i
n∗ (xj , πj ) = ⎨ ∗ j (i) j (l) (i) (l)

⎩ n (xj , πj ) = n(xj , πj ) + 1 r = i.

(l)
The virtual sample size for πj is updated as

kj
(l)
s∗ = n(πj ) + 1 + ∑ αjil .
i=1
12.6. UPDATING, MISSING DATA, FRACTIONAL UPDATING 245

A Missing Instantiation Suppose the instantiation at node j is missing in the new case; the parent
(l)
conguration πj is present. Let

⎛ x(1) ⎞
X=⎜
⎜ ⋮


⎝ x ⎠
(n)

denote the complete instantiations and let x(n+1) denote instantiation n + 1 where the value xn+1,j is
missing. The distribution of the random vector θj.l ∣x, xn+1 is expressed as the mixture of distributions

kj
(i) (l) (i) (l) (kj ) (l)
∑ wi Dir(n(xj ∣πj ) + αj1l , . . . , n(xj , πj ) + 1 + αjil , . . . , n(xj , πj ) + αjkj l ),
i=1

where
(i) (l)
wi = PXj,n+1 ∣Paj,n+1 ,X (xj ∣πj , x) = ∫ θjil πΘ∣X (θ∣x)dθ.

Updating: Parent Conguration and the state at node j are missing Consider a new case
x(n+1) where both the state and the parent conguration of node j are missing. Then the distribution
of θj.l ∣x, xn+1 is given as the mixture of distributions

kj
(1) (l) (i) (l) (kj ) (l)
∑ vi Dir(n(xj , πj ) + αj1l , . . . , n(xj ∣πj ) + 1 + αjil , . . . , n(xj , πj ) + αjkj l )
i=1
(1) (l) (kj ) (l)
+Dir(n(xj , πj ) + αj1l , . . . , n(xj , πj ) + αjkj l )v ∗ ,

where

(i) (l)
vi = PXj ,Paj ∣X,X (xj , πj ∣x, xn+1 ), i = 1, . . . , kj
n+1

and

(l)
v ∗ = 1 − PPaj ∣X,X (πj ∣x, xn+1 ).
n+1

Fractional Updating The preceding shows that adding new cases with missing values results in
dealing with increasingly messy mixtures, with increasing numbers of components. The standard way
to deal with this is to use a Dirichlet integral that is an approximation of the true update, taking the
updated distribution as:

(1) (l) (kj ) (l)


θj.l ∼ Dir (n∗ (xj , πj ) + αj1l , . . . , n∗ (xj , πj ) + αjkj l )

where
(i) (l) (i) (l) (i) (l)
n∗ (xj , πj ) = n(xj , πj ) + PXj ,Paj ∣X,X (xj , πj ∣x, xn+1 ), i = 1, . . . , kj .
n+1

This is known as fractional updating.


246 CHAPTER 12. LEARNING THE CONDITIONAL PROBABILITY FUNCTIONS

Fading If the parameters change with time, then information learnt a long time ago may not be so
useful. A way to make the old cases less relevant is to have the sample size discounted by a fading
factor qF , a positive number less than one.

(i) (l) (i) (l)


The fading update is as follows: using n(xj , πj ) in this section to denote n(xj , πj )+αjil previously,
(i) (l)
if (xj , πj ) is observed in the next instantiation, then n is updated to n∗ where


⎪ (r) (l)
(r) (l) ⎪ qF n(xj , πj ) r≠i
n∗ (xj , πj ) = ⎨ (i) (l)

⎩ 1 + qF n(xj , πj ) r = i.

(l) (i) (l)
If (πj , xj ) is observed for some i = 1, . . . , kj , the virtual sample size for parent conguration πj is
updated to

(l) (l)
n∗ (πj ) = 1 + qF n(πj )

and
(a) (a)
n∗ (πj ) = qF n(πj ) a ≠ l.

Consider the sequence of equations

sn = qF sn−1 + 1, s0 = s.

This may be solved as

n 1 − qFn+1
sn = qFn s + ∑ qFi = qFn s + .
i=0 1 − qF
The limiting eective maximal sample size is therefore

1
s∗ = .
1 − qF

12.7 Likelihood Function for the Graph Structure


The Bayesian approach to parameter learning leads to a straightforward and elegant approach to
computing a likelihood function for the graph structure, given data x. This was introduced by Cooper
and Herskovitz (1992) [31]. Let G = (V, D) denote a directed acyclic graph with edge set D. Let D
denote the collection of all possible edge sets over DAGs with node set V , where V is the collection of
random variables. Suppose that, for each D, we have a prior distribution πΘ∣D (θ∣D) for the parameters
θ when the edge set is D. The likelihood for the graph structure D given data x is:

PX∣D (x∣D) = ∫ PX∣Θ,D (x∣θ, D)πΘ∣D (θ∣D)dθ,


q
Since πΘ∣D (θ∣D) = ∏dj=1 ∏l=1
j
πΘj.l ∣D (θj.l ∣D), it follows that:
12.8. BAYESIAN SUFFICIENT STATISTICS 247

n d qj
PX∣D (x∣D) = ∫ ∏ PX∣Θ,D (x(k) ∣θ, D) ∏ ∏ ϕ(θj.l ∣αj.l , D)dθj.l
k=1 j=1 l=1

where ϕ(θj.l ∣αj.l ) is a compact way of referring to the Dirichlet density Dir(αj1l , . . . , αjkj l ).

Because PX∣Θ,D (x∣θ, D) has a convenient product form, computing the Dirichlet integral is straight-
forward and gives

d qj k
Γ(∑i=1
j
αjil ) kj Γ(n(xij ∣πjl ) + αjil )
L(D∣x) ∶= PX∣D (x∣D) = ∏ ∏ ∏ . (12.15)
j=1 l=1
kj
Γ(n(πjl ) + ∑i=1 αjil ) i=1 Γ(αjil )
The computation is left as an exercise; this is the Cooper Herskovitz likelihood for the graph structure.

12.8 Bayesian Sucient Statistics


Let X be an n × d random matrix, where each row is an independent copy of a discrete random vector
X = (X1 , . . . , Xd ) and let θ be a continuous random vector of unknown parameters. Suppose that, for
a parameter vector θ, X has conditional probability function pX (.∣θ). Suppose that there is a prior
density πΘ (θ) over the parameter space and suppose that t is a function or a statistic of X,

t = t (X) .

Denition 12.12 (Bayesian Suciency). A statistic T dened as T = t (X) such that for every prior
πΘ within the space of prior distributions under consideration, there is a function ϕ such that
pX (x∣θ)πΘ (θ)
πΘ∣X (θ∣x) = = ϕ(θ, t(x)) (12.16)
pX (x)
is called a Bayesian sucient statistic for θ.
This denition states that for learning about θ based on X, the statistic T contains all the relevant
information, since the posterior distribution depends on X only through T .

The following result shows that if the conditional distribution of X given t(X) does not depend on θ,
then t(X) is Bayesian sucient for θ. If the families of probability measures have nite dimensional
parameter spaces, then the converse is also true. If there are an innite number of parameters, counter
examples may be obtained to the converse statement.

Proposition 12.13. Let t denote a function and let T = t(X). If

X ⊥ θ∣T, (12.17)
where Equation (12.17) means that

pX∣T (x∣t, θ) = pX∣T (x∣t) independent of θ ∈ Θ (12.18)


then T = t(X) is a Bayesian sucient statistic for θ.
248 CHAPTER 12. LEARNING THE CONDITIONAL PROBABILITY FUNCTIONS

Proof of Proposition 12.13 As usual, let T = t(X). An application of Bayes rule gives

pX,T ∣Θ (x, t∣θ)πΘ (θ) pX∣T,Θ (x∣t, θ)pT ∣Θ (t∣θ)πΘ (θ)


πΘ∣X,T (θ∣x, t) = = , (12.19)
pX,T (x, t) pX,T (x, t)
and Equation (12.19) with an application of (12.18) gives

pX∣T (x∣t)pT ∣Θ (t∣θ)πΘ (θ)


πΘ∣X,T (θ∣x, t) =
pX∣T (x∣t)pT (t)
pT ∣Θ (t∣θ)πΘ (θ)
= = πΘ∣T (θ∣t). (12.20)
pT (t)
The proposition is proved by setting ϕ(θ, t(x)) = πΘ∣T (θ∣t(x)).

Example 12.14 (Tossing a Thumb-tack).


In the thumb-tack experiment described in section 12.5.1, there is a single parameter, θ. In this
paragraph, a Bayesian sucient statistic is derived for θ, for a suitable class of prior distributions. Let
πΘ denote the prior density function for θ, and let Θ denote the random variable with this density
function. In this case, X is a n×1 matrix, a column vector, which will be written as X = (X1 , . . . , Xn )t ,
a sequence of n independent Bernoulli trials, each with probability θ of success (that is Xj ∼ Be(θ),
j = 1, . . . , n and Xi ⊥ Xj ∣θ for i ≠ j ). The sequence of outcomes will be denoted by the vector

x = (x1 , . . . , xn )t .
That is, for each j = 1, . . . , n, xj = 1 or 0. The statistic t is a function of n variables, dened as
n
t(x) = ∑ xj .
j=1

That is, when t is applied to a sequence of n 0's and 1's, it returns the number of 1's in the sequence.
Here, T = t(X) = ∑nj=1 Xj and therefore T has a binomial distribution with the parameters n and θ,
since it is the sum of independent Bernoulli trials. The probability function of T is given by


⎪ ⎛ n ⎞ k


⎪ θ (1 − θ)n−k k = 0, 1, . . . , n
pT ∣Θ (k∣θ) = ⎨ ⎝ k ⎠




⎩ 0 other k.
Since t is a function of x, it follows that


⎪ θk (1 − θ)n−k k = 0, 1, . . . , n
pX,T ∣Θ (x, k∣θ) = ⎨

⎪ other k
⎩ 0
from which

pX,T ∣Θ (x, k∣θ) 1


pX∣T,Θ (x∣k, θ) = = .
pT ∣Θ (k∣θ) ⎛ n ⎞
⎝ k ⎠
12.9. PREDICTION SUFFICIENCY 249

The right hand side does not depend on θ, from which equation (12.18) holds and hence equation
(12.17) follows. Therefore, if x = (x1 , . . . , xn ) are n independent Bernoulli trials, each with parameter
θ, the function t such that t(x) = ∑nl=1 xl is a Bayesian sucient statistic for the parameter θ. In the
thumb-tack example, given in subsection 12.5.1, the posterior distribution, based on a uniform prior is
an explicit function of the data x only through the function t(x).

Now consider a random vector X and suppose now that t is a generic sucient statistic. Since t is a
function of X (i.e. t = t(X)), it follows, using the rules of conditional probability and equation (12.18),
that

pX∣Θ (x∣θ) = pX,T ∣Θ (x, t(x)∣θ) = pX∣T,Θ (x∣t(x), θ)pT ∣Θ (t(x)∣θ) = pX∣T (x∣t(x))pT ∣Θ (t(x)∣θ).

In other words, there is a factorisation of the form

pX∣Θ (x∣θ) = g(t(x), θ)h(x), (12.21)

where
h(x) = pX∣T (x∣t(x)) = pX∣t(X) (x∣t(x)).

In statistical literature, t(X) is often dened to be a sucient statistic if there is a factorisation of


the type given by equation (12.21). Equation (12.21) is in fact a characterisation of suciency in the
sense that the likelihood function for θ depends on data only through t; the aspects of data that do
not inuence the value of t are not needed for inference about θ, as long as pX∣Θ (x∣θ) is the object of
study. In the example above, and in many other cases, this oers a data reduction. That is, for any n,
a sample of size n can be reduced to a quantity of xed dimension.

12.9 Prediction Suciency


Let X be a discrete random vector, Y a discrete random variable or vector, t a function and let
T = t(X). Let θ be a parameter vector. Suppose X, Y , Θ and T satisfy

X ⊥ (Y , θ) ∣T. (12.22)

That is, once t(X) is given, there is no additional statistical information in X about Y or θ. The
problem is to predict Y statistically using a function of X .

Proposition 12.15. Let t denote a function and let T = t(X). If X, Y , T, θ satisfy X ⊥ Y ∣∣(T, θ) and
X ⊥ θ∣T , then
πΘ∣Y ,X,T (θ ∣ y, x, t) = πΘ∣Y ,T (θ ∣ y, t). (12.23)
250 CHAPTER 12. LEARNING THE CONDITIONAL PROBABILITY FUNCTIONS

Proof Firstly, X ⊥ Y ∣(θ, T ) and X ⊥ θ∣T implies that X ⊥ Y ∣T . It follows that

pX,Y ∣T (x, y∣t) = ∫ pX,Y ∣θ,T (x, y∣θ, t)πΘ∣T (θ∣t)dθ


Θ

= ∫ pX∣θ,T (x, y∣θ, t)pY ∣θ,T (y∣θ, t)πθ∣T (θ∣t)dθ


Θ

= pX∣T (x∣t) ∫ pY ∣θ,T (y∣θ, t)πθ∣T (θ∣t)dθ


Θ
= pX∣T (x∣t)pY ∣T (y∣t).

It follows that
pX,Y ∣T (x, y∣t)
pY ∣X,T (y∣x, t) = = pY ∣T (y∣t). (12.24)
pX∣T (x∣t)
An application of Bayes rule gives

pY ,X,T ∣Θ (y, x, t∣θ)πΘ (θ) pY ,X∣T,θ (y, x∣t, θ)pT ∣θ (t∣θ)πΘ (θ)
πΘ∣Y ,X,T (θ∣y, x, t) = =
pY ,X,T (y, x, t) pY ,X,T (y, x, t)
pY ∣T,Θ (y∣t, θ)pX∣T,θ (x∣t, θ)pT ∣θ (t∣θ)πΘ (θ)
= ,
pY ,X,T (y, x, t)

where the conditional independence X ⊥ Y ∣(θ, T ) was used. Then, since X ⊥ θ∣T , it follows that
pX∣T,Θ (x∣t, θ) = pX∣T (x∣t) and hence, using the identity (12.24), that

pY ∣T,Θ (y∣t, θ)pX∣T (x∣t)pT ∣Θ (t∣θ)πΘ (θ) pY ∣T,Θ (y∣t, θ)pT ∣Θ (t∣θ)πΘ (θ)
πΘ∣Y ,X,T (θ∣y, x, t) = = .
pY ∣X,T (y ∣ x, t)pX∣T (x ∣ t)pT (t) pY ∣T (y ∣ t)pT (t)

It follows that

pY ,T ∣Θ (y, t ∣ θ)πΘ (θ)


πΘ∣Y ,X,T (θ∣y, x, t) = = πΘ∣Y ,T (θ∣y, t),
pY ∣T (y, t)
as claimed.

12.10 Prediction Suciency for a Bayesian Network


Let G = (V, E) denote a DAG with V = {X1 , . . . , Xd }, where the nodes are numbered, for convenience
such that for each j ,

Πj ⊆ {X1 , . . . , Xj−1 },

where Πj (as usual) denotes the parent set for Xj .


Using a fully Bayesian approach to the problem, the parameter vector θ is considered as an ob-
servation on a random vector Θ and for each j = 1 . . . , d the parameter vector θj an observation on a
random vector Θj .
12.10. PREDICTION SUFFICIENCY FOR A BAYESIAN NETWORK 251

Denition 12.16 (Parameter Modularity). A set of parameters Θ for a Bayesian Network satises
parameter modularity if it may be decomposed into d distinct parameter sets Θ1 , . . . , Θd such that for
j = 1, . . . , d, the parameters in vector Θj are directly linked only to node Xj .

This denition was introduced by Heckerman, Geiger and Chickering (1995) [62].

Under the assumption of parameter modularity, the DAG may be expanded by adding the parameter
nodes as parent variables in the graph, and directed links from each node in the set Θj to the node Xj
giving an extended graph that is directed and acyclic, where pX1 ,...,Xd ∣Θ has the decomposition

d
pX1 ,...,Xd ∣Θ = ∏ pXj ∣Θj ,Πj . (12.25)
j=1

Furthermore, under the assumption of modularity, Θ1 , . . . , Θd are independent random vectors and the
joint prior distribution is a product of individual priors; πΘ = ∏dj=1 πΘj .
The following notation is useful:

X̃j ∶= ((X1 , Θ1 ), . . . , (Xj−1 , Θj−1 )) , j = 1, . . . , d

and, for j = 1, . . . , d, tj is used to denote the function such that

tj (X̃j ) = Πj .

It follows directly from equation (12.25) that

X̃j ⊥ (Xj , Θj ) ∣Πj .

In other words, the parent set Πj is a prediction sucient statistic for (Xj , Θj ) in the sense that there
is no further information in ((X1 , Θ1 ), . . . , (Xj−1 , Θj−1 )) relevant to uncertainty about either Θj or
Xj .
In a Bayesian network where the parameters satisfy the modularity assumption (Denition 12.16),
(Πj , Xj ) are a Bayesian sucient statistic for Θj . The modularity assumption is clearly satised when
Equation (12.25) holds.

Notes The discussion of the thumb-tack and learning for DAGs is taken from D. Heckerman [61]
and [62]. Learning from incomplete data is discussed in [116]. Another treatment of learning is found
in [99] (Neapolitan). The Savage distribution is due to J.L. Savage [121]. The Dickey distribution is
due to J.M. Dickey [38].
12.11 Exercises
1. Suppose one has a data base C with n cases of congurations over a collection of variables V .
Let Sp(V ) denote the set of possible congurations over V and let #(v) denote the number of
#(v)
cases of conguration v . Dene P C (v) = n . Let P M denote a probability distribution over
Sp(V ). Assume that P C (v) = 0 if and only if P M (v) = 0 and discount these congurations.
Dene S M (C) = − ∑c∈C log P M (c).
Let DKL denote the Kullback Leibler distance. Show that

S M (C) − S C (C) = nDKL (P C ∣P M ).

2. (a) Consider the thumb-tack experiment and the conditional independence model for the prob-
lem and the uniform prior density for θ. Let X denote the vector of n i.i.d. copies of the
random variable and let Xn+1 denote an additional copy, independent of X. Let x denote
an outcome of X What is PXn+1 ∣X (head∣x) ?
(b) Prove the Laplace Rule of Succession. Namely, let {X1 , . . . , Xn+1 } be independent, identi-
cally distributed Bernoulli random variables, where PXi (1) = 1 − PXi (0) = θ and θ ∼ U (0, 1).
Then the Laplace Rule of Succession states that

s+1
PXn+1 ∣X1 +...+Xn (1∣s) = .
n+2
3. Let Θ ∼ Beta(α, β). Compute E[Θ] and V(Θ). You may use the fact that if Θ ∼ Beta(α, β)
then its density is given by

Γ(α + β) α−1
π(θ) = θ (1 − θ)β−1 θ ∈ [0, 1].
Γ(α)Γ(β)

4. Let (X1 , . . . , Xn+1 )t be a vector of independent identically distributed random variables, each with
probability distribution given by PX (x(i) ) = θi , i = 1, . . . , k . Suppose that the prior distribution
over θ is Dir(αq1 , . . . , αqk ) where ∑ki=1 qi = 1. Let X = (X1 , . . . , Xn ) and let x be an n-vector of
outcomes where x(i) appears ni times, for i = 1, . . . , k and ∑ki=1 ni = n. Show that
ni + αqi
PXn+1 ∣X (x(i) ∣ x) = ∫ θi π (θ1 , . . . , θL ∣x; αq) dθ1 . . . dθL = . (12.26)
SL n+α

5. Let Θ = (Θ1 , . . . , ΘL ) be a continuous random vector with Dir (α1 , . . . , αL ) distribution. Com-
pute V (Θi ).

6. (a) Let V = (V1 , . . . , VK ) be a continuous random vector, with

V ∼ Dir (a1 , . . . , aK ) ,

and set

252
12.11. EXERCISES 253

Vi x−1
Ui = i
−1
, , i = 1, . . . , K,
∑K
i=1 Vi xi

where x = (x1 , . . . , xK ) is a vector of positive real numbers; that is, xi > 0 for each i =
1, . . . , K . Show that U = (U1 , . . . , UK ) has density function

∑K
Γ (∑ki=1 ai ) K
a −1 1 i=1 ai K

∏ ui i ( K ) a
∏ xi i .
i=1 Γ(ai )
∏K i=1 ∑i=1 ui xi i=1

This density is denoted

U ∼ S (a, x) .

This is due to J.L. Savage [121]. Note that the Dirichlet density is obtained as a special
case when xi = c for i = 1, . . . , K .
The next two parts illustrate how the Savage distribution can arise in Bayesian analysis, for
updating an objective distribution over the subjective assessments of a probability distri-
bution by several dierent researchers, faced with a common set of data.
(b) Consider several researchers studying an unknown quantity X , where X can take values in
{1, 2, . . . , K}. Each researcher has his own initial assessment of the probability distribution
V = (V1 , . . . , VK ) for the value that X takes. That is, for a particular researcher,

Vi = PX (i) , i = 1, . . . , K.

It is assumed that

V ∼ Dir (a1 , . . . , aK ) .

Each researcher observes the same set of data with the common likelihood function

li = P (data∣{X = i}) , i = 1, . . . , K.

The coherent posterior probability of a researcher is

Ui = P ({X = i} ∣ data) , i = 1, 2, . . . , K.

Let U = (U1 , . . . , UK ). Prove that

U ∼ S (a, l−1 ) ,

where a = (a1 , . . . , aK ) and l−1 = (l1−1 , . . . , lK


−1
). This is due to J.M. Dickey [38].
254 CHAPTER 12. LEARNING THE CONDITIONAL PROBABILITY FUNCTIONS

(c) Show that the family of distributions S (a, l−1 ) is closed under updating of the opinion
populations. In other words, if

V ∼ S (a, z) ,

before the data is considered, then

U ∼ S (a, z × l−1 ) ,

after the data update, where

z × l−1 = (z1 l1−1 , . . . , zK lK


−1
).

7. Consider a Bayesian Network over two binary variables A and B , where the Directed Acyclic
Graph is A → B and A and B each take the values 0 or 1. Let (Θa , Θb∣y , Θb∣n ) denote three
independent random variables representing the unknown parameters. Let θa = PA∣Θa (1∣θa ), θb∣y =
PB∣A,Θb∣y (1∣1, θb∣y ), θb∣n = PB∣A,Θb∣n (1∣0, θb∣n ). Let the prior distributions over the parameters be



⎪ 3θ2 0 ≤ θa ≤ 1
πa (θ) = ⎨

⎪ θ ∈/ [0, 1],
⎩ 0


⎪ 12θ2 (1 − θ) 0 ≤ θ ≤ 1
πb∣y (θ) = ⎨

⎪ θ ∈/ [0, 1],
⎩ 0


⎪ 12θ(1 − θ)2 0 ≤ θ ≤ 1
πb∣n (θ) = ⎨

⎪ θ ∈/ [0, 1],
⎩ 0
Suppose that there is a single instantiation, where B = 1 is observed, but A is unknown. Perform
the approximate updating.

8. Let the likelihood for θ = (θ1 , . . . , θL ) with data x be given by

L
n
L (θ; x) = ∏ θj j ,
j=1

where nj is the number of times the symbol xj (in a nite alphabet with L symbols) is present
in x and ∑Lj=1 θj = 1. For the prior distribution over θ , a nite Dirichlet mixture is taken, given
by

k
(i) (i)
πΘ (θ) = ∑ λi Dir (α(i) q1 , . . . , α(i) qL ) ,
i=1

(i) (i)
where λi ≥ 0, ∑ki=1 λi = 1 (the mixture distribution), α(i) > 0, qj > 0, ∑L
i=1 qj = 1 for every i.
Compute the mean posterior estimate θ̂j;M P for j = 1, . . . , L.
12.11. EXERCISES 255

9. Let ϕ(θj.l , αj.l ) denote the Dirichlet density Dir(αj1l , . . . , αjkj l ). By performing the required
integration, prove that the Likelihood function for the graph structure, dened by

n d qj
PX∣D (x∣D) = ∫ ∏ PX∣Θ,D (x(k) ∣θ, D) ∏ ∏ ϕ(θj.l , αj.l )dθj.l
k=1 j=1 l=1

is given by

k
d qj Γ (∑i=1
j
αjil ) kj Γ (n(xij , πjl ) + αjil )
PX∣D (x∣D) = ∏ ∏ ∏ .
j=1 l=1
kj
Γ (n(πjl ) + ∑i=1 αjil ) i=1 Γ (αjil )

You may use the identity:

α −1
1 1−θ1 1−(θ1 +...+θn−2 ) ⎛n−1 α −1 ⎞ ⎛ n−1 ⎞ n ∏nj=1 Γ(αj )
∫ ∫ ...∫ ∏θ
j
1 − ∑ θj dθn−1 . . . dθ1 = .
0 0 0 ⎝ j=1 j ⎠ ⎝ j=1 ⎠ Γ(∑nj=1 αj )

What parameters αj.l are used if a uniform prior is taken on every θj.l ? You may use Γ(n) =
(n − 1)!.
12.12 Short Answers
1. Firstly, note that

S M (C) = − ∑ log P M (c) = − ∑ (#(v)) log P M (v) = − ∑ nP C (v) log P M (v).


c∈C v∈Sp(v) v∈Sp(v)

This is true for all M ; in particular, take M = C , so that

S C (C) = − ∑ nP C (v) log P C (v)


v∈Sp(v)

giving

P C (v)
S M (C) − S C (C) = n ∑ P C (v) log = ndK (P C ∣P M ).
v∈Sp(v) P M (v)

2. (a) Suppose x contains k heads. Then


1
PXn+1 ∣X (H∣x) = ∫ PXn+1 ∣Θ,X (H∣θ, x)πΘ∣X (θ∣x)dθ
0
(n + 1)! 1
= ∫ θ (1 − θ) dθ
k+1 n−k
k!(n − k)! 0
(n + 1)! (k + 1)!(n − k)! k + 1
= = .
k!(n − k)! (n + 2)! n+2

(b) Using the evaluation of the Beta integral,

PXn+1 ∣X1 +...+Xn (1∣s)


1
=∫ PXn+1 ∣X1 +...+Xn ,Θ (1∣s, θ)πΘ∣X1 +...+Xn (θ∣s)dθ
0
1 PX1 +...+Xn ∣Θ (s∣θ)πΘ (θ)
=∫ θ dθ
0 ∫ PX1 +...+Xn ∣Θ (s∣θ)πΘ (θ)dθ
⎛ n ⎞ 1 s+1
∫0 θ (1 − θ) dθ
n−s
⎝ s ⎠
=
⎛ n ⎞ 1 s
∫ θ (1 − θ)n−s dθ
⎝ s ⎠ 0
n! (s+1)!(n−s)!
s!(n−s)! (n+2)!
= s!(n−s)!
n!
s!(n−s)! (n+1)!
s+1
=
n+2

3.
1 Γ(α + β) 1 Γ(α + β) Γ(α + 1)Γ(β) α
E[Θ] = ∫ θπ(θ) = ∫ θ (1 − θ) dθ =
α β−1
=
0 Γ(α)Γ(β) 0 Γ(α)Γ(β) Γ(α + β + 1) α+β

256
12.12. SHORT ANSWERS 257

using Γ(x + 1) = xΓ(x).

V(Θ) = E[Θ]2 − E[Θ]2 .


1 Γ(α + β) Γ(α + 2)Γ(β) (α + 1)α
E[Θ2 ] = ∫ θ2 π(θ)dθ = =
0 Γ(α)Γ(β) Γ(α + β + 2) (α + β + 1)(α + β)
so
(α + 1)α α2
V(Θ) = −
(α + β + 1)(α + β) (α + β)2
(α + 1)α(α + β) − α2 (α + β + 1) αβ
= = .
(α + β + 1)(α + β) 2 (α + β + 1)(α + β)2

4.

PXn+1 ∣X (xi ∣x) = ∫ PXn+1 ∣Θ,X (x(i) ∣θ, x)πΘ∣X (θ∣x)dθ


S
n+α nj +αqj −1 ni +αqi
= ∫ (∏ θ j )θi dθ
∏j=1 Γ(αqj + nj ) S j≠i
L

Γ(n + α) (∏j≠i Γ(nj + αqj ))Γ(ni + 1 + αqi )


=
∏L
j=1 Γ(nj + αqj ) Γ(n + α + 1)
Γ(ni + 1 + αqi )Γ(n + α)
=
Γ(ni + αqi )Γ(n + α + 1)
ni + αqi
= .
n+α
5. Let α = α1 + . . . + αL . Then, taking dθ in the appropriate sense and using

Γ(α + 1) = αΓ(α),

E[Θi ] = ∫ θi πΘ (θ)dθ

Γ(α) ⎛ αj ⎞ α +1
= ∫ ∏ θj θi i dθ
j=1 Γ(αj )
∏L ⎝ j≠i ⎠
Γ(α) (∏j≠i Γ(αj )) Γ(αi + 1) Γ(α)Γ(αi + 1)
= =
∏Lj=1 Γ(αj ) Γ(α + 1) Γ(α + 1)Γ(αi )
αi
= .
α
Similarly,
Γ(α)Γ(αi + 2) (αi + 1)αi
E[Θ2i ] = = .
Γ(α + 2)Γαi ) (α + 1)α
This gives

αi (αi + 1) αi2 ααi2 + ααi − ααi2 − αi2 αi (α − αi )


V(Θi ) = − 2 = = 2 .
α(α + 1) α α2 (α + 1) α (α + 1)
258 CHAPTER 12. LEARNING THE CONDITIONAL PROBABILITY FUNCTIONS

6. (a) The free variables are (v1 , . . . , vK−1 ) with the constraint vK = 1 − ∑K−1
j=1 vj . Set

K vj K−1
1 1 1
S=∑ = + ∑ vj ( − )
j=1 xj xK j=1 xj xK

then
vj K−1
uj = j = 1, . . . , K and uK = 1 − ∑ uj .
xj S j=1

j=1 vj = 1, it follows that 1 = ∑j=1 vj = S ∑j=1 xj uj so that


and, since ∑K K K

1 1
S= =
xK + ∑K−1
j=1 (xj − xK )uj ∑K
j=1 xj uj

The original density (in terms of the free variables) is


aK −1
Γ (∑kj=1 aj ) ⎛K−1 aj −1 ⎞ ⎛ K−1 ⎞
∏ vj 1 − ∑ vj .
∏Kj=1 Γ(aj ) ⎝ j=1 ⎠⎝ j=1 ⎠

The Jacobian determinant for v → u may be computed by noting that vj = uj xj S and using
∂S
= −S 2 (xα − xK )
∂uα
so that

⎪ −Sui xi (xα − xK )
∂vi ⎪ α≠i
=⎨
⎪ Sxi − Sui xi (xi − xK ) α = i
∂uα ⎪

i=1 xi M , where
The matrix of which the determinant is to be computed is therefore S K−1 ∏K−1

⎛ u1 ⎞
M =I −S⎜
⎜ ⋮
⎟ (x1 − xK , . . . , xK−1 − xK ).

⎝ uK−1 ⎠

Clearly 1 is an eigenvalue of multiplicity K − 2 for M . The remaining eigenvalue λ of M


may be computed by noting that the vector e that satises

(M − λ)e = 0

⎛ u1 ⎞
satises e = c ⎜
⎜ ⋮
⎟ and therefore λ satises

⎝ uK−1 ⎠
K−1 K−1
1
1 − λ = S ∑ uj (xj − xK ) = S( ∑ uj xj − xK + xK uK ) = S( − xK )
j=1 j=1 S
so that λ = SxK . It follows that the density in the new coordinates is

aK −1
Γ (∑kj=1 aj ) ⎛K−1 aj −1 ⎞ ⎛
K−1 ⎞ K
∏ (Sx u
j j ) 1 − S ∑ xj uj S K ∏ xj .
∏j=1 Γ(aj ) ⎝ j=1
K ⎠⎝ j=1 ⎠ j=1
12.12. SHORT ANSWERS 259

Since SxK uK = 1 − S ∑K−1


j=1 xj uj , it follows that

∑K
j=1 aj
Γ (∑kj=1 aj ) K
aj ⎛ K aj −1 ⎞ ⎛ 1 ⎞
∏ xj ∏ uj
j=1 Γ(aj )
∏K j=1 ⎝j=1 ⎠ ⎝ ∑j=1 xj uj ⎠
K

as required.
(b) The work was in the previous part, computing the distribution. This exercise is now a
straightforward application of Bayes rule.
PX (i)li Vi li
Ui = P({X = i}∣data) = =
P(data) ∑K i=1 Vi li

the denominator follows because ∑K i=1 Ui = 1. The distribution of U now satises the deni-
−1
tion of the S(a, l ) distribution of the previous exercise.
(c) Again, assume data is obtained and the likelihood is li = P(data∣X = i) and the prior
distribution is S(a, z). Then

Vi li Wi zi−1 li
Ui = P({X = i}∣data) = = ,
P(data) P(data) ∑K −1
i=1 Wi zi

where W ∼ Dir(a1 , . . . , aK ). Since ∑K


i=1 Ui = 1, it follows that

Wi zi−1 li
Ui = −1
∑K
i=1 Wi zi li

so that the distribution of U satises the denition of a S(a, z × l−1 ) distribution.


k
7. With approximate updating, the independence structure of the distributions over (θj.l )i=1 j
is
(i) (l)
retained (the distributions for each (j, l) are mutually independent). Let n(xj πj ) denote the
(i) (l)
eective number of (xj , πj ) congurations upon which the prior distribution is based, then

(1) (l) (kj ) (l)


Θj.l ∼ Dir(n(xj , πj ), . . . , n(xj , πj ))

before the update. After a partially observed instantiation, this is updated to


(1) (l) (kj ) (l)
Dir(n∗ (xj , πj ), . . . , n∗ (xj , πj ))

where
(i) (l) (i) (l)
n∗ (xij , πjl ) = n(xj , πj ) + PXj ,Paj ∣E (xj , πj ∣e∗ )

where P is the probability computed using the prior and E = (Xi1 , . . . , Xim ), those variables that
are instantiated in the partial observation; e∗ denotes the values that these variables take in the
incomplete instantiation.
To update the distribution over Θa , the eective sample sizes on which the prior is based are
needed. Furthermore, for Xj = A, Paj = ϕ, so PXj ∣Paj ,E = PA∣B (.∣1). For Xj = B , Paj = A, so
that PXj ∣Paj ,E = PB∣A,B (.∣., 1).
260 CHAPTER 12. LEARNING THE CONDITIONAL PROBABILITY FUNCTIONS

The computations of PA and PB∣A are straightforward; PA∣B is obtained using Bayes rule. Note
that

1 1 3
PA (1) = ∫ PA∣Θa (1∣θ)πΘa (θ)dθ = 3 ∫ θ3 dθ =
0 0 4
1
PA (0) =
4
1 1 3
PB∣A (1∣1) = ∫ PB∣A,Θb∣y (1∣1, θ)πΘb∣y (θ)dθ = 12 ∫ θ3 (1 − θ)dθ =
0 0 5
2
PB∣A (0∣1) =
5
1 2
PB∣A (1∣0) = 12 ∫ θ2 (1 − θ)2 =
0 5
3
PB∣A (0∣0) =
5
so
2 1 3 3 11
PB (1) = × + × =
5 4 5 4 20
9
PB (0) =
20
PA (1)PB∣A (1∣1) 9
PA∣E ∗ (1∣e∗ ) = PA∣B (1∣1) = =
PB (1) 11
2
PA∣E ∗ (0∣e∗ ) =
11
2
PA,B∣B ((0, 1)∣1) = PA∣B (0∣1) =
11
9
PA,B∣B ((1, 1)∣1) = PA∣B (1∣1) =
11
PA,B∣B ((0, 0)∣1) = PA,B∣B ((1, 0)∣1) = 0.

So updating is
Γ(5) 9 2
πa∣e∗ (θ) = θ2+ 11 (1 − θ) 11 , θ ∈ [0, 1]
Γ(3 + 11 )Γ(1 + 11 )
9 2

Γ(5 + 11 ))
9
9
πb∣y,e∗ (θ) = θ2+ 11 (1 − θ), θ ∈ [0, 1]
Γ(3 + 11 )Γ(2)
9

Γ(5 + 11 ))
2
2
πb∣n,e∗ (θ) = θ1+ 11 (1 − θ)2 , θ ∈ [0, 1]
Γ(2 + 11 )Γ(3)
2

8. First note that


12.12. SHORT ANSWERS 261

πΘ∣X (θ∣x) = (const)πΘ (θ)PX∣Θ (x∣θ)


(i) (i)
= (const) ∑ λi Dir(α(i) q1 + n1 , . . . , α(i) qL + nL )
i
(i)
Γ(∑j α(i) qj + nj ) L α(i) qj +nj −1
(i)
= ∑ λi (i)
∏ θj .
∏j Γ(α(i) qj + nj ) j=1

By standard Dirichlet integral calculations,


(i)
α(i) qj + nj + 1
E[θj ] = ∑ λi
i n + α(i) + 1

where n is the total sample size.

9. (straightforward application of Dirichlet integrals)


262 CHAPTER 12. LEARNING THE CONDITIONAL PROBABILITY FUNCTIONS
Chapter 13

Parameters and Sensitivity

Notations As usual, for a variable Xj , let Paj denote the set of parent variables and let (πj(i) )qi=1
j

denote the possible congurations for the parent set. Set

(i) (l)
θjil = PXj ∣Paj (xj ∣πj ),
k
so that ∑i=1
j
θjil = 1 for each (j, l). The collection of θjil ∶ j = 1, . . . , d, i = 1, . . . , kj , l = 1, . . . , qj with
the constraint given above denotes the entire set of parameters for the network.
The functions PXj ∣Paj will be referred to as potentials or CPPs (conditional probability potentials).

13.1 Parameter Changes to Satisfy Query Constraints


Denition 13.1 (Query, Query Constraint). A query in probabilistic inference is simply a conditional
probability distribution, over the variables of interest (the query variables) conditioned on information
received. A query constraint is a restriction; for example, if it is known that two conditional probabilities
dier by a certain amount, or if there is a restriction on the ratio between two conditional probabilities.

re tamper

| # z
smoke alarm


leaving / report

Figure 13.1: The DAG for the Bayesian Network `Fire'

263
264 CHAPTER 13. PARAMETERS AND SENSITIVITY

The problem considered in this section is to decide whether an individual parameter is relevant to
a given query constraint and, if it is, to compute the minimum amount of change needed to that
parameter to enforce the constraint. The constraints considered are in the form of hard evidence where
the collection E is instantiated as e, where E = (Xe1 , . . . , Xem ) is a subset of (X1 , . . . , Xd ).

Example 13.2 (Fire).

Consider the Bayesian network called Fire.1 The model is shown in Figure 13.1. The network models
the scenario of whether or not there is a re in the building. Let F denote `re', T denote `tampering',
S `smoke', A `alarm', L `leaving' and R `report'. A re may causes smoke to be seen; it may also
cause the alarm to go o. Equally, if somebody tampers with the alarm, this could also cause it to
go o, even without a re. When people hear the alarm, they may leave the building and when a
large number of people leave the building at an unscheduled time, this may be reported to the re
department.
Now consider the following evidence: {report = true, smoke = false}. That is, the re department
receives a report that people are evacuating the building, but no smoke is observed. This evidence
should make it more likely that the re alarm has been tampered with than that there is a real re. Let
t denote `true' and f denote `false'. Suppose that the conditional probability values for this network
derived, perhaps, from experience, are

t f t f
PF = , PT = ,
0.01 0.99 0.02 0.98

L/R t f
PR∣L = t 0.75 0.25
f 0.01 0.99

F /S t f A/L t f
PS∣F = t 0.9 0.1 , PL∣A = T 0.88 0.12
f 0.01 0.99 f 0.001 0.999

F /T t f
PA∣F,T (t∣., .) = t 0.5 0.99
f 0.85 0.0001.

The evidence is (R, S) = (t, f ). The probability that someone has tampered with the alarm given this
evidence is

PT,R,S (t, t, f )
PT ∣R,S (t∣t, f ) = .
PR,S (t, f )
Using the notation XZ to denote the state space of a variable Z ,
1
This Bayesian network is distributed with the evaluation version of the commercial HUGIN Graphical User Interface,
by HUGIN Expert.
13.1. PARAMETER CHANGES TO SATISFY QUERY CONSTRAINTS 265

PT,R,S (t, t, f ) = PT (t) ∑ pR∣L (t∣.) ∑ PL∣A ∑ PA∣T,F (.∣t, .)PS∣F (f ∣.)PF
XL XA XF

and

PR,S (t, f ) = ∑ PT ∑ PR∣L (t∣.) ∑ PL∣A ∑ PA∣T,F PS∣F (f ∣.)PF .


XT XL XA XF

Similarly,

PF,R,S (t, t, f )
PF ∣R,S (t∣t, f ) = ,
PR,S (t, f )
and

PF,R,S (t, t, f ) = PF (t)PS∣F (f ∣t) ∑ PT ∑ PA∣T,F (.∣., f ) ∑ PL∣A PR∣L (t∣.)


XT XA XL

The computations are straightforward and give

PT ∣R,S (t∣t, f ) = 0.501, PF ∣R,S (t∣t, f ) = 0.0294.

Suppose that it is known from experience that the probability that the alarm has been tampered
with should be no less than 0.65 given this evidence. The network should therefore be adjusted to
accommodate. It is simplest to try changing only one network parameter. Suppose that the probability
function PT is to be adjusted. Let θ = PT (t). Let

α = ∑ PR∣L (t∣.) ∑ PL∣A ∑ PA∣T,F (.∣t, .)PS∣F (f ∣.)PF


XL XA F

so that

PT,R,S (t, t, f ) = θα

and

β = ∑ PR∣L (t∣.) ∑ PL∣A ∑ PA∣T,F (.∣f, .)PS∣F (f ∣.)P(.),


XL XA XF

so that

PR,S (t, f ) = θα + (1 − θ)β.

Then the computation of α and β is straightforward arithmetic and

PT,R,S (t, t, f ) αθ
PT ∣R,S (t∣t, f ) = = .
PR,S (t, f ) (α − β)θ + β

The solution to the equation


αθ
= 0.65
(α − β)θ
266 CHAPTER 13. PARAMETERS AND SENSITIVITY

is θ = 0.0364.

Similarly, let ψ = PR∣L (t∣f ). Keeping all other potentials xed, PT ∣R,S (t∣t, f ) may be computed as a
function of ψ and the equation PT ∣R,S (t∣t, f )(ψ) = 0.65 has solution ψ = 0.00471.

For all other single parameter adjustments, the equation does not have a solution in the interval [0, 1].
Therefore, if only one parameter is to be adjusted, the constraint PT ∣R,S (t∣t, f ) = 0.65 can be dealt
with in either of the following two ways:

1. Increase PT (t) from 0.02 to greater than 0.0364, or

2. Decrease the probability of a false report, given that there is an evacuation, from 0.01 to less
than 0.00471.

It turns out for this example that it is not possible to enforce the desired constraint by adjusting a
single parameter in any of the CPPs of the variables re, smoke, alarm and leaving.

13.2 Proportional Scaling


k
A network where each conditional probability distribution (θjil )i=1
j
has at most one variable parameter
(jl)
t is said to satisfy the proportional scaling property.

Denition 13.3 (Proportional Scaling Property). A Bayesian network satises the proportional scal-
(i) (l)
ing property if for each conditional probability distribution θj.l , where θjil = pXj ∣Paj (xj ∣πj ), there is
a parameter t(jl) such that

(l)
PXj ∣Paj (.∣πj ) = (αj1l + βj1l t(jl) , . . . , αjkj l + βjkj l t(jl) ),
k k
where ∑m=1
j
αjml = 1 and ∑m=1
j
βjml = 0.

Theorem 13.4. Consider a Bayesian network over a collection of variables V = {X1 , . . . , Xd }. Sup-
pose that the network satises proportional scaling, where there is a single variable parameter t in a
(c ) (c )
conditional probability distribution θj.l . Then for any E = (Xi1 , . . . , Xim ) and e = (xi1 1 , . . . , ximm ),

PE (e)(t) = at + b
for two constants a and b that depend on e.

Proof of Theorem 13.4 Let θjil = αjil + βjil t for i = 1, . . . , kj . Then

PE (e) = ∑ PV (y1 , . . . , yd )
y∈X ∣(yi1 ,...,yim )=(xi 1 ,...,ximm )
(c ) (c )
1

= ∑ PXj ∣Paj (yj ∣πj (y)) ∏ PXk ∣Pak (yk ∣πk (y)) (13.1)
y∈X ∣(yi1 ,...,yim )=(xi
(c1 )
,...,ximm )
(c ) k≠j
1
13.2. PROPORTIONAL SCALING 267

and it is clear from the denition of proportional scaling, and from Equation (13.1), that t enters
linearly. It therefore follows that

PE (e)(t) = at + b.

It follows that for two disjoint sets of variables A and E , there are numbers a(e), b(e), c(x, e), d(x, e)
such that

PA,E (x, e) ct + d
PA∣E (x∣e)(t) = = .
PE (e) at + b

The Optimality of Proportional Scaling Consider one of the conditional probability distributions
(θj1l , . . . , θjkj l ) and suppose that θj1l is to be altered to a dierent value, denoted by θ̃jl1 . Under
proportional scaling, the probabilities of the other states are given by

1 − θ̃j1l
θ̃jil = θjil i = 2, . . . , kj .
1 − θj1l

This is clearly proportional scaling with

1 − θ̃j1l
t(jl) =
1 − θj1l

βjil = θjil i = 2, . . . , kj , βj1l = θj1l − 1

αjil = 0 i = 2, . . . , kj , αj1l = 1.

Proportional scaling turns out to be optimal under the Chan - Darwiche distance measure.

Theorem 13.5. Consider a probability distribution P factorised according to a DAG G . Suppose the
value θj1l is changed to θ̃j1l . Among the class of probability distributions Q factorised along G with
QXj ∣Paj (xj ∣πj ) = θ̃j1l , minQ∈Q DCD (P, Q) is obtained for Q such that θ̃a.b = θa.b for all (a, b) ≠ (j, l)
(1) (l)

and
1 − θ̃j1l
θ̃jil = θjil .
1 − θj1l
Under proportional scaling, the Chan - Darwiche distance is then given by

DCD (P, Q) = ∣ ln θ̃j1l − ln θj1l ∣ + ∣ ln(1 − θ̃j1l ) − ln(1 − θj1l )∣.

Proof Let P be a distribution that factorises along a DAG G , with conditional probabilities θaib =
(i) (b)
PXa ∣Paa (xa ∣πa ). Let Q denote the distribution that factorises along G , with conditional probabilities

θ̃aib = QXa ∣Paa (xa(i) ∣πa(b) ),

where θ̃j1b is given,


268 CHAPTER 13. PARAMETERS AND SENSITIVITY

θ̃aib = θaib (a, b) ≠ (j, l)


and

1 − θ̃j1l
θ̃jil = θjil i = 2, . . . , kj .
1 − θj1l
This is the distribution generated by the proportional scheme. Let R denote any other probability
belonging to class Q.

If θj1l = 1 and θ̃j1l < 1, then there is a θ̃jkl > 0 with θjkl = 0 and it follows that DCD (P, Q) = DCD (P, R) =
+∞.

If θj1l = 0 and θ̃j1l > 0 then, similarly, it follows directly that DCD (P, Q) = DCD (P, R) = +∞.

Consider 0 < θj1l < 1. Firstly, consider θ̃j1l > θj1l . Then

Q(x) θ̃j1l 1 − θ̃j1l θ̃j1l


max = max ( , )=
x∈X P(x) θj1l 1 − θj1l θj1l
and

Q(x) 1 − θ̃j1l
min = .
x∈X P(x) 1 − θj1l
1−θ̃j1b θ̃j1l
Similarly, if θ̃j1l < θj1l , then maxx∈X = and minx∈X = θj1l , so
Q(x) Q(x)
P(x) 1−θj1b P(x)

DCD (P, Q) = ∣ ln θ̃j1l − ln θj1l ∣ + ∣ ln(1 − θ̃j1l ) − ln(1 − θj1l )∣.


Let R denote any other distribution that factorises along G with RXj ∣Paj (xj ∣πj ) = θ̃j1l . The next
(1) (l)

task is to prove that DCD (P, R) ≥ DCD (P, Q).

P and R may be expressed as PX,Y and RX,Y where (X, Y ) are two sets of variables. Using PX,Y =
PX PY ∣X and RX,Y = RX RY ∣X and (x∗ , y ∗ ) and (x∗ , y∗ ) to denote the points where the maxima and
minima of the ratios are achieved, it follows that

PX,Y (x∗ , y ∗ ) PX,Y (x∗ , y∗ )


DCD (PX,Y , RX,Y ) = ln ∗ ∗
− ln
RX,Y (x , y ) RX,Y (x∗ , y∗ )

PX (x ) PX (x∗ ) PY ∣X (y ∗ ∣x∗ ) PY ∣X (y∗ ∣x∗ )
= ln ∗
− ln + ln ∗ ∗
− ln
RX (x ) RX (x∗ ) RY ∣X (y ∣x ) RY ∣X (y∗ ∣x∗ )
PY ∣X (y ∗ ∣x∗ ) PY ∣X (y∗ ∣x∗ )
≥ DCD (PX , RX ) + ln − ln .
RY ∣X (y ∗ ∣x∗ ) RY ∣X (y∗ ∣x∗ )
PY ∣X (.∣x∗ )
Now, because (x∗ , y ∗ ) maximises the ratio, it follows that y ∗ maximises the ratio RY ∣X (.∣x∗ ) and hence
PY ∣X (y ∗ ∣x∗ ) PY ∣X (y ∗ ∣x∗ )
that ln RY ∣X (y ∗ ∣x∗ ) ≥ 1; similarly, ln RY ∣X (y ∗ ∣x∗ ) ≤ 1, so that
13.2. PROPORTIONAL SCALING 269

DCD (PX,Y , RX,Y ) ≥ DCD (PX , RX ).

Now let X denote the set of variables (X1 , . . . , Xj ) and Y the set of variables (Xj+1 , . . . , Xd ). It follows
that

DCD (P, R) ≥ DCD (PX1 ,...,Xj , RX1 ,...,Xj )


(l)
where the notation is clear. Finally, for any z corresponding to parent conguration πj ,

PX1 ,...,Xj PX1 ,...,Xj


DCD (PX1 ,...,Xj , RX1 ,...,Xj ) ≥ ln max − ln min
x∣(x1 ,...,xj−1 )=z RX1 ,...,Xj x∣(x1 ,...,xj−1 )=z RX1 ,...,Xj
PXj ∣Paj PXj ∣Paj
= ln max − ln min
x∣(x1 ,...,xj−1 )=z RXj ∣Paj x∣(x1 ,...,xj−1 )=z RXj ∣Paj
= DCD (θj.l , θ̃j.l ).

13.2.1 Query Constraints


Let Y, Z denote two random variables such that Y ∈/ E and Z ∈/ E . The query constraints considered
in this section are of the following type:

ˆ
PY ∣E (y∣e) − PZ∣E (z∣e) ≥ ϵ, (13.2)

ˆ
PY ∣E (y∣e)
≥ ϵ. (13.3)
PZ∣E (z∣e)

The notation will be abbreviated by writing: P(y∣e) when the abbreviation is clear from the context.

Let PX denote the probability function for a collection of variables X = (X1 , . . . , Xd ), which may
be factorised along a graph G = (V, E) (where V = {X1 , . . . , Xd }), with given conditional probability
(i) (l)
potentials, θjil = PXj ∣Paj (xj ∣πj ). Then

d qj kj
n (i,l)
PX (x) = ∏ ∏ ∏ θjilj ,
j=1 l=1 i=1

(i) (l)
where nj (i, l) = 1 if the child parent conguration (xj , πj ) appears in x and 0 otherwise. Suppose
(jl) (jl)
that the probabilities (θj1l , . . . , θj,kj ,l ) are parametrised by (t1 , . . . , tmj ), where mj ≤ kj − 1. The
following result holds.
270 CHAPTER 13. PARAMETERS AND SENSITIVITY

Theorem 13.6. Let X = (X1 , . . . , Xd ) denote a set of variables and let P be a probability distribution
(i) (l)
that factorises along a DAG G with node set V = {X1 , . . . , Xd }. Let θjil = PXj ∣Paj (xj ∣πj ). Suppose
(jl) (jl)
that for each (j, l) the probabilities (θj1l , . . . , θj,kj ,l ) are parametrised by (t1 , . . . , tmjl ) where mjl ≤
(i ) (i )
kj −1. Let E = (Xe1 , . . . Xem ) denote a subset of X and let e = (xe11 , . . . , xemm ) denote an instantiation
of E . Then for all 1 ≤ k ≤ mjl ,

(i) (l)

kj PE,Xj ,Paj (e, xj , πj ) ∂
(jl)
PE (e) = ∑ θ .
(jl) jil
∂tk i=1 θjil ∂tk

Proof Firstly,

(i) (l) (i) (l) (l)


PE (e) = ∑ PE∣Xj ,Paj (e∣xj , πj )pXj ∣Paj (xj ∣πj )PPaj (πj )
il
(i) (l) (l)
= ∑ PE∣Xj ,Paj (e∣xj , πj )θjil PPaj (πj ).
il

It follows that

kj
∂ (i) (l) (l) ∂θjil
(jl)
PE (e) = ∑ PE∣Xj ,Paj (e∣xj , πj )PPaj (πj ) (jl)
∂tk i=1 ∂tk
(i) (l) (l)
kj PXj ,Paj ∣E (xj , πj ∣e)PE (e)PPaj (πj ) ∂θjil
= ∑ (i) (l) (jl)
i=1 PXj ,Paj (xj , πj ) ∂tk
(j) (j)
kj PXj ,Paj ,E (xi , πl , e) ∂θjil
= ∑ (i) (l) (jl)
i=1 PXj ∣Paj (xj ∣πj ) ∂tk
(i) (l)
kjPXj ,Paj ,E (xj , πj , e) ∂θjil
= ∑ (jl)
i=1 θjil ∂t k

as required.

Proportional Scaling Again, the complete set of variables is X = (X1 , . . . , Xd ), with a joint proba-
bility distribution P that may be factorised along a Directed Acyclic Graph G . Evidence is received on
a subset of the variables E = (Xe1 , . . . , Xem ). Consider a proportional scaling scheme, where each condi-
tional probability distribution (θj1l , . . . , θjkj l ) has exactly one parameter. Under proportional scaling,
(jl) (jl)
this may be represented as θj1l = t(jl) and there are non negative numbers a2 , . . . , akj satisfying
kj (jl)
∑α=2 aα = 1, such that
θj1l = t(jl)

θjαl = a(jl)
α (1 − t
(jl)
), α = 2, . . . , kj .

Then, an application of Theorem 13.6 in the simplied setting of proportional scaling immediately
gives
13.2. PROPORTIONAL SCALING 271

(1) (l) (α) (l)


∂ PE,Xj ,Paj (e, xj , πj ) kj PE,Xj ,Paj (e, xj , πj )
P E (e) = −∑ aα(jl) . (13.4)
∂t(jl) θj1l α=2 θ jαl

When a proportional scaling scheme is used, Theorem 13.4 gives

PE (e) = α + βt(jl) ,

where α and β do not depend on t(jl) . It follows that for any t(jl) , ∂t∂(jl) PE (e) = β , where β is constant
(i.e. it does not depend on t(jl) ). This observation makes it straight forward, under proportional
scaling, to nd the necessary change in a single parameter t(jl) (if such a parameter change is possible)
to enforce a query constraint.

13.2.2 Binary Variables


(1) (l) (0) (l)
Assume that variable Xj is binary, with PXj ∣Paj (xj ∣πj ) = t(jl) and PXj ∣Paj (xj ∣πj ) = 1 − t(jl) .
Then Equation (13.4) reduces to:

(1) (l) (0) (l)


∂ PE,Xj ,Paj (e, xj , πj ) PE,Xj ,Paj (e, xj , πj )
P E (e) = − . (13.5)
∂t(jl) t(jl) 1 − t(jl)
The statement Y = y, E = e may be treated as hard evidence. By Theorem 13.4, it follows that there
are real numbers λ, λy and λz such that

(1) (l) (0) (l)


∂ PE,Xj ,Paj (e, xj , πj ) PE,Xj ,Paj (e, xj , πj )
λ = (jl) PE (e) = − ,
∂t t(jl) 1 − t(jl)
(1) (l) (0) (l)
∂ PY,E,Xj ,Paj (y, e, xj , πj ) PY,E,Xj ,Paj (y, e, xj , πj )
λy = (jl) PY,E (y, e) = −
∂t t(jl) 1 − t(jl)
and

(1) (l) (0) (l)


∂ PZ,E,Xj ,Paj (z, e, xj , πj ) PZ,E,Xj ,Paj (z, e, xj , πj )
λz = (jl) PZ,E (z, e) = − .
∂t t(jl) 1 − t(jl)
The following is a corollary of Theorem 13.6, which reduces to Equation (13.5) for the binary case.

Corollary 13.7. To satisfy the constraint given by Equation (13.2), the parameter t(jl) has to be
changed to t(jl) + δ ,where δ satises

PY,E (y, e) − PZ,E (z, e) − ϵPE (e) ≥ δ(−λy + λz + ϵλ). (13.6)

To satisfy the constraint given by Equation (13.3), the parameter t(jl) has to be changed to t(jl) + δ ,
where

PY,E (y, e) − ϵPZ,E (z, e) ≥ δ(−λy + ϵλz ). (13.7)


272 CHAPTER 13. PARAMETERS AND SENSITIVITY

PY,E (y,e)
Proof Since PY ∣E (y∣e) = PE (e) ,it follows that PY ∣E (y∣e)−PZ∣E (z∣e) ≥ ϵ is equivalent to PY,E (y, e)−
PZ,E (z, e) ≥ ϵPE (e). A change in the constraint changes PY,E (y, e), PZ,E (z, e) and PE (e) to PY,E (y, e)+
δλy , PZ,E (z, e) + δλz and PE (e) + δλ respectively. To enforce the dierence constraint, it follows that
δ satises
(PY,E (y, e) + λy δ) − (PZ,E (z, e) + λz δ) ≥ ϵ(PE (e) + λδ).
Equation (13.6) follows directly.

Similarly, to enforce the ratio constraint, the following inequality is required:


PY,E (y, e) + λy δ
≥ ϵ.
PZ,E (z, e) + λz δ
Equation (13.7) now follows directly and the proof is complete.

13.3 The Sensitivity of Queries to Parameter Changes


In line with the Chan - Darwiche distance measure, sensitivity is dened in the following way.

Denition 13.8 (Sensitivity). Let P denote a parametrised family of probability distributions, over a
̃ , where Θ
nite, discrete state space X , parametrised by k parameters (θ1 , . . . , θk ) ∈ Θ ̃ ⊆ Rk denotes the
parameter space. Let P(θ1 ,...,θk ) (.) denote the probability function over X when the parameters are xed
at θ1 , . . . , θk . Then the sensitivity of P to parameter θj is dened as
∂ ∂
Sj (P)(θ1 , . . . , θk ) = max ln P(θ1 ,...,θk ) (x) − min ln P(θ1 ,...,θk ) (x).
x∈X ∂θj x∈X ∂θj

Example 13.9.
If P is a family of binary variables, with state space X = {x0 , x1 } and a single parameter θ, then

∂ P(θ) (x1 )
S(P)(θ) = ∣ ln (θ) ∣.
∂θ P (x0 )

This section restricts attention to a single parameter model. Consider a network with d variables,
X = (X1 , . . . , Xd ) where one particular variable Xj is a binary variable. The other variables may be
multivalued. Let
(1) (l)
t(jl) = PXj ∣Paj (xj ∣πj ).
Let Y denote a collection of variables, taken from (X1 , . . . , Xn ) and let Y = y denote an instantiation
of these variables. Let y denote the event {Y = y} and let y c denote the event {Y ≠ y}. Similarly,
let e denote the event {E = e}, where E is a dierent sub-collection of variables from X . From
Denition 13.8, the sensitivity of a query P(y∣e) to the parameter t(jl) is dened as

∂ P(y∣e)
∣ ln ∣.
∂t(jl) P(y c ∣e)
The following theorem provides a simple bound on the derivative in terms of P(y∣e) and t(jl) only.
13.3. THE SENSITIVITY OF QUERIES TO PARAMETER CHANGES 273

Theorem 13.10. Suppose Xj is a binary variable taking values xj(1) or xj(0) . Set

(1) (l)
t(jl) = PXj ∣Paj (xj ∣πj ).

Then
∂ P(y∣e)(1 − P(y∣e))
∣ (jl)
P(y∣e)∣ ≤ . (13.8)
∂t t(jl) (1 − t(jl) )

The example given after the proof shows that this bound is sharp; there are situations where the
derivative assumes the bound exactly.

Proof of Theorem 13.10 Firstly, P(y∣e) =


P(y,e)
P(e) , so that

∂ 1 ∂ P(y, e) ∂
P(y∣e) = P(y, e) − P(e).
∂t(jl) P(e) ∂t(jl) P2 (e) ∂t(jl)

Using this, Equation (13.5) gives


P(y∣e)
∂t(jl)
(1) (l) (0) (l)
{(1 − t(jl) )P(y, xj , πj ∣e) − t(jl) P(y, xj , πj ∣e)}
= (13.9)
t(jl) (1 − t(jl) )
(1) (l) (0) (l)
{(1 − t(jl) )P(y∣e)PXj ,Paj ∣E (xj πj ∣e) − t(jl) P(y∣e)PXj ,Paj ∣E (xj , πj ∣e)}

t(jl) (t − t(jl) )
(1) (l) (1) (l)
(1 − t(jl) )(PY ,Xj ,Paj ∣E (y, xj , πj ∣e) − P(y∣e)PXj ,Paj ∣E (xj πj ∣e))
=
t(jl) (1 − t(jl) )
(0) (l) (0) (l)
t(jl) (PY ,Xj ,Paj ∣E (y, xj , πj ∣e) − P(y∣e)PXj ,Paj ∣E (xj πj ∣e))
− . (13.10)
t(jl) (t − t(jl) )

With the shorthand notation y c to denote the event {Y ≠ y},

(1) (l) (1) (l)


PXj ,Paj ,Y ∣E (xj , πj , y∣e) − P(y∣e)PXj ,Paj ∣E (xj , πj ∣e)
(1) (l) (1) (l)
≤ PXj ,Paj ,Y ∣E (xj , πj , y∣e) − PY ∣E (y∣e)PXj ,Paj ,Y ∣E (xj , πj , y∣e)
(1) (l)
= PXj ,Paj ,Y ∣E (xj , πj , y∣e)(1 − P(y∣e)
≤ P(y∣e)(1 − P(y∣e))
274 CHAPTER 13. PARAMETERS AND SENSITIVITY

and

(1) (l) (1) (l)


P(y∣e)PXj ,Paj ∣E (xj , πj ∣e) − PXj ,Paj ,Y ∣E (xj , πj , y∣e)
(1) (l)
= (1 − P(y c ∣e))PXj ,Paj ∣E (xj , πj ∣e)
(1) (l) (1) (l)
−PXj ,Paj ∣E (xj , πj ∣e) + PXj ,Paj ,Y ∣E (xj , πj , y c ∣e)
(1) (l) (1) (l)
= PXj ,Paj ,Y ∣E (xj , πj , y c ∣e) − P(y c ∣e)PXj ,Paj ∣E (xj , πj ∣e)
(1) (l)
= PXj ,Paj ,Y ∣E (xj , πj , y c ∣e)(1 − P(y c ∣e))
≤ P(y c ∣e)(1 − P(y c ∣e))
= (1 − P(y∣e))P(y∣e)

From this, it follows directly from Equation (13.10) that

∂ P(y∣e)(1 − P(y∣e))
∣ P(y∣e)∣ ≤ .
∂t(jl) t(jl) (1 − t(jl) )
The proof of Theorem 13.10 is complete.

Corollary 13.11. The sensitivity of P(y∣e) to the parameter t(jl) is bounded by

∂ P(y∣e) 1
∣ ln ∣ ≤ (jl) . (13.11)
∂t(jl) P(y ∣e)
c t (1 − t(jl) )

Proof Immediate.

It is clear that the worst situation from a robustness point of view arises when the parameter value
t(jl) is close to either 0 or 1, while the query takes values that are close to neither 0 nor 1.

Example 13.12.

This example shows that the bounds given by inequalities (13.8) and (13.11) are sharp, in the sense
that there are examples where the bounds are attained. Consider the network given in Figure 13.2,
where X and Y are binary variables taking values from (x0 , x1 ) and (y0 , y1 ) respectively. PX (x0 ) = θx
and PY (y0 ) = θy . Suppose that E is a deterministic binary variable; that is, P({E = e}∣{X = Y }) = 1
and P({E = e}∣{X ≠ Y }) = 0.

X Y


E

Figure 13.2: The Network Used in Example 13.12


13.3. THE SENSITIVITY OF QUERIES TO PARAMETER CHANGES 275

The probability potentials are

x0 x1 y0 y1
PX = PY =
θ x 1 − θx θy 1 − θy

X/Y y0 y1
PE∣X,Y (e∣., .) = x0 1 0
x1 0 1

from which it follows that

PY,E (y0 , e) PY (y0 ) ∑x PX (x)PE∣X,Y (e∣x, y0 ) θy θx


PY ∣E (y0 ∣e) = = =
PE (e) ∑x,y PX (x)PY (y)PE∣X,Y (e∣x, y) θy θx + (1 − θy )(1 − θx )

and
∂ θy (1 − θy )
PY ∣E (y0 ∣e) =
∂θx (θx θy + (1 − θx )(1 − θy ))2
while

PY ∣E (y0 ∣e)(1 − PY ∣E (y0 ∣e)) θy θx (1 − θy )(1 − θx ) θy (1 − θy )


= = ,
θx (1 − θx ) (θx θy + (1 − θx )(1 − θy ))2 θx (1 − θx ) (θx θy + (1 − θx )(1 − θy ))2

so that
∂ θy (1 − θy )
PY ∣E (y0 ∣e) =
∂θx (θx θy + (1 − θx )(1 − θy ))2
showing that the bound (13.8) is achieved.
For the bound (13.11), note from the above that

∂ PY ∣E (y0 ∣e)PY ∣E (y1 ∣e)


PY ∣E (y0 ∣e) =
∂θx θx (1 − θx )

so that
∂ PY ∣E (y1 ∣e)
ln PY ∣E (y0 ∣e) =
∂θx θx (1 − θx )
and, because PY ∣E (y0 ∣e) + PY ∣E (y1 ∣e) = 1,

∂ ∂ PY ∣E (y0 ∣e)PY ∣E (y1 ∣e)


PY ∣E (y1 ∣e) = − PY ∣E (y0 ∣e) = −
∂θx ∂θx θx (1 − θx )

so that
∂ PY ∣E (y0 ∣e) 1
ln = ,
∂θx PY ∣E (y1 ∣e) θx (1 − θx )

so that equality is achieved in bound (13.11).

The following results bound the odds.


276 CHAPTER 13. PARAMETERS AND SENSITIVITY

Theorem 13.13. Let P be a parametrised family of probability distributions, factorised along the same
(θ) (0) (l)
DAG, with a single parameter θ. Let Xj be a binary variable and let θ = P (xj ∣πj ); all the
Xj ∣Paj
other CPPs remain xed and let Oθ = 1−θ
θ
. Consider a parameter change from θ = t to θ = s. Note that
(θ)
Ot = 1−t and Os = 1−s . Let P (y∣e) denote the probability value of a query when θ is the parameter
t s

P(θ) (y∣e)
̃θ (y∣e) =
value. Let O . Then
1−P(θ) (y∣e)

Ot Õs (y∣e) O
s
≤ ≤ s≥t
Os Õt (y∣e) Ot

Os Õs (y∣e) O
t
≤ ≤ t ≤ s.
Ot Õt (y∣e) Os
This gives the bound
̃s (y∣e) − ln O
∣ln O ̃t (y∣e)∣ ≤ ∣ln Os − ln Ot ∣ .

Proof Let x denote the probability of the query P(y∣e) when the value of the parameter t(jl) is z .
Note that, for 0 < a ≤ b < 1,
b dx b dx b dx b 1−a
∫ =∫ +∫ = ln .
a x(1 − x) a x a 1−x a 1−b
Then, for t(jl) ≤ s(jl) , Equation (13.8) gives

s dz Ps (y∣e) dx s dz
−∫ ≤∫ ≤∫ ,
t z(1 − z) Pt (y∣e) x(1 − x) t z(1 − z)

so that
s 1−t Ps (y∣e) 1 − Ps (y∣e) s 1−t
− ln ≤ ln ≤ ln
t 1−s Pt (y∣e) 1 − Pt (y∣e) t 1−s
giving immediately that
Ot Õs (y∣e) O
s
≤ ≤ .
Os Õt (y∣e) Ot
For s ≤ t the argument is similar and gives

Os Õs (y∣e) O
t
≤ ≤ .
Ot Õt (y∣e) Os

In both cases
̃s (y∣e) − ln O
∣ln O ̃t (y∣e)∣ ≤ ∣ln Os − ln Ot ∣

and the result follows.

Notes The observation that the probability of evidence is a linear function of any single parameter
in the model and hence that the conditional probability is the ratio of two linear functions is due to
Castillo, Gutiérrez and Hadi (1997) [12] and [13]. The most signicant developments in sensitivity
analysis, which comprise practically the whole chapter, were introduced by Chan and Darwiche in the
article [21] (2002) and developed in the articles [20] (2005) and article [22].
13.4 Exercises
1. Consider a Bernoulli trial, with probability function PX (.∣t) dened by

PX (x∣t) = tx (1 − t)1−x , x = 0, 1, t ∈ [0, 1].

Recall the denition of sensitivity, Denition 13.8. Compute the sensitivity with respect to the
parameter t.

2. Consider the `re' example given in the text. Suppose that the evidence is (R, S) = (t, f ). Let
PT (t) = θ be a variable parameter so that PT (f ) = 1−θ and suppose that all the other probabilities
are xed, according to the values given. From an initial value θ0 = 0.02, compute the lower bound
for the change δ required to satisfy the query constraint

PT ∣R,S (t∣t, f )
≥ 10
PF ∣R,S (t∣t, f )

corresponding to Corollary 13.7 and express the probabilities needed in terms of the conditional
probabilities given. This represents the constraint that, given the report without smoke, it is 10
times more likely that the alarm has been tampered with than that there is a real re.

3. Consider a probability distribution

PX,Y,E = PX PY PE∣X,Y

where X, Y, E are all binary variables and

x0 x1 y0 y1
PX = PY =
θx 1 − θ x θy 1 − θy

X/Y y0 y1
PE∣X,Y (e∣., .) = x0 α β
x1 β α

and β < α.
P (y0 ∣e)
(a) Compute ∂
∂θx ln PY ∣E (y1 ∣e) and compare the result with the bound from Corollary 13.11.
Y ∣E

(y0 ∣e) Os (y0 ∣e)


(b) Let Os (y0 ∣e) = PY ∣E (y1 ∣e) when θx = s. Compute
P
Ot (y0 ∣e) and compare with the bounds given
Y ∣E
by Theorem 13.13.

4. (a) On Odds and the Weight of Evidence Let P be a probability distribution over a space
X . The odds of an event A ⊆ X given B ⊆ X under P, denoted by OP (A ∣ B), is dened as

P (A ∣ B)
OP (A ∣ B) = . (13.12)
P (Ac ∣ B)

277
278 CHAPTER 13. PARAMETERS AND SENSITIVITY

The weight of evidence E in favour of an event A given B , denoted by W (A ∶ E ∣ B), is


dened as
OP (A ∣ B ∩ E)
W (A ∶ E ∣ B) = ln . (13.13)
OP (A ∣ B)
Show that if P(E ∩ Ac ∩ B) > 0, then

P (E ∣ A ∩ B)
W (A ∶ E ∣ B) = ln . (13.14)
P (E ∣ Ac ∩ B)

(b) On a generalised Odds and the Weight of Evidence Let P denote a probability
distribution over a space X and let H1 ⊆ X , H2 ⊆ X , G ⊆ X and E ⊆ X . The odds of H1
compared to H2 given G, denoted by OP (H1 /H2 ∣ G), is dened as

P (H1 ∣ G)
OP (H1 /H2 ∣ G) = . (13.15)
P (H2 ∣ G)

The generalised weight of evidence is dened by

OP (H1 /H2 ∣ G ∩ E)
W (H1 /H2 ∶ E ∣ G) = ln . (13.16)
OP (H1 /H2 ∣ G)

Show that if P(H1 ∩ G ∩ E) > 0 and P(H2 ∩ G ∩ E) > 0 then

P (E ∣ H1 ∩ G)
W (H1 /H2 ∶ E ∣ G) = ln . (13.17)
P (E ∣ H2 ∩ G)

This is clearly a loglikelihood ratio and these notions are another expression for

posterior odds = likelihood ratio × prior odds.


13.5 Answers
1.
d d
S(P)(t) = max ln PX (x∣t) − min ln PX (x∣t)
x∈{0,1} dt x∈{0,1} dt
d d
= max (x ln t + (1 − x) ln(1 − t))
x∈{0,1} dt dt
d d
− min (x ln t + (1 − x) ln(1 − t))
x∈{0,1} dt dt
x 1−x x 1−x
= max ( − ) − min ( − )
x∈{0,1} t 1−t x∈{0,1} t 1−t
1 1 1
= + = .
t 1 − t t(1 − t)

2. The parameter is in the variable T , which has no parents; PaT = ϕ. According to the corollary,
it is required to choose δ such that

PT,R,S (t, t, f ) − 10PF,R,S (t, t, f ) ≥ δ(−λT + 10λF )

is required, where
PR,S,T (t, f, t)
λT =
θ0
since PR,S,T,R (t, f, t, f ) = 0,

PF,R,S,T (t, t, f, t) PF,R,S,T (t, t, f, f )


λF = − .
θ0 1 − θ0

The probabilities are obtained by summation:

PR,S,T (t, f, t) = PT (t) ∑ PF (xf )PS∣F (f ∣xf ) ∑ PA∣T,F (xa ∣t, xf ) ∑ PL∣A (xl ∣xa )PR∣L (t∣xl ).
xf xa xl

PF,R,S (t, t, f ) = PF (t)PS∣F (f ∣t) ∑ PT (xt ) ∑ PA∣T,F (xa ∣xt , t) ∑ PL∣A (xl ∣xa )PR∣L (t∣xl )
xt xa xl

PF,R,S,T (t, t, f, t) = θ0 PF (t)PS∣F (f ∣t) ∑ PA∣F,T (xa ∣t, t) ∑ PL∣A (xl ∣xa )PR∣L (t∣xl )
xa xl

PF,R,S,T (t, t, f, f ) = (1 − θ0 )PF (t)PS∣F (f ∣t) ∑ PA∣F,T (xa ∣t, f ) ∑ PL∣A (xl ∣xa )PR∣L (t∣xl )
xa xl

3. (a)
PY,E (y0 , e) ∑1i=0 PX (xi )PX,Y ∣E (xi , y0 ∣e)
PY ∣E (y0 ∣e) = = PY (y0 ) 1
PE (e) ∑i,j=0 PY (yj )PX (xi )PX,Y ∣E (xi , yj ∣e)
(α − β)θx θy + θy β
PY ∣E (y0 ∣e) = .
2θx θy (α − β) + (β − α)(θx + θy ) + α
(α − β)θx θy + α + βθx − α(θx + θy )
PY ∣E (y1 ∣e) = .
2θx θy (α − β) + (β − α)(θx + θy ) + α

279
280 CHAPTER 13. PARAMETERS AND SENSITIVITY

PY ∣E (y0 ∣e)
ln = ln ((α − β)θx θy + θy β) − ln ((α − β)θx θy + α + βθx − α(θx + θy ))
PY ∣e (y1 ∣e)

∂ PY ∣E (y0 ∣e) (α − β) (α − β)
ln = +
∂θx PY ∣E (y1 ∣e) (α − β)θx + β α − (α − β)θx
1 1
= + α .
θx + α−β α−β − θx
β

Set θ̃x = β
α−β + θx , then

∂ PY ∣E (y0 ∣e) 1 1
ln = + .
̃
∂θx PY ∣E (y1 ∣e) θx 1 + ̃
α−β − θ

Clearly, if α < 1 or β > 0,

∂ PY ∣E (y0 ∣e) 1
∣ ln ∣< .
∂θx PY ∣E (y1 ∣e) θx (1 − θx )
(b)
(α − β)sθy + θy β
Os (y0 ∣e) =
(α − β)sθy + α + βs − α(s + θy )
so that
Os (y0 ∣e) (α − β)s + β (α − (α − β)t
= ( )( )
Ot (y0 ∣e) (α − β)t + β (α − (α − β)s
⎛ s + α−β ⎞ ⎛ α−β − t ⎞
β α
= β ⎠ ⎝ α − s⎠
.
⎝ t + α+β α−β

For s < t, clearly


Os (y0 ∣e) s 1−t
1≥ ≥ ( )( )
Ot (y0 ∣e) t 1−s
as required.

4. (a)

OP (A∣BE) P(A∣BE) P(Ac ∣B)


W (A ∶ E∣B) = ln = ln
OP (A∣B) P(Ac ∣BE) P(A∣B)
P(ABE)P(BE) P(Ac B)P(B) P(ABE)P(Ac B) P(E∣AB)
= ln c
= ln c
= ln
P(BE)P(A BE) P(B)P(AB) P(A BE)P(AB) P(E∣Ac B)

(b)

OP (H1 /H2 ∣GE) P(H1 ∣GE)P(H2 ∣G)


W (H1 /H2 ∶ E∣G) = ln = ln
OP (H1 /H2 ∣G) P(H2 ∣GE)P(H1 ∣G)
P(H1 GE)P(GE)P(H2 G)P(G) P(E∣H1 G)
= ln = ln .
P(GE)P(H2 GE)P(G)P(H1 G) P(E∣H2 G)
Chapter 14

Structure Learning

14.1 Introduction
This chapter considers the problem of learning the structure of a DAG corresponding to a Bayesian
network for a random (row) vector X = (X1 , . . . , Xd ) when presented with an n × d data matrix x,
considered as an instantiation of a random matrix

⎛ X1. ⎞
X=⎜
⎜ ⋮


⎝ Xn. ⎠

where X1. , . . . , Xn. is a collection of independent identically distributed random vectors, each with the
same distribution as X . The notation Xj. means (Xj1 , . . . , Xjd ) for j = 1, . . . , n.
Methods available fall into two categories; search and score techniques, where a score function is
used and the algorithm attempts to nd the structure that maximises the score function and constraint
based methods, where conditional independence tests are carried out and the independence relations
thus established provide constraints, limiting the edges that can be added.
Algorithms can, broadly speaking, be placed in one of three dierent categories; search-and-score,
constraint based and hybrid. Hybrid algorithms use features from both constraint based and search
and score methods.
The aim of this chapter is to give a broad introduction and describe some of the search-and-score
algorithms. Constraint based algorithms will be dealt with in considerably more detail in Chapter 16,
while Markov chain Monte Carlo (MCMC), the most popular search-and-score approach, will be dealt
with in Chapter 18.
The straightforward approach of maximising the likelihood, or a posterior distribution, over graph
structures leads to a problem that, a rst glance, may appear fairly straightforward. There is a nite
number of dierent possible DAGs G = (V, D) with d nodes. In general, though, testing all possible
structures is not computationally feasible. This is because the number of possible DAGs grows super
exponentially in the number of nodes. In [118], Robinson gave the following recursive function for
computing the number N (d) of acyclic directed graphs with d nodes:

281
282 CHAPTER 14. STRUCTURE LEARNING

d ⎛ d ⎞ i(d−1)
N (d) = ∑(−1)i+1 2 N (d − i). (14.1)
i=1 ⎝ i ⎠

For d = 5 it is 29000 and for d = 10 it is approximately 4.2 × 1018 . Here N (d) is a very large number,
even for small values of d. Therefore, it is clearly not feasible to compute this sum, even for modest
values of d.

14.2 Distance Measures


When measuring distance, there are two criteria of interest: rstly, the graph alone can be considered.
A distance measure between graphs will simply compare the numbers of edges and their orientations
between graphs. Secondly, the dierences between the probability distributions, estimated from data,
factorised along the graph may be considered.

14.2.1 Structural Hamming Distance


This is a distance measure that simply measures the distance between graphs. In the context of tting
a Bayesian network, structures that are Markov equivalent should be considered equal, since only the
Markov equivalence class can be obtained from data. The Structural Hamming distance between two
DAGs graphs is dened as follows

Denition 14.1 (Structural Hamming Distance). The Structural Hamming Distance between two
DAGs graphs G1 = (V, D1 ) and G2 = (V, D2 ) is dened as

SHD(D1 , D2 ) = (number of edges that have to be added to D1 )


+ (number of edges that have to be deleted from D1 )
+ (number of edges in D1 that have to have their direction changed )
to obtain D2

The structural Hamming Distance between two essential graphs G1 = (V, E1 ) and G2 = (V, E2 ) is dened
as

SHDess (E1 , E2 ) = min SHD(D1 , D2 )


D1 ∈E1 ,D2 ∈E2

where E1 is the set of DAGs within the Markov equivalence class of E1 and E2 is the set of DAGs within
the Markov equivalence class of E2 . D1 and D2 are the edge sets for directed acyclic graphs chosen
from the equivalence classes E1 and E2 respectively.

The SHD is a distance measure, or metric, in the sense that it satises the denition of a distance
or metric. That is, it satises:

ˆ SHD(E1 , E2 ) ≥ 0 ∀E1 , E2
14.2. DISTANCE MEASURES 283

B E C


A D / F

Figure 14.1: Essential graph G1

B / E o C
O


A D / F

Figure 14.2: Essential graph G2

ˆ SHD(E1 , E2 ) = 0 ⇔ E1 = E2

ˆ SHD(E1 , E2 ) = SHD(E2 , E1 ) ∀E1 , E2 ,

ˆ SHD(E1 , E3 ) ≤ SHD(E1 , E2 ) + SHD(E2 , E3 ) ∀E1 , E2 , E3 .

The Structural Hamming Distance measures the distance between two essential graphs, but if com-
parison is being made between a `tted' graph and a `true' graph, the SHD distance measure does not
distinguish between `false positives' (edges in the tted graph that are not in the true graph) and `false
negatives' (edges not present in the tted graph that are present in the true graph).
The distance thus dened between the two graphs in Figures 14.1 and 14.2 is 1, since there is a
valid orientation of the edges in 14.1 where all except C − E (which is not present) have the same
orientation as the edges in Figure 14.2.

14.2.2 Sensitivity and Specicity


Rough measures of `goodness of t', when comparing the skeletons of a tted graph with a true graph,
are the sensitivity and specicity. Sensitivity, True Positive Rate, is dened as follows:

number of edges correctly identied


TPR = (14.2)
number of edges correctly identied + number of edges falsely rejected
The specicity, SPC is dened as
number of edges correctly rejected
SPC = . (14.3)
number of edges correctly rejected + number of edges wrongly included
The TPR measure is useful, but for the relatively sparse graphs in view for genetics data, where the
parent / child sets are limited, the SPC measure is not so useful, unless it is modied. For d variables,
there are 2d possible edges to consider for the skeleton, exponential in the number of variables. For
284 CHAPTER 14. STRUCTURE LEARNING

sparse graphs, with a large number of nodes, the specicity measure will always be approximately 1 for
an algorithm with a tendency to wrongly reject edges rather than wrongly include edges. The following
denitions for sensitivity and specicity are therefore more convenient; the specicity corresponding to
the usual denition, the sensitivity modied. The following denitions are proposed for sparse graphs:

Denition 14.2 (Sensitivity and Specicity). For the construction of the skeleton, the sensitivity is
dened as
number of edges correctly identied
TPR = (14.4)
number of edges correctly identied + number falsely rejected
while the proposed denition for specicity is

total number of edges in the skeleton


SPC = . (14.5)
total number of edges in the skeleton + number of edges wrongly included

Equation (14.5) is not the standard denition of specicity, but if the value is close to 1, it implies
that the rate of wrong inclusion is insignicant, rather than that the graph is large and sparse, which
would lead to a value close to 1 using the denition in Equation (14.3) even if the number of edges
wrongly included is large compared with the total number of edges in the true graph.

14.2.3 The Kullback Leibler Divergence


The Kullback Leibler Divergence may be used as the basis of measuring the distance between two DAGs
over d variables, with respect to a data set. Recall the denition of the Kullback Leibler Divergence
between two probability functions p and q each dened over the same state space X :

p(x)
DKL (p∥q) = ∑ p(x) log .
x∈X q(x)

In view here is the divergence between the factorisation over a true directed acyclic graph G1 = (V, D1 )
and a tted directed acyclic graph G2 = (V, D2 ). Let p̂1 and p̂2 denote the tted probability distributions
from the data, according to the factorisations along G1 and G2 respectively. The tted distribution p̂
is the same for each directed acyclic graph within the Markov equivalence class of an essential graph.

14.3 Search and Score Algorithms


For a Bayesian network with a directed acyclic graph G = (V, D), the edge set D is often referred to as
the structure of the network. Let D̃ denote the set of all possible edge sets that give a directed acyclic
graph with node set V and suppose that D is unknown and has to be inferred only from the n × d data
matrix x.
For a given structure, the prior distribution over the parameter vectors θj.l are taken from the
family Dir(αj1l , . . . , αjkj l ) for all nodes and parent congurations (j, l). There is a prior distribution
pD over the collection of possible structures D̃,which is the probability function for a random variable
D taking values in D̃.
14.3. SEARCH AND SCORE ALGORITHMS 285

The Prior Distribution for the Graph Structure There are several possible ways of constructing
a prior distribution pD . If it is known a priori that the graph structure lies within a subset A ⊆ D̃,
then an obvious choice is the uniform prior over A;




1
if D ∈ A
PD (D) = ⎨ ∣A∣

⎪ otherwise
⎩ 0
where ∣A∣ is the number of elements in a subset A ⊆ D̃.
The Bayesian selection rule for a graph G = (V, D) uses the graph which maximises the posterior
probability

PX∣D (x∣D)PD (D)


PD∣X (D∣x) = , (14.6)
PX (x)
The task is then, for a given x, to nd the D that maximises the Bayesian Dirichlet score function

S(D) = PX∣D (x∣D)PD (D). (14.7)

where PD is the prior probability over the space of edge sets. The prior odds ratio for two dierent edge
P (D1 ) (D1 ∣x)
and the posterior odds ratio is dened as PD∣X (D2 ∣x) . Equation
P
sets D1 and D2 is dened as PD D (D2 ) D∣X
(14.6) may then be expressed as

S(D1 )
Posterior odds = Likelihood ratio × Prior odds = .
S(D2 )
Using factorisations along the relevant graphs, the computation of a ratio, rather than simply com-
puting each score function, is sometimes easier if the two graphs have some part of the structure in
common.
Computing the posterior distribution is an NP hard problem; Cooper [30] proves that the inference
problem is NP hard. That means, worse than an NP problem. This discussed in [26]. Koivisto and
Sood [75] (2004) constructed the rst algorithm that had a complexity less than super exponential
for nding the posterior probability of a network, at the expense of limiting the maximum number
of parents for each variable; the run time is O(d2d + dk+1 C(n)) where d is the number of notes, k is
the maximum in-degree permitted and C(n) the cost of computing a single local marginal conditional
marginal likelihood for n instantiations.

Aside: P, NP and NP Hard Problems A problem is assigned to the NP (non-deterministic


polynomial time) class if it is veriable in polynomial time by a non-deterministic Turing machine. (A
non-deterministic Turing machine is a `parallel' Turing machine which can take many computational
paths simultaneously, with the restriction that the parallel Turing machines cannot communicate.) A
P-problem (whose solution time is bounded by a polynomial) is always also NP. If a problem is known
to be NP, and a solution to the problem is somehow known, then demonstrating the correctness of the
solution can always be reduced to a single P (polynomial time) verication. A problem is NP-hard if an
algorithm for solving it can be translated into one for solving any other NP-problem (non-deterministic
286 CHAPTER 14. STRUCTURE LEARNING

polynomial time) problem. NP-hard therefore means `at least as hard as any NP-problem' although it
might, in fact, be harder.

14.3.1 Score Functions


For a given data matrix x, one example of a score function is simply a function proportional to the
posterior probability given, for example, by Equation (14.7). It is often considered that this score
function gives too much preference to graphs with large number of edges.

AIC and BIC Score Functions One standard score function is the Akaike Information Criterion
(AIC) dened as:

AIC(D) = −2 log L(D∣x) + 2∣θ∣ (14.8)


where ∣θ∣ ∶= ∑dj=1 qj (kj −1) denotes the number of parameters required to dene the network and L(D∣x)
is the Cooper Herskovitz likelihood given by Equation (12.15). The Bayesian Information Criterion is
similar, but uses log ∣θ∣;

BIC(D) = −2 log L(D∣x) + (log n)∣θ∣. (14.9)

The BDeu Score The BDeu score was introduced by Heckerman, Geiger and Chickering [62]. The
BD score is simply the score function given by Equation (14.7), the posterior probability over directed
acyclic graphs, assuming that the variables each have multinomial distribution. The BDeu score uses a
uniform prior over graph structures, so that the posterior distribution is proportional to the likelihood,
and then multiplies by a factor that penalises according to the number of edges where the graph diers
from some `target' graph, based on prior information. The BDeu score function for a directed acyclic
graph, based on the data is dened as follows

Denition 14.3 (BD, BDeu Score Function).


d qj k
Γ(∑i=1
j
αjil ) kj Γ(n(xij ∣πjl ) + αjil )
S(D; x) = κ δ(D)
∏∏ ∏ , (14.10)
j=1 l=1
kj
Γ (n(πjl ) + ∑i=1 αjil ) i=1 Γ(αjil )

where x denotes the n × d data matrix of n independent instantiations of the d variables in the variable
set V , D denotes the edge set for the directed acyclic graph G = (V, D), κ is a number 0 < κ ≤ 1, δ(D)
denotes the number of edges in D that dier from those in a `target' graph, a graph that is a priori
considered most likely, based on prior information.
The BD score function is the BDeu score function with κ = 1.

When the aim is to construct a graph representing the dependence relations in the data, with as few
edges as possible, δ(D) simply counts the number of edges in the edge set D.

Denition 14.4 (Prior Sample Size). The prior sample size is dened as the quantity
ñ = ∑ αjil .
jil
14.3. SEARCH AND SCORE ALGORITHMS 287

The quantity ñ = ∑jil αjil is considered to be the weight attached to the prior assessment. Loosely
speaking it is the `number' of observations on which the prior is based.

For the BDeu score function, the value κ = 1+n+ñ


1
is often chosen. Note that, with this choice of κ, the
BDeu and BIC are similar; the BIC penalty is the number of parameters, while the BDeu penalty is
the number of edges.

14.3.2 Sparse Candidate Algorithm


The discussion now moves onto a selection of search and score algorithms. The rst of these is the
sparse candidate algorithm, which was developed by Friedman, Nachman and Pe'er (1999) [45] and
used for analysis of genetic expression data in Friedman et. al. (2000) [46]. The main idea of the
technique is to identify a relatively small number of candidate parents for each variable. This is based
on simple local statistics, such as correlation. Attention is then restricted to networks in which the
parent set is a subset of the candidate parent set.
(n)
The algorithm proceeds as follows: let Dn denote the DAG chosen at iteration n, let Pai denote
the parent set for variable Xi in Dn .
(n)
ˆ For i = 1, . . . , d, choose the candidate set Ci = {Y1 , . . . , Yk } of candidate variables for Pai , the
(n) (n−1)
parent set for variable Xi . The set Ci is chosen as Pai together with children and parents
of children of Xi in Dn , and all those variables Y ∈/ M B(Xi ) such that the score

n (x, y, z)n (z)


Xi ,Y,Pai Pa(n−1)
(n−1)

∑ n (x, y, z) ln i
Xi ,Y,Pa (n−1) (y, z)n (n−1) (x, z)
(n−1)
n
(x,y,z)∈XXi ×XY ×X Y,Pai Xi ,Pai
i
Pai(n−1)

is suciently high. Here M B denotes Markov blanket (parents, children and parents of children).
Also, for a set W , nW (w) denotes the number of appearances of conguration w in the data
(n−1)
matrix x. If the test statistic is low, it supports Xi ⊥ Y ∣Pai and hence Y is not a candidate
parent.
There are other ways of determining the candidate parents; anything in the current Markov blan-
ket not d-separated from the variable by the Markov blanket should be included as a candidate
parent.
(n)
ˆ Find a high scoring network Dn where PaD
i
n
⊂ Ci for i = 1, . . . , d.

Optimal Reinsertion The optimal reinsertion algorithm, introduced by A. Moore and W-K. Wong
(2003) [96], is a search - and -score algorithm that works along the following lines: at each step a
target node is chosen, all edges entering or leaving the target are deleted, and the optimal combination
of in-edges and out-edges is found, the node is re-inserted with these edges. This involves searching
through the legal candidate parent sets and, for each candidate parent set, the legal child sets. The
optimal reinsertion may be combined with sparse candidate.
288 CHAPTER 14. STRUCTURE LEARNING

14.3.3 Greedy Search and Greedy Equivalence Search


The Greedy Search was introduced by Meek (1997) in his Ph.D. thesis and correctness was proved by
Chickering (2002) [25] under the assumption that there was a DAG faithful to the probability distri-
bution. It works along the following lines to produce a DAG, along which the probability distribution
factorises, starting from the graph with no edges:

ˆ Forward phase Let E0 denote the graph with no edges. Let En denote the essential graph from
stage n of the forward phase. Consider all possible DAGs within the Markov equivalence class,
all possible DAGs obtained by adding exactly one edge to a DAG from this equivalence class and
consider the set of essential graphs corresponding to this collection of DAGs. Let En+1 denote
the essential graph with the highest score if it has a higher score than En and continue to forward
phase stage n + 1. Otherwise, terminate the forward phase, with output En .

ˆ Backward phase Let Ẽ0 denote the output graph from the forward phase. Let Ẽn denote the
output graph from stage n of the backward phase. Consider all possible DAGs corresponding to
the equivalence class Ẽn , all possible DAGs formed by an edge deletion from these DAGs and
consider the set of essential graphs corresponding to this collection of DAGs. Let Ẽn+1 denote the
essential graph with the highest score if it is higher than that for Ẽn and continue to backward
phase stage n+1. Otherwise terminate; Ẽn is the output of the backward phase and of the greedy
equivalence search algorithm.

After the forward and backward phase, this algorithm is guaranteed to return an optimal structure
provided there exists a faithful DAG. The faithfulness assumption may be relaxed; the algorithm re-
turns a suitable structure provided the weaker composition condition holds (compositional graphoid,
Equation (2.1.1)). The compositional axiom is essential for the algorithm to return the correct graph.

The necessity of composition is clear from the three variable example, where Y1 , Y2 , Y3 are independent
binary variables P(Yi = 1) = P(Yi = 0) = 21 , X1 = 1(Y2 = Y3 ), X2 = 1(Y1 = Y3 ), X3 = 1(Y1 = Y2 ). Since
X1 ⊥ X2 , X1 ⊥ X3 and X2 ⊥ X3 , adding a single edge to the empty graph will not increase the score.
The algorithm will therefore terminate after the rst step of the forward phase and return the empty
graph.

Notes The Cooper Herskovitz likelihood was introduced by Cooper and Herskovitz in [31]. In [30],
Cooper proves that the inference problem for structure learning is NP hard. In [26], Chickering
Heckerman and Meek prove, under some assumptions, that identifying high scoring structures in search
- and - score algorithms is NP - hard. Koivisto and Sood [75] [2004] constructed the rst algorithm
that had a complexity less than super - exponential for nding the posterior probability of a network.
The Chow - Liu tree is taken from [28] [1969]. The K2 algorithm is by Cooper and Herskovitz [31]
[1992]. The robotics example is due to E. Lazkano, B. Sierra, A. Astigarraga, and J.M. Martínez -
Otzeta [79] [2007] The maximum minimum hill climbing algorithm is found in [137]. The Markov chain
Monte Carlo model composition algorithm, known as M C 3 , and the augmented Markov chain Monte
14.3. SEARCH AND SCORE ALGORITHMS 289

Carlo model composition (AM C 3 ) algorithm were introduced by Madigan and York [89] in 1995 and
Madigan, Andersson, Perlman and Volinsky [88] in 1997.
14.4 Exercises
These exercises should be carried out using R. The bnlearn package may be useful.

1. Chow - Liu Tree Generate three columns, c1, c2 and c3, each containing independent random
samples of 50 Be(1/2) observations. Here Be(1/2) means Bernoulli trials, returning 0 with
probability 1/2 and 1 with probability 1/2. Let c4 = c1 + c2 and let c5 = c3 + c4. Implement the
Kruskal algorithm on the variables c1, c2, c3, c4, c5 and see which edges are chosen.

2. Chow - Liu Tree Download the data set from the URL address
http://archive.ics.uci.edu/ml/machine-learning-databases/zoo/zoo.data
A description of the data is found at the address
http://archive.ics.uci.edu/ml/datasets/Zoo
The data set presents attributes of various animals; hair type, feather type, egg type, milk type,
whether it is airborne, aquatic, a predator, whether or not it has teeth, a backbone, whether
it breathes, or is venomous, has ns, legs, tail, or domestic, or catsize. The last variable is a
classication of the type of animal.

(a) Compute the estimated probability distribution for all the variables except for the `class'
variable, assuming that they are independent. What is the Kullback Leibler distance be-
tween the empirical distribution and the estimate using the independence model?
(b) Perform Kruskal's algorithm, to determine the optimal Chow - Liu tree. Use the data
from all the variables, except for the class variable, to construct a single Chow - Liu tree.
Calculate the estimated probability distribution, assuming that the distribution factorises
according to the Chow - Liu tree. Calculate the Kullback Leibler distance between this
estimate and the empirical probability distribution.
Note You have to specify a root for the Chow-Liu tree. This determines the directions of
the arrows. All possible Chow-Liu trees from the same skeleton are Markov equivalent.
(c) Classication See how the Chow - Liu tree performs for classication. Compute the
classier using the data and then use the classier to predict the classes of the same data
set. Such a procedure is not so satisfactory; dierent data should be used for training and
classication.
(d) Perform an MMPC algorithm on the zoo data, using a nominal signicance level of 0.05.
Compare it with the Chow-Liu tree with those edges that fail the signicance test at 0.05
level are removed.

The following R code solves the Chow-Liu tree problem. The `50 warnings' basically come from zero
divided by zero problems. The code should be modied (by adding on a small value such as 0.01 to
each cell) to prevent this. The classier works reasonably well in any case.

> library("bnlearn")
> zoo <- read.csv("~/data/zoo.data", header=F)

290
14.4. EXERCISES 291

> colnames(zoo) <- c("animal name", "hair", "feathers", "eggs","milk","airborne","aquatic",


+ "predator","toothed","backbone","breathes","venomous","fins", "legs","tail
> for(i in 2:ncol(zoo)) zoo[,i] <- factor(zoo[,i]) # conversion to factors
> s <- sample(100,70)
> trainingdata <- zoo[s,]
> testdata <- zoo[-s,]
> res <- chow.liu(trainingdata[,-c(1,18)]) # learning the structure
> print(res)

Bayesian network learned via Pairwise Mutual Information methods

model:
[undirected graph]
nodes: 16
arcs: 15
undirected arcs: 15
directed arcs: 0
average markov blanket size: 1.88
average neighbourhood size: 1.88
average branching factor: 0.00

learning algorithm: Chow-Liu


mutual information estimator: Maximum Likelihood (disc.)
training node:
tests used in the learning procedure: 120
> print(res$arcs)
from to
[1,] "hair" "milk"
[2,] "milk" "hair"
[3,] "feathers" "legs"
[4,] "legs" "feathers"
[5,] "eggs" "milk"
[6,] "milk" "eggs"
[7,] "eggs" "toothed"
[8,] "toothed" "eggs"
[9,] "milk" "catsize"
[10,] "catsize" "milk"
[11,] "airborne" "legs"
[12,] "legs" "airborne"
[13,] "aquatic" "breathes"
292 CHAPTER 14. STRUCTURE LEARNING

[14,] "breathes" "aquatic"


[15,] "predator" "legs"
[16,] "legs" "predator"
[17,] "predator" "domestic"
[18,] "domestic" "predator"
[19,] "toothed" "legs"
[20,] "legs" "toothed"
[21,] "backbone" "legs"
[22,] "legs" "backbone"
[23,] "backbone" "tail"
[24,] "tail" "backbone"
[25,] "breathes" "legs"
[26,] "legs" "breathes"
[27,] "venomous" "legs"
[28,] "legs" "venomous"
[29,] "fins" "legs"
[30,] "legs" "fins"
> plot(res)
> res2 <- pdag2dag(res, colnames(zoo)[c(2,5,4,14,17,3,6,9,10,11,12,15,7,8,13,16)]) # directing t
> plot(res2)
> # parameters estimation for every class
> res3 <- list()
> for(i in 1:7){
+ res3[[i]] <- bn.fit(res2, trainingdata[trainingdata$type==i,-c(1,18)],method="bayes")
+ }
>
> lik <- array(dim=c(7,nrow(testdata))) # matrix of likelihoods
> for(i in 1:7) for (j in 1:nrow(testdata)){
+ lik[i,j] <- logLik(res3[[i]],testdata[j,-c(1,18)])
+ }
>
> pred <- apply(lik,2,which.max)
> table (pred, testdata$type)

pred 1 2 3 4 5 6 7
1 11 0 0 0 0 0 0
2 0 7 0 0 0 0 0
3 0 0 0 0 4 0 0
4 0 0 0 6 0 0 0
6 0 0 0 0 0 1 0
14.4. EXERCISES 293

7 0 0 0 0 0 0 2
> > resmmpc <- mmpc(zoo[,-c(1,18)])
> print(resmmpc$arcs)
from to
[1,] "hair" "aquatic"
[2,] "hair" "milk"
[3,] "eggs" "milk"
[4,] "milk" "catsize"
[5,] "milk" "eggs"
[6,] "milk" "hair"
[7,] "aquatic" "fins"
[8,] "aquatic" "predator"
[9,] "aquatic" "hair"
[10,] "predator" "domestic"
[11,] "predator" "aquatic"
[12,] "toothed" "backbone"
[13,] "backbone" "tail"
[14,] "backbone" "toothed"
[15,] "fins" "aquatic"
[16,] "tail" "backbone"
[17,] "domestic" "predator"
[18,] "catsize" "milk"
294 CHAPTER 14. STRUCTURE LEARNING
Chapter 15

Data Storage, Product Approximations,

Chow Liu Trees

15.1 Introduction
Let X = (X1 , . . . , Xd ) denote a random vector with probability function PX1 ,...,Xd . Let
(1) (kj )
Xj = (xj , . . . , xj )

denote the state space of Xj , j = 1, . . . , d and let

X = ×dj=1 Xj .

The number of elements in the state space is ∣X ∣ = (∏dj=1 kj ) and, without further assumptions on P,
∣X ∣ − 1 elements are required to store the entire distribution.
The problem of storing the entire probability distribution is one of many expressions of the `curse
of dimensionality'. The size of the problem is reduced if one instead stores lower dimensional marginals
and approximates the distribution by an appropriate product of lower dimensional marginals.
The topic of storing a high dimensional discrete probability distribution in a digital medium ap-
peared in the journal literature, probably for the rst time, by J. Hartmanis (1959) [59] and P.M. Lewis
II (1959) [85]. The Chow - Liu tree by Chow and Liu (1969) [28], approximately 10 years later, provides
an inuential and eective solution to the problem. Chow and Liu gave an algorithm for selecting rst
order factors for the product approximation so that among all such rst order approximations, the
constructed approximation has the minimum Kullback-Leibler distance to the actual distribution to
be stored.

15.2 Product Approximations


15.2.1 Existence of Extensions with Given Marginals
For a probability distribution PX1 ,...,Xd over a set of random variables X = (X1 , . . . , Xd ), there are
j=1 (j ) = 2 − 2 lower dimensional marginal distributions, which may be obtained by marginalising
d
∑d−1 d

295
296 CHAPTER 15. DATA STORAGE, PRODUCT APPROXIMATIONS, CHOW LIU TREES

the distribution. The classical marginal problem considers an `inverse problem'; given a family (PWi )si=1
of probability distributions, for s < 2d − 2 and Wi ⊂ V = {X1 , . . . , Xd }, the question is whether there
exists a probability distribution PV that satises the so-called collective compatibility condition given
by Equation (15.1).

PWi = (PV )↓Wi ∀i = 1, . . . , s (15.1)

Here the notation ↓ A means the marginalisation down to a set of variables A. This problem is of
importance in the following setting: if the full probability distribution cannot be estimated and stored,
it may be possible to estimate and store probability distributions over selected subsets of the variables.
These subsets should be chosen so that, formally, the collection of distributions over the subsets are
compatible. Some fundamental contributions to this problem are due to H.G. Kellerer [73] and others.
If the sets (Wi )si=1 are disjoint, satisfying ∪i Wi = V , then the problem has an obvious trivial solution:

s
P(x) = ∏ PWi (xWi )
i=1

where the product operation means rst extending the probabilities PWi as functions, to functions P̃Wi
over the domain V where (using obvious notation) P̃Wi (xWi .xV /Wi ) = PWi (xWi ) for each xWi ∈ XWi
and then multiplying. With the appropriate projections of x,

s
P(x) = ∏ PWj (xWj ).
j=1

If (Wj )sj=1 are not disjoint, then clearly the collection of probabilities (PWj )sj=1 should satisfy a pairwise
compatibility condition:
↓C ↓C
PCij = PWiij = PWjij ∀i, j ∈ {1, . . . , s}2 .

The following example due to Vorobev (1962) [141] shows that pairwise compatibility does not imply
collective compatibility.

Example 15.1 (Vorobev's example).

Let V = {1, 2, 3}, W1 = {2, 3}, W2 = {1, 3}, W3 = {1, 2}. Suppose that the following three pairwise joint
distributions are specied:

x2 /x3 0 1 x1 /x3 0 1 x1 /x2 0 1


PW1 (x2 , x3 ) = 0 1
2 0 PW2 (x3 , x1 ) = 0 0 1
2
PW3 (x1 , x2 ) = 0 1
2 0
1 1 1
1 0 2 1 2 0 1 0 2

These are pairwise compatible; W1 ∩ W2 = {3} and

↓{3} ↓{3} 0 1
PW1 (x3 ) = PW2 (x3 ) = 1 1
.
2 2
15.2. PRODUCT APPROXIMATIONS 297

W1 ∩ W3 = {2} and
↓{2} ↓{2} 0 1
PW1 (x2 ) = PW3 (x2 ) = 1 1
.
2 2

W2 ∩ W3 = {1} and
↓{1} ↓{1} 0 1
PW2 (x1 ) = PW3 (x1 ) = 1 1
.
2 2

If a common extension P∗ existed, it would follow that (for example)

1
= PW1 (0, 0) = P∗ (0, 0, 0) + P∗ (1, 0, 0) ≤ PW2 (0, 0) + PW3 (1, 0) = 0,
2
which is a contradiction. The three marginals satisfy a pairwise compatibility condition, but not a
collective compatibility condition.

Without loss of generality, let V = ∪sj=1 Wj . The condition to ensure that pairwise compatibility implies
collective compatibility is known as the acyclic condition.

Denition 15.2 (Acyclic, Running Intersection Property). Suppose that there is an ordering of the
sets W1 , . . . , Ws such that for each j there is an l < j such that

Bj = Wj ∩ (∪j−1
k=1 Wk ) ⊆ Wj ∩ Wl (15.2)

This property is known as the running intersection property. A set of subsets of W1 , . . . , Ws having the
running intersection property, given some ordering, is known as acyclic.

Remark In Example 15.1, if the ordering W1 , W2 , W3 is chosen, then

W3 ∩ (W1 ∪ W2 ) = {1, 2},

but {1, 2} is not a subset of W1 or W2 . It follows that Equation (15.2) does not hold for this ordering.
It is easy to check that there is no ordering that satises Equation (15.2), hence acyclicity does not
hold for Example 15.1.

The following important result is due to Beeri et. al. (1983) [5]

Theorem 15.3. Acyclicity is equivalent to `pairwise compatibility for all (i, j) implies collective com-
patibility'. Furthermore, under acyclicity, there is a unique product form extension,

∏sl=1 P(xWl )
P∗ (x) = (15.3)
h=1 P(xVh )
∏s−1

where each Vh is the intersection of two or more Wl .


298 CHAPTER 15. DATA STORAGE, PRODUCT APPROXIMATIONS, CHOW LIU TREES

Proof Assume we have the acyclic / running intersection property. Consider the variables as nodes of
a graph, where the sets (Wi )si=1 are maximal cliques. The running intersection property is equivalent
to a perfect order of the maximal cliques, which implies that the maximal cliques W1 , . . . , Ws can
be arranged as a junction tree, where for each j ∈ {2, . . . , s}, we choose an l(j) < j from the set
{l ∶ Wj ∩ (∪j−1
k=1 Wk ) = Wj ∩ Wl } and insert an edge j − l(j). In this way, we have a tree with s − 1 edges.
For each edge ⟨j, l⟩, let V⟨j,l⟩ = Wj ∩ Wl and let U denote the (undirected) edge set. Now consider any
collection (PWj )sj=1 which is pairwise compatible. Then we can dene a distribution P∗ by:

∏sl=1 PWl (xWl )


P∗ (x) =
∏⟨j,l⟩∈U PV⟨j,l⟩ (xV⟨j,l⟩ )

where PV⟨j,l = (PWl )↓V⟨j,l⟩ = (PWj )↓V⟨j,l⟩ . Since the sets (Wi )si=1 are arranged on a junction tree, hence
any intersection Wα ∩ Wβ is contained in Wγ ∩ Wδ for any edge γ − δ on the unique path α ↔ β in the
tree. Hence the acyclic property gives (pairwise compatibility implies collective compatibility).
Now suppose that pairwise compatibility implies collective compatibility for W1 , . . . , Ws and assume
that acyclicity is not possible. Taking W1 , . . . , Ws as the maximal cliques of an undirected graph, lack-
of-acyclicity is equivalent to existence of a cycle of length ≥ 4 in the graph without a chord. Let the
cycle be α1 , . . . , αm . Then there are Wj1 , . . . , Wjm such that {αi , αi+1 } ⊆ Wji for i = 1, . . . , m, using
αm+1 = α1 . Furthermore, the lack-of-chord implies that Wja ∩Wjb = ∅ for ∣a−b∣ ≥ 2, where we take a and
b mod m. We may therefore nd (similar to Vorobev's example) distributions PWj1 , . . . , PWjm which
↓{α ,αi+1 }
are pairwise compatible, but where the distributions PWj i (using αm+1 ≡ α1 ) are not collectively
i
compatible.

Uniqueness of representation (15.3) requires that PWi (xWi ) > 0 for all xWi and all i = 1, . . . , s.

15.2.2 Dependence Structures


Let W1 , . . . , Ws be sets of random variables, V = ∪sj=1 Wj and suppose that W1 , . . . , Ws satisfy the
running intersection property of Equation (15.2). With this ordering, set B1 = ϕ and

Bj = Wj ∩ (∪j−1
k=1 Wk ) , j = 2, . . . , k.

Let Aj = Wj /Bj so that Wj = Aj ∪ Bj . It follows that A1 , . . . , As is a partition of V and that the sets
(Bj )sj=1 satisfy
Bj ⊂ Ai ∪ Bi some i ∈ {1, . . . , j − 1}.

This leads to the denition of a dependence structure, the term used to describe collections (Aj , Bj )sj=1
which satisfy this property.

Denition 15.4 (Dependence Structure). Let (Ai )ki=1 be a partition of a set V and let S be a sequence
of pairs of subsets of V , S = (Ai , Bi )ki=1 satisfying

B1 = ϕ, B r ⊂ Ai ∪ B i 1≤i≤r−1 r = 2, . . . , k

Then S is a dependence structure.


15.2. PRODUCT APPROXIMATIONS 299

Denition 15.5 (Product Approximation). Let S be a dependence structure. Then the probability
distribution dened by
k
P(S) (x) = PA1 (xA1 ) ∏ PAj ∣Bj (xAj ∣xBj )
j=2

is called the product approximation of the probability distribution P determined by S .

A product approximation is clearly a well dened probability distribution. Furthermore, it satised the
following compatibility condition:

Lemma 15.6.

P(S)↓Aj ∪Bj (xAj ∪Bj ) = PAj ∪Bj (xAj ∪Bj ) ∀x ∈ X , j = 1, . . . , s.

Proof By marginalising over Aj+1 ∪ . . . ∪ As ,


j
P(S)↓A1 ∪...∪Aj (xA1 ∪...∪Aj ) = PA1 (xA1 ) ∏ PAk ∣Bk (xAk ∣xBk ) j = 1, . . . , s
k=2

so that

j−1
P(S)↓Aj ∪Bj (xAj ∪Bj ) = PAj ∣Bj (xAj ∣xBj ) ∑ PA1 (xA1 ) ∏ PAk ∣Bk (xAk ∣xBk )
A1 ∪...∪Aj−1 /Bj k=2

= PAj ∣Bj (xAj ∣xBj ) ∑ P(S)↓A1 ∪...∪Aj−1 (xA1 ∪...∪Aj−1 )


A1 ∪...∪Aj−1 /Bj

= PAj ∣Bj (xAj ∣xBj )P(S)↓Bj (xBj ).

It remains to show that P(S)↓Bj (xBj ) = PBj (xBj ). This follows inductively; B1 = ϕ. Assume true for
all i = 1, . . . , j − 1. Then Bj ⊂ Ai ∪ Bi for some i ∈ 1, . . . , j − 1. Assume that PAi ∪Bi = P(S)↓(Ai ∪Bi ) for
1 = 1, . . . , j − 1, then P(S)↓Bj = PBj and the result follows by induction.

Note that if Wi = Ai ∪ Bi for i = 1, . . . , s and PWi are given, then

∏si=1 PWi
P(S) =
∏si=2 PBi

where Bi = Wi ∩ ∪i−1
j=1 Wj and the convention Pϕ ≡ 1 is used. In this situation, clearly

P(S)Wj = PWj .

It follows directly from this factorisation that

Aj ⊥ ∪j−1
k=1 Ak /Bj ∣P(S) Bj .
300 CHAPTER 15. DATA STORAGE, PRODUCT APPROXIMATIONS, CHOW LIU TREES

15.3 Reverse I -Projection and the Optimal Product Approximation


The relative entropy, or information divergence, or I -divergence, or Kullback Leibler distance, written
DKL (P∥P(S) ), is dened by

DKL (P∥P(S) ) = ∑ P(x) ln


P(x)
. (15.4)
x∈X P(S) (x)
The task of Optimal Product Representation of P is, for a given dependence structure S , to nd a P∗S
such that D(P∥PS ) is minimised. The solution P∗S is called a Reverse I -Projection of P onto the set of
all probability measures with S as a dependence structure.

Denition 15.7 (Shannon Entropy). Let A ⊆ V . The Shannon entropy of the set of variables A for
a probability distribution P is dened as

HP (A) ∶= − ∑ PA (xA ) ln PA (xA )


xA ∈XA

where PA = P↓A .

With A = V , it follows that, for a probability distribution Q,

DKL (P∥Q) = ∑ P(x) ln P(x) − ∑ P(x) ln Q(x) = −HP (V ) − ∑ P(x) ln Q(x).


x∈X x∈X x∈X

For a dependence structure S = (Ai , Bi )si=1 and a probability distribution Q that factorises according
to: Q = ∏si=1 QAi ∣Bi , it is straightforward to compute that

s
DKL (P∥Q) = −HP (V ) − ∑ P(x) ∑ ln QAi ∣Bi (xAi ∣xBi )
x∈X i=1
s
= −HP (V ) − ∑ ∑ PAi ∪Bi (xAi ∪Bi ) ln QAi ∣Bi (xAi ∣xBi )
i=1 xAi ∪Bi
s
= −HP (V ) − ∑ ∑ PBi (xBi ) ∑ PAi ∣Bi (xAi ∣xBi ) ln QAi ∣Bi (xAi ∣xBi ).
i=1 xBi xA
i

Now use Gibb's inequality; for any two probability distributions f and g over the same state space,

L L
∑ fj ln fj ≥ ∑ fj ln gj . (15.5)
j=1 j=1

This follows from the fact that


L fj
DKL (f ∥g) = ∑ fj ln ≥0
j=1 gj
with equality if and only if f = g . It follows that the reverse I -projection of P onto a dependency
structure S = (Ai , Bi )si=1 is
s
P(S) = ∏ PAi ∣Bi
i=1
15.4. THE OPTIMAL CHOW-LIU PRODUCT APPROXIMATION 301

and

k
DKL (P∥P(S) ) = −HP (V ) + ∑ (HP (Ai ∪ Bi ) − HP (Bi )) .
i=1

Denition 15.8 (Mutual Information). The mutual information I(A, B) between two disjoint sets of
variables A and B is dened as

I(A, B) = H(A) + H(B) − H(A ∪ B).

This may be written as


PA∪B (xA∪B )
I(A, B) = ∑ PA∪B (xA∪B ) ln = DKL (PA∪B ∥PA PB ).
PA (xA )PB (xB )
Note that I(A, B) = 0 ⇔ X A ⊥ X B .

If one is choosing a dependence structure S = (Ai , Bi )si=1 , from within a class S of dependence structures
with the same storage properties, it follows that the dependence structure S = (Ai , Bi )si=1 is chosen to
maximise
k k
Q(S) = − ∑ H(Ai ) + ∑ I(Ai , Bi ).
i=1 i=1

15.4 The Optimal Chow-Liu Product Approximation


For a Chow Liu tree, the dependence structure (Ai , Bi )ki=1 satises

∣Ai ∪ Bi ∣ ≤ 2 i = 1, . . . , k.

Let G = (V, U ) denote an undirected graph, where V = {1, . . . , d} is the indexing set for the nodes and
U is the undirected edge set. An undirected graph G is complete if U = {⟨i, j⟩ ∶ 1 ≤ i < j ≤ d}. The
degree of a node i is dened as the number of distinct edges containing the node i.
A subgraph H of G is a graph (V1 , U1 ) where V1 ⊆ V and U1 ⊆ U . A subgraph V1 is induced by
A ⊂ V if V1 = A and U1 = U ∩ A × A. A subgraph H is a spanning subgraph of G if it is connected and
V1 = V .
An undirected tree T is a connected undirected graph that has no cycles. It follows that there is
a unique path between any two nodes. A spanning tree of a graph is a spanning graph of G which is a
tree.
A labelled tree is a tree on d nodes where each node is labelled by one of the integers {1, . . . , d}. In
the sequel, labelled trees will be referred to as trees.
A weighted undirected graph is
G = ((V, U )∣w)
where w ∶ U → R+ (non negative real numbers). The weight of a tree is the sum of its edge weights.
The weight to be used by the Chow-Liu algorithm will be dened via the mutual information

w(i, j) ∶= I(j, k) = H(j) + H(k) − H(j ∪ k).


302 CHAPTER 15. DATA STORAGE, PRODUCT APPROXIMATIONS, CHOW LIU TREES

15.4.1 Chow Liu Tree with known P


Denition 15.9 (Chow-Liu Dependence Structure). Let (ir )dr=1 be an arbitrary permutation of V =
{1, . . . , d}. The singleton sets Ar = {ir } r = 1, . . . , d are a partition of V . Let σ be a sequence of pairs
of singletons of V , σ = (ir , jr )dr=1 , where

j1 = ϕ, jr ∈ {i1 , . . . , ir−1 } ⊆ V r = 2, . . . , d.

Then σ is a Chow-Liu dependence structure.

A Chow-Liu dependence structure will give a tree. Since the tree connects all the nodes, it is a spanning
tree. Arrows are directed from jr to ir . If jr = ϕ, there is no arrow pointing to the node ir . Any node
in a directed tree with jr = ϕ is called a root. By construction, i1 is the only root. A tree with exactly
one root is said to be proper.

Note that
ir ⊥P(S) {i1 , . . . , ir−1 }/{jr }∣jr .

For σ thus dened,

d d
Q(σ) = − ∑ H(ir ) + ∑ I(ir , jr ).
r=1 r=1

The Chow-Liu dependence structure denes a product approximation of a known probability distribu-
tion P by
d
P(σ) (x) = Pi1 (xi1 ) ∏ P(xir ∣xjr ).
r=2

The following theorem is the rst main result in Chow and Liu [28] (1968).

Theorem 15.10. Let P be a probability distribution over X . Let G = ((V, U )∣w) be a complete weighted
graph with w given by
w(j, k) = I(j, k) ⟨j, k⟩ ∈ U

where the I(j, k)s are computed using the Pj,k s. Then the maximum weight spanning tree of G denes
a Chow-Liu dependence structure σ , which maximises

d d
Q(σ) = − ∑ H(ir ) + ∑ I(ir , jr ).
r=1 r=2

Proof Firstly, ∑dr=1 H(ir ) = ∑di=1 H(i) so that the rst term in Q(σ) is independent of σ , hence the
problem is equivalent to the maximisation of ∑dr=1 I(ir , jr ).
15.4. THE OPTIMAL CHOW-LIU PRODUCT APPROXIMATION 303

15.4.2 Chow-Liu Algorithm with Unknown P


⎛ x(1) ⎞
For P unknown, suppose there is an n × d data matrix x, where x = ⎜
⎜ ⋮
⎟. Each x ∈ X for
⎟ (j)
⎝ x ⎠
(n)
j = 1, . . . , n.

Let P(X ) denote the space of all probability distributions over X ; that is

P(X ) = {P∣P = {P(x)}x∈X }

Let Td = (V, σ) be a spanning tree on V , where σ is a Chow-Liu dependence structure. Let Td denote
the set of all spanning trees, then
P(X , Td ) = {P(σ) }

is the set of all tree dependent probability distributions on X and P(X , Td ) ⊂ P(X ). The empirical
probability is dened as
n
̂n (x) = 1 ∑ 1x (x ).
P (k)
n k=1
̂(M L) is given by
Lemma 15.11. The maximum likelihood estimate P

̂(M L) = arg
P min ̂n ∥P).
DKL (P
P∈P(X ,Td )

Proof
̂n ∥P) = −H
DKL (P ̂n (x) ln P(x),
̂n (V ) − ∑ P
x∈X

where H ̂n (x) ln P
̂n (V ) = − ∑x∈X P ̂n (x). Note that this does not depend on the tree structure. For the
other part,
n
∑P ̂n (x) ln P(x) = 1 ∑ ln P(x ),
(j)
x∈X n j=1
which is the log likelihood function. Hence, the maximum likelihood estimate is equivalent to the
̂n onto the set of tree dependent distributions P(X , Td ).
reverse I -projection of P

15.4.3 The Log Likelihood Function


When σ = (ir , jr )dr=1 is a Chow-Liu dependence structure, the parameter set P is the set of two
dimensional distributions given by

P = {Pi,j ∣(i, j) ∈ V × V, i ≠ j}.

The corresponding parametric probability is


d d d Pir ,jr (xir , xjr )
P(σ) (x) = Pi1 (xi1 ) ∏ Pir ∣jr (xir ∣xjr ) = ∏ Pir (xir ) ∏ x = (xj )dj=1 ∈ X .
r=2 r=1 r=2 Pir (xir )Pjr (xjr )
304 CHAPTER 15. DATA STORAGE, PRODUCT APPROXIMATIONS, CHOW LIU TREES

The likelihood function is therefore


n
L(σ, P) = ∏ P(σ) (x(j) ∣P)
j=1

and the log likelihood function, divided by n, is

1 n
L(σ, P) = ∑ ln P(x(j) ∣σ, P).
n j=1

which may be re-written as

1 1 d 1 d
L(σ, P) = ∑ N (x) ln Pi1 (x) + ∑ ∑ N (x, y) ln Pir ,jr (x, y) − ∑ ∑ N (x) ln Pjr (x)
n x∈Xi n r=2 (x,y)∈Xi ×Xj n j=2 x∈Xjr
1 r r

where N (x) denotes number of appearances of the appropriate conguration x in the data matrix x.
This reduces to
d d
̂n;i (x) ln Pi (x)+ ∑
L(σ, P) = ∑ P ∑ ̂n;i ,j (x, y) ln Pi ,j (x, y)− ∑ ∑ P
P ̂n;j (x) ln Pj (x).
1 1 r r r r r r
x∈Xi1 r=2 (x,y)∈Xir ×Xjr j=2 x∈Xjr

Using the notation Pir ∣jr = Pjr , this may be written as


Pir ,jr

d
̂n;i (x) ln Pi (x) + ∑ ∑ P
L(σ, P) = ∑ P ̂n;j (y) ∑ ̂
Pn;ir ∣jr (x, y) ln Pir ∣jr (x∣y). (15.6)
1 1 r
x∈Xi1 r=2 y∈Xjr x∈Xir

The log likelihood L(σ, P) is to be maximised. For a xed structure σ , it therefore follows from Gibb's
inequality that the maximum likelihood estimates are:

̂n;i ,j
(M L)
P i1 ̂n;i
=P
(M L)
Pir ∣jr =
P r r
r = 2, . . . , d.
1
̂
Pn;j r

from which

̂
̂n;i ,j (x, y) ln Pn;ir ,jr (x, y)
d
L(σ, P (M L) ) = ̂i (x) ln P
∑ P ̂n;i (x) + ∑ ∑
P
x∈Xi1
1 1
r=2 (x,y)∈Xir ×Xjr
r r
̂n;j (y)
P r

d d ̂n;i ,j (x, y)
̂n;i (x) ln P
= ∑ ∑ P ̂n;i (x) + ∑ ∑ ̂n;i ,j (x, y) ln
P
P r r
r
r=1 x∈Xir
r
r=2 (x,y)∈Xir ×Xjr
r r
̂n;i (x)P
P ̂n;j (y)
r r

d d
= ∑ ∑ P ̂n;i (x) + ∑ I(i
̂n;i (x) ln P ̂ r , jr )
r r
r=1 x∈Xir r=2

where
̂
Pn;ir ,jr (x, y)
̂ r , jr ) =
I(i ∑ ̂n;i ,j (x, y) ln
P
(x,y)∈Xir ×Xjr
r r
̂n;i (x)P
P ̂n;j (y)
r r
15.4. THE OPTIMAL CHOW-LIU PRODUCT APPROXIMATION 305

is the plug in estimate of the mutual information. Clearly, the rst term in the expression for
L(σ, P (M L) ) does not depend on σ and hence the maximum likelihood estimate σ (M L) is given by
d
̂ r , jr )} .
σ (M L) = argmaxσ { ∑ I(i
r=2

The number of spanning trees on d nodes is dd−2 . This is Cayley's formula. An exhaustive search is not
feasable in practise. Besides, as pointed out by Chow - Liu [28], a greedy approach nds the maximal
spanning tree.
There are several well known standard algorithms for nding the spanning tree of maximum weight,
for example Kruskal's algorithm and Prim's algorithm. These algorithms are almost identical and nd
the maximum weight spanning tree in O(d2 ln d) time.

Kruskal's algorithm Kruskal's Algorithm runs as follows:


1. The d variables yield d(d − 1)/2 edges. The edges are indexed in decreasing order, according to
their weights b1 , b2 , b3 , . . . , bd(d−1)/2 .

2. The edges b1 and b2 are selected. Then the edge b3 is added, if it does not form a cycle.

3. This is repeated, through b4 , . . . bd(d−1)/2 , in that order, adding edges if they do not form a cycle
and discarding them if they form a cycle.

This procedure returns a unique tree if the weights are dierent. If two weights are equal, one may
impose an arbitrary ordering. From the d(d − 1)/2 edges, exactly d − 1 will be chosen.

Lemma 15.12. Kruskal's algorithm returns the tree with the maximum weight.

Proof The result may be proved by induction. It is clearly true for 2 nodes. Assume that it is
true for d nodes and consider a collection of d + 1 nodes, labelled (X1 , X2 , . . . , Xd+1 ), where they are
ordered so that for each j = 1, . . . , d + 1, the maximal tree from (X1 , . . . , Xj ) gives the maximal tree
from any selection of j nodes from the full set of d + 1 nodes. Let b(i,j) denote the weight of edge (i, j)
(d+1)
for 1 ≤ i < j ≤ d + 1. Edges will be considered to be undirected. Let Tj denote the maximal tree
(d+1)
obtained by selecting j nodes from the d + 1 and consider Td+1 .
(d+1) (d+1)
Let Z denote the leaf node in Td+1 such that among all leaf nodes in Td+1 the edge (Z, Y ) in
(d+1)
Td+1 has the smallest weight. Removing the node Z gives the maximal tree on d nodes from the set
of d + 1 nodes. This is seen as follows. Clearly, there is no tree with larger weight that can be formed
with these d nodes, otherwise the tree on d nodes with larger weight, with the addition of the leaf
(Z, Y ) would be a tree on d + 1 nodes with greater weight than Td+1 d+1
. It follows that Z = Xd+1 and
(d+1)
hence that Xd+1 is a leaf node of Td+1 .
(d+1)
By the inductive hypothesis, Td may be obtained by applying Kruskal's algorithm to the weights
(b(i,j) )1≤i<j≤d . Now consider an application of Kruskal's algorithm to the weights (b(i,j) )1≤i<j≤d+1 and
(d+1)
note that for any (i, j) with i < j such that the undirected edge (Xi , Xj ) forms part of the tree Td ,
b(i,d+1) < b(i,j) and b(j,d+1) < b(i,j) . Therefore, if the edges (b(i,j) )1≤i<j≤d+1 are listed according to their
306 CHAPTER 15. DATA STORAGE, PRODUCT APPROXIMATIONS, CHOW LIU TREES

(d+1)
weight and the Kruskal algorithm applied, then all the edges used in Td will appear further up the
(d+1)
list than any edge (b(k,d+1) )k=1 and therefore all the edges of Td
d
will be included by the algorithm
(d+1)
before the edges (b(k,d+1) )k=1 are considered. It follows that Td+1 is the graph obtained by applying
d

Kruskal's algorithm to the nodes (X1 , . . . , Xd+1 ).


Corollary 15.13 (Prim's Algorithm). The tree of maximal weight may be chosen by choosing any
initial node Ci , adding a link Ci − Cj where j is chosen such that bij = maxk bik and at each stage,
adding the node Ca to the tree that maximises bak over nodes Ck already in the tree.

Proof It is clear that, with the same ordering of the weight, Prim's algorithm returns the same tree
as Kruskal's algorithm.

15.4.4 The Chow-Liu Algorithm and Polytrees


A probability distribution PX1 ,...,Xd factorises according to a polytree if there is an ordering of the
variables σ such that
d
PX1 ,...,Xd = ∏ PX ,
σ(j) ∣Πj
(σ)
j=1
(σ)
Πj ⊆ {Xσ(1) , . . . Xσ(j−1) }
(σ)
and where the directed graph, formed by placing directed edges from each variable in Πj to Xσ(j)
for j = 1, . . . , d is a tree.
(σ) (σ,j) (σ,j)
A distribution that factorises along a polytree satises the condition: if Πj = {Y1 , . . . , Ym },
then m
PΠ(σ) = ∏ PY (σ,j) .
j k
k=1
To extend the Chow-Liu algorithm to polytrees, the conditional mutual information is required;

PA∪C∣B (xA , xC ∣xB )


I(A, C∣B) = ∑ PA∪B∪C (xA∪B∪C ) ln .
xA∪B∪C PA∣B (xA ∣xB )PC∣B (xC ∣xB )
The following lemma is required.
Lemma 15.14. If A ⊥P B∣C , then
min(I(A, C), I(B, C)) ≥ I(A, B).

Proof This follows from observing that if A ⊥P B∣C , then


I(A, B) + I(A, C∣B) = I(A, C), I(A, B) + I(C, B∣A) = I(B, C),

which follows from:


PA∪B∪C PA∪C PB∪C
A ⊥P B∣C ⇔ PA∪C∣B = =
PB PB PC
and hence
PA∪C PB
I(A, C∣B) = ∑ PA∪B∪C ln = I(A, C) − I(A, B).
PA∪B PC
The other is similar.
15.5. ASYMPTOTIC CONSISTENCY OF THE MAXIMUM LIKELIHOOD ESTIMATE 307

Theorem 15.15. Suppose that P factorises according to a polytree. Kruskal's algorithm will locate the
skeleton of the polytree.

Proof Let A = {i}, B = {j} and D = {k} be three distinct nodes. Assume that i ⊥P j∣k. This can
happen in the following cases:

i → k → j, i ← k ← j, i ← k → j,

where m → n indicates a directed path. In all cases, I(i, k∣j) > 0 and I(k, j∣i) > 0. It follows that

min(I(i, k), I(j, k)) > I(i, j).

Kruskal's algorithm takes the edge of largest weight that does not form a cycle. The algorithm will
therefore not choose the edge (i, j) if there is a node k between i and j in σ .
For i → k ← j , i ⊥P j and hence the edge i − j will not be chosen by Kruskal's algorithm.

It is straightforward to nd appropriate directions for the edges; if there are edges i − j − k , then the
edges take directions i → j ← k if and only if I(i, k) = 0.

15.5 Asymptotic Consistency of the Maximum Likelihood Estimate


(M L) ̂ r , jr ) denote the weight of the tree computed by the Chow-Liu algorithm.
Let W(Td (n)) = ∑dr=2 I(i
Suppose that there is a distribution P0 which is the true, but unknown distribution. Let

P(σ)0 = argminP∈P(X ,Td ) DKL (P0 ∥P).

Then P(σ)0 is the reverse I -projection of P0 and the corresponding structure σ (0) = (i0r , jr0 )dj=2 is the
Chow-Liu dependence structure. If P ∈ P(X , Td ), then P(σ)0 = P0 .
Let Td0 denote the tree structure corresponding to σ (0) , then
d
W(Td0 ) = ∑ I 0 (i0r , jr0 )
r=2

where I 0 (i0r , jr0 ) are the mutual informations computed with P0↓(ir ,jr ) , which is the tree of maximal
0 0

weight.
Let
̂(σM L ) = argmin
P M L;n P∈P(X ,Td ) DKL (Pn ∥P)

where Pn denotes the empirical distribution and σM L denotes the maximum likelihood Chow-Liu
dependence structure. Let W(T ; n) denote the weight of tree T based on probability distribution Pn
(M L)
and, in particular, let W(Td (n); n) denote the Chow Liu dependence tree weight based on σM L
̂
and Pn . Then the following result holds:

Theorem 15.16.
(M L) n→+∞
W(Td (n); n) Ð→ W(Td0 ) P0 − a.s.
308 CHAPTER 15. DATA STORAGE, PRODUCT APPROXIMATIONS, CHOW LIU TREES

Proof This is a consequence of the strong law of large numbers; rstly, since X is nite, the strong
law of large numbers gives that

n→+∞
max ∣Pn (x) − P0 (x)∣ Ð→ 0 P0 − a.s.
x∈X

from which a.s. convergence of all empirical marginal distributions follows and in particular

În (i, j) Ð→ I 0 (i, j)


n→+∞

for all pairs (i, j). It follows that for each tree Td ,

n→+∞
W(Td ; n) Ð→ W(Td ) P0 − a.s.

and hence, since Td is a nite set,

n→+∞
max ∣W(Td ; n) − W(Td )∣ Ð→ 0.
Td ∈Td

(M L)
Note that, by construction, W(Td ; n) ≤ W(Td (n); n) for all Td and each n. Now let

T0d = {Td ∈ Td ∣W(Td ) = W(Td0 )}.

Since Td is nite, there is a positive constant δ such that

δ= min ∣W(Td0 ) − W(Td )∣ > 0.


Td ∈Td /T0d

Choose n large enough such that P0 a.s.

δ
max ∣W(Td ; n) − W(Td )∣ ≤ .
Td ∈Td 2

There is an nδ such that this holds for all n ≥ nδ and such that there is a tree in T0d , say Td0 such that
(M L)
W(Td (n)) = W(Td0 ) and such that

δ
∣W(Td0 ; n) − W(Td0 )∣ ≤ .
2

In other words, for any ϵ > 0 with δ


2 > ϵ, it holds that for n > nϵ ,

(M L)
∣W(Td (n); n) − W(Td0 )∣ < ϵ P0 − a.s.

(M L)
This result does not assert convergence of the sequence of trees Td (n) unless the set T0d contains
exactly one element.
15.6. CLASSIFICATION 309

15.6 Classication
Many of the techniques of supervised learning, or classication, involve a Bayes rule and an approx-
imate distribution. Variables are of two types, symptom variables X O (O for observable) and class
variables, or diagnosis variables, X C . A prior distribution PC is placed over the class variables, evi-
dence is obtained in the form of an instantiation xO of X O of the symptom variables and the posterior
distribution over the class variables obtained using Bayes rule;

PC PO∣C
PC∣O = ∝ PC PO∣C .
PO
In supervised classication, the probabilities PO∣C are learned, by observing the instantiations xO in
training examples where xC is given. When classifying (where the class xC is unknown, the class that
maximises PC PO∣C is chosen, for a given set of symptoms xO .

Often in classication, the distribution PO∣C has too many states and instead a set of lower dimensional
marginals is considered:

P(xC ) = {PAj ∣Bj ,C (.∣., xC ) j = 1, . . . , s}

where for each xC , SC ∶= (Aj , Bj )sj=1 is a dependence structure. The dependence structures may depend
on xC . The class variable xC is then chosen to maximise
s
PC PA1 ∣C ∏ PAj ∣Bj ,C .
j=1

The Naïve Classier The naïve classier considers X1 , . . . , Xd to be independent conditioned on


C , so that the approximation Q to the probability distribution P, given by

d
QX∣C = ∏ PXj ∣C
j=1

is used. The aim is then, for an observation x, to nd the value c that maximises PC QX∣C (x∣.).

Classication comes in two stages; rstly, constructing the classier. For constructing the classier,
a large number of observations of X are made, assumed independent, for each value of C , where the
n(x,c)
value of C is known. From this, PXj ∣C is estimated by P̂Xj ∣C (x∣c) = n(c) where n(x, c) is the number
of observations with (Xj , C) = (x, c) in the sample.

If a prior distribution PC has been placed over the class variable C , the score function is then

d
PC (c) ∏ P̂Xj ∣C (xj ∣c)
j=1

and an observation x is assigned to the class c that maximises this function. If there is no prior, then
the likelihood function ∏dj=1 P̂Xj ∣C (xj ∣c) is used.
310 CHAPTER 15. DATA STORAGE, PRODUCT APPROXIMATIONS, CHOW LIU TREES

Example 15.17.

In the article [28] by Chow and Liu, the example of character recognition is discussed. A person writes
a number, 0, 1, 2, 3, 4, 5, 6, 7, 8 or 9 in a rectangular space and the machine has to recognise which of
the ten characters has been written. The rectangle is split into a 12 × 8 grid and each of the 96 spaces
is coded as a 1 or a 0 depending on which character has been written. In the example, 7000 numerals
were used as training examples to construct the classier, which was then applied to 12000 examples,
with a success rate of 91%.

Chow Liu Tree Suppose that V = {X1 , . . . , Xd , C}, where X = (X1 , . . . , Xd ) is a random vector to
be observed and C is a class variable. With classication, an observation x is assigned to the category
c that maximises pC (.)pX∣C (X∣.)
The Chow - Liu tree presents an improvement over the naïve classier. For each category c ∈ C ,
the best tting Chow Liu tree is estimated from the training variables;
d
QX∣C = ∏ P̂Xj ∣Xπc (j) ,C
j=1

and then the observation x is assigned to the category c that maximises the score function SC,X =
PC QX∣C (x∣.) if there is a prior PC over the categories, or the score function SC,X = QX∣C (x∣.) if the
initial assessment is that all categories are equally likely.
The article [28] which introduced the Chow - Liu tree considers the problem of machine recognition
of handwritten numerals, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9. There are c = 10 pattern classes. Let ai denote the
numeral i. There is a prior distribution p = (p0 , p1 , . . . , p9 ) over the numerals. The number is written on
a 12 × 8 rectangle and 96 binary measurements are used to represent the numeral; 1 if the cell contains
writing and 0 otherwise. In the example given in [28], 19000 numerals produced by 4 inventory clerks
were scanned. 7000 of these were employed as training examples, to nd the best tting trees and
estimate the probabilities p0 , . . . , p9 . The optimal trees for each of the 10 numerals were obtained. For
the remaining numerals, the observation x = (x1 , . . . , x96 ) was considered. By Bayes rule,

pX∣C (x∣ak )PC (ak ) PX∣C (x∣ak )PC (ak )


PC∣X (ak ∣x) = = .
PX (x) PX (x)
The quantity PX∣C (x∣ak ) was estimated by QX∣C (x∣ak ) and the following classication rule was used:
the numeral was declared to be of class ak if PC (ak )QX∣C (x∣ak ) ≥ PC (ai )QX∣C (x∣ai ) for all i ≠ k . Using
the trees, the error rate was reduced from 0.09 to 0.04 compared with the model produced by assuming
independence between the contents of the 96 cells.
Chapter 16

Constraint-Based Structure Learning

Algorithms

16.1 Structure Learning


Let X = (X1 , . . . , Xd ) be a random (row) vector and X an n × d random matrix, where each row is
an i.i.d. copy of X . Let x be an n × d data matrix, where the modelling assumption is that x is an
instantiation of X.
The object of structure learning is to learn the DAG of a Bayesian Network for X from x.
Structure learning algorithms fall broadly into two categories; search and score, and constraint
based. Many of the learning algorithms available are hybrid algorithms, which involve both constraint-
based and search-and-score principles.
For search and score algorithms, a score function is used to score each network, giving a high
score if it returns a high value for the score function. This is usually based on the likelihood function
(often the Cooper-Herskovitz likelihood, Equation (12.15)), which is combined with a penalisation term
for models that have a large number of parameters. Since the search space tends to be large, these
algorithms tend to be computationally expensive. They are considered in later chapters.
The other category of structure learning algorithm is constraint based. They tend to make the
rather bold assumption that there exists a faithful DAG for the underlying probability distribution
and work according to the principle of Theorem 2.3. That is, starting with a complete graph, they
remove an edge α ∼ β from the skeleton whenever a conditional independence statement Xα ⊥ Xβ ∣XS is
established for some subset S ⊆ V /{α, β}. Here, using ϕ to denote the empty set, Xα ⊥ Xβ ∣Xϕ means:
Xα ⊥ Xβ . A resulting vee- structure α − γ − β (where α ∼/ β ) is declared an immorality if γ ∈/ S ; it is
declared not to be an immorality if Xγ ∈ S . The edges of the skeleton corresponding to immoralities
are directed accordingly, the other compelled edges are directed and the algorithm returns an essential
graph.

311
312 CHAPTER 16. CONSTRAINT-BASED STRUCTURE LEARNING ALGORITHMS

16.2 Testing for Conditional Independence


16.2.1 Gaussian variables
For multivariate Gaussian, a test of X ⊥ Y ∣S may be carried out using partial correlation.

Denition 16.1 (Partial Correlation). The partial correlation ρX,Y ∣S between X and Y given S is
dened as
ρ(X − ΣXS Σ−1 −1
SS S, Y − ΣY S ΣSS S).

where ΣSS is the covariance matrix of S , ΣXS is the covariance between X and S , ΣY S is the covariance
between Y and S . The partial correlation may be viewed in the following way: regress X on S
and regress Y on S ; the partial correlation is the correlation between the residuals from these two
regressions.
For multivariate Gaussian, X ⊥ Y ∣S if and only if ρX,Y ∣S = 0. To test this, rst regress X against S
and store the residuals R1 , and regress Y against S and store the residuals R2 . The estimated partial
correlation is the sample correlation between R1 and R2 . Fisher's z -transform of the partial correlation
is dened as:

1 1 + ρ̂X,Y ∣S
ρX,Y ∣S ) =
z(̂ log ( ).
2 1 − ρ̂X,Y ∣S
Consider the null hypothesis H0 ∶ ρX,Y ∣S = 0 versus the alternative H1 ∶ ρX,Y ∣S ≠ 0 (two sided test).
The null hypothesis is rejected at signicance level α if and only if

n − ∣S∣ − 3∣z(̂
ρX,Y ∣S )∣ ≥ zα/2

where zα is the value such that P(Z ≥ zα ) = α for Z ∼ N (0, 1). The distribution of the sample partial
correlation was described by Fisher (1924) [42].
Under the assumption that the variables are multivariate Gaussian, the statement X ⊥ Y ∣S (where
X and Y are variables and S is a vector) may be tested by considering Σ ̂ −1 , where Σ is the covariance
matrix of (X, Y, S) and Σ̂ −1 is either the inverse, or a generalised inverse, of Σ. If X ⊥ Y ∣S then
(Σ−1 )XY = 0.

16.2.2 Discrete Variables


For discrete variables, testing for conditional independence is carried out, quite simply, using the usual
χ2 test. To test whether or not X ⊥ Y ∣S , let n(x, y, s) denote the number of times (X, Y, S) = (x, y, s)
appears in the data, n(x, s), n(y, s), n(s) the number of instances of (X, S) = (x, s), (Y, S) = (y, s),
S = s respectively. The G2 statistic, which is standard, is dened as

n(x, y, s)n(s)
G2 (X, Y, S) = 2 ∑ n(x, y, s) log . (16.1)
x,y,s n(x, s)n(y, s)
Asymptotically, this is distributed as a χ2 distribution on (jx − 1)(jy − 1)js degrees of freedom, where
jx , jy and js are the number of values that X , Y and S respectively can take.
16.3. THE K2 STRUCTURAL LEARNING ALGORITHM 313

16.2.3 Hypothesis Testing and Statistical Theory


There are two basic diculties with the method of declaring X ⊥ Y ∣S when the null hypothesis is not
rejected at a signicance level α. The rst is that while the nominal signicance is α, there are rather
many tests carried out. All that can be said about the true signicance level is that it is less than
N α, where N is the total number of tests carried out. Nevertheless, for a single hypothesis test, a
result `reject H0 ' at signicance level α is a good indicator that the data suggests that the alternative
hypothesis H1 is true; if H0 is rejected, then the dependence represented by H1 is clearly and distinctly
present in the data matrix x, even if it is not necessarily present in the probability distribution of the
random vector X .
There is, however, a much more serious problem. In statistical theory, the conclusion reached when
a test fails to reject the null hypothesis is, simply, `there is insucient evidence to reject the null
hypothesis'. The `court of law' metaphor is appropriate here; a `not guilty' verdict may simply mean
that the evidence is insucient to establish guilt beyond all reasonable doubt. It does not establish that
the defendant did not commit the crime. There are two possible reasons for a failure to reject a null
hypothesis: either the null hypothesis happens to be true, or else the null hypothesis is false, but the
test is not suciently powerful to detect this. The constraint based algorithms discussed all accept an
independence statement X ⊥ Y ∣S if the result of the test is `do not reject independence'. The problem is
that these independence statements are added to the list of constraints, and the output network satises
the D-separation statements, even if they contradict some of the `reject conditional independence'
statements that have been obtained by rejecting a null hypothesis of conditional independence.
Several approaches have been suggested to try and limit acceptance of conditional independence
that is incompatible with independence statements rejected. A. Fast in [41] suggests using the power
of the test, but points out the computational diculties with this. He does not, though, address the
problem that if X ⊥ / Y ∣S , then the corresponding D-connection statement should be in the network.
Blomberg and Margaritis in [9] formalise the identication of all inconsistencies that stem from standard
probability theory and provide respective algorithms.
All the constraint based algorithms discussed need to be modied to ensure that the conditional
dependence statements obtained by rejecting conditional independence statements correspond to D-
connection statements in the resulting DAG.

Some of the most prominent constraint-based algorithms are now described.

16.3 The K2 Structural Learning Algorithm


Let X = (X1 , . . . , Xd ) and V = {X1 , . . . , Xd , C}, where C is a class variable, (X1 , . . . , Xd ) are variables
to be observed, from which the class should be inferred.
The K2 structure learning algorithm, introduced by Cooper and Herskovitz, is an algorithm to
locate associations between the variables (X1 , . . . , Xd ). Since the number of entries required to dene
the conditional probability functions increases exponentially with the number of parents, the algorithm
limits the number of parents a node can take. An upper limit of four parents is a value widely used.
314 CHAPTER 16. CONSTRAINT-BASED STRUCTURE LEARNING ALGORITHMS

The algorithm assumes that an order has been established for the d nodes X1 , . . . , Xd so that,
for each i, the parent nodes Pai for variable Xi are established among the nodes X1 , . . . , Xi−1 . For
j = 1, . . . , i − 1, the empirical Kullback - Leibler divergence between the two empirical probability
distributions of (X1 , . . . , Xi ), one determined by the graphs with and the other determined by the
graph without the directed edge (i, j), is measured and the edge is retained if a) the divergence is
suciently large and b) node i does not already have 4 parents. That is, if Pai is the current parent
set of Xi and Xj is under consideration, the quantity in question is

̂ ̂ ̂
P Xi ,Pai ,Xj PPai PXi ∣Pai ,Xj
̂
Q(i, j) = ∑ P ̂
= ∑P
Xi ,Pai ,Xj log ̂ ̂ Xi ,Pai ,Xj log ̂
.
P Xi ,Pai PPai ,Xj P Xi ∣Paj

Under the null hypothesis, that Xi ⊥ Xj ∣Pai , 2nQ(i, j) ∼ χ2n(Pa )(n(X )−1)(n(X )−1) , where n(Xi )
i i j
denotes the number of elements in the state space of Xi ; similarly for Xj and Pai .
The resulting algorithm is a greedy algorithm, with all the advantages and disadvantages that this
implies.
When the K2 algorithm is used, the learnt structure depends entirely on the order chosen for the
variables generated before the learning process starts. It is therefore usual to repeat the algorithm
with several dierent randomly chosen orders (say 1000) and choose the best; the one with the lowest
Kullback Leibler divergence between the tted distribution and the empirical distribution.

Example 16.2 (Robotics).

This example is taken from the paper [79]. It shows an application to Bayesian network learning
techniques for task execution in mobile robots. The task here is for the robot to locate an open door
and travel through it.
The robot emits sonar pulses and is equipped with eight detectors, which detect the echoes. From
this information, it has to decide where the door is located.
An action has to be taken: step to left, right, or straight ahead. This is the class variable and the
class has to be determined by the signals received by the eight detectors. Since the signals are not
independent of each other (the echoes may be created by the same object), the model is improved by
incorporating a dependence structure.
In this experiment, the problem is to learn the structure of the Bayesian network and to estimate
the probability potentials from the training data base.
The K2 algorithm is employed to establish a suitable structure. For the robot learning example,
the maximum number is set to four. The size of the probability potentials cannot be too large, since
the robot is expected to nd the door and travel through it in real time.
The intensity of an echo may be modelled as a continuous random variable, but the variables are
discretised for computational convenience. In general, it is not convenient to use a variable with more
than 20 dierent values.
In the Bayesian Robotics experiment, the experiments were repeated 1000 times and nets with
optimal values selected.
The resulting network for the eight variables is shown in Figure 16.1.
16.4. THREE PHASE DEPENDENCY ANALYSIS 315

S4 / S5

~
S3 S6

~
S2 / + S7

 
S1 r / S8

Figure 16.1: Network produced by the K2 algorithm. Here the nodes Sj represent the signals received
by the sensors. The variable C , not shown, which is a parent to all the variables shown, denotes the
class variable, the action to be performed.

In addition to the 8 variables shown in the network, there is also a class variable C , the direction
to be taken, which is a parent of all the nodes in X = (S1 , S2 , S3 , S4 , S5 , S6 , S7 , S8 ). The network is
estimated using a uniform prior distribution over C , which is an ancestor variable for each random
ordering chosen for the nodes in X ; the action performed is the action that maximises p̂X,C where
p̂ is the estimate of the distribution from the training examples, factorised according to the DAG in
Figure 16.1.

16.4 Three phase dependency analysis


The three phase dependency analysis algorithm (denoted TPDA) was introduced by Cheng, Greiner,
Kelly, Bell and Liu (2002) [23], who write, `this TPDA algorithm is correct (i.e., will produce the
perfect model of the distribution) given a sucient quantity of training data whenever the underlying
model is monotone DAG faithful.' The algorithm requires the faithfulness assumption to hold and relies
on Theorem 2.3. The TPDA algorithm works in three phases; draughting, thickening and thinning,
outlined in Algorithm 4, which gives the main steps of the algorithm. A precise description of the
algorithm and proof that it returns a faithful DAG when it exists, is straightforward to establish and
is found in [23].
Strictly speaking, the TPDA algorithm is a hybrid algorithm, since the rst stage (draughting) is
the Chow-Liu tree, which is a search and score procedure.

16.5 Fast Adjacency Search (FAS) algorithm


The FAS algorithm is perhaps the simplest constraint-based algorithm. Firstly, an input order of
the variables, (X1 , . . . , Xd ) and then Algorithm 5 is applied. The algorithm works on the principle
316 CHAPTER 16. CONSTRAINT-BASED STRUCTURE LEARNING ALGORITHMS

Algorithm 4 The Three Phase Dependency Analysis Algorithm


Stage 1: Draughting Locate the Chow - Liu tree
This stage is simply Kruskal's algorithm
Stage 2: Thickening Add edges
for i = 1, . . . , d − 1, j = i + 1, . . . , d do
Let Ci,j be the set of neighbours of Xi or Xj on a path between Xi and Xj
if Xi ⊥/ Xj ∣C for any subset C ⊆ Cij then
add an edge Xi ∼ Xj
else
do not add an edge Xi ∼ Xj
and let Si,j denote the set such that Xi ⊥ Xj ∣Sij
end if
end for
Stage 3: Thinning Removing unnecessary edges
for i = 1, . . . , d − 1, j = i + 1, . . . , d do
Let Cij denote common neighbours of Xi and Xj
if Xi ∼ Xj and there is a set C ⊆ Cij such that Xi ⊥ Xj ∣C then
remove the edge between Xi and Xj .
end if
end for
Stage 4: Directing edges For each vee structure Xi ∼ Xk ∼ Xj , (Xi , Xk , Xj ) is an immorality
if Xk ∈/ Sij , otherwise it is not. Once the immoralities have been added, the additional compelled
edges are obtained using Meek's rules.
16.6. PC AND MMPC ALGORITHMS 317

that there exists a faithful graphical representation for the probability distribution. First, a complete
(undirected) graph is created. Then for n = 0, 1, 2, . . . an edge ⟨α, β⟩ is removed if and only if there
is a set Sα,β of size n such that Xα ⊥ Xβ ∣XSα,β . This is the approach to nding the skeleton. A vee
structure (α, γ, β) is declared to be an immorality if and only if γ ∈/ Sα,β (known as the minimal sepset).
The remaining compelled edges are added using Meek's rules to obtain the essential graph. These are
edges α ∼ β that appear in structures given in Figure 2.10 Denition 2.16 Page 42 are directed as in
the Figure 2.10.

16.6 PC and MMPC Algorithms


The PC algorithm was introduced by Spirtes, Glymour and Scheines [127] (1993) and was modied
to produce the MMPC algorithm in [137] (2006). It is algorithm for locating the skeleton of a faithful
DAG (should such a DAG exist) and hence to construct the essential graph. It works in three stages.
Firstly, a forward stage starts with an empty graph, and adds in all possible edges. There are possibly
too many edges after this stage. Secondly, a backward stage removes some of the edges. The resulting
graph, after the second stage, will contain no false negatives, but may still contain some false positives.
A third stage is implemented to remove the false positives. The algorithm runs as follows:

The algorithm starts with an input order for the variables (X1 , . . . , Xd ). Stage 1 of the PC algorithm
is given in Algorithm 6.
The MMPC diers from the PC in one aspect: there is a gentle change whereby at each stage the
best variable is added into the parent set. Stage 1 of the MMPC algorithm is given in Algorithm 7.
After the rst stage of the PC / MMPC algorithm, the candidate parent/children sets may contain
too many variables. The next stage prunes them. This is Algorithm 8.
After Stages 1 and 2 of the PC / MMPC algorithm, there may still be false positives. Suppose a
probability distribution may be represented by the DAG in Figure 16.2. Working from T , the node C
may enter the output, and remain in the output.

T / A / C
O ?

Figure 16.2: MMPC: A False Positive from Algorithm 8

This is because C is dependent on T , conditioned on all subsets of T 's parents and children; namely,
ϕ (the empty set) and {A}. Note that the collider connection T AB , is opened when A is instantiated
so that, when A is instantiated and B is uninstantiated, T is d-connected with C . For ϕ (the empty
set), T AC is a chain connection, where A is uninstantiated, so that T is D-connected to C .
318 CHAPTER 16. CONSTRAINT-BASED STRUCTURE LEARNING ALGORITHMS

Algorithm 5 The FAS Algorithm


Start with Full Undirected Graph
Stage 0
for α = 1, . . . , d − 1, β = i + 1, . . . , d do
if Xα ⊥ Xβ then
remove edge ⟨α, β⟩
set Sα,β = ϕ
end if
end for
for k ≥ 1 do
Stage k
for α = 1, . . . , d − 1, β = i + 1, . . . , d do
Work through sets S ⊂ V /{α, β}
of size k from lowest to highest sum of indices
if current graph contains edge ⟨α, β⟩ then
Test Xα ⊥ Xβ ∣XS . If true, then remove edge ⟨α, β⟩.
set Sα,β = S and move to next (α, β) pair.
end if
end for
end for
Termination: Terminate when either all nodes have less than k neighbours, or else a pre-specied
terminal value is reached.
for α = 1, . . . , d − 1, β = α + 1, . . . , d, γ = 1, . . . , d; γ ≠ α, β do
if α − γ − β is a vee-structure then
if γ ∈/ Sα,β then
α − γ − β is an immorality
else
α − γ − β is not an immorality
end if
end if
end for
Now direct the additional compelled edges are obtained using Meek's rules.
16.6. PC AND MMPC ALGORITHMS 319

Algorithm 6 The PC Algorithm: Stage 1


for For i = 1, . . . , d do
(i)
initialise Z0 = ϕ, the empty set.
these will become the parent/children sets of the variables
end for
for i = 1, . . . d do
for j = 1, . . . , d, j ≠ i do
(i) (i) (i)
check whether Xi ⊥ Xj ∣Zj . If it is not, let Zj+1 = Zj ∪ {Xj }. If the independence statement
(i) (i) (i)
holds, then set Zj+1 = Zj and set SXi Xj = Zj . SXi Xj is known as the sepset, or separating
set; a set that satises Xi ⊥ Xj ∣Sij .
end for
end for
(i)
Set Z (i) = Zd .

Algorithm 7 The MMPC Algorithm: Stage 1


for For i = 1, . . . , d do
(i)
initialise Z0 = ϕ, the empty set.
these will become the parent/children sets of the variables
end for
for i = 1, . . . , d do
for k = 1, . . . , d, k ≠ i do
(i)
Let jk∗ = argmaxj/∈{j1∗ ,...,j ∗} G(Xi , Xj , Zk−1 ).
k−1
(i) (i) (i)
Check whether Xi ⊥ Xj ∗ ∣Zk−1 . If it is not, let Zk = Zk−1 ∪{Xj ∗ }. If the independence statement
(i) (i) (i)
holds, then set Zk = Zk−1 and set SXi Xj ∗ = Zk−1 . SXi Xj ∗ is known as the sepset, or separating
set; a set that satises Xi ⊥ Xj ∗ ∣Sij ∗ .
end for
end for
if i = d then
(i) (i)
Zd = Zd−1
end if
(i)
Set Z (i) = Zd .
320 CHAPTER 16. CONSTRAINT-BASED STRUCTURE LEARNING ALGORITHMS

Algorithm 8 The PC / MMPC Algorithm: Stage 2


Suppose that Z (i) contains k variables.
(i)
Label them Y1 , . . . , Yk . Let Zk = Z (i) .
for j = 0, . . . , k − 1 do
Check whether there exists a set S ⊆ Zk−j /{Yk−j } such that Xi ⊥ Yk−j ∣S
if there is then
(i) (i)
set Zk−j−1 = Zk−j /{Yk−i }, and set SXi Yk−i
else
(i) (i)
Set Zk−j−1 = Zk−j .
end if
end for
(i)
Let Z (i) = Z1
This set contains all the variables which have an edge either to or from the variable Xi .

T and C are D-separated if and only if A and B are simultaneously instantiated; that is, T ⊥
C∣{A, B}. But if B is independent from T given the empty set, so it will be removed from Z . Therefore,
the link T C will not be removed.

This is corrected by considering the parent / child sets of the other variables. When working from C ,
both A and B will be in the parent / child set, and T ⊥ C∣{A, B}. The third stage of the algorithm
(Algorithm 9) removes these false positives.

Algorithm 9 The PC / MMPC Algorithm: Stage 3


Let (Z (i) )di=1 denote the parent / child sets for all the variables arrived at after Algorithm 8.
Let Xσ(1) , . . . , Xσ(k) denote the set of variables in Z (i) , the parent child set for Xi arrived at after
Algorithm 8
(i)
Set Y0 = Z (i) .
for j = 1, . . . , k do
set
⎧ (i)
(i) ⎪⎪ Y /{Xσ(j) } Xi ∈/ Z (σ(j))
Yj = ⎨ j−1 (i)
⎪ Xi ∈ Z (σ(j)) .
⎩ Yj−1

end for
for i = 1, . . . , d do
(i)
Set Z (i) = Yk
end for
This returns the complete parent / child set for Xi .
The sepsets for the variables removed in the third stage have already been established.
16.7. RECURSIVE AUTONOMY IDENTIFICATION 321

Establishing the Essential Graph Having recorded the sepsets, sets such that X ⊥ Y ∣SXY , it
is now straightforward to construct the essential graph. For each vee structure (X, Z, Y ) (that is a
structure such that {X, Y } ⊂ Z (Z) , but X ∈/ Z (Y ) ), check whether or not Z ∈ SXY . If Z ∈ SXY , then
(X, Z, Y ) is not an immorality; the edges X − Z − Y remain undirected at this stage. If Z ∈/ SXY , then
(X, Z, Y ) is an immorality.
Finally, add in the additional compelled edges using Meek's rules; edges α ∼ β that appear in
structures given in Figure 2.10 Denition 2.16 Page 42 are directed as in the Figure 2.10.

16.7 Recursive Autonomy Identication


The Recursive Autonomy Identication algorithm is from Yehezkel and Lerner [150] (2009). Like the
FAS algorithm, it tries to keep the size of the sepsets as small as possible. The general idea is similar
to the FAS algorithm, but it tries to locate, and use, more of the chain graph structure of the essential
graph at each state. At stage n + 1, instead of simply checking all possible subsets of size n + 1 to
determine whether or not there is a set S such that X ⊥ Y ∣S , only those components of the current
chain graph after stage n that can have inuence are considered. This reduces the number of tests
that have to be carried out at stage n + 1.
The algorithm assumes that there is a faithful graph and aims to locate its essential graph. When
testing for independence, it checks all relevant tests of X ⊥ Y ∣S for X, Y ∈ V and S ⊂ V for ∣S∣ = n (the
subset S has n variables) before making tests of X ⊥ Y ∣S where ∣S∣ = n + 1, since tests are less reliable
when the conditioning sets are larger.
The rst step of the algorithm is as follows.

ˆ Starting with a variable set V , the initial graph is the complete graph, with undirected edges
between each pair of variables {X, Y }.

ˆ For each pair {X, Y } ⊂ V , it is checked whether or not X ⊥ Y and if this holds, the edge X − Y
is removed. Record SX,Y = ϕ, the empty set (SX,Y is the separator).

ˆ For each vee structure X −Z −Y where there is no edge X −Y , the triple (X, Z, Y ) is an immorality
X →Z ←Y.

ˆ The remaining compelled edges are added.

For each pair {X, Y } that do not have an edge between them, the set SX,Y used to determine the edge
removal using X ⊥ Y ∣SX,Y , is recorded.
After this initialisation (stage 0), the algorithm proceeds recursively. At stage n+1, do the following.

ˆ Start with the skeleton from stage n. For each vee-structure α − γ − β , the vee-structure is an
immorality if γ ∈/ Sα,β and it is not an immorality if γ ∈ Sα,β . Add in the remaining compelled
edges. The resulting graph is the Stage n essential graph, which is a chain graph. Locate the
chain components.
322 CHAPTER 16. CONSTRAINT-BASED STRUCTURE LEARNING ALGORITHMS

ˆ Starting with a chain component that has no descendants and proceeding backwards, consider
in turn each chain component GC and the subgraph GD formed by taking the chain component
GC = (C, UC ) together with the chain components that have parent variables of GC and all all the
directed edges connecting these chain components. Let D denote variable set for GD . For each
Y ∈ C and each neighbour X of Y (consider rst the parents in dierent connected components,
and then the undirected neighbours in the component GC ), check whether there is a set SXY ⊂ D
of size n such that X ⊥ Y ∣S . If there is, then remove the edge between X and Y and record
SXY . Remove the chain component GC and proceed recursively until the whole graph has been
considered.

This is repeated until the size of the largest neighbour set in the undirected graph is equal to n,
then the algorithm terminates. Undirect all the edges, nd the immoralities and add in the remaining
compelled edges. The output is the resulting essential graph.

Note In [150], the algorithm presented is slightly dierent; once an edge is directed, it is not subse-
quently undirected. It is dicult to see the theoretical justication for this; the modication presented
here ensures that, when there is a faithful graph and assuming a perfect oracle, the output graph is
the essential graph.

Example 16.3 (Example for Recursive Autonomy Identication).


Suppose that the DAG in Figure 16.3 is faithful to the distribution PX1 ,X2 ,X3 ,X4 ,X5 ,X6 ,X7 .

X3

} !
X1 X4 X5

! }
X2

!
X6

!
X7

Figure 16.3: Example to illustrate RAI algorithm

Suppose also a `perfect oracle'; independence tests give the correct results. After the rst round,
X1 ⊥
/ X2 , X1 ⊥/ X6 , X1 ⊥ / X7 , but X1 ⊥ {X3 , X4 , X5 }. X2 ⊥
/ Xj for any j , X3 ⊥
/ Xj for j = 4, 5, 6, 7,
X4 ⊥
/ Xj for j = 5, 6, 7, X5 ⊥
/ Xj for j = 6, 7 and X6 ⊥
/ X7 .
16.7. RECURSIVE AUTONOMY IDENTIFICATION 323

After the CI tests with conditioning sets size 0 have been carried out, the immoralities are deter-
mined;
(X1 , X2 , X3 ), (X1 , X6 , X3 ), (X1 , X7 , X3 ), (X1 , X2 , X4 ), (X1 , X6 , X4 )

(X1 , X7 , X4 ), (X1 , X2 , X5 ), (X1 , X6 , X5 ), (X1 , X7 , X5 ).

The edges X3 − X4 , X3 − X5 and X4 − X5 , X2 − X6 , X2 − X7 X6 − X7 remain undirected.


Removing the undirected edges, the chain components are A1 = {X1 }, D = {X2 , X6 , X7 } and
A2 = {X3 , X4 , X5 }. D stands for descendant, A for ancestor.
Within D, X2 ⊥ X7 ∣X6 and this is the only CI statement with a conditioning set size 1. The edge
X2 − X7 is therefore removed and X2 − X6 − X7 is not an immorality, since X6 ∈ S2,7 (the sep set).
Within A2 , X4 ⊥ X5 ∣X3 , hence X4 − X5 is removed and X4 − X3 − X5 is not an immorality since
X3 ∈ S4,5 .
Now consider the directed edges from A1 and A2 to D. {X3 , X4 , X5 } ⊥ {X6 , X7 }∣X2 , leading to
removal of the 6 corresponding directed edges. {X3 , X5 } ⊥ {X2 }∣{X4 }. Finally, Meek's rules may be
used to direct X2 → X6 and X6 → X7 giving the essential graph in Figure 16.4.

X3

X1 X4 X5

! }
X2

!
X6

!
X7

Figure 16.4: Example for RAI algorithm: essential graph

At this point, D is now removed. Since A1 and A2 have no ancestors, they are considered separately and
the algorithm is nished. If A1 and A2 were descendants of other chain components, a chain component
with no descendants would be chosen and the algorithm continues until all chain components have been
considered.
The algorithm is then repeated with conditioning sets of size 2, and so on, until the termination
condition is satised.
324 CHAPTER 16. CONSTRAINT-BASED STRUCTURE LEARNING ALGORITHMS

16.8 Incompatible Immoralities: EDGE-OPT Algorithm


It is possible that in the constraint based structure learning algorithms, immoralities that are incom-
patible with each other may emerge. This could, for example, be vee structures declared not to be
immoralities, which lead to a cycle of length ≥ 4 without a chord, or two immoralities that give oppositie
orientations for an edge.
There are two possible reasons for incompatibilities; either the independence tests give inaccurate
results, or else there does not exist a faithful DAG.
A. Fast in [41] uses a constraint based method for dealing with this. EDGE-OPT ALGORITHM
starts with the edges of the skeleton produced by a constraint based algorithm, either FAS, MMPC
or RAI, chooses an orientation of the edges at random to produce a DAG and then considers each
vee-structure in turn, deciding locally whether or not it should be a collider.
The list of constraints (that is, for each X, Y with no edge between them, the statement X ⊥ Y ∣SXY
and X ∼ Y if there is no sepset) is established before EDGE-OPT start. At each stage, it chooses a
vee-structure, examines the possibilities for orientation of the edges in the vee structure, and chooses
the orientation that satises the largest number of constraints.

16.9 Hybrid Algorithms


We now consider various hybrid algorithms, which start by constraining the space and then using search
and score techniques within the constrained space.

16.9.1 The Maximum Minimum Hill Climbing Algorithm


The MMHC algorithm by Tsamardinos, Brown and Aliferis (2006) [137] is a hybrid algorithm. Firstly,
the set of edges of which the skeleton is a subset, is obtained using the constraint-based Maximum
Minimum Parents Children algorithm. The sep sets, though, are not recorded, since they are unnec-
essary. Having obtained the skeleton, the orientation of the edges is obtained via a search-and-score
procedure, known as the MMHC algorithm. It works as follows: let V = {X1 , . . . Xd } and denote the
current graph by G = (V, D).

ˆ Start with the empty graph G = (V, D) where D = ϕ.

ˆ At each stage, either add a directed edge to D, choosing an edge in E and directing it; any
direction that does not produce a cycle is admissible, or delete an edge from G , or reverse an
edge in G , or leave the graph unaltered. From all the possibilities of `add an edge', `delete an
edge', `reverse an edge', `leave the graph unaltered' choose the one that gives the greatest score;
that is, the operation that produces the greatest reduction in the Kullback Leibler divergence
between the probability modelled along the graph and the empirical probability.

ˆ Repeat until the score is not changed.


16.9. HYBRID ALGORITHMS 325

The algorithm may be modied as follows: instead of the best change, make the best change that
results on a graph that has not already appeared. When 15 changes occur without an increase in the
best score ever encountered during the search, the algorithm terminates. The DAG that produced the
best score is then returned. which starts with the constraint based MMPC stage to locate the skeleton
and then carries out a search and score based MMHC stage, using the skeleton obtained from MMPC
as the candidate edge set. Two other hybrid methods are described below.

16.9.2 L1-Regularisation
One method, introduced by Schmidt, Niculescu-Mizil and Murphy (2007) [123], places constraints on
the model and then uses an L1 score function, described below, as the basis of a search and score
within the constrained space.
The method can be employed with Gaussian or binary variables. The binary case is outlined here.
In this algorithm, there is no restriction on the number of parents that a variable may have, but
there is a constraint on the way in which the parents inuence the variable. The state space of variable
Xj is {−1, 1} for each j and the conditional probabilities are modelled so that the logit function is
linear:

⎛ pXj ∣Paj (1∣π j ) ⎞ pj


ln = (θj,0 + ∑ θj,k πj,k ) (16.2)
⎝ 1 − pXj ∣Paj (1∣π j ) ⎠ k=1

where π j = (πj1 , . . . , πjpj ), the conguration of Paj , is a sequence of ±1 corresponding to the states of
p
the parent variables. The parent variables are only permitted to inuence ln 1−p linearly; no interactions
are permitted. This permits a large number of parents, since the number of parameters is linear, rather
than exponential, in the number of parents.
The algorithm works in two stages: like the MMPC, it rst produces candidate parent children sets
for each variable. Having constrained the search space, it then uses a search and score algorithm to
determine the candidate parent / children sets. Having determined the parent / children sets, it runs the
hill climbing part of the MMHC algorithm of Tsamardinos, Brown and Aliferis to obtain the structure,
keeping the conditional probabilities of the form in Equation (16.2). For a vector x = (x1 , . . . , xd ) of
1's and −1's, let Paj denote all the variables without j . That is, all variables permitted as possible
parents for j at this stage. Let x̃(j) denote the vector x without xj . Let

LL(j, θj , x) = log pXj ∣Paj (xj ∣x̃(j) )

denote the log likelihood function and, for x the data matrix with rows x(1) , . . . , x(n) , let

n
LL(j, θj , x) = ∑ LL(j, θj , x(k) ).
k=1

The parameters θj are chosen to maximise the L1 regularisation score function,

L1 R(θj , x) = LL(j, θj , x) − λ∥θj ∥1 ,


326 CHAPTER 16. CONSTRAINT-BASED STRUCTURE LEARNING ALGORITHMS

k=1 ∣θjk ∣ and λ is chosen appropriately. The sum is over the parameters corresponding
where ∥θj ∥1 = ∑d−1
to dependence on parent variables; the parameter θj,0 is not included. The article [123] has some
discussion about the appropriate choice of λ.
The L1 regularisation, if λ is appropriately chosen, has the eect of choosing vectors θj with a
substantial number of zero components. Because of this property, it tends to favours a lower number of
parameters in the model. For this reason, L1 regularisation is a technique that is developing increasing
importance.

16.9.3 Gibbs sampling


A related approach to the problem of structure learning is found in Bulashevska and Eils (2005) [11].
The structure learning algorithm is intended for analysis of gene expression data, to locate gene regu-
latory interactions. As with Schmidt, Niculescu-Mizil and Murphy (2007) [123], the parents inuence
the ospring independently of each other and the algorithm forms `noisy OR' and `noisy AND' gates.
The parent sets are chosen using Gibbs sampling. The generic techniques of Gibbs sampling are found
in Gamerman and Lopes (2006) [49].

16.10 A Junction Tree Framework for Undirected Graphical Model


Selection
Let X = (X1 , . . . , Xd ) be a random vector, with indexing set V = {1, . . . , d}. An undirected graphical
model is simply an undirected graph G = (V, U ) where U is a set of undirected edges, such that G is
the independence graph of X .
The edge set U contains an edge ⟨α, β⟩ if and only if Xα ⊥
/ Xβ ∣X−(α,β) (i.e. Xα not independent of
Xβ conditioned on all the other components of the random vector X ).
Learning the independence graph may be problematic when d is large, since the conditional in-
dependence tests available have lower power when the number of states of the conditioning set is
large.
̃,
The process may be facilitated if additional a-priori information is available, of the form: U ⊆ U
where Ũ is an undirected edge set with node set V . Let H = (V, U
̃ ).
If H is triangulated, a junction tree may be constructed from the cliques. It is clear that ⟨α, β⟩ ∈/ U
̃.
if none of the cliques contain both α and β , since this implies that ⟨α, β⟩ ∈/ U
If there is a clique C such that α, β ∈ C , then it is not necessary to consider X−(α,β) ; let C
denote the clique-set, and let C denote a generic element of C . Let Cα = {C ∈ C ∶ α ∈ C} and let
Wα,β = (∪C∈Cα C) ∪ (∪C∈Cβ C). In other words, Wα,β is the collection of nodes contained in cliques
which contain either α or β (or both). Then, by obvious properties of the independence graph,
⟨α, β⟩ ∈ U if and only if

Xα ⊥
/ Xβ ∣XWα,β /{α,β} .
16.10. A JUNCTION TREE FRAMEWORK FOR UNDIRECTED GRAPHICAL MODEL SELECTION 327

In other words, only those cliques containing either α or β (or both) need to be considered, leading to
a reduction in the size of the conditioning sets and hence to more accurate conditional independence
tests.
There may be some additional gain, in terms of reducing the size of the conditioning sets, if the
̃ edges that have been considered, for
junction tree can be successively updated, by removing from U
which it has been established that they are not in U .
Vats and Nowak [139](2014) provide a framework for this, by considering the so-called region graph.
A region graph is simply a directed acyclic graph, where a node the region graph (which we call a region-
node) is a subset of V , the node set. The region graph of interest is constructed as follows: the rst
generation of regions, R1 is the collection C of cliques of a junction tree. These are the ancestor
nodes of the region graph. Generation Ri+1 is the set of all pairwise intersections of sets in Ri with
cardinality greater than or equal to 2, for i = 1, . . . , L − 1, where L is the maximum value of i for which
Ri constructed in this way is non-empty.
The edge set of a region graph contains an edge R → S if and only if R ∈ Ri and S ∈ Ri+1 for some
i ∈ {1, . . . , L − 1} and there is a set T ∈ Ri such that R ∩ T = S .
Vats and Nowak propose an algorithm for locating the independence graph G = (V, U ), given a
decomposable graph H = (V, U ̃ ) where U ⊆ U
̃ . The algorithm is given as Algorithm 10; some further
notation is needed before introducing it.
For a region R of a region graph, let

R = ∪S∈{an(R),R} S (16.3)

In other words, R is the union of region R and all its ancestors. In terms of the junction tree for H,
this is the union of all cliques which have R as a subset.
For a node set S , let K(S) denote the complete undirected graph with node set S . For a set of
nodes R, let denote the edge set W restricted to R and let

WR′ = WR /{∪S∈ch(R) K(S)}. (16.4)

Algorithm 10 returns the independence graph G = (V, U ).

Correctness of Algorithm 10 It remains to show that, assuming a perfect oracle, Algorithm 10


returns the independence graph G = (V, U ). Firstly, it is clear that the algorithm thus constructed
considers all the edges of Ũ . Secondly, it is straightforward (and left as an exercise) to show that, if
the distribution factorises along the junction tree, then Xα ⊥/ Xβ ∣XR/{α,β} ⇔ Xα ⊥ / Xβ ∣XV /{α,β} . From
this it follows that, assuming a perfect oracle, the algorithm returns the independence graph.

Example 16.4 (Region Graph).

Suppose the independence graph G = (V, U ) is given on the left of Figure 16.5 and it is known that
U ⊆Ũ , where the graph H = (V, U
̃ ) on the right. Here V = {1, 2, 3, 4, 5, 6, 7}.
The algorithm proceeds as follows:
328 CHAPTER 16. CONSTRAINT-BASED STRUCTURE LEARNING ALGORITHMS

Algorithm 10 Finding Independence Graph Given Decomposable Graph as Wrapper


Input: A graph H = (V, Ũ ) such that U ⊆ Ũ
Output: The independence graph G = (V, U )
Step 1: Initialise: Û as Û = ϕ (empty set) and nd the region graph of H.
Step 2: Suppose the regions are R1 , . . . , RL
and j is the smallest value such that there exists a region R ∈ Rj such that Ũ′ ≠ ϕ
R
where Ũ ′ is dened by (16.4)
R
for each R ∈ Rj do
Compute R dened by (16.3) and U ̃ ′ dened by (16.4). For each edge ⟨α, β⟩ ∈ Ũ ′ , remove the
R R
̃ ̂
edge from U . Add the edge ⟨α, β⟩ to U if and only if Xα ⊥ / Xβ ∣XR/{α,β} .
end for
̃ ∪U
(Note: at this stage, U ̂ ⊇ U ).
Step 3: Compute a new junction tree and region graph using an ecient triangulation of edge set
̃ ∪U
U ̂
Step 4: If Ũ = ϕ then terminate, otherwise go to Step 2.

1 3 5 7 1

2 4 6 3 5 7

2 4 6

̃ ); U
Figure 16.5: Graph G = (V, U ) and H = (V, U ̃⊇U

ˆ Clique 1, 2, 3, 5 has child 1, 3, 5. Remove edges from the child gives edges ⟨1, 2⟩, ⟨2, 3⟩, ⟨2, 5⟩ from
̃ to be estimated. Therefore, look at R1 (the cliques of the junction tree with the complete
U
graphs of the separators removed.
Clique 1, 2, 3, 5 edges ⟨1, 2⟩, ⟨2, 3⟩, ⟨2, 5⟩ considered; ⟨2, 3⟩ and ⟨2, 5⟩ removed. Edge ⟨1, 2⟩ added
̂
to U
Clique 1, 3, 4, 5 Children are 1, 3, 5 and 3, 4, 5. Therefore, only edge ⟨1, 4⟩ is considered. This is
retained. It is therefore removed from U ̃ and added to U ̂.
Clique 3, 4, 5, 6 Children are 3, 4, 5 and 4, 5, 6. Only edge ⟨3, 6⟩ is considered. It is removed from
̃.
U
Clique 4, 5, 6, 7 Child is 4, 5, 6. Edges considered are: ⟨4, 7⟩, ⟨5, 7⟩ and ⟨6, 7⟩. They are removed
from Ũ . Edges ⟨5, 7⟩ and ⟨6, 7⟩ are added to U ̂.

̃ ∪U
ˆ At this stage, a new junction tree may be computed, using the edges from U ̂ . This may be
1, 2, 3, 5 1, 3, 4, 5 3, 4, 5, 6 4, 5, 6, 7
1,3,5 3,4,5 4,5,6

1, 2, 3, 5 1, 3, 4, 5 3, 4, 5, 6 4, 5, 6, 7

 z $ z $ 
1, 3, 5 3, 4, 5 4, 5, 6

$ z $ z
3, 5 4, 5

Figure 16.6: Junction Tree (above) and Region Graph (below) for Figure 16.5

more ecient. Alternatively, we may continue with the same junction tree.
After deleting these edges from U ̃ , generation R2 is the rst generation that satises the property.
Region 1, 3, 5: This has one child, which is 3, 5. The edges under consideration are therefore
⟨1, 3⟩ and ⟨1, 5⟩. These have not been considered before. They are removed from U ̃ and edge
⟨1, 3⟩ is added to Û.
Region 3, 4, 5: This has two children, 3, 5 and 4, 5. Only one edge is considered; ⟨3, 4⟩. This is
removed from U ̃ . It is not present in U and therefore (assuming a perfect oracle) is not added
to Û.
Region 4, 5, 6: This has one child, 4, 5. The edges under consideration are therefore ⟨4, 6⟩ and
⟨5, 6⟩. They are removed from U ̃ . The edge ⟨4, 6⟩ is added to Û.

ˆ At this stage, a new junction tree may be computed. If we proceed with the current junction
tree, we look at R3 . This contains regions 3, 5 and 4, 5. Each region consists of two nodes; for
each region, there is one edge under consideration. The edge ⟨3, 5⟩ is removed from Ũ and added
̂ ; similarly with ⟨4, 5⟩.
to U

̃ = ϕ and U
At this stage, U ̂ = U.

16.11 The Xie-Geng Algorithm for Learning a DAG


The algorithm of Xie-Geng [148] (2008) provides a framework for learning a DAG which is essentially
dierent from FAS, PC/MMPC, RAI. This algorithm again assumes that there exists a faithful DAG
and, at the nal stage of edge removal, removes edges according to the principle of Theorem 2.3,
although this last stage may be omitted if one does not have a priori information that there exists a
faithful DAG. The algorithm starts by nding the independence graph (Denition 5.6), which is useful

329
330 CHAPTER 16. CONSTRAINT-BASED STRUCTURE LEARNING ALGORITHMS

by Theorem 5.7. Recall the denition of weak decomposition (Denition 7.17). The independence
graph is subsequently decomposed, the sep-sets recorded at each stage. With the reconstruction, an
edge ⟨α, β⟩ appears in the nal graph if and only if it appears in all parts of the decomposition that
contain both nodes α and β ; otherwise a suitable immorality is added, dictated by the sep-sets in the
usual manner. The compelled edges are then added and the essential graph is returned.
The algorithm assumes that independence statements Xα ⊥ Xβ ∣X−{α,β} can be veried. This may
be a weak point for large numbers of random variables, since the power of conditional independence
tests decays proportionally to the number of variables in the conditioning set. The algorithm may be
combined with the algorithm of Section 16.10 from Vats and Nowak [139] to reduce the size of the
conditioning sets if there is additional a-priori information that the independence graph is contained
within a decomposable graph H = (V, U ̃ ).
Theorem 2.3 is essential for proving that the algorithm returns a faithful DAG when it exists.
From Denition 5.6 (the denition of the independence graph) together with Theorem 5.7, it follows
that graphical separation statements in the independence graph are equivalent to the corresponding
conditional independence statements for the probability distribution. By Theorem 5.5 together with
the denition of the independence graph (Denition 5.6), the independence graph is equivalent to the
moral graph of a faithful DAG, when a faithful DAG exists.
The following two theorems are used crucially in proving that the graph returned by the algorithm
is the skeleton of a DAG along which the distribution may be factorised.

Theorem 16.5. Let G = (V, D) be a DAG. Suppose that A á B∥G S for three subsets A, B, S ⊂ V . Let
α ∈ A and β ∈ A ∪ S . Then α á β∥G R for some R ⊂ A ∪ B ∪ S if and only if α á β∥G R′ for a subset
R′ ⊂ A ∪ S .

Theorem 16.6. Let G = (V, D) be a DAG and suppose that A, B, S ⊂ V such that A á B∥G S . Let
α, β ∈ S . Then there is a subset R ⊆ A ∪ B ∪ S such that α ⊥ β∣R if and only if either there is a subset
R′ ⊂ A ∪ S or there is a subset R′ ⊂ B ∪ S such that α á β∥R′ .

These statements appear, at face value, precisely what one would expect. Their proofs, though, are
somewhat involved and non-trivial. The Xie-Geng algorithm is based on these statements, which
enable the edge set for the whole graph to be concluded from examining subsets of the variables. The
proofs of these theorems are given later, after the description of the algorithm.

16.11.1 Description of the Xie-Geng Algorithm


The algorithm proceeds as follows:

ˆ An undirected graph is constructed. This is the graph G = (V, U ) where ⟨α, β⟩ ∈ U if and only if
Xα ⊥/ Xβ ∣X−(α,β) . This is the independence graph (Denition 5.6). It is therefore equivalent to
the moral graph of a faithful DAG if a faithful DAG exists (Exercise 6 page 352).

ˆ A weak decomposition (A, B, S) (Denition 7.17) of the moral graph is found, if such a decom-
position exists.
16.11. THE XIE-GENG ALGORITHM FOR LEARNING A DAG 331

ˆ For each α ∈ A/S and β ∈ B/S , set Sα,β = S , the separator of α, β .

ˆ Construct GA∪Si
and GB∪S
i
, where for each γ, δ ∈ A ∪ S , ⟨γ, δ⟩ ∈ UA∪S
i
if and only if Xγ ⊥
/
Xδ ∣X(A∪S)/{γ,δ}) , similarly for GB∪S . These are the independence graphs for A ∪ S and B ∪ S
i

respectively.

ˆ Find weak decompositions of GA∪Si


and GB∪S
i
, the independence graphs associated with these
weak decompositions and continue recursively until it is not possible to decompose any of these
pieces further. At each stage, if W ⊂ V is decomposed into A′ , B ′ , S ′ , set Sα,β = S ′ for each
α ∈ A′ , β ∈ B ′ .

Before the assembly stage, the following additional stage is carried out on the cliques which are obtained
from the recursive decomposition:

ˆ For each clique A in the decomposition and each pair {α, β} ⊆ A, check whether there is a subset
S ⊂ A/{α, β} such that Xα ⊥ Xβ ∣XS . If there is, then remove the edge ⟨α, β⟩ and let Sα,β = S ,
the sep-set of {α, β}.

From these pieces, the DAG is constructed as follows:

ˆ Two sub-skeletons LA∪S = (A ∪ S, UA∪S ) and LB∪S = (B ∪ S, UB∪S ) are combined to form

LA∪B∪S = (A ∪ B ∪ S, UA∪B∪S )

where
UA∪B∪S = UA∪S ∪ UB∪S /{⟨α, β⟩∣α, β ∈ S, ⟨α, β⟩ ∈/ UA∪S ∩ UB∩S }.

ˆ This is done recursively until all the pieces have been added.

ˆ For each separator Sα,β , orient a vee-structure (α, γ, β) as an immorality α → γ ← β if γ ∈/ Sα,β .

ˆ Orient the compelled edges.

Establishing Correctness If there is a faithful DAG for the distribution and a perfect oracle, then
the algorithm returns the essential graph of the faithful DAG. This is established as follows:
Suppose that GA∪C and GB∪C are faithful for the distributions over A ∪ C and B ∪ C respectively
and are combined according to the rules given to give GA∪B∪C . The results of Theorems 16.5 and 16.6
may be used to establish the D-separation properties:
For any α ∈ A and β ∈ B , Sα,β = C and therefore there is no edge α ∼ β in a faithful DAG for the
distribution over A ∪ B ∪ C . Following the reconstruction, there is no edge in GA∪B∪C .
For α, β ∈ C , the reconstruction has an edge α ∼ β in GA∪B∪C if and only if there are edges in both
GA∪C and GB∪C . Theorem 16.6 states that if G(A ∪ B ∪ C) is a DAG over the variables A ∪ B ∪ C and
there is a set R ⊆ A∪B ∪C such that α á β∥G(A∪B∪C) R if and only if either there is a set R′ ⊂ A∪C such
that α á β∥G(A∪C) R′ or there is a set R′ ⊂ B ∪ C such that α á β∥G(B∪C) R′ . Therefore, G(A ∪ B ∪ C)
is a faithful graph for pA∪B∪C then its skeleton contains an edge α ∼ β between two variables in C if
and only if both GA∪C and GB∪C contain the edge.
332 CHAPTER 16. CONSTRAINT-BASED STRUCTURE LEARNING ALGORITHMS

Example 16.7 (Example for the Xie-Geng Algorithm).


Suppose that the probability distribution PA,B,C,D,E,F,G,H and the DAG in Figure 16.7 are faithful.

B / E
O `

A / C G

  
D / F o H

Figure 16.7: A faithful DAG to illustrate the Xie-Geng algorithm

Suppose that we also have a perfect oracle (each conditional independence test gives the correct result).
The rst step of the algorithm is to construct the independence graph, given in Figure 16.8. This is
constructed by starting with the empty graph and adding an undirected edge ⟨α, β⟩ if and only if
Xα ⊥/ Xβ ∣X−(α,β) . If the DAG in Figure 16.7 is faithful, the independence graph is the moral graph.

B E

A C G

D F H

Figure 16.8: The Moral / Independence Graph for the DAG of Figure 16.7

The graph is now decomposed recursively; for example, take {C, G}, then this decomposes the graph
into {A, C, D, F } and {B, E, C, F, G, H}. The independence graphs for these two sets of variables are
illustrated in Figure 16.9.
Now consider the piece on the left hand side of Figure 16.9. The set {C, D} may be used as the
separation set, and the independence graphs of the two pieces {A, C, D} and {C, D, F } are shown in
Figure 16.10.
The edge C − D does not appear in the rst graph, since C ⊥ D∣A. This is clear from the DAG in
Figure 16.7, which is faithful to the distribution. It therefore follows that in the reconstruction stage,
the edge C − D will not be present and that C − F − D will be an immorality.

More fully, the decomposition phase can proceed as follows:


16.11. THE XIE-GENG ALGORITHM FOR LEARNING A DAG 333

A C B E

D F C G

F H

Figure 16.9: First stage of decomposition for the Xie-Geng algorithm

A C C

D D F

Figure 16.10: Further stage of decomposition for the Xie-Geng algorithm

ˆ {A} and {B, E, F, G, H} are separated by {C, D}; A ⊥ {B, E, F, G, H}∣{C, D}. The two pieces
are: {A, C, D} and {B, C, D, E, F, G, H}.

ˆ Consider {A, C, D}. The graph is; A − D − C , since for variables {A, D, C}, C ⊥ D∣A.

ˆ This is decomposed further into {A, D} and {A, C}; C ⊥ D∣A. This decomposition is complete;
the pieces are cliques and cannot be decomposed further.

ˆ Consider {B, C, D, E, F, G, H}. Then B ⊥ {D, F, H, G}∣{C, E}. The decomposition is into
{B, C, E} and {C, D, E, F, G, H}. The piece {B, C, E} is a clique, since B ⊥
/ C∣E .

ˆ For {C, D, E, F, G, H}, E ⊥ {D, F, H}∣{C, G}, so it is decomposed into {C, D, F, G, H} and
{C, E, G}. {C, E, G} is a clique at this stage, since C ⊥
/ G∣E .

ˆ For {C, D, F, G, H}, C ⊥ G∣{D, F, H}, so the graph of this piece does not contain the edge C − G.

ˆ G ⊥ {C, F, D}∣H , so decompose {C, D, F, G, H} into {G, H} and {C, D, F, H}.

ˆ Now consider {C, D, F, H} and decompose into {C, D, F } and {H, D, F }; C ⊥ H∣{D, F }. Since
C⊥/ D∣F , the piece {C, D, F } is a clique. Since D ⊥
/ H∣F , this is also a clique.

Now the cliques are considered and edges removed according to the principle of Theorem 2.3.

ˆ For {C, D, F } the edge C − D is not removed; C ⊥


/ D and C ⊥
/ D∣F .
334 CHAPTER 16. CONSTRAINT-BASED STRUCTURE LEARNING ALGORITHMS

B / E
O `

A C G


D / F o H

Figure 16.11: Structure learning example: essential graph

ˆ For {D, F, H}, the edge D − H is removed, with separation set (sep set) ϕ, since D ⊥ H (with no
instantiated notes, there is an open collider in each trail in the original DAG).

ˆ For the nal stage of the `deconstruction' phase, edge B − C is removed because B ⊥ C , with
sepset SBC = ϕ.

Reconstruction For the reconstruction, these are put together, using the rule that GA∪B∪C has an
edge between two variables in C if and only if both GA∪C and GB∪C . At this stage, the edge C − D
is removed from the nal graph, since at the earlier stage SCD = {A}. Similarly, C − G is removed
because SCG = {D, F, H}. Vee structures (α, γ, β) are immoralities if and only if γ ∈/ Sα,β . This gives
the essential graph of Figure 16.11.

16.11.2 Proofs of Theorems 16.5 and 16.6


Finally, in the discussion of the Xie-Geng algorithm, we prove Theorems 16.5 and 16.6, thus establishing
the correctness of the Xie-Geng algorithm under the assumptions that there exists a faithful graph and
that there is a perfect oracle.
Theorem 16.5 requires some preparatory lemmas:

Lemma 16.8. Let G = (V, D) be a directed acyclic graph. Let α, β ∈ V . Let F = GAn
m
({α,β}∪S)
where
An(W ) denotes the set W together with all nodes that are ancestor nodes in G for any node in W .
First, the subgraph is taken, then it is moralised. Prove that S separates α and β in F if and only if
α á β∥G S .

Proof of Lemma 16.8 Assume that there is a path from α to β in F that has no nodes in S . Then
a trail from α to β in G may be found by taking the directed edge in G if it corresponds to an edge
in F or two edges to form a collider if there is no corresponding edge in F ; the two directed edges
corresponding to the immorality that was removed when the graph was moralised.
If the collider node, or any of its descendants is in S , then the node is S -active. Assume that there
is one collider γ that is not S -active. Then each parent node (they are both in F ) is either an ancestor
16.11. THE XIE-GENG ALGORITHM FOR LEARNING A DAG 335

of α or an ancestor of β and hence the collider node is either an ancestor of α or an ancestor of β . It


follows that there is a directed path from that node to α or β that does not pass through S . Assume
that it is α and consider the trail between α and β with the part between α and γ replaced by this
directed path from γ to α.
Proceeding inductively, a trail can be constructed such that the only colliders are S -active and
there are no other nodes in S on the trail. It follows that α á/ β∥G S .
Now assume that all paths from α to β in F have at least one node in S . Consider any trail in
G between α and β . The skeleton of any trail that has only fork or chain connections is in F and
hence has a node in S . Consider any trail in G and consider the S -active collider connections. In H,
there is an undirected edge ⟨X, Y ⟩ for any collider connection (X, Z, Y ) such that Z is S -active. If the
trail has nodes not in An({α, β} ∪ S), then it clearly has a collider that is uninstantiated and has no
descendants in S . If all the nodes of the trail are in An({α, β} ∪ S), then since the undirected path
in H formed by taking the directed edge ⟨X, Y ⟩ instead of ⟨X, Z⟩, ⟨Z, Y ⟩ has a node in S , it follows
that the original trail has a fork or chain node in S and hence is blocked. The proof of Lemma 16.8 is
complete.

Lemma 16.9. Let G = (V, D) and let S ⊂ V . Two nodes {α, β} are D-separated by S if and only if
they are D-separated by an({α, β}) ∩ S , where an(W ) = An(W )/W .

Proof of Lemma 16.9 Set S ′ = an({α, β}) ∩ S . Since S ⊇ S ′ , it follows that if α á β∥G S ′ then
(trivially) there is a subset R ⊂ S such that α á β∥G R.
Now suppose that α á / β∥G S ′ . By Lemma 16.8, there is a path ρ connecting α and β in GAnm
({α,β}
that does not contain any vertex of S ′ and hence that ρ does not contain any vertex in S/{α, β}.
Suppose that α and β are D-separated by S0 ⊆ S . Since an({α, β}) ∩ S0 ⊆ S ′ , it follows that ρ does
not contain any vertex in an({α, β}) ∩ S0 and hence, by Lemma 16.8, α á / β∥G S0 . It follows that if
there is a subset R ⊆ S such that α á β∥G R, then α á β∥G an({α, β}) ∩ S . The proof of Lemma 16.9 is
complete.

Lemma 16.10. Let G = (V, D) be a DAG and suppose that ρ is a trail between two non adjacent
vertices α and β . If there are any nodes in ρ that are not in An({α, β}), then the trail ρ is blocked by
any subset S ⊆ an({α, β})

Proof It is clear that such a trail contains a collider connection, where the collider node is not in
An({α, β}) and hence the node does is not in an({α, β}), nor does it have a descendant in this set.
The proof of Lemma 16.10 is complete.

We are now in a position to prove Theorem 16.5.

Proof of Theorem 16.5 Since A ∪ B ∪ S ⊇ A ∪ S , it follows trivially that existence of a suitable


subset of A ∪ S implies existence of a suitable subset of A ∪ B ∪ S .
To prove that existence of a subset in A ∪ B ∪ S implies existence of a subset in A ∪ S , assume that
α and δ are two vertices in A and A ∪ S respectively, that are D-separated by a subset of A ∪ B ∪ S .
336 CHAPTER 16. CONSTRAINT-BASED STRUCTURE LEARNING ALGORITHMS

Let
S ′ = (an({α}) ∪ an({δ}) ∩ (A ∪ S).

By Lemma 16.9, it is sucient to show that S ′ blocks every trail ρ between α and δ . There are two
cases:

ˆ ρ not contained completely in An({α, δ})

ˆ ρ contained completely in An({α, δ}).

By Lemma 16.10, in the rst case, ρ is blocked by S ′ since S ′ ⊂ an({α}) ∪ an({δ}).


For the second case, A á B∥G S implies that {α} ∪ (S ′ ∩ A) ⊥ {β}∣S for each β ∈ B and hence (using
Exercise 2 page 22) that α á β∥G (S ′ ∩ A) ∪ S . Since S ′ ⊆ A ∪ S , it follows that

α á β∥G (S ′ ∪ S).

Now suppose there is a trail ρ contained in An({α, δ}) between α and δ that is not blocked by S ′ .
Let W = S ′ ∪ S . Then W blocks ρ. There is therefore at least one node in ρ that is in W /S ′ . Note
that W ⊆ B . Let γ ∈ W /S ′ denote the rst node on the trail ρ, starting from α, that is in W /S ′ .
Let ρ′ denote the sub-trail of ρ between α and γ . Since ρ is not blocked by S ′ , neither is ρ′ . Since
γ is the only node of ρ′ that is in B , it follows that if ρ′ is S ′ active, it is also W active and hence
αá / γ∥G (S ′ ∪ S), which is a contradiction.
If every sequence satises these properties, then clearly it satises these properties for every trail
and hence, from the denition, α á / β∥G S .
If α á β∥G S , then consider any such sequence of nodes. Take a subsequence by removing the loops
so that any node appears at most once. This is a trail. Since D-separation holds, the trail has the
property listed. The property therefore holds for the original sequence.
It is clear that if there is a set R′ ⊂ A ∪ S or R′ ⊂ B ∪ S , then R = R′ ⊂ A ∪ B ∪ S satises the
criterion.
Now suppose there is a set R̃ ⊂ A∪B ∪S such that α ⊥ β∥G R̃ and let γ1 , γ2 ∈ S such that γ1 ⊥ γ2 ∥G R̃.
By Lemma 16.9, γ1 ⊥ γ2 ∥G R where

R = (an(γ1 ) ∪ an(γ2 )) ∩ (A ∪ B ∪ S).

Suppose that γ2 is not an ancestor of γ1 . This can be done without loss of generality, by exchanging
the roles of γ1 and γ2 if necessary.
Let
R1 = (an(γ1 ) ∪ an(γ2 )) ∩ (A ∪ S).

R2 = (an(γ1 ) ∪ an(γ2 )) ∩ (B ∪ S).

To prove that R1 or R2 D-separate γ1 and γ2 , it is sucient to show that for two trails ρ1 in A ∪ S
and ρ2 in B ∪ S either ρ1 is R1 active, or ρ2 is R2 active, or both.
Consider the two cases separately:

ˆ One of the trails ρj is not completely contained in An({γ1 , γ2 })


16.11. THE XIE-GENG ALGORITHM FOR LEARNING A DAG 337

ˆ both trails γ1 and γ2 are contained in An({γ1 , γ2 }).

For the rst case, since both R1 and R2 are subsets of an(γ1 ) ∪ an(γ2 ), it follows from Lemma 16.10
that ρj is blocked by both R1 and R2 .
Now consider the second case. Suppose that ρ1 is R1 active and ρ2 is R2 active. Both ρ1 and ρ2
are blocked by R = R1 ∪ R2 . It follows that ρ1 has a node in R/R1 and ρ2 has a node in R/R2 . Let δ1
and δ2 denote the nodes on ρ1 and ρ2 respectively that are closest to γ1 . As with the previous exercise,
γ1 ∈ R/R1 ⊆ B and γ2 ∈ R/R2 ⊆ A. Let ρ′1 and ρ′2 denote the subtrails of ρ1 and ρ2 respectively between
γ1 ↔ δ1 , and γ1 ↔ δ2 respectively. Note that ρ′1 is R1 active, and ρ′2 is R2 active. Connecting at γ1
gives a sequence ρ′ between δ1 and δ2 through γ1 . Note that ρ′ may not be a trail, since there may be
repeated nodes.
Any node that is not a collider node in ρ′1 , since it is in an(γ1 ) ∪ an(γ2 ) and since neither ρ1 nor ρ′1
are blocked by R1 , is not in R1 ∪ S . Similarly S does not contain any collider node on ρ′2 . Therefore,
except perhaps for γ1 , ρ′ does not have any collider connections where the collider node is in S .
Let ν1 denote the neighbour of γ1 on ρ′1 . Since ν1 ∈ an(γ1 )∪ an(γ2 ) and it is not γ2 , it is an ancestor
of γ1 or γ2 . If the orientation is γ1 → ν1 , then γ2 is an ancestor of γ1 , contradicting the assumption.
Therefore the edge is oriented ν1 → γ1 . Similarly, for ν2 a neighbour of γ1 on ρ′2 . It follows that
(ν1 , γ1 , ν2 ) is a collider on ρ′ . Therefore S does not contain any nodes on ρ′ that are not collider nodes
on the trail.
Consider any collider node c in ρ′j (that is, the centre of a collider connection in ρ′j ). It is either
in Rj or else has a descendant in Rj . Since c ∈ an(γ1 ) ∪ an(γ2 ), it follows that γ1 ∈ S or γ2 ∈ S is a
descendant of c. Since γ1 ∈ S , it follows that each collider node in ρ′ is either in S or has a descendant
in S .
It follows that δ1 á / δ2 ∥G S , contradicting A á B∥G S . It follows that either γ1 á γ2 ∥G R1 or γ1 á
γ2 ∥G R2 . The proof of Theorem 16.5 is complete.
Another preparatory lemma is needed, before proving Theorem 16.6.

Lemma 16.11. Two non adjacent nodes α and β in a directed acyclic graph G = (V, D) are D-separated
by a set S ⊂ V if and only if for any sequence λ = (α, λ1 , . . . , λn−1 , β) (where the same node can appear
more than once) with edges between each consecutive pair

ˆ either λ contains a chain or a fork connection such that the chain node or fork node is in S or

ˆ λ contains a collider connection such that the collider node is not in S and has no descendant in
S.

A sequence λ with edges between each consecutive pair that satises this property is said to be blocked
by S .

Proof of Lemma 16.11 The result of Theorem 1.24 page 15, stating that a DAG G = (V, D) has an
edge between α and β in D if and only if α á / β∥G S for any subset S , is used crucially here, together
with the denition of `faithful', that conditional independence statements and D-separation statements
are equivalent.
338 CHAPTER 16. CONSTRAINT-BASED STRUCTURE LEARNING ALGORITHMS

For α ∈ A and β ∈ C , the graph GA∪B∪C in the reconstruction has an edge α ∼ β if and only if there
is an edge α ∼ β in the graph GA∪C . Theorem 16.5 states that if G(A ∪ B ∪ C) is a DAG over the
variables A ∪ B ∪ C then there is a set R ⊂ A ∪ B ∪ C such that α á β∥G(A∪B∪C) R if and only if there
is a set R′ ⊆ A ∪ C such that α á β∥G(A∪C) R′ . It follows that if G(A ∪ B ∪ C) is faithful for pA∪B∪C
then its skeleton contains an edge α ∼ β between two variables α ∈ A and β ∈ C if and only if G(A ∪ C)
contains an edge between α and β . The proof of Lemma 16.11 is complete.

Theorem 16.6 now follows almost directly:

Proof of Theorem 16.6 If α, β ∈ V , then any vee-structure α − γ − β such that γ ∈/ Sα,β is an


immorality, hence the immoralities are correct. The proof of Theorem 16.6 is complete.

16.12 The Ma-Xie-Geng Algorithm for Learning Chain Graphs


We now turn attention to learning chain graphs. As with the Xie-Geng algorithm for learning DAGs,
the Ma-Xie-Geng algorithm for learning chain graphs starts by learning the independence graph. From
the independence graph, a separation tree can be learned (Theorem 5.23) and, from the separation
tree, edges may be deleted and edges oriented according to the principles described in 5.2.
The region graph of 16.10 can provide an eective alternative to the separation tree.

16.12.1 Skeleton Recovery with a Separation Tree


The skeleton of the chain graph may be recovered from the separation tree with the help of The-
orem 5.25. The assumption is that there exists a chain graph which is faithful to the probability
distribution. The recovery follows Algorithm 11. The algorithm consists of three main parts:

ˆ Local skeletons are recovered for each individual tree-node of the separation tree. By Condition
1 of Theorem 5.25, edges deleted in any local skeleton are also absent in the global skeleton. This
is the same principle used in the Xie-Geng algorithm for DAGs.

ˆ All the information from local skeletons is combined to give a global undirected graph, which has
all the edges of the skeleton, but may contain additional edges.

ˆ Finally, the extra edges are eliminated.

Theorem 16.12. Suppose there is a chain graph faithful to a probability distribution P. Given a perfect
oracle (i.e. each test for conditional independence gives the correct answer, rejecting CI when it is false
and not rejecting when the CI statement is true), Algorithm 11 returns the skeleton of a faithful chain
graph.
16.12. THE MA-XIE-GENG ALGORITHM FOR LEARNING CHAIN GRAPHS 339

Algorithm 11 Recovering the Skeleton of a Chain Graph


Input: A separation tree T of G and the set of independence statements of P
Output: The skeleton of G and a set S of C -separators
Stage 1: recover local skeletons
Set S = ϕ
for each tree node Ch do
Start from a complete undirected graph Gh with vertex set Ch
for each pair of nodes {α, β} ⊂ Ch do
if ∃Sα,β ⊂ Ch such that Xα ⊥ Xβ ∣XSα,β then
Delete the edge ⟨α, β⟩ in Gh .
Add Sα,β to S
end if
end for
end for
Stage 2: Combine Local Skeletons
Combine the graphs Gh = (Ch , Eh ) into an undirected graph G ′ = (V, ∪h Eh ).
for each pair of nodes {α, β} contained in more than one tree-node and ⟨α, β⟩ ∈ G do
if ∃Ch such that {α, β} ⊂ Ch and ⟨α, β⟩ ∈/ Eh then
Delete the edge ⟨α, β⟩ from G ′
end if
end for
Stage 3: Remove Extra Edges
for each pair of nodes {α, β} contained in more than one tree-node and ⟨α, β⟩ ∈ G ′ do
if Xα ⊥ Xβ ∣XSα,β for some Sα,β ⊂ NG ′ (α) or NG ′ (β) which is not a subset of any Ch with
{α, β} ⊂ Ch then
Delete ⟨α, β⟩ from G ′
Add Suv to S
end if
end for
340 CHAPTER 16. CONSTRAINT-BASED STRUCTURE LEARNING ALGORITHMS

Proof This uses Theorem 5.25. There is an edge between two nodes in a chain graph if and only
if α á
/ β∥G S for any subset S ⊆ V /{α, β}. The three lines which delete edges therefore only delete
edges which cannot appear in the skeleton. The output therefore returns a graph which contains all
the edges of the skeleton.
At the same time, if α ∼/ β , then one of the three conditions of Theorem 5.25 holds. For condition
1, there is no edge ⟨α, β⟩ if α and β do not appear in the same tree node.
If condition 2 holds, then the edge ⟨α, β⟩ is removed in Stage 1 or Stage 2.
If condition 3 holds, then either α á β∥G Pa(α) or α á β∥G Pa(β) (or both) where Pa(γ) denotes
{δ ∶ (δ, γ) ∈ D or ⟨δ, γ⟩ ∈ U }. The edge ⟨α, β⟩ is therefore removed in Stage 3.

16.12.2 Recovering the Complexes


Algorithm 12 locates and orients the complex arrows of G , after the skeleton G ′ has been located.

Algorithm 12 Complex Recovery


Input: The conditional independence statements of P, the skeleton G ′ of G and the set S of C -
separators from Algorithm 11.
Output: The pattern G ∗ of G
Initialise: G ∗ = G ′
for each ordered pair (α, β) such that Sα,β ∈ S do
for each ⟨α, γ⟩ in G ∗ do
if Xα ⊥ Xβ ∣XSα,β ∪{γ} then
Orient ⟨α, γ⟩ as (α, γ) in G ∗
end if
end for
end for
The resulting graph G ∗ is the pattern of G .

Before proving that Algorithm 12 orients the edges correctly, the following preparatory lemma is
necessary.

Lemma 16.13. Any arrow oriented by Algorithm 12 gives the same orientation as the arrow in G .

Proof The result is trivially clear and requires faithfulness; lack of C -separation implies that the
corresponding conditional independence statement does not hold.
If all trails α ↔ β are blocked by Sα,β , but opened by γ , then γ is either a node in the region of
a complex on the trail between α and β or a descendant of such a node. Since γ is adjacent to α, it
follows that G contains the arrow (α, β).

Theorem 16.14. If G ′ is the skeleton of a chain graph G which is faithful to the probability distribution
P over X , then the output G ∗ of Algorithm 12 is the pattern of G .
16.13. STRUCTURE LEARNING AND FAITHFULNESS: AN EVALUATION 341

Proof This follows from Theorem 16.12 and Proposition 5.26. Firstly, by Theorem 16.12 provides the
correct skeleton. Clearly, if all the C-sep-sets were recorded, Algorithm 12 would consider all ordered
pairs of nodes (α, β); for each γ such that ⟨α, γ ∈ G ′ (the skeleton) and determine whether or not it
was a complex arrow (Denition ??). The algorithm would then return the pattern; the graph where
all the complex arrows are directed and the others are undirected.
The only remaining issue is whether or not the sep-sets provided by Algorithm 11 are sucient.
By Proposition 5.26, for any complex (α, ρ1 , . . . , ρn , β), there is a tree-node C which contains both
parents α and β of the complex. Hence the set of C-sep-sets returned by Algorithm 11 is sucient and
hence Algorithm 11 returns the correct pattern.

Example 16.15.

Suppose the chain graph in Figure 5.7 gives a faithful graphical representation of the conditional
independence structure of a probability distribution P. Suppose that we derive the separation tree
of Figure 5.8. This separation tree is not optimal, in the sense that the tree-node F GKH could
be decomposed further into two tree-nodes F KG and GH separated by G. Given a perfect oracle,
Algorithm 11 will return the correct skeleton and the set of C-sep-sets will be sucient for Algorithm 12
will return the correct pattern.
The C-sep-sets found by Algorithm 1 are:

ˆ Stage 1 (each tree node): SBC = {A}, SCD = {B} (we cannot separate B − C at this stage, nor
D − E ), SF G = {D}, SF H = {G}, SKH = {G}.

ˆ Stage 2 this simply looks at the edges removed from each tree-node. An edge between two nodes
is present in the skeleton if and only if it is present between the two nodes for every tree-node.

After Stage 2, the only additional edge, still present in the graph which is not present in the skeleton,
is D − E .

ˆ Stage 3 The edge D −E is removed with sep set SDE = {C, F } using tree-nodes CDE and DEF .

The complex arrows are D → F , C → E , F → K and G → K . Algorithm 12 detects these because


G⊥/ F ∣SGF ∪ K , i.e. G ⊥
/ F ∣{D, K} for the immorality G → K ← F .
For the other complex, D ⊥
/ C∣SCD ∪ F and D ⊥
/ C∣SCD ∪ E .
The pattern has thus been established.

16.13 Structure Learning and Faithfulness: an Evaluation


16.13.1 Faithfulness and `real world' data
The Recursive Autonomy Identication algorithm was analysed by B. Barros (2012) [4], applying it
both to data simulated from test networks and to a nancial data set. When applied to simulated
data, simulated from the ALARM network, the algorithm performed very well; the performance was
consistent with the results described by Yehezkel and Lerner [150]. For a data set generated by a
342 CHAPTER 16. CONSTRAINT-BASED STRUCTURE LEARNING ALGORITHMS

probability distribution for which there exists a faithful DAG, the results veried that the algorithm is
ecient and produces a graph that corresponds well to the distribution that generated the data, with
low computational overheads. The feature of the algorithm of making all required tests with smaller
conditioning sets before moving on to larger increases accuracy over methods that do not do this. The
additional use made of the structure, identifying the chain components of the essential graph at each
stage, ensures that fewer statistical calls (references to the data set) are required.
Some features were noted in the performance of the algorithm. In earlier stages, some contradictory
directions appeared. That is, pairs of immoralities X → Y ← Z , Y → Z ← W , in situations where
the edge Y ∼ Z would be deleted in subsequent rounds of the algorithm following tests with larger
conditioning sets. The direction chosen for the edge during that round was dictated by which im-
morality appeared rst. If the test X ⊥ Z∣SX,Z , yielding a sep-set SX,Z was carried out rst, then the
edge would take the direction Y ← Z . After carrying out the CI tests and determining the directions,
Meek's orientation rules were applied to determine the structures for the next round of the algorithm.
The algorithm worked very well; with 10000 observations, it produced a graph that had the correct
skeleton and only 4 edges with incorrect orientation.
The test of performance of an algorithm is based on the ability of the algorithm to recover a
probability distribution used to simulate data. There are several standard networks, including the
ALARM network, that are used. Data is simulated from the network and the algorithm applied to
the simulated data. Freedman and Humphreys (2000) p 33,34 [43] are somewhat scathing in their
assessment of this procedure for verifying the utility of an algorithm, of using simulated data from a
distribution known to have good properties. They write,

The ALARM network is supposed to represent causal relations between variables relevant to
hospital emergency rooms, and Spirtes Glymour Scheines (1993) [126] p 11 claim to have
discovered almost all the adjacencies and edge directions `from sample data'. However,
these `sample data' are simulated; the hospitals and patients exist only in the computer
program. The assumptions made by SGS (1993) [126] are all satised by at, having been
programmed into the computer: the question of whether they are satised in the real world
is not addressed. After all, computer programs operate on numbers, not on blood pressures
or pulmonary ventilation levels (two of the many evocative labels on nodes in the ALARM
network).

Freedman and Humphreys continue by stating,

These kinds of simulations tell us very little about the extent to which modelling assump-
tions hold true for substantive applications.

The constraint based algorithms all depend crucially on the modelling assumption that there is a DAG
that is faithful to the set of conditional dependence / independence statements that can be established.
We have already pinpointed two diculties that can arise in the `real world'; interaction eects without
main eects and hidden common causes.
16.13. STRUCTURE LEARNING AND FAITHFULNESS: AN EVALUATION 343

16.13.2 Interaction eects without main eects


Example 2.7 gives an example of a situation where these constraint based algorithms will miss key
associations between the variables. Any situation where factors taken individually give no information,
but where there are two-factor, or higher order factor interaction without main eects, will not be
detected. If applied to genetic data, for example, the algorithm will not be able to detect situations
where a single gene by itself has no apparent eect, but where the genome pathway may be opened by
two genes acting together.
This situation will not lead to internal inconsistencies in the functioning of the algorithms; asso-
ciations of this type will simply be missed and the output will be a DAG that does not show these
associations, but it may not lead to reversed edges (situations where the algorithm has to choose
between two contradictory directions for an edge).

16.13.3 Hidden variables


In a `real world' situation, there may well be hidden variables which are not measured and the experi-
menter may be unaware of their existence. This can lead to reversed edges, as the following example
illustrates. Suppose that X, Y, Z, W are variables that are recorded, while H is a hidden variable, a
common cause of X and Y , whose presence is not suspected by the researcher. Suppose that the causal
relations between H, X, Y, Z, W are given by Figure 16.12.

X o H / Y
>
O

W / Z

Figure 16.12: H is hidden and does not appear in the data matrix

If the RAI algorithm is applied to the variables X, Y, Z, W , whose associations are described by the
d-connection statements of the DAG in Figure 16.12, then X ⊥ Z∣W , giving X → Y ← Z and Y ⊥
W ∣Z , giving the immorality Y → X ← W . Even if there is a perfect oracle (sucient data to give
correct results for each CI test so that the results are consistent with the probability distribution over
(X, Y, Z, W )), the edge between X and Y is a reversed edge, X ↔ Y . This notation means that, from
the CI tests, one test gives a direction X → Y ; the other gives a direction X ← Y and the algorithm
will choose the direction depending on the order in which the tests are carried out.
In the RAI algorithm, the direction that an edge takes in the output graph, under such circum-
stances is determined by the order of the variables; if the test results X ⊥ Z∣W appears rst, the
output graph will contain X → Y and thus the graph will contain the false d-separation statement
W ⊥ Y ∣{X, Z}, while if the result W ⊥ Y ∣Z appears rst, the output graph will contain the edge Y → X
and the false d-separation statement X ⊥ Z∣{W, Y }. The two possibilities are given in Figure 16.13.
344 CHAPTER 16. CONSTRAINT-BASED STRUCTURE LEARNING ALGORITHMS

X / Y X o Y
O O O O

W / Z W / Z

Figure 16.13: Possible outputs applying constraint based algorithm to variables (X, Y, Z, W ) from
Figure 16.12

16.13.4 The scope of structure learning


Algorithms can detect associations, at the level of `descriptive statistics', without reference to the
process that generates the data and the nature of randomness. At the level of descriptive statistics,
the scope of constraint based algorithms is viewed along the following lines: from the n × d data
matrix, an empirical distribution can be established (or, at least, if d is very large, empirical probability
distributions of the marginalisation to subsets of the variables can be established). Any test result that
produces X ⊥ / Y ∣S corresponds to a d-connection statement that is to be retained in the output graph;
any test result where X ⊥ / Y ∣S is not rejected does not have to be retained in the output graph. The
output graph attempts to have as few edges as possible, while retaining all the d-connection statements
that were established through rejecting independence.
For large numbers of variables, there are clear diculties that make serious inferential statistics
impossible. The assumption is that the n × d data matrix represents n independent instantiations of
a d-random vector X . This assumption, together with an assumption that n is suciently large for a
central limit theorem eect to hold is required for the test statistics to be approximately χ2 . Even if
the nominal signicance level α chosen for rejecting a null hypothesis can be considered as a measure
of a probability in any serious way the number of tests required is large that the overall signicance
level could be close to 1. In terms of descriptive statistics, the output graph can be informative, but
it is dicult to reach inferential conclusions from the output of these algorithms.

16.13.5 Application of FAS and RAI to nancial data


After testing the Fast and RAI algorithms on the training example of the ALARM network, where
it performed well, the work of Barros [4] proceeded to run these algorithms on a nancial data set,
composed of the closing values of 18 stock market indices (Amsterdam stock index, Austrian traded
index, Brussels stock index, etc ...) from 1st January 2005 to 1st January 2011, approximately 1000
instantiations of 18 variables.
The aim of the thesis was to detect changes in associations between the variables, to learn a
structure, detect when the structure was no longer appropriate and update.
In the nancial data set, the raw RAI algorithm gave no independence statements after the rst
round; for each pair of variables (X, Y ), the result was `reject independence'. Therefore, any pair of
16.13. STRUCTURE LEARNING AND FAITHFULNESS: AN EVALUATION 345

variables should be d-connected in the output graph. Yet the output graph, following application of the
raw RAI algorithm, gave pairs of d-separated variables, which indicates that conditional independence
was falsely accepted due to weak tests.
In order to deal with the situation where `accept independence' from tests with large condition-
ing sets contradicted d-connection statements with lower order conditioning sets, Barros adopted a
more conservative approach than the argumentation of Bromberg and Margaritis [9] and modied
the algorithm so that it did not accept an independence statement that resulted in a d-separation
in the output graph contradicting a dependence statement that has already been established. This
modication worked well.
The output still gave a large number of `reversed edges'. While the ALARM network gave one or
two, the nancial data set gave approximately 28 reversed edges, indicating situations that appeared
in the DAG in Figure 16.12, with possible output graphs corresponding to Figure 16.13.
The presence of a substantial number of `common cause' hidden variables would explain this.
This was a randomly chosen `real world' data set and probably not appropriate for an algorithm
based on a `faithfulness' assumption. The variables here do not satisfy one of the motivating features of
the faithfulness assumption, that the variables stand in causal relation to each other; their association
is more likely to be a result of hidden common causes, such as government policies, or global nancial
considerations that inuence the various stock markets.
The same diculties seemed to arise in other applications. The RAI algorithm was applied to
the genetic data found in Friedman et. al. [46]. Tentative results seem to give substantially dierent
output depending on the input order of the variables, suggesting hidden common causes.

16.13.6 Conclusion
Constraint based algorithms oer a fast approach, which is convenient with data matrices when d, the
number of variables, is very large. They can be many times faster than search and score algorithms.
Unfortunately, these algorithms tend to assume `faithfulness' and work on the principle of removing an
edge whenever a conditional independence test gives the result `do not reject X ⊥ Y ∣S '. This leads to
several diculties. Firstly, since tests with larger conditioning sets are weaker, it can lead to situations
where deletion of an edge can contradict earlier d-connection statements. This diculty is present even
if there is a faithful DAG corresponding to the independence structure. Secondly, two-factor, or higher
order interactions are not detected if there are no `main eects'. Thirdly, hidden variables can lead to
contradictory edges, resulting in d-separation statements not present in the probability distribution. If
there is no faithful DAG that describes the underlying independence structure, this can manifest itself
in other ways.
Modications to remove the rst of these diculties have been considered, for example by Bromberg
and Margaritis [9] using argumentation and the more conservative approach of Barros [4] retaining all
dependence statements that have been established through rejecting independence.
The second and third of these diculties have not been fully addressed by constraint based algo-
rithms.
346 CHAPTER 16. CONSTRAINT-BASED STRUCTURE LEARNING ALGORITHMS

16.13.7 The `Causal Discovery' Controversy


The discussion about structure learning has described various methods to locate structures that rep-
resent the independence relations within a data set. All these methods, search and score, constraint
based, hybrid, yield results that fall under the heading of descriptive statistics. The search and score
methods simply examine some of the available structures and choose the structure with the highest
score of those examined. On the `classical' side, there is no measure of condence for the structure
chosen; on the `Bayesian' side, even if a prior distribution is placed over the structure space and the
posterior used as the basis of a score function, there is no posterior assessment of the probability for
the structure to lie in a certain subspace of the set of possible structures; only a small number of
structures are visited and the structure chosen is the one visited that gives the largest score. With
constraint based methods, even if the hypothesis that the data matrix represents n instantiations of
i.i.d. random vectors held, the number of tests is so large that even with a small nominal signicance
level for each test, the overall signicance level approaches 1.
The output structure can give useful information at the level of descriptive statistics, but little or
no formal inference can be made. This is generally the case in multivariate statistics, where methods
are often more successful as descriptive than inferential tools.
Assume, though, that statistical associations have been established. Substantial parts of the litera-
ture suggest claims that a rigorous engine for inferring causation from association has been established.
For example, Spirtes, Glymour and Scheines (1993) [126] claim to have algorithms for discovering causal
relations based only on empirical data. The underlying assumption seems to be that, for a large class
of problems, when immoralities are learned from data and Meek's rules then applied, cause to eect
can be inferred for the directed edges of the essential graph. Schmidt, Niculesu-Mizil and Murphy
(2007) [123] write, explaining why they are constructing techniques to produce directed graphs,

`... undirected models cannot be used to model causality in the sense of Pearl [109], which is
useful in many domains such as molecular biology, where interventions can be performed.'

The thrust of the quote is that directed edges whose direction can be interpreted as cause to eect,
can be learned from data. But placing a causal interpretation on a directed arrow in a graph that has
been learned purely by applying a structure learning algorithm to data can be misleading.
In a situation where interventions can be performed, a causal directed graph can be obtained
from the undirected graph through further controlled experiments. Consider the situation on three
variables (X, Y, Z) where X ⊥ Z∣Y , but X ⊥ / Y,X ⊥/ Z, Y ⊥/ Z, Y ⊥
/ X∣Z and Y ⊥ / Z∣X . There are three
DAGs along which the distribution pX,Y,Z may be factorised, given in Figure 16.14. Suppose that an
intervention may be carried out on the variable Y , forcing its state. This has the eect of removing
arrows from parents of Y to Y . If the state Y ← y is forced, this gives the graphs in Figure 16.15.
If all the states of Y can be explored, in a controlled experiment, by randomly assigning levels of
the `treatment' variable Y , the causal structure can be determined from the Markov structure, but not
otherwise.
Markowetz and Spang [91] discuss the application of intervention calculus for perturbation experi-
ments that are inferring gene function and regulatory pathways.
16.13. STRUCTURE LEARNING AND FAITHFULNESS: AN EVALUATION 347

Y >Y

~ 
X Z X Z

Y `

~
X Z

Figure 16.14: Three Markov equivalent DAGs

Y =y Y =y

| " "
X Z X Z

Y =y

|
X Z

Figure 16.15: Intervention Y ← y in Figure 16.14

As Freedman and Humphreys point out (1999) [43], commenting on automated causal learning,
`these claims are premature at best and the examples used in [126] to illustrate the algorithms are
indicative of failure rather than success.' They point out that `the gap between association and
causation has yet to be bridged.'

16.13.8 Faithfulness and the great leap of faith


One of the leading assumptions behind `causal discovery' is the assumption that distributions of interest
satisfy the faithfulness assumption, that there is a DAG G with variable set V = (U, O) where U denotes
the unobserved variables and O the observed variables and a probability distribution P over (U, O)
such that P factorises along G and G gives a faithful graphical representation of the independence
structure.
This is described as follows;
348 CHAPTER 16. CONSTRAINT-BASED STRUCTURE LEARNING ALGORITHMS

Y1 / X3
`

~
X2 ` Y2

~
Y3 / X1

Figure 16.16: DAG for the natural factorisation; it is not faithful

` .... the faithfulness condition can be thought of as the assumption that conditional
independence relations are due to causal structure rather than to accidents of parameter
values.' Spirtes et. al. (2000) [127]

Example 2.7 gives an instance of a situation where the probability distribution does not have faithful
graphical representation. For the variables (Y1 , Y2 , Y3 , X1 , X2 , X3 ), the DAG that best represents the
associations between the variables is given by Figure 16.16. In this graph, X1 = 1 if Y2 = Y3 and 0
otherwise. X1 ⊥ Y2 and X1 ⊥ Y3 , but X1 ⊥ / {Y2 , Y3 }. In this situation the inuence of Y2 and Y3 on X1
is not seen if the variables are considered separately, but the interaction eect is decisive.
Another statement of the same principle is found in Meek (1995) [93]

In cases where P(G) (the set of distributions that factorise along a graph G ) can be
parametrised by a family of distributions with a parameter of nite dimensions, the set of
unfaithful distributions typically has Lebesgue measure zero. (Spirtes et. al. (2000) [127]
pp 42 - 2)

This assumption, that the set of observable variables O may be extended to a set V = (U, O) where
U represents unobserved common causes, or confounders, and that there will exist a DAG over V
that is faithful to the probability distribution over V , is re-stated in Robins, Scheines, Spirtes and
Wasserman (2003) [117]. There is strong interest in classes of faithful distributions in the literature;
the work of Zhang and Spirtes [151] requires that the class of distributions under consideration satisfy
a stronger assumption than faithfulness in order to obtain uniform consistency in causal inference for a
certain class of problems; [117] illustrates non-existence of uniform consistency when only faithfulness is
assumed, because of the possibility of non-faithful distributions in the closure of the set of distributions
under consideration.
Consider again Example 2.7 and suppose that O = (X1 , X2 , X3 ), the values for (X1 , X2 , X3 ) are
observable and U = (Y1 , Y2 , Y3 ), the results of (Y1 , Y2 , Y3 ) are hidden. Clearly, the set of distributions
over 6 binary variables that factorises over the DAG in Figure 16.16 can be described by a nite
parameter space; 15 parameters are required to describe the entire set of distributions; the param-
eter space is [0, 1]15 . Furthermore, it is clear that the parameters to describe the distribution over
(Y1 , Y2 , Y3 , X1 , X2 , X3 ) in Example 2.7 correspond to exactly one point in the parameter space, which
16.13. STRUCTURE LEARNING AND FAITHFULNESS: AN EVALUATION 349

has Lebesgue measure zero. Nevertheless, examples where knowledge of two causes is required to ex-
plain the eect and where knowledge only of a single cause tells you nothing about an eect arise all
the time in practise, in the real world.
Furthermore, the parametrisation of any distribution that has an independence structure has
Lebesgue measure zero in the parameter space of all distributions over the variables in question.
Meek's argument can equally well be used to argue against searching for any independence structure
at all.
Faithfulness appears a convenient hypothesis to produce beautiful mathematics (and the relation
between DAGs and probability distributions under this assumption has produced a very elegant and
attractive mathematical theory), but it is dicult to see that it necessarily applies to real world
situations; the real world does not respect the fact that the set of parameters that describe the situation
have Lebesgue measure zero in a mathematical parameter space. Divergence between `real world'
behaviour and the assumption that it should t into a convenient mathematical framework has been
termed `The Mind Projection Fallacy' by E.T. Jaynes (2003) [70].

16.13.9 Inferring non-causation and causation


Robins, Scheines, Spirtes and Wasserman (2003) [117] describe situations where non-causation can be
inferred. A situation where such an inference can be made is given by Figure 16.12 representing the
causal associations between variables, where H is hidden and X, Y, W are observable. In this example,
X is not a cause of Y , neither is Y a cause of X . This can be inferred from the CI tests; from the
results X ⊥ Z∣W and Y ⊥ W ∣Z , it is possible to infer that the relation between X and Y is not cause
to eect in either direction and that a common cause H would explain the test results.
The discovery of an immorality, though, does not necessarily imply causation. Suppose H1 and
H2 are hidden and X, Z, Y are observable in Figure 16.17. The distribution over (X, Z, Y ) factorises
according to Figure 16.18.

H1 H2

~ ~
X Z Y

Figure 16.17: H1 and H2 hidden

If one were using immoralities as a guide to causation, one would conclude that X and Y were common
causes of Z . As Freedman and Humphreys point out in [43], commenting on Spirtes Glymour Scheines
(1993) [126] on a DAG produced from a sociological data set,

The graph says, for instance, that race and religion cause region of residence.

In the context, this is non-sensical and raises a timely note of caution when inferring causality.
350 CHAPTER 16. CONSTRAINT-BASED STRUCTURE LEARNING ALGORITHMS

X Y


Z

Figure 16.18: DAG for (X, Y, Z) from Figure 16.17

16.13.10 Summarising causal discovery


Freedman and Humphreys go on to summarise the attempts to automate `causal discovery' with the
example of smoking and lung cancer,

The epidemiologists discovered an important truth - smoking is bad for you. The epi-
demiologists made this discovery by looking at the data and using their brains, two skills
that are not readily automated. .... The examples in SGS (1993) [126] count against the
automation principle, not for it.'

The conclusion drawn by the authors of this article is that the output produced by structure learning
algorithms provides invaluable information. It can give good information about associations and can
certainly point towards the possibility of causal relations, but they do not even begin to automate the
process of learning causality; it is still necessary for researchers to use their brains to design experiments,
examine the data and use their brains again, taking into account circumstances and contexts additional
to the raw data, to reach conclusions. As the example from SGS (1993) [126], extended by Freedman
and Humphreys [43] shows, causation cannot be deduced from the presence of an immorality and,
indeed, cannot be inferred from the output of structure learning algorithms alone.

Notes The PC algorithm was introduced by Spirtes et. al. [126](1993), while the MMPC was in-
troduced by Tsamardinos et. al. [137](2006). The FAS algorithm is discussed in Fast [41]. Recursive
Autonomy Identication is due to Yehezkel and Lerner [150](2009).
16.14 Exercises
1. This problem is motivated by the following consideration: when searching for a graph with a
suitable structure to t a given data set with reasonable accuracy, Markov chain Monte Carlo
techniques are often used. These algorithms are computationally more ecient if they change
as few edges as possible at each transition, while ensuring that the chain can move through the
entire space of graphs. It is also more ecient to search the space of essential graphs, to ensure
that the chain does not spend time moving between graphs that are Markov equivalent.
This exercise shows that even in a simple setting, it is necessary to change at least two edges per
move to ensure that the algorithm can move from the current essential graph a dierent essential
graph.

(a) Consider a collider connection A → B ← C . Is this an essential graph?


(b) List all the essential graphs on three variables.
(c) List all the graphs that may be obtained by altering one edge of the graph A → B ← C ,
through either adding or removing a directed edge or an undirected edge, or from directing
an undirected edge, or from `un-directing' a directed edge, or reversing the direction of a
directed edge. Which of these graphs are essential graphs?

2. Let Y1 , Y2 , Y3 be three independent identically distributed variables with probability function


P(1) = P(0) = 12 . Let


⎪ 1 Y2 = Y3
X1 = ⎨

⎩ 0 otherwise



⎪ 1 Y1 = Y3
X2 = ⎨

⎩ 0 otherwise



⎪ 1 Y1 = Y2
X3 = ⎨

⎩ 0 otherwise

(a) Let V = {X1 , X2 , X3 }. Construct an undirected graph by adding an edge between two nodes
α and β if and only if α ⊥/ β∣S for any subset S ⊆ V /{α, β}.
(b) Construct the independence graph.
(c) What happens if V = {Y1 , Y2 , Y3 , X1 , X2 , X3 }?

3. Consider the second structure in Figure 16.19.

(a) Is it a chain graph?


(b) Is it an essential graph? If not, why not?
(c) If it is a chain graph, what are the chain components? Are they triangulated?
(d) Do there exists any substructures on three variables from the graph on the right of the form
of the graph on the left?

351
γ


γ β α @β


α δ

Figure 16.19: Figure for Exercise 3

The following two exercises are taken from Chickering [24].

4. For any DAG G = (V, D), an edge (X, Y ) ∈ D is said to be covered in G if PaX = PaY /{X}. Let
G1 = (V, D1 ) be a DAG and let G2 = (V, D2 ) be obtained by reversing the edge (X, Y ) ∈ D1 .
Prove that G2 is Markov equivalent to G1 if and only if (X, Y ) is covered in G1 .

5. Let G1 and G2 be two Markov equivalent DAGs and suppose that there are exactly m edges in G1
with the opposite orientation in G2 . Using Exercise 4, prove that there is a sequence of exactly
m distinct edge reversals in G1 with the following properties:

ˆ Each edge reversed is covered when it is reversed.


ˆ After each reversal, the resulting graph H is Markov equivalent to G2 .
ˆ After all reversals, H = G2 .

6. Let G = (V, D) be a directed acyclic graph. Prove that G m , the moral graph, contains an
undirected edges ⟨X, Y ⟩ if and only if X á
/ Y ∥G V /{X, Y } (X and Y are not d-separated by
V /{X, Y }).

7. Recall the Recursive Autonomy Identication algorithm, Subsection 16.7 page 321.

(a) In the description of stage 0, where an edge between X and Y is removed if and only if
X ⊥ Y , assume that the resulting skeleton is correct. Why is (X, Z, Y ) an immorality if
there are edges X − Y and Y − Z but no edge X − Y ?
(b) Assume that the graph in Figure 16.20 is a faithful graph for PX1 ,X2 ,X3 ,X4 . Assume that
the data set is suciently large so that each test for independence gives the correct result.
Outline how the algorithm proceeds, sketching the graphs returned at each stage of the
algorithm, stating the reasons for deleting edges and directing edges.
(c) Assume that the graph in Figure 16.21 is a faithful graph for PX1 ,X2 ,X3 ,X4 and that each
independence test gives the correct result. Outline how the algorithm proceeds.

352
X1 X2

! }
X3


X4

Figure 16.20: Directed acyclic graph for algorithm, example 1

= X2

!
X1 = X4

!
X3

Figure 16.21: Directed acyclic graph for algorithm, example 2

(d) Assume that the graph in Figure 2.3 is faithful to the distribution PU1 ,Z1 ,Z2 ,Z3 ,Z4 and
that variable U1 is hidden. What is the output of the RAI algorithm if the input is
(Z1 , Z2 , Z3 , Z4 )? What is the output of the RAI algorithm if the input order is (Z4 , Z3 , Z2 , Z1 )?

353
16.15 Answers
1. (a) Yes: A → B ← C is an essential graph.
(b) Recall that the essential graph is the graph where directions are retained on and only on
those edges that retain the same direction in every graph in the Markov equivalence class.
Hence A−B ← C is not an essential graph since A → B ← C and A ← B ← C are not Markov
equivalent; if B ← C is present and (A, B, C) is not an immorality, this forces A ← B .
With this in mind, the essential graphs are:
The three graphs A → B ← C , B → A ← C , A → C ← B , the three graphs with one
(undirected) edge between two of the nodes and the third node unconnected, the graph with
no edges between any of the nodes, the three graphs with two undirected edges A − B − C ,
A − C − B , C − A − B . The graph with three undirected edges between A, B and C .
(c) A, B ← C ; A − B ← C ; A ← B ← C ; A → B, C ; A → B − C ; A → B → C . None of them are
essential graphs.

2. (a) The graph contains no edges; since X1 ⊥ X2 , X1 ⊥ X3 and X2 ⊥ X3 .


(b) The graph is complete; X1 ⊥
/ X2 ∣X3 , X1 ⊥
/ X3 ∣X2 and X2 ⊥
/ X3 ∣X1 .
(c) Again, the graph constructed according to the `faithfulness' principle (no edge X ∼ Y
whenever X ⊥ Y ∣S for some S ) is empty since Y1 ⊥ X1 , Y1 ⊥ X2 , X1 ⊥ X2 . Taking S = ϕ
(the empty set) in each case, any pair of variables is independent.

3. (a) Yes; it is a chain graph.


(b) No; the edge α − β appears in a compelled conguration and should be directed α ↦ β .
(c) There is one chain component, which contains all the nodes {α, β, γ, δ}, with the edges
γ ↦ β and δ ↦ β removed. It is a tree and it is clearly triangulated, because it contains no
cycles.
(d) No; {γ, α, δ} has two undirected edges, {γ, β, δ} forms an immorality centred at β , {α, δ, β}
has three edges and {α, γ, δ} has three edges.
This shows that to show that a chain graph is an essential graph, it is not sucient simply
to show that the chain components are triangulated and that the substructure on the left in
Figure 16.19 does not appear.

4. When edge (X, Y ) is reversed in a DAG G1 to form a new graph G2 , the two graphs are Markov
equivalent if and only if it has the same skeleton, and the same immoralities and there are no
cycles in G2 .
When (X, Y ) is removed and (Y, X) is added, there are no new immoralities if and only if for
each Z ∈ Pa(X), there is link between Y and Z . The link is (Z, Y ), otherwise there is a cycle in
G1 . Therefore Pa(X) ⊆ Pa(Y ) in G1 .
No immoralities are removed and no cycles are introduced if and only if for any Z ∈ Pa(Y )/{X},
Z ∈ Pa(X), so Pa(Y )/{X} ⊆ Pa(X). It follows that Pa(X) = Pa(Y )/{X}.

354
5. Assume that none of the m edges are covered. Then using the previous exercise, for each
edge (X, Y ) to be altered, either there is a node Z ∈ Pa(Y )/Pa(X) or there is a node Z ∈
Pa(X)/Pa(Y ). If there is a node Z ∈ Pa(Y )/Pa(X) then (X, Y, Z) is an immorality, so that the
direction X → Y remains the same in any Markov equivalent graph. It follows that for each of
the m edges (X, Y ), there is a variable Z ∈ Pa(X)/Pa(Y ). If the direction of the edge (Z, X)
is not also reversed, then (Z, X, Y ) is an immorality in the new graph, which is a contradiction.
It follows that there is at least one covered edge among the m edges. Change the orientation of
this edge. After the change, there is a covered edge among the remaining m − 1 and by induction
the target graph is obtained after m changes.

6. Firstly, note that the moral graph contains an edge X − Y if and only if Y ∈ M B(X), the Markov
blanket of X . M B(X) is the set X , together with Pa(X) (the parents of X ) and Ch(X) (the
children of X ) and all parents that share a child with X . That is, X together with all neighbours
of X and those variables that are linked to X when the graph is moralised.
Note that
X ⊥ V /M B(X)∥G M B(X)

so that, using the weak union result of Exercise 2 page 22, if Y ∈/ M B(X),

X ⊥ Y ∥G V /{X, Y }.

If Y ∈ M B(X), then the moral graph contains an edge X − Y and X ⊥


/ Y ∥G V /{X, Y }.

7. (a) If X − Z − Y is a fork or chain connection, X ⊥


/ Y so that X − Y would not be removed. It
follows that if the vee structure X − Z − Y remains in the nal graph, it is an immorality.
(b) Stage 1: add all edges, all are undirected.
Stage 2: X1 ⊥ X2 so remove X1 − X2 . No other edges removed. X1 − X3 − X2 is a collider,
X1 − X4 − X2 is a collider. No additional compelled edges.
The chain components are: {X1 }, {X2 } and {X3 , X4 }.
Stage 3: Consider chain component {X3 , X4 } together with chain components containing
parents. X1 ⊥ X4 ∣X3 , X2 ⊥ X4 ∣X3 so remove X1 − X4 and X2 − X4 . Now X3 → X4 is
compelled.
There are now no parent sets of size greater than 1 hence the algorithm terminates.
(c) Stage 1: add all edges to make the complete undirected graph.
Stage 2: test X ⊥ Y ∣ϕ; no edges removed, none of the variables are (pairwise) independent.
There is therefore only one chain component after this stage; the complete graph. Stage
3: test X ⊥ Y ∣Z for each triple (12 tests). The only independence result is X2 ⊥ X3 ∣X1 ,
therefore edge ⟨X2 , X3 ⟩ is removed. The triple (X2 , X4 , X3 ) is an immorality. The triple
(X2 , X1 , X3 ) is not an immorality. By Meek's rules, the edge ⟨X1 , X4 ⟩ is compelled X1 → X4 .
Stage 4: the chain components are the subgraph with {X2 , X1 , X4 } and the subgraph {X4 }.
Start with {X4 }, this is connected to {X2 , X1 , X4 }. At this stage, X1 ⊥ X4 ∣{X2 , X3 } so that

355
356 CHAPTER 16. CONSTRAINT-BASED STRUCTURE LEARNING ALGORITHMS

the edge X1 → X4 is removed. The algorithm now terminates, returning the essential graph
with undirected edges ⟨X1 , X2 ⟩ and ⟨X1 , X3 ⟩ and directed edges ⟨X2 , X4 ⟩ and ⟨X3 , X4 ⟩.
Chapter 17

Bayesian Networks in R: Structure and

Parameter Learning

17.1 Bayesian Networks with bnlearn


This tutorial is based predominantly on the bnlearn package, a package by Marco Scutari. It supports
a wide variety of structure learning algorithms. These are found in the documentation. They are:

Constraint Based Algorithms


1. Grow-Shrink gs: based on the Grow-Shrink Markov Blanket, the rst (and simplest) Markov
blanket detection algorithm used in a structure learning algorithm.

2. Incremental Association iamb: based on the Markov blanket detection algorithm of the same
name, which is based on a two-phase selection scheme (a forward selection followed by an attempt
to remove false positives).

3. Fast Incremental Association fast.iamb: a variant of IAMB which uses speculative stepwise
forward selection to reduce the number of conditional independence tests.

4. Interleaved Incremental Association inter.iamb: another variant of IAMB which uses


forward stepwise selection to avoid false positives in the Markov blanket detection phase.

Search and Score Learning Algorithms


1. Hill-Climbing hc: a hill climbing greedy search on the space of the directed graphs. The
optimised implementation uses score caching, score decomposability and score equivalence to
reduce the number of duplicated tests.

2. Tabu Search tabu: a modied hill climbing able to escape local optima by selecting a network
that minimally decreases the score function.

357
358CHAPTER 17. BAYESIAN NETWORKS IN R: STRUCTURE AND PARAMETER LEARNING

Hybrid Learning Algorithms


1. Max-Min Hill-Climbing mmhc: a hybrid algorithm which combines the Max-Min Parents and
Children algorithm (to restrict the search space) and the Hill-Climbing algorithm (to nd the
optimal network structure in the restricted space).

2. Restricted Maximization rsmax2: a more general implementation of the Max-Min Hill-


Climbing, which can use any combination of constraint-based and score-based algorithms.

Other (Constraint-Based) Learning Algorithms


These algorithms learn the structure of the undirected graph underlying the Bayesian network, which
is known as the skeleton of the network or the (partial) correlation graph. Therefore all the arcs are
undirected, and no attempt is made to detect their orientation. They are often used in hybrid learning
algorithms.

1. Max-Min Parents and Children mmpc: a forward selection technique for neighbourhood
detection based on the maximization of the minimum association measure observed with any
subset of the nodes selected in the previous iterations.

2. Hiton Parents and Children si.hiton.pc: a fast forward selection technique for neigh-
bourhood detection designed to exclude nodes early based on the marginal association. The
implementation follows the Semi-Interleaved variant of the algorithm.

3. Chow-Liu chow.liu: an application of the minimum-weight spanning tree and the information
inequality. It learn the tree structure closest to the true one in the probability space.

4. ARACNE aracne: an improved version of the Chow-Liu algorithm that is able to learn poly-
trees.

17.1.1 Creating and Manipulating Network Structures


The following illustrates how to create objects of class bn. We consider the marks data set, which
gives the exam scores of 88 students across ve dierent topics: mechanics, vectors, algebra, analysis
and statistics. The original data set was investigated by Mardia et. al. (1979) [90] and subsequently
became a bench mark for structure learning (e.g. Whittaker (1990) [144]. It is a data set within the
bnlearn package under the name marks.

> library(bnlearn)
> data(marks)
> str(marks)
'data.frame': 88 obs. of 5 variables:
$ MECH: num 77 63 75 55 63 53 51 59 62 64 ...
$ VECT: num 82 78 73 72 63 61 67 70 60 72 ...
17.1. BAYESIAN NETWORKS WITH BNLEARN 359

$ ALG : num 67 80 71 63 65 72 65 68 58 60 ...


$ ANL : num 67 70 66 70 70 64 65 62 62 62 ...
$ STAT: num 81 81 81 68 63 73 68 56 70 45 ...

First create an empty network with the nodes corresponding to the variables using the empty.graph
function:

> ug<-empty.graph(names(marks))

The arcs presented in Whittaker (1990) from Figure 17.1 may be added as follows:

> arcs(ug,ignore.cycles=TRUE)=matrix(
+ c("MECH","VECT","MECH","ALG","VECT","MECH",
+ "VECT","ALG","ALG","MECH","ALG","VECT",
+ "ALG","ANL","ALG","STAT","ANL","ALG",
+ "ANL","STAT","STAT","ALG","STAT","ANL"),
+ ncol=2, byrow = TRUE,
+ dimnames=list(c(),c("from","to")))
> plot(ug)

STAT

MECH ANL

VECT ALG

Figure 17.1: Marks network: undirected graph

The resuting ug object belongs to graph bn. There are several arguments: ug$learning, ug$nodes,
ug$arcs.
learning is not useful in this example, since this argument gives information about the results of
the structure learning algorithm used to generate the network and its tuning parameters (which were
not used here).
$nodes gives information about the Markov blanket of each node, while $arcs gives the arcs
presented in the network.
360CHAPTER 17. BAYESIAN NETWORKS IN R: STRUCTURE AND PARAMETER LEARNING

> ug

Random/Generated Bayesian network

model:
[undirected graph]
nodes: 5
arcs: 6
undirected arcs: 6
directed arcs: 0
average markov blanket size: 2.40
average neighbourhood size: 2.40
average branching factor: 0.00

generation algorithm: Empty

> dag = empty.graph(names(marks))


> arcs(dag)=matrix(c("VECT","MECH","ALG","MECH","ALG","VECT",
+ "ANL","ALG","STAT","ALG","STAT","ANL"),
+ ncol=2,byrow=TRUE,
+ dimnames=list(c(),c("from","to")))
> dag

Random/Generated Bayesian network

model:
[STAT][ANL|STAT][ALG|ANL:STAT][VECT|ALG][MECH|VECT:ALG]
nodes: 5
arcs: 6
undirected arcs: 0
directed arcs: 6
average markov blanket size: 2.40
average neighbourhood size: 2.40
average branching factor: 1.20

generation algorithm: Empty

A dag can be specied by its adjacency matrix. The function all.equal() indicates whether two
graphs are equal.

> mat=matrix(c(0,1,1,0,0,0,0,1,0,0,0,0,
17.1. BAYESIAN NETWORKS WITH BNLEARN 361

+ 0,1,1,0,0,0,0,1,0,0,0,0,0),
+ nrow=5,
+ dimnames=list(nodes(dag),nodes(dag)))
> mat
MECH VECT ALG ANL STAT
MECH 0 0 0 0 0
VECT 1 0 0 0 0
ALG 1 1 0 0 0
ANL 0 0 1 0 0
STAT 0 0 1 1 0
> dag2=empty.graph(nodes(dag))
> amat(dag2)=mat
> all.equal(dag,dag2)
[1] TRUE

A new bn object may be created by adding (set.arc), dropping (drop.arc) or reversing rev.arc)
arcs from the original. For example:

> dag3 = empty.graph(nodes(dag))


> dag3 = set.arc(dag3,"VECT","MECH")
> dag3 = set.arc(dag3,"ALG","MECH")

A topological ordering of the nodes (from ancestors to descendants) may be obtained by the func-
tion node.ordering(). The neighbours and Markov blanket may be found using nbr() and mb()
respectively. The %in% command may be used to establish membership.

> node.ordering(dag)
[1] "STAT" "ANL" "ALG" "VECT" "MECH"
> nbr(dag,"ANL")
[1] "ALG" "STAT"
> mb(dag,"ANL")
[1] "ALG" "STAT"
> "ANL" %in% mb(dag,"ALG")
[1] TRUE

We can check that the Markov blanket of a variable consists of parents, children and children of parents:

> chld=children(dag,"VECT")
> par=parents(dag,"VECT")
> o.par=sapply(chld,parents,x=dag)
> unique(c(chld,par,o.par[o.par != "VECT"]))
[1] "MECH" "ALG"
> mb(dag,"VECT")
[1] "MECH" "ALG"
362CHAPTER 17. BAYESIAN NETWORKS IN R: STRUCTURE AND PARAMETER LEARNING

17.1.2 Visualising Graphical Models


The structures from bnlearn may be plotted using functions provided by graph and Rgraphviz
packages (Gentry et. al. [52](2012)). The graphviz.plot function takes a bn object and returns the
corresponding graph object.
In bnlearn, vee-structure refers to a collider connection.

> library(Rgraphviz)
Loading required package: grid
> h = list(arcs=vstructs(dag2,arcs=TRUE),lwd=4,col="black")
> graphviz.plot(dag2,highlight=h,layout="fdp",main="dag2")

The output is shown in Figure 17.2.

dag2

ANL STAT

ALG

MECH

VECT

Figure 17.2: Plot obtained using graphviz.plot

The essential graph, showing the Markov equivalence class is returned by cpdag.The function moral
returns the moral graph.

> plot(cpdag(dag2))

for example gives a plot of the essential graph corresponding to dag2.

17.1.3 Structure Learning


In bnlearn, the Maximum Minimum Parents Children (MMPC) algorithm is referred to as the grow-
shrink algorithm. The name is natural following the procedure; rst the maximum parents / children
set for each node is established and then unnecessary nodes are removed. This algorithm is implemented
simply by the function gs.

> bn.gs <- gs(marks)


> bn.gs
17.1. BAYESIAN NETWORKS WITH BNLEARN 363

Bayesian network learned via Constraint-based methods

model:
[undirected graph]
nodes: 5
arcs: 6
undirected arcs: 6
directed arcs: 0
average markov blanket size: 2.40
average neighbourhood size: 2.40
average branching factor: 0.00

learning algorithm: Grow-Shrink


conditional independence test: Pearson's Correlation
alpha threshold: 0.05
tests used in the learning procedure: 44
optimized: TRUE

The parameter value α = 0.05 is the nominal signicance level for each χ2 test for independence.
The mmhc algorithm learns a dierent network, but it is Markov equivalent to the network learned by
the gs algorithm and has the same BIC score.

These structure learning algorithms often only direct an edge when a particular direction gives a
better t, leaving other edges undirected. The function cextend() gets one graph out of the Markov
equivalence class, which may be used for scoring purposes. The BIC score for the learned graph may
be obtained as follows. The documentation lists other scoring criteria that are available (such as AIC).

> bn.gsdirect <- cextend(bn.gs)


> bn.gsdirect

Bayesian network learned via Constraint-based methods

model:
[STAT][ANL|STAT][ALG|ANL:STAT][VECT|ALG][MECH|VECT:ALG]
nodes: 5
arcs: 6
undirected arcs: 0
directed arcs: 6
average markov blanket size: 2.40
average neighbourhood size: 2.40
364CHAPTER 17. BAYESIAN NETWORKS IN R: STRUCTURE AND PARAMETER LEARNING

average branching factor: 1.20

learning algorithm: Grow-Shrink


conditional independence test: Pearson's Correlation
alpha threshold: 0.05
tests used in the learning procedure: 44
optimized: TRUE

> score(bn.gsdirect, data=marks, type="bic-g")


[1] -1720.15

17.1.4 Parameter Learning


Having established the network, the next task is to learn the parameters. With bnlearn, this is
performed by the bn.fit function.

> fitted = bn.fit(bn.gsdirect, data=marks)


> fitted

Bayesian network parameters

Parameters of node MECH (Gaussian distribution)

Conditional density: MECH | VECT + ALG


Coefficients:
(Intercept) VECT ALG
-12.3647583 0.4658693 0.5484053
Standard deviation of the residuals: 13.97432

Parameters of node VECT (Gaussian distribution)

Conditional density: VECT | ALG


Coefficients:
(Intercept) ALG
12.4183094 0.7543653
Standard deviation of the residuals: 10.48167

Parameters of node ALG (Gaussian distribution)

Conditional density: ALG | ANL + STAT


Coefficients:
17.1. BAYESIAN NETWORKS WITH BNLEARN 365

(Intercept) ANL STAT


24.7254768 0.3482454 0.2273881
Standard deviation of the residuals: 6.871428

Parameters of node ANL (Gaussian distribution)

Conditional density: ANL | STAT


Coefficients:
(Intercept) STAT
24.5824229 0.5223601
Standard deviation of the residuals: 11.86392

Parameters of node STAT (Gaussian distribution)

Conditional density: STAT


Coefficients:
(Intercept)
42.30682
Standard deviation of the residuals: 17.25559

The type of estimator (maximum likelihood or Bayes) can be specied by either mle (maximum like-
lihood estimates) or Bayes the posterior Bayesian estimate arising from a at, non-informative prior.
Only mle is available with continuous (Gaussian) data; the Bayes considers Dirichlet densities over the
parameter space.

The parameters of a tted network can easily be replaced. For example, ALG has two parents, ANL and
STAT. For the Gaussian network, the restriction is that the standard deviation for the residuals at each
node is the same. We consider
ALG = β0 + ANLβ1 + STATβ2 + ϵ
where ϵ ∼ N (0, σ 2 ), independent identically distributed. This is carried out by:

> fitted$ALG = list(coef=c("(Intercept)"=25, "ANL"=0.5, "STAT"=0.25),sd=6.5)


> fitted$ALG

Parameters of node ALG (Gaussian distribution)

Conditional density: ALG | ANL + STAT


Coefficients:
(Intercept) ANL STAT
25.00 0.50 0.25
Standard deviation of the residuals: 6.5
366CHAPTER 17. BAYESIAN NETWORKS IN R: STRUCTURE AND PARAMETER LEARNING

A bn.fit object can be created from scratch using the custom.fit function. For example:

> MECH.par = list(coef=c("(Intercept)"=-10, "VECT"=0.5, "ALG"=0.6),sd = 13)


> VECT.par = list(coef=c("(Intercept)"=10, "ALG"=1),sd=10)
> ALG.par=list(coef=c("(Intercept)"=25,"ANL"=0.5,"STAT"=0.25),sd=6.5)
> ANL.par=list(coef=c("(Intercept)"=25,"STAT"=0.5),sd=12)
> STAT.par=list(coef=c("(Intercept)"=43),sd=17)
> dist=list(MECH=MECH.par,VECT=VECT.par,ALG=ALG.par,ANL=ANL.par,STAT=STAT.par)
> fitted2 = custom.fit(bn.gsdirect,dist=dist)

17.1.5 Discretisation
The only continuous models that can be accommodated are Gaussian. When the data is manifestly not
Gaussian, it is better to discretise it and to construct a Bayesian network over multinomial variables.
There are several methods of discretisation available; look up the documentation for discretize. For
example:

> ?discretize
> dmarks = discretize(marks, breaks=2, method="quantile")
> bn.dgs=gs(dmarks)
> plot(bn.dgs)
> all.equal(cpdag(bn.dgs),cpdag(bn.gsdirect))
[1] "Different number of directed/undirected arcs"

The network learned from the discretised data is dierent; MECH is independent of the other variables.

The parameters may be tted to the structure using the discretised data:

> fitted3=bn.fit(cextend(bn.dgs),data=dmarks)
> fitted3$ALG

Parameters of node ALG (multinomial distribution)

Conditional probability table:

ANL
ALG [9,49] (49,70]
[15,50] 0.7777778 0.2558140
(50,80] 0.2222222 0.7441860

17.1.6 Latent Variables


Probability distributions often fail to have a faithful graphical representation because there are latent
(or hidden) variables missing from the model.
17.1. BAYESIAN NETWORKS WITH BNLEARN 367

For the marks data, Edwards (2000) [40] assumed that the students fell into two distinct groups
(which we call A and B ). He then used a classication technique involving the EM algorithm to assign
the students to two dierenc classes. The results were as follows: group A contained students 1-44
and 46-52 while group B contained students 45 and 53 - 88. We add in this latent variable and we
construct a network for group A and another network for group B. We then discretize the variables
and learn the network when the latent variable is included. The results are:

> latent=factor(c(rep("A",44),"B",rep("A",7),rep("B",36)))
> bn.A = hc(marks[latent=="A",])
> bn.B = hc(marks[latent=="B",])
> modelstring(bn.A)
[1] "[MECH][ALG|MECH][VECT|ALG][ANL|ALG][STAT|ALG:ANL]"
> modelstring(bn.B)
[1] "[MECH][ALG][ANL][STAT][VECT|MECH]"
> dmarks=discretize(marks,breaks=2,method="interval")
> dmarks2=cbind(dmarks,LAT=latent)
> bn.LAT=hc(dmarks2)
> bn.LAT

Bayesian network learned via Score-based methods

model:
[MECH][ANL][LAT|MECH:ANL][VECT|LAT][ALG|LAT][STAT|LAT]
nodes: 6
arcs: 5
undirected arcs: 0
directed arcs: 5
average markov blanket size: 2.00
average neighbourhood size: 1.67
average branching factor: 0.83

learning algorithm: Hill-Climbing


score: BIC (disc.)
penalization coefficient: 2.238668
tests used in the learning procedure: 40
optimized: TRUE

Note that for the learned network, variable LAT has two parents; MECH and ANL. If MECH, VECT, ALG,
ANL, STAT were continuous, this distribution would therefore not fall into the CG framework.
368CHAPTER 17. BAYESIAN NETWORKS IN R: STRUCTURE AND PARAMETER LEARNING

17.1.7 Application to Gene Expression Data


The analysis for large arrays of gene expression data is dealt with in the following steps:

1. Outliers are removed. This is because, for continuous data, Bayesian Networks only supports
multivariate Gaussian distributions; outliers make the Gaussian modelling assumptions less likely
to hold.

2. Structure learning is repeated several times, so that there is more chance of nding a global
maximiser for the score function.

3. The networks discovered in the previous step are averaged. This is a technique from Claskens
and Hjort (2008) [29]. The averaged network uses arcs present in (say) 85% of the networks.

We try this on the sachs.data.txt data set, found in the data directory of the course home page:

> library(bnlearn)
> sachs.data <- read.delim("~/data/sachs.data.txt")
> sachs<-sachs.data
> dsachs=discretize(sachs,method="hartemink",breaks=3,ibreaks=60,idisc="quantile")

Each variable in the dsachs data frame is a factor with three levels, corresponding approximately to
low, normal and high expression. Now apply bootstrap resampling to learn a set of 500 networks to
be used for model averaging:

> boot=boot.strength(data=dsachs,R=500,algorithm="hc",algorithm.args=list(score="bde",iss=10))
> boot[(boot$strength>0.85)&(boot$direction>=0.5),]
from to strength direction
1 praf pmek 1.000 0.5180000
23 plcg PIP2 1.000 0.5100000
24 plcg PIP3 1.000 0.5220000
34 PIP2 PIP3 1.000 0.5120000
56 p44.42 pakts473 1.000 0.5620000
57 p44.42 PKA 0.992 0.5665323
67 pakts473 PKA 1.000 0.5690000
89 PKC P38 1.000 0.5100000
90 PKC pjnk 1.000 0.5100000
100 P38 pjnk 0.954 0.5062893

The virtual sample size is 10, which is very low. Arcs are signicant if they appear in at least 85% of
the networks and in the direction that appears most frequently. The averaged network is formed quite
simply using the averaged.network function:

> avg.boot = averaged.network(boot,threshold=0.85)


17.1. BAYESIAN NETWORKS WITH BNLEARN 369

An alternative approach is to average the results of several hill climbing searches, each starting from
a dierent network. The initial condition can be generated using a distribution over the space of con-
nected graphs. An algorithm to do this was proposed by Ide and Cozman [69](2002). It is implemented
by the function random.graph(). It is carried out as follows:

> library("bnlearn", lib.loc="~/R/x86_64-redhat-linux-gnu-library/3.1")


> nodes=names(dsachs)
> start=random.graph(nodes=nodes,method="ic-dag",num=500)
> netlist=lapply(start,function(net){
+ hc(dsachs,score="bde",iss=10,start=net)})
> rnd=custom.strength(netlist,nodes=nodes)
> rnd[(rnd$strength>0.85)&(rnd$direction>=0.5),]
from to strength direction
1 praf pmek 1 0.500
11 pmek praf 1 0.500
23 plcg PIP2 1 0.500
24 plcg PIP3 1 0.620
33 PIP2 plcg 1 0.500
34 PIP2 PIP3 1 0.620
56 p44.42 pakts473 1 0.500
57 p44.42 PKA 1 0.507
66 pakts473 p44.42 1 0.500
67 pakts473 PKA 1 0.507
89 PKC P38 1 0.500
90 PKC pjnk 1 0.500
99 P38 PKC 1 0.500
100 P38 pjnk 1 0.500
109 pjnk PKC 1 0.500
110 pjnk P38 1 0.500
> avg.start=averaged.network(rnd,threshold=0.85)
Warning messages:
1: In averaged.network.backend(strength = strength, nodes = nodes, :
arc pjnk -> PKC would introduce cycles in the graph, ignoring.
2: In averaged.network.backend(strength = strength, nodes = nodes, :
arc pjnk -> P38 would introduce cycles in the graph, ignoring.
> all.equal(cpdag(avg.boot),cpdag(avg.start))
[1] TRUE

The networks have the same skeleton, although some of the directions are dierent.
The score is computed rst by taking cpdag to get an essential graph and then by taking cextend
to form a dag.
370CHAPTER 17. BAYESIAN NETWORKS IN R: STRUCTURE AND PARAMETER LEARNING

> score(cextend(cpdag(avg.start)),dsachs,type="bde",iss=10)
[1] -8498.877

The bnlearn package contains a default level for the threshold, which is found in averaged.network

> averaged.network(boot)

Random/Generated Bayesian network

model:
[praf][plcg][p44.42][PKC][pmek|praf][PIP2|plcg][pakts473|p44.42][P38|PKC]
[pjnk|PKC][PIP3|plcg:PIP2][PKA|p44.42:pakts473]
nodes: 11
arcs: 9
undirected arcs: 0
directed arcs: 9
average markov blanket size: 1.64
average neighbourhood size: 1.64
average branching factor: 0.82

generation algorithm: Model Averaging


significance threshold: 0.954

The default threshold is computed as follows: Let

̂(.) = {0 ≤ p̂(1) ≤ . . . ≤ p̂(k) ≤ 1}


p

denote the order statistics for the arc strengths stored in boot. Now, let ̂
t denote a threshold and set



⎪ 1 p̂(k) ≥ t
p̃(k) (t) = ⎨ .

⎪ 0 p̂ < t.
⎩ (k)

This denotes the `empirical' probability function for arc strengths for the graph where arcs are present
̃(.) denote the resulting vector.
if and only if p̂(k) ≥ t. Let p
Now choose ̂ t to minimise

̂(.) ) ∶= ∫ ∣Fp̂(.) − Fp̃(.) ∣dx


L1 (t, p

where Fp̂(.) and Fp̂(.) are the empirical distribution functions of p ̃(.) respectively. Then ̃
̂(.) and p t, the
threshold is chosen to minimise this.
17.1. BAYESIAN NETWORKS WITH BNLEARN 371

17.1.8 Interventional Data


The data set in sachs.interventional.txt gives data from dierent experiments, where the inter-
ventions to force the levels of certain variables, dier from experiment to experiment.

> isachs <- read.table("~/data/sachs.interventional.txt",header=TRUE,colClasses="factor")

It is important that colClasses = factor.


One (less useful) way of dealing with the situation is to include the intervention INT in the network
and make all the variables depend on it. This is done using the whitelist command, which contains
all possible arcs from INT to the other nodes. These arcs are then forced to be present in the learned
network structure.

> wh = matrix(c(rep("INT",11),names(isachs)[1:11]),ncol=2)
> bn.wh = tabu(isachs,whitelist=wh,score="bde",iss=10,tabu=50)

The tabu learning algorithm gives more stable results here.


Not all the arcs in wh are necessary. The tiers2blacklist function may be used to blacklist all
arcs going towards INT, thus ensuring that only outgoing arcs are present.

> tiers=list("INT",names(isachs)[1:11])
> bl = tiers2blacklist(nodes=tiers)
> bn.tiers=tabu(isachs,blacklist=bl,score="bde",iss=10,tabu=50)

While the two methods given above, producing bn.wh and bn.tiers show how to force certain arrows
into a network, they do not involve the structure of the intervention.
The way to model an intervention is described as follows: the value of INT identies which node is
subject to an intervention. Therefore, we start by constructing a named list of which observations are
manipulated for each node.

> INT2=sapply(1:11,function(x){which(isachs$INT==x)})
> nodes=names(isachs)[1:11]
> names(INT2)=nodes

Now pass the list to tabu as an additional argument for mbde (the modied BDe score function).

> start=random.graph(nodes=nodes,method="melancon",num=500,burn.in=10^5,every=100)
> netlist=lapply(start,function(net){
+ tabu(isachs[,1:11],score="mbde",exp=INT2,iss=10,start=net,tabu=50)})
> bn.mbde=averaged.network(arcs,threshold=0.85)
Warning messages:
1: In averaged.network.backend(strength = strength, nodes = nodes, :
arc pjnk -> PKA would introduce cycles in the graph, ignoring.
372CHAPTER 17. BAYESIAN NETWORKS IN R: STRUCTURE AND PARAMETER LEARNING

2: In averaged.network.backend(strength = strength, nodes = nodes, :


arc PKC -> PKA would introduce cycles in the graph, ignoring.
3: In averaged.network.backend(strength = strength, nodes = nodes, :
arc PKC -> P38 would introduce cycles in the graph, ignoring.
4: In averaged.network.backend(strength = strength, nodes = nodes, :
arc pjnk -> P38 would introduce cycles in the graph, ignoring.
> bn.mbde2 <- cextend(cpdag(bn.mbde))
> graphviz.plot(bn.mbde2)
17.2. EXERCISES 373

17.2 Exercises
1. This exercise uses the asia data set found in the bnlearn package.

(a) Create a bn object with the network structure shown in Figure 17.3.

asia smoke

tub lung

either bronc

xray dysp

Figure 17.3: Asia Network

(b) Derive the skeleton, the moral graph, and the essential graph representing the Markov
equivalence class. Plot them using graphviz.plot.
(c) Identify the parents, the children, the neighbours and the Markov blanket of each node.
(d) For the network in Figure 17.3, estimate the CPPs.
(e) Using the data asia, use the MMPC algorithm (called `grow-shrink' in bnlearn) to learn
the skeleton followed by hill climbing to learn the direction of the arrows. Is the output
DAG Markov equivalent to the graph in Figure 17.3?

2. The marks data set is found in the bnlearn package.

(a) Discretise the data using a quantile transform and dierent numbers of intervals (say 2 to
5). Learn the network structure. How does the structure change with the discretisation?
(b) Repeat the discretisation using interval discretisation, using up to ve intervals. Compare
the resulting networks with those obtained previously using quantile discretisation.
(c) Does Hartemink's discretisation algorithm perform better than either quantile or interval
discretisation? How does its behaviour depend on the number of initial breaks?

3. The ALARM network is a standard network used to test new algorithms. A synthetic data set
alarm is found in the bnlearn package. Type:

> library(bnlearn)
> ?alarm
374CHAPTER 17. BAYESIAN NETWORKS IN R: STRUCTURE AND PARAMETER LEARNING

On the bottom right quadrant of Rstudio, click on ALARM Monitoring System (synthetic)
data set. This gives a description. Go to the bottom under Examples. You will nd the
structure of the `true' network.

(a) Create a bn object for the true network using the model string provided in the documenta-
tion.
(b) Compare the networks learned from the data using dierent constraint based algorithms
with the true network, both in terms of structural dierences and also using either BIC or
BDe.
(c) How are these constraint based strategies aected by dierent choices of α (the nominal
signicance level of each test)?
(d) Now learn the structure with hill-climbing and tabu search, using the posterior density BDe
as a score function. How does the network change with the hyper parameters iss (imaginary
sample size)?
(e) Does the length of the tabu list have a signicant impact on the network structures learned
using tabu?
(f) Does the learned network depend on whether BDe or BIC is being used as a score criterion?

4. Now consider the data from Sachs et. al., found in sachs.data.txt on the course home page.
Use the original data set; not the discretised data set.

(a) Evaluate the networks leanred by hill-climbing with BIC and BGe, using cross-validation
and the log-likelihood loss function.
(b) Use bootstrap resampling to evaluate the distribution of the number of arcs present in each
of the networks learned. Do they dier signicantly?
(c) Compute the averaged network structure for sachs using hill-climbing with BGe and dif-
ferent hyperparameters (imaginary sample sizes). How does the value of the signicance
threshold change as iss increases?
Chapter 18

Monte Carlo Algorithms for Graph Search

There are various Monte Carlo approaches to locating a structure. These involve running a stochastic
process through the space of possible structures and using this either to build up a posterior distribution
over the space of structures (Markov Chain Monte Carlo) or else designing a process with sucient
mobility, that is attracted to highly scoring structures and scoring each structure visited. The output
from a stochastic optimisation algorithm is simply the structure visited with the highest score.
⎛ X (1) ⎞
As usual, X = (X1 , . . . , Xd ) denotes the random vector of variables, X = ⎜ ⎟
⎜ ⋮ ⎟ denotes an n × d
⎝ X ⎠
(n)
random matrix of n independent copies of X , x denotes the data matrix, an instantiation of X.

18.1 A Stochastic Optimisation Algorithm for Essential Graphs


The following discussion is loosely based on the Markov chain Monte Carlo model composition algo-
rithm, known as M C 3 , and the augmented Markov chain Monte Carlo model composition (AM C 3 )
algorithm from Madigan, Andersson, Perlman and Volinsky [88] (1997). This algorithm provides a
stochastic process which works through the space of essential graphs. It is not intended to provide
a process that gives the correct stationary distribution; the aim is simply to nd a process which
is suciently mobile, where the direction is biased towards highly scoring structures and where the
stochastic component will help the process to escape from local maxima.
Let E denote the space of edge sets for essential graphs. The aim is to construct a Markov chain
{E(t), j = 1, 2, . . .} with state space E .
Firstly, assume that PD (D) is equal for each D ∈ equiv(E), where E denotes the edge set of an
essential graph and equiv(E) denotes the space of DAGs which have E as their essential graph. The
prior over DAGs, then

PE (E) = n(E)PD (D)

where n(E) is the number of DAGs within the equivalence class and D ∈ equiv(E). The posterior is
then given by:

375
376 CHAPTER 18. MONTE CARLO ALGORITHMS FOR GRAPH SEARCH

PE∣X (E∣x) ∝ PE (E)L(D, x) D ∈ equiv(E).


A penalisation may be useful; let κ ∈ (0, 1) and let ∣E∣ denote the number of edges in the graph. If
sparser graphs are desirable, then it may be useful to consider a score function

SE∣X (E∣x) = κ∣E∣ PE∣X (E∣x) κ ∈ (0, 1). (18.1)


The diculty with constructing Markov chains over the set of essential graphs is that if only a single
edge is modied at a time, the chain may not move. This is seen rather simply with the immorality
A → B ← C . This is an essential graph on three variables. Any alteration of a single edge (either by
adding in one of (A, C), (C, A) or ⟨A, C⟩, or un-directing one of the directed edges or changing the
direction of an edge) gives a graph that is not an essential graph. It is therefore not possible to move
in a single step from the immorality (A, B, C) (where B is the collider node) to a dierent essential
graph on the variables (A, B, C). Filling in the details is left as an exercise (Example 1, Page 351).
The (M C)3 algorithm therefore considers triples of nodes and works as follows. Let E0 be an edge
set of an arbitrarily chosen essential graph. To move from Ej to Ej+1 , do the following:

ˆ Choose three nodes (Xi , Xj , Xk ) at random, where Xi ≠ Xj , Xj ≠ Xk , Xi ≠ Xk , taking any


possible triple of nodes each with equal probability.

ˆ Let E denote the current edge set. As usual, E = D ∪ U where D denotes the directed edges
and U denotes the undirected edges. ⟨α, β⟩ ∈ U denotes an undirected edge; (α, β) ∈ D denotes a
directed edge α ↦ β . For Fij and Fjk where Fpq is dened below, consider the 16 possible graphs
generated by keeping all other edges the same and modifying any edges between the two pairs
[Xi , Xj ] and [Xj , Xk ] (where [α, β] simply denotes the ordered pair of vertices) according to the
four possibilities for each pair:

⎪ 1 (Xp , Xq ) ∈/ D, (Xq , Xp ) ∈/ D, ⟨Xp , Xq ⟩ ∈/ U





⎪ 2 (Xq , Xp ) ∈ D
Fpq =⎨ (18.2)


⎪ 3 (Xp , Xq ) ∈ D




⎩ 4 ⟨Xp , Xq ⟩ ∈ U.

ˆ Suppose the current state is E (0) and label the 16 possible graphs E (0) , E (1) , . . . , E (15) generated
by all the possibilities of Fij and Fjk . For each graph, check whether it is an essential graph,
using the criteria of Theorem 5.3.
That is, it has to be a chain graph (for α ∈ Vi and β ∈ Vj where Vi and Vj are two separate chain
components) there is no cycle containing both α and β (that is, a sequence ρ0 , . . . , ρm , ρm+1 = ρ0
with either (ρi , ρi+1 ) ∈ D or ⟨ρi , ρi+1 ⟩ ∈ U for each i = 0, . . . , m). The chain components have to
be triangulated and the graph must not contain forbidden substructures (those in Figure 5.1).

ˆ For each possible graph E (l) ∶ l = 0, . . . 15, set




⎪ 0 E (l) not essential
yl = ⎨
⎪ (l) (l)
⎩ SE∣X (E ∣x) E
⎪ essential
18.2. STRUCTURE MCMC 377

ˆ where SE∣X is dened by (18.1), or indeed any other reasonable score function. Select E(t + 1) =
E (l) with probability yl , l = 0, 1, . . . , 15.

This gives a process which works through the space of essential graphs, guiding the process (at least
locally) to highly scoring structures, while the stochastic element ensures that the process can escape
from a local maximum with positive probability.
Since the aim is to examine each graph E(0), . . . , E(N ) visited, together with all those that were
checked as candidates when the transition probabilites were computed and then choose the one that
maximises S(E) ∶ E ∈ {graphs evaluated}, the following variation may be more ecient.

ˆ Start with an empty graph. Let E(0) denote the empty graph and let E(t) denote the graph
selected at step t.

ˆ For each cycle of 12 d(d − 1)(d − 2) steps, randomly select σ , an ordering of {1, . . . , d}, each with
probability d!
1
and, for j = 1, . . . , d, i = 1, . . . , d − 1, i ≠ j , k = i + 1, . . . , d, k ≠ j , do the following:

1. For the triple of nodes {Xσ(i) , Xσ(j) , Xσ(k) }, consider all 16 possibilities of (Fσ(i),σ(j) , Fσ(j),σ(k) )
(dened in Equation (18.2)) when applied to the current essential graph and record those for
which the new graph is an essential graph.

2. Let E (0) = E(t) and let E (1) , . . . , E (15) denote the other 15 possibilities. For E (0) , . . . , E (15) , set
yk = 0 if E (k) is not an essential graph, otherwise, set yk = R(E (k) ), where R is a suitable score
function.

3. Let E(t + 1) = E (k) where yk = maxj {yj ∣E (j) ∈/ {E(0), . . . , E(t)}}.

ˆ After the algorithm has run for the required length of time (several cycles of length 12 d(d − 1)(d −
2)), the graph E that gives maxt∈{0,...,N } E(t) is selected.

Diculties with Metropolis Hastings This algorithm has computational advantages. Only three
nodes at a time are considered, with the possibility of at most 15 dierent essential graphs. It provides
a stochastic search algorithm, where the aim is to nd a highly scoring structure. But it seems very
dicult to modify it to produce a Metropolis Hastings scheme with a `theoretically' correct stationary
distribution. If N (E) denotes the space of all essential graphs that can be obtained by such a procedure,
then the Q(E, E ′ ), the probability of proposing E ′ given a current state E does not have a convenient
S(E ′ )Q(E ′ ,E)
expression and neither does the acceptance probability αE,E ′ = min (1, S(E)Q(E,E ′ ) ) .

18.2 Structure MCMC


The classical MCMC method for learning the underlying structure of a Bayesian network dates back
to Madigan and York [89](1995). A prior PD is required over the space of directed edge sets D. The
score function used is the Cooper-Herskovitz likelihood, L(D, x), given by Equation (12.15). The aim
is to construct a Markov chain with posterior distribution
378 CHAPTER 18. MONTE CARLO ALGORITHMS FOR GRAPH SEARCH

PD∣X (D∣x) ∝ PD (D) × L(D, x).


The Markov chain is generated by the operations of addition and deletion of single edges. Given
directed edge set D(t) at iteration t, let N (D(t)) denote all directed edge sets which may be derived
from D(t) by one edge added or deleted, together with D(t) itself. A new edge set D′ is sampled from
the set N (D(t)) with proposal probability



′ ⎪
1
D′ ∈ N (D(t))
Q(D(t), D ) = ⎨ ∣N (D(t))∣

⎪ otherwise.
⎩ 0
The acceptance probability is:

Q(D′ , D(t))P(D′ ∣x) ∣N (D(t))∣P(D′ ∣x)


αD(t),D′ = min {1, } = min {1, }.
Q(D(t), D′ )P(D(t)∣x) ∣N (D′ )∣P(D(t)∣x)
The stationary distribution of this chain is PD∣X (.∣x).
There are modications of the basic algorithm: let N (D(t)) denote the space of all DAGs obtained
by addition, deletion, or reversal of a single edge from the current DAG. This is a straightforward
modication; the work of Giudici and Castelo [53](2003) shows that it leads to substantial gains in
eciency.
The samples are generated from randomly chosen starting points and the sequence of DAGs
recorded after some suitable burn-in period. This should give enough information to decouple the
chains from their starting points.

18.3 Edge Reversal Moves


The main problem with structure McMC is slow convergence. The following edge reversal move was
introduced by Grzegorczyk and Husmeier [58](2008).
If an edge Xi ↦ Xj is to be reversed, the two nodes Xi and Xj are rst orphaned; that is, links
from Pai to Xi are removed and links from Paj to Xj are removed. This involves removing Xi ↦ Xj .
Next, the node Xj ↦ Xi is inserted. Then the remainder of the new parent set of Xi is established
according to a suitable score function and nally a new parent set for Xj is established.
This move is clearly reversible under mild conditions on the way that the new parent sets are
established; consider the new graph. Suppose that Xj ↦ Xi is to be reversed. Firstly, all the edges
that have just been added, establishing the new parent sets are removed. Then Xi ↦ Xj is inserted,
then additional parents of Xj and nally parents of Xi are established.

Notation Let D denote a directed edge set. For a node Xi , let D(Xi )←π denote the graph D where
the edges Pai ↦ Xi are removed and a new parent set π is imposed on Xi . For a graph D, let 1(D)
denote the indicator function, returning value 1 if D is a DAG and 0 otherwise. Let

Z(Xi ∣D) = ∑ Li (π∣D)


π∶1(DXi ←π )=1
18.4. ORDER MCMC 379

where, for a given ordering of the nodes, Li (π∣x) denotes a score function for node i having parent set
π . Let

Z ∗ (Xi ∣D, Xj ) = ∑ Li (π∣D).


π∶1(DXi ←π )=1,Xj ∈π

Choice of New Parent Sets Let D0 denote the graph D after links Pai ↦ Xi and Paj ↦ Xj have
been removed. The new parent set for Xi , ̃
πi , is sampled from the distribution:

πi ∣x)1(D0Xi ←̃πi )1(Xj ∈ ̃


Li (̃ πi )
πi ∣D0 , Xj ) =
Q(̃ .
Z ∗ (Xi ∣D0 , Xj )
Having sampled ̃
πi , ̃
πj is now sampled from the distribution:

πj ∣x)1(((D0 )Xi ←̃πi )Xj ←̃πj )


Lj (̃
πj ∣D0Xi ←̃πi ) ∶=
Q(̃
Z(Xj ∣D0Xi ←̃πi )

Conditioned on choosing REV (deciding to make a move of reverse-edge type), the proposal probability
for the move D ↦ D′ , where D′ is obtained by exchanging the parent sets (πi , πj ) of nodes (Xi , Xj )
by ̃πi , ̃
πj ) is:

1
Q(D, D′ ) = πj ∣D0Xi ←̃πi )
πi ∣D0 , Xj )Q(̃
Q(̃
N (D)
where N (D) is the number of edges in D. The acceptance is:

′ ⎛ N (D) Z ∗ (Xi ∣D0 , Xj ) Z(Xj ∣D0Xi ←̃πi ) ⎞


α(D, D ) = min 1, .
⎝ N (D′ ) Z ∗ (Xj ∣D0′ , Xi ) Z(Xi ∣D′Xj ←πj ) ⎠
0

Adding Reverse Move to the Sampler A value pR ∈ (0, 1) is chosen. If the current graph is
not empty, then with probability pR , it is decided is to make a reverse move and with probability
pS = 1 − pR it is decided to make a standard move (addition or deletion). Since the standard moves
comprise an ergodic Markov chain (albeit not with the desired level of mobility), the mixture is also
ergodic.

18.4 Order MCMC


The order MCMC algorithm was introduced by Friedman and Koller [47](2003), to establish the order-
ing of the nodes. The nodes 1, . . . , d according to a given permutation σ . The DAGs that correspond
to a given order are simply those where each node may only have parents of a lower order. Once the
posterior distribution over orders has been established, the DAG can be constructed relatively easily
using other methods (for example, the K2 algorithm).
Recall that the Cooper-Herskovits Likelihood (12.15) has product form, which may be written as:
380 CHAPTER 18. MONTE CARLO ALGORITHMS FOR GRAPH SEARCH

d
L(D∣x) = ∏ L̃(j, πj ∣x)
j=1

where j denotes node j in D and πj denotes its parent set. Assume that the prior PD (D) also has
form:

d
PD (D) = ∏ Q(j, Paj )
j=1

and set

S(j, πj ∣x) = Q(j, πj )L̃(j, Paj ∣x). (18.3)

The score R(σ∣x) for a given ordering σ , given the data x, is given by:

d
R(σ∣x) = ∑ P(D∣x) ∝ ∏ ∑ S(σ(j), Paσ(j) ∣x) (18.4)
j=1 Pa
σ(j) ∈σ
D∈σ

where S is a score function, D ∈ σ denotes a DAG compatible with node ordering σ and Paσ(j) ∈ σ
denotes that the parent set of σ(j) is compatible with node ordering σ .
A hard limit K is placed on the size of each parent set. This reduces the complexity of scoring
each node to order nK .
It is much easier to consider moves between node orders. There are a variety of proposals for moves
from σ to σ ′ ; for example, choose two at random and ip them. The move σ ↦ σ ′ is proposed with
probability Q(σ, σ ′ ) the proposal is accepted with probability

Q(σ ′ , σ)R(σ ′ ∣x)


ασ,σ′ = min (1, ).
Q(σ, σ ′ )R(σ∣x)

Sampling the DAG Having converged to the stationary distribution over orders σ , orderings σ ∗
are then sampled proportionally to R(σ∣x). A DAG is sampled for a xed order, in the following way:
the parent sets are sampled independently for each variable Xi ; for Xi , the scre function . This makes
the problem much easier; the parent sets for each variable Xi are sampled independently, according to
the score function (18.3).

The Problem with Bias The posterior distribution over orderings is;

P(σ∣x) = ∑ P(σ, D∣x) = ∑ P(σ∣D)P(D∣x).


D D∈σ

Here P(D∣x) is simply the Cooper-Herskovitz likelihood. This diers from the score function (18.4)
through the term P(σ∣D), which is simply the inverse of the number of orders that the DAG belongs
to. On average, the number of orders that each DAG belongs to is exponentially large. (It can range
from 1 to d!). Neglecting this term in the order MCMC algorithm then wieghts DAGs by the number
of orders they belong to.
18.5. PARTITION MCMC FOR DIRECTED ACYCLIC GRAPHS 381

18.5 Partition MCMC for Directed Acyclic Graphs


Partition MCMC was introduced recently by Kuipers and Moa [76](2015). With partition MCMC,
the moves are not between DAGs and the aim of the algorithm is not to end up with a distribution
over DAGs; rather, it is to end up with a distribution over layerings of DAGs (dened below).

Layering of a DAG The nodes of a DAG may be layered. A layering is a partition satisfying the
condition that no node in the same layer is either an ancestor or descendant of any other node in layer
k . The layers are indexed by N = {1, 2, 3, . . .}. It is known as a minimal layering if each node has the
minimal index value such that the partition is a layering.
The minimal layering clearly satises (for example) that all ancestor nodes are in layer 1. Further-
more, all nodes in layer k have at least one parent in layer k − 1.
Consider a minimal layering with m levels and let (k1 , . . . , km ) denote the number of nodes in each
layer. The number of DAGs belonging to such a partition is given by:

m m
d! kj
ak1 ,...,km = ∏ (2 j−1 − 1) ∏ 2 j j−2 .
k k S
k1 ! . . . km ! j=2 j=3

where Sj = ∑ji=1 ki . The rst term is simply the number of ways of distributing d nodes in m partition
elements of size k1 , . . . , km respectively. The second is the number of ways that nodes in each partition
can have parents in the previous partition. Subtracting 1 excludes the case where nodes receive no
edges. The third term is the number of ways that nodes can have parents from partitions other than
the one directly below.

18.5.1 Scoring Partitions


A score S(P ) is assigned to each partition P . This is done as follows: let λ = (k1 , . . . , km ) denote a
partition. This gives the shape of a layering; λ = (k1 , . . . , km ) where k1 + . . . + km = d species that
there are k1 nodes in the rst layer, k2 in the second, kj in the j th for j = 1, . . . , m and there are m
layers. Furthermore, kj ≥ 1 for each j ∈ {1, . . . , m}.
Let σ denote a permuation of the nodes. This species which nodes are in which layer. If λ =
(k1 , . . . , km ), then Xσ(1) , . . . , Xσ(k1 ) belong to layer 1; nodes Xσ(k1 +...+kj +1) , . . . , Xσ(k1 +...+kj +kj+1 ) are in
layer j + 1 for j = 1, . . . , m − 1.
Permuting nodes within a layer does not change anything. Let πλ,σ denote a representative per-
mutation; that is, λ together with permutation πλ,σ gives the same layering as λ together with σ . Let
Λ = (λ, πλ,σ ). The score for Λ is:

d
S(Λ∣x) = ∑ P(Λ∣D, x)P(D∣x) = ∑ P(D∣x) ∝ ∏ ∑ S(Xj , Paj ∣x).
D D∈Λ j=1 Paj ∈Λ

where D ∈ Λ denotes that the DAG D is compatible with the layering specied by Λ and Paj ∈ Λ
denotes that the parent set of variable j is compatible with Λ.
382 CHAPTER 18. MONTE CARLO ALGORITHMS FOR GRAPH SEARCH

The MCMC will propose a move Λ ↦ Λ′ , by dening a set N (Λ) of neighbours and choosing each
with equal probability. The acceptance is:

∣N (Λ)∣ S(Λ′ ∣x)


αΛ,Λ′ = min (1, ) (18.5)
∣N (Λ′ )∣ S(Λ∣x)

18.5.2 Partition Moves


The basic partition move involves
ˆ Splitting a layer in two;

ˆ Merging two adjacent layers.


When splitting a layer of size k into two parts, one of size c and one of size k − c, there are (kc) ways
to do it. There are m − 1 ways to merge two partitions. The size of the neighbourhood is therefore:

m ki −1 m m
ki
(m − 1) + ∑ ∑ ( ) = m − 1 + ∑ (2ki − 2) = (∑ 2ki ) − m − 1.
i=1 c=1 c i=1 i=1
When merging layer i with layer i + 1, the score changes simply with the indicator function of whether
the parent sets are legal under the new layering. The alterations are only in those in the layer labelled
i + 2 before the merge; the number of possible parent sets has increased - and (of course) those in layer
i+1 before the merge. Instead of being forced to have at least one parent from layer i before the merge,
links with these variables are excluded; now the variables from i + 1 (before merge) are forced to have
a parent in layer i − 1.
Splitting and merging thus dened give reversible moves, so that the acceptance dened by (18.5)
is positive.
It is straightforward to see that the chain is irreducible; from one partition any other partition can
be reached in a nite number of moves which have positive probability. If necessary, the chain can stay
still with positive probability to ensure aperodicity.

18.5.3 Permutation Moves


The permutation moves are simpler. Two strategies can be adopted
ˆ Choose two nodes at random, with the constraint that they are in dierent layers, and swap
them. There are m
ki (n − ki )

i=1 2
M (Λ)S(Λ′ ∣x)
possibilities, each chosen with equal probability. The move is accepted with probability M (Λ′ )S(Λ∣x) .

ˆ Choose two nodes at random, with the constraint that they are in adjacent layers. There are
m−1
M (Λ) = ∑ ki ki+1
i=1
possible choices of pairs. They are chosen each with equal probability and the move is accepted
M (Λ)S(Λ′ ∣x)
with probability M (Λ′ )S(Λ∣x) .
18.5. PARTITION MCMC FOR DIRECTED ACYCLIC GRAPHS 383

18.5.4 Combination with Edge Reversal


The Edge Reversal Move discussed in Section 18.3 may be combined with these. In this context, rstly
a DAG is chosen compatible with Λ, with probability proportional to its score. Then the reverse
move is proposed and accepted with the probabilities given in Section 18.3. Then the corresponding
partition / permutation Λ′ is computed. The Edge-Reversal move is not ergodic, but if probabilities
pR > 0, pλ > 0, pσ > 0 for the probabilities of taking an Edge Reversal, Partition and Permutation move
respectively are specied in advance, where pR + pλ + pσ = 1, the process is ergodic, with the correct
stationary distribution.
Explicitly, let PD′ ∣D (Λ′ ∣Λ) denote the probability of a transition to Λ′ through an edge reversal
move D to D′ . Let Q(D′ ∣D) denote the transition probability of a move from D to D′ given that the
move is edge reversal. Then

PD′ ∣D (Λ′ ∣Λ) = Q(D′ ∣D).


P(D∣x)
P(Λ∣x)
This move satises the detailed balance equation;

P(D′ ∣x) Q(D′ ∣D)


=
P(D∣x) Q(D∣D′ )
from which

PD′ ∣D (Λ′ ∣Λ) P(Λ′ ∣x)


= . (18.6)
PD∣D′ (Λ∣Λ′ ) P(Λ∣x)
Finally, there may be more than one path between layerings;

P(Λ′ ∣Λ) = ∑ PD′ ∣D (Λ′ ∣Λ)


D,D′

is the total transition. Now, from Equation (18.6), it follows that:

P(Λ∣x)P(Λ′ ∣Λ) = P(Λ′ ∣x)P(Λ∣Λ′ ).


384 CHAPTER 18. MONTE CARLO ALGORITHMS FOR GRAPH SEARCH
Chapter 19

Dynamic Bayesian Networks

19.1 Introduction
Dynamic Bayesian networks (DBNs) are an important tool that have proved useful for a large class of
problems. The thesis of Kevin Murphy (2002) [97] provides a comprehensive introduction to the topic.
The rst mention of dynamic Bayesian networks seems to be by Dean and Kanazawa (1989) [34].
The DBN framework provides a way to extend Bayesian network machinery to model probability
distributions over collections of random variables (Z t )t≥0 . The parameter t ∈ {0, 1, 2, . . .} represents
time. Typically, the variables at a time slice t are partitioned into Z t = (U t , X t , Y t ) representing the
input, hidden and output variables of the model. The term `dynamic' refers to the fact that the system
is dynamic; the basic structure remains the same over time.

Denition 19.1. A k - slice Dynamic Bayesian network is a DAG corresponding to a factorisation of


the probability distribution over the variables {Z 0 , Z 1 , . . .} such that for t ≥ k ,

k−1 t
PZ0 ,...,Zt = PZ0 ∏ PZs ∣Z0 ,...,Zs−1 ∏ PZs ∣Zs−k ,...,Zs−1
s=1 s=k

where, for t ≥ k ,
PZt ∣Zt−k−1 ,...,Zt−1 = ∏ PZ j ∣Pa(Z j ) ,
t t
j

Ztj is the j th node at time t, which could be a component of either Xt , Yt or Ut and the set Pa(Ztj )
of parents of Ztj belongs to the collection

Z t−k , . . . , Z t−1 , {Zt1 , . . . , Ztj−1 }.

The arrows within the same time slice do not represent causality.
The requirement is that the subgraph restricted to {Z t , . . . , Z t+k−1 } is the same for each t ≥ 0 and
the conditional probabilities PZ j ∣Pa(Z j ) are the same for each t ≥ k . Furthermore, for 1 ≤ i ≤ j ≤ k ,
t t
and each s ≥ j , the subgraph restricted to {Z s+i , . . . , Z s+j } is a subgraph of the subgraph restricted to
{Z s+i−1 , . . . , Z s+j }.

385
386 CHAPTER 19. DYNAMIC BAYESIAN NETWORKS

The arcs between slices are from left to right and reecting the causal ow of time. If there is an
j
arc from Zt−1 to Ztj , the node Z j is said to be persistent. The arcs within a slice may have arbitrary
direction, so long as the overall DBN is a DAG. The arcs within a time slice may be undirected, since
they model correlation or constraints rather than causation. The resulting model is then a (dynamic)
chain graph.
The parameters of the conditional probabilities PZ j ∣Pa(Z j ) are time-invariant for t ≥ k , i.e., the
t t
model is time-homogeneous. If parameters can change, they may be added to the state-space and
treated as random variables or alternatively a hidden variable may be added that selects which set of
parameters to use.
Within the engineering community, DBNs have become a popular tool, because they can express
a large number of models and are often computationally tractable.
DBNs have been successfully applied to in the reconstruction of genetic networks, where genes do
not remain static, but rather their expression levels uctuate constantly. Increased expression level of a
gene will result in increased levels of mRNA from that gene which will in turn inuence the expression
levels of other genes. DBNs have proved to be a successful way of analysing genetic expression data.
With a Dynamic Bayesian Network, the n × d data matrix no longer represents n independent
instantiations of a random d-vector. Rather, the rows represent time slices of a process {X(t) ∶ t ∈ N}.
Some assumptions (for example time homogeneity) have to be made in order to learn structure and
parameters.
If the number of instantiations n available is large in comparison to d, then standard multivariate
time series techniques may be used eectively. If n is small compared with d, other techniques (such
as LASSO L1 regularisation) should be used.

19.2 Multivariate Time Series


A VARMA(p,q) model (vector auto regressive moving average, lags p and q for the auto-regressive and
moving average parts respectively) is a model:

p q
X(t) = µ0 + tµ1 + ∑ Aj X(t − j) + ∑ Bk ϵt+1−q
j=1 k=1

where ϵt ∼ N (0, Σ) are i.i.d. (the distribution is not necessarily normal, but the normality assumption,
if true, leads to sharper estimation).
The MA part often leads to instability for estimation; we therefore only consider VAR(p) processes;

p
X(t) = µ0 + tµ1 + ∑ Aj X(t − j) + ϵt
j=1

The package vars ts a vector auto regressive model:

> install.packages("vars")
> library(vars)
19.2. MULTIVARIATE TIME SERIES 387

Within vars, there is a test data-set Canada, which contains 4 macroeconomic indicators; prod (labour
productivity), e (employment), U (unemployment rate) and rw (real wages). A VAR(2) model is tted
quite simply with the command:

> data(Canada)
> can = VAR(Canada,p=2)
> summary(can)

VAR Estimation Results:


=========================
Endogenous variables: e, prod, rw, U
Deterministic variables: const
Sample size: 82
Log Likelihood: -175.819
Roots of the characteristic polynomial:
0.995 0.9081 0.9081 0.7381 0.7381 0.1856 0.1429 0.1429
Call:
VAR(y = Canada, p = 2)

Estimation results for equation e:


==================================
e = e.l1 + prod.l1 + rw.l1 + U.l1 + e.l2 + prod.l2 + rw.l2 + U.l2 + const

Estimate Std. Error t value Pr(>|t|)


e.l1 1.638e+00 1.500e-01 10.918 < 2e-16 ***
prod.l1 1.673e-01 6.114e-02 2.736 0.00780 **
rw.l1 -6.312e-02 5.524e-02 -1.143 0.25692
U.l1 2.656e-01 2.028e-01 1.310 0.19444
e.l2 -4.971e-01 1.595e-01 -3.116 0.00262 **
prod.l2 -1.017e-01 6.607e-02 -1.539 0.12824
rw.l2 3.844e-03 5.552e-02 0.069 0.94499
U.l2 1.327e-01 2.073e-01 0.640 0.52418
const -1.370e+02 5.585e+01 -2.453 0.01655 *
---
Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1

Residual standard error: 0.3628 on 73 degrees of freedom


Multiple R-Squared: 0.9985,Adjusted R-squared: 0.9984
F-statistic: 6189 on 8 and 73 DF, p-value: < 2.2e-16
388 CHAPTER 19. DYNAMIC BAYESIAN NETWORKS

Estimation results for equation prod:


=====================================
prod = e.l1 + prod.l1 + rw.l1 + U.l1 + e.l2 + prod.l2 + rw.l2 + U.l2 + const

Estimate Std. Error t value Pr(>|t|)


e.l1 -0.17277 0.26977 -0.640 0.52390
prod.l1 1.15043 0.10995 10.464 3.57e-16 ***
rw.l1 0.05130 0.09934 0.516 0.60710
U.l1 -0.47850 0.36470 -1.312 0.19362
e.l2 0.38526 0.28688 1.343 0.18346
prod.l2 -0.17241 0.11881 -1.451 0.15104
rw.l2 -0.11885 0.09985 -1.190 0.23778
U.l2 1.01592 0.37285 2.725 0.00805 **
const -166.77552 100.43388 -1.661 0.10109
---
Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1

Residual standard error: 0.6525 on 73 degrees of freedom


Multiple R-Squared: 0.9787,Adjusted R-squared: 0.9764
F-statistic: 419.3 on 8 and 73 DF, p-value: < 2.2e-16

Estimation results for equation rw:


===================================
rw = e.l1 + prod.l1 + rw.l1 + U.l1 + e.l2 + prod.l2 + rw.l2 + U.l2 + const

Estimate Std. Error t value Pr(>|t|)


e.l1 -0.268833 0.322619 -0.833 0.407
prod.l1 -0.081065 0.131487 -0.617 0.539
rw.l1 0.895478 0.118800 7.538 1.04e-10 ***
U.l1 0.012130 0.436149 0.028 0.978
e.l2 0.367849 0.343087 1.072 0.287
prod.l2 -0.005181 0.142093 -0.036 0.971
rw.l2 0.052677 0.119410 0.441 0.660
U.l2 -0.127708 0.445892 -0.286 0.775
const -33.188339 120.110525 -0.276 0.783
---
19.2. MULTIVARIATE TIME SERIES 389

Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1

Residual standard error: 0.7803 on 73 degrees of freedom


Multiple R-Squared: 0.9989,Adjusted R-squared: 0.9987
F-statistic: 8009 on 8 and 73 DF, p-value: < 2.2e-16

Estimation results for equation U:


==================================
U = e.l1 + prod.l1 + rw.l1 + U.l1 + e.l2 + prod.l2 + rw.l2 + U.l2 + const

Estimate Std. Error t value Pr(>|t|)


e.l1 -0.58076 0.11563 -5.023 3.49e-06 ***
prod.l1 -0.07812 0.04713 -1.658 0.101682
rw.l1 0.01866 0.04258 0.438 0.662463
U.l1 0.61893 0.15632 3.959 0.000173 ***
e.l2 0.40982 0.12296 3.333 0.001352 **
prod.l2 0.05212 0.05093 1.023 0.309513
rw.l2 0.04180 0.04280 0.977 0.331928
U.l2 -0.07117 0.15981 -0.445 0.657395
const 149.78056 43.04810 3.479 0.000851 ***
---
Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1

Residual standard error: 0.2797 on 73 degrees of freedom


Multiple R-Squared: 0.9726,Adjusted R-squared: 0.9696
F-statistic: 324 on 8 and 73 DF, p-value: < 2.2e-16

Covariance matrix of residuals:


e prod rw U
e 0.131635 -0.007469 -0.04210 -0.06909
prod -0.007469 0.425711 0.06461 0.01392
rw -0.042099 0.064613 0.60886 0.03422
U -0.069087 0.013923 0.03422 0.07821

Correlation matrix of residuals:


390 CHAPTER 19. DYNAMIC BAYESIAN NETWORKS

e prod rw U
e 1.00000 -0.03155 -0.1487 -0.6809
prod -0.03155 1.00000 0.1269 0.0763
rw -0.14870 0.12691 1.0000 0.1568
U -0.68090 0.07630 0.1568 1.0000

The default value, which estimates µ0 and sets µ1 = 0 is const. To set µ0 = 0 and µ1 = 0, type:

> VAR(Canada,p=2,type="none")

To set µ0 = 0 while estimating an unknown trend µ1 , type:

> VAR(Canada,p=2,type="trend")

To estimate both an intercept µ0 and a trend µ1 , type:

> VAR(Canada,p=2,type="both")

The stability function veries the covariance stationarity of a VAR process, using cumulative
sums of residuals. This may be carried out by:

> var.2c=VAR(Canada,p=2,type="const")
> stab=stability(var.2c,type="OLS-CUSUM")
> plot(stab)

There are several tests for normality which come under normality.test.

> normality.test(var.2c)
$JB

JB-Test (multivariate)

data: Residuals of VAR object var.2c


Chi-squared = 5.094, df = 8, p-value = 0.7475

$Skewness

Skewness only (multivariate)

data: Residuals of VAR object var.2c


Chi-squared = 1.7761, df = 4, p-value = 0.7769
19.3. LASSO LEARNING 391

$Kurtosis

Kurtosis only (multivariate)

data: Residuals of VAR object var.2c


Chi-squared = 3.3179, df = 4, p-value = 0.5061

The function serial.test carries out the Portmanteau (i.e. Ljung-Box) test

> serial.test(var.2c,lags.pt=16,type="PT.adjusted")

Portmanteau Test (adjusted)

data: Residuals of VAR object var.2c


Chi-squared = 231.5907, df = 224, p-value = 0.3497

The VARMA model is standard and is treated in any reasonable text on Time Series, for example [?].

19.3 Lasso Learning


One of the most prominent applications of DBNs is to gene expression data and locating regulatory
pathways. The main diculty is that n (the number of instantiations) tends to be small compared
with d (the number of genes under investigation). On the other hand, gene expression networks tend
to be sparse.
One technique that has developed and is quite eective in such situations is L1 regularisation, or
LASSO learning.

LASSO and Least Angle Regression Given a set of input measurements (xj,1 , . . . , xj,d ) for j =
1, . . . , n and outcome measurement yj ∶ j = 1, . . . , n, taken as observations on independent variables,
the lasso ts a linear model

d
ŷj = β̂0 + ∑ xj β̂j .
j=1

The criterion it uses is:

Minimise ∑nj=1 (yj ŷj )2 subject to ∑dj=0 ∣βj ∣ ≤ s for a constraint value s.

The bound s is a tuning parameter. When s is suciently large, the constraint has no eect and the
solution is simply the usual multiple linear least squares regression of y on x1 , . . . , xd .
For smaller values of s (s ≥ 0), the solutions are shrunken versions of the least squares estimates.
The L1 penalisation often forces some of the coecient estimates β̂j to be zero.
392 CHAPTER 19. DYNAMIC BAYESIAN NETWORKS

The choice of s therefore plays a similar role to choosing the number of predictors in a regression
model.
Cross-validation is the standard tool for estimating the best value for s.
Forward stepwise regression achieves the same objective as regularisation by adding in explanatory
variables one at a time:

ˆ Start with all coecients βj equal to zero.

ˆ Find the predictor xj which is most correlated to y and add it into the model. Take residuals
r = y − ŷ.

ˆ Continue, at each stage adding to the model the predictor most correlated with r.

ˆ Until: all predictors are in the model

The Least Angle Regression procedure follows the same general scheme, but does not add a predictor
fully into the model. The coecient of that predictor is increased only until that predictor is no longer
the one most correlated with the residual r. Then some other competing predictor is included.

Least Angle Regression algorithm The algorithm proceeds as follows:

ˆ Start with all coecients βj equal to zero.

ˆ Find the predictor xj most correlated with y .

ˆ Increase the coecient βj in the direction of the sign of its correlation with y . Take residuals
r = y − ŷ. Stop when some other predictor xk has as much correlation with r as xj has.

ˆ Increase (βj , βk ) in their joint least squares direction, until some other predictor xm has as much
correlation with the residual r.

ˆ Continue until: all predictors are in the model

It can be shown that, with one modication, this procedure gives the entire path of lasso solutions,
as s is varied from 0 to innity. The modication needed is: if a non-zero coecient hits zero, remove
it from the active set of predictors and recompute the joint direction.

Cross-Validation Cross validation is a model evaluation method where some of the data is removed
before training begins. Then when training is done, the data that was removed can be used to test
the performance of the learned model on new data. This is the basic idea for the class of model
evaluation methods called cross validation.

ˆ Holdout The holdout method is the simplest kind of cross validation. The data set is separated
into two sets; the training set and the testing set. The function approximator ts a function using
19.3. LASSO LEARNING 393

the training set only. Then the function approximator is asked to predict the output values for
the data in the testing set (it has never seen these output values before). The errors it makes
are accumulated as before to give the mean absolute test set error, which is used to evaluate the
model.

ˆ K-fold Cross Validation K-fold cross validation is one way to improve over the holdout method.
The data set is divided into k subsets, and the holdout method is repeated k times. Each time,
one of the k subsets is used as the test set and the other k-1 subsets are put together to form
a training set. Then the average error across all k trials is computed. The advantage of this
method is that it matters less how the data gets divided. Every data point gets to be in a test set
exactly once, and gets to be in a training set k-1 times. The variance of the resulting estimate is
reduced as k is increased. The disadvantage of this method is that the training algorithm has to
be rerun from scratch k times, which means it takes k times as much computation to make an
evaluation. A variant of this method is to randomly divide the data into a test and training set
k dierent times. The advantage of doing this is that you can independently choose how large
each test set is and how many trials you average over.

ˆ Leave-one-out Leave-one-out cross validation is K-fold cross validation taken to its logical
extreme, with K equal to n, the number of data points in the set. That means that the function
approximator is trained on all the data except for one point n separate times and a prediction
is made for that point. As before the average error is computed and used to evaluate the model.
The evaluation given by leave-one-out cross validation error (LOO-XVE) is good, but at rst
pass it seems very expensive to compute.

19.3.1 Implementation
There are several packages available in R for DBN learning. One of the most prominent is the lars
package by Hastie and Efron [60] (2012). Other packages available are: glmnet package by Friedman
et. al. [44] (2010) and penalized by Goeman [54] (2012). For illustration, we use the arth800MTS data
set from the GeneNet package. This describes the expression levels of 800 genes of the Arabidopsis
thaliana during the diurnal cycle. We consider a subset arth12 of 12 of the genes.

> library(lars)
> library(GeneNet)
> data(arth800)
> subset=c(60,141,260,333,365,424,441,512,521,578,799)
> arth12=arth800.expr[,subset]

Now lars is used to estimate a model for a target variable speied by a vector (say y) and a set of
possible parents specied by a matrix of predictors (say x). The arth800 data set consists of two time
series, each of 11 points in length. That is, there are two repeated measurements for each time point.
To estimate a VAR(1) process, rstly remove the two repeated measurements for the rst time point
394 CHAPTER 19. DYNAMIC BAYESIAN NETWORKS

of y and the two repeated measurements for the last time point of x. They cannot be used for LASSO,
since y(t) needs x(t − 1).

> x = arth12[1:(nrow(arth12)-2),]
> y = arth12[-(1:2),"265768_at"]
> lasso.fit = lars(y=y,x=x,type="lasso")
> plot(lasso.fit)

The plot is shown in Figure 19.1.

LASSO
0 1 3 4 5 6 7 9 11
Standardized Coefficients

**
2

**

8
* *
* **
* * *
*
**** ** * *

7
*
0

* ** ** ** *
* * ** ** **

10
*
−2

*
**
** * * * * ** * *

11
−4

* *
0.0 0.2 0.4 0.6 0.8 1.0
|beta|/max|beta|

Figure 19.1: Lasso output

The gure is interpreted as follows: the aim is to predict y(t) (the expression levels for gene labelled
265768_at) by the expression levels one time unit earlier (given at time index t − 2 because we have
double measurements for each time); x(t−2). The regression is carried out by evaluating the coecients
β which minimise ∑22 t=3 (y(t)− ∑j=1 xj (t−2)βj ) , subject to a constraint that ∑j=1 ∣βj ∣ ≤ t for t increasing.
11 2 11

For the x-axis, this is presented as ∣β∣/ max ∣β∣, where ∣β∣ = ∑11j=1 ∣βj ∣ and max ∣β∣ is the value of ∑j=1 ∣βj ∣
11

for the unconstrained problem.


The values of the coecients are denoted by dierent colours and the plot shows how they change as
the value of t increases. The vertical lines indicate the points at which new coecients are introduced.
The coecients may be obtained by

> coef(lasso.fit)

Structure learning (i.e. deciding which directed edges to include in the network) is carried out via
cross-validation. The cv.lars function does this.

> lasso.cv=cv.lars(y=y,x=x,mode="fraction")

The output gives the MSE (mean squared error) as a function of ∣β∣/ max ∣β∣ (where ∣β∣ denotes the
constraint and max ∣β∣ denotes the value of ∑11
j=1 ∣βj ∣ for the unconstrained problem) and the output is
shown in Figure 19.2. The optimal set of arcs is chosen to minimise the mean squared error.
19.3. LASSO LEARNING 395

> frac=lasso.cv$index[which.min(lasso.cv$cv)]
> predict(lasso.fit,s=frac,type="coef",mode="fraction")
$s
[1] 0.1919192

$fraction
[1] 0.1919192

$mode
[1] "fraction"

$coefficients
265768_at 263426_at 260676_at 258736_at 257710_at 255764_at
0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
255070_at 253425_at 253174_at 251324_at 245319_at 245094_at
0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 -0.6420806

The non-zero coecients indicate the arcs to be included on the gene 265768_at for the optimal value
s=frac computed by cv.lars.
Cross−Validated MSE
0.6
0.2

0.0 0.2 0.4 0.6 0.8 1.0


Fraction of final L1 norm

Figure 19.2: Lasso cross validation

The number of steps can be controlled by setting the mode argument of predict to step.

> predict(lasso.fit,s=3,type="coef",mode="step")$coefficients
265768_at 263426_at 260676_at 258736_at 257710_at 255764_at 255070_at 253425_at
-0.02152962 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000
253174_at 251324_at 245319_at 245094_at
0.00000000 0.00000000 0.00000000 -0.72966658
396 CHAPTER 19. DYNAMIC BAYESIAN NETWORKS

The L1 penalty can be specied with mode = lambda

> predict(lasso.fit,s=0.2,type="coef",mode="lambda")$coefficients
265768_at 263426_at 260676_at 258736_at 257710_at 255764_at 255070_at 253425_at
0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
253174_at 251324_at 245319_at 245094_at
0.0000000 0.0000000 0.0000000 -0.6961228

The lars package also ts least angle regression and stepwise regression.

> lar.fit=lars(y=y,x=x,type="lar")
> lar.cv=cv.lars(y=y,x=x,type="lar")
> step.fit=lars(y=y,x=x,type="stepwise")
> step.cv=cv.lars(y=y,x=x,type="stepwise")

19.4 simone: Statistical Inference for MOdular NEtworks


The simone package by Chiquet et. al. [27](2009) implements LASSO specically for dynamic Bayesian
networks. Install the package, activate it and get information using

> install.packages("simone")
> library(simone)
> ?simone

It works on the principle that the n×d data matrix contains n sequential observations of the d variables
and it ts a VAR(1) model. The default is clustering = FALSE.

> result = simone(arth12,type="time-course")

The output is the number of edges in the network depending on the penalisation (default: BIC). A
sequencing display of the network as the penalty is reduced is obtained by:

> plot.simone(result)

The analysis can be carried out with clustering; edges are penalised if latent clustering is discovered
while constructing the network.

> resultcluster=simone(arth12,type="time-course",clustering=TRUE,control=ctrl)

The sequencing display of the network indicates that clustering has not changed the output much.
19.5. GENENET, GIDBN 397

19.5 GeneNet, GIDBN


> install.packages("G1DBN")
> library(G1DBN)
> data(arth800line)
> data(arth800line)
> subset=c(60,141,260,333,365,424,441,512,521,578,789,799)
> arth12=as.matrix(arth800line[,subset])

Learning is carried out in two stages: rstly, learning the graph encoding the rst order partial
dependencies with DBNScoreStep1.

> step1=DBNScoreStep1(arth12,method="ls")
> edgesG1=BuildEdges(score=step1$S1ls,threshold=0.50,prec=6)
> nrow(edgesG1)
[1] 27

The help commands describe the second step.

> step2=DBNScoreStep2(step1$S1ls,data=arth12,method="ls",alpha1=0.50)
> edgesG=BuildEdges(score=step2,threshold=0.05,prec=6)
398 CHAPTER 19. DYNAMIC BAYESIAN NETWORKS

19.6 Inference for Dynamic Bayesian Networks


For a given DBN (where the network structure and the conditional probability potentials have been
specied), the queries of interest are usually those of computing the marginal distribution of Xi (t)
conditioned on all nodes other than Xi (t) at times 1, . . . , T . In line with standard time series problems,
these problems fall into three categories:

ˆ If T = t, the query is called ltering.

ˆ If T > t (node Xi (t) is omitted), the query is called smoothing. It returns a smoothed value of
X̂i (t); the aim of the query is noise reduction.

ˆ If T < t, the query is called prediction.

Queries which ask for the Most Probable Explanation can be performed for ltering, smoothing
and prediction with the lars package.
To see how it works, consider the arth12 data set:

> library(GeneNet)
> data(arth800)
> subset = c(60, 141, 260, 333, 365, 424, 441, 512,
+ 521, 578, 789, 799)
> arth12 = arth800.expr[, subset]
> library(lars)
> x = arth12[1:(nrow(arth12) - 2), ]
> y = arth12[-(1:2), "265768_at"]

y contains the expression levels of gene 265768_at at all times except for time 0 (recall that there are
two measurements at each time). x contains the whole data set for all times except for the last one,
labelled 24.

> lasso.fit = lars(y = y, x = x, type = "lasso")


> lasso.cv = cv.lars(y = y, x = x, mode = "fraction")
> frac = lasso.cv$index[which.min(lasso.cv$cv)]

frac contains the value of the index that minimises the cross variation. Therefore, this is the value
that is used to build the model. Estimation for the expression levels of 265768_at may be carried out
quite simply by:

> lasso.est = predict(lasso.fit, type = "fit",


+ newx = x, s = frac,
+ mode = "fraction")$fit
> lasso.est
0-1 0-2 1-1 1-2 2-1 2-2 4-1
19.6. INFERENCE FOR DYNAMIC BAYESIAN NETWORKS 399

7.099782 6.894064 7.166249 7.157744 7.592092 7.379432 7.990548


4-2 8-1 8-2 12-1 12-2 13-1 13-2
8.078921 8.353137 8.333108 8.940241 8.780302 8.816387 8.758480
14-1 14-2 16-1 16-2 20-1 20-2
8.542374 8.417818 7.446577 7.329513 6.717392 6.747178

The estimated expression levels at 20-1 and 20-2 are a result of ltering, while the others given
here are a result of smoothing.
The values of 24-1 and 24-2 can be predicted by:

> lasso.pred = predict(lasso.fit, type = "fit",


+ newx = arth12[c("24-1", "24-2"), ],
+ s = frac, mode = "fraction")$fit
> lasso.pred
24-1 24-2
6.822643 6.882054

The penalized package ts LASSO models which are compatible with bnlearn. Therefore, more
complex conditional probability queries can be carried out using cpquery and cpdist if the model is
rst learned in this way.

> library(penalized)
> lambda = optL1(response = y, penalized = x)$lambda
> lasso.t = penalized(response = y, penalized = x,
+ lambda1 = lambda)
# nonzero coefficients: 2
> coef(lasso.t)
(Intercept) 245094_at
14.0402894 -0.7059011

The only parent of gene 256768_at is 245094_at, which seems to act as an inhibitor.

This suggests that a model with this explanatory variable might be useful. Such a DBN can be created
in the following way:

>dbn1 =
+ model2network("[245094_at][265768_at|245094_at]")
>xp.mean = mean(x[, "245094_at"])
>xp.sd = sd(x[, "245094_at"])
>dbn1.fit =
+ custom.fit(dbn1,
+ dist = list("245094_at" = list(coef = xp.mean,
+ sd = xp.sd), "265768_at" = lasso.t))
400 CHAPTER 19. DYNAMIC BAYESIAN NETWORKS

Since the data is continuous, there are two possibilities: either create a Gaussian network, or discretise
the variables. The network dbn1 is Gaussian. The mean xp.mean and standard deviation xp.sd need
to be specied.
The regression analysis suggests that high expression levels of 245094_at at time t − 1 lead to low
expression levels of 265768_at at time t. The cpquery function can be used:

>cpquery(dbn1.fit, event = (`265768_at` > 8),


+ evidence = (`245094_at` > 8))
[1] 0.2454624
>cpquery(dbn1.fit, event = (`265768_at` > 8),
+ evidence = (`245094_at` < 8))
[1] 0.9829545

Note With this package, it is not permitted to condition on events of measure 0. Therefore, intervals
must be specied both for event and evidence.

The function cpdist may be used to generate random observations. To compare the conditional
distributions for both pieces of evidence, use:

>dist.low = cpdist(dbn1.fit, node = "265768_at",


+ evidence = (`245094_at` < 8))
>dist.high = cpdist(dbn1.fit, node = "265768_at",
+ evidence = (`245094_at` > 8))

These may be plotted and the densities compared.

Now suppose that the variables at time t are not independent of those at t − 2 given t − 1. It is then a
good idea to construct a DBN which depends on lags 1 and 2. To check whether the introduction of
t − 2 to explain t improves the model:

> y = arth12[-(1:2), "245094_at"]


> colnames(x)[12] = "245094_at1"
> lambda = optL1(response = y, penalized = x)$lambda
> lasso.s = penalized(response = y, penalized = x,
+ lambda1 = lambda)
> coef(lasso.s)
(Intercept) 258736_at 257710_at 255070_at 245319_at
-2.659077706 -0.009220815 0.273648262 -0.444106451 -0.134050990
245094_at1
1.589716443

The assumption is that the DBN is time homogeneous. These results suggest a network structure
which can be created as follows:
19.6. INFERENCE FOR DYNAMIC BAYESIAN NETWORKS 401

> dbn2 = empty.graph(c("265768_at", "245094_at",


+ "258736_at", "257710_at", "255070_at",
+ "245319_at", "245094_at1"))
> dbn2 = set.arc(dbn2, "245094_at", "265768_at")
> for (node in names(coef(lasso.s))[-c(1, 6)])
+ dbn2 = set.arc(dbn2, node, "245094_at")
> dbn2 = set.arc(dbn2, "245094_at1", "245094_at")

The parameters of dbn2 may be estimated via maximum likelihood. The parameters of 265769_at and
245094_at may then be substituted with those from the LASSO models lasso.t and lasso.s.
19.7 Exercises
1. Consider the Canada data set from the vars package. Load the data set, make some exploratory
analysis and estimate a VAR(1) process for this data set. Estimate the auto-regressive matrix A
and the constant matrix B which dene the VAR(1) model.
Compare the results with the LASSO matrix when the L1 penalty is estimated by cross-validation.
What are your conclusions?

2. Consider the arth800 data set from the GeneNet package. Load the data set. The time
series expression of the 800 genes is included in a data set called arth800.expr. Investigate its
properties.
Compute the variances of each of the 800 variables, plot them in decreasing order and create a
data set with those variables whose variance is greater than 2.
Can you t a VAR process using the var package (unlikely)? Suggest alternative approaches
(such as LASSO) and apply them. Estimate a DBN with each approach and compare the DBNs.
Plot the DBNs using plot from G1DBN.

402
Chapter 20

Factor graphs and the sum product

algorithm

This chapter describes the Sum Product Algorithm, henceforth abbreviated SPA, which was introduced
by Wiberg [145] (1996). It is an algorithm for obtaining the marginals of a factorised function. It
has also become known as Loopy Belief Propagation. It operates on factor graphs. SPA can be
considered as the most elementary of a family of related algorithms, consisting of double-loop algorithms
(see Heskes et. al. [63](2003)), Generalised Belief Propagation (see Yedidia et. al. [149] (2005)),
Expectation Propagation (see [94](2001)), Expectation Consistent Approximate Inference (see Opper
and Winter [102](2005)), the Max-Product Algorithm (see Weiss and Freeman [143](2001)), the Survey
Propagation Algorithm (see Braunstein, Mézard and Zecchina [7] (2004) and [6](2005)) and Fractional
Belief Propagation (see Tatikonda [133](2003)) to name but a few variants. SPA and its variants
provide a natural method for a wide variety of applications: Wiberg [145] discusses applications to
error correcting codes, an application developed by McEliece, MacKay and Cheng [92] (1998). It is
used for satisability problems in combinatorial optimisation [7] and computer vision (stereo matching:
Sun, Zheng and Shum [131](2003) and image restoration Tanaka [132](2002)). More recently, a variant
known as `Stochastic Belief Propagation' algorithm was developed by Noorshams and Wainwright [101]
(2013) with applications to image analysis. For that situation, the number of states of each variable is
large, so that only a few of the states are randomly selected for update in each cycle of the algorithm.

20.1 Factorisation and Local Functions


(1) (k )
As usual, let Ṽ = {1, . . . , d}, and for each j ∈ Ṽ let Xj = (xj , . . . , xj j ) denote the nite state space
of variable Xj . Let X = ×dj=1 Xj . The space X is the conguration space. Let ϕ denote a function
dened on X . Let x = (x1 , . . . , xd ) ∈ X denote a conguration and, for a subset D ⊆ {1, . . . , d}, where
D = {j1 , . . . , jm }, let xD = (xj1 , . . . , xjm ) and XD = ×v∈D Xv .
A domain XD for D ⊂ {1, . . . , d} (where the subset is strict) is called a local domain.

Denition 20.1 (Factorisability). The function ϕ is said to be factorisable if it factors into a product
of several local functions γj each dened on local domains, such that

403
404 CHAPTER 20. FACTOR GRAPHS AND THE SUM PRODUCT ALGORITHM

ϕ(x) = ∏ γj (xDj ) (20.1)


j∈J

for a collection of local domains XDj , j ∈ J where J = {1, 2, . . . , q} and q ≤ d.

For a factorisable function ϕ, consider the problem of computing the marginal

ϕi (xi ) = ∑ ∏ γj (z, xi ), (20.2)


z∈XṼ /{i} j∈J

where the domains of the functions have been extended to X (Denition 7.2). This is also known
as the `one i (eye) problem'. The aim of this chapter is to describe a procedure for computing the
marginalisation, which exploits the way in which the global function is factorised and uses the current
values to update the values assigned to each variable. The method involves a factor graph, which is an
example of a bipartite graph.

Denition 20.2 (Bipartite Graph). A graph G is bipartite if its node set can be partitioned into two
sets W and U in such a way that every edge in G has one node in W and another in U .

A factor graph is a bipartite graph that expresses the structure of the factorisation given by Equa-
tion (20.1). The graph has the following properties:

ˆ there is a variable node (an element of U ) for each variable. A capital letter X will be used to
denote the variable node, a small letter the value x in the state space XX associated with the
variable.

ˆ there is a function node (an element of W ) for each function γj . γj will be used to denote both
the local function and the node.

ˆ an undirected edge connecting variable node Xi to factor node γj if and only if Xi is in the local
domain of γj .

In other words, a factor graph is a representation of the relation `is an argument of'.

Example 20.3 (A Bayesian Network as a Factor Graph).

A Bayesian Network has a joint probability distribution that factorises according to a DAG. This joint
distribution can be converted into a factor graph. Each function is the local function PXi ∣Πi and edges
are drawn from this node to Xi and to its parents Πi . The DAG corresponding to the factorisation

PX1 X2 X3 X4 = PX1 PX2 ∣X1 PX3 ∣X1 X2 PX4 ∣X3

of PX1 ,X2 ,X3 ,X4 is shown in Figure 20.1 and the corresponding factor graph in Figure 20.2.
20.2. THE SUM PRODUCT ALGORITHM 405

X1 / X2

! 
X3 / X4

Figure 20.1: A Directed Acyclic Graph

pX1 X1 pX3 ∣X1 ,X2 X3 pX4 ∣X3 X4

pX2 ∣X1 X2

Figure 20.2: The Factor Graph Corresponding to the Directed Acyclic Graph in Figure 20.1

20.2 The Sum Product Algorithm


Figure 20.3 indicates messages to be passed. The following notation is introduced:

µX→γj (x) x ∈ XX ∶ Variable to local function

This is the message sent from node X to node γj in the sum product algorithm and

µγj →X (x) x ∈ XX ∶ Local function to variable.

This is the message sent from the function node γj to the variable node X .

µX→γj


X γj
Z
µγj →X

Figure 20.3: Updates in a Factor Graph

Recall the denition of neighbour (Denition 1.2). Nv will be used to denote the set of neighbours of a
node v . A factor graph is undirected. By the denition of a factor graph, all the neighbours of a node
will be of the opposite type to the node itself.
406 CHAPTER 20. FACTOR GRAPHS AND THE SUM PRODUCT ALGORITHM

The message sent from node v on edge e is the product of the local function at v (or the unit function
if v is a variable node) with all messages received at v on edges other than e and then marginalised to
the variable associated with e. The messages are dened recursively as follows.

Denition 20.4 (Sum Product Update Rule). For x ∈ Xk , and for each Xk ∈ Nγj ,



⎪ ∏h∈NXk /{γj } µh→Xk (x) ∀x ∈ Xk NXk ≠ ϕ
µXk →γj (x) = ⎨ (20.3)

⎪ NXk = ϕ.
⎩ 1
and for each γj ∈ NXk ,

µγj →Xk (x) = ∑ γj (y, x) ∏ µY →γj (yj ) ∀x ∈ Xk (20.4)


y∈XṼ /{k} Y ∈Nγj /{Xk }

where ϕ denotes the empty set, and where the domain of γj has been extended to X and variable Xk
takes the last position; yj is the value taken by variable Xj (j ≠ k ).

The ow of computation in a factor graph is illustrated in Figure 20.4.

Figure 20.4: Updates in a fragment of a Factor Graph

Denition 20.5 (Initialisation). The initialisation is

µXk →γj (x) = 1 ∀x ∈ Xj

for each Xk ∈ Nγj


and

µγj →Xk (x) = 1 ∀x ∈ Xj

for each γj ∈ NXk . for each variable node Xk and each function node γj .
20.2. THE SUM PRODUCT ALGORITHM 407

Denition 20.6 (Termination). The termination at a node is the product of all messages directed
towards that node.

µXk (x) = ∏ µγj →Xk (x), x ∈ Xk (20.5)


γj ∈NXk

and

µγj (xDj ) = ∏ µXk →γj (xk ) ∀xDj ∈ XDj .


Xk ∈Nj

Note that the function node receives communications from precisely those variables that are in the
domain of the function.

After sending suciently many messages according to a suitable schedule, the termination at the
variable node yields the marginalisation, or a suitable approximation to the marginalisation, over that
variable. That is,

µXi (x) = ∑ ϕ(y, x) ∀x ∈ Xi ,


y∈XṼ /{i}

where the arguments of ϕ have been rearranged, so that variable Xi appears last.

Note Consider the problem where the potentials initially represent probability distributions over the
domains and where hard evidence is inserted rendering the potential over the `impossible'. For the
initialisation, only those states that are possible are included and the initialisation set to 1; the other
states are not included (equivalently, the corresponding initialisation is set to zero). The termination
at a node then gives the joint probability distribution of the variable and the evidence. If a conditional
probability is required, then the answer has to be normalised.

The Schedule One node is arbitrarily chosen as a root and, for the purposes of constructing a
schedule, the edges are directed to form a directed acyclic graph, where the root has no parents. If the
graph is a tree, then the choice of directed acyclic graph is uniquely dened by the choice of the root
node. Computation begins at the leaves of the factor graph.

ˆ Each leaf variable node sends the trivial identity function to its parents.

ˆ Each leaf function node sends a description of γ to its parents.

ˆ Each node waits for the message from all its children before computing the message to be sent
to its parents.

ˆ Once the root has received messages from all its children, it sends messages to all its children.

ˆ Each node waits for messages from all its parents before computing the message to be sent to its
children.
408 CHAPTER 20. FACTOR GRAPHS AND THE SUM PRODUCT ALGORITHM

This is repeated from root to leaves and is iterated a suitable number of times. No iterations are needed
if the factor graph is cycle free. This is known as a generalised forward and backward algorithm.

The following result was proved by N. Wiberg [145].

Theorem 20.7 (Wiberg). Let

ϕ(x) = ∏ γj (xDj )
j

and let G be a factor graph with no cycles, representing ϕ. Then, for any variable node Xk , the
marginal of ϕ at x ∈ Xk is

µXk (x) = ∑ ϕ(y, x),


y∈XṼ /{k}

where the arguments of ϕ have been rearranged so that the k th variable appears last and µXk (x) is given
in Equation (20.5).

Example 20.8.

Before giving a proof of Wiberg's theorem, the following example may be instructive. Consider

ϕ(x1 , x2 , x3 ) = γ1 (x1 , x2 )γ2 (x2 , x3 ).

The factor graph is then a tree given in Figure 20.5.

X1 γ1 (x1 , x2 ) X2 γ2 (x2 , x3 ) X3

Figure 20.5: An Example on Three Variables and Two Functions

In this case, the messages are:

µX1 →γ1 (x1 ) = µX3 →γ2 (x3 ) = 1.

µγ1 →X2 (x2 ) = ∑ γ1 (x1 , x2 )µX1 →γ1 (x1 ) = ∑ γ1 (x1 , x2 )


x1 ∈X1 x1 ∈X1

µγ2 →X2 (x2 ) = ∑ γ2 (x2 , x3 )µX3 →γ2 (x3 ) = ∑ γ2 (x2 , x3 )


x3 ∈X3 x3 ∈X3

µX2 →γ2 (x2 ) = µγ1 →X2 (x2 ) = ∑ γ1 (x1 , x2 )


x1 ∈X1

µX2 →γ1 (x2 ) = µγ2 →X2 (x2 ) = ∑ γ2 (x2 , x3 )


x3 ∈X3

µγ1 →X1 (x1 ) = ∑ γ1 (x1 , x2 )µX2 →γ1 (x2 ) = ∑ γ1 (x1 , x2 )γ2 (x2 , x3 )
x2 ∈X2 (x2 ,x3 )∈X2 ×X3
20.2. THE SUM PRODUCT ALGORITHM 409

µγ2 →X3 (x3 ) = ∑ γ1 (x1 , x2 )γ2 (x2 , x3 ).


(x1 ,x2 )∈X1 ×X2

Note that the variable terminations are

µX1 (x1 ) = µγ1 →X1 (x1 ) = ∑ γ1 (x1 , x2 )γ2 (x2 , x3 )


(x2 ,x3 )∈X2 ×X3

µX2 (x2 ) = µγ1 →X2 (x2 )µγ2 →X2 (x2 ) = ∑ γ1 (x1 , x2 )γ2 (x2 , x3 )
(x1 ,x3 )∈X1 ×X3

µX3 (x3 ) = µγ2 →X3 (x3 ) = ∑ γ1 (x1 , x2 )γ2 (x2 , x3 ),


(x1 ,x2 )∈X1 ×X2

which are the required marginalisation. The theorem of N. Wiberg states that if the factor graph is a
tree, then after a full schedule, the terminations give the required marginalisation.

Proof of Theorem 20.7 Consider Figure 20.6. Suppose that a full schedule has been performed on
a tree. The proof proceeds in three steps.

Step 1: Decompose the factor graph into n components, R1 , . . . , Rn Choose a variable Xi and
suppose that n edges enter the variable node Xi . Since there are no cycles, the margin ∑y∈XṼ /{i} ϕ(y, xi )
(where the arguments of ϕ have been suitably rearranged) may be written as

∑ ϕ(y, xi ) = ∑ ∏ γj (y D ) ∏ γj (y D ) . . . ∏ γj (y Dn )
j j
y∈XṼ /{i} y∈X ∣yi =xi j∈R1 j∈R2 j∈Rn
n
= ∏ ∑ ∏ γj (y D )
j
k=1 y ∈XRk ∣yi =xi j∈Rk
Rk
n
= ∏ νRk (xi ),
k=1

where the notation is clear. The last expression has the same form as the termination formula. There-
fore the assertion is proved if it can be established that

νRk (xi ) = µγ 0 →Xi (xi ), k = 1, . . . , n,


k

where γ10 , . . . , γn0 are the n function nodes that are neighbours of Xi . Due to the clear symmetry, it is
only necessary to consider one of these.

Step 2 Consider the decomposition of R1 . The case where γ10 has three neighbours is illustrated in
Figure 20.7. In the three variable case shown in Figure 20.7, X1 is the node under consideration and
γ10 is outside R3 and R4 . Suppose the variables neighbouring γ10 are X1 , Y1 , . . . , Ym and the regions
corresponding to Y1 , . . . , Ym are R11 , . . . , R1m respectively. Then νR1 can be decomposed as
410 CHAPTER 20. FACTOR GRAPHS AND THE SUM PRODUCT ALGORITHM

Figure 20.6: Step 1 (chosen variable X1 , which has two neighbours)

Figure 20.7: Step 2

νR1 (x1 ) = ∑ ∏ γj (y D )
j
y∈XR1 ∣y1 =x1 j∈R1

m ⎛ ⎞
= ∑ γ10 (x1 , y1 , . . . , ym ) ∏ ⎜ ∑ ∏ γj (z Dj )⎟
(y1 ,...,ym ) k=1 ⎝z R ∈XR1k ∣Yk =yk j∈R3 ⎠
1k
m
= ∑ γ10 (x1 , y1 , . . . , ym ) ∏ ν̃R1k (yk )
(y1 ,...,ym ) k=1

where the notation Yk = yk means that the value of the variable denoted Yk takes the value yk in
z R1k . The notation XR1k denotes all the variable nodes that are neighbours of function nodes in R1k ,
retaining the same indices as the full set of variables.
Crucially, note that if variable Xj is a leaf node in the graph, then ν̃Rj ≡ 1.
20.3. THE SUM PRODUCT ALGORITHM ON GENERAL GRAPHS 411

The expression for νR1 has the same form as the update rule given for µγj →X in Equation (20.4).
In other words, if ν̃Rj (yj ) = µXj →γ 0 (yj ) for each j , then the result is proved. The algorithm proceeds
1
to the leaf nodes of the factor graph.

Step 3 There are two cases. If the leaf node is a function node (as in step 1, going from a variable
to functions), then (clearly from the graph) this is a function (h say) of a single variable (say Y ) and
(from (20.4)),

ν(y) = h(y) = µh→Y (y).

If the leaf node is a variable node X (as in step 2, going from functions to variables), then the leaf
variable is adjacent to a single function h (or else it is not a leaf), which has neighbours (Y1 , . . . , Ym , X),
say, then

ν̃(x) = 1 = µX→h (x),

since if X is a leaf, then h is the only neighbour of X and hence µX→h (x) ≡ 1 from (20.3).
By tracing backward from the leaf nodes, it is now clear, by induction, that
n
∑ ϕ(y, xi ) = ∏ µγ 0 →Xi (xi ),
j
y∈XṼ /{i} j=1

where (γ10 , . . . , γn0 ) are the neighbours of node Xi .

Termination Consider the termination formula

µX (x) = ∏ µγj →X (x),


γj ∈NX

together with the formula for the message from a variable node to a function node:

µX→γj (x) = ∏ µh→X (x).


h∈NX /{γj }

Suppose the factor graph is a tree. Then, since any variable to function message is the product of
all but one of the factors in the termination formula, it is clear that µX (x) may be computed as the
product of the two messages that were passed in opposite directions, a) from the variable X to one of
the functions and b) from the function to the variable X .

20.3 The Sum Product Algorithm on General Graphs


The result of Wiberg shows that the sum product algorithm gives the correct answer after a nite
schedule when the factor graph is a tree. Unfortunately, even in relatively simple examples (Exam-
ple 20.3), the factor graph is not a tree. The problem of nding conditions on whether a propagation
412 CHAPTER 20. FACTOR GRAPHS AND THE SUM PRODUCT ALGORITHM

scheme converges to the right answer has been considered in [95]. In general, there are two major
obstacles.

1. if the sum-product algorithm converges, it is not clear whether the convergence is to the required
marginal.

2. the sum-product algorithm does not always converge.

If the factors are all strictly positive, a xed point exists [149]. This does not imply convergence towards
the xed point and there is no guarantee that the xed point is stable.
Mooij and Kappen [95] give sucient conditions where the mapping has a xed point and where
there is convergence to the xed point.

20.4 Stochastic Probability Updates


This section considers the article by Noorshams and Wainwright [101]. Some simplication to the
message passing algorithm can be made if it is assumed that the function ϕ over a domain X = ×dj=1 Xj
is of the form:

d
ϕ = ∏ ψj ∏ ψjk (20.6)
j=1 ⟨j,k⟩∈U

where the domain of ψj is Xj and the domain of ψjk is Xj × Xk ; U ⊆ {⟨j, k⟩ ∶ 1 ≤ j < k ≤ d}. The charge
here is:

Φ = {ψj ∶ j ∈ Ṽ ; ψjk ∶ ⟨j, k⟩ ∈ U }

In this case, the function nodes corresponding to the ψj s are leaf nodes, while the function nodes ψjk
only receive a message from one neighbour before passing a message onto a variable node. Therefore,
the message passed on from the function node is identical to the message received by the function
node; no multiplication is required.
For functions that factorise according to Equation (20.6), it follows that only variable to variable
messages need be considered; messages are propagated along the edges of the undirected graph G =
(Ṽ , U ).
Let Muv denote the message transmitted along the edge ⟨u, v⟩ in the direction u ↦ v . The message
passing algorithm discussed so far, in this setting, may be expressed as:



⎪ Muv0
≡1


⎩ Muv (xv ) = ∑y∈Xu ψu (y)ψuv (y, xv ) ∏j∈N (u)/{v} Mju
⎪ t+1 t

where N (u) denotes the neighbours of node u in graph G . If the factor graph is a tree, the messages
are sent into a root, then propagated back out to the leaves, resulting in exact marginalisations. If the
factor graph contains loops, then a suitable schedule is chosen and the updates are iterated.
t t→+∞ ∗
Suppose that Muv Ð→ Muv . The termination is:
20.4. STOCHASTIC PROBABILITY UPDATES 413

P(Xu = x(k) (k) ∗ (k)


u ) = ψu (xu ) ∏ Mwu (xu ).
w∈N (u)

The Stochastic Probability Updates of Noorshams and Wainwright [101] consider the situation where
(1) (k )
the state space for each variable Xj = (xj , . . . , xj j ) is large. Therefore, not all elements of the state
space are updated at each iteration. The algorithm proceeds as follows:

O-line Phase For the o-line phase, compute:


(j)
̃uv (., x(j) ) = ψuv (., xv )
ku
Γ v (j)
βuv (x(j) (i) (j) (j)
v ) = ∑ ψuv (xu , xv )ψv (xv ).
βuv (xv ) i=1

Stochastic Update
(k)
1. Initialise message vectors Mvu (xu )0 ≡ 1

2. (a) Compute the product of incoming messages

̃v/u (x(j) ) = (x(j)


v )
t
M v ∏ Mwv
w∈N (v)/{u}

(b) Pick a random index Jvu


t
according to the probability distribution

ptvu (x(j) ̃ (j) (j)


v ) ∝ Mv/u (xv )βvu (xv ) j ∈ {1, . . . , kv }

(c) Update message vector Mvu


t+1
with step-size λt ∈ (0, 1) (superscript is an index):

t+1
Mvu (.) = (1 − λt )Mvu
t ̃uv (., x(Jvu ) ).
(.) + λt Γ
t
v

Now suppose that kj = K , for some xed K ∈ N. The computational complexity of this algorithm is
O(d) operations per edge per round.
The number λt is chosen as: λt = 1+t .
1
It has to satisfy:

1. λt → 0 as t → +∞,

2. ∑∞
t=1 λ = +∞ to ensure `innite travel'.
t

Application to Image Restoration This algorithm is presented in [101], where results on conver-
gence are established. It is applied to image processing and computer vision; a 200 × 200 image (40000
pixels), with K = 256 grey-scale levels.
The model is the Potts model: it is assumed that the state space for each variable is Xj = {1, . . . , K}
and



⎪ 1 i=j
ψuv (i, j) = ⎨

⎩ γ i≠j

414 CHAPTER 20. FACTOR GRAPHS AND THE SUM PRODUCT ALGORITHM

For the Potts model,


⎪ βuv (j) = ψu (j)(1 + (K − 1)γ)



⎪ ⎧

⎨ ⎪ 1+(K−1)γ i = j
1


⎪ Γ (i, j) = ⎨
⎪ uv

⎩ 1+(K−1)γ i ≠ j
γ

⎩ ⎪
For the application to image processing, the lattice is used; the edge set is

U = {⟨(x, y), (x + 1, y)⟩ ∶ x = 1, . . . , 199, y = 1, . . . , 200;


⟨(x, y), (x, y + 1)⟩ ∶ x = 1, . . . , 200, y = 1, . . . , 199}.

The parameter in the Potts model is: γ = 0.05. This is a smoothing parameter. A picture of the moon
is taken, which is then contaminated by adding i.i.d. N (0, 0.12 ) variables to each pixel. The algorithm
is then run, where evidence is entered on the singleton potentials;


⎪ 1 intensity = x
ψj (x) ← ⎨

⎩ 0 otherwise

This is slightly dierent from the earlier discussion of the sum-product algorithm; the single variable
potentials ψj represent the raw data; the edge potentials ψjk represent smoothing.
The propagation algorithm is applied and the output is the most likely value for each pixel.
The experiments indicate that the Stochastic Probability Update gives good results.

Notes The sum product algorithm is due to N.Wiberg (1996) [145], and was developed further,
with applications to Bayesian networks by F.R. Ksischang , B.J. Frey and H-A. Loeliger (2001) [78]
and S.M. Aji and R.J McEliece (2000) [1]. The stochastic update algorithm and application to image
processing was introduced by N. Noorshams and M.J. Wainwright [101] (2013).
20.5 Exercise
Consider the directed acyclic graph below.

B E

  
A R

Figure 20.8: Burglary, Earthquake and Radio

The variables are B - Burglary, A - Alarm, E - Earthquake and R - news broadcast.


These are random variables with the states (0 - no (false), 1 - yes(true)). The alarm is reliable for
detecting burglary, but also responds to minor earthquakes. Radio broadcasts tell about occurrences
of such earthquakes, but are not always correct. The conditional probability distributions for this
problem are given below.

R/E 0 1
PR∣E = 0 0.99 0.05
1 0.01 0.95

E/B 0 1
PA∣B,E (0∣., .) = 0 0.97 0.05
1 0.05 0.02

PB (1) = 0.01, PE (1) = 0.999

Assume that the joint distribution PA,B,E,R factorises recursively according to the Bayesian network
shown in the gure. Using the sum - product algorithm, compute

1. the conditional probability PB∣A (1∣1)

2. the conditional probability PB∣A,R (1∣1, 1).

415
20.6 Answer
The computation of PB∣A (1∣1) is given. The key point is that when hard evidence A = 1 is received,
this is accommodated by considering XA = {1} and only considering a = 1. When this is done, the
termination at variable B will give the function PB,A (., 1); this has to be normalised appropriately to
give the conditional probability.
The factor graph is given in Figure 20.9

pB (b) b pE (e)

a pA∣B,E (a, b, e) e pR∣E (e, r) r

Figure 20.9: Factor Graph

µPB →B = PB = (0.99, 0.01)

µB→PA∣B,E = µPB →B = (0.99, 0.01)

A is observed to be 1, so
µA→PA∣B,E (1) = 1 ∀(b, e)

Now µPA∣B,E →E needs a marginalisation

µPA∣B,E →E = ∑ PA∣B,E (1∣b, e)µB→PA∣B,E (b)µA→PA∣B,E (1)


b
= ∑ PA∣B,E (1∣b, e)µB→PA∣B,E (b)
b
= (0.03 × 0.99 + 0.95 × 0.01, 0.95 × 0.99 + 0.98 × 0.01) = (0.0392, 0.9503)

µPE →E = PE

µR→pR∣E = (1, 1)

µPR∣E →E = ∑ PR∣E (r∣.) = (1, 1)


r

All messages have been propagated to the root E .


Message µE→PE is not involved in the computation of PB∣A so don't compute it.

µE→PA∣B,E = µPE →E = (0.001, 0.999)

Message µpA∣B,E →A not needed, so don't compute it. Neither is µE→PR∣E nor µPR∣E →R .

416
20.6. ANSWER 417

µPA∣B,E →B (b) = ∑ PA∣B,E (1∣b, e)µE→PA∣B,E (e)µA→PA∣B,E (1)


e
= ∑ PA∣B,E (1∣b, e)µE→PA∣B,E (e)
e
= (0.03 × 0.001 + 0.95 × 0.999, 0.95 × 0.001 + 0.98 × 0.999) = (0.94908, 0.97997)

Message µB→PB not needed, because we are interested in the variable B and we need the product of
messages function to the variable B .

Finally,

(PB∣A (0∣1), PB∣A (1∣1)) = β(µPB →b (0)µPA∣B,E →b (0), µPB →b (1)µPA∣B,E →b (1))
1
= (0.037203936, 0.93165691)
0.968469627
418 CHAPTER 20. FACTOR GRAPHS AND THE SUM PRODUCT ALGORITHM
Chapter 21

Graphical Models and Exponential

Families

This chapter deals with multivariate distributions which fall within the framework of exponential family.
The dependence structure is expressed as a graphical model. For an exponential family of full rank,
there is a 1 - 1 mapping between canonical parameters and mean eld parameters. We discuss conjugate
duality and the Fenchel-Legendre transform between the log-partition function A(θ) ∶ θ ∈ Θ (the
canonical parameter space) and A∗ (µ) ∶ µ ∈ M where M is the mean-value parameter space and µ
denotes the mean value vector of the sucient statistic vector. The Kullback-Leibler divergence has
particularly convenient form for exponential families; we discuss the primal, dual and mixed forms in
terms of the canonical and mean value parametrisations. We consider mean eld approximations, to
obtain a mean eld lower bound for A(θ).

21.1 Introduction to Exponential Families


The notations are as before. Let V = {X1 , . . . , Xd } denote the random variables. For j = 1, . . . , d, Xj
will denote the state space for variable Xj . If Xj is continuous, then Xj ⊆ R (the real numbers). If Xj is
(1) (k )
discrete, then Xj = {xj , . . . , xj j }, where kj is possibly +∞. As usual, the notation X = (X1 , . . . , Xd )
denotes the row vector of variates. An instantiation of X will be denoted x ∈ X1 × . . . Xd ≡ X (when no
subscript is employed, X denotes the product space, which is the state space of the row vector X ).

An exponential family is a family of probability distributions satisfying certain properties, listed in


Denition 21.1 below. For the purposes of Bayesian Networks, the emphasis is on discrete variables
and Gaussian variables.

Denition 21.1 (Exponential Family). An exponential family is a family of probability distributions


{Pθ ∶ θ ∈ Θ}, where Θ is a parameter space. These are dened by a probability mass function PX (.∣θ) if
X are discrete variables, or a probability density function πX (.∣θ) for continuous variables, indexed by
a parameter set Θ ⊆ Rp (where p is possibly innite), where there is a function Φ ∶ X → Rp , a function
A ∶ Θ → R and a function h ∶ X → R such that

419
420 CHAPTER 21. GRAPHICAL MODELS AND EXPONENTIAL FAMILIES

PX (x∣θ) = exp{⟨θ, Φ(x)⟩ − A(θ)}h(x)

if X is a discrete random vector and

πX (x∣θ) = exp{⟨θ, Φ(x)⟩ − A(θ)}h(x)

if X is a continuous random vector.


It is convenient to use the notation I to denote the indexing set for the parameters; θ = (θα )α∈I .
Then Φ denotes a collection of functions Φ = (ϕα )α∈I , where ϕα ∶ X → R. The inner product notation
is dened as
⟨θ, Φ(x)⟩ = ∑ θα ϕα (x).
α∈I

The parameters in the vector θ are known as the canonical parameters or exponential parameters.

Attention will be restricted to distributions where ∣I∣ = p < +∞; namely, I has a nite number, p, of
elements.
Since ∑X PX (x∣θ) = 1 for discrete variables and ∫X πX (x∣θ)dx = 1 for continuous variables, it follows
that the quantity A, known as the log partition function, is given by the expression

A(θ) = log ∫ exp{⟨θ, Φ(x)⟩}h(x)dx


X

for continuous variables and

A(θ) = log ∑ exp {⟨θ, Φ(x)⟩} h(x)


X
for discrete variables. It is assumed that h, θ and Φ satisfy appropriate conditions so that A is nite.

Set

PX (x∣θ)
P (x; θ) = . (21.1)
h(x)
With the set of functions Φ xed, each parameter vector θ indexes a particular probability function
PX (.∣θ) belonging to the family. The exponential parameters of interest belong to the parameter space,
which is the set

Θ = {θ ∈ Rp ∣A(θ) < +∞}. (21.2)

It will be seen shortly that A is a convex function of θ.

Denition 21.2 (Regular Families). An exponential family for which the domain Θ of Equation (21.2)
is an open set is known as a regular family.

Attention will be restricted to regular families.


21.2. STANDARD EXAMPLES OF EXPONENTIAL FAMILIES 421

Denition 21.3 (Minimal Representation). An exponential family, dened using a collection of func-
tions Φ for which there is no linear combination ⟨a, Φ(x)⟩ = ∑α∈I aα ϕα (x) equal to a constant is known
as a minimal representation.

For a minimal representation, there is a unique parameter vector θ associated with each distribution.

Denition 21.4 (Over-complete). An over-complete representation is a representation that is not


minimal; there is a linear combination of the elements of Φ which yields a constant.

When the representation is over-complete, there exists an ane subset of parameter vectors θ, each
associated with the same distribution.

Recall the denition of suciency, given in Denition 12.12. The following lemma is crucial. Its proof
is left as an exercise

Lemma 21.5. Let X = (X1 , . . . , Xd ) be a random vector with joint probability function

PX (x∣θ) = exp{⟨θ, Φ(x)⟩ − A(θ)}h(x), x∈X

then Φ(X), which will be denoted Φ, is a sucient statistic for θ. If the representation is minimal,
then Φ(X) is a minimal sucient statistic for θ.

Proof Exercise 1 page 436.

21.2 Standard Examples of Exponential Families


The purpose of this section is to take some basic distributions, which are well known, and illustrate
that they satisfy the denition of an exponential family.

Bernoulli Consider the random variable X , taking values 0 or 1, with probability function PX (1) = p,
PX (0) = 1 − p. This may be written as



⎪ px (1 − p)1−x x ∈ {0, 1}
PX (x) = ⎨

⎪ other x.
⎩ 0
Then

p
pX (x) = exp {x log ( ) + log(1 − p)}
1−p
= exp {xθ + log(1 − p)}
= exp {xθ − log(1 + eθ )} ,

where θ = log ( 1−p


p
).
422 CHAPTER 21. GRAPHICAL MODELS AND EXPONENTIAL FAMILIES

Notation Here, the quantity θ denotes the canonical parameter.

In the language of exponential families, X = {0, 1}, Φ = {ϕ} where ϕ(x) = x, h(0) = h(1) = 1,

PX (0∣θ) = e−A(θ) , PX (1∣θ) = eθ−A(θ)

In other words

log PX (x∣θ) = θx − A(θ),

which gives

1 = PX (0∣θ) + PX (1∣θ) = e−A(θ) (1 + eθ )

so that

A(θ) = log(1 + exp{θ}).

Gaussian Recall that the one dimensional Gaussian density is of the form

1 (x − µ)2
π(x∣µ, σ) = √ exp {− }.
2πσ 2σ 2
This may be expressed in terms of an exponential family as follows: X = R, h(x) = 1, Φ = {ϕ1 , ϕ2 }
where ϕ1 (x) = x and ϕ2 (x) = −x2 .

log π(x∣θ) = θ1 x − θ2 x2 − A(θ)

where


1 = e−A(θ) ∫
2
eθ1 x−θ2 x dx.
−∞

The partition function is therefore

1 1 θ2
A(θ) = log π − log θ2 + 12
2 2 4θ2
and the parameter space is

Θ = {(θ1 , θ2 ) ∈ R2 ∣θ2 > 0}.

Note that in the `usual' notation

µ 1
θ1 = , θ2 = .
σ2 σ2
21.3. GRAPHICAL MODELS AND EXPONENTIAL FAMILIES 423

Exponential Recall that an Exponential density is of the form




⎪ λe−λx x ≥ 0
π(x∣λ) = ⎨

⎪ x < 0.
⎩ 0
This is an exponential family, taking X = (0, +∞), h(x) = dx, Φ = ϕ, where ϕ(x) = −x, θ = λ, so that
e−A(θ) = θ, yielding A(θ) = − log θ, Θ = (0, +∞).

Poisson Recall that the probability function p for a Poisson distribution with parameter µ is given
by

µx −µ
P(x∣µ) = e , x = 0, 1, 2, . . .
x!
This is an exponential family with h(x) = x!
1
, θ = log µ so that P(x∣µ) = P (x; θ)h(x), where

θ
P (x; θ) = exθ−e .

This gives A(θ) = exp{θ}. Since µ ≥ 0 and θ = log µ, it follows that Θ = R.

Beta Recall that the probability density function for a Beta distribution is given by

⎪ Γ(α+β) α−1
⎪ x (1 − x)β−1 x ∈ [0, 1]
π(x∣α, β) = ⎨ Γ(α)Γ(β)

⎪ other x.
⎩ 0
This is an exponential family, with X = (0, 1), h ≡ 1, α − 1 = θ1 , β − 1 = θ2 , Φ = {ϕ1 , ϕ2 } where
ϕ1 (x) = log x, ϕ2 (x) = log(1 − x). Then

log π(x∣θ) = θ1 log x + θ2 log(1 − x) − A(θ),

where the partition function A is given by

A(θ) = log Γ(θ1 + 1) + log Γ(θ2 + 1) − log Γ(θ1 + θ2 + 2)

and the parameter space is Θ = (−1, ∞)2 .

21.3 Graphical Models and Exponential Families


The scalar examples described in section 21.2 serve as building blocks for the construction of exponential
families, which have an underlying graphical structure.

Example 21.6 (Sigmoid Belief Network Model).


The sigmoid belief network model, described below, was introduced by R. Neal (1992) [98]. It is an
exponential family, with an underlying graphical structure.
Consider a directed acyclic graph G = (V, D), where V = {X1 , . . . , Xd } is the set of variables, along
which the probability distribution of X = (X1 , . . . , Xd ) may be factorised. Suppose that for each
424 CHAPTER 21. GRAPHICAL MODELS AND EXPONENTIAL FAMILIES

Xj ∈ V , j = 1, . . . , d, the random variable Xj takes values 0 or 1, each with probability 1/2. For any
two components Xs and Xt of the random vector X , component Xs has a direct causal eect on Xt
only if (Xs , Xt ) ∈ D.

The notation will be simplied in the following way: V and D will be used to denote the sets of nodes
(variables) and directed edges respectively; the same notation will also be used to denote the indexing
sets of nodes and directed edges. In other words the notations

V = {1, . . . , d} and D = {(s, t)∣(Xs , Xt ) ∈ D}

will also be used. The meaning will be clear from the context. The probability distribution over the
possible congurations is modelled by an exponential family with probability function PX (.∣θ) of the
form


⎪ ⎫

⎪d ⎪
PX (x∣θ) = exp ⎨ ∑ θs xs + ∑ θ(s,t) xs xt − A(θ)⎬ .

⎪ ⎪

⎩s=1 (s,t)∈D ⎭
The notation Pai denotes the parent set of node Xi and πi (x) denotes the instantiation of Pai corre-
sponding to the instantiation {X = x}, this may be rewritten as

d
PX (x∣θ) = ∏ PXi ∣Pai (xi ∣πi (x), θ),
i=1

where (clearly)

exp {xi (θi + ∑xj ∈πi (x) θ(ij) xj )}


PXi ∣Pai (xi ∣πi (x), θ) = ,
1 + exp {θi + ∑xj ∈πi (x) θ(ij) xj }
where the notation xj ∈ πi (x) is clear. The index set is I = V ∪ D. The domain Θ = Rn , where n = ∣I∣.
Since the sum that denes A(θ) is nite for all θ ∈ Rn , it follows that the family is regular. It is
minimal, since there is no linear combination of the functions equal to a constant.

This model may be generalised. For example, one may consider higher order interactions. To include
coupling of triples (Xs , Xt , Xu ), one would add a monomial xs xt xu with corresponding exponential
parameter θ(s,t,u) . More generally, the set C of indices of interacting variables may be considered,
giving

PX (x∣θ) = exp { ∑ θ(C) ∏ Xs − A(θ)} .


C∈C s∈C

Example 21.7 (Noisy `or' as an Exponential Family).

The QMR - DT (Quick Medical Reference - Decision Theoretic) database is a large scale probabilistic
data base that is intended to be used as a diagnostic aid in the domain of internal medicine. It is a
21.4. PROPERTIES OF THE LOG PARTITION FUNCTION 425

bipartite graphical model; that is, a graphical model where the nodes may be of one of two types. The
upper layer of nodes (the parents) represent diseases and the lower layer of nodes represent symptoms.
There are approximately 600 disease nodes and 4000 symptom nodes in the database.
An evidence, or nding will be a set of observed symptoms, denoted by a vector of length 4000,
each entry being a 1 or 0 depending upon whether or not the symptom is present or absent. This will
be denoted f , which is an instantiation of the random vector F . The vector d will be used to represents
the diseases; this is considered as an instantiation of the random vector D. Let dj denote component
j of vector d and let fj denote component j of vector f . Then, if the occurrence of various diseases
are taken to be independent of each other, the following factorisation holds:

PF ,D (f , d) = PF ∣D (f ∣d)PD (d) = ∏ PFi ∣D (fi ∣d) ∏ PDj (dj ).


i j

This may be represented by noisy `or' model. Let qi0 denote the probability that symptom i is present
in the absence of any disease and qij the probability that disease j induces symptom i, then the
probability that symptom i is absent, given a vector of diseases d is

PFi ∣D (0∣d) = (1 − qi0 ) ∏(1 − qij )dj .


j

The noisy or may then be rewritten in an exponential form:


⎪ ⎫

⎪ ⎪
PFi ∣D (0∣d) = exp ⎨− ∑ θij dj − θi0 ⎬ ,

⎪ ⎪

⎩ j ⎭
where θij ≡ log(1 − qij ) are the transformed parameters.

21.4 Properties of the log Partition Function


Firstly, some basic properties of the log partition function A(θ) are discussed, which are then developed
using convex analysis, discussed in [3]. Let Eθ [.] denote expectation with respect to p(.∣θ) for discrete
variables, or π(.∣θ) for continuous variables. Of particular importance is the idea that the vector
µ, where µi ∶= Eθ [ϕi (X)] provides an alternative parametrisation of the exponential family. Here
expectation is dened as

Eθ [f (X)] = ∫ πX (x∣θ)f (x)dx


X
if X is a continuous random vector and

Eθ [f (X)] = ∑ PX (x∣θ)f (x)


x∈X

if X is a discrete random vector. Recall that, for discrete variables,

A(θ) = log ∑ e⟨θ,Φ(x)⟩ h(x). (21.3)


x∈X
426 CHAPTER 21. GRAPHICAL MODELS AND EXPONENTIAL FAMILIES

Provided expectations and variances exist, it follows that


A(θ) = ∑ e⟨θ,Φ(x)⟩−A(θ) ϕα (x)h(x) = Eθ [ϕα (X)]. (21.4)
∂θα x∈X

Taking second derivatives yields


A(θ) = Eθ [ϕα (X)ϕβ (X)] − Eθ [ϕα (X)]Eθ [ϕα (X)] = Covθ (ϕα (X), ϕβ (X)).
∂θα ∂θβ
It is and easy to show, and a standard fact, that any covariance matrix is non negative denite. It
now follows that, on Θ, A is a convex function.

Mapping to Mean Parameters Given a vector of functions Φ, set F (θ) = Eθ [Φ(X)] and let
M = F (Θ). For an arbitrary exponential family dened by

PX (x∣θ) = exp {⟨θ, Φ(x)⟩ − A(θ)} h(x),

a mapping Λ ∶ Θ → M may be dened as follows:

Λ(θ) ∶= Eθ [Φ(X)].

To each θ ∈ Θ, the mapping Λ associates a vector of mean parameters µ = Λ(θ) belonging to the set
M. Note that, by Equation (21.4),

Λ(θ) = ∇A(θ).

The mapping Λ is one to one, and hence invertible on its image, when the representation is minimal.
The image of Θ is the interior of M.

Example 21.8 (Bernoulli Trial).

Consider a Bernoulli random variable X with state space {0, 1}. That is, pX (0) = 1 − p and pX (1) = p.
Now consider an Overcomplete exponential representation

PX (x∣θ) = exp {θ0 (1 − x) + θ1 x − A(θ0 , θ1 )}

so that

A(θ0 , θ1 ) = log (eθ0 + eθ1 ) .

Here Θ = R2 . ϕ0 (x) = 1 − x and ϕ1 (x) = x.


A(θ) = eθ0 −A(θ0 ,θ1 ) = 1 − p = µ0
∂θ0

A(θ) = eθ1 −A(θ0 ,θ1 ) = p = µ1 .
∂θ1
21.5. FENCHEL LEGENDRE CONJUGATE 427

The set M of mean parameters is the simplex {(µ0 , µ1 ) ∈ R+ ×R+ ∣µ0 +µ1 = 1}. For any xed µ = (µ0 , µ1 )
where µ0 ≥ 0, µ1 ≥ 0, µ0 + µ1 = 1, the inverse image is,

eθ0
Λ−1 (µ) = {(θ0 , θ1 ) ∈ R2 ∣ = µ0 }
eθ0 + eθ1

which may be rewritten as

µ1
Λ−1 (µ) = {(θ0 , θ1 ) ∈ R2 ∣θ1 − θ0 = log }.
µ0
In an over-parametrised, or over-complete representation, there is no longer a bijection between Θ and
Λ(Θ). Instead, there is a bijection between elements of Λ(Θ) and a ane subsets of Θ. A pair (θ, µ)
is said to be dually coupled if µ = Λ(θ), and hence θ ∈ Λ−1 (µ).

21.5 Fenchel Legendre Conjugate


The Fenchel Legendre conjugate of the log partition function A is dened as follows:

A∗ (µ) ∶= sup {⟨µ, θ⟩ − A(θ)} . (21.5)


θ∈Θ

The choice of notation is deliberately suggestive; the variables in the Fenchel Legendre dual turn out
to have interpretation as the mean parameters. Recall the denition of P given by Equation (21.1);
namely, if PX (x∣θ) is the probability function (or density function), then

PX (x∣θ)
P (x; θ) = .
h(x)

Denition 21.9 (Boltzmann - Shannon Entropy). The Boltzmann - Shannon entropy of PX (x∣θ) with
respect to h is dened as

H(PX (x∣θ)) = −Eθ [log P (x; θ)].

The following is the main result of the chapter.

Theorem 21.10. For any µ ∈ M, let θ(µ) ∈ Λ−1 (µ). Then

A∗ (µ) = −H(PX (x∣θ(µ))).

In terms of this dual, for θ ∈ Θ, the log partition satises be expressed:

A(θ) = sup {⟨θ, µ⟩ − A∗ (µ)}. (21.6)


µ∈M
428 CHAPTER 21. GRAPHICAL MODELS AND EXPONENTIAL FAMILIES

Proof of Theorem 21.10 From the denition µ = Eθ [Φ(X)], it follows that

−H(PX (x∣θ)) = Eθ [log P (X; θ)] = Eθ [⟨θ, Φ(X)⟩] − A(θ) = ⟨θ, µ⟩ − A(θ). (21.7)

Consider the function

F (µ, θ) = ⟨µ, θ⟩ − A(θ).

Let θ(µ) denote a value of θ that maximises F (µ, θ) if such a value exists in Θ. The result follows
directly by using the denition given by Equation (21.5) together with Equation (21.7). Otherwise,
let θ(n) (µ) denote a sequence such that limn→+∞ F (µ, θ(n) (µ)) = A∗ (µ). The rst statement of the
theorem follows directly.

For the second part, choose θ ∈ Θ and choose µ(θ) = ∇θ A(θ). By denition of M, note that µ(θ) ∈ M.
Since A is convex, it follows that µ(θ) maximises ⟨θ, µ⟩ − A(θ), so that

A(θ) = ⟨µ(θ), θ⟩ − A∗ (µ(θ)).

But, from the denition of A∗ (µ), it follows that for all µ ∈ M,

A(θ) ≥ ⟨µ, θ⟩ − A∗ (µ).

From this,

A(θ) = sup {⟨µ, θ⟩ − A∗ (µ)}


µ∈M

and Theorem 21.10 is established.

Examples The conjugate dual pair (A, A∗ ) is now computed for several examples of exponential
families.

Bernoulli Recall that A(θ) = log(1 + exp{θ}) for θ ∈ R. It follows that

A∗ (µ) = sup{θµ − log(1 + eθ )}


θ∈R

The supremum is attained for θ(µ) satisfying

eθ(µ)
µ= .
1 + eθ(µ)
It follows that

µ
eθ(µ) =
1−µ
and
21.5. FENCHEL LEGENDRE CONJUGATE 429

θ(µ) = log µ − log(1 − µ)

so that

µ
A∗ (µ) = µ log µ − µ log(1 − µ) − log(1 + ),
1−µ
which gives

A∗ (µ) = µ log µ + (1 − µ) log(1 − µ).

Gaussian Recall that Θ = {(θ1 , θ2 )∣θ2 > 0} and

1 1 θ2
A(θ) = log π − log θ2 + 1 .
2 2 4θ2

1 1 θ2
A∗ (µ) = sup{θ1 µ1 + θ2 µ2 − log π + ln θ2 − 1 }.
θ∈Θ 2 2 4θ2
This is maximised when


⎪ θ1 (µ)
⎪ µ1 −
⎪ 2θ2 (µ) =0
⎨ θ12 (µ)


⎪ µ2 + 1
+ = 0,
⎩ 2θ2 (µ) 4θ22 (µ)

which gives



⎪ θ2 (µ1 , µ2 ) = − 2(µ21 +µ2 )
⎪ 1



⎪ θ1 (µ) = − µ2µ+µ1
⎩ 1 2

and

1 1 1
A∗ (µ1 , µ2 ) = − − log π − log(−2(µ21 + µ2 )).
2 2 2
Note that

M = {(µ1 , µ2 )∣µ21 + µ2 < 0}.

Exponential Distribution Recall that Θ = (0, +∞) and that A(θ) = − log(θ). By a straightforward
computation,

A∗ (µ) = −1 − log(−µ)

and

M = (−∞, 0).
430 CHAPTER 21. GRAPHICAL MODELS AND EXPONENTIAL FAMILIES

Poisson Distribution Recall that Θ = R and that A(θ) = exp{θ}. It is a straightforward computa-
tion to see that

A∗ (µ) = µ log µ − µ

and that

M = (0, +∞).

21.6 Kullback Leibler Divergence


Recall Denition 6.9, the Kullback Leibler distance between two probability distributions p ∈ [0, 1]M
and q =∈ [0, 1]M

M qj
DKL (q∣p) = ∑ qj ln .
j=1 pj

This may be written as

q(X)
DKL (q∣p) = Eq [log ], (21.8)
p(X)

where X is a random vector with state space X = (x1 , . . . , xM ) and Eq is expectation with respect
to the measure such that qj = P(X = xj ). The denition of Kullback Leibler may be extended to
continuous distributions using Equation (21.8), where q and p denote the respective density functions.
In this case, Equation (21.8) is taken as

q(x)
DKL (q∣p) = ∫ q(x) log dx.
Rd p(x)
When q and p are members of the same exponential family, the Kullback Leibler distance may be
computed in terms of the parameters. The key result, for expressing the distance in terms of the
partition function, is the Fenchel's inequality given in Equation (21.9), which can be seen directly from
the denition of A∗ (µ).

A(θ) + A∗ (µ) ≥ ⟨µ, θ⟩, (21.9)

with equality if and only if µ = Λ(θ) and θ ∈ Λ−1 (µ). That is, for µ = Λ(θ) and θ ∈ Λ−1 (µ),

A(θ) + A∗ (µ) = ⟨µ, θ⟩. (21.10)

Consider an exponential family of distributions, and consider two exponential parameter vectors, θ1 ∈ Θ
and θ2 ∈ Θ. When distributions are from the same exponential family, the notation D(θ1 ∣θ2 ) is used
21.7. MEAN FIELD THEORY 431

to denote DKL (p(.∣θ1 )∣p(.∣θ2 )). Set µi = Λ(θi ). Using the parameter to denote the distribution with
respect to which the expectation is taken, note that

P(X∣θ1 )
D(θ1 ∣θ2 ) = Eθ1 [log ] = A(θ2 ) − A(θ1 ) − ⟨µ1 , θ2 − θ1 ⟩. (21.11)
P(X∣θ2 )

The representation of the Kullback Leibler divergence given in Equation (21.11) is known as the primal
form of the KL divergence.

Taking µ1 = Λ(θ1 ) and applying Equation (21.10), the Kullback Leibler distance may also be written

̃
D(θ1 ∣θ2 ) ≡ D(µ ∣θ ) = A(θ2 ) + A∗ (µ1 ) − ⟨µ1 , θ2 ⟩. (21.12)
1 2

The representation given in Equation (21.12) is known as the mixed form of the KL divergence. Recall
the denition of A∗ given by

A∗ (µ) ∶= sup{⟨µ, θ⟩ − A(θ)}


θ∈Θ

and recall Equation (21.6) from Theorem 21.10,

A(θ) = sup {⟨θ, µ⟩ − A∗ (µ)}.


µ∈M

Equation (21.6) may be rewritten as

inf {A(θ) + A∗ (µ) − ⟨θ, µ⟩} = 0.


µ∈M

̃
It follows that inf µ∈M D(µ∣θ) = 0.

Finally, taking µ2 = Λ(θ2 ) and applying Equation (21.10) once again to Equation (21.12) yields the
so-called dual form of the KL divergence;

̃
̃
D(µ ∣µ ) ≡ D(θ1 ∣θ2 ) = A∗ (µ1 ) − A∗ (µ2 ) − ⟨θ2 , µ1 − µ2 ⟩. (21.13)
1 2

21.7 Mean Field Theory


In this section, probability distributions of the form

PX (x∣θ) = exp {∑ θα ϕα (x) − A(θ)} h(x)


α

are considered. Mean eld theory techniques are discussed and it is shown how they may be used to
obtain estimates of the log partition function A(θ). This is equivalent to the problem of nding an
432 CHAPTER 21. GRAPHICAL MODELS AND EXPONENTIAL FAMILIES

appropriate normalising constant to make a function into a probability density, a problem that often
arises when updating using Bayes rule.
Mean Field Theory is based on the variational principle of Equation (21.6). The two fundamental
diculties associated with the variational problem are the nature of the constraint set M and the lack
of an explicit form for the dual function A∗ . Mean eld theory entails limiting the optimization to a
subset of distributions for which A∗ is relatively easy to characterise.
More specically, the discussion of this chapter is restricted to the case where the functions ϕα are
either linear or quadratic. The problem therefore reduces to considering a graph G = (V, U ), where the
node set V denotes the variables and the edge set U denotes a direct association between the variables.
For this discussion, the edges in U are assumed to be undirected. As usual, V and U denote the node
(variable) and undirected edge sets; the same notation is used for the indexing sets. That is, with
minor abuse of notation (clear from the context), V and U are also used to mean: V = {1, . . . , d} and
U = {⟨s, t⟩∣⟨Xs , Xt ⟩ ∈ E}. Specically, the probability distributions under consideration are of the form


⎪ ⎫


⎪ ⎪

PX (x∣θ) = exp ⎨ ∑ θs xs + ∑ θ(s,t) xs xt − A(θ)⎬ .


⎪ ̃ ̃ ⎪


⎩s∈V (s,t)∈E ⎭
Let H denote a sub-graph of G over which it is feasible to perform exact calculations. In an exponential
formulation, the set of all distributions that respect the structure of H can be represented by a linear
subspace of the exponential parameters. Let I(H) denote the subset of indices associated with cliques
in H . Then the set of exponential parameters corresponding to distributions structured according to
H is given by

E(H) ∶= {θ ∈ Θ ∣ θα = 0, α ∈ I/I(H)} .

The simplest example is to consider the completely disconnected graph H = (V, ϕ). Then

E(H) = {θ ∈ Θ ∣ θ(s, t) = 0, (s, t) ∈ E} .

The associated distributions are of the product form

PX (X∣θ) = ∏ PXs (xs ∣θs ).


̃
s∈V

Optimisation and Lower Bounds Let PX (x∣θ) denote the target distribution that is to be approx-
imated. The basis of mean eld approximation is the following: any valid mean parameter species a
lower bound on the log partition function, established using Jensen's inequality.

Proposition 21.11 (Mean Field Lower Bound).

A(θ) ≥ sup {⟨θ, µ⟩ − A∗ (µ)}


µ∈M
21.7. MEAN FIELD THEORY 433

Proof The proof is given for discrete variables; the proof for continuous variables is exactly the same,
replacing the sum with an integral.

A(θ) = log ∑ exp{⟨θ, Φ(x)⟩}


x∈X

= log ∑ PX (x∣θ) exp{⟨θ, Φ(X)⟩ − log PX (x∣θ)}


x∈X
= log Eθ [exp{⟨θ, Φ(X)⟩ − log PX (X∣θ)}]
(a)
≥ ⟨θ, Eθ [Φ(X)]⟩ − Eθ [log PX (X∣θ)}]
= ⟨θ, µ⟩ − A∗ (µ).

The inequality (a) follows from Jensen's inequality; the last line follows from Theorem 21.10.

There are diculties in computing the lower bound in cases where there is not an explicit form for
A∗ (µ). The mean eld approach circumvents this diculty by restricting to

M(G; H) ∶= {µ ∈ Rd ∣ µ = Eθ [Φ(X)], θ ∈ E(H)} .

Note that M(G; H) ⊂ M, hence

A(θ) ≥ sup {⟨θ, µ⟩ − A∗ (µ)} ≥ sup {⟨θ, µ⟩ − A∗ (µ)} .


µ∈M µ∈M(G;H)

This lower bound is the best that can be obtained by restricting to H .

n→+∞
Let µ(n) denote a sequence such that for each n, µ(n) ∈ M(G, H), such that µ(n) Ð→ µ and such
that

⟨θ, µ(n) ⟩ − A∗ (µ(n) ) Ð→ {⟨θ, µ⟩ − A∗ (µ)} .


n→+∞
sup
µ∈M(G;H)

Note that µ ∈ M(G; H). Since θ ∈ Θ, it follows that µ ∈ M. The distribution associated with
µ minimises the Kullback Leibler divergence between the approximating distribution and the target
distribution, subject to the constraint that µ ∈ M(G; H). Recall the mixed form of the Kullback
Leibler divergence; namely, Equation (21.12).

̃
D(µ∣θ) = A(θ) − A∗ (µ) − ⟨µ, θ⟩.

Naive Mean Field Updates In the naive mean eld approach, a fully factorised distribution is
chosen. This is equivalent to the approximation obtained by taking an empty edge set to approximate
the original distribution. The naive mean eld updates are a set of recursions for nding a stationary
point of the resulting optimisation problem.

Example 21.12 (Sigmoid Network Model).


434 CHAPTER 21. GRAPHICAL MODELS AND EXPONENTIAL FAMILIES

Let X = (X1 , . . . , Xd ) be a random vector with state space X = {0, 1}d (d binary variables). Suppose
that the distribution may be factorised along an undirected graph G = (V, U ). The probability function
is given by


⎪ ⎫

⎪n ⎪
PX (x∣θ) = exp ⎨ ∑ θj xj + ∑ θ⟨i,j⟩ xi xj − A(θ)⎬ .

⎪ ⎪

⎩j=1 ⟨i,j⟩∈U ⎭
The naive mean eld approach involves considering the graph with no edges. In this restricted class,


⎪ ⎫

⎪n (H) ⎪
PX (x∣θ) = exp ⎨ ∑ θj xj − A(θ )⎬ ,

⎪ ⎪

⎩j=1 ⎭
(H)
where θ(H) is the collection of parameters θs = θs , s = 1, . . . , d and θ(H) (s, t) ≡ 0. Note that

µs = Eθ [ϕs (X)] = Eθ [Xs ]

and

µ(s,t) = Eθ [ϕs,t (X)] = Eθ [Xs Xt ].

When θ ∈ H , it follows that (Xs )ds=1 are independent, so that

µ(s,t) = Eθ [Xs Xt ] = µs µt .

The optimisation is therefore restricted to the set of parameters

M(G; H) = {(µs )ds=1 , (µ⟨s,t⟩ )⟨s,t⟩∈{1,...,d}2 ∣0 ≤ µs ≤ 1, µ⟨s,t⟩ = µs µt .}

With the restriction to product form distributions, (Xs )ds=1 are independent Bernoulli variables and
hence

d
A∗H (µ) = ∑ {µs log µs + (1 − µs ) log(1 − µs )}.
s=1
Set

d d
F (µ; θ) = ∑ θs µs + ∑ θ(s,t) µs µt − ∑ (µs log µs + (1 − µs ) log(1 − µs )),
s=1 ⟨s,t⟩∈U s=1

then the lower bound is given by

A(θ) ≥ sup F (µ; θ).


(µs )ds=1 ∈[0,1]d

Note that, for each µs , the function F is strictly convex. It is easy to see that the maximum is attained
when, for all 1 ≤ s ≤ t, (µt )dt=1 satises

µs
θs + ∑ θ⟨s,t⟩ µt − log = 0,
t∶⟨s,t⟩∈U 1 − µs
21.7. MEAN FIELD THEORY 435

or

µs
log = θs + ∑ θ⟨s,t⟩ µt .
1 − µs t∈N (s)

Note that if

y
log = x,
1−y
then

y = σ(x),

where

1
σ(x) = .
1 + e−x
The algorithm then proceeds by setting

⎛ (j) ⎞
µ(j+1)
s = σ θs + ∑ θ⟨s,t⟩ µt .
⎝ t∈N (s) ⎠

As discussed in [72] (page 222), the lower bound thus computed seems to provide a good approximation
to the true value.

Notes
The material for Chapter 21 is taken mostly from Wainright and Jordan [142]. It is developed further
in [72]. Possible improvements to the lower bound are proposed by Humphreys and Titterington in [66].
The book by Barndor - Nielsen [3] is the standard treatise of exponential families and the required
convex analysis.
21.8 Exercises: Graphical Models and Exponential Families
1. Prove lemma 21.5.

2. Let (X1 , X2 , X3 ) be random variables, with joint probability function

3
n!
p(x1 , x2 , x3 ∣η) = x
∏ pi i , x1 + x2 + x3 = n,
x1 !x2 !x3 ! j=1

where p1 = η 2 , p2 = 2η(1 − η) and p3 = (1 − η)2 and 0 ≤ η ≤ 1.

(a) Is this an exponential family?


(b) Obtain the minimal sucient statistic for θ.
(c) Compute the mean parameter in terms of η .
(d) Compute the Fenchel Legendre Conjugate of the log partition function.
(e) Prove that the Kullback Leibler Divergence is given by

D(θ1 ∣θ2 ) = A(θ2 ) − A(θ1 ) − ⟨µ1 , θ2 − θ1 ⟩.

D̃(µ1 ∣θ2 ) = A(θ2 ) + A∗ (µ1 ) − ⟨µ1 , θ2 ⟩


˜ ∗ ∗
D̃(µ 1 ∣µ2 ) = A (µ1 ) − A (µ2 ) − ⟨θ2 , µ1 − µ2 ⟩.

State the denitions of the terms used in this equation.


(f) Compute the primal form of the Kullback Leibler divergence D(θ1 ∣θ2 ), where θ1 and θ2 are
the canonical parameters, for this example. Compute the dual form, expressed in terms of
the mean parameters.

3. (Mean Field Update) Consider a probability function, given by



⎪ ⎫

⎪n ⎪
pX (x∣θ) = exp ⎨ ∑ θ(j)x(j) + ∑ θ(i, j)x(j) − A(θ)⎬ ,

⎪ ⎪

⎩j=1 (i,j)∈E ⎭

where θ = {(θ(j))nj=1 , (θ(j, k)), (j, k) ∈ E}, E denotes the edge set and x ∈ {0, 1}n . Let q denote
the probability function

⎪ ⎫

⎪n ⎪
qX (x∣θ) = exp ⎨ ∑ θ(j)x(j) − AH (θ)⎬ .

⎪ ⎪

⎩j=1 ⎭
Let
A∗H (µ) = sup{⟨µ, θ⟩ − AH (θ).
θ

(a) Prove that


n
A∗H (µ) = ∑ {µ(j) log µ(j) + (1 − µ(j)) log µ(j)} .
j=1

436
21.8. EXERCISES: GRAPHICAL MODELS AND EXPONENTIAL FAMILIES 437

(b) Prove that



⎪ ⎫

⎪n ∗ ⎪
A(θ) ≥ sup ⎨ ∑ θ(j)µ(j) + ∑ θ(j, k)µ(j)µ(k) − AH (µ)⎬ .
µ ⎪⎪ ⎪

⎩j=1 (j,k)∈E ⎭
(c) Consider the probability distribution

⎪ ⎫

⎪3 ⎪
p(x1 , x2 , x3 ; θ) = exp ⎨ ∑ θ(j)xj + θ(1, 2)x1 x2 + θ(1, 3)x1 x3 − A(θ)⎬ .

⎪ ⎪

⎩j=1 ⎭
Show that the expression in the previous part is maximised for (µ(1), µ(2), µ(3)) that satisfy

µ(1)
log = θ(1) + θ(1, 2)µ(2) + θ(1, 3)µ(3)
1 − µ(1)

µ(2)
log = θ(2) + θ(1, 2)µ(1)
1 − µ(2)
µ(3)
log = θ(3) + θ(1, 3)µ(1).
1 − µ(3)
(d) Write a Matlab code to compute numerical approximations to the values (µ(1), µ(2), µ(3))
that give the naive mean eld approximation to the log partition function A(θ).
438 CHAPTER 21. GRAPHICAL MODELS AND EXPONENTIAL FAMILIES
Chapter 22

Variational Methods for Parameter

Estimation

22.1 Complete Instantiations


Let x be an n × d data matrix with n i.i.d. instantiations of X = (X1 , . . . , Xd ), which has distribution
Pθ . Here θ is the parameter vector; θ ∈ Θ ⊂ Rp . If {Pθ ∶ θ ∈ Θ} is an exponential family, there are
several useful techniques that may be used for parameter estimation.
The distributions encountered in Bayesian Networks and, more generally graphical models, are
usually multinomial, multivariate Guassian, or Conditional Gaussian. All these are exponential families
and lend themselves to the techniques discussed.

22.1.1 Triangulated Graphs


Probability distributions that factorise over a triangulated graph present the most straightforward
situation for parameter estimation. Such a probability distribution P may be written in the form:

∏C∈C PC,
P=
∏S∈S PS
where C and S denote the collections of cliques and separators.
Consider the multivariate setting. Let x = (x1 , . . . , xd ) denote an instantiation of the random vector
X = (X1 , . . . , Xd ). For each s ∈ {1, . . . , d}, xs ∈ Xs = (1, . . . , ks ), x ∈ X = ×ds=1 Xs , xC = {xs ∶ s ∈ C} and
xS = {xs ∶ s ∈ S}. For simplicity, the values taken by the variables are noted by their indices.

This is a multinomial distribution written as an exponential family in an over-complete representation.


The mean eld parameters are simply:

p(C, xC ) ∶= PC (xC ) ∶ C ∈ C, x C ∈ XC .

p(S, xS ) = PS (xS ) ∶ S ∈ S, xS ∈ XS .

The maximum likelihood estimators are:

439
440 CHAPTER 22. VARIATIONAL METHODS FOR PARAMETER ESTIMATION

1 n 1 n
p̂C (xC ) = ∑ 1x (xj,C ) p̂S (xS ) = ∑ 1x (xj,S ).
n j=1 C n j=1 S
where xj,C denotes the value for clique C of instantiation j ∶ j = 1, . . . , n and similarly for xj,S . By
construction, these are clearly consistent; if S ⊂ C then p̂(S, xS ) = ∑xC /xS p̂(C, xC ).
This may be written as an exponential family with over-complete canonical representation:

P(x) = exp { ∑ ψC (xC ) − ∑ ψS (xS )} (22.1)


C∈C S∈S

where ψC = log PC and ψS = log PS .

Factorisation along a Chow-Liu Tree Now suppose that the distribution factorises along a Chow-
Liu tree. Equation 22.1 may now be written:


⎪ ⎫

⎪d ⎪
P(x) = exp ⎨ ∑ θ(s; xs ) + ∑ θ(s, t; xs , xt )⎬

⎪ ⎪

⎩s=1 (s,t)∈E ⎭
where
PXs ,Xt (xs , xt )
θ(s; xs ) = log PXs (xs ), θ(s, t; xs , xt ) = log
PXs (xs )PXt (xt )
and E denotes the edge set of the graph. The maximum likelihood estimates of the parameters are
given by:

̂ xs ) = log p̂s (xs )


θ(s; ̂ t; xs , xt ) = log p̂s,t (xs , xt ) .
θ(s,
p̂s (xs )̂
pt (xt )

22.1.2 Non-Triangulated Graphs


For non-triangulated graphs, there is no closed form expression for the maximum likelihood estimates.
Recall that a probability distribution factorises along an undirected graph if P may be written as:

P(x) = ∏ ϕC (xC )
C∈C

where C denotes the collection of cliques of the undirected graph. The probability distribution may be
written as:

P(x) = exp { ∑ θC (xC ) − A(θ)}


C∈C

where A(θ) is the log partition function and θ = {θC (xC ) ∶ C ∈ C, xC ∈ XC }.


An iterative proportion tting (IPF) method may be used, since the log partition function is convex.
Let

1 n
L(θ) = ∑ ( ∑ θC (xj,C ) − A(θ)) = ∑ ∑ θC (xC )̂
pC (xC ) − A(θ) (22.2)
n j=1 C∈C C∈C xC ∈XC
22.1. COMPLETE INSTANTIATIONS 441

then

∂ ∂
L(θ) = p̂C (xC ) − A(θ) = p̂C (xC ) − pC (xC ). (22.3)
∂θC (xC ) ∂θC (xC )
Here we've used the fact that we have an exponential family in its canonical form so that ∂
∂θC (xC ) A(θ) =
pC (xC )] = pC (xC ) (from (22.2), p̂C (xC ) is the sucient statistic).
Eθ [̂

The aim is to nd the MLE (where ∂θC (xC ) L(θ) = 0). The iterative proportional tting scheme proceeds

as follows:

At iterations t = 0, 1, 2, . . . let θ(t) denote the current vector of parameter estimates.

ˆ Choose a clique C = C(t) and compute the local marginal distribution

(t)
pC (xC ) ∶= Pθ(t) (XC = xC ) ∀xC ∈ XC .

ˆ Update the canonical parameter vector:



⎪ (t) ̂C (xC )
p
(t+1) ⎪ θC (xC ) + log p(t) (xC )
⎪ C = C(t), xC ∈ XC
θC (xC ) =⎨ C


⎪ θ
(t)
(x ) otherwise
⎩ C C

The sequence satises two important properties, which are stated as a proposition.

Proposition 22.1. 1.
A(θ(t+1) ) = A(θ(t) ).

2. For each t, Equation (22.3) holds.

Proof Suppose C(t) = C ′ . Dening θ(t+1) in this way gives:

(t+1)
A(θ(t+1) ) = log ∑ exp { ∑ θC (xC )}
x C∈C

⎪ ⎫
⎪ (t) p̂C ′ (xC ′ ) ⎪

= log ∑ exp ⎨ ∑ θC (xC ) + log (t) ⎬

⎪ pC ′ (xC ′ ) ⎪⎪
x ⎩C∈C ⎭

⎪ ⎫

p̂C ′ (xC ′ )eθC ′ (xC ′ )
(t)
⎪ (t) ⎪
= log ∑ ∑ exp ⎨ ∑ θ (x ) ⎬
(t)
pC ′ (xC ′ ) ⎪

C C


x′C x/xC ⎩C∈C/C ′ ⎭
Now use:

pC ′ ,θ (xC ′ ) = e−A(θ)+θC ′ (xC ′ ) ∑ e∑C≠C ′ θC (xC )


x/xC

from which
442 CHAPTER 22. VARIATIONAL METHODS FOR PARAMETER ESTIMATION

A(θ(t+1) ) = A(θ(t) ) + log ∑ p̂C ′ (xC ′ ) = A(θ(t) ).


xC ′

This follows because ∑xC ′ p̂C ′ (xC ′ ) = 1.

For the second part, the parameter update gives:

p̂C (xC )
pC,θ(t+1) (xC ) = p (t) (xC ) = p̂C (xC ).
pC,θ(t) (xC ) C,θ

It therefore follows that the IPF algorithm corresponds to a co-ordinate ascent method for maximising
the objective (22.2).

The Schedule Convexity of the log-partition function gives, by standard results, that the IPF
algorithm converges. The main issue is eciency. One way is to

1. Triangulate the graph and construct a junction tree. Fix a schedule for the junction tree.

2. For each node of the junction tree, consider the true model (the sub-graph of cliques and separa-
tors of the true model) and use the IPF scheme to update each clique of the triangulated graph
according to the schedule.

22.2 Partially Observed Models and Expectation-Maximisation


Now suppose that the random vector X is not observed directly, but rather a `noisy' version Y is
observed. The expectation-maximisation (EM) algorithm of Dempster et. al. [36] may be used.

22.2.1 Exact EM Algorithm for Exponential Families


Suppose we have a random vector (X, Y ) where X are unobserved and Y are observable. Suppose the
probability model is:

pθ (x, y) = exp {⟨θ, ϕ(x, y)⟩ − A(θ)} h(x)

The conditional distribution of X given Y is:

exp {⟨θ, ϕ(x, y)⟩}


pθ (x∣y) = =∶ exp {⟨θ, ϕ(x, y)⟩ − Ay (θ)} .
∫X exp {⟨θ, ϕ(x, y)⟩} h(x)dx
(this is the denition of Ay ). For each xed y , the conditional distribution of X is therefore an
exponential family with log partition function Ay given by:

Ay (θ) = log ∫ exp {⟨θ, ϕ(x, y)⟩} h(x)dx.


X
22.2. PARTIALLY OBSERVED MODELS AND EXPECTATION-MAXIMISATION 443

The maximum likelihood estimate θ̂ is obtained by maximising the log probability of the observed data
y . This is referred to as the incomplete log likelihood in the EM setting. The incomplete log likelihood
is given by the integral:

L(θ; y) = log ∫ exp {⟨θ, ϕ(x, y)⟩ − A(θ)} h(x)dx = Ay (θ) − A(θ). (22.4)
X

For each xed y , the set My of valid mean parameters is dened as:

My = {ν ∈ Rp ∶ µ = Eθ [ϕ(X, y)] θ ∈ Θ} .

The Fenchel-Legendre conjugate may be used to obtain:

Ay (θ) = sup {⟨θ, µ⟩ − A∗y (µ)} (22.5)


µ∈My

where the conjugate dual is dened variationally as:

A∗y (µ) ∶= sup {⟨µ, θ⟩ − Ay (θ)} . (22.6)


θ∈dom(Ay )

From (22.5), it follows that Ay (θ) ≥ ⟨µ, θ⟩ − A∗y (µ) for any µ. A lower bound for the incomplete log
likelihood is therefore:

̃ θ).
L(θ, y) = Ay (θ) − A(θ) ≥ ⟨µ, θ⟩ − A∗y (µ) − A(θ) ∶= L(µ,
̃ which gives a
With this set up, the EM algorithm is the coordinate ascent function on this function L
lower bound. The steps of the EM algorithm are:


⎪ (t+1) ̃ θ(t) ) E step
⎪ µy = arg maxµ∈My L(µ,
⎨ (t+1) (22.7)

⎪ ̃ y(t+1) , θ) M step
= arg maxθ∈Θ L(µ
⎩ θ
̃ were equal to the log likelihood L, then the E step would be equivalent to nding the
Note that if L
expectation µ for parameter vector θ(t) , while the M step would be precisely the problem of nding
(t+1)
the maximum likelihood estimator based on expected sucient statistics µy .

̃ y(t+1) , θ(t+1) ), while the maximisation of the


The maximisation of the M step gives L(θ(t+1) , y) = L(µ
E step gives L(θ(t) , y).

Example 22.2 (EM for Conditional Gaussian).

The EM algorithm described can be used to estimate the parameters for a Conditional Gaussian
model. For example, consider the straightforward setting where Y = (Y1 , . . . , Yr ) are Gaussian variables,
Y ∣{X = j} = Yj for j = 1, . . . , r. Suppose that X , the index of the components, is unobserved. The
state space for X is X = {1, . . . , r} and X has a multinomial distribution.
The complete likelihood may be written as:
444 CHAPTER 22. VARIATIONAL METHODS FOR PARAMETER ESTIMATION


⎪ ⎫

⎪r ⎪
Lθ (x, y) = exp ⎨ ∑ 1j (x) {αj + γj y + ̃
γj y − Aj (γj , ̃
2
γj )} − A(α)⎬

⎪ ⎪

⎩j=1 ⎭
where θ = (α, γ, ̃
γ ), the parameter α ∈ Rr parametrises the the multinomial distribution over the hidden
vector X and the pair (γj , ̃ γj ) parametrises the Gaussian distribution of the j th mixture component.
The log-partition function A(γj , ̃ γj ) is for the conditionally Gaussian distribution of Y given X = j ,
while A(α) = log ∑j=1 exp {αj } normalises the multinomial distribution.
r

When the complete likelihood is viewed as an exponential family, the sucient statistics are the
collection of triples

Ψj (x, y) ∶= {1j (x), 1j (x)y, 1j (x)y 2 } j = 1, . . . , r.

Consider a collection of i.i.d. observations (y1 , . . . , yn ). To each observation there is associated a


triplet (µi , ηi , η̃i ) ∈ Rr × Rr × Rr corresponding to expectations of the triplet of sucient statistics
Ψj (X, yi ) ∶ j = 1, . . . , r. The conditional distribution has form:


⎪ ⎫

⎪r ⎪
p(x∣y, θ) ∝ exp ⎨ ∑ 1{j} (x) (αj + γj y + ̃
γj y 2 − Aj (γj , ̃
γj ))⎬ .

⎪ ⎪

⎩j=1 ⎭
It follows that the mean parameter pj∣y = P(X = j∣Y = y) is:

exp {αj + γj y + ̃
γj y 2 − Aj (γj , ̃
γj )}
pj∣y =
∑rk=1 exp {αk + γk y + ̃
γk y 2 − Aj (γk , ̃
γk )}
Similarly, the remaining mean parameters are:

ηj∣y = pj∣y y, η̃j∣y = pj∣y y 2 .

The computations of the mean parameter µy = (pj∣y , pj∣y y, pj∣y y 2 ) correspond to the E step.

The M step requires nding θ = (α, γ, ̃


γ ) to maximise

⟨µy(t+1) , θ) − A(θ).

Some computation shows that this problem takes the form of nding (α, γ, ̃
γ ) ∈ Θ which maximises:

r n
∑ ∑ (αj pj∣yi + γj pj∣yi yi + ̃
γj pj∣yi yi2 − pj∣yi Aj (γj , ̃
γj )) − nA(α).
j=1 i=1

The optimisation therefore decouples into separate maximisation problems: one for the α vector
parametrising the mixtures and one for each of the (γj , ̃
γj ) pairs specifying the Gaussian mixtures.

The optimum solution is therefore the value α such that


22.3. VARIATIONAL BAYES 445


⎪ pj∣α = n1 ∑ni=1 pj∣yi




⎪ E ∑n
i=1 pj∣yi yi
⎨ γj ,̃γj [Y ∣X = j] = ∑ni=1 pj∣yi






∑n pj∣yi yi2
Eγj ,̃γj [Y 2 ∣X = j] = ∑i=1 .
⎩ n
i=1 pj∣y
i

22.2.2 Mean Field Approximate EM


Suppose that it is not feasible to compute the sucient statistics. Then the E step can be replaced
by a Mean Field E step where the maximum is taken over a reduced space of models:

µ(t+1)
y = max {⟨µ, θ(t) ⟩ − A∗y (µ)} .
µ∈M
red
The E step no longer closes the gap between the incomplete log-likelihood L and the auxiliary function
̃ and there are no longer guarantees that the algorithm goes uphill.
L

22.3 Variational Bayes


Assume that the complete distribution lies in an exponential family

p(x, y∣θ) = exp {⟨η(θ), ϕ(x, y)⟩ − A(η(θ))}

where the function η ∶ Rp → Rp gives some additional exibility. Assume, furthermore, that the prior
distribution over Θ also lies in an exponential family and is of conjugate prior form:

pξ,λ (θ) = exp {⟨ξ, η(θ)⟩ − λA(η(θ)) − B(ξ, λ)} . (22.8)

This exponential family is specied by the sucient statistics: {η(θ), −A(η(θ))} ∈ Rd × R. The log
partition function B(ξ, λ) is dened in the usual way:

B(ξ, λ) ∶= log ∫ exp {⟨ξ, η(θ)⟩ − λA(η(θ))} dθ.


Θ
Now consider the problem of computing the marginal likelihood pξ∗ ,λ∗ (y) where y is an observed
datum and (ξ ∗ , λ∗ ) are xed values of the hyperparameters. This requires averaging over both x (the
unobserved variables) and the parameter space Θ.

log pξ∗ ,λ∗ (y) = log ∫ (∫ p(x, y∣θ)dx) pξ∗ ,λ∗ (θ)dθ = log ∫ pξ∗ ,λ∗ (θ)p(y∣θ)dθ.

A simple application of Jensen's inequality gives:

pξ∗ ,λ∗ (Θ) pξ∗ ,λ∗ (Θ)


log pξ∗ ,λ∗ (y) = log Eξ,λ [ p(y∣Θ)] ≥ Eξ,λ [log p(y, Θ)] + Eξ,λ [log ]
pξ,λ (Θ) pξ,λ (Θ)
with equality for (ξ, λ) = (ξ ∗ , λ∗ ). From Equation (22.4),

log p(y∣Θ) = Ay (η(Θ)) − A(η(Θ))


446 CHAPTER 22. VARIATIONAL METHODS FOR PARAMETER ESTIMATION

so that

pξ∗ ,λ∗ (Θ)


pξ∗ ,λ∗ (y) ≥ Eξ,λ [Ay (η(Θ)) − A(η(Θ))] + Eξ,λ [log ] (22.9)
pξ,λ (Θ)
where Ay is the log partition function of the conditional density p(x∣y, θ).

For each xed y , the set My is the set of mean parameters of the form µ = E[ϕ(X, y)].

The variational Bayes algorithm is based on optimising this lower bound using only distributions of
product form over (Θ, X ). Such an optimisation is referred to as `free form'. Using (22.6),

Ay (η) ≥ ⟨µ, η⟩ − A∗y (µ)

for any µ and hence the right hand side of Equation (22.9) has lower bound:

pξ∗ ,λ∗ (Θ)


Eξ,λ [⟨µ(Θ), η(Θ)⟩ − A∗y (µ(Θ)) − A(η(Θ))] + Eξ,λ [log ]. (22.10)
pξ,λ (Θ)
for any function µ(θ). The expression in (22.10), restricting to µ constant, is:

pξ∗ ,λ∗ (Θ)


(⟨µ, η⟩ − A∗y (µ) − A) + Eξ,λ [log ], (22.11)
pξ,λ (Θ)
where η = Eξ,λ [η(Θ)] and A = Eξ,λ [A(Θ)]. Using (22.8),

pξ∗ ,λ∗ (θ)


log = ⟨ξ ∗ − ξ, η(θ)⟩ − (λ∗ − λ)A(η(θ)) − (B(ξ ∗ , λ∗ ) − B(ξ, λ))
pξ,λ (θ)
so that:

pξ∗ ,λ∗ (Θ)


Eξ,λ [log ] = ⟨η, ξ ∗ − ξ⟩ + ⟨−A, λ∗ − λ⟩ − B(ξ ∗ , λ∗ ) + B(ξ, λ).
pξ,λ (Θ)
Now recall the denition of B ∗ (Fenchel Legendre conjugate of B ):

B ∗ (µ1 , µ2 ) = sup {µ1 ξ + µ2 λ − B(ξ, λ)} .


ξ,λ

Then, since {η(θ), −A(η(θ)) are the sucient statistics, therefore ∂B


∂ξ = η and ∂B
∂λ = −A, so that:

B ∗ (η, A) = ⟨η, ξ⟩ + ⟨−A, λ⟩ − B(ξ, λ).

Hence the decoupled optimisation problem is equivalent to maximising:

⟨µ + ξ ∗ , η⟩ − A∗y (µ) + ⟨λ∗ + 1, −A⟩ − B ∗ (η, A)

over µ ∈ My and (η, A) ∈ dom(B).


A coordinate ascent amounts to rst maximising over µ and then maximising over the mean pa-
(t)
rameters (η, A). This generates a sequence of iterates (µ(t) , η (t) , A ). The updates are:
22.3. VARIATIONAL BAYES 447


⎪ (t+1)
= arg maxµ∈My {⟨µ, η (t) ⟩ − A∗y (µ)} VB-E Step
⎪ µ
⎨ (t+1) (t+1) (22.12)
⎪ ) = arg max(η,A) {⟨µ (t+1) ∗ ∗ ∗
+ ξ , η⟩ − (1 + λ A − B (η, A)} VB-M Step
⎩ (η
⎪ ,A

These coordinate-wise optimisations have explicit solutions; the explicit solution of the VB-E Step
is:

µ(t+1) = Eη(t) [ϕ(X, y)] .

Similarly, setting
(ξ (t+1) , λ(t+1) ) = (ξ ∗ + µ(t+1) , λ∗ + 1)

then

η (t+1) = E(ξ(t+1) ,λ(t+1) ) [η(Θ)] .


448 CHAPTER 22. VARIATIONAL METHODS FOR PARAMETER ESTIMATION
Literature Cited

[1] S.M. Aji and R.J McEliece [2000] The Generalised Distributive Law IEEE Transactions on Infor-
mation Theory vol. 46 pp. 325 - 343
[2] S.A. Andersson, D. Madigan, M.D. Perlman and C.M. Triggs [1997] A graphical characterisation
of lattice conditional independence models Annals of Mathematics and Articial Intelligence vol.
21 pp. 27 - 50
[3] O. Barndor - Nielsen [1978]Information and Exponential Families in Statistical Theory Wiley
[4] Barros, B. [2012] Incremental Learning Algorithms for Financial Data Modelling Master's Thesis,
Linköping University, Department of Mathematics LiTH-MAT-INT-A2012/01SE
[5] Beeri, C.; Fagin, R.; Maier, D.; Yannakakis, M. [1983] On the desirability of acyclic database
schemes J. Assoc. Comput. Mach. 30 pp 479 - 513.
[6] [2005] Braunstein, A.; Mézard, M.; Zecchina, R. [2005] An Algorithm for Satisability Random
Structures and Algorithms, vol. 27, no. 2, pp. 201 - 226
http://dx.doi.org/10.1002/rsa.20057
[7] Braunstein, A.; Zecchina, R. [2004] Survey Propagation as Local Equilibrium Equations Journal
of Statistical Mechanics: Theory and Experiment vol. 2004, no. 6 pp. P06007
https://stacks.iop.org/1742-5468/2004/P06007
[8] Brockwell, P.J.; Davis, R.A. [1991] Time Series: Theory and Methods (second edition) Springer
[9] F. Bromberg, D. Margaritis [2009] Improving the reliability of causal discovery from small data
sets using argumentation Journal of Machine Learning Research vol. 10 pp. 301 - 340
[10] D.T. Brown [1959] A Note on Approximations to Discrete Probability Distributions Information
and Control vol. 2 pp. 386 - 392
[11] Bulashevska, S.; Eils, R. [2005] Inferring genetic regulatory logic from expression data Bioinfor-
matics vol 21 no 11 pp 2706 - 2713
[12] E. Castillo, J.M. Gutiérrez, A.S. Hadi [1996] A New Method for Ecient Symbolic Propagation in
Discrete Bayesian Networks Networks vol. 28 no. 1 pp. 31 - 43
[13] E. Castillo, J.M. Gutiérrez, A.S. Hadi [1997] Sensitivity Analysis in Discrete Bayesian Networks
IEEE Transactions on Systems, Man and Cybernetics - Part A: Systems and Humans vol. 27 no.
4
[14] Cayley, A. [1853] Note on a Question in the Theory of Probabilities The London, Edinburgh, and
Dublin Philosophical Magazine and Journal of Science vol. VI. - fourth series July - December,
1853, Taylor and Francis. p. 259
[15] Cayley, A. [1854] On the theory of groups as depending on the symbolic equation θn = 1 Phil. Mag.
vol. 7 no. 4 pp 40 - 47
[16] Cayley, A. [1858] A Memoir on the Theory of Matrices Phil. Trans. of the Royal Soc. of London,
vol 148 p. 24

449
450 LITERATURE CITED

[17] Cayley, A. [1869] A Memoir on Cubic Surfaces Philosophical Transactions of the Royal Society of
London (The Royal Society) vol 159 pp 231326
[18] Cayley, A. [1878] Desiderata and suggestions: No. 2. The Theory of groups: graphical representa-
tion Amer. J. Math. vol. 1 no. 2 174176
[19] Cayley, A. [1889] A Theorem on Trees Quarterly Journal of Mathematics vol 23 pp 276-378
[20] H. Chan, A. Darwiche [2005] A Distance Measure for Bounding Probabilistic Belief Change Inter-
national Journal of Approximate Reasoning vol. 38 pp. 149 - 174
[21] H. Chan, A. Darwiche [2002]When do Numbers Really Matter? Journal of Articial Intelligence
Research vol. 17 pp. 265 - 287
[22] H. Chan, A. Darwiche [2005] On the Revision of Probabilistic Beliefs Using Uncertain Evidence
Articial Intelligence vol. 163 pp. 67-90
[23] Cheng, J.; Greiner, R.; Kelly, J.; Bell, D. A.; Liu, W. [2002] Learning Bayesian networks from
data: An information-theory based approach Articial Intelligence vol 137 pp 43 - 90.
[24] D.M. Chickering [1995] A transformational characterization of Bayesian network structures In
Hanks, S. and Besnard, P., editors, Proceedings of the Eleventh Conference on Uncertainty in
Articial Intelligence, pages 87 - 98 Morgan Kaufmann.
[25] Chickering, D. M. [2002] Optimal structure identication with greedy search Journal of Machine
Learning Research, 507554.
[26] D.M. Chickering, D. Heckerman, C. Meek [2004] Large Sample Learning of Bayesian Networks is
NP - Hard Journal of Machine Learning Research vol. 5 pp. 1287 - 1330
[27] Chiquet, J.; Smith, A.; Grasseau, G.; Matias, C.; Ambroise, C. [2009] SIMoNe: Statistical Infer-
ence for Modular Networks Bioinformatics 25(3):417418
[28] C.K. Chow and C.N. Liu [1968] Approximating Discrete Probability Distributions with Dependence
Trees IEEE Transactions on Information Theory, vol. IT - 14 no. 3
[29] Claeskens G, Hjort NL [2008]Model selection and model averaging Cambridge University Press,
Cambridge
[30] G.F. Cooper [1990] The Computational Complexity of Probabilistic Inference using Bayesian Belief
Networks Articial Intelligence vol. 42 pp. 393 - 405
[31] G.F. Cooper and E. Herskovitz [1992]A Bayesian Method for the Induction of Probabilistic Net-
works from Data Machine Learning vol. 9 pp. 309 - 347
[32] R.G. Cowell, A.P. David, S.L. Lauritzen and D.J. Spiegelhalter [1999] Probabilistic Networks and
Expert Systems Springer, New York
[33] A.P. Dawid [1992] Applications of a General Propagation Algorithm for Probabilistic Expert Sys-
tems Statistics and Computing vol. 2 pp. 25 - 36
[34] Dean, T.; Kanazawa, K. [1989] A Model for Reasoning about Persistence and Causation Compu-
tational Intelligence vol. 5, no. 2, pp.142 - 150.
[35] W.E. Deming and F.F. Stephan [1940] On a Least Squares Adjustment of a Sampled Frequency
Table when the Expected Marginal Totals are Known Annals of Mathematical Statistics vol. 11
pp. 427 - 444
[36] Dempster, P.; Laird, N.M.; Rubin, D.B. [1977]Maximum Likelihood from Incomplete Data via the
EM Algorithm Journal of the Royal Statistical Society, Series B, vol. 39, pp. 1 - 38
[37] P. Diaconis and S.L. Zabell [1982] Updating Subjective Probability Journal of the American Sta-
tistical Association vol. 77 (380) pp. 822 - 830
451

[38] J.M. Dickey [1983] Multiple Hypergeometric Functions: Probabilistic Interpretations and Statistical
Uses Journal of the American Statistical Association, 1983, vol. 78 (383) pp. 628 - 637
[39] Drton, M.; Sturmfels, B.; Sullivant, S. [2009] Lectures on algebraic statistics Birkhäuser

[40] D. Edwards [2000] Introduction to Graphical Modelling chapter 9: Causal Inference. Springer

[41] A. Fast [2010] Learning the structure of Bayesian networks with constraint satisfaction Ph.D.
thesis, Graduate School of the University of Massachusetts Amherst, Department of Computer
Science
[42] Fisher, R.A. [1924] The Distribution of the Partial Correlation Coecient Metron vol. 3 no. 3-4
pp. 329 - 332.
[43] D. Freedman and P. Humphreys [1999]Are there Algorithms that Discover Causal Structure? Syn-
these vol. 121 pp. 29 - 54
[44] Friedman, J.; Hastie, T.; Tibshirani, R. [2010]Regularisation Paths for Generalised Linear Models
via Coordinate Descent J Stat Softw 33(1):122
[45] Friedman, N.; Nachman, I.; Pe'er, D. [1999] Learning Bayesian network structure from massive
datasets: the `sparse candidate' algorithm Proc. Sixteenth Conference on Uncertainty in Articial
Intelligence (UAI '99) pp 196 - 205

[46] Friedman, N.; Linial, M.; Nachman, I.; Pe'er, D. [2000] Using Bayesian Networks to Analyse
Expression Data Journal of Computational Biology 7 no 3/4 pp 601 - 620
[47] Friedman, N.; Koller, D. [2003]Being Bayesian About Network Structure: A Bayesian Approach
to Structure Discovery in Bayesian Networks Machine Learning, vol. 50 pp. 95 - 125
[48] Friedman, N. [2004] Inferring Cellular Networks Using Probabilistic Graphical Models Science Vol
303 no 5659 pp 799-805 DOI: 10.1126/science.1094068

[49] Gamerman, D.; Lopes, H.F. [2006] Markov chain Monte Carlo: stochastic simulation for Bayesian
inference Chapman and Hall CRC
[50] Garcia, L.D.; Stillman, M.; Sturmfels, B. [2005] Algebraic geometry of Bayesian networks Journal
of Symbolic Computation 39 pp 331355
[51] D. Geiger, T. Verma and J. Pearl [1990] Identifying Independence in Bayesian Networks Networks
vol. 20 pp. 507 - 534.
[52] Gentry J, Long L, Gentleman R, Seth, Hahne F, Sarkar D, Hansen K [2012]Rgraphviz: provides
plotting capabilities for R graph objects. R package version 1.32.0
[53] Giudici, P.; Castelo, R. [2003]Improving Markov chain Monte Carlo Model Search for Data Mining
Machine Learning vol. 50 pp. 127 - 158
[54] Goeman, J.J. [2012]penalized R package R package version 0.9-41

[55] M.C. Golumbic [2004] Algorithmic Graph Theory and Perfect Graphs Elsevier

[56] Greenland, S.; Pearl, J.; Robins, J.M. [1999] Causal diagrams for epidemiologic research Epidemi-
ology pp 37 - 48
[57] Greenland, S.; Lash, T. [2008] Bias Analysis in: Modern Epidemiology, 3rd ed., Ed. K Rothman,
S. Greenland and T. Lash, pp 345 - 380. Philadelphia: Lippincott, Williams and Wilkins.
[58] Grzegorczyk, M.; Husmeier, D. [2008]Improving the Structure MCMC Sampler for Bayesian Net-
works by introducing a New Edge Reversal Move Mach. Learn vol. 71 pp. 265 - 305
452 LITERATURE CITED

[59] Hartmanis, J. [1959]Application of some Basic Inequalities for Entropy Information and Control
vol. 2 pp 199 - 213
[60] Hastie T.; Efron, B. [2012]lars: least angle regression, lasso and forward stagewise R package
version 1.1
[61] D. Heckerman [1998] A Tutorial on Learning with Bayesian Networks Report # MSR-TR-95-06
Microsoft Research, Redmont, Washington
http://research.microsoft.com/∼ heckerman/
[62] D. Heckerman, D. Geiger and D.M. Chickering [1995] Learning Bayesian Networks: The Combi-
nation of Knowledge and Statistical Data Machine Learning vol. 20 pp. 197 - 243
[63] Heskes, T.; Albers, C.; Kappen, H.J. [2003] Approximate Inference and Constrained Optimisation
in Proc. of the 19th Annual Conference on Uncertainty in Articial Intelligence (UAI-03) San
Fransisco, CA: Morgan Kaufmann Publishers, pp. 313 - 320
[64] Huang, Y.; Valtorta, M. [2006] Pearl's Calculus of Intervention is Complete Proceedings of the
22nd Conference on Uncertainty in Artical Intelligence pp. 217-224 UAI Press
[65] Huang, Y.; Valtorta, M. [2008] On the Completeness of an Identiability Algorithm for Semi-
Markov Models Ann Math Artif Intell vol. 54 pp. 363 - 408
[66] K. Humphreys and D.M. Titterington [2000] Improving the Mean - Field Approximation in Be-
lief Networks using Bahadur's Reparameterisation of the Multivariate Binary Distribution Neural
Processing Letters vol. 12 pp. 183 - 197
[67] Højsgaard, S. [2012] Graphical Independence Networks with the gRain Package for R Journal of
Statistical Software, vol. 46 no.10 pp. 1-26.
http://www.jstatsoft.org/v46/i10/
[68] Højsgaard,S.; Edwards, D.; Lauritzen, S. [2012] Graphical Models with R Springer
[69] Ide, J.S.; Cozman, F.G. [2002]Random generation of Bayesian networks In: SBIA '02: Proceedings
of the 16th Brazilian symposium on articial intelligence, Springer, pp 366375
[70] Jaynes, E.T. [2003] Probability Theory. The Logic of Science Cambridge University Press
[71] R.C. Jerey [1965]The Logic of Decision McGraw - Hill, New York (second ed., University of
Chicago Press, Chicago, 1983; Paperback correction, 1990)
[72] M.I. Jordan, Z. Ghahramani, T.S. Jaakkola, L.K. Saul [1999] An Introduction to Variational
Methods for Graphical Models Machine Learning vol. 37 pp. 183 - 233
[73] Kellerer, H.G. [1991]Indecomposable marginal problems Advances in probability distributions with
given marginals: beyond the copulas, Springer Verlag, Berlin, pp 139 - 149
[74] H. Kiiveri, T.P. Speed, J.B. Carlin [1984] Recursive Causal Models J. Austral. Math. Soc. (series
A) vol. 36 pp. 30 - 52
[75] M. Koivisto and K. Sood [2004]Exact Bayesian Structure Discovery in Bayesian Networks Journal
of Machine Learning Research vol. 5 pp. 549 - 573
[76] Kuipers, J.; Moa, G. [2015] Partition MCMC for Inference on Acyclic Digraphs preprint:
arxiv:1504.05006v1
[77] Kuroki, M.; Pearl, J. [2014] Measurement Bias and Eect Restoration in Causal Inference
Biometrika vol. 101 no. 2 pp. 423 - 437
[78] F.R. Ksischang , B.J. Frey, H-A. Loeliger [2001] Factor Graphs and the Sum Product Algorithm
IEEE Transactions on Information Theory vol. 47 February, pp. 498 - 519
453

[79] E. Lazkano, B. Sierra, A. Astigarraga, J.M. Martínez - Otzeta [2007] On the use of Bayesian
Networks to Develop Behaviours for Mobile Robots Robots and Autonomous Systems vol. 55 pp.
253 - 265
[80] S.L. Lauritzen, D.J. Spiegelhalter [1988]Local Computations of Probabilities on Graphical Struc-
tures and their Applications to Expert Systems Journal of the Royal Statistical Society B (Method-
ological) vol. 50 no. 2 pp. 157 - 224
[81] S.L. Lauritzen [1992]Propagation of Probabilities, Means and Variances in Mixed Graphical Asso-
ciation Models Journal of the Americal Statistical Association vol. 78 no. 420 pp. 1098 - 1108
[82] S. Lauritzen [2001] Causal Inference from Graphical Models in Complex Stochastic Systems pp.
63 - 108, Chapman and Hall
[83] S. Lauritzen and D. Spiegelhalter [1988] Local Computations with Probabilities on Graphical Struc-
tures and their Application to Expert Systems (with discussion) Journal of the Royal Statistical
Society, Series B, vol. 50, pp. 157 - 224
[84] S. Lauritzen [1992] Propagation of Probabilities, Means and Variances in Mixed Graphical Asso-
ciation Models Journal of the American Statistical Association vol. 87 no. 420 pp. 1098 - 1108
[85] Lewis II, P.M. [1959]Approximating Probability Distributions to Reduce Storage Requirements In-
formation and Control vol. 2 pp 214 - 225
[86] Ma, Z.; X, Xie; Geng, Z. [2008] Structure Learning of Chain Graphs via Decomposition J. Mach.
Learn Res 9 pp 2847 - 2880
[87] Madgison, J. [1977]Toward a Causal Model Approach for Adjusting for Pre-Existing Dierences
in the Non-Equivalent Control Group Situation: A General Alternative to ANCOVA Eval. Rev.
vol. 1 pp. 399 - 420.
[88] D. Madigan, S.A. Andersson, M.D. Perlman, C.T. Volinsky [1996] Bayesian Model Averaging and
Model Selection for Markov Equivalence Classes of Acyclic Digraphs Communications In Statistics:
Theory and Methods vol. 25, no. 11 pp. 2493-2519
[89] D. Madigan and J. York [1995] Bayesian Graphical Models for Discrete Data International Statis-
tical Review vol. 63 pp. 215 - 232
[90] Mardia KV, Kent JT, Bibby JM (1979) Multivariate analysis. Academic, London
[91] Markowetz, F.; Spang, R. [2007]Inferring Cellular Networks - A Review BMC bioinformatics vol.
8 (Suppl 6) : S5
[92] McEliece, R.J.; MacKay, D.J.C.; Cheng, J.-F. [1998] Turbo Decoding as an Instance of Pearl's
`Belief Propagation' Algorithm IEEE J. Select. Areas Commun. vol. 16 pp. 140 - 152
[93] C. Meek [1995] Causal inference and causal explanation with background knowledge Proceedings
of the Eleventh Conference on Uncertainty in Articial Intellegence pp 403 - 410
[94] Minka, T. [2001] Expectation Propagation for Approximate Bayesian Inference in Proc. of the
17th Annual Conf. on Uncertainty in Articial Intelligence (UAI-01) San Fransisco, CA: Morgan
Kaufmann Publishers, pp. 362 - 369
[95] Mooij, J.M.; Kappen, H.J. [2007] Sucient Conditions for Convergence of the Sum-Product Al-
gorithm IEEE Transactions on Information Theory, vol. 53 no. 12, pp 4422 - 4437
[96] Moore, A.; Wong, W-K. [2003] Optimal Reinsertion: A new search operator for accelerated and
more accurate Bayesian network structure learning Proceedings of the Twentieth International
Conference on Machine Learning (ICML - 2003), Washington DC
[97] Murphy, K.P. [2002] Dynamic Bayesian Networks: Representation, Inference and Learning Uni-
versity of California, Berkeley, Ph.D. thesis (Computer Science)
454 LITERATURE CITED

[98] R. Neal [1992] Correctionist Learning of Belief Networks Articial Intelligence vol. 56 pp. 71 - 113
[99] R.E. Neapolitan [2004] Learning Bayesian Networks Pearson Prentice Hall, Upper Saddle River,
New Jersey.
[100] Nelson, E. [1987] Radically Elementary Probability Theory Princeton University Press
[101] Noorshams, N; Wainwright, M.J. [2013] Stochastic Belief Propagation: A Low-Complexity Alter-
native to the Sum-Product Algorithm IEEE Transactions on Information Theory vol. 59 no. 4 pp.
1981 - 2000
[102] Opper, M.; Winder, O. [2005] Expectation Consistent Approximate Inference Journal of Machine
Learning Research vol. 6 pp. 2177 - 2004
[103] J. Pearl [1982] Reverend Bayes on Inference Engines: A Distributed Hierarchical Approach AAAI
- 82 Proceedings pp. 133 - 136
[104] J. Pearl [1987] Evidential Reasoning Using Stochastic Simulation of Causal Models Artical In-
telligence, vol. 32, pp. 245-257.
[105] Pearl, J. [1988] Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference
Morgan Kaufmann, San Mateo, CA.
[106] J. Pearl [1990] Probabilistic Reasoning in Intelligent Systems 2nd revised printing, Morgan and
Kaufman Publishers Inc., San Fransisco
[107] J. Pearl [1995]Causal Diagrams for Empirical Research Biometrika vol. 82 pp. 669 - 710
[108] J. Pearl [1995]Causal Inference from Indirect Experiments Articial Intelligence in Medicine vol.
7 pp. 561 - 582
[109] J. Pearl [2000] Causality Cambridge University Press
[110] Pearl, J.; Dechter, D. [1996] Identifying Independencies in Causal Graphs with Feedback Proceed-
ings of the Twelfth International Conference in Uncertainty in Articial Intelligence (UAI'96) pp.
420 - 426, Morgan Kaufmann
[111] J. Pearl, D. Geiger and T. Verma [1989] Conditional Independence and its Representations
Kybernetica vol. 25 no. 2 pp. 33 - 44
[112] J. Pearl and T. Verma [1987] The Logic of Representing Dependencies by Directed Acyclic Graphs
Proceedings of the AAAI, Seattle, Washington pp. 374 - 379
[113] Pearl, J. [2010] On Measurement Bias in Causal Inference In: Proc. 20th Cof. Uncertainty in
Articial Intelligence, pp. 425 - 432, Catalina Island.
[114] J.M. Pena [2007] Approximate Counting of Graphical Models Via MCMC Proceedings of the 11th
Conference in Articial Intelligence pp. 352- 359
[115] Pistone, G.; Riccomagno, E.; Wynn, H. [2001] Algebraic Statistics: Computational Commutative
Algebra in Statistics Chapman and Hall, Boca Raton.
[116] M. Ramoni and P. Sebastiani [1997] Parameter Estimation in Bayesian Networks from Incomplete
Databases Knowledge Media Institute, KMI-TR-57
[117] Robins, J.M.; Scheines, R.; Spirtes, P.; Wasserman, L. [2003] Uniform consistency in causal
inference Biometrika vol 90 no 3 pp 491- 515
[118] R.W. Robinson [1977]Counting Unlabelled Acyclic Digraphs Springer Lecture Notes in Mathe-
matics: Combinatorial Mathematics V, C.H.C. Little (ed.) pp. 28 - 43.
[119] Rosenbaum, P.; Rubin, D. [1983] The Central Role of Propensity Score in Observational Studies
for Causal Eects Biometrika vol. 70 pp. 41-55
455

[120] Sadeghi, K.; Lauritzen, S, [2012] Markov Properties for Mixed Graphs submitted to Bernoulli,
available on arxiv
http://arxiv.org/pdf/1109.5909v2.pdf
[121] J. L. Savage [1966]Foundations of Statistics John Wiley and Sons, New York.
[122] R.D. Schachter [1998] Bayes Ball: The Rational Pass Time for Determining Irrelevance and
Requisite Information in Belief Networks and Inuence Diagrams Proceedings of the 14th Annual
Conference on Uncertainty in Articial Intelligence (ed G.F. Cooper and S. Moral) pp. 480 - 487,
Morgan Kaufmann, San Fransisco, CA.
[123] Schmidt, M.; Niculescu-Mizil, A.; Murphy, K. [2007] Learning graphical model structure using
l1-regularization paths Proceedings of the National Conference on Articial Intelligence vol 22 no
2 pp 12- 78
[124] Shpister, I.; Pearl, J. [2006]Identication of Joint Interventional Distributions in Recursive Semi-
Markovian Causal Models In: Proceedings of the Twenty-First National Conference on Articial
Intelligence. Menlo Park, CA: AAAI Press. pp 1219 - 1226
[125] Shpister, I.; Pearl, J. [2008] Complete Identication Methods for Causal Hierarchy Journal of
Machine Learning Research vol. 9 pp. 1941 - 1979
[126] Spirtes, P.; Glymour, C.; Scheines, R. [1993] Causation, Prediction and Search Lecture Notes in
Statistics no. 81 Springer-Verlag New York
[127] P. Spirtes, C. Glymour and R. Scheines [2000] Causation, Prediction and Search second edition,
The MIT press.
[128] Strotz, R.H.; Wold, H.O.A. [1960] Recursive versus Nonrecursive Systems: An Attempt at Syn-
thesis Econometrica vol. 28 pp. 417-427
[129] M. Studený [2005] Probabilistic Conditional Independence Structures Springer Verlag.
[130] Sturmfels, B. [2002] Solving Systems of Polynomial Equations In: CBMS Lectures Series, Amer-
ican Mathematical Society.
[131] Sun, J.; Zheng, N.-N.; Shum, H.-Y. [2003] Stereo Matching using Belief Propagation IEEE Trans-
actions on Pattern Analysis and Machine Intelligence vol. 25 no. 7 pp. 787 - 800
[132] Tanaka, K. [2002] Statistical-Mechanical Approach to Image Processing Journal of Physics A:
Mathematical and General vol. 35 no. 37 pp. R81 - R150
http://stacks.iop.org/0305-4470/35/R81
[133] Tatikonda, S.C. [2003] Convergence of the Sum-Product Algorithm in Proceedings 2003 IEEE
Information Theory Workshop
[134] Tian, J.; Pearl, J. [2002] A General Identication Condition for Causal Eects Proceedings of
the Eighteenth National Conference on Artical Intelligence, AAAI Press, Menlo Park California
pp. 567 - 573.
[135] Tian, J.; Pearl, J. [2002]On the Testable Implications of Causal Models with Hidden Variables in
Proceedings of UAI-02, pp. 519 - 527
[136] Tian, J.; Shpitser, I. [2010] On Identifying Causal Eects In: Dechter, R.; Gener, H.; Halpern,
J. eds., Heuristics, Probability and Causality: A Tribpute to Judea Pearl UK: College Publications,
pp. 415 - 444.
[137] I. Tsamardinos, L.E. Brown and C.F. Aliferis [2006] The Max - Min Hill - Climbing Bayesian
Network Structure Learning Algorithm Machine Learning vol. 65 pp. 31 - 78
[138] M. Valtorta, Y.G. Kim, J. Vomlel [2002] Soft Evidential Update for Probabilistic Multiagent
Systems International Journal of Approximate Reasoning vol. 29 no. 1 pp. 71 - 106
456 LITERATURE CITED

[139] Vats, D.; Nowak, R.D. [2014]A Junction Tree Framework for Undirected Graphical Model Selec-
tion Journal of Machine Learning Research vol. 15 pp. 147 - 191
[140] P. Verma and J. Pearl [1992] An Algorithm for Deciding if a Set of Observed Independencies has a
Causal Explanation in Uncertainty in Articial Intelligence, Proceedings of the Eighth Conference
(D. Dubois, M.P. Welman, B. D'Ambrosio and P.Smets, eds.) San Fransisco: Morgan Kaufman
pp. 323 - 330
[141] Vorobev, N. N. [1962] Consistent families of measures and their extensions Theory of Probability
and its Applications vol. 7 pp 147 - 162
[142] M.J. Wainright, M.I. Jordan [2003] Graphical Models, Exponential Families and Variational In-
ference Technical report 649, Department of Statistics, University of California, Berkeley
[143] On the Optimality of Solutions of the Max-Product Belief-Propagation Algorithm in Arbitrary
Graphs IEEE Transactions on Information Theory vol. 47 no. 2 pp. 736 - 744
[144] Whittaker, J. [1990]Graphical models in applied multivariate statistics Wiley
[145] N. Wiberg [1996] Codes and Decoding on General Graphs Linköping Studies in Science and
Technology. Dissertation 440 Linköpings Universitet, Linköping, 1996
[146] S. Wright [1921]Correlation and Causation Journal of Agricultural Research vol. 20 pp. 557 - 585
[147] Wright, S. [1934] The method of path coecients Ann. Math. Statist. vol 5 pp 161 - 215.
[148] X. Xie and Z. Geng [2008]A recursive method for structural learning of directed acyclic graphs
Journal of machine learning research vol. 9 pp. 459 - 483
[149] Yedidia, J.S.; Freeman, W.T.; Weiss, Y. [2005] Constructing Free-Energy Approximations and
Generalised Belief Propagation Algorithms IEEE Transactions on Information Theory, vol. 51 no.
7 pp. 2282-2312
[150] R. Yehezkel and B. Lerner [2009]Bayesian network structure learning by recursive autonomy
identication Journal of Machine Learning Research vol. 10 pp 1527 - 1570
[151] Zhang, J; Spirtes, P. [2002]Strong faithfulness and uniform consistency in causal inference Pro-
ceedings of the nineteenth conference on uncertainty in articial intelligence pp 632639, Morgan
Kaufmann Publishers Inc.
Index

D-connected, 13 local, 171


D-separation, 13 contraction, 143
conditional independence, 15 contraction of a charge on a junction tree, 165
I -equivalence, 30 controlled experiment, 46
I -map, 30, 2932 to establish the model within the equivalence
perfect, 30 class, 51
I map Cooper Herskovitz likelihood, 247
I -sub-map, 30 cycle, 7, 209
active descendant, 6
minimal active trail, 37 Dickey, J.M., 253
node, 37 Dirichlet
active ow, 167 density, 241
ancestor, 6 integral, 241
distance, 120
back door criterion, 60, 5962 Chan - Darwiche, 121127
Bayes ball, 14 Euclidean, 120
Bayesian network, 9 distributive law, 143
Bernoulli, 421 domain, 141, 141
beta extending the, 142
density, 239, 240, 423
integral, 239 elimination
bipartite graph, 404 domain, 152
bipartite graphical model, 425 of a variable, 146
Boltzmann - Shannon entropy, 427 order, 152
sequence, 152
canonical parameters, 420 entropy, 419
chain component, 99 Euler Gamma function, 239
Chan - Darwiche distance, 121 evidence, 116119
charge, 143 hard, 116
restriction, 169 soft, 116
chord, 148, 209 virtual, 116, 119
common cause, 12 virtual on a DAG, 117
commutative law, 143 virtual, Pearl's method, 126, 128
compelled edge , emph42 weight of, 278
complete, 209 explaining away, 13
conditional Gaussian distribution, 204, 203207 exponential distribution, 423
mean parameters, 205 exponential family, 419, 419
parametrisation, 205
update using a Junction tree, 208214 exponential parameters, 420
conditional Gaussian regression, 206
conditional independence, 7 factor graph, 403
conguration, 141 factorisability, 403
confounding, 56, 5962 factorisation, 9
conjugate dual, 428 along a DAG, 9
connected factorisation of a probability function
two nodes, 148 along an undirected graph, 159
connection fading, 246
chain, 10 faithful, 30
collider, 11 Fenchel inequality, 430
fork, 11 Fenchel Legendre conjugate, 427
consistency, 171173 nding
global, 172 hard, soft, virtual, 116

457
458 INDEX

re, 264 local surgery, 47


ow of messages, 163, 165171 locally directed Markov property, 17
CG distribution, 208 log likelihood function, 233
fractional updating, 245 log partition function, 420, 425427
function, 141, 141
addition, multiplication, division, 142 marginal charge, 162
function node, 404 marginalisation, 142
computational tree, 146
Gaussian, 422 graphical representations, 144
graph, 4 Markov blanket, 14, 22
CG decomposable, 209 Markov chain Monte Carlo
CG decomposition, 208 learning the graph structure, 375
chain, 99, 99 Model Composition Algorithm, 375
complete, 147 Markov equivalence, 30, 29101
connected, connected component, 6 characterisation, 3640
decomposable, 149 theorem, 36
decomposition, 148 Markov model, 29
directed, 4, 6 maximal clique, 147, 154
directed acyclic (DAG), 7 maximum minimum hill climbing algorithm, 325
directed acyclic marked graph, 207 maximum minimum parents children algorithm,
domain graph, 144 317
maximum posterior estimate, 240
essential, 42, 41375 mean eld lower bound, 432
family, 5 mean eld theory, 431435
moral, 101 mean parameters, 205, 426
simple, 4 mean posterior estimate, 242
sub-graph, induced sub-graph, 6 minimal representation, 421
triangulated, 148 missing data, 244
undirected, 4, 6 modularity, 251
weak decomposition, 148 moment generating function, 204
greedy algorithm, 314 multinomial
sampling, 241
HUGIN, 264 multivariate normal distribution, 204
identiability, 62, 5962 naive mean eld update, 433
immorality, 36 node
independence, 7 child, 5
instantiated, 10 neighbour, 5
intervention parent, 5
formula, 47 simplicial, 148, 154
measure, 47 node elimination, 151, 154
intervention formula, 47 ll ins, 151
iterative proportional tting procedure, 175 perfect sequence, 151
noisy `or', 93, 424
Jerey's rule, 113, 114, 124, 128 causal network, 20
Jensen, J.L., 234 gate, 21
junction tree, 154155 inhibitor, 20
construction, 154 NP hard, 285
factorisation along, 162
soft evidence, 173 odds, 122, 276, 277
one eye problem, 404
K2 structural learning algorithm, 313314 optimisation
Kullback Leibler divergence, 120, 234, 314, 419, constraint based, 281
430431 score function, 281
dual form, 431 over-complete representation, 421
mixed form, 431
primal form, 431 path, 6
directed, 6
Laplace rule of succession, 241, 252 pattern recognition, 310
leaf, 7, 156 PC algorithm, 317
likelihood Pearl's update, 114, 114
estimate, 233 Poisson distribution, 423
function, 233 prediction suciency, 249
INDEX 459

for a Bayesian network, 250


predictive distribution, 243
predictive probability, 241
proportional scaling, 266272
optimality of, 267
propositional logic, 93, 424
disjunction, 19
QMR - DT database, 19, 424
query, 18, 263
constraint, 269272
query constraint, 263
regular family, 420
root, 156
CG, 210
Savage, J.L., 253
schedule, 167, 167
fully active, 167
sensitivity, 272
separator, 148, 154, 209
minimal, 148, 209
Shannon, 233
Shannon entropy, 233
sigmoid belief network model, 423
Simpson's paradox, 57
skeleton, 36
statistic
Bayesian sucient, 247, 421
minimal sucient, 421
strong component, 148
structure, 284
likelihood, 246
prior distribution, 284
sub-tree
base, 169
live, 169
suciency
Bayesian, 421
sum product algorithm, 403
sum product rule
initialisation, 406
schedule, 407
termination, 407
sum product update rule, 406
support, 121
sure thing principle, 5657
thumb-tack, 248
trail, 6
active, 13
blocked, 13
tree, 7
rooted, 156
triangulated, 209
Turing machine, 285
update ratio, 164
variable node, 404
variational principle, 427
weight of evidence, 277

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy