Khemani D. Search Methods in Artificial Intelligence 2024
Khemani D. Search Methods in Artificial Intelligence 2024
in
Artificial Intelligence
Deepak Khemani
Search Methods in Artificial Intelligence
The text begins with an introduction to search spaces that confront intelligent agents. It
illustrates how basic algorithms like depth first search and breadth first search run into
exponentially growing spaces. Discussions on heuristic search follow along with stochastic
local search, algorithm A*, and problem decomposition. The role of search in playing board
games, deduction in logic, automated planning, and machine learning is described next. The
book concludes with a coverage of constraint satisfaction.
Deepak Khemani has been actively working in the field of artificial intelligence (AI) for over
four decades - first as a student at Indian Institute of Technology (IIT) Bombay and then as a
Professor in the Department of Computer Science and Engineering at IIT Madras. Currently
he is Professor at Plaksha University, Mohali. He has three well-received courses on AI on
SWAYAM, a MOOC platform launched by the Government of India. He is also the author of
A First Course in Artificial Intelligence (2013).
Search Methods in
Artificial Intelligence
Deepak Khemani
Cambridge
UNIVERSITY PRESS
BM CAMBRIDGE
Wf university press
www.cambridge.org
Information on this title: www.cambridge.org/9781009284325
© Deepak Khemani 2024
This publication is in copyright. Subject to statutory exception and to the provisions
of relevant collective licensing agreements, no reproduction of any part may take
place without the written permission of Cambridge University Press & Assessment.
First published 2024
Printed in India
A catalogue recordfor this publication is availablefrom the British Library
isbn 978-1-009-28432-5 Hardback
isbn 978-1-009-28433-2 Paperback
Cambridge University Press & Assessment has no responsibility for the persistence
or accuracy of URLs for external or third-party internet websites referred to in this
publication and does not guarantee that any content on such websites is, or will
remain, accurate or appropriate.
Contents
Preface xi
Acknowledgements xiii
1 Introduction 1
1.1 Can Machines Think? 1
1.2 Problem Solving 4
1.3 Neural Networks 6
1.3.1 Deep neural networks 8
1.4 Symbolic AI 10
1.4.1 Symbols, language, and knowledge 10
1.4.2 Symbol systems 12
1.4.3 An architecture for cognition 13
1.5 The Core of Intelligence 14
1.5.1 Remember the past and learn from it 15
1.5.2 Understand and represent the present 18
1.5.3 Imagine the future and shape it 23
A Note for the Reader 26
Exercises 26
2 Search Spaces 29
2.1 The State Space 30
2.1.1 Generate andtest 31
2.2 Search Spaces 32
2.3 Configuration Problems 33
2.3.1 The map colouringproblem 33
2.3.2 The N-queens problem 34
2.3.3 The SAT problem 35
2.4 Planning Problems 35
2.4.1 The 8-puzzle 36
2.4.2 River crossing puzzles 37
2.4.3 The water jug problem 38
2.4.4 The travelling salesman problem 39
2.5 The Solution Space 42
2.5.1 Constructive methods 42
2.5.2 Perturbative methods 43
Summary 44
Exercises 44
v
vi | Contents
3 Blind Search 47
3.1 Search Trees 48
3.2 Depth First Search 49
3.2.1 How to stop going around incircles 51
3.2.2 Reconstructing the path 53
3.2.3 The complete algorithm 55
3.2.4 Backtracking in DFS 57
3.3 Breadth First Search 58
3.4 Comparing DFS and BFS 60
3.4.1 Space complexity 61
3.4.2 Time complexity 62
3.4.3 Quality of solution 64
3.4.4 Completeness 64
3.5 Depth First Iterative Deepening 64
3.5.1 Depth bounded depth first search 65
3.5.2 Depth first iterative deepening 66
3.5.3 Space complexity 67
3.5.4 Time complexity 67
3.5.5 Quality of solution 67
3.5.6 Completeness 69
3.6 Uninformed Search 70
Summary 71
Exercises 72
4 Heuristic Search 75
4.1 Heuristic Functions 76
4.1.1 Map colouring 77
4.1.2 SAT 77
4.1.3 The 8-puzzle 77
4.1.4 Route finding 78
4.1.5 Travelling salesperson problem 80
4.2 Best First Search 81
4.2.1 Quality of solution 84
4.2.2 Completeness 85
4.2.3 Space complexity 85
4.2.4 Time complexity 86
4.3 Local Search Methods 86
4.3.1 An optimization problem 87
4.3.2 Hill climbing 88
4.3.3 Completeness 89
4.3.4 Quality of solution 89
4.3.5 Time complexity 89
4.3.6 Space complexity 90
4.4 Heuristic Search Terrains 90
4.4.1 Hill climbing in the blocks world domain 90
4.4.2 Heuristic functions 91
4.4.3 The SAT landscape 95
Contents | vii
This book is meant for the serious practitioner-to-be of constructing intelligent machines.
Machines that are aware of the world around them, that have goals to achieve, and the ability to
imagine the future and make appropriate choices to achieve those goals. It is an introduction to
a fundamental building block of artificial intelligence (AI). As the book shows, search is central
to intelligence.
Clearly AI is not one monolithic algorithm but a collection of processes working in
tandem, an idea espoused by Marvin Minsky in his book The Society of Mind (1986). Human
problem solving has three critical components. The ability to make use of experiences stored
in memory; the ability to reason and make inferences from what one knows; and the ability to
search through the space of possibilities. This book focuses on the last of these. In the real world
we sense the world using vision, sound, touch, and smell. An autonomous agent will need to be
able to do so as well. Language, and the written word, is perhaps a distinguishing feature of the
human species. It is the key to communication which means that human knowledge becomes
pervasive and is shared with future generations. The development of mathematical sciences has
sharpened our understanding of the world and allows us to compute probabilities over choices
to take calculated risks. All these abilities and more are needed by an autonomous agent.
Can one massive neural network be the embodiment of AI? Certainly, the human brain as
a seat of intelligence suggests that. Everything we humans do has its origin in activity in our
brains, which we call the mind. Perched on the banks of a stream in the mountains we perceive
the world around us and derive a sense of joy and well-being. In a fit of contented creativity, we
may pen an essay or a poem using our faculty of language. We may call a friend on the phone
and describe the scene around us, allowing the friend to visualize the serene surroundings. She
may reflect upon her own experiences and recall a holiday she had on the beach. You might start
humming your favourite song and then be suddenly jolted out of your reverie remembering that
friends are coming over for dinner. You get up and head towards your home with cooking plans
brewing in your head.
So, in principle at least one can imagine a massive neural network that could do all the
above. But how would it be implemented? What kind of a training process would instil all such
knowledge and memories in the neural brain? Human beings go through a lifetime of learning.
A human baby, unlike a fawn, is an utterly helpless creature and needs to be nurtured for years.
Taught in schools, influenced by peer groups, moulded through culture and religion, coached in
sports. Every human is said to be unique, even identical twins. We celebrate this diversity, even
when it is sometimes a source of crime and conflict. Are we ready for idiosyncratic machines?
Or do we aim for identical assembly line robots? But what or who would they be like? And
what about issues of fairness? And generation of harmful or misleading content?
xi
xii | Preface
The twenty-first century has seen an explosion in machine learning as exemplified by deep
neural networks which outperform humans on many classification tasks, and large language
models that can generate an essay, a college application, or a poem in a jiffy. Massive computing
power and humungous amounts of data have made this possible. It has been very impressive,
but has it peaked? Do we need to move on and seek another path to the Holy Grail, machines
which autonomously solve problems for us? Instead of blanket ingestion of all data on the
internet, perhaps we need to build machines which learn from human expertise to become
experts in specific domains. And do useful things for us.
This book is a step in that direction. It is designed to be a complete guide to one specific
aspect of problem solving - the use of search methods. It is intended to be one in-depth module
for the task of building AI, and its contents can be covered in a one semester course. We begin by
learning to walk with small problems, and gradually build a repertoire of search algorithms that
would allow us to navigate the high seas and vast deserts. The algorithms are general purpose,
but our representations are tailormade for the individual domains. We urge the interested reader
to implement the algorithms described here and develop a suite of search algorithms that can
be used to solve specific problems.
One common feature in all these algorithms is that they operate on symbolic data, where
symbols stand for things meaningful to us, and algorithms operate upon them. This approach
is, as hypothesized by Herbert Simon and Alan Newell in 1976, both necessary and sufficient
to create AI.
Maybe one day these many algorithms and the different problems they solve will come
together in an integrated entity as a step towards artificial general intelligence. But that will
need advances in knowledge representation where different domains and problems can be
uniformly expressed in a common language. There is work still ahead for us.
Acknowledgements
Several people have contributed in myriad ways to this book, some directly and some indirectly.
Many students in my class, both online and offline, have triggered a thought process by asking
incisive questions and making insightful observations. I gratefully acknowledge all of them
collectively. They have made the job of teaching and learning rewarding.
Baskaran Sankaranarayanan has been involved with my courses over the last few years,
answering student queries, helping with question papers and figures, and most importantly by
standardizing the way in algorithms are written in pseudo code. He has written the appendix in
this book on algorithm style, and the algorithms in the book conform to that style.
Sutanu Chakraborti has been a long-time collaborator working in AI. He wrote a chapter on
natural language processing in my previous book, A First Course in Artificial Intelligence. In
this book he has written one chapter on machine learning, despite a pressing schedule.
I am indebted to the following who have read and commented upon parts of the book -
Nitin Dhiman, Aditi Khemani, Kamal Lodaya, Adwait Pramod Parsodkar, Devika Sethi, and
Shashank Shekhar.
I am grateful to the team at CUP for their constant support from the very beginning. Vaishali
Thapliyal took up my book proposal with gusto and got reviews from external reviewers
expeditiously, yielding some very valuable suggestions and feedback. When the manuscript
was ready Vikash Tiwari and Ankush Kumar initiated the production process immediately.
Aniruddha De and Karan Gupta have done an excellent job with proofreading and copy editing,
ironing out many discrepancies and bringing uniformity to the writing style.
Finally, I would like to thank the friends and family who have supported the book writing
in many ways.
xiii
chapter 1
Introduction
We will adopt the overall goal of artificial intelligence (AI) to be ‘to build machines with minds,
in the full and literal sense’ as prescribed by the Canadian philosopher John Haugeland (1985).
Not to create machines with a clever imitation of human-like intelligence. Or machines
that exhibit behaviours that would be considered intelligent if done by humans - but to build
machines that reason.
This book focuses on search methods for problem solving. We expect the user to define
the goals to be achieved and the domain description, including the moves available with the
machine. The machine then finds a solution employing first principles methods based on search.
A process of trial and error. The ability to explore different options is fundamental to thinking.
As we describe subsequently, such methods are just amongst the many in the armoury of
an intelligent agent. Understanding and representing the world, learning from past experiences,
and communicating with natural language are other equally important abilities, but beyond
the scope of this book. We also do not assume that the agent has meta-level abilities of being
self-aware and having goals of its own. While these have a philosophical value, our goal is to
make machines do something useful, with as general a problem solving approach as possible.
This and other definitions of what AI is do not prescribe how to test if a machine is
intelligent. In fact, there is no clear-cut universally accepted definition of intelligence. To put
an end to the endless debates on machine intelligence that ensued, the brilliant scientist Alan
Turing proposed a behavioural test.
1
2 | Search Methods in Artificial Intelligence
Since then, many programs have produced text based interactions that are convincingly
human-like, for example, ChatGPT1 being one of the latest. Advances in machine learning
algorithms for building language models from large amounts of training data have enabled
machines to churn out remarkably well structured impressive text. Humans are quite willing to
believe that if it talks like a human, then it must think like a human. Even when the very first
chat program Eliza threw back user sentences with an interrogative twist, its creator Edward
Weizenbaum was shocked to discover that his secretary was confiding her personal woes to
the program (Weizenbaum, 1966). Pamela McCorduck (2004) has observed in Machines Who
Think that in medieval Europe people were willing to ascribe intelligence to mechanical toy
statues that could nod or shake their head in response to a question.
Clearly relying on human impressions based on interaction in natural language is not
the best way of determining whether a machine is intelligent or not. With more and more
machines becoming good at generating text rivalling that produced by humans, a need is being
felt for something that delves deeper and tests whether the machine is actually reasoning when
answering questions.
Hector Levesque and colleagues have proposed a new test of intelligence which they call
the Winograd Schema Challenge, after Terry Winograd who first suggested it (Levesque et al.,
2012; Levesque, 2017). The idea is that the test cannot be answered by having a large language
model or access to the internet but would need common sense knowledge about the world. The
test subject is given a sentence that refers to two entities of the same kind and a pronoun that
could refer to either one of them. The question is which one, and the task is called anaphora
resolution. The ambiguity can easily be resolved by humans using common sense knowledge.
The strategy is to have two variations of the sentence, each having a different word or a phrase
that leads to different anaphora resolution. One of the versions is presented to the subject with
a question about what the pronoun refers to. Guesswork on a series of such questions is only
expected to produce about half the correct answers, whereas a knowledgeable (read intelligent)
agent would do much better. The following is the example attributed to Winograd (1972).
• The town councillors refused to give the angry demonstrators a permit because they feared
violence. Who feared violence?
(a) The town councillors
(b) The angry demonstrators
• The town councillors refused to give the angry demonstrators a permit because they
advocated violence. Who advocated violence?
(a) The town councillors
(b) The angry demonstrators
In both cases, two options are given to the subject who has to choose one of the two. Here are
two more examples of the Winograd Schema Challenge, with two sets of sentences, each one
of which is presented and followed by a question.
• The trophy doesn’t fit in the brown suitcase because it’s too big. What is too big?
(a) the trophy
(b) the suitcase
1
ChatGPT: Optimizing Language Models for Dialogue. https://openai.com/blog/chatgpt/, accessed December 2022.
Introduction | 3
• The trophy doesn’t fit in the brown suitcase because it’s too small. What is too small?
(a) the trophy
(b) the suitcase
The following sentence is from the First Winograd Challenge at the International Joint
Conference on AI in 2016 (Davis et al., 2017).
• John took the water bottle out of the backpack so that it would be lighter.
• John took the water bottle out of the backpack so that it would be handy.
What does ‘it’ refer to? Again, two options are given to the subject who is asked to choose one.
The authors report that the Winograd Schema Test was preceded by a pronoun disambiguation
test in a single sentence, with examples chosen from naturally occurring text. Only those
programs that did well in the first test were allowed to advance to the Winograd Schema Test.
Here is an example from their paper which has been taken from the story ‘Sylvester and the
Magic Pebble’.
• The donkey wished a wart on its hind leg would disappear, and it did.
What vanished? The important thing is that such problems can be solved only if the subject is
well versed with sufficient common sense knowledge about the world and also the structure of
language.
A question one might ask is why should a test of intelligence be language based? After all,
intelligence manifests itself in other ways as well. Could one of these also be an indicator of
intelligence?
One area that has been proposed is in the arts, where creativity is the driving force. Computer
generated art has time and again come to the limelight. Many artworks by AARON, the drawing
artist created by Harold Cohen (1928-2016), have been demonstrated at AI conferences over
the years (Cohen, 2016). A slew of text-to-image AI systems including DALL-E, Midjourney,
and Stable Diffusion have all been released for public use recently.
Erik Belgum and colleagues have proposed a Turing Test for musical intelligence (Belgum
et al., 1989). In the fall of 1997, Douglas Hofstadter organized a series of five public symposia
centred on the burning question ‘Are Computers Approaching Human-Level Creativity?’
at Indiana University. This fourth symposium was about a particular computer program,
David Cope’s EMI (Experiments in Musical Intelligence) as a composer of music in the
style of various classical composers (Cope, 2004). A two-hour concert took place in which
compositions written by EMI and compositions written by famous human composers were
performed without identification, and the audience was asked to vote for which pieces they
thought were human-composed and which were computer-composed. Subsequently, David
Coco-Pope published an article written by a computer program EWI (Experiments in Written
Intelligence) in the style of Hofstadter, grudgingly conceded by Hofstadter himself at the end
of the article (Hofstadter, 2009).
After the 2011 spectacular win by IBM’s program Watson in the game of Jeopardy
over two players who were widely considered to be the best that the game had seen, the
company unveiled a program Chef Watson with the following claim - ‘In our application, a
computationally creative computer can automatically design and discover culinary recipes that
4 | Search Methods in Artificial Intelligence
are flavorful, healthy, and novel!’2 The market is now abuzz with robots that can cook for you,
for example, as reported in Cain (2022).
Recently, when DeepMind’s AlphaGo program beat the reigning world champion Lee
Sedol in the oriental game of go, the entire world sat up and took notice (Silver et al., 2016).
This followed an equally impressive win almost twenty years earlier in 1997 when IBM’s Deep
Blue program beat the then world champion Garry Kasparov in the game of chess (Campbell
et al., 2002). Both the games are two person board games in which programs can search game
trees as described in Chapter 8. The challenge in these games is to search the huge trees that
present themselves. While chess is played on an 8 x 8 board, go is played on a 19 x 19 board,
which generates a much larger game tree. And yet a combination of machine learning and
selective search proved invincible. Both these games are conceptually simple even though the
search trees are large. In the author’s opinion, only when a computer program can play the
game of contract bridge at the level described in Ottlik and Kelsey (1983) can we legitimately
stake a claim to have created an AI.
Meanwhile, one should perhaps take a cue from Alan Turing himself, move away from the
bickering, and get on with the design and implementation of autonomous machines who3 do
useful things for us. In the summer of 1956, John McCarthy and Marvin Minsky had organized
the Dartmouth Conference with the following stated goal - ‘The study is to proceed on the
basis of the conjecture that every aspect of learning or any other feature of intelligence can in
principle be so precisely described that a machine can be made to simulate it’ (McCorduck,
2004). That is the spirit of our quest for AI.
Figure 1.1 An autonomous agent operates in a three stage cycle. It receives input from its
sensory mechanism, it deliberates over the inputs and its goals, and acts in the world.
neuron is connected to many other neurons and each connection has a weight that evolves with
experience. This changing of weights is associated with the process of learning.
The neurons at the sensing end of the brain accept information coming in from various
senses like sight, sound, smell, taste, and touch. The general model of processing is that once a
neuron is activated, it sends a signal down its principal nerve called the axon, which distributes
the signal to other connected neurons. The weights of the connections determine which
connected neurons receive how much of the signal. Eventually the signals reach the neurons
at the output end, sending signals down the motor neurons that activate muscles that produce
sounds from the mouth and movement of the limbs.
Some simple creatures may be just reactive, recognizing food or prey and triggering
appropriate actions, but as we move up the hierarchy, there may be more complex processing
happening in the brain, involving memory (in Greek mythology, the dog Argos recognizes
Odysseus at once when humans could not), planning (monkeys in Japan have been known to
season their food with salt water), and reasoning (remember all those experiments with mice in
mazes and Pavlov’s dog). Whatever the cognitive capability of the creature, our view of their
brains can be captured as shown in Figure 1.2.
Different life forms have differently sized brains relative to the sizes of their bodies. Earlier
life forms had simple brains often referred to as the reptilian brain. In the 1960s, the American
neuroscientist Paul MacLean (1990) formulated the Triune Brain model, which is based on
Figure 1.2 The neural animal brain. All life forms represent knowledge in the form of weights of
connections between neurons in their brain and body. The numbers do not mean anything to
us, and we say that the representation is sub-symbolic.
6 | Search Methods in Artificial Intelligence
the division of the human brain into three distinct regions. MacLean’s model suggests that the
human brain is organized into a hierarchy, which itself is based on an evolutionary view of
brain development. The three regions are as follows:
1. Reptilian or primal brain (basal ganglia) was the first to evolve and is the one in charge of
our primal instincts.
2. Paleomammalian or emotional brain (limbic system) was the next to evolve and handles
our emotions.
3. Neomammalian or rational brain (neocortex) which is responsible for what we call as
thinking.
According to MacLean, the hierarchical organization of the human brain represents the gradual
acquisition of the brain structures through evolution. The human brain, considered by many
to be the most complex piece of matter in the universe, is made up of a cerebrum, the brain
stem, and the cerebellum. The cerebrum is considered to be the seat of thought and, in humans,
comprises two halves, each having an inner white core and an outer cerebral cortex made up
of grey matter.
It is generally believed that the larger the brain, the greater the cognitive abilities of the
owner.
X1
o o
Xn
Figure 1.3 A neuron is a simple processing device that receives signals and generates an
impulse as shown on the left. On the right is an example of a classification problem in which
no line can be drawn to separate the shaded circle from the unshaded ones.
node is activated, indicating the class label. What the neural network has learnt is the association
between the pattern of activation in the input layer and the class label at the output layer.
The fact that the input may be an image of a scene is only in the mind of the user, as is the
name given to the class label. In the figure, the names are five animals, but the neural network
has no idea that one is talking of animals, or a particular animal like a horse or a bear. It just
knows which label to activate with a given image.
Figure 1.4 A feedforward artificial neural network learns a function from the input space to the
output classes. Learning happens via the adjustment of edge weights. The labels of the output
classes are meaningful only to the user.
8 | Search Methods in Artificial Intelligence
This knowledge is not explicit or symbolic in the network. It is buried in the weights of
the edges from nodes in one layer to the next one. These weights are instrumental in directing
the activation from the input layer to the relevant output layer node. Nowhere in the network is
there any indication that one is looking at a giraffe or a lion. Such representations of knowledge
are often called sub-symbolic in contrast with the explicit symbolic representations we humans
commonly use.
Neural networks learn what they learn by a process of training. The most common form of
training is called supervised learning, in which a user presents input patterns to the network,
and for each input shows what the output label should be. Every time an input pattern is
presented, the network makes its own decision of what the activation value of the class label
is. For example, if a bear is shown to the network, it might compute the output values as [0.2,
0.1, 0.0, 0.4, 0.3] when the expected output is [0, 0, 0, 1, 0], indicating that it is the fourth
node (the bear). The error in the actual output defines a loss function that Backprop (as it is
also known) aims to minimize. Most variations of the algorithm compute the gradient of the
loss function with respect to the weights and do a small change in the edge weights in each
cycle to reduce the loss. This can be viewed as gradient descent, an algorithm we look at later
in the book.
The other forms of learning that are popular are unsupervised learning in which algorithms
can learn to identify clusters in data, and reinforcement learning in which feedback from the
world is used to adjust relevant weights. Reinforcement learning has achieved great success
in game playing programs that learn how to play by playing hundreds of thousands of games
against themselves, learning how to evaluate board positions from the outcomes of the games.
Pear
I Litchi
M
A Plum
G
E
Guava
Orange
of open source software like Tensorflow4 from Google that makes the task of implementing
machine learning models easier for researchers.
More recently, generative neural networks have been successfully deployed for language
generation and even creating paintings, for example, from OpenAI.5 Generative models embody
a form of unsupervised learning from large amounts of data, and are then trained to generate
data like the one the algorithms were trained on. After having been fed with millions of images
and text and their associated captions, they have now learnt to generate similar pictures or stories
from similar text commands. Programs like ChatGPT, Imagen, and DALL-E have created quite
a flurry amongst many users on the internet.
Deep neural networks are very good at the task of pattern recognition. Qualitatively, they
are no different from the earlier networks, but in terms of performance they are far superior. The
main task they are very good at is classification, a task that some researchers have commented
is accomplished ‘in the blink of an eye’ by all life forms (Darwiche, 2018). The question one
might ask is what after that?
Both in the case of generative models and deep neural network based classification, one
must remember that the programs are throwing back at us whatever data has been fed to them.
They do not understand what they are writing or drawing even though there is some correlation
between the input query or command and the output generated.
For understanding and acting upon such perceived data, one needs to create models of the
world to reason with. This is best done by explicit symbolic representations, which have the
added benefit that they can contribute to explanations.
4 https://developers.google.com/machine-learning/crash-course/first-steps-with-tensorflow/toolkit, accessed
December 2022.
5 https://openai.com/blog/generative-models/ , accessed December 2022.
10 | Search Methods in Artificial Intelligence
1.4 Symbolic AI
Imagine that you are the coach of a football team watching the game when your team is down
two-nil. You have been watching the game for the best part of seventy minutes. The need of the
hour is to make a couple of changes.6 You pull out two players who are playing under their par
and send in two substitutes.
Can a neural network take such a decision? Clearly not. The knowledge of the neural
network is a kind of long term memory arrived at by training on many past examples. The
neural network does not have the apparatus to represent the world around it in a dynamic
scenario. What an agent also needs is short term memory that represents the current problem,
and facilitates reasoning about the situation and planning of actions. We will talk about short
term and long term memory in a little more detail in Chapter 7. But now we introduce the main
idea at the core of this book - symbolic reasoning.
6 Even as I write this, France has scored two goals in two minutes to draw level with Argentina in the FIFA World
Cup final of 2022.
Introduction | 11
in German, seks in Norwegian, seis in Spanish and Portuguese, shash in Sanskrit, and sitta in
Arabic. Likewise, the number seven is saat in Hindi, saith in Welsh, sieben in German, sedam
in Serbian, septem in Latin, and sapta in Sanskrit. At the same time, the same word may mean
different things in different languages, much to the consternation of uninformed travellers,
who may not realize that gift in German means poison or a toxin, and helmet in Finnish means
pearls. In addition, the diversity in the world across regions results in communities having
fine-grained words indicating small differences in what they encounter in their lives. Nordic
countries have a multitude of words for different kinds of snow. A Swede may use Kramsno for
squeezy snow, perfect for making snowballs, and Lappvante for thick, falling snow amongst
the many words that residents of Kerala may club into one word, snow. At the same time, the
people in Kerala have different names for a variety of what the Scandinavian countries might
just refer to as a banana. Ethapazham, for example, is the name of the longest banana available,
chenkadali is the red banana, and poovan a small banana.
Observe that when we talk of trees and fruits and animals, we talk of them as we perceive
them. There is a process of reification or abstraction that happens here. A human body is made
up of about 1027 atoms, but we do not think of it at that level of detail. We cannot. We think
of a person as an individual and think about the body parts as individual entities. The atoms
that are part of our body are anyway transient, but our notion of the self is persistent. In his
book titled Creation, author Steve Grand (2001) highlights the fact that when we perceive
a stationary cloud atop a mountain pass, it is really moist wind blowing over it with water
molecules condensing as they reach the top and becoming visible even as they flow on. The
cloud, like our own body, is in our mind. The concepts that we form in our heads are often at
a convenient level of aggregation. In any case, humans started off by giving names to what we
see and what we do. Our visual perception system has a finite field of vision. A fascinating
chronicle on the sizes of objects in the universe, The Powers of Ten, lists different physical
entities that exist at different scales (Morrison et al., 1986). In the book, and a short movie
of the same name, the authors zoom out from human-size objects to the very ends of our
universe, and them zoom back in and onwards onto the subatomic level. Our human perception
is limited from about 10-4 m where we can see a pollen grain shining in a ray of sunlight, to
larger objects - a mustard seed (10-3 m), a fingernail (10-2 m), a sunbird (10-1 m), a child
(100 m), a small tree (101 m), a pond (102 m), the Golden Gate bridge (103 m), and a small
town seen from a hill (104 m). We find it easy to give names for objects at these scales. For
larger or smaller scales, we have to rely on science to inform us. We know that the diameter of
solar system is 11,826,600,000 km, and the diameter of the Milky Way is about 100,000 light
years across. We know that a virus is about 10-7 m, and the size of the carbon atom nucleus
is about 10-14 m. All this is secondary knowledge derived from our scientific endeavour, even
though we often cannot visualize very large or very small distances at the extreme scales.
Quantum mechanics has further obfuscated our understanding of the world. Marcelo Gleiser
(2022) writes that quantum physics has redefined our understanding of matter. In the 1920s,
the wave-particle duality of light was extended to include all material objects, from electrons
to you. Cutting-edge experiments now explore how biological macromolecules can behave as
both particle and wave.
When we talk of the spoken word, we think of it as a symbol that stands for something.
Symbols take on a life of their own when they are represented by tangible marks, which are not
transitory like sounds, but have a degree of permanency associated with them.
12 | Search Methods in Artificial Intelligence
The earliest humans known to engrave symbols on clay were the Sumerians in ancient
Mesopotamia, which is often known as the cradle of civilization. The first engravings were
pictographs, but soon evolved into more abstract entities like symbols from an alphabet. The
earliest form of writing was cuneiform writing.
First developed around 3200 B.C. by Sumerian scribes in the ancient city-state of
Uruk, in present-day Iraq, as a means of recording transactions, cuneiform writing
was created by using a reed stylus to make wedge-shaped indentations in clay tablets.
Cuneiform as a robust writing tradition endured 3,000 years. The script - not itself a
language - was used by scribes of multiple cultures over that time to write a number of
languages other than Sumerian, most notably Akkadian, a Semitic language that was
the lingua franca of the Assyrian and Babylonian Empires.7
It was replaced by alphabetic writing sometime after the first century AD. The breakthrough
came when symbols were not only employed as images representing objects and events like a
hunt, but abstract entities like sounds. A set of symbols forms an alphabet. Alphabetic symbols
could now come together to form words, and words could form sentences. The spoken word
became the written word. Different natural languages evolved in many regions of the world.
The common theme was writing.
The faculty of language in turn created a mechanism of knowledge dissemination. Starting
with stories in the oral tradition, the invention of writing made it possible for us to leave a
permanent imprint for anyone to read at any time in any place. The invention of the internet
made all this information available for everyone instantaneously.
The basis of the written word was the idea of symbols.
• Symbol: A perceptible something that stands for something else. For example, alphabet
symbols, numerals, road signs, musical notation.
• Symbol System: A collection of symbols - a pattern. For example, words, arrays, lists,
even a tune.
7 https://www.archaeology.org/issues/213-1605/features/4326-cuneiform-the-world-s-oldest-writing, accessed
September 2022
Introduction | 13
• Physical Symbol System: That obeys laws of some kind, a formal system. For example,
long division, computing with an abacus, an algorithm that operates on a data structure
(which is a symbol system).
The idea of symbolic reasoning goes back to olden times. John Haugeland (1985) traces the
evolution of the idea of thinking being symbolic to medieval Europe, reproduced here: Galileo
Galilei (1564-1642) in The Assayer (published 1623) says that ‘tastes, odors, colors, and so
on are no more than mere names so far as the object in which we locate them are concerned,
and that they reside in consciousness’. Further, that ‘philosophy is written in this grand book,
the universe ... It is written in the language of mathematics, and its characters are triangles,
circles, and other geometric figures’. Galileo Galilei gave us what we call the laws of motion,
and his explanations were expressed in geometry. The English philosopher Thomas Hobbes
(1588-1679) first put forward the view that thinking itself is the manipulation of symbols.
Galileo had said that all reality is mathematical in the sense that everything is made up of
particles, and our sensing of smell or taste was how we reacted to those particles. Hobbes
extended this notion to say that thought too was made up of (expressed in) particles which the
thinker manipulated. However, he had no answer to the question of how a symbol can mean
anything. In De Corpore, Hobbes first describes the view that reasoning is computation early
in Chapter 1. ‘By reasoning’, he says, ‘I understand computation.’ Hobbes was influenced by
Galileo. Just as geometry could represent motion, thinking could be done by manipulation
of mental symbols. Rene Descartes (1596-1650) further extended the idea by saying that
‘thoughts themselves are symbolic representations’. Descartes was the first to clarify that a
symbol and what it symbolizes are two different things, but then he ran into the mind-body
dualism. If reasoning is the manipulation of meaningful symbols according to rational rules,
then who is manipulating the symbols? It can be either mechanical or meaningful, but how can
it be both? How can a mechanical manipulator pay attention to meaning? These are some of the
questions we are still to find answers for.
adversarial reasoning
problem solving
Figure 1.6 An architecture for cognition. In classical AI, an intelligent agent senses the world
around it and maps it to a symbolic representation making inferences, and planning for
its goals.
these pixels and extracts information about the objects in the image do we say that we have
a symbolic representation. Early work on pattern recognition was syntactic in nature, for
example, as described in Gonzalez and Thomason (1978). One would extract edges in the
image and apply grammar rules to combine edges to (say) recognize handwritten characters.
Processing complex images was not feasible at all. However, neural networks have proven to
be excellent at processing images and recognizing patterns and individuals.
In our proposed architecture, the task of deliberation is done using symbolic reasoning. As
shown in Figure 1.6, the deliberation phase may invoke many different algorithms. We outline
the different process in the next section.
The question of meaning was also raised by the American philosopher John Searle (1932-)
with his Chinese room thought experiment in which a native English speaker locked inside a
room with boxes of symbols and a set of instructions on how to manipulate them could answer
questions in the Chinese language slipped below on pieces of paper without understanding a
word of the Chinese language.
The digital computer manipulates symbols based on a set of instructions given to it. Does
it understand the meaning of the symbols that it is manipulating? If it is adding two numbers,
does it know that it is adding numbers? Do all of us understand what a number is (McCulloch,
1961)? Or if it is beating a world champion in the game of chess, does it even know that it is
playing chess, or what winning is?
We will sidestep these questions on meaning and focus instead on utility and meaningful
action. Build machines that operate in a purposeful goal directed manner.
In this book we assume that our goal is to build machines that autonomously solve problems
for us, and that the goals of the machines are the goals we have given them to solve. Given a
problem to solve, and given a set of operators in its repertoire, the task of the problem solver is
to choose actions that will achieve the goal.
At the core is the ability to create a model of the world in which the agent is operating, and
reason about its goals, plans, and actions with the representation. The model of the world is the
base for all cognitive activity. This model contains the memories of the agent, lessons learnt,
and the representation of the world in which it operates. The agent needs the ability to imagine
worlds that are not immediately perceptible, or which the agent may desire to create.
Broadly speaking, there are three kinds of processes that come together to solve a problem,
and which form the core of intelligent behaviour.
• Similar problems have similar solutions, is the adage behind memory based reasoning.
• And, importantly, problems are indeed often similar.
In the simplest form, CBR maintains a case base of problem solution pairs <p, s>. The problem
part of a case is a description of the problem that the case solves. The description may be
attribute value pairs or it could be in natural language text. The following is the 4R cycle that
CBR follows.
16 | Search Methods in Artificial Intelligence
Domain
Figure 1.7 A memory based agent employs the knowledge stored in its memory to solve a
problem in the domain. Based on the outcome of each instance of solving a problem, the
agent refines its knowledge and improves over time.
• Retrieve: When the agent encounters a new problem, it searches the case base for the most
similar problem. Often more than one case is retrieved in the style of k nearest neighbours
retrieval.
• Reuse: The solution that is retrieved along with the case is adapted to the current problem.
This could involve adjusting some parameters that are different in the current description
and the retrieved one and adjusting the solution part.
• Revise: If the solution does not work, tweak it. This could involve human intervention.
• Retain: If the revised solution is significantly different, add it to the case base. The next
time a similar problem shows up, this could be useful.
CBR has been particularly useful in domains that are not well modelled and where the problem
solving knowledge is more experiential than analytical. One of the earliest successes was the
Clavier system developed to cure aircraft parts at Lockheed Missiles and Space Company
in Sunnyvale, California (Watson, 1997). The task involved placing parts made of composite
materials that kept changing in an autoclave, which is an expensive resource. The quality of the
product depended on where it was placed on the tray, and operators were essentially following
similar layouts from the past. Curing is an unrewarding art, rather than a science, but Clavier
reduced the discrepancy reports considerably as its case base grew from an initial twenty to
several thousands. Figure 1.8 shows a schematic of a CBR system employed for knowledge
management in a manufacturing setting (Khemani et al., 2002).
CBR is a form of instance based learning (see also Chapter 11) in which the system
memorizes past experiences and remembers them. Another approach is to assimilate the
knowledge accrued from experience into compact structures that can be used. Neural
Introduction | 17
Figure 1.8 A CBR system in the manufacturing industry. The data recorded from different
locations in the shop floor is assimilated into a case base. The resulting system can have
multiple applications. The CBR is essentially a tool for knowledge management.
networks are examples of such learning, but there have also been more explainable
structures like decision trees, where attributes and their values used for classification can
be read from the path in the tree. Consider a small data set shown in Table 1.1 for the sake
of illustration. There are three attributes A, B, and C in this data set with values {a1, a2,
a3}, {b1, b2, b3}, and {c1, c2} respectively. There are two class labels ‘Yes’ and ‘No’ in
each row in the table.
One could of course use CBR for prediction given a new problem in which the values of
the three attributes are given. But for such well defined domains, it is convenient to build a
decision tree. A decision tree is a discrimination tree that tests for the value of one variable at
the root node, and then traverses an appropriate branch to test the value of another variable. The
algorithm for constructing a decision tree with nominal attribute values is the well known ID3
algorithm (Mitchell, 1997). The basic idea behind the algorithm is to choose that attribute that
separates the two classes as best as possible. A decision tree for the data in Table 1.1 is shown
in Figure 1.9.
When a new record of the values for A, B, and C comes in, it is dropped down the tree.
At each node, the value of some attribute is tested and the record follows an appropriate
branch. Leaves in the tree are labelled with class information. Observe that other trees may
be possible, testing a different attribute at each stage. The ID3 algorithm is designed to build
short trees.
18 | Search Methods in Artificial Intelligence
Table 1.1
A B C Outcome
a1 b1 c1 Yes
a1 b1 c2 No
a1 b2 c1 Yes
a1 b3 c2 No
a1 b2 c2 No
a2 b1 c1 Yes
a2 b2 c1 Yes
a2 b1 c2 Yes
a3 b1 c1 Yes
a3 b2 c1 Yes
a3 b1 c2 No
a3 b2 c2 No
Figure 1.9 A decision tree based on the data from Table 1.1. The root nodes tests for the
attribute A and traverses the appropriate branch. The leaf nodes have the class labels.
We the thinking creatures create our own worlds in our minds. And it is only our own
creation that is meaningful to us. We create categories in our heads, for example, a human, a
fox, a river, and a tree. We define an ontology in terms of which we represent the world. Every
domain of study from physics to chemistry, to biology, to economics defines its own ontology.
Or its own terminology. While the notion of an ontology has roots in philosophy, it has found
a formal definition in computer science, as an explicit specification of a conceptualization
(Gruber, 1993).
Behind every word of a language there sits a concept, and a knowledgeable agent relates
that concept to others, within an ontology. For example, we associate the word ‘banana’ with
a particular kind of fruit, growing on a particular kind of a single stem tree, which has a skin
that can be peeled off before eating, and which has leaves that can be used to serve a meal on.
Likewise, with verbs we can conjure up actions mentally, for example, jogging. In any case,
whenever we use words in a language, they just stand for some concepts that an agent may have
in its knowledge representation scheme. Roger Schank and his group at Yale university showed
that the moment one talks of a person going into a restaurant one needs to retrieve all that one
knows about what typically happens in a restaurant to make sense of the conversation (Schank
and Abelson, 1977; Schank and Riesbeck, 1981). At around the same time, Marvin Minsky
(1975) published his idea of frames for knowledge representation, which eventually led to the
ideas of object oriented programming.
With the advent of the internet, when programs could talk to each other, defining ontologies
gained prominence. Computational ontologies are a means to formally model the structure
of a system, that is the relevant entities and relations that emerge from its observation, and
which are useful to our purposes. The backbone of an ontology consists of a generalization/
specialization hierarchy of concepts, that is a taxonomy (Guarino et al., 2009).
Figure 1.10 shows a snippet of a sample ontology represented as a frame system. The
shaded squares are abstract frames, corresponding to concepts in an ontology. The IS-A slot in
a concept defines an abstraction hierarchy, for example, ‘a dog is a mammal’. The unshaded
nodes represent individuals or instances of concepts. An ontology may have other kinds of
links, for example, the fact that ‘Ted is a dog owned by Socrates who is a human’.
The idea of semantic networks was already well developed. A semantic network is
a graphical model in which nodes representing concepts are connected with labelled edges
representing relations. Early work on semantic nets was motivated by natural language
processing. Ross Quinlan is often credited with crystalizing the idea (Quillian, 1967, 1968).
Subsequently, the idea of semantic nets evolved into the idea of knowledge graphs, which
were semantic networks spread over the internet. The idea of the Semantic Web evolved in the
twenty-first century. In 2012, Google adopted the term ‘knowledge graph’ (Singhal, 2012).
A knowledge graph is a collection of nodes and named edges. We can create an abstract
type called event and describe the rest of the relations for an instance of that event. It has
become common to express these as triples <subject, predicate, object> or <subject, property,
value>. For example, here is an incident of kids fighting: ‘Divya hit Atul with a stick yesterday
afternoon in a park.’
Figure 1.10 A snapshot of an ontology captured in a frame system. The shaded nodes are
abstract concepts, and the unshaded ones are instances of concepts. An ontology may define
an abstraction hierarchy as well as other kinds of relations.
Figure 1.11 shows how information about musical performances and poetry could be
represented as a knowledge graph.
Representation is only one side of the coin. Reasoning is the other.
We are never privy to everything there is to know. We have partial knowledge of the
world and can try and fill in by making inferences. The process of making inferences is called
reasoning. There are three kinds of inferences. All are in some way connected with a logical
relation often captured as ‘IF antecedent THEN consequent’, expressed as a sentence in some
logic. When the sentence is true, then it often expresses a causal connection from the antecedent
to the consequent. But as Judea Pearl (2019) has shown to us, there can be confusion between
causality and correlation, which has been exploited by the tobacco industry to contest the
connection between smoking and cancer.
• Deduction: From a given set of facts, infer another fact that is necessarily true. Deduction
is the bread and butter of logic. We study deduction in Chapter 10. Deduction is sound
because it goes from the antecedent to the consequent. For example, the statement ‘If
X is a trapezium then X is a quadrilateral’ is true by definition. So if anyone has drawn
a trapezium, then she has drawn a quadrilateral. Deduction only makes explicit what is
implicit in the knowledge base.
Introduction | 21
Figure 1.11 A triple subject, predicate, object store can store heterogeneous information in
a knowledge graph, with nodes connected by directed edges. Each labelled edge goes from
subject to object and is labelled by the property or the predicate. The figure shows a snippet of
a knowledge graph relating to music and poetry.
• Induction: From a given sets of facts, infer a new fact. Also known as generalization.
Induction can create new knowledge. Recognizing that a number of entities in the domain
share some common property and asserting that as a general statement. For example, from
the observations,
the lawn will be wet. But we also know that if the sprinkler is on, the lawn will be wet, or if
children are playing with a hose, the lawn will be wet, or if there was a flood (as is common
in these times of climate change), the lawn will be wet. Then if we observe that the lawn is
wet, how can we infer the cause? Which causal connection shall we make use of? Larson
calls it the selection problem for AI. And yet, we manage to make abductive inferences all
the time. Very often we use other facts we know. For example, we may know that the sky
has been clear and so rain cannot be the cause of the grass being wet. Medical diagnosis,
incidentally, is making abductive inferences. A doctor may suspect COVID if you have
cough and cold, especially if a new wave has started. But the doctor relies on a clinical
test to validate the hypothesis. Observe that we do not face this difficulty with deduction
in which if we know the antecedent of a rule to be true, then the consequent necessarily
follows.
Larson says that as humans the majority of the inferences we do are abductive in nature, and
that is why they can be error prone too. The following scenario illustrates plausible inferences.
If you are running with the ball in a football game, you need to be aware of where the other
players are and what they intend to do. This inference of intention comes from background
knowledge about the strategies and tactics used by the team. You should be able to imagine that
if you kick the ball to where your teammate should be running to, then he would have a better
shot at the goal. The opponents no doubt are thinking about it too. Why is the opposing team
player running towards that spot? Making inferences is the basic cognitive act for intelligent
minds and we are constantly making inferences.
Another example is the work done by Roger Schank and his group with stereotypical
situations knowledge which is instrumental in generating expectations about what must have
happened and what to expect. If we hear that ‘John went to a restaurant. He ordered a masala
dosa. He left satisfied,’ we can imagine what must have happened because we have knowledge
about how restaurants function, even though the story is cryptic. We know that he must eaten
the dosa, and must have paid for it, because that is the normal behaviour in a restaurant.
In summary, the agent must be able to reason with what it knows to infer what is implicit
(deduction) or even what is unknown (induction) to create new knowledge. It must be able to
hypothesize connections between facts and events (abduction) to anticipate what is happening
in the world around it, and what other agents are up to. It must be able to recognize intentions
and plans of collaborators and adversaries, make its own plans, evaluate and choose the best
ones, execute them, monitor them as they are executed, replan if necessary or take advantage of
an unexpected opportunity. It must also be able to use the science of probability to judge which
of its possible decisions is most likely to succeed.
The previous two subsections have described in a nutshell those aspects of AI that would
each need a complete book for any justice to be done. We have briefly dwelt upon these to
highlight the fact that they are necessary for building intelligent systems, along with other
processes shown in Figure 1.6.
While all these are necessary, we now come to the subject matter of this book - solving
problems by first principles by projecting decisions into the future to tease out those that solve the
problem. The search methods that we study in this book arrive at solutions by trying out different
options available to the agent. In the following section we outline the contents of this book.
Introduction | 23
8 Consider the two sayings - ‘Out of sight, out of mind’ versus ‘Absence makes the heart grow fonder’.
24 | Search Methods in Artificial Intelligence
Chapter 4 introduces informed search aimed to guide the search towards the goal instead
of the blind or uninformed search methods of Chapter 3. We introduce the idea of a heuristic
function h(n) that looks at a state n and computes an estimate of its distance to the goal state.
Search that employs such a heuristic function is called heuristic search. From the set of all
available candidates, the algorithm best first search picks that node that appears to be closest
to the goal. We show that if a solution exists, the algorithm will find it. That is, it is complete
for finite state spaces. The performance of the algorithm depends upon the quality of heuristic
estimate, and it has been empirically found that most implementations still need exponential
time and space.
In an effort to save on space, we resort to local search. The algorithm hill climbing burns its
bridges and considers only the neighbours of the current state (or candidate solution). Instead
of using the GoalTest function, it looks for an optimum value of the heuristic function. While
it does result in reduced complexity, it loses out on completeness. It does not guarantee finding
a solution and can get stuck in a local optimum. Then begins the quest for variations in local
search more likely to succeed in finding a solution. We look at the algorithm beam search that
explores more than one path, and at tabu search that allows a search algorithm to get off a
local optimum and continue exploration. We also introduce an aspect of stochastic search with
iterated hill climbing.
Chapter 5 is devoted to stochastic local search methods. All the algorithms in the chapter
draw inspiration from processes in nature. Simulated annealing begins with randomized moves
and gradually makes them deterministic, reminiscent of the annealing process used to form
materials in the lowest energy states. Genetic algorithms mimic the process of survival of
the fittest in natural selection and mix and churn the components available in a population
of candidates. We look at how genetic algorithms solve the travelling salesperson problem.
Finally, we introduce the ant colony optimization algorithm that draws inspiration from how
ants communicate with each other via pheromone trails and collaborate to find shortest paths.
All the three algorithms studied are popular in the optimization community.
The algorithms studied so far do not guarantee an optimal solution (except breadth first
search). Chapter 6 introduces the well known algorithm A* that employs heuristic search and
also guarantees an optimal solution even for infinite state spaces. We say that the algorithm is
admissible and we present a proof of its admissibility. We introduce the problem of sequence
alignment from bioinformatics where A* is applicable. We then look at space saving variations
of A* that can solve much bigger problems than A* can.
Chapter 7 looks at problem decomposition and how problems can be solved in parts. We
begin by looking at pattern directed inference systems in which an algorithm looks for patterns
in the given state and triggers appropriate actions. The production system architecture lays the
foundations of building rule based expert systems. We present the Rete algorithm which is an
efficient implementation that is used in many business rule management systems. Such systems
also serve as a vehicle of declarative programming exemplified by the language OPS5 in which
the user just writes the rules and an inference engine decides which rules to apply to what data.
We then look at a goal directed approach to breaking down problems into subproblems
with And-Or graphs. The idea is to decompose the problem into simpler subproblems and
continue the process till the reduced problems are primitive problems with trivial solutions. We
present the algorithm AO* that can be used to find an optimal least cost decomposition strategy.
Introduction | 25
Finally, Chapter 12 looks at constraint satisfaction which offers the tantalizing possibility
of integrating the different kinds of processes needed for intelligence into one. We study finite
domain constraint satisfaction problems, and show how search and reasoning can be combined,
and look at some algorithms where reasoning is effective in reducing the search effort. The
most attractive feature of constraints is that they offer a unifying formalism for representation,
when solutions can be found by general purpose methods. Eugene Freuder (1997), one of the
founding figures in constraint programming, has said - ‘Constraint Programming represents
one of the closest approaches computer science has yet made to the Holy Grail of programming:
the user states the problem, the computer solves it’.
Exercises
1. Alan Turing prescribed the Imitation Game as a test of whether machines can think or
not. We call the test the Turing Test. Discuss the merits and demerits of the Turing Test.
Is the ability to chat intelligently a sufficient indicator of intelligence? What is the role of
world knowledge in a meaningful conversation? How should a machine react to a topic it
does not know about and still convince the judge that he is chatting with a human? Should
a machine introduce errors or delays intentionally, for example, when given a massive
arithmetic problem?
Introduction | 27
2. Devise three sets of questions for the Winograd Schema Challenge that would require
world knowledge to answer correctly.
3. Natural language is notoriously ambiguous, a fact that has been widely exploited to create
punch lines that surprise the listener. For example, ‘Time flies like an arrow, fruit flies like
a banana’ sometimes attributed to Groucho Marx. When humans parse language, they start
building a semantic picture of what they are listening to. Garden path sentences force the
listener to abandon an initial likely interpretation after hearing the complete sentence. For
example, ‘The old man’s glasses were filled with sherry’. Given the sentence ‘She shot
the girl with the rifle’, how would a computer chat program answer the following question
‘Who had the rifle?’
4. In Chapter 8, we discuss games as models of rational behaviours aimed at maximizing
the agent’s own payoff. While this is an economic model that explains why individuals,
corporates, and nations behave the way they do, what does it say about the collective
intelligence of humankind whose focus is on arms manufacture, sale, and use, even while
climate change looms upon us?
chapter 2
Search Spaces
In this chapter we lay the foundations of problem solving using first principles. The
first principles approach requires that the agent represent the domain in some way
and investigate the consequences of its actions by simulating the actions on these
representations. The representations are often referred to as models of the domain and
the simulations as search. This approach is also known as model based reasoning,
as opposed to problem solving using memory or knowledge, which, incidentally, has
its own requirements of searching over representations, but at a sub-problem solving
retrieval level.
We begin with the notion of a state space and then look at the notion of search spaces
from the perspective of search algorithms. We characterize problems as planning
problems and configuration problems, and the corresponding search spaces that
are natural to them. We also present two iconic problems, the Boolean satisfiability
problem (SAT) and the travelling salesman problem (TSP), among others.
In this chapter we lay the foundations of the search spaces that an agent would explore.
First, we imagine the space of possibilities. Next, we look at a mechanism to navigate this
space. And then in the chapters that follow we figure out what search strategy an algorithm can
use to do so efficiently.
Our focus is on creating domain independent solvers, or agents, which can be used to solve
a variety of problems. We expect that the users of our solvers will implement some domain
specific functions1 in a specified form that will create the domain specific search space for our
domain independent algorithms to search in. In effect, these domain specific functions create
the space, which our algorithm will view as a graph over which to search. But the graph is
not supplied to the search algorithm upfront. Rather, it is constructed on the fly during search.
This is done by the user supplied neighbourhood function that links a node in this graph to
its neighbours, generating them when invoked. The neighbourhood function takes a node as
an input and computes, or returns, the set of neighbours in the abstract graph for the search
algorithm to search in.
1 Throughout this book, we will use the word ‘function’ as a synonym for a program. The function accepts the input
as an argument and returns the output it computes. This is in accordance with the style of functional programming.
29
30 | Search Methods in Artificial Intelligence
Figure 2.1 A state space is the set of possible states. Each state can be seen as a node in
an implicit graph. A neighbourhood function we call MoveGen(N) takes a node as input and
returns the neighbours of N that can be reached in one move. We call the given state as the
start state S and a desired state as a goal state G. There may be more than one goal states.
The figure shows the neighbours of the state S generated by MoveGen. The graph is generated
on the fly by repeated calls to MoveGen.
The state space can be seen as a graph. However, it is a graph that is implicit and defined
by the MoveGen function, as depicted in Figure 2.1. Observe that we have shown the edges in
this figure as directed edges. This is only to highlight the exploration away from the start state
S. In practice, the edges in a state space may be directed when the moves are not reversible, or
undirected when they are. For example, in the domain of cooking, chopping vegetables cannot
be undone, but in the river crossing problem described later, one could simply row the boat
back. In some problems some moves may be reversible while others are not, for example, the
Search Spaces | 31
water jug problem described later. The MoveGen function captures all these variations. For
any node as input, it returns the set of neighbour nodes reachable in one move.
MoveGen(N)
Input: a node N
Output: the neighbours of node N
For the moment, we will not name the moves, and simply rely on the MoveGen function
to provide the resulting states. Later, in Chapter 9, we will look at how the automated planning
community explicitly represents actions and reasons with them.
GoalTest(N)
Input: a node N
Output: true if N is a goal state, and false otherwise
The GoalTest function is a predicate that returns true if the input is a goal state. The high
level search algorithm maintains a set of candidate states, which is traditionally called OPEN,
and repeatedly picks some node N from OPEN, till it picks a goal node. OPEN is initialized to
the start state S. It returns the goal state if it finds one, else it returns fail. That happens if there
is no way of reaching the goal state. In graph theory terms, there is no path to the goal state.
Algorithm 2.1 Algorithm SimpleSearch picks some node N from OPEN. If N is the goal,
it terminates with N, else it calls MoveGen and adds the neighbours of N to OPEN.
SimpieSearch()
1 OPEN H S}
2 while OPEN is not empty
3 do pick some node N from OPEN
4 OPEN ^ OPEN - {N}
5 if GoalTest(N) = TRUE
6 then return N
7 else OPEN ^ OPEN U MoveGen(N)
8 return FAILURE
This problem solving strategy is also known as generate and test. It embodies the strategy
of navigating the space, in some order, generating candidates one by one, and testing each
for being a goal state. As one can imagine, the choice of which node to pick from OPEN will
determine how quickly the algorithm terminates. This choice will be addressed in the next few
chapters.
32 | Search Methods in Artificial Intelligence
Figure 2.2 Given a tiny state space with a start state S and a goal state G, the simple search
algorithm defines a search space which is a tree. Starting with S, it can go down any branch in
search of a goal node. Observe that the tree has some infinite branches.
As the simple search algorithm does not specify which candidate to pick for inspection
next, it could go down any path in the search tree. Observe that some of these paths are of
infinite length, wherein the algorithm traverses cycles in the state space. In the following
chapters we will look at different strategies for exploring this search space and study their
properties.
We will evaluate the performance of our search algorithms on the following four criteria.
• Quality of solution. We may optionally specify a quality measure for the algorithm. We
begin with the length of the path found as a measure of quality. Later, we will associate
edge costs with each move, and then the total cost of the path will be the measure. This will
typically be in the case for planning problems (described later) where we can associate a
cost with the solution found.
• Space complexity. This looks at the amount of space the algorithm requires to execute. We
will see that this will be a critical measure, as the number of candidates in the search space
often grows exponentially.
• Time complexity. This describes the amount of time needed by the algorithm, measured by
the number of candidates inspected. The most desirable complexity will be linear in path
length, but one will have to often contend with exponential complexity.
The next few chapters will occupy us with finding increasingly better algorithms on the above
parameters, often making a tradeoff on one to improve upon another.
We will now identify two distinct kinds of problems. Planning problems are problems in
which the sequence of moves to the goal state is of interest. Often in such problems the goal
state is clearly specified, and one needs to find a path to a goal state. Configuration problems,
on the other hand, have only an abstract description of the desired state, and the task is to find a
state that conforms to the description. The path to the goal state is generally of no interest. We
describe both kinds of problems with some examples. We begin with configuration problems,
because the high level algorithm described earlier is more suited to such problems, since it only
returns the goal state.
{red, green}( E
{blue, {blue,
green} yellow}
{green}
{green, ( D
yellow}
B^blue, red}
Figure 2.3 A map colouring problem (left) represented as a planar graph, and a solution
(right). Each node in the graph is a region and the colours next to it are the set of allowed
colours for that region. An edge between two nodes exists if the two regions have a common
boundary.
Figure 2.4 Three solutions for an 8-queens problem. A board with eight queens may be
represented by a list signifying the column for each queen. This is possible because each
column can have only one queen. The solution on the left is represented by (e, b, d, f, h, c, a, g)
where e is the column for the queen in row 1, b is the column for the queen in row 2, and so on.
Search Spaces | 35
(b V - c) A (c V - d) A (- b) A (-a V - e) A (e V - c) A (- c V - d)
The SAT problem has been well studied and finds many applications. Many problems can
be formulated as SAT and then solved by one of the general-purpose SAT solvers of the kind we
will study. This exemplifies the spirit of general or weak methods, in which we seek to explore
general-purpose problem solving methods, which can then be applied to different individual
problems.
The SAT problem is also an epitome of exponential complexity. SAT was one of the earliest
problems to be proven NP-complete, which is a complexity class of problems that can be
solved in non-deterministic polynomial time (Cook, 1971). Solving the SAT problem by brute
force can be unviable when the number of variables is large. A formula with 100 variables will
have 2100 or about 1030 candidates. Even if we could inspect a million candidates per second,
we would need 3 x 1014 centuries or so. Clearly that is in the realm of the impossible as far
as humankind is concerned. Further, it is believed that NP-complete problems do not have
algorithms whose worst case running time is better than exponential in the input size.
One often looks at specialized classes of SAT formulas labelled as k-SAT, in which each
clause has at most k literals. It has been shown that 3-SAT is NP-complete. On the other hand,
2-SAT is solvable in polynomial time. For k-SAT, complexity is measured in terms of the size
of the formula, which in turn is at most polynomial in the number of variables.
start looking at search algorithms in greater detail. We will also look at representations used
by the planning community, in which named actions are represented explicitly, in Chapter 9.
Meanwhile, we look at some examples of planning problems.
□[DE
0®(6
00_
Goal
Figure 2.5 The 8-puzzle consists of eight tiles on a 3 X 3 grid. A tile can slide into an adjacent
location if it is empty. A move is labelled R if a tile moves right, and likewise for up (U), down
(D), and left (L). Planning algorithms use named moves.
The directed arrows signify the direction of search. In the actual puzzle, the moves are
reversible, so the arrows should be bidirectional, but are not depicted in the figure for simplicity.
Interestingly, not every starting position leads to the given goal position. This is because the set
of possible states is disjoint, having two partitions. Two states that differ only in the position
of two adjacent tiles belong to different partitions, and there is no path (sequence of moves)
from one to the other. Exploiting this property for popularizing his puzzle, Sam Lloyd offered a
prize of USD 1,000, a lot of money in those days, for solving such an instance of the problem.
No one could claim the prize though, because there was no solution for the published instance!
One of the earliest textbooks on artificial intelligence (AI) (Nilsson, 1971) used this puzzle
extensively, comparing different heuristics for performance. We shall look at heuristic search in
Chapter 4. This puzzle, like its 3-dimensional cousin, the Rubik’s cube, exemplifies problems
with non-serializable subgoals. That is, if one chooses an order of bringing tiles into their
destined place, for example, the top row first, then while solving subsequent subgoals one
is forced to disrupt the earlier partial solution. Solvers of the Rubik’s cube must be familiar
with this phenomenon. The trick we employ as humans is to learn whole sequences of moves,
Search Spaces | 37
called macros, or macro operators, for solving each subgoal, while turning a blind eye to the
disruptions in the intermediate states. The researcher Richard Korf explored ways of learning
such operators for his doctoral thesis (Korf, 1985).
A larger version of the 8-puzzle is the 15-puzzle, on a 4 X 4 grid. While the 8-puzzle
has 9! = 362,880 states (in two partitions), the 15-puzzle has about 1013 or 10 trillion states.
Searching through these is indeed a formidable task, but people do find solutions. Searching
for the shortest path solution, however, is harder, because the average length of the shortest
path is 53, the longest being 80. It was only in 2005 that Korf and Schultze (2005) reported a
large-scale parallel breadth first search running for over 52 hours to find the shortest path. We
will look at breadth first search in Chapter 3. We will also look at the puzzle again in Chapter
4 on heuristic search.
[[Left bank objects] [Right bank objects]] [Left bank objects] [Objects on the side where the boat is]
[[G L C B] [ ]] [G L C B] [G L C Left]
♦ ♦
[[ ] [G L C B]] [] [G L C Right]
Man M is where the boat B is Man M is where the boat B is Man M is in each state shown
Figure 2.6 Three representations for the man, goat, lion, and cabbage problem. The one on
the left has two lists, one for each bank. The representation in the middle has only elements
on the left bank. The representation on the right has a list of elements on the bank where the
boat is, along with the label specifying which bank it is. The reader is encouraged to write the
MoveGen functions for these representations.
The reader is encouraged to write the MoveGen and GoalTest functions for the above
representations.
A note about representation would not be out of place here. Very often we choose a
representation designed for a specific implementation, with the semantics being only in the
mind of the programmer. This can have two drawbacks. One, the representation may not
be suitable for another problem or implementation. Two, the representation may not make
sense to someone else who looks at the program. On both counts, it should be good practice
to choose clear and meaningful representations that are interpretable by other humans or
programs.
Figure 2.7 The states reachable from the start (8, 0, 0) with three jugs of capacity 8, 5, and 3
litres. Undirected edges on the periphery represent reversible moves. The figure also identifies
three goal states in which 4 litres of water has been measured.
2
In the days when this problem was posed perhaps only men did this work.
40 | Search Methods in Artificial Intelligence
Figure 2.8 The TSP is to construct a tour of N cities visiting each exactly once and returning
to the starting city. The objective function is to minimize the cost of the tour. This figure shows
two different tours of eight Indian cities.
Note: Maps not to scale and do not represent authentic international boundaries.
N - 1 ways, the third in N - 2 ways, and so on. Thus the possibilities are N!. However, for each
tour, the starting point could well have been any of the N cities in the tour, so there are in effect
(N - 1)! distinct tours. Also, for every tour, travelling in the opposite direction would result
in the same cost, assuming the costs are symmetric. In that case, one can say that there are (N
- 1)!/2 distinct tours. But since most algorithms cannot always identify such duplications, we
often say that the problem is of complexity N!.
Now the factorial function grows much faster than the exponential function. Let us compare
an N variable SAT problem with an N-city TSP. Table 2.1 compares the number of candidate
solutions for the two problems as N increases. Observe the ratio between the two. The TSP
problem does indeed grow much faster!
Let us look at the value 2100 which is the number of candidates for a 100-variable SAT
problem. This number is 267,650,600,228,229,401,496,703,205,376. In the US number
naming system, this is 1 nonillion, 267 octillion, 650 septillion, 600 sextillion, 228 quintillion,
229 quadrillion, 401 trillion, 496 billion, 703 million, 205 thousand, 376. This is about 1030.
The number in every row for SAT is double the number in the previous row. If one were
to take a sheet of paper that is 0.1 millimetre thick and double the thickness (by folding it) one
hundred times, the resulting stack would be 13.4 billion light years tall. It would reach from
Earth to beyond the most distant galaxy we can see with the most powerful telescopes - almost
to the edge of the observable universe. A 100-variable SAT is hard enough. But 100! is a much
bigger number. The following output from a simple Lisp3 program shows the number.
3 Lisp has a built-in feature to handle large numbers. It has long been a favourite language of AI researchers, primarily
because it allows dynamic structures to be built naturally, and its functional style allows the creation of domain
specific operators easily. It has been on a little bit of decline with the advent of newer languages, and a diminishing
community makes it daunting for new entrants to try their hand at it now.
Search Spaces | 41
(factorial 100)
^
933262154439441526816992388562667004907159682643816214685
929638952175999932299156089414639761565182862536979208272
23758251185210916864000000000000000000000000
This number is larger than 10157, and clearly it is impossible to inspect all possible tours
in a 100-city problem. Inspecting all candidates is not going to be an approach, and the TSP
problem has been shown to be NP-hard (Gary and Johnson, 1979). Exact solutions are hard to
find for a given large problem, and thus it makes it difficult to evaluate an algorithm. A library
of TSP problems TSPLIB4 (Reinelt, 2013) with exact solutions is available on the web. An exact
solution for 15,112 German cities from TSPLIB was found in 2001 using the cutting-plane
method proposed by George Dantzig, Ray Fulkerson, and Selmer Johnson in 1954, based on
linear programming (Dantzig et al., 1954). The computations were performed on a network
of 110 processors. In May 2004, the TSP of visiting all 24,978 cities in Sweden was solved: a
tour approximately 72,500 kilometres long was found, and it was proven that no shorter tour
exists. In March 2005, the TSP of visiting all 33,810 points in a circuit board was solved using
Concorde (Applegate et al., 2007): a tour of length 66,048,945 units was found, and it was
proven that no shorter tour exists.
One reason why TSP is much harder than SAT is that it is an optimization problem. We
are looking for the lowest cost tour. In SAT, when we inspect a candidate, we know whether it
is a solution or not. On the other hand, looking at a valid TSP tour, there is no way of knowing
whether it is the optimal in the general case. Chapter 6 describes the Branch&Bound algorithm
that can find the optimal tour without having to search the entire space, and in Chapter 4
we will study heuristic search algorithms that can give very good solutions. Under certain
conditions simpler algorithms work well. For example, when the cost function behaves like a
metric distance function that satisfies the triangle inequality. The triangle inequality says that
the sum of two sides of a triangle is less than the third side. Human beings are often satisfied
with good solutions in lieu of optimal ones, especially when the cost of finding them is much
lower. It is said that we are satisficers and not optimizers.
Stochastic local search (SLS) methods (Hoos and Stutzle, 2005) can find very good solutions
quite quickly. We will study such methods in Chapter 5. For example, for randomly generated
problems of 25 million cities, a solution quality within 0.3% of the estimated optimal solution
was found in 8 CPU days on an IBM RS6000 machine (Applegate et al., 2003). More results
on performance can be obtained from the website for the DIMACS (the Center for Discrete
Mathematics and Theoretical Computer Science, http://dimacs.rutgers.edu/) implementation
challenge on TSP (Johnson et al., 2003; Applegate et al., 2007). Some more interesting TSP
problems available (Applegate et al., 2007) are as follows: The World TSP - a 1,904,711-city
TSP consisting of all locations in the world that are registered as populated cities or towns, as
well as several research bases in Antarctica; National TSP Collection - a set of 27 problems,
ranging in size from 28 cities in Western Sahara to 71,009 cities in China. Many of these instances
remain unsolved, providing a challenge for new TSP codes; and VLSI TSP Collection - a set of
102 problems based on VLSI data sets from the University of Bonn. The problems range in size
from 131 cities to 744,710 cities.
TSP problems arise in many applications (Johnson, 1990), for example, circuit
drilling boards (Litke, 1984), where the drill has to travel over all the hole locations, X-ray
crystallography (Bland and Shallcross, 1989), genome sequencing (Agarwala et al., 2000),
and VLSI fabrications (Korte, 1990). These can give rise to problems with thousands of cities,
with the last one reporting 1.2 million cities. Many of these problems are what are known as
Euclidean TSPs, in which the distance between two nodes (cities) is the Euclidean distance.
One can devise approximation algorithms that work in polynomial time. Arora (1998) reports
that in general, for any c > 0, there is a polynomial-time algorithm that finds a tour of length
at most (1 + 1/c) times the optimal for geometric instances of TSP, which is a more general
case of a Euclidean TSP. Special cases of TSPs can be solved easily. For example, if all the
cities are known to lie on the perimeter of a convex polygon, a simple greedy algorithm
TSP-NearestNeighbour works well.
Interestingly, TSP can be seen both as a planning problem and a configuration problem,
hinting that the two are just different ways of looking at a problem. As a planning problem,
we can think of solving it by constructing a tour by moving from one city to another. As
a configuration problem, we can think of the solution as a particular ordering of the cities.
Algorithms that take the second view are said to operate in the solution space.
you one step closer to a solution. The solution could be a plan for a planning problem, where
constructive approaches are natural and intuitive, and we construct the plan move by move.
But constructive methods can also be used for configuration problems, where again the
solution is synthesized piece by piece. In the TSP, we can synthesize the tour one edge at a time.
In the map colouring problem, we assign a colour region by region. In the N-queens problem,
we can imagine, or represent, an empty board and place the queens one by one. In the SAT
problem, we can assign values to variables one at a time.
Figure 2.9 A flip-one-bit MoveGen function for SAT flips one bit of a candidate. For a 5-variable
SAT problem, it produces a set of five neighbours.
However, there can be other neighbourhood functions as well. For example, one could flip
any two bits, or flip one or two bits, and so on. We shall explore the implication of choosing
different neighbourhood functions in Chapter 4. Meanwhile, here is a point to ponder. Why not
allow the neighbourhood function to change any number of bits? Then the solution would be
just one move away!
In the TSP problem, one representation of the candidates, known as the path representation,
is to list the cities in the order they are visited. For example, for the five cities Chennai, Bangalore,
Goa, Delhi, and Mandi depicted by the initial letters, a candidate tour could be [B, M, G, C, D].
One perturbation function could be a 2-city exchange in which a new permutation is created by
swapping some two cities. We shall also explore different neighbourhood functions in Chapter 4.
In Chapter 9, we will look at plan space methods for planning problems. In the plan space,
each node is a candidate plan, and the search is for those candidates that are solution plans.
The neighbourhood functions there will be a little different from the simple perturbation that is
possible for configuration problems like SAT and TSP.
44 | Search Methods in Artificial Intelligence
Summary
In this chapter we have laid the foundations of problem solving by search. Search is a first
principles approach in which the problem solver simulates the effects of its intended actions,
and by a process of searching through various possibilities, arrives at a solution.
First principles methods are needed to solve problems in the first place. When faced with
a new problem, this is the only approach to finding a solution. But we do not employ them all
the time. For problems that are similar to the ones encountered earlier, making use of memory
and learning is much more effective. Humankind has even learned to employ social memory
through language, telling stories and writing books for others to read and benefit.
A search method has to wade through a sea of possibilities, as the number of choices grows
exponentially. We have named this adversary CombEx (Khemani, 2013) for combinatorial
explosion and likened it to the mythical Hydra that Hercules had to battle. Hydra would sprout
new heads every time Hercules cut off one. Our search spaces also multiply as we inspect a
candidate and delve deeper. And yet we are encouraged by the story that Hercules was able to
triumph over Hydra.
In the rest of this book we shall study different ways of exploring search spaces and
different kinds of problems that can be formulated as search problems. In the next chapter we
begin by setting up the basic mechanisms for search, and in the following chapters we will
investigate how to battle CombEx.
Exercises
8. Given a set of three variables {a, b, c}, construct a SAT problem such that every assignment
is a solution.
9. Of the eight locations in the 8-puzzle, a blank in the centre allows four moves, a blank on the
side allows three moves, and a blank in a corner allows two moves. Design a representation
and a MoveGen function that works with the representation.
10. Given the representation of the water jug problem as a triple, devise the MoveGen function.
11. Given the following representation for the man, goat, lion, and cabbage river crossing
problem, write the MoveGen and GoalTest functions. The representation is a list of
two lists: one with the elements on the left bank and the other with elements on the right
bank. The start state is [[G L C B] []] indicating that all four elements are on the left bank.
Observe that B stands for the boat, and the assumption is that the man is where the boat is,
since only he can row the boat.
12. Write the MoveGen and GoalTest functions when the representation is only one list,
with the elements on the left bank. The corresponding start state is [G L C B].
13. Write the MoveGen and GoalTest functions with the following representation: there
is only one list, with the elements on the side where the boat is, along with the side. The
corresponding start state is [G L C Left].
14. In the perturbative move described in the chapter for SAT, one is allowed to flip one bit in
the candidate. If there are N variables, then this results in N neighbours being generated.
What would the neighbourhood size be if the move was to flip two bits? And if one could
flip one or two bits?
15. In the SAT problem with N variables, if we flip one bit, the path to the solution in the worst
case would be of length N, when each bit has to be flipped. If the neighbourhood function
allowed one to change any number of bits, then the solution would be only one move away.
Comment on this neighbourhood function.
chapter 3
Blind Search
In this chapter we introduce the basic machinery needed for search. We devise
algorithms for navigating the implicit search space and look at their properties.
One distinctive feature of the algorithms in this chapter is that they are all blind or
uninformed. This means that the way the algorithms search the space is always the
same irrespective of the problem instance being solved.
We look at a few variations and analyse them on the four parameters we defined in the
last chapter: completeness, quality of solution, time complexity, and space complexity.
We observe that complexity becomes a stumbling block, as our principal foe CombEx
inevitably rears its head. We end by making a case for different approaches to fight
CombEx in the chapters that follow.
In the last chapter we looked at the notion of search spaces. Search spaces, as shown in Figure
2.2, are trees corresponding to the different traversals possible in the state space or the solution
space. In this chapter we begin by constructing the machinery, viz. algorithms, for navigating
this space. We begin our study with the corresponding tiny state space shown in Figure 3.1.
The tiny search problem has seven nodes, including the start node S, the goal node G, and
five other nodes named A, B, C, D, and E. Without any loss of generality, let us assume that the
nodes are states in a state space. The algorithms apply to the solution space as well. The left side
of the figure describes the MoveGen function with the notation Node ^ (list of neighbours). On
the right side is the corresponding graph which, remember, is implicit and not given upfront.
The algorithm itself works with the MoveGen function and also the GoALTEST function. The
latter, for this example, simply knows that state G is the goal node. For configuration problems
like the N-queens, it will need to inspect the node given as the argument.
The search space that an algorithm explores is implicit. It is generated on the fly by the
MoveGen function, as described in Algorithm 2.1. The candidates generated are added to
what is traditionally called OPEN, from where they are picked one by one for inspection. In
this chapter we represent OPEN as a list data structure. We also replace the nondeterministic
move of picking some candidate from OPEN with the operation of picking the node at the head
of the list. By doing so, we shift the onus of devising a strategy for picking the next move to
the way we maintain OPEN. The modified version that stores OPEN as a list is described in
Algorithm 3.1.
47
48 | Search Methods in Artificial Intelligence
S (A (A,B,C)
A ->(S,B,D)
B S (S,A,D)
C ->(S,G)
D -> (A,B,E)
E -D (D,G)
G -> (C,E)
Figure 3.1 A tiny search problem. The figure on the left shows the MoveGen function in the
format Node ^ (List of neighbours). The figure on the right shows the corresponding state
space. Observe that all moves are bidirectional.
Algorithm 3.1 Algorithm SimpleSearch picks node N from the head of OPEN. If N is the
goal, it terminates with N, else it calls MoveGen and adds the neighbours of N to OPEN in
some manner. The way this is done determines the behaviour of the search algorithm.
SimpieSearch(S)
1 OPEN ^ [S]
2 while OPEN is not empty
3 N ^ head OPEN
4 OPEN ^ tail OPEN
5 if GoaLTest(N) = TRUE
6 then return N
7 else Combine MoveGen(N) with OPEN
8 return FAILURE
MoveGen
S (A (A,B,C)
A -»(S,B,D)
B -> (S,A,D)
C -»(S,G)
D (A (A,B,E)
E ->(D,G)
G (C (C,E)
Figure 3.2 A part of the search tree for the tiny search problem of Figure 3.1. For each
node, the children are ordered as in the list given by MoveGen. Every path in the search tree
continues till it encounters the goal node G. Observe that some paths are never-ending.
in the above tree. The actual behaviour would depend on how OPEN is organized in the
implementation. Different algorithms do their own thing and explore the space in different
ways. The way they do so is captured in the corresponding search tree, which may be a subtree
of the above tree.
The search tree depicts the space as viewed by a given search algorithm. Each variation
of the algorithm generates its own search tree. Given that we have decided to extract the next
candidate from the head of the OPEN list, we need to decide how to add the new nodes to
OPEN. We begin by treating OPEN as a stack data structure. The resulting search is known as
depth first search.
1 Thanks to the Coronavirus all of us have learnt to form queues at grocery stores.
50 | Search Methods in Artificial Intelligence
Algorithm 3.2 Algorithm DFS picks node N from the head of OPEN. If N is the goal it
terminates with N, else it calls MoveGen and concatenates the list of neighbours of N with
OPEN.
DFS(S)
1 OPEN ^ [S]
2 while OPEN is not empty
3 N ^ head OPEN
4 OPEN ^ tail OPEN
5 if GoalTest(N) = TRUE
6 then return N
7 else OPEN ^ MoveGen(N) ++ OPEN
8 return FAILURE
Algorithm DFS starts searching at node S. The three neighbours of S are A, B, and C, and
these are added to the OPEN list after the algorithm has inspected node S and removed it from
OPEN. The new nodes are added in the order generated by the MoveGen function, with node
A ending up at the head of OPEN, or at top of the stack. DFS picks A next. The search tree
explored by the algorithm is the entire tree shown earlier and repeated in Figure 3.3. The figure
also shows on the right how OPEN evolves after each step. Remember that at each stage the
next node is picked from the head of OPEN.
[S]
[A, B, C]
[S, B, D, B, C]
[A, B, C, B, D, B, C]
Figure 3.3 The entire search tree is available for algorithm DFS, which is the simplest instance
of depth first search. It dives down the never-ending leftmost path and goes into an infinite
loop. The OPEN list is shown on the right at each level, with the algorithm always picking the
node at the head at each stage.
Blind Search | 51
Algorithm DFS dives headlong into the first branch it sees. The fact that it does so on the
leftmost branch is only because the first node in the list returned by MoveGen always ends up
in at the head of OPEN. OPEN behaves like a stack with the LIFO property. The behaviour of
depth first search can thus be seen to embody the newest nodes first strategy. In the search tree
that translates to deepest nodes first, giving the algorithm its name.
Even though the state space is finite, the search tree is infinite. This is another way of
saying that a search algorithm could well go in cycles, moving around in the state space without
moving towards the goal. As can be seen, the search simply oscillates between nodes S and A,
and will keep doing so till the machine crashes. In the river crossing problem this would simply
mean repeatedly going over to the other bank (with the goat) and coming back. For the 8-puzzle
this would mean moving the same tile back and forth.
Observe that in the search tree in Figure 3.3, there are several occurrences of the goal node
G, each with a different path leading to it. Will some search algorithm find one?
There are two issues with algorithm DFS. First, the problem of looping indefinitely as seen
above. And second, even if the algorithm were to find the goal, it only returns the goal node
(Line 6) and not the path that we seek. We resolve the two lacunae one by one.
node N from OPEN for inspection, we add it to CLOSED. And before adding the neighbours
of N generated by the MoveGen function, we remove those nodes that are already present
in CLOSED. The corresponding search tree generated by the algorithm is shown in Figure
3.4 with solid edges. On the right, we show both OPEN and CLOSED as search progresses.
The search is still depth first. Only that nodes already present in CLOSED are not generated
again.
With this modification, our search algorithm does terminate by finding a path to the goal
node. Moreover, the infinite search tree has transmogrified into a finite one. The perceptive
reader would have noticed that the path found is not the shortest path. We will see below that
the other option of treating OPEN as a queue does find the shortest path.
[S] []
[A, B, C] [S]
s [B, D, B, C] [A, S]
OPEN CLOSED
Figure 3.4 When nodes already on CLOSED are not added to OPEN, the search space
accessible to DFS shrinks dramatically to the subtree with solid edges. It finds the path to
the goal as shown. Note that the search tree depends on the left to right order of depth first
search, which is why it does not go beyond the nodes D, B, and C on the right. All their
neighbours would already be on CLOSED.
Looking at Figure 3.4, one can see that while nodes on CLOSED are not added again,
nodes already on OPEN are again added. These are B and C as children of S, and D as a child of
A. In fact, the path found to G is through the copy of D added later. If we filter out neighbours of
a node that are already on OPEN, in addition to CLOSED, then we get an even smaller search
tree as shown in Figure 3.5.
The resulting search tree includes each node from the state space exactly once. In this
example it does include all nodes of the state space, but for a larger graph it may not have
done so. Also, the path found by this variation is different from the one in Figure 3.4. The
numbers in the search tree represent the order in which depth first search inspects the node
till termination. We next turn our attention to modifying our algorithm to reconstruct and
return the path found.
Blind Search | 53
[S] []
in in <* "i A* in m
/i\ /i\ /i\ /i\ /:\/i\ OPEN CLOSED
Figure 3.5 When only new nodes, which are neither on OPEN nor on CLOSED, are added to
OPEN, the search tree shrinks even more. Now only one copy of every node exists in the search
tree. Note that the path found is different too. The numbers next to nodes show the order in
which the nodes are inspected.
MoveGen
S -> (A,B,C)
A S (S,B,D)
B (S (S,A,D)
C -> (S,G)
D -> (A,B,E)
E -> (D,G)
G ■> (C,E)
Figure 3.6 The modified search space for the tiny search problem where each node stores the
parent too in a nodePair. The state space and the MoveGen function are repeated on the left.
The figure also shows the path found by the depth first algorithm with only new nodes being
added to OPEN.
with the tiny search problem. As shown in Figure 3.6, when GoalTest succeeds with the
nodePair (G,E), CLOSED contains the following nodePairs: (E,D), (D,A), (A,S), and (S,null).
The algorithm begins by initializing the path P to [G]. As long as the parent of the last node
added to the path P is not null, it concatenates the parent to the path and looks for the parent
of the parent in CLOSED. Starting with E, the following nodes are successively concatenated
to the path: D, A, and S, at which the parent of S is null and the algorithm terminates with [S,
A, D, E, G].
The algorithm is shown as Algorithm 3.3. It accepts a node pair of the form (goal, parent)
and constructs the path by tracing back successive parents till it reaches the start node whose
parent is null. Lines 1-4 show an ancillary function FindLink that retrieves the nodePair which
has the parent of the node given as the first argument.
Algorithm 3.3. Algorithm ReconstructPath accepts the nodePair containing the goal
node and constructs the path by tracing the parents via the nodePairs stored in CLOSED.
It uses an ancillary function FiNDLiNK(node, CLOSED) which fetches the nodePair in
which node is the first element.
Algorithm 3.4. Algorithm DFS works with node pairs to keep track of the parent of
each node. It treats OPEN like a stack. It removes any nodes generated earlier from
the set of neighbours returned by MoveGen and for each adds the parent to make pairs
before appending them to the tail of OPEN. For a planning problem, we call the module
to reconstruct the path. For a configuration problem, we would only return the goal node
when it is found.
DFS(S)
1 OPEN ^ (S. null) : [ ]
2 CLOSED ^ empty list
3 while OPEN is not empty
4 nodePair ^ head OPEN
5 (N, _ ) ^ nodePair
6 if GoaiTest(N) = TRUE
7 return RecostructPath(nodePair, CLOSED)
8 else CLOSED ^ nodePair : CLOSED
9 children ^ MoveGen(N)
10 newNodes ^ RemoveSeen(children, OPEN, CLOSED)
11 newPairs ^ MakePairs(newNodes, N)
12 OPEN ^ newPairs ++ (tail OPEN)
13 return empty list
56 | Search Methods in Artificial Intelligence
There are two exit points for algorithm DFS. The first is in Line 7 when it finds the goal
node. The second is in Line 13 when the while loop has ended, and it has no more nodes left
in OPEN to inspect. It then returns the empty path. The latter exit happens if the state space is
finite and there is no path to any goal node. If the state space itself were to be infinite then DFS
could go down an infinite path, even if there was a finite path to a goal node.
The other ancillary functions are described below. When node N is not the goal node, its
neighbours are generated by calling MoveGen (Line 9). Those neighbours that have already
been inspected, in CLOSED, and those waiting to be inspected, in OPEN, are filtered out
(Line 10). For each node in nodeList, it makes two calls to algorithm OccursIn, once to check
if the node occurs in OPEN, and once to check for its presence in CLOSED. The algorithm
RemoveSeen is described below.
Algorithm 3.5. Algorithm RemoveSeen accepts a list of nodes, and two lists of
nodePairs, OPEN and CLOSED, and filters out those nodes in the nodeList that are
present in either CLOSED or OPEN.
The above program is a recursive program that steps through nodeList, removing a node if
it is present in either CLOSED or OPEN (Lines 4, 5) or keeping it and recursively processing
the tail of nodeList (Line 6). Algorithm RemoveSeen in turn calls algorithm OccursIn to
check if a given node occurs somewhere in the list nodePairs as a first element of a pair.
Algorithm OccursIn is described below.
Algorithm 3.6. The procedure OccursIn checks for the presence of a node, as a first
element of some nodePair in the list nodePairs
OccursIn(node, nodePairs)
1 if nodePairs is empty
2 return FALSE
3 elseif node = first head nodePairs
4 return TRUE
5 else return OccursIn(node, tail nodePairs)
The final piece of the jigsaw is the algorithm MakePairs that is called in DFS (Line 11) with
a list of (new) nodes to be added to OPEN. In DFS the input to this is the nodes from which
Blind Search | 57
nodes generated earlier have been filtered by RemoveSeen. MakePairs accepts this list and
the parent they were generated from and forms pairs to be added to OPEN.
Algorithm 3.7. MakePairs recurses down a list of nodes in nodeList, for each making
a pair with the parent node. For example, MakePairs([A, B, C, D], S) = [(A,S), (B,S), (C,S),
(D, S)]
MoveGen
® © ©©©©©©
S (A (A,B,C)
A -»(S,B,D)
B (S (S,A,D)
©0© (a)©
C (S (S,G)
D (A (A,B) fl' III *1* flt
G ->(C)
(s)(b)(d) (s)0(d)
*" "*
m ’■* III
Figure 3.7 If we remove node E from the tiny search problem, then DFS is compelled to
backtrack from the paths via A and B on the left in the search tree and find the path via node
C on the right. When only new nodes are added, both DFS and BFS (described later) generate
the same subtree shown in solid lines, but they explore it in a different order.
58 | Search Methods in Artificial Intelligence
Backtracking happens when the algorithm reaches a dead end or has run out of options,
and it backtracks to the parent to try the next option. In our implementation of DFS using a
stack, this happens naturally. The reader is encouraged to simulate the progress of DFS on the
above problem. DFS first goes down to D via node A. Node D is a dead end as it has no new
neighbours, so after it is removed from OPEN it does not add any nodes to it. Node B naturally
comes to the head of the list and likewise does not add anything to OPEN. Node C then becomes
the next node and leads to the only path to the goal G. In general, when all options at some
level are tried in a particular branch without adding new ones, the alternatives come to the fore
automatically at the head of OPEN.
Depth first search admits the possibility of keeping only one copy of the state. Then instead
of adding a neighbour state to OPEN, one can add two moves fi and bi, where fi is a forward
move and bi is the corresponding backward move which undoes the forward move. When fi
is picked, then the state is modified to reflect the change. When MoveGen is applied to the
resulting state it returns a list of forward and backward moves which are added to the head of
OPEN. If there are no new moves to be made, then, as before, nothing is added and then bi is
the next move to be picked. That changes the state back to what it was before fi was applied.
Backtracking is thus explicit. The reader is encouraged to modify the program to work in this
manner with only one state.
For the problem in Figure 3.7, DFS ends up inspecting the entire state space. The reader
should verify that the order of inspecting them is S, A, D, B, C, G. The algorithm breadth first
search (BFS) also inspects the entire space but in a different order. The reader should work that
out after studying the algorithm BFS described below.
Algorithm 3.8. The BFS algorithm differs from DFS only in the manner in which OPEN
is maintained. Here the new neighbours join the queue for being inspected.
BFS(S)
1 OPEN ^ (S, null) : [ ]
2 CLOSED ^ empty list
3 while OPEN is not empty
4 nodePair ^ head OPEN
5 (N, _ ) ^ nodePair
6 if GoalTest(N) = TRUE
7 return ReconstructPath(nodePair, CLOSED)
8 else CLOSED ^ nodePair : CLOSED
9 children ^ MoveGen(N)
10 newNodes ^ RemoveSeen(children, OPEN, CLOSED)
Blind Search | 59
11 newPairs ^ MakePairs(newNodes, N)
12 OPEN ^ (tail OPEN) ++ newPairs
13 return empty list
The new order of maintaining OPEN as a queue completely changes the behaviour of the
search. The newest nodes first strategy led DFS to dive headlong into the search tree, and the
state space. The FIFO, or newest nodes last, strategy of BFS forces new nodes to queue up to
be inspected when their turn comes. Figure 3.8 shows the behaviour of BFS on the tiny problem
of Figure 3.1. Nodes A, B, and C, the neighbours of S, are first in line. After A is inspected, its
only new neighbour D is added, but to the rear of the queue. It is node B that is inspected after
A. It has no new neighbours to add. Then search visits C and generates the goal node G as a
child. Now, having finished with the first level in the search tree, the algorithm inspects D and
then node G. When the path is reconstructed, it is the shortest path S-C-G. The order of visiting
nodes, level by level, is also marked in Figure 3.8.
The conservative nature of BFS, to stick as close to the start node as possible, is an
insurance against going into infinite loops. This is true even when we explore the entire search
tree without filtering out any neighbours, as shown in the figure with dashed edges. This is a
direct consequence of the level by level push into the search tree. There may be multiple paths
to the goal node, as depicted by multiple occurrences of G in the complete search tree. Since
BFS ventures incrementally away from the start node, it will find the shortest path when it first
appears in some level. This is true even if the state space itself is infinite.
[S] []
[A, B, C] [S]
[B, C, D] [A, S]
[C, D] [B, A, S]
[D, G] [C, B, A, S]
[G, E] [D, C, B, A, S]
<®@® 0®® ®®
®®® ©®® ®®
»n in m m m m in
Figure 3.8 When OPEN is a queue, the behaviour of search is starkly different from that in
DFS. Algorithm BFS goes down the search tree level by level as indicated by the numbers
showing the order in which nodes are visited. As a consequence, it terminates with the
shortest path to the goal G.
60 | Search Methods in Artificial Intelligence
The reader is encouraged to explore the problem described in Figure 3.7 with the two
algorithms DFS and BFS, in all their three avatars with new nodes in CLOSED being filtered
out, new nodes both in CLOSED and OPEN being filtered, and when all nodes returned by
MoveGen are added to OPEN.
Figure 3.9 As depth first search dives into a tree with 4 children for each node, it adds 3 nodes
to OPEN at every level. As search sweeps from left to right, this number will come down. The
figure shows the nodes on OPEN in grey when DFS is about to pick the 6th node from OPEN
for inspection.
DFS has dived into the search tree impetuously along the first path it saw. At each level, as
it goes deeper it adds b nodes and picks one immediately after, leaving (b - 1) nodes on OPEN
where b is the branching factor. In Figure 3.9, b = 4, and as the figure shows, 3 nodes were
added to OPEN at each level, shown in grey.
BFS on the other hand has just reached level 2 when it picks the 6th node in the same search
tree, as shown in Figure 3.10.
Blind Search | 61
Figure 3.10 As breadth first search pushes into a tree with 4 children for each node, it
multiplies the number of nodes in OPEN by 4 at every level. Thus OPEN grows exponentially
with depth. The figure shows the nodes on OPEN in grey when BFS is just about to pick the
6th node from OPEN for inspection.
When it is about to pick the 6th node, BFS has finished inspecting all the nodes in level
1. In the process, it has added 4 nodes for each node it inspected at level 1. When it starts on
level 2, it has 16 nodes in OPEN, which is 42 nodes. When it is about to pick the 6th node, there
are 16 nodes on OPEN shown in grey in Figure 3.10. In general, as BFS enters level d, it is
confronted with bd nodes at level d where b is the branching factor.
z OO \
z < OO '
z OO
OOO
ooo
OOO
••
ooo _
ooo C
•oooooooooooooooooooooooooooooooooooo
Figure 3.11 The search frontiers for DFS in grey and BFS in unshaded nodes. The outer
envelope is meant to give a feel of exponential growth.
62 | Search Methods in Artificial Intelligence
In the case of BFS, the size of OPEN is exponential in the d. At depth 0, there is one
node, or b0. When d = 1, we have b nodes. When d = 2, we have b2 nodes. And so on. As we
go deeper, we multiply the number of nodes in the current layer by b. When BFS is about to
commence at level d, we have
|OPENBFS| = b d.
For DFS OPEN grows in a linear fashion. This is because DFS does not end up generating
b children for every node in the current layer, but only for the one node that it picks. We only
add b nodes to OPEN as we go to the next layer. At depth 0, there is 1 node. At depth 1 we add
b nodes. Of these, we remove the first for inspection, so in effect we have (b - 1) nodes left. At
every layer, we add b nodes and pick one for inspection. Counting the node that the algorithm
is about the pick at depth d, we have
|OPENDFS| = d(b - 1) + 1
DFS thus scores over BFS on space complexity. The OPEN list grows only in a linear
fashion as compared to the exponential growth of BFS.
Figure 3.11 also suggests a way of visualizing the progress of search. Assuming that
the MoveGen function generates children from left to right, DFS dives down into the
leftmost branch. Assuming that the tree is of finite depth, which would be the case for a
finite state space, it may backtrack at some point, and explore the search tree sweeping it
from left to right. BFS on the other hand pushes into the search tree layer by layer, with
OPEN growing exponentially. In either case, the algorithm is oblivious of the goal, till the
point it hits upon it.
Figure 3.12 The time taken by search to find the goal node depends upon where the goal
node lies in the search tree. If it is at location A deep into the tree on the left, then DFS will
find it quickly while BFS will wade through all the levels before. If it is at location B on the right
closer to the root, then BFS will find it early while DFS will sweep through the entire tree from
left to right.
NDFS = (d + 1)
NBFS = (b d - 1)/(b - 1) + 1
We observe that for large b, BFS has a slightly higher time complexity than DFS. But the
important observation is that both are exponential.
64 | Search Methods in Artificial Intelligence
3.4.4 Completeness
A corollary of the above property is that BFS is complete. If a path exists from the start state
to the goal state, then BFS will find a path. This will happen even if the search tree is infinite,
for example, as in Figure 3.3. Not only that, it will find the shortest path to the goal node. The
moment a goal node appears in a new layer, it will be found. DFS on the other hand can go
down an infinite path, again as in Figure 3.3, even though there is a path to the goal.
For finite state spaces with only previously unseen nodes being added, as in Algorithms
3.4 and 3.8, both algorithms are complete. If there is a path to the goal node, both will find one
and will also terminate with failure if there is none. This is because, in every cycle, one node
is moved from OPEN to CLOSED and eventually OPEN becomes empty. We also say that the
search algorithms are systematic when they explore the entire finite state space, whatever the
order they choose to do so in.
Most real world problems have finite state spaces, so we will consider both algorithms to
be equal on this count. Infinite state spaces can come from the domain of mathematics, and we
will perhaps need to be careful here. For example, in number theory, if we were to search for a
counterexample to Fermat’s Last Theorem (look for x, y and z such that x3 + y3 = z3), the search
will be futile, and no algorithm will terminate.
On the four criteria for analysing search algorithms, BFS did better on the quality measure,
guaranteeing the shortest path, and DFS did better on space complexity, requiring only linear
space to store OPEN. We next look at an algorithm that gives us the best of these two worlds,
with surprisingly low extra cost.
can imagine, if you spend too much time over the initial moves, you have less for the remaining.
Now, as we will study in Chapter 8, the most used algorithm is based on DFS. The programs
look ahead a certain number of moves, called plies, before evaluating the board position. The
farther you look ahead, the better the analysis. Chess programmers came up with the idea of
doing a flexible amount of lookahead, based on the available time, and the idea of iterative
deepening was born.
Figure 3.13 DB-DFS searches the state space in a depth first manner but does not venture
beyond a boundary at depth d.
Algorithm 3.9 describes DB-DFS, which takes the depth bound as an argument, depthBound,
and executes DFS only within the bound. It stores the depth information in the nodePair, which
now in fact becomes a triple. But we retain its name. The root is at depth 0. DB-DFS invokes
MoveGen only when the current node is within the depth bound.
DB-DFS(S, depthBound)
1 OPEN ^ (S, null, 0) : [ ]
2 CLOSED ^ empty list
3 while OPEN is not empty
4 nodePair ^ head OPEN
66 | Search Methods in Artificial Intelligence
The MakePairs module will need to be modified as well, as we now deal with triples. The
same applies to ReconstructPath and its ancillary functions.
Let us analyse this algorithm. DB-DFS, being essentially DFS, requires only linear space
for the OPEN list. However, it does not guarantee the shortest path, or even a path to the goal.
What is more, if there are multiple instances of the goal within the bound, it may find the one
farther away. And if the goal node lies beyond the bound, it will not reach it at all.
This is where the idea of iterative deepening comes in - which is to do a sequence of
DB-DFS searches looking incrementally farther in every cycle, like BFS does.
Algorithm 3.10. DFID does a series of DB-DFSs with increasing depth bounds
DFID(start)
1 depthBound ^ 0
2 while TRUE
3 do DB-DFS(start, depthBound)
4 depthBound ^ depthBound + 1
Can one get the best of both worlds? That is, the shortest path using linear space. The answer
is yes, almost. A small incremental cost must be paid. We analyse the high level algorithm and
highlight its positive features with a couple of problems. We also give some suggestions on how
some lacunae can be addressed. We begin with the simpler arguments.
Blind Search | 67
Now, for a full tree with branching factor b, the following holds:
L = (b - 1)I + 1
This gives us
Thus, the extra work done by DFID is not significant for large b. When the search tree is a
binary tree, with b = 2, we know from our data structures background that L = I + 1. That is,
DFID would do double the work done by BFS. But as b increases, this ratio comes down. For
b = 11, for example, DFID does only 10% more work than BFS. That is a price we would be
willing to pay for the benefit of linear space complexity.
The fact that the extra work done by DFID is not too much is surprising for some people.
But it should not be so. This is to be expected of exponential growth. The fact that as we go
deeper into a tree, we multiply the nodes in the previous layer by the branching factor. The work
done on every new layer is much more than the work done on the entire tree before that!
That is the nature of our adversary, CombEx!
2
This was pointed out by Siddharth Sagar, an undergraduate at IIT Dharwad in 2019, in a class I was teaching.
68 | Search Methods in Artificial Intelligence
S -A (A,B)
A S (S,C)
B S (S,D)
C (E (E,D,A)
D -» (B,C,G)
E -> (C,G)
G ■» (D,E)
Figure 3.14 An example to highlight the need to add nodes in CLOSED to OPEN again, for
DFID to find the shortest path from S to G.
The reader is encouraged to try out all the variations of DFS and BFS described in this
chapter. We are interested in the variation in which only new nodes not already in OPEN or
CLOSED are added to OPEN. The reader should ascertain that BFS finds the shortest path
S-D-G. But, as illustrated in Figure 3.15, DFID does not.
Paths of length 0 Paths of length 1 Paths of length 2 Paths of length 3 Paths of length 4
Figure 3.15 DFID explores paths of increasing length in successive cycles. When nodes in
CLOSED are not added to OPEN, DFID does not explore the path from B to D when exploring
paths of length 3 because D is already on CLOSED as a child of C, and consequently fails to
find the shortest path S-B-D-G.
The shortest path to the goal, S-B-D-G, is of length 3. However, node D on this path is
not added to OPEN as a child of B because it has already been inspected as a child of C and is
on CLOSED. It is left as an exercise for the reader to verify that not filtering nodes already on
CLOSED does indeed work. We begin by modifying DBDFS function as follows. We do not
filter out nodes already in CLOSED, and for the sake of simplicity, we add all new neighbours
to OPEN after constructing nodePairs (now triples) with them. It would be natural to ask
whether one should stop filtering out nodes already on OPEN as well. If the reader is convinced
that the nodes on OPEN can be filtered out, she is encouraged to modify the algorithm to do so.
This is left as something for the reader to ponder over (also see the exercises).
Blind Search | 69
3.5.6 Completeness
It is not yet time to celebrate though. The second problem is that of avoiding looping. Remember
that we introduced CLOSED precisely to do this. If we do not filter nodes in CLOSED, which
we still maintain to be able to reconstruct the path, how does one stop the program from going
into loops? This question has two subparts to it.
One, if looping cannot be stopped, will we still find the shortest path? The interested
reader should hand simulate search on the earlier examples and verify that indeed the
shortest path will be found. There may be wasteful loops, but not infinite ones. This is
because in every cycle the algorithm explores paths of up to a certain length, even if they
may have cycles. For example, in the discussed problem when the depth bound is 3, the
algorithm will also consider S-A -S-B as a candidate, before finding the path S-B-D- G.
The reader should satisfy herself that when the algorithm picks the goal node it has indeed
found the shortest path.
The other subpart needs more attention though. What if there is no path to the goal? This
would happen if the goal node is not in the partition in which the start node is. Will the program
go into incrementally longer and longer loops, and never terminate? That, indeed, is a danger.
Imagine that in our tiny problem, G, or any other node, was not a goal node. Then it would
continue looking at longer paths in an infinite loop. We still need a termination criterion when
the solution does not exist. In our original DFS and BFS algorithms it was when OPEN became
empty. Now we need a different one, since OPEN will never become empty.
One solution could be to count the number of new nodes visited by the algorithm in each
call. While a node may be added to CLOSED more than once, because it can be visited more
than once through different paths, it is counted only once.
The algorithm should now return two things. One is the count of the number of new nodes
in the state space visited by DBDFS in a call. This count will depend upon the depth bound
given to it. Initially it will be 1 when b = 0. When b = 1, the count is b + 1. And so on. If the
goal node does not exist in the graph, then, after searching the entire graph, the count will be
the same in two consecutive calls. The calling DFID algorithm will then be able to call it quits
and report failure. The other parameter it returns is the path found, or the empty list if in that
call a path was not found.
70 | Search Methods in Artificial Intelligence
Earlier we had seen that pruning nodes already on CLOSED fails to find the shortest path
on some graphs (see Figure 3.15). It turns out that there can be graphs where DFID fails to even
find a path if we were to prune CLOSED and also use count as described above.3 Consider the
curious graph shown in Figure 3.16.
S -A (A,C,B)
A (S (S,C)
B S (S,C,E,D)
C -A (A,S,B,E)
D -B (B,E,G)
E ■» (C,B,D)
G -> (D)
Figure 3.16 Another example to highlight the need to add nodes in CLOSED to OPEN again
in DFID. Without that, and when the algorithm kept count of new nodes in every cycle, DFID
would not even find a path from S to G.
Consider a version of DFID which filtered nodes already on CLOSED and also maintained
a count of the number of nodes seen in each cycle. Then, on the above graph, when exploring
paths of length 2 the algorithm would have visited the 6 nodes S, A, C, B, E, and D. In the next
cycle it would again visit the same 6 nodes. In a manner reminiscent of Figure 3.15, DFID
would visit D via S-C-E-D first, and later not extend S-B to D, thus missing the 3-step path
to G. What is more, since the node count in the two cycles, 6, is the same, the algorithm will
terminate and report failure.
This gives us another reason to not filter nodes already on CLOSED in DFID. What about
filtering nodes already on OPEN? The reader is encouraged to find out.
3 As pointed out by my colleague S. Baskaran as we were formulating a quiz problem for the course on search
methods.
Blind Search | 71
Figure 3.17 The trajectory of DFS and BFS in the state space. BFS gradually moves away
from the start node S, while DFS, shown in grey nodes, dives headlong into the state space.
The arrows depict the trajectory and not the edges in the state space. The goal node could be
anywhere in the state space. Neither algorithm is cognizant of the goal node while searching
and each traverses the space in a predetermined manner.
Summary
In this chapter we have built the basic machinery for search. Given a MoveGen or neighbourhood
function, the search algorithm simultaneously generates and explores the search space. This
happens until the GoalTest function signals success, or for a finite search space it has
exhausted all possibilities.
The search algorithm has its own perspective of the space being searched, and that manifests
itself in the form of a search tree that it generates and explores. Every variation of the search
algorithm has its own corresponding search tree that it explores.
We have studied three variations of our basic algorithm. DFS dives headlong into the
search space. Its major advantage is that it requires only linear space. But it does not guarantee
the shortest path, and for infinite spaces may not return any path even if there exists one. BFS
conservatively pushes into the search space and guarantees the shortest path, even for infinite
search spaces. DFID combines the best of both, with surprisingly low extra cost.
In terms of time complexity, all three are unable to combat CombEx, inspecting an
exponential number of nodes as they go deeper. This is partly because they are all blind or
uninformed. The way they search the given space is always the same in a given domain as
defined by the MoveGen function, oblivious of where the goal node is. The GoalTest function
is only used to recognize a goal node when the search stumbles upon it. In the next chapter
we study heuristic search, which exploits some extra knowledge to skew the exploration more
towards the goal node.
72 | Search Methods in Artificial Intelligence
Exercises
1. Implement the DFS and BFS algorithms in which a node in the search tree stores the entire
path up to it in the state space. Then, when a goal node is found, the corresponding path
would be readily available.
2. Investigate the DFS (Algo. 3.4) and BFS (Algo 3.6) search algorithms on the tiny problem
shown in Figure 3.7. Draw the search trees for each of the two algorithms and list the order
in which the two algorithms explore the state space. What is the path found by each of the
two algorithms?
3. For the problem shown in Figure 3.7 investigate how DFS and BFS behave when nodes
already in OPEN are allowed to be added to OPEN again. How does the performance
compare with the algorithms when this is not allowed?
4. For the problem shown in Figure 3.7, investigate how DFS and BFS behave when nodes
already in OPEN and CLOSED are allowed to be added to OPEN again. How does the
performance compare with the algorithms when this is not allowed?
5. Repeat the above three paper and pencil simulations on the state space depicted in Figure 3.14.
6. [Baskaran] The following figure shows a map with several locations connected by 2-way
edges (roads). The MoveGen function returns neighbours in alphabetical order. The
start node is S and the goal node is G. Module RemoveSeen removes neighbours already
present in OPEN/CLOSED lists. List the nodes in the order visited by DFS and BFS.
Draw the search trees generated by both, clearly identifying the nodes on CLOSED and on
OPEN, and the path found.
7. List the nodes in the order visited by DFID (when all new nodes are added to OPEN). What
is the path found?
8. [Baskaran] The MoveGen function for the following graph generates the neighbours in
alphabetical order. S is the start node and G is the goal node. List the nodes in the order in
which DFS will visit the nodes till termination. What is the path found?
Blind Search | 73
9. Will your answer be the same if the edge GC in the above graph were to be bidirectional?
10. Given the MoveGen function below, draw the state space. S is the start node and G is the
goal node. Draw the search trees for each of the three variations for DFS and BFS and list
the nodes in the order in which the six algorithms visit them.
S 4 (AB C D)
A -> (E J B S)
B -> (A F S)
C -> (G H D S)
D -> (C I S)
E (K J A)
F (J K B)
G -> (LC)
H (L M I C)
I (H D)
J -> (E F A)
K^(FE)
L (M H G)
M (H L)
11. Algorithm DFID (the version which adds all nodes generated by MoveGen without
filtering any nodes) is traversing the state space generated by the above MoveGen function
till termination. S is the start node and G is the goal node. List the nodes visited in each
cycle up to depth 4 or till termination, whichever happens first.
74 | Search Methods in Artificial Intelligence
12. [Baskaran] Write the MoveGen functions for the four graphs in the following figure.
13. [Baskaran] Study the following state space. Which of the variations of DFS, BFS, and
DFID, finds a path from S to G? Which of them finds an optimal path?
14. Formulate the cryptarithmetic problem as a state space search problem. An example
problem is shown here. Each letter needs to be assigned to a unique digit such that the
arithmetic sum adds up correctly.
SEND
+ MORE
MONEY
15. Modify the DFS algorithm and its ancillary functions with moves fi and bi as described in
Section 3.2.4. Use this version to create a demo of the N-queens problem with one board
on display and queens being placed and removed from the board as the search progresses.
16. Write down the modules ReconstructPath and MakePairs along with the ancillary
functions for implementing DB-DFS, given the fact that nodePair is now a triple after the
depth information is added to it.
17. While developing DFID we argued in the chapter for not filtering out nodes already in
CLOSED while adding new nodePairs to OPEN. Should one do the same for nodes already
on OPEN? Hint: All nodes on OPEN are to the right of a given node in the search tree. Add
an edge A-B to the problem in Figure 3.10 and try it out.
chapter 4
Heuristic Search
Having introduced the machinery needed for search in the last chapter, we look at
approaches to informed search. The algorithms introduced in the last chapter were
blind, or uninformed, taking no cognizance at all of the actual problem instance to
be solved and behaving in the same bureaucratic manner wherever the goal might be.
In this chapter we introduce the idea of heuristic search, which uses domain specific
knowledge to guide exploration. This is done by devising a heuristic function that
estimates the distance to the goal for each candidate in OPEN.
When heuristic functions are not very accurate, search complexity is still
exponential, as revealed by experiments. We then investigate local search methods
that do not maintain an OPEN list, and study gradient based methods to optimize the
heuristic value.
Knowledge is necessary for intelligence. Without knowledge, problem solving with search is
blind. We saw this in the last chapter. In general, knowledge is that sword in the armoury of
a problem solver that can cut through the complexity. Knowledge accrues over time, either
distilled from our own experiences or assimilated from interaction with others - parents,
teachers, authors, coaches, and friends. Knowledge is the outcome of learning and exists in
diverse forms, varying from tacit to explicit. When we learn to ride a bicycle, we know it but are
unable to articulate our knowledge. We are concerned with explicit knowledge. Most textbook
knowledge is explicit, for example, knowing how to implement a leftist heap data structure.
In a well known incident from ancient Greece, it is said that Archimedes, considered by
many to be the greatest scientist of the third century BC, ran naked onto the streets of Syracuse.
King Hieron II was suspicious that a goldsmith had cheated him by adulterating a bar of gold
given to him for making a crown. He asked Archimedes to investigate without damaging the
crown. Stepping into his bathtub Archimedes noticed the water spilling out, and realized in a
flash that if the gold were to be adulterated with silver, then it would displace more water since
silver was less dense. This was his epiphany moment when he discovered what we now know
as the Archimedes principle. And he ran onto the streets shouting ‘Eureka, eureka!’ We now
call such an enlightening moment a Eureka moment!
75
76 | Search Methods in Artificial Intelligence
‘Eureka’ comes from the Ancient Greek word eupnKa (heureka), meaning ‘I have found
(it)’. We will interpret this assertion to mean ‘I know (it)’. The word heuristic is etymologically
related to it and refers to a similar state of knowing. In this chapter we focus on giving a sense of
direction to our search algorithms which, hitherto, were blind, choosing the node from OPEN
in an uninformed manner. We will now empower our search algorithm with some knowledge
of which of the choices presenting themselves via OPEN is the best. Figure 4.1 illustrates the
notion of an estimated distance to the goal node as embodied in a heuristic function. If there
is more than one goal in the domain, then the heuristic value is the smallest of the estimated
distances. As we will see, such heuristic knowledge is not infallible, but more often than not
will favourably impact the time complexity of search.
Figure 4.1 Node S is the start node and G the yet undiscovered goal node. Nodes in grey are
on CLOSED and unfilled nodes are on OPEN. The arrows from nodes on OPEN point to the
parent which added them first. The heuristic function estimates the distance of each node N
to the goal. This estimate, h(N), can help decide which node to pick from OPEN. The figure
depicts the estimated distance only for four nodes.
distance, computed by an inexpensive procedure, and may not be accurate. If we were to search
ahead from each of the candidates to decide which one to pick, that would defeat the purpose.
We look at heuristic functions for some example problems, including some from Chapter 2.
4.1.2 SAT
Defining a heuristic function for the Boolean satisfiability, or SAT, problem (Section 2.3.3)
happens naturally. For a candidate solution, the value of h(n) can be the number of clauses
satisfied. For the example SAT problem,
(b V - c) A (c V - d) A (- b) A (- a V - e) A (e V - c) A (- c V - d),
the candidate 10101 satisfies clauses 2, 3, 5, and 6. Thus h(10101) = 4. The reader would
have noticed that this is not consistent with our idea of a distance function, which would
have decreasing values as we satisfy more clauses, and presumably be closer to the goal. The
observation is true. One simple way to convert this into such a function is to redefine it as
follows:
This would have the property that the h(goal) = 0. Later in this chapter we will look at an
algorithm that treats our search problem as a problem of minimization of h(n). With the simpler
definition of just counting the number of clauses satisfied, this would become a maximization
problem instead. A variation of this function could be to consider the weighted sum of clauses
satisfied, with larger weights for smaller clauses.
We consider two heuristic functions hHamming(n) and hManhattan(n). The first simply counts
the number of tiles out of place. This assumes that each tile will take the same effort to get to
its destination. The second adds up the distances for each from its destination. The distance
is measured in terms of the number of moves either horizontally or vertically but does not
consider the fact that the way has to be cleared for any tile to move.
Figure 4.2 Two heuristic functions for the 8-puzzle. The Hamming distance counts the number
of tiles out of place, and the Manhattan distance computes the Manhattan distance of each tile
from its final location. Observe that neither is perfect, and both neighbours of the start node
have a higher value.
Both the above heuristic functions vastly underestimate the actual number of moves that
the solution will need. As we will see in Chapter 6 on finding optimal paths, it is desirable to
underestimate the distance, but at the same time be as close to the actual distance as possible.
Remember that the heuristic distance is only an estimate. If we had an oracle that would tell
us the exact distance, there would be no need for search. The algorithm would directly head
towards the goal.
The heuristic functions above are computed by inspecting the given state and the goal
state. The more a given state looks like the goal state, the closer it is assumed to be. In that
sense, a distance function is the inverse of a similarity function, though the latter normally has a
range [0, 1]. However, similarity can be misleading. It could easily be the case of so near yet
so far. One way to improve the heuristic function that researchers have tried is to add a certain
value to the distance if two tiles are in the same row but in the wrong order. This kind of
tuning has been common in the early research on heuristic search, until the time when machine
learning made it possible to learn heuristic functions.
superimposed on a map of the South Chennai region. If you look carefully, you can also see a
couple of routes to Anna University Alumni Club suggested to us by Google Maps.
Figure 4.3 A city map can be represented as a graph, here shown superimposed on a map.
Observe the two routes from IIT Madras to Anna Alumni Club suggested by Google Maps as
shown with thick dashed edges. A heuristic function would first drive the search towards the
nodes in the circle in the middle.
One expects the shortest route, or the fastest route, from a map finding algorithm, and that
is what we generally get. What is underneath the hood of such applications? We will get some
answers in Chapter 6, but for now let us focus on the use of a heuristic function to help us make
the choices, as depicted in Figure 4.1.
The heuristic function h(n) is an estimate of the distance from a given node to the goal
node. While ideally we would like the road length, perhaps weighted by congestion, we assume
we have access only to location information. Let the coordinates of the two points be (xi, yi) and
(xk, yk). The following distance measures are commonly used in geometric spaces.
The Minkowski norm, named after the German mathematician Hermann Minkowski, is a
generalization of the Manhattan and Euclidean distance measures.
Observe that, whatever distance measure we use, the search algorithm will head to that
point on the Adyar river that appears to be closest to the destination, before we end up seeking
80 | Search Methods in Artificial Intelligence
ways to cross the river. This is characteristic of heuristic functions, which are estimates based
on incomplete information.
Figure 4.4 On the left, the tour constructed by the NearestNeighbour heuristic, which starts
from a random node and goes to the nearest neighbour on the next hop. On the right, some
initial edges added by the Greedy heuristic, adding shortest edges first. Both algorithms have
to carefully avoid constructing tours of smaller length.
Heuristic Search | 81
Another popular algorithm is called the Savings heuristic. The algorithm begins by choosing
a pivot vertex and constructing (N - 1) tours of length 2, anchored on the pivot. It then merges
two tours by removing two edges containing the pivot and connecting the other ends of the two
removed edges. The algorithm is illustrated in Figure 4.5. Let L1 and L2 be the lengths of the
two edges removed. And let Lnew be the length of the new edge added. The reduction in edge
costs or the saving is ((L1 + L2) - Lnew). The two edges to be removed are chosen such that the
saving is maximum, and that explains the name.
Figure 4.5 The Savings heuristic begins with the construction of (N - 1) tours of length 2 from
a fulcrum node shown in black. Then in (N - 2) cycles it progressively merges two tours by
deleting one edge from each of the two and adding one edge to reconnect them.
The algorithms described above construct a solution in one shot. They do not search for
alternatives. In Chapter 6 we will look at another constructive heuristic method for TSP that
employs search, designed to guarantee the optimal tour. But search is more common in the
solution space with perturbative methods, and we will explore them later in this chapter and the
next. As far as search is concerned, the cost of a tour for TSP can itself be used as the estimate,
with the understanding that the lower it is the better.
Heuristic functions add another dimension to the search space, along which the heuristic
value for every node is specified. In a sense, a heuristic function defines the landscape that
search operates on. The nature of the landscape is determined by the heuristic function, and the
way we traverse it is determined by the neighbourhood or MoveGen function. We shall explore
variations of these later. First, we look at modifying the search algorithm from Chapter 2 to
incorporate the heuristic function.
1 As the great bard said, a rose by any other name would smell as sweet.
82 | Search Methods in Artificial Intelligence
Algorithm 4.1. Algorithm BestFirstSearch sorts the OPEN to bring the best node to
the head. The third parameter in the nodePair stores the heuristic value of the node.
BestFirstSearch(S)
1 OPEN ^ (S, null, h(S)) : [ ]
2 CLOSED ^ empty list
3 while OPEN is not empty
4 nodePair ^ head OPEN
5 (N, _, _) ^ nodePair
6 if GoaiTest(N) = TRUE
7 return ReconstructPath(nodePair, CLOSED)
8 else CLOSED ^ nodePair : CLOSED
9 children ^ MoveGen(N)
10 newNodes ^ RemoveSeen (children, OPEN, CLOSED)
11 newPairs ^ MakePairs(newNodes, N)
12 OPEN ^ sorth(newPairs ++ tail OPEN)
13 return empty list
We should point out that sorting the OPEN gives us an algorithm that is correct, although it
may not be the most efficient. This is because sorting is expensive. The task is to pick the node
with the minimum heuristic value from OPEN, and this is best done by maintaining OPEN as a
priority queue. Using a similar argument, CLOSED should be maintained as a hash table, since
the task is to retrieve a specific nodePair based on a node name.
The behaviour of BestFirstSearch is illustrated in Figure 4.6 for an abstract search tree.
The values in the nodes are heuristic values, and nodes in CLOSED are shaded. As can be seen
in the illustration, search can jump around in the search tree and not follow a predetermined
pattern like it did in DFS and BFS. Patrick Henry Winston (1943-2019), one of the pioneers
of AI at the Massachusetts Institute of Technology (MIT), said that a new branch may sprout
at any moment in the search tree. This can happen because the heuristic value does not always
reflect the ground truth. Remember, the heuristic value is only an estimate, and hence it is
perfectly possible that the heuristic function can go wrong and the focus of search may later
shift to another part of the tree.
Figure 4.6 The nodes are labelled with their heuristic values. BEsTFiRsTSEARcH expands the
best node in each cycle, that is, the node with the lowest heuristic value. As Patrick Winston
once said, a new branch may sprout in the tree any time.
Heuristic Search | 83
Figure 4.7 depicts a small path finding problem that we shall test different algorithms on.
The thick curve represents a river, not part of the problem description, which might explain the
structure of the graph. As one can see, there are three places where one can cross the river. The
nodes are placed on a grid of unit size 10 kilometres, so that their coordinates can be computed
easily, as can be the heuristic function. The edge labels are the costs of traversing the edges.
Node I is the start node and node W is the goal node.
24
29
22 21
L
33
2
35 i
41
W
-'goal
Figure 4.7 A tiny route finding problem. The nodes are placed on a grid where each edge is 10
kilometres. Each edge is labelled with the cost of traversing that edge. Node I is the start node
and node W is the goal node. The thick grey curve is not a part of the graph and represents a
river flowing with three bridges across it.
We will adopt the Manhattan distance function as the heuristic function for ease of
computation. Thus h(W) = 0, being the goal node, and h(R) = 40 since it is four hops away.
The start node has a value h(I) = 100, six steps horizontally and four vertically from the goal.
The node that BestFirstSearch picks at each stage is determined completely by the
heuristic function. In that sense, BestFirstSearch is only a forward looking algorithm,
focused on reaching the goal node. As we will illustrate, it does not guarantee the shortest path.
In Chapter 6, we will look at the well known algorithm A* that guarantees the shortest path,
while being guided by a heuristic function as well.
Figure 4.8 shows the progress of BestFirstSearch on the problem given above. Since
BestFirstSearch ignores the edge costs, we have removed them from the figure. Instead,
we have shown the h-value for each node, which is the Manhattan distance to the goal node.
84 | Search Methods in Artificial Intelligence
The reader is encouraged to work out the order in which the algorithm inspects the nodes, by
simulating the algorithm. This order is shown as labels of the nodes inspected. The shaded
nodes are still on OPEN when the algorithm terminates. The directed edges show the parents
pointers from the goal all the way back to the start node.
Figure 4.8 BestFirstSearch is guided by heuristic function values shown next to each
node. The numbers in the nodes show the order in which they are inspected. The path
found is shown by the backward arrows. The shaded nodes are the nodes in OPEN when
BestFirstSearch terminates. The dashed edges are not generated by the algorithm.
The algorithm inspects eight nodes and finds the path <I P Q K F N W> with cost 195,
as summed up from the edge costs in Figure 4.7. Observe that node M is visited after K, but is
not on the path, because F was first generated as a child of K. Also observe that the path is not
optimal.
Let us analyse the algorithm from the perspective of the four parameters we are looking at.
in the version of BestFirstSearch adopted from DFS, each node is generated and added to
OPEN exactly once, and that determines, once and for all, who the parent of a node is.
When edge or move costs are equal, BestFirstSearch can still find non-optimal paths.
This happens because the heuristic function is not perfect. As described later, heuristic
functions define terrains where the gradient embodies the direction the heuristic points to. As
we will see in the blocks world domain later, these terrains define local optima which search
gravitates towards. This is the case for problems like the 8-puzzle and also the Rubik’s cube.
BestFirstSearch can still get around and find a solution, even though it may not be optimal.
The local search algorithms we study next will in fact become incomplete.
4.2.2 Completeness
The BestFirstSearch algorithm is complete for finite graphs. The argument is the same
that we applied for BFS and DFS. In every cycle, the algorithm picks one node from OPEN
and inspects it. There are a finite number of nodes in the connected component of the graph.
And with filtering out of nodes already on OPEN or CLOSED, each node is added exactly
once to OPEN. There are two ways the algorithm can terminate. One, when OPEN becomes
empty. This means the goal node was not present in the graph. Or two, it finds the goal node.
When the graph is infinite, we cannot make the claim of completeness. While the heuristic
function is meant to guide search towards the goal, if the function does not yield a good estimate,
the search may wander off. If the heuristic function happens to be Machiavellian, it could even
drive the search in the opposite direction.
zoo v
'00 '
oo
•DOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO
00
o
Figure 4.9 Given that the heuristic function is imperfect, the search frontiers for BestFirstSearch
shown in grey turn out to be exponential in practice too. Unshaded nodes show the frontier for
BFS spanning the exponentially growing width of the search tree.
86 | Search Methods in Artificial Intelligence
Penetrance = LIN.
If the heuristic function were to be perfect, then penetrance would tend to one.
The effective branching factor, B, is the number of successors generated by a ‘typical’ node
for a given search problem. This is estimated by imagining a tree with branching factor B and
depth L, with total N nodes. As we know from the last chapter, these three parameters satisfy
the following constraint:
N = (BL+1 - 1)|(B - 1)
As one can see, the smaller N is, the lower B will be.
The bottom line is that in practice both time and space complexity of BestFirstSearch
are exponential in nature. This is also a consequence of the search being global in nature,
retaining all nodes not yet inspected in OPEN. As we have discussed earlier, this is the basis
of search being complete for finite spaces. Even if the heuristic function is poor, the algorithm
will inspect all nodes before it terminates. In the real world, one is sometimes willing to trade
completeness with complexity. This, for example, is the case with most TSP solvers. As we
shall see, this is often also the case with SAT solvers.
In the following sections we look at search methods requiring low space. These are
approaches that only look in the neighbourhood of the current node and are called local search
methods.
How does the algorithm behave with this change? The first observation is that it commits to
moving to one of the neighbours of the current node, the best one. Second, because we have not
made any other change, and the filtering of nodes continues, the search will never turn back. It
will terminate if it finds the goal node, or if no new neighbour can be generated.
Heuristic Search | 87
Figure 4.10 Algorithm HillClimbing or steepest gradient ascent moves locally in the direction
where improvement in the heuristic value is the highest. Figure from Khemani (2013).
88 | Search Methods in Artificial Intelligence
images about the horizontal axis. In one, the highest point is the goal, and in the other, it is the
lowest. Both are optimization problems. The first maximizes the height, the second minimizes
it. Both choose the steepest gradient, and both terminate when the gradient becomes zero. The
corresponding variation of our BestFirstSearch algorithm is called HillClimbing.
HiiiCiimbing(S)
1 bestNode ^ S
2 nextNode ^ head sorth MoveGen(bestNode) > best to worst order
3 while h (nextNode) is better than h(bestNode)
4 bestNode ^ nextNode
5 nextNode ^ head sorth MoveGen(bestNode)
6 return bestNode
The astute reader would have noticed that Algorithm 4.2 does unnecessary work. This is
because it has been arrived at as an adaption of BestFirstSearch. There is no need to sort the
set of neighbours. Replacing Line 2 with the following would be more elegant:
Here best is a function, Max or Min, that chooses the best neighbour. This can be done in
one pass over the neighbours.
Algorithm 4.2 does not discuss the path reconstruction part. This is left as an exercise for
the reader. Like the SimpleSearch algorithm we started with, this is more suited to solving a
configuration problem, where one has only to return the goal node. As mentioned before, the
only termination criterion is when there is no better node in the neighbourhood. One implicitly
Heuristic Search | 89
assumes that that is the goal node. If wishes were horses...! When our climber stops, the
situation is more likely to be the one in Figure 4.11.
We analyse the HillClimbing algorithm on the four parameters we are looking at.
4.3.3 Completeness
The algorithm terminates on all finite domains. It will also terminate on all infinite domains if
the heuristic values are bounded. At some point, no neighbour will be better than the current
node, and the algorithm will terminate.
However, the algorithm may not find a solution node, or a path to the solution node, because
it may halt at a local optimum. It is not complete.
We will look at how the planning community formally defines planning domains in
Chapter 11. There picking up and putting a block down are separate, named, moves. In this
chapter we will adopt a simpler approach. A move is one where a one-armed robot can pick up
one clear block, which has nothing on top of it, and place it on another clear block or on the
table. The task is to rearrange the blocks in some desired way.
Figure 4.12 depicts a typical planning problem. This problem was used by Elaine Rich in
her very popular book on AI that appeared in the 1980s (Rich, 1983). The example illustrates
the efficacy of a good heuristic function. The figure depicts the start state on the left and the
goal state on the right. The goal state is completely specified here. The planning community
generally uses a partial goal description, which would correspond to a set of many states that
satisfy the goal description.
Figure 4.12 The blocks world domain. A one-armed domain has the task of rearranging the
blocks in the start state S, to achieve the goal state G. A move constitutes of picking up a block
which has nothing on it and placing it on another block with nothing on it or on the table. Only
one block can be placed on another block. The table is large enough to accommodate any
number of the blocks.
h1(S) =(- 1) + 1 + 1 + 1 + (- 1 ) + 1 = 2
h1(G) = 1 + 1 + 1 + 1 + 1 + 1 =6
92 | Search Methods in Artificial Intelligence
The heuristic value of the goal is 6, signifying that all six blocks are where they should
be. Observe that this makes our problem a maximizing problem, with higher values being
better. Figure 4.13 depicts the progress of search on the state space. Remember that each
move is reversible. It starts by generating the neighbours of the start state and computes their
heuristic values.
h1(P) =(- 1) + 1 + 1 + 1 + (- 1) + 1 =2
h1(Q) = 1 + 1 + 1 + 1 + (- 1) + 1 =4
h1(R) =(- 1) + 1 + 1 + 1 + (- 1 ) + 1 = 2
h1(T) =(- 1) + 1 + 1 + 1 + (- 1) + 1 =2
Figure 4.13 HillClimbing begins by generating the neighbours of state S. All moves are
reversible. Q is best neighbour with a value 4, as per h1(N), and it moves to it. But Q is also a
local maximum as all four of its neighbours have a heuristic value 2. The algorithm terminates
at Q without reaching the goal state.
The best neighbour is Q, in which the robot arm has moved block A on top of block E.
HillClimbing moves to node Q, which in turn has two new neighbours in which block B is
moved, on top of A in node U, and on the table in node V. The other two neighbours of Q are S
and P, when block A is moved, and they already exist in the state space.
Heuristic Search | 93
h1(U) = 1 + (- 1) + 1 + 1 + (- 1) + 1 =2
h1(V) = 1 + (- 1) + 1 + 1 + (- 1) + 1 = 2
All neighbours of Q are worse that Q, and it is a local maximum. HillClimbing terminates
here, without solving the planning problem.
The above heuristic function only checks whether a given block is perched on something
it should be on. The second heuristic function, h2(n), is more perceptive. It checks whether the
entire tower below it is as it should be. For every block that is sitting on a correct tower below, it
adds the number of objects below it, including the table. That is, it adds the height of the block
from the table. And for every block on a wrong tower, it subtracts the same. The following are
the heuristic values of the start state and the goal state:
h2(S) =(- 4) + 3 + 2 + 1 + (- 2) + 1 = 1
h2(G) = 5 + 3 + 2 + 1 + 4 + 1 = 16
With the second heuristic function, HillClimbing again begins by generating the four
neighbours as shown in Figure 4.14. The h-values are shown below. It can be observed that
h2(n) is more discriminating than h1(n) was, giving each of the four states different values. Also,
it evaluates node P to be the best node, instead of node Q.
h2(P) =(- 1) + 3 + 2+ 1 + (- 2) + 1 =4
h2(Q) = (- 3) + 3 + 2+ 1 + (- 2) + 1 =2
h2(R) =(- 4) + 3 + 2+ 1 + (- 5) + 1 =- 2
h2(T) =(- 4) + 3 + 2+ 1 + (- 1) + 1 =2
The best neighbour is P, in which the robot picks up A and puts it on the table. Node
P in turn has six new neighbours, three for moving block B and three for moving E. Figure
4.14 depicts only three of these, W, X, and Y, where block E is moved onto A, B, and the table
respectively, ignoring the poorer moves for block B.
h2(W) = (- 1) + 3 + 2+ 1 + (- 2) + 1 = 4
h2(X) =(- 1) + 3 + 2+ 1 + 4 + 1 = 10
h2(Y) =(- 1) + 3 + 2+ 1 + (- 1) + 1 = 5
The reader is encouraged to verify that the threenew neighbours in which block B is moved
are all worse than P.
The best neighbour is node X where E has been moved onto block B. As one can see from
the problem description in Figure 4.12, node X is quite close to the goal node. This is also
reflected in its heuristic value h2(X) = 10. Node X has five new successors. Three for moving
block F, not shown in the figure, and two for block A. Of these, one is the goal node G with the
best heuristic value. HuiCiMBing moves to it, and terminates with a plan to reach the goal -
move A to table, move E onto B, move A onto E.
The simple planning problem discussed above in some detail illustrates the effect of the
choice of the heuristic function on local search. It also highlights the fact that the heuristic
function defines a terrain on which search progresses.
94 | Search Methods in Artificial Intelligence
3 B moves
3 F moves
Figure 4.14 With the heuristic function h2 again HillClimbing starts with the same four
neighbours of S. This time P is best with a value 4, as per hi(N), and it moves to it. P has six
new neighbours, three moves each for blocks E and B. We have only drawn moves of E. Next,
node X is best with a value 10 which leads to the goal G in the next cycle, and the algorithm
terminates with a working plan.
Heuristic Search | 95
F = (a V b) A (a V c) A (c V d ) A (b V -d ) A (a V d ) A (b V c) A (c V -d ) A (b V d ) A (-b V d)
The Boolean formula F has four variables and nine clauses. The number of candidates
is 24. Let h(n) be the number of clauses satisfied by a candidate, and let the neighbourhood
function be flip-1-bit. The goal has a value 9 when all nine clauses are satisfied. With the
flip-1-bit neighbourhood function, each node has four neighbours. The landscape is drawn
in Figure 4.15.
There are four types of edges incident on the nodes, based on their heuristic values relative
to their neighbours. Thick arrows lead to a unique best neighbour. Steepest gradient ascent
Figure 4.15 The heuristic terrain for a 4-variable SAT problem with nine clauses. The heuristic
value is the number of clauses satisfied by a candidate’s valuation. The darkest nodes are
global maxima. There is one local maximum, 1010, shown in grey. Directed edges from any
node indicate steepest gradient. Thick arrows are unambiguous steepest gradient edges and
dashed arrows have other competing edges. Double lined edges are ridges. The remaining
grey edges are never in contention.
96 | Search Methods in Artificial Intelligence
has a clear path to follow. For example, the arrow from node 1000 to 1010. Dashed arrows
represent the case when there is more than one best neighbour. HillClimbing then has more
than one option for making the move, and the tie break would depend upon the implementation.
There are three such arrows emanating from 0100 with value 5 to nodes 1100, 0110, and 0101,
each with value 7. The third kind of edge is shown with grey undirected edges, indicating that
neither node is an option to move to from the other. Node 0010 is connected to two such nodes,
0110 and 0011 with value 7. All three nodes have a better option available elsewhere. Finally,
a double lined undirected dashed edge represents a ridge, connecting to a node with the same
value. HillClimbing would not traverse such an edge. Node 1000 with value 6 is connected by
such an edge to 1001. However, it has a better neighbour, 1010 with value 8, that it can move to.
Node 1010 itself is connected by two ridges to nodes 1110 and 1011. Since these are the best
neighbours, node 1010 becomes a maximum. In fact, it is a local maximum.
Observe that the nature of an edge depends only on what other nodes the given node is
connected to. And this would be different for different neighbourhood functions.
There are two global maxima, 0111 and 1111, shown in dark rectangles with white text.
There is one local maximum, 1010. These are nodes on which steepest gradient ascent halts.
Observe that a local maximum like 1010 may be connected to nodes via ridges which are
not conducive to gradient ascent. Such groups of nodes define a region in the terrain called a
plateau. The reader is encouraged to identify the starting candidates for the SAT from which
HillClimbing reaches a global maximum.
The fact that a given node is a local optimum is a consequence of (a) the heuristic function
and (b) the neighbourhood function. In the problem from Figure 4.12, we saw that the choice
of the heuristic function determines the path that the search traverses. Figure 4.14 is an example
of a heuristic function that defines a monotonic terrain in which the steepest gradient succeeds
in finding the goal node.
swapped in the path representation. The tour on the left is the given tour and the one on the right
is the result of the perturbation. The move is reversible.
If there are N cities, then the 2-city-exchange yields NC2 neighbours. One could devise
similar, but denser, functions swapping more than two cities.
It is more common, however, to employ edge exchange neighbourhood functions, possibly
because edges are the ones that contribute to the tour cost. In such operators, one removes a
certain number of edges from a tour and inserts new edges to obtain a new tour. Figure 4.17
shows a 2-edge-exchange operator.
Figure 4.16 The perturbation operator 2-city-exchange swaps the position of two cities in the
path representation of a tour. This results in four edges being deleted and four new edges
being introduced.
Figure 4.17 The perturbation operator 2-edge-exchange removes two dashed edges from
tour on the left and adds two new dashed edges on the right. Observe that the direction of the
arrows on the thick edges on the right has been reversed. In the path representation, 2-edge-
exchange can be implemented by reversing a subtour of the original tour. In this illustration it is
the subtour with these thick edges.
98 | Search Methods in Artificial Intelligence
There is only one possible way of inserting two new edges and having a valid tour. Given
that in an N city TSP there are N edges, the two edges to be removed can be chosen in NC2
ways, yielding as many neighbours. For the path representation, the 2-edge-exchange can be
implemented by inverting a subsequence of the cities.
Likewise, in a 3-edge-exchange, three edges are removed from the tour, and three new
ones are inserted to construct a complete tour. As shown in Figure 4.18, this can be done in four
ways. The three edges can themselves be selected in NC3 ways.
Figure 4.18 In the 3-edge-exchange, three edges from the given tour are removed, as shown
on the top. Three new edges can be added in four ways, as shown in the four figures below.
The 4-city-exchange has more neighbours. A point to note is that the 2-city-exchange is
just one of the 4-edge-exchange neighbours. The reader is encouraged to find the others.
as illustrated in Figure 2.8. They are 01111, 10111, 11011, 11101, and 11110. These five are
the neighbours of 11111 in the search space. Let us call this neighbourhood function N1. When
k = 2, then the neighbourhood function is N2. Each candidate would have ten new neighbours,
since we can choose two bits in 5C2 = 10 ways. For 11111, they are 00111, 01011, 01101,
01110, 10011, 11001, 11100, 11110, 11010, and 11100.
Figure 4.15 shows the heuristic terrain for a 4-variable SAT problem with the neighbourhood
function N1. The reader is encouraged to redraw the graph using the neighbourhood function N2
which flips some two bits.
If there are N variables, then we could have a set of functions {N1, N2, ..., NN} which
change a fixed number of bits. The kth neighbourhood function would have NCk neighbours. We
could also have neighbourhood functions like N12 that changes one or two bits. N12 would have
fifteen neighbours. And then N123 and so on till N1...N, the last one allowing one to change any
number of bits. There are 2N neighbours to choose from. Observe that the last one represents a
fully connected graph.
Gradient based methods have traditionally been discussed in the context of optimization,
which in general is a hard problem. Our interest in optimization arose from the objective of
finding a node with the best heuristic value.
Maximization and minimization are two sides of a coin. A maximization problem can be
converted to a minimization problem by prefixing a negation sign to the objective function
(which in our case is the heuristic function). Thinking of it as a minimization problem does help
us get some insight into the algorithm. Imagine rolling a ball down a slope with the intention
of sending it to the lowest point of a valley. Keep aside physics for a moment and imagine that
there is no momentum. Then the ball would roll down the steepest gradient, but only till the
point where the gradient becomes zero. The moment it becomes zero, it would come to a halt.
VariabieNeighbourhoodDescent()
1 node ^ start
2 for i ^ 1 to n
3 MoveGen ^ MoveGen(i)
4 node ^ HiiiCiiMBiNG(node, MoveGen)
5 return node
Figure 4.19 revisits the SAT example with a neighbourhood function N2. Each node has
six neighbours, as illustrated for node 0000. We have not drawn all the forty-eight edges in the
graph, but only those that participate in the hill climbing process.
Figure 4.19 A subset of the edges on the four variable SAT problem when we use the
neighbourhood function N2 that flips two bits in a candidate. The nodes in grey are just one
hop away from the solution. The remaining two nodes also have a gradient ascending to the
solution.
Heuristic Search | 101
As seen in the figure, the search terrain does not have any local maxima. The two nodes
0111 and 1111 are the solutions as before. Twelve of the remaining fourteen nodes, shown in
grey, are just one step away from a solution. The remaining two nodes also have a steepest
gradient path that leads to the solution.
Figure 4.20 BeamSearch picks the best b nodes from OPEN and expands them. The set of
neighbours of these b nodes forms the new OPEN. At every level, it keeps b nodes from
OPEN in contention. Hopefully, the path to the goal goes through one of b nodes. In this
illustration b = 2.
BeamSearch has been effectively used in speech recognition, where a set of phonemes
need to be eventually combined into words, and then sentences. Eugene Charniak and Drew
McDermott (1985) quote the following example where there may be ambiguity in speech
understanding. If you think that someone from New York is telling you that ‘everything in
the city costs a nominal egg’ they are more likely saying that ‘everything in the city costs an
arm and a leg’. Another touching example is that of a young child telling an acquaintance that
she has ‘sixty-five roses’ when her diagnosis was ‘cystic fibrosis’. Matt Payne (2021) says the
following in his blog: ‘First used for speech recognition in 1976, beam search is used often in
models that have encoders and decoders with LSTM or Gated Recurrent Unit modules built in.
To understand where this algorithm is used a little more let’s take a look at how NLP models
generate output, to see where Beam search comes into play.’
The algorithm that the speech community calls Viterbi search (Xie and Limin, 2004)
maintains a short list of the most probable words at each time step, and only extends transitions
from those words into the next time step. The algorithm is an implementation of BeamSearch,
keeping a few options in contention at each point of time as it processes the input sequence of
phonemes.
102 | Search Methods in Artificial Intelligence
Let us look at the search tree explored by BeamSearch for the tiny SAT problem mentioned
in Figure 4.15:
F = (a V b) A (a V c) A (c V d) A (b V - d) A (a V d) A (b V c) A (c V - d) A (b V d) A (- b V d)
The Boolean formula has four variables and nine clauses. Let h(n) be the number of clauses
satisfied by a candidate, and the neighbourhood function be flip-1-bit. The goal node would
have a value 9 since there are nine clauses in the formula. Starting with the candidate 0000, the
progress of BeamSearch with beam width b = 2 is shown in Figure 4.21. The value alongside
each node is the number of clauses satisfied. Observe that starting with a value 3 for 0000, both
HillClimbing and BeamSearch can make a maximum of six moves, since the node at the
next level needs to be better at each stage. Where there are more than two best nodes at any
level, the algorithm selects the ones on the left. The nodes selected by BeamSearch are shown
in shaded rectangles in the figure. The solution found by the algorithm is 1111 after four moves.
Figure 4.21 BeamSearch with width b = 2 on the problem shown in Figure 4.15. If there are
more than two best nodes in some level, then the leftmost two are selected. The algorithm
finds the solution 1111 after four moves. Observe that HillClimbing would not have found the
solution because after 1000 it moves to 1010, which is a local maximum. Both do not reach the
solution 0111.
Heuristic Search | 103
For compactness, we have not drawn neighbours that exist at an earlier level, since
they would have a lower heuristic value. For example, node 1000 would have 0000 too as a
neighbour. Consequently, each candidate appears exactly once in the search tree. When there
is a tie between nodes at some level, then we have broken it in favour of the moves on the left.
This assumes that the neighbourhood functions flip bits from the left to right. Observe that
HillClimbing would have failed to find the solution under these conditions, because it would
get stuck at node 1010, which is a local maximum. Since BeamSearch keeps more than one
option, our algorithm was able to find the solution.
The reader is encouraged to look at the search tree explored by BeamSearch for the tiny
SAT problem given below from Section 2.3.3. Choose 1111 as the starting node and beam
width b = 2.
F = (b V - c) A (c V - d) A (-b) A (- a V - e) A (e V - c) A (- c V - d)
This example shows us that even in small search spaces BeamSearch can get trapped in
local optima. However, the algorithm continues to be of considerable interest, not least because
much larger beam widths are possible with memory becoming abundant.
Figure 4.22 The values of two heuristic functions along the solution path for a simple 8-puzzle
instance. The darker tiles are the ones out of place. The graph below plots the values of
the two heuristic functions, the squares for the Manhattan distance and the circles for the
Hamming distance. Observe that after the first move, the state is at a local minimum according
to both heuristic functions.
We will, however, focus here on search, and turn our attention to mechanisms that enable
local search to escape from local optima.
This can be done simply by removing the criterion of the best neighbour being better than
the current node. Once this restriction is removed, search, which is akin to HillClimbing in
other ways, merrily carries on endlessly. Remember that there is no way to determine whether
an optimum is local or global. A different termination criterion has to be introduced again. This
can be for a fixed number of cycles, or stopping when no significant improvement is made for a
certain number of cycles. For problems where GoalTest is available, like the SAT problem, one
can have an additional exit condition if the goal is found, not least to avoid unnecessary work.
This simple modification would not work by itself though. The freedom from moving only
to a better move allows the search to go beyond an optimum. But once past it, what is to stop
the algorithm from coming right back in the next step? We prevent this by disallowing some
moves at each stage, or making them prohibited, albeit temporarily. That is how the algorithm
gets its name (Glover, 1986). The high level algorithm is described below.
Algorithm 4.4. TabuSearch moves to the best allowed neighbour till some termination
criterion. Allowed here means not tabu or taboo.
TabuSearch(Start)
N ^ Start
bestSeen ^ N
Until some termination criterion
N ^ best(allowed(MoveGen(N))
IF N better than bestSeen
bestSeen ^ N
return bestSeen
Tabu search is different from hill climbing in two ways. One, the termination criterion is
different. And two, it introduces a notion of allowed neighbours. By allowed, we mean moves
that are not tabu, or taboo. The condition of being tabu is implemented in different ways for
different problems. A simple way would be to maintain a small CLOSED as a circular list,
disallowing the most recent states from being visited again. More often in literature, though, an
embargo on some moves or perturbations itself is imposed in solution space search.
We illustrate the idea with our favourite flip-1-bit operator for a hypothetical 7-variable
SAT problem. As search progresses, some bits cannot be changed for some time. The period
for which a bit is tabu is set by a parameter called tabu tenure (tt). In our illustration, we use
tt = 2. This means that once a bit is flipped, it is quarantined2 for two cycles. One can keep track
of which bits are allowed by keeping a memory array with a value for each bit initialized to zero.
M = [0 0 0 0 0 0 0]
The subsequent value for each bit could be the last time the bit was changed. The value
could be compared with counter for the cycles to decide whether it is tabu or not. A simpler
way is to maintain M as a timer, counting down to when that bit can be changed again.
2 As I write this in the times of the Coronavirus, this seems to be the most appropriate word.
106 | Search Methods in Artificial Intelligence
The moment it is changed, the value is initialized to tt, and decremented in every cycle. The
Figure 4.23 Illustrating TabuSearch with tt = 2. As bits are flipped on the left, the corresponding
bit in M counts down to when they can be flipped again. The first move above changes the
third bit as shown on the left. The shaded numbers on the right show the corresponding values
in M. The third bit can only be flipped when it becomes 0 after two moves.
The algorithm works as follows. It flips all the allowed bits to generate neighbours and
moves to the best neighbour, irrespective of whether it is better or not. The figure above
depicts three cycles in the TabuSearch execution. In the first cycle, bit 3 is changed, and the
corresponding value in M is set to 2, which is the tabu tenure. That value decrements to 1 in the
next cycle, in which bit 6 is changed. When bit 1 is changed in the third cycle, bit 3 is available
for flipping again.
How does tabu search perform on the SAT problem of Figure 4.15? Are there any states
from which a goal node is not reached? Does it work when tt = 2? What about when tt = 1?
This is left as an exercise for the reader.
Clearly, TabuSearch looks at a subset of the neighbours at each stage, ignoring the moves
that are in quarantine. What if a tabu move leads to a very good candidate? Some implementations
of tabu search include an aspiration criterion. This says that if one of the tabu neighbours has a
value better than any seen in the entire run, then that tabu move should be allowed.
It may also happen that perturbations of some components may be happening much more
often than others. One way of giving a boost to exploration is to drive the search to newer areas
by devaluing the nodes generated by more frequent moves. This can be done by maintaining a
frequency memory F = [f1,f2, ...,fN]. Then, if the heuristic value of a node being generated by
modifying (flipping for SAT) the kth component (a bit in the case of SAT) is h(nodek), this can
be attenuated as
h(nodek) ^ h(nodek) - c x fk
A similar approach can be employed to solve TSP using tabu search. One way of marking
tabu moves would be to maintain a 2-dimensional memory array showing which pairs of edges
were removed in 2-edge-exchange or a 3-dimensional one for 3-edge exchange.
Algorithm 4.5. IHC does HillClimbing N times starting each time from N different
randomly selected starting points.
IHC(N)
1 bestNode ^ random candidate solution
2 repeat N times
3 currentBest ^ hiiiCiimbing(new random candidate solution)
4 if h(currentBest) is better than h(bestNode)
5 bestNode ^ currentBest
6 return bestNode
The hope is that one of these instances of hill climbing will strike gold. The reader is
encouraged to look at the SAT problem in Figure 4.15 and find out how many of the starting
nodes in the solution space would lead to a solution.
Clearly, the performance of IHC depends upon the nature of the landscape defined by
the heuristic function. The nodes from where hill climbing succeeds can be said to define the
footprint of HillClimbing. The boundary of the footprint is defined by the set of local minima
surrounding the global maximum. If the algorithm were to start from any node in the footprint,
it would have a smooth ascent to the summit. The larger the size of the footprint, the greater
the probability of IHC succeeding. If the heuristic function were like the Jagged Mountain
mentioned in Section 4.4, the footprint would be small, and many iterations would be required
to have a reasonable chance of success. At the other extreme, if the landscape were to be like
Mount Fuji, one iteration would be enough.
Even though both algorithms examine more than one path on the landscape, IHC is different
from BeamSearch. BeamSearch has one starting point, and it spawns many paths from that,
108 | Search Methods in Artificial Intelligence
bounded by the beam width. IHC, on the other hand, begins from different randomly chosen
starting points. It is the first algorithm we have seen that has an element of randomness. The
next chapter is devoted entirely to randomized algorithms.
Summary
We arrived in this chapter in search of knowledge to guide search. Having developed the basic
search machinery in the last chapter, we were looking for ways for search to be directed towards
the goal. This was achieved by devising heuristic functions that somehow estimated the distance
to goal, or conversely the similarity with goal. The more similar a state is to the goal state,
the closer it should be. Heuristic functions estimate this closeness, and the BestFirstSearch
algorithm capitalizes on this knowledge to search in an informed manner.
The performance of heuristic search is only as good as the heuristic function. With a good
function, both time and space complexity are low and may even be linear with very good ones.
This is not the case in practice though, and the complexity is often exponential.
We then introduced the idea of local search, trading off completeness for lower complexity.
Algorithm HillClimbing views the problem as an optimization problem and chooses the
steepest gradient path to reach the optimum. However, the problem of local optima then crops
up, where the gradient is zero, and search gets trapped. That led our quest for algorithms to get
around the local optima problem.
We introduced the notion of exploration in addition to exploitation. This chapter was devoted
to deterministic search. We ended with IHC, which introduces an element of randomness. In
the next chapter we build upon randomized algorithms and look at some popular stochastic
methods employed by the optimization community.
Exercises
1. The BestFirstSearch algorithm picks the node with the best heuristic value from OPEN.
Compare the relative advantages of (a) appending the new nodes with the tail of OPEN
and scanning for the best node, (b) appending the new nodes and then sorting OPEN, (c)
sorting the new nodes and inserting them into a sorted OPEN, and (d) maintaining OPEN
as a priority queue. How will the size of the search space influence your choice?
2. You are writing a program to solve the map colouring problem with regions A, B, C, D, E,
F, G, and H using heuristic search. The input is defined as a graph in which the neighbours
of a region are connected by edges. Each region has an allowed set of colours. What kind
of heuristic functions (that look at neighbours only) can you think of (a) to choose which
region to colour next and (b) what colour to pick from the set of available colours?
Heuristic Search | 109
3. You are writing a program to solve an N variable SAT problem with clauses varying in size
from 1 to K. Let us say you are trying a constructive method in which you pick a variable
to try a value, true or false, one by one. Let us say you maintain a data structure in which
you keep track of the sizes of clauses along how many variables in each clause have been
assigned value in the partial assignment. What heuristic function would you choose the
next variable to try a value for?
4. Let us say you are implementing best first search for path finding on a finite city map
and are using the Euclidean distance as a heuristic function. Algorithm 4.1 removes the
neighbours of a node returned by MoveGen that are already on OPEN or CLOSED before
adding them to OPEN (Line 10). How would the performance of the algorithm be affected
if this step was removed? Would it still be complete? What about complexity?
5. [Baskaran] The following figure shows a map with several locations on a grid where
each tile is 1 unit by 1 unit in size, connected by 2-way edges (roads). The MoveGen
function returns neighbours in alphabetical order. The start node is S, and the goal node
is G. Module RemoveSeen removes neighbours already present in OPEN/CLOSED lists.
List the nodes in the order visited by BestFirstSearch. Draw the search tree generated,
clearly identifying the nodes on CLOSED and on OPEN, and the path found. How does
HillClimbing perform on this problem? What about BeamSearch with width=2?
110 | Search Methods in Artificial Intelligence
6. [Baskaran] The following figure is another map, where the nodes are placed on a grid of
5 X 5 units. The MoveGen function returns neighbours in alphabetical order. The start
node is S and the goal node is G. Module RemoveSeen removes neighbours already present
in OPEN/CLOSED lists. List the nodes in the order visited by BestFirstSearch. Draw
the search tree generated, clearly identifying the nodes on CLOSED and on OPEN, and the
path found. How does HillClimbing perform on this problem? What about BeamSearch
with width=2?
7. Consider the following city map. The cities are laid out on a square grid of side 10
kilometres. S is the start node and G is the goal node. Labels on edges are costs. Draw the
subgraph generated by the BestFirstSearch algorithm till it terminates. Let the nodes on
OPEN be single circles. Mark each node inspected by
• a double circle,
• sequence number,
• parent pointer on an edge,
• its cost as used by the algorithm.
List the path found along with its cost.
Heuristic Search | 111
8. The following graphs represent a city map. The nodes are placed on a grid where each side
is 10 units. Node H is the start node and node T is the goal node. Use Manhattan distance
as the heuristic function where needed. The label on each edge represents the cost of the
edge. Label the nodes with the heuristic values. List the nodes in the order inspected by (a)
HillClimbing and (b) BestFirstSearch till it terminates. Does it find a path to the goal
node? If yes, list the path found along with its cost.
112 | Search Methods in Artificial Intelligence
9. Repeat the exercise in the previous question for the following graph. Let S be the start node
and G be the goal node.
10. Try out the two heuristic functions in Section 4.4.2 for the blocks world domain on the
problem defined in the following figure.
14. The following figure depicts the search terrain for the SAT problem defined below.
F = (a v b) A (a v c) A (c v d) A (b v - d) A (a v d) A (b v c) A (c v - d)
A (b v d) A (- b v d)
Study the figure and identify the nodes starting from which HillClimbing reaches the
solution node.
15. Analyse the above SAT problem repeated below when the N2 neighbourhood function is
employed. What is the nature of the terrain?
F = (a v b) A (a v c) A (c v d) A (b v - d) A (a v d) A (b v c) A (c v - d)
A (b v d) A (- b v d)
A subset of the edges on the 4-variable SAT problem when we use the neighbourhood
function N2 that flips two bits in a candidate. The nodes in grey are just one hop away from
the solution. The node 0000 leads to 1010, and the next move is to a solution. Likewise for
node 1000, where there are paths other than the one shown. Thus on this problem there
is no local maximum when we use the neighbourhood function N2.
114 | Search Methods in Artificial Intelligence
16. How does TabuSearch perform on the SAT problem of Figure 4.15? Are there any states
from which a goal node is not reached? Does it work when tt = 2? What about when
tt = 1?
17. [Baskaran] A TSP problem on seven cities is shown in the accompanying table and city
locations.
A B C D E F G
A - 50 36 28 30 72 50
B 50 - 82 36 58 41 71
C 36 82 - 50 32 92 42
D 28 36 50 - 22 45 36
E 30 58 32 22 - 61 20
F 72 41 92 45 61 - 61
G 50 71 42 36 20 61 -
Using B as the starting city, construct a tour using NearestNeighbour heuristic. Use
a.
the distance matrix to compute the tour. What is the cost of the tour found?
b. Construct a tour using the Greedy heuristic. What is the cost of the tour found?
c. Perform 2-city-exchange on the tour generated by Greedy heuristic. Choose B and E
for city exchange. What is the cost of the new tour?
d. Perform 2-edge-exchange on the tour generated by Greedy heuristic, use the edges
BC and DE for exchange. What is the tour found and what is its cost?
e. Construct a tour using the Savings heuristic with city A as the fulcrum. What is the
cost of the tour found?
18. Write the algorithm to implement the Savings heuristic for solving a TSP.
chapter 5
Stochastic Local Search
Search spaces can be huge. The number of choices faced by a search algorithm can grow
exponentially. We have named this combinatorial explosion, the principal adversary of
search, CombEx. In Chapter 4 we looked at one strategy to battle CombEx, the use of
knowledge in the form of heuristic functions - knowledge that would point towards
the goal node. Yet, for many problems, such heuristics are hard to acquire and often
inadequate, and algorithms continue to demand exponential time.
We also look at the power of many for problem solving, as opposed to a sole crusader.
Population based methods have given a new dimension to solving optimization
problems.
Douglas Hofstadter says that humans are not known to have a head for numbers (Hofstadter,
1996). For most of us, the numbers 3.2 billion and 5.3 million seem vaguely similar and big. A
very popular book (Gamow, 1947) was titled One, Two, Three ... Infinity. The author, George
Gamow, talks about the Hottentot tribes who had the only numbers one, two, and three in their
vocabulary, and beyond that used the word many. Bill Gates is famously reputed to have said,
‘Most people overestimate what they can do in one year and underestimate what they can do
in ten years.’
So, how big is big? Why are computer scientists wary of combinatorial growth? In
Table 2.1 we looked at the exponential function 2N and the factorial N!, which are respectively
the sizes of search spaces for SAT and TSP, with N variables or cities. How long will take it to
inspect all the states when N = 50?
For a SAT problem with 50 variables, 250 = 1,125,899,906,842,624. How big is that? Let
us say we can inspect a million or 106 nodes a second. We would then need 1,125,899,906.8
seconds, which is about 35.7 years! There are N! = 3.041409320 x 1064 non-distinct tours
(each distinct tour has 2N representations) of 50 cities. This would need more than a thousand
115
116 | Search Methods in Artificial Intelligence
trillion centuries to inspect. Surely you are not willing to wait that long, and brute force is not
an option.
In Chapter 4 we introduced heuristic functions, and then we also looked at local search,
which basically follows the diktat of the gradient.1 The strategy is exploitation of the gradient
function resulting in travelling along the steepest gradient (see Figure 4.11). We also observed
that the HillClimbing algorithm works when the ‘hill’ we are climbing is monotonous like
Mount Fuji. The algorithm fails when there are local optima, as depicted in Figure 4.12.
Figure 5.1 Most heuristic functions define a jagged mountain where HillClimbing is doomed
to fail.
The question is: if one has to reach the pinnacle in an optimization problem where the
gradient is non-monotonic, what, other than brute force, would work?
Towards the end of the last chapter, we introduced the notions of exploitation versus
exploration. Incidentally, while we did start off with the heuristic function defining the terrain
for optimization, whose gradient we exploit, the techniques are more general. The optimization
community often uses the phrase objective function or evaluation function for what is to be
1 By gradient we will mean the difference between the values of the current node and the neighbour in question. By
steepest gradient we refer to the neighbour for which this difference is the maximum.
Stochastic Local Search | 117
optimized. The algorithms are of course name agnostic. A little later in this chapter we will also
use the phrase fitness function in the context of evolutionary algorithms.
In the next section we inject a dose of exploration into the search process, allowing the
local search to make some random moves, often against the yoke of the steepest gradient. As
we will see, this facilitates the search to avoid getting stuck in a local optimum.
Algorithm 5.1 Algorithm RandomWalk explores the search space in a random fashion.
At each point, it moves to a random neighbour.
RandomWaik()
1 node ^ random candidate solution or start
2 bestNode ^ node
3 for i ^ 1 to n
4 node ^ RANDOMCHOosE(MovEGEN(node))
5 if node is better than bestNode
6 then bestNode ^ node
7 return bestNode
If we assume that instead of selecting randomly from the choices offered by MoveGen, we
can generate a random neighbour, then the time complexity of selecting a random neighbour
will become constant, irrespective of how dense or sparse the neighbourhood function is.
Next, we look at a variation that is not purely random but selects a randomly generated
neighbour with a probability that is dependent upon the gradient. In this framework, the
RandomWalk can be seen as always accepting the random move with a probability half,
irrespective of whether it leads to a better neighbour or worse.
draw it towards the goal node. We do want our search to be attracted towards the goal. That is
what HillClimbing did single-mindedly, doing only exploitation. Now we inject an element
of randomness into an algorithm that does prefer the gradient but is not bound by it. We call this
algorithm StochasticHillClimbing.
Let C be the current node, and let N be the random neighbour being considered. Let eval(N)
be the objective function for our optimization problem, the value that we want to optimize.
This could well be the heuristic value h(N) if we were doing heuristic guided search, as in
HiiiCiimbing. We define the gradient AE as the difference between the values of N and C.
NE = Evai(N)-evai(C)
We move from C to N with some probability depending upon AE. For all neighbours
N, this probability is non-zero. Even if the neighbour is worse than the current node. For a
maximization problem, the larger the AE the better is node N, and the higher should be the
probability of moving from node C to node N. A function that serves our purpose is the sigmoid
function. The probability P of making the move is given by
P = 1 / (1 + e -NE/T)
where T is a parameter that determines the shape of the curve. Observe that the sigmoid
function always evaluates to a value between 0 and 1. As AE tends to -w the probability tends
to 0. This means that for a very bad neighbour, the probability will be very low. Conversely, as
AE tends to +<w the probability tends to 1. Thus, StochasticHiiiCiimbing moves to better
neighbours with a higher probability than to worse ones. The shape of the sigmoid function is
shown in Figure 5.2.
Figure 5.2 The sigmoid function is an increasing function, asymptotically approaching 0 and 1
at the two ends.
How does the parameter T influence the shape of the curve? When T ^ w, then the
probability P is 0.5 for all values of AE. This means that the probability does not depend on
AE and is therefore like a random walk. Purely explorative. On the other hand, as T ^ 0, the
sigmoid function approaches a step function, with the probability being 0 for all AE less than
0, and 1 for all values greater than 0. Thus it is totally deterministic, like HiiiCiimbing. If the
neighbour is better, it always moves to it, otherwise never. Purely exploitative.
Stochastic Local Search | 119
For values of T between 0 and 1 the behaviour of stochastic hill climbing is a blend of
exploitation and exploration. The higher the value of T, the more random the movement, and
the lower the value, the more deterministic it is. The algorithm StochasticHillClimbing is
described below.
StochasticHiiiCiimbing(T, start)
1 node ^ start
2 bestNode ^ node
3 while some termination criteria > M cycles in a simple case
4 neighbour ^ RANDOMNEiGHBOUR(node)
5 AE ^ EVAL(neighbour) - EVAL(node)
6 if Random(0, 1) < 1 / (1+e-AEIT)
7 node ^ neighbour
8 if EVAL(node) > EVAL(bestNode)
9 bestNode ^ node
10 return bestNode
temperature. We start with a high value of T, which we can now identify with temperature,
allowing search to explore more initially, and gradually reduce T, allowing the gradient to have
a greater say in the decision.
Figure 5.3 illustrates how the sigmoid function changes with decreasing values of T. On the
top left, the probability is 0.5 when T tends to infinity. Then as we look at decreasing values of
T, the sigmoid function emerges, becomes more pronounced, and ends on the bottom left as a
step function when T approaches 0.
10
p
0.5
0.0
0 AE
1.0’ 1.0•
P; P
05 0.5 r
Figure 5.3 The probability curves (sigmoid function) for different values of T. When T ^ x
the function has a value 0.5 irrespective of AE in the graph on the top left. As T decreases,
the curve becomes more pronounced, and as T ^ 0, the sigmoid function becomes a step
function with value 1 for AE > 0 and 0 otherwise, as shown in the graph on the bottom left.
SimuiatedAnneaiing(start, numberOfEpochs);
1 node ^ start
2 bestNode ^ node
3 T ^ some large value
Stochastic Local Search | 121
At each point in the inner loop, the algorithm generates a random neighbour and moves
to it with a certain probability. Initially, when T is high, its moves are random, whether the
neighbour is better or worse. But as T is reduced, the algorithm would prefer a better neighbour,
with higher the probability the greater the gradient. It still can move to worse neighbours, but
with lower and lower probability as T is reduced.
How does this stochastic search perform better than a random walk? The following intuition
might help. Consider the situation in a maximization problem in which the algorithm is at a
lower maximum A and must go down a valley (against the slope) and then climb up to a higher
maximum B. The landscape is depicted in Figure 5.4. To travel from A to B, the search has to
first arrive at L, after which the gradient will again be favourable. Likewise, if the algorithm has
to travel from B to A, it has to also go via the low point L.
Figure 5.4 To move from peak A to peak B, the algorithm has to overcome a smaller energy
difference than to move from peak B to peak A. Consequently, at high temperature, algorithm
SimulatedAnnealing is more likely to move from A to B than vice versa.
To travel from A to L, the search must overcome the energy difference AE = eval(A) -
eval(L), going against the gradient with a sequence of low probability moves. Likewise, if it
122 | Search Methods in Artificial Intelligence
has to travel from B to L, except that the difference here is larger. Over a period of time the
probability of going from B to L is going to be lower than the probability of going from A to
L because the energy difference is greater. Thus it is more likely that the search will end up
at B rather than A. Even at a local optimum, the search carries on, stepping off it, like tabu
search did.
When will SimulatedAnnealing perform well? Clearly when the magnitude of work
against the gradient away from a local optimum is not too large. One can imagine a jagged
surface with an overall upward trend (in the case of maximization) where search overcomes
minor pitfalls on the way. This is illustrated in Figure 5.5.
Figure 5.5 SimulatedAnnealing works well when there is an overall upward trend towards the
global optimum, with a jagged surface with many local optima. It would not work well on the
surface depicted with the dashed line with long downward slopes.
While SimulatedAnnealing would work on the shaded surface depicted with a solid line
in the figure, it would not do so with the surface shown by the dashed line. This is because it
would be less likely to make a long sequence of low probability moves down a long slope that
would take search onto a neighbouring upwards slope.
Next, we look at a couple of algorithms that harness the power of teamwork. We first look
at how nature optimizes the design of life forms, if indeed one can talk of nature as having
agency. By creating a large population in which competition is the driving force and survival
the mantra. Another key factor in nature’s experiments is that the offspring are not a minor
variation of one parent as in our algorithms so far, but are produced by more than one parent,
usually two, and where a child inherits features from (both) the parents.
natural selection, named after Charles Darwin, who first published it. When we talk of creation,
we often end up asking the question why? When we talk of natural selection, we ignore the why,
and focus instead on the question how? We2 dwell on the second question.
Let us, for argument’s sake, say that nature, a name we give to everything around us, has
a goal, and that it is to experiment with and create different life forms. Like a child with a
Lego set.
The key idea behind evolution is that things may happen by chance, and nature is such that
‘things that persist, persist; things that don’t, don’t’ - a profound observation made by Steve
Grand in his book Creation (Grand, 2001). In the cauldron of organic compounds that churned
over aeons of time on Earth, certain molecules got together and had the happy habit of not
being destroyed. Eventually these molecules that survived became large and sophisticated and
acquired the ability to grow. The biological cell emerged as a stable structure - with a core and
a cell body, and instructions stored in large molecules about its own structure. A single cell,
however complex, by itself would be a drop in the sea of other forms getting assembled. When
it learns to replicate, and create other similar entities, life begins to take root.
Then came the competition. Competition for matter and for energy to build bodies with.
And here comes into play the phrase Darwin gave us, survival of the fittest. The fit are those
that garner the resources. In fact, one can say that the fit are those that survive. If that would
have been all, then some particular body form would have eventually dominated and the world
have been homogenous and stable. But nature had more tricks up its sleeve. It allowed an
occasional accident to happen during the reproduction process. Once in a while a small error
crept in while copying the instruction molecules from parent to child. As accidents are prone to
be, most accidents were disastrous. But once in an even rare while the random mutation would
result in an improvement that would bequeath an advantage to the child in the survival game.
An advantage that was evolutionary in nature, because the change would also be propagated to
subsequent generations. And the ratchet mechanism comes into play. Once a superior life form
is discovered, it dominates, thrives, and propagates.
Eventually, some better specimens would become markedly different, and a new species
would be born.
Individuals need matter to build their bodies. Assembling bodies from ‘raw’ matter is a
specialized task, something we can naturally think of plants as doing. Most life forms found
it more expedient to consume other life forms, which gave them pre-processed matter - food.
One species is the food for another. The prey and the predator. Giving rise to more competition.
There was not only competition between the species, but, perhaps more importantly,
competition within the species. Looking ahead in this narrative, the fastest foxes got the rabbits,
and the fastest rabbits escaped and survived. Our next optimization method motivated by this
phenomenon of survival of the fittest is presented in the following section.
The game of life gathered steam when nature hit upon the mechanism of bisexual
reproduction. Instead of one parent creating an offspring, now two of them collaborated to
produce one. The instruction set the child acquired was now an amalgamation of the instructions
of the two parents, and there was always the possibility that it would be better than each of
its parents. Combined with the thousands of mutation accidents that happened occasionally
2 The thoughts presented here are the ruminations of the author and should by no means be treated as an authoritative
source on the life sciences.
124 | Search Methods in Artificial Intelligence
resulting in new species, a plethora of life forms have since evolved. About 540 million years ago
this churning of genes in the Cambrian explosion ‘filled the seas with an astonishing diversity
of animals’ (Fox, 2016). Almost all the creatures present in the world today - ‘arthropods with
legs and compound eyes, worms with feathery gills and swift predators that could crush prey in
tooth-rimmed jaws’ - can be traced back to the event.
One difference between plants and animals is that the latter are more sentient, though
some people would dispute this. In any case, animals are mobile and have the freedom to move
around. This becomes important since an individual needs to find a mate if it is to pass on its
genes to an offspring. Nature has craftily injected the phenomenon of attraction into this fitness
game. An individual seeks a mate with whom it would have a fitter offspring, and there is a
plethora of ways in which individuals of a species seek to attract one, who in turn has similar
genetically inherited reciprocal goals. As more couples produce better offspring, the species
become fitter by natural selection. So much so, that in current times the Homo sapiens have
become such a dominant species that we are threatening our own existence by the destruction
of the ecosystem we live in. Paradoxical but true.
The natural ecosystem has emerged as an environment in which the same matter is recycled
over and over to sustain the diverse life forms. Life inevitably leads to death. An individual does
not hoard matter for ever and returns it upon its demise. Eventually every living creature’s body
is consumed by another life form, and the cycle of birth, growth, and death goes on incessantly.
The individual dies but the species survives. Unless there is a catastrophic event that wipes out
entire species. Remember the dinosaurs? This organization of the finite amount of matter into
an unending life and death cycle is called the ecosystem, and the entire transformation is driven
by energy, which to a large part comes directly or indirectly from the Sun. Figure 5.6 depicts a
miniscule fragment of our ecosystem. The arrows depict a qualitative positive influence of the
population of one species on another. For example, the more the number of grasshoppers in the
environment, the more will be the number of robins who can feed on them.
Figure 5.6 A small fragment of the vast ecosystem. The natural world contains millions of
species interacting with each other. Arrows depict a positive influence of the population of one
species on another.
Stochastic Local Search | 125
Opponents of the evolution theory look at the clockwork-like movement of the Sun, the
Moon, the planets and the stars, and the wonderful explosion of life on Earth and ask whether
all this would have been possible without a maker. Richard Dawkins responds that nature is ‘the
blind watchmaker’ which accepts and preserves any advancements that come about because
of chance events (Dawkins, 1986). The anthropic cosmological principle3 also comes out in
support of natural selection. Just because it is so improbable that the Earth and life on it could
have come about by a long sequence of improbable events does not mean that that sequence
of improbable events did not happen when it did. If you toss a coin a hundred times and get a
hundred heads, it does not mean that it could not have happened by chance.
Evolution by natural selection has two processes working in tandem. To quote the French
poet Paul Valery: ‘It takes two to invent anything. The one makes up combinations; the other
chooses, recognizes what he wishes and what is important to him in the mass of the things
which the former has imparted to him’ (Hadamard, 1945). All living creatures carry a blueprint
of their design in their genes. The genotype is an organism’s hereditary information encoded
in the DNA. When a child is born of two parents, it receives a combination of genes inherited
from both. This mixing up of genes is not deterministic but has an element of randomness. This
is illustrated by the fact that human siblings can be very different from each other. The physical
manifestation of the inherited genes is in the phenotype, for example, size and shape, metabolic
activities, and patterns of movement.4 The phenotype is the living being that is living out there
in the world. The idea of survival of the fittest is embodied in the fact that it is the individuals of
the species that compete in the real world. If they survive, find mates, and procreate, then their
genotype is inherited by their offspring. In this fashion the species culls out weaker individuals,
selects the better ones, and in the process becomes fitter. One can say that nature is continually
improving upon its design of a species. It is this evolutionary algorithm that we seek to mimic
when we devise GeneticAlgorithms. John Holland is credited with inventing genetic algorithms
along with his other work on complex adaptive systems (Holland, 1975).
crossover would choose a random point in the two chromosomes and create a child with the
left part of one and the right part of the second. The second child would be a complement of
the first. The first child would be one of 1000000, 1100000, 1110000, 1111000, 1111100, and
1111110. One can observe that the children produced by the crossover operation can be quite
far in the solution space (in terms of Hamming distance). Figure 5.7 illustrates this process.
Figure 5.7 The pair of nodes on the left illustrate a move made by simulated annealing with
a small perturbation on the current node. The three nodes on the right show how a new
candidate is generated by a genetic algorithm by mixing up the genes from two parents. The
resulting child may be far in the search space from both parents. G is the solution node.
Two, in nature, mating happens first and the offspring are produced after that. In
GeneticAlgorithms, the offspring are simply clones of the fittest parents, with fitter parents
possibly getting to reproduce more than once. Given a population of N candidates {M1, ...,
MN}, we produce a new population of Nparents {P1, ..., PN}. This is done in Ncycles, and
in each cycle one member Mi is chosen with a probability in proportion to its fitness f(Mi).
Imagine a roulette wheel where each member gets a sector with arc length proportional to its
fitness. The wheel is spun N times. After each spin, the winning member gets to reproduce, and
the clone is added to the set of parents. This phase is called Reproduction.
Three, in nature bisexual reproduction happens when a male parent mates with a female
parent. The male is the one who transfers some genetic material to the female, who nurtures it.
The choice of the mate is exercised by the individuals concerned and is usually based on how
attractive one finds a member of the opposite sex. In addition, physical proximity is a must, and
that is why certain species thrive in specific geographic regions. In GeneticAlgorithms this
process of mating is egalitarian to the extreme. An individual mates with a randomly chosen
individual. There are no conditions on gender or region. From the population {P1, ., PN}
of parents produced in the Reproduction phase, pairs are randomly selected to mate. In the
Crossover phase, each selected pair of parents produce a pair of children. In this way a new
population {C1, ., CN} of N children is produced. The simplest idea is to replace the original
population {M1, ., MN} of N members with these children, and begin the cycle again. Some
researchers consider it prudent to retain some K of the fittest original members, or the K fittest
parents, and only replace the remaining (N - K) members with a the (N - K) fittest children.
As one can imagine, the success of a GeneticAlgorithm would depend upon the abundant
availability of the building blocks, or the genes or the components, in the population. This
means that a large diverse population is more likely to throw up better candidates, as compared
to a small or a less diverse one. This is especially true in a changing environment. It becomes
imperative that a species retains genes that would be beneficial in a different environment. It is
said that the cheetah evolved to be a perfect hunting creature for the grasslands. So much so that
the genetic makeup in the species started becoming homogenous, with all individuals having
similar genes. It is also speculated that for the same reason the cheetah is unable to adapt when
there is a large scale destruction of its natural habitat, and is in danger of becoming extinct. On
the other hand, Australians discovered in the recent weather upheavals that mice can adapt and
proliferate, leading to huge losses for farmers. It is a fact that in current times, dominated by
human activity, many species are becoming extinct at a rapid rate. The following example from
the earliest book on genetic algorithms (Goldberg, 1989) illustrates this danger of losing out on
a specific gene if the population is small.
Consider the problem of finding a 5-bit string over the alphabet {0,1} where X is the
number represented by the string and the fitness function is the value X2. Let us say we start
with four strings:
The total fitness of the population is 1170 and the average fitness is 293. The probabilities
of the four candidates being reproduced are as follows:
As one can see, M2 has the highest probability of being reproduced, followed by M4, M1,
and M3. We spin the roulette wheel four times and let us say that we get two copies of M2, one
of M4, and one of M1. The set of parents then is
P1 = 01101
P2 = 11000
P3 = 11000
P4 = 10011
Let us say that we (randomly) mate P1 with P2, and P3 with P4. We apply a single point
crossover (randomly) after four bits for the first pair, and after two bits for the second. The set
of children we now get along with their fitness values are
The total fitness of the population is 1754 and the average fitness is 439. The new population
is a fitter population, with C2 and C3 being much fitter than the other two. It is quite likely that
in the next cycle both C2 and C3 will get two copies each. But if that happens, the third gene, or
bit, would have disappeared from the population, and however much we churn the genes after
that, we can never generate the candidate 11111 which has the highest possible fitness.
To keep the possibility of breaking free from the confines of a restricted gene pool, genetic
algorithms take another leaf out of nature’s book. Every once in a while some gene is perturbed
randomly. This, the third phase, is called Mutation. Now since a random move is more likely
to be detrimental to the fitness of the candidate, this should be done very rarely. It does not
promise a better candidate but keeps the possibility alive.
A GeneticAlgorithm is thus a process of starting with a population of N chromosomes,
or candidates, and producing new candidates by trying different combinations of the genes, or
components, inherited from two parents at a time. The following three steps are repeated until
some termination criterion is reached.
1. Reproduction. Produce a new population in N cycles. In each cycle select one member with
probability proportional to its fitness and add it to the new population.
2. Crossover. Randomly pair the resulting population, and for each pair do a random mixing
up of genes, or components.
3. Mutation. Once in a while randomly replace a gene in some candidate with another one.
Stochastic Local Search | 129
Genetic-Aigorithm()
1 P ^ create N candidate solutions > initial population
2 repeat
3 compute fitness value for each member of P
4 S ^ with probability proportional to fitness value, randomly select N
members from P
5 offspring ^ partition S into two halves, and randomly mate and crossover
members to generate N offsprings
6 with a low probability mutate some offsprings
7 replace k weakest members of P with k strongest offsprings
8 until some termination criteria
9 return the best member of P
GeneticAlgorithms have been popular because when carefully crafted they yield very
good solutions. Starting with a random population of candidates, over a period of time the
population becomes fitter, and the members gravitate towards the various maxima in the
domain. This is illustrated in Figure 5.8.
0 o o °
° ° o o
° o ° • o
o ° (D
0*00A O o
o o
o o
o ° o °
° o ° o °
o ° o o
o o o u
Figure 5.8 The initial population may be randomly distributed as shown on the left, but as
GeneticAlgorithm is run the population has more members around the peaks, as shown on the
right. The three dark circles represent three optima in the search space.
130 | Search Methods in Artificial Intelligence
When we say that the GeneticAlgorithms have to be carefully crafted, it pertains to the
many choices that have to be made. What is the population size and how are the initial members
generated? Are they generated randomly or are they the output of some other method, like
iterated hill climbing or simulated annealing? What is the crossover operator that is employed?
What is the nature of mutation and how often is it done? We look at the possibilities of crossover
operators for the well-studied domain of TSP.
Figure 5.9 To play the game of choosing three numbers first that add up to 15, arrange the
numbers in a magic square and then play noughts and crosses on the grid that is defined.
The point made above is that good representation often facilitates reasoning. ‘The Sapir-
Whorf hypothesis, also known as the linguistic relativity hypothesis, refers to the proposal
that the particular language one speaks influences the way one thinks about reality’ (Lucy,
2001). A weaker version of this hypothesis which says that language influences thought has
found evidential support (Gerrig and Banaji, 1994; Ottenheimer, 2009). Vocabularies in natural
languages are tuned to geography. Innuits have a variety of names for different kinds of what
the rest of us call snow, and the Malayalis likewise for what many know as a banana. It is
often difficult to translate some words accurately into other languages, for example, hyggelig in
Danish. On the programming front newer languages are still being designed for different kinds
of applications. One assumes that the programming language influences how programmers
think, with implications for correctness.
TSP has sometimes been referred to as the holy grail of computer science and has attracted
considerable attention. Gerhard Reinelt of Heidelberg University has a very informative
Stochastic Local Search | 131
webpage5 called TSPLIB which has a collection of TSP problems and tools. We look at three
different representations for a tour in TSP and the crossover operations that can be implemented
on them (Larranaga et al., 1999).
In the following discussion, we will work with a TSP of 15 cities {A, B, ...., O}. Like in
Chapter 4, we assume every city to be directly connected to every other city, even if some edge
costs may be prohibitively high. This is to ensure that all possible tours are valid tours, even
though some may have unacceptable cost.
5 http://comopt.ifi.uni-heidelberg.de/software/TSPLIB95/ .
132 | Search Methods in Artificial Intelligence
two parents. Let us say that we intend to copy the first city O in P1 into C1. Then we cannot
copy H from P2 in the same place in C1. Therefore, we copy H too from P1. This, in turn, leads
to A being copied from P1 as well. The corresponding city in P2 is O, which we have already
copied into C1. This identifies a cycle, which we call Cycle 1 in both parents, with O, H, and A
in P1 and correspondingly H, A, and O in P2. The identification of Cycle 1 is shown in the top
half of Figure 5.10.
Cycle 1
Cycle 2
Figure 5.10 The top half shows the identification of Cycle 1 and the bottom half shows the
identification of Cycle 2 in CX.
The bottom half shows the identification of Cycle 2. We start with city D in P1 and follow
through to G, K, and M before returning to D. In a similar fashion we identify Cycle 3 starting
with city L in P1, and then Cycle 4 and Cycle 5. The reader is encouraged to verify that the five
cycles are as depicted in the top half of Figure 5.11. In the figure the odd numbered cycle cities
in P1 are shaded, as are the even numbered cycle cities in P2.
12233122451 5345
Pi ffi^LEFfflHBJffiffll
P2 BMENErnmiOimJ
C [OCG^ELECHfDLKEOJ0CC0LB1
C2 tH0GFNH00[B[C[O[E[L0 J
Figure 5.11 In the given example, five cycles are identified by CX. The first child C1 gets the
odd numbered cities from P1 and the even numbered cities from P2. These are shown in
shaded boxes. The remaining cities go into forming C2.
Stochastic Local Search | 133
In CX we construct C1 by copying all the odd numbered cycles from P1 and the even
numbered cycles from P2. This is illustrated in the bottom half of Figure 5.11. All the shaded
cities are in C1 and we can observe where they have been copied from the two parents. The
unshaded cities form the second child C2, copied from the odd numbered cycles in P2 and even
numbered cycles in P1.
The partially mapped crossover (PMX) also copies some cities from parents in the same
position in the children, though with a slightly different procedure. One key difference is that
it copies entire subtours from the two parents into the children, before filling in the rest. This
potentially allows short subtours to survive from parents into the next generation.
Consider the parent P1 = ODGLFHKMBJACNIE from the previous example. In PMX
a subtour is selected to be copied into the first child C1. Let this subtour be HKMBJ. So far
this is like the single point crossover which selects part of a parent chromosome. Remember
that P1 could have also been written as HKMBJACNIEODGLF after rotation, and HKMBJ is
the left half here. The second half cannot be directly copied from P2 because, as discussed
earlier, this will include some cities already in C1. To get around this duplication problem, PMX
begins with a mapping of the chosen subtour with the corresponding cities in the other parent
P2 = HGMFNADKICOELBJ. This is illustrated in Figure 5.12.
Ci □□□□□ehm®jarnm
Pi O0G0E00M® J0C001
P2 0G0FN0000[C[O10^J
Figure 5.12 PMX begins by copying a subtour HKMBJ from P1 into the first child C1 and
establishing a partial map of this subtour with P2. One would like to copy the remaining cities
from P2 but the locations for cities A, D, I, and C in P2 are already occupied by cities H, K, B,
and J respectively. K has found a place in C2.
Having selected a subtour from P1, we need to copy the remaining cities from P2, keeping
in mind that we already have the subtour HKMBJ. City A has been displaced and we cannot
copy A in place from P2 because it is occupied by H in C1. This is where the partial map comes
into play. We look at the location of H in P2. Its location is free in C1 and serves as a refuge for
A. We could have followed the partial map in the reverse direction as well, starting with H in
P2, which cannot be copied because it is already in C1. Instead, we copy A which H maps to in
the partial map.
Finding the destination for D, the second displaced city, requires a little more work. The
location of D in P2 is occupied by K in C1. We follow the map and consider the location of K in
P2. That in turn is occupied by M in C1. Finally, we look at the location of M in P2 and find that
the corresponding spot is available for D. This process of finding the destination for D, having
found one for A, is illustrated in Figure 5.13.
134 | Search Methods in Artificial Intelligence
Ci amraiJniE
pi o^GEEHEMtBJHtCEEU
p2 [HGMEiH00TOffiElJ
Figure 5.13 PMX copies the subtour HKMBJ from P1 into C1 and looks to copy the rest from
P2. After moving A to the position occupied by its image H, where should city D be in C1? The
answer: follow the partial map.
A similar process is followed for the cities I and C. City K from P2 already has a place in
C1. After the five displaced cities A, D, K, I, and C have been accommodated, the remaining
cities G, F, O, N, E, and L are directly copied from P2 into C1. The reader should verify that C1
is AGDFOHKMBJNELIC and C2, after a similar process, is OMGLFADKICHJNBE.
The order crossover (OX) also copies a subtour from P1 into C1. The remaining cities are
then inserted in the order that they occur in the other parent P2. This is illustrated in Figure 5.14.
p1 otDGPFHEMCBjACNcnE
Figure 5.14 In OX we copy a subtour from P1 into C1 and insert the remaining cities in the
order they occur in P2.
As seen in the previous example, this crossover also has the property of preserving entire
subtours from parents in the offspring. The reader should verify that the second child C2 is
OGLFHADKICMBJNE.
Index 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15-
City ABCDEFGHIJ K L MN O
Stochastic Local Search | 135
ODGLFHKMBJACNIE
Adjacency CJNGOHLKEAMFB I D
From City ABCDEFGHI J KLMNO
The first observation one can make is that for a tour being traversed in one direction, there
is a unique representation. In the above example, the first city can only be C because one arrives
at it from A, the first city in the index. Then, since we go from C to N, only N can be in the
corresponding position in the representation with index C. And so on. If one were to traverse
the tour in the opposite direction, we would have another unique representation in which the
first entry can only be J, the other neighbour of A in the path representation. And A would be
at the location indexed by C. Every tour then has exactly two representations, one for each
direction of travel. This contrasts with the 2N representations any tour of N cities has in the
path representation.
The second observation is that not every permutation of N cities in adjacency representation
is a valid tour. For example, the permutation that begins with CAB cannot be a tour because this
contains a cycle A ^C^B^A. A consequence of this is that the crossover operators must take
care to avoid cycles of less than N cities.
One crossover operation that is popular with adjacency representation is the alternating
edges crossover (AEX). Here one constructs the first child C1 as follows. Given any starting
city, choose the first successor from P1 in the adjacency representation and then the next
one from P2, and so on. We illustrate this with the two parents ODGLFHKMBJACNIE and
HGMFNADKICOELBJ we looked at earlier. First we must represent them in adjacency
representation. The reader should verify that the representations are CJNGOHLKEAMFBID
and DJOKLNMGCHIBFAE respectively. Observe that there is no identified starting point in the
tour representation. For any city Ci Table 5.2 tells us what the next city Cj in the tour is. There is
no starting point in the path representation too, where every rotation of a tour is the same tour.
The process of implementing the AEX is depicted in Figure 5.15, where the two parents
are in the adjacency representation. The top half of the figure shows the two parents, and the
bottom half shows the process of constructing the child C1. Let us say we start constructing the
tour starting at city F. Then, as is in P1, we move to H. Then from H to G as in P2, from G to L
as in P1, from L to B as in P2, from B to J, and then from J to H!
136 | Search Methods in Artificial Intelligence
ODGLFHKMBJACNIE and
P1 [CJINIGOHfnKIElAMBIBHD HGMFNADKICOELBJ
P2 0J0000M^0Q^0^
Starting with F
P1 F H H
P2 H ->G
Index ABCDEFGH I JKLMNO
P1 G L L
P2 L B B
C1 P1 B J J
5 132 6 4 P2 J ->H
Figure 5.15 In AEX every step in the tour is dictated by alternating parents in the adjacency
representation. If child tour begins at F, then the first move is F ^ H as in P 1, the second H ^
G as in P2, the third G ^ L as in P 1, and so on. After six entries, H is repeated, and one may
have to pick a different city.
As seen in this example, after six entries we are back to the city H, which is not allowed.
This is typically resolved by choosing a city that is available instead of closing the loop. One
could even optimistically choose the nearest available neighbour to form a shorter tour.
Choosing a shorter edge is the idea behind the heuristic crossover (HX) in which at every
point the shorter of the two edges suggested by the two parents is selected, as long as it does
not close the loop.
Both the above crossover operators produce one child from two parents. To generate N
offspring one must go through N cycles of randomly selecting pairs.
One advantage of the adjacency representation and the crossover operators used is that one
could produce offspring from multiple parents. We could choose, say, three parents for each
child to be produced and select the moves made in the child by inheriting from them in a cyclic
or heuristic manner. This would allow mixing up of the genes from more than two parents.
Index 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
City ABCDEFGHI J K L MN O
Path-Tour = ODGLFHKMBJACNIE
Let Ord-Tour be the tour we are constructing. We initialize Ord-Tour to be the empty list
[]. We process the cities in Path-Tour one by one and add the index of the city at the tail of the
list. The first city is O, with index 15.
Stochastic Local Search | 137
Ord-Tour = [15]
Next, we delete the inserted city from the index and shift the cities left. In case of city O,
there is nothing to shift. The next city is D, with index 4.
Ord-Tour = [15, 4]
After deleting D from the index, we shift the remaining cities to the left. Observe that the
index of cities beyond D in the index has decremented by 1, as shown in Table 5.3.
Index 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15-
City A B C E F G H I J K L M N
The next city is G and Ord-Tour = [15, 4, 6] and the updated index is shown in Table 5.4
The next city is L and Ord-Tour = [15, 4, 6, 1o] with the updated index in Table 5.5.
index 1 2 3 4 5 6 7 8 9 1o 11 12 13 14 15
City A B C E F H I J K M N
We always use the updated index. The reader should verify that after all the cities have been
processed the ordinal representation is
Given the above ordinal tour we can recreate the path by an inverse process. We begin
again with the index of the cities in Table 5.1, reproduced here. We initialize the Path-Tour to
and empty string (or list). Path-Tour = [].
Index 1 2 3 4 5 6 7 8 9 10 11 12 13 14 25^
City ABCDEFGHI J K L MN O
We pick the first element from the Ord-Tour, which is 15, and add the 15th city, O, at the
tail of Path-Tour.
Path-Tour = O
138 | Search Methods in Artificial Intelligence
We delete the corresponding city in the index and shift the remaining cities to the left.
At this point, there is nothing to shift. Next, we pick the second number from the Ord-Tour,
which is 4, and append the corresponding city, D, at the end of Path-Tour.
Path-Tour = OD.
Index 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15-
City ABCEFGHIJK L MN
The next number is 6 and the corresponding city is G. Path-Tour = ODG. The update index
is the same as Table 5.4 reproduced here.
index 1 2 3 4 5 6 7 8 9 1o 11 12 13 14 T5-
City A B C E F H I J K L M N
and the head has been replaced twice’. John Holland compared cities to self-organized higher
level entities, even though the people and their activities keep changing - ‘Buyers, sellers,
administrations, streets, bridges, and buildings are always changing, so that a city’s coherence
is somehow imposed on a perpetual flux of people and structures. Like the standing wave in
front of a rock in a fast-moving stream, a city is a pattern in time.’
Emergent systems is a field which studies the emergence of complexity from the interaction
and aggregation of many simple elements (Johnson, 2002).
‘Emergence is what happens when a multitude of little things - neurons, bacteria,
people - exhibit properties beyond the ability of any individual, simply through the act of
making a few basic choices: Left or right? Attack or ignore? Buy or sell? The ant colony is the
classic example, of course. This meta-organism possesses abilities and intelligence far greater
than the sum of its parts: The colony knows when food is nearby, or when to take evasive action,
or, amazingly, just how many ants need to leave the colony to forage for the day’s food or ward
off an attack’ - (Ito and Howe, 2017).
Another field of study that looks at the interaction of a multitude of simple elements is Chaos
Theory (Gleick, 1987; Holland, 1999). Chaos theory is an interdisciplinary branch that looks
at compact mathematical models that describe patterns of behaviour that are highly sensitive
to initial conditions. Even though the systems may be deterministic, minute differences in
measurement of initial conditions can make prediction of future behaviour almost impossible.
In the words of Edward Lorenz (1993), ‘Chaos: When the present determines the future, but
the approximate present does not approximately determine the future.’ The book by Gleick
describes concepts like self-similarity which can be used to model the structure of coastlines
(which look the same at any level of magnification), trees, and other seemingly random structures
in nature. Benoit Mandelbrot, a Polish-born French-American mathematician, devoted his
research to finding the simple mathematical rules that give rise to the rough and irregular shapes
of the real world. His work gave rise to the world of fractal geometry.6 The fascinating images
produced by the simple equations of the Mandelbrot sets, Julia sets, and the Sierpinski triangle
all have unconstrained self-similar structures where any amount of zooming in presents the
same picture. A fascinating documentary The Secret Life of Chaos by Jim Al-Khalili starts off
with Turing and the study of chaos, and goes on to describe how genetic churning could have
created complex life.
The key idea in emergent systems is that life forms have emerged through simple elements
interacting with each other to give rise to complex structures. The British mathematician John
Conway brought this to light in a spectacular ‘game’ that he invented called the Game of Life7
or simply Life. It was publicized by Martin Gardner in his column on Mathematical Games in
Scientific American (Gardner, 1970). The Game of Life is a cellular automaton made up of an
infinite two dimensional grid of cells in principle. In computer implementations we generally
stitch the left end of the screen with the right end, and the top with the bottom, defining an
endless toroidal array of cells.
Each cell can be in two states, dead or alive, 0 or 1. The ‘game’ has no active players, and
once initialized, takes a life of its own. At each time step, each cell obeys the following simple
rules of birth, death, and survival based on the current states of its eight neighbours:
• If a cell is alive at time t and has two or three neighbours alive, then it stays alive at time
t+1. Else it dies.
• If a cell is dead at time t it becomes alive at time t+1 if it has exactly three neighbours that
are alive. Else it remains dead.
The rules are such that cells thrive unless there is overcrowding or, at the other end,
loneliness. In the cellular automaton, time moves in discrete steps and the system evolves. The
fascinating thing about the game is that depending upon the starting position the cells evolve
in unpredictable ways. Three cells in a row left to themselves rotate around the middle cell, as
two die of loneliness and two new ones are born in each cycle. There are static structures like
four cells arranged in a square, since each has three live neighbours. But if another structure
were to come and crash into it, the tranquillity would be broken when the number of alive
neighbours cross the overcrowding threshold of three. Movement of patterns is possible, for
example, the famous Gosper’s gliding gun which ‘emits objects or organisms’ that tumble
across the 2-dimensional space. This is illustrated in Figure 5.16. But there are more interesting
patterns in which societies of cells grow and wane as time goes along. The reader is encouraged
to implement the game and experiment with the rules.
0 1 1 1 0 0 0 0 0 0
0 1 1 2 1 1 1 2 1 1
0 3 4 3 2 1 1 4 2 2
1 1 3 2 2 1 3 4 3 2
0 2 3 2 1 0 2 2 3 1
Figure 5.16 Gosper’s glider gun emits this 5-cell ‘organism’ that creates an illusion of
movement in the Game of Life. The detail on the top shows one time step. The shaded cells
are alive. The number in each cell is the number of alive neighbours. The numbers in bold are
the cells that will be alive in the next time step. The figure in the bottom shows four time steps
ending in replicating the starting structure but shifted to the right and bottom.
What the game demonstrates is that simple elements operating with simple rules can lead
to the emergence of higher level entities. We often see flocks of birds in flight coordinating their
movements in tandem as if they were one entity. Collections of individual elements acting in a
decentralized mode and self-organization principles is also referred to as swarm intelligence.
Colonies of ants and termites, schools of fish, flocks of birds, and herds of land animals are
some groups that characterize such behaviour.
Stochastic Local Search | 141
Simple components interacting with simple rules result in the emergence of higher level
patterns that persist. The phenomenon of reproduction emerged when nature figured out
encoding body plans at the genomic level. These instructions coded in the genes orchestrate
the diversification of cells into different kinds of tissues that make up a living creature, which
consumes food which is matter to build bodies with and also energy to build them with.
Competition for food and natural selection ensured that the best designs survived. Bisexual
reproduction accelerated the production of better designs when individuals competed for mates
in addition to food. The world we live in with all its diversity is the culmination of this mindless
application of simple rules over millions of years, including the evolution of creatures with
minds that could contemplate and ponder over this very phenomenon.
In Chapter 10 of his book Godel, Escher, Bach, Hofstadter (1979) describes how composite
systems can be viewed. In an accompanying writeup, Ant Fugue, he describes how a colony of
ants can be treated as a more complex organism behaving in a purposeful manner, even creating
living bridges with their own bodies. The human brain too is a complex system made up of
large numbers of simple elements, the neurons. Neither the ant in a colony nor a neuron in a
neural network has the big picture which the composite entity has.
We look at an optimization algorithm that is inspired by ant colonies.
Figure 5.17 Initially ants A, B, C, D, and E set out in search of food. When ant A find some
food, it grabs it and dutifully heads back. It follows its own pheromone trail back, and
strengthens it. Ants F, G, and H have set out meanwhile. Ant F followed the trails left by ant A,
and will come back with food too, further strengthening the trail.
Motivated by the behaviour of the ant colony, Marco Dorigo and his associates devised
an optimization algorithm in which simple problem solving agents cooperate with each other,
finding better and better solutions by following the cues given out by other agents (Colorni,
Dorigo, and Maniezzo, 1991; Dorigo, 2004). The algorithm is called ant colony optimization
(ACO). The main idea here is that an army of simple agents repeatedly solves a problem
individually, and in each new cycle is influenced by the solutions synthesized by all agents in
the previous cycle. A little bit like how a cell in the Game of Life is influenced by the states of
its neighbours at the previous time step. The algorithm is best illustrated with the TSP problem.
Let us say we have an N city TSP to solve. We place M ants randomly on the nodes in the
graph. Then, in each cycle, each ant constructs a tour independently using a simple stochastic
greedy algorithm as described below.
In each cycle beginning at time t the ants start constructing a tour, completing it at time
t + N. At each choice point at Cityi an ant chooses a city to move to in a probabilistic manner.
The probability of moving to Cityj is influenced by two factors. One, nij, called visibility, which
is inversely proportional to the cost of the edge between Cityi and Cityj. The other, Tij(t), is the
total pheromone on edgeij at the beginning of the cycle at time t. The two factors are moderated
by parameters a and fi. The probability of the kth ant moving from Cityi to Cityj in the cycle
beginning at time t is given by
path to the lump of food. The goal in TSP is to construct the shortest tour. Shorter tours should
have more pheromone. This is achieved by assuming that an ant deposits the same amount of
pheromone on all the edges of the tour it has found. The amount of pheromone deposited is
inversely proportional to the length Lk of the tour constructed by the kth ant. After constructing
a tour in N time steps, each ant k deposits an amount of pheromone Q/Lk on all the edges it
traversed, where Q is a parameter the user can control.
Being a chemical substance, pheromone evaporates in real life. In ACO, this phenomenon
is incorporated by a parameter p, which is the rate of evaporation. We assume that a fraction
proportional to p evaporates in every cycle. The total pheromone on edgej after the cycle is
over is
where ATij (t, t+n) is the pheromone deposited by all the ants in the cycle beginning at time
t and ending at time t+N.
In this manner, M ants repeatedly construct tours in a greedy stochastic manner, choosing
the next city heuristically at each point in time, and not looking back. Each ant deposits an
amount of pheromone inversely proportional to the cost of the tour it finds. Shorter tours get
more pheromone on the edges, and more ants follow those edges, further strengthening the
levels of pheromone. The termination criterion can either be a fixed number of cycles or a
threshold between the best tour cost in two consecutive cycles. The algorithm is given below.
Algorithm 5.5 The ant colony algorithm for TSP (TSP-ACO) employs M ants to
construct tours independently in a greedy stochastic manner. In every cycle each ant is
drawn towards edges on shorter tours found by other ants.
TSP-ACO()
1 bestTour ^ nil
2 repeat
3 randomly place M ants on N cities
4 for each ant a > construct tour
5 for n ^ 1 to N
6 ant a selects an edge from the distribution Pn.
7 update bestTour
8 for each ant a > update pheromone
9 for each edge (u, v) in the ant’s tour
10 deposit pheromone « 1/tour-length on edge (u, v)
11 until some termination criteria
12 return bestTour
The reader is encouraged to implement this simple algorithm, all the better with a graphical
user interface depicting the pheromone with thickness of the edges.
144 | Search Methods in Artificial Intelligence
Summary
In this chapter we studied randomized algorithms in the context of stochastic local search for
optimization. All the algorithms we looked at were motivated by real world phenomenon. The
first, SimulatedAnnealing, is based on a transition from random search to gradient descent
(or ascent) in a quest to head towards the global optimum while avoiding the local ones. This is
based on a technique used in moulding materials into desired low energy states.
Then we moved on to drawing inspiration from natural phenomena. The world around us
has emerged from random mixing and matching combined with the survival of the persistent.
As simple elements combine to give rise to higher level phenomena, competition for resources
leads to selection of the fittest elements, and the rise of different species that evolve. Selection
provides a ratchet mechanism that preserves better designs, as and when the random processes
stumble upon them. This led us to GeneticAlgorithms which mimic the process of evolution.
Next we looked at how populations of ants coordinating with each other with biosemiotics
provide us with a cue for distributed optimization techniques in the form of the ACO algorithm.
Emergence refers to complex entities emerging from the interaction of many simple ones,
like in a colony of ants. The human brain is one such complex entity that is made up of millions
of simple elements called neurons. We defer the discussion on neural networks, which are
information processing architectures that learn, to a later chapter when we discuss learning and
its relation to search.
In the next chapter we come back to deterministic approaches to finding optimal solutions.
We had ventured into optimization, motivated by the desire to find the node with the best
heuristic value.
The heuristic function was devised to speed up search. We return to problem solving where
we have a goal node, with the objective of finding the shortest path. The need for employing a
heuristic function will come in again. We seek not only to find the optimal path to a goal, but
also to do so quickly in an informed manner.
Exercises
1. Devise an algorithm to take a value P between 0 and 1, and return a 1 with probability P,
and 0 otherwise.
2. Extend the above algorithm to accept fitness values of a set of candidates {P1,
P2,..., PN} and produce a new population where each candidate is cloned with a probability
proportional to its fitness. A candidate may have more than one cloned child, and some
candidates may have none.
3. Write a program to generate random instances of the TSP problem, given the number of
cities N as an input. This entails generating the random coordinates of the N cities. Assume
Euclidean distance as the cost of each edge. Implement and compare the performance of
the following three sets of algorithms.
a. Choose different perturbation operators for the path representation. Choose a
starting temperature T, a cooling function, and a random starting tour. Implement the
SimulatedAnnealing algorithm and plot the temperature and the tour cost with time.
Stochastic Local Search | 145
b. Accept the population size P as an input, experiment with different crossover operators
for solving the TSP with a GeneticAlgorithm. Plot the best and average tour costs
with time.
c. Accept the number of ants M as an input, and experiment with the three parameters a,
P, and p used in the ACO algorithm. Plot the best and average tour costs with time.
4. A tour is shown in the figure below; the edges are bidirectional. Use A, B, C, D, E, F, G,
H, I, and J as reference sequence (index sequence) for preparing ordinal and adjacency
representations. Use this tour to answer the following questions:
9. Given the set of cities {A, B, C, D, E, F, G}, is there a tour whose ordinal representation is
[2 2 2 2 2 2 2]? If yes, then what is the path representation of the tour? If no, explain why.
10. Given the set of cities {A, B, C, D, E, F, G}, is there a tour whose ordinal representation is
[7 6 5 4 3 2 1]? If yes, then what is the path representation of the tour? If no, explain why.
11. Given the set of cities {A, B, C, D, E, F, G}, is there a tour whose ordinal representation is
[2 2 2 2 2 2 1]? If yes, then what is the path representation of the tour? If no, explain why.
12. Given the set of cities {A, B, C, D, E, F, G}, is there a tour whose ordinal representation is
[2 1 2 1 2 1 1]? If yes, then what is the path representation of the tour? If no, explain why.
13. Is there a unique ordinal representation of every tour?
14. Given the set of cities {A, B, C, D, E, F, G} and the ordinal representations [2 1 2 1 2 1
1] and [7 6 5 4 3 2 1], represent the two tours in path representation. Do a single point
crossover on the ordinal tours and verify that the resulting representations are valid tours.
15. Programming exercise: Randomly create a TSP with N cities to be displayed on the monitor.
Let the graph be fully connected with edge weights being the Euclidean distance. Do not
display all the edges. Ask the user to choose the parameters M, a, p, and p. Ask the user
to choose the number of cycles. Execute the TSP-ACO algorithm, displaying the best tour
found at the end of each cycle. Draw the thickness of each edge in proportion to the amount
of pheromone on the edge. If there are any edges that have a higher amount of pheromone
than the best ones on the tour, draw them too.
16. Trace the evolution of the following three patterns in Conway’s Game of Life.
chapter 6
Algorithm A* and Variations
In this chapter we look at the algorithm A* for finding optimal solutions. It is a heuristic
search algorithm that guarantees an optimal solution. It does so by combining the goal
seeking of best first search with a tendency to keep as close to the source as possible.
We begin by looking at the algorithm branch & bound that focuses only on the latter,
before incorporating the heuristic function.
We revert to graph search for the study of algorithms that guarantee optimal solutions. The task
is to find a shortest path in a graph from a start node to a goal node. We have already studied
algorithms BFS and DFID in Chapter 3. The key idea there was to extend that partial path
which was the shortest. We begin with the same strategy. Except that now we add weights to
edges in the graph. Without edge weights, the optimal or shortest path has the least number of
edges in the path. With edge weights added, we modify this notion to the sum of the weights
on the edges.
The common theme continuing in our search algorithms is as follows:
Pick the best node from OPEN and extend it, till you pick the goal node.
The question that remains is the definition of ‘best’. In DFS, the deepest node is the best
node. In BestFirstSearch, the node that appears to be closest to the goal is the best. In BFS,
the node closest to the start node is the best. We begin by extending the idea behind breadth
first search.
We can generalize our common theme as follows. With every node N on OPEN, we associate
a number that stands for the estimated cost of the final solution. For BestFirstSearch, this
is the estimated distance from N to the goal node, ignoring the cost up to node N. For BFS,
147
148 | Search Methods in Artificial Intelligence
it is, implicitly, the depth in the search tree. For the branch and bound algorithm described
below, it is the sum of the edge costs from the start node to N. Here we ignore the cost beyond
node N.
Algorithm 6.1. Algorithm B&B searches in the space of partial paths. It extends the
cheapest partial path till it finds one to the goal.
B&B-FirstCut(S)
1 OPEN ^ ([S], 0) : [ ]
2 while OPEN is not empty
4 pathPair ^ head OPEN
5 (path, cost) ^ pathPair
6 N ^ head path
6 if GoalTest(N) = true
7 then return reverse(path)
8 else
9 children ^ MoveGen(N)
10 newPaths ^ MakePaths(children, pathPair)
11 OPEN ^ sortcost(newPaths ++ tail OPEN)
12 return empty list
MakePaths(children, pathPair)
1 if children is empty
2 then return empty list
3 else
4 (path, cost) ^ pathPair
5 M ^ head children
6 N ^ head path
7 return ([M : path], cost + k(N, M)) : MakePaths(tail children, pathPair)
Algorithm A* and Variations | 149
The function MakePaths generates all children of the node N and extends the path
[N ... S] by adding each child M in turn to the head of the path, along with the cost of the path
[M N ... S]. The new search nodes of the form ([M N ... S], cost) are added to OPEN in
Line 11. OPEN is maintained as a sorted list, or more efficiently, as a priority queue. We look at
the performance of this algorithm on the tiny search problem depicted in Figure 6.1.
S -XA.B.C)
A ->(S,B,D)
B ->(S,A,D)
C ->(S,G)
D -»(A,B,G,E)
E ->(D,G)
G -»(C,D,E)
Figure 6.1 A tiny search space. S is the start node and G is the goal node. The labels on the
edges are edge costs. Note that the placement of nodes in the diagram does not reflect
edges costs.
The algorithm begins by adding the pathPair ([S], 0) to OPEN. This is the only node on
OPEN. It is removed and the neighbours of S are added in a sorted order.
OPEN = [([B S], 3), ([A S], 6), ([C S], 8)]
The path SB with cost 3 is removed, and the neighbours of B are added.
OPEN = [([A B S], 5), ([A S], 6), ([D B S], 7), ([C S], 8)]
Observe that the algorithm has found two paths to the node A, one with cost 5 and the other
with cost 6. It will next pick ([A B S], 5) for expansion. At this point one can argue that it should
abandon and delete the other path ([A S], 6). We will investigate this a little later, when we will
keep only one copy of a node instead of all the paths that lead to it. Meanwhile, we can observe
that the simple strategy of extending the cheapest path at all times will lead to an unfettered
explosion of the search space as depicted in Figure 6.2.
150 | Search Methods in Artificial Intelligence
111 s
10 A B
121 D 10 16; G
14 C
12 A
10
13(G)14i G
Figure 6.2 The first cut B&B algorithm keeps multiple paths to nodes in its search space
including loops like SBABA. After eight expansions of the shaded nodes, it has found one path
to the goal G with cost 13. But it will only pick that when it has exhausted all cheaper paths, in
the process generating other paths to G.
When there is a collection of short edges in the state space, then the algorithm goes into
loops because of its strategy of always extending the cheapest path. In our example, the nodes
S, A, B, and D have short edges connecting them. Apart from the wasted work, there is a larger
danger lurking here. While the algorithm is guaranteed to find the shortest path if there is one, if
there is no path to the goal it will keep looping endlessly searching for one. We saw this danger
lurking in algorithm DFID in Chapter 3. One way to mitigate this problem is to stop the search
from going into loops. This can be done by checking if a new child already exists on the partial
path being extended. Algorithm 6.2 is a version of B&B which does that.
Algorithm 6.2. A version of B&B that avoids getting into loops. If a child of N is already
present in the path, then that is discarded.
B&B(S)
1 OPEN ^ ([S], 0) : [ ]
2 while OPEN is not empty
4 pathPair ^ head OPEN
5 (path, cost) ^ pathPair
6 N ^ head path
6 if GoalTest(N) = true
7 then return reverse(path)
8 else
9 children ^ MoveGen(N)
10 noloops ^ RemoveSeen(children, path)
11 newPaths ^ MakePaths(children, pathPair)
Algorithm A* and Variations | 151
The function RemoveSeen, similar to the one we saw in Chapter 3, removes any nodes that
are looping back to the path. It does so by calling OccursIn which checks if a node is already
in a given list. Figure 6.3 is the same search space as in Figure 6.2 except that loops have been
removed by the modified B&B algorithm.
Figure 6.3 The pruned search space from Figure 6.2 generated by B&B with loop checking.
Values next to nodes on OPEN are known path costs.
With the above modification, the B&B algorithm will not enter an infinite loop. It will still
explore all distinct paths without loops. Even this can be wasteful. For example, the algorithm
will first extend the path SBAD which has cost 9 to G with cost 15, before it eventually picks the
node G on SBDG with cost 13. This is because of the implicit assumption that the known cost
of SBAD, viz. 9, is the estimated cost of the final solution, which is less than 13. This happens
because it does not know that the node D in the two options is the same. If we store the graph
152 | Search Methods in Artificial Intelligence
itself, keeping exactly one copy of any node, then it would circumvent this problem. Figure 6.4
shows the search space at the moment when B&B is about to finally pick the goal node from
OPEN with cost 13. As can be seen, it has found the optimal path.
Figure 6.4 The moment when B&B is about to pick the goal node with path cost 13. Observe
that all other nodes on OPEN have a higher cost.
One observation is key. At the point where the B&B algorithm picked the goal G, all other
branches in the search tree had higher costs for the partial paths. And since those costs can only
go up, the path found must be optimal.
If one is to treat the cost stored with each node as an estimate of the cost of the optimal
solution leading from that node, then it is imperative that the estimate be a lower bound on the
actual cost. That is, it should be an underestimate. When that happens, then at the moment the
algorithm picks the goal node it has found the optimal path. Because it is the lowest cost node
on OPEN, and all other nodes have higher estimated lower bound costs. In B&B, we have a
bound on every branch which allows us to decide whether to process it further.
We look at the performance of B&B on the four parameters for search algorithms.
We will return to graph search a little later and prove that underestimation leads to optimal
solutions. Before that we look at an application of B&B to the travelling salesperson
problem (TSP).
Cities A B C D E
A 2 6 100 110
B 2 4 80 90
C 6 4 60 70
D 100 80 60 10
6E 110 90 70 10
We begin with a search node S containing all tours. The value associated with this node is
an estimate of cost of the best tour in S. One way to estimate the absolute lower bound for the
cost of any TSP is to associate each city with the two shortest edges emanating from it. This
follows from the fact that in a valid tour each city is connected to exactly two cities. This will
clearly be a lower bound.
154 | Search Methods in Artificial Intelligence
The following are the two cheapest edges and the costs associated with each city in our
tiny problem:
Following the heuristic, we select the cheapest edge AB and partition S into two. One,
called SAB, includes the edge AB, and the other, SAB, excludes it. The estimated cost of SAB is
same as cost of S since edge AB is counted in S. For SAB the edge AB must be excluded. The
following are now the two cheapest edges and the costs associated with each city:
Clearly SAB has a lower estimated cost. We repeat the process by selecting the next available
cheapest edge BC. This gives us two sets: SAB,BC that includes both AB and BC, and SAB,BC that
includes AB but excludes BC. The cost of SAB,BC remains the same.
SAB,BC = SAB =S
Figure 6.5 The search space in B&B TSP after two expansions.
It can be seen that the best node is SAB,BC with estimated cost 87, and our search will
proceed to expand that. It cannot add the next shortest edge AC because that would result in a
premature loop. It will instead select the edge DE and continue.
The reader is encouraged to verify that after DE the next edge chosen will be CD, and then
the only option would be to select AE. When the last edge is chosen, the cost of the tour will
shoot up, because AE is the most expensive edge, and search will have to abandon this node,
and shift to the next cheapest node.
One can see that the estimate for SAB,BC was overly optimistic. This is because the two edges
being considered for A in this step are AB and AC with costs 2 and 6 respectively. Looking at
Figure 6.5 one can see that the edge AC cannot be part of a tour that already has AB and BC,
because together they form a subtour. Thus, for SAB,BC we need to exclude edge AC.
Now that the cost of SAB,BC has been revised upwards to 161, the algorithm will instead
shift its attention to SAB,BC with cost 153. The reader is encouraged to verify that this will
indeed lead to the optimal solution.
156 | Search Methods in Artificial Intelligence
1. No edges are allowed for any city which already has two edges.
2. A city with one edge can only connect to a city with zero edges.
3. The only exception is the last step when the tour is completed. At this point, only one edge
will be available.
The above rules apply to selection of edges both for refinement and for estimation. Why are
higher estimates better? By excluding cheaper disallowed edges in the estimates, we get a more
accurate estimate, which is higher. That branch will not make an unrealistic claim for being
picked from OPEN.
On the other hand, computing higher estimates involves more work, where one has to do
additional reasoning. But the more accurate the heuristic function, the lower is the amount of
search one has to do, because it has a better sense of direction. Choosing between the amount
of effort on the heuristic function versus the effort doing search is often a delicate balance.
6.2 Algorithm A*
Algorithm B&B has no sense of direction. It proceeds to the next nearest candidate and checks
whether it is the goal node. While correct, this can be extremely inefficient. Consider the
problem of finding the route from IIT Madras in Chennai to the Shore Temple in neighbouring
Mahabalipuram. B&B would exhaust every nook and corner of Chennai city before heading
out and south. It is restrained by the desire to stay as close to source as possible. What one also
needs is a pull towards the goal, like in BestFirstSearch (Chapter 4).
goal node is found. We describe the algorithm informally here. The input to the algorithm is
the complete graph at the very outset.
• Dijkstra’s algorithm begins by assigning infinite cost estimates to all nodes except the start
node, which has cost zero.
• It assigns the colour white to all the nodes initially.
• It picks the cheapest white node and colours it black.
• Relaxation: Inspect all neighbours of the new black node and check if a cheaper path has
been found to any of them. If yes, then update the cost of that node, and mark its new
parent.
Figure 6.6 Dijkstra’s algorithm first generates node A as a child of S and marks S as a parent.
When it colours and relaxes node B black (adds to CLOSED), it also finds a cheaper path to A
and becomes the parent of A. It will next colour node A which has the lowest cost on OPEN.
Dijkstra’s algorithm was devised to find shortest paths to all nodes given a graph as input.
In our search task we have a goal node or goal specification in mind, and a move generation
function to generate the graph on demand. The problem graph could in principle have infinite
nodes, as long as there is a goal at a finite distance.
Figure 6.7 shows the problem graph from Figure 4.7 at the moment Dijkstra’s algorithm
picks the goal node. Observe that the algorithm has ended up searching the entire graph, but
found the optimal path with cost 148, where BestFirstSearch had found a path with cost 195
after inspecting only eight nodes.
158 | Search Methods in Artificial Intelligence
Figure 6.7 The graph at the moment Dijkstra’s algorithm picks the goal node. The values in
the nodes are the costs of the shortest path from the start node. Observe that it has visited all
23 nodes in this graph.
We now introduce the A* algorithm, which imparts a sense of direction to search. It is one
of the most important algorithms we study in this book.
6.2.2 A*
The algorithm A*, first described by Hart, Nilsson, and Raphael (Hart, Nilsson, and Raphael,
1968; Nilsson, 1971, 1980) is a heuristic search algorithm that has found wide-ranging
applications.
Algorithm A* combines several features of the algorithms we have studied so far.
parent node. Like Dijkstra’s algorithm it marks the parent which is on the shortest path to
n. The parent pointer is instrumental in reconstructing the path.
Every node n in A* has an estimated cost f(n) of the path from the start node to the goal node
passing through n.
That is, f(n) is the estimated cost of the solution containing node n. The algorithm always
refines or expands the node with the lowest f-value. The search space for A* is depicted in
Figure 6.8.
Figure 6.8 The search space for A* is the state space itself. The estimated cost f(n) for every
node on OPEN has two components. One, g(n), is the known cost from the start to n.
The second, h(n), is an estimated distance to the goal.
Like our earlier algorithms A* picks the best node n from OPEN and checks whether it is
the goal node. If it is, then it follows the pointers back to reconstruct the path. If it is not, it adds
the node to CLOSED and generates its neighbours.
For each neighbour m of n it does the following:
1. If m is a new node, it adds it to OPEN with parent n and g(m) = g(n) + k(n, m) where
k(n, m) is the cost of the edge connecting m and n.
2. If m is on OPEN with parent n‘, then it checks if a better path to m has been found. If yes,
then it updates g(m) and sets its parent to n.
3. If m is on CLOSED with parent n‘, then it checks if a better path to m has been found.
If yes, then it updates g(m) and sets its parent to n. This possibility exists since h(n) is
an estimate and may be imperfect, choosing n ‘ before n. The algorithm will also need to
propagate this improved cost to other descendants of m.
160 | Search Methods in Artificial Intelligence
The algorithm A* is described below. We have been deliberately vague about the lists OPEN
and CLOSED. For small problems, one could implement them as lists. For larger problems, it
would be prudent to implement OPEN as a priority queue and CLOSED as a hash table.
Algorithm 6.3. Algorithm A* maintains one copy of each node either in OPEN or in
CLOSED.
A*(S)
1. parent(S) ^ null
2. g(S) ^ 0
3. f(S) ^ g(S) + h(S)
4. OPEN ^ S : []
5. CLOSED ^ empty list
6. while OPEN is not empty
7. N ^ remove node with lowest f-value from OPEN
8. add N to CLOSED
9. if GoalTest(N) = True then return ReconstructPath(N)
10. for each neighbour M e MoveGen(N)
11. if (M £ OPEN and M £ CLOSED)
12. parent(M) ^ N
13. g(M) ^ g(N) + k(N,M)
14. f(M) ^ g(M) + h(M)
15. add M to OPEN
16. else
17. if (g(N) + k(N,M)) < g(M)
18. parent(M) ^ N
19. g(M) ^ g(N) + k(N,M)
20. f(M) ^ g(M) + h(M)
21. if M e CLOSED
22. PropagateImprovement(M)
23. return empty list
PropagateImprovement(M)
1. for each neighbour X e MoveGen(M)
2. if g(M) + k(M,X) < g(X)
3. parent(X) ^ M
4. g(X) ^ g(M) + k(M,X)
5. f(X) ^ g(X) + h(X)
6. if X e CLOSED
7. PropagateImprovement(X)
Algorithm A* and Variations | 161
We have used the ReconstructPath function from Chapter 3, though it may need minor
tweaks given the different representation. While implementing the algorithm one may want to
also return the cost of the solution found.
Figure 6.9 illustrates the algorithm on the problem graph from Figure 4.7. As done
with BestFirstSearch we use the Manhattan distance heuristic function as the heuristic
function h(n). We do this for ease of manual computation (especially in exam papers). In an
implementation, one would use the Euclidean distance.
Figure 6.9 The graph generated by algorithm A* when it picks the goal node. Shaded nodes
are on CLOSED. The values in the nodes are the f-values, and the values outside the nodes
are the Manhattan distance t - l values. A* has inspected 14 nodes and found the optimal path
with cost 148.
As can be observed, A* inspects much fewer nodes than Dijkstra’s algorithm, but still finds
the same optimal path. It did inspect more nodes than BestFirstSearch but found a much
better path. What is the secret of its success?
162 | Search Methods in Artificial Intelligence
6.2.3 A* is admissible
The star in the name A* is indicative of the fact that the algorithm is admissible. This means
that A* always finds the optimal path when a path to the goal exists. This is true even if the
graph is infinite under the following conditions:
1. The MoveGen or neighbourhood function has a finite branching factor. Clearly, with
infinite branching, it would not be able to even generate the neighbours. It does generate
all neighbours, unlike SimulatedAnnealing which generates a random neighbour.
2. The cost of every edge must be greater than a small constant e .1 This, as we will see, is to
preclude the possibility of getting trapped in an infinite path with a finite total cost.
3. The heuristic function must underestimate the distance to the goal h(n) for every node. We
look at this condition informally before moving on to a formal proof of admissibility.
Consider a tiny search problem with four nodes S, A, B, and G. Node S is the start node and
G is the goal node. The edge costs are as follows:
k(S, A) = k(S, B) = 80
k(A, G) = 150
k(B, G) = 160
There is no direct edge between S and G. Clearly, the shortest path from S to G is
S-A-G with cost 230. We consider two heuristic functions h 1 and h2, where h 1 overestimates
all distances and h2 underestimates all distances. Also, both functions are mistaken about the
distance to the goal with both believing that node B is closer to the goal than node A is. Let
First we look at the performance of algorithm A1 using the function h1. With every node,
we display its f-value (f(n) = g(n) + h(n)) as a subscript of the node.
1. OPEN = [S0+250=250], CLOSED = []. A1 picks S. Since S is not the goal it adds its neighbours
A and B to OPEN. S itself is added to the CLOSED.
2. OPEN = [A80+190=270, B80+180=260], CLOSED = [S250]. A1 picks node B with f(B) = 260.
It adds its neighbour G with g(G) = 80+160 = 240 and f(G) = 240 + 0 = 240.
Parent(G) = B.
3. OPEN = [G240+0=240, A80+190=270], CLOSED = [S250, B260]. A1 picks node G with f(G) = 240
and terminates.
A1 has failed to find the optimal path and is therefore undeserving of being decorated with a
star. Next we look at the performance of A2* using the function h2.
1 This was observed by an alert student, Arvind Narayanan, during my class in the mid-1990s. Traditional wisdom
then was that the edge cost just be greater than zero.
Algorithm A* and Variations | 163
2. OPEN = [A80+130=210, B80+120=200], CLOSED = [S210]. A2* picks node B with f(B) = 200.
It adds its neighbour G with g(G) = 80+160 = 240 and f(G) = 240 + 0 = 240.
Parent(G) = B.
3. OPEN = [A80+130=210, G240+0=240], CLOSED = [S210, B200]. A2* picks node A and finds a
better path to G with g(G) = 80+150 = 230 and f(G) = 230 + 0 = 230. It updates the value
of G and resets the parent of G to A.
4. OPEN = [G230+0=230], CLOSED = [S210, B200, A210]. A2* picks node G with cost 230 and
terminates.
As one can observe, both versions of the algorithm were misinformed and wrongly picked the
node B and found a longer path to G first. For A 1 the cost 240 of S-B-G appeared better than
the overestimatedf-value 270 of node A and it terminated. For A2* the same cost 240 of S-B-G
appeared worse than the underestimated value 210 of node A and it expanded node A to find
the optimal path S-A-G with cost 230. The differentiating factor was the underestimation done
by h2.
The fact that A2* needed four steps to terminate against the three needed by A1 is directly
related to the fact that the estimate h2 is consistently lower than the estimate h1. This is similar
to what we saw in the TSP example in the previous section where the version with a higher
estimate terminated faster. The difference between this example and the TSP example is that
the latter found the optimal tour in both cases. This leads us to the conjecture that higher
estimates are better but only up to a certain point. That point is the threshold set by the actual
optimal cost. Within this threshold, the higher the estimate the better. We formalize this notion
in the next section.
Given the above, we prove a series of lemmas leading to the proof of admissibility of A*.
L1: The algorithm always terminates for finite graphs.
Proof: The algorithm keeps exactly one copy of every node generated, either in OPEN or
in CLOSED. In every cycle, it picks one node from OPEN and moves it to CLOSED if it is not
164 | Search Methods in Artificial Intelligence
a goal node. It also adds some previously not seen nodes to OPEN. Since the total number of
nodes is finite, there will be eventually none left to add to OPEN. If the goal node is not in the
graph, OPEN will become empty and the algorithm will terminate.
L2: If a path exists from the start node to the goal node, then at all times before termination
OPEN always has a node n ‘ on an optimal path. Furthermore, thef-value of this node is optimal
or less.
Proof: Let (S, n 1, n2, ..., G) be such an optimal path. The algorithm begins by adding S to
OPEN. When S is removed from OPEN, its successor n1 is added to OPEN, and when that is
removed then n2 is added, and so on. If G is removed, then the algorithm has terminated. Else
let n ‘ be the node on OPEN.
Then,
fn ‘) = g(n ‘) + h (n ‘)
= g*(n ') + h (n ‘) because n ‘ is on the optimal path g (n ‘) = g *(n ‘)
< g*(n‘) + h*(n‘) because h(n‘) < h*(n‘) by C3
< f*(n‘)
< f*(S) because n‘ is on the optimal pathf*(n‘) = f*(S)
L6: Let A1* and A2* be two admissible versions of A* and let h*(n) > h2(n) > h1(n) for
all n. We say h2 is more informed than h1, because it is closer to the h* value. Then the search
space explored by A2* is contained in the space explored by A1*. Every node visited by A2* is
also visited by A1*. Whenever a node is visited it is also added to CLOSED.
Proof (by induction): To show that if n e CLOSED2, then n e CLOSED1.
Base step: if S e CLOSED2, then S e CLOSED1.
Induction hypothesis: Let the statement be true for all nodes up to depth d. If n e CLOSED2,
then n e CLOSED1.
Induction step (by contradiction):
Assumption: Let L be a node at depth (d + 1) such that L e CLOSED2 and let A1* terminate
without inspecting L.
Since A2* has picked node L,
because g 1(L) < g2(L) since A1* has seen all nodes up to depth d seen by A2*, and would
have found an equal or better cost path to L.
We can rewrite the last inequality as
which contradicts the given fact that h2(n) > hi(n) for all nodes.
The assumption that A2* terminates without expanding L must be false, and therefore A2*
must expand L. Since L was an arbitrary node picked at depth d + i, the following is true at
depth d + i as well:
Remember that the heuristic function is an estimate of the distance to the goal. As one
moves towards the goal, the h-value is expected to decrease. The monotone condition says that
this decline cannot be greater than the edge cost. One can think of h(m) - h(n) as the cost of the
m-n edge as estimated by the heuristic function. This too must be an underestimate.
Rearranging the above inequality, we get
That is, as we move towards the goal, the estimated cost of the solution increases, becoming
closer to the optimal cost. A more significant consequence of the monotone condition is that
when A* picks a node n from OPEN, it has already found the optimal path from the start node
S to n. That is, g(n) = g*(n). We look at the proof.
Let A* be about to pick node n with a value g(n). Let there be an optimal path P from S to
n which is yet unexplored fully. On this path P let nL be the last node on CLOSED and let nL+1
be its successor on OPEN. Given the monotone condition,
because both are on the optimal path. By transitivity, this inequality extends to node n.
This gives us
But g(n) cannot be less than g*(n) which is the optimal cost from S to n. Therefore,
g(n) = g*(n)
A direct consequence of this is that A* does not have to update g(n) for nodes of CLOSED.
This allows us to implement some space saving versions of A* which prune CLOSED
drastically. We look at space saving versions later in this chapter.
6.2.7 Performance of A*
1. Completeness: The algorithm is complete. As shown earlier, it will even terminate on
infinite graphs if there is a path from the start node to the goal.
2. Quality of solution: As shown earlier, A* is admissible. Given an underestimating heuristic
function, it always finds the optimal path to the goal node.
3. Space complexity: The space required depends upon how good the heuristic function is.
Given a perfect heuristic function, the algorithm heads straight for the goal, and the OPEN
list will grow linearly. However, in practice, perfect heuristic functions are hard to come
by, and it has been experimentally observed that OPEN tends to grow exponentially.
4. Time complexity: The time complexity is also dependent on how good the heuristic
function is. With a perfect function, it would be linear. In practice, however, it does a fair
bit of exploration, reflected by the size of CLOSED, and generally needs exponential time
as well.
We now explore variations of A*. First we look at a variation that compromises on the quality
of the solution to gain on time and space complexity.
6.2.8 Weighted A*
We have observed that A* explores more of the space than BestFirstSearch, but finds an
optimal solution. We look at a variation of A* that allows us to choose the trade-off between
quality and complexity, of both time and space which go hand in hand for A*. We have also
observed earlier that heuristic functions with higher estimates result in more focussed search.
We have so far imposed a condition that the heuristic function must underestimate the distance
to the goal. We now explore how the algorithm behaves when we relax that condition.
Consider a weighted version of the estimated costf(n) = ag(n) + £h(n). What would be the
behaviour of A* with different values of a and £? When £ = 0 we have the uninformed search
algorithms BFS and B&B. We can even model DFS by defining g(n) = 1/depth. If a = 0 we
have BestFirstSearch.
When a = 1 and £ = 1 we have A*. Choosing a value of fi < 1 would push the algorithm
towards B&B, without any material advantage. Choosing fi > 1 gives us the weighted A*
168 | Search Methods in Artificial Intelligence
algorithm. Traditionally in the literature the algorithm is known as wA* where w is the weight
in f(n) = g(n) + w x h(n). As the value of w increases, the algorithm becomes more and more
like BestFirstSearch, finding the solution faster but possibly one with a higher cost.
Figure 6.10 shows wA* with w = 2 on our example graph of Figure 4.7 on which Dijkstra’s
algorithm and A* found solutions with cost 148. As can be seen, wA* expands fewer nodes
than A* but finds a more expensive solution with cost 153, which though is better than the one
found by BestFirstSearch with cost 195.
Figure 6.10 Weighted A* with w = 2 finds a path to the goal after inspecting nine nodes. But it
finds a more expensive path to the goal than A* with cost = 153. The values outside the nodes
are 2h(n) values.
1. Instead of depth the algorithm uses the f-value of a node to check that it is within the bound
to continue searching. This bound is initially set to f(S) = h(S), which we know is a lower
bound on optimal cost.
2. When the neighbours of a node are generated, one needs to check that their f-value is
within the bound before adding it to OPEN.
3. One has to keep track of the lowest value among the nodes that were not added to OPEN.
This lowest value will become the bound in the next cycle.
The above changes are left as an exercise for the reader. The high level IDA* algorithm is given
below.
Algorithm 6.4. IDA * does a series of DB-DFSs with increasing depth bounds.
IDA*(start: S)
1 depthBound ^ f(S)
2 while True
3 DB-DFS(start, depthBound)
4 depthBound ^ f(N) of cheapest unexpanded node on OPEN
The above algorithm suffers from the same drawback we observed in DFID. The algorithm
may loop forever even for a finite graph which does not contain the goal node. The reader is
encouraged to modify it to count the number of new nodes generated by IDA* and terminate
with failure, like we did for DFID, if the count does not increase in some cycle. If the graph is
infinite and there is no goal node, there is no way of reporting failure.
170 | Search Methods in Artificial Intelligence
One problem with using IDA* without checking for CLOSED is that in many problems the
number of paths to any node grows combinatorially. The algorithm will end up exploring all
these paths. Another problem with IDA* is that of thrashing. When the f-values of most nodes
are distinct, extending the bound to the lowest one will only include that node in the next cycle.
This is a waste of effort. DFID worked because one expected the next layer to have more nodes
than all the visited ones put together. One can ameliorate this problem a little by compromising
on quality. This could be done by increasing the bound by a fixed increment 8, which then
becomes the tolerance for the drop in quality.
Figure 6.11 RBFS maintains a linear number of nodes down one path indicated by the
heuristic function. When the path shown on the left looks less promising it rolls back to the next
good looking node, updating the f-values on the path as it rolls back.
In Figure 6.11, RBFS expanded node C at level one with the value 63, marking B as the
second best. It moved on to E and then marked F as the second best. It moved forward and had
Algorithm A* and Variations | 171
finished inspecting node O with the value 72 while F remained the second best. None of the
successors of O is better than F. It starts the rollback, updating the f-value of each node to the
best value of its children. O gets the value 73 from S, which gets transmitted to K, and then on
to I. The parent E at this level gets its value 72 from H. At this point RBFS shifts to node F and
marks D as the second best node.
As one can imagine, going down the path from F could soon look less appealing and
it might roll back and try D. Experimentally this kind of frequent rolling back, known as
thrashing, has been observed in RBFS, and is a major negative property. One could alleviate
this problem by setting a level of tolerance, like we suggested in IDA*. That is, roll back from
the current path if it is worse than second best by a given margin. For example, if the margin
was 10 in the previous example, then RBFS would continue searching beyond O as long as the
value was better than 80, which is f(F) + 10.
Next we look at a problem in which CLOSED grows much faster than OPEN and is a
candidate for pruning.
Char A G C T
A 10 -1 -3 -4
G -1 7 -5 -3
C -3 -5 9 0
T -4 -3 0 8
One may also insert a gap in one of the two sequences being aligned if the resulting match
score improves. Then we impose an indel cost to the alignment. If the indel penalty is -5, for
example, then the following alignment yields a total match score of 1.
172 | Search Methods in Artificial Intelligence
AGA C TA GTTA C
C GA _ _ _ GA C G T
The task of sequence alignment is to find the alignment with the highest total score. The
highest possible score is when the two sequences are identical. Posed as graph search the two
sequences are laid out on two axes, and a grid created with diagonal edges added as shows in
Figure 6.12.
Figure 6.12 Graph search for sequence alignment has arrived at node N and is looking at two
letters X and Y to match next. It could align them (move 1), or insert a blank before X (move 2),
or insert a blank before Y (move 3).
Let us say the next two characters to consider are X in the horizontal sequence and Y in
the vertical sequence. A diagonal move means the X is aligned with Y, along with the match
score. This is marked by the edge labelled 1. A horizontal move, edge marked 3, means that a
blank has been inserted before Y. This means X aligns with the blank, and the next character in
the sequence will come into play. A vertical move, marked 2, likewise means a blank has been
inserted before X.
The search space for an example problem is shown in Figure 6.13. The sequence along the
horizontal axis is GCATGCA and the one along the vertical direction is GATTACA. Observe
that they are of unequal length, which means that at least one blank will be inserted in the
former. As can be seen from the figure, the state space grows quadratically with depth. But the
number of distinct paths grows combinatorically. Consider two strings of length N and M being
aligned. The grid size then is (N + 1) X (M + 1).
The number of ways that gaps can be inserted (moving only horizontally or vertically) is
(N + M)! / (N! X M!)
Taking diagonal moves also into account the number of paths is
where R varies from 0 to min(M,N) and stands for the number of diagonal moves in
the path
Algorithm A* and Variations | 173
start G C A T G C A
A
goal
Figure 6.13 The search space in sequence alignment. The shaded nodes are on CLOSED.
The unshaded nodes are on OPEN. The dashed nodes are yet to be generated. Observe that
the two sequences are of unequal length.
As can be observed in the figure, the size of OPEN grows linearly, while CLOSED
grows as a quadratic. In biology the sequences to be aligned may have hundreds of thousands
of characters. Quadratic is then a formidable growth rate. This gives us a motive to prune
CLOSED. But CLOSED serves the following two vital functions:
1. It prevents the search from ‘leaking back’ and going into loops.
2. It serves as a means for reconstructing the path. In the A* version of search we do this by
maintaining parent pointers, and the parents are in CLOSED.
If one can address these two requirements, then one could prune the CLOSED and, as a
consequence, solve much larger sequence alignment problems. The next section describes two
variations that achieve that.
node is found, it has a pointer to such a node R in the relay layer. This gives us two subproblems
to be solved recursively. The first from the start node S to R, and the second from R to G.
Proponents of divide and conquer strategy would observe that it works best when the two
subproblems are roughly equal. To this end, DCFS recommends that a node R be made a relay
node when g(R) = h(R). When the g-value is approximately equal to the h-value, search should
be roughly at the halfway mark.
The fact that we recursively solve subproblems has an adverse impact on time complexity.
If T(d) is the time complexity of A* needed to find the goal at d steps, then time complexity of
DCFS is
Start
Figure 6.14 DCFS is about to pick and expand node A. When A was generated as a child of
G and F, they were added as taboo neighbours. Only nodes B, C, D and E are generated.
Of these B and E are already on OPEN. C and Dare new nodes.
In the figure, nodes on CLOSED have been deleted. The shaded nodes shown in dotted
circles only depict the nodes which are taboo for the corresponding nodes on OPEN. Algorithm
DCFS is about to pick node A from OPEN. Node A has six neighbours, B, C, D, E, F,
Algorithm A* and Variations | 175
and G, which along with A have been magnified in the figure. Of the neighbours, F and G are
on CLOSED, G being the parent of A. Both had generated A earlier, and both were placed on
the taboo list of A, as depicted by crosses on the edges. They will not be generated now. Nodes
B and E are generated and already on OPEN, and A has not found a better path to them, so their
parent pointers will remain as they are. Nodes C and D are new nodes that will be added to
OPEN, with A as their parent. Now node A gets added to the taboo list of nodes B, C, D, and E,
so that when they are picked and expanded, they will not generate A as neighbour.
In this manner DCFS pushes into the search space with only the OPEN layer of nodes,
modified as described above to avoid leaking back. Each node on OPEN has a pointer to the
start node. Around the halfway mark, when g(R) = h(R) for node R, it is stored as a relay node.
Beyond the relay layer, all children will carry a pointer to their corresponding relay node, and
nodes that would have gone into CLOSED are pruned. This goes on till the goal node is picked,
and two recursive subproblems created.
OPEN
Figure 6.15 SMGS keeps the nodes on CLOSED while it can. Nodes on CLOSED with at
least one neighbour on OPEN are BOUNDARY nodes. The rest form the KERNEL and are
expendable. When SMGS senses memory shortage it deletes the KERNEL and converts
the BOUNDARY into a RELAY layer. It continues searching, creating a new KERNEL and
BOUNDARY as OPEN pushes forward.
176 | Search Methods in Artificial Intelligence
When the time comes to prune nodes, SMGS does the following. It deletes all nodes in
the KERNEL, and it converts the BOUNDARY layer into a RELAY layer, with pointers to the
previous RELAY layer, or the start node. On the one hand, if the problem being solved is small
enough to have been solved by A*, SMGS does not prune CLOSED at all. When it finds the
goal node, it simply traces the path back. This path is known as the dense path.
On the other hand, when the problem is really big, SMGS may prune CLOSED more than
once, creating multiple RELAY layers. It does this by being aware of the memory available to
it, and smartly deciding to prune nodes when it is running out of memory. When it finds the goal
node, it may have several relay layers left behind, creating a sparse path with back pointers to
the previous relay layer, and needs to find the path between them recursively.
Observe that the recursive calls in SMGS are not likely to be nested deep. This is because
the problem size may now be small enough to solve without further subdividing it. A corollary
of this awareness of available memory is that it may prune CLOSED less often when the
memory is abundant.
Pruning CLOSED is beneficial in special situations like the sequence alignment problem.
In the more general case, it is the OPEN that grows faster, in an exponentially growing search
tree. We next look at approaches to prune OPEN.
In BSS backtracking has to be done explicitly by going back to the parent and regenerating the
next set of nodes that could not be accommodated in the beam earlier. To make this process
systematic BSS maintains a beam stack which stores a pair off-values [fmin, fmax) at each level.
The lower value fmin is the value of the cheapest node in the beam, and the higher value fmax is
the lowest value not in the beam. If one imagines that the search tree is sorted on increasing
f-values from left to right at each level, then one can imagine the search sweeping from left to
right keeping a constant number of nodes at each level. The role of the values in the beam stack
is depicted in Figure 6.16.
level k
Before backtracking
level k+1
After backtracking
fmax
level k+1 fmin
Figure 6.16 The beam stack stores pairs of values [fmin, fmax) at each level where, fmin is
lowest value in the beam, and fmax is the lowest value not in the beam. When BSS backtracks
to level k, it knows which of the neighbours to add next to the beam at level k + 1. The old fmax
becomes the new fmin at level k + 1.
Figure 6.16 shows two layers of the beam, with shaded nodes. In the figure the nodes are
arranged from left to right with increasing f-values. The beam width is 3. The figure on top
shows two layers at levels k and k + 1 before backtracking happens. The leftmost node has the
value fmin, and the lowest value node excluded from the beam has value fmax.
When BSS needs to backtrack from level k + 1, it goes up and generates the nodes with
the next higher values. These are shown as shaded nodes in the bottom part of the figure. The
old fmax has become the new fmin, and a new fmax value is identified at level k + 1. One must
remember that the upper bound U is the largest f-value one is willing to explore, beyond which
the beam will not be extended.
Algorithm BSS systematically explores the search space within the bound U and will
guarantee an optimal path. In fact, as and when it finds a new path to the goal with a f-value
lower that U, it updates the upper bound, and the space to be searched shrinks further.
178 | Search Methods in Artificial Intelligence
Let the beam width be b. The size of OPEN is b. Overall, searching at depth d algorithm
BSS stores O(bd) nodes. It also keeps d pairs of [fmin, fmax) values in the beam stack. Thus, its
overall space requirement is linear.
We next briefly discuss the divide and conquer versions of the algorithms studied in this
section.
Figure 6.17 DCBSS maintains a constant number of nodes in three layers. The BOUNDARY
nodes prevent search leaking back, and the RELAY nodes are used to reconstruct the path
when the goal node is found. It also maintains a beam stack marking fmin and fmax at each
layer. The stack keeps track of the edge nodes in the beam and is instrumental in facilitating
backtracking.
Algorithm DCBSS is as close to a constant space version of A* as one can be. Strictly
speaking, it is not constant space because, though it stores a constant number of nodes, it does
have to maintain the beam stack which grows linearly with depth. The price that it must pay
for this space economy is twofold. One, like BSS, it has to backtrack and explore the unseen
part of the space that may yet yield a better solution. Second, in DCBSS each backtracking
move requires it to regenerate the entire beam from the start node. This is extra work. But
given that memory is becoming abundant, one hopes that with a large beam width, the need for
backtracking would be reduced.
Summary
In this chapter we have focussed exclusively on finding optimal solutions. We postulate the
problem as a graph search problem when edge weights are arbitrary. The goal is to find the
shortest or least cost path. This goal is different from that of Chapter 4 in which we employed
heuristic functions to find the goal node quickly. Finding the optimal path is important in
situations when costs are significant, and also when the solution found has to be executed
multiple times.
Starting with the uninformed B&B algorithm we added a heuristic function to guide search
and proved that the resulting algorithm, A*, is admissible given certain conditions. An important
one is that the heuristic function underestimates the distance to the goal node.
180 | Search Methods in Artificial Intelligence
We then explore the effect of differently valued heuristic functions of the search complexity,
even devising a faster algorithm wA*, but at the risk of losing optimality. Finally we looked at
variations of A* that traded off space with time, and culminated our study with an algorithm,
DCBSS, that is almost a constant space algorithm.
Our approach so far has been to define a search space and find a path to the goal state. We
did look at knowledge in the form of heuristic functions to guide search. The heuristic function
is defined for a node in the search space and is meant to make informed choices in choosing
neighbours.
The importance of knowledge to battle our adversary CombEx became evident as more
systems were implemented. In particular, it was recognized that human experts are the source
of such domain specific knowledge. This led to the approach in which knowledge is solicited
directly from the expert. The form in which such knowledge is articulated is rules, and systems
sprang up aimed at directly harnessing this knowledge. We begin out next chapter with a study
of such rule based systems.
Exercises
1. Recall the method used for computing the lower bound estimate of nodes in the B&B search
for an optimal solution for the TSP. What is the lower bound estimate for the following TSP
problem? Assume that the edges that have not been drawn have infinite cost.
2. [Baskaran] A TSP problem on seven cities is shown on the accompanying table and city
locations. Simulate the B&B algorithm on the problem.
Algorithm A* and Variations | 181
A B C D E F G
A - 50 36 28 30 72 50
B 50 - 82 36 58 41 71
c 36 82 - 50 32 92 42
D 28 36 50 - 22 45 36
E 30 58 32 22 - 61 20
F 72 41 92 45 61 - 61
G 50 71 42 36 20 61 -
3. Algorithm A* is about to expand the node N in the graph below. Thick rectangles are on
CLOSED and dotted rectangles are on OPEN. One node is about to be added to OPEN.
Labels on edges are cost of moves. These are only shown for some edges where they
cannot be computed from the parent. Show the graph after A* has finished expanding the
node N. Clearly show the parent pointers and mark where the g-values have been updated.
182 | Search Methods in Artificial Intelligence
4. State the conditions needed for the A* algorithm to be admissible, and give a proof of its
admissibility under those conditions.
5. [Baskaran] Consider the following city map drawn below. The cities are laid out on a
square grid of side 10 kilometres. S is the start node and G is the goal node. Labels on edges
are costs. Draw the subgraph generated by the algorithm A* till it terminates. Let the nodes
on OPEN be single circles. Mark each node inspected by
a. a double circle,
b. sequence number,
c. parent pointer on an edge,
d. its cost as used by the algorithm.
List the path found along with its cost.
6. The following graph represents a city map. The nodes are placed on a grid where each side
is 10 units. Node H is the start node and node T is the goal node. Use Manhattan distance
as the heuristic function where needed. The label on each edge represents the cost of the
edge. Label the nodes with the heuristic values.
List the nodes in the order inspected by
a. B&B algorithm
b. the A* algorithm and
c. the wA* with w = 2 algorithm till it terminates.
Does it find a path to the goal node? If yes, list the path found along with its cost.
Algorithm A* and Variations | 183
Draw the sub-graphs when the algorithms A* and wA* (with w = 2) terminate. Clearly
show the nodes that are on OPEN and those that are on CLOSED. Show the f-values and
the parent pointers of all nodes.
7. Repeat the exercise in the previous question for the following graph. Let S be the start node
and G be the goal node.
184 | Search Methods in Artificial Intelligence
8. When is a heuristic function said to be more informed than another one? How is a more
informed heuristic function in A* better than a less informed one? Support your answer
with a proof.
9. Describe and compare the IDA* and RBFS algorithms.
10. Imagine a wA* algorithm implemented such the weight w is an input parameter. How would
the performance of the algorithm change when starting at w=0, the weight is increased in
steps of 0.5. When would the algorithm terminate faster? When would it definitely return
the optimal solution?
11. State the monotone condition and why is it needed to implement the DCFS. Support your
answer with a formal proof.
12. Algorithms like BFHS and BSS benefit from have a good upper bound estimate of the cost
of the solution. Devise a quick algorithm that will give us a reasonable (meaning not too
high) value for this bound.
13. Modify the algorithm DB-DFS-2 (Algorithm 3.8 in Section 3.4) to work with the IDA*
algorithm. Is it necessary to add nodes that are already on CLOSED to OPEN again as this
algorithm does, or can we prune them?
14. Programming exercise: Randomly create a TSP with N cities to be displayed on the
monitor. Let the graph be fully connected with edge weights being the Euclidean distance.
Do not display all the edges. Implement the B&B algorithm with the heuristic cheapest
available edge first. Maintain a list of allowed edges and taboo edges for estimation as well
as refinement. Display the best tour (node in the search space) along with the estimated
cost at the click in each cycle.
15. Adapt the ReconstructPath algorithm (Algorithm 3.2) from Chapter 3 to work with the
node representation and a separate parent pointer. Modify this algorithm to return the cost
of the path found as well.
chapter 7
Problem Decomposition
So far our approach to solving problems has been characterized by state space search.
We are in a given state, and we have a desired or goal state. We have a set of moves
available to us which allow us to navigate from one state to another. We search through
the possible moves, and we employ a heuristic function to explore the space in an
informed manner. In this chapter we study two different approaches to problem solving.
One, with emphasis on knowledge that we can acquire from domain experts. We look
at mechanisms to harness and exploit such knowledge. In the last century in the 1980s,
an approach to express knowledge in the form of if-then rules gained momentum, and
many systems were developed under the umbrella of expert systems. Although only a
few lived up to expert level expectations, the technology matured into an approach to
allow human users to impart their knowledge into systems. The key to this approach
was the Rete algorithm that allowed an inference engine to efficiently match rules
with data.
The other looks at problem solving from a teleological perspective. That is, we look at
a goal based approach which investigates what needs to be done to achieve a goal. In
that sense, it is reasoning backwards from the goal. We look at how problems can be
formulated as goal trees, and an algorithm AO* to solve them.
The search algorithms we have studied so far take a holistic view of a state representing the
given situation. In practice, states are represented in some language in which the different
constituents are described. The state description is essentially a set of statements. As the
importance of knowledge for problem solving became evident, using rules to spot patterns in
the description and proposing actions emerged as a problem solving strategy.
185
186 | Search Methods in Artificial Intelligence
and used directly. Rule based systems are programs that facilitate their use directly. We begin
with a few example domains.
Example 1: Consider the blocks world planning problem discussed in Figure 4.13. The
planning community has devised languages of varying expressivity to describe planning
domains. These languages are a series called planning domain definition languages (PDDL).
The simplest of these languages, called PDDL 1.0, is used to describe the blocks world domain
using the following set of predicates:
The domain has a table of unbounded extent, and a set of named cuboid blocks, all of equal
size. Exactly one block can be placed on another block, and the height of a tower of blocks is
unbounded. Given the above predicates, the start state in Figure 4.13 can be described by a set
of sentences or predicates: {OnTable(D), OnTable(F), On(C,D), On(B,C), On(A,B), On(E,F),
Clear(A), Clear(E), ArmEmpty}. A subset of the domain description is a pattern that can trigger
an action. Such pattern-action combinations define the moves, or operators as they are called
by the planning community. For example, an operator called Unstack(X,Y) is applicable if the
robot arm is empty, X is on Y, and Y is clear. An action is a ground instance of an operator. In our
example problem, two actions are possible in the start state - Unstack(A,B) and Unstack(E,F).
In general, a collection of such applicable actions together define the MoveGen functions
from the search perspective. MoveGen is then not a uniformly described function, but depends
on the patterns that exist in the domain description. Each state has its own sets of applicable
moves. The setup is depicted in Figure 7.1.
Figure 7.1 Beneath the hood of the MoveGen function, there are typically a set of pattern
action associations known as rules. MoveGen is simply a collection of all applicable actions.
Problem Decomposition | 187
We shall look at planning in Chapter 10 where we will describe the planning operators and
the algorithms they are used in. In this chapter we will discuss how to compute the MoveGen
function efficiently as moves are made and the state changes as a consequence.
Example 2: Consider the game of tic-tac-toe or noughts and crosses, commonly played
by children. The game is played on a grid of nine squares typically drawn by two vertical
and two horizontal lines. The two players play alternately, one marking a X (or cross) and the
other marking a O (or nought). The first one to place three in a row, column, or diagonal wins.
Figure 7.2 shows the moves that Cross can make on her turn.
Figure 7.2 Given the tic-tac-toe board position in the centre, there are five moves where a X
can be played. The moves are identified by their location TM (top middle), TR (top right), MR
(middle right), BM (bottom middle), and BR (bottom right).
We will look at a standard game playing algorithm in Chapter 8. Here we focus on the
representation and move generation. Of the several representations that are possible we choose a
simple one amenable to move generation using rules. We identify each square by its coordinates
with T (top), M (middle), and B (bottom) being the three values along the vertical axis, and L
(left), M (middle), and R (right) on the horizontal axis. In Figure 7.2, the move TM says that
Cross can mark a X on the top-middle square which is empty. The MoveGen function, in the
style of planning operators, starts with a move name, followed by its preconditions, followed
by its actions.
Move Play-Cross(L)
Preconditions: It is the turn of Cross to play and
Location L is empty
Action: Mark X at L
188 | Search Methods in Artificial Intelligence
Here we have treated L as a variable whose value is every location that Cross can play
on. In our example above, the moves are Play-Cross(TM), Play-Cross(TR), Play-Cross(MR),
Play-Cross(BM), and Play-Cross(BR). We could then hand the moves over to the game playing
program which would search the game tree and choose the best move.
The interesting thing about describing the moves as rules is that one can have additional
rules that capture our expert knowledge as well. For example, one could have a rule that defines
a forced move, as in the board position of Figure 7.2.
Move Forced(MR)
Preconditions: It is the turn of Cross to play and
Location MR is empty
Location ML has O
Location MM has O
Action: Mark X at MR
Now this is a more specific rule, and may exist with the above general rule. Rule based
systems allow us to choose strategies that would prefer some rules over others, and we could
then choose such forced moves without resolving to lookahead using search. We will develop
this idea next.
Example 3: Consider a bank manager who has to make a decision on a loan application by
a candidate. Banks typically have loan disbursement policies for different categories of people.
These policies can be articulated as rules and may change with time. Typical rules could look
like the following.
Or,
There could be a multitude of such rules, many of them with exceptions, that could define
the policy the bank uses for deciding upon loans. The management would benefit if all they had
to do was to add and delete such rules, and the application software could classify a candidate
as eligible or not eligible.
We describe below the mechanism which makes such business rule management systems
possible, and they have applications in many areas.
Example 4: The Indian Railways has a plethora of rules for concessional ticket prices,
wherein various categories of people get different levels of concession.
Problem Decomposition | 189
Example 5: In these Covid times, governments of various countries made and revised rules
depending upon the status of the pandemic. For example, the Indian government first allowed
frontline health workers to get vaccinated, followed by senior citizens, other adults, and finally
teenagers and children. At any time, various exceptions like comorbidities and other health
conditions could influence the decisions. Allowing travel between different regions and such
rules could have been efficiently handled by a rule management system.
Such rule based systems are very well suited to do classification tasks, in which the
knowledge is articulated by human users. The inputs to the rules are not uniform, and different
rules may look at different sets of preconditions. Many such application areas are not amenable
to machine learning approaches to classification, not least because there could be heavy costs
for misclassification, there may be insufficient training data, and the human users may demand
explanations which are possible only with articulated symbolic rules.
1. A working memory (WM). The WM is a set of statements that define a state of the problem
to be solved. These statements are known as working memory elements (WMEs). As the
world changes, some of these WMEs may get deleted and new ones may get added. Thus
the WM is transient knowledge of the world at the moment and is the short term memory
(STM) of the problem solver. In rule based systems, the WM is initialized in the beginning
and is modified by the application of rules.
2. Rules. The problem solving knowledge of the agent is captured in the form of rules, which
are pattern-action combinations. Rules reside in the long term memory (LTM) of the
problem solver and are not specific to a particular problem being solved. The rules have
a left hand side (LHS), which is a set of patterns. Patterns can have slots for variables
which make them general. Each pattern in the LHS must match a WME for the rule to
be triggered. If a rule is executed or fired, then it may add or delete WMEs among other
possible actions.
3. The inference engine (IE). The IE is the workhorse of the rule based system. It matches
the rules with the WMEs, decides upon which rule-data combination is to be selected, and
then executes the rule. It does this repeatedly till some termination criterion.
190 | Search Methods in Artificial Intelligence
In the following sections we look at rule based systems in a little more detail. To make the
discussion more concrete, we ground the description of rule based systems in the language
called OPS5 mentioned earlier. OPS5 is said to expand to Official Production System language
version 5 and was specifically devised for implementing forward chaining with rules (Brownston
et al., 1985).
(class-name
Aattribute1 value1
Aattribute2 value2
attributeN valueN)
The order in which the attributes are written is not important. The following two examples
represent the same WME describing a student:
As we will see, the match algorithm does not require the attributes to be in a specific order.
Observe that white spaces are ignored. Also note that there is no notion of explicit types for
values, though implicitly the language recognizes numbers.
The WM is a collection of WMEs. The WMEs are indexed by a time stamp, indicating the
order in which they were added to the WM. The following is an example of what the data could
look like:
1. (next Arank 1)
2. (student Aname Shreya ArollNo 1111 Amajor CS Ayear 3 Aage 20)
3. (student Aname Aditi ArollNo 1112 Amajor CS Ayear 3 Aage 20)
4. (student Aname Garv ArollNo 1113 Amajor CS Ayear 3 Aage 20)
5. (student Aname Atish ArollNo 1114 Amajor CS Ayear 3 Aage 20)
6. (marks Asubject AI ArollNo 1111 AmidSem 41 AendSem 35 Atotal nil Arank nil)
7. (marks Asubject AI ArollNo1112 AmidSem 40 AendSem 40 Atotal nil Arank nil)
8. (marks Asubject AI ArollNo 1113 AmidSem 43 AendSem 36 Atotal nil Arank nil)
9. (marks Asubject AI ArollNo 1114 AmidSem 42 AendSem 35 Atotal nil Arank nil)
Note that there is no structure in the data except for the time stamp in the WM. If one wants to
implement an array, then one must represent it as a set of records or WMEs and explicitly add
an attribute for the index. WMEs with these index values will continue to be organized by the
Problem Decomposition | 191
time stamp. If we want to sort the records, then one may need to swap index values, rather than
keeping the index sorted and swapping the records as would traditionally be done.
(p rule-name
LHS
^
RHS)
(p rule-name
pattern1
pattern2
- patternK
^
action1
action2
actionM)
The patterns can be identified by their sequence number. Alternatively, each pattern can be
prefixed by a label which serves as an identifier. Each pattern must match some WME, unless
it is preceded by a negation symbol, in which case there must be no WME that matches it. In
OPS5 negated patterns are written after the positive ones.
Each pattern conforms to some WME. It begins with a class name and is followed by a
sequence of attribute names followed by a test for the attribute value. Each of the tests specified
in a pattern must be satisfied. The attributes being tested need only be a subset of the attributes
the corresponding WME has. The pattern has a set of attributes to be tested and ignores other
attributes in the WME. The tests on values of attributes are as follows.
If the attribute value in the rule is a constant, then it must match an identical constant in
the WME. For example, ‘Aname Shreya’ in the pattern matches ‘Aname Shreya’ in the WME.
If the attribute in the rule is a variable, then it can match any constant in the WME. For
example, ‘Aname <n>‘ in the pattern matches ‘Aname Shreya’ as well as ‘Aname Aditi’.
Variables are enclosed in angular brackets. In addition, if the same variable name occurs more
than once in the LHS of a rule, then all occurrences must match the same constant.
If an attribute in the rule contains a Boolean test, then the value in the WME must satisfy
that test. The following are examples:
192 | Search Methods in Artificial Intelligence
Aname <y> = Shreya: the value is a variable <y> and must be equal to Shreya
Aname <y> = <x>: the value is a variable <y> which must be equal to <x>
also in the LHS
Aname <y> <> Shreya: the value is a variable <y> and must not be equal to Shreya
Aname <y> <=> Shreya: the value is a variable <y> and must be the same type as Shreya
If the value in the WME is numerical, then these additional Boolean tests apply: > (greater
than), >= (greater than or equal to), < (less than), and <= (less than or equal to). Some
examples are
Aage >= 19: the value in the WME can be any number greater than or equal to 19
Aage <y> > 19: the value is a variable <y> and in the WME must be a number greater
than 19
Aage <y> > <x>: a variable <y> and in the WME must be greater than <x> from
the LHS
In addition, one can combine two or more tests using logical connectives. Curly brackets
stand for the AND connective and double angular brackets stand for the OR connective. All
the test inside {} must be satisfied, and at least one test inside << >> must be satisfied. For
example,
Aage {> 12 < 20} says that the matching value must be greater than 12 and less than 20
Aday << Wednesday Friday>> says that the value must be Wednesday or Friday
(p sumTotal
(marks Asubject <S> AmidSem <m> <> nil AendSem <e> <> nil Atotal nil)
^
(Modify 1 Atotal <m> + <e>))
(p ranking
(next Arank <r> Asubject <S>)
Problem Decomposition | 193
Rule sumTotal
IF
there is a WME marks for a given Asubject with AmidSem and AendSem attribute
values non-nil and whose Atotal attribute has value nil
THEN
modify the Atotal attribute to the sum of the AmidSem and AEndSem values
Rule ranking
IF
the next rank to assigned in subject <S> is <r>
and there is an entry for some student in <S> with total <m> and rank nil
and there is no entry for any student in <S> with total > <m> and rank nil
THEN
modify the next rank to be assigned as <r> + 1
and modify the rank of the student to <r>
Between the two rules, they add up the midsemester and end-semester marks for every
student and assign ranks to the students. The third pattern in rule ranking says that there must
be no matching WME which has a total greater than <m> and rank nil.
Observe that in the second rule, next Arank is updated in the first action and the rank is
assigned to the student in the second action. These actions are not sequential. If the next rank to
be assigned was 4 to begin with, then that would get modified to 5 for the next round, while the
student in this rule firing would still get rank 4. One could also write the same rule as follows:
(p ranking
{<rank> (next Arank <r>)}
{<total> (marks Astudent <s> Atotal <m> Arank nil)}
- (marks Atotal > <m> Arank nil)
^
(Modify <total> Arank <r>)
(Modify <rank> Arank (<r> + 1)))
Observe that the order of the actions is different. The reader would no doubt have started
viewing OPS5 as a programming language, which indeed it is. It is in fact a Turing complete
language, which means that whatever you can do in any other language you can also do
194 | Search Methods in Artificial Intelligence
with OPS5. The other actions we have not mentioned are the ones needed to make it a complete
programming language. Actions like Read, Print, Load a file, and so on.
But having given up the flow of control to an inference engine (see the next section), one
still needs to be careful in how one writes the rules. The above two rules are a program meant
to first sum up the marks of the students and then assign ranks to them based on the total. But
this sequencing of tasks may not happen in practice. Clearly rank assignment must be done
only after completing the adding up of the total for all students. Initially, for our four students,
four instances of the sumTotal rule would match. One of them would get executed. As soon
as that happens, an instance of the ranking rule for that student would now also match. What
is to prevent that student from being assigned rank 1? In that sense, the 2-rule program has a
bug. One way to address that problem is to introduce a context. For example, to start with, we
can add the WME (current Acontext totalling) and add that as a pattern in the sumTotal rule.
Likewise, one could add a pattern (current Acontext ranking) to the ranking rule. No ranking
rule will match till that appears in the WM. Then we can have a rule to switch contexts as
follows. If the context is totalling and there is no WME with Atotal nil, then modify the context
to ranking. In general, rule based programs may have many rules that match a context, and
thus they can all be bunched up to execute together. Within a context we still need a strategy to
select one from the set of matching rules. We will look at some strategies after describing the
inference engine, the third component in rule based systems.
1. Match. The task of the match phase is to determine the set of rules that are matching along
with the WMEs they are matching. Each pattern in each rule must be checked against each
element in the WM. Each rule may have multiple matching instances, and every WME
may match multiple rules. Thus the number of tests that have to be done is the product of
the number of patterns in all the rules, the number of attributes being tested in each rule,
and the number of WMEs. The output of the match phase is a set of tuples of the form
<rule-name 11 12 ... tK> which specifies the rule name and the K timestamps of the K
WMEs matching the K patterns.
2. Resolve. The output of the match phase is a set called the conflict set, so named because
one can imagine a set of matching rules vying to be selected. The resolve phase selects one
such rule along with its matching data. This is called conflict resolution, since it resolves
the conflict between the matching rules. This selection is done by means of a conflict
resolution strategy. We describe the common conflict resolution strategies in a separate
section below.
3. Execute. The actions in the RHS of the selected rule are then executed. The execution is
parallel in principle, even though in practice the actions are executed sequentially. Our
interest is in the actions Make, Remove, and Modify. These make changes in the WM,
adding and deleting elements, and ostensibly requiring one to do the Match phase all over
again. As Charles Forgy showed, that is not necessary and we discuss his approach later.
Problem Decomposition | 195
The above cycle concludes either when there is no matching rule or one of the rules executes
an action called Halt. Figure 7.3 shows the Match-Resolve-Execute cycle of a forwarding
chaining rule based IE. It is forward chaining, or data driven, because rules match data and add
(or delete) data to the WM, and again match rules going forward in this fashion. In contrast,
one can have a goal driven backward chaining approach that selects rules that match a goal to
be achieved. We look at this in Chapter 9 on logical reasoning.
Working Memo
Rules Match |
Conflict Set
|Resolve|
Rule + Data
[Execute
Figure 7.3 The match-resolve-execute cycle of a forward chaining rule based IE.
Refractoriness
A rule-data combination may fire only once. If a rule along with a set of matching WMEs has
been selected to execute once, then it cannot be executed again. This is particularly relevant
196 | Search Methods in Artificial Intelligence
when the selected rule does not modify the WMEs matching its precondition. Given that
the rule-data combination was selected to fire, it is possible that it would be selected again.
Selecting it would not contribute anything new, and could even result in the system going into
a loop. Observe that the same rule can still fire with different data.
The idea of refractoriness comes from the way neurons fire in animal brains. When a neuron
has received excitation that crosses its threshold, it fires once, and then waits for new signals
to come in. As defined in the Merriam-Webster dictionary, refractoriness is ‘the insensitivity to
further immediate stimulation that develops in irritable and especially nervous tissue as a result
of intense or prolonged stimulation’.
Lexical Order
By lexical order we mean the order in which the user writes the rules. And if a rule has multiple
instances with different data, then choose the instance that matches the earlier data. This strategy
places the onus of this choice on the user who becomes more of a programmer, deciding the
flow of control. This strategy is used in the programming language Prolog which, for efficiency
reasons, deviates from the idea of pure logic programming in which the user would only state
the relation between input and output (Sterling and Shapiro, 1994).
Specificity
This says that of all the rules that have matching instances, choose the instance of the rule that
is most specific. Specificity can be measured in terms of the number of tests that patterns in
rules need.
The intuition is that the more specific the conditions of a rule, the more appropriate the
rule is likely to be in the given situation. Remember that the working memory models the STM
of the problem solver and rules constitute the problem solver’s knowledge and reside in the
quasistatic LTM.
Specificity can facilitate default reasoning. When one has only a little information then
one can make some default inferences (Reiter, 1980; Lifschitz, 1994). But if you have more
information then you might make a different inference. The most popular example is as follows.
If one knows that Tweety is a bird, then one can make an inference that Tweety can fly. But if in
addition one knows that Tweety is also a penguin, then we infer that it cannot fly.
The following is another example of a default reasoning. If one were to be writing a
program to play contract bridge (Khemani, 1989, 1994), then the bidding module might have
many rules that cater to different hand holdings. Two examples are as follows:
Such default reasoning is common in many situations and can easily be implemented in rule
based systems.
Problem Decomposition | 197
Recency
Looking at a pattern and adding a new element to the working memory is akin to making an
inference. In logical reasoning one builds an argument step by step, from one lemma to the
next. The difference between the mechanism of a rule based system and deriving proofs in logic
is that the rules have to be sound for the proof to be valid. Rule based systems do more than
logical reasoning and can address tasks like planning and classification as well. Nevertheless,
the last lemma or WME added could provide the cue for the next step. This is behind the
conflict resolution strategy called recency.
Recency aims to maintain a chain of thought in reasoning, with rules matching newer
WMEs gaining preference over others. Of all the rules that have matching instances, choose
the instance that has the most recent WME. Recency can be implemented by looking at the
time stamps of the matching WMEs. The intuition is that when a problem solver adds a new
element to the WM, then any rule that matches that WME should get priority. Recency can be
implemented by maintaining the conflict set as a priority queue.
Means-Ends Analysis
In their pioneering work on human problem solving, Newell and Simon proposed a strategy
of breaking down problems into subproblems, addressing the most significant subproblem
first, and then attempting the rest (Newell and Simon, 1963; Newell, 1973). Each subproblem
reduces some difference between the given state and the desired state, and is solved by analysing
the means to achieve the ends. They named this strategy as means-ends analysis (MEA). We
describe MEA a little later in this chapter.
OPS5 has the option of choosing MEA as a conflict resolution strategy. The idea is to
partition the set of rules based on the context and focus on one partition at a time. One can think
of each partition as solving a specific subgoal or reducing a specific difference. The context is
set by the first pattern in a rule. All rules in the same partition have the same first pattern. The
MEA strategy applies recency to the first pattern in each rule, and specificity for the remaining
patterns.
Next, we look at how the IE can be implemented efficiently.
as a token. The key point is that WMEs are matched only once when they are created, and not
in each match-resolve-execute cycle.
There are two kinds of tokens that are generated in each cycle. One <+WME> when a
WME is added, and the other <-WME> when one is deleted. The positive token will go and
reside in the network. The negative token will go and neutralize the existing positive token.
The origin of the name rete is based on the middle English word riet, itself derived from
the Latin word rete meaning1 ‘an anatomical mesh or network, as of veins, arteries, or nerves’.
The Rete network is composed of two parts. The top half is a discrimination tree, in which
at each node a property is tested and the token sent down different paths. A bit like a decision
tree or a generalization of a binary search tree. The nodes in this part are called alpha nodes
and are the locations of the tokens that travel down the discrimination tree. Alpha nodes have
exactly one parent. Sitting below the alpha network is the beta network which assimilates the
different tokens that match the different patterns in the rules. Beta nodes may have more than
one parent, in which case the WMEs are joined based on equality of the values of variables.
Rules themselves are attached to beta nodes which have collected all the WMEs that match
their patterns. Figure 7.4 is a schematic of the Rete net.
Figure 7.4 A schematic of a Rete network. The top half is a discriminatory network of alpha
nodes shown in rectangles. The bottom half is made of beta nodes drawn as ovals and
assimilate WMEs that match the patterns of a rule. The joins are typically made on values
of variables. Rules themselves are attached to beta nodes, where WMEs matching all the
patterns arrive. Observe that Rule2 is the default version of Rule1.
In Figure 7.4, Rule1 needs four matching WMEs which arrive from the four paths from the
root to the corresponding beta node. Above Rule1 resides Rule2 which needs only two of those
four WMEs, and thus can be seen as a default version of Rule1.
The Rete network is a directed acyclic graph that is a compilation of the rules that it
embodies. The tokens generated by rule firing are inserted at the root and follow a distinct path
for each class name that begins a pattern. Subsequently, they are routed based on the values
of tests that each pattern makes. The bottom half is the beta network that knows the rules. The
WMEs needed by different patterns in each rule are identified. Tokens may need to have the
same values of a variable in different patterns for a join to happen. We look at an example to
illustrate this relation between a set of rules and the corresponding Rete net.
Consider a database in which there are a set of named figures, along with a set of properties
associated with each figure. We consider polygons with three or four sides. The properties are
the number of sides the figure has, the number of equal sides the figure has, the number of
parallel sides the figure has, and the number of right angles the figure has. These are defined
by four class names. The task is to classify the figure and identify what geometrical shape that
figure is. The following are the rules for the different shapes. The first rule simply states that if
a polygon has three sides, it is a triangle. This is a default 3-sided figure. More specific rules
identify different kinds of triangles.
(p triangle
(sides Ain <name> A are 3)
^
(make (polygon <name> Ainstance-of triangle))
(p isosceles-triangle
(sides Ain <name> A are 3)
(equal-side Ain <name> Aare 2)
^
(make (polygon <name> Ainstance-of isosceles-triangle))
(p right-triangle
(sides Ain <name> A are 3)
(right-angles Ain <name> Acount 1)
^
(make (polygon <name> Ainstance-of right-triangle))
(p right-isosceles-triangle
(sides Ain <name> A are 3)
(equal-side Ain <name> Aare 2)
(right-angles Ain <name> Acount 1)
^
(make (polygon <name> Ainstance-of right-isosceles-triangle))
(p trapezium
(sides Ain <name> A are 4)
(parallel-sides Ain <name> Apairs 1)
200 | Search Methods in Artificial Intelligence
^
(make (polygon <name> Ainstance-of trapezium))
(p rhombus
(sides Ain <name> A are 4)
(equal-side Ain <name> Aare 4)
^
(make (polygon <name> Ainstance-of rhombus))
Figure 7.5 shows the Rete net for the rules given above, along with some rules for other
shapes that are left as an exercise for the reader. The root node tests for the class name in the
input WME and sends the token down the corresponding path, where other alpha nodes apply
tests for some attributes. In this scheme of things, the tokens reside in the first beta node they
encounter and can be accessed by lower nodes.
Figure 7.5 A Rete net for geometric shapes. The patterns describe the number of sides,
number of equal sides, number of parallel sides, and the number of right angles. Rectangles
are alpha nodes. Circles are beta nodes, and joins are on <name> variable. The shaded beta
nodes have the rules described in the text attached.
If you only know that a shape has three sides, you can conclude that it is a triangle. The
corresponding beta node has label A in Figure 7.5. In addition, if we know that there is a right
angle in the figure, then we can classify it as right angled (node F). Instead, if we know that it
Problem Decomposition | 201
has two equal sides, it is an isosceles triangle (node D). If it is both right angled and isosceles,
then it is right isosceles (node I). Nodes D, E, F, and I are special cases of triangles. Node G
hosts the rule for a rhombus, and node L defines a trapezoid. The reader is encouraged to write
the rules for other figures in the Rete net.
The above WMEs are converted into positive tokens and inserted one by one into the Rete net.
We show their locations in the alternate, but equivalent, depiction of the net in Figure 7.6. This
diagram views each alpha node as a placeholder with a test. Initially the token is placed in the
root node named token-in. The alpha nodes below that include tests that a WME must satisfy
to be accepted. This is different from the depiction in Figure 7.5 where each alpha node was
labelled with the attribute being tested, and the edges emanating below were labelled with the
accepted value of the test. Observe that, irrespective of the diagram schema, a token may be
replicated and may go down more than one path if more than one test is satisfied. Remember
that the tests come from patterns in different rules. For example, one rule may test for a value
being greater than 5 while another may test for a value being greater than 11. If an incoming
WME has a value 16 for that attribute, it will satisfy both tests.
202 | Search Methods in Artificial Intelligence
token-in
=2
pairs = 2 ]
pairs = 1
11
Figure 7.6 An alternate depiction of the Rete net from Figure 7.5 in which each alpha node
contains the test that must be satisfied to accept a WME. The numbers in the shaded boxes
are the time stamp values for the eleven WMEs inserted into the net.
Each token travels down the network as long as it satisfies the tests. Figure 7.6 depicts the
location of the eleven WMEs described above. In this version of the figure, we assume that the
tokens reside in the alpha node whose test they satisfy. In the figure they are shown as numbers
in shaded boxes representing time stamps of the WMEs for brevity. In practice they should have
a positive sign as well, for example, <+ 1>, <+ 2>, and so on. In our simple rule base there
are no Remove actions, but if there were, then negative tokens like <-15> could have been
inserted. A negative token would follow the same path as its positive version. When it collides
with the positive token, both get destroyed. Any rules matching the positive WME must be
retracted from the conflict set as well.
The beta nodes below the alpha nodes accept tokens from one or more parents. Receiving
exactly one WME is rare. It typically happens with a rule with only one pattern, for example,
the triangle rule attached to node A which adds two instances <triangle 1> and <triangle 4>
for shape1 and shape4, the names of the figures, to the conflict set. Later, when the
WME 5 arrives, the beta node D adds the instance <isosceles-triangle 1 5> to the conflict set
for shape1. Observe that token 1 is still residing in the same alpha node, even though it has
triggered two rules for shape1. In fact, when token 8, also for shape1, arrives, two new rule
instances with data are triggered, <right-triangle 1 8> and <right-isosceles-triangle 1 5 8>.
Thus, there are four rules waiting to classify shape1.
Observe that when the token for WME 10 arrives which says that shape4 has zero parallel
sides, it stays in the alpha node as shown and does not trigger any rule. Tokens 2 and 3 talk of
Problem Decomposition | 203
4-sided figures and trigger quadrilateral rules <quadrilateral 2> and <quadrilateral 3> in node
B for shape2 and shape3. When token 7 for shape3 arrives then <rhombus 3 7> is added to the
conflict set, and when token 9 arrives <square 3 7 9> is also added. Finally, when token 11
arrives for shape2 <trapezium 2 11> is added.
The initial data is uploaded into the WM before any rule is selected. In our example, we
have nine rules that enter the conflict set. Which rule to select for execution is determined by
the conflict resolution strategy. We quickly review how the ReteAlgorithm implements the
strategies described in Section 7.2.5.
Refractoriness
This happens naturally in the ReteAlgorithm. This is because the match happens when the
WMEs are inserted into the net, and that happens only once. Let us say three WMEs with time
stamps 10, 15, and 20 match a rule someRule. When 10 is inserted, it goes and sits in its place,
waiting for its partners that could trigger any rule. Likewise for WME 15. Now when WME
20 arrives <someRule 10 15 20> is added to the conflict set. Since WMEs 10, 15, and 20 have
already been generated and inserted, this rule-data combination can fire only once. When it
does fire, it is removed from the conflict set.
Lexical Order
Forward chaining rule based systems in general and OPS5 in particular ignores the lexical order
of rules that would be present in a text file. This is because the Rete net obliterates that order.
Hence lexical order cannot be realized.
Specificity
This says that of all the rules that have matching instances, choose the instance of the rule that
is most specific. Specificity can be measured in terms of the number of tests that patterns in
rules need. In the Rete net this can be measured by summing up the lengths of the paths for all
matching patterns.
Recency
Recency gives preferences to the rules that match the most recent WMEs. This can be
implemented by maintaining a priority queue. When a new WME goes in and triggers a set of
rules, then all these rules will go to the front of the priority queue.
Note that this implementation only takes care of one pattern in the rule, the one that matches
the most recent WME. Breaking ties by choosing a rule with the second most recent WME is a
harder task, and is left as an exercise for the reader to ponder over. Perhaps one could add the
time stamp values, but what when one rule has more patterns than another?
204 | Search Methods in Artificial Intelligence
Means-Ends Analysis
The MEA strategy looks at the recency of the first pattern in every matching rule. Ties can be
broken by specificity.
When rule firing begins on the Rete net with WME tokens depicted in Figure 7.6 then, in
some order, the following inferences are made:
- shapel is a triangle
- shape2 is a quadrilateral
- shape3 is a quadrilateral
- shape4 is a triangle
- shape1 is an isosceles triangle
- shape3 is a rhombus
- shape1 is a right triangle
- shape1 is a right isosceles triangle
- shape3 is a square
- shape2 is a trapezium
The reader is encouraged to explore the order in which the above conclusions are asserted with
the different conflict resolution strategies. In this example, all the ten instances of rule-data fire,
and then the program terminates.
Different conclusions are drawn about the different figures that are represented in the WM.
As an exercise, modify the rules so that exactly one conclusion is drawn by each figure. Hint:
you can include Remove actions in the RHS. What conclusion will be drawn for each of the four
figures - shape1, shape2, shape3, shape4 - under the different conflict resolution strategies?
Finally, we look again at how the match-resolve-execute cycle is handled by the
ReteAlgorithm. In the brute force approach, the MatchAlgorithm compares the patterns
in the rules with WMEs and produces the conflict set. In the ReteAlgorithm, the rules are
compiled into a network that accepts changes in the WM and produces changes in the conflict
set, as shown in Figure 7.7.
Figure 7.7 The Rete net is a compilation of the given rules and also hosts the WMEs. The
tokens generated when a rule is fired are the new input, resulting in changes in the conflict set.
Problem Decomposition | 205
This selective processing of the incoming data makes the rule based system an order of
magnitude faster. Even though the original goal of building expert systems was on the wane,
the business community adopted the technology with great enthusiasm, since it allowed them
to focus on the rules. Adopted as a business rule management system technology, the Rete
algorithm was refined many times, resulting in more speedups.
Carole-Ann Berlioz of Sparkling Logic wrote the following in her blog2 in 2011.
The best usage of the Rete network I have seen in a business environment was likely
Alarm Correlation and Monitoring. This implies a ‘stateful’ kind of execution where
alarms are received over time as they occur. When you consider that the actual
Telecom network is made of thousands of pieces of equipment to keep in ‘mind’ while
processing the alarms, there is no wonder that Rete outperforms brute force. When
one alarm is raised on one Router, the Rete network does not have to reprocess all
past events to realize that we reached the threshold of 5 major alerts on the same
piece of equipment and trigger the proper treatment, eventually providing a probable
diagnostic. Hours of processing time turned into seconds in Network Management
Systems. Fabulous.
At some point Charles Forgy implemented a version called Rete-NTTM that was another
order of magnitude faster. He, however, decided not to publish the algorithm and kept it as a
trade secret.
The AO graph is a directed graph rooted at a node representing the problem to be solved.
Figure 7.8 shows a contrived AO graph in which the root node G is the goal to be solved. The
And edges are shown by connecting the related edges with an arc. As can be seen, there are
three ways the goal G can be transformed. One can solve G by solving A, or solving B, or
solving both C and D. Whichever choice one makes, one can transform the resulting problem
into one or more subproblems. The shaded nodes in the graph are primitive problems or solved
nodes. Each solved node may have an associated cost, and each edge transforming a problem
may have an associated cost too.
Figure 7.8 An And-Or graph decomposes a problem into subproblems. Here goal G can
be transformed into problem A, or problem B, or problem C + D. Edges are labelled with
transformation costs. Shaded nodes are primitive problems whose solutions are known along
with their cost.
The reader should verify that the combinations of solved nodes for solving G are {E, F},
{G, H}, {H, I}, {I, J, K}, and {I, J, L}. Which combination has the lowest cost?
We first look at programs from literature that employ AO graphs, and then we look at an
algorithm to solve And-Or graphs.
7.3.1 DENDRAL
DENDRAL (for DENDRitic ALgorithm) was an early expert system, developed beginning
in 1965 by Edward Feigenbaum and the geneticist Joshua Lederberg, at Stanford University
(Lindsay et al., 1980, 1993). It was designed to help chemists in identifying unknown organic
molecules, by analysing their mass spectra and using knowledge of chemistry. DENDRAL
is often considered the first expert system, though the authors of MYCIN would have
contested that.
DENDRAL was a program designed to assist a chemist in the task of deciphering the
chemical structure of a compound whose molecular formula was known and some clues about
its structure were available from mass spectroscopy. The problem is important because the
Problem Decomposition | 207
physical and chemical properties of a substance depend upon how the atoms are arranged.
Figure 7.9 shows some ways in which the atoms of C6H13NO2 could be arranged. The reader
might recall that hydrogen atoms have valency 1, carbon has 4, oxygen 2 and nitrogen 3. This
is reflected as edges in the graph representation. The hydrogen atoms are shown only in the
structure on the right for brevity.
C= C H
I
H—C—H
C — C— O—N
H
O I
O —C—C C—H
C H I
H
C— C—N O
OO
C-----C C
O C C
H H
N —C—C= C
I I
O C
Figure 7.9 DENDRAL was designed to assist chemists in finding the structural formula of
compounds. The above are some candidate structures for C6H13NO2_ The hydrogen atoms
are shown only for the arrangement on the right. Figure adapted from [Buchanan 82].
Earlier versions of DENDRAL could handle compounds up to a hundred atoms, but even
there millions of arrangements are possible, making the chemist’s task quite hard. The problem
has been addressed by handling substructures, called ‘superatoms’, in the search process.
To discover the structure, chemists use a mass spectrometer that breaks up the chemical
into some constituents and produces a spectrogram in which peaks correspond to the mass
of the constituent. Chemists have some knowledge of how fragments could be created when
bombarded by atoms, and the kind of spectrogram that might be produced. But looking at
the spectrogram and determining the structure is harder, because there are combinatorially
increasing possibilities. What spectroscopy reveals is the arrangement in some substructures,
which become constraints in exploring the possible structures. The DENDRAL program had a
planner that could decide which constraints to impose in the search for possible candidates. It
applied knowledge of spectral processes to infer constraints from instrument data. DENDRAL
extends the generate-and-test approach of search to plan-generate-and-search. The planner
strives to generate only meaningful candidates to narrow down search, while not ignoring
solutions.
A constrained generator (CONGEN) accepts constraints and generates complete chemical
structures. Some of the symbols manipulated by the structure generation algorithm stand for
208 | Search Methods in Artificial Intelligence
‘superatoms’, which are collections of atoms arranged in a particular way, along with their
collective valency. The program has to expand these to get the final structure. The chemist may
specify the superatoms to be used. For example, such a substructure may have four carbon
atoms, two of which may have free valencies of 1 and 2. These will be linked by CONGEN to
other structures. Figure 7.10 illustrates the kind of search space DENDRAL explores. It is an
And-Or graph, with hyperedges connected by arcs.
Figure 7.10 The program DENDRAL explores and And-Or graph. It generates candidate
structures and a synthetic spectrogram. Formulas in vertical lines are superatoms. The
shaded nodes represent completely specified structures or substructures. Figure from
Mainzer (2003: ch. 6).
Problem Decomposition | 209
The following is a record of a session with CONGEN (Lindsay et al., 1993). The
constraints listed below illustrate the flexibility and power of the program to use information
about chemical structure that may be inferred (manually or automatically) from a variety of
analytical techniques.
DENDRAL is not a single program, but a collection of programs. The project consisted
of research on two main programs, Heuristic DENDRAL and Meta-DENDRAL, and several
subprograms. The first uses knowledge of chemistry and mass spectroscopy data to hypothesize
possible structures. The second accepts known mass spectrum/structure pairs as input and
attempts to infer the specific knowledge of mass spectrometry that can be used by Heuristic
DENDRAL (Lindsay et al., 1993).
Figure 7.11 Symbolic integration involves searching through a space of substitutions and
decompositions in the quest for primitive integration problems. The figure shows a part of the
search tree, along with a subtree that has three primitive problems. Figure from
Mainzer (2003: ch. 6).
windows, the kitchen, the roof, and so on. For each of the subproblems, there may be options
to choose from, and thus the search space is an And-Or tree (or a graph if solution components
can be reused). Such search spaces are also called goal trees, because they transform goals
into subgoals. More recently And-Or trees have been used in graphical models (Dechter and
Mateescu, 2007).
We make an assumption here that the subgoals can be solved independently. Sometimes
this may not be true, as when planning a thematic dinner. In Chapter 10 on planning, we will
encounter this problem of not independent subgoals again, even for simple planning problems.
We say that the subgoals are non-serializable. This means that one cannot solve the subgoals
independently in a serial order and arrive at the whole solution. In Chapter 9 when we study
goal directed reasoning, we will encounter this dependency between subgoals. The solution of
a previous subgoal will constrain the solution of the current one.
Next, we study the well-known algorithm AO*, which as the name suggests, is an
admissible algorithm. This means that under certain conditions, including an underestimating
heuristic function, the algorithm finds an optimal solution. The assumption is that the subgoals
can be solved independently.
Figure 7.12 In the forward phase, algorithm AO* follows the markers and picks a node (here
node J with h(J) = 170) for refinement. It expands J into two options, {L, M} and {N, O}, and
marks the cheaper option {L, M}. Each edge has a cost 10. The estimate of node J changes to
200 which is h(L) + h(M) + 10 + 10.
The last step when h(J) was updated triggers the second phase, which is the backward
phase. In this phase the updated estimates are propagated upwards. In the above example, when
node J changes, the change must be propagated to all its parents.
The following is the outline of the algorithm AO*. It uses a value futility to set the limit of
an acceptable cost of the solution (Rich, 1983).
It should be emphasized that when a node changes in the backward propagation phase, all
its parents must consider the change, and not just the parent figured in the forward phase. This
is because the other parents may also get affected, and it might have an impact on the solution
found. Figure 7.13 illustrates the propagation in the backward phase.
Problem Decomposition | 213
□ 1190
210
220
-
K 1240
200
250
100
100
150 L V □
80 120 120
140 Q 1150
140
Figure 7.13 In the backward phase, algorithm AO* propagates any changes seen back
towards the root. In this example, after node Q is expanded, it reaches a solved node S which
is cheaper than the live node R. Consequently, Q gets labelled solved as well, but its sibling
P gets the marker as it is cheaper. Node L is now solved too, but the hike in cost results in AO*
shifting its attention to the {A, B} option. Notice that the estimate of G has escalated as well.
Each edge in the figure has a cost 10. The shaded nodes in the figure are solved nodes and
have an associated known cost. In the forward phase, AO* lands up on the live nodes M and Q,
with estimates 80 and 90 respectively. It picks Q, which is an Or node, and refines it. Of the two
children R and S of Q, the latter is cheaper, and so it gets the marker from Q, whose estimate
goes up to 150 (= 140 + 10). S is also a solved node and, since it is the marked option, Q is
labelled solved as well. But its sibling P is cheaper and so gets the marker from L, which is now
labelled solved as well. Its parent J now has a revised estimate of 250. This is propagated up to
both the parents B and C, even though C is the partial solution being currently refined. Notice
that both B and C abandon J and shift to other options. With the other option K, the cost of C
has gone up to 250, so that the solution containing {C, D} now has an estimated cost 460. The
option {A, B} offers a solution with a lower cost 450, and AO* will go down that road in the
next forward move.
The cycle of forward and backward moves continues till the root is labelled solved, or
till the estimated cost of the root becomes greater than an acceptable limit, futility. When it
finds a solution, it must be the optimal solution, provided the heuristic function underestimates
the cost of each node. The argument is similar to the one made in the B&B for TSP and A*.
The markers serve the purpose of identifying the cheapest partial solution to be refined. If the
estimate of another option is higher, then it can be safely excluded because the actual cost will
be still higher. Thus, the AO* algorithm is an instance of the common theme described earlier.
Refine the candidate solution with the lowest estimated cost,
till the lowest cost candidate is fully refined.
Algorithm 7.1 presents the details of AO*. Futility is a large number that sets the limit for
the acceptable cost. Line 1 begins by adding the goal as the start node. A heuristic function
h(N) returns an estimate for solving N. As discussed earlier, this must underestimate the actual
214 | Search Methods in Artificial Intelligence
cost for the algorithm to be admissible. The forward phase in Lines 5-16 follows the marked
nodes and refines one of the live nodes. It needs to check for loops because in some domains,
symbolic integration for example, the transformations could be reversible or have loops. The
heuristic values of the new nodes are computed, and primitive nodes if any are labelled solved.
Algorithm 7.1. The algorithm AO* operates in a cycle of two phases. In the forward
phase, the algorithm follows a set of markers that identify the cheapest partial solution
and extends it. In the backward phase, it propagates the revised costs up towards the
root. It terminates when the label solved is propagated up to the root, or if the cost
estimate is beyond an acceptable limit.
AO*(start, Futility)
1 add start to G
2 compute h(start)
3 solved(start) ^ FALSE
4 while solved(start) = FALSE and h(start) < Futility
> FORWARD PHASE
5 U ^ trace marked paths in G to a set of unexpanded nodes
6 N ^ select a node from U
7 children ^ Successors(N)
8 if children is empty
9 h(N) ^ Futility
10 else check for looping in the members of children
11 remove any looping members from children
12 for each S e children
13 add S to G
14 compute h(S)
15 if S is primitive
16 solved(S) ^ TRUE
> PROPAGATE BACK
17 M ^ {N} > set of modified nodes
18 while M is not empty
19 D ^ remove deepest node from M
20 compute best cost of D from its children
21 mark best option at D as Marked
22 if all nodes connected through marked arcs are solved
23 solved(D) ^ TRUE
24 if D has changed
25 add all parents of D to M
26 if solved(start) = TRUE
27 return the marked subgraph from start node
28 else return null
Problem Decomposition | 215
The backward phase begins by adding the newly refined node to a set of modified nodes
M. Choosing the lowest node D from M, one computes its estimated cost and marks the subtree
from where it is propagated. If the subtree is labelled solved, D is labelled solved as well, and
its cost becomes the actual cost. If D has changed, then all its parents are added to M, and they
will be updated in turn.
If the root is labelled solved, then the algorithm returns the subtree containing the marked
edges and the corresponding nodes.
As described above, the solution returned by AO* is a subtree or a subgraph of the problem
graph. The leaves of the solution are the primitive solved nodes and represent the solution parts
that together solve the original problem. In this way, solving And-Or graphs is different from
path finding. The fact that the solution is a subgraph has bearing also on the contribution that
heuristic value of an individual node makes during the search process. In A* the heuristic value
h(N) of a node directly represents an estimate of the distance to the goal, and f(N) the estimated
total cost. In AO* an individual node may only be a part of the solution, and the total estimated
cost is determined by other constituents of the solution as well. This explains the need for the
backward phase where the estimated cost of the entire solution is aggregated, and the best
options marked, which guide the forward movement in each cycle.
Figure 7.14 shows a solution on a plausible problem graph on which the progress of AO* is
shown in Figure 7.13. The labelled nodes are as explored by the algorithm, and the unlabelled
nodes are yet to be explored. The shaded nodes are the primitive nodes. The solution is identified
by solid edges, while the rest of the problem graph is drawn with dashed edges.
510
140 P
Figure 7.14 The partial solution of Figure 7.13 extended to a plausible complete problem graph.
The unlabelled nodes are yet to be explored. A solution is a subtree with solved nodes as
leaves. The subtree of solid edges is one solution with total cost 510. This is the sum of the
cost of the four leaves that are part of the solution, along with a cost 10 for every edge.
216 | Search Methods in Artificial Intelligence
As can be seen, the solution shown here is a refinement of the option {C, D}. There are four
nodes in the solution with costs 140, 90, 60, and 130, of which only node P has been explored
in Figure 7.13. Along with the costs of the edges in the solution, the total cost is 510. The
question is whether this is the optimal solution. For that we need to know the costs of all the
primitive nodes. We pose this question again in the exercises where the complete graph is given.
Summary
In this chapter we have taken a different view of problems and problem solving, in which we
focus on subproblems while searching for a solution. We have looked at two approaches, each
requiring its own representation scheme.
The first is a continuation of the forward chaining approach which moves from the start state
in search of achieving the goal. The difference is that, instead of a monolithic state transition
system, we look at how patterns in the state description can trigger moves or rules. The idea
behind this modular rule or production representation arose from the desire to elicit problem
solving knowledge from human experts and use that to drive the problem solving process. The
evolution of this approach led to the development of a formalism that could not only capture
expert heuristic knowledge, but also serve as a complete programming language.
The second approach is the goal-directed backward reasoning approach which starts with
the goal to be achieved and explores means of breaking it down into parts that can be solved
independently. This becomes possible by capturing the choices as well as the part-whole
relations in an And-Or graph, which the algorithm AO* explores. While the independence of
subproblems cannot always be ensured, it is a good starting point. In later chapters on planning
and logical deduction, we will meet And-Or trees or goal trees again. We will also encounter
them in the next chapter when we look at how best to choose a move in two player board games.
Exercises
1. Write a set of rules so that a rule based system can play the game of tic-tac-toe without
resorting to search. Analyse the game and write specific rules to play a move in a specific
situation. For example, the opening move could be made in any of the corner squares. What
kind of representation will you use to express such rules succinctly?
2. Given an array defined by a set of WMEs of the form (array Aindex value Asubject value
Amarks value ArollNumber value) write an OPS5 program to arrange the index values in
increasing values of roll numbers. This would amount to sorting the array on roll number.
Modify the above program to assign the index values in decreasing values of the marks
attribute.
3. Given an array of marks (array Aindex value Asubject value Amarks value ArollNumber
value) and a set of cutoffs for a set of grades {A, B, C, D, E, F} for each subject, write
a program to assign grades to each student for each course. How would you choose to
represent the cutoffs, and how many rules would you need for three subjects - economics,
history, and political science?
Problem Decomposition | 217
4. Six rules for geometric shapes have been defined in the chapter, and these are attached with
the shaded beta nodes in Figure 7.5. Define the rules for the unshaded beta nodes and give
the shapes their commonly accepted name. Does node C have a common name? [Hint: the
number of sides is not known.]
5. The six rules and the eleven WMEs together result in ten conclusions about shape 1,
shape 2, shape 3, and shape 4, as given in Section 7.26. Simulate the execution of these ten
rule-data instances using the specificity, recency, and MEA strategies.
6. Assume the following schema for WMEs:
a. (Person Aname Aage Agender) gender: M/F/N
b. (Work Aperson Anature) nature: self-employed/government/corporate/NGO
c. (Habits Aperson Aactivity) activity: smoking/trekking/cricket
d. (Education Aperson Acompleted) completed: highSchool/bachelors
e. (Eligible Aperson Aloan) loan: yes/no
A bank uses the following rules for deciding whether a person is eligible for a loan or not:
a. If the person has finished high school and smokes, then s/he is not eligible
b. If the person is in a corporate or government job and does not smoke, s/he is eligible
c. If the person’s name is Vijay or Nirav, he is eligible
d. If the person is a female graduate, then she is eligible
e. If the person is a self-employed female, she is eligible
Express the above rules in an OPS5-like language, and construct and draw a Rete net for
the above rules.
7. For the above problem, given the following WM, list the conflict set for the above set
of rules:
1. (Person Aname Sunil Aage 37 Agender M)
2. (Person Aname Jill Aage 22 Agender F)
3. (Person Aname Sneha Aage 27 Agender F)
4. (Habits Aperson Sneha Aactivity cricket)
5. (Habits Aperson Sunil Aactivity smoking)
6. (Education Aperson Sneha Acompleted bachelors)
7. (Education Aperson Sunil Acompleted bachelors)
8. (Education Aperson Jill Acompleted highSchool)
9. (Work Aperson Jill Anature corporate)
10. (Work Aperson Sunil Anature self)
11. (Work Aperson Sneha Anature NGO)
Which element of the conflict set would be selected if the conflict resolution strategy is
specificity?
Which element of the conflict set would be selected if the conflict resolution strategy is
recency?
Who is/are eligible for a loan?
8. The rule based program as given in the text produces all possible category labels for the
input figures. Modify the program to assign only one category to each figure and simulate
the different conflict resolution strategies. [Hint: you can include Remove actions in the
RHS.] What conclusion will be drawn for each of the four figures - shape1, shape 2,
shape 3, shape 4 - under the different conflict resolution strategies?
218 | Search Methods in Artificial Intelligence
The graph below is a problem graph similar to the one used in Figure 7.14.
xy5z<- - [iscih^-^^-^
@ @ 25$ v-yVp->M30)
60
!40
260
^50
.AA, @®
(150 Q40
120
9. Propagate the costs in the solved nodes upwards in the above graph and replace the heuristic
estimates with actual costs. What is the cost of the optimal solution? Does the heuristic
function underestimate the actual costs?
10. Hand simulate the AO* algorithm on the above graph and draw the solution found. What is
the cost of the solution found?
11. Divide the heuristic estimates of the internal nodes by 2 and repeat the hand simulation.
What is the impact of the lower heuristic estimates?
12. Multiply the heuristic estimates of the internal nodes by 2 and repeat the hand simulation.
What is the impact of the lower heuristic estimates?
13. [Baskaran] Show how the algorithm AO* will explore the following And-Or graph.
Assume that each edge has a cost 1. Draw the graph after each move. Repeat the process
after assuming that each edge has cost 10.
Problem Decomposition | 219
14. [Baskaran] Show how the algorithm AO* will explore the following And-Or graph.
Assume that each edge has a cost 2. Draw the graph after each move. Repeat the process
after assuming that each edge has cost 10.
15. Multiplying a sequence of matrices can be posed as an And-Or graph, though it is usually
seen as a dynamic programming problem because the costs are known. Given four matrices
A [5x15], B [15x2], C[2x20], and D [20x1] we need to find the order in which to pick two
matrices to multiply, so that the total number of multiplications is minimized. Pose this
problem as an And-Or search problem and identify the optimal solution.
chapter 8
Chess and Other Games
Acting rationally in a multi-agent scenario has long been studied under the umbrella of
games. Game theory is a study of decision making in the face of other players, usually
adversaries of the given player or agent. Economists study games to understand the
behaviour of governments and corporates when everyone has the goal of maximizing
their own payoffs. A stark example is the choice of NATO countries refusing to act
directly against the Russian invasion of Ukraine given the threat of nuclear escalation.
In this chapter we turn our attention to the simplified situation in which the agent
has one adversary. Board games like chess exemplify this scenario and have received
considerable attention in the world of computing. In such games each player makes a
move on her turn, and the information is complete since both players can see the board,
and where the outcome is a win for one and a loss for the other. We look at the most
popular algorithms for playing board games.
Chess has long fascinated humankind as a game of strategy and skill. It was probably
invented in India in the sixth century in the Gupta empire when it was known as chaturanga.
A comprehensive account of its history was penned in 1913 by H.J.R. Murray (2015). The name
refers to the four divisions an army may have. The infantry includes the pawns, the knights make
up the cavalry, the rooks correspond to the chariotry, and the bishops the elephantry (though the
Hindi word for the piece calls it a camel). In Persia the name was shortened to chatrang. This in
turn transformed to shatranj as exemplified in the 1924 story by Munshi Premchand (2020) and
the film of the same name by Satyajit Ray, Shatranj Ke Khiladi (The Chess Players). It became
customary to warn the king by uttering shah (the Persian word for king) which became check,
and the word mate came from mat which means defeated. Checkmate is derived from shah mat
which says that the king has been vanquished.
Table 8.1 lists the names of the chess pieces in Sanskrit, Persian, Arabic, and English
(Murray, 2015). In Hindi users often say oont (camel) for bishop and haathi (elephant) for rook.
From India the game spread to Persia, and then to Russia, Europe, and East Asia around the
ninth century. It was introduced to southwestern Europe by the Moors and finds mention in the
Libro de los Juegos (The Book of Games)1 commissioned by King Alphonso in the thirteenth
221
222 | Search Methods in Artificial Intelligence
century. Buddhist pilgrims and traders on the Silk Road spread the game to the Far East, where
it sprouted newer variations like Chinese chess and shogi. Another game that was played on a
board was go which is still very popular today.
The moves that the pieces could make also changed over time as the game spread. The
rules used currently in tournaments correspond to international chess as compared to the
Indo-Arabic chess which had different rules. Castling, for example, was a new addition, as well
as the ability of the pawn to move either one step or two.
The game of chess is played on an 8 x 8 board of alternating black and white squares,
and each player has sixteen pieces at the start of the game. These are two rooks, two knights,
two bishops, a king, and a queen, along with eight pawns. Each piece has well defined rules
of movement and thus embodies different tactical abilities. The pieces of the two players are
initially lined up on the respective ends of the board, like the armies of yore, and then each
player in turn makes a move. A move can result in the capture of an opponent piece on the
destination square, which then goes out of play. The objective of the game is to threaten to
capture the opponent’s king, and the aggressor is obliged to utter the word check while doing so
as a warning. If there is no escape for the king being attacked, then the word used is checkmate,
and the game ends at that point. This game with simple rules on a small terrain gives rise
to virtually countless possibilities, and tomes have been written, and read, on the strategies
for playing the game. So much so that the discourse is split into three parts. The first is the
well documented opening game, with strategies like Ruy Lopez and the Queen’s Gambit.
After a few moves when the board opens up and pieces become more mobile is the middle
game. This is not as well documented because of the humungous possibilities, and is driven by
heuristics like retaining control of the centre, maintaining connected pawn structures, lining up
doubled rooks, and so on. The real battle of wits is fought here. Then, as pieces are captured
and removed from the board, the game becomes more tractable, and seasoned chess players
recognize patterns with known outcomes well in advance. This phase is the end game, and more
easily documented. Beginners, for example, are often taught how to checkmate a lone opponent
king with two rooks.
The earliest pioneers of computing including Alan Turing and John von Neumann were
deeply interested in chess which, though being a game of well defined rules and outcomes, is
difficult to master for most of us, and they treated it as an alibi for intelligent behaviour (Larson,
2021). In 1950 Claude Shannon published a paper on computers and chess (Shannon, 1950).
At the Dartmouth conference in 1956, Alex Bernstein presented a chess program developed at
IBM, but the limelight was stolen by Arthur Samuel’s checkers playing program which learnt
from experience and is reputed to have beaten its creator. Chess was still considered to be a
Chess and Other Games | 223
difficult game, and in 1968 the British grandmaster David Levy wagered a bet that no machine
could beat him in the next ten years. He did win his bet, but in 1997 the program Deep Blue
developed at IBM beat the reigning world champion Garry Kasparov in a legendary six game
match (Campbell, Hoane, and Hsu, 2002). Chess machines continued their march over humans
over the years. David Levy himself turned an artificial intelligence (AI) proponent and went
so far as to predict that robots will provide the romantic companionship that humans crave
(Levy, 2008).
From a game theoretic perspective, the objective would be to determine the outcome of the
game when both players are perfect, but that is not yet computable. Instead, beating the best
humans has been set as the benchmark. After conquering chess, attention shifted to the oriental
game of go which was considered to be much harder. The game is played with black and white
coins called stones, with each player free to place one on any grid location. Go is played on
a 19 X 19 grid, thus presenting a much larger set of choices. The first move can be played in
361 ways, the second in 360, and so on. In all, there are 10170 possible board positions. This
is much larger than the estimated 10120 chess games that are possible. Both these numbers
are practically incomprehensible for most of us. The reader is encouraged to do a back of the
envelope calculation to estimate how long it would take to inspect the 10120 chess games even if
every one of the estimated 1075 fundamental particles in the universe were to be a supercomputer
inspecting billions of games a second. Nevertheless, in 2016 the program AlphaGo developed
by DeepMind, then a company in the United Kingdom, beat the reigning go champion Lee
Sedol 4-1 in a much publicized match in Seoul (Silver et al., 2016). A gripping account of the
match, and a film made on it, can be seen on DeepMind’s website2 which says -
The game earned AlphaGo a 9 dan professional ranking, the highest certification. This
was the first time a computer Go player had ever received the accolade. During the
games, AlphaGo played several inventive winning moves, several of which - including
move 37 in game two - were so surprising that they upended hundreds of years
of wisdom.
(1944) are generally credited with formalizing the idea of game theory. Game theory has also
been defined3 as ‘the study of the ways in which interacting choices of economic agents produce
outcomes with respect to the preferences (or utilities) of those agents, where the outcomes in
question might have been intended by none of the agents’.
When we say rational agents we mean selfish agents, whose only goal is to maximize
their own reward. This could mean that the outcome of the actions of all agents might not
be the best for all concerned. A stark example is the concerted actions required to mitigate
the effects of climate change, which are imperative if one is to save the world, but do not
happen because individual nations have their own short term goals, most often in energy
requirements. Many nations are critically dependent on fossil fuels, the very culprit behind
detrimental climate change. Energy requirements are also behind the conundrums that many
nations found themselves entangled in during the recent Russian invasion of Ukraine. The
fact is that self-preservation was the motive behind the reluctance of most powerful nations to
overtly intervene to prevent the massive destruction of people and property in Ukraine. When
peoples and nations act in ‘rational’ self interest, they can still push the entire world towards a
catastrophe. That is what we mean when we say that rational refers to being selfish, even though
it may be short sighted. This is illustrated by the well known dilemma when two suspects are
being interrogated independently.
It has been shown that as long as T > R > P > S then it is rational for both players to betray
each other, irrespective of what the other player does.4 Consider the case when T = 0, meaning
that the prisoner is let off, and the other payoffs are R = -1, P = -2, and S = -3. Then each
prisoner reasons as follows.
Case 1: The other cooperates and does not confess (Column 1 for A, and Row 1 for B).
If I cooperate then I get R = -1, and if I betray him then I get T = 0. I am better off
betraying him.
Case 2: The other confesses and betrays me (Column 2 for A, and Row 2 for B). If I
confess and betray him I get P = -2, and if I do not confess then I get S = -3. I am better off
betraying him.
In both cases, the best action for each prisoner is to confess and betray the other. Clearly,
rational self-interest results in each player getting a payoff P = -2 which is lower than the
payoff R = -1 if both had cooperated. Therein lies the dilemma. This decision point is known
as the Nash Equilibrium after the mathematician John Nash. Nash Equilibrium is the outcome
from the choices of all the players such that if any player were to make a different choice the
player would lose. As shown by the above example, this is not necessarily the best possible
outcome.
The above analysis holds for a one time game for the two rational players, and not for
seasoned criminals who may have developed trust in their partners in crime over a period
of time.
It is not always necessary that a stable decision point will be reached. Consider the task
of dividing the proceeds of a bank robbery between three robbers. Let the amount be 10
units, and let the decision be made by majority vote. Then, if A and B propose a 5-5 division
among themselves, C can offer 4-6 to, say, A. Now B can make a counteroffer to one of them,
and in this manner a decision will never be arrived at. This example is of a game between
three players.
Number of Moves
Many games have one decision or one move to be made, like in prisoner’s dilemma. We will
be interested in games where each player makes a sequence of moves, and where the payoff
is received after the game ends. As we will see, these can also be analysed to arrive at one
complex decision, which we will call a strategy. But in games like chess, it is not feasible to
compute the winning strategy, and we end up making decisions afresh at every turn.
Payoff
Games need not always be adversarial. They can be a basis of cooperation as well. The nature
of the game is characterized by the sum of the payoffs received by each player. Games can be
classified as follows:
• Zero sum games: Here the total payoff is zero. Some players may gain while others lose.
Board games like chess are zero sum games. One player’s win is the other player’s loss.
• Positive sum games: Here the total payoff is positive, and most or all players gain. Such
games are the basis for cooperation. For example, two researchers working jointly on a
project, two students studying together preparing for an exam, two nations indulging in
free trade benefitting the economies of both, and people of different religions participating
in each other’s festivals. Cartels between manufacturers result in increased profits for them,
though at the expense of the consumer.
• Negative sum games: Here the total sum is negative and most or all players lose. A price
war between two companies will result in reduced profits for both, though if we include
the consumer in the game then it will become zero sum. Malevolent neighbours disrupting
each other’s festivities results in negative payoffs for all. War is the ultimate negative sum
game, even when we count the profits made by arms manufacturers and dealers. Heads
of arms manufacturing countries are known to promote sales during official visits to
other countries.
Number of Players
The number of agents involved is another characteristic of the game.
• Two player games: Many games are treated as two player games, for example, a price
war between two companies is a negative sum two player game. Games like noughts and
crosses, checkers, and chess are two person games.
• Multiplayer games have more than two players. Price wars become a zero sum game when
the consumer is included.
• Team games: Multiplayer games are sometimes modelled as competition between two
teams - often zero sum. Each team may have team members collaborating with each
other. Examples are contract bridge, football, armies on a battlefield, teams of lawyers
in a courtroom, and members of a species in an ecosystem where survival of the fittest is
the theme.
Uncertainty
Most real world situations in which agents act are fraught with uncertainty. Uncertainty arises
from two sources. One, due to incomplete information. A bridge player does not know what
cards an opponent, or the partner, is holding. The executives of a company do not know what
Chess and Other Games | 227
their competitors are planning. Nations likewise do not know what other nations are planning.
In both cases, espionage is a move employed by the people in power. Authoritative governments
are known to spy upon their own citizens. Most recently, the use of spyware, for example,
Pegasus, is testimony to that. The other source of uncertainty is due to one’s actions not being
deterministic.
• Incomplete information games lead to uncertainty. In card games like contract bridge and
poker one cannot see other players’ cards. In the corporate world we are not aware of what
others are planning, and hence corporate espionage and lobbying with governments. In
war, what the enemy is up to is not known, and we have spies reporting from across the
line. We may even have to contend with misinformation when double agents are active.
Generals often resort to deception, for example, Operation Fortitude in World War II
in Normandy.
• Stochasticity in the domain leads to uncertainty. The throw of the dice in backgammon
or in snakes and ladders cannot be predicted, as cannot be the draw of cards in poker or
rummy. Except for the best in the field, most of us can only have the intent of shooting a
basket on the basketball court.
As we can observe, the study of games involves a multitude of scenarios, with the common
theme being the choice of actions designed to maximize one’s payoff. We will confine our
attention to two-person zero-sum complete-information alternate-move games like chess. In
these games there is not just one move but many. We will call these board games, glossing over
for the moment the fact that there are games played on a board like backgammon that involve
the throwing of dice, and games like Chinese checkers and snakes and ladders that additionally
may have more than two players. The games we intend to write algorithms for can all be
abstracted into a game tree and have well studied algorithms to play them.
Max
Min
ddd ddd
Figure 8.1 A generic game tree is a layered game tree with the two players choosing a move
at alternate levels. Max plays first and can choose one of the four moves at the root. Each path
represents a possible game, and the games end at leaf nodes. The leaf nodes are labelled
with the outcome - win, draw, or loss - from Max’s perspective.
Max chooses a move at the root, and then Min chooses one at the next level. This continues
till the game ends. Every path in the tree represents a game, with the leaf labelled with the
outcome. Observe that some games in the tree are shorter than others.
Associated with every game tree is a value that is the outcome when both players play
perfectly. This is known as the minimax value of the game, and is the Nash Equilibrium. The
minimax value can be computed by backing up the values from the leaves to the root in a
bottom fashion. The procedure is as follows.
Pick a node whose children are all already labelled with W, D, or L. If the node is a
Max node, then back up a W if some child is labelled W, else D if some child is labelled D,
else L. This reflects perfect play for Max. The outcome can also be labelled with values 1, 0,
and -1 respectively for W, D, and L. With these numeric labels one can see that Max prefers the
highest value, and hence the name Max. Min is the opposite, preferring the lowest value from
its children. Min’s first preference is L, because it is a loss for Max and a win for Min. In this
fashion we back up the values from the leaves to the root, choosing the maximum and minimum
of the children at alternate levels. The value of the root is the minimax value of the game. It is
the maximum value Max can pick from the Min child, which in turn chooses the minimum of
the values offered by its Max children.
The reader is encouraged to compute the minimax value of the above game tree and verify
that the game is a draw. Observe that this is so even though most leaves are labelled with a W.
The reason is that at critical points it is Min that prefers another outcome over W.
Chess and Other Games | 229
8.2.1 Strategies
The game tree represents a game in which multiple moves are made by both players, and the
outcome is known only at the end of the game. Chess players, for example, ponder over every
move, as do children playing noughts and crosses. How can one relate such multiple move
games with the single choice games like the prisoner’s dilemma seen in the last section?
We can do that by defining the notion of a strategy for each player. A strategy is a statement
of intent, once and for all, by a player. In other words, a strategy for Max freezes the choices
for Max, while catering to all possible choices for Min. A strategy is a subtree of the game tree,
where for Max we choose one move and choose all moves for Min. A strategy for Min likewise
freezes the choices for Min, while catering to all subsequent moves by Max. Figure 8.2 shows
two strategies for Max for the above game tree. One is the subtree on the left with bold edges,
and the other on the right with bold-dashed edges. The leaves in the two strategies are shaded.
We have used labels +1, 0, and -1 instead of W, D, and L.
Figure 8.2 Two strategies for Max. Each strategy chooses one move for Max and all moves for
Min. The shaded leaves are the leaves in the two strategies. The labels on the leaves are +1,
0, and -1, instead of W, D, and L.
Each strategy represents Max’s decisions over the entire game. The strategy on the left has
four leaves with labels +1, -1, -1, and +1. Since Max has frozen her moves, the game played,
which is a path in that subtree, will be determined by Min. Clearly, any rational Min will drive
the game towards a leaf with the lowest value, in this case -1. This is the case in general, and
we can make the following observation:
The value of a strategy for Max is the minimum of the values of the leaves in the
strategy. That is the best that Min can do, given the frozen choices for Max. If the value
of the strategy is +1, then it is called a winning strategy for Max.
230 | Search Methods in Artificial Intelligence
The reader should verify that the value of the strategy with dashed thick edges on the right
is 0. Clearly, it is the better of the two strategies that Max can choose. In general, a rational Max
would choose the strategy with the highest value. That will also be the minimax value of the
game, since it represents the best choices both the players can make.
How many strategies does Max have? Figure 8.3 depicts an approach to counting strategies
given by Baskaran Sankaranarayanan in 2020.5 Like computing the minimax value, we count
the strategies from the leaf nodes up to the root. We begin with 1 for every leaf node. At the
Max level, we sum the values from its children, since each branch represents a choice for Max.
At the Min level, we compute the product of the children, since Max has to account for all
combinations of choices that Min has.
-A-
[p Ed 0 LpEdEp Ed [fa □ [p
ZZ ' ' ' X J| [ *\ / \ 1\ I X^>
/ I ' • ' / 1 1 1 \ / \ 1 \ I \
0000000 (b Q'OOOO
S[b[p[±ii3ixi[xiixi[p[3Ed[3[pixiixi[-dEd[^i[3[i]
/! \ /Z? r \ ' ' ' ''
c5O(5 6© 66
Figure 8.3 A method of counting strategies for Max credited to Baskaran Sankaranarayanan.
Any leaf node is assigned a count 1. A Max node sums up the values of its children, depicted
here by a summation sign. A Min node computes the product of the values of all its children.
The minimax value then represents the outcome of the game when both players choose
their best strategies. This can be computed by analysing the entire game tree. For games like
noughts and crosses, this can easily be computed and, as most children realize over a period of
time, one cannot win the game against an opponent who makes no mistakes. And it was only
in this century that the game of checkers was analysed (Schaeffer et al., 2007). Perfect play by
both sides leads to a draw. Games like chess and go are another matter. The game trees are just
too large to be analysed completely, and we still do not know whether White, the player who
moves first, always wins in chess or not. The best we can do is to play better than our opponent.
Which is why chess still fascinates us.
Figure 8.4 An evaluation function looks at a board position and returns a value that signifies
how good the board position is for Max, with higher values being better. In this example, Max
would choose the move with the value +11.
The evaluation function is a static function like the heuristic function that inspects a board
position and returns a value. Let us consider the example of chess. How does a human expert
look at a board position and evaluate it? The consensus is that there are two components to an
evaluation function.
One computes material advantage, and the other evaluates positional advantage. In chess,
for example, one can assign a relative value to each piece. John von Neumann is said to have
given the following values. A king is valued at 200. This could be interpreted as Large = 200
because the absence of the king would mean the game is lost, irrespective of what other pieces
are present. The queen is valued at 9. This would mean that if Black had the queen and White
did not, then one could subtract 9 from eval(J). Thus, every piece advantage would add or
subtract something from the evaluation function. The rook has a value 5, the bishop and the
knight 3, and each pawn has a value 1. Since in the initial position both players have all the
pieces, the contribution of material to the evaluation function would be 0.
The other contribution to the evaluation is positional advantage. Chess experts have
several features that they consider to be good, for example, connected pawns, doubled rooks, a
protected king, control of the centre, the mobility of the pieces, and so on. The presence of each
feature contributes to the evaluation function. Traditionally programmers have drawn upon
human experts to devise evaluation functions. The evaluation function can be as complex as the
detailed knowledge that an expert can articulate. The function used in the program Deep Blue
has about 8,000 components (Campbell et al., 2002).
Is there a first move advantage? Should the evaluation function have a small positive value
in favour of White? The jury is still out on this one.
If the evaluation function were perfect, then it would select the best move simply by
looking at the options. But such functions are hard to find. Consider that one of the moves
in chess captures an opponent bishop. Then it might look attractive because of the material
gained. But if the opponent could capture your queen on the next move, then your move
ends up looking worse than it appeared to be. A perfect evaluation function would take
into account the effect of such future piece exchanges but is difficult to devise. To counter
the imperfectness of the evaluation function most chess programmers implement a limited
lookahead. The idea is that the inaccuracy in the evaluation function will be compensated by
the look ahead. If the program is looking ahead k moves, we say that it is doing k ply search,
and the tree it explores is k plies deep. The evaluation function is applied to the nodes on the
horizon at k plies, and the values are backed up to the root using the Minimax rule bottom up
from the horizon nodes.
Minimax rule: For a Max node, back up the maximum value of its children.
For a Min node, back up the minimum value of its children.
Figure 8.5 shows a 4 ply game tree with the leaves on the horizon labelled with the values
of the evaluation function at each node or board position.
The reader is encouraged to apply the Minimax rule to the above tree and verify that the
minimax value of this tree is 32, and to achieve this outcome Max must choose the rightmost
move at the root.
Since we are not inspecting the entire tree, we do not know the final outcome of the game.
Instead, what we have is a move that appears to be the best based on the k ply lookahead. After
we make the move, the opponent responds, and we need to do another k ply search to choose
our next move. The algorithm for playing a game is shown below.
Chess and Other Games | 233
oooooooooooooooooo
A A A II ” ,l /' A 11 II 11 II II II II II II II
[5^|l2 [40|[3^ [40|[60 [80[90 [70[25 [20[60 [55|~25 [60|[57 ^7||T^ [l5[701 ?5| 80 |t7[88 [45|~35 [201 80 [2^[601132[a3 [28(991401381
Figure 8.5 Most game playing programs look ahead to a certain ply depth and apply the
evaluation function on the horizon nodes. In this figure the tree is 4 ply deep, and the values in
the leaves on the horizon are the values returned by the evaluation function eval(J). Effectively
the program is looking 4 moves ahead to decide the move for Max at the root.
Algorithm 8.1. The algorithm GamePlay repeatedly calls k ply search for every move
that it has to make. With every new call, it looks 2 plies further in the game tree.
GamePiay(MAX)
1 while game not over
2 call k-ply search
3 make move
4 get MIN’s move
In this way one can imagine the game playing algorithm pushing into the game tree with a
limited lookahead at every step, as shown in Figure 8.6.
Remember our reason for not relying totally on the evaluation function with a 1 ply
lookahead, that there may be unforeseen dangers beyond (Figure 8.4). This continues to be a
case where the lookahead search may choose a poor move because it is oblivious of the danger
lurking beyond the horizon. This is known as the horizon effect. We illustrate this with an
example shown in Figure 8.7.
Let Max choose a move as shown with a minimax value backed up from the node Z on the
horizon. A little earlier in the path let an intermediate Max node choose a value from its two
successors marked X and Y. The node Y leads to a good position for Min in the node shown in
black, which is a bad position for Max. The backed-up value from this node reduces the value
of Y, which then is ignored. In the path via node X, Max has made an inconsequential move
234 | Search Methods in Artificial Intelligence
Figure 8.6 For every move it makes, a game playing program calls a k ply search and waits for the
opponent’s move. The figure shows three such calls. Figure adapted from Khemani et al. (2013).
Figure 8.7 The horizon effect. Let the minimax value be from the node Z on the horizon. This
happens because the Max node chooses the higher value that comes from its child X. The
value from Y is poor because of the Min node shown in black. In the path from X there is an
inconsequential pair of moves where Max moves to the grey coloured node. This inconsequential
move pushed the black node beyond the horizon, making node X look better than it actually is.
Chess and Other Games | 235
shown in grey. One can always make such moves in a game, for example, pushing a nondescript
pawn, or making a knight move and reversing it later. The effect of such moves is that they may
push the black node beyond the horizon, as in the case of the chosen path, thus making X look
better than it is.
The above example shows how the horizon effect may make a path look better than it is.
One way to deal with this is to do a secondary search from the node Z before committing to
the move. This could uncover any hidden danger. The other way is to not worry about it now,
because the k ply lookahead for the next or a later move would anyway reveal the danger.
Algorithm 8.2. The algorithm Minimax does a depth first search of the game tree. This
version recursively calls itself to compute the value of each child. A node is a terminal
node if it is on the horizon.
Minimax(N)
1 if N is a terminal node
2 value ^ evai(N)
3 else if N is a max node
4 value ^ -large
5 for each child C of N
6 value ^ max(value, Minnmax(C)
7 else value ^ + large
8 for each child C of N
9 value ^ min(value, Minimax(C)
10 return value
A critical question is how deep should the algorithm look ahead. The simple answer is
as deep as the computational resources and the allotted time allow. How efficiently can one
exploit the multicore processors that are available nowadays? Before the multicore era the
Hitech chess machine by Hans Berliner, a computer scientist and a chess player of some repute,
236 | Search Methods in Artificial Intelligence
used a specialized sixty-four custom VLSI chip architecture (Berliner, 1987). One must keep in
mind though that the tree grows exponentially, which is difficult to match even with increased
computing power.
One approach seeks to fine-tune the ply depth to tournament conditions. In an international
tournament one typically has to play 40 moves in 2 hours, and the next 20 moves in an extra
hour. After this if the game is still in progress, one needs to play faster. Within each block of
time, each player can decide how long each individual move takes, since the two players have
individual clocks, only one of which runs at a time. This allows us to follow a flexible time
schedule. The key is to decide the time one can devote to a move by the calling program, which
keeps track of time. This can be done by employing the strategy of depth first iterative deepening
(DFID) (Chapter 3), in which the calling program can call for the move anytime, and the DFID
Minimax returns the best move found. The opening game is often handled by deploying well
studied standard openings in which the moves can be made rapidly, leaving more time for
other moves. Some implementations rely on quasi-stability in the fluctuation of the evaluation
function with depth, usually during sequences of material exchanges. That particular call can
then be allowed more time, but at the expense of time available for future moves.
The Minimax algorithm, however, does some unnecessary work traversing the entire k
ply game tree. There are situations when a significant part of the tree can be pruned without
affecting the correctness of the minimax value calculation. We look at two such algorithms
below.
the value of nodes) is being explored. It has generated and evaluated its first child a3. Should it
generate and evaluate its second child J?
The node ^3 has a value ^3 = a3 received from its first child. This becomes an upper
bound for the node and it can only get lower if a smaller value is received from any of its other
children. Moreover, this value must be higher than that of its parent a2 if it is to be selected, and
lower than &, higher than a1, lower than /31, and higher than a0. Only then does it have a chance
to be the minimax value of the root node. If all these conditions do not hold, then ^3 need not
be refined any further and can be pruned.
Let a = max{a0, a1, a2, a3,} and £ = min{^1>^2, ft}. Then the node J should be explored
only as long as £ > V(J) > a, where £ is the lowest £ ancestor, and a is the highest a ancestor.
Else J can be pruned. A cutoff is said to have occurred at its parent node.
If J is an Alpha node and its parent has a £ value that is lower than the highest Alpha
ancestor value a, then J is not generated and its Beta parent not evaluated any further. We say
that an a-cutoff has happened. An a-cutoff is an a-induced cutoff because some Alpha ancestor
has a value a, and the Beta node which has a lower value cannot contribute to changing it.
Likewise, a ^-cutoff happens when an Alpha node is ceased to be evaluated because it
already has an a higher than some £ ancestor, which will block that value.
The algorithm AlphaBeta is in fact a technique of pruning added to the Minimax
algorithm and is said to have been introduced by many people at different times. The algorithm
is described in Algorithm 8.3.
Figure 8.8 Should AlphaBeta generate and evaluate node J? Only if p3 is greater than all the a
ancestors and smaller than all the p3 ancestors.
238 | Search Methods in Artificial Intelligence
As can be seen, the algorithm operates in a window with an upper bound £ and a lower
bound a. Initially £ = +Large and a = -Large. Gradually, as the tree is traversed, better values
are found and the a-bound increases and the ^-bound decreases. One can see this as a shrinking
window. If at any point an a-value becomes higher than the ^-bound, the window shuts and
a ^-cutoff occurs. Likewise, if a ^-value becomes lower than the a-bound, then an a-cutoff
occurs.
Algorithm 8.3. The AlphaBeta algorithm augments the Minimax algorithm with cutoffs,
pruning parts of the game tree that cannot contribute to the minimax value of the root.
The value a of a Max node cannot go higher than a ^-bound received from an ancestor.
Likewise, the £ of a Min node cannot go lower than a a-bound imposed by an ancestor.
Initially a = -Large and £ = +Large.
AiPhaBeta(N, a, p)
1 if N is a terminal node
2 return evai(N)
3 If N is a MAX node
4 for each child C of N
5 a ^ max(a, AiPhaBeta(C, a, p))
6 if a > p. then return p
7 return a
8 else > N is a MIN node
9 for each child C of N
10 p ^ min((p, AiPhaBeta(C, a, p))
11 if a > p then return a
12 return p
In Line 5 a higher a-value is selected and returned in Line 7 unless a ^-cutoff occurs in
Line 6 if the a-value being computed becomes higher than the ^-bound. Likewise, in Line
10 a lower ^-value is selected and the a-cutoff occurs in Line 11 if it becomes lower than the
a-bound. As before, in Minimax, the value EVAL(N) is returned for a node on the horizon in the
base case in Lines 1 and 2.
Figure 8.9 shows the subtree explored by AlphaBeta on the 4 ply game tree of Figure 8.5.
AlphaBeta explores 26 of the 36 leaf nodes, where Minimax would have inspected all the
36 nodes.
In general AlphaBeta inspects fewer leaves than does Minimax, and the deeper the ply
depth, the more the pruning happens as entire subtrees are discarded. The amount of pruning
that AlphaBeta does depends upon the order in which good nodes are encountered during
DFS. As an exercise, the reader is encouraged to construct a tree by filling in leaf node values
such that no cutoff takes place searching from left to right. Then on the same tree simulate the
algorithm searching from right to left and compare the result. One can then observe that the
earlier the best moves are found in search, the greater will be the pruning. The intuition here is
that the later moves are worse and hence cut off.
Chess and Other Games | 239
Figure 8.9 The AlphaBeta algorithm inspects 26 of the 36 nodes on the 4 ply game tree of
Figure 8.5. The cutoffs are indicated by double lines. As can be seen, there is one p-cutoff and
six a-cutoffs.
The algorithm can be further optimized as follows. While traversing the tree one also
computes the (minimax) value for each of the internal nodes. These values can be utilized to
order the moves in the next round, generating better children first, so that they appear earlier
in the tree. This is expected to result in greater pruning. The same can also be done during the
different passes if one is implementing a DFID version for a tournament. And finally, after
having made a move, one can also “think in opponent’s time” by doing another search to order
the moves.
But instead of ordering the move generation, can one explore the tree with a sense of
direction? It turns out the answer is yes, provided one does some preliminary work to get an
estimate of the value resulting from each choice. We look at a best first approach next.
Refine the best candidate or partial solution, till the best candidate is fully refined.
Unlike our earlier heuristic search algorithms, the notion of best in algorithm SSS* is
defined in a domain independent fashion, and not by a user defined heuristic function. To
understand that, we need to return to the definition of a strategy. Recall that a strategy is a
subtree that freezes the choices of one player. Figure 8.10 shows two strategies for Max in a
tiny 4 ply game tree. The first strategy is shown with solid arrows in which Max chooses the
240 | Search Methods in Artificial Intelligence
Figure 8.10 A strategy for Max shown with solid arrows contains the leaves with values 14,
5, 12, and 8. The shaded leaf node represents a partial strategy and is an upper bound on
the value of the strategy. It is also part of another strategy where Max chooses leaves 9 and
13 shown in think dashed lines, instead of 12 and 8 in the first strategy. The shaded leaf node
represents a cluster of two strategies with an upper bound of 11 on both.
leftmost branch at each choice point. The four leaves in this strategy have values 14, 5, 12, and
8. The value of this strategy is 5, the minimum of the values of the leaves. If Max were to adopt
this strategy, Min would drive the game towards the node with value 5, since now only Min
gets to choose.
The shaded leaf node in the figure is a partial strategy, with only one leaf identified. It is
also part of the second strategy in which Max chooses the right branch at its level 3. This is
marked with thick dashed edges and contains nodes 9 and 13 instead of 12 and 8. This strategy
also has a value 5. The shaded node is part of both strategies and is itself an upper bound
on both the strategies. The node thus represents a cluster of two strategies. The following
observation is pertinent:
Any leaf node in a game tree represents a cluster of strategies, and its value is an upper
bound on the values of all the strategies in the cluster.
When we look at strategies for Max, we view the problem of playing the game essentially
as a one person or single agent problem, where the task is to find the optimal strategy, the one
with the maximum value. To guarantee finding the optimal strategy, we must consider all the
strategies available for Max. But the number of strategies is huge. The above tiny tree has eight
distinct strategies. Earlier in Figure 8.3 we had looked at a procedure to count strategies in an
arbitrary game tree. For a game tree with all leaves at ply depth k and a constant branching fact
b, the count can be expressed as a formula. If k is even, then let n = k/2, else n = (k + 1)/2. Here
n represents the number of layers where Max has to choose. At level 1 there are b choices for
Chess and Other Games | 241
Max, at level 3 there are bb choices, for n = 5 we have bb*b choices, and so on. Adopting the
sum-product procedure from Figure 8.3, we get the following expression:
= b [(bn-1)/(b-1)]
Clearly, a brute force approach inspecting all strategies is not desirable. Instead, we follow
a strategy analogous to the B&B for TSP from Section 6.1.2. We identify an exhaustive set of
partial solutions, or partial strategies, each representing a cluster or set of strategies. We keep
track of the lowest upper bound on each cluster and refine the best (highest value) cluster till
the best cluster is a fully refined one. The following method identifies clusters of strategies that
exhaustively cover the space of all possible strategies. Each cluster is represented by a single
node. Starting with the root, we identify the clusters as follows:
How many clusters does Max need to consider in a game tree? If the branching factor is
a constant b, then Max has b choices or clusters at the root which is at level or ply 1. Since
we select only one branch at level 1, we still have b clusters for 2 ply search. For each of
these b choices, there are b further choices at level 3, giving us b2 choices. At level 5, each
of these b2 choices can be refined into further b choices, giving us b3 strategies. In general, at
level k, where k is odd, one has b(k+1)/2 clusters. If k is even, the formula becomes bk/2.
The game tree in Figure 8.10 has eight strategies in four clusters as shown in Figure 8.11.
At the Max level we select all branches, and at the Min level we select one. Without loss of
generality, we choose the leftmost branch at the Min level. The four clusters are named A, B,
C, and D in the figure.
The value of each leaf is an upper bound on the strategies it represents and serves as the
heuristic function that guides search. Each node is a partial strategy and a candidate for refinement.
In the above tree, node 21 of cluster C happens to be the most promising cluster and is selected.
On refinement, its sibling with value 12 is included in the cluster. But now the value of cluster C
drops to 12, and SSS* shifts its attention to cluster A with the upper bound 14. This jump in the
search space is characteristic of the best first search behaviour studied earlier in the book.
When cluster A is refined, its value drops to 5 and attention shifts to cluster D. The sibling
of 13 is 14, and the value of D does not change. Figure 8.12 depicts the game tree at this stage.
After refining clusters C and D, cluster C gets pruned because of the Max parent above them.
This is like an alpha cutoff. The shaded nodes depict the remaining three contenders, with
cluster D with value 13 still leading the pack. To refine cluster D, we have to shift to its Max
sibling, and solve it as long as it does not become greater than or equal to 13. If it were to
exceed 13 or become equal to it, then the Min parent would induce a beta cutoff.
Refining cluster D now entails recursively solving its sibling with an upper bound of 13. As
shown in the figure, this results in two clusters D1 and D2 with value 12 and 7 respectively. The
reader should verify that solving these will result in a minimax value of 12 for the sibling of D,
which now takes on the mantle of cluster D. What is more, there are no more Max siblings and
the Min parent is completely solved with a value 12.
242 | Search Methods in Artificial Intelligence
Figure 8.11 The 4 ply game tree with branching factor 2 has eight strategies in the four
clusters named A, B, C, and D. The thick arrows show how these are identified. All choices at
the Max level, and one choice at the Min level. Each shaded node represents a cluster of two
strategies.
Figure 8.12 The game tree as seen by SSS* after inspecting three more leaf nodes. Cluster
C has been pruned having a value 12, which is dominated by the value 13 from cluster D.
Cluster D is the next cluster to be refined. This involves solving its sibling with an upper bound
of 13. This involves recursively solving the Max sibling. Cluster D now splits into two, D1
and D2.
Chess and Other Games | 243
The tree as seen by SSS* is depicted in Figure 8.13. At this point the best cluster D1 is
fully refined, and the other clusters have lower upper bounds. The algorithm SSS* terminates
with the minimax value 12. The reader should verify with Figure 8.10 that this is indeed the
minimax value.
Figure 8.13 Algorithm SSS* terminates when cluster D1 is fully refined with value 12, higher
than other partial clusters, which have upper bounds lower than 12.
As is evident from this example, the algorithm has two phases. In the forward phase, we
traverse the game tree up to the horizon and identify the clusters. Then as we refine the clusters,
the values are propagated upwards. Having solved an internal Max node, we again embark on
the forward phase to solve a sibling, whose value is again arrived at by propagation from the
terminal nodes.
In the following section we describe an iterative version of the SSS* algorithm. The reader
is encouraged to compare the algorithm with the AO* algorithm in Section 7.32 and observe the
similarities. The game tree as seen by SSS* is similar to an And-Or tree, with Max nodes being
Or nodes, and Min being And nodes, from the perspective of Max. Interleaving the forward and
backward phases happens because Max has to consider all responses from Min, which is like
an And node. We even use the same terminology of classifying each node as live or solved, and
also terminate when the root node is labelled solved.
The node structure is a triple <name, status, h> where name identifies the node, the status
is live or solved, and h is the estimated upper bound value. In addition, we assume that we have
a function that can identify when a node is terminal, or on the horizon of search. The algorithm
is described in Algorithm 8.4.
We begin in Lines 1 and 2 by inserting the root node <root, live, +Large> in the priority
queue as a live node with +Large as the upper bound. Remember that +Large stands for a win
for Max. The following are the action cases when we pop the top element N from the priority
queue:
If the node is the root node and is solved (Lines 4-6), the algorithm terminates and returns
the minimax value.
If the node N is a live terminal node (Lines 7-9), then we invoke the evaluation function
eval(N), choose the smaller of eval(N) and h, the existing value, and change the status to solved.
Note that in the initial forward phase, the h value is +Large, but this may not be the case when
we solve a sibling of a solved Max node. We then push the node back into the priority queue.
Algorithm 8.4. Algorithm SSS* maintains a priority queue of partial strategies sorted
on their estimated value. Each partial strategy represents a cluster of complete strategies.
SSS*(root)
1 OPEN ^ empty priority queue
2 add (root, live, + large) to OPEN
3 loop
4 (N, status, h) ^ pop top element from OPEN
5 if N = root and status is solved
6 return h
7 if status is live
8 if N is a terminal node
9 add (N, solved, min(h, evai(N))) to OPEN
10 else if N is a max node
11 for each child C of N
12 add (C, live, h) to OPEN
13 else if N is a min node
14 add (first child of N, live, h) to OPEN
15 if status is solved
16 P ^ parent(N)
17 if N is a max node and N is the last child
18 add (P, solved, h) to OPEN
19 else if N is a max node
20 add (next child of P, live, h) to OPEN
21 else if N is a min node
22 add (P, solved, h) to OPEN
23 remove all successors of P from OPEN
Chess and Other Games | 245
If the node N is a live internal Max node, then we add all its children as live nodes with the
same h value. For a Min node we only add one child. This is in effect the process of identifying
the clusters in the forward phase (Lines 10-14).
If the popped node N is a solved node (Line 15), then three cases arise. If N is a Max node
and also the last child of its parent, then its parent P is solved as well, and is added to the priority
queue as a solved node (Lines 17-18), else the next sibling of N is added to the queue as a live
node with the h value copied from N (Lines 19-20). This h value serves as an upper bound for
solving the sibling recursively. Finally, if N is a solved Min node (Line 21), then, since it was at
the head of the priority queue, it is better than all its Min siblings. We prune the siblings (alpha
cutoff) and add the Max parent of N as a solved node with the same h value (Lines 22-23).
The algorithm SSS* thus identifies nodes on the horizon to form clusters of strategies and
uses the value of each node to serve as a heuristic value to guide search. Since the value of a
node is an upper bound on the strategies that it is a part of, the algorithm is guaranteed to have
found an optimal strategy when it terminates with a complete strategy being picked from the
priority queue.
Figure 8.14 shows the subtree explored by SSS* for the game tree of Figure 8.9, whose
visited nodes are copied below for comparison. As can be seen, SSS* explores fewer nodes.
The reader is encouraged to simulate SSS* on the tree and verify that both AlphaBeta and
SSS* find the same minimax value. The AlphaBeta algorithm visits the leaf nodes from left
to right. What is the order in which SSS* visits the leaf nodes on this tree?
Figure 8.14 The game tree of Figure 8.9 as explored by Algorithm SSS*. The shaded leaves
are not inspected. The cutoffs shown are the alpha cutoffs by SSS*. The optimal strategy on
the right is marked by arrows. The leaves explored by AlphaBeta are also shown below the
tree. SSS* does more pruning than AlphaBeta.
246 | Search Methods in Artificial Intelligence
infrequent letters like Q and Z. Two of the tiles are blank with a zero score. They can stand for
any letter. The game is played on a 15 x 15 board by placing letters to make legal words on the
board. The first player can make any word which occupies the centre square. Subsequent words
must be connected with existing words, like in a crossword puzzle. All sequences of letters on
the board must be valid words. In addition to the different letters having their own score, certain
squares on the board also amplify the score of the player, with double or triple letter squares
which multiply the letter value correspondingly, and double or triple word scores that multiply
the word score arrived at from the letter scores. Playing all seven tiles a player holds earns
an additional bonus. Figure 8.15 shows a typical combination of words on a Scrabble board.
Observe that there are no sequences of letters that are not words.
Figure 8.15 Word combinations on a Scrabble board. The next player must use the existing
letters to form a new word and at the same time not leave meaningless letter combinations.
For example, one can play HAD vertically on the bottom three squares on the left, but cannot
make FAD because FAS would not be a word horizontally.
Scrabble is of course a game in which one player wins, and is zero-sum in that sense.
But in another sense it is not adversarial because each player is trying to maximize her own
cumulative score. In that sense it is more like a competition. One could even imagine playing
it solo to try and maximize the score. But there is certainly a tactical element to the game,
bringing in adversarial decision making. An advanced player may decide to minimize the
number of openings available to the opponents, since new words have to be connected to
existing ones. Or she may use up or block a triple word opening, employing a dog in the
manger tactic.
The main objective though is to maximize one’s own score. A large vocabulary of words
here is advantageous, and clearly computers possess that. Selecting a subset of the seven tiles
you hold and placing them at an appropriate place on the board is the main task. According to
Brian Sheppard who implemented the program Maven, a typical rack and board position may
have 700 possible moves (Sheppard, 2002). So the branching fact can be very high. It is the
Chess and Other Games | 249
objective of maximizing one’s score that makes it harder. It is often easy to find a word to make,
but a little harder to exploit the letter scores and special squares on the board. A little bit like
the fact that finding some tour in a TSP problem is much easier than finding good or optimal
tours. Most humans adopt a greedy approach trying to maximize the points earned by making
the current word. But carefully leaving useful letters, called the rack leave, for future moves
requires more imagination. As the game proceeds and letters get used up, one can also make
educated guesses about what letters are still in the bag, and perhaps what opponents hold. This
is because we know the initial set of hundred tiles. These kinds of inferences are more common
in contract bridge where we know that the pack has fifty-two cards, but more on that later. In
their program called Inference Player, Richards and Amir (2002) do make such inferences
about the opponent’s rack. It infers that any letters that could have been used for high scoring
words are not on the rack, because otherwise they could have been used. Like Sherlock Holmes,
the fictional detective created by Arthur Conan Doyle, asking9 ‘why did the dog not bark?’
A trie data structure with common prefixes of different words combined together can be
used to look for words. Such a network is known as a directed acyclic word graph (DAWG)
and is a compact searchable structure containing the words in the dictionary (Aho et al., 1974).
Figure 8.16 shows a part of a DAWG for a small collection of words beginning with the letters
G and H. Shaded squares are those where a legal word ends.
Figure 8.16 A compact searchable dictionary can be represented as DAWG. Here we have a
few words beginning with the letters G and H. Shaded nodes represent the culmination of a
legal word.
The problems to solve for a Scrabble player are the following. Given the existing words
on the board and the letters on one’s rack, identify the words that can be made such that the
score is as large as possible, but leaves useful letters on the rack for the future, and prevents
an opponent from making a high scoring word. The DAWG in Figure 8.16 is useful to look
for words starting with a given letter. It would be interesting to devise a representation that can
also look for words with a given letter somewhere in it, or at a specific location. How does one
retrieve the words - like had, him, gin, or go - that can be formed in the location in the bottom
left identified in Figure 8.15? We leave that as a point to ponder for the reader.
Next, we look at a game that is still a challenge for computer programs. Contract bridge is
an incomplete-information team-game which is adversarial being ultimately a zero-sum game.
NorthB
WestA EastA
h
SouthB ------- *■ Same cards
OPEN ROOM
Figure 8.17 Two teams A and B play a bridge match. In the open room where kibitzers are
allowed, East-West are from team A and North-South from team B. The same deal is played
in the closed room with the opponents holding the same cards as their corresponding players
in the other room.
can play any card. A trick is won by the side whose player has played the highest card in the
suit led in that trick. That player gets to play the first card in the next trick.
In general, the more the tricks a side makes the better, but the actual score in a deal is
governed by a contract, giving the game its name contract bridge. The contract is arrived
at in a bidding phase preceding play, with the side bidding for the highest number of tricks
winning the contract. The rules are a little more elaborate with the suits also being named as
trumps during bidding, but are not of immediate interest to us here. In a team of four bridge
match, the first obligation of the contracting side is to fulfil, or make, the contract. The other
side aims to defeat the contract and if they succeed they would get a positive score. Contracts
can also be challenged during bidding and, in turn, counter-challenged, to multiply the stakes.
In addition, there are levels of contracts called a game, a small slam, and a grand slam which
have incremental hefty bonuses motivating the sides to bid higher, but only when they have the
combined strength in cards to make the contract. Elaborate bidding systems have been devised
to exchange information in pursuit of the par contract. More on that later.
In the bidding phase, each player can see only her 13 cards. When bidding ends, one side
wins the contract and has to make at least the contracted number of tricks. The proceedings
for the contracting side are conducted by only one player, called the declarer, who decides the
cards to be played from both hands, and her partner, called the dummy, has no further say in
the matter. The other pair tries to defeat the contract by winning enough of the 13 tricks so that
the declarer does not succeed. These two players are called defenders and are said to defend
the contract (but that really means they are trying to wreck it). Both the defenders have to make
independent decisions. The dummy’s cards are kept face up on the table and are visible to all.
Each player can thus see two hands, her own and the dummy’s.
The following features make contract bridge a much more complex game than board games
like chess:
On a table there are four players, forming teams of two each. This requires cooperation
between partners, which entails exchange of information via legal codes and signals.
252 | Search Methods in Artificial Intelligence
• A bridge match may have two or more tables playing the same deals.
• The number of different deals or starting positions is 5,364,659,935,864,916,575,237,
440,000 = 5 x 1027. Every deal is practically a new deal. This contrasts with board games
where the starting position is always the same, as is the goal. In bridge the goal is to first
bid, and then make, a contract maximizing the score, or payoff, on that deal. The par
contract depends on the lay of the cards dealt, but bridge being an incomplete information
game, par is not easy to achieve, especially when opponents interfere.
• Each player has incomplete information. In the bidding phase each can see only 13 cards.
The number of possible worlds for a player during bidding is 39C13 x 26C13 which is about
8.4 x 1016. This is the number of ways in which the hidden cards can be distributed among
the other three players. Given that the contract is decided in this phase, each partnership
needs to exchange information about their hands. This is done by means of an encoding
known as a bidding system.
• When play begins after bidding, each player can see two hands. Their own and the
dummy’s. The remaining 26 cards can be divided in 26C13 = 10,400,600 ways. In practice,
one defender plays a card before the dummy is put down, called the opening lead, so for
the other two players, there are only 25 hidden cards when their turn comes, which can be
divided in 5,200,300 ways.
• The declarer can see the combined assets of her side, but the defenders have incomplete
information of what cards their side holds. Thus the two sides face different kinds of
problems.
The key to the game is information exchange over an open channel, by means of public
announcements. This happens both in the bidding phase and in the play phase, and players
attempt to reconstruct and imagine the card holdings of other players. Since information is
exchanged over an open channel, the opponents are listening in too, and this in turn leads to the
possibility of deception. If one knew all four hands, one could analyse the par contract along
with the par line of play, determining the Nash Equilibrium. In practice, bridge players often
attempt to capitalize on the lack of information the opponents face, and try and beat the par. The
following is the kind of information exchange during the two phases of the game.
Bidding
On the face of it, a bid is a contract proposed. If no one bids higher, then that bid becomes the
contract. In bridge the contracts range from 7 tricks to 13 tricks, along with the specification
of a trump suit. The idea of trumps is that even the smallest trump card is higher than all cards
of other suits. Both sides naturally strive to name a suit they have more cards of as trump.
A contract of 7 tricks is the lowest you can bid for, because that leaves 6 for the other side,
and the opponents will have to make 7 tricks too to defeat your contract. For higher contracts
(of n tricks) they need fewer (14 - n) tricks to defeat the contract. A bid of 1S, read as ‘one
spade’, says that one proposes to make 1 + 6 = 7 tricks with spades nominated as trumps. The
four suits themselves are ranked. Clubs is the lowest suit, followed by diamonds, hearts, and
spades. One can even propose no trump (N), saying that there are no trumps. There are thus 35
possible bids for contracts, starting with 1C, 1D, 1H, 1S, 1N, 2C going up to 7D, 7H, 7S, and 7N.
Chess and Other Games | 253
Like in any auction, one can only make a higher bid than the last one, and as the bidding
proceeds, the contracts being bid for get higher and higher. A player can also pass on her turn.
A scoring system determines the payoff for each contract. If the contract is fulfilled the side
gets the payoff, else there is a penalty which is the gain for the other side. Thus the goal is to
bid as high as possible for high payoff, but only as high as being able to fulfil the contract. In
addition to these 35 bids, there is a bid called double which can only be made if the last bid
for a contract is by either opponent and it challenges the contract (though there have been
cases reported when furious players have felt like doubling their partner’s bid!). The double
also raises the stakes for both the payoff and the penalty. The contracting side can counter
the challenge with a redouble, which raises the stakes even further. Bidding ends when three
players pass in succession, except in the opening sequence when after three passes the fourth
player still has a bid.
The objective of bidding is to reach a makeable contract with the highest payoff. Since both
sides may have suits in which they have more cards, they strive to buy the contract with their
suit as trumps. The problem is that each player can see only her cards, whereas the makeable
contract depends upon the combined strength of the side. Hence they need to exchange
information about their holdings. But the only language they can use is the bids. In this way
the bids are also used to encode information to be conveyed. The coding scheme is known as a
bidding system, and there are numerous bidding systems that have been devised. These are not
secret codes, but have to be revealed to the opponents.
A bid, thus, has two facets. One is the encoded information that it seeks to convey to the
partner via a public announcement. The other is the literal meaning of the bid, which is the
contract it specifies. The latter comes into effect only for the final contract.
Bidding systems are designed to encode information to be conveyed. This encoding is
context sensitive. This means that the meaning of a bid depends upon the sequence of bids that
precede the bid, including the bids made by the opponents. Thus bidding systems are quite
elaborate. Even double and redouble have encoded meaning based on the context in which they
appear.
Most commonly a bid encodes two kinds of information. One, the length of a particular suit
held by a player. Consider the 1S bid. If it is the first bid made by a side, an opening bid, then
in most bidding systems it promises five or more cards in spades. But in the sequence when a
player is responding to the partner’s opening bid, say 1D, then it shows four or more spades.
If a player opens with, say, 1C and then bids 1S in the next turn, it usually shows exactly four
cards in spades. The more cards a partnership has in a suit, the better the suit is as a choice for
naming as trumps.
The other is information about high cards. High cards are important because they can win
tricks. Almost all bidding systems use the following measure of high card points (HCPs). An
ace is counted as 4 HCPs, a king 3, a queen 2, and a jack 1. These are known as honour cards.
Each suit has 10 HCPs, and the full pack has 40. The more a side has in excess of the expected
20, the more they are likely to bid and make a high contract. The main goal of bidding is for
each partner to create a picture of the holdings of the partner, and at the same time convey
information about their own holding. This includes information about shape (how many cards
of each suit) and HCPs. The catch is that as one makes more bids, the contract to be made gets
higher, and there is a danger that the contract may not be makeable. One would like to convey
254 | Search Methods in Artificial Intelligence
as much as one can in the limited and shrinking bandwidth. A corollary of this constraint is that
a side which perceives itself to be weaker, but with a long suit, can make a tactically high bid,
known as a pre-empt, to consume the shared bandwidth and force the stronger side to guess
with less information.
Bids do not just describe suit length and points as described above. The need for creating
an accurate picture of the combined hands has necessitated the invention of bidding languages
that are like dialogues, with questions being asked by asking bids, answers being given in
response bids, inviting partner to bid higher with invitational bids, showing a red flag with a
sign-off bid, or showing control in a suit by a cue bid, and other mechanisms too numerous to
be described here.
As bidding proceeds, all players are privy to the encoded meaning of bids. It is the ability
to imagine and reconstruct the hidden hands that is the hallmark of an expert. This involves
making inferences from the bids made by other players about their cards. As bidding proceeds,
the level gets higher and higher, each side being careful to not venture into a zone where the
penalty they could concede is prohibitive, and bidding ends when three players pass. It is then
the time to play the cards.
Play
The play of cards essentially entails planning, and counter-planning. The declarer makes
plans to make the contract, and the defenders plan to defeat it. Both sides have to operate with
incomplete information, and algorithms like AlphaBeta are not directly applicable.
In contrast, though, most bridge playing programs currently adopt a Monte Carlo approach
in which they generate a set of sample hands, and then treat each instance as a complete
information game (Ginsberg, 1999; Browne et al., 2012). And they do this for every card they
have to play. By generating a sufficient number of samples, they are able to choose an action
that often works. There have been cases, however, where, because each generated sample has
complete information, the programs can always choose the winning card for that particular
sample. In doing so, the program may choose an inferior line of play which may not work
on the actual hand, where the human expert would choose a different line that would work
whatever the lay of the opponent’s cards is. A typical example is when an opponent queen
needs to be finessed, and the sampling approach gets it right every time knowing all the cards,
but a human expert would adopt an end-play that obviates guessing where the queen is. The
Monte Carlo approach has achieved some success, but it is not clear whether it can perform at
the top human level. Also, being based on sampling and not on planning and reasoning, such
programs cannot explain their decisions.
A human declarer would adopt a planning approach. First she would count her top tricks,
and determine how many more need to be generated for the contract to be fulfilled. Then she
would delve into her repertoire of thematic techniques, learnt over a period of time by reading
bridge columns and bridge books, and also learning from experts, and even from her own
mistakes. These techniques are conditional plan parts retrieved from memory, adapted for the
current problem, and strung together with other interleaved thematic acts (Khemani, 1989,
1994). A simple example is the finesse, say of an opponent held queen. This involves making
the opponent commit first, to play or not play the queen, because her turn is earlier. If she
Chess and Other Games | 255
does play it you can play a higher card, say the king, and if she does not, then you can play
the jack. The finesse will work if the targeted opponent does indeed hold the queen. Without
any other information, the a priori probability of this being true is 50 per cent. But with extra
information, the a posteriori probability could change, and it is the expert bridge player that
can draw inferences from bidding and play to choose the best line of play from the different
options that present themselves.
There are numerous books written on the subject describing play techniques, for example,
Kelsey (1995) and Love (2010), that describe techniques like safety play, trump control, cross
ruff, dummy reversal, elopement, double squeeze, and many others, and even more exotic
techniques like the backwash squeeze and entry shifting squeeze described in the bible of card
play by Geza Ottlik and Hugh Kelsey (1983).
Defence
The play phase in bridge is not symmetric. One side is trying to fulfil the contract and has
the advantage of fully knowing what is in their armoury, since the declarer can see her hand
as well as the dummy’s. The defenders too can see the dummy’s cards, but cannot see their
partner’s cards. The two of them also have to take independent decisions, and neither knows
the combined strength of their hands, whereas the declarer alone decides what cards the dummy
will play.
To facilitate meaningful cooperation they have to resort to signalling. But the only medium
available to them are the cards that they can play, even while obeying the rules of the game,
and pursuing their goal of defeating the contract. Typically the signals involve the choice of
otherwise equivalent cards. For example, having the two small cards, the 2 and 3 of spades,
the order in which one plays them when the suit is led can be used to encode a message.
A high-low, for example, might signal an even number of cards. The card that one starts a trick
may be informative as well. Very often a low card signals possession of an honour card, while
a 7 or an 8 may deny one, and an honour card lead indicates a touching honour. Such signals
are not secret and have to be disclosed, and the declarer too can draw inferences from them.
Defenders also follow certain heuristics like second-man-low (for information concealment)
and third-man-high (to avoid giving away a trick cheaply). In addition, they also need to plan
how to defeat the contract, albeit individually, coordinated only by the signals. As with declarer
play, such planning draws upon the inferred knowledge that the players have, with their own set
of techniques like the uppercut and trump promotion. Defence is widely considered to be harder
than declarer play, and often requires greater imagination, as illustrated in the deal below.
The fact that both declarer and defence make plans means that their adversaries can try
and recognize their intent and adopt countermeasures. As mentioned briefly above, the ability
to make inferences contributes to choosing plans with higher probability of success. Likewise,
the ability to reason about what others know, known as the ‘Theory of Mind’, gives more
ammunition to an intelligent player to win the battle of wits (Leslie, 2001). Reasoning about
what other agents know and believe is studied in the field of epistemic logic (Fagin et al., 2004)
and is beyond the scope of this book. But we present below an example of such reasoning and
planning from the real world as a motivation for researchers to take up programming bridge
(Khemani and Singh, 2018).
256 | Search Methods in Artificial Intelligence
Deception
Deception during bidding constitutes mainly of announcing some features that one does not
have in the hope of deflecting the opponents from their best contract or dissuading defenders
from the winning defence later in the play. Deception in play, likewise, is aimed to veer the
opponent away from a winning line of play. In either case, the goal is to do better than par.
The following deal is a real life example described in Truscott and Truscott (2004).
Maurice Gray (1899-1968) was a dispatch rider with the British Army during World War I and
a keen bridge player. He was sitting West on the following hand which we analyse from the
perspective of the declarer sitting South. The contract was to make 9 tricks with no trumps, as
shown in Figure 8.18. On the left we see the cards as seen by the declarer, while on the right is
the view that Gray sitting West had. Our discussion below focuses on the manner in which Gray
anticipated the declarer’s plan and spun a web of illusion. We hope that the discussion will be
accessible to even readers not familiar with the game.
The view for the declarer (South) The view for West
♦3 ♦3
VK J 5 VK J 5
♦ K J 10 7 6 3 ♦ K J 10 7 6 3
*8 6 3 *8 6 3
♦ KQJ xxxx ♦92 *KQJxxxx
inferred ♦9 8 7 3 inferred
from bidding ♦AQ 4 from bidding
♦KJ 5 2
♦ A 10 7 Good club suit
VA Q Good points
♦982 inferred from bidding
♦ AQ 9 74
Figure 8.18 Maurice Gray, sitting West, led the *9 sitting West against 3N. The figure on the
left shows the cards as seen by the declarer after the lead. On the right we have the cards as
seen by Gray. Both have inferred the long spade suit with East.
There was no trump suit and spades were led. East had bid spades and was known to have
a stack of them that could win many tricks if he got the chance after the ace was out of the way.
The first goal of the declarer was to keep East away from getting the lead again. The declarer
adopted the standard play of ducking (letting East win) the first two rounds of spades so that
West would not have any left to play later. South identified the main source of tricks as the
diamond suit, and he had to hope that the ace would be with West. His plan was the following
- after winning the third round of spades with the ace, he would play a small diamond from the
South hand. If West played the ace, he would play small from dummy. And if West played small,
he would play the king from dummy, a kind of finesse, and then play another diamond which
West would be forced to take (either with the ace or the queen if he had that too) but would have
no spades left to play. South could then set up his diamond suit and make the contract.
Chess and Other Games | 257
This line of play is the best for the given diamond suit and the known information. Four
cards of diamonds are out, and the declarer does not know how they are distributed. He has
to hope and pray that West held the ace. There are eight possible ways the four outstanding
diamonds can be divided when West has the ace, as shown in Figure 8.19. Of these the declarer’s
play would succeed in seven, shown in unshaded rectangles on the left. It would fail only in the
possible world shown as a shaded rectangle at the bottom, because even though the ace is with
the West as hoped, East still has the queen along with two small cards and would command a
trick later. But the actual case (as in Figure 8.18) was the one shown in a solid line rectangle,
and the declarer was destined to succeed.
West East
♦ AQ54 ♦--
♦ A54 ♦ Q
♦ AQ4 ♦5
A brilliant play, discard
♦ AQ5 ♦ 4 ace before declarer plays
diamond converts the *
♦ AQ <54 case into a failing case
* ♦ A4 4Q5 ♦4 <Q5
♦ A5 4Q4
♦A 4Q54
Figure 8.19 There are eight possible ways the diamonds could be divided assuming West had
the ace, as shown on the left. Of these, on the last one in the shaded rectangle was the only
case when the declarer’s plan would fail. It would succeed in the other seven including the
solid line rectangle. However, if the cards were in the possible world marked with an *, then
West had a brilliant play available, of discarding the ace of diamonds earlier, and convert that
into a failing case.
Maurice Gray, however, had other plans. He imagined a possible world in which the queen
was actually with his partner. The imagined case is marked with an * in the above figure. Now,
if that were the real world, a brilliant player could jettison the ace of diamonds on the third
round of spades. Then he would be only left holding diamond 4 as shown on the right, and the
declarer could no longer set up his diamond suit without yielding a trick to the queen with East,
who would then run his spade winners. But only if Gray’s diamonds were the ace and the 4, as
in the marked case. Nevertheless, Gray pretended that that was indeed the case and discarded
his diamond ace on trick three when spades were continued.
This would have had no impact, except the loss of a diamond trick, had the declarer
embarked upon his plan in the diamond suit. But the declarer made the inference that Gray
wanted him to make - that the real world was as in the starred case, and that Gray had made the
brilliant discard. Otherwise, why would he discard the ace? The declarer then abandoned his
diamond suit plan and went after the club suit, which, to the discerning reader, should be clear
258 | Search Methods in Artificial Intelligence
was destined to fail. The declarer fell into a trap because he was a thinking player, who could
make inferences. A lesser player would have failed to draw the intended inferences.
Contract bridge then is a complex multidimensional game that combines various kinds
of reasoning - communication, multi-agent reasoning with incomplete information, planning,
epistemic reasoning with possible worlds and probabilities, plan recognition, and counter-planning.
The most important is the ability to imagine possible worlds, the ability to imagine what the
opponent knows (the Theory of Mind), the ability to draw inferences and augment one’s imagined
construction of the hidden hands, and monitor plans dynamically as play proceeds. This is complex
reasoning. There are no simple or ‘neat’ ways of solving complex problems.
Summary
Games have been proposed as a platform for AI research because they can be implemented
easily with minimal interface sensing and acting with the real world. The rules in games are
well defined, and performance is easy to measure. And yet they require considerable skill.
For more than half a century chess and then go have given us a problem where the search
space is humungous. And we have learnt to surmount them with greater computing power and
the ability to learn better evaluation functions, most recently with deep reinforcement learning.
The ideal evaluation function would suffice to be used with one ply search. If we can achieve
that, it would encapsulate the impact of future possible moves into one function. Will we attain
that stage of knowledge where one look at the board position will reveal the best move to a
program? We wait with bated breath.
Meanwhile, games of incomplete knowledge confront us with a problem where the future
is covered with fog. In backgammon we do not know what moves the dice will offer the two
players. In Scrabble we do not know what tiles our opponent holds and what tiles are still in the
bag. And yet these games have been conquered with a combination of reinforcement learning
and efficient dictionary search.
It is perhaps time to move on to the next challenge. The multiplayer knowledge rich game of
contract bridge, which requires players to communicate with their teammates, make inferences
from the actions of others including what plans they have, and employ probability and deception
in pursuit of a win. The state-of-the-art programs have relied on Monte Carlo techniques in the
lay of cards eschewing the human approach of bringing in tactical knowledge and reasoning.
Will these approaches be sufficient for playing better than us? They have been in board games.
Edward Feigenbaum has been credited with a quote saying that just as aerodynamics enables us
to build aeroplanes without mimicking how birds fly, as Daedalus is reputed to have attempted,
artificial intelligence will find solutions that do not mimic human thought. Well, contract bridge
does pose a challenge for AI.
Exercises
1. Does the following game styled like the prisoner’s dilemna have a stable equilibrium?
If yes, show how? In no, why not? What would happen in the payoffs in the bottom right
square were to be (-30, -30)? Can you design the payoffs in which there is a stable
equilibrium in which A always betrays and B always cooperates?
Chess and Other Games | 259
2. What is the size of the game tree for noughts and crosses?
3. Draw the game tree for noughts and crosses and compute its minimax value. You can
simplify your task by eliminating rotated and mirror positions. For example, instead of the
nine first moves, you can consider only three distinct moves.
4. Solve the game tree of Figure 8.1.
5. [Baskaran] The figure shows a 4 ply game tree with evaluation function values at the
horizon. The nodes in the horizon are assigned labels A, B, C, ...,P. Use these labels when
asked to enter a horizon node or a list of horizon nodes.
AB CD EF GH I J KL MN OP
List the horizon nodes in the best strategy for Max. Enter the nodes in the ascending order
of node labels.
6. For the above game tree, show the subtree explored by the AlphaBeta algorithm.
7. Modify the Minimax algorithm in Algorithm 8.2 to explicitly keep track of depth. Hint:
The original call must have the depth k as a parameter, and recursive calls must reduce the
value of k. A terminal node will have k = 0.
8. Flip the game tree of Figure 8.5 and simulate the AlphaBeta algorithm. Alternatively,
simulate it from right to left on the same tree. How many leaf nodes does AlphaBeta
inspect now?
9. Take the game tree of Figure 8.5 and enter new values for the leaves such that AlphaBeta
does no pruning searching from left to right. For the same tree try simulating AlphaBeta
from right to left.
10. Flip the game tree of Figure 8.9 and simulate the AlphaBeta algorithm. Alternatively,
simulate it from right to left on the same tree. How many leaf nodes does AlphaBeta
inspect now?
11. Simulate the AlphaBeta algorithm from right to left on the game tree from Figure 8.5
with 36 leaf nodes below. The values in the boxes are the values returned by the evaluation
function. How many nodes does this traversal visit?
260 | Search Methods in Artificial Intelligence
oooooooooooooooooo
n n /' n ii *i ii ii ii ii n it ii ii it n n it
12. Take the game tree of Figure 8.9 to simulate the SSS* algorithm. List the order in which
SSS* visits the leaf nodes.
13. For the following game tree, show the order in which the leaf nodes are visited by the SSS*
algorithm. Assume that the algorithm chooses the leftmost child at a decision point. What
is the value of the game tree? Mark the move that Max will make.
20 10 8 15 14 8 20 6 15 14 13 12 9 14 16 11 8 13 8 12 9 14 11 13 16
14. Draw the above game tree after algorithm AlphaBeta explores it searching from left to
right. Clearly mark the alpha and beta cutoffs.
Chess and Other Games | 261
15. Fill in the leaf values in the following game tree such that there is maximum pruning done
by algorithm AlphaBeta searching from left to right. Choose your date of birth in DD
format as the value of the leftmost leaf. How many leaves does AlphaBeta inspect?
oooooooooooooooooo
16. For the above tree fill in the leaves such that algorithm SSS* does maximum pruning.
17. For the above tree fill in the values such that algorithm AlphaBeta searching from left to
right does no pruning.
18. Is it possible to fill in values in the leaves such that algorithm SSS* does no pruning? If yes,
fill in the values such that that happens. If your answer is no, justify your answer.
19. Implement a program to play noughts and crosses with a user on the computer screen.
chapter 9
Automated Planning
So far in this book we have not thought of plans as explicit representations. True, we
have referred to the path from the start node in the state space to the goal node as a
plan, but that has been represented as a sequence of states. When we looked at goal
trees we could also think of the solution subtree as a plan. Likewise, the strategy found
by the SSS* algorithm is also a plan. But even here the intent of the problem solving
agent is captured in terms of what state or board position will the player move to.
In this chapter we see problem solving from the perspective of actions. We represent
plans explicitly, and the agent goes about the task of synthesizing a plan. At the simplest
level, a plan is a sequence of named actions designed to achieve a goal. We begin with
planning in the state space and move on to searching in the plan space. We also look at
a two stage approach to planning with the algorithms Graphplan and Satplan.
An intelligent agent acts in the world to achieve its goals. Given the state of the world it is in,
and given the goals it has, it has to choose an appropriate set of actions. The process of selecting
those actions is called planning. Planning is the reasoning side of acting (Ghallab, Nau, and
Traverso, 2004). Planning and acting do not happen in isolation. A third process is an integral
part of intelligent agency - perceiving. An agent senses the world it is operating in, deliberates
upon its goals to produce a plan, and executes the actions in the plan. This is often referred to
as the sense-deliberate-act cycle. The entire process may need to be monitored by the agent.
Since the world may be changing owing to other agencies, it may even have to modify its plans
on the fly.
There has been considerable work on autonomous agents that plan their activity. This
became necessary in space applications where communication with Earth takes too long,
necessitating autonomy. This was the case with the Mars rovers experiments by NASA, and
even ten years after landing on Mars the rover Curiosity is still active.1 Likewise, autonomous
submersible vehicles were designed to explore the deep oceans on Earth with a system called
1 https://mars.nasa.gov/news/9240/10-years-since-landing-nasas-curiosity-mars-rover-still-has-drive/, accessed 4
October 2022.
263
264 | Search Methods in Artificial Intelligence
Teleo-Reactive Executive (T-REX) (McGann et al., 2008; Rajan et al., 2009). Controlling an
underwater autonomous vehicle (UAV) requires the system to follow the sense-deliberate-
act cycle of an autonomous agent. Autonomous robots are being sent into volcanoes to study
eruptions, in preparation for exploring Jupiter (Caltabiano and Muscato, 2005; Andrews, 2019).
Teams of tiny robots are being employed for search and rescue missions (Murphy et al., 2008).
More recently in 2022, driverless taxi services are being experimented with in San Francisco.
We focus on the deliberation part, planning, of this autonomous activity.
All the three are expressed in a language suitably chosen from a family of languages for domain
representation as described below. The languages in the family vary in expressiveness.
The goal of planning is to produce a plan which when applied in the start state will result
in a goal state. In the simplest case, a plan is a sequence of actions.
9.1 Representation
The goal in domain independent planning is to devise general algorithms that work in diverse
domains. The first step in this exercise is for the user to describe the domain, that is, the states
and the actions. The planning community has come up with a family of languages to do so.
These are called planning domain definition languages (PDDL), which can describe domains
with varying degrees of richness (McDermott, 1998; Fox and Long, 2003; Edelkamp and
Hoffmann, 2004; Gerevini and Long, 2005). The common theme in the family is to adopt the
use of predicates from logic to describe the world by a set of propositional sentences, each
describing some aspect of the state. This is like the working memory in rule based systems
from Chapter 7. Each sentence in the language is like a working memory element, except that
we adopt the syntax from first order logic. For example, the sentence Filled(cup12, tea) may be
used to express the fact that cup12 contains tea, and Holding(Mahsa, cup12) may express the
fact that Mahsa is holding cup12. These facts can change with activity, and we call them fluents,
sentences whose truth value can change with time.
one would have to describe the world at each time step. A variation is to extend first order logic
itself to handle time and change. In event calculus one does this with higher order predicates
that take a fluent as one of the arguments, the other being time (Shanahan, 1999). Then the fact
that Mahsa was holding the cup at time t1 would be expressed as HoldsAt(Holding(Mahsa,
cup12), t1) where HoldsAt(f,t) is an event calculus predicate that asserts that fluent f is true at
time t. Another event calculus predicate Initiates(e,f,t) asserts that event e happening at time t
results in fluent f becoming true after that. This establishes a causal relation between an action
and a resulting fluent. For example, one might say that if Mahsa is holding the cup and puts it
down at time t, then it will be on the table, and she would not be holding it thereafter. This is
done by two statements of the form consequent ^ antecedants.
The above two statements capture the relationship between actions and their effects in the
domain. Another event calculus predicate Happens(e, t), which states that event e has happened
at time t, enables one to infer the statements that actually become true when actions happen.
HoldsAt(Ontable(cup12),t2) ^
Happens(Putdown(Mahsa, cup12),t1)
A Initiates(Putdown(Mahsa, cup12), Ontable(cup12),t1)
A t1 < t2 A - Clipped(t1, Ontable(cup12), t2))
This inference rule states that if it happens that Mahsa put the cup down at time t1, and
nothing happens to clip (undo) the resulting fluent, then at time t2 greater than t1 it would be
on the table.
An early planner developed by Green (1969) was based on a logic based theorem prover.
The logical approach to planning allows an agent to know when fluents (statements) are true,
and what caused them to become true. However, it poses certain bookkeeping problems even
assuming that time is discrete. First, one may need to assert, or infer, the truth value of fluents
at all times resulting in the working memory becoming bloated. Second, and more important,
how does one conclude that a fluent which was true at time t remains true at the next time
instant and thereafter? This is known as the frame problem in literature (McCarthy and Hayes,
1969; Hayes, 1987; McDermott, 1987; Shanahan, 1997). In event calculus this is hidden in the
predicate Clipped(t1, f, t2). Beneath the hood of Clipped is the statement that there was some
event that happened in the intervening period that made the fluent false. As a consequence, one
has to add frame axioms to the knowledge base that say that fluents not affected by any action
continue to retain their truth value. We will look at frame axioms later in this chapter when we
encode a planning problem as SAT.
The planning community has circumvented these problems by doing away with time
itself, and describing only the current state in the working memory. The planning operators
are like the rules from Chapter 7, which add the fluents made true by an action, and also delete
fluents made false by the action. And like rules, they have preconditions for the actions to be
266 | Search Methods in Artificial Intelligence
applicable. Actions are instances of operators and are also called ground operators. An operator
has the following format:
ActionName(arguments)
Preconditions of the action.
Effects of the action.
If the preconditions of the action are true in a given state, then the action is applicable in the
state. If applied, then its effects describe the changes in the state. Different versions of PDDL
allow one to describe both the preconditions and effects at different levels of expressivity.
The algorithms described here will work with the simplest domain PDDL 1.0, where both
the preconditions and effects will be a conjunction of simple relation free atomic formulas.
Towards the end we will comment upon richer domains.
9.2.1 STRIPS
It all began with the Stanford Research Institute Problem Solver (STRIPS), one of the earliest
planning programs developed at Stanford (Fikes and Nilsson, 1971). The authors say that their
work was motivated by the frame problem that Green’s program ran into. Their goal was to
separate the logical reasoning part from the search component which they associated with
means-ends analysis (see Chapter 7). The program was written for one of the first autonomous
robots, Shakey, that roamed the corridors of Stanford from 1968 and currently is on display in
the Computer History Museum at Mountain View, California. Shakey could perform tasks that
required planning, route-finding, and the rearranging of simple objects, and has been called the
2 A sentence is something which can be true or false. An atomic sentence has no logical connectives. A proposition is
an atomic sentence without variables.
Automated Planning | 267
‘great-grandfather of self-driving cars and military drones’ by John Markoff, historian at the
Computer History Museum.3
The paper by Fikes and Nilsson describes the domain in which a robot has access to some
operators that enable it to move around, climb on boxes, push boxes around, and toggle light
switches. The following operators are adapted from the paper. The meaning of the predicates
used is sself-evident.
The above operator says that if the robot is at location m in room r, it can go to location n
in the same room using the operator GoTo(m,n). The operator is applicable if the precondition
AtRobot(m) is true, and when the source and destination are in the same room. If the operator
is applied, then the change is effected by deleting AtRobot(m) and adding AtRobot(n). The
following are some of the other operators:
Given these operators, it is possible to devise a plan that the robot that is in room 3 can go
to room 1 in which a light needs to be switched on, push a box in room 1 to the location of the
light, climb on to the box, and switch the light on. Observe that we have treated the light as
also being a location. The following might be the start state, described as a set of propositions:
This says that there are lights in rooms 1 and 3, there are boxes in rooms 1 and 2, and there
is a robot in room 3.
Clearly the operators allow a plan in which the robot goes to room 1, pushes the box to the
location of the light, and switches on the light by climbing on the box. But how can it switch
on the light in room 3 where there is no box? The reader is encouraged to add an operator or
operators that will enable the robot to fetch a box from another room.
Figure 9.1 The start state on the left has twenty blocks in some configuration on a boundless
table. There is one robot arm that is empty. The goal description on the right only says that
block A should be on block I, block M should be on E, and E should be on D. Nothing else.
Automated Planning | 269
In the figure the start state is shown on the left and is completely specified. The goal
description as illustrated on the right is a partial description of a state. This means that there
may be more than one state that satisfies the goal description. The start state and the goal state
are described by two sets of atomic sentences. Each set stands for a conjunct of propositions.
If a proposition is present in the set, it means that it is true. If not present, then it is false. The
following is the predicate schema for describing states and goals:
When variables in the above predicates are substituted with constants (block names), then
they become propositions. If a proposition is present in a given state description, then it is true.
If a proposition is present in a goal description, then it has to be made true. A state S satisfies a
goal G if the goal propositions are true in the state. That is, G c S.
The following is the start state description for the problem in Figure 9.1.
The goal description specifies that the following propositions must be true in a goal state:
The operators are defined in STRIPS-like manner as follows. An operator where the
variables are replaced by constants (block names) is an action. Each operator has a set of
preconditions that need to be true in a given state for the action to be applicable in that state.
Each operator has an add list that has propositions to be added, and a delete list that specifies
the propositions to be deleted.
270 | Search Methods in Artificial Intelligence
Pickup(X)
Precondition: onTable(X) A clear(X) A AE
Add list: holding(X)
Delete list: onTable(X), AE
Putdown(X)
Precondition: holding(X)
Add list: onTable(X), AE
Delete list: holding(X)
Unstack(X, Y)
Precondition: on(X,Y) A clear(X) A AE
Add list: holding(X), clear(Y)
Delete list: on(X,Y), AE
Stack(X, Y)
Precondition: holding(X) A clear(Y)
Add list: on(X,Y), clear(X), AE
Delete list: holding(X), clear(Y)
The same operators are written again below in PDDL 1.2 which introduces types for objects.
The other changes are that variables are identified by a ‘?’ prefix, and the logical formulas are
written in a list notation introduced by Charniak and McDermott (1985). For example, on(X,Y)
is now written as (on ?x - block ?y - block). Further, the entire domain definition is assembled
in a structured form. Finally, the add list and the delete list are replaced by one effect formula
with deleted items prefixed with a negation (not).
(:action pickup
:parameters (?x - block)
:precondition (and (onTable ?x) (AE) (clear ?x))
:effect (and (not (AE)) (holding ?x) (not (onTable ?x))))
Automated Planning | 271
(:action putdown
parameters (?x - block)
:precondition (and (holding ?x))
:effect (and (not (holding ?x)) (AE) (onTable ?x)))
(:action stack
:parameters (?x - block ?y - block)
:precondition (and (holding ?x) (clear ?y))
:effect (and (not (holding ?x)) (AE) (on ?x ?y) (not (clear ?y))))
(:action unstack
:parameters (?x - block ?y - block)
:precondition (and (on ?x ?y) (clear ?x) (AE))
:effect (and (not (on ?x ?y)) (holding ?x) (clear ?y) (not (AE))))
)
A planning problem is defined by choosing a domain, specifying the set of objects in
that problem, specifying the initial state, and specifying the goal conditions. Consider the
following planning problem, specified in PDDL:
The goal description only states that block F must be on block A, which in turn must be
on block D. Nothing is said about where the other blocks are and whether the arm is holding
anything.
Figure 9.2 A tiny planning problem. The given state is on the left, and the goal description is
on the right.
Some more domains are described in exercises, and the reader is encouraged to formulate
them in PDDL or in a STRIPS-like language.
We will look at various algorithms for planning and we will use the blocks world always to
illustrate the algorithms. We begin with state space planning.
S‘ = y(S, a)
= {S U effects* (a)} \ effects-(a)
One adds the elements in effects+(a) to S and deletes the elements in effects-(a) from S to
get S ‘.
A plan n is a sequence of actions <a 1, a2, ..., an>. A plan n is applicable in a state S0 if
there are states S1, ..., Sn such that y(Si-1, ai) = Si for i = 1,., n. The final state is Sn = y(S0, n).
FSSP begins with the empty plan and incrementally constructs the plan by adding new
actions at the end. In the algorithm for FSSP an action is added to a plan by the assignment
n ^ n° a
It progresses over a to the state S‘ as described above and continues searching from there,
till it finds a valid plan.
Let G be a goal description. Then a plan n is a valid plan in a state S0 if
G C y(So,n)
Algorithm FSSP searches from the start state, finding applicable actions, applying one and
progressing to the resulting state. The main drawback of forward search is the high branching
factor, since the number of applicable actions increases in general with the number of objects in
the state. Figure 9.3 shows the set of actions applicable in the start state in Figure 9.1 and also
in a state resulting from progressing over the action Pickup(J).
274 | Search Methods in Artificial Intelligence
Figure 9.3 The applicable moves in the start state of Figure 9.1, and in the state after
progressing over the action Pickup(J).
Even in the simplest domains, it has been shown that the complexity of planning is in
PSPACE (Bylander, 1994). That means the algorithm uses space bounded by a polynomial
function of the size of the input instance. It can, however, take time bounded by an exponential
function of the size of the input instance. This has spurred research in many approaches to
alleviate the complexity. One of the approaches has been to devise heuristic functions in a
domain independent manner.
gs(p) =0 if p e S
= mina e o (p) [1 + gs(precond(a))] otherwise
where O(p) stands for the set of actions that add p. The authors use a simple forward chaining
procedure in which the measures gs(p) are initialized to 0 if p e S and to to otherwise. Then,
every time an operator op is applicable in S, each proposition p e effect+(op) is added to S and
gs(p) is updated to
These updates continue until gs(p) does not change. The procedure is polynomial in the
number of propositions and actions, and corresponds to a version of Dijkstra’s algorithm
(Chapter 6).
The estimated cost of reaching a set of propositions, for example, gs(precond(a)), is
aggregated from the individual costs for each proposition. The following aggregation functions
defined for the set of goal propositions G are popular. The estimated cost is adopted as the
heuristic value of a state in FSSP.
h (S) * gs (G)
The additive heuristic g+S(G) computes the value as a sum of the costs of each proposition
in the goal.
This heuristic is known to be well informed but is not admissible. The additive heuristic
may overestimate the cost because it assumes that the individual propositions are independent
and achieved individually. That means that if it were used for the A* algorithm, then the plan
found may not be optimal.
The max heuristic instead picks the maximum of the estimates for individual propositions
and is clearly admissible, but not well informed. That would imply searching more of the space.
The original version of HSP used the additive heuristic with HillClimbing (see Chapter 4).
A later version, HSP2, uses the wA* algorithm (see Chapter 6) and can be made an admissible
algorithm with an appropriate choice of heuristic function and the weights. HSP2 also used a
variation on the heuristic function called h2(S). Instead of taking only the costliest proposition
as is done in hmax(S), the h2(S) function looks at the costliest pair of propositions that are
achieved at the same time. Let us say that the two propositions are called p and q. Then the
heuristic function h2(S) is defined as (Ghallab, Nau and Traverso, 2004)
g2s(p) =0 if p e S
= minaeO(p) [1 + g2s(precond(a)] Otherwise
g2s({p,q}) =0 if p,q e S
276 | Search Methods in Artificial Intelligence
and
h2(S) = g2s(G)
Another planner that deploys domain independent heuristics in FSSP is the algorithm
FastForward, abbreviated as FF, which also introduces another variation in forward search
(Hoffmann and Nebel, 2001). For the heuristic estimate, it relies on applying the algorithm
Graphplan (discussed later in Section 9.6) on the relaxed problem. Unlike the additive heuristic
of HSP, this does not assume that actions are independent and is more likely to be admissible.
What is pertinent is that Graphplan returns a plan which is sequence of sets of actions
<Set1, Set2, ..., Setk> where each Si = {ai 1, ai2, ..., aip} contains actions that could execute in
parallel in step i. Then if the goal propositions first appear after Setm, the heuristic function is
computed as
The following two heuristics can be used to prune the search space in any forward search
planning algorithm. The first is identifying the set of helpful actions for any state S. These are
the actions that are in Set1 of a relaxed plan from the state S. Recall that the relaxed plan is
the plan (found by Graphplan) for the relaxed planning problem P‘ = <S, G, O‘>. For the
planning problem in Figure 9.2, the relaxed plan is <{Pickup(A), Unstack(O,F)}, {Stack(A,D),
Putdown(O)}, {Pickup(F)}, {Stack(F,A)}>. Observe that <{Pickup(A), Unstack(O,F)},
{Stack(A,D), Stack(O,A)}, {Pickup(F)}, {Stack(F,A)}> is a relaxed plan too because the
relaxed action Pickup(A) does not delete onTable(A) and clear(A). In both cases Set1 =
{Pickup(A), Unstack(O,F} and these two actions are called helpful actions, and one of them
can be selected during search. Observe that this plan could be executed in four time steps by
a two armed robot in the blocks world. We say that the makespan of the plan is 4. How long
would a three armed robot take to solve this problem?
The second heuristic involves deleting actions that achieve a certain goal that is destined
to be undone later in the plan. In the same example from Figure 9.2, there are two goals to
be achieved, {on(F,A), on(A,D)}. Clearly achieving on(F,A) first will require it to be undone
when the robot needs clear(A) as a precondition for Pickup(A) needed for achieving on(A,D).
The added goal deletion heuristic works as follows. Let us say that a forward planning algorithm
has reached a state S in which on(F,A) is true. Then Graphplan produces the relaxed plan
<{Unstack(F,A)}, {Pickup(A)}, {Stack(A,D)}> which undoes the earlier goal on(F,A) in the
original unrelaxed problem. If that happens, one can prune the state S from the search space,
and instead find a plan in which on(A,D) is achieved before on(F,A). The authors, however,
point out that such pruning may result in incompleteness in some domains and a plan may not
be found.
Automated Planning | 277
When a relevant action is applied to a goal G, then the algorithm regresses to a subgoal G'
(also referred to as a goal G ‘).
G’ = y-1( G, a)
= {G \ effects+(a)} U pre(a)
As before, a plan n is a sequence of actions <a 1, a2, .., an>. A plan n is relevant to a goal
Gn if there are goals G0, ..., Gn-1 such that Gi-1 = y-1(Gi, ai) for i = 1, ..., n.
BSSP begins with the empty plan and incrementally constructs the plan by adding new
actions at the front of the current plan. In the algorithm for BSSP, this is done by the assignment
n ^ a °n
It regresses over a to the given goal G‘ as described above and continues searching from
there, till it finds a plan. Let S0 be the start state. BSSP ends when G0 c S0. The test for a valid
plan still needs to be done by progression over the actions in the plan. The plan n is a valid plan
if Gn C y(So,n).
278 Search Methods in Artificial Intelligence
The BSSP algorithm may reach the condition G0 C S0 but the plan found may not be a
valid plan. This is because regression over operators is not sound as illustrated in Figure 9.4.
This, in turn, is because of the way relevant actions are defined. They only define what goal
propositions are required to be true in goal G'. And as illustrated in Figure 9.4, this can result in
actions that are not applicable in the state which satisfies G'. In fact, for such spurious actions,
the regressed goal is not a (valid) state at all!
Figure 9.4 An illustration of regression as done by backward state space planning, on the
goal description in Figure 9.1. The actions in the shaded rectangles would not be applicable,
even though they are relevant, because the goals regressed to are not valid states. The
regressed goal for Stack(E,D), for example, requires holding(E) and on(M,E) to be true in the
same state.
This is because regression is going against the arrow of time defined by the planning
operators. An action is relevant if it achieves a certain goal (every fluent is a goal). It simply
proposes what needs to be true in the regressed goal if that action is to be applicable. Adding
those conditions may lead to a description that is not a valid state. This is because goal
descriptions are incomplete and the regressed goals could be spurious. In progression, on the
other hand, an action is applicable if its preconditions are true in a given state, and states are
completely described. The resulting description is a valid state, and progression is a sound step.
Regression is thus not a sound step. BSSP is not aware of this danger, and could thus
propose the final three actions as <Stack(E,D), Stack(M,E), Stack(A,I)> which clearly are not
all feasible. It is for this reason that the termination criterion for BSSP cannot just be (y-1(Gn ,n)
C S0) but will need a validity check for the plan n as well.
Automated Planning | 279
A planner that searches in the backward direction from the goal is the heuristic regression
planner (HSPr) which is a variation on HSP (Bonet and Geffner, 2001a, 2001b). A major
advantage of HSPr is that it does not have to recompute the heuristic value repeatedly. In
HSP the value gs(p) for a goal proposition p has to be computed from every state S that the
planner is looking at. In HSPr gs0(p) is computed once and for all from the start state S0 for
every proposition p. Then, when the planner regresses to a goal G‘, the heuristic distance of
achieving G‘ is simply the sum of the distances for each goalpeG‘. For any goal G the heuristic
estimate is
Figure 9.5 depicts the space for HSPr after it has found the relevant actions for goal G in
the same problem. There are three actions, Stack(A,I), Stack(M,E), and Stack(E,D), that the
algorithm can regress over producing the three subgoals G1, G2, and G3 it has to choose from.
It will use the above heuristic function to make the choice.
Figure 9.5 Given the goal G in the problem from Figure 9.1, HSPR has to choose between
three actions it can regress over to reach one of G1, G2, and G3. It will use the heuristic
g+s0(G1), g+s0{G2), and g+s0 (G3) that sums up the precomputed heuristic estimates for the
constituent propositions in the three subgoals to make the choice.
As discussed above, not all goal sets that backward search regresses to are feasible. This has
been illustrated in Figure 9.4 which shows that the subgoal G2 is not feasible since it requires
both on(M,E) and holding(E) to be true in the same state. Taking a cue from Graphplan Bonet
and Geffner introduce the notion of mutex pairs of propositions that can never be achieved
together starting from So. This would mean that if any such pair of propositions occurs in any
280 | Search Methods in Artificial Intelligence
goal, then that goal is not feasible, and that goal can be pruned. In the above example, goal G2
can be pruned because it contains on(M,E) and holding(E). Algorithm HSPr constructs a set M
of mutex pairs as follows. It begins my first constructing a set M0 which is the set of potentially
mutex pairs (Bonet and Geffner, 2001a, 2001b). Instead of starting with all pairs of proposition
which would work too, HSPr starts with a smaller set defined as
M0 = MA U MB where
- MA is the set of pairs P = {p, q} where some action a adds p and deletes q. That is,
p e effects* (a) and q e effects-(a).
- MB is the set of pairs P = {r, q} such that for some pair P‘ = {p, q} in MA, there is an action
a, such that r e pre(a) and p e effects+(a).
From this set M0 one can extract a subset M* by removing ‘bad pairs’ in M0 which may not in
fact be mutex. The bad pairs are those pairs that do not satisfy the following conditions. Given
an initial state S0 and a set of ground operators A, a set M of pairs of propositions is a mutex set
iff for all pairs R = {p, q} in M,
n ^ n° a
Automated Planning | 281
holding(F)
clear(A)
Stack(F, A)
on(A,D)
GSP pops holding(F). This is not true in the state S0, and the action Pickup(F) is pushed
onto the stack. We have made this right choice non-deterministically, eschewing Unstack(F,?X),
but an implementation would have to rely on backtracking along with a heuristic that looks at
the current state S to choose between the two. In this illustration, we will, from now on, make
the choice of actions and the order of goals to be added non-deterministically. Action Pickup(F)
along with its preconditions are now pushed onto the stack, which now looks like,
onTable(F)
clear(F)
AE
Pickup(F)
clear(A)
Stack(F,A)
on(A,D)
282 | Search Methods in Artificial Intelligence
Next, onTable(F) is popped. It is true in the current state, because onTable(F) e S0, so
nothing needs to be done. Next, clear(F) is popped, but it is not true. The action Unstack(O,F)
is pushed along with its preconditions.
on(O,F)
clear(O)
AE
Unstack(O,F)
AE
Pickup(F)
clear(A)
Stack(F,A)
on(A,D)
The next three goals, on(O,F), clear(O), and AE, are popped one by one and are true in S0.
Now comes the first action to be popped and added to the plan. The state progresses over the
action Unstack(O,F) and we have
AE
Pickup(F)
clear(A)
Stack(F,A)
on(A,D)
The goal AE is popped next, and it is not true. GSP adds the action Putdown(O) to the stack,
along with holding(O). The latter is popped, turns out to be true, and the action Putdown(O) is
popped next and added to the plan. After this, Pickup(F) is popped and added to the plan. Note
that holding(F) is true in the resulting state.
The stack is
clear(A)
Stack(F,A)
on(A,D)
Then goal clear(A) is popped and is true in the revised S. The action Stack(F,A) is added to
the plan next, achieving the first goal on(F,A).
Automated Planning | 283
This state is shown in the centre in Figure 9.6. At this point, GSP is only concerned with
the remaining goal on(A,D) which is the only element left in the stack. The reader should work
out the details and verify that it could find the plan <Unstack(F,A), Putdown(F), Pickup(A),
Stack(A,D)>. The state S at this stage is drawn on the right in Figure 9.6. Here we selected
the action Putdown(F) as a relevant action for the goal AE which would be a precondition for
Pickup(A), but it could have stacked F onto O as well. Worse, it could stack it back on A and
gone into a loop, or even on D, thus destroying the precondition clear(D) of Stack(A,D).
Figure 9.6 GSP started with two goals, on(F,A) and on(A,D), on the tiny planning problem
from Figure 9.2. After choosing on(F,A) and achieving it with the plan shown on the left, it
reaches the state in the centre. From there it solves for the goal on(A,D) with the plan shown
on the right ending in the state on the right. In the process it has undone the first goal on(F,A).
After solving the second goal on(A,D) with the plan <Unstack(F,A), Putdown(F),
Pickup(A), Stack(A,D)> we find that the first goal has been undone, as shown on the right in
Figure 9.6. The plan found is not a valid plan. This a characteristic of planning problems in
which the sub-goals are not serializable. They cannot be solved independently one by one.
Other examples of such problems are the 8-puzzle and the Rubik’s cube, in which one cannot
set serial goals of solving one part and moving on to the next.
One way to address this is to add the compound goal that is to be solved as a conjunct before
adding the constituent goal propositions. This is done by the function PUSHSET(G, stack) in the
GSP algorithm described in Algorithm 9.1, and is also done when pushing the preconditions of
an action. Then, after solving the goal propositions, the compound goal remains on the stack.
When that is popped and if found not to be true in the current state S, it is simply added back to
the stack. There is a danger though of the program going into an infinite loop with this feature
if the compound goal does not have a solution. In Figure 9.6, when the algorithm again chooses
the goal on(F,A), it solves it by picking F and stacking it on to A. The second goal on(A,D) is
already true in the resulting state, and a valid plan has been found, albeit not an optimal one.
The reader should verify that if the algorithm had chosen the goals in a different order, solving
for on(A,D) first and then solving for on(F,A), this checkback step would not have been invoked.
284 | Search Methods in Artificial Intelligence
Algorithm 9.1. Algorithm GSP starts by pushing the given compound goal G onto
the stack. It then pushes each constituent proposition g in G onto the stack. If a popped
goal g is not true, the algorithm pushes a relevant action onto the stack, along with its
preconditions using PushSet. When an action is popped, it is added to the plan, and the
state progresses over that action.
GSP(S0, G, A)
1 S ^ S0; plan ^ < >; stack ^ [G]
2 while stack is not empty
3 x ^ pop stack
4 if x e A
5 then plan ^ <plan o x>
6 S ^ Progress(S, x)
7 else if x is a conjunct and is not true
8 then stack ^ PushSet(x, stack)
9 else if x is a goal proposition and x £ S
10 then CHOOSE a relevant action a that achieves g
11 if none then return FAILURE
12 stack ^ Push(x, stack)
13 stack ^ PushSeT(pre (x), stack)
14 return plan
PushSet(G, stack)
1 Push(G, stack)
2 for each g e G
3 Push(g, stack)
4 return stack
When GSP is applied to the problem in Figure 9.2, the plan found depends upon the order
of tackling the two goals. Both orderings yield a plan, but the wrong order results in a longer
plan. A plan is found nevertheless because the blocks world domain has only reversible moves.
Contrast this with cooking, where often goals need to be tackled in a particular order. One
cannot, for example, grind the rice and lentil into a batter for making idlis first, and then soak
the rice and lentil mix. In the above example there was a right order for finding the shortest
plan. But even in the blocks world domain there exist planning problems for which there is no
right order of solving the goals.
When GSP picks up on(A,B) first, it finds the plan <unstack(C,A), putdown(C), pickup(A),
stack(A,B)> to reach the state S = {on(Table C), onTable(B), on(A,B), clear(C), clear(A), AE>}
Next it solves for on(B,C) with the plan <unstack(A,B), putdown(A), pickup(B), stack(B,C)>
and the current state becomes
which as one can see on the left branch in Figure 9.7 is not the goal state.
Figure 9.7 Neither choice of the goal to solve in the Sussman anomaly ends in a valid plan.
On the left GSP chooses to solve on(A, B) first, and on the right on(B, C) first. In both cases
the planner has more work to do.
286 | Search Methods in Artificial Intelligence
In a similar manner, choosing the goal on(B,C) first leads to the non-goal state as shown in
the right branch.
We say that the goals are non-serializable. This is also the case in the context of the
8-puzzle (Korf, 1985b). Richard Korf attacked the problem by proposing a learning approach
in which macro-operators could be learnt to move from one achieved sub-goal to the next one,
where the first goal could be disrupted en route but would be restored subsequently. This was
also addressed as a problem of chunking in the cognitive architecture implemented in Soar
(Laird et al., 1985). Most of us learn to solve the 8-puzzle and the Rubik’s cube by learning
such macro-operators. They enable us to find a solution quickly, but it may not be the optimal
solution.
The Sussman anomaly is a problem that cannot be solved optimally by a linear planning
approach like GSP. It can be solved by other approaches though. We next look at searching in
the plan space.
- A is a set of actions from the set of operators O. The set simply identifies the actions that are
somewhere in the plan. The actions may be partially instantiated, for example, Stack(A,?X)
- stack block A onto some block.
- O is a set of ordering links between actions of the form (ai < ak) which says that action ai
happens before action ak in the plan n. The partial plan is thus a directed graph. It imposes
some order on all actions in the plan, but is not a linear order.
- L is a set of causal links of the form (ai, P, ak). Action ai produces a proposition P which is
consumed by action ak for which it is a precondition. Whenever a causal link (ai, P, ak) is
added to a plan, then so is a corresponding ordering link (ai < ak), because ai must happen
before ak.
- IB is a set of binding constraints that specify what values a variable can or cannot take. Thus
a partially instantiated action may be added first, and a binding constraint can be added
later. This conforms to the idea of least commitment.
Automated Planning | 287
Let P = <S0 = {s 1, s2, ..., sn}, G = {g 1, g2, ..., gk}, O> be a planning problem. The PSP
algorithm always begins with an initial plan n0 = <{A0, A„}, {(A0 < A„)}, {}, {}> where A0
is the initial action with no preconditions and positive effects s 1, s2, ..., sn, and A„ is the final
action with preconditions g 1, g2, ..., gk and no efects. One can think of n0 as the set of all
possible plans. One can refine a given partial plan by the following operators:
We adopt a concise graphical notation as shown in Figure 9.8 to depict partial plans. The
diagrams have shortened names for actions and predicates. Preconditions of actions are drawn
above the action, and effects below. Negative effects are prefixed by the negation symbol .
In the style of Hasse diagrams for partial orders, actions which occur earlier are drawn above
and later actions are drawn below them. Ordering links are drawn only when necessary. Causal
links are drawn explicitly with dashed arrows from the producer to the consumer.
Figure 9.8 A concise notation for actions in partial plans. The action names have been
shortened to Pk for Pickup, Pt for Putdown, Un for Unstack, and St for Stack. Predicate
names have likewise been shortened: ot for on Table, h for holding, and c for clear. The
preconditions are shown above the action and the effects below. Negative effects are marked
by a negation sign . For the sake of illustration two causal links are drawn as dashed arrows.
As far as possible ordering links are not drawn, and actions occurring earlier in the partial plan
are drawn above.
The refinement process continues till a solution plan is found. Unlike in state space
planning which tests if a given state is a goal state, POP identifies a solution plan as a plan
without any flaws.
A partial plan can have two kinds of flaws. The first is if a partial plan has an open goal.
An open goal is a precondition of some action in A which is not supported by a causal link. Let
action ak have an open goal P. This can be resolved in two ways.
288 | Search Methods in Artificial Intelligence
- If there exists an action ai e A which has a positive effect P, and it is consistent to add the
ordering link (ai < ak) to the plan, then add the causal link (ai, P, ak) to L and the ordering
link (ai < ak) to O.
- Add a new action an that has a positive effect P to A along with the causal link (an, P, ak)
to L and the ordering link (an < ak) to O.
The second kind of flaw is a threat. An action at e A is a threat to a link (ai, P, ak) in L if the
following three conditions hold.
If all three conditions are met, then the threat will materialize. That is, the threat actions
happens after the producer of P and deletes P before the consumer can consume it. For example,
At might be Stack(B, ?X) which deletes clear(?X). If there is a causal link (Unstack(M,N),
clear(N), Pickup(N)), then if ?X were to be N the causal link would be destroyed. One may even
treat an action at as a threat if it produces P, because it threatens to make action Ai redundant
(McAllester and Rosenblitt, 1991; Kambhampati, 1993).
To resolve the threat it is sufficient to negate any of the three conditions. The three threat
resolvers are
1. Demotion: Delay the threatening action to happen after goal P has been produced and
consumed. This can be done by adding an ordering link (ak < at) to O. Remember there is
already an ordering link (ai < ak) in the plan.
2. Promotion: Advance the threatening action to happen before goal P has been produced and
consumed. This can be done by adding an ordering link (at < ai) to O.
3. Separation: Add a binding constraint b to IB that ensures that Q cannot be unified with P.
The PSP algorithm is described below (Ghallab, Nau, and Traverso, 2004).
Algorithm 9.2. The plan space planning (PSP) procedure attempts to resolve one flaw
at a time. Function Resolve returns the set of resolvers for f in the plan n. CHOOSE is a
non-deterministic operator that chooses the appropriate resolver r. The Refine procedure
applies the chosen resolver, and the algorithm PSP is called recursively to address the
next flaw.
PSP(n)
1 flaws ^ OpenGoals(n) U Threats(n)
2 if empty flaws
3 return n
4 else
5 select and remove some f e flaws
6 resolvers ^ Resolve f, n)
7 if empty resolvers
Automated Planning | 289
8 return FAIL
9 else CHOOSE r e resolvers
10 n‘ ^ Refine(r, n)
11 return PSP(n‘)
Given a set of flaws to resolve, anyone can be chosen since all flaws have to be resolved.
But having chosen a flaw to resolve, one has to choose an appropriate resolver. We use a
nondeterministic CHOOSE operator in the description. In practice, one may have to resort
to search. Procedure Refine applies the chosen resolver to the plan. It might in turn introduce
new flaws. This happens, for example, when one adds a new action an to A, resulting in its
preconditions being added as own open goals.
It may be possible that when a threat is resolved by promotion or demotion, another causal
link may in fact be broken. This could be if the two actions in the new ordering had a common
precondition which only one of them could consume. We illustrate the algorithm with the
example below, when one is forced to impose an ordering on two actions applicable in the
start state because they both consume the same proposition. Consider the following planning
problem:
Figure 9.9 Another tiny planning problem. The given state is on the left, and the goal
description is on the right.
The algorithm begins with n0 = <{A0, A„}, {(A0 < A„)}, {}, {}> where
Here is a description of how PSP might solve the problem. For the sake of illustration, we
have assumed that the planner will somehow make the right choice of which flaw to resolve to
290 | Search Methods in Artificial Intelligence
minimize search. This is to illustrate the kind of reasoning that could happen. An augmented
planner doing such reasoning could perhaps generate an explanation of the process. In practice,
a wrong choice would lead to backtracking in search.
There are two open goals on(B,C) and on(A,B). Let us say that the planner chooses the
former, and adds the action Stack(B,C), represented as St(B,C) as described in Figure 9.8, to
resolve it. This action in turn introduces its own open goals, holding(B) and clear(C). Let us say
it selects the latter and adds the action Unstack(A,C) to resolve it. The open goals introduced by
Unstack(A,C) can all be resolved because they are produced by the A0 action. That is, they are
true in the start state S0. Assume that the planner next adds Unstack(B,D) to resolve the open
goal holding(B). That has three preconditions - on(B,D), clear(B), and AE. All three are true
in the start state as well, but there is a two way threat now because the actions Unstack(A,C)
and Unstack(B,D) both consume AE and both delete it. The partial plan with the two threats is
shown in Figure 9.10.
Figure 9.10 The partial plan after PSP has added three actions to solve one original goal
on(B, C). However, in the process, it has introduced two threats. Both Un(A,E) and Un(B,D)
have AE as a precondition supported by a causal link from A0, and both delete AE. One of
them will have to be demoted.
Automated Planning | 291
An ordering has to be imposed on the two threatened actions to resolve the threat. Let us
say that the planner fortuitously demotes Unstack(B,D) to happen after Unstack(A,C). This
results in the causal link (A0, AE, Unstack(B,D)) being clobbered (Tate, Drabble, and Kirby,
1994) and it means that AE is again an open goal for Unstack(B,D). This threat resolving action
has undone the resolution of an earlier open goal. The next algorithm we look at, Graphplan,
defers this kind of potential interaction as constraints to be resolved later. The ordering link
(Unstack(A,C) < Unstack(B,D)) due to demotion is shown as an explicit arrow in Figure 9.11.
There are other ordering links too, but have not been drawn explicitly. Instead, one relies on the
convention that actions drawn above happen before actions drawn below.
Figure 9.11 The two threats in Figure 9.10 are both resolved by demoting the action
Unstack(B,D), by adding the ordering link (Unstack(A,C) -< Unstack(B,D)) as shown with the
solid arrow.
At this point there are two open goals in the plan, on(A,B) and AE, as can be seen in
Figure 9.11. Both can in fact be achieved by one action Stack(A,B) added after Unstack(A,C)
but that would disrupt clear(B) which is a precondition for Unstack(B,D). Again let the planner
non-deterministically choose the action Putdown(A) which produces AE to be consumed by
Unstack(B,D). The situation is shown in Figure 9.12.
292 | Search Methods in Artificial Intelligence
nil
A0
on(A,C) c(A) AE
Un(A,C)
h(A)
Pt(A)
ot(A) c(A) AE
AE on(B,D) c(B)
Un(B,D)
-AE
c(C) h(B)
St(B,C)
on(A,B) on(B,
A “
nil
Figure 9.12 PSP next tackles the open goal AE in Figure 9.11. It non-deterministically chooses
the action Putdown(A) to be added after holding(A) is made true by Unstack(A, C). This
pushes the action Unstack(B, D) further down because it has to consume AE produced by
Putdown(A).
The planner is now in a happy situation. It has only one open goal to solve which is on(A,B).
It adds the action Stack(A,B) as a resolver for that, and in turn it adds the open goal holding(A)
which can be resolved by Pickup(A). Pickup(A) can be achieved because AE is available as
Automated Planning | 293
an effect of Stack(B,C), and the other preconditions clear(A) and onTable(A) were added by
Putdown(A) added earlier. Since a causal link (Stack(B,D), AE, Pickup(A)) will be added,
Pickup(A) must happen after Stack(B,D) and the last action in the plan would be Stack(A,B), at
which point the partial plan would have no flaws.
The observant reader would have noticed that the solution plan is a linear plan. But what
else can one expect from a one armed robot solving the problem?
Figure 9.13 A two armed robot solving the problem from Figure 9.9. The two arms, arm1 and
arm2, have their own set of predicates, for example, holding1 and holding2, and actions, for
example, Pickup1 and Pickup2.
We assume the simple modification of adding distinct predicates and actions for each
arm, as described above. We continue to represent predicates and actions by their abbreviated
notation. The partial plan that PSP terminates with is shown in Figure 9.14. The first thing to
notice is that it has only four actions, which is fewer than the six actions it found for the one
armed robot. The second is that it is not a linear plan. We have not drawn the ordering links as
before and adopted the convention that actions higher up precede actions lower in the figure.
294 | Search Methods in Artificial Intelligence
Figure 9.14 The partial plan returned by algorithm PSP for a two armed robot solving the
problem from Figure 9.13. There are no flaws in the partial plan, and it has four actions. Can
the plan be executed in two time steps?
The above plan has four actions. Given a plan which is a partial order, a linear plan of
length four that respects the given ordering can always be generated by a topological sort. The
plan in Figure 9.14 has ordering only as follows. Unstack(A,C) must precede Stack(A,B) and
Stack(B,C), and likewise Unstack(B,D) must precede Stack(A,B) and Stack(B,C). The first two
actions can clearly be done in parallel by the two arms, but what about the last two actions? The
solution plan suggests that they can be done in parallel too. Imagine one arm stacking A on B,
even while the other arm is stacking B on C. Needs a bid of dexterity perhaps, but there is no
reason why it cannot be done. The plan can therefore be executed in two time steps. We say that
the makespan of the plan is 2.
One would expect that while arm2 is holding B, it should not be possible for arm1 to
stack A on B. But the preconditions for Stack(A,B) - holding(A) and clear(B) - are true
after the Unstack(A,B) action. The reason for this is that while clear(B) is a precondition for
Unstack(B,D), it is not in the negative effects of the action. The rationale for not including
Automated Planning | 295
clear(X) in the negative effect of either Pickup(X) or Unstack(X,Y) was that the only thing that
a one armed robot could do next was to either put it on the table or stack it on another block. In
either case clear(X) would be a positive effect, so why delete it in the first place? Consequently,
even when arm2 is holding B, clear(B) remains true, and so another block can be stacked on
top of B. A little bit of thought should convince the reader that a multi-armed robot holding
N blocks in N hands can create a tower of N blocks in one time step!
If the user does not want such jugglery, then a simple modification of the Pickup(X) or
Unstack(X,Y) operators by adding clear(X) to the negative effects will do the trick. This is
left as an exercise for the reader. What is the makespan of the plan now for the problem in
Figure 9.13? Draw the solution partial plan.
We have assumed that there is a centralized planner. In real multiagent scenarios, for
example, in search and rescue teams, a certain amount of autonomy would be necessary.
Moreover, with more than one robot each acting independently but in a coordinated manner,
can we address tasks like two robots holding a large table at each end and moving it to a new
location? We discuss this briefly in the last section.
None of the algorithms described so far guarantee an optimal plan, one with the shortest
makespan. We now describe a couple of algorithms that do. Both algorithms adopt a two stage
approach, first converting the planning problem into an intermediate representation, and then
solving for the plan on that representation. We begin with algorithm Graphplan.
defined later, with the term standing for mutual exclusion. In the original implementation, they
are binary relations, eschewing the harder computation required for constraints between more
elements.
The initial or the zeroth layer P0 is a proposition layer containing all the propositions in
the start state.
Then comes A1, the first action layer, which is the union of all actions applicable in the
preceding proposition layer P0. An action in an action layer is applicable if its preconditions
are non-mutex in the preceding proposition layer. The presence of many actions in a layer does
not mean they can be executed in parallel. Only actions that are non-mutex can go together
into a plan. The solution returned by Graphplan is a sequence of sets of actions <Set1,
Set2, ..., Setk> where each Seti = {ai 1, ai2, ..., aip} contains actions that could execute in
parallel in step i.
A proposition layer Pi follows every action layer Ai. Like an action layer, a proposition
layer is the union of all propositions that are the effects of all the actions in the action layer.
Any state Si reachable in i steps would be a subset of the proposition layer Pi. If the subset is
non-mutex, then the state can possibly be reached by actions in the preceding layers.
A goal G is reachable from the start state S0 if there is a plan that achieves G. However,
computing this by state space search is the planning problem itself. Algorithm Graphplan
employs a weaker notion of reachability, which is a lower bound approximation. If the goal
G occurs in a layer in the planning graph, it is said to be reachable. While this is a necessary
condition for the goal to be reachable, it is not a sufficient condition. Consider, for example,
achieving the goal propositions on(C,A) and on(D,B) from the start state in Figure 9.9. Both
goals would appear in proposition layer P4 since each tower can be inverted in four moves each,
but both are achievable by a one armed robot only in layer P8.
Between any two consecutive layers, the following sets of edges connect the nodes. These
are shown in Figure 9.15 for the problem from Figure 9.9. Observe that the planning graph is
constructed from S0 without paying any heed to the goal state, except for the signal to stop.
This process has also been called disjunctive refinement (Ghallab, Nau, and Traverso, 2004).
^,t.A) ot(A)
| Pt(A) x on(A,B)
h h(A) \ X h(A)
EJAB^^cc)
CC c(C)
X X
\X c(D) ptB7^/--<<,',«
ot(B)
P0 A1 P1 A2 P2 A3
Figure 9.15 The proposition and actions layers in a planning graph, extended up to two
levels. Each layer is a union of actions or a union of propositions. An action is connected
by precondition links to its preconditions. It also has positive effects linking it to the next
proposition layer with dashed arrows, and negative effects drawn as dashed arrows with
rounded heads. The next action layer will be A3.
Layer P0 contains the start state. Layer A1 contains all the actions that are individually
applicable in the start state, and layer P1 includes all the effects of these actions. The reader
would have observed that there are propositions, like onTable(D) shown in abbreviated form as
ot(D), that are present in layers P1 and P2. The reason for that is that we are not implementing
the progression of states over actions, but only incorporating the effects of these actions. We do
this because we do not know at this stage which actions will be included in the plan. Consider,
for example, the proposition on(B,D). Given the one armed robot if Unstack(B,D) were to be
the first action, then this would be deleted, but if Unstack(A,C) were to be the first action, this
would be true in the state after the first action. Algorithm Graphplan includes a No-op action
which says that nothing happens to the proposition on(B,D) and it should be included in the
next layer as is. In fact, it does this for all propositions. Every proposition in a layer Pi is thus
carried forward to layer Pi+1. The set of No-op actions is shown in Figure 9.16. This allows for
the possibility that no robot action is executed in some layers of the plan. The STRIPS planning
domain does not have a notion of time. But in richer domains there could be, and then a No-op
action could be instrumental in meeting time constraints.
298 | Search Methods in Artificial Intelligence
P0 A1 P1 A2 P2
Figure 9.16 In each action layer of the planning graph, there is a No-op action for each
proposition, which has only one positive effect of copying the proposition into the next layer.
The set of No-op actions in layers A1 and A2 are shown here. Note that layer P1 also has new
propositions added by the two actions in A1 shown in Figure 9.15, and there are No-op actions
for these as well.
The No-op action takes one proposition p as an argument, which is the only precondition
for the action and is the only positive effect, for example, No-op(on(A,C)). The planning graph
includes the No-op actions for every proposition in a layer along with the other planning actions.
If a planning action has a negative effect, for example, on(A,C) for Unstack(A,C) in A1, then
on(A,C) has two effect arrows impinging upon it, one as a negative effect of Unstack(A,C) and
the other as a positive effect of No-op(on(A,C)). In the final plan, only one of them can exist,
and they are mutually exclusive or mutex. Since Unstack(A,C) and No-op(on(A,C)) are mutex,
so are their positive effects, for example, on(A,C) and holding(A). We define the set of mutex
relations below.
Two actions a e Ai and b e Ai are mutex if one of the following conditions holds. All the
mutex relations between actions are stored in the set pAi as pairs.
- Strong interference: There exists a proposition p such that p e pre(a), p e effects- (a),
p e pre(b), andp e effects- (b). In the blocks worldp could be AE.
- Weak interference: There exists a proposition p such that p e pre(a) and p e effects- (b).
Then only one linear order of the two would be possible.
- Competing needs: There exist propositions pa e pre(a) and pb e pre(b) such that pa and pb
are mutex in Pi-1.
- Inconsistent effects: There exists a proposition p such that p e effects+(a) and p e effects-(b).
Then the semantics of the two actions in parallel is not defined. If they are linearized, the
semantics will depend upon the order.
Automated Planning | 299
Two propositions p e Pi and q e Pi are mutex if all pairs of actions a e Ai and b e Ai such that
p e effects*(a) and q e effects*(b) are mutex. All the mutex relations between propositions are
stored in the set ^Pi.
Figure 9.17 shows some of the mutex pairs in the planning graph for the problem in
Figure 9.9.
P0 A1 P1 A2 P2
Figure 9.17 Some of the mutex relations in the planning graph being constructed. The No-op
actions are shown in layer A1 along with their mutexes with the two actions in the same layer
A1. All actions in layer A2 are mutex with each other, though the links have not all been shown.
Some mutexes for the proposition layer P1 are shown as well.
Not all mutex relations have been shown in the figure to avoid cluttering. Also not shown
in the figure are the precondition and effect links. Once they are added, the planning graph will
be complete, represented by the following sets - proposition layers, action layers, precondition
links and effect links across consecutive layers, and mutex links within each layer.
A one armed robot allows for only linear plans. Hence all the actions that the robot can
do in each layer are mutex with each other, though not shown in the figure. This would not be
the case for domains where parallel actions are possible. The reader is encouraged to draw two
layers of the planning graph for the problem in Figure 9.13.
Two actions a and b are said to be independent if
- Build a planning for the relaxed problem P‘ = <S0, G, O‘> which ignores negative effects.
Actions are no longer mutex and the goal propositions appear earliest in the planning
graph.
- Extract a relaxed plan from the planning graph with the original operators as soon as each
goal proposition has appeared in some layer.
- Extract a relaxed plan from the planning graph with the original operators, when all the
goal propositions appear non-mutex in some layer. This could be a deeper level.
What is important is that since all actions are considered in parallel at each level, the heuristic
estimate is a more accurate one, being somewhere between the optimistic max heuristic, which
assumes positive goal interaction and may be grossly underestimating, and the conservative
sum heuristic, which assumes that actions are independent and is in most cases overestimating.
The backward search algorithm RelaxedPlanExtraction is described below (Bryce and
Kambhampati, 2007). The algorithm returns a layered plan <Set1, Set2, ..., Setk> where
Seti = ^i is the set of actions in the layer Ai.
RPE(PG(S), G, n)
1 n ^ []
2 Gn <— G
3 for i ^ n down to 1
Automated Planning | 301
4 n ^ []
5 Gi-1 ^ []
6 for all p e Gi
7 Pick some a e Ai such that p e effects+(a)
8 ni ^ a : ni
9 Gi-1 <— pre (a) +—+ Gi-1
10 n ^ ni : n
11 return n
Observe that the above algorithm to extract the relaxed plan is silent on mutexes. Taking
mutexes into account is what transforms it into the backward phase of Graphplan described
below.
Thanks to the No-op actions, when a proposition p appears in a proposition layer, it will
appear in all succeeding layers. As a corollary, any action a that appears in an action layer
will also appear in all succeeding layers. So the number of propositions and actions grows
in a strictly non-decreasing manner. Mutex relations, on the other hand, can appear and then
disappear. In P0 there are none, but they appear quickly, and they can disappear later as well.
If two blocks A and B in Figure 9.9 are to be unstacked and placed on the table by a one armed
robot, then onTable(A) and onTable(B) will be mutex in P2 but will be non-mutex in P4. Given a
goal G = {g 1, g2, ..., gk}, the algorithm has to wait till the set of propositions appear non-mutex
in some layer, before the search for a plan even begins.
The algorithm begins with the initial layer P0 which contains the propositions in the start
state S0. Then from every proposition layer Pi-1 the following sets are constructed.
- the set of actions in layer Ai along with the precondition links to Pi-1 and effect links to Pi.
- the mutex relations between actions in layer Ai in pAi.
- the set of propositions in layer Pi. First, all the propositions are copied from layer Pi-1 due
to the No-op actions, and then new propositions are added as the effects of the new actions
in layer Ai.
- the mutex relations between proposition in layer Pi in pPi.
In Graphplan the forward process of building the planning graph happens till one of the
following two conditions is met:
1. All the goal propositions g e G appear mutex free in some layer Pn.
2. The planning graph has levelled off. A planning graph is said to level off when two
consecutive proposition layers and mutex layers do not change. That is, Pi = Pi+1 and
pPi = pPi+1. This means that no new actions can appear, and the layers cannot change any
more.
The second condition says that no plan exists. In the blocks world this could be if the goal
description is inconsistent, for example, on(A,B) and on(C,B). In other domains there could be
goals that are not reachable from the start state, for example, if there is no path for a robot to go
from one building to another.
302 | Search Methods in Artificial Intelligence
As soon as all goal propositions appear non-mutex in a layer, the algorithm switches to
phase two and searches backwards for a plan. Observe that this can be later than a layer where
they occur together but where some are mutex, because mutexes can disappear. The algorithm
to find a plan is backward search in the spirit of Algorithm 9.3, except that it must take mutex
relations into account. Starting with the non-mutex goals, it must ensure that the actions chosen
in Line 7 are non-mutex, and that the combined set of preconditions of these actions which
form the goals at the preceding layer are non-mutex as well. In the absence of nondeterminism,
Graphplan algorithm employs depth first search in the backward direction to search for mutex
free actions leading back to the start state. Subsequently, other researchers have proposed other
approaches to extract a plan, for example, treating the planning graph as a constraint satisfaction
problem (CSP) (Do and Kambhampati, 2001). We will study CSPs in Chapter 12.
When the algorithm finds a plan it must be the shortest makespan plan, because this was the
first occasion when the goal propositions appeared non-mutex and a plan could be extracted.
If a plan is not found, Graphplan extends the planning graph by one more level. But before
that it creates a memory of goal sets and the level at which they failed. This process is called
memoization. The next time it embarks upon backward search, it will know not to proceed
beyond the memoized goal sets. When it succeeds, the plan that algorithm Graphplan returns
has the following structure:
n = <{a 11, ..., a 1p}, {a21, ..., a2q}, {a31, ..., a3r}, ..., {an 1, ..., ans}>
That is, it contains n ordered sets of actions. The first set {a 11, ..., a 1p} contains actions that
can be executed in parallel in the start state, the second set {a21,...,a2q} in the state after that,
and so on. A linear plan can always be extracted from it by topological sorting.
A key feature of Graphplan is that it is a two phase algorithm. In the first phase, which is
computationally inexpensive, the planning problem is converted to another problem which is
then solved for the solution. At around the same time, other planning algorithms were devised
that too adopted a two phase approach, most notably planning as CSP (van Beek and Chen,
1999) and planning as satisfiability. We take a brief look at the latter next.
The action variables likewise are all instantiations of the planning operators.
The task is then to create a SAT formula such that all models are valid plans. The SAT
formula is expressed in conjunctive normal form (CNF) with the following types of clauses:
Clauses from the initial state: A set of clauses derived from the initial state S0. This is a
conjunct of all the propositions in S0 and the negation of all propositions not in S0. For the
problem from Figure 9.9, the clauses are
and
The negated propositions are added to exclude models that do not represent valid states.
The above clauses will form the first part of the CNF. If only the (positive) propositions in
S0 were to be included in the formula, then an interpretation with, for example, on(A,B,0) =
true would be a model too even though it is not true in S0. The start state then contributes the
following sub-formula:
[on(A, C,0) A on(B,D ,0) A onTable( C,0) A onTable(D ,0) A clear(A ,0) A clear(B ,0) A
AE(0) A
-on(A,B ,0) A -on(B A ,0) A -on( CA ,0) A -on(A,D ,0) A -on(D A ,0) A -on(B, C,0)
A -on(C,B,0) A -on(D,B,0) A -on(C,D,0) A -on(D, C,0) A -onTable(A,0) A
-onTable(B ,0) A -holding(A ,0) A -holding(B ,0) A -holding( C,0) A -holding(D ,0) A
-clear( C,0) A -clear(D ,0)]
Clauses from the goal description: The goal description is incomplete, and only introduces
the clauses explicit in G. For our example, the clauses are as follows where the formulation is
of a plan of n steps. One does not care what else is true at time n.
[on(A,B,n) A on(B,C,n)]
Clauses relating actions and propositions: For all values of t between 1 and n, each
proposition p and each action a will be assigned a value in an interpretation. The interpretation
will be a model if all the clauses are true in the interpretation. For the model to be a valid plan,
these assignments must be consistent with the relation between actions and their preconditions
and effects. This is achieved by adding the following clauses to the SAT formula, in which each
action in the domain implies both its preconditions and its effects. If a is an action, then
If pre(a) = {pa 1, pa2 ..., p^) and effects+(a) = {qa 1, qa2 ..., qaJ and effects- (a) = {ra 1, ra2
..., ram}, then
Automated Planning | 305
(a D pre(a)) = (a D (pa 1 A pa2 ... A pak)) = (-a Vpa 1) A (-a V p 2) A... A (-a V Pak).
So the clauses added for the preconditions are (-a Vpa 1) A (-a Vpa2) A. A (-a Vpak).
In a similar manner, the clauses for the effects are
Such clauses are added for every action instance for every time point t between 1 and n. For
example, consider the actions Stack(B,D,t) at time point t. It contributes six clauses as shown
below:
Then, if, say, Stack(B,D,4) is assigned a value 1 or true, the propositions holding(B,3),
clear(D,3), on(B,D,4) and AE(4) must be all true as well, and holding(B,4) and clear(D,4) must
be false, for the six clauses to be satisfied (evaluate to true).
The frame axioms: When an action happens, it has a set of effects that change the value of
some propositions between two time steps. What about propositions not affected by actions in
the plan? One needs to ensure that only actions in a plan change the values of propositions. This
is done by adding frame axioms that assert that propositions that are not changed by an action
retain their truth value at the next time point.
There are two approaches for doing this in the literature. The classical frame axioms assert
that every action that does not affect a proposition leaves it unchanged for the next time step
(McCarthy and Hayes, 1969). For every action a and for every proposition p that is not in the
effects of a, one adds an axiom of the form (p(t-1) A a(t)) D p(t). For example, the following
axiom, in CNF form, says that if block C is on the table when block B is stacked on to block D,
then block C continues to be on the table.
An alternative set of axioms are the explanatory frame axioms as described in Haas (1987).
These axioms enumerate the actions that could have led to a change in the value of a proposition
p. There is one axiom for positive change, which lists all actions a that have p e effects+(a)
and likewise for negative change when p e effects-(a). For example, if at some time step the
arm is now holding block B, then the robot must have picked it up from the table or unstacked
it from another block. Note that one such action must be present for every block it could have
been unstacked from. In the Sussman anomaly example with three blocks A, B, and C, one has
a CNF clause,
Likewise, if the robot was earlier holding block B and is no longer holding it, then
Again, for every proposition in the planning problem, one will have to add an instance of
such axioms for every time point between 1 and n.
Given the above set of clauses, every solution to the cumulative SAT will contain a valid
plan of makespan n. The propositions in the initial and goal description are assigned a value
true. Every action that is part of the plan is assigned a value true at the time step t when it
happens, and false at other time points. The preconditions of the action at time t-1 are also
assigned true, as are the positive effects at time t. The negative effects are assigned a valuefalse.
The remaining propositions are assigned truth values consistent with the frame axioms. If the
solver finds that the formula is unsatisfiable, then a new encoding reflecting a one step longer
plan length is generated.
- For each proposition p in the domain, if p G P0, then the corresponding p0 = true else
p0 = false where the subscript denotes the time point or layer number. The clauses here are
the same as the clauses from the initial state in direct encoding.
- For the propositions from goal layer Pn, we similarly keep only the goal propositions with
time stamp n. Like in the direct encoding, the two clauses for the Sussman anomaly are
on(A,B,n) and on(B,C,n).
- Working backwards from the goal with t = n, every p G Pt such that p G Gn induces a
disjunction of actions a G At in the planning graph that have p as a positive effect. The
actions in the disjunction must be from the planning graph. For example, looking at the
Automated Planning | 307
planning graph in Figure 9.15, we have only two actions in the planning graph that could
result in holding (B,2),
This is different from the clause that direct encoding for the problem from Figure 9.9
would have generated.
Observe that this is also similar to the explanatory frame axioms. This says that if holding(B)
is to be true at time t, then one of the actions in the planning graph that produced it must have
happened at time t as well. This translates to a smaller clause than in direct encoding,
- If an action a happens at time t, then its preconditions must have been true at time t - 1.
at D pre(a)t-1
This is similar to encoding preconditions in the direct encoding, which for the same
example is
- Actions that are mutex in the planning graph introduce corresponding clauses in the SAT
encoding. For example, in Figure 9.15 we have
Observe that we do not need to encode mutex relations on propositions. This is because we
begin backwards from the mutex free goal propositions and only regress to subgoal propositions
that are preconditions of actions which are in the planning graph. And actions appear in the
planning graph only when their preconditions are non-mutex in the planning graph.
In summary, by the time Graphplan is ready to search for a plan in the planning graph, it
has already restricted the set of actions that are applicable and that lead to the goal eventually.
For example, the action Stack(C,D) does not appear in the SAT encoding for the problem in
Figure 9.15. Consequently, the SAT encoding is much smaller. This can be further reduced by
doing a limited amount of fast inferences as reported in van Gelder and Tsuji (1996) and Kautz
and Selman (1999).
and deterministic that do not fail in the real world, and where the goals are propositions to be
satisfied in the goal state. We have restricted ourselves to this simple domain to focus on the
planning algorithms that have been developed. These algorithms have been extended to richer
domains as well but are beyond the scope of this book. We end by presenting a brief description
of the richer domains that are being addressed by the planning community.
This action optimistically assumes that the toddler has successfully transferred the entire
water from the second cup to the first one. When planning with instantaneous actions, two
actions can either be concurrent or one happens before the other. When actions have durations,
then many more relations are possible. These are captured in Allen’s interval algebra (Allen,
1983, 1991) and shown in Figure 9.18.
Automated Planning | 309
equal(a, b) equal(b, a)
starts(c, b) isStartedByl(b, c)
contains(c, d) during(d, c)
finishes(d, e) isFinishedBy(e, d)
overlaps(e, f) isOverlappedBy(f, e)
g meets(f, g) isMetBy(f, e)
h before(h, g) after(g, h)
Figure 9.18 The thirteen relations in Allen’s interval algebra, shown between eight intervals
{a, b, c, d, e, f, g, h}. In the figure, the relations are shown from b onwards, and are between the
interval and the one preceding it. The first relation equal is symmetric.
The STRIPS domain has no notion of time, only sequencing of actions one after the
other. Parallel actions happen instantaneously at the same time point. Linear planning with
durative actions is similarly a task of sequencing actions. It is when actions can be done in
parallel that things become more interesting. A look at Allen’s interval relations reveals why
this is so. Having to deal with actions with different durations becomes tricky when there
is interdependence between their preconditions and effects. When should two friends start
walking from their homes to reach the park at the same time? Forward planning would have no
basis, except lookahead, to choose the starting times. There may be situations when actions are
required to be executed in parallel in one or more specific relations from the interval algebra.
The term required concurrency was introduced in Cushing, Subbarao Kambhampati, and
Weld (2007) and Cushing (2012). These requirements are not stated explicitly but are implied
by the preconditions and effects of durative actions. This is exemplified by the following
example from Cushing’s doctoral thesis. Consider the problem of repairing and inserting a
broken fuse in a dark cellar, where the only potential source of light is one last matchstick in
your matchbox. Assuming that you can repair the fuse in the dark, you will still need light while
finally inserting the fuse to avoid getting electrocuted. Let Fusestart and Fuseend be the start and
end times of the fuse repair action, and let Matchstart and Matchend be the start and end times of
the match lighting action. Then the plan to repair and insert the fuse would need to respect the
following constraints:
That is, the match must be lit before inserting the fuse, and its light must last till after the
fuse has been inserted. The question is: when should the match be lit? A fielder on the boundary
on a cricket field may similarly be required to time her jump accurately in order to catch a ball
which would otherwise sail over the boundary.
Two of the earliest planners that handled durative actions are Sapa (Do and Kambhampati,
2003) and Crikey3 (Coles et al., 2008). The former is a metric temporal planner that maintains
an event queue of durative actions and has two kinds of moves. The first kind selects and
adds a new action to the plan, and the second advances the time to the next event in the event
queue. The second is applicable when the event queue is not empty. Crikey3 splits the durative
310 | Search Methods in Artificial Intelligence
actions into two instantaneous actions, one at the start of the durative action and the other at
the end, like in the fuse and matchbox example described above. The algorithm searches in
FF-like manner using relaxed planning graph heuristics. It also uses a simple temporal network
to record temporal relationships. An interesting feature of Crikey3 is that it separates the
decisions concerning which actions to choose and when to schedule those actions, like in plan
space planning.
(:action pour
:parameters (?jug1 ?jug2 - jug)
:precondition (>= (- (capacity ?jug2) (amount ?jug2)) (amount ?jug1))
:effect (and (assign (amount ?jug1) 0)
(increase (amount ?jug2) (amount ?jug1)))
)
The above domain ignores the fact that pouring is a durative action. The reader is encouraged
to add the temporal aspect to the action. Both Sapa and Crikey3 mentioned in the previous
section are metric planners as well.
Consider the example of a bus driving from location A to location B. What should be the
effects of this action, apart from the fact that the bus is at location B? Clearly, the driver and
the passengers should be at location B as well. How does one include that in the new state?
One way would be to add it as an effect of the alight action, in which when a passenger gets off
the bus her location is the same as the location of the bus. But what if she does not alight from
the bus? A similar example is of carrying books in a briefcase. Koehler defines the following
conditional action, expressed in the ADL style:
name: move-briefcase
par: L1:location, L2:location
pre: at-b(L1)
eff: ADD at-b(L2), DEL at-b(L1)
Vx:object [in(x) D ADD at(x,L2), DEL at(x,L1)]
The action says that when you move the briefcase from location 1 to location 2, then
anything that is in the briefcase also gets transported to location 2. Clearly, this is a generic
action that compresses all possible movement actions of individual objects into one. Further, if
you have carried the briefcase to office and a colleague asks you whether you have a particular
book in the office, you can reply in the affirmative without having to take the book out of the
briefcase.
one can be accommodated at a time. When dunking one package is the chosen action in one
possible world (planning graph), it induces a mutex between the two dunking actions to delay
dunking in the other possible world. If in addition the package can possibly clog the toilet, an
unclog action is added after the first dunking, to clear the way for the second package.
- <GD> must be true at the end (like in the domains we have considered so far)
- <GD> must be true at all times during the plan
- <GD> must be true at some time during the plan
- <GD> must be true within N steps in the plan
- <GD> must be true at most once during the plan
- <GD1> must be true some time after <GD2> is true
- <GD1> must be true some time before <GD2> is true
- <GD1> must be true within N steps after <GD2> is true
- <GD> must be true from step N to step M in the plan
- <GD> must hold after N steps in the plan
The other feature introduced in PDDL 3.0 is preferences. Goals, whether on the end point
or during the trajectory, that must be satisfied are hard constraints. Preferences, on the other
hand, are soft constraints, which are desirable but not mandatory. For example, when going out
for a dinner and a movie, one might want to buy some groceries on the way, but only if it is
feasible. Soft constraints change the way we evaluate plans. There may be a penalty introduced
if a preference is not met, or a reward added if it is. This changes the notion of what is a valid
plan. Instead of looking for the stated goals to be necessarily satisfied, one thinks of it as an
optimization problem in which as many soft goals as possible are satisfied.
task has only to reach out her hand and a colleague magically hands her the right instrument.
Another place when coordination is visible is on the football field where, for example, a striker
may rise to the occasion in a coordinated manner to intercept a cross from a teammate and head
it into the opponent goal. Coordination between agents may also be required when multiple
agents are working with shared resources. An example could be a children’s hobby club
working with a limited set of instruments that have be shared between them. Another example
is when vehicles coming from different directions have to negotiate their movement across
a roundabout. In some countries the drivers diligently observe the right of way conventions,
while in others the drivers may have to rely on a keen eye and quick reaction, while in still
others the civic authorities may install traffic signals, and in some even police personnel to
monitor the traffic obeying the signals. A comparison of coordinated planning methods for
rovers is given in Chien et al. (2000).
Coordination could be done by centralized planning in which the actions and schedules of
the different actors are specified completely. In this situation one agent would be responsible
for the planning and synchronization of actions.
Alternatively, the different agents may produce their own partial plans and send them to a
central agent for reconciliation and synchronization.
Finally, the different agents may have common goals but act independently. Here too some
amount of communication may be needed. This is clearly the case in team games like football
and hockey.
Some domains involving coordination may require the agents to reason about what other
agents know, and that would take us into the emerging field of epistemic planning.
If there are two post offices, then the father needs to try both of them to pick up the gift
packet. After visiting the first post office he will know whether he has the gift or not and can act
according to the following plan where PO1 and PO2 are the two post offices:
One can also introduce another agent, an employee in PO1 whom the father can ask, and
who presumably knows if the post office has the present.
The following is a more complex problem involving epistemic coordinated actions from
Engesser et al. (2017):
Bob would like to borrow his friend Anne’s apartment while she is away. Anne is happy to
lend it to him, and the plan is that she will leave the key below the door mat for Bob, who will
use it to unlock her apartment.
The question is: what does she need to tell Bob (an epistemic action)? For this, Anne has
to be able to view things from Bob’s perspective. This is known as the ‘Theory of Mind’ which
enables an agent to reason about what other agents know (Premack and Woodruff, 1978). She
must realize that Bob needs to be told where the key is. She must also assume that Bob will
himself synthesize the plan to retrieve the key from below the mat and use it to unlock and enter
her apartment. The plan would then be:
Anne puts the key under the door mat; Anne calls Bob to let him know where the key is;
when Bob arrives, Bob takes the key from under the door mat; Bob opens the door with the key.
This does qualify as an implicitly coordinated plan. Anne now knows that Bob will know
that he can find the key under the door mat and hence will be able to reach the goal. Anne does
not have to request or even coordinate the sub-plan for Bob (which is: take key under door mat;
open door with key), as she knows he will himself be able to determine this sub-plan given the
information she provides.
Finally, multi-agent card games are a fertile ground for epistemic planning. The complete
pack is known to each player, but each can only see her own cards. Some inference can be
made from the cards other players play. Contract bridge is probably the most sophisticated
of all card games. This is due to two reasons. One, it is a partnership game requiring active
communication between players. Second, in the play phase that follows the bidding phase, the
cards of one player are exposed to all, which generates sufficient information to serve as fodder
for complex reasoning.
Communication between partners is done publicly, and the opponents listen eagerly. And
like in war and espionage, the communicators can target the opponent with false information
(or fake news). This can give space to cloak and dagger operations. We have described one such
example analysed in Khemani and Singh (2018) in Section 8.4 in which a player weaves a web
of deception to score over his opponent. Logicians distinguish between knowledge and belief.
Logically, knowledge can only be about what is true in the world, but beliefs have no such
constraint, which leaves the door ajar for deception. Many computer programs employ Monte
Automated Planning | 315
Carlo methods to probabilistically choose a plan. For example, a program to play the simple
game Hannabi is reported in Reifsteck et al. (2019).
Contract bridge, however, is still an open problem.
Summary
Planning is a critical activity for an intelligent agent. In this chapter we have studied different
algorithms for domain independent planning. Starting with state space search, we moved on to
searching in the plan space with POP, and algorithms Graphplan and Satplan. One reason
that different kinds of algorithms have been explored is that planning is a hard problem, even
for the simplest domains. Various approaches have been explored to mitigate complexity.
Backward state space planning was explored to exploit the fact that the goal description
is sparse, leading to low branching. But it has a problem of generating spurious states. GSP
was an attempt to combine the best features of forward and backward planning. The idea of
domain independent heuristics was explored in FF, HSP, and HSPr, and it was observed that
the planning graph is a source of such heuristics. PSP offers a separation of action selection
and action scheduling, and could be the foundation of combining planning with reasoning
about goals and actions. Both Graphplan and Satplan explore intermediate structures easy
to compute, and where standard solvers could be deployed.
Mutex relations introduced logic and reasoning into the process of search. This also happens
when planning in richer domains, which demand greater expressivity in describing domains
and the associated reasoning. We will look at the interplay between search and reasoning again
in Chapter 12. In the next chapter we look at logical reasoning and see how the process of
reasoning also has an underlying search component.
Exercises
1. Extend the STRIPS operators described in Section 9.2.1 to allow the robot to move a box
from one room to another room when the need arises.
2. The monkey and the banana is a popular variation of the STRIPS problem. In this variation
there are some bananas hanging from a ceiling and a monkey needs to push a box to the
location of the bananas, climb on the box, retrieve the bananas, climb down, and then eat
the bananas. Devise STRIPS-like operators to solve the problem.
3. Combine the previous two domains so that the monkey has to push the box from another
room to the room where the bananas are.
4. The Gripper domain is defined as follows. There are two rooms, four balls, and two robot
arms. The predicates are - X is a room, X is a ball, X is inside Y, robot arm X is empty.
The robot can move between rooms and pick up and drop one or two balls. In the initial
state all four balls and the robot are in one room, and both robot arms are empty. The goal
description is that all balls should be in the other room. Express the Gripper domain in
PDDL.
316 | Search Methods in Artificial Intelligence
5. The logistics domain is defined as follows (Bart Selman, Henry Kautz). There are several
cities, each containing several locations, some of which are airports. There are also trucks,
which can drive within a single city, and airplanes, which can fly between airports. The
goal is to get some packages from various locations to various new locations. Express the
domain in PDDL.
6. The Mystery domain was introduced in the international planning competition (IPC) 1998
by Drew McDermott. There is a planar graph of nodes. At each node are vehicles, cargo
items, and some amount of fuel. Objects can be loaded onto vehicles (up to their capacity),
and the vehicles can move between nodes, but a vehicle can leave a node only if there is a
nonzero amount of fuel there, and the amount decreases by 1 unit. The goal is to get cargo
items from various nodes to various new nodes. To disguise the domain, the nodes are
called emotions, the cargo items are pains, the vehicles are pleasures, and fuel and capacity
numbers are encoded as geographical entities. Express the planning domain in PDDL.
7. Given the planning problem described below, show how goal stack planning with STRIPS
operators will achieve the goal. You may choose any order where a choice has to be made.
What is the plan found?
Start state = {onTable(A), onTable(C), on(B,C), on(D,E), onTable(F), on(E,F), AE,
clear(A), clear(B), clear(D)}
Goal description = {on(D,A), on(A,E)}
8. Given the planning problem described below, show how goal stack planning with STRIPS
operators will achieve the goal. You may choose any order where a choice has to be made.
What is the plan found?
Start state = {onTable(A), on(B,A), on(C,B), on(D,C), onTable(E), on(F,E), AE,
clear(D), clear(F)}
Goal description = {on(D,B)}
9. Define blocks world operators for multi-armed robots. Modify the STRIPS operators
to include another parameter for the arm number. For example, the Stack(X,Y) operator
is modified to Stack(N, X, Y) where N is the arm number. Likewise for predicates, for
example, AE(N). Modify the planning algorithms studied to allow for plans with actions
being done in parallel.
10. Modify the definition of the Pickup(N, X) and Unstack(N, X, Y) operators to add clear(X) to
the negative effects of these actions. Show the solution found by PSP on the problem from
Figure 9.13.
11. For the planning problem below but with a two armed robot, draw the partial plan found
that is an optimal solution plan finishing the earliest possible. What is the makespan of the
plan?
Start state = {onTable(A), onTable(C), on(B,C), on(D,E), onTable(F), on(E,F), AE(1),
AE(2), clear(A), clear(B), clear(D)}
Goal description = {on(C,B), on(A,C), on(B,D)}
Automated Planning | 317
12. For the planning problem below but with a two armed robot, draw the partial plan found
that is an optimal solution plan finishing the earliest possible. What is the makespan of the
plan? State any assumptions you have made.
Start state = {onTable(A), on(B,A), on(C,B), on(D,C), onTable(E), on(F,E), AE(1),
AE(2), clear(D), clear(F)}
Goal description = {on(C,D), onTable(D), on(F,C)}
13. Define the notion of a flaw in a partial plan represented in POP. How are the flaws addressed?
14. It was observed in Figure 9.17 that for the problem in Figure 9.9 all non-No-op actions are
mutex with each other. What is the situation when Graphplan attempts the problem in
Figure 9.13 which has two arms? Draw the first two action layers and identify the mutex
actions.
15. Simulate the algorithm Graphplan on the planning problems from Figures 9.9 and 9.13.
For each, take a large sheet of paper and draw the planning graph till the point when the
backward phase succeeds.
16. Define the notion of ‘mutex relations’ used in algorithm Graphplan. When are two
propositions mutex?
17. Define the mutex relations in Graphplan. Given the start state = {onTable(A), onTable(C),
on(B,C), clear(A), clear(B) AE}, draw one level of the planning graph and show the mutex
relations.
18. Consider the problem of inverting a stack of two blocks, A and B. Encode this as a planning
graph with four layers.
19. Use the planning graph for encoding the two block inverting problem into SAT. Also
generate the direct SAT encoding for the above problem, and compare the two.
20. Express the water jug problem from Section 2.4.3 as a metric planning problem. Extend it
to a temporal problem by assuming that the time taken for a pouring action is proportional
to the amount of water being poured.
21. Given that a durative action a can be written as two actions astart and aend, and likewise
action b as actions bstart and bend, express the thirteen Allen’s relations in terms of constraints
between these four instantaneous actions.
chapter 10
Deduction as Search
An intelligent agent must be aware of the world it is operating in. This awareness
comes mainly via perception. Human beings use the senses of sight, sound, and touch
to update themselves. However, the entire world is not perceptible to any of us. Our
senses have limitations. We cannot hear the dog whistle, or see the bacteria living
on our skin or the mountain on the other side of the world. But through science and
communication we know about the worlds beyond our sensory reach. Telescopes
from Galileo to James Webb have delivered spectacular images of the universe, some
taken in the infrared band in the spectrum. We augment whatever we know by making
inferences. The conclusions we draw may be sound or they may be speculative yet
useful. Evolution has preserved in us both kinds of inference making capability.
The world is dynamic and has other agencies making changes in the world too. If
we observe something we may guess the cause or intention behind it. This kind of
speculation is called abduction. The conclusion is possibly true, maybe even likely. If
we see the local bully striding towards us, we may suspect ill intent on his part, and
take evasive action. Better safe than sorry. If we develop a cough and fever, we may
fear Covid and isolate ourselves from others. When we observe a few white swans, we
may conclude that all swans are white. This is called induction. Neither abduction nor
induction is always sound. Conclusions we draw may not always hold. But they are
eminently useful.
In this chapter we study deduction, a form of inference that is sound. The conclusions
that we draw using deduction are necessarily true. The machinery we use is the
language of logic and the ability to derive proofs. We highlight the fact that behind
deduction the fundamental activity is searching for a proof.
Logic and mathematics are often considered to be synonymous. Both are concerned with truth
of statements. In this chapter we confine ourselves to the family of classical logics, also known
as mathematical logics, in which every sentence has exactly two possible truth values - true
and false. Nothing in between. No fuzzy concepts like tall and dark. Is a person whose height
is 176 centimetres tall? What about 175 then? And 174? When does she become not tall? Or
modalities like maybe. It is possible she loves him. Does that mean she loves him or does
319
320 | Search Methods in Artificial Intelligence
she not? Or values like don’t know. Is it raining in Delhi? Classical logics would either say a
sentence is true or it is false. Even a sentence which says that White always wins in chess is a
sentence in classical logic, because in principle it must be either true or false. Even though we
cannot find out given the size of the chess game tree.
The simplest classical logic is the logic of propositions, or propositional logic (PL). Every
logic has an associated formal language. Every logic has a well defined vocabulary and syntax,
which completely defines the language L as a set of sentences. And well defined semantics.
There are two angles to the semantics of a language. One is truth functional semantics, which
assigns a truth value to every sentence in the language L. The other is denotational, which refers
to what the sentences mean. Logicians are largely concerned only with the former, while the
artificial intelligence (AI) community is also concerned with meaning. What does the sentence
represent?
• The set of atomic sentences A of L is a countable set of proposition symbols {P, Q, R,...}.
Propositional variables are also called Boolean variables after George Boole who invented
Boolean algebra.
• The set of commonly used logical connectives of L are {-, A, D, V, ®, =, j, t}. Of these
the first is a unary connective, and the remaining are binary connectives.
• The constant symbols ‘1’ and ‘T’, called Bottom and Top respectively. These are atomic
sentences whose truth value is known and constant, being respectively false and true.
• The set of punctuation symbols include the different kinds of brackets and parentheses.
The atomic sentence in PL is indivisible. We do not peer inside it but treat it as a unit. Logical
reasoning itself is only concerned with form and not with meaning, though the meaning is
defined by the user. Meaning lies in the mind of the beholder. We represent an atomic sentence
in PL as a propositional symbol from a countable set of symbols {P, Q, R, ., P1, P2, .}.
A propositional symbol can stand for or denote any sentence in a natural language. Even a
complex sentence. For example,
The PL language L is defined as follows, where a and p are propositional variables that
stand for any sentence of L.
The set of sentences in L thus includes all the atomic sentences in the vocabulary and all
sentences constructed using the logical connectives as defined above.
The truth values of atomic sentences are defined by a valuation function V which maps
every atomic sentence to a two element set usually denoted by {true, false} or {T, F} or {1, 0}.
For the rest of the formulas, the truth values are defined by structural induction. We as users
associate these two values to the sentences being true or false. The following cases apply for
the connectives described above:
NAND is a short form for NOT-AND, and NOR is a short form for NOT-OR. XOR is short for
EXCLUSIVE-OR and says that exactly one of its constituents is true, which is different from
the inclusive OR where at least one is true. As can be seen, the following pairs of connectives
are negations of each other: {A, f}, {V, j}, {=, ©}.
The truth value of combined statements hence depends upon the truth values of the
constituents as well as the logical connectives used. The properties of the connectives described
above are often illustrated by truth tables.
Figure 10.1 shows two truth tables for the two formulas ((a A (a D fi)) D fi) and
((fi A (a D fi)) D a). Since there are two propositional variables in each, we have four rows, and
since there are three connectives, three more columns are added.
Figure 10.1 A truth table for a sentence with N variables and C connectives has 2N rows and
N+C columns. The last column indicates when the sentence is true. The truth table on top
shows that ((a /\ (a D fi)) D fi) is always true, while the one below shows that ((fi /\ (a D fi)) D a)
is false when a is false and fi is true.
As can be seen, the first sentence is always true whatever the truth values of a and fi. Such
sentences are called tautologies. Tautologies are of considerable importance since deduction is
based on tautological sentences, like the first one in Figure 10.1.
If 5 is a tautology, then -5 is a contradiction or is unsatisfiable. An unsatisfiable sentence
is false whatever the truth values of the constituents.
The second sentence in the figure is an example of a contingency. A contingency is a
sentence that is true for some valuations and false for others.
The set of all sentences of PL is partitioned into three sets - tautologies, contingencies, and
contradictions. The reader is encouraged to ponder over the fact that all three sets are infinite,
since a new sentence can always be constructed from old ones with any logical connective.
Moreover, an injection can be constructed from any one to any other. This is because given
a sentence from any set, a sentence in any other set can be produced. So, in this sense, the
cardinalities of the three sets are the same (if you can compare the sizes of infinite sets).
The set of satisfiable sentences is the union of the sets of tautologies and contingencies.
A sentence is said to be satisfiable if there is some valuation that makes the sentence true.
Constructing the truth table to determine the truth values of formulas is not practical. There
are several reasons for this. First, the size of the table grows exponentially with the number of
variables. Second, the method does not extend to richer logics, for example, the first order logic
we look at later. And third, the most important of all, the valuation of the atomic formulas is
rarely given to us. Instead, what we have is a set of formulas, not necessarily atomic, that are
given as true. We will refer to this set as the knowledge base (KB). The KB can be the set of
axioms that are always provably true or a set of premises that we accept as true.
Deduction as Search | 323
The question then is: given a KB that is true, and given a query sentence a, is a true
as well?
Mahsa is a girl. Mahsa either likes to sing or she likes to fight. Mahsa does not like
to fight.
Assuming the above sentences are true, is the following also true?
G = Mahsa is a girl.
S = Mahsa likes to sing.
F = Mahsa likes to fight.
We say that a KB is true if every sentence in the KB is true. The notion of entailment
embodies the connection between the truth values of the KB and the goal a. Given a true KB,
if a sentence a is necessarily true, we say that the sentence a is entailed by the KB. Informally
we also say that a is true. This is expressed as
KB k a
Entailment is a semantic notion. It looks at truth values but does not provide us with a
procedure to determine whether a is true. Instead, we turn to the syntactic notion of proof.
A proof procedure aims to derive the goal a from the KB via a sequence of derivation steps.
This process is also known as theorem proving. A theorem is a true statement, and the term is
used extensively in mathematics, where theorems are true once and for all. Each derivation step
employs a rule of inference that allows a new sentence to be added to the KB. For example, the
well known rule modus ponens (MP), which says that if the KB has sentences matching fi and
(fi D a), then one can add a to the KB. This can be written as
KB h a
We also say that a can be proved given the KB, or that we have deduced a from the
KB, and the process is known as deduction. The set of starting sentences and the sequence of
intermediate sentences leading up to the goal a constitute the proof of a. Finding the proof is
the stuff mathematicians engage in, sometimes for long periods of time. Remember Fermat’s
Last Theorem? In our modern times we can deploy the computing power available to us to
search for a proof. This is possible because deriving a proof is a purely syntactic activity. This
chapter gives us a glimpse into this process.
We now have two different notions. The notion of entailment, which talks of truth, the
subject of our interest, and the notion of proof, which is purely syntactic. Proofs, and programs
to find proofs, would be useful only if the derived sentences are also entailed. There are two
properties in logic that pertain to this issue.
10.2.1 Soundness
A logic is said to be sound if every sentence we can derive is a true sentence. Or every provable
sentence is true (given that the KB is true).
Soundness: If KB h a then KB 1= a
The soundness property of a logic system reflects on the rules of inference used in the logic.
A rule of inference is sound or valid if it is based on a tautological implication. For example,
modus ponens is sound because it is based on ((a A (a P 0)) P 0), which is a tautology as
shown in Figure 10.1. In our earlier definition we had had said that we derive a from P and
(0 P a). This should not cause any confusion because a and 0 are variables and can stand for
any sentence. In fact, the following two are logically equivalent:
The reader is encouraged to construct a truth table for the above equivalence and verify that
it is indeed a tautology. In fact, tautological equivalences are the basis for rules of substitution,
where the left hand side can replace the right hand side and vice versa. This is not surprising
because the equivalence is essentially a biconditional as the following sentence depicts:
The equivalence relation is really two implication statements and, if it is a tautology, the
corresponding rule of substitution is essentially two rules of inference. The following are some
common rules of inference. In addition, an arbitrary number of derived rules can be devised
based on tautological implications (Manna, 1974; Stoll, 1979; Smullyan, 2009).
Modus ponens: P, (p D a) F a
Modus tollens: -a, (p D a) 1—।p
Conjunction: a, p F a A p
Addition: a F aVp
Simplification: aAp F a
Hypothetical syllogism: (a D p), (p D 5) F a D 5
Disjunctive syllogism: (a V p), a F p
Constructive dilemma: ((a D p) A (y D 5)), (a V y) F p V 5
Destructive dilemma: (a D p) A (y D 5), (~p V ~5) F ~a V ~y
The process of proving a given sentence is called theorem proving. Here theorem refers to
a statement that is true. And in classical or mathematical logics, the truth values of sentences do
not change with time. Remember the Pythagoras theorem?
In the most intuitive form of the algorithm the theorem prover picks a rule with matching
antecedents in the KB and adds the consequent to the KB. This process is entirely syntactic in
nature, based only on pattern matching. Let a = ‘The Earth is round’ and p = ‘Roses are red’;
the sentence (a A p) stands for ‘The Earth is round and roses are red’. If it is given that this
(compound) sentence is true, then the rule simplification allows one to conclude a = ‘The Earth
is round’. But what about p = ‘Roses are red’? That should follow logically too. But simple
pattern matching allows only the first component a of (a A p) to be inferred. Clearly one should
be able to infer p too. Here is when the rules of substitution come into effect. One such rule
says that (a A p) is equivalent to (p A a) and either one can replace the other. This allows us to
add (p A a) to the KB and then infer p using simplification. These kinds of situations abound
in formal proofs. The following are some commonly used rules of substitution (Manna, 1974;
Stoll, 1979; Smullyan, 2009):
The reader is encouraged to verify that each of these rules is sound, by verifying that these
are tautologies.
10.2.2 Completeness
A logic must be sound if the sentences it derives are to be believed. But to be useful it must also
be able to produce all sentences that are entailed by the KB. This property is called completeness.
Completeness: If KB 1= a then KB F a
Completeness proofs can be hard and are beyond the scope of this book. However, this
is a laudable quest and we will illustrate it to some extent when we talk of proof methods in
first order logic. Friedrich Ludwig Gottlob Frege (1848-1925) was a German mathematician,
logician, and philosopher who worked at the University of Jena.1 In 1879 he gave the first
axiomatization of propositional calculus that was both sound and complete (Frege, 1879). He
showed that all tautologies that can be expressed in the language of propositional logic can be
derived from the following set of six axioms and one rule of inference, MP:
1. THEN-1 a D (fl D a)
2. THEN-2 (a D (fl D y)) D ((a D fl) D (a D y))
3. THEN-3 (a D (fl D y)) D (fl D (a D y))
4. FRG-1 (a D fl) D (—fl D —a)
5. FRG-2 ——a D a
6. FRG-3 a D ——a
Remember that a, fl, and y are propositional variables and can match any sentence. In an
axiomatic system only the axioms are to be taken for granted. Any other true statement needs a
proof. Even the seemingly obvious sentence (P D P), where P is a propositional symbol.
First, we introduce a derived rule of inference. A derived rule of inference is like a macro
call that serves as a short cut. The rule we are interested in is hypothetical syllogism (HS) and
is derived as follows:
1. (fl D y) Premise
2. (fl D y) D (a D (fl D y)) Then-1 (after appropriate substitution)
3. (a D (fl D y)) MP, 1, 2
4. (a D (fl D y)) D ((a D fl) D (a D y)) Then-2
5. (a D fl) D (a D y) MP, 3, 4
6. (a D fl) Premise
7. (a D y)
We have shown that given the two premises (a D fl) and (fl D y), we can derive (a D y)
The deduction theorem says that a given set of premises A, B, C, and D entail a conclusion
E if and only if ((A A B A C A D) D E) is a tautology.
The fact that Frege’s axiomatic system is complete means that if ((A A B A C A D) D E) is
a tautology, then there is a proof for the formula. Thus, wherever the given premises A, B, C,
and D entail a conclusion E, a proof can be found.
Frege’s axiomatic system handles only two operators { , D} and only one rule of inference.
Students of logic would recall that anything that can be expressed with any of the 16 binary
operators and negation can be expressed with the set { , D}. Only that the representations can
tend to blow up in size. For example, we know that the connective A can be replaced as follows:
(a A fl) is equivalent to --(a A fl), which is equivalent to -(-a V -fl), which is equivalent to
-(a D -fl). Also, given the definition of equivalence as two implications, we can rewrite (a = fl)
In practice, we tend to use more logical connectives, which also require more rules of
inference to make inferences, some of which have been mentioned in Section 10.2.1. The
question of which combination of logical connectives yields a complete proof system has to
be considered before choosing one. We will look at one such system later in the chapter. For
now, we turn our attention to a more expressive language than PL, in which we can peer inside
atomic sentences and view them as relations between elements in a domain.
328 | Search Methods in Artificial Intelligence
The Greek syllogism was close to natural language. Modern logic adopts a more formal
representation. As in the case of any formal logic, there are two parts to FOL. One, the syntax
of the language itself and, two, the semantics. The semantics itself has two facets. One,
denotational, which is concerned with the meaning of sentences. The other, truth functional,
which deals with truth values. Instead of first describing the formal language and then its
semantics, we will adopt a more informal approach weaving through both.
Every FOL has a domain independent part or logical part whose vocabulary is the following
(Fitting, 2013):
- Symbols that stand for connectives or operators: ‘A’, ‘v’, '-’, and ‘D’.
- Brackets: ‘(’, ‘)’, ‘{’, ‘}'...
- The constant symbols: ‘1’ and ‘T’.
- A countable set of variable symbols: V = {v 1, v2, v3, ...} or {x, y, z, x1, y 1, z 1, ...}
- Quantifiers: ‘V’ read as ‘for all’, and ‘3’ read as ‘there exists’. The former is the universal
quantifier and the latter the existential quantifier.
- The symbol ‘=’ read as ‘equals’. This is optional.
The domain specific part of the language L(R, F, C) is defined by three sets R, F, and C. R is
a set of relation or predicate symbols, F is a set of function symbols, and C is a set of constant
symbols. An interpretation I = <D, I> specifies a domain D for the language L(R, F, C) and
a mapping I from each of R, F, and C to elements of D. In addition, an assignment A maps
every variable in V to the domain D. Like in PL, the interpretation determines which sentences
are true. In FOL it also determines what the expressions in the language stand for, at least the
structural nature of the relation in set theoretic terms.
We first describe how the elements or entities in a domain are represented.
- If t e V, then t e T.
- If t e C, then t e T.
- If t1, t2, ., tN e T and f e F is an n-place function symbol, then f(t1,t2, ., tN) e T.
Terms are made up of variable symbols, constant symbols, or recursively defined using the
function symbols. Each term of L(R, F, C) refers to an element in the domain. The mapping
from the set of terms T to the elements of a chosen domain D is given by
- Every n-place function symbol f is mapped under I as follows: I(f) = f I, where f I is the
image off and is an n-ary function f D: DN^ D.
- If t e V, then tIA = tA. Every variable is mapped by the assignment A.
- If t e C, then tIA = tI. Every constant is mapped by the mapping I.
- If t1, t2, ., tN e T and f e F, then f(t1,t2, ., tN)IA = f I(t1IA, t2IA, ., tNIA).
Every n-place function symbol maps to an n-ary function in the domain. Variables are mapped
to some element by the assignment A, constant symbols are mapped to specific elements in the
domain as specified by I, and terms using function symbols use a combination of A and I as
needed.
Given the domain of natural numbers, a variable x may map to some number, say 11,
given an assignment A. A constant zero or sifar or 0 may map to the number 0. We generally
overload the use of a numeral to stand for both the constant in the language and the number in
330 | Search Methods in Artificial Intelligence
the domain. Thus, the constant symbol 7 stands for the number 7. Then sum(7,11) will map to
the element 18, and successor(7) maps to the number 8.
Likewise in the domain of people, we use the name of a person both as a constant symbol in
the language and as the person in the domain. For example, Mahsa, Zhina, Nika, Sarina, Hadis,
and Neda may stand for persons having the said names. Then father(Sarina) would stand for
the individual who is the father of Sarina. Remember that mathematically father is a 1-place
function. Where there is ambiguity, we may suffix a name with a number. For example, the
constants Nika16 and Nika21 refer to potentially two different individuals. The mapping could
still map them to the same person, in which case the two symbols become aliases.
When writing programs we often adopt some convention to distinguish the sets V and C.
The convention we will follow is to use lower case words as constant and function symbols,
and prefix a word with a ‘?’ if it is a variable. Thus, variable names need not be restricted to
the symbols x, y, and similar mnemonic, and a programmer may use variable names such as
?count, ?age, and ?girl also as variable names. The programming language Prolog adopts the
convention of using capitalized words for variables and lower case for constants.
In an interpretation I = <D, I>, a term of L(R, F, C) points to an element of the domain D.
- If P e R is an n-place predicate and 11, 12, ..., tN e T are Nterms, then P(11, 12, ..., tN) e A.
- If 11, 12 e T are terms, then (11 = 12) e A.
- The constant symbols 1 and T are also atomic formulas.
An atomic formula is the smallest unit that can be assigned a truth value. The truth value
is determined by the interpretation I = <D, I>. For propositional symbols {P, Q, R...} the
interpretation is simply a truth assignment by a valuation function V. In the case of FOL, an
atomic formula is true if the corresponding relation holds in the domain D and interpretation I.
We can define a similar valuation function Val: A ^ {true, false} as follows:
The following examples use names of predicates and functions familiar to us:
When we include atomic formulas of the kind (t1 = t2), we say the language is FOL with
equality. Then, as we will see later, we need to add some additional axioms for the logic to be
complete.
When talking about truth values we often leave out the valuation function and make
assertions like Friend(Hadis, Neda) = true. This is somewhat informal and also incorrect
(because only terms can be equal), but since there is no ambiguity it is often used as a short
form. We can write this correctly as Friend(Hadis, Neda) = T, where two atomic formulas are
connected by a logical connective.
- If a e A, then a e F.
- If a e F, then —(a) e F. Where there is no ambiguity, we can write -a e F.
- If a, ft e F, then (a A ft) e F. Where there is no ambiguity, we can write a A ft e F.
- If a, ft e F, then (a V ft) e F. Where there is no ambiguity, we can write a V ft e F.
- If a, ft e F, then (a D ft) e F. Where there is no ambiguity, we can write a D ft e F.
- If a, ft e F, then (a = ft) e F. Where there is no ambiguity, we can write a = ft e F.
- If a, ft e F, then (a D ft) e F. Where there is no ambiguity, we can write a D ft e F.
- If a e F and x e V, then Vx(a) e F. We often use Vx,y(a) as a short form for Vx(Vy(a)).
- If a e F and x e V, then 3x(a) e F. We often use 3x,y(a) as a short form for 3x(3y(a)).
Let us look at some examples of quantified formulas, with meaningful (to us) predicate,
function, and constant names.
The valuation for the formulas with the logical connectives is defined by the semantics of
the connectives and is similar to the valuation in PL. The valuation for quantified formulas is
defined below. An assignment IB is said to be an x-variant of an assignment A if the two agree
on all variables except x (Fitting, 2013).
- Val(3x(a)DA) = true iff aDA is true for some assignment B that is an x-variant of A. In other
words, the formula a is true for some value of x.
- Val((Vx(a))DA) = true iff a A is true for all assignments B that are x-variants of A. In other
words, the formula a is true for all values of x.
Observe that on the right hand side of iff the formula is without the quantifier. Effectively,
we are saying that 3x(a) is true if one can find some assignment to the variable x such that
the formula becomes true. For example, 3x(Even(x)) is true because x can, for example, be
2. Likewise, Vx(GreaterThan(successor(x), x)) is true because x+1 is greater than x for every
x. Where there is no ambiguity, we can drop the outer brackets and write 3xEven(x) and
VxGreaterThan(successor(x), x).
Deduction as Search | 333
Not every formula a e F can be assigned a truth value. In particular, formulas with free
variables may not have a defined truth value. Consider, for example, the sentences LessThan
(x, y) or 3xLessThan(x, y) or VxLessThan(x, y). We cannot say anything about the truth values
of any of these. That is because y is a free variable in all three, and x is a free variable in the
first. A free variable is one that is not bound. A bound variable is one that occurs in the scope of
a quantifier. In the second and third sentences, x is bound and y is free. A formula without any
free variables is a sentence of L(R, F, C). A sentence can be assigned a truth value. The truth
value of a sentence is not dependent on an assignment but does depend upon the interpretation
I = <D, I>. For example, given the domain of natural numbers, the truth values of the following
four sentences are:
- Val(3x (3y LessThan(x, y))D) = true because we can always find two number x and y such
that x < y.
- Val(3x (Vy LessThan(x, y))D) = false because there is no x such that for all y, x is less
than y. Not even 0 because it is not less than itself. If the relation was less than or equal to,
then the sentence would become true.
- Val(Vx (Vy LessThan(x, y))D) = false because we can find many counter-examples where
x > y.
- Val(Vx (3y LessThan(x, y))D) = false because we have a counter-example with x = 0.
A point to note. The name of a variable does not affect the truth value of a formula. Thus, the
formula 3x (Vy LessThan(x, y)) is logically equivalent to 3y (Vx LessThan(y, x)), which is
equivalent to 3z (Vy LessThan(z, y)). One can rename variables as long as the new name does
not occur elsewhere in the sentence. As we will see, renaming variables apart in two formulas
can sometimes be necessary.
We have introduced a new notation here. The formula a[x] is read as a formula a which
has the variable x somewhere in it, and the rule UI is read as: Replace all instances of x in a[x]
with a, where a is a term in the language. The following description of the rule is equivalent.
G: a [a] F 3xa[x]
The first sentence VxBrave(x) says that all of them are brave, and the second one 3xBrave
(x) says that at least one of them is brave. Now one can relate rule G to addition (a F a V p)
from PL, and UI to simplification (a A ft F a). This correspondence carries forward to
DeMorgan’s laws of substitution.
Vx a = 3x a
-3x a = Vx - a
Logicians also introduce the following rules to facilitate reasoning, but that needs careful
treatment.
Universal generalization or V introduction produces a universally quantified formula from
a formula with an arbitrary constant. Care must be taken that the constant name is not one of
the existing constant names. When that is the case, and we start with a premise P(k), where k
is a new symbol, and can derive Q(k), then we can introduce a tautological implication Vx(P(x)
D Q(x)). The intuition is that since k was arbitrarily chosen, the connection between P(k) and
Q(k) must be universal.
Existential instantiation essentially creates a new name for an unnamed entity that is
known to exist.
Now this entity can participate in a rule of inference, as shown in the example below.
Deduction as Search | 335
The police thug: A police person murdered Mahsa. Anyone who murders someone is
a murderer. All murderers must be prosecuted. Therefore, the police person must be
prosecuted.
The argument essentially goes like this. Some police person murdered Mahsa. Let that
person be called Bulli. Bulli is a murderer. All murderers must be prosecuted. Bulli must be
prosecuted. Therefore, some police person must be prosecuted.
This kind of reasoning can be handled naturally with Skolemization described later.
- Any name without a marker is a constant symbol, except Skolem constants discussed
below. For example, Neda, Mahua, Kailash, two, 2, Chennai, Narmada, Asia, and Jupiter.
In general, any unadorned string stands for a constant. This includes the ones commonly
used by logicians, such as a, b, c, but also x, y, z that logicians reserve for variables. In the
programming language Prolog, constants begin with lower case letters.
- A universal variable, or a universally quantified variable, is prefixed with a ‘?’. For example,
?Neda, ?Mahua, ?Kailash, ?two, ?2, ?Chennai, ?a, ?b, ?x, and ?y. In the Prolog language,
variables are capitalized.
- An existential variable, or an existentially quantified variable, begins with the prefix ‘sk’
or ‘sk-’. This is to ensure that it does not come from the set V of variables or the set C
of constants from (R, F, C). For example, sk-Neda and sk-2. These are called Skolem
constants after the logician Thoralf Skolem, and the process of 3 elimination is called
Skolemization. We saw an example of this in the previous section.
- An existential variable in the scope of a universal quantifier is represented as a Skolem
function of the universally quantified variable. For example, sk12(?x). This could have
been in a formula Vx3yLoves (x,y) which would be Skolemized as Loves (?x, sk12(?x)).
The intuition here is that y cannot be any arbitrary element and is dependent on the
value of x.
336 | Search Methods in Artificial Intelligence
Identifying the real nature of a variable has to be done with care. If one is looking at a negated
quantified formula, then pushing the negation sign inside can switch the quantifier and reveal its
true colours, as per DeMorgan’s laws described earlier. Then if, talking about numbers, one has
to represent the sentence ‘no element is both odd and even’, one might begin by expressing it as
The equivalent formula on the right says that ‘every element is not odd and even’ where the
variable x is a universally quantified variable and the implicit quantifier representation should
be -(Odd(?x) A Even(?x)).
One must also keep in mind that the antecedent in an implication statement contains a
negation because (P D Q) = (-P V Q). Then a formula of the kind Vx(3yP(y, x)) D Q(x)) is
equivalent to Vx,y (P(y, x)) D Q (x)). The sentence ‘every person who has a friend is happy’ can
be represented as
Interestingly, the converse treats the variable P differently. If A is an aunt of X then A must
be female and there must exist a related individual P who is the parent of X and a sibling of A.
In this sentence, the variable P is a Skolem function of the universal variables A and X,
because its value is dependent upon the values of A and X. This would be Skolemized as
An interesting formula, named as the drinking formula (Smullyan, 2009), is 3x(D (x) D
VyD (y)), which brings out the nature of a quantifier in the antecedent. The reader is encouraged
to show that this formula is a tautology, and is in fact logically different from (3xD (x) D VyD (y)).
10.4.2 Unification
In PL a rule of inference is applicable when one can find matching patterns in the KB. For
example, the rule disjunctive dilemma (DS) is applicable when we have {((P®Q) V (RDS),
-(P®Q)} in the KB. In the rule (a V p), -a F p the propositional variable a matches the
formulas (P®Q) and p matches (RDS), to produce (RDS) as the conclusion. Such direct pattern
matching is not possible in FOL because variables have to match and be bound to constants, or
even other variables, or functions, all being terms of different hues. Matching is accomplished
by the Unification algorithm, which substitutes variables with terms to make the formulas
identical, enabling the rule to be applied (Charniak and McDermott, 1985). Here are some
definitions first.
A substitution 0 is a set of <variable, value> pairs each denoting the value to be substituted
for the variable. When we apply the substitution 0 to a formula a, we replace every variable
from 0 in a with the corresponding value from 0. The new formula is denoted by a0.
A unifier for two formulas a and p is a substitution that makes the two formulas identical.
We say that a unifies with p.
A unifier 0 unifies a set of formulas {a 1, a2,..., aN} if
Algorithm 10.1. The Unification algorithm initializes the substitution to an empty list. It
then parses the formula with calls to SubUnify, AtomicUnify , and TermUnify. The synthesis
of the substitution is done in VarUnify .
Unify(arg1, arg2)
1. return SubUnify(arg1, arg2, ())
SubUnify(arg1, arg2, theta)
1. if arg1 and arg2 are compound and have same logical structure
2. for each atfi in arg1 and atfj in arg2 call AtomicUnify(atfi, atfj, theta)
3. else return AtomicUnify(arg1, arg2, theta)
AtomicUnify(functor1, functor2, theta)
1. if functor1 is P(t1, t2,..., tn) and functor2 is P(s1, s2, . ., sn)
2. or functor1 is f(t1, t2, ..., tn) and functor2 is f(s1, s2, ..., sn)
3. for each ti and si call TermUnify(ti, sj, theta)
TermUnify(term1, term2, theta)
1. if term1 and term2 are identical constants return theta
2. if term1 is a variable return VarUnify(term1, term2, theta)
3. if term2 is a variable return VarUnify(term2, term1, theta)
4. return AtomicUnify(term1, term1, theta)
VarUnify(var, term, theta)
1. if var occurs in term return NIX
2. if variable has a value <var, alpha> in theta return TermUnify(term, alpha, theta)
3. return <var, term>: theta
The function SubUnify is initialized with the empty substitution, which is incrementally
populated inside the VarUnify function that takes a variable and a term that could possibly
be assigned as a value to the variable, and does so if it is consistent to do so. Observe that
terms can be arbitrarily nested and calls between AtomicUnify and TermUnify can handle that
case. The two formulas Friend(mother(mother(?x)), mother(mother(Zhina)) and Friend(mother
(?y), ?z) will make such cross functions calls.
Deduction as Search | 339
Care must be taken to separate the variables in the two formulas apart. This needs to be
done since the variables in two quantified formulas should not have the same name. Consider
the two formulas Loves(?x, Sarina) and Loves(Nika, ?x). The former says that everyone loves
Sarina, and the latter says that Nika loves everyone. The two should be unifiable to yield the
formula Loves(Nika, Sarin). But when they use the variable name ?x, Algorithm 10.1 will fail
because the algorithm will first add <?x, Nika>. Then when trying to unify Sarina with ?x it
will run into Line 2 of VarUnify which says that since ?x already has a value Nika, unify that
value with Sarina. This cannot be done because they are different constants. Instead, if the
second formula were to be Loves(Nika, ?y), which asserts the same fact, then <?y, Sarina> can
be added to the substitution.
The occurs check in Line 1 of VarUnify prohibits the variable to be present in the term
that we are assigning to it as a value. Clearly then the variable inside the term would have to be
substituted as well, and that could get into an infinite loop.
The operational difficulty here is that moving from the KB towards conclusions one has
to guess how to apply the rule UI. When working with the rule in the implicit quantifier form,
the use of unification makes guesswork unnecessary because Man(Socrates) is the one being
unified with the antecedent, using modified modus ponens (MMP).
The two versions of the proof are contrasted in Figure 10.2. Not only does the MMP rule
obviate the need for guesswork, it also results in a shorter proof.
340 | Search Methods in Artificial Intelligence
Figure 10.2 Forward chaining in FOL is a two step process with UI followed by MP as shown
on the left. With the use of implicit quantifier notation, this collapses into a one step inference
using the rule MPP as shown on the right. Moreover, one does not have to guess the value
of x in the UI step.
The phrase ‘forward chaining’ alludes to the fact that a sequence of rules are instrumental
in chaining the facts to the goal. This is most easily visualized with rules with one antecedent.
For example, Man(Socrates) connects to Mortal(Socrates), which could in turn connect to
FiniteLife(Socrates). In practice though, rules may have more than one antecedent, in which
case the proof structure is a tree, as will be evident when we look at backward chaining.
The rule existential instantiation can naturally be adapted to the implicit quantifier form.
The problem from the earlier section repeated below is solved with the existential variable
being represented by a Skolem constant.
Some police person murdered Mahsa. Anyone who murders someone is a murderer. All
murderers must be prosecuted. Therefore, the police person must be prosecuted.
We will use the adaptation of the MP rules to MMP in our further discussion. However,
any rule of inference can be adapted to work with the implicit quantifier form. As long as the
antecedents can be unified, the conclusion can be added by the forward chaining algorithm.
The forward chaining algorithm is akin to forward state space search as described in
Chapter 7 and also Chapter 9. Theorem proving in FOL is a simpler process because sentences
are only added, and never deleted. Otherwise, the algorithm essentially searches the space of
theorems looking for the one it needs to derive. The MoveGen function in forward chaining is
depicted in Figure 10.3, which is a reproduction of Figure 7.1. It lists the immediate inferences
that can be made in a given state or KB.
Deduction as Search | 341
Figure 10.3 The MoveGen function for forward chaining is composed from the set of matching
antecedent-consequent instances of rules of inference. The MoveGen is simply a collection of
all instances of all applicable rules of inference.
Forward chaining is associated typically with forward reasoning, when we move from the
given sentences towards the goals to be derived. It is also known as data driven, and has the
trait of being eager computation. A forward chaining system makes inferences when it can.
This contrasts with backward chaining or backward reasoning where inferences are made only
when they are needed.
Figure 10.4 Backward chaining matches the consequent and moves to the antecedent,
producing subgoals, till the goal is satisfied in the KB. In the figure the shaded boxes constitute
the KB, and unshaded ones the goals. A propositional query Mortal(Socrates) on the top
produces the subquery Man(Socrates) that has a matching fact in the KB. An existential query
Mortal(?z) produces the subquery Man(?z) which can match any of the three facts.
A sentence that is true may have more than one justification or a proof. This could be
because a goal maybe the consequent in more than one implication. This introduces search
into the backward chaining process. Consider the following KB which defines some relations
between humans:
We often refer to the implication statements as rules and the rest as facts. Both are part of
the KB.
The backward chaining process explores an And-Or graph, similar to the ones described in
Section 7.3. The search space for the query Goal: Related(?x, ?y), which asks if anyone is related
to someone, is shown in Figure 10.5. Some edges are labelled with the rules that transform a
query into a subquery. The formulas in the unshaded boxes are the goals, and the ones in the
shaded boxes are the facts. As in Chapter 7, the solution is a subtree. The AND siblings are not
independent, however. The shared variables must be bound to the same values. One solution for
this query is marked by three solid edges connecting atomic queries to matching facts. In the
solution, Hadis is related to Nika because she is her aunt, because Hadis is female and a sibling
of Sarina who is Nika’s parent.
Related(?x, ?y)
S(Sarina, Hadis) P(Sarina, Nika) P(Hadis, Zhina) F(Sarina) F(Hadis) P(Neda, Hadis)
Figure 10.5 The goal tree for the query Goal: Related(?x, ?y). The shaded nodes depict one
solution Related(Hadis, Nika) via the aunt rule. The predicates Sibling, Parent, and Female are
represented by their first letters.
10.4.5 Prolog
The programming language Prolog does depth first search (DFS) on the goal tree or the
And-Or graph. This amounts to inspecting the sentences in the program KB1 from top to
down and left to right in the text notation. DFS would have returned the answer true with {<?x,
Zhina>, <?y, Nika>} using the cousin rule. One can follow the computation via the sequence
344 | Search Methods in Artificial Intelligence
of goals that need to be solved. Starting with the query the sequence is as follows. On the left
is the pending set of goals, and on the right the substitution employed. Doing DFS the leftmost
goal is removed from the goal set, and replaced with the antecedents of a rule that the goal is a
consequent in. If the goal matches a fact, then no new goal is added. The search may backtrack
on the goal tree and terminates when the goal set is empty. In our example, the first path leads
to a solution.
{Related(?x, ?y)}
{Sibling(?q, ?p), Parent(?q, ?y), Parent(?p, ?x)} {}
{Parent(Sarina, ?y), Parent(Hadis, ?x)} {<?q, Sarina>, <?p, Hadis>}
{Parent(Hadis, ?x)} {<?q, Sarina>, <?p, Hadis>, <y, Nika>}
{} {<?q, Sarina>, <?p, Hadis>, <?y, Nika>, <?x, Zhina>}
At this point there are no goals to solve and the original query Related(?x, ?y)} becomes
true with ?x = Zhina and ?y = Nika. The goal is true and Zhina is related to Nika.
The perceptive reader would have noticed that order of writing the antecedents affects the
performance of backward chaining with DFS. With the cousin rule Prolog would first look for
a pair of siblings and then check if they have children who would be cousins. This could lead
to wasted work looking at siblings when we should be looking at parents who are siblings.
Likewise, the aunt rule would look at every female and every unrelated parent-child pair before
unearthing the aunt relation. It would be much more efficient to start with the arguments in
the query. The reader should verify that the following two versions would answer the given
query faster:
A point to ponder is whether one can have variations of rules tailored to different queries.
Observe that if we had asked the query ‘Is Sarina related to anyone?’ backward chaining would
have returned false with the given KB. None of the three rules defining the related predicate
applies, even though a closer look reveals that Sarina is Zhina’s mother’s sister. This is because
the rule defining an aunt requires Sibling(Hadis, Sarina) to be true, which it is not. KB1 only
has Sibling(Sarina, Hadis). If we had a symmetry rule (Sibling(?x, ?y) D Sibling(?y, ?x)) then
we could have inferred Sibling(Hadis, Sarina) from Sibling(Sarina, Hadis), which would have
then matched the aunt rule.
The definition of grandparent given earlier is defined in terms of the Parent relation.4 What
about great grandparents? Or great-great grandparents? FOL allows one to recursively define
the relation ancestor as follows, using two sentences:
4 We use the terms ‘relation’ and ‘predicate’ synonymously. Strictly speaking, relations are in the domain while
predicates are in the language of FOL.
Deduction as Search | 345
The first clause is the base clause defining a parent as an ancestor. The second one says that
the ancestor of a parent is an ancestor. The reader is encouraged to investigate variations of the
second clause that may work for different queries when backward chaining does DFS.
Backward chaining goes beyond database retrieval. One can have recursive rules that can
allow connecting elements that are arbitrarily far from each other in the number of inferences
needed. This makes backward chaining a means of implementing a programming language.
Robert Kowalski introduced the idea that logical reasoning is a way of doing any computation, an
approach that is known as logic programming. In idealized logic programming the programmer
or the user only needs to specify the relation between the input and the output in logic. The
task of making the connection is left to the theorem prover. Backward chaining is the inference
engine in Prolog (Sterling and Shapiro, 1994).
We look at a tiny example to define addition on the domain of natural numbers
NN = {0, s(0), s(s(0)), s(s(s(0)), ...}, where 0 is a constant and s is a successor function. In
maths we have adopted a naming convention for the numbers wherein NN = {0, 1, 2, 3, ...}.
The following two statements define addition, and also serve as a program to add two numbers.
The meaning of the sentences in the language of mathematics is shown alongside.
As one can see, the two sentences give a recursive definition of addition. How can one
check whether 3 + 2 = 5 is true? We begin with the statement as a goal, and successively reduce
it to subgoals till we reach a fact in the KB, in which case the set of goals to be solved becomes
an empty set.
One does not in practice use the above program for addition, because its complexity
is proportional to the first argument, while we can get addition in constant time. We have
mentioned it only to show that logic subsumes arithmetic.
An interesting feature is that with the same program we can both add and subtract. The
query Goal: {Sum(s(s(s(0))), sum(s(s(0)), ?sum)} asks for the sum of 3 + 2 while the query
Goal: {Sum(s(s(s(0))), ?diff, sum(s(s(s(s(s(0))))))} asks for the difference between 5 and 3.
One can even pose the query Goal: {Sum(?x, ?y, sum(s(s(s(s(s(0))))))} to find two numbers
that add up to 5.
Figure 10.6 shows the proof for the statement Sum(?n, s(s(0)), s(s(s(s(s(0)))))), which is
true when ?n = s(s(s(0))). Observe that the goal tree is linear because the implication has only
one antecedent. The nodes in the unshaded rectangles are goals and subgoals, and the contents
of the shaded rectangles are facts. The final subgoal {Sum(?x3, s(s(0)), s(s(0)))} is satisfied by
the fact Sum(0, ?x, ?x).
346 | Search Methods in Artificial Intelligence
Figure 10.6 The Sum predicates for addition can be used for subtraction as well. The
existential goal ‘There is some number n such that n + 2 = 5’ is shown to be true with ?
n = s(s(s(0))) via back substitution from ?x3 = 0. The unshaded nodes are the goals and the
subgoals. The shaded nodes are the KB, which is also a program for addition and subtraction.
The sequence of goals and subgoals for the above problem is shown below. The clauses
defining the Sum predicate are reproduced again. Only the relevant substitution is shown at
each step. The others can be read off from the proof tree in Figure 10.6.
The query is 3x, y (P(x, y) A Q(x) A -Q(y)), which in the implicit quantifier form is
written as
Neither forward chaining nor backward chaining has a derivation, primarily because the
KB is just a set of propositions and has no logical connectives. Logic connectives are the basis
of rules of inference, for example, MP relies on D. Yet the goal formula is entailed from the KB.
Choosing a familiar domain makes that easier to accept.
Let the interpretation I = <D, I> have D as the set of blocks from the blocks world, and
let I be as follows:
and the existential query is ‘There is a blue block on a non-blue block.’ A little bit of
thought reveals that this is indeed a true sentence. We do not know the colour of block B, but it
is either blue or it is not. If it is blue, then then it is on C, and hence a blue block is on a non-blue
block. If it is not blue, then the statement is again true because A is on B and A is blue. Thus, the
existential statement is true, even though we cannot name the blocks that make it true. In fact,
we do not know which two blocks make it true.
The interpretation is in the mind of the user. Consider another interpretation where D is the
domain of women, and the mapping is as follows:
The query now is ‘Does a married woman like an unmarried one?’ and again the answer is
yes, even though we do not know the two individuals that make it true. This lack of knowledge
is characterized by disjunction, which has often been difficult to handle for computing, and
there is a disjunctive sentence implicitly lurking in the KB - Hadis is married, or Hadis is not
married.
Next, we look at a proof method that is complete for FOL. If a formula a is entailed by the
KB, then the resolution refutation method will have a proof for a.
to the KB which, if the goal is entailed, will now become unsatisfiable. Robinson showed that
given an unsatisfiable KB the empty clause, which stands for a contradiction and is always
false, can always be derived.
The method works with the clause form which has the following structure:
The Norwegian logician Thoralf Skolem had shown a century ago that any FOL formula
can be converted into clause form (reproduced in Skolem (1970)). The procedure is as follows:
1. Standardize variables apart across quantifiers. Rename variables so that the same symbol
does not occur in different quantifiers.
2. Eliminate all occurrences of operators other than A, V, and .
3. Move all the way in.
4. Push the quantifiers to the right. This ensures that their scope is as tight as possible.
5. Eliminate 3. Skolemization.
6. Move all V to the left. They can be ignored henceforth.
7. Distribute V over A.
8. Simplify.
9. Rename variables in each clause (disjunction).
The clause form is a conjunction of a set of clauses. A literal is an atomic formula or the
negation of an atomic formula. A clause is a disjunction of literals. The resolution rule takes
two clauses as input. One of the clauses must have a positive literal and the other a matching
negative literal. The inferred clause combines the remaining literals after removing the two
opposite literals. Let Ci and Ck be two clauses with the structure
If 0 is the MGU for {L1, L2, ., Lk, R 1, R2, ., Rs}, then we can resolve Ci and Ck to derive
the resolvent
That is, we throw away all the positive literals Lj and negative literals -Ri, and combine
the remainder after applying the substitution 0 (Charniak and McDermott, 1985). We can
illustrate the validity of the resolution rule by establishing the following equivalence for a
simpler formula:
The reader is encouraged to show that the above formula is a propositional tautology. In
the above formula each element in the disjunct - P, Q, R - is an atomic formula, and could be
any formula including an FOL formula. A consequence of the soundness of the resolution rule
is that adding the resolvent to the set of sentences does not change the truth value of the KB.
Robinson showed that the method is complete for an unsatisfiable set of clauses. We can
make our KB unsatisfiable by adding the negation of the goal clause to the KB. Then the
following holds. Given a set of premises {a 1, a2, ..., aN} and the desired goal fi, we want to
determine if the formula
is true. This follows from the well known Deduction theorem that asserts that
Also ((a1 A a2 A . A an) D fi) is a tautology iff - ((a1 A a2 A . A an) D fi) is unsatisfiable.
To show that - ((a1 A a2 A . A an) D fi) is unsatisfiable, one can add the negation of the goal
to the set of premises, because the following are equivalent:
The following equivalences establish that if we can add the null or empty clause to a KB,
then the KB must have been unsatisfiable to start with. The null clause can be generated by
resolving a literal P with its negation -P. Let the original KB be {P1, P2, ., PN}. Remember,
350 | Search Methods in Artificial Intelligence
this stands for a conjunction of the N clauses. To this we add a sequence of resolvents R1, R2,
R3, ... culminating with 1. The databases at all stages are logically equivalent because the
resolution rule is sound.
Now since the last set of clauses evaluates to false (because it contains the empty clause),
the KB we started with, which is logically equivalent, also evaluates to false. Thus {P1, P2, •,
PN} is unsatisfiable. The procedure for finding a proof by the resolution refutation method is
as follows:
The proof procedure uses only one rule of inference repeatedly till the null clause is derived.
We look at our familiar example, the Socratic argument.
The three clauses for the Socratic argument are
Remember that applying a substitution {?x = Socrates} is a kind of instantiation. And since
(-Man(?x) V Mortal(?x)) = (Man(?x) D Mortal(?x)), the first step in resolution derivation is
emulating the MMP derivation. What is more, it can also emulate backward chaining. Consider
the alternate derivation,
as -Mortal(?z), and the same derivation would apply. This gives us an intuition why in goals
or queries the Skolemization convention is inverted and existential query variables are prefixed
with a ‘?’ and not replaced with Skolem constants or functions.
In this way, the resolution proof subsumes both forward and backward chaining, but can
do more. The problem that both could not solve can be solved by the resolution method. The
given KB2 = {P(c 1, c2), P(c2, c3), Q (c 1), - Q (c3)} is already in clause form. The query is
3x,y(P(x, y) A Q(x) A -Q(y)), which after negation is written as (-P(?x, ?y) v -Q(?x) V Q(?y)).
The derivation of the null clause is shown in Figure 10.7 as a directed acyclic graph (DAG), as
resolution refutation proofs are often depicted. To make it more intelligible to us we have used
the first interpretation from the blocks world where P(X,Y) denotes X is on Y, and Q(X) denotes
X is blue.
Figure 10.7 Given the KB {On(A,B), On(B,C), Blue(A), —Blue(C)}, a resolution refutation proof
of the statement ‘A blue block is on a non-blue block’ after adding the negated goal to the KB.
The empty clause is derived by resolving Blue(B) with —Blue( B).
One can observe that the null clause or contradiction is derived from Blue(B) and -Blue(B),
which cannot both be true at the same time. Block B is either blue, or it is not blue. We do not
know whether it is blue or not. Reasoning by cases one can argue that if it is blue, then since it
is on block C, a blue block is on a non-blue block. And if B is not blue, then the blue block A is
on B, so the statement is again true. In either case the statement is true, even though we do not
know which two blocks fit the bill.
FOL with equality, we implicitly state that the following equality axioms must be added to the
KB (Brachman and Levesque, 2004).
• Reflexivity: Vx (x = x)
• Symmetry: Vx,y (x = y D y = x)
• Transitivity: Vx,y,z ((x = y A y = z) D x = z)
• Substitution for functions: for every function f of arity N
Vx1,y 1, x2,y2, ... xn,yN ((x1=y 1 A x2=y3 A ... A xn = yN) D f(x1,x2, ..., xn)
=f(y 1,y2, ... yN)
• Substitution for predicates: for every predicate P of arity N
Vx1,y1, x2,y2, ... xN,yN ((x1=y1 A x2=y3 A . A xN = yN) D P(x1,x2, ., xN)
= P(y 1,y2, ..., yN)
We will use the implicit quantifier clause form of the equality axioms and demonstrate their
use with the resolution refutation method. In the following, a formula (?x=?y) is written in
the familiar form (?xt?y).
• Reflexivity: ?x=?x
• Symmetry: ?xt?y V ?y=?x
• Transitivity: ?xt?y V ?yt?z V ?x=?z
• Substitution for functions: for every function f of arity N
?x1t?y1 V ?x2t?y3 V . V ?xNt?yN V (f(?x1,?x2, ., ?xN) = f(?y1,?y2, ., ?yN))
• Substitution for predicates: for every predicate P of arity N
?x 1*?y 1 v ?x2t?y3 v ... V ?xnt?yN V -P(?x 1,?x2, ., ?xn) v P(?y 1,?y2, ., ?yN))
Given the above axioms and the KB = {?N = 5, ?M=?N}, the goal ?M = 5 can be proved
as follows.
1. ?x=?x Reflexivity
2. ?x1t?y1 V ?y1=?x1 Symmetry
3. ?x2t?y2 V ?y2t?z2 V ?x2=?z2 Transitivity
4. ?N = 5 Premise
5. ?M=?N Premise
6. ?Mt5 Negated goal
7. 5=?N 2, 4, {<?x1, ?N>, <?y1, 5>
8. ?Nt?z2 V 5=?z2 3, 7, {<?x2, 5>, <?y2, ?N>
9. ?N=?M 2, 5, {<?x1, ?M>, <?y1, N>
10. 5=?M 2, 8, {<?z2, ?M>}
11. ?M = 5 2, 10, {<?x1, 5>, <?y1, ?M>
12. 1 6, 11
The following shorter proof illustrates the fact that theorem proving involves search, and
the choice of which two clauses to resolve can affect the length of the proof.
Deduction as Search | 353
1. ?x=?x Reflexivity
2. ?x 1*?y 1 V ?y 1=?x 1 Symmetry
3. ?x2*?y2 V ?y2*?z2 V ?x2=?z2 Transitivity
4. ?N = 5 Premise
5. ?M=?N Premise
6. ?M*5 Negated goal
7. ?x2*?N V ?x2=5 3, 4, {<?y2, ?N>, <?z2, 5>
8. ?M = 5 5, 7, {<?x2, ?M>, <?N, 5>
9. 1 6, 8
The shorter proof used only one of the three axioms listed as premises, but made a
judicious choice as to which atomic formula in the clause to cancel. Clearly, search can run into
combinatorial explosion here as well. The following two strategies have been used effectively
to try and find shorter proofs:
- Unit clause strategy. Wherever possible use a unit clause (a clause with only one atomic
formula) as one of the clauses to resolve. This ensures that the resolvent will be smaller
than the other clause. In general if one is cancelling one literal each from two clauses of
length N and M, the resolvent will be of length (N - 1) + (M - 1). Then if N = 1, the length
of the resolvent will be M - 1. Smaller resolvents are desirable because the empty clause
is of length 0, and it can be derived from two clauses of length 1 each.
- Set of support strategy. Always choose a descendant of the negated goal as one of the
clauses. This is because the null clause represents a contradiction. Remember it is derived
from two unit clauses, say B and B. And the augmented KB becomes unsatisfiable because
we add the negated goal. If we keep resolving only among the clauses in the original KB,
then a contradiction will never arise.
- SLD (selected literal, linear structure, definite clauses) resolution. This is described in
Section 10.5.1.
We illustrate the utility of equality axioms with another example. The following puzzle has
sometimes baffled some humans:
A father and his son were crossing the road when they met with an accident. The father
died on the spot, and the son was rushed to the hospital, and into the operation theatre.
But the surgeon had one look at the boy and said ‘I cannot operate on this boy. He is
my son.’
Who was the surgeon?
In the following we use upper case letters as variables in the style of Prolog. Let us start
with the existential statement 3SSurgeon(S). The task is to find out more about who S is. Let
Son(X, Y) be read as ‘The son of X is Y’.
354 | Search Methods in Artificial Intelligence
Let m(Y) and f(Y) be two functions read as ‘mother of Y’’ and ‘father of Y’’. Then the
following statement is literally true: VX,Y(m(Y) = X = -f(Y) = X)). Here -f(Y)=Xis another way
of writing f(Y)*X. That is, X can be the mother of Y, or X can be the father of Y, but not both.
This statement reduces to the following three clauses:
Let the statement ‘He is my son’ be represented as follows, with the additional piece of
knowledge that a person who speaks must be alive. Observe that both the surgeon and the son
are represented as Skolem constants.
The last sentence is our knowledge of living beings and can be reduced to two clauses as
follows:
Reflexivity: X = X
Symmetry: X*Y V Y = X
Transitivity: X*Y V Y*Z V X = Z
Substitution for functions:
X*Y V (f(X) = f(Y))
X*Y V (m(X) = m(Y))
Substitution for predicates:
X1*X2 V -Alive(X1) V Alive(X2)))
Finally, let us serendipitously choose the goal ‘The surgeon is the boy’s mother’.
Figure 10.8 shows the resolution refutation when we add the negated goal to a set of relevant
clauses from our KB. For the ease of reading we have maintained the variable names as upper
case letters.
Deduction as Search | 355
Figure 10.8 The answer to the question ‘Who was the surgeon?’ by resolution refutation. When
the negated goal m(sk-son)# sk-S is added, the KB becomes unsatisfiable, and the null clause
can be derived after adding the equality axioms.
Informally the argument is simple. If the surgeon said that the boy was the surgeon’s son,
then the surgeon must be the mother or the father of the boy. And it must be one of the two.
Then since the father is dead, it must be the mother. If we take that as the goal and add its
negation that surgeon is not the mother, then it will eventually lead to a contradiction.
All three variations represent the same sentence - If X is a man and X is a philosopher, then
X is mortal.
Seen from the perspective of resolution, a goal is a negative clause in Prolog, obtained by
negating the goal. Then, viewing the augmented KB as a set of (unsatisfiable) clauses, there
exists a strategy, SLD resolution, that is equivalent to backward chaining that we saw earlier
in this chapter. An SLD (selected literal, linear structure, definite clauses) derivation has the
following structure:
- the first resolvent has one parent from the goal and the other from the program
- each resolvent in the derivation is negative (this is equivalent to doing backward chaining
to generate a subgoal)
- the latest resolvent becomes one of the parents, and the other parent is a positive clause
from the program
Every SLD resolution has the structure shown in Figure 10.9 (Brachman and Levesque, 2004).
The shaded parent of every resolvent is a positive definite clause which generates the subgoal if
it is a rule, or matches one of the subgoals if it is a fact. When the goal is atomic, then matching
results in the null clause terminating the derivation. The reader is encouraged to verify that the
proof in Figure 10.7 cannot be cast as an SLD derivation.
Deduction as Search | 357
Figure 10.9 An SLD derivation with Horn clauses is essentially backward chaining. Starting
with the (negated) goal on the top right shown as a thick lined rectangle, the resolvent at each
step is a subgoal that backward chaining generates. In the final step a unit goal matches a fact
in the program, resulting in the null clause.
The efficiency comes at a cost, though. Horn clauses cannot talk of disjunctions. One cannot
say, for example, that every human is a man or a woman. This inability to handle disjunctions
is also at the root of backward chaining being unable to solve the problem from Section 10.4.6.
The goal there is existential, which is essentially a disjunctive statement.
- The TBox, or terminological box, is a set of concept axioms that define new concepts from
existing ones. For example, the sentence (Teen n Sowns^Apple) c Happy. Here Teen,
Apple, and Happy are concepts and ‘owns’ is a role. The expression (Teen n Sowns^Apple)
is a concept that is formed by the intersection (n) of two sets. One is Teen and the other is
the set of individuals who own at least one Apple device. The sentence is equivalent to the
FOL sentence Vx(teen(x) A Sy(Apple(y) A Owns(x,y)) D Happy(x)).
358 | Search Methods in Artificial Intelligence
- The RBox is a set of role axioms that define new roles. For example, the expression
hasBrother E hasSibling says that the relation hasBrother is a subset of the relation
hasSibling.
- The ABox, or assertion box, is a set of concept and role assertions. For example,
happy(Zhina) and hasSister(Kian, Sarina).
A description logic is concerned with defining new concepts and roles given a set of primitive
roles and concepts. Different DLs allow varying levels of concept describing operators. One
common language is ALC (attributive language with complement) which allows the following,
where C and D are concepts and R is a role:
More expressive languages allow more precise descriptions of concepts and roles. For example,
one can define a concept of people who have exactly one daughter who is a lawyer.
There are just three kinds of sentences that ALC is primarily concerned with.
- C E D says that the concept C is subsumed by concept D. For example, for the concepts
CompetitionLawyer E Lawyer and TechCompany E Company.
- C = D says that the concepts C and D are equivalent. Such statements are often used
to define new concepts from primitive ones. For example, ManAllDaughtersLawyers =
Father n Vdaughter^Lawyer.
- C(a) or R(a, b), for example, YoungWoman(Mahsa) or daughterOf(Mahsa, Amjad).
The first two are sentences in the TBox while the third kind are assertions in the Abox. The
following kinds of queries can be answered from a given KB:
The answers to the queries are determined structurally from the descriptions. In very simple
languages a procedure called structure matching can answer the queries by scanning the
concept descriptions once (Brachman and Levesque, 2004). In more expressive languages a
proof procedure called the tableau method is used (Baader and Sattler, 2001). The tableau
method has rules for each concept forming operator to break down complex formulas into
simpler ones in search of an interpretation that is a model. The method is often used to decide
that there is no model after adding the negative of the query to the KB, in the style of looking
Deduction as Search | 359
for a proof by contradiction. The tableau method, which applies to FOL as well, is beyond the
scope of this text.
Once you have a procedure to compute subsumption, one can devise an algorithm to
organize all concepts into a taxonomy. This can further speed up answering queries.
DLs provide a foundation for the Semantic Web (Antoniou and Harmelen, 2008). The Web
Ontology Language (OWL) is based upon DL.
There are two birds, and four different interpretations organized in a partial order. Of these,
only 31 and 32 are models in which the KB is true. In the other two, Peppy is not abnormal and
hence can fly. Which is not true. Of the two models, circumscription looks at only the smaller
one, because its strategy is to accept conclusions only from minimal models.
Circumscription has its own difficulties. If we were to add a statement that penguins cannot
fly, and are therefore abnormal, then reasoning with circumscription becomes entangled. Now
in minimal models there are no penguins, which is not asserted in the KB.
Initiates(Load,Loaded,t)
Initiates(Shoot,Dead,t) ^ HoldsAt(Loaded, t)
Terminates(Shoot,Alive,t) ^ HoldsAt(Loaded, t)
The Yale shooting scenario comprises a Load action followed by a Sneeze action followed
by a Shoot action.
Show that
The problem here is that we are working in the open world assumption. How do we know
that everything that is relevant has been stated? What if the Sneeze action has the effect of
making Loaded false? And how do we know that someone did not fire the gun at time 20
rendering Loaded false?
The answer is to resort to circumscription. We must restrict the effects of actions, and the
happening of actions only to ones explicitly stated. Then we can say that HoldsAt(Dead, 70) is
entailed under circumscription.
Neda: Hello Kian and Nika! I have given you each a different natural
number {0,1,2,...}. Who of you has the larger number?
Kian: I don’t know.
Nika: I don’t know either.
Kian: Even though you say that, I still don’t know.
Nika: Aha! Now I know which of us has the larger number.
Kian: In that case, I know both our numbers.
Nika: And now I also know both numbers.
that Kian’s number is not 0. If her number were to be 1, she could conclude that Kian has the
higher number. But she does not know the answer. So now Kian knows that Nika’s number is
not 1. But he says he still does not know the answer. That means his number is higher than 2.
Now Nika says that she knows who has the higher number! This can only be if her number were
to be 2 or 3, because Kian can only have a higher number now. At this point we leave it to the
reader to deduce what the two numbers are.
Summary
Since ancient times, logic has been concerned with uncovering the truth. The goal has been to
identify valid forms of arguments. In modern times we look at the logic machine as operating
on a formal language. We define the notion of entailment, associated with logical consequence,
and proof as a means of arriving at conclusions. Our focus has been on illustrating the search
that lies behind logical reasoning and deduction with algorithms that can connect facts to
conclusions via rules of inference.
We have dwelt upon FOL because that is by far the most widely used language and includes
our programming languages as well. We have described algorithms that are both sound and
complete. But theorem proving in FOL is intractable. This leads to restricted languages like
Horn clause logic and DL, which are computationally very efficient. Logic has come to be not
just a machinery for deduction, but also a medium for knowledge representation. In the quest
for agents being able to deal with real world problems, we mention more expressive languages.
Default logic and its variations are designed to reason under the open world assumption when
information available to the reasoner may be incomplete. We also mentioned event calculus,
which models time and change, and epistemic logic, which allows an agent to reason about
what other agents know. As we progress towards more expressive languages, the computational
complexity increases. The search is on for efficient algorithms.
Doxastic logic is still in a nascent stage. But the need for being able to reason with beliefs
is ever growing in this era of disinformation and fake news. Agents will need to check the
consistency of their beliefs and perhaps weed out false beliefs. In strategic situations one may
need to detect deception, or even indulge in it as well. That is a long way to go.
Deduction in logic is instrumental in digging out implicit facts from the KB. It does not
generate any new knowledge. New knowledge comes from learning from induction. Machine
learning (ML) has made great strides in extracting models from data. A model is experience
distilled into knowledge. ML algorithms search in the space of models for ones that best fit the
available data. We take a brief look at search in ML in the next chapter.
Exercises
1. Draw truth tables to show that the following sentences are tautologies.
2. Show that the following sentences are tautologies. These are the basis of the common rules
of inference.
3. Show that the following sentences are not true or not tautologies, by constructing a truth
table or finding a falsifying assignment.
a. ((a D fi) V -5)
b. ((fi A (a D fi)) D a).
4. Show that there is an injection from the set of tautologies in PL to the set of contingencies
in PL, and vice versa.
5. Show that there is an injection from the set of tautologies in PL to the set of unsatisfiable
sentences in PL, and vice versa.
364 | Search Methods in Artificial Intelligence
6. Show that there is an injection from the set of unsatisfiable sentences in PL to the set of
contingencies in PL, and vice versa.
7. State the rule of UI and show using the resolution refutation method that it is a sound rule.
8. Negate the sentence Vx(Boy(x) D 3y(Girl(y) A Love(x, y)), use the substitution rules to
simplify it, and express the negation in English.
9. Negate the sentence 3y(Girl(y) A Vx(Boy(x) D Love(x, y)), use the substitution rules to
simplify it, and express the negation in English.
10. Find the MGU for the following sets of clauses, where ‘a’ is a constant and ‘w, x, y, z’ are
variables. [Note. Each part requires an independent answer.]
a. P(a, x, f(g(y))) V P(z, h(z,w), fw))
b. P(a, x, f(g(x))) V P(z, h(z,w), fw))
11. Show that the following two formulas are logically different. What can you say about the
truth values of the two formulas? Try and prove both of them.
a. 3x(D(x) D VyD(y))
b. (3xD(x) D VyD(y))
12. Given the knowledge base KB1 in Section 10.4.3, the following facts cannot be deduced:
Parent(Neda, Sarina), Sibling(Hadis, Sarina), Cousin(Zhina, Nika), GrandParent(Neda,
Nika). Add rules to KB1 so that the above can be deduced. Do not add them as facts.
13. Given a KB {(Sibling(?x, ?y) D Sibling(?y, ?x)) Sibling(Sarina, Hadis)}, how would
backward chaining respond to the following queries:
a. Goal: Sibling(Hadis, ?z)
b. Goal: Sibling(Zhina, ?z)
14. An exercise in representation. Give the following primitive relations, define the other
relations that you can name.
a. Parent(X,Y): X is the parent of Y
Schema: Parent(Parent, Child) helps in specifying the semantics for us
when more than one element is involved.
b. Male(X): X is male
c. Female(X): X is a female
d. Married(X,Y): X is married to Y
Hindi speakers will have a plethora of named relations like dada, dadi, chacha, chachi, bua, tai,
mama, nana, and nani to choose from.
15. Consider the definition of ancestor given in the chapter, repeated below, and the alternatives
to clause 2 given below.
1. V p,x (Parent(p,x) D Ancestor(p,x))
2. V a,x [(3p (Parent(p,x) A Ancestor(a,p)) D Ancestor(a,x)]
= V a,x,p [(Parent(px) A Ancestor(a,p)) D Ancestor(a,x)]
Which of the above would be more efficient and for what kind of queries. What about the
following queries?
Goal: 3x Ancestor(x, Mahsa)
Goal: 3x Ancestor(Aishe, x)
16. Let the edges between two nodes Node1 and Node2 in a graph be stored in a KB with
predicates Edge(Node1, Node2). Define the predicate Path(Start, Destination), which is
true if there is a path in the graph from node Start to node Destination.
17. Construct a truth table to show that the following equivalence is a tautology:
20. Prove the following formula using the resolution refutation method:
[Vx3yP(x, y) A Vy3zQ(y, z)] D Vx3y3z(P(x,y) A Q(y,z))
21. Add the appropriate equality axioms and show using the resolution refutation method that
the following set of statements is inconsistent. Here nisha and t are constants, while P and
MPP are functions.
nisha = P(t)
Vx (MPP(x) = P(x))
nisha # MPP( t)
22. Convert the Police Thug problem repeated here into clause form and derive a resolution
refutation proof for the query.
Some police person murdered Mahsa. Anyone who murders someone is a murderer. All
murderers must be prosecuted. Therefore, the police person must be prosecuted.
23. Express the following facts in FOL. Ignore tense information. Disha is a girl. Disha owns a
dog. Curiosity is a policeman. Dem is a cat. Every dog owner is an animal lover. No animal
lover kills an animal. Dogs and cats are animals. One of Disha and Curiosity killed Dem.
Use the resolution refutation method with the Answer Predicate to answer the query
‘Does there exist a policeman who killed Dem?’
24. What are the two numbers in the puzzle in Section 10.5.5?
chapter 11
Search in Machine Learning
Sutanu Chakraborti
The earliest programs were entirely hand coded. Both the algorithm and the knowledge
that the algorithm embodied were created manually. Machines that learn were always
on the wish list though. One of the earliest reported programs was the checkers playing
program by Arthur Samuel that went on to beat its creator, evoking the spectre of
Frankenstein’s monster, a fear which still echoes today among some. Since then machine
learning (ML) has steadily advanced due to three factors. First, the availability of vast
amounts of data that the internet has made possible. Second, the tremendous increase
in computing power available. And third, a continuous evolution of algorithms. But
the core of ML is to process data using first principles and incrementally build models
about the domain that the data comes from. In this chapter we look at this process.
The computer is ideally suited to learning. It can never forget. The key is to incorporate
a ratchet mechanism a la natural selection - a mechanism to encapsulate the lessons
learnt into a usable form, a model. Robustness demands that one must build in the
ability to withstand occasional mistakes. Because the outlier must not become the
norm.
Children, doctors, and machines - they all learn. A toddler touches a piece of burning firewood
and is forced to withdraw her hand immediately. She learns to curb her curiosity and pay heed
to adult supervision. As she grows up, she picks up motor skills like cycling and learns new
languages. Doctors learn from their experience and become experts at their job - in fact, the
words ‘expert’ and ‘experience’ are derived from the same root. The smartphone you hold in
your hand learns to recognize your voice and handwriting and also tracks your preferences
for recommending books, movies, and food outlets in ways that often leave you pleasantly
surprised. This chapter is about how we can make machines learn. We also illustrate how such
learning is intimately related to the broader class of search methods explored in the rest of
this book.
Let us consider a simple example: the task of classifying an email as spam or non-spam.
Given the ill-defined nature of the problem, it is hard for us to arrive at a comprehensive set
of rules that can do this discrimination. Instead, we start out with a large number of emails,
367
368 | Search Methods in Artificial Intelligence
each of which has been marked as spam or non-spam by humans - this constitutes our
set of observations. We hypothesize a set of three numeric features1 that may be useful in
discriminating spam emails from legitimate ones:
Next, we represent each email in our dataset in terms of these three features and attempt
to arrive at a hypothesis that successfully discriminates spam emails from non-spam ones. A
template for such a hypothesis is as follows:
if (C > p1) AND (NL > p2) AND (SW > p3)
then MAIL is SPAM
else MAIL is NON_SPAM
where p1, p2, and p3 are constants in the closed interval [0,1]. Given this template, it is evident
that each distinct combination of p1, p2, and p3 leads to a distinct hypothesis, and there are
infinite such hypotheses. Of these, we need to identify one that is expected to work the best not
only over the given dataset, but also in terms of classifying emails outside those that are present
in the dataset.
At the heart of machine learning (ML) is this ability to search over a space of candidate
hypotheses and narrow down on one or more of the most promising ones that explain our
observations. Approaches differ with respect to each other in terms of the representation of
the hypotheses and the specific search mechanism. For instance, the if-then rule template
above for spam classification could be generalized to allow both AND and OR connectives
(disjunctions over conjunctions) in the antecedent. In Section 11.1 we look at decision trees
that can discover such rules from data. Yet another possibility is to have a radically different
template for representing our hypotheses, such as the one shown below.
We have a distinct hypothesis for each distinct choice of values of w1, w2,, w3 and
threshold, which are referred to as parameters. Again, we have an infinite set of hypotheses. In
Section 11.4, we turn our attention to neural networks, which use this template for representing
hypotheses.
Irrespective of the diversity in the representation of hypotheses and the mechanisms used
for search over the hypothesis space, a bottom-up (data driven) process that relies on inducing
hypothesis from observations is central to ML, and this contrasts sharply with top-down (goal
1 The example is solely for illustrative purposes. In practice, we would need a wider pool of more involved features
for building an effective spam filter.
Search in Machine Learning | 369
In order to solve a prediction task like this, we first need to build (induce) a theory of the
world - in this case, a theory of how R varies as a function of D. We will define a space H of
candidate hypotheses, say, the class of linear functions R = a + bD. Each hypothesis in this
space H is characterized by a unique choice of parameters, a and b, and can be pictured as a
straight line fit to the datapoints as shown by the dashed grey line in Figure 11.1(b). Since a and
b are real valued, it is evident that we have an infinite hypothesis space. We now search for a
hypothesis that best fits the given set of observations. This entails finding values of a and b that
minimize a well defined criterion called the objective function, the formulation of which closely
corresponds to our intuition of what constitutes a ‘good’ hypothesis. In the current context, the
objective function should be low if the estimates of R produced by a hypothesis are close to the
true values, given the values of D. For a datapoint <Di,Ri>, a hypothesis with parameters a and
b generates an estimate R, = a + b Di. The following formulation called mean squared error
(MSE), which aggregates the deviations of R i with respect to Ri over all N datapoints while
370 | Search Methods in Artificial Intelligence
remaining agnostic to whether such deviations are positive or negative, is a common choice for
objective function:
M<7.' l-V’f P 2 \2
MS1 = N ^ (Ri - Ri)
We can find values of a and b that minimize MSE by equating the partial derivatives
111 and 1MM1 to 0, and these values will characterize the ‘best’ hypothesis in H. Later, we
will encounter harder problems where such analytic closed form solutions are not possible.
However, the essential idea of searching over a hypothesis space for finding parameters that
optimize an underlying objective function forms the essence of most ML approaches.
The best hypothesis in H as obtained above may fail to correspond to the true unknown
underlying functional relationship between D and R, which is shown pictorially as the bold
curve in Figure 11.1(b). In this case, the reason is simple: no straight line can model such a
curve. The inability of a simple model like linear regression to capture more complex underlying
relationships is called bias. An obvious solution is to use a richer space of hypothesis H’ that
can accommodate higher order polynomials, involving a larger number of parameters that
need to be estimated. Figure 11.1(c) shows the result of such a nonlinear regression on the
same dataset. Clearly, the wiggly curve fits the given datapoints better and would result in a
lower MSE than any hypothesis in H. Not all is well about H’ though. An accurate fit over
the observations does not ensure that the wiggly curve would do better than the straight line
fit in terms of predicting the risk of death of a new patient. Why? Hypotheses in H’ have high
variance - a small change in any of the observations can result in a fit that looks very different.
This makes them vulnerable to noise or unrepresentative datapoints. In contrast, any hypothesis
in H based on linear regression tends to have higher errors on observed datapoints, but has
lesser variability and is hence more consistent in its predictions on unseen data. In ML, simple
hypotheses tend to have high bias but low variance, whereas complex hypotheses involving a
larger number of parameters have low bias but high variance. Ideally, we would like both bias
and variance to be low, so that our hypothesis captures the regularities in the observed data
and also leads to consistent predictions on unseen data. Finding such a sweet spot is hard in
practice. This is referred to as the problem of bias-variance tradeoff. A wiggly curve from H’
fits the observations accurately but fails to generalize well over unseen data - this is referred to
as the problem of overfitting. In contrast, straight line fits drawn from hypotheses in H suffer
from underfitting, because they lead to inaccurate estimates over observed data.
The datapoints used for inducing a hypothesis constitute the training data. In contrast, a
separate set of datapoints called the test data, which was not exposed to the learner, is used for
its ability to generalize and carry out prediction. In experiments, we are often just given a single
dataset, a portion (say, 80%) of which is reserved for training purposes and the rest (say, 20%)
used for testing. This is called a train-test split. In order to compare the hypotheses produced
by ML approaches M1 and M2, we can create 10 such randomized train-test splits, and record
the ten MSE values over training and test data for both hypotheses. A statistical hypothesis test,
such as a paired t-test, is then carried over the 10 pairs of recorded values to arrive at a conclusive
evidence of superiority of one of these hypotheses over the other, at a chosen level of confidence.
It is worth noting that no ML can happen unless we have some preconceived notion of
the nature of hypotheses we intend to induce from observations. This preconceived notion is
Search in Machine Learning | 371
the learner’s bias. When restricted to the hypothesis space H, the learner is biased in favour
of its supposition that R linearly depends on D. No learner is bias-free. Since a learner relies
on induction, any prediction it makes outside the instances it has observed involves a ‘leap of
faith’. A learner that is empirically found to do well on classification or regression tasks can
be deliberately shown to perform very poorly, simply by presenting test instances that do not
abide by the regularities seen in the training data. This forms the key intuition behind the no
free lunch theorem.
We have looked into two kinds of learning problems: classification and regression. Both
fall under the broad category of supervised learning, where the training data is labelled.
In classification, these labels are category names (such as spam or non-spam in our earlier
example); in regression they are the values of dependent variables (such as risk of death). There
are learning problems where no labels are available, and the goal is to infer hidden patterns
or clustering tendencies in data. They fall under the category of unsupervised learning. There
is a third kind of learning, called reinforcement learning, where an agent interacts with an
environment, resulting in a change of state of the environment. In addition to having access to
the state information, the agent also receives feedback in the form of a reward in proportion
to the extent that its action positively contributed towards its goal - this determines its next
action. Unlike supervised learning, there is no pre-labelled dataset - rather, the learner has to
rely on intermediate reward values, and the action sequence it choses by way of exploring the
environment determines the training examples. Reinforcement learning finds application in
tasks like autonomous driving or playing board games like go.
In the rest of this chapter we take a closer look at four supervised learning approaches
and one unsupervised clustering algorithm. The unifying thread is that they can all be viewed
as realizing search over hypothesis spaces with the goal of inducing models from raw
observations.
(a)
Figure 11.2 Illustration of a decision tree (b) induced from training examples in (a)
instance, the path corresponding to the grey shaded arrow in Figure 11.2(b) corresponds to the
rule
What is the hypothesis space of the decision tree classifier? It is the set of all candidate
trees that fit the training data. These candidate trees differ from each other across one or more
of the following criteria:
1. Choice of attributes: In a general classification setting, instances can have many attributes,
and not all may be equally well suited for classification purposes. In our simple example,
even if we had access to additional attributes like Name or Gender, they should ideally not
play a role in deciding the outcome of a job interview. Candidate trees could differ with
Search in Machine Learning | 373
respect to the attributes they use for classification. Figure 11.3 shows an unwieldy tree
resulting from the test on the gender attribute at the root node. The same subtree rooted at
the node ‘Performance in Interview’ is duplicated for both branches, since gender plays no
role in deciding the class label (outcome).
2. Ordering of attributes: In the example in Figure 11.2(b), it is evident that as we read out a
rule along a path from the root to the leaf, the test on the attribute Performance in Interview
is done ahead of the test on CGPA. We could have obtained a different tree by doing this
differently: testing on CGPA at the root node, and testing on Performance in Interview
next. If the ordering is not done in a principled way, it can result in an unwieldy tree, where
redundant comparisons are needed to classify a test instance.
3. Attribute splits: If an attribute is real valued or has a large number of categorical or ordinal
values, the strategy used to split the attributes and the number of resulting splits will
determine the nature of the induced tree. In our example, we have used three splits (CGPA
< 8, 8 < CGPA < 9, CGPA > 9) on the real valued attribute CGPA; choosing other values
to split the CGPA attribute can result in trees that look very different.
The decision tree induction algorithm performs a search over the hypothesis space of all
candidate trees to find a tree that is compact, based on objective measures that determine the
attributes to be used, the order in which attributes have to be tested, and the splitting criterion.
Rejected Waitlisted Rejected Waitlisted Accepted Waitlisted Accepted Rejected Waitlisted Rejected Waitlisted Accepted Waitlisted Accepted
Figure 11.3 Illustrating the influence of choice of attribute on decision tree induction
In order to decide on the attributes we need to test at each node, we use a criterion called
information gain. The intuition is as follows. Let us say, we have two datasets A and B, each
having 600 instances of job applicants. In dataset A, the number of applicants accepted,
waitlisted, and rejected are 20, 20, and 560 respectively. In contrast, the corresponding numbers
in dataset B are 200, 200, and 200. It is evident that in the absence of any additional information
about a candidate (her CGPA or Performance in Interview), we would be more uncertain about
the outcome of a candidate in dataset B compared to that of a candidate in dataset A. Entropy is
a measure of that uncertainty - the higher the uncertainty or randomness, the higher the entropy.
Let pa, pw, and pr be the probabilities of an applicant being accepted, waitlisted, and rejected
374 | Search Methods in Artificial Intelligence
where C is the total number of classes and the pi values correspond to probabilities of each
class. In our example, C = 3, and p 1, p2, and p3 correspond to pa, pw, and pr respectively. An
interesting aspect of the entropy formulation above is that it attains its maximum value when
all pi values are the same. Therefore, it is in line with our intuition of entropy as a measure of
randomness.
How does this help us in deciding the order in which attributes have to be tested in a
decision tree? Note that the entropy H formulated above quantifies uncertainty in absence of any
knowledge of attribute values. We would expect that knowledge of CGPA or of Performance in
Interview should result in a reduction in the value of this uncertainty or a corresponding gain in
information; this leads us naturally to the concept of information gain (IG). An attribute with
a higher IG score should be preferred over another that has a lower IG score. Let us consider
dataset B, which has 200 instances in each of the three classes. Let us split dataset B into
three partitions B1, B2, and B3 based on the attribute CGPA: Partition B1 has 100 instances, all
having CGPA > 9; Partition B2 has 350 instances, all with 8 < CGPA < 9; Partition B3 has 150
instances, all with CGPA < 8. Each of these partitions can be treated as a dataset in its known
right and has an entropy score of its own. Since Partition B1 has applicants with high CGPA,
we would expect it to have a higher proportion of accepted candidates when compared to
Partitions B2 and B3. Let pa, pw, and pr be 0.8, 0.15, and 0.05 respectively for Partition B 1 - these
are the relative proportions of the 100 instances in B1 who have been accepted, waitlisted, and
rejected respectively. Then the entropy of B1 is given as HB = 0.8 log _1 + 0.15 log _1
1 B1 0.8 0.15
+ 0.05 log (_0 105). Similarly, HB2 and HB3, the entropy for partitions B2 and B3 respectively, can
be estimated as well. The uncertainty about the outcome that remains, once we know the value
of CGPA of an applicant, is expressed as follows:
3
Hcgpa = S P(Bi) HBi
i=1
HCGPA is the weighted sum of HB 1, HB2, and HB3, where the weights P(B 1), P(B2), andP( B3)
are the relative proportions of data in the three partitions. In our example, P(B 1), P(B2),
and P( B3)are (100/600), (350/600), and (150/600) respectively. HCGPA evaluates to 1.39 bits.
If the knowledge of CGPA alone is sufficient to classify all instances, the partitions B1, B2,
and B3 are pure, in that each of them contains instances of just one class. In that case, HCGPA
is minimum (0). On the other hand, knowledge of an attribute like Gender hardly helps in
predicting the outcome, since it leads to two partitions corresponding to two values (Male and
Female), each of which is expected to be impure. This is because each partition is expected
Search in Machine Learning | 375
to contain a mix of instances drawn from all three classes. Thus, HGender, which quantifies
the uncertainty about the outcome that remains, once we know the gender of an applicant, is
expected to be conspicuously higher than HCGPA. Similarly, we can estimate HF for any attribute
F. The IG resulting due to F, denoted as IGF, is defined as the difference between the original
entropy H and the updated entropy HF:
n
I GF = H - HF = H - 2 P(Di) HDi
i=1
Note that the formula for HF is a generalization over the 3-partition case, where the different
values (or splits) of attribute F lead to n partitions D1 through Dn , and HDi is the entropy of the
ith partition.
The decision tree induction algorithm places nodes corresponding to attributes that have
higher IG closer to the root, since they are better in terms of discriminating between classes. In
the decision tree in Figure 11.2(b), the test on Performance in Interview is carried out at the root
node, since the IG score of the attribute Performance in Interview is higher than the IG score of
the attribute CGPA. Note that similar tests on IG are carried out recursively at each intermediate
level of the tree as well. This is a straightforward extension of the idea described earlier. Each
branch emanating out of a node represents a subset of data having certain values (or ranges of
values) of the attribute corresponding to that node. The attribute that has the highest IG score
with respect to that subset is the one that is tested against in the next node.
The process described above for the construction of decision trees is formalized in
Algorithm 11.1. Given a data set D with attributes {A 1,A2, ..., Ap}, the algorithm successively
partitions the data set till some termination criterion.
Algorithm 11.1. Sketch of a simple algorithm to construct a decision tree with training
data D with attributes {A 1, A2 Ap}. The algorithm chooses an attribute FE {A 1, A2
Ap} which gives maximum IG, and adds a branch for each value V of F. For each child
with high entropy, a recursive call is made to further partition the data to create subtrees.
As shown in the algorithm, a stopping criterion is needed to terminate the recursive process
of growing the tree. The simplest criterion is to terminate when all instances in the subset of
data represented by a branch belong to the same class, that is, the corresponding partition is
pure. However, this criterion may lead to very deep trees that are likely to overfit the training
data. To prevent this, the decision tree may be pruned after fully growing it, or, alternatively,
the pruning can be done as the tree is being built. In the context of our earlier discussion on
bias-variance trade-off, pruned decision trees are more compact, and have higher bias and
lower variance, compared to fully grown trees. Yet another issue we have not discussed is the
procedure for splitting continuous attributes like CGPA. In this context, while it seems intuitive
to choose split points such that the gain in information is maximized, we also want to avoid too
many splits since this can lead to overfitting.
Once the decision tree is constructed from training data, classifying any given test instance
is straightforward: attribute tests are carried out at each node starting with the root all the way
down till we reach a leaf node, from which we can read out the class label. Unlike some of the
other classification approaches that we discuss next, decision trees are not black box in nature,
in that they naturally produce easily interpretable rules, as long as they are short and compact.
3 Nearest Neighbours
(a) (b)
Figure 11.4 Illustrating the kNN algorithm for (a) k = 1 and (b) k = 3.
yield a ‘global’ measure of similarity between instances.2 Also, weights can be associated with
each attribute, such that attributes that play a more important role in the classification process
can be weighed relatively higher than others.
Another simple improvement over the basic kNN approach self-suggests itself. Taking a
majority vote over the near neighbours ignores their relative distances from the test instance.
Ideally, we would like the closer neighbours to have a higher say in classification than the
neighbours that are distant. This version is called the DistanceWeighted-kNN algorithm
(Dudani, 1976).
What is the nature of the hypothesis space induced by the kNN algorithm? We can
answer this by investigating the nature of decision boundaries produced by the classifier for
different values of k. When k = 1, the decision boundary closely fits the individual training
instances - hence, a small change in the training instances has a pronounced influence on
the decision boundary. Thus a low value of k leads to overfitting, with high variance and low
bias. As we increase k, we have higher bias and lower variance, and the decision boundary
becomes smoother. The influence of k on the nature of decision boundaries is illustrated in
Figures 11.5(a) and (b).
Unlike many other classification techniques, the kNN algorithm does not explicitly
construct a model out of training data ahead of time. Rather, it waits till a test instance comes
in, and classifies the same based on the training data in its local neighbourhood alone. Hence,
kNN is referred to as a lazy learner. This may be contrasted with eager approaches like decision
trees, where the induced hypothesis is agnostic to the test instance being classified. Also, in
comparison to eager learners, lazy learners are faster to train but slow in classifying fresh
queries.
2 Such generalized notions of similarity find application in the field of case based reasoning (CBR), a model of
experiential reasoning where past problem solving episodes are recorded in the form of cases (refer to Section 1.5.1).
Each case can be represented in terms of attribute-value pairs. Given a new problem, the closest cases are retrieved
and reused to suggest a solution. CBR finds application in ill-defined domains in tasks such as diagnosis (helpdesks,
for example), design, configuration, and planning.
378 | Search Methods in Artificial Intelligence
3 Nearest Neighbours
(a) (b)
Figure 11.5 Implicit hypothesis spaces for k = 1 on the left and k = 3 on the right. Although
no model has been explicitly constructed, one can imagine the decision boundaries. As
the decision boundaries become smoother as shown by the dashed lines on the right, the
classifier shows more error on the training data.
P(dis) = _______P_D_
P(S )
where S is the set of symptoms and D is the disease. The conditional probabilities P(D|S)
and P(S|D) are referred to as posterior probability and likelihood respectively. P(D) is the prior
probability of the disease D, and P(S) is referred to as the evidence. Given a test instance with
symptoms (or attribute values) S, diseases are ranked on the basis of their posterior probabilities,
and the disease with the highest posterior probability is the class to which the test instance is
assigned. It is clear that the denominator P(S) is independent of the disease and hence plays
no role in classification. Therefore, in order to estimate rank classes based on their posterior
probability, we need to estimate two terms: the likelihood and the prior. The likelihood term is
central to the Bayesian formulation, as it answers the question: how likely is it for a disease to
generate a given set of symptoms? The prior probability P(D) is the probability of each disease
and is agnostic to the symptoms observed. For example, throat cancer, being a relatively rare
disease, has a much smaller prior probability compared to flu. Thus, even though both flu and
throat cancer have comparable P(S| D) terms and thus are almost equally likely to generate the
Search in Machine Learning | 379
symptom of sore throat, a patient with a sore throat is more likely to be suffering from flu than
from throat cancer. This is because the much higher prior term associated with flu makes its
posterior P(D | S) higher than that of throat cancer.
When generalized to a setting where several symptoms S1 through Sn are used to diagnose
the disease, the likelihood term takes the form P(S 1, S2, .., Sn| D). The difficulty in estimating
this term is that in order to get reliable estimates of this term, we need really large training data
with the same instance repeated several times over. Hence, a simplifying assumption is often
made, where the symptoms (attribute values) are assumed to be conditionally independent given
the disease (class label). The likelihood term can be expressed as the product of individual
attribute-specific conditional probabilities as shown below:
n
P(S 1, S2... SnD) =n P(S.1D)
i=1
Robust estimates can be obtained for each of the terms P(Si | D). A classifier based on the
simplifying assumption above is called the Naive Bayes classifier (Lewis, 1998).
Let us illustrate how the prior, likelihood, and posterior terms of a Naive Bayes classifier
are estimated from training data using an example. We use the same dataset as used in
Figure 11.2(a) except that the CGPA scores are represented using ordinal values High, Medium,
and Low. The form of the dataset is shown in Figure 11.7. The goal is to classify a test instance
having attribute values CGPA and InterviewPerf as belonging to one of the three categories
Accepted, Rejected, or Waitlisted.
Based on the training data, we need to estimate three quantities P(CGPA| Decision),
P(InterviewPerf Decision), and P(Decision); these are parameters of the model. Let us
say, the test data has attribute values CGPA = High, and InterviewPerf = Average. The
goal is to assign one of the three class labels (Decision) to this instance. The posteriors
380 | Search Methods in Artificial Intelligence
P(Decision| CGPA = High, InterviewPerf = Average) are estimated for each class, and the test
instance is assigned to the class having the highest posterior probability. For example,
The two terms constituting the likelihood term are estimated as shown below:
40 5 50
Thus P(Decision = Accepted| CGPA = High, InterviewPerf = Average) ___
50 50 600
Similarly, we can estimate P(Decision = Waitlisted| CGPA = High, InterviewPerf = Average)
and P(Decision = Rejected|CGPA = High, InterviewPerf = Average). The constant of
proportionality is the same in all the three cases - the test instance is assigned to the class
having the highest posterior value.
Bayesian reasoning can be used for two kinds of tasks: inference and prediction. As a
simple example, if we toss a biased coin 10 times and obtain 4 heads, the inference problem is
to estimate the parameter 0 H, the probability of heads. We can visualize a continuous hypothesis
space, with an infinite set of hypotheses corresponding to 0 H values in the closed interval [0,1].
Denoting hypothesis and data (observations) as h and d respectively, the posterior distribution
P(h| d) is estimated using Bayes’ rule as follows:
Let us start with a prior which assumes that all hypotheses are equally likely, that is, the
prior corresponds to a uniform distribution over probability of heads, as shown in Figure 11.8(a).
After 10 flips with 4 heads, the posterior peaks around p = 0.4, as shown in Figure 11.8(b).
Having observed 100 flips with 40 heads, the peak around p = 0.4 becomes even more
pronounced (Figure 11.8(c)). The central idea here is: the posterior after each flip acts as the
prior for the next flip. As we observe more and more data, we become more and more certain
about the parameters that could have produced the data, given the underlying generative model
that is encoded in the likelihood term.
P(h)
0 0.4 h 1
Figure 11.8 Illustrating the effect of number of observations on posteriors (figures are not
drawn to scale)
382 | Search Methods in Artificial Intelligence
In the prediction task, we make use of this posterior distribution to predict whether the
|
next toss is likely to yield a head or tail: P (heads d) = /P(heads | h)P(h| d)dh. For discrete
hypothesis spaces, the integral is replaced by a summation: P(outcome| d) =
2 hP(outcome| h)P(h | d). This can be viewed as a weighted aggregation over the predictions
generated by each hypothesis h, and the weights of the hypotheses are their posterior
probabilities P(h| d). The hypotheses that are active (have non-zero posteriors) have a say in the
final prediction. This is the Bayes Optimal setting.
then you decide to attend the class, else you decide to skip it. Let us call this model M1.
It is clear that if you attach a relatively high importance to your laid-back friend (L),
resulting in a w3 value that is much higher than w1 and w2, you will end up skipping the class.
Later, you may regret your decision when you find out that you missed out on the surprise test
conducted that day, which adversely affected your course grade eventually. You have learnt a
lesson. When you take a similar decision the next time, you would attach a higher importance
to your sincere friend; in other words, you would boost the weight w2 relative to w3. Learning
amounts to updating the weights and the threshold based on feedback from the environment so
that you get better at decision making over time.
The perceptron realizes the decision making process in M1. A schematic is shown in
Figure 11.9.
The perceptron has two kinds of nodes: input nodes, which represent the factors used for
arriving at the decision, and an output node, which produces the outcome. The figure shows
three inputs x1, x2, and x3 which correspond to I, S, and L respectively in the example above.
Search in Machine Learning | 383
There is a directed edge from each input node to the output node. These edges carry weights,
which are shown as w1, w2, and w3 in the figure. The output node computes a weighted sum of
the inputs and subtracts the threshold t. The sign of the result determines the value y as shown
below:
1, if w 1 x 1 + w 2 x2 + w 3 x 3 - t > 0
{ — 1, ff W 1 X1 + W 2 X2 + W 3 X3 — t < 0
Note that the perceptron does exactly the same job as was formulated in model M1. The
idea can be extended to work over n input factors. Representing the threshold t as — W 0, and
treating this as a weight attached to an additional (constant) input x0 = 1, we obtain the more
general formulation:
n
1, ff tw i Xi > 0
i=0
n
— 1, ff tw i x i ^ 0
i=0
Also, it is easy to see that the decision making problem we set out to solve can alternatively
be viewed as a classification problem, where the two outcomes (attend class and skfp class)
correspond to class labels. Thus we have in a perceptron a general model of a classifier that
can take in n attribute values of a test instance as an instance and classify it into one of several
classes. The weight values w0 through wn are referred to as modelparameters.
How does such a perceptron learn from training data in a classification setting? In model M1,
we used feedback from the environment to guide the process of updating weights and thresholds.
The training data comes with this feedback about the desired outcomes corresponding to the
attribute values of each training instance. The perceptron makes use of this to learn weights that
achieve the desired classification.
The perceptron learning algorithm is simple. We start with random values of weights W 0
through Wn. The perceptron is tried on each training instance; each time it fails to correctly
classify an instance, the weights are updated. The process is repeated over all the instances over
several passes till all instances are correctly classified. The update rule for weight W i associated
with input x; is as follows: w; ^ w; + n(l — y) Xi, where l is the class label (either + 1 or —1)
associated with the current training instance, y is the perceptron output, and n is a positive
384 | Search Methods in Artificial Intelligence
constant. It is easy to see why this works. If l and y match, the weights stay the same. If l is +1
and y is -1, this rule will increase the weights associated with inputs (xi) that are positive, and
decrease the magnitude of the weights associated with inputs (xi) that are negative. Conversely,
if l is -1 and y is +1, this rule will decrease the weights associated with inputs that are positive,
and increase the weights associated with inputs (xi) that are negative. This is in line with the
intuitions we used in our earlier example to boost the weight w2 relative to w3 when model M1
failed to arrive at the correct decision.
Is the perceptron learning algorithm flexible enough to handle any classification problem?
Unfortunately, the answer is no. The reason is that the perceptron algorithm can only model
linear decision boundaries of the kind J n=0 w; xt = 0. Therefore, the perceptron can handle the
class of linearly separable problems, where such a linear decision boundary can discriminate
between classes. As shown in Figures 11.10(a) and (b), it can represent Boolean functions like
AND and OR. Recall that OR(x1, x2) is 1 when at least one of x1 and x2 is 1. The dashed line
in Figure 11.10(a) separates the three inputs that evaluate to 1 from the lone (x1 = 0, x2 = 0)
that evaluates to 0. Likewise, a line separates the three 0s from the lone 1 (x1 = 1, x2 = 1) for
AND (x1, x2) in Figure 11.10(b). The figure shows that the decision boundary induced by the
perceptron splits the space into two parts (half-planes) which are labelled as + and -. However,
it fails to represent the XOR problem where no single straight-line separator can be found that
separates the outcomes 0 and 1 (Figure 11.10(c)). Recall that XOR(x1, x2) is 1 when exactly one
of x1 and x2 is 1, and the other is 0.
In linearly non-separable problems, we come up with a notion of error aggregated over
all training instances and attempt to find weights that minimize this error. This leads to an
alternate formulation called the delta rule, which uses the idea of gradient descent to find the
best possible weights. We consider a perceptron model that is somewhat different from the one
shown in Figure 11.9 in that the output node produces a weighted sum of inputs but does no
thresholding. Let us call this perceptron a linear unit. We use the following formulation of error
aggregated over training instances:
E = _ J (ld - yd)2
2 dc D
where d refers to an individual training instance in the training data D, ld refers to the class
label of d as recorded in the dataset, and y d is the corresponding output of the linear unit. The
error is a function of the weights, and learning should involve adjustment of these weights. We
start off with random initialization of weights, and iteratively update these weights with the
goal of arriving at the combination of weights that minimizes error. We picture the error as a
function of weight as shown in Figure 11.11. The gradient descent approach chooses a direction
that leads to the steepest descent along the error surface.
For weight w i the update formulation is as follows:
w i <~ w i n_
dw i
dE
where n is a positive constant, and _- i s the slope of the error surface with respect to w..
Why does this work? Refer to Figure 11.11 which illustrates the simplest setting where the
error is a function of just one weight w. Let us start with a random guess for w, say, wA. Since
d , dE
the slope _ is positive at w = wA, the update formulation above will lead to a decrease in the
weight value (note that n is positive), and we move to a weight value to the left of wA towA as
dE ■
shown. On the other hand, consider an alternative starting weight wB. Since the slope _ is
negative at w = wB, the update formulation above will lead to an increase in the weight value,
which moves the weight value to the right of wB to wB as shown. In either case, we move closer
to the weight value wA where the error is minimum.
Figure 11.11 Illustration of the intuition behind the gradient descent algorithm
accepting a value above a threshold completely, it is more usual to use nodes with a sigmoid
activation function in multilayer networks. The sigmoid function is smooth and differentiable,
and this helps in working out the mathematical derivation of the learning mechanism. The
threshold and sigmoid activation functions are shown in Figure 11.12.
In order to see how using an additional layer can help, let us revisit the simplest instance
of linearly non-separable problems that we have seen so far, the XOR problem. Consider
the architecture in Figure 11.13(b), where nodes N1 and N2 feed their output to node N3 that
produces the final output. While it is impossible to draw a linear separator in the original space
shown in Figure 11.13(a), we observe that N1 and N2 can find out which of their half-planes
the input falls in, and feed their verdicts (+ or -) to N3, which can do the final job by learning
a criterion of the following kind:
Note that the above criterion can be realized using a logical AND, and we have seen that
a perceptron can represent the AND Boolean function. It is therefore possible for node N3
to correctly solve the XOR problem on the basis of the intermediate information regarding
half-planes generated by nodes N1 and N2. This is why adding an extra layer can overcome
the limitation of the basic perceptron model, and help in solving a wide class of linearly
non-separable problems. Henceforth, we refer to this network with one extra layer as OALN
(for one additional layer network).
Would we ever need more than one additional layer? Consider a multilayer network applied
to the problem of handwritten digit classification. Each handwritten digit is pre-processed and
represented in terms of n numeric feature values, constituting a feature vector. Each such vector
is in an n-dimensional feature space. The training data contains several thousands of instances,
along with the correct class labels. The multilayer network contains n input nodes and 11 output
nodes: one for each of the digits and an additional dummy node for all images that could not
be reliably assigned to any class. Feature vectors of images are expected to be close to each
Search in Machine Learning 387
(b)
other in the feature space if they belong to the same class (represent the same digit). Feature
vectors of different classes are expected to be far from each other. There are certain practical
challenges, though. The image in Figure 11.14(a) can be read as a 0 or a 6, and the one in
Figure 11.14(b) can be a 1 or a 7. In either case, feature vectors of instances belonging to
different classes are close to each other. There is yet another problem: the same digit 7 can be
written in two different ways as shown in Figures 11.14(c) and (d). Here, feature vectors of
instances that belong to the same class are far apart. The images shown in Figures 11.14(a) and
(b) are hard to classify for humans as well. Hence, we will not be overly concerned if a machine
goes wrong in such cases. However, we need to make sure that the neural network does well for
the setting in Figures 11.14(c) and (d).
As shown in Figure 11.15(b), OALN can extract regions in the feature space by realizing
a logical AND over the half-plane information obtained from nodes in the previous layer.
However, there are two distinct regions in the feature space corresponding to two different
ways of writing the digit 7. These regions, shown as region A and region B, are well separated
from each other, and yet represent the same class. In situations such as this where there is no
neat correspondence between regions and classes, OALN is not sufficient. The output nodes in
OALN need to feed the region information to yet another final layer which does the final job of
classification. The final layer node learns a criterion of the following kind:
if Region A OR Region B
then Digit = 7
388 | Search Methods in Artificial Intelligence
The above criterion can be realized using a logical OR, and we have seen that a perceptron
can represent the OR Boolean function. We call this new architecture TALN (for two additional
layers network). It can be shown that TALN can, in principle, realize arbitrarily complex
decision boundaries. However, this assumes that there is no inherent limitation to the number
of nodes we use in each layer. This is clearly unrealistic. In practice, a multilayer network often
uses several additional layers to address such practical concerns.
Figure 11.15 Illustrating why one additional layer may not always suffice
In Figure 11.16(b), we show a multilayer network used for solving the classification
problem that we presented in Section 11.1. The training data is replicated in Figure 11.16(a) for
ease of reference. The input nodes take in two features: CGPA and Performance in Interview,
which are numeric and ordinal respectively. There are three nodes in the output layer based on
the three classes Accepted, Waitlisted, and Rejected. The nodes f3, f4, and f5 in the intermediate
layer realize linear separators and feed the half-plane information to f6, f7, and f8, which generate
the final class labels.
Figure 11.16 (a) An example dataset (b) A multilayer network trained on this data
How are the weights linking up nodes in successive layers learnt in such a multilayer
network? The general idea of gradient descent can be applied in this setting as well. There is
a key challenge, however. The measure of error is available only at the output nodes; for the
intermediate nodes, we have no direct access to error information. A clever popular algorithm
called Backpropagation that has revolutionized the field of ML solves precisely this problem
Search in Machine Learning | 389
(Rumelhart, Hinton, and Williams, 1986). The idea is to propagate the errors back from the
output nodes to the intermediate nodes, to facilitate weight updates across layers. Note that the
signal travels forward; it is only the errors that propagate in the reverse direction.
Input Output
Figure 11.17 The clustering problem
The space of hypotheses is the set of possible clusterings given a dataset. For instance,
given the data in Figure 11.18, three candidate hypotheses are shown in Figure 11.18. Visual
inspection reveals that Clustering 3 is more representative of natural groups present in the data.
We need an objective function to characterize this intuition so that we can use this function to
guide the search over the space of hypotheses. Given a formulation of the objective function,
the K-Means algorithm can be viewed as solving an optimization problem.
31*» * * ’
°° rS °o
og
o o o O
•SI °0 O o
Step 1 Step 2 Step 3 Step 4 Step 5
Let us consider an analogy to drive home the idea behind this algorithm. You have K
cameras pointing at an object, such that each camera can only observe a part of that object.
If you knew what the object looks like, you could have determined precisely where to fix
the K cameras. However, unless you position the cameras appropriately, you would not know
what the object looks like. Such a chicken-and-egg problem is at the heart of unsupervised
clustering. If we knew the class labels of the input instances, we could model each class; on
3 The terms ‘E step’ and ‘M step’ owe their origin to the words ‘expectation’ and ‘maximization’, which relate to the
underlying optimization performed by K-Means. While the details need not bother us here, it may be worth noting
that K-Means is a special case of a more general class of algorithms that goes by the name EM algorithm, which finds
a lot of application in various contexts in ML literature.
Search in Machine Learning | 391
the other hand, if we knew the models, we could have inferred the class labels. The K-Means
objective function has two sets of parameters: the cluster centroids and affiliations of instances
to the clusters. In the M step, the affiliations are held fixed and the centroids updated; on the
other hand, the E step updates the affiliations given fixed centroids. It can be proved that we
will never be worse than where we started off. The K-Means algorithm starts off with a random
initial configuration of centroids. It may be noted that the clustering produced is sensitive to this
random initialization, and it is typical in practice to try multiple initializations and pick the best
result. This is somewhat like the iterated hill climbing algorithm described in Section 4.6.3.
Summary
ML aims to capitalize on the experience gleaned from problem solving from the past and create
a representation that offers a quicker solution for a new problem. In this chapter we have seen a
few examples of approaches in ML to learn from the past. The algorithms described are applied
to solving a particular problem, which very often is classification. We have looked at four
supervised learning algorithms in which the labels provided by the user in the past are used to
generalize and create a hypothesis about how to describe and differentiate the different classes.
The approaches all learn by optimizing some parameters that minimize the classification error
in the training examples. If the training examples are sufficient in number and the model being
learnt does not overfit, then the model is expected to perform well on the unseen test examples.
We also looked at unsupervised learning approaches to identify clusters in a dataset. The
general idea behind clustering is that instances that are close to each other based on an appropriate
distance measure are assigned to the same cluster. The K-Means clustering algorithm accepts
K as a parameter and iterates through the data instances eventually forming K clusters.
In the next and last chapter we look at a unifying formalism of constraint satisfaction
problems that allows a combination and search to be integrated into one problem solver.
Exercises
1. The algorithm for constructing decision trees selects the attribute whose values are used to
partition the training data into subsets, on which a decision tree is constructed recursively.
Suppose a tree similar to the one in Figure 11.2 is constructed from a large dataset for a
company, and the attribute Gender shows up somewhere in the decision tree, what can one
conclude about the algorithm and the dataset?
2. [Adwait] Section 11.2 illustrates how the classification of a test instance can potentially
vary with the value of k. In practice, we typically restrict our attention to smaller values
of k so that instances very far from (or equivalently, dissimilar to) the test instance do not
influence its classification. Assuming that we have a finite set of candidate values of k, how
would you choose a suitable value for k in a principled way?
3. [Adwait] When a test instance is presented to a kNN classifier, the classifier is required
to compute the similarity of the test instance with all the labelled instances in the dataset.
Apparently, this computation is time-consuming, and a boost in its efficiency is highly
desirable. Consider the following dataset. Can you think of instances that can be deleted
392 | Search Methods in Artificial Intelligence
without compromising the effectiveness of a 1-NN classifier? Also, argue that the time
taken to classify a new test instance is lower when using this reduced dataset.
A
A A
A
A
A
A
4. [Adwait] The Euclidean distance is a conventional distance measure used when dealing
with numerical attributes. Consider a setting where a person is to be classified into one of
two classes - wealthy or not-wealthy - based on her savings (S) and the number of houses
(H) she owns. Notice that the ranges of these two attributes are drastically different, which
might make using plain Euclidean distance formulation rather misleading. In particular,
it is expected that S will dominate the value of the Euclidean distance and can very likely
mask the contribution of H. Can you think of a way to overcome such an issue?
5. We have seen in Section 11.3 that the Naive Bayes classifier assumes that the features are
conditionally independent given the class label. Can you identify a real world situation
where such an assumption makes sense and one where it does not?
6. [Adwait] The activation function in the intermediate nodes of an ANN plays a critical role
in allowing the network to learn complex functions. Interestingly, if no activation function
is used (equivalently, using a linear/identity activation function), the network is capable
of only learning functions that are linear in nature. Provide supporting arguments for this
claim.
7. [Adwait] We have seen that the clusters identified by the K-means algorithm are often
influenced by the choice of initial cluster centroids. Can you provide supporting evidence
using a synthetic dataset such that different initial cluster centres lead to visibly different
clusters?
chapter 12
Constraint Satisfaction
What is common between solving a sudoku or a crossword puzzle and placing eight
queens on a chessboard so that none attacks another? They are all problems where
each number or word or queen placed on the board is not independent of the others.
Each constrains some others. Like a piece in a jigsaw puzzle that must conform to
its neighbours. Interestingly, all these puzzles can be posed in a uniform formalism,
constraints. The constraints must be respected by the solution - the constraints must be
satisfied. And a unified representation admits general purpose solvers. This has given
rise to an entire community engaged in constraint processing. Constraint processing
goes beyond constraint satisfaction, with variations concerned with optimization. And
it is applicable on a vast plethora of problems, some of which have been tackled by
specialized algorithms like linear programming and integer programming.
R = <X, D, C>
where X is a set of variable names, D is a set of domains, one for each variable, and C is
a set of constraints on some subsets of variables (Dechter, 2003). We will use the names
X = {x 1, x2, ..., xn} where convenient with the corresponding domains D = {D 1, D2, ..., Dn}.
The domains can be different for each variable and each domain has values that the variable
can take, Di = {ai1, ai2, ., aik}. Let C = {C1, C2, ., Cm} be the constraints. Each constraint Ci
has a scope Si c X and a relation Ri that is a subset of the cross product of the domains of the
variables in Si. Based on the size of Si, we will refer to the constraints as unary, binary, ternary,
and so on. A CSP is often depicted by a constraint graph and a matching diagram, as described
in the examples to follow.
We will confine ourselves to finite domain CSPs, in which the domain of each variable is
discrete and finite. We will also specify the relations in extensional form well suited for our
algorithms. For example, given a common domain {1, 2, 3, 4} for each variable, if we have a
393
394 | Search Methods in Artificial Intelligence
binary constraint between two variables xi and xk in which the value in xi is smaller, then we
represent it as
Rik = {<1, 2>, <1, 3>, <1, 4>, <2, 3>, <2, 4>, <3, 4>}
The pairs in the relation are the allowable combination of values for the two variables
respectively. For example, xi = 1 and xk = 4 are allowed. Note that we have adopted a naming
convention for the relation as well, with the subscripts in Rik referring to the subscripts of the
two related variables. We shall focus largely on binary constraint networks (BCNs) in this
chapter.
An assignment A is a set of variable-value pairs, for example, {x2 = a21, x4 = a45,
x7 = a72}. We also say that the assignment is an instantiation of the set of variables. An
assignment to a subset of the variables is a partial assignment. Wherever there is no confusion,
we will represent the assignment as a tuple A = <a 1, a2, ..., ap> where it is understood that
these are the variables x1, x2, ...,xp instantiated.
An assignment A satisfies a constraint Ci if Si C {x1, x2, ..., xp} and ASi e Ri where ASiis
the projection of A onto Si.
An assignment A is consistent if it satisfies all the constraints whose scope is covered by A.
A solution to a CSP is a consistent assignment over all the variables in X. The CSP
expresses the relation aX, also called sol(R), the solution relation, which is a relation on all the
variables of X.
R = <X, D, C>
X = {A, B, C, D, E}, D = {DA, DB, DC, DD, DE}, C = {RAB, RBC, RBD, RCD, RDE}
DA = {b, g}, DB = {r, b, g}, DC = {b}, DD = {r, b, g}, DE = {r}
RAB = {<b, r>, <b, g>, <g, r>, <g, b>}
Constraint Satisfaction | 395
Every CSP can be depicted as a constraint graph. The nodes in the graph are the variables
in the CSP and an edge between two nodes says that the two variables participate in a constraint.
This is true even when the constraint is ternary or higher. Constraint graphs are consulted by
some algorithms in deciding the order of visiting variables.
Another diagram that is useful is the matching diagram. An edge in the matching diagram
connects two values in two variables that together participate in some constraint. Figure 12.1
shows three views of the map colouring problem. On the left is the map showing the regions
that share a boundary. In the centre is the constraint graph, where each region is represented
by a node or a variable with an edge between two nodes that share a boundary. In the figure
the nodes have the domains shown alongside, and the label on an edge represents the not-equal
relation. The two related variables are only allowed different values. On the right is the matching
diagram that makes the relation explicit, with every pair of allowed values being connected
with an edge. Implicit in the matching diagram is the universal relation between nodes not
connected in the constraint graph, for example, A and C. Any combination of values of such
pairs of nodes is allowed, though not shown explicitly in the matching diagram.
Figure 12.1 A map colouring problem on regions A, B, C, D, and E is on the left. The constraint
graph is in the centre and the matching diagram on the right. An edge in the matching diagram
stands for an allowable pair of colours. For regions that are not adjacent, the matching
diagram has an implicit universal relation where any combination of values is allowed.
The matching diagram shows pairs of values that can possibly occur together in a solution.
When the fog clears, only the pairs that are part of a solution are left. We illustrate this
phenomenon with the 6-queens problem in the next section.
Thinking of a physical chessboard, the first thought is to have N2 variables for the squares with
each possibly having a queen. But we can exploit the knowledge that only one queen can be in
one row and one column. This suggests a compact representation that is commonly used. Each
row (or each column) can be a variable which will have one queen identified by the column (or
row) in which it is. Figure 12.2 shows the 6-queens problem in which a queen has to be placed
in each row. The row number becomes the variable, and the column number the value. In this
representation there are six variables X = {1, 2, 3, 4, 5, 6} and each Di = {a, b, c, d, e, f}.
Figure 12.2 The 6-queens problem is to place the six queens on a 6 x 6 chessboard such that
no queen attacks another. The six queens must be on six different rows. We name each row
as a variable, with the column names as values. The arrows show the squares attacked by a
queen on square c4. The figure on the right is the constraint graph, which is a complete graph
since each queen is constrained by every other queen.
As one can see, the constraint graph, shown on the right, is a complete graph. This is
because every queen can potentially be attacked by every other queen. The pairwise allowed
values are captured in the relations C = {R12, R13, R14, R15, R16, R23, R24, R25, R26, R34, R35, R36,
R45, R46, R56}. We describe R12 and leave the other relations for the reader to complete.
Figure 12.3 shows a part of the matching diagram. The relations covered in the diagram
are R12, R13, R14, R15, R16, R25, and R36. Even with this subset of relations, one can see that
there is a large number of combinations to choose from. As one can see, there is verily a fog of
connections for each variable.
Constraint Satisfaction | 397
Figure 12.3 The matching diagram for the 6-queens problem. Only edges for the relations
R12, R13, R14, R15, R16, R25, and R36 are drawn in the figure, giving rise to the higher density of
edges on the left. A close scrutiny will reveal that there are three of four edges from a value for
one variable to values in another variable.
In the solution, one value must be selected from the domain of each variable. Further,
each value in each variable must have an edge connected to a value in every other variable that
must be in the solution. The task of solving the CSP is to clear the fog and reveal the solution.
Figure 12.4 shows one solution for the 6-queens problem.
2
d eJ\J
6
4
Figure 12.4 A solution <b, d, f, a, c, e> for the 6-queens problem highlighted on the matching
diagram. The solution is also shown on the board on the right.
398 | Search Methods in Artificial Intelligence
The solution <b, d, f, a, c, e> is also shown in the figure on the right.
Now we turn our gaze towards solving CSPs. The algorithms we are interested in are
domain independent in nature, exemplifying the spirit of this book. The idea is again that users
can pose their problems as a CSP, and then use a general off-the-shelf solver for solving the
CSP. There is a two pronged strategy for solving a CSP. One is search. The idea here is that one
picks the variables one at a time and assigns a value to the variable, which is consistent with
earlier variables. The main problem faced by brute force search is combinatorial explosion,
and we look at methods to mitigate that. The second is consistency enforcement or constraint
propagation, which aims to prune the space being searched. Done to the extreme this can obviate
the need for search altogether, but at a considerable cost. In practice, a judicious combination of
the two works best. We begin with search.
Backtracking {X, D, C)
1. A ^ []
2. i ^ 1
3. D’i ^ Di
Constraint Satisfaction | 399
Figure 12.5 shows the progress on the tiny map colouring problem from Figure 12.1. The
order of variables is alphabetic. The very first choices for the variables A, B, and C are accepted,
but when it comes to variable D only the third choice g works.
Figure 12.5 BACKTRACKING does depth first search on the problem from Figure 12.1 and finds
the solution <b, r, b, g, r>. On the way SelectValue has rejected the values D = r and D = b.
The constraint graph is shown on the right.
400 | Search Methods in Artificial Intelligence
The order of processing variables will clearly impact the complexity of the search. There
are essentially two approaches to deciding this order. One is a static approach that looks at
the topology of the constraint graph to choose an order with fewer dead ends. We look at that
next. The other is to dynamically choose the next variable to try, in tandem with constraint
propagation. We will describe that after looking at the consistency enforcement algorithms.
Algorithm 12.2. Algorithm MinWidth accepts a graph <V E> with N nodes and returns
a min-width ordering of the graph.
When a node has multiple parents, perhaps constraints can be imposed between them.
Consider the example of variable U having variables X, Y, and Z as parents. Searching for
values in the given order one can see that if X, Y, and Z were originally unrelated, then adding
the constraints X = Z, X = Y, and Y = Z would have made finding a value for U easier. One
would not have to backtrack and try different values for X or Y or Z. As we shall see later,
adding such constraints with the goal of minimizing backtracking is a strategy in consistency
enforcement. It would be desirable to enforce enough consistency to make the search backtrack
free. But the cost of achieving that could outweigh the savings.
In this context one can introduce the notion of an induced graph with an induced width, in
which edges are added connecting parents of nodes in the order being imposed. Unfortunately,
finding a min induced width ordering is NP-complete, but the following greedy algorithm
often produces very good ones (Dechter, 2003). The greedy algorithm below is similar to
Algorithm 12.2 except that before removing the selected node v from the graph, all its parents
are connected pairwise.
A variation that often performs better is to choose the node to be plucked using a different
criterion. Instead of selecting the node with the lowest degree, one picks a node which has a
minimum number of unconnected parents. Then only a few new edges will need to be added.
This algorithm is called MinFill.
Figure 12.6 shows a few orderings for a small graph with seven nodes X = {A, B, C, D,
E, F, G} shown on the top. The first ordering in the figure is the alphabetic ordering (A, B, C,
D, E, F, G). With this ordering Backtracking would assign a value for variable A first and
variable G last. The alphabetic ordering has a width 3, because node E has three parents A, B,
and D. The second ordering is reverse alphabetic and has width 4 since A has degree 4. The
third ordering is the one produced by the MinWidth algorithm and has a width 2. The last one
is the one produced by the MinInducedWidth algorithm. It also has a width 2, but has an
additional edge connecting D and G, the parents of F which occurs later in the ordering.
402 | Search Methods in Artificial Intelligence
Alphabetic
width = 3
Reverse alphabetic
width = 4
Min width
width = 2
Figure 12.6 A graph and some orderings. Both the min-width and min-induced-width orderings
have a width 2. The alphabetic ordering has a width 3, and the reverse alphabetic ordering has
the maximum width possible 4.
The reader should verify that if the given graph were to be a tree, then both the algorithms
will produce an ordering of width 1. When we have a CSP ordering of width 1, then it is
possible to do backtrack free search. This is because each node is constrained by only one
parent who already has a value.
If the graph has cycles, then the minimum width possible is 2. This is the case for the
example above.
a value to this variable before considering the others. This is the approach taken in dynamic
variable ordering, where the order in which variables are processed is decided on the fly.
This becomes even more relevant when the domains of future variables are pruned by
the algorithm. We illustrate this with a cursory description of the algorithm ForwardChecking
discussed later in more detail. The crux of the algorithm is that when it considers a value for a
variable, it deletes values from future variables that would become inconsistent with the current
assignment. We illustrate this with the small map colouring example from Figure 12.1. The
domains and the constraints are reproduced below.
We begin with the variable C which is one of the two with the smallest domains.
As seen here, dynamic variable ordering considers those variables first which have the fewest
values to choose from. And deleting values from future variables removes potentially conflicting
choices. In the process, if a future variable becomes empty, the search algorithm can backtrack
from the current variable itself. We will illustrate this in Section 12.4.
Algorithm 12.4. Algorithm Revise prunes the domain of variable X, removing any value
that is not paired to a matching value in the domain of variable Y.
Revise((X), Y)
1. for every a G Dx
2. if there is no b G DY s.t. <a,b> G RXY
3. then delete a from Dx
The worst case complexity of Revise is O(k2) where k is the size of each domain. The
worst case happens when no value in X has a matching value in Y. An edge (X, Y) in a constraint
graph is said to be arc consistent iff both X and Y are arc consistent with respect to each other.
A constraint network R is said to be arc consistent if all edges in the constraint graph are arc
consistent. A node is said to be 2-consistent if an assignment to any variable can be extended
to a consistent assignment to any other variable. Clearly, if a network is 2-consistent, it must
be arc consistent as well. A simple brute force algorithm AC-1 cycles through all edges in the
constraint graph until no domain changes (Mackworth, 1977; Mackworth and Freuder, 1985).
Algorithm 12.5. Algorithm AC-1 cycles through all edges repeatedly even if one value
is removed from one variable.
AC-1 (X, D, C)
1. repeat
2. for each edge (x, y) in the constraint graph
3. Revise((x), y)
4. Revise((y), x)
5. until no domain changes in the cycle
Constraint Satisfaction | 405
Let there be n variables, each with domain of size k. Let there be e edges in the constraint
graph. Every cycle then has complexity O(ek2). In the worst case, the network is not AC, and in
every cycle exactly one element in one domain is removed. Then there will be nk cycles. The
worst case complexity of AC-1 is therefore O(nek3).
Before improving upon the arc consistency algorithm, we look at how deduction with
modus ponens can be seen as constraint propagation. Let the knowledge base be {P, P D Q,
Q D R, R D S}. Working with Boolean formulas each propositional variable has two values in
its domain, 1 (true) and 0 (false). The truth table of the binary relation X D Y can be represented
by the constraint DXY = {<0, 0>, <0, 1>, <1, 1>}. The CSP can then be viewed as
First achieving node consistency prunes the domain of P to {1}. Then achieving arc
consistency prunes the rest of the variables to also contain only 1. The process is illustrated in
Figure 12.7 with matching diagrams.
P Q Q => R => S
Figure 12.7 Logical deduction can be seen as consistency enforcement. Given the variables
P, Q, R, and S and the constraints defined by the KB = {P, P D Q, Q D R, R D S}. Node
consistency followed by arc consistency results in the domains of all variables having only 1.
This amounts to deducing that Q, R, and S are true.
406 | Search Methods in Artificial Intelligence
The algorithm AC-1 is an overkill. It makes unnecessary calls to Revise. A better strategy
is as follows. If Revise((X), Y) removes some value v from the variable X, one need only check
that all edges connected to X are still arc consistent. It is possible that the value v was the only
support for some value w in a variable W. Then a call to Revise((W), X) is needed. This is done
by algorithm AC-3 that pushes all such connected pairs of variables into a queue. A change in
a variable is propagated to the connected variables. Only those are considered again for a call
to Revise.
Algorithm 12.6. Algorithm AC-3 begins by invoking Revise for all edges in the
constraint graph. After that, if the domain of a variable P has changed, then consistency
w.r.t. P is enforced for all neighbours of P.
AC-3(X, D, C)
1. Q ^ []
2. for each edge (N,M) in the constraint graph
3. Q ^ Q ++ (N,M) : [(M,N)]
4. while Q is not empty
5. (P,T) ^ head Q
6. Q ^ tail Q
7. Revise((P), T)
8. if Dp has changed
9. for each R * T and (R,P) in the constraint graph
10. Q ^ Q ++ [(R,P)]
The complexity of AC-3 is O(ek3) where e is the number of edges and each domain is of
size k. Of this, k2 comes from Revise. For each of the e edges, it makes 2k calls to Revise in
the worst case if it deletes all values from the two connected variables.
One can be more frugal if one realizes that the call to Revise can itself be an overkill.
Just because a value v has been deleted from a variable1 X why should one make a call Revise
((Y), X) to check if every value in Y is still supported by values in X? If one could keep track of
the values in Y that were being supported by v e DX, then if v were the only support of a value
w e DY, then one can go ahead and delete w from DY. Following this, we will have to check if w
in turn was the only support for some value in a connected variable. This is done by algorithm
AC-4 which, however, needs more bookkeeping to be done to keep track of individual support
from values. The following data structures are used. Let R = <X, D, C> be the network, and
let x and y be variables in X (Dechter, 2003).
- The support set S is a set of sets, one for each variable-value pair <x, a>, named S<x, a>.
For each variable-value pair <x, a> the support set contains a list of supporting pairs from
other variables. When a value a e Dx is deleted the set S<x, a> is instrumental in checking
which values in other variables might have lost a support.
1 We often say ‘from a variable X’ as a short form for ‘from the domain of a variable X’.
Constraint Satisfaction | 407
The algorithm AC-4 begins by inspecting the network R setting up the records of links from the
matching diagram in the set S, and a count of how many values from a variable y support a value
a e Dx. This requires visiting both ends of all the e edges in the network and all k2 combination
of values for the two connected variables. This is done in O(ek2) time, and also requires O(ek2)
space. All this work is done upfront.
Algorithm 12.7. Algorithm AC-4 begins by inspecting all edges in the matching
diagram and identifying the list of all supports for all variable-value pairs, and the count of
number of supports for each value from another variable. It deletes a value with count 0
and then decrements the count of all connected values.
AC-4(X, D, C)
1. Q ^ []
2. initialize S<x,a> and counter(x, a, y) for each Rxy in C
3. for each counter
4. if counter(x, a, y) = 0
5. Q <— Q +—+ <x, a>
6. while Q is not empty
7. <x, a> ^ head Q
8. delete a from Dx
9. for each <y, b> in S<x,a>
10. counter(y, b, x) ^ counter(y, b, x) - 1
11. if counter(y, b, x) = 0
12. Q ^ Q ++ <y, b>
After the initialization, if there is a missing support for a value a for variable x (from some
variable y), then a is deleted from Dx and then the set S is inspected to decrement all counters
for all related variable-value pairs <y, b>. If any counter becomes 0, then that variable-value
pair is added to the queue Q of values destined for deletion. In this manner, the propagation
is extremely fine grained. Whenever a value is deleted, the algorithm pursues the links in the
matching diagram, effectively deleting each such link. Then if a value in some domain is left
without a link, that is added to the queue for deletion as well.
408 | Search Methods in Artificial Intelligence
The initialization step that creates the counters and the support pointers requires, at most,
O(ek2) steps. The number of elements in S < x,a> is of the order of ek2 and each is accessed at
most once. Therefore the time and space complexity of AC-4 is O(ek2).
What can one say about a CSP on which arc consistency has been achieved and no domain
is empty? If the constraint graph is a tree, then the CSP has a solution. This is because each
variable is constrained by exactly one variable. Moreover, if one chooses the min-width
ordering of the variables, the search will be backtrack free. If the constraint graph is not a tree,
then it may be possible that the CSP has no solution. This is illustrated in Figure 12.8 where
the network on the left has no solution even though it is arc consistent. But after removing one
edge (BC) it becomes a tree, and this has two solutions.
Figure 12.8 The CSP on the left is arc consistent but does not have a solution. The network
on the right is similar to the one on the left except that it has one edge (B,C) less which makes
it a tree. This network has two solutions.
The reader is encouraged to try out different orderings of the network and verify that the
min-width ordering for the tree can be solved in a backtrack free manner.
Figure 12.9 The three kinds of edge labels and four kinds of vertices in trihedral objects.
Each edge in a line drawing can be labelled in one of four ways: +, -, ^, and ^. Then a W
or a Y or a T vertex can have 43 = 64 different combined labels and an L vertex can have 42 = 16.
The interesting thing is that for trihedral objects without cracks or shadows, there are only 18
kinds of edge label combinations that are physically possible. These are shown in Figure 12.10.
Figure 12.10 The 18 different kinds of vertices possible in line drawings for trihedral objects
without cracks or shadows. Some texts leave out the middle two T vertices because that
configuration comes from a non-normal viewpoint.
Every edge in a line drawing connects two vertices but it can have only one label. This lays
the foundation of constraint propagation. If one knows the label at one end, then that label must
be propagated to the other end as well. And at the other end, the other edges impinging on the
vertex will be constrained by possibilities shown in Figure 12.10.
The constraint propagation algorithm was written by David Waltz who extended the scope
of objects manifold (Waltz, 1975). The Waltz algorithm, as it is now known, could handle
objects with more than 3-edge vertices, objects with cracks, and images with light and shadows.
The number of edge labels shot up from 4 to 50-plus, and the number of valid vertices shot
410 | Search Methods in Artificial Intelligence
up to thousands. The algorithm is somewhere between AC-1 and AC-3 and does propagation
from vertex to vertex. We illustrate the propagation with a trihedral object shown on the left in
Figure 12.11.
Figure 12.11 The Waltz algorithm begins by demarcating the two solid objects and marks
the external sequence of edges A -B-C-D-E-F with arrows surrounding the solid material.
In the diagram on the left the edge B-G in the W junction at B gets a label +. This + label
must be the same at the G end of the edge B-G as well. The other two edges on vertex G
can now only be +. This is the kind of propagation the Waltz algorithm does. A similar process
is followed in the line diagram on the right, but where one has two choices for the edge 1-J
during propagation.
Both the objects in Figure 12.11 are solid objects, as careful observation will reveal. The
Waltz algorithm begins by isolating an object from the background. It does so by labelling
the outermost edges with arrows, going in the clockwise direction along the outermost lines.
In both the objects, these are A ^ B, B ^ C, C ^ D, D ^ E, and E ^ F. Now there are
only three kinds of W vertices as shown in Figure 12.10, and only one of them has two arrow
labels. Consequently, the third edge in these vertices must be convex edges and can be labelled
with a ‘+’.
This is illustrated for the edge B-G for the object on the left. We have labelled it twice to
emphasize the fact that the label is propagated from the W vertex B to the Y vertex G. Now
there is only one kind of Y vertex that has ‘+’ labels, and all three of the edges impinging on
it must be labelled ‘+’. This label can now be propagated along the two edges emanating from
G to the connected W edges. This process continues and the entire set of edges can be labelled
unambiguously.
For the object on the right, the labelling may involve backtracking. The reader should
verify that the edge H-I must be labelled with a ‘-’. Now there are two possibilities of labelling
the other two edges at vertex I. Either both must be ‘-’ or both must be arrows. If I-J is a ‘-’,
Constraint Satisfaction | 411
then the edge J-K can only be a ‘+’ given the constraints on W vertices. This results in K being
labelled with ‘+++’. If it is to be an arrow, then the direction must be I ^ J. But then there is
no label possible for the edge J-K. So if the algorithm were to select I ^ J, it would have to
backtrack and select the label ‘-’.
The reader is encouraged to complete the labelling process for both objects.
Figure 12.12 The network on the left is not path consistent because an assignment <B = r,
C = b> cannot be extended to the variable D. Making it path consistent adds a new constraint
RBC = {<r, r>, <b, b>} to the CSP. Now the variables B and C are related by the equality
relation. Earlier, it was implicitly the universal relation.
The astute reader would have noticed that in the process of making the network 3-consistent
we have introduced a new edge in the constraint graph for the relation B = C. The reader must
also keep in mind that when the vertices B and C were not connected in the constraint graph,
it meant that any value in B was locally consistent with any value in C. That is, no constraint
between B and C was specified, and RBC was a universal relation {<r, r>, <r, b>, <b, r>,
<b, b>}. After the propagation this was pruned to {<r, r>, <b, b>}, and then an edge B-C
was introduced in the constraint graph. This is done by the algorithm Revise-3 which takes
three variables X, Y, and Z, and removes any pair of values <X = a, Y = b> when a and b
are not connected to some value c e DZ (Dechter, 2003). In other words, we are pruning the
relation RXY.
412 | Search Methods in Artificial Intelligence
Algorithm 12.8. Algorithm Revise-3 prunes the relation RXY, removing any edge <a, b>
that does not have a matching value in the domain of variable Z.
Revise-3((X,Y), Z)
1. for every <a, b> e RXY
2. if there is no c e DZ s.t. <a, c> e RXZ and <b, c> e RYZ
3. then delete <a, b> from RXY
This is, in fact, an instance of the general case wherein making a network N-consistent
induces a relation of arity N - 1, which essentially prunes an existing relation that could have
been universal. This was the case also for arc consistency, because pruning the domain of a
variable X is equivalent to inducing a relation RX on the network. We will have more to say on
this later.
The simplest algorithm to achieve path consistency is analogous to AC-1. It repeatedly
considers all variable pairs X and Y and eliminates pairs of values a e DX and b e DY that
cannot be extended to a third variable Z. The algorithm is called PC-1.
Algorithm 12.9. Algorithm PC-1 repeatedly calls Revise-3 with every pair of variables
for path consistency with every other variable, until no relation RYZ is pruned. The
algorithm assumes that every pair of variables is related, even if by a universal relation
which does not show up in the constraint graph.
PC-1(X, D, C)
1. repeat
2. for each x in X
3. for each y and z in X
4. Revise-3((y, z), x)
5. until no relation changes in the cycle
Let there be n variables, each with domain of size k. The complexity of Revise-3 is O(k3)
because the algorithm has to look at all values of the three variables. In each cycle the algorithm
PC-1 inspects (n - 1)2 edges for each of the n variables, requiring O(k3) computations for
each call to Revise-3. Therefore, in each cycle, the algorithm will do O(n3k3) computations.
In the worst case, PC-1 will remove one pair of values <a, b> from some constraint Rxy.
Then the number of cycles is O(n2k2), because there are n2 pairs of variables, each having
k2 elements. Thus in the worst case, algorithm PC-1 will require O(n5k5) computations
(Dechter, 2003).
Note that unlike AC-1, the algorithm PC-1 is not confined to working only with the edges
in the constraint graph but considers all pairs of variables. It might be pertinent to remember
that two variables in the constraint graph without an edge are related by the universal relation,
Constraint Satisfaction | 413
which means that any combination of values is allowed. Achieving path consistency may delete
some elements from the universal relation, as illustrated in Figure 12.13.
Figure 12.13 The two figures on the left are the constraint graph and the matching diagram
for a network with four variables W, X, Y, and Z. The dashed edges represent the implicit
universal relations Rwz and RXY. On the right are the corresponding figures after the network
is made path consistent. The relations Rwz and RXY are now non-universal and show up in the
constraint graph. The edge c-d is deleted along with eight edges from Rwz and RXY.
It can be observed that after achieving path consistency every edge in the matching diagram
is part of a triangle with all other variables. The edge c-d in Figure 12.13 on the left gets deleted
because it is not part of any triangle with values in variables Y and Z. For the same reason, four
edges from each of the two implicit universal relations Rwz and RXY are also deleted. In the
network on the right, every edge is a part of two triangles.
Algorithm PC-1 looks at all triples of variables in every cycle. A better approach is to look
only at variables where an edge deletion may have broken a triangle in the spirit of AC-3. Let
the variables be an ordered set X = {x1,x2,...,xN}. Algorithm PC-2 too tests every variable pair
<xi, xj> where i < j against all other variables. It begins by enqueuing all such triples for calls
to Revise-3. Each pair of variables is added only once. If an edge <a, b> e Rxyis deleted by a
call to Revise-3, then PC-2 only checks for the triangles formed by all other variables with x
and y. In the following algorithm, the indices 1, 2, ., N of the variables x1, x2, ., xN are stored
in the queue Q to enable only one of <xi, xj> and <xj, xi> to be added.
414 | Search Methods in Artificial Intelligence
PC-2 (X, D, C)
1. Q ^ []
2. for i ^ 1 to N-1
3. for j ^ i+1 to N
4. for each k s.t. k * i and k * j
5. Q ^ Q ++ ((i, j), k)
6. while Q is not empty
7. ((i, j), k) ^ head Q
8. Q ^ tail Q
9. Revise-3((xi, Xj), xk)
10. if Rij has changed
11. for k ^ 1 to N
12. if k * i and k * j
13. Q ^ Q ++ ((i, k), j): [((j, k), i)]
Each call to Revise-3 is O(k3). The minimum number of calls is O(n3), which is the number
of distinct calls that can be made initially. In the worst case, one pair of values is deleted in each
call to Revise-3 in the while loop. In the worst case, n3 calls to Revise-3 are made. In each call
to Revise-3, at most k2 edges can be removed. Hence the while loop can be executed at most
O(n3k2) times and with Revise-3 being O(k3), the complexity of PC-2 is O(n3k5).
Like AC-1 and AC-3, both PC-1 and PC-2 rely on calls to Revise-3, which can be a little
bit of overkill. Like in AC-4, one can work at the value level, but we will not pursue those edges
in the matching diagram. Mohr and Henderson (1986) have devised such an algorithm PC-4
with complexity O(n3k3).
It is worth noting that path consistency does not automatically imply arc consistency. This
is evident in Figure 12.13. The CSP is path consistent, but it is not arc consistent. In general, if
a CSP is i-consistent it does not mean that it is (i - 1)-consistent as well.
12.3.4 i-Consistency
The concept of consistency can be applied to any number of variables. Without going into the
details we observe that the process involves defining a function Revise-i in which a tuple tS from
a set S of (i - 1) variables is checked with one variable X for consistency. If there is no value v in
X that is consistent with the tuple then tS is deleted. This is equivalent to introducing a relation
RS of arity (i - 1). In general, the complexity of Revise-i is O(k), and algorithms for enforcing
i-consistency have the worst case time complexity (9((nk)2i2i) and space complexity of O(n'k')
as described in (Dechter, 2003).
Constraint Satisfaction | 415
Figure 12.14 The above network is neither arc consistent nor path consistent, but is both
directionally arc consistent and directionally path consistent. Given the order X, Y, Z,
Backtracking finds a solution without backtracking.
Given an ordering X = (x1, x2, ..., xN), a network <X, D, C> is said to be directionally arc
consistent (DAC) if for every edge <xi, xj> in the constraint graph such that i < j, variable xi is
arc consistent with respect to variable xj. DAC arc consistency can be achieved in a single pass,
processing variables from the last to the first.
Algorithm 12.11. Algorithm DAC scans the variables from the last to first calling
Revise with all parents in the constraint graph.
Directional path consistency (DPC) is similar, except that it prunes binary relations and
looks at all triples without any heed to the constraint graph. When it prunes an edge in the
matching diagram, it adds a relation to the constraint graph. In the following algorithm, we
have included DAC as well.
416 | Search Methods in Artificial Intelligence
Algorithm 12.12. Algorithm DPC does one pass from the last variable down to the first
one. For each variable, it calls Revise-3 with all the preceding variables, and then it also
calls Revise.
Figure 12.15 illustrates the DPC algorithm on a 4-variable 2-colour map colouring
problem. To begin with, the constraint graph has three relations RWY, RXZ, and RYZ. After the
call Revise-3((X, Y), Z) an induced relation RXY is added, and after Revise-3((W, X), Z) the
relation RWX is added.
Revise-3((X,Y), Z)
Revise-3((W,Y), Z)
Revise-3((W,X), Z)
Revise((Y), Z)
Revise((X), Z)
Revise((W), Z)
Revise-3((W,X), Y)
Revise((X), Y)
Revise((W), Y)
Revise((W), X)
Figure 12.15 DPC and DAC process the network in one pass from the last to the first node.
The original matching diagram and the constraint graph are on the top. The dashed edges
are from the universal relations. The revised versions are shown progressively below. The final
network is strongly path consistent and backtrack free.
Constraint Satisfaction | 417
The resultant network has an induced width 2. Observe that the edge <r, r> between
variables W and Z is a remnant of the universal relation, and not a member of an induced
relation. The induced width of the graph is 2, and for that DPC is sufficient for search to
be backtrack free. If the induced width were to be higher, then a higher level of consistency
would be required. This is neatly arrived at by the algorithm AdaptiveConsistency, which also
processes the variables from the last to the first, but for each variable the degree of consistency
is tailor-made based on the number of parents the node has. One must keep in mind that as
the algorithm achieves the requisite consistency for a variable, it induces new relations on the
parents, which may increase the width of some nodes. The algorithm is described below. In
the literature a variation, called bucket elimination, that focuses on the relations explicitly is
also popular.
The propagation techniques for combating combinatorial explosion seen so far are largely
static and precede the search for solutions. Now we turn our attention to how some of these
can be carried forward to the search phase itself. We have already mentioned dynamic variable
ordering earlier. In the next section we look at ways for constraint propagation during search.
Before picking a value for a variable, can we compute the impact on the domains of future
variables?
Look before you leap.
assigned to earlier variables. The algorithms in this section look ahead at future variables in
addition to the ones in the past. Of course, as before, there is a cost to be paid for the extra
reasoning one does.
Consider trying to solve an N-queens problem on a real chessboard or one drawn on a piece
of paper. Every queen one places rules out all the squares it attacks for the other queens to be
placed. Imagine marking those squares with a cross. In Figure 12.16 we illustrate how placing
six queens row by row, this marking process can help narrow down search and backtrack even
before a dead end is reached.
Figure 12.16 After placing a queen in the corner of the top row, the crosses mark the squares
no longer available. By the time search places the third queen in board position A, many
squares in the bottom half are already marked. At this point, placing a queen in the fourth row
would block the entire sixth row. The algorithm backtracks and tries position B with similar
effect. It next goes back to trying a new value for the second queen in board position C.
Placing queens row by row in the first available position, one finds oneself in board position
A after placing three queens. There is one unmarked square in row 4, but if one were to place
a queen there, row 6 would be completely blocked. The next, and last, option, marked B, for
the third queen would have a similar impact with row 5 being ruled out this time. Without even
placing the fourth queen one, can backtrack and try another square for queen number 2 in board
C. This is in essence the algorithm forward checking (FC).
Constraint Satisfaction | 419
ForwardChecking(X, D, C)
1. A^ []
2. for k ^ 1 to N
3. Dk ^ Dk
4. i^ 1
5. while 1 < i < N
6. ai ^ SeiectVaiue-FC(D’i, A, C)
7. if ai = null
8. then Undo lookahead pruning done while choosing ai-1
9. i ^i-1 /* look for new value */
10. A ^ tail A
11. else A ^ ai : A
12. i^i+ 1
13. return Reverse(A)
SeiectVaiue-FC(D’i, A, C)
1. while D’i is not empty
2. ai ^ head D’i
3. D’i ^ tail Di
4. for k ^ i + 1 to N
5. for each b in Dk
6. if not Consistent(b : ai : A)
7. delete b from Dk
8. if no Dk is empty
9. then return ai
10. else for k ^ i + 1 to N
11. undo deletes in Dk
12. return null
420 | Search Methods in Artificial Intelligence
Algorithm FC does one pass over the future variables deleting values that are not going to
be consistent with the current assignment. We illustrate the algorithm by following its progress
on the matching diagram of a tiny example with five variables x1, ..., x5processed in the given
order. Figure 12.17 shows the constraint graph and the matching diagram at the start.
Figure 12.17 A tiny CSP with five variables processed in the order x1, x2, x3, x4, and x5. The
constraint graph is shown at the top and the matching diagram at the bottom. Each domain
has three values, selected in alphabetical order.
FC begins by calling for a value for variable x1. SelectValue-FC picks the first value a
from x1. This value is connected to values e andfin x2 but not connected to value d. Consequently,
SelectValue-FC removes d from Dx2. Values deleted by SelectValue-FC are shown with
shaded circles in the figures that follow. In a similar manner, it also deletes values m and n from
the domain of x4. These are the only two variables which are related to x1. SelectValue-FC
does no other pruning while considering value a. Then FC moves to x2 and SelectValue-FC
tries the next available value e. This in turn deletes l from x4 and p and r from x5. The situation
at this point is shown in Figure 12.18.
At this point, the domain of variable x4 has become empty and the algorithm undoes the
deletions done with respect to x2 = e and backtracks to try another value.
When SelectValue-FC tries the next value x2 = f, it deletes p from variable x5 but
that still has q and r. It returns x2 = f to FC, which calls it again looking for a value for x3.
SelectValue-FC tries x3 = g and x3 = h but both delete l from Dx4. The next value k does not,
but it deletes q and r from the domain of x5, which now becomes empty. The situation is shown
in Figure 12.19. It has assigned values to the first three variables but does not even try to for
the fourth.
Constraint Satisfaction | 421
a
b
c
Xi
Figure 12.18 When FC tries x1 = a and the first available value x2 = e, it discovers that the
domain of variable x4 has become empty, because all three values are not consistent with tries
x1 = a and x2 = e. It will now undo deletion of values l, p, and r done while assigning x2 = e
and will try the next value x2 = f.
Figure 12.19 When FC tries the next value x2 = f after undeleting values l, p, and r. This
deletes p from x5. It next tries values g and h for x3 but both delete l in x4. SelectValue-FC next
tries x3 = k but that deletes q and r from x5, which becomes empty. There are no more values
to backtrack to in x2 and x3 and it backtracks to x1 and tries the value b.
SelectValue-FC reports failure to find a value for x3 and backtracks to x2 but there is no
other value available. It will next try x1 = b. The reader is encouraged to verify that FC will
backtrack because the Dx5 will again become empty after assigning the last possible value
to x3. The algorithm next tries x1 = c and eventually finds a solution with matching diagram as
shown in Figure 12.20.
422 | Search Methods in Artificial Intelligence
Figure 12.20 The matching diagram at the point when FC finds the solution <c, d, k, n, p>.
Note that values a, b, g, and h were not deleted because variables x1 and x3 do not have any
parents in the given ordering.
In the diagram in Figure 12.20 there are still some unconnected values in the domains of x1
and x3 that have not been deleted. This is because there were no parents who could have done
so. The next algorithm does a little bit more pruning of future domains.
DAC-Lookhead(X, D, C)
1. A ^ []
2. for k ^ 1 to N
3. D’k ^ Dk
4. i ^ 1
5. while 1 < i < N
Constraint Satisfaction | 423
6. ai ^ SeiectVaiue-FC(D’k, A, C)
7. if ai = null
8. then Undo lookahead pruning done while choosing ai-1
9. i ^i-1 /* look for new value */
10. A ^ tail A
11. else A ^ ai : A
12. i^i+ 1
13. return Reverse(A)
SeiectVaiue-DAC(D‘i, A, C)
1. while D’i is not empty
2. ai ^ head D’i
3. D’i ^ tail D’i
4. for k ^ i + 1 to N
5. for each b in D’k
6. if not Consistent(b: ai : A)
7. delete b from D’k
8. DAC ({xi+b. Xn}, D’, C)
9. if no domain is empty
10. return ai
11. else for k ^ i + 1 to N
12. undo deletes in D’k
13. return null
The extra work done in DAC-Lookahead are these calls to DAC. We illustrate the effect
of these on the tiny problem in Figure 12.17. DAC-Lookahead too begins by selecting
x1 = a and deleting d from x2 and m, n from x4. As in the diagrams previously discussed, we
show these as shaded circles in Figure 12.21, but we have deleted the edges emanating from
them for clarity. Now DAC is called with the future variables x2, x3, x4, and x5 shown inside the
dashed oval. Both x2 and x3 are arc consistent with respect to x5 and no deletions happen. But
values e in x2 and g, h in x3 do not have supporting values in x4 and are deleted. The deletions
by DAC are shown with cross marks, and the situation is as shown in Figure 12.21.
In the situation in Figure 12.21, DAC-Lookahead next tries the value x2 = f. The future
variables are now only x3, x4, and x5 as shown inside the dashed oval in Figure 12.22. Forward
checking deletes the value p from the domain of x5. This has a cascading effect when DAC kicks
in with the value k being deleted from x3. The domain of x3 is now empty and DAC-Lookahead
retreats to x1 and will try the value b.
Algorithm FC had looked at all values in the domain of x3 before backtracking to x1 to try
the next value. Algorithm DAC-Lookahead retreated because it could not find a consistent
value for x2 without going to x3. The next algorithm AC-Lookahead finds that it is unable to
even assign x1 = a.
424 | Search Methods in Artificial Intelligence
Figure 12.22 When algorithm DAC-Lookahead picks value f in x2, forward checking deletes
the value p in x5. The DAC component in SelectValue-DAC kicks in for the remaining
three variables x4, x5, and x5, shown in the dashed oval resulting in Dx3 becoming empty.
DAC-Lookahead retreats to variable x1 without looking at x3.
Constraint Satisfaction | 425
Figure 12.23 Algorithm AC-Lookahead begins like FC with x1 = a deleting values d, m, and n.
After this variables x2, x3, x4, and x5 are made arc consistent. Values e in x2 without support in
x4, g and h in x3 also without support in x4, and p in x5 without support in x2 are the first to go,
shown by cross marks.
The matching diagram at this stage is shown in Figure 12.24 where the pruning process
continues after we have removed the pruned nodes from the figure. At this point, there are only
five values remaining in the four future variables. Value k in variable x3 is deleted because it has
no support in x5, and this results in l in x4 and p, q in x5 also being deleted, after which f goes
from x2.
426 | Search Methods in Artificial Intelligence
Figure 12.24 Continuing from Figure 12.23 values k in x3 without support in x5, q and r in x5
without support in x3 will go next. At this point, the domains of x3 and x5 have become empty,
and x2 and x4 follow suit. Algorithm AC-Lookahead abandons the value a for x1 and moves
on to b.
At this point, algorithm AC-Lookahead abandons the value a it was considering for x1 and
moves on to the next value b. In practice, while implementing the algorithm one might exit as
soon as one domain becomes empty. This is not reflected in our algorithm, where one blanket
call is made to the algorithm for arc consistency.
The reader might have felt that AC-Lookahead perhaps does too much work. An
algorithm we have not mentioned here is FullLookahead, which does a little bit less. This is
like AC-Lookahead except that it does only one pass of calling Revise for every pair of future
variables.
We now turn our attention to informed or intelligent backtracking.
and the work done trying different values for xi-1 may be futile. Informed backtracking aims to
reduce such unnecessary search and jump back to a variable where a different value may allow
some value in xi.
We say that the assignment A = <a 1, a2, ..., ai- 1> is a conflict set with respect to xi if
we cannot find a value b in Di such that <a 1, a2, ..., ai- 1, b> is consistent. If no subset of
A = <a 1, a2, ..., ai- 1> is a conflict set with respect to xiwe say that A is a minimal conflict set.
We say that <a1, a2, ., ai-1> is a dead end with respect to xi, and xi is a leaf dead end variable.
If in addition <a1, a2, ., ai-1> cannot appear in any solution, we say that it is a no-good. It
is possible for an assignment to be a no-good but not be a dead-end for any single variable.
A minimal no-good is one which does not have any subset that is a no-good.
When <a1, a2, ., ai-1> is a conflict set with respect to xi, then search can jump back to
any variable xj such that j < i - 1 in the quest for a solution. This process is called backjumping.
We say that the jump back is safe if there is no k between j and i - 1 such that a new value
for xk leads to a solution. Jumping back to a safe variable will thus not preclude any solution
and affect the completeness of the algorithm. We say that a safe backjump to a variable xj is
maximal if there is no m < j such that a backjump to xm is safe.
The question is: given an assignment <a1, a2, ., ai-1> that is a dead end for a variable, what
is a safe and maximal backjump? We look at three well known algorithms for backjumping.
Each collects differing kinds of data, based on which it decides the variable that is safe to jump
back to. Each of them, however, arrives at a different answer to what is a maximal backjump
that is safe.
Figure 12.25 SelectValue-GBJ is unable to find a value (column name) for the sixth queen.
The numbers in row 6 are the numbers of the queens attacking that square. In each of these,
the earliest counts for each square. The value of latest6 begins with 1 for square a6,
becomes 3, the earlier queen attacking b6, and so on. The largest value is 4 from the square
d6. GBJ would backtrack to the fourth queen.
The assignment <a, c, e, b> for the first four queens is a no-good. One of the queens must
be relocated. It can only be queen 4 because a solution by relocating that could still be possible.
Skipping queen 4 and relocating an earlier queen might miss a solution that relocating 4 might
yield. So jumping back to queen 4 is both maximal and safe, and queen 4 is the culprit.
The algorithm GBJ is described below. Observe that the variable latesti is a global variable,
initialized in the main program, set in the call to SelectValue-GBJ, and used in the main
program for jumping back when a null value is returned.
GBJ(X, D, C)
1. A ^ []
2. i ^ 1
3. D’i ^ Di
4. while 1 < i < N
5. latesti ^ 0
6. ai ^ SeiectVaiue-GBJ(D’i, A, C)
7. if ai = null
8. then
9. while i > latesti
Constraint Satisfaction | 429
10. i^i- 1
11. A ^ tail A
12. else
13. A <— ai : A
14. i^i+ 1
15. D’i ^ Di
16. return Reverse(A)
SeiectVaiue-GBJ(D’i, A, C)
1. while D’i is not empty
2. ai ^ head D’i
3. D’i ^ tail D’i
4. consistent ^ true
5. k ^ 1
6. while k < i and consistent
7. Ak ^ take k A
8. if k > latesti
9. latesti ^ k
10. if not Consistent(ai : Ak)
11. consistent ^ false
12. else
13. k ^ k + 1
14. if consistent
15. return ai
16. return null
When SelectValue-GBJ does return a value for xi, the variable latesti has a value i - 1
and GBJ moves on to xi+1. If at a later point GBJ were to backtrack to xi, and if that had no
value left in its domain, where would it backtrack to? This is known as an internal dead end.
The value of latesti is i - 1, and hence GBJ would just move one step back. The algorithm GBJ
thus does a safe and maximal backjump from a leaf dead end, but just moves one step back
from an internal dead end.
Figure 12.26 shows the graph from Figure 12.6 with the alphabetic ordering. Of the four
nodes connected to node E, three are its ancestors in the given ordering. Of A, B, and D, the last
one is the parent, and if node E were to be a leaf dead end, then it would try a new value of its
parent D. This happens to be the last node visited, but that is not the case for nodes C and F who
have only one ancestor and who is the parent. If C were to be a leaf dead end, GBBJ would try
A next, and if F were to be a leaf dead end, GBBJ would try D next.
Alphabetic
width = 3
Figure 12.26 On the alphabetic ordering in the graph from Figure 12.6 node E has three
ancestors A, B, and D of which D is the parent. Nodes C and F have only one ancestor each
which is the parent.
When GBBJ jumps back from a leaf dead end, it may encounter an internal dead end,
which is the node it has jumped back to but which does not have any consistent value left.
Where does it go next? Consider the case when node G were to be a leaf dead end and GBBJ
jumped back to its parent F. If there is no value left in F, should it jump to its parent D? That
would not be safe, because the original conflict in G might have been caused by E, its other
ancestor. Bypassing E would not be safe. GBBJ handles this and similar cases as follows.
We say that GBBJ invisits a node x when it visits it in the forward direction, that is, after a
node preceding it in the ordering. Starting from there, the current session of x includes all the
nodes it has tried after invisiting x till the time it finds x to be an internal dead end.
We define the set of relevant dead ends r(x) of a node x as follows:
The set of induced ancestors of a node x is defined as follows. Let Y be the set of relevant dead
ends in the current session of x. Then the set of induced ancestors Ix(Y) of node x is the union of
all ancestors of nodes in Y which precede x in the ordering. The induced parent Px(Y) of node x
is the latest amongst the induced ancestors of x. When x is a dead end, algorithm GBBJ jumps
to the induced parent of a node x.
In Figure 12.26, when G turns out to be a leaf dead end, GBBJ tries its induced parent F
(which is also its parent). The induced parent of F is E because E is the latest induced ancestor
of F, an ancestor of G, which is a relevant dead end for F. So if F is an internal dead end, GBBJ
will try E next. If E is an internal dead end too, the algorithm will try D. The current session
of D includes E, F, and G, and hence the induced parent of D is B, which is where GBBJ will
jump back to if D were to be a dead end too.
Contrast this with the case when F is a leaf dead end. GBBJ will jump back to D the parent
of F. The relevant dead ends of D are F and D itself. The only induced ancestor of D is A.
Remember that E is not a relevant dead end. If D were to be an internal dead end, the algorithm
will now jump back to A.
Constraint Satisfaction | 431
The algorithm GBBJ shown below begins by computing the set of ancestors anc(x)
for each variable, given the ordering x1, x2, ..., xn. Whenever it moves forward to a node
xi, it initializes the set of induced ancestors li of xi to anc(xi) in Lines 6, 7 and 22, 23 of
Algorithm 12.17. The index p of the induced parent of the current node is always the latest node
in the set of induced ancestors li (Lines 7, 17, and 23). When it encounters a dead end xi, which
is when there is no consistent value for the current variable, it jumps back to induced parent xj.
It updates the induced ancestors of xj and identifies the induced parent to which it would jump
back from xj if needed (Lines 12-17).
Algorithm 12.17. Algorithm GBBJ begins by computing the ancestors of each node. It
keeps track of the induced ancestors when it jumps back to a node. It always jumps back
to the induced parent on reaching a dead-end. The SelectValue function is the simple
one used in Backtracking.
GBBJ(X, D, C)
1. A ^ []
2. for k ^ 1 to N
3. compute anc(k) the set of ancestors of xk
4. i ^ 1
5. D’i ^ Di
6. Ii ^ anc(i)
7. p ^ latest node in Ii
8. while 1 < i < N
9. ai ^ Select Vaiue(D’i, A, C)
10. if ai = null
11. then
12. iprev ^ i
13. while i > p
14. i ^ i -1
15. A ^ tail A
16. Ii ^ Ii U {liprev - {xi}}
17. p ^ latest node in Ii
18. else
19. A <— ai : A
20. i <— i + 1
21. D’i ^ Di
22. Ii ^ anc(i)
23. p ^ latest node in Ii
24. return Reverse(A)
SeiectVaiue(D‘i, A, C)
1. while D’i is not empty
2. ai ^ head D’i
432 | Search Methods in Artificial Intelligence
The jumping back behaviour of GBBJ is the same whether it does so from an internal dead
end or a leaf dead end. This is an improvement over GBJ, which can jump back only from leaf
dead ends. But GBBJ is conservative and assumes that if a node is a parent in the constraint
graph, it must be the cause of the dead end, and jumps to that as an insurance. It will also not
miss out on any solution and its jumps are safe. But it may jump back less than it needs to
because it does not look at the values in the domains that lead to the conflict. The next algorithm
makes use of both kinds of information, based on the values that conflict, and also the graph
topology.
(Lines 5-13 of SelectValue-CDBJ). The moment the value ai from the domain of xi conflicts
with Ak, it adds the variables in the earliest conflict to the jumpback set Ji. Note that more than
one constraint may simultaneously conflict with ai for a particular value of k. That is why one
needs to select the earliest one amongst them (Lines 11-13). Having done that, it moves on to
the next value in the domain D ‘ to test for consistency. For every value in D‘ it finds a conflict,
it adds the earliest conflict to the jumpback set Ji. If SELECTVALuE-CDBJ cannot find a consistent
value, then xi would be a leaf dead end and the parent program would jump back to the latest
variable in the jumpback set Ji. Observe that like in algorithm GBJ we have assumed that Ji is
global data structure.
Algorithm 12.18. Algorithm CDBJ looks at actual conflicts of values a little bit like
GBJ, but constructs the earliest minimal conflict set in SelectValue-CDBJ when it spots a
conflict based on the earliest constraint that conflicts with a value in xi. Like GBBJ it can
combine the data gleaned from relevant dead ends in the main algorithm to be able to
jump back from internal dead ends as well.
CDBJ(X, D, C)
1. A H]
2. i ^ 1
3. D’i ^ Di
4. Ji ^ {}
5. while 1 < i < N
6. ai ^ SeiectVaiue(D‘i, A, C)
7. if ai = null
8. then
9. iprev ^ i
10. p ^ latest node in Ji
11. while i > p
12. i ^ i - 1
13. A ^ tail A
14. Ji ^ Ji U {Jiprev — {Xi}}
15. else
16. A <— ai : A
17. i ^ i + 1
18. D’i ^ Di
19. Ji ^ {}
20. return Reverse(A)
Select Vaiue-CDBJ(D’i, A, C)
1. while D’i is not empty
2. ai ^ head D’i
3. D’i ^ tail D’i
4. consistent ^ true
5. k ^ 1
434 | Search Methods in Artificial Intelligence
Observe that when CDBJ jumps back to variable xi (Lines 10-14 of CDBJ), it is still in the
current session of the variable, not yet having retreated from there. It might find a value for this
variable and go forth to the next variable and onwards, till it strikes another dead end and again
jumps back to xi from a relevant dead end. The merging of jumpack sets in Line 14 of CDBJ is
similar to the process of computing the induced ancestors in GBBJ which enables the algorithm
to jump back safely and maximally from the internal dead end as well.
Summary
Constraints offer a uniform formalism for representing what an agent knows about the world.
The world is represented as a set of variables each with its own domain of values, along with
local constraints between subsets of variables. The fact that constraints are local obfuscates the
world view. It is not clear what combination of values for each variable is globally consistent.
The task of solving the CSP is to elucidate these values, which can be thought of as unearthing
the solution relation that prescribes all consistent combinations of values.
There are two approaches to strive for this clarity, and constraint processing allows for
the interleaving of both. On the one hand, there is constraint propagation or reasoning that
eliminates infeasible combinations of values, in the process adding new constraints to the
network. On the other is search, the basic strategy of problem solving by first principles.
We explored various combinations of techniques. This includes various levels of consistency
that can be enforced. We looked at algorithms for arc consistency and path consistency. We
also looked at the advantages of directional consistency and a little bit on ordering variables
for search. Then we started with the basic search algorithm Backtracking which essentially
searches through combinations of values for variables. This can be augmented with look ahead
methods that prune future variables, and by look back methods that make an informed choice
of which variable to jump back to when a dead end is encountered. In all cases a dead end
forces the search algorithm to retreat and undo some instantiations to try new combinations.
One aspect we have not studied is no-good learning. Here, every time an algorithm jumps back
to a culprit variable, the combination of conflicting values can be marked to be avoided in the
Constraint Satisfaction | 435
future. Clearly, no-good learning would be meaningful in large problems with huge search trees
that can benefit with such pruning.
We have not looked at many applications despite having said that the CSPs present a very
attractive formulation in which all kinds of problems can be posed, and then solved using
some of the methods we have studied. We did mention in the chapter on planning that planning
can be posed as a CSP, and illustrated the idea by posing it as satisfiability. Another frequent
application is classroom scheduling and timetable generation which has its own group of
interested researchers. Also, we have confined ourselves to finite domain CSPs with the general
methods that they admit. We have not looked at specialized constraint solving problems and
methods like linear and integer programming that have evolved as areas of study in themselves,
beyond the scope of this book.
Exercises
1. Which of the following statements are true regarding solving a CSP?
a. Values must be assigned to ALL variables such that ALL constraints are satisfied.
b. Values must be assigned to at least SOME variables such that ALL constraints are
satisfied.
c. Values must be assigned to ALL variables such that at least SOME constraints are
satisfied.
d. Values must be assigned to at least SOME variables such that at least SOME constraints
are satisfied.
2. Pose the following cryptarithmetic problems as CSP:
TWO + TWO = ONE
JAB + JAB = FREE
SEND + MORE = MONEY
3. Consider the following constraint network R = <{x1, x2, x3}, {D1, D2, D3}, {C}> where
D1 = D2 = D3 = {a, b, c} and C = <{x1, x2, x3}, {<a, a, b>, <a, b, b>, <b, a, c>,
<b, b, b>}. How many solutions exist?
4. Given a constraint satisfaction problem with two variables x and y whose domains are
Dx = {1,2,3}, Dy = {1,2,3}, and constraint x < y, what happens to Dx and Dy after the
Revise(x,y) algorithm is called?
a. Both Dx and Dy remain the same as before
b. Dx = {1,2} and Dy = {1,2,3}
c. Dx = {2,3} and Dy = {1,2}
d. Dx = {} and Dy = {1,2,3}
5. Draw the search tree explored by algorithm Backtracking for the 5-queens problem till
it finds the first solution.
6. What is the best case complexity of Revise((X), Y) when the size of each domain is k?
7. Draw the matching diagram for the network in Figure 12.1 after it has been made arc
consistent. Has the network become backtrack free?
8. What does one conclude when the domain of some variable X while computing arc
consistency becomes empty?
436 | Search Methods in Artificial Intelligence
9. Try out different orderings for the two networks in Figure 2.8 and investigate how
Backtracking performs.
10. Is the following object a trihedral object? Label the edges, and explain your answer.
11. Draw trihedral objects to illustrate all the vertex labels shown in Figure 12.10.
12. [Baskaran] Label the edges in the following figure and identify each vertex type.
13. Given the CSP on variables X = {x, y, z} and the relations Rxy, Rxz, Ryz depicted in the
matching diagram below, draw the matching diagram after the CSP has been made arc
consistent. State the resulting CSP.
Constraint Satisfaction | 437
14. Consider the following CSP for a map colouring problem. Answer the questions that
follow.
The following figure depicts the domains of the variables for the given 4 x 4 Sudoku
problem. Note that some cells have only one value in their domain. Show the order in
which dynamic variable ordering with forward checking (DVFC) algorithm will fill in the
16. [Baskaran] The following figure shows a constraint graph of a binary CSP and a part of the
matching diagram. When a pair of variables (like X1 and X3) do not have a constraint in the
constraint graph then assume a universal relation in the matching diagram.
The FC algorithm begins by assigning X1 = a. What are the next four values assigned to
variables by the FC algorithm? List the values as a comma separated list in the order they
are assigned.
17. What is the first solution found by the FC algorithm for the above problem?
18. The following figure shows the constraint graph of a binary CSP on the left and a part of the
matching diagram on the right. Please assume a universal relation in the matching diagram
where there is no constraint between variables in the constraint graph. The variables, and
their values, are to be considered in alphabetical order.
Algorithm FC is about to begin by assigning X1 = a. What are the next six values assigned
to variables? Draw the matching diagram at this point.
What is the first solution found by the algorithm?
Constraint Satisfaction | 439
The algorithms presented in this book assume eager evaluation. The values of primitive types
(integers, reals, strings) are passed by value, and tuples, lists, arrays, sets, stacks, queues, etc.,
are passed by reference, similar to how Java treats primitive values and objects.
The data structures (container types) like sets, arrays, stacks and queues, and the operations
on those structures carry their usual meaning, and their usages in the algorithms are self
explanatory.
Tuple
A tuple is an ordered collection of fixed number of elements, where each element may be of a
different type. A tuple is represented as a comma separated sequence of elements, surrounded
by parenthesis.
tuple ^ (ELEMENTb ElEMENT2, ..., ELEMENTk)
A tuple of two elements is called a pair, for example, (S, null), ((A, S), 1), (S, [A, B])
are pairs. And a tuple of three elements is called a triple, for example, (S, null, 0), (A, S, 1),
(S, A, B) are triples. A tuple of k elements is called a k-tuple, for example, (S, MAX, -to, to),
(A, MIN, LIVE, to, 42).
Note: parenthesis is also used to indicate precedence, like in (3+1) * 4 or in (1 : (4 : [ ])),
its usage will be clear from the context.
List
A list is an ordered collection of an arbitrary number of elements of the same type. A list is read
from left to right and new elements are added at the left end. Lists are constructed recursively
like in Haskell.
441
442 | Search Methods in Artificial Intelligence
The ‘:’ operator is a list constructor; it takes an element (HEAD) and a list (TAIL) and
constructs a new list (HEAD : TAIL) similar to cons(HEAD, TAIL) in LISP. Using head:tail
notation, a list such as [3, 1, 4] is recursively constructed from (3 : (1 : (4 : [ ]))), similar to
cons(3, cons(1, cons(4, nil))) in LISP. The empty list [ ] has no head or tail.
cons
The list [3,1,4] in
recursive head:tail
representation. 1
In the head:tail representation, elements are always added to and removed from the head
(left end) of a list and in that respect a list behaves like a stack.
To reduce clutter, we allow the list (3 : (1 : (4 : [ ]))) to be expressed in any of the following
equivalent forms:
Assignment
The assignment statements take the general form
PATTERN ^ EXPRESSION
where the values contained in the EXPRESSION are assigned to the variables contained in
the PATTERN. A PATTERN is an expression that is constructed out of variables and underscores.
An underscore is a placeholder for an unnamed variable, whose value is of no interest. Such
values can be called ‘don’t care values’. A few examples of assignment statements and the
resulting assignments are shown in the table below.
x:y^3 : 1 : 4: [ ] x ^ 3; y ^ 1 : 4 : [ ]
x : y : z ^ [3, 1, 4] x ^ 3; y ^ 1; z ^ [4]
(x, y, _) ^ (S, LIVE, 125) x ^ S; y ^ LIVE and the value 125 is ignored.
(x, _, _) ^ (S, LIVE, 125) x ^ S and the remaining values are ignored.
Algorithm and Pseudocode Conventions | 443
Tests
The equality, inequality, and ‘is’ tests used in the algorithms are standard ones; for the most
part they are self explanatory. Here, we describe equality testing on structures and context
dependent tests.
The equality tests are of the form
EXPRESSION1 = EXPRESSION2
where two expressions are equal if their structures match and the corresponding values in
the respective structures also match, for example:
(A, S, _) = (A, S, 2) is true because the structures match, the elements match
and the underscore matches any value.
which tests whether the value generated by one expression is better than the value
generated by another expression. The notion of what is better will be clear from the context. For
maximization problems, ‘better than’ means ‘greater than’ and for minimization problems,
it means ‘less than’.
Built-in Functions
The standard functions on sets, tuples, arrays, lists, stacks, and queues are treated as built-in
functions and are typeset like keywords and invoked like commands, for example:
In some cases, a chain of built-in functions may be applied to a value; such chain of calls
are evaluated from right to left, for example, “second head tail [(A, 1), (N, 3), (D, 4)]” is
evaluated in the following manner:
Procedures
We use indentation to capture the scope (body, block, block structure) of procedures and control
statements.
A procedure has a name followed by zero or more parameters and a body, and, optionally,
one or more nested procedures. A procedure may accept zero or more inputs and may produce
an output.
ReconstructPath(nodePair, CLOSED)
1 SkiPTo(parent, nodePairs)
2 if parent = first head nodePairs
3 then return head nodePairs
4 else return SkipTo(parent, tail nodePairs)
Algorithm and Pseudocode Conventions | 445
ReconstructPath is a procedure that takes two inputs and returns one output; its scope
spans lines 1 through 11; it contains a nested procedure (SkipTo, lines 1 through 4) and the body
(lines 5 through 11). ReconstructPath will start executing from line 5.
The level of abstraction (the amount of detail) in a procedure can vastly vary, for example,
the GamePlay procedure describes the general idea of game playing at a much higher
abstraction than what ReconstructPath does for path reconstruction.
GamePlay lacks detail; it does not tell us what a move is or how to make a move or when
the play ends or how the winner is decided, etc. GamePlay is a high-level algorithm.
GamePlay(max)
1 while game not over
2 call k-ply search
3 make move
4 get min’s move
Control Statements
The control statements such as if-then, if-then-else, for loop, for-each loop, while loop, repeat
loop, and repeat-until loop carry their usual meaning. And we use indentation to capture the
scope (body, block, block structure) of control statements.
1 if test
2 block
Execute the block only if the test succeeds, otherwise skip that block.
1 if test
2 then block1
3 else block2
446 | Search Methods in Artificial Intelligence
Execute block1 if the test succeeds, otherwise execute block2. In each pass, only one of the
two blocks is executed.
A while loop represents a computation that essentially executes an arbitrarily long sequence
of identical if-then statements, and the computation (the loop) ends when the test fails. In a
while loop, the body may not execute, or execute one or more times.
1 repeat body
2 body if test fails then execute body
3 until test if test fails then execute body
A repeat-until loop represents a computation that essentially executes the body followed
by an arbitrarily long sequence of identical if-then statements, and the computation (the loop)
ends when the test succeeds. In a repeat-until loop, the body will execute one or more times.
A for loop represents a sequence of computations that essentially sets the loop variable to
a number and executes the body, and does so for each integer from 1 to N.
Algorithm and Pseudocode Conventions | 447
1 for each x in list x ^ e1; body; x ^ e2; body; ...; x ^ ek; body;
2 body where, list = [e1, e2, ..., ek]
A for-each loop represents a sequence of computations that essentially sets the loop variable
to an element in the list and executes the body, and does so for each element in the list.
The for-each loop can iterate over lists, sets, arrays, stacks, and queues.
A loop with a guarded body is a special case; it iterates like a regular loop, but in each
iteration the body is executed only if the test succeeds. The guarded body may occur in a
for-each loop, while loop, repeat loop, and repeat-until loop.
References
Agarwala, R, D. L. Applegate, D. Maglott, G. D. Schuler, and A. A. Schaffler. 2000. ‘A Fast and Scalable
Radiation Hybrid Map Construction and Integration Strategy’. Genome Research 10: 350-64.
Aho, Alfred V., John E. Hopcroft, and Jeffrey D. Ullman. 1974. The Design and Analysis of Computer
Algorithms, Addison-Wesley Series. Computer Science and Information Processing. Boston:
Addison-Wesley.
Allen, James F. 1983. ‘Maintaining Knowledge about Temporal Intervals’. Communications of the
ACM 26 (11): 832-43.
———. 1991. ‘Temporal Reasoning and Planning’. In Reasoning about Plans, edited by James F. Allen,
Henry A. Kautz, Richard N. Pelavin, and Josh D. Tenenberg. San Mateo: Morgan Kaufmann.
Andrews, Robin. 2019. ‘Volcano Space Robots Are Prepping for a Wild Mission to Jupiter’. The Wired,
Science. Accessed October 2022. https://www.wired.co.uk/article/nasa-submarines-searching.
Antoniou, Grigoris and Frank van Harmelen. 2008. A Semantic Web Primer. 2nd ed. Cambridge: MIT
Press.
Applegate, D., W. Cook, and A. Rohe. 2003. ‘Chained Lin-Kernighan for Large Traveling Salesman
Problems’. INFORMS Journal of Computing 15 (1): 82-92.
Applegate, David L., Robert E. Bixby, Vasek Chvatal, and William J. Cook. 2007. The Traveling Salesman
Problem: A Computational Study. Princeton Series in Applied Mathematics. Princeton: Princeton
University Press.
Arora, S. 1998. ‘Polynomial Time Approximation Schemes for Euclidean Traveling Salesman and Other
Geometric Problems’. Journal ofACM 45: 753-82.
Baader, Franz and Ulrike Sattler. 2001. ‘An Overview of Tableau Algorithms for Description Logics’.
Studia Logica: An International Journal for Symbolic Logic 69 (1): 5-40. Analytic Tableaux and
Related Methods. Part 1: Modal Logics.
Belgum, Erik, Curtis Roads, Joel Chadabe, T. Emile Tobenfeld and Laurie Spiegel. 1988. ‘A Turing Test
for “Musical Intelligence”?’ Computer Music Journal 12 (4): 7-9. doi:10.2307/3680146.
Berliner, Hans J. 1987. ‘Pennsylvania Chess Championship Report-HITECH Wins Chess Tourney’.
AI Magazine 8 (4): 101-02.
Bland, R.G., and D.F. Shallcross. 1984. ‘Large Traveling Salesman Problems Arising from Experiments
in X-Ray Crystallography: A Preliminary Report on Computation’. Operation Research Letters 8:
125-28.
Blum, A., and Furst, M. 1997. ‘Fast Planning through Planning Graph Analysis.” Journal of Artificial
Intelligence 90 (1-2): 281-300.
Bolander, Thomas. 2017. “A Gentle Introduction to Epistemic Planning: The DEL Approach’. In
Proceedings ofM4M@ICLA 2017, 1-22.
Bonet, Blai, and Hector Geffner. 2001a. ‘Planning as Heuristic Search’. Artificial Intelligence
129 (1-2): 5-33.
———. 2001b. ‘Heuristic Search Planner 2.0’. AI Magazine 22 (3): 77-80.
449
450 | References
Brachman, Ronald J. and Hector J. Levesque. 2004. Knowledge Representation and Reasoning.
Burlington: Morgan Kaufmann.
Browne, Cameron B., Edward Powley, Daniel Whitehouse, Simon M. Lucas, Peter I. Cowling, Philipp
Rohlfshagen, Stephen Tavener, Diego Perez, Spyridon Samothrakis, and Simon Colton. 2012.
‘A Survey of Monte Carlo Tree Search Methods’. IEEE Transactions on Computational Intelligence
and AI in Games 4 (1): 1-43. doi:10.1109/TCIAIG.2012.2186810.
Brownston, Lee, Robert Farrell, Elaine Kant, and Nancy Martin. 1985. Programming Expert Systems in
OPS5. Boston: Addison-Wesley.
Bryce, Daniel and Subbarao Kambhampati. 2007. ‘A Tutorial on Planning Graph Based Reachability
Heuristics’. AI Magazine 28 (1): 4.
Buchanan, Bruce G. and Edward H. Shortliffe. 1984. Rule Based Expert Systems: The Mycin Experiments
of the Stanford Heuristic Programming Project. Addison-Wesley Series in Artificial Intelligence.
Boston: Addison-Wesley.
Bylander, Tom. 1994. ‘The Computational Complexity of Propositional STRIPS Planning’. Artificial
Intelligence 69 (1-2): 165-204.
Cain, Stephanie. 2022. ‘A.I.-Driven Robots are Cooking your Dinner’. Fortune, October. Accessed
December 2022. https://fortune.com/2022/10/18/tech-forward-everyday-ai-robots-pizza/.
Caltabiano, Daniele and Giovanni Muscato. 2005. ‘A Robotic System for Volcano Exploration’. In
Cutting Edge Robotics, edited by Vedran Kordic, Aleksandar Lazinica and Munir Merdan. I-Tech.
Campbell, A. N., V. F. Hollister, R. O. Duda, and P. E. Hart. 1982. ‘Recognition of a Hidden Mineral
Deposit by an Artificial Intelligence Program’. Science 217: 927-29.
Campbell, Murray, A. Joseph Hoane Jr., and Feng-hsiung Hsu. 2002. ‘Deep Blue’. Artificial Intelligence
134 (1-2): 57-83.
Charniak, Eugene and Drew McDermott. 1985. Introduction to Artificial Intelligence. Boston.
Addison-Wesley.
Chien, Steve, Anthony Barrett, Tara Estlin, and Gregg Rabideau. 2000. ‘A Comparison of Coordinated
Planning Methods for Cooperating Rovers’. In Proceedings of the Fourth International Conference
on Autonomous Agents (Agents 2000), Barcelona, Spain, 100101, June 2000. doi:10.1145/336595.
337057.
Clowes, M. B. 1971. ‘On Seeing Things’. Artificial Intelligence 2: 79-116.
Cobb, William S. 1997. ‘The Game of Go: An Unexpected Path to Enlightenment’. The Eastern Buddhist
(New Series) 30 (2): 199-213. https://www.jstor.org/stable/44362178.
Cohen, Paul. 2016. ‘Harold Cohen and AARON’. AI Magazine 37 (4).
Coles, Andrew, Maria Fox, Derek Long, and Amanda Smith. 2008. ‘Planning with Problems Requiring
Temporal Coordination’. In Proceedings of the Twenty-Third AAAI Conference on Artificial
Intelligence, AAAI 2008, edited by Dieter Fox, Carla P. Gomes, 892-97. Palo Alto: AAAI Press.
Colorni, A., M. Dorigo, and V. Maniezzo. 1991. ‘Distributed Optimisation by Ant Colonies’. In
Proceedings of ECAL’91, European Conference on Artificial Life. Amsterdam: Elsevier Publishing.
Cook, S. and D. Mitchell. 1997. ‘Finding Hard Instances of the Satisfiability Problem: A Survey’. In
Proceedings of the DIMACS Workshop on Satisfiability Problems, 11-13. Providence: American
Mathematical Society.
Cope, David. 2004. Virtual Music: Computer Synthesis of Musical Style. Cambridge: MIT Press.
Cormen, Thomas H., Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein. 2001. Introduction to
Algorithms. 2nd ed. Cambridge: The MIT Press.
Cushing, William, Subbarao Kambhampati, Mausam, and Daniel S. Weld. 2007. ‘When Is Temporal
Planning Really Temporal?’ In IJCAI 2007, Proceedings of the 20th International Joint Conference
References | 451
on Artificial Intelligence, edited by Manuela M. Veloso, 1852-59. San Francisco, CA: Morgan
Kaufmann Publishers Inc.
Cushing, William Albemarle. 2012. ‘When Is Temporal Planning Really Temporal?’ PhD diss., Arizona
State University. Accessed October 2022. https://rakaposhi.eas.asu.edu/cushing-dissertation.pdf.
Dantzig, G. B., R. Fulkerson, and S. M. Johnson. 1954. ‘Solution of a Large-scale Traveling Salesman
Problem’. Operations Research 2: 393-410.
Darwiche, Adnan. 2018. ‘Human-Level Intelligence or Animal-Like Abilities’. CACM 61 (10): 56-57.
Davis, Ernest, Leora Morgenstern, and Charles L. Ortiz, Jr. 2017. ‘The First Winograd Schema Challenge
at IJCAI-16’. AI Magazine 38 (3). doi:10.1609/aimag.v38i4.2734.
Dawkins, Richard. 1996. The Blind Watchmaker: Why the Evidence of Evolution Reveals a Universe
Without Design. New York: W. W. Norton & Company.
Dawkins, Richard. 2006. Climbing Mount Improbable. London: Penguin UK.
Dechter, Rina. 2003. Constraint Processing. Burlington: Morgan Kaufmann.
Dechter, Rina and Robert Mateescu. 2007. ‘And/or Search Spaces for Graphical Models’. Artificial
Intelligence 171 (2): 73-106.
Dempster, A. P., N. M. Laird, and D. B. Rubin. 1977. ‘Maximum Likelihood from Incomplete Data via
the EM Algorithm’. Journal of the Royal Statistical Society, Series B 39 (1): 1-38. JSTOR 2984875.
MR 0501537.
Dijkstra, E. W. 1959. ‘A Note on Two Problems in Connexion with Graphs’. Numerische Mathematik
1: 269-71. https://link.springer.com/article/10.1007/BF01386390.
Do, Minh Binh, and Subbarao Kambhampati. 2001. ‘Planning as Constraint Satisfaction: Solving the
Planning Graph by Compiling it into CSP’. Artificial Intelligence 132 (2): 151-82.
———. 2003. ‘SAPA: A Multi-objective Metric Temporal Planner’. Journal of Artificial Intelligence
Research 20: 155-94.
Dorigo, Marco and Thomas Stutzle. 2004. Ant Colony Optimisation. Cambridge: Bradford Books,
MIT Press.
Dudani, S. A. 1976. ‘The Distance-weighted k-Nearest Neighbour Rule’. IEEE Transactions on System,
Man, and Cybernetics SMC-6: 325-27.
Edelkamp, S., and J. Hoffmann. 2004. ‘PDDL2.2: The Language for the Classical Part of IPC-4’. In
Competition Proceeding Hand-Out, 14th International Conference on Automated Planning and
Scheduling, June 3-7, 2004, Whistler, British Columbia, Canada. Menlo Park, CA: The AAAI Press.
Engesser, Thorsten, Thomas Bolander, Robert Mattmuller, and Bernhard Nebel. 2017. ‘Cooperative
Epistemic Multi-Agent Planning for Implicit Coordination’. In Proceedings of M4M@ICLA 2017,
75-90. https://icla.cse.iitk.ac.in/M4M/ .
Fagin, Ronald, Joseph Y. Halpern, Yoram Moses, and Moshe Vardi. 2004. Reasoning About Knowledge.
Cambridge: MIT Press.
Fikes, R. E., and N. Nilsson. 1971. ‘STRIPS: A New Approach to the Application of Theorem Proving to
Problem Solving’. Artificial Intelligence 5 (2): 189-208.
Fitting, Melvin. 2013. First-Order Logic and Automated Theorem Proving. 2nd ed. Springer.
Fix, Evelyn and J. L. Hodges, Jr. 1989. ‘Discriminatory Analysis. Nonparametric Discrimination:
Consistency Properties’. International Statistical Review/Revue Internationale de Statistique 57 (3):
238-47.
Forgy, Charles. 1981. OPS5 User’s Manual, Technical Report CMU-CS-81-135. Pittsburgh: Carnegie
Mellon University.
———. 1982. ‘Rete: A Fast Algorithm for the Many Patterns/Many Objects Match Problem’. Artificial
Intelligence 19 (1): 17-37.
452 | References
Fox, Douglas. 2016. ‘What Sparked the Cambrian Explosion?’ Nature 530: 268-70.
Fox, M. and D. Long. ‘PDDL2.1: An Extension to PDDL for Expressing Temporal Planning Domains’.
Journal of Artificial Intelligence Research (special issue on the 3rd International Planning
Competition) 20: 61-124.
Frege, Gottlob. 1967. Begriffsschrift, eine der arithmetischen nachgebildete Formelsprache des reinen
Denkens, Halle a. S.: Louis Nebert; translated as Concept Script, a Formal Language ofPure Thought
Modelled upon That ofArithmetic by S. Bauer-Mengelberg. In From Frege to Godel: A Source Book
in Mathematical Logic, 1879-1931, edited by J. van Heijenoort. Cambridge, MA: Harvard University
Press.
Freuder, E. C. 1997. ‘In Pursuit of the Holy Grail’. Constraints 2: 57-61. Accessed December 2022.
doi:10.1023/A:1009749006768.
Gardner, Martin. 1970. ‘The Fantastic Combinations of John Conway’s New Solitaire Game “Life”.’
Scientific American 223: 120-23.
Gamow, George. 1989. One, Two, Three ... Infinity: Facts and Speculations of Science (Dover Books on
Mathematics). New ed. Mineola: Dover Publications.
Gerevini, A. and D. Long. 2005. Plan Constraints and Preferences in PDDL3. Technical Report,
Department of Electronics for Automation. Brescia: University of Brescia.
Gerrig, Richard J. and Mahzarin R. Banaji. 1994. ‘Language and Thought’. In Thinking and Problem
Solving (2nd ed.), edited by Robert J. Sternberg, vol. 2 in Handbook of Perception and Cognition.
Academic Press.
Ghallab, M., D. Nau, and P. Traverso. 2004. Automated Planning, Theory and Practice. Amsterdam;
Burlington: Elsevier, Morgan Kaufmann Publishers.
Ginsberg, M. L. 1999. ‘GIB: Steps Toward an Expert-Level Bridge-Playing Program’. In Proceedings of
the Sixteenth International Joint Conference on Artificial Intelligence, 584-93. Stockholm: Morgan
Kaufmann.
Gleick, James. 1987. Chaos: Making a New Science. New York: Viking.
Gleiser, Marcelo. 2022. ‘Not Just Light: Everything Is a Wave, Including You’. Big Think 13 (8). Accessed
December 2022. https://bigthink.com/13-8/wave-particle-duality-matter/.
Glover, Fred. 1986. ‘Future Paths for Integer Programming and Links to Artificial Intelligence’.
Computers and Operations Research 13 (5): 533-49. doi:10.1016/0305- 0548(86)90048-1.
Goldberg, David E. 1989. Genetic Algorithms in Search, Optimisation, and Machine Learning. Reading,
MA: Addison-Wesley.
Gonzalez, Rafael C. and Michael G. Thomason. 1978. Syntactic Pattern Recognition. Reading, MA:
Addison-Wesley Pub. Co., Advanced Book Program.
Grand, Steve. 2001. Creation: Life and How to Make It. Cambridge, MA: Harvard University Press.
Green, Cordell. 1969. ‘Application of Theorem Proving to Problem Solving’. In IJCAI’69: Proceedings of
the 1st International Joint Conference on Artificial Intelligence, 219-39, May 7-9, 1969, Washington,
DC. San Francisco, CA: Morgan Kaufmann Publishers Inc.
Gruber, T. R. 1993. ‘A Translation Approach to Portable Ontologies’. Knowledge Acquisition 5 (2):
199-220.
Guarino, N., D. Oberle, and S. Staab. 2009. ‘What Is an Ontology?’ In Handbook on Ontologies.
International Handbooks on Information Systems, edited by International Handbooks on Information
Systems. Berlin, Heidelberg: Springer. doi:10.1007/978-3-540-92673-3_0.
Haas, A. 1987. ‘The Case for Domain-Specific Frame Axioms’. In The Frame Problem in Artificial
Intelligence, Proceedings of the 1987 Workshop, edited by F.M. Brown. Burlington: Morgan
Kaufmann Publishers.
References | 453
Hadamard, Jacques. 1945. The Mathematician’s Mind: The Psychology of Invention in the Mathematical
Field. Princeton, NJ: Princeton University Press. Reprinted, Dover Publications Inc., 2003.
Haken, Armin. 1985. ‘The Intractability of Resolution’. Theoretical Computer Science 39: 297-308.
Hart, P., N. Nilsson, and B. Raphael. 1968. ‘A Formal Basis of Heuristic Determination of Minimum Cost
Paths’. IEEE Trans. System Science and Cybernetics, SSC 4(2): 100107.
Haugeland, John. 1985. Artificial Intelligence: The Very Idea, A Bradford Book. Cambridge: The
MIT Press.
Hawking, Stephen. 2003. On the Shoulders of Giants: The Great Works of Physics and Astronomy.
London: Penguin UK.
Hayes, Patrick. ‘What the Frame Problem Is and Isn’t’. In The Robot’s Dilemma: The Frame Problem in
Artificial Intelligence, edited by Z. W. Pylyshyn. Norwood, NJ: Ablex Publ.
Hayes-Roth, Frederick, Donald A. Waterman, and Douglas B. Lenat. 1983. Building Expert Systems.
Boston, MA: Addison-Wesley Longman Publishing Co.
Hoffmann, Jorg and Bernhard Nebel. 2001. ‘The FF Planning System: Fast Plan Generation Through
Heuristic Search’. Journal of Artificial Intelligence Research (JAIR) 14: 253-302. doi:10.1613/
iair.855.
Hofstadter, Douglas. 1979. Godel, Escher, Bach: An Eternal Golden Braid. New York: Basic Books.
———. 1996. ‘Number Numbness.’ In Metamagical Themas: Questing for the Essence of Mind and
Pattern, ch. 6. Revised ed. New York: Basic Books.
———. 2009. ‘Essay in the Style of Douglas Hofstadter’. AI Magazine 30 (3): 82-88. doi:10.1609/
aimag.v30i3.2256.
Holland, John H. 1975. Adaptation in Natural and Artificial Systems: An Introductory Analysis with
Applications to Biology, Control and Artificial Intelligence. Ann Arbor: The University of Michigan
Press.
———. 1999. Emergence: From Chaos to Order. New York: Basic Books.
Hoos, Holger H. and Thomas Stutzle. 2004. Stochastic Local Search: Foundations and Applications. The
Morgan Kaufmann Series in Artificial Intelligence. Burlington: Morgan Kaufmann.
Horn, Alfred. 1951. ‘On Sentences Which Are True of Direct Unions of Algebras’. J. Symbolic Logic
16(1): 14-21.
Huffman D. A. 1971. ‘Impossible Object as Nonsense Sentences’. In Machine Intelligence, vol. 6, edited
by B. Meltzer and B. Michie, 295-323. New York: American Elsevier.
Ito, Joi and Jeff Howe. 2017. Emergent Systems Are Changing the Way We Think. Blogpost on the Aspen
Institute website, 30 January. Accessed November 2021. https://www.aspeninstitute.org/blog-posts
/emergent-systems-changing-way-think/.
Jackson, Peter. 1986. Introduction to Expert Systems. 1st ed. Boston: Addison-Wesley.
Jimenez, Sergio, Tomas De la Rosa, Susana Fernandez, Fernando Fernandez, and Daniel Borrajo. 2012.
‘A Review of Machine Learning for Automated Planning’. The Knowledge Engineering Review
27 (4): 433-67. doi:10.1017/S026988891200001X.
Johnson, David S. 1990. ‘Local Optimization and the Traveling Salesman Problem’. In Automata,
Languages and Programming, 17th International Colloquium, ICALP90, Proceedings, Lecture
Notes in Computer Science 443, edited by Mike Paterson, 446-61. Berlin: Springer.
Johnson, Steven. 2002. Emergence: The Connected Lives of Ants, Brains, Cities, and Software.
New York: Scribner. Reprint edition.
Kambhampati, S. 1993. ‘On the Utility of Systematicity: Understanding the Tradeoffs Between
Redundancy and Commitment in Partial-order Planning’. In Proceedings of IJCAI-93, 1380-85. San
Francisco, CA: Morgan Kaufmann Publishers Inc.
454 | References
Kautz, Henry and Bart Selman. 1992. ‘Planning as Satisfiability’. In Proceedings of the 10th European
Conference on Artificial Intelligence, 359-63. New York: Wiley.
Kautz, Henry, David McAllester, and Bart Selman. 1996. ‘Encoding Plans in Propositional Logic’.
In Proceedings of the 4th International Conference on Knowledge Representation and Reasoning
(KR-96), 374-85. Burlington: Morgan Kaufmann.
Kautz, Henry A. and Bart Selman. 1999. ‘Unifying SAT-based and Graph-based Planning’. In Proceedings
of the Sixteenth International Joint Conference on Artificial Intelligence, IJCAI99, edited by Thomas
Dean, vol. 2, 318-25. Burlington: Morgan Kaufmann.
Kelsey, Hugh. 1995. Logical Bridge Play. 2nd ed. London: Orion.
Khemani, Deepak. 1989. ‘Theme Based Planning in an Uncertain Environment’. PhD thesis, Department
of CS&E, IIT Bombay.
———. 1994. ‘Planning with Thematic Actions’. In Proceedings of the Second International Conference
on Artificial Intelligence Planning Systems, June 13-15, University of Chicago, Chicago, Illinois,
287-92. AAAI, AIPS.
———. 2013. A First Course in Artificial Intelligence. New York: McGraw-Hill Education India.
Khemani, Deepak, Radhika B. Selvamani, Ananda Rabi Dhar, and S. M. Michael. 2002. ‘InfoFrax:
CBR in Fused Cast Refractory Manufacture’. In Advances in Case-Based Reasoning, 6th European
Conference, ECCBR 2002, edited by S. Craw and A. Preece, 560. Aberdeen, Scotland, UK,
4-7 September. Proceedings, Springer LNAI 2416.
Khemani, D. and S. Singh. 2018. ‘Contract Bridge: Multi-Agent Adversarial Planning in an Uncertain
Environment’. In Sixth Annual Conference on Advances in Cognitive Systems, 161-80. Accessed
September 2018. www.cogsys.org/papers/ACSvol6/posters/Khemani.pdf.
Korf, Richard E. 1985. Learning to Solve Problems by Searching for Macro-Operators. Research Notes
in Artificial Intelligence, vol. 5. London: Pitman Publishing.
Koehler, Jana, Bernhard Nebel, Jorg Hoffmann, and Yannis Dimopoulos. 1997. ‘Extending Planning
Graphs to an ADL Subset’. In Recent Advances in AI Planning, 4th European Conference on Planning,
ECP’97, September 24-26, 1997, Toulouse, France, Proceedings, Lecture Notes in Computer Science
1348, edited by Sam Steel and Rachid Alam, 273-85. Springer.
Korf, Richard. 1985a. ‘Depth First Iterative Deepening: An Optimal Admissible Tree Search Algorithm’.
Artificial Intelligence 27: 97-109.
———. 1985b. Learning to Solve Problems by Searching for Macro-Operators. Research Notes in
Artificial Intelligence, vol. 5. Pitman Publishing, 1985.
———. 1993. ‘Linear-Space Best-First Search’. Artificial Intelligence 62: 41-78.
Korf, R. and W. Zhang. 2000. ‘Divide-and-Conquer Frontier Search Applied to Optimal Sequence
Alignment’. In Proceedings of the 17th National Conference on Artificial Intelligence (AAAI-00),
July 30-August 3, 2000, Austin, TX, 910-16. AAAI Press.
Korf, R., W. Zhang, I. Thayer, and H. Hohwald. 2005. ‘Frontier Search’. Journal of the ACM 52 (5):
715-48.
Korf, Richard E. and Peter Schultze. 2005. ‘Large-Scale Parallel Breadth-First Search’. In AAAI’05:
Proceedings of the 20th National Conference on Artificial Intelligence, vol. 3, July 9-13, 2005,
1380-85. AAAI Press.
Korte, B. 1990. ‘Applications of Combinatorial Optimization in the Design, Layout and Production of
Computers’. In Modelling the Innovation 1990, edited by H.-J. Sebastian and K. Tammer, 517-38.
Springer.
References | 455
Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. 2017. ‘ImageNet Classification with Deep
Convolutional Neural Networks’. Communications of the ACM 60 (6): 84-90. doi:10.1145/3065386.
ISSN 0001-0782. S2CID 195908774.
Kumashi, Praveen Kumar and Deepak Khemani. 2002. ‘State Space Regression Planning Using Forward
Heuristic Construction Mechanism’. In Artificial Intelligence: Theory and Practice, proceedings
of the Int. Conf. on Knowledge Based Computer Systems, KBCS 2002, edited by M. Sasikumar,
J. J. Hegde, and M. Kavitha, 489-99. Noida: Vikas Publishing House.
Laird, John E., Paul S. Rosenbloom, and Allen Newell. 1986. ‘Chunking in Soar: The Anatomy of a
General Learning Mechanism’. Machine Learning 1: 11-46.
Laird, John E., Allen Newell, and S. Paul. ‘Rosenbloom: SOAR: An Architecture for General Intelligence’.
Artificial Intelligence 33 (1): 1-64.
Larranaga, P., C. M. H. Kuijpers, R. H. Murga, I. Inza, and S. Dizdarevic. 1999. ‘Genetic Algorithms for
the Travelling Salesman Problem: A Review of Representations and Operators’. Artificial Intelligence
Review 13: 129-70.
Larson, Eric J. 2021. The Myth of Artificial Intelligence: Why Computers Can’t Think the Way We Do.
Cambridge: The Belknap Press.
Leslie, A. M. 2001. ‘Theory of Mind’. In International Encyclopedia of the Social and Behavioral
Sciences, edited by Neil J. Smelser and Paul B. Baltes. Elsevier.
Levesque, Hector J., Ernest Davis, and Leora Morgenstern. 2012. ‘The Winograd Schema Challenge’. In
Principles of Knowledge Representation and Reasoning: Proceedings of the Thirteenth International
Conference, KR 2012, edited by Gerhard Brewka, Thomas Eiter, and Sheila A. McIlraith, Rome,
Italy, 10-14 June. AAAI Press.
Levesque, Hector J. 2017. Common Sense, the Turing Test, and the Quest for Real AI. 1st ed. Cambridge:
MIT Press.
Levy, David. 2008. Love and Sex with Robots: The Evolution of Human-Robot Relationships. London:
Gerald Duckworth & Co Ltd.
Lewis, D. D. 1998. ‘Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval’.
In Machine Learning: ECML-98: 10th European Conference on Machine Learning, Chemnitz,
Germany, April, 21-23, 1998, edited by C. N’edellec and C. Rouveirol, 4-15. Berlin/Heidelberg:
Springer.
Lifschitz, V. 1994. ‘Circumscription’. In Handbook of Logic in Artificial Intelligence and Logic
Programming, vol. 3, 297-352. Oxford: Oxford University Press.
Lindsay, Robert K., Bruce G. Buchanan, Edward A. Feigenbaum, and Joshua Lederberg. 1980.
Applications of Artificial Intelligence for Organic Chemistry: The Dendral Project. New York:
McGraw-Hill Book Company.
Lindsay, Robert K., Bruce G. Buchanan, and Edward A. Feigenbaum. 1993. ‘DENDRAL: A Case Study
of the First Expert System for Scientific Hypothesis Formation’. Artificial Intelligence 61 (2): 209-61.
Litke, John D. 1984. ‘An Improved Solution to the Traveling Salesman Problem with Thousands of
Nodes’. Communications of the ACM 27 (12): 1227-36.
Long, Derek and Maria Fox. 2003. ‘The 3rd International Planning Competition: Results and Analysis’.
Journal of Artificial Intelligence Research. Special issue on the 3rd International Planning
Competition, 20: 1-59.
Lorenz, Edward N. 1993. The Essence of Chaos. Seattle: University of Washington Press.
Love, Clyde C. 2010. Bridge Squeezes Complete: Winning End Play. 2nd updated, revised ed. Toronto:
Master Point Press.
456 | References
Lucy, J. A. 2001. ‘Sapir-Whorf Hypothesis’. In International Encyclopedia of the Social and Behavioral
Sciences, edited by Neil J. Smelser and Paul B. Baltes. Elsevier.
Mackworth, Alan K. 1977. ‘Consistency in Networks of Relations’. Artificial Intelligence 8 (1): 99-118.
Mackworth, Alan K. and Eugene C. Freuder. 1985. ‘The Complexity of Some Polynomial Network
Consistency Algorithms for Constraint Satisfaction Problems’. Artificial Intelligence 25 (1): 65-74.
MacLean, P. D. 1990. The Triune Brain in Evolution: Role in Paleocerebral Functions. Berlin: Springer.
MacQueen, J. B. 1967. ‘Some Methods for classification and Analysis of Multivariate Observations’. In
Proceedings of 5th Berkeley Symposium on Mathematical Statistics and Probability, vol. 1, 281-97.
Berkeley: University of California Press.
Mainzer, Klaus. 2003. Thinking in Complexity: The Computational Dynamics of Matter, Mind, and
Mankind. 4th ed. Berlin: Springer.
Manna, Zohar. 1974.Mathematical Theory of Computation. McGraw Hill.
Martelli, Alberto and Ugo Montanari. 1978. ‘Optimising Decision Trees Through Heuristically Guided
Search’. Communications of the ACM 21 (12): 1025-039.
McAllester, D. and D. Rosenblitt. 1991. ‘Systematic Nonlinear Planning’. In AAAI’91: Proceedings of
the Ninth National Conference on Artificial Intelligence, vol. 2, July 14-19, 1991, Anaheim, CA,
634-39. AAAI Press.
McCarthy, J. 1980. ‘Circumscription: A Form of Non-monotonic Reasoning’. Artificial Intelligence
13: 27-39.
———. 1986. ‘Applications of Circumscription to Formalizing Common-sense Knowledge’. Artificial
Intelligence 28: 89-116.
McCarthy, J. and P. J. Hayes. 1969. ‘Some Philosophical Problems from the Standpoint of Artificial
Intelligence’. In Machine Intelligence 4, edited by D. Michie and B. Meltzer, 463-502. Edinburgh:
Edinburgh University Press.
McCorduck, Pamela. 2004. Machines Who Think: A Personal Inquiry into the History and Prospects of
Artificial Intelligence. 2nd ed. Boca Raton: A K Peters/CRC Press.
McCulloch, Warrant. 1961. ‘What Is a Number, That a Man May Know It, and a Man, That He May
Know a Number’. General Semantics Bulletin Nos 26, 27: 7-18.
McCulloch, Warren S., and Walter Pitts. 1943. ‘A Logical Calculus of Ideas Immanent in Nervous
Activity’. Bulletin of Mathematical Biophysics 5: 115-33.
McDermott, John P. 1980. ‘RI: An Expert in the Computer Systems Domain’. In Proceedings of the First
National Conference on Artificial Intelligence (AAAI-80), August 18-21, 1980, Stanford University,
Stanford, CA, 267-71. Palo Alto, CA: AAAI Press.
———. 1980. ‘R1: The Formative Years’. AI Magazine 2 (2): 21-29.
McDermott, D. 1987. ‘We’ve Been Framed: Or Why AI Is Innocent of the Frame Problem’. In The
Robot’s Dilemma: The Frame Problem in Artificial Intelligence, edited by Z. W. Pylyshyn. Norwood,
NJ: Ablex Publ.
———. 1998. PDDL - The Planning Domain Definition Language Version 1.2. New Haven: Yale Center
for Computational Vision and Control, Yale, October.
McGann, Conor, Frederic Py, Kanna Rajan, Hans Thomas, Richard Henthorn, and Rob McEwen. 2008.
‘A Deliberative Architecture for AUV Control’. In Proceedings of IEEE International Conference on
Robotics and Automation, ICRA 2008, IEEE, Pasadena, CA, 1049-54.
McClelland, James L. and David E. Rumelhart. 1986. Parallel Distributed Processing: Explorations in
the Microstructure of Cognition. Vol. 1: Foundations. Cambridge, MA: MIT Press.
———. 1986. Parallel Distributed Processing: Explorations in the Microstructure of Cognition. Vol. 2:
Psychological and Biological Models. Cambridge, MA: MIT Press.
References | 457
Rajan, Kanna, Frederic Py, Conor Mcgann, John Ryan, and Thom Maughan. 2009. ‘Onboard Adaptive
Control of AUVs using Automated Planning and Execution’. In Proceedings of the 16th Annual
International Symposium on Unmanned Untethered Submersible Technology 2009 (UUST 09),
August, 23-26, 2009, Durham, NH, 1-13. Autonomous Undersea Systems Institute (AUSI).
Reifsteck, D., T. Engesser, R. Mattmuller, and B. Nebel. 2019. ‘Epistemic Multi-Agent Planning using
Monte-Carlo Tree Search’. In Joint German/Austrian Conference on Artificial Intelligence, 277-89.
Kunstliche Intelligenz.
Reinelt G. 1995. ‘TSPLIB: Discrete and Combinatorial Optimization’. Accessed August 7 2022. http://
comopt.ifi.uni-heidelberg.de/software/TSPLIB95/.
Reiter, Raymond. 1980. ‘A Logic for Default Reasoning’. Artificial Intelligence 13: 81132.
Rich, Elaine. 1983. Artificial Intelligence. New York: McGraw Hill.
Rich, Elaine and Kevin Knight. 1991. Artificial Intelligence. New York: Tata McGraw Hill.
Richards, Mark and Eyal Amir. 2007. ‘Opponent Modeling in Scrabble’. In Proceedings of the 20th
International Joint Conference on Artificial Intelligence, January, 6-12, 2007, Hyderabad, India,
edited by Manuela M. Veloso, 1482-87.
Robinson, J. Alan. 1965. ‘A Machine-Oriented Logic Based on the Resolution Principle’. Journal of the
ACM (JACM) 12 (1): 23-41.
Rosenblatt, Frank. 1958. ‘The Perceptron: A Probabilistic Model for Information Storage and Organization
in the Brain’. Psychological Review 65 (6): 386-408.
Rumelhart, David E., Geoffrey E. Hinton, and Ronald J. Williams. 1986. ‘Learning Representations by
Back-Propagating Errors’. Nature 323: 533-36. doi:10.1038/323533a0.
Schaeffer, J., N. Burch, Y. Bjornsson, A. Kishimoto, M. Muller, R. Lake, P. Lu, and S. Sutphen. 2007.
‘Checkers Is Solved’. Science 317: 1518-522.
Schank, Roger C. and Robert P. Abelson. 1977. Scripts, Plans, Goals, and Understanding: An Inquiry
into Human Knowledge Structures. Hillsdale, NJ: Lawrence Erlbaum.
Schank, R. C. and C. K. Riesbeck. 1981. Inside Computer Understanding: Five Programs Plus Miniatures.
Mahwah: Lawrence Erlbaum.
Schirber, Michael. 2005. Dancing Bees Speak in Code. Live Science, 27 May. Accessed November 2021.
https://www.livescience.com/3812-dancing-bees-speak-code.html.
Schrittwieser, Julian, Ioannis Antonoglou, Thomas Hubert, Karen Simonyan, Laurent Sifre, Simon
Schmitt, Arthur Guez, Edward Lockhart, Demis Hassabis, Thore Graepel, Timothy Lillicrap, and
David Silver. 2020. ‘Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model’.
Nature 588: 604-09. doi:10.1038/s41586-020- 03051-4.
Shanahan, M.P. 1997. Solving the Frame Problem: A Mathematical Investigation of the Common Sense
Law of Inertia. Cambridge: MIT Press.
Shanahan, Murray. 1999. ‘The Event Calculus Explained’. Artificial Intelligence Today 1999: 409-30.
Shannon, Claude E. 1950. ‘Programming a Computer for Playing Chess’. Philosophical Magazine
(Ser.7) 41 (314): 256-75.
Sheppard, Brian. 2002. ‘World-championship-caliber Scrabble’. Artificial Intelligence 134: 241-275.
Shortliffe E. H. 1976. Computer Based Medical Consultation: MYCIN. New York: Elsevier.
Silver, David, Aja Huang, Chris J. Maddison, Arthur Guez, Laurent Sifre, George van den Driessche,
Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, Sander Dieleman,
Dominik Grewe, John Nham, Nal Kalchbrenner, Ilya Sutskever, Timothy Lillicrap, Madeleine Leach,
Koray Kavukcuoglu, Thore Graepel, and Demis Hassabis. 2016. ‘Mastering the Game of Go with
Deep Neural Networks and Tree Search’. Nature 529: 484-89. doi:10.1038/nature16961.
References | 459
Silver, David, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez,
Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, Yutian Chen, Timothy Lillicrap, Fan Hui,
Laurent Sifre, George van den Driessche, Thore Graepel, and Demis Hassabis. 2017. ‘Mastering the
Game of Go without Human Knowledge’. Nature 550: 354-59. doi:10.1038/nature24270.
Silver, David, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc
Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, Timothy Lillicrap, Karen Simonyan, and
Demis Hassabis. 2018. ‘A General Reinforcement Learning Algorithm That Masters Chess, Shogi,
and Go through Self-Play’. Science 362 (6419): 1140-44. doi:10.1126/science.aar640.
Singhal, Amit. 2012. ‘Introducing the Knowledge Graph: Things, Not Strings’. Google Official Blog,
May 2012. Accessed December 2022. https://blog.google/products/search/introducing-knowledge-
graph-things-not/.
Skolem, Thoralf A. 1970. Selected Works in Logic. Edited by J. E. Fenstad. Oslo: Scandinavian University
Books.
Slagle J. 1961. ‘A Heuristic Program that Solves Symbolic Integration Problems in Freshman Calculus’.
Ph.D. diss., MIT, May.
———. 1963. ‘A Heuristic Program That Solves Symbolic Integration Problems in Freshman Calculus’.
JACM 10 (4): 507-20.
Smith, David E. and Daniel S. Weld. 1998. ‘Conformant Graphplan’. In Proceedings of the Fifteenth
National Conference on Artificial Intelligence and Tenth Innovative Applications of Artificial
Intelligence Conference, AAAI 98, IAAI98, 889-96. AAAI Press.
Smullyan, Raymond M. 2009. Logical Labyrinths. Natick, MA: A K Peters.
Sterling, Leon and Ehud Shapiro. 1994. The Art of Prolog, Second Edition: Advanced Programming
Techniques (Logic Programming). Cambridge, MA: The MIT Press.
Stieger, Allison. 2014. ‘Myth and Creativity: Ariadne’s Thread and a Path through the Labyrinth’. The
Creativity Post, 16 June. Accessed 6 April 2020. https://www.creativitypost.com/article/myth_and_
creativity_ariadnes_thread_and_a_path_through_the_labyrinth.
Stockman, George C. 1979. ‘A Minimax Algorithm Better than Alpha-Beta?’ Artificial Intelligence
12 (2): 179-96.
Stoll, Robert R. 1979. Set Theory and Logic. New ed. Dover Publications Inc.
Sussman, G. 1975. A Computer Model of Skill Acquisition. Amsterdam: Elsevier/North Holland.
Sutton, R. S. 1988. ‘Learning to Predict by the Methods of Temporal Differences’. Machine Learning
3: 9-44.
Sutton, Richard S. and Andrew G. Barto. 1998. Reinforcement Learning: An Introduction. Cambridge,
MA: MIT Press.
Tate, A., B. Drabble, and R. Kirby. 1994. O-Plan2: An Architecture Command Planning and Control.
Burlington: Morgan Kaufmann.
Tesauro, Gerald and Terrence J. Sejnowski. 1989. ‘A Parallel Network That Learns to Play Backgammon’.
Artificial Intelligence 39 (3): 357-90.
Tessauro, G. 1989. ‘Neurogammon Wins Computer Olympiad’. Neural Computation 1: 321-23.
———. 1994. ‘TD-Gammon, A Self-Teaching Backgammon Program Achieves Master-level Play’.
Neural Computation 6 (2): 215-19.
———. 1995. ‘Temporal Difference Learning and TD-Gammon’. Communications of the ACM 38 (3):
58-68.
———. 2002. ‘Programming Backgammon using Self-teaching Neural Nets’. Artificial Intelligence
134 (1-2): 181-99.
460 | References
Truscott, A. and D. Truscott. 2004. The New York Times Bridge Book: An Anecdotal History of the
Development, Personalities, and Strategies of the World’s Most Popular Card Game. Basingstoke:
Macmillan.
Turing, A. M. 1950. ‘Computing Machinery and Intelligence’. Mind 59: 433-60. http://loebner.net/
Prizef/TuringArticle.html.
van Beek, Peter and Xinguang Chen. 1999. ‘CPlan: A Constraint Programming Approach to Planning’.
In Proceedings of the Sixteenth National Conference on Artificial Intelligence and the Eleventh
innovative Applications of Artificial intelligence, American Association for Artificial Intelligence,
Menlo Park, CA, 585-90.
van Ditmarsch, H., W. van Der Hoek, and B. Kooi, 2007. Dynamic Epistemic Logic. Berlin: Springer.
van Gelder, A. and Y. K. Tsuji. 1996. ‘Satisfiability Testing with More Reasoning and Less Guessing’.
In Cliques, Coloring, and Satisfiability: Second DIMACS Implementation Challenge, edited by
D. S. Johnson and M. Trick, 559-86. DIMACS Series in Discrete Mathematics and Theoretical
Computer Science. Providence, RI: American Mathematical Society.
von Neumann, John and Oskar Morgenstern. 1944. Theory of Games and Economic Behavior. Princeton:
Princeton University Press.
Waltz, D. L. 1975. ‘Understanding Line Drawings of Scenes with Shadows’. In The Psychology of
Computer Vision, edited by P. H. Winston, 19-91. New York: McGraw-Hill.
Waterman, D. A. and Frederick Hayes-Roth. 1978. Pattern-Directed Inference Systems. New York:
Academic Press.
Watson, Ian. 1997. Applying Case-Based Reasoning: Techniques for Enterprise Systems. The Morgan
Kaufmann Series in Artificial Intelligence. Burlington: Morgan Kaufmann.
———. 2002. Applying Knowledge Management: Techniques for Building Corporate Memories. The
Morgan Kaufmann Series in Artificial Intelligence. Burlington: Morgan Kaufmann.
Weizenbaum, Joseph. 1966. ‘ELIZA: A Computer Program for the Study of Natural Language
Communication Between Man and Machine’. Communications of the ACM 9 (1): 36-45.
Weld, Daniel S. 1994. ‘An Introduction to Least Commitment Planning’. AI Magazine 15 (4): 27-61.
———. 1999. ‘Recent Advances in AI Planning’. AI Magazine 20 (2): 93. doi:10.1609/aimag.
v20i2.1459.
Winograd, T. 1972. Understanding Natural Language. New York: Academic Press.
Winston, Patrick Henry. 1977. Artificial Intelligence. Boston: Addison-Wesley Pub. Co.
Woolsey, Kit. 2000. Computers and Rollouts, GammOnline. http://gammonline.com/members/lan00/
articles/roll.htm.
Xie, Lingyun and Du Limin, Du. 2004. ‘Efficient Viterbi Beam Search Algorithm using Dynamic
Pruning’. In International Conference on Signal Processing Proceedings, ICSP. 1, vol. 1, 699-702.
doi:10.1109/ICOSP.2004.1452759.
Zhou, R. and E. A. Hansen. 2003. ‘Sparse-Memory Graph Search’. In Proceedings of the 18th
International Joint Conference on Artificial Intelligence (IJCAI-03), 1259-66. San Francisco, CA:
Morgan Kaufmann Publishers Inc.
———. 2004. ‘Breadth-First Heuristic Search’. In Proceedings of the 14th International Conference on
Automated Planning and Scheduling, June 3-7, 2004, Whistler, British Columbia, Canada, 92-100.
AAAI Press.
———. 2005. ‘Beam Stack Search: Integrating Backtracking with Beam Search’. In Proceedings of the
15th International Conference on Automated Planning and Scheduling, edited by Susanne Biundo,
Karen Myers, Kanna Rajan, 90-98. Palo Alto: AAAI Press.
Index
461
462 | Index