0% found this document useful (0 votes)
28 views340 pages

3

The document provides an introduction to artificial intelligence including its aims, history, and different approaches. It discusses what AI has achieved and various definitions of AI from acting and thinking like a human to acting rationally. The document also outlines fields that have contributed to AI and what topics will be covered in the course.

Uploaded by

shobitg
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views340 pages

3

The document provides an introduction to artificial intelligence including its aims, history, and different approaches. It discusses what AI has achieved and various definitions of AI from acting and thinking like a human to acting rationally. The document also outlines fields that have contributed to AI and what topics will be covered in the course.

Uploaded by

shobitg
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 340

Artificial Intelligence

Dr Sean Holden
Computer Laboratory, Room FC06
Telephone extension 63725
Email: sbh11@cam.ac.uk
www.cl.cam.ac.uk/users/sbh11/

Copyright © Sean Holden 2002-2021.

1
Artificial Intelligence

Introduction: aims, history, rational action, and agents

Reading: AIMA chapters 1, 2, 26 and 27.


2
Introduction: what are our aims?

Artificial Intelligence (AI) is currently at the top of its periodic hype-cycle.

Much of this has been driven by philosophers and people with something to sell.

3
Introduction: what are our aims?

What is the purpose of Artificial Intelligence (AI)? If you’re a philosopher or a


psychologist then perhaps it’s:

• To understand intelligence.
• To understand ourselves.

Philosophers have worked on this for at least 2000 years. They’ve also wondered
about:

• Can we do AI? Should we do AI? What are the ethical implications?


• Is AI impossible? (Note: I didn’t write possible here, for a good reason…)

Despite 2000 years of work by philosophers, there’s essentially nothing in the


way of results.

4
Introduction: what are our aims?

Luckily, we were sensible enough not to pursue degrees in philosophy—we’re


scientists/engineers, so while we might have some interest in such pursuits, our
perspective is different:

• Brains are small (true) and apparently slow (not quite so clear-cut), but in-
credibly good at some tasks—we want to understand a specific form of com-
putation.
• It would be nice to be able to construct intelligent systems.
• It is also nice to make and sell cool stuff .

Historically speaking, this view seems to be the more successful. . .


AI has been entering our lives for decades, almost without us being aware of it.
But be careful: brains are much more complex than you think.

5
Introduction: now is a fantastic time to investigate AI

In many ways this is a young field, having only really got under way in 1956
with the Dartmouth Conference.

www-formal.stanford.edu/jmc/history/dartmouth/dartmouth.html

• This means we can actually do things. It’s as if we were physicists before


anyone thought about atoms, or gravity, or. . . .
• Also, we know what we’re trying to do is possible. (Unless we think humans
don’t exist. NOW STEP AWAY FROM THE PHILOSOPHY before SOMEONE
GETS HURT‼‼)

Perhaps I’m being too hard on them; there was some good groundwork: Socrates wanted an algorithm for “piety”,
leading to Syllogisms. Ramon Lull’s concept wheels and other attempts at mechanical calculators. Rene Descartes’
Dualism and the idea of mind as a physical system. Wilhelm Leibnitz’s opposing position of Materialism. (The
intermediate position: mind is physical but unknowable.) The origin of knowledge: Francis Bacon’s Empiricism, John
Locke: “Nothing is in the understanding, which was not first in the senses”. David Hume: we obtain rules by repeated
exposure: Induction. Further developed by Bertrand Russell and in the Confirmation Theory of Carnap and Hempel.

More recently: the connection between knowledge and action? How are actions justified? If to achieve the end you
need to achieve something intermediate, consider how to achieve that, and so on. This approach was implemented
in Newell and Simon’s 1957 General Problem Solver (GPS).

6
What has been achieved?

Artificial Intelligence (AI) is currently at the top of its periodic hype-cycle.

As a result, it’s important to maintain some sense of perspective.


Notable successes:

• Perception: vision, speech processing, inference of emotion from video, scene


labelling, touch sensing, artificial noses…
• Logical reasoning: prolog, expert systems, CYC, Bayesian reasoning, Wat-
son…
• Playing games: chess, backgammon, go, robot football…
• Diagnosis of illness in various contexts…
• Theorem proving: Robbin’s conjecture, formalization of the Kepler conjec-
ture…
• Literature and music: automated writing and composition…
• And many more… (most of which don’t include the word ‘DEEP’!)

7
What has been achieved?

Artificial Intelligence (AI) is currently at the top of its periodic hype-cycle.

As a result, it’s important to maintain some sense of perspective.


There are equally many areas in which we currently can’t do things very well:

“Sleep that knits up the ragged sleeve of care”

is a line from Shakespeare’s Macbeth.


On the other hand…
When AI has a success, the ideas in question tend to stop being called AI .
Do you consider the fact that your phone can do speech recognition to be a form
of AI?

8
The nature of the pursuit

What is AI? This is not necessarily a straightforward question.


It depends on who you ask…
We can find many definitions and a rough categorisation can be made depending
on whether we are interested in:

• The way in which a system acts or the way in which it thinks.


• Whether we want it to do this in a human way or a rational way.

Here, the word rational has a special meaning: it means doing the correct thing
in given circumstances.

9
What is AI, version one: acting like a human

Alan Turing proposed what is now known as the Turing Test.

• A human judge is allowed to interact with an AI program via a terminal.


• This is the only method of interaction.
• If the judge can’t decide whether the interaction is produced by a machine or
another human then the program passes the test.

In the unrestricted Turing test the AI program may also have a camera attached,
so that objects can be shown to it, and so on.
The Turing test is informative, and (very!) hard to pass. (See the Loebner Prize…)

• It requires many abilities that seem necessary for AI, such as learning. BUT :
a human child would probably not pass the test.
• Sometimes an AI system needs human-like acting abilities—for example ex-
pert systems often have to produce explanations—but not always.

10
What is AI, version two: thinking like a human

There is always the possibility that a machine acting like a human does not ac-
tually think. The cognitive modelling approach to AI has tried to:

• Deduce how humans think—for example by introspection or psychological ex-


periments.
• Copy the process by mimicking it within a program.

An early example of this approach is the General Problem Solver produced by


Newell and Simon in 1957. They were concerned with whether or not the pro-
gram reasoned in the same manner that a human did.

Computer Science + Psychology = Cognitive Science

11
What is AI, version three: thinking rationally and the “laws of thought”

The idea that intelligence reduces to rational thinking is a very old one, going at
least as far back as Aristotle as we’ve already seen.
The general field of logic made major progress in the 19th and 20th centuries,
allowing it to be applied to AI.

• We can represent and reason about many different things.


• The logicist approach to AI.

This is a very appealing idea, but there are obstacles. It is hard to:

• Represent commonsense knowledge.


• Deal with uncertainty.
• Reason without being tripped up by computational complexity.
• Sometimes it’s necessary to act when there’s no logical course of action.
• Sometimes inference is unnecessary (reflex actions).

These will be recurring themes in this course, and in Machine Learning and Bayesian
Inference next year.
12
What is AI, version four: acting rationally

Basing AI on the idea of acting rationally means attempting to design systems


that act to achieve their goals given their beliefs.

• Thinking about this in engineering terms, it seems almost inevitably to lead


us towards the usual subfields of AI. What might be needed?
• The concepts of action, goal and belief can be defined precisely making the
field suitable for scientific study.
• This is important: if we try to model AI systems on humans, we can’t even
propose any sensible definition of what a belief or goal is.
• In addition, humans are a system that is still changing and adapted to a very
specific environment.
• All of the things needed to pass a Turing test seem necessary for rational
acting, so this seems preferable to the acting like a human approach.
• The logicist approach can clearly form part of what’s required to act ratio-
nally, so this seems preferable to the thinking rationally approach alone.

As a result, we will focus on the idea of designing systems that act rationally.
13
Other fields that have contributed to AI

14
What’s in this course?

This course introduces some of the fundamental areas that make up AI:

• An outline of the background to the subject.


• An introduction to the idea of an agent.
• Solving problems in an intelligent way by search.
• Solving problems represented as constraint satisfaction problems.
• Playing games.
• Knowledge representation, and reasoning.
• Planning.
• Learning using neural networks.

Strictly speaking, this course covers what is often referred to as “Good Old-Fashioned
AI”. (Although “Old-Fashioned” is a misleading term.)
The nature of the subject changed when the importance of uncertainty was fully
appreciated. Machine Learning and Bayesian Inference covers this more recent
material.
15
What’s not in this course?

• The classical AI programming languages Prolog and Lisp.


• A great deal of all the areas on the last slide!
• Perception: vision, hearing and speech processing, touch (force sensing, know-
ing where your limbs are, knowing when something is bad), taste, smell.
• Natural language processing.
• Acting on and in the world: robotics (effectors, locomotion, manipulation),
control engineering, mechanical engineering, navigation.
• Areas such as genetic algorithms/programming, swarm intelligence, artificial
immune systems and fuzzy logic, for reasons that I will expand upon during
the lectures.
• Uncertainty and much further probabilistic material. (You’ll have to wait until
next year.)

16
Introductory reading that isn’t nonsense

• Francis Crick, “The recent excitement about neural networks”, Nature (1989) is
still entirely relevant:
www.nature.com/nature/journal/v337/n6203/abs/337129a0.html

• The Loebner Prize in Artificial Intelligence:


aisb.org.uk/aisb-events/

provides a good illustration of how far we are from passing the Turing test.
• Marvin Minsky, “Why people think computers can’t”, AI Magazine (1982) is
an excellent response to nay-saying philosophers.
http://web.media.mit.edu/∼minsky/

• Go: www.nature.com/nature/journal/v529/n7587/full/nature16961.html

• The Cyc project: www.cyc.com

• AI at Nasa Ames:
www.nasa.gov/centers/ames/research/areas-of-ames-ingenuity-autonomy-and-robotics

17
Introductory reading that isn’t nonsense

• AI in the UK: ready, willing and able?


House of Lords, Select Committee on Artificial Intelligence

https://publications.parliament.uk/pa/ld201719/ldselect/ldai/100/100.pdf

• Machine learning: the power and promise of computers that learn by example
The Royal Society
https://royalsociety.org/topics-policy/projects/machine-learning/

• Building machines that learn and think like people


Brenden M. Lake et al, Behavioral and Brain Sciences, Cambridge University
Press, 2017.

18
Text book

The course is based on the relevant parts of:

Artificial Intelligence: A Modern Approach, Third Edition (2010). Stuart Russell


and Peter Norvig, Prentice Hall International Editions.

and an alternative source is:

Artificial Intelligence: Foundations of Computational Agents, Second Edition


(2017). David L. Poole and Alan K. Mackworth, Cambridge University Press.

For more depth on specific areas see:


Dechter, R. (2003). Constraint processing. Morgan Kaufmann.
Cawsey, A. (1998). The essence of artificial intelligence. Prentice Hall.
Ghallab, M., Nau, D. and Traverso, P. (2004). Automated planning: theory and
practice. Morgan Kaufmann.
Bishop, C.M. (2006). Pattern recognition and machine learning. Springer.
Brachman, R. J. and Levesque, H. J. (2004). Knowledge Representation and Reason-
ing. Morgan Kaufmann.
19
Prerequisites

The prerequisites for the course are: first order logic, some algorithms and data
structures, discrete and continuous mathematics, and basic computational com-
plexity.
DIRE WARNING:
No doubt you want to know something about machine learning, given the recent
peek in interest.
In the lectures on machine learning I will be talking about neural networks.
I will introduce the backpropagation algorithm, which is the foundation for both
classical neural networks and the more fashionable deep learning methods.
This means you will need to be able to differentiate and also handle vectors and
matrices.
If you’ve forgotten how to do this you WILL get lost—I guarantee it‼!

20
Prerequisites

Self test:

1. Let n
X
f (x1, . . . , xn) = aix2i
i=1
where the ai are constants. Can you compute ∂f /∂xj where 1 ≤ j ≤ n?
2. Let f (x1, . . . , xn) be a function. Now assume xi = gi(y1, . . . , ym) for each xi
and some collection of functions gi. Assuming all requirements for differen-
tiability and so on are met, can you write down an expression for ∂f /∂yj
where 1 ≤ j ≤ m?

If the answer to either of these questions is “no” then it’s time for some revision.
(You have about three weeks notice, so I’ll assume you know it!)

21
And finally. . .

There are some important points to be made regarding computational complexity.


First, you might well hear the term AI-complete being used a lot. What does it
mean?

AI-complete: only solvable if you can solve AI in its entirety.

For example: high-quality automatic translation from one language to another.


To produce a genuinely good translation of Moby Dick from English to Cantonese
is likely to be AI-complete.

22
And finally. . .

More practically, you will often hear me make the claim that everything that’s at
all interesting in AI is at least NP-complete.
There are two ways to interpret this:

1. The wrong way: “It’s all a waste of time.1” OK, so it’s a partly understandable
interpretation. BUT the fact that Boolean satisfiability is intractable does not
mean we can’t solve large instances in practice. . .
2. The right way: “It’s an opportunity to design nice approximation algorithms.”
In reality, the algorithms that are good in practice are ones that try to often
find a good but not necessarily optimal solution, in a reasonable amount of
time and memory.

1
In essence, a comment on a course assessment a couple of years back to the effect of: “Why do you teach us this stuff if it’s all futile?”

23
Agents

There are many different definitions for the term agent within AI.
Allow me to introduce EVIL ROBOT.

MUST ENSLAVE
EARTH!!! Dr Holden
will be our GLORIOUS
LEADER!!!

Act
Environment
Sense

We will use the following simple definition: an agent is any device that can sense
and act upon its environment.
24
Agents

This definition can be very widely applied: to humans, robots, pieces of software,
and so on.
We are taking quite an applied perspective. We want to make things rather than
copy humans. So:

1. How can we judge an agent’s performance?


2. How can an agent’s environment affect its design?
3. Are there sensible ways in which to think about the structure of an agent?

Recall that we are interested in devices that act rationally, where ‘rational’ means
doing the correct thing under given circumstances.

25
Measuring performance

Item 1: How can we judge an agent’s performance?

• Any measure of performance is likely to be problem-specific.


– Even a simple email filter is an agent—it can sense and act. Here the per-
formance measure is straightforward.
– For a self-driving car, it is more complicated!
• We’re usually interested in expected, long-term performance.
– Expected performance because usually agents are not omniscient—they
don’t infallibly know the outcome of their actions.
(It is rational for you to enter this lecture theatre even if the roof falls in
today. An agent capable of detecting and protecting itself from a falling
roof might be more successful than you, but not more rational.
– Long-term performance because it tends to lead to better approximations
to what we’d consider rational behaviour.

26
Environments

Item 2: How can an agent’s environment affect its design?


Some common attributes of an environment have a considerable influence on
agent design.

• Accessible/inaccessible: do percepts tell you everything you need to know


about the world?
• Deterministic/non-deterministic: does the future depend predictably on the
present and your actions?
• Episodic/non-episodic is the agent run in independent episodes.
• Static/dynamic: can the world change while the agent is deciding what to do?
• Discrete/continuous: an environment is discrete if the sets of allowable per-
cepts and actions are finite.
• For multiple agents: whether the situation is competitive or cooperative, and
whether communication is required.

27
Programming agents

Item 3: Are there sensible ways in which to think about the structure of an agent?
A basic agent can be thought of as working according to a straightforward un-
derlying process. To achieve some goal:
• Gather perceptions.
• Update working memory to take account of them.
• On the basis of what’s in the working memory, choose an action to perform.
• Update the working memory to take account of this action.
• Do the chosen action.
Obviously, this hides a great deal of complexity:
• A percept might arrive while an action is being chosen.
• The world may change while an action is being chosen.
• Actions may affect the world in unexpected ways.
• We might have multiple goals, which interact with each other.
• And so on…
28
Keeping track of the environment, and having a goal

It seems reasonable that an agent should maintain:

• A description of the current state of its environment.


• Knowledge of how the environment changes independently of the agent.
• Knowledge of how the agent’s actions affect its environment.

This requires us to do knowledge representation and reasoning .

It also seems reasonable that an agent should choose a rational course of action
depending on its goal.

• If an agent has knowledge of how its actions affect the environment, then it
has a basis for choosing actions to achieve goals.

• To obtain a sequence of actions we need to be able to search and to plan .

29
Goal-based agents

We now have a basic design that looks something like this:

Percept

Update
Update
Description: current environment

Description: effect of actions

Description: behaviour of environment

Description of Goal

Infer

Action/Action sequence

30
Utility-based agents

Introducing goals is still not the end of the story.

• There may be many sequences of actions that lead to a given goal, and some
may be preferable to others.
• We might need to trade-off conflicting goals, for example speed and safety.
• An agent may have several goals, but not be certain of achieving any of them.
Can it trade-off the likelihood of reaching a goal against the desirability of
getting there?

A utility function maps a state to a number representing the desirability of that


state.
Maximising expected utility over time forms a fundamental model for the design
of agents.
Unfortunately, there is insufficient time in this course to properly explore agents
based on utility.

31
Learning agents

It seems reasonable that an agent should learn from experience :

Percept

Update
Update
Description: current environment

Description: effect of actions

Update
Feedback Learner Description: behaviour of environment

Description of Goal

Infer

Action/Action sequence

What might this entail?


32
Learning agents

Learning mainly requires two additions:

1. The learner needs some form of feedback on the agent’s performance. This
can come in several different forms.
2. The learner needs a means of generating new behaviour in order to find out
about the world.

The second point leads to an important trade-off:

1. Should the agent spend time exploiting what it’s learned so far, if it’s achieving
a level of success, or…
2. …should the agent try new things, exploring the environment on the basis
that it might learn something really useful even if it performs worse in the
short term?

33
Artificial Intelligence

Problem solving by search

Reading: AIMA chapters 3 and 4.


34
Problem solving by search

We begin with what is perhaps the simplest collection of AI techniques: those al-
lowing an agent existing within an environment to search for a sequence of actions
that achieves a goal.
Search algorithms apply to a particularly simple class of problems—we need to
identify:

• An initial state s0 from a set S of possible states.


This models the agent’s situation before anything else happens.
• A set of actions, denoted A.
These are modelled by specifying what state will result on performing any
available action in any state.
We can model this using a function action : A × S → S: if the agent is in
state s and performs action a then its new state is action(a, s).
• A goal test: we can tell whether or not the state we’re in corresponds to a
goal.
We can model this using a function goal : S → {true, false}.

35
Problem solving by search

We also need the idea of path cost.


We need another function cost : A × S → R. This denotes the cost of perform-
ing an action a in state s.
If the agent starts in state s0 and takes a sequence of actions a0, a1, . . . , an then
it moves through a sequence of states
cost(a0 ,s0 ) cost(a1 ,s1 ) cost(a2 ,s2 ) cost(an ,sn )
s0 −−−−−−→ s1 −−−−−−→ s2 −−−−−−→ · · · −−−−−−→ sn+1
with si+1 = action(ai, si). We then define the path cost of this path as
n
X
p(sn+1) = cost(ai, si).
i=0

We generally want a path to a goal that has minimim path cost.


Note that you have already seen problems like this…

36
Problem solving by search

You have already seen problems like this…

• Foundations of Computer Science: talks about searching in trees.


It covers depth-first, breadth-first and iterative deepening search.
• Algorithms: talks about searching in graphs.
It also covers depth-first and breadth-first search, from a more formal per-
spective.

This is all important stuff, but there’s a problem: none of these methods works in
practice for typical AI problems!
Essentially, the problem is that they are too naı̈ve in the way that they choose a
state to explore at each step.
I’m going to assume that you know this material and move on…

37
Problem solving by search

A simple example: the 8-puzzle.

Start State

3 5

1 4 2 Action 1 3 5
−→
7 8 6
4 2 Action 1 3 5
−→ Goal State
7 8 6
4 2 Further actions 1 2 3
−→ · · · −→
7 8 6
4 5 6

7 8

From the pre-PC dark ages. Christmas was grim…

38
Problem solving by search

Here we have:

• Start state: a randomly-selected configuration of the numbers 1 to 8 arranged


on a 3 × 3 square grid, with one square empty.
• Goal state: the numbers in ascending order with the bottom right square
empty.
• Actions: left, right, up, down. We can move any square adjacent to the
empty square into the empty square. (It’s not always possible to choose from
all four actions.)
• Path cost: 1 per move.

The 8-puzzle is very simple. However general sliding block puzzles are a good
test case. The general problem is NP-complete. The 5 × 5 version has about 1025
states, and a random instance is in fact quite a challenge.

39
Problem solving by search

Problems of this kind are very simple, but a surprisingly large number of appli-
cations have appeared:

• Route-finding/tour-finding.
• Layout of VLSI systems.
• Navigation systems for robots.
• Sequencing for automatic assembly.
• Searching the internet.
• Design of proteins.

and many others…


Problems of this kind continue to form an active research area.

40
Search trees versus search graphs

We need to make an important distinction between search trees and search graphs.

s s
as opposed to

• In a tree only one path can lead to a given node, but a state s can appear in
multiple nodes.
• In a graph a state can appear in only one node, but may be reached via multiple
paths.
• In a graph we may encounter cycles.
• In a graph we may encounter redundant paths, where multiple paths lead to
the same state.

41
Search trees versus search graphs

Graphs can lead to problems:

A A

B B
B
C C C C

C
. . . .
. . . .
. . . .
D

.
.
.

The sliding blocks puzzle for example suffers this way.


So: we start by assuming the search is taking place on a tree.

42
The basic tree-search algorithm

We need to define one more function: expand takes any state s. It applies all
actions that can be applied in s and returns the set of the resulting states:
expand(s) = {s0|s0 = action(a, s) where a is an action possible in s}.
The algorithm for searching in a tree then looks like this:

1 fringe = [s0 ];
2 while true do
3 if fringe.empty() then
4 return NONE;
5 s = fringe.remove();
6 if goal(s) then
7 return (SOME s);
8 fringe.addAll(expand(s));

The search strategy is set by using a priority queue to implement the fringe.
The definition of priority then sets the way in which the tree is searched.
43
The basic tree-search algorithm

The process looks like this:

Expanded

In the fringe, but not expanded

Not yet investigated

At each iteration, one node from the fringe is expanded. In general, if the branch-
ing factor is b then the layer at depth d can have bd states.
Pd d bd+1−1
The entire tree to depth d can have i=0 b = b−1 states.

44
Graph search

To search in graphs we need a way to make sure no state gets visited more than
once.
We need to add a closed list, and add a state to it when the state is first seen:

1 closed = [];
2 fringe = [s0 ];
3 while true do
4 if fringe.empty() then
5 return NONE;
6 s = fringe.remove();
7 if goal(s) then
8 return (SOME s);
9 closed.add(s);
10 for s0 ∈ expand(s) do
11 if (!closed.contains(s0 ) && !fringe.contains(s0 )) then
12 fringe.add (s’);

45
Graph search

There are several points to note regarding graph search:

1. The closed list contains all the expanded states.


2. The closed list can be implemented using a hash table. So the time taken to
add or check membership can be managable.
3. Both worst case time and space are now proportional to the size of the state
space. (Which is BIG‼‼)
4. Memory: depth first and iterative deepening search are no longer linear space
as we need to store the closed list.
5. Optimality: when a repeat is found we are discarding the new possibility even
if it is better than the first one. We may need to check which solution is better
and if necessary modify path costs and depths for descendants of the repeated
state.

46
Graph search

Graph search builds a tree on the graph:

Start Start

Goal Goal

The graph becomes separated: any path from the start to an unexplored node has
to pass through a fringe node.

47
The performance of search techniques

How might we judge the performance of a search technique?


We are interested in:

• Whether a solution is found.


• Whether the solution found is a good one in terms of path cost.
• The cost of the search in terms of time and memory.

So
the total cost = path cost + search cost
If a problem is highly complex it may be worth settling for a sub-optimal solution
obtained in a short time.
And we are interested in:
Completeness: does the strategy guarantee a solution is found?
Optimality: does the strategy guarantee that the best solution is found?
Once we start to consider these, things get a lot more interesting…

48
Basic search algorithms

We can immediately define some familiar tree search algorithms:

• New nodes are added to the head of the queue. This is depth-first search.
• New nodes are added to the tail of the queue. This is breadth-first search. (You
can do the goal test earlier—why is this?)

We will not dwell on these, as they are both completely hopeless in practice.
Why is breadth-first search hopeless?

• The procedure is complete: it is guaranteed to find a solution if one exists.


(Provided b < ∞.)
• The procedure is optimal if the path cost is a non-decreasing function of node-
depth. (For example, when all actions have the same cost.)
• The procedure has exponential complexity for both memory and time.

In practice it is the memory requirement that is problematic.

49
Basic search algorithms

For depth-first search:

• With graph search: complete if the number of states is finite, but not other-
wise.
• Not complete for tree search because of loops.

Neither is optimal.

50
Basic search methods

With depth-first search: for a given branching factor b and depth d the memory
requirement is O(bd).

−→ −→

··· ··· ···

··· ···

···

This is only for tree search because we need to store nodes on the current path and
the other unexpanded nodes.
The time complexity for tree search is still O(bd) (if you know you only have to
go to depth d). For graph search it is the size of the state space.
The search is no longer optimal, and may not be complete.
Iterative-deepening combines the two, but we can do better.

51
Uniform-cost search

How might we change tree search to try to get to an optimal solution while lim-
iting the time and memory needed?
The key point: so far we only distinguish goal states from non-goal states!

None of the searches you’ve seen so far tries to prioritize the exploration of good
states‼!

What is a good state?

• Well, at any point in the search we can work out the path cost p(s) of whatever
state s we’ve got to.
• How about using the p(s) as the priority for the priority queue?

This is called Uniform-Cost Search when implemented as a graph search.

52
Uniform-cost search

It needs a slight modification:

1 closed = [];
2 fringe = [s0 ];
3 while true do
4 if fringe.empty() then
5 return NONE;
6 s = fringe.remove();
7 if goal(s) then
8 return (SOME s);
9 closed.add(s);
10 for s0 ∈ expand(s) do
11 if (!closed.contains(s0 ) && !fringe.contains(s0 )) then
12 fringe.add (s’);
13 else if (fringe.contains(s0 ) with higher p(s0 )) then
14 replace the fringe node with s0 ;

This modification must also be used when implementing the A? search method,
which we will see in a moment.
53
Uniform-cost search

This is optimal because when we select a node it must have the shortest path to
that node.
It is complete, provided it is impossible to get stuck within an infinite path:

• Require all costs to have a minimal value of  > 0.


• Require the branching factor to be finte, so b < ∞.

In practice it doesn’t work very well: we need something more subtle.


But it does suggest the idea of an evaluation function: a function that attempts to
measure the desirability of each state.

54
Heuristics

Why is path cost not a good evaluation function? It is not directed in any sense
toward the goal.
A heuristic function, usually denoted h(s), is one that estimates the cost of the
best path from any state s to a goal. If s is a goal then h(s) = 0.

p(s) is known when we get to s.

s0 s1 s2 ··· s

h(s) estimates cost to nearest goal.

sgoal

This is a problem-dependent measure. We are required either to design it using


our knowledge of the problem, or by some other means.
The last point is critical: AI is a long way from being independent of human inge-
nuity.
55
Example: route-finding

Example: for route finding a reasonable heuristic function is


h(s) = straight line distance from s to the nearest goal

s0 1 s1 1 s2 s0 s1 s2


h(s1 ) = 2
h(s2 ) = 1

h(s0 ) = 5

Goal Goal

Accuracy here obviously depends on what the roads are really like.
Can we use h(s) in choosing a state to explore? If it’s really good it can work
well, but we can still do better!

56
A? search

A? search is the classical AI-oriented search algorithm.


A? search combines the good points of:

• Using p(s) to know how far we’ve come.


• Using h(s) to estimate how far we have to go.

It does this in a very simple manner: it uses path cost p(s) and also the heuristic
function h(s) by forming
f (s) = p(s) + h(s).
So: f (s) is the estimated cost of a path through s.
By using this as a priority for exploring states we get a search algorithm that is
optimal and complete under simple conditions, and can be vastly superior to the
more naı̈ve approaches.

57
A? search

Definition: an admissible heuristic h(s) is one that never overestimates the cost of
the best path from s to a goal.

p(s) is known when we get to s.

s0 s1 s2 ··· s

h(s) estimates cost to nearest goal.

sgoal
Actual path to nearest goal.
h(s) must underestimate this.

So if h0(s) denotes the actual distance from s to the goal we have


∀s.h(s) ≤ h0(s).
If h(s) is admissible then tree-search A? is optimal.

58
A? tree-search is optimal for admissible h(s)

To see that tree-search A? is optimal we reason as follows. Let Goalopt be an


optimal goal state with f (Goalopt) = p(Goalopt) = fopt (because h(Goalopt) = 0).

At some point Goal2 is in the fringe.


Can it be selected before s?
Goal2

Goalopt

Let Goal2 be a suboptimal goal state with f (Goal2) = p(Goal2) = f2 > fopt. We
need to demonstrate that the search can never select Goal2.

59
A? tree-search is optimal for admissible h(s)

Let s be a state in the fringe on an optimal path to Goalopt. So


fopt ≥ p(s) + h(s) = f (s)
because h is admissible.
Now say Goal2 is chosen for expansion before s. This means that
f (s) ≥ f2
so we’ve established that
fopt ≥ f2 = p(Goal2).
But this means that Goalopt is not optimal: a contradiction.
And that’s all that’s needed for trees. But for searching on graphs we need a little
more…

60
A? graph search

Unfortunately for graph search the situation is trickier…

• Graph search can discard an optimal route if that route is not the first one
generated.
• We could keep only the least expensive path. This means updating, which is
extra work, not to mention messy, but sufficient to insure optimality.
• Alternatively, we can impose a further condition on h(s) which forces the best
path to a repeated state to be generated first.

The required condition is called monotonicity. As


monotonicity −→ admissibility
this is an important property.

61
Monotonicity

Assume h is admissible. Remember that f (s) = p(s) + h(s) so if s0 follows s


p(s0) ≥ p(s)
and we expect that h(s0) ≤ h(s) although this does not have to be the case.

p(s) = 5
s
h(s) = 4

s0

p(s0 ) = 6
h(s0 ) = 1

Here f (s) = 9 and f (s0) = 7 so f (s0) < f (s).

62
Monotonicity

Monotonicity:

• If it is always the case that f (s0) ≥ f (s) then h(s) is called monotonic or con-
sistent.
• h(s) is monotonic if and only if it obeys the triangle inequality.
h(s) ≤ cost(a, s) + h(s0)
where a is the action moving us from s to s0.

63
Monotonicity

Why does this make sense?

p(s) = 5
s
h(s) = 4

s0

p(s0 ) = 6
h(s0 ) = 1

The fact that f (s) = 9 tells us the cost of a path through s is at least 9 (because
h(s) is admissible).
But s0 is on a path through s. So to say that f (s0) = 7 makes no sense.

64
A? graph search is optimal for monotonic heuristics

The crucial fact from which optimality follows is that if h(s) is monotonic then
the values of f (s) along any path are non-decreasing.
We therefore have the following situation:

You can’t deal with s0 until everything with


f (s)
f (s00 ) < f (s0 ) has been dealt with.

f (s0 )

Consequently everything with f (s00) < fopt gets explored. Then one or more
things with fopt get found (not necessarily all goals).

65
A? search is complete

A? search is complete provided:

1. The graph has finite branching factor.


2. There is a finite, positive constant  such that each action has cost at least .

Why is this? The search expands nodes according to increasing f (s). So: the
only way it can fail to find a goal is if there are infinitely many nodes with
f (s) < f (Goal).
There are two ways this can happen:

1. There is a node with an infinite number of descendants.


2. There is a path with an infinite number of nodes but a finite path cost.

66
Complexity

We won’t be proving the following, but they are good things to know:

• A? search has a further desirable property: it is optimally efficient.


• This means that no other optimal algorithm that works by constructing paths
from the root can guarantee to examine fewer nodes.
• BUT : despite its good properties we’re not done yet…
• …A? search unfortunately still has exponential time complexity in most cases.
• As A? search also stores all the nodes it generates: once again it is generally
memory that becomes a problem before time.

67
IDA? - iterative deepening A? search

How might we improve the way in which A? search uses memory?

• Iterative deepening search used depth-first search with a limit on depth that
is gradually increased.
• IDA? does the same thing with a limit on f cost.

68
IDA? - iterative deepening A? search

The function contour searches from a specified state s as far as a specified limit
fLimit on f .
It returns either a path from s to a goal, or the next biggest value to try for the
limit on f .

1 function contour(s, fLimit, path)


2 nextF = ∞;
3 if f (s) > fLimit then
4 return ([], f (s));
5 if goal(s) then
6 return (s :: path, fLimit)
7 for s0 ∈ expand(s) do
8 (newPath, newF) = contour(s0 , fLimit, s :: path);
9 if newPath ! = [] then
10 return (newPath, fLimit);
11 nextF = min(nextF, newF);
12 return ([], nextF);

69
IDA? - iterative deepening A? search

1 function iterativeDeepeningAStar()
2 fLimit = f (s0 );
3 while true do
4 (path, fLimit) = contour(s0 , fLimit, []);
5 if path ! = [] then
6 return path;
7 if fLimit == ∞ then
8 return [];

70
IDA? - iterative deepening A? search

This is a little tricky to unravel, so here is an example:

7 4 5

Initially, the algorithm looks ahead and finds the smallest f cost that is greater
than its current f cost limit. The new limit is 4.

71
IDA? - iterative deepening A? search

It now does the same again:

7 4 5

5 9 10

Anything with f cost at most equal to the current limit gets explored, and the
algorithm keeps track of the smallest f cost that is greater than its current limit.
The new limit is 5.

72
IDA? - iterative deepening A? search

And again:

7 4 5

5 9 10 19 12 7

8 12 7

The new limit is 7, so at the next iteration the three arrowed nodes will be ex-
plored.

73
IDA? - iterative deepening A? search

Properties of IDA?:

• It is complete and optimal under the same conditions as A?.


• It is often good if we have step costs equal to 1.
• It does not require us to maintain a sorted queue of nodes.
• It only requires space proportional to the longest path.
• The time taken depends on the number of values h can take.

If h takes enough values to be problematic we can increase the limit on f by a


fixed  at each stage, guaranteeing a solution at most  worse than the optimum.

74
Recursive best-first search (RBFS)

Another method by which we can attempt to overcome memory limitations is


the Recursive Best-First Search (RBFS).
Idea: try to use f , but only use linear space by doing a depth-first search with a
few modifications:

1. We remember the f (s0) for the best alternative state s0 we’ve seen so far on
the way to the state s we’re currently considering.
2. If s has f (s) > f (s0):
• We go back and explore the best alternative…
• …and as we retrace our steps we replace the f cost of every state we’ve
seen in the current path with f (s). (See red text in pseudo-code.)

The replacement of f values as we retrace our steps provides a means of remem-


bering how good a discarded path might be, so that we can easily return to it
later.

75
Recursive best-first search (RBFS)

1 function rbfs(s, fLimit)


2 if goal(s) then
3 return (SOME s, fLimit);
4 if expand(s) = ∅ then
5 return (NONE, ∞);
6 for each s0 ∈ expand(s) do
7 f (s0 ) = maximum(f (s0 ), f (s));
8 while true do
9 best = s0 ∈ expand(s) with smallest f (s0 );
10 if f (best) > fLimit then
11 return (NONE, f (best));
12 nextBest = s0 ∈ expand(s) with second smallest f (s0 );
13 (result, f 0 ) = rbfs (best, minimum(fLimit, f (nextBest)));
14 f (best) = f 0 ;
15 if result ! = NONE then
16 return (result, f 0 );

76
Recursive best-first search (RBFS): an example

This function is called using rbfs(s0, ∞) to begin the process.


Function call number 1:

3
fLimit1 = ∞

7 4 best1 5
nextBest1 = 5

Now perform the recursive function call (result2 , f 0 ) = rbfs(best1 , 5)


so f (best1 ) takes the returned value f 0

77
Recursive best-first search (RBFS): an example

Function call number 2:

3 fLimit1 = ∞
fLimit2 = 5

7 4 best1 5
nextBest1 = 5

nextBest2 = 9
5 9 10
best2

Now perform the recursive function call (result3 , f 0 ) = rbfs(best2 , 5)


so f (best2 ) takes the returned value f 0

78
Recursive best-first search (RBFS): an example

Function call number 3 :

fLimit1 = ∞
3 fLimit2 = 5
fLimit3 = 5

7 4 best1 5
nextBest1 = 5

5 replaced by 10
nextBest2 = 9
5 9 10
best2

11 12 10
nextBest3 = 11 best3

Now f (best3) > fLimit3 so the function call returns (NONE, 10) into (result3, f 0)
and f (best2) = 10.

79
Recursive best-first search (RBFS): an example

The while loop for function call 2 now repeats:

3 fLimit1 = ∞
fLimit2 = 5
4 replaced by 9
7 4 best1 5
nextBest1 = 5

5 replaced by 10
5 9 best2 10

11 12 10

Now f (best2) > fLimit2 so the function call returns (NONE, 9) into (result2, f 0)
and f (best1) = 9.

80
Recursive best-first search (RBFS): an example

The while loop for function call 1 now repeats:

3 fLimit1 = ∞

4 replaced by 9
7 4 5
nextBest1 = 7 best1

5 replaced by 10
5 9 10

11 12 10

We do a further function call to expand the new best node, and so on…

81
Recursive best-first search (RBFS)

Some nice properties:

• If h is admissible then RBFS is optimal.


• Memory requirement is O(bd)
• Generally more efficient than IDA?.

And some less nice ones:

• Time complexity is hard to analyse, but can be exponential.


• Can spend a lot of time re-generating nodes.

To some extent IDA? and RBFS throw the baby out with the bathwater.

• They limit memory too harshly, so…


• …we can try to use all available memory.

MA? and SMA? will not be covered in this course…

82
Local search

Sometimes, it’s only the goal that we’re interested in. The path needed to get
there is irrelevant.

• For example: VLSI layout, factory design, automatic programming…


• We are now simply searching for a state that is in some sense the best.
• This is also known as optimisation.

This leads to the remarkably simple concept of local search.

83
Local search

Instead of trying to find a path from start state to goal, we explore the local area
of the graph, meaning those states one edge away from the one we’re at:

f (s) = 52

f (s) = 24

f (s) = 1
f (s) = 24

f (s) = 29

We assume that we have a function f (s) such that f (s0) > f (s) indicates s0 is
preferable to s.

84
The m-queens problem

You may be familiar with the m-queens problem.

Find an arrangement of m queens on an m by m board such that no queen is


attacking another.
In the Prolog course you may have been tempted to generate permutations of
row numbers and test for attacks.
This is a hopeless strategy for large m. (Imagine m ' 1, 000, 000.)

85
The m-queens problem

We might however consider the following:

• A state s for an m by m board is a sequence of m numbers drawn from the


set {1, . . . , m}, possibly including repeats.
• We move from one state to another by moving a single queen to any alterna-
tive row.
• We define f (s) to be the number of pairs of queens attacking one-another in
the new position2. (Regardless of whether or not the attack is direct.)

2
Note that we actually want to minimize f here. This is equivalent to maximizing −f , and I will generally use whichever seems more appropriate.

86
The m-queens problem

Here, we have {4, 3, ?, 8, 6, 2, 4, 1} and the f values for the undecided queen are
shown.

As we can choose which queen to move, each state in fact has 56 neighbours in
the graph.

87
Hill-climbing search

Hill-climbing search is remarkably simple:

1 Generate a start state s;


2 while true do
3 Generate the neighbours N = {s1 , . . . , sp } of s;
4 Nf = {f (si )|si ∈ N };
5 if max Nf ≤ f (s) then
6 return s;
7 s = si ∈ N with maximum f (si );

In fact, that looks so simple that it’s amazing the algorithm is at all useful.
In this version we stop when we get to a node with no better neighbour.

88
Hill-climbing search: the reality

We might alternatively allow sideways moves by changing the stopping condi-


tion:

1 if max Nf < f (s) then


2 return s;

Why would we consider doing this?

89
Hill-climbing search: the reality

In reality, nature has a number of ways of shaping f to complicate the search


process.

f (s) Global maximum Local maxima

Plateau
s

Sideways moves allow us to move across plateaus.


However, should we ever find a local maximum then we’ll return it: we won’t
keep searching to find a global maximum.

90
Hill-climbing search: the reality

Of course, the fact that we’re dealing with a general graph means we need to think
of something like the preceding figure, but in a very large number of dimensions,
and this makes the problem much harder.
There is a body of techniques for trying to overcome such problems. For example:

• Stochastic hill-climbing: Choose a neighbour at random, perhaps with a prob-


ability depending on its f value. For example: let N (s) denote the neighbours
of s. Define
N +(s) = {s0 ∈ N (s)|f (s0) ≥ f (s)}
N −(s) = {s0 ∈ N (s)|f (s0) < f (s)}.
Then
if s0 ∈ N −(s)

0 0
Pr(s ) = 1
Z (f (s 0
) − f (s)) otherwise.

91
Hill-climbing search: the reality

• First choice: Generate neighbours at random. Select the first one that is better
than the current one. (Particularly good if nodes have many neighbours.)
• Random restarts: Run a procedure k times with a limit on the time allowed
for each run.
Note: generating a start state at random may itself not be straightforward.
• Simulated annealing: Similar to stochastic hill-climbing, but start with lots of
random variation and reduce it over time.
Note: in some cases this is provably an effective procedure, although the time
taken may be excessive if we want the proof to hold.
• Beam search: Maintain k states at any given time. At each search step, find
the successors of each, and retain the best k from all the successors.
Note: this is not the same as random restarts.

92
Gradient ascent and related methods

For some problems3—we do not have a search graph, but a continuous search
space.

30

20

10

-10

-20

-30
0 1 2 3 4 5 6

Typically, we have a function f (x) : Rn → R and we want to find


xopt = argmax f (x)
x
3
For the purposes of this course, the training of neural networks is a notable example.

93
Gradient ascent and related methods

In a single dimension we can clearly try to solve


df (x)
=0
dx
to find the stationary points, and use
d2f (x)
dx2
to find a global maximum. In multiple dimensions the equivalent is to solve
∂f (x)
∇f (x) = =0
∂x
where
∂f (x) h ∂f (x) ∂f (x) i
= ∂x1 ∂x2 · · · ∂f∂x(x) .
∂x n

and the equivalent of the second derivative is the Hessian matrix


 2 
∂f (x) ∂f 2 (x) ∂f 2 (x)
· · · ∂x1∂xn
 ∂x2 21 ∂x12∂x2 2

 ∂f (x) ∂f (x) ∂f (x) 
 ∂x2∂x1 ∂x2 · · · ∂x2∂xn 
H= .
 .. .. 2 . . . ..  
 2 
∂f (x) ∂f 2 (x) ∂f 2 (x)
∂xn ∂x1 ∂xn ∂x2 · · ·
n ∂x2

94
Gradient ascent and related methods

However this approach is usually not analytically tractable regardless of dimen-


sionality.
The simplest way around this is to employ gradient ascent:

• Start with a randomly chosen point x0.


• Using a small step size , iterate using the equation
xi+1 = xi + ∇f (xi).

This can be understood as follows:

• At the current point xi the gradient ∇f (xi) tells us the direction and magni-
tude of the slope at xi.
• Adding ∇f (xi) therefore moves us a small distance upward.

This is perhaps more easily seen graphically. . .

95
Gradient ascent and related methods

Here we have a simple parabolic surface:

50

-2000
0
-4000

50
50
0 -50
0
-50 0 50
-50 -50

With  = 0.1 the procedure is clearly effective at finding the maximum.


Note however that the steps are small, and in a more realistic problem it might
take some time. . .

96
Gradient ascent and related methods

Simply increasing the step size  can lead to a different problem:

50 50

0 0

-50 -50
-50 0 50 -50 0 50

50

100

0
0

-100
-50
-50 0 50 -50 0 50

We can easily jump too far. . .

97
Gradient ascent and related methods

There is a large collection of more sophisticated methods. For example:

• Line search: increase  until f decreases and maximise in the resulting interval.
Then choose a new direction to move in. Conjugate gradients, the Fletcher-
Reeves and Polak-Ribiere methods etc.
• Use H to exploit knowledge of the local shape of f . For example the Newton-
Raphson and Broyden-Fletcher-Goldfarb-Shanno (BFGS) methods etc.

98
Artificial Intelligence

Games (adversarial search)

Reading: AIMA chapter 5.


99
Solving problems by search: playing games

How might an agent act when the outcomes of its actions are not known because
an adversary is trying to hinder it?

• This is essentially a more realistic kind of search problem because we do not


know the exact outcome of an action.
• This is a common situation when playing games: in chess, draughts, and so
on an opponent responds to our moves.

Game playing has been of interest in AI because it provides an idealisation of a


world in which two agents act to reduce each other’s well-being.
We now look at:

• How game-playing can be modelled as search.


• The minimax algorithm for game-playing.
• Some problems inherent in the use of minimax.
• The concept of α − β pruning.

100
Playing games: search against an adversary

Despite the fact that games are an idealisation, game playing can be an excellent
source of hard problems. For instance with chess:

• The average branching factor is roughly 35.


• Games can reach 50 moves per player.
• So a rough calculation gives the search tree 35100 nodes.
• Even if only different, legal positions are considered it’s about 1040.

So: in addition to the uncertainty due to the opponent:

• We can’t make a complete search to find the best move…


• … so we have to act even though we’re not sure about the best thing to do.

And chess isn’t even very hard: Go is much harder…


Note: yes, more advanced learning-based methods have conquered chess and Go, but that’s an entirely different approach with its own pros and cons.

101
Perfect decisions in a two-person game

Say we have two players. Traditionally, they are called Max and Min for reasons
that will become clear.

• We’ll use noughts and crosses as an initial example.


• Max moves first.
• The players alternate until the game ends.
• At the end of the game, prizes are awarded. (Or punishments administered—
EVIL ROBOT is starting up his favourite chainsaw…)

This is exactly the same game format as chess, Go, draughts and so on.

102
Perfect decisions in a two-person game

Games like this can be modelled as search problems as follows:

• There is an initial state.

Max to move

• There is a set of operators. Here, Max can place a cross in any empty square,
or Min a nought.
• There is a terminal test. Here, the game ends when three noughts or three
crosses are in a row, or there are no unused spaces.
• There is a utility or payoff function. This tells us, numerically, what the out-
come of the game is.

This is enough to model the entire game.

103
Perfect decisions in a two-person game

We can construct a tree to represent a game.


From the initial state Max can make nine possible moves:

. . .

Then it’s Min’s turn…

104
Perfect decisions in a two-person game

For each of Max’s opening moves Min has eight replies:

. . .

. . .

And so on…
This can be continued to represent all possibilities for the game.

105
Perfect decisions in a two-person game

. . .

. . .

−1

+1
0

At the leaves a player has won or there are no spaces. Leaves are labelled using
the utility function.

106
Perfect decisions in a two-person game

How can Max use this tree to decide on a first move?


Consider a much simpler tree:

Labels on the leaves denote utility.


High values are preferred by Max.
Low values are preferred by Min.

4 5 2 20 20 15 6 7 1 4 10 9 5 8 5 4

If Max is rational he will play to reach a position with the biggest utility possible
But if Min is rational she will play to minimise the utility available to Max.

107
The minimax algorithm

There are two moves: Max then Min. Game theorists would call this one move,
or two ply deep.
The minimax algorithm allows us to infer the best move that the current player
can make, given the utility function, by working backward from the leaves.

2 6 1 4

4 5 2 20 20 15 6 7 1 4 10 9 5 8 5 4

As Min plays the last move, she minimises the utility available to Max.

108
The minimax algorithm

Moving one further step up the tree:

2 6 1 4

4 5 2 20 20 15 6 7 1 4 10 9 5 8 5 4

We can see that Max’s best opening move is move 2, as this leads to the node
with highest utility.

109
The minimax algorithm

In general:

• Generate the complete tree and label the leaves according to the utility func-
tion.
• Working from the leaves of the tree upward, label the nodes depending on
whether Max or Min is to move.
• If Min is to move label the current node with the minimum utility of any
descendant.
• If Max is to move label the current node with the maximum utility of any
descendant.

If the game is p ply and at each point there are q available moves then this process
has (surprise, surprise) O(q p) time complexity and space complexity linear in p
and q.

110
Making imperfect decisions

We need to avoid searching all the way to the end of the tree.
So:

• We generate only part of the tree: instead of testing whether a node is a leaf
we introduce a cut-off test telling us when to stop.
• Instead of a utility function we introduce an evaluation function for the eval-
uation of positions for an incomplete game.

The evaluation function attempts to measure the expected utility of the current
game position.

111
Making imperfect decisions

How can this be justified?

• This is a strategy that humans clearly sometimes make use of.


• For example, when using the concept of material value in chess.
• The effectiveness of the evaluation function is critical…
• … but it must be computable in a reasonable time.
• (In principle it could just be done using minimax.)

The importance of the evaluation function can not be understated—it is probably


the most important part of the design.

112
The evaluation function

Designing a good evaluation function can be extremely tricky:

• Let’s say we want to design one for chess by giving each piece its material
value: pawn = 1, knight/bishop = 3, rook = 5 and so on.
• Define the evaluation of a position to be the difference between the material
value of black’s and white’s pieces
X X
eval(position) = value of pi − value of qi
black’s pieces pi white’s pieces qi

This seems like a reasonable first attempt. Why might it go wrong?

• Until the first capture the evaluation function gives 0, so in fact we have
a category containing many different game positions with equal estimated
utility.
• For example, all positions where white is one pawn ahead.

So in fact this seems highly naı̈ve …

113
The evaluation function

We can try to learn an evaluation function.

• For example, using material value, construct a weighted linear evaluation func-
tion n
X
eval(position) = w i fi
i=1
where the wi are weights and the fi represent features of the position—in this
case, the value of the ith piece.
• Weights can be chosen by allowing the game to play itself and using learning
techniques to adjust the weights to improve performance.

However in general

• Here we probably want to give different evaluations to individual positions.


• The design of an evaluation function can be highly problem dependent and
might require significant human input and creativity.

114
α − β pruning

Even with a good evaluation function and cut-off test, the time complexity of the
minimax algorithm makes it impossible to write a good chess program without
some further improvement.

• Assuming we have 150 seconds to make each move, for chess we would be
limited to a search of about 3 to 4 ply whereas…
• …even an average human player can manage 6 to 8.

Luckily, it is possible to prune the search tree without affecting the outcome and
without having to examine all of it.

115
α − β pruning

Returning for a moment to the earlier, simplified example:

4 5 2 20 20 15 6 7 1 4 10 9 5 8 5 4

The search is depth-first and left to right.

116
α − β pruning

The search continues as previously for the first 8 leaves.

2 6 ≤1

4 5 2 20 20 15 6 7 1 4 10 9 5 8 5 4

Then we note: if Max plays move 3 then Min can reach a leaf with utility at most
1.
So: we don’t need to search any further under Max’s opening move 3. This is be-
cause the search has already established that Max can do better by making open-
ing move 2.

117
α − β pruning in general

Remember that this search is depth-first. We’re only going to use knowledge of
nodes on the current path.

α = m tells us that the


= Player
value of this node is ≥ m.
value ≥ m
= Opponent

The value of α is updated as


the search progresses.
While searching under this node
we find that the opponent can force
value ≥ m0
a score of n.

If n < m we can stop. There is a


m0 better choice earlier in the game.

If n < m0 we can stop. The player


Searching here establishes that maximises and will never move here.
the opponent can force a score
of m0 .

So: once you’ve established that n is sufficiently small, you don’t need to explore
any more of the corresponding node’s children.

118
α − β pruning in general

The situation is exactly analogous if we swap player and opponent in the previous
diagram.
The search is depth-first, so we’re only ever looking at one path through the tree.
We need to keep track of the values α and β where
α = the highest utility seen so far on the path for Max
β = the lowest utility seen so far on the path for Min
Assume Max begins. Initial values for α and β are
α = −∞
and
β = +∞.

119
α − β pruning in general

So: we start with the function call


player(−∞, +∞, root)
The following function implements the procedure suggested by the previous di-
agram:

1 function player(α, β, n)
2 if cutoff(n) then
3 return eval(n);
4 value = −∞;
5 for each successor n0 of n do
6 value = max(value, opponent(α, β, n0 ));
7 if value ≥ β then
8 return value;
9 if value > α then
10 α = value;
11 return value;

120
α − β pruning in general

The function opponent is exactly analogous:

1 function opponent(α, β, n)
2 if cutoff(n) then
3 return eval(n);
4 value = ∞;
5 for each successor n0 of n do
6 value = min(value, player(α, β, n0 ));
7 if value ≤ α then
8 return value;
9 if value < β then
10 β = value;
11 return value;

Note: the semantics here is that parameters are passed to functions by value.

121
α − β pruning in general

Applying this to the earlier example and keeping track of the values for α and β
you should obtain:

α = −∞ = 2 = 6

Return 2 β = +∞

2 6 Return 1
Return 6

4 5 2 20 20 15 6 7 1

α = −∞ α=2 α=6

β = +∞ = 2 β = +∞ = 6

122
How effective is α − β pruning?

(Warning: the theoretical results that follow are somewhat idealised.)


A quick inspection should convince you that the order in which moves are ar-
ranged in the tree is critical.
So, it seems sensible to try good moves first:

• If you were to have a perfect move-ordering technique then α − β pruning


would be O(q p/2) as opposed to O(q p).

• Consequently the branching factor would effectively be q instead of q.
• We would therefore expect to be able to search ahead twice as many moves as
before.

However, this is not realistic: if you had such an ordering technique you’d be
able to play perfect games!

123
How effective is α − β pruning?

If moves are arranged at random then α − β pruning is:

• O((q/ log q)p) asymptotically when q > 1000 or…


• …about O(q 3p/4) for reasonable values of q.

In practice simple ordering techniques can get close to the best case. For example,
if we try captures, then threats, then moves forward etc.
Alternatively, we can implement an iterative deepening approach and use the
order obtained at one iteration to drive the next.

124
A further optimisation: the transposition table

Finally, note that many games correspond to graphs rather than trees because the
same state can be arrived at in different ways.

• This is essentially the same effect we saw in heuristic search: recall graph
search versus tree search.
• It can be addressed in a similar way: store a state with its evaluation in a hash
table—generally called a transposition table—the first time it is seen.

The transposition table is essentially equivalent to the closed list introduced as


part of graph search.
This can vastly increase the effectiveness of the search process, because we don’t
have to evaluate a single state multiple times.

125
Artificial Intelligence

Constraint satisfaction problems (CSPs)

Reading: AIMA chapter 6.


126
Constraint satisfaction problems (CSPs)

The search scenarios examined so far seem in some ways unsatisfactory.

• States were represented using an arbitrary and problem-specific data struc-


ture.
• Heuristics were also problem-specific.
• It would be nice to be able to transform general search problems into a stan-
dard format.

CSPs standardise the manner in which states and goal tests are represented. By
standardising like this we benefit in several ways:

• We can devise general purpose algorithms and heuristics.


• We can look at general methods for exploring the structure of the problem.
• Consequently it is possible to introduce techniques for decomposing prob-
lems.
• We can try to understand the relationship between the structure of a problem
and the difficulty of solving it.
127
Introduction to constraint satisfaction problems

We now return to the idea of problem solving by search and examine it from this
new perspective.
Aims:

• To introduce the idea of a constraint satisfaction problem (CSP) as a general


means of representing and solving problems by search.
• To look at a backtracking algorithm for solving CSPs.
• To look at some general heuristics for solving CSPs.
• To look at more intelligent ways of backtracking.

Another method of interest in AI that allows us to do similar things involves


transforming to a propositional satisfiability problem.
We’ll see an example of this—and of the application of CSPs—when we discuss
planning.

128
Constraint satisfaction problems

We have:

• A set of n variables V1, V2, . . . , Vn.


• For each Vi a domain Di specifying the values that Vi can take.
• A set of m constraints C1, C2, . . . , Cm.

Each constraint Ci involves a set of variables and specifies an allowable collection


of values.

• A state is an assignment of specific values to some or all of the variables.


• An assignment is consistent if it violates no constraints.
• An assignment is complete if it gives a value to every variable.

A solution is a consistent and complete assignment.

129
Example

We will use the problem of colouring the nodes of a graph as a running example.

7 7

6 6
5 5

4 4
3 3
8 8

2 2
1 1

Each node corresponds to a variable. We have three colours and directly con-
nected nodes should have different colours.

130
Example

This translates easily to a CSP formulation:

• The variables are the nodes


Vi = node i
• The domain for each variable contains the values black, red and cyan
Di = {B, R, C}

• The constraints enforce the idea that directly connected nodes must have dif-
ferent colours. For example, for variables V1 and V2 the constraints specify
(B, R), (B, C), (R, B), (R, C), (C, B), (C, R)

• Variable V8 is unconstrained.

131
Different kinds of CSP

This is an example of the simplest kind of CSP: it is discrete with finite domains.
We will concentrate on these.
We will also concentrate on binary constraints; that is, constraints between pairs
of variables.

• Constraints on single variables—unary constraints—can be handled by adjust-


ing the variable’s domain. For example, if we don’t want Vi to be red, then
we just remove that possibility from Di.
• Higher-order constraints applying to three or more variables can certainly be
considered, but…
• …when dealing with finite domains they can always be converted to sets of
binary constraints by introducing extra auxiliary variables.

How does that work?

132
Auxiliary variables

Example: three variables each with domain {B, R, C}.


A single constraint
(C, C, C), (R, B, B), (B, R, B), (B, B, R)

New, binary constraints:


V1 V2 V1
(A = 1, V1 = C), (A = 1, V2 = C), (A = 1, V3 = C)
A=3 V2 (A = 2, V1 = R), (A = 2, V2 = B), (A = 2, V3 = B)
V3 (A = 3, V1 = B), (A = 3, V2 = R), (A = 3, V3 = B)
V3 (A = 4, V1 = B), (A = 4, V2 = B), (A = 4, V3 = R)
The original constraint connects all
three variables.

Introducing auxiliary variable A with domain {1, 2, 3, 4} allows us to convert this


to a set of binary constraints.

133
Backtracking search

Backtracking search now takes on a very simple form: search depth-first, assign-
ing a single variable at a time, and backtrack if no valid assignment is available.
Using the graph colouring example, the search now looks something like this…

1=B 1= R 1=C

1=B 1=B 1=B


2=B 2=R 2=C

1=B 1=B 1=B


2=R 2=R 2=R
3=B 3=R 3=C

…and new possibilities appear.

134
Backtracking search

1=B
2=R 7
3=C
4=B
5=R
6
6=B 5

4
3
8
Nothing is available for 7, so
either assign 8 or backtrack 2
1

Rather than using problem-specific heuristics to try to improve searching, we


can now explore heuristics applicable to general CSPs.

135
Backtracking search

Starting with:
backtrack([], problemDescription)

1 function backTrack(assignmentList, problemDescription)


2 if assignmentList is complete then
3 return SOME assignmentList;
4 nextVar = getNextVariable(assignmentList, problemDescription);
5 for each v in orderValues(nextVar, assignmentList, problemDescription) do
6 if v is consistent with assignmentList then
7 add “nextVar = v” to assignmentList;
8 solution = backTrack(assignmentList, problemDescription);
9 if solution is not FAIL then
10 return solution;
11 remove “nextVar = v” from assignmentList;
12 return FAIL;

136
Backtracking search: possible heuristics

There are several points we can examine in an attempt to obtain general CSP-
based heuristics:

• In what order should we try to assign variables?


• In what order should we try to assign possible values to a variable?

Or being a little more subtle:

• What effect might the values assigned so far have on later attempted assign-
ments?
• When forced to backtrack, is it possible to avoid the same failure later on?
• Can we try to force the search in a successful direction (remember the use of
heuristics)?
• Can we try to force failures/backtracks to occur quickly?

137
Heuristics I: Choosing the order of variable assignments and values

Say we have 1 = B and 2 = R

6
5
At this point there is only one possible assignment
? for 3, whereas the others have more flexibility.
4
3
8

2
1

Assigning such variables first is called the minimum remaining values (MRV)
heuristic.
(Alternatively, the most constrained variable or fail first heuristic.)

138
Heuristics I: Choosing the order of variable assignments and values

How do we choose a variable to begin with?


The degree heuristic chooses the variable involved in the most constraints on as
yet unassigned variables.

6
5 Start with 3, 5 or 7.

4
3
8

2
1

MRV is usually better but the degree heuristic is a good tie breaker.

139
Heuristics I: Choosing the order of variable assignments and values

Once a variable is chosen, in what order should values be assigned?

6
5

4
3 Choosing 1 = C is bad as it removes
the final possibility for 3.
8

? 2
The heuristic prefers 1=B
1

The least constraining value heuristic chooses first the value that leaves the max-
imum possible freedom in choosing assignments for the variable’s neighbours.

140
Heuristics II: forward checking and constraint propagation

Continuing the previous slide’s progress, now add 1 = C.

6
5

4
3
8
2
C is ruled out as an assignment to
1 2 and 3.

Each time we assign a value to a variable, it makes sense to delete that value from
the collection of possible assignments to its neighbours.
This is called forward checking. It works nicely in conjunction with MRV.

141
Heuristics II: forward checking and constraint propagation

We can visualise this process as follows:


1 2 3 4 5 6 7 8
Start BRC BRC BRC BRC BRC BRC BRC BRC
2=B RC =B RC RC BRC BRC BRC BRC
3=R C =B =R RC BC BRC BC BRC
6=B C =B =R RC C =B C BRC
5=C C =B =R R =C =B ! BRC

At the fourth step 7 has no possible assignments left.


However, we could have detected a problem a little earlier…

142
Heuristics II: forward checking and constraint propagation

…by looking at step three.


1 2 3 4 5 6 7 8
Start BRC BRC BRC BRC BRC BRC BRC BRC
2=B RC =B RC RC BRC BRC BRC BRC
3=R C =B =R RC BC BRC BC BRC
6=B C =B =R RC C =B C BRC
5=C C =B =R R =C =B ! BRC

• At step three, 5 can be C only and 7 can be C only.


• But 5 and 7 are connected.
• So we can’t progress, but this hasn’t been detected.
• Ideally we want to do constraint propagation.

Trade-off: time to do the search, against time to explore constraints.

143
Constraint propagation

Arc consistency:
Consider a constraint as being directed. For example 4 → 5.
In general, say we have a constraint i → j and currently the domain of i is Di
and the domain of j is Dj .
i → j is consistent if
∀d ∈ Di, ∃d0 ∈ Dj such that i → j is valid
Example:
In step three of the table, D4 = {R, C} and D5 = {C}.

• 5 → 4 in step three of the table is consistent.


• 4 → 5 in step three of the table is not consistent.

4 → 5 can be made consistent by deleting C from D4.


Or in other words, regardless of what you assign to i you’ll be able to find some-
thing valid to assign to j.
144
Enforcing arc consistency

We can enforce arc consistency each time a variable i is assigned.

• We need to maintain a collection of arcs to be checked.


• Each time we alter a domain, we may have to include further arcs in the
collection.

This is because if i → j is inconsistent resulting in a deletion from Di we may as


a consequence make some arc k → i inconsistent.
Why is this?

145
Enforcing arc consistency

k1 i → j is not consistent so k1
delete B from the domain
k2 of i. k2
i → j is now consistent.
.. ..
. i j . i j
{R, B} {B} {R} {B}
kK kK
{R} kK → i is consistent but {R} kK → i is no longer consistent
kK = R can only be paired because kK = R can not be paired
with i = B. with i = R.

• i → j inconsistent means removing a value from Di.


• ∃d ∈ Di such that there is no valid d0 ∈ Dj so delete d ∈ Di.

However some d00 ∈ Dk may only have been pairable with d.


We need to continue until all consequences are taken care of.

146
The AC-3 algorithm

1 function AC-3(problemDescription)
2 Queue toCheck = [ all arcs i → j ];
3 while toCheck is not empty do
4 i → j = next(toCheck);
5 if removeInconsistencies(Di , Dj ) then
6 for each k that is a neighbour of i, where k ∈
/ {i, j} do
7 add k → i to toCheck;

1 function removeInconsistencies(D1 , D2 )
2 Bool result = FALSE;
3 for each d ∈ D1 do
4 if no d0 ∈ D2 valid with d then
5 remove d from D1 ;
6 result = TRUE;
7 return result;

147
Enforcing arc consistency

Complexity:

• A binary CSP with n variables can have O(n2) directional constraints i → j.


• Any i → j can be considered at most d times where d = maxk |Dk | because
only d things can be removed from Di.
• Checking any single arc for consistency can be done in O(d2).

So the complexity is O(n2d3).


Note: this setup includes 3SAT.
Consequence: we can’t check for consistency in polynomial time, which suggests
this doesn’t guarantee to find all inconsistencies.

148
A more powerful form of consistency

We can define a stronger notion of consistency as follows:

• Given: any k − 1 variables and any consistent assignment to these.


• Then: We can find a consistent assignment to any kth variable.

This is known as k-consistency.


Strong k-consistency requires the we be k-consistent, k − 1-consistent etc as far
down as 1-consistent.
If we can demonstrate strong n-consistency (where as usual n is the number of
variables) then an assignment can be found in O(nd).
Unfortunately, demonstrating strong n-consistency will be worst-case exponen-
tial.

149
Backjumping

The basic backtracking algorithm backtracks to the most recent assignment. This
is known as chronological backtracking. It is not always the best policy:

7 ???
7
6
5 4

5
4
3
3
8

1 2 1

Say we’ve assigned 1 = B, 3 = R, 5 = C and 4 = B and now we want to as-


sign something to 7. This isn’t possible so we backtrack, however re-assigning 4
clearly doesn’t help.

150
Backjumping

With some careful bookkeeping it is often possible to jump back multiple levels
without sacrificing the ability to find a solution.
We need some definitions:

• When we set a variable Vi to some value d ∈ Di we refer to this as the assign-


ment Ai = (Vi ← d).
• A partial instantiation Ik = {A1, A2, . . . , Ak } is a consistent set of assignments
to the first k variables…
• … where consistent means that no constraints are violated.
• Conversely, Ik conflicts with some variable V if no value for V is consistent
with Ik .

Henceforth we shall assume that variables are assigned in the order V1, V2, . . . , Vn
when formally presenting algorithms.

151
Gaschnig’s algorithm

Gaschnig’s algorithm works as follows. Say we have a partial instantiation Ik :

• When choosing a value for Vk+1 we need to check that any candidate value
d ∈ Dk+1, is consistent with Ik .
• When testing potential values for d, we will generally discard one or more
possibilities, because they conflict with some member of Ik
• We keep track of the most recent assignment Aj for which this has happened.

Finally, if no value for Vk+1 is consistent with Ik then we backtrack to Vj .


More formally: if Ik conflicts with Vk+1 we backtrack to Vj where
j = min{j ≤ k|Ij conflicts with Vk+1}.
If there are no possible values left to try for Vj then we backtrack chronologically.

152
Gaschnig’s algorithm

Example:

7 ??? 7= 7= 7=
7
2
6
5
8

4 4
3 Backtrack to 5
8 5

1 2
3

If there’s no value left to try for 5 then backtrack to 3 and so on.

153
Graph-based backjumping

This allows us to jump back multiple levels when we initially detect a conflict.
Can we do better than chronological backtracking thereafter?
Some more definitions:

• We assume an ordering V1, V2, . . . , Vn for the variables.


• Given V 0 = {V1, V2, . . . , Vk } where k < n the ancestors of Vk+1 are the mem-
bers of V 0 connected to Vk+1 by a constraint.
• The parent P (Vk+1) of Vk+1 is its most recent ancestor.

The ancestors for each variable can be accumulated as assignments are made.
Graph-based backjumping backtracks to the parent of Vk+1.
Note: Gaschnig’s algorithm uses assignments whereas graph-based backjumping
uses constraints.

154
Graph-based backjumping

7 ??? {1, 3, 5}

7
2 {1, 3, 4, 8}
6
5 8 {4}

4 {5} 4 {5}
4
3
5 {3} 5 {3} 5 {3}
8

1 2 3 {1} 3 {1} 3 {1} 3 {1}

1 1 1 1

At this point, backjump to the parent for 7, which is 5.

155
Backjumping and forward checking

If we use forward checking: say we’re assigning to Vk+1 by making Vk+1 = d:

• Forward checking removes d from the Di of all Vi connected to Vk+1 by a


constraint.
• When doing graph-based backjumping, we’d also add Vk+1 to the ancestors
of Vi.

In fact, use of forward checking can make some forms of backjumping redundant.
Note: there are in fact many ways of combining constraint propagation with back-
jumping, and we will not explore them in further detail here.

156
Backjumping and forward checking

7 ??? Ancestors
7
1 − {}
6 2 − {1, 3, 4}
5 4 3 − {1}
4 − {5 }
5 5 − {3 }
4 6 − {5 }
3 7 − {1, 3 , 5}
3 8 − {}
8

1 2 1

1 2 3 4 5 6 7 8
Start BRC BRC BRC BRC BRC BRC BRC BRC
1=B =B RC RC BRC BRC BRC RC BRC
3=R =B C = R BRC BC BRC C BRC
5=C =B C = R BR = C BR ! BRC
4=B =B C = R BR = C BR ! BRC

Forward checking finds the problem before backtracking does.

157
Graph-based backjumping

We’re not quite done yet though. What happens when there are no assignments
left for the parent we just backjumped to?

V7 ???

V6

V5

V4 V4 ???

V3 V3

V2 V2

V1 V1

Backjumping from V7 to V4 is fine. However we shouldn’t then just backjump to


V2, because changing V3 could fix the problem at V7.

158
Graph-based backjumping

To describe an algorithm in this case is a little involved.

Leaf dead-end variable V7 ???

V6

V5

V4 V4 ???
Leaf dead-end
I6 .
V3 V3

V2 V2

V1 V1

Given an instantiation Ik and Vk+1, if there is no consistent d ∈ Dk+1 we call Ik


a leaf dead-end and Vk+1 a leaf dead-end variable.

159
Graph-based backjumping

Also

Leaf dead-end variable V7 ???

V6

V5

Internal dead-end
V4 I3 . ??? Internal dead-end variable V4
Leaf dead-end
I6 .
V3 V3

V2 V2

V1 V1

If Vi was backtracked to from a later leaf dead-end and there are no more values
to try for Vi then we refer to it as an internal dead-end variable and call Ii−1 an
internal dead-end.

160
Graph-based backjumping

To keep track of exactly where to jump to we also need the definitions:

• The session of a variable V begins when the search algorithm visits it and
ends when it backtracks through it to an earlier variable.
• The current session of a variable V is the set of all variables visiting during its
session.
• In particular, the current session for any V contains V .
• The relevant dead-ends for the current session R(V ) for a variable V are:
1. R(V ) is initialized to {V } when V is first visited.
2. If V is a leaf dead-end variable then R(V ) = {V }.
3. If V was backtracked to from a dead-end V 0 then R(V ) = R(V ) ∪ R(V 0).

And we’re not done yet…

161
Graph-based backjumping

Example:

Session of V7 = {V7 }.
R(V7 ) = {V7 }
Session starts

Session of V4 = {V4 , V5 , V6 , V7 }.
Session starts R(V4 ) = {V4 , V7 }

As expected, the relevant dead-ends for V4 are {V4} and {V7}.

162
Graph-based backjumping

One more bunch of definitions before the pain stops. Say Vk is a dead-end:

• The induced ancestors ind(Vk ) of Vk are defined as


 
[
ind(Vk ) = {V1, V2, . . . , Vk−1} ∩  ancestors(V )
V ∈R(Vk )

• The culprit for Vk is the most recent V 0 ∈ ind(Vk ).

Note that these definitions depend on R(Vk ).


FINALLY: graph-based backjumping backjumps to the culprit.

163
Graph-based backjumping

Example:

Backjump from V7
to V4 .

Session of V4 = {V4 , V5 , V6 , V7 }.
Nothing left to try! R(V4 ) = {V4 , V7 }
ind(V4 ) = {V2 , V3 }

As expected, we back jump to V3 instead of V2. Hooray!


Gaschnig’s algorithm and graph-based backjumping can be combined to produce
conflict-directed backjumping.
We will not explore conflict-directed backjumping in this course.

164
Varieties of CSP

We have only looked at discrete CSPs with finite domains. These are the simplest.
We could also consider:

1. Discrete CSPs with infinite domains:


• We need a constraint language. For example
V3 ≤ V10 + 5
• Algorithms are available for integer variables and linear constraints.
• There is no algorithm for integer variables and nonlinear constraints.
2. Continuous domains—using linear constraints defining convex regions we
have linear programming. This is solvable in polynomial time in n.
3. We can introduce preference constraints in addition to absolute constraints, and
in some cases an objective function.

165
Artificial Intelligence

Knowledge representation and reasoning

Reading: AIMA, chapters 7 to 10.


166
Knowledge representation and reasoning

We now look at how an agent might represent knowledge about its environment,
and reason with this knowledge to achieve its goals.
Initially we’ll represent and reason using first order logic (FOL). Aims:

• To show how FOL can be used to represent knowledge about an environment


in the form of both background knowledge and knowledge derived from per-
cepts.
• To show how this knowledge can be used to derive non-perceived knowledge
about the environment using a theorem prover.
• To introduce the situation calculus and demonstrate its application in a simple
environment as a means by which an agent can work out what to do next.

Using FOL in all its glory can be problematic.


Later we’ll look at how some of the problems can be addressed using semantic
networks, frames, inheritance and rules.

167
Knowledge representation and reasoning

Earlier in the course we looked at what an agent should be able to do.


It seems that all of us—and all intelligent agents—should use logical reasoning to
help us interact successfully with the world.
Any intelligent agent should:

• Possess knowledge about the environment and about how its actions affect the
environment.
• Use some form of logical reasoning to maintain its knowledge as percepts ar-
rive.
• Use some form of logical reasoning to deduce actions to perform in order to
achieve goals.

168
Knowledge representation and reasoning

This raises some important questions:

• How do we describe the current state of the world?


• How do we infer from our percepts, knowledge of unseen parts of the world?
• How does the world change as time passes?
• How does the world stay the same as time passes? (The frame problem.)
• How do we know the effects of our actions? (The qualification and ramifica-
tion problems.)

We’ll now look at one way of answering some of these questions.


FOL (arguably?) seems to provide a good way in which to represent the required
kinds of knowledge: it is expressive, concise, unambiguous, it can be adapted to
different contexts, and it has an inference procedure, although a semidecidable one.
In addition is has a well-defined syntax and semantics.

169
Logic for knowledge representation

Problem: it’s quite easy to talk about things like set theory using FOL. For exam-
ple, we can easily write axioms like
∀S . ∀S 0 . ((∀x . (x ∈ S ⇔ x ∈ S 0)) ⇒ S = S 0)
But how would we go about representing the proposition that if you have a bucket
of water and throw it at your friend they will get wet, have a bump on their head
from being hit by a bucket, and the bucket will now be empty and dented?
More importantly, how could this be represented within a wider framework for
reasoning about the world?
It’s time to introduce The Wumpus…

170
Wumpus world

As a simple test scenario for a knowledge-based agent we will make use of the
Wumpus World.

Wumpus

Evil Robot

The Wumpus World is a 4 by 4 grid-based cave.


EVIL ROBOT wants to enter the cave, find some gold, and get out again un-
scathed.

171
Wumpus world

The rules of Wumpus World:

• Unfortunately the cave contains a number of pits, which EVIL ROBOT can
fall into. Eventually his batteries will fail, and that’s the end of him.
• The cave also contains the Wumpus, who is armed with state-of-the-art Evil
Robot Obliteration Technology.
• The Wumpus itself knows where the pits are and never falls into one.

172
Wumpus world

EVIL ROBOT can move around the cave at will and can perceive the following:

• In a position adjacent to the Wumpus, a stench is perceived. (Wumpuses are


famed for their lack of personal hygiene.)
• In a position adjacent to a pit, a breeze is perceived.
• In the position where the gold is, a glitter is perceived.
• On trying to move into a wall, a bump is perceived.
• On killing the Wumpus a scream is perceived.

In addition, EVIL ROBOT has a single arrow, with which to try to kill the Wum-
pus.
“Adjacent” in the following does not include diagonals.

173
Wumpus world

So we have:
Percepts: stench, breeze, glitter, bump, scream.
Actions: forward, turnLeft, turnRight, grab, release, shoot, climb.
Of course, our aim now is not just to design an agent that can perform well in a
single cave layout.
We want to design an agent that can usually perform well regardless of the layout
of the cave.

174
Logic for knowledge representation

The fundamental aim is to construct a knowledge base KB containing a collection


of statements about the world—expressed in FOL—such that useful things can be
derived from it.
Our central aim is to generate sentences that are true, if the sentences in the KB
are true.
This process is based on concepts familiar from your introductory logic courses:

• Entailment: KB |= α means that the KB entails α.


• Proof: KB `i α means that α is derived from the KB using inference procedure
i. If i is sound then we have a proof .
• i is sound if it can generate only entailed α.
• i is complete if it can find a proof for any entailed α.

175
Example: Prolog

You have by now learned a little about programming in Prolog. For example:
concat([], L, L).
concat([H|T ], L, [H|L2]) :- concat(T, L, L2).
is a program to concatenate two lists. The query
concat([1, 2, 3], [4, 5], X).
results in
X = [1, 2, 3, 4, 5].
What’s happening here? Well, Prolog is just a more limited form of FOL so…

176
Example: Prolog

… we are in fact doing inference from a KB:

• The Prolog programme itself is the KB. It expresses some knowledge about
lists.
• The query is expressed in such a way as to derive some new knowledge.

How does this relate to full FOL? First of all the list notation is nothing but syn-
tactic sugar. It can be removed: we define a constant called empty and a function
called cons.
Now [1, 2, 3] just means
cons(1, cons(2, cons(3, empty))))
which is a term in FOL.
I will assume the use of the syntactic sugar for lists from now on.

177
Prolog and FOL

The program when expressed in FOL, says


∀x . concat(empty, x, x) ∧
∀h, t, l1, l2 . concat(t, l1, l2) → concat(cons(h, t), l1, cons(h, l2))
The rule is simple—given a Prolog program:

• Universally quantify all the unbound variables in each line of the program and

• … form the conjunction of the results.

If the universally quantified lines are L1, L2, . . . , Ln then the Prolog programme
corresponds to the KB
KB = L1 ∧ L2 ∧ · · · ∧ Ln
Now, what does the query mean?

178
Prolog and FOL

When you give the query


concat([1, 2, 3], [4, 5], X).
to Prolog it responds by trying to prove the following statement
KB → ∃X . concat([1, 2, 3], [4, 5], X)
So: it tries to prove that the KB implies the query, and variables in the query are
existentially quantified.
When a proof is found, it supplies a value for X that makes the inference true.

179
Prolog and FOL

Prolog differs from FOL in that, amongst other things:

• It restricts you to using Horn clauses.


• Its inference procedure is not a full-blown proof procedure.
• It does not deal with negation correctly.

However the central idea also works for full-blown theorem provers.
If you want to experiment, you can obtain Prover9 from
https : //www.cs.unm.edu/ ∼ mccune/mace4/
We’ll see a brief example now, and a more extensive example of its use later, time
permitting…

180
Prolog and FOL

Expressed in Prover9, the above Prolog program and query look like this:
set(prolog style variables).

% This is the translated Prolog program for list concatenation.


% Prover9 has its own syntactic sugar for lists.

formulas(assumptions).
concat([], L, L).
concat(T, L, L2) -> concat([H:T], L, [H:L2]).
end of list.

% This is the query.

formulas(goals).
exists X concat([1, 2, 3], [4, 5], X).
end of list.

Note: it is assumed that unbound variables are universally quantified.

181
Prolog and FOL

You can try to infer a proof using

prover9 -f file.in

and the result is (in addition to a lot of other information):


1 concat(T,L,L2) -> concat([H:T],L,[H:L2]) # label(non clause). [assumption].
2 (exists X concat([1,2,3],[4,5],X)) # label(non clause) # label(goal). [goal].
3 concat([],A,A). [assumption].
4 -concat(A,B,C) | concat([D:A],B,[D:C]). [clausify(1)].
5 -concat([1,2,3],[4,5],A). [deny(2)].
6 concat([A],B,[A:B]). [ur(4,a,3,a)].
7 -concat([2,3],[4,5],A). [resolve(5,a,4,b)].
8 concat([A,B],C,[A,B:C]). [ur(4,a,6,a)].
9 $F. [resolve(8,a,7,a)].

This shows that a proof is found but doesn’t explicitly give a value for X—we’ll
see how to extract that later…

182
The fundamental idea

So the basic idea is: build a KB that encodes knowledge about the world, the effects
of actions and so on.
The KB is a conjunction of pieces of knowledge, such that:

• A query regarding what our agent should do can be posed in the form
∃actionList . Goal(...actionList...)

• Proving that
KB → ∃actionList . Goal(...actionList...)
instantiates actionList to an actual list of actions that will achieve a goal
represented by the Goal predicate.

We sometimes use the notation ask and tell to refer to querying and adding to
the KB.

183
Using FOL in AI: the triumphant return of the Wumpus

We want to be able to speculate about the past and about possible futures. So:

Wumpus

Evil Robot

• We include situations in the logical language used by our KB.


• We include axioms in our KB that relate to situations.

This gives rise to situation calculus.

184
Situation calculus

In situation calculus:

• The world consists of sequences of situations.


• Over time, an agent moves from one situation to another.
• Situations are changed as a result of actions.

In Wumpus World the actions are: forward, shoot, grab, climb, release,
turnRight, turnLeft.

• A situation argument is added to items that can change over time. For example
At(location, s)
Items that can change over time are called fluents.
• A situation argument is not needed for things that don’t change. These are
sometimes referred to as eternal or atemporal.

185
Representing change as a result of actions

Situation calculus uses a function


result(action, s)
to denote the new situation arising as a result of performing the specified action
in the specified situation.
result(grab, s0) = s1
result(turnLeft, s1) = s2
result(shoot, s2) = s3
result(forward, s3) = s4
..

186
Axioms I: possibility axioms

The first kind of axiom we need in a KB specifies when particular actions are
possible.
We introduce a predicate
Poss(action, s)
denoting that an action can be performed in situation s.
We then need a possibility axiom for each action. For example:
At(l, s) ∧ Available(gold, l, s) → Poss(grab, s)
Remember that unbound variables are universally quantified.

187
Axioms II: effect axioms

Given that an action results in a new situation, we can introduce effect axioms to
specify the properties of the new situation.
For example, to keep track of whether EVIL ROBOT has the gold we need effect
axioms to describe the effect of picking it up:
Poss(grab, s) → Have(gold, result(grab, s))
Effect axioms describe the way in which the world changes.
We would probably also include
¬Have(gold, s0)
in the KB, where s0 is the starting situation.
Important: we are describing what is true in the situation that results from per-
forming an action in a given situation.

188
Axioms III: frame axioms

We need frame axioms to describe the way in which the world stays the same.
Example:
Have(o, s) ∧
¬(a = release ∧ o = gold) ∧ ¬(a = shoot ∧ o = arrow)
→ Have(o, result(a, s))
describes the effect of having something and not discarding it.
In a more general setting such an axiom might well look different. For example
¬Have(o, s) ∧
(a 6= grab(o) ∨ ¬(Available(o, s) ∧ Portable(o)))
→ ¬Have(o, result(a, s))
describes the effect of not having something and not picking it up.

189
The frame problem

The frame problem has historically been a major issue.


Representational frame problem: a large number of frame axioms are required to
represent the many things in the world which will not change as the result of an
action.
We will see how to solve this in a moment.
Inferential frame problem: when reasoning about a sequence of situations, all the
unchanged properties still need to be carried through all the steps.
This can be alleviated using planning systems that allow us to reason efficiently
when actions change only a small part of the world. There are also other reme-
dies, which we will not cover.

190
Successor-state axioms

Effect axioms and frame axioms can be combined into successor-state axioms.
One is needed for each predicate that can change over time.
Action a is possible →
(true in new situation ⇐⇒
(you did something to make it true ∨
it was already true and you didn’t make it false))
For example
Poss(a, s) →
(Have(o, result(a, s)) ⇐⇒ ((a = grab ∧ Available(o, s)) ∨
(Have(o, s) ∧ ¬(a = release ∧ o = gold) ∧
¬(a = shoot ∧ o = arrow))))

191
Knowing where you are, and so on…

We now have considerable flexibility in adding further rules:

• If s0 is the initial situation we know that At((1, 1), s0).


• We need to keep track of what way we’re facing. Say north is 0, south is 2,
east is 1 and west is 3. We might assume facing(s0) = 0.
• We need to know how motion affects location
forwardResult((x, y), north) = (x, y + 1)
forwardResult((x, y), east) = (x + 1, y)
and so on.
• The concept of adjacency is very important in the Wumpus world
Adjacent(l1, l2) ⇐⇒ ∃d forwardResult(l1, d) = l2

• We also know that the cave is 4 by 4 and surrounded by walls


WallHere((x, y)) ⇐⇒ (x = 0 ∨ y = 0 ∨ x = 5 ∨ y = 5)

192
The qualification and ramification problems

Qualification problem: we are in general never completely certain what condi-


tions are required for an action to be effective.
Consider for example turning the key to start your car.
This will lead to problems if important conditions are omitted from axioms.
Ramification problem: actions tend to have implicit consequences that are large
in number.
For example, if I pick up a sandwich in a dodgy sandwich shop, I will also be
picking up all the bugs that live in it. I don’t want to model this explicitly.

193
Solving the ramification problem

The ramification problem can be solved by modifying successor-state axioms.


For example:
Poss(a, s) →
(At(o, l, result(a, s)) ⇐⇒
(∃l0 . a = go(l0, l) ∧
[o = robot ∨ Has(robot, o, s)]) ∨
(At(o, l, s) ∧
[¬∃l00 . a = go(l, l00) ∧ l 6= l00 ∧
{o = robot ∨ Has(robot, o, s)}]))
describes the fact that anything EVIL ROBOT is carrying moves around with
him.

194
Deducing properties of the world: causal and diagnostic rules

If you know where you are, then you can think about places rather than just
situations. Synchronic rules relate properties shared by a single state of the world.
There are two kinds: causal and diagnostic.
Causal rules: some properties of the world will produce percepts.
WumpusAt(l1) ∧ Adjacent(l1, l2) → StenchAt(l2)
PitAt(l1) ∧ Adjacent(l1, l2) → BreezeAt(l2)
Systems reasoning with such rules are known as model-based reasoning systems.
Diagnostic rules: infer properties of the world from percepts. For example:
At(l, s) ∧ Breeze(s) → BreezeAt(l)
At(l, s) ∧ Stench(s) → StenchAt(l)
These may not be very strong.
The difference between model-based and diagnostic reasoning can be important.
For example, medical diagnosis can be done based on symptoms or based on a
model of disease.
195
General axioms for situations and objects

Note: in FOL, if we have two constants robot and gold then an interpretation is
free to assign them to be the same thing. This is not something we want to allow.
Unique names axioms state that each pair of distinct items in our model of the
world must be different
robot 6= gold
robot 6= arrow
robot 6= wumpus
..
Unique actions axioms state that actions must share this property, so for each pair
of actions
go(l, l0) 6= grab
go(l, l0) 6= drop(o)
..
and in addition we need to define equality for actions, so for each action
go(l, l0) = go(l00, l000) ⇐⇒ l = l00 ∧ l0 = l000
drop(o) = drop(o0) ⇐⇒ o = o0
..

196
General axioms for situations and objects

The situations are ordered so


s0 6= result(a, s)
and situations are distinct so
result(a, s) = result(a0, s0) ⇐⇒ a = a0 ∧ s = s0
Strictly speaking we should be using a many-sorted version of FOL.
In such a system variables can be divided into sorts which are implicitly separate
from one another.
Finally, we’re going to need to specify what’s true in the start state.
For example
At(robot, [1, 1], s0)
At(wumpus, [3, 4], s0)
Has(robot, arrow, s0)
..
and so on.

197
Sequences of situations

We know that the function result tells us about the situation resulting from
performing an action in an earlier situation.
How can this help us find sequences of actions to get things done?
Define
Sequence([], s, s0) = s0 = s
Sequence([a], s, s0) = Poss(a, s) ∧ s0 = result(a, s)
Sequence(a :: as, s, s0) = ∃t . Sequence([a], s, t) ∧ Sequence(as, t, s0)
To obtain a sequence of actions that achieves Goal(s) we can use the query
∃a ∃s . Sequence(a, s0, s) ∧ Goal(s)

198
Interesting reading

Knowledge representation based on logic is a vast subject and can’t be covered


in full in the lectures.
In particular:

• Techniques for representing further kinds of knowledge.


• Techniques for moving beyond the idea of a situation.
• Reasoning systems based on categories.
• Reasoning systems using default information.
• Truth maintenance systems.

Happy reading…

199
Knowledge representation and reasoning

It should be clear that generating sequences of actions by inference in FOL is


highly non-trivial.
Ideally we’d like to maintain an expressive language while restricting it enough
to be able to do inference efficiently.
Further aims:

• To give a brief introduction to semantic networks and frames for knowledge


representation.
• To see how inheritance can be applied as a reasoning method.
• To look at the use of rules for knowledge representation, along with forward
chaining and backward chaining for reasoning.

Further reading: The Essence of Artificial Intelligence, Alison Cawsey. Prentice


Hall, 1998.

200
Frames and semantic networks

Frames and semantic networks represent knowledge in the form of classes of


objects and relationships between them:

• The subclass and instance relationships are emphasised.


• We form class hierarchies in which inheritance is supported and provides the
main inference mechanism.

As a result inference is quite limited.


We also need to be extremely careful about semantics.
The only major difference between the two ideas is notational.

201
Example of a semantic network

has Head
Person has
has Left arm

subclass Right arm


Instrument has

Musician

subclass subclass volume Quiet


Ear problems has
volume Rock musician has
Classical musician Sheet music
Loud

hair_length hair_length Any


Long instance
instance
Jake Mayhem
Axe has
has Violet Scroot
Oboe

202
Frames

Frames once again support inheritance through the subclass relationship.

Rock musician
Musician

subclass: Musician subclass: Person


has: ear problems has: instrument
hairlength: long
volume: loud

has, hairlength, volume etc are slots.


long, loud, instrument etc are slot values.
These are a direct predecessor of object-oriented programming languages.

203
Defaults

Both approaches to knowledge representation are able to incorporate defaults:

Rock musician
Dementia Evilperson

subclass: Musician subclass: Rock musician


has: ear problems hairlength: short
* hairlength: long image: gothic
* volume: loud

Starred slots are typical values associated with subclasses and instances, but can
be overridden.

204
Multiple inheritance

Both approaches can incorporate multiple inheritance, at a cost:

Rock musician Classical musician

instance instance

Cornelius Cleverchap

• What is hairlength for Cornelius if we’re trying to use inheritance to es-


tablish it?
• This can be overcome initially by specifying which class is inherited from in
preference when there’s a conflict.
• But the problem is still not entirely solved—what if we want to prefer inheri-
tance of some things from one class, but inheritance of others from a different
one?

205
Other issues

• Slots and slot values can themselves be frames. For example Dementia may
have an instrument slot with the value Electricharp, which itself may have
properties described in a frame.
• Slots can have specified attributes. For example, we might specify that:
– instrument can have multiple values
– Each value can only be an instance of Instrument
– Each value has a slot called owned by
and so on.
• Slots may contain arbitrary pieces of program. This is known as procedural
attachment. The fragment might be executed to return the slot’s value, or
update the values in other slots etc.

206
Rule-based systems

A rule-based system requires three things:

1. A set of if − then rules. These denote specific pieces of knowledge about


the world.
They should be interpreted similarly to logical implication.
Such rules denote what to do or what can be inferred under given circum-
stances.
2. A collection of facts denoting what the system regards as currently true about
the world.
3. An interpreter able to apply the current rules in the light of the current facts.

207
Forward chaining

The first of two basic kinds of interpreter begins with established facts and then
applies rules to them.
This is a data-driven process. It is appropriate if we know the initial facts but not
the required conclusion.
Example: XCON—used for configuring VAX computers.
In addition:

• We maintain a working memory, typically of what has been inferred so far.


• Rules are often condition-action rules, where the right-hand side specifies an
action such as adding or removing something from working memory, print-
ing a message etc.
• In some cases actions might be entire program fragments.

208
Forward chaining

The basic algorithm is:

1. Find all the rules that can fire, based on the current working memory.
2. Select a rule to fire. This requires a conflict resolution strategy.
3. Carry out the action specified, possibly updating the working memory.

Repeat this process until either no rules can be used or a halt appears in the
working memory.

209
Condition−action rules

dry_mouth −> ADD thirsty


thirsty −> ADD get_drink
get_drink AND no_work −> ADD go_bar
working −> ADD no_work
no_work −> DELETE working

Working memory Interpreter

dry_mouth
working

210
Example

Progress is as follows:

1. The rule
dry mouth → ADD thirsty
fires adding thirsty to working memory.
2. The rule
thirsty → ADD get drink
fires adding get drink to working memory.
3. The rule
working → ADD no work
fires adding no work to working memory.
4. The rule
get drink AND no work → ADD go bar
fires, and we establish that it’s time to go to the bar.

211
Conflict resolution

Clearly in any more realistic system we expect to have to deal with a scenario
where two or more rules can be fired at any one time:
• Which rule we choose can clearly affect the outcome.
• We might also want to attempt to avoid inferring an abundance of useless
information.
We therefore need a means of resolving such conflicts. Common conflict resolution
strategies are:
• Prefer rules involving more recently added facts.
• Prefer rules that are more specific. For example
patient coughing → ADD lung problem
is more general than
patient coughing AND patient smoker → ADD lung cancer.
• Allow the designer of the rules to specify priorities.
• Fire all rules simultaneously—this essentially involves following all chains of
inference at once.
212
Reason maintenance

Some systems will allow information to be removed from the working memory
if it is no longer justified.
For example, we might find that
patient coughing
and
patient smoker
are in working memory, and hence fire
patient coughing AND patient smoker → ADD lung cancer
but later infer something that causes patient coughing to be withdrawn from
working memory.
The justification for lung cancer has been removed, and so it should perhaps
be removed also.

213
Pattern matching

In general rules may be expressed in a slightly more flexible form involving vari-
ables which can work in conjunction with pattern matching.
For example the rule
coughs(X) AND smoker(X) → ADD lung cancer(X)
contains the variable X.
If the working memory contains coughs(neddy) and smoker(neddy) then
X = neddy
provides a match and
lung cancer(neddy)
is added to the working memory.

214
Backward chaining

The second basic kind of interpreter begins with a goal and finds a rule that would
achieve it.
It then works backwards, trying to achieve the resulting earlier goals in the suc-
cession of inferences.
Example: MYCIN—medical diagnosis with a small number of conditions.
This is a goal-driven process. If you want to test a hypothesis or you have some
idea of a likely conclusion it can be more efficient than forward chaining.

215
Example

Working memory
Goal
dry mouth
working go bar

To establish go bar we have to


get drink establish get drink and no work.
no work
These are the new goals.

Try first to establish get drink. This


thirsty
no work can be done by establishing thirsty.

thirsty can be established by establishing


dry mouth dry mouth. This is in the working memory
no work
so we’re done.

Finally, we can establish no work by


establishing working. This is in the working
working
memory so the process has finished.

216
Example with backtracking

If at some point more than one rule has the required conclusion then we can
backtrack.
Example: Prolog backtracks, and incorporates pattern matching. It orders at-
tempts according to the order in which rules appear in the program.
Example: having added
up early → ADD tired
and
tired AND lazy → ADD go bar
to the rules, and up early to the working memory:

217
Example with backtracking

Working memory Goal


dry mouth
working go bar
up early

Attempt to establish go bar


tired by establishing tired and get drink
lazy no work
lazy.

This can be done by establishing


up early up early and lazy. thirsty
lazy up early is in the working memory no work
Process proceeds as before
so we’re done.

We can not establish lazy


and so we backtrack and try a dry mouth
lazy no work
different approach.

working

218
Artificial Intelligence I

Planning algorithms

Reading: AIMA, chapter 11.


219
Problem solving is different to planning

In search problems we:

• Represent states: and a state representation contains everything that’s relevant


about the environment.
• Represent actions: by describing a new state obtained from a current state.
• Represent goals: all we know is how to test a state either to see if it’s a goal,
or using a heuristic.
• A sequence of actions is a ‘plan’: but we only consider sequences of consecutive
actions.

Search algorithms are good for solving problems that fit this framework. How-
ever for more complex problems they may fail completely…

220
Problem solving is different to planning

Representing a problem such as: ‘go out and buy some pies’ is hopeless:

• There are too many possible actions at each step.


• A heuristic can only help you rank states. In particular it does not help you
ignore useless actions.
• We are forced to start at the initial state, but you have to work out how to get
the pies—that is, go to town and buy them, get online and find a web site that
sells pies etc—before you can start to do it.

Knowledge representation and reasoning might not help either: although we end
up with a sequence of actions—a plan—there is so much flexibility that complex-
ity might well become an issue.
Our aim now is to look at how an agent might construct a plan enabling it to
achieve a goal.

• We look at how we might update our concept of knowledge representation and


reasoning to apply more specifically to planning tasks.
• We look in detail at the partial-order planning algorithm.
221
Planning algorithms work differently

Difference 1:

• Planning algorithms use a special purpose language—often based on FOL or a


subset— to represent states, goals, and actions.
• States and goals are described by sentences, as might be expected, but…
• …actions are described by stating their preconditions and their effects.

So if you know the goal includes (maybe among other things)


Have(pie)
and action Buy(x) has an effect Have(x) then you know that a plan including
Buy(pie)
might be reasonable.

222
Planning algorithms work differently

Difference 2:

• Planners can add actions at any relevant point at all between the start and the
goal, not just at the end of a sequence starting at the start state.
• This makes sense: I may determine that Have(carKeys) is a good state to be
in without worrying about what happens before or after finding them.
• By making an important decision like requiring Have(carKeys) early on we
may reduce branching and backtracking.
• State descriptions are not complete—Have(carKeys) describes a class of states—
and this adds flexibility.

So: you have the potential to search both forwards and backwards within the
same problem.

223
Planning algorithms work differently

Difference 3:
It is assumed that most elements of the environment are independent of most other
elements.

• A goal including several requirements can be attacked with a divide-and-


conquer approach.
• Each individual requirement can be fulfilled using a subplan…
• …and the subplans then combined.

This works provided there is not significant interaction between the subplans.
Remember: the frame problem.

224
Running example: gorilla-based mischief

We will use a simple example, based on one from Russell and Norvig.

The intrepid little scamps in the Cambridge University Roof-Climbing Society wish
to attach an inflatable gorilla to the spire of a Famous College. To do this they need
to leave home and obtain:

• An inflatable gorilla: these can be purchased from all good joke shops.
• Some rope: available from a hardware store.
• A first-aid kit: also available from a hardware store.

They need to return home after they’ve finished their shopping. How do they go
about planning their jolly escapade?
225
The STRIPS language

STRIPS: “Stanford Research Institute Problem Solver” (1970).


States: are conjunctions of ground literals. They must not include function symbols.
At(home) ∧ ¬Have(gorilla)
∧ ¬Have(rope)
∧ ¬Have(kit)
Goals: are conjunctions of literals where variables are assumed existentially quan-
tified.
At(x) ∧ Sells(x, gorilla)
A planner finds a sequence of actions that when performed makes the goal true.
We are no longer employing a full theorem-prover.

226
The STRIPS language

STRIPS represents actions using operators. For example

At(x), Path(x, y)

Go(y)

At(y), ¬At(x)

Op(Action: Go(y), Pre: At(x) ∧ Path(x, y), Effect: At(y) ∧ ¬At(x))


All variables are implicitly universally quantified. An operator has:

• An action description: what the action does.


• A precondition: what must be true before the operator can be used. A con-
junction of positive literals.
• An effect: what is true after the operator has been used. A conjunction of
literals.

227
The space of plans

We now make a change in perspective—we search in plan space:

• Start with an empty plan.


• Operate on it to obtain new plans. Incomplete plans are called partial plans.
Refinement operators add constraints to a partial plan. All other operators are
called modification operators.
• Continue until we obtain a plan that solves the problem.

Operations on plans can be:

• Adding a step.
• Instantiating a variable.
• Imposing an ordering that places a step in front of another.
• and so on…

228
Representing a plan: partial order planners

When putting on your shoes and socks:

• It does not matter whether you deal with your left or right foot first.
• It does matter that you place a sock on before a shoe, for any given foot.

It makes sense in constructing a plan not to make any commitment to which side
is done first if you don’t have to.
Principle of least commitment: do not commit to any specific choices until you
have to. This can be applied both to ordering and to instantiation of variables.
A partial order planner allows plans to specify that some steps must come before
others but others have no ordering.
A linearisation of such a plan imposes a specific sequence on the actions therein.

229
Representing a plan: partial order planners

A plan consists of:

1. A set {S1, S2, . . . , Sn} of steps. Each of these is one of the available operators.
2. A set of ordering constraints. An ordering constraint Si < Sj denotes the fact
that step Si must happen before step Sj . Si < Sj < Sk and so on has the
obvious meaning. Si < Sj does not mean that Si must immediately precede
Sj .
3. A set of variable bindings v = x where v is a variable and x is either a variable
or a constant.
c
4. A set of causal links or protection intervals Si → Sj . This denotes the fact that
the purpose of Si is to achieve the precondition c for Sj .

A causal link is always paired with an equivalent ordering constraint.

230
Representing a plan: partial order planners

The initial plan has:

• Two steps, called Start and Finish.


• A single ordering constraint Start < Finish.
• No variable bindings.
• No causal links.

In addition to this:

• The step Start has no preconditions, and its effect is the start state for the
problem.
• The step Finish has no effect, and its precondition is the goal.
• Neither Start or Finish has an associated action.

We now need to consider what constitutes a solution…

231
Solutions to planning problems

A solution to a planning problem is any complete and consistent partially ordered


plan.
Complete: each precondition of each step is achieved by another step in the so-
lution.
A precondition c for S is achieved by a step S 0 if:

1. The precondition is an effect of the step


S 0 < S and c ∈ Effects(S 0)
and…
2. … there is no other step that could cancel the precondition. That is, no S 00
exists where:
• The existing ordering constraints allow S 00 to occur after S 0 but before S.
• ¬c ∈ Effects(S 00) .

232
Solutions to planning problems

Consistent: no contradictions exist in the binding constraints or in the proposed


ordering. That is:

1. For binding constraints, we never have v = X and v = Y for distinct con-


stants X and Y .
2. For the ordering, we never have S < S 0 and S 0 < S.

Returning to the roof-climbers’ shopping expedition, here is the basic approach:

• Begin with only the Start and Finish steps in the plan.
• At each stage add a new step.
• Always add a new step such that a currently non-achieved precondition is
achieved.
• Backtrack when necessary.

233
An example of partial-order planning

Here is the initial plan:

Start

At(Home) ∧ Sells(JS,G) ∧ Sells(HS,R) ∧ Sells(HS,FA)

At(Home) ∧ Have(G) ∧ Have(R) ∧ Have(FA)

Finish

Thin arrows denote ordering.

234
An example of partial-order planning

There are two actions available:

At(x) At(x), Sells(x, y)

Go(y) Buy(y)

At(y), ¬At(x) Have(y)

A planner might begin, for example, by adding a Buy(G) action in order to achieve
the Have(G) precondition of Finish.
Note: the following order of events is by no means the only one available to a
planner.
It has been chosen for illustrative purposes.

235
An example of partial-order planning

Incorporating the suggested step into the plan:

Start

At(Home), Sells(JS,G), Sells(HS,R), Sells(HS,FA)


At(x), Sells(x, G)

Buy(G)

At(Home), Have(G), Have(R), Have(FA)

Finish

Thick arrows denote causal links. They always have a thin arrow underneath.
Here the new Buy step achieves the Have(G) precondition of Finish.

236
An example of partial-order planning

The planner can now introduce a second causal link from Start to achieve the
Sells(x, G) precondition of Buy(G).

Start

At(Home), Sells(JS,G), Sells(HS,R), Sells(HS,FA)

At(JS), Sells(JS,G)

Buy(G)

At(Home), Have(G), Have(R), Have(FA)

Finish

237
An example of partial-order planning

The planner’s next obvious move is to introduce a Go step to achieve the At(JS)
precondition of Buy(G).

Start

At(x) At(Home), Sells(JS,G), Sells(HS,R), Sells(HS,FA)

Go(JS)

At(JS), Sells(JS,G)

Buy(G)

At(Home), Have(G), Have(R), Have(FA)

Finish

And we continue…

238
An example of partial-order planning

Initially the planner can continue quite easily in this manner:

• Add a causal link from Start to Go(JS) to achieve the At(x) precondition.
• Add the step Buy(R) with an associated causal link to the Have(R) precondi-
tion of Finish.
• Add a causal link from Start to Buy(R) to achieve the Sells(HS, R) precon-
dition.

But then things get more interesting…

239
An example of partial-order planning

Start

At(Home) At(Home), Sells(JS,G), Sells(HS,R), Sells(HS,FA)

Go(JS)

At(JS), Sells(JS,G) At(HS), Sells(HS,R)

Buy(G) Buy(R)

At(Home), Have(G), Have(R), Have(FA)

Finish

At this point it starts to get tricky…


The At(HS) precondition in Buy(R) is not achieved.

240
An example of partial-order planning

Start
At(x)

At(Home) At(Home), Sells(JS,G), Sells(HS,R), Sells(HS,FA)


Go(HS)
Go(JS)
¬At(x)

At(JS), Sells(JS,G) Sells(HS,R), At(HS)

Buy(G) Buy(R)

At(Home), Have(G), Have(R), Have(FA)

Finish

The At(HS) precondition is easy to achieve.


But if we introduce a causal link from Start to Go(HS) then we risk invalidating the
precondition for Go(JS).

241
An example of partial-order planning

A step that might invalidate (sometimes the word clobber is employed) a previ-
ously achieved precondition is called a threat.

Demotion ¬c
c

¬c

c c Promotion

Threat

¬c

A planner can try to fix a threat by introducing an ordering constraint.

242
An example of partial-order planning

The planner could backtrack and try to achieve the At(x) precondition using the
existing Go(JS) step.

Start
At(JS)
At(Home) At(Home), Sells(JS,G), Sells(HS,R), Sells(HS,FA)
Go(HS)
Go(JS)
¬At(JS)

At(JS), Sells(JS,G) Sells(HS,R), At(HS)

Buy(G) Buy(R)

At(Home), Have(G), Have(R), Have(FA)

Finish

This involves a threat, but one that can be fixed using promotion.

243
The algorithm

Simplifying slightly to the case where there are no variables.


Say we have a partially completed plan and a set of the preconditions that have
yet to be achieved.

• Select a precondition p that has not yet been achieved and is associated with
an action B.
• At each stage the partially complete plan is expanded into a new collection of
plans.
• To expand a plan, we can try to achieve p either by using an action that’s
already in the plan or by adding a new action to the plan. In either case, call
the action A.

We then try to construct consistent plans where A achieves p.

244
The algorithm

This works as follows:

• For each possible way of achieving p:


p
– Add Start < A, A < Finish, A < B and the causal link A → B to the plan.
– If the resulting plan is consistent we’re done, otherwise generate all pos-
sible ways of removing inconsistencies by promotion or demotion and keep
any resulting consistent plans.

At this stage:

• If you have no further preconditions that haven’t been achieved then any plan
obtained is valid.

245
The algorithm

But how do we try to enforce consistency?


When you attempt to achieve p using A:
¬p
• Find all the existing causal links A0 → B 0 that are clobbered by A.
• For each of those you can try adding A < A0 or B 0 < A to the plan.
p
• Find all existing actions C in the plan that clobber the new causal link A → B.
• For each of those you can try adding C < A or B < C to the plan.
• Generate every possible combination in this way and retain any consistent
plans that result.

246
Possible threats

What about dealing with variables?


If at any stage an effect ¬At(x) appears, is it a threat to At(JS)?
Such an occurrence is called a possible threat and we can deal with it by intro-
ducing inequality constraints: in this case x 6= JS.

• Each partially complete plan now has a set I of inequality constraints asso-
ciated with it.
• An inequality constraint has the form v 6= X where v is a variable and X is
a variable or a constant.
• Whenever we try to make a substitution we check I to make sure we won’t
introduce a conflict.

If we would introduce a conflict then we discard the partially completed plan as


inconsistent.

247
Planning II

Unsurprisingly, this process can become complex.


How might we improve matters?
One way would be to introduce heuristics. We now consider:

• The way in which basic heuristics might be defined for use in planning prob-
lems.
• The construction of planning graphs and their use in obtaining more sensible
heuristics.
• Planning graphs as the basis of the GraphPlan algorithm.

Another is to translate into the language of a general-purpose algorithm exploit-


ing its own heuristics. We now consider:

• Planning using propositional logic.


• Planning using constraint satisfaction.

248
An example of partial-order planning

We left our example problem here:


The planner could backtrack and try to achieve the At(x) precondition using the
existing Go(JS) step.

Start
At(JS)

At(Home) At(Home), Sells(JS,G), Sells(HS,R), Sells(HS,FA)


Go(HS)
Go(JS)
¬At(JS)

At(JS), Sells(JS,G) Sells(HS,R), At(HS)

Buy(G) Buy(R)

At(Home), Have(G), Have(R), Have(FA)

Finish

This involves a threat, but one that can be fixed using promotion.
249
Using heuristics in planning

We found in looking at search problems that heuristics were a helpful thing to


have.
Note that now there is no simple representation of a state, and consequently it is
harder to measure the distance to a goal.
Defining heuristics for planning is therefore more difficult than it was for search
problems. Simple possibilities:
h = number of unsatisfied preconditions
or
h =number of unsatisfied preconditions
− number satisfied by the start state
These can lead to underestimates or overestimates:

• Underestimates if actions can affect one another in undesirable ways.


• Overestimates if actions achieve many preconditions.

250
Using heuristics in planning

We can go a little further by learning from Constraint Satisfaction Problems and


adopting the most constrained variable heuristic:

• Prefer the precondition satisfiable in the smallest number of ways.

This can be computationally demanding but two special cases are helpful:

• Choose preconditions for which no action will satisfy them.


• Choose preconditions that can only be satisfied in one way.

But these still seem somewhat basic.


We can do better using Planning Graphs. These are easy to construct and can also
be used to generate entire plans.

251
Planning graphs

Planning Graphs apply when it is possible to work entirely using propositional


representations of plans. Luckily, STRIPS can always be propositionalized…

Predicate Propositional

At(x) At(Home) At(JS)

Go(y) Go(JS) Go(HS)

At(y), ¬At(x) At(JS), ¬At(Home) At(HS), ¬At(JS)

At(Home)

Go(HS) and so on…

At(HS), ¬At(Home)

At(JS)

Go(Home)

At(Home), ¬At(JS)

252
Planning graphs

A planning graph is constructed in levels:

• Level 0 corresponds to the start state.


• At each level we keep approximate track of all things that could be true at the
corresponding time.
• At each level we keep approximate track of what actions could be applicable
at the corresponding time.

The approximation is due to the fact that not all conflicts between actions are
tracked. So:

• The graph can underestimate how long it might take for a particular proposi-
tion to appear, and therefore . . .
• . . . a heuristic can be extracted.

For example: the triumphant return of the gorilla-purchasing roof-climbers…

253
Planning graphs: a simple example

Our intrepid student adventurers will of course need to inflate their gorilla before
attaching it to a distinguished roof . It has to be purchased before it can be inflated.
Start state: Empty.
We assume that anything not mentioned in a state is false. So the state is actually
¬Have(Gorilla) and ¬Inflated(Gorilla)
Actions:

¬Have(Gorilla) Have(Gorilla)

Buy(Gorilla) Inflate(Gorilla)

Have(Gorilla) Inflated(Gorilla)

Goal: Have(Gorilla) and Inflated(Gorilla).

254
Planning graphs

S0 A0 S1 A1 S2

¬H(G) ¬H(G) ¬H(G)

Buy(G)
H(G)

Buy(G) H(G)

I(G)
Inf(G)

¬I(G) ¬I(G) ¬I(G)

Describe start All actions available in All possibilities for All actions that might All possibilities for
state. start state. what might be the be available at time what might be the
case at time 1. 1. case at time 2.

= a persistence action—what happens if no action is taken.


An action level Ai contains all actions that could happen given the propositions in Si .

255
Mutex links

We also record, using mutual exclusion (mutex) links which pairs of actions could
not occur together.
Mutex links 1: Effects are inconsistent.

S0 A0 S1

¬H(G) ¬H(G)

Buy(G) H(G)

The effect of one action negates the effect of another.

256
Mutex links

Mutex links 2: The actions interfere.

S1 A1 S2

I(G)
Inf(G)

¬I(G) ¬I(G)

The effect of an action negates the precondition of another.

257
Mutex links

Mutex links 3: Competing for preconditions.

S1 A1

¬H(G)

Buy(G)

H(G)

Inf(G)

The precondition for an action is mutually exclusive with the precondition for
another. (See next slide!)

258
Mutex links

A state level Si contains all propositions that could be true, given the possible
preceding actions.
We also use mutex links to record pairs that can not be true simultaneously:
Possibility 1: pair consists of a proposition and its negation.

S1

¬H(G)

H(G)

259
Mutex links

Possibility 2: all pairs of actions that could achieve the pair of propositions are
mutex.

A1 S2

¬H(G)

Buy(G)
H(G)

I(G)
Inf(G)

The construction of a planning graph is continued until two identical levels are
obtained.

260
Planning graphs

S0 A0 S1 A1 S2

¬H(G) ¬H(G) ¬H(G)

Buy(G)
H(G)

Buy(G) H(G)

I(G)
Inf(G)

¬I(G) ¬I(G) ¬I(G)

261
Obtaining heuristics from a planning graph

To estimate the cost of reaching a single proposition:

• Any proposition not appearing in the final level has infinite cost and can never
be reached.
• The level cost of a proposition is the level at which it first appears but this may
be inaccurate as several actions can apply at each level and this cost does not
count the number of actions. (It is however admissible.)
• A serial planning graph includes mutex links between all pairs of actions ex-
cept persistence actions.

Level cost in serial planning graphs can be quite a good measurement.

262
Obtaining heuristics from a planning graph

How about estimating the cost to achieve a collection of propositions?

• Max-level: use the maximum level in the graph of any proposition in the set.
Admissible but can be inaccurate.
• Level-sum: use the sum of the levels of the propositions. Inadmissible but
sometimes quite accurate if goals tend to be decomposable.
• Set-level: use the level at which all propositions appear with none being mu-
tex. Can be accurate if goals tend not to be decomposable.

263
Other points about planning graphs

A planning graph guarantees that:

1. If a proposition appears at some level, there may be a way of achieving it.


2. If a proposition does not appear, it can not be achieved.

The first point here is a loose guarantee because only pairs of items are linked by
mutex links.
Looking at larger collections can strengthen the guarantee, but in practice the
gains are outweighed by the increased computation.

264
Graphplan

The GraphPlan algorithm goes beyond using the planning graph as a source of
heuristics.

1 function GraphPlan()
2 Start at level 0;
3 while true do
4 if All goal propositions appear in the current level AND no pair has a mutex link then
5 Attempt to extract a plan;
6 if A solution is obtained then
7 return SOME solution;
8 if Graph indicates there is no solution then
9 return NONE;
10 Expand the graph to the next level;

We extract a plan directly from the planning graph. Termination can be proved
but will not be covered here.

265
Graphplan in action

Here, at levels S0 and S1 we do not have both H(G) and I(G) available with no
mutex links, and so we expand first to S1 and then to S2.

S0 A0 S1 A1 S2

¬H(G) ¬H(G) ¬H(G)

Buy(G)
H(G)

Buy(G) H(G)

I(G)
Inf(G)

¬I(G) ¬I(G) ¬I(G)

At S2 we try to extract a solution (plan).

266
Extracting a plan from the graph

Extraction of a plan can be formalised as a search problem.


States contain a level, and a collection of unsatisfied goal propositions.
Start state: the current final level of the graph, along with the relevant goal propo-
sitions.
Goal: a state at level S0 containing the initial propositions.
Actions: For a state S with level Si, a valid action is to select any set X of actions
in Ai−1 such that:

1. no pair has a mutex link;


2. no pair of their preconditions has a mutex link;
3. the effects of the actions in X achieve the propositions in S.

The effect of such an action is a state having level Si−1, and containing the pre-
conditions for the actions in X.
Each action has a cost of 1.

267
Graphplan in action

S0 A0 S1 A1 S2

¬H(G) ¬H(G) ¬H(G)

Buy(G)
H(G)

Buy(G) H(G)

I(G)
Inf(G)

¬I(G) ¬I(G) ¬I(G)

S0 S1 S2
H(G) H(G) I(G)

Start state
Action: Buy(G) Action: Inf(G) and 2

268
Heuristics for plan extraction

We can of course also apply heuristics to this part of the process.


For example, when dealing with a set of propositions:

• Choose the proposition having maximum level cost first.


• For that proposition, attempt to achieve it using the action for which the
maximum/sum level cost of its preconditions is minimum.

269
Planning III: planning using propositional logic

We’ve seen that plans might be extracted from a knowledge base via theorem
proving, using first order logic (FOL) and situation calculus.
BUT : this might be computationally infeasible for realistic problems.
Sophisticated techniques are available for testing satisfiability in propositional
logic, and these have also been applied to planning.
The basic idea is to attempt to find a model of a sentence having the form
description of start state
∧ descriptions of the possible actions
∧ description of goal
We attempt to construct this sentence such that:

• If M is a model of the sentence then M assigns true to a proposition if and


only if it is in the plan.
• Any assignment denoting an incorrect plan will not be a model as the goal
description will not be true.
• The sentence is unsatisfiable if no plan exists.
270
Propositional logic for planning

Two roof-climbers want to swap places:


Start state:
S = At0(a, spire) ∧ At0(b, ground)
∧ ¬At0(a, ground) ∧ ¬At0(b, spire)

Remember that an expression such as At0(a, spire) is a proposition. The super-


scripted number now denotes time.
271
Propositional logic for planning

Goal:
G = Ati(a, ground) ∧ Ati(b, spire)
∧ ¬Ati(a, spire) ∧ ¬Ati(b, ground)
Actions: can be introduced using the equivalent of successor-state axioms
At1(a,ground) ↔
(At0(a, ground) ∧ ¬Move0(a, ground, spire)) (1)
∨ (At0(a, spire) ∧ Move0(a, spire, ground))
Denote by A the collection of all such axioms.

272
Propositional logic for planning

We will now find that S ∧ A ∧ G has a model in which Move0(a, spire, ground)
and Move0(b, ground, spire) are true while all remaining actions are false.
In more realistic planning problems we will clearly not know in advance at what
time the goal might expect to be achieved.
We therefore:

• Loop through possible final times T .


• Generate a goal for time T and actions up to time T .
• Try to find a model and extract a plan.
• Until a plan is obtained or we hit some maximum time.

273
Propositional logic for planning

Unfortunately there is a problem—we may, if considerable care is not applied,


also be able to obtain less sensible plans.
In the current example
Move0(b, ground, spire) = true
Move0(a, spire, ground) = true

Move0(a, ground, spire) = true

is a model, because the successor-state axiom (1) does not in fact preclude the
application of Move0(a, ground, spire).
We need a precondition axiom
Movei(a, ground, spire) → Ati(a, ground)
and so on.

274
Propositional logic for planning

Life becomes more complicated still if a third location is added: hospital.


Move0(a, spire, ground) ∧ Move0(a, spire, hospital)
is perfectly valid and so we need to specify that he can’t move to two places
simultaneously
¬(Movei(a, spire, ground) ∧ Movei(a, spire, hospital))
¬(Movei(a, ground, spire) ∧ Movei(a, ground, hospital))
..

and so on.
These are action-exclusion axioms.
Unfortunately they will tend to produce totally-ordered rather than partially-
ordered plans.

275
Propositional logic for planning

Alternatively:

1. Prevent actions occurring together if one negates the effect or precondition


of the other.
2. Or, specify that something can’t be in two places simultaneously
¬(Ati(x, l1) ∧ Ati(x, l2))

for all combinations of x, i and l1 6= l2.


This is an example of a state constraint.
Clearly this process can become very complex, but there are techniques to help
deal with this.

276
Review of constraint satisfaction problems (CSPs)

Recall that in a CSP we have:

• A set of n variables V1, V2, . . . , Vn.


• For each Vi a domain Di specifying the values that Vi can take.
• A set of m constraints C1, C2, . . . , Cm.

Each constraint Ci involves a set of variables and specifies an allowable collection


of values.

• A state is an assignment of specific values to some or all of the variables.


• An assignment is consistent if it violates no constraints.
• An assignment is complete if it gives a value to every variable.

A solution is a consistent and complete assignment.

277
The state-variable representation

Another planning language: the state-variable representation.


Things of interest such as people, places, objects etc are divided into domains:
D1 = {climber1, climber2}
D2 = {home, jokeShop, hardwareStore, pavement, spire, hospital}
D3 = {rope, gorilla}
Part of the specification of a planning problem involves stating which domain a
particular item is in. For example
D1(climber1)
and so on.
Relations and functions have arguments chosen from unions of these domains.
above ⊆ D1above × D2above
is a relation. The Diabove are unions of one or more Di.
Note: D is used for domains in the state-variable representation. D is used for
domains in CSPs.
278
The state-variable representation

The relation above is in fact a rigid relation (RR), as it is unchanging: it does not
depend upon state. (Remember fluents in situation calculus?)
Similarly, we have functions
at(x1, s) : D1at × S → D at.
Here, at(x, s) is a state-variable. The domain D1at and range D at are unions of
one or more Di. In general these can have multiple parameters
sv(x1, . . . , xn, s) : D1sv × · · · × Dnsv × S → D sv.
A state-variable denotes assertions such as
at(gorilla, s) = jokeShop
where s denotes a state and the set S of all states will be defined later.
The state variable allows things such as locations to change—again, much like
fluents in the situation calculus.
Variables appearing in relations and functions are considered to be typed.

279
The state-variable representation

Note:

• For properties such as a location a function might be considerably more suit-


able than a relation.
• For locations, everything has to be somewhere and it can only be in one place
at a time.

So a function is perfect and immediately solves some of the problems seen earlier.

280
The state-variable representation

Actions as usual, have a name, a set of preconditions and a set of effects.

• Names are unique, and followed by a list of variables involved in the action.
• Preconditions are expressions involving state variables and relations.
• Effects are assignments to state variables.

For example:

buy(x, y, l)
Preconditions at(x, s) = l
sells(l, y)
has(y, s) = l
Effects has(y, s) = x

281
The state-variable representation

Goals are sets of expressions involving state variables.


For example:

Goal:
at(climber, s) = home
has(rope, s) = climber
at(gorilla, s) = spire

From now on we will generally suppress the state s when writing state variables.

282
The state-variable representation

A state as just a statement of what values the state variables take at a given time.

s={ has(gorilla) = jokeShop


has(firstAidKit) = climber2
has(rope) = climber2
..
.

at(climber1) = jokeShop
at(climber2) = spire
..
.
}

• For each state variable sv consider all ground instances, such as sv(climber, rope),
with arguments consistent with the rigid relations.
Define X to be the set of all such ground instances.
• A state s is then just a set
s = {(v = c)|v ∈ X}
where c is in the range of v.

This allows us to define the effect of an action.


A planning problem also needs a start state s0, which can be defined in this way.
283
The state-variable representation

Considering all the ground actions consistent with the rigid relations:

s={ has(gorilla) = jokeShop


has(firstAidKit) = climber2 buy(climber1, gorilla, jokeShop)
has(rope) = climber2
..
. In the definition of buy(x, y, l):
at(climber1) = jokeShop x = climber1
at(climber2) = spire y = gorilla
.. l = jokeShop
.
}

sells(jokeShop, gorilla)

• An action is applicable in s if all expressions v = c appearing in the set of


preconditions also appear in s.
• As there is no rigid relation sells(jokeShop, fruitBats) we would not con-
sider an action such as buy(climber1, fruitBats, jokeShop)—it is not con-
sistent with the rigid relations.

284
The state-variable representation

Finally, there is a function γ that maps a state and an action to a new state
γ(s, a) = s0

s={ has(gorilla) = jokeShop s0 = { has(gorilla) = climber1


has(firstAidKit) = climber2 has(firstAidKit) = climber2
has(rope) = climber2 has(rope) = climber2
.. ..
. γ(buy(climber1, gorilla, jokeShop), s) .

at(climber1) = jokeShop at(climber1) = jokeShop


at(climber2) = spire at(climber2) = spire
.. ..
. .
} }

Specifically, we have
γ(s, a) = {(v = c)|v ∈ X}
where either c is specified in an effect of a, or otherwise v = c is a member of s.
Note: the definition of γ implicitly solves the frame problem.

285
The state-variable representation

A solution to a planning problem is a sequence (a0, a1, . . . , an) of actions such


that…

• a0 is applicable in s0 and for each i, ai is applicable in si = γ(si−1, ai−1).


• For each goal g we have
g ∈ γ(sn, an).

What we need now is a method for transforming a problem described in this


language into a CSP.
We’ll once again do this for a fixed upper limit T on the number of steps in the
plan.

286
Converting to a CSP

Step 1: encode actions as CSP variables.


For each time step t where 0 ≤ t ≤ T − 1, the CSP has a variable
actiont
with domain
t
Daction = {a|a is the ground instance of an action} ∪ {none}
Example: at some point in searching for a plan we might attempt to find the
solution to the corresponding CSP involving
action5 = attach(gorilla, spire)

WARNING: be careful in what follows to distinguish between state variables, ac-


tions etc in the planning problem and variables in the CSP.

287
Converting to a CSP

Step 2: encode ground state variables as CSP variables, with a complete copy of
all the state variables for each time step.
So, for each t where 0 ≤ t ≤ T we have a CSP variable
svti(c1, . . . , cn)
with domain D = D svi . (That is, the domain of the CSP variable is the range of
the state variable.)
Example: at some point in searching for a plan we might attempt to find the
solution to the corresponding CSP involving
location9(climber1) = hospital.

288
Converting to a CSP

Step 3: encode the preconditions for actions in the planning problem as constraints
in the CSP problem.
For each time step t and for each ground action a(c1, . . . , cn) with arguments
consistent with the rigid relations in its preconditions:
For a precondition of the form svi = v include constraint pairs
(actiont = a(c1, . . . , cn),
svti = v)

Example: consider the action buy(x, y, l) introduced above, and having the pre-
conditions at(x) = l, sells(l, y) and has(y) = l.
Assume sells(y, l) is only true for
l = jokeShop
and
y = gorilla
so we only consider these values for l and y. Then for each time step t we have
the constraints…
289
Converting to a CSP

actiont = buy(climber1, gorilla, jokeShop)


paired with
att(climber1) = jokeShop
actiont = buy(climber1, gorilla, jokeShop)
paired with
hast(gorilla) = jokeShop
actiont = buy(climber2, gorilla, jokeShop)
paired with
att(climber2) = jokeShop
actiont = buy(climber2, gorilla, jokeShop)
paired with
hast(gorilla) = jokeShop
and so on…

290
Converting to a CSP

Step 4: encode the effects of actions in the planning problem as constraints in the
CSP problem.
For each time step t and for each ground action a(c1, . . . , cn) with arguments
consistent with the rigid relations in its preconditions:
For an effect of the form svi = v include constraint pairs
(actiont = a(c1, . . . , cn),
svt+1
i = v)

Example: continuing with the previous example, we will include constraints

actiont = buy(climber1, gorilla, jokeShop)


paired with
hast+1(gorilla) = climber1
actiont = buy(climber2, gorilla, jokeShop)
paired with
hast+1(gorilla) = climber2
and so on…

291
Converting to a CSP

Step 5: encode the frame axioms as constraints in the CSP problem.


An action must not change things not appearing in its effects. So:
For:

1. Each time step t.


2. Each ground action a(c1, . . . , cn) with arguments consistent with the rigid re-
lations in its preconditions.
3. Each svi that does not appear in the effects of a, and each v ∈ D svi

include in the CSP the ternary constraint


(actiont = a(c1, . . . , cn),
svti = v,
svt+1
i = v).

292
Finding a plan

Finally, having encoded a planning problem into a CSP, we solve the CSP.
The scheme has the following property:
A solution to the planning problem with at most T steps exists if and only if there
is a a solution to the corresponding CSP.
Assume the CSP has a solution.
Then we can extract a plan simply by looking at the values assigned to the
actiont variables in the solution of the CSP.
It is also the case that:
There is a solution to the planning problem with at most T steps if and only if there
is a solution to the corresponding CSP from which the solution can be extracted in
this way.
For a proof see:
Automated Planning: Theory and Practice
Malik Ghallab, Dana Nau and Paolo Traverso. Morgan Kaufmann 2004.

293
Artificial Intelligence I

Machine learning using neural networks

Reading: AIMA, chapter 20.


294
Did you heed the DIRE WARNING?

At the beginning of the course I suggested making sure you can answer the fol-
lowing two questions:
1. Let n
X
f (x1, . . . , xn) = aix2i
i=1
where the ai are constants. Compute ∂f /∂xj where 1 ≤ j ≤ n?
Answer: As only one term in the sum depends on xj , all the other terms dif-
ferentiate to give 0 and
∂f
= 2aj xj .
∂xj
2. Let f (x1, . . . , xn) be a function. Now assume xi = gi(y1, . . . , ym) for each xi
and some collection of functions gi. Assuming all requirements for differen-
tiability and so on are met, can you write down an expression for ∂f /∂yj
where 1 ≤ j ≤ m?
Answer: this is just the chain rule for partial differentiation
n
∂f X ∂f ∂gi
= .
∂yj i=1
∂g i ∂y j

295
Supervised learning with neural networks

We now consider how an agent might learn to solve a general problem by seeing
examples:

• I present an outline of supervised learning.


• I introduce the classical perceptron.
• I introduce multilayer perceptrons and the backpropagation algorithm for train-
ing them.

To begin, a common source of problems in AI is medical diagnosis.


Imagine that we want to automate the diagnosis of an Embarrassing Disease (call
it D) by constructing a machine:

Measurements taken from the 1 if the patient suffers from D


patient: heart rate, blood pressure, Machine 0 otherwise
presence of green spots etc.

Could we do this by explicitly writing a program that examines the measurements


and outputs a diagnosis? Experience suggests that this is unlikely.
296
An example, continued…

An alternative approach: each collection of measurements can be written as a


vector,
xT = ( x1 x2 · · · xn )
where,
x1 = heart rate
x2 = blood pressure
x3 = 1 if the patient has green spots, and 0 otherwise
..
and so on.
(Note: it’s a common convention that vectors are column vectors by default. This
is why the above is written as a transpose.)

297
An example, continued…

A vector of this kind contains all the measurements for a single patient and is
called a feature vector or instance.
The measurements are attributes or features.
Attributes or features generally appear as one of three basic types:

• Continuous: xi ∈ [xmin, xmax] where xmin, xmax ∈ R.


• Binary: xi ∈ {0, 1} or xi ∈ {−1, +1}.
• Discrete: xi can take one of a finite number of values, say xi ∈ {X1, . . . , Xp}.

298
An example, continued…

Now imagine that we have a large collection of patient histories (m in total) and
for each of these we know whether or not the patient suffered from D.

• The ith patient history gives us an instance xi.


• This can be paired with a single bit—0 or 1—denoting whether or not the ith
patient suffers from D. The resulting pair is called an example or a labelled
example.
• Collecting all the examples together we obtain a training sequence
s = ((x1, 0), (x2, 1), . . . , (xm, 0)).

299
An example, continued…

In supervised machine learning we aim to design a learning algorithm which


takes s and produces a hypothesis h.

s Learning Algorithm h

Intuitively, a hypothesis is something that lets us diagnose new patients.


This is IMPORTANT : we want to diagnose patients that the system has never seen.
The ability to do this successfully is called generalisation.

300
An example, continued…

In fact, a hypothesis is just a function that maps instances to labels.

Classifier
Attribute vector h(x) Label
x

As h is a function it assigns a label to any x and not just the ones that were in the
training sequence.
What we mean by a label here depends on whether we’re doing classification or
regression.

301
Supervised learning: classification and regression

In classification we’re assigning x to one of a set {ω1, . . . , ωc} of c classes. For


example, if x contains measurements taken from a patient then there might be
three classes:
ω1 = patient has disease
ω2 = patient doesn’t have disease
ω3 = don’t ask me buddy, I’m just a computer!
The binary case above also fits into this framework, and we’ll often specialise to
the case of two classes, denoted C1 and C2.
In regression we’re assigning x to a real number h(x) ∈ R. For example, if x
contains measurements taken regarding today’s weather then we might have
h(x) = estimate of amount of rainfall expected tomorrow.
For the two-class classification problem we will also refer to a situation somewhat
between the two, where
h(x) = Pr(x is in C1)
and so we would typically assign x to class C1 if h(x) > 1/2.

302
Summary

We don’t want to design h explicitly.

Classifier
Attribute vector h(x) Label
x

h = L(s)

Learner
L

Training sequence
s

So we use a learner L to infer it on the basis of a sequence s of training examples.

303
Neural networks

There is generally a set H of hypotheses from which L is allowed to select h


L(s) = h ∈ H
H is called the hypothesis space.
The learner can output a hypothesis explicitly or—as in the case of a neural net-
work—it can output a vector
T

w = w 1 w2 · · · w W
of weights which in turn specify h
h(x) = f (w; x)
where w = L(s).

304
Types of learning

The form of machine learning described is called supervised learning. The litera-
ture also discusses unsupervised learning, semisupervised learning, learning using
membership queries and equivalence queries, and reinforcement learning. (More
about some of this next year…)
Supervised learning has multiple applications:
• Speech recognition.
• Deciding whether or not to give credit.
• Detecting credit card fraud.
• Deciding whether to buy or sell a stock option.
• Deciding whether a tumour is benign.
• Data mining: extracting interesting but hidden knowledge from existing, large
databases. For example, databases containing financial transactions or loan
applications.
• Automatic driving. (See Pomerleau, 1989, in which a car is driven for 90 miles
at 70 miles per hour, on a public road with other cars present, but with no
assistance from humans.)
305
This is very similar to curve fitting

This process is in fact very similar to curve fitting. Think of the process as follows:

• Nature picks an h0 ∈ H but doesn’t reveal it to us.


• Nature then shows us a training sequence s where each xi is labelled as
h0(xi) + i where i is noise of some kind.

Our job is to try to infer what h0 is on the basis of s only. Example: if H is the set of
all polynomials of degree 3 then nature might pick h0(x) = 13 x3 − 32 x2 + 2x − 12 .

0.5

0.4

0.3

0.2

0.1

0.5 1 1.5 2 2.5 3

The line is dashed to emphasise the fact that we don’t get to see it.
306
Curve fitting

We can now use h0 to obtain a training sequence s in the manner suggested..

0.5

0.4

0.3

0.2

0.1

0.5 1 1.5 2 2.5 3

Here we have,
sT = ((x1, y1), (x2, y2), . . . , (xm, ym))
where each xi and yi is a real number.

307
Curve fitting

We’ll use a learning algorithm L that operates in a reasonable-looking way: it


picks an h ∈ H minimising the following quantity,
m
X
E= (h(xi) − yi)2.
i=1

In other words m
X
h = L(s) = argmin (h(xi) − yi)2.
h∈H i=1
Why is this sensible?

1. Each term in the sum is 0 if h(xi) is exactly yi.


2. Each term increases as the difference between h(xi) and yi increases.
3. We add the terms for all examples.

308
Curve fitting

If we pick h using this method then we get:

0.6

0.4

0.2

0.5 1 1.5 2 2.5 3

-0.2

The chosen h is close to the target h0, even though it was chosen using only a
small number of noisy examples.
It is not quite identical to the target concept.
However if we were given a new point x0 and asked to guess the value h0(x0)
then guessing h(x0) might be expected to do quite well.

309
Curve fitting

Problem: we don’t know what H nature is using. What if the one we choose
doesn’t match? We can make our H ‘bigger’ by defining it as
H = {h : h is a polynomial of degree at most 5}.
If we use the same learning algorithm then we get:

0.6

0.4

0.2

0.5 1 1.5 2 2.5 3

-0.2

The result in this case is similar to the previous one: h is again quite close to h0,
but not quite identical.

310
Curve fitting

So what’s the problem? Repeating the process with,


H = {h : h is a polynomial of degree at most 1}
gives the following:

0.5

0.4

0.3

0.2

0.1

0.5 1 1.5 2 2.5 3

In effect, we have made our H too ‘small’. It does not in fact contain any hypoth-
esis similar to h0.

311
Curve fitting

So we have to make H huge, right? WRONG‼! With


H = {h : h is a polynomial of degree at most 25}
we get:

0.8

0.6

0.4

0.2

0.5 1 1.5 2 2.5 3


-0.2

-0.4

BEWARE‼! This is known as overfitting.

312
The perceptron

The example just given illustrates much of what we want to do. However in
practice we deal with more than a single dimension, so
xT = ( x1 x2 · · · xn ).
The simplest form of hypothesis used is the linear discriminant, also known as
the perceptron. Here
n
!
X
h(w; x) = σ w0 + wixi = σ (w0 + w1x1 + w2x2 + · · · + wnxn) .
i=1

So: we have a linear function modified by the activation function σ.


The perceptron’s influence continues to be felt in the recent and ongoing devel-
opment of support vector machines, and forms the basis for most of the field of
supervised learning.

313
The perceptron activation function I

There are three standard forms for the activation function:

1. Linear: for regression problems we often use


σ(z) = z.

2. Step: for two-class classification problems we often use


C1 if z > 0

σ(z) =
C2 otherwise.
3. Sigmoid/Logistic: for probabilistic classification we often use
1
Pr(x is in C1) = σ(z) = .
1 + exp(−z)

The step function is important but the algorithms involved are somewhat different
to those we’ll be seeing. We won’t consider it further.
The sigmoid/logistic function plays a major role in what follows.

314
The sigmoid/logistic function

1
The logistic function σ(z) = 1+exp(−
z) Logistic σ(z) applied to the output of a linear function
1

0.9

0.8 1

0.7 0.8

Pr(x is in C1 )
0.6 0.6
σ(z)

0.5 0.4

0.4
0.2
0.3
0
0.2 10
5 10
0.1 0 5
0
−5 −5
0
−10 −5 0 5 10 Input x2 −10 −10 Input x1
z

315
Gradient descent

A method for training a basic perceptron works as follows. Assume we’re dealing
with a regression problem and using σ(z) = z.
We define a measure of error for a given collection of weights. For example
m
X
E(w) = (yi − h(w; xi))2.
i=1

Modifying our notation slightly so that


xT = ( 1 x1 x2 · · · xn )
w T = ( w0 w 1 w 2 · · · wn )
lets us write m
X
E(w) = (yi − wT xi)2.
i=1
We want to minimise E(w).

316
Gradient descent

One way to approach this is to start with a random w0 and update it as follows:
∂E(w)
wt+1 = wt − η
∂w wt

where
∂E(w)  ∂E(w) ∂E(w) ∂E(w)
T
= ∂w0 ∂w1 ··· ∂wn
∂w
and η is some small positive number.
The vector
∂E(w)

∂w
tells us the direction of the steepest decrease in E(w).

317
Gradient descent

With m
X
E(w) = (yi − wT xi)2
i=1
we have !
m
∂E(w) ∂ X
= (yi − wT xi)2
∂wj ∂wj i=1
m  
X ∂
= (yi − wT xi)2
i=1
∂wj
m  
X ∂
2(yi − wT xi) −wT xi

=
i=1
∂wj
m
(j)
X
T

= −2 xi yi − w xi
i=1
(j)
where xi is the jth element of xi.

318
Gradient descent

The method therefore gives the algorithm


m
X
wtT xi

wt+1 = wt + 2η yi − xi
i=1

Some things to note:

• In this case E(w) is parabolic and has a unique global minimum and no local
minima so this works well.
• Gradient descent in some form is a very common approach to this kind of
problem.
• We can perform a similar calculation for other activation functions and for
other definitions for E(w).
• Such calculations lead to different algorithms.

319
Perceptrons aren’t very powerful: the parity problem

There are many problems a perceptron can’t solve.

1.5
1
1

Network output
x2 0.8
0.5
0.6
0
0.4
−0.5 2
0.2 1
−1
−1 0 1 2 0 0
x1 −1 −1 x2
0 1 2
x1

We need a network that computes more interesting functions.

320
The multilayer perceptron

Each node in the network is itself a perceptron:

z0 = 1
w0 Node j
z1 w1

w2 Pn aj zj
z2 i=0 wi zi σ(aj )
..
.

wn
zn

Weights wi connect nodes together, and aj is the weighted sum or activation for
node j. σ is the activation function and the output is zj = σ(aj ).
Reminder: we’ll continue to use the notation
zT = ( 1 z1 z2 · · · zn )
w T = ( w0 w 1 w 2 · · · wn )
so that n n
X X
wizi = w0 + wizi = wT z.
i=0 i=1

321
The multilayer perceptron

In the general case we have a completely unrestricted feedforward structure:

Feature vector x Node i


Node j
x1 wi→j
Output y = h(w; x)
x2
..
.

xn

Each node is a perceptron. No specific layering is assumed.


wi→j connects node i to node j. w0 for node j is denoted w0→j .

322
Backpropagation

As usual we have:

• Instances xT = (x1, . . . , xn).


• A training sequence s = ((x1, y1), . . . , (xm, ym)).

We also define a measure of training error


E(w) = measure of the error of the network on s
where w is the vector of all the weights in the network.
Our aim is to find a set of weights that minimises E(w) using gradient descent.

323
Backpropagation: the general case

The central task is therefore to calculate


∂E(w)
∂w
To do that we need to calculate the individual quantities
∂E(w)
∂wi→j
for every weight wi→j in the network.
Often E(w) is the sum of separate components, one for each example in s
m
X
E(w) = Ep(w)
p=1

in which case m
∂E(w) X ∂Ep(w)
=
∂w p=1
∂w
We can therefore consider examples individually.

324
Backpropagation: the general case

Place example p at the input and calculate aj and zj for all nodes including the
output y. This is forward propagation.
We have
∂Ep(w) ∂Ep(w) ∂aj
=
∂wi→j ∂aj ∂wi→j
where aj = wk→j zk .
P
k

Here the sum is over all the nodes connected to node j. As


!
∂aj ∂ X
= wk→j zk = zi
∂wi→j ∂wi→j
k
we can write
∂Ep(w)
= δj zi
∂wi→j
where we’ve defined
∂Ep(w)
δj = .
∂aj

325
Backpropagation: the general case

So we now need to calculate the values for δj . When j is the output node—that is,
the one producing the output y = h(w; xp) of the network—this is easy as zj = y
and
∂Ep(w)
δj =
∂aj
∂Ep(w) ∂y
=
∂y ∂aj
∂Ep(w) 0
= σ (aj )
∂y
using the fact that y = σ(aj ). The first term is in general easy to calculate for a
given E as the error is generally just a measure of the distance between y and
the label yp in the training sequence.
Example: when
Ep(w) = (y − yp)2
we have
∂Ep(w)
= 2(y − yp)
∂y
= 2(h(w; xp) − yp).
326
Backpropagation: the general case

When j is not an output node we need something different:

k1
ak 1
σ
j
k2
aj ak 2
σ σ

.. ..
. .
kq
ak q
σ

We’re interested in
∂Ep(w)
δj =
∂aj
Altering aj can affect several other nodes k1, k2, . . . , kq each of which can in turn
affect Ep(w).

327
Backpropagation: the general case

k1
ak 1
σ
j
k2
aj ak 2
σ σ

.. ..
. .
kq
ak q
σ

We have
∂Ep(w) X ∂Ep(w) ∂ak X ∂ak
δj = = = δk
∂aj ∂ak ∂aj ∂aj
k∈{k1 ,k2 ,...,kq } k∈{k1 ,k2 ,...,kq }

where k1, k2, . . . , kq are the nodes to which node j sends a connection.

328
Backpropagation: the general case

k1
ak 1
σ
j
k2
aj ak 2
σ σ

.. ..
. .
kq
ak q
σ

Because we know how to compute δj for the output node we can work backwards
computing further δ values.
We will always know all the values δk for nodes ahead of where we are.
Hence the term backpropagation.

329
Backpropagation: the general case

k1
ak 1
σ
j
k2
aj ak 2
σ σ

.. ..
. .
kq
ak q
σ

!
∂ak ∂ X
= wi→k σ(ai) = wj→k σ 0(aj )
∂aj ∂aj i
and X X
0 0
δj = δk wj→k σ (aj ) = σ (aj ) δk wj→k .
k∈{k1 ,k2 ,...,kq } k∈{k1 ,k2 ,...,kq }

330
Backpropagation: the general case

∂Ep (w)
Summary: to calculate ∂w for the pth pattern:

1. Forward propagation: apply xp and calculate outputs etc for all the nodes in
the network.
2. Backpropagation 1: for the output node
∂Ep(w) 0 ∂Ep(w)
= ziδj = ziσ (aj )
∂wi→j ∂y
where y = h(w; xp).
3. Backpropagation 2: For other nodes
∂Ep(w) 0
X
= ziσ (aj ) δk wj→k
∂wi→j
k

where the δk were calculated at an earlier step.

331
Backpropagation: a specific example

Hidden nodes receive


inputs from all features

Output node receives


x1 inputs from all hidden
nodes
x2
y = h(w; x)
..
.

..
.
xn

For the output: σ(a) = a. For the hidden nodes σ(a) = 1+exp(−a) .
1

332
Backpropagation: a specific example

For the output: σ(a) = a so σ 0(a) = 1.


For the hidden nodes:
1
σ(a) =
1 + exp(−a)
so
σ 0(a) = σ(a) [1 − σ(a)] .
We’ll continue using the same definition for the error
m
X
E(w) = (yp − h(w; xp))2
p=1
Ep(w) = (yp − h(w; xp))2.

333
Backpropagation: a specific example

For the output: the equation is


∂Ep(w) ∂Ep(w)
= ziδoutput = ziσ 0(aoutput)
∂wi→output ∂y
where y = h(w; xp). So as
∂Ep(w) ∂ 2

= (yp − y)
∂y ∂y
= 2(y − yp)
= 2 [h(w; xp) − yp]
and σ 0(a) = 1 so
δoutput = 2 [h(w; xp) − yp]
and
∂Ep(w)
= 2zi(h(w; xp) − yp)
∂wi→output

334
Backpropagation: a specific example

For the hidden nodes: the equation is


∂Ep(w) 0
X
= ziσ (aj ) δk wj→k .
∂wi→j
k

However there is only one output so


∂Ep(w)
= ziσ(aj ) [1 − σ(aj )] δoutputwj→output
∂wi→j
and we know that
δoutput = 2 [h(w; xp) − yp]
so
∂Ep(w)
= 2ziσ(aj ) [1 − σ(aj )] [h(w; xp) − yp] wj→output
∂wi→j
= 2xizj (1 − zj ) [h(w; xp) − yp] wj→output.

335
Putting it all together

We can then use the derivatives in one of two basic ways:


Batch: (as described previously)
m
∂E(w) X ∂Ep(w)
=
∂w p=1
∂w

then
∂E(w)
wt+1 = wt − η .
∂w wt
Sequential: using just one pattern at once
∂Ep(w)
wt+1 = wt − η
∂w wt

selecting patterns in sequence or at random.

336
Example: the parity problem revisited

As an example we show the result of training a network with:

• Two inputs.
• One output.
• One hidden layer containing 5 units.
• η = 0.01.
• All other details as above.

The problem is the parity problem. There are 40 noisy examples.


The sequential approach is used, with 1000 repetitions through the entire training
sequence.

337
Example: the parity problem revisited

Before training After training


2 2

1.5 1.5

x2 1 1

x2
0.5 0.5

0 0

−0.5 −0.5

−1 −1
−1 0 1 2 −1 0 1 2
x1 x1

338
Example: the parity problem revisited

After training
Before training

Network output

Network output
1
0.5
0.5

0 2
2 0
2 −1 1
1
1 0
0 0 0
x2 −1 −1 1 x2
x1
x1 2 −1

339
Example: the parity problem revisited

Error during training


10

0
0 100 200 300 400 500 600 700 800 900 1000

340

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy