3
3
Dr Sean Holden
Computer Laboratory, Room FC06
Telephone extension 63725
Email: sbh11@cam.ac.uk
www.cl.cam.ac.uk/users/sbh11/
1
Artificial Intelligence
Much of this has been driven by philosophers and people with something to sell.
3
Introduction: what are our aims?
• To understand intelligence.
• To understand ourselves.
Philosophers have worked on this for at least 2000 years. They’ve also wondered
about:
4
Introduction: what are our aims?
• Brains are small (true) and apparently slow (not quite so clear-cut), but in-
credibly good at some tasks—we want to understand a specific form of com-
putation.
• It would be nice to be able to construct intelligent systems.
• It is also nice to make and sell cool stuff .
5
Introduction: now is a fantastic time to investigate AI
In many ways this is a young field, having only really got under way in 1956
with the Dartmouth Conference.
www-formal.stanford.edu/jmc/history/dartmouth/dartmouth.html
Perhaps I’m being too hard on them; there was some good groundwork: Socrates wanted an algorithm for “piety”,
leading to Syllogisms. Ramon Lull’s concept wheels and other attempts at mechanical calculators. Rene Descartes’
Dualism and the idea of mind as a physical system. Wilhelm Leibnitz’s opposing position of Materialism. (The
intermediate position: mind is physical but unknowable.) The origin of knowledge: Francis Bacon’s Empiricism, John
Locke: “Nothing is in the understanding, which was not first in the senses”. David Hume: we obtain rules by repeated
exposure: Induction. Further developed by Bertrand Russell and in the Confirmation Theory of Carnap and Hempel.
More recently: the connection between knowledge and action? How are actions justified? If to achieve the end you
need to achieve something intermediate, consider how to achieve that, and so on. This approach was implemented
in Newell and Simon’s 1957 General Problem Solver (GPS).
6
What has been achieved?
7
What has been achieved?
8
The nature of the pursuit
Here, the word rational has a special meaning: it means doing the correct thing
in given circumstances.
9
What is AI, version one: acting like a human
In the unrestricted Turing test the AI program may also have a camera attached,
so that objects can be shown to it, and so on.
The Turing test is informative, and (very!) hard to pass. (See the Loebner Prize…)
• It requires many abilities that seem necessary for AI, such as learning. BUT :
a human child would probably not pass the test.
• Sometimes an AI system needs human-like acting abilities—for example ex-
pert systems often have to produce explanations—but not always.
10
What is AI, version two: thinking like a human
There is always the possibility that a machine acting like a human does not ac-
tually think. The cognitive modelling approach to AI has tried to:
11
What is AI, version three: thinking rationally and the “laws of thought”
The idea that intelligence reduces to rational thinking is a very old one, going at
least as far back as Aristotle as we’ve already seen.
The general field of logic made major progress in the 19th and 20th centuries,
allowing it to be applied to AI.
This is a very appealing idea, but there are obstacles. It is hard to:
These will be recurring themes in this course, and in Machine Learning and Bayesian
Inference next year.
12
What is AI, version four: acting rationally
As a result, we will focus on the idea of designing systems that act rationally.
13
Other fields that have contributed to AI
14
What’s in this course?
This course introduces some of the fundamental areas that make up AI:
Strictly speaking, this course covers what is often referred to as “Good Old-Fashioned
AI”. (Although “Old-Fashioned” is a misleading term.)
The nature of the subject changed when the importance of uncertainty was fully
appreciated. Machine Learning and Bayesian Inference covers this more recent
material.
15
What’s not in this course?
16
Introductory reading that isn’t nonsense
• Francis Crick, “The recent excitement about neural networks”, Nature (1989) is
still entirely relevant:
www.nature.com/nature/journal/v337/n6203/abs/337129a0.html
provides a good illustration of how far we are from passing the Turing test.
• Marvin Minsky, “Why people think computers can’t”, AI Magazine (1982) is
an excellent response to nay-saying philosophers.
http://web.media.mit.edu/∼minsky/
• Go: www.nature.com/nature/journal/v529/n7587/full/nature16961.html
• AI at Nasa Ames:
www.nasa.gov/centers/ames/research/areas-of-ames-ingenuity-autonomy-and-robotics
17
Introductory reading that isn’t nonsense
https://publications.parliament.uk/pa/ld201719/ldselect/ldai/100/100.pdf
• Machine learning: the power and promise of computers that learn by example
The Royal Society
https://royalsociety.org/topics-policy/projects/machine-learning/
18
Text book
The prerequisites for the course are: first order logic, some algorithms and data
structures, discrete and continuous mathematics, and basic computational com-
plexity.
DIRE WARNING:
No doubt you want to know something about machine learning, given the recent
peek in interest.
In the lectures on machine learning I will be talking about neural networks.
I will introduce the backpropagation algorithm, which is the foundation for both
classical neural networks and the more fashionable deep learning methods.
This means you will need to be able to differentiate and also handle vectors and
matrices.
If you’ve forgotten how to do this you WILL get lost—I guarantee it‼!
20
Prerequisites
Self test:
1. Let n
X
f (x1, . . . , xn) = aix2i
i=1
where the ai are constants. Can you compute ∂f /∂xj where 1 ≤ j ≤ n?
2. Let f (x1, . . . , xn) be a function. Now assume xi = gi(y1, . . . , ym) for each xi
and some collection of functions gi. Assuming all requirements for differen-
tiability and so on are met, can you write down an expression for ∂f /∂yj
where 1 ≤ j ≤ m?
If the answer to either of these questions is “no” then it’s time for some revision.
(You have about three weeks notice, so I’ll assume you know it!)
21
And finally. . .
22
And finally. . .
More practically, you will often hear me make the claim that everything that’s at
all interesting in AI is at least NP-complete.
There are two ways to interpret this:
1. The wrong way: “It’s all a waste of time.1” OK, so it’s a partly understandable
interpretation. BUT the fact that Boolean satisfiability is intractable does not
mean we can’t solve large instances in practice. . .
2. The right way: “It’s an opportunity to design nice approximation algorithms.”
In reality, the algorithms that are good in practice are ones that try to often
find a good but not necessarily optimal solution, in a reasonable amount of
time and memory.
1
In essence, a comment on a course assessment a couple of years back to the effect of: “Why do you teach us this stuff if it’s all futile?”
23
Agents
There are many different definitions for the term agent within AI.
Allow me to introduce EVIL ROBOT.
MUST ENSLAVE
EARTH!!! Dr Holden
will be our GLORIOUS
LEADER!!!
Act
Environment
Sense
We will use the following simple definition: an agent is any device that can sense
and act upon its environment.
24
Agents
This definition can be very widely applied: to humans, robots, pieces of software,
and so on.
We are taking quite an applied perspective. We want to make things rather than
copy humans. So:
Recall that we are interested in devices that act rationally, where ‘rational’ means
doing the correct thing under given circumstances.
25
Measuring performance
26
Environments
27
Programming agents
Item 3: Are there sensible ways in which to think about the structure of an agent?
A basic agent can be thought of as working according to a straightforward un-
derlying process. To achieve some goal:
• Gather perceptions.
• Update working memory to take account of them.
• On the basis of what’s in the working memory, choose an action to perform.
• Update the working memory to take account of this action.
• Do the chosen action.
Obviously, this hides a great deal of complexity:
• A percept might arrive while an action is being chosen.
• The world may change while an action is being chosen.
• Actions may affect the world in unexpected ways.
• We might have multiple goals, which interact with each other.
• And so on…
28
Keeping track of the environment, and having a goal
It also seems reasonable that an agent should choose a rational course of action
depending on its goal.
• If an agent has knowledge of how its actions affect the environment, then it
has a basis for choosing actions to achieve goals.
29
Goal-based agents
Percept
Update
Update
Description: current environment
Description of Goal
Infer
Action/Action sequence
30
Utility-based agents
• There may be many sequences of actions that lead to a given goal, and some
may be preferable to others.
• We might need to trade-off conflicting goals, for example speed and safety.
• An agent may have several goals, but not be certain of achieving any of them.
Can it trade-off the likelihood of reaching a goal against the desirability of
getting there?
31
Learning agents
Percept
Update
Update
Description: current environment
Update
Feedback Learner Description: behaviour of environment
Description of Goal
Infer
Action/Action sequence
1. The learner needs some form of feedback on the agent’s performance. This
can come in several different forms.
2. The learner needs a means of generating new behaviour in order to find out
about the world.
1. Should the agent spend time exploiting what it’s learned so far, if it’s achieving
a level of success, or…
2. …should the agent try new things, exploring the environment on the basis
that it might learn something really useful even if it performs worse in the
short term?
33
Artificial Intelligence
We begin with what is perhaps the simplest collection of AI techniques: those al-
lowing an agent existing within an environment to search for a sequence of actions
that achieves a goal.
Search algorithms apply to a particularly simple class of problems—we need to
identify:
35
Problem solving by search
36
Problem solving by search
This is all important stuff, but there’s a problem: none of these methods works in
practice for typical AI problems!
Essentially, the problem is that they are too naı̈ve in the way that they choose a
state to explore at each step.
I’m going to assume that you know this material and move on…
37
Problem solving by search
Start State
3 5
1 4 2 Action 1 3 5
−→
7 8 6
4 2 Action 1 3 5
−→ Goal State
7 8 6
4 2 Further actions 1 2 3
−→ · · · −→
7 8 6
4 5 6
7 8
38
Problem solving by search
Here we have:
The 8-puzzle is very simple. However general sliding block puzzles are a good
test case. The general problem is NP-complete. The 5 × 5 version has about 1025
states, and a random instance is in fact quite a challenge.
39
Problem solving by search
Problems of this kind are very simple, but a surprisingly large number of appli-
cations have appeared:
• Route-finding/tour-finding.
• Layout of VLSI systems.
• Navigation systems for robots.
• Sequencing for automatic assembly.
• Searching the internet.
• Design of proteins.
40
Search trees versus search graphs
We need to make an important distinction between search trees and search graphs.
s s
as opposed to
• In a tree only one path can lead to a given node, but a state s can appear in
multiple nodes.
• In a graph a state can appear in only one node, but may be reached via multiple
paths.
• In a graph we may encounter cycles.
• In a graph we may encounter redundant paths, where multiple paths lead to
the same state.
41
Search trees versus search graphs
A A
B B
B
C C C C
C
. . . .
. . . .
. . . .
D
.
.
.
42
The basic tree-search algorithm
We need to define one more function: expand takes any state s. It applies all
actions that can be applied in s and returns the set of the resulting states:
expand(s) = {s0|s0 = action(a, s) where a is an action possible in s}.
The algorithm for searching in a tree then looks like this:
1 fringe = [s0 ];
2 while true do
3 if fringe.empty() then
4 return NONE;
5 s = fringe.remove();
6 if goal(s) then
7 return (SOME s);
8 fringe.addAll(expand(s));
The search strategy is set by using a priority queue to implement the fringe.
The definition of priority then sets the way in which the tree is searched.
43
The basic tree-search algorithm
Expanded
At each iteration, one node from the fringe is expanded. In general, if the branch-
ing factor is b then the layer at depth d can have bd states.
Pd d bd+1−1
The entire tree to depth d can have i=0 b = b−1 states.
44
Graph search
To search in graphs we need a way to make sure no state gets visited more than
once.
We need to add a closed list, and add a state to it when the state is first seen:
1 closed = [];
2 fringe = [s0 ];
3 while true do
4 if fringe.empty() then
5 return NONE;
6 s = fringe.remove();
7 if goal(s) then
8 return (SOME s);
9 closed.add(s);
10 for s0 ∈ expand(s) do
11 if (!closed.contains(s0 ) && !fringe.contains(s0 )) then
12 fringe.add (s’);
45
Graph search
46
Graph search
Start Start
Goal Goal
The graph becomes separated: any path from the start to an unexplored node has
to pass through a fringe node.
47
The performance of search techniques
So
the total cost = path cost + search cost
If a problem is highly complex it may be worth settling for a sub-optimal solution
obtained in a short time.
And we are interested in:
Completeness: does the strategy guarantee a solution is found?
Optimality: does the strategy guarantee that the best solution is found?
Once we start to consider these, things get a lot more interesting…
48
Basic search algorithms
• New nodes are added to the head of the queue. This is depth-first search.
• New nodes are added to the tail of the queue. This is breadth-first search. (You
can do the goal test earlier—why is this?)
We will not dwell on these, as they are both completely hopeless in practice.
Why is breadth-first search hopeless?
49
Basic search algorithms
• With graph search: complete if the number of states is finite, but not other-
wise.
• Not complete for tree search because of loops.
Neither is optimal.
50
Basic search methods
With depth-first search: for a given branching factor b and depth d the memory
requirement is O(bd).
−→ −→
··· ···
···
This is only for tree search because we need to store nodes on the current path and
the other unexpanded nodes.
The time complexity for tree search is still O(bd) (if you know you only have to
go to depth d). For graph search it is the size of the state space.
The search is no longer optimal, and may not be complete.
Iterative-deepening combines the two, but we can do better.
51
Uniform-cost search
How might we change tree search to try to get to an optimal solution while lim-
iting the time and memory needed?
The key point: so far we only distinguish goal states from non-goal states!
None of the searches you’ve seen so far tries to prioritize the exploration of good
states‼!
• Well, at any point in the search we can work out the path cost p(s) of whatever
state s we’ve got to.
• How about using the p(s) as the priority for the priority queue?
52
Uniform-cost search
1 closed = [];
2 fringe = [s0 ];
3 while true do
4 if fringe.empty() then
5 return NONE;
6 s = fringe.remove();
7 if goal(s) then
8 return (SOME s);
9 closed.add(s);
10 for s0 ∈ expand(s) do
11 if (!closed.contains(s0 ) && !fringe.contains(s0 )) then
12 fringe.add (s’);
13 else if (fringe.contains(s0 ) with higher p(s0 )) then
14 replace the fringe node with s0 ;
This modification must also be used when implementing the A? search method,
which we will see in a moment.
53
Uniform-cost search
This is optimal because when we select a node it must have the shortest path to
that node.
It is complete, provided it is impossible to get stuck within an infinite path:
54
Heuristics
Why is path cost not a good evaluation function? It is not directed in any sense
toward the goal.
A heuristic function, usually denoted h(s), is one that estimates the cost of the
best path from any state s to a goal. If s is a goal then h(s) = 0.
s0 s1 s2 ··· s
sgoal
s0 1 s1 1 s2 s0 s1 s2
√
h(s1 ) = 2
h(s2 ) = 1
√
h(s0 ) = 5
Goal Goal
Accuracy here obviously depends on what the roads are really like.
Can we use h(s) in choosing a state to explore? If it’s really good it can work
well, but we can still do better!
56
A? search
It does this in a very simple manner: it uses path cost p(s) and also the heuristic
function h(s) by forming
f (s) = p(s) + h(s).
So: f (s) is the estimated cost of a path through s.
By using this as a priority for exploring states we get a search algorithm that is
optimal and complete under simple conditions, and can be vastly superior to the
more naı̈ve approaches.
57
A? search
Definition: an admissible heuristic h(s) is one that never overestimates the cost of
the best path from s to a goal.
s0 s1 s2 ··· s
sgoal
Actual path to nearest goal.
h(s) must underestimate this.
58
A? tree-search is optimal for admissible h(s)
Goalopt
Let Goal2 be a suboptimal goal state with f (Goal2) = p(Goal2) = f2 > fopt. We
need to demonstrate that the search can never select Goal2.
59
A? tree-search is optimal for admissible h(s)
60
A? graph search
• Graph search can discard an optimal route if that route is not the first one
generated.
• We could keep only the least expensive path. This means updating, which is
extra work, not to mention messy, but sufficient to insure optimality.
• Alternatively, we can impose a further condition on h(s) which forces the best
path to a repeated state to be generated first.
61
Monotonicity
p(s) = 5
s
h(s) = 4
s0
p(s0 ) = 6
h(s0 ) = 1
62
Monotonicity
Monotonicity:
• If it is always the case that f (s0) ≥ f (s) then h(s) is called monotonic or con-
sistent.
• h(s) is monotonic if and only if it obeys the triangle inequality.
h(s) ≤ cost(a, s) + h(s0)
where a is the action moving us from s to s0.
63
Monotonicity
p(s) = 5
s
h(s) = 4
s0
p(s0 ) = 6
h(s0 ) = 1
The fact that f (s) = 9 tells us the cost of a path through s is at least 9 (because
h(s) is admissible).
But s0 is on a path through s. So to say that f (s0) = 7 makes no sense.
64
A? graph search is optimal for monotonic heuristics
The crucial fact from which optimality follows is that if h(s) is monotonic then
the values of f (s) along any path are non-decreasing.
We therefore have the following situation:
f (s0 )
Consequently everything with f (s00) < fopt gets explored. Then one or more
things with fopt get found (not necessarily all goals).
65
A? search is complete
Why is this? The search expands nodes according to increasing f (s). So: the
only way it can fail to find a goal is if there are infinitely many nodes with
f (s) < f (Goal).
There are two ways this can happen:
66
Complexity
We won’t be proving the following, but they are good things to know:
67
IDA? - iterative deepening A? search
• Iterative deepening search used depth-first search with a limit on depth that
is gradually increased.
• IDA? does the same thing with a limit on f cost.
68
IDA? - iterative deepening A? search
The function contour searches from a specified state s as far as a specified limit
fLimit on f .
It returns either a path from s to a goal, or the next biggest value to try for the
limit on f .
69
IDA? - iterative deepening A? search
1 function iterativeDeepeningAStar()
2 fLimit = f (s0 );
3 while true do
4 (path, fLimit) = contour(s0 , fLimit, []);
5 if path ! = [] then
6 return path;
7 if fLimit == ∞ then
8 return [];
70
IDA? - iterative deepening A? search
7 4 5
Initially, the algorithm looks ahead and finds the smallest f cost that is greater
than its current f cost limit. The new limit is 4.
71
IDA? - iterative deepening A? search
7 4 5
5 9 10
Anything with f cost at most equal to the current limit gets explored, and the
algorithm keeps track of the smallest f cost that is greater than its current limit.
The new limit is 5.
72
IDA? - iterative deepening A? search
And again:
7 4 5
5 9 10 19 12 7
8 12 7
The new limit is 7, so at the next iteration the three arrowed nodes will be ex-
plored.
73
IDA? - iterative deepening A? search
Properties of IDA?:
74
Recursive best-first search (RBFS)
1. We remember the f (s0) for the best alternative state s0 we’ve seen so far on
the way to the state s we’re currently considering.
2. If s has f (s) > f (s0):
• We go back and explore the best alternative…
• …and as we retrace our steps we replace the f cost of every state we’ve
seen in the current path with f (s). (See red text in pseudo-code.)
75
Recursive best-first search (RBFS)
76
Recursive best-first search (RBFS): an example
3
fLimit1 = ∞
7 4 best1 5
nextBest1 = 5
77
Recursive best-first search (RBFS): an example
3 fLimit1 = ∞
fLimit2 = 5
7 4 best1 5
nextBest1 = 5
nextBest2 = 9
5 9 10
best2
78
Recursive best-first search (RBFS): an example
fLimit1 = ∞
3 fLimit2 = 5
fLimit3 = 5
7 4 best1 5
nextBest1 = 5
5 replaced by 10
nextBest2 = 9
5 9 10
best2
11 12 10
nextBest3 = 11 best3
Now f (best3) > fLimit3 so the function call returns (NONE, 10) into (result3, f 0)
and f (best2) = 10.
79
Recursive best-first search (RBFS): an example
3 fLimit1 = ∞
fLimit2 = 5
4 replaced by 9
7 4 best1 5
nextBest1 = 5
5 replaced by 10
5 9 best2 10
11 12 10
Now f (best2) > fLimit2 so the function call returns (NONE, 9) into (result2, f 0)
and f (best1) = 9.
80
Recursive best-first search (RBFS): an example
3 fLimit1 = ∞
4 replaced by 9
7 4 5
nextBest1 = 7 best1
5 replaced by 10
5 9 10
11 12 10
We do a further function call to expand the new best node, and so on…
81
Recursive best-first search (RBFS)
To some extent IDA? and RBFS throw the baby out with the bathwater.
82
Local search
Sometimes, it’s only the goal that we’re interested in. The path needed to get
there is irrelevant.
83
Local search
Instead of trying to find a path from start state to goal, we explore the local area
of the graph, meaning those states one edge away from the one we’re at:
f (s) = 52
f (s) = 24
f (s) = 1
f (s) = 24
f (s) = 29
We assume that we have a function f (s) such that f (s0) > f (s) indicates s0 is
preferable to s.
84
The m-queens problem
85
The m-queens problem
2
Note that we actually want to minimize f here. This is equivalent to maximizing −f , and I will generally use whichever seems more appropriate.
86
The m-queens problem
Here, we have {4, 3, ?, 8, 6, 2, 4, 1} and the f values for the undecided queen are
shown.
As we can choose which queen to move, each state in fact has 56 neighbours in
the graph.
87
Hill-climbing search
In fact, that looks so simple that it’s amazing the algorithm is at all useful.
In this version we stop when we get to a node with no better neighbour.
88
Hill-climbing search: the reality
89
Hill-climbing search: the reality
Plateau
s
90
Hill-climbing search: the reality
Of course, the fact that we’re dealing with a general graph means we need to think
of something like the preceding figure, but in a very large number of dimensions,
and this makes the problem much harder.
There is a body of techniques for trying to overcome such problems. For example:
91
Hill-climbing search: the reality
• First choice: Generate neighbours at random. Select the first one that is better
than the current one. (Particularly good if nodes have many neighbours.)
• Random restarts: Run a procedure k times with a limit on the time allowed
for each run.
Note: generating a start state at random may itself not be straightforward.
• Simulated annealing: Similar to stochastic hill-climbing, but start with lots of
random variation and reduce it over time.
Note: in some cases this is provably an effective procedure, although the time
taken may be excessive if we want the proof to hold.
• Beam search: Maintain k states at any given time. At each search step, find
the successors of each, and retain the best k from all the successors.
Note: this is not the same as random restarts.
92
Gradient ascent and related methods
For some problems3—we do not have a search graph, but a continuous search
space.
30
20
10
-10
-20
-30
0 1 2 3 4 5 6
93
Gradient ascent and related methods
94
Gradient ascent and related methods
• At the current point xi the gradient ∇f (xi) tells us the direction and magni-
tude of the slope at xi.
• Adding ∇f (xi) therefore moves us a small distance upward.
95
Gradient ascent and related methods
50
-2000
0
-4000
50
50
0 -50
0
-50 0 50
-50 -50
96
Gradient ascent and related methods
50 50
0 0
-50 -50
-50 0 50 -50 0 50
50
100
0
0
-100
-50
-50 0 50 -50 0 50
97
Gradient ascent and related methods
• Line search: increase until f decreases and maximise in the resulting interval.
Then choose a new direction to move in. Conjugate gradients, the Fletcher-
Reeves and Polak-Ribiere methods etc.
• Use H to exploit knowledge of the local shape of f . For example the Newton-
Raphson and Broyden-Fletcher-Goldfarb-Shanno (BFGS) methods etc.
98
Artificial Intelligence
How might an agent act when the outcomes of its actions are not known because
an adversary is trying to hinder it?
100
Playing games: search against an adversary
Despite the fact that games are an idealisation, game playing can be an excellent
source of hard problems. For instance with chess:
101
Perfect decisions in a two-person game
Say we have two players. Traditionally, they are called Max and Min for reasons
that will become clear.
This is exactly the same game format as chess, Go, draughts and so on.
102
Perfect decisions in a two-person game
Max to move
• There is a set of operators. Here, Max can place a cross in any empty square,
or Min a nought.
• There is a terminal test. Here, the game ends when three noughts or three
crosses are in a row, or there are no unused spaces.
• There is a utility or payoff function. This tells us, numerically, what the out-
come of the game is.
103
Perfect decisions in a two-person game
. . .
104
Perfect decisions in a two-person game
. . .
. . .
And so on…
This can be continued to represent all possibilities for the game.
105
Perfect decisions in a two-person game
. . .
. . .
−1
+1
0
At the leaves a player has won or there are no spaces. Leaves are labelled using
the utility function.
106
Perfect decisions in a two-person game
4 5 2 20 20 15 6 7 1 4 10 9 5 8 5 4
If Max is rational he will play to reach a position with the biggest utility possible
But if Min is rational she will play to minimise the utility available to Max.
107
The minimax algorithm
There are two moves: Max then Min. Game theorists would call this one move,
or two ply deep.
The minimax algorithm allows us to infer the best move that the current player
can make, given the utility function, by working backward from the leaves.
2 6 1 4
4 5 2 20 20 15 6 7 1 4 10 9 5 8 5 4
As Min plays the last move, she minimises the utility available to Max.
108
The minimax algorithm
2 6 1 4
4 5 2 20 20 15 6 7 1 4 10 9 5 8 5 4
We can see that Max’s best opening move is move 2, as this leads to the node
with highest utility.
109
The minimax algorithm
In general:
• Generate the complete tree and label the leaves according to the utility func-
tion.
• Working from the leaves of the tree upward, label the nodes depending on
whether Max or Min is to move.
• If Min is to move label the current node with the minimum utility of any
descendant.
• If Max is to move label the current node with the maximum utility of any
descendant.
If the game is p ply and at each point there are q available moves then this process
has (surprise, surprise) O(q p) time complexity and space complexity linear in p
and q.
110
Making imperfect decisions
We need to avoid searching all the way to the end of the tree.
So:
• We generate only part of the tree: instead of testing whether a node is a leaf
we introduce a cut-off test telling us when to stop.
• Instead of a utility function we introduce an evaluation function for the eval-
uation of positions for an incomplete game.
The evaluation function attempts to measure the expected utility of the current
game position.
111
Making imperfect decisions
112
The evaluation function
• Let’s say we want to design one for chess by giving each piece its material
value: pawn = 1, knight/bishop = 3, rook = 5 and so on.
• Define the evaluation of a position to be the difference between the material
value of black’s and white’s pieces
X X
eval(position) = value of pi − value of qi
black’s pieces pi white’s pieces qi
• Until the first capture the evaluation function gives 0, so in fact we have
a category containing many different game positions with equal estimated
utility.
• For example, all positions where white is one pawn ahead.
113
The evaluation function
• For example, using material value, construct a weighted linear evaluation func-
tion n
X
eval(position) = w i fi
i=1
where the wi are weights and the fi represent features of the position—in this
case, the value of the ith piece.
• Weights can be chosen by allowing the game to play itself and using learning
techniques to adjust the weights to improve performance.
However in general
114
α − β pruning
Even with a good evaluation function and cut-off test, the time complexity of the
minimax algorithm makes it impossible to write a good chess program without
some further improvement.
• Assuming we have 150 seconds to make each move, for chess we would be
limited to a search of about 3 to 4 ply whereas…
• …even an average human player can manage 6 to 8.
Luckily, it is possible to prune the search tree without affecting the outcome and
without having to examine all of it.
115
α − β pruning
4 5 2 20 20 15 6 7 1 4 10 9 5 8 5 4
116
α − β pruning
2 6 ≤1
4 5 2 20 20 15 6 7 1 4 10 9 5 8 5 4
Then we note: if Max plays move 3 then Min can reach a leaf with utility at most
1.
So: we don’t need to search any further under Max’s opening move 3. This is be-
cause the search has already established that Max can do better by making open-
ing move 2.
117
α − β pruning in general
Remember that this search is depth-first. We’re only going to use knowledge of
nodes on the current path.
So: once you’ve established that n is sufficiently small, you don’t need to explore
any more of the corresponding node’s children.
118
α − β pruning in general
The situation is exactly analogous if we swap player and opponent in the previous
diagram.
The search is depth-first, so we’re only ever looking at one path through the tree.
We need to keep track of the values α and β where
α = the highest utility seen so far on the path for Max
β = the lowest utility seen so far on the path for Min
Assume Max begins. Initial values for α and β are
α = −∞
and
β = +∞.
119
α − β pruning in general
1 function player(α, β, n)
2 if cutoff(n) then
3 return eval(n);
4 value = −∞;
5 for each successor n0 of n do
6 value = max(value, opponent(α, β, n0 ));
7 if value ≥ β then
8 return value;
9 if value > α then
10 α = value;
11 return value;
120
α − β pruning in general
1 function opponent(α, β, n)
2 if cutoff(n) then
3 return eval(n);
4 value = ∞;
5 for each successor n0 of n do
6 value = min(value, player(α, β, n0 ));
7 if value ≤ α then
8 return value;
9 if value < β then
10 β = value;
11 return value;
Note: the semantics here is that parameters are passed to functions by value.
121
α − β pruning in general
Applying this to the earlier example and keeping track of the values for α and β
you should obtain:
α = −∞ = 2 = 6
Return 2 β = +∞
2 6 Return 1
Return 6
4 5 2 20 20 15 6 7 1
α = −∞ α=2 α=6
β = +∞ = 2 β = +∞ = 6
122
How effective is α − β pruning?
However, this is not realistic: if you had such an ordering technique you’d be
able to play perfect games!
123
How effective is α − β pruning?
In practice simple ordering techniques can get close to the best case. For example,
if we try captures, then threats, then moves forward etc.
Alternatively, we can implement an iterative deepening approach and use the
order obtained at one iteration to drive the next.
124
A further optimisation: the transposition table
Finally, note that many games correspond to graphs rather than trees because the
same state can be arrived at in different ways.
• This is essentially the same effect we saw in heuristic search: recall graph
search versus tree search.
• It can be addressed in a similar way: store a state with its evaluation in a hash
table—generally called a transposition table—the first time it is seen.
125
Artificial Intelligence
CSPs standardise the manner in which states and goal tests are represented. By
standardising like this we benefit in several ways:
We now return to the idea of problem solving by search and examine it from this
new perspective.
Aims:
128
Constraint satisfaction problems
We have:
129
Example
We will use the problem of colouring the nodes of a graph as a running example.
7 7
6 6
5 5
4 4
3 3
8 8
2 2
1 1
Each node corresponds to a variable. We have three colours and directly con-
nected nodes should have different colours.
130
Example
• The constraints enforce the idea that directly connected nodes must have dif-
ferent colours. For example, for variables V1 and V2 the constraints specify
(B, R), (B, C), (R, B), (R, C), (C, B), (C, R)
• Variable V8 is unconstrained.
131
Different kinds of CSP
This is an example of the simplest kind of CSP: it is discrete with finite domains.
We will concentrate on these.
We will also concentrate on binary constraints; that is, constraints between pairs
of variables.
132
Auxiliary variables
133
Backtracking search
Backtracking search now takes on a very simple form: search depth-first, assign-
ing a single variable at a time, and backtrack if no valid assignment is available.
Using the graph colouring example, the search now looks something like this…
1=B 1= R 1=C
134
Backtracking search
1=B
2=R 7
3=C
4=B
5=R
6
6=B 5
4
3
8
Nothing is available for 7, so
either assign 8 or backtrack 2
1
135
Backtracking search
Starting with:
backtrack([], problemDescription)
136
Backtracking search: possible heuristics
There are several points we can examine in an attempt to obtain general CSP-
based heuristics:
• What effect might the values assigned so far have on later attempted assign-
ments?
• When forced to backtrack, is it possible to avoid the same failure later on?
• Can we try to force the search in a successful direction (remember the use of
heuristics)?
• Can we try to force failures/backtracks to occur quickly?
137
Heuristics I: Choosing the order of variable assignments and values
6
5
At this point there is only one possible assignment
? for 3, whereas the others have more flexibility.
4
3
8
2
1
Assigning such variables first is called the minimum remaining values (MRV)
heuristic.
(Alternatively, the most constrained variable or fail first heuristic.)
138
Heuristics I: Choosing the order of variable assignments and values
6
5 Start with 3, 5 or 7.
4
3
8
2
1
MRV is usually better but the degree heuristic is a good tie breaker.
139
Heuristics I: Choosing the order of variable assignments and values
6
5
4
3 Choosing 1 = C is bad as it removes
the final possibility for 3.
8
? 2
The heuristic prefers 1=B
1
The least constraining value heuristic chooses first the value that leaves the max-
imum possible freedom in choosing assignments for the variable’s neighbours.
140
Heuristics II: forward checking and constraint propagation
6
5
4
3
8
2
C is ruled out as an assignment to
1 2 and 3.
Each time we assign a value to a variable, it makes sense to delete that value from
the collection of possible assignments to its neighbours.
This is called forward checking. It works nicely in conjunction with MRV.
141
Heuristics II: forward checking and constraint propagation
142
Heuristics II: forward checking and constraint propagation
143
Constraint propagation
Arc consistency:
Consider a constraint as being directed. For example 4 → 5.
In general, say we have a constraint i → j and currently the domain of i is Di
and the domain of j is Dj .
i → j is consistent if
∀d ∈ Di, ∃d0 ∈ Dj such that i → j is valid
Example:
In step three of the table, D4 = {R, C} and D5 = {C}.
145
Enforcing arc consistency
k1 i → j is not consistent so k1
delete B from the domain
k2 of i. k2
i → j is now consistent.
.. ..
. i j . i j
{R, B} {B} {R} {B}
kK kK
{R} kK → i is consistent but {R} kK → i is no longer consistent
kK = R can only be paired because kK = R can not be paired
with i = B. with i = R.
146
The AC-3 algorithm
1 function AC-3(problemDescription)
2 Queue toCheck = [ all arcs i → j ];
3 while toCheck is not empty do
4 i → j = next(toCheck);
5 if removeInconsistencies(Di , Dj ) then
6 for each k that is a neighbour of i, where k ∈
/ {i, j} do
7 add k → i to toCheck;
1 function removeInconsistencies(D1 , D2 )
2 Bool result = FALSE;
3 for each d ∈ D1 do
4 if no d0 ∈ D2 valid with d then
5 remove d from D1 ;
6 result = TRUE;
7 return result;
147
Enforcing arc consistency
Complexity:
148
A more powerful form of consistency
149
Backjumping
The basic backtracking algorithm backtracks to the most recent assignment. This
is known as chronological backtracking. It is not always the best policy:
7 ???
7
6
5 4
5
4
3
3
8
1 2 1
150
Backjumping
With some careful bookkeeping it is often possible to jump back multiple levels
without sacrificing the ability to find a solution.
We need some definitions:
Henceforth we shall assume that variables are assigned in the order V1, V2, . . . , Vn
when formally presenting algorithms.
151
Gaschnig’s algorithm
• When choosing a value for Vk+1 we need to check that any candidate value
d ∈ Dk+1, is consistent with Ik .
• When testing potential values for d, we will generally discard one or more
possibilities, because they conflict with some member of Ik
• We keep track of the most recent assignment Aj for which this has happened.
152
Gaschnig’s algorithm
Example:
7 ??? 7= 7= 7=
7
2
6
5
8
4 4
3 Backtrack to 5
8 5
1 2
3
153
Graph-based backjumping
This allows us to jump back multiple levels when we initially detect a conflict.
Can we do better than chronological backtracking thereafter?
Some more definitions:
The ancestors for each variable can be accumulated as assignments are made.
Graph-based backjumping backtracks to the parent of Vk+1.
Note: Gaschnig’s algorithm uses assignments whereas graph-based backjumping
uses constraints.
154
Graph-based backjumping
7 ??? {1, 3, 5}
7
2 {1, 3, 4, 8}
6
5 8 {4}
4 {5} 4 {5}
4
3
5 {3} 5 {3} 5 {3}
8
1 1 1 1
155
Backjumping and forward checking
In fact, use of forward checking can make some forms of backjumping redundant.
Note: there are in fact many ways of combining constraint propagation with back-
jumping, and we will not explore them in further detail here.
156
Backjumping and forward checking
7 ??? Ancestors
7
1 − {}
6 2 − {1, 3, 4}
5 4 3 − {1}
4 − {5 }
5 5 − {3 }
4 6 − {5 }
3 7 − {1, 3 , 5}
3 8 − {}
8
1 2 1
1 2 3 4 5 6 7 8
Start BRC BRC BRC BRC BRC BRC BRC BRC
1=B =B RC RC BRC BRC BRC RC BRC
3=R =B C = R BRC BC BRC C BRC
5=C =B C = R BR = C BR ! BRC
4=B =B C = R BR = C BR ! BRC
157
Graph-based backjumping
We’re not quite done yet though. What happens when there are no assignments
left for the parent we just backjumped to?
V7 ???
V6
V5
V4 V4 ???
V3 V3
V2 V2
V1 V1
158
Graph-based backjumping
V6
V5
V4 V4 ???
Leaf dead-end
I6 .
V3 V3
V2 V2
V1 V1
159
Graph-based backjumping
Also
V6
V5
Internal dead-end
V4 I3 . ??? Internal dead-end variable V4
Leaf dead-end
I6 .
V3 V3
V2 V2
V1 V1
If Vi was backtracked to from a later leaf dead-end and there are no more values
to try for Vi then we refer to it as an internal dead-end variable and call Ii−1 an
internal dead-end.
160
Graph-based backjumping
• The session of a variable V begins when the search algorithm visits it and
ends when it backtracks through it to an earlier variable.
• The current session of a variable V is the set of all variables visiting during its
session.
• In particular, the current session for any V contains V .
• The relevant dead-ends for the current session R(V ) for a variable V are:
1. R(V ) is initialized to {V } when V is first visited.
2. If V is a leaf dead-end variable then R(V ) = {V }.
3. If V was backtracked to from a dead-end V 0 then R(V ) = R(V ) ∪ R(V 0).
161
Graph-based backjumping
Example:
Session of V7 = {V7 }.
R(V7 ) = {V7 }
Session starts
Session of V4 = {V4 , V5 , V6 , V7 }.
Session starts R(V4 ) = {V4 , V7 }
162
Graph-based backjumping
One more bunch of definitions before the pain stops. Say Vk is a dead-end:
163
Graph-based backjumping
Example:
Backjump from V7
to V4 .
Session of V4 = {V4 , V5 , V6 , V7 }.
Nothing left to try! R(V4 ) = {V4 , V7 }
ind(V4 ) = {V2 , V3 }
164
Varieties of CSP
We have only looked at discrete CSPs with finite domains. These are the simplest.
We could also consider:
165
Artificial Intelligence
We now look at how an agent might represent knowledge about its environment,
and reason with this knowledge to achieve its goals.
Initially we’ll represent and reason using first order logic (FOL). Aims:
167
Knowledge representation and reasoning
• Possess knowledge about the environment and about how its actions affect the
environment.
• Use some form of logical reasoning to maintain its knowledge as percepts ar-
rive.
• Use some form of logical reasoning to deduce actions to perform in order to
achieve goals.
168
Knowledge representation and reasoning
169
Logic for knowledge representation
Problem: it’s quite easy to talk about things like set theory using FOL. For exam-
ple, we can easily write axioms like
∀S . ∀S 0 . ((∀x . (x ∈ S ⇔ x ∈ S 0)) ⇒ S = S 0)
But how would we go about representing the proposition that if you have a bucket
of water and throw it at your friend they will get wet, have a bump on their head
from being hit by a bucket, and the bucket will now be empty and dented?
More importantly, how could this be represented within a wider framework for
reasoning about the world?
It’s time to introduce The Wumpus…
170
Wumpus world
As a simple test scenario for a knowledge-based agent we will make use of the
Wumpus World.
Wumpus
Evil Robot
171
Wumpus world
• Unfortunately the cave contains a number of pits, which EVIL ROBOT can
fall into. Eventually his batteries will fail, and that’s the end of him.
• The cave also contains the Wumpus, who is armed with state-of-the-art Evil
Robot Obliteration Technology.
• The Wumpus itself knows where the pits are and never falls into one.
172
Wumpus world
EVIL ROBOT can move around the cave at will and can perceive the following:
In addition, EVIL ROBOT has a single arrow, with which to try to kill the Wum-
pus.
“Adjacent” in the following does not include diagonals.
173
Wumpus world
So we have:
Percepts: stench, breeze, glitter, bump, scream.
Actions: forward, turnLeft, turnRight, grab, release, shoot, climb.
Of course, our aim now is not just to design an agent that can perform well in a
single cave layout.
We want to design an agent that can usually perform well regardless of the layout
of the cave.
174
Logic for knowledge representation
175
Example: Prolog
You have by now learned a little about programming in Prolog. For example:
concat([], L, L).
concat([H|T ], L, [H|L2]) :- concat(T, L, L2).
is a program to concatenate two lists. The query
concat([1, 2, 3], [4, 5], X).
results in
X = [1, 2, 3, 4, 5].
What’s happening here? Well, Prolog is just a more limited form of FOL so…
176
Example: Prolog
• The Prolog programme itself is the KB. It expresses some knowledge about
lists.
• The query is expressed in such a way as to derive some new knowledge.
How does this relate to full FOL? First of all the list notation is nothing but syn-
tactic sugar. It can be removed: we define a constant called empty and a function
called cons.
Now [1, 2, 3] just means
cons(1, cons(2, cons(3, empty))))
which is a term in FOL.
I will assume the use of the syntactic sugar for lists from now on.
177
Prolog and FOL
• Universally quantify all the unbound variables in each line of the program and
…
• … form the conjunction of the results.
If the universally quantified lines are L1, L2, . . . , Ln then the Prolog programme
corresponds to the KB
KB = L1 ∧ L2 ∧ · · · ∧ Ln
Now, what does the query mean?
178
Prolog and FOL
179
Prolog and FOL
However the central idea also works for full-blown theorem provers.
If you want to experiment, you can obtain Prover9 from
https : //www.cs.unm.edu/ ∼ mccune/mace4/
We’ll see a brief example now, and a more extensive example of its use later, time
permitting…
180
Prolog and FOL
Expressed in Prover9, the above Prolog program and query look like this:
set(prolog style variables).
formulas(assumptions).
concat([], L, L).
concat(T, L, L2) -> concat([H:T], L, [H:L2]).
end of list.
formulas(goals).
exists X concat([1, 2, 3], [4, 5], X).
end of list.
181
Prolog and FOL
prover9 -f file.in
This shows that a proof is found but doesn’t explicitly give a value for X—we’ll
see how to extract that later…
182
The fundamental idea
So the basic idea is: build a KB that encodes knowledge about the world, the effects
of actions and so on.
The KB is a conjunction of pieces of knowledge, such that:
• A query regarding what our agent should do can be posed in the form
∃actionList . Goal(...actionList...)
• Proving that
KB → ∃actionList . Goal(...actionList...)
instantiates actionList to an actual list of actions that will achieve a goal
represented by the Goal predicate.
We sometimes use the notation ask and tell to refer to querying and adding to
the KB.
183
Using FOL in AI: the triumphant return of the Wumpus
We want to be able to speculate about the past and about possible futures. So:
Wumpus
Evil Robot
184
Situation calculus
In situation calculus:
In Wumpus World the actions are: forward, shoot, grab, climb, release,
turnRight, turnLeft.
• A situation argument is added to items that can change over time. For example
At(location, s)
Items that can change over time are called fluents.
• A situation argument is not needed for things that don’t change. These are
sometimes referred to as eternal or atemporal.
185
Representing change as a result of actions
186
Axioms I: possibility axioms
The first kind of axiom we need in a KB specifies when particular actions are
possible.
We introduce a predicate
Poss(action, s)
denoting that an action can be performed in situation s.
We then need a possibility axiom for each action. For example:
At(l, s) ∧ Available(gold, l, s) → Poss(grab, s)
Remember that unbound variables are universally quantified.
187
Axioms II: effect axioms
Given that an action results in a new situation, we can introduce effect axioms to
specify the properties of the new situation.
For example, to keep track of whether EVIL ROBOT has the gold we need effect
axioms to describe the effect of picking it up:
Poss(grab, s) → Have(gold, result(grab, s))
Effect axioms describe the way in which the world changes.
We would probably also include
¬Have(gold, s0)
in the KB, where s0 is the starting situation.
Important: we are describing what is true in the situation that results from per-
forming an action in a given situation.
188
Axioms III: frame axioms
We need frame axioms to describe the way in which the world stays the same.
Example:
Have(o, s) ∧
¬(a = release ∧ o = gold) ∧ ¬(a = shoot ∧ o = arrow)
→ Have(o, result(a, s))
describes the effect of having something and not discarding it.
In a more general setting such an axiom might well look different. For example
¬Have(o, s) ∧
(a 6= grab(o) ∨ ¬(Available(o, s) ∧ Portable(o)))
→ ¬Have(o, result(a, s))
describes the effect of not having something and not picking it up.
189
The frame problem
190
Successor-state axioms
Effect axioms and frame axioms can be combined into successor-state axioms.
One is needed for each predicate that can change over time.
Action a is possible →
(true in new situation ⇐⇒
(you did something to make it true ∨
it was already true and you didn’t make it false))
For example
Poss(a, s) →
(Have(o, result(a, s)) ⇐⇒ ((a = grab ∧ Available(o, s)) ∨
(Have(o, s) ∧ ¬(a = release ∧ o = gold) ∧
¬(a = shoot ∧ o = arrow))))
191
Knowing where you are, and so on…
192
The qualification and ramification problems
193
Solving the ramification problem
194
Deducing properties of the world: causal and diagnostic rules
If you know where you are, then you can think about places rather than just
situations. Synchronic rules relate properties shared by a single state of the world.
There are two kinds: causal and diagnostic.
Causal rules: some properties of the world will produce percepts.
WumpusAt(l1) ∧ Adjacent(l1, l2) → StenchAt(l2)
PitAt(l1) ∧ Adjacent(l1, l2) → BreezeAt(l2)
Systems reasoning with such rules are known as model-based reasoning systems.
Diagnostic rules: infer properties of the world from percepts. For example:
At(l, s) ∧ Breeze(s) → BreezeAt(l)
At(l, s) ∧ Stench(s) → StenchAt(l)
These may not be very strong.
The difference between model-based and diagnostic reasoning can be important.
For example, medical diagnosis can be done based on symptoms or based on a
model of disease.
195
General axioms for situations and objects
Note: in FOL, if we have two constants robot and gold then an interpretation is
free to assign them to be the same thing. This is not something we want to allow.
Unique names axioms state that each pair of distinct items in our model of the
world must be different
robot 6= gold
robot 6= arrow
robot 6= wumpus
..
Unique actions axioms state that actions must share this property, so for each pair
of actions
go(l, l0) 6= grab
go(l, l0) 6= drop(o)
..
and in addition we need to define equality for actions, so for each action
go(l, l0) = go(l00, l000) ⇐⇒ l = l00 ∧ l0 = l000
drop(o) = drop(o0) ⇐⇒ o = o0
..
196
General axioms for situations and objects
197
Sequences of situations
We know that the function result tells us about the situation resulting from
performing an action in an earlier situation.
How can this help us find sequences of actions to get things done?
Define
Sequence([], s, s0) = s0 = s
Sequence([a], s, s0) = Poss(a, s) ∧ s0 = result(a, s)
Sequence(a :: as, s, s0) = ∃t . Sequence([a], s, t) ∧ Sequence(as, t, s0)
To obtain a sequence of actions that achieves Goal(s) we can use the query
∃a ∃s . Sequence(a, s0, s) ∧ Goal(s)
198
Interesting reading
Happy reading…
199
Knowledge representation and reasoning
200
Frames and semantic networks
201
Example of a semantic network
has Head
Person has
has Left arm
Musician
202
Frames
Rock musician
Musician
203
Defaults
Rock musician
Dementia Evilperson
Starred slots are typical values associated with subclasses and instances, but can
be overridden.
204
Multiple inheritance
instance instance
Cornelius Cleverchap
205
Other issues
• Slots and slot values can themselves be frames. For example Dementia may
have an instrument slot with the value Electricharp, which itself may have
properties described in a frame.
• Slots can have specified attributes. For example, we might specify that:
– instrument can have multiple values
– Each value can only be an instance of Instrument
– Each value has a slot called owned by
and so on.
• Slots may contain arbitrary pieces of program. This is known as procedural
attachment. The fragment might be executed to return the slot’s value, or
update the values in other slots etc.
206
Rule-based systems
207
Forward chaining
The first of two basic kinds of interpreter begins with established facts and then
applies rules to them.
This is a data-driven process. It is appropriate if we know the initial facts but not
the required conclusion.
Example: XCON—used for configuring VAX computers.
In addition:
208
Forward chaining
1. Find all the rules that can fire, based on the current working memory.
2. Select a rule to fire. This requires a conflict resolution strategy.
3. Carry out the action specified, possibly updating the working memory.
Repeat this process until either no rules can be used or a halt appears in the
working memory.
209
Condition−action rules
dry_mouth
working
210
Example
Progress is as follows:
1. The rule
dry mouth → ADD thirsty
fires adding thirsty to working memory.
2. The rule
thirsty → ADD get drink
fires adding get drink to working memory.
3. The rule
working → ADD no work
fires adding no work to working memory.
4. The rule
get drink AND no work → ADD go bar
fires, and we establish that it’s time to go to the bar.
211
Conflict resolution
Clearly in any more realistic system we expect to have to deal with a scenario
where two or more rules can be fired at any one time:
• Which rule we choose can clearly affect the outcome.
• We might also want to attempt to avoid inferring an abundance of useless
information.
We therefore need a means of resolving such conflicts. Common conflict resolution
strategies are:
• Prefer rules involving more recently added facts.
• Prefer rules that are more specific. For example
patient coughing → ADD lung problem
is more general than
patient coughing AND patient smoker → ADD lung cancer.
• Allow the designer of the rules to specify priorities.
• Fire all rules simultaneously—this essentially involves following all chains of
inference at once.
212
Reason maintenance
Some systems will allow information to be removed from the working memory
if it is no longer justified.
For example, we might find that
patient coughing
and
patient smoker
are in working memory, and hence fire
patient coughing AND patient smoker → ADD lung cancer
but later infer something that causes patient coughing to be withdrawn from
working memory.
The justification for lung cancer has been removed, and so it should perhaps
be removed also.
213
Pattern matching
In general rules may be expressed in a slightly more flexible form involving vari-
ables which can work in conjunction with pattern matching.
For example the rule
coughs(X) AND smoker(X) → ADD lung cancer(X)
contains the variable X.
If the working memory contains coughs(neddy) and smoker(neddy) then
X = neddy
provides a match and
lung cancer(neddy)
is added to the working memory.
214
Backward chaining
The second basic kind of interpreter begins with a goal and finds a rule that would
achieve it.
It then works backwards, trying to achieve the resulting earlier goals in the suc-
cession of inferences.
Example: MYCIN—medical diagnosis with a small number of conditions.
This is a goal-driven process. If you want to test a hypothesis or you have some
idea of a likely conclusion it can be more efficient than forward chaining.
215
Example
Working memory
Goal
dry mouth
working go bar
216
Example with backtracking
If at some point more than one rule has the required conclusion then we can
backtrack.
Example: Prolog backtracks, and incorporates pattern matching. It orders at-
tempts according to the order in which rules appear in the program.
Example: having added
up early → ADD tired
and
tired AND lazy → ADD go bar
to the rules, and up early to the working memory:
217
Example with backtracking
working
218
Artificial Intelligence I
Planning algorithms
Search algorithms are good for solving problems that fit this framework. How-
ever for more complex problems they may fail completely…
220
Problem solving is different to planning
Representing a problem such as: ‘go out and buy some pies’ is hopeless:
Knowledge representation and reasoning might not help either: although we end
up with a sequence of actions—a plan—there is so much flexibility that complex-
ity might well become an issue.
Our aim now is to look at how an agent might construct a plan enabling it to
achieve a goal.
Difference 1:
222
Planning algorithms work differently
Difference 2:
• Planners can add actions at any relevant point at all between the start and the
goal, not just at the end of a sequence starting at the start state.
• This makes sense: I may determine that Have(carKeys) is a good state to be
in without worrying about what happens before or after finding them.
• By making an important decision like requiring Have(carKeys) early on we
may reduce branching and backtracking.
• State descriptions are not complete—Have(carKeys) describes a class of states—
and this adds flexibility.
So: you have the potential to search both forwards and backwards within the
same problem.
223
Planning algorithms work differently
Difference 3:
It is assumed that most elements of the environment are independent of most other
elements.
This works provided there is not significant interaction between the subplans.
Remember: the frame problem.
224
Running example: gorilla-based mischief
We will use a simple example, based on one from Russell and Norvig.
The intrepid little scamps in the Cambridge University Roof-Climbing Society wish
to attach an inflatable gorilla to the spire of a Famous College. To do this they need
to leave home and obtain:
• An inflatable gorilla: these can be purchased from all good joke shops.
• Some rope: available from a hardware store.
• A first-aid kit: also available from a hardware store.
They need to return home after they’ve finished their shopping. How do they go
about planning their jolly escapade?
225
The STRIPS language
226
The STRIPS language
At(x), Path(x, y)
Go(y)
At(y), ¬At(x)
227
The space of plans
• Adding a step.
• Instantiating a variable.
• Imposing an ordering that places a step in front of another.
• and so on…
228
Representing a plan: partial order planners
• It does not matter whether you deal with your left or right foot first.
• It does matter that you place a sock on before a shoe, for any given foot.
It makes sense in constructing a plan not to make any commitment to which side
is done first if you don’t have to.
Principle of least commitment: do not commit to any specific choices until you
have to. This can be applied both to ordering and to instantiation of variables.
A partial order planner allows plans to specify that some steps must come before
others but others have no ordering.
A linearisation of such a plan imposes a specific sequence on the actions therein.
229
Representing a plan: partial order planners
1. A set {S1, S2, . . . , Sn} of steps. Each of these is one of the available operators.
2. A set of ordering constraints. An ordering constraint Si < Sj denotes the fact
that step Si must happen before step Sj . Si < Sj < Sk and so on has the
obvious meaning. Si < Sj does not mean that Si must immediately precede
Sj .
3. A set of variable bindings v = x where v is a variable and x is either a variable
or a constant.
c
4. A set of causal links or protection intervals Si → Sj . This denotes the fact that
the purpose of Si is to achieve the precondition c for Sj .
230
Representing a plan: partial order planners
In addition to this:
• The step Start has no preconditions, and its effect is the start state for the
problem.
• The step Finish has no effect, and its precondition is the goal.
• Neither Start or Finish has an associated action.
231
Solutions to planning problems
232
Solutions to planning problems
• Begin with only the Start and Finish steps in the plan.
• At each stage add a new step.
• Always add a new step such that a currently non-achieved precondition is
achieved.
• Backtrack when necessary.
233
An example of partial-order planning
Start
Finish
234
An example of partial-order planning
Go(y) Buy(y)
A planner might begin, for example, by adding a Buy(G) action in order to achieve
the Have(G) precondition of Finish.
Note: the following order of events is by no means the only one available to a
planner.
It has been chosen for illustrative purposes.
235
An example of partial-order planning
Start
Buy(G)
Finish
Thick arrows denote causal links. They always have a thin arrow underneath.
Here the new Buy step achieves the Have(G) precondition of Finish.
236
An example of partial-order planning
The planner can now introduce a second causal link from Start to achieve the
Sells(x, G) precondition of Buy(G).
Start
At(JS), Sells(JS,G)
Buy(G)
Finish
237
An example of partial-order planning
The planner’s next obvious move is to introduce a Go step to achieve the At(JS)
precondition of Buy(G).
Start
Go(JS)
At(JS), Sells(JS,G)
Buy(G)
Finish
And we continue…
238
An example of partial-order planning
• Add a causal link from Start to Go(JS) to achieve the At(x) precondition.
• Add the step Buy(R) with an associated causal link to the Have(R) precondi-
tion of Finish.
• Add a causal link from Start to Buy(R) to achieve the Sells(HS, R) precon-
dition.
239
An example of partial-order planning
Start
Go(JS)
Buy(G) Buy(R)
Finish
240
An example of partial-order planning
Start
At(x)
Buy(G) Buy(R)
Finish
241
An example of partial-order planning
A step that might invalidate (sometimes the word clobber is employed) a previ-
ously achieved precondition is called a threat.
Demotion ¬c
c
¬c
c c Promotion
Threat
¬c
242
An example of partial-order planning
The planner could backtrack and try to achieve the At(x) precondition using the
existing Go(JS) step.
Start
At(JS)
At(Home) At(Home), Sells(JS,G), Sells(HS,R), Sells(HS,FA)
Go(HS)
Go(JS)
¬At(JS)
Buy(G) Buy(R)
Finish
This involves a threat, but one that can be fixed using promotion.
243
The algorithm
• Select a precondition p that has not yet been achieved and is associated with
an action B.
• At each stage the partially complete plan is expanded into a new collection of
plans.
• To expand a plan, we can try to achieve p either by using an action that’s
already in the plan or by adding a new action to the plan. In either case, call
the action A.
244
The algorithm
At this stage:
• If you have no further preconditions that haven’t been achieved then any plan
obtained is valid.
245
The algorithm
246
Possible threats
• Each partially complete plan now has a set I of inequality constraints asso-
ciated with it.
• An inequality constraint has the form v 6= X where v is a variable and X is
a variable or a constant.
• Whenever we try to make a substitution we check I to make sure we won’t
introduce a conflict.
247
Planning II
• The way in which basic heuristics might be defined for use in planning prob-
lems.
• The construction of planning graphs and their use in obtaining more sensible
heuristics.
• Planning graphs as the basis of the GraphPlan algorithm.
248
An example of partial-order planning
Start
At(JS)
Buy(G) Buy(R)
Finish
This involves a threat, but one that can be fixed using promotion.
249
Using heuristics in planning
250
Using heuristics in planning
This can be computationally demanding but two special cases are helpful:
251
Planning graphs
Predicate Propositional
At(Home)
At(HS), ¬At(Home)
At(JS)
Go(Home)
At(Home), ¬At(JS)
252
Planning graphs
The approximation is due to the fact that not all conflicts between actions are
tracked. So:
• The graph can underestimate how long it might take for a particular proposi-
tion to appear, and therefore . . .
• . . . a heuristic can be extracted.
253
Planning graphs: a simple example
Our intrepid student adventurers will of course need to inflate their gorilla before
attaching it to a distinguished roof . It has to be purchased before it can be inflated.
Start state: Empty.
We assume that anything not mentioned in a state is false. So the state is actually
¬Have(Gorilla) and ¬Inflated(Gorilla)
Actions:
¬Have(Gorilla) Have(Gorilla)
Buy(Gorilla) Inflate(Gorilla)
Have(Gorilla) Inflated(Gorilla)
254
Planning graphs
S0 A0 S1 A1 S2
Buy(G)
H(G)
Buy(G) H(G)
I(G)
Inf(G)
Describe start All actions available in All possibilities for All actions that might All possibilities for
state. start state. what might be the be available at time what might be the
case at time 1. 1. case at time 2.
255
Mutex links
We also record, using mutual exclusion (mutex) links which pairs of actions could
not occur together.
Mutex links 1: Effects are inconsistent.
S0 A0 S1
¬H(G) ¬H(G)
Buy(G) H(G)
256
Mutex links
S1 A1 S2
I(G)
Inf(G)
¬I(G) ¬I(G)
257
Mutex links
S1 A1
¬H(G)
Buy(G)
H(G)
Inf(G)
The precondition for an action is mutually exclusive with the precondition for
another. (See next slide!)
258
Mutex links
A state level Si contains all propositions that could be true, given the possible
preceding actions.
We also use mutex links to record pairs that can not be true simultaneously:
Possibility 1: pair consists of a proposition and its negation.
S1
¬H(G)
H(G)
259
Mutex links
Possibility 2: all pairs of actions that could achieve the pair of propositions are
mutex.
A1 S2
¬H(G)
Buy(G)
H(G)
I(G)
Inf(G)
The construction of a planning graph is continued until two identical levels are
obtained.
260
Planning graphs
S0 A0 S1 A1 S2
Buy(G)
H(G)
Buy(G) H(G)
I(G)
Inf(G)
261
Obtaining heuristics from a planning graph
• Any proposition not appearing in the final level has infinite cost and can never
be reached.
• The level cost of a proposition is the level at which it first appears but this may
be inaccurate as several actions can apply at each level and this cost does not
count the number of actions. (It is however admissible.)
• A serial planning graph includes mutex links between all pairs of actions ex-
cept persistence actions.
262
Obtaining heuristics from a planning graph
• Max-level: use the maximum level in the graph of any proposition in the set.
Admissible but can be inaccurate.
• Level-sum: use the sum of the levels of the propositions. Inadmissible but
sometimes quite accurate if goals tend to be decomposable.
• Set-level: use the level at which all propositions appear with none being mu-
tex. Can be accurate if goals tend not to be decomposable.
263
Other points about planning graphs
The first point here is a loose guarantee because only pairs of items are linked by
mutex links.
Looking at larger collections can strengthen the guarantee, but in practice the
gains are outweighed by the increased computation.
264
Graphplan
The GraphPlan algorithm goes beyond using the planning graph as a source of
heuristics.
1 function GraphPlan()
2 Start at level 0;
3 while true do
4 if All goal propositions appear in the current level AND no pair has a mutex link then
5 Attempt to extract a plan;
6 if A solution is obtained then
7 return SOME solution;
8 if Graph indicates there is no solution then
9 return NONE;
10 Expand the graph to the next level;
We extract a plan directly from the planning graph. Termination can be proved
but will not be covered here.
265
Graphplan in action
Here, at levels S0 and S1 we do not have both H(G) and I(G) available with no
mutex links, and so we expand first to S1 and then to S2.
S0 A0 S1 A1 S2
Buy(G)
H(G)
Buy(G) H(G)
I(G)
Inf(G)
266
Extracting a plan from the graph
The effect of such an action is a state having level Si−1, and containing the pre-
conditions for the actions in X.
Each action has a cost of 1.
267
Graphplan in action
S0 A0 S1 A1 S2
Buy(G)
H(G)
Buy(G) H(G)
I(G)
Inf(G)
S0 S1 S2
H(G) H(G) I(G)
Start state
Action: Buy(G) Action: Inf(G) and 2
268
Heuristics for plan extraction
269
Planning III: planning using propositional logic
We’ve seen that plans might be extracted from a knowledge base via theorem
proving, using first order logic (FOL) and situation calculus.
BUT : this might be computationally infeasible for realistic problems.
Sophisticated techniques are available for testing satisfiability in propositional
logic, and these have also been applied to planning.
The basic idea is to attempt to find a model of a sentence having the form
description of start state
∧ descriptions of the possible actions
∧ description of goal
We attempt to construct this sentence such that:
Goal:
G = Ati(a, ground) ∧ Ati(b, spire)
∧ ¬Ati(a, spire) ∧ ¬Ati(b, ground)
Actions: can be introduced using the equivalent of successor-state axioms
At1(a,ground) ↔
(At0(a, ground) ∧ ¬Move0(a, ground, spire)) (1)
∨ (At0(a, spire) ∧ Move0(a, spire, ground))
Denote by A the collection of all such axioms.
272
Propositional logic for planning
We will now find that S ∧ A ∧ G has a model in which Move0(a, spire, ground)
and Move0(b, ground, spire) are true while all remaining actions are false.
In more realistic planning problems we will clearly not know in advance at what
time the goal might expect to be achieved.
We therefore:
273
Propositional logic for planning
is a model, because the successor-state axiom (1) does not in fact preclude the
application of Move0(a, ground, spire).
We need a precondition axiom
Movei(a, ground, spire) → Ati(a, ground)
and so on.
274
Propositional logic for planning
and so on.
These are action-exclusion axioms.
Unfortunately they will tend to produce totally-ordered rather than partially-
ordered plans.
275
Propositional logic for planning
Alternatively:
276
Review of constraint satisfaction problems (CSPs)
277
The state-variable representation
The relation above is in fact a rigid relation (RR), as it is unchanging: it does not
depend upon state. (Remember fluents in situation calculus?)
Similarly, we have functions
at(x1, s) : D1at × S → D at.
Here, at(x, s) is a state-variable. The domain D1at and range D at are unions of
one or more Di. In general these can have multiple parameters
sv(x1, . . . , xn, s) : D1sv × · · · × Dnsv × S → D sv.
A state-variable denotes assertions such as
at(gorilla, s) = jokeShop
where s denotes a state and the set S of all states will be defined later.
The state variable allows things such as locations to change—again, much like
fluents in the situation calculus.
Variables appearing in relations and functions are considered to be typed.
279
The state-variable representation
Note:
So a function is perfect and immediately solves some of the problems seen earlier.
280
The state-variable representation
• Names are unique, and followed by a list of variables involved in the action.
• Preconditions are expressions involving state variables and relations.
• Effects are assignments to state variables.
For example:
buy(x, y, l)
Preconditions at(x, s) = l
sells(l, y)
has(y, s) = l
Effects has(y, s) = x
281
The state-variable representation
Goal:
at(climber, s) = home
has(rope, s) = climber
at(gorilla, s) = spire
From now on we will generally suppress the state s when writing state variables.
282
The state-variable representation
A state as just a statement of what values the state variables take at a given time.
at(climber1) = jokeShop
at(climber2) = spire
..
.
}
• For each state variable sv consider all ground instances, such as sv(climber, rope),
with arguments consistent with the rigid relations.
Define X to be the set of all such ground instances.
• A state s is then just a set
s = {(v = c)|v ∈ X}
where c is in the range of v.
Considering all the ground actions consistent with the rigid relations:
sells(jokeShop, gorilla)
284
The state-variable representation
Finally, there is a function γ that maps a state and an action to a new state
γ(s, a) = s0
Specifically, we have
γ(s, a) = {(v = c)|v ∈ X}
where either c is specified in an effect of a, or otherwise v = c is a member of s.
Note: the definition of γ implicitly solves the frame problem.
285
The state-variable representation
286
Converting to a CSP
287
Converting to a CSP
Step 2: encode ground state variables as CSP variables, with a complete copy of
all the state variables for each time step.
So, for each t where 0 ≤ t ≤ T we have a CSP variable
svti(c1, . . . , cn)
with domain D = D svi . (That is, the domain of the CSP variable is the range of
the state variable.)
Example: at some point in searching for a plan we might attempt to find the
solution to the corresponding CSP involving
location9(climber1) = hospital.
288
Converting to a CSP
Step 3: encode the preconditions for actions in the planning problem as constraints
in the CSP problem.
For each time step t and for each ground action a(c1, . . . , cn) with arguments
consistent with the rigid relations in its preconditions:
For a precondition of the form svi = v include constraint pairs
(actiont = a(c1, . . . , cn),
svti = v)
Example: consider the action buy(x, y, l) introduced above, and having the pre-
conditions at(x) = l, sells(l, y) and has(y) = l.
Assume sells(y, l) is only true for
l = jokeShop
and
y = gorilla
so we only consider these values for l and y. Then for each time step t we have
the constraints…
289
Converting to a CSP
290
Converting to a CSP
Step 4: encode the effects of actions in the planning problem as constraints in the
CSP problem.
For each time step t and for each ground action a(c1, . . . , cn) with arguments
consistent with the rigid relations in its preconditions:
For an effect of the form svi = v include constraint pairs
(actiont = a(c1, . . . , cn),
svt+1
i = v)
291
Converting to a CSP
292
Finding a plan
Finally, having encoded a planning problem into a CSP, we solve the CSP.
The scheme has the following property:
A solution to the planning problem with at most T steps exists if and only if there
is a a solution to the corresponding CSP.
Assume the CSP has a solution.
Then we can extract a plan simply by looking at the values assigned to the
actiont variables in the solution of the CSP.
It is also the case that:
There is a solution to the planning problem with at most T steps if and only if there
is a solution to the corresponding CSP from which the solution can be extracted in
this way.
For a proof see:
Automated Planning: Theory and Practice
Malik Ghallab, Dana Nau and Paolo Traverso. Morgan Kaufmann 2004.
293
Artificial Intelligence I
At the beginning of the course I suggested making sure you can answer the fol-
lowing two questions:
1. Let n
X
f (x1, . . . , xn) = aix2i
i=1
where the ai are constants. Compute ∂f /∂xj where 1 ≤ j ≤ n?
Answer: As only one term in the sum depends on xj , all the other terms dif-
ferentiate to give 0 and
∂f
= 2aj xj .
∂xj
2. Let f (x1, . . . , xn) be a function. Now assume xi = gi(y1, . . . , ym) for each xi
and some collection of functions gi. Assuming all requirements for differen-
tiability and so on are met, can you write down an expression for ∂f /∂yj
where 1 ≤ j ≤ m?
Answer: this is just the chain rule for partial differentiation
n
∂f X ∂f ∂gi
= .
∂yj i=1
∂g i ∂y j
295
Supervised learning with neural networks
We now consider how an agent might learn to solve a general problem by seeing
examples:
297
An example, continued…
A vector of this kind contains all the measurements for a single patient and is
called a feature vector or instance.
The measurements are attributes or features.
Attributes or features generally appear as one of three basic types:
298
An example, continued…
Now imagine that we have a large collection of patient histories (m in total) and
for each of these we know whether or not the patient suffered from D.
299
An example, continued…
s Learning Algorithm h
300
An example, continued…
Classifier
Attribute vector h(x) Label
x
As h is a function it assigns a label to any x and not just the ones that were in the
training sequence.
What we mean by a label here depends on whether we’re doing classification or
regression.
301
Supervised learning: classification and regression
302
Summary
Classifier
Attribute vector h(x) Label
x
h = L(s)
Learner
L
Training sequence
s
303
Neural networks
304
Types of learning
The form of machine learning described is called supervised learning. The litera-
ture also discusses unsupervised learning, semisupervised learning, learning using
membership queries and equivalence queries, and reinforcement learning. (More
about some of this next year…)
Supervised learning has multiple applications:
• Speech recognition.
• Deciding whether or not to give credit.
• Detecting credit card fraud.
• Deciding whether to buy or sell a stock option.
• Deciding whether a tumour is benign.
• Data mining: extracting interesting but hidden knowledge from existing, large
databases. For example, databases containing financial transactions or loan
applications.
• Automatic driving. (See Pomerleau, 1989, in which a car is driven for 90 miles
at 70 miles per hour, on a public road with other cars present, but with no
assistance from humans.)
305
This is very similar to curve fitting
This process is in fact very similar to curve fitting. Think of the process as follows:
Our job is to try to infer what h0 is on the basis of s only. Example: if H is the set of
all polynomials of degree 3 then nature might pick h0(x) = 13 x3 − 32 x2 + 2x − 12 .
0.5
0.4
0.3
0.2
0.1
The line is dashed to emphasise the fact that we don’t get to see it.
306
Curve fitting
0.5
0.4
0.3
0.2
0.1
Here we have,
sT = ((x1, y1), (x2, y2), . . . , (xm, ym))
where each xi and yi is a real number.
307
Curve fitting
In other words m
X
h = L(s) = argmin (h(xi) − yi)2.
h∈H i=1
Why is this sensible?
308
Curve fitting
0.6
0.4
0.2
-0.2
The chosen h is close to the target h0, even though it was chosen using only a
small number of noisy examples.
It is not quite identical to the target concept.
However if we were given a new point x0 and asked to guess the value h0(x0)
then guessing h(x0) might be expected to do quite well.
309
Curve fitting
Problem: we don’t know what H nature is using. What if the one we choose
doesn’t match? We can make our H ‘bigger’ by defining it as
H = {h : h is a polynomial of degree at most 5}.
If we use the same learning algorithm then we get:
0.6
0.4
0.2
-0.2
The result in this case is similar to the previous one: h is again quite close to h0,
but not quite identical.
310
Curve fitting
0.5
0.4
0.3
0.2
0.1
In effect, we have made our H too ‘small’. It does not in fact contain any hypoth-
esis similar to h0.
311
Curve fitting
0.8
0.6
0.4
0.2
-0.4
312
The perceptron
The example just given illustrates much of what we want to do. However in
practice we deal with more than a single dimension, so
xT = ( x1 x2 · · · xn ).
The simplest form of hypothesis used is the linear discriminant, also known as
the perceptron. Here
n
!
X
h(w; x) = σ w0 + wixi = σ (w0 + w1x1 + w2x2 + · · · + wnxn) .
i=1
313
The perceptron activation function I
The step function is important but the algorithms involved are somewhat different
to those we’ll be seeing. We won’t consider it further.
The sigmoid/logistic function plays a major role in what follows.
314
The sigmoid/logistic function
1
The logistic function σ(z) = 1+exp(−
z) Logistic σ(z) applied to the output of a linear function
1
0.9
0.8 1
0.7 0.8
Pr(x is in C1 )
0.6 0.6
σ(z)
0.5 0.4
0.4
0.2
0.3
0
0.2 10
5 10
0.1 0 5
0
−5 −5
0
−10 −5 0 5 10 Input x2 −10 −10 Input x1
z
315
Gradient descent
A method for training a basic perceptron works as follows. Assume we’re dealing
with a regression problem and using σ(z) = z.
We define a measure of error for a given collection of weights. For example
m
X
E(w) = (yi − h(w; xi))2.
i=1
316
Gradient descent
One way to approach this is to start with a random w0 and update it as follows:
∂E(w)
wt+1 = wt − η
∂w wt
where
∂E(w) ∂E(w) ∂E(w) ∂E(w)
T
= ∂w0 ∂w1 ··· ∂wn
∂w
and η is some small positive number.
The vector
∂E(w)
−
∂w
tells us the direction of the steepest decrease in E(w).
317
Gradient descent
With m
X
E(w) = (yi − wT xi)2
i=1
we have !
m
∂E(w) ∂ X
= (yi − wT xi)2
∂wj ∂wj i=1
m
X ∂
= (yi − wT xi)2
i=1
∂wj
m
X ∂
2(yi − wT xi) −wT xi
=
i=1
∂wj
m
(j)
X
T
= −2 xi yi − w xi
i=1
(j)
where xi is the jth element of xi.
318
Gradient descent
• In this case E(w) is parabolic and has a unique global minimum and no local
minima so this works well.
• Gradient descent in some form is a very common approach to this kind of
problem.
• We can perform a similar calculation for other activation functions and for
other definitions for E(w).
• Such calculations lead to different algorithms.
319
Perceptrons aren’t very powerful: the parity problem
1.5
1
1
Network output
x2 0.8
0.5
0.6
0
0.4
−0.5 2
0.2 1
−1
−1 0 1 2 0 0
x1 −1 −1 x2
0 1 2
x1
320
The multilayer perceptron
z0 = 1
w0 Node j
z1 w1
w2 Pn aj zj
z2 i=0 wi zi σ(aj )
..
.
wn
zn
Weights wi connect nodes together, and aj is the weighted sum or activation for
node j. σ is the activation function and the output is zj = σ(aj ).
Reminder: we’ll continue to use the notation
zT = ( 1 z1 z2 · · · zn )
w T = ( w0 w 1 w 2 · · · wn )
so that n n
X X
wizi = w0 + wizi = wT z.
i=0 i=1
321
The multilayer perceptron
xn
322
Backpropagation
As usual we have:
323
Backpropagation: the general case
in which case m
∂E(w) X ∂Ep(w)
=
∂w p=1
∂w
We can therefore consider examples individually.
324
Backpropagation: the general case
Place example p at the input and calculate aj and zj for all nodes including the
output y. This is forward propagation.
We have
∂Ep(w) ∂Ep(w) ∂aj
=
∂wi→j ∂aj ∂wi→j
where aj = wk→j zk .
P
k
325
Backpropagation: the general case
So we now need to calculate the values for δj . When j is the output node—that is,
the one producing the output y = h(w; xp) of the network—this is easy as zj = y
and
∂Ep(w)
δj =
∂aj
∂Ep(w) ∂y
=
∂y ∂aj
∂Ep(w) 0
= σ (aj )
∂y
using the fact that y = σ(aj ). The first term is in general easy to calculate for a
given E as the error is generally just a measure of the distance between y and
the label yp in the training sequence.
Example: when
Ep(w) = (y − yp)2
we have
∂Ep(w)
= 2(y − yp)
∂y
= 2(h(w; xp) − yp).
326
Backpropagation: the general case
k1
ak 1
σ
j
k2
aj ak 2
σ σ
.. ..
. .
kq
ak q
σ
We’re interested in
∂Ep(w)
δj =
∂aj
Altering aj can affect several other nodes k1, k2, . . . , kq each of which can in turn
affect Ep(w).
327
Backpropagation: the general case
k1
ak 1
σ
j
k2
aj ak 2
σ σ
.. ..
. .
kq
ak q
σ
We have
∂Ep(w) X ∂Ep(w) ∂ak X ∂ak
δj = = = δk
∂aj ∂ak ∂aj ∂aj
k∈{k1 ,k2 ,...,kq } k∈{k1 ,k2 ,...,kq }
where k1, k2, . . . , kq are the nodes to which node j sends a connection.
328
Backpropagation: the general case
k1
ak 1
σ
j
k2
aj ak 2
σ σ
.. ..
. .
kq
ak q
σ
Because we know how to compute δj for the output node we can work backwards
computing further δ values.
We will always know all the values δk for nodes ahead of where we are.
Hence the term backpropagation.
329
Backpropagation: the general case
k1
ak 1
σ
j
k2
aj ak 2
σ σ
.. ..
. .
kq
ak q
σ
!
∂ak ∂ X
= wi→k σ(ai) = wj→k σ 0(aj )
∂aj ∂aj i
and X X
0 0
δj = δk wj→k σ (aj ) = σ (aj ) δk wj→k .
k∈{k1 ,k2 ,...,kq } k∈{k1 ,k2 ,...,kq }
330
Backpropagation: the general case
∂Ep (w)
Summary: to calculate ∂w for the pth pattern:
1. Forward propagation: apply xp and calculate outputs etc for all the nodes in
the network.
2. Backpropagation 1: for the output node
∂Ep(w) 0 ∂Ep(w)
= ziδj = ziσ (aj )
∂wi→j ∂y
where y = h(w; xp).
3. Backpropagation 2: For other nodes
∂Ep(w) 0
X
= ziσ (aj ) δk wj→k
∂wi→j
k
331
Backpropagation: a specific example
..
.
xn
For the output: σ(a) = a. For the hidden nodes σ(a) = 1+exp(−a) .
1
332
Backpropagation: a specific example
333
Backpropagation: a specific example
334
Backpropagation: a specific example
335
Putting it all together
then
∂E(w)
wt+1 = wt − η .
∂w wt
Sequential: using just one pattern at once
∂Ep(w)
wt+1 = wt − η
∂w wt
336
Example: the parity problem revisited
• Two inputs.
• One output.
• One hidden layer containing 5 units.
• η = 0.01.
• All other details as above.
337
Example: the parity problem revisited
1.5 1.5
x2 1 1
x2
0.5 0.5
0 0
−0.5 −0.5
−1 −1
−1 0 1 2 −1 0 1 2
x1 x1
338
Example: the parity problem revisited
After training
Before training
Network output
Network output
1
0.5
0.5
0 2
2 0
2 −1 1
1
1 0
0 0 0
x2 −1 −1 1 x2
x1
x1 2 −1
339
Example: the parity problem revisited
0
0 100 200 300 400 500 600 700 800 900 1000
340