Lect28 4up
Lect28 4up
1 / 41 2 / 41
• Recall that in Jason you write rules/plans that invoke actions that are
defined by the environment model.
• Imagine if these could be learnt.
https://www.youtube.com/watch?v=nUQsRPJ1dYw
3 / 41 4 / 41
How to decide what to do How to decide what to do
• Consider being offered a bet in which you pay £2 if an odd number is • Consider being offered a bet in which you pay £2 if an odd number is
rolled on a die, and win £3 if an even number appears. rolled on a die, and win £3 if an even number appears.
• Is this a good bet? • Is this a good bet?
• To analyse this, we need the expected value of the bet.
5 / 41 6 / 41
• The expected value is then the weighted sum of the values, where the
weights are the probabilities.
• Formally the expected value of X is defined by:
• We do this in terms of a random variable, which we will call X . ÿ
• X can take two values: E pX q “ k PrpX “ k q
3 if the die rolls odd k
´2 if the die rolls even
where the summation is over all values of k for which PrpX “ k q , 0.
• And we can also calculate the probability of these two values
• Here the expected value is:
PrpX “ 3q “ 0.5
PrpX “ ´2q “ 0.5
E pX q “ 0.5 ˆ 3 ` 0.5 ˆ ´2
• Thus the expected value of X is £0.5, and we take this to be the value
of the bet.
• As opposed to £0 if you don’t take the bet.
7 / 41 8 / 41
How to decide what to do How to decide what to do
which is ´0.33.
9 / 41 10 / 41
How an agent might decide what to do How an agent might decide what to do
11 / 41 12 / 41
How an agent might decide what to do Sequential decision problems
• In other words, for the set of outcomes sa of each action each a, the
agent should calculate:
ÿ • These approaches give us a battery of techniques to apply to
E pupsa qq “ ups 1 q. Prpsa “ s 1 q
individual decisions by agents.
s 1 Psa
• However, they aren’t really sufficient.
and pick the best. • Agents aren’t usually in the business of taking single decisions
• Life is a series of decisions.
s a2 s6
The best overall result is not necessarily obtained by a greedy
a1 approach to a series of decisions.
• The current best option isn’t the best thing in the long-run.
s5
s3 s4
s1
s2
13 / 41 14 / 41
11
00 G
00
11
00
11
S
00
11
• To get from the start point (S) to the goal (G), an agent needs to
repeatedly make a decision about what to do.
11
00
00
11
+1
00
11
00
11
−1 0.8
S 0.1
17 / 41 18 / 41
11
00 +1
00
11
00
11 −1
S
00
11
• It will get to the goal with probability 0.85 “ 0.32768 doing what it
expects/hopes to do.
• Arguably a more accurate approximation than assuming that it will
always do what it is programmed to do.
19 / 41 20 / 41
Motion model Rewards
• It can also reach the goal going around the obstacle the other way,
with probability = 0.14 ˆ 0.8.
• To complete the description, we have to give a reward to every state.
• To give the agent an incentive to reach the goal quickly, we give each
11
00
00
11
+1 non-terminal state a reward of ´0.04.
00
11
• Equivalent to a cost for each action.
−1
S
00
11 • So if the goal is reached after 10 steps, the agent’s overall reward is
0.6.
21 / 41 22 / 41
23 / 41 24 / 41
Policies Policies
25 / 41 26 / 41
00
11
00
11
−1 • We compute this using the Bellman equation
ÿ
U ps q “ R ps q ` γ max Prps 1 |s , a qU ps 1 q
a P A ps q
s1
• γ is a discount factor.
• Note that this is specific to the value of the reward R ps q for
non-terminal states — different rewards will give different policies.
27 / 41 28 / 41
Value iteration Applications
31 / 41 32 / 41
Reinforcement learning Reinforcement learning
• Ok, now we have the notion of an MDP, imagine we don’t know what
the model is. • Since it knows what state s 1 it gets to when it executes a in s, it can
• We don’t know R ps q count how often particular transitions occur to estimate:
• We don’t know Prps 1 |s , a q Prps 1 |s , a q
• But it is simple to learn them — the agent just moves around the
environment. as the proportion of times executing a in s takes the agent to s 1 .
http://vimeo.com/13387420
33 / 41 34 / 41
• If the agent wanders randomly for long enough, it will learn the
probability and reward values.
• Similarly the agent can see what reward it gets in s to give it R ps q.
• (How would it know what “long enough” was?)
• With these values it can apply the Bellman equation(s) and start
doing the right thing.
35 / 41 36 / 41
Reinforcement learning Q-learning
37 / 41 38 / 41
39 / 41 40 / 41
Summary
41 / 41