16 RL
16 RL
Chapter 21
Mausam
(some slides by Rajesh Rao)
2
MDPs
Static
Environment
Fully
Observable
Stochastic
What action
next?
Instantaneous
Perfect
Percepts Actions
3
Reinforcement Learning
• S: a set of states
• A: a set of actions
• T(s,a,s’): transition model
• R(s,a): reward model
• : discount factor
• Still looking for policy (s)
5
Learning/Planning/Acting
Main Dimensions
Example: foraging
• Bees can learn near-optimal foraging policy in field of
artificial flowers with controlled nectar supplies
8
Passive Learning (Policy Evaluation)
Remember
• We don’t know T
• We don’t know R
• But we can execute (and simulate)
Key Idea
• compute expectations by average over samples
9
Aside: Expected Age
11
Example 1 2 3 4
A +100
12 states, 4 actions
B
Reward(action) = -1
Discount factor = 1 C -100
A4 and C4 are absorbing states
12
Data on Executing ¼ 1 2 3 4
A +100
(A1, D, -1) (A1, D, -1)
B
(B1, R, -1) (B1, R, -1)
(B2, R, -1) (B2, R, -1) C -100
(B3, U, -1) (B3, U, -1)
(A3, R, -1) (C3, U, -1)
(A2, D, -1) (C4, -100)
(B2, R, -1)
(B3, U, -1) T(A1, D, B1) = 1
(A3, R, -1)
(A4, 100) T(B3, U, A3) = 2/3
14
Method 2: Empirical Estimation of V¼
15
Data on Executing ¼ 1 2 3 4
A +100
(A1, D, -1) (A1, D, -1)
B
(B1, R, -1) (B1, R, -1)
(B2, R, -1) (B2, R, -1) C -100
(B3, U, -1) (B3, U, -1)
(A3, R, -1) (C3, U, -1)
(A2, D, -1) (C4, -100)
(B2, R, -1)
(B3, U, -1) V¼ (B1) =
(A3, R, -1) V¼ (B2) =
(A4, 100)
16
Properties
Is wasteful (why?)
• Compare V¼ (B1) and V¼ (B2)
17
Method 3: Temporal Difference Learning
𝑉 𝜋 𝑠 = 𝑠′ 𝑇 𝑠, 𝜋(𝑠), 𝑠 ′ [𝑅 𝑠, 𝜋 𝑠 , 𝑠 ′ + 𝛾𝑉 𝜋 (𝑠 ′ )]
represents relationship between s and s’
TD Learning: computing this expectation as average
18
TD Learning
𝑉 𝜋 𝑠 = 𝑠′ 𝑇 𝑠, 𝜋(𝑠), 𝑠 ′ [𝑅 𝑠, 𝜋 𝑠 , 𝑠 ′ + 𝛾𝑉 𝜋 (𝑠 ′ )]
Say I know correct values of 𝑉 𝜋 𝑠1 and 𝑉 𝜋 (𝑠2 )
V¼=5
s1 s1
Pr=0.6
R=5
a0 s
s
Pr=0.4
R=2
s2 s2
V¼=3
V¼(s)=0.6(5+5)
+ 0.4(2+3) V¼(s)= (10+10+10+5+5)/5
=6+2=8 =8 19
TD Learning
𝑉 𝜋 𝑠 = 𝑠′ 𝑇 𝑠, 𝜋(𝑠), 𝑠 ′ [𝑅 𝑠, 𝜋 𝑠 , 𝑠 ′ + 𝛾𝑉 𝜋 (𝑠 ′ )]
Inner term is the sample value
• (s,s’,r): reached s’ from s by executing 𝜋 𝑠 and got
immediate reward of r
• sample = r + 𝛾𝑉 𝜋 (𝑠 ′ )
𝜋 1
Compute 𝑉 𝑠 = 𝑖 𝑠𝑎𝑚𝑝𝑙𝑒𝑖
𝑁
sample n+1
average of n+1 samples learning rate
𝜋
• 𝑉𝑛+1 (s) 𝑉𝑛𝜋 (s) + 𝛼(samplen+1−𝑉𝑛𝜋 (s))
• Nudge the old estimate towards the sample 21
TD Learning
(s,s’,r)
𝑉 𝜋 (s) 𝑉 𝜋 (s) + 𝛼(sample−𝑉 𝜋 (s))
𝑉 𝜋 (s) 𝑉 𝜋 (s) + 𝛼(r+𝛾𝑉 𝜋 (𝑠 ′ ) − 𝑉 𝜋 (s)) TD-error
𝑉 𝜋 (s) (1 − 𝛼)𝑉 𝜋 (s) + 𝛼(r+𝛾𝑉 𝜋 (𝑠 ′ ))
Classical (Pavlovian)
conditioning
experiments
Training: Bell Food
After: Bell Salivate
Conditioned stimulus
(bell) predicts future
reward (food)
Predicting Delayed Rewards
Before Training
After Training
No error
[0 v(t 1) v(t )] v(t ) r (t ) v(t 1)
Figure from Theoretical Neuroscience by Peter Dayan and Larry Abbott, MIT Press, 2001
More Evidence for Prediction Error Signals
Negative error
r (t ) 0, v(t 1) 0
[r (t ) v(t 1) v(t )] v(t )
Figure from Theoretical Neuroscience by Peter Dayan and Larry Abbott, MIT Press, 2001
The Story So Far: MDPs and RL
Goal Technique
Compute V*, Q*, * Value / policy iteration
Evaluate a policy Policy evaluation
Key challenge?
29
Model-based RL Example 1 2 3 4
A ? ? +100
Say world is deterministic
B ? ? ?
• and no wind
C -2
Key challenge
• Just executing i is not enough!
• It may miss important regions
• Needs to explore new regions
31
TD Learning TD (V*) Learning
𝑉 ∗ 𝑠 = max𝑎 𝑠′ 𝑇 𝑠, 𝑎, 𝑠 ′
[𝑅 𝑠, 𝑎, 𝑠 ′
+ 𝛾𝑉 ∗ ′
(𝑠 )]
𝑉 ∗ 𝑠 = max𝑎 𝑠′ 𝑇 𝑠, 𝑎, 𝑠 ′
[𝑅 𝑠, 𝑎, 𝑠 ′
+ 𝛾𝑉 ∗ ′
(𝑠 )]
𝑄∗ 𝑠, 𝑎 = 𝑠′ 𝑇 𝑠, 𝑎, 𝑠 ′
[𝑅 𝑠, 𝑎, 𝑠 ′
+ 𝛾𝑉 ∗ ′
(𝑠 )]
𝑄∗ 𝑠, 𝑎 = 𝑠′ 𝑇 𝑠, 𝑎, 𝑠 ′
[𝑅(𝑠, 𝑎, 𝑠 ′
+ 𝛾 max 𝑎′ 𝑄 ∗
𝑠′, 𝑎′ ]
VI Q-Value Iteration
TD Learning Q Learning
Q Learning
34
Q Learning Algorithm
Forall s, a
• Initialize Q(s, a) = 0
Repeat Forever
Where are you? s.
Choose some action a
Execute it in real world: (s, a, r, s’)
Do update:
𝑄(s,a) (1 − 𝛼)𝑄(s,a)+ 𝛼(r+𝛾 max 𝑄(𝑠 ′ , 𝑎′ ))
𝑎′
35
Properties
as long as
• states/actions finite, all rewards bounded
• No (s,a) is starved: infinite visits over infinite samples
• Learning rate decays with visits to state-action pairs
• but not too fast decay. (∑ia(s,a,i) = ∞, ∑ia2(s,a,i) < ∞)
36
Q Learning Algorithm
Forall s, a
• Initialize Q(s, a) = 0
Repeat Forever
Where are you? s.
Choose some action a
Execute it in real world: (s, a, r, s’)
Do update:
𝑄(s,a) (1 − 𝛼)𝑄(s,a)+ 𝛼(r+𝛾 max 𝑄(𝑠 ′ , 𝑎′ ))
𝑎′
How to choose?
new: exploration
greedy: exploitation
37
Exploration vs. Exploitation Tradeoff
A fundamental tradeoff in RL
Problem
• Exploration probability is constant
Solutions
• Lower ϵ over time
• Use an exploration function 39
Explore/Exploit Policies
Boltzmann Exploration
• Select action a with probability
exp(𝑄 𝑠,𝑎 𝑇))
• Pr(𝑎|𝑠) =
𝑎′∈𝐴 exp(𝑄 𝑠,𝑎′ 𝑇))
T: Temperature
• Similar to simulated annealing
• Large T: uniform, Small T: greedy
• Start with large T and decrease with time
Exploration Functions
• stop exploring actions whose badness is established
• continue exploring other actions
Let Q(s,a) = q, #visits(s,a) = n
E.g.: f q, n = 𝑞 + 𝑘/𝑛
• Unexplored states have infinite f
• Highly explored bad states have low f
Modified Q update
• 𝑄(s,a) (1 − 𝛼)𝑄(s,a)
+ 𝛼(r+𝛾 max 𝑓(𝑄 𝑠 ′ , 𝑎′ , 𝑁 𝑠 ′ , 𝑎′ ))
𝑎′
States leading to unexplored states are also preferred41
Explore/Exploit Policies
ln n( s)
UCT ( s) arg max a Q( s, a) c
n( s , a )
Value Term:
favors actions that looked Exploration Term:
good historically actions get an exploration
bonus that grows with ln(n)
Model based
• estimate O(|S|2|A|) parameters
• requires relatively larger data for learning
• can make use of background knowledge easily
Model free
• estimate O(|S||A|) parameters
• requires relatively less data for learning
Generalizing Across States
44
Feature-based Representation
45
Approximate Q-Learning
46
Optimization: Least Squares
Error
Observation
Prediction
0
0 20
Minimizing Error
Imagine we had only one point x, with features f(x), target value y,
and weights w:
Approximate q update
explained:
“target” “prediction”
Overfitting and Limited Capacity Approximations