Mod2 Slides
Mod2 Slides
p(st+1 = s 0 |st = s, at = a)
*̂ = 0 *̂ = 0 *̂ = 0 *̂ = 0 *̂ = 0 *̂ = 0 *̂ = 0
Evaluation
Estimate/predict the expected rewards from following a given policy
Control
Optimization: find the best policy
Markov Processes
Markov Reward Processes (MRPs)
Markov Decision Processes (MDPs)
Evaluation and Control in MDPs
0 1
0.6 0.4 0 0 0 0 0
B0.4 0.2 0.4 0 0 0 0C
B C
B 0 0.4 0.2 0.4 0 0 0C
B C
P=B
B0 0 0.4 0.2 0.4 0 0CC
B0 0 0 0.4 0.2 0.4 0 C
B C
@0 0 0 0 0.4 0.2 0.4A
0 0 0 0 0 0.4 0.6
Emma Brunskill
Question for today’s lecture (not for poll): Can we construct algorithms
for computing decision policies so that we can guarantee with additional
computation / iterations, we monotonically improve the decision policy?
Do all algorithms satisfy this property?
Yes it is possible! We will see this today. Not all of them do.
Last Time:
Introduction
Components of an agent: model, value, policy
This Time:
Making good decisions given a Markov decision process
Next Time:
Policy evaluation when don’t have a model of how the world works
For finite state MRP, we can express V (s) using a matrix equation
P(s1 |s1 ) · · ·
P(sN |s1 )
V (s1 ) R(s1 ) P(s1 |s2 ) · · · V (s1 )
.. .. P(sN |s2 ) .
. = . +γ ..
.. .. ..
. . .
V (sN ) R(sN ) V (sN )
P(s1 |sN ) · · · P(sN |sN )
V = R + γPV
For finite state MRP, we can express V (s) using a matrix equation
P(s1 |s1 ) · · · P(sN |s1 )
V (s1 ) R(s1 ) P(s1 |s2 ) · · · V (s1 )
.. .. P(sN |s2 ) .
. = . +γ ..
.. .. ..
. . .
V (sN ) R(sN ) V (sN )
P(s1 |sN ) · · · P(sN |sN )
V = R + γPV
V − γPV = R
(I − γP)V = R
V = (I − γP)−1 R
Dynamic programming
Initialize V0 (s) = 0 for all s
For k = 1 until convergence
For all s in S
X
Vk (s) = R(s) + γ P(s ′ |s)Vk−1 (s ′ )
s ′ ∈S
1
Reward is sometimes defined as a function of the current state, or as a function of
the (state, action, next state) tuple. Most frequently in this class, we will assume reward
is a function of state and action
Emma Brunskill (CS234 Reinforcement Learning)
Lecture 2: Making Sequences of Good Decisions Given a Model
Spring of
2024
the World 12 / 65
Example: Mars Rover MDP
1 0 0 0 0 0 0 0 1 0 0 0 0 0
1 0 0 0 0 0 0 0 0 1 0 0 0 0
0 1 0 0 0 0 0 0 0 0 1 0 0 0
′ ′
P(s |s, a1 ) =
0 0 1 0 0 0 0 P(s |s, a2 ) =
0 0 0 0 1 0 0
0 0 0 1 0 0 0 0 0 0 0 0 1 0
0 0 0 0 1 0 0 0 0 0 0 0 0 1
0 0 0 0 0 1 0 0 0 0 0 0 0 1
2 deterministic actions
Emma Brunskill (CS234 Reinforcement Learning)
Lecture 2: Making Sequences of Good Decisions Given a Model
Spring of
2024
the World 13 / 65
MDP Policies
Set i = 0
Initialize π0 (s) randomly for all states s
While i == 0 or ∥πi − πi−1 ∥1 > 0 (L1-norm, measures if the policy
changed for any state):
V πi ← MDP V function policy evaluation of πi
πi+1 ← Policy improvement
i =i +1
Set i = 0
Initialize π0 (s) randomly for all states s
While i == 0 or ∥πi − πi−1 ∥1 > 0 (L1-norm, measures if the policy
changed for any state):
V πi ← MDP V function policy evaluation of πi
πi+1 ← Policy improvement
i =i +1
X
Q πi (s, a) = R(s, a) + γ P(s ′ |s, a)V πi (s ′ )
s ′ ∈S
X
Q πi (s, a) = R(s, a) + γ P(s ′ |s, a)V πi (s ′ )
s ′ ∈S
X
max Q πi (s, a) ≥ R(s, πi (s)) + γ P(s ′ |s, πi (s))V πi (s ′ ) = V πi (s)
a
s ′ ∈S
Suppose we take πi+1 (s) for one action, then follow πi forever
Our expected sum of rewards is at least as good as if we had always
followed πi
But new proposed policy is to always follow πi+1 ...
Definition
V π1 ≥ V π2 : V π1 (s) ≥ V π2 (s), ∀s ∈ S
Proposition: V πi+1 ≥ V πi with strict inequality if πi is suboptimal,
where πi+1 is the new policy we get from policy improvement on πi
Set k = 1
Initialize V0 (s) = 0 for all states s
Loop until convergence: (for ex. ||Vk+1 − Vk ||∞ ≤ ϵ)
For each state s
" #
X
′ ′
Vk+1 (s) = max R(s, a) + γ P(s |s, a)Vk (s )
a
s ′ ∈S
Vk+1 = BVk
" #
X
′ ′
πk+1 (s) = arg max R(s, a) + γ P(s |s, a)Vk (s )
a
s ′ ∈S
To do policy improvement
" #
X
πk+1 (s) = arg max R(s, a) + γ P(s ′ |s, a)V πk (s ′ )
a
s ′ ∈S
Vk+1 = BVk
′ ′ ′ ′ ′ ′
X X
∥BVk − BVj ∥ = max R(s, a) + γ P(s |s, a)Vk (s ) − max R(s, a ) + γ P(s |s, a )Vj (s )
a a′
s ′ ∈S s ′ ∈S
′ ′ ′ ′ ′ ′
X X
∥BVk − BVj ∥ = max R(s, a) + γ P(s |s, a)Vk (s ) − max R(s, a ) + γ P(s |s, a )Vj (s )
a a′
s ′ ∈S s ′ ∈S
′ ′ ′ ′
X X
≤ max R(s, a) + γ P(s |s, a)Vk (s ) − R(s, a) − γ P(s |s, a)Vj (s )
a
s ′ ∈S s ′ ∈S
′ ′ ′
X
= max γ P(s |s, a)(Vk (s ) − Vj (s ))
a
s ′ ∈S
′
X
≤ max γ P(s |s, a)∥Vk − Vj ∥)
a
s ′ ∈S
′
X
= max γ∥Vk − Vj ∥ P(s |s, a))
a
s ′ ∈S
= γ∥Vk − Vj ∥
Note: Even if all inequalities are equalities, this is still a contraction if γ < 1
Set k = 1
Initialize V0 (s) = 0 for all states s
Loop until k == H:
For each state s
X
Vk+1 (s) = max R(s, a) + γ P(s ′ |s, a)Vk (s ′ )
a
s ′ ∈S
X
πk+1 (s) = arg max R(s, a) + γ P(s ′ |s, a)Vk (s ′ )
a
s ′ ∈S
Set k = 1
Initialize V0 (s) = 0 for all states s
Loop until k == H:
For each state s
X
Vk+1 (s) = max R(s, a) + γ P(s ′ |s, a)Vk (s ′ )
a
s ′ ∈S
X
πk+1 (s) = arg max R(s, a) + γ P(s ′ |s, a)Vk (s ′ )
a
s ′ ∈S
Value iteration:
Compute optimal value for horizon = k
Note this can be used to compute optimal policy if horizon = k
Increment k
Policy iteration
Compute infinite horizon value of a policy
Use to select another (better) policy
Closely related to a very popular method in RL: policy gradient
Emma Brunskill
Can value iteration require more iterations than |A||S| to compute the
optimal value function? (Assume |A| and |S| are small enough that each
round of value iteration can be done exactly).
Answer: True. As an example, consider a single state, single action MDP
where r (s, a) = 1, = .9 and initialize V0 (s) = 0. V ⇤ (s) = 1 1 but after
the first iteration of value iteration, V1 (s) = 1.
Last Time:
Markov reward / decision processes
Policy evaluation & control when have true model (of how the world works)
Today
Policy evaluation without known dynamics & reward models
Next Time:
Control when don’t have a model of how the world works
1
Assume today this experience comes from executing the policy ⇡. Later will
consider how to do policy evaluation using data gathered from other policies.
Emma Brunskill (CS234 Reinforcement Learning)
Lecture 3: Model-Free Policy Evaluation: Policy Evaluation
Spring
Without
2024Knowing How5the
/ 67Wo
This Lecture: Policy Evaluation
2 3 Ti t
Gt = rt + rt+1 + rt+2 + rt+3 + · · · + rTi in MDP M under policy ⇡
V (s) = E⌧ ⇠⇡ [Gt |st = s]
⇡
If trajectories are all finite, sample set of trajectories & average returns
Does not require MDP dynamics/rewards
Does not assume state is Markov
Can be applied to episodic MDPs
Averaging over returns from a complete episode
Requires each episode to terminate
Sample episode i = si,1 , ai,1 , ri,1 , si,2 , ai,2 , ri,2 , . . . , si,Ti , ai,Ti , ri,Ti
2 Ti 1
Gi,t = ri,t + ri,t+1 + ri,t+2 + · · · ri,Ti
For each time step t until Ti ( the end of the episode i)
If this is the first time t that state s is visited in episode i (for first visit MC)
Increment counter of total first visits: N(s) = N(s) + 1
Increment total return G (s) = G (s) + Gi,t
Update estimate V ⇡ (s) = G (s)/N(s)
Sample episode i = si,1 , ai,1 , ri,1 , si,2 , ai,2 , ri,2 , . . . , si,Ti , , ai,Ti , ri,Ti
2 Ti 1
Gi,t = ri,t + ri,t+1 + ri,t+2 + · · · ri,Ti
for t = 1 : Ti where Ti is the length of the i-th episode
V ⇡ (sit ) = V ⇡ (sit ) + ↵(Gi,t V ⇡ (sit ))
We will see many algorithms of this form with a learning rate, target, and
incremental update
Consistency: with enough data, does the estimate converge to the true value
of the policy?
Computational complexity: as get more data, computational cost of
updating estimate
Memory requirements
Statistical efficiency (intuitively, how does the accuracy of the estimate
change with the amount of data)
Empirical accuracy, often evaluated by mean squared error
Let n be the number of data points x used to estimate the parameter ✓ and
call the resulting estimate of ✓ using that data ✓ˆn
Then the estimator ✓ˆn is consistent if, for all ✏ > 0
Properties:
First-visit Monte Carlo
V ⇡ estimator is an unbiased estimator of true E⇡ [Gt |st = s]
By law of large numbers, as N(s) ! 1, V ⇡ (s) ! E⇡ [Gt |st = s]
Every-visit Monte Carlo
V ⇡ every-visit MC estimator is a biased estimator of V ⇡
But consistent estimator and often has better MSE
Incremental Monte Carlo
Properties depends on the learning rate ↵
“If one had to identify one idea as central and novel to reinforcement
learning, it would undoubtedly be temporal-di↵erence (TD) learning.” –
Sutton and Barto 2017
Combination of Monte Carlo & dynamic programming methods
Model-free
Can be used in episodic or infinite-horizon non-episodic settings
Immediately updates estimate of V after each (s, a, r , s 0 ) tuple
TD(0) error:
t = rt + V ⇡ (st+1 ) V ⇡ (st )
Input: ↵
Initialize V ⇡ (s) = 0, 8s 2 S
Loop
Sample tuple (st , at , rt , st+1 )
V ⇡ (st ) = V ⇡ (st ) + ↵([rt + V ⇡ (st+1 )] V ⇡ (st ))
| {z }
TD target
Input: ↵
Initialize V ⇡ (s) = 0, 8s 2 S
Loop
Sample tuple (st , at , rt , st+1 )
V ⇡ (st ) = V ⇡ (st ) + ↵([rt + V ⇡ (st+1 )] V ⇡ (st ))
| {z }
TD target
Example:
Mars rover: R = [ 1 0 0 0 0 0 +10] for any action
⇡(s) = a1 8s, = 1. any action from s1 and s7 terminates episode
Trajectory = (s3 , a1 , 0, s2 , a1 , 0, s2 , a1 , 0, s1 , a1 , 1, terminal)
TD estimate of all states (init at 0) with ↵ = 1, < 1 at end of this
episode?
V = [1 0 0 0 0 0 0 0]
2
First visit MC estimate of V of each state? [1 0 0 0 0]
Emma Brunskill (CS234 Reinforcement Learning)
Lecture 3: Model-Free Policy Evaluation: Policy Evaluation
Spring
Without
2024Knowing How
36the
/ 67Wo
Temporal Di↵erence (TD) Policy Evaluation
X
V ⇡ (st ) = r (st , ⇡(st )) + P(st+1 |st , ⇡(st ))V ⇡ (st+1 )
st+1
Input: ↵
Initialize V ⇡ (s) = 0, 8s 2 S
Loop
Sample tuple (st , at , rt , st+1 )
V ⇡ (st ) = V ⇡ (st ) + ↵([rt + V ⇡ (st+1 )] V ⇡ (st ))
| {z }
TD target
Input: ↵
Initialize V ⇡ (s) = 0, 8s 2 S
Loop
Sample tuple (st , at , rt , st+1 )
V ⇡ (st ) = V ⇡ (st ) + ↵([rt + V ⇡ (st+1 )] V ⇡ (st ))
| {z }
TD target
i
X
1
rˆ(s, a) = (sk = s, ak = a)rk
N(s, a)
k=1
2
Compute V ⇡ using MLE MDP (using any dynamic programming method
from lecture 2))
2
Requires initializing for all (s, a) pairs
Emma Brunskill (CS234 Reinforcement Learning)
Lecture 3: Model-Free Policy Evaluation: Policy Evaluation
Spring
Without
2024Knowing How
42the
/ 67Wo
Certainty Equivalence V ⇡ MLE MDP Model Estimates
Model-based option for policy evaluation without true models
After each (s, a, r , s 0 ) tuple
Recompute maximum likelihood MDP model for (s, a)
K TX
X k 1
1
P̂(s 0 |s, a) = 1(sk,t = s, ak,t = a, sk,t+1 = s 0 )
N(s, a) t=1
k=1
K TX
X k 1
1
rˆ(s, a) = 1(sk,t = s, ak,t = a)rt,k
N(s, a) t=1
k=1
Compute V ⇡ using MLE MDP
Cost: Updating MLE model and MDP planning at each update (O(|S|3 ) for
analytic matrix solution, O(|S|2 |A|) for iterative methods)
Very data efficient and very computationally expensive
Consistent (will converge to right estimate for Markov models)
Can also easily be used for o↵-policy evaluation (which we will shortly define
and discuss)
Emma Brunskill (CS234 Reinforcement Learning)
Lecture 3: Model-Free Policy Evaluation: Policy Evaluation
Spring
Without
2024Knowing How
43the
/ 67Wo
This Lecture: Policy Evaluation
Monte Carlo in batch setting converges to min MSE (mean squared error)
Minimize loss with respect to observed returns
In AB example, V (A) = 0
i
X
1
rˆ(s, a) = (sk = s, ak = a)rk
N(s, a)
k=1
Emma Brunskill
Initialize policy π
Repeat:
Policy evaluation: compute Q π
Policy improvement: update π given Q π
May need to modify policy evaluation:
If π is deterministic, can’t compute Q(s, a) for any a ̸= π(s)
How to interleave policy evaluation and improvement?
Policy improvement is now using an estimated Q
Computational complexity?
Converge to optimal Q ∗ function?
Empirical performance?
Avoid explicitly storing or learning the following for every single state
and action
Dynamics or reward model
Value
State-action value
Policy
Want more compact representation that generalizes across state or
states and actions
Reduce memory needed to store (P, R)/V /Q/π
Reduce computation needed to compute (P, R)/V /Q/π
Reduce experience needed to find a good (P, R)/V /Q/π
First assume we could query any state s and action a and an oracle
would return the true value for Q π (s, a)
Similar to supervised learning: assume given ((s, a), Q π (s, a)) pairs
The objective is to find the best approximate representation of Q π
given a particular parameterized function Q̂(s, a; w )
1: Initialize w, k = 1
2: loop
3: Sample k-th episode (sk,1 , ak,1 , rk,1 , sk,2 , . . . , sk,Lk ) given π
4: for t = 1, . . . , Lk do
5: if First visit toP(s, a) in episode k then
Lk
6: Gt (s, a) = j=t rk,j
7: ∇w J(w ) = −2[Gt (s, a)−Q̂(st , at ; w )]∇w Q̂(st , at ; w ) (Compute
Gradient)
8: Update weights ∆w
9: end if
10: end for
11: k =k +1
12: end loop
1: Initialize w, s
2: loop
3: Given s sample a ∼ π(s), r (s, a),s ′ ∼ p(s ′ |s, a)
4: ∇w J(w ) = −2[r + γ V̂ (s ′ ; w ) − V̂ (s; w )]∇w V̂ (s; w )
5: Update weights ∆w
6: if s ′ is not a terminal state then
7: Set s = s ′
8: else
9: Restart episode, sample initial state s
10: end if
11: end loop