0% found this document useful (0 votes)
3 views161 pages

Mod2 Slides

The document discusses the Markov Assumption, which states that the future state is independent of the past given the present state. It highlights the importance of state representation in reinforcement learning, particularly in Markov Decision Processes (MDPs), and provides examples such as the Mars Rover. Additionally, it covers concepts like policies, evaluation, control, and the value function in the context of decision-making processes.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views161 pages

Mod2 Slides

The document discusses the Markov Assumption, which states that the future state is independent of the past given the present state. It highlights the importance of state representation in reinforcement learning, particularly in Markov Decision Processes (MDPs), and provides examples such as the Mars Rover. Additionally, it covers concepts like policies, evaluation, control, and the value function in the context of decision-making processes.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 161

Markov Assumption

Information state: sufficient statistic of history


State st is Markov if and only if:

p(st+1 |st , at ) = p(st+1 |ht , at )

Future is independent of past given present

Professor Emma Brunskill (CS234 RL) Lecture 1: Introduction to RL Spring 2024 44 / 76


Why is Markov Assumption Popular?

Simple and often can be satisfied if include some history as part of


the state
In practice often assume most recent observation is sufficient statistic
of history: st = ot
State representation has big implications for:
Computational complexity
Data required
Resulting performance

Professor Emma Brunskill (CS234 RL) Lecture 1: Introduction to RL Spring 2024 45 / 76


Types of Sequential Decision Processes

Is state Markov? Is world partially observable? (POMDP)


Are dynamics deterministic or stochastic?
Do actions influence only immediate reward (bandits) or reward and
next state ?

Professor Emma Brunskill (CS234 RL) Lecture 1: Introduction to RL Spring 2024 46 / 76


Example: Mars Rover as a Markov Decision Process

!" !# !$ !% !& !' !(

Figure: Mars rover image: NASA/JPL-Caltech

States: Location of rover (s1 , . . . , s7 )


Actions: TryLeft or TryRight
Rewards:
+1 in state s1
+10 in state s7
0 in all other states

Professor Emma Brunskill (CS234 RL) Lecture 1: Introduction to RL Spring 2024 47 / 76


MDP Model

Agent’s representation of how world changes given agent’s action


Transition / dynamics model predicts next agent state

p(st+1 = s 0 |st = s, at = a)

Reward model predicts immediate reward

r (st = s, at = a) = E[rt |st = s, at = a]

Professor Emma Brunskill (CS234 RL) Lecture 1: Introduction to RL Spring 2024 48 / 76


Example: Mars Rover Stochastic Markov Model

!" !# !$ !% !& !' !(

*̂ = 0 *̂ = 0 *̂ = 0 *̂ = 0 *̂ = 0 *̂ = 0 *̂ = 0

Numbers above show RL agent’s reward model


Part of agent’s transition model:
0.5 = P(s1 |s1 , TryRight) = P(s2 |s1 , TryRight)
0.5 = P(s2 |s2 , TryRight) = P(s3 |s2 , TryRight) · · ·
Model may be wrong

Professor Emma Brunskill (CS234 RL) Lecture 1: Introduction to RL Spring 2024 49 / 76


Policy

Policy ⇡ determines how the agent chooses actions


⇡ : S ! A, mapping from states to actions
Deterministic policy:
⇡(s) = a
Stochastic policy:

⇡(a|s) = Pr (at = a|st = s)

Professor Emma Brunskill (CS234 RL) Lecture 1: Introduction to RL Spring 2024 50 / 76


Example: Mars Rover Policy

!" !# !$ !% !& !' !(

⇡(s1 ) = ⇡(s2 ) = · · · = ⇡(s7 ) = TryRight


Quick check your understanding: is this a deterministic policy or a
stochastic policy?

Professor Emma Brunskill (CS234 RL) Lecture 1: Introduction to RL Spring 2024 51 / 76


Evaluation and Control

Evaluation
Estimate/predict the expected rewards from following a given policy
Control
Optimization: find the best policy

Professor Emma Brunskill (CS234 RL) Lecture 1: Introduction to RL Spring 2024 52 / 76


Build Up in Complexity

Professor Emma Brunskill (CS234 RL) Lecture 1: Introduction to RL Spring 2024 53 / 76


Making Sequences of Good Decisions Given a Model of the
World

Assume finite set of states and actions


Given models of the world (dynamics and reward)
Evaluate the performance of a particular decision policy
Compute the best policy
This can be viewed as an AI planning problem

Professor Emma Brunskill (CS234 RL) Lecture 1: Introduction to RL Spring 2024 54 / 76


Making Sequences of Good Decisions Given a Model of the
World

Markov Processes
Markov Reward Processes (MRPs)
Markov Decision Processes (MDPs)
Evaluation and Control in MDPs

Professor Emma Brunskill (CS234 RL) Lecture 1: Introduction to RL Spring 2024 55 / 76


Markov Process or Markov Chain

Memoryless random process


Sequence of random states with Markov property
Definition of Markov Process
S is a (finite) set of states (s 2 S)
P is dynamics/transition model that specifices p(st+1 = s 0 |st = s)
Note: no rewards, no actions
If finite number (N) of states, can express P as a matrix
0 1
P(s1 |s1 ) P(s2 |s1 ) · · · P(sN |s1 )
B P(s1 |s2 ) P(s2 |s2 ) · · · P(sN |s2 ) C
B C
P=B .. .. . .. C
@ . . . . . A
P(s1 |sN ) P(s2 |sN ) · · · P(sN |sN )

Professor Emma Brunskill (CS234 RL) Lecture 1: Introduction to RL Spring 2024 56 / 76


Example: Mars Rover Markov Chain Transition Matrix, P

!" !# !$ !% !& !' !(


0.4 0.4 0.4 0.4 0.4 0.4

0.4 0.4 0.4 0.4 0.4 0.4

0.6 0.2 0.2 0.2 0.2 0.2 0.6

0 1
0.6 0.4 0 0 0 0 0
B0.4 0.2 0.4 0 0 0 0C
B C
B 0 0.4 0.2 0.4 0 0 0C
B C
P=B
B0 0 0.4 0.2 0.4 0 0CC
B0 0 0 0.4 0.2 0.4 0 C
B C
@0 0 0 0 0.4 0.2 0.4A
0 0 0 0 0 0.4 0.6

Professor Emma Brunskill (CS234 RL) Lecture 1: Introduction to RL Spring 2024 57 / 76


Example: Mars Rover Markov Chain Episodes

!" !# !$ !% !& !' !(


0.4 0.4 0.4 0.4 0.4 0.4

0.4 0.4 0.4 0.4 0.4 0.4

0.6 0.2 0.2 0.2 0.2 0.2 0.6

Example: Sample episodes starting from S4


s 4 , s 5 , s6 , s 7 , s7 , s 7 , . . .
s 4 , s 4 , s5 , s 4 , s5 , s 6 , . . .
s 4 , s 3 , s2 , s 1 , . . .

Professor Emma Brunskill (CS234 RL) Lecture 1: Introduction to RL Spring 2024 58 / 76


Markov Reward Process (MRP)

Markov Reward Process is a Markov Chain + rewards


Definition of Markov Reward Process (MRP)
S is a (finite) set of states (s 2 S)
P is dynamics/transition model that specifices P(st+1 = s 0 |st = s)
R is a reward function R(st = s) = E[rt |st = s]
Discount factor 2 [0, 1]
Note: no actions
If finite number (N) of states, can express R as a vector

Professor Emma Brunskill (CS234 RL) Lecture 1: Introduction to RL Spring 2024 59 / 76


Example: Mars Rover Markov Reward Process

!" !# !$ !% !& !' !(


0.4 0.4 0.4 0.4 0.4 0.4

0.4 0.4 0.4 0.4 0.4 0.4

0.6 0.2 0.2 0.2 0.2 0.2 0.6

Reward: +1 in s1 , +10 in s7 , 0 in all other states

Professor Emma Brunskill (CS234 RL) Lecture 1: Introduction to RL Spring 2024 60 / 76


Return & Value Function

Definition of Horizon (H)


Number of time steps in each episode
Can be infinite
Otherwise called finite Markov reward process
Definition of Return, Gt (for a Markov Reward Process)
Discounted sum of rewards from time step t to horizon H
2 H 1
Gt = rt + rt+1 + rt+2 + · · · + rt+H 1

Definition of State Value Function, V (s) (for a Markov Reward


Process)
Expected return from starting in state s
2 H 1
V (s) = E[Gt |st = s] = E[rt + rt+1 + rt+2 + · · · + rt+H 1 |st = s]

Professor Emma Brunskill (CS234 RL) Lecture 1: Introduction to RL Spring 2024 61 / 76


Discount Factor

Mathematically convenient (avoid infinite returns and values)


Humans often act as if there’s a discount factor < 1
If episode lengths are always finite (H < 1), can use =1

Professor Emma Brunskill (CS234 RL) Lecture 1: Introduction to RL Spring 2024 62 / 76


Discount Factor

Mathematically convenient (avoid infinite returns and values)


Humans often act as if there’s a discount factor < 1
= 0: Only care about immediate reward
= 1: Future reward is as beneficial as immediate reward
If episode lengths are always finite (H < 1), can use =1

Professor Emma Brunskill (CS234 RL) Lecture 1: Introduction to RL Spring 2024 63 / 76


Lecture 2: Making Sequences of Good Decisions Given
a Model of the World

Emma Brunskill

CS234 Reinforcement Learning

Emma Brunskill (CS234 Reinforcement Learning)


Lecture 2: Making Sequences of Good Decisions Given a Model
Spring of
2024
the World 1 / 65
L2N1 Quick Check Your Understanding 1. Participation
Poll

In a Markov decision process, a large discount factor γ means that short


term rewards are much more influential than long term rewards. [Enter
your answer in participation poll ]
True
False
Don’t know
Question for today’s lecture (not for poll): Can we construct algorithms
for computing decision policies so that we can guarantee with additional
computation / iterations, we monotonically improve the decision policy?
Do all algorithms satisfy this property?

Emma Brunskill (CS234 Reinforcement Learning)


Lecture 2: Making Sequences of Good Decisions Given a Model
Spring of
2024
the World 2 / 65
L2N1 Quick Check Your Understanding 1. Participation
Poll
In a Markov decision process, a large discount factor γ means that short
term rewards are much more influential than long term rewards. [Enter
your answer in the poll]
True
False
Don’t know
False. A large γ implies we weigh delayed / long term rewards more.
γ = 0 only values immediate rewards

Question for today’s lecture (not for poll): Can we construct algorithms
for computing decision policies so that we can guarantee with additional
computation / iterations, we monotonically improve the decision policy?
Do all algorithms satisfy this property?
Yes it is possible! We will see this today. Not all of them do.

Emma Brunskill (CS234 Reinforcement Learning)


Lecture 2: Making Sequences of Good Decisions Given a Model
Spring of
2024
the World 3 / 65
Today’s Plan

Last Time:
Introduction
Components of an agent: model, value, policy
This Time:
Making good decisions given a Markov decision process
Next Time:
Policy evaluation when don’t have a model of how the world works

Emma Brunskill (CS234 Reinforcement Learning)


Lecture 2: Making Sequences of Good Decisions Given a Model
Spring of
2024
the World 5 / 65
Today: Given a model of the world

Markov Processes (last time)


Markov Reward Processes (MRPs) (continue from last time)
Markov Decision Processes (MDPs)
Evaluation and Control in MDPs

Emma Brunskill (CS234 Reinforcement Learning)


Lecture 2: Making Sequences of Good Decisions Given a Model
Spring of
2024
the World 6 / 65
Return & Value Function

Definition of Horizon (H)


Number of time steps in each episode
Can be infinite
Otherwise called finite Markov reward process
Definition of Return, Gt (for a MRP)
Discounted sum of rewards from time step t to horizon H

Gt = rt + γrt+1 + γ 2 rt+2 + · · · + γ H−1 rt+H−1

Definition of State Value Function, V (s) (for a MRP)


Expected return from starting in state s

V (s) = E[Gt |st = s] = E[rt + γrt+1 + γ 2 rt+2 + · · · + γ H−1 rt+H−1 |st = s]

Emma Brunskill (CS234 Reinforcement Learning)


Lecture 2: Making Sequences of Good Decisions Given a Model
Spring of
2024
the World 7 / 65
Computing the Value of an Infinite Horizon Markov
Reward Process

Markov property provides structure


MRP value function satisfies
X
V (s) = R(s) + γ P(s ′ |s)V (s ′ )
s ′ ∈S
|{z}
Immediate reward | {z }
Discounted sum of future rewards

Emma Brunskill (CS234 Reinforcement Learning)


Lecture 2: Making Sequences of Good Decisions Given a Model
Spring of
2024
the World 8 / 65
Matrix Form of Bellman Equation for MRP

For finite state MRP, we can express V (s) using a matrix equation
 
  P(s1 |s1 ) · · ·
  P(sN |s1 )  
V (s1 ) R(s1 )  P(s1 |s2 ) · · · V (s1 )
 ..   ..  P(sN |s2 )   . 

 . = . +γ   .. 

.. .. ..
 . . . 
V (sN ) R(sN ) V (sN )
P(s1 |sN ) · · · P(sN |sN )
V = R + γPV

Emma Brunskill (CS234 Reinforcement Learning)


Lecture 2: Making Sequences of Good Decisions Given a Model
Spring of
2024
the World 9 / 65
Analytic Solution for Value of MRP

For finite state MRP, we can express V (s) using a matrix equation
 
    P(s1 |s1 ) · · · P(sN |s1 )  
V (s1 ) R(s1 )  P(s1 |s2 ) · · · V (s1 )
 ..   ..  P(sN |s2 )   . 

 . = . +γ   .. 

.. .. ..
 . . . 
V (sN ) R(sN ) V (sN )
P(s1 |sN ) · · · P(sN |sN )
V = R + γPV
V − γPV = R
(I − γP)V = R
V = (I − γP)−1 R

Solving directly requires taking a matrix inverse ∼ O(N 3 )


Requires that (I − γP) is invertible
Emma Brunskill (CS234 Reinforcement Learning)
Lecture 2: Making Sequences of Good Decisions Given a Model
Spring of
2024
the World 10 / 65
Iterative Algorithm for Computing Value of a MRP

Dynamic programming
Initialize V0 (s) = 0 for all s
For k = 1 until convergence
For all s in S
X
Vk (s) = R(s) + γ P(s ′ |s)Vk−1 (s ′ )
s ′ ∈S

Computational complexity: O(|S|2 ) for each iteration (|S| = N)

Emma Brunskill (CS234 Reinforcement Learning)


Lecture 2: Making Sequences of Good Decisions Given a Model
Spring of
2024
the World 11 / 65
Markov Decision Process (MDP)

Markov Decision Process is Markov Reward Process + actions


Definition of MDP
S is a (finite) set of Markov states s ∈ S
A is a (finite) set of actions a ∈ A
P is dynamics/transition model for each action, that specifies
P(st+1 = s ′ |st = s, at = a)
R is a reward function1

R(st = s, at = a) = E[rt |st = s, at = a]

Discount factor γ ∈ [0, 1]


MDP is a tuple: (S, A, P, R, γ)

1
Reward is sometimes defined as a function of the current state, or as a function of
the (state, action, next state) tuple. Most frequently in this class, we will assume reward
is a function of state and action
Emma Brunskill (CS234 Reinforcement Learning)
Lecture 2: Making Sequences of Good Decisions Given a Model
Spring of
2024
the World 12 / 65
Example: Mars Rover MDP

!" !# !$ !% !& !' !(

   
1 0 0 0 0 0 0 0 1 0 0 0 0 0
1 0 0 0 0 0 0 0 0 1 0 0 0 0
   
0 1 0 0 0 0 0 0 0 0 1 0 0 0
′ ′
   
P(s |s, a1 ) = 
0 0 1 0 0 0 0 P(s |s, a2 ) = 

0 0 0 0 1 0 0
0 0 0 1 0 0 0 0 0 0 0 0 1 0
   
0 0 0 0 1 0 0  0 0 0 0 0 0 1
0 0 0 0 0 1 0 0 0 0 0 0 0 1

2 deterministic actions
Emma Brunskill (CS234 Reinforcement Learning)
Lecture 2: Making Sequences of Good Decisions Given a Model
Spring of
2024
the World 13 / 65
MDP Policies

Policy specifies what action to take in each state


Can be deterministic or stochastic
For generality, consider as a conditional distribution
Given a state, specifies a distribution over actions
Policy: π(a|s) = P(at = a|st = s)

Emma Brunskill (CS234 Reinforcement Learning)


Lecture 2: Making Sequences of Good Decisions Given a Model
Spring of
2024
the World 14 / 65
MDP + Policy

MDP + π(a|s) = Markov Reward Process


Precisely, it is the MRP (S, R π , P π , γ), where
X
R π (s) = π(a|s)R(s, a)
a∈A
X

π
P (s |s) = π(a|s)P(s ′ |s, a)
a∈A

Implies we can use same techniques to evaluate the value of a policy


for a MDP as we could to compute the value of a MRP, by defining a
MRP with R π and P π

Emma Brunskill (CS234 Reinforcement Learning)


Lecture 2: Making Sequences of Good Decisions Given a Model
Spring of
2024
the World 15 / 65
MDP Policy Evaluation, Iterative Algorithm

Initialize V0 (s) = 0 for all s


For k = 1 until convergence
For all s in S
" #
X X
Vkπ (s) = π(a|s) R(s, a) + γ p(s ′ |s, a)Vk−1
π
(s ′ )
a s ′ ∈S

This is a Bellman backup for a particular policy


Note that if the policy is deterministic then the above update
simplifies to
X
Vkπ (s) = R(s, π(s)) + γ p(s ′ |s, π(s))Vk−1
π
(s ′ )
s ′ ∈S

Emma Brunskill (CS234 Reinforcement Learning)


Lecture 2: Making Sequences of Good Decisions Given a Model
Spring of
2024
the World 16 / 65
Exercise L2E1: MDP 1 Iteration of Policy Evaluation, Mars
Rover Example

Dynamics: p(s6 |s6 , a1 ) = 0.5, p(s7 |s6 , a1 ) = 0.5, . . .


Reward: for all actions, +1 in state s1 , +10 in state s7 , 0 otherwise
Let π(s) = a1 ∀s, assume Vk =[1 0 0 0 0 0 10] and k = 1, γ = 0.5
Compute Vk+1 (s6 )
See answer at the end of the slide deck. If you’d like practice, work this
out and then check your answers.

Emma Brunskill (CS234 Reinforcement Learning)


Lecture 2: Making Sequences of Good Decisions Given a Model
Spring of
2024
the World 17 / 65
Check Your Understanding Poll L2N2

!" !# !$ !% !& !' !(

We will shortly be interested in not just evaluating the value of a


single policy, but finding an optimal policy. Given this it is informative
to think about properties of the potential policy space.
First for the Mars rover example [ 7 discrete states (location of
rover); 2 actions: Left or Right]
How many deterministic policies are there?
Select answer on the participation poll: 2 / 14 / 72 / 27 / Not sure
Is the optimal policy (one with highest value) for a MDP unique?
Select answer on the participation poll: Yes / No / Not sure
Emma Brunskill (CS234 Reinforcement Learning)
Lecture 2: Making Sequences of Good Decisions Given a Model
Spring of
2024
the World 18 / 65
Check Your Understanding L2N2

!" !# !$ !% !& !' !(

7 discrete states (location of rover)


2 actions: Left or Right
How many deterministic policies are there?
27

Is the highest reward policy for a MDP always unique?


No, there may be two policies with the same (maximal) value
function.

Emma Brunskill (CS234 Reinforcement Learning)


Lecture 2: Making Sequences of Good Decisions Given a Model
Spring of
2024
the World 19 / 65
MDP Control

Compute the optimal policy

π ∗ (s) = arg max V π (s)


π

There exists a unique optimal value function


Optimal policy for a MDP in an infinite horizon problem is
deterministic

Emma Brunskill (CS234 Reinforcement Learning)


Lecture 2: Making Sequences of Good Decisions Given a Model
Spring of
2024
the World 20 / 65
MDP Control

Compute the optimal policy

π ∗ (s) = arg max V π (s)


π

There exists a unique optimal value function


Optimal policy for a MDP in an infinite horizon problem (agent acts
forever is
Deterministic
Stationary (does not depend on time step)
Unique? Not necessarily, may have two policies with identical (optimal)
values

Emma Brunskill (CS234 Reinforcement Learning)


Lecture 2: Making Sequences of Good Decisions Given a Model
Spring of
2024
the World 21 / 65
Policy Search

One option is searching to compute best policy


Number of deterministic policies is |A||S|
Policy iteration is generally more efficient than enumeration

Emma Brunskill (CS234 Reinforcement Learning)


Lecture 2: Making Sequences of Good Decisions Given a Model
Spring of
2024
the World 22 / 65
MDP Policy Iteration (PI)

Set i = 0
Initialize π0 (s) randomly for all states s
While i == 0 or ∥πi − πi−1 ∥1 > 0 (L1-norm, measures if the policy
changed for any state):
V πi ← MDP V function policy evaluation of πi
πi+1 ← Policy improvement
i =i +1

Emma Brunskill (CS234 Reinforcement Learning)


Lecture 2: Making Sequences of Good Decisions Given a Model
Spring of
2024
the World 23 / 65
New Definition: State-Action Value Q

State-action value of a policy


X
Q π (s, a) = R(s, a) + γ P(s ′ |s, a)V π (s ′ )
s ′ ∈S

Take action a, then follow the policy π

Emma Brunskill (CS234 Reinforcement Learning)


Lecture 2: Making Sequences of Good Decisions Given a Model
Spring of
2024
the World 24 / 65
Policy Improvement

Compute state-action value of a policy πi


For s in S and a in A:
X
Q πi (s, a) = R(s, a) + γ P(s ′ |s, a)V πi (s ′ )
s ′ ∈S

Compute new policy πi+1 , for all s ∈ S

πi+1 (s) = arg max Q πi (s, a) ∀s ∈ S


a

Emma Brunskill (CS234 Reinforcement Learning)


Lecture 2: Making Sequences of Good Decisions Given a Model
Spring of
2024
the World 25 / 65
MDP Policy Iteration (PI)

Set i = 0
Initialize π0 (s) randomly for all states s
While i == 0 or ∥πi − πi−1 ∥1 > 0 (L1-norm, measures if the policy
changed for any state):
V πi ← MDP V function policy evaluation of πi
πi+1 ← Policy improvement
i =i +1

Emma Brunskill (CS234 Reinforcement Learning)


Lecture 2: Making Sequences of Good Decisions Given a Model
Spring of
2024
the World 26 / 65
Delving Deeper Into Policy Improvement Step

X
Q πi (s, a) = R(s, a) + γ P(s ′ |s, a)V πi (s ′ )
s ′ ∈S

Emma Brunskill (CS234 Reinforcement Learning)


Lecture 2: Making Sequences of Good Decisions Given a Model
Spring of
2024
the World 27 / 65
Delving Deeper Into Policy Improvement Step

X
Q πi (s, a) = R(s, a) + γ P(s ′ |s, a)V πi (s ′ )
s ′ ∈S
X
max Q πi (s, a) ≥ R(s, πi (s)) + γ P(s ′ |s, πi (s))V πi (s ′ ) = V πi (s)
a
s ′ ∈S

πi+1 (s) = arg max Q πi (s, a)


a

Suppose we take πi+1 (s) for one action, then follow πi forever
Our expected sum of rewards is at least as good as if we had always
followed πi
But new proposed policy is to always follow πi+1 ...

Emma Brunskill (CS234 Reinforcement Learning)


Lecture 2: Making Sequences of Good Decisions Given a Model
Spring of
2024
the World 28 / 65
Monotonic Improvement in Policy

Definition
V π1 ≥ V π2 : V π1 (s) ≥ V π2 (s), ∀s ∈ S
Proposition: V πi+1 ≥ V πi with strict inequality if πi is suboptimal,
where πi+1 is the new policy we get from policy improvement on πi

Emma Brunskill (CS234 Reinforcement Learning)


Lecture 2: Making Sequences of Good Decisions Given a Model
Spring of
2024
the World 29 / 65
Proof: Monotonic Improvement in Policy

V πi (s) ≤ max Q πi (s, a)


a
X
= max R(s, a) + γ P(s ′ |s, a)V πi (s ′ )
a
s ′ ∈S

Emma Brunskill (CS234 Reinforcement Learning)


Lecture 2: Making Sequences of Good Decisions Given a Model
Spring of
2024
the World 30 / 65
Proof: Monotonic Improvement in Policy

V πi (s) ≤ max Q πi (s, a)


a
X
= max R(s, a) + γ P(s ′ |s, a)V πi (s ′ )
a
s ′ ∈S
X
=R(s, πi+1 (s)) + γ P(s ′ |s, πi+1 (s))V πi (s ′ ) //by the definition of πi+1
s ′ ∈S
X  
≤R(s, πi+1 (s)) + γ P(s ′ |s, πi+1 (s)) max

Q πi ′ ′
(s , a )
a
s ′ ∈S
X
=R(s, πi+1 (s)) + γ P(s ′ |s, πi+1 (s))
s ′ ∈S
!
X
R(s ′ , πi+1 (s ′ )) + γ P(s ′′ |s ′ , πi+1 (s ′ ))V πi (s ′′ )
s ′′ ∈S
..
.
=V πi+1 (s)

Emma Brunskill (CS234 Reinforcement Learning)


Lecture 2: Making Sequences of Good Decisions Given a Model
Spring of
2024
the World 31 / 65
Check Your Understanding L2N3: Policy Iteration (PI)

Note: all the below is for finite state-action spaces


Set i = 0
Initialize π0 (s) randomly for all states s
While i == 0 or ∥πi − πi−1 ∥1 > 0 (L1-norm, measures if the policy
changed for any state):
V πi ← MDP V function policy evaluation of πi
πi+1 ← Policy improvement
i =i +1
If policy doesn’t change, can it ever change again?
Select on participation poll: Yes / No / Not sure
Is there a maximum number of iterations of policy iteration?
Select on participation poll: Yes / No / Not sure

Emma Brunskill (CS234 Reinforcement Learning)


Lecture 2: Making Sequences of Good Decisions Given a Model
Spring of
2024
the World 32 / 65
Results for Check Your Understanding L2N3 Policy
Iteration
Note: all the below is for finite state-action spaces
Set i = 0
Initialize π0 (s) randomly for all states s
While i == 0 or ∥πi − πi−1 ∥1 > 0 (L1-norm, measures if the policy
changed for any state):
V πi ← MDP V function policy evaluation of πi
πi+1 ← Policy improvement
i =i +1
If policy doesn’t change, can it ever change again?
No

Is there a maximum number of iterations of policy iteration?


|A||S| since that is the maximum number of policies, and as the policy
improvement step is monotonically improving, each policy can only
appear in one round of policy iteration unless it is an optimal policy.

Emma Brunskill (CS234 Reinforcement Learning)


Lecture 2: Making Sequences of Good Decisions Given a Model
Spring of
2024
the World 34 / 65
Check Your Understanding Explanation of Policy Not
Changing

Suppose for all s ∈ S, πi+1 (s) = πi (s)


Then for all s ∈ S, Q πi+1 (s, a) = Q πi (s, a)
Recall policy improvement step
X
Q πi (s, a) = R(s, a) + γ P(s ′ |s, a)V πi (s ′ )
s ′ ∈S

πi+1 (s) = arg max Q πi (s, a)


a
πi+1
πi+2 (s) = arg max Q (s, a) = arg max Q πi (s, a)
a a

Therefore policy cannot ever change again

Emma Brunskill (CS234 Reinforcement Learning)


Lecture 2: Making Sequences of Good Decisions Given a Model
Spring of
2024
the World 35 / 65
MDP: Computing Optimal Policy and Optimal Value

Policy iteration computes infinite horizon value of a policy and then


improves that policy
Value iteration is another technique
Idea: Maintain optimal value of starting in a state s if have a finite
number of steps k left in the episode
Iterate to consider longer and longer episodes

Emma Brunskill (CS234 Reinforcement Learning)


Lecture 2: Making Sequences of Good Decisions Given a Model
Spring of
2024
the World 36 / 65
Bellman Equation and Bellman Backup Operators

Value function of a policy must satisfy the Bellman equation


X
V π (s) = R π (s) + γ P π (s ′ |s)V π (s ′ )
s ′ ∈S

Bellman backup operator


Applied to a value function
Returns a new value function
Improves the value if possible
" #
X
′ ′
BV (s) = max R(s, a) + γ p(s |s, a)V (s )
a
s ′ ∈S

BV yields a value function over all states s

Emma Brunskill (CS234 Reinforcement Learning)


Lecture 2: Making Sequences of Good Decisions Given a Model
Spring of
2024
the World 37 / 65
Value Iteration (VI)

Set k = 1
Initialize V0 (s) = 0 for all states s
Loop until convergence: (for ex. ||Vk+1 − Vk ||∞ ≤ ϵ)
For each state s
" #
X
′ ′
Vk+1 (s) = max R(s, a) + γ P(s |s, a)Vk (s )
a
s ′ ∈S

View as Bellman backup on value function

Vk+1 = BVk
" #
X
′ ′
πk+1 (s) = arg max R(s, a) + γ P(s |s, a)Vk (s )
a
s ′ ∈S

Emma Brunskill (CS234 Reinforcement Learning)


Lecture 2: Making Sequences of Good Decisions Given a Model
Spring of
2024
the World 38 / 65
Policy Iteration as Bellman Operations

Bellman backup operator B π for a particular policy is defined as


X
B π V (s) = R π (s) + γ P π (s ′ |s)V (s ′ )
s ′ ∈S

Policy evaluation amounts to computing the fixed point of B π


To do policy evaluation, repeatedly apply operator until V stops
changing
V π = BπBπ · · · BπV

Emma Brunskill (CS234 Reinforcement Learning)


Lecture 2: Making Sequences of Good Decisions Given a Model
Spring of
2024
the World 39 / 65
Policy Iteration as Bellman Operations

Bellman backup operator B π for a particular policy is defined as


X
B π V (s) = R π (s) + γ P π (s ′ |s)V (s ′ )
s ′ ∈S

To do policy improvement
" #
X
πk+1 (s) = arg max R(s, a) + γ P(s ′ |s, a)V πk (s ′ )
a
s ′ ∈S

Emma Brunskill (CS234 Reinforcement Learning)


Lecture 2: Making Sequences of Good Decisions Given a Model
Spring of
2024
the World 40 / 65
Going Back to Value Iteration (VI)
Set k = 1
Initialize V0 (s) = 0 for all states s
Loop until convergence: (for ex. ||Vk+1 − Vk ||∞ ≤ ϵ)
For each state s
" #
X
′ ′
Vk+1 (s) = max R(s, a) + γ P(s |s, a)Vk (s )
a
s ′ ∈S

Equivalently, in Bellman backup notation

Vk+1 = BVk

To extract optimal policy if can act for k + 1 more steps,


" #
X
′ ′
π(s) = arg max R(s, a) + γ P(s |s, a)Vk+1 (s )
a
s ′ ∈S

Emma Brunskill (CS234 Reinforcement Learning)


Lecture 2: Making Sequences of Good Decisions Given a Model
Spring of
2024
the World 41 / 65
Contraction Operator

Let O be an operator,and |x| denote (any) norm of x


If |OV − OV ′ | ≤ |V − V ′ |, then O is a contraction operator

Emma Brunskill (CS234 Reinforcement Learning)


Lecture 2: Making Sequences of Good Decisions Given a Model
Spring of
2024
the World 42 / 65
Will Value Iteration Converge?

Yes, if discount factor γ < 1, or end up in a terminal state with


probability 1
Bellman backup is a contraction if discount factor, γ < 1
If apply it to two different value functions, distance between value
functions shrinks after applying Bellman equation to each

Emma Brunskill (CS234 Reinforcement Learning)


Lecture 2: Making Sequences of Good Decisions Given a Model
Spring of
2024
the World 43 / 65
Proof: Bellman Backup is a Contraction on V for γ < 1

Let ∥V − V ′ ∥ = maxs |V (s) − V ′ (s)| be the infinity norm

   
′ ′ ′ ′ ′ ′
X X
∥BVk − BVj ∥ = max R(s, a) + γ P(s |s, a)Vk (s ) − max R(s, a ) + γ P(s |s, a )Vj (s )
a a′
s ′ ∈S s ′ ∈S

Emma Brunskill (CS234 Reinforcement Learning)


Lecture 2: Making Sequences of Good Decisions Given a Model
Spring of
2024
the World 44 / 65
Proof: Bellman Backup is a Contraction on V for γ < 1

Let ∥V − V ′ ∥ = maxs |V (s) − V ′ (s)| be the infinity norm

   
′ ′ ′ ′ ′ ′
X X
∥BVk − BVj ∥ = max R(s, a) + γ P(s |s, a)Vk (s ) − max R(s, a ) + γ P(s |s, a )Vj (s )
a a′
s ′ ∈S s ′ ∈S
 
′ ′ ′ ′
X X
≤ max R(s, a) + γ P(s |s, a)Vk (s ) − R(s, a) − γ P(s |s, a)Vj (s )
a
s ′ ∈S s ′ ∈S

′ ′ ′
X
= max γ P(s |s, a)(Vk (s ) − Vj (s ))
a
s ′ ∈S


X
≤ max γ P(s |s, a)∥Vk − Vj ∥)
a
s ′ ∈S


X
= max γ∥Vk − Vj ∥ P(s |s, a))
a
s ′ ∈S

= γ∥Vk − Vj ∥

Note: Even if all inequalities are equalities, this is still a contraction if γ < 1

Emma Brunskill (CS234 Reinforcement Learning)


Lecture 2: Making Sequences of Good Decisions Given a Model
Spring of
2024
the World 45 / 65
Opportunities for Out-of-Class Practice

Prove value iteration converges to a unique solution for discrete state


and action spaces with γ < 1
Does the initialization of values in value iteration impact anything?
Is the value of the policy extracted from value iteration at each round
guaranteed to monotically improve (if executed in the real infinite
horizon problem), like policy iteartion?

Emma Brunskill (CS234 Reinforcement Learning)


Lecture 2: Making Sequences of Good Decisions Given a Model
Spring of
2024
the World 46 / 65
Value Iteration for Finite Horizon H

Vk = optimal value if making k more decisions


πk = optimal policy if making k more decisions
Initialize V0 (s) = 0 for all states s
For k = 1 : H
For each state s
" #
X
′ ′
Vk+1 (s) = max R(s, a) + γ P(s |s, a)Vk (s )
a
s ′ ∈S
" #
X
′ ′
πk+1 (s) = arg max R(s, a) + γ P(s |s, a)Vk (s )
a
s ′ ∈S

Emma Brunskill (CS234 Reinforcement Learning)


Lecture 2: Making Sequences of Good Decisions Given a Model
Spring of
2024
the World 47 / 65
Computing the Value of a Policy in a Finite Horizon

Alternatively can estimate by simulation


Generate a large number of episodes
Average returns
Concentration inequalities bound how quickly average concentrates to
expected value
Requires no assumption of Markov structure

Emma Brunskill (CS234 Reinforcement Learning)


Lecture 2: Making Sequences of Good Decisions Given a Model
Spring of
2024
the World 48 / 65
Example: Mars Rover

!" !# !$ !% !& !' !(


0.4 0.4 0.4 0.4 0.4 0.4

0.4 0.4 0.4 0.4 0.4 0.4

0.6 0.2 0.2 0.2 0.2 0.2 0.6

Reward: +1 in s1 , +10 in s7 , 0 in all other states


Sample returns for sample 4-step (H=4) episodes, γ = 1/2
1 1 1
s4 , s5 , s6 , s7 : 0 + 2 ×0+ 4 ×0+ 8 × 10 = 1.25

Emma Brunskill (CS234 Reinforcement Learning)


Lecture 2: Making Sequences of Good Decisions Given a Model
Spring of
2024
the World 49 / 65
Example: Mars Rover

!" !# !$ !% !& !' !(


0.4 0.4 0.4 0.4 0.4 0.4

0.4 0.4 0.4 0.4 0.4 0.4

0.6 0.2 0.2 0.2 0.2 0.2 0.6

Reward: +1 in s1 , +10 in s7 , 0 in all other states


Sample returns for sample 4-step (H=4) episodes, start state s4 ,
γ = 1/2
1 1 1
s4 , s5 , s6 , s7 : 0 + 2 ×0+ 4 ×0+ 8 × 10 = 1.25
1 1 1
s4 , s4 , s5 , s4 : 0 + 2 ×0+ 4 ×0+ 8 ×0=0
1 1 1
s4 , s3 , s2 , s1 : 0 + 2 ×0+ 4 ×0+ 8 × 1 = 0.125

Emma Brunskill (CS234 Reinforcement Learning)


Lecture 2: Making Sequences of Good Decisions Given a Model
Spring of
2024
the World 50 / 65
Question: Finite Horizon Policies

Set k = 1
Initialize V0 (s) = 0 for all states s
Loop until k == H:
For each state s
X
Vk+1 (s) = max R(s, a) + γ P(s ′ |s, a)Vk (s ′ )
a
s ′ ∈S
X
πk+1 (s) = arg max R(s, a) + γ P(s ′ |s, a)Vk (s ′ )
a
s ′ ∈S

Is optimal policy stationary (independent of time step) in finite horizon


tasks?

Emma Brunskill (CS234 Reinforcement Learning)


Lecture 2: Making Sequences of Good Decisions Given a Model
Spring of
2024
the World 51 / 65
Question: Finite Horizon Policies

Set k = 1
Initialize V0 (s) = 0 for all states s
Loop until k == H:
For each state s
X
Vk+1 (s) = max R(s, a) + γ P(s ′ |s, a)Vk (s ′ )
a
s ′ ∈S
X
πk+1 (s) = arg max R(s, a) + γ P(s ′ |s, a)Vk (s ′ )
a
s ′ ∈S

Is optimal policy stationary (independent of time step) in finite horizon


tasks?
In general no.

Emma Brunskill (CS234 Reinforcement Learning)


Lecture 2: Making Sequences of Good Decisions Given a Model
Spring of
2024
the World 52 / 65
Value vs Policy Iteration

Value iteration:
Compute optimal value for horizon = k
Note this can be used to compute optimal policy if horizon = k
Increment k
Policy iteration
Compute infinite horizon value of a policy
Use to select another (better) policy
Closely related to a very popular method in RL: policy gradient

Emma Brunskill (CS234 Reinforcement Learning)


Lecture 2: Making Sequences of Good Decisions Given a Model
Spring of
2024
the World 53 / 65
RL Terminology: Models, Policies, Values

Model: Mathematical models of dynamics and reward


Policy: Function mapping states to actions
Value function: future rewards from being in a state and/or action
when following a particular policy

Emma Brunskill (CS234 Reinforcement Learning)


Lecture 2: Making Sequences of Good Decisions Given a Model
Spring of
2024
the World 54 / 65
What You Should Know

Define MP, MRP, MDP, Bellman operator, contraction, model,


Q-value, policy
Be able to implement
Value Iteration
Policy Iteration
Give pros and cons of different policy evaluation approaches
Be able to prove contraction properties
Limitations of presented approaches and Markov assumptions
Which policy evaluation methods require the Markov assumption?

Emma Brunskill (CS234 Reinforcement Learning)


Lecture 2: Making Sequences of Good Decisions Given a Model
Spring of
2024
the World 55 / 65
Lecture 3: Model-Free Policy Evaluation: Policy
Evaluation Without Knowing How the World Works

Emma Brunskill

CS234 Reinforcement Learning

Material builds on structure from David Silver’s Lecture 4: Model-Free


Prediction. Other resources: Sutton and Barto Jan 1 2018 draft
Chapter/Sections: 5.1; 5.5; 6.1-6.3

Emma Brunskill (CS234 Reinforcement Learning)


Lecture 3: Model-Free Policy Evaluation: Policy Evaluation
Spring
Without
2024Knowing How1the
/ 67Wo
L3N1 Refresh Your Knowledge [Polleverywhere Poll]

In a tabular MDP asymptotically value iteration will always yield a policy


with the same value as the policy returned by policy iteration
1 True.
2 False
3 Not sure
Can value iteration require more iterations than |A||S| to compute the
optimal value function? (Assume |A| and |S| are small enough that each
round of value iteration can be done exactly).
1 True.
2 False
3 Not sure

Emma Brunskill (CS234 Reinforcement Learning)


Lecture 3: Model-Free Policy Evaluation: Policy Evaluation
Spring
Without
2024Knowing How2the
/ 67Wo
L3N1 Refresh Your Knowledge

In a tabular MDP asymptotically value iteration will always yield a policy


with the same value as the policy returned by policy iteration
Answer. True. Both are guaranteed to converge to the optimal value
function and a policy with an optimal value

Can value iteration require more iterations than |A||S| to compute the
optimal value function? (Assume |A| and |S| are small enough that each
round of value iteration can be done exactly).
Answer: True. As an example, consider a single state, single action MDP
where r (s, a) = 1, = .9 and initialize V0 (s) = 0. V ⇤ (s) = 1 1 but after
the first iteration of value iteration, V1 (s) = 1.

Emma Brunskill (CS234 Reinforcement Learning)


Lecture 3: Model-Free Policy Evaluation: Policy Evaluation
Spring
Without
2024Knowing How3the
/ 67Wo
Today’s Plan

Last Time:
Markov reward / decision processes
Policy evaluation & control when have true model (of how the world works)
Today
Policy evaluation without known dynamics & reward models
Next Time:
Control when don’t have a model of how the world works

Emma Brunskill (CS234 Reinforcement Learning)


Lecture 3: Model-Free Policy Evaluation: Policy Evaluation
Spring
Without
2024Knowing How4the
/ 67Wo
Evaluation through Direct Experience

Estimate expected return of policy ⇡


Only using data from environment1 (direct experience)
Why is this important?
What properties do we want from policy evaluation algorithms?

1
Assume today this experience comes from executing the policy ⇡. Later will
consider how to do policy evaluation using data gathered from other policies.
Emma Brunskill (CS234 Reinforcement Learning)
Lecture 3: Model-Free Policy Evaluation: Policy Evaluation
Spring
Without
2024Knowing How5the
/ 67Wo
This Lecture: Policy Evaluation

Estimating the expected return of a particular policy if don’t have access to


true MDP models
Monte Carlo policy evaluation
Policy evaluation when don’t have a model of how the world works
Given on-policy samples
Temporal Di↵erence (TD)
Certainty Equivalence with dynamic programming
Batch policy evaluation

Emma Brunskill (CS234 Reinforcement Learning)


Lecture 3: Model-Free Policy Evaluation: Policy Evaluation
Spring
Without
2024Knowing How6the
/ 67Wo
Recall

Definition of Return, Gt (for a MRP)


Discounted sum of rewards from time step t to horizon
2 3
Gt = rt + rt+1 + rt+2 + rt+3 + · · ·

Definition of State Value Function, V (s)
Expected return starting in state s under policy ⇡
2 3
V ⇡ (s) = E⇡ [Gt |st = s] = E⇡ [rt + rt+1 + rt+2 + rt+3 + · · · |st = s]

Definition of State-Action Value Function, Q ⇡ (s, a)


Expected return starting in state s, taking action a and following policy ⇡

Q ⇡ (s, a) = E⇡ [Gt |st = s, at = a]


2 3
= E⇡ [rt + rt+1 + rt+2 + rt+3 + · · · |st = s, at = a]

Emma Brunskill (CS234 Reinforcement Learning)


Lecture 3: Model-Free Policy Evaluation: Policy Evaluation
Spring
Without
2024Knowing How7the
/ 67Wo
Recall: Dynamic Programming for Policy Evaluation
In a Markov decision process

V ⇡ (s) = E⇡ [Gt |st = s]


= E⇡ [rt + rt+1 + 2 rt+2 + 3 rt+3 + · · · |st = s]
X
= R(s, ⇡(s)) + P(s 0 |s, ⇡(s))V ⇡ (s 0 )
s 0 2S

If given dynamics and reward models, can do policy evaluation through


dynamic programming
X
Vk⇡ (s) = r (s, ⇡(s)) + p(s 0 |s, ⇡(s))Vk⇡ 1 (s 0 ) (1)
s 0 2S

Note: before convergence, Vk⇡ is an estimate of V ⇡


P
In Equation 1 we are substituting s 0 2S p(s 0 |s, ⇡(s))Vk⇡ 1 (s 0 ) for
E⇡ [rt+1 + rt+2 + 2 rt+3 + · · · |st = s].
This substitution is an instance of bootstrapping
Emma Brunskill (CS234 Reinforcement Learning)
Lecture 3: Model-Free Policy Evaluation: Policy Evaluation
Spring
Without
2024Knowing How8the
/ 67Wo
This Lecture: Policy Evaluation

Estimating the expected return of a particular policy if don’t have access to


true MDP models
Monte Carlo policy evaluation
Policy evaluation when don’t have a model of how the world work
Given on-policy samples
Temporal Di↵erence (TD)
Certainty Equivalence with dynamic programming
Batch policy evaluation

Emma Brunskill (CS234 Reinforcement Learning)


Lecture 3: Model-Free Policy Evaluation: Policy Evaluation
Spring
Without
2024Knowing How9the
/ 67Wo
Monte Carlo (MC) Policy Evaluation

2 3 Ti t
Gt = rt + rt+1 + rt+2 + rt+3 + · · · + rTi in MDP M under policy ⇡
V (s) = E⌧ ⇠⇡ [Gt |st = s]

Expectation over trajectories ⌧ generated by following ⇡

Simple idea: Value = mean return


If trajectories are all finite, sample set of trajectories & average returns
Note: all trajectories may not be same length (e.g. consider MDP with
terminal states)

Emma Brunskill (CS234 Reinforcement Learning)


Lecture 3: Model-Free Policy Evaluation: Policy Evaluation
Spring
Without
2024Knowing How
10the
/ 67Wo
Monte Carlo (MC) Policy Evaluation

If trajectories are all finite, sample set of trajectories & average returns
Does not require MDP dynamics/rewards
Does not assume state is Markov
Can be applied to episodic MDPs
Averaging over returns from a complete episode
Requires each episode to terminate

Emma Brunskill (CS234 Reinforcement Learning)


Lecture 3: Model-Free Policy Evaluation: Policy Evaluation
Spring
Without
2024Knowing How
11the
/ 67Wo
First-Visit Monte Carlo (MC) On Policy Evaluation

Initialize N(s) = 0, G (s) = 0 8s 2 S


Loop
Sample episode i = si,1 , ai,1 , ri,1 , si,2 , ai,2 , ri,2 , . . . , si,Ti , , ai,Ti , ri,Ti
Define Gi,t = ri,t + ri,t+1 + 2 ri,t+2 + · · · Ti 1
ri,Ti as return from time
step t onwards in ith episode
For each time step t until Ti ( the end of the episode i)
If this is the first time t that state s is visited in episode i
Increment counter of total first visits: N(s) = N(s) + 1
Increment total return G (s) = G (s) + Gi,t
Update estimate V ⇡ (s) = G (s)/N(s)

Emma Brunskill (CS234 Reinforcement Learning)


Lecture 3: Model-Free Policy Evaluation: Policy Evaluation
Spring
Without
2024Knowing How
12the
/ 67Wo
Every-Visit Monte Carlo (MC) On Policy Evaluation

Initialize N(s) = 0, G (s) = 0 8s 2 S


Loop
Sample episode i = si,1 , ai,1 , ri,1 , si,2 , ai,2 , ri,2 , . . . , si,Ti , , ai,Ti , ri,Ti
Define Gi,t = ri,t + ri,t+1 + 2 ri,t+2 + · · · Ti 1
ri,Ti as return from time
step t onwards in ith episode
For each time step t until Ti ( the end of the episode i)
state s is the state visited at time step t in episodes i
Increment counter of total visits: N(s) = N(s) + 1
Increment total return G (s) = G (s) + Gi,t
Update estimate V ⇡ (s) = G (s)/N(s)

Emma Brunskill (CS234 Reinforcement Learning)


Lecture 3: Model-Free Policy Evaluation: Policy Evaluation
Spring
Without
2024Knowing How
13the
/ 67Wo
Optional Worked Example: MC On Policy Evaluation

Initialize N(s) = 0, G (s) = 0 8s 2 S


Loop

Sample episode i = si,1 , ai,1 , ri,1 , si,2 , ai,2 , ri,2 , . . . , si,Ti , ai,Ti , ri,Ti
2 Ti 1
Gi,t = ri,t + ri,t+1 + ri,t+2 + · · · ri,Ti
For each time step t until Ti ( the end of the episode i)
If this is the first time t that state s is visited in episode i (for first visit MC)
Increment counter of total first visits: N(s) = N(s) + 1
Increment total return G (s) = G (s) + Gi,t
Update estimate V ⇡ (s) = G (s)/N(s)

Mars rover: R(s) = [ 1 0 0 0 0 0 +10]


Trajectory = (s3 , a1 , 0, s2 , a1 , 0, s2 , a1 , 0, s1 , a1 , 1, terminal)
Let < 1. Compute the first visit & every visit MC estimates of s2 .
See solutions at the end of the slides

Emma Brunskill (CS234 Reinforcement Learning)


Lecture 3: Model-Free Policy Evaluation: Policy Evaluation
Spring
Without
2024Knowing How
14the
/ 67Wo
Incremental Monte Carlo (MC) On Policy Evaluation

After each episode i = si,1 , ai,1 , ri,1 , si,2 , ai,2 , ri,2 , . . .


2
Define Gi,t = ri,t + ri,t+1 + ri,t+2 + · · · as return from time step t
onwards in ith episode
For state s visited at time step t in episode i
Increment counter of total visits: N(s) = N(s) + 1
Update estimate
N(s) 1 Gi,t 1
V ⇡ (s) = V ⇡ (s) + = V ⇡ (s) + (Gi,t V ⇡ (s))
N(s) N(s) N(s)

Emma Brunskill (CS234 Reinforcement Learning)


Lecture 3: Model-Free Policy Evaluation: Policy Evaluation
Spring
Without
2024Knowing How
15the
/ 67Wo
Incremental Monte Carlo (MC) On Policy Evaluation

Sample episode i = si,1 , ai,1 , ri,1 , si,2 , ai,2 , ri,2 , . . . , si,Ti , , ai,Ti , ri,Ti
2 Ti 1
Gi,t = ri,t + ri,t+1 + ri,t+2 + · · · ri,Ti
for t = 1 : Ti where Ti is the length of the i-th episode
V ⇡ (sit ) = V ⇡ (sit ) + ↵(Gi,t V ⇡ (sit ))

We will see many algorithms of this form with a learning rate, target, and
incremental update

Emma Brunskill (CS234 Reinforcement Learning)


Lecture 3: Model-Free Policy Evaluation: Policy Evaluation
Spring
Without
2024Knowing How
16the
/ 67Wo
Policy Evaluation Diagram

Emma Brunskill (CS234 Reinforcement Learning)


Lecture 3: Model-Free Policy Evaluation: Policy Evaluation
Spring
Without
2024Knowing How
17the
/ 67Wo
Policy Evaluation Diagram

Emma Brunskill (CS234 Reinforcement Learning)


Lecture 3: Model-Free Policy Evaluation: Policy Evaluation
Spring
Without
2024Knowing How
18the
/ 67Wo
Policy Evaluation Diagram

Emma Brunskill (CS234 Reinforcement Learning)


Lecture 3: Model-Free Policy Evaluation: Policy Evaluation
Spring
Without
2024Knowing How
19the
/ 67Wo
Policy Evaluation Diagram

Emma Brunskill (CS234 Reinforcement Learning)


Lecture 3: Model-Free Policy Evaluation: Policy Evaluation
Spring
Without
2024Knowing How
20the
/ 67Wo
MC Policy Evaluation

V ⇡ (s) = V ⇡ (s) + ↵(Gi,t V ⇡ (s))

Emma Brunskill (CS234 Reinforcement Learning)


Lecture 3: Model-Free Policy Evaluation: Policy Evaluation
Spring
Without
2024Knowing How
21the
/ 67Wo
MC Policy Evaluation

V ⇡ (s) = V ⇡ (s) + ↵(Gi,t V ⇡ (s))

Emma Brunskill (CS234 Reinforcement Learning)


Lecture 3: Model-Free Policy Evaluation: Policy Evaluation
Spring
Without
2024Knowing How
22the
/ 67Wo
Evaluation of the Quality of a Policy Estimation Approach

Consistency: with enough data, does the estimate converge to the true value
of the policy?
Computational complexity: as get more data, computational cost of
updating estimate
Memory requirements
Statistical efficiency (intuitively, how does the accuracy of the estimate
change with the amount of data)
Empirical accuracy, often evaluated by mean squared error

Emma Brunskill (CS234 Reinforcement Learning)


Lecture 3: Model-Free Policy Evaluation: Policy Evaluation
Spring
Without
2024Knowing How
23the
/ 67Wo
Evaluation of the Quality of a Policy Estimation Approach:
Bias, Variance and MSE
Consider a statistical model that is parameterized by ✓ and that determines
a probability distribution over observed data P(x|✓)
Consider a statistic ✓ˆ that provides an estimate of ✓ and is a function of
observed data x
E.g. for a Gaussian distribution with known variance, the average of a set of
i.i.d data points is an estimate of the mean of the Gaussian
Definition: the bias of an estimator ✓ˆ is:
ˆ = Ex|✓ [✓]
Bias✓ (✓) ˆ ✓

Definition: the variance of an estimator ✓ˆ is:


ˆ = Ex|✓ [(✓ˆ
Var (✓) ˆ 2]
E[✓])

Definition: mean squared error (MSE) of an estimator ✓ˆ is:


ˆ = Var (✓)
MSE (✓) ˆ + Bias✓ (✓)
ˆ2

Emma Brunskill (CS234 Reinforcement Learning)


Lecture 3: Model-Free Policy Evaluation: Policy Evaluation
Spring
Without
2024Knowing How
24the
/ 67Wo
Evaluation of the Quality of a Policy Estimation Approach:
Consistent Estimator
Consider a statistical model that is parameterized by ✓ and that determines
a probability distribution over observed data P(x|✓)
Consider a statistic ✓ˆ that provides an estimate of ✓ and is a function of
observed data x
Definition: the bias of an estimator ✓ˆ is:
ˆ = Ex|✓ [✓]
Bias✓ (✓) ˆ ✓

Let n be the number of data points x used to estimate the parameter ✓ and
call the resulting estimate of ✓ using that data ✓ˆn
Then the estimator ✓ˆn is consistent if, for all ✏ > 0

lim Pr (|✓ˆn ✓| > ✏) = 0


n!1

If an estimator is unbiased (bias = 0) is it consistent?


Emma Brunskill (CS234 Reinforcement Learning)
Lecture 3: Model-Free Policy Evaluation: Policy Evaluation
Spring
Without
2024Knowing How
25the
/ 67Wo
Properties of Monte Carlo On Policy Evaluators

Properties:
First-visit Monte Carlo
V ⇡ estimator is an unbiased estimator of true E⇡ [Gt |st = s]
By law of large numbers, as N(s) ! 1, V ⇡ (s) ! E⇡ [Gt |st = s]
Every-visit Monte Carlo
V ⇡ every-visit MC estimator is a biased estimator of V ⇡
But consistent estimator and often has better MSE
Incremental Monte Carlo
Properties depends on the learning rate ↵

Emma Brunskill (CS234 Reinforcement Learning)


Lecture 3: Model-Free Policy Evaluation: Policy Evaluation
Spring
Without
2024Knowing How
26the
/ 67Wo
Properties of Monte Carlo On Policy Evaluators

Update is: V ⇡ (sit ) = V ⇡ (sit ) + ↵k (sj )(Gi,t V ⇡ (sit ))


where we have allowed ↵ to vary (let k be the total number of updates done
so far, for state sit = sj )
If
1
X
↵n (sj ) = 1,
n=1
X1
↵n2 (sj ) < 1
n=1

then incremental MC estimate will converge to true policy value V ⇡ (sj )

Emma Brunskill (CS234 Reinforcement Learning)


Lecture 3: Model-Free Policy Evaluation: Policy Evaluation
Spring
Without
2024Knowing How
27the
/ 67Wo
Monte Carlo (MC) Policy Evaluation Key Limitations

Generally high variance estimator


Reducing variance can require a lot of data
In cases where data is very hard or expensive to acquire, or the stakes are
high, MC may be impractical
Requires episodic settings
Episode must end before data from episode can be used to update V

Emma Brunskill (CS234 Reinforcement Learning)


Lecture 3: Model-Free Policy Evaluation: Policy Evaluation
Spring
Without
2024Knowing How
28the
/ 67Wo
Monte Carlo (MC) Policy Evaluation Summary

Aim: estimate V ⇡ (s) given episodes generated under policy ⇡


s1 , a1 , r1 , s2 , a2 , r2 , . . . where the actions are sampled from ⇡
2 3
Gt = rt + rt+1 + rt+2 + rt+3 + · · · under policy ⇡
V (s) = E⇡ [Gt |st = s]

Simple: Estimates expectation by empirical average (given episodes sampled


from policy of interest)
Updates V estimate using sample of return to approximate the expectation
Does not assume Markov process
Converges to true value under some (generally mild) assumptions
Note: Sometimes is preferred over dynamic programming for policy
evaluation even if know the true dynamics model and reward

Emma Brunskill (CS234 Reinforcement Learning)


Lecture 3: Model-Free Policy Evaluation: Policy Evaluation
Spring
Without
2024Knowing How
29the
/ 67Wo
This Lecture: Policy Evaluation

Estimating the expected return of a particular policy if don’t have access to


true MDP models
Monte Carlo policy evaluation
Temporal Di↵erence (TD)
Certainty Equivalence with dynamic programming
Batch policy evaluation

Emma Brunskill (CS234 Reinforcement Learning)


Lecture 3: Model-Free Policy Evaluation: Policy Evaluation
Spring
Without
2024Knowing How
30the
/ 67Wo
Temporal Di↵erence Learning

“If one had to identify one idea as central and novel to reinforcement
learning, it would undoubtedly be temporal-di↵erence (TD) learning.” –
Sutton and Barto 2017
Combination of Monte Carlo & dynamic programming methods
Model-free
Can be used in episodic or infinite-horizon non-episodic settings
Immediately updates estimate of V after each (s, a, r , s 0 ) tuple

Emma Brunskill (CS234 Reinforcement Learning)


Lecture 3: Model-Free Policy Evaluation: Policy Evaluation
Spring
Without
2024Knowing How
31the
/ 67Wo
Temporal Di↵erence Learning for Estimating V

Aim: estimate V ⇡ (s) given episodes generated under policy ⇡


2 3
Gt = rt + rt+1 + rt+2 + rt+3 + · · · in MDP M under policy ⇡
V ⇡ (s) = E⇡ [Gt |st = s]
Recall Bellman operator (if know MDP models)
X
B ⇡ V (s) = r (s, ⇡(s)) + p(s 0 |s, ⇡(s))V (s 0 )
s 0 2S

In incremental every-visit MC, update estimate using 1 sample of return (for


the current ith episode)

V ⇡ (s) = V ⇡ (s) + ↵(Gi,t V ⇡ (s))

Idea: have an estimate of V ⇡ , use to estimate expected return

V ⇡ (s) = V ⇡ (s) + ↵([rt + V ⇡ (st+1 )] V ⇡ (s))

Emma Brunskill (CS234 Reinforcement Learning)


Lecture 3: Model-Free Policy Evaluation: Policy Evaluation
Spring
Without
2024Knowing How
32the
/ 67Wo
Temporal Di↵erence [TD(0)] Learning

Aim: estimate V ⇡ (s) given episodes generated under policy ⇡


s1 , a1 , r1 , s2 , a2 , r2 , . . . where the actions are sampled from ⇡
TD(0) learning / 1-step TD learning: update estimate towards target

V ⇡ (st ) = V ⇡ (st ) + ↵([rt + V ⇡ (st+1 )] V ⇡ (st ))


| {z }
TD target

TD(0) error:
t = rt + V ⇡ (st+1 ) V ⇡ (st )

Can immediately update value estimate after (s, a, r , s 0 ) tuple


Don’t need episodic setting

Emma Brunskill (CS234 Reinforcement Learning)


Lecture 3: Model-Free Policy Evaluation: Policy Evaluation
Spring
Without
2024Knowing How
33the
/ 67Wo
Temporal Di↵erence [TD(0)] Learning Algorithm

Input: ↵
Initialize V ⇡ (s) = 0, 8s 2 S
Loop
Sample tuple (st , at , rt , st+1 )
V ⇡ (st ) = V ⇡ (st ) + ↵([rt + V ⇡ (st+1 )] V ⇡ (st ))
| {z }
TD target

Emma Brunskill (CS234 Reinforcement Learning)


Lecture 3: Model-Free Policy Evaluation: Policy Evaluation
Spring
Without
2024Knowing How
34the
/ 67Wo
Worked Example TD Learning

Input: ↵
Initialize V ⇡ (s) = 0, 8s 2 S
Loop
Sample tuple (st , at , rt , st+1 )
V ⇡ (st ) = V ⇡ (st ) + ↵([rt + V ⇡ (st+1 )] V ⇡ (st ))
| {z }
TD target

Example Mars rover: R = [ 1 0 0 0 0 0 +10] for any action


⇡(s) = a1 8s, = 1. any action from s1 and s7 terminates episode
Trajectory = (s3 , a1 , 0, s2 , a1 , 0, s2 , a1 , 0, s1 , a1 , 1, terminal)

Emma Brunskill (CS234 Reinforcement Learning)


Lecture 3: Model-Free Policy Evaluation: Policy Evaluation
Spring
Without
2024Knowing How
35the
/ 67Wo
Worked Example TD Learning
Input: ↵
Initialize V ⇡ (s) = 0, 8s 2 S
Loop
Sample tuple (st , at , rt , st+1 )
V ⇡ (st ) = V ⇡ (st ) + ↵([rt + V ⇡ (st+1 )] V ⇡ (st ))
| {z }
TD target

Example:
Mars rover: R = [ 1 0 0 0 0 0 +10] for any action
⇡(s) = a1 8s, = 1. any action from s1 and s7 terminates episode
Trajectory = (s3 , a1 , 0, s2 , a1 , 0, s2 , a1 , 0, s1 , a1 , 1, terminal)
TD estimate of all states (init at 0) with ↵ = 1, < 1 at end of this
episode?
V = [1 0 0 0 0 0 0 0]

2
First visit MC estimate of V of each state? [1 0 0 0 0]
Emma Brunskill (CS234 Reinforcement Learning)
Lecture 3: Model-Free Policy Evaluation: Policy Evaluation
Spring
Without
2024Knowing How
36the
/ 67Wo
Temporal Di↵erence (TD) Policy Evaluation
X
V ⇡ (st ) = r (st , ⇡(st )) + P(st+1 |st , ⇡(st ))V ⇡ (st+1 )
st+1

V (st ) = V (st ) + ↵([rt + V ⇡ (st+1 )]


⇡ ⇡
V ⇡ (st ))

Emma Brunskill (CS234 Reinforcement Learning)


Lecture 3: Model-Free Policy Evaluation: Policy Evaluation
Spring
Without
2024Knowing How
37the
/ 67Wo
Check Your Understanding L3N2: Polleverywhere Poll
Temporal Di↵erence [TD(0)] Learning Algorithm

Input: ↵
Initialize V ⇡ (s) = 0, 8s 2 S
Loop
Sample tuple (st , at , rt , st+1 )
V ⇡ (st ) = V ⇡ (st ) + ↵([rt + V ⇡ (st+1 )] V ⇡ (st ))
| {z }
TD target

Select all that are true


1 If ↵ = 0 TD will weigh the TD target more than the past V estimate
2 If ↵ = 1 TD will update the V estimate to the TD target
3 If ↵ = 1 TD in MDPs where the policy goes through states with multiple
possible next states, V may oscillate forever
4 There exist deterministic MDPs where ↵ = 1 TD will converge

Emma Brunskill (CS234 Reinforcement Learning)


Lecture 3: Model-Free Policy Evaluation: Policy Evaluation
Spring
Without
2024Knowing How
38the
/ 67Wo
Check Your Understanding L3N2: Polleverywhere Poll
Temporal Di↵erence [TD(0)] Learning Algorithm

Input: ↵
Initialize V ⇡ (s) = 0, 8s 2 S
Loop
Sample tuple (st , at , rt , st+1 )
V ⇡ (st ) = V ⇡ (st ) + ↵([rt + V ⇡ (st+1 )] V ⇡ (st ))
| {z }
TD target

Answers. If ↵ = 1 TD will update to the TD target. If ↵ = 1 TD in


MDPs where the policy goes through states with multiple possible next
states, V may oscillate forever. There exist deterministic MDPs where
↵ = 1 TD will converge.

Emma Brunskill (CS234 Reinforcement Learning)


Lecture 3: Model-Free Policy Evaluation: Policy Evaluation
Spring
Without
2024Knowing How
39the
/ 67Wo
Summary: Temporal Di↵erence Learning

Combination of Monte Carlo & dynamic programming methods


Model-free
Bootstraps and samples
Can be used in episodic or infinite-horizon non-episodic settings
Immediately updates estimate of V after each (s, a, r , s 0 ) tuple
Biased estimator (early on will be influenced by initialization, and won’t be
unibased estimator)
Generally lower variance than Monte Carlo policy evaluation
Consistent estimator if learning rate ↵ satisfies same conditions specified for
incremental MC policy evaluation to converge
Note: algorithm I introduced is TD(0). In general can have
approaches that interpolate between TD(0) and Monte Carlo
approach

Emma Brunskill (CS234 Reinforcement Learning)


Lecture 3: Model-Free Policy Evaluation: Policy Evaluation
Spring
Without
2024Knowing How
40the
/ 67Wo
This Lecture: Policy Evaluation

Estimating the expected return of a particular policy if don’t have access to


true MDP models
Monte Carlo policy evaluation
Temporal Di↵erence (TD)
Certainty Equivalence with dynamic programming
Batch policy evaluation

Emma Brunskill (CS234 Reinforcement Learning)


Lecture 3: Model-Free Policy Evaluation: Policy Evaluation
Spring
Without
2024Knowing How
41the
/ 67Wo
Certainty Equivalence V ⇡ MLE MDP Model Estimates

Model-based option for policy evaluation without true models


After each (si , ai , ri , si+1 ) tuple
Recompute maximum likelihood MDP model for (s, a)
i
X
1
P̂(s 0 |s, a) = (sk = s, ak = a, sk+1 = s 0 )
N(s, a)
k=1

i
X
1
rˆ(s, a) = (sk = s, ak = a)rk
N(s, a)
k=1
2
Compute V ⇡ using MLE MDP (using any dynamic programming method
from lecture 2))

Optional worked example at end of slides for Mars rover domain.

2
Requires initializing for all (s, a) pairs
Emma Brunskill (CS234 Reinforcement Learning)
Lecture 3: Model-Free Policy Evaluation: Policy Evaluation
Spring
Without
2024Knowing How
42the
/ 67Wo
Certainty Equivalence V ⇡ MLE MDP Model Estimates
Model-based option for policy evaluation without true models
After each (s, a, r , s 0 ) tuple
Recompute maximum likelihood MDP model for (s, a)
K TX
X k 1
1
P̂(s 0 |s, a) = 1(sk,t = s, ak,t = a, sk,t+1 = s 0 )
N(s, a) t=1
k=1

K TX
X k 1
1
rˆ(s, a) = 1(sk,t = s, ak,t = a)rt,k
N(s, a) t=1
k=1
Compute V ⇡ using MLE MDP
Cost: Updating MLE model and MDP planning at each update (O(|S|3 ) for
analytic matrix solution, O(|S|2 |A|) for iterative methods)
Very data efficient and very computationally expensive
Consistent (will converge to right estimate for Markov models)
Can also easily be used for o↵-policy evaluation (which we will shortly define
and discuss)
Emma Brunskill (CS234 Reinforcement Learning)
Lecture 3: Model-Free Policy Evaluation: Policy Evaluation
Spring
Without
2024Knowing How
43the
/ 67Wo
This Lecture: Policy Evaluation

Estimating the expected return of a particular policy if don’t have access to


true MDP models
Monte Carlo policy evaluation
Policy evaluation when don’t have a model of how the world work
Given on-policy samples
Temporal Di↵erence (TD)
Certainty Equivalence with dynamic programming
Batch policy evaluation

Emma Brunskill (CS234 Reinforcement Learning)


Lecture 3: Model-Free Policy Evaluation: Policy Evaluation
Spring
Without
2024Knowing How
44the
/ 67Wo
Batch MC and TD

Batch (O✏ine) solution for finite dataset


Given set of K episodes
Repeatedly sample an episode from K
Apply MC or TD(0) to the sampled episode

What do MC and TD(0) converge to?

Emma Brunskill (CS234 Reinforcement Learning)


Lecture 3: Model-Free Policy Evaluation: Policy Evaluation
Spring
Without
2024Knowing How
45the
/ 67Wo
AB Example: (Ex. 6.4, Sutton & Barto, 2018)

Two states A, B with =1


Given 8 episodes of experience:
A, 0, B, 0
B, 1 (observed 6 times)
B, 0
Imagine running TD updates over data infinite number of times
V (B) =

Emma Brunskill (CS234 Reinforcement Learning)


Lecture 3: Model-Free Policy Evaluation: Policy Evaluation
Spring
Without
2024Knowing How
46the
/ 67Wo
AB Example: (Ex. 6.4, Sutton & Barto, 2018)

TD Update: V ⇡ (st ) = V ⇡ (st ) + ↵([rt + V ⇡ (st+1 )] V ⇡ (st ))


| {z }
TD target

Two states A, B with =1


Given 8 episodes of experience:
A, 0, B, 0
B, 1 (observed 6 times)
B, 0

Imagine run TD updates over data infinite number of times


V (B) = 0.75 by TD or MC
What about V (A)?

Emma Brunskill (CS234 Reinforcement Learning)


Lecture 3: Model-Free Policy Evaluation: Policy Evaluation
Spring
Without
2024Knowing How
47the
/ 67Wo
Check Your Understanding L3N3: AB Example: (Ex. 6.4,
Sutton & Barto, 2018)

TD Update: V ⇡ (st ) = V ⇡ (st ) + ↵([rt + V ⇡ (st+1 )] V ⇡ (st ))


| {z }
TD target

Two states A, B with =1


Given 8 episodes of experience:
A, 0, B, 0
B, 1 (observed 6 times)
B, 0

Imagine run TD updates over data infinite number of times


V (B) = 0.75 by TD or MC
What about V (A)?
Respond in Poll

Emma Brunskill (CS234 Reinforcement Learning)


Lecture 3: Model-Free Policy Evaluation: Policy Evaluation
Spring
Without
2024Knowing How
48the
/ 67Wo
Check Your Understanding L3N3: AB Example: (Ex. 6.4,
Sutton & Barto, 2018)

TD Update: V ⇡ (st ) = V ⇡ (st ) + ↵([rt + V ⇡ (st+1 )] V ⇡ (st ))


| {z }
TD target

Two states A, B with =1


Given 8 episodes of experience:
A, 0, B, 0
B, 1 (observed 6 times)
B, 0

Imagine run TD updates over data infinite number of times


V (B) = 0.75 by TD or MC
What about V (A)?
V MC (A) = 0 V TD (A) = .75

Emma Brunskill (CS234 Reinforcement Learning)


Lecture 3: Model-Free Policy Evaluation: Policy Evaluation
Spring
Without
2024Knowing How
49the
/ 67Wo
Batch MC and TD: Convergence

Monte Carlo in batch setting converges to min MSE (mean squared error)
Minimize loss with respect to observed returns
In AB example, V (A) = 0

TD(0) converges to DP policy V ⇡ for the MDP with the maximum


likelihood model estimates
Aka same as dynamic programming with certainty equivalence!
Maximum likelihood Markov decision process model
i
X
1
P̂(s 0 |s, a) = (sk = s, ak = a, sk+1 = s 0 )
N(s, a)
k=1

i
X
1
rˆ(s, a) = (sk = s, ak = a)rk
N(s, a)
k=1

Compute V ⇡ using this model


In AB example, V (A) = 0.75

Emma Brunskill (CS234 Reinforcement Learning)


Lecture 3: Model-Free Policy Evaluation: Policy Evaluation
Spring
Without
2024Knowing How
50the
/ 67Wo
Some Important Properties to Evaluate Model-free Policy
Evaluation Algorithms

Data efficiency & Computational efficiency


In simple TD(0), use (s, a, r , s 0 ) once to update V (s)
O(1) operation per update
In an episode of length L, O(L)

In MC have to wait till episode finishes, then also O(L)


MC can be more data efficient than simple TD
But TD exploits Markov structure
If in Markov domain, leveraging this is helpful

Dynamic programming with certainty equivalence also uses Markov structure

Emma Brunskill (CS234 Reinforcement Learning)


Lecture 3: Model-Free Policy Evaluation: Policy Evaluation
Spring
Without
2024Knowing How
51the
/ 67Wo
Summary: Policy Evaluation
Estimating the expected return of a particular policy if don’t have access
to true MDP models. Ex. evaluating average purchases per session of new
product recommendation system
Monte Carlo policy evaluation
Policy evaluation when we don’t have a model of how the world works
Given on policy samples
Given o↵ policy samples
Temporal Di↵erence (TD)
Dynamic Programming with certainty equivalence
*Understand what MC vs TD methods compute in batch evaluations
Metrics / Qualities to evaluate and compare algorithms
Uses Markov assumption
Accuracy / MSE / bias / variance
Data efficiency
Computational efficiency

Emma Brunskill (CS234 Reinforcement Learning)


Lecture 3: Model-Free Policy Evaluation: Policy Evaluation
Spring
Without
2024Knowing How
52the
/ 67Wo
Lecture 4: Model Free Control and Function
Approximation

Emma Brunskill

CS234 Reinforcement Learning.

Structure and content drawn in part from David Silver’s Lecture 5


and Lecture 6. For additional reading please see SB Sections 5.2-5.4,
6.4, 6.5, 6.7

Emma Brunskill (CS234 Reinforcement Learning.


Lecture
) 4: Model Free Control and Function Approximation
Winter 2024 1 / 85
Check Your Understanding L4N1: Model-free Generalized
Policy Improvement

Consider policy iteration


Repeat:
Policy evaluation: compute Q π
Policy improvement πi+1 (s) = arg maxa Q πi (s, a)
Question: is this πi+1 deterministic or stochastic? Assume for each
state s there is a unique maxa Q πi (s, a).
Answer: Deterministic, Stochastic, Not Sure
Now consider evaluating the policy of this new πi+1 . Recall in
model-free policy evaluation, we estimated V π , using π to generate
new trajectories
Question: Can we compute Q πi+1 (s, a) ∀s, a by using this πi+1 to
generate new trajectories?
Answer: True, False, Not Sure

Emma Brunskill (CS234 Reinforcement Learning.


Lecture
) 4: Model Free Control and Function Approximation
Winter 2024 2 / 85
Check Your Understanding L4N1: Model-free Generalized
Policy Improvement
Consider policy iteration
Repeat:
Policy evaluation: compute Q π
Policy improvement πi+1 (s) = arg maxa Q πi (s, a)
Question: is this πi+1 deterministic or stochastic? Assume for each
state s there is a unique maxa Q πi (s, a).
Answer: Deterministic

Now consider evaluating the policy of this new πi+1 . Recall in


model-free policy evaluation, we estimated V π , using π to generate
new trajectories
Question: Can we compute Q πi+1 (s, a) ∀s, a by using this πi+1 to
generate new trajectories?
Answer: No.

Emma Brunskill (CS234 Reinforcement Learning.


Lecture
) 4: Model Free Control and Function Approximation
Winter 2024 3 / 85
Class Structure

Last time: Policy evaluation with no knowledge of how the world


works (MDP model not given)
Control (making decisions) without a model of how the world works
Generalization – Value function approximation

Emma Brunskill (CS234 Reinforcement Learning.


Lecture
) 4: Model Free Control and Function Approximation
Winter 2024 4 / 85
Today’s Lecture

Generalized Policy Improvement


Monte-Carlo Control with Tabular Representations
Greedy in the Limit of Infinite Exploration
Temporal Difference Methods for Control

1 Model Free Value Function Approximation


Policy Evaluation
Monte Carlo Policy Evaluation
Temporal Difference TD(0) Policy Evaluation

2 Control using Value Function Approximation


Control using General Value Function Approximators
Deep Q-Learning

Emma Brunskill (CS234 Reinforcement Learning.


Lecture
) 4: Model Free Control and Function Approximation
Winter 2024 5 / 85
Table of Contents

Generalized Policy Improvement


Monte-Carlo Control with Tabular Representations
Greedy in the Limit of Infinite Exploration
Temporal Difference Methods for Control
Policy Evaluation
Monte Carlo Policy Evaluation
Temporal Difference TD(0) Policy Evaluation
Control using General Value Function Approximators
Deep Q-Learning

Emma Brunskill (CS234 Reinforcement Learning.


Lecture
) 4: Model Free Control and Function Approximation
Winter 2024 6 / 85
Model-free Policy Iteration

Initialize policy π
Repeat:
Policy evaluation: compute Q π
Policy improvement: update π given Q π
May need to modify policy evaluation:
If π is deterministic, can’t compute Q(s, a) for any a ̸= π(s)
How to interleave policy evaluation and improvement?
Policy improvement is now using an estimated Q

Emma Brunskill (CS234 Reinforcement Learning.


Lecture
) 4: Model Free Control and Function Approximation
Winter 2024 7 / 85
The Problem of Exploration

Goal: Learn to select actions to maximize total expected future reward


Problem: Can’t learn about actions without trying them (need to
explore
Problem: But if we try new actions, spending less time taking actions
that our past experience suggests will yield high reward (need to
exploit knowledge of domain to achieve high rewards)
Emma Brunskill (CS234 Reinforcement Learning.
Lecture
) 4: Model Free Control and Function Approximation
Winter 2024 8 / 85
ϵ-greedy Policies

Simple idea to balance exploration and achieving rewards


Let |A| be the number of actions
Then an ϵ-greedy policy w.r.t. a state-action value Q(s, a) is
π(a|s) =
ϵ
arg maxa Q(s, a), w. prob 1 − ϵ + |A|
a′ ̸= arg max Q(s, a) w. prob |A|
ϵ

In words: select argmax action with probability 1 − ϵ, else select


action uniformly at random

Emma Brunskill (CS234 Reinforcement Learning.


Lecture
) 4: Model Free Control and Function Approximation
Winter 2024 9 / 85
Policy Improvement with ϵ-greedy policies

Recall we proved that policy iteration using given dynamics and


reward models, was guaranteed to monotonically improve
That proof assumed policy improvement output a deterministic policy
Same property holds for ϵ-greedy policies

Emma Brunskill (CS234 Reinforcement Learning.


Lecture
) 4: Model Free Control and Function Approximation
Winter 2024 10 / 85
Monotonic ϵ-greedy Policy Improvement
Theorem
For any ϵ-greedy policy πi , the ϵ-greedy policy w.r.t. Q πi , πi+1 is a
monotonic improvement V πi+1 ≥ V πi
π π
X
Q i (s, πi+1 (s)) = πi+1 (a|s)Q i (s, a)
a∈A
 
π π
X
= (ϵ/|A|)  Q i (s, a) + (1 − ϵ) max Q i (s, a)
a
a∈A

Emma Brunskill (CS234 Reinforcement Learning.


Lecture
) 4: Model Free Control and Function Approximation
Winter 2024 11 / 85
Today: Model-free Control

Generalized policy improvement


Importance of exploration
Monte Carlo control
Model-free control with temporal difference (SARSA, Q-learning)

Emma Brunskill (CS234 Reinforcement Learning.


Lecture
) 4: Model Free Control and Function Approximation
Winter 2024 12 / 85
Table of Contents

Generalized Policy Improvement


Monte-Carlo Control with Tabular Representations
Greedy in the Limit of Infinite Exploration
Temporal Difference Methods for Control
Policy Evaluation
Monte Carlo Policy Evaluation
Temporal Difference TD(0) Policy Evaluation
Control using General Value Function Approximators
Deep Q-Learning

Emma Brunskill (CS234 Reinforcement Learning.


Lecture
) 4: Model Free Control and Function Approximation
Winter 2024 13 / 85
Recall Monte Carlo Policy Evaluation, Now for Q

1: Initialize Q(s, a) = 0, N(s, a) = 0 ∀(s, a), k = 1, Input ϵ = 1, π


2: loop
3: Sample k-th episode (sk,1 , ak,1 , rk,1 , sk,2 , . . . , sk,T ) given π
3: Compute Gk,t = rk,t + γrk,t+1 + γ 2 rk,t+2 + · · · γ Ti −1 rk,Ti ∀t
4: for t = 1, . . . , T do
5: if First visit to (s,a) in episode k then
6: N(s, a) = N(s, a) + 1
1
7: Q(st , at ) = Q(st , at ) + N(s,a) (Gk,t − Q(st , at ))
8: end if
9: end for
10: k =k +1
11: end loop

Emma Brunskill (CS234 Reinforcement Learning.


Lecture
) 4: Model Free Control and Function Approximation
Winter 2024 14 / 85
Monte Carlo Online Control / On Policy Improvement

1: Initialize Q(s, a) = 0, N(s, a) = 0 ∀(s, a), Set ϵ = 1, k = 1


2: πk = ϵ-greedy(Q) // Create initial ϵ-greedy policy
3: loop
4: Sample k-th episode (sk,1 , ak,1 , rk,1 , sk,2 , . . . , sk,T ) given πk
4: Gk,t = rk,t + γrk,t+1 + γ 2 rk,t+2 + · · · γ Ti −1 rk,Ti
5: for t = 1, . . . , T do
6: if First visit to (s, a) in episode k then
7: N(s, a) = N(s, a) + 1
1
8: Q(st , at ) = Q(st , at ) + N(s,a) (Gk,t − Q(st , at ))
9: end if
10: end for
11: k = k + 1, ϵ = 1/k
12: πk = ϵ-greedy(Q) // Policy improvement
13: end loop

Emma Brunskill (CS234 Reinforcement Learning.


Lecture
) 4: Model Free Control and Function Approximation
Winter 2024 15 / 85
Optional Worked Example: MC for On Policy Control
Mars rover with new actions:
r (−, a1 ) = [ 1 0 0 0 0 0 +10], r (−, a2 ) = [ 0 0 0 0 0 0 +5], γ = 1.
Assume current greedy π(s) = a1 ∀s, ϵ=.5. Q(s, a) = 0 for all (s, a)
Sample trajectory from ϵ-greedy policy
Trajectory = (s3 , a1 , 0, s2 , a2 , 0, s3 , a1 , 0, s2 , a2 , 0, s1 , a1 , 1, terminal)
First visit MC estimate of Q of each (s, a) pair?
Q ϵ−π (−, a1 ) = [1 0 1 0 0 0 0]
After this trajectory (Select all)
Q ϵ−π (−, a2 ) = [0 0 0 0 0 0 0]
The new greedy policy would be: π = [1 tie 1 tie tie tie tie]
The new greedy policy would be: π = [1 2 1 tie tie tie tie]
If ϵ = 1/3, prob of selecting a1 in s1 in the new ϵ-greedy policy is 1/9.
If ϵ = 1/3, prob of selecting a1 in s1 in the new ϵ-greedy policy is 2/3.
If ϵ = 1/3, prob of selecting a1 in s1 in the new ϵ-greedy policy is 5/6.
Not sure
Emma Brunskill (CS234 Reinforcement Learning.
Lecture
) 4: Model Free Control and Function Approximation
Winter 2024 16 / 85
Properties of MC control with ϵ-greedy policies

Computational complexity?
Converge to optimal Q ∗ function?
Empirical performance?

Emma Brunskill (CS234 Reinforcement Learning.


Lecture
) 4: Model Free Control and Function Approximation
Winter 2024 17 / 85
L4N2 Check Your Understanding: Monte Carlo Online
Control / On Policy Improvement

1: Initialize Q(s, a) = 0, N(s, a) = 0 ∀(s, a), Set ϵ = 1, k = 1


2: πk = ϵ-greedy(Q) // Create initial ϵ-greedy policy
3: loop
4: Sample k-th episode (sk,1 , ak,1 , rk,1 , sk,2 , . . . , sk,T ) given πk
4: Gk,t = rk,t + γrk,t+1 + γ 2 rk,t+2 + · · · γ Ti −1 rk,Ti
5: for t = 1, . . . , T do
6: if First visit to (s, a) in episode k then
7: N(s, a) = N(s, a) + 1
1
8: Q(st , at ) = Q(st , at ) + N(s,a) (Gk,t − Q(st , at ))
9: end if
10: end for
11: k = k + 1, ϵ = 1/k
12: πk = ϵ-greedy(Q) // Policy improvement
13: end loop

Is Q an estimate of Q πk ? When might this procedure fail to compute


the optimal Q ∗ ?
Emma Brunskill (CS234 Reinforcement Learning.
Lecture
) 4: Model Free Control and Function Approximation
Winter 2024 18 / 85
Table of Contents

Generalized Policy Improvement


Monte-Carlo Control with Tabular Representations
Greedy in the Limit of Infinite Exploration
Temporal Difference Methods for Control

1 Model Free Value Function Approximation


Policy Evaluation
Monte Carlo Policy Evaluation
Temporal Difference TD(0) Policy Evaluation

2 Control using Value Function Approximation


Control using General Value Function Approximators
Deep Q-Learning

Emma Brunskill (CS234 Reinforcement Learning.


Lecture
) 4: Model Free Control and Function Approximation
Winter 2024 35 / 85
Motivation for Function Approximation

Avoid explicitly storing or learning the following for every single state
and action
Dynamics or reward model
Value
State-action value
Policy
Want more compact representation that generalizes across state or
states and actions
Reduce memory needed to store (P, R)/V /Q/π
Reduce computation needed to compute (P, R)/V /Q/π
Reduce experience needed to find a good (P, R)/V /Q/π

Emma Brunskill (CS234 Reinforcement Learning.


Lecture
) 4: Model Free Control and Function Approximation
Winter 2024 36 / 85
State Action Value Function Approximation for Policy
Evaluation with an Oracle

First assume we could query any state s and action a and an oracle
would return the true value for Q π (s, a)
Similar to supervised learning: assume given ((s, a), Q π (s, a)) pairs
The objective is to find the best approximate representation of Q π
given a particular parameterized function Q̂(s, a; w )

Emma Brunskill (CS234 Reinforcement Learning.


Lecture
) 4: Model Free Control and Function Approximation
Winter 2024 37 / 85
Stochastic Gradient Descent
Goal: Find the parameter vector w that minimizes the loss between a
true value function Q π (s, a) and its approximation Q̂(s, a; w ) as
represented with a particular function class parameterized by w .
Generally use mean squared error and define the loss as

J(w ) = Eπ [(Q π (s, a) − Q̂(s, a; w ))2 ]

Can use gradient descent to find a local minimum


1
∆w = − α∇w J(w )
2
Stochastic gradient descent (SGD) uses a finite number of (often
one) samples to compute an approximate gradient:

Expected SGD is the same as the full gradient update


Emma Brunskill (CS234 Reinforcement Learning.
Lecture
) 4: Model Free Control and Function Approximation
Winter 2024 38 / 85
Stochastic Gradient Descent
Goal: Find the parameter vector w that minimizes the loss between a
true value function Q π (s, a) and its approximation Q̂(s, a; w ) as
represented with a particular function class parameterized by w .
Generally use mean squared error and define the loss as
J(w ) = Eπ [(Q π (s, a) − Q̂(s, a; w ))2 ]
Can use gradient descent to find a local minimum
1
∆w = − α∇w J(w )
2
Stochastic gradient descent (SGD) uses a finite number of (often
one) samples to compute an approximate gradient:
∇w J(w ) = ∇w Eπ [Q π (s, a) − Q̂(s, a; w )]2
= −2Eπ [(Q π (s, a) − Q̂(s, a; w )]∇w Q̂(s, a, w )
Expected SGD is the same as the full gradient update
Emma Brunskill (CS234 Reinforcement Learning.
Lecture
) 4: Model Free Control and Function Approximation
Winter 2024 39 / 85
Table of Contents

Generalized Policy Improvement


Monte-Carlo Control with Tabular Representations
Greedy in the Limit of Infinite Exploration
Temporal Difference Methods for Control

1 Model Free Value Function Approximation


Policy Evaluation
Monte Carlo Policy Evaluation
Temporal Difference TD(0) Policy Evaluation
Control using General Value Function Approximators
Deep Q-Learning

Emma Brunskill (CS234 Reinforcement Learning.


Lecture
) 4: Model Free Control and Function Approximation
Winter 2024 40 / 85
Model Free VFA Policy Evaluation

No oracle to tell true Q π (s, a) for any state s and action a


Use model-free state-action value function approximation

Emma Brunskill (CS234 Reinforcement Learning.


Lecture
) 4: Model Free Control and Function Approximation
Winter 2024 41 / 85
Model Free VFA Prediction / Policy Evaluation

Recall model-free policy evaluation (Lecture 3)


Following a fixed policy π (or had access to prior data)
Goal is to estimate V π and/or Q π
Maintained a lookup table to store estimates V π and/or Q π
Updated these estimates after each episode (Monte Carlo methods)
or after each step (TD methods)
Now: in value function approximation, change the estimate
update step to include fitting the function approximator

Emma Brunskill (CS234 Reinforcement Learning.


Lecture
) 4: Model Free Control and Function Approximation
Winter 2024 42 / 85
Table of Contents

Generalized Policy Improvement


Monte-Carlo Control with Tabular Representations
Greedy in the Limit of Infinite Exploration
Temporal Difference Methods for Control

1 Model Free Value Function Approximation


Policy Evaluation
Monte Carlo Policy Evaluation
Temporal Difference TD(0) Policy Evaluation
Control using General Value Function Approximators
Deep Q-Learning

Emma Brunskill (CS234 Reinforcement Learning.


Lecture
) 4: Model Free Control and Function Approximation
Winter 2024 43 / 85
Monte Carlo Value Function Approximation

Return Gt is an unbiased but noisy sample of the true expected return


Q π (st , at )
Therefore can reduce MC VFA to doing supervised learning on a set
of (state,action,return) pairs:
⟨(s1 , a1 ), G1 ⟩, ⟨(s2 , a2 ), G2 ⟩, . . . , ⟨(sT , aT ), GT ⟩
Substitute Gt for the true Q π (st , at ) when fit function approximator

Emma Brunskill (CS234 Reinforcement Learning.


Lecture
) 4: Model Free Control and Function Approximation
Winter 2024 44 / 85
MC Value Function Approximation for Policy Evaluation

1: Initialize w, k = 1
2: loop
3: Sample k-th episode (sk,1 , ak,1 , rk,1 , sk,2 , . . . , sk,Lk ) given π
4: for t = 1, . . . , Lk do
5: if First visit toP(s, a) in episode k then
Lk
6: Gt (s, a) = j=t rk,j
7: ∇w J(w ) = −2[Gt (s, a)−Q̂(st , at ; w )]∇w Q̂(st , at ; w ) (Compute
Gradient)
8: Update weights ∆w
9: end if
10: end for
11: k =k +1
12: end loop

Emma Brunskill (CS234 Reinforcement Learning.


Lecture
) 4: Model Free Control and Function Approximation
Winter 2024 45 / 85
Table of Contents

Generalized Policy Improvement


Monte-Carlo Control with Tabular Representations
Greedy in the Limit of Infinite Exploration
Temporal Difference Methods for Control

1 Model Free Value Function Approximation


Policy Evaluation
Monte Carlo Policy Evaluation
Temporal Difference TD(0) Policy Evaluation
Control using General Value Function Approximators
Deep Q-Learning

Emma Brunskill (CS234 Reinforcement Learning.


Lecture
) 4: Model Free Control and Function Approximation
Winter 2024 46 / 85
Recall: Temporal Difference Learning w/ Lookup Table

Uses bootstrapping and sampling to approximate V π


Updates V π (s) after each transition (s, a, r , s ′ ):

V π (s) = V π (s) + α(r + γV π (s ′ ) − V π (s))

Target is r + γV π (s ′ ), a biased estimate of the true value V π (s)


Represent value for each state with a separate table entry
Note: Unlike MC we will focus on V instead of Q for policy
evaluation here, because there are more ways to create TD targets
from Q values than V values

Emma Brunskill (CS234 Reinforcement Learning.


Lecture
) 4: Model Free Control and Function Approximation
Winter 2024 47 / 85
Temporal Difference TD(0) Learning with Value Function
Approximation

Uses bootstrapping and sampling to approximate true V π


Updates estimate V π (s) after each transition (s, a, r , s ′ ):

V π (s) = V π (s) + α(r + γV π (s ′ ) − V π (s))

Target is r + γV π (s ′ ), a biased estimate of the true value V π (s)


In value function approximation, target is r + γ V̂ π (s ′ ; w ), a biased
and approximated estimate of the true value V π (s)
3 forms of approximation:
1 Sampling
2 Bootstrapping
3 Value function approximation

Emma Brunskill (CS234 Reinforcement Learning.


Lecture
) 4: Model Free Control and Function Approximation
Winter 2024 48 / 85
Temporal Difference TD(0) Learning with Value Function
Approximation

In value function approximation, target is r + γ V̂ π (s ′ ; w ), a biased


and approximated estimate of the true value V π (s)
Can reduce doing TD(0) learning with value function approximation
to supervised learning on a set of data pairs:
⟨s1 , r1 + γ V̂ π (s2 ; w )⟩, ⟨s2 , r2 + γ V̂ (s3 ; w )⟩, . . .
Find weights to minimize mean squared error

J(w ) = Eπ [(rj + γ V̂ π (sj+1 , w ) − V̂ (sj ; w ))2 ]

Use stochastic gradient descent, as in MC methods

Emma Brunskill (CS234 Reinforcement Learning.


Lecture
) 4: Model Free Control and Function Approximation
Winter 2024 49 / 85
TD(0) Value Function Approximation for Policy Evaluation

1: Initialize w, s
2: loop
3: Given s sample a ∼ π(s), r (s, a),s ′ ∼ p(s ′ |s, a)
4: ∇w J(w ) = −2[r + γ V̂ (s ′ ; w ) − V̂ (s; w )]∇w V̂ (s; w )
5: Update weights ∆w
6: if s ′ is not a terminal state then
7: Set s = s ′
8: else
9: Restart episode, sample initial state s
10: end if
11: end loop

Emma Brunskill (CS234 Reinforcement Learning.


Lecture
) 4: Model Free Control and Function Approximation
Winter 2024 50 / 85

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy