0% found this document useful (0 votes)

37 views65 pages

Lecture 2 Post

Uploaded by

ah15300047507

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

37 views65 pages

Lecture 2 Post

Uploaded by

ah15300047507

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 65

Lecture 2: Making Sequences of Good Decisions Given

a Model of the World

Emma Brunskill

CS234 Reinforcement Learning

Spring 2024

Emma Brunskill (CS234 Reinforcement Learning)

Lecture 2: Making Sequences of Good Decisions Given a Model
Spring of
2024
the World 1 / 65
L2N1 Quick Check Your Understanding 1. Participation
Poll

In a Markov decision process, a large discount factor γ means that short

term rewards are much more influential than long term rewards. [Enter
your answer in participation poll ]
True
False
Don’t know
Question for today’s lecture (not for poll): Can we construct algorithms
for computing decision policies so that we can guarantee with additional
computation / iterations, we monotonically improve the decision policy?
Do all algorithms satisfy this property?

Emma Brunskill (CS234 Reinforcement Learning)

Lecture 2: Making Sequences of Good Decisions Given a Model
Spring of
2024
the World 2 / 65
L2N1 Quick Check Your Understanding 1. Participation
Poll
In a Markov decision process, a large discount factor γ means that short
term rewards are much more influential than long term rewards. [Enter
your answer in the poll]
True
False
Don’t know
False. A large γ implies we weigh delayed / long term rewards more.
γ = 0 only values immediate rewards

Question for today’s lecture (not for poll): Can we construct algorithms
for computing decision policies so that we can guarantee with additional
computation / iterations, we monotonically improve the decision policy?
Do all algorithms satisfy this property?
Yes it is possible! We will see this today. Not all of them do.

Emma Brunskill (CS234 Reinforcement Learning)

Lecture 2: Making Sequences of Good Decisions Given a Model
Spring of
2024
the World 3 / 65
Class Tasks and Updates

Homework 1 out shortly. Due Friday April 12 at 6pm.

Office hours will start Friday. See Ed for days, times of group and 1:1
office hours, and we will update the calendar on the website with
locations shortly.

Emma Brunskill (CS234 Reinforcement Learning)

Lecture 2: Making Sequences of Good Decisions Given a Model
Spring of
2024
the World 4 / 65
Today’s Plan

Last Time:
Introduction
Components of an agent: model, value, policy
This Time:
Making good decisions given a Markov decision process
Next Time:
Policy evaluation when don’t have a model of how the world works

Emma Brunskill (CS234 Reinforcement Learning)

Lecture 2: Making Sequences of Good Decisions Given a Model
Spring of
2024
the World 5 / 65
Today: Given a model of the world

Markov Processes (last time)

Markov Reward Processes (MRPs) (continue from last time)
Markov Decision Processes (MDPs)
Evaluation and Control in MDPs

Emma Brunskill (CS234 Reinforcement Learning)

Lecture 2: Making Sequences of Good Decisions Given a Model
Spring of
2024
the World 6 / 65
Return & Value Function

Definition of Horizon (H)

Number of time steps in each episode
Can be infinite
Otherwise called finite Markov reward process
Definition of Return, Gt (for a MRP)
Discounted sum of rewards from time step t to horizon H

Gt = rt + γrt+1 + γ 2 rt+2 + · · · + γ H−1 rt+H−1

Definition of State Value Function, V (s) (for a MRP)

Expected return from starting in state s

V (s) = E[Gt |st = s] = E[rt + γrt+1 + γ 2 rt+2 + · · · + γ H−1 rt+H−1 |st = s]

Emma Brunskill (CS234 Reinforcement Learning)

Lecture 2: Making Sequences of Good Decisions Given a Model
Spring of
2024
the World 7 / 65
Computing the Value of an Infinite Horizon Markov
Reward Process

Markov property provides structure

MRP value function satisfies
X
V (s) = R(s) + γ P(s ′ |s)V (s ′ )
s ′ ∈S
|{z}
Immediate reward | {z }
Discounted sum of future rewards

Emma Brunskill (CS234 Reinforcement Learning)

Lecture 2: Making Sequences of Good Decisions Given a Model
Spring of
2024
the World 8 / 65
Matrix Form of Bellman Equation for MRP

Emma Brunskill (CS234 Reinforcement Learning)

Lecture 2: Making Sequences of Good Decisions Given a Model
Spring of
2024
the World 9 / 65
Analytic Solution for Value of MRP

For finite state MRP, we can express V (s) using a matrix equation
 
    P(s1 |s1 ) · · · P(sN |s1 )  
V (s1 ) R(s1 )  P(s1 |s2 ) · · · V (s1 )
 ..   ..  P(sN |s2 )   . 

 . = . +γ   .. 

.. .. ..
 . . . 
V (sN ) R(sN ) V (sN )
P(s1 |sN ) · · · P(sN |sN )
V = R + γPV
V − γPV = R
(I − γP)V = R
V = (I − γP)−1 R

Solving directly requires taking a matrix inverse ∼ O(N 3 )

Requires that (I − γP) is invertible
Emma Brunskill (CS234 Reinforcement Learning)
Lecture 2: Making Sequences of Good Decisions Given a Model
Spring of
2024
the World 10 / 65
Iterative Algorithm for Computing Value of a MRP

Dynamic programming
Initialize V0 (s) = 0 for all s
For k = 1 until convergence
For all s in S
X
Vk (s) = R(s) + γ P(s ′ |s)Vk−1 (s ′ )
s ′ ∈S

Computational complexity: O(|S|2 ) for each iteration (|S| = N)

Emma Brunskill (CS234 Reinforcement Learning)

Lecture 2: Making Sequences of Good Decisions Given a Model
Spring of
2024
the World 11 / 65
Markov Decision Process (MDP)

Markov Decision Process is Markov Reward Process + actions

Definition of MDP
S is a (finite) set of Markov states s ∈ S
A is a (finite) set of actions a ∈ A
P is dynamics/transition model for each action, that specifies
P(st+1 = s ′ |st = s, at = a)
R is a reward function1

R(st = s, at = a) = E[rt |st = s, at = a]

Discount factor γ ∈ [0, 1]

MDP is a tuple: (S, A, P, R, γ)

1
Reward is sometimes defined as a function of the current state, or as a function of
the (state, action, next state) tuple. Most frequently in this class, we will assume reward
is a function of state and action
Emma Brunskill (CS234 Reinforcement Learning)
Lecture 2: Making Sequences of Good Decisions Given a Model
Spring of
2024
the World 12 / 65
Example: Mars Rover MDP

!" !# !$ !% !& !' !(

   
1 0 0 0 0 0 0 0 1 0 0 0 0 0
1 0 0 0 0 0 0 0 0 1 0 0 0 0
   
0 1 0 0 0 0 0 0 0 0 1 0 0 0
′ ′
   
P(s |s, a1 ) = 
0 0 1 0 0 0 0 P(s |s, a2 ) = 

0 0 0 0 1 0 0
0 0 0 1 0 0 0 0 0 0 0 0 1 0
   
0 0 0 0 1 0 0  0 0 0 0 0 0 1
0 0 0 0 0 1 0 0 0 0 0 0 0 1

2 deterministic actions
Emma Brunskill (CS234 Reinforcement Learning)
Lecture 2: Making Sequences of Good Decisions Given a Model
Spring of
2024
the World 13 / 65
MDP Policies

Policy specifies what action to take in each state

Can be deterministic or stochastic
For generality, consider as a conditional distribution
Given a state, specifies a distribution over actions
Policy: π(a|s) = P(at = a|st = s)

Emma Brunskill (CS234 Reinforcement Learning)

Lecture 2: Making Sequences of Good Decisions Given a Model
Spring of
2024
the World 14 / 65
MDP + Policy

MDP + π(a|s) = Markov Reward Process

Precisely, it is the MRP (S, R π , P π , γ), where
X
R π (s) = π(a|s)R(s, a)
a∈A
X
′
π
P (s |s) = π(a|s)P(s ′ |s, a)
a∈A

Implies we can use same techniques to evaluate the value of a policy

for a MDP as we could to compute the value of a MRP, by defining a
MRP with R π and P π

Emma Brunskill (CS234 Reinforcement Learning)

Lecture 2: Making Sequences of Good Decisions Given a Model
Spring of
2024
the World 15 / 65
MDP Policy Evaluation, Iterative Algorithm

Initialize V0 (s) = 0 for all s

For k = 1 until convergence
For all s in S
" #
X X
Vkπ (s) = π(a|s) R(s, a) + γ p(s ′ |s, a)Vk−1
π
(s ′ )
a s ′ ∈S

This is a Bellman backup for a particular policy

Note that if the policy is deterministic then the above update
simplifies to
X
Vkπ (s) = R(s, π(s)) + γ p(s ′ |s, π(s))Vk−1
π
(s ′ )
s ′ ∈S

Emma Brunskill (CS234 Reinforcement Learning)

Lecture 2: Making Sequences of Good Decisions Given a Model
Spring of
2024
the World 16 / 65
Exercise L2E1: MDP 1 Iteration of Policy Evaluation, Mars
Rover Example

Dynamics: p(s6 |s6 , a1 ) = 0.5, p(s7 |s6 , a1 ) = 0.5, . . .

Reward: for all actions, +1 in state s1 , +10 in state s7 , 0 otherwise
Let π(s) = a1 ∀s, assume Vk =[1 0 0 0 0 0 10] and k = 1, γ = 0.5
Compute Vk+1 (s6 )
See answer at the end of the slide deck. If you’d like practice, work this
out and then check your answers.

Emma Brunskill (CS234 Reinforcement Learning)

Lecture 2: Making Sequences of Good Decisions Given a Model
Spring of
2024
the World 17 / 65
Check Your Understanding Poll L2N2

!" !# !$ !% !& !' !(

We will shortly be interested in not just evaluating the value of a

single policy, but finding an optimal policy. Given this it is informative
to think about properties of the potential policy space.
First for the Mars rover example [ 7 discrete states (location of
rover); 2 actions: Left or Right]
How many deterministic policies are there?
Select answer on the participation poll: 2 / 14 / 72 / 27 / Not sure
Is the optimal policy (one with highest value) for a MDP unique?
Select answer on the participation poll: Yes / No / Not sure
Emma Brunskill (CS234 Reinforcement Learning)
Lecture 2: Making Sequences of Good Decisions Given a Model
Spring of
2024
the World 18 / 65
Check Your Understanding L2N2

!" !# !$ !% !& !' !(

7 discrete states (location of rover)

2 actions: Left or Right
How many deterministic policies are there?
27

Is the highest reward policy for a MDP always unique?

No, there may be two policies with the same (maximal) value
function.

Emma Brunskill (CS234 Reinforcement Learning)

Lecture 2: Making Sequences of Good Decisions Given a Model
Spring of
2024
the World 19 / 65
MDP Control

Compute the optimal policy

π ∗ (s) = arg max V π (s)

There exists a unique optimal value function

Optimal policy for a MDP in an infinite horizon problem is
deterministic

Emma Brunskill (CS234 Reinforcement Learning)

Lecture 2: Making Sequences of Good Decisions Given a Model
Spring of
2024
the World 20 / 65
MDP Control

Compute the optimal policy

π ∗ (s) = arg max V π (s)

There exists a unique optimal value function

Optimal policy for a MDP in an infinite horizon problem (agent acts
forever is
Deterministic
Stationary (does not depend on time step)
Unique? Not necessarily, may have two policies with identical (optimal)
values

Emma Brunskill (CS234 Reinforcement Learning)

Lecture 2: Making Sequences of Good Decisions Given a Model
Spring of
2024
the World 21 / 65
Policy Search

One option is searching to compute best policy

Number of deterministic policies is |A||S|
Policy iteration is generally more efficient than enumeration

Emma Brunskill (CS234 Reinforcement Learning)

Lecture 2: Making Sequences of Good Decisions Given a Model
Spring of
2024
the World 22 / 65
MDP Policy Iteration (PI)

Emma Brunskill (CS234 Reinforcement Learning)

Lecture 2: Making Sequences of Good Decisions Given a Model
Spring of
2024
the World 23 / 65
New Definition: State-Action Value Q

State-action value of a policy

X
Q π (s, a) = R(s, a) + γ P(s ′ |s, a)V π (s ′ )
s ′ ∈S

Take action a, then follow the policy π

Emma Brunskill (CS234 Reinforcement Learning)

Lecture 2: Making Sequences of Good Decisions Given a Model
Spring of
2024
the World 24 / 65
Policy Improvement

Compute state-action value of a policy πi

For s in S and a in A:
X
Q πi (s, a) = R(s, a) + γ P(s ′ |s, a)V πi (s ′ )
s ′ ∈S

Compute new policy πi+1 , for all s ∈ S

πi+1 (s) = arg max Q πi (s, a) ∀s ∈ S

Emma Brunskill (CS234 Reinforcement Learning)

Lecture 2: Making Sequences of Good Decisions Given a Model
Spring of
2024
the World 25 / 65
MDP Policy Iteration (PI)

Emma Brunskill (CS234 Reinforcement Learning)

Lecture 2: Making Sequences of Good Decisions Given a Model
Spring of
2024
the World 26 / 65
Delving Deeper Into Policy Improvement Step

X
Q πi (s, a) = R(s, a) + γ P(s ′ |s, a)V πi (s ′ )
s ′ ∈S

Emma Brunskill (CS234 Reinforcement Learning)

Lecture 2: Making Sequences of Good Decisions Given a Model
Spring of
2024
the World 27 / 65
Delving Deeper Into Policy Improvement Step

X
Q πi (s, a) = R(s, a) + γ P(s ′ |s, a)V πi (s ′ )
s ′ ∈S
X
max Q πi (s, a) ≥ R(s, πi (s)) + γ P(s ′ |s, πi (s))V πi (s ′ ) = V πi (s)
a
s ′ ∈S

πi+1 (s) = arg max Q πi (s, a)

Suppose we take πi+1 (s) for one action, then follow πi forever
Our expected sum of rewards is at least as good as if we had always
followed πi
But new proposed policy is to always follow πi+1 ...

Emma Brunskill (CS234 Reinforcement Learning)

Lecture 2: Making Sequences of Good Decisions Given a Model
Spring of
2024
the World 28 / 65
Monotonic Improvement in Policy

Definition
V π1 ≥ V π2 : V π1 (s) ≥ V π2 (s), ∀s ∈ S
Proposition: V πi+1 ≥ V πi with strict inequality if πi is suboptimal,
where πi+1 is the new policy we get from policy improvement on πi

Emma Brunskill (CS234 Reinforcement Learning)

Lecture 2: Making Sequences of Good Decisions Given a Model
Spring of
2024
the World 29 / 65
Proof: Monotonic Improvement in Policy

V πi (s) ≤ max Q πi (s, a)

a
X
= max R(s, a) + γ P(s ′ |s, a)V πi (s ′ )
a
s ′ ∈S

Emma Brunskill (CS234 Reinforcement Learning)

Lecture 2: Making Sequences of Good Decisions Given a Model
Spring of
2024
the World 30 / 65
Proof: Monotonic Improvement in Policy

V πi (s) ≤ max Q πi (s, a)

a
X
= max R(s, a) + γ P(s ′ |s, a)V πi (s ′ )
a
s ′ ∈S
X
=R(s, πi+1 (s)) + γ P(s ′ |s, πi+1 (s))V πi (s ′ ) //by the definition of πi+1
s ′ ∈S
X
≤R(s, πi+1 (s)) + γ P(s ′ |s, πi+1 (s)) max
′
Q πi ′ ′
(s , a )
a
s ′ ∈S
X
=R(s, πi+1 (s)) + γ P(s ′ |s, πi+1 (s))
s ′ ∈S
!
X
R(s ′ , πi+1 (s ′ )) + γ P(s ′′ |s ′ , πi+1 (s ′ ))V πi (s ′′ )
s ′′ ∈S
..
.
=V πi+1 (s)

Emma Brunskill (CS234 Reinforcement Learning)

Lecture 2: Making Sequences of Good Decisions Given a Model
Spring of
2024
the World 31 / 65
Check Your Understanding L2N3: Policy Iteration (PI)

Note: all the below is for finite state-action spaces

Set i = 0
Initialize π0 (s) randomly for all states s
While i == 0 or ∥πi − πi−1 ∥1 > 0 (L1-norm, measures if the policy
changed for any state):
V πi ← MDP V function policy evaluation of πi
πi+1 ← Policy improvement
i =i +1
If policy doesn’t change, can it ever change again?
Select on participation poll: Yes / No / Not sure
Is there a maximum number of iterations of policy iteration?
Select on participation poll: Yes / No / Not sure

Emma Brunskill (CS234 Reinforcement Learning)

Lecture 2: Making Sequences of Good Decisions Given a Model
Spring of
2024
the World 32 / 65
Lecture Break after Policy Iteration

Emma Brunskill (CS234 Reinforcement Learning)

Lecture 2: Making Sequences of Good Decisions Given a Model
Spring of
2024
the World 33 / 65
Results for Check Your Understanding L2N3 Policy
Iteration
Note: all the below is for finite state-action spaces
Set i = 0
Initialize π0 (s) randomly for all states s
While i == 0 or ∥πi − πi−1 ∥1 > 0 (L1-norm, measures if the policy
changed for any state):
V πi ← MDP V function policy evaluation of πi
πi+1 ← Policy improvement
i =i +1
If policy doesn’t change, can it ever change again?
No

Is there a maximum number of iterations of policy iteration?

|A||S| since that is the maximum number of policies, and as the policy
improvement step is monotonically improving, each policy can only
appear in one round of policy iteration unless it is an optimal policy.

Emma Brunskill (CS234 Reinforcement Learning)

Lecture 2: Making Sequences of Good Decisions Given a Model
Spring of
2024
the World 34 / 65
Check Your Understanding Explanation of Policy Not
Changing

Suppose for all s ∈ S, πi+1 (s) = πi (s)

Then for all s ∈ S, Q πi+1 (s, a) = Q πi (s, a)
Recall policy improvement step
X
Q πi (s, a) = R(s, a) + γ P(s ′ |s, a)V πi (s ′ )
s ′ ∈S

πi+1 (s) = arg max Q πi (s, a)

a
πi+1
πi+2 (s) = arg max Q (s, a) = arg max Q πi (s, a)
a a

Therefore policy cannot ever change again

Emma Brunskill (CS234 Reinforcement Learning)

Lecture 2: Making Sequences of Good Decisions Given a Model
Spring of
2024
the World 35 / 65
MDP: Computing Optimal Policy and Optimal Value

Policy iteration computes infinite horizon value of a policy and then

improves that policy
Value iteration is another technique
Idea: Maintain optimal value of starting in a state s if have a finite
number of steps k left in the episode
Iterate to consider longer and longer episodes

Emma Brunskill (CS234 Reinforcement Learning)

Lecture 2: Making Sequences of Good Decisions Given a Model
Spring of
2024
the World 36 / 65
Bellman Equation and Bellman Backup Operators

Value function of a policy must satisfy the Bellman equation

X
V π (s) = R π (s) + γ P π (s ′ |s)V π (s ′ )
s ′ ∈S

Bellman backup operator

Applied to a value function
Returns a new value function
Improves the value if possible
" #
X
′ ′
BV (s) = max R(s, a) + γ p(s |s, a)V (s )
a
s ′ ∈S

BV yields a value function over all states s

Emma Brunskill (CS234 Reinforcement Learning)

Lecture 2: Making Sequences of Good Decisions Given a Model
Spring of
2024
the World 37 / 65
Value Iteration (VI)

Set k = 1
Initialize V0 (s) = 0 for all states s
Loop until convergence: (for ex. ||Vk+1 − Vk ||∞ ≤ ϵ)
For each state s
" #
X
′ ′
Vk+1 (s) = max R(s, a) + γ P(s |s, a)Vk (s )
a
s ′ ∈S

View as Bellman backup on value function

Vk+1 = BVk
" #
X
′ ′
πk+1 (s) = arg max R(s, a) + γ P(s |s, a)Vk (s )
a
s ′ ∈S

Emma Brunskill (CS234 Reinforcement Learning)

Lecture 2: Making Sequences of Good Decisions Given a Model
Spring of
2024
the World 38 / 65
Policy Iteration as Bellman Operations

Bellman backup operator B π for a particular policy is defined as

X
B π V (s) = R π (s) + γ P π (s ′ |s)V (s ′ )
s ′ ∈S

Policy evaluation amounts to computing the fixed point of B π

To do policy evaluation, repeatedly apply operator until V stops
changing
V π = BπBπ · · · BπV

Emma Brunskill (CS234 Reinforcement Learning)

Lecture 2: Making Sequences of Good Decisions Given a Model
Spring of
2024
the World 39 / 65
Policy Iteration as Bellman Operations

Bellman backup operator B π for a particular policy is defined as

X
B π V (s) = R π (s) + γ P π (s ′ |s)V (s ′ )
s ′ ∈S

To do policy improvement
" #
X
πk+1 (s) = arg max R(s, a) + γ P(s ′ |s, a)V πk (s ′ )
a
s ′ ∈S

Emma Brunskill (CS234 Reinforcement Learning)

Lecture 2: Making Sequences of Good Decisions Given a Model
Spring of
2024
the World 40 / 65
Going Back to Value Iteration (VI)
Set k = 1
Initialize V0 (s) = 0 for all states s
Loop until convergence: (for ex. ||Vk+1 − Vk ||∞ ≤ ϵ)
For each state s
" #
X
′ ′
Vk+1 (s) = max R(s, a) + γ P(s |s, a)Vk (s )
a
s ′ ∈S

Equivalently, in Bellman backup notation

Vk+1 = BVk

To extract optimal policy if can act for k + 1 more steps,

" #
X
′ ′
π(s) = arg max R(s, a) + γ P(s |s, a)Vk+1 (s )
a
s ′ ∈S

Emma Brunskill (CS234 Reinforcement Learning)

Lecture 2: Making Sequences of Good Decisions Given a Model
Spring of
2024
the World 41 / 65
Contraction Operator

Let O be an operator,and |x| denote (any) norm of x

If |OV − OV ′ | ≤ |V − V ′ |, then O is a contraction operator

Emma Brunskill (CS234 Reinforcement Learning)

Lecture 2: Making Sequences of Good Decisions Given a Model
Spring of
2024
the World 42 / 65
Will Value Iteration Converge?

Yes, if discount factor γ < 1, or end up in a terminal state with

probability 1
Bellman backup is a contraction if discount factor, γ < 1
If apply it to two different value functions, distance between value
functions shrinks after applying Bellman equation to each

Emma Brunskill (CS234 Reinforcement Learning)

Lecture 2: Making Sequences of Good Decisions Given a Model
Spring of
2024
the World 43 / 65
Proof: Bellman Backup is a Contraction on V for γ < 1

Let ∥V − V ′ ∥ = maxs |V (s) − V ′ (s)| be the infinity norm

   
′ ′ ′ ′ ′ ′
X X
∥BVk − BVj ∥ = max R(s, a) + γ P(s |s, a)Vk (s ) − max R(s, a ) + γ P(s |s, a )Vj (s )
a a′
s ′ ∈S s ′ ∈S

Emma Brunskill (CS234 Reinforcement Learning)

Lecture 2: Making Sequences of Good Decisions Given a Model
Spring of
2024
the World 44 / 65
Proof: Bellman Backup is a Contraction on V for γ < 1

Let ∥V − V ′ ∥ = maxs |V (s) − V ′ (s)| be the infinity norm

   
′ ′ ′ ′ ′ ′
X X
∥BVk − BVj ∥ = max R(s, a) + γ P(s |s, a)Vk (s ) − max R(s, a ) + γ P(s |s, a )Vj (s )
a a′
s ′ ∈S s ′ ∈S
 
′ ′ ′ ′
X X
≤ max R(s, a) + γ P(s |s, a)Vk (s ) − R(s, a) − γ P(s |s, a)Vj (s )
a
s ′ ∈S s ′ ∈S

′ ′ ′
X
= max γ P(s |s, a)(Vk (s ) − Vj (s ))
a
s ′ ∈S

′
X
≤ max γ P(s |s, a)∥Vk − Vj ∥)
a
s ′ ∈S

′
X
= max γ∥Vk − Vj ∥ P(s |s, a))
a
s ′ ∈S

= γ∥Vk − Vj ∥

Note: Even if all inequalities are equalities, this is still a contraction if γ < 1

Emma Brunskill (CS234 Reinforcement Learning)

Lecture 2: Making Sequences of Good Decisions Given a Model
Spring of
2024
the World 45 / 65
Opportunities for Out-of-Class Practice

Prove value iteration converges to a unique solution for discrete state

and action spaces with γ < 1
Does the initialization of values in value iteration impact anything?
Is the value of the policy extracted from value iteration at each round
guaranteed to monotically improve (if executed in the real infinite
horizon problem), like policy iteartion?
[GPT]
- No, the value of the policy extracted from value iteration at each round is not guaranteed to
monotonically improve in the real infinite-horizon problem. This contrasts with policy iteration,
where the policy's value is guaranteed to improve or remain the same at each step.
- Reason: The value function Vk(s) in value iteration is only an intermediate estimate and does
not necessarily represent the value of a valid policy.
- Greedily extracting a policy at intermediate steps can sometimes lead to policies that perform
worse in the real infinite-horizon problem compared to those extracted at earlier iterations.

Emma Brunskill (CS234 Reinforcement Learning)

Lecture 2: Making Sequences of Good Decisions Given a Model
Spring of
2024
the World 46 / 65
Value Iteration for Finite Horizon H

Vk = optimal value if making k more decisions

πk = optimal policy if making k more decisions
Initialize V0 (s) = 0 for all states s
For k = 1 : H
For each state s
" #
X
′ ′
Vk+1 (s) = max R(s, a) + γ P(s |s, a)Vk (s )
a
s ′ ∈S
" #
X
′ ′
πk+1 (s) = arg max R(s, a) + γ P(s |s, a)Vk (s )
a
s ′ ∈S

Emma Brunskill (CS234 Reinforcement Learning)

Lecture 2: Making Sequences of Good Decisions Given a Model
Spring of
2024
the World 47 / 65
Computing the Value of a Policy in a Finite Horizon

Alternatively can estimate by simulation

Generate a large number of episodes
Average returns
Concentration inequalities bound how quickly average concentrates to
expected value Concentration inequalities: e.g. Hoeffding's inequality, Bernstein's inequality

Requires no assumption of Markov structure

Emma Brunskill (CS234 Reinforcement Learning)

Lecture 2: Making Sequences of Good Decisions Given a Model
Spring of
2024
the World 48 / 65
Example: Mars Rover

!" !# !$ !% !& !' !(

0.4 0.4 0.4 0.4 0.4 0.4

0.6 0.2 0.2 0.2 0.2 0.2 0.6

Reward: +1 in s1 , +10 in s7 , 0 in all other states

Sample returns for sample 4-step (H=4) episodes, γ = 1/2
1 1 1
s4 , s5 , s6 , s7 : 0 + 2 ×0+ 4 ×0+ 8 × 10 = 1.25

Emma Brunskill (CS234 Reinforcement Learning)

Lecture 2: Making Sequences of Good Decisions Given a Model
Spring of
2024
the World 49 / 65
Example: Mars Rover

!" !# !$ !% !& !' !(

0.4 0.4 0.4 0.4 0.4 0.4

0.6 0.2 0.2 0.2 0.2 0.2 0.6

Reward: +1 in s1 , +10 in s7 , 0 in all other states

Sample returns for sample 4-step (H=4) episodes, start state s4 ,
γ = 1/2
1 1 1
s4 , s5 , s6 , s7 : 0 + 2 ×0+ 4 ×0+ 8 × 10 = 1.25
1 1 1
s4 , s4 , s5 , s4 : 0 + 2 ×0+ 4 ×0+ 8 ×0=0
1 1 1
s4 , s3 , s2 , s1 : 0 + 2 ×0+ 4 ×0+ 8 × 1 = 0.125

Emma Brunskill (CS234 Reinforcement Learning)

Lecture 2: Making Sequences of Good Decisions Given a Model
Spring of
2024
the World 50 / 65
Question: Finite Horizon Policies

Set k = 1
Initialize V0 (s) = 0 for all states s
Loop until k == H:
For each state s
X
Vk+1 (s) = max R(s, a) + γ P(s ′ |s, a)Vk (s ′ )
a
s ′ ∈S
X
πk+1 (s) = arg max R(s, a) + γ P(s ′ |s, a)Vk (s ′ )
a
s ′ ∈S

Is optimal policy stationary (independent of time step) in finite horizon

tasks?

Emma Brunskill (CS234 Reinforcement Learning)

Lecture 2: Making Sequences of Good Decisions Given a Model
Spring of
2024
the World 51 / 65
Question: Finite Horizon Policies

Is optimal policy stationary (independent of time step) in finite horizon

tasks?
In general no.

Emma Brunskill (CS234 Reinforcement Learning)

Lecture 2: Making Sequences of Good Decisions Given a Model
Spring of
2024
the World 52 / 65
Value vs Policy Iteration

Value iteration:
Compute optimal value for horizon = k
Note this can be used to compute optimal policy if horizon = k
Increment k
Policy iteration
Compute infinite horizon value of a policy
Use to select another (better) policy
Closely related to a very popular method in RL: policy gradient

Emma Brunskill (CS234 Reinforcement Learning)

Lecture 2: Making Sequences of Good Decisions Given a Model
Spring of
2024
the World 53 / 65
RL Terminology: Models, Policies, Values

Model: Mathematical models of dynamics and reward

Policy: Function mapping states to actions
Value function: future rewards from being in a state and/or action
when following a particular policy

Emma Brunskill (CS234 Reinforcement Learning)

Lecture 2: Making Sequences of Good Decisions Given a Model
Spring of
2024
the World 54 / 65
What You Should Know

Define MP, MRP, MDP, Bellman operator, contraction, model,

Q-value, policy
Be able to implement
Value Iteration
Policy Iteration
Give pros and cons of different policy evaluation approaches
Be able to prove contraction properties
Limitations of presented approaches and Markov assumptions
Which policy evaluation methods require the Markov assumption?

Emma Brunskill (CS234 Reinforcement Learning)

Lecture 2: Making Sequences of Good Decisions Given a Model
Spring of
2024
the World 55 / 65
Where We Are

Emma Brunskill (CS234 Reinforcement Learning)

Lecture 2: Making Sequences of Good Decisions Given a Model
Spring of
2024
the World 56 / 65
Exercise L2E1: MDP 1 Iteration of Policy Evaluation, Mars
Rover Example, Answer

Dynamics: p(s6 |s6 , a1 ) = 0.5, p(s7 |s6 , a1 ) = 0.5, . . .

Reward: for all actions, +1 in state s1 , +10 in state s7 , 0 otherwise
Let π(s) = a1 ∀s, assume Vk =[1 0 0 0 0 0 10] and k = 1, γ = 0.5
Compute Vk+1 (s6 )
X
Vk+1 (s6 ) = r (s6 ) + γ p(s ′ |s6 , a1 )Vk (s ′ ) (1)
s′
= 0 + 0.5 ∗ (0.5 ∗ 10 + 0.5 ∗ 0) (2)
= 2.5 (3)

Emma Brunskill (CS234 Reinforcement Learning)

Lecture 2: Making Sequences of Good Decisions Given a Model
Spring of
2024
the World 57 / 65
Check Your Understanding L2N1: MDP 1 Iteration of
Policy Evaluation, Mars Rover Example
Dynamics: p(s6 |s6 , a1 ) = 0.5, p(s7 |s6 , a1 ) = 0.5, . . .
Reward: for all actions, +1 in state s1 , +10 in state s7 , 0 otherwise
Let π(s) = a1 ∀s, assume Vk =[1 0 0 0 0 0 10] and k = 1, γ = 0.5
X
Vkπ (s) = r (s, π(s)) + γ p(s ′ |s, π(s))Vk−1
π
(s ′ )
s ′ ∈S
Vk+1 (s6 ) = r (s6 , a1 ) + γ ∗ 0.5 ∗ Vk (s6 ) + γ ∗ 0.5 ∗ Vk (s7 )

Vk+1 (s6 ) = 0 + 0.5 ∗ 0.5 ∗ 0 + .5 ∗ 0.5 ∗ 10

Vk+1 (s6 ) = 2.5

Emma Brunskill (CS234 Reinforcement Learning)

Lecture 2: Making Sequences of Good Decisions Given a Model
Spring of
2024
the World 58 / 65
Full Observability: Markov Decision Process (MDP)

MDPs can model a huge number of interesting problems and settings

Bandits: single state MDP
Optimal control mostly about continuous-state MDPs
Partially observable MDPs = MDP where state is history

Emma Brunskill (CS234 Reinforcement Learning)

Lecture 2: Making Sequences of Good Decisions Given a Model
Spring of
2024
the World 59 / 65
Recall: Markov Property

Information state: sufficient statistic of history

State st is Markov if and only if:

p(st+1 |st , at ) = p(st+1 |ht , at )

Future is independent of past given present

Emma Brunskill (CS234 Reinforcement Learning)

Lecture 2: Making Sequences of Good Decisions Given a Model
Spring of
2024
the World 60 / 65
Markov Process or Markov Chain

Memoryless random process

Emma Brunskill (CS234 Reinforcement Learning)

Lecture 2: Making Sequences of Good Decisions Given a Model
Spring of
2024
the World 61 / 65
Example: Mars Rover Markov Chain Transition Matrix, P

!" !# !$ !% !& !' !(

0.4 0.4 0.4 0.4 0.4 0.4

0.6 0.2 0.2 0.2 0.2 0.2 0.6

 
0.6 0.4 0 0 0 0 0
0.4 0.2 0.4 0 0 0 0
 
 0 0.4 0.2 0.4 0 0 0
 
0
P= 0 0.4 0.2 0.4 0 0
0 0 0 0.4 0.2 0.4 0 
 
0 0 0 0 0.4 0.2 0.4
0 0 0 0 0 0.4 0.6

Emma Brunskill (CS234 Reinforcement Learning)

Lecture 2: Making Sequences of Good Decisions Given a Model
Spring of
2024
the World 62 / 65
Example: Mars Rover Markov Chain Episodes

!" !# !$ !% !& !' !(

0.4 0.4 0.4 0.4 0.4 0.4

0.6 0.2 0.2 0.2 0.2 0.2 0.6

Example: Sample episodes starting from S4

s4 , s5 , s6 , s7 , s7 , s7 , . . .
s4 , s4 , s5 , s4 , s5 , s6 , . . .
s4 , s3 , s2 , s1 , . . .

Emma Brunskill (CS234 Reinforcement Learning)

Lecture 2: Making Sequences of Good Decisions Given a Model
Spring of
2024
the World 63 / 65
Markov Reward Process (MRP)

Markov Reward Process is a Markov Chain + rewards

Definition of Markov Reward Process (MRP)
S is a (finite) set of states (s ∈ S)
P is dynamics/transition model that specifices P(st+1 = s ′ |st = s)
R is a reward function R(st = s) = E[rt |st = s]
Discount factor γ ∈ [0, 1]
Note: no actions
If finite number (N) of states, can express R as a vector

Emma Brunskill (CS234 Reinforcement Learning)

Lecture 2: Making Sequences of Good Decisions Given a Model
Spring of
2024
the World 64 / 65
Example: Mars Rover MRP

!" !# !$ !% !& !' !(

0.4 0.4 0.4 0.4 0.4 0.4

0.6 0.2 0.2 0.2 0.2 0.2 0.6

Reward: +1 in s1 , +10 in s7 , 0 in all other states

Emma Brunskill (CS234 Reinforcement Learning)

Lecture 2: Making Sequences of Good Decisions Given a Model
Spring of
2024
the World 65 / 65

Mod2 Slides
No ratings yet
Mod2 Slides
161 pages
Lecture 2 Pre
No ratings yet
Lecture 2 Pre
58 pages
MIT16 410F10 Lec22
No ratings yet
MIT16 410F10 Lec22
19 pages
Microsoft PowerPoint - Lecture20Final-Part1
No ratings yet
Microsoft PowerPoint - Lecture20Final-Part1
65 pages
mdp2 6pp
No ratings yet
mdp2 6pp
14 pages
Machine Learning
No ratings yet
Machine Learning
5 pages
Lec 08
No ratings yet
Lec 08
59 pages
06 MDP
No ratings yet
06 MDP
89 pages
Lecture 2: Making Sequences of Good Decisions Given A Model of The World
No ratings yet
Lecture 2: Making Sequences of Good Decisions Given A Model of The World
60 pages
An Introduction To Markov Decision Processes: Bob Givan Ron Parr Purdue University Duke University
No ratings yet
An Introduction To Markov Decision Processes: Bob Givan Ron Parr Purdue University Duke University
23 pages
MIT 6.036 Lecture
No ratings yet
MIT 6.036 Lecture
64 pages
EE290 Lecture 16
No ratings yet
EE290 Lecture 16
4 pages
2023 Week2 Lecture Before
No ratings yet
2023 Week2 Lecture Before
77 pages
Stochastic DP
No ratings yet
Stochastic DP
23 pages
subtitle (10)
No ratings yet
subtitle (10)
2 pages
2024 MDPs Part 1
No ratings yet
2024 MDPs Part 1
59 pages
L12 Markov Decision Processes
No ratings yet
L12 Markov Decision Processes
64 pages
Solution Climbers
No ratings yet
Solution Climbers
3 pages
DLMAIRIL01_Q4-2024_Session2
No ratings yet
DLMAIRIL01_Q4-2024_Session2
68 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
43 pages
CSE2530__Reinforcement_Learning__2025_P1+2
No ratings yet
CSE2530__Reinforcement_Learning__2025_P1+2
115 pages
Markov Decision Processes & Reinforcement Learning: Megan Smith Lehigh University, Fall 2006
No ratings yet
Markov Decision Processes & Reinforcement Learning: Megan Smith Lehigh University, Fall 2006
40 pages
AI (IT) UNIT-4
No ratings yet
AI (IT) UNIT-4
37 pages
Lect28 4up
No ratings yet
Lect28 4up
11 pages
A17 Complexdecisions
No ratings yet
A17 Complexdecisions
28 pages
Finite Markov Decision Processes-BR
No ratings yet
Finite Markov Decision Processes-BR
31 pages
17 - Markov Decision Processes.pptx
No ratings yet
17 - Markov Decision Processes.pptx
59 pages
Markov Decision Process Tutorial
No ratings yet
Markov Decision Process Tutorial
22 pages
A Tutorial For Reinforcement Learning
No ratings yet
A Tutorial For Reinforcement Learning
17 pages
cs229-notes12 Reinforcement in Control
No ratings yet
cs229-notes12 Reinforcement in Control
17 pages
CS229
No ratings yet
CS229
17 pages
Reinforcement-Learning-Cheatsheet
No ratings yet
Reinforcement-Learning-Cheatsheet
16 pages
L13 Reinforcement Learning
No ratings yet
L13 Reinforcement Learning
57 pages
02 MarkovDecisionProcess
No ratings yet
02 MarkovDecisionProcess
51 pages
Reinforcement Learning in A Nutshell
No ratings yet
Reinforcement Learning in A Nutshell
12 pages
Instant Download MATLAB Optimization Toolbox User S Guide The Mathworks PDF All Chapters
100% (3)
Instant Download MATLAB Optimization Toolbox User S Guide The Mathworks PDF All Chapters
62 pages
20AI903_RL_UNIT 2
No ratings yet
20AI903_RL_UNIT 2
27 pages
Add-On DRL CS06
No ratings yet
Add-On DRL CS06
23 pages
Application of Graph Coloring in Map Coloring and GSM Mobile Phone Networks
50% (4)
Application of Graph Coloring in Map Coloring and GSM Mobile Phone Networks
2 pages
Reinforcement Learning: Part I - Definitions
No ratings yet
Reinforcement Learning: Part I - Definitions
26 pages
DSA5102_lecture11
No ratings yet
DSA5102_lecture11
44 pages
Reinforcement Learning: Karan Kathpalia
No ratings yet
Reinforcement Learning: Karan Kathpalia
80 pages
Reinforcement Learning Note
No ratings yet
Reinforcement Learning Note
16 pages
5.4-Reinforcement Learning-Part1-Introduction
No ratings yet
5.4-Reinforcement Learning-Part1-Introduction
15 pages
An Introduction to Reinforcement Learning From theory to algorithms (December 19, 2024)_Joon Kwon
No ratings yet
An Introduction to Reinforcement Learning From theory to algorithms (December 19, 2024)_Joon Kwon
66 pages
unit-2-serching-and-sorting
No ratings yet
unit-2-serching-and-sorting
40 pages
UNIT-5 AI
No ratings yet
UNIT-5 AI
19 pages
22 Reinforcement Learning
No ratings yet
22 Reinforcement Learning
18 pages
Reinforcement Learning and Control: CS229 Lecture Notes
No ratings yet
Reinforcement Learning and Control: CS229 Lecture Notes
15 pages
Lab Sheet 3 Aakash
No ratings yet
Lab Sheet 3 Aakash
9 pages
Dijkstra Algorithm
No ratings yet
Dijkstra Algorithm
8 pages
Reinforcement Learning and Control: CS229 Lecture Notes
No ratings yet
Reinforcement Learning and Control: CS229 Lecture Notes
7 pages
2023 Staar Algebra i Paper Sampler Key
No ratings yet
2023 Staar Algebra i Paper Sampler Key
2 pages
Lecture 3 - MDPs and Dynamic Programming
No ratings yet
Lecture 3 - MDPs and Dynamic Programming
66 pages
cs188 sp23 Note07
No ratings yet
cs188 sp23 Note07
7 pages
CC103 - Data Structure and Algorithm
No ratings yet
CC103 - Data Structure and Algorithm
28 pages
Competitive Programming: Maximum Bipartite Matching
No ratings yet
Competitive Programming: Maximum Bipartite Matching
19 pages
S 15
100% (1)
S 15
116 pages
Technical Interview Questions Technical Interview Questions
No ratings yet
Technical Interview Questions Technical Interview Questions
13 pages
Lecture Notes On Subtyping: 15-312: Foundations of Programming Languages Frank Pfenning October 19, 2004
No ratings yet
Lecture Notes On Subtyping: 15-312: Foundations of Programming Languages Frank Pfenning October 19, 2004
7 pages
TI4101 Perancangan Tata Letak Pabrik: Basic Algorithms For The Layout Problem
No ratings yet
TI4101 Perancangan Tata Letak Pabrik: Basic Algorithms For The Layout Problem
52 pages
(Partially Observable) Markov Decision Processes: Frederike Petzschner & Lionel Rigoux
No ratings yet
(Partially Observable) Markov Decision Processes: Frederike Petzschner & Lionel Rigoux
19 pages
Extra Notes DFC3033 - CHAPTER 1
No ratings yet
Extra Notes DFC3033 - CHAPTER 1
17 pages
CD - R16 - UNIT III - Notes
No ratings yet
CD - R16 - UNIT III - Notes
33 pages
Parallel Random Access Machine
No ratings yet
Parallel Random Access Machine
22 pages
Reinforcement Learning: Amulya Viswambaran (202090007) Kehkashan Fatima (202090202) Sruthi Krishnan (202090333)
No ratings yet
Reinforcement Learning: Amulya Viswambaran (202090007) Kehkashan Fatima (202090202) Sruthi Krishnan (202090333)
40 pages
Transportation Assignment
No ratings yet
Transportation Assignment
11 pages
Lecture 2: Markov Decision Processes: David Silver
No ratings yet
Lecture 2: Markov Decision Processes: David Silver
57 pages
Shunting Yard Algorithm
No ratings yet
Shunting Yard Algorithm
4 pages
UNIT-4 OF AI
No ratings yet
UNIT-4 OF AI
9 pages
Cyclic Codes: EE 430 / Dr. Muqaibel
No ratings yet
Cyclic Codes: EE 430 / Dr. Muqaibel
55 pages
Lec.1n - COMM 552 Information Theory and Coding
No ratings yet
Lec.1n - COMM 552 Information Theory and Coding
25 pages
Introduction To Languages There Are Two Types of Languages Formal Languages (Syntactic Languages)
No ratings yet
Introduction To Languages There Are Two Types of Languages Formal Languages (Syntactic Languages)
29 pages
Data Structures and Algorithms: Lab Experiment-12
No ratings yet
Data Structures and Algorithms: Lab Experiment-12
10 pages
Cs 402 Analysis
No ratings yet
Cs 402 Analysis
4 pages
Unit 2
No ratings yet
Unit 2
38 pages
Lab 2
No ratings yet
Lab 2
3 pages
Thriving Virtually: A Guide to Self-Help for Remote Workers
From Everand
Thriving Virtually: A Guide to Self-Help for Remote Workers
Jackson Stone
No ratings yet
Unit 5 Reinforcement Learning Notes
No ratings yet
Unit 5 Reinforcement Learning Notes
20 pages
Advanced Econometrics: Methods and Practical Uses
From Everand
Advanced Econometrics: Methods and Practical Uses
Himadri Deshpande
No ratings yet
Stochastic Process - Markov Property - Markov Chain - Markov Decision Process - Reinforcement Learning - RL Techniques - Example Applications
No ratings yet
Stochastic Process - Markov Property - Markov Chain - Markov Decision Process - Reinforcement Learning - RL Techniques - Example Applications
39 pages
IGNOU BCA Introduction to Database Management Systems Previous Year Unsolved Papers MCS 023
From Everand
IGNOU BCA Introduction to Database Management Systems Previous Year Unsolved Papers MCS 023
Manish Soni
No ratings yet
Data Structure Questions and Answers - Singly Linked List Operations - 1
0% (1)
Data Structure Questions and Answers - Singly Linked List Operations - 1
31 pages
The act of multi-purposing
From Everand
The act of multi-purposing
Klutse Ayebi
No ratings yet
Algorithm Analysis & Types of Algorithms
100% (1)
Algorithm Analysis & Types of Algorithms
22 pages
Chapter Test Functions
No ratings yet
Chapter Test Functions
8 pages
Operation Research Project Transportation
87% (23)
Operation Research Project Transportation
11 pages
Simplified College Algebra
From Everand
Simplified College Algebra
Sachin Nambeesan
No ratings yet
Applications of Finite Mathematics
From Everand
Applications of Finite Mathematics
Gautami Devar
No ratings yet
PSM I: Professional Scrum Master I Full Exam Preparation
From Everand
PSM I: Professional Scrum Master I Full Exam Preparation
Georgio Daccache
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.