17 - Markov Decision Processes
17 - Markov Decision Processes
Rui Zhang
Fall 2024
1
Outline
Markov Decision Processes (MDP)
● Markov Processes
● Markov Reward Processes (MRP)
● Markov Decision Processes (MDP)
MDP is an idealized form of the AI problem for which we have precise theoretical
results.
MDP formally describes an environment for RL.
Introduce key components of the mathematics: value functions, policies, and
Bellman equations
2
Sequential Decision Making
Agent and environment interact at discrete time steps:
3
The agent-environment interface
The MDP and agent together thereby give rise to a sequence or trajectory that
begins like this:
4
Markov property
“The future is independent of the past given the present”
“Markov” generally means that given the present state, the future and the past are
independent 5
State Transition Matrix
For a Markov state and successor state , the state transition probability is
defined by
State transition matrix P defines transition probabilities from all states to all
successor states:
6
n is the # of states
Markov Process (Markov Chain)
A Markov process is a memoryless random process, i.e. a sequence of random
states with the Markov property
8
Example: student chain
9
Markov Reward Process
A Markov reward process is a Markov chain with values.
10
Example: student reward chain
11
The Reward Hypothesis
Agent goal: maximize the total amount of reward it receives. This means
maximizing not immediate reward, but cumulative reward in the long run.
Reward Hypothesis
That all of what we mean by goals and purposes can be well thought of as the
maximization of the cumulative sum of a received scalar signal (called reward)
Example:
● Chess: +/- for winning/losing the game
● Humanoid robot walk: + for forward motion, - for falling over
12
From Reward to Return
The objective in RL is to maximize long-term future reward
That is, to choose so as to maximize
But what exactly should be maximized?
The discounted return at time t:
Continuing tasks
● infinite number of time steps
14
Why discounted?
Most Markov reward and decision processes are discounted. Why?
CFFS
16
The value function
The value function in a Markov Reward Process is the expected return starting
from a state
The value function gives the long-term value of state , estimating how
good it is for the agent to be in a given state
17
How to compute value function?
The value function can be decomposed into two parts:
immediate reward + discounted value of successor state
18
How to compute value function?
The value function can be decomposed into two parts:
immediate reward + discounted value of successor state
So
19
Bellman Equation
We can write the Bellman Equation as:
20
Bellman Equation in matrix form
We can write the Bellman Equation as:
where and are column vectors with one entry per state where is the
number states
21
Bellman Equation in matrix form
The Bellman equation is a linear system.
It can be solved directly:
22
Markov Decision Process
A Markov Decision Process (MDP) is a Markov Reward Process with decisions.
23
Markov Decision Process
A Markov decision process (MDP) is a Markov Reward Process with decisions.
If state, action, reward sets are finite, it is a finite MDP.
26
Expected Reward
We can also compute the expected reward for state-action pairs:
27
An example of finite MDP
Recycling Robot
● At each step, robot has to decide whether it should (1) actively search for a
can, (2) wait for someone to bring it a can, or (3) go to home base and
recharge.
● Searching is better but runs down the battery; if runs out of power while
searching, has to be rescued (which is bad).
● Decisions made on basis of current energy level: high, low.
State Set?
Action Set?
One-Step Dynamics?
28
Recycling robot MDP
29
Recycling robot MDP
30
Policy
A policy is a mapping from states to probabilities of selecting each possible action
31
The agent learns a policy in RL
A policy is a mapping from states to probabilities of selecting each possible action
The notion of “how good” is defined in terms of future rewards that can be
expected (i.e., expected return)
33
State Value function
The value of a state for a policy is the expected return starting from that state; it
depends on the agent's policy:
The expected return starting from state , and then following policy
The value function gives the long-term value of state under the policy
So:
35
Bellman Equation for
A key property of value-functions is their recursive relationships:
36
Bellman Equation and Backup Diagram for
This is a set of equations (in fact, linear), one for each state. The value function for
is its unique solution.
Backup diagram
37
Gridworld
38
Gridworld
SOME OBSERVATIONS:
● The values are obtained by solving linear system from Bellman equation
● Notice the negative values near the lower edge!
● State A is the best state to be in but its expected return is less than 10 (immediate
reward)
● State B is valued more than 5 (immediate reward)
● The Bellman equations holds for all states. Check! 39
State-Action Value function
The value of an action (in a state) is the expected return starting after taking that
action from that state; depends on the agent’s policy:
The expected return starting from state , taking action , and then following
policy (Note that this action can be any action, not necessarily following )
41
Bellman Equation and Backup Diagram for
42
Bellman Equation and Backup Diagram for
43
Optimal Value Function and Optimal Policy
● For finite MDPs, policies can be partially ordered:
44
Optimal Value Function and Optimal Policy
● For finite MDPs, policies can be partially ordered:
● There are always one or more policies that are better than or equal to all the
others. These are the optimal policies. We denote them all
45
Optimal Value Function and Optimal Policy
● For finite MDPs, policies can be partially ordered:
● There are always one or more policies that are better than or equal to all the
others. These are the optimal policies. We denote them all
● Optimal policies share the same optimal state-value function:
46
Some Useful Information
We give the following useful information without proofs:
● The optimal value function is unique. But there can be multiple optimal
policies with the same optimal value function.
● To search for optimal policy, it is sufficient just to look for only deterministic
policies (i.e., one action for one state). We don't need to consider stochastic
policies (i.e., multiple actions for one state).
47
How to find Optimal Value Function and Optimal Policy?
Strategy 1 (a feasible but expensive approach):
● There will be number of deterministic policies and their value functions.
● Compute all of them by solving systems of linear equations based on Bellman
Equations.
● Compare them to find the policy with the best value function, then it is the
optimal value function and the optimal policy.
● Expensive because we need to find all of the value functions.
Backup diagram
Backup diagram
51
From Optimal Value Function to Optimal Policy
Given , the agent does not even have to do a one-step-ahead search:
52
Two Problems in MDP
Input: a perfect model of RL as a finite MDP
Two problems:
1. evaluation (prediction): given a policy , what is value function ?
2. control: find the optimal policy or optimal value functions, i.e., or
In fact, in order to solve problem 2, we must first know how to solve problem 1.
53
Solving the Bellman optimality equation
Finding an optimal policy by solving the Bellman Optimality Equation requires the following:
● Accurate knowledge of environment dynamics
● we have enough space and time to do the computation
● the Markov Property
54
Summary
Markov Decision Process Two Value Functions and their Bellman Equations
● State ● State-value function for a policy
● Action ● Action-value function for a policy
● Reward
● Transition Probability
● Discount Factor Optimal Value function and Optimal Policy
● Bellman Optimality Equation
Policy: stochastic rule for selecting actions ● Get Optimal Policy from Optimal Value
function
Return: the total future reward to maximize
55
Appendix
56
The relationship between MDP and MRP
http://web.stanford.edu/class/cs234/CS234Win2019/slides/lnotes2.pdf 57
Example
58
Summary of four value functions and their purposes
evaluation
59