0% found this document useful (0 votes)
62 views59 pages

17 - Markov Decision Processes

Uploaded by

sanjitdfd
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
62 views59 pages

17 - Markov Decision Processes

Uploaded by

sanjitdfd
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 59

CMPSC 448: Machine Learning

Lecture 17. Markov Decision Processes

Rui Zhang
Fall 2024

1
Outline
Markov Decision Processes (MDP)
● Markov Processes
● Markov Reward Processes (MRP)
● Markov Decision Processes (MDP)

MDP is an idealized form of the AI problem for which we have precise theoretical
results.
MDP formally describes an environment for RL.
Introduce key components of the mathematics: value functions, policies, and
Bellman equations

2
Sequential Decision Making
Agent and environment interact at discrete time steps:

Agent observes state at time step

Agent decides an action at time step

Environment responds with a reward


and
resulting next state

3
The agent-environment interface

The MDP and agent together thereby give rise to a sequence or trajectory that
begins like this:

4
Markov property
“The future is independent of the past given the present”

The state captures all relevant information from the history


Once the state is known, the history may be thrown away
i.e. The state is a sufficient statistic of the future

“Markov” generally means that given the present state, the future and the past are
independent 5
State Transition Matrix
For a Markov state and successor state , the state transition probability is
defined by

State transition matrix P defines transition probabilities from all states to all
successor states:

6
n is the # of states
Markov Process (Markov Chain)
A Markov process is a memoryless random process, i.e. a sequence of random
states with the Markov property

There is no action or decision in Markov Processing.


The Markov property is not a limitation on decision process, but on the state.
The state must include information about all aspects of the past agent-environment
interaction that make a difference for future. 7
Example: student chain

8
Example: student chain

Sample episodes for Student Markov


Chain starting from state Class
● CFFCPS
● CFFFFCPS
● …...

9
Markov Reward Process
A Markov reward process is a Markov chain with values.

10
Example: student reward chain

11
The Reward Hypothesis
Agent goal: maximize the total amount of reward it receives. This means
maximizing not immediate reward, but cumulative reward in the long run.

Reward Hypothesis
That all of what we mean by goals and purposes can be well thought of as the
maximization of the cumulative sum of a received scalar signal (called reward)

Example:
● Chess: +/- for winning/losing the game
● Humanoid robot walk: + for forward motion, - for falling over

12
From Reward to Return
The objective in RL is to maximize long-term future reward
That is, to choose so as to maximize
But what exactly should be maximized?
The discounted return at time t:

Discount Factor Only care about immediate reward


Future reward is as beneficial as immediate reward 13
Episodic tasks vs Continuing tasks
Episodic tasks
● finite number of time steps
● such a sequence is called an episode

Continuing tasks
● infinite number of time steps

14
Why discounted?
Most Markov reward and decision processes are discounted. Why?

● Mathematically convenient to discount rewards


● Avoids infinite returns in cyclic Markov processes
● Uncertainty about the future may not be fully represented
● If the reward is financial, immediate rewards may earn more interest than
delayed rewards
● Animal/human behavior shows preference for immediate reward
● It is sometimes possible to use , e.g. if all sequences terminate.
15
Example: student reward chain
Example of episode:

CFFS

Discount factor = 0.5


Total reward:

G1 = 2 + 0.5 x (-1) + (0.5)2 x (-1) + (0.5)3 x


0

How about episode


C F F F F C P S?

16
The value function
The value function in a Markov Reward Process is the expected return starting
from a state

Mapping from each state to a real number


More precisely, this is called state value function

The value function gives the long-term value of state , estimating how
good it is for the agent to be in a given state

17
How to compute value function?
The value function can be decomposed into two parts:
immediate reward + discounted value of successor state

18
How to compute value function?
The value function can be decomposed into two parts:
immediate reward + discounted value of successor state

So

19
Bellman Equation
We can write the Bellman Equation as:

20
Bellman Equation in matrix form
We can write the Bellman Equation as:

The Bellman equation can be expressed concisely using matrices,

where and are column vectors with one entry per state where is the
number states

21
Bellman Equation in matrix form
The Bellman equation is a linear system.
It can be solved directly:

Computational complexity is for states (for computing the inverse).

Note: This inverse always exists for

22
Markov Decision Process
A Markov Decision Process (MDP) is a Markov Reward Process with decisions.

It is an environment in which all states are Markov.

MDP formally describes an environment for Reinforcement Learning

If a reinforcement learning task has the Markov Property, it is basically a Markov


Decision Process (MDP).

23
Markov Decision Process
A Markov decision process (MDP) is a Markov Reward Process with decisions.
If state, action, reward sets are finite, it is a finite MDP.

To define a finite MDP, you need to give:


● state set, action set, discount factor
24
● one-step "dynamics"
One step dynamics
One-step dynamics:

This function defines the dynamics of the MDP.


It completely characterizes the environment dynamics.
That is, the probability of each possible value for and depends only on
the immediately preceding state and action
We can compute anything we want to know about the environment from
25
State transition probabilities
From the , we can compute the state transition probabilities:

26
Expected Reward
We can also compute the expected reward for state-action pairs:

27
An example of finite MDP
Recycling Robot
● At each step, robot has to decide whether it should (1) actively search for a
can, (2) wait for someone to bring it a can, or (3) go to home base and
recharge.
● Searching is better but runs down the battery; if runs out of power while
searching, has to be rescued (which is bad).
● Decisions made on basis of current energy level: high, low.

State Set?
Action Set?
One-Step Dynamics?
28
Recycling robot MDP

29
Recycling robot MDP

30
Policy
A policy is a mapping from states to probabilities of selecting each possible action

31
The agent learns a policy in RL
A policy is a mapping from states to probabilities of selecting each possible action

If the agent is following policy at time , then is the probability that


if
A policy fully defines the behavior of an agent
Reinforcement learning methods specify how the agent changes its policy as a result of
experience.
Stochastic Policy: multiple actions for one state
Deterministic Policy: one action for one state, i.e.,
32
Two types of Value functions: and
1. (state value function) quantifies how good it is for the agent to be in a
given state s.

1. (state-action value function, or action value function) quantifies


how good it is for the agent to perform a given action a in a given state s.

The notion of “how good” is defined in terms of future rewards that can be
expected (i.e., expected return)

Almost all RL algorithms involve estimating value functions (functions of states or


of state-actions pairs)

33
State Value function
The value of a state for a policy is the expected return starting from that state; it
depends on the agent's policy:

The expected return starting from state , and then following policy

The value function gives the long-term value of state under the policy

is called the state-value function or value function for policy 34


Bellman Equation for
The value function can be decomposed into two parts:
immediate reward + discounted value of successor state

So:

35
Bellman Equation for
A key property of value-functions is their recursive relationships:

36
Bellman Equation and Backup Diagram for

This is a set of equations (in fact, linear), one for each state. The value function for
is its unique solution.

Backup diagram

37
Gridworld

38
Gridworld

SOME OBSERVATIONS:
● The values are obtained by solving linear system from Bellman equation
● Notice the negative values near the lower edge!
● State A is the best state to be in but its expected return is less than 10 (immediate
reward)
● State B is valued more than 5 (immediate reward)
● The Bellman equations holds for all states. Check! 39
State-Action Value function
The value of an action (in a state) is the expected return starting after taking that
action from that state; depends on the agent’s policy:

The expected return starting from state , taking action , and then following
policy (Note that this action can be any action, not necessarily following )

is called the state-action value function or action value function for 40


Bellman Equation for
The value function can be decomposed into two parts:
immediate reward + discounted value of successor state

So for state value function:

Similarly, for state-action value function

41
Bellman Equation and Backup Diagram for

42
Bellman Equation and Backup Diagram for

43
Optimal Value Function and Optimal Policy
● For finite MDPs, policies can be partially ordered:

44
Optimal Value Function and Optimal Policy
● For finite MDPs, policies can be partially ordered:

● There are always one or more policies that are better than or equal to all the
others. These are the optimal policies. We denote them all

45
Optimal Value Function and Optimal Policy
● For finite MDPs, policies can be partially ordered:

● There are always one or more policies that are better than or equal to all the
others. These are the optimal policies. We denote them all
● Optimal policies share the same optimal state-value function:

● Optimal policies also share the same optimal action-value function:

46
Some Useful Information
We give the following useful information without proofs:

● The optimal value function is unique. But there can be multiple optimal
policies with the same optimal value function.

● To search for optimal policy, it is sufficient just to look for only deterministic
policies (i.e., one action for one state). We don't need to consider stochastic
policies (i.e., multiple actions for one state).

● In a finite MDP, the number of deterministic policies is finite, which is

47
How to find Optimal Value Function and Optimal Policy?
Strategy 1 (a feasible but expensive approach):
● There will be number of deterministic policies and their value functions.
● Compute all of them by solving systems of linear equations based on Bellman
Equations.
● Compare them to find the policy with the best value function, then it is the
optimal value function and the optimal policy.
● Expensive because we need to find all of the value functions.

Strategy 2 (a more attracting approach):


● Compute the (only one!) optimal value function directly by solving system of
nonlinear equations based on Bellman Optimality Equations.
● Use the optimal value function to induce the optimal policy.
● How to do these two steps exactly? See later slides. 48
Bellman Optimality Equation and Backup Diagram for
The value of a state under an optimal policy must equal the expected return for the
best action from that state:

Backup diagram

is the unique solution of this system of nonlinear equations


49
Bellman Optimality Equation and Backup Diagram for

Backup diagram

is the unique solution of this system of nonlinear equations


50
From Optimal Value Function to Optimal Policy
Any policy that is greedy with respect to is an optimal policy.

Therefore, given , one-step-ahead search produces the long-term optimal


actions.

E.g., back to the gridworld:

51
From Optimal Value Function to Optimal Policy
Given , the agent does not even have to do a one-step-ahead search:

52
Two Problems in MDP
Input: a perfect model of RL as a finite MDP

Two problems:
1. evaluation (prediction): given a policy , what is value function ?
2. control: find the optimal policy or optimal value functions, i.e., or

In fact, in order to solve problem 2, we must first know how to solve problem 1.

53
Solving the Bellman optimality equation
Finding an optimal policy by solving the Bellman Optimality Equation requires the following:
● Accurate knowledge of environment dynamics
● we have enough space and time to do the computation
● the Markov Property

How much space and time do we need?


● Polynomial in number of states (via dynamic programming methods; Chapter 4)
● BUT, number of states is often huge (e.g., backgammon has about 1020 states).

We usually have to settle for approximations.

Many RL methods can be understood as approximately solving the Bellman Optimality


Equation.

54
Summary
Markov Decision Process Two Value Functions and their Bellman Equations
● State ● State-value function for a policy
● Action ● Action-value function for a policy
● Reward
● Transition Probability
● Discount Factor Optimal Value function and Optimal Policy
● Bellman Optimality Equation
Policy: stochastic rule for selecting actions ● Get Optimal Policy from Optimal Value
function
Return: the total future reward to maximize

55
Appendix

56
The relationship between MDP and MRP

http://web.stanford.edu/class/cs234/CS234Win2019/slides/lnotes2.pdf 57
Example

58
Summary of four value functions and their purposes

evaluation

All theoretical objects, mathematical ideals (expected values)

59

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy