0% found this document useful (0 votes)

62 views59 pages

17 - Markov Decision Processes

Uploaded by

sanjitdfd

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

62 views59 pages

17 - Markov Decision Processes

Uploaded by

sanjitdfd

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 59

CMPSC 448: Machine Learning

Lecture 17. Markov Decision Processes

Rui Zhang
Fall 2024

1
Outline
Markov Decision Processes (MDP)
● Markov Processes
● Markov Reward Processes (MRP)
● Markov Decision Processes (MDP)

MDP is an idealized form of the AI problem for which we have precise theoretical
results.
MDP formally describes an environment for RL.
Introduce key components of the mathematics: value functions, policies, and
Bellman equations

2
Sequential Decision Making
Agent and environment interact at discrete time steps:

Agent observes state at time step

Agent decides an action at time step

Environment responds with a reward

and
resulting next state

3
The agent-environment interface

The MDP and agent together thereby give rise to a sequence or trajectory that
begins like this:

4
Markov property
“The future is independent of the past given the present”

The state captures all relevant information from the history

Once the state is known, the history may be thrown away
i.e. The state is a sufficient statistic of the future

“Markov” generally means that given the present state, the future and the past are
independent 5
State Transition Matrix
For a Markov state and successor state , the state transition probability is
defined by

State transition matrix P defines transition probabilities from all states to all
successor states:

6
n is the # of states
Markov Process (Markov Chain)
A Markov process is a memoryless random process, i.e. a sequence of random
states with the Markov property

There is no action or decision in Markov Processing.

The Markov property is not a limitation on decision process, but on the state.
The state must include information about all aspects of the past agent-environment
interaction that make a difference for future. 7
Example: student chain

8
Example: student chain

Sample episodes for Student Markov

Chain starting from state Class
● CFFCPS
● CFFFFCPS
● …...

9
Markov Reward Process
A Markov reward process is a Markov chain with values.

10
Example: student reward chain

11
The Reward Hypothesis
Agent goal: maximize the total amount of reward it receives. This means
maximizing not immediate reward, but cumulative reward in the long run.

Reward Hypothesis
That all of what we mean by goals and purposes can be well thought of as the
maximization of the cumulative sum of a received scalar signal (called reward)

Example:
● Chess: +/- for winning/losing the game
● Humanoid robot walk: + for forward motion, - for falling over

12
From Reward to Return
The objective in RL is to maximize long-term future reward
That is, to choose so as to maximize
But what exactly should be maximized?
The discounted return at time t:

Discount Factor Only care about immediate reward

Future reward is as beneficial as immediate reward 13
Episodic tasks vs Continuing tasks
Episodic tasks
● finite number of time steps
● such a sequence is called an episode

Continuing tasks
● infinite number of time steps

14
Why discounted?
Most Markov reward and decision processes are discounted. Why?

● Mathematically convenient to discount rewards

● Avoids infinite returns in cyclic Markov processes
● Uncertainty about the future may not be fully represented
● If the reward is financial, immediate rewards may earn more interest than
delayed rewards
● Animal/human behavior shows preference for immediate reward
● It is sometimes possible to use , e.g. if all sequences terminate.
15
Example: student reward chain
Example of episode:

CFFS

Discount factor = 0.5

Total reward:

G1 = 2 + 0.5 x (-1) + (0.5)2 x (-1) + (0.5)3 x

How about episode

C F F F F C P S?

16
The value function
The value function in a Markov Reward Process is the expected return starting
from a state

Mapping from each state to a real number

More precisely, this is called state value function

The value function gives the long-term value of state , estimating how
good it is for the agent to be in a given state

17
How to compute value function?
The value function can be decomposed into two parts:
immediate reward + discounted value of successor state

18
How to compute value function?
The value function can be decomposed into two parts:
immediate reward + discounted value of successor state

19
Bellman Equation
We can write the Bellman Equation as:

20
Bellman Equation in matrix form
We can write the Bellman Equation as:

The Bellman equation can be expressed concisely using matrices,

where and are column vectors with one entry per state where is the
number states

21
Bellman Equation in matrix form
The Bellman equation is a linear system.
It can be solved directly:

Computational complexity is for states (for computing the inverse).

Note: This inverse always exists for

22
Markov Decision Process
A Markov Decision Process (MDP) is a Markov Reward Process with decisions.

It is an environment in which all states are Markov.

MDP formally describes an environment for Reinforcement Learning

If a reinforcement learning task has the Markov Property, it is basically a Markov

Decision Process (MDP).

23
Markov Decision Process
A Markov decision process (MDP) is a Markov Reward Process with decisions.
If state, action, reward sets are finite, it is a finite MDP.

To define a finite MDP, you need to give:

● state set, action set, discount factor
24
● one-step "dynamics"
One step dynamics
One-step dynamics:

This function defines the dynamics of the MDP.

It completely characterizes the environment dynamics.
That is, the probability of each possible value for and depends only on
the immediately preceding state and action
We can compute anything we want to know about the environment from
25
State transition probabilities
From the , we can compute the state transition probabilities:

26
Expected Reward
We can also compute the expected reward for state-action pairs:

27
An example of finite MDP
Recycling Robot
● At each step, robot has to decide whether it should (1) actively search for a
can, (2) wait for someone to bring it a can, or (3) go to home base and
recharge.
● Searching is better but runs down the battery; if runs out of power while
searching, has to be rescued (which is bad).
● Decisions made on basis of current energy level: high, low.

State Set?
Action Set?
One-Step Dynamics?
28
Recycling robot MDP

29
Recycling robot MDP

30
Policy
A policy is a mapping from states to probabilities of selecting each possible action

31
The agent learns a policy in RL
A policy is a mapping from states to probabilities of selecting each possible action

If the agent is following policy at time , then is the probability that

if
A policy fully defines the behavior of an agent
Reinforcement learning methods specify how the agent changes its policy as a result of
experience.
Stochastic Policy: multiple actions for one state
Deterministic Policy: one action for one state, i.e.,
32
Two types of Value functions: and
1. (state value function) quantifies how good it is for the agent to be in a
given state s.

1. (state-action value function, or action value function) quantifies

how good it is for the agent to perform a given action a in a given state s.

The notion of “how good” is defined in terms of future rewards that can be
expected (i.e., expected return)

Almost all RL algorithms involve estimating value functions (functions of states or

of state-actions pairs)

33
State Value function
The value of a state for a policy is the expected return starting from that state; it
depends on the agent's policy:

The expected return starting from state , and then following policy

The value function gives the long-term value of state under the policy

is called the state-value function or value function for policy 34

Bellman Equation for
The value function can be decomposed into two parts:
immediate reward + discounted value of successor state

So:

35
Bellman Equation for
A key property of value-functions is their recursive relationships:

36
Bellman Equation and Backup Diagram for

This is a set of equations (in fact, linear), one for each state. The value function for
is its unique solution.

Backup diagram

37
Gridworld

38
Gridworld

SOME OBSERVATIONS:
● The values are obtained by solving linear system from Bellman equation
● Notice the negative values near the lower edge!
● State A is the best state to be in but its expected return is less than 10 (immediate
reward)
● State B is valued more than 5 (immediate reward)
● The Bellman equations holds for all states. Check! 39
State-Action Value function
The value of an action (in a state) is the expected return starting after taking that
action from that state; depends on the agent’s policy:

The expected return starting from state , taking action , and then following
policy (Note that this action can be any action, not necessarily following )

is called the state-action value function or action value function for 40

Bellman Equation for
The value function can be decomposed into two parts:
immediate reward + discounted value of successor state

So for state value function:

Similarly, for state-action value function

41
Bellman Equation and Backup Diagram for

42
Bellman Equation and Backup Diagram for

43
Optimal Value Function and Optimal Policy
● For finite MDPs, policies can be partially ordered:

44
Optimal Value Function and Optimal Policy
● For finite MDPs, policies can be partially ordered:

● There are always one or more policies that are better than or equal to all the
others. These are the optimal policies. We denote them all

45
Optimal Value Function and Optimal Policy
● For finite MDPs, policies can be partially ordered:

● There are always one or more policies that are better than or equal to all the
others. These are the optimal policies. We denote them all
● Optimal policies share the same optimal state-value function:

● Optimal policies also share the same optimal action-value function:

46
Some Useful Information
We give the following useful information without proofs:

● The optimal value function is unique. But there can be multiple optimal
policies with the same optimal value function.

● To search for optimal policy, it is sufficient just to look for only deterministic
policies (i.e., one action for one state). We don't need to consider stochastic
policies (i.e., multiple actions for one state).

● In a finite MDP, the number of deterministic policies is finite, which is

47
How to find Optimal Value Function and Optimal Policy?
Strategy 1 (a feasible but expensive approach):
● There will be number of deterministic policies and their value functions.
● Compute all of them by solving systems of linear equations based on Bellman
Equations.
● Compare them to find the policy with the best value function, then it is the
optimal value function and the optimal policy.
● Expensive because we need to find all of the value functions.

Strategy 2 (a more attracting approach):

● Compute the (only one!) optimal value function directly by solving system of
nonlinear equations based on Bellman Optimality Equations.
● Use the optimal value function to induce the optimal policy.
● How to do these two steps exactly? See later slides. 48
Bellman Optimality Equation and Backup Diagram for
The value of a state under an optimal policy must equal the expected return for the
best action from that state:

Backup diagram

is the unique solution of this system of nonlinear equations

49
Bellman Optimality Equation and Backup Diagram for

Backup diagram

is the unique solution of this system of nonlinear equations

50
From Optimal Value Function to Optimal Policy
Any policy that is greedy with respect to is an optimal policy.

Therefore, given , one-step-ahead search produces the long-term optimal

actions.

E.g., back to the gridworld:

51
From Optimal Value Function to Optimal Policy
Given , the agent does not even have to do a one-step-ahead search:

52
Two Problems in MDP
Input: a perfect model of RL as a finite MDP

Two problems:
1. evaluation (prediction): given a policy , what is value function ?
2. control: find the optimal policy or optimal value functions, i.e., or

In fact, in order to solve problem 2, we must first know how to solve problem 1.

53
Solving the Bellman optimality equation
Finding an optimal policy by solving the Bellman Optimality Equation requires the following:
● Accurate knowledge of environment dynamics
● we have enough space and time to do the computation
● the Markov Property

How much space and time do we need?

● Polynomial in number of states (via dynamic programming methods; Chapter 4)
● BUT, number of states is often huge (e.g., backgammon has about 1020 states).

We usually have to settle for approximations.

Many RL methods can be understood as approximately solving the Bellman Optimality

Equation.

54
Summary
Markov Decision Process Two Value Functions and their Bellman Equations
● State ● State-value function for a policy
● Action ● Action-value function for a policy
● Reward
● Transition Probability
● Discount Factor Optimal Value function and Optimal Policy
● Bellman Optimality Equation
Policy: stochastic rule for selecting actions ● Get Optimal Policy from Optimal Value
function
Return: the total future reward to maximize

55
Appendix

56
The relationship between MDP and MRP

http://web.stanford.edu/class/cs234/CS234Win2019/slides/lnotes2.pdf 57
Example

58
Summary of four value functions and their purposes

evaluation

All theoretical objects, mathematical ideals (expected values)

Mastering EES Themechangers - Blogspot.in
100% (3)
Mastering EES Themechangers - Blogspot.in
608 pages
Deep RL - Content Beyond Syllabus
No ratings yet
Deep RL - Content Beyond Syllabus
16 pages
Finite Markov Decision Processes-BR
No ratings yet
Finite Markov Decision Processes-BR
31 pages
DRL #4-5 - Introducing MDP and Dynamic Programming Solution
No ratings yet
DRL #4-5 - Introducing MDP and Dynamic Programming Solution
74 pages
20ai903 - RL - Unit 2
No ratings yet
20ai903 - RL - Unit 2
27 pages
Reinforcement Learning: Karan Kathpalia
No ratings yet
Reinforcement Learning: Karan Kathpalia
80 pages
Reinforcement Learning: Amulya Viswambaran (202090007) Kehkashan Fatima (202090202) Sruthi Krishnan (202090333)
No ratings yet
Reinforcement Learning: Amulya Viswambaran (202090007) Kehkashan Fatima (202090202) Sruthi Krishnan (202090333)
40 pages
DSA5102 Lecture11
No ratings yet
DSA5102 Lecture11
44 pages
Reinforcement Learning Lec12
No ratings yet
Reinforcement Learning Lec12
60 pages
ML Unit 4
No ratings yet
ML Unit 4
9 pages
RL Unit - Ii
No ratings yet
RL Unit - Ii
20 pages
02 MarkovDecisionProcess
No ratings yet
02 MarkovDecisionProcess
51 pages
Lec 12
No ratings yet
Lec 12
60 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
9 pages
Reinforcement Learning in A Nutshell
No ratings yet
Reinforcement Learning in A Nutshell
12 pages
Lecture 3 - MDPs and Dynamic Programming
No ratings yet
Lecture 3 - MDPs and Dynamic Programming
66 pages
Textbook Solutions Expert Q&A Practice: Find Solutions For Your Homework
No ratings yet
Textbook Solutions Expert Q&A Practice: Find Solutions For Your Homework
6 pages
Reinforcement Learning: Part I - Definitions
No ratings yet
Reinforcement Learning: Part I - Definitions
26 pages
Reinforcement Learning Model Based Planning Dynamic Programming
No ratings yet
Reinforcement Learning Model Based Planning Dynamic Programming
17 pages
Reinforcement Learning: Nguyen Do Van, PHD
No ratings yet
Reinforcement Learning: Nguyen Do Van, PHD
40 pages
Markov Decision
No ratings yet
Markov Decision
4 pages
Unit 5 Reinforcement Learning Notes
No ratings yet
Unit 5 Reinforcement Learning Notes
20 pages
RL-UNIT2 - RL Unit 2 RL-UNIT2 - RL Unit 2
No ratings yet
RL-UNIT2 - RL Unit 2 RL-UNIT2 - RL Unit 2
23 pages
L12 Markov Decision Processes
No ratings yet
L12 Markov Decision Processes
64 pages
5.4-Reinforcement Learning-Part1-Introduction
No ratings yet
5.4-Reinforcement Learning-Part1-Introduction
15 pages
Stochastic Process - Markov Property - Markov Chain - Markov Decision Process - Reinforcement Learning - RL Techniques - Example Applications
No ratings yet
Stochastic Process - Markov Property - Markov Chain - Markov Decision Process - Reinforcement Learning - RL Techniques - Example Applications
39 pages
CS229
No ratings yet
CS229
17 pages
Unit Vi
No ratings yet
Unit Vi
17 pages
16 RL PDF
No ratings yet
16 RL PDF
87 pages
Reinforcement Learning and Control: CS229 Lecture Notes
No ratings yet
Reinforcement Learning and Control: CS229 Lecture Notes
7 pages
Reinforcement Learning and Control: CS229 Lecture Notes
No ratings yet
Reinforcement Learning and Control: CS229 Lecture Notes
15 pages
Unit-5 Ai
No ratings yet
Unit-5 Ai
19 pages
RL Ese Answers
No ratings yet
RL Ese Answers
16 pages
Reinforcement Learning Note
No ratings yet
Reinforcement Learning Note
16 pages
A Brief Introduction To Reinforcement Learning
No ratings yet
A Brief Introduction To Reinforcement Learning
4 pages
Lect28 4up
No ratings yet
Lect28 4up
11 pages
Markov Decision Processes & Reinforcement Learning: Megan Smith Lehigh University, Fall 2006
No ratings yet
Markov Decision Processes & Reinforcement Learning: Megan Smith Lehigh University, Fall 2006
40 pages
کتاب هشتم بارگزاری شده
No ratings yet
کتاب هشتم بارگزاری شده
112 pages
RL Frra
No ratings yet
RL Frra
10 pages
Add-On DRL CS06
No ratings yet
Add-On DRL CS06
23 pages
Lec17 ReinforcementLearning
No ratings yet
Lec17 ReinforcementLearning
58 pages
10 ML Introduction To Reinforcement Learning
No ratings yet
10 ML Introduction To Reinforcement Learning
8 pages
Markov Decision Process
No ratings yet
Markov Decision Process
11 pages
Lec 04 Reinforcement Learning
No ratings yet
Lec 04 Reinforcement Learning
57 pages
CSE2530 Reinforcement Learning 2025 P1+2
No ratings yet
CSE2530 Reinforcement Learning 2025 P1+2
115 pages
A Crash Course On Reinforcement Learning - Felix Wagner
No ratings yet
A Crash Course On Reinforcement Learning - Felix Wagner
84 pages
PDF Unit-5 (Full Unit)
No ratings yet
PDF Unit-5 (Full Unit)
37 pages
Machine Learning For NLP
No ratings yet
Machine Learning For NLP
58 pages
Cs229-Notes12 Reinforcement in Control
No ratings yet
Cs229-Notes12 Reinforcement in Control
17 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
31 pages
Lecture 2: Markov Decision Processes: David Silver
No ratings yet
Lecture 2: Markov Decision Processes: David Silver
57 pages
Reinforcement Learning2A
No ratings yet
Reinforcement Learning2A
88 pages
MIT 6.036 Lecture
No ratings yet
MIT 6.036 Lecture
64 pages
Slidedeck 5 MAS 2021 22 RL 1 MDP Bellman v3
No ratings yet
Slidedeck 5 MAS 2021 22 RL 1 MDP Bellman v3
93 pages
2023 Week2 Lecture Before
No ratings yet
2023 Week2 Lecture Before
77 pages
Value Functions & Bellman Equations: UNIT-3
No ratings yet
Value Functions & Bellman Equations: UNIT-3
11 pages
Markov Decision Process: Fundamentals and Applications
From Everand
Markov Decision Process: Fundamentals and Applications
Fouad Sabry
No ratings yet
Exercises of Multi-Variable Functions
From Everand
Exercises of Multi-Variable Functions
Simone Malacrida
No ratings yet
Simulated Annealing: Fundamentals and Applications
From Everand
Simulated Annealing: Fundamentals and Applications
Fouad Sabry
No ratings yet
Exercises of Limits
From Everand
Exercises of Limits
Simone Malacrida
No ratings yet
Situation Calculus: Fundamentals and Applications
From Everand
Situation Calculus: Fundamentals and Applications
Fouad Sabry
No ratings yet
18 - Dynamic Programming For Markov Decision Processes
No ratings yet
18 - Dynamic Programming For Markov Decision Processes
50 pages
15 - Matrix Factorization
No ratings yet
15 - Matrix Factorization
55 pages
16 - Reinforcement Learning and Bandits
No ratings yet
16 - Reinforcement Learning and Bandits
41 pages
19 - Monte Carlo and Temporal Difference For Markov Decision Processes
No ratings yet
19 - Monte Carlo and Temporal Difference For Markov Decision Processes
57 pages
2023 2024 3rd Term Notes Differentiation
No ratings yet
2023 2024 3rd Term Notes Differentiation
11 pages
ME 409 Computational Lab Heat Conduction: Prof. Asim Tewari TA - Deepak Soman
No ratings yet
ME 409 Computational Lab Heat Conduction: Prof. Asim Tewari TA - Deepak Soman
25 pages
Miller Indices
No ratings yet
Miller Indices
6 pages
7 - Symmetric Patterns SC PDF
No ratings yet
7 - Symmetric Patterns SC PDF
19 pages
MATHEMATICAL METHODS (MD Nasim)
No ratings yet
MATHEMATICAL METHODS (MD Nasim)
12 pages
Algebra Notes by MR Muchineuta 0783457940 0780665653 0734483941 0713581783
No ratings yet
Algebra Notes by MR Muchineuta 0783457940 0780665653 0734483941 0713581783
101 pages
MAT 127 - Syllabus
No ratings yet
MAT 127 - Syllabus
3 pages
2017 Unit 3 Notes CH 6 CP A2
No ratings yet
2017 Unit 3 Notes CH 6 CP A2
47 pages
A Primer On Nurbs - David F. Rogers - Siggraph 2002
100% (2)
A Primer On Nurbs - David F. Rogers - Siggraph 2002
162 pages
Quadratic Equatio
No ratings yet
Quadratic Equatio
168 pages
Mean
No ratings yet
Mean
10 pages
Lecture 21: Greens Theorem: Edwin Abbot's Flatland Is A 1884 Romance Taking Place in The Plane
No ratings yet
Lecture 21: Greens Theorem: Edwin Abbot's Flatland Is A 1884 Romance Taking Place in The Plane
18 pages
Solution For HOMEWORK 4
No ratings yet
Solution For HOMEWORK 4
5 pages
Environmental Data Analysis With MatLab
No ratings yet
Environmental Data Analysis With MatLab
46 pages
CHAPTER 2 (Done)
No ratings yet
CHAPTER 2 (Done)
30 pages
Chapter 10 Polar Coordinates
No ratings yet
Chapter 10 Polar Coordinates
4 pages
What Is Machine Learning by Coursera
No ratings yet
What Is Machine Learning by Coursera
47 pages
Butterfly Effect
50% (2)
Butterfly Effect
3 pages
Maths Pure P425pp1
No ratings yet
Maths Pure P425pp1
4 pages
Unproject Explained
No ratings yet
Unproject Explained
4 pages
Add Math Differentiation
100% (2)
Add Math Differentiation
22 pages
Honors Math 7 Introduction To Algebra B 6
No ratings yet
Honors Math 7 Introduction To Algebra B 6
1 page
Week 9 Activity Sheets PDF
No ratings yet
Week 9 Activity Sheets PDF
7 pages
MAT 102D ANALYTIC GEOMETRY AND CALCULUS Week 3
No ratings yet
MAT 102D ANALYTIC GEOMETRY AND CALCULUS Week 3
23 pages
Complex Numbers - Basic
No ratings yet
Complex Numbers - Basic
15 pages
Cycling in Simplex Method: Aryaman Banga Karan Kukreja
No ratings yet
Cycling in Simplex Method: Aryaman Banga Karan Kukreja
12 pages
Lec 18 Unidirectional Search PDF
No ratings yet
Lec 18 Unidirectional Search PDF
32 pages
Energy Transport Equation
No ratings yet
Energy Transport Equation
4 pages
Cirrelt 2016 30
No ratings yet
Cirrelt 2016 30
37 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

17 - Markov Decision Processes

Uploaded by

17 - Markov Decision Processes

Uploaded by

CMPSC 448: Machine Learning

Lecture 17. Markov Decision Processes

Agent observes state at time step

Agent decides an action at time step

Environment responds with a reward

The state captures all relevant information from the history

There is no action or decision in Markov Processing.

Sample episodes for Student Markov

Discount Factor Only care about immediate reward

● Mathematically convenient to discount rewards

Discount factor = 0.5

G1 = 2 + 0.5 x (-1) + (0.5)2 x (-1) + (0.5)3 x

How about episode

Mapping from each state to a real number

The Bellman equation can be expressed concisely using matrices,

Computational complexity is for states (for computing the inverse).

Note: This inverse always exists for

It is an environment in which all states are Markov.

MDP formally describes an environment for Reinforcement Learning

If a reinforcement learning task has the Markov Property, it is basically a Markov

To define a finite MDP, you need to give:

This function defines the dynamics of the MDP.

If the agent is following policy at time , then is the probability that

1. (state-action value function, or action value function) quantifies

Almost all RL algorithms involve estimating value functions (functions of states or

is called the state-value function or value function for policy 34

is called the state-action value function or action value function for 40

So for state value function:

Similarly, for state-action value function

● Optimal policies also share the same optimal action-value function:

● In a finite MDP, the number of deterministic policies is finite, which is

Strategy 2 (a more attracting approach):

is the unique solution of this system of nonlinear equations

is the unique solution of this system of nonlinear equations

Therefore, given , one-step-ahead search produces the long-term optimal

E.g., back to the gridworld:

How much space and time do we need?

We usually have to settle for approximations.

Many RL methods can be understood as approximately solving the Bellman Optimality

All theoretical objects, mathematical ideals (expected values)

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.