0% found this document useful (0 votes)

46 views7 pages

03 Basic Model of An MDP

1. Markov decision processes (MDPs) provide a simple model for stochastic control systems, where the next state depends on the current state and control input, plus random noise. 2. For an MDP, it is shown that the optimal control strategy can depend only on the current state, not full history. Such a strategy is called a Markov strategy. 3. Dynamic programming can be used to recursively compute the optimal Markov strategy by defining and updating cost-to-go functions associated with each state. The optimal strategy satisfies the verification equation from dynamic programming.

Uploaded by

achm3dz

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

46 views7 pages

03 Basic Model of An MDP

Uploaded by

achm3dz

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

1/14/2020 ECSE 506: Stochastic Control and Decision Theory

ECSE 506: Stochastic Control and Decision Theory

Aditya Mahajan
Winter 2020
About | Lectures | Notes | Coursework

Theory: Basic model of an MDP

Markov decision processes (MDP) are the simplest model of a stochastic control system. The dynamic behavior
of an MDP is modeled by an equation of the form

where is the state, is the control input, and is the noise. An agent/controller
observes the state and chooses the control input .

The controller can be as sophisticated as we want. In principle, it can analyze the entire history of observations
and control actions to choose the current control action. Thus, the control action can be wri en as

where is a shorthand for and a similar interpretation holds for . The function is
called the control law at time .

At each time, the system incurs a cost that may depend on the current state and control action. This cost is
denoted by . The system operates for a time horizon . During this time, it incurs a total cost

The initial state and the noise process are random variables deﬁned on a common probability
space (these are called primitive random variables) and are mutually independent.

Suppose we have to design such a controller. We are told the probability distribution of the initial state and the
noise. We are also told the system update functions and the cost functions . We are
asked to choose a control strategy to minimize the expected total cost

How should we proceed?

At ﬁrst glance, the problem looks intimidating. It appears that we have to design a very sophisticated controller:
one that analyzes all past data to choose a control input. However, this is not the case. A remarkable result is
that the optimal control station can discard all past data and choose the control input based only on the current
state of the system. Formally, we have the following:

https://adityam.github.io/stochastic-control/mdp/mdp-functional/ 1/7
1/14/2020 ECSE 506: Stochastic Control and Decision Theory

Optimality of Markov strategies. For the system model described above, there is no loss of optimality in
chosing the control action according to

Such a control strategy is called a Markov strategy.

The above result claims that the cost incurred by the best Markovian strategy is the same as the cost incurred by
the best history dependent strategy. This appears to be a tall claim, so lets see how we can prove it. The main
idea of the proof is to repeatedly apply Blackwell’s principle of irrelevant information (1964)

Two-Step Lemma. Consider an MDP that operates for two steps ( ). Then there is no loss of optimality
in restricting a ention to a Markov control strategy at time .

Note that is Markov because it can only depend .

Proof
Fix and look at the problem of optimizing . The total cost is

The choice of does not influence the first term. So, for a fixed , minimizing the total cost is the equivalent
to minimizing the second term. Now, from Blackwell’s principle of irrelevant information, there exists a
such that for any

Three-Step Lemma. Consider an MDP that operates for three steps ( ). Assume that the control law
at time is Markov, i.e., . Then, there is no loss of optimality in restricting
a ention to Markov control law at time .

Proof
Fix and and look at optimizing . The total cost is

The choice of does not affect the first term. So, for a fixed and , minimizing the total cost is the same as
minimizing the last two terms. Let us look at the last term carefully. Bu the law of iterated expectations, we
have

Now,

https://adityam.github.io/stochastic-control/mdp/mdp-functional/ 2/7
1/14/2020 ECSE 506: Stochastic Control and Decision Theory

The key point is that does not depend on or .

Thus, the total expected cost aﬀected by the choice of can be wri en as

Now, by Blackwell’s principle of irrelevant information, there exists a such that for any , we
have

Now we have enough background to present the proof of optimality of Markov strategies.

Proof of optimality of Markov strategies

The main idea is that any system can be thought of as a two- or three-step system by aggregating time. Suppose
that the system operates for steps. It can be thought of as a two-step system where
corresponds to step 1 and corresponds to step 2. From the two-step lemma, there is no loss of optimality
in restricting a ention to Markov control law at step 2 (i.e., at time ), i.e.,

Now consider a system where we are using a Markov strategy at time . This system can be thought of as
a three-step system where corresponds to step 1, corresponds to step 2, and
corresponds to step 3. Since the controller at time is Markov, the assumption of the three step lemma
is satisﬁed. Thus, by that lemma, there is no loss of optimality in restricting a ention to Markov controllers at
step 2 (i.e., at time ), i.e.,

Now consider a system where we are using a Markov strategy at time . This can be thought of
as a three-step system where correspond to step 1, correspond to step 2, and
correspond to step 3. Since the controllers at time are Markov, the
assumption of the three-step lemma is satisﬁed. Thus, by that lemma, there is no loss of optimality in restricting
a ention to Markov controllers at step 2 (i.e., at time ), i.e.,

Proceeding this way, we continue to think of the system as a three step system by diﬀerent relabeling of time.
Once we have shown that the controllers at times are Markov, we relabel time as
follows: corresponds to step 1, corresponds to step 2, and
corresponds to step 3. Since the controllers at time are Markov, the assumption of the
three-step lemma is satisﬁed. Thus, by that lemma, there is no loss of optimality in restricting a ention to
Markov controllers at stage 2 (i.e. at time ), i.e.,

Proceeding until , completes the proof.

Performance of Markov strategies

https://adityam.github.io/stochastic-control/mdp/mdp-functional/ 3/7
1/14/2020 ECSE 506: Stochastic Control and Decision Theory

We have shown that there is no loss of optimality to restrict a ention to Markov strategies. One of the
advantages of Markov strategies is that it is easy to recursively compute their performance. In particular, given
any Markov strategy , deﬁne the cost-to-go functions as follows:

Note that only depends on the future strategy . These functions can be computed
recursively as follows:

Dynamic Programming Decomposition

Now we are ready to state the main result of MDPs

Theorem (Dynamic program) Recursive deﬁne value functions as follows:

and for :

{ and deﬁne

and

Then, a Markov policy is optimal if and only if it satisﬁes .

Instead of proving the above result, we prove a related result

Theorem (The comparison principle) For any Markov strategy

with equality at if and only if the future strategy satisﬁes the veriﬁcation step .

Note that the comparison principle immediately implies that the strategy obtained using dynamic
programming is optimal.

https://adityam.github.io/stochastic-control/mdp/mdp-functional/ 4/7
1/14/2020 ECSE 506: Stochastic Control and Decision Theory

The comparison principle also allows us to interpret the value functions. The value function at time is the
minimum of all the cost-to-go functions over all future strategies. The comparison principle also allows us to
interpret the optimal policy (the interpretation is due to Bellman and is colloquially called Bellman’s principle
of optimality).

Bellman’s principle of optimality. An optimal policy has the property that whatever the initial state and the
initial decisions are, the remaining decisions must constitute an optimal policy with regard to
the state resulting from the ﬁrst decision.

Proof of the comparison principle

The proof proceeds by backward induction. Consider any Markov strategy . For ,

where follows from the definition of , follows from the definition of minimization, and follows
from the definition of . Equality holds in iff the policy is optimal. This result forms the basis of
induction.

Now assume that the statement of the theorem is true for . Then, for

where follows from the definition of , follows from the definition of minimization, follows from
the induction hypothesis, and follows from the definition of . We have equality in step iff satisfies
the verification step and have equality in step iff is optimal (this is part of the induction
hypothesis). Thus, the result is true for time and, by the principle of induction, is true for all time.

Variations of a theme
Cost depends on next state
Exponential cost function
Multiplicative cost
Discounted cost
https://adityam.github.io/stochastic-control/mdp/mdp-functional/ 5/7
1/14/2020 ECSE 506: Stochastic Control and Decision Theory

TODO: Explain this.

There are two interpretations of the discount factor . The ﬁrst interpretation is an economic interpretation to
determine the present value of a utility that will be received in the future. For example, suppose a decision maker
is indiﬀerent between receiving 1 dollar today or dollars tomorrow. This means that the decision maker
discounts the future at a rate , so .

The second interpretation of the discount factor is as follows. Suppose we are operating a machine that
generates a value of $1 each day. However, there is a probability that the machine will break down at the end
of the day. Thus, the expected return for today is $1 while the expected return for tomorrow is (which
is the probability that the machine is still working tomorrow). In this case, the discount factor is deﬁned as
.

Optimal stopping
Let be a Markov chain. At each time , a decision maker observes the state of the Markov chain
and decides whether to continue or stop the process. If the decision maker decides to continue, he incurs a
continuation cost and the state evolves. If the DM decides to stop, he incurs a stopping cost of and
the problem is terminated. The objective is to determine an optimal stopping time to minimize

Such problems are called Optimal stopping problems.

Deﬁne the cost-to-go function of any stopping rule as

and the value function as

Then, it can be shown that the value functions satisfy the following recursion:

Dynamic Program for optimal stopping

For more details on the optimal stopping problems, see Ferguson (2008).

Minimax setup
References
The proof idea for the optimality of Markov strategies is based on a proof by Witsenhausen (1979) on the
structure of optimal coding strategies for real-time communication. Note that the proof does not require us to
ﬁnd a dynamic programming decomposition of the problem. This is in contrast with the standard textbook
proof where the optimality of Markov strategies is proved as part of the dynamic programming decomposition.
https://adityam.github.io/stochastic-control/mdp/mdp-functional/ 6/7
1/14/2020 ECSE 506: Stochastic Control and Decision Theory

B , D. 1964. Memoryless strategies in ﬁnite-stage dynamic programming. The Annals of Mathematical

Statistics 35, 2, 863–865. DOI: doi:10.1214/aoms/1177703586.

F , T.S. 2008. Optimal stopping and applications.. Available at:

h p://www.math.ucla.edu/~tom/Stopping/Contents.html.

W , H.S. 1979. On the structure of real-time source coders. Bell System Technical Journal 58, 6, 1437–
1451.

This entry was last updated on 04 Oct 2019 and posted in MDP and tagged markov strategies, dynamic programming, comparison
principle, principle of irrelevant information.

https://adityam.github.io/stochastic-control/mdp/mdp-functional/ 7/7

Response Surface Methodology and MINITAB
100% (1)
Response Surface Methodology and MINITAB
22 pages
RL and ObC Lecture 2
No ratings yet
RL and ObC Lecture 2
20 pages
Lecture Notes
No ratings yet
Lecture Notes
29 pages
Dynamic Programming
No ratings yet
Dynamic Programming
52 pages
Namic Programming
No ratings yet
Namic Programming
18 pages
Machine Learning
No ratings yet
Machine Learning
5 pages
Lecture 3 - MDPs and Dynamic Programming
No ratings yet
Lecture 3 - MDPs and Dynamic Programming
66 pages
Markovian Decision Process
No ratings yet
Markovian Decision Process
27 pages
Markov Decision Process: Fundamentals and Applications
From Everand
Markov Decision Process: Fundamentals and Applications
Fouad Sabry
No ratings yet
Lecture4 Model Free Prediction
No ratings yet
Lecture4 Model Free Prediction
34 pages
2008 - Carmon, Shwartz - Markov Decision Processes With Exponentially Representable Discounting
No ratings yet
2008 - Carmon, Shwartz - Markov Decision Processes With Exponentially Representable Discounting
10 pages
Markov Decision Processes: Lecture Notes For STP 425: Jay Taylor
100% (1)
Markov Decision Processes: Lecture Notes For STP 425: Jay Taylor
86 pages
Lectures On Stochastic Control and Its Applications To Finance Chap 4 Martingale Approach Pham
No ratings yet
Lectures On Stochastic Control and Its Applications To Finance Chap 4 Martingale Approach Pham
84 pages
08 - Markov Decision Processes
No ratings yet
08 - Markov Decision Processes
31 pages
Optimal Control Theory
No ratings yet
Optimal Control Theory
28 pages
5.1 Dynamic Programming and The HJB Equation: k+1 K K K K
No ratings yet
5.1 Dynamic Programming and The HJB Equation: k+1 K K K K
30 pages
Markov Decision Processes and Exact Solution Methods
No ratings yet
Markov Decision Processes and Exact Solution Methods
34 pages
Andersson Djehiche - AMO 2011
No ratings yet
Andersson Djehiche - AMO 2011
16 pages
DP - Bellman - 1741339134 2025-03-07 09 - 19 - 05
No ratings yet
DP - Bellman - 1741339134 2025-03-07 09 - 19 - 05
13 pages
Lecture26 Ri
No ratings yet
Lecture26 Ri
55 pages
Stochastic Control
No ratings yet
Stochastic Control
4 pages
Pham
No ratings yet
Pham
77 pages
Dynamic Programing and Optimal Control PDF
No ratings yet
Dynamic Programing and Optimal Control PDF
276 pages
Dynamic Programing and Optimal Control
No ratings yet
Dynamic Programing and Optimal Control
276 pages
Stochastic Control Princeton
No ratings yet
Stochastic Control Princeton
14 pages
Markov Decision Processes With Their Applications
No ratings yet
Markov Decision Processes With Their Applications
305 pages
18 - Dynamic Programming For Markov Decision Processes
No ratings yet
18 - Dynamic Programming For Markov Decision Processes
50 pages
Dynamic Programming and Optimal Control Script
No ratings yet
Dynamic Programming and Optimal Control Script
58 pages
P550
No ratings yet
P550
27 pages
Notas - Dynamic Optimation and Optimal Control
No ratings yet
Notas - Dynamic Optimation and Optimal Control
26 pages
Cs5811 Ch17 Complex Dec
No ratings yet
Cs5811 Ch17 Complex Dec
29 pages
Homework - 06 - 223 - Spring 2024
No ratings yet
Homework - 06 - 223 - Spring 2024
5 pages
Optimal Control Under Unknown Intensity With Bayesian Learning
No ratings yet
Optimal Control Under Unknown Intensity With Bayesian Learning
23 pages
DSA5102 Lecture11
No ratings yet
DSA5102 Lecture11
44 pages
Inventarios 2 Modelos
No ratings yet
Inventarios 2 Modelos
6 pages
A17 Complexdecisions
No ratings yet
A17 Complexdecisions
28 pages
Lecture 3 and 4
No ratings yet
Lecture 3 and 4
14 pages
Limiting Average Cost Control Problems in A Class of Discrete-Time Stochastic Systems
No ratings yet
Limiting Average Cost Control Problems in A Class of Discrete-Time Stochastic Systems
13 pages
2 Dynamic
No ratings yet
2 Dynamic
50 pages
Introducción Piazza
No ratings yet
Introducción Piazza
33 pages
Constrained MDP Altman
No ratings yet
Constrained MDP Altman
250 pages
Makowski
No ratings yet
Makowski
6 pages
Module-2 For Btech in Topic
No ratings yet
Module-2 For Btech in Topic
29 pages
PhysRevResearch.5.013122 Physics of Networks
No ratings yet
PhysRevResearch.5.013122 Physics of Networks
9 pages
Examples in Markov Decision Processes by A B Piunovskiy
No ratings yet
Examples in Markov Decision Processes by A B Piunovskiy
308 pages
BOOK-Soner-Stochastic Optimal Control in Finance
No ratings yet
BOOK-Soner-Stochastic Optimal Control in Finance
67 pages
RL Lecture4
No ratings yet
RL Lecture4
7 pages
Mondal Smdps
No ratings yet
Mondal Smdps
17 pages
1 13 Optimal Control Proofs
No ratings yet
1 13 Optimal Control Proofs
9 pages
Infinite Horizon Problems
No ratings yet
Infinite Horizon Problems
69 pages
2017 Sannikov
No ratings yet
2017 Sannikov
33 pages
19.5 Markov Decision Processes: Resolving Unbounded Expected Rewards
No ratings yet
19.5 Markov Decision Processes: Resolving Unbounded Expected Rewards
13 pages
Lec 09
No ratings yet
Lec 09
51 pages
L12 Markov Decision Processes
No ratings yet
L12 Markov Decision Processes
64 pages
EE675 Lecture 10
No ratings yet
EE675 Lecture 10
4 pages
Optimal Asset Allocation Under Forward Exponential Performance Criteria
No ratings yet
Optimal Asset Allocation Under Forward Exponential Performance Criteria
16 pages
DP Slides
No ratings yet
DP Slides
263 pages
Optimal Control Theory With Econ Applications
No ratings yet
Optimal Control Theory With Econ Applications
250 pages
Book All-In-One 2
No ratings yet
Book All-In-One 2
281 pages
08 MDPs
No ratings yet
08 MDPs
111 pages
Markov Models: An Introduction to Markov Models
From Everand
Markov Models: An Introduction to Markov Models
Steven Taylor
3/5 (1)
Clasification of Mango (Mangifera Indica L) Fruit Varieties Using CNN
No ratings yet
Clasification of Mango (Mangifera Indica L) Fruit Varieties Using CNN
7 pages
Soft Vs Hard Clustering
No ratings yet
Soft Vs Hard Clustering
5 pages
Hungarian Algorithm For Assignment Problem - Set 1 (Introduction)
No ratings yet
Hungarian Algorithm For Assignment Problem - Set 1 (Introduction)
10 pages
CS
No ratings yet
CS
4 pages
Collision Risk in Hash-Based Surrogate Keys - by Krzysztof K. Zdeb - Nov, 2024 - Towards Data Science
No ratings yet
Collision Risk in Hash-Based Surrogate Keys - by Krzysztof K. Zdeb - Nov, 2024 - Towards Data Science
11 pages
Control System
No ratings yet
Control System
39 pages
Lesson 14 Linear and Nonlinear Functions
No ratings yet
Lesson 14 Linear and Nonlinear Functions
8 pages
DS Unit 2
No ratings yet
DS Unit 2
34 pages
Assignment 9 July 2022 Solution
No ratings yet
Assignment 9 July 2022 Solution
4 pages
Coding Theory and Techniques - Updated
No ratings yet
Coding Theory and Techniques - Updated
21 pages
Ai Module 3
No ratings yet
Ai Module 3
41 pages
Unit-I Notes
No ratings yet
Unit-I Notes
36 pages
JN - 2013 - Li - Model and Simulation For Collaborative VRPSPD
No ratings yet
JN - 2013 - Li - Model and Simulation For Collaborative VRPSPD
8 pages
Risk Analysis
No ratings yet
Risk Analysis
4 pages
Operations Research
94% (16)
Operations Research
191 pages
Journal of Forecasting - 2024 - Lei - Volatility Forecasting For Stock Market in
No ratings yet
Journal of Forecasting - 2024 - Lei - Volatility Forecasting For Stock Market in
25 pages
Nonlinear Observer Design For L-V System
No ratings yet
Nonlinear Observer Design For L-V System
8 pages
Neural Machine Translation Advised by Statistical Machine Translation
No ratings yet
Neural Machine Translation Advised by Statistical Machine Translation
7 pages
Renaming Grids "Dynamo Script
No ratings yet
Renaming Grids "Dynamo Script
13 pages
24: 12.07.05 Flory-Huggins Theory: Today
No ratings yet
24: 12.07.05 Flory-Huggins Theory: Today
4 pages
Rohini 836843492
No ratings yet
Rohini 836843492
3 pages
Unit II Full Notes
No ratings yet
Unit II Full Notes
108 pages
X Maths Mindmaps
No ratings yet
X Maths Mindmaps
15 pages
Lecture 18
No ratings yet
Lecture 18
7 pages
Superfluidic Time Reversal and Its Implications For Quantum Chronodynamics
No ratings yet
Superfluidic Time Reversal and Its Implications For Quantum Chronodynamics
2 pages
Presentation Error: Input
No ratings yet
Presentation Error: Input
2 pages
MCQ Bcom-QT OF BUSINESS
No ratings yet
MCQ Bcom-QT OF BUSINESS
14 pages
Reliability & Maintainability Engineering Ebeling Chapter 12 Book Solutions - Data Collection ..
No ratings yet
Reliability & Maintainability Engineering Ebeling Chapter 12 Book Solutions - Data Collection ..
15 pages
DL Notes
No ratings yet
DL Notes
34 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

03 Basic Model of An MDP

Uploaded by

03 Basic Model of An MDP

Uploaded by

1/14/2020 ECSE 506: Stochastic Control and Decision Theory

ECSE 506: Stochastic Control and Decision Theory

Theory: Basic model of an MDP

How should we proceed?

Such a control strategy is called a Markov strategy.

Note that is Markov because it can only depend .

The key point is that does not depend on or .

Proof of optimality of Markov strategies

Proceeding until , completes the proof.

Performance of Markov strategies

Dynamic Programming Decomposition

Theorem (Dynamic program) Recursive deﬁne value functions as follows:

Then, a Markov policy is optimal if and only if it satisﬁes .

Instead of proving the above result, we prove a related result

Theorem (The comparison principle) For any Markov strategy

Proof of the comparison principle

TODO: Explain this.

Such problems are called Optimal stopping problems.

Deﬁne the cost-to-go function of any stopping rule as

and the value function as

Dynamic Program for optimal stopping

B , D. 1964. Memoryless strategies in ﬁnite-stage dynamic programming. The Annals of Mathematical

F , T.S. 2008. Optimal stopping and applications.. Available at:

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.