0% found this document useful (0 votes)
46 views7 pages

03 Basic Model of An MDP

1. Markov decision processes (MDPs) provide a simple model for stochastic control systems, where the next state depends on the current state and control input, plus random noise. 2. For an MDP, it is shown that the optimal control strategy can depend only on the current state, not full history. Such a strategy is called a Markov strategy. 3. Dynamic programming can be used to recursively compute the optimal Markov strategy by defining and updating cost-to-go functions associated with each state. The optimal strategy satisfies the verification equation from dynamic programming.

Uploaded by

achm3dz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
46 views7 pages

03 Basic Model of An MDP

1. Markov decision processes (MDPs) provide a simple model for stochastic control systems, where the next state depends on the current state and control input, plus random noise. 2. For an MDP, it is shown that the optimal control strategy can depend only on the current state, not full history. Such a strategy is called a Markov strategy. 3. Dynamic programming can be used to recursively compute the optimal Markov strategy by defining and updating cost-to-go functions associated with each state. The optimal strategy satisfies the verification equation from dynamic programming.

Uploaded by

achm3dz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

1/14/2020 ECSE 506: Stochastic Control and Decision Theory

ECSE 506: Stochastic Control and Decision Theory


Aditya Mahajan
Winter 2020
About | Lectures | Notes | Coursework

Theory: Basic model of an MDP


Markov decision processes (MDP) are the simplest model of a stochastic control system. The dynamic behavior
of an MDP is modeled by an equation of the form

where is the state, is the control input, and is the noise. An agent/controller
observes the state and chooses the control input .

The controller can be as sophisticated as we want. In principle, it can analyze the entire history of observations
and control actions to choose the current control action. Thus, the control action can be wri en as

where is a shorthand for and a similar interpretation holds for . The function is
called the control law at time .

At each time, the system incurs a cost that may depend on the current state and control action. This cost is
denoted by . The system operates for a time horizon . During this time, it incurs a total cost

The initial state and the noise process are random variables defined on a common probability
space (these are called primitive random variables) and are mutually independent.

Suppose we have to design such a controller. We are told the probability distribution of the initial state and the
noise. We are also told the system update functions and the cost functions . We are
asked to choose a control strategy to minimize the expected total cost

How should we proceed?

At first glance, the problem looks intimidating. It appears that we have to design a very sophisticated controller:
one that analyzes all past data to choose a control input. However, this is not the case. A remarkable result is
that the optimal control station can discard all past data and choose the control input based only on the current
state of the system. Formally, we have the following:

https://adityam.github.io/stochastic-control/mdp/mdp-functional/ 1/7
1/14/2020 ECSE 506: Stochastic Control and Decision Theory

Optimality of Markov strategies. For the system model described above, there is no loss of optimality in
chosing the control action according to

Such a control strategy is called a Markov strategy.

The above result claims that the cost incurred by the best Markovian strategy is the same as the cost incurred by
the best history dependent strategy. This appears to be a tall claim, so lets see how we can prove it. The main
idea of the proof is to repeatedly apply Blackwell’s principle of irrelevant information (1964)

Two-Step Lemma. Consider an MDP that operates for two steps ( ). Then there is no loss of optimality
in restricting a ention to a Markov control strategy at time .

Note that is Markov because it can only depend .

Proof
Fix and look at the problem of optimizing . The total cost is

The choice of does not influence the first term. So, for a fixed , minimizing the total cost is the equivalent
to minimizing the second term. Now, from Blackwell’s principle of irrelevant information, there exists a
such that for any

Three-Step Lemma. Consider an MDP that operates for three steps ( ). Assume that the control law
at time is Markov, i.e., . Then, there is no loss of optimality in restricting
a ention to Markov control law at time .

Proof
Fix and and look at optimizing . The total cost is

The choice of does not affect the first term. So, for a fixed and , minimizing the total cost is the same as
minimizing the last two terms. Let us look at the last term carefully. Bu the law of iterated expectations, we
have

Now,

https://adityam.github.io/stochastic-control/mdp/mdp-functional/ 2/7
1/14/2020 ECSE 506: Stochastic Control and Decision Theory

The key point is that does not depend on or .

Thus, the total expected cost affected by the choice of can be wri en as

Now, by Blackwell’s principle of irrelevant information, there exists a such that for any , we
have

Now we have enough background to present the proof of optimality of Markov strategies.

Proof of optimality of Markov strategies


The main idea is that any system can be thought of as a two- or three-step system by aggregating time. Suppose
that the system operates for steps. It can be thought of as a two-step system where
corresponds to step 1 and corresponds to step 2. From the two-step lemma, there is no loss of optimality
in restricting a ention to Markov control law at step 2 (i.e., at time ), i.e.,

Now consider a system where we are using a Markov strategy at time . This system can be thought of as
a three-step system where corresponds to step 1, corresponds to step 2, and
corresponds to step 3. Since the controller at time is Markov, the assumption of the three step lemma
is satisfied. Thus, by that lemma, there is no loss of optimality in restricting a ention to Markov controllers at
step 2 (i.e., at time ), i.e.,

Now consider a system where we are using a Markov strategy at time . This can be thought of
as a three-step system where correspond to step 1, correspond to step 2, and
correspond to step 3. Since the controllers at time are Markov, the
assumption of the three-step lemma is satisfied. Thus, by that lemma, there is no loss of optimality in restricting
a ention to Markov controllers at step 2 (i.e., at time ), i.e.,

Proceeding this way, we continue to think of the system as a three step system by different relabeling of time.
Once we have shown that the controllers at times are Markov, we relabel time as
follows: corresponds to step 1, corresponds to step 2, and
corresponds to step 3. Since the controllers at time are Markov, the assumption of the
three-step lemma is satisfied. Thus, by that lemma, there is no loss of optimality in restricting a ention to
Markov controllers at stage 2 (i.e. at time ), i.e.,

Proceeding until , completes the proof.

Performance of Markov strategies

https://adityam.github.io/stochastic-control/mdp/mdp-functional/ 3/7
1/14/2020 ECSE 506: Stochastic Control and Decision Theory

We have shown that there is no loss of optimality to restrict a ention to Markov strategies. One of the
advantages of Markov strategies is that it is easy to recursively compute their performance. In particular, given
any Markov strategy , define the cost-to-go functions as follows:

Note that only depends on the future strategy . These functions can be computed
recursively as follows:

Dynamic Programming Decomposition


Now we are ready to state the main result of MDPs

Theorem (Dynamic program) Recursive define value functions as follows:

and for :

{ and define

and

Then, a Markov policy is optimal if and only if it satisfies .

Instead of proving the above result, we prove a related result

Theorem (The comparison principle) For any Markov strategy

with equality at if and only if the future strategy satisfies the verification step .

Note that the comparison principle immediately implies that the strategy obtained using dynamic
programming is optimal.

https://adityam.github.io/stochastic-control/mdp/mdp-functional/ 4/7
1/14/2020 ECSE 506: Stochastic Control and Decision Theory

The comparison principle also allows us to interpret the value functions. The value function at time is the
minimum of all the cost-to-go functions over all future strategies. The comparison principle also allows us to
interpret the optimal policy (the interpretation is due to Bellman and is colloquially called Bellman’s principle
of optimality).

Bellman’s principle of optimality. An optimal policy has the property that whatever the initial state and the
initial decisions are, the remaining decisions must constitute an optimal policy with regard to
the state resulting from the first decision.

Proof of the comparison principle


The proof proceeds by backward induction. Consider any Markov strategy . For ,

where follows from the definition of , follows from the definition of minimization, and follows
from the definition of . Equality holds in iff the policy is optimal. This result forms the basis of
induction.

Now assume that the statement of the theorem is true for . Then, for

where follows from the definition of , follows from the definition of minimization, follows from
the induction hypothesis, and follows from the definition of . We have equality in step iff satisfies
the verification step and have equality in step iff is optimal (this is part of the induction
hypothesis). Thus, the result is true for time and, by the principle of induction, is true for all time.

Variations of a theme
Cost depends on next state
Exponential cost function
Multiplicative cost
Discounted cost
https://adityam.github.io/stochastic-control/mdp/mdp-functional/ 5/7
1/14/2020 ECSE 506: Stochastic Control and Decision Theory

TODO: Explain this.

There are two interpretations of the discount factor . The first interpretation is an economic interpretation to
determine the present value of a utility that will be received in the future. For example, suppose a decision maker
is indifferent between receiving 1 dollar today or dollars tomorrow. This means that the decision maker
discounts the future at a rate , so .

The second interpretation of the discount factor is as follows. Suppose we are operating a machine that
generates a value of $1 each day. However, there is a probability that the machine will break down at the end
of the day. Thus, the expected return for today is $1 while the expected return for tomorrow is (which
is the probability that the machine is still working tomorrow). In this case, the discount factor is defined as
.

Optimal stopping
Let be a Markov chain. At each time , a decision maker observes the state of the Markov chain
and decides whether to continue or stop the process. If the decision maker decides to continue, he incurs a
continuation cost and the state evolves. If the DM decides to stop, he incurs a stopping cost of and
the problem is terminated. The objective is to determine an optimal stopping time to minimize

Such problems are called Optimal stopping problems.

Define the cost-to-go function of any stopping rule as

and the value function as

Then, it can be shown that the value functions satisfy the following recursion:

Dynamic Program for optimal stopping

For more details on the optimal stopping problems, see Ferguson (2008).

Minimax setup
References
The proof idea for the optimality of Markov strategies is based on a proof by Witsenhausen (1979) on the
structure of optimal coding strategies for real-time communication. Note that the proof does not require us to
find a dynamic programming decomposition of the problem. This is in contrast with the standard textbook
proof where the optimality of Markov strategies is proved as part of the dynamic programming decomposition.
https://adityam.github.io/stochastic-control/mdp/mdp-functional/ 6/7
1/14/2020 ECSE 506: Stochastic Control and Decision Theory

B , D. 1964. Memoryless strategies in finite-stage dynamic programming. The Annals of Mathematical


Statistics 35, 2, 863–865. DOI: doi:10.1214/aoms/1177703586.

F , T.S. 2008. Optimal stopping and applications.. Available at:


h p://www.math.ucla.edu/~tom/Stopping/Contents.html.

W , H.S. 1979. On the structure of real-time source coders. Bell System Technical Journal 58, 6, 1437–
1451.

This entry was last updated on 04 Oct 2019 and posted in MDP and tagged markov strategies, dynamic programming, comparison
principle, principle of irrelevant information.

https://adityam.github.io/stochastic-control/mdp/mdp-functional/ 7/7

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy