03 Basic Model of An MDP
03 Basic Model of An MDP
where is the state, is the control input, and is the noise. An agent/controller
observes the state and chooses the control input .
The controller can be as sophisticated as we want. In principle, it can analyze the entire history of observations
and control actions to choose the current control action. Thus, the control action can be wri en as
where is a shorthand for and a similar interpretation holds for . The function is
called the control law at time .
At each time, the system incurs a cost that may depend on the current state and control action. This cost is
denoted by . The system operates for a time horizon . During this time, it incurs a total cost
The initial state and the noise process are random variables defined on a common probability
space (these are called primitive random variables) and are mutually independent.
Suppose we have to design such a controller. We are told the probability distribution of the initial state and the
noise. We are also told the system update functions and the cost functions . We are
asked to choose a control strategy to minimize the expected total cost
At first glance, the problem looks intimidating. It appears that we have to design a very sophisticated controller:
one that analyzes all past data to choose a control input. However, this is not the case. A remarkable result is
that the optimal control station can discard all past data and choose the control input based only on the current
state of the system. Formally, we have the following:
https://adityam.github.io/stochastic-control/mdp/mdp-functional/ 1/7
1/14/2020 ECSE 506: Stochastic Control and Decision Theory
Optimality of Markov strategies. For the system model described above, there is no loss of optimality in
chosing the control action according to
The above result claims that the cost incurred by the best Markovian strategy is the same as the cost incurred by
the best history dependent strategy. This appears to be a tall claim, so lets see how we can prove it. The main
idea of the proof is to repeatedly apply Blackwell’s principle of irrelevant information (1964)
Two-Step Lemma. Consider an MDP that operates for two steps ( ). Then there is no loss of optimality
in restricting a ention to a Markov control strategy at time .
Proof
Fix and look at the problem of optimizing . The total cost is
The choice of does not influence the first term. So, for a fixed , minimizing the total cost is the equivalent
to minimizing the second term. Now, from Blackwell’s principle of irrelevant information, there exists a
such that for any
Three-Step Lemma. Consider an MDP that operates for three steps ( ). Assume that the control law
at time is Markov, i.e., . Then, there is no loss of optimality in restricting
a ention to Markov control law at time .
Proof
Fix and and look at optimizing . The total cost is
The choice of does not affect the first term. So, for a fixed and , minimizing the total cost is the same as
minimizing the last two terms. Let us look at the last term carefully. Bu the law of iterated expectations, we
have
Now,
https://adityam.github.io/stochastic-control/mdp/mdp-functional/ 2/7
1/14/2020 ECSE 506: Stochastic Control and Decision Theory
Thus, the total expected cost affected by the choice of can be wri en as
Now, by Blackwell’s principle of irrelevant information, there exists a such that for any , we
have
Now we have enough background to present the proof of optimality of Markov strategies.
Now consider a system where we are using a Markov strategy at time . This system can be thought of as
a three-step system where corresponds to step 1, corresponds to step 2, and
corresponds to step 3. Since the controller at time is Markov, the assumption of the three step lemma
is satisfied. Thus, by that lemma, there is no loss of optimality in restricting a ention to Markov controllers at
step 2 (i.e., at time ), i.e.,
Now consider a system where we are using a Markov strategy at time . This can be thought of
as a three-step system where correspond to step 1, correspond to step 2, and
correspond to step 3. Since the controllers at time are Markov, the
assumption of the three-step lemma is satisfied. Thus, by that lemma, there is no loss of optimality in restricting
a ention to Markov controllers at step 2 (i.e., at time ), i.e.,
Proceeding this way, we continue to think of the system as a three step system by different relabeling of time.
Once we have shown that the controllers at times are Markov, we relabel time as
follows: corresponds to step 1, corresponds to step 2, and
corresponds to step 3. Since the controllers at time are Markov, the assumption of the
three-step lemma is satisfied. Thus, by that lemma, there is no loss of optimality in restricting a ention to
Markov controllers at stage 2 (i.e. at time ), i.e.,
https://adityam.github.io/stochastic-control/mdp/mdp-functional/ 3/7
1/14/2020 ECSE 506: Stochastic Control and Decision Theory
We have shown that there is no loss of optimality to restrict a ention to Markov strategies. One of the
advantages of Markov strategies is that it is easy to recursively compute their performance. In particular, given
any Markov strategy , define the cost-to-go functions as follows:
Note that only depends on the future strategy . These functions can be computed
recursively as follows:
and for :
{ and define
and
with equality at if and only if the future strategy satisfies the verification step .
Note that the comparison principle immediately implies that the strategy obtained using dynamic
programming is optimal.
https://adityam.github.io/stochastic-control/mdp/mdp-functional/ 4/7
1/14/2020 ECSE 506: Stochastic Control and Decision Theory
The comparison principle also allows us to interpret the value functions. The value function at time is the
minimum of all the cost-to-go functions over all future strategies. The comparison principle also allows us to
interpret the optimal policy (the interpretation is due to Bellman and is colloquially called Bellman’s principle
of optimality).
Bellman’s principle of optimality. An optimal policy has the property that whatever the initial state and the
initial decisions are, the remaining decisions must constitute an optimal policy with regard to
the state resulting from the first decision.
where follows from the definition of , follows from the definition of minimization, and follows
from the definition of . Equality holds in iff the policy is optimal. This result forms the basis of
induction.
Now assume that the statement of the theorem is true for . Then, for
where follows from the definition of , follows from the definition of minimization, follows from
the induction hypothesis, and follows from the definition of . We have equality in step iff satisfies
the verification step and have equality in step iff is optimal (this is part of the induction
hypothesis). Thus, the result is true for time and, by the principle of induction, is true for all time.
Variations of a theme
Cost depends on next state
Exponential cost function
Multiplicative cost
Discounted cost
https://adityam.github.io/stochastic-control/mdp/mdp-functional/ 5/7
1/14/2020 ECSE 506: Stochastic Control and Decision Theory
There are two interpretations of the discount factor . The first interpretation is an economic interpretation to
determine the present value of a utility that will be received in the future. For example, suppose a decision maker
is indifferent between receiving 1 dollar today or dollars tomorrow. This means that the decision maker
discounts the future at a rate , so .
The second interpretation of the discount factor is as follows. Suppose we are operating a machine that
generates a value of $1 each day. However, there is a probability that the machine will break down at the end
of the day. Thus, the expected return for today is $1 while the expected return for tomorrow is (which
is the probability that the machine is still working tomorrow). In this case, the discount factor is defined as
.
Optimal stopping
Let be a Markov chain. At each time , a decision maker observes the state of the Markov chain
and decides whether to continue or stop the process. If the decision maker decides to continue, he incurs a
continuation cost and the state evolves. If the DM decides to stop, he incurs a stopping cost of and
the problem is terminated. The objective is to determine an optimal stopping time to minimize
Then, it can be shown that the value functions satisfy the following recursion:
For more details on the optimal stopping problems, see Ferguson (2008).
Minimax setup
References
The proof idea for the optimality of Markov strategies is based on a proof by Witsenhausen (1979) on the
structure of optimal coding strategies for real-time communication. Note that the proof does not require us to
find a dynamic programming decomposition of the problem. This is in contrast with the standard textbook
proof where the optimality of Markov strategies is proved as part of the dynamic programming decomposition.
https://adityam.github.io/stochastic-control/mdp/mdp-functional/ 6/7
1/14/2020 ECSE 506: Stochastic Control and Decision Theory
W , H.S. 1979. On the structure of real-time source coders. Bell System Technical Journal 58, 6, 1437–
1451.
This entry was last updated on 04 Oct 2019 and posted in MDP and tagged markov strategies, dynamic programming, comparison
principle, principle of irrelevant information.
https://adityam.github.io/stochastic-control/mdp/mdp-functional/ 7/7