MIT16 410F10 Lec22
MIT16 410F10 Lec22
410/413
Principles of Autonomy and Decision Making
Lecture 22: Markov Decision Processes I
Emilio Frazzoli
Readings
Lecture notes
[AIMA] Ch. 17.1-3.
Notice that the actual sequence of states, and hence the actual total
reward, is unknown a priori.
We could choose a plan, i.e., a sequence of actions: A = (a1 , a2 , . . .).
In this case, transition probabilities are fixed and one can compute the
probability of being at any given state at each time step—in a similar
way as the forward algorithm in HMMs—and hence compute the
expected reward:
X
E[R(st , at , st+1 )|st , at ] = T (st , at , s)R(st , at , s)
s∈S
The border cells and some of the interior cells are “obstacles” (marked in
gray).
A reward of 1 is collected when reaching the bottom right feasible cell. The
discount factor is 0.9.
At each non-obstacle cell, the agent can attempt to move to any of the
neighboring cells. The move will be successful with probability 3/4.
Otherwise the agent will move to a different neighboring cell, with equal
probability.
The agent always has the option to stay put, which will succeed with
certainty.
Initial condition:
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
After 1 iteration:
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0.75 0
0 0 0 0 0 0 0 0.75 1 0
0 0 0 0 0 0 0 0 0 0
After 2 iterations:
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0.51 0
0 0 0 0 0 0 0 0.56 1.43 0
0 0 0 0 0 0 0 1.43 1.9 0
0 0 0 0 0 0 0 0 0 0
After 50 iterations:
0 0 0 0 0 0 0 0 0 0
0 0.44 0.54 0.59 0.82 1.15 0.85 1.09 1.52 0
0 0.59 0.69 0 0 1.52 0 0 2.13 0
0 0.75 0.90 0 0 2.12 2.55 2.98 3.00 0
0 0.95 1.18 0 2.00 2.70 3.22 3.80 3.88 0
0 1.20 1.55 1.87 2.41 2.92 3.51 4.52 5.00 0
0 1.15 1.47 1.74 2.05 2.25 0 5.34 6.47 0
0 0.99 1.26 1.49 1.72 1.74 0 6.69 8.44 0
0 0.74 0.99 1.17 1.34 1.27 0 7.96 9.94 0
0 0 0 0 0 0 0 0 0 0
Under some technical conditions (e.g., finite state and action spaces,
and γ < 1), value iteration converges to the optimal value function
V ∗.
The optimal value function V ∗ satisfies the following equation, called
the Bellman’s equation, a nice (perhaps the prime) example of the
principle of optimality
The optimal policy can be easily recovered from the optimal value
function:
Robust to uncertainty
For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms .