0% found this document useful (0 votes)
13 views19 pages

MIT16 410F10 Lec22

The document describes Markov decision processes (MDPs), which extend basic planning models to include probabilistic uncertainty in state transitions. MDPs define transition probabilities and reward functions rather than deterministic transitions. The value iteration algorithm is presented as a way to compute optimal policies for MDPs by iteratively updating state value estimates.

Uploaded by

Fernanda G.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views19 pages

MIT16 410F10 Lec22

The document describes Markov decision processes (MDPs), which extend basic planning models to include probabilistic uncertainty in state transitions. MDPs define transition probabilities and reward functions rather than deterministic transitions. The value iteration algorithm is presented as a way to compute optimal policies for MDPs by iteratively updating state value estimates.

Uploaded by

Fernanda G.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

16.

410/413
Principles of Autonomy and Decision Making
Lecture 22: Markov Decision Processes I

Emilio Frazzoli

Aeronautics and Astronautics


Massachusetts Institute of Technology

November 29, 2010

Frazzoli (MIT) Lecture 22: MDPs November 29, 2010 1 / 16


Assignments

Readings
Lecture notes
[AIMA] Ch. 17.1-3.

Frazzoli (MIT) Lecture 22: MDPs November 29, 2010 2 / 16


Outline

1 Markov Decision Processes

Frazzoli (MIT) Lecture 22: MDPs November 29, 2010 3 / 16


From deterministic to stochastic planning problems
A basic planning model for deterministic systems (e.g., graph/tree search
algorithms, etc.) is :

Planning Model (Transition system + goal)


A (discrete, deterministic) feasible planning model is defined by
A countable set of states S.
A countable set of actions A.
A transition relation →⊆ S × A × S.
An initial state s1 ∈ S.
A set of goal states sG ⊂ S.

We considered the case in which the transition relation is purely


deterministic: if (s, a, s 0 ) are in relation, i.e., (s, a, s 0 ) ∈→, or, more
a
− s 0 , then taking action a from state s will always take the
concisely, s →
0
state to s .
Can we extend this model to include (probabilistic) uncertainty in the
transitions?
Frazzoli (MIT) Lecture 22: MDPs November 29, 2010 4 / 16
Markov Decision Process
Instead of a (deterministic) transition relation, let us define transition
probabilities; also, let us introduce a reward (or cost) structure:
Markov Decision Process (Stoch. transition system + reward)
A Markov Decision Process (MDP) is defined by
A countable set of states S.
A countable set of actions A.
A transition probability function T : S × A × S → R+ .
An initial state s0 ∈ S.
A reward function R : S × A × S → R+ .

In other words: if action a is applied from state s, a transition to


state s 0 will occur with probability T (s, a, s 0 ).
Furthermore, every time a transition is made from s to s 0 using action
a, a reward R(s, a, s 0 ) is collected.
Frazzoli (MIT) Lecture 22: MDPs November 29, 2010 5 / 16
Some remarks

In a Markov Decision Process, both transition probabilities and


rewards only depend on the present state, not on the history of the
state. In other words, the future states and rewards are independent
of the past, given the present.
A Markov Decision Process has many common features with Markov
Chains and Transition Systems.
In a MDP:
Transitions and rewards are stationary.
The state is known exactly. (Only transitions are stochastic.)

MDPs in which the state is not known exactly (HMM + Transition


Systems) are called Partially Observable Markov Decision Processes
(POMDP’s): these are very hard problems.

Frazzoli (MIT) Lecture 22: MDPs November 29, 2010 6 / 16


Total reward in a MDP

Let us assume that it is desired to maximize the total reward collected


over infinite time.
In other words, let us assume that the sequence of states is
S = (s1 , s2 , . . . , st , . . .), and the sequence of actions is
A = (a1 , a2 , . . . , at , . . .); then the total collected reward (also called
utility) is
X∞
V = γ t R(st , at , st+1 ),
t=0

where γ ∈ (0, 1] is a discount factor.


Philosophically: it models the fact that an immediate reward is better
than an uncertain reward in the future.
Mathematically: it ensures that the sum is always finite, if the rewards
are bounded (e.g., finitely many states/actions).

Frazzoli (MIT) Lecture 22: MDPs November 29, 2010 7 / 16


Decision making in MDPs

Notice that the actual sequence of states, and hence the actual total
reward, is unknown a priori.
We could choose a plan, i.e., a sequence of actions: A = (a1 , a2 , . . .).
In this case, transition probabilities are fixed and one can compute the
probability of being at any given state at each time step—in a similar
way as the forward algorithm in HMMs—and hence compute the
expected reward:
X
E[R(st , at , st+1 )|st , at ] = T (st , at , s)R(st , at , s)
s∈S

Such approach is essentially open loop, i.e., it does not take


advantage of the fact that at each time step the actual state reached
is known, and a new feedback strategy can be computed based on
this knowledge.
Frazzoli (MIT) Lecture 22: MDPs November 29, 2010 8 / 16
Introduction to value iteration

Let us assume we have a function Vi : S → R+ that associates to


each state s a lower bound on the optimal (discounted) total reward
V ∗ (s) that can be collected starting from that state. Note the
connection with admissible heuristics in informed search algorithms.

For example, we can start with V0 (s) = 0, for all s ∈ S.

As a feedback strategy, we can do the following: at each state,


choose the action that maximizes the expected reward of the present
action + estimate total reward from the next step onwards.

Using this strategy, we can get an update Vi+1 on the function Vi .

Iterate until convergence...

Frazzoli (MIT) Lecture 22: MDPs November 29, 2010 9 / 16


Value iteration algorithm

A bit more formally:

Set V0 (s) ← 0, for all s ∈ S

iterate, for all s ∈ S:

Vi+1 (s) ← max E R(s, a, s 0 ) + γVi (s 0 )


 
a
X
T (s, a, s 0 ) R(s, a, s 0 ) + γVi (s 0 ) ,
 
= max
a
s 0 ∈S

until maxs |Vi+1 (s) − Vi (s)| < .

Frazzoli (MIT) Lecture 22: MDPs November 29, 2010 10 / 16


Value iteration example

Let us consider a simple MDP:


The state space is a 10-by-10 grid.

The border cells and some of the interior cells are “obstacles” (marked in
gray).

The initial state is the top-left feasible cell.

A reward of 1 is collected when reaching the bottom right feasible cell. The
discount factor is 0.9.

At each non-obstacle cell, the agent can attempt to move to any of the
neighboring cells. The move will be successful with probability 3/4.
Otherwise the agent will move to a different neighboring cell, with equal
probability.

The agent always has the option to stay put, which will succeed with
certainty.

Frazzoli (MIT) Lecture 22: MDPs November 29, 2010 11 / 16


Value iteration example

Initial condition:

0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0

Frazzoli (MIT) Lecture 22: MDPs November 29, 2010 12 / 16


Value iteration example

After 1 iteration:

0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0.75 0
0 0 0 0 0 0 0 0.75 1 0
0 0 0 0 0 0 0 0 0 0

Frazzoli (MIT) Lecture 22: MDPs November 29, 2010 12 / 16


Value iteration example

After 2 iterations:

0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0.51 0
0 0 0 0 0 0 0 0.56 1.43 0
0 0 0 0 0 0 0 1.43 1.9 0
0 0 0 0 0 0 0 0 0 0

Frazzoli (MIT) Lecture 22: MDPs November 29, 2010 12 / 16


Value iteration example

Frazzoli (MIT) Lecture 22: MDPs November 29, 2010 13 / 16


Value iteration example

After 50 iterations:

0 0 0 0 0 0 0 0 0 0
0 0.44 0.54 0.59 0.82 1.15 0.85 1.09 1.52 0
0 0.59 0.69 0 0 1.52 0 0 2.13 0
0 0.75 0.90 0 0 2.12 2.55 2.98 3.00 0
0 0.95 1.18 0 2.00 2.70 3.22 3.80 3.88 0
0 1.20 1.55 1.87 2.41 2.92 3.51 4.52 5.00 0
0 1.15 1.47 1.74 2.05 2.25 0 5.34 6.47 0
0 0.99 1.26 1.49 1.72 1.74 0 6.69 8.44 0
0 0.74 0.99 1.17 1.34 1.27 0 7.96 9.94 0
0 0 0 0 0 0 0 0 0 0

Frazzoli (MIT) Lecture 22: MDPs November 29, 2010 14 / 16


Bellman’s equation

Under some technical conditions (e.g., finite state and action spaces,
and γ < 1), value iteration converges to the optimal value function
V ∗.
The optimal value function V ∗ satisfies the following equation, called
the Bellman’s equation, a nice (perhaps the prime) example of the
principle of optimality

V ∗ (s) = max E R(s, a, s 0 ) + γV ∗ (s 0 )


 
a
X
T (s, a, s 0 ) R(s, a, s 0 ) + γV ∗ (s 0 ) ,
 
= max ∀s ∈ S.
a
s 0 ∈S

In other words, the optimal value function can be seen as a fixed


point for value iteration.
The Bellman’s equation can be proven by contradiction.
Frazzoli (MIT) Lecture 22: MDPs November 29, 2010 15 / 16
Value iteration summary

Value iteration converges monotonically and in polynomial time to


the optimal value function.

The optimal policy can be easily recovered from the optimal value
function:

π ∗ (s) = arg max E R(s, a, s 0 ) + γV ∗ (s 0 ) ,


 
∀s ∈ S.
a

Knowledge of the value function turns the optimal planning problem


into a feedback problem,

Robust to uncertainty

Minimal on-line computations

Frazzoli (MIT) Lecture 22: MDPs November 29, 2010 16 / 16


MIT OpenCourseWare
http://ocw.mit.edu

16.410 / 16.413 Principles of Autonomy and Decision Making


Fall 2010

For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms .

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy