0% found this document useful (0 votes)

13 views19 pages

MIT16 410F10 Lec22

The document describes Markov decision processes (MDPs), which extend basic planning models to include probabilistic uncertainty in state transitions. MDPs define transition probabilities and reward functions rather than deterministic transitions. The value iteration algorithm is presented as a way to compute optimal policies for MDPs by iteratively updating state value estimates.

Uploaded by

Fernanda G.

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views19 pages

MIT16 410F10 Lec22

Uploaded by

Fernanda G.

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 19

16.

410/413
Principles of Autonomy and Decision Making
Lecture 22: Markov Decision Processes I

Emilio Frazzoli

Aeronautics and Astronautics

Massachusetts Institute of Technology

November 29, 2010

Frazzoli (MIT) Lecture 22: MDPs November 29, 2010 1 / 16

Assignments

Readings
Lecture notes
[AIMA] Ch. 17.1-3.

Frazzoli (MIT) Lecture 22: MDPs November 29, 2010 2 / 16

Outline

1 Markov Decision Processes

Frazzoli (MIT) Lecture 22: MDPs November 29, 2010 3 / 16

From deterministic to stochastic planning problems
A basic planning model for deterministic systems (e.g., graph/tree search
algorithms, etc.) is :

Planning Model (Transition system + goal)

A (discrete, deterministic) feasible planning model is defined by
A countable set of states S.
A countable set of actions A.
A transition relation →⊆ S × A × S.
An initial state s1 ∈ S.
A set of goal states sG ⊂ S.

We considered the case in which the transition relation is purely

deterministic: if (s, a, s 0 ) are in relation, i.e., (s, a, s 0 ) ∈→, or, more
a
− s 0 , then taking action a from state s will always take the
concisely, s →
0
state to s .
Can we extend this model to include (probabilistic) uncertainty in the
transitions?
Frazzoli (MIT) Lecture 22: MDPs November 29, 2010 4 / 16
Markov Decision Process
Instead of a (deterministic) transition relation, let us define transition
probabilities; also, let us introduce a reward (or cost) structure:
Markov Decision Process (Stoch. transition system + reward)
A Markov Decision Process (MDP) is defined by
A countable set of states S.
A countable set of actions A.
A transition probability function T : S × A × S → R+ .
An initial state s0 ∈ S.
A reward function R : S × A × S → R+ .

In other words: if action a is applied from state s, a transition to

state s 0 will occur with probability T (s, a, s 0 ).
Furthermore, every time a transition is made from s to s 0 using action
a, a reward R(s, a, s 0 ) is collected.
Frazzoli (MIT) Lecture 22: MDPs November 29, 2010 5 / 16
Some remarks

In a Markov Decision Process, both transition probabilities and

rewards only depend on the present state, not on the history of the
state. In other words, the future states and rewards are independent
of the past, given the present.
A Markov Decision Process has many common features with Markov
Chains and Transition Systems.
In a MDP:
Transitions and rewards are stationary.
The state is known exactly. (Only transitions are stochastic.)

MDPs in which the state is not known exactly (HMM + Transition

Systems) are called Partially Observable Markov Decision Processes
(POMDP’s): these are very hard problems.

Frazzoli (MIT) Lecture 22: MDPs November 29, 2010 6 / 16

Total reward in a MDP

Let us assume that it is desired to maximize the total reward collected

over infinite time.
In other words, let us assume that the sequence of states is
S = (s1 , s2 , . . . , st , . . .), and the sequence of actions is
A = (a1 , a2 , . . . , at , . . .); then the total collected reward (also called
utility) is
X∞
V = γ t R(st , at , st+1 ),
t=0

where γ ∈ (0, 1] is a discount factor.

Philosophically: it models the fact that an immediate reward is better
than an uncertain reward in the future.
Mathematically: it ensures that the sum is always finite, if the rewards
are bounded (e.g., finitely many states/actions).

Frazzoli (MIT) Lecture 22: MDPs November 29, 2010 7 / 16

Decision making in MDPs

Notice that the actual sequence of states, and hence the actual total
reward, is unknown a priori.
We could choose a plan, i.e., a sequence of actions: A = (a1 , a2 , . . .).
In this case, transition probabilities are fixed and one can compute the
probability of being at any given state at each time step—in a similar
way as the forward algorithm in HMMs—and hence compute the
expected reward:
X
E[R(st , at , st+1 )|st , at ] = T (st , at , s)R(st , at , s)
s∈S

Such approach is essentially open loop, i.e., it does not take

advantage of the fact that at each time step the actual state reached
is known, and a new feedback strategy can be computed based on
this knowledge.
Frazzoli (MIT) Lecture 22: MDPs November 29, 2010 8 / 16
Introduction to value iteration

Let us assume we have a function Vi : S → R+ that associates to

each state s a lower bound on the optimal (discounted) total reward
V ∗ (s) that can be collected starting from that state. Note the
connection with admissible heuristics in informed search algorithms.

For example, we can start with V0 (s) = 0, for all s ∈ S.

As a feedback strategy, we can do the following: at each state,

choose the action that maximizes the expected reward of the present
action + estimate total reward from the next step onwards.

Using this strategy, we can get an update Vi+1 on the function Vi .

Iterate until convergence...

Frazzoli (MIT) Lecture 22: MDPs November 29, 2010 9 / 16

Value iteration algorithm

A bit more formally:

Set V0 (s) ← 0, for all s ∈ S

iterate, for all s ∈ S:

Vi+1 (s) ← max E R(s, a, s 0 ) + γVi (s 0 )

a
X
T (s, a, s 0 ) R(s, a, s 0 ) + γVi (s 0 ) ,

= max
a
s 0 ∈S

until maxs |Vi+1 (s) − Vi (s)| < .

Frazzoli (MIT) Lecture 22: MDPs November 29, 2010 10 / 16

Value iteration example

Let us consider a simple MDP:

The state space is a 10-by-10 grid.

The border cells and some of the interior cells are “obstacles” (marked in
gray).

The initial state is the top-left feasible cell.

A reward of 1 is collected when reaching the bottom right feasible cell. The
discount factor is 0.9.

At each non-obstacle cell, the agent can attempt to move to any of the
neighboring cells. The move will be successful with probability 3/4.
Otherwise the agent will move to a different neighboring cell, with equal
probability.

The agent always has the option to stay put, which will succeed with
certainty.

Frazzoli (MIT) Lecture 22: MDPs November 29, 2010 11 / 16

Value iteration example

Initial condition:

0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0

Frazzoli (MIT) Lecture 22: MDPs November 29, 2010 12 / 16

Value iteration example

After 1 iteration:

0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0.75 0
0 0 0 0 0 0 0 0.75 1 0
0 0 0 0 0 0 0 0 0 0

Frazzoli (MIT) Lecture 22: MDPs November 29, 2010 12 / 16

Value iteration example

After 2 iterations:

0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0.51 0
0 0 0 0 0 0 0 0.56 1.43 0
0 0 0 0 0 0 0 1.43 1.9 0
0 0 0 0 0 0 0 0 0 0

Frazzoli (MIT) Lecture 22: MDPs November 29, 2010 12 / 16

Value iteration example

Frazzoli (MIT) Lecture 22: MDPs November 29, 2010 13 / 16

Value iteration example

After 50 iterations:

0 0 0 0 0 0 0 0 0 0
0 0.44 0.54 0.59 0.82 1.15 0.85 1.09 1.52 0
0 0.59 0.69 0 0 1.52 0 0 2.13 0
0 0.75 0.90 0 0 2.12 2.55 2.98 3.00 0
0 0.95 1.18 0 2.00 2.70 3.22 3.80 3.88 0
0 1.20 1.55 1.87 2.41 2.92 3.51 4.52 5.00 0
0 1.15 1.47 1.74 2.05 2.25 0 5.34 6.47 0
0 0.99 1.26 1.49 1.72 1.74 0 6.69 8.44 0
0 0.74 0.99 1.17 1.34 1.27 0 7.96 9.94 0
0 0 0 0 0 0 0 0 0 0

Frazzoli (MIT) Lecture 22: MDPs November 29, 2010 14 / 16

Bellman’s equation

Under some technical conditions (e.g., finite state and action spaces,
and γ < 1), value iteration converges to the optimal value function
V ∗.
The optimal value function V ∗ satisfies the following equation, called
the Bellman’s equation, a nice (perhaps the prime) example of the
principle of optimality

V ∗ (s) = max E R(s, a, s 0 ) + γV ∗ (s 0 )

a
X
T (s, a, s 0 ) R(s, a, s 0 ) + γV ∗ (s 0 ) ,

= max ∀s ∈ S.
a
s 0 ∈S

In other words, the optimal value function can be seen as a fixed

point for value iteration.
The Bellman’s equation can be proven by contradiction.
Frazzoli (MIT) Lecture 22: MDPs November 29, 2010 15 / 16
Value iteration summary

Value iteration converges monotonically and in polynomial time to

the optimal value function.

The optimal policy can be easily recovered from the optimal value
function:

π ∗ (s) = arg max E R(s, a, s 0 ) + γV ∗ (s 0 ) ,

∀s ∈ S.
a

Knowledge of the value function turns the optimal planning problem

into a feedback problem,

Robust to uncertainty

Minimal on-line computations

Frazzoli (MIT) Lecture 22: MDPs November 29, 2010 16 / 16

MIT OpenCourseWare
http://ocw.mit.edu

16.410 / 16.413 Principles of Autonomy and Decision Making

Fall 2010

For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms .

L12 Markov Decision Processes
No ratings yet
L12 Markov Decision Processes
64 pages
Markov Decision Processes & Reinforcement Learning: Megan Smith Lehigh University, Fall 2006
No ratings yet
Markov Decision Processes & Reinforcement Learning: Megan Smith Lehigh University, Fall 2006
40 pages
Stochastic Process - Markov Property - Markov Chain - Markov Decision Process - Reinforcement Learning - RL Techniques - Example Applications
No ratings yet
Stochastic Process - Markov Property - Markov Chain - Markov Decision Process - Reinforcement Learning - RL Techniques - Example Applications
39 pages
Lecture4 Model Free Prediction
No ratings yet
Lecture4 Model Free Prediction
34 pages
Markov Decision Process Tutorial
No ratings yet
Markov Decision Process Tutorial
22 pages
Lecture 3 - MDPs and Dynamic Programming
No ratings yet
Lecture 3 - MDPs and Dynamic Programming
66 pages
An Introduction To Markov Decision Processes: Bob Givan Ron Parr Purdue University Duke University
No ratings yet
An Introduction To Markov Decision Processes: Bob Givan Ron Parr Purdue University Duke University
23 pages
Stochastic DP
No ratings yet
Stochastic DP
23 pages
Markovian Decision Process
No ratings yet
Markovian Decision Process
27 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
43 pages
Quick Start: Resolving A Markov Decision Process Problem Using The Mdptoolbox in Matlab
No ratings yet
Quick Start: Resolving A Markov Decision Process Problem Using The Mdptoolbox in Matlab
9 pages
Cs5811 Ch17 Complex Dec
No ratings yet
Cs5811 Ch17 Complex Dec
29 pages
Microsoft PowerPoint - Lecture20Final-Part1
No ratings yet
Microsoft PowerPoint - Lecture20Final-Part1
65 pages
Tut21 RL
No ratings yet
Tut21 RL
101 pages
Lecture 2: Markov Decision Processes: David Silver
No ratings yet
Lecture 2: Markov Decision Processes: David Silver
57 pages
242 Sheet 02 03
No ratings yet
242 Sheet 02 03
5 pages
CS229
No ratings yet
CS229
17 pages
Reinforcement Learning Note
No ratings yet
Reinforcement Learning Note
16 pages
Markov Decision Processes (MDP) : Sudeshna Sarkar
No ratings yet
Markov Decision Processes (MDP) : Sudeshna Sarkar
14 pages
Lecture Notes
No ratings yet
Lecture Notes
29 pages
A17 Complexdecisions
No ratings yet
A17 Complexdecisions
28 pages
Lecture 2 Post
No ratings yet
Lecture 2 Post
65 pages
cs229-notes12 Reinforcement in Control
No ratings yet
cs229-notes12 Reinforcement in Control
17 pages
Cse 473 MDP Notes
No ratings yet
Cse 473 MDP Notes
11 pages
New CZ3005 Module 4 - Markov Decision Process
No ratings yet
New CZ3005 Module 4 - Markov Decision Process
38 pages
Instructor (Andrew NG) :okay, Good Morning. Welcome Back. So I Hope All of You Had
No ratings yet
Instructor (Andrew NG) :okay, Good Morning. Welcome Back. So I Hope All of You Had
14 pages
15 MDP
No ratings yet
15 MDP
35 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
101 pages
02 MarkovDecisionProcess
No ratings yet
02 MarkovDecisionProcess
51 pages
17 - Markov Decision Processes.pptx
No ratings yet
17 - Markov Decision Processes.pptx
59 pages
Infinite Horizon Problems
No ratings yet
Infinite Horizon Problems
69 pages
MIT 6.036 Lecture
No ratings yet
MIT 6.036 Lecture
64 pages
19.5 Markov Decision Processes: Resolving Unbounded Expected Rewards
No ratings yet
19.5 Markov Decision Processes: Resolving Unbounded Expected Rewards
13 pages
119686
No ratings yet
119686
24 pages
22 Reinforcement Learning
No ratings yet
22 Reinforcement Learning
18 pages
DRL #4-5 - Introducing MDP and Dynamic Programming Solution
No ratings yet
DRL #4-5 - Introducing MDP and Dynamic Programming Solution
74 pages
Markov decision
No ratings yet
Markov decision
4 pages
Lecture#5 Monte Carlo Methods Part I
No ratings yet
Lecture#5 Monte Carlo Methods Part I
28 pages
AS02
No ratings yet
AS02
16 pages
5.4-Reinforcement Learning-Part1-Introduction
No ratings yet
5.4-Reinforcement Learning-Part1-Introduction
15 pages
AI (IT) UNIT-4
No ratings yet
AI (IT) UNIT-4
37 pages
Lecture 2 Pre
No ratings yet
Lecture 2 Pre
58 pages
2025_MDPs 1
No ratings yet
2025_MDPs 1
62 pages
Markov Decision Processes: Lecture Notes For STP 425: Jay Taylor
100% (1)
Markov Decision Processes: Lecture Notes For STP 425: Jay Taylor
86 pages
Robust Markov Decision Processes- A Place Where AI and Formal Methods Meet
No ratings yet
Robust Markov Decision Processes- A Place Where AI and Formal Methods Meet
29 pages
DSA5102_lecture11
No ratings yet
DSA5102_lecture11
44 pages
Logistics: CSE 473 Markov Decision Processes
No ratings yet
Logistics: CSE 473 Markov Decision Processes
10 pages
Lec 02
No ratings yet
Lec 02
89 pages
Reinforcement Learning and Control: CS229 Lecture Notes
No ratings yet
Reinforcement Learning and Control: CS229 Lecture Notes
7 pages
Reinforcement Learning and Control: CS229 Lecture Notes
No ratings yet
Reinforcement Learning and Control: CS229 Lecture Notes
15 pages
lec12
No ratings yet
lec12
60 pages
A Tutorial For Reinforcement Learning
No ratings yet
A Tutorial For Reinforcement Learning
17 pages
RL-DQN-PG
No ratings yet
RL-DQN-PG
65 pages
Hidden Markov Model: Fundamentals and Applications
From Everand
Hidden Markov Model: Fundamentals and Applications
Fouad Sabry
No ratings yet
Student Solutions Manual to Accompany Economic Dynamics in Discrete Time, secondedition
From Everand
Student Solutions Manual to Accompany Economic Dynamics in Discrete Time, secondedition
Yue Jiang
4.5/5 (2)
How To Code For Quantum Computers
From Everand
How To Code For Quantum Computers
Nivio Dos Santos
No ratings yet
Student's Solutions Manual and Supplementary Materials for Econometric Analysis of Cross Section and Panel Data, second edition
From Everand
Student's Solutions Manual and Supplementary Materials for Econometric Analysis of Cross Section and Panel Data, second edition
Jeffrey M. Wooldridge
No ratings yet
Digital Signal Processing (DSP) with Python Programming
From Everand
Digital Signal Processing (DSP) with Python Programming
Maurice Charbit
No ratings yet
Solution of Certain Problems in Quantum Mechanics
From Everand
Solution of Certain Problems in Quantum Mechanics
A. Bolotin
No ratings yet
Lectures on the Coupling Method
From Everand
Lectures on the Coupling Method
Torgny Lindvall
No ratings yet
Full Computational Modeling in Bioengineering and Bioinformatics 1st Edition Nenad Filipovic Ebook All Chapters
100% (7)
Full Computational Modeling in Bioengineering and Bioinformatics 1st Edition Nenad Filipovic Ebook All Chapters
23 pages
Chapter 4
No ratings yet
Chapter 4
12 pages
House Price Statistics For Small Areas (Hpssas)
No ratings yet
House Price Statistics For Small Areas (Hpssas)
1,057 pages
2 Grade 4 English Q1 W5
No ratings yet
2 Grade 4 English Q1 W5
25 pages
advtDYSLQT_JRF29042024
No ratings yet
advtDYSLQT_JRF29042024
5 pages
Cover Letter Abb
No ratings yet
Cover Letter Abb
2 pages
Math P1
No ratings yet
Math P1
180 pages
Crack Width Check
No ratings yet
Crack Width Check
3 pages
Investigation of Non PB All Perovskite 4 T Mechanically Stacked Numerical
No ratings yet
Investigation of Non PB All Perovskite 4 T Mechanically Stacked Numerical
12 pages
Celebrations 1
No ratings yet
Celebrations 1
7 pages
Urban Planing and Design I Final Project Cheaklist
100% (1)
Urban Planing and Design I Final Project Cheaklist
4 pages
Two Hearts One Choice
No ratings yet
Two Hearts One Choice
59 pages
Injection Moulding Technique: by Hiteshikaa Dahiya
No ratings yet
Injection Moulding Technique: by Hiteshikaa Dahiya
22 pages
Inorganic Chemistry Lecture Notes 1
No ratings yet
Inorganic Chemistry Lecture Notes 1
10 pages
Reading 05,06,07,08
No ratings yet
Reading 05,06,07,08
9 pages
Report Card Comments
No ratings yet
Report Card Comments
7 pages
To Enhance or Not to Enhance the Situational Conte
No ratings yet
To Enhance or Not to Enhance the Situational Conte
17 pages
the-war-of-the-worlds-h-g-wells
No ratings yet
the-war-of-the-worlds-h-g-wells
1 page
Conbextra EP75 Plus
No ratings yet
Conbextra EP75 Plus
4 pages
EE6503 - Chapter 2
No ratings yet
EE6503 - Chapter 2
30 pages
Teachers Guide Lower Primary Environmental Studies PDF
No ratings yet
Teachers Guide Lower Primary Environmental Studies PDF
82 pages
FBFA
No ratings yet
FBFA
216 pages
Sociological and Political Literary Criticism
No ratings yet
Sociological and Political Literary Criticism
48 pages
Human Settlements
No ratings yet
Human Settlements
10 pages
Parent Interview Questions
No ratings yet
Parent Interview Questions
5 pages
(C) 2008, Esoteric Technologies Pty Ltdsome Interpretations Are From:Roderick Kidston, Aquarius Communications, 2003-10
0% (1)
(C) 2008, Esoteric Technologies Pty Ltdsome Interpretations Are From:Roderick Kidston, Aquarius Communications, 2003-10
13 pages
ACPET Internship Requirements
No ratings yet
ACPET Internship Requirements
2 pages
Maria Montessori Holy Christian School Inc.: Third Quarter Science 1
No ratings yet
Maria Montessori Holy Christian School Inc.: Third Quarter Science 1
45 pages
Dokumen - Tips - 11th House of Income
No ratings yet
Dokumen - Tips - 11th House of Income
9 pages
Code Breaker, Spy Hunter Teaching Guide
No ratings yet
Code Breaker, Spy Hunter Teaching Guide
14 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

MIT16 410F10 Lec22

Uploaded by

MIT16 410F10 Lec22

Uploaded by

16.

Aeronautics and Astronautics

November 29, 2010

Frazzoli (MIT) Lecture 22: MDPs November 29, 2010 1 / 16

Frazzoli (MIT) Lecture 22: MDPs November 29, 2010 2 / 16

1 Markov Decision Processes

Frazzoli (MIT) Lecture 22: MDPs November 29, 2010 3 / 16

Planning Model (Transition system + goal)

We considered the case in which the transition relation is purely

In other words: if action a is applied from state s, a transition to

In a Markov Decision Process, both transition probabilities and

MDPs in which the state is not known exactly (HMM + Transition

Frazzoli (MIT) Lecture 22: MDPs November 29, 2010 6 / 16

Let us assume that it is desired to maximize the total reward collected

where γ ∈ (0, 1] is a discount factor.

Frazzoli (MIT) Lecture 22: MDPs November 29, 2010 7 / 16

Such approach is essentially open loop, i.e., it does not take

Let us assume we have a function Vi : S → R+ that associates to

For example, we can start with V0 (s) = 0, for all s ∈ S.

As a feedback strategy, we can do the following: at each state,

Using this strategy, we can get an update Vi+1 on the function Vi .

Iterate until convergence...

Frazzoli (MIT) Lecture 22: MDPs November 29, 2010 9 / 16

A bit more formally:

Set V0 (s) ← 0, for all s ∈ S

iterate, for all s ∈ S:

Vi+1 (s) ← max E R(s, a, s 0 ) + γVi (s 0 )

until maxs |Vi+1 (s) − Vi (s)| < .

Frazzoli (MIT) Lecture 22: MDPs November 29, 2010 10 / 16

Let us consider a simple MDP:

The initial state is the top-left feasible cell.

Frazzoli (MIT) Lecture 22: MDPs November 29, 2010 11 / 16

Frazzoli (MIT) Lecture 22: MDPs November 29, 2010 12 / 16

Frazzoli (MIT) Lecture 22: MDPs November 29, 2010 12 / 16

Frazzoli (MIT) Lecture 22: MDPs November 29, 2010 12 / 16

Frazzoli (MIT) Lecture 22: MDPs November 29, 2010 13 / 16

Frazzoli (MIT) Lecture 22: MDPs November 29, 2010 14 / 16

V ∗ (s) = max E R(s, a, s 0 ) + γV ∗ (s 0 )

In other words, the optimal value function can be seen as a fixed

Value iteration converges monotonically and in polynomial time to

π ∗ (s) = arg max E R(s, a, s 0 ) + γV ∗ (s 0 ) ,

Knowledge of the value function turns the optimal planning problem

Minimal on-line computations

Frazzoli (MIT) Lecture 22: MDPs November 29, 2010 16 / 16

16.410 / 16.413 Principles of Autonomy and Decision Making

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

until maxs |Vi+1 (s) − Vi (s)| < .