l1 Mdps Exact Methods
l1 Mdps Exact Methods
[Peng, Abbeel, Levine, van de Panne, 2018] Pieter Abbeel -- UC Berkeley | Gradescope | Covariant.AI
A Few Deep RL Highlights
2013 Atari (DQN)
[Deepmind]
2014 2D locomotion (TRPO)
[Berkeley]
2015 AlphaGo
[Deepmind]
2016 3D locomotion (TRPO+GAE)
[Berkeley]
2016 Real Robot Manipulation
(GPS) [Berkeley, Google]
2017 Dota2
(PPO) [OpenAI]
2018 DeepMimic
[Berkeley]
2019 AlphaStar
[Deepmind]
A Few Deep RL Highlights
2013 Atari (DQN)
[Deepmind]
2014 2D locomotion (TRPO)
[Berkeley]
2015 AlphaGo
[Deepmind]
2016 3D locomotion (TRPO+GAE)
[Berkeley]
2016 Real Robot Manipulation
(GPS) [Berkeley, Google]
2017 Dota2
(PPO) [OpenAI]
2018 DeepMimic
[Berkeley]
2019 AlphaStar
[Deepmind]
2019 Rubik’s Cube (PPO+DR)
[OpenAI]
Outline for This Lecture
n Motivation
For now: small, discrete
n Markov Decision Processes (MDPs) state-action spaces as
they are simpler to get the
n Exact Solution Methods main concepts across.
n Value Iteration We will consider large
n Policy Iteration state spaces in the next
lecture!
n Maximum Entropy Formulation
Markov Decision Process
n Contrast: If environment were deterministic, then would just need an optimal plan, or
sequence of actions, from start to a goal
Outline for This Lecture
n Motivation
For now: small, discrete
n Markov Decision Processes (MDPs) state-action spaces as
they are simpler to get the
n Exact Solution Methods main concepts across.
n Value Iteration We will consider large
n Policy Iteration state spaces in the next
lecture!
n Maximum Entropy Formulation
Optimal Value Function V*
" H
#
X
V ⇤ (s) = max E t
R(st , at , st+1 ) | ⇡, s0 = s
⇡
t=0
= sum of discounted rewards when starting from state s and acting optimally
Optimal Value Function V*
" H
#
X
V ⇤ (s) = max E t
R(st , at , st+1 ) | ⇡, s0 = s
⇡
t=0
= sum of discounted rewards when starting from state s and acting optimally
Let’s assume:
actions deterministically successful, gamma = 1, H = 100
V*(4,3) = 1
V*(3,3) = 1
V*(2,3) = 1
V*(1,1) = 1
V*(4,2) = -1
Optimal Value Function V*
" H
#
X
V ⇤ (s) = max E t
R(st , at , st+1 ) | ⇡, s0 = s
⇡
t=0
= sum of discounted rewards when starting from state s and acting optimally
Let’s assume:
actions deterministically successful, gamma = 0.9, H = 100
V*(4,3) = 1
V*(3,3) = 0.9
V*(2,3) = 0.9*0.9 = 0.81
V*(1,1) = 0.9*0.9*0.9*0.9*0.9 = 0.59
V*(4,2) = -1
Optimal Value Function V*
" H
#
X
V ⇤ (s) = max E t
R(st , at , st+1 ) | ⇡, s0 = s
⇡
t=0
= sum of discounted rewards when starting from state s and acting optimally
Let’s assume:
actions successful w/probability 0.8, gamma = 0.9, H = 100
V*(4,3) = 1
V*(3,3) = 0.8 * 0.9 * V*(4,3) + 0.1 * 0.9 * V*(3,3) + 0.1 * 0.9 * V*(3,2)
V*(2,3) =
V*(1,1) =
V*(4,2) =
Value Iteration
n V0⇤ (s) = optimal value for state s when H=0
n
V0⇤ (s) = 0 8s
n V1⇤ (s) = optimal value for state s when H=1
X
n
V1⇤ (s) = max P (s0 |s, a)(R(s, a, s0 ) + V0⇤ (s0 ))
a
s0
n
V2⇤ (s) = optimal value for state s when H=2
X
n
V2⇤ (s) = max P (s0 |s, a)(R(s, a, s0 ) + V1⇤ (s0 ))
a
s0
n
Vk⇤ (s) = optimal value for state s when H = k
n
X
⇤
Vk (s) = max P (s0 |s, a)(R(s, a, s0 ) + Vk⇤ 1 (s0 ))
a
s0
Value Iteration
Algorithm:
Start with for all s.
For k = 1, … , H:
For all states s in S:
X
⇤
Vk (s) max P (s0 |s, a) R(s, a, s0 ) + Vk⇤ 1 (s0 )
a
s0
X
⇡k⇤ (s) arg max P (s0 |s, a) R(s, a, s0 ) + Vk⇤ 1 (s0 )
a
s0
k=0
Noise = 0.2
Discount = 0.9
Value Iteration
k=0
Noise = 0.2
Discount = 0.9
Value Iteration
k=1
Noise = 0.2
Discount = 0.9
Value Iteration
k=2
Noise = 0.2
Discount = 0.9
Value Iteration
k=3
Noise = 0.2
Discount = 0.9
Value Iteration
k=4
Noise = 0.2
Discount = 0.9
Value Iteration
k=5
Noise = 0.2
Discount = 0.9
Value Iteration
k=6
Noise = 0.2
Discount = 0.9
Value Iteration
k=7
Noise = 0.2
Discount = 0.9
Value Iteration
k=8
Noise = 0.2
Discount = 0.9
Value Iteration
k=9
Noise = 0.2
Discount = 0.9
Value Iteration
k = 10
Noise = 0.2
Discount = 0.9
Value Iteration
k = 11
Noise = 0.2
Discount = 0.9
Value Iteration
k = 12
Noise = 0.2
Discount = 0.9
Value Iteration
k = 100
Noise = 0.2
Discount = 0.9
Value Iteration Convergence
Theorem. Value iteration converges. At convergence, we have found the
optimal value function V* for the discounted infinite horizon problem, which
satisfies the Bellman equations
§ Now we know how to act for infinite horizon with discounted rewards!
§ Run value iteration till convergence.
§ This produces V*, which in turn tells us how to act, namely following:
§ Note: the infinite horizon optimal policy is stationary, i.e., the optimal action at
a state s is the same action at all times. (Efficient to store!)
Convergence: Intuition
n V ⇤ (s) = expected sum of rewards accumulated starting from state s, acting optimally for 1 steps
n VH⇤ (s) = expected sum of rewards accumulated starting from state s, acting optimally for H steps
For simplicity of notation in the above it was assumed that rewards are
always greater than or equal to zero. If rewards can be negative,
a similar argument holds, using max |R| and bounding from both sides.
Convergence and Contractions
n Definition: max-norm:
(a) Prefer the close exit (+1), risking the cliff (-10) (1) γ = 0.1, noise = 0.5
(b) Prefer the close exit (+1), but avoiding the cliff (-10) (2) γ = 0.99, noise = 0
(c) Prefer the distant exit (+10), risking the cliff (-10) (3) γ = 0.99, noise = 0.5
(d) Prefer the distant exit (+10), avoiding the cliff (-10) (4) γ = 0.1, noise = 0
Exercise 1 Solution
(a) Prefer close exit (+1), risking the cliff (-10) --- (4) γ = 0.1, noise = 0
Exercise 1 Solution
(b) Prefer close exit (+1), avoiding the cliff (-10) --- (1) γ = 0.1, noise = 0.5
Exercise 1 Solution
(c) Prefer distant exit (+10), risking the cliff (-10) --- (2) γ = 0.99, noise = 0
Exercise 1 Solution
(d) Prefer distant exit (+10), avoid the cliff (-10) --- (3) γ = 0.99, noise = 0.5
Q-Values
Q*(s, a) = expected utility starting in s, taking action a, and (thereafter)
acting optimally
Bellman Equation:
Q-Value Iteration:
X
Q⇤k+1 (s, a) P (s0 |s, a)(R(s, a, s0 ) + max
0
Q ⇤ 0 0
k (s , a ))
a
s0
Q-Value Iteration
X
Q⇤k+1 (s, a) P (s0 |s, a)(R(s, a, s0 ) + max
0
Q ⇤ 0 0
k (s , a ))
a
s0
k = 100
Noise = 0.2
Discount = 0.9
Outline for This Lecture
n Motivation
For now: small, discrete
n Markov Decision Processes (MDPs) state-action spaces as
they are simpler to get the
n Exact Solution Methods main concepts across.
n Value Iteration We will consider large
n Policy Iteration state spaces in the next
lecture!
n Maximum Entropy Formulation
Policy Evaluation
n Recall value iteration:
X
Vk⇤ (s) max P (s0 |s, a) R(s, a, s0 ) + Vk⇤ 1 (s0 )
a
s0
n Policy evaluation for a given ⇡(s) :
X
Vk⇡ (s) P (s0 |s, ⇡(s))(R(s, ⇡(s), s0 ) + Vk⇡ 1 (s))
s0
At convergence:
X
⇡
8s V (s) P (s0 |s, ⇡(s))(R(s, ⇡(s), s0 ) + V ⇡ (s))
s0
Exercise 2
n At convergence: optimal policy; and converges faster than value iteration under some conditions
Policy Iteration Guarantees
Policy Iteration iterates over:
⇡k
X
Vi+1 (s) P (s0 |s, ⇡k (s)) [R(s, ⇡(s), s0 ) + Vi⇡k (s0 )]
s0
X
⇡k+1 (s) arg max P (s0 |s, a) [R(s, a, s0 ) + V ⇡k (s0 )]
a
s0
Theorem. Policy iteration is guaranteed to converge and at convergence, the current policy
and its value function are the optimal policy and the optimal value function!
Proof sketch:
(1) Guarantee to converge: In every step the policy improves. This means that a given policy can be
encountered at most once. This means that after we have iterated as many times as there are different
policies, i.e., (number actions)(number states), we must be done and hence have converged.
(2) Optimal at convergence: by definition of convergence, at convergence πk+1(s) = πk(s) for all states s. This
means
Hence satisfies the Bellman equation, which means is equal to the optimal value function V*.
Outline for This Lecture
n Motivation
For now: small, discrete
n Markov Decision Processes (MDPs) state-action spaces as
they are simpler to get the
n Exact Solution Methods main concepts across.
n Value Iteration We will consider large
n Policy Iteration state spaces in the next
lecture!
n Maximum Entropy Formulation
What if we could find a distribution over near-optimal solutions?
n More robust policy: if the environment changes, the distribution over near-optimal
solutions might still have some good ones for the new situation
n Max-ent formulation:
Max-ent Value Iteration
n Yes, we’ll cover soon, but we first need intermezzo on
constrained optimization…
Constrained Optimization
n Original problem:
n Lagrangian:
n At optimum:
Max-ent for 1-step problem
Max-ent for 1-step problem
= softmax
Max-ent Value Iteration