0% found this document useful (0 votes)
78 views69 pages

l1 Mdps Exact Methods

Uploaded by

manhforbusiness
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
78 views69 pages

l1 Mdps Exact Methods

Uploaded by

manhforbusiness
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 69

The Foundations of Deep RL in 6 Lectures

Lecture 1: MDPs and Exact Solution Methods


Pieter Abbeel
Lecture Series
n Lecture 1: MDPs Foundations and Exact Solution Methods
n Lecture 2: Deep Q-Learning
n Lecture 3: Policy Gradients, Advantage Estimation
n Lecture 4: TRPO, PPO
n Lecture 5: DDPG, SAC
n Lecture 6: Model-based RL
Outline for This Lecture
n Motivation
For now: small, discrete
n Markov Decision Processes (MDPs) state-action spaces as
they are simpler to get the
n Exact Solution Methods main concepts across.
n Value Iteration We will consider large
n Policy Iteration state spaces in the next
lecture!
n Maximum Entropy Formulation
A Few Deep RL Highlights
2013 Atari (DQN)
[Deepmind]

Pong Enduro Beamrider Q*bert


A Few Deep RL Highlights
2013 Atari (DQN)
[Deepmind]
2014 2D locomotion (TRPO)
[Berkeley]

Play 0:06 – 0:25


A Few Deep RL Highlights
2013 Atari (DQN)
[Deepmind]
2014 2D locomotion (TRPO)
[Berkeley]
2015 AlphaGo
[Deepmind]

AlphaGo Silver et al, Nature 2015


AlphaGoZero Silver et al, Nature 2017
AlphaZero Silver et al, 2017
Tian et al, 2016; Maddison et al, 2014; Clark et al, 2015
A Few Deep RL Highlights
2013 Atari (DQN)
[Deepmind]
2014 2D locomotion (TRPO)
[Berkeley]
2015 AlphaGo
[Deepmind]
2016 3D locomotion (TRPO+GAE)
[Berkeley]

[Schulman, Moritz, Levine, Jordan, Abbeel, ICLR 2016]


A Few Deep RL Highlights
2013 Atari (DQN)
[Deepmind]
2014 2D locomotion (TRPO)
[Berkeley]
2015 AlphaGo
[Deepmind]
2016 3D locomotion (TRPO+GAE)
[Berkeley]
2016 Real Robot Manipulation
(GPS) [Berkeley]

[Levine*, Finn*, Darrell, Abbeel, JMLR 2016]


Unsupervised Learning for Interaction?

[Levine et al, 2016 (Google)]


A Few Deep RL Highlights
2013 Atari (DQN)
[Deepmind]
2014 2D locomotion (TRPO)
[Berkeley]
2015 AlphaGo
[Deepmind]
2016 3D locomotion (TRPO+GAE)
[Berkeley]
2016 Real Robot Manipulation
(GPS) [Berkeley, Google]
2017 Dota2
(PPO) [OpenAI]

OpenAI Dota Bot beat best humans 1:1 (Aug 2018)


A Few Deep RL Highlights
2013 Atari (DQN)
[Deepmind]
2014 2D locomotion (TRPO)
[Berkeley]
2015 AlphaGo
[Deepmind]
2016 3D locomotion (TRPO+GAE)
[Berkeley]
2016 Real Robot Manipulation
(GPS) [Berkeley, Google]
2017 Dota2
(PPO) [OpenAI]
2018 DeepMimic
[Berkeley]
[Peng, Abbeel, Levine, van de Panne, 2018]
Deep RL: Dynamic Animation for Motion Picture

[Peng, Abbeel, Levine, van de Panne, 2018] Pieter Abbeel -- UC Berkeley | Gradescope | Covariant.AI
A Few Deep RL Highlights
2013 Atari (DQN)
[Deepmind]
2014 2D locomotion (TRPO)
[Berkeley]
2015 AlphaGo
[Deepmind]
2016 3D locomotion (TRPO+GAE)
[Berkeley]
2016 Real Robot Manipulation
(GPS) [Berkeley, Google]
2017 Dota2
(PPO) [OpenAI]
2018 DeepMimic
[Berkeley]
2019 AlphaStar
[Deepmind]
A Few Deep RL Highlights
2013 Atari (DQN)
[Deepmind]
2014 2D locomotion (TRPO)
[Berkeley]
2015 AlphaGo
[Deepmind]
2016 3D locomotion (TRPO+GAE)
[Berkeley]
2016 Real Robot Manipulation
(GPS) [Berkeley, Google]
2017 Dota2
(PPO) [OpenAI]
2018 DeepMimic
[Berkeley]
2019 AlphaStar
[Deepmind]
2019 Rubik’s Cube (PPO+DR)
[OpenAI]
Outline for This Lecture
n Motivation
For now: small, discrete
n Markov Decision Processes (MDPs) state-action spaces as
they are simpler to get the
n Exact Solution Methods main concepts across.
n Value Iteration We will consider large
n Policy Iteration state spaces in the next
lecture!
n Maximum Entropy Formulation
Markov Decision Process

Assumption: agent gets to observe the state

[Drawing from Sutton and Barto, Reinforcement Learning: An Introduction, 1998]


Markov Decision Process (MDP)
An MDP is defined by:
n Set of states S
n Set of actions A
n Transition function P(s’ | s, a)
n Reward function R(s, a, s’)
n Start state s0
n Discount factor γ
n Horizon H
Examples
MDP (S, A, T, R, γ, H), goal:

q Cleaning robot q Server management

q Walking robot q Shortest path problems

q Pole balancing q Model for animals, people

q Games: tetris, backgammon


Example MDP: Gridworld
An MDP is defined by:
n Set of states S
n Set of actions A
n Transition function P(s’ | s, a)
n Reward function R(s, a, s’)
n Start state s0 Goal:
n Discount factor γ
n Horizon H π:
Solving MDPs
n In an MDP, we want to find an optimal policy p*: S x 0:H → A
n A policy p gives an action for each state for each time
t=5=H
t=4
t=3
t=2
t=1
t=0

n An optimal policy maximizes expected sum of rewards

n Contrast: If environment were deterministic, then would just need an optimal plan, or
sequence of actions, from start to a goal
Outline for This Lecture
n Motivation
For now: small, discrete
n Markov Decision Processes (MDPs) state-action spaces as
they are simpler to get the
n Exact Solution Methods main concepts across.
n Value Iteration We will consider large
n Policy Iteration state spaces in the next
lecture!
n Maximum Entropy Formulation
Optimal Value Function V*
" H
#
X
V ⇤ (s) = max E t
R(st , at , st+1 ) | ⇡, s0 = s

t=0

= sum of discounted rewards when starting from state s and acting optimally
Optimal Value Function V*
" H
#
X
V ⇤ (s) = max E t
R(st , at , st+1 ) | ⇡, s0 = s

t=0

= sum of discounted rewards when starting from state s and acting optimally

Let’s assume:
actions deterministically successful, gamma = 1, H = 100
V*(4,3) = 1
V*(3,3) = 1
V*(2,3) = 1
V*(1,1) = 1
V*(4,2) = -1
Optimal Value Function V*
" H
#
X
V ⇤ (s) = max E t
R(st , at , st+1 ) | ⇡, s0 = s

t=0

= sum of discounted rewards when starting from state s and acting optimally

Let’s assume:
actions deterministically successful, gamma = 0.9, H = 100
V*(4,3) = 1
V*(3,3) = 0.9
V*(2,3) = 0.9*0.9 = 0.81
V*(1,1) = 0.9*0.9*0.9*0.9*0.9 = 0.59
V*(4,2) = -1
Optimal Value Function V*
" H
#
X
V ⇤ (s) = max E t
R(st , at , st+1 ) | ⇡, s0 = s

t=0

= sum of discounted rewards when starting from state s and acting optimally

Let’s assume:
actions successful w/probability 0.8, gamma = 0.9, H = 100
V*(4,3) = 1
V*(3,3) = 0.8 * 0.9 * V*(4,3) + 0.1 * 0.9 * V*(3,3) + 0.1 * 0.9 * V*(3,2)
V*(2,3) =
V*(1,1) =
V*(4,2) =
Value Iteration
n V0⇤ (s) = optimal value for state s when H=0
n
V0⇤ (s) = 0 8s
n V1⇤ (s) = optimal value for state s when H=1
X
n
V1⇤ (s) = max P (s0 |s, a)(R(s, a, s0 ) + V0⇤ (s0 ))
a
s0
n
V2⇤ (s) = optimal value for state s when H=2
X
n
V2⇤ (s) = max P (s0 |s, a)(R(s, a, s0 ) + V1⇤ (s0 ))
a
s0
n
Vk⇤ (s) = optimal value for state s when H = k
n
X

Vk (s) = max P (s0 |s, a)(R(s, a, s0 ) + Vk⇤ 1 (s0 ))
a
s0
Value Iteration
Algorithm:
Start with for all s.
For k = 1, … , H:
For all states s in S:
X

Vk (s) max P (s0 |s, a) R(s, a, s0 ) + Vk⇤ 1 (s0 )
a
s0
X
⇡k⇤ (s) arg max P (s0 |s, a) R(s, a, s0 ) + Vk⇤ 1 (s0 )
a
s0

This is called a value update or Bellman update/back-up


Value Iteration

k=0

Noise = 0.2
Discount = 0.9
Value Iteration

k=0

Noise = 0.2
Discount = 0.9
Value Iteration

k=1

Noise = 0.2
Discount = 0.9
Value Iteration

k=2

Noise = 0.2
Discount = 0.9
Value Iteration

k=3

Noise = 0.2
Discount = 0.9
Value Iteration

k=4

Noise = 0.2
Discount = 0.9
Value Iteration

k=5

Noise = 0.2
Discount = 0.9
Value Iteration

k=6

Noise = 0.2
Discount = 0.9
Value Iteration

k=7

Noise = 0.2
Discount = 0.9
Value Iteration

k=8

Noise = 0.2
Discount = 0.9
Value Iteration

k=9

Noise = 0.2
Discount = 0.9
Value Iteration

k = 10

Noise = 0.2
Discount = 0.9
Value Iteration

k = 11

Noise = 0.2
Discount = 0.9
Value Iteration

k = 12

Noise = 0.2
Discount = 0.9
Value Iteration

k = 100

Noise = 0.2
Discount = 0.9
Value Iteration Convergence
Theorem. Value iteration converges. At convergence, we have found the
optimal value function V* for the discounted infinite horizon problem, which
satisfies the Bellman equations

§ Now we know how to act for infinite horizon with discounted rewards!
§ Run value iteration till convergence.
§ This produces V*, which in turn tells us how to act, namely following:

§ Note: the infinite horizon optimal policy is stationary, i.e., the optimal action at
a state s is the same action at all times. (Efficient to store!)
Convergence: Intuition
n V ⇤ (s) = expected sum of rewards accumulated starting from state s, acting optimally for 1 steps
n VH⇤ (s) = expected sum of rewards accumulated starting from state s, acting optimally for H steps

n Additional reward collected over time steps H+1, H+2, …


H+1
H+1 H+2 H+1 H+2
R(sH+1 ) + R(sH+2 ) + . . .  Rmax + Rmax + . . . = Rmax
1

goes to zero as H goes to infinity


H!1
Hence VH⇤ !V⇤

For simplicity of notation in the above it was assumed that rewards are
always greater than or equal to zero. If rewards can be negative,
a similar argument holds, using max |R| and bounding from both sides.
Convergence and Contractions
n Definition: max-norm:

n Definition: An update operation is a γ-contraction in max-norm if and only if

for all Ui, Vi:

n Theorem: A contraction converges to a unique fixed point, no matter initialization.


n Fact: the value iteration update is a γ-contraction in max-norm
n Corollary: value iteration converges to a unique fixed point
n Additional fact:
n I.e. once the update is small, it must also be close to converged
Exercise 1: Effect of Discount and Noise

(a) Prefer the close exit (+1), risking the cliff (-10) (1) γ = 0.1, noise = 0.5

(b) Prefer the close exit (+1), but avoiding the cliff (-10) (2) γ = 0.99, noise = 0

(c) Prefer the distant exit (+10), risking the cliff (-10) (3) γ = 0.99, noise = 0.5

(d) Prefer the distant exit (+10), avoiding the cliff (-10) (4) γ = 0.1, noise = 0
Exercise 1 Solution

(a) Prefer close exit (+1), risking the cliff (-10) --- (4) γ = 0.1, noise = 0
Exercise 1 Solution

(b) Prefer close exit (+1), avoiding the cliff (-10) --- (1) γ = 0.1, noise = 0.5
Exercise 1 Solution

(c) Prefer distant exit (+10), risking the cliff (-10) --- (2) γ = 0.99, noise = 0
Exercise 1 Solution

(d) Prefer distant exit (+10), avoid the cliff (-10) --- (3) γ = 0.99, noise = 0.5
Q-Values
Q*(s, a) = expected utility starting in s, taking action a, and (thereafter)
acting optimally

Bellman Equation:

Q-Value Iteration:
X
Q⇤k+1 (s, a) P (s0 |s, a)(R(s, a, s0 ) + max
0
Q ⇤ 0 0
k (s , a ))
a
s0
Q-Value Iteration
X
Q⇤k+1 (s, a) P (s0 |s, a)(R(s, a, s0 ) + max
0
Q ⇤ 0 0
k (s , a ))
a
s0

k = 100

Noise = 0.2
Discount = 0.9
Outline for This Lecture
n Motivation
For now: small, discrete
n Markov Decision Processes (MDPs) state-action spaces as
they are simpler to get the
n Exact Solution Methods main concepts across.
n Value Iteration We will consider large
n Policy Iteration state spaces in the next
lecture!
n Maximum Entropy Formulation
Policy Evaluation
n Recall value iteration:
X
Vk⇤ (s) max P (s0 |s, a) R(s, a, s0 ) + Vk⇤ 1 (s0 )
a
s0
n Policy evaluation for a given ⇡(s) :
X
Vk⇡ (s) P (s0 |s, ⇡(s))(R(s, ⇡(s), s0 ) + Vk⇡ 1 (s))
s0
At convergence:
X

8s V (s) P (s0 |s, ⇡(s))(R(s, ⇡(s), s0 ) + V ⇡ (s))
s0
Exercise 2

Consider a stochastic policy ⇡(a|s), where ⇡(a|s) is the probability of taking


action a when in state s. Which of the following is the correct update to perform
policy evaluation for this stochastic policy?
P

1. Vk+1 (s) maxa s0 P (s0 |s, a) (R(s, a, s0 ) + Vk⇡ (s0 ))

P P 0 0 ⇡ 0
2. Vk+1 (s) s0 a ⇡(a|s)P (s |s, a) (R(s, a, s ) + Vk (s ))

P 0 0 ⇡ 0
3. Vk+1 (s) ⇡(a|s) max s 0 P (s |s, a) (R(s, a, s ) + V (s ))
a k
Policy Iteration
One iteration of policy iteration:
n Policy evaluation for current policy ⇡k :
n Iterate until convergence
⇡k
X
Vi+1 (s) P (s0 |s, ⇡k (s)) [R(s, ⇡(s), s0 ) + Vi⇡k (s0 )]
s0

n Policy improvement: find the best action according to one-step


look-ahead
X
⇡k+1 (s) arg max P (s0 |s, a) [R(s, a, s0 ) + V ⇡k (s0 )]
a
s0

n Repeat until policy converges

n At convergence: optimal policy; and converges faster than value iteration under some conditions
Policy Iteration Guarantees
Policy Iteration iterates over:

⇡k
X
Vi+1 (s) P (s0 |s, ⇡k (s)) [R(s, ⇡(s), s0 ) + Vi⇡k (s0 )]
s0

X
⇡k+1 (s) arg max P (s0 |s, a) [R(s, a, s0 ) + V ⇡k (s0 )]
a
s0

Theorem. Policy iteration is guaranteed to converge and at convergence, the current policy
and its value function are the optimal policy and the optimal value function!

Proof sketch:
(1) Guarantee to converge: In every step the policy improves. This means that a given policy can be
encountered at most once. This means that after we have iterated as many times as there are different
policies, i.e., (number actions)(number states), we must be done and hence have converged.
(2) Optimal at convergence: by definition of convergence, at convergence πk+1(s) = πk(s) for all states s. This
means
Hence satisfies the Bellman equation, which means is equal to the optimal value function V*.
Outline for This Lecture
n Motivation
For now: small, discrete
n Markov Decision Processes (MDPs) state-action spaces as
they are simpler to get the
n Exact Solution Methods main concepts across.
n Value Iteration We will consider large
n Policy Iteration state spaces in the next
lecture!
n Maximum Entropy Formulation
What if we could find a distribution over near-optimal solutions?
n More robust policy: if the environment changes, the distribution over near-optimal
solutions might still have some good ones for the new situation

n More robust learning: if we can retain a distribution over near-optimal solutions


our agent will collect more interesting exploratory data during learning
Entropy
n Entropy = measure of uncertainty over random variable X
= number of bits required to encode X (on average)
Entropy
E.g. binary random variable
Entropy
Maximum Entropy MDP
n Regular formulation:

n Max-ent formulation:
Max-ent Value Iteration
n Yes, we’ll cover soon, but we first need intermezzo on
constrained optimization…
Constrained Optimization
n Original problem:

n Lagrangian:

n At optimum:
Max-ent for 1-step problem
Max-ent for 1-step problem

= softmax
Max-ent Value Iteration

= 1-step problem (with Q instead of r), so we can directly transcribe solution:


Outline for This Lecture
n Motivation
For now: small, discrete
n Markov Decision Processes (MDPs) state-action spaces as
they are simpler to get the
n Exact Solution Methods main concepts across.
n Value Iteration We will consider large
n Policy Iteration state spaces in the next
lecture!
n Maximum Entropy Formulation

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy