l1 Mdps Exact Methods

Uploaded by

manhforbusiness

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

78 views69 pages

l1 Mdps Exact Methods

Uploaded by

manhforbusiness

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 69

The Foundations of Deep RL in 6 Lectures

Lecture 1: MDPs and Exact Solution Methods

Pieter Abbeel
Lecture Series
n Lecture 1: MDPs Foundations and Exact Solution Methods
n Lecture 2: Deep Q-Learning
n Lecture 3: Policy Gradients, Advantage Estimation
n Lecture 4: TRPO, PPO
n Lecture 5: DDPG, SAC
n Lecture 6: Model-based RL
Outline for This Lecture
n Motivation
For now: small, discrete
n Markov Decision Processes (MDPs) state-action spaces as
they are simpler to get the
n Exact Solution Methods main concepts across.
n Value Iteration We will consider large
n Policy Iteration state spaces in the next
lecture!
n Maximum Entropy Formulation
A Few Deep RL Highlights
2013 Atari (DQN)
[Deepmind]

Pong Enduro Beamrider Q*bert

A Few Deep RL Highlights
2013 Atari (DQN)
[Deepmind]
2014 2D locomotion (TRPO)
[Berkeley]

Play 0:06 – 0:25

A Few Deep RL Highlights
2013 Atari (DQN)
[Deepmind]
2014 2D locomotion (TRPO)
[Berkeley]
2015 AlphaGo
[Deepmind]

AlphaGo Silver et al, Nature 2015

AlphaGoZero Silver et al, Nature 2017
AlphaZero Silver et al, 2017
Tian et al, 2016; Maddison et al, 2014; Clark et al, 2015
A Few Deep RL Highlights
2013 Atari (DQN)
[Deepmind]
2014 2D locomotion (TRPO)
[Berkeley]
2015 AlphaGo
[Deepmind]
2016 3D locomotion (TRPO+GAE)
[Berkeley]

[Schulman, Moritz, Levine, Jordan, Abbeel, ICLR 2016]

[Levine, Finn, Darrell, Abbeel, JMLR 2016]

Unsupervised Learning for Interaction?

[Levine et al, 2016 (Google)]

OpenAI Dota Bot beat best humans 1:1 (Aug 2018)

A Few Deep RL Highlights
2013 Atari (DQN)
[Deepmind]
2014 2D locomotion (TRPO)
[Berkeley]
2015 AlphaGo
[Deepmind]
2016 3D locomotion (TRPO+GAE)
[Berkeley]
2016 Real Robot Manipulation
(GPS) [Berkeley, Google]
2017 Dota2
(PPO) [OpenAI]
2018 DeepMimic
[Berkeley]
[Peng, Abbeel, Levine, van de Panne, 2018]
Deep RL: Dynamic Animation for Motion Picture

[Peng, Abbeel, Levine, van de Panne, 2018] Pieter Abbeel -- UC Berkeley | Gradescope | Covariant.AI
A Few Deep RL Highlights
2013 Atari (DQN)
[Deepmind]
2014 2D locomotion (TRPO)
[Berkeley]
2015 AlphaGo
[Deepmind]
2016 3D locomotion (TRPO+GAE)
[Berkeley]
2016 Real Robot Manipulation
(GPS) [Berkeley, Google]
2017 Dota2
(PPO) [OpenAI]
2018 DeepMimic
[Berkeley]
2019 AlphaStar
[Deepmind]
A Few Deep RL Highlights
2013 Atari (DQN)
[Deepmind]
2014 2D locomotion (TRPO)
[Berkeley]
2015 AlphaGo
[Deepmind]
2016 3D locomotion (TRPO+GAE)
[Berkeley]
2016 Real Robot Manipulation
(GPS) [Berkeley, Google]
2017 Dota2
(PPO) [OpenAI]
2018 DeepMimic
[Berkeley]
2019 AlphaStar
[Deepmind]
2019 Rubik’s Cube (PPO+DR)
[OpenAI]
Outline for This Lecture
n Motivation
For now: small, discrete
n Markov Decision Processes (MDPs) state-action spaces as
they are simpler to get the
n Exact Solution Methods main concepts across.
n Value Iteration We will consider large
n Policy Iteration state spaces in the next
lecture!
n Maximum Entropy Formulation
Markov Decision Process

Assumption: agent gets to observe the state

[Drawing from Sutton and Barto, Reinforcement Learning: An Introduction, 1998]

Markov Decision Process (MDP)
An MDP is defined by:
n Set of states S
n Set of actions A
n Transition function P(s’ | s, a)
n Reward function R(s, a, s’)
n Start state s0
n Discount factor γ
n Horizon H
Examples
MDP (S, A, T, R, γ, H), goal:

q Cleaning robot q Server management

q Walking robot q Shortest path problems

q Pole balancing q Model for animals, people

q Games: tetris, backgammon

Example MDP: Gridworld
An MDP is defined by:
n Set of states S
n Set of actions A
n Transition function P(s’ | s, a)
n Reward function R(s, a, s’)
n Start state s0 Goal:
n Discount factor γ
n Horizon H π:
Solving MDPs
n In an MDP, we want to find an optimal policy p*: S x 0:H → A
n A policy p gives an action for each state for each time
t=5=H
t=4
t=3
t=2
t=1
t=0

n An optimal policy maximizes expected sum of rewards

n Contrast: If environment were deterministic, then would just need an optimal plan, or
sequence of actions, from start to a goal
Outline for This Lecture
n Motivation
For now: small, discrete
n Markov Decision Processes (MDPs) state-action spaces as
they are simpler to get the
n Exact Solution Methods main concepts across.
n Value Iteration We will consider large
n Policy Iteration state spaces in the next
lecture!
n Maximum Entropy Formulation
Optimal Value Function V*
" H
#
X
V ⇤ (s) = max E t
R(st , at , st+1 ) | ⇡, s0 = s
⇡
t=0

= sum of discounted rewards when starting from state s and acting optimally
Optimal Value Function V*
" H
#
X
V ⇤ (s) = max E t
R(st , at , st+1 ) | ⇡, s0 = s
⇡
t=0

= sum of discounted rewards when starting from state s and acting optimally

Let’s assume:
actions deterministically successful, gamma = 1, H = 100
V*(4,3) = 1
V*(3,3) = 1
V*(2,3) = 1
V*(1,1) = 1
V*(4,2) = -1
Optimal Value Function V*
" H
#
X
V ⇤ (s) = max E t
R(st , at , st+1 ) | ⇡, s0 = s
⇡
t=0

= sum of discounted rewards when starting from state s and acting optimally

Let’s assume:
actions deterministically successful, gamma = 0.9, H = 100
V*(4,3) = 1
V*(3,3) = 0.9
V*(2,3) = 0.9*0.9 = 0.81
V*(1,1) = 0.9*0.9*0.9*0.9*0.9 = 0.59
V*(4,2) = -1
Optimal Value Function V*
" H
#
X
V ⇤ (s) = max E t
R(st , at , st+1 ) | ⇡, s0 = s
⇡
t=0

= sum of discounted rewards when starting from state s and acting optimally

Let’s assume:
actions successful w/probability 0.8, gamma = 0.9, H = 100
V*(4,3) = 1
V*(3,3) = 0.8 * 0.9 * V*(4,3) + 0.1 * 0.9 * V*(3,3) + 0.1 * 0.9 * V*(3,2)
V*(2,3) =
V*(1,1) =
V*(4,2) =
Value Iteration
n V0⇤ (s) = optimal value for state s when H=0
n
V0⇤ (s) = 0 8s
n V1⇤ (s) = optimal value for state s when H=1
X
n
V1⇤ (s) = max P (s0 |s, a)(R(s, a, s0 ) + V0⇤ (s0 ))
a
s0
n
V2⇤ (s) = optimal value for state s when H=2
X
n
V2⇤ (s) = max P (s0 |s, a)(R(s, a, s0 ) + V1⇤ (s0 ))
a
s0
n
Vk⇤ (s) = optimal value for state s when H = k
n
X
⇤
Vk (s) = max P (s0 |s, a)(R(s, a, s0 ) + Vk⇤ 1 (s0 ))
a
s0
Value Iteration
Algorithm:
Start with for all s.
For k = 1, … , H:
For all states s in S:
X
⇤
Vk (s) max P (s0 |s, a) R(s, a, s0 ) + Vk⇤ 1 (s0 )
a
s0
X
⇡k⇤ (s) arg max P (s0 |s, a) R(s, a, s0 ) + Vk⇤ 1 (s0 )
a
s0