100% found this document useful (4 votes)
688 views

Deep Reinforcement Learning

This document discusses reinforcement learning and its key concepts. It defines reinforcement learning as learning through interactions with an environment to maximize rewards over time. The key concepts discussed are the agent, environment, actions, observations, states, rewards, and the goal of maximizing total discounted reward. It also introduces the Q-function, which captures the expected future reward for taking an action in a given state. The goal is to learn a policy that chooses actions maximizing the Q-function and future rewards.

Uploaded by

jolamo1122916
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (4 votes)
688 views

Deep Reinforcement Learning

This document discusses reinforcement learning and its key concepts. It defines reinforcement learning as learning through interactions with an environment to maximize rewards over time. The key concepts discussed are the agent, environment, actions, observations, states, rewards, and the goal of maximizing total discounted reward. It also introduces the Q-function, which captures the expected future reward for taking an action in a given state. The goal is to learn a policy that chooses actions maximizing the Q-function and future rewards.

Uploaded by

jolamo1122916
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 48

Deep Reinforcement Learning

MIT 6.S191
Alexander Amini
January 30, 2019
AlphaGo video

6.S191 Introduction to Deep Learning


1/30/19
introtodeeplearning.com
Classes of Learning Problems
Supervised Learning

Data: (", $)
" is data, $ is label

Goal: Learn function to map


"→$

Apple example:

This thing is an apple.

6.S191 Introduction to Deep Learning


1/30/19
introtodeeplearning.com
Classes of Learning Problems
Supervised Learning Unsupervised Learning

Data: (", $) Data: "


" is data, $ is label " is data, no labels!

Goal: Learn function to map Goal: Learn underlying


"→$ structure

Apple example: Apple example:

This thing is like


This thing is an apple.
the other thing.
6.S191 Introduction to Deep Learning
1/30/19
introtodeeplearning.com
Classes of Learning Problems
Supervised Learning Unsupervised Learning Reinforcement Learning

Data: (", $) Data: " Data: state-action pairs


" is data, $ is label " is data, no labels!

Goal: Learn function to map Goal: Learn underlying Goal: Maximize future rewards
"→$ structure over many time steps

Apple example: Apple example: Apple example:

This thing is like Eat this thing because it


This thing is an apple.
the other thing. will keep you alive.
6.S191 Introduction to Deep Learning
1/30/19
introtodeeplearning.com
Classes of Learning Problems
Supervised Learning Unsupervised Learning Reinforcement Learning

Data: (", $) Data: " Data: state-action pairs


" is data, $ is label " is data, no labels!

RL: our focus


Goal: Learn function to map today.
Goal: Learn underlying Goal: Maximize future rewards
"→$ structure over many time steps

Apple example: Apple example: Apple example:

This thing is like Eat this thing because it


This is an apple.
the other thing. will keep you alive.
6.S191 Introduction to Deep Learning
1/30/19
introtodeeplearning.com
Reinforcement Learning (RL): Key Concepts

AGENT

Agent: takes actions.

6.S191 Introduction to Deep Learning


1/30/19
introtodeeplearning.com
Reinforcement Learning (RL): Key Concepts

AGENT ENVIRONMENT

Environment: the world in which the agent exists and operates.

6.S191 Introduction to Deep Learning


1/30/19
introtodeeplearning.com
Reinforcement Learning (RL): Key Concepts

AGENT Action: !" ENVIRONMENT


ACTIONS

Action: a move the agent can make in the environment.

6.S191 Introduction to Deep Learning


1/30/19
introtodeeplearning.com
Reinforcement Learning (RL): Key Concepts
OBSERVATIONS

AGENT Action: !" ENVIRONMENT


ACTIONS

Observations: of the environment after taking actions.

6.S191 Introduction to Deep Learning


1/30/19
introtodeeplearning.com
Reinforcement Learning (RL): Key Concepts
OBSERVATIONS
State changes: !"#$

AGENT Action: %" ENVIRONMENT


ACTIONS

State: a situation which the agent perceives.

6.S191 Introduction to Deep Learning


1/30/19
introtodeeplearning.com
Reinforcement Learning (RL): Key Concepts
OBSERVATIONS
State changes: !"#$
Reward: %"

AGENT Action: &" ENVIRONMENT


ACTIONS

Reward: feedback that measures the success or failure of the agent’s action.

6.S191 Introduction to Deep Learning


1/30/19
introtodeeplearning.com
Reinforcement Learning (RL): Key Concepts
OBSERVATIONS
State changes: !"#$
Reward: %"

AGENT Action: &" ENVIRONMENT


ACTIONS
Total Reward ,

'" = ) %*
*+"
6.S191 Introduction to Deep Learning
1/30/19
introtodeeplearning.com
Reinforcement Learning (RL): Key Concepts
OBSERVATIONS
State changes: !"#$
Reward: %"

AGENT Action: &" ENVIRONMENT


ACTIONS
Total Reward ,

'" = ) %* = %" + %"#$ … + %"#/ + ⋯


*+"
6.S191 Introduction to Deep Learning
1/30/19
introtodeeplearning.com
Reinforcement Learning (RL): Key Concepts
OBSERVATIONS
State changes: !"#$
Reward: %"

AGENT Action: &" ENVIRONMENT


ACTIONS
Discounted ,
Total Reward
'" = ) - * %*
*+"
6.S191 Introduction to Deep Learning
1/30/19
introtodeeplearning.com
Reinforcement Learning (RL): Key Concepts
OBSERVATIONS
State changes: !"#$
Reward: %"

AGENT Action: &" ENVIRONMENT


ACTIONS
Discounted
Total Reward
. ': discount factor
)" = + ' , %, = ' " %" + ' "#$ %"#$ … + ' "#1 %"#1 + ⋯
,-"
6.S191 Introduction to Deep Learning
1/30/19
introtodeeplearning.com
Defining the Q-function

!" = $" + &$"'( +& ) $"') + ⋯

Total reward, !" , is the discounted sum of all rewards obtained from time 0

+ ,, . = / !"

The Q-function captures the expected total future reward an


agent in state, ,, can receive by executing a certain action, .

6.S191 Introduction to Deep Learning


1/30/19
introtodeeplearning.com
How to take actions given a Q-function?
! ", $ = & '(
(state, action)

Ultimately, the agent needs a policy ) * , to infer the best action to take at its state, s

Strategy: the policy should choose an action that maximizes future reward

+ ∗ " = argmax !(", $)


2

6.S191 Introduction to Deep Learning


1/30/19
introtodeeplearning.com
Deep Reinforcement Learning Algorithms

Value Learning Policy Learning

Find % ", # Find ! "


# = argmax %(", #) Sample # ~ ! "
-
Deep Reinforcement Learning Algorithms

Value Learning Policy Learning

Find % ", # Find ! "


# = argmax %(", #) Sample # ~ ! "
-
Digging deeper into the Q-function
Example: Atari Breakout

It can be very difficult for humans to


accurately estimate Q-values

A B

Which (s, a) pair has a


higher Q-value?

6.S191 Introduction to Deep Learning


1/30/19
introtodeeplearning.com
Digging deeper into the Q-function
Example: Atari Breakout

It can be very difficult for humans to


accurately estimate Q-values

A B

Which (s, a) pair has a


higher Q-value?

6.S191 Introduction to Deep Learning


1/30/19
introtodeeplearning.com
Digging deeper into the Q-function
Example: Atari Breakout - Middle

It can be very difficult for humans to


accurately estimate Q-values

A B

Which (s, a) pair has a


higher Q-value?

6.S191 Introduction to Deep Learning


1/30/19
introtodeeplearning.com
Digging deeper into the Q-function
Example: Atari Breakout - Side

It can be very difficult for humans to


accurately estimate Q-values

A B

Which (s, a) pair has a


higher Q-value?

6.S191 Introduction to Deep Learning


1/30/19
introtodeeplearning.com
Deep Q Networks (DQN)
How can we use deep neural networks to model Q-functions?

state, " Deep


! ", $
NN
“move
right”

action, $

6.S191 Introduction to Deep Learning


1/30/19
introtodeeplearning.com
Deep Q Networks (DQN)
How can we use deep neural networks to model Q-functions?

! ", $%
state, " Deep Deep ! ", $&
! ", $
NN NN
“move state, " ! ", $'
right”

action, $

6.S191 Introduction to Deep Learning


1/30/19
introtodeeplearning.com
Deep Q Networks (DQN): Training
How can we use deep neural networks to model Q-functions?

! ", $%
state, " Deep Deep ! ", $&
! ", $
NN NN
“move state, " ! ", $'
right”

action, $

&
ℒ=* + + - max !(" 2 , $2 ) − ! ", $
12
6.S191 Introduction to Deep Learning
1/30/19
introtodeeplearning.com
Deep Q Networks (DQN): Training
How can we use deep neural networks to model Q-functions?

! ", $%
state, " Deep Deep ! ", $&
! ", $
NN NN
“move state, " ! ", $'
right”

action, $

target predicted
&
ℒ=* + + - max !(" 2 , $2 ) − ! ", $
12
6.S191 Introduction to Deep Learning
1/30/19
introtodeeplearning.com
Deep Q Networks (DQN): Training
How can we use deep neural networks to model Q-functions?

! ", $%
state, " Deep Deep ! ", $&
! ", $
NN NN
“move state, " ! ", $'
right”

action, $

target predicted
&
ℒ=* + + - max !(" 2 , $2 ) − ! ", $
12
6.S191 Introduction to Deep Learning
1/30/19
introtodeeplearning.com
DQN Atari Results

6.S191 Introduction to Deep Learning


1/30/19
introtodeeplearning.com
DQN Atari Results

Surpass Below
human-level human-level

6.S191 Introduction to Deep Learning


1/30/19
introtodeeplearning.com
Downsides of Q-learning
Complexity:
• Can model scenarios where the action space is discrete and small
• Cannot handle continuous action spaces IMPORTANT:
Imagine you want to predict
steering wheel angle of a car!
Flexibility:
• Cannot learn stochastic policies since policy is deterministically computed
from the Q function

To overcome, consider a new class of RL training algorithms:


Policy gradient methods

6.S191 Introduction to Deep Learning


1/30/19
introtodeeplearning.com
Policy Gradient (PG) : Key Idea
DQN (before): Approximating Q and inferring the optimal policy,

! ", $%

! ", $&
Deep
NN
! ", $'
state, "

6.S191 Introduction to Deep Learning


1/30/19
introtodeeplearning.com
Policy Gradient (PG): Key Idea
DQN (before): Approximating Q and inferring the optimal policy,

Policy Gradient: Directly optimize the policy!

! "# |%

! "& |%
Deep
NN
! "' |%
state, %

6.S191 Introduction to Deep Learning


1/30/19
introtodeeplearning.com
Policy Gradient (PG): Key Idea
DQN (before): Approximating Q and inferring the optimal policy,

Policy Gradient: Directly optimize the policy!

! "# |% ( ! "- |% = 1
)* ∈,
! "& |%
Deep
NN
! "' |%
state, %

6.S191 Introduction to Deep Learning


1/30/19
introtodeeplearning.com
Policy Gradient (PG): Key Idea
DQN (before): Approximating Q and inferring the optimal policy,

Policy Gradient: Directly optimize the policy!

! "# |% ( ! "- |% = 1
)* ∈,
! "& |%
Deep
! "|% = 0("23456|%3"37)
NN
! "' |%
state, %

6.S191 Introduction to Deep Learning


1/30/19
introtodeeplearning.com
Policy Gradient (PG): Key Idea
DQN (before): Approximating Q and inferring the optimal policy,

Policy Gradient: Directly optimize the policy!

! "# |% ( ! "- |% = 1
)* ∈,
! "& |%
Deep
! "|% = 0("23456|%3"37)
NN
! "' |%
state, %

6.S191 Introduction to Deep Learning


1/30/19
introtodeeplearning.com
Policy Gradient (PG): Training
function REINFORCE
Initialize !
for "#$%&'" ~ )*
{%, , ., , /, }342
,12 ← "#$%&'"
for t = 1 to T-1
∇ ← ∇* log )* .: |%: <:
! ← ! + >∇
1. Run a policy for a while return !
2. Increase probability of actions that lead to high
rewards
3. Decrease probability of actions that lead to
low/no rewards

6.S191 Introduction to Deep Learning


1/30/19
introtodeeplearning.com
Policy Gradient (PG): Training
function REINFORCE
Initialize !
for "#$%&'" ~ )*
{%, , ., , /, }342
,12 ← "#$%&'"
for t = 1 to T-1
∇ ← ∇789: ;7 <= |?= @=
! ← ! + B∇
1. Run a policy for a while return !
2. Increase probability of actions that lead to high
rewards
3. Decrease probability of actions that lead to
low/no rewards

6.S191 Introduction to Deep Learning


1/30/19
introtodeeplearning.com
Policy Gradient (PG): Training
function REINFORCE
Initialize !
for "#$%&'" ~ )*
{%, , ., , /, }342
,12 ← "#$%&'"
for t = 1 to T-1
∇ ← ∇789: ;7 <= |?= @=
! ← ! + B∇
1. Run a policy for a while return !
2. Increase probability of actions that lead to high log-likelihood of action
rewards
3. Decrease probability of actions that lead to ∇* log )* .F |%F GF
low/no rewards
reward
6.S191 Introduction to Deep Learning
1/30/19
introtodeeplearning.com
The Game of Go
Aim: Get more board territory than your opponent.

Board Size %
Positions 3$ % Legal Legal Positions
nxn
1×1 3 33.33% 1
2×2 81 70.37% 57
3×3 19,683 64.40% 12,675
4×4 43,046,721 56.49% 24,318,165
5×5 847,288,609,443 48.90% 414,295,148,741
9×9 4.434264882×1038 23.44% 1.03919148791×1038
13×13 4.300233593×1080 8.66% 3.72497923077×1079
19×19 1.740896506×10172 1.20% 2.08168199382×10170

Greater number of legal board positions than atoms in the universe.


Source: Wikipedia.
6.S191 Introduction to Deep Learning
1/30/19
introtodeeplearning.com
AlphaGo Beats Top Human Player at Go (2016)

Silver et al., Nature 2016.


6.S191 Introduction to Deep Learning
1/30/19
introtodeeplearning.com
AlphaGo Beats Top Human Player at Go (2016)

1) Initial training: human data

Silver et al., Nature 2016.


6.S191 Introduction to Deep Learning
1/30/19
introtodeeplearning.com
AlphaGo Beats Top Human Player at Go (2016)

1) Initial training: human data

2) Self-play and reinforcement learning


à super-human performance

Silver et al., Nature 2016.


6.S191 Introduction to Deep Learning
1/30/19
introtodeeplearning.com
AlphaGo Beats Top Human Player at Go (2016)

1) Initial training: human data

2) Self-play and reinforcement learning


à super-human performance
3) “Intuition” about board state

Silver et al., Nature 2016.


6.S191 Introduction to Deep Learning
1/30/19
introtodeeplearning.com
AlphaGo Beats Top Human Player at Go (2016)

1) Initial training: human data

2) Self-play and reinforcement learning


à super-human performance
3) “Intuition” about board state

Silver et al., Nature 2016.


6.S191 Introduction to Deep Learning
1/30/19
introtodeeplearning.com
AlphaZero: RL from Self-Play (2018)

Silver et al., Science 2018.


6.S191 Introduction to Deep Learning
1/30/19
introtodeeplearning.com
Questions?

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy