0% found this document useful (0 votes)

8 views22 pages

Lec 22

This lecture focuses on Q-learning, a reinforcement learning algorithm that learns an action-value function to predict future returns. It discusses the Markov decision process, the Bellman equation, and the exploration-exploitation tradeoff, emphasizing the importance of balancing exploration of new actions with exploitation of known rewards. The lecture also touches on function approximation using neural networks and compares Q-learning with policy gradient methods.

Uploaded by

Lokesh Kumar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views22 pages

Lec 22

Uploaded by

Lokesh Kumar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 22

CSC321 Lecture 22: Q-Learning

Roger Grosse

Roger Grosse CSC321 Lecture 22: Q-Learning 1 / 21

Overview

Second of 3 lectures on reinforcement learning

Last time: policy gradient (e.g. REINFORCE)
Optimize a policy directly, don’t represent anything about the
environment
Today: Q-learning
Learn an action-value function that predicts future returns
Next time: AlphaGo uses both a policy network and a value network
This lecture is review if you’ve taken 411
This lecture has more new content than I’d intended. If there is an
exam question about this lecture or next one, it won’t be a hard
question.

Roger Grosse CSC321 Lecture 22: Q-Learning 2 / 21

Overview

Agent interacts with an environment, which we treat as a black box

Your RL code accesses it only through an API since it’s external to
the agent
I.e., you’re not “allowed” to inspect the transition probabilities, reward
distributions, etc.

Roger Grosse CSC321 Lecture 22: Q-Learning 3 / 21

Recap: Markov Decision Processes

The environment is represented as a Markov decision process (MDP)

M.
Markov assumption: all relevant information is encapsulated in the
current state
Components of an MDP:
initial state distribution p(s0 )
transition distribution p(st+1 | st , at )
reward function r (st , at )
policy πθ (at | st ) parameterized by θ
Assume a fully observable environment, i.e. st can be observed directly

Roger Grosse CSC321 Lecture 22: Q-Learning 4 / 21

Finite and Infinite Horizon

Last time: finite horizon MDPs

Fixed number of steps T per episode
Maximize expected return R = Ep(τ ) [r (τ )]
Now: more convenient to assume infinite horizon
We can’t sum infinitely many rewards, so we need to discount them:
$100 a year from now is worth less than $100 today
Discounted return

Gt = rt + γrt+1 + γ 2 rt+2 + · · ·

Want to choose an action to maximize expected discounted return

The parameter γ < 1 is called the discount factor
small γ = myopic
large γ = farsighted

Roger Grosse CSC321 Lecture 22: Q-Learning 5 / 21

Value Function

Value function V π (s) of a state s under policy π: the expected

discounted return if we start in s and follow π

V π (s) = E[Gt | st = s]
"∞ #
X
=E γ i rt+i | st = s
i=0

Computing the value function is generally impractical, but we can try

to approximate (learn) it
The benefit is credit assignment: see directly how an action affects
future returns rather than wait for rollouts

Roger Grosse CSC321 Lecture 22: Q-Learning 6 / 21

Value Function

Rewards: -1 per time step

Undiscounted (γ = 1)
Actions: N, E, S, W
State: current location
Roger Grosse CSC321 Lecture 22: Q-Learning 7 / 21
Value Function

Roger Grosse CSC321 Lecture 22: Q-Learning 8 / 21

Action-Value Function
Can we use a value function to choose actions?
arg max r (st , a) + γEp(st+1 | st ,at ) [V π (st+1 )]
a

Roger Grosse CSC321 Lecture 22: Q-Learning 9 / 21

Action-Value Function
Can we use a value function to choose actions?
arg max r (st , a) + γEp(st+1 | st ,at ) [V π (st+1 )]
a

Problem: this requires taking the expectation with respect to the

environment’s dynamics, which we don’t have direct access to!
Instead learn an action-value function, or Q-function: expected
returns if you take action a and then follow your policy
Q π (s, a) = E[Gt | st = s, at = a]
Relationship: X
V π (s) = π(a | s)Q π (s, a)
a
Optimal action:
arg max Q π (s, a)
a
Roger Grosse CSC321 Lecture 22: Q-Learning 9 / 21
Bellman Equation

The Bellman Equation is a recursive formula for the action-value

function:

Q π (s, a) = r (s, a) + γEp(s0 | s,a) π(a0 | s0 ) [Q π (s0 , a0 )]

There are various Bellman equations, and most RL algorithms are

based on repeatedly applying one of them.

Roger Grosse CSC321 Lecture 22: Q-Learning 10 / 21

Optimal Bellman Equation

The optimal policy π ∗ is the one that maximizes the expected

discounted return, and the optimal action-value function Q ∗ is the
action-value function for π ∗ .
The Optimal Bellman Equation gives a recursive formula for Q ∗ :

∗ ∗ 0
Q (s, a) = r (s, a) + γEp(s0 | s,a) max
0
Q (st+1 , a ) | st = s, at = a
a

This system of equations characterizes the optimal action-value

function. So maybe we can approximate Q ∗ by trying to solve the
optimal Bellman equation!

Roger Grosse CSC321 Lecture 22: Q-Learning 11 / 21

Q-Learning

Let Q be an action-value function which hopefully approximates Q ∗ .

The Bellman error is the update to our expected return when we
observe the next state s0 .

r (st , at ) + γ max Q(st+1 , a) − Q(st , at )

| {za }
inside E in RHS of Bellman eqn

The Bellman equation says the Bellman error is 0 in expectation

Q-learning is an algorithm that repeatedly adjusts Q to minimize the
Bellman error
Each time we sample consecutive states and actions (st , at , st+1 ):
h i
Q(st , at ) ← Q(st , at ) + α r (st , at ) + γ max Q(st+1 , a) − Q(st , at )
a
| {z }
Bellman error

Roger Grosse CSC321 Lecture 22: Q-Learning 12 / 21

Exploration-Exploitation Tradeoff

Notice: Q-learning only learns about the states and actions it visits.
Exploration-exploitation tradeoff: the agent should sometimes pick
suboptimal actions in order to visit new states and actions.
Simple solution: -greedy policy
With probability 1 − , choose the optimal action according to Q
With probability , choose a random action
Believe it or not, -greedy is still used today!

Roger Grosse CSC321 Lecture 22: Q-Learning 13 / 21

Exploration-Exploitation Tradeoff

You can’t use an epsilon-greedy strategy with policy gradient because

it’s an on-policy algorithm: the agent can only learn about the policy
it’s actually following.
Q-learning is an off-policy algorithm: the agent can learn Q regardless
of whether it’s actually following the optimal policy
Hence, Q-learning is typically done with an -greedy policy, or some
other policy that encourages exploration.

Roger Grosse CSC321 Lecture 22: Q-Learning 14 / 21

Q-Learning

Roger Grosse CSC321 Lecture 22: Q-Learning 15 / 21

Function Approximation

So far, we’ve been assuming a tabular representation of Q: one entry

for every state/action pair.
This is impractical to store for all but the simplest problems, and
doesn’t share structure between related states.
Solution: approximate Q using a parameterized function, e.g.
linear function approximation: Q(s, a) = w> ψ(s, a)
compute Q with a neural net
Update Q using backprop:

t ← r (st , at ) + γ max Q(st+1 , a)

a
∂Q
θ ← θ + α(t − Q(s, a))
∂θ

Roger Grosse CSC321 Lecture 22: Q-Learning 16 / 21

Function Approximation
Approximating Q with a neural net is a decades-old idea, but
DeepMind got it to work really well on Atari games in 2013 (“deep
Q-learning”)
They used a very small network by today’s standards

Main technical innovation: store experience into a replay buffer, and

perform Q-learning using stored experience
Gains sample efficiency by separating environment interaction from
optimization — don’t need new experience for every SGD update!
Roger Grosse CSC321 Lecture 22: Q-Learning 17 / 21
Atari

Mnih et al., Nature 2015. Human-level control through deep

reinforcement learning
Network was given raw pixels as observations
Same architecture shared between all games
Assume fully observable environment, even though that’s not the case
After about a day of training on a particular game, often beat
“human-level” performance (number of points within 5 minutes of
play)
Did very well on reactive games, poorly on ones that require planning
(e.g. Montezuma’s Revenge)
https://www.youtube.com/watch?v=V1eYniJ0Rnk
https://www.youtube.com/watch?v=4MlZncshy1Q

Roger Grosse CSC321 Lecture 22: Q-Learning 18 / 21

Wireheading

If rats have a lever that causes an electrode to stimulate certain

“reward centers” in their brain, they’ll keep pressing the lever at the
expense of sleep, food, etc.
RL algorithms show this “wireheading” behavior if the reward
function isn’t designed carefully
https://blog.openai.com/faulty-reward-functions/

Roger Grosse CSC321 Lecture 22: Q-Learning 19 / 21

Policy Gradient vs. Q-Learning

Policy gradient and Q-learning use two very different choices of

representation: policies and value functions
Advantage of both methods: don’t need to model the environment
Pros/cons of policy gradient
Pro: unbiased estimate of gradient of expected return
Pro: can handle a large space of actions (since you only need to sample
one)
Con: high variance updates (implies poor sample efficiency)
Con: doesn’t do credit assignment
Pros/cons of Q-learning
Pro: lower variance updates, more sample efficient
Pro: does credit assignment
Con: biased updates since Q function is approximate (drinks its own
Kool-Aid)
Con: hard to handle many actions (since you need to take the max)

Roger Grosse CSC321 Lecture 22: Q-Learning 20 / 21

Actor-Critic (optional)

Actor-critic methods combine the best of both worlds

Fit both a policy network (the “actor”) and a value network (the
“critic”)
Repeatedly update the value network to estimate V π
Unroll for only a few steps, then compute the REINFORCE policy
update using the expected returns estimated by the value network
The two networks adapt to each other, much like GAN training
Modern version: Asynchronous Advantage Actor-Critic (A3C)

Roger Grosse CSC321 Lecture 22: Q-Learning 21 / 21

Introduction To Reinforcement Learning: Instructor: Sergey Levine UC Berkeley
No ratings yet
Introduction To Reinforcement Learning: Instructor: Sergey Levine UC Berkeley
46 pages
(SL 132) Hans Reichenbach - Logical Empiricist (Synthese Library) 1979
100% (4)
(SL 132) Hans Reichenbach - Logical Empiricist (Synthese Library) 1979
794 pages
Review: Normal Distribution
No ratings yet
Review: Normal Distribution
46 pages
Lecture 06
No ratings yet
Lecture 06
98 pages
Intro To Reinforcement Learning
No ratings yet
Intro To Reinforcement Learning
56 pages
Unit 5 Deep Learning
No ratings yet
Unit 5 Deep Learning
24 pages
18 AI BasicRL
No ratings yet
18 AI BasicRL
96 pages
Deep RL Tutorial Small
No ratings yet
Deep RL Tutorial Small
66 pages
Lec17 ReinforcementLearning
No ratings yet
Lec17 ReinforcementLearning
58 pages
Q Learning
No ratings yet
Q Learning
38 pages
Artificial Intelligence: Lecture 11 - Reinforcement Learning II Dr. Shivanjali Khare
No ratings yet
Artificial Intelligence: Lecture 11 - Reinforcement Learning II Dr. Shivanjali Khare
52 pages
A Crash Course On Reinforcement Learning - Felix Wagner
No ratings yet
A Crash Course On Reinforcement Learning - Felix Wagner
84 pages
Serge Levine Course Introduction To Reinforcement Learning 3: RL Introduction
No ratings yet
Serge Levine Course Introduction To Reinforcement Learning 3: RL Introduction
46 pages
CZ3005 Module 5 - Reinforcement Learning
No ratings yet
CZ3005 Module 5 - Reinforcement Learning
31 pages
10 - Reinforcement Learning
No ratings yet
10 - Reinforcement Learning
24 pages
10 Deep Reinforcement
No ratings yet
10 Deep Reinforcement
40 pages
New CZ3005 Module 5 - Reinforcement Learning
No ratings yet
New CZ3005 Module 5 - Reinforcement Learning
31 pages
Serge Levine Course Introduction To Reinforcement Learning 6 Value Function
No ratings yet
Serge Levine Course Introduction To Reinforcement Learning 6 Value Function
27 pages
cs224r L05 QLearning
No ratings yet
cs224r L05 QLearning
40 pages
Reinforcement Learning: Russell and Norvig: CH 21
No ratings yet
Reinforcement Learning: Russell and Norvig: CH 21
16 pages
AI Seminar RL
No ratings yet
AI Seminar RL
27 pages
37 RL
No ratings yet
37 RL
18 pages
8200 Non Delusional Q Learning and Value Iteration
No ratings yet
8200 Non Delusional Q Learning and Value Iteration
11 pages
Reinforcement Learning Lec12
No ratings yet
Reinforcement Learning Lec12
60 pages
AI T8 ReinfoLearning
No ratings yet
AI T8 ReinfoLearning
38 pages
Q-Learning and Deep Q Networks (DQN)
No ratings yet
Q-Learning and Deep Q Networks (DQN)
52 pages
RL DP and Value and Policy
No ratings yet
RL DP and Value and Policy
4 pages
22 Reinforcement Learning
No ratings yet
22 Reinforcement Learning
18 pages
Reinforcement Learning: Russell and Norvig: CH 21
No ratings yet
Reinforcement Learning: Russell and Norvig: CH 21
16 pages
Lec 21
No ratings yet
Lec 21
28 pages
Unit-5 MLT
No ratings yet
Unit-5 MLT
13 pages
2.2+model Free+Control
No ratings yet
2.2+model Free+Control
92 pages
DQN Atari
No ratings yet
DQN Atari
26 pages
Sdfesdf
No ratings yet
Sdfesdf
23 pages
RL Lecturer
No ratings yet
RL Lecturer
38 pages
Lec 04 Reinforcement Learning
No ratings yet
Lec 04 Reinforcement Learning
57 pages
Unit-5 Part C 1) Explain The Q Function and Q Learning Algorithm Assuming Deterministic Rewards and Actions With Example. Ans)
No ratings yet
Unit-5 Part C 1) Explain The Q Function and Q Learning Algorithm Assuming Deterministic Rewards and Actions With Example. Ans)
11 pages
Lecture7.1-MultimodalInteraction 1
No ratings yet
Lecture7.1-MultimodalInteraction 1
97 pages
Reinforcement Learning: Instructor: Max Welling
No ratings yet
Reinforcement Learning: Instructor: Max Welling
18 pages
DD2431 Machine Learning Lab 4: Reinforcement Learning Python Version
No ratings yet
DD2431 Machine Learning Lab 4: Reinforcement Learning Python Version
9 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
30 pages
Filippov Theory On Infinitesimal Epsilon-Greedy Q-Learning
No ratings yet
Filippov Theory On Infinitesimal Epsilon-Greedy Q-Learning
66 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
9 pages
Unit 1
No ratings yet
Unit 1
18 pages
I2ml3e Chap18
No ratings yet
I2ml3e Chap18
27 pages
3.5 Intro2DeepQLearning
No ratings yet
3.5 Intro2DeepQLearning
12 pages
RL Intro-2
No ratings yet
RL Intro-2
24 pages
Reinforcement Learning: Csci 5512: Artificial Intelligence Ii
No ratings yet
Reinforcement Learning: Csci 5512: Artificial Intelligence Ii
30 pages
Lec 11
No ratings yet
Lec 11
45 pages
Reinforcement Learning: Mitchell, Ch. 13 (See Also Barto & Sutton Book On-Line)
No ratings yet
Reinforcement Learning: Mitchell, Ch. 13 (See Also Barto & Sutton Book On-Line)
14 pages
Reinforcement Learning: Mitchell, Ch. 13 (See Also Barto & Sutton Book On-Line)
No ratings yet
Reinforcement Learning: Mitchell, Ch. 13 (See Also Barto & Sutton Book On-Line)
14 pages
RL Class Mtech
No ratings yet
RL Class Mtech
67 pages
Sections
No ratings yet
Sections
76 pages
5SC28 Machine Learning For Systems and Control
No ratings yet
5SC28 Machine Learning For Systems and Control
68 pages
AI 11 Reinforcement Learning II
No ratings yet
AI 11 Reinforcement Learning II
35 pages
Lecture Notes v1.0 687 F22
No ratings yet
Lecture Notes v1.0 687 F22
115 pages
Lecture 9 Reiforcement Learning
No ratings yet
Lecture 9 Reiforcement Learning
29 pages
Ai (It) Unit-5
No ratings yet
Ai (It) Unit-5
43 pages
Unit 5
No ratings yet
Unit 5
54 pages
42-Deep Q Learning
No ratings yet
42-Deep Q Learning
8 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
32 pages
A-level Maths Revision: Cheeky Revision Shortcuts
From Everand
A-level Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (8)
Q3 Statistics and Probability 11 Module 2
100% (2)
Q3 Statistics and Probability 11 Module 2
22 pages
Autocorrelation by Christopher Dougherty PDF
No ratings yet
Autocorrelation by Christopher Dougherty PDF
30 pages
Chapter 5
No ratings yet
Chapter 5
45 pages
Gamma Distribution
No ratings yet
Gamma Distribution
7 pages
Introduction To Lévy Processes
No ratings yet
Introduction To Lévy Processes
8 pages
Random Process Syllabus
No ratings yet
Random Process Syllabus
2 pages
Tutorial 08 - 21.03.24
No ratings yet
Tutorial 08 - 21.03.24
2 pages
Statistics Minor Question Paper
No ratings yet
Statistics Minor Question Paper
1 page
11-Introduction To Random variable-27-Jul-2020Material - I - 27-Jul-2020 - Random - Variable - PPT
No ratings yet
11-Introduction To Random variable-27-Jul-2020Material - I - 27-Jul-2020 - Random - Variable - PPT
28 pages
L1 Probability and Venn Diagrams 12.2
No ratings yet
L1 Probability and Venn Diagrams 12.2
19 pages
Equidistribution Theorem
No ratings yet
Equidistribution Theorem
3 pages
Stats PDF
No ratings yet
Stats PDF
18 pages
15CS73 Module 4
No ratings yet
15CS73 Module 4
60 pages
Statistical Methods For Data Analysis in Particle Physics: Luca Lista
No ratings yet
Statistical Methods For Data Analysis in Particle Physics: Luca Lista
268 pages
Geometricbrownian PDF
100% (1)
Geometricbrownian PDF
15 pages
Actuarial Model
No ratings yet
Actuarial Model
37 pages
Stochastic User Equilibrium
No ratings yet
Stochastic User Equilibrium
18 pages
Skew Normal Distribution
No ratings yet
Skew Normal Distribution
4 pages
Unit 7 Rank Correlation: Structure
No ratings yet
Unit 7 Rank Correlation: Structure
21 pages
2022 Year 12 Mathematics Methods SEMESTER 2 Exam (CA)
No ratings yet
2022 Year 12 Mathematics Methods SEMESTER 2 Exam (CA)
21 pages
Assignment 7 Solutions
No ratings yet
Assignment 7 Solutions
4 pages
Chapter 7 - Fundamentals of Probability
No ratings yet
Chapter 7 - Fundamentals of Probability
105 pages
Stochastic Processes (2) L-5
No ratings yet
Stochastic Processes (2) L-5
9 pages
Practical Bayesian Inference A Primer For Physical Scientists 1st Edition Coryn A. L. Bailer Jones Instant Download
No ratings yet
Practical Bayesian Inference A Primer For Physical Scientists 1st Edition Coryn A. L. Bailer Jones Instant Download
56 pages
20-Markov Chains 1
No ratings yet
20-Markov Chains 1
17 pages
Compound Events
No ratings yet
Compound Events
9 pages
Tugas Probabilitas Dan Combinatorik
No ratings yet
Tugas Probabilitas Dan Combinatorik
4 pages
Econ 325 - Problem Set 3: Instructions
No ratings yet
Econ 325 - Problem Set 3: Instructions
2 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Lec 22

Uploaded by

Lec 22

Uploaded by

CSC321 Lecture 22: Q-Learning

Roger Grosse CSC321 Lecture 22: Q-Learning 1 / 21

Second of 3 lectures on reinforcement learning

Roger Grosse CSC321 Lecture 22: Q-Learning 2 / 21

Agent interacts with an environment, which we treat as a black box

Roger Grosse CSC321 Lecture 22: Q-Learning 3 / 21

The environment is represented as a Markov decision process (MDP)

Roger Grosse CSC321 Lecture 22: Q-Learning 4 / 21

Last time: finite horizon MDPs

Want to choose an action to maximize expected discounted return

Roger Grosse CSC321 Lecture 22: Q-Learning 5 / 21

Value function V π (s) of a state s under policy π: the expected

Computing the value function is generally impractical, but we can try

Roger Grosse CSC321 Lecture 22: Q-Learning 6 / 21

Rewards: -1 per time step

Roger Grosse CSC321 Lecture 22: Q-Learning 8 / 21

Roger Grosse CSC321 Lecture 22: Q-Learning 9 / 21

Problem: this requires taking the expectation with respect to the

The Bellman Equation is a recursive formula for the action-value

Q π (s, a) = r (s, a) + γEp(s0 | s,a) π(a0 | s0 ) [Q π (s0 , a0 )]

There are various Bellman equations, and most RL algorithms are

Roger Grosse CSC321 Lecture 22: Q-Learning 10 / 21

The optimal policy π ∗ is the one that maximizes the expected

This system of equations characterizes the optimal action-value

Roger Grosse CSC321 Lecture 22: Q-Learning 11 / 21

Let Q be an action-value function which hopefully approximates Q ∗ .

r (st , at ) + γ max Q(st+1 , a) − Q(st , at )

The Bellman equation says the Bellman error is 0 in expectation

Roger Grosse CSC321 Lecture 22: Q-Learning 12 / 21

Roger Grosse CSC321 Lecture 22: Q-Learning 13 / 21

You can’t use an epsilon-greedy strategy with policy gradient because

Roger Grosse CSC321 Lecture 22: Q-Learning 14 / 21

Roger Grosse CSC321 Lecture 22: Q-Learning 15 / 21

So far, we’ve been assuming a tabular representation of Q: one entry

t ← r (st , at ) + γ max Q(st+1 , a)

Roger Grosse CSC321 Lecture 22: Q-Learning 16 / 21

Main technical innovation: store experience into a replay buffer, and

Mnih et al., Nature 2015. Human-level control through deep

Roger Grosse CSC321 Lecture 22: Q-Learning 18 / 21

If rats have a lever that causes an electrode to stimulate certain

Roger Grosse CSC321 Lecture 22: Q-Learning 19 / 21

Policy gradient and Q-learning use two very different choices of

Roger Grosse CSC321 Lecture 22: Q-Learning 20 / 21

Actor-critic methods combine the best of both worlds

Roger Grosse CSC321 Lecture 22: Q-Learning 21 / 21

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.