0% found this document useful (0 votes)

2 views5 pages

Solutions - REINFORCE and Linear Function Approximation

The document discusses the REINFORCE algorithm and its variance reduction techniques through the introduction of a baseline, which helps minimize the variance of the policy gradient estimator without introducing bias. It also covers the importance sampling identity for estimating expectations and provides practical implementations for Tabular REINFORCE and Linear Q-learning, highlighting their performance differences in learning efficiency and variance. Empirical results indicate that while REINFORCE is unbiased, it has high variance and slower learning compared to Q-learning, which can learn faster but may introduce bias.

Uploaded by

turkmenyigit2003

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views5 pages

Solutions - REINFORCE and Linear Function Approximation

Uploaded by

turkmenyigit2003

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

Solutions: REINFORCE and Linear Function

Approximation
Problem 1: Baseline in REINFORCE (Variance Reduction)
The policy gradient for an episodic return R(τ ) is given by

T −1
∇θ J(θ) = \Eτ ∼πθ [ ∑ ∇θ log πθ (at ∣st ) Gt ],
t=0

T −1
where Gt = ∑t′ =t rt′ . We can introduce a baselineb(s) (any function of state) by replacing Gt with Gt −
b(st ) . Crucially, adding b(st ) does not change the expectation of the gradient, because

\E[∇θ log πθ (a∣s) b(s)] = \Es [b(s)\Ea∼π [∇θ log π(a∣s)]] = \Es [b(s) ∇θ ∑ π(a∣s)] = 0,
a

since ∑a π(a∣s) = 1 (the “score function” has zero mean). Thus the baseline introduces no bias 1 2 .

Subtracting b(s) reduces variance. To see this, write Xt = ∇θ log π(at ∣st ) and Yt = Gt . The variance of
Xt (Yt − b) is

2
\Var[Xt (Yt − b)] = \E[Xt2 (Yt − b)2 ] − (\E[Xt Yt ]) ,

since \E[Xt b] = b \E[Xt ] = 0 . Expanding gives

\E[Xt2 Yt2 ] − 2b \E[Xt2 Yt ] + b2 \E[Xt2 ] − (\E[Xt Yt ])2 .

As a function of b , this is minimized when

\E[Xt2 Yt ]
b∗ = ,
\E[Xt2 ]

i.e. b∗ = \E[Yt ] if Xt and Yt are uncorrelated. In practice this means the optimal baseline is the state-
value V π (s) ≈ \E[Gt ∣st = s] 3 2 . Subtracting b(st ) = V π (st ) (or an estimate thereof) thus yields the
minimal variance of the gradient estimator.

In summary, one obtains the REINFORCE with baseline update:

T −1
∇θ J(θ) = \Eτ [ ∑ ∇θ log πθ (at ∣st ) (Gt − b(st ))],
t=0

1
with \E[∇θ log π(at ∣st ) b(st )]
= 0 , so the estimate remains unbiased 1 2 . Choosing b(st ) =
\E[Gt ∣st ] minimizes the variance of (Gt − b)2 in expectation 3 . This derivation shows formally that
subtracting a suitable baseline reduces the variance of the policy gradient estimator without bias 1 2 .

Problem 2: Importance Sampling Identity and Simulation

LetX ∼ p1 and Y ∼ p2 be (discrete or continuous) random variables with densities p1 (x), p2 (x) and
suppose p2 (x) > 0 whenever p1 (x) > 0 . For any function ϕ(x) , we have

p1 (x) p1 (Y )
\EX∼p1 [ϕ(X)] = ∫ ϕ(x) p1 (x) dx = ∫ ϕ(x) p2 (x) dx = \EY ∼p2 [ϕ(Y ) ].
p2 (x) p2 (Y )

This is the importance sampling identity 4 . Equivalently, defining the weight w(x) = p1 (x)/p2 (x) , we
1 N
estimate \Ep1 [ϕ] by N ϕ(Yi )w(Yi ) for samples Yi ∼ p2 . In our assignment we simulate instead
∑i=1
from p1 and use weights w(x) = p2 (x)/p1 (x) to estimate \Ep2 [ϕ] . The same identity applies by symmetry
(simply swap p1 and p2 in the derivation).

By the strong law of large numbers, the importance-weighted average converges almost surely to the true
expectation as N → ∞ 5 . In practice one sees that both the simple empirical average (sampling from
p1 ) and the weighted average converge to their respective target means, but with differing variance. The
weighted estimator remains unbiased (converges to \Ep2 [ϕ] ), though it can exhibit larger variance if the
weights vary widely.

4 5 p1 = N (0, 1) , estimating \E[X] under p1 vs.\ \E[X]

Figure: Simulation of 200 samples from
under p2 = N (1, 1) . The yellow curve is the empirical average of X ∼ p1 (converging to 0), and the
orange curve is the importance-weighted average (converging to 1). (Dotted lines show the true means.)

# importance_sampling.py
import numpy as np
import matplotlib.pyplot as plt

# Define distributions p1 ~ N(0,1), p2 ~ N(1,1) and function f(x)=x

p1_pdf = lambda x: 1/np.sqrt(2*np.pi) * np.exp(-0.5*x**2)
p2_pdf = lambda x: 1/np.sqrt(2*np.pi) * np.exp(-0.5*(x-1)**2)
N = 200
X = np.random.normal(0, 1, size=N) # samples from p1
w = p2_pdf(X) / p1_pdf(X) # importance weights p2/p1
f = X # here f(X)=X

# Compute cumulative averages

emp_avg = np.cumsum(f) / np.arange(1, N+1)
imp_avg = np.cumsum(f * w) / np.arange(1, N+1)

# Plot empirical vs importance-weighted averages

plt.figure(figsize=(6,4))
plt.plot(emp_avg, label='Empirical mean from $p_1$', color='orange')

2
plt.plot(imp_avg, label='Importance-weighted for $p_2$', color='red')
plt.hlines(0, 0, N, linestyles='--', colors='blue', label='True $\E_{p_1}[X]$')
plt.hlines(1, 0, N, linestyles='--', colors='gray', label='True $\E_{p_2}[X]$')
plt.legend()
plt.xlabel("Number of samples")
plt.ylabel("Average value")
plt.title("Importance Sampling Averages ($p_1\\to p_2$)")
plt.show()

In the plot above, the orange curve shows the ordinary empirical mean of samples X ∼ p1 (converging to
0 = \Ep1 [X] ), while the red curve shows the importance-weighted estimate (converging to 1 = \Ep2 [X] ).
The figure illustrates convergence behavior: by ≈ 200 samples both estimators are close to their true
values, confirming the law of large numbers for weighted samples 5 . (Note the importance-weighted
estimator fluctuates more, reflecting higher variance due to the weights.)

Problem 3: Implementing Tabular REINFORCE and Linear Q-

learning
We fill in the notebook’s TODOs as follows (each snippet goes into the indicated method):

• Tabular REINFORCE – In the TabularREINFORCE class:

• choose_action(self, state) : sample an action from the softmax policy. Insert:

policy = self.get_policy(state)
action = np.random.choice(self.env.action_space, p=policy)
return action

• train(self, num_episodes) : run Monte Carlo policy-gradient updates. Replace the raise
NotImplementedError with:

rewards_per_episode = []
for episode in range(num_episodes):
state = self.env.reset()
done = False
trajectory = []
rewards = []
# Generate one episode
while not done:
action = self.choose_action(state)
next_state, reward, done = self.env.step(action)
trajectory.append((state, action))
rewards.append(reward)
state = next_state
# Compute discounted returns

3
G = 0
returns = []
for r in reversed(rewards):
G = r + self.gamma * G
returns.insert(0, G)
# Update policy parameters
for (state, action), G in zip(trajectory, returns):
pi = self.get_policy(state)
grad_log = -pi
grad_log[action] += 1
self.theta[state] += self.lr * G * grad_log
rewards_per_episode.append(sum(rewards))
return rewards_per_episode

This implements the REINFORCE rule using ∇θ log π(a∣s) = (one-hot–policy).

• Linear Q-learning – In the LinearApproxQlearning class:

• state_action_to_feature(self, state, action) : return a one-hot feature of length

width*height*|A| . Insert:

x, y = state
idx = (x * self.env.height + y) * len(self.env.action_space) + action
feature = np.zeros(self.feature_dim)
feature[idx] = 1
return feature

• choose_action(self, state) : ε-greedy on Q-values. Insert:

if np.random.rand() < self.epsilon:

return np.random.choice(self.env.action_space)
q_values = [self.get_q_value(state, a) for a in self.env.action_space]
return int(np.argmax(q_values))

• train(self, num_episodes) : Q-learning with TD updates. Replace raise with:

rewards = []
for episode in range(num_episodes):
state = self.env.reset()
done = False
total_reward = 0
while not done:
action = self.choose_action(state)
next_state, reward, done = self.env.step(action)
total_reward += reward

4
# Q-learning target
if not done:
next_qs = [self.get_q_value(next_state, a) for a in
self.env.action_space]
best_next_q = max(next_qs)
else:
best_next_q = 0
current_q = self.get_q_value(state, action)
td_error = (reward + self.gamma * best_next_q) - current_q
# Gradient update for linear Q
phi = self.state_action_to_feature(state, action)
self.weights += self.lr * td_error * phi
state = next_state
self.epsilon *= self.epsilon_decay
rewards.append(total_reward)
return rewards

This updates weights by δ = r + γ maxa Q(s′ , a) − Q(s, a) with linear features.

After inserting these code blocks into the notebook and running 2000 episodes, we observe performance
differences. For example, in our trials Tabular REINFORCE learned a modest policy (the average total
reward stabilized around −13 ), reflecting the high variance of pure Monte Carlo updates. The Linear Q-
learning agent, by contrast, achieved a positive reward (average ≈ +3 ), indicating it more consistently
reached the goal. These results agree with known properties: REINFORCE (policy-gradient) is unbiased but
tends to have high variance and slow learning 6 , whereas Q-learning (even with linear approximation) can
learn faster from rewards (though it can introduce bias or instability if not tuned). We also experimented
with parameters (e.g. higher learning rates or different γ ) and grid layouts: increasing the discount γ made
the agents aim for longer returns, while a larger learning rate sped initial learning but risked instability.
Overall, our empirical curves show that (with a suitable baseline) the policy-gradient method converges but
more slowly, whereas the value-based method with function approximation often learns a better policy
sooner under these settings.

Sources: The unbiasedness of baselines and optimal baseline choice are discussed in policy-gradient
literature 1 2 3 . Importance sampling identity and convergence follow standard Monte Carlo theory
4 5 . The high variance of REINFORCE is noted in RL theory 6 .

1 3 Going Deeper Into Reinforcement Learning: Fundamentals of Policy Gradients

https://danieltakeshi.github.io/2017/03/28/going-deeper-into-reinforcement-learning-fundamentals-of-policy-gradients/

2 Policy Gradient Methods

https://rll.berkeley.edu/deeprlcoursesp17/docs/lec2.pdf

4 5 moodle.umontpellier.fr
https://moodle.umontpellier.fr/mod/resource/view.php?id=751393

6 Sutton & Barto summary chap 13 - Policy Gradient Methods | lcalem

https://lcalem.github.io/blog/2019/03/21/sutton-chap13

RL Unit
No ratings yet
RL Unit
595 pages
RL Unit 4
No ratings yet
RL Unit 4
9 pages
MG 202 (2014) Drillholes 1 (2014-02D1)
No ratings yet
MG 202 (2014) Drillholes 1 (2014-02D1)
66 pages
The Effectiveness of Concepcion, Tarlac As A Business Location For Startups
No ratings yet
The Effectiveness of Concepcion, Tarlac As A Business Location For Startups
53 pages
A Study On Training Effectiveness - Project
85% (20)
A Study On Training Effectiveness - Project
80 pages
SW Vendor Evaluation Matrix Template
No ratings yet
SW Vendor Evaluation Matrix Template
5 pages
RL Week - 3 - 4
No ratings yet
RL Week - 3 - 4
33 pages
(APA Reference Books) Sheldon, Ph.D. Zedeck, Sheldon, Ph.D. Zedeck-APA Dictionary of Statistics and Research Methods-American Psychological Association (APA) (2013)
No ratings yet
(APA Reference Books) Sheldon, Ph.D. Zedeck, Sheldon, Ph.D. Zedeck-APA Dictionary of Statistics and Research Methods-American Psychological Association (APA) (2013)
448 pages
08 PG Methods
No ratings yet
08 PG Methods
83 pages
Topic - Four - Decisions - Are - Hard Aa
No ratings yet
Topic - Four - Decisions - Are - Hard Aa
85 pages
Dis9 Sol
No ratings yet
Dis9 Sol
8 pages
RL Concepts
No ratings yet
RL Concepts
36 pages
Module 6
No ratings yet
Module 6
47 pages
03.12.25 Stat RL
No ratings yet
03.12.25 Stat RL
26 pages
RL 5
No ratings yet
RL 5
26 pages
I C Iet 2019 Conference Proceedings
No ratings yet
I C Iet 2019 Conference Proceedings
355 pages
Reinforcement Learning I
No ratings yet
Reinforcement Learning I
85 pages
Sta301 Collection of Old Papers
100% (1)
Sta301 Collection of Old Papers
45 pages
13 ML Reinforcement Learning - Policy Search
No ratings yet
13 ML Reinforcement Learning - Policy Search
10 pages
Note 6
No ratings yet
Note 6
13 pages
Research For Hardbound Final Na Jud Ni
No ratings yet
Research For Hardbound Final Na Jud Ni
74 pages
Note 8
No ratings yet
Note 8
12 pages
Unit 2
No ratings yet
Unit 2
76 pages
2023 Week4 Funcapproximate Update
No ratings yet
2023 Week4 Funcapproximate Update
69 pages
20ai903 - RL - Unit 4
No ratings yet
20ai903 - RL - Unit 4
49 pages
GDD Nonlinear NIPS 2009 Convergent Temporal Difference Learning With Arbitrary Smooth Function Approximation
No ratings yet
GDD Nonlinear NIPS 2009 Convergent Temporal Difference Learning With Arbitrary Smooth Function Approximation
9 pages
Cost Variances Analysis: Material Variances Labour Variances
No ratings yet
Cost Variances Analysis: Material Variances Labour Variances
44 pages
CE 010 Module 1.2-1.3
No ratings yet
CE 010 Module 1.2-1.3
29 pages
cs229 Notes14
No ratings yet
cs229 Notes14
6 pages
Stabilizing Off Policy QLearning
No ratings yet
Stabilizing Off Policy QLearning
19 pages
Consumer Acceptance of Cheese, in Uence of Different Testing Conditions
No ratings yet
Consumer Acceptance of Cheese, in Uence of Different Testing Conditions
8 pages
5 - Policy Gradient Methods
No ratings yet
5 - Policy Gradient Methods
57 pages
RL Chap 4
No ratings yet
RL Chap 4
7 pages
Metaheuristics For Portfolio Optimization An Introduction Using MATLAB 1st Edition G. A. Vijayalakshmi Pai PDF Download
100% (1)
Metaheuristics For Portfolio Optimization An Introduction Using MATLAB 1st Edition G. A. Vijayalakshmi Pai PDF Download
53 pages
Status of Pantawid Pamilyang Pilipino Program
No ratings yet
Status of Pantawid Pamilyang Pilipino Program
17 pages
Lecture 5: Value Function Approximation: Emma Brunskill
No ratings yet
Lecture 5: Value Function Approximation: Emma Brunskill
59 pages
Unit7 RL
No ratings yet
Unit7 RL
7 pages
Solutions To Reinforcement Learning by Sutton Chapter 4 r5
No ratings yet
Solutions To Reinforcement Learning by Sutton Chapter 4 r5
5 pages
Revision Notes On Probability and Regression Analysis Both Classses
No ratings yet
Revision Notes On Probability and Regression Analysis Both Classses
49 pages
EE675A Lecture 16
No ratings yet
EE675A Lecture 16
6 pages
9541-Article Text-13069-1-2-20201228
No ratings yet
9541-Article Text-13069-1-2-20201228
7 pages
Yates y Cochran 1938
No ratings yet
Yates y Cochran 1938
25 pages
College Environmental Influences On Students' Educational Aspirations
No ratings yet
College Environmental Influences On Students' Educational Aspirations
22 pages
1.materail Labour Variance Analysis
No ratings yet
1.materail Labour Variance Analysis
54 pages
Hackett 1985
No ratings yet
Hackett 1985
42 pages
RL Exam Tutti
No ratings yet
RL Exam Tutti
47 pages
Module 2a
No ratings yet
Module 2a
70 pages
Chapter 13: Policy Gradient Methods: by Richard Sutton and Andrew Barto
No ratings yet
Chapter 13: Policy Gradient Methods: by Richard Sutton and Andrew Barto
35 pages
10 - Reinforcement Learning
No ratings yet
10 - Reinforcement Learning
24 pages
Function Approximation
No ratings yet
Function Approximation
35 pages
Project Report On Effect of Welfare Measures On Employees
67% (6)
Project Report On Effect of Welfare Measures On Employees
50 pages
Lecture 5 - ModelFreePrediction
No ratings yet
Lecture 5 - ModelFreePrediction
79 pages
Yu
No ratings yet
Yu
21 pages
COMP 4901Z: Reinforcement Learning: 2.3 Value Function Approximation
No ratings yet
COMP 4901Z: Reinforcement Learning: 2.3 Value Function Approximation
55 pages
The Nature of An External Audit
100% (2)
The Nature of An External Audit
8 pages
ACCA FoLkch04exampdf
No ratings yet
ACCA FoLkch04exampdf
11 pages
Exam Prep Exercises034534123124
No ratings yet
Exam Prep Exercises034534123124
20 pages
SP14 CS188 Lecture 10 - Reinforcement Learning I - Print
No ratings yet
SP14 CS188 Lecture 10 - Reinforcement Learning I - Print
25 pages
Kil Bertus 2020
No ratings yet
Kil Bertus 2020
31 pages
Lecture 12 Slides - After
No ratings yet
Lecture 12 Slides - After
50 pages
Title Proposal - Group 3
No ratings yet
Title Proposal - Group 3
26 pages
Exercise 0
No ratings yet
Exercise 0
4 pages
Chapters 4 - 5 - 6 Quiz MC Answers
No ratings yet
Chapters 4 - 5 - 6 Quiz MC Answers
7 pages
Bellemare17a PDF
No ratings yet
Bellemare17a PDF
10 pages
Meta-Analysis of Screening and Diagnostic Tests
No ratings yet
Meta-Analysis of Screening and Diagnostic Tests
12 pages
A Quantifiable Study: Tik Tok Credibility
No ratings yet
A Quantifiable Study: Tik Tok Credibility
9 pages
Policy Gradient Methods
No ratings yet
Policy Gradient Methods
70 pages
Social Marketing Planning Worksheets - 2022
No ratings yet
Social Marketing Planning Worksheets - 2022
13 pages
Assignment 2 - Policy Gradients
No ratings yet
Assignment 2 - Policy Gradients
7 pages
CH3 - 2 Montecarlo Control
No ratings yet
CH3 - 2 Montecarlo Control
33 pages
High-Dimensional Continuous Control Using Generalized Advantage Estimation-1506.02438v5
No ratings yet
High-Dimensional Continuous Control Using Generalized Advantage Estimation-1506.02438v5
14 pages
Pinus Radiata
No ratings yet
Pinus Radiata
22 pages
16 RL
No ratings yet
16 RL
51 pages
Cheat Sheet 1
No ratings yet
Cheat Sheet 1
2 pages
Chapter 3 Methodology
No ratings yet
Chapter 3 Methodology
4 pages
Six Lectures On NN - Montanari
No ratings yet
Six Lectures On NN - Montanari
77 pages
p1 Report
No ratings yet
p1 Report
4 pages
Lecture Notes v1.0 687 F22
No ratings yet
Lecture Notes v1.0 687 F22
115 pages
Sol3 2015
No ratings yet
Sol3 2015
8 pages
RL With LCS
No ratings yet
RL With LCS
29 pages
Tri-Tue-Nhan-Tao - Nathan-Lambert - Lec13 - 6up-Reinforcement-Learning - (Cuuduongthancong - Com)
No ratings yet
Tri-Tue-Nhan-Tao - Nathan-Lambert - Lec13 - 6up-Reinforcement-Learning - (Cuuduongthancong - Com)
8 pages
11 Hidden Markov Models (HMMS) Model and Problem Description
No ratings yet
11 Hidden Markov Models (HMMS) Model and Problem Description
15 pages
Kinetic and Morphological Study of Palladium Electrodeposits Onto Indium Tin Oxide (ITO) Substrates
No ratings yet
Kinetic and Morphological Study of Palladium Electrodeposits Onto Indium Tin Oxide (ITO) Substrates
11 pages
Online Learning Lecture Notes 2011 Oct 20
No ratings yet
Online Learning Lecture Notes 2011 Oct 20
125 pages
RL Theory Tutorial
No ratings yet
RL Theory Tutorial
80 pages
Bridging The Gap Between Value and Policy Based Reinforcement Learning
No ratings yet
Bridging The Gap Between Value and Policy Based Reinforcement Learning
21 pages
Policy Gradient Methods
No ratings yet
Policy Gradient Methods
28 pages
07 Deep Reinforcement Learning (John)
No ratings yet
07 Deep Reinforcement Learning (John)
52 pages
Ilovepdf Merged
No ratings yet
Ilovepdf Merged
5 pages
Reinforcement Learning: Csci 5512: Artificial Intelligence Ii
No ratings yet
Reinforcement Learning: Csci 5512: Artificial Intelligence Ii
30 pages
SP14 CS188 Lecture 10 - Reinforcement Learning I
No ratings yet
SP14 CS188 Lecture 10 - Reinforcement Learning I
35 pages
RL Unit 5
No ratings yet
RL Unit 5
30 pages
EEE321 Homework 2 (Clearly Justify All Answers.) : (Due 13 October 2022)
No ratings yet
EEE321 Homework 2 (Clearly Justify All Answers.) : (Due 13 October 2022)
1 page
Worked Examples in Mathematics for Scientists and Engineers
From Everand
Worked Examples in Mathematics for Scientists and Engineers
G. Stephenson
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Solutions - REINFORCE and Linear Function Approximation

Uploaded by

Solutions - REINFORCE and Linear Function Approximation

Uploaded by

Solutions: REINFORCE and Linear Function

since \E[Xt b] = b \E[Xt ] = 0 . Expanding gives

\E[Xt2 Yt2 ] − 2b \E[Xt2 Yt ] + b2 \E[Xt2 ] − (\E[Xt Yt ])2 .

As a function of b , this is minimized when

In summary, one obtains the REINFORCE with baseline update:

Problem 2: Importance Sampling Identity and Simulation

4 5 p1 = N (0, 1) , estimating \E[X] under p1 vs.\ \E[X]

# Define distributions p1 ~ N(0,1), p2 ~ N(1,1) and function f(x)=x

# Compute cumulative averages

# Plot empirical vs importance-weighted averages

Problem 3: Implementing Tabular REINFORCE and Linear Q-

• Tabular REINFORCE – In the TabularREINFORCE class:

This implements the REINFORCE rule using ∇θ log π(a∣s) = (one-hot–policy).

• Linear Q-learning – In the LinearApproxQlearning class:

• state_action_to_feature(self, state, action) : return a one-hot feature of length

• choose_action(self, state) : ε-greedy on Q-values. Insert:

if np.random.rand() < self.epsilon:

• train(self, num_episodes) : Q-learning with TD updates. Replace raise with:

This updates weights by δ = r + γ maxa Q(s′ , a) − Q(s, a) with linear features.

1 3 Going Deeper Into Reinforcement Learning: Fundamentals of Policy Gradients

2 Policy Gradient Methods

6 Sutton & Barto summary chap 13 - Policy Gradient Methods | lcalem

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.