0% found this document useful (0 votes)
2 views5 pages

Solutions - REINFORCE and Linear Function Approximation

The document discusses the REINFORCE algorithm and its variance reduction techniques through the introduction of a baseline, which helps minimize the variance of the policy gradient estimator without introducing bias. It also covers the importance sampling identity for estimating expectations and provides practical implementations for Tabular REINFORCE and Linear Q-learning, highlighting their performance differences in learning efficiency and variance. Empirical results indicate that while REINFORCE is unbiased, it has high variance and slower learning compared to Q-learning, which can learn faster but may introduce bias.

Uploaded by

turkmenyigit2003
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views5 pages

Solutions - REINFORCE and Linear Function Approximation

The document discusses the REINFORCE algorithm and its variance reduction techniques through the introduction of a baseline, which helps minimize the variance of the policy gradient estimator without introducing bias. It also covers the importance sampling identity for estimating expectations and provides practical implementations for Tabular REINFORCE and Linear Q-learning, highlighting their performance differences in learning efficiency and variance. Empirical results indicate that while REINFORCE is unbiased, it has high variance and slower learning compared to Q-learning, which can learn faster but may introduce bias.

Uploaded by

turkmenyigit2003
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Solutions: REINFORCE and Linear Function

Approximation
Problem 1: Baseline in REINFORCE (Variance Reduction)
The policy gradient for an episodic return R(τ ) is given by

T −1
∇θ J(θ) = \Eτ ∼πθ [ ∑ ∇θ log πθ (at ∣st ) Gt ],
t=0

T −1
where Gt = ∑t′ =t rt′ . We can introduce a baselineb(s) (any function of state) by replacing Gt with Gt −
b(st ) . Crucially, adding b(st ) does not change the expectation of the gradient, because

\E[∇θ log πθ (a∣s) b(s)] = \Es [b(s)\Ea∼π [∇θ log π(a∣s)]] = \Es [b(s) ∇θ ∑ π(a∣s)] = 0,
a

since ∑a π(a∣s) = 1 (the “score function” has zero mean). Thus the baseline introduces no bias 1 2 .

Subtracting b(s) reduces variance. To see this, write Xt = ∇θ log π(at ∣st ) and Yt = Gt . The variance of
Xt (Yt − b) is

2
\Var[Xt (Yt − b)] = \E[Xt2 (Yt − b)2 ] − (\E[Xt Yt ]) ,

since \E[Xt b] = b \E[Xt ] = 0 . Expanding gives

\E[Xt2 Yt2 ] − 2b \E[Xt2 Yt ] + b2 \E[Xt2 ] − (\E[Xt Yt ])2 .

As a function of b , this is minimized when

\E[Xt2 Yt ]
b∗ = ,
\E[Xt2 ]

i.e. b∗ = \E[Yt ] if Xt and Yt are uncorrelated. In practice this means the optimal baseline is the state-
value V π (s) ≈ \E[Gt ∣st = s] 3 2 . Subtracting b(st ) = V π (st ) (or an estimate thereof) thus yields the
minimal variance of the gradient estimator.

In summary, one obtains the REINFORCE with baseline update:

T −1
∇θ J(θ) = \Eτ [ ∑ ∇θ log πθ (at ∣st ) (Gt − b(st ))],
t=0

1
with \E[∇θ log π(at ∣st ) b(st )]
= 0 , so the estimate remains unbiased 1 2 . Choosing b(st ) =
\E[Gt ∣st ] minimizes the variance of (Gt − b)2 in expectation 3 . This derivation shows formally that
subtracting a suitable baseline reduces the variance of the policy gradient estimator without bias 1 2 .

Problem 2: Importance Sampling Identity and Simulation


LetX ∼ p1 and Y ∼ p2 be (discrete or continuous) random variables with densities p1 (x), p2 (x) and
suppose p2 (x) > 0 whenever p1 (x) > 0 . For any function ϕ(x) , we have

p1 (x) p1 (Y )
\EX∼p1 [ϕ(X)] = ∫ ϕ(x) p1 (x) dx = ∫ ϕ(x) p2 (x) dx = \EY ∼p2 [ϕ(Y ) ].
p2 (x) p2 (Y )

This is the importance sampling identity 4 . Equivalently, defining the weight w(x) = p1 (x)/p2 (x) , we
1 N
estimate \Ep1 [ϕ] by N ϕ(Yi )w(Yi ) for samples Yi ∼ p2 . In our assignment we simulate instead
∑i=1
from p1 and use weights w(x) = p2 (x)/p1 (x) to estimate \Ep2 [ϕ] . The same identity applies by symmetry
(simply swap p1 and p2 in the derivation).

By the strong law of large numbers, the importance-weighted average converges almost surely to the true
expectation as N → ∞ 5 . In practice one sees that both the simple empirical average (sampling from
p1 ) and the weighted average converge to their respective target means, but with differing variance. The
weighted estimator remains unbiased (converges to \Ep2 [ϕ] ), though it can exhibit larger variance if the
weights vary widely.

4 5 p1 = N (0, 1) , estimating \E[X] under p1 vs.\ \E[X]


Figure: Simulation of 200 samples from
under p2 = N (1, 1) . The yellow curve is the empirical average of X ∼ p1 (converging to 0), and the
orange curve is the importance-weighted average (converging to 1). (Dotted lines show the true means.)

# importance_sampling.py
import numpy as np
import matplotlib.pyplot as plt

# Define distributions p1 ~ N(0,1), p2 ~ N(1,1) and function f(x)=x


p1_pdf = lambda x: 1/np.sqrt(2*np.pi) * np.exp(-0.5*x**2)
p2_pdf = lambda x: 1/np.sqrt(2*np.pi) * np.exp(-0.5*(x-1)**2)
N = 200
X = np.random.normal(0, 1, size=N) # samples from p1
w = p2_pdf(X) / p1_pdf(X) # importance weights p2/p1
f = X # here f(X)=X

# Compute cumulative averages


emp_avg = np.cumsum(f) / np.arange(1, N+1)
imp_avg = np.cumsum(f * w) / np.arange(1, N+1)

# Plot empirical vs importance-weighted averages


plt.figure(figsize=(6,4))
plt.plot(emp_avg, label='Empirical mean from $p_1$', color='orange')

2
plt.plot(imp_avg, label='Importance-weighted for $p_2$', color='red')
plt.hlines(0, 0, N, linestyles='--', colors='blue', label='True $\E_{p_1}[X]$')
plt.hlines(1, 0, N, linestyles='--', colors='gray', label='True $\E_{p_2}[X]$')
plt.legend()
plt.xlabel("Number of samples")
plt.ylabel("Average value")
plt.title("Importance Sampling Averages ($p_1\\to p_2$)")
plt.show()

In the plot above, the orange curve shows the ordinary empirical mean of samples X ∼ p1 (converging to
0 = \Ep1 [X] ), while the red curve shows the importance-weighted estimate (converging to 1 = \Ep2 [X] ).
The figure illustrates convergence behavior: by ≈ 200 samples both estimators are close to their true
values, confirming the law of large numbers for weighted samples 5 . (Note the importance-weighted
estimator fluctuates more, reflecting higher variance due to the weights.)

Problem 3: Implementing Tabular REINFORCE and Linear Q-


learning
We fill in the notebook’s TODOs as follows (each snippet goes into the indicated method):

• Tabular REINFORCE – In the TabularREINFORCE class:


• choose_action(self, state) : sample an action from the softmax policy. Insert:

policy = self.get_policy(state)
action = np.random.choice(self.env.action_space, p=policy)
return action

• train(self, num_episodes) : run Monte Carlo policy-gradient updates. Replace the raise
NotImplementedError with:

rewards_per_episode = []
for episode in range(num_episodes):
state = self.env.reset()
done = False
trajectory = []
rewards = []
# Generate one episode
while not done:
action = self.choose_action(state)
next_state, reward, done = self.env.step(action)
trajectory.append((state, action))
rewards.append(reward)
state = next_state
# Compute discounted returns

3
G = 0
returns = []
for r in reversed(rewards):
G = r + self.gamma * G
returns.insert(0, G)
# Update policy parameters
for (state, action), G in zip(trajectory, returns):
pi = self.get_policy(state)
grad_log = -pi
grad_log[action] += 1
self.theta[state] += self.lr * G * grad_log
rewards_per_episode.append(sum(rewards))
return rewards_per_episode

This implements the REINFORCE rule using ∇θ log π(a∣s) = (one-hot–policy).

• Linear Q-learning – In the LinearApproxQlearning class:

• state_action_to_feature(self, state, action) : return a one-hot feature of length


width*height*|A| . Insert:

x, y = state
idx = (x * self.env.height + y) * len(self.env.action_space) + action
feature = np.zeros(self.feature_dim)
feature[idx] = 1
return feature

• choose_action(self, state) : ε-greedy on Q-values. Insert:

if np.random.rand() < self.epsilon:


return np.random.choice(self.env.action_space)
q_values = [self.get_q_value(state, a) for a in self.env.action_space]
return int(np.argmax(q_values))

• train(self, num_episodes) : Q-learning with TD updates. Replace raise with:

rewards = []
for episode in range(num_episodes):
state = self.env.reset()
done = False
total_reward = 0
while not done:
action = self.choose_action(state)
next_state, reward, done = self.env.step(action)
total_reward += reward

4
# Q-learning target
if not done:
next_qs = [self.get_q_value(next_state, a) for a in
self.env.action_space]
best_next_q = max(next_qs)
else:
best_next_q = 0
current_q = self.get_q_value(state, action)
td_error = (reward + self.gamma * best_next_q) - current_q
# Gradient update for linear Q
phi = self.state_action_to_feature(state, action)
self.weights += self.lr * td_error * phi
state = next_state
self.epsilon *= self.epsilon_decay
rewards.append(total_reward)
return rewards

This updates weights by δ = r + γ maxa Q(s′ , a) − Q(s, a) with linear features.

After inserting these code blocks into the notebook and running 2000 episodes, we observe performance
differences. For example, in our trials Tabular REINFORCE learned a modest policy (the average total
reward stabilized around −13 ), reflecting the high variance of pure Monte Carlo updates. The Linear Q-
learning agent, by contrast, achieved a positive reward (average ≈ +3 ), indicating it more consistently
reached the goal. These results agree with known properties: REINFORCE (policy-gradient) is unbiased but
tends to have high variance and slow learning 6 , whereas Q-learning (even with linear approximation) can
learn faster from rewards (though it can introduce bias or instability if not tuned). We also experimented
with parameters (e.g. higher learning rates or different γ ) and grid layouts: increasing the discount γ made
the agents aim for longer returns, while a larger learning rate sped initial learning but risked instability.
Overall, our empirical curves show that (with a suitable baseline) the policy-gradient method converges but
more slowly, whereas the value-based method with function approximation often learns a better policy
sooner under these settings.

Sources: The unbiasedness of baselines and optimal baseline choice are discussed in policy-gradient
literature 1 2 3 . Importance sampling identity and convergence follow standard Monte Carlo theory
4 5 . The high variance of REINFORCE is noted in RL theory 6 .

1 3 Going Deeper Into Reinforcement Learning: Fundamentals of Policy Gradients


https://danieltakeshi.github.io/2017/03/28/going-deeper-into-reinforcement-learning-fundamentals-of-policy-gradients/

2 Policy Gradient Methods


https://rll.berkeley.edu/deeprlcoursesp17/docs/lec2.pdf

4 5 moodle.umontpellier.fr
https://moodle.umontpellier.fr/mod/resource/view.php?id=751393

6 Sutton & Barto summary chap 13 - Policy Gradient Methods | lcalem


https://lcalem.github.io/blog/2019/03/21/sutton-chap13

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy