Solutions - REINFORCE and Linear Function Approximation
Solutions - REINFORCE and Linear Function Approximation
Approximation
Problem 1: Baseline in REINFORCE (Variance Reduction)
The policy gradient for an episodic return R(τ ) is given by
T −1
∇θ J(θ) = \Eτ ∼πθ [ ∑ ∇θ log πθ (at ∣st ) Gt ],
t=0
T −1
where Gt = ∑t′ =t rt′ . We can introduce a baselineb(s) (any function of state) by replacing Gt with Gt −
b(st ) . Crucially, adding b(st ) does not change the expectation of the gradient, because
\E[∇θ log πθ (a∣s) b(s)] = \Es [b(s)\Ea∼π [∇θ log π(a∣s)]] = \Es [b(s) ∇θ ∑ π(a∣s)] = 0,
a
since ∑a π(a∣s) = 1 (the “score function” has zero mean). Thus the baseline introduces no bias 1 2 .
Subtracting b(s) reduces variance. To see this, write Xt = ∇θ log π(at ∣st ) and Yt = Gt . The variance of
Xt (Yt − b) is
2
\Var[Xt (Yt − b)] = \E[Xt2 (Yt − b)2 ] − (\E[Xt Yt ]) ,
\E[Xt2 Yt ]
b∗ = ,
\E[Xt2 ]
i.e. b∗ = \E[Yt ] if Xt and Yt are uncorrelated. In practice this means the optimal baseline is the state-
value V π (s) ≈ \E[Gt ∣st = s] 3 2 . Subtracting b(st ) = V π (st ) (or an estimate thereof) thus yields the
minimal variance of the gradient estimator.
T −1
∇θ J(θ) = \Eτ [ ∑ ∇θ log πθ (at ∣st ) (Gt − b(st ))],
t=0
1
with \E[∇θ log π(at ∣st ) b(st )]
= 0 , so the estimate remains unbiased 1 2 . Choosing b(st ) =
\E[Gt ∣st ] minimizes the variance of (Gt − b)2 in expectation 3 . This derivation shows formally that
subtracting a suitable baseline reduces the variance of the policy gradient estimator without bias 1 2 .
p1 (x) p1 (Y )
\EX∼p1 [ϕ(X)] = ∫ ϕ(x) p1 (x) dx = ∫ ϕ(x) p2 (x) dx = \EY ∼p2 [ϕ(Y ) ].
p2 (x) p2 (Y )
This is the importance sampling identity 4 . Equivalently, defining the weight w(x) = p1 (x)/p2 (x) , we
1 N
estimate \Ep1 [ϕ] by N ϕ(Yi )w(Yi ) for samples Yi ∼ p2 . In our assignment we simulate instead
∑i=1
from p1 and use weights w(x) = p2 (x)/p1 (x) to estimate \Ep2 [ϕ] . The same identity applies by symmetry
(simply swap p1 and p2 in the derivation).
By the strong law of large numbers, the importance-weighted average converges almost surely to the true
expectation as N → ∞ 5 . In practice one sees that both the simple empirical average (sampling from
p1 ) and the weighted average converge to their respective target means, but with differing variance. The
weighted estimator remains unbiased (converges to \Ep2 [ϕ] ), though it can exhibit larger variance if the
weights vary widely.
# importance_sampling.py
import numpy as np
import matplotlib.pyplot as plt
2
plt.plot(imp_avg, label='Importance-weighted for $p_2$', color='red')
plt.hlines(0, 0, N, linestyles='--', colors='blue', label='True $\E_{p_1}[X]$')
plt.hlines(1, 0, N, linestyles='--', colors='gray', label='True $\E_{p_2}[X]$')
plt.legend()
plt.xlabel("Number of samples")
plt.ylabel("Average value")
plt.title("Importance Sampling Averages ($p_1\\to p_2$)")
plt.show()
In the plot above, the orange curve shows the ordinary empirical mean of samples X ∼ p1 (converging to
0 = \Ep1 [X] ), while the red curve shows the importance-weighted estimate (converging to 1 = \Ep2 [X] ).
The figure illustrates convergence behavior: by ≈ 200 samples both estimators are close to their true
values, confirming the law of large numbers for weighted samples 5 . (Note the importance-weighted
estimator fluctuates more, reflecting higher variance due to the weights.)
policy = self.get_policy(state)
action = np.random.choice(self.env.action_space, p=policy)
return action
• train(self, num_episodes) : run Monte Carlo policy-gradient updates. Replace the raise
NotImplementedError with:
rewards_per_episode = []
for episode in range(num_episodes):
state = self.env.reset()
done = False
trajectory = []
rewards = []
# Generate one episode
while not done:
action = self.choose_action(state)
next_state, reward, done = self.env.step(action)
trajectory.append((state, action))
rewards.append(reward)
state = next_state
# Compute discounted returns
3
G = 0
returns = []
for r in reversed(rewards):
G = r + self.gamma * G
returns.insert(0, G)
# Update policy parameters
for (state, action), G in zip(trajectory, returns):
pi = self.get_policy(state)
grad_log = -pi
grad_log[action] += 1
self.theta[state] += self.lr * G * grad_log
rewards_per_episode.append(sum(rewards))
return rewards_per_episode
x, y = state
idx = (x * self.env.height + y) * len(self.env.action_space) + action
feature = np.zeros(self.feature_dim)
feature[idx] = 1
return feature
rewards = []
for episode in range(num_episodes):
state = self.env.reset()
done = False
total_reward = 0
while not done:
action = self.choose_action(state)
next_state, reward, done = self.env.step(action)
total_reward += reward
4
# Q-learning target
if not done:
next_qs = [self.get_q_value(next_state, a) for a in
self.env.action_space]
best_next_q = max(next_qs)
else:
best_next_q = 0
current_q = self.get_q_value(state, action)
td_error = (reward + self.gamma * best_next_q) - current_q
# Gradient update for linear Q
phi = self.state_action_to_feature(state, action)
self.weights += self.lr * td_error * phi
state = next_state
self.epsilon *= self.epsilon_decay
rewards.append(total_reward)
return rewards
After inserting these code blocks into the notebook and running 2000 episodes, we observe performance
differences. For example, in our trials Tabular REINFORCE learned a modest policy (the average total
reward stabilized around −13 ), reflecting the high variance of pure Monte Carlo updates. The Linear Q-
learning agent, by contrast, achieved a positive reward (average ≈ +3 ), indicating it more consistently
reached the goal. These results agree with known properties: REINFORCE (policy-gradient) is unbiased but
tends to have high variance and slow learning 6 , whereas Q-learning (even with linear approximation) can
learn faster from rewards (though it can introduce bias or instability if not tuned). We also experimented
with parameters (e.g. higher learning rates or different γ ) and grid layouts: increasing the discount γ made
the agents aim for longer returns, while a larger learning rate sped initial learning but risked instability.
Overall, our empirical curves show that (with a suitable baseline) the policy-gradient method converges but
more slowly, whereas the value-based method with function approximation often learns a better policy
sooner under these settings.
Sources: The unbiasedness of baselines and optimal baseline choice are discussed in policy-gradient
literature 1 2 3 . Importance sampling identity and convergence follow standard Monte Carlo theory
4 5 . The high variance of REINFORCE is noted in RL theory 6 .
4 5 moodle.umontpellier.fr
https://moodle.umontpellier.fr/mod/resource/view.php?id=751393