Chapter 12
Chapter 12
We can use estimates of the policy gradient to drive the search of the parameter
space toward an optimal policy. The previous chapter outlined methods for
estimating this gradient. This chapter explains how to use these estimates to
guide the optimization. We begin with gradient ascent, which simply takes steps
in the direction of the gradient at each iteration. Determining the step size is a
major challenge. Large steps can lead to faster progress to the optimum, but they
can overshoot. The natural policy gradient modifies the direction of the gradient
to better handle variable levels of sensitivity across parameter components. We
conclude with the trust region method, which starts in exactly the same way as
the natural gradient method to obtain a candidate policy. It then searches along
the line segment in policy space connecting the original policy to this candidate
to find a better policy.
We can use gradient ascent (reviewed in appendix A.11) to find a policy parame-
terized by θ that maximizes the expected utility U (θ). Gradient ascent is a type
of iterated ascent method, which involves taking steps in the parameter space at
each iteration in an attempt to improve the quality of the associated policy. All
the methods discussed in this chapter are iterated ascent methods, but they differ
in how they take steps. The gradient ascent method discussed in this section
takes steps in the direction of ∇U (θ), which may be estimated using one of the
methods discussed in the previous chapter. The update of θ is
θ ← θ + α ∇U ( θ ) (12.1)
250 c ha p te r 12. p ol i c y g ra d i e n t op ti mi z a ti on
where the step length is equal to a step factor α > 0 times the magnitude of the
gradient.
Algorithm 12.1 implements a method that takes such a step. This method can
be called for either a fixed number of iterations or until θ or U (θ) converges.
Gradient ascent, as well as the other algorithms discussed in this chapter, is not
guaranteed to converge to the optimal policy. However, there are techniques to
encourage convergence to a locally optimal policy, in which taking an infinitesimally 1
This approach, as well as many
others, are covered in detail by M. J.
small step in parameter space cannot result in a better policy. One approach is to
Kochenderfer and T. A. Wheeler,
decay the step factor with each step.1 Algorithms for Optimization. MIT
Press, 2019.
Very large gradients tend to overshoot the optimum and may occur due to a
variety of reasons. Rewards for some problems, such as for the 2048 problem
(appendix F.2), can vary by orders of magnitude. One approach for keeping the
gradients manageable is to use gradient scaling, which limits the magnitude of a
gradient estimate before using it to update the policy parameterization. Gradients
are commonly limited to having an L2 -norm of 1. Another approach is gradient
clipping, which conducts elementwise clamping of the gradient before using it to
update the policy. Clipping commonly limits the entries to lie between ±1. Both
techniques are implemented in algorithm 12.2.
© 2022 Massachusetts Institute of Technology, shared under a Creative Commons CC-BY-NC-ND license.
2022-05-07 21:55:16-07:00, comments to bugs@algorithmsbook.com
12. 2. re stri c te d g ra d i e n t u p d a te 251
0 no modification
1 scale gradient to 2
scale gradient to 1
expected reward
scale gradient to 1/2
0.5
clip gradient to ±2
θ2
−500
clip gradient to ±1
0 clip gradient to ±1/2
θ⋆
Figure 12.1. The effect of gradi-
ent scaling and clipping applied to
−0.5 the simple regulator problem. Each
−1 −0.5 0 0 2 4 gradient evaluation ran 10 rollouts
θ1 iteration to depth 10. Step updates were ap-
plied with a step size of 0.2. The
optimal policy parameterization is
shown in black.
12.2 Restricted Gradient Update
U ( θ′ ) ≈ U ( θ ) + ∇U ( θ ) ⊤ ( θ′ − θ ) (12.2)
© 2022 Massachusetts Institute of Technology, shared under a Creative Commons CC-BY-NC-ND license.
2022-05-07 21:55:16-07:00, comments to bugs@algorithmsbook.com
252 c ha p te r 12. p ol i c y g ra d i e n t op ti mi z a ti on
We can drop U (θ) from the objective since it does not depend on θ′ . In addition,
we can change the inequality to an equality in the constraint because the linear
objective forces the optimal solution to be on the boundary of the feasible region.
These changes result in an equivalent optimization problem:
√
r
2ǫ u
′
θ = θ+u ⊤
= θ + 2ǫ (12.6)
u u kuk
function update(M::RestrictedPolicyUpdate, θ)
𝒫, b, d, m, ∇logπ, π, γ = M.𝒫, M.b, M.d, M.m, M.∇logπ, M.π, M.𝒫.γ
πθ(s) = π(θ, s)
R(τ) = sum(r*γ^(k-1) for (k, (s,a,r)) in enumerate(τ))
τs = [simulate(𝒫, rand(b), πθ, d) for i in 1:m]
∇log(τ) = sum(∇logπ(θ, a, s) for (s,a) in τ)
∇U(τ) = ∇log(τ)*R(τ)
u = mean(∇U(τ) for τ in τs)
return θ + u*sqrt(2*M.ϵ/dot(u,u))
end
© 2022 Massachusetts Institute of Technology, shared under a Creative Commons CC-BY-NC-ND license.
2022-05-07 21:55:16-07:00, comments to bugs@algorithmsbook.com
12. 3. n a tu ra l g ra d i e n t u p d a te 253
The natural gradient method2 is a variation of the restricted step method discussed 2
S. Amari, “Natural Gradient
Works Efficiently in Learning,”
in the previous section to better handle situations when some components of the
Neural Computation, vol. 10, no. 2,
parameter space are more sensitive than others. Sensitivity in this context refers pp. 251–276, 1998.
to how much the utility of a policy varies with respect to small changes in one of
the parameters. The sensitivity in gradient methods is largely determined by the
choice of scaling of the policy parameters. The natural policy gradient method
makes the search direction u invariant to parameter scaling. Figure 12.2 illustrates
the differences between the true gradient and the natural gradient.
0.2
mum (black dot) at [−1, 0]. A sim-
ilar figure is presented in J. Pe-
ters and S. Schaal, “Reinforcement
0 Learning of Motor Skills with Pol-
icy Gradients,” Neural Networks,
−2 −1 0 −2 −1 0 vol. 21, no. 4, pp. 682–697, 2008.
θ1 θ1
The natural policy gradient method uses the same first-order approximation of
the objective as in the previous section. The constraint, however, is different. The
intuition is that we want to restrict changes in θ that result in large changes in the
distribution over trajectories. A way to measure how much a distribution changes
is to use the Kullback-Leibler divergence, or KL divergence (appendix A.10). We
could impose the constraint
1 ′
g(θ, θ′ ) = (θ − θ)⊤ Fθ (θ′ − θ) ≤ ǫ (12.8)
2
© 2022 Massachusetts Institute of Technology, shared under a Creative Commons CC-BY-NC-ND license.
2022-05-07 21:55:16-07:00, comments to bugs@algorithmsbook.com
254 c ha p te r 12. p ol i c y g ra d i e n t op ti mi z a ti on
(12.10)
h i
= E τ ∇ log p(τ | θ)∇ log p(τ | θ)⊤
which looks identical to equation (12.5) except that instead of the identity matrix F−1 ∇U ( θ )
I, we have the Fisher matrix Fθ . This difference results in an ellipsoid feasible set.
Figure 12.3 shows an example in two dimensions.
Figure 12.3. The natural policy gra-
This optimization problem can be solved analytically and has the same form
dient places a constraint on the
as the update in the previous section: approximated Kullback-Leibler di-
vergence. This constraint takes the
form of an ellipse. The ellipse may
s
2ǫ
θ′ = θ + u (12.12) be elongated in certain directions,
∇U ( θ ) ⊤ u allowing larger steps if the gradi-
ent is rotated.
except that we now have3
−1
u = Fθ ∇U ( θ ) (12.13)
3
This computation can be done
We can use sampled trajectories to estimate Fθ and ∇U (θ). Algorithm 12.4 pro- using conjugate gradient descent,
vides an implementation. which reduces computation when
the dimension of θ is large. S. M.
Kakade, “A Natural Policy Gradi-
12.4 Trust Region Update ent,” in Advances in Neural Infor-
mation Processing Systems (NIPS),
2001.
This section discusses a method for searching within the trust region, defined by
the elliptical feasible region from the previous section. This category of approach
is referred to as trust region policy optimization (TRPO).4 It works by computing 4
J. Schulman, S. Levine, P. Moritz,
the next evaluation point θ′ that would be taken by the natural policy gradient M. Jordan, and P. Abbeel, “Trust
Region Policy Optimization,” in In-
and then conducting a line search along the line segment connecting θ to θ′ . A key ternational Conference on Machine
property of this line search phase is that evaluations of the approximate objective Learning (ICML), 2015.
© 2022 Massachusetts Institute of Technology, shared under a Creative Commons CC-BY-NC-ND license.
2022-05-07 21:55:16-07:00, comments to bugs@algorithmsbook.com
12. 4 . tru st re g i on u p d a te 255
function update(M::NaturalPolicyUpdate, θ)
𝒫, b, d, m, ∇logπ, π, γ = M.𝒫, M.b, M.d, M.m, M.∇logπ, M.π, M.𝒫.γ
πθ(s) = π(θ, s)
R(τ) = sum(r*γ^(k-1) for (k, (s,a,r)) in enumerate(τ))
∇log(τ) = sum(∇logπ(θ, a, s) for (s,a) in τ)
∇U(τ) = ∇log(τ)*R(τ)
F(τ) = ∇log(τ)*∇log(τ)'
τs = [simulate(𝒫, rand(b), πθ, d) for i in 1:m]
return natural_update(θ, ∇U, F, M.ϵ, τs)
end
© 2022 Massachusetts Institute of Technology, shared under a Creative Commons CC-BY-NC-ND license.
2022-05-07 21:55:16-07:00, comments to bugs@algorithmsbook.com
256 c ha p te r 12. p ol i c y g ra d i e n t op ti mi z a ti on
During the line search phase, we no longer use a first-order approximation. In-
stead, we use an approximation derived from an equality involving the advantage
function5 5
A variation of this equality is
proven in lemma 6.1 of S. M.
" #
d
U (θ′ ) = U (θ ) + E ∑ Aθ ( s ( k ) , a ( k ) ) (12.14) Kakade and J. Langford, “Approx-
τ ∼ πθ′
k =1 imately Optimal Approximate Re-
inforcement Learning,” in Interna-
Another way to write this is to use bγ,θ , which is the discounted visitation distribution
tional Conference on Machine Learn-
of state s under policy πθ , where ing (ICML), 2002.
We would like to pull our samples from our policy parameterized by θ instead
of θ′ so that we do not have to run more simulations during the line search. The
samples associated with the inner expectation can be replaced with samples from
our original policy so long as we appropriately weight the advantage:6 6
This weighting comes from impor-
tance sampling, which is reviewed
π θ′ ( a | s ) in appendix A.14.
′
U (θ ) = U (θ) + E E Aθ (s, a) (12.17)
s∼bγ,θ′ a∼πθ (·|s) πθ ( a | s )
The next step involves replacing the state distribution with bγ,θ . The quality of
the approximation degrades as θ′ gets further from θ, but it is hypothesized that
it is acceptable within the trust region. Since U (θ) does not depend on θ′ , we can
drop it from the objective. We can also drop the state value function from the
advantage function, leaving us with the action value function. What remains is
referred to as the surrogate objective:
π θ′ ( a | s )
′
f (θ, θ ) = E E Qθ (s, a) (12.18)
s∼bγ,θ a∼πθ (·|s) πθ ( a | s )
7
Algorithm 12.5 instead uses
This equation can be estimated from the same set of trajectories that was used to ∑ℓ=k r (ℓ) γℓ−1 , which effectively
discounts the reward-to-go by
estimate the natural gradient update. We can estimate Qθ (s, a) using the reward- γk−1 . This discount is needed to
to-go in the sampled trajectories.7 weight each sample’s contribution
to match the discounted visitation
The surrogate constraint in the line search is given by distribution. The surrogate
constraint is similarly discounted.
g(θ, θ′ ) = E [ DKL (πθ (· | s) || πθ′ (· | s))] ≤ ǫ (12.19)
s∼bγ,θ
© 2022 Massachusetts Institute of Technology, shared under a Creative Commons CC-BY-NC-ND license.
2022-05-07 21:55:16-07:00, comments to bugs@algorithmsbook.com
12. 5 . c l a mp e d su rrog a te obj e c ti ve 257
Line search involves iteratively evaluating our surrogate objective f and sur-
rogate constraint g for different points in the policy space. We begin with the θ′
obtained from the same process as the natural gradient update. We then iteratively
apply
θ′ ← θ + α (θ′ − θ ) (12.20)
until we have an improvement in our objective with f (θ, θ′ ) > f (θ, θ) and our
constraint is met with g(θ, θ′ ) ≤ ǫ. The step factor 0 < α < 1 shrinks the distance
between θ and θ′ at each iteration, with α typically set to 0.5.
Algorithm 12.5 provides an implementation of this approach. Figure 12.4
illustrates the relationship between the feasible regions associated with the natural
gradient and the line search. Figure 12.5 demonstrates the approach on a regulator
problem, and example 12.1 shows an update for a simple problem.
π θ′ ( a | s ) π θ′ ( a | s )
E E min A (s, a), clamp , 1 − ǫ, 1 + ǫ Aθ (s, a) (12.22)
s∼bγ,θ a∼πθ (·|s) πθ ( a | s ) θ πθ ( a | s )
where ǫ is a small positive value9 and clamp( x, a, b) forces x to be between a and 9
While this ǫ does not directly act
b. By definition, clamp( x, a, b) = min{max{ x, a}, b}. as a threshold on divergence, as it
did in previous algorithms, its role
Clamping the probability ratio alone does not produce a lower bound; we must is similar. A typical value is 0.2.
also take the minimum of the clamped and original objectives. The lower bound
is shown in figure 12.6, together with the original and clamped objectives. The
end result of the lower bound is that the change in probability ratio is ignored
when it would cause the objective to improve significantly. Using the lower bound
thus prevents large, often detrimental, updates in these situations and removes
the need for the trust region surrogate constraint equation (12.19). Without the
© 2022 Massachusetts Institute of Technology, shared under a Creative Commons CC-BY-NC-ND license.
2022-05-07 21:55:16-07:00, comments to bugs@algorithmsbook.com
258 c ha p te r 12. p ol i c y g ra d i e n t op ti mi z a ti on
function update(M::TrustRegionUpdate, θ)
𝒫, b, d, m, ∇logπ, π, γ = M.𝒫, M.b, M.d, M.m, M.∇logπ, M.π, M.𝒫.γ
πθ(s) = π(θ, s)
R(τ) = sum(r*γ^(k-1) for (k, (s,a,r)) in enumerate(τ))
∇log(τ) = sum(∇logπ(θ, a, s) for (s,a) in τ)
∇U(τ) = ∇log(τ)*R(τ)
F(τ) = ∇log(τ)*∇log(τ)'
τs = [simulate(𝒫, rand(b), πθ, d) for i in 1:m]
θ′ = natural_update(θ, ∇U, F, M.ϵ, τs)
f(θ′) = surrogate_objective(M, θ, θ′, τs)
g(θ′) = surrogate_constraint(M, θ, θ′, τs)
return linesearch(M, f, g, θ, θ′)
end
© 2022 Massachusetts Institute of Technology, shared under a Creative Commons CC-BY-NC-ND license.
2022-05-07 21:55:16-07:00, comments to bugs@algorithmsbook.com
12. 5 . c l a mp e d su rrog a te obj e c ti ve 259
1 ′
− θ ) ⊤ Fθ ( θ ′ − θ ) = ǫ Figure 12.4. Trust region policy op-
2 (θ timization searches within the el-
liptical constraint generated by a
g(θ, θ′ ) = ǫ second-order approximation of the
Kullback-Leibler divergence. After
computing the natural policy gra-
∇U πθ dient ascent direction, a line search
θ is conducted to ensure that the up-
−1 dated policy improves the policy
Fθ ∇U ( θ ) reward and adheres to the diver-
gence constraint. The line search
θ′ starts from the estimated maxi-
mum step size and reduces the step
size along the ascent direction until
a satisfactory point is found.
shown in black.
0 θ⋆ −40
−1 −0.5 0 0 2 4 6 8
θ1 iteration
© 2022 Massachusetts Institute of Technology, shared under a Creative Commons CC-BY-NC-ND license.
2022-05-07 21:55:16-07:00, comments to bugs@algorithmsbook.com
260 c ha p te r 12. p ol i c y g ra d i e n t op ti mi z a ti on
∂ a − θ1
log πθ ( a | s) =
∂θ1 θ22
∂ ( a − θ1 )2 − θ22
log πθ ( a | s) =
∂θ2 θ23
Suppose that we run two rollouts with θ = [0, 1] (this problem only has
one state):
The objective function gradient is [2.030, 1.020]. The resulting descent di-
rection u is [1, 0]. Setting ǫ = 0.1, we compute our updated parameterization
vector and obtain θ′ = [0.314, 1].
The surrogate objective function value at θ is 1.485. Line search begins at
θ , where the surrogate objective function value is 2.110 and the constraint
′
yields 0.049. This satisfies our constraint (as 0.049 < ǫ), so we return the
new parameterization.
© 2022 Massachusetts Institute of Technology, shared under a Creative Commons CC-BY-NC-ND license.
2022-05-07 21:55:16-07:00, comments to bugs@algorithmsbook.com
12. 5 . c l a mp e d su rrog a te obj e c ti ve 261
constraint, we can also eliminate line search and use standard gradient ascent
methods.
The gradient of the unclamped objective equation (12.21) with action values is
∇ θ′ π θ′ ( a | s )
′
∇θ′ f (θ, θ ) = E E Qθ (s, a) (12.23)
s∼bγ,θ a∼πθ (·|s) πθ ( a | s )
where Qθ (s, a) can be estimated from reward-to-go. The gradient of the lower-
bound objective equation (12.22) (with clamping), is the same, except there is no
contribution from experience tuples for which the objective is actively clamped.
That is, if either the reward-to-go is positive and the probability ratio is greater
than 1 + ǫ, or if the reward-to-go is negative and the probability ratio is less than
1 − ǫ, the gradient contribution is zero.
Like TRPO, the gradient can be computed for a parameterization θ′ from
experience generated from θ. Hence, several gradient updates can be run in
a row using the same set of sampled trajectories. Algorithm 12.6 provides an
implementation of this.
The clamped surrogate objective is compared to several other surrogate ob-
jectives in figure 12.7, which includes a line plot for the effective objective for
TRPO:
π θ′ ( a | s )
E A (s, a) − βDKL (πθ (· | s) || πθ′ (· | s)) (12.24)
s∼bγ,θ πθ ( a | s ) θ
a∼πθ (·|s)
© 2022 Massachusetts Institute of Technology, shared under a Creative Commons CC-BY-NC-ND license.
2022-05-07 21:55:16-07:00, comments to bugs@algorithmsbook.com
262 c ha p te r 12. p ol i c y g ra d i e n t op ti mi z a ti on
function update(M::ClampedSurrogateUpdate, θ)
𝒫, b, d, m, π, α, k_max= M.𝒫, M.b, M.d, M.m, M.π, M.α, M.k_max
πθ(s) = π(θ, s)
τs = [simulate(𝒫, rand(b), πθ, d) for i in 1:m]
θ′ = copy(θ)
for k in 1:k_max
θ′ += α*clamped_gradient(M, θ, θ′, τs)
end
return θ′
end
© 2022 Massachusetts Institute of Technology, shared under a Creative Commons CC-BY-NC-ND license.
2022-05-07 21:55:16-07:00, comments to bugs@algorithmsbook.com
12. 6 . su mma ry 263
which is the trust region policy objective where the constraint is implemented
as a penalty for some coefficient β. TRPO typically uses a hard constraint rather
than a penalty because it is difficult to choose a value of β that performs well
within a single problem, let alone across multiple problems.
4
surrogate objective
surrogate constraint
TRPO effective objective
2 clamped surrogate objective
• Gradient ascent can be made more robust by scaling, clipping, or forcing the
size of the improvement steps to be uniform.
• We can use a pessimistic lower bound of the TRPO objective to obtain a clamped
surrogate objective that performs similarly without the need for line search.
© 2022 Massachusetts Institute of Technology, shared under a Creative Commons CC-BY-NC-ND license.
2022-05-07 21:55:16-07:00, comments to bugs@algorithmsbook.com
264 c ha p te r 12. p ol i c y g ra d i e n t op ti mi z a ti on
12.7 Exercises
Exercise 12.1. TRPO starts its line search from a new parameterization given by a natural
policy gradient update. However, TRPO conducts the line search using a different objec-
tive than the natural policy gradient. Show that the gradient of the surrogate objective
equation (12.18) used in TRPO is actually the same as the reward-to-go policy gradient
equation (11.26).
∇ θ′ π θ′ ( a | s )
∇θ′ UTRPO = E E Qθ (s, a)
s∼bγ,θ a∼πθ (·|s) πθ ( a | s )
When conducting the initial natural policy gradient update, the search direction is
evaluated at θ′ = θ. Furthermore, the action value is approximated with the reward-to-go:
∇θ πθ ( a | s )
∇θ′ UTRPO = E E rto-go
s∼bγ,θ a∼πθ (·|s) πθ ( a | s )
which takes the same form as the reward-to-go policy gradient equation (11.26).
Exercise 12.2. Perform the calculations of example 12.1. First, compute the inverse of the
Fisher information matrix Fθ
−1
, compute u, and compute the updated parameters θ′ .
© 2022 Massachusetts Institute of Technology, shared under a Creative Commons CC-BY-NC-ND license.
2022-05-07 21:55:16-07:00, comments to bugs@algorithmsbook.com
12. 7. e x e rc i se s 265
Exercise 12.3. Suppose we have the parameterized policies πθ and πθ′ given in the
following table:
a1 a2 a3 a4
πθ ( a | s1 ) 0.1 0.2 0.3 0.4
π θ′ ( a | s 1 ) 0.4 0.3 0.2 0.1
πθ ( a | s2 ) 0.1 0.1 0.6 0.2
π θ′ ( a | s 2 ) 0.1 0.1 0.5 0.3
Given that we sample the following five states, s1 , s2 , s1 , s1 , s2 , approximate E s [ DKL (πθ (· | s) || πθ′ (· | s))]
using the definition
P( x )
DKL ( P || Q) = ∑ P( x ) log
x Q (x)
Solution: First, we compute the KL divergence for a state sample s1 :
0.1 0.2 0.3 0.4
DKL (πθ (· | s1 ) || πθ′ (· | s1 )) = 0.1 log 0.4 + 0.2 log 0.3 + 0.3 log 0.3 + 0.4 log 0.1 ≈ 0.456
Now, we compute the KL divergence for a state sample s2 :
0.1 0.1 0.6 0.2
DKL (πθ (· | s2 ) || πθ′ (· | s2 )) = 0.1 log 0.1 + 0.1 log 0.1 + 0.6 log 0.5 + 0.2 log 0.3 ≈ 0.0283
Finally, we compute the approximation of the expectation, which is the average KL diver-
gence of the parameterized policies over the n state samples:
1 n
E s [ DKL (πθ (· | s) || πθ′ (· | s))] ≈ ∑ DKL πθ (· | s(i) ) πθ′ (· | s(i) )
n i =1
1
≈ (0.456 + 0.0283 + 0.456 + 0.456 + 0.0283)
5
≈ 0.285
© 2022 Massachusetts Institute of Technology, shared under a Creative Commons CC-BY-NC-ND license.
2022-05-07 21:55:16-07:00, comments to bugs@algorithmsbook.com