0% found this document useful (0 votes)
6 views17 pages

Chapter 12

Uploaded by

zishankhan.00f
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views17 pages

Chapter 12

Uploaded by

zishankhan.00f
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

12 Policy Gradient Optimization

We can use estimates of the policy gradient to drive the search of the parameter
space toward an optimal policy. The previous chapter outlined methods for
estimating this gradient. This chapter explains how to use these estimates to
guide the optimization. We begin with gradient ascent, which simply takes steps
in the direction of the gradient at each iteration. Determining the step size is a
major challenge. Large steps can lead to faster progress to the optimum, but they
can overshoot. The natural policy gradient modifies the direction of the gradient
to better handle variable levels of sensitivity across parameter components. We
conclude with the trust region method, which starts in exactly the same way as
the natural gradient method to obtain a candidate policy. It then searches along
the line segment in policy space connecting the original policy to this candidate
to find a better policy.

12.1 Gradient Ascent Update

We can use gradient ascent (reviewed in appendix A.11) to find a policy parame-
terized by θ that maximizes the expected utility U (θ). Gradient ascent is a type
of iterated ascent method, which involves taking steps in the parameter space at
each iteration in an attempt to improve the quality of the associated policy. All
the methods discussed in this chapter are iterated ascent methods, but they differ
in how they take steps. The gradient ascent method discussed in this section
takes steps in the direction of ∇U (θ), which may be estimated using one of the
methods discussed in the previous chapter. The update of θ is

θ ← θ + α ∇U ( θ ) (12.1)
250 c ha p te r 12. p ol i c y g ra d i e n t op ti mi z a ti on

where the step length is equal to a step factor α > 0 times the magnitude of the
gradient.
Algorithm 12.1 implements a method that takes such a step. This method can
be called for either a fixed number of iterations or until θ or U (θ) converges.
Gradient ascent, as well as the other algorithms discussed in this chapter, is not
guaranteed to converge to the optimal policy. However, there are techniques to
encourage convergence to a locally optimal policy, in which taking an infinitesimally 1
This approach, as well as many
others, are covered in detail by M. J.
small step in parameter space cannot result in a better policy. One approach is to
Kochenderfer and T. A. Wheeler,
decay the step factor with each step.1 Algorithms for Optimization. MIT
Press, 2019.

struct PolicyGradientUpdate Algorithm 12.1. The gradient as-


∇U # policy gradient estimate cent method for policy optimiza-
α # step factor tion. It takes a step from a point
end θ in the direction of the gradient
∇U with step factor α. We can use
function update(M::PolicyGradientUpdate, θ) one of the methods in the previous
return θ + M.α * M.∇U(θ) chapter to compute ∇U.
end

Very large gradients tend to overshoot the optimum and may occur due to a
variety of reasons. Rewards for some problems, such as for the 2048 problem
(appendix F.2), can vary by orders of magnitude. One approach for keeping the
gradients manageable is to use gradient scaling, which limits the magnitude of a
gradient estimate before using it to update the policy parameterization. Gradients
are commonly limited to having an L2 -norm of 1. Another approach is gradient
clipping, which conducts elementwise clamping of the gradient before using it to
update the policy. Clipping commonly limits the entries to lie between ±1. Both
techniques are implemented in algorithm 12.2.

scale_gradient(∇, L2_max) = min(L2_max/norm(∇), 1)*∇ Algorithm 12.2. Methods for gra-


clip_gradient(∇, a, b) = clamp.(∇, a, b) dient scaling and clipping. Gradi-
ent scaling limits the magnitude
of the provided gradient vector ∇
to L2_max. Gradient clipping pro-
Scaling and clipping differ in how they affect the final gradient direction, as vides elementwise clamping of the
demonstrated in figure 12.1. Scaling will leave the direction unaffected, whereas provided gradient vector ∇ to be-
tween a and b.
clipping affects each component individually. Whether this difference is advanta-
geous depends on the problem. For example, if a single component dominates
the gradient vector, scaling will zero out the other components.

© 2022 Massachusetts Institute of Technology, shared under a Creative Commons CC-BY-NC-ND license.
2022-05-07 21:55:16-07:00, comments to bugs@algorithmsbook.com
12. 2. re stri c te d g ra d i e n t u p d a te 251

0 no modification
1 scale gradient to 2
scale gradient to 1

expected reward
scale gradient to 1/2
0.5
clip gradient to ±2
θ2

−500
clip gradient to ±1
0 clip gradient to ±1/2
θ⋆
Figure 12.1. The effect of gradi-
ent scaling and clipping applied to
−0.5 the simple regulator problem. Each
−1 −0.5 0 0 2 4 gradient evaluation ran 10 rollouts
θ1 iteration to depth 10. Step updates were ap-
plied with a step size of 0.2. The
optimal policy parameterization is
shown in black.
12.2 Restricted Gradient Update

The remaining algorithms in this chapter attempt to optimize an approximation


of the objective function U (θ), subject to a constraint that the policy parameters at
the next step θ′ are not too far from θ at the current step. The constraint takes the
form g(θ, θ′ ) ≤ ǫ, where ǫ > 0 is a free parameter in the algorithm. The methods
differ in their approximation of U (θ) and the form of g. This section describes a
simple restricted step method.
We use the first-order Taylor approximation (appendix A.12) obtained from
our gradient estimate at θ to approximate U:

U ( θ′ ) ≈ U ( θ ) + ∇U ( θ ) ⊤ ( θ′ − θ ) (12.2)

For the constraint, we use


1 ′ 1
g(θ, θ′ ) = (θ − θ)⊤ I(θ′ − θ) = kθ′ − θk22 (12.3)
2 2

We can view this constraint as limiting the step length to no more than 2ǫ.

In other words, the feasible region in our optimization is a ball of radius 2ǫ
centered at θ.
The optimization problem is, then,

maximize U (θ) + ∇U (θ)⊤ (θ′ − θ)


θ′
(12.4)
1 ′
subject to (θ − θ ) ⊤ I (θ′ − θ ) ≤ ǫ
2

© 2022 Massachusetts Institute of Technology, shared under a Creative Commons CC-BY-NC-ND license.
2022-05-07 21:55:16-07:00, comments to bugs@algorithmsbook.com
252 c ha p te r 12. p ol i c y g ra d i e n t op ti mi z a ti on

We can drop U (θ) from the objective since it does not depend on θ′ . In addition,
we can change the inequality to an equality in the constraint because the linear
objective forces the optimal solution to be on the boundary of the feasible region.
These changes result in an equivalent optimization problem:

maximize ∇U (θ)⊤ (θ′ − θ)


θ′
(12.5)
1 ′
subject to (θ − θ ) ⊤ I (θ′ − θ ) = ǫ
2
This optimization problem can be solved analytically:


r
2ǫ u

θ = θ+u ⊤
= θ + 2ǫ (12.6)
u u kuk

where the unnormalized search direction u is simply ∇U (θ). Of course, we do


not know ∇U (θ) exactly, but we can use any of the methods described in the
previous chapter to estimate it. Algorithm 12.3 provides an implementation.

struct RestrictedPolicyUpdate Algorithm 12.3. The update func-


𝒫 # problem tion for the restricted policy gra-
b # initial state distribution dient method at θ for a problem
d # depth 𝒫 with initial state distribution b.
m # number of samples The gradient is estimated from an
∇logπ # gradient of log likelihood initial state distribution b to depth
π # policy d with m simulations of parameter-
ϵ # divergence bound ized policy π(θ, s) with log policy
end gradient ∇logπ.

function update(M::RestrictedPolicyUpdate, θ)
𝒫, b, d, m, ∇logπ, π, γ = M.𝒫, M.b, M.d, M.m, M.∇logπ, M.π, M.𝒫.γ
πθ(s) = π(θ, s)
R(τ) = sum(r*γ^(k-1) for (k, (s,a,r)) in enumerate(τ))
τs = [simulate(𝒫, rand(b), πθ, d) for i in 1:m]
∇log(τ) = sum(∇logπ(θ, a, s) for (s,a) in τ)
∇U(τ) = ∇log(τ)*R(τ)
u = mean(∇U(τ) for τ in τs)
return θ + u*sqrt(2*M.ϵ/dot(u,u))
end

© 2022 Massachusetts Institute of Technology, shared under a Creative Commons CC-BY-NC-ND license.
2022-05-07 21:55:16-07:00, comments to bugs@algorithmsbook.com
12. 3. n a tu ra l g ra d i e n t u p d a te 253

12.3 Natural Gradient Update

The natural gradient method2 is a variation of the restricted step method discussed 2
S. Amari, “Natural Gradient
Works Efficiently in Learning,”
in the previous section to better handle situations when some components of the
Neural Computation, vol. 10, no. 2,
parameter space are more sensitive than others. Sensitivity in this context refers pp. 251–276, 1998.
to how much the utility of a policy varies with respect to small changes in one of
the parameters. The sensitivity in gradient methods is largely determined by the
choice of scaling of the policy parameters. The natural policy gradient method
makes the search direction u invariant to parameter scaling. Figure 12.2 illustrates
the differences between the true gradient and the natural gradient.

true gradient natural gradient Figure 12.2. A comparison of


the true gradient and the natu-
ral gradient on the simple regu-
lator problem (see appendix F.5).
0.4 The true gradient generally points
strongly in the negative θ2 direc-
tion, whereas the natural gradient
generally points toward the opti-
θ2

0.2
mum (black dot) at [−1, 0]. A sim-
ilar figure is presented in J. Pe-
ters and S. Schaal, “Reinforcement
0 Learning of Motor Skills with Pol-
icy Gradients,” Neural Networks,
−2 −1 0 −2 −1 0 vol. 21, no. 4, pp. 682–697, 2008.
θ1 θ1

The natural policy gradient method uses the same first-order approximation of
the objective as in the previous section. The constraint, however, is different. The
intuition is that we want to restrict changes in θ that result in large changes in the
distribution over trajectories. A way to measure how much a distribution changes
is to use the Kullback-Leibler divergence, or KL divergence (appendix A.10). We
could impose the constraint

g(θ, θ′ ) = DKL p(· | θ) p(· | θ′ ) ≤ ǫ (12.7)




but instead we will use a second-order Taylor approximation:

1 ′
g(θ, θ′ ) = (θ − θ)⊤ Fθ (θ′ − θ) ≤ ǫ (12.8)
2

© 2022 Massachusetts Institute of Technology, shared under a Creative Commons CC-BY-NC-ND license.
2022-05-07 21:55:16-07:00, comments to bugs@algorithmsbook.com
254 c ha p te r 12. p ol i c y g ra d i e n t op ti mi z a ti on

where the Fisher information matrix has the following form:


Z
Fθ = p(τ | θ)∇ log p(τ | θ)∇ log p(τ | θ)⊤ dτ (12.9)

(12.10)
h i
= E τ ∇ log p(τ | θ)∇ log p(τ | θ)⊤

The resulting optimization problem is

maximize ∇U (θ)⊤ (θ′ − θ) 1 ⊤


2 ∆θ Fθ ∆θ =ǫ
θ′
(12.11)
1 ′
subject to ( θ − θ ) ⊤ Fθ ( θ ′ − θ ) = ǫ
2 ∇U ( θ )

which looks identical to equation (12.5) except that instead of the identity matrix F−1 ∇U ( θ )
I, we have the Fisher matrix Fθ . This difference results in an ellipsoid feasible set.
Figure 12.3 shows an example in two dimensions.
Figure 12.3. The natural policy gra-
This optimization problem can be solved analytically and has the same form
dient places a constraint on the
as the update in the previous section: approximated Kullback-Leibler di-
vergence. This constraint takes the
form of an ellipse. The ellipse may
s

θ′ = θ + u (12.12) be elongated in certain directions,
∇U ( θ ) ⊤ u allowing larger steps if the gradi-
ent is rotated.
except that we now have3
−1
u = Fθ ∇U ( θ ) (12.13)
3
This computation can be done
We can use sampled trajectories to estimate Fθ and ∇U (θ). Algorithm 12.4 pro- using conjugate gradient descent,
vides an implementation. which reduces computation when
the dimension of θ is large. S. M.
Kakade, “A Natural Policy Gradi-
12.4 Trust Region Update ent,” in Advances in Neural Infor-
mation Processing Systems (NIPS),
2001.
This section discusses a method for searching within the trust region, defined by
the elliptical feasible region from the previous section. This category of approach
is referred to as trust region policy optimization (TRPO).4 It works by computing 4
J. Schulman, S. Levine, P. Moritz,
the next evaluation point θ′ that would be taken by the natural policy gradient M. Jordan, and P. Abbeel, “Trust
Region Policy Optimization,” in In-
and then conducting a line search along the line segment connecting θ to θ′ . A key ternational Conference on Machine
property of this line search phase is that evaluations of the approximate objective Learning (ICML), 2015.

and constraint do not require any additional rollout simulations.

© 2022 Massachusetts Institute of Technology, shared under a Creative Commons CC-BY-NC-ND license.
2022-05-07 21:55:16-07:00, comments to bugs@algorithmsbook.com
12. 4 . tru st re g i on u p d a te 255

struct NaturalPolicyUpdate Algorithm 12.4. The update func-


𝒫 # problem tion for the natural policy gradi-
b # initial state distribution ent, given policy π(θ, s), for an
d # depth MDP 𝒫 with initial state distribu-
m # number of samples tion b. The natural gradient with
∇logπ # gradient of log likelihood respect to the parameter vector θ is
π # policy estimated from m rollouts to depth
ϵ # divergence bound d using the log policy gradients
end ∇logπ. The natural_update helper
method conducts an update ac-
function natural_update(θ, ∇f, F, ϵ, τs) cording to equation (12.12), given
∇fθ = mean(∇f(τ) for τ in τs) an objective gradient ∇f(τ) and a
u = mean(F(τ) for τ in τs) \ ∇fθ Fisher matrix F(τ) for a list of tra-
return θ + u*sqrt(2ϵ/dot(∇fθ,u)) jectories.
end

function update(M::NaturalPolicyUpdate, θ)
𝒫, b, d, m, ∇logπ, π, γ = M.𝒫, M.b, M.d, M.m, M.∇logπ, M.π, M.𝒫.γ
πθ(s) = π(θ, s)
R(τ) = sum(r*γ^(k-1) for (k, (s,a,r)) in enumerate(τ))
∇log(τ) = sum(∇logπ(θ, a, s) for (s,a) in τ)
∇U(τ) = ∇log(τ)*R(τ)
F(τ) = ∇log(τ)*∇log(τ)'
τs = [simulate(𝒫, rand(b), πθ, d) for i in 1:m]
return natural_update(θ, ∇U, F, M.ϵ, τs)
end

© 2022 Massachusetts Institute of Technology, shared under a Creative Commons CC-BY-NC-ND license.
2022-05-07 21:55:16-07:00, comments to bugs@algorithmsbook.com
256 c ha p te r 12. p ol i c y g ra d i e n t op ti mi z a ti on

During the line search phase, we no longer use a first-order approximation. In-
stead, we use an approximation derived from an equality involving the advantage
function5 5
A variation of this equality is
proven in lemma 6.1 of S. M.
" #
d
U (θ′ ) = U (θ ) + E ∑ Aθ ( s ( k ) , a ( k ) ) (12.14) Kakade and J. Langford, “Approx-
τ ∼ πθ′
k =1 imately Optimal Approximate Re-
inforcement Learning,” in Interna-
Another way to write this is to use bγ,θ , which is the discounted visitation distribution
tional Conference on Machine Learn-
of state s under policy πθ , where ing (ICML), 2002.

bγ,θ (s) ∝ P(s(1) = s) + γP(s(2) = s) + γ2 P(s(3) = s) + · · · (12.15)

Using the discounted visitation distribution, the objective becomes


" #
U (θ′ ) = U (θ ) + E E [ Aθ (s, a)] (12.16)
s∼bγ,θ′ a∼πθ′ (·|s)

We would like to pull our samples from our policy parameterized by θ instead
of θ′ so that we do not have to run more simulations during the line search. The
samples associated with the inner expectation can be replaced with samples from
our original policy so long as we appropriately weight the advantage:6 6
This weighting comes from impor-
tance sampling, which is reviewed
π θ′ ( a | s ) in appendix A.14.
  

U (θ ) = U (θ) + E E Aθ (s, a) (12.17)
s∼bγ,θ′ a∼πθ (·|s) πθ ( a | s )

The next step involves replacing the state distribution with bγ,θ . The quality of
the approximation degrades as θ′ gets further from θ, but it is hypothesized that
it is acceptable within the trust region. Since U (θ) does not depend on θ′ , we can
drop it from the objective. We can also drop the state value function from the
advantage function, leaving us with the action value function. What remains is
referred to as the surrogate objective:

π θ′ ( a | s )
  

f (θ, θ ) = E E Qθ (s, a) (12.18)
s∼bγ,θ a∼πθ (·|s) πθ ( a | s )
7
Algorithm 12.5 instead uses
This equation can be estimated from the same set of trajectories that was used to ∑ℓ=k r (ℓ) γℓ−1 , which effectively
discounts the reward-to-go by
estimate the natural gradient update. We can estimate Qθ (s, a) using the reward- γk−1 . This discount is needed to
to-go in the sampled trajectories.7 weight each sample’s contribution
to match the discounted visitation
The surrogate constraint in the line search is given by distribution. The surrogate
constraint is similarly discounted.
g(θ, θ′ ) = E [ DKL (πθ (· | s) || πθ′ (· | s))] ≤ ǫ (12.19)
s∼bγ,θ

© 2022 Massachusetts Institute of Technology, shared under a Creative Commons CC-BY-NC-ND license.
2022-05-07 21:55:16-07:00, comments to bugs@algorithmsbook.com
12. 5 . c l a mp e d su rrog a te obj e c ti ve 257

Line search involves iteratively evaluating our surrogate objective f and sur-
rogate constraint g for different points in the policy space. We begin with the θ′
obtained from the same process as the natural gradient update. We then iteratively
apply
θ′ ← θ + α (θ′ − θ ) (12.20)
until we have an improvement in our objective with f (θ, θ′ ) > f (θ, θ) and our
constraint is met with g(θ, θ′ ) ≤ ǫ. The step factor 0 < α < 1 shrinks the distance
between θ and θ′ at each iteration, with α typically set to 0.5.
Algorithm 12.5 provides an implementation of this approach. Figure 12.4
illustrates the relationship between the feasible regions associated with the natural
gradient and the line search. Figure 12.5 demonstrates the approach on a regulator
problem, and example 12.1 shows an update for a simple problem.

12.5 Clamped Surrogate Objective

We can avoid detrimental policy updates from overly optimistic estimates of


the trust region surrogate objective by clamping.8 The surrogate objective from 8
Clamping is a key idea in what is
equation (12.18), after exchanging the action value advantage, is known as proximal policy optimiza-
tion (PPO) as discussed by J. Schul-
man, F. Wolski, P. Dhariwal, A.
π θ′ ( a | s )
  
E E Aθ (s, a) (12.21) Radford, and O. Klimov, “Proximal
s∼bγ,θ a∼πθ (·|s) πθ ( a | s ) Policy Optimization Algorithms,”
2017. arXiv: 1707.06347v2.
The probability ratio πθ′ ( a | s)/πθ ( a | s) can be overly optimistic. A pessimistic
lower bound on the objective can significantly improve performance:

π θ′ ( a | s ) π θ′ ( a | s )
     
E E min A (s, a), clamp , 1 − ǫ, 1 + ǫ Aθ (s, a) (12.22)
s∼bγ,θ a∼πθ (·|s) πθ ( a | s ) θ πθ ( a | s )
where ǫ is a small positive value9 and clamp( x, a, b) forces x to be between a and 9
While this ǫ does not directly act
b. By definition, clamp( x, a, b) = min{max{ x, a}, b}. as a threshold on divergence, as it
did in previous algorithms, its role
Clamping the probability ratio alone does not produce a lower bound; we must is similar. A typical value is 0.2.
also take the minimum of the clamped and original objectives. The lower bound
is shown in figure 12.6, together with the original and clamped objectives. The
end result of the lower bound is that the change in probability ratio is ignored
when it would cause the objective to improve significantly. Using the lower bound
thus prevents large, often detrimental, updates in these situations and removes
the need for the trust region surrogate constraint equation (12.19). Without the

© 2022 Massachusetts Institute of Technology, shared under a Creative Commons CC-BY-NC-ND license.
2022-05-07 21:55:16-07:00, comments to bugs@algorithmsbook.com
258 c ha p te r 12. p ol i c y g ra d i e n t op ti mi z a ti on

struct TrustRegionUpdate Algorithm 12.5. The update pro-


𝒫 # problem cedure for trust region policy opti-
b # initial state distribution mization, which augments the nat-
d # depth ural gradient with a line search. It
m # number of samples generates m trajectories using pol-
π # policy π(s) icy π in problem 𝒫 with initial state
p # policy likelihood p(θ, a, s) distribution b and depth d. To ob-
∇logπ # log likelihood gradient tain the starting point of the line
KL # KL divergence KL(θ, θ′, s) search, we need the gradient of the
ϵ # divergence bound log-probability of the policy gener-
α # line search reduction factor (e.g., 0.5) ating a particular action from the
end current state, which we denote as
∇logπ. For the surrogate objective,
function surrogate_objective(M::TrustRegionUpdate, θ, θ′, τs) we need the probability function
d, p, γ = M.d, M.p, M.𝒫.γ p, which gives the probability that
R(τ, j) = sum(r*γ^(k-1) for (k,(s,a,r)) in zip(j:d, τ[j:end]))
our policy generates a particular
w(a,s) = p(θ′,a,s) / p(θ,a,s)
action from the current state. For
f(τ) = mean(w(a,s)*R(τ,k) for (k,(s,a,r)) in enumerate(τ))
the surrogate constraint, we need
return mean(f(τ) for τ in τs)
the divergence between the action
end
distributions generated by πθ and
function surrogate_constraint(M::TrustRegionUpdate, θ, θ′, τs) πθ′ . At each step of the line search,
γ = M.𝒫.γ we shrink the distance between the
KL(τ) = mean(M.KL(θ, θ′, s)*γ^(k-1) for (k,(s,a,r)) in enumerate(τ)) considered point θ′ and θ while
return mean(KL(τ) for τ in τs) maintaining the search direction.
end

function linesearch(M::TrustRegionUpdate, f, g, θ, θ′)


fθ = f(θ)
while g(θ′) > M.ϵ || f(θ′) ≤ fθ
θ′ = θ + M.α*(θ′ - θ)
end
return θ′
end

function update(M::TrustRegionUpdate, θ)
𝒫, b, d, m, ∇logπ, π, γ = M.𝒫, M.b, M.d, M.m, M.∇logπ, M.π, M.𝒫.γ
πθ(s) = π(θ, s)
R(τ) = sum(r*γ^(k-1) for (k, (s,a,r)) in enumerate(τ))
∇log(τ) = sum(∇logπ(θ, a, s) for (s,a) in τ)
∇U(τ) = ∇log(τ)*R(τ)
F(τ) = ∇log(τ)*∇log(τ)'
τs = [simulate(𝒫, rand(b), πθ, d) for i in 1:m]
θ′ = natural_update(θ, ∇U, F, M.ϵ, τs)
f(θ′) = surrogate_objective(M, θ, θ′, τs)
g(θ′) = surrogate_constraint(M, θ, θ′, τs)
return linesearch(M, f, g, θ, θ′)
end

© 2022 Massachusetts Institute of Technology, shared under a Creative Commons CC-BY-NC-ND license.
2022-05-07 21:55:16-07:00, comments to bugs@algorithmsbook.com
12. 5 . c l a mp e d su rrog a te obj e c ti ve 259

1 ′
− θ ) ⊤ Fθ ( θ ′ − θ ) = ǫ Figure 12.4. Trust region policy op-
2 (θ timization searches within the el-
liptical constraint generated by a
g(θ, θ′ ) = ǫ second-order approximation of the
Kullback-Leibler divergence. After
computing the natural policy gra-
∇U πθ dient ascent direction, a line search
θ is conducted to ensure that the up-
−1 dated policy improves the policy
Fθ ∇U ( θ ) reward and adheres to the diver-
gence constraint. The line search
θ′ starts from the estimated maxi-
mum step size and reduces the step
size along the ascent direction until
a satisfactory point is found.

1 0 Figure 12.5. Trust region policy op-


timization applied to the simple
regulator problem with rollouts to
expected reward

depth 10 with ǫ = 1 and c = 2. The


−20 optimal policy parameterization is
0.5
θ2

shown in black.

0 θ⋆ −40

−1 −0.5 0 0 2 4 6 8
θ1 iteration

© 2022 Massachusetts Institute of Technology, shared under a Creative Commons CC-BY-NC-ND license.
2022-05-07 21:55:16-07:00, comments to bugs@algorithmsbook.com
260 c ha p te r 12. p ol i c y g ra d i e n t op ti mi z a ti on

Example 12.1. An example of one


Consider applying TRPO to the Gaussian policy N (θ1 , θ22 ) from example 11.3
iteration of trust region policy op-
to the single-state MDP from example 11.1 with γ = 1. Recall that the gradient timization.
of the log policy likelihood is

∂ a − θ1
log πθ ( a | s) =
∂θ1 θ22
∂ ( a − θ1 )2 − θ22
log πθ ( a | s) =
∂θ2 θ23

Suppose that we run two rollouts with θ = [0, 1] (this problem only has
one state):

τ1 = {( a = r = −0.532), ( a = r = 0.597), ( a = r = 1.947)}


τ2 = {( a = r = −0.263), ( a = r = −2.212), ( a = r = 2.364)}

The estimated Fisher information matrix is


1 
Fθ = ∇ log p(τ (1) )∇ log p(τ (1) )⊤ + ∇ log p(τ (2) )∇ log p(τ (2) )⊤
2 " # " #! " #
1 4.048 2.878 0.012 − 0.838 2.030 1.020
= + =
2 2.878 2.046 −0.838 57.012 1.019 29.529

The objective function gradient is [2.030, 1.020]. The resulting descent di-
rection u is [1, 0]. Setting ǫ = 0.1, we compute our updated parameterization
vector and obtain θ′ = [0.314, 1].
The surrogate objective function value at θ is 1.485. Line search begins at
θ , where the surrogate objective function value is 2.110 and the constraint

yields 0.049. This satisfies our constraint (as 0.049 < ǫ), so we return the
new parameterization.

© 2022 Massachusetts Institute of Technology, shared under a Creative Commons CC-BY-NC-ND license.
2022-05-07 21:55:16-07:00, comments to bugs@algorithmsbook.com
12. 5 . c l a mp e d su rrog a te obj e c ti ve 261

constraint, we can also eliminate line search and use standard gradient ascent
methods.

A>0 A<0 Figure 12.6. A visualization of


the lower-bound objective for posi-
tive and negative advantages com-
pared to the original objective and
objective function

the clamped objective. The black


point shows the baseline around
which the optimization is per-
formed, πθ′ ( a | s)/πθ ( a | s) = 1.
The three line plots in each axis are
vertically separated for clarity.

0 1−ǫ 1 1+ǫ 0 1−ǫ 1 1+ǫ


probability ratio probability ratio

original objective clamped objective lower-bound objective

The gradient of the unclamped objective equation (12.21) with action values is

∇ θ′ π θ′ ( a | s )
  

∇θ′ f (θ, θ ) = E E Qθ (s, a) (12.23)
s∼bγ,θ a∼πθ (·|s) πθ ( a | s )

where Qθ (s, a) can be estimated from reward-to-go. The gradient of the lower-
bound objective equation (12.22) (with clamping), is the same, except there is no
contribution from experience tuples for which the objective is actively clamped.
That is, if either the reward-to-go is positive and the probability ratio is greater
than 1 + ǫ, or if the reward-to-go is negative and the probability ratio is less than
1 − ǫ, the gradient contribution is zero.
Like TRPO, the gradient can be computed for a parameterization θ′ from
experience generated from θ. Hence, several gradient updates can be run in
a row using the same set of sampled trajectories. Algorithm 12.6 provides an
implementation of this.
The clamped surrogate objective is compared to several other surrogate ob-
jectives in figure 12.7, which includes a line plot for the effective objective for
TRPO:
π θ′ ( a | s )
 
E A (s, a) − βDKL (πθ (· | s) || πθ′ (· | s)) (12.24)
s∼bγ,θ πθ ( a | s ) θ
a∼πθ (·|s)

© 2022 Massachusetts Institute of Technology, shared under a Creative Commons CC-BY-NC-ND license.
2022-05-07 21:55:16-07:00, comments to bugs@algorithmsbook.com
262 c ha p te r 12. p ol i c y g ra d i e n t op ti mi z a ti on

struct ClampedSurrogateUpdate Algorithm 12.6. An implementa-


𝒫 # problem tion of clamped surrogate policy
b # initial state distribution optimization, which returns a new
d # depth policy parameterization for policy
m # number of trajectories π(s) of an MDP 𝒫 with initial state
π # policy distribution b. This implementa-
p # policy likelihood tion samples m trajectories to depth
∇π # policy likelihood gradient d, and then uses them to estimate
ϵ # divergence bound the policy gradient in k_max subse-
α # step size quent updates. The policy gradient
k_max # number of iterations per update using the clamped objective is con-
end structed using the policy gradients
∇p with clamping parameter ϵ.
function clamped_gradient(M::ClampedSurrogateUpdate, θ, θ′, τs)
d, p, ∇π, ϵ, γ = M.d, M.p, M.∇π, M.ϵ, M.𝒫.γ
R(τ, j) = sum(r*γ^(k-1) for (k,(s,a,r)) in zip(j:d, τ[j:end]))
∇f(a,s,r_togo) = begin
P = p(θ, a,s)
w = p(θ′,a,s) / P
if (r_togo > 0 && w > 1+ϵ) || (r_togo < 0 && w < 1-ϵ)
return zeros(length(θ))
end
return ∇π(θ′, a, s) * r_togo / P
end
∇f(τ) = mean(∇f(a,s,R(τ,k)) for (k,(s,a,r)) in enumerate(τ))
return mean(∇f(τ) for τ in τs)
end

function update(M::ClampedSurrogateUpdate, θ)
𝒫, b, d, m, π, α, k_max= M.𝒫, M.b, M.d, M.m, M.π, M.α, M.k_max
πθ(s) = π(θ, s)
τs = [simulate(𝒫, rand(b), πθ, d) for i in 1:m]
θ′ = copy(θ)
for k in 1:k_max
θ′ += α*clamped_gradient(M, θ, θ′, τs)
end
return θ′
end

© 2022 Massachusetts Institute of Technology, shared under a Creative Commons CC-BY-NC-ND license.
2022-05-07 21:55:16-07:00, comments to bugs@algorithmsbook.com
12. 6 . su mma ry 263

which is the trust region policy objective where the constraint is implemented
as a penalty for some coefficient β. TRPO typically uses a hard constraint rather
than a penalty because it is difficult to choose a value of β that performs well
within a single problem, let alone across multiple problems.

4
surrogate objective
surrogate constraint
TRPO effective objective
2 clamped surrogate objective

Figure 12.7. A comparison of surro-


gate objectives related to clamped
surrogate policy optimization us-
0
ing the linear quadratic regulator
problem. The x-axis shows surro-
gate objectives as we travel from
θ at 0 toward θ′ , given a natural
−2
−0.4 −0.2 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 policy update at 1. The surrogate
objectives were centered at 0 by
linear interpolation factor subtracting the surrogate objective
function value for θ. We see that
the clamped surrogate objective be-
haves very similarly to the effective
TRPO objective without needing a
constraint. Note that ǫ and β can
12.6 Summary be adjusted for both algorithms,
which would affect where the max-
• The gradient ascent algorithm can use the gradient estimates obtained from the imum is in each case.

methods discussed in the previous chapter to iteratively improve our policy.

• Gradient ascent can be made more robust by scaling, clipping, or forcing the
size of the improvement steps to be uniform.

• The natural gradient approach uses a first-order approximation of the objective


function with a constraint on the divergence between the trajectory distribution
at each step, approximated using an estimate of the Fisher information matrix.

• Trust region policy optimization involves augmenting the natural gradient


method with a line search to further improve the policy without additional
trajectory simulations.

• We can use a pessimistic lower bound of the TRPO objective to obtain a clamped
surrogate objective that performs similarly without the need for line search.

© 2022 Massachusetts Institute of Technology, shared under a Creative Commons CC-BY-NC-ND license.
2022-05-07 21:55:16-07:00, comments to bugs@algorithmsbook.com
264 c ha p te r 12. p ol i c y g ra d i e n t op ti mi z a ti on

12.7 Exercises
Exercise 12.1. TRPO starts its line search from a new parameterization given by a natural
policy gradient update. However, TRPO conducts the line search using a different objec-
tive than the natural policy gradient. Show that the gradient of the surrogate objective
equation (12.18) used in TRPO is actually the same as the reward-to-go policy gradient
equation (11.26).

Solution: The gradient of TRPO’s surrogate objective is

∇ θ′ π θ′ ( a | s )
  
∇θ′ UTRPO = E E Qθ (s, a)
s∼bγ,θ a∼πθ (·|s) πθ ( a | s )

When conducting the initial natural policy gradient update, the search direction is
evaluated at θ′ = θ. Furthermore, the action value is approximated with the reward-to-go:

∇θ πθ ( a | s )
  
∇θ′ UTRPO = E E rto-go
s∼bγ,θ a∼πθ (·|s) πθ ( a | s )

Recall that the derivative of log f ( x ) is f ′ ( x )/ f ( x ). It thus follows that


 
 
∇θ′ UTRPO = E E ∇θ log πθ ( a | s)rto-go
s∼bγ,θ a∼πθ (·|s)

which takes the same form as the reward-to-go policy gradient equation (11.26).

Exercise 12.2. Perform the calculations of example 12.1. First, compute the inverse of the
Fisher information matrix Fθ
−1
, compute u, and compute the updated parameters θ′ .

Solution: We start by computing the inverse of the Fisher information matrix:


" # " #
−1 1 29.529 −0.332 0.501 −0.017
Fθ ≈ ≈
0.341(29.529) − 0.332(0.332) −0.332 0.341 −0.017 0.034

Now, we update u as follows:


" #" # " #
−1 0.501 −0.017 2.030 1
u= Fθ ∇U ( θ ) ≈ ≈
−0.017 0.034 1.020 0

© 2022 Massachusetts Institute of Technology, shared under a Creative Commons CC-BY-NC-ND license.
2022-05-07 21:55:16-07:00, comments to bugs@algorithmsbook.com
12. 7. e x e rc i se s 265

Finally, we estimate the updated parameters θ:


s
′ 2ǫ
θ = θ+u
∇U ( θ ) ⊤ u
" # " #v
0 1 u 2(0.1)
u
≈ + " #
1 0 uh
u
t 2.030 1.020 1
i
0
" # " #r
0 1 0.2
≈ +
1 0 2.030
" #
0.314

1

Exercise 12.3. Suppose we have the parameterized policies πθ and πθ′ given in the
following table:

a1 a2 a3 a4
πθ ( a | s1 ) 0.1 0.2 0.3 0.4
π θ′ ( a | s 1 ) 0.4 0.3 0.2 0.1
πθ ( a | s2 ) 0.1 0.1 0.6 0.2
π θ′ ( a | s 2 ) 0.1 0.1 0.5 0.3

Given that we sample the following five states, s1 , s2 , s1 , s1 , s2 , approximate E s [ DKL (πθ (· | s) || πθ′ (· | s))]
using the definition
P( x )
DKL ( P || Q) = ∑ P( x ) log
x Q (x)
Solution: First, we compute the KL divergence for a state sample s1 :
       
0.1 0.2 0.3 0.4
DKL (πθ (· | s1 ) || πθ′ (· | s1 )) = 0.1 log 0.4 + 0.2 log 0.3 + 0.3 log 0.3 + 0.4 log 0.1 ≈ 0.456
Now, we compute the KL divergence for a state sample s2 :
       
0.1 0.1 0.6 0.2
DKL (πθ (· | s2 ) || πθ′ (· | s2 )) = 0.1 log 0.1 + 0.1 log 0.1 + 0.6 log 0.5 + 0.2 log 0.3 ≈ 0.0283
Finally, we compute the approximation of the expectation, which is the average KL diver-
gence of the parameterized policies over the n state samples:

1 n  
E s [ DKL (πθ (· | s) || πθ′ (· | s))] ≈ ∑ DKL πθ (· | s(i) ) πθ′ (· | s(i) )
n i =1
1
≈ (0.456 + 0.0283 + 0.456 + 0.456 + 0.0283)
5
≈ 0.285

© 2022 Massachusetts Institute of Technology, shared under a Creative Commons CC-BY-NC-ND license.
2022-05-07 21:55:16-07:00, comments to bugs@algorithmsbook.com

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy