0% found this document useful (0 votes)

6 views17 pages

Chapter 12

Uploaded by

zishankhan.00f

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views17 pages

Chapter 12

Uploaded by

zishankhan.00f

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 17

12 Policy Gradient Optimization

We can use estimates of the policy gradient to drive the search of the parameter
space toward an optimal policy. The previous chapter outlined methods for
estimating this gradient. This chapter explains how to use these estimates to
guide the optimization. We begin with gradient ascent, which simply takes steps
in the direction of the gradient at each iteration. Determining the step size is a
major challenge. Large steps can lead to faster progress to the optimum, but they
can overshoot. The natural policy gradient modifies the direction of the gradient
to better handle variable levels of sensitivity across parameter components. We
conclude with the trust region method, which starts in exactly the same way as
the natural gradient method to obtain a candidate policy. It then searches along
the line segment in policy space connecting the original policy to this candidate
to find a better policy.

12.1 Gradient Ascent Update

We can use gradient ascent (reviewed in appendix A.11) to find a policy parame-
terized by θ that maximizes the expected utility U (θ). Gradient ascent is a type
of iterated ascent method, which involves taking steps in the parameter space at
each iteration in an attempt to improve the quality of the associated policy. All
the methods discussed in this chapter are iterated ascent methods, but they differ
in how they take steps. The gradient ascent method discussed in this section
takes steps in the direction of ∇U (θ), which may be estimated using one of the
methods discussed in the previous chapter. The update of θ is

θ ← θ + α ∇U ( θ ) (12.1)
250 c ha p te r 12. p ol i c y g ra d i e n t op ti mi z a ti on

where the step length is equal to a step factor α > 0 times the magnitude of the
gradient.
Algorithm 12.1 implements a method that takes such a step. This method can
be called for either a fixed number of iterations or until θ or U (θ) converges.
Gradient ascent, as well as the other algorithms discussed in this chapter, is not
guaranteed to converge to the optimal policy. However, there are techniques to
encourage convergence to a locally optimal policy, in which taking an infinitesimally 1
This approach, as well as many
others, are covered in detail by M. J.
small step in parameter space cannot result in a better policy. One approach is to
Kochenderfer and T. A. Wheeler,
decay the step factor with each step.1 Algorithms for Optimization. MIT
Press, 2019.

struct PolicyGradientUpdate Algorithm 12.1. The gradient as-

∇U # policy gradient estimate cent method for policy optimiza-
α # step factor tion. It takes a step from a point
end θ in the direction of the gradient
∇U with step factor α. We can use
function update(M::PolicyGradientUpdate, θ) one of the methods in the previous
return θ + M.α * M.∇U(θ) chapter to compute ∇U.
end

Very large gradients tend to overshoot the optimum and may occur due to a
variety of reasons. Rewards for some problems, such as for the 2048 problem
(appendix F.2), can vary by orders of magnitude. One approach for keeping the
gradients manageable is to use gradient scaling, which limits the magnitude of a
gradient estimate before using it to update the policy parameterization. Gradients
are commonly limited to having an L2 -norm of 1. Another approach is gradient
clipping, which conducts elementwise clamping of the gradient before using it to
update the policy. Clipping commonly limits the entries to lie between ±1. Both
techniques are implemented in algorithm 12.2.

scale_gradient(∇, L2_max) = min(L2_max/norm(∇), 1)*∇ Algorithm 12.2. Methods for gra-

clip_gradient(∇, a, b) = clamp.(∇, a, b) dient scaling and clipping. Gradi-
ent scaling limits the magnitude
of the provided gradient vector ∇
to L2_max. Gradient clipping pro-
Scaling and clipping differ in how they affect the final gradient direction, as vides elementwise clamping of the
demonstrated in figure 12.1. Scaling will leave the direction unaffected, whereas provided gradient vector ∇ to be-
tween a and b.
clipping affects each component individually. Whether this difference is advanta-
geous depends on the problem. For example, if a single component dominates
the gradient vector, scaling will zero out the other components.

© 2022 Massachusetts Institute of Technology, shared under a Creative Commons CC-BY-NC-ND license.
2022-05-07 21:55:16-07:00, comments to bugs@algorithmsbook.com
12. 2. re stri c te d g ra d i e n t u p d a te 251

0 no modification
1 scale gradient to 2
scale gradient to 1

expected reward
scale gradient to 1/2
0.5
clip gradient to ±2
θ2

−500
clip gradient to ±1
0 clip gradient to ±1/2
θ⋆
Figure 12.1. The effect of gradi-
ent scaling and clipping applied to
−0.5 the simple regulator problem. Each
−1 −0.5 0 0 2 4 gradient evaluation ran 10 rollouts
θ1 iteration to depth 10. Step updates were ap-
plied with a step size of 0.2. The
optimal policy parameterization is
shown in black.
12.2 Restricted Gradient Update

The remaining algorithms in this chapter attempt to optimize an approximation

of the objective function U (θ), subject to a constraint that the policy parameters at
the next step θ′ are not too far from θ at the current step. The constraint takes the
form g(θ, θ′ ) ≤ ǫ, where ǫ > 0 is a free parameter in the algorithm. The methods
differ in their approximation of U (θ) and the form of g. This section describes a
simple restricted step method.
We use the first-order Taylor approximation (appendix A.12) obtained from
our gradient estimate at θ to approximate U:

U ( θ′ ) ≈ U ( θ ) + ∇U ( θ ) ⊤ ( θ′ − θ ) (12.2)

For the constraint, we use

1 ′ 1
g(θ, θ′ ) = (θ − θ)⊤ I(θ′ − θ) = kθ′ − θk22 (12.3)
2 2
√
We can view this constraint as limiting the step length to no more than 2ǫ.
√
In other words, the feasible region in our optimization is a ball of radius 2ǫ
centered at θ.
The optimization problem is, then,

maximize U (θ) + ∇U (θ)⊤ (θ′ − θ)

θ′
(12.4)
1 ′
subject to (θ − θ ) ⊤ I (θ′ − θ ) ≤ ǫ
2

© 2022 Massachusetts Institute of Technology, shared under a Creative Commons CC-BY-NC-ND license.
2022-05-07 21:55:16-07:00, comments to bugs@algorithmsbook.com
252 c ha p te r 12. p ol i c y g ra d i e n t op ti mi z a ti on

We can drop U (θ) from the objective since it does not depend on θ′ . In addition,
we can change the inequality to an equality in the constraint because the linear
objective forces the optimal solution to be on the boundary of the feasible region.
These changes result in an equivalent optimization problem:

maximize ∇U (θ)⊤ (θ′ − θ)

θ′
(12.5)
1 ′
subject to (θ − θ ) ⊤ I (θ′ − θ ) = ǫ
2
This optimization problem can be solved analytically:

√
r
2ǫ u
′
θ = θ+u ⊤
= θ + 2ǫ (12.6)
u u kuk

where the unnormalized search direction u is simply ∇U (θ). Of course, we do

not know ∇U (θ) exactly, but we can use any of the methods described in the
previous chapter to estimate it. Algorithm 12.3 provides an implementation.

struct RestrictedPolicyUpdate Algorithm 12.3. The update func-

𝒫 # problem tion for the restricted policy gra-
b # initial state distribution dient method at θ for a problem
d # depth 𝒫 with initial state distribution b.
m # number of samples The gradient is estimated from an
∇logπ # gradient of log likelihood initial state distribution b to depth
π # policy d with m simulations of parameter-
ϵ # divergence bound ized policy π(θ, s) with log policy
end gradient ∇logπ.

function update(M::RestrictedPolicyUpdate, θ)
𝒫, b, d, m, ∇logπ, π, γ = M.𝒫, M.b, M.d, M.m, M.∇logπ, M.π, M.𝒫.γ
πθ(s) = π(θ, s)
R(τ) = sum(r*γ^(k-1) for (k, (s,a,r)) in enumerate(τ))
τs = [simulate(𝒫, rand(b), πθ, d) for i in 1:m]
∇log(τ) = sum(∇logπ(θ, a, s) for (s,a) in τ)
∇U(τ) = ∇log(τ)*R(τ)
u = mean(∇U(τ) for τ in τs)
return θ + u*sqrt(2*M.ϵ/dot(u,u))
end

© 2022 Massachusetts Institute of Technology, shared under a Creative Commons CC-BY-NC-ND license.
2022-05-07 21:55:16-07:00, comments to bugs@algorithmsbook.com
12. 3. n a tu ra l g ra d i e n t u p d a te 253

12.3 Natural Gradient Update

The natural gradient method2 is a variation of the restricted step method discussed 2
S. Amari, “Natural Gradient
Works Efficiently in Learning,”
in the previous section to better handle situations when some components of the
Neural Computation, vol. 10, no. 2,
parameter space are more sensitive than others. Sensitivity in this context refers pp. 251–276, 1998.
to how much the utility of a policy varies with respect to small changes in one of
the parameters. The sensitivity in gradient methods is largely determined by the
choice of scaling of the policy parameters. The natural policy gradient method
makes the search direction u invariant to parameter scaling. Figure 12.2 illustrates
the differences between the true gradient and the natural gradient.

true gradient natural gradient Figure 12.2. A comparison of

the true gradient and the natu-
ral gradient on the simple regu-
lator problem (see appendix F.5).
0.4 The true gradient generally points
strongly in the negative θ2 direc-
tion, whereas the natural gradient
generally points toward the opti-
θ2

0.2
mum (black dot) at [−1, 0]. A sim-
ilar figure is presented in J. Pe-
ters and S. Schaal, “Reinforcement
0 Learning of Motor Skills with Pol-
icy Gradients,” Neural Networks,
−2 −1 0 −2 −1 0 vol. 21, no. 4, pp. 682–697, 2008.
θ1 θ1

The natural policy gradient method uses the same first-order approximation of
the objective as in the previous section. The constraint, however, is different. The
intuition is that we want to restrict changes in θ that result in large changes in the
distribution over trajectories. A way to measure how much a distribution changes
is to use the Kullback-Leibler divergence, or KL divergence (appendix A.10). We
could impose the constraint

g(θ, θ′ ) = DKL p(· | θ) p(· | θ′ ) ≤ ǫ (12.7)

but instead we will use a second-order Taylor approximation:

1 ′
g(θ, θ′ ) = (θ − θ)⊤ Fθ (θ′ − θ) ≤ ǫ (12.8)
2

© 2022 Massachusetts Institute of Technology, shared under a Creative Commons CC-BY-NC-ND license.
2022-05-07 21:55:16-07:00, comments to bugs@algorithmsbook.com
254 c ha p te r 12. p ol i c y g ra d i e n t op ti mi z a ti on

where the Fisher information matrix has the following form:

Z
Fθ = p(τ | θ)∇ log p(τ | θ)∇ log p(τ | θ)⊤ dτ (12.9)

(12.10)
h i
= E τ ∇ log p(τ | θ)∇ log p(τ | θ)⊤

The resulting optimization problem is

maximize ∇U (θ)⊤ (θ′ − θ) 1 ⊤

2 ∆θ Fθ ∆θ =ǫ
θ′
(12.11)
1 ′
subject to ( θ − θ ) ⊤ Fθ ( θ ′ − θ ) = ǫ
2 ∇U ( θ )

which looks identical to equation (12.5) except that instead of the identity matrix F−1 ∇U ( θ )
I, we have the Fisher matrix Fθ . This difference results in an ellipsoid feasible set.
Figure 12.3 shows an example in two dimensions.
Figure 12.3. The natural policy gra-
This optimization problem can be solved analytically and has the same form
dient places a constraint on the
as the update in the previous section: approximated Kullback-Leibler di-
vergence. This constraint takes the
form of an ellipse. The ellipse may
s
2ǫ
θ′ = θ + u (12.12) be elongated in certain directions,
∇U ( θ ) ⊤ u allowing larger steps if the gradi-
ent is rotated.
except that we now have3
−1
u = Fθ ∇U ( θ ) (12.13)
3
This computation can be done
We can use sampled trajectories to estimate Fθ and ∇U (θ). Algorithm 12.4 pro- using conjugate gradient descent,
vides an implementation. which reduces computation when
the dimension of θ is large. S. M.
Kakade, “A Natural Policy Gradi-
12.4 Trust Region Update ent,” in Advances in Neural Infor-
mation Processing Systems (NIPS),
2001.
This section discusses a method for searching within the trust region, defined by
the elliptical feasible region from the previous section. This category of approach
is referred to as trust region policy optimization (TRPO).4 It works by computing 4
J. Schulman, S. Levine, P. Moritz,
the next evaluation point θ′ that would be taken by the natural policy gradient M. Jordan, and P. Abbeel, “Trust
Region Policy Optimization,” in In-
and then conducting a line search along the line segment connecting θ to θ′ . A key ternational Conference on Machine
property of this line search phase is that evaluations of the approximate objective Learning (ICML), 2015.

and constraint do not require any additional rollout simulations.

© 2022 Massachusetts Institute of Technology, shared under a Creative Commons CC-BY-NC-ND license.
2022-05-07 21:55:16-07:00, comments to bugs@algorithmsbook.com
12. 4 . tru st re g i on u p d a te 255

struct NaturalPolicyUpdate Algorithm 12.4. The update func-

𝒫 # problem tion for the natural policy gradi-
b # initial state distribution ent, given policy π(θ, s), for an
d # depth MDP 𝒫 with initial state distribu-
m # number of samples tion b. The natural gradient with
∇logπ # gradient of log likelihood respect to the parameter vector θ is
π # policy estimated from m rollouts to depth
ϵ # divergence bound d using the log policy gradients
end ∇logπ. The natural_update helper
method conducts an update ac-
function natural_update(θ, ∇f, F, ϵ, τs) cording to equation (12.12), given
∇fθ = mean(∇f(τ) for τ in τs) an objective gradient ∇f(τ) and a
u = mean(F(τ) for τ in τs) \ ∇fθ Fisher matrix F(τ) for a list of tra-
return θ + u*sqrt(2ϵ/dot(∇fθ,u)) jectories.
end

function update(M::NaturalPolicyUpdate, θ)
𝒫, b, d, m, ∇logπ, π, γ = M.𝒫, M.b, M.d, M.m, M.∇logπ, M.π, M.𝒫.γ
πθ(s) = π(θ, s)
R(τ) = sum(r*γ^(k-1) for (k, (s,a,r)) in enumerate(τ))
∇log(τ) = sum(∇logπ(θ, a, s) for (s,a) in τ)
∇U(τ) = ∇log(τ)*R(τ)
F(τ) = ∇log(τ)*∇log(τ)'
τs = [simulate(𝒫, rand(b), πθ, d) for i in 1:m]
return natural_update(θ, ∇U, F, M.ϵ, τs)
end

© 2022 Massachusetts Institute of Technology, shared under a Creative Commons CC-BY-NC-ND license.
2022-05-07 21:55:16-07:00, comments to bugs@algorithmsbook.com
256 c ha p te r 12. p ol i c y g ra d i e n t op ti mi z a ti on

During the line search phase, we no longer use a first-order approximation. In-
stead, we use an approximation derived from an equality involving the advantage
function5 5
A variation of this equality is
proven in lemma 6.1 of S. M.
" #
d
U (θ′ ) = U (θ ) + E ∑ Aθ ( s ( k ) , a ( k ) ) (12.14) Kakade and J. Langford, “Approx-
τ ∼ πθ′
k =1 imately Optimal Approximate Re-
inforcement Learning,” in Interna-
Another way to write this is to use bγ,θ , which is the discounted visitation distribution
tional Conference on Machine Learn-
of state s under policy πθ , where ing (ICML), 2002.

bγ,θ (s) ∝ P(s(1) = s) + γP(s(2) = s) + γ2 P(s(3) = s) + · · · (12.15)

Using the discounted visitation distribution, the objective becomes

" #
U (θ′ ) = U (θ ) + E E [ Aθ (s, a)] (12.16)
s∼bγ,θ′ a∼πθ′ (·|s)

We would like to pull our samples from our policy parameterized by θ instead
of θ′ so that we do not have to run more simulations during the line search. The
samples associated with the inner expectation can be replaced with samples from
our original policy so long as we appropriately weight the advantage:6 6
This weighting comes from impor-
tance sampling, which is reviewed
π θ′ ( a | s ) in appendix A.14.

′
U (θ ) = U (θ) + E E Aθ (s, a) (12.17)
s∼bγ,θ′ a∼πθ (·|s) πθ ( a | s )

The next step involves replacing the state distribution with bγ,θ . The quality of
the approximation degrades as θ′ gets further from θ, but it is hypothesized that
it is acceptable within the trust region. Since U (θ) does not depend on θ′ , we can
drop it from the objective. We can also drop the state value function from the
advantage function, leaving us with the action value function. What remains is
referred to as the surrogate objective:

π θ′ ( a | s )

′
f (θ, θ ) = E E Qθ (s, a) (12.18)
s∼bγ,θ a∼πθ (·|s) πθ ( a | s )
7
Algorithm 12.5 instead uses
This equation can be estimated from the same set of trajectories that was used to ∑ℓ=k r (ℓ) γℓ−1 , which effectively
discounts the reward-to-go by
estimate the natural gradient update. We can estimate Qθ (s, a) using the reward- γk−1 . This discount is needed to
to-go in the sampled trajectories.7 weight each sample’s contribution
to match the discounted visitation
The surrogate constraint in the line search is given by distribution. The surrogate
constraint is similarly discounted.
g(θ, θ′ ) = E [ DKL (πθ (· | s) || πθ′ (· | s))] ≤ ǫ (12.19)
s∼bγ,θ

© 2022 Massachusetts Institute of Technology, shared under a Creative Commons CC-BY-NC-ND license.
2022-05-07 21:55:16-07:00, comments to bugs@algorithmsbook.com
12. 5 . c l a mp e d su rrog a te obj e c ti ve 257

Line search involves iteratively evaluating our surrogate objective f and sur-
rogate constraint g for different points in the policy space. We begin with the θ′
obtained from the same process as the natural gradient update. We then iteratively
apply
θ′ ← θ + α (θ′ − θ ) (12.20)
until we have an improvement in our objective with f (θ, θ′ ) > f (θ, θ) and our
constraint is met with g(θ, θ′ ) ≤ ǫ. The step factor 0 < α < 1 shrinks the distance
between θ and θ′ at each iteration, with α typically set to 0.5.
Algorithm 12.5 provides an implementation of this approach. Figure 12.4
illustrates the relationship between the feasible regions associated with the natural
gradient and the line search. Figure 12.5 demonstrates the approach on a regulator
problem, and example 12.1 shows an update for a simple problem.

12.5 Clamped Surrogate Objective

We can avoid detrimental policy updates from overly optimistic estimates of

the trust region surrogate objective by clamping.8 The surrogate objective from 8
Clamping is a key idea in what is
equation (12.18), after exchanging the action value advantage, is known as proximal policy optimiza-
tion (PPO) as discussed by J. Schul-
man, F. Wolski, P. Dhariwal, A.
π θ′ ( a | s )

E E Aθ (s, a) (12.21) Radford, and O. Klimov, “Proximal
s∼bγ,θ a∼πθ (·|s) πθ ( a | s ) Policy Optimization Algorithms,”
2017. arXiv: 1707.06347v2.
The probability ratio πθ′ ( a | s)/πθ ( a | s) can be overly optimistic. A pessimistic
lower bound on the objective can significantly improve performance:

π θ′ ( a | s ) π θ′ ( a | s )

E E min A (s, a), clamp , 1 − ǫ, 1 + ǫ Aθ (s, a) (12.22)
s∼bγ,θ a∼πθ (·|s) πθ ( a | s ) θ πθ ( a | s )
where ǫ is a small positive value9 and clamp( x, a, b) forces x to be between a and 9
While this ǫ does not directly act
b. By definition, clamp( x, a, b) = min{max{ x, a}, b}. as a threshold on divergence, as it
did in previous algorithms, its role
Clamping the probability ratio alone does not produce a lower bound; we must is similar. A typical value is 0.2.
also take the minimum of the clamped and original objectives. The lower bound
is shown in figure 12.6, together with the original and clamped objectives. The
end result of the lower bound is that the change in probability ratio is ignored
when it would cause the objective to improve significantly. Using the lower bound
thus prevents large, often detrimental, updates in these situations and removes
the need for the trust region surrogate constraint equation (12.19). Without the

© 2022 Massachusetts Institute of Technology, shared under a Creative Commons CC-BY-NC-ND license.
2022-05-07 21:55:16-07:00, comments to bugs@algorithmsbook.com
258 c ha p te r 12. p ol i c y g ra d i e n t op ti mi z a ti on

struct TrustRegionUpdate Algorithm 12.5. The update pro-

𝒫 # problem cedure for trust region policy opti-
b # initial state distribution mization, which augments the nat-
d # depth ural gradient with a line search. It
m # number of samples generates m trajectories using pol-
π # policy π(s) icy π in problem 𝒫 with initial state
p # policy likelihood p(θ, a, s) distribution b and depth d. To ob-
∇logπ # log likelihood gradient tain the starting point of the line
KL # KL divergence KL(θ, θ′, s) search, we need the gradient of the
ϵ # divergence bound log-probability of the policy gener-
α # line search reduction factor (e.g., 0.5) ating a particular action from the
end current state, which we denote as
∇logπ. For the surrogate objective,
function surrogate_objective(M::TrustRegionUpdate, θ, θ′, τs) we need the probability function
d, p, γ = M.d, M.p, M.𝒫.γ p, which gives the probability that
R(τ, j) = sum(r*γ^(k-1) for (k,(s,a,r)) in zip(j:d, τ[j:end]))
our policy generates a particular
w(a,s) = p(θ′,a,s) / p(θ,a,s)
action from the current state. For
f(τ) = mean(w(a,s)*R(τ,k) for (k,(s,a,r)) in enumerate(τ))
the surrogate constraint, we need
return mean(f(τ) for τ in τs)
the divergence between the action
end
distributions generated by πθ and
function surrogate_constraint(M::TrustRegionUpdate, θ, θ′, τs) πθ′ . At each step of the line search,
γ = M.𝒫.γ we shrink the distance between the
KL(τ) = mean(M.KL(θ, θ′, s)*γ^(k-1) for (k,(s,a,r)) in enumerate(τ)) considered point θ′ and θ while
return mean(KL(τ) for τ in τs) maintaining the search direction.
end

function linesearch(M::TrustRegionUpdate, f, g, θ, θ′)

fθ = f(θ)
while g(θ′) > M.ϵ || f(θ′) ≤ fθ
θ′ = θ + M.α*(θ′ - θ)
end
return θ′
end

function update(M::TrustRegionUpdate, θ)
𝒫, b, d, m, ∇logπ, π, γ = M.𝒫, M.b, M.d, M.m, M.∇logπ, M.π, M.𝒫.γ
πθ(s) = π(θ, s)
R(τ) = sum(r*γ^(k-1) for (k, (s,a,r)) in enumerate(τ))
∇log(τ) = sum(∇logπ(θ, a, s) for (s,a) in τ)
∇U(τ) = ∇log(τ)*R(τ)
F(τ) = ∇log(τ)*∇log(τ)'
τs = [simulate(𝒫, rand(b), πθ, d) for i in 1:m]
θ′ = natural_update(θ, ∇U, F, M.ϵ, τs)
f(θ′) = surrogate_objective(M, θ, θ′, τs)
g(θ′) = surrogate_constraint(M, θ, θ′, τs)
return linesearch(M, f, g, θ, θ′)
end

1 ′
− θ ) ⊤ Fθ ( θ ′ − θ ) = ǫ Figure 12.4. Trust region policy op-
2 (θ timization searches within the el-
liptical constraint generated by a
g(θ, θ′ ) = ǫ second-order approximation of the
Kullback-Leibler divergence. After
computing the natural policy gra-
∇U πθ dient ascent direction, a line search
θ is conducted to ensure that the up-
−1 dated policy improves the policy
Fθ ∇U ( θ ) reward and adheres to the diver-
gence constraint. The line search
θ′ starts from the estimated maxi-
mum step size and reduces the step
size along the ascent direction until
a satisfactory point is found.

1 0 Figure 12.5. Trust region policy op-

timization applied to the simple
regulator problem with rollouts to
expected reward

depth 10 with ǫ = 1 and c = 2. The

−20 optimal policy parameterization is
0.5
θ2

shown in black.

0 θ⋆ −40

−1 −0.5 0 0 2 4 6 8
θ1 iteration

© 2022 Massachusetts Institute of Technology, shared under a Creative Commons CC-BY-NC-ND license.
2022-05-07 21:55:16-07:00, comments to bugs@algorithmsbook.com
260 c ha p te r 12. p ol i c y g ra d i e n t op ti mi z a ti on

Example 12.1. An example of one

Consider applying TRPO to the Gaussian policy N (θ1 , θ22 ) from example 11.3
iteration of trust region policy op-
to the single-state MDP from example 11.1 with γ = 1. Recall that the gradient timization.
of the log policy likelihood is

∂ a − θ1
log πθ ( a | s) =
∂θ1 θ22
∂ ( a − θ1 )2 − θ22
log πθ ( a | s) =
∂θ2 θ23

Suppose that we run two rollouts with θ = [0, 1] (this problem only has
one state):

τ1 = {( a = r = −0.532), ( a = r = 0.597), ( a = r = 1.947)}

τ2 = {( a = r = −0.263), ( a = r = −2.212), ( a = r = 2.364)}

The estimated Fisher information matrix is

1
Fθ = ∇ log p(τ (1) )∇ log p(τ (1) )⊤ + ∇ log p(τ (2) )∇ log p(τ (2) )⊤
2 " # " #! " #
1 4.048 2.878 0.012 − 0.838 2.030 1.020
= + =
2 2.878 2.046 −0.838 57.012 1.019 29.529

The objective function gradient is [2.030, 1.020]. The resulting descent di-
rection u is [1, 0]. Setting ǫ = 0.1, we compute our updated parameterization
vector and obtain θ′ = [0.314, 1].
The surrogate objective function value at θ is 1.485. Line search begins at
θ , where the surrogate objective function value is 2.110 and the constraint
′

yields 0.049. This satisfies our constraint (as 0.049 < ǫ), so we return the
new parameterization.

constraint, we can also eliminate line search and use standard gradient ascent
methods.

A>0 A<0 Figure 12.6. A visualization of

the lower-bound objective for posi-
tive and negative advantages com-
pared to the original objective and
objective function

the clamped objective. The black

point shows the baseline around
which the optimization is per-
formed, πθ′ ( a | s)/πθ ( a | s) = 1.
The three line plots in each axis are
vertically separated for clarity.

0 1−ǫ 1 1+ǫ 0 1−ǫ 1 1+ǫ

probability ratio probability ratio

original objective clamped objective lower-bound objective

The gradient of the unclamped objective equation (12.21) with action values is

∇ θ′ π θ′ ( a | s )

′
∇θ′ f (θ, θ ) = E E Qθ (s, a) (12.23)
s∼bγ,θ a∼πθ (·|s) πθ ( a | s )

where Qθ (s, a) can be estimated from reward-to-go. The gradient of the lower-
bound objective equation (12.22) (with clamping), is the same, except there is no
contribution from experience tuples for which the objective is actively clamped.
That is, if either the reward-to-go is positive and the probability ratio is greater
than 1 + ǫ, or if the reward-to-go is negative and the probability ratio is less than
1 − ǫ, the gradient contribution is zero.
Like TRPO, the gradient can be computed for a parameterization θ′ from
experience generated from θ. Hence, several gradient updates can be run in
a row using the same set of sampled trajectories. Algorithm 12.6 provides an
implementation of this.
The clamped surrogate objective is compared to several other surrogate ob-
jectives in figure 12.7, which includes a line plot for the effective objective for
TRPO:
π θ′ ( a | s )

E A (s, a) − βDKL (πθ (· | s) || πθ′ (· | s)) (12.24)
s∼bγ,θ πθ ( a | s ) θ
a∼πθ (·|s)

© 2022 Massachusetts Institute of Technology, shared under a Creative Commons CC-BY-NC-ND license.
2022-05-07 21:55:16-07:00, comments to bugs@algorithmsbook.com
262 c ha p te r 12. p ol i c y g ra d i e n t op ti mi z a ti on

struct ClampedSurrogateUpdate Algorithm 12.6. An implementa-

𝒫 # problem tion of clamped surrogate policy
b # initial state distribution optimization, which returns a new
d # depth policy parameterization for policy
m # number of trajectories π(s) of an MDP 𝒫 with initial state
π # policy distribution b. This implementa-
p # policy likelihood tion samples m trajectories to depth
∇π # policy likelihood gradient d, and then uses them to estimate
ϵ # divergence bound the policy gradient in k_max subse-
α # step size quent updates. The policy gradient
k_max # number of iterations per update using the clamped objective is con-
end structed using the policy gradients
∇p with clamping parameter ϵ.
function clamped_gradient(M::ClampedSurrogateUpdate, θ, θ′, τs)
d, p, ∇π, ϵ, γ = M.d, M.p, M.∇π, M.ϵ, M.𝒫.γ
R(τ, j) = sum(r*γ^(k-1) for (k,(s,a,r)) in zip(j:d, τ[j:end]))
∇f(a,s,r_togo) = begin
P = p(θ, a,s)
w = p(θ′,a,s) / P
if (r_togo > 0 && w > 1+ϵ) || (r_togo < 0 && w < 1-ϵ)
return zeros(length(θ))
end
return ∇π(θ′, a, s) * r_togo / P
end
∇f(τ) = mean(∇f(a,s,R(τ,k)) for (k,(s,a,r)) in enumerate(τ))
return mean(∇f(τ) for τ in τs)
end

function update(M::ClampedSurrogateUpdate, θ)
𝒫, b, d, m, π, α, k_max= M.𝒫, M.b, M.d, M.m, M.π, M.α, M.k_max
πθ(s) = π(θ, s)
τs = [simulate(𝒫, rand(b), πθ, d) for i in 1:m]
θ′ = copy(θ)
for k in 1:k_max
θ′ += α*clamped_gradient(M, θ, θ′, τs)
end
return θ′
end

© 2022 Massachusetts Institute of Technology, shared under a Creative Commons CC-BY-NC-ND license.
2022-05-07 21:55:16-07:00, comments to bugs@algorithmsbook.com
12. 6 . su mma ry 263

which is the trust region policy objective where the constraint is implemented
as a penalty for some coefficient β. TRPO typically uses a hard constraint rather
than a penalty because it is difficult to choose a value of β that performs well
within a single problem, let alone across multiple problems.

4
surrogate objective
surrogate constraint
TRPO effective objective
2 clamped surrogate objective

Figure 12.7. A comparison of surro-

gate objectives related to clamped
surrogate policy optimization us-
0
ing the linear quadratic regulator
problem. The x-axis shows surro-
gate objectives as we travel from
θ at 0 toward θ′ , given a natural
−2
−0.4 −0.2 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 policy update at 1. The surrogate
objectives were centered at 0 by
linear interpolation factor subtracting the surrogate objective
function value for θ. We see that
the clamped surrogate objective be-
haves very similarly to the effective
TRPO objective without needing a
constraint. Note that ǫ and β can
12.6 Summary be adjusted for both algorithms,
which would affect where the max-
• The gradient ascent algorithm can use the gradient estimates obtained from the imum is in each case.

methods discussed in the previous chapter to iteratively improve our policy.

• Gradient ascent can be made more robust by scaling, clipping, or forcing the
size of the improvement steps to be uniform.

• The natural gradient approach uses a first-order approximation of the objective

function with a constraint on the divergence between the trajectory distribution
at each step, approximated using an estimate of the Fisher information matrix.

• Trust region policy optimization involves augmenting the natural gradient

method with a line search to further improve the policy without additional
trajectory simulations.

• We can use a pessimistic lower bound of the TRPO objective to obtain a clamped
surrogate objective that performs similarly without the need for line search.

© 2022 Massachusetts Institute of Technology, shared under a Creative Commons CC-BY-NC-ND license.
2022-05-07 21:55:16-07:00, comments to bugs@algorithmsbook.com
264 c ha p te r 12. p ol i c y g ra d i e n t op ti mi z a ti on

12.7 Exercises
Exercise 12.1. TRPO starts its line search from a new parameterization given by a natural
policy gradient update. However, TRPO conducts the line search using a different objec-
tive than the natural policy gradient. Show that the gradient of the surrogate objective
equation (12.18) used in TRPO is actually the same as the reward-to-go policy gradient
equation (11.26).

Solution: The gradient of TRPO’s surrogate objective is

∇ θ′ π θ′ ( a | s )

∇θ′ UTRPO = E E Qθ (s, a)
s∼bγ,θ a∼πθ (·|s) πθ ( a | s )

When conducting the initial natural policy gradient update, the search direction is
evaluated at θ′ = θ. Furthermore, the action value is approximated with the reward-to-go:

∇θ πθ ( a | s )

∇θ′ UTRPO = E E rto-go
s∼bγ,θ a∼πθ (·|s) πθ ( a | s )

Recall that the derivative of log f ( x ) is f ′ ( x )/ f ( x ). It thus follows that

∇θ′ UTRPO = E E ∇θ log πθ ( a | s)rto-go
s∼bγ,θ a∼πθ (·|s)

which takes the same form as the reward-to-go policy gradient equation (11.26).

Exercise 12.2. Perform the calculations of example 12.1. First, compute the inverse of the
Fisher information matrix Fθ
−1
, compute u, and compute the updated parameters θ′ .

Solution: We start by computing the inverse of the Fisher information matrix:

" # " #
−1 1 29.529 −0.332 0.501 −0.017
Fθ ≈ ≈
0.341(29.529) − 0.332(0.332) −0.332 0.341 −0.017 0.034

Now, we update u as follows:

" #" # " #
−1 0.501 −0.017 2.030 1
u= Fθ ∇U ( θ ) ≈ ≈
−0.017 0.034 1.020 0

© 2022 Massachusetts Institute of Technology, shared under a Creative Commons CC-BY-NC-ND license.
2022-05-07 21:55:16-07:00, comments to bugs@algorithmsbook.com
12. 7. e x e rc i se s 265

Finally, we estimate the updated parameters θ:

s
′ 2ǫ
θ = θ+u
∇U ( θ ) ⊤ u
" # " #v
0 1 u 2(0.1)
u
≈ + " #
1 0 uh
u
t 2.030 1.020 1
i
0
" # " #r
0 1 0.2
≈ +
1 0 2.030
" #
0.314
≈
1

Exercise 12.3. Suppose we have the parameterized policies πθ and πθ′ given in the
following table:

a1 a2 a3 a4
πθ ( a | s1 ) 0.1 0.2 0.3 0.4
π θ′ ( a | s 1 ) 0.4 0.3 0.2 0.1
πθ ( a | s2 ) 0.1 0.1 0.6 0.2
π θ′ ( a | s 2 ) 0.1 0.1 0.5 0.3

Given that we sample the following five states, s1 , s2 , s1 , s1 , s2 , approximate E s [ DKL (πθ (· | s) || πθ′ (· | s))]
using the definition
P( x )
DKL ( P || Q) = ∑ P( x ) log
x Q (x)
Solution: First, we compute the KL divergence for a state sample s1 :

0.1 0.2 0.3 0.4
DKL (πθ (· | s1 ) || πθ′ (· | s1 )) = 0.1 log 0.4 + 0.2 log 0.3 + 0.3 log 0.3 + 0.4 log 0.1 ≈ 0.456
Now, we compute the KL divergence for a state sample s2 :

0.1 0.1 0.6 0.2
DKL (πθ (· | s2 ) || πθ′ (· | s2 )) = 0.1 log 0.1 + 0.1 log 0.1 + 0.6 log 0.5 + 0.2 log 0.3 ≈ 0.0283
Finally, we compute the approximation of the expectation, which is the average KL diver-
gence of the parameterized policies over the n state samples:

1 n
E s [ DKL (πθ (· | s) || πθ′ (· | s))] ≈ ∑ DKL πθ (· | s(i) ) πθ′ (· | s(i) )
n i =1
1
≈ (0.456 + 0.0283 + 0.456 + 0.456 + 0.0283)
5
≈ 0.285

Toyota 4Y Motor Spec - Motorpower
No ratings yet
Toyota 4Y Motor Spec - Motorpower
1 page
MAE 3181 Materials and Structures Laboratory
No ratings yet
MAE 3181 Materials and Structures Laboratory
22 pages
Unconstrained Numerical Optimization An Introduction For Econometricians
100% (1)
Unconstrained Numerical Optimization An Introduction For Econometricians
32 pages
World Political Map Blank - Google Search
No ratings yet
World Political Map Blank - Google Search
1 page
Unit 5 - Policy Based
No ratings yet
Unit 5 - Policy Based
30 pages
Natural Actor-Critic: Abstract. This Paper Investigates A Novel Model-Free Reinforcement
No ratings yet
Natural Actor-Critic: Abstract. This Paper Investigates A Novel Model-Free Reinforcement
12 pages
SRE Report Merged
No ratings yet
SRE Report Merged
16 pages
Chapter 13: Policy Gradient Methods: by Richard Sutton and Andrew Barto
No ratings yet
Chapter 13: Policy Gradient Methods: by Richard Sutton and Andrew Barto
35 pages
Policy Gradient Methods-BR
No ratings yet
Policy Gradient Methods-BR
14 pages
13 ML Reinforcement Learning - Policy Search
No ratings yet
13 ML Reinforcement Learning - Policy Search
10 pages
CH3 - 3 Policy Search Alg
No ratings yet
CH3 - 3 Policy Search Alg
9 pages
Policy Gradient Methods
No ratings yet
Policy Gradient Methods
70 pages
Unit7 RL
No ratings yet
Unit7 RL
7 pages
Policy Gradient 2020
No ratings yet
Policy Gradient 2020
76 pages
3 - Chapter 4 Value Iteration and Policy Iteration
No ratings yet
3 - Chapter 4 Value Iteration and Policy Iteration
20 pages
An Introduction To Policy Search Methods: Thomas Furmston
No ratings yet
An Introduction To Policy Search Methods: Thomas Furmston
33 pages
10 - Reinforcement Learning
No ratings yet
10 - Reinforcement Learning
24 pages
5 - Policy Gradient Methods
No ratings yet
5 - Policy Gradient Methods
57 pages
3 - Chapter 4 Value Iteration and Policy Iteration
No ratings yet
3 - Chapter 4 Value Iteration and Policy Iteration
20 pages
Dis9 Sol
No ratings yet
Dis9 Sol
8 pages
RL 5
No ratings yet
RL 5
26 pages
9 Sqoop Notes
No ratings yet
9 Sqoop Notes
35 pages
1、Bayesian Policy Gradient Algorithms（2006）
No ratings yet
1、Bayesian Policy Gradient Algorithms（2006）
9 pages
Policy Gradient Methods
No ratings yet
Policy Gradient Methods
28 pages
Policy Gradient Methods For Reinforcement Learning
No ratings yet
Policy Gradient Methods For Reinforcement Learning
5 pages
Policy Gradient Methods For Reinforcement Learning PDF
No ratings yet
Policy Gradient Methods For Reinforcement Learning PDF
5 pages
Paper RL
No ratings yet
Paper RL
61 pages
Policy Approximation Document
No ratings yet
Policy Approximation Document
2 pages
Lec 5 Policy Gradients
No ratings yet
Lec 5 Policy Gradients
40 pages
RL Week - 3 - 4
No ratings yet
RL Week - 3 - 4
33 pages
4 Reinforcement Learning - Basic Algorithms: - S, A) ) and The Immediate Reward Function R (R (S, A, S
No ratings yet
4 Reinforcement Learning - Basic Algorithms: - S, A) ) and The Immediate Reward Function R (R (S, A, S
16 pages
Module 04
No ratings yet
Module 04
63 pages
Solution 9
No ratings yet
Solution 9
3 pages
Markov Decision Process II
No ratings yet
Markov Decision Process II
88 pages
2023 Week5 Policy
No ratings yet
2023 Week5 Policy
62 pages
Policy Gradient
No ratings yet
Policy Gradient
33 pages
Lnotes 04
No ratings yet
Lnotes 04
8 pages
3 - Chapter 10 Actor-Critic Methods
No ratings yet
3 - Chapter 10 Actor-Critic Methods
22 pages
Trust Region Policy Optimization: John Schulman Sergey Levine Philipp Moritz Michael Jordan Pieter Abbeel
No ratings yet
Trust Region Policy Optimization: John Schulman Sergey Levine Philipp Moritz Michael Jordan Pieter Abbeel
16 pages
Part 3 - Intro To Policy Optimization - Spinning Up Documentation PDF
No ratings yet
Part 3 - Intro To Policy Optimization - Spinning Up Documentation PDF
10 pages
5SC28 Machine Learning For Systems and Control
No ratings yet
5SC28 Machine Learning For Systems and Control
68 pages
Deep Reinforcement Learning
No ratings yet
Deep Reinforcement Learning
93 pages
RL Concepts
No ratings yet
RL Concepts
36 pages
Book All in One
No ratings yet
Book All in One
288 pages
1 - Table of Contents
No ratings yet
1 - Table of Contents
6 pages
RL Lecture4
No ratings yet
RL Lecture4
7 pages
Abdolmaleki Et Al. - 2018 - Maximum A Posteriori Policy Optimisation
No ratings yet
Abdolmaleki Et Al. - 2018 - Maximum A Posteriori Policy Optimisation
23 pages
SLchapt 3
No ratings yet
SLchapt 3
10 pages
Chapter 11
No ratings yet
Chapter 11
17 pages
RL Unit-4
No ratings yet
RL Unit-4
18 pages
Subtitle
No ratings yet
Subtitle
1 page
Shiyu Zhao - Mathematical Foundation of Reinforcement Learning (2024, Tsinghua University Press, Springer) - Libgen - Li
No ratings yet
Shiyu Zhao - Mathematical Foundation of Reinforcement Learning (2024, Tsinghua University Press, Springer) - Libgen - Li
283 pages
Book All-In-One 2
No ratings yet
Book All-In-One 2
281 pages
The Levenberg-Marquardt Algorithm: Ananth Ranganathan 8th June 2004
No ratings yet
The Levenberg-Marquardt Algorithm: Ananth Ranganathan 8th June 2004
5 pages
L10 Actor Critic With Animation
No ratings yet
L10 Actor Critic With Animation
134 pages
Unit VI Optimization Techniques Question Bank Solved Answer
No ratings yet
Unit VI Optimization Techniques Question Bank Solved Answer
20 pages
CS 234: Assignment #2: 1 Deep - Networks (DQN) (8 Pts Writeup)
No ratings yet
CS 234: Assignment #2: 1 Deep - Networks (DQN) (8 Pts Writeup)
9 pages
GDD Nonlinear NIPS 2009 Convergent Temporal Difference Learning With Arbitrary Smooth Function Approximation
No ratings yet
GDD Nonlinear NIPS 2009 Convergent Temporal Difference Learning With Arbitrary Smooth Function Approximation
9 pages
NIPS 2012 A Unifying Perspective of Parametric Policy Search Methods For Markov Decision Processes Paper
No ratings yet
NIPS 2012 A Unifying Perspective of Parametric Policy Search Methods For Markov Decision Processes Paper
9 pages
Nocedal - Wright CH - 02-02
No ratings yet
Nocedal - Wright CH - 02-02
12 pages
Macro2 HW2 Solution v1
No ratings yet
Macro2 HW2 Solution v1
15 pages
Gradient Descent Algorithm in Machine Learning
No ratings yet
Gradient Descent Algorithm in Machine Learning
21 pages
Bundle Adjustment: Optimizing Visual Data for Precise Reconstruction
From Everand
Bundle Adjustment: Optimizing Visual Data for Precise Reconstruction
Fouad Sabry
No ratings yet
Basic Principles of Air Springs: General Discussion
No ratings yet
Basic Principles of Air Springs: General Discussion
4 pages
Chemistry Chapter 5 PDF
No ratings yet
Chemistry Chapter 5 PDF
52 pages
Intelligent Search Algorithms: Forth Year
No ratings yet
Intelligent Search Algorithms: Forth Year
17 pages
CS - Full SQL
No ratings yet
CS - Full SQL
60 pages
Rule of Thumb Formulae
80% (5)
Rule of Thumb Formulae
54 pages
Star Wars: The Force Awakens: Task 16. Quiz Assignment (Logic 5.1)
0% (1)
Star Wars: The Force Awakens: Task 16. Quiz Assignment (Logic 5.1)
2 pages
Concept: Mathematics 4 - Quarter 1 Week 2
No ratings yet
Concept: Mathematics 4 - Quarter 1 Week 2
9 pages
Advanced Creating of 3D Dental Models in Blender Software: September 2016
No ratings yet
Advanced Creating of 3D Dental Models in Blender Software: September 2016
67 pages
Outlook Hol WPF
100% (5)
Outlook Hol WPF
90 pages
Jowett 0
No ratings yet
Jowett 0
7 pages
Screw Conveyor Design
100% (1)
Screw Conveyor Design
8 pages
Terminology and Body Plan
No ratings yet
Terminology and Body Plan
26 pages
S01 Hydraulic Cylinders - SEC
No ratings yet
S01 Hydraulic Cylinders - SEC
13 pages
Gaminglasopa: Powered by
No ratings yet
Gaminglasopa: Powered by
3 pages
Data Logging of Power Profiles From Wireless Iot and Other Low-Power Devices
No ratings yet
Data Logging of Power Profiles From Wireless Iot and Other Low-Power Devices
16 pages
Sap Mmphysical Inventory
No ratings yet
Sap Mmphysical Inventory
9 pages
RES320 - Preisinger, Carrie FINAL EXAM
100% (1)
RES320 - Preisinger, Carrie FINAL EXAM
5 pages
D5072-087 DTS0434
No ratings yet
D5072-087 DTS0434
2 pages
Nikon Micro-Macro Lens
100% (6)
Nikon Micro-Macro Lens
71 pages
B.A. Revised Syllabus
No ratings yet
B.A. Revised Syllabus
41 pages
Unit 09
No ratings yet
Unit 09
9 pages
Ix Preparation Paper 2025 by Ustani G Education Centre
No ratings yet
Ix Preparation Paper 2025 by Ustani G Education Centre
27 pages
Servicing
No ratings yet
Servicing
19 pages
Dear Sir,: Larsen & Toubro Limited Electrical & Automation Control & Automation
No ratings yet
Dear Sir,: Larsen & Toubro Limited Electrical & Automation Control & Automation
2 pages
Sirosonic L
No ratings yet
Sirosonic L
100 pages
Vlookuppractice
No ratings yet
Vlookuppractice
16 pages
Repair Manual: PS-6400, PS-6400 H PS-7300, PS-7300 H PS-7900, PS-7900 H
No ratings yet
Repair Manual: PS-6400, PS-6400 H PS-7300, PS-7300 H PS-7900, PS-7900 H
44 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Chapter 12

Uploaded by

Chapter 12

Uploaded by

12 Policy Gradient Optimization

12.1 Gradient Ascent Update

struct PolicyGradientUpdate Algorithm 12.1. The gradient as-

scale_gradient(∇, L2_max) = min(L2_max/norm(∇), 1)*∇ Algorithm 12.2. Methods for gra-

The remaining algorithms in this chapter attempt to optimize an approximation

For the constraint, we use

maximize U (θ) + ∇U (θ)⊤ (θ′ − θ)

maximize ∇U (θ)⊤ (θ′ − θ)

where the unnormalized search direction u is simply ∇U (θ). Of course, we do

struct RestrictedPolicyUpdate Algorithm 12.3. The update func-

12.3 Natural Gradient Update

true gradient natural gradient Figure 12.2. A comparison of

g(θ, θ′ ) = DKL p(· | θ) p(· | θ′ ) ≤ ǫ (12.7)

but instead we will use a second-order Taylor approximation:

where the Fisher information matrix has the following form:

The resulting optimization problem is

maximize ∇U (θ)⊤ (θ′ − θ) 1 ⊤

and constraint do not require any additional rollout simulations.

struct NaturalPolicyUpdate Algorithm 12.4. The update func-

bγ,θ (s) ∝ P(s(1) = s) + γP(s(2) = s) + γ2 P(s(3) = s) + · · · (12.15)

Using the discounted visitation distribution, the objective becomes

12.5 Clamped Surrogate Objective

We can avoid detrimental policy updates from overly optimistic estimates of

struct TrustRegionUpdate Algorithm 12.5. The update pro-

function linesearch(M::TrustRegionUpdate, f, g, θ, θ′)

1 0 Figure 12.5. Trust region policy op-

depth 10 with ǫ = 1 and c = 2. The

Example 12.1. An example of one

τ1 = {( a = r = −0.532), ( a = r = 0.597), ( a = r = 1.947)}

The estimated Fisher information matrix is

A>0 A<0 Figure 12.6. A visualization of

the clamped objective. The black

0 1−ǫ 1 1+ǫ 0 1−ǫ 1 1+ǫ

original objective clamped objective lower-bound objective

struct ClampedSurrogateUpdate Algorithm 12.6. An implementa-

Figure 12.7. A comparison of surro-

methods discussed in the previous chapter to iteratively improve our policy.

• The natural gradient approach uses a first-order approximation of the objective

• Trust region policy optimization involves augmenting the natural gradient

Solution: The gradient of TRPO’s surrogate objective is

Recall that the derivative of log f ( x ) is f ′ ( x )/ f ( x ). It thus follows that

Solution: We start by computing the inverse of the Fisher information matrix:

Now, we update u as follows:

Finally, we estimate the updated parameters θ:

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.