0% found this document useful (0 votes)

84 views21 pages

Bridging The Gap Between Value and Policy Based Reinforcement Learning

This document summarizes a research paper that bridges the gap between value-based and policy-based reinforcement learning. It establishes a new connection between the two based on a relationship between softmax temporal value consistency and optimal policy probabilities under entropy regularization. This allows developing a new algorithm called Path Consistency Learning that minimizes soft consistency error along action sequences from both on-policy and off-policy data. Experimental results show it significantly outperforms actor-critic and Q-learning baselines on several benchmarks.

Uploaded by

nanana

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

84 views21 pages

Bridging The Gap Between Value and Policy Based Reinforcement Learning

Uploaded by

nanana

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 21

Bridging the Gap Between Value and Policy Based

Reinforcement Learning

Ofir Nachum Mohammad Norouzi Kelvin Xu Dale Schuurmans

{ofirnachum,mnorouzi,kelvinxx}@google.com, daes@ualberta.ca
Google Brain
arXiv:1702.08892v2 [cs.AI] 8 Jun 2017

Abstract

We establish a new connection between value and policy based reinforcement

learning (RL) based on a relationship between softmax temporal value consistency
and policy optimality under entropy regularization. Specifically, we show that
softmax consistent action values satisfy a strong consistency property with optimal
entropy regularized policy probabilities along any action sequence, regardless
of provenance. From this observation, we develop a new RL algorithm, Path
Consistency Learning (PCL), that minimizes a notion of soft consistency error
along multi-step action sequences extracted from both on- and off-policy traces.
We subsequently deepen the relationship by showing how a single model can be
used to represent both a policy and corresponding softmax state values. Beyond
eliminating the need for a separate critic, the unification demonstrates how policy
gradients can be stabilized via self-bootstrapping from both on- and off-policy
data. An experimental evaluation demonstrates that the PCL algorithm significantly
outperforms strong actor-critic and Q-learning baselines across several benchmarks.

1 Introduction
Model-free RL aims to acquire an effective behavior policy through trial and error interaction with a
black box environment. The goal is to optimize the quality of an agents behavior policy in terms of
the total expected discounted reward. Model-free RL has a myriad of applications in games [26, 41],
robotics [20, 21], and marketing [22, 42], to name a few. Recently, the impact of model-free RL has
been expanded through the use of deep neural networks, which promise to replace manual feature
engineering with end-to-end learning of value and policy representations. Unfortunately, a key
challenge remains how best to combine the advantages of value and policy based RL approaches in
the presence of deep function approximators, while mitigating their shortcomings. Although recent
progress has been made in combining value and policy based methods, this issue is not yet settled,
and the intricacies of each perspective are exacerbated by deep models.
The primary advantage of policy based approaches, such as REINFORCE [50], is that they directly
optimize the quantity of interest while remaining stable under function approximation (given a
sufficiently small learning rate). Their biggest drawback is inefficiency: since policy gradients are
estimated from rollouts the variance is often extreme. Although policy updates can be improved by
the use of appropriate geometry [17, 31, 36], the need for variance reduction remains paramount.
Actor-critic methods have thus become popular [37, 38, 40], because they use value approximators
to replace rollout estimates and reduce variance, at the cost of some bias. Nevertheless, on-policy
learning remains inherently data inefficient [13]: by estimating quantities defined by the current
policy, either on-policy data must be used, or updating must be sufficiently slow to avoid significant
bias. Naive importance correction is hardly able to overcome these shortcomings in practice [32, 33].
By contrast, value based methods, such as Q-learning [49, 26, 34, 47, 25], can learn from any
trajectory sampled from the same environment. Such off-policy methods are able to exploit data
1
Work done as a member of the Google Brain Residency program (g.co/brainresidency)
from other sources, such as experts, making them inherently more sample efficient than on-policy
methods [13]. Their key drawback is that off-policy learning does not stably interact with function
approximation [39, Chap.11]. The practical consequence is that extensive hyperparameter tuning can
be required to obtain stable behavior. Despite practical success [26], there is also little theoretical
understanding of how deep Q-learning might obtain near-optimal objective values.
Ideally, one would like to combine the unbiasedness and stability of on-policy training with the data
efficiency of off-policy approaches. This desire has motivated substantial recent work on off-policy
actor-critic methods, where the data efficiency of policy gradient is improved by training an off-
policy critic [23, 25, 13]. Although such methods have demonstrated improvements over on-policy
actor-critic approaches, they have not resolved the theoretical difficulty associated with off-policy
learning under function approximation. Hence, current methods remain potentially unstable and
require specialized algorithmic and theoretical development as well as delicate tuning to be effective
in practice [13, 46, 11].
In this paper, we exploit a relationship between policy optimization under entropy regularization and
softmax value consistency to obtain a new form of stable off-policy learning. Even though entropy
regularized policy optimization is a well studied topic in RL [51, 43, 44, 52, 5, 4, 6, 10]in fact, one
that has been attracting renewed recent interest [29, 14]we contribute new observations to this study
that are essential for the methods we propose: first, we identify a strong form of path consistency that
relates optimal policy probabilities under entropy regularization to softmax consistent state values
for any action sequence; second, we use this result to formulate a novel optimization objective that
allows for a stable form of off-policy actor-critic learning; finally, we observe that under this objective
the actor and critic can be unified in a single model that coherently fulfills both roles.

2 Notation & Background

We model an agents behavior by a parametric distribution (a | s) defined by a neural network over

a finite set of actions. At iteration t, the agent encounters a state st and performs an action at sampled
from (a | st ). The environment then returns a scalar reward rt and transitions to the next state st+1 .
Note: Our main results identify specific properties that hold for arbitrary action sequences. To keep
the presentation clear and focus attention on the key properties, we provide a simplified presentation
in the main body of this paper by assuming deterministic state dynamics. This restriction is not
necessary, and in Appendix C we provide a full treatment of the same concepts generalized to
stochastic state dynamics. All of the desired properties continue to hold in the general case and the
algorithms proposed remain unaffected.
For simplicity, we assume the per-step reward rt and the next state st+1 are given by functions
rt = r(st , at ) and st+1 = f (st , at ) specified by the environment. We begin the formulation
by reviewing the key elements of Q-learning [48, 49], which uses a notion of hard-max Bellman
backup to enable off-policy TD control. First, observe that the expected discounted reward objective,
OER (s, ), can be recursively expressed as,
X
OER (s, ) = (a | s) [r(s, a) + OER (s0 , )] , where s0 = f (s, a) . (1)
a

Let V (s) denote the optimal state value at a state s given by the maximum value of OER (s, ) over
policies, i.e., V (s) = max OER (s, ). Accordingly, let denote the optimal policy that results in
V (s), i.e., = argmax OER (s, ). Such an optimal policy is a one-hot distribution that assigns a
probability of 1 to an action with maximal return and 0 elsewhere. Thus we have
V (s) = OER (s, ) = max(r(s, a) + V (s0 )). (2)
a

This is the well-known hard-max Bellman temporal consistency. Instead of state values, one can
equivalently (and more commonly) express this consistency in terms of optimal action values, Q :
Q (s, a) = r(s, a) + max
0
Q (s0 , a0 ) . (3)
a

Q-learning relies on a value iteration algorithm based on (3), where Q(s, a) is bootstrapped based on
successor action values Q(s0 , a0 ).

2
3 Softmax Temporal Consistency
In this paper, we study the optimal state and action values for a softmax form of temporal consistency,
which arises by augmenting the standard expected reward objective with a discounted entropy
regularizer. Entropy regularization [51] encourages exploration and helps prevent early convergence
to sub-optimal policies, as has been confirmed in practice (e.g., [25, 28]). In this case, one can express
regularized expected reward as a sum of the expected reward and a discounted entropy term,
OENT (s, ) = OER (s, ) + H(s, ) , (4)
where 0 is a user-specified temperature parameter that controls the degree of entropy regulariza-
tion, and the discounted entropy H(s, ) is recursively defined as
X
H(s, ) = (a | s) [ log (a | s) + H(s0 , )] . (5)
a
The objective OENT (s, ) can then be re-expressed recursively as,
X
OENT (s, ) = (a | s) [r(s, a) log (a | s) + OENT (s0 , )] . (6)
a
Note that when = 1 this is equivalent to the entropy regularized objective proposed in [51].
Let V (s) = max OENT (s, ) denote the soft optimal state value at a state s and let (a | s) denote
the optimal policy at s that attains the maximum of OENT (s, ). When > 0, the optimal policy is no
longer a one-hot distribution, since the entropy term prefers the use of policies with more uncertainty.
We characterize the optimal policy (a | s) in terms of the OENT -optimal state values of successor
states V (s0 ) as a Boltzmann distribution of the form,
(a | s) exp{(r(s, a) + V (s0 ))/ } . (7)
It can be verified that this is the solution by noting that the OENT (s, ) objective is simply a -scaled
constant-shifted KL-divergence between and , hence the optimum is achieved when = .
To derive V (s) in terms of V (s0 ), the policy (a | s) can be substituted into (6), which after
some manipulation yields the intuitive definition of optimal state value in terms of a softmax (i.e.,
log-sum-exp) backup,
X
V (s) = OENT (s, ) = log exp{(r(s, a) + V (s0 ))/ } . (8)
a
Note that in the 0 limit one recovers the hard-max state values defined in (2). Therefore we can
equivalently state softmax temporal consistency in terms of optimal action values Q (s, a) as,
X
Q (s, a) = r(s, a) + V (s0 ) = r(s, a) + log 0
exp(Q (s0 , a0 )/ ) . (9)
a
Now, much like Q-learning, the consistency equation (9) can be used to perform one-step backups
to asynchronously bootstrap Q (s, a) based on Q (s0 , a0 ). In Appendix C we prove that such a
procedure, in the tabular case, converges to a unique fixed point representing the optimal values.
We point out that similar notions of softmax Q-values have been studied in previous work (e.g., [5, 3,
10]). Concurrently to our work, [14] has also proposed a soft Q-learning algorithm for continuous
control that is based on a similar notion of softmax temporal consistency. However, we contribute
new observations below that lead to the novel training principles we explore.

4 Consistency Between Optimal Value & Policy

We now describe the main technical contributions of this paper, which lead to the development of
two novel off-policy RL algorithms in Section 5. The first key observation is that, for the softmax
value function V in (8), the quantity exp{V (s)/ } also serves as the normalization factor of the
optimal policy (a | s) in (7); that is,
exp{(r(s, a) + V (s0 ))/ }
(a | s) = . (10)
exp{V (s)/ }
Manipulation of (10) by taking the log of both sides then reveals an important connection between
the optimal state value V (s), the value V (s0 ) of the successor state s0 reached from any action a
taken in s, and the corresponding action probability under the optimal log-policy, log (a | s).

3
Theorem 1. For > 0, the policy that maximizes OENT and state values V (s) = max OENT (s, )
satisfy the following temporal consistency property for any state s and action a (where s0 = f (s, a)),
V (s) V (s0 ) = r(s, a) log (a | s) . (11)

Proof. All theorems are established for the general case of a stochastic environment and discounted
infinite horizon problems in Appendix C. Theorem 1 follows as a special case.

Note that one can also characterize in terms of Q as

(a | s) = exp{(Q (s, a) V (s))/ } . (12)

An important property of the one-step softmax consistency established in (11) is that it can be
extended to a multi-step consistency defined on any action sequence from any given state. That is, the
softmax optimal state values at the beginning and end of any action sequence can be related to the
rewards and optimal log-probabilities observed along the trajectory.
Corollary 2. For > 0, the optimal policy and optimal state values V satisfy the following
extended temporal consistency property, for any state s1 and any action sequence a1 , ..., at1 (where
si+1 = f (si , ai )):
t1
X
V (s1 ) t1 V (st ) = i1 [r(si , ai ) log (ai | si )] . (13)
i=1

Proof. The proof in Appendix C applies (the generalized version of) Theorem 1 to any s1 and
sequence a1 , ..., at1 , summing the left and right hand sides of (the generalized version of) (11) to
induce telescopic cancellation of intermediate state values. Corollary 2 follows as a special case.

Importantly, the converse of Theorem 1 (and Corollary 2) also holds:

Theorem 3. If a policy (a | s) and state value function V (s) satisfy the consistency property (11)
for all states s and actions a (where s0 = f (s, a)), then = and V = V . (See Appendix C.)
Theorem 3 motivates the use of one-step and multi-step path-wise consistencies as the foundation
of RL algorithms that aim to learn parameterized policy and value estimates by minimizing the
discrepancy between the left and right hand sides of (11) and (13).

5 Path Consistency Learning (PCL)

The temporal consistency properties between the optimal policy and optimal state values devel-
oped above lead to a natural path-wise objective for training a policy , parameterized by ,
and a state value function V , parameterized by , via the minimization of a soft consistency
error. Based on (13), we first define a notion of soft consistency for a d-length sub-trajectory
si:i+d (si , ai , . . . , si+d1 , ai+d1 , si+d ) as a function of and :
Xd1
C(si:i+d , , ) = V (si ) + d V (si+d ) + j [r(si+j , ai+j ) log (ai+j | si+j )] . (14)
j=0

The goal of a learning algorithm can then be to find V and such that C(si:i+d , , ) is as close to
0 as possible for all sub-trajectories si:i+d . Accordingly, we propose a new learning algorithm, called
Path Consistency Learning (PCL), that attempts to minimize the squared soft consistency error over a
set of sub-trajectories E,
X 1
OPCL (, ) = C(si:i+d , , )2 . (15)
2
si:i+d E

The PCL update rules for and are derived by calculating the gradient of (15). For a given trajectory
si:i+d these take the form,
Xd1
= C(si:i+d , , ) j log (ai+j | si+j ) , (16)
j=0

= v C(si:i+d , , ) V (si ) d V (si+d ) ,

(17)

4
where v and denote the value and policy learning rates respectively. Given that the consistency
property must hold on any path, the PCL algorithm applies the updates (16) and (17) both to
trajectories sampled on-policy from as well as trajectories sampled from a replay buffer. The
union of these trajectories comprise the set E used in (15) to define OPCL .
Specifically, given a fixed rollout parameter d, at each iteration, PCL samples a batch of on-policy
trajectories and computes the corresponding parameter updates for each sub-trajectory of length d.
Then PCL exploits off-policy trajectories by maintaining a replay buffer and applying additional
updates based on a batch of episodes sampled from the buffer at each iteration. We have found it
beneficial to sample replay episodes proportionally to exponentiated reward, mixed with a uniform
distribution, although we did not exhaustively experiment with this sampling procedure. In particular,
we sample a full episode s0:T from the replay buffer of size B with probability 0.1/B + 0.9
PT 1
exp( i=0 r(si , ai ))/Z, where we use no discounting on the sum of rewards, Z is a normalization
factor, and is a hyper-parameter. Pseudocode of PCL is provided in the Appendix.

5.1 Unified Path Consistency Learning (Unified PCL)

The PCL algorithm maintains a separate model for the policy and the state value approximation.
However, given the soft consistency between the state and action value functions (e.g.,in (9)), one can
express the soft consistency errors strictly in terms of Q-values. Let Q denote a model of action
values parameterized by , based on which one can estimate both the state values and the policy as,
X
V (s) = log exp{Q (s, a)/ } , (18)
a
(a | s) = exp{(Q (s, a) V (s))/ } . (19)
Given this unified parameterization of policy and value, we can formulate an alternative algo-
rithm, called Unified Path Consistency Learning (Unified PCL), which optimizes the same objective
(i.e., (15)) as PCL but differs by combining the policy and value function into a single model. Merging
the policy and value function models in this way is significant because it presents a new actor-critic
paradigm where the policy (actor) is not distinct from the values (critic). We note that in practice,
we have found it beneficial to apply updates to from V and using different learning rates, very
much like PCL. Accordingly, the update rule for takes the form,
Xd1
= C(si:i+d , ) j log (ai+j | si+j ) + (20)
j=0

v C(si:i+d , ) V (si ) d V (si+d ) .

(21)

5.2 Connections to Actor-Critic and Q-learning

To those familiar with advantage-actor-critic methods [25] (A2C and its asynchronous analogue A3C)
PCLs update rules might appear to be similar. In particular, advantage-actor-critic is an on-policy
method that exploits the expected value function,
X
V (s) = (a | s) [r(s, a) + V (s0 )] , (22)
a

to reduce the variance of policy gradient, in service of maximizing the expected reward. As in PCL,
two models are trained concurrently: an actor that determines the policy, and a critic V that is
trained to estimate V . A fixed rollout parameter d is chosen, and the advantage of an on-policy
trajectory si:i+d is estimated by
Xd1
A(si:i+d , ) = V (si ) + d V (si+d ) + j r(si+j , ai+j ) . (23)
j=0

The advantage-actor-critic updates for and can then be written as,

= Esi:i+d | [A(si:i+d , ) log (ai |si )] , (24)
= v Esi:i+d | [A(si:i+d , ) V (si )] , (25)
where the expectation Esi:i+d | denotes sampling from the current policy . These updates exhibit a
striking similarity to the updates expressed in (16) and (17). In fact, if one takes PCL with 0
and omits the replay buffer, a slight variation of A2C is recovered. In this sense, one can interpret PCL

5
as a generalization of A2C. Moreover, while A2C is restricted to on-policy samples, PCL minimizes
an inconsistency measure that is defined on any path, hence it can exploit replay data to enhance its
efficiency via off-policy learning.
It is also important to note that for A2C, it is essential that V tracks the non-stationary target V
to ensure suitable variance reduction. In PCL, no such tracking is required. This difference is more
dramatic in Unified PCL, where a single model is trained both as an actor and a critic. That is, it is
not necessary to have a separate actor and critic; the actor itself can serve as its own critic.
One can also compare PCL to hard-max temporal consistency RL algorithms, such as Q-learning [48].
In fact, setting the rollout to d = 1 in Unified PCL leads to a form of soft Q-learning, with the degree
of softness determined by . We therefore conclude that the path consistency-based algorithms
developed in this paper also generalize Q-learning. Importantly, PCL and Unified PCL are not
restricted to single step consistencies, which is a major limitation of Q-learning. While some
have proposed using multi-step backups for hard-max Q-learning [30, 25], such an approach is not
theoretically sound, since the rewards received after a non-optimal action do not relate to the hard-max
Q-values Q . Therefore, one can interpret the notion of temporal consistency proposed in this paper
as a sound generalization of the one-step temporal consistency given by hard-max Q-values.

6 Related Work

Connections between softmax Q-values and optimal entropy-regularized policies have been previously
noted. In some cases entropy regularization is expressed in the form of relative entropy [4, 6, 10, 35],
and in other cases it is the standard entropy [52]. While these papers derive similar relationships to (7)
and (8), they stop short of stating the single- and multi-step consistencies over all action choices we
highlight. Moreover, the algorithms proposed in those works are essentially single-step Q-learning
variants, which suffer from the limitation of using single-step backups. Another recent work [29]
uses the softmax relationship in the limit of 0 and proposes to augment an actor-critic algorithm
with offline updates that minimize a set of single-step hard-max Bellman errors. Again, the methods
we propose are differentiated by the multi-step path-wise consistencies which allow the resulting
algorithms to utilize multi-step trajectories from off-policy samples in addition to on-policy samples.
The proposed PCL and Unified PCL algorithms bear some similarity to multi-step Q-learning [30],
which rather than minimizing one-step hard-max Bellman error, optimizes a Q-value function
approximator by unrolling the trajectory for some number of steps before using a hard-max backup.
While this method has shown some empirical success [25], its theoretical justification is lacking,
since rewards received after a non-optimal action no longer relate to the hard-max Q-values Q . In
contrast, the algorithms we propose incorporate the log-probabilities of the actions on a multi-step
rollout, which is crucial for the version of softmax consistency we consider.
Other notions of temporal consistency similar to softmax consistency have been discussed in the RL
literature. Previous work has used a Boltzmann weighted average operator [24, 5]. In particular, this
operator has been used by [5] to propose an iterative algorithm converging to the optimal maximum
reward policy inspired by the work of [18, 43]. While they use the Boltzmann weighted average,
they briefly mention that a softmax (log-sum-exp) operator would have similar theoretical properties.
More recently [3] proposed a mellowmax operator, defined as log-average-exp. These log-average-
exp operators share a similar non-expansion property, and the proofs of non-expansion are related.
Additionally it is possible to show that when restricted to an infinite horizon setting, the fixed point
of the mellowmax operator is a constant shift of the Q investigated here. In all these cases, the
suggested training algorithm optimizes a single-step consistency unlike PCL and Unified PCL, which
optimizes a multi-step consistency. Moreover, these papers do not present a clear relationship between
the action values at the fixed point and the entropy regularized expected reward objective, which was
key to the formulation and algorithmic development in this paper.
Finally, there has been a considerable amount of work in reinforcement learning using off-policy data
to design more sample efficient algorithms. Broadly speaking, these methods can be understood as
trading off bias [40, 38, 23, 12] and variance [32, 27]. Previous work that has considered multi-step
off-policy learning has typically used a correction (e.g., via importance-sampling [33] or truncated
importance sampling with bias correction [27], or eligibility traces [32]). By contrast, our method
defines an unbiased consistency for an entire trajectory applicable to on- and off-policy data. An
empirical comparison with all these methods remains however an interesting avenue for future work.

6
Synthetic Tree Copy DuplicatedInput RepeatCopy
35 16 100
20 30 14
80
12
25
15
10 60
20
8
10 15
6 40
10
4
5 20
5 2

0 0 0 0
0 50 100 0 1000 2000 0 1000 2000 3000 0 2000 4000

Reverse ReversedAddition ReversedAddition3 Hard ReversedAddition

35 20 30
30
30 25
25
15
25
20 20
20
15 10 15
15
10 10
10
5
5 5 5

0 0 0 0
0 5000 10000 0 5000 10000 0 20000 40000 60000 0 5000 10000

PCL A3C DQN

Figure 1: The results of PCL against A3C and DQN baselines. Each plot shows average reward
across 5 random training runs (10 for Synthetic Tree) after choosing best hyperparameters. We also
show a single standard deviation bar clipped at the min and max. The x-axis is number of training
iterations. PCL exhibits comparable performance to A3C in some tasks, but clearly outperforms A3C
on the more challenging tasks. Across all tasks, the performance of DQN is worse than PCL.

7 Experiments

We evaluate the proposed algorithms, namely PCL & Unified PCL, across several different tasks and
compare them to an A3C implementation, based on [25], and an implementation of double Q-learning
with prioritized experience replay, based on [34]. We find that PCL can consistently match or beat the
performance of these baselines. We also provide a comparison between PCL and Unified PCL and
find that the use of a single unified model for both values and policy can be competitive with PCL.
These new algorithms are easily amenable to incorporate expert trajectories. Thus, for the more
difficult tasks we also experiment with seeding the replay buffer with 10 randomly sampled expert
trajectories. During training we ensure that these trajectories are not removed from the replay buffer
and always have a maximal priority.
The details of the tasks and the experimental setup are provided in the Appendix.

7.1 Results

We present the results of each of the variants PCL, A3C, and DQN in Figure 1. After finding the
best hyperparameters (see Section B.3), we plot the average reward over training iterations for five
randomly seeded runs. For the Synthetic Tree environment, the same protocol is performed but with
ten seeds instead.
The gap between PCL and A3C is hard to discern in some of the more simple tasks such as Copy,
Reverse, and RepeatCopy. However, a noticeable gap is observed in the Synthetic Tree and Dupli-
catedInput results and more significant gaps are clear in the harder tasks, including ReversedAddition,
ReversedAddition3, and Hard ReversedAddition. Across all of the experiments, it is clear that the
prioritized DQN performs worse than PCL. These results suggest that PCL is a competitive RL
algorithm, which in some cases significantly outperforms strong baselines.
We compare PCL to Unified PCL in Figure 2. The same protocol is performed to find the best
hyperparameters and plot the average reward over several training iterations. We find that using a
single model for both values and policy in Unified PCL is slightly detrimental on the simpler tasks,
but on the more difficult tasks Unified PCL is competitive or even better than PCL.
We present the results of PCL along with PCL augmented with expert trajectories in Figure 3. We
observe that the incorporation of expert trajectories helps a considerable amount. Despite only
using a small number of expert trajectories (i.e., 10) as opposed to the mini-batch size of 400, the

7
Synthetic Tree Copy DuplicatedInput RepeatCopy
20 35 16 100

30 14
80
15 12
25
10 60
20
10 8
15 40
6
10
5 4
20
5 2
0 0 0 0
0 50 100 0 1000 2000 0 1000 2000 3000 0 2000 4000

Reverse ReversedAddition ReversedAddition3 Hard ReversedAddition

30 30 30 30

25 25 25 25

20 20 20 20

15 15 15 15

10 10 10 10

5 5 5 5

0 0 0 0
0 5000 10000 0 5000 10000 0 20000 40000 60000 0 5000 10000

PCL Unified PCL

Figure 2: The results of PCL vs. Unified PCL. Overall we find that using a single model for both
values and policy is not detrimental to training. Although in some of the simpler tasks PCL has an
edge over Unified PCL, on the more difficult tasks, Unified PCL preforms better.
Reverse ReversedAddition ReversedAddition3 Hard ReversedAddition
30 30 30 30

25 25 25 25

20 20 20 20

15 15 15 15

10 10 10 10

5 5 5 5

0 0 0 0
0 2000 4000 0 2000 4000 0 20000 40000 60000 0 5000 10000

PCL PCL + Expert

Figure 3: The results of PCL vs. PCL augmented with a small number of expert trajectories on the
hardest algorithmic tasks. We find that incorporating expert trajectories greatly improves performance.
inclusion of expert trajectories in the training process significantly improves the agents performance.
We performed similar experiments with Unified PCL and observed a similar lift from using expert
trajectories. Incorporating expert trajectories in PCL is relatively trivial compared to the specialized
methods developed for other policy based algorithms [2, 15]. While we did not compare to other
algorithms that take advantage of expert trajectories, this success shows the promise of using path-
wise consistencies. Importantly, the ability of PCL to incorporate expert trajectories without requiring
adjustment or correction is a desirable property in real-world applications such as robotics.

8 Conclusion

We study the characteristics of the optimal policy and state values for a maximum expected reward
objective in the presence of discounted entropy regularization. The introduction of an entropy
regularizer induces an interesting softmax consistency between the optimal policy and optimal state
values, which may be expressed as either a single-step or multi-step consistency. This softmax
consistency led us to develop Path Consistency Learning (PCL), an RL algorithm that resembles
actor-critic in that it maintains and jointly learns a model of the state values and a model of the policy,
and is similar to Q-learning in that it minimizes a measure of temporal consistency error. We also
propose the variant Unified PCL which maintains a single model for both the policy and the values,
thus upending the actor-critic paradigm of separating the actor from the critic. Unlike standard policy
based RL algorithms, PCL and Unified PCL apply to both on-policy and off-policy trajectory samples.
Further, unlike value based RL algorithms, PCL and Unified PCL can take advantage of multi-step
consistencies. Empirically, PCL and Unified PCL exhibit a significant improvement over baseline
methods across several algorithmic benchmarks.

8
9 Acknowledgment
We thank Rafael Cosman, Brendan ODonoghue, Volodymyr Mnih, George Tucker, Irwan Bello, and
the Google Brain team for insightful comments and discussions.

References
[1] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving,
M. Isard, et al. Tensorflow: A system for large-scale machine learning. arXiv:1605.08695,
2016.
[2] P. Abbeel and A. Y. Ng. Apprenticeship learning via inverse reinforcement learning. In
Proceedings of the twenty-first international conference on Machine learning, page 1. ACM,
2004.
[3] K. Asadi and M. L. Littman. A new softmax operator for reinforcement learning.
arXiv:1612.05628, 2016.
[4] M. G. Azar, V. Gmez, and H. J. Kappen. Dynamic policy programming with function
approximation. AISTATS, 2011.
[5] M. G. Azar, V. Gmez, and H. J. Kappen. Dynamic policy programming. JMLR, 13(Nov),
2012.
[6] M. G. Azar, V. Gmez, and H. J. Kappen. Optimal control as a graphical model inference
problem. Mach. Learn. J., 87, 2012.
[7] D. P. Bertsekas. Dynamic Programming and Optimal Control, volume 2. Athena Scientific,
1995.
[8] J. Borwein and A. Lewis. Convex Analysis and Nonlinear Optimization. Springer, 2000.
[9] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba.
OpenAI Gym. arXiv:1606.01540, 2016.
[10] R. Fox, A. Pakman, and N. Tishby. G-learning: Taming the noise in reinforcement learning via
soft updates. UAI, 2016.
[11] A. Gruslys, M. G. Azar, M. G. Bellemare, and R. Munos. The reactor: A sample-efficient
actor-critic architecture. arXiv preprint arXiv:1704.04651, 2017.
[12] S. Gu, E. Holly, T. Lillicrap, and S. Levine. Deep reinforcement learning for robotic manipula-
tion with asynchronous off-policy updates. ICRA, 2016.
[13] S. Gu, T. Lillicrap, Z. Ghahramani, R. E. Turner, and S. Levine. Q-Prop: Sample-efficient
policy gradient with an off-policy critic. ICLR, 2017.
[14] T. Haarnoja, H. Tang, P. Abbeel, and S. Levine. Reinforcement learning with deep energy-based
policies. arXiv:1702.08165, 2017.
[15] J. Ho and S. Ermon. Generative adversarial imitation learning. In Advances in Neural Informa-
tion Processing Systems, pages 45654573, 2016.
[16] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural Comput., 1997.
[17] S. Kakade. A natural policy gradient. NIPS, 2001.
[18] H. J. Kappen. Path integrals and symmetry breaking for optimal control theory. Journal of
statistical mechanics: theory and experiment, 2005(11):P11011, 2005.
[19] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. ICLR, 2015.
[20] J. Kober, J. A. Bagnell, and J. Peters. Reinforcement learning in robotics: A survey. IJRR, 2013.

9
[21] S. Levine, C. Finn, T. Darrell, and P. Abbeel. End-to-end training of deep visuomotor policies.
JMLR, 17(39), 2016.
[22] L. Li, W. Chu, J. Langford, and R. E. Schapire. A contextual-bandit approach to personalized
news article recommendation. 2010.
[23] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra.
Continuous control with deep reinforcement learning. ICLR, 2016.
[24] M. L. Littman. Algorithms for sequential decision making. PhD thesis, Brown University, 1996.
[25] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. P. Lillicrap, T. Harley, D. Silver, and
K. Kavukcuoglu. Asynchronous methods for deep reinforcement learning. ICML, 2016.
[26] V. Mnih, K. Kavukcuoglu, D. Silver, et al. Human-level control through deep reinforcement
learning. Nature, 2015.
[27] R. Munos, T. Stepleton, A. Harutyunyan, and M. Bellemare. Safe and efficient off-policy
reinforcement learning. NIPS, 2016.
[28] O. Nachum, M. Norouzi, and D. Schuurmans. Improving policy gradient by exploring under-
appreciated rewards. ICLR, 2017.
[29] B. ODonoghue, R. Munos, K. Kavukcuoglu, and V. Mnih. PGQ: Combining policy gradient
and Q-learning. ICLR, 2017.
[30] J. Peng and R. J. Williams. Incremental multi-step Q-learning. Machine learning, 22(1-3):283
290, 1996.
[31] J. Peters, K. Mling, and Y. Altun. Relative entropy policy search. AAAI, 2010.
[32] D. Precup. Eligibility traces for off-policy policy evaluation. Computer Science Department
Faculty Publication Series, page 80, 2000.
[33] D. Precup, R. S. Sutton, and S. Dasgupta. Off-policy temporal-difference learning with function
approximation. 2001.
[34] T. Schaul, J. Quan, I. Antonoglou, and D. Silver. Prioritized experience replay. ICLR, 2016.
[35] J. Schulman, X. Chen, and P. Abbeel. Equivalence between policy gradients and soft Q-learning.
arXiv:1704.06440, 2017.
[36] J. Schulman, S. Levine, P. Moritz, M. Jordan, and P. Abbeel. Trust region policy optimization.
ICML, 2015.
[37] J. Schulman, P. Moritz, S. Levine, M. Jordan, and P. Abbeel. High-dimensional continuous
control using generalized advantage estimation. ICLR, 2016.
[38] D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, and M. Riedmiller. Deterministic policy
gradient algorithms. ICML, 2014.
[39] R. S. Sutton and A. G. Barto. Introduction to Reinforcement Learning. MIT Press, 2nd edition,
2017. Preliminary Draft.
[40] R. S. Sutton, D. A. McAllester, S. P. Singh, Y. Mansour, et al. Policy gradient methods for
reinforcement learning with function approximation. NIPS, 1999.
[41] G. Tesauro. Temporal difference learning and TD-gammon. CACM, 1995.
[42] G. Theocharous, P. S. Thomas, and M. Ghavamzadeh. Personalized ad recommendation systems
for life-time value optimization with guarantees. IJCAI, 2015.
[43] E. Todorov. Linearly-solvable Markov decision problems. NIPS, 2006.
[44] E. Todorov. Policy gradients in linearly-solvable MDPs. NIPS, 2010.

10
[45] J. N. Tsitsiklis and B. Van Roy. An analysis of temporal-difference learning with function
approximation. IEEE Transactions on Automatic Control, 42(5), 1997.
[46] Z. Wang, V. Bapst, N. Heess, V. Mnih, R. Munos, K. Kavukcuoglu, and N. de Freitas. Sample
efficient actor-critic with experience replay. ICLR, 2017.
[47] Z. Wang, N. de Freitas, and M. Lanctot. Dueling network architectures for deep reinforcement
learning. ICLR, 2016.
[48] C. J. Watkins. Learning from delayed rewards. PhD thesis, University of Cambridge England,
1989.
[49] C. J. Watkins and P. Dayan. Q-learning. Machine learning, 8(3-4):279292, 1992.
[50] R. J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement
learning. Mach. Learn. J., 1992.
[51] R. J. Williams and J. Peng. Function optimization using connectionist reinforcement learning
algorithms. Connection Science, 1991.
[52] B. D. Ziebart. Modeling purposeful adaptive behavior with the principle of maximum causal
entropy. PhD thesis, CMU, 2010.

11
A Pseudocode

Algorithm 1 Path Consistency Learning

Input: Environment EN V , learning rates , v , discount factor , rollout d, number of steps N ,
replay buffer capacity B, prioritized replay hyperparameter .
function Gradients(s0:T )
// We use G(st:t+d , ) to denote a discounted sum of log-probabilities from st to st+d .
PT 1
Compute = t=0 C, (st:t+d ) G(st:t+d , ).
Compute =
PT 1 d

t=0 C, (st:t+d ) V (st ) V (st+d ) .
Return ,
end function
Initialize , .
Initialize empty replay buffer RB().
for i = 0 to N 1 do
Sample s0:T (s0: ) on EN V .
, = Gradients(s0:T ).
Update + .
Update + V .
Input s0:T into RB with priority R1 (s0:T ).
If |RB| > B, remove episodes uniformly at random.
Sample s0:T from RB.
, = Gradients(s0:T ).
Update + .
Update + v .
end for

B Experimental Details
We describe the tasks we experimented on as well as details of the experimental setup.

B.1 Synthetic Tree

As an initial testbed, we developed a simple synthetic environment. The environment is defined

by a binary decision tree of depth 20. For each training run, the reward on each edge is sampled
uniformly from [1, 1] and subsequently normalized so that the maximal reward trajectory has total
reward 20. We trained using a fully-parameterized model: for each node s in the decision tree there
are two parameters to determine the logits of (|s) and one parameter to determine V (s). In
the Q-learning and Unified PCL implementations only two parameters per node s are needed to
determine the Q-values.

B.2 Algorithmic Tasks

For more complex environments, we evaluated PCL, Unified PCL, and the two baselines on the
algorithmic tasks from the OpenAI Gym library [9]. This library provides six tasks, in rough order of
difficulty: Copy, DuplicatedInput, RepeatCopy, Reverse, ReversedAddition, and ReversedAddition3.
In each of these tasks, an agent operates on a grid of characters or digits, observing one character or
digit at a time. At each time step, the agent may move one step in any direction and optionally write
a character or digit to output. A reward is received on each correct emission. The agents goal for
each task is:

Copy: Copy a 1 n sequence of characters to output.

DuplicatedInput: Deduplicate a 1 n sequence of characters.
RepeatCopy: Copy a 1 n sequence of characters first in forward order, then reverse, and
finally forward again.

12
Reverse: Copy a 1 n sequence of characters in reverse order.
ReversedAddition: Observe two ternary numbers in little-endian order via a 2 n grid
and output their sum.
ReversedAddition3: Observe three ternary numbers in little-endian order via a 3 n grid
and output their sum.

These environments have an implicit curriculum associated with them. To observe the performance
of our algorithm without curriculum, we also include a task Hard ReversedAddition which has the
same goal as ReversedAddition but does not utilize curriculum.
For these environments, we parameterized the agent by a recurrent neural network with LSTM [16]
cells of hidden dimension 128.

B.3 Implementation Details

For our hyperparameter search, we found it simple to parameterize the critic learning rate in terms of
the actor learning rate as v = C , where C is the critic weight.
For the Synthetic Tree environment we used a batch size of 10, rollout of d = 3, discount of
= 1.0, and a replay buffer capacity of 10,000. We fixed the parameter for PCLs replay
buffer to 1 and used = 0.05 for DQN. To find the optimal hyperparameters, we performed an
extensive grid search over actor learning rate {0.01, 0.05, 0.1}; critic weight C {0.1, 0.5, 1};
entropy regularizer {0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0} for A3C, PCL, Unified PCL;
and {0.1, 0.3, 0.5, 0.7, 0.9}, {0.2, 0.4, 0.6, 0.8, 1.0} for DQN replay buffer parameters. We
used standard gradient descent for optimization.
For the algorithmic tasks we used a batch size of 400, rollout of d = 10, a replay buffer of capacity
100,000, ran using distributed training with 4 workers, and fixed the actor learning rate to
0.005, which we found to work well across all variants. To find the optimal hyperparameters, we
performed an extensive grid search over discount {0.9, 1.0}, {0.1, 0.5} for PCLs replay
buffer; critic weight C {0.1, 1}; entropy regularizer {0.005, 0.01, 0.025, 0.05, 0.1, 0.15};
{0.2, 0.4, 0.6, 0.8}, {0.06, 0.2, 0.4, 0.5, 0.8} for the prioritized DQN replay buffer; and
also experimented with exploration rates {0.05, 0.1} and copy frequencies for the target DQN,
{100, 200, 400, 600}. In these experiments, we used the Adam optimizer [19].
All experiments were implemented using Tensorflow [1].

C Proofs

In this section, we provide a general theoretical foundation for this work, including proofs of the main
path consistency results. We first establish the basic results for a simple one-shot decision making
setting. These initial results will be useful in the proof of the general infinite horizon setting.
Although the main paper expresses the main claims under an assumption of deterministic dynamics,
this assumption is not necessary: we restricted attention to the deterministic case in the main body
merely for clarity and ease of explanation. Given that in this appendix we provide the general
foundations for this work, we consider the more general stochastic setting throughout the later
sections.
In particular, for the general stochastic, infinite horizon setting, we introduce and discuss the entropy
regularized expected return OEN T and define a softmax operator B (analogous to the Bellman
operator for hard-max Q-values). We then show the existence of a unique fixed point V of B , by
establishing that the softmax Bellman operator (B ) is a contraction under the infinity norm. We
then relate V to the optimal value of the entropy regularized expected reward objective OENT , which
we term V . We are able to show that V = V , as expected. Subsequently, we present a policy
determined by V that satisfies V (s) = OENT (s, ). Then given the characterization of in terms
of V , we establish the consistency property stated in Theorem 1 of the main text. Finally, we show
that a consistent solution is optimal by satisfying the KKT conditions of the constrained optimization
problem (establishing Theorem 4 of the main text).

13
C.1 Basic results for one-shot entropy regularized optimization

For > 0 and any vector q Rn , n < , define the scalar valued function F (the softmax) by
n
!
X
F (q) = log eqa / (26)
a=1

and define the vector valued function f (the soft indmax) by

eq/
f (q) = Pn qa /
= e(qF (q))/ , (27)
a=1 e

where the exponentiation is component-wise. It is easy to verify that f = F . Note that

f maps any real valued vector to a probability vector. We denote the probability simplex by
= { : 0, 1 = 1}, and denote the entropy function by H() = log .
Lemma 4.
n o
F (q) = max q + H() (28)

= f (q) q + H(f (q)) (29)

Proof. First consider the constrained optimization problem on the right hand side of (28). The
Lagrangian is given by L = (q log ) + (1 1 ), hence L = q log .
The KKT conditions for this optimization problems are the following system of n + 1 equations
1 =1 (30)
log = q v (31)
for the n + 1 unknowns, and v, where v = + . Note that for any v, satisfying (31) requires the
unique assignment P = exp((q v)/ ), whichP also ensures > 0. To subsequently satisfy (30),
the equation 1 = a exp((qa v)/ ) = ev/ a exp(qa / ) must be solved for v; since the right
hand side is strictly decreasing in v, the solution is also unique and in this case given by v = F (q).
Therefore = f (q) and v = F (q) provide the unique solution to the KKT conditions (30)-(31).
Since the objective is strictly concave, must be the unique global maximizer, establishing (29). It is
then easy to show F (q) = f (q) q + H(f (q)) by algebraic manipulation, which establishes
(28).
n o
Corollary 5 (Optimality Implies Consistency). If v = max q + H() then

v = qa log a for all a, (32)

where = f (q).

Proof. From Lemma 4 we know v = F (q) = (q log ) where = f (q). From

the definition of f it also follows that log a = (qa F (q))/ for all a, hence v = F (q) =
qa log a for all a.
Corollary 6 (Consistency Implies Optimality). If v R and jointly satisfy
v = qa log a for all a, (33)
then v = F (q) and = f (q); that is, must be an optimizer for (28) and v is its corresponding
optimal value.

Proof. Any v and that jointly satisfy (33) must also satisfy the KKT conditions (30)-(31);
hence must be the unique maximizer for (28) and v its corresponding objective value.

Although these results are elementary, they reveal a strong connection between optimal state values
(v), optimal action values (q) and optimal policies () under the softmax operators. In particular,
Lemma 4 states that, if q is an optimal action value at some current state, the optimal state value must
be v = F (q), which is simply the entropy regularized value of the optimal policy, = f (q), at the
current state.

14
Corollaries 5 and 6 then make the stronger observation that this mutual consistency between the
optimal state value, optimal action values and optimal policy probabilities must hold for every
action, not just in expectation over actions sampled from ; and furthermore that achieving mutual
consistency in this form is equivalent to achieving optimality.
Below we will also need to make use of the following properties of F .
Lemma 7. For any vector q,
n o
F (q) = sup p q p log p . (34)
p

Proof. Let F denote the conjugate of F , which is given by

n o
F (p) = sup q p F (q) = p log p (35)
q

for p dom(F ) = . Since F is closed and convex, we also have that F = F [8, Section 4.2];
hence
n o
F (q) = sup q p F (p) . (36)
p

Lemma 8. For any two vectors q(1) and q(2) ,

n o
F (q(1) ) F (q(2) ) max qa(1) qa(2) . (37)
a

Proof. Observe that by Lemma 7

n o n o
F (q(1) ) F (q(2) ) = sup q(1) p(1) F (p(1) ) sup q(2) p(2) F (p(2) ) (38)
p(1) p(2)
n n oo
= sup inf q(1) p(1) q(2) p(2) (F (p(1) ) F (p(2) ))
(2)
p(1) p
(39)
n o
sup p(1) (q(1) q(2) ) by choosing p (2) (1)
=p (40)
p(1)
n o
max qa(1) qa(2) . (41)
a

Corollary 9. F is an -norm contraction; that is, for any two vectors q(1) and q(2) ,

F (q(1) ) F (q(2) ) kq(1) q(2) k (42)

Proof. Immediate from Lemma 8.

C.2 Background results for on-policy entropy regularized updates

Although the results in the main body of the paper are expressed in terms of deterministic problems,
we will prove that all the desired properties hold for the more general stochastic case, where there is
a stochastic transition s, a 7 s0 determined by the environment. Given the characterization for this
general case, the application to the deterministic case is immediate. We continue to assume that the
action space is finite, and that the state space is discrete.
For any policy , define the entropy regularized expected return by
" #
X
i

V (s` ) = OENT (s` , ) = Ea` s`+1 ...|s` r(s`+i , a`+i ) log (a`+i |s`+i ) , (43)
i=0

15
where the expectation is taken with respect to the policy and with respect to the stochastic state
transitions determined by the environment. We will find it convenient to also work with the on-policy
Bellman operator defined by
h i
(B V )(s) = Ea,s0 |s r(s, a) log (a|s) + V (s0 ) (44)
h i
= Ea|s r(s, a) log (a|s) + Es0 |s,a V (s0 )

(45)
= (: |s) (Q(s, :) log (: |s)), where (46)
0
Q(s, a) = r(s, a) + Es0 |s,a [V (s )] (47)
for each state s and action a. Note that in (46) we are using Q(s, :) to denote a vector values over
choices of a for a given s, and (: |s) to denote the vector of conditional action probabilities specified
by at state s.
Lemma 10. For any policy and state s, V (s) satisfies the recurrence
h i
V (s) = Ea|s r(s, a) + Es0 |s,a [V (s0 )] log (a|s) (48)

= (: |s) Q (s, :) log (: |s) where Q (s, a) = r(s, a) + Es0 |s,a [V (s0 )] (49)
= (B V )(s). (50)

Moreover, B is a contraction mapping.

Proof. Consider an arbitrary state s` . By the definition of V (s` ) in (43) we have

" #
X
i

V (s` ) = Ea` s`+1 ...|s` r(s`+i , a`+i ) log (a`+i |s`+i ) (51)
i=0
"
= Ea` s`+1 ...|s` r(s` , a` ) log (a` |s` ) (52)

#
X
j

+ r(s`+1+j , a`+1+j ) log (a`+1+j |s`+1+j )
j=0
"
= Ea` |s` r(s` , a` ) log (a` |s` ) (53)

Note that this lemma shows V is a fixed point of the corresponding on-policy Bellman operator B .
Next, we characterize how quickly convergence to a fixed point is achieved by repeated application
of ther B operator.
Lemma 11. For any and any V , for all states s` , and for all k 0 it holds that:
(B )k V (s` ) V (s` ) = k Ea` s`+1 ...s`+k |s` V (s`+k ) V (s`+k ) .

Proof. Consider an arbitrary state s` . We use an induction

on k. For the basecase, consider k = 0 and
observe that the claim follows trivially, since (B ) V (s` ) (B ) V (s` ) = V (s` ) V (s` ).
0 0

16
For the induction hypothesis, assume the result holds for k. Then consider:

(B )k+1 V (s` ) V (s` )

= (B )k+1 V (s` ) (B )k+1 V (s` ) (by Lemma 10)

(57)
= B (B )k V (s` ) B (B )k V (s` )

(58)
h i
= Ea` s`+1 |s` r(s` , a` ) log (a` |s` ) + (B )k V (s`+1 )
h i
Ea` s`+1 |s` r(s` , a` ) log (a` |s` ) + (B )k V (s`+1 ) (59)
h i
= Ea` s`+1 |s` (B )k V (s`+1 ) (B )k V (s`+1 ) (60)
h i
= Ea` s`+1 |s` (B )k V (s`+1 ) V (s`+1 ) (by Lemma 10) (61)
h i
= Ea` s`+1 |s` k Ea`+1 s`+2 ...s`+k+1 |s`+1 V (s`+k+1 ) V (s`+k+1 )

(by IH) (62)
h i
= k+1 Ea` s`+1 ...s`+k+1 |s` V (s`+k+1 ) V (s`+k+1 ) , (63)

establishing the claim.

Lemma 12. For any and any V , we have (B )k V V k V V .

Proof. Let p(k) (s`+k |s` ) denote the conditional distribution over the kth state, s`+k , visited in a
random walk starting from s` , which is induced by the environment and the policy . Consider
k
(B ) V V = k max Ea s ...s |s V (s`+k ) V (s`+k ) (by Lemma 11) (64)

s` ` `+1 `+k `
X
k
= max p(k) (s`+k |s` ) V (s`+k ) V (s`+k ) (65)

s`
s`+k

= k max p(k) (: |s` ) V V (66)

s`

k max kp(k) (: |s` )k1 kV V k (by Hlders inequality) (67)

= k kV V k . (68)

Corollary 13. For any bounded V and any > 0 there exists a k0 such that (B )k V V for
all k k0 .

Proof. By Lemma 12 we have (B )k V V k V V for all k 0. Therefore, for any

> 0 there exists a k0 such that k V V < for all k k0 , since V is assumed bounded.

Thus, any value function will converge to V via repeated application of on-policy backups B .
Below we will also need to make use of the following monotonicity property of the on-policy Bellman
operator.
Lemma 14. For any , if V (1) V (2) then B V (1) B V (2) .

Proof. Assume V (1) V (2) and note that for any state s`

(B V (2) )(s` ) (B V (1) )(s` ) = Ea` s`+1 |s` V (2) (s`+1 ) V (1) (s`+1 )

(69)
(2) (1)
0 since it was assumed that V V . (70)

17
C.3 Proof of main optimality claims for off-policy softmax updates

Define the optimal value function by

V (s) = max OENT (s, ) = max V (s) for all s. (71)

For > 0, define the softmax Bellman operator B by

X
(B V )(s) = log exp r(s, a) + Es0 |s,a [V (s0 )] / (72)
a
= F (Q(s, :)) where Q(s, a) = r(s, a) + Es0 |s,a [V (s0 )] for all a. (73)

Lemma 15. For < 1, the fixed point of the softmax Bellman operator, V = B V , exists and is
unique.

Proof. First observe that the softmax Bellman operator is a contraction in the infinity norm. That is,
consider two value functions, V (1) and V (2) , and let p(s0 |s, a) denote the state transition probability
function determined by the environment. We then have

(1)
B V B V (2) = max (B V (1) )(s) (B V (2) )(s) (74)

s

= max F Q(1) (s, :) F Q(2) (s, :)

(75)

s

max max Q(1) (s, a) Q(2) (s, a) (by Corollary 9) (76)

s a

= max max Es0 |s,a V (1) (s0 ) V (2) (s0 )

(77)

s a

= max max p(: |s, a) V (1) V (2)

(78)

s a

max max kp(: |s, a)k1 kV (1) V (2) k (Hlders inequality) (79)
s a

= kV (1) V (2) k < kV (1) V (2) k if < 1. (80)

The existence and uniqueness of V then follows from the contraction map fixed-point theorem
[7].
Lemma 16. For any , if V B V then V (B )k V for all k.

Proof. Observe for any s that the assumption implies

Proof. Consider an arbitrary policy . If V B V , then by Corollary 17 we have V (B )k V for

all k. Then by Corollary 13, for any > 0 there exists a k0 such that V (B )k V V for all
k k0 since V is bounded; hence V V for all > 0. We conclude that V V .

Next, given the existence of V , we define a specific policy as follows

(: |s) = f Q (s, :) , where

(85)
0
Q (s, a) = r(s, a) + Es0 |s,a [V (s )]. (86)

Note that we are simply defining at this stage and have not as yet proved it has any particular
properties; but we will see shortly that it is, in fact, an optimal policy.

18

Lemma 18. V = V ; that is, for defined in (85), V gives its entropy regularized expected
return from any state.

Proof. We establish the claim by showing B V = V . In particular, for an arbitrary state s
consider

(B V )(s) = F Q (s, :)

by (73) (87)

= (: |s) Q (s, :) log (: |s) by Lemma 4 (88)

= V (s) by Lemma 10. (89)

Theorem 19. The fixed point of the softmax Bellman operator is the optimal value function: V = V .

Proof. Since V B V (in fact, V = B V ) we have V V for all by Corollary 17, hence

V V . Next observe that by Lemma 18 we have V V = V . Finally, by Lemma 15, we
know that the fixed point V = B V is unique, hence V = V .

Corollary 20 (Optimality Implies Consistency). The optimal state value function V and optimal
policy satisfy V (s) = r(s, a) + Es0 |s,a [V (s0 )] log (a|s) for every state s and action a.

Proof. First note that

Q (s, a) = r(s, a) + Es0 |s,a [V (s0 )] by (86) (90)
0
= r(s, a) + Es0 |s,a [V (s )] by Lemma 18 (91)

= Q (s, a) by (47). (92)
Then observe that for any state s,
V (s) = F Q (s, :)

by (73) (93)

= F Q (s, :) from above (94)

= (: |s) Q (s, :) log (: |s) by Lemma 4 (95)

= Q (s, a) log (a|s) for all a by Corollary 5 (96)
= Q (s, a) log (a|s) for all a from above. (97)

Corollary 21 (Consistency Implies Optimality). If V and satisfy, for all s and a:

V (s) = r(s, a) + Es0 |s,a [V (s0 )] log (a|s); then V = V and = .

Proof. We will show that satisfying the constraint for every s and a implies B V = V ; it will then
immediately follow that V = V and = by Lemma 15. Let Q(s, a) = r(s, a) + Es0 |s,a [V (s0 )].
Consider an arbitrary state s, and observe that
(B V )(s) = F Q(s, :)

(by (73)) (98)
n o
= max Q(s, :) log (by Lemma 4) (99)

= Q(s, a) log (a|s) for all a (by Corollary 6) (100)
0
= r(s, a) + Es0 |s,a [V (s )] log (a|s) for all a (by definition of Q above) (101)
= V (s) (by the consistency assumption on V and ). (102)

19
C.4 Proof of Theorem 1 from Main Text

Note: Theorem 1 from the main body was stated under an assumption of deterministic dynamics. We
used this assumption in the main body merely to keep presentation simple and understandable. The
development given in this appendix considers the more general case of a stochastic environment. We
give the proof here for the more general setting; the result stated in Theorem 1 follows as a special
case.

Proof. Assuming a stochastic environment, as developed in this appendix, we will establish that the
optimal policy and state value function, and V respectively, satisfy
V (s) = log (a|s) + r(s, a) + Es0 |s,a [V (s0 )] (103)
for all s and a. Theorem 1 will then follow as a special case.

Consider the policy defined in (85). From Corollary 18 we know that V = V and from

Theorem 19 we know V = V , hence V = V ; that is, is the optimizer of OENT (s, ) for
any state s (including s0 ). Therefore, this must be the same as considered in the premise. The
assertion (103) then follows directly from Corollary 20.

C.5 Proof of Corollary 2 from Main Text

Note: We consider the more general case of a stochastic environment as developed in this appendix.
First note that the consistency property for the stochastic case (103) can be rewritten as
Es0 |s,a V (s) + V (s0 ) + r(s, a) log (a|s) = 0

(104)
for all s and a. For a stochastic environment, the generalized version of (13) in Corollary 2 can then
be expressed as
t1
X
Es2 ...st |s1 ,a1 ...at1 V (s1 ) + t1 V (st ) + i1 r(si , ai ) log (ai |si ) = 0

i=1
(105)
for all states s1 and action sequences a1 ...at1 . We now show that (104) implies (105).

Proof. Observe that by (104) we have

t1
X
i1
0 = Es2 ...st |s1 ,a1 ...at1 V (si ) + V (si+1 ) + r(si , ai ) log (ai |si )
i=1
(106)
t1
X
= Es2 ...st |s1 ,a1 ...at1 i1 V (si ) + V (si+1 )
i=1
t1
X
i1
+ r(si , ai ) log (ai |si )
i=1
(107)
t1
X
= Es2 ...st |s1 ,a1 ...at1 V (s1 ) + t1 V (st ) + i1
r(si , ai ) log (ai |si )
i=1
(108)
by a telescopic sum on the first term, which yields the result.

C.6 Proof of Theorem 3 from Main Text

Note: Again, we consider the more general case of a stochastic environment. The consistency
property in this setting is given by (103) above.

20
Proof. Consider a policy and value function V that satisfy the general consistency property for a
stochastic environment: V (s) = log (a|s) + r(s, a) + Es0 |s,a [V (s0 )] for all s and a. Then
by Corollary 21, we must have V = V and = . Theorem 3 follows as a special case when the
environment is deterministic.

Introduction To Reinforcement Learning: Instructor: Sergey Levine UC Berkeley
No ratings yet
Introduction To Reinforcement Learning: Instructor: Sergey Levine UC Berkeley
46 pages
Machine Learning Lab Viva
100% (1)
Machine Learning Lab Viva
9 pages
Mac OS X Hacks
No ratings yet
Mac OS X Hacks
504 pages
OHS-PR-02-07 Document Control
100% (2)
OHS-PR-02-07 Document Control
14 pages
M39TE Tecnical Manual Apr 20153
No ratings yet
M39TE Tecnical Manual Apr 20153
50 pages
Serge Levine Course Introduction To Reinforcement Learning 6 Value Function
No ratings yet
Serge Levine Course Introduction To Reinforcement Learning 6 Value Function
27 pages
RL With LCS
No ratings yet
RL With LCS
29 pages
PDF
100% (2)
PDF
39 pages
Soft Q-Learning With Mutual Information Regularization
No ratings yet
Soft Q-Learning With Mutual Information Regularization
19 pages
Mandatory Documentation and Records: Status Interpretation Notes
100% (1)
Mandatory Documentation and Records: Status Interpretation Notes
29 pages
07 Deep Reinforcement Learning (John)
No ratings yet
07 Deep Reinforcement Learning (John)
52 pages
Serge Levine Course Introduction To Reinforcement Learning 3: RL Introduction
No ratings yet
Serge Levine Course Introduction To Reinforcement Learning 3: RL Introduction
46 pages
Drive in Trafic PDF
No ratings yet
Drive in Trafic PDF
20 pages
SAP User Classification
100% (3)
SAP User Classification
4 pages
T Test Formula
100% (1)
T Test Formula
2 pages
User Manual: ATEQ Leak/Flow Calibrator (CDF)
No ratings yet
User Manual: ATEQ Leak/Flow Calibrator (CDF)
78 pages
8200 Non Delusional Q Learning and Value Iteration
No ratings yet
8200 Non Delusional Q Learning and Value Iteration
11 pages
NIPS 2016 Generative Adversarial Imitation Learning Paper
No ratings yet
NIPS 2016 Generative Adversarial Imitation Learning Paper
9 pages
Inomax Manual de Operacion PDF
No ratings yet
Inomax Manual de Operacion PDF
136 pages
4E1 4 10/100M Ethernet Integrated Optical Multiplexer: User Manual
No ratings yet
4E1 4 10/100M Ethernet Integrated Optical Multiplexer: User Manual
25 pages
Report On Reinforcement Learning
No ratings yet
Report On Reinforcement Learning
26 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
46 pages
Reinforcement Learning (Part 2) : Nguyen Do Van, PHD
No ratings yet
Reinforcement Learning (Part 2) : Nguyen Do Van, PHD
46 pages
Careers in Applied Psychology
No ratings yet
Careers in Applied Psychology
15 pages
Introduction To HVDC Architecture and Solutions For Control and Protection
No ratings yet
Introduction To HVDC Architecture and Solutions For Control and Protection
18 pages
Reinforcement Learning in A Nutshell
No ratings yet
Reinforcement Learning in A Nutshell
12 pages
Policy Gradient Methods
No ratings yet
Policy Gradient Methods
70 pages
26
No ratings yet
26
68 pages
Fundamentals of Reinforcement Learning
No ratings yet
Fundamentals of Reinforcement Learning
33 pages
4 Reinforcement Learning - Basic Algorithms: - S, A) ) and The Immediate Reward Function R (R (S, A, S
No ratings yet
4 Reinforcement Learning - Basic Algorithms: - S, A) ) and The Immediate Reward Function R (R (S, A, S
16 pages
Soft Actor-Critic Algorithms and Applications: UC Berkeley, Google Brain, Contributed Equally
No ratings yet
Soft Actor-Critic Algorithms and Applications: UC Berkeley, Google Brain, Contributed Equally
17 pages
Soft Actor-Critic:: Off-Policy Maximum Entropy Deep Reinforcement Learning With A Stochastic Actor
No ratings yet
Soft Actor-Critic:: Off-Policy Maximum Entropy Deep Reinforcement Learning With A Stochastic Actor
14 pages
Reinforcement Learning: Csci 5512: Artificial Intelligence Ii
No ratings yet
Reinforcement Learning: Csci 5512: Artificial Intelligence Ii
30 pages
B8300030511 Maa Eng
No ratings yet
B8300030511 Maa Eng
18 pages
SP14 CS188 Lecture 10 - Reinforcement Learning I
No ratings yet
SP14 CS188 Lecture 10 - Reinforcement Learning I
35 pages
Boeing-Stearman Kaydet PT13 - 17
100% (2)
Boeing-Stearman Kaydet PT13 - 17
12 pages
DD2431 Machine Learning Lab 4: Reinforcement Learning Python Version
No ratings yet
DD2431 Machine Learning Lab 4: Reinforcement Learning Python Version
9 pages
Manisha's Journey
No ratings yet
Manisha's Journey
6 pages
Explain Briefly The Different Building Blocks of Algorithms
No ratings yet
Explain Briefly The Different Building Blocks of Algorithms
19 pages
AdaGAN Boosting Generative Models
No ratings yet
AdaGAN Boosting Generative Models
31 pages
EE675A Lecture 16
No ratings yet
EE675A Lecture 16
6 pages
Az1084s PDF
No ratings yet
Az1084s PDF
17 pages
Offline Imitation Learning From Multiple Baselines With Applications To Compiler Optimization
No ratings yet
Offline Imitation Learning From Multiple Baselines With Applications To Compiler Optimization
10 pages
Stabilizing Off Policy QLearning
No ratings yet
Stabilizing Off Policy QLearning
19 pages
1965 STHS Yearbook
No ratings yet
1965 STHS Yearbook
122 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
45 pages
Reinforcement Learning: Instructor: Max Welling
No ratings yet
Reinforcement Learning: Instructor: Max Welling
18 pages
Reinforcement Learning With Deep Energy-Based Policies
No ratings yet
Reinforcement Learning With Deep Energy-Based Policies
16 pages
Low Power Square and Cube Architectures Using Vedic
100% (1)
Low Power Square and Cube Architectures Using Vedic
18 pages
Chapter 4
No ratings yet
Chapter 4
16 pages
Systems Support For Preemptive Disk Scheduling
No ratings yet
Systems Support For Preemptive Disk Scheduling
13 pages
Unit 5d - Deep Reinforcement Learning
No ratings yet
Unit 5d - Deep Reinforcement Learning
52 pages
Guetzli: Perceptually Guided JPEG Encoder
No ratings yet
Guetzli: Perceptually Guided JPEG Encoder
13 pages
Lecture 30 Reinforcement-Learning
No ratings yet
Lecture 30 Reinforcement-Learning
50 pages
A Brief Study of In-Domain Transfer and Learning From Fewer Samples Using A Few Simple Priors
No ratings yet
A Brief Study of In-Domain Transfer and Learning From Fewer Samples Using A Few Simple Priors
5 pages
B Li Interspeech 2017
No ratings yet
B Li Interspeech 2017
5 pages
RL 10 QUESTIONS FOR MID II Scheme of Evaluvation
No ratings yet
RL 10 QUESTIONS FOR MID II Scheme of Evaluvation
15 pages
Lecture Notes v1.0 687 F22
No ratings yet
Lecture Notes v1.0 687 F22
115 pages
Characteristics of New Media
No ratings yet
Characteristics of New Media
1 page
ML Unit-4 - RTU
No ratings yet
ML Unit-4 - RTU
18 pages
Lec 04 Reinforcement Learning
No ratings yet
Lec 04 Reinforcement Learning
57 pages
Sapthagiri College of Engineering Department of Computer Science and Engineering Internal Assessment Test - III
No ratings yet
Sapthagiri College of Engineering Department of Computer Science and Engineering Internal Assessment Test - III
2 pages
JAI MAHAKAAL! GOC Kohinoor Drive [Private] - _JAI MAHAKAAL! GOC Kohinoor Drive_[Educative.io] System Design_Grokking the System Design Interview_Course Contents_2.Glossary of System Design Basics_
No ratings yet
JAI MAHAKAAL! GOC Kohinoor Drive [Private] - _JAI MAHAKAAL! GOC Kohinoor Drive_[Educative.io] System Design_Grokking the System Design Interview_Course Contents_2.Glossary of System Design Basics_
139 pages
Unit 5 - Policy Based
No ratings yet
Unit 5 - Policy Based
30 pages
KOBRA 400 HS-6 Auto-Oiler
No ratings yet
KOBRA 400 HS-6 Auto-Oiler
2 pages
UNIT 1 Database Management System DBMS 2
No ratings yet
UNIT 1 Database Management System DBMS 2
20 pages
Imitation Learning Papers
No ratings yet
Imitation Learning Papers
10 pages
11-DL-Deep Learning For Reinforcement Learning
No ratings yet
11-DL-Deep Learning For Reinforcement Learning
47 pages
IPM Lab Manual - Exp - 1
No ratings yet
IPM Lab Manual - Exp - 1
9 pages
2019 Imlicitehuman
No ratings yet
2019 Imlicitehuman
16 pages
Paper RL
No ratings yet
Paper RL
61 pages
QP Ans
No ratings yet
QP Ans
40 pages
Abdolmaleki Et Al. - 2018 - Maximum A Posteriori Policy Optimisation
No ratings yet
Abdolmaleki Et Al. - 2018 - Maximum A Posteriori Policy Optimisation
23 pages
RL DP and Value and Policy
No ratings yet
RL DP and Value and Policy
4 pages
20ai903 - RL - Unit 4
No ratings yet
20ai903 - RL - Unit 4
49 pages
Math 5 Reviewer
100% (1)
Math 5 Reviewer
2 pages
Unit 3 Ai
No ratings yet
Unit 3 Ai
5 pages
Midterm Report Example3
No ratings yet
Midterm Report Example3
4 pages
10 - Reinforcement Learning
No ratings yet
10 - Reinforcement Learning
24 pages
Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning With A Stochastic Actor
No ratings yet
Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning With A Stochastic Actor
10 pages
Deep Reinforcement Learning: 1 Notation
No ratings yet
Deep Reinforcement Learning: 1 Notation
9 pages
03 04 Lessonarticle
No ratings yet
03 04 Lessonarticle
5 pages
Reinforcement Learning Cheatsheet
No ratings yet
Reinforcement Learning Cheatsheet
16 pages
Lecture 06
No ratings yet
Lecture 06
98 pages
Epfo Mis 312
No ratings yet
Epfo Mis 312
1 page
GRES Integrated Energy Storage Systgem User Manual V1.01-EN
No ratings yet
GRES Integrated Energy Storage Systgem User Manual V1.01-EN
112 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
38 pages
Lecture 12 Slides - After
No ratings yet
Lecture 12 Slides - After
50 pages
10 Deep Reinforcement
No ratings yet
10 Deep Reinforcement
40 pages
SP14 CS188 Lecture 10 - Reinforcement Learning I - Print
No ratings yet
SP14 CS188 Lecture 10 - Reinforcement Learning I - Print
25 pages
A Crash Course On Reinforcement Learning - Felix Wagner
No ratings yet
A Crash Course On Reinforcement Learning - Felix Wagner
84 pages
Sahil Khaja Huzoor AMS 517 Report
No ratings yet
Sahil Khaja Huzoor AMS 517 Report
11 pages
Theory and Practice of Artificial Intelligence
No ratings yet
Theory and Practice of Artificial Intelligence
7 pages
5SC28 L7 Machine Learning
No ratings yet
5SC28 L7 Machine Learning
61 pages
5SC28 Machine Learning For Systems and Control
No ratings yet
5SC28 Machine Learning For Systems and Control
68 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Bridging The Gap Between Value and Policy Based Reinforcement Learning

Uploaded by

Bridging The Gap Between Value and Policy Based Reinforcement Learning

Uploaded by

Bridging the Gap Between Value and Policy Based

Ofir Nachum Mohammad Norouzi Kelvin Xu Dale Schuurmans

We establish a new connection between value and policy based reinforcement

2 Notation & Background

We model an agents behavior by a parametric distribution (a | s) defined by a neural network over

4 Consistency Between Optimal Value & Policy

Note that one can also characterize in terms of Q as

Importantly, the converse of Theorem 1 (and Corollary 2) also holds:

5 Path Consistency Learning (PCL)

= v C(si:i+d , , ) V (si ) d V (si+d ) ,

5.1 Unified Path Consistency Learning (Unified PCL)

v C(si:i+d , ) V (si ) d V (si+d ) .

5.2 Connections to Actor-Critic and Q-learning

The advantage-actor-critic updates for and can then be written as,

Reverse ReversedAddition ReversedAddition3 Hard ReversedAddition

PCL A3C DQN

Reverse ReversedAddition ReversedAddition3 Hard ReversedAddition

PCL Unified PCL

PCL PCL + Expert

Algorithm 1 Path Consistency Learning

B.1 Synthetic Tree

As an initial testbed, we developed a simple synthetic environment. The environment is defined

B.2 Algorithmic Tasks

Copy: Copy a 1 n sequence of characters to output.

B.3 Implementation Details

and define the vector valued function f (the soft indmax) by

where the exponentiation is component-wise. It is easy to verify that f = F . Note that

Proof. From Lemma 4 we know v = F (q) = (q log ) where = f (q). From

Proof. Let F denote the conjugate of F , which is given by

Lemma 8. For any two vectors q(1) and q(2) ,

Proof. Observe that by Lemma 7

Proof. Immediate from Lemma 8.

C.2 Background results for on-policy entropy regularized updates

Proof. Consider an arbitrary state s` . By the definition of V (s` ) in (43) we have

Proof. Consider an arbitrary state s` . We use an induction

(B )k+1 V (s` ) V (s` )

= (B )k+1 V (s` ) (B )k+1 V (s` ) (by Lemma 10)

establishing the claim.

k max kp(k) (: |s` )k1 kV V k (by Hlders inequality) (67)

Define the optimal value function by

For > 0, define the softmax Bellman operator B by

= kV (1) V (2) k < kV (1) V (2) k if < 1. (80)

Proof. Observe for any s that the assumption implies

Proof. Consider an arbitrary policy . If V B V , then by Corollary 17 we have V (B )k V for

Next, given the existence of V , we define a specific policy as follows

Proof. First note that

Corollary 21 (Consistency Implies Optimality). If V and satisfy, for all s and a:

C.5 Proof of Corollary 2 from Main Text

Proof. Observe that by (104) we have

C.6 Proof of Theorem 3 from Main Text

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.