1.1 Discounted (Infinite-Horizon) Markov Decision Processes
1.1 Discounted (Infinite-Horizon) Markov Decision Processes
In reinforcement learning, the interactions between the agent and the environment are often described by an infinite-
horizon, discounted Markov Decision Process (MDP) M = (S, A, P, r, , µ), specified by:
• A state space S, which may be finite or infinite. For mathematical convenience, we will assume that S is finite
or countably infinite.
• An action space A, which also may be discrete or infinite. For mathematical convenience, we will assume that
A is finite.
• A transition function P : S ⇥ A ! (S), where (S) is the space of probability distributions over S (i.e., the
probability simplex). P (s0 |s, a) is the probability of transitioning into state s0 upon taking action a in state s.
We use Ps,a to denote the vector P (· s, a).
• A reward function r : S ⇥ A ! [0, 1]. r(s, a) is the immediate reward associated with taking action a in state
s. More generally, the r(s, a) could be a random variable (where the distribution depends on s, a). While we
largely focus on the case where r(s, a) is deterministic, the extension to methods with stochastic rewards are
often straightforward.
• A discount factor 2 [0, 1), which defines a horizon for the problem.
• An initial state distribution µ 2 (S), which specifies how the initial state s0 is generated.
In many cases, we will assume that the initial state is fixed at s0 , i.e. µ is a distribution supported only on s0 .
Policies. In a given MDP M = (S, A, P, r, , µ), the agent interacts with the environment according to the following
protocol: the agent starts at some state s0 ⇠ µ; at each time step t = 0, 1, 2, . . ., the agent takes an action at 2 A,
obtains the immediate reward rt = r(st , at ), and observes the next state st+1 sampled according to st+1 ⇠ P (·|st , at ).
The interaction record at time t,
⌧t = (s0 , a0 , r0 , s1 , . . . , st , at , rt ),
5
is called a trajectory, which includes the observed state at time t.
In the most general setting, a policy specifies a decision-making strategy in which the agent chooses actions adaptively
based on the history of observations; precisely, a policy is a (possibly randomized) mapping from a trajectory to an
action, i.e. ⇡ : H ! (A) where H is the set of all possible trajectories (of all lengths) and (A) is the space of
probability distributions over A. A stationary policy ⇡ : S ! (A) specifies a decision-making strategy in which
the agent chooses actions based only on the current state, i.e. at ⇠ ⇡(·|st ). A deterministic, stationary policy is of the
form ⇡ : S ! A.
Values. We now define values for (general) policies. For a fixed policy and a starting state s0 = s, we define the
value function VM
⇡
: S ! R as the discounted sum of future rewards
hX
1 i
⇡ t
VM (s) = E r(st , at ) ⇡, s0 = s .
t=0
where expectation is with respect to the randomness of the trajectory, that is, the randomness in state transitions and
the stochasticity of ⇡. Here, since r(s, a) is bounded between 0 and 1, we have 0 VM ⇡
(s) 1/(1 ).
Similarly, the action-value (or Q-value) function Q⇡M : S ⇥ A ! R is defined as
hX
1 i
Q⇡M (s, a) = E t
r(st , at ) ⇡, s0 = s, a0 = a .
t=0
Goal. Given a state s, the goal of the agent is to find a policy ⇡ that maximizes the value, i.e. the optimization
problem the agent seeks to solve is:
⇡
max VM (s) (0.1)
⇡
where the max is over all (possibly non-stationary and randomized) policies. As we shall see, there exists a determin-
istic and stationary policy which is simultaneously optimal for all starting states s.
We drop the dependence on M and write V ⇡ when it is clear from context.
Example 1.1 (Navigation). Navigation is perhaps the simplest to see example of RL. The state of the agent is their
current location. The four actions might be moving 1 step along each of east, west, north or south. The transitions
in the simplest setting are deterministic. Taking the north action moves the agent one step north of their location,
assuming that the size of a step is standardized. The agent might have a goal state g they are trying to reach, and the
reward is 0 until the agent reaches the goal, and 1 upon reaching the goal state. Since the discount factor < 1, there
is incentive to reach the goal state earlier in the trajectory. As a result, the optimal behavior in this setting corresponds
to finding the shortest path from the initial to the goal state, and the value function of a state, given a policy is d ,
where d is the number of steps required by the policy to reach the goal state.
Example 1.2 (Conversational agent). This is another fairly natural RL problem. The state of an agent can be the
current transcript of the conversation so far, along with any additional information about the world, such as the context
for the conversation, characteristics of the other agents or humans in the conversation etc. Actions depend on the
domain. In the most basic form, we can think of it as the next statement to make in the conversation. Sometimes,
conversational agents are designed for task completion, such as travel assistant or tech support or a virtual office
receptionist. In these cases, there might be a predefined set of slots which the agent needs to fill before they can find a
good solution. For instance, in the travel agent case, these might correspond to the dates, source, destination and mode
of travel. The actions might correspond to natural language queries to fill these slots.
6
In task completion settings, reward is naturally defined as a binary outcome on whether the task was completed or not,
such as whether the travel was successfully booked or not. Depending on the domain, we could further refine it based
on the quality or the price of the travel package found. In more generic conversational settings, the ultimate reward is
whether the conversation was satisfactory to the other agents or humans, or not.
Example 1.3 (Strategic games). This is a popular category of RL applications, where RL has been successful in
achieving human level performance in Backgammon, Go, Chess, and various forms of Poker. The usual setting consists
of the state being the current game board, actions being the potential next moves and reward being the eventual win/loss
outcome or a more detailed score when it is defined in the game. Technically, these are multi-agent RL settings, and,
yet, the algorithms used are often non-multi-agent RL algorithms.
Proof: To see that the I P ⇡ is invertible, observe that for any non-zero vector x 2 R|S||A| ,
k(I P ⇡ )xk1 = kx P ⇡ xk1
kxk1 kP ⇡ xk1 (triangule inequality for norms)
kxk1 kxk1 (each element of P ⇡ x is an average of x)
= (1 )kxk1 > 0 ( < 1, x 6= 0)
which implies I P ⇡ is full rank.
The following is also a helpful lemma:
7
Lemma 1.6. We have that:
1
X
[(1 )(I P ⇡) 1
](s,a),(s0 ,a0 ) = (1 ) P (st = s0 , at = a0 |s0 = s, a0 = a)
t ⇡
t=0
so we can view the (s, a)-th row of this matrix as an induced distribution over states and actions when following ⇡
after starting with s0 = s and a0 = a.
A remarkable and convenient property of MDPs is that there exists a stationary and deterministic policy that simulta-
neously maximizes V ⇡ (s) for all s 2 S. This is formalized in the following theorem:
Theorem 1.7. Let ⇧ be the set of all non-stationary and randomized policies. Define:
which is finite since V ⇡ (s) and Q⇡ (s, a) are bounded between 0 and 1/(1 ).
There exists a stationary and deterministic policy ⇡ such that for all s 2 S and a 2 A,
V ⇡ (s) = V ? (s)
Q⇡ (s, a) = Q? (s, a).
Proof: For any ⇡ 2 ⇧ and for any time t, ⇡ specifies a distribution over actions conditioned on the history of observa-
tions; here, we write ⇡(At = a|S0 = s0 , A0 = a0 , R0 = r0 , . . . St 1 = st 1 , At 1 = at 1 , Rt 1 = rt 1 , St = st )
as the probability that ⇡ selects action a at time t given an observed history s0 , a0 , r0 , . . . st 1 , at 1 , rt 1 , st . For
the purposes of this proof, it is helpful to formally let St , At and Rt denote random variables, which will dis-
tinguish them from outcomes, which is denoted by lower case variables. First, let us show that conditioned on
(S0 , A0 , R0 , S1 ) = (s, a, r, s0 ), the maximum future discounted value, from time 1 onwards, is not a function of
s, a, r. More precisely, we seek to show that:
hX
1 i
sup E t
r(st , at ) ⇡, (S0 , A0 , R0 , S1 ) = (s, a, r, s0 ) = V ? (s0 ) (0.3)
⇡2⇧ t=1
For any policy ⇡, define an “offset” policy ⇡(s,a,r) , which is the policy that chooses actions on a trajectory ⌧ according
to the same distribution that ⇡ chooses actions on the trajectory (s, a, r, ⌧ ). Precisely, for all t, define
8
where the first equality follows from a change of variables on the time index, along with the definition of the policy
⇡(s,a,r) . Also, we have that, for all (s, a, r), that the set {⇡(s,a,r) |⇡ 2 ⇧} is equal to ⇧ itself, by the definition of ⇧
and ⇡(s,a,r) . This implies:
hX
1 i
sup E t
r(st , at ) ⇡, (S0 , A0 , R0 , S1 ) = (s, a, r, s0 ) = · sup V ⇡(s,a,r) (s0 ) = · sup V ⇡ (s0 ) = V ? (s0 ),
⇡2⇧ t=1 ⇡2⇧ ⇡2⇧
h 1
X i
? t
V (s0 ) = sup E r(s0 , a0 ) + r(st , at )
⇡2⇧ t=1
(a)
h hX
1 ii
t
= sup E r(s0 , a0 ) + E r(st , at ) ⇡, (S0 , A0 , R0 , S1 ) = (s0 , a0 , r0 , s1 )
⇡2⇧ t=1
h hX
1 ii
sup E r(s0 , a0 ) + sup E t
r(st , at ) ⇡ 0 , (S0 , A0 , R0 , S1 ) = (s0 , a0 , r0 , s1 )
⇡2⇧ ⇡ 0 2⇧ t=1
(b)
h i
?
= sup E r(s0 , a0 ) + V (s1 )
⇡2⇧
h i
= sup E r(s0 , a0 ) + V ? (s1 )
a0 2A
(c)
h i
= E r(s0 , a0 ) + V ? (s1 ) ⇡
e .
where step (a) uses the law of iterated expectations; step (b) uses Equation 0.3; and step (c) follows from the definition
of ⇡
e. Applying the same argument recursively leads to:
h i h i
V ? (s0 ) E r(s0 , a0 ) + V ? (s1 ) ⇡
e E r(s0 , a0 ) + r(s1 , a1 ) + 2
e . . . V ⇡e (s0 ).
V ? (s2 ) ⇡
Since V ⇡e (s) sup⇡2⇧ V ⇡ (s) = V ? (s) for all s, we have that V ⇡e = V ? , which completes the proof of the first claim.
For the same policy ⇡
e, an analogous argument can be used prove the second claim.
This shows that we may restrict ourselves to using stationary and deterministic policies without any loss in perfor-
mance. The following theorem, also due to [Bellman, 1956], gives a precise characterization of the optimal value
function.
Theorem 1.8 (Bellman optimality equations). We say that a vector Q 2 R|S||A| satisfies the Bellman optimality
equations if:
Q(s, a) = r(s, a) + Es0 ⇠P (·|s,a) max
0
Q(s0 , a0 ) .
a 2A
For any Q 2 R|S||A| , we have that Q = Q? if and only if Q satisfies the Bellman optimality equations. Furthermore,
the deterministic policy defined by ⇡(s) 2 argmaxa2A Q? (s, a) is an optimal policy (where ties are broken in some
arbitrary manner).
9
Before we prove this claim, we will provide a few definitions. Let ⇡Q denote the greedy policy with respect to a vector
Q 2 R|S||A| , i.e
⇡Q (s) := argmaxa2A Q(s, a) .
where ties are broken in some arbitrary manner. With this notation, by the above theorem, the optimal policy ⇡ ? is
given by:
⇡ ? = ⇡Q ? .
Let us also use the following notation to turn a vector Q 2 R|S||A| into a vector of length |S|.
T Q := r + P VQ . (0.4)
This allows us to rewrite the Bellman optimality equation in the concise form:
Q = T Q,
and, so, the previous theorem states that Q = Q? if and only if Q is a fixed point of the operator T .
Proof: Let us begin by showing that:
V ? (s) = max Q? (s, a). (0.5)
a
Let ⇡ ? be an optimal stationary and deterministic policy, which exists by Theorem 1.7. Consider the policy which
takes action a and then follows ⇡ ? . Due to that V ? (s) is the maximum value over all non-stationary policies (as shown
in Theorem 1.7), ?
V ? (s) Q⇡ (s, a) = Q? (s, a),
which shows that V ? (s) maxa Q? (s, a) since a is arbitrary in the above. Also, by Lemma 1.4 and Theorem 1.7,
? ? ?
V ? (s) = V ⇡ (s) = Q⇡ (s, ⇡ ? (s)) max Q⇡ (s, a) = max Q? (s, a),
a a
which proves our claim since we have upper and lower bounded V (s) by maxa Q? (s, a).
?
We first show sufficiency, i.e. that Q? (the state-action value of an optimal policy) satisfies Q? = T Q? . Now for all
actions a 2 A, we have:
Here, the second equality follows from Theorem 1.7 and the final equality follows from Equation 0.5. This proves
sufficiency.
For the converse, suppose Q = T Q for some Q. We now show that Q = Q? . Let ⇡ = ⇡Q . That Q = T Q implies
that Q = r + P ⇡ Q, and so:
Q = (I P ⇡ ) 1 r = Q⇡
using Corollary 1.5 in the last step. In other words, Q is the action value of the policy ⇡Q . Also, let us show that
0
(P ⇡ P ⇡ )Q⇡ 0. To see this, observe that:
0
[(P ⇡ P ⇡ )Q⇡ ]s,a = Es0 ⇠P (·|s,a) [Q⇡ (s0 , ⇡(s0 )) Q⇡ (s0 , ⇡ 0 (s0 ))] 0
10
where the last step uses that ⇡ = ⇡Q . Now observe for any other deterministic and stationary policy ⇡ 0 :
0 0
Q Q⇡ = Q⇡ Q⇡
0
= Q⇡ (I P⇡ ) 1
r
0 0
= (I P⇡ ) 1
((I P⇡ ) (I P ⇡ ))Q⇡
0 0
= (I P⇡ ) 1
(P ⇡ P ⇡ )Q⇡
0,
0 0
where the last step follows since we have shown (P ⇡ P ⇡ )Q⇡ 0 and since (1 )(I P ⇡ ) 1 is a matrix with
⇡0
all positive entries (see Lemma 1.6). Thus, Q = Q Q for all deterministic and stationary ⇡ 0 , which shows that
⇡
⇡ is an optimal policy. Thus, Q = Q⇡ = Q? , using Theorem 1.7. This completes the proof.
In some cases, it is natural to work with finite-horizon (and time-dependent) Markov Decision Processes (see the dis-
cussion below). Here, a finite horizon, time-dependent Markov Decision Process (MDP) M = (S, A, {P }h , {r}h , H, µ)
is specified as follows:
Here, for a policy ⇡, a state s, and h 2 {0, . . . H 1}, we define the value function Vh⇡ : S ! R as
hH
X1 i
Vh⇡ (s) =E rh (st , at ) ⇡, sh = s ,
t=h
where again the expectation is with respect to the randomness of the trajectory, that is, the randomness in state tran-
sitions and the stochasticity of ⇡. Similarly, the state-action value (or Q-value) function Q⇡h : S ⇥ A ! R is defined
as
hHX1 i
Q⇡h (s, a) = E rh (st , at ) ⇡, sh = s, ah = a .
t=h
We also use the notation V ⇡ (s) = V0⇡ (s).
Again, given a state s, the goal of the agent is to find a policy ⇡ that maximizes the value, i.e. the optimization problem
the agent seeks to solve is:
max V ⇡ (s) (0.6)
⇡
11
Theorem 1.9. (Bellman optimality equations) Define
Q?h (s, a) = sup Q⇡h (s, a)
⇡2⇧
where the sup is over all non-stationary and randomized policies. Suppose that QH = 0. We have that Qh = Q?h for
all h 2 [H] if and only if for all h 2 [H],
Qh (s, a) = rh (s, a) + Es0 ⇠Ph (·|s,a) max
0
Qh+1 (s0 , a0 ) . (0.7)
a 2A
Discussion: Stationary MDPs vs Time-Dependent MDPs For the purposes of this book, it is natural for us to
study both of these models, where we typically assume stationary dynamics in the infinite horizon setting and time-
dependent dynamics in the finite-horizon setting. From a theoretical perspective, the finite horizon, time-dependent
setting is often more amenable to analysis, where optimal statistical rates often require simpler arguments. However,
we should note that from a practical perspective, time-dependent MDPs are rarely utilized because they lead to policies
and value functions that are O(H) larger (to store in memory) than those in the stationary setting. In practice, we often
incorporate temporal information directly into the definition of the state, which leads to more compact value functions
and policies (when coupled with function approximation methods, which attempt to represent both the values and
policies in a more compact form).
This section will be concerned with computing an optimal policy, when the MDP M = (S, A, P, r, ) is known; this
can be thought of as the solving the planning problem. While much of this book is concerned with statistical limits,
understanding the computational limits can be informative. We will consider algorithms which give both exact and
approximately optimal policies. In particular, we will be interested in polynomial time (and strongly polynomial time)
algorithms.
Suppose that (P, r, ) in our MDP M is specified with rational entries. Let L(P, r, ) denote the total bit-size required
to specify M , and assume that basic arithmetic operations +, , ⇥, ÷ take unit time. Here, we may hope for an
algorithm which (exactly) returns an optimal policy whose runtime is polynomial in L(P, r, ) and the number of
states and actions.
More generally, it may also be helpful to understand which algorithms are strongly polynomial. Here, we do not want
to explicitly restrict (P, r, ) to be specified by rationals. An algorithm is said to be strongly polynomial if it returns
an optimal policy with runtime that is polynomial in only the number of states and actions (with no dependence on
L(P, r, )).
The first two subsections will cover classical iterative algorithms that compute Q? , and then we cover the linear
programming approach.
Perhaps the simplest algorithm for discounted MDPs is to iteratively apply the fixed point mapping: starting at some
Q, we iteratively apply T :
Q T Q,
12
Value Iteration Policy Iteration LP-Algorithms
1 1
L(P,r, ) log L(P,r, ) log
Poly? |S|2 |A| 1
1
(|S|3 + |S|2 |A|) 1
1
|S|3 |A|L(P, r, )
⇢ 2 |S|2
|A||S| |S| |A| log |S|
Strongly Poly? 7 (|S|3 + |S|2 |A|) · min |S| , 1
1
|S|4 |A|4 log 1
Table 0.1: Computational complexities of various approaches (we drop universal constants). Polynomial time algo-
rithms depend on the bit complexity, L(P, r, ), while strongly polynomial algorithms do not. Note that only for a
fixed value of are value and policy iteration polynomial time algorithms; otherwise, they are not polynomial time
algorithms. Similarly, only for a fixed value of is policy iteration a strongly polynomial time algorithm. In contrast,
the LP-approach leads to both polynomial time and strongly polynomial time algorithms; for the latter, the approach
is an interior point algorithm. See text for further discussion, and Section 1.6 for references. Here, |S|2 |A| is the
assumed runtime per iteration of value iteration, and |S|3 + |S|2 |A| is the assumed runtime per iteration of policy
iteration (note that for this complexity we would directly update the values V rather than Q values, as described in the
text); these runtimes are consistent with assuming cubic complexity for linear system solving.
kT Q T Q0 k1 kQ Q0 k1
Proof: First, let us show that for all s 2 S, |VQ (s) VQ0 (s)| maxa2A |Q(s, a) Q0 (s, a)|. Assume VQ (s) > VQ0 (s)
(the other direction is symmetric), and let a be the greedy action for Q at s. Then
|VQ (s) VQ0 (s)| = Q(s, a) max Q0 (s, a0 ) Q(s, a) Q0 (s, a) max |Q(s, a) Q0 (s, a)|.
a0 2A a2A
Using this,
kT Q T Q0 k1 = kP VQ P VQ0 k1
= kP (VQ VQ0 )k1
kVQ VQ0 k1
= max |VQ (s) VQ0 (s)|
s
max max |Q(s, a) Q0 (s, a)|
s a
0
= kQ Q k1
where the first inequality uses that each element of P (VQ VQ0 ) is a convex average of VQ VQ0 and the second
inequality uses our claim above.
The following result bounds the sub-optimality of the greedy policy itself, based on the error in Q-value function.
2kQ Q? k1
V ⇡Q V? 1,
1
13
Proof: Fix state s and let a = ⇡Q (s). We have:
where the first inequality uses Q(s, ⇡ ? (s)) Q(s, ⇡Q (s)) = Q(s, a) due to the definition of ⇡Q .
Q(k+1) = T Q(k)
2
log )2 ✏
Let ⇡ (k) = ⇡Q(k) . For k (1
1 ,
(k)
V⇡ V? ✏1 .
Proof: Since kQ? k1 1/(1 ), Q(k) = T k Q(0) and Q? = T Q? , Lemma 1.10 gives
exp( (1 )k)
kQ(k) Q? k1 = kT k Q(0) T k Q? k1 k
kQ(0) Q? k1 = (1 (1 ))k kQ? k1 .
1
The proof is completed with our choice of k and using Lemma 1.11.
Iteration complexity for an exact solution. With regards to computing an exact optimal policy, when the gap
between the current objective value and the optimal objective value is smaller than 2 L(P,r, ) , then the greedy policy
will be optimal. This leads to claimed complexity in Table 0.1. Value iteration is not strongly polynomial algorithm
due to that, in finite time, it may never return the optimal policy.
The policy iteration algorithm, for discounted MDPs, starts from an arbitrary policy ⇡0 , and repeats the following
iterative procedure: for k = 0, 1, 2, . . .
⇡k+1 = ⇡Q⇡k
In each iteration, we compute the Q-value function of ⇡k , using the analytical form given in Equation 0.2, and update
the policy to be greedy with respect to this new Q-value. The first step is often called policy evaluation, and the second
step is often called policy improvement.
1. Q⇡k+1 T Q ⇡k Q ⇡k
14
2. kQ⇡k+1 Q? k1 kQ⇡k Q? k1
Proof: First let us show that T Q⇡k Q⇡k . Note that the policies produced in policy iteration are always deterministic,
so V ⇡k (s) = Q⇡k (s, ⇡k (s)) for all iterations k and states s. Hence,
Now let us prove that Q⇡k+1 T Q⇡k . First, let us see that Q⇡k+1 Q ⇡k :
1
X
Q⇡k = r + P ⇡k Q⇡k r + P ⇡k+1 Q⇡k t
(P ⇡k+1 )t r = Q⇡k+1 .
t=0
where we have used that ⇡k+1 is the greedy policy in the first inequality and recursion in the second inequality. Using
this,
Q⇡k+1 (s, a) = r(s, a) + Es0 ⇠P (·|s,a) [Q⇡k+1 (s0 , ⇡k+1 (s0 ))]
r(s, a) + Es0 ⇠P (·|s,a) [Q⇡k (s0 , ⇡k+1 (s0 ))]
= r(s, a) + Es0 ⇠P (·|s,a) [max
0
Q⇡k (s0 , a0 )] = T Q⇡k (s, a)
a
where we have used that Q? Q⇡k+1 T Q⇡k in second step and the contraction property of T (see Lemma 1.10) in
the last step.
With this lemma, a convergence rate for the policy iteration algorithm immediately follows.
1
log
Theorem 1.14. (Policy iteration convergence). Let ⇡0 be any initial policy. For k (1
1
)✏
, the k-th policy in
policy iteration has the following performance bound:
Q ⇡k Q? ✏1 .
Iteration complexity for an exact solution. With regards to computing an exact optimal policy, it is clear from the
previous results that policy iteration is no worse than value iteration. However, with regards to obtaining an exact
solution MDP that is independent of the bit complexity, L(P, r, ), improvements are possible (and where we assume
basic arithmetic operations on real numbers are order one cost). Naively, the number of iterations of policy iterations
is bounded by the number of policies, namely |A||S| ; here, a small improvement is possible, where the number of
|A||S|
iterations of policy iteration can be bounded by |S| . Remarkably, for a fixed value of , policy iteration can be
|S|2
|S|2 |A| log
show to be a strongly polynomial time algorithm, where policy iteration finds an exact policy in at most 1
1
iterations. See Table 0.1 for a summary, and Section 1.6 for references.
15
1.3.3 Value Iteration for Finite Horizon MDPs
Let us now specify the value iteration algorithm for finite-horizon MDPs. For the finize-horizon setting, it turns out
that the analogues of value iteration and policy iteration lead to identical algorithms. The value iteration algorithm is
specified as follows:
2. For h = H 2, . . . 0, set:
Qh (s, a) = rh (s, a) + Es0 ⇠Ph (·|s,a) max
0
Qh+1 (s0 , a0 ) .
a 2A
By Theorem 1.9, it follows that Qh (s, a) = Q?h (s, a) and that ⇡(s, h) = argmaxa2A Q?h (s, a) is an optimal policy.
It is helpful to understand an alternative approach to finding an optimal policy for a known MDP. With regards to
computation, consider the setting where our MDP M = (S, A, P, r, , µ) is known and P , r, and are all specified by
rational numbers. Here, from a computational perspective, the previous iterative algorithms are, strictly speaking, not
polynomial time algorithms, due to that they depend polynomially on 1/(1 ), which is not polynomial in the de-
scription length of the MDP . In particular, note that any rational value of 1 may be specified with only O(log 1 1 )
bits of precision. In this context, we may hope for a fully polynomial time algorithm, when given knowledge of the
MDP, which would have a computation time which would depend polynomially on the description length of the MDP
M , when the parameters are specified as rational numbers. We now see that the LP approach provides a polynomial
time algorithm.
Provided that µ has full support, then the optimal value function V ? (s) is the unique solution to this linear program.
With regards to computation time, linear programming approaches only depend on the description length of the coeffi-
cients in the program, due to that this determines the computational complexity of basic additions and multiplications.
Thus, this approach will only depend on the bit length description of the MDP, when the MDP is specified by rational
numbers.
Computational complexity for an exact solution. Table 0.1 shows the runtime complexity for the LP approach,
where we assume a standard runtime for solving a linear program. The strongly polynomial algorithm is an interior
point algorithm. See Section 1.6 for references.
16
Policy iteration and the simplex algorithm. It turns out that the policy iteration algorithm is actually the simplex
method with block pivot. While the simplex method, in general, is not a strongly polynomial time algorithm, the
policy iteration algorithm is a strongly polynomial time algorithm, provided we keep the discount factor fixed. See
[Ye, 2011].
For a fixed (possibly stochastic) policy ⇡, let us define a visitation measure over states and actions induced by following
⇡ after starting at s0 . Precisely, define this distribution, d⇡s0 , as follows:
1
X
d⇡s0 (s, a) := (1 ) t
Pr⇡ (st = s, at = a|s0 ) (0.8)
t=0
where Pr⇡ (st = s, at = a|s0 ) is the probability that st = s and at = a, after starting at state s0 and following ⇡
thereafter. It is straightforward to verify that d⇡s0 is a distribution over S ⇥ A. We also overload notation and write:
⇥ ⇤
d⇡µ (s, a) = Es0 ⇠µ d⇡s0 (s, a) .
for a distribution µ over S. Recall Lemma 1.6 provides a way to easily compute d⇡µ (s, a) through an appropriate
vector-matrix multiplication.
It is straightforward to verify that d⇡µ satisfies, for all states s 2 S:
X X
d⇡µ (s, a) = (1 )µ(s) + P (s|s0 , a0 )d⇡µ (s0 , a0 ).
a s0 ,a0
We now see that this set precisely characterizes all state-action visitation distributions.
Proposition 1.15. We have that Kµ is equal to the set of all feasible state-action distributions, i.e. d 2 Kµ if and only
if there exists a stationary (and possibly randomized) policy ⇡ such that d⇡µ = d.
Note that Kµ is itself a polytope, and one can verify that this is indeed the dual of the aforementioned LP. This approach
provides an alternative approach to finding an optimal solution.
If d? is the solution to this LP, and provided that µ has full support, then we have that:
d? (s, a)
⇡ ? (a|s) = P ? 0
,
a0 d (s, a )
is an optimal policy. An alternative optimal policy is argmaxa d? (s, a) (and these policies are identical if the optimal
policy is unique).
17
1.4 Sample Complexity and Sampling Models
Much of reinforcement learning is concerned with finding a near optimal policy (or obtaining near optimal reward)
in settings where the MDPs is not known to the learner. We will study these questions in a few different models of
how the agent obtains information about the unknown underlying MDP. In each of these settings, we are interested
understanding the number of samples required to find a near optimal policy, i.e. the sample complexity. Ultimately, we
interested in obtaining results which are applicable to cases where number of states and actions is large (or, possibly,
countably or uncountably infinite). This is many ways analogous to the supervised learning question of generalization,
though, as we shall see, this question is fundamentally more challenging in the reinforcement learning setting.
The Episodic Setting. In the episodic setting, in every episode, the learner acts for some finite number of steps,
starting from a fixed starting state s0 ⇠ µ, the learner observes the trajectory, and the state resets to s0 ⇠ µ. This
episodic model of feedback is applicable to both the finte-horizon and infinite horizon settings.
• (Finite Horizon MDPs) Here, each episode lasts for H-steps, and then the state is reset to s0 ⇠ µ.
• (Infinite Horizon MDPs) Even for infinite horizon MDPs it is natural to work in an episodic model for learning,
where each episode terminates after a finite number of steps. Here, it is often natural to assume either the agent
can terminate the episode at will or that the episode will terminate at each step with probability 1 . After
termination, we again assume that the state is reset to s0 ⇠ µ. Note that, if each step in an episode is terminated
with probability 1 , then the observed cumulative reward in an episode of a policy provides an unbiased
estimate of the infinite-horizon, discounted value of that policy.
In this setting, we are often interested in either the number of episodes it takes to find a near optimal policy, which is
a PAC (probably, approximately correct) guarantee, or we are interested in a regret guarantee (which we will study in
Chapter 7). Both of these questions are with regards to statistical complexity (i.e. the sample complexity) of learning.
The episodic setting is challenging in that the agent has to engage in some exploration in order to gain information at
the relevant state. As we shall see in Chapter 7, this exploration must be strategic, in the sense that simply behaving
randomly will not lead to information being gathered quickly enough. It is often helpful to study the statistical com-
plexity of learning in a more abstract sampling model, a generative model, which allows to avoid having to directly
address this exploration issue. Furthermore, this sampling model is natural in its own right.
The generative model setting. A generative model takes as input a state action pair (s, a) and returns a sample
s0 ⇠ P (·|s, a) and the reward r(s, a) (or a sample of the reward if the rewards are stochastic).
The offline RL setting. The offline RL setting is where the agent has access to an offline dataset, say generated
under some policy (or a collection of policies). In the simplest of these settings, we may assume our dataset is of the
form {(s, a, s0 , r)} where r is the reward (corresponding to r(s, a) if the reward is deterministic) and s0 ⇠ P (·|s, a).
Furthermore, for simplicity, it can be helpful to assume that the s, a pairs in this dataset were sampled i.i.d. from some
fixed distribution ⌫ over S ⇥ A.
18
The advantage A⇡ (s, a) of a policy ⇡ is defined as
A⇡ (s, a) := Q⇡ (s, a) V ⇡ (s) .
Note that: ⇤
A⇤ (s, a) := A⇡ (s, a) 0
for all state-action pairs.
Analogous to the state-action visitation distribution (see Equation 0.8), we can define a visitation measure over just
the states. When clear from context, we will overload notation and also denote this distribution by d⇡s0 , where:
1
X
d⇡s0 (s) = (1 ) t
Pr⇡ (st = s|s0 ). (0.9)
t=0
Here, Pr⇡ (st = s|s0 ) is the state visitation probability, under ⇡ starting at state s0 . Again, we write:
⇥ ⇤
d⇡µ (s) = Es0 ⇠µ d⇡s0 (s) .
for a distribution µ over S.
The following lemma is helpful in the analysis of RL algorithms.
Lemma 1.16. (The performance difference lemma) For all policies ⇡, ⇡ 0 and distributions µ over S,
0 1 h 0 i
V ⇡ (µ) V ⇡ (µ) = Es0 ⇠d⇡µ Ea0 ⇠⇡(·|s0 ) A⇡ (s0 , a0 ) .
1
Proof: Let Pr⇡ (⌧ |s0 = s) denote the probability of observing a trajectory ⌧ when starting in state s and following the
policy ⇡. By definition of d⇡s0✓ , observe that for any function f : S ⇥ A ! R,
"1 #
X 1 ⇥ ⇤
t
E⌧ ⇠Pr⇡ f (st , at ) = Es⇠d⇡s ✓ Ea⇠⇡✓ (·|s) f (s, a) . (0.10)
t=0
1 0
19
1.6 Bibliographic Remarks and Further Reading
We refer the reader to [Puterman, 1994] for a more detailed treatment of dynamic programming and MDPs. [Puterman,
1994] also contains a thorough treatment of the dual LP, along with a proof of Lemma 1.15
With regards to the computational complexity of policy iteration, [Ye, 2011] showed that policy iteration is a strongly
polynomial time algorithm for a fixed discount rate 1 . Also, see [Ye, 2011] for a good summary of the computa-
tional complexities of various approaches. [Mansour and Singh, 1999] showed that the number of iterations of policy
|S|
iteration can be bounded by |A|
|S| .
With regards to a strongly polynomial algorithm, the CIPA algorithm [Ye, 2005] is an interior point algorithm with the
claimed runtime in Table 0.1.
Lemma 1.11 is due to Singh and Yee [1994].
The performance difference lemma is due to [Kakade and Langford, 2002, Kakade, 2003], though the lemma was
implicit in the analysis of a number of prior works.
1 The stated strongly polynomial runtime in Table 0.1 for policy iteration differs from that in [Ye, 2011] due to we assume that the runtime per
20
Chapter 2
This chapter begins our study of the sample complexity, where we focus on the (minmax) number of transitions we
need to observe in order to accurately estimate Q? or in order to find a near optimal policy. We assume that we have
access to a generative model (as defined in Section 1.4) and that the reward function is deterministic (the latter is often
a mild assumption, due to that much of the difficulty in RL is due to the uncertainty in the transition model P ).
This chapter follows the results due to [Azar et al., 2013], along with some improved rates due to [Agarwal et al.,
2020c]. One of the key observations in this chapter is that we can find a near optimal policy using a number of observed
transitions that is sublinear in the model size, i.e. use a number of samples that is smaller than O(|S|2 |A|). In other
words, we do not need to learn an accurate model of the world in order to learn to act near optimally.
Notation. We define M c to be the empirical MDP that is identical to the original M , except that it uses Pb instead
of P for the transition model. When clear from context, we drop the subscript on M on the values, action values
(and one-step variances and variances which we define later). We let Vb ⇡ , Qb⇡ , Q
b ? , and ⇡
b? denote the value function,
c, respectively.
state-action value function, optimal state-action value, and optimal policy in M
A central question in this chapter is: Do we require an accurate model of the world in order to find a near optimal
policy? Recall that a generative model takes as input a state action pair (s, a) and returns a sample s0 ⇠ P (·|s, a) and
the reward r(s, a) (or a sample of the reward if the rewards are stochastic). Let us consider the most naive approach
to learning (when we have access to a generative model): suppose we call our simulator N times at each state action
pair. Let Pb be our empirical model, defined as follows:
count(s0 , s, a)
Pb(s0 |s, a) =
N
where count(s0 , s, a) is the number of times the state-action pair (s, a) transitions to state s0 . As the N is the number
of calls for each state action pair, the total number of calls to our generative model is |S||A|N . As before, we can view
Pb as a matrix of size |S||A| ⇥ |S|.
21
Note that since P has a |S|2 |A| parameters, we would expect that observing O(|S|2 |A|) transitions is sufficient to
provide us with an accurate model. The following proposition shows that this is the case.
Proposition 2.1. There exists an absolute constant c such that the following holds. Suppose ✏ 2 0, 1 1 and that we
obtain
|S|2 |A| log(c|S||A|/ )
# samples from generative model = |S||A|N 4
(1 ) ✏2
where we uniformly sample every state action pair. Then, with probability greater than 1 , we have:
kQ⇡ b ⇡ k1 ✏
Q
Before we provide the proof, the following lemmas will be helpful throughout:
Lemma 2.2. (Simulation Lemma) For all ⇡ we have that:
Q⇡ b⇡
Q = (I Pb⇡ ) 1
(P Pb)V ⇡
Proof: Using our matrix equality for Q⇡ (see Equation 0.2), we have:
Q⇡ b⇡
Q = (I P ⇡) 1
r (I Pb⇡ ) 1
r
= (I Pb⇡ ) 1
((I Pb⇡ ) (I P ⇡ ))Q⇡
= (I Pb⇡ ) 1
(P ⇡ Pb⇡ )Q⇡
= (I Pb⇡ ) 1
(P Pb)V ⇡
where the final inequality follows since P ⇡ w is an average of the elements of w by the definition of P ⇡ so that
kP ⇡ wk1 kwk1 . Rearranging terms completes the proof.
Now we are ready to complete the proof of our proposition.
Proof: Using the concentration of a distribution in the `1 norm (Lemma A.8), we have that for a fixed s, a that, with
probability greater than 1 , we have:
r
b |S| log(1/ )
kP (·|s, a) P (·|s, a)k1 c
m
22
where m is the number of samples used to estimate Pb(·|s, a). The first claim now follows by the union bound (and
redefining and c appropriately).
For the second claim, we have that:
kQ⇡ b ⇡ k1 = k (I
Q Pb⇡ ) 1
(P Pb)V ⇡ k1 k(P Pb)V ⇡ k1
1
✓ ◆
max kP (·|s, a) Pb(·|s, a)k1 kV ⇡ k1 max kP (·|s, a) Pb(·|s, a)k1
1 s,a (1 )2 s,a
where the penultimate step uses Holder’s inequality. The second claim now follows.
For the final claim, first observe that | supx f (x) supx g(x)| supx |f (x) g(x)|, where f and g are real valued
functions. This implies:
b ? (s, a)
|Q b ⇡ (s, a)
Q? (s, a)| = | sup Q b ⇡ (s, a)
sup Q⇡ (s, a)| sup |Q Q⇡ (s, a)| ✏
⇡ ⇡ ⇡
which proves the first inequality. The second inequality is left as an exercise to the reader.
In the previous approach, we are able to accurately estimate the value of every policy in the unknown MDP M . How-
ever, with regards to planning, we only need an accurate estimate Q b ? of Q? , which we may hope would require less
samples. Let us now see that the model based approach can be refined to obtain minmax optimal sample complexity,
which we will see is sublinear in the model size.
We will state our results in terms of N , and recall that N is the # of calls to the generative models per state-action pair,
so that:
# samples from generative model = |S||A|N.
Let us start with a crude bound on the optimal action-values, which provides a sublinear rate. In the next section, we
will improve upon this to obtain the minmax optimal rate.
Proposition 2.4. (Crude Value Bounds) Let 0. With probability greater than 1 ,
kQ? b ? k1
Q ,N
kQ ? b ⇡ ? k1
Q ,N ,
where: r
2 log(2|S||A|/ )
,N :=
(1 )2 N
Note that the first inequality above shows a sublinear rate on estimating the value function. Ultimately, we are in-
?
terested in the value V ⇡b when we execute ⇡ b ? of Q? . Here, by Lemma 1.11, we lose an
b? , not just an estimate Q
additional horizon factor and have:
kQ? Q b ⇡b? k1 1 ,N .
1
As we see in Theorem 2.6, this is improvable.
Before we provide the proof, the following lemma will be helpful throughout.
23
Lemma 2.5. (Component-wise Bounds) We have that:
?
Q? b?
Q (I Pb⇡ ) 1
(P Pb)V ?
?
Q? b?
Q (I Pb⇡b ) 1
(P Pb)V ?
Q? b ? = Q⇡ ?
Q b ⇡b? Q⇡?
Q b ⇡? = (I
Q
?
Pb⇡ ) 1
(P Pb)V ? ,
where we have used Lemma 2.2 in the final step. This proves the first claim.
For the second claim,
?
Q? b?
Q = Q⇡ b ⇡b?
Q
⇣ ? ?
⌘
= (1 ) (I P⇡ ) 1
r (I Pb⇡b ) 1
r
? ? ?
= (I Pb⇡ ) 1 ((I Pb⇡b ) (I P ⇡ ))Q?
? ? ?
= (I Pb⇡ ) 1 (P ⇡ Pb⇡b )Q?
? ? ?
(I Pb⇡ ) 1 (P ⇡ Pb⇡ )Q?
?
= (I Pb⇡ ) 1
(P Pb)V ? ,
? ?
where the inequality follows from Pb⇡b Q? Pb⇡ Q? , due to the optimality of ⇡ ? . This proves the second claim.
Proof: Following from the simulation lemma (Lemma 2.2) and Lemma 2.3, we have:
kQ? b ⇡? k1
Q k(P Pb)V ? k1 .
1
Also, the previous lemma, implies that:
kQ? b ? k1
Q k(P Pb)V ? k1
1
By applying Hoeffding’s inequality and the union bound,
r
1 2 log(2|S||A|/ )
k(P Pb)V ? k1 = max |Es0 ⇠P (·|s,a) [V ? (s0 )] Es0 ⇠Pb(·|s,a) [V (s )]| ? 0
s,a 1 N
which holds with probability greater than 1 . This completes the proof.
2.3 Minmax Optimal Sample Complexity (and the Model Based Approach)
We now see that the model based approach is minmax optimal, for both the discounted case and the finite horizon
setting.
24
• (Value estimation) With probability greater than 1 ,
s
b ? k1 c log(c|S||A|/ ) c log(c|S||A|/ )
kQ? Q + .
(1 )3 N (1 )3 N
• (Sub-optimality) If N (1
1
)2 , then with probability greater than 1 ,
s
? b?
⇡ c log(c|S||A|/ )
kQ Q k1 .
(1 )3 N
c|S||A| log(c|S||A|/ )
# samples from generative model = |S||A|N ,
(1 )3 ✏2
then with probability greater than 1 , ?
kQ? Q⇡b k1 ✏.
We only prove the first claim in Theorem 2.6 on the estimation accuracy. With regards to the sub-optimality, note
that Theorem 1.11 already implies a sub-optimality gap, though with an amplification of the estimation error by
2/(1 ). The argument for the improvement provided in the second claim is more involved (See Section 2.6 for
further discussion).
b ? , is
Lower Bounds. Let us say that an estimation algorithm A, which is a map from samples to an estimate Q
(✏, )-good on MDP M if kQ ? b
Q k1 ✏ holds with probability greater than 1
?
.
Theorem 2.8. There exists ✏0 , 0 , c and a set of MDPs M such that for ✏ 2 (0, ✏0 ) and 2 (0, 0 ) if algorithm A is
(✏, )-good on all M 2 M, then A must use a number of samples that is lower bounded as follows
c |S||A| log(c|S||A|/ )
# samples from generative model .
(1 )3 ✏2
In other words, this theorem shows that the model based approach minmax optimal.
Recall the setting of finite horizon MDPs defined in Section 1.2. Again, we can consider the most naive approach to
learning (when we have access to a generative model): suppose we call our simulator N times for every (s, a, h) 2
S ⇥ A ⇥ [H], i.e. we obtain N i.i.d. samples where s0 ⇠ Ph (·|s, a), for every (s, a, h) 2 S ⇥ A ⇥ [H]. Note that the
total number of observed transitions is H|S||A|N .
25
Upper bounds. The following theorem provides an upper bound on the model based approach.
Theorem 2.9. For 0 and with probability greater than 1 , we have that:
• (Value estimation) r
b ?0 k1 cH log(c|S||A|/ ) log(c|S||A|/ )
kQ?0 Q + cH ,
N N
• (Sub-optimality)
r
? log(c|S||A|/ ) log(c|S||A|/ )
kQ?0 Q⇡0b k1 cH + cH ,
N N
Note that the above bound requires N to be O(H 2 ) in order to achieve an ✏-optimal policy, while in the discounted
case, we require N to be O(1/(1 )3 ) for the same guarantee. While this may seem like an improvement by a
horizon factor, recall that for the finite horizon case, N corresponds to observing O(H) more transitions than in the
discounted case.
Lower Bounds. In the minmax sense of Theorem 2.8, the previous upper bound provided by the model based
approach for the finite horizon setting achieves the minmax optimal sample complexity.
2.4 Analysis
The key to the sharper analysis is to more sharply characterize the variance in our estimates.
Denote the variance of any real valued f under a distribution D as:
Slightly abusing the notation, for V 2 R|S| , we define the vector VarP (V ) 2 R|S||A| as:
Equivalently,
VarP (V ) = P (V )2 (P V )2 .
26
Proof: The claims follows from Bernstein’s inequality along with a union bound over all state-action pairs.
? p ? p
The key ideas in the proof are in how we bound k(I Pb⇡ ) 1 VarP (V ? )k1 and k(I Pb⇡b ) 1 VarP (V ? )k1 .
It is helpful to define ⌃⇡M as the variance of the discounted reward, i.e.
2 !2 3
X1
⌃⇡M (s, a) := E 4 t
r(st , at ) Q⇡M (s, a) s0 = s, a0 = a5
t=0
where the expectation is induced under the trajectories induced by ⇡ in M . It is straightforward to verify that
k⌃⇡M k1 2 /(1 )2 .
The following lemma shows that ⌃⇡M satisfies a Bellman consistency condition.
Lemma 2.11. (Bellman consistency of ⌃) For any MDP M ,
⌃⇡M = 2 ⇡
VarP (VM )+ 2
P ⇡ ⌃⇡M (0.1)
where P is the transition model in MDP M .
Proof:Note that (1 )(I P ⇡ ) 1 is matrix whose rows are a probability distribution. For a positive
p vector v and
p
a distribution ⌫ (where ⌫ is vector of the same dimension of v), Jensen’s inequality implies that ⌫ · v ⌫ · v. This
implies:
p 1 p
k(I P ⇡ ) 1 vk1 = k(1 )(I P ⇡ ) 1 vk1
1
r
1
(I P ⇡ ) 1v
1 1
r
2 2P ⇡ ) 1v
(I .
1 1
where we have used that k(I P ⇡ ) 1 vk1 2k(I P ⇡ ) 1 vk1 (which we will prove shortly). The proof is
2
27
2.4.2 Completing the proof
where r
0 1 18 log(6|S||A|/ ) 1 4 log(6|S||A|/ )
,N := + .
(1 )2 N (1 )4 N
Proof: By definition,
VarP (V ? ) = VarP (V ? ) VarPb (V ? ) + VarPb (V ? )
= P (V ? )2 (P V ? )2 Pb(V ? )2 + (PbV ? )2 + VarPb (V ? )
⇣ ⌘
= (P Pb)(V ? )2 (P V ? )2 (PbV ? )2 + VarPb (V ? )
Now we bound each of these terms with Hoeffding’s inequality and the union bound. For the first term, with probability
greater than 1 , r
1 2 log(2|S||A|/ )
k(P Pb)(V ) k1
? 2
.
(1 )2 N
For the second term, again with probability greater than 1 ,
k(P V ? )2 (PbV ? )2 k1 kP V ? + PbV ? k1 kP V ? PbV ? k1
r
2 2 2 log(2|S||A|/ )
k(P Pb)V k1 ?
.
1 (1 )2 N
where we have used that (·)2 is a component-wise operation in the second step. For the last term:
? ?
VarPb (V ? ) = VarPb (V ? Vb ⇡ + Vb ⇡ )
? ?
2VarPb (V ? Vb ⇡ ) + 2Var b (Vb ⇡ )
P
? ?
2kV ? Vb ⇡ k21 + 2Var b (Vb ⇡ ) P
?
= 2 2
,N + 2VarPb (Vb ⇡ ) .
where ,N is defined in Proposition 2.4. To obtain a cumulative probability of error less than , we replace in the
above claims with /3. Combining these bounds completes the proof of the first claim. The argument in the above
display also implies that VarPb (V ? ) 2 2,N + 2VarPb (Vb ? ) which proves the second claim.
Using Lemma 2.10 and 2.13, we have the following corollary.
Corollary 2.14. Let 0. With probability greater than 1 , we have:
s
VarPb (Vb ⇡? ) log(c|S||A|/ )
|(P Pb)V ? | c + 00
,N 1
N
s
VarPb (Vb ? ) log(c|S||A|/ )
|(P Pb)V ? | c + 00
,N 1 ,
N
where ✓ ◆3/4
00 1 log(c|S||A|/ ) c log(c|S||A|/ )
,N := c + ,
1 N (1 )2 N
and where c is an absolute constant.
28
Proof:(of Theorem 2.6) The proof consists of bounding the terms in Lemma 2.5. We have:
?
Q? b?
Q k(I Pb⇡ ) 1
(P Pb)V ? k1
r q ✓ ◆3/4
log(c|S||A|/ ) ? c log(c|S||A|/ )
c k(I Pb⇡ ) 1 VarPb (Vb ⇡? )k1 +
N (1 )2 N
c log(c|S||A|/ )
+
(1 )3 N
s r ✓ ◆3/4
2 log(c|S||A|/ ) c log(c|S||A|/ ) c log(c|S||A|/ )
3
+ 2
+ 3
(1 ) N (1 ) N (1 ) N
s r
1 log(c|S||A|/ ) c log(c|S||A|/ )
3 c +2 ,
(1 )3 N (1 )3 N
where the first step uses Corollary 2.14; the second uses Lemma 2.12; and the last step uses that 2ab a2 + b2
(and choosing a, b appropriately). The proof of the lower bound is analogous. Taking a different absolute constant
completes the proof.
It will be helpful to more intuitively understand why 1/(1 )3 is the effective horizon dependency one might hope
to expect, from a dimensional analysis viewpoint. Due to that Q? is a quantity that is as large as 1/(1 ), to account
for this scaling, it is natural to look at obtaining relative accuracy.
In particular, if
c |S||A| log(c|S||A|/ )
N ,
1 ✏2
then with probability greater than 1 , then
? ✏ b ? k1 ✏
kQ? Q⇡b k1 , and kQ? Q .
1 1
p
(provided that ✏ 1 using Theorem 2.6). In other words, if we had normalized the value functions 1 , then for
additive accuracy (on our normalized value functions) our sample size would scale linearly with the effective horizon.
The notion of a generative model was first introduced in [Kearns and Singh, 1999], which made the argument that,
up to horizon factors and logarithmic factors, both model based methods and model free methods are comparable.
[Kakade, 2003] gave an improved version of this rate (analogous to the crude bounds seen here).
The first claim in Theorem 2.6 is due to [Azar et al., 2013], and the proof in this section largely follows this work.
Improvements are possible with regards to bounding the quality of ⇡ b? ; here, Theorem 2.6 shows that the model based
approach is near optimal even for policy itself; showing that the quality of ⇡ b? does suffer any amplification factor of
1/(1 ). [Sidford et al., 2018] provides the first proof of this improvement using a variance reduction algorithm with
value iteration. The second claim in Theorem 2.6 is due to [Agarwal et al., 2020c], which shows that the naive model
based approach is sufficient. The lower bound in Theorem 2.8 is due to [Azar et al., 2013].
1 Rescaling the value functions by multiplying by (1 ), i.e. Q⇡ (1 )Q⇡ , would keep the values bounded between 0 and 1. Throughout,
29
We also remark that we may hope for the sub-optimality bounds (on the value of the argmax policy) to hold up to for
“large” ✏, i.e. up to ✏ 1/(1 ) (see the second claim in Theorem 2.6). Here, the work in [Li et al., 2020] shows
this limit is achievable, albeit with a slightly different algorithm where they introduce perturbations. It is currently an
open question if the naive model based approach also achieves the non-asymptotic statistical limit.
This chapter also provided results, without proof, on the optimal sample complexity in the finite horizon setting (see
Section 2.3.2). The proof of this claim would also follow from the line of reasoning in [Azar et al., 2013], with
the added simplification that the sub-optimality analysis is simpler in the finite-horizon setting with time-dependent
transition matrices (e.g. see [Yin et al., 2021]).
this book it is helpful to understand sample size with regards to normalized quantities.
30