0% found this document useful (0 votes)

78 views26 pages

1.1 Discounted (Infinite-Horizon) Markov Decision Processes

The document summarizes key concepts in Markov Decision Processes (MDPs) including: 1) The components of an MDP including state and action spaces, transition probabilities, rewards, and discount factor. 2) How agents in an MDP interact with the environment by taking actions and observing rewards and next states. 3) The objectives of finding an optimal policy that maximizes the expected discounted return, and how value functions are defined for evaluating policies. 4) How stationary policies satisfy consistency conditions known as the Bellman equations that relate state/action values to next state values and rewards.

Uploaded by

Mingyu Liu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

78 views26 pages

1.1 Discounted (Infinite-Horizon) Markov Decision Processes

Uploaded by

Mingyu Liu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 26

Chapter 1

Markov Decision Processes

1.1 Discounted (Infinite-Horizon) Markov Decision Processes

In reinforcement learning, the interactions between the agent and the environment are often described by an infinite-
horizon, discounted Markov Decision Process (MDP) M = (S, A, P, r, , µ), specified by:

• A state space S, which may be finite or infinite. For mathematical convenience, we will assume that S is finite
or countably infinite.

• An action space A, which also may be discrete or infinite. For mathematical convenience, we will assume that
A is finite.

• A transition function P : S ⇥ A ! (S), where (S) is the space of probability distributions over S (i.e., the
probability simplex). P (s0 |s, a) is the probability of transitioning into state s0 upon taking action a in state s.
We use Ps,a to denote the vector P (· s, a).

• A reward function r : S ⇥ A ! [0, 1]. r(s, a) is the immediate reward associated with taking action a in state
s. More generally, the r(s, a) could be a random variable (where the distribution depends on s, a). While we
largely focus on the case where r(s, a) is deterministic, the extension to methods with stochastic rewards are
often straightforward.

• A discount factor 2 [0, 1), which defines a horizon for the problem.

• An initial state distribution µ 2 (S), which specifies how the initial state s0 is generated.

In many cases, we will assume that the initial state is fixed at s0 , i.e. µ is a distribution supported only on s0 .

1.1.1 The objective, policies, and values

Policies. In a given MDP M = (S, A, P, r, , µ), the agent interacts with the environment according to the following
protocol: the agent starts at some state s0 ⇠ µ; at each time step t = 0, 1, 2, . . ., the agent takes an action at 2 A,
obtains the immediate reward rt = r(st , at ), and observes the next state st+1 sampled according to st+1 ⇠ P (·|st , at ).
The interaction record at time t,
⌧t = (s0 , a0 , r0 , s1 , . . . , st , at , rt ),

5
is called a trajectory, which includes the observed state at time t.
In the most general setting, a policy specifies a decision-making strategy in which the agent chooses actions adaptively
based on the history of observations; precisely, a policy is a (possibly randomized) mapping from a trajectory to an
action, i.e. ⇡ : H ! (A) where H is the set of all possible trajectories (of all lengths) and (A) is the space of
probability distributions over A. A stationary policy ⇡ : S ! (A) specifies a decision-making strategy in which
the agent chooses actions based only on the current state, i.e. at ⇠ ⇡(·|st ). A deterministic, stationary policy is of the
form ⇡ : S ! A.

Values. We now define values for (general) policies. For a fixed policy and a starting state s0 = s, we define the
value function VM
⇡
: S ! R as the discounted sum of future rewards
hX
1 i
⇡ t
VM (s) = E r(st , at ) ⇡, s0 = s .
t=0

where expectation is with respect to the randomness of the trajectory, that is, the randomness in state transitions and
the stochasticity of ⇡. Here, since r(s, a) is bounded between 0 and 1, we have 0  VM ⇡
(s)  1/(1 ).
Similarly, the action-value (or Q-value) function Q⇡M : S ⇥ A ! R is defined as
hX
1 i
Q⇡M (s, a) = E t
r(st , at ) ⇡, s0 = s, a0 = a .
t=0

and Q⇡M (s, a) is also bounded by 1/(1 ).

Goal. Given a state s, the goal of the agent is to find a policy ⇡ that maximizes the value, i.e. the optimization
problem the agent seeks to solve is:
⇡
max VM (s) (0.1)
⇡

where the max is over all (possibly non-stationary and randomized) policies. As we shall see, there exists a determin-
istic and stationary policy which is simultaneously optimal for all starting states s.
We drop the dependence on M and write V ⇡ when it is clear from context.
Example 1.1 (Navigation). Navigation is perhaps the simplest to see example of RL. The state of the agent is their
current location. The four actions might be moving 1 step along each of east, west, north or south. The transitions
in the simplest setting are deterministic. Taking the north action moves the agent one step north of their location,
assuming that the size of a step is standardized. The agent might have a goal state g they are trying to reach, and the
reward is 0 until the agent reaches the goal, and 1 upon reaching the goal state. Since the discount factor < 1, there
is incentive to reach the goal state earlier in the trajectory. As a result, the optimal behavior in this setting corresponds
to finding the shortest path from the initial to the goal state, and the value function of a state, given a policy is d ,
where d is the number of steps required by the policy to reach the goal state.
Example 1.2 (Conversational agent). This is another fairly natural RL problem. The state of an agent can be the
current transcript of the conversation so far, along with any additional information about the world, such as the context
for the conversation, characteristics of the other agents or humans in the conversation etc. Actions depend on the
domain. In the most basic form, we can think of it as the next statement to make in the conversation. Sometimes,
conversational agents are designed for task completion, such as travel assistant or tech support or a virtual office
receptionist. In these cases, there might be a predefined set of slots which the agent needs to fill before they can find a
good solution. For instance, in the travel agent case, these might correspond to the dates, source, destination and mode
of travel. The actions might correspond to natural language queries to fill these slots.

6
In task completion settings, reward is naturally defined as a binary outcome on whether the task was completed or not,
such as whether the travel was successfully booked or not. Depending on the domain, we could further refine it based
on the quality or the price of the travel package found. In more generic conversational settings, the ultimate reward is
whether the conversation was satisfactory to the other agents or humans, or not.
Example 1.3 (Strategic games). This is a popular category of RL applications, where RL has been successful in
achieving human level performance in Backgammon, Go, Chess, and various forms of Poker. The usual setting consists
of the state being the current game board, actions being the potential next moves and reward being the eventual win/loss
outcome or a more detailed score when it is defined in the game. Technically, these are multi-agent RL settings, and,
yet, the algorithms used are often non-multi-agent RL algorithms.

1.1.2 Bellman Consistency Equations for Stationary Policies

Stationary policies satisfy the following consistency conditions:

Lemma 1.4. Suppose that ⇡ is a stationary policy. Then V ⇡ and Q⇡ satisfy the following Bellman consistency
equations: for all s 2 S, a 2 A,
V ⇡ (s) = Q⇡ (s, ⇡(s)).
⇥ ⇤
Q⇡ (s, a) = r(s, a) + Ea⇠⇡(·|s),s0 ⇠P (·|s,a) V ⇡ (s0 ) .

We leave the proof as an exercise to the reader.

It is helpful to view V ⇡ as vector of length |S| and Q⇡ and r as vectors of length |S| · |A|. We overload notation and
let P also refer to a matrix of size (|S| · |A|) ⇥ |S| where the entry P(s,a),s0 is equal to P (s0 |s, a).
We also will define P ⇡ to be the transition matrix on state-action pairs induced by a stationary policy ⇡, specifically:
⇡ 0 0 0
P(s,a),(s 0 ,a0 ) := P (s |s, a)⇡(a |s ).

In particular, for deterministic policies we have:

⇢
⇡ P (s0 |s, a) if a0 = ⇡(s0 )
P(s,a),(s :=
if a0 6= ⇡(s0 )
0 ,a0 )
0

With this notation, it is straightforward to verify:

Q⇡ = r + P V ⇡
Q⇡ = r + P ⇡ Q⇡ .
Corollary 1.5. Suppose that ⇡ is a stationary policy. We have that:
Q⇡ = (I P ⇡) 1
r (0.2)
where I is the identity matrix.

Proof: To see that the I P ⇡ is invertible, observe that for any non-zero vector x 2 R|S||A| ,
k(I P ⇡ )xk1 = kx P ⇡ xk1
kxk1 kP ⇡ xk1 (triangule inequality for norms)
kxk1 kxk1 (each element of P ⇡ x is an average of x)
= (1 )kxk1 > 0 ( < 1, x 6= 0)
which implies I P ⇡ is full rank.
The following is also a helpful lemma:

7
Lemma 1.6. We have that:
1
X
[(1 )(I P ⇡) 1
](s,a),(s0 ,a0 ) = (1 ) P (st = s0 , at = a0 |s0 = s, a0 = a)
t ⇡

t=0

so we can view the (s, a)-th row of this matrix as an induced distribution over states and actions when following ⇡
after starting with s0 = s and a0 = a.

We leave the proof as an exercise to the reader.

1.1.3 Bellman Optimality Equations

A remarkable and convenient property of MDPs is that there exists a stationary and deterministic policy that simulta-
neously maximizes V ⇡ (s) for all s 2 S. This is formalized in the following theorem:
Theorem 1.7. Let ⇧ be the set of all non-stationary and randomized policies. Define:

V ? (s) := sup V ⇡ (s)

⇡2⇧
Q? (s, a) := sup Q⇡ (s, a).
⇡2⇧

which is finite since V ⇡ (s) and Q⇡ (s, a) are bounded between 0 and 1/(1 ).
There exists a stationary and deterministic policy ⇡ such that for all s 2 S and a 2 A,

V ⇡ (s) = V ? (s)
Q⇡ (s, a) = Q? (s, a).

We refer to such a ⇡ as an optimal policy.

Proof: For any ⇡ 2 ⇧ and for any time t, ⇡ specifies a distribution over actions conditioned on the history of observa-
tions; here, we write ⇡(At = a|S0 = s0 , A0 = a0 , R0 = r0 , . . . St 1 = st 1 , At 1 = at 1 , Rt 1 = rt 1 , St = st )
as the probability that ⇡ selects action a at time t given an observed history s0 , a0 , r0 , . . . st 1 , at 1 , rt 1 , st . For
the purposes of this proof, it is helpful to formally let St , At and Rt denote random variables, which will dis-
tinguish them from outcomes, which is denoted by lower case variables. First, let us show that conditioned on
(S0 , A0 , R0 , S1 ) = (s, a, r, s0 ), the maximum future discounted value, from time 1 onwards, is not a function of
s, a, r. More precisely, we seek to show that:
hX
1 i
sup E t
r(st , at ) ⇡, (S0 , A0 , R0 , S1 ) = (s, a, r, s0 ) = V ? (s0 ) (0.3)
⇡2⇧ t=1

For any policy ⇡, define an “offset” policy ⇡(s,a,r) , which is the policy that chooses actions on a trajectory ⌧ according
to the same distribution that ⇡ chooses actions on the trajectory (s, a, r, ⌧ ). Precisely, for all t, define

⇡(s,a,r) (At = a|S0 = s0 , A0 = a0 , R0 = r0 , . . . St = st )

:= ⇡(At+1 = a|S0 = s, A0 = a, R0 = r, S1 = s0 , A1 = a0 , R1 = r0 , . . . St+1 = st ).

By the Markov property, we have that:

hX
1 i hX
1 i
E t
r(st , at ) ⇡, (S0 , A0 , R0 , S1 ) = (s, a, r, s0 ) = E t
r(st , at ) ⇡(s,a,r) , S0 = s0 = V ⇡(s,a,r) (s0 ),
t=1 t=0

8
where the first equality follows from a change of variables on the time index, along with the definition of the policy
⇡(s,a,r) . Also, we have that, for all (s, a, r), that the set {⇡(s,a,r) |⇡ 2 ⇧} is equal to ⇧ itself, by the definition of ⇧
and ⇡(s,a,r) . This implies:

hX
1 i
sup E t
r(st , at ) ⇡, (S0 , A0 , R0 , S1 ) = (s, a, r, s0 ) = · sup V ⇡(s,a,r) (s0 ) = · sup V ⇡ (s0 ) = V ? (s0 ),
⇡2⇧ t=1 ⇡2⇧ ⇡2⇧

thus proving Equation 0.3.

We now show the deterministic and stationary policy
h i
e(s) = sup E r(s, a) + V ? (s1 ) (S0 , A0 ) = (s, a)
⇡
a2A

is optimal, i.e. that V ⇡e (s) = V ? (s). For this, we have that:

h 1
X i
? t
V (s0 ) = sup E r(s0 , a0 ) + r(st , at )
⇡2⇧ t=1

(a)
h hX
1 ii
t
= sup E r(s0 , a0 ) + E r(st , at ) ⇡, (S0 , A0 , R0 , S1 ) = (s0 , a0 , r0 , s1 )
⇡2⇧ t=1
h hX
1 ii
 sup E r(s0 , a0 ) + sup E t
r(st , at ) ⇡ 0 , (S0 , A0 , R0 , S1 ) = (s0 , a0 , r0 , s1 )
⇡2⇧ ⇡ 0 2⇧ t=1
(b)
h i
?
= sup E r(s0 , a0 ) + V (s1 )
⇡2⇧
h i
= sup E r(s0 , a0 ) + V ? (s1 )
a0 2A
(c)
h i
= E r(s0 , a0 ) + V ? (s1 ) ⇡
e .

where step (a) uses the law of iterated expectations; step (b) uses Equation 0.3; and step (c) follows from the definition
of ⇡
e. Applying the same argument recursively leads to:
h i h i
V ? (s0 )  E r(s0 , a0 ) + V ? (s1 ) ⇡
e  E r(s0 , a0 ) + r(s1 , a1 ) + 2
e  . . .  V ⇡e (s0 ).
V ? (s2 ) ⇡

Since V ⇡e (s)  sup⇡2⇧ V ⇡ (s) = V ? (s) for all s, we have that V ⇡e = V ? , which completes the proof of the first claim.
For the same policy ⇡
e, an analogous argument can be used prove the second claim.
This shows that we may restrict ourselves to using stationary and deterministic policies without any loss in perfor-
mance. The following theorem, also due to [Bellman, 1956], gives a precise characterization of the optimal value
function.

Theorem 1.8 (Bellman optimality equations). We say that a vector Q 2 R|S||A| satisfies the Bellman optimality
equations if:

Q(s, a) = r(s, a) + Es0 ⇠P (·|s,a) max
0
Q(s0 , a0 ) .
a 2A

For any Q 2 R|S||A| , we have that Q = Q? if and only if Q satisfies the Bellman optimality equations. Furthermore,
the deterministic policy defined by ⇡(s) 2 argmaxa2A Q? (s, a) is an optimal policy (where ties are broken in some
arbitrary manner).

9
Before we prove this claim, we will provide a few definitions. Let ⇡Q denote the greedy policy with respect to a vector
Q 2 R|S||A| , i.e
⇡Q (s) := argmaxa2A Q(s, a) .
where ties are broken in some arbitrary manner. With this notation, by the above theorem, the optimal policy ⇡ ? is
given by:
⇡ ? = ⇡Q ? .

Let us also use the following notation to turn a vector Q 2 R|S||A| into a vector of length |S|.

VQ (s) := max Q(s, a).

a2A

The Bellman optimality operator TM : R|S||A| ! R|S||A| is defined as:

T Q := r + P VQ . (0.4)

This allows us to rewrite the Bellman optimality equation in the concise form:

Q = T Q,

and, so, the previous theorem states that Q = Q? if and only if Q is a fixed point of the operator T .
Proof: Let us begin by showing that:
V ? (s) = max Q? (s, a). (0.5)
a

Let ⇡ ? be an optimal stationary and deterministic policy, which exists by Theorem 1.7. Consider the policy which
takes action a and then follows ⇡ ? . Due to that V ? (s) is the maximum value over all non-stationary policies (as shown
in Theorem 1.7), ?
V ? (s) Q⇡ (s, a) = Q? (s, a),
which shows that V ? (s) maxa Q? (s, a) since a is arbitrary in the above. Also, by Lemma 1.4 and Theorem 1.7,
? ? ?
V ? (s) = V ⇡ (s) = Q⇡ (s, ⇡ ? (s))  max Q⇡ (s, a) = max Q? (s, a),
a a

which proves our claim since we have upper and lower bounded V (s) by maxa Q? (s, a).
?

We first show sufficiency, i.e. that Q? (the state-action value of an optimal policy) satisfies Q? = T Q? . Now for all
actions a 2 A, we have:

Q? (s, a) = max Q⇡ (s, a) = r(s, a) + max Es0 ⇠P (·|s,a) [V ⇡ (s0 )]

⇡ ⇡
= r(s, a) + Es0 ⇠P (·|s,a) [V ? (s0 )]
= r(s, a) + Es0 ⇠P (·|s,a) [max
0
Q? (s0 , a0 )].
a

Here, the second equality follows from Theorem 1.7 and the final equality follows from Equation 0.5. This proves
sufficiency.
For the converse, suppose Q = T Q for some Q. We now show that Q = Q? . Let ⇡ = ⇡Q . That Q = T Q implies
that Q = r + P ⇡ Q, and so:
Q = (I P ⇡ ) 1 r = Q⇡
using Corollary 1.5 in the last step. In other words, Q is the action value of the policy ⇡Q . Also, let us show that
0
(P ⇡ P ⇡ )Q⇡ 0. To see this, observe that:
0
[(P ⇡ P ⇡ )Q⇡ ]s,a = Es0 ⇠P (·|s,a) [Q⇡ (s0 , ⇡(s0 )) Q⇡ (s0 , ⇡ 0 (s0 ))] 0

10
where the last step uses that ⇡ = ⇡Q . Now observe for any other deterministic and stationary policy ⇡ 0 :
0 0
Q Q⇡ = Q⇡ Q⇡
0
= Q⇡ (I P⇡ ) 1
r
0 0
= (I P⇡ ) 1
((I P⇡ ) (I P ⇡ ))Q⇡
0 0
= (I P⇡ ) 1
(P ⇡ P ⇡ )Q⇡
0,
0 0
where the last step follows since we have shown (P ⇡ P ⇡ )Q⇡ 0 and since (1 )(I P ⇡ ) 1 is a matrix with
⇡0
all positive entries (see Lemma 1.6). Thus, Q = Q Q for all deterministic and stationary ⇡ 0 , which shows that
⇡

⇡ is an optimal policy. Thus, Q = Q⇡ = Q? , using Theorem 1.7. This completes the proof.

1.2 Finite-Horizon Markov Decision Processes

In some cases, it is natural to work with finite-horizon (and time-dependent) Markov Decision Processes (see the dis-
cussion below). Here, a finite horizon, time-dependent Markov Decision Process (MDP) M = (S, A, {P }h , {r}h , H, µ)
is specified as follows:

• A state space S, which may be finite or infinite.

• An action space A, which also may be discrete or infinite.
• A time-dependent transition function Ph : S ⇥ A ! (S), where (S) is the space of probability distributions
over S (i.e., the probability simplex). Ph (s0 |s, a) is the probability of transitioning into state s0 upon taking
action a in state s at time step h. Note that the time-dependent setting generalizes the stationary setting where
all steps share the same transition.
• A time-dependent reward function rh : S ⇥ A ! [0, 1]. rh (s, a) is the immediate reward associated with taking
action a in state s at time step h.
• A integer H which defines the horizon of the problem.
• An initial state distribution µ 2 (S), which species how the initial state s0 is generated.

Here, for a policy ⇡, a state s, and h 2 {0, . . . H 1}, we define the value function Vh⇡ : S ! R as
hH
X1 i
Vh⇡ (s) =E rh (st , at ) ⇡, sh = s ,
t=h

where again the expectation is with respect to the randomness of the trajectory, that is, the randomness in state tran-
sitions and the stochasticity of ⇡. Similarly, the state-action value (or Q-value) function Q⇡h : S ⇥ A ! R is defined
as
hHX1 i
Q⇡h (s, a) = E rh (st , at ) ⇡, sh = s, ah = a .
t=h
We also use the notation V ⇡ (s) = V0⇡ (s).
Again, given a state s, the goal of the agent is to find a policy ⇡ that maximizes the value, i.e. the optimization problem
the agent seeks to solve is:
max V ⇡ (s) (0.6)
⇡

where recall that V ⇡ (s) = V0⇡ (s).

11
Theorem 1.9. (Bellman optimality equations) Define
Q?h (s, a) = sup Q⇡h (s, a)
⇡2⇧

where the sup is over all non-stationary and randomized policies. Suppose that QH = 0. We have that Qh = Q?h for
all h 2 [H] if and only if for all h 2 [H],

Qh (s, a) = rh (s, a) + Es0 ⇠Ph (·|s,a) max
0
Qh+1 (s0 , a0 ) . (0.7)
a 2A

Furthermore, ⇡(s, h) = argmaxa2A Q?h (s, a) is an optimal policy.

We leave the proof as an exercise to the reader.

Discussion: Stationary MDPs vs Time-Dependent MDPs For the purposes of this book, it is natural for us to
study both of these models, where we typically assume stationary dynamics in the infinite horizon setting and time-
dependent dynamics in the finite-horizon setting. From a theoretical perspective, the finite horizon, time-dependent
setting is often more amenable to analysis, where optimal statistical rates often require simpler arguments. However,
we should note that from a practical perspective, time-dependent MDPs are rarely utilized because they lead to policies
and value functions that are O(H) larger (to store in memory) than those in the stationary setting. In practice, we often
incorporate temporal information directly into the definition of the state, which leads to more compact value functions
and policies (when coupled with function approximation methods, which attempt to represent both the values and
policies in a more compact form).

1.3 Computational Complexity

This section will be concerned with computing an optimal policy, when the MDP M = (S, A, P, r, ) is known; this
can be thought of as the solving the planning problem. While much of this book is concerned with statistical limits,
understanding the computational limits can be informative. We will consider algorithms which give both exact and
approximately optimal policies. In particular, we will be interested in polynomial time (and strongly polynomial time)
algorithms.
Suppose that (P, r, ) in our MDP M is specified with rational entries. Let L(P, r, ) denote the total bit-size required
to specify M , and assume that basic arithmetic operations +, , ⇥, ÷ take unit time. Here, we may hope for an
algorithm which (exactly) returns an optimal policy whose runtime is polynomial in L(P, r, ) and the number of
states and actions.
More generally, it may also be helpful to understand which algorithms are strongly polynomial. Here, we do not want
to explicitly restrict (P, r, ) to be specified by rationals. An algorithm is said to be strongly polynomial if it returns
an optimal policy with runtime that is polynomial in only the number of states and actions (with no dependence on
L(P, r, )).
The first two subsections will cover classical iterative algorithms that compute Q? , and then we cover the linear
programming approach.

1.3.1 Value Iteration

Perhaps the simplest algorithm for discounted MDPs is to iteratively apply the fixed point mapping: starting at some
Q, we iteratively apply T :
Q T Q,

12
Value Iteration Policy Iteration LP-Algorithms
1 1
L(P,r, ) log L(P,r, ) log
Poly? |S|2 |A| 1
1
(|S|3 + |S|2 |A|) 1
1
|S|3 |A|L(P, r, )
⇢ 2 |S|2
|A||S| |S| |A| log |S|
Strongly Poly? 7 (|S|3 + |S|2 |A|) · min |S| , 1
1
|S|4 |A|4 log 1

Table 0.1: Computational complexities of various approaches (we drop universal constants). Polynomial time algo-
rithms depend on the bit complexity, L(P, r, ), while strongly polynomial algorithms do not. Note that only for a
fixed value of are value and policy iteration polynomial time algorithms; otherwise, they are not polynomial time
algorithms. Similarly, only for a fixed value of is policy iteration a strongly polynomial time algorithm. In contrast,
the LP-approach leads to both polynomial time and strongly polynomial time algorithms; for the latter, the approach
is an interior point algorithm. See text for further discussion, and Section 1.6 for references. Here, |S|2 |A| is the
assumed runtime per iteration of value iteration, and |S|3 + |S|2 |A| is the assumed runtime per iteration of policy
iteration (note that for this complexity we would directly update the values V rather than Q values, as described in the
text); these runtimes are consistent with assuming cubic complexity for linear system solving.

This is algorithm is referred to as Q-value iteration.

Lemma 1.10. (contraction) For any two vectors Q, Q0 2 R|S||A| ,

kT Q T Q0 k1  kQ Q0 k1

Proof: First, let us show that for all s 2 S, |VQ (s) VQ0 (s)|  maxa2A |Q(s, a) Q0 (s, a)|. Assume VQ (s) > VQ0 (s)
(the other direction is symmetric), and let a be the greedy action for Q at s. Then

|VQ (s) VQ0 (s)| = Q(s, a) max Q0 (s, a0 )  Q(s, a) Q0 (s, a)  max |Q(s, a) Q0 (s, a)|.
a0 2A a2A

Using this,

kT Q T Q0 k1 = kP VQ P VQ0 k1
= kP (VQ VQ0 )k1
 kVQ VQ0 k1
= max |VQ (s) VQ0 (s)|
s
 max max |Q(s, a) Q0 (s, a)|
s a
0
= kQ Q k1

where the first inequality uses that each element of P (VQ VQ0 ) is a convex average of VQ VQ0 and the second
inequality uses our claim above.
The following result bounds the sub-optimality of the greedy policy itself, based on the error in Q-value function.

Lemma 1.11. (Q-Error Amplification) For any vector Q 2 R|S||A| ,

2kQ Q? k1
V ⇡Q V? 1,
1

where 1 denotes the vector of all ones.

13
Proof: Fix state s and let a = ⇡Q (s). We have:

V ? (s) V ⇡Q (s) = Q? (s, ⇡ ? (s)) Q⇡Q (s, a)

= Q? (s, ⇡ ? (s)) Q? (s, a) + Q? (s, a) Q⇡Q (s, a)
= Q? (s, ⇡ ? (s)) Q? (s, a) + Es0 ⇠P (·|s,a) [V ? (s0 ) V ⇡Q (s0 )]
? ? ? ?
 Q (s, ⇡ (s)) Q(s, ⇡ (s)) + Q(s, a) Q (s, a)
? 0 ⇡Q 0
+ Es0 ⇠P (s,a) [V (s ) V (s )]
? ? ⇡Q
 2kQ Q k1 + kV V k1 .

where the first inequality uses Q(s, ⇡ ? (s))  Q(s, ⇡Q (s)) = Q(s, a) due to the definition of ⇡Q .

Theorem 1.12. (Q-value iteration convergence). Set Q(0) = 0. For k = 0, 1, . . ., suppose:

Q(k+1) = T Q(k)
2
log )2 ✏
Let ⇡ (k) = ⇡Q(k) . For k (1
1 ,
(k)
V⇡ V? ✏1 .

Proof: Since kQ? k1  1/(1 ), Q(k) = T k Q(0) and Q? = T Q? , Lemma 1.10 gives

exp( (1 )k)
kQ(k) Q? k1 = kT k Q(0) T k Q? k1  k
kQ(0) Q? k1 = (1 (1 ))k kQ? k1  .
1

The proof is completed with our choice of k and using Lemma 1.11.

Iteration complexity for an exact solution. With regards to computing an exact optimal policy, when the gap
between the current objective value and the optimal objective value is smaller than 2 L(P,r, ) , then the greedy policy
will be optimal. This leads to claimed complexity in Table 0.1. Value iteration is not strongly polynomial algorithm
due to that, in finite time, it may never return the optimal policy.

1.3.2 Policy Iteration

The policy iteration algorithm, for discounted MDPs, starts from an arbitrary policy ⇡0 , and repeats the following
iterative procedure: for k = 0, 1, 2, . . .

1. Policy evaluation. Compute Q⇡k

2. Policy improvement. Update the policy:

⇡k+1 = ⇡Q⇡k

In each iteration, we compute the Q-value function of ⇡k , using the analytical form given in Equation 0.2, and update
the policy to be greedy with respect to this new Q-value. The first step is often called policy evaluation, and the second
step is often called policy improvement.

Lemma 1.13. We have that:

1. Q⇡k+1 T Q ⇡k Q ⇡k

14
2. kQ⇡k+1 Q? k1  kQ⇡k Q? k1

Proof: First let us show that T Q⇡k Q⇡k . Note that the policies produced in policy iteration are always deterministic,
so V ⇡k (s) = Q⇡k (s, ⇡k (s)) for all iterations k and states s. Hence,

T Q⇡k (s, a) = r(s, a) + Es0 ⇠P (·|s,a) [max

0
Q⇡k (s0 , a0 )]
a
r(s, a) + Es0 ⇠P (·|s,a) [Q⇡k (s0 , ⇡k (s0 ))] = Q⇡k (s, a).

Now let us prove that Q⇡k+1 T Q⇡k . First, let us see that Q⇡k+1 Q ⇡k :
1
X
Q⇡k = r + P ⇡k Q⇡k  r + P ⇡k+1 Q⇡k  t
(P ⇡k+1 )t r = Q⇡k+1 .
t=0

where we have used that ⇡k+1 is the greedy policy in the first inequality and recursion in the second inequality. Using
this,

Q⇡k+1 (s, a) = r(s, a) + Es0 ⇠P (·|s,a) [Q⇡k+1 (s0 , ⇡k+1 (s0 ))]
r(s, a) + Es0 ⇠P (·|s,a) [Q⇡k (s0 , ⇡k+1 (s0 ))]
= r(s, a) + Es0 ⇠P (·|s,a) [max
0
Q⇡k (s0 , a0 )] = T Q⇡k (s, a)
a

which completes the proof of the first claim.

For the second claim,

kQ? Q⇡k+1 k1  kQ? T Q⇡k k1 = kT Q? T Q⇡k k1  kQ? Q⇡k k1

where we have used that Q? Q⇡k+1 T Q⇡k in second step and the contraction property of T (see Lemma 1.10) in
the last step.
With this lemma, a convergence rate for the policy iteration algorithm immediately follows.
1
log
Theorem 1.14. (Policy iteration convergence). Let ⇡0 be any initial policy. For k (1
1
)✏
, the k-th policy in
policy iteration has the following performance bound:

Q ⇡k Q? ✏1 .

Iteration complexity for an exact solution. With regards to computing an exact optimal policy, it is clear from the
previous results that policy iteration is no worse than value iteration. However, with regards to obtaining an exact
solution MDP that is independent of the bit complexity, L(P, r, ), improvements are possible (and where we assume
basic arithmetic operations on real numbers are order one cost). Naively, the number of iterations of policy iterations
is bounded by the number of policies, namely |A||S| ; here, a small improvement is possible, where the number of
|A||S|
iterations of policy iteration can be bounded by |S| . Remarkably, for a fixed value of , policy iteration can be
|S|2
|S|2 |A| log
show to be a strongly polynomial time algorithm, where policy iteration finds an exact policy in at most 1
1

iterations. See Table 0.1 for a summary, and Section 1.6 for references.

15
1.3.3 Value Iteration for Finite Horizon MDPs

Let us now specify the value iteration algorithm for finite-horizon MDPs. For the finize-horizon setting, it turns out
that the analogues of value iteration and policy iteration lead to identical algorithms. The value iteration algorithm is
specified as follows:

1. Set QH 1 (s, a) = rH 1 (s, a).

2. For h = H 2, . . . 0, set:

Qh (s, a) = rh (s, a) + Es0 ⇠Ph (·|s,a) max
0
Qh+1 (s0 , a0 ) .
a 2A

By Theorem 1.9, it follows that Qh (s, a) = Q?h (s, a) and that ⇡(s, h) = argmaxa2A Q?h (s, a) is an optimal policy.

1.3.4 The Linear Programming Approach

It is helpful to understand an alternative approach to finding an optimal policy for a known MDP. With regards to
computation, consider the setting where our MDP M = (S, A, P, r, , µ) is known and P , r, and are all specified by
rational numbers. Here, from a computational perspective, the previous iterative algorithms are, strictly speaking, not
polynomial time algorithms, due to that they depend polynomially on 1/(1 ), which is not polynomial in the de-
scription length of the MDP . In particular, note that any rational value of 1 may be specified with only O(log 1 1 )
bits of precision. In this context, we may hope for a fully polynomial time algorithm, when given knowledge of the
MDP, which would have a computation time which would depend polynomially on the description length of the MDP
M , when the parameters are specified as rational numbers. We now see that the LP approach provides a polynomial
time algorithm.

The Primal LP and A Polynomial Time Algorithm

Consider the following optimization problem with variables V 2 R|S| :

X
min µ(s)V (s)
s
X
subject to V (s) r(s, a) + P (s0 |s, a)V (s0 ) 8a 2 A, s 2 S
s0

Provided that µ has full support, then the optimal value function V ? (s) is the unique solution to this linear program.
With regards to computation time, linear programming approaches only depend on the description length of the coeffi-
cients in the program, due to that this determines the computational complexity of basic additions and multiplications.
Thus, this approach will only depend on the bit length description of the MDP, when the MDP is specified by rational
numbers.

Computational complexity for an exact solution. Table 0.1 shows the runtime complexity for the LP approach,
where we assume a standard runtime for solving a linear program. The strongly polynomial algorithm is an interior
point algorithm. See Section 1.6 for references.

16
Policy iteration and the simplex algorithm. It turns out that the policy iteration algorithm is actually the simplex
method with block pivot. While the simplex method, in general, is not a strongly polynomial time algorithm, the
policy iteration algorithm is a strongly polynomial time algorithm, provided we keep the discount factor fixed. See
[Ye, 2011].

The Dual LP and the State-Action Polytope

For a fixed (possibly stochastic) policy ⇡, let us define a visitation measure over states and actions induced by following
⇡ after starting at s0 . Precisely, define this distribution, d⇡s0 , as follows:
1
X
d⇡s0 (s, a) := (1 ) t
Pr⇡ (st = s, at = a|s0 ) (0.8)
t=0

where Pr⇡ (st = s, at = a|s0 ) is the probability that st = s and at = a, after starting at state s0 and following ⇡
thereafter. It is straightforward to verify that d⇡s0 is a distribution over S ⇥ A. We also overload notation and write:
⇥ ⇤
d⇡µ (s, a) = Es0 ⇠µ d⇡s0 (s, a) .

for a distribution µ over S. Recall Lemma 1.6 provides a way to easily compute d⇡µ (s, a) through an appropriate
vector-matrix multiplication.
It is straightforward to verify that d⇡µ satisfies, for all states s 2 S:
X X
d⇡µ (s, a) = (1 )µ(s) + P (s|s0 , a0 )d⇡µ (s0 , a0 ).
a s0 ,a0

Let us define the state-action polytope as follows:

X X
Kµ := {d| d 0 and d(s, a) = (1 )µ(s) + P (s|s0 , a0 )d(s0 , a0 )}
a s0 ,a0

We now see that this set precisely characterizes all state-action visitation distributions.
Proposition 1.15. We have that Kµ is equal to the set of all feasible state-action distributions, i.e. d 2 Kµ if and only
if there exists a stationary (and possibly randomized) policy ⇡ such that d⇡µ = d.

With respect the variables d 2 R|S|·|A| , the dual LP formulation is as follows:

1 X
max dµ (s, a)r(s, a)
1 s,a
subject to d 2 Kµ

Note that Kµ is itself a polytope, and one can verify that this is indeed the dual of the aforementioned LP. This approach
provides an alternative approach to finding an optimal solution.
If d? is the solution to this LP, and provided that µ has full support, then we have that:
d? (s, a)
⇡ ? (a|s) = P ? 0
,
a0 d (s, a )

is an optimal policy. An alternative optimal policy is argmaxa d? (s, a) (and these policies are identical if the optimal
policy is unique).

17
1.4 Sample Complexity and Sampling Models

Much of reinforcement learning is concerned with finding a near optimal policy (or obtaining near optimal reward)
in settings where the MDPs is not known to the learner. We will study these questions in a few different models of
how the agent obtains information about the unknown underlying MDP. In each of these settings, we are interested
understanding the number of samples required to find a near optimal policy, i.e. the sample complexity. Ultimately, we
interested in obtaining results which are applicable to cases where number of states and actions is large (or, possibly,
countably or uncountably infinite). This is many ways analogous to the supervised learning question of generalization,
though, as we shall see, this question is fundamentally more challenging in the reinforcement learning setting.

The Episodic Setting. In the episodic setting, in every episode, the learner acts for some finite number of steps,
starting from a fixed starting state s0 ⇠ µ, the learner observes the trajectory, and the state resets to s0 ⇠ µ. This
episodic model of feedback is applicable to both the finte-horizon and infinite horizon settings.

• (Finite Horizon MDPs) Here, each episode lasts for H-steps, and then the state is reset to s0 ⇠ µ.
• (Infinite Horizon MDPs) Even for infinite horizon MDPs it is natural to work in an episodic model for learning,
where each episode terminates after a finite number of steps. Here, it is often natural to assume either the agent
can terminate the episode at will or that the episode will terminate at each step with probability 1 . After
termination, we again assume that the state is reset to s0 ⇠ µ. Note that, if each step in an episode is terminated
with probability 1 , then the observed cumulative reward in an episode of a policy provides an unbiased
estimate of the infinite-horizon, discounted value of that policy.

In this setting, we are often interested in either the number of episodes it takes to find a near optimal policy, which is
a PAC (probably, approximately correct) guarantee, or we are interested in a regret guarantee (which we will study in
Chapter 7). Both of these questions are with regards to statistical complexity (i.e. the sample complexity) of learning.
The episodic setting is challenging in that the agent has to engage in some exploration in order to gain information at
the relevant state. As we shall see in Chapter 7, this exploration must be strategic, in the sense that simply behaving
randomly will not lead to information being gathered quickly enough. It is often helpful to study the statistical com-
plexity of learning in a more abstract sampling model, a generative model, which allows to avoid having to directly
address this exploration issue. Furthermore, this sampling model is natural in its own right.

The generative model setting. A generative model takes as input a state action pair (s, a) and returns a sample
s0 ⇠ P (·|s, a) and the reward r(s, a) (or a sample of the reward if the rewards are stochastic).

The offline RL setting. The offline RL setting is where the agent has access to an offline dataset, say generated
under some policy (or a collection of policies). In the simplest of these settings, we may assume our dataset is of the
form {(s, a, s0 , r)} where r is the reward (corresponding to r(s, a) if the reward is deterministic) and s0 ⇠ P (·|s, a).
Furthermore, for simplicity, it can be helpful to assume that the s, a pairs in this dataset were sampled i.i.d. from some
fixed distribution ⌫ over S ⇥ A.

1.5 Bonus: Advantages and The Performance Difference Lemma

Throughout, we will overload notation where, for a distribution µ over S, we write:

V ⇡ (µ) = Es⇠µ [V ⇡ (s)] .

18
The advantage A⇡ (s, a) of a policy ⇡ is defined as
A⇡ (s, a) := Q⇡ (s, a) V ⇡ (s) .
Note that: ⇤
A⇤ (s, a) := A⇡ (s, a)  0
for all state-action pairs.
Analogous to the state-action visitation distribution (see Equation 0.8), we can define a visitation measure over just
the states. When clear from context, we will overload notation and also denote this distribution by d⇡s0 , where:
1
X
d⇡s0 (s) = (1 ) t
Pr⇡ (st = s|s0 ). (0.9)
t=0

Here, Pr⇡ (st = s|s0 ) is the state visitation probability, under ⇡ starting at state s0 . Again, we write:
⇥ ⇤
d⇡µ (s) = Es0 ⇠µ d⇡s0 (s) .
for a distribution µ over S.
The following lemma is helpful in the analysis of RL algorithms.
Lemma 1.16. (The performance difference lemma) For all policies ⇡, ⇡ 0 and distributions µ over S,
0 1 h 0 i
V ⇡ (µ) V ⇡ (µ) = Es0 ⇠d⇡µ Ea0 ⇠⇡(·|s0 ) A⇡ (s0 , a0 ) .
1

Proof: Let Pr⇡ (⌧ |s0 = s) denote the probability of observing a trajectory ⌧ when starting in state s and following the
policy ⇡. By definition of d⇡s0✓ , observe that for any function f : S ⇥ A ! R,
"1 #
X 1 ⇥ ⇤
t
E⌧ ⇠Pr⇡ f (st , at ) = Es⇠d⇡s ✓ Ea⇠⇡✓ (·|s) f (s, a) . (0.10)
t=0
1 0

Using a telescoping argument, we have:

" 1
#
X
⇡ ⇡0 t 0
V (s) V (s) = E⌧ ⇠Pr⇡ (⌧ |s0 =s) r(st , at ) V ⇡ (s)
t=0
" #
X1 ⇣ ⌘
t ⇡0 ⇡0 0
= E⌧ ⇠Pr⇡ (⌧ |s0 =s) r(st , at ) + V (st ) V (st ) V ⇡ (s)
t=0
" #
(a)
1
X ⇣ 0 0
⌘
t
= E⌧ ⇠Pr⇡ (⌧ |s0 =s) r(st , at ) + V ⇡ (st+1 ) V ⇡ (st )
t=0
"1 #
(b) X ⇣ ⌘
t ⇡0 ⇡0
= E⌧ ⇠Pr⇡ (⌧ |s0 =s) r(st , at ) + E[V (st+1 )|st , at ] V (st )
t=0
"1 #
(c) X ⇣ 0 0
⌘
t
= E⌧ ⇠Pr⇡ (⌧ |s0 =s) Q⇡ (st , at ) V ⇡ (st )
t=0
"1 #
X
t ⇡0
= E⌧ ⇠Pr⇡ (⌧ |s0 =s) A (st , at )
t=0
1 0
= Es0 ⇠d⇡s Ea⇠⇡(·|s) A⇡ (s0 , a),
1
where step (a) rearranges terms in the summation via telescoping; step (b) uses the law of iterated expectations; step
(c) follows by definition; and the final equality follows from Equation 0.10.

19
1.6 Bibliographic Remarks and Further Reading

We refer the reader to [Puterman, 1994] for a more detailed treatment of dynamic programming and MDPs. [Puterman,
1994] also contains a thorough treatment of the dual LP, along with a proof of Lemma 1.15
With regards to the computational complexity of policy iteration, [Ye, 2011] showed that policy iteration is a strongly
polynomial time algorithm for a fixed discount rate 1 . Also, see [Ye, 2011] for a good summary of the computa-
tional complexities of various approaches. [Mansour and Singh, 1999] showed that the number of iterations of policy
|S|
iteration can be bounded by |A|
|S| .

With regards to a strongly polynomial algorithm, the CIPA algorithm [Ye, 2005] is an interior point algorithm with the
claimed runtime in Table 0.1.
Lemma 1.11 is due to Singh and Yee [1994].
The performance difference lemma is due to [Kakade and Langford, 2002, Kakade, 2003], though the lemma was
implicit in the analysis of a number of prior works.

1 The stated strongly polynomial runtime in Table 0.1 for policy iteration differs from that in [Ye, 2011] due to we assume that the runtime per

iteration of policy iteration is |S|3 + |S|2 |A|.

20
Chapter 2

Sample Complexity with a Generative

Model

This chapter begins our study of the sample complexity, where we focus on the (minmax) number of transitions we
need to observe in order to accurately estimate Q? or in order to find a near optimal policy. We assume that we have
access to a generative model (as defined in Section 1.4) and that the reward function is deterministic (the latter is often
a mild assumption, due to that much of the difficulty in RL is due to the uncertainty in the transition model P ).
This chapter follows the results due to [Azar et al., 2013], along with some improved rates due to [Agarwal et al.,
2020c]. One of the key observations in this chapter is that we can find a near optimal policy using a number of observed
transitions that is sublinear in the model size, i.e. use a number of samples that is smaller than O(|S|2 |A|). In other
words, we do not need to learn an accurate model of the world in order to learn to act near optimally.

Notation. We define M c to be the empirical MDP that is identical to the original M , except that it uses Pb instead
of P for the transition model. When clear from context, we drop the subscript on M on the values, action values
(and one-step variances and variances which we define later). We let Vb ⇡ , Qb⇡ , Q
b ? , and ⇡
b? denote the value function,
c, respectively.
state-action value function, optimal state-action value, and optimal policy in M

2.1 Warmup: a naive model-based approach

A central question in this chapter is: Do we require an accurate model of the world in order to find a near optimal
policy? Recall that a generative model takes as input a state action pair (s, a) and returns a sample s0 ⇠ P (·|s, a) and
the reward r(s, a) (or a sample of the reward if the rewards are stochastic). Let us consider the most naive approach
to learning (when we have access to a generative model): suppose we call our simulator N times at each state action
pair. Let Pb be our empirical model, defined as follows:

count(s0 , s, a)
Pb(s0 |s, a) =
N

where count(s0 , s, a) is the number of times the state-action pair (s, a) transitions to state s0 . As the N is the number
of calls for each state action pair, the total number of calls to our generative model is |S||A|N . As before, we can view
Pb as a matrix of size |S||A| ⇥ |S|.

21
Note that since P has a |S|2 |A| parameters, we would expect that observing O(|S|2 |A|) transitions is sufficient to
provide us with an accurate model. The following proposition shows that this is the case.
Proposition 2.1. There exists an absolute constant c such that the following holds. Suppose ✏ 2 0, 1 1 and that we
obtain
|S|2 |A| log(c|S||A|/ )
# samples from generative model = |S||A|N 4
(1 ) ✏2
where we uniformly sample every state action pair. Then, with probability greater than 1 , we have:

• (Model accuracy) The transition model has error bounded as:

max kP (·|s, a) Pb(·|s, a)k1  (1 )2 ✏ .

s,a

• (Uniform value accuracy) For all policies ⇡,

kQ⇡ b ⇡ k1  ✏
Q

• (Near optimal planning) Suppose that ⇡ c. We have that:

b is the optimal policy in M
b?
kQ Q? k1  ✏, and kQ⇡b Q? k1  2✏.

Before we provide the proof, the following lemmas will be helpful throughout:
Lemma 2.2. (Simulation Lemma) For all ⇡ we have that:

Q⇡ b⇡
Q = (I Pb⇡ ) 1
(P Pb)V ⇡

Proof: Using our matrix equality for Q⇡ (see Equation 0.2), we have:

Q⇡ b⇡
Q = (I P ⇡) 1
r (I Pb⇡ ) 1
r
= (I Pb⇡ ) 1
((I Pb⇡ ) (I P ⇡ ))Q⇡
= (I Pb⇡ ) 1
(P ⇡ Pb⇡ )Q⇡
= (I Pb⇡ ) 1
(P Pb)V ⇡

which proves the claim.

Lemma 2.3. For any policy ⇡, MDP M and vector v 2 R|S|⇥|A| , we have (I P ⇡) 1
v 1
 kvk1 /(1 ).

Proof: Note that v = (I P ⇡ )(I P ⇡) 1

v = (I P ⇡ )w, where w = (I P ⇡) 1
v. By triangle inequality,
we have

kvk1 = k(I P ⇡ )wk1 kwk1 kP ⇡ wk1 kwk1 kwk1 ,

where the final inequality follows since P ⇡ w is an average of the elements of w by the definition of P ⇡ so that
kP ⇡ wk1  kwk1 . Rearranging terms completes the proof.
Now we are ready to complete the proof of our proposition.
Proof: Using the concentration of a distribution in the `1 norm (Lemma A.8), we have that for a fixed s, a that, with
probability greater than 1 , we have:
r
b |S| log(1/ )
kP (·|s, a) P (·|s, a)k1  c
m

22
where m is the number of samples used to estimate Pb(·|s, a). The first claim now follows by the union bound (and
redefining and c appropriately).
For the second claim, we have that:

kQ⇡ b ⇡ k1 = k (I
Q Pb⇡ ) 1
(P Pb)V ⇡ k1  k(P Pb)V ⇡ k1
1
✓ ◆
 max kP (·|s, a) Pb(·|s, a)k1 kV ⇡ k1  max kP (·|s, a) Pb(·|s, a)k1
1 s,a (1 )2 s,a

where the penultimate step uses Holder’s inequality. The second claim now follows.
For the final claim, first observe that | supx f (x) supx g(x)|  supx |f (x) g(x)|, where f and g are real valued
functions. This implies:

b ? (s, a)
|Q b ⇡ (s, a)
Q? (s, a)| = | sup Q b ⇡ (s, a)
sup Q⇡ (s, a)|  sup |Q Q⇡ (s, a)|  ✏
⇡ ⇡ ⇡

which proves the first inequality. The second inequality is left as an exercise to the reader.

2.2 Sublinear Sample Complexity

In the previous approach, we are able to accurately estimate the value of every policy in the unknown MDP M . How-
ever, with regards to planning, we only need an accurate estimate Q b ? of Q? , which we may hope would require less
samples. Let us now see that the model based approach can be refined to obtain minmax optimal sample complexity,
which we will see is sublinear in the model size.
We will state our results in terms of N , and recall that N is the # of calls to the generative models per state-action pair,
so that:
# samples from generative model = |S||A|N.

Let us start with a crude bound on the optimal action-values, which provides a sublinear rate. In the next section, we
will improve upon this to obtain the minmax optimal rate.

Proposition 2.4. (Crude Value Bounds) Let 0. With probability greater than 1 ,

kQ? b ? k1
Q  ,N

kQ ? b ⇡ ? k1
Q  ,N ,

where: r
2 log(2|S||A|/ )
,N :=
(1 )2 N

Note that the first inequality above shows a sublinear rate on estimating the value function. Ultimately, we are in-
?
terested in the value V ⇡b when we execute ⇡ b ? of Q? . Here, by Lemma 1.11, we lose an
b? , not just an estimate Q
additional horizon factor and have:
kQ? Q b ⇡b? k1  1 ,N .
1
As we see in Theorem 2.6, this is improvable.
Before we provide the proof, the following lemma will be helpful throughout.

23
Lemma 2.5. (Component-wise Bounds) We have that:
?
Q? b?
Q  (I Pb⇡ ) 1
(P Pb)V ?
?
Q? b?
Q (I Pb⇡b ) 1
(P Pb)V ?

Proof: For the first claim, the optimality of ⇡ ? in M implies:

Q? b ? = Q⇡ ?
Q b ⇡b?  Q⇡?
Q b ⇡? = (I
Q
?
Pb⇡ ) 1
(P Pb)V ? ,

where we have used Lemma 2.2 in the final step. This proves the first claim.
For the second claim,
?
Q? b?
Q = Q⇡ b ⇡b?
Q
⇣ ? ?
⌘
= (1 ) (I P⇡ ) 1
r (I Pb⇡b ) 1
r
? ? ?
= (I Pb⇡ ) 1 ((I Pb⇡b ) (I P ⇡ ))Q?
? ? ?
= (I Pb⇡ ) 1 (P ⇡ Pb⇡b )Q?
? ? ?
(I Pb⇡ ) 1 (P ⇡ Pb⇡ )Q?
?
= (I Pb⇡ ) 1
(P Pb)V ? ,
? ?
where the inequality follows from Pb⇡b Q?  Pb⇡ Q? , due to the optimality of ⇡ ? . This proves the second claim.
Proof: Following from the simulation lemma (Lemma 2.2) and Lemma 2.3, we have:

kQ? b ⇡? k1 
Q k(P Pb)V ? k1 .
1
Also, the previous lemma, implies that:

kQ? b ? k1 
Q k(P Pb)V ? k1
1
By applying Hoeffding’s inequality and the union bound,
r
1 2 log(2|S||A|/ )
k(P Pb)V ? k1 = max |Es0 ⇠P (·|s,a) [V ? (s0 )] Es0 ⇠Pb(·|s,a) [V (s )]|  ? 0
s,a 1 N
which holds with probability greater than 1 . This completes the proof.

2.3 Minmax Optimal Sample Complexity (and the Model Based Approach)

We now see that the model based approach is minmax optimal, for both the discounted case and the finite horizon
setting.

2.3.1 The Discounted Case

b? .
Upper bounds. The following theorem refines our crude bound on Q
Theorem 2.6. For 0 and for an appropriately chosen absolute constant c, we have that:

24
• (Value estimation) With probability greater than 1 ,
s
b ? k1  c log(c|S||A|/ ) c log(c|S||A|/ )
kQ? Q + .
(1 )3 N (1 )3 N

• (Sub-optimality) If N (1
1
)2 , then with probability greater than 1 ,
s
? b?
⇡ c log(c|S||A|/ )
kQ Q k1  .
(1 )3 N

This immediately provides the following corollary.

Corollary 2.7. Provided that ✏  1 and that
c|S||A| log(c|S||A|/ )
# samples from generative model = |S||A|N ,
(1 )3 ✏2
then with probability greater than 1 ,
kQ? b ? k1  ✏.
Q
q
Furthermore, provided that ✏  1
1
and that

c|S||A| log(c|S||A|/ )
# samples from generative model = |S||A|N ,
(1 )3 ✏2
then with probability greater than 1 , ?
kQ? Q⇡b k1  ✏.

We only prove the first claim in Theorem 2.6 on the estimation accuracy. With regards to the sub-optimality, note
that Theorem 1.11 already implies a sub-optimality gap, though with an amplification of the estimation error by
2/(1 ). The argument for the improvement provided in the second claim is more involved (See Section 2.6 for
further discussion).

b ? , is
Lower Bounds. Let us say that an estimation algorithm A, which is a map from samples to an estimate Q
(✏, )-good on MDP M if kQ ? b
Q k1  ✏ holds with probability greater than 1
?
.
Theorem 2.8. There exists ✏0 , 0 , c and a set of MDPs M such that for ✏ 2 (0, ✏0 ) and 2 (0, 0 ) if algorithm A is
(✏, )-good on all M 2 M, then A must use a number of samples that is lower bounded as follows

c |S||A| log(c|S||A|/ )
# samples from generative model .
(1 )3 ✏2

In other words, this theorem shows that the model based approach minmax optimal.

2.3.2 Finite Horizon Setting

Recall the setting of finite horizon MDPs defined in Section 1.2. Again, we can consider the most naive approach to
learning (when we have access to a generative model): suppose we call our simulator N times for every (s, a, h) 2
S ⇥ A ⇥ [H], i.e. we obtain N i.i.d. samples where s0 ⇠ Ph (·|s, a), for every (s, a, h) 2 S ⇥ A ⇥ [H]. Note that the
total number of observed transitions is H|S||A|N .

25
Upper bounds. The following theorem provides an upper bound on the model based approach.

Theorem 2.9. For 0 and with probability greater than 1 , we have that:

• (Value estimation) r
b ?0 k1  cH log(c|S||A|/ ) log(c|S||A|/ )
kQ?0 Q + cH ,
N N

• (Sub-optimality)
r
? log(c|S||A|/ ) log(c|S||A|/ )
kQ?0 Q⇡0b k1  cH + cH ,
N N

where c is an absolute constant.

Note that the above bound requires N to be O(H 2 ) in order to achieve an ✏-optimal policy, while in the discounted
case, we require N to be O(1/(1 )3 ) for the same guarantee. While this may seem like an improvement by a
horizon factor, recall that for the finite horizon case, N corresponds to observing O(H) more transitions than in the
discounted case.

Lower Bounds. In the minmax sense of Theorem 2.8, the previous upper bound provided by the model based
approach for the finite horizon setting achieves the minmax optimal sample complexity.

2.4 Analysis

We now prove (the first claim in) Theorem 2.6.

2.4.1 Variance Lemmas

The key to the sharper analysis is to more sharply characterize the variance in our estimates.
Denote the variance of any real valued f under a distribution D as:

VarD (f ) := Ex⇠D [f (x)2 ] (Ex⇠D [f (x)])2

Slightly abusing the notation, for V 2 R|S| , we define the vector VarP (V ) 2 R|S||A| as:

VarP (V )(s, a) := VarP (·|s,a) (V )

Equivalently,
VarP (V ) = P (V )2 (P V )2 .

Now we characterize a relevant deviation in terms of the its variance.

Lemma 2.10. Let > 0. With probability greater than 1 ,

r
2 log(2|S||A|/ ) p 1 2 log(2|S||A|/ )
|(P Pb)V ? |  VarP (V ? ) + 1.
N 1 3N

26
Proof: The claims follows from Bernstein’s inequality along with a union bound over all state-action pairs.
? p ? p
The key ideas in the proof are in how we bound k(I Pb⇡ ) 1 VarP (V ? )k1 and k(I Pb⇡b ) 1 VarP (V ? )k1 .
It is helpful to define ⌃⇡M as the variance of the discounted reward, i.e.
2 !2 3
X1
⌃⇡M (s, a) := E 4 t
r(st , at ) Q⇡M (s, a) s0 = s, a0 = a5
t=0

where the expectation is induced under the trajectories induced by ⇡ in M . It is straightforward to verify that
k⌃⇡M k1  2 /(1 )2 .
The following lemma shows that ⌃⇡M satisfies a Bellman consistency condition.
Lemma 2.11. (Bellman consistency of ⌃) For any MDP M ,
⌃⇡M = 2 ⇡
VarP (VM )+ 2
P ⇡ ⌃⇡M (0.1)
where P is the transition model in MDP M .

The proof is left as an exercise to the reader.

Lemma 2.12. (Weighted Sum of Deviations) For any policy ⇡ and MDP M ,
s
q
2
(I P ⇡ ) 1 VarP (VM ⇡)  ,
1 (1 )3
where P is the transition model of M .

Proof:Note that (1 )(I P ⇡ ) 1 is matrix whose rows are a probability distribution. For a positive
p vector v and
p
a distribution ⌫ (where ⌫ is vector of the same dimension of v), Jensen’s inequality implies that ⌫ · v  ⌫ · v. This
implies:
p 1 p
k(I P ⇡ ) 1 vk1 = k(1 )(I P ⇡ ) 1 vk1
1
r
1
 (I P ⇡ ) 1v
1 1
r
2 2P ⇡ ) 1v
 (I .
1 1

where we have used that k(I P ⇡ ) 1 vk1  2k(I P ⇡ ) 1 vk1 (which we will prove shortly). The proof is
2

completed as follows: by Equation 0.1, ⌃⇡M = 2 (I 2

P ) 1 VarP (VM
⇡ ⇡
), so taking v = VarP (VM
⇡
) and using that
k⌃⇡M k1  2 /(1 )2 completes the proof.
Finally, to see that k(I P ⇡) 1
vk1  2k(I 2
P ⇡) 1
vk1 , observe:
k(I P ⇡) 1
vk1 = k(I P ⇡) (I 1 2
P ⇡ )(I 2
P ⇡) vk1 1
⇣ ⌘
⇡ 1 ⇡ 2
= k(I P ) (1 )I + (I P ) (I P ⇡) 1
vk1
⇣ ⌘
= k (1 )(I P ⇡ ) 1 + I (I 2
P ⇡) 1
vk1
⇡ 1 2 ⇡ 1 2
 (1 )k(I P ) (I P ) vk1 + k(I P ⇡) 1
vk1
1 2 ⇡ 1 2 ⇡ 1
 k(I P ) vk1 + k(I P ) vk1
1
2 ⇡
 2k(I P ) 1 vk1
which proves the claim.

27
2.4.2 Completing the proof

Lemma 2.13. Let 0. With probability greater than 1 , we have:

?
VarP (V ? )  2VarPb (Vb ⇡ ) + 0
,N 1
? b? 0
VarP (V )  2VarPb (V ) + ,N 1

where r
0 1 18 log(6|S||A|/ ) 1 4 log(6|S||A|/ )
,N := + .
(1 )2 N (1 )4 N

Proof: By definition,
VarP (V ? ) = VarP (V ? ) VarPb (V ? ) + VarPb (V ? )
= P (V ? )2 (P V ? )2 Pb(V ? )2 + (PbV ? )2 + VarPb (V ? )
⇣ ⌘
= (P Pb)(V ? )2 (P V ? )2 (PbV ? )2 + VarPb (V ? )

Now we bound each of these terms with Hoeffding’s inequality and the union bound. For the first term, with probability
greater than 1 , r
1 2 log(2|S||A|/ )
k(P Pb)(V ) k1 
? 2
.
(1 )2 N
For the second term, again with probability greater than 1 ,
k(P V ? )2 (PbV ? )2 k1  kP V ? + PbV ? k1 kP V ? PbV ? k1
r
2 2 2 log(2|S||A|/ )
 k(P Pb)V k1 ?
.
1 (1 )2 N
where we have used that (·)2 is a component-wise operation in the second step. For the last term:
? ?
VarPb (V ? ) = VarPb (V ? Vb ⇡ + Vb ⇡ )
? ?
 2VarPb (V ? Vb ⇡ ) + 2Var b (Vb ⇡ )
P
? ?
 2kV ? Vb ⇡ k21 + 2Var b (Vb ⇡ ) P
?
= 2 2
,N + 2VarPb (Vb ⇡ ) .
where ,N is defined in Proposition 2.4. To obtain a cumulative probability of error less than , we replace in the
above claims with /3. Combining these bounds completes the proof of the first claim. The argument in the above
display also implies that VarPb (V ? )  2 2,N + 2VarPb (Vb ? ) which proves the second claim.
Using Lemma 2.10 and 2.13, we have the following corollary.
Corollary 2.14. Let 0. With probability greater than 1 , we have:
s
VarPb (Vb ⇡? ) log(c|S||A|/ )
|(P Pb)V ? |  c + 00
,N 1
N
s
VarPb (Vb ? ) log(c|S||A|/ )
|(P Pb)V ? |  c + 00
,N 1 ,
N
where ✓ ◆3/4
00 1 log(c|S||A|/ ) c log(c|S||A|/ )
,N := c + ,
1 N (1 )2 N
and where c is an absolute constant.

28
Proof:(of Theorem 2.6) The proof consists of bounding the terms in Lemma 2.5. We have:
?
Q? b?
Q  k(I Pb⇡ ) 1
(P Pb)V ? k1
r q ✓ ◆3/4
log(c|S||A|/ ) ? c log(c|S||A|/ )
 c k(I Pb⇡ ) 1 VarPb (Vb ⇡? )k1 +
N (1 )2 N
c log(c|S||A|/ )
+
(1 )3 N
s r ✓ ◆3/4
2 log(c|S||A|/ ) c log(c|S||A|/ ) c log(c|S||A|/ )
 3
+ 2
+ 3
(1 ) N (1 ) N (1 ) N
s r
1 log(c|S||A|/ ) c log(c|S||A|/ )
 3 c +2 ,
(1 )3 N (1 )3 N

where the first step uses Corollary 2.14; the second uses Lemma 2.12; and the last step uses that 2ab  a2 + b2
(and choosing a, b appropriately). The proof of the lower bound is analogous. Taking a different absolute constant
completes the proof.

2.5 Scalings and Effective Horizon Dependencies

It will be helpful to more intuitively understand why 1/(1 )3 is the effective horizon dependency one might hope
to expect, from a dimensional analysis viewpoint. Due to that Q? is a quantity that is as large as 1/(1 ), to account
for this scaling, it is natural to look at obtaining relative accuracy.
In particular, if
c |S||A| log(c|S||A|/ )
N ,
1 ✏2
then with probability greater than 1 , then
? ✏ b ? k1  ✏
kQ? Q⇡b k1  , and kQ? Q .
1 1
p
(provided that ✏  1 using Theorem 2.6). In other words, if we had normalized the value functions 1 , then for
additive accuracy (on our normalized value functions) our sample size would scale linearly with the effective horizon.

2.6 Bibliographic Remarks and Further Readings

The notion of a generative model was first introduced in [Kearns and Singh, 1999], which made the argument that,
up to horizon factors and logarithmic factors, both model based methods and model free methods are comparable.
[Kakade, 2003] gave an improved version of this rate (analogous to the crude bounds seen here).
The first claim in Theorem 2.6 is due to [Azar et al., 2013], and the proof in this section largely follows this work.
Improvements are possible with regards to bounding the quality of ⇡ b? ; here, Theorem 2.6 shows that the model based
approach is near optimal even for policy itself; showing that the quality of ⇡ b? does suffer any amplification factor of
1/(1 ). [Sidford et al., 2018] provides the first proof of this improvement using a variance reduction algorithm with
value iteration. The second claim in Theorem 2.6 is due to [Agarwal et al., 2020c], which shows that the naive model
based approach is sufficient. The lower bound in Theorem 2.8 is due to [Azar et al., 2013].
1 Rescaling the value functions by multiplying by (1 ), i.e. Q⇡ (1 )Q⇡ , would keep the values bounded between 0 and 1. Throughout,

29
We also remark that we may hope for the sub-optimality bounds (on the value of the argmax policy) to hold up to for
“large” ✏, i.e. up to ✏  1/(1 ) (see the second claim in Theorem 2.6). Here, the work in [Li et al., 2020] shows
this limit is achievable, albeit with a slightly different algorithm where they introduce perturbations. It is currently an
open question if the naive model based approach also achieves the non-asymptotic statistical limit.
This chapter also provided results, without proof, on the optimal sample complexity in the finite horizon setting (see
Section 2.3.2). The proof of this claim would also follow from the line of reasoning in [Azar et al., 2013], with
the added simplification that the sub-optimality analysis is simpler in the finite-horizon setting with time-dependent
transition matrices (e.g. see [Yin et al., 2021]).

this book it is helpful to understand sample size with regards to normalized quantities.

Book Mathmatical Foundation of Reinforcement Learning Lecture Slides
No ratings yet
Book Mathmatical Foundation of Reinforcement Learning Lecture Slides
524 pages
17 - Markov Decision Processes.pptx
No ratings yet
17 - Markov Decision Processes.pptx
59 pages
CSE2530__Reinforcement_Learning__2025_P1+2
No ratings yet
CSE2530__Reinforcement_Learning__2025_P1+2
115 pages
lecture-06
No ratings yet
lecture-06
98 pages
EEPGI-M.tech - EPS - Syllabus - I Year - 2023-24 - Sdmcet Eee
No ratings yet
EEPGI-M.tech - EPS - Syllabus - I Year - 2023-24 - Sdmcet Eee
44 pages
A crash course on reinforcement learning - Felix Wagner
No ratings yet
A crash course on reinforcement learning - Felix Wagner
84 pages
06 MDP
No ratings yet
06 MDP
89 pages
Linear Programming Notes
No ratings yet
Linear Programming Notes
168 pages
Solving Linear Programming Problems - The Simplex Method
No ratings yet
Solving Linear Programming Problems - The Simplex Method
10 pages
Reinforcement learning lec12
No ratings yet
Reinforcement learning lec12
60 pages
MIT 6.036 Lecture
No ratings yet
MIT 6.036 Lecture
64 pages
lec12
No ratings yet
lec12
60 pages
PDF Unit-5(Full Unit)
No ratings yet
PDF Unit-5(Full Unit)
37 pages
Chapter 2 LP
No ratings yet
Chapter 2 LP
63 pages
2024 MDPs Part 1
No ratings yet
2024 MDPs Part 1
59 pages
Chap3_LP
No ratings yet
Chap3_LP
43 pages
Lec17-ReinforcementLearning
No ratings yet
Lec17-ReinforcementLearning
58 pages
کتاب هشتم بارگزاری شده
No ratings yet
کتاب هشتم بارگزاری شده
112 pages
Lec 08
No ratings yet
Lec 08
59 pages
CH 05 Transportation Model and Its Variance
No ratings yet
CH 05 Transportation Model and Its Variance
74 pages
Lecture3__InsideAnAgent
No ratings yet
Lecture3__InsideAnAgent
35 pages
Introduction to Computational Models with Python 1st Edition Jose M. Garrido pdf download
No ratings yet
Introduction to Computational Models with Python 1st Edition Jose M. Garrido pdf download
64 pages
Simplex Method - Duality Problem
No ratings yet
Simplex Method - Duality Problem
8 pages
Operations Research F-QAM201 Final
No ratings yet
Operations Research F-QAM201 Final
200 pages
L12 Markov Decision Processes
No ratings yet
L12 Markov Decision Processes
64 pages
Finite Markov Decision Processes-BR
No ratings yet
Finite Markov Decision Processes-BR
31 pages
O-R. p24. Janvier 2024
No ratings yet
O-R. p24. Janvier 2024
83 pages
Lec 04 Reinforcement Learning
No ratings yet
Lec 04 Reinforcement Learning
57 pages
Machine Learning Unit4
No ratings yet
Machine Learning Unit4
21 pages
DSA5102_lecture11
No ratings yet
DSA5102_lecture11
44 pages
Reinforcement Learning: Karan Kathpalia
No ratings yet
Reinforcement Learning: Karan Kathpalia
80 pages
DRL #4-5 - Introducing MDP and Dynamic Programming Solution
No ratings yet
DRL #4-5 - Introducing MDP and Dynamic Programming Solution
74 pages
CS229
No ratings yet
CS229
17 pages
2
No ratings yet
2
23 pages
DEEP RL - CONTENT BEYOND SYLLABUS
No ratings yet
DEEP RL - CONTENT BEYOND SYLLABUS
16 pages
RL Frra
No ratings yet
RL Frra
9 pages
6.2 N 5c Optimization Techniques I OE Sem. III
No ratings yet
6.2 N 5c Optimization Techniques I OE Sem. III
5 pages
Unit 5 Notes
No ratings yet
Unit 5 Notes
20 pages
RL UNIT - II
No ratings yet
RL UNIT - II
20 pages
20AI903_RL_UNIT 2
No ratings yet
20AI903_RL_UNIT 2
27 pages
CPLEX Conversation
No ratings yet
CPLEX Conversation
5 pages
Value Functions & Bellman Equations: UNIT-3
No ratings yet
Value Functions & Bellman Equations: UNIT-3
11 pages
Prescriptive Analytics
No ratings yet
Prescriptive Analytics
9 pages
Machine Learning
No ratings yet
Machine Learning
5 pages
cs229-notes12 Reinforcement in Control
No ratings yet
cs229-notes12 Reinforcement in Control
17 pages
Reinforcement-Learning-Cheatsheet
No ratings yet
Reinforcement-Learning-Cheatsheet
16 pages
Linear Optimization and Extensions - Problems and Solutions (PDFDrive)
No ratings yet
Linear Optimization and Extensions - Problems and Solutions (PDFDrive)
450 pages
UNIT VI
No ratings yet
UNIT VI
17 pages
AD3351 DAA Lab Manual
No ratings yet
AD3351 DAA Lab Manual
47 pages
Unit 04 Finite Markov Decision Processes
No ratings yet
Unit 04 Finite Markov Decision Processes
8 pages
Markov decision
No ratings yet
Markov decision
4 pages
16 RL PDF
No ratings yet
16 RL PDF
87 pages
242 Sheet 02 02
No ratings yet
242 Sheet 02 02
6 pages
Add-On DRL CS06
No ratings yet
Add-On DRL CS06
23 pages
A17 Complexdecisions
No ratings yet
A17 Complexdecisions
28 pages
rl
No ratings yet
rl
6 pages
242 Sheet 02 03
No ratings yet
242 Sheet 02 03
5 pages
hgtfhgfhtf
No ratings yet
hgtfhgfhtf
5 pages
RL Frra
No ratings yet
RL Frra
10 pages
22 Reinforcement Learning
No ratings yet
22 Reinforcement Learning
18 pages
Reinforcement Learning Note
No ratings yet
Reinforcement Learning Note
16 pages
02 MarkovDecisionProcess
No ratings yet
02 MarkovDecisionProcess
51 pages
Reinforcement Learning: Part I - Definitions
No ratings yet
Reinforcement Learning: Part I - Definitions
26 pages
ML Unit 4
No ratings yet
ML Unit 4
9 pages
Lecture 3 - MDPs and Dynamic Programming
No ratings yet
Lecture 3 - MDPs and Dynamic Programming
66 pages
5 Big M and Two-Phase Methods
No ratings yet
5 Big M and Two-Phase Methods
11 pages
5.4-Reinforcement Learning-Part1-Introduction
No ratings yet
5.4-Reinforcement Learning-Part1-Introduction
15 pages
CS 188 Fall 2018 Written HW4 Soln
No ratings yet
CS 188 Fall 2018 Written HW4 Soln
6 pages
Reinforcement Learning Model Based Planning Dynamic Programming
No ratings yet
Reinforcement Learning Model Based Planning Dynamic Programming
17 pages
Solutions Manual
No ratings yet
Solutions Manual
73 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
9 pages
Simplex Answer
No ratings yet
Simplex Answer
31 pages
New Abyssinia College
No ratings yet
New Abyssinia College
14 pages
Reinforcement Learning and Control: CS229 Lecture Notes
No ratings yet
Reinforcement Learning and Control: CS229 Lecture Notes
15 pages
Practice Questions For LPP
100% (6)
Practice Questions For LPP
3 pages
Operations Research: Simplex Method
No ratings yet
Operations Research: Simplex Method
27 pages
A Case Study Application of Linear Programming and Simulation To Mine Planning
No ratings yet
A Case Study Application of Linear Programming and Simulation To Mine Planning
8 pages
Transportation
No ratings yet
Transportation
7 pages
Assignment
No ratings yet
Assignment
3 pages
Management Science Reviewer
No ratings yet
Management Science Reviewer
3 pages
Reinforcement Learning and Control: CS229 Lecture Notes
No ratings yet
Reinforcement Learning and Control: CS229 Lecture Notes
7 pages
Reinforcement Learning: Amulya Viswambaran (202090007) Kehkashan Fatima (202090202) Sruthi Krishnan (202090333)
No ratings yet
Reinforcement Learning: Amulya Viswambaran (202090007) Kehkashan Fatima (202090202) Sruthi Krishnan (202090333)
40 pages
A Brief Introduction To Reinforcement Learning
No ratings yet
A Brief Introduction To Reinforcement Learning
4 pages
DR Biniam B Quantitative Analysis Assignment
0% (1)
DR Biniam B Quantitative Analysis Assignment
5 pages
(Partially Observable) Markov Decision Processes: Frederike Petzschner & Lionel Rigoux
No ratings yet
(Partially Observable) Markov Decision Processes: Frederike Petzschner & Lionel Rigoux
19 pages
Transportation Problem in Operation Research
No ratings yet
Transportation Problem in Operation Research
60 pages
Exercise Problems in QT
50% (2)
Exercise Problems in QT
3 pages
KANNUR UNIVERSITY BTech S8 CE. Syllabus
No ratings yet
KANNUR UNIVERSITY BTech S8 CE. Syllabus
12 pages
Tutorial 6 PDF
No ratings yet
Tutorial 6 PDF
2 pages
A-level Maths Revision: Cheeky Revision Shortcuts
From Everand
A-level Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (8)

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.