0% found this document useful (0 votes)
37 views23 pages

A (Long) Peek Into Reinforcement Learning - Lil'Log

The document provides an overview of Reinforcement Learning (RL), explaining its fundamental concepts, key components, and various approaches to solving RL problems. It discusses the role of agents, environments, policies, and value functions, along with the distinction between model-based and model-free methods. Additionally, it introduces Markov Decision Processes and Bellman equations, which are essential for understanding the optimal strategies in RL.

Uploaded by

lry89757
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views23 pages

A (Long) Peek Into Reinforcement Learning - Lil'Log

The document provides an overview of Reinforcement Learning (RL), explaining its fundamental concepts, key components, and various approaches to solving RL problems. It discusses the role of agents, environments, policies, and value functions, along with the distinction between model-based and model-free methods. Additionally, it introduces Markov Decision Processes and Bellman equations, which are essential for understanding the optimal strategies in RL.

Uploaded by

lry89757
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

5/26/25, 10:25 PM A (Long) Peek into Reinforcement Learning | Lil'Log

| Posts Archive Search Tags FAQ

A (Long) Peek into Reinforcement


Learning
Date: February 19, 2018 | Estimated Reading Time: 31 min | Author: Lilian Weng
Table of Contents

[Updated on 2020-09-03: Updated the algorithm of SARSA and Q-learning so that the difference
is more pronounced.
[Updated on 2021-09-19: Thanks to 爱吃猫的鱼, we have this post in Chinese].
A couple of exciting news in Artificial Intelligence (AI) has just happened in recent years. AlphaGo
defeated the best professional human player in the game of Go. Very soon the extended algorithm
AlphaGo Zero beat AlphaGo by 100-0 without supervised learning on human knowledge. Top
professional game players lost to the bot developed by OpenAI on DOTA2 1v1 competition. After
knowing these, it is pretty hard not to be curious about the magic behind these algorithms —
Reinforcement Learning (RL). I’m writing this post to briefly go over the field. We will first introduce
several fundamental concepts and then dive into classic approaches to solving RL problems.
Hopefully, this post could be a good starting point for newbies, bridging the future study on the
cutting-edge research.
What is Reinforcement Learning?
Say, we have an agent in an unknown environment and this agent can obtain some rewards by
interacting with the environment. The agent ought to take actions so as to maximize cumulative
rewards. In reality, the scenario could be a bot playing a game to achieve high scores, or a robot
trying to complete physical tasks with physical items; and not just limited to these.

https://lilianweng.github.io/posts/2018-02-19-rl-overview/ 1/23
5/26/25, 10:25 PM A (Long) Peek into Reinforcement Learning | Lil'Log
Figure 1: An agent interacts with the environment, trying to take smart actions
| to maximize cumulative rewards.
The goal of Reinforcement Learning (RL) is to learn a good strategy for the agent from experimental
trials and relative simple feedback received. With the optimal strategy, the agent is capable to
actively adapt to the environment to maximize future rewards.
Key Concepts
Now Let’s formally define a set of key concepts in RL.
The agent is acting in an environment. How the environment reacts to certain actions is defined by
a model which we may or may not know. The agent can stay in one of many states ( ) of the s ∈ S

environment, and choose to take one of many actions ( ) to switch from one state to another.
a ∈ A

Which state the agent will arrive in is decided by transition probabilities between states ( ). Once P

an action is taken, the environment delivers a reward ( ) as feedback.


r ∈ R

The model defines the reward function and transition probabilities. We may or may not know how
the model works and this differentiate two circumstances:
Know the model: planning with perfect information; do model-based RL. When we fully know
the environment, we can find the optimal solution by Dynamic Programming (DP). Do you still
remember “longest increasing subsequence” or “traveling salesmen problem” from your
Algorithms 101 class? LOL. This is not the focus of this post though.
Does not know the model: learning with incomplete information; do model-free RL or try to
learn the model explicitly as part of the algorithm. Most of the following content serves the
scenarios when the model is unknown.
The agent’s policy provides the guideline on what is the optimal action to take in a certain
π(s)

state with the goal to maximize the total rewards. Each state is associated with a value function
V (s) predicting the expected amount of future rewards we are able to receive in this state by
acting the corresponding policy. In other words, the value function quantifies how good a state is.
Both policy and value functions are what we try to learn in reinforcement learning.

https://lilianweng.github.io/posts/2018-02-19-rl-overview/ 2/23
5/26/25, 10:25 PM A (Long) Peek into Reinforcement Learning | Lil'Log
Figure 2: Summary of approaches in RL based on whether we want to model
| the value, policy, or the environment. (Image source: reproduced from David
Silver's RL course lecture 1.)
The interaction between the agent and the environment involves a sequence of actions and
observed rewards in time, . During the process, the agent accumulates the
t = 1, 2, … , T

knowledge about the environment, learns the optimal policy, and makes decisions on which action
to take next so as to efficiently learn the best policy. Let’s label the state, action, and reward at time
step t as , , and , respectively. Thus the interaction sequence is fully described by one
St At Rt

episode (also known as “trial” or “trajectory”) and the sequence ends at the terminal state : ST

S1 , A1 , R2 , S2 , A2 , … , ST

Terms you will encounter a lot when diving into different categories of RL algorithms:
Model-based: Rely on the model of the environment; either the model is known or the algorithm
learns it explicitly.
Model-free: No dependency on the model during learning.
On-policy: Use the deterministic outcomes or samples from the target policy to train the
algorithm.
Off-policy: Training on a distribution of transitions or episodes produced by a different behavior
policy rather than that produced by the target policy.
Model: Transition and Reward
The model is a descriptor of the environment. With the model, we can learn or infer how the
environment would interact with and provide feedback to the agent. The model has two major
parts, transition probability function and reward function .
P R

Let’s say when we are in state s, we decide to take action a to arrive in the next state s’ and obtain
reward r. This is known as one transition step, represented by a tuple (s, a, s’, r).
The transition function P records the probability of transitioning from state s to s’ after taking
action a while obtaining reward r. We use as a symbol of “probability”.
P

′ ′
P (s , r|s, a) = P[S t+1 = s , R t+1 = r|S t = s, A t = a]

Thus the state-transition function can be defined as a function of ′


P (s , r|s, a) :
a ′ ′ ′
P ′ = P (s |s, a) = P[S t+1 = s |S t = s, A t = a] = ∑ P (s , r|s, a)
ss

r∈R

The reward function R predicts the next reward triggered by one action:

R(s, a) = E[R t+1 |S t = s, A t = a] = ∑ r ∑ P (s , r|s, a)

r∈R s ∈S

https://lilianweng.github.io/posts/2018-02-19-rl-overview/ 3/23
Policy
5/26/25, 10:25 PM A (Long) Peek into Reinforcement Learning | Lil'Log

|
Policy, as the agent’s behavior function , tells us which action to take in state s. It is a mapping
π

from state s to action a and can be either deterministic or stochastic:


Deterministic: .
π(s) = a

Stochastic: .
π(a|s) = P π [A = a|S = s]

Value Function
Value function measures the goodness of a state or how rewarding a state or an action is by a
prediction of future reward. The future reward, also known as return, is a total sum of discounted
rewards going forward. Let’s compute the return starting from time t: Gt

k
G t = R t+1 + γR t+2 + ⋯ = ∑ γ R t+k+1

k=0

The discounting factor penalize the rewards in the future, because:


γ ∈ [0, 1]

The future rewards may have higher uncertainty; i.e. stock market.
The future rewards do not provide immediate benefits; i.e. As human beings, we might prefer to
have fun today rather than 5 years later ;).
Discounting provides mathematical convenience; i.e., we don’t need to track future steps forever
to compute return.
We don’t need to worry about the infinite loops in the state transition graph.
The state-value of a state s is the expected return if we are in this state at time t, : St = s

V π (s) = E π [G t |S t = s]

Similarly, we define the action-value (“Q-value”; Q as “Quality” I believe?) of a state-action pair as:
Q π (s, a) = E π [G t |S t = s, A t = a]

Additionally, since we follow the target policy , we can make use of the probility distribution over
π

possible actions and the Q-values to recover the state-value:


V π (s) = ∑ Q π (s, a)π(a|s)

a∈A

The difference between action-value and state-value is the action advantage function (“A-value”):
A π (s, a) = Q π (s, a) − V π (s)

Optimal Value and Policy


The optimal value function produces the maximum return:
https://lilianweng.github.io/posts/2018-02-19-rl-overview/ 4/23
5/26/25, 10:25 PM A (Long) Peek into Reinforcement Learning | Lil'Log
V ∗ (s) = max V π (s), Q ∗ (s, a) = max Q π (s, a)

| π π

The optimal policy achieves optimal value functions:


π ∗ = arg max V π (s), π ∗ = arg max Q π (s, a)
π π

And of course, we have V π (s) = V ∗ (s)



and Q π (s, a) = Q ∗ (s, a)

.
Markov Decision Processes
In more formal terms, almost all the RL problems can be framed as Markov Decision Processes
(MDPs). All states in MDP has “Markov” property, referring to the fact that the future only depends
on the current state, not the history:
P[S t+1 |S t ] = P[S t+1 |S 1 , … , S t ]

Or in other words, the future and the past are conditionally independent given the present, as the
current state encapsulates all the statistics we need to decide the future.

Figure 3: The agent-environment interaction in a Markov decision process.


(Image source: Sec. 3.1 Sutton & Barto (2017).)
A Markov deicison process consists of five elements , where the symbols
M = ⟨S, A, P , R, γ⟩

carry the same meanings as key concepts in the previous section, well aligned with RL problem
settings:
S - a set of states;
A - a set of actions;
P - transition probability function;
R - reward function;
γ- discounting factor for future rewards. In an unknown environment, we do not have perfect
knowledge about and . P R

https://lilianweng.github.io/posts/2018-02-19-rl-overview/ 5/23
5/26/25, 10:25 PM A (Long) Peek into Reinforcement Learning | Lil'Log

Figure 4: A fun example of Markov decision process: a typical work day. (Image
source: randomant.net/reinforcement-learning-concepts)

Bellman Equations
Bellman equations refer to a set of equations that decompose the value function into the immediate
reward plus the discounted future values.
V (s) = E[G t |S t = s]

2
= E[R t+1 + γR t+2 + γ R t+3 + … |S t = s]

= E[R t+1 + γ(R t+2 + γR t+3 + …)|S t = s]

= E[R t+1 + γG t+1 |S t = s]

= E[R t+1 + γV (S t+1 )|S t = s]

Similarly for Q-value,


Q(s, a) = E[R t+1 + γV (S t+1 ) ∣ S t = s, A t = a]

= E[R t+1 + γE a∼π Q(S t+1 , a) ∣ S t = s, A t = a]

Bellman Expectation Equations


The recursive update process can be further decomposed to be equations built on both state-value
and action-value functions. As we go further in future action steps, we extend V and Q alternatively
by following the policy . π

https://lilianweng.github.io/posts/2018-02-19-rl-overview/ 6/23
5/26/25, 10:25 PM A (Long) Peek into Reinforcement Learning | Lil'Log

Figure 5: Illustration of how Bellman expection equations update state-value


and action-value functions.
V π (s) = ∑ π(a|s)Q π (s, a)

a∈A

a ′
Q π (s, a) = R(s, a) + γ ∑ P ′ V π (s )
ss

s ∈S

a ′
V π (s) = ∑ π(a|s)(R(s, a) + γ ∑ P ′ V π (s ))
ss

a∈A s ∈S

a ′ ′ ′ ′
Q π (s, a) = R(s, a) + γ ∑ P ′ ∑ π(a |s )Q π (s , a )
ss

s ′ ∈S ′
a ∈A

Bellman Optimality Equations


If we are only interested in the optimal values, rather than computing the expectation following a
policy, we could jump right into the maximum returns during the alternative updates without using a
policy. RECAP: the optimal values and are the best returns we can obtain, defined here.
V∗ Q∗

V ∗ (s) = max Q ∗ (s, a)


a∈A

a ′
Q ∗ (s, a) = R(s, a) + γ ∑ P ′ V ∗ (s )
ss

s ∈S

a ′
V ∗ (s) = max (R(s, a) + γ ∑ P ′ V ∗ (s ))
ss
a∈A

s ∈S

a ′ ′
Q ∗ (s, a) = R(s, a) + γ ∑ P ′ max Q ∗ (s , a )
ss
a ′ ∈A
s ′ ∈S

Unsurprisingly they look very similar to Bellman expectation equations.


If we have complete information of the environment, this turns into a planning problem, solvable by
DP. Unfortunately, in most scenarios, we do not know or , so we cannot solve MDPs by
P
a
R(s, a)

directly applying Bellmen equations, but it lays the theoretical foundation for many RL algorithms.

ss

Common Approaches
https://lilianweng.github.io/posts/2018-02-19-rl-overview/ 7/23
5/26/25, 10:25 PM A (Long) Peek into Reinforcement Learning | Lil'Log

Now it is the time to go through the major approaches and classic algorithms for solving RL
| problems. In future posts, I plan to dive into each approach further.
Dynamic Programming
When the model is fully known, following Bellman equations, we can use Dynamic Programming
(DP) to iteratively evaluate value functions and improve policy.
Policy Evaluation
Policy Evaluation is to compute the state-value for a given policy : Vπ π

′ ′ ′
V t+1 (s) = E π [r + γV t (s )|S t = s] = ∑ π(a|s) ∑ P (s , r|s, a)(r + γV t (s ))

a s ′ ,r

Policy Improvement
Based on the value functions, Policy Improvement generates a better policy π

≥ π by acting
greedily.
′ ′
Q π (s, a) = E[R t+1 + γV π (S t+1 )|S t = s, A t = a] = ∑ P (s , r|s, a)(r + γV π (s ))

s ,r

Policy Iteration
The Generalized Policy Iteration (GPI) algorithm refers to an iterative procedure to improve the
policy when combining policy evaluation and improvement.
evaluation improve evaluation improve evaluation improve evaluation

π0






− → Vπ → π1 → Vπ → π2 → ⋯ → π∗ → V∗
0 1

In GPI, the value function is approximated repeatedly to be closer to the true value of the current
policy and in the meantime, the policy is improved repeatedly to approach optimality. This policy
iteration process works and always converges to the optimality, but why this is the case?
Say, we have a policy and then generate an improved version by greedily taking actions,
π π


π (s) = arg max a∈A Q π (s, a) . The value of this improved is guaranteed to be better because:
π


Q π (s, π (s)) = Q π (s, arg max Q π (s, a))
a∈A

= max Q π (s, a) ≥ Q π (s, π(s)) = V π (s)


a∈A

Monte-Carlo Methods
First, let’s recall that . Monte-Carlo (MC) methods uses a simple idea: It
V (s) = E[G t |S t = s]

learns from episodes of raw experience without modeling the environmental dynamics and
computes the observed mean return as an approximation of the expected return. To compute the

https://lilianweng.github.io/posts/2018-02-19-rl-overview/ 8/23
5/26/25, 10:25 PM A (Long) Peek into Reinforcement Learning | Lil'Log

empirical return , MC methods need to learn from complete episodes to


| compute
Gt S1 , A1 , R2 , … , ST

Gt = ∑
k=0
and all the episodes must eventually terminate.
T −t−1
γ
k
R t+k+1

The empirical mean return for state s is:



T

t=1
𝟙[S t = s]G t
V (s) =

T

t=1
𝟙[S t = s]

where 𝟙 is a binary indicator function. We may count the visit of state s every time so that
[S t = s]

there could exist multiple visits of one state in one episode (“every-visit”), or only count it the first
time we encounter a state in one episode (“first-visit”). This way of approximation can be easily
extended to action-value functions by counting (s, a) pair.

T

t=1
𝟙[S t = s, A t = a]G t
Q(s, a) =

T

t=1
𝟙[S t = s, A t = a]

To learn the optimal policy by MC, we iterate it by following a similar idea to GPI.

1. Improve the policy greedily with respect to the current value function:
π(s) = arg max a∈A Q(s, a) .
2. Generate a new episode with the new policy (i.e. using algorithms like ε-greedy helps us
π

balance between exploitation and exploration.)


3. Estimate Q using the new episode: 𝟙 T T −t−1 k
∑ ( [S t =s,A t =a] ∑ γ R t+k+1 )
t=1 k=0
q π (s, a) =
𝟙 ∑
T

t=1
[S t =s,A t =a]

Temporal-Difference Learning
Similar to Monte-Carlo methods, Temporal-Difference (TD) Learning is model-free and learns from
episodes of experience. However, TD learning can learn from incomplete episodes and hence we
don’t need to track the episode up to termination. TD learning is so important that Sutton & Barto
(2017) in their RL book describes it as “one idea … central and novel to reinforcement learning”.
Bootstrapping
TD learning methods update targets with regard to existing estimates rather than exclusively
relying on actual rewards and complete returns as in MC methods. This approach is known as
https://lilianweng.github.io/posts/2018-02-19-rl-overview/ 9/23
5/26/25, 10:25 PM A (Long) Peek into Reinforcement Learning | Lil'Log

bootstrapping.
|
Value Estimation
The key idea in TD learning is to update the value function towards an estimated return
V (S t )

R t+1 + γV (S t+1 ) (known as “TD target”). To what extent we want to update the value function is
controlled by the learning rate hyperparameter α:
V (S t ) ← (1 − α)V (S t ) + αG t

V (S t ) ← V (S t ) + α(G t − V (S t ))

V (S t ) ← V (S t ) + α(R t+1 + γV (S t+1 ) − V (S t ))

Similarly, for action-value estimation:


Q(S t , A t ) ← Q(S t , A t ) + α(R t+1 + γQ(S t+1 , A t+1 ) − Q(S t , A t ))

Next, let’s dig into the fun part on how to learn optimal policy in TD learning (aka “TD control”). Be
prepared, you are gonna see many famous names of classic algorithms in this section.
SARSA: On-Policy TD control
“SARSA” refers to the procedure of updaing Q-value by following a sequence of
. The idea follows the same route of GPI. Within one episode, it
… , S t , A t , R t+1 , S t+1 , A t+1 , …

works as follows:
1. Initialize . t = 0

2. Start with and choose action


S0 , where -greedy is commonly
A 0 = arg max a∈A Q(S 0 , a) ϵ

applied.
3. At time , after applying action , we observe reward and get into the next state .
t At R t+1 S t+1

4. Then pick the next action in the same way as in step 2: .


A t+1 = arg max a∈A Q(S t+1 , a)

5. Update the Q-value function:


.
Q(S t , A t ) ← Q(S t , A t ) + α(R t+1 + γQ(S t+1 , A t+1 ) − Q(S t , A t ))

6. Set and repeat from step 3.


t = t + 1

In each step of SARSA, we need to choose the next action according to the current policy.
Q-Learning: Off-policy TD control
The development of Q-learning (Watkins & Dayan, 1992) is a big breakout in the early days of
Reinforcement Learning. Within one episode, it works as follows:
1. Initialize . t = 0

2. Starts with . S0

3. At time step , we pick the action according to Q values,


t and -
A t = arg max a∈A Q(S t , a) ϵ

greedy is commonly applied.


https://lilianweng.github.io/posts/2018-02-19-rl-overview/ 10/23
5/26/25, 10:25 PM A (Long) Peek into Reinforcement Learning | Lil'Log

4. After applying action , we observe reward and get into the next state .
| 5. Update the Q-value function:
At R t+1 S t+1

Q(S t , A t ) ← Q(S t , A t ) + α(R t+1 + γ max a∈A Q(S t+1 , a) − Q(S t , A t )) .


6. t = t + 1 and repeat from step 3.
The key difference from SARSA is that Q-learning does not follow the current policy to pick the
second action . It estimates out of the best Q values, but which action (denoted as )
A t+1 Q

a

leads to this maximal Q does not matter and in the next step Q-learning may not follow . a

Figure 7: The backup diagrams for Q-learning and SARSA. (Image source:
Replotted based on Figure 6.5 in Sutton & Barto (2017))
Deep Q-Network
Theoretically, we can memorize for all state-action pairs in Q-learning, like in a gigantic
Q ∗ (. )

table. However, it quickly becomes computationally infeasible when the state and action space are
large. Thus people use functions (i.e. a machine learning model) to approximate Q values and this is
called function approximation. For example, if we use a function with parameter to calculate Q θ

values, we can label Q value function as . Q(s, a; θ)

Unfortunately Q-learning may suffer from instability and divergence when combined with an
nonlinear Q-value function approximation and bootstrapping (See Problems #2).
Deep Q-Network (“DQN”; Mnih et al. 2015) aims to greatly improve and stabilize the training
procedure of Q-learning by two innovative mechanisms:
Experience Replay: All the episode steps are stored in one replay
e t = (S t , A t , R t , S t+1 )

memory . has experience tuples over many episodes. During Q-learning


D t = {e 1 , … , e t } D t

updates, samples are drawn at random from the replay memory and thus one sample could be
used multiple times. Experience replay improves data efficiency, removes correlations in the
observation sequences, and smooths over changes in the data distribution.
Periodically Updated Target: Q is optimized towards target values that are only periodically
updated. The Q network is cloned and kept frozen as the optimization target every C steps (C is
a hyperparameter). This modification makes the training more stable as it overcomes the short-
term oscillations.
The loss function looks like this:
https://lilianweng.github.io/posts/2018-02-19-rl-overview/ 11/23
5/26/25, 10:25 PM A (Long) Peek into Reinforcement Learning | Lil'Log
2

|
′ ′ −
L(θ) = E (s,a,r,s ′ )∼U (D) [(r + γ max Q(s , a ; θ ) − Q(s, a; θ)) ]
a′

where U (D) is a uniform distribution over the replay memory D; is the parameters of the frozen
θ

target Q-network.
In addition, it is also found to be helpful to clip the error term to be between [-1, 1]. (I always get
mixed feeling with parameter clipping, as many studies have shown that it works empirically but it
makes the math much less pretty. :/)

Figure 8: Algorithm for DQN with experience replay and occasionally frozen
optimization target. The prepossessed sequence is the output of some
processes running on the input images of Atari games. Don't worry too much
about it; just consider them as input feature vectors. (Image source: Mnih et al.
2015)
There are many extensions of DQN to improve the original design, such as DQN with dueling
architecture (Wang et al. 2016) which estimates state-value function V(s) and advantage function
A(s, a) with shared network parameters.
Combining TD and MC Learning
In the previous section on value estimation in TD learning, we only trace one step further down the
action chain when calculating the TD target. One can easily extend it to take multiple steps to
estimate the return.
Let’s label the estimated return following n steps as G
(n)

t
, n = 1, … , ∞ , then:

https://lilianweng.github.io/posts/2018-02-19-rl-overview/ 12/23
5/26/25, 10:25 PM A (Long) Peek into Reinforcement Learning | Lil'Log

Notes
| n Gt

n = 1 G
(1)

t
= R t+1 + γV (S t+1 ) TD learning
(2) 2
n = 2 G = R t+1 + γR t+2 + γ V (S t+2 )
t


(n) n−1 n
n = n G = R t+1 + γR t+2 + ⋯ + γ R t+n + γ V (S t+n )
t


n = ∞ G
(∞)

t
= R t+1 + γR t+2 + ⋯ + γ
T −t−1
RT + γ
T −t
V (S T ) MC estimation

The generalized n-step TD learning still has the same form for updating the value function:
(n)
V (S t ) ← V (S t ) + α(G − V (S t ))
t

We are free to pick any in TD learning as we like. Now the question becomes what is the best ?
n n

Which gives us the best return approximation? A common yet smart solution is to apply a
G
(n)

weighted sum of all possible n-step TD targets rather than to pick a single best n. The weights
t

decay by a factor λ with n, ; the intuition is similar to why we want to discount future rewards
λ
n−1

when computing the return: the more future we look into the less confident we would be. To make
all the weight (n → ∞) sum up to 1, we multiply every weight by (1-λ), because:
2
let S = 1 + λ + λ + …
2
S = 1 + λ(1 + λ + λ + …)

S = 1 + λS

S = 1/(1 − λ)

This weighted sum of many n-step returns is called λ-return G. TDλ


= (1 − λ) ∑

λ
n−1
G
(n)

learning that adopts λ-return for value updating is labeled as TD(λ). The original version we
t n=1 t

introduced above is equivalent to TD(0).

https://lilianweng.github.io/posts/2018-02-19-rl-overview/ 13/23
5/26/25, 10:25 PM A (Long) Peek into Reinforcement Learning | Lil'Log

Figure 10: Comparison of the backup diagrams of Monte-Carlo, Temporal-


Difference learning, and Dynamic Programming for state value functions.
(Image source: David Silver's RL course lecture 4: "Model-Free Prediction")

Policy Gradient
All the methods we have introduced above aim to learn the state/action value function and then to
select actions accordingly. Policy Gradient methods instead learn the policy directly with a
parameterized function respect to , . Let’s define the reward function (opposite of loss
θ π(a|s; θ)

function) as the expected return and train the algorithm with the goal to maximize the reward
function. My next post described why the policy gradient theorem works (proof) and introduced a
number of policy gradient algorithms.
In discrete space:
J (θ) = V π (S 1 ) = E π [V 1 ]
θ θ

where is the initial starting state.


S1

Or in continuous space:
J (θ) = ∑ d π (s)V π (s) = ∑ (d π (s) ∑ π(a|s, θ)Q π (s, a))
θ θ θ

s∈S s∈S a∈A

where is stationary distribution of Markov chain for . If you are unfamiliar with the
d π (s) πθ

definition of a “stationary distribution,” please check this reference.


θ

Using gradient ascent we can find the best θ that produces the highest return. It is natural to
expect policy-based methods are more useful in continuous space, because there is an infinite
number of actions and/or states to estimate the values for in continuous space and hence value-
based approaches are computationally much more expensive.
Policy Gradient Theorem
Computing the gradient numerically can be done by perturbing θ by a small amount ε in the k-th
dimension. It works even when is not differentiable (nice!), but unsurprisingly very slow.
J (θ)

https://lilianweng.github.io/posts/2018-02-19-rl-overview/ 14/23
5/26/25, 10:25 PM A (Long) Peek into Reinforcement Learning | Lil'Log
∂J (θ) J (θ + ϵu k ) − J (θ)

| ∂θ k

ϵ

Or analytically,
J (θ) = E π θ [r] = ∑ d π θ (s) ∑ π(a|s; θ)R(s, a)

s∈S a∈A

Actually we have nice theoretical support for (replacing d(. ) with d π (. )):
J (θ) = ∑ d π (s) ∑ π(a|s; θ)Q π (s, a) ∝ ∑ d(s) ∑ π(a|s; θ)Q π (s, a)
θ

s∈S a∈A s∈S a∈A

Check Sec 13.1 in Sutton & Barto (2017) for why this is the case.
Then,
J (θ) = ∑ d(s) ∑ π(a|s; θ)Q π (s, a)

s∈S a∈A

∇J (θ) = ∑ d(s) ∑ ∇π(a|s; θ)Q π (s, a)

s∈S a∈A

∇π(a|s; θ)
= ∑ d(s) ∑ π(a|s; θ) Q π (s, a)
π(a|s; θ)
s∈S a∈A

= ∑ d(s) ∑ π(a|s; θ)∇ ln π(a|s; θ)Q π (s, a)

s∈S a∈A

= E π [∇ ln π(a|s; θ)Q π (s, a)]


θ

This result is named “Policy Gradient Theorem” which lays the theoretical foundation for various
policy gradient algorithms:
∇J (θ) = E π [∇ ln π(a|s, θ)Q π (s, a)]
θ

REINFORCE
REINFORCE, also known as Monte-Carlo policy gradient, relies on , an estimated return byQ π (s, a)

MC methods using episode samples, to update the policy parameter . θ

A commonly used variation of REINFORCE is to subtract a baseline value from the return to Gt

reduce the variance of gradient estimation while keeping the bias unchanged. For example, a
common baseline is state-value, and if applied, we would use in the
A(s, a) = Q(s, a) − V (s)

gradient ascent update.


1. Initialize θ at random
2. Generate one episode S1 , A1 , R2 , S2 , A2 , … , ST

3. For t=1, 2, … , T:
1. Estimate the the return G_t since the time step t.
2. t
.
θ ← θ + αγ G t ∇ ln π(A t |S t , θ)

https://lilianweng.github.io/posts/2018-02-19-rl-overview/ 15/23
Actor-Critic
5/26/25, 10:25 PM A (Long) Peek into Reinforcement Learning | Lil'Log

|
If the value function is learned in addition to the policy, we would get Actor-Critic algorithm.
Critic: updates value function parameters w and depending on the algorithm it could be action-
value Q(a|s; w) or state-value . V (s; w)

Actor: updates policy parameters θ, in the direction suggested by the critic, . π(a|s; θ)

Let’s see how it works in an action-value actor-critic algorithm.


1. Initialize s, θ, w at random; sample . a ∼ π(a|s; θ)

2. For t = 1… T:
1. Sample reward and next state
r t ∼ R(s, a) . s
′ ′
∼ P (s |s, a)

2. Then sample the next action a . ′ ′ ′


∼ π(s , a ; θ)

3. Update policy parameters: .


θ ← θ + α θ Q(s, a; w)∇ θ ln π(a|s; θ)

4. Compute the correction for action-value at time t:


′ ′
G t:t+1 = r t + γQ(s , a ; w) − Q(s, a; w)

and use it to update value function parameters:


w ← w + α w G t:t+1 ∇ w Q(s, a; w) .
5. Update and
a ← a

. s ← s

αθ and are two learning rates for policy and value function parameter updates, respectively.
αw

A3C
Asynchronous Advantage Actor-Critic (Mnih et al., 2016), short for A3C, is a classic policy
gradient method with the special focus on parallel training.
In A3C, the critics learn the state-value function, , while multiple actors are trained in
V (s; w)

parallel and get synced with global parameters from time to time. Hence, A3C is good for parallel
training by default, i.e. on one machine with multi-core CPU.
The loss function for state-value is to minimize the mean squared error,
J v (w) = (G t − V (s; w)) and we use gradient descent to find the optimal w. This state-value
2

function is used as the baseline in the policy gradient update.


Here is the algorithm outline:
1. We have global parameters, θ and w; similar thread-specific parameters, θ’ and w'.
2. Initialize the time step t = 1
3. While T <= T_MAX:
1. Reset gradient: dθ = 0 and dw = 0.
2. Synchronize thread-specific parameters with global ones: θ’ = θ and w’ = w.
https://lilianweng.github.io/posts/2018-02-19-rl-overview/ 16/23
5/26/25, 10:25 PM A (Long) Peek into Reinforcement Learning | Lil'Log

3. = t and get .
|
t start st

4. While ( s t ≠ TERMINAL ) and ( ):


t − t start <= t max

1. Pick the action and receive a new reward and a new state

a t ∼ π(a t |s t ; θ ) rt s t+1 .
2. Update t = t + 1 and T = T + 1.
5. Initialize the variable that holds the return estimation

R = {0 if s t is TERMINAL V (s t ; w ) otherwise

.
6. For :
i = t − 1, … , t start

1. R ← r i + γR; here R is a MC measure of . Gi

2. Accumulate gradients w.r.t. θ’: ; ′


dθ ← dθ + ∇ θ ′ log π(a i |s i ; θ )(R − V (s i ; w ))

Accumulate gradients w.r.t. w’: .


dw ← dw + ∇ w ′ (R − V (s i ; w ))
′ 2

7. Update synchronously θ using dθ, and w using dw.


A3C enables the parallelism in multiple agent training. The gradient accumulation step (6.2) can be
considered as a reformation of minibatch-based stochastic gradient update: the values of w or θ
get corrected by a little bit in the direction of each training thread independently.
Evolution Strategies
Evolution Strategies (ES) is a type of model-agnostic optimization approach. It learns the optimal
solution by imitating Darwin’s theory of the evolution of species by natural selection. Two
prerequisites for applying ES: (1) our solutions can freely interact with the environment and see
whether they can solve the problem; (2) we are able to compute a fitness score of how good each
solution is. We don’t have to know the environment configuration to solve the problem.
Say, we start with a population of random solutions. All of them are capable of interacting with the
environment and only candidates with high fitness scores can survive (only the fittest can survive in
a competition for limited resources). A new generation is then created by recombining the settings
(gene mutation) of high-fitness survivors. This process is repeated until the new solutions are good
enough.
Very different from the popular MDP-based approaches as what we have introduced above, ES
aims to learn the policy parameter without value approximation. Let’s assume the distribution over
θ

the parameter is an isotropic multivariate Gaussian with mean and fixed covariance . The
θ μ
2
σ I

gradient of is calculated:
F (θ)

https://lilianweng.github.io/posts/2018-02-19-rl-overview/ 17/23
5/26/25, 10:25 PM A (Long) Peek into Reinforcement Learning | Lil'Log
∇ θ E θ∼N (μ,σ 2 ) F (θ)

|
=∇ θ ∫ F (θ) Pr(θ) Pr(.) is the Gaussian density f unction.
θ

∇ θ Pr(θ)
=∫ F (θ) Pr(θ)
θ Pr(θ)

=∫ F (θ) Pr(θ)∇ θ log Pr(θ)


θ

=E θ∼N (μ,σ 2 ) [F (θ)∇ θ log Pr(θ)] Similar to how we do policy gradient upd
2
1 −
(θ−μ)

2
=E θ∼N (μ,σ 2 ) [F (θ)∇ θ log ( e 2σ )]
√ 2πσ 2
2
(θ − μ)
√ 2
=E θ∼N (μ,σ 2 ) [F (θ)∇ θ ( − log 2πσ − )]
2

θ − μ
=E θ∼N (μ,σ 2 ) [F (θ) ]
2
σ

We can rewrite this formula in terms of a “mean” parameter (different from the above; this is θ θ θ

the base gene for further mutation), and therefore


ϵ ∼ N (0, I ) . controls how 2
θ + ϵσ ∼ N (θ, σ ) ϵ

much Gaussian noises should be added to create mutation:


1
∇ θ E ϵ∼N (0,I ) F (θ + σϵ) = E ϵ∼N (0,I ) [F (θ + σϵ)ϵ]
σ

Figure 11: A simple parallel evolution-strategies-based RL algorithm. Parallel


workers share the random seeds so that they can reconstruct the Gaussian
noises with tiny communication bandwidth. (Image source: Salimans et al.
2017.)
ES, as a black-box optimization algorithm, is another approach to RL problems (In my original
writing, I used the phrase “a nice alternative”; Seita pointed me to this discussion and thus I
updated my wording.). It has a couple of good characteristics (Salimans et al., 2017) keeping it fast
and easy to train:
ES does not need value function approximation;
https://lilianweng.github.io/posts/2018-02-19-rl-overview/ 18/23
5/26/25, 10:25 PM A (Long) Peek into Reinforcement Learning | Lil'Log

ES does not perform gradient back-propagation;


| ES is invariant to delayed or long-term rewards;
ES is highly parallelizable with very little data communication.
Known Problems
Exploration-Exploitation Dilemma
The problem of exploration vs exploitation dilemma has been discussed in my previous post. When
the RL problem faces an unknown environment, this issue is especially a key to finding a good
solution: without enough exploration, we cannot learn the environment well enough; without
enough exploitation, we cannot complete our reward optimization task.
Different RL algorithms balance between exploration and exploitation in different ways. In MC
methods, Q-learning or many on-policy algorithms, the exploration is commonly implemented by ε-
greedy; In ES, the exploration is captured by the policy parameter perturbation. Please keep this
into consideration when developing a new RL algorithm.
Deadly Triad Issue
We do seek the efficiency and flexibility of TD methods that involve bootstrapping. However, when
off-policy, nonlinear function approximation, and bootstrapping are combined in one RL algorithm,
the training could be unstable and hard to converge. This issue is known as the deadly triad
(Sutton & Barto, 2017). Many architectures using deep learning models were proposed to resolve
the problem, including DQN to stabilize the training with experience replay and occasionally frozen
target network.
Case Study: AlphaGo Zero
The game of Go has been an extremely hard problem in the field of Artificial Intelligence for
decades until recent years. AlphaGo and AlphaGo Zero are two programs developed by a team at
DeepMind. Both involve deep Convolutional Neural Networks (CNN) and Monte Carlo Tree Search
(MCTS) and both have been approved to achieve the level of professional human Go players.
Different from AlphaGo that relied on supervised learning from expert human moves, AlphaGo Zero
used only reinforcement learning and self-play without human knowledge beyond the basic rules.

https://lilianweng.github.io/posts/2018-02-19-rl-overview/ 19/23
5/26/25, 10:25 PM A (Long) Peek into Reinforcement Learning | Lil'Log

Figure 12: The board of Go. Two players play black and white stones
alternatively on the vacant intersections of a board with 19 x 19 lines. A group
of stones must have at least one open point (an intersection, called a "liberty")
to remain on the board and must have at least two or more enclosed liberties
(called "eyes") to stay "alive". No stone shall repeat a previous position.
With all the knowledge of RL above, let’s take a look at how AlphaGo Zero works. The main
component is a deep CNN over the game board configuration (precisely, a ResNet with batch
normalization and ReLU). This network outputs two values:
(p, v) = f θ (s)

s : the game board configuration, 19 x 19 x 17 stacked feature planes; 17 features for each
position, 8 past configurations (including current) for the current player + 8 past configurations
for the opponent + 1 feature indicating the color (1=black, 0=white). We need to code the color
specifically because the network is playing with itself and the colors of current player and
opponents are switching between steps.
p : the probability of selecting a move over 19^2 + 1 candidates (19^2 positions on the board, in
addition to passing).
v : the winning probability given the current setting.
During self-play, MCTS further improves the action probability distribution and then the
π ∼ p(. )

action is sampled from this improved policy. The reward is a binary value indicating whether
at zt

the current player eventually wins the game. Each move generates an episode tuple and (s t , π t , z t )

it is saved into the replay memory. The details on MCTS are skipped for the sake of space in this
post; please read the original paper if you are interested.

https://lilianweng.github.io/posts/2018-02-19-rl-overview/ 20/23
5/26/25, 10:25 PM A (Long) Peek into Reinforcement Learning | Lil'Log

Figure 13: AlphaGo Zero is trained by self-play while MCTS improves the
output policy further in every step. (Image source: Figure 1a in Silver et al.,
2017).
The network is trained with the samples in the replay memory to minimize the loss:
2 ⊤ 2
L = (z − v) − π log p + c∥θ∥

where is a hyperparameter controlling the intensity of L2 penalty to avoid overfitting.


c

AlphaGo Zero simplified AlphaGo by removing supervised learning and merging separated policy
and value networks into one. It turns out that AlphaGo Zero achieved largely improved performance
with a much shorter training time! I strongly recommend reading these two papers side by side and
compare the difference, super fun.
I know this is a long read, but hopefully worth it. If you notice mistakes and errors in this post, don’t
hesitate to contact me at [lilian dot wengweng at gmail dot com]. See you in the next post! :)

Cited as:
@article{weng2018bandit,
title = "A (Long) Peek into Reinforcement Learning",
author = "Weng, Lilian",
journal = "lilianweng.github.io",
year = "2018",
url = "https://lilianweng.github.io/posts/2018-02-19-rl-overview/"
}

References
[1] Yuxi Li. Deep reinforcement learning: An overview. arXiv preprint arXiv:1701.07274. 2017.

https://lilianweng.github.io/posts/2018-02-19-rl-overview/ 21/23
5/26/25, 10:25 PM A (Long) Peek into Reinforcement Learning | Lil'Log

[2] Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction; 2nd Edition.
| 2017.
[3] Volodymyr Mnih, et al. Asynchronous methods for deep reinforcement learning. ICML. 2016.
[4] Tim Salimans, et al. Evolution strategies as a scalable alternative to reinforcement learning.
arXiv preprint arXiv:1703.03864 (2017).
[5] David Silver, et al. Mastering the game of go without human knowledge. Nature 550.7676
(2017): 354.
[6] David Silver, et al. Mastering the game of Go with deep neural networks and tree search. Nature
529.7587 (2016): 484-489.
[7] Volodymyr Mnih, et al. Human-level control through deep reinforcement learning. Nature
518.7540 (2015): 529.
[8] Ziyu Wang, et al. Dueling network architectures for deep reinforcement learning. ICML. 2016.
[9] Reinforcement Learning lectures by David Silver on YouTube.
[10] OpenAI Blog: Evolution Strategies as a Scalable Alternative to Reinforcement Learning
[11] Frank Sehnke, et al. Parameter-exploring policy gradients. Neural Networks 23.4 (2010): 551-
559.
[12] Csaba Szepesvári. Algorithms for reinforcement learning. 1st Edition. Synthesis lectures on
artificial intelligence and machine learning 4.1 (2010): 1-103.

If you notice mistakes and errors in this post, please don’t hesitate to contact me at [lilian dot
wengweng at gmail dot com] and I would be super happy to correct them right away!

Reinforcement-Learning Long-Read Math-Heavy


« »
Policy Gradient Algorithms The Multi-Armed Bandit Problem and Its
Solutions

https://lilianweng.github.io/posts/2018-02-19-rl-overview/ 22/23
5/26/25, 10:25 PM A (Long) Peek into Reinforcement Learning | Lil'Log

| © 2025 Lil'Log Powered by Hugo & PaperMod

https://lilianweng.github.io/posts/2018-02-19-rl-overview/ 23/23

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy