RL Basics 1737166593
RL Basics 1737166593
www.peaks2tails.com
www.peaks2tails.com Satyapriya Ojha
Risk and AI Reinforcement Learning
Reinforcement learning has led to proliferation of AI bots in various areas. Some of the notable
developments include
Autonomous Driving
AlphaZero ( Chess, Go ) https://www.youtube.com/watch?v=WXuK6gekU1Y
Algorithmic Trading
Dynamic Hedging
The Agent and the Environment together generate the (SARSA) sequence
𝑆0 , 𝐴0, 𝑅1 , 𝑆1 , 𝐴1 , 𝑅2 , 𝑆2 , 𝐴2 , 𝑅3 , 𝑆3 …
The states 𝑆 ∈ 𝒮 and rewards 𝑅 ∈ ℛ in general are random quantities and come from a distribution. It
could be finite or infinite
An MDP (Markov Decision Process) is the process with Markov Property. Markov property states that
the future distribution of a state only depends on the current state and not the entire history of states.
(Think of Chess !). So, basically it means
𝑝 𝑠 ′ , 𝑟 ′ | 𝑠, 𝑎 = ℙ 𝑆𝑡 = 𝑠′, 𝑅𝑡 = 𝑟′ | 𝑆𝑡−1 = 𝑠, 𝐴𝑡 = 𝑎
Goal
The goal of the agent is to maximize total expected rewards 𝐺𝑡 over a long run. We will use the familiar
discounting factor (thing Time Value of Money) to discount future rewards
We can write the formula in recursive fashion which is often useful in these types of problems
𝐺𝑡 = 𝑅𝑡+1 + 𝛾𝐺𝑡+1
Policy Function
A Policy Function 𝜋 𝑎|𝑠 basically assigns a probability to actions 𝐴 ∈ 𝒜 an agent takes when in
states 𝑆 ∈ 𝒮. It basically means under a given policy the actions taken in a state have a distribution.
𝑠0 = Opening 𝑎1 = 𝑒4
𝜋 𝑎1 |𝑠0 = 80%
𝜋 𝑎2 |𝑠0 = 20%
𝑎2 = 𝑑4
Value Function
A State-Value Function of state s under a policy 𝜋 is the total expected reward starting from state s
and following policy 𝜋 thereafter. It’s given as
𝑣 𝜋 𝑠 = 𝔼𝜋 𝐺𝑡 | 𝑆𝑡 = 𝑠 = 𝔼𝜋 𝛾 𝑘 𝑅𝑡+𝑘+1 | 𝑆𝑡 = 𝑠
𝑘=0
An Action-Value Function of taking action 𝑎 in state s under a policy 𝜋 is the total expected reward
starting from state s, taking the action 𝑎 and following policy 𝜋 thereafter. It’s given as
𝑞 𝜋 𝑠, 𝑎 = 𝔼𝜋 𝐺𝑡 | 𝑆𝑡 = 𝑠, 𝐴𝑡 = 𝑎 = 𝔼𝜋 𝛾 𝑘 𝑅𝑡+𝑘+1 | 𝑆𝑡 = 𝑠, 𝐴𝑡 = 𝑎
𝑘=0
Bellman equation
The Bellman equation is the single most important equation in RL. It gives us a recursive way to
calculate value function
𝑠′ Action-value function
𝑎 𝑟, 𝑝
𝜋 𝑎|𝑠 𝑞 𝜋 𝑠, 𝑎 = 𝔼𝜋 𝐺𝑡 | 𝑆𝑡 = 𝑠, 𝐴𝑡 = 𝑎 = 𝑝 𝑠 ′ , 𝑟 𝑠, 𝑎) 𝑟 + 𝛾𝑣 𝜋 𝑠′
𝑠 ′ ,𝑟
State-value function
𝑠
𝑣 𝜋 𝑠 = 𝔼𝜋 𝐺𝑡 | 𝑆𝑡 = 𝑠 = 𝜋 𝑎 𝑠) 𝑝 𝑠 ′ , 𝑟 𝑠, 𝑎) 𝑟 + 𝛾𝑣 𝜋 𝑠′
𝑎 𝑠 ′ ,𝑟
= 𝜋 𝑎 𝑠)𝑞 𝜋 𝑠, 𝑎
𝑎
′
A policy 𝜋 is better than a policy 𝜋′ , if 𝑣 𝜋 𝑠 ≥ 𝑣 𝜋 𝑠 for all 𝑠 ∈ 𝒮. Let’s denote the Optimal policy
as 𝜋 ∗ which is the best among all. The state-value function under the optimal policy is called Optimal
state-value function. Let’s denote it as 𝑣 ∗ 𝑠
𝑣 ∗ 𝑠 = max 𝑝 𝑠 ′ , 𝑟 𝑠, 𝑎 𝑟 + 𝛾𝑣 ∗ 𝑠′
𝑎
𝑠 ′ ,𝑟
𝜋 ∗
The Optimal action-value function denoted by 𝑞 𝑠, 𝑎 is given as
𝑞 ∗ 𝑠, 𝑎 = 𝑝 𝑠 ′ , 𝑟 𝑠, 𝑎 𝑟 + 𝛾 max 𝑞 ∗ 𝑠′, 𝑎′
𝑎′
𝑠 ′ ,𝑟
Chance Node
𝒔𝒕𝒂𝒕𝒆 ∶ 𝑺𝟏
𝒂𝟏 = 𝑺𝒕𝒂𝒚
In In ; Stay (𝑺𝟏 , 𝒂𝟏 )
(𝑟 = $4, 𝑃 = 2/3)
𝒂𝟐 = 𝑸𝒖𝒊𝒕
(𝑟 = $4, 𝑃 = 1/3)
(𝑟 = $10, 𝑃 = 1)
(𝑺𝟏 , 𝒂𝟐 ) In ; Quit End (absorbing state)
𝒔𝒕𝒂𝒕𝒆 ∶ 𝑺𝟐
Chance Node
Note that 𝑣 End = 0 as the game gets over. We need to find out the 𝑣 In under policy 𝜋1 ≔
In, stay and under policy 𝜋2 ≔ In, quit
𝜋1
2 𝜋1
1
𝑣 In = 1 4 + 1 ∗ 𝑣 In + 4 + 1 ∗ 𝑣 𝜋1 End
3 3
𝜋1
2 𝜋1
1
𝑣 In = 4 + 𝑣 In + 4
3 3
𝑣 𝜋1 In = 12
𝑣 𝜋2 In = 1 1 10
𝑣 𝜋2 In = 10
Model-Free learning
Value Iteration Monte Carlo
Policy Iteration Temporal Difference (TD)
Q Learning
(Cum Reward
per episode)
𝑣 𝜋 𝑠 = 𝔼𝜋 𝐺𝑡 | 𝑆𝑡 = 𝑠
(Average
across all
episodes)
We can establish an iterative scheme to converge on the true value by dynamically updating the old
estimate by the new estimate.
1
𝑣𝜋 𝑠 𝑛𝑒𝑤 = 𝑣𝜋 𝑠 𝑜𝑙𝑑 + 𝐺𝑡 − 𝑣 𝜋 𝑠 𝑜𝑙𝑑
𝑛
1
𝑄𝜋 𝑠, 𝑎 𝑛𝑒𝑤
= 𝑄𝜋 𝑠, 𝑎 𝑜𝑙𝑑
+ 𝐺𝑡 − 𝑄𝜋 𝑠, 𝑎 𝑜𝑙𝑑
𝑛
In Temporal difference Learning we do not have to wait till the end of the episode to compute the
reward. Rather at every step of the episode we update the state value or state-action value by
considering the current estimate of the state value, the immediate reward and the value estimate
of the next state.
Mathematically it’s more efficient than Monte Carlo. Also, it can be applied for infinite horizon
problems as we do not have to wait till the end of the episode.
(immediate
reward)
𝑣 𝜋 𝑠 = 𝔼𝜋 𝑅 + 𝛾𝑣 𝜋 𝑠′
(discounted value of
the next state )
(TD Target)
𝑣𝜋 𝑠 𝑛𝑒𝑤
= 𝑣𝜋 𝑠 𝑜𝑙𝑑
+ 𝛼 𝑅 + 𝛾𝑣 𝜋 𝑠′ 𝑜𝑙𝑑
− 𝑣𝜋 𝑠 𝑜𝑙𝑑
(TD Error)
Q Learning
(TD Error)
Exploration vs Exploitation
As we explore the state, action, reward sequences to estimate the Q functions or Value functions,
it’s of practical importance to also exploit the knowledge to earn superior reward along the way.
On one extreme we have random exploration which assigns equal probability to all the actions. The
emphasis is not on exploitation but on exploring the entire space.
On the other extreme we have greedy approach which follows the sequence of actions that lead to
higher expected future rewards.
In between, these 2 extremes we have something called as 𝜖-greedy approach which means that
with probability 𝜖 we take a random action (explore) and with probability 1 − 𝜖 we take the best
action (exploit). Practically speaking we probably need to start with a high value of 𝜖 to explore
the space initially, and then gradually reduce (maybe exponentially) to exploit more than explore.