0% found this document useful (0 votes)
7 views30 pages

RL Basics 1737166593

Reinforcement Learning involves an agent that takes actions within an environment to maximize rewards based on its state. Key concepts include Markov Decision Processes (MDP), policy functions, value functions, and the Bellman equation, which are essential for calculating optimal strategies. Applications of reinforcement learning span various fields, including autonomous driving, game playing, and algorithmic trading.

Uploaded by

Lokesh Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views30 pages

RL Basics 1737166593

Reinforcement Learning involves an agent that takes actions within an environment to maximize rewards based on its state. Key concepts include Markov Decision Processes (MDP), policy functions, value functions, and the Bellman equation, which are essential for calculating optimal strategies. Applications of reinforcement learning span various fields, including autonomous driving, game playing, and algorithmic trading.

Uploaded by

Lokesh Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

Reinforcement Learning

www.peaks2tails.com
www.peaks2tails.com Satyapriya Ojha
Risk and AI Reinforcement Learning

The Reinforcement learning problem

In Reinforcement learning we have the following entities


Agent : An entity that takes actions ( e.g. robot )
Environment : The outside world that updates based on Agent’s actions and gives rewards
State : A snapshot of the environment at a given point in time (Random Variable)
Reward : The immediate benefit environment gives after agent performs an action (Random variable)
Action : One of the set of possible steps agent can take

www.peaks2tails.com Satyapriya Ojha 2


Risk and AI Reinforcement Learning

Reinforcement Learning – (Applications)

Reinforcement learning has led to proliferation of AI bots in various areas. Some of the notable
developments include
Autonomous Driving
AlphaZero ( Chess, Go ) https://www.youtube.com/watch?v=WXuK6gekU1Y

Algorithmic Trading
Dynamic Hedging

www.peaks2tails.com Satyapriya Ojha 3


Risk and AI Reinforcement Learning

MDP (Markov Decision Process)

The Agent and the Environment together generate the (SARSA) sequence

𝑆0 , 𝐴0, 𝑅1 , 𝑆1 , 𝐴1 , 𝑅2 , 𝑆2 , 𝐴2 , 𝑅3 , 𝑆3 …

The states 𝑆 ∈ 𝒮 and rewards 𝑅 ∈ ℛ in general are random quantities and come from a distribution. It
could be finite or infinite
An MDP (Markov Decision Process) is the process with Markov Property. Markov property states that
the future distribution of a state only depends on the current state and not the entire history of states.
(Think of Chess !). So, basically it means

ℙ 𝑆𝑡+1 = 𝑠 ′ , 𝑅𝑡+1 = 𝑟 | 𝑆𝑡 , 𝐴𝑡 , 𝑅𝑡 , 𝑆𝑡−1 , 𝐴𝑡−1 , 𝑅𝑡−1 , … , 𝑆0 , 𝐴0 = ℙ 𝑆𝑡+1 = 𝑠 ′ , 𝑅𝑡+1 = 𝑟 | 𝑆𝑡 , 𝐴𝑡

www.peaks2tails.com Satyapriya Ojha 4


Risk and AI Reinforcement Learning

MDP (Markov Decision Process)

The dynamics of an MDP therefore is given as

𝑝 𝑠 ′ , 𝑟 ′ | 𝑠, 𝑎 = ℙ 𝑆𝑡 = 𝑠′, 𝑅𝑡 = 𝑟′ | 𝑆𝑡−1 = 𝑠, 𝐴𝑡 = 𝑎

The state-transition probabilities are given as

𝑝 𝑠 ′ |𝑠, 𝑎 = ℙ 𝑆𝑡 = 𝑠 ′ | 𝑆𝑡−1 = 𝑠, 𝐴𝑡 = 𝑎 = ෍ 𝑝 𝑠 ′ , 𝑟|𝑠, 𝑎


𝑟∈ℛ

Expected rewards can be calculated for a state-action pair (chance node) as

𝑟 𝑠, 𝑎 = 𝔼 𝑅𝑡 | 𝑆𝑡−1 = 𝑠, 𝐴𝑡−1 = 𝑎 = ෍ 𝑟 ෍ 𝑝 𝑠 ′ , 𝑟|𝑠, 𝑎


𝑟∈ℛ 𝑠′∈𝒮

www.peaks2tails.com Satyapriya Ojha 5


Risk and AI Reinforcement Learning

Goal

The goal of the agent is to maximize total expected rewards 𝐺𝑡 over a long run. We will use the familiar
discounting factor (thing Time Value of Money) to discount future rewards

𝐺𝑡 = 𝑅𝑡+1 + 𝛾𝑅𝑡+2 + 𝛾 2 𝑅𝑡+3 + ⋯ = ෍ 𝛾 𝑘 𝑅𝑡+𝑘+1


𝑘=0

We can write the formula in recursive fashion which is often useful in these types of problems

𝐺𝑡 = 𝑅𝑡+1 + 𝛾𝐺𝑡+1

How does 𝛾 affect the Net rewards ?

www.peaks2tails.com Satyapriya Ojha 6


Risk and AI Reinforcement Learning

Policy Function

A Policy Function 𝜋 𝑎|𝑠 basically assigns a probability to actions 𝐴 ∈ 𝒜 an agent takes when in
states 𝑆 ∈ 𝒮. It basically means under a given policy the actions taken in a state have a distribution.

𝑠0 = Opening 𝑎1 = 𝑒4
𝜋 𝑎1 |𝑠0 = 80%

𝜋 𝑎2 |𝑠0 = 20%
𝑎2 = 𝑑4

www.peaks2tails.com Satyapriya Ojha 7


Risk and AI Reinforcement Learning

Value Function

A State-Value Function of state s under a policy 𝜋 is the total expected reward starting from state s
and following policy 𝜋 thereafter. It’s given as

𝑣 𝜋 𝑠 = 𝔼𝜋 𝐺𝑡 | 𝑆𝑡 = 𝑠 = 𝔼𝜋 ෍ 𝛾 𝑘 𝑅𝑡+𝑘+1 | 𝑆𝑡 = 𝑠
𝑘=0

An Action-Value Function of taking action 𝑎 in state s under a policy 𝜋 is the total expected reward
starting from state s, taking the action 𝑎 and following policy 𝜋 thereafter. It’s given as

𝑞 𝜋 𝑠, 𝑎 = 𝔼𝜋 𝐺𝑡 | 𝑆𝑡 = 𝑠, 𝐴𝑡 = 𝑎 = 𝔼𝜋 ෍ 𝛾 𝑘 𝑅𝑡+𝑘+1 | 𝑆𝑡 = 𝑠, 𝐴𝑡 = 𝑎
𝑘=0

www.peaks2tails.com Satyapriya Ojha 8


Risk and AI Reinforcement Learning

Bellman equation

The Bellman equation is the single most important equation in RL. It gives us a recursive way to
calculate value function

𝑠′ Action-value function
𝑎 𝑟, 𝑝
𝜋 𝑎|𝑠 𝑞 𝜋 𝑠, 𝑎 = 𝔼𝜋 𝐺𝑡 | 𝑆𝑡 = 𝑠, 𝐴𝑡 = 𝑎 = ෍ 𝑝 𝑠 ′ , 𝑟 𝑠, 𝑎) 𝑟 + 𝛾𝑣 𝜋 𝑠′
𝑠 ′ ,𝑟

State-value function
𝑠
𝑣 𝜋 𝑠 = 𝔼𝜋 𝐺𝑡 | 𝑆𝑡 = 𝑠 = ෍ 𝜋 𝑎 𝑠) ෍ 𝑝 𝑠 ′ , 𝑟 𝑠, 𝑎) 𝑟 + 𝛾𝑣 𝜋 𝑠′
𝑎 𝑠 ′ ,𝑟

= ෍ 𝜋 𝑎 𝑠)𝑞 𝜋 𝑠, 𝑎
𝑎

www.peaks2tails.com Satyapriya Ojha 9


Risk and AI Reinforcement Learning

Optimal Policy and Optimal Value Functions


A policy 𝜋 is better than a policy 𝜋′ , if 𝑣 𝜋 𝑠 ≥ 𝑣 𝜋 𝑠 for all 𝑠 ∈ 𝒮. Let’s denote the Optimal policy
as 𝜋 ∗ which is the best among all. The state-value function under the optimal policy is called Optimal
state-value function. Let’s denote it as 𝑣 ∗ 𝑠

𝑣 ∗ 𝑠 = max ෍ 𝑝 𝑠 ′ , 𝑟 𝑠, 𝑎 𝑟 + 𝛾𝑣 ∗ 𝑠′
𝑎
𝑠 ′ ,𝑟

𝜋 ∗
The Optimal action-value function denoted by 𝑞 𝑠, 𝑎 is given as

𝑞 ∗ 𝑠, 𝑎 = ෍ 𝑝 𝑠 ′ , 𝑟 𝑠, 𝑎 𝑟 + 𝛾 max 𝑞 ∗ 𝑠′, 𝑎′
𝑎′
𝑠 ′ ,𝑟

www.peaks2tails.com Satyapriya Ojha 10


Risk and AI Reinforcement Learning

Example – (dice game)

For each round


You choose to stay or quit
If you quit , you win $10, and the game is over
If you stay, you win $4 and roll the dice again
If dice results in 1 or 2, the game is over
Else continue to the next round

Identify states, action, rewards


Draw the transition diagram
Perform iterative policy valuation and find optimal value and policy (assume 𝛾 = 1 )

www.peaks2tails.com Satyapriya Ojha 11


Risk and AI Reinforcement Learning

Solution – (dice game)

Chance Node

𝒔𝒕𝒂𝒕𝒆 ∶ 𝑺𝟏
𝒂𝟏 = 𝑺𝒕𝒂𝒚
In In ; Stay (𝑺𝟏 , 𝒂𝟏 )

(𝑟 = $4, 𝑃 = 2/3)

𝒂𝟐 = 𝑸𝒖𝒊𝒕
(𝑟 = $4, 𝑃 = 1/3)

(𝑟 = $10, 𝑃 = 1)
(𝑺𝟏 , 𝒂𝟐 ) In ; Quit End (absorbing state)

𝒔𝒕𝒂𝒕𝒆 ∶ 𝑺𝟐
Chance Node

𝑴𝒂𝒓𝒌𝒐𝒗 𝑫𝒆𝒄𝒊𝒔𝒊𝒐𝒏 𝑷𝒓𝒐𝒄𝒆𝒔𝒔 (𝑴𝑫𝑷)

www.peaks2tails.com Satyapriya Ojha 12


Risk and AI Reinforcement Learning

Solution – Value of policy (In, Stay)

Note that 𝑣 End = 0 as the game gets over. We need to find out the 𝑣 In under policy 𝜋1 ≔
In, stay and under policy 𝜋2 ≔ In, quit

𝑣 𝜋1 In = 𝜋 stay In 𝑝 In, 4|In, stay 4 + 1 ∗ 𝑣 𝜋1 In + 𝑝 End, 4|In, stay 4 + 1 ∗ 𝑣 𝜋1 End

𝜋1
2 𝜋1
1
𝑣 In = 1 4 + 1 ∗ 𝑣 In + 4 + 1 ∗ 𝑣 𝜋1 End
3 3
𝜋1
2 𝜋1
1
𝑣 In = 4 + 𝑣 In + 4
3 3
𝑣 𝜋1 In = 12

Also, because policy is deterministic 𝑞𝜋1 In, stay = 12

www.peaks2tails.com Satyapriya Ojha 13


Risk and AI Reinforcement Learning

Solution – Value of policy (In, Quit)

𝑣 𝜋2 In = 𝜋 quit In 𝑝 End, 10|In, quit 10 + 1 ∗ 𝑣 𝜋2 End

𝑣 𝜋2 In = 1 1 10

𝑣 𝜋2 In = 10

Also, because policy is deterministic 𝑞𝜋2 In, quit = 10

www.peaks2tails.com Satyapriya Ojha 14


Risk and AI Reinforcement Learning

Model Based vs Model Free Learning

In Model Based Learning, we know the transition probabilities MDP.


In practice, the probabilities are not known. So, we conduct experiments to learn about the states,
rewards and transitions. The probabilities can still be estimated to impose an MDP structure. But we
can resort to Model – Free learning where we try to estimate the value functions using trial and error.
Model-Based Learning

Model-Free learning
Value Iteration Monte Carlo
Policy Iteration Temporal Difference (TD)
Q Learning

www.peaks2tails.com Satyapriya Ojha 15


Risk and AI Reinforcement Learning

Monte Carlo Learning

Monte Carlo Learning is an Episodic learning (sequence of state, action, reward).


The total cumulative reward is calculated at the end of each episode. Then the average of this
cumulative reward is computed over all episodes to get the expected value of a state.

(Cum Reward
per episode)

𝑣 𝜋 𝑠 = 𝔼𝜋 𝐺𝑡 | 𝑆𝑡 = 𝑠

(Average
across all
episodes)

www.peaks2tails.com Satyapriya Ojha 16


Risk and AI Reinforcement Learning

Monte Carlo Learning

We can establish an iterative scheme to converge on the true value by dynamically updating the old
estimate by the new estimate.

1
𝑣𝜋 𝑠 𝑛𝑒𝑤 = 𝑣𝜋 𝑠 𝑜𝑙𝑑 + 𝐺𝑡 − 𝑣 𝜋 𝑠 𝑜𝑙𝑑
𝑛

1
𝑄𝜋 𝑠, 𝑎 𝑛𝑒𝑤
= 𝑄𝜋 𝑠, 𝑎 𝑜𝑙𝑑
+ 𝐺𝑡 − 𝑄𝜋 𝑠, 𝑎 𝑜𝑙𝑑
𝑛

The two disadvantages of this approach are


It can only be used for finite horizon problems
The convergence to true value maybe infeasibly slow

www.peaks2tails.com Satyapriya Ojha 17


Risk and AI Reinforcement Learning

Temporal Difference (TD) Learning

In Temporal difference Learning we do not have to wait till the end of the episode to compute the
reward. Rather at every step of the episode we update the state value or state-action value by
considering the current estimate of the state value, the immediate reward and the value estimate
of the next state.
Mathematically it’s more efficient than Monte Carlo. Also, it can be applied for infinite horizon
problems as we do not have to wait till the end of the episode.
(immediate
reward)

𝑣 𝜋 𝑠 = 𝔼𝜋 𝑅 + 𝛾𝑣 𝜋 𝑠′

(discounted value of
the next state )

www.peaks2tails.com Satyapriya Ojha 18


Risk and AI Reinforcement Learning

Temporal Difference (TD) Learning

The iterative procedures are given below

(TD Target)

𝑣𝜋 𝑠 𝑛𝑒𝑤
= 𝑣𝜋 𝑠 𝑜𝑙𝑑
+ 𝛼 𝑅 + 𝛾𝑣 𝜋 𝑠′ 𝑜𝑙𝑑
− 𝑣𝜋 𝑠 𝑜𝑙𝑑

(TD Error)

www.peaks2tails.com Satyapriya Ojha 19


Risk and AI Reinforcement Learning

Q Learning

Q Learning is an off-policy learning which is a TD learning on the Q function.


The first reward is based on the action taken now, followed by the best possible actions from next
state onwards.
(TD Target)

𝑞 𝑠, 𝑎 𝑛𝑒𝑤 = 𝑞 𝑠, 𝑎 𝑜𝑙𝑑 + 𝛼 𝑟 + 𝛾 max 𝑞(𝑠 ′ , 𝑎′ ) − 𝑞 𝑠, 𝑎 𝑜𝑙𝑑


𝑎′

(TD Error)

www.peaks2tails.com Satyapriya Ojha 20


Risk and AI Reinforcement Learning

Exploration vs Exploitation

As we explore the state, action, reward sequences to estimate the Q functions or Value functions,
it’s of practical importance to also exploit the knowledge to earn superior reward along the way.
On one extreme we have random exploration which assigns equal probability to all the actions. The
emphasis is not on exploitation but on exploring the entire space.
On the other extreme we have greedy approach which follows the sequence of actions that lead to
higher expected future rewards.
In between, these 2 extremes we have something called as 𝜖-greedy approach which means that
with probability 𝜖 we take a random action (explore) and with probability 1 − 𝜖 we take the best
action (exploit). Practically speaking we probably need to start with a high value of 𝜖 to explore
the space initially, and then gradually reduce (maybe exponentially) to exploit more than explore.

www.peaks2tails.com Satyapriya Ojha 21


www.peaks2tails.com Satyapriya Ojha 22
www.peaks2tails.com Satyapriya Ojha 23
www.peaks2tails.com Satyapriya Ojha 24
www.peaks2tails.com Satyapriya Ojha 25
www.peaks2tails.com Satyapriya Ojha 26
www.peaks2tails.com Satyapriya Ojha 27
www.peaks2tails.com Satyapriya Ojha 28
www.peaks2tails.com Satyapriya Ojha 29
www.peaks2tails.com Satyapriya Ojha 30

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy