0% found this document useful (0 votes)

7 views30 pages

RL Basics 1737166593

Reinforcement Learning involves an agent that takes actions within an environment to maximize rewards based on its state. Key concepts include Markov Decision Processes (MDP), policy functions, value functions, and the Bellman equation, which are essential for calculating optimal strategies. Applications of reinforcement learning span various fields, including autonomous driving, game playing, and algorithmic trading.

Uploaded by

Lokesh Kumar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views30 pages

RL Basics 1737166593

Uploaded by

Lokesh Kumar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 30

Reinforcement Learning

www.peaks2tails.com
www.peaks2tails.com Satyapriya Ojha
Risk and AI Reinforcement Learning

The Reinforcement learning problem

In Reinforcement learning we have the following entities

Agent : An entity that takes actions ( e.g. robot )
Environment : The outside world that updates based on Agent’s actions and gives rewards
State : A snapshot of the environment at a given point in time (Random Variable)
Reward : The immediate benefit environment gives after agent performs an action (Random variable)
Action : One of the set of possible steps agent can take

www.peaks2tails.com Satyapriya Ojha 2

Risk and AI Reinforcement Learning

Reinforcement Learning – (Applications)

Reinforcement learning has led to proliferation of AI bots in various areas. Some of the notable
developments include
Autonomous Driving
AlphaZero ( Chess, Go ) https://www.youtube.com/watch?v=WXuK6gekU1Y

Algorithmic Trading
Dynamic Hedging

www.peaks2tails.com Satyapriya Ojha 3

Risk and AI Reinforcement Learning

MDP (Markov Decision Process)

The Agent and the Environment together generate the (SARSA) sequence

𝑆0 , 𝐴0, 𝑅1 , 𝑆1 , 𝐴1 , 𝑅2 , 𝑆2 , 𝐴2 , 𝑅3 , 𝑆3 …

The states 𝑆 ∈ 𝒮 and rewards 𝑅 ∈ ℛ in general are random quantities and come from a distribution. It
could be finite or infinite
An MDP (Markov Decision Process) is the process with Markov Property. Markov property states that
the future distribution of a state only depends on the current state and not the entire history of states.
(Think of Chess !). So, basically it means

ℙ 𝑆𝑡+1 = 𝑠 ′ , 𝑅𝑡+1 = 𝑟 | 𝑆𝑡 , 𝐴𝑡 , 𝑅𝑡 , 𝑆𝑡−1 , 𝐴𝑡−1 , 𝑅𝑡−1 , … , 𝑆0 , 𝐴0 = ℙ 𝑆𝑡+1 = 𝑠 ′ , 𝑅𝑡+1 = 𝑟 | 𝑆𝑡 , 𝐴𝑡

www.peaks2tails.com Satyapriya Ojha 4

Risk and AI Reinforcement Learning

MDP (Markov Decision Process)

The dynamics of an MDP therefore is given as

𝑝 𝑠 ′ , 𝑟 ′ | 𝑠, 𝑎 = ℙ 𝑆𝑡 = 𝑠′, 𝑅𝑡 = 𝑟′ | 𝑆𝑡−1 = 𝑠, 𝐴𝑡 = 𝑎

The state-transition probabilities are given as

𝑝 𝑠 ′ |𝑠, 𝑎 = ℙ 𝑆𝑡 = 𝑠 ′ | 𝑆𝑡−1 = 𝑠, 𝐴𝑡 = 𝑎 = ෍ 𝑝 𝑠 ′ , 𝑟|𝑠, 𝑎

𝑟∈ℛ

Expected rewards can be calculated for a state-action pair (chance node) as

𝑟 𝑠, 𝑎 = 𝔼 𝑅𝑡 | 𝑆𝑡−1 = 𝑠, 𝐴𝑡−1 = 𝑎 = ෍ 𝑟 ෍ 𝑝 𝑠 ′ , 𝑟|𝑠, 𝑎

𝑟∈ℛ 𝑠′∈𝒮

www.peaks2tails.com Satyapriya Ojha 5

Risk and AI Reinforcement Learning

Goal

The goal of the agent is to maximize total expected rewards 𝐺𝑡 over a long run. We will use the familiar
discounting factor (thing Time Value of Money) to discount future rewards

𝐺𝑡 = 𝑅𝑡+1 + 𝛾𝑅𝑡+2 + 𝛾 2 𝑅𝑡+3 + ⋯ = ෍ 𝛾 𝑘 𝑅𝑡+𝑘+1

𝑘=0

We can write the formula in recursive fashion which is often useful in these types of problems

𝐺𝑡 = 𝑅𝑡+1 + 𝛾𝐺𝑡+1

How does 𝛾 affect the Net rewards ?

www.peaks2tails.com Satyapriya Ojha 6

Risk and AI Reinforcement Learning

Policy Function

A Policy Function 𝜋 𝑎|𝑠 basically assigns a probability to actions 𝐴 ∈ 𝒜 an agent takes when in
states 𝑆 ∈ 𝒮. It basically means under a given policy the actions taken in a state have a distribution.

𝑠0 = Opening 𝑎1 = 𝑒4
𝜋 𝑎1 |𝑠0 = 80%

𝜋 𝑎2 |𝑠0 = 20%
𝑎2 = 𝑑4

www.peaks2tails.com Satyapriya Ojha 7

Risk and AI Reinforcement Learning

Value Function

A State-Value Function of state s under a policy 𝜋 is the total expected reward starting from state s
and following policy 𝜋 thereafter. It’s given as

𝑣 𝜋 𝑠 = 𝔼𝜋 𝐺𝑡 | 𝑆𝑡 = 𝑠 = 𝔼𝜋 ෍ 𝛾 𝑘 𝑅𝑡+𝑘+1 | 𝑆𝑡 = 𝑠
𝑘=0

An Action-Value Function of taking action 𝑎 in state s under a policy 𝜋 is the total expected reward
starting from state s, taking the action 𝑎 and following policy 𝜋 thereafter. It’s given as

𝑞 𝜋 𝑠, 𝑎 = 𝔼𝜋 𝐺𝑡 | 𝑆𝑡 = 𝑠, 𝐴𝑡 = 𝑎 = 𝔼𝜋 ෍ 𝛾 𝑘 𝑅𝑡+𝑘+1 | 𝑆𝑡 = 𝑠, 𝐴𝑡 = 𝑎
𝑘=0

www.peaks2tails.com Satyapriya Ojha 8

Risk and AI Reinforcement Learning

Bellman equation

The Bellman equation is the single most important equation in RL. It gives us a recursive way to
calculate value function

𝑠′ Action-value function
𝑎 𝑟, 𝑝
𝜋 𝑎|𝑠 𝑞 𝜋 𝑠, 𝑎 = 𝔼𝜋 𝐺𝑡 | 𝑆𝑡 = 𝑠, 𝐴𝑡 = 𝑎 = ෍ 𝑝 𝑠 ′ , 𝑟 𝑠, 𝑎) 𝑟 + 𝛾𝑣 𝜋 𝑠′
𝑠 ′ ,𝑟

State-value function
𝑠
𝑣 𝜋 𝑠 = 𝔼𝜋 𝐺𝑡 | 𝑆𝑡 = 𝑠 = ෍ 𝜋 𝑎 𝑠) ෍ 𝑝 𝑠 ′ , 𝑟 𝑠, 𝑎) 𝑟 + 𝛾𝑣 𝜋 𝑠′
𝑎 𝑠 ′ ,𝑟

= ෍ 𝜋 𝑎 𝑠)𝑞 𝜋 𝑠, 𝑎
𝑎

www.peaks2tails.com Satyapriya Ojha 9

Risk and AI Reinforcement Learning

Optimal Policy and Optimal Value Functions

′
A policy 𝜋 is better than a policy 𝜋′ , if 𝑣 𝜋 𝑠 ≥ 𝑣 𝜋 𝑠 for all 𝑠 ∈ 𝒮. Let’s denote the Optimal policy
as 𝜋 ∗ which is the best among all. The state-value function under the optimal policy is called Optimal
state-value function. Let’s denote it as 𝑣 ∗ 𝑠

𝑣 ∗ 𝑠 = max ෍ 𝑝 𝑠 ′ , 𝑟 𝑠, 𝑎 𝑟 + 𝛾𝑣 ∗ 𝑠′
𝑎
𝑠 ′ ,𝑟

𝜋 ∗
The Optimal action-value function denoted by 𝑞 𝑠, 𝑎 is given as

𝑞 ∗ 𝑠, 𝑎 = ෍ 𝑝 𝑠 ′ , 𝑟 𝑠, 𝑎 𝑟 + 𝛾 max 𝑞 ∗ 𝑠′, 𝑎′
𝑎′
𝑠 ′ ,𝑟

www.peaks2tails.com Satyapriya Ojha 10

Risk and AI Reinforcement Learning

Example – (dice game)

For each round

You choose to stay or quit
If you quit , you win $10, and the game is over
If you stay, you win $4 and roll the dice again
If dice results in 1 or 2, the game is over
Else continue to the next round

Identify states, action, rewards

Draw the transition diagram
Perform iterative policy valuation and find optimal value and policy (assume 𝛾 = 1 )

www.peaks2tails.com Satyapriya Ojha 11

Risk and AI Reinforcement Learning

Solution – (dice game)

Chance Node

𝒔𝒕𝒂𝒕𝒆 ∶ 𝑺𝟏
𝒂𝟏 = 𝑺𝒕𝒂𝒚
In In ; Stay (𝑺𝟏 , 𝒂𝟏 )

(𝑟 = $4, 𝑃 = 2/3)

𝒂𝟐 = 𝑸𝒖𝒊𝒕
(𝑟 = $4, 𝑃 = 1/3)

(𝑟 = $10, 𝑃 = 1)
(𝑺𝟏 , 𝒂𝟐 ) In ; Quit End (absorbing state)

𝒔𝒕𝒂𝒕𝒆 ∶ 𝑺𝟐
Chance Node

𝑴𝒂𝒓𝒌𝒐𝒗 𝑫𝒆𝒄𝒊𝒔𝒊𝒐𝒏 𝑷𝒓𝒐𝒄𝒆𝒔𝒔 (𝑴𝑫𝑷)

www.peaks2tails.com Satyapriya Ojha 12

Risk and AI Reinforcement Learning

Solution – Value of policy (In, Stay)

Note that 𝑣 End = 0 as the game gets over. We need to find out the 𝑣 In under policy 𝜋1 ≔
In, stay and under policy 𝜋2 ≔ In, quit

𝑣 𝜋1 In = 𝜋 stay In 𝑝 In, 4|In, stay 4 + 1 ∗ 𝑣 𝜋1 In + 𝑝 End, 4|In, stay 4 + 1 ∗ 𝑣 𝜋1 End

𝜋1
2 𝜋1
1
𝑣 In = 1 4 + 1 ∗ 𝑣 In + 4 + 1 ∗ 𝑣 𝜋1 End
3 3
𝜋1
2 𝜋1
1
𝑣 In = 4 + 𝑣 In + 4
3 3
𝑣 𝜋1 In = 12

Also, because policy is deterministic 𝑞𝜋1 In, stay = 12

www.peaks2tails.com Satyapriya Ojha 13

Risk and AI Reinforcement Learning

Solution – Value of policy (In, Quit)

𝑣 𝜋2 In = 𝜋 quit In 𝑝 End, 10|In, quit 10 + 1 ∗ 𝑣 𝜋2 End

𝑣 𝜋2 In = 1 1 10

𝑣 𝜋2 In = 10

Also, because policy is deterministic 𝑞𝜋2 In, quit = 10

www.peaks2tails.com Satyapriya Ojha 14

Risk and AI Reinforcement Learning

Model Based vs Model Free Learning

In Model Based Learning, we know the transition probabilities MDP.

In practice, the probabilities are not known. So, we conduct experiments to learn about the states,
rewards and transitions. The probabilities can still be estimated to impose an MDP structure. But we
can resort to Model – Free learning where we try to estimate the value functions using trial and error.
Model-Based Learning

Model-Free learning
Value Iteration Monte Carlo
Policy Iteration Temporal Difference (TD)
Q Learning

www.peaks2tails.com Satyapriya Ojha 15

Risk and AI Reinforcement Learning

Monte Carlo Learning

Monte Carlo Learning is an Episodic learning (sequence of state, action, reward).

The total cumulative reward is calculated at the end of each episode. Then the average of this
cumulative reward is computed over all episodes to get the expected value of a state.

(Cum Reward
per episode)

𝑣 𝜋 𝑠 = 𝔼𝜋 𝐺𝑡 | 𝑆𝑡 = 𝑠

(Average
across all
episodes)

www.peaks2tails.com Satyapriya Ojha 16

Risk and AI Reinforcement Learning

Monte Carlo Learning

We can establish an iterative scheme to converge on the true value by dynamically updating the old
estimate by the new estimate.

1
𝑣𝜋 𝑠 𝑛𝑒𝑤 = 𝑣𝜋 𝑠 𝑜𝑙𝑑 + 𝐺𝑡 − 𝑣 𝜋 𝑠 𝑜𝑙𝑑
𝑛

1
𝑄𝜋 𝑠, 𝑎 𝑛𝑒𝑤
= 𝑄𝜋 𝑠, 𝑎 𝑜𝑙𝑑
+ 𝐺𝑡 − 𝑄𝜋 𝑠, 𝑎 𝑜𝑙𝑑
𝑛

The two disadvantages of this approach are

It can only be used for finite horizon problems
The convergence to true value maybe infeasibly slow

www.peaks2tails.com Satyapriya Ojha 17

Risk and AI Reinforcement Learning

Temporal Difference (TD) Learning

In Temporal difference Learning we do not have to wait till the end of the episode to compute the
reward. Rather at every step of the episode we update the state value or state-action value by
considering the current estimate of the state value, the immediate reward and the value estimate
of the next state.
Mathematically it’s more efficient than Monte Carlo. Also, it can be applied for infinite horizon
problems as we do not have to wait till the end of the episode.
(immediate
reward)

𝑣 𝜋 𝑠 = 𝔼𝜋 𝑅 + 𝛾𝑣 𝜋 𝑠′

(discounted value of
the next state )

www.peaks2tails.com Satyapriya Ojha 18

Risk and AI Reinforcement Learning

Temporal Difference (TD) Learning

The iterative procedures are given below

(TD Target)

𝑣𝜋 𝑠 𝑛𝑒𝑤
= 𝑣𝜋 𝑠 𝑜𝑙𝑑
+ 𝛼 𝑅 + 𝛾𝑣 𝜋 𝑠′ 𝑜𝑙𝑑
− 𝑣𝜋 𝑠 𝑜𝑙𝑑

(TD Error)

www.peaks2tails.com Satyapriya Ojha 19

Risk and AI Reinforcement Learning

Q Learning

Q Learning is an off-policy learning which is a TD learning on the Q function.

The first reward is based on the action taken now, followed by the best possible actions from next
state onwards.
(TD Target)

𝑞 𝑠, 𝑎 𝑛𝑒𝑤 = 𝑞 𝑠, 𝑎 𝑜𝑙𝑑 + 𝛼 𝑟 + 𝛾 max 𝑞(𝑠 ′ , 𝑎′ ) − 𝑞 𝑠, 𝑎 𝑜𝑙𝑑

𝑎′

(TD Error)

www.peaks2tails.com Satyapriya Ojha 20

Risk and AI Reinforcement Learning

Exploration vs Exploitation

As we explore the state, action, reward sequences to estimate the Q functions or Value functions,
it’s of practical importance to also exploit the knowledge to earn superior reward along the way.
On one extreme we have random exploration which assigns equal probability to all the actions. The
emphasis is not on exploitation but on exploring the entire space.
On the other extreme we have greedy approach which follows the sequence of actions that lead to
higher expected future rewards.
In between, these 2 extremes we have something called as 𝜖-greedy approach which means that
with probability 𝜖 we take a random action (explore) and with probability 1 − 𝜖 we take the best
action (exploit). Practically speaking we probably need to start with a high value of 𝜖 to explore
the space initially, and then gradually reduce (maybe exponentially) to exploit more than explore.

www.peaks2tails.com Satyapriya Ojha 21

www.peaks2tails.com Satyapriya Ojha 22
www.peaks2tails.com Satyapriya Ojha 23
www.peaks2tails.com Satyapriya Ojha 24
www.peaks2tails.com Satyapriya Ojha 25
www.peaks2tails.com Satyapriya Ojha 26
www.peaks2tails.com Satyapriya Ojha 27
www.peaks2tails.com Satyapriya Ojha 28
www.peaks2tails.com Satyapriya Ojha 29
www.peaks2tails.com Satyapriya Ojha 30

CRISP-DM Template Final Project
No ratings yet
CRISP-DM Template Final Project
13 pages
Abundance Meditation
75% (12)
Abundance Meditation
15 pages
CCTV Tech HBK - 0713 508
0% (1)
CCTV Tech HBK - 0713 508
66 pages
Reinforcement Learning Note
No ratings yet
Reinforcement Learning Note
16 pages
Reinforcement Learning: Karan Kathpalia
No ratings yet
Reinforcement Learning: Karan Kathpalia
80 pages
RL-UNIT2 - RL Unit 2 RL-UNIT2 - RL Unit 2
No ratings yet
RL-UNIT2 - RL Unit 2 RL-UNIT2 - RL Unit 2
23 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
31 pages
11-DL-Deep Learning For Reinforcement Learning
No ratings yet
11-DL-Deep Learning For Reinforcement Learning
47 pages
RL 10 QUESTIONS FOR MID II Scheme of Evaluvation
No ratings yet
RL 10 QUESTIONS FOR MID II Scheme of Evaluvation
15 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
46 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
50 pages
Lec17 ReinforcementLearning
No ratings yet
Lec17 ReinforcementLearning
58 pages
DLMAIRIL01 Q4-2024 Session4
No ratings yet
DLMAIRIL01 Q4-2024 Session4
80 pages
Unit 5 Deep Learning
No ratings yet
Unit 5 Deep Learning
24 pages
Lecture 12 Slides - After
No ratings yet
Lecture 12 Slides - After
50 pages
Unit 4
No ratings yet
Unit 4
49 pages
Reinforcement Learning MY101
No ratings yet
Reinforcement Learning MY101
15 pages
Add-On DRL CS06
No ratings yet
Add-On DRL CS06
23 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
45 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
52 pages
Reinforcement Learning Model Based Planning Dynamic Programming
No ratings yet
Reinforcement Learning Model Based Planning Dynamic Programming
17 pages
CSE2530 Reinforcement Learning 2025 P1+2
No ratings yet
CSE2530 Reinforcement Learning 2025 P1+2
115 pages
Lecture 30 Reinforcement-Learning
No ratings yet
Lecture 30 Reinforcement-Learning
50 pages
ML Unit 4
No ratings yet
ML Unit 4
17 pages
Lecture 3 - MDPs and Dynamic Programming
No ratings yet
Lecture 3 - MDPs and Dynamic Programming
66 pages
Evolutionary Game Theory and Multi-Agent Reinforcement Learning
No ratings yet
Evolutionary Game Theory and Multi-Agent Reinforcement Learning
26 pages
Cs229-Notes12 Reinforcement in Control
No ratings yet
Cs229-Notes12 Reinforcement in Control
17 pages
Ideai Reinforcement Learning
No ratings yet
Ideai Reinforcement Learning
167 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
9 pages
Unit 1, 2 RL
No ratings yet
Unit 1, 2 RL
29 pages
DSA5102 Lecture11
No ratings yet
DSA5102 Lecture11
44 pages
Reinforcement Learning: Csci 5512: Artificial Intelligence Ii
No ratings yet
Reinforcement Learning: Csci 5512: Artificial Intelligence Ii
30 pages
22 Reinforcement Learning
No ratings yet
22 Reinforcement Learning
18 pages
Lec 04 Reinforcement Learning
No ratings yet
Lec 04 Reinforcement Learning
57 pages
Reinforcement Learning: Instructor: Max Welling
No ratings yet
Reinforcement Learning: Instructor: Max Welling
18 pages
Lecture#5 Monte Carlo Methods Part I
No ratings yet
Lecture#5 Monte Carlo Methods Part I
28 pages
L13 Reinforcement Learning
No ratings yet
L13 Reinforcement Learning
57 pages
19 - Monte Carlo and Temporal Difference For Markov Decision Processes
No ratings yet
19 - Monte Carlo and Temporal Difference For Markov Decision Processes
57 pages
7.reinforcement Learning-Introduction-The Learning Task Q-Learning
No ratings yet
7.reinforcement Learning-Introduction-The Learning Task Q-Learning
34 pages
Deep RL - Content Beyond Syllabus
No ratings yet
Deep RL - Content Beyond Syllabus
16 pages
Unit Vi
No ratings yet
Unit Vi
17 pages
ML - Unit-3 - Reinforcement Learning
No ratings yet
ML - Unit-3 - Reinforcement Learning
47 pages
Reinforcement Learning (Part 2) : Nguyen Do Van, PHD
No ratings yet
Reinforcement Learning (Part 2) : Nguyen Do Van, PHD
46 pages
Reinforcement Learning Cheatsheet
No ratings yet
Reinforcement Learning Cheatsheet
16 pages
A Crash Course On Reinforcement Learning - Felix Wagner
No ratings yet
A Crash Course On Reinforcement Learning - Felix Wagner
84 pages
RL Complete Unit-5
No ratings yet
RL Complete Unit-5
30 pages
RL 1
No ratings yet
RL 1
30 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
101 pages
An Introduction To Reinforcement Learning From Theory To Algorithms (December 19, 2024) - Joon Kwon
No ratings yet
An Introduction To Reinforcement Learning From Theory To Algorithms (December 19, 2024) - Joon Kwon
66 pages
Finite Markov Decision Processes-BR
No ratings yet
Finite Markov Decision Processes-BR
31 pages
Reinf 2
No ratings yet
Reinf 2
4 pages
A (Long) Peek Into Reinforcement Learning - Lil'Log
No ratings yet
A (Long) Peek Into Reinforcement Learning - Lil'Log
23 pages
20ai903 - RL - Unit 2
No ratings yet
20ai903 - RL - Unit 2
27 pages
Tri-Tue-Nhan-Tao - Nathan-Lambert - Lec13 - 6up-Reinforcement-Learning - (Cuuduongthancong - Com)
No ratings yet
Tri-Tue-Nhan-Tao - Nathan-Lambert - Lec13 - 6up-Reinforcement-Learning - (Cuuduongthancong - Com)
8 pages
ML Unit 4
No ratings yet
ML Unit 4
9 pages
New CZ3005 Module 4 - Markov Decision Process
No ratings yet
New CZ3005 Module 4 - Markov Decision Process
38 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
9 pages
37 RL
No ratings yet
37 RL
18 pages
Reinforcement Learning and Control: CS229 Lecture Notes
No ratings yet
Reinforcement Learning and Control: CS229 Lecture Notes
15 pages
Reinforcement Learning and Control: CS229 Lecture Notes
No ratings yet
Reinforcement Learning and Control: CS229 Lecture Notes
7 pages
A17 Complexdecisions
No ratings yet
A17 Complexdecisions
28 pages
L12 Reinforcement Learning 2
No ratings yet
L12 Reinforcement Learning 2
26 pages
MCS-011: Problem Solving and Programming
From Everand
MCS-011: Problem Solving and Programming
Dr. DK Sukhani
No ratings yet
11252
No ratings yet
11252
2 pages
DB2 Security: Sample Questions
No ratings yet
DB2 Security: Sample Questions
3 pages
Task Card 5 - Confidence Intervals
No ratings yet
Task Card 5 - Confidence Intervals
3 pages
Daikin VRV 5 Tai Lieu Huong Dan Lap Dat Va Van Hanh 3
No ratings yet
Daikin VRV 5 Tai Lieu Huong Dan Lap Dat Va Van Hanh 3
64 pages
Shayna Parker Resume 2018
No ratings yet
Shayna Parker Resume 2018
2 pages
MCA Program
No ratings yet
MCA Program
40 pages
Boycott List of Israel Items
No ratings yet
Boycott List of Israel Items
3 pages
Delhi Metro Rail Corporation LTD
No ratings yet
Delhi Metro Rail Corporation LTD
3 pages
Shrey Choubey: Career Objective Skills
No ratings yet
Shrey Choubey: Career Objective Skills
2 pages
Mid Term Assessment CPE615 S1
No ratings yet
Mid Term Assessment CPE615 S1
2 pages
Muhaba Research Proposal 20211
No ratings yet
Muhaba Research Proposal 20211
84 pages
Surya Veera Reddy
No ratings yet
Surya Veera Reddy
8 pages
LABNICS Filter Integrity Tester NFIT 101
No ratings yet
LABNICS Filter Integrity Tester NFIT 101
5 pages
Member List With Plot No./Load
No ratings yet
Member List With Plot No./Load
7 pages
Spesifikasi Gorman Ruup PAH3A60-6068H
No ratings yet
Spesifikasi Gorman Ruup PAH3A60-6068H
2 pages
Mechanical Equipment Selection
No ratings yet
Mechanical Equipment Selection
17 pages
What Is NumPy
No ratings yet
What Is NumPy
5 pages
Advertisement Agency
72% (18)
Advertisement Agency
42 pages
Poll Watchers' Guide: 13 May 2019 National and Local Elections
No ratings yet
Poll Watchers' Guide: 13 May 2019 National and Local Elections
58 pages
ABB Azipod Brochure Lores
No ratings yet
ABB Azipod Brochure Lores
8 pages
Low FODMAP Food Plan Comprehensive Guide
No ratings yet
Low FODMAP Food Plan Comprehensive Guide
18 pages
Problem - 1739D - Codeforces
No ratings yet
Problem - 1739D - Codeforces
2 pages
Sathish Yellanki: Skyess: in Association With
No ratings yet
Sathish Yellanki: Skyess: in Association With
12 pages
NH Toll List
No ratings yet
NH Toll List
18 pages
Assembler Services Guide
No ratings yet
Assembler Services Guide
556 pages
Memorandum Ra 9165
No ratings yet
Memorandum Ra 9165
5 pages
Content Control Interfaces
No ratings yet
Content Control Interfaces
58 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

RL Basics 1737166593

Uploaded by

RL Basics 1737166593

Uploaded by

Reinforcement Learning

The Reinforcement learning problem

In Reinforcement learning we have the following entities

www.peaks2tails.com Satyapriya Ojha 2

Reinforcement Learning – (Applications)

www.peaks2tails.com Satyapriya Ojha 3

MDP (Markov Decision Process)

ℙ 𝑆𝑡+1 = 𝑠 ′ , 𝑅𝑡+1 = 𝑟 | 𝑆𝑡 , 𝐴𝑡 , 𝑅𝑡 , 𝑆𝑡−1 , 𝐴𝑡−1 , 𝑅𝑡−1 , … , 𝑆0 , 𝐴0 = ℙ 𝑆𝑡+1 = 𝑠 ′ , 𝑅𝑡+1 = 𝑟 | 𝑆𝑡 , 𝐴𝑡

www.peaks2tails.com Satyapriya Ojha 4

MDP (Markov Decision Process)

The dynamics of an MDP therefore is given as

The state-transition probabilities are given as

𝑝 𝑠 ′ |𝑠, 𝑎 = ℙ 𝑆𝑡 = 𝑠 ′ | 𝑆𝑡−1 = 𝑠, 𝐴𝑡 = 𝑎 = ෍ 𝑝 𝑠 ′ , 𝑟|𝑠, 𝑎

Expected rewards can be calculated for a state-action pair (chance node) as

𝑟 𝑠, 𝑎 = 𝔼 𝑅𝑡 | 𝑆𝑡−1 = 𝑠, 𝐴𝑡−1 = 𝑎 = ෍ 𝑟 ෍ 𝑝 𝑠 ′ , 𝑟|𝑠, 𝑎

www.peaks2tails.com Satyapriya Ojha 5

𝐺𝑡 = 𝑅𝑡+1 + 𝛾𝑅𝑡+2 + 𝛾 2 𝑅𝑡+3 + ⋯ = ෍ 𝛾 𝑘 𝑅𝑡+𝑘+1

How does 𝛾 affect the Net rewards ?

www.peaks2tails.com Satyapriya Ojha 6

www.peaks2tails.com Satyapriya Ojha 7

www.peaks2tails.com Satyapriya Ojha 8

www.peaks2tails.com Satyapriya Ojha 9

Optimal Policy and Optimal Value Functions

www.peaks2tails.com Satyapriya Ojha 10

Example – (dice game)

For each round

Identify states, action, rewards

www.peaks2tails.com Satyapriya Ojha 11

Solution – (dice game)

𝑴𝒂𝒓𝒌𝒐𝒗 𝑫𝒆𝒄𝒊𝒔𝒊𝒐𝒏 𝑷𝒓𝒐𝒄𝒆𝒔𝒔 (𝑴𝑫𝑷)

www.peaks2tails.com Satyapriya Ojha 12

Solution – Value of policy (In, Stay)

𝑣 𝜋1 In = 𝜋 stay In 𝑝 In, 4|In, stay 4 + 1 ∗ 𝑣 𝜋1 In + 𝑝 End, 4|In, stay 4 + 1 ∗ 𝑣 𝜋1 End

Also, because policy is deterministic 𝑞𝜋1 In, stay = 12

www.peaks2tails.com Satyapriya Ojha 13

Solution – Value of policy (In, Quit)

𝑣 𝜋2 In = 𝜋 quit In 𝑝 End, 10|In, quit 10 + 1 ∗ 𝑣 𝜋2 End

Also, because policy is deterministic 𝑞𝜋2 In, quit = 10

www.peaks2tails.com Satyapriya Ojha 14

Model Based vs Model Free Learning

In Model Based Learning, we know the transition probabilities MDP.

www.peaks2tails.com Satyapriya Ojha 15

Monte Carlo Learning

Monte Carlo Learning is an Episodic learning (sequence of state, action, reward).

www.peaks2tails.com Satyapriya Ojha 16

Monte Carlo Learning

The two disadvantages of this approach are

www.peaks2tails.com Satyapriya Ojha 17

Temporal Difference (TD) Learning

www.peaks2tails.com Satyapriya Ojha 18

Temporal Difference (TD) Learning

The iterative procedures are given below

www.peaks2tails.com Satyapriya Ojha 19

Q Learning is an off-policy learning which is a TD learning on the Q function.

𝑞 𝑠, 𝑎 𝑛𝑒𝑤 = 𝑞 𝑠, 𝑎 𝑜𝑙𝑑 + 𝛼 𝑟 + 𝛾 max 𝑞(𝑠 ′ , 𝑎′ ) − 𝑞 𝑠, 𝑎 𝑜𝑙𝑑

www.peaks2tails.com Satyapriya Ojha 20

www.peaks2tails.com Satyapriya Ojha 21

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.