0% found this document useful (0 votes)

72 views9 pages

RL Frra

The document discusses the Bellman equation, which states that the long-term reward for a given action equals the immediate reward from that action plus the expected future rewards from subsequent actions. It provides an example of an agent navigating a maze to reach a goal state. Without the Bellman equation, the agent would be unable to learn an optimal policy. The document then explains how the Bellman equation is used to calculate state values and find the optimal action at each step.

Uploaded by

Vishal Tarwatkar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

72 views9 pages

RL Frra

Uploaded by

Vishal Tarwatkar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 9

Bellman Equation

According to Equation, long-term- reward in a given

action is equal to the reward from the current action
combined with the expected reward from the future
actions taken at the following time.

example: we have a maze and the goal of our agent is to

reach the trophy state (R = 1) or to get Good reward and to
avoid the fire state because it will be a failure (R = -1) or
will get Bad reward.

without Bellman Equation?

we will give agent some time to explore the environment. As soon as it find its goal, it will back trace its
steps back to its starting position and mark values of all the states leads towards the goal as V = 1.

The agent will face np until we change its starting position, as it will not be able to find a path towards
the trophy state since the value of all the states is equal to 1. to solve this problem

V(s)=maxa(R(s,a)+ γV(s’))

State(s): current state

Next State(s’): After taking (a) at (s) the agent reaches s’

Value(V): Numeric representation of a state, helps the agent to find its path. V(s) here means the value s.

Reward(R): treat which the agent gets after performing an action(a).

 R(s): reward for being in the state s

 R(s,a): reward for being in the state and performing an action a
 R(s,a,s’): reward for being in a state s, taking an action a and ending up in s’

e.g. Good reward can be +1, Bad reward can be -1, No reward can be 0.

Action(a): set of possible actions that can be taken by the agent in the state(s). e.g. (LEFT, RIGHT, UP,
DOWN)

Discount factor(γ): determines how much the agent cares

about rewards in the distant future relative to those in the
immediate future. It has a value between 0 and 1. Lower
value encourages short–term rewards while higher value
promises long-term reward
The max denotes the most optimum action among all the actions that the agent can take in a particular
state which can lead to the reward after repeating this process every consecutive step.

For example:

The state left to the fire state (V = 0.9) can go UP, DOWN, RIGHT but NOT LEFT because it’s a wall (not
accessible). Among all these actions available the maximum value for that state is the UP action.

The current starting state of our agent can choose any random action UP or RIGHT since both lead
towards the reward with the same number of steps.

By using the Bellman equation our agent will calculate the value of every step except for the trophy and
the fire state (V = 0), they cannot have values since they are the end of the maze.

So, after making such a plan our agent can easily accomplish its goal by just following the increasing
values.
Markov Reward Process
Markov Process or Markov Chains

Markov Process is the memory less random process i.e. a sequence of a random state S[1],S[2],….S[n]
with a Markov Property.So, it’s basically a sequence of states with the Markov Property.It can be defined
using a set of states(S) and transition probability matrix (P).The dynamics of the environment can be fully
defined using the States(S) and Transition Probability matrix(P).

But what random process means ?

To answer this question let’s look at a example:

The edges of the tree denote transition probability. From this chain let’s take some sample. Now,
suppose that we were sleeping and the according to the probability distribution there is a 0.6 chance
that we will Run and 0.2 chance we sleep more and again 0.2 that we will eat ice-cream. Similarly, we
can think of other sequences that we can sample from this chain.

Some samples from the chain :

Sleep — Run — Ice-cream — Sleep

Sleep — Ice-cream — Ice-cream — Run

In the above two sequences we see is we get random set of States(S) every time we run the chain. That’s
why Markov process is called random set of sequences.

Reward and Returns : Rewards are the numerical values that the agent receives on performing some
action at some state(s) in the environment. The numerical value can be positive or negative based on the
actions of the agent.

In Reinforcement learning, we care about maximizing the cumulative reward (all the rewards agent
receives from the environment) instead of, the reward agent receives from the current state(also called
immediate reward).

We can define Returns as :

r[t+1] is the reward received by the agent at time step t[0] while performing an action(a) to move from
one state to another. Similarly, r[t+2] is the reward received by the agent at time step t[1] by performing
an action to move to another state. And, r[T] is the reward received by the agent by at the final time step
by performing an action to move to another state.

Discount Factor (ɤ): It determines how much importance is to be given to the immediate reward and
future rewards. This basically helps us to avoid infinity as a reward in continuous tasks. It has a value
between 0 and 1. A value of 0 means that more importance is given to the immediate reward and a
value of 1 means that more importance is given to future rewards. In practice, a discount factor of 0 will
never learn as it only considers immediate reward and a discount factor of 1 will go on for future rewards
which may lead to infinity. Therefore, the optimal value for the discount factor lies between 0.2 to 0.8.

Reinforcement Learning is all about goal to maximize the reward.So, let’s add reward to our Markov
Chain. This gives us Markov Reward Process. means MDPs are the Markov chains with values
judgement.Basically, we get a value from every state our agent is in.

What this equation means is how much reward (Rs) we get from a particular state S[t]. This tells us the
immediate reward from that particular state our agent is in. As we will see in the next story how we
maximize these rewards from each state our agent is in. In simple terms, maximizing the cumulative
reward we get from each state.

We define MRP as (S,P, R,ɤ) , where : S is a set of states, P is the Transition Probability Matrix, R is the
Reward function, we saw earlier, ɤ is the discount factor
Function Approximation

 Function approximation in reinforcement learning involves approximating the value function or

policy function using a parametric model, such as a neural network.
 Value function approximation estimates the expected return from a given state or state-action
pair using a parameterized function instead of a lookup table.
 Policy function approximation approximates the agent's behavior by mapping states to actions
using a parameterized model.

 Neural networks are commonly used as function approximators in reinforcement learning due to
their ability to model complex relationships.
 The neural network takes a state or state-action pair as input and outputs the estimated value or
action probabilities.
 Training the neural network involves adjusting its parameters using techniques like gradient
descent to minimize the difference between predicted and actual values or actions.
 Function approximation allows the agent to generalize its knowledge and make informed
decisions in similar states.
 Challenges in function approximation include balancing the trade-off between underfitting and
overfitting, where the model is either too simple or too specific to the training data.
 Regularization techniques and careful selection of model architecture can help mitigate
underfitting and overfitting issues.
 Function approximation is used in various reinforcement learning algorithms, such as Deep Q-
Networks (DQN), Proximal Policy Optimization (PPO), and Advantage Actor-Critic (A2C), to learn
effective policies in complex environments.

Markov Game

A Markov game, also known as a stochastic game or multi-agent Markov decision process (MDP), is an
extension of the Markov decision process framework to include multiple interacting agents. In
reinforcement learning, a Markov game is used to model environments where multiple agents make
decisions simultaneously and their actions affect each other's rewards and transitions.
 Multiple Agents: In Markov game, there are two or more agents that interact with each other
and the environment. Each agent selects actions based on its own policy and receives rewards
based on the joint actions taken by all agents.
 State and Action Spaces: Similar to a Markov decision process, a Markov game consists of a state
space and an action space. The state space represents the possible configurations of the
environment, and the action space represents the available actions for each agent.
 Transitions and Rewards: The environment in a Markov game transitions from one state to
another based on the joint actions selected by the agents. The transition probabilities and
rewards depend on the joint action taken and the current state.
 Joint Policy: In a Markov game, each agent has its own policy that maps its observations to
actions. However, the agents need to coordinate their actions to achieve optimal outcomes. This
coordination can be achieved through a joint policy, which specifies how each agent's policy is
combined to determine the joint action.

 Nash Equilibrium: In Markov games, the notion of Nash equilibrium from game theory is often
used to analyze the optimal joint policy. A Nash equilibrium is a set of joint policies where no
agent can unilaterally improve its own reward by changing its policy while all other agents keep
their policies fixed.
 Learning in Markov Games: Reinforcement learning algorithms can be extended to learn in
Markov games. This involves agents updating their policies based on their own observations and
rewards, as well as the actions and rewards of other agents. Techniques like multi-agent Q-
learning, policy gradient methods, and actor-critic algorithms can be applied in the context of
Markov games.
 Complexity: Markov games can introduce additional challenges due to the increased complexity
of interactions among agents. The state and action spaces grow exponentially with the number
of agents, making it more difficult to learn optimal policies.
 Applications: Markov games find applications in various domains where multiple agents interact,
such as multi-robot systems, autonomous driving, and strategic games.
 By modeling environments as Markov games, reinforcement learning algorithms can handle
scenarios with multiple interacting agents and learn effective policies that take into account the
interdependencies among agents' actions and rewards.

RL Frra
No ratings yet
RL Frra
10 pages
RL Unit - Ii
No ratings yet
RL Unit - Ii
20 pages
ML Unit 4
No ratings yet
ML Unit 4
9 pages
Reinforcement Learning Note
No ratings yet
Reinforcement Learning Note
16 pages
Lec17 ReinforcementLearning
No ratings yet
Lec17 ReinforcementLearning
58 pages
Deep RL - Content Beyond Syllabus
No ratings yet
Deep RL - Content Beyond Syllabus
16 pages
Unit 04 Finite Markov Decision Processes
No ratings yet
Unit 04 Finite Markov Decision Processes
8 pages
Reinforcement Learning Model Based Planning Dynamic Programming
No ratings yet
Reinforcement Learning Model Based Planning Dynamic Programming
17 pages
20ai903 - RL - Unit 2
No ratings yet
20ai903 - RL - Unit 2
27 pages
RL Ese Answers
No ratings yet
RL Ese Answers
16 pages
Kguh
No ratings yet
Kguh
38 pages
Sections
No ratings yet
Sections
76 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
9 pages
CSE2530 Reinforcement Learning 2025 P1+2
No ratings yet
CSE2530 Reinforcement Learning 2025 P1+2
115 pages
17 - Markov Decision Processes
No ratings yet
17 - Markov Decision Processes
59 pages
ML Unit 4
No ratings yet
ML Unit 4
17 pages
Unit Vi
No ratings yet
Unit Vi
17 pages
FALLSEM2024-25 BCSE209L TH VL2024250101717 2024-11-12 Reference-Material-I
No ratings yet
FALLSEM2024-25 BCSE209L TH VL2024250101717 2024-11-12 Reference-Material-I
11 pages
Reinforcement Learning2A
No ratings yet
Reinforcement Learning2A
88 pages
A Brief Introduction To Reinforcement Learning
No ratings yet
A Brief Introduction To Reinforcement Learning
4 pages
Unit 5 Notes
No ratings yet
Unit 5 Notes
20 pages
RL-UNIT2 - RL Unit 2 RL-UNIT2 - RL Unit 2
No ratings yet
RL-UNIT2 - RL Unit 2 RL-UNIT2 - RL Unit 2
23 pages
1.1 Discounted (Infinite-Horizon) Markov Decision Processes
No ratings yet
1.1 Discounted (Infinite-Horizon) Markov Decision Processes
26 pages
Unit-5 ML Notes
No ratings yet
Unit-5 ML Notes
31 pages
RL MJJ
No ratings yet
RL MJJ
32 pages
Unit 5
No ratings yet
Unit 5
10 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
5 pages
10 ML Introduction To Reinforcement Learning
No ratings yet
10 ML Introduction To Reinforcement Learning
8 pages
Reinforcement Learning: Karan Kathpalia
No ratings yet
Reinforcement Learning: Karan Kathpalia
80 pages
DRL #4-5 - Introducing MDP and Dynamic Programming Solution
No ratings yet
DRL #4-5 - Introducing MDP and Dynamic Programming Solution
74 pages
21ai020 & Reinforcement Learning: The Agent-Environment Interface
No ratings yet
21ai020 & Reinforcement Learning: The Agent-Environment Interface
8 pages
5.4-Reinforcement Learning-Part1-Introduction
No ratings yet
5.4-Reinforcement Learning-Part1-Introduction
15 pages
Reinforcement Learning: Amulya Viswambaran (202090007) Kehkashan Fatima (202090202) Sruthi Krishnan (202090333)
No ratings yet
Reinforcement Learning: Amulya Viswambaran (202090007) Kehkashan Fatima (202090202) Sruthi Krishnan (202090333)
40 pages
A Crash Course On Reinforcement Learning - Felix Wagner
No ratings yet
A Crash Course On Reinforcement Learning - Felix Wagner
84 pages
Reinforcement Learning: Part I - Definitions
No ratings yet
Reinforcement Learning: Part I - Definitions
26 pages
Markov Decision Process (MDP)
No ratings yet
Markov Decision Process (MDP)
31 pages
Finite Markov Decision Processes-BR
No ratings yet
Finite Markov Decision Processes-BR
31 pages
Unit-5 (AI)
No ratings yet
Unit-5 (AI)
21 pages
RL Lecturer
No ratings yet
RL Lecturer
38 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
35 pages
PDF Unit-5 (Full Unit)
No ratings yet
PDF Unit-5 (Full Unit)
37 pages
Reinforcement LN-6
No ratings yet
Reinforcement LN-6
13 pages
Reinforcement Learning Lec12
No ratings yet
Reinforcement Learning Lec12
60 pages
Reinforcement Learning in A Nutshell
No ratings yet
Reinforcement Learning in A Nutshell
12 pages
Lecture 9 Reiforcement Learning
No ratings yet
Lecture 9 Reiforcement Learning
29 pages
Chapter 2
No ratings yet
Chapter 2
21 pages
Reinforcement Learning and Control: CS229 Lecture Notes
No ratings yet
Reinforcement Learning and Control: CS229 Lecture Notes
7 pages
Reinforcement Learning and Control: CS229 Lecture Notes
No ratings yet
Reinforcement Learning and Control: CS229 Lecture Notes
15 pages
Reinf 2
No ratings yet
Reinf 2
4 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
6 pages
Cs229-Notes12 Reinforcement in Control
No ratings yet
Cs229-Notes12 Reinforcement in Control
17 pages
RL RS-Unit - 3
No ratings yet
RL RS-Unit - 3
6 pages
DSA5102 Lecture11
No ratings yet
DSA5102 Lecture11
44 pages
Tutorial Questions (Annexure I) Que S-Tion No Questions Co BTL
No ratings yet
Tutorial Questions (Annexure I) Que S-Tion No Questions Co BTL
6 pages
02 MarkovDecisionProcess
No ratings yet
02 MarkovDecisionProcess
51 pages
ML - Unit 3 - Part II
No ratings yet
ML - Unit 3 - Part II
51 pages
Introduction To Reinforcement Learning
100% (1)
Introduction To Reinforcement Learning
52 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
29 pages
Shreeni Sai - Personal Branding - 500084855
No ratings yet
Shreeni Sai - Personal Branding - 500084855
13 pages
(Resit) - Phe7032-Research - Methods of Enquiry 2
No ratings yet
(Resit) - Phe7032-Research - Methods of Enquiry 2
23 pages
Chapter 4 The Business Research Process
No ratings yet
Chapter 4 The Business Research Process
49 pages
Natural Language Processing For Sentiment Analysis in Social Media
No ratings yet
Natural Language Processing For Sentiment Analysis in Social Media
3 pages
Introduction To The Philosophy of The Human Person
No ratings yet
Introduction To The Philosophy of The Human Person
9 pages
Insta-Bio Instructions and Rubric
No ratings yet
Insta-Bio Instructions and Rubric
6 pages
Malla Curricular LiLEI V3
No ratings yet
Malla Curricular LiLEI V3
3 pages
AutumnSem 2025 26
No ratings yet
AutumnSem 2025 26
2 pages
Tutorial 2 Decision Makers
No ratings yet
Tutorial 2 Decision Makers
7 pages
TFG - 2017 - Alguacil Casas - Francisco Miguel
No ratings yet
TFG - 2017 - Alguacil Casas - Francisco Miguel
33 pages
C9-BC, Salter
No ratings yet
C9-BC, Salter
14 pages
Nursing Care Study
No ratings yet
Nursing Care Study
20 pages
Written Report in Functions of Educational Sociology
100% (2)
Written Report in Functions of Educational Sociology
3 pages
05 K-Nearest Neighbors
No ratings yet
05 K-Nearest Neighbors
15 pages
YAhotline 76 17
No ratings yet
YAhotline 76 17
2 pages
Sociology - Mid Term 1 24-25
No ratings yet
Sociology - Mid Term 1 24-25
5 pages
BCSS Sec Unit 1
No ratings yet
BCSS Sec Unit 1
16 pages
Periodic Course Outline: Senior High School Department
No ratings yet
Periodic Course Outline: Senior High School Department
4 pages
An Examination of Reverse Mentoring in The Workplace
No ratings yet
An Examination of Reverse Mentoring in The Workplace
13 pages
Artificial Intelligence in Accounting and Finance: Meta-Analysis
No ratings yet
Artificial Intelligence in Accounting and Finance: Meta-Analysis
27 pages
Lesson 1. Nature of Research
No ratings yet
Lesson 1. Nature of Research
42 pages
MSWPG7203 Law and Ethics in Social Work Practice
No ratings yet
MSWPG7203 Law and Ethics in Social Work Practice
4 pages
Topic: Social Constructivism and Student Interactions: You Said
No ratings yet
Topic: Social Constructivism and Student Interactions: You Said
4 pages
Preferred Strand (Research)
No ratings yet
Preferred Strand (Research)
12 pages
Contrastive Predictive Coding
No ratings yet
Contrastive Predictive Coding
13 pages
Issb Interview Script
No ratings yet
Issb Interview Script
4 pages
RECITATION
No ratings yet
RECITATION
3 pages
Socialization Values Norms Status and Roles Pascua
No ratings yet
Socialization Values Norms Status and Roles Pascua
2 pages
Aguilen, Gabrielle Pam P. Quiz No.1
No ratings yet
Aguilen, Gabrielle Pam P. Quiz No.1
1 page
CS4 Slides - 1
No ratings yet
CS4 Slides - 1
69 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

RL Frra

Uploaded by

RL Frra

Uploaded by

Bellman Equation

According to Equation, long-term- reward in a given

example: we have a maze and the goal of our agent is to

without Bellman Equation?

State(s): current state

Next State(s’): After taking (a) at (s) the agent reaches s’

Reward(R): treat which the agent gets after performing an action(a).

 R(s): reward for being in the state s

Discount factor(γ): determines how much the agent cares

But what random process means ?

To answer this question let’s look at a example:

Some samples from the chain :

Sleep — Run — Ice-cream — Sleep

Sleep — Ice-cream — Ice-cream — Run

We can define Returns as :

 Function approximation in reinforcement learning involves approximating the value function or

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.