0% found this document useful (0 votes)
72 views9 pages

RL Frra

The document discusses the Bellman equation, which states that the long-term reward for a given action equals the immediate reward from that action plus the expected future rewards from subsequent actions. It provides an example of an agent navigating a maze to reach a goal state. Without the Bellman equation, the agent would be unable to learn an optimal policy. The document then explains how the Bellman equation is used to calculate state values and find the optimal action at each step.

Uploaded by

Vishal Tarwatkar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
72 views9 pages

RL Frra

The document discusses the Bellman equation, which states that the long-term reward for a given action equals the immediate reward from that action plus the expected future rewards from subsequent actions. It provides an example of an agent navigating a maze to reach a goal state. Without the Bellman equation, the agent would be unable to learn an optimal policy. The document then explains how the Bellman equation is used to calculate state values and find the optimal action at each step.

Uploaded by

Vishal Tarwatkar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 9

Bellman Equation

According to Equation, long-term- reward in a given


action is equal to the reward from the current action
combined with the expected reward from the future
actions taken at the following time.

example: we have a maze and the goal of our agent is to


reach the trophy state (R = 1) or to get Good reward and to
avoid the fire state because it will be a failure (R = -1) or
will get Bad reward.

without Bellman Equation?

we will give agent some time to explore the environment. As soon as it find its goal, it will back trace its
steps back to its starting position and mark values of all the states leads towards the goal as V = 1.

The agent will face np until we change its starting position, as it will not be able to find a path towards
the trophy state since the value of all the states is equal to 1. to solve this problem

V(s)=maxa(R(s,a)+ γV(s’))

State(s): current state

Next State(s’): After taking (a) at (s) the agent reaches s’

Value(V): Numeric representation of a state, helps the agent to find its path. V(s) here means the value s.

Reward(R): treat which the agent gets after performing an action(a).

 R(s): reward for being in the state s


 R(s,a): reward for being in the state and performing an action a
 R(s,a,s’): reward for being in a state s, taking an action a and ending up in s’

e.g. Good reward can be +1, Bad reward can be -1, No reward can be 0.

Action(a): set of possible actions that can be taken by the agent in the state(s). e.g. (LEFT, RIGHT, UP,
DOWN)

Discount factor(γ): determines how much the agent cares


about rewards in the distant future relative to those in the
immediate future. It has a value between 0 and 1. Lower
value encourages short–term rewards while higher value
promises long-term reward
The max denotes the most optimum action among all the actions that the agent can take in a particular
state which can lead to the reward after repeating this process every consecutive step.

For example:

The state left to the fire state (V = 0.9) can go UP, DOWN, RIGHT but NOT LEFT because it’s a wall (not
accessible). Among all these actions available the maximum value for that state is the UP action.

The current starting state of our agent can choose any random action UP or RIGHT since both lead
towards the reward with the same number of steps.

By using the Bellman equation our agent will calculate the value of every step except for the trophy and
the fire state (V = 0), they cannot have values since they are the end of the maze.

So, after making such a plan our agent can easily accomplish its goal by just following the increasing
values.
Markov Reward Process
Markov Process or Markov Chains

Markov Process is the memory less random process i.e. a sequence of a random state S[1],S[2],….S[n]
with a Markov Property.So, it’s basically a sequence of states with the Markov Property.It can be defined
using a set of states(S) and transition probability matrix (P).The dynamics of the environment can be fully
defined using the States(S) and Transition Probability matrix(P).

But what random process means ?

To answer this question let’s look at a example:

The edges of the tree denote transition probability. From this chain let’s take some sample. Now,
suppose that we were sleeping and the according to the probability distribution there is a 0.6 chance
that we will Run and 0.2 chance we sleep more and again 0.2 that we will eat ice-cream. Similarly, we
can think of other sequences that we can sample from this chain.

Some samples from the chain :

Sleep — Run — Ice-cream — Sleep

Sleep — Ice-cream — Ice-cream — Run

In the above two sequences we see is we get random set of States(S) every time we run the chain. That’s
why Markov process is called random set of sequences.

Reward and Returns : Rewards are the numerical values that the agent receives on performing some
action at some state(s) in the environment. The numerical value can be positive or negative based on the
actions of the agent.

In Reinforcement learning, we care about maximizing the cumulative reward (all the rewards agent
receives from the environment) instead of, the reward agent receives from the current state(also called
immediate reward).

We can define Returns as :

r[t+1] is the reward received by the agent at time step t[0] while performing an action(a) to move from
one state to another. Similarly, r[t+2] is the reward received by the agent at time step t[1] by performing
an action to move to another state. And, r[T] is the reward received by the agent by at the final time step
by performing an action to move to another state.

Discount Factor (ɤ): It determines how much importance is to be given to the immediate reward and
future rewards. This basically helps us to avoid infinity as a reward in continuous tasks. It has a value
between 0 and 1. A value of 0 means that more importance is given to the immediate reward and a
value of 1 means that more importance is given to future rewards. In practice, a discount factor of 0 will
never learn as it only considers immediate reward and a discount factor of 1 will go on for future rewards
which may lead to infinity. Therefore, the optimal value for the discount factor lies between 0.2 to 0.8.

Reinforcement Learning is all about goal to maximize the reward.So, let’s add reward to our Markov
Chain. This gives us Markov Reward Process. means MDPs are the Markov chains with values
judgement.Basically, we get a value from every state our agent is in.

What this equation means is how much reward (Rs) we get from a particular state S[t]. This tells us the
immediate reward from that particular state our agent is in. As we will see in the next story how we
maximize these rewards from each state our agent is in. In simple terms, maximizing the cumulative
reward we get from each state.

We define MRP as (S,P, R,ɤ) , where : S is a set of states, P is the Transition Probability Matrix, R is the
Reward function, we saw earlier, ɤ is the discount factor
Function Approximation

 Function approximation in reinforcement learning involves approximating the value function or


policy function using a parametric model, such as a neural network.
 Value function approximation estimates the expected return from a given state or state-action
pair using a parameterized function instead of a lookup table.
 Policy function approximation approximates the agent's behavior by mapping states to actions
using a parameterized model.

 Neural networks are commonly used as function approximators in reinforcement learning due to
their ability to model complex relationships.
 The neural network takes a state or state-action pair as input and outputs the estimated value or
action probabilities.
 Training the neural network involves adjusting its parameters using techniques like gradient
descent to minimize the difference between predicted and actual values or actions.
 Function approximation allows the agent to generalize its knowledge and make informed
decisions in similar states.
 Challenges in function approximation include balancing the trade-off between underfitting and
overfitting, where the model is either too simple or too specific to the training data.
 Regularization techniques and careful selection of model architecture can help mitigate
underfitting and overfitting issues.
 Function approximation is used in various reinforcement learning algorithms, such as Deep Q-
Networks (DQN), Proximal Policy Optimization (PPO), and Advantage Actor-Critic (A2C), to learn
effective policies in complex environments.

Markov Game

A Markov game, also known as a stochastic game or multi-agent Markov decision process (MDP), is an
extension of the Markov decision process framework to include multiple interacting agents. In
reinforcement learning, a Markov game is used to model environments where multiple agents make
decisions simultaneously and their actions affect each other's rewards and transitions.
 Multiple Agents: In Markov game, there are two or more agents that interact with each other
and the environment. Each agent selects actions based on its own policy and receives rewards
based on the joint actions taken by all agents.
 State and Action Spaces: Similar to a Markov decision process, a Markov game consists of a state
space and an action space. The state space represents the possible configurations of the
environment, and the action space represents the available actions for each agent.
 Transitions and Rewards: The environment in a Markov game transitions from one state to
another based on the joint actions selected by the agents. The transition probabilities and
rewards depend on the joint action taken and the current state.
 Joint Policy: In a Markov game, each agent has its own policy that maps its observations to
actions. However, the agents need to coordinate their actions to achieve optimal outcomes. This
coordination can be achieved through a joint policy, which specifies how each agent's policy is
combined to determine the joint action.

 Nash Equilibrium: In Markov games, the notion of Nash equilibrium from game theory is often
used to analyze the optimal joint policy. A Nash equilibrium is a set of joint policies where no
agent can unilaterally improve its own reward by changing its policy while all other agents keep
their policies fixed.
 Learning in Markov Games: Reinforcement learning algorithms can be extended to learn in
Markov games. This involves agents updating their policies based on their own observations and
rewards, as well as the actions and rewards of other agents. Techniques like multi-agent Q-
learning, policy gradient methods, and actor-critic algorithms can be applied in the context of
Markov games.
 Complexity: Markov games can introduce additional challenges due to the increased complexity
of interactions among agents. The state and action spaces grow exponentially with the number
of agents, making it more difficult to learn optimal policies.
 Applications: Markov games find applications in various domains where multiple agents interact,
such as multi-robot systems, autonomous driving, and strategic games.
 By modeling environments as Markov games, reinforcement learning algorithms can handle
scenarios with multiple interacting agents and learn effective policies that take into account the
interdependencies among agents' actions and rewards.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy