RL Frra
RL Frra
we will give agent some time to explore the environment. As soon as it find its goal, it will back trace its
steps back to its starting position and mark values of all the states leads towards the goal as V = 1.
The agent will face np until we change its starting position, as it will not be able to find a path towards
the trophy state since the value of all the states is equal to 1. to solve this problem
V(s)=maxa(R(s,a)+ γV(s’))
Value(V): Numeric representation of a state, helps the agent to find its path. V(s) here means the value s.
e.g. Good reward can be +1, Bad reward can be -1, No reward can be 0.
Action(a): set of possible actions that can be taken by the agent in the state(s). e.g. (LEFT, RIGHT, UP,
DOWN)
For example:
The state left to the fire state (V = 0.9) can go UP, DOWN, RIGHT but NOT LEFT because it’s a wall (not
accessible). Among all these actions available the maximum value for that state is the UP action.
The current starting state of our agent can choose any random action UP or RIGHT since both lead
towards the reward with the same number of steps.
By using the Bellman equation our agent will calculate the value of every step except for the trophy and
the fire state (V = 0), they cannot have values since they are the end of the maze.
So, after making such a plan our agent can easily accomplish its goal by just following the increasing
values.
Markov Reward Process
Markov Process or Markov Chains
Markov Process is the memory less random process i.e. a sequence of a random state S[1],S[2],….S[n]
with a Markov Property.So, it’s basically a sequence of states with the Markov Property.It can be defined
using a set of states(S) and transition probability matrix (P).The dynamics of the environment can be fully
defined using the States(S) and Transition Probability matrix(P).
The edges of the tree denote transition probability. From this chain let’s take some sample. Now,
suppose that we were sleeping and the according to the probability distribution there is a 0.6 chance
that we will Run and 0.2 chance we sleep more and again 0.2 that we will eat ice-cream. Similarly, we
can think of other sequences that we can sample from this chain.
In the above two sequences we see is we get random set of States(S) every time we run the chain. That’s
why Markov process is called random set of sequences.
Reward and Returns : Rewards are the numerical values that the agent receives on performing some
action at some state(s) in the environment. The numerical value can be positive or negative based on the
actions of the agent.
In Reinforcement learning, we care about maximizing the cumulative reward (all the rewards agent
receives from the environment) instead of, the reward agent receives from the current state(also called
immediate reward).
r[t+1] is the reward received by the agent at time step t[0] while performing an action(a) to move from
one state to another. Similarly, r[t+2] is the reward received by the agent at time step t[1] by performing
an action to move to another state. And, r[T] is the reward received by the agent by at the final time step
by performing an action to move to another state.
Discount Factor (ɤ): It determines how much importance is to be given to the immediate reward and
future rewards. This basically helps us to avoid infinity as a reward in continuous tasks. It has a value
between 0 and 1. A value of 0 means that more importance is given to the immediate reward and a
value of 1 means that more importance is given to future rewards. In practice, a discount factor of 0 will
never learn as it only considers immediate reward and a discount factor of 1 will go on for future rewards
which may lead to infinity. Therefore, the optimal value for the discount factor lies between 0.2 to 0.8.
Reinforcement Learning is all about goal to maximize the reward.So, let’s add reward to our Markov
Chain. This gives us Markov Reward Process. means MDPs are the Markov chains with values
judgement.Basically, we get a value from every state our agent is in.
What this equation means is how much reward (Rs) we get from a particular state S[t]. This tells us the
immediate reward from that particular state our agent is in. As we will see in the next story how we
maximize these rewards from each state our agent is in. In simple terms, maximizing the cumulative
reward we get from each state.
We define MRP as (S,P, R,ɤ) , where : S is a set of states, P is the Transition Probability Matrix, R is the
Reward function, we saw earlier, ɤ is the discount factor
Function Approximation
Neural networks are commonly used as function approximators in reinforcement learning due to
their ability to model complex relationships.
The neural network takes a state or state-action pair as input and outputs the estimated value or
action probabilities.
Training the neural network involves adjusting its parameters using techniques like gradient
descent to minimize the difference between predicted and actual values or actions.
Function approximation allows the agent to generalize its knowledge and make informed
decisions in similar states.
Challenges in function approximation include balancing the trade-off between underfitting and
overfitting, where the model is either too simple or too specific to the training data.
Regularization techniques and careful selection of model architecture can help mitigate
underfitting and overfitting issues.
Function approximation is used in various reinforcement learning algorithms, such as Deep Q-
Networks (DQN), Proximal Policy Optimization (PPO), and Advantage Actor-Critic (A2C), to learn
effective policies in complex environments.
Markov Game
A Markov game, also known as a stochastic game or multi-agent Markov decision process (MDP), is an
extension of the Markov decision process framework to include multiple interacting agents. In
reinforcement learning, a Markov game is used to model environments where multiple agents make
decisions simultaneously and their actions affect each other's rewards and transitions.
Multiple Agents: In Markov game, there are two or more agents that interact with each other
and the environment. Each agent selects actions based on its own policy and receives rewards
based on the joint actions taken by all agents.
State and Action Spaces: Similar to a Markov decision process, a Markov game consists of a state
space and an action space. The state space represents the possible configurations of the
environment, and the action space represents the available actions for each agent.
Transitions and Rewards: The environment in a Markov game transitions from one state to
another based on the joint actions selected by the agents. The transition probabilities and
rewards depend on the joint action taken and the current state.
Joint Policy: In a Markov game, each agent has its own policy that maps its observations to
actions. However, the agents need to coordinate their actions to achieve optimal outcomes. This
coordination can be achieved through a joint policy, which specifies how each agent's policy is
combined to determine the joint action.
Nash Equilibrium: In Markov games, the notion of Nash equilibrium from game theory is often
used to analyze the optimal joint policy. A Nash equilibrium is a set of joint policies where no
agent can unilaterally improve its own reward by changing its policy while all other agents keep
their policies fixed.
Learning in Markov Games: Reinforcement learning algorithms can be extended to learn in
Markov games. This involves agents updating their policies based on their own observations and
rewards, as well as the actions and rewards of other agents. Techniques like multi-agent Q-
learning, policy gradient methods, and actor-critic algorithms can be applied in the context of
Markov games.
Complexity: Markov games can introduce additional challenges due to the increased complexity
of interactions among agents. The state and action spaces grow exponentially with the number
of agents, making it more difficult to learn optimal policies.
Applications: Markov games find applications in various domains where multiple agents interact,
such as multi-robot systems, autonomous driving, and strategic games.
By modeling environments as Markov games, reinforcement learning algorithms can handle
scenarios with multiple interacting agents and learn effective policies that take into account the
interdependencies among agents' actions and rewards.