0% found this document useful (0 votes)
14 views32 pages

RL MJJ

The document provides an overview of Reinforcement Learning (RL), explaining its types, including positive and negative reinforcement, and how it operates through an agent interacting with an environment. It introduces the Bellman equation as a key concept for calculating value functions and discusses Q-learning as a model-free RL algorithm that aims to maximize rewards. Additionally, it describes the creation of a Q-table to help agents select optimal actions based on learned Q-values.

Uploaded by

Kevilsinh Barad
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views32 pages

RL MJJ

The document provides an overview of Reinforcement Learning (RL), explaining its types, including positive and negative reinforcement, and how it operates through an agent interacting with an environment. It introduces the Bellman equation as a key concept for calculating value functions and discusses Q-learning as a model-free RL algorithm that aims to maximize rewards. Additionally, it describes the creation of a Q-table to help agents select optimal actions based on learned Q-values.

Uploaded by

Kevilsinh Barad
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 32

Reinforcement

Learning
Introduction
Example:

 Any problem where decision making is sequential, and the goal is


long-term like:
 8 – puzzle
 Chess
 Tik-tak-toe
 Any state action problem
Elements of RL
Conti..
Agent environment interface
Types of Reinforcement learning

 There are mainly two types of reinforcement learning, which are:


 Positive Reinforcement
 Negative Reinforcement
Positive Reinforcement:

 The positive reinforcement learning means adding something to


increase the tendency that expected behavior would occur again. It
impacts positively on the behavior of the agent and increases the
strength of the behavior.
 This type of reinforcement can sustain the changes for a long time,
but too much positive reinforcement may lead to an overload of
states that can reduce the consequences.
Negative Reinforcement:

 The negative reinforcement learning is opposite to the positive


reinforcement as it increases the tendency that the specific behavior
will occur again by avoiding the negative condition.
 It can be more effective than the positive reinforcement depending on
situation and behavior, but it provides reinforcement only to meet
minimum behavior.
How does Reinforcement Learning
Work?
 To understand the working process of the RL,
we need to consider two main things:
 Environment: It can be anything such as a
room, maze, football ground, etc.
 Agent: An intelligent agent such as AI robot.
 Let's take an example of a maze environment
that the agent needs to explore. Consider the
image:
Conti..

 In the above image, the agent is at the very first block of the maze. The
maze is consisting of an S6 block, which is a wall, S8 a fire pit, and
S4 a diamond block.
 The agent cannot cross the S6 block, as it is a solid wall. If the agent
reaches the S4 block, then get the +1 reward; if it reaches the fire pit,
then gets -1 reward point. It can take four actions: move up, move
down, move left, and move right.
 The agent can take any path to reach to the final point, but he needs to
make it in possible fewer steps. Suppose the agent considers the
path S9-S5-S1-S2-S3, so he will get the +1-reward point.
 The agent will try to remember the preceding steps that it has taken to
reach the final step. To memorize the steps, it assigns 1 value to each
previous step. Consider the following step:
 Now, the agent has successfully stored the previous steps assigning
the 1 value to each previous block. But what will the agent do if he
starts moving from the block, which has 1 value block on both sides?
Consider the below diagram:
Conti..

 It will be a difficult condition for the agent whether he should go up or


down as each block has the same value. So, the above approach is
not suitable for the agent to reach the destination. Hence to solve the
problem, we will use the Bellman equation, which is the main
concept behind reinforcement learning.
The Bellman Equation
 The Bellman equation was introduced by the Mathematician Richard Ernest Bellman in the
year 1953, and hence it is called as a Bellman equation. It is associated with dynamic
programming and used to calculate the values of a decision problem at a certain point by
including the values of previous states.
 It is a way of calculating the value functions in dynamic programming or environment that leads
to modern reinforcement learning.
 The key-elements used in Bellman equations are:
 Action performed by the agent is referred to as "a"
 State occurred by performing the action is "s."
 The reward/feedback obtained for each good and bad action is "R."
 A discount factor is Gamma "γ."
 The Bellman equation can be written as:
 In the above equation, we are taking the max of the complete values because the
agent tries to find the optimal solution always.
 So now, using the Bellman equation, we will find value at each state of the given
environment. We will start from the block, which is next to the target block.
 For 1st block:
 V(s3) = max [R(s,a) + γV(s`)], here V(s')= 0 because there is no further state to move.
 V(s3)= max[R(s,a)]=> V(s3)= max[1]=> V(s3)= 1.
 For 2nd block:
 V(s2) = max [R(s,a) + γV(s`)], here γ= 0.9(lets), V(s')= 1, and R(s, a)= 0, because
there is no reward at this state.
 V(s2)= max[0.9(1)]=> V(s)= max[0.9]=> V(s2) =0.9
 For 3rd block:
 V(s1) = max [R(s,a) + γV(s`)], here γ= 0.9(lets), V(s')= 0.9, and R(s, a)= 0, because
there is no reward at this state also.
 V(s1)= max[0.9(0.9)]=> V(s3)= max[0.81]=> V(s1) =0.81
 For 4th block:
 V(s5) = max [R(s,a) + γV(s`)], here γ= 0.9(lets), V(s')= 0.81, and R(s, a)= 0, because
there is no reward at this state also.
 V(s5)= max[0.9(0.81)]=> V(s5)= max[0.81]=> V(s5) =0.73
 For 5th block:
 V(s9) = max [R(s,a) + γV(s`)], here γ= 0.9(lets), V(s')= 0.73, and R(s, a)= 0, because
there is no reward at this state also.
 V(s9)= max[0.9(0.73)]=> V(s4)= max[0.81]=> V(s4) =0.66
 Now, we will move further to the 6th block, and
here agent may change the route because it
always tries to find the optimal path. So now, let's
consider from the block next to the fire pit.
 Now, the agent has three options to move; if he moves to the blue box, then
he will feel a bump if he moves to the fire pit, then he will get the -1 reward.
But here we are taking only positive rewards, so for this, he will move to
upwards only. The complete block values will be calculated using this
formula. Consider the below image:
Q learning
 Q-learning is an Off policy RL algorithm, which is used for the
temporal difference Learning. The temporal difference learning
methods are the way of comparing temporally successive predictions.
 It learns the value function Q (S, a), which means how good to take
action "a" at a particular state "s.“
 Q-learning is a popular model-free reinforcement learning algorithm
based on the Bellman equation.
 The main objective of Q-learning is to learn the policy which
can inform the agent that what actions should be taken for
maximizing the reward under what circumstances.
 The Q stands for quality in Q-learning, which means it specifies the
quality of an action taken by the agent.
 The goal of the agent in Q-learning is to maximize the value of Q.
 The value of Q-learning can be derived from the Bellman equation.
Algorithm
Example

 The value of Q-learning can be derived from the Bellman equation.


Consider the Bellman equation given below:

 In the equation, we have various components, including reward,


discount factor (γ), probability, and end states s'.
Conti..
 But there is no any Q-value is given so first consider the below image:

 In the above image, we can see there is an agent who has three
values options, V(s1), V(s2), V(s3). As this is MDP(Markov Decision
Process), so agent only cares for the current state and the future
state. The agent can go to any direction (Up, Left, or Right), so he
needs to decide where to go for the optimal path. Here agent will take
a move as per probability bases and changes the state. But if we
want some exact moves, so for this, we need to make some changes
in terms of Q-value.
Conti..
 Q- represents the quality of the actions at each state. So instead of
using a value at each state, we will use a pair of state and action,
i.e., Q(s, a). Q-value specifies that which action is more lubricative
than others, and according to the best Q-value, the agent takes his
next move. The Bellman equation can be used for deriving the Q-
value.
 To perform any action, the agent will get a reward R(s, a), and also
he will end up on a certain state, so the Q -value equation will be:
Q-table

 A Q-table or matrix is created while performing the Q-learning. The


table follows the state and action pair, i.e., [s, a], and initializes the
values to zero. After each action, the table is updated, and the q-
values are stored within the table.
 The RL agent uses this Q-table as a reference table to select the best
action based on the q-values.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy