0% found this document useful (0 votes)

28 views52 pages

Unit-8 - Reinforcement Learning

Uploaded by

hefeke8164

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

28 views52 pages

Unit-8 - Reinforcement Learning

Uploaded by

hefeke8164

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 52

Unit – 8 Reinforcement Learning

Reinforcement Learning is a feedback-based Machine learning technique

in which an agent learns to behave in an environment by performing the
actions and seeing the results of actions. For each good action, the agent
gets positive feedback, and for each bad action, the agent gets negative
feedback or penalty.
In Reinforcement Learning, the agent learns automatically using
feedbacks without any labeled data, unlike supervised learning.
Since there is no labeled data, so the agent is bound to learn by its
experience only.
RL solves a specific type of problem where decision making is sequential,
and the goal is long-term, such as game-playing, robotics, etc.
• The agent interacts with the environment and explores it by itself. The
primary goal of an agent in reinforcement learning is to improve the
performance. The agent continues doing these three things (take
action, change state/remain in the same state, and get feedback), and
by doing these actions, he learns and explores the environment by
getting the maximum positive rewards.
▪ The agent learns with the process of hit and trial, and based on the
experience, it learns to perform the task in a better way.

▪ Definition: Reinforcement learning is a type of AI method where an

intelligent agent (computer program) interacts with the environment
and learns to act within that.
▪ Suppose there is an AI agent present within a maze environment, and
his goal is to find the diamond. The agent interacts with the
environment by performing some actions, and based on those actions,
the state of the agent gets changed, and it also receives a reward or
penalty as feedback.
▪ The agent continues doing these
three things (take action, change
state/remain in the same state, and
get feedback), and by doing these
actions, he learns and explores the
environment.
▪ The agent learns that what actions
lead to positive feedback or rewards
and what actions lead to negative
feedback penalty. As a positive
reward, the agent gets a positive
point, and as a penalty, it gets a
negative point.
• Agent(): An entity that can perceive/explore the environment and act upon it.
• Environment(): A situation in which an agent is present or surrounded by. In RL, we
assume the stochastic environment, which means it is random in nature.
• Action(): Actions are the moves taken by an agent within the environment.
• State(): State is a situation returned by the environment after each action taken by
the agent.
• Reward(): A feedback returned to the agent from the environment to evaluate the
action of the agent.
• Policy(): Policy is a strategy applied by the agent for the next action based on the
current state.
• Value(): It is expected long-term retuned with the discount factor and opposite to
the short-term reward.
• Q-value(): It is mostly similar to the value, but it takes one additional parameter as a
current action (a).
• In RL, the agent is not instructed about the environment and what
actions need to be taken.
• It is based on the hit and trial process.
• The agent takes the next action and changes states according to the
feedback of the previous action.
• The agent may get a delayed reward.
• The environment is stochastic, and the agent needs to explore it to
reach to get the maximum positive rewards.
1. Value-based:
The value-based approach is about to find the optimal value
function, which is the maximum value at a state under any policy.
Therefore, the agent expects the long-term return at any state(s)
under policy π.
2. Policy-based:
Policy-based approach is to find the optimal policy for the maximum
future rewards without using the value function. In this approach,
the agent tries to apply such a policy that the action performed in
each step helps to maximize the future reward.
The policy-based approach has mainly two types of policy:
1. Deterministic: The same action is produced by the policy (π) at any state.
2. Stochastic: In this policy, probability determines the produced action.
3. Model-based: In the model-based approach, a virtual model is
created for the environment, and the agent explores that
environment to learn it. There is no particular solution or algorithm
for this approach because the model representation is different for
each environment.
▪ There are four main elements of Reinforcement Learning, which are
given below:
1. Policy
2. Reward Signal
3. Value Function
4. Model of the environment
▪ 1) Policy: A policy can be defined as a way how an agent behaves at a
given time. It maps the perceived states of the environment to the
actions taken on those states. A policy is the core element of the RL as
it alone can define the behavior of the agent. In some cases, it may be a
simple function or a lookup table, whereas, for other cases, it may
involve general computation as a search process. It could be
deterministic or a stochastic policy:

▪ For deterministic policy: a = π(s)

For stochastic policy: π(a | s) = P[At =a | St = s]
▪ 2) Reward Signal: The goal of reinforcement learning is defined by the
reward signal. At each state, the environment sends an immediate
signal to the learning agent, and this signal is known as a reward signal.
These rewards are given according to the good and bad actions taken
by the agent. The agent's main objective is to maximize the total
number of rewards for good actions. The reward signal can change the
policy, such as if an action selected by the agent leads to low reward,
then the policy may change to select other actions in the future.
▪ 3) Value Function: The value function gives information about how
good the situation and action are and how much reward an agent can
expect. A reward indicates the immediate signal for each good and bad
action, whereas a value function specifies the good state and action
for the future. The value function depends on the reward as, without
reward, there could be no value. The goal of estimating values is to
achieve more rewards.
▪ 4) Model: The last element of reinforcement learning is the model,
which mimics the behavior of the environment. With the help of the
model, one can make inferences about how the environment will
behave. Such as, if a state and an action are given, then a model can
predict the next state and reward.
▪ Let’s consider a problem where an agent can be in various states and can
choose an action from a set of actions. Such type of problems are
called Sequential Decision Problems.

▪ An MDP is the mathematical framework which captures such a fully

observable, non-deterministic environment with Markovian Transition
Model and additive rewards in which the agent acts.

▪ The solution to an MDP is an optimal policy which refers to the choice of

action for every state that maximizes overall cumulative reward. Thus,
the transition model that represents an agent’s environment(when the
environment is known) and the optimal policy which decides what action
the agent needs to perform in each state are required elements for training
the agent learn a specific behavior.
▪ In almost all cased we do not know anything about the
environment model- the transition function T(s,a,s’).
▪ There are mainly two types of reinforcement learning, which are:

• model-based learning
• model-free learning
▪ In model-based learning an agent generates an approximation
of the transition function, Tˆ(s,a,s’ ), by keeping counts of the
number of times it arrives in each state s’ after entering each q-
state (s,a). The agent can then generate the the approximate
transition function Tˆ upon request by normalizing the counts it
has collected - dividing the count for each observed tuple
(s,a,s’ ) by the sum over the counts for all instances where the
agent was in q-state (s,a). Normalization of counts scales them
such that they sum to one, allowing them to be interpreted as
probabilities.
▪ There are several model-free learning algorithms, and we’ll
cover three of them: direct evaluation, temporal difference
learning, and Q-learning. Direct evaluation and temporal
difference learning fall under a class of algorithms known as
passive reinforcement learning.
▪ In passive reinforcement learning, an agent is given a policy to
follow and learns the value of states under that policy as it
experiences episodes, which is exactly what is done by policy
evaluation for MDPs when T and R are known.

▪ Q-learning falls under a second class of model-free learning

algorithms known as active reinforcement learning, during
which the learning agent can use the feedback it receives to
iteratively update its policy while learning until eventually
determining the optimal policy after sufficient exploration.
1) Passive RL
2) Active RL

▪ In case of passive RL, the agent’s policy is fixed which means that it
is told what to do. OR the agent imply watches the world going by and
tries to learn the utilities of being in various states .
▪ In active RL, an agent needs to decide what to do as there’s no fixed
policy that it can act on.
▪ Therefore, the goal of a passive RL agent is to execute a fixed policy
(sequence of actions) and evaluate it while that of an active RL agent is
to act and learn an optimal policy.
▪ The agent see the sequence of state transitions and associate
rewards.
▪ The environment generates state transitions and the agent
perceive them

▪ Need to update the utility value using the given training

sequences.
All direct evaluation does is fix some policy π and have the agent
that’s learning experience several episodes while following π.
As the agent collects samples through these episodes it
maintains counts of the total utility obtained from each state and
the number of times it visited each state.
At any point, we can compute the estimated value of any state s
by dividing the total utility obtained from s by the number of
times s was visited.
▪ The actual utility state is constrained to be probability-
weighted average of its successor’s utilities.
▪ Meet very slowly to correct utilities values (requires a lot of
sequences)
▪ Example: No. of epochs>1000!
TD learning does not require the agent to learn the transition model. The update
occurs between successive states and agent only updates states that are directly
affected.
Temporal Difference Learning
• Temporal difference learning (TD learning) uses the idea of
learning from every experience, rather than simply keeping
track of total rewards and number of times states are visited
and learning at the end as direct evaluation does.
• In policy evaluation, we used the system of equations
generated by our fixed policy and the Bellman equation to
determine the values of states under that policy (or used
iterative updates like with value iteration).
• V π (s) = ∑ s 0 T(s,π(s),s 0 )[R(s,π(s),s 0 ) +γV π (s 0 )]
• Each of these equations equates the value of one state to the
weighted average over the discounted values of that state’s
successors plus the rewards reaped in transitioning to them. TD
learning tries to answer the question of how to compute this
weighted average without the weights, cleverly doing so with
an exponential moving average.
▪ ADP is a smarter method than Direct Utility Estimation as it runs trials
to learn the model of the environment by estimating the utility of a
state as a sum of reward for being in that state and the expected
discounted reward of being in the next state.

▪ It is a fast but can become quite costly to compute for large state
spaces. ADP is a model based approach and requires the transition
model of the environment.
▪ Q-Learning is a Reinforcement learning policy that will find the
next best action, given a current state. It chooses this action at
random and aims to maximize the reward.
▪ Q-learning is a model-free, off-policy reinforcement learning that
will find the best course of action, given the current state of the
agent. Depending on where the agent is in the environment, it will
decide the next action to be taken.
▪ The objective of the model is to find the best course of action
given its current state. To do this, it may come up with rules of its
own or it may operate outside the policy given to it to follow. This
means that there is no actual need for a policy, hence we call it
off-policy.
▪ Model-free means that the agent uses predictions of the
environment’s expected response to move forward. It does not
use the reward system to learn, but rather, trial and error.
▪ Both direct evaluation and TD learning will eventually learn the
true value of all states under the policy they follow.
▪ However, they both have a major inherent issue - we want to
find an optimal policy for our agent, which requires knowledge
of the q-values of states.
▪ To compute q-values from the values we have, we require a
transition function and reward function as dictated by the
Bellman equation.
▪ Advertisement recommendation system.
▪ In a normal ad recommendation system, the ads you get are
based on your previous purchases or websites you may have
visited. If you’ve bought a TV, you will get recommended TVs of
different brands.

▪
Using Q-learning, we can optimize the ad recommendation
system to recommend products that are frequently bought
together. The reward will be if the user clicks on the suggested
product.
▪ Let’s say that a robot has to cross a maze and reach the end
point. There are mines, and the robot can only move one tile at
a time. If the robot steps onto a mine, the robot is dead. The
robot has to reach the end point in the shortest time possible.
▪ The scoring/reward system is as below:
1. The robot loses 1 point at each step. This is done so that the robot
takes the shortest path and reaches the goal as fast as possible.
2. If the robot steps on a mine, the point loss is 100 and the game ends.
3. If the robot gets power ⚡️, it gains 1 point.
4. If the robot reaches the end goal, the robot gets 100 points.

How do we train a robot to reach the end goal with the shortest
path without stepping on a mine?
▪ Q-Table is just a fancy name for a simple lookup table where we
calculate the maximum expected future rewards for action at each
state. Basically, this table will guide us to the best action at each state.

▪ There will be four numbers of actions at each non-edge tile. When a

robot is at a state it can either move up or down or right or left.
▪ In the Q-Table, the columns are the actions and the rows are the
states.

▪ Each Q-table score will be the maximum expected future reward that
the robot will get if it takes that action at that state. This is an
iterative process, as we need to improve the Q-Table at each
iteration.
➢ questions are:
• How do we calculate the values of the Q-table?
• Are the values available or predefined?
➢ Answer:
▪ Q-function
▪ The Q-function uses the Bellman equation and takes two inputs:
state (s) and action (a).
▪ Using the above function, we get the values of Q for the cells in
the table.
▪ When we start, all the values in the Q-table are zeros.
▪ There is an iterative process of updating the values. As we start to
explore the environment, the Q-function gives us better and
better approximations by continuously updating the Q-values in
the table.
▪ Now, let’s understand how the updating takes place.
▪ There are n columns, where n= number of actions. There are m
rows, where m= number of states. We will initialize the values at
0.

four actions (a=4) and five states (s=5).

▪ Choose an action (a) in the state (s) based on the Q-Table. But, as
mentioned earlier, when the episode initially starts, every Q-value
is 0.
▪ So now the concept of exploration and exploitation trade-off
comes into play.
▪ In the beginning, the epsilon rates will be higher. The robot will
explore the environment and randomly choose actions. The logic
behind this is that the robot does not know anything about the
environment.
▪ As the robot explores the environment, the epsilon rate decreases
and the robot starts to exploit the environment.
▪ During the process of exploration, the robot progressively
becomes more confident in estimating the Q-values.
▪ For the robot example, there are four actions to choose from: up,
down, left, and right. We are starting the training now — our robot
knows nothing about the environment. So the robot chooses a
random action, say right.

We can now update the Q-values for being at the start and moving
right using the Bellman equation.
▪ Now we have taken an action and observed an outcome and reward.
We need to update the function Q(s,a).

➢ In the case of the robot game, to reiterate the scoring/reward

structure is:
• power = +1
• mine = -100
• end = +100
Repeat this again and again until the learning is stopped. In this way the Q-
Table will be updated.
▪ Exploration is all about finding more information about an
environment.
▪ The more we explore, the better we understand the world(eg. T
and R)
▪ Exploration helps the agent for long-term beneficial.
▪ Exploitation is exploiting already known information to maximize
the rewards.
▪ Based on what we know about the world, we can take actions with
the aim of get highest reward.
▪ Exploitation helps the agent maximize its short and medium-
term reward.
▪ Say you go to the same restaurant every day. You are
basically exploiting. But on the other hand, if you search for new
restaurant every time before going to any one of them, then
it’s exploration. Exploration is very important for the search of
future rewards which might be higher than the near rewards.
▪ In given picture, robotic mouse can
have a good amount of small
cheese (+0.5 each). But at the top
of the maze there is a big sum of
cheese (+100). So, if we only focus
on the nearest reward, our robotic
mouse will never reach the big sum
of cheese — it will just exploit.

▪ But if the robotic mouse does a

little bit of exploration, it can find
the big reward i.e. the big cheese.

Unit-5 ML Notes
No ratings yet
Unit-5 ML Notes
31 pages
Reinforcement Learning: By: Chandra Prakash IIITM Gwalior
No ratings yet
Reinforcement Learning: By: Chandra Prakash IIITM Gwalior
64 pages
Unit-5 Reinforcemnt and Q learning
No ratings yet
Unit-5 Reinforcemnt and Q learning
45 pages
Lecture 29 RL
No ratings yet
Lecture 29 RL
38 pages
21ai020 & Reinforcement Learning UNIT 1-LM:1
No ratings yet
21ai020 & Reinforcement Learning UNIT 1-LM:1
8 pages
Artificial Intelligence: Computer Science & Engineering, Khulna University
No ratings yet
Artificial Intelligence: Computer Science & Engineering, Khulna University
30 pages
UNIT 3
No ratings yet
UNIT 3
32 pages
Reinforcement Learning Notes ?
No ratings yet
Reinforcement Learning Notes ?
40 pages
5th Unit Notes Full File (1)
No ratings yet
5th Unit Notes Full File (1)
22 pages
Lecture RL
No ratings yet
Lecture RL
37 pages
Reinforcement Learning
100% (1)
Reinforcement Learning
25 pages
DLMAIRIL01_Q4-2024_Session4
No ratings yet
DLMAIRIL01_Q4-2024_Session4
80 pages
Reinforcement Learning: Nazia Bibi
100% (1)
Reinforcement Learning: Nazia Bibi
61 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
38 pages
Unit-5 Mla
No ratings yet
Unit-5 Mla
22 pages
L13 Reinforcement Learning
No ratings yet
L13 Reinforcement Learning
57 pages
CNC lap 1_merged
No ratings yet
CNC lap 1_merged
35 pages
Reinforced Learning
No ratings yet
Reinforced Learning
25 pages
Unit 1 - Reinforcement Learning,Overfitting, Training, Validation Sets, Metrics, Bias and Variance
No ratings yet
Unit 1 - Reinforcement Learning,Overfitting, Training, Validation Sets, Metrics, Bias and Variance
16 pages
SP14 CS188 Lecture 10 - Reinforcement Learning I
No ratings yet
SP14 CS188 Lecture 10 - Reinforcement Learning I
35 pages
fai_mid2_4ans[1]
No ratings yet
fai_mid2_4ans[1]
4 pages
Unit 5 ML 3year
No ratings yet
Unit 5 ML 3year
17 pages
Unit 5
No ratings yet
Unit 5
45 pages
unit4(AI)2024.docx-1
No ratings yet
unit4(AI)2024.docx-1
22 pages
AI (IT) UNIT-5
No ratings yet
AI (IT) UNIT-5
43 pages
Unit 4
No ratings yet
Unit 4
49 pages
MLT Unit-5 notes
No ratings yet
MLT Unit-5 notes
17 pages
Exercises On Complex Variables PDF
No ratings yet
Exercises On Complex Variables PDF
101 pages
ML Unit-4 - RTU
No ratings yet
ML Unit-4 - RTU
18 pages
Module_1 - Reinforcement Learning and Markov Decision Process
No ratings yet
Module_1 - Reinforcement Learning and Markov Decision Process
19 pages
lecture 9 Reiforcement learning (1)
No ratings yet
lecture 9 Reiforcement learning (1)
29 pages
Reinforcement Learning
100% (1)
Reinforcement Learning
64 pages
Neural Networks: 1 October, 2016
No ratings yet
Neural Networks: 1 October, 2016
51 pages
SCD Stage
No ratings yet
SCD Stage
11 pages
Unit 5 - Reinforcement Learning
No ratings yet
Unit 5 - Reinforcement Learning
15 pages
What Is Reinforcement Learning
No ratings yet
What Is Reinforcement Learning
15 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
28 pages
UNIT-4
No ratings yet
UNIT-4
56 pages
Unit 5-1
No ratings yet
Unit 5-1
8 pages
ML_Unit-4
No ratings yet
ML_Unit-4
10 pages
ML-10
No ratings yet
ML-10
9 pages
Unit V
100% (1)
Unit V
24 pages
Unit No. 05 - Reinforced and Deep Learning
No ratings yet
Unit No. 05 - Reinforced and Deep Learning
44 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
29 pages
UNIT-3
No ratings yet
UNIT-3
29 pages
Unit-5 (AI)
No ratings yet
Unit-5 (AI)
21 pages
IntroductiontoRL-BR
No ratings yet
IntroductiontoRL-BR
22 pages
7- Reinforcement Learning
No ratings yet
7- Reinforcement Learning
23 pages
Reinforcement learning
No ratings yet
Reinforcement learning
10 pages
Ph.D. Preliminary Examination: Design: 30 Kip/ft (Dead Load) 20 Kip (Earthquake Load) 5 FT
No ratings yet
Ph.D. Preliminary Examination: Design: 30 Kip/ft (Dead Load) 20 Kip (Earthquake Load) 5 FT
5 pages
Fundamentals of Reinforcement Learning
No ratings yet
Fundamentals of Reinforcement Learning
33 pages
Unit 5
No ratings yet
Unit 5
10 pages
L11 Reinforcement Learning 1
No ratings yet
L11 Reinforcement Learning 1
18 pages
Valve Material Selection Guide
No ratings yet
Valve Material Selection Guide
15 pages
Unit 3
No ratings yet
Unit 3
12 pages
7.reinforcement Learning-Introduction-The Learning Task Q-Learning
No ratings yet
7.reinforcement Learning-Introduction-The Learning Task Q-Learning
34 pages
Multi-Band Booster With HRLB - In-Band Routed DCN Solution Guide
100% (1)
Multi-Band Booster With HRLB - In-Band Routed DCN Solution Guide
26 pages
unit 5 ml
No ratings yet
unit 5 ml
15 pages
Reinforcement
No ratings yet
Reinforcement
9 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
5 pages
Lecture 5
No ratings yet
Lecture 5
28 pages
ML Assignment 2
No ratings yet
ML Assignment 2
6 pages
Activity 2-Distance Measuring by Pacing
100% (1)
Activity 2-Distance Measuring by Pacing
7 pages
3.RL Unit 3
No ratings yet
3.RL Unit 3
31 pages
Keep It Simple Science 3 - Metals
No ratings yet
Keep It Simple Science 3 - Metals
13 pages
Modeling Amorphous-core Inductors Up to Magnetic Saturation
No ratings yet
Modeling Amorphous-core Inductors Up to Magnetic Saturation
13 pages
Reinforcement Learning MY101
No ratings yet
Reinforcement Learning MY101
15 pages
Ant Colony Optimization Model For Tsunamis Evacuation Routes
No ratings yet
Ant Colony Optimization Model For Tsunamis Evacuation Routes
15 pages
Power Quality Improvement Using Fuzzy Logic Controller Based Unified Power Ow Controller (UPFC)
No ratings yet
Power Quality Improvement Using Fuzzy Logic Controller Based Unified Power Ow Controller (UPFC)
10 pages
Sizing (Slashing)
100% (1)
Sizing (Slashing)
46 pages
Math4 170513085146
No ratings yet
Math4 170513085146
47 pages
Automatic Water Level Controller
No ratings yet
Automatic Water Level Controller
14 pages
Chapter 9: Gas Power Cycles: MAE 3311 - 002 Thermodynamics II
No ratings yet
Chapter 9: Gas Power Cycles: MAE 3311 - 002 Thermodynamics II
31 pages
Analysis of Flow Characteristics and Pressure Drop For An Impinging Plate Fin Heat Sink With Elliptic Bottom Profiles
No ratings yet
Analysis of Flow Characteristics and Pressure Drop For An Impinging Plate Fin Heat Sink With Elliptic Bottom Profiles
17 pages
Protection From Coastal Erosion
No ratings yet
Protection From Coastal Erosion
30 pages
TB - FX5U GXW3 Using File Password and Permanent PLC Lock Function
No ratings yet
TB - FX5U GXW3 Using File Password and Permanent PLC Lock Function
13 pages
Strain: X A B C S
No ratings yet
Strain: X A B C S
9 pages
Defect and Limitation
No ratings yet
Defect and Limitation
4 pages
How To Recover Permanent Deleted Files Windows 10
No ratings yet
How To Recover Permanent Deleted Files Windows 10
10 pages
RL Unit 1
100% (1)
RL Unit 1
26 pages
Tugop Elementary School Second Summative Test in Science Iv Quarter 3 (Week 3&4)
No ratings yet
Tugop Elementary School Second Summative Test in Science Iv Quarter 3 (Week 3&4)
2 pages
How To Select Welding Electrodes
100% (1)
How To Select Welding Electrodes
6 pages
3.13-3.15 Polar Coordinates and Polar Functions Practice
No ratings yet
3.13-3.15 Polar Coordinates and Polar Functions Practice
7 pages
Operating Instructions: KX-TS520MX
No ratings yet
Operating Instructions: KX-TS520MX
2 pages
Q2 Summative Test 1 Grade 8 - Updated
No ratings yet
Q2 Summative Test 1 Grade 8 - Updated
2 pages
Chapter 4 - Vehicle Transmission System
No ratings yet
Chapter 4 - Vehicle Transmission System
35 pages
La Place 0
No ratings yet
La Place 0
3 pages
Chem 113E Module 1
No ratings yet
Chem 113E Module 1
11 pages
Office Table 5'X2'-6''
No ratings yet
Office Table 5'X2'-6''
1 page
Reinforcement Learning Explained - A Step-by-Step Guide to Reward-Driven AI
From Everand
Reinforcement Learning Explained - A Step-by-Step Guide to Reward-Driven AI
Luka Nikolic
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Unit-8 - Reinforcement Learning

Uploaded by

Unit-8 - Reinforcement Learning

Uploaded by

Unit – 8 Reinforcement Learning

Reinforcement Learning is a feedback-based Machine learning technique

▪ Definition: Reinforcement learning is a type of AI method where an

▪ For deterministic policy: a = π(s)

▪ An MDP is the mathematical framework which captures such a fully

▪ The solution to an MDP is an optimal policy which refers to the choice of

▪ Q-learning falls under a second class of model-free learning

▪ Need to update the utility value using the given training

▪ There will be four numbers of actions at each non-edge tile. When a

four actions (a=4) and five states (s=5).

➢ In the case of the robot game, to reiterate the scoring/reward

▪ But if the robotic mouse does a

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.