0% found this document useful (0 votes)
24 views31 pages

Unit-4-ML-Reinforcement Learning

The document discusses Machine Learning, focusing on Semi-Supervised Learning and Reinforcement Learning. Semi-Supervised Learning combines labeled and unlabeled data to reduce the cost of data labeling, while Reinforcement Learning involves agents making decisions to maximize rewards based on their actions. It also covers practical applications, key features, and the Markov Decision Process framework used in Reinforcement Learning.

Uploaded by

jishika388
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views31 pages

Unit-4-ML-Reinforcement Learning

The document discusses Machine Learning, focusing on Semi-Supervised Learning and Reinforcement Learning. Semi-Supervised Learning combines labeled and unlabeled data to reduce the cost of data labeling, while Reinforcement Learning involves agents making decisions to maximize rewards based on their actions. It also covers practical applications, key features, and the Markov Decision Process framework used in Reinforcement Learning.

Uploaded by

jishika388
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

Swami Keshvanand Institute of Technology, Management

& Gramothan, Ramnagaria, Jagatpura, Jaipur-302017,


INDIA
Approved by AICTE, Ministry of HRD, Government of
India Recognized by UGC under Section 2(f) of the UGC
Act, 1956 Tel. : +91-0141- 5160400Fax: +91-0141-
2759555 E-mail: info@skit.ac.in Web: www.skit.ac.in

Machine Learning- 6CS4-02

Unit-4

ML | Semi-Supervised Learning

Today’s Machine Learning algorithms can be broadly classified into three categories, Supervised
Learning, Unsupervised Learning and Reinforcement Learning. Casting Reinforced Learning aside,
the primary two categories of Machine Learning problems are Supervised and Unsupervised
Learning. The basic difference between the two is that Supervised Learning datasets have an output
label associated with each tuple while Unsupervised Learning datasets do not.

The most basic disadvantage of any Supervised Learning algorithm is that the dataset has to be
hand-labeled either by a Machine Learning Engineer or a Data Scientist. This is a very costly
process, especially when dealing with large volumes of data. The most basic disadvantage of any
Unsupervised Learning is that it’s application spectrum is limited.

To counter these disadvantages, the concept of Semi-Supervised Learning was introduced. In this
type of learning, the algorithm is trained upon a combination of labeled and unlabeled data.
Typically, this combination will contain a very small amount of labeled data and a very large
amount of unlabeled data. The basic procedure involved is that first, the programmer will cluster
similar data using an unsupervised learning algorithm and then use the existing labeled data to label
the rest of the unlabeled data. The typical use cases of such type of algorithm have a common
property among them – The acquisition of unlabeled data is relatively cheap while labeling the said
data is very expensive.

Intuitively, one may imagine the three types of learning algorithms as Supervised learning where a
student is under the supervision of a teacher at both home and school, Unsupervised learning where
a student has to figure out a concept himself and Semi-Supervised learning where a teacher teaches
a few concepts in class and gives questions as homework which are based on similar concepts.

A Semi-Supervised algorithm assumes the following about the data –

1. Continuity Assumption: The algorithm assumes that the points which are closer to each
other are more likely to have the same output label.

2. Cluster Assumption: The data can be divided into discrete clusters and points in the same
cluster are more likely to share an output label.

Page | 1
Swami Keshvanand Institute of Technology, Management
& Gramothan, Ramnagaria, Jagatpura, Jaipur-302017,
INDIA
Approved by AICTE, Ministry of HRD, Government of
India Recognized by UGC under Section 2(f) of the UGC
Act, 1956 Tel. : +91-0141- 5160400Fax: +91-0141-
2759555 E-mail: info@skit.ac.in Web: www.skit.ac.in

3. Manifold Assumption: The data lie approximately on a manifold of much lower dimension
than the input space. This assumption allows the use of distances and densities which are
defined on a manifold.

Practical applications of Semi-Supervised Learning –

1. Speech Analysis: Since labeling of audio files is a very intensive task, Semi-Supervised
learning is a very natural approach to solve this problem.

2. Internet Content Classification: Labeling each webpage is an impractical and unfeasible


process and thus uses Semi-Supervised learning algorithms. Even the Google search
algorithm uses a variant of Semi-Supervised learning to rank the relevance of a webpage for
a given query.

3. Protein Sequence Classification: Since DNA strands are typically very large in size, the
rise of Semi-Supervised learning has been imminent in this field.

Reinforcement learning

Reinforcement learning is an area of Machine Learning. It is about taking suitable action to


maximize reward in a particular situation. It is employed by various software and machines to find
the best possible behavior or path it should take in a specific situation. Reinforcement learning
differs from the supervised learning in a way that in supervised learning the training data has the
answer key with it so the model is trained with the correct answer itself whereas in reinforcement
learning, there is no answer but the reinforcement agent decides what to do to perform the given
task. In the absence of training dataset, it is bound to learn from its experience.

Example : The problem is as follows: We have an agent and a reward, with many hurdles in
between. The agent is supposed to find the best possible path to reach the reward. The following
problem explains the problem more easily.

Page | 2
Swami Keshvanand Institute of Technology, Management &
Gramothan, Ramnagaria, Jagatpura, Jaipur-302017, INDIA
Approved by AICTE, Ministry of HRD, Government of
India Recognized by UGC under Section 2(f) of the UGC
Act, 1956 Tel. : +91-0141- 5160400Fax: +91-0141-
2759555 E-mail: info@skit.ac.in Web: www.skit.ac.in

The above image shows robot, diamond and fire. The goal of the robot is to get the reward that is
the diamond and avoid the hurdles that is fire. The robot learns by trying all the possible paths and
then choosing the path which gives him the reward with the least hurdles. Each right step will give
the robot a reward and each wrong step will subtract the reward of the robot. The total reward will
be calculated when it reaches the final reward that is the diamond.

Main points in Reinforcement learning –

• Input: The input should be an initial state from which the model will start

• Output: There are many possible output as there are variety of solution to a particular
problem

• Training: The training is based upon the input, The model will return a state and the user
will decide to reward or punish the model based on its output.

• The model keeps continues to learn.

• The best solution is decided based on the maximum reward.

Difference between Reinforcement learning and Supervised learning:

Reinforcement learning Supervised learning

Reinforcement learning is all about making decisions In Supervised learning the decision
sequentially. In simple words we can say that the output is made on the initial input or the

Page | 3
Swami Keshvanand Institute of Technology, Management &
Gramothan, Ramnagaria, Jagatpura, Jaipur-302017, INDIA
Approved by AICTE, Ministry of HRD, Government of
India Recognized by UGC under Section 2(f) of the UGC
Act, 1956 Tel. : +91-0141- 5160400Fax: +91-0141-
2759555 E-mail: info@skit.ac.in Web: www.skit.ac.in

Reinforcement learning Supervised learning

depends on the state of the current input and the next input input given at the start
depends on the output of the previous input

Supervised learning the decisions


In Reinforcement learning decision is dependent, So we
are independent of each other so
give labels to sequences of dependent decisions
labels are given to each decision.

Example: Chess game Example: Object recognition

Types of Reinforcement: There are two types of Reinforcement:

1. Positive –
Positive Reinforcement is defined as when an event, occurs due to a particular behavior,
increases the strength and the frequency of the behavior. In other words it has a positive
effect on the behavior.

Advantages of reinforcement learning are:

o Maximizes Performance

o Sustain Change for a long period of time


Disadvantages of reinforcement learning:

o Too much Reinforcement can lead to overload of states which can diminish the
results

2. Negative –
Negative Reinforcement is defined as strengthening of a behavior because a negative
condition which should have stopped or avoided stopped or avoided.

Advantages of reinforcement learning:

o Increases Behavior

o Provide defiance to minimum standard of


performance Disadvantages of reinforcement learning:
o It Only provides enough to meet up the minimum behavior

Various Practical applications of Reinforcement Learning –

Page | 4
Swami Keshvanand Institute of Technology, Management
& Gramothan, Ramnagaria, Jagatpura, Jaipur-302017,
INDIA
Approved by AICTE, Ministry of HRD, Government of
India Recognized by UGC under Section 2(f) of the UGC
Act, 1956 Tel. : +91-0141- 5160400Fax: +91-0141-
2759555 E-mail: info@skit.ac.in Web: www.skit.ac.in

• RL can be used in robotics for industrial automation.

• RL can be used in machine learning and data processing

• RL can be used to create training systems that provide custom instruction and materials
according to the requirement of students.

RL can be used in large environments in the following situations:

1. A model of the environment is known, but an analytic solution is not available;

2. Only a simulation model of the environment is given (the subject of simulation-based


optimization);[6]

3. The only way to collect information about the environment is to interact with it.

Page | 5
Swami Keshvanand Institute of Technology, Management
& Gramothan, Ramnagaria, Jagatpura, Jaipur-302017,
INDIA
Approved by AICTE, Ministry of HRD, Government of
India Recognized by UGC under Section 2(f) of the UGC
Act, 1956 Tel. : +91-0141- 5160400Fax: +91-0141-
2759555 E-mail: info@skit.ac.in Web: www.skit.ac.in

Terms used in Reinforcement Learning

• Agent(): An entity that can perceive/explore the environment and act upon it.

• Environment(): A situation in which an agent is present or surrounded by. In RL, we


assume the stochastic environment, which means it is random in nature.

• Action(): Actions are the moves taken by an agent within the environment.

• State(): State is a situation returned by the environment after each action taken by the agent.

• Reward(): A feedback returned to the agent from the environment to evaluate the action of
the agent.

• Policy(): Policy is a strategy applied by the agent for the next action based on the current
state.

• Value(): It is expected long-term retuned with the discount factor and opposite to the short-
term reward.

• Q-value(): It is mostly similar to the value, but it takes one additional parameter as a current
action (a).

Key Features of Reinforcement Learning

• In RL, the agent is not instructed about the environment and what actions need to be taken.

• It is based on the hit and trial process.

• The agent takes the next action and changes states according to the feedback of the
previous action.

• The agent may get a delayed reward.

• The environment is stochastic, and the agent needs to explore it to reach to get the
maximum positive rewards.

Approaches to implement Reinforcement Learning

There are mainly three ways to implement reinforcement-learning in ML, which are:

1. Value-based:
The value-based approach is about to find the optimal value function, which is the

Page | 6
Swami Keshvanand Institute of Technology, Management
& Gramothan, Ramnagaria, Jagatpura, Jaipur-302017,
INDIA
Approved by AICTE, Ministry of HRD, Government of
India Recognized by UGC under Section 2(f) of the UGC
Act, 1956 Tel. : +91-0141- 5160400Fax: +91-0141-
2759555 E-mail: info@skit.ac.in Web: www.skit.ac.in

maximum value at a state under any policy. Therefore, the agent expects the long-term
return at any state(s) under policy π.

2. Policy-based:
Policy-based approach is to find the optimal policy for the maximum future rewards without
using the value function. In this approach, the agent tries to apply such a policy that the
action performed in each step helps to maximize the future reward.

The policy-based approach has mainly two types of policy:

o Deterministic: The same action is produced by the policy (π) at any state.

o Stochastic: In this policy, probability determines the produced action.

3. Model-based: In the model-based approach, a virtual model is created for the environment,
and the agent explores that environment to learn it. There is no particular solution or
algorithm for this approach because the model representation is different for each
environment.

Page | 7
Swami Keshvanand Institute of Technology, Management
& Gramothan, Ramnagaria, Jagatpura, Jaipur-302017,
INDIA
Approved by AICTE, Ministry of HRD, Government of
India Recognized by UGC under Section 2(f) of the UGC
Act, 1956 Tel. : +91-0141- 5160400Fax: +91-0141-
2759555 E-mail: info@skit.ac.in Web: www.skit.ac.in

Markov decision process (MDP)


A Markov Decision Process descrbes an environment for reinforcement learning. The environment
is fully observable. In MDPs, the current state completely characterises the process.

What is exactly a Markov Decision Process ?

To understand the MDP, first we have to understand the Markov property.

The Markov property

The Markov property states that,

“The future is independent of the past given the present.”

Once the current state in known, the history of information encountered so far may be thrown away,
and that state is a sufficient statistic that gives us the same characterization of the future as if we
have all the history.

In mathematical terms, a state St has the Markov property, if and only if;

P[St+1 | St] = P[St+1 | S1, ….. , St],

A Markov Process is a memoryless random process. It is a


sequence of randdom states S1,S2,⋯⋯ with the Markov Property.
A Markov decision process (MDP) is a discrete time stochastic control process. It provides a
mathematical framework for modeling decision making in situations where outcomes are partly
random and partly under the control of a decision maker. MDPs are useful for studying optimization
problems solved via dynamic programming and reinforcement learning. MDPs were known at least
as early as the 1950s; a core body of research on Markov decision processes resulted from Ronald
Howard's 1960 book, Dynamic Programming and Markov Processes. They are used in many
disciplines, including robotics, automatic control, economics and manufacturing. The name of
MDPs comes from the Russian mathematician Andrey Markov as they are an extension of the
Markov chains.

At each time step, the process is in some state s, and the decision maker may choose any action a
that is available in state s’. The process responds at the next time step by randomly moving into a
new state s’ , and giving the decision maker a corresponding reward Ra(s,s’).

Page | 8
Swami Keshvanand Institute of Technology, Management
& Gramothan, Ramnagaria, Jagatpura, Jaipur-302017,
INDIA
Approved by AICTE, Ministry of HRD, Government of
India Recognized by UGC under Section 2(f) of the UGC
Act, 1956 Tel. : +91-0141- 5160400Fax: +91-0141-
2759555 E-mail: info@skit.ac.in Web: www.skit.ac.in

The probability that the process moves into its new state s’ is influenced by the chosen action.
Specifically, it is given by the state transition function Pa(s,s’) . Thus, the next state depends on the
current state and the decision maker's action a. But given s and a, it is conditionally independent of
all previous states and actions; in other words, the state transitions of an MDP satisfy the Markov
property.

Markov decision processes are an extension of Markov chains; the difference is the addition of
actions (allowing choice) and rewards (giving motivation). Conversely, if only one action exists for
each state (e.g. "wait") and all rewards are the same (e.g. "zero"), a Markov decision process
reduces to a Markov chain

Page | 9
Swami Keshvanand Institute of Technology, Management & Gramothan,
Ramnagaria, Jagatpura, Jaipur-302017, INDIA
Approved by AICTE, Ministry of HRD, Government of India Recognized by UGC
under Section 2(f) of the UGC Act, 1956 Tel. : +91-0141- 5160400Fax: +91-0141-
2759555 E-mail: info@skit.ac.in Web: www.skit.ac.in

Page | 10
Swami Keshvanand Institute of Technology, Management & Gramothan,
Ramnagaria, Jagatpura, Jaipur-302017, INDIA
Approved by AICTE, Ministry of HRD, Government of India Recognized by UGC
under Section 2(f) of the UGC Act, 1956 Tel. : +91-0141- 5160400Fax: +91-0141-
2759555 E-mail: info@skit.ac.in Web: www.skit.ac.in

Page | 11
Swami Keshvanand Institute of Technology, Management & Gramothan,
Ramnagaria, Jagatpura, Jaipur-302017, INDIA
Approved by AICTE, Ministry of HRD, Government of India Recognized by UGC
under Section 2(f) of the UGC Act, 1956 Tel. : +91-0141- 5160400Fax: +91-0141-
2759555 E-mail: info@skit.ac.in Web: www.skit.ac.in

Page | 12
Swami Keshvanand Institute of Technology, Management & Gramothan,
Ramnagaria, Jagatpura, Jaipur-302017, INDIA
Approved by AICTE, Ministry of HRD, Government of India Recognized by UGC
under Section 2(f) of the UGC Act, 1956 Tel. : +91-0141- 5160400Fax: +91-0141-
2759555 E-mail: info@skit.ac.in Web: www.skit.ac.in

Page | 13
Swami Keshvanand Institute of Technology, Management & Gramothan,
Ramnagaria, Jagatpura, Jaipur-302017, INDIA
Approved by AICTE, Ministry of HRD, Government of India Recognized by UGC
under Section 2(f) of the UGC Act, 1956 Tel. : +91-0141- 5160400Fax: +91-0141-
2759555 E-mail: info@skit.ac.in Web: www.skit.ac.in

Page | 14
Swami Keshvanand Institute of Technology, Management & Gramothan,
Ramnagaria, Jagatpura, Jaipur-302017, INDIA
Approved by AICTE, Ministry of HRD, Government of India Recognized by UGC
under Section 2(f) of the UGC Act, 1956 Tel. : +91-0141- 5160400Fax: +91-0141-
2759555 E-mail: info@skit.ac.in Web: www.skit.ac.in

Page | 15
Swami Keshvanand Institute of Technology, Management & Gramothan,
Ramnagaria, Jagatpura, Jaipur-302017, INDIA
Approved by AICTE, Ministry of HRD, Government of India Recognized by UGC
under Section 2(f) of the UGC Act, 1956 Tel. : +91-0141- 5160400Fax: +91-0141-
2759555 E-mail: info@skit.ac.in Web: www.skit.ac.in

Page | 16
Swami Keshvanand Institute of Technology, Management & Gramothan,
Ramnagaria, Jagatpura, Jaipur-302017, INDIA
Approved by AICTE, Ministry of HRD, Government of India Recognized by UGC
under Section 2(f) of the UGC Act, 1956 Tel. : +91-0141- 5160400Fax: +91-0141-
2759555 E-mail: info@skit.ac.in Web: www.skit.ac.in

Page | 17
Swami Keshvanand Institute of Technology, Management & Gramothan,
Ramnagaria, Jagatpura, Jaipur-302017, INDIA
Approved by AICTE, Ministry of HRD, Government of India Recognized by UGC
under Section 2(f) of the UGC Act, 1956 Tel. : +91-0141- 5160400Fax: +91-0141-
2759555 E-mail: info@skit.ac.in Web: www.skit.ac.in

Page | 18
Swami Keshvanand Institute of Technology, Management
& Gramothan, Ramnagaria, Jagatpura, Jaipur-302017,
INDIA
Approved by AICTE, Ministry of HRD, Government of
India Recognized by UGC under Section 2(f) of the UGC
Act, 1956 Tel. : +91-0141- 5160400Fax: +91-0141-
2759555 E-mail: info@skit.ac.in Web: www.skit.ac.in

Monte-Carlo method
Remember that the state-value is nothing more than the discounted cumulative reward by starting

from state s following a policy π to the end state at the last timestep T:

An episode is a sequence of all states from an initial state s to a terminal state. It is not a continuous

infit sequence. Moreover, for each episode, we get a different return G_t. An example of an episode

for a video game could be the sequence of all states, the frames, from start to Game Over.

Now, let’s choose one of our states as an initial state s_i. We go through one complete episode by

following the policy π and eventually reach the end state s_e.

We repeat this many, many times. On every iteration, the end state could be reached in a different way,

depending on the stochastic policy π. This leads to a different return G_t for every episode. Averaging

over those returns gives us a close representation of the real state-value for the initial state s_i.

If we do this process for all states, where we always pick a different state as the initial state, then we

obtain a good approximation of the state-value for each state and therefore the state-value function of

the environment model. This is the main idea behind Monte-Carlo.

Instead of using the Dynamic Programming and the Bellman Equation to calculate the exact value of

the state s at timestep t by considering future rewards and values of the future states, we obtain an

estimate of the true value function by running through many episodes, then observe what returns we

got for each and average over all of them. No model is needed.

Page | 19
Swami Keshvanand Institute of Technology, Management
& Gramothan, Ramnagaria, Jagatpura, Jaipur-302017,
INDIA
Approved by AICTE, Ministry of HRD, Government of
India Recognized by UGC under Section 2(f) of the UGC
Act, 1956 Tel. : +91-0141- 5160400Fax: +91-0141-
2759555 E-mail: info@skit.ac.in Web: www.skit.ac.in

There are two different approaches to evaluate the policy by using Monte-Carlo. First-Visit and Every-

Visit. For both, we keep a counter N(s) for each state s and we save the sum of the different returns

from each episode for each state in S(s).

N(s) and S(s) accumulate values until all K episodes are completed. They are not set to zero after each

episode.

Every-Visit Monte-Carlo Policy Evaluation

We iterate through K episodes. In each episode, every time we reach a state s we increment the counter

N(s) for that state by one. It is possible that the same state is reached multiple times in the same

episode. For all those visits the counter is increased.

For each visit, G_t can be different.

The algorithm looks like the following:

Page | 20
Swami Keshvanand Institute of Technology, Management
& Gramothan, Ramnagaria, Jagatpura, Jaipur-302017,
INDIA
Approved by AICTE, Ministry of HRD, Government of
India Recognized by UGC under Section 2(f) of the UGC
Act, 1956 Tel. : +91-0141- 5160400Fax: +91-0141-
2759555 E-mail: info@skit.ac.in Web: www.skit.ac.in

First-Visit Monte-Carlo Policy Evaluation

We do the same as for the Every-Time Monte-Carlo Policy Evaluation. The only difference here is that

in the same episode we update N(s) and the S(s) only on the first visit of the state s. In our algorithm,

we will check if s was visited for the first time in the episode and only then update them.

Problems of Monte carlo


Working only for episodes, not for continuous or infinite problems

We need to complete an episode first before we can update all values. That’s bad e.g. for autonomous

cars which if using Monte Carlo there would update the values of a state only after a crash, the end of

the episode. That’s a bit too late.


Page | 21
Swami Keshvanand Institute of Technology, Management & Gramothan,
Ramnagaria, Jagatpura, Jaipur-302017, INDIA
Approved by AICTE, Ministry of HRD, Government of India Recognized by UGC
under Section 2(f) of the UGC Act, 1956 Tel. : +91-0141- 5160400Fax: +91-0141-
2759555 E-mail: info@skit.ac.in Web: www.skit.ac.in

Page | 22
Swami Keshvanand Institute of Technology, Management & Gramothan,
Ramnagaria, Jagatpura, Jaipur-302017, INDIA
Approved by AICTE, Ministry of HRD, Government of India Recognized by UGC
under Section 2(f) of the UGC Act, 1956 Tel. : +91-0141- 5160400Fax: +91-0141-
2759555 E-mail: info@skit.ac.in Web: www.skit.ac.in

Page | 23
Swami Keshvanand Institute of Technology, Management & Gramothan,
Ramnagaria, Jagatpura, Jaipur-302017, INDIA
Approved by AICTE, Ministry of HRD, Government of India Recognized by UGC
under Section 2(f) of the UGC Act, 1956 Tel. : +91-0141- 5160400Fax: +91-0141-
2759555 E-mail: info@skit.ac.in Web: www.skit.ac.in

Page | 24
Swami Keshvanand Institute of Technology, Management & Gramothan,
Ramnagaria, Jagatpura, Jaipur-302017, INDIA
Approved by AICTE, Ministry of HRD, Government of India Recognized by UGC
under Section 2(f) of the UGC Act, 1956 Tel. : +91-0141- 5160400Fax: +91-0141-
2759555 E-mail: info@skit.ac.in Web: www.skit.ac.in

Page | 25
Swami Keshvanand Institute of Technology, Management & Gramothan,
Ramnagaria, Jagatpura, Jaipur-302017, INDIA
Approved by AICTE, Ministry of HRD, Government of India Recognized by UGC
under Section 2(f) of the UGC Act, 1956 Tel. : +91-0141- 5160400Fax: +91-0141-
2759555 E-mail: info@skit.ac.in Web: www.skit.ac.in

Page | 26
Swami Keshvanand Institute of Technology, Management
& Gramothan, Ramnagaria, Jagatpura, Jaipur-302017,
INDIA
Approved by AICTE, Ministry of HRD, Government of
India Recognized by UGC under Section 2(f) of the UGC
Act, 1956 Tel. : +91-0141- 5160400Fax: +91-0141-
2759555 E-mail: info@skit.ac.in Web: www.skit.ac.in

Policy

Consider the following case:

a12 is the probability to take the action that leads to S2 from S1.
a13 is the probability to take the action that leads to S3 from S1.
The value function at S1 and S3 are v2 = 10, v3 = 20 respectively.
We also assume that the reward r = 1 and the discount factor γ = 0.9.
Also as simplification, we consider that the transition probabilities is 1, which means when we
perform an action we are 100% sure that we will land in the intended state.
Under these criteria we can write the value function at S1 which we denote v1 as:
v1 = a12*(r + γ *v2) + a13*(r + γ*v3)
v1 = a12*(1 + 0.9*10) + a13(1 + 0.9*20)
v1 = a12*10 + a13 * 19

Suppose the action probabilities are a12 = 0.9 and a13 = 0.1 then v1 will be 10.9 . On the other hand
if a12 = 0.1 and a13 = 0.9 then v1 will become 18.1!
Clearly giving more probability to perform a13 will give better result in terms of the value of the
state S1.
The act of selecting an action at each state is called “policy” and is denoted as .

Page | 27
Swami Keshvanand Institute of Technology, Management
& Gramothan, Ramnagaria, Jagatpura, Jaipur-302017,
INDIA
Approved by AICTE, Ministry of HRD, Government of
India Recognized by UGC under Section 2(f) of the UGC
Act, 1956 Tel. : +91-0141- 5160400Fax: +91-0141-
2759555 E-mail: info@skit.ac.in Web: www.skit.ac.in

As seen in the example above, some policies are better than others due to the selection of
actions over others in one or more states.
It is important to note that a policy is better than ’ if all v(s) under are greater or equal to all
v(s) under ’.

It follows that in order to maximize the collected rewards we have to find the best possible
policy, called optimal policy and denoted *.

Value Iteration

Value iteration computes the optimal state value function by iteratively improving the estimate of
V(s). The algorithm initialize V(s) to arbitrary random values. It repeatedly updates the Q(s, a) and
V(s) values until they converges. Value iteration is guaranteed to converge to the optimal values.
This algorithm is shown in the following pseudo-code:

Policy Iteration

While value-iteration algorithm keeps improving the value function at each iteration until the value-
function converges. Since the agent only cares about the finding the optimal policy, sometimes the
optimal policy will converge before the value function. Therefore, another algorithm called policy-
iteration instead of repeated improving the value-function estimate, it will re-define the policy at
each step and compute the value according to this new policy until the policy converges. Policy
iteration is also guaranteed to converge to the optimal policy and it often takes less iterations to
converge than the value-iteration algorithm.

The pseudo code for Policy Iteration is shown below.

Page | 28
Swami Keshvanand Institute of Technology, Management & Gramothan,
Ramnagaria, Jagatpura, Jaipur-302017, INDIA
Approved by AICTE, Ministry of HRD, Government of India Recognized by UGC
under Section 2(f) of the UGC Act, 1956 Tel. : +91-0141- 5160400Fax: +91-0141-
2759555 E-mail: info@skit.ac.in Web: www.skit.ac.in

Page | 29
Swami Keshvanand Institute of Technology, Management
& Gramothan, Ramnagaria, Jagatpura, Jaipur-302017,
INDIA
Approved by AICTE, Ministry of HRD, Government of
India Recognized by UGC under Section 2(f) of the UGC
Act, 1956 Tel. : +91-0141- 5160400Fax: +91-0141-
2759555 E-mail: info@skit.ac.in Web: www.skit.ac.in

SARSA Reinforcement Learning

SARSA algorithm is a slight variation of the popular Q-Learning algorithm. For a learning agent in
any Reinforcement Learning algorithm it’s policy can be of two types:-

1. On Policy: In this, the learning agent learns the value function according to the
current action derived from the policy currently being used.

2. Off Policy: In this, the learning agent learns the value function according to the action
derived from another policy.

Q-Learning technique is an Off Policy technique and uses the greedy approach to learn the Q-
value. SARSA technique, on the other hand, is an On Policy and uses the action performed by the
current policy to learn the Q-value.

This difference is visible in the difference of the update statements for each technique:-

1. Q-Learning:

2. SARSA:

Here, the update equation for SARSA depends on the current state, current action, reward obtained,
next state and next action. This observation lead to the naming of the learning technique as SARSA
stands for State Action Reward State Action which symbolizes the tuple (s, a, r, s’, a’).

Model-free (reinforcement learning)

In reinforcement learning (RL), a model-free algorithm (as opposed to a model-based one) is an


algorithm which does not use the transition probability distribution (and the reward function)
associated with the Markov decision process (MDP), which, in RL, represents the problem to be
solved. The transition probability distribution (or transition model) and the reward function are
often collectively called the "model" of the environment (or MDP), hence the name "model-free". A
model-free RL algorithm can be thought of as an "explicit" trial-and-error algorithm [1]. An

Page | 30
Swami Keshvanand Institute of Technology, Management
& Gramothan, Ramnagaria, Jagatpura, Jaipur-302017,
INDIA
Approved by AICTE, Ministry of HRD, Government of
India Recognized by UGC under Section 2(f) of the UGC
Act, 1956 Tel. : +91-0141- 5160400Fax: +91-0141-
2759555 E-mail: info@skit.ac.in Web: www.skit.ac.in

example of a model-free algorithm is Q-learning.

Page | 31

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy