0% found this document useful (0 votes)

24 views31 pages

Unit-4-ML-Reinforcement Learning

The document discusses Machine Learning, focusing on Semi-Supervised Learning and Reinforcement Learning. Semi-Supervised Learning combines labeled and unlabeled data to reduce the cost of data labeling, while Reinforcement Learning involves agents making decisions to maximize rewards based on their actions. It also covers practical applications, key features, and the Markov Decision Process framework used in Reinforcement Learning.

Uploaded by

jishika388

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

24 views31 pages

Unit-4-ML-Reinforcement Learning

Uploaded by

jishika388

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 31

Swami Keshvanand Institute of Technology, Management

& Gramothan, Ramnagaria, Jagatpura, Jaipur-302017,

INDIA
Approved by AICTE, Ministry of HRD, Government of
India Recognized by UGC under Section 2(f) of the UGC
Act, 1956 Tel. : +91-0141- 5160400Fax: +91-0141-
2759555 E-mail: info@skit.ac.in Web: www.skit.ac.in

Machine Learning- 6CS4-02

Unit-4

ML | Semi-Supervised Learning

Today’s Machine Learning algorithms can be broadly classified into three categories, Supervised
Learning, Unsupervised Learning and Reinforcement Learning. Casting Reinforced Learning aside,
the primary two categories of Machine Learning problems are Supervised and Unsupervised
Learning. The basic difference between the two is that Supervised Learning datasets have an output
label associated with each tuple while Unsupervised Learning datasets do not.

The most basic disadvantage of any Supervised Learning algorithm is that the dataset has to be
hand-labeled either by a Machine Learning Engineer or a Data Scientist. This is a very costly
process, especially when dealing with large volumes of data. The most basic disadvantage of any
Unsupervised Learning is that it’s application spectrum is limited.

To counter these disadvantages, the concept of Semi-Supervised Learning was introduced. In this
type of learning, the algorithm is trained upon a combination of labeled and unlabeled data.
Typically, this combination will contain a very small amount of labeled data and a very large
amount of unlabeled data. The basic procedure involved is that first, the programmer will cluster
similar data using an unsupervised learning algorithm and then use the existing labeled data to label
the rest of the unlabeled data. The typical use cases of such type of algorithm have a common
property among them – The acquisition of unlabeled data is relatively cheap while labeling the said
data is very expensive.

Intuitively, one may imagine the three types of learning algorithms as Supervised learning where a
student is under the supervision of a teacher at both home and school, Unsupervised learning where
a student has to figure out a concept himself and Semi-Supervised learning where a teacher teaches
a few concepts in class and gives questions as homework which are based on similar concepts.

A Semi-Supervised algorithm assumes the following about the data –

1. Continuity Assumption: The algorithm assumes that the points which are closer to each
other are more likely to have the same output label.

2. Cluster Assumption: The data can be divided into discrete clusters and points in the same
cluster are more likely to share an output label.

Page | 1
Swami Keshvanand Institute of Technology, Management
& Gramothan, Ramnagaria, Jagatpura, Jaipur-302017,
INDIA
Approved by AICTE, Ministry of HRD, Government of
India Recognized by UGC under Section 2(f) of the UGC
Act, 1956 Tel. : +91-0141- 5160400Fax: +91-0141-
2759555 E-mail: info@skit.ac.in Web: www.skit.ac.in

3. Manifold Assumption: The data lie approximately on a manifold of much lower dimension
than the input space. This assumption allows the use of distances and densities which are
defined on a manifold.

Practical applications of Semi-Supervised Learning –

1. Speech Analysis: Since labeling of audio files is a very intensive task, Semi-Supervised
learning is a very natural approach to solve this problem.

2. Internet Content Classification: Labeling each webpage is an impractical and unfeasible

process and thus uses Semi-Supervised learning algorithms. Even the Google search
algorithm uses a variant of Semi-Supervised learning to rank the relevance of a webpage for
a given query.

3. Protein Sequence Classification: Since DNA strands are typically very large in size, the
rise of Semi-Supervised learning has been imminent in this field.

Reinforcement learning

Reinforcement learning is an area of Machine Learning. It is about taking suitable action to

maximize reward in a particular situation. It is employed by various software and machines to find
the best possible behavior or path it should take in a specific situation. Reinforcement learning
differs from the supervised learning in a way that in supervised learning the training data has the
answer key with it so the model is trained with the correct answer itself whereas in reinforcement
learning, there is no answer but the reinforcement agent decides what to do to perform the given
task. In the absence of training dataset, it is bound to learn from its experience.

Example : The problem is as follows: We have an agent and a reward, with many hurdles in
between. The agent is supposed to find the best possible path to reach the reward. The following
problem explains the problem more easily.

Page | 2
Swami Keshvanand Institute of Technology, Management &
Gramothan, Ramnagaria, Jagatpura, Jaipur-302017, INDIA
Approved by AICTE, Ministry of HRD, Government of
India Recognized by UGC under Section 2(f) of the UGC
Act, 1956 Tel. : +91-0141- 5160400Fax: +91-0141-
2759555 E-mail: info@skit.ac.in Web: www.skit.ac.in

The above image shows robot, diamond and fire. The goal of the robot is to get the reward that is
the diamond and avoid the hurdles that is fire. The robot learns by trying all the possible paths and
then choosing the path which gives him the reward with the least hurdles. Each right step will give
the robot a reward and each wrong step will subtract the reward of the robot. The total reward will
be calculated when it reaches the final reward that is the diamond.

Main points in Reinforcement learning –

• Input: The input should be an initial state from which the model will start

• Output: There are many possible output as there are variety of solution to a particular
problem

• Training: The training is based upon the input, The model will return a state and the user
will decide to reward or punish the model based on its output.

• The model keeps continues to learn.

• The best solution is decided based on the maximum reward.

Difference between Reinforcement learning and Supervised learning:

Reinforcement learning Supervised learning

Reinforcement learning is all about making decisions In Supervised learning the decision
sequentially. In simple words we can say that the output is made on the initial input or the

Page | 3
Swami Keshvanand Institute of Technology, Management &
Gramothan, Ramnagaria, Jagatpura, Jaipur-302017, INDIA
Approved by AICTE, Ministry of HRD, Government of
India Recognized by UGC under Section 2(f) of the UGC
Act, 1956 Tel. : +91-0141- 5160400Fax: +91-0141-
2759555 E-mail: info@skit.ac.in Web: www.skit.ac.in

Reinforcement learning Supervised learning

depends on the state of the current input and the next input input given at the start
depends on the output of the previous input

Supervised learning the decisions

In Reinforcement learning decision is dependent, So we
are independent of each other so
give labels to sequences of dependent decisions
labels are given to each decision.

Example: Chess game Example: Object recognition

Types of Reinforcement: There are two types of Reinforcement:

1. Positive –
Positive Reinforcement is defined as when an event, occurs due to a particular behavior,
increases the strength and the frequency of the behavior. In other words it has a positive
effect on the behavior.

Advantages of reinforcement learning are:

o Maximizes Performance

o Sustain Change for a long period of time

Disadvantages of reinforcement learning:

o Too much Reinforcement can lead to overload of states which can diminish the
results

2. Negative –
Negative Reinforcement is defined as strengthening of a behavior because a negative
condition which should have stopped or avoided stopped or avoided.

Advantages of reinforcement learning:

o Increases Behavior

o Provide defiance to minimum standard of

performance Disadvantages of reinforcement learning:
o It Only provides enough to meet up the minimum behavior

Various Practical applications of Reinforcement Learning –

Page | 4
Swami Keshvanand Institute of Technology, Management
& Gramothan, Ramnagaria, Jagatpura, Jaipur-302017,
INDIA
Approved by AICTE, Ministry of HRD, Government of
India Recognized by UGC under Section 2(f) of the UGC
Act, 1956 Tel. : +91-0141- 5160400Fax: +91-0141-
2759555 E-mail: info@skit.ac.in Web: www.skit.ac.in

• RL can be used in robotics for industrial automation.

• RL can be used in machine learning and data processing

• RL can be used to create training systems that provide custom instruction and materials
according to the requirement of students.

RL can be used in large environments in the following situations:

1. A model of the environment is known, but an analytic solution is not available;

2. Only a simulation model of the environment is given (the subject of simulation-based

optimization);[6]

3. The only way to collect information about the environment is to interact with it.

Page | 5
Swami Keshvanand Institute of Technology, Management
& Gramothan, Ramnagaria, Jagatpura, Jaipur-302017,
INDIA
Approved by AICTE, Ministry of HRD, Government of
India Recognized by UGC under Section 2(f) of the UGC
Act, 1956 Tel. : +91-0141- 5160400Fax: +91-0141-
2759555 E-mail: info@skit.ac.in Web: www.skit.ac.in

Terms used in Reinforcement Learning

• Agent(): An entity that can perceive/explore the environment and act upon it.

• Environment(): A situation in which an agent is present or surrounded by. In RL, we

assume the stochastic environment, which means it is random in nature.

• Action(): Actions are the moves taken by an agent within the environment.

• State(): State is a situation returned by the environment after each action taken by the agent.

• Reward(): A feedback returned to the agent from the environment to evaluate the action of
the agent.

• Policy(): Policy is a strategy applied by the agent for the next action based on the current
state.

• Value(): It is expected long-term retuned with the discount factor and opposite to the short-
term reward.

• Q-value(): It is mostly similar to the value, but it takes one additional parameter as a current
action (a).

Key Features of Reinforcement Learning

• In RL, the agent is not instructed about the environment and what actions need to be taken.

• It is based on the hit and trial process.

• The agent takes the next action and changes states according to the feedback of the
previous action.

• The agent may get a delayed reward.

• The environment is stochastic, and the agent needs to explore it to reach to get the
maximum positive rewards.

Approaches to implement Reinforcement Learning

There are mainly three ways to implement reinforcement-learning in ML, which are:

1. Value-based:
The value-based approach is about to find the optimal value function, which is the

Page | 6
Swami Keshvanand Institute of Technology, Management
& Gramothan, Ramnagaria, Jagatpura, Jaipur-302017,
INDIA
Approved by AICTE, Ministry of HRD, Government of
India Recognized by UGC under Section 2(f) of the UGC
Act, 1956 Tel. : +91-0141- 5160400Fax: +91-0141-
2759555 E-mail: info@skit.ac.in Web: www.skit.ac.in

maximum value at a state under any policy. Therefore, the agent expects the long-term
return at any state(s) under policy π.

2. Policy-based:
Policy-based approach is to find the optimal policy for the maximum future rewards without
using the value function. In this approach, the agent tries to apply such a policy that the
action performed in each step helps to maximize the future reward.

The policy-based approach has mainly two types of policy:

o Deterministic: The same action is produced by the policy (π) at any state.

o Stochastic: In this policy, probability determines the produced action.

3. Model-based: In the model-based approach, a virtual model is created for the environment,
and the agent explores that environment to learn it. There is no particular solution or
algorithm for this approach because the model representation is different for each
environment.

Page | 7
Swami Keshvanand Institute of Technology, Management
& Gramothan, Ramnagaria, Jagatpura, Jaipur-302017,
INDIA
Approved by AICTE, Ministry of HRD, Government of
India Recognized by UGC under Section 2(f) of the UGC
Act, 1956 Tel. : +91-0141- 5160400Fax: +91-0141-
2759555 E-mail: info@skit.ac.in Web: www.skit.ac.in

Markov decision process (MDP)

A Markov Decision Process descrbes an environment for reinforcement learning. The environment
is fully observable. In MDPs, the current state completely characterises the process.

What is exactly a Markov Decision Process ?

To understand the MDP, first we have to understand the Markov property.

The Markov property

The Markov property states that,

“The future is independent of the past given the present.”

Once the current state in known, the history of information encountered so far may be thrown away,
and that state is a sufficient statistic that gives us the same characterization of the future as if we
have all the history.

In mathematical terms, a state St has the Markov property, if and only if;

P[St+1 | St] = P[St+1 | S1, ….. , St],

A Markov Process is a memoryless random process. It is a

sequence of randdom states S1,S2,⋯⋯ with the Markov Property.
A Markov decision process (MDP) is a discrete time stochastic control process. It provides a
mathematical framework for modeling decision making in situations where outcomes are partly
random and partly under the control of a decision maker. MDPs are useful for studying optimization
problems solved via dynamic programming and reinforcement learning. MDPs were known at least
as early as the 1950s; a core body of research on Markov decision processes resulted from Ronald
Howard's 1960 book, Dynamic Programming and Markov Processes. They are used in many
disciplines, including robotics, automatic control, economics and manufacturing. The name of
MDPs comes from the Russian mathematician Andrey Markov as they are an extension of the
Markov chains.

At each time step, the process is in some state s, and the decision maker may choose any action a
that is available in state s’. The process responds at the next time step by randomly moving into a
new state s’ , and giving the decision maker a corresponding reward Ra(s,s’).

Page | 8
Swami Keshvanand Institute of Technology, Management
& Gramothan, Ramnagaria, Jagatpura, Jaipur-302017,
INDIA
Approved by AICTE, Ministry of HRD, Government of
India Recognized by UGC under Section 2(f) of the UGC
Act, 1956 Tel. : +91-0141- 5160400Fax: +91-0141-
2759555 E-mail: info@skit.ac.in Web: www.skit.ac.in

The probability that the process moves into its new state s’ is influenced by the chosen action.
Specifically, it is given by the state transition function Pa(s,s’) . Thus, the next state depends on the
current state and the decision maker's action a. But given s and a, it is conditionally independent of
all previous states and actions; in other words, the state transitions of an MDP satisfy the Markov
property.

Markov decision processes are an extension of Markov chains; the difference is the addition of
actions (allowing choice) and rewards (giving motivation). Conversely, if only one action exists for
each state (e.g. "wait") and all rewards are the same (e.g. "zero"), a Markov decision process
reduces to a Markov chain

Page | 9
Swami Keshvanand Institute of Technology, Management & Gramothan,
Ramnagaria, Jagatpura, Jaipur-302017, INDIA
Approved by AICTE, Ministry of HRD, Government of India Recognized by UGC
under Section 2(f) of the UGC Act, 1956 Tel. : +91-0141- 5160400Fax: +91-0141-
2759555 E-mail: info@skit.ac.in Web: www.skit.ac.in

Page | 10
Swami Keshvanand Institute of Technology, Management & Gramothan,
Ramnagaria, Jagatpura, Jaipur-302017, INDIA
Approved by AICTE, Ministry of HRD, Government of India Recognized by UGC
under Section 2(f) of the UGC Act, 1956 Tel. : +91-0141- 5160400Fax: +91-0141-
2759555 E-mail: info@skit.ac.in Web: www.skit.ac.in

Page | 11
Swami Keshvanand Institute of Technology, Management & Gramothan,
Ramnagaria, Jagatpura, Jaipur-302017, INDIA
Approved by AICTE, Ministry of HRD, Government of India Recognized by UGC
under Section 2(f) of the UGC Act, 1956 Tel. : +91-0141- 5160400Fax: +91-0141-
2759555 E-mail: info@skit.ac.in Web: www.skit.ac.in

Page | 12
Swami Keshvanand Institute of Technology, Management & Gramothan,
Ramnagaria, Jagatpura, Jaipur-302017, INDIA
Approved by AICTE, Ministry of HRD, Government of India Recognized by UGC
under Section 2(f) of the UGC Act, 1956 Tel. : +91-0141- 5160400Fax: +91-0141-
2759555 E-mail: info@skit.ac.in Web: www.skit.ac.in

Page | 13
Swami Keshvanand Institute of Technology, Management & Gramothan,
Ramnagaria, Jagatpura, Jaipur-302017, INDIA
Approved by AICTE, Ministry of HRD, Government of India Recognized by UGC
under Section 2(f) of the UGC Act, 1956 Tel. : +91-0141- 5160400Fax: +91-0141-
2759555 E-mail: info@skit.ac.in Web: www.skit.ac.in

Page | 14
Swami Keshvanand Institute of Technology, Management & Gramothan,
Ramnagaria, Jagatpura, Jaipur-302017, INDIA
Approved by AICTE, Ministry of HRD, Government of India Recognized by UGC
under Section 2(f) of the UGC Act, 1956 Tel. : +91-0141- 5160400Fax: +91-0141-
2759555 E-mail: info@skit.ac.in Web: www.skit.ac.in

Page | 15
Swami Keshvanand Institute of Technology, Management & Gramothan,
Ramnagaria, Jagatpura, Jaipur-302017, INDIA
Approved by AICTE, Ministry of HRD, Government of India Recognized by UGC
under Section 2(f) of the UGC Act, 1956 Tel. : +91-0141- 5160400Fax: +91-0141-
2759555 E-mail: info@skit.ac.in Web: www.skit.ac.in

Page | 16
Swami Keshvanand Institute of Technology, Management & Gramothan,
Ramnagaria, Jagatpura, Jaipur-302017, INDIA
Approved by AICTE, Ministry of HRD, Government of India Recognized by UGC
under Section 2(f) of the UGC Act, 1956 Tel. : +91-0141- 5160400Fax: +91-0141-
2759555 E-mail: info@skit.ac.in Web: www.skit.ac.in

Page | 17
Swami Keshvanand Institute of Technology, Management & Gramothan,
Ramnagaria, Jagatpura, Jaipur-302017, INDIA
Approved by AICTE, Ministry of HRD, Government of India Recognized by UGC
under Section 2(f) of the UGC Act, 1956 Tel. : +91-0141- 5160400Fax: +91-0141-
2759555 E-mail: info@skit.ac.in Web: www.skit.ac.in

Page | 18
Swami Keshvanand Institute of Technology, Management
& Gramothan, Ramnagaria, Jagatpura, Jaipur-302017,
INDIA
Approved by AICTE, Ministry of HRD, Government of
India Recognized by UGC under Section 2(f) of the UGC
Act, 1956 Tel. : +91-0141- 5160400Fax: +91-0141-
2759555 E-mail: info@skit.ac.in Web: www.skit.ac.in

Monte-Carlo method
Remember that the state-value is nothing more than the discounted cumulative reward by starting

from state s following a policy π to the end state at the last timestep T:

An episode is a sequence of all states from an initial state s to a terminal state. It is not a continuous

infit sequence. Moreover, for each episode, we get a different return G_t. An example of an episode

for a video game could be the sequence of all states, the frames, from start to Game Over.

Now, let’s choose one of our states as an initial state s_i. We go through one complete episode by

following the policy π and eventually reach the end state s_e.

We repeat this many, many times. On every iteration, the end state could be reached in a different way,

depending on the stochastic policy π. This leads to a different return G_t for every episode. Averaging

over those returns gives us a close representation of the real state-value for the initial state s_i.

If we do this process for all states, where we always pick a different state as the initial state, then we

obtain a good approximation of the state-value for each state and therefore the state-value function of

the environment model. This is the main idea behind Monte-Carlo.

Instead of using the Dynamic Programming and the Bellman Equation to calculate the exact value of

the state s at timestep t by considering future rewards and values of the future states, we obtain an

estimate of the true value function by running through many episodes, then observe what returns we

got for each and average over all of them. No model is needed.

Page | 19
Swami Keshvanand Institute of Technology, Management
& Gramothan, Ramnagaria, Jagatpura, Jaipur-302017,
INDIA
Approved by AICTE, Ministry of HRD, Government of
India Recognized by UGC under Section 2(f) of the UGC
Act, 1956 Tel. : +91-0141- 5160400Fax: +91-0141-
2759555 E-mail: info@skit.ac.in Web: www.skit.ac.in

There are two different approaches to evaluate the policy by using Monte-Carlo. First-Visit and Every-

Visit. For both, we keep a counter N(s) for each state s and we save the sum of the different returns

from each episode for each state in S(s).

N(s) and S(s) accumulate values until all K episodes are completed. They are not set to zero after each

episode.

Every-Visit Monte-Carlo Policy Evaluation

We iterate through K episodes. In each episode, every time we reach a state s we increment the counter

N(s) for that state by one. It is possible that the same state is reached multiple times in the same

episode. For all those visits the counter is increased.

For each visit, G_t can be different.

The algorithm looks like the following:

Page | 20
Swami Keshvanand Institute of Technology, Management
& Gramothan, Ramnagaria, Jagatpura, Jaipur-302017,
INDIA
Approved by AICTE, Ministry of HRD, Government of
India Recognized by UGC under Section 2(f) of the UGC
Act, 1956 Tel. : +91-0141- 5160400Fax: +91-0141-
2759555 E-mail: info@skit.ac.in Web: www.skit.ac.in

First-Visit Monte-Carlo Policy Evaluation

We do the same as for the Every-Time Monte-Carlo Policy Evaluation. The only difference here is that

in the same episode we update N(s) and the S(s) only on the first visit of the state s. In our algorithm,

we will check if s was visited for the first time in the episode and only then update them.

Problems of Monte carlo

Working only for episodes, not for continuous or infinite problems

We need to complete an episode first before we can update all values. That’s bad e.g. for autonomous

cars which if using Monte Carlo there would update the values of a state only after a crash, the end of

the episode. That’s a bit too late.

Page | 21
Swami Keshvanand Institute of Technology, Management & Gramothan,
Ramnagaria, Jagatpura, Jaipur-302017, INDIA
Approved by AICTE, Ministry of HRD, Government of India Recognized by UGC
under Section 2(f) of the UGC Act, 1956 Tel. : +91-0141- 5160400Fax: +91-0141-
2759555 E-mail: info@skit.ac.in Web: www.skit.ac.in

Page | 22
Swami Keshvanand Institute of Technology, Management & Gramothan,
Ramnagaria, Jagatpura, Jaipur-302017, INDIA
Approved by AICTE, Ministry of HRD, Government of India Recognized by UGC
under Section 2(f) of the UGC Act, 1956 Tel. : +91-0141- 5160400Fax: +91-0141-
2759555 E-mail: info@skit.ac.in Web: www.skit.ac.in

Page | 23
Swami Keshvanand Institute of Technology, Management & Gramothan,
Ramnagaria, Jagatpura, Jaipur-302017, INDIA
Approved by AICTE, Ministry of HRD, Government of India Recognized by UGC
under Section 2(f) of the UGC Act, 1956 Tel. : +91-0141- 5160400Fax: +91-0141-
2759555 E-mail: info@skit.ac.in Web: www.skit.ac.in

Page | 24
Swami Keshvanand Institute of Technology, Management & Gramothan,
Ramnagaria, Jagatpura, Jaipur-302017, INDIA
Approved by AICTE, Ministry of HRD, Government of India Recognized by UGC
under Section 2(f) of the UGC Act, 1956 Tel. : +91-0141- 5160400Fax: +91-0141-
2759555 E-mail: info@skit.ac.in Web: www.skit.ac.in

Page | 25
Swami Keshvanand Institute of Technology, Management & Gramothan,
Ramnagaria, Jagatpura, Jaipur-302017, INDIA
Approved by AICTE, Ministry of HRD, Government of India Recognized by UGC
under Section 2(f) of the UGC Act, 1956 Tel. : +91-0141- 5160400Fax: +91-0141-
2759555 E-mail: info@skit.ac.in Web: www.skit.ac.in

Page | 26
Swami Keshvanand Institute of Technology, Management
& Gramothan, Ramnagaria, Jagatpura, Jaipur-302017,
INDIA
Approved by AICTE, Ministry of HRD, Government of
India Recognized by UGC under Section 2(f) of the UGC
Act, 1956 Tel. : +91-0141- 5160400Fax: +91-0141-
2759555 E-mail: info@skit.ac.in Web: www.skit.ac.in

Policy

Consider the following case:

a12 is the probability to take the action that leads to S2 from S1.
a13 is the probability to take the action that leads to S3 from S1.
The value function at S1 and S3 are v2 = 10, v3 = 20 respectively.
We also assume that the reward r = 1 and the discount factor γ = 0.9.
Also as simplification, we consider that the transition probabilities is 1, which means when we
perform an action we are 100% sure that we will land in the intended state.
Under these criteria we can write the value function at S1 which we denote v1 as:
v1 = a12*(r + γ *v2) + a13*(r + γ*v3)
v1 = a12*(1 + 0.9*10) + a13(1 + 0.9*20)
v1 = a12*10 + a13 * 19

Suppose the action probabilities are a12 = 0.9 and a13 = 0.1 then v1 will be 10.9 . On the other hand
if a12 = 0.1 and a13 = 0.9 then v1 will become 18.1!
Clearly giving more probability to perform a13 will give better result in terms of the value of the
state S1.
The act of selecting an action at each state is called “policy” and is denoted as .

Page | 27
Swami Keshvanand Institute of Technology, Management
& Gramothan, Ramnagaria, Jagatpura, Jaipur-302017,
INDIA
Approved by AICTE, Ministry of HRD, Government of
India Recognized by UGC under Section 2(f) of the UGC
Act, 1956 Tel. : +91-0141- 5160400Fax: +91-0141-
2759555 E-mail: info@skit.ac.in Web: www.skit.ac.in

As seen in the example above, some policies are better than others due to the selection of
actions over others in one or more states.
It is important to note that a policy is better than ’ if all v(s) under are greater or equal to all
v(s) under ’.

It follows that in order to maximize the collected rewards we have to find the best possible
policy, called optimal policy and denoted *.

Value Iteration

Value iteration computes the optimal state value function by iteratively improving the estimate of
V(s). The algorithm initialize V(s) to arbitrary random values. It repeatedly updates the Q(s, a) and
V(s) values until they converges. Value iteration is guaranteed to converge to the optimal values.
This algorithm is shown in the following pseudo-code:

Policy Iteration

While value-iteration algorithm keeps improving the value function at each iteration until the value-
function converges. Since the agent only cares about the finding the optimal policy, sometimes the
optimal policy will converge before the value function. Therefore, another algorithm called policy-
iteration instead of repeated improving the value-function estimate, it will re-define the policy at
each step and compute the value according to this new policy until the policy converges. Policy
iteration is also guaranteed to converge to the optimal policy and it often takes less iterations to
converge than the value-iteration algorithm.

The pseudo code for Policy Iteration is shown below.

Page | 28
Swami Keshvanand Institute of Technology, Management & Gramothan,
Ramnagaria, Jagatpura, Jaipur-302017, INDIA
Approved by AICTE, Ministry of HRD, Government of India Recognized by UGC
under Section 2(f) of the UGC Act, 1956 Tel. : +91-0141- 5160400Fax: +91-0141-
2759555 E-mail: info@skit.ac.in Web: www.skit.ac.in

Page | 29
Swami Keshvanand Institute of Technology, Management
& Gramothan, Ramnagaria, Jagatpura, Jaipur-302017,
INDIA
Approved by AICTE, Ministry of HRD, Government of
India Recognized by UGC under Section 2(f) of the UGC
Act, 1956 Tel. : +91-0141- 5160400Fax: +91-0141-
2759555 E-mail: info@skit.ac.in Web: www.skit.ac.in

SARSA Reinforcement Learning

SARSA algorithm is a slight variation of the popular Q-Learning algorithm. For a learning agent in
any Reinforcement Learning algorithm it’s policy can be of two types:-

1. On Policy: In this, the learning agent learns the value function according to the
current action derived from the policy currently being used.

2. Off Policy: In this, the learning agent learns the value function according to the action
derived from another policy.

Q-Learning technique is an Off Policy technique and uses the greedy approach to learn the Q-
value. SARSA technique, on the other hand, is an On Policy and uses the action performed by the
current policy to learn the Q-value.

This difference is visible in the difference of the update statements for each technique:-

1. Q-Learning:

2. SARSA:

Here, the update equation for SARSA depends on the current state, current action, reward obtained,
next state and next action. This observation lead to the naming of the learning technique as SARSA
stands for State Action Reward State Action which symbolizes the tuple (s, a, r, s’, a’).

Model-free (reinforcement learning)

In reinforcement learning (RL), a model-free algorithm (as opposed to a model-based one) is an

algorithm which does not use the transition probability distribution (and the reward function)
associated with the Markov decision process (MDP), which, in RL, represents the problem to be
solved. The transition probability distribution (or transition model) and the reward function are
often collectively called the "model" of the environment (or MDP), hence the name "model-free". A
model-free RL algorithm can be thought of as an "explicit" trial-and-error algorithm [1]. An

Page | 30
Swami Keshvanand Institute of Technology, Management
& Gramothan, Ramnagaria, Jagatpura, Jaipur-302017,
INDIA
Approved by AICTE, Ministry of HRD, Government of
India Recognized by UGC under Section 2(f) of the UGC
Act, 1956 Tel. : +91-0141- 5160400Fax: +91-0141-
2759555 E-mail: info@skit.ac.in Web: www.skit.ac.in

example of a model-free algorithm is Q-learning.

Page | 31

Algorithmic Trading With Markov Chains
No ratings yet
Algorithmic Trading With Markov Chains
29 pages
Application of Reinforcement Learning - Finance
No ratings yet
Application of Reinforcement Learning - Finance
540 pages
Optimization For Learning and Control - 2023 - Hansson
No ratings yet
Optimization For Learning and Control - 2023 - Hansson
413 pages
Machine Learning For Algorithmic Trading
36% (11)
Machine Learning For Algorithmic Trading
13 pages
R22ML 5
No ratings yet
R22ML 5
24 pages
Special Elective PH.D
No ratings yet
Special Elective PH.D
84 pages
Cosmo Learning
No ratings yet
Cosmo Learning
891 pages
Unit-4 Learning in AI
No ratings yet
Unit-4 Learning in AI
32 pages
Unit 6
No ratings yet
Unit 6
34 pages
AIML Cheatsheet - Coders - Section
No ratings yet
AIML Cheatsheet - Coders - Section
47 pages
Unit 1.1 Learning
No ratings yet
Unit 1.1 Learning
16 pages
Aditya Kumar Singh - A2305220463 NTCC Term Paper
No ratings yet
Aditya Kumar Singh - A2305220463 NTCC Term Paper
16 pages
AI 3000 / CS 5500: Reinforcement Learning Assignment 1: Problem 1: Markov Reward Process
No ratings yet
AI 3000 / CS 5500: Reinforcement Learning Assignment 1: Problem 1: Markov Reward Process
5 pages
Supervised and Unsupervised Learing
100% (1)
Supervised and Unsupervised Learing
7 pages
MLT Quantum
No ratings yet
MLT Quantum
138 pages
Unit I Introduction To RL
No ratings yet
Unit I Introduction To RL
30 pages
Core Concepts of Supervised, Unsupervised, and Reinforcement Learning
No ratings yet
Core Concepts of Supervised, Unsupervised, and Reinforcement Learning
3 pages
5 planning-AI
No ratings yet
5 planning-AI
90 pages
Lecture 4
No ratings yet
Lecture 4
21 pages
RL & DL Notes
No ratings yet
RL & DL Notes
43 pages
1.to Study Supervisedunsupervisedreinforcement Learning Approach
No ratings yet
1.to Study Supervisedunsupervisedreinforcement Learning Approach
6 pages
RL & DL Notes
No ratings yet
RL & DL Notes
73 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
7 pages
Unit 5 I Machine Learning
No ratings yet
Unit 5 I Machine Learning
6 pages
Lect 2
No ratings yet
Lect 2
26 pages
Introduction To Prolog-Unit3
No ratings yet
Introduction To Prolog-Unit3
30 pages
ML Module V
No ratings yet
ML Module V
21 pages
Unit 5 ML 3year
No ratings yet
Unit 5 ML 3year
17 pages
Reinforcemnet Learning
No ratings yet
Reinforcemnet Learning
8 pages
ML Unit2
No ratings yet
ML Unit2
39 pages
Sp14 Cs188 Lecture 9 - Mdps II
No ratings yet
Sp14 Cs188 Lecture 9 - Mdps II
48 pages
AI II - Unit 1
No ratings yet
AI II - Unit 1
72 pages
Unit V Reinforcement Learning and Genetic Algorithm
No ratings yet
Unit V Reinforcement Learning and Genetic Algorithm
40 pages
Markov Decision Processes and Exact Solution Methods
No ratings yet
Markov Decision Processes and Exact Solution Methods
34 pages
Unit 4 Machine Learning Tools, Techniques and Applications
No ratings yet
Unit 4 Machine Learning Tools, Techniques and Applications
78 pages
CS8691 Unit 4
No ratings yet
CS8691 Unit 4
46 pages
SP14 CS188 Lecture 10 - Reinforcement Learning I
No ratings yet
SP14 CS188 Lecture 10 - Reinforcement Learning I
35 pages
ML PPT 2
No ratings yet
ML PPT 2
15 pages
Deep Learning
No ratings yet
Deep Learning
79 pages
r21 III II Syllabus Hits-1
No ratings yet
r21 III II Syllabus Hits-1
26 pages
Optimizing A Dynamic Order-Picking Process: Yossi Bukchin, Eugene Khmelnitsky, Pini Yakuel
No ratings yet
Optimizing A Dynamic Order-Picking Process: Yossi Bukchin, Eugene Khmelnitsky, Pini Yakuel
26 pages
l1 Mdps Exact Methods
No ratings yet
l1 Mdps Exact Methods
69 pages
Unit 3
No ratings yet
Unit 3
29 pages
Sp14 Cs188 Lecture 8 - Mdps I
No ratings yet
Sp14 Cs188 Lecture 8 - Mdps I
50 pages
Unit I
No ratings yet
Unit I
8 pages
Types of Data:: Reference Website
No ratings yet
Types of Data:: Reference Website
15 pages
Physics Informed Machine Learning For Data Anomaly Detection, Classification1
No ratings yet
Physics Informed Machine Learning For Data Anomaly Detection, Classification1
21 pages
Reinforcement Learning With Game Theory
No ratings yet
Reinforcement Learning With Game Theory
93 pages
AI Unit-4
No ratings yet
AI Unit-4
59 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
11 pages
1.to Study SupervisedunsupervisedReinforcement Learning Approach-2
No ratings yet
1.to Study SupervisedunsupervisedReinforcement Learning Approach-2
5 pages
Exp-14 Reinforcement Learning
No ratings yet
Exp-14 Reinforcement Learning
11 pages
ArCHer - Training Language Model Agents Via Hierarchical Multi-Turn RL
No ratings yet
ArCHer - Training Language Model Agents Via Hierarchical Multi-Turn RL
39 pages
ML Unit-1
No ratings yet
ML Unit-1
16 pages
Policy (RL IITH)
No ratings yet
Policy (RL IITH)
46 pages
Lecture 2 Post
No ratings yet
Lecture 2 Post
65 pages
Unit - 5 Re-Inforcement Learning
No ratings yet
Unit - 5 Re-Inforcement Learning
3 pages
Unit 1 Notes
No ratings yet
Unit 1 Notes
29 pages
Markov Decision Process
No ratings yet
Markov Decision Process
3 pages
AI Notes
No ratings yet
AI Notes
24 pages
Ai PPT New
No ratings yet
Ai PPT New
14 pages
Module 1
No ratings yet
Module 1
72 pages
Markov Decision Process
No ratings yet
Markov Decision Process
8 pages
Unit 5-1
No ratings yet
Unit 5-1
8 pages
1.machine Learning Basics
No ratings yet
1.machine Learning Basics
74 pages
AI Fundamentals Finals
No ratings yet
AI Fundamentals Finals
6 pages
Machine Learning in Finance
100% (4)
Machine Learning in Finance
300 pages
Synthesizing World Models For Bilevel Planning
No ratings yet
Synthesizing World Models For Bilevel Planning
29 pages
ML R20 Material
No ratings yet
ML R20 Material
96 pages
7.reinforcement Learning-Introduction-The Learning Task Q-Learning
No ratings yet
7.reinforcement Learning-Introduction-The Learning Task Q-Learning
34 pages
Enhancing Q-Learning Speed Using Selective Signal Injection
No ratings yet
Enhancing Q-Learning Speed Using Selective Signal Injection
4 pages
Unit 1 Intro
No ratings yet
Unit 1 Intro
41 pages
What Is Reinforcement Learning
No ratings yet
What Is Reinforcement Learning
5 pages
Machine Learning Unit-1.2
No ratings yet
Machine Learning Unit-1.2
23 pages
Machine Learning Unit1
No ratings yet
Machine Learning Unit1
151 pages
Machine Learning
No ratings yet
Machine Learning
21 pages
Ai Machine Learning
No ratings yet
Ai Machine Learning
27 pages
5 Le
No ratings yet
5 Le
36 pages
Machine Learning Notes
No ratings yet
Machine Learning Notes
20 pages
AIML - Practical No.01
No ratings yet
AIML - Practical No.01
9 pages
Lec 1
No ratings yet
Lec 1
12 pages
Intorduction of ML
No ratings yet
Intorduction of ML
14 pages
Data Science Solutions IA 2
No ratings yet
Data Science Solutions IA 2
16 pages
SCSA3015 Deep Learning Unit 1 Notes PDF
No ratings yet
SCSA3015 Deep Learning Unit 1 Notes PDF
30 pages
Machine Learning
No ratings yet
Machine Learning
4 pages
Shashi Neural Network Assignment
No ratings yet
Shashi Neural Network Assignment
6 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
2 pages
A1579305753 - 23783 - 8 - 2019 - Machine Learning
No ratings yet
A1579305753 - 23783 - 8 - 2019 - Machine Learning
18 pages
Intermediate AI Prompting – Reinforcement Learning
From Everand
Intermediate AI Prompting – Reinforcement Learning
Eric Centore
No ratings yet
Mechanical, Spatial & Abstract Reasoning
From Everand
Mechanical, Spatial & Abstract Reasoning
Craig MacKellar
4.5/5 (2)

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Unit-4-ML-Reinforcement Learning

Uploaded by

Unit-4-ML-Reinforcement Learning

Uploaded by

Swami Keshvanand Institute of Technology, Management

& Gramothan, Ramnagaria, Jagatpura, Jaipur-302017,

Machine Learning- 6CS4-02

A Semi-Supervised algorithm assumes the following about the data –

Practical applications of Semi-Supervised Learning –

2. Internet Content Classification: Labeling each webpage is an impractical and unfeasible

Reinforcement learning is an area of Machine Learning. It is about taking suitable action to

Main points in Reinforcement learning –

• The model keeps continues to learn.

• The best solution is decided based on the maximum reward.

Difference between Reinforcement learning and Supervised learning:

Reinforcement learning Supervised learning

Reinforcement learning Supervised learning

Supervised learning the decisions

Example: Chess game Example: Object recognition

Types of Reinforcement: There are two types of Reinforcement:

Advantages of reinforcement learning are:

o Sustain Change for a long period of time

Advantages of reinforcement learning:

o Provide defiance to minimum standard of

Various Practical applications of Reinforcement Learning –

• RL can be used in robotics for industrial automation.

• RL can be used in machine learning and data processing

RL can be used in large environments in the following situations:

1. A model of the environment is known, but an analytic solution is not available;

2. Only a simulation model of the environment is given (the subject of simulation-based

Terms used in Reinforcement Learning

• Environment(): A situation in which an agent is present or surrounded by. In RL, we

Key Features of Reinforcement Learning

• It is based on the hit and trial process.

• The agent may get a delayed reward.

Approaches to implement Reinforcement Learning

The policy-based approach has mainly two types of policy:

o Stochastic: In this policy, probability determines the produced action.

Markov decision process (MDP)

What is exactly a Markov Decision Process ?

To understand the MDP, first we have to understand the Markov property.

The Markov property

The Markov property states that,

“The future is independent of the past given the present.”

P[St+1 | St] = P[St+1 | S1, ….. , St],

A Markov Process is a memoryless random process. It is a

the environment model. This is the main idea behind Monte-Carlo.

from each episode for each state in S(s).

Every-Visit Monte-Carlo Policy Evaluation

episode. For all those visits the counter is increased.

For each visit, G_t can be different.

The algorithm looks like the following:

First-Visit Monte-Carlo Policy Evaluation

Problems of Monte carlo

the episode. That’s a bit too late.

Consider the following case:

The pseudo code for Policy Iteration is shown below.

SARSA Reinforcement Learning

Model-free (reinforcement learning)

In reinforcement learning (RL), a model-free algorithm (as opposed to a model-based one) is an

example of a model-free algorithm is Q-learning.

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.