Unit-4-ML-Reinforcement Learning
Unit-4-ML-Reinforcement Learning
Unit-4
ML | Semi-Supervised Learning
Today’s Machine Learning algorithms can be broadly classified into three categories, Supervised
Learning, Unsupervised Learning and Reinforcement Learning. Casting Reinforced Learning aside,
the primary two categories of Machine Learning problems are Supervised and Unsupervised
Learning. The basic difference between the two is that Supervised Learning datasets have an output
label associated with each tuple while Unsupervised Learning datasets do not.
The most basic disadvantage of any Supervised Learning algorithm is that the dataset has to be
hand-labeled either by a Machine Learning Engineer or a Data Scientist. This is a very costly
process, especially when dealing with large volumes of data. The most basic disadvantage of any
Unsupervised Learning is that it’s application spectrum is limited.
To counter these disadvantages, the concept of Semi-Supervised Learning was introduced. In this
type of learning, the algorithm is trained upon a combination of labeled and unlabeled data.
Typically, this combination will contain a very small amount of labeled data and a very large
amount of unlabeled data. The basic procedure involved is that first, the programmer will cluster
similar data using an unsupervised learning algorithm and then use the existing labeled data to label
the rest of the unlabeled data. The typical use cases of such type of algorithm have a common
property among them – The acquisition of unlabeled data is relatively cheap while labeling the said
data is very expensive.
Intuitively, one may imagine the three types of learning algorithms as Supervised learning where a
student is under the supervision of a teacher at both home and school, Unsupervised learning where
a student has to figure out a concept himself and Semi-Supervised learning where a teacher teaches
a few concepts in class and gives questions as homework which are based on similar concepts.
1. Continuity Assumption: The algorithm assumes that the points which are closer to each
other are more likely to have the same output label.
2. Cluster Assumption: The data can be divided into discrete clusters and points in the same
cluster are more likely to share an output label.
Page | 1
Swami Keshvanand Institute of Technology, Management
& Gramothan, Ramnagaria, Jagatpura, Jaipur-302017,
INDIA
Approved by AICTE, Ministry of HRD, Government of
India Recognized by UGC under Section 2(f) of the UGC
Act, 1956 Tel. : +91-0141- 5160400Fax: +91-0141-
2759555 E-mail: info@skit.ac.in Web: www.skit.ac.in
3. Manifold Assumption: The data lie approximately on a manifold of much lower dimension
than the input space. This assumption allows the use of distances and densities which are
defined on a manifold.
1. Speech Analysis: Since labeling of audio files is a very intensive task, Semi-Supervised
learning is a very natural approach to solve this problem.
3. Protein Sequence Classification: Since DNA strands are typically very large in size, the
rise of Semi-Supervised learning has been imminent in this field.
Reinforcement learning
Example : The problem is as follows: We have an agent and a reward, with many hurdles in
between. The agent is supposed to find the best possible path to reach the reward. The following
problem explains the problem more easily.
Page | 2
Swami Keshvanand Institute of Technology, Management &
Gramothan, Ramnagaria, Jagatpura, Jaipur-302017, INDIA
Approved by AICTE, Ministry of HRD, Government of
India Recognized by UGC under Section 2(f) of the UGC
Act, 1956 Tel. : +91-0141- 5160400Fax: +91-0141-
2759555 E-mail: info@skit.ac.in Web: www.skit.ac.in
The above image shows robot, diamond and fire. The goal of the robot is to get the reward that is
the diamond and avoid the hurdles that is fire. The robot learns by trying all the possible paths and
then choosing the path which gives him the reward with the least hurdles. Each right step will give
the robot a reward and each wrong step will subtract the reward of the robot. The total reward will
be calculated when it reaches the final reward that is the diamond.
• Input: The input should be an initial state from which the model will start
• Output: There are many possible output as there are variety of solution to a particular
problem
• Training: The training is based upon the input, The model will return a state and the user
will decide to reward or punish the model based on its output.
Reinforcement learning is all about making decisions In Supervised learning the decision
sequentially. In simple words we can say that the output is made on the initial input or the
Page | 3
Swami Keshvanand Institute of Technology, Management &
Gramothan, Ramnagaria, Jagatpura, Jaipur-302017, INDIA
Approved by AICTE, Ministry of HRD, Government of
India Recognized by UGC under Section 2(f) of the UGC
Act, 1956 Tel. : +91-0141- 5160400Fax: +91-0141-
2759555 E-mail: info@skit.ac.in Web: www.skit.ac.in
depends on the state of the current input and the next input input given at the start
depends on the output of the previous input
1. Positive –
Positive Reinforcement is defined as when an event, occurs due to a particular behavior,
increases the strength and the frequency of the behavior. In other words it has a positive
effect on the behavior.
o Maximizes Performance
o Too much Reinforcement can lead to overload of states which can diminish the
results
2. Negative –
Negative Reinforcement is defined as strengthening of a behavior because a negative
condition which should have stopped or avoided stopped or avoided.
o Increases Behavior
Page | 4
Swami Keshvanand Institute of Technology, Management
& Gramothan, Ramnagaria, Jagatpura, Jaipur-302017,
INDIA
Approved by AICTE, Ministry of HRD, Government of
India Recognized by UGC under Section 2(f) of the UGC
Act, 1956 Tel. : +91-0141- 5160400Fax: +91-0141-
2759555 E-mail: info@skit.ac.in Web: www.skit.ac.in
• RL can be used to create training systems that provide custom instruction and materials
according to the requirement of students.
3. The only way to collect information about the environment is to interact with it.
Page | 5
Swami Keshvanand Institute of Technology, Management
& Gramothan, Ramnagaria, Jagatpura, Jaipur-302017,
INDIA
Approved by AICTE, Ministry of HRD, Government of
India Recognized by UGC under Section 2(f) of the UGC
Act, 1956 Tel. : +91-0141- 5160400Fax: +91-0141-
2759555 E-mail: info@skit.ac.in Web: www.skit.ac.in
• Agent(): An entity that can perceive/explore the environment and act upon it.
• Action(): Actions are the moves taken by an agent within the environment.
• State(): State is a situation returned by the environment after each action taken by the agent.
• Reward(): A feedback returned to the agent from the environment to evaluate the action of
the agent.
• Policy(): Policy is a strategy applied by the agent for the next action based on the current
state.
• Value(): It is expected long-term retuned with the discount factor and opposite to the short-
term reward.
• Q-value(): It is mostly similar to the value, but it takes one additional parameter as a current
action (a).
• In RL, the agent is not instructed about the environment and what actions need to be taken.
• The agent takes the next action and changes states according to the feedback of the
previous action.
• The environment is stochastic, and the agent needs to explore it to reach to get the
maximum positive rewards.
There are mainly three ways to implement reinforcement-learning in ML, which are:
1. Value-based:
The value-based approach is about to find the optimal value function, which is the
Page | 6
Swami Keshvanand Institute of Technology, Management
& Gramothan, Ramnagaria, Jagatpura, Jaipur-302017,
INDIA
Approved by AICTE, Ministry of HRD, Government of
India Recognized by UGC under Section 2(f) of the UGC
Act, 1956 Tel. : +91-0141- 5160400Fax: +91-0141-
2759555 E-mail: info@skit.ac.in Web: www.skit.ac.in
maximum value at a state under any policy. Therefore, the agent expects the long-term
return at any state(s) under policy π.
2. Policy-based:
Policy-based approach is to find the optimal policy for the maximum future rewards without
using the value function. In this approach, the agent tries to apply such a policy that the
action performed in each step helps to maximize the future reward.
o Deterministic: The same action is produced by the policy (π) at any state.
3. Model-based: In the model-based approach, a virtual model is created for the environment,
and the agent explores that environment to learn it. There is no particular solution or
algorithm for this approach because the model representation is different for each
environment.
Page | 7
Swami Keshvanand Institute of Technology, Management
& Gramothan, Ramnagaria, Jagatpura, Jaipur-302017,
INDIA
Approved by AICTE, Ministry of HRD, Government of
India Recognized by UGC under Section 2(f) of the UGC
Act, 1956 Tel. : +91-0141- 5160400Fax: +91-0141-
2759555 E-mail: info@skit.ac.in Web: www.skit.ac.in
Once the current state in known, the history of information encountered so far may be thrown away,
and that state is a sufficient statistic that gives us the same characterization of the future as if we
have all the history.
In mathematical terms, a state St has the Markov property, if and only if;
At each time step, the process is in some state s, and the decision maker may choose any action a
that is available in state s’. The process responds at the next time step by randomly moving into a
new state s’ , and giving the decision maker a corresponding reward Ra(s,s’).
Page | 8
Swami Keshvanand Institute of Technology, Management
& Gramothan, Ramnagaria, Jagatpura, Jaipur-302017,
INDIA
Approved by AICTE, Ministry of HRD, Government of
India Recognized by UGC under Section 2(f) of the UGC
Act, 1956 Tel. : +91-0141- 5160400Fax: +91-0141-
2759555 E-mail: info@skit.ac.in Web: www.skit.ac.in
The probability that the process moves into its new state s’ is influenced by the chosen action.
Specifically, it is given by the state transition function Pa(s,s’) . Thus, the next state depends on the
current state and the decision maker's action a. But given s and a, it is conditionally independent of
all previous states and actions; in other words, the state transitions of an MDP satisfy the Markov
property.
Markov decision processes are an extension of Markov chains; the difference is the addition of
actions (allowing choice) and rewards (giving motivation). Conversely, if only one action exists for
each state (e.g. "wait") and all rewards are the same (e.g. "zero"), a Markov decision process
reduces to a Markov chain
Page | 9
Swami Keshvanand Institute of Technology, Management & Gramothan,
Ramnagaria, Jagatpura, Jaipur-302017, INDIA
Approved by AICTE, Ministry of HRD, Government of India Recognized by UGC
under Section 2(f) of the UGC Act, 1956 Tel. : +91-0141- 5160400Fax: +91-0141-
2759555 E-mail: info@skit.ac.in Web: www.skit.ac.in
Page | 10
Swami Keshvanand Institute of Technology, Management & Gramothan,
Ramnagaria, Jagatpura, Jaipur-302017, INDIA
Approved by AICTE, Ministry of HRD, Government of India Recognized by UGC
under Section 2(f) of the UGC Act, 1956 Tel. : +91-0141- 5160400Fax: +91-0141-
2759555 E-mail: info@skit.ac.in Web: www.skit.ac.in
Page | 11
Swami Keshvanand Institute of Technology, Management & Gramothan,
Ramnagaria, Jagatpura, Jaipur-302017, INDIA
Approved by AICTE, Ministry of HRD, Government of India Recognized by UGC
under Section 2(f) of the UGC Act, 1956 Tel. : +91-0141- 5160400Fax: +91-0141-
2759555 E-mail: info@skit.ac.in Web: www.skit.ac.in
Page | 12
Swami Keshvanand Institute of Technology, Management & Gramothan,
Ramnagaria, Jagatpura, Jaipur-302017, INDIA
Approved by AICTE, Ministry of HRD, Government of India Recognized by UGC
under Section 2(f) of the UGC Act, 1956 Tel. : +91-0141- 5160400Fax: +91-0141-
2759555 E-mail: info@skit.ac.in Web: www.skit.ac.in
Page | 13
Swami Keshvanand Institute of Technology, Management & Gramothan,
Ramnagaria, Jagatpura, Jaipur-302017, INDIA
Approved by AICTE, Ministry of HRD, Government of India Recognized by UGC
under Section 2(f) of the UGC Act, 1956 Tel. : +91-0141- 5160400Fax: +91-0141-
2759555 E-mail: info@skit.ac.in Web: www.skit.ac.in
Page | 14
Swami Keshvanand Institute of Technology, Management & Gramothan,
Ramnagaria, Jagatpura, Jaipur-302017, INDIA
Approved by AICTE, Ministry of HRD, Government of India Recognized by UGC
under Section 2(f) of the UGC Act, 1956 Tel. : +91-0141- 5160400Fax: +91-0141-
2759555 E-mail: info@skit.ac.in Web: www.skit.ac.in
Page | 15
Swami Keshvanand Institute of Technology, Management & Gramothan,
Ramnagaria, Jagatpura, Jaipur-302017, INDIA
Approved by AICTE, Ministry of HRD, Government of India Recognized by UGC
under Section 2(f) of the UGC Act, 1956 Tel. : +91-0141- 5160400Fax: +91-0141-
2759555 E-mail: info@skit.ac.in Web: www.skit.ac.in
Page | 16
Swami Keshvanand Institute of Technology, Management & Gramothan,
Ramnagaria, Jagatpura, Jaipur-302017, INDIA
Approved by AICTE, Ministry of HRD, Government of India Recognized by UGC
under Section 2(f) of the UGC Act, 1956 Tel. : +91-0141- 5160400Fax: +91-0141-
2759555 E-mail: info@skit.ac.in Web: www.skit.ac.in
Page | 17
Swami Keshvanand Institute of Technology, Management & Gramothan,
Ramnagaria, Jagatpura, Jaipur-302017, INDIA
Approved by AICTE, Ministry of HRD, Government of India Recognized by UGC
under Section 2(f) of the UGC Act, 1956 Tel. : +91-0141- 5160400Fax: +91-0141-
2759555 E-mail: info@skit.ac.in Web: www.skit.ac.in
Page | 18
Swami Keshvanand Institute of Technology, Management
& Gramothan, Ramnagaria, Jagatpura, Jaipur-302017,
INDIA
Approved by AICTE, Ministry of HRD, Government of
India Recognized by UGC under Section 2(f) of the UGC
Act, 1956 Tel. : +91-0141- 5160400Fax: +91-0141-
2759555 E-mail: info@skit.ac.in Web: www.skit.ac.in
Monte-Carlo method
Remember that the state-value is nothing more than the discounted cumulative reward by starting
from state s following a policy π to the end state at the last timestep T:
An episode is a sequence of all states from an initial state s to a terminal state. It is not a continuous
infit sequence. Moreover, for each episode, we get a different return G_t. An example of an episode
for a video game could be the sequence of all states, the frames, from start to Game Over.
Now, let’s choose one of our states as an initial state s_i. We go through one complete episode by
following the policy π and eventually reach the end state s_e.
We repeat this many, many times. On every iteration, the end state could be reached in a different way,
depending on the stochastic policy π. This leads to a different return G_t for every episode. Averaging
over those returns gives us a close representation of the real state-value for the initial state s_i.
If we do this process for all states, where we always pick a different state as the initial state, then we
obtain a good approximation of the state-value for each state and therefore the state-value function of
Instead of using the Dynamic Programming and the Bellman Equation to calculate the exact value of
the state s at timestep t by considering future rewards and values of the future states, we obtain an
estimate of the true value function by running through many episodes, then observe what returns we
got for each and average over all of them. No model is needed.
Page | 19
Swami Keshvanand Institute of Technology, Management
& Gramothan, Ramnagaria, Jagatpura, Jaipur-302017,
INDIA
Approved by AICTE, Ministry of HRD, Government of
India Recognized by UGC under Section 2(f) of the UGC
Act, 1956 Tel. : +91-0141- 5160400Fax: +91-0141-
2759555 E-mail: info@skit.ac.in Web: www.skit.ac.in
There are two different approaches to evaluate the policy by using Monte-Carlo. First-Visit and Every-
Visit. For both, we keep a counter N(s) for each state s and we save the sum of the different returns
N(s) and S(s) accumulate values until all K episodes are completed. They are not set to zero after each
episode.
We iterate through K episodes. In each episode, every time we reach a state s we increment the counter
N(s) for that state by one. It is possible that the same state is reached multiple times in the same
Page | 20
Swami Keshvanand Institute of Technology, Management
& Gramothan, Ramnagaria, Jagatpura, Jaipur-302017,
INDIA
Approved by AICTE, Ministry of HRD, Government of
India Recognized by UGC under Section 2(f) of the UGC
Act, 1956 Tel. : +91-0141- 5160400Fax: +91-0141-
2759555 E-mail: info@skit.ac.in Web: www.skit.ac.in
We do the same as for the Every-Time Monte-Carlo Policy Evaluation. The only difference here is that
in the same episode we update N(s) and the S(s) only on the first visit of the state s. In our algorithm,
we will check if s was visited for the first time in the episode and only then update them.
We need to complete an episode first before we can update all values. That’s bad e.g. for autonomous
cars which if using Monte Carlo there would update the values of a state only after a crash, the end of
Page | 22
Swami Keshvanand Institute of Technology, Management & Gramothan,
Ramnagaria, Jagatpura, Jaipur-302017, INDIA
Approved by AICTE, Ministry of HRD, Government of India Recognized by UGC
under Section 2(f) of the UGC Act, 1956 Tel. : +91-0141- 5160400Fax: +91-0141-
2759555 E-mail: info@skit.ac.in Web: www.skit.ac.in
Page | 23
Swami Keshvanand Institute of Technology, Management & Gramothan,
Ramnagaria, Jagatpura, Jaipur-302017, INDIA
Approved by AICTE, Ministry of HRD, Government of India Recognized by UGC
under Section 2(f) of the UGC Act, 1956 Tel. : +91-0141- 5160400Fax: +91-0141-
2759555 E-mail: info@skit.ac.in Web: www.skit.ac.in
Page | 24
Swami Keshvanand Institute of Technology, Management & Gramothan,
Ramnagaria, Jagatpura, Jaipur-302017, INDIA
Approved by AICTE, Ministry of HRD, Government of India Recognized by UGC
under Section 2(f) of the UGC Act, 1956 Tel. : +91-0141- 5160400Fax: +91-0141-
2759555 E-mail: info@skit.ac.in Web: www.skit.ac.in
Page | 25
Swami Keshvanand Institute of Technology, Management & Gramothan,
Ramnagaria, Jagatpura, Jaipur-302017, INDIA
Approved by AICTE, Ministry of HRD, Government of India Recognized by UGC
under Section 2(f) of the UGC Act, 1956 Tel. : +91-0141- 5160400Fax: +91-0141-
2759555 E-mail: info@skit.ac.in Web: www.skit.ac.in
Page | 26
Swami Keshvanand Institute of Technology, Management
& Gramothan, Ramnagaria, Jagatpura, Jaipur-302017,
INDIA
Approved by AICTE, Ministry of HRD, Government of
India Recognized by UGC under Section 2(f) of the UGC
Act, 1956 Tel. : +91-0141- 5160400Fax: +91-0141-
2759555 E-mail: info@skit.ac.in Web: www.skit.ac.in
Policy
a12 is the probability to take the action that leads to S2 from S1.
a13 is the probability to take the action that leads to S3 from S1.
The value function at S1 and S3 are v2 = 10, v3 = 20 respectively.
We also assume that the reward r = 1 and the discount factor γ = 0.9.
Also as simplification, we consider that the transition probabilities is 1, which means when we
perform an action we are 100% sure that we will land in the intended state.
Under these criteria we can write the value function at S1 which we denote v1 as:
v1 = a12*(r + γ *v2) + a13*(r + γ*v3)
v1 = a12*(1 + 0.9*10) + a13(1 + 0.9*20)
v1 = a12*10 + a13 * 19
Suppose the action probabilities are a12 = 0.9 and a13 = 0.1 then v1 will be 10.9 . On the other hand
if a12 = 0.1 and a13 = 0.9 then v1 will become 18.1!
Clearly giving more probability to perform a13 will give better result in terms of the value of the
state S1.
The act of selecting an action at each state is called “policy” and is denoted as .
Page | 27
Swami Keshvanand Institute of Technology, Management
& Gramothan, Ramnagaria, Jagatpura, Jaipur-302017,
INDIA
Approved by AICTE, Ministry of HRD, Government of
India Recognized by UGC under Section 2(f) of the UGC
Act, 1956 Tel. : +91-0141- 5160400Fax: +91-0141-
2759555 E-mail: info@skit.ac.in Web: www.skit.ac.in
As seen in the example above, some policies are better than others due to the selection of
actions over others in one or more states.
It is important to note that a policy is better than ’ if all v(s) under are greater or equal to all
v(s) under ’.
It follows that in order to maximize the collected rewards we have to find the best possible
policy, called optimal policy and denoted *.
Value Iteration
Value iteration computes the optimal state value function by iteratively improving the estimate of
V(s). The algorithm initialize V(s) to arbitrary random values. It repeatedly updates the Q(s, a) and
V(s) values until they converges. Value iteration is guaranteed to converge to the optimal values.
This algorithm is shown in the following pseudo-code:
Policy Iteration
While value-iteration algorithm keeps improving the value function at each iteration until the value-
function converges. Since the agent only cares about the finding the optimal policy, sometimes the
optimal policy will converge before the value function. Therefore, another algorithm called policy-
iteration instead of repeated improving the value-function estimate, it will re-define the policy at
each step and compute the value according to this new policy until the policy converges. Policy
iteration is also guaranteed to converge to the optimal policy and it often takes less iterations to
converge than the value-iteration algorithm.
Page | 28
Swami Keshvanand Institute of Technology, Management & Gramothan,
Ramnagaria, Jagatpura, Jaipur-302017, INDIA
Approved by AICTE, Ministry of HRD, Government of India Recognized by UGC
under Section 2(f) of the UGC Act, 1956 Tel. : +91-0141- 5160400Fax: +91-0141-
2759555 E-mail: info@skit.ac.in Web: www.skit.ac.in
Page | 29
Swami Keshvanand Institute of Technology, Management
& Gramothan, Ramnagaria, Jagatpura, Jaipur-302017,
INDIA
Approved by AICTE, Ministry of HRD, Government of
India Recognized by UGC under Section 2(f) of the UGC
Act, 1956 Tel. : +91-0141- 5160400Fax: +91-0141-
2759555 E-mail: info@skit.ac.in Web: www.skit.ac.in
SARSA algorithm is a slight variation of the popular Q-Learning algorithm. For a learning agent in
any Reinforcement Learning algorithm it’s policy can be of two types:-
1. On Policy: In this, the learning agent learns the value function according to the
current action derived from the policy currently being used.
2. Off Policy: In this, the learning agent learns the value function according to the action
derived from another policy.
Q-Learning technique is an Off Policy technique and uses the greedy approach to learn the Q-
value. SARSA technique, on the other hand, is an On Policy and uses the action performed by the
current policy to learn the Q-value.
This difference is visible in the difference of the update statements for each technique:-
1. Q-Learning:
2. SARSA:
Here, the update equation for SARSA depends on the current state, current action, reward obtained,
next state and next action. This observation lead to the naming of the learning technique as SARSA
stands for State Action Reward State Action which symbolizes the tuple (s, a, r, s’, a’).
Page | 30
Swami Keshvanand Institute of Technology, Management
& Gramothan, Ramnagaria, Jagatpura, Jaipur-302017,
INDIA
Approved by AICTE, Ministry of HRD, Government of
India Recognized by UGC under Section 2(f) of the UGC
Act, 1956 Tel. : +91-0141- 5160400Fax: +91-0141-
2759555 E-mail: info@skit.ac.in Web: www.skit.ac.in
Page | 31