ML Unit-4 - RTU
ML Unit-4 - RTU
Semi-supervised learning is a type of machine learning that falls in between supervised and
unsupervised learning. It is a method that uses a small amount of labelled data and a large
amount of unlabelled data to train a model. The goal of semi-supervised learning is to learn
a function that can accurately predict the output variable based on the input variables, similar
to supervised learning. However, unlike supervised learning, the algorithm is trained on a
dataset that contains both labelled and unlabelled data.
Semi-supervised learning is particularly useful when there is a large amount of unlabelled
data available, but it’s too expensive or difficult to label all of it.
Reinforcement Learning→
Reinforcement Learning is a feedback-based Machine learning technique in which an agent
learns to behave in an environment by performing the actions and seeing the results of actions.
For each good action, the agent gets positive feedback, and for each bad action, the agent gets
negative feedback or penalty.
• In Reinforcement Learning, the agent learns automatically using feedbacks without any
labeled data, unlike supervised learning.
• Since there is no labeled data, so the agent is bound to learn by its experience only.
• RL solves a specific type of problem where decision making is sequential, and the goal is
long-term, such as game-playing, robotics, etc.
• The agent interacts with the environment and explores it by itself. The primary goal of an
agent in reinforcement learning is to improve the performance by getting the maximum
positive rewards.
• The agent learns with the process of hit and trial, and based on the experience, it learns to
perform the task in a better way.
Example: Suppose there is an AI agent present within a maze environment, and his goal is to
find the diamond. The agent interacts with the environment by performing some actions, and
based on those actions, the state of the agent gets changed, and it also receives a reward or
penalty as feedback.
Environment
Reward, Actions
State
Agent
1. Positive Reinforcement-
The positive reinforcement learning means adding something to increase the tendency that
expected behavior would occur again. It impacts positively on the behavior of the agent and
increases the strength of the behavior.
This type of reinforcement can sustain the changes for a long time, but too much positive
reinforcement may lead to an overload of states that can reduce the consequences.
2. Negative Reinforcement:
Suppose we wish to estimate , the value of a state under policy , given a set of episodes
obtained by following and passing through . Each occurrence of state in an episode is
called a visit to . The every-visit MC method estimates as the average of the returns
following all the visits to in a set of episodes. Within a given episode, the first time is visited
is called the first visit to . The first-visit MC method averages just the returns following first
visits to .
Monte Carlo Backup Diagram:
Value iteration→
• Value iteration is a method of computing an optimal MDP policy and its value.
• Value iteration starts at the "end" and then works backward, refining an estimate of
either Q* or V*.
• There is really no end, so it uses an arbitrary end point.
• Let Vk be the value function assuming there are k stages to go, and let Qk be the Q-function
assuming there are k stages to go.
• These can be defined recursively.
• Value iteration starts with an arbitrary function V0 and uses the following equations to get
the functions for k+1 stages to go from the functions for k stages to go:
• It can either save the V[S] array or the Q[S,A] array. Saving the V array results in less
storage, but it is more difficult to determine an optimal action, and one more iteration is
needed to determine which action results in the greatest value.
Policy Iteration→
• Policy iteration starts with a policy and iteratively improves it.
• It starts with an arbitrary policy π0 (an approximation to the optimal policy
works best) and carries out the following steps starting from i=0.
• Policy evaluation: Determine Vπi(S). The definition of Vπ is a set of |S| linear
equations in |S| unknowns. The unknowns are the values of Vπi(S). There is an
equation for each state. These equations can be solved by a linear equation
solution method (such as Gaussian elimination) or they can be solved
iteratively.
• Policy improvement: choose πi+1(s)= argmaxa Qπi(s,a), where the Q-value can
be obtained from V. To detect when the algorithm has converged, it should
only change the policy if the new action for some state improves the expected
value; that is, it should set πi+1(s) to be πi(s) if πi(s) is one of the actions that
maximizes Qπi(s,a).
• Stop if there is no change in the policy - that is, if πi+1=πi - otherwise
increment i and repeat.
Q-Learning→
• Q Learning comes under Value-based learning algorithms.
• The objective is to optimize a value function suited to a given problem/environment.
• The ‘Q’ stands for quality; it helps in finding the next action resulting in a state of the
highest quality.
• This approach is rather simple and intuitive.
• The values are stored in a table, called a Q Table.
• A Q-table or matrix is created while performing the Q-learning. The table follows the state
and action pair, i.e., [s, a], and initializes the values to zero. After each action, the table is
updated, and the q-values are stored within the table.
• The RL agent uses this Q-table as a reference table to select the best action based on the q-
values.
• The below flowchart explains the working of Q- learning:
Example:
Let us devise a simple 2D game environment of size 4 x 4 and understand how Q- Learning
can be used to arrive at the best solution.
Goal: Guide the kid to the Park
Reward System:
• Get candy = +10 points
• Encounter Dog = -50 points
• Reach Park = +50 points
End of an Episode:
• Encounter Dog
• Reach Park
Now let us see how a typical Q learning agent will play this game.
First, let us create a Q- table where we will keep a track of all values associated with each state.
The Q Table will have rows equal to the number of states in the problem i.e. 16 in our case,
and the number of columns would be equal to the number of actions an agent can make which
happens to be 4 (Up, Down, Left & Right).
Step 1: Initialization
When the agent plays the game for the first time, it has no prior knowledge so let’s initialize
the table with zeroes.
Step 2: Exploitation OR Exploration
Now the agent can interact with the environment in two ways: either it can use already
gained info from the Q-table i.e. exploit, or it can venture to uncharted territories i.e.
explore.
Exploitation becomes very useful when the agent has worked out a high number of episodes
and has information about the environment.
Whereas, the exploration becomes important when the agent is naïve and does not have
much experience.
This tradeoff between exploitation and exploration can be handled by including epsilon in
the value function.
Ideally, at initial stages, we would like to give more preference to exploration, while in the
later stages exploitation would be more useful.
In Step 2, the agent takes an action (exploit or explore).
Step 3: Measure Reward
After the agent performs an action decided in step 2, it reaches the next state say s’. Now
again at state s’ the four actions can be performed, each one leading to a different reward
score.
For e.g, the boy moves from 1 to 5, now either 6 can be selected or 9 can be selected. Now
for finding the reward value for state 5, we will find out the reward values of all the future
states i.e, 6 & 9, and select the maximum value.
At 5, there are two options (For simplicity retracing steps is not performed)–
Go to 9 : End of Episode
Go to 6 : At state 6 there are again 3 options –
Go to 7 – End of Episode
Go to 2 – Continue this step until reach end of episode and find out the reward
Go to 10 – Continue this step, find out reward
Sample Calculation-
Path A reward = 10 + 50 = 60
Path B reward = 50
Max Reward = 60 (Path A)
Total Rewards at State 5: -50 (Faced dog at 9), 10 + 60 (Max reward from State 6
onwards)
Value of reward at 5 = Max (-50 , 10+60 ) = 70
Step 4: Update the Q table
The reward value calculated in step 3 is then used to update the value at state 5 using the
Bellman’s equation-
Here, Learning rate = A constant which determines how much weightage you want to give to the
new value vs the old value.
Discount Rate = Constant that discounts the effect of future rewards (0.8 to 0.99), i.e.,
balance the effect of future rewards in the new values.
The agent will iterate over these steps and achieve a Q- Table with updated values.
Now using this Q-Table is as simple as using a map, for each state select an action, which
leads to a state with the maximum Q value.
State-Action-Reward-State-Action (SARSA)→
• SARSA algorithm is a slight variety of the well-known Q-Learning algorithm.
• For a learning agent in any Reinforcement Learning algorithm it’s policy can be of two
types: -
• On Policy: In this, the learning agent learns the value function as indicated by the
current action derived from the policy currently being used.
• Off Policy: In this, the learning agent learns the value function according to the action
derived from another policy.
• Q-Learning technique is an Off Policy method and uses the greedy way to learn the Q-
value.
• SARSA technique, on the other hand, is an On Policy and uses the action performed by
the current policy to learn the Q-value.
• SARSA is an on-policy algorithm where, in the current state, S an action, A is taken and
the agent gets a reward, R and ends up in next state, S’ and takes action, A’ in S’.
Therefore, the tuple (S, A, R, S’, A’) stands for the acronym SARSA.
• It is called an on-policy algorithm because it updates the policy based on actions taken.
An experience in SARSA is of the form ⟨s,a,r,s',a'⟩, which means that the agent was in
state s, did action a, received reward r, and ended up in state s', from which it decided to
do action a'.
This provides a new experience to update Q(s,a). The new value that this experience
provides is r+γQ(s',a').
We will choose the current action At and the next action A(t+1) using the same policy. And
thus, in the state S(t+1), its action will be A(t+1) which is selected while updating the action-
state value of St.
SARSA vs Q-learning→
• The significant distinction among SARSA and Q-Learning, is that the maximum
reward for the following state isn't really utilized for updating the Q-values. Rather, a
new action, and in this manner reward, is chosen utilizing a same policy that decided
the original action.
• In SARSA, the agent begins in state 1, performs action 1, and gets a reward (reward
1). Presently, it’s in state 2 and plays out another action (action 2) and gets the reward
from this state (reward 2) preceding it returns and updates the estimation of activity 1
acted in state 1.
• In contrast, in Q-learning the agent begins in state 1, performs action 1 and gets a
reward (reward 1), and then looks and sees what the maximum possible reward for an
action is in state 2, and uses that to update the action value of performing action 1 in
state 1. So the difference is in the way the future reward is found. In Q-learning it’s
simply the highest possible action that can be taken from state 2, and in SARSA it’s
the value of the actual action that was taken.