Unit-5 ML Notes
Unit-5 ML Notes
In the above image, the agent is at the very first block of the maze. The maze is consisting of
an S6 block, which is a wall, S8 a fire pit, and S4 a diamond block.
The agent cannot cross the S6 block, as it is a solid wall. If the agent reaches the S4 block, then
get the +1 reward; if it reaches the fire pit, then gets -1 reward point. It can take four actions:
move up, move down, move left, and move right.
The agent can take any path to reach to the final point, but he needs to make it in possible fewer
steps. Suppose the agent considers the path S9-S5-S1-S2-S3, so he will get the +1-reward point.
The agent will try to remember the preceding steps that it has taken to reach the final step. To
memorize the steps, it assigns 1 value to each previous step. Consider the below step:
Now, the agent has successfully stored the previous steps assigning the 1 value to each previous
block. But what will the agent do if he starts moving from the block, which has 1 value block
on both sides? Consider the below diagram:
It will be a difficult condition for the agent whether he should go up or down as each block has
the same value. So, the above approach is not suitable for the agent to reach the destination.
Hence to solve the problem, we will use the Bellman equation, which is the main concept
behind reinforcement learning.
Now, we will move further to the 6th block, and here agent may change the route because it
always tries to find the optimal path. So now, let's consider from the block next to the fire pit.
Now, the agent has three options to move; if he moves to the blue box, then he will feel a bump
if he moves to the fire pit, then he will get the -1 reward. But here we are taking only positive
rewards, so for this, he will move to upwards only. The complete block values will be
calculated using this formula. Consider the below image:
In the equation, we have various components, including reward, discount factor (γ), probability,
and end states s'. But there is no any Q-value is given so first consider the below image:
In the above image, we can see there is an agent who has three values options, V(s 1), V(s2),
V(s3). As this is MDP, so agent only cares for the current state and the future state. The agent
can go to any direction (Up, Left, or Right), so he needs to decide where to go for the optimal
path. Here agent will take a move as per probability bases and changes the state. But if we want
some exact moves, so for this, we need to make some changes in terms of Q-value. Consider
the below image:
Q- represents the quality of the actions at each state. So instead of using a value at each state,
we will use a pair of state and action, i.e., Q(s, a). Q-value specifies that which action is more
lubricative than others, and according to the best Q-value, the agent takes his next move. The
Bellman equation can be used for deriving the Q-value.
To perform any action, the agent will get a reward R(s, a), and also he will end up on a certain
state, so the Q -value equation will be:
The RL algorithm works like the human Supervised Learning works as when a
brain works when making some decisions. human learns things in the supervision of a
guide.
Genetic Algorithms are being widely used in different real-world applications, for
example, Designing electronic circuits, code-breaking, image processing, and artificial
creativity.
In this topic, we will explain Genetic algorithm in detail, including basic terminologies used in
Genetic algorithm, how it works, advantages and limitations of genetic algorithm, etc.
Before understanding the Genetic algorithm, let's first understand basic terminologies to better
understand this algorithm:
• Population: Population is the subset of all possible or probable solutions, which can
solve the given problem.
• Chromosomes: A chromosome is one of the solutions in the population for the given
problem, and the collection of gene generate a chromosome.
• Gene: A chromosome is divided into a different gene, or it is an element of the
chromosome.
• Allele: Allele is the value provided to the gene within a particular chromosome.
• Fitness Function: The fitness function is used to determine the individual's fitness
level in the population. It means the ability of an individual to compete with other
individuals. In every iteration, individuals are evaluated based on their fitness function.
• Genetic Operators: In a genetic algorithm, the best individual mate to regenerate
offspring better than parents. Here genetic operators play a role in changing the genetic
composition of the next generation.
• Selection
After calculating the fitness of every existent in the population, a selection process is used to
determine which of the individualities in the population will get to reproduce and produce the
seed that will form the coming generation.
So, now we can define a genetic algorithm as a heuristic search algorithm to solve optimization
problems. It is a subset of evolutionary algorithms, which is used in computing. A genetic
algorithm uses genetic and natural selection concepts to solve optimization problems.
The genetic algorithm works on the evolutionary generational cycle to generate high-quality
solutions. These algorithms use different operations that either enhance or replace the
population to give an improved fit solution.
It basically involves five phases to solve the complex optimization problems, which are given
as below:
o Initialization
o Fitness Assignment
o Selection
o Reproduction
o Termination
1. Initialization
The process of a genetic algorithm starts by generating the set of individuals, which is called
population. Here each individual is the solution for the given problem. An individual contains
or is characterized by a set of parameters called Genes. Genes are combined into a string and
generate chromosomes, which is the solution to the problem. One of the most popular
techniques for initialization is the use of random binary strings.
2. Fitness Assignment
Fitness function is used to determine how fit an individual is? It means the ability of an
individual to compete with other individuals. In every iteration, individuals are evaluated based
on their fitness function. The fitness function provides a fitness score to each individual. This
score further determines the probability of being selected for reproduction. The high the fitness
score, the more chances of getting selected for reproduction.
3. Selection
The selection phase involves the selection of individuals for the reproduction of offspring. All
the selected individuals are then arranged in a pair of two to increase reproduction. Then these
individuals transfer their genes to the next generation.
4. Reproduction
After the selection process, the creation of a child occurs in the reproduction step. In this step,
the genetic algorithm uses two variation operators that are applied to the parent population. The
two operators involved in the reproduction phase are given below:
Crossover: The crossover plays a most significant role in the reproduction phase of the genetic
algorithm. In this process, a crossover point is selected at random within the genes. Then the
crossover operator swaps genetic information of two parents from the current generation to
produce a new individual representing the offspring.
The genes of parents are exchanged among themselves until the crossover point is met. These
newly generated offspring are added to the population. This process is also called or crossover.
Types of crossover styles available:
Mutation
The mutation operator inserts random genes in the offspring (new child) to maintain the
diversity in the population. It can be done by flipping some bits in the chromosomes.
Mutation helps in solving the issue of premature convergence and enhances diversification.
The below image shows the mutation process:
Types of mutation styles available,
5. Termination
After the reproduction phase, a stopping criterion is applied as a base for termination. The
algorithm terminates after the threshold fitness solution is reached. It will identify the final
solution as the best solution in the population.
o Genetic algorithms are not efficient algorithms for solving simple problems.
o It does not guarantee the quality of the final solution to a problem.
o Repetitive calculation of fitness values may generate some computational challenges.
o A search space is the set of all possible solutions to the problem. In the traditional
algorithm, only one set of solutions is maintained, whereas, in a genetic algorithm,
several sets of solutions in search space can be used.
o Traditional algorithms need more information in order to perform a search, whereas
genetic algorithms need only one objective function to calculate the fitness of an
individual.
o Traditional Algorithms cannot work parallelly, whereas genetic Algorithms can work
parallelly (calculating the fitness of the individualities are independent).
o One big difference in genetic Algorithms is that rather of operating directly on seeker
results, inheritable algorithms operate on their representations (or rendering),
frequently appertained to as chromosomes.
o One of the big differences between traditional algorithm and genetic algorithm is that
it does not directly operate on candidate solutions.
o Traditional Algorithms can only generate one result in the end, whereas Genetic
Algorithms can generate multiple optimal results from different generations.
o The traditional algorithm is not more likely to generate optimal results, whereas Genetic
algorithms do not guarantee to generate optimal global results, but also there is a great
possibility of getting the optimal result for a problem as it uses genetic operators such
as Crossover and Mutation.
o Traditional algorithms are deterministic in nature, whereas Genetic algorithms are
probabilistic and stochastic in nature.
Some other Topics
PCA generally tries to find the lower-dimensional surface to project the high-dimensional data.
PCA works by considering the variance of each attribute because the high attribute shows the
good split between the classes, and hence it reduces the dimensionality. Some real-world
applications of PCA are image processing, movie recommendation system, optimizing the
power allocation in various communication channels. It is a feature extraction technique, so
it contains the important variables and drops the least important variable
o Dimensionality: It is the number of features or variables present in the given dataset. More
easily, it is the number of columns present in the dataset.
o Correlation: It signifies that how strongly two variables are related to each other. Such as
if one changes, the other variable also gets changed. The correlation value ranges from -1
to +1. Here, -1 occurs if variables are inversely proportional to each other, and +1 indicates
that variables are directly proportional to each other.
o Orthogonal: It defines that variables are not correlated to each other, and hence the
correlation between the pair of variables is zero.
o Eigenvectors: If there is a square matrix M, and a non-zero vector v is given. Then v will
be eigenvector if Av is the scalar multiple of v.
o Covariance Matrix: A matrix containing the covariance between the pair of variables is
called the Covariance Matrix.
Principal Components in PCA
As described above, the transformed new features or the output of PCA are the Principal
Components. The number of these PCs are either equal to or less than the original features
present in the dataset. Some properties of these principal components are given below:
• The principal component must be the linear combination of the original features.
• These components are orthogonal, i.e., the correlation between a pair of variables is zero.
• The importance of each component decreases when going to 1 to n, it means the 1 PC
has the most importance, and n PC will have the least importance.
1. Getting the dataset Firstly, we need to take the input dataset and divide it into two
subparts X and Y, where X is the training set, and Y is the validation set.
2. Representing data into a structure Now we will represent our dataset into a structure.
Such as we will represent the two-dimensional matrix of independent variable X. Here each
row corresponds to the data items, and the column corresponds to the Features. The number of
columns is the dimensions of the dataset.
3. Standardizing the data In this step, we will standardize our dataset. Such as in a
particular column, the features with high variance are more important compared to the features
with lower variance. If the importance of features is independent of the variance of the feature,
then we will divide each data item in a column with the standard deviation of the column. Here
we will name the matrix as Z.
4. Calculating the Covariance of Z To calculate the covariance of Z, we will take the
matrix Z, and will transpose it. After transpose, we will multiply it by Z. The output matrix
will be the Covariance matrix of Z.
5. Calculating the Eigen Values and Eigen Vectors Now we need to calculate the
eigenvalues and eigenvectors for the resultant covariance matrix Z. Eigenvectors or the
covariance matrix are the directions of the axes with high information. And the coefficients of
these eigenvectors are defined as the eigenvalues.
6. Sorting the Eigen Vectors In this step, we will take all the eigenvalues and will sort
them in decreasing order, which means from largest to smallest. And simultaneously sort the
eigenvectors accordingly in matrix P of eigenvalues. The resultant matrix will be named as P*.
7. Calculating the new features Or Principal Components Here we will calculate the
new features. To do this, we will multiply the P* matrix to the Z. In the resultant matrix Z*,
each observation is the linear combination of original features. Each column of the Z* matrix
is independent of each other.
8. Remove less or unimportant features from the new dataset. The new feature set has
occurred, so we will decide here what to keep and what to remove. It means, we will only keep
the relevant or important features in the new dataset, and unimportant features will be removed
out.
After finding food, it carries some food with itself and returns to the colony. When it tracking
the returning path it deposits pheromone on the ground. The ant following the shorter path will
reach the colony earlier.
When the third ant wants to go out for searching food it will follow the path having shorter
distance based on the pheromone level on the ground. As a shorter path has more pheromones
than the longer, the third ant will follow the path having more pheromones.
By the time the ant following the longer path returned to the colony, more ants already have
followed the path with more pheromones level. Then when another ant tries to reach the
destination(food) from the colony it will find that each path has the same pheromone level. So,
it randomly chooses one. Let consider it choose the above one(in the picture located below)
Repeating this process again and again, after some time, the shorter path has a more pheromone
level than others and has a higher probability to follow the path, and all ants next time will
follow the shorter path.
For solving different problems with ACO, there are three different proposed version of Ant-
System:
Ant Density & Ant Quantity: Pheromone is updated in each movement of an ant from one
location to another.
Ant Cycle: Pheromone is updated after all ants completed their tour.
Let see the pseudocode for applying the ant colony optimization algorithm. An artificial ant is
made for finding the optimal solution. In the first step of solving a problem, each ant generates
a solution. In the second step, paths found by different ants are compared. And in the third step,
paths value or pheromone is updated.
procedure ACO_MetaHeuristic is
while not_termination do
generateSolutions()
daemonActions()
pheromoneUpdate()
repeat
end procedure
There are many optimization problems where you can use ACO for finding the optimal solution.
Some of them are:
1. Capacitated vehicle routing problem
2. Stochastic vehicle routing problem (SVRP)
3. Vehicle routing problem with pick-up and delivery (VRPPD)
4. Group-shop scheduling problem (GSP)
5. Nursing time distribution scheduling problem
6. Permutation flow shop problem (PFSP)
7. Frequency assignment problem
8. Redundancy allocation problem
9. Traveling salesman problem(TSP)
Let see the mathematical terms of ACO(typically for a TSP problem).
Pheromone update
The left side on the equation indicates the amount of pheromone on the given edge x,y
ρ — the rate of pheromone evaporation
And the last term on the right side indicated the amount of pheromone deposited.
I am sure each one of us in our lifetime has heard from our well-wisher’s, “Be with good
company. It helps you to cultivate good quality.” When we speak about a ‘good company,’ we
discuss the unequal distribution of good qualities among group members to achieve a better
common goal. It is the reason we always say ‘Work as a Team.’ Particle Swarm Optimization
(PSO) Algorithm is based on that. In 1995, Kennedy and Eberhart wrote a research paper based
on the social behavior of animal groups, where they had stated that sharing information among
the group increases survival advantage. Like while a bird searching for food randomly can
optimize her searching if she works with the flock. The advantage of working is mutual sharing
of the best information, which can help a flock to discover the best place to hunt.
Group optimization and Ensemble Learning
Many of you have heard about ‘No Free Lunch (NFL) in machine learning. It speaks that no
single model works best for all possible situations. We can also say that all optimization
algorithms perform equally well when averaged across all potential problems. The last
statement that I have written isn’t self-explanatory with the example of flock of bird. Why do
we need optimization in machine learning or deep learning? To train a model, we must define
a loss function to measure the difference between our model predictions. Our objective is to
minimize or optimize this loss function so that it will be closer to 0. Maybe you have heard
about a term called ‘Ensemble Learning.’ If you have not, then let me explain you. ‘Ensemble’
is a French word—meaning ‘Assembly.’ It speaks about learning in a group or crowd. It is like
you are trying to train a model with the help of multiple algorithms. So, what type of benefit
are we going to get here? A single base learner is a weak learner. But, when we combine all
these vulnerable learners, they become strong learners. They become strong learners because
their predictive power, accuracy, precision are high. And the error rate is less. We call this type
of combined model ‘Meta-learning’ in machine learning. It refers to learning algorithms that
can learn from other learning algorithms. It decreases variance, decreases bias, and improves
prediction. Now, when you achieve that, that’s your ultimate ‘Nirvana’ moment as a data
analyst.
Now let’s come back to our PSO model. The concept of swarm intelligence inspired the POS.
Here we are speaking about finding the optimal solution in a high-dimensional solution space.
It talks about Maximizing earns or minimizing losses. So, we are looking to maximize or
minimize a function to find the optimum solution. A function can have multiple local maximum
and minimum. But, there can be only one global maximum as well as a minimum. If your
function is very complex, then finding the global maximum can be a very daunting task. PSO
tries to capture the global maximum or minimum. Even though it cannot capture the exact
global maximum/minimum, it goes very close to it. It is the reason we called PSO a heuristic
model.
Let me give you an example of why the finding of global maximum/minimum is problematic.
Check the below function :
y=f(x)=sinx+sinx2+sinxcosx
We can see that we have one global maximum and one global minimum. If we consider the
function based on an interval in X-axis value from -4 to 6, we will have a maximum that will
not be our global maximum. It is a local maximum. So we can say that finding out the global
maximum may depend upon the interval. It is something like we try to observe a portion of a
continuous function. Also, one thing to note while describing a dynamic system or entity, you
cannot have a static function. The function that I have defined here is fixed. Data analytics is
data-hungry. To train a model or to find a suitable mathematical function, you must have
enormous data. It is impossible to have all the data. Meaning it’s challenging to get the exact
global minimum or maximum. Well, for me, it’s a limitation of Mathematics. Fortunately, we
have Statistics that advocate sampling, and from there, it can optimize some value like global
maximum or minimum concerning the original function. But again, you won’t get the exact
global maximum or minimum. You will get some values that will be closer to the actual global
maximum or minimum.
Also, when we describe a mathematical function based on some real-life scenario, we must
explain it with multiple variables or higher-dimensional vector space. The growth of bacteria
in a jar may depend upon temperature, humidity, the container, the solvent, etc. For this type
of function, it’s more challenging to get the exact global maximum and minimum. Check the
below function. And see if we add more variables than how difficult it becomes to get global
maximum and minimum.
• Each particle adjusts its traveling velocity dynamically, according to the flying
experiences it has and its colleagues in the group.
• Each particle tries to keep track of :
Let’s us assume a few parameters first. You will find some new parameters, which I will
describe later.
f: Objective function, Vi: Velocity of the particle or agent, A: Population of agents, W: Inertia
weight, C1: cognitive constant, U1, U2: random numbers, C2: social constant, Xi: Position of
the particle or agent, Pb: Personal Best, gb: global Best
If W=1, the particle’s motion is entirely influenced by the previous motion, so the particle may
keep going in the same direction. On the other hand, if 0≤W<1, such influence is reduced,
which means that a particle instead goes to other regions in the search domain.
Pb1t And its current position Pit. It has been noticed that the idea behind this term is that as the
particle gets more distant from the Pb1t (Personal Best) position, the difference (Pb1t-Pit ) Must
increase; hence, this term increases, attracting the particle to its best own position. The
parameter C1 existing as a product is a positive constant, and it is an individual-cognition
parameter. It weighs the importance of the particle’s own previous experiences.
The other hyper-parameter which composes the product of the second term is U1t. It is a random
value parameter with [0,1] range. This random parameter plays an essential role in avoiding
premature convergences, increasing the most likely global optima.
The difference (gbt-Pit ) Works as an attraction for the particles towards the best point until it’s
found at t iteration. Likewise, C2 is also a social learning parameter, and it weighs the
importance of the global learning of the swarm. And U2t plays precisely the same role as U1t.
In the case of C1=C2=0, all particles continue flying at their current speed until they hit the
search space’s boundary.
In cases C1>0 and C2=0, all particles are independent.
In cases C1>0 and C2=0, all particles are attracted to a single point in the entire swarm.
In case C1=C2≠0, all particles are attracted towards the average of pbest and gbest.
Neighbourhood Topologies
A neighborhood must be defined for each particle. This neighborhood determines the extent of
social interaction within the swarm and influences a particular particle’s movement. Less
interaction occurs when the neighborhoods in the swarm are small. For small neighborhoods,
the convergence will be slower, but it may improve the quality of solutions. The convergence
will be faster for more prominent neighborhoods, but the risk that sometimes convergence
occurs earlier.
For Star topology, each particle is connected with other particles. It leads to faster convergence
than other topologies, Easy to find out gbest. But it can be biased to the pbest.
For wheel topology, only one particle connects to the others, and all information is
communicated through this particle. This focal particle compares the best performance of all
particles in the swarm, and adjusts its position towards the best performance particle. Then the
new position of the focal particle is informed to all the particles.
For Ring Topology, when one particle finds the best result, it will make pass it to its immediate
neighbors, and these two immediate neighbors pass it to their immediate neighbors until it
reaches the last particle. Here the best result found is spread very slowly.
Types of Particle Swarm Optimization
Contour plot
Let’s draw a graph of circle z=x2+y2 at fixed heights ‘z’, z =1,2,3 etc.
To give you intuition, let Plot the function below in the contour plot.
z=x2+y2 its actual plotting and the contour plotting will look like below:
Here we can see the function in the region of f(x,y). We can create ten particles at random
locations in this region, together with a random velocity which is sampled over a normal
distribution with mean 0 and standard deviation 0.1, as follows:
The actual outcome will be like :
Genetic Algorithms (GAs) and PSOs are both used as cost functions, they are both iterative,
and they both have a random element. They can be used on similar kinds of problems. The
difference between PSO and Genetic Algorithms (GAs) is that GAs it does not traverse the
search space like birds flocking, covering the spaces in between. The operation of GAs is more
like Monte Carlo, where the candidate solutions are randomized, and the best solutions are
picked to compete with a new set of randomized solutions. Also, PSO algorithms require
normalization of the input vectors to reach faster “convergence” (as heuristic algorithms, both
don’t truly converge). GAs can work with features that are continuous or discrete.