Machine Learning Notes_ Concepts, Algorithms
Machine Learning Notes_ Concepts, Algorithms
LEARNING
Definition of learning
A computer program is said to learn from experience E with respect to some class of tasks T and
performance measure P, if its performance at tasks T, as measured by P, improves with
experience
Examples
i) Handwriting recognition learning problem
• Task T: Recognising and classifying handwritten words within images
• Performance P: Percent of words correctly classified
• Training experience E: A dataset of handwritten words with given classifications
Therefore, a computer program which learns from experience is called a machine learning
program or simply a learning program. Such a program is sometimes also referred to as a learner.
Machine Learning
Machine learning enables a machine to automatically learn from data, prove performance from
experiences, and predict things without being explicitly programmed.
A Machine Learning system learns from historical data, builds the prediction models, and
whenever it receives new data, predicts the output for it. The accuracy of predicted output
depends upon the amount of data, as the huge amount of data helps to build a better model which
predicts the output more accurately. Suppose we have a complex problem, where we need to
perform some predictions, so instead of writing a code for it, we just need to feed the data to
Arthur Samuel, an early American leader in the field of computer gaming and artificial
intelligence, coined the term “Machine Learning” in 1959 while at IBM. He defined machine
learning as “the field of study that gives computers the ability to learn without being explicitly
programmed.” However, there is no universally accepted definition for machine learning.
Different authors define the term differently.
Application Description
Enhanced Accuracy Medical image analysis for ML models detect subtle patterns in medical images
and Precision disease diagnosis with high accuracy, aiding in early diagnosis.
Improved Efficiency Facial recognition in security ML processes large volumes of images quickly,
and Scalability systems making it ideal for real-time surveillance.
Adaptability and Image classification for ML models improve over time by adapting to new
Continuous product categorization in e- data, ensuring accurate categorization of new
Learning commerce products.
trust issues.
TYPES OF LEARNING
Here are brief definitions for different types of machine learning:
1. Supervised Learning: A type of machine learning where the model is trained on labeled
data, meaning both input and output are provided. Example: Spam email detection.
2. Unsupervised Learning: The model learns patterns and structures from unlabeled data
without explicit outputs. Example: Customer segmentation.
3. Semi-Supervised Learning: Combines aspects of both supervised and unsupervised
learning by using a small amount of labeled data along with a large amount of unlabeled
data. Example: Medical diagnosis with limited labeled samples.
4. Reinforcement Learning: The model learns through trial and error by interacting with
an environment and receiving rewards or penalties. Example: Training an AI to play
chess.
5. Evolutionary Learning: A type of machine learning inspired by natural selection, where
algorithms evolve over generations by selecting the best solutions and applying mutations
or crossovers. Example: Genetic algorithms used for optimization problems.
SUPERVISED LEARNING
Supervised learning is the types of machine learning in which machines are trained using well
"labelled" training data, and on basis of that data, machines predict the output. The labelled data
means some input data is already tagged with the correct output.
In supervised learning, the training data provided to the machines work as the supervisor that
teaches the machines to predict the output correctly. It applies the same concept as a student
learns in the supervision of the teacher.
Supervised learning is a process of providing input data as well as correct output data to the
machine learning model. The aim of a supervised learning algorithm is to find a mapping
function to map the input variable(x) with the output variable(y). In the real-world, supervised
In supervised learning, models are trained using labelled dataset, where the model learns about
each type of data. Once the training process is completed, the model is tested on the basis of test
data (a subset of the training set), and then it predicts the output.
The working of Supervised learning can be easily understood by the below example and
diagram:
Suppose we have a dataset of different types of shapes which includes square, rectangle, triangle,
and Polygon. Now the first step is that we need to train the model for each shape.
• If the given shape has four sides, and all the sides are equal, then it will be labelled as
a Square.
• If the given shape has three sides, then it will be labelled as a triangle.
• If the given shape has six equal sides then it will be labelled as hexagon.
Now, after training, we test our model using the test set, and the task of the model is to identify
the shape.
The machine is already trained on all types of shapes, and when it finds a new shape, it classifies
the shape on the bases of a number of sides, and predicts the output.
1. Regression
Regression algorithms are used if there is a relationship between the input variable and the
output variable. It is used for the prediction of continuous variables, such as Weather forecasting,
Market Trends, etc. Below are some popular Regression algorithms which come under
supervised learning:
• Linear Regression
• Regression Trees
• Non-Linear Regression
• Bayesian Linear Regression
• Polynomial Regression
2. Classification
Classification algorithms are used when the output variable is categorical, which means there are
two classes such as Yes-No, Male-Female, True-false, etc. Below are some popular
Classification algorithms which come under supervised learning:
• Random Forest
• Decision Trees
• Logistic Regression
• Support vector Machines
• With the help of supervised learning, the model can predict the output on the basis of
prior experiences.
• In supervised learning, we can have an exact idea about the classes of objects.
• Supervised learning model helps us to solve various real-world problems such as fraud
detection, spam filtering, etc.
• Supervised learning models are not suitable for handling the complex tasks.
• Supervised learning cannot predict the correct output if the test data is different from the
training dataset.
• Training required lots of computation times.
• In supervised learning, we need enough knowledge about the classes of object.
The brain is an amazing system that can handle messy and complicated information (like
pictures) and give quick and accurate answers. It’s made up of simple building blocks called
neurons, which send signals when activated. These signals travel through connections called
synapses, creating a huge network of about 100 trillion links. Even as we age and lose neurons,
the brain keeps working well.
Each neuron acts like a tiny decision-maker in a massive network of 100 billion neurons. This
has inspired scientists to create AI systems that try to copy how the brain learns. The brain learns
by changing the strength of its connections i:e plasticity which refers to its ability to change and
adapt by modifying the strength of the connections (called synapses) between neurons or
forming new connections altogether. This is how the brain learns and remembers things and
forming new connections between neurons in the brain. One famous idea, suggested by Donald
Hebb in 1949, is that learning happens when neurons that frequently work together strengthen
their connection.
Hebb’s Rule
Hebb's rule is a simple idea: if two neurons fire at the same time repeatedly, their connection
becomes stronger. On the other hand, if they never fire together, their connection weakens and
might disappear. This is how the brain learns to associate things.
Here’s an example: Imagine you always see your grandmother when she gives you chocolate.
Neurons in your brain that recognize your grandmother and neurons that make you happy about
chocolate will fire at the same time. Over time, their connection strengthens. Eventually, just
seeing your grandmother (even in a photo) makes you think of chocolate. This is similar to
classical conditioning, where Pavlov trained dogs to associate a bell with food. When the bell
This idea is called long-term potentiation or neural plasticity, and it’s a real process in our
brains that helps us learn and form memories.
Scientists have studied neurons and created a mathematical model of them to simplify
understanding. Real neurons are tiny and hard to study, but Hodgkin and Huxley studied large
neurons in squids to measure how they work, earning them a Nobel Prize. Later, McCulloch and
Pitts created a simplified model of a neuron in 1943 that focused on the essential parts.
Imagine the neuron model as a simple flowchart with three main parts:
• Inputs (x₁, x₂, x₃, ...): These are signals coming into the neuron from other neurons.
Think of them as messages or pieces of information.
• Weights (w₁, w₂, w₃, ...): Each input has a weight that represents the strength or
importance of that input. A higher weight means the input has a stronger influence on the
neuron's decision to fire.
Example:
• x1=1 (active)
• x2=0 (inactive)
• x3=0.5 (partially active)
• Weights: w1=1, w2=−0.5, w3=−1
• The neuron adds up all the inputs after they’ve been multiplied by their respective
weights.
• Formula: h=w1x1+w2x2+w3x3+…
• h = (1×1)+(0×−0.5)+(0.5×−1)=1+0+(−0.5)=0.5
• h=0.5
• After summing the inputs, the neuron decides whether to "fire" (send a signal) or not
based on a threshold value (θ).
• Decision Rule:
o If h > θ, the neuron fires (output = 1)
o If h ≤ θ, the neuron does not fire (output = 0)
4. Output
• The result of the activation function is the neuron's output, which can be sent to other
neurons.
Key Features:
• Simple Decision-Maker: Despite its simplicity, this model can perform basic decisions
based on input signals.
• Foundation for Neural Networks: Multiple such neurons can be connected to form
complex networks capable of more advanced computations.
• Adjusting Weights: Learning in neural networks involves adjusting these weights to
improve decision-making based on data.
• Inputs (x₁, x₂, x₃): Different sensors detecting things (like motion, light, sound).
• Weights (w₁, w₂, w₃): The importance of each sensor in deciding whether to turn on the
light.
• Summation (h): Adding up the signals from all sensors.
• Threshold (θ): The level of combined signals needed to decide to turn the light on.
By adjusting the weights, you can make the system more or less sensitive to certain sensors, just
like training a neural network to recognize patterns.
The McCulloch and Pitts (M&P) neuron model is a simplified version of how real neurons work.
While it has been influential in early neural network models, it has several limitations when
compared to actual biological neurons.
1. Simplified Summing: In the McCulloch and Pitts model, inputs to the neuron are simply
added together in a linear fashion. Real neurons, however, may have non-linear
interactions, meaning their inputs don’t just add up but interact in more complex ways.
2. Single Output vs. Spike Train: The M&P neuron produces just one output, either firing
or not firing, based on a threshold. Real neurons, however, send out a series of pulses,
called a "spike train," to represent information. So, real neurons don't just decide whether
to fire or not—they generate a sequence of signals that encode data.
3. Changing Thresholds: In the M&P model, the threshold for firing is constant. In real
neurons, the threshold can change depending on the current state of the organism, like
how much neurotransmitter is available, which influences the neuron’s sensitivity.
4. Asynchronous vs. Synchronous Updates: The M&P model updates neurons in a
regular, clocked sequence (synchronously). Real neurons don't work this way; they
update asynchronously, meaning they fire at different times, influenced by random
factors, not just a regular time cycle.
5. Excitatory and Inhibitory Weights: The M&P model allows weights (connections
between neurons) to change from positive to negative, which isn’t seen in real neurons. In
the brain, synaptic connections are either excitatory (increase the likelihood of firing) or
inhibitory (decrease the likelihood of firing), and they don't switch from one type to the
other.
6. Feedback Loops: Real neurons can have feedback connections where a neuron connects
back to itself. The M&P model typically doesn't include this, although it’s a feature in
some more advanced models.
7. Biological Complexity Ignored: The M&P model focuses on the basic idea of deciding
whether a neuron fires or not, leaving out more complex biological factors, such as
chemical concentrations or refractory periods (the time it takes for a neuron to reset
before firing again).
According to Tom Mitchell, “A computer program is said to be learning from experience (E),
with respect to some task (T). Thus, the performance measure (P) is the performance at task T,
which is measured by P, and it improves with experience E.”
Example: In Spam E-Mail detection,
• Task, T: To classify mails into Spam or Not Spam.
Step 1- Choosing the Training Experience: The very important and first task is to choose the
training data or training experience which will be fed to the Machine Learning Algorithm. It is
important to note that the data or experience that we fed to the algorithm must have a
significant impact on the Success or Failure of the Model. So Training data or experience
should be chosen wisely.
Below are the attributes which will impact on Success and Failure of Data:
• The training experience will be able to provide direct or indirect feedback regarding
choices. For example: While Playing chess the training data will provide feedback
to itself like instead of this move if this is chosen the chances of success increases.
• Second important attribute is the degree to which the learner will control the
sequences of training examples. For example: when training data is fed to the
machine then at that time accuracy is very less but when it gains experience while
playing again and again with itself or opponent the machine algorithm will get
feedback and control the chess game accordingly.
• Third important attribute is how it will represent the distribution of examples over
which performance will be measured. For example, a Machine learning algorithm
will get experience while going through a number of different cases and different
examples. Thus, Machine Learning Algorithm will get more and more experience
by passing through more and more examples and hence its performance will
increase.
Step 2- Choosing target function: The next important step is choosing the target function. It
means according to the knowledge fed to the algorithm the machine learning will choose
NextMove function which will describe what type of legal moves should be taken. For
example: While playing chess with the opponent, when opponent will play then the machine
learning algorithm will decide what be the number of possible legal moves taken in order to
get success.
Step 5- Final Design: The final design is created at last when system goes from number of
examples, failures and success, correct and incorrect decision and what will be the next step
etc. Example: DeepBlue is an intelligent computer which is ML-based won chess game against
the chess expert Garry Kasparov, and it became the first computer which had beaten a human
chess expert.
One useful perspective on machine learning is that it involves searching a very large space of
possible hypotheses to determine one that best fits the observed data and any prior knowledge
held by the learner. For example, consider the space of hypotheses that could in principle be
output by the above checkers learner. This hypothesis space consists of all evaluation functions
that can be represented by some choice of values for the weight’s wo through w6. The learner's
task is thus to search through this vast space to locate the hypothesis that is most consistent with
the available training examples. The LMS algorithm for fitting weights achieves this goal by
iteratively tuning the weights, adding a correction to each weight each time the hypothesized
evaluation function predicts a value that differs from the training value. This algorithm works
well when the hypothesis representation considered by the learner defines a continuously
parameterized space of potential hypotheses.
• What algorithms exist for learning general target functions from specific training
examples? In what settings will particular algorithms converge to the desired function,
given sufficient training data? Which algorithms perform best for which types of
problems and representations?
• How much training data is sufficient? What general bounds can be found to relate the
confidence in learned hypotheses to the amount of training experience and the character
of the learner's hypothesis space?
• When and how can prior knowledge held by the learner guide the process of generalizing
from examples? Can prior knowledge be helpful even when it is only approximately
correct?
• What is the best strategy for choosing a useful next training experience, and how does the
choice of this strategy alter the complexity of the learning problem?
• What is the best way to reduce the learning task to one or more function approximation
problems? Put another way, what specific functions should the system attempt to learn?
Can this process itself be automated?
• How can the learner automatically alter its representation to improve its ability to
represent and learn the target function?
Key Concepts
• Target Concept: The underlying rule or pattern that the model aims to learn.
• Training Data: A set of labeled examples used to train the model. Each example consists
of an input and its corresponding output (label).
• Hypothesis: A proposed rule or function that the model learns from the training data.
• Generalization: The ability of the model to accurately classify new, unseen data based on
the learned concept.
Concept learning involves exploring a hypothesis space to identify the hypothesis that best
explains the training examples. This hypothesis space is implicitly defined by the hypothesis
representation chosen by the learning algorithm designer. By selecting a specific representation,
the designer determines the space of all hypotheses the program can represent and learn.
In the EnjoySport learning task, we aim to find a hypothesis (rule) that determines whether the
weather conditions are favorable for enjoying sports. Let's break it down step by step.
This represents all possible combinations of weather attributes. The attributes and their possible
values are:
To find the total number of possible weather conditions (instances in XXX), multiply the
number of possible values for each attribute:
∣X∣=3x2x2x2x2x2
=3×32
=96
A hypothesis is a rule that classifies instances as positive or negative. Hypotheses can use
specific values (e.g., "Sunny") or wildcards (?), which mean "any value is fine." For each
attribute:
Syntactically distinct hypotheses: additionally 2 more values: ?( accepts any values which is
most general hypothesis and Ø (reject any values which is more specific hypothesis)
For each of the 6 attributes, there are 4 options. The total number of syntactically distinct
hypotheses is:
∣H∣=5x4x4x4x4x4=5120
Some hypotheses, like those containing only "Ø," classify all instances as negative and are
redundant. Removing these, the number of semantically distinct hypotheses becomes:
=1+(4×3x3x3x3x3)
=1+(4×243)
=1+972=973
After finding all syntactically and semantically distinct hypothesis we search the best match from
all these that matches our learning model (training example).
The FIND-S algorithm is a simple way to find a rule (or hypothesis) that matches all the positive
examples in a dataset while ignoring the negative ones. It works step by step, starting with a
very specific rule and gradually making it more general to include all positive examples. Here's
how it works in an easy way:
The FIND-S algorithm is like starting with the most specific guess and slowly relaxing it until it
fits all the examples.
1. Start small: Begin with the most specific rule (e.g., "Only this exact weather works").
2. Fix the rule: For each good (positive) example, check if your rule matches it:
o If it does, great—do nothing!
o If it doesn’t, make the rule a bit more general (e.g., "Okay, maybe it works if the wind
isn’t strong").
3. Finish: When you’re done, you have a rule that matches all the good examples.
Example:
Imagine you're trying to figure out what kind of weather makes you enjoy playing a sport, using
this data:
• Start with the most specific rule: h=(?, ?, ?, ?, ?, ?), which means "no conditions are set
yet."
• Look at the first positive example: (Sunny, Warm, Normal, Strong, Warm, Same)
o Rule becomes: h=(Sunny, Warm, Normal, Strong, Warm, Same)
• Look at the second positive example: (Sunny, Warm, High, Strong, Warm, Same)
o Update the rule to match both examples:
h=(Sunny, Warm, ?, Strong, Warm, Same)
• Ignore the negative example.
• Look at the fourth positive example: (Sunny, Warm, High, Strong, Cool, Change)
o Update the rule again: h=(Sunny, Warm, ?, Strong, ?, ?)
Final rule: (Sunny, Warm, ?, Strong, ?, ?). This means you enjoy playing sports if it’s sunny,
warm, and windy, regardless of the other conditions.
Limitations of FIND-S
• Good for Clean Data: Works well if the data is perfect (no mistakes or noise).
• Ignores Negatives: It doesn't use negative examples to refine the rule.
• May Miss Other Rules: If there are multiple valid rules, it picks the most specific one
but doesn’t explore other options.
In short, FIND-S is like a detective who focuses only on positive clues and tries to make the
simplest case for what’s true!
VERSION SPACES
A version space is a set of all hypotheses (rules) that are consistent with the given training data.
It represents everything the learner currently knows about the target concept.
The "version" refers to different possibilities or hypotheses that might explain the data. The
space includes:
How It Works:
A version space has two boundaries:
1. Specific boundary (S): The most specific hypotheses consistent with the data.
2. General boundary (G): The most general hypotheses consistent with the data.
• Efficient Representation: Instead of listing all hypotheses, it tracks only the boundaries
S and G.
• Keeps Track of Knowledge: Helps understand what the learner knows and doesn’t know
yet.
• Flexible Search: Allows for adding or removing examples to refine the boundaries.
A version space is the range of hypotheses consistent with the training data, bounded by the
most specific (S) and the most general (G) hypotheses. It narrows down as you process more
examples, zeroing in on the true concept.
It’s a way to learn rules for when something happens (like "Play Sport = Yes") by narrowing
down possibilities. The algorithm works by keeping two boundaries:
1. Specific Boundary (S): The most specific rule that only fits positive examples.
2. General Boundary (G): The most general rule that excludes negative examples.
The Data
Sky Temp Humidity Wind Water Forecast Play Sport?
Our goal is to find the rule for when "Play Sport" is Yes.
Step-by-Step Execution
Start with Initial S and G:
It’s another positive example. S is too specific, so we generalize it to fit both positive examples:
It’s a negative example. We update G to exclude this negative example while staying as general
as possible.
It’s a positive example. S needs to generalize further to fit this example. Compare with previous
S.
Why is G=⟨Sunny,?,?,?,?,?⟩
1. General Boundary (G) starts very broad (because it initially includes everything).
2. After processing positive examples, G gets refined to include only the conditions that
must be true for playing sports ("Yes").
3. The Sky condition (Sunny) is the only attribute that must always be true in the general
rule.
4. The other attributes (Temperature, Humidity, Wind, Water, Forecast) can be anything
since the general rule still covers all the positive examples we’ve seen so far.
When we look at all the positive examples (the ones where Play Sport = Yes), we find that they
all have Sky = Sunny. Since G should cover all positive examples, we make Sky = Sunny and
warm a condition in the rule. But the other attributes (Temperature, Humidity, Wind, etc.) can
vary, so we leave them as wildcards ( ? ).
LINEAR DISCRIMINANTS
Machine learning models are often used to solve supervised learning tasks,
particularly classification problems, where the goal is to assign data points to specific categories
or classes. However, as datasets grow larger with more features, it becomes challenging for
models to process the data effectively. This is where dimensionality reduction techniques like
Linear Discriminant Analysis (LDA) come into play.
LDA not only helps to reduce the number of features but also ensures that the important class-
related information is retained, making it easier for models to differentiate between classes.
Linear Discriminant Analysis (LDA) is a supervised learning technique used for classification
tasks. It helps distinguish between different classes by projecting data points onto a lower-
dimensional space, maximizing the separation between those classes.
The core idea of Linear Discriminant Analysis (LDA) is to find a new axis that best separates
different classes by maximizing the distance between them. LDA achieves this by reducing the
dimensionality of the data while retaining the class-discriminative information.
Key Concepts:
Advantages:
1. Face Recognition: LDA helps extract features from facial images, classifying them based on
individuals. It is commonly used in biometric systems to identify or verify users.
2. Disease Diagnosis in Healthcare: LDA is used to analyze medical data for classifying diseases,
such as distinguishing between different stages of cancer or predicting the presence of heart
disease.
4. Credit Risk Assessment in Finance: Financial institutions use LDA to assess credit risk by
analyzing customer data to predict the likelihood of loan defaults or creditworthiness.
Perceptron Are Based on Biological Neurons and Originally proposed in 1957, perceptrons were
one of the earliest components of artificial neural networks. The structure of the perceptron is
based on the anatomy of neurons. Neurons have several parts, but for our purposes, the most
important parts are the dendrites, which receive inputs from other neurons, and the axon, which
produces outputs.
Neuron Activation
Neurons “fire” – that is, produce an output – in an all or nothing way. The outputs of a neuron
are essentially 0 or 1. On or off. A neuron will “fire” if the input signals at the dendrites are
sufficiently large, collectively. If the amount of input signal at the dendrites is high enough, the
neuron will “fire” an produce an output. But if the amount of input signal is insufficient, the
neuron will not produce a output. Put simply, the neuron sums up the inputs, and if the collective
input signals meet a certain threshold, then it will produce an output. If the collective input
signals are under the threshold, it will not produce an output. This is, of course, a very simple
explanation of how a neuron works (because they are very complex at the chemical level), but
it’s roughly accurate.
What is Perceptron?
The Perceptron Learning is a fundamental concept in machine learning and serves as one of the
simplest types of artificial neural networks. It is primarily used for binary classification tasks
and is based on the idea of learning a linear decision boundary to separate data points into two
classes. The perceptron algorithm was introduced by Frank Rosenblatt in 1958. It operates on a
set of input features and produces an output that is either 1 or −1 (or 0 depending on the
implementation). The model is trained iteratively, adjusting its weights based on the error
between predicted and actual labels.
1. Input Features: Take a vector of input features (x1,x2,…,…,xn) from the dataset.
2. Compute Weighted Sum: Calculate
3. Apply Activation Function: Use the step function to decide the output (1 or 0).
4. Update Weights (During Training):
• If the predicted output is incorrect, adjust the weights and bias using the Perceptron
Learning Algorithm.
A perceptron is like a very basic "brain" for a machine. It looks at input data (numbers) and
makes a decision: Class A or Class B (e.g., "yes" or "no").
• Imagine you have some input features (e.g., x1, x2) like:
• Each input has a weight (w1, w2) that tells the perceptron how important that input is.
• Multiply each input by its weight, and add them all together. Then, add a bias (b), which is like a
nudge to adjust the sum.
• Use a simple rule: If the weighted sum (z) is positive, output 1 (e.g., "yes").
• If it’s negative or zero, output −1 (e.g., "no").
• Compare the perceptron’s guess (y^) with the actual answer (y).
• If it’s correct, you’re good! If it’s wrong, adjust the weights and bias.
Step 6: Repeat
• Go through the dataset multiple times, adjusting weights and bias each time the perceptron makes
a mistake.
Key Features
• Linear Model: The perceptron can only separate data that is linearly separable.
• Supervised Learning: It requires labeled data for training.
• Binary Classification: It predicts one of two possible classes (1 or −1).
Strengths
Limitations
• Cannot Handle Non-linear Data: It fails when data is not linearly separable.
• Binary Outputs: Limited to binary classification tasks.
• Sensitive to Feature Scaling: Requires normalization or scaling for effective learning.
EXAMPLE: Here's an example of a perceptron for the logical AND function with the given
parameters:
0 0 0
0 1 0
1 0 0
1 1 1
For each input, compute the weighted sum and update weights if necessary, using:
2. For (0,1):
✅ No update needed.
3. For (1,0):
4. For (1,1):
✅ Correct.
Final Weights
LINEAR SEPARABILITY
Linear separability means that you can draw a straight line (or a flat surface, or a hyperplane)
that separates two groups of data points perfectly without any overlap.
You can draw a straight line between the two classes, and the points on one side belong to Class
1, while the points on the other side belong to Class 2.
This line is the decision boundary, and the data is linearly separable because the line separates
the two classes without overlap.
Real-World Applications
1. Image Classification:
o Linear separability is rare; deep learning handles non-linear boundaries.
2. Medical Diagnosis:
o Linearly separable cases may involve straightforward conditions; complex
diseases often require advanced methods.
3. Spam Detection:
o Simple keyword-based filters assume linear separability, while modern techniques
use non-linear models.
The concept of linear separability helps us decide which machine learning algorithms to use.
Some algorithms work well when the data is linearly separable, while others are better for more
complex, non-linearly separable data.
• Linear Models (e.g., Perceptron, SVM): These work best when the data is linearly
separable. They try to find the straightest line or plane to divide the data.
• Non-Linear Models (e.g., Neural Networks, Decision Trees): These are more flexible
and can handle non-linearly separable data. They can create complex decision
boundaries.
x1 x2 OR Output y
0 0 0
0 1 1
1 0 1
1 1 1
x1 x2 XOR Output y
0 0 0
0 1 1
1 0 1
1 1 0
1. Simplicity:
o Linear separability allows using simple models with fewer parameters.
2. Faster Training:
o Models converge quickly during training due to straightforward optimization.
3. Interpretability:
o Easy to visualize and understand the decision boundary.
4. Optimal Solution:
o Algorithms like SVM find the maximum margin boundary, ensuring optimal
performance for separable data.
5. Good Generalization:
o Models are less likely to overfit due to their simplicity.
1. Limited Applicability:
o Many real-world datasets are not linearly separable.
2. Lack of Flexibility:
o Cannot capture complex patterns in the data.
3. Over-Simplification:
o May miss subtle relationships or nuances.
4. Sensitive to Noise:
o Outliers or noisy data near the boundary can disrupt the model.
5. Feature Dependence:
o Requires feature transformations for non-linearly separable data.
6. Failure for Non-Linearly Separable Data:
o Cannot separate inherently non-linear datasets without additional techniques.
LINEAR REGRESSION
Linear regression is a fundamental supervised learning algorithm used in machine learning for
modeling the relationship between one or more independent variables (features) and a dependent
variable (target). The goal is to find the best-fit line (or hyperplane in higher dimensions) that
minimizes the error in predicting the dependent variable.
In the above data, the target House Price is the dependent variable represented by Y, and the
feature, Square Feet, is the independent variable represented by X. The input features (X) are
used to predict the target label (Y). So, the independent variables are also known as predictor
variables, and the dependent variable is known as the response variable.
The main goal of the linear regression model is to find the best-fitting straight line (often called a
regression line) through a set of data points.
A straight line that shows a relation between the dependent variable and independent variables is
known as the line of regression or regression line.
Simple linear regression is a type of regression analysis in which a single independent variable
(also known as a predictor variable) is used to predict the dependent variable. In other words, it
models the linear relationship between the dependent variable and a single independent variable.
In the above image, the straight line represents the simple linear regression line where Ŷ is the
predicted value, and X is the input value.
Y=w0+w1X+ϵ
Where,
Multiple linear regression is basically the extension of simple linear regression that predicts a
response using two or more features.
When dealing with more than one independent variable, we extend simple linear regression to
multiple linear regression. The model is expressed as:
Multiple linear regression extends the concept of simple linear regression to multiple
independent variables. The model is expressed as:
Y=w0+w1X1+w2X2+⋯+wpXp+ϵ
Where,
The main goal of linear regression is to find the best-fit line through a set of data points that
minimizes the difference between the actual values and predicted values. So it is done? This is
done by estimating the parameters w0, w1 etc.
The working of linear regression in machine learning can be broken down into many steps as
follows −
• Hypothesis− We assume that there is a linear relation between input and output.
• Cost Function − Define a loss or cost function. The cost function quantifies the model's
prediction error. The cost function takes the model's predicted values and actual values
and returns a single scaler value that represents the cost of the model's prediction.
• Optimization − Optimize (minimize) the model's cost function by updating the model's
parameters.
In linear regression problems, we assume that there is a linear relationship between input features
(X) and predicted value (Ŷ). The hypothesis function returns the predicted value for a given
input value. Generally we represent a hypothesis by hw(X) and it is equal to Ŷ.
For different values of parameters (weights), we can find many regression lines. The main goal is
to find the best-fit lines.
A regression line is said to be the best fit if the error between actual and predicted values is
minimal.
Below image shows a regression line with error (ε) at input data point X. The error is calculated
for all data points and our goal is to minimize the average error/ loss. We can use different types
of loss functions such as mean square error (MSE), mean average error (MAE), L1 loss, L2 Loss,
etc.
The error between actual and predicted values can be quantified using a loss function of the cost
function. The cost function takes the model's predicted values and actual values and returns a
single scaler value that represents the cost of the model's prediction. Our main goal is to
minimize the cost function.
The most commonly used cost function is the mean squared error function.
Where,
1. Predictive Modeling: Linear regression is widely used for predictive modeling. For instance,
in real estate, predicting house prices based on features such as size, location, and number of
bedrooms can help buyers, sellers, and real estate agents make informed decisions.
2. Feature Selection: In multiple linear regression, analyzing the coefficients can help in feature
selection. Features with small or zero coefficients might be considered less important and can be
dropped to simplify the model.
3. Financial Forecasting: In finance, linear regression models predict stock prices, economic
indicators, and market trends. Accurate forecasts can guide investment strategies and financial
planning.
4. Risk Management: Linear regression helps in risk assessment by modeling the relationship
between risk factors and financial metrics. For example, in insurance, it can model the
relationship between policyholder characteristics and claim amounts.
1. Overfitting: Overfitting occurs when the regression model performs well on training data but
lacks generalization on test data. Overfitting leads to poor prediction on new, unseen data.
2. Multicollinearity: When the dependent variables (predictor or feature variables) correlate, the
situation is known as multicollinearity. In this, the estimates of the parameters (coefficients) can
be unstable.
3. Outliers and Their Impact: Outliers can cause the regression line to be a poor fit for the
majority of data points.
MULTI-LAYER PERCEPTRON
A Multi-Layer Perceptron (MLP) consists of fully connected dense layers that transform
input data from one dimension to another. It is called “multi-layer” because it contains an input
layer, one or more hidden layers, and an output layer. The purpose of an MLP is to model
complex relationships between inputs and outputs, making it a powerful tool for various
machine learning tasks. MLP (Multi-Layer Perceptron) is primarily used for supervised
learning, as it is a type of artificial neural network that requires labeled data to train and learn
relationships between input features and target outputs, making it suitable for tasks like
classification and regression.
Every connection in the diagram is a representation of the fully connected nature of an MLP.
This means that every node in one layer connects to every node in the next layer. As the data
moves through the network, each layer transforms it until the final output is generated in the
output layer.
1. Weighted Sum: The neuron computes the weighted sum of the inputs:
The activation function decides whether a neuron should be activated by calculating the
weighted sum of inputs and adding a bias term. This helps the model make complex
decisions and predictions by introducing non-linearities to the output of each neuron.
Neural networks consist of neurons that operate using weights, biases, and activation
functions.
Without non-linearity, even deep networks would be limited to solving only simple,
linearly separable problems. Activation functions empower neural networks to model
highly complex data distributions and solve advanced deep learning tasks. Adding non-
linear activation functions introduce flexibility and enable the network to learn more
complex and abstract patterns from data.
Here, 𝑒 is the base of the natural logarithm (approximately equal to 2.71828), and 𝑥 is
the input to the function.
f(x)=max(0,x)
Where:
• x is the input to the neuron.
• The function returns x if x is greater than 0.
• If x is less than or equal to 0, the function returns 0. In mathematical terms, the ReLU
function can be written as:
For a classification problem, the commonly used binary cross-entropy loss function is:
Step 3: Backpropagation
The goal of training an MLP is to minimize the loss function by adjusting the network’s
weights and biases. This is achieved through backpropagation. Both MSE and BCE can be
used in backpropagation. Backpropagation computes gradients of the chosen loss function
(MSE or BCE) and updates the network’s weights using gradient descent.
1. Gradient Calculation: The gradients of the loss function with respect to each
weight and bias are calculated using the chain rule of calculus.
2. Error Propagation: The error is propagated back through the network, layer by
layer.
3. Gradient Descent: The network updates the weights and biases by moving in the
opposite direction of the gradient to reduce the loss
For both regression (MSE loss) and classification (BCE loss), the weights are updated using
the gradient descent formula:
• Forward and backward propagation repeat over multiple epochs until the model
converges (i.e., achieves an acceptable error rate).
MLP ALGORITHM:
The Multi-Layer Perceptron (MLP) Algorithm is like training a digital brain to learn patterns
and make predictions.
This section explores practical considerations for using Multi-Layer Perceptrons (MLPs) to solve
real-world problems, focusing on three critical aspects: the amount of training data, the number
of hidden layers, and when to stop learning.
• For the MLP with one hidden layer there are (L + 1) ×M + (M + 1) × N weights, where L,M,N
are the number of nodes in the input, hidden, and output layers, respectively.
• The extra +1s come from the bias nodes, which also have adjustable weights
• This is a potentially huge number of adjustable parameters that we need to set during the
training phase.
• Setting the values of these weights is the job of the back-propagation algorithm, which is
driven by the errors coming from the training data.
• Unfortunately, there is no way to compute what the minimum amount of data required is, since
it depends on the problem.
• A rule of thumb that you should use a number of training examples that is at least 10 times the
number of weights.
• This is probably going to be a very large number of examples, so neural network training is a
fairly computationally expensive operation, because we need to show the network all of these
inputs lots of times.
• Two Choices
• It is possible to show mathematically that one hidden layer with lots of hidden nodes is
sufficient. This is known as the Universal Approximation Theorem.
• we will never normally need more than two layers (that is, one hidden layer and the output
layer)
• The training of the MLP requires that the algorithm runs over the entire dataset many times,
with the weights changing as the network makes errors in each iteration.
• Using both of these options together can help, as can terminating the learning once the error
stops decreasing.
• We train the network for some predetermined amount of time, and then use the validation set to
estimate how well the network is generalising.
• We then carry on training for a few more iterations, and repeat the whole process.
• At some stage the error on the validation set will start increasing again, because the network
has stopped learning about the function that generated the data, and started to learn about the
noise that is in the data itself.
• We will then apply MLP to find solutions to four different types of problem: Regression,
Classification, Time-series prediction, and Data compression.
Regression:
• If you want to predict a single value, you only need a single output neuron and if you want to
predict multiple values, you can add multiple output neurons.
• In general, we don't apply any activation function to the output layer of MLP, when dealing
with regression tasks, It just does the weighted sum and sends the output.
• But, in case you want your value between a given range, for example, -1 or +1 you can use
activation like Tanh(Hyperbolic Tangent) function.
• The loss functions that can be used in Regression MLP include Mean Squared Error(MSE) and
Mean Absolute Error(MAE).
• MSE can be used in datasets with fewer outliers, while MAE is a good measure in datasets
which has more outliers.
• If the output variable is categorical, then we have to use classification for prediction.
• The aim is to classify iris flowers among three species (Setosa, Versicolor, or Virginica) from
the sepals’ and petals’ length and width measurements.
• The above neural network has one input layer, two hidden layers and one output layer.
• In the hidden layers we use sigmoid as an activation function for all neurons.
• In the output layer, we use softmax as an activation function for the three output neurons.
• In this regard, all outputs are between 0 and 1, and their sum is 1.
• The neural network has three outputs since the target variable contains three classes (Setosa,
Versicolor, and Virginica).
• There is a common data analysis task known as time-series prediction, where we have a set of
data that show how something varies over time, and we want to predict how the data will vary in
the future.
• The problem is that even if there is some regularity in the time-series, it can appear over many
different scales. For example, there is often seasonal variation in temperatures.
• We train the network to reproduce the inputs at the output layer called auto-associative
learning.
• The network is trained so that whatever you give as the input is reproduced at the output, which
doesn’t seem very useful at first, but suppose that we use a hidden layer that has fewer neurons
than the input layer.
• This bottleneck hidden layer has to represent all of the information in the input, so that it can
be reproduced at the output.
• It therefore performs some compression of the data, representing it using fewer dimensions
than were used in the input.
• They are finding a different representation of the input data that extracts important components
of the data, and ignores the noise.
• This auto-associative network can be used to compress images and other data.
DERIVING BACK-PROPAGATION
Backpropagation is an algorithm used in artificial intelligence and machine learning to train
artificial neural networks through error correction. The computer learns by calculating the loss
function, or the difference between the input you provided and the output it produced. When you
apply backpropagation, you work backward from output nodes to input nodes to reduce the loss
function and produce the desired result.
Backpropagation is the process of adjusting a neural network’s weights and biases to reduce
error. It does this by:
We use Mean Squared Error (MSE) loss, which is used when predicting continuous values
(e.g., predicting house prices).
Repeating these steps reduces error over time. By repeating this process, the model gradually
improves and learns the correct weight and bias to minimize the error.
Step 1: Network architecture and Define Input Values and given weights
Step 2: Forward Propagation: We calculate the hidden layer activation, then the output
layer activation.
Note: Hidden layers do have their own weights and biases. The hidden layer does have an
input value, but it comes from the previous layer
Each neuron in a layer is connected to neurons in the previous layer via weights. Every layer
(except the input layer) has:
For a Neural Network with 1 Input, 1 Hidden Layer, and 1 Output Layer:
A radial basis function (RBF) neural network is a type of artificial neural network that uses radial
basis functions as activation functions. It typically consists of three layers: an input layer, only
one hidden layer, and an output layer. The hidden layer applies a radial basis function, usually
a Gaussian function. RBF neural networks are highly versatile and are extensively used in
pattern classification tasks, function approximation, and a variety of machine learning
applications. They are especially known for their ability to handle non-linear problems
effectively.
• Input layer: This layer simply transmits the inputs to the neurons in the hidden layer.
• Hidden layer: Each neuron in this layer applies a radial basis function to the inputs it
receives. RBF has strictly one hidden layer.
• Output layer: Each neuron in this layer computes a weighted sum of the outputs from
the hidden layer, resulting in the final output.
Working of RBF
• When dealing with non-linear data, we aim to convert it into linearly separable data.
• To achieve this, every hidden layer neuron uses a non-linear radial basis function as the
activation function, transforming the data into a higher-dimensional space.
x = Input
c = Center
r = Radius
2. Multiquadric RBF:
Algorithm of RBF
• Assign weights for each connection from hidden layer to output layer.
• Initially, weights are randomly assigned in the range [-1,1].
The curse of dimensionality is a common machine learning problem that occurs when a dataset
has many dimensions. This can make it difficult to analyze, organize, and model the data. The
Curse of Dimensionality refers to the various challenges and complications that arise when
analyzing and organizing data in high-dimensional spaces (often hundreds or thousands of
dimensions). In the realm of machine learning, it's crucial to understand this concept because as
the number of features or dimensions in a dataset increases, the amount of data we need to
generalize accurately grows exponentially.
1. Data sparsity: As mentioned, data becomes sparse, meaning that most of the high-
dimensional space is empty. This makes clustering and classification tasks challenging.
2. Increased computation: More dimensions mean more computational resources and time
to process the data.
It occurs mainly because as we add more features or dimensions, we're increasing the complexity
of our data without necessarily increasing the amount of useful information. Moreover, in high-
dimensional spaces, most data points are at the "edges" or "corners," making the data sparse.
The primary solution to the curse of dimensionality is "dimensionality reduction." It's a process
that reduces the number of random variables under consideration by obtaining a set of principal
variables. By reducing the dimensionality, we can retain the most important information in the
data while discarding the redundant or less important features.
PCA is a statistical method that transforms the original variables into a new set of variables,
which are linear combinations of the original variables. These new variables are called principal
components.
Let's say we have a dataset containing information about different aspects of cars, such as
horsepower, torque, acceleration, and top speed. We want to reduce the dimensionality of this
dataset using PCA.
Using PCA, we can create a new set of variables called principal components. The first principal
component would capture the most variance in the data, which could be a combination of
horsepower and torque. The second principal component might represent acceleration and top
speed. By reducing the dimensionality of the data using PCA, we can visualize and analyze the
dataset more effectively.
LDA aims to identify attributes that account for the most variance between classes. It's
particularly useful for classification tasks. Suppose we have a dataset with various features of
flowers, such as petal length, petal width, sepal length, and sepal width. Additionally, each
flower in the dataset is labeled as either a rose or a lily. We can use LDA to identify the
attributes that account for the most variance between these two classes.
LDA might find that petal length and petal width are the most discriminative attributes between
roses and lilies. It would create a linear combination of these attributes to form a new variable,
which can then be used for classification tasks. By reducing the dimensionality using LDA, we
can improve the accuracy of flower classification models.
t-SNE is a non-linear dimensionality reduction technique that's particularly useful for visualizing
high-dimensional datasets. Let's consider a dataset with images of different types of animals,
such as cats, dogs, and birds. Each image is represented by a high-dimensional feature vector
extracted from a deep neural network.
Using t-SNE, we can reduce the dimensionality of these feature vectors to two dimensions,
allowing us to visualize the dataset. The t-SNE algorithm would map similar animals closer
together in the reduced space, enabling us to observe clusters of similar animals. This
visualization can help us understand the relationships and similarities between different animal
types in a more intuitive way.
Autoencoders
These are neural networks used for dimensionality reduction. They work by compressing the
input into a compact representation and then reconstructing the original input from this
representation. Suppose we have a dataset of images of handwritten digits, such as the MNIST
dataset. Each image is represented by a high-dimensional pixel vector.
We can use an autoencoder, which is a type of neural network, for dimensionality reduction.
The autoencoder would learn to compress the input images into a lower-dimensional
representation, often called the latent space. This latent space would capture the most important
features of the images. We can then use the autoencoder to reconstruct the original images from
the latent space representation. By reducing the dimensionality using autoencoders, we can
effectively capture the essential information from the images while discarding unnecessary
details.
• Geodesy: Interpolation is used to map out features on Earth's surface, such as mountains
or ocean currents, using satellite imagery.
• Statistical analysis: Interpolation can be used to smooth out data sets so that they
become more evenly distributed. For example, if you have a spike in sales one day, you
can use interpolation to smooth out the rest of your sales data for that month so that the
overall trend looks smooth instead of erratic.
TYPES OF INTERPOLATION:
• Linear interpolation: Linear interpolation is a simple method for estimating unknown
values between two known points. It assumes that the data points can be connected by a
straight line.
Formula for Linear Interpolation:
• Polynomial interpolation: What if we have more than two points? Instead of a straight
line, we can fit a curve using a polynomial. This works like connecting the dots
smoothly so the estimated values follow the trend of the data. A common method for this
is Lagrange interpolation.
Basis function
Instead of using a single equation to represent a function, we combine multiple small functions
(called basis functions) to form the final function. It means a function breaks into small parts
using basis functions so that a machine learning model can learn patterns better.
Think of it like building a house with Lego blocks—each basis function is a Lego piece.
A cubic spline is a smooth curve made up of cubic polynomials that are joined together at
specific points called knotpoints.
Once you have knotpoints, you need to choose how the function behaves in each section.
Problem: The function is not smooth—it jumps from one level to another without a transition.
Problem: If you just use straight lines, they may not connect smoothly at knotpoints—meaning there
might be sharp corners.
Best Choice for Smoothness: Cubic splines! They create smooth curves that don’t have sharp
edges or abrupt changes.
The goal of the SVM algorithm is to create the best line or decision boundary that can segregate
n-dimensional space into classes so that we can easily put the new data point in the correct
category in the future. This best decision boundary is called a hyperplane.
SVM chooses the extreme points/vectors that help in creating the hyperplane. These extreme
cases are called as support vectors, and hence algorithm is termed as Support Vector Machine.
Consider the below diagram in which there are two different categories that are classified using a
decision boundary or hyperplane:
Types of SVM
o Linear SVM: Linear SVM is used for linearly separable data, which means if a dataset
can be classified into two classes by using a single straight line, then such data is termed
as linearly separable data, and classifier is used called as Linear SVM classifier.
o Non-linear SVM: Non-Linear SVM is used for non-linearly separated data, which means
if a dataset cannot be classified by using a straight line, then such data is termed as non-
linear data and classifier used is called as Non-linear SVM classifier.
Support Vectors: The data points or vectors that are the closest to the hyperplane and which
affect the position of the hyperplane are termed as Support Vector. Since these vectors support
the hyperplane, hence called a Support vector.
So as it is 2-d space so by just using a straight line, we can easily separate these two classes. But
there can be multiple lines that can separate these classes. Consider the below image:
Hence, the SVM algorithm helps to find the best line or decision boundary; this best boundary or
region is called as a hyperplane. SVM algorithm finds the closest point of the lines from both
the classes. These points are called support vectors. The distance between the vectors and the
hyperplane is called as margin. And the goal of SVM is to maximize this margin.
The hyperplane with maximum margin is called the optimal hyperplane.
So now, SVM will divide the datasets into classes in the following way. Consider the below
image:
Since we are in 3-d Space, hence it is looking like a plane parallel to the x-axis. If we convert it
in 2d space with z=1, then it will become as:
SVM Algorithm
1. Goal:
o Find the best line (or hyperplane in higher dimensions) that separates two classes
of data points.
2. Steps:
o Step 1: Collect Data:
▪ Gather your data with features (e.g., height, weight) and labels (e.g., cat or
dog).
o Step 2: Plot Data:
▪ Visualize the data points on a graph (if possible).
o Step 3: Find the Best Line:
▪ Draw a line that separates the two classes.
▪ Make sure the line is as far as possible from the closest data points of both
classes (these closest points are called support vectors).
o Step 4: Handle Non-Linear Data:
▪ If the data isn’t linearly separable (you can’t draw a straight line), use a
trick called the kernel trick to transform the data into a higher dimension
where a line can separate the classes.
o Step 5: Make Predictions:
1. Plot the Data: Each data point is represented in n-dimensional space (n = number of
features). For example, if you have two features, you can plot the data on a 2D graph.
2. Find the Hyperplane: SVM finds the hyperplane (a straight line in 2D, a flat plane in
3D, or more generally, an n-dimensional plane) that separates the two classes of data
points with the maximum margin.
3. Separate the Classes: The hyperplane divides the data into two regions, each
representing one class. For example:
o One side of the line = Class A.
o Other side = Class B.
4. Non-Linearly Separable Data: If the data cannot be separated with a straight line (e.g.,
spiral data), SVM uses something called a kernel trick to transform the data into a higher
dimension where it becomes linearly separable.
o Kernel Functions: Mathematical functions like polynomial, RBF (Radial Basis
Function), etc., are used to transform the data.
• Decision Tree is a Supervised learning technique that can be used for both classification
and Regression problems, but mostly it is preferred for solving Classification problems.
• It is a tree-structured classifier, where internal nodes represent the features of a dataset,
branches represent the decision rules and each leaf node represents the outcome.
• In a Decision tree, there are two nodes, which are the Decision Node and Leaf Node.
• Decision nodes are used to make any decision and have multiple branches, whereas
Leaf nodes are the output of those decisions and do not contain any further branches.
• The decisions or the tests are performed on the basis of features of the given dataset.
• It is a graphical representation for getting all the possible solutions to a problem/decision
based on given conditions.
• It is called a decision tree because, similar to a tree, it starts with the root node, which
expands on further branches and constructs a tree-like structure.
• One of the reasons that decision trees are popular is that we can turn them into a set of
logical disjunctions (if ... then rules) that then go into program code very simply.
Ex:
The ID3 (Iterative Dichotomiser 3) algorithm is a decision tree learning algorithm used in
machine learning and data mining for classification tasks. It was developed by Ross Quinlan in
the 1980s and is the predecessor of more advanced decision tree algorithms like C4.5 and
CART.
ID3 builds a decision tree by selecting attributes that maximize information gain (or minimize
entropy). The process follows these steps:
1. Calculate Entropy
Entropy measures the impurity or disorder in a dataset. It is calculated using the formula:
Disadvantages of ID3
Overfits on noisy or small datasets
Cannot handle continuous numerical values directly (must be discretized)
Prefers attributes with many values (can be biased toward high-cardinality attributes)
Example of ID3
• Yes = 3 times
• No = 2 times
Since the entropy after splitting is 0, we stop here. The final tree is:
Weather
/ | \
Sunny Overcast Rainy
Yes Yes No
The C4.5 algorithm is an improved version of the ID3 decision tree algorithm developed by
Ross Quinlan. It overcomes some limitations of ID3 and is widely used in classification
problems.
• ID3 uses Information Gain, which can favor attributes with many values.
• C4.5 solves this issue by introducing Gain Ratio, which normalizes Information Gain.
• Formula for Gain Ratio:
Imagine a dataset where we decide whether to play outside based on weather conditions.
Example:
We see that:
Since the missing row is split between "Hot" and "Mild", entropy and information gain are
calculated by weighting the contributions accordingly.
Unlike ID3, which requires categorical attributes, C4.5 can split numerical data dynamically
by finding the best threshold.
How It Works:
Step 2: Compute Information Gain for each threshold and pick the best one.
• Suppose 20°C gives the highest Gain Ratio, C4.5 splits the data:
o ≤ 20°C → "No"
o > 20°C → "Yes"
The CART algorithm (Classification and Regression Trees) is a decision tree learning
technique used for classification and regression tasks. It was introduced by Breiman et al.
(1984) and is widely used in machine learning for predictive modelling.
CART constructs binary decision trees by recursively splitting the dataset into two subsets based
on feature values. The algorithm selects the best split at each step using Gini impurity (for
classification) or mean squared error (for regression).
• For classification problems, the split is chosen based on Gini Index (default in CART).
• For regression problems, the split is chosen based on Mean Squared Error (MSE).
pi is the probability of each class. A lower Gini Index means purer nodes.
4. Pruning (Optional):
The Gini Index measures the impurity of a node. The formula for Gini Index is:
1 25 Yes (1)
2 30 Yes (1)
3 35 No (0)
4 40 No (0)
5 45 No (0)
Weighted Gini
For regression, CART splits the data based on Mean Squared Error (MSE):
Example Dataset
ID Feature: Age Target: Salary (in $1000s)
1 25 50
2 30 55
3 35 60
4 40 70
5 45 80
• Left Node (Age ≤ 35): { (25, 50), (30, 55), (35, 60) }
Age ≤ 35?
/ \
55 75
ENSEMBLE LEARNING
Ensemble learning is a technique in machine learning where multiple models (often called weak
learners or base models) are combined to create a stronger, more accurate model. The main idea
is that multiple models working together can reduce errors and improve predictions compared to
a single model.
1. Boosting
• Models are trained sequentially, where each new model corrects the mistakes of the previous
ones.
• Helps reduce bias and improve weak models.
Example: AdaBoost, Gradient Boosting (XGBoost, LightGBM, CatBoost).
Steps in Boosting:
• Assigns more weight to misclassified samples and improves the next weak learner.
AdaBoost (Adaptive Boosting) is an ensemble learning method that combines multiple weak
learners (usually decision stumps) to create a strong classifier. It adjusts the weights of
misclassified samples to focus more on difficult cases in each iteration.
1. Initialize Weights:
o Assign equal weights to all training samples.
3. Calculate Error:
o The error is measured as the total weight of misclassified samples.
Step 1: Dataset
Initially, all samples are given equal importance. Since we have 4 samples, their initial weight is:
• If x<2.5, predict +1
• If x≥2.5, predict -1
The error of the weak classifier is the sum of weights of misclassified samples:
After several iterations, we get multiple weak classifiers h1,h2,h3,... with different weights αt.
• Multiple models (usually the same algorithm) are trained on random subsets of data.
• Predictions are averaged (for regression) or voted (for classification).
• Helps reduce variance and prevents overfitting.
Example: Random Forest (combines multiple decision trees).
Steps in Bagging:
• Instead of a single Decision Tree, Random Forest builds multiple trees and combines
their predictions.
• The random forest algorithm is a machine learning technique that uses multiple decision
trees to make predictions. It can be used for classification and regression tasks
1. Create multiple datasets → Randomly pick data with replacement (some data may be
repeated).
2. Train multiple decision trees → Each tree learns from a different dataset.
3. Make predictions → Each tree makes its own prediction.
4. Combine the results →
o For classification → Take the majority vote (most common prediction).
o For regression → Take the average of all predictions.
5. More trees = better accuracy & less overfitting.
6. Every tree in the forest makes its own predictions without relying on others.
7. Each tree is built using random samples and features to reduce mistakes.
8. Sufficient data ensures the trees are different and learn unique patterns and variety.
9. Combining the predictions from different trees leads to a more accurate final result.
• Random Forest provides very accurate predictions even with large datasets.
• Random Forest can handle missing data well without compromising with accuracy.
• It doesn’t require normalization or standardization on dataset.
• When we combine multiple decision trees it reduces the risk of overfitting of the
model.
Combining multiple classifiers can improve machine learning model performance by leveraging
the strengths of different algorithms. There are various ways to combine classifiers:
• Stacking: Train several models, then we use another model (meta model) to combine
their predictions for better result.
• Bagging (Bootstrap Aggregating)
Trains multiple instances of the same classifier on different subsets of data. Reduces
variance and prevents overfitting. Example: Random Forest (uses bagging with decision
trees).
• Boosting
Sequentially trains classifiers, where each new model focuses on the mistakes of the
previous ones. Reduces bias and increases accuracy. Examples:
The Mixture of Experts (MoE) is an ensemble learning technique that divides a complex
problem into subproblems and assigns specialized models (called experts) to solve each
subproblem. A gating network learns to combine the outputs of these experts to make a final
prediction.
It is widely used in deep learning and large-scale AI models, such as Google’s Switch
Transormers, which use MoE to efficiently allocate computational resources.
Experts: These are smaller models, each trained to be really good at a specific part of the overall
problem. Think of them like the different specialists on your team.
Gating network: This is like a manager who decides which expert is best suited for each part of
the problem. It looks at the input and figures out who should work on what.
Output: This is the final answer or solution that the model produces after the experts have done
their work.
Scalability – MoE can handle large-scale problems by distributing tasks across specialized
models.
Improved Accuracy – Experts specialize in different areas, leading to better generalization.
Parallel Computation – Experts can run independently, making MoE efficient for distributed
computing.
Reduced Overfitting – Specialization prevents overfitting to general patterns.
Disadvantages of MoE
BASIC STATISTICS
Mean:
Median:
Mode:
Variance, in statistics, is a measure of how spread out or dispersed data points are from their
average (mean), calculated by averaging the squared differences from the mean.
Covariance:
Covariance is a measure of relationship between two variables that is scale dependent, i.e. how
much will a variable change when another variable changes.
Standard Deviation: The square root of the variance is known as the standard deviation.
Interquartile Range: The range between the first and third quartiles, measuring data spread
around the median.
• Leptokurtic: Leptokurtic is a curve having a high peak than the normal distribution. In
this curve, there is too much concentration of items near the central value.
• Mesokurtic: Mesokurtic is a curve having a normal peak than the normal curve. In this
curve, there is equal distribution of items around the central value.
• Platykurtic: Platykurtic is a curve having a low peak than the normal curve is called
platykurtic. In this curve, there is less concentration of items around the central value.
It is a powerful technique that considers the correlations between variables in a dataset, making it
a valuable tool in various applications such as outlier detection, clustering, and classification.
D² = (x-μ)ᵀΣ⁻¹(x-μ)
Where D² is the squared Mahalanobis Distance, x is the point in question, μ is the mean vector of
the distribution, Σ is the covariance matrix of the distribution, and ᵀ denotes the transpose of a
matrix.
The Gaussian / Normal Distribution: Normal distribution, also known as the Gaussian
distribution, is a continuous probability distribution that is symmetric about the mean, depicting
that data near the mean are more frequent in occurrence than data far from the mean.
Model Training
Expectation-Maximization:
• During the E step, the model calculates the probability of each data point belonging to
each Gaussian component.
• The M step then adjusts the model’s parameters based on these probabilities.
1. Mixture of Gaussians
o Instead of assuming all points belong to just one cluster (like in k-means), GMM
assumes data is a mix of several Gaussian distributions.
o Each distribution represents one hidden group (e.g., different flavors of candy).
2. Soft Clustering (Probabilities Instead of Hard Labels)
o Instead of saying, “This point is in Cluster A,” GMM says, “This point is 70%
likely to be in Cluster A and 30% likely to be in Cluster B.”
3. Expectation-Maximization (EM) Algorithm
o Since we don’t know which Gaussian a point belongs to, we start with a guess.
o We then refine this guess using the E-step (Expectation) and M-step
(Maximization) until the clusters make sense.
Let’s say we measure the heights of students in a school. If we plot the heights, we might see
three peaks in the data.
GMM assumes that each peak represents a Gaussian distribution, and the overall height
distribution is just a mix of these three groups.
If we give a new student’s height, GMM can tell us the probability that the student belongs to
each group.
K-Nearest Neighbors (KNN) is a simple way to classify things by looking at what’s nearby. The
K-Nearest Neighbors (KNN) algorithm is a supervised machine learning method employed to
tackle classification and regression problems.
K-Nearest Neighbors is also called as a lazy learner algorithm because it does not learn from the
training set immediately instead it stores the dataset and at the time of classification it performs
an action on the dataset.
As an example, consider the following table of data points containing two features:
The new point is classified as Category 2 because most of its closest neighbors are blue
squares. KNN assigns the category based on the majority of nearby points.
The image shows how KNN predicts the category of a new data point based on its closest
neighbours.
• The red diamonds represent Category 1 and the blue squares represent Category
2.
• K represents the number of nearest neighbors that needs to be considered while making
prediction.
• To measure the similarity between target and training data points, Euclidean distance is used.
Distance is calculated between each of the data points in the dataset and target point.
• The k data points with the smallest distances to the target point are the nearest neighbors.
• In the classification problem, the class labels of K-nearest neighbors are determined by
performing majority voting. The class with the most occurrences among the neighbors becomes
the predicted class for the target data point.
• In the regression problem, the class label is calculated by taking average of the target values of
K nearest neighbors. The calculated average value becomes the predicted output for the target
data point.
Example:
Given Query:
X= (Maths=6, CS=8) → Find the class?
Maths CS Result
4 3 Fail
6 7 Pass
6 8 Pass
5 5 Fail
8 8 Pass
Step 3:
As per the result, K=3 and we need to consider the 3 smallest values (smallest distances) from the new
data point to the actual data points.
Thus, we assign the new data point into the Pass category.
Advantages:
• Easy to implement: The KNN algorithm is easy to implement because its
complexity is relatively low as compared to other machine learning algorithms.
Disadvantages:
• Doesn’t scale well: KNN is considered as a “lazy” algorithm as it is very slow
especially with large datasets
• Curse of Dimensionality: When the number of features increases KNN struggles to
classify data accurately a problem known as curse of dimensionality.
• Prone to Overfitting: As the algorithm is affected due to the curse of
dimensionality it is prone to the problem of overfitting as well.
K-dimensional tree
A k-d tree is a special kind of binary search tree that helps organize points in multiple
dimensions (like 2D or 3D space).
Imagine you have a list of locations on a map (like stores or houses), and you want to quickly
find the one closest to you. Instead of checking every single location one by one, a k-d tree
organizes them in a way that makes searching much faster.
1. It starts by dividing the space based on one coordinate (like splitting a map along a vertical line).
2. Then, it keeps dividing the smaller sections using other coordinates (like splitting horizontally
next).
3. This process continues, making it easier to search for nearby points.
The purpose of a k-d tree is to efficiently organize and search points in multiple dimensions
(2D, 3D, or higher).
• Faster searches than checking every point one by one (especially in large datasets).
• Organizes multi-dimensional data in a structured way.
UNSUPERVISED LEARNING
Unsupervised learning is a type of machine learning that works with data that has no labels or
categories. The main goal is to find patterns and relationships in the data without any
guidance.In this approach, the machine analyzes unorganized information and groups it based
on similarities, patterns, or differences. Unlike supervised learning, there is no teacher or
training involved. The machine must uncover hidden structures in the data on its own.
Example
Imagine you have a machine learning model trained on a large dataset of unlabeled images,
containing both dogs and cats. The model has never seen an image of a dog or cat before, and it
has no pre-existing labels or categories for these animals. Your task is to use unsupervised
learning to identify the dogs and cats in a new, unseen image. suppose it is given an image
having both dogs and cats which it has never seen. Thus, the machine has no idea about the
features of dogs and cats so we can’t categorize it as ‘dogs and cats ‘. But it can categorize them
according to their similarities, patterns, and differences, i.e., we can easily categorize the above
It allows the model to work on its own to discover patterns and information that was previously
undetected. It mainly deals with unlabeled data.
The unsupervised learning algorithm can be further categorized into two types of problems:
• Clustering: Clustering is a method of grouping the objects into clusters such that objects
with most similarities remains into a group and has less or no similarities with the objects
of another group. Cluster analysis finds the commonalities between the data objects and
categorizes them as per the presence and absence of those commonalities.
• Association: An association rule is an unsupervised learning method which is used for
finding the relationships between variables in the large database. It determines the set of
items that occurs together in the dataset. Association rule makes marketing strategy more
effective. Such as people who buy X item (suppose a bread) are also tend to purchase Y
(Butter/Jam) item. A typical example of Association rule is Market Basket Analysis.
K MEANS ALGORITHM
• K-means clustering is a popular unsupervised machine learning algorithm used for
partitioning a dataset into a pre-defined number of clusters. The goal is to group similar
data points together and discover underlying patterns or structures within the data.
• The first property of clusters states that the points within a cluster should be similar to
each other. So, our aim here is to minimize the distance between the points within a
cluster.
• There is an algorithm that tries to minimize the distance of the points in a cluster with
their centroid – the k-means clustering technique.
• Initialization: Start by randomly selecting K points from the dataset. These points will
act as the initial cluster centroids.
• Assignment: For each data point in the dataset, calculate the distance between that point
and each of the K centroids. Assign the data point to the cluster whose centroid is closest
to it. This step effectively forms K clusters.
• Update centroids: Once all data points have been assigned to clusters, recalculate the
centroids of the clusters by taking the mean of all data points assigned to each cluster.
• Repeat: Repeat steps 2 and 3 until convergence. Convergence occurs when the centroids
no longer change significantly or when a specified number of iterations is reached.
• Final Result: Once convergence is achieved, the algorithm outputs the final cluster
centroids and the assignment of each data point to a cluster.
Mathematical Representation
The objective of K-Means is to minimize the sum of squared differences between each point and
its assigned cluster centroid:
The main objective of k-means clustering is to partition your data into a specific number (k) of
groups, where data points within each group are similar and dissimilar to points in other groups.
It achieves this by minimizing the distance between data points and their assigned cluster’s
center, called the centroid.
• Grouping similar data points: K-means aims to identify patterns in your data by
grouping data points that share similar characteristics together. This allows you to
discover underlying structures within the data.
• Minimizing within-cluster distance: The algorithm strives to make sure data points
within a cluster are as close as possible to each other, as measured by a distance metric
(usually Euclidean distance). This ensures tight-knit clusters with high cohesiveness.
• Maximizing between-cluster distance: Conversely, k-means also tries to maximize the
separation between clusters. Ideally, data points from different clusters should be far
apart, making the clusters distinct from each other.
Advantages of K-means
1. Simple and easy to implement: The k-means algorithm is easy to understand and
implement, making it a popular choice for clustering tasks.
2. Fast and efficient: K-means is computationally efficient and can handle large datasets
with high dimensionality.
3. Scalability: K-means can handle large datasets with many data points and can be
easily scaled to handle even larger datasets.
4. Flexibility: K-means can be easily adapted to different applications and can be used
with varying metrics of distance and initialization methods.
Example:
(Note: Keep point 1 and 2 as centroids and label them as K1 & K2)
Step 1:
Decide the centroid. So let's consider that point ① & ② are the centroids of the cluster K1 & K2.
K1 = (185, 72)
K2 = (170, 56)
1 185 72 K1
2 170 56 K2
3 168 60 K2
4 179 68 K1
DIMENSIONALITY REDUCTION
• Dimensionality reduction is the process of reducing the number of features (or
dimensions) in a dataset while retaining as much information as possible.
• In other words, it is a process of transforming high-dimensional data into a lower
dimensional space that still preserves the essence of the original data.
• Dimensionality reduction can be done in two different ways:
1. By only keeping the most relevant variables from the original dataset (this technique
is called feature selection)
2. By finding a smaller set of new variables, each being a combination of the input
variables, containing the same information as the input variables (this technique is
called dimensionality reduction)
Applications:
• Classification: LDA is commonly used for classification tasks in various domains, such as
image recognition, medical diagnosis, and customer segmentation.
• Feature Selection: LDA can be used to select the most relevant features for classification
by identifying the linear combinations that best separate the classes.
• Dimensionality Reduction: LDA can be used to reduce the dimensionality of data while
preserving the information that is most important for classification.
Advantages:
• Simplicity: LDA is a relatively simple algorithm that is easy to implement and
understand.
• Computational Efficiency: LDA is computationally efficient, making it suitable for
large datasets.
• Interpretability: The linear combinations of features learned by LDA are easy to
interpret, providing insights into the relationships between features and classes.
Limitations:
• Assumptions: LDA relies on the assumption that the data within each class follows a
normal distribution, which may not always be true in real-world datasets.
• Linearity: LDA assumes that the class boundaries are linear, which may not be suitable for
datasets with complex, non-linear relationships.
• Class Imbalance: LDA may not perform well on datasets with imbalanced classes, where
one class has significantly more data points than the other.
Applications:
o Data Visualization: PCA can help visualize high-dimensional data in a lower-
dimensional space (e.g., 2D or 3D).
o Feature Extraction: It can identify the most important features or variables that
contribute most to the overall variance in the data.
o Data Compression: PCA can be used to compress data by representing it with a
smaller number of principal components.
o Noise Reduction: By focusing on the principal components that capture the most
variance, PCA can help remove noise or irrelevant information.
o Anomaly Detection: PCA can be used to identify outliers or anomalies in the data
by measuring the distance of data points from the principal components.
X=ΛF+ϵ
• Factor loadings (Λ)tell us how much each observed variable is influenced by a latent
factor.
• Noise (ϵ) accounts for variability not explained by the factors.
Example
Observed Variables are the actual data points we measure. Suppose we have a survey with
questions about waiting time, cleanliness, staff behavior of a restaurants. Latent Factors
(Hidden Variables) are unobserved underlying causes that explain patterns in the observed
data.
In the example, there might be two latent factors influencing the responses:
Factor Analysis is mainly classified into two types based on the purpose and approach used:
• Used when the number and structure of factors are already known or hypothesized.
• Confirms whether the data fits the assumed factor structure.
• Common in validating questionnaires, psychological tests, and scientific research.
• Example: In education, CFA is used to confirm that an IQ test correctly measures verbal,
logical, and spatial intelligence.
How it works:
1. Data Collection:
Gather data on a set of variables.
2. Correlation/Covariance Matrix:
Calculate the correlation or covariance matrix to understand the relationships between the
variables.
3. Factor Extraction:
Determine the number of factors to extract and extract them using methods like principal
component analysis (PCA) or maximum likelihood estimation.
4. Factor Rotation (Optional):
Rotate the factors to simplify interpretation and make the relationship between factors and
variables clearer.
5. Factor Loadings:
Examine the factor loadings, which indicate how much each original variable contributes to
each factor.
6. Interpretation:
Interpret the factors based on the factor loadings and understand the underlying structure of
the data.
You're in a room with two people talking at the same time, and you have two microphones
recording the sounds. Each microphone picks up a different mixture of both people’s voices.
You want to separate the two voices from the recordings using ICA.
Mathematical Model
Given:
• Blind Source Separation (BSS) – Classic example: Cocktail party problem, separating
different voices from a recording.
• EEG/MEG Signal Processing – Separate brain signals from noise.
• Image Processing – Feature extraction and noise removal.
• Financial Data Analysis – Uncovering underlying independent factors in stock prices.
• You collect the mixed signals. For example, two microphones recording different mixes
of two people speaking.
• ICA assumes:
o The original sources are statistically independent
o They are non-Gaussian
• ICA algorithm (like FastICA) tries to find a matrix (W) that transforms the mixed
data into independent sources:
S=WX
Where:
• The result S gives you the independent components – your original signals!
• Capability of breaking down mixed alerts into their separate components: ICA is a useful
method for breaking down blended signals into their component parts.
• This is useful for several programmes, including sign processing, picture evaluation, and
statistics compression.
• Non-parametric technique: ICA does not assume anything about the underlying
opportunity distribution of the facts because it is non-parametric.
• Unsupervised learning of: ICA is a learning approach that can be used to facts without
the need for categorised samples. As a result, it may be helpful when access to classified
records is restricted.
• Feature extraction: Using ICA, significant characteristics in the data that are useful for
other tasks, like classification, can be found. This process is known as feature extraction.
Disadvantages of Independent Component Analysis (ICA):
• Non-Gaussian assumption: Although this may not always be the case, ICA assumes that
the underlying sources are non-Gaussian. ICA might not work if the underlying sources
are Gaussian.
• Assumption of linear mixing: Although this may not always be the case, ICA assumes
that the sources are mixed linearly. ICA might not work if the sources are blended
nonlinearly.
• Costly to compute: ICA can be costly to compute, particularly for big datasets. This can
make using ICA to solve practical issues challenging.
• Find Neighbors: For each data point, find its k nearest neighbors using Euclidean
distance.
• Compute Weights: For each point, compute weights that best reconstruct the point from
its neighbors using linear combinations (i.e., minimize reconstruction error). In short,
each point is reconstructed as a linear combination of its neighbors. LLE calculates the
weights that best reconstruct the point from its neighbors while minimizing
reconstruction-error.
This results in weights W such that:
• Embed in Low Dimensions: Find low-dimensional points Y that preserve the same
reconstruction weights from the high-dimensional space. It means It then finds a low-
dimensional representation of the data where those same weights still reconstruct each
point from its neighbors. This preserves the local structure of the manifold
Applications:
Advantages of LLE
The dimensionality reduction method known as locally linear embedding (LLE) has many
benefits for data processing and visualization. The following are LLE's main benefits:
• Preservation of Local Structures: LLE is excellent at maintaining the in-data local
relationships or structures. It successfully captures the inherent geometry of
nonlinear manifolds by maintaining pairwise distances between nearby data points.
• Handling Non-Linearity: LLE has the ability to capture nonlinear patterns and
structures in the data, in contrast to linear techniques like Principal Component
Analysis (PCA). When working with complicated, curved, or twisted datasets, it is
especially helpful.
• Dimensionality Reduction: LLE lowers the dimensionality of the data while
preserving its fundamental properties. Particularly when working with high-
dimensional datasets, this reduction makes data presentation, exploration, and
analysis simpler.
Disadvantages of LLE
• Curse of Dimensionality: LLE can experience the "curse of dimensionality" when
used with extremely high-dimensional data, just like many other dimensionality
reduction approaches. The number of neighbors required to capture local
interactions rises as dimensionality does, potentially increasing the computational
cost of the approach.
• Memory and computational Requirements: For big datasets, creating a weighted
adjacency matrix as part of LLE might be memory-intensive. The eigenvalue
decomposition stage can also be computationally taxing for big datasets.
• Outliers and Noisy data: LLE is susceptible to anomalies and jittery data points.
The quality of the embedding may be affected and the local linear relationships may
be distorted by outliers.
ISOMAP
ISOMAP is used to reduce the number of dimensions in high-dimensional data while preserving
the intrinsic geometry (shape) of the data — especially when the data lies on a non-linear
manifold.
Applications
• Manifold learning
• Visualization of high-dimensional data
• Preprocessing before classification/clustering
Least Squares Optimization: Least Squares Optimization is a method used to minimize the
difference between predicted values and actual data.
Example: Imagine you're trying to train a robot to walk. You don’t know the perfect way to do it,
but you let it try randomly, keep the ones that perform better, and let them “reproduce” to
create a new generation of robots with small improvements. Repeat this over and over, and
eventually, some of them will walk well.
This method doesn't require gradient-based optimization (like backpropagation), so it’s useful in
tricky cases where derivatives are hard to calculate.
• You want to find a model that minimizes the least squares error (i.e., best fits the data)
The process:
Over time, the models evolve to have better fit (lower least squares error).
GENETIC ALGORITHMS
Genetic Algorithms (GAs) are a type of search heuristic inspired by Darwin’s theory of natural
selection, mimicking the process of biological evolution. These algorithms are designed to find
optimal or near-optimal solutions to complex problems by iteratively improving candidate
solutions based on survival of the fittest.
The primary purpose of Genetic Algorithms is to tackle optimization and search problems. By
leveraging evolutionary principles such as selection, crossover, and mutation, GAs explore large
solution spaces efficiently, even for problems where traditional methods struggle.
Genetic Algorithm in machine learning plays a significant role in tasks like hyperparameter
tuning, feature selection, and model optimization. For instance, they can optimize the
architecture of a neural network or select the most relevant features for improving prediction
accuracy.
Real-World Examples:
Genetic Algorithms (GAs) operate through an iterative process inspired by natural evolution.
This process involves generating, evaluating, and evolving populations of candidate solutions to
find the optimal outcome. The workflow can be broken down into several key stages:
1. Initialization
2. Fitness Evaluation
Each candidate solution is evaluated using a fitness function that measures its quality or
suitability for solving the problem. The fitness function is problem-specific and determines how
well a solution meets the objective.
Example: In the Traveling Salesman Problem (TSP), the fitness is calculated as the inverse of
the total distance traveled. Shorter routes yield higher fitness scores.
3. Selection
To create the next generation, GAs select the fittest solutions from the current population.
Various methods ensure that better solutions have a higher probability of being chosen:
Crossover, or recombination, involves combining the genetic material of two parent solutions to
produce offspring. This process introduces variability and explores new areas of the search
space.
Types of Crossover:
5. Mutation
Mutation introduces random changes to the chromosomes to maintain diversity and avoid
premature convergence. It helps the algorithm explore unexplored areas of the search space.
Example: In a binary chromosome, mutation might involve flipping a 0 to 1 or vice versa (e.g.,
101010 becomes 101110).
6. Termination
The algorithm terminates when a specific termination criterion is met, such as:
Genetic Algorithms (GAs) rely on several core components that work together to solve
optimization and search problems effectively.
Search Space
The search space represents the range of all possible solutions for a given problem. It is
essentially the domain within which the algorithm operates to identify the optimal or near-
optimal solution.
GAs excel at exploring this space efficiently by balancing exploitation (focusing on promising
areas) and exploration (investigating new areas), ensuring a higher chance of finding the best
solution.
Example: For the Traveling Salesman Problem, the search space includes all possible
permutations of cities in the route.
Fitness Function
The fitness function evaluates how well a candidate solution performs relative to the problem’s
objectives. A well-designed fitness function is crucial because it directly influences the
algorithm’s ability to converge on the optimal solution.
Example: In a scheduling problem, the fitness function might evaluate the minimization of
resource conflicts or task completion times.
Genetic Operators
Selection, crossover, and mutation are the primary genetic operators that drive the evolutionary
process:
Genetic Offspring: In the context of machine learning, especially with genetic algorithms,
"genetic offspring" refers to new individuals or solutions generated by combining the
characteristics of parent solutions through crossover and mutation. These offspring inherit
features from their parents but also introduce new variations, allowing the algorithm to explore
the solution space and potentially find better solutions over generations.
Genetic Algorithms (GAs) have a broad range of applications in machine learning, where they
enhance model performance, reduce complexity, and tackle optimization challenges effectively.
1. Hyperparameter Optimization
GAs are frequently used to automate the process of hyperparameter tuning, which is critical
for improving machine learning model performance. Instead of relying on grid or random search,
GAs explore combinations of hyperparameters more efficiently by leveraging evolutionary
principles.
Example: In neural networks, GAs can optimize learning rates, layer configurations, and
dropout rates to achieve better accuracy. Similarly, for Support Vector Machines (SVMs), GAs
can fine-tune kernel parameters to enhance classification performance.
2. Feature Selection
Selecting the most relevant features from a dataset is crucial for reducing model complexity and
improving accuracy. GAs identify optimal subsets of features by evaluating their impact on
model performance through a fitness function. This helps reduce overfitting and computational
costs.
Example: In a classification task, GAs can identify the most informative features from a high-
dimensional dataset, improving the classifier’s accuracy.
GAs are employed to optimize neural network architectures and weights, making them highly
effective in designing robust models. By evolving network parameters over generations, GAs
help discover architectures that balance accuracy and computational efficiency.
Example: GAs can optimize the number of neurons, hidden layers, and activation functions in a
deep learning model to enhance predictive accuracy.
4. Other Applications
GAs extend beyond traditional machine learning tasks and find utility in diverse areas:
While Genetic Algorithms (GAs) are powerful tools, they come with certain limitations that can
impact their effectiveness:
Reinforcement Learning – Overview – Getting Lost Example Markov Chain Monte Carlo
Methods – Sampling – Proposal Distribution – Markov Chain Monte Carlo – Graphical
Models – Bayesian Networks – Markov Random Fields – Hidden Markov Models –
Tracking Methods.
REINFORCEMENT LEARNING
Reinforcement Learning (RL) is a branch of machine learning that focuses on how agents
can learn to make decisions through trial and error to maximize cumulative rewards. RL allows
machines to learn by interacting with an environment and receiving feedback based on their
actions. This feedback comes in the form of rewards or penalties.
Reinforcement Learning revolves around the idea that an agent (the learner or decision-maker)
interacts with an environment to achieve a goal. The agent performs actions and receives
feedback to optimize its decision-making over time.
• Agent: The decision-maker that performs actions.
• Environment: The world or system in which the agent operates.
• State: The situation or condition the agent is currently in.
• Action: The possible moves or decisions the agent can make.
• Reward: The feedback or result from the environment based on the agent’s action.
SAMPLING
• When we have a big dataset and excited to get started with analyzing it and building your
machine learning model. Our machine gives an “out of memory” error while trying to load
the dataset.
• It’s happened to us most of the time when we have big dataset. Big dataset is one of the
biggest hurdles we face in data science — dealing with massive amounts of data on
computationally limited machines (of course we can resolve it with additional resource
power).
• So how can we overcome this problem? Is there a way to pick a subset of the data and
analyze that — and that can be a good representation of the entire dataset? Here comes the
statistical approach to deal with bigger dataset called “Sampling”.
• “Sampling is a method that allows us to get information about the population based
on the statistics from a subset of the population (sample), without having to
investigate every individual”
• Example: When you conduct research about a group of people, it’s rarely possible to
collect data from every person in that group. Instead, you select a sample. The sample is
the group of individuals who will actually participate in the research.
Step 1: The first stage in the sampling process is to clearly define the target population.
Step 2: Sampling Frame — It is a list of items or people forming a population from which the
sample is taken.
Step 5: Once the target population, sampling frame, sampling technique, and sample size have
been established, the next step is to collect data from the sample.
• Simple Random Sampling: Everyone has an equal chance (like picking names from a
hat).
• Systematic Sampling: Pick every 5th, 10th, or 20th person from a list.
• Stratified Sampling: Divide people into groups (like by age or gender) and randomly
pick from each group.
• Cluster Sampling: Divide the population into clusters (like cities) and randomly select
whole clusters.
• Convenience Sampling: Pick whoever is easiest to reach (like asking your friends).
• Judgmental Sampling: You choose who you think is best to include.
• Snowball Sampling: Existing participants refer new participants (good for finding rare
groups).
• Quota Sampling: You pick people to meet a set number for each group (like 50 men and
50 women).
PROPOSAL DISTRIBUTION
A proposal distribution (denoted q(x′∣x)) is a probability distribution used to propose a new
state x′ given the current state x in the Markov chain. The choice of proposal distribution directly
impacts the efficiency and convergence of MCMC algorithms.
Example Robot Exploring a Maze
Imagine you have a robot in a maze. The robot is trying to find the best path to the goal, but it
doesn’t know the full layout of the maze. It can only try different moves and learn if they’re
good or bad over time.
The Goal: The robot wants to find the most likely paths (according to some hidden rules, like
shortest path or least danger). But it can’t sample these good paths directly — the maze is too
complex.
“Let me randomly propose a move, like ‘turn left’ or ‘move forward 2 steps’ based on my
current location.”
3. If the move looks promising (based on some probability), it accepts it. If not, it rejects it
and stays in place (or tries another).
• The maze is like the complex target probability distribution — we don’t know its full
shape.
• The robot’s random move is like sampling from the proposal distribution — a simple
way to suggest new positions (or parameter values).
• The accept/reject step helps the robot eventually explore the most important areas of
the maze — just like in MCMC, we sample more from the “important” regions of the
target distribution.
Markov Chain Monte Carlo (MCMC) is a powerful technique used in statistics and various
scientific fields to sample from complex probability distributions. It is particularly useful when
directly sampling from the distribution is difficult or impossible. Here is a breakdown of the
name:
• Monte Carlo: This refers to a general approach using randomness to solve problems,
drawing inspiration from the element of chance involved in casino games.
• Markov Chain: This is a sequence of random events where the probability of the next
event depends only on the current event, not the history leading up to it.
MCMC combines constructing a Markov chain and recording samples from the chain. The chain
is designed to spend more time in regions with higher probability according to the target
distribution. Then, by recording states from the chain after it has ‘warmed up’ and reached a
stable state, you effectively get samples from the target distribution.
The min(1, ...) part ensures that the acceptance probability α\alphaα is always between 0 and 1,
which is required because:
MCMC plays a crucial role in various aspects of machine learning, particularly when dealing
with complex probabilistic models or situations where direct sampling is difficult. Here are some
key ways it’s utilised:
What Are Some Common Applications of Markov Chain Monte Carlo In AI?
Markov Chain Monte Carlo finds several applications in various aspects of artificial intelligence
(AI), particularly when dealing with complex probabilistic models or situations where direct
sampling is impractical. Here are some common areas where MCMC plays a significant role:
Advantages Of MCMC:
Disadvantages Of MCMC:
GIBBS SAMPLING
Gibbs sampling is a Markov Chain Monte Carlo (MCMC) method used in machine learning to
generate samples from a joint probability distribution, especially when direct sampling is
difficult. It works by iteratively sampling one variable at a time, given the current values of all
other variables, and repeating this process until a stable distribution of samples is achieved.
Example Imagine you and your friend are sharing a cake, but the size each of you takes
depends on the other person’s slice:
Over time, this back-and-forth settles into a stable pattern — that’s the balance, or the true
joint distribution of cake slices.
How it Works:
1. Start with an initial value for each variable.
2. Iterate: For each variable:
Graphical models in machine learning use graphs to represent the relationships between variables
and their dependencies, providing a visual and structured way to model complex systems. These
models are broadly categorized into Bayesian networks (directed) and Markov random fields
(undirected). They are useful for tasks like prediction, inference, and decision-making by
capturing probabilistic dependencies and allowing efficient computation.
Example: Imagine you're modeling the likelihood of someone getting the flu.
• Fever (F)
• Cough (C)
• Fatigue (T)
• Flu (U)
F
/ \
U — C
\ /
T
This is an undirected graph, meaning:
Applications
Advantages:
Example:
If you’re trying to guess the weather:
• Weather (Sunny or Rainy) is the hidden variable — you can’t see it directly.
Example:
If you see your friend:
Concept Example
The relationship between the hidden states and the observations is modeled using a probability
distribution. The Hidden Markov Model (HMM) is the relationship between the hidden states
and the observations using two sets of probabilities: the transition probabilities and the
emission probabilities.
1. Transition Probabilities
These describe the probability of moving from one hidden state to another.
Example:
Let’s say the hidden states are:
• Sunny
• Rainy
Then:
2. Emission Probabilities
Example:
You don’t see the weather, but you see your friend:
• Carrying an umbrella
• Not carrying an umbrella
Then:
• P(Umbrella∣Rainy)=0.9
• P (Umbrella∣Sunny)=0.2
These tell you how likely each observation is, depending on the hidden state.
ML Applications of HMM
The Forward Algorithm is used to compute the probability of an observed sequence given an
HMM. Instead of checking all possible hidden state sequences (which is computationally
expensive), it efficiently sums over them using dynamic programming.
Example:
You’re a detective.
Each day, someone tells you what they did (like “walk”, “shop”, or “clean”) — but you don’t
know the weather that day (Sunny or Rainy).
You want to figure out:
How likely is it that this person did those things, based on what you know about the
weather?
You know:
The Observations
Day 1: walk
Day 2: shop
Day 3: clean
How likely is it that this sequence (walk, shop, clean) could happen?
But you don’t know the weather on any day. That’s what’s “hidden.”
• Starting with Day 1: “If it was Sunny, how likely was 'walk'? If it was Rainy, how likely
was 'walk'?”
• Then moving to Day 2: “If yesterday was Sunny, how likely is today Sunny or Rainy, and
how likely is 'shop'?”
• It builds up the total probability step by step for each day.
• At the end, it adds everything up to find the total chance of the full observation (walk,
shop, clean).
In Short: The Forward Algorithm is a step-by-step way to add up all the possible hidden
weather paths that could explain what you saw, without checking every single path one by
one.
Let:
Step 1: Initialization (t = 1)
Step 2: Recursion (t = 2 to T)
Step 3: Termination
In Simple Words:
VITERBI ALGORITHM
The Viterbi Algorithm is a dynamic programming algorithm used to find the most probable
sequence of hidden states (called the Viterbi path) in a Hidden Markov Model (HMM), given
a sequence of observations.
The Viterbi Algorithm is used in Hidden Markov Models (HMMs) to solve the decoding
problem, which means:
Given a sequence of observations, what is the most likely sequence of hidden states that
generated it?
In an HMM, multiple hidden states can produce the same observation. This creates ambiguity
about which state produced the observation. So, the problem becomes:
Given a sequence of observations, which hidden state sequence is the most likely sequence
that generated these observations?
Example of Ambiguity:
• Sunny could produce "Walk, Walk, Shop" because people tend to walk or shop on sunny
days.
• Rainy could also produce "Walk, Walk, Shop" because people might walk quickly or
shop to avoid the rain.
But the probability of being in Sunny or Rainy at each time step, and the transition
probabilities between states, are different.
So, even though both states can result in similar observations, we need to figure out which state
sequence (Sunny-Rainy or Rainy-Sunny, etc.) is more probable over the entire observation
sequence.
• The Viterbi Algorithm helps by finding the most probable sequence of hidden states
(even if some states can produce similar observations) by considering:
o Transition probabilities (probability of moving from one state to another)
o Emission probabilities (probability of an observation occurring given a state)
Thus, it takes into account both the likelihood of observations given states and the likelihood
of state transitions.
What It Does:
• The algorithm efficiently searches for the most probable path (sequence of states)
through a trellis (a time vs. state graph), using dynamic programming.
• It avoids recalculating the same probabilities repeatedly by:
1. Storing the maximum probability of reaching each state at each time.
2. Keeping track of the path that led to that max probability.
Let’s say:
• Multiply:
o The probability of starting in that state
o By the probability of that state producing the first observation
Save that value — it tells us how likely it is to start in that state and see the first observation.
• Look at the last step and pick the state with the highest probability
• That state is the last step in the best path
• Use the remembered “best previous states” from each step to trace back and find the full
sequence of states
• The most likely sequence of hidden states (like: Sunny → Sunny → Rainy)
• The probability of that sequence
Example:
Scenario: Imagine a simple HMM where the weather (Sunny or Rainy) is hidden, and we can
only observe whether someone is carrying an umbrella or not.
Model:
• Hidden States: Sunny (S), Rainy (R)
• Observations: Umbrella (U), No Umbrella (N)
• Initial State Probabilities: π = (π_S, π_R)
• Transition Matrix: A = [[p(S|S), p(R|S)], [p(S|R), p(R|R)]]
• Emission Matrix: B = [[p(U|S), p(N|S)], [p(U|R), p(N|R)]]
Baum-Welch Algorithm Steps:
1. 1. Initialization:
Start with random guesses for π, A, and B.
2. 2. E-Step (Expectation):
o Use the forward and backward algorithms (a part of the forward-backward
algorithm) to calculate the probability of each hidden state sequence given the
observed umbrella/no umbrella sequence.
oSpecifically, the forward algorithm computes the probability of observing the data
up to a given time step, given a hidden state at that time. The backward algorithm
calculates the probability of observing the data from a given time step to the end,
given a hidden state at that time.
3. 3. M-Step (Maximization):
o Update the HMM parameters (π, A, and B) based on the calculated probabilities
from the E-step.
o For example, the new initial state probabilities are calculated as the sum of the
forward and backward probabilities at time t=0, normalized by the sum of all
forward-backward probabilities.
In machine learning and robotics, tracking methods are used to estimate the state of a system
over time, especially when that state is partially observed and noisy. Common applications
include:
Two popular tracking techniques are the Kalman Filter and the Particle Filter.
KALMAN FILTER
The Kalman Filter is a mathematical algorithm used to track or estimate the state of
something over time, especially when the data is noisy or uncertain. Works best for linear systems
with Gaussian noise.
How It Works:
The Kalman Filter estimates the current state of a system using a two-step process:
1. Prediction Step:
o Predict the current state from the previous state using a motion model.
o Predict the current uncertainty (covariance) as well.
2. Update Step:
o Get a new observation (measurement).
o Combine prediction and observation using a weighted average, giving more
weight to the more certain information.
o Update the estimate and reduce uncertainty.
1. Predict where the car should be based on its last known position and speed.
2. Update that guess using the new (noisy) GPS reading.
3. Combine both in a smart way to get a better estimate of the car's actual location.
Goal of Kalman Filter: To predict the next state of a system (like the position of a car), and
then update the prediction using noisy measurements (like GPS data), in the smartest possible
way.
STEP 1:
STEP 2:
STEP 4:
STEP 5:
Pros:
Cons:
PARTICLE FILTER
• A Particle Filter is a probabilistic algorithm used to estimate the state of a system over
time by representing it with a set of many random samples (called particles) and
updating them based on new data.
• Each particle is like a possible guess of the true state, and the algorithm uses a weighting
and resampling process to keep the best guesses and discard the bad ones.
• The Particle Filter is a method that uses lots of random guesses (called particles) to
figure out where the robot might be — and then keeps the best guesses. Designed for
non-linear and non-Gaussian systems.
Example: You're tracking where a robot is in a room. But you don’t know exactly where it is —
you only have a noisy sensor (like a blurry camera or weak GPS).
How it works:
1. Initialization – Start with 1000 random guesses of where the robot might be (particles).
2. Prediction – Move each guess based on the robot’s movement (e.g., it moved forward).
3. Update (Weighting) – Check how well each guess matches the new sensor reading.
o If it matches well, it gets a high weight.
o If not, low weight.
4. Resample – Keep only the best guesses (high weights) and throw away the bad ones.
o Make new guesses based on the best ones.
Pros:
Cons:
• Computationally expensive.
• Requires many particles for accurate estimates.