0% found this document useful (0 votes)
8 views

Machine Learning Notes_ Concepts, Algorithms

The document outlines the syllabus for a Machine Learning course at JNTU Hyderabad, detailing course objectives, outcomes, and various learning units covering supervised, unsupervised, and reinforcement learning techniques. It includes descriptions of different algorithms, applications of machine learning, and the advantages and disadvantages of the field. Additionally, it discusses the foundational concepts of learning, the role of neurons in the brain, and key principles such as Hebb's rule.

Uploaded by

Mohammed Khaleel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

Machine Learning Notes_ Concepts, Algorithms

The document outlines the syllabus for a Machine Learning course at JNTU Hyderabad, detailing course objectives, outcomes, and various learning units covering supervised, unsupervised, and reinforcement learning techniques. It includes descriptions of different algorithms, applications of machine learning, and the advantages and disadvantages of the field. Additionally, it discusses the foundational concepts of learning, the role of neurons in the brain, and key principles such as Hebb's rule.

Uploaded by

Mohammed Khaleel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 171

R22 B.Tech.

CSE Syllabus JNTU Hyderabad CS601PC: MACHINE


LEARNING B. Tech III Year II Sem.
LTPC
3003
Course Objectives:
• To introduce students to the basic concepts and techniques of Machine Learning.
• To have a thorough understanding of the Supervised and Unsupervised learning techniques
• To study the various probability-based learning techniques
Course Outcomes:
• Distinguish between, supervised, unsupervised and semi-supervised learning
• Understand algorithms for building classifiers applied on datasets of non-linearly separable
classes
• Understand the principles of evolutionary computing algorithms
• Design an ensembler to increase the classification accuracy
UNIT - I
Learning – Types of Machine Learning – Supervised Learning – The Brain and the Neuron – Design a
Learning System – Perspectives and Issues in Machine Learning – Concept Learning Task – Concept
Learning as Search – Finding a Maximally Specific Hypothesis – Version Spaces and the Candidate
Elimination Algorithm – Linear Discriminants: – Perceptron – Linear Separability – Linear Regression.
UNIT - II
Multi-layer Perceptron– Going Forwards – Going Backwards: Back Propagation Error – Multi-layer
Perceptron in Practice – Examples of using the MLP – Overview – Deriving Back-Propagation – Radial
Basis Functions and Splines – Concepts – RBF Network – Curse of Dimensionality – Interpolations and
Basis Functions – Support Vector Machines
UNIT - III
Learning with Trees – Decision Trees – Constructing Decision Trees – Classification and Regression
Trees – Ensemble Learning – Boosting – Bagging – Different ways to Combine Classifiers – Basic
Statistics – Gaussian Mixture Models – Nearest Neighbor Methods – Unsupervised Learning – K means
Algorithms
UNIT - IV
Dimensionality Reduction – Linear Discriminant Analysis – Principal Component Analysis – Factor
Analysis – Independent Component Analysis – Locally Linear Embedding – Isomap – Least Squares
Optimization Evolutionary Learning – Genetic algorithms – Genetic Offspring: - Genetic Operators –
Using Genetic Algorithms
UNIT - V
Reinforcement Learning – Overview – Getting Lost Example Markov Chain Monte Carlo Methods –
Sampling – Proposal Distribution – Markov Chain Monte Carlo – Graphical Models – Bayesian
Networks – Markov Random Fields – Hidden Markov Models – Tracking Methods
TEXT BOOKS:
1. Stephen Marsland, ―Machine Learning – An Algorithmic Perspective, Second Edition,
Chapman and Hall/CRC Machine Learning and Pattern Recognition Series, 2014

Prepared by Dr. Syeda Husna Mehanoor


UNIT - I
Learning – Types of Machine Learning – Supervised Learning – The Brain and the Neuron
– Design a Learning System – Perspectives and Issues in Machine Learning – Concept
Learning Task – Concept Learning as Search – Finding a Maximally Specific Hypothesis –
Version Spaces and the Candidate Elimination Algorithm – Linear Discriminants: –
Perceptron – Linear Separability – Linear Regression.

LEARNING
Definition of learning
A computer program is said to learn from experience E with respect to some class of tasks T and
performance measure P, if its performance at tasks T, as measured by P, improves with
experience

Examples
i) Handwriting recognition learning problem
• Task T: Recognising and classifying handwritten words within images
• Performance P: Percent of words correctly classified
• Training experience E: A dataset of handwritten words with given classifications

ii) A robot driving learning problem


• Task T: Driving on highways using vision sensors
• Performance measure P: Average distance traveled before an error
• training experience: A sequence of images and steering commands recorded while observing a
human driver

iii) A chess learning problem


• Task T: Playing chess
• Performance measure P: Percent of games won against opponents
• Training experience E: Playing practice games against itself

Therefore, a computer program which learns from experience is called a machine learning
program or simply a learning program. Such a program is sometimes also referred to as a learner.

Machine Learning
Machine learning enables a machine to automatically learn from data, prove performance from
experiences, and predict things without being explicitly programmed.

A Machine Learning system learns from historical data, builds the prediction models, and
whenever it receives new data, predicts the output for it. The accuracy of predicted output
depends upon the amount of data, as the huge amount of data helps to build a better model which
predicts the output more accurately. Suppose we have a complex problem, where we need to
perform some predictions, so instead of writing a code for it, we just need to feed the data to

Prepared by Dr. Syeda Husna Mehanoor


generic algorithms, and with the help of these algorithms, machine builds the logic as per the
data+ and predict the output.

Arthur Samuel, an early American leader in the field of computer gaming and artificial
intelligence, coined the term “Machine Learning” in 1959 while at IBM. He defined machine
learning as “the field of study that gives computers the ability to learn without being explicitly
programmed.” However, there is no universally accepted definition for machine learning.
Different authors define the term differently.

How does Machine Learning works


A machine learning system builds prediction models, learns from previous data, and predicts the
output of new data whenever it receives it. The amount of data helps to build a better model that
accurately predicts the output, which in turn affects the accuracy of the predicted output. Let's
say we have a complex problem in which we need to make predictions. Instead of writing code,
we just need to feed the data to generic algorithms, which build the logic based on the data and
predict the output. Our perspective on the issue has changed as a result of machine learning. The
Machine Learning algorithm's operation is depicted in the following block diagram

Features of Machine Learning:


• Machine learning uses data to detect various patterns in a given dataset.
• It can learn from past data and improve automatically.
• It is a data-driven technology.
• Machine learning is much similar to data mining as it also deals with the huge amount of
the data.

NEED OF MACHINE LEARNING


• Following are some key points which show the importance of Machine Learning:
• Rapid increment in the production of data
• Solving complex problems, which are difficult for a human
• Decision making in various sector including finance
• Finding hidden patterns and extracting useful information from data

Prepared by Dr. Syeda Husna Mehanoor


Applications of Machine Learning

Application Description

Image recognition used for face detection, pattern recognition, and


Facial Recognition
personal identification.

Identifies objects, people, and places in digital images (e.g.,


Image Recognition
Facebook’s auto-tagging feature).

Converts voice instructions into text (e.g., Google Assistant, Siri,


Speech Recognition
Alexa, Cortana).

Determines emotions or opinions in text, used in reviews, emails,


Sentiment Analysis
and decision-making applications.

Uses unsupervised learning to detect people, objects, and navigate


Self-Driving Cars
autonomously (e.g., Tesla).

Tracks user behavior to suggest products on e-commerce


Product Recommendations
platforms based on past purchases and browsing history.

Detects fraudulent transactions using neural networks to identify


Fraud Detection
fake accounts and unauthorized activities.

AI-driven chat assistants provide customer support, collect


Chatbots
insights, and enhance user engagement.

Assists in diagnosing diseases (e.g., cancer detection in X-rays,


Medical Image Analysis
MRIs, and CT scans).

Natural Language Enables machines to understand and generate human language


Processing (NLP) (e.g., chatbots, translation, text summarization).

Analyzes sensor data to predict equipment failures, reducing


Predictive Maintenance
downtime and optimizing maintenance.

Uses ML models to analyze market data, predict stock trends, and


Financial Trading
automate trading strategies.

Prepared by Dr. Syeda Husna Mehanoor


Application Description

Detects anomalies, identifies threats, and prevents cyberattacks


Cybersecurity
such as malware and phishing.

Advantages & Disadvantages of Machine Learning


Advantages Examples Explanation

ML algorithms can identify objects like pedestrians,


Automation of Object detection in self-
traffic lights, and other vehicles in real-time,
Complex Tasks driving cars
enabling autonomous navigation.

Enhanced Accuracy Medical image analysis for ML models detect subtle patterns in medical images
and Precision disease diagnosis with high accuracy, aiding in early diagnosis.

Improved Efficiency Facial recognition in security ML processes large volumes of images quickly,
and Scalability systems making it ideal for real-time surveillance.

Adaptability and Image classification for ML models improve over time by adapting to new
Continuous product categorization in e- data, ensuring accurate categorization of new
Learning commerce products.

ML analyzes large datasets to identify trends and


Uncovering Hidden Image analysis for market
patterns that might not be apparent to human
Insights research
analysts.

Disadvantages Examples Explanation

Facial recognition system


If training data is biased, incomplete, or noisy,
Data Dependency trained mainly on Caucasian
model performance is negatively impacted.
faces

Deep learning models require significant


High Computational Training large-scale image
processing power and time, sometimes taking
Costs classification models
days or weeks to train.

Lack of Transparency Unexpected decisions by Complex ML models make it difficult to


(Black Box Problem) self-driving cars interpret decision-making processes, leading to

Prepared by Dr. Syeda Husna Mehanoor


Disadvantages Examples Explanation

trust issues.

Models trained on biased historical data may


AI-driven job recruitment
Potential for Bias reinforce existing discrimination in hiring
system
practices.

Malicious actors can manipulate input data


Adversarial attacks on self-
Security Vulnerabilities (e.g., road signs) to deceive ML models,
driving cars
causing incorrect predictions.

TYPES OF LEARNING
Here are brief definitions for different types of machine learning:

1. Supervised Learning: A type of machine learning where the model is trained on labeled
data, meaning both input and output are provided. Example: Spam email detection.
2. Unsupervised Learning: The model learns patterns and structures from unlabeled data
without explicit outputs. Example: Customer segmentation.
3. Semi-Supervised Learning: Combines aspects of both supervised and unsupervised
learning by using a small amount of labeled data along with a large amount of unlabeled
data. Example: Medical diagnosis with limited labeled samples.
4. Reinforcement Learning: The model learns through trial and error by interacting with
an environment and receiving rewards or penalties. Example: Training an AI to play
chess.
5. Evolutionary Learning: A type of machine learning inspired by natural selection, where
algorithms evolve over generations by selecting the best solutions and applying mutations
or crossovers. Example: Genetic algorithms used for optimization problems.

SUPERVISED LEARNING

Supervised learning is the types of machine learning in which machines are trained using well
"labelled" training data, and on basis of that data, machines predict the output. The labelled data
means some input data is already tagged with the correct output.

In supervised learning, the training data provided to the machines work as the supervisor that
teaches the machines to predict the output correctly. It applies the same concept as a student
learns in the supervision of the teacher.

Supervised learning is a process of providing input data as well as correct output data to the
machine learning model. The aim of a supervised learning algorithm is to find a mapping
function to map the input variable(x) with the output variable(y). In the real-world, supervised

Prepared by Dr. Syeda Husna Mehanoor


learning can be used for Risk Assessment, Image classification, Fraud Detection, spam filtering,
etc.

How Supervised Learning Works?

In supervised learning, models are trained using labelled dataset, where the model learns about
each type of data. Once the training process is completed, the model is tested on the basis of test
data (a subset of the training set), and then it predicts the output.

The working of Supervised learning can be easily understood by the below example and
diagram:

Suppose we have a dataset of different types of shapes which includes square, rectangle, triangle,
and Polygon. Now the first step is that we need to train the model for each shape.

• If the given shape has four sides, and all the sides are equal, then it will be labelled as
a Square.
• If the given shape has three sides, then it will be labelled as a triangle.
• If the given shape has six equal sides then it will be labelled as hexagon.
Now, after training, we test our model using the test set, and the task of the model is to identify
the shape.

The machine is already trained on all types of shapes, and when it finds a new shape, it classifies
the shape on the bases of a number of sides, and predicts the output.

Steps Involved in Supervised Learning:

• First Determine the type of training dataset


• Collect/Gather the labelled training data.
• Split the training dataset into training dataset, test dataset, and validation dataset.
• Determine the input features of the training dataset, which should have enough
knowledge so that the model can accurately predict the output.

Prepared by Dr. Syeda Husna Mehanoor


• Determine the suitable algorithm for the model, such as support vector machine, decision
tree, etc.
• Execute the algorithm on the training dataset. Sometimes we need validation sets as the
control parameters, which are the subset of training datasets.
• Evaluate the accuracy of the model by providing the test set. If the model predicts the
correct output, which means our model is accurate.

Types of supervised Machine learning Algorithms:

Supervised learning can be further divided into two types of problems:

1. Regression

Regression algorithms are used if there is a relationship between the input variable and the
output variable. It is used for the prediction of continuous variables, such as Weather forecasting,
Market Trends, etc. Below are some popular Regression algorithms which come under
supervised learning:

• Linear Regression
• Regression Trees
• Non-Linear Regression
• Bayesian Linear Regression
• Polynomial Regression

2. Classification

Classification algorithms are used when the output variable is categorical, which means there are
two classes such as Yes-No, Male-Female, True-false, etc. Below are some popular
Classification algorithms which come under supervised learning:

• Random Forest
• Decision Trees
• Logistic Regression
• Support vector Machines

Prepared by Dr. Syeda Husna Mehanoor


Advantages of Supervised learning:

• With the help of supervised learning, the model can predict the output on the basis of
prior experiences.
• In supervised learning, we can have an exact idea about the classes of objects.
• Supervised learning model helps us to solve various real-world problems such as fraud
detection, spam filtering, etc.

Disadvantages of supervised learning:

• Supervised learning models are not suitable for handling the complex tasks.
• Supervised learning cannot predict the correct output if the test data is different from the
training dataset.
• Training required lots of computation times.
• In supervised learning, we need enough knowledge about the classes of object.

THE BRAIN AND THE NEURON

The brain is an amazing system that can handle messy and complicated information (like
pictures) and give quick and accurate answers. It’s made up of simple building blocks called
neurons, which send signals when activated. These signals travel through connections called
synapses, creating a huge network of about 100 trillion links. Even as we age and lose neurons,
the brain keeps working well.

Each neuron acts like a tiny decision-maker in a massive network of 100 billion neurons. This
has inspired scientists to create AI systems that try to copy how the brain learns. The brain learns
by changing the strength of its connections i:e plasticity which refers to its ability to change and
adapt by modifying the strength of the connections (called synapses) between neurons or
forming new connections altogether. This is how the brain learns and remembers things and
forming new connections between neurons in the brain. One famous idea, suggested by Donald
Hebb in 1949, is that learning happens when neurons that frequently work together strengthen
their connection.

Hebb’s Rule

Hebb's rule is a simple idea: if two neurons fire at the same time repeatedly, their connection
becomes stronger. On the other hand, if they never fire together, their connection weakens and
might disappear. This is how the brain learns to associate things.

Here’s an example: Imagine you always see your grandmother when she gives you chocolate.
Neurons in your brain that recognize your grandmother and neurons that make you happy about
chocolate will fire at the same time. Over time, their connection strengthens. Eventually, just
seeing your grandmother (even in a photo) makes you think of chocolate. This is similar to
classical conditioning, where Pavlov trained dogs to associate a bell with food. When the bell

Prepared by Dr. Syeda Husna Mehanoor


and food were paired repeatedly, the dogs began to salivate at the sound of the bell alone because
the "bell" neurons and "salivation" neurons became strongly connected.

This idea is called long-term potentiation or neural plasticity, and it’s a real process in our
brains that helps us learn and form memories.

McCulloch and Pitts Neurons

Scientists have studied neurons and created a mathematical model of them to simplify
understanding. Real neurons are tiny and hard to study, but Hodgkin and Huxley studied large
neurons in squids to measure how they work, earning them a Nobel Prize. Later, McCulloch and
Pitts created a simplified model of a neuron in 1943 that focused on the essential parts.

Understanding the McCulloch and Pitts Neuron Model

Imagine the neuron model as a simple flowchart with three main parts:

1. Inputs with Weights


2. Summation (Adder)
3. Activation Function (Threshold)
4. Output

Here's a breakdown of each part:

1. Inputs with Weights (w₁, w₂, w₃, ...)

• Inputs (x₁, x₂, x₃, ...): These are signals coming into the neuron from other neurons.
Think of them as messages or pieces of information.
• Weights (w₁, w₂, w₃, ...): Each input has a weight that represents the strength or
importance of that input. A higher weight means the input has a stronger influence on the
neuron's decision to fire.

Example:

• x1=1 (active)
• x2=0 (inactive)
• x3=0.5 (partially active)
• Weights: w1=1, w2=−0.5, w3=−1

Prepared by Dr. Syeda Husna Mehanoor


2. Summation (Adder)

• The neuron adds up all the inputs after they’ve been multiplied by their respective
weights.
• Formula: h=w1x1+w2x2+w3x3+…

Using the Example:

• h = (1×1)+(0×−0.5)+(0.5×−1)=1+0+(−0.5)=0.5
• h=0.5

3. Activation Function (Threshold θ)

• After summing the inputs, the neuron decides whether to "fire" (send a signal) or not
based on a threshold value (θ).
• Decision Rule:
o If h > θ, the neuron fires (output = 1)
o If h ≤ θ, the neuron does not fire (output = 0)

Using the Example:

• Let's set θ=0


• Since h=0.5>0, the neuron fires (output = 1)

4. Output

• The result of the activation function is the neuron's output, which can be sent to other
neurons.

Key Features:

• Simple Decision-Maker: Despite its simplicity, this model can perform basic decisions
based on input signals.
• Foundation for Neural Networks: Multiple such neurons can be connected to form
complex networks capable of more advanced computations.
• Adjusting Weights: Learning in neural networks involves adjusting these weights to
improve decision-making based on data.

Real-World Analogy: Think of the neuron as a light switch system

• Inputs (x₁, x₂, x₃): Different sensors detecting things (like motion, light, sound).
• Weights (w₁, w₂, w₃): The importance of each sensor in deciding whether to turn on the
light.
• Summation (h): Adding up the signals from all sensors.
• Threshold (θ): The level of combined signals needed to decide to turn the light on.

Prepared by Dr. Syeda Husna Mehanoor


• Output: The light is either on (1) or off (0).

By adjusting the weights, you can make the system more or less sensitive to certain sensors, just
like training a neural network to recognize patterns.

Limitations of the McCulloch and Pitts Neuronal Model

The McCulloch and Pitts (M&P) neuron model is a simplified version of how real neurons work.
While it has been influential in early neural network models, it has several limitations when
compared to actual biological neurons.

1. Simplified Summing: In the McCulloch and Pitts model, inputs to the neuron are simply
added together in a linear fashion. Real neurons, however, may have non-linear
interactions, meaning their inputs don’t just add up but interact in more complex ways.
2. Single Output vs. Spike Train: The M&P neuron produces just one output, either firing
or not firing, based on a threshold. Real neurons, however, send out a series of pulses,
called a "spike train," to represent information. So, real neurons don't just decide whether
to fire or not—they generate a sequence of signals that encode data.
3. Changing Thresholds: In the M&P model, the threshold for firing is constant. In real
neurons, the threshold can change depending on the current state of the organism, like
how much neurotransmitter is available, which influences the neuron’s sensitivity.
4. Asynchronous vs. Synchronous Updates: The M&P model updates neurons in a
regular, clocked sequence (synchronously). Real neurons don't work this way; they
update asynchronously, meaning they fire at different times, influenced by random
factors, not just a regular time cycle.
5. Excitatory and Inhibitory Weights: The M&P model allows weights (connections
between neurons) to change from positive to negative, which isn’t seen in real neurons. In
the brain, synaptic connections are either excitatory (increase the likelihood of firing) or
inhibitory (decrease the likelihood of firing), and they don't switch from one type to the
other.
6. Feedback Loops: Real neurons can have feedback connections where a neuron connects
back to itself. The M&P model typically doesn't include this, although it’s a feature in
some more advanced models.
7. Biological Complexity Ignored: The M&P model focuses on the basic idea of deciding
whether a neuron fires or not, leaving out more complex biological factors, such as
chemical concentrations or refractory periods (the time it takes for a neuron to reset
before firing again).

DESIGN A LEARNING SYSTEM

According to Tom Mitchell, “A computer program is said to be learning from experience (E),
with respect to some task (T). Thus, the performance measure (P) is the performance at task T,
which is measured by P, and it improves with experience E.”
Example: In Spam E-Mail detection,
• Task, T: To classify mails into Spam or Not Spam.

Prepared by Dr. Syeda Husna Mehanoor


• Performance measure, P: Total percent of mails being correctly classified as being
“Spam” or “Not Spam”.
• Experience, E: Set of Mails with label “Spam”

Steps for Designing Learning System are:

Step 1- Choosing the Training Experience: The very important and first task is to choose the
training data or training experience which will be fed to the Machine Learning Algorithm. It is
important to note that the data or experience that we fed to the algorithm must have a
significant impact on the Success or Failure of the Model. So Training data or experience
should be chosen wisely.
Below are the attributes which will impact on Success and Failure of Data:
• The training experience will be able to provide direct or indirect feedback regarding
choices. For example: While Playing chess the training data will provide feedback
to itself like instead of this move if this is chosen the chances of success increases.
• Second important attribute is the degree to which the learner will control the
sequences of training examples. For example: when training data is fed to the
machine then at that time accuracy is very less but when it gains experience while
playing again and again with itself or opponent the machine algorithm will get
feedback and control the chess game accordingly.
• Third important attribute is how it will represent the distribution of examples over
which performance will be measured. For example, a Machine learning algorithm
will get experience while going through a number of different cases and different
examples. Thus, Machine Learning Algorithm will get more and more experience
by passing through more and more examples and hence its performance will
increase.

Step 2- Choosing target function: The next important step is choosing the target function. It
means according to the knowledge fed to the algorithm the machine learning will choose
NextMove function which will describe what type of legal moves should be taken. For
example: While playing chess with the opponent, when opponent will play then the machine
learning algorithm will decide what be the number of possible legal moves taken in order to
get success.

Prepared by Dr. Syeda Husna Mehanoor


Step 3- Choosing Representation for Target function: When the machine algorithm will
know all the possible legal moves the next step is to choose the optimized move using any
representation i.e. using linear Equations, Hierarchical Graph Representation, Tabular form
etc. The NextMove function will move the Target move like out of these move which will
provide more success rate. For Example: while playing chess machine have 4 possible moves,
so the machine will choose that optimized move which will provide success to it.

Step 4- Choosing Function Approximation Algorithm: An optimized move cannot be


chosen just with the training data. The training data had to go through with set of example and
through these examples the training data will approximates which steps are chosen and after
that machine will provide feedback on it. For Example: When a training data of Playing chess
is fed to algorithm so at that time it is not machine algorithm will fail or get success and again
from that failure or success it will measure while next move what step should be chosen and
what is its success rate.

Step 5- Final Design: The final design is created at last when system goes from number of
examples, failures and success, correct and incorrect decision and what will be the next step
etc. Example: DeepBlue is an intelligent computer which is ML-based won chess game against
the chess expert Garry Kasparov, and it became the first computer which had beaten a human
chess expert.

Prepared by Dr. Syeda Husna Mehanoor


PERSPECTIVES AND ISSUES IN MACHINE LEARNING

One useful perspective on machine learning is that it involves searching a very large space of
possible hypotheses to determine one that best fits the observed data and any prior knowledge
held by the learner. For example, consider the space of hypotheses that could in principle be
output by the above checkers learner. This hypothesis space consists of all evaluation functions
that can be represented by some choice of values for the weight’s wo through w6. The learner's
task is thus to search through this vast space to locate the hypothesis that is most consistent with
the available training examples. The LMS algorithm for fitting weights achieves this goal by
iteratively tuning the weights, adding a correction to each weight each time the hypothesized
evaluation function predicts a value that differs from the training value. This algorithm works
well when the hypothesis representation considered by the learner defines a continuously
parameterized space of potential hypotheses.

Issues in Machine Learning


Our checkers example raises a number of generic questions about machine learning. The field of
machine learning, and much of this book, is concerned with answering questions such as the
following:

• What algorithms exist for learning general target functions from specific training
examples? In what settings will particular algorithms converge to the desired function,
given sufficient training data? Which algorithms perform best for which types of
problems and representations?
• How much training data is sufficient? What general bounds can be found to relate the
confidence in learned hypotheses to the amount of training experience and the character
of the learner's hypothesis space?
• When and how can prior knowledge held by the learner guide the process of generalizing
from examples? Can prior knowledge be helpful even when it is only approximately
correct?
• What is the best strategy for choosing a useful next training experience, and how does the
choice of this strategy alter the complexity of the learning problem?
• What is the best way to reduce the learning task to one or more function approximation
problems? Put another way, what specific functions should the system attempt to learn?
Can this process itself be automated?
• How can the learner automatically alter its representation to improve its ability to
represent and learn the target function?

Prepared by Dr. Syeda Husna Mehanoor


CONCEPT LEARNING TASK
Concept learning is a fundamental task in machine learning that involves training a model to
recognize and categorize patterns or concepts from a set of examples or data points. It's like
teaching a machine to understand the underlying rules of a specific concept, such as identifying a
cat in an image or predicting whether a customer will make a purchase.

Key Concepts
• Target Concept: The underlying rule or pattern that the model aims to learn.
• Training Data: A set of labeled examples used to train the model. Each example consists
of an input and its corresponding output (label).
• Hypothesis: A proposed rule or function that the model learns from the training data.
• Generalization: The ability of the model to accurately classify new, unseen data based on
the learned concept.

How Concept Learning Works


• Data Preparation: The training data is carefully prepared, ensuring it is representative of
the target concept and free from biases.
• Model Selection: An appropriate machine learning algorithm is chosen based on the
nature of the data and the desired outcome. Common algorithms include decision trees,
support vector machines, and neural networks.
• Training: The model is trained on the labeled data, iteratively adjusting its parameters to
minimize errors and improve its ability to predict the target concept.
• Evaluation: The trained model is evaluated on a separate test dataset to assess its
generalization performance. Metrics such as accuracy, precision, and recall are used to
measure the model's effectiveness.
• Refinement: Based on the evaluation results, the model may be further refined or
retrained to improve its performance. Applications of Concept Learning
• Image Recognition: Identifying objects, faces, and scenes in images.
• Natural Language Processing: Understanding and generating human language, such as
sentiment analysis and machine translation.
• Fraud Detection: Identifying unusual patterns in financial transactions that may indicate
fraudulent activity.
• Medical Diagnosis: Assisting doctors in diagnosing diseases based on patient data.
• Customer Segmentation: Grouping customers based on their behavior and preferences for
targeted marketing

CONCEPT LEARNING AS SEARCH

Concept learning involves exploring a hypothesis space to identify the hypothesis that best
explains the training examples. This hypothesis space is implicitly defined by the hypothesis
representation chosen by the learning algorithm designer. By selecting a specific representation,
the designer determines the space of all hypotheses the program can represent and learn.

Prepared by Dr. Syeda Husna Mehanoor


Example: EnjoySport Learning Task

In the EnjoySport learning task, we aim to find a hypothesis (rule) that determines whether the
weather conditions are favorable for enjoying sports. Let's break it down step by step.

This represents all possible combinations of weather attributes. The attributes and their possible
values are:

1. Sky: {Sunny, Cloudy, Rainy} → 3 values


2. AirTemp: {Hot, Cold} → 2 values
3. Humidity: {High, Normal} → 2 values
4. Wind: {Strong, Light} → 2 values
5. Water: {Warm, Cool} → 2 values
6. Forecast: {Same, Change} → 2 values

Different number of instances possible:

To find the total number of possible weather conditions (instances in XXX), multiply the
number of possible values for each attribute:

∣X∣=3x2x2x2x2x2

=3×32

=96

So, there are 96 distinct weather instances.

Hypothesis Space (H)

A hypothesis is a rule that classifies instances as positive or negative. Hypotheses can use
specific values (e.g., "Sunny") or wildcards (?), which mean "any value is fine." For each
attribute:

• There are 4 possible constraints:


o A specific value (e.g., "Sunny")
o A wildcard (?)
o "Ø" (no instances match, meaning the hypothesis is invalid).

Syntactically distinct hypotheses: additionally 2 more values: ?( accepts any values which is
most general hypothesis and Ø (reject any values which is more specific hypothesis)

For each of the 6 attributes, there are 4 options. The total number of syntactically distinct
hypotheses is:

1. Sky: { Ø, Sunny, Cloudy, Rainy,?}→5 values

Prepared by Dr. Syeda Husna Mehanoor


2. AirTemp: { Ø ,Hot, Cold, ?} → 4values
3. Humidity: { Ø ,High, Normal, ?} → 4 values
4. Wind: { Ø ,Strong, Light, ?} → 4 values
5. Water: { Ø ,Warm, Cool, ?} → 4 values
6. Forecast: {Ø ,Same, Change, ?} → 4vvalues

Including the empty hypothesis (all "Ø"), we get:

∣H∣=5x4x4x4x4x4=5120

So, we have 5120 Syntactically distinct hypotheses possible

Semantically distinct hypotheses:

Some hypotheses, like those containing only "Ø," classify all instances as negative and are
redundant. Removing these, the number of semantically distinct hypotheses becomes:

Here Ø is taken as common means

1. Sky: {Sunny, Cloudy, Rainy,?}→4 values


2. AirTemp: { Hot, Cold, ?} → 3 values
3. Humidity: { High, Normal, ?} → 3 values
4. Wind: { Strong, Light, ?} → 3 values
5. Water: { Warm, Cool, ?} → 3 values
6. Forecast: {,Same, Change, ?} → 3 values

=1+(4×3x3x3x3x3)

=1+(4×243)

=1+972=973

After finding all syntactically and semantically distinct hypothesis we search the best match from
all these that matches our learning model (training example).

FINDING A MAXIMALLY SPECIFIC HYPOTHESIS

The FIND-S algorithm is a simple way to find a rule (or hypothesis) that matches all the positive
examples in a dataset while ignoring the negative ones. It works step by step, starting with a
very specific rule and gradually making it more general to include all positive examples. Here's
how it works in an easy way:

Prepared by Dr. Syeda Husna Mehanoor


FIND-S Algorithm

The FIND-S algorithm is like starting with the most specific guess and slowly relaxing it until it
fits all the examples.

1. Start small: Begin with the most specific rule (e.g., "Only this exact weather works").
2. Fix the rule: For each good (positive) example, check if your rule matches it:
o If it does, great—do nothing!
o If it doesn’t, make the rule a bit more general (e.g., "Okay, maybe it works if the wind
isn’t strong").
3. Finish: When you’re done, you have a rule that matches all the good examples.

Example:

Imagine you're trying to figure out what kind of weather makes you enjoy playing a sport, using
this data:

Sky Temperature Humidity Wind Water Forecast Play Sport?

Sunny Warm Normal Strong Warm Same Yes

Sunny Warm High Strong Warm Same Yes

Rainy Cold High Strong Warm Change No

Sunny Warm High Strong Cool Change Yes

• Start with the most specific rule: h=(?, ?, ?, ?, ?, ?), which means "no conditions are set
yet."
• Look at the first positive example: (Sunny, Warm, Normal, Strong, Warm, Same)
o Rule becomes: h=(Sunny, Warm, Normal, Strong, Warm, Same)
• Look at the second positive example: (Sunny, Warm, High, Strong, Warm, Same)
o Update the rule to match both examples:
h=(Sunny, Warm, ?, Strong, Warm, Same)
• Ignore the negative example.
• Look at the fourth positive example: (Sunny, Warm, High, Strong, Cool, Change)
o Update the rule again: h=(Sunny, Warm, ?, Strong, ?, ?)

Final rule: (Sunny, Warm, ?, Strong, ?, ?). This means you enjoy playing sports if it’s sunny,
warm, and windy, regardless of the other conditions.

Prepared by Dr. Syeda Husna Mehanoor


Properties of FIND-S

1. Guarantees a Maximally Specific Hypothesis:


o In conjunction-based hypothesis spaces like EnjoySport, FIND-S always outputs
the most specific consistent hypothesis.
2. Focus on Positive Examples:
o Negative examples are ignored unless they conflict with the current hypothesis,
which is unlikely if the hypothesis space contains the true target concept.

Limitations of FIND-S

1. Uncertainty About the Target Concept:


o FIND-S cannot determine if the final hypothesis is the only consistent hypothesis
or if others exist.
2. Preference for Specific Hypotheses:
o There's no inherent reason to prefer the most specific hypothesis over more
general ones, especially in cases of multiple consistent hypotheses.
3. Sensitivity to Errors:
o Inconsistent or noisy training examples can severely mislead FIND-S, as it relies
solely on positive examples and assumes they are error-free.
4. Multiple Maximally Specific Hypotheses:
o In some hypothesis spaces, there may be multiple maximally specific hypotheses,
requiring modifications to FIND-S to handle backtracking or branching.
5. No Accommodation for Non-Conjunctive Hypothesis Spaces:
o FIND-S works well only in hypothesis spaces where a single maximally specific
consistent hypothesis exists.

Key Points to Remember

• FIND-S focuses only on positive examples and ignores negatives.


• It starts very specific and becomes more general as needed.
• It finds one rule that works for the given data but doesn’t guarantee it's the only possible
rule.

Why FIND-S Is Simple But Limited

• Good for Clean Data: Works well if the data is perfect (no mistakes or noise).
• Ignores Negatives: It doesn't use negative examples to refine the rule.
• May Miss Other Rules: If there are multiple valid rules, it picks the most specific one
but doesn’t explore other options.

In short, FIND-S is like a detective who focuses only on positive clues and tries to make the
simplest case for what’s true!

Prepared by Dr. Syeda Husna Mehanoor


VERSION SPACES AND THE CANDIDATE ELIMINATION
ALGORITHM

VERSION SPACES

A version space is a set of all hypotheses (rules) that are consistent with the given training data.
It represents everything the learner currently knows about the target concept.

Why "Version Space"?

The "version" refers to different possibilities or hypotheses that might explain the data. The
space includes:

• Hypotheses that agree with all positive examples.


• Hypotheses that disagree with all negative examples.

How It Works:
A version space has two boundaries:

1. Specific boundary (S): The most specific hypotheses consistent with the data.
2. General boundary (G): The most general hypotheses consistent with the data.

The true target concept lies between these boundaries.

1. Initial Version Space:


o S=The most specific hypothesis: Only matches exactly one positive example.
o G=The most general hypothesis: Matches everything.
2. Update Version Space:
o Each positive example makes S more general to include it.
o Each negative example makes G more specific to exclude it.

Why Use Version Spaces?

• Efficient Representation: Instead of listing all hypotheses, it tracks only the boundaries
S and G.
• Keeps Track of Knowledge: Helps understand what the learner knows and doesn’t know
yet.
• Flexible Search: Allows for adding or removing examples to refine the boundaries.

A version space is the range of hypotheses consistent with the training data, bounded by the
most specific (S) and the most general (G) hypotheses. It narrows down as you process more
examples, zeroing in on the true concept.

Prepared by Dr. Syeda Husna Mehanoor


CANDIDATE ELIMINATION ALGORITHM

It’s a way to learn rules for when something happens (like "Play Sport = Yes") by narrowing
down possibilities. The algorithm works by keeping two boundaries:

1. Specific Boundary (S): The most specific rule that only fits positive examples.
2. General Boundary (G): The most general rule that excludes negative examples.

As we process each example, we update S and G to make them more accurate.

The Data
Sky Temp Humidity Wind Water Forecast Play Sport?

Sunny Warm Normal Strong Warm Same Yes

Sunny Warm High Strong Warm Same Yes

Rainy Cold High Strong Warm Change No

Sunny Warm High Strong Cool Change Yes

Our goal is to find the rule for when "Play Sport" is Yes.

Step-by-Step Execution
Start with Initial S and G:

• S=⟨Φ,Φ,Φ,Φ,Φ,Φ⟩ "I know nothing; no instance fits."


• G=⟨?,?,?,?,?,?⟩ "Everything works."

Example 1: Sunny, Warm, Normal, Strong, Warm, Same (Yes)

It’s a positive example. We update S and G:

• S: Start from nothing and match this example exactly:


S=⟨Sunny,Warm,Normal,Strong,Warm,Same⟩
• G: Stays the same because it already includes everything.
G=⟨?,?,?,?,?,?⟩

Example 2: Sunny, Warm, High, Strong, Warm, Same (Yes)

It’s another positive example. S is too specific, so we generalize it to fit both positive examples:

Prepared by Dr. Syeda Husna Mehanoor


• Compare each attribute in S to the new example:
o Sky = Sunny: Matches, no change.
o Temp = Warm: Matches, no change.
o Humidity = Normal vs High: Doesn’t match, so generalize to ?.
o Wind = Strong: Matches, no change.
o Water = Warm: Matches, no change.
o Forecast = Same: Matches, no change.
• Updated S:
S=⟨Sunny,Warm,?,Strong,Warm,Same⟩
(This means Humidity can be anything now.)

G=⟨?,?,?,?,?,?⟩: Still no change.

Example 3: Rainy, Cold, High, Strong, Warm, Change (No)

It’s a negative example. We update G to exclude this negative example while staying as general
as possible.

• S=⟨Sunny,Warm,?,Strong,Warm,Same⟩ keep specific hypothesis same means copy the


previous S
Step 1: Look at G: ⟨?,?,?,?,?,?⟩
o G is too general—it allows this negative example. We specialize G to exclude it.
• Step 2: Create new rules for G by considering each attribute and change G to specific.

G= {Sunny, warm,?, ?, ?,Same) → change the attributes to more specific.

Example 4: Sunny, Warm, High, Strong, Cool, Change (YES)

It’s a positive example. S needs to generalize further to fit this example. Compare with previous
S.

• Compare S=⟨Sunny,Warm,?,Strong,Warm,Same⟩ to the new example:


o Sky = Sunny: Matches, no change.
o Temp = Warm: Matches, no change.
o Humidity = ?: Already generalized, no change.
o Wind = Strong: Matches, no change.
o Water = Warm vs Cool: Doesn’t match, so generalize to ?.
o Forecast = Same vs Change: Doesn’t match, so generalize to ?.
• Updated S:
S=⟨Sunny,Warm,?,Strong,?,?⟩
• G: No change—it already matches all positive examples.

Prepared by Dr. Syeda Husna Mehanoor


Final Results:

1. Specific Boundary (S):


S=⟨Sunny,Warm,?,Strong,?,?⟩
(This means: Sky must be Sunny, Temp must be Warm, Wind must be Strong. Humidity,
Water, and Forecast can be anything.)
2. General Boundary (G):
G=⟨Sunny,warm,?,?,?,?⟩
(This means: Sky must be Sunny. Everything else can be anything.). Attribute same is
not considered here.

Why is G=⟨Sunny,?,?,?,?,?⟩

1. General Boundary (G) starts very broad (because it initially includes everything).
2. After processing positive examples, G gets refined to include only the conditions that
must be true for playing sports ("Yes").
3. The Sky condition (Sunny) is the only attribute that must always be true in the general
rule.
4. The other attributes (Temperature, Humidity, Wind, Water, Forecast) can be anything
since the general rule still covers all the positive examples we’ve seen so far.

Why does the Sky have to be Sunny, warm in G?

When we look at all the positive examples (the ones where Play Sport = Yes), we find that they
all have Sky = Sunny. Since G should cover all positive examples, we make Sky = Sunny and
warm a condition in the rule. But the other attributes (Temperature, Humidity, Wind, etc.) can
vary, so we leave them as wildcards ( ? ).

Advantages of CEA over Find-S:


1. Improved accuracy: CEA considers both positive and negative examples to generate
the hypothesis, which can result in higher accuracy when dealing with noisy or
incomplete data.
2. Flexibility: CEA can handle more complex classification tasks, such as those with
multiple classes or non-linear decision boundaries.
3. More efficient: CEA reduces the number of hypotheses by generating a set of
general hypotheses and then eliminating them one by one. This can result in faster
processing and improved efficiency.
4. Better handling of continuous attributes: CEA can handle continuous attributes by
creating boundaries for each attribute, which makes it more suitable for a wider
range of datasets.
Disadvantages of CEA in comparison with Find-S:
1. More complex: CEA is a more complex algorithm than Find-S, which may make it
more difficult for beginners or those without a strong background in machine
learning to use and understand.

Prepared by Dr. Syeda Husna Mehanoor


2. Higher memory requirements: CEA requires more memory to store the set of
hypotheses and boundaries, which may make it less suitable for memory-
constrained environments.
3. Slower processing for large datasets: CEA may become slower for larger datasets
due to the increased number of hypotheses generated.
4. Higher potential for overfitting: The increased complexity of CEA may make it
more prone to overfitting on the training data, especially if the dataset is small or
has a high degree of noise.

LINEAR DISCRIMINANTS
Machine learning models are often used to solve supervised learning tasks,
particularly classification problems, where the goal is to assign data points to specific categories
or classes. However, as datasets grow larger with more features, it becomes challenging for
models to process the data effectively. This is where dimensionality reduction techniques like
Linear Discriminant Analysis (LDA) come into play.

LDA not only helps to reduce the number of features but also ensures that the important class-
related information is retained, making it easier for models to differentiate between classes.

What is Linear Discriminant Analysis (LDA)

Linear Discriminant Analysis (LDA) is a supervised learning technique used for classification
tasks. It helps distinguish between different classes by projecting data points onto a lower-
dimensional space, maximizing the separation between those classes.

Prepared by Dr. Syeda Husna Mehanoor


LDA performs two key roles:

• Classification: It finds a linear combination of features that best separates multiple


classes.
• Dimensionality Reduction: It reduces the number of input features while preserving the
information necessary for classification.
For example, in a dataset where each data point belongs to one of three classes, LDA transforms
the data into a space where the classes are well-separated, making it easier for models to classify
them correctly.

How Does LDA Work?

The core idea of Linear Discriminant Analysis (LDA) is to find a new axis that best separates
different classes by maximizing the distance between them. LDA achieves this by reducing the
dimensionality of the data while retaining the class-discriminative information.

Key Concepts:

1. Maximizing Between-Class Variance: LDA maximizes the separation between


the means of different classes.
2. Minimizing Within-Class Variance: It minimizes the spread (variance) within
each class, ensuring that data points from the same class remain close together.
3. Projection to Lower-Dimensional Space: LDA projects data onto a new axis or
subspace that best separates the classes. For example, in a 3-class problem, LDA
can reduce the dimensionality to 2 or even 1 while preserving class-related
information.
Working Mechanism:

• Step 1: Compute the mean vectors for each class.


• Step 2: Calculate the within-class and between-class scatter matrices.
• Step 3: Solve for the linear discriminants that maximize the ratio of between-class
variance to within-class variance.
• Step 4: Project the data onto the new lower-dimensional space.

Advantages & Disadvantages of Using LDA.

Advantages:

1. Simplicity: LDA is easy to implement and understand, making it suitable for


beginners.
2. Interpretability: It provides clear insight into how features contribute to the
classification task.
3. Computational Efficiency: LDA is computationally less intensive, making it
useful for large datasets.

Prepared by Dr. Syeda Husna Mehanoor


4. Works Well with Linearly Separable Data: It performs effectively when the
classes are linearly separable.
Disadvantages:

1. Sensitive to Assumptions: LDA assumes that features follow a Gaussian


distribution, which may not always hold.
2. Struggles with Non-linear Relationships: It may not perform well if the data
contains complex, non-linear relationships.
3. Affected by Class Imbalance: LDA can be biased toward the majority class if
the class distribution is imbalanced.
4. Impact of Outliers: It is sensitive to outliers, which can affect the model’s
performance.

Applications of Linear Discriminant Analysis (LDA)

1. Face Recognition: LDA helps extract features from facial images, classifying them based on
individuals. It is commonly used in biometric systems to identify or verify users.

2. Disease Diagnosis in Healthcare: LDA is used to analyze medical data for classifying diseases,
such as distinguishing between different stages of cancer or predicting the presence of heart
disease.

3. Customer Identification in Marketing: In marketing, LDA aids in customer segmentation by


grouping customers based on their profiles, enabling targeted campaigns and personalized
services.

4. Credit Risk Assessment in Finance: Financial institutions use LDA to assess credit risk by
analyzing customer data to predict the likelihood of loan defaults or creditworthiness.

5. Quality Control in Manufacturing: LDA identifies defects in products by analyzing sensor


data, ensuring that faulty products are detected early in the production process.

6. Campaign Optimization in Marketing: LDA helps optimize marketing campaigns by analyzing


customer interactions and identifying which strategies yield the best results.

Prepared by Dr. Syeda Husna Mehanoor


PERCEPTRON

Perceptron Are Based on Biological Neurons and Originally proposed in 1957, perceptrons were
one of the earliest components of artificial neural networks. The structure of the perceptron is
based on the anatomy of neurons. Neurons have several parts, but for our purposes, the most
important parts are the dendrites, which receive inputs from other neurons, and the axon, which
produces outputs.

Neuron Activation

Neurons “fire” – that is, produce an output – in an all or nothing way. The outputs of a neuron
are essentially 0 or 1. On or off. A neuron will “fire” if the input signals at the dendrites are
sufficiently large, collectively. If the amount of input signal at the dendrites is high enough, the
neuron will “fire” an produce an output. But if the amount of input signal is insufficient, the
neuron will not produce a output. Put simply, the neuron sums up the inputs, and if the collective
input signals meet a certain threshold, then it will produce an output. If the collective input
signals are under the threshold, it will not produce an output. This is, of course, a very simple
explanation of how a neuron works (because they are very complex at the chemical level), but
it’s roughly accurate.

What is Perceptron?

The Perceptron Learning is a fundamental concept in machine learning and serves as one of the
simplest types of artificial neural networks. It is primarily used for binary classification tasks
and is based on the idea of learning a linear decision boundary to separate data points into two
classes. The perceptron algorithm was introduced by Frank Rosenblatt in 1958. It operates on a
set of input features and produces an output that is either 1 or −1 (or 0 depending on the
implementation). The model is trained iteratively, adjusting its weights based on the error
between predicted and actual labels.

Components of a Perceptron : A perceptron consists of following

1. Inputs (x1,x2,…,xn):Features of the data (numerical values).


2. Weights (w1,w2,…,wn):Each input has a corresponding weight that determines its
influence on the output.
3. Bias (b):A constant term that shifts the decision boundary. The bias term helps the
perceptron make adjustments independent of the input, improving its flexibility in
learning. bias is a numerical value that is added to the weighted sum of inputs. It's an
adjustable parameter that helps the model learn and perform better.
4. Weighted Sum (z): perceptron computes a weighted sum of the inputs. The perceptron can be
represented as:

Prepared by Dr. Syeda Husna Mehanoor


5. Activation Function (f(z)): The perceptron model begins with multiplying all input values and
their weights, then adds these values to create the weighted sum. Further, this weighted sum is
applied to the activation function to obtain the desired output. This activation function is also
known as the step function and is represented by g(h).

Working of the Perceptron

1. Input Features: Take a vector of input features (x1,x2,…,…,xn) from the dataset.
2. Compute Weighted Sum: Calculate

3. Apply Activation Function: Use the step function to decide the output (1 or 0).
4. Update Weights (During Training):

• If the predicted output is incorrect, adjust the weights and bias using the Perceptron
Learning Algorithm.

Mathematical Model: The perceptron can be represented as:

Prepared by Dr. Syeda Husna Mehanoor


Types of Perceptron
1. Single-Layer Perceptron is a type of perceptron is limited to learning linearly
separable patterns. It is effective for tasks where the data can be divided into
distinct categories through a straight line. While powerful in its simplicity, it
struggles with more complex problems where the relationship between inputs and
outputs is non-linear.
2. Multi-Layer Perceptron possess enhanced processing capabilities as they consist
of two or more layers, adept at handling more complex patterns and relationships
within the data.

Perceptron Learning Algorithm:

A perceptron is like a very basic "brain" for a machine. It looks at input data (numbers) and
makes a decision: Class A or Class B (e.g., "yes" or "no").

Step 1: Start with inputs and weights

• Imagine you have some input features (e.g., x1, x2) like:
• Each input has a weight (w1, w2) that tells the perceptron how important that input is.

Step 2: Compute the "weighted sum"

• Multiply each input by its weight, and add them all together. Then, add a bias (b), which is like a
nudge to adjust the sum.

Weighted sum (z)=w1⋅x1+w2⋅x2+b

Step 3: Decide the output

• Use a simple rule: If the weighted sum (z) is positive, output 1 (e.g., "yes").
• If it’s negative or zero, output −1 (e.g., "no").

This rule is called the activation function.

Prepared by Dr. Syeda Husna Mehanoor


Step 4: Check if the output is correct

• Compare the perceptron’s guess (y^) with the actual answer (y).
• If it’s correct, you’re good! If it’s wrong, adjust the weights and bias.

Step 5: Update the weights and bias

• Adjust the weights and bias to make the perceptron “learn”

Step 6: Repeat

• Go through the dataset multiple times, adjusting weights and bias each time the perceptron makes
a mistake.

Key Features

• Linear Model: The perceptron can only separate data that is linearly separable.
• Supervised Learning: It requires labeled data for training.
• Binary Classification: It predicts one of two possible classes (1 or −1).

Strengths

• Simple and easy to understand.


• Efficient for linearly separable data.
• Forms the foundation for more complex neural networks.

Limitations

• Cannot Handle Non-linear Data: It fails when data is not linearly separable.
• Binary Outputs: Limited to binary classification tasks.
• Sensitive to Feature Scaling: Requires normalization or scaling for effective learning.

Prepared by Dr. Syeda Husna Mehanoor


Applications

• Image recognition (basic tasks).


• Spam filtering (binary classification).
• Sentiment analysis (positive vs. negative sentiment).

EXAMPLE: Here's an example of a perceptron for the logical AND function with the given
parameters:

• w1=1.2, w2=0.6 (weights)


• θ=1= 1 (threshold)
• η=0.5 = 0.5 (learning rate)

Step 1: AND Function Truth Table


x1 x2 AND Output (Target t)

0 0 0

0 1 0

1 0 0

1 1 1

Step 2: Activation Function

The perceptron output is calculated as:

Step 3: Training with Initial Weights


Iteration 1

For each input, compute the weighted sum and update weights if necessary, using:

Testing Initial Weights

1. For (0,0): (w1x1*w2x2)

Prepared by Dr. Syeda Husna Mehanoor


✅ No update needed.

2. For (0,1):

✅ No update needed.

3. For (1,0):

4. For (1,1):

✅ Correct.

Final Weights

After training, the perceptron correctly classifies AND function with:

LINEAR SEPARABILITY

Linear separability is an important concept in machine learning, particularly in the field of


supervised learning. It refers to the ability of a set of data points to be separated into distinct
categories using a linear decision boundary. In other words, if there exists a straight line that can
cleanly divide the data into two classes, then the data is said to be linearly separable.

Prepared by Dr. Syeda Husna Mehanoor


A dataset is linearly separable if there exists a hyperplane that can separate the data points into
distinct classes without any misclassification.

• In 2D: A straight line separates two classes.


• In 3D: A plane separates two classes.

Linear separability means that you can draw a straight line (or a flat surface, or a hyperplane)
that separates two groups of data points perfectly without any overlap.

In other words, there is a clear boundary where:

• One group of data points lies on one side of the boundary.


• The other group of data points lies on the other side.

Imagine you have two types of points:

• Class 1 (Positive): Points like (1, 2), (2, 3)


• Class 2 (Negative): Points like (-1, -2), (-2, -3)

You can draw a straight line between the two classes, and the points on one side belong to Class
1, while the points on the other side belong to Class 2.

This line is the decision boundary, and the data is linearly separable because the line separates
the two classes without overlap.

Real-World Applications

1. Image Classification:
o Linear separability is rare; deep learning handles non-linear boundaries.
2. Medical Diagnosis:
o Linearly separable cases may involve straightforward conditions; complex
diseases often require advanced methods.
3. Spam Detection:
o Simple keyword-based filters assume linear separability, while modern techniques
use non-linear models.

Prepared by Dr. Syeda Husna Mehanoor


Why is Linear Separability Important?

The concept of linear separability helps us decide which machine learning algorithms to use.
Some algorithms work well when the data is linearly separable, while others are better for more
complex, non-linearly separable data.

• Linear Models (e.g., Perceptron, SVM): These work best when the data is linearly
separable. They try to find the straightest line or plane to divide the data.
• Non-Linear Models (e.g., Neural Networks, Decision Trees): These are more flexible
and can handle non-linearly separable data. They can create complex decision
boundaries.

Example: Logical OR Function is linearly seperable

The logical OR function follows this truth table:

x1 x2 OR Output y

0 0 0

0 1 1

1 0 1

1 1 1

Example: Logical XOR Function is not linearly separable

x1 x2 XOR Output y

0 0 0

0 1 1

1 0 1

1 1 0

Prepared by Dr. Syeda Husna Mehanoor


1. Linearly Separable Data:
o Blue points (+1) are separated from red points (−1) by the dashed green line
(x2=x1+1).
o The data is perfectly separable by this straight line.
2. Non-Linearly Separable Data (XOR):
o Blue points (+1) and red points (−1) are arranged in such a way that no single
straight line can separate the classes.
o This is a classic XOR problem where the decision boundary requires a more
complex, non-linear solution.

Advantages of Linear Separability

1. Simplicity:
o Linear separability allows using simple models with fewer parameters.
2. Faster Training:
o Models converge quickly during training due to straightforward optimization.
3. Interpretability:
o Easy to visualize and understand the decision boundary.
4. Optimal Solution:
o Algorithms like SVM find the maximum margin boundary, ensuring optimal
performance for separable data.
5. Good Generalization:
o Models are less likely to overfit due to their simplicity.

Disadvantages of Linear Separability

1. Limited Applicability:
o Many real-world datasets are not linearly separable.
2. Lack of Flexibility:
o Cannot capture complex patterns in the data.
3. Over-Simplification:
o May miss subtle relationships or nuances.
4. Sensitive to Noise:
o Outliers or noisy data near the boundary can disrupt the model.
5. Feature Dependence:
o Requires feature transformations for non-linearly separable data.
6. Failure for Non-Linearly Separable Data:
o Cannot separate inherently non-linear datasets without additional techniques.

LINEAR REGRESSION

Linear regression is a fundamental supervised learning algorithm used in machine learning for
modeling the relationship between one or more independent variables (features) and a dependent
variable (target). The goal is to find the best-fit line (or hyperplane in higher dimensions) that
minimizes the error in predicting the dependent variable.

Prepared by Dr. Syeda Husna Mehanoor


In machine learning, labeled datasets contain input data (features) and output labels (target
values). For linear regression in machine learning, we represent features as independent variables
and target values as the dependent variable. It predicts the continuous output variables based on
the independent input variable. like the prediction of house prices based on different parameters
like house age, distance from the main road, location, area, etc.

In the above data, the target House Price is the dependent variable represented by Y, and the
feature, Square Feet, is the independent variable represented by X. The input features (X) are
used to predict the target label (Y). So, the independent variables are also known as predictor
variables, and the dependent variable is known as the response variable.

The main goal of the linear regression model is to find the best-fitting straight line (often called a
regression line) through a set of data points.

Prepared by Dr. Syeda Husna Mehanoor


Line of Regression

A straight line that shows a relation between the dependent variable and independent variables is
known as the line of regression or regression line.

Types of Linear Regression

Linear regression is of the following two types −

• Simple Linear Regression


• Multiple Linear Regression

1. Simple Linear Regression

Simple linear regression is a type of regression analysis in which a single independent variable
(also known as a predictor variable) is used to predict the dependent variable. In other words, it
models the linear relationship between the dependent variable and a single independent variable.

In the above image, the straight line represents the simple linear regression line where Ŷ is the
predicted value, and X is the input value.

Prepared by Dr. Syeda Husna Mehanoor


Mathematically, the relationship can be modelled as a linear equation −

Y=w0+w1X+ϵ

Where,

• Y is the dependent variable (target).


• X is the independent variable (feature).
• w0 is the y-intercept of the line.
• w1 is the slope of the line, representing the effect of X on Y.
• ε is the error term, capturing the variability in Y not explained by X.

2. Multiple Linear Regression

Multiple linear regression is basically the extension of simple linear regression that predicts a
response using two or more features.

When dealing with more than one independent variable, we extend simple linear regression to
multiple linear regression. The model is expressed as:

Multiple linear regression extends the concept of simple linear regression to multiple
independent variables. The model is expressed as:

Y=w0+w1X1+w2X2+⋯+wpXp+ϵ

Where,

• X1, X2, ..., Xp are the independent variables (features).


• w0, w1, ..., wp are the coefficients for these variables.
• ε is the error term.

How Does Linear Regression Work?

The main goal of linear regression is to find the best-fit line through a set of data points that
minimizes the difference between the actual values and predicted values. So it is done? This is
done by estimating the parameters w0, w1 etc.

The working of linear regression in machine learning can be broken down into many steps as
follows −

• Hypothesis− We assume that there is a linear relation between input and output.
• Cost Function − Define a loss or cost function. The cost function quantifies the model's
prediction error. The cost function takes the model's predicted values and actual values
and returns a single scaler value that represents the cost of the model's prediction.
• Optimization − Optimize (minimize) the model's cost function by updating the model's
parameters.

Prepared by Dr. Syeda Husna Mehanoor


It continues updating the model's parameters until the cost or error of the model's prediction is
optimized (minimized).

Hypothesis Function for Linear Regression

In linear regression problems, we assume that there is a linear relationship between input features
(X) and predicted value (Ŷ). The hypothesis function returns the predicted value for a given
input value. Generally we represent a hypothesis by hw(X) and it is equal to Ŷ.

Hypothesis function for simple linear regression −

Hypothesis function for multiple linear regression –

For different values of parameters (weights), we can find many regression lines. The main goal is
to find the best-fit lines.

Finding the Best Fit Line

A regression line is said to be the best fit if the error between actual and predicted values is
minimal.

Below image shows a regression line with error (ε) at input data point X. The error is calculated
for all data points and our goal is to minimize the average error/ loss. We can use different types
of loss functions such as mean square error (MSE), mean average error (MAE), L1 loss, L2 Loss,
etc.

Prepared by Dr. Syeda Husna Mehanoor


Loss Function for Linear Regression

The error between actual and predicted values can be quantified using a loss function of the cost
function. The cost function takes the model's predicted values and actual values and returns a
single scaler value that represents the cost of the model's prediction. Our main goal is to
minimize the cost function.

The most commonly used cost function is the mean squared error function.

Where,

Applications of Linear Regression

1. Predictive Modeling: Linear regression is widely used for predictive modeling. For instance,
in real estate, predicting house prices based on features such as size, location, and number of
bedrooms can help buyers, sellers, and real estate agents make informed decisions.
2. Feature Selection: In multiple linear regression, analyzing the coefficients can help in feature
selection. Features with small or zero coefficients might be considered less important and can be
dropped to simplify the model.
3. Financial Forecasting: In finance, linear regression models predict stock prices, economic
indicators, and market trends. Accurate forecasts can guide investment strategies and financial
planning.
4. Risk Management: Linear regression helps in risk assessment by modeling the relationship
between risk factors and financial metrics. For example, in insurance, it can model the
relationship between policyholder characteristics and claim amounts.

Advantages of Linear Regression

• Interpretability − Linear regression is easy to understand, which is useful when


explaining how a model makes decisions.
• Speed − Linear regression is faster to train than many other machine learning algorithms.
• Predictive analytics − Linear regression is a fundamental building block for predictive
analytics.
• Linear relationships − Linear regression is a powerful statistical method for finding
linear relationships between variables.

Prepared by Dr. Syeda Husna Mehanoor


• Simplicity − Linear regression is simple to implement and interpret.
• Efficiency − Linear regression is efficient to compute.

Common Challenges with Linear Regression

1. Overfitting: Overfitting occurs when the regression model performs well on training data but
lacks generalization on test data. Overfitting leads to poor prediction on new, unseen data.
2. Multicollinearity: When the dependent variables (predictor or feature variables) correlate, the
situation is known as multicollinearity. In this, the estimates of the parameters (coefficients) can
be unstable.
3. Outliers and Their Impact: Outliers can cause the regression line to be a poor fit for the
majority of data points.

Prepared by Dr. Syeda Husna Mehanoor


UNIT - II
Multi-layer Perceptron– Going Forwards – Going Backwards: Back Propagation Error –
Multi-layer Perceptron in Practice – Examples of using the MLP – Overview – Deriving
Back-Propagation – Radial Basis Functions and Splines – Concepts – RBF Network –
Curse of Dimensionality – Interpolations and Basis Functions – Support Vector Machines

MULTI-LAYER PERCEPTRON
A Multi-Layer Perceptron (MLP) consists of fully connected dense layers that transform
input data from one dimension to another. It is called “multi-layer” because it contains an input
layer, one or more hidden layers, and an output layer. The purpose of an MLP is to model
complex relationships between inputs and outputs, making it a powerful tool for various
machine learning tasks. MLP (Multi-Layer Perceptron) is primarily used for supervised
learning, as it is a type of artificial neural network that requires labeled data to train and learn
relationships between input features and target outputs, making it suitable for tasks like
classification and regression.

The key components of Multi-Layer Perceptron include:


• Input Layer: Each neuron (or node) in this layer corresponds to an input feature.
For instance, if you have three input features, the input layer will have three
neurons.
• Hidden Layers: An MLP can have any number of hidden layers, with each layer
containing any number of nodes. These layers process the information received
from the input layer.
• Output Layer: The output layer generates the final prediction or result. If there are
multiple outputs, the output layer will have a corresponding number of neurons.

Every connection in the diagram is a representation of the fully connected nature of an MLP.
This means that every node in one layer connects to every node in the next layer. As the data
moves through the network, each layer transforms it until the final output is generated in the
output layer.

Prepared by Dr. Syeda Husna Mehanoor


WORKING OF MULTI-LAYER PERCEPTRON

Step 1: Forward Propagation


In forward propagation, the data flows from the input layer to the output layer, passing
through any hidden layers. In forward propagation, the MLP computes predictions, regardless
of whether we use MSE or BCE. The choice of loss function (MSE or BCE) depends on whether
the task is regression or classification. Each neuron in the hidden layers processes the input as
follows:

1. Weighted Sum: The neuron computes the weighted sum of the inputs:

2. Activation Function: An activation function is a mathematical function applied to the


output of a neuron. It introduces non-linearity into the model, allowing the network to learn
and represent complex patterns in the data. Without this non-linearity feature, a neural
network would behave like a linear regression model, no matter how many layers it has.

The activation function decides whether a neuron should be activated by calculating the
weighted sum of inputs and adding a bias term. This helps the model make complex
decisions and predictions by introducing non-linearities to the output of each neuron.
Neural networks consist of neurons that operate using weights, biases, and activation
functions.

Without non-linearity, even deep networks would be limited to solving only simple,
linearly separable problems. Activation functions empower neural networks to model
highly complex data distributions and solve advanced deep learning tasks. Adding non-
linear activation functions introduce flexibility and enable the network to learn more
complex and abstract patterns from data.

The weighted sum z is passed through an activation function to introduce non-linearity.


Common activation functions include:

1. Sigmoid: Sigmoid function is a mathematical function that has an “S”-shaped curve


(sigmoid curve). The sigmoid function is one of the most commonly used activation
functions in Machine learning and Deep learning. It is particularly useful in neural
networks, where it introduces non-linearity, allowing the model to handle complex patterns
in the data.

Prepared by Dr. Syeda Husna Mehanoor


Sigmoid function is also known as the squashing function, as it takes the input from the
previously hidden layer and squeezes it between 0 and 1. So a value fed to the sigmoid
function will always return a value between 0 and 1, no matter how big or small the
value is fed.

The formula of the sigmoid activation function is:

Here, 𝑒 is the base of the natural logarithm (approximately equal to 2.71828), and 𝑥 is
the input to the function.

2. ReLU (Rectified Linear Unit):


The Rectified Linear Unit (ReLU) is one of the most popular activation functions
used in neural networks, especially in deep learning models. It has become the
default choice in many architectures due to its simplicity and efficiency. The ReLU
function is a piecewise linear function that outputs the input directly if it is positive;
otherwise, it outputs zero.
In simpler terms, ReLU allows positive values to pass through unchanged while
setting all negative values to zero.

The ReLU function can be described mathematically as follows:

f(x)=max(0,x)
Where:
• x is the input to the neuron.
• The function returns x if x is greater than 0.
• If x is less than or equal to 0, the function returns 0. In mathematical terms, the ReLU
function can be written as:

The graph of ReLU activation looks like:

Prepared by Dr. Syeda Husna Mehanoor


3. Tanh (Hyperbolic Tangent): The hyperbolic tangent (tanh) activation function is a
mathematical function used in artificial neural networks to transform input values into output
values between -1 and 1. The tanh function outputs values in the range of -1 to +1. This
means that it can deal with negative values more effectively than the sigmoid function, which
has a range of 0 to 1. Tanh is preferred over sigmoid in hidden layers of a neural network, as
its zero-centered property often results in faster training.
The tanh function is mathematically similar to the sigmoid function but differs in its
output range. It is defined as:

Step 2: Loss Function


A loss function is a mathematical function that measures how well a model's predictions
match the true outcomes. It provides a quantitative metric for the accuracy of the model's
predictions, which can be used to guide the model's training process. The goal of a loss
function is to guide optimization algorithms in adjusting model parameters to reduce this loss
over time. Once the network generates an output, the next step is to calculate the loss using
a loss 3function. In supervised learning, this compares the predicted output to the actual label.

For a classification problem, the commonly used binary cross-entropy loss function is:

Prepared by Dr. Syeda Husna Mehanoor


For regression problems, the mean squared error (MSE) is often used:

Step 3: Backpropagation
The goal of training an MLP is to minimize the loss function by adjusting the network’s
weights and biases. This is achieved through backpropagation. Both MSE and BCE can be
used in backpropagation. Backpropagation computes gradients of the chosen loss function
(MSE or BCE) and updates the network’s weights using gradient descent.

1. Gradient Calculation: The gradients of the loss function with respect to each
weight and bias are calculated using the chain rule of calculus.
2. Error Propagation: The error is propagated back through the network, layer by
layer.
3. Gradient Descent: The network updates the weights and biases by moving in the
opposite direction of the gradient to reduce the loss

For both regression (MSE loss) and classification (BCE loss), the weights are updated using
the gradient descent formula:

Prepared by Dr. Syeda Husna Mehanoor


Step 4: Iteration

• Forward and backward propagation repeat over multiple epochs until the model
converges (i.e., achieves an acceptable error rate).

MLP ALGORITHM:

The Multi-Layer Perceptron (MLP) Algorithm is like training a digital brain to learn patterns
and make predictions.

1. Start with Inputs:


o Give the MLP some data (like an image or numbers).
2. Forward Propagation:
o Pass the data through each layer.
o The system adjusts the importance of each input using weights and biases.
3. Calculate the Loss:
o Compare the model's guess (output) to the correct answer (target).
o If it’s wrong, calculate the "error" using a loss function.
4. Backward Propagation:
o Work backward to figure out how to reduce the error.
o Adjust the weights and biases to improve the next prediction.
5. Update Weights:
o Update the weights and biases to make the model smarter.
6. Repeat:
o Do this many times (epochs) until the system gets good at predicting.

THE MULTI-LAYER PERCEPTRON IN PRACTICE

This section explores practical considerations for using Multi-Layer Perceptrons (MLPs) to solve
real-world problems, focusing on three critical aspects: the amount of training data, the number
of hidden layers, and when to stop learning.

Amount of Training Data:

• For the MLP with one hidden layer there are (L + 1) ×M + (M + 1) × N weights, where L,M,N
are the number of nodes in the input, hidden, and output layers, respectively.

• The extra +1s come from the bias nodes, which also have adjustable weights

• This is a potentially huge number of adjustable parameters that we need to set during the
training phase.

• Setting the values of these weights is the job of the back-propagation algorithm, which is
driven by the errors coming from the training data.

Prepared by Dr. Syeda Husna Mehanoor


• Clearly, the more training data there is, the better for learning, although the time that the
algorithm takes to learn increases.

• Unfortunately, there is no way to compute what the minimum amount of data required is, since
it depends on the problem.

• A rule of thumb that you should use a number of training examples that is at least 10 times the
number of weights.

• This is probably going to be a very large number of examples, so neural network training is a
fairly computationally expensive operation, because we need to show the network all of these
inputs lots of times.

Number of Hidden Layers:

• Two Choices

o The number of hidden nodes


o The number of hidden layers

• It is possible to show mathematically that one hidden layer with lots of hidden nodes is
sufficient. This is known as the Universal Approximation Theorem.

• we will never normally need more than two layers (that is, one hidden layer and the output
layer)

When to stop Learning:

• The training of the MLP requires that the algorithm runs over the entire dataset many times,
with the weights changing as the network makes errors in each iteration.

• Two options o Predefined number of Iterations o Predefined minimum error reached

• Using both of these options together can help, as can terminating the learning once the error
stops decreasing.

• We train the network for some predetermined amount of time, and then use the validation set to
estimate how well the network is generalising.

• We then carry on training for a few more iterations, and repeat the whole process.

• At some stage the error on the validation set will start increasing again, because the network
has stopped learning about the function that generated the data, and started to learn about the
noise that is in the data itself.

Prepared by Dr. Syeda Husna Mehanoor


• At this stage we stop the training. This technique is called early stopping.

EXAMPLES OF USING MLP

• We will then apply MLP to find solutions to four different types of problem: Regression,
Classification, Time-series prediction, and Data compression.

Regression:

• Regression is a statistical technique that is used for predicting continuous outcomes.

• If you want to predict a single value, you only need a single output neuron and if you want to
predict multiple values, you can add multiple output neurons.

• In general, we don't apply any activation function to the output layer of MLP, when dealing
with regression tasks, It just does the weighted sum and sends the output.

• But, in case you want your value between a given range, for example, -1 or +1 you can use
activation like Tanh(Hyperbolic Tangent) function.

• The loss functions that can be used in Regression MLP include Mean Squared Error(MSE) and
Mean Absolute Error(MAE).

• MSE can be used in datasets with fewer outliers, while MAE is a good measure in datasets
which has more outliers.

• Example: Rainfall prediction, Stock price prediction

Prepared by Dr. Syeda Husna Mehanoor


Classification:

• If the output variable is categorical, then we have to use classification for prediction.

Example: Iris Flower classification

• The aim is to classify iris flowers among three species (Setosa, Versicolor, or Virginica) from
the sepals’ and petals’ length and width measurements.

• The above neural network has one input layer, two hidden layers and one output layer.

• In the hidden layers we use sigmoid as an activation function for all neurons.

• In the output layer, we use softmax as an activation function for the three output neurons.

• In this regard, all outputs are between 0 and 1, and their sum is 1.

• The neural network has three outputs since the target variable contains three classes (Setosa,
Versicolor, and Virginica).

Time series Prediction:

• There is a common data analysis task known as time-series prediction, where we have a set of
data that show how something varies over time, and we want to predict how the data will vary in
the future.

• The problem is that even if there is some regularity in the time-series, it can appear over many
different scales. For example, there is often seasonal variation in temperatures.

Prepared by Dr. Syeda Husna Mehanoor


• Example: A typical time-series problem is to predict the ozone levels into the future and see if
you can detect an overall drop in the mean ozone level.

Data Compression / Data denoising:

• We train the network to reproduce the inputs at the output layer called auto-associative
learning.

• These networks are known as auto encoders.

• The network is trained so that whatever you give as the input is reproduced at the output, which
doesn’t seem very useful at first, but suppose that we use a hidden layer that has fewer neurons
than the input layer.

• This bottleneck hidden layer has to represent all of the information in the input, so that it can
be reproduced at the output.

• It therefore performs some compression of the data, representing it using fewer dimensions
than were used in the input.

• They are finding a different representation of the input data that extracts important components
of the data, and ignores the noise.

• This auto-associative network can be used to compress images and other data.

Advantages of Multi-Layer Perceptron Neural Network


• Multi-Layer Perceptron Neural Networks can easily work with non-linear problems.
• It can handle complex problems while dealing with large datasets.
• Developers use this model to deal with the fitness problem of Neural Networks.
• It has a higher accuracy rate and reduces prediction error by using backpropagation.
• After training the model, the Multilayer Perceptron Neural Network quickly predicts
the output.

Prepared by Dr. Syeda Husna Mehanoor


Disadvantages of Multi-Layer Perceptron Neural Network
• This Neural Network consists of large computation, which sometimes increases the
overall cost of the model.
• The model will perform well only when it is trained perfectly.
• Due to this model’s tight connections, the number of parameters and node
redundancy increases.

DERIVING BACK-PROPAGATION
Backpropagation is an algorithm used in artificial intelligence and machine learning to train
artificial neural networks through error correction. The computer learns by calculating the loss
function, or the difference between the input you provided and the output it produced. When you
apply backpropagation, you work backward from output nodes to input nodes to reduce the loss
function and produce the desired result.

How Does the Backward Pass Work?


In the backward pass, the error (the difference between the predicted and actual output) is
propagated back through the network to adjust the weights and biases.
Once the error is calculated, the network adjusts weights using gradients, which are computed
with the chain rule. These gradients indicate how much each weight and bias should be
adjusted to minimize the error in the next iteration. The backward pass continues layer by
layer, ensuring that the network learns and improves its performance. The activation function,
through its derivative, plays a crucial role in computing these gradients during
backpropagation.

Back Propagation Algorithm


The backpropagation algorithm is used in a Multilayer perceptron neural network to increase the
accuracy of the output by reducing the error in predicted output and actual output.
According to this algorithm,
• Calculate the error after calculating the output from the Multilayer perceptron neural
network.

Prepared by Dr. Syeda Husna Mehanoor


• This error is the difference between the output generated by the neural network and the
actual output. The calculated error is fed back to the network, from the output layer to
the hidden layer.
• Now, the output becomes the input to the network.
• The model reduces error by adjusting the weights in the hidden layer.
• Calculate the predicted output with adjusted weight and check the error. The process is
recursively used till there is minimum or no error.
• This algorithm helps in increasing the accuracy of the neural network.

Deriving the Backpropagation

Backpropagation is the process of adjusting a neural network’s weights and biases to reduce
error. It does this by:

1. Calculating the error (how wrong the model's prediction is).


2. Finding gradients (how much each weight contributes to the error).
3. Updating the weights using gradient descent to make better predictions.

Backpropagation has 4 main steps:

1. Forward Propagation (Make a prediction).


2. Calculate Loss (Measure how wrong the prediction is).
3. Compute Gradients (Find how much each weight contributed to the error).
4. Update Weights (Adjust weights to minimize the error).

Backpropagation for Regression (MSE Loss)

We use Mean Squared Error (MSE) loss, which is used when predicting continuous values
(e.g., predicting house prices).

Step 1: Forward Propagation (Compute Prediction)


Each neuron performs:

Prepared by Dr. Syeda Husna Mehanoor


Step 2: Compute MSE Loss

Step 3: Compute Gradients (Find Errors)

This formula calculates how much the weight W should be adjusted.

This formula calculates how much the bias b should be adjusted.

Step 4: Update Weights

where η is the learning rate.

Repeating these steps reduces error over time. By repeating this process, the model gradually
improves and learns the correct weight and bias to minimize the error.

Backpropagation for Classification (BCE Loss)


We use Binary Cross-Entropy (BCE) loss, which is used for binary classification (e.g., Spam vs. Not
Spam).

Prepared by Dr. Syeda Husna Mehanoor


Step 1: Forward Propagation (Compute Prediction)

Each neuron performs:

Step 2: Compute Binary Cross-Entropy (BCE) Loss

Step 3: Compute Gradients (Find Errors)

Step 4: Update Weights

Prepared by Dr. Syeda Husna Mehanoor


Example:

Step 1: Network architecture and Define Input Values and given weights

We assume the network has:

Step 2: Forward Propagation: We calculate the hidden layer activation, then the output
layer activation.

1.Compute Hidden Layer Activation:


The hidden layer neuron receives input:

Prepared by Dr. Syeda Husna Mehanoor


2. Compute Output Layer Activation

The output neuron receives input:

3. Compute Loss (Error)

Using Mean Squared Error (MSE):

Step 3: Backward Propagation

Now, we compute gradients and update weights/biases.

Prepared by Dr. Syeda Husna Mehanoor


1. Compute Error Signal for Output Layer

2. Compute Gradient w.r.t. W2

3. Compute Gradient w.r.t. b2

4. Compute Error Signal for Hidden Layer

Prepared by Dr. Syeda Husna Mehanoor


5. Compute Gradient w.r.t. W1

6. Compute Gradient w.r.t. b1

Step 4: Update Weights and Biases

Prepared by Dr. Syeda Husna Mehanoor


Final Updated Values

Note: Hidden layers do have their own weights and biases. The hidden layer does have an
input value, but it comes from the previous layer

Each neuron in a layer is connected to neurons in the previous layer via weights. Every layer
(except the input layer) has:

• Weights for each connection from the previous layer.


• Biases added to the weighted sum before applying the activation function.

For a Neural Network with 1 Input, 1 Hidden Layer, and 1 Output Layer:

• Input Layer → Hidden Layer:


o Weight: W1 (connects input to hidden neuron)
o Bias: b1 (bias for hidden neuron)
• Hidden Layer → Output Layer:
o Weight: W2 (connects hidden neuron to output neuron)
o Bias: b2(bias for output neuron)

Each layer learns its own set of weights and biases.

Advantages of Backpropagation for Neural Network Training

The key benefits of using the backpropagation algorithm are:


• Ease of Implementation: Backpropagation is beginner-friendly, requiring no prior
neural network knowledge, and simplifies programming by adjusting weights via
error derivatives.
• Simplicity and Flexibility: Its straightforward design suits a range of tasks, from
basic feedforward to complex convolutional or recurrent networks.
• Efficiency: Backpropagation accelerates learning by directly updating weights
based on error, especially in deep networks.

Prepared by Dr. Syeda Husna Mehanoor


• Generalization: It helps models generalize well to new data, improving prediction
accuracy on unseen examples.
• Scalability: The algorithm scales efficiently with larger datasets and more complex
networks, making it ideal for large-scale tasks.

Challenges with Backpropagation


While backpropagation is powerful, it does face some challenges:
1. Vanishing Gradient Problem: In deep networks, the gradients can become very
small during backpropagation, making it difficult for the network to learn. This is
common when using activation functions like sigmoid or tanh.
2. Exploding Gradients: The gradients can also become excessively large, causing
the network to diverge during training.
3. Overfitting: If the network is too complex, it might memorize the training data
instead of learning general patterns.

RADIAL BASIS FUNCTIONS AND SPLINES


THE RADIAL BASIS FUNCTION (RBF) NETWORK

A radial basis function (RBF) neural network is a type of artificial neural network that uses radial
basis functions as activation functions. It typically consists of three layers: an input layer, only
one hidden layer, and an output layer. The hidden layer applies a radial basis function, usually
a Gaussian function. RBF neural networks are highly versatile and are extensively used in
pattern classification tasks, function approximation, and a variety of machine learning
applications. They are especially known for their ability to handle non-linear problems
effectively.

Structure of RBF neural networks

An RBF neural network typically comprises three layers:

• Input layer: This layer simply transmits the inputs to the neurons in the hidden layer.
• Hidden layer: Each neuron in this layer applies a radial basis function to the inputs it
receives. RBF has strictly one hidden layer.
• Output layer: Each neuron in this layer computes a weighted sum of the outputs from
the hidden layer, resulting in the final output.

Working of RBF

• When dealing with non-linear data, we aim to convert it into linearly separable data.
• To achieve this, every hidden layer neuron uses a non-linear radial basis function as the
activation function, transforming the data into a higher-dimensional space.

Prepared by Dr. Syeda Husna Mehanoor


Types of Radial Basis Functions:

1. Gaussian RBF (most common):

x = Input
c = Center
r = Radius

2. Multiquadric RBF:

Algorithm of RBF

Input & Output

• Input: A set of input vectors (x1,x2,...,xn)


• Output: y_n

Step 1: Initialize Weights

• Assign weights for each connection from hidden layer to output layer.
• Initially, weights are randomly assigned in the range [-1,1].

Prepared by Dr. Syeda Husna Mehanoor


Forward Phase

Step 1: Input Layer Computation

• Each node in the input layer directly passes its input:

Step 2: Hidden Layer Computation

• The hidden layer applies the Radial Basis Function:

• The distance between input x and center c determines the activation.

Step 3: Output Layer Computation

Backward Phase (Training)

1. Train the hidden layer using backpropagation.

Prepared by Dr. Syeda Husna Mehanoor


2. Update weights between hidden layer and output layer.

Key Characteristics of RBFs


• Radial Basis Functions: These are real-valued functions dependent solely on the
distance from a central point. The Gaussian function is the most commonly used
type.
• Dimensionality: The network's dimensions correspond to the number of predictor
variables.
• Center and Radius: Each RBF neuron has a center and a radius (spread). The
radius affects how broadly each neuron influences the input space.

Advantages of RBF Networks


1. Universal Approximation: RBF Networks can approximate any continuous
function with arbitrary accuracy given enough neurons.
2. Faster Learning: The training process is generally faster compared to other neural
network architectures.
3. Simple Architecture: The straightforward, three-layer architecture makes RBF
Networks easier to implement and understand.

Applications of RBF Networks


• Classification: RBF Networks are used in pattern recognition
and classification tasks, such as speech recognition and image classification.
• Regression: These networks can model complex relationships in data for prediction
tasks.
• Function Approximation: RBF Networks are effective in approximating non-
linear functions.

THE CURSE OF DIMENSIONALITY

The curse of dimensionality is a common machine learning problem that occurs when a dataset
has many dimensions. This can make it difficult to analyze, organize, and model the data. The
Curse of Dimensionality refers to the various challenges and complications that arise when
analyzing and organizing data in high-dimensional spaces (often hundreds or thousands of
dimensions). In the realm of machine learning, it's crucial to understand this concept because as
the number of features or dimensions in a dataset increases, the amount of data we need to
generalize accurately grows exponentially.

What problems does it cause?

1. Data sparsity: As mentioned, data becomes sparse, meaning that most of the high-
dimensional space is empty. This makes clustering and classification tasks challenging.
2. Increased computation: More dimensions mean more computational resources and time
to process the data.

Prepared by Dr. Syeda Husna Mehanoor


3. Overfitting: With higher dimensions, models can become overly complex, fitting to the
noise rather than the underlying pattern. This reduces the model's ability to generalize to
new data.
4. Distances lose meaning: In high dimensions, the difference in distances between data
points tends to become negligible, making measures like Euclidean distance less
meaningful.
5. Performance degradation: Algorithms, especially those relying on distance measurements
like k-nearest neighbors, can see a drop in performance.
6. Visualization challenges: High-dimensional data is hard to visualize, making exploratory
data analysis more difficult.

Why does the curse of dimensionality occur?

It occurs mainly because as we add more features or dimensions, we're increasing the complexity
of our data without necessarily increasing the amount of useful information. Moreover, in high-
dimensional spaces, most data points are at the "edges" or "corners," making the data sparse.

How to Solve the Curse of Dimensionality

The primary solution to the curse of dimensionality is "dimensionality reduction." It's a process
that reduces the number of random variables under consideration by obtaining a set of principal
variables. By reducing the dimensionality, we can retain the most important information in the
data while discarding the redundant or less important features.

Dimensionality Reduction Methods

Principal Component Analysis (PCA)

PCA is a statistical method that transforms the original variables into a new set of variables,
which are linear combinations of the original variables. These new variables are called principal
components.

Let's say we have a dataset containing information about different aspects of cars, such as
horsepower, torque, acceleration, and top speed. We want to reduce the dimensionality of this
dataset using PCA.

Using PCA, we can create a new set of variables called principal components. The first principal
component would capture the most variance in the data, which could be a combination of
horsepower and torque. The second principal component might represent acceleration and top
speed. By reducing the dimensionality of the data using PCA, we can visualize and analyze the
dataset more effectively.

Prepared by Dr. Syeda Husna Mehanoor


Linear Discriminant Analysis (LDA)

LDA aims to identify attributes that account for the most variance between classes. It's
particularly useful for classification tasks. Suppose we have a dataset with various features of
flowers, such as petal length, petal width, sepal length, and sepal width. Additionally, each
flower in the dataset is labeled as either a rose or a lily. We can use LDA to identify the
attributes that account for the most variance between these two classes.

LDA might find that petal length and petal width are the most discriminative attributes between
roses and lilies. It would create a linear combination of these attributes to form a new variable,
which can then be used for classification tasks. By reducing the dimensionality using LDA, we
can improve the accuracy of flower classification models.

t-Distributed Stochastic Neighbor Embedding (t-SNE)

t-SNE is a non-linear dimensionality reduction technique that's particularly useful for visualizing
high-dimensional datasets. Let's consider a dataset with images of different types of animals,
such as cats, dogs, and birds. Each image is represented by a high-dimensional feature vector
extracted from a deep neural network.

Using t-SNE, we can reduce the dimensionality of these feature vectors to two dimensions,
allowing us to visualize the dataset. The t-SNE algorithm would map similar animals closer
together in the reduced space, enabling us to observe clusters of similar animals. This
visualization can help us understand the relationships and similarities between different animal
types in a more intuitive way.

Autoencoders

These are neural networks used for dimensionality reduction. They work by compressing the
input into a compact representation and then reconstructing the original input from this
representation. Suppose we have a dataset of images of handwritten digits, such as the MNIST
dataset. Each image is represented by a high-dimensional pixel vector.

We can use an autoencoder, which is a type of neural network, for dimensionality reduction.
The autoencoder would learn to compress the input images into a lower-dimensional
representation, often called the latent space. This latent space would capture the most important
features of the images. We can then use the autoencoder to reconstruct the original images from
the latent space representation. By reducing the dimensionality using autoencoders, we can
effectively capture the essential information from the images while discarding unnecessary
details.

Prepared by Dr. Syeda Husna Mehanoor


INTERPOLATION AND BASIS FUNCTIONS
INTERPOLATION:
In machine learning, interpolation refers to the process of estimating unknown values that
fall between known data points. This can be useful in various scenarios, such as filling in
missing values in a dataset or generating new data points to smooth out a curve.

It can be used in a variety of industries, including

• Geodesy: Interpolation is used to map out features on Earth's surface, such as mountains
or ocean currents, using satellite imagery.

• Engineering: Interpolation predicts how materials behave in extreme conditions, such as


high temperatures or pressure.

• Statistical analysis: Interpolation can be used to smooth out data sets so that they
become more evenly distributed. For example, if you have a spike in sales one day, you
can use interpolation to smooth out the rest of your sales data for that month so that the
overall trend looks smooth instead of erratic.

TYPES OF INTERPOLATION:
• Linear interpolation: Linear interpolation is a simple method for estimating unknown
values between two known points. It assumes that the data points can be connected by a
straight line.
Formula for Linear Interpolation:

• Polynomial interpolation: What if we have more than two points? Instead of a straight
line, we can fit a curve using a polynomial. This works like connecting the dots
smoothly so the estimated values follow the trend of the data. A common method for this
is Lagrange interpolation.

If we have three known points: (x0,y0),(x1,y1),(x2,y2) ,we can construct a polynomial


P(x) that passes through these points.

Prepared by Dr. Syeda Husna Mehanoor


• Spline Interpolation (Smooth Curves): Spline interpolation is used when we need
smooth curves instead of sharp turns. The most common type is cubic spline
interpolation, which fits a cubic polynomial between each pair of points.

The general form of a cubic spline is:

Basis function

Instead of using a single equation to represent a function, we combine multiple small functions
(called basis functions) to form the final function. It means a function breaks into small parts
using basis functions so that a machine learning model can learn patterns better.

Think of it like building a house with Lego blocks—each basis function is a Lego piece.

Prepared by Dr. Syeda Husna Mehanoor


This method is used in splines and radial basis functions (RBFs) to make models that can fit
complex patterns.

THE CUBIC SPLINE

A cubic spline is a smooth curve made up of cubic polynomials that are joined together at
specific points called knotpoints.

The key idea is:

• The function is made up of different cubic equations for different sections.


• These cubic equations connect smoothly at the knotpoints.
• The function and its first two derivatives (slope & curvature) must match at each
knotpoint.
• A basis function is like a small building block that helps us construct the final curve.

Once you have knotpoints, you need to choose how the function behaves in each section.

1. Constant Basis Function (Blocky Steps)

• Imagine a staircase: each step is flat and has a fixed height.


• In this case, each section (between knotpoints) has a constant value.
• This is a piecewise constant function (it looks like a blocky step graph).

Problem: The function is not smooth—it jumps from one level to another without a transition.

Prepared by Dr. Syeda Husna Mehanoor


2. Linear Basis Function (Straight Line Segments)

• Instead of keeping each section flat, we allow it to increase or decrease linearly.


• This creates a piecewise linear function (like a zigzag pattern).
• The function now smoothly transitions between points.

Problem: If you just use straight lines, they may not connect smoothly at knotpoints—meaning there
might be sharp corners.

3. Cubic Basis Function (Smooth Curves)

• To avoid sharp corners, we use cubic splines.


• Instead of straight lines, each section is a cubic equation.
• This ensures that the function, slope, and curvature match at knotpoints.
• The result is a smooth, flowing curve.

Best Choice for Smoothness: Cubic splines! They create smooth curves that don’t have sharp
edges or abrupt changes.

SUPPORT VECTOR MACHINE


Support Vector Machine or SVM is one of the most popular Supervised Learning algorithms,
which is used for Classification as well as Regression problems. However, primarily, it is used
for Classification problems in Machine Learning.

The goal of the SVM algorithm is to create the best line or decision boundary that can segregate
n-dimensional space into classes so that we can easily put the new data point in the correct
category in the future. This best decision boundary is called a hyperplane.

SVM chooses the extreme points/vectors that help in creating the hyperplane. These extreme
cases are called as support vectors, and hence algorithm is termed as Support Vector Machine.
Consider the below diagram in which there are two different categories that are classified using a
decision boundary or hyperplane:

Prepared by Dr. Syeda Husna Mehanoor


SVM algorithm can be used for Face detection, image classification, text categorization, etc.

Types of SVM

SVM can be of two types:

o Linear SVM: Linear SVM is used for linearly separable data, which means if a dataset
can be classified into two classes by using a single straight line, then such data is termed
as linearly separable data, and classifier is used called as Linear SVM classifier.
o Non-linear SVM: Non-Linear SVM is used for non-linearly separated data, which means
if a dataset cannot be classified by using a straight line, then such data is termed as non-
linear data and classifier used is called as Non-linear SVM classifier.

Hyperplane and Support Vectors in the SVM algorithm:

Hyperplane: There can be multiple lines/decision boundaries to segregate the classes in n-


dimensional space, but we need to find out the best decision boundary that helps to classify the
data points. This best boundary is known as the hyperplane of SVM.

Support Vectors: The data points or vectors that are the closest to the hyperplane and which
affect the position of the hyperplane are termed as Support Vector. Since these vectors support
the hyperplane, hence called a Support vector.

Prepared by Dr. Syeda Husna Mehanoor


Linear SVM:
The working of the SVM algorithm can be understood by using an example. Suppose we have a
dataset that has two tags (green and blue), and the dataset has two features x1 and x2. We want a
classifier that can classify the pair(x1, x2) of coordinates in either green or blue. Consider the
below image:

So as it is 2-d space so by just using a straight line, we can easily separate these two classes. But
there can be multiple lines that can separate these classes. Consider the below image:

Hence, the SVM algorithm helps to find the best line or decision boundary; this best boundary or
region is called as a hyperplane. SVM algorithm finds the closest point of the lines from both
the classes. These points are called support vectors. The distance between the vectors and the
hyperplane is called as margin. And the goal of SVM is to maximize this margin.
The hyperplane with maximum margin is called the optimal hyperplane.

Kernel or Non-Linear SVM:


If data is linearly arranged, then we can separate it by using a straight line, but for non-linear
data, we cannot draw a single straight line. Consider the below image:

Prepared by Dr. Syeda Husna Mehanoor


So, to separate these data points, we need to add one more dimension. For linear data, we have
used two dimensions x and y, so for non-linear data, we will add a third-dimension z. It can be
calculated as:
z=x2 +y2
By adding the third dimension, the sample space will become as below image:

So now, SVM will divide the datasets into classes in the following way. Consider the below
image:

Since we are in 3-d Space, hence it is looking like a plane parallel to the x-axis. If we convert it
in 2d space with z=1, then it will become as:

Prepared by Dr. Syeda Husna Mehanoor


Hence, we get a circumference of radius 1 in case of non-linear data.

SVM Algorithm
1. Goal:
o Find the best line (or hyperplane in higher dimensions) that separates two classes
of data points.
2. Steps:
o Step 1: Collect Data:
▪ Gather your data with features (e.g., height, weight) and labels (e.g., cat or
dog).
o Step 2: Plot Data:
▪ Visualize the data points on a graph (if possible).
o Step 3: Find the Best Line:
▪ Draw a line that separates the two classes.
▪ Make sure the line is as far as possible from the closest data points of both
classes (these closest points are called support vectors).
o Step 4: Handle Non-Linear Data:
▪ If the data isn’t linearly separable (you can’t draw a straight line), use a
trick called the kernel trick to transform the data into a higher dimension
where a line can separate the classes.
o Step 5: Make Predictions:

How does Support Vector Machine Algorithm Work?

1. Plot the Data: Each data point is represented in n-dimensional space (n = number of
features). For example, if you have two features, you can plot the data on a 2D graph.

2. Find the Hyperplane: SVM finds the hyperplane (a straight line in 2D, a flat plane in
3D, or more generally, an n-dimensional plane) that separates the two classes of data
points with the maximum margin.

Prepared by Dr. Syeda Husna Mehanoor


o Maximum Margin: This is the largest possible distance between the hyperplane
and the nearest data points from both classes.
o These closest points are called support vectors because they “support” the
hyperplane.

3. Separate the Classes: The hyperplane divides the data into two regions, each
representing one class. For example:
o One side of the line = Class A.
o Other side = Class B.

4. Non-Linearly Separable Data: If the data cannot be separated with a straight line (e.g.,
spiral data), SVM uses something called a kernel trick to transform the data into a higher
dimension where it becomes linearly separable.
o Kernel Functions: Mathematical functions like polynomial, RBF (Radial Basis
Function), etc., are used to transform the data.

Prepared by Dr. Syeda Husna Mehanoor


Advantages and Disadvantages of Support Vector Machine (SVM)
1. High-Dimensional Performance: SVM excels in high-dimensional spaces, making
it suitable for image classification and gene expression analysis.
2. Nonlinear Capability: Utilizing kernel functions like RBF and polynomial, SVM
effectively handles nonlinear relationships.
3. Outlier Resilience: The soft margin feature allows SVM to ignore outliers,
enhancing robustness in spam detection and anomaly detection.
4. Binary and Multiclass Support: SVM is effective for both binary
classification and multiclass classification, suitable for applications in text
classification.
5. Memory Efficiency: SVM focuses on support vectors, making it memory efficient
compared to other algorithms.

Disadvantages of Support Vector Machine (SVM)


1. Slow Training: SVM can be slow for large datasets, affecting performance in SVM
in data mining tasks.
2. Parameter Tuning Difficulty: Selecting the right kernel and adjusting parameters
like C requires careful tuning, impacting SVM algorithms.
3. Noise Sensitivity: SVM struggles with noisy datasets and overlapping classes,
limiting effectiveness in real-world scenarios.
4. Limited Interpretability: The complexity of the hyperplane in higher dimensions
makes SVM less interpretable than other models.
5. Feature Scaling Sensitivity: Proper feature scaling is essential; otherwise, SVM
models may perform poorly.

Prepared by Dr. Syeda Husna Mehanoor


UNIT - III
Learning with Trees – Decision Trees – Constructing Decision Trees – Classification and
Regression Trees – Ensemble Learning – Boosting – Bagging – Different ways to Combine
Classifiers – Basic Statistics – Gaussian Mixture Models – Nearest Neighbor Methods –
Unsupervised Learning – K means Algorithms

LEARNING WITH TREES


DECISION TREES

• Decision Tree is a Supervised learning technique that can be used for both classification
and Regression problems, but mostly it is preferred for solving Classification problems.
• It is a tree-structured classifier, where internal nodes represent the features of a dataset,
branches represent the decision rules and each leaf node represents the outcome.
• In a Decision tree, there are two nodes, which are the Decision Node and Leaf Node.
• Decision nodes are used to make any decision and have multiple branches, whereas
Leaf nodes are the output of those decisions and do not contain any further branches.
• The decisions or the tests are performed on the basis of features of the given dataset.
• It is a graphical representation for getting all the possible solutions to a problem/decision
based on given conditions.
• It is called a decision tree because, similar to a tree, it starts with the root node, which
expands on further branches and constructs a tree-like structure.

Below diagram explains the general structure of a decision tree:

Prepared by Dr. Syeda Husna Mehanoor


Example:

• One of the reasons that decision trees are popular is that we can turn them into a set of
logical disjunctions (if ... then rules) that then go into program code very simply.

Ex:

• if there is a party then go to it


• if there is not a party and you have an urgent deadline then study

CONSTRUCTING DECISION TREE

Types of Decision Tree Algorithms:


• ID3: This algorithm measures how mixed up the data is at a node using something called
entropy. It then chooses the feature that helps to clarify the data the most.
• C4.5: This is an improved version of ID3 that can handle missing data and continuous
attributes.
• CART: This algorithm uses a different measure called Gini impurity to decide how to
split the data. It can be used for both classification (sorting data into categories) and
regression (predicting continuous values) tasks.

ID3 (Iterative Dichotomiser 3)

The ID3 (Iterative Dichotomiser 3) algorithm is a decision tree learning algorithm used in
machine learning and data mining for classification tasks. It was developed by Ross Quinlan in
the 1980s and is the predecessor of more advanced decision tree algorithms like C4.5 and
CART.

Prepared by Dr. Syeda Husna Mehanoor


How the ID3 Algorithm Works?

ID3 builds a decision tree by selecting attributes that maximize information gain (or minimize
entropy). The process follows these steps:

1. Calculate Entropy
Entropy measures the impurity or disorder in a dataset. It is calculated using the formula:

2. Compute Information Gain


Information Gain measures the reduction in entropy achieved by splitting the dataset
based on an attribute. It is calculated as:

3. Select the Best Attribute


The attribute with the highest Information Gain is chosen as the root node.
4. Split the Data
The dataset is split based on the selected attribute, and the process is repeated recursively
for each subset until one of the stopping conditions is met:
o All instances in a subset belong to the same class.
o There are no remaining attributes to split on.
o The dataset is empty.
5. Assign a Leaf Node
If further splitting is not possible, the most common class label in the subset is assigned
to the leaf node.

Prepared by Dr. Syeda Husna Mehanoor


Advantages of ID3
Simple and easy to implement
Provides a human-readable decision tree
Works well with categorical data

Disadvantages of ID3
Overfits on noisy or small datasets
Cannot handle continuous numerical values directly (must be discretized)
Prefers attributes with many values (can be biased toward high-cardinality attributes)

Example of ID3

Step 1: Calculate Entropy of the Entire Dataset

We first calculate the entropy of the dataset before any splits.

Weather Play Outside?


Sunny Yes
Rainy No
Overcast Yes
Rainy No
Sunny Yes

• Yes = 3 times
• No = 2 times

The entropy formula is:

Let's compute it.

The entropy of the dataset is 0.971 (rounded).

Step 2: Compute Information Gain for "Weather"

The Weather attribute has three unique values:

• Sunny → (2 instances: ["Yes", "Yes"])


• Rainy → (2 instances: ["No", "No"])

Prepared by Dr. Syeda Husna Mehanoor


• Overcast → (1 instance: ["Yes"])

Step 2.1: Compute Entropy for Each Subset

Step 2.2: Compute Information Gain

Step 3: Decision Tree

Since the entropy after splitting is 0, we stop here. The final tree is:

Weather
/ | \
Sunny Overcast Rainy
Yes Yes No

C4.5 ALGORITHM (IMPROVEMENT OF ID3)

The C4.5 algorithm is an improved version of the ID3 decision tree algorithm developed by
Ross Quinlan. It overcomes some limitations of ID3 and is widely used in classification
problems.

Prepared by Dr. Syeda Husna Mehanoor


How C4.5 Works
1. Start with the entire dataset and calculate the entropy.
2. Select the best attribute to split the data using the Gain Ratio (instead of just Information
Gain).
3. Create branches based on attribute values.
4. Handle missing values and continuous data (ID3 cannot handle these well).
5. Recursively repeat the process for each subset until all data is classified.
6. Use pruning to remove unnecessary branches and prevent overfitting.

Key Improvements Over ID3

Feature ID3 C4.5


Splitting Criterion Information Gain (IG) Gain Ratio (GR)
Handles Continuous Data ❌ No ✅ Yes
Handles Missing Values ❌ No ✅ Yes
Pruning (to prevent overfitting) ❌ No ✅ Yes
Handles Multiple Classes ✅ Yes ✅ Yes

1. Gain Ratio (Better than Information Gain)

• ID3 uses Information Gain, which can favor attributes with many values.
• C4.5 solves this issue by introducing Gain Ratio, which normalizes Information Gain.
• Formula for Gain Ratio:

This avoids bias toward attributes with many unique values.

2. Handling Continuous Data


• C4.5 splits numerical attributes into two groups:
"Less than threshold" and "Greater than threshold"
• It finds the best threshold dynamically instead of requiring pre-defined categories.

Prepared by Dr. Syeda Husna Mehanoor


3. Handling Missing Values
• Instead of removing rows with missing values, C4.5 estimates probabilities based on available
data.
• If an attribute is missing, it distributes the instance across possible values proportionally.

4. Pruning (Reduces Overfitting)


• C4.5 performs "post-pruning" by removing less significant branches.
• This results in a simpler tree that generalizes better to new data.

Example: Deciding Whether to Play Outside

Imagine a dataset where we decide whether to play outside based on weather conditions.

Weather Temperature Wind Play Outside?


Sunny Hot Weak No
Sunny Hot Strong No
Overcast Hot Weak Yes
Rainy Mild Weak Yes
Rainy Cool Weak Yes
Rainy Cool Strong No
Overcast Cool Strong Yes
Sunny Mild Weak No
Sunny Cool Weak Yes
Rainy Mild Strong No
Sunny Mild Strong No
Overcast Mild Strong Yes
Overcast Hot Weak Yes
Rainy Mild Weak Yes

Prepared by Dr. Syeda Husna Mehanoor


Prepared by Dr. Syeda Husna Mehanoor
Prepared by Dr. Syeda Husna Mehanoor
Prepared by Dr. Syeda Husna Mehanoor
Handling Missing Values
C4.5 handles missing values. Instead of replacing the missing value with a single most common value, it
distributes the missing instance proportionally across the possible values.

Example:

Weather Temperature Play Tennis?


Sunny Hot No
Rainy ? (Missing) Yes
Rainy Mild No
Overcast Hot Yes
Rainy Hot Yes

Step 1: Find Other "Rainy" Rows:


We look at all rows where Weather = Rainy and check their Temperature values:
Weather Temperature
Rainy Mild
Rainy Hot
Rainy ? (Missing)

We see that:

• Hot appears 60% of the time

Prepared by Dr. Syeda Husna Mehanoor


• Mild appears 40% of the time

Step 2: Distribute the Missing Value

Instead of assuming Hot or Mild, C4.5 splits the missing row:

• 60% of the row is treated as "Hot"


• 40% of the row is treated as "Mild"

Now, the dataset conceptually looks like:

Weather Temperature Play Tennis? Weight


Sunny Hot No 1
Rainy Hot (60%) Yes 0.6
Rainy Mild (40%) Yes 0.4
Rainy Mild No 1
Overcast Hot Yes 1
Rainy Hot Yes 1

Step 3: Compute Entropy (Using Weighted Contributions)

Since the missing row is split between "Hot" and "Mild", entropy and information gain are
calculated by weighting the contributions accordingly.

How to handle continuous data?

Handling Continuous (Numerical) Data

Unlike ID3, which requires categorical attributes, C4.5 can split numerical data dynamically
by finding the best threshold.

How It Works:

1. Sort the numerical values in ascending order.


2. Find the best split point by testing different thresholds.
3. Convert the numerical attribute into two categories:
o ≤ threshold
o > threshold

Prepared by Dr. Syeda Husna Mehanoor


Example:
Temperature Play Outside?
15°C No
18°C No
22°C Yes
24°C Yes
30°C Yes

Step 1: Find possible split points between values:

• Between 18°C & 22°C → Threshold = 20°C


• Between 22°C & 24°C → Threshold = 23°C

Step 2: Compute Information Gain for each threshold and pick the best one.

• Suppose 20°C gives the highest Gain Ratio, C4.5 splits the data:
o ≤ 20°C → "No"
o > 20°C → "Yes"

Now, "Temperature" is treated like a categorical variable without predefining ranges!

CLASSIFICATION AND REGRESSION TREES (CART)

The CART algorithm (Classification and Regression Trees) is a decision tree learning
technique used for classification and regression tasks. It was introduced by Breiman et al.
(1984) and is widely used in machine learning for predictive modelling.

How CART Works

CART constructs binary decision trees by recursively splitting the dataset into two subsets based
on feature values. The algorithm selects the best split at each step using Gini impurity (for
classification) or mean squared error (for regression).

Steps in the CART Algorithm

1. Start with the Entire Dataset (Root Node)

• The root node contains all the training samples.


• The goal is to find the best feature and value to split the dataset into two groups.

2. Choose the Best Split

• For classification problems, the split is chosen based on Gini Index (default in CART).
• For regression problems, the split is chosen based on Mean Squared Error (MSE).

Prepared by Dr. Syeda Husna Mehanoor


Splitting Criteria:

• Gini Index (for classification)

pi is the probability of each class. A lower Gini Index means purer nodes.

• Mean Squared Error (MSE) (for regression)

Measures the variance within the group

3. Recursively Split the Dataset

• The dataset is split into two subsets at each step.


• The process continues until a stopping condition is met.

4. Pruning (Optional):

• Reduce tree complexity by pruning unnecessary nodes to prevent overfitting.

5. Define Stopping Criteria

• The tree stops growing if:


o A node contains only one class (classification).
o The number of samples in a node is less than a threshold.
o The maximum depth is reached.
o The information gain is too small.

6. Assign Leaf Node Values

• For classification, assign the most common class in that node.


• For regression, assign the average target value.

1. CART for Classification (Using Gini Index)

The Gini Index measures the impurity of a node. The formula for Gini Index is:

Prepared by Dr. Syeda Husna Mehanoor


Example Dataset
ID Feature: Age Label: Play Tennis (Yes=1, No=0)

1 25 Yes (1)

2 30 Yes (1)

3 35 No (0)

4 40 No (0)

5 45 No (0)

Step 1: Calculate Gini Index for Root Node

There are 2 "Yes" and 3 "No":

Step 2: Find the Best Split

Let's split the dataset at Age = 30:

• Left Node (Age ≤ 30): { (25, Yes), (30, Yes) } → 2 Yes


• Right Node (Age > 30): { (35, No), (40, No), (45, No) } → 3 No

Gini for Left Node

Gini for Right Node

Weighted Gini

Prepared by Dr. Syeda Husna Mehanoor


Since the split at Age = 30 results in Gini = 0, this is the best split.

Final Decision Tree


Age <= 30?
/ \
Yes No

2. CART for Regression (Using Mean Squared Error)

For regression, CART splits the data based on Mean Squared Error (MSE):

Example Dataset
ID Feature: Age Target: Salary (in $1000s)

1 25 50

2 30 55

3 35 60

4 40 70

5 45 80

Step 1: Calculate MSE for Root Node

Step 2: Find the Best Split

Let’s split at Age = 35:

• Left Node (Age ≤ 35): { (25, 50), (30, 55), (35, 60) }

Prepared by Dr. Syeda Husna Mehanoor


• Right Node (Age > 35): { (40, 70), (45, 80) }

Final Decision Tree

Age ≤ 35?
/ \
55 75

Advantages and Disadvantages of the CART (Classification and Regression Trees)


algorithm:

Aspect Advantages Disadvantages


Easy to understand and interpret Large trees become complex and hard to
Interpretability
(visualizable as a tree). interpret.
Works well with numerical and Sensitive to small changes in data (can
Handling Data
categorical data. lead to different trees).
Automatically selects the most Can be biased towards features with
Feature Selection
important features. more levels (e.g., continuous data).
Captures non-linear relationships Struggles with smooth functions (can
Non-Linearity
well. create a step-like boundary).
Can be controlled using pruning Without pruning, it tends to overfit the
Overfitting
or max depth constraints. training data.
Requires little to no data Splits can be biased if data is
Preprocessing
preprocessing (no scaling needed). imbalanced.
Computational Faster than many complex models Can become computationally expensive
Cost (e.g., neural networks). for very deep trees.

Prepared by Dr. Syeda Husna Mehanoor


Aspect Advantages Disadvantages
Can handle missing values Does not inherently perform well on
Missing Values
naturally. missing data without imputation.

ENSEMBLE LEARNING

Ensemble learning is a technique in machine learning where multiple models (often called weak
learners or base models) are combined to create a stronger, more accurate model. The main idea
is that multiple models working together can reduce errors and improve predictions compared to
a single model.

Types of Ensemble Learning

1. Boosting

• Models are trained sequentially, where each new model corrects the mistakes of the previous
ones.
• Helps reduce bias and improve weak models.
Example: AdaBoost, Gradient Boosting (XGBoost, LightGBM, CatBoost).

Steps in Boosting:

1. Train a model on the dataset.


2. Identify the incorrectly predicted samples and assign higher weights.
3. Train a new model that focuses on correcting these mistakes.
4. Repeat the process for several iterations.

Example: AdaBoost (Adaptive Boosting)

• Assigns more weight to misclassified samples and improves the next weak learner.

Prepared by Dr. Syeda Husna Mehanoor


AdaBoost (Adaptive Boosting) Algorithm

AdaBoost (Adaptive Boosting) is an ensemble learning method that combines multiple weak
learners (usually decision stumps) to create a strong classifier. It adjusts the weights of
misclassified samples to focus more on difficult cases in each iteration.

How AdaBoost Works

1. Initialize Weights:
o Assign equal weights to all training samples.

2. Train a Weak Learner:


o A weak model (e.g., Decision Stump) is trained on the dataset.

3. Calculate Error:
o The error is measured as the total weight of misclassified samples.

4. Update Model Weight:


o The model is given a weight based on its accuracy.
o More accurate models get higher weights.

Prepared by Dr. Syeda Husna Mehanoor


5. Update Sample Weights:
o Misclassified samples get higher weights so the next model focuses more on
them.

6. Repeat Steps 2-5:


o Train multiple weak models iteratively.
o Combine them using a weighted majority vote (for classification) or weighted
sum (for regression).

Prepared by Dr. Syeda Husna Mehanoor


AdaBoost Algorithm:

• Initialize Weights: Assign equal weights to all training samples.


• Train a Weak Learner: Train a simple model (like a Decision Stump, a one-level
decision tree).
• Calculate Error: Check how many samples are misclassified. If a sample is
misclassified, increase its weight so the next model focuses on it more.
• Update Weights: Increase the importance (weight) of misclassified samples. Reduce the
weight of correctly classified samples.
• Repeat Steps 2-4: Train multiple weak learners, each correcting the mistakes of the
previous one.
• Final Prediction: Combine all weak learners using a weighted majority vote (for
classification) or weighted sum (for regression).

Example: AdaBoost (Adaptive Boosting) is an ensemble learning algorithm that


combines multiple weak classifiers to form a strong classifier.
It iteratively trains weak models and adjusts sample weights to focus on difficult cases.

Step 1: Dataset

Consider a small binary classification dataset:

Sample Feature x Class y


1 1.0 +1
2 2.0 -1
3 3.0 +1
4 4.0 -1

Each sample belongs to either Class +1 or Class -1.

Step 2: Initialize Sample Weights

Initially, all samples are given equal importance. Since we have 4 samples, their initial weight is:

Prepared by Dr. Syeda Husna Mehanoor


Sample Weight wi
1 0.25
2 0.25
3 0.25
4 0.25

Step 3: Train the First Weak Classifier

We use a Decision Stump (one-level decision tree).


A decision stump chooses a single feature threshold to split the data.

Let's say our first weak classifier predicts:

• If x<2.5, predict +1
• If x≥2.5, predict -1

Sample Feature x True y Prediction h1(x) Correct?


1 1.0 +1 +1 ✅
2 2.0 -1 +1 ❌
3 3.0 +1 -1 ❌
4 4.0 -1 -1 ✅

Misclassified samples: x=2.0 and x=3.0

Step 4: Compute Weighted Error ϵ1

The error of the weak classifier is the sum of weights of misclassified samples:

Step 5: Compute Classifier Weight α1

Prepared by Dr. Syeda Husna Mehanoor


Step 6: Update Sample Weights

We update weights using:

Step 7: Repeat for More Weak Learners

After several iterations, we get multiple weak classifiers h1,h2,h3,... with different weights αt.

Each classifier focuses more on previously misclassified samples.

Step 8: Final Prediction

The final prediction is made by combining all weak classifiers:

2. Bagging (Bootstrap Aggregating)

• Multiple models (usually the same algorithm) are trained on random subsets of data.
• Predictions are averaged (for regression) or voted (for classification).
• Helps reduce variance and prevents overfitting.
Example: Random Forest (combines multiple decision trees).

Steps in Bagging:

1. Create multiple training datasets using random sampling with replacement.


2. Train a separate model on each dataset.

Prepared by Dr. Syeda Husna Mehanoor


3. Combine the predictions using majority voting (for classification) or averaging (for
regression).

Example: Random Forest (Bagging on Decision Trees)

• Instead of a single Decision Tree, Random Forest builds multiple trees and combines
their predictions.
• The random forest algorithm is a machine learning technique that uses multiple decision
trees to make predictions. It can be used for classification and regression tasks

How the Random Forest algorithm works:

1. Create multiple datasets → Randomly pick data with replacement (some data may be
repeated).
2. Train multiple decision trees → Each tree learns from a different dataset.
3. Make predictions → Each tree makes its own prediction.
4. Combine the results →
o For classification → Take the majority vote (most common prediction).
o For regression → Take the average of all predictions.
5. More trees = better accuracy & less overfitting.
6. Every tree in the forest makes its own predictions without relying on others.
7. Each tree is built using random samples and features to reduce mistakes.
8. Sufficient data ensures the trees are different and learn unique patterns and variety.
9. Combining the predictions from different trees leads to a more accurate final result.

Prepared by Dr. Syeda Husna Mehanoor


Advantages of Random Forest

• Random Forest provides very accurate predictions even with large datasets.
• Random Forest can handle missing data well without compromising with accuracy.
• It doesn’t require normalization or standardization on dataset.
• When we combine multiple decision trees it reduces the risk of overfitting of the
model.

Limitations of Random Forest


• It can be computationally expensive especially with a large number of trees.
• It’s harder to interpret the model compared to simpler models like decision trees.

Prepared by Dr. Syeda Husna Mehanoor


DIFFERENT WAYS TO COMBINE CLASSIFIERS

Combining multiple classifiers can improve machine learning model performance by leveraging
the strengths of different algorithms. There are various ways to combine classifiers:

• Voting: It is a method to combine predictions from multiple models in ensemble


learning. There are two types of voting
1. Majority Voting (Hard voting): The most common approach, where each
classifier casts a vote for a class, and the class with the most votes is chosen as
the final prediction.
2. Averaging (Soft voting): Models give probabilities and the class with the
highest average probabilities is chosen.

• Stacking: Train several models, then we use another model (meta model) to combine
their predictions for better result.
• Bagging (Bootstrap Aggregating)
Trains multiple instances of the same classifier on different subsets of data. Reduces
variance and prevents overfitting. Example: Random Forest (uses bagging with decision
trees).
• Boosting
Sequentially trains classifiers, where each new model focuses on the mistakes of the
previous ones. Reduces bias and increases accuracy. Examples:

o AdaBoost: Assigns higher weights to misclassified instances.


o Gradient Boosting: Uses gradient descent to minimize loss.
o XGBoost: Optimized version of gradient boosting.

MIXTURE OF EXPERTS (MOE) ALGORITHM IN MACHINE LEARNING

The Mixture of Experts (MoE) is an ensemble learning technique that divides a complex
problem into subproblems and assigns specialized models (called experts) to solve each
subproblem. A gating network learns to combine the outputs of these experts to make a final
prediction.

MoE is inspired by divide-and-conquer strategies in problem-solving. Instead of training a


single model to handle all cases, MoE allows different models to specialize in different regions
of the input space.

It is widely used in deep learning and large-scale AI models, such as Google’s Switch
Transormers, which use MoE to efficiently allocate computational resources.

Prepared by Dr. Syeda Husna Mehanoor


Input: This is the problem or data you want to handle.

Experts: These are smaller models, each trained to be really good at a specific part of the overall
problem. Think of them like the different specialists on your team.

Gating network: This is like a manager who decides which expert is best suited for each part of
the problem. It looks at the input and figures out who should work on what.

Output: This is the final answer or solution that the model produces after the experts have done
their work.

Prepared by Dr. Syeda Husna Mehanoor


Advantages of MoE

Scalability – MoE can handle large-scale problems by distributing tasks across specialized
models.
Improved Accuracy – Experts specialize in different areas, leading to better generalization.
Parallel Computation – Experts can run independently, making MoE efficient for distributed
computing.
Reduced Overfitting – Specialization prevents overfitting to general patterns.

Disadvantages of MoE

Complexity – Requires careful tuning of experts and the gating function.


Training Instability – If the gating network overfits, it may favor only a single expert.
Computational Cost – Large MoE models require more memory and computation.

BASIC STATISTICS

Mean:

• The "mean" is the average value of a dataset.


• It is calculated by adding up all the values in the dataset and dividing by the number of
observations.
• The mean is a useful measure of central tendency because it is sensitive to outliers,
meaning that extreme values can significantly affect the value of the mean.

Median:

• The "median" is the middle value in a dataset.


• It is calculated by arranging the values in the dataset in order and finding the value that
lies in the middle.
• If there are an even number of values in the dataset, the median is the average of the two
middle values.
• The median is a useful measure of central tendency because it is not affected by outliers,
meaning that extreme values do not significantly affect the value of the median.

Mode:

• The "mode" is the most common value in a dataset.


• It is calculated by finding the value that occurs most frequently in the dataset.
• If there are multiple values that occur with the same frequency, the dataset is said to be
bimodal, trimodal, or multimodal.
• The mode is a useful measure of central tendency because it can identify the most
common value in a dataset.
• However, it is not a good measure of central tendency for datasets with a wide range of
values or datasets with no repeating values.

Prepared by Dr. Syeda Husna Mehanoor


Variance:

Variance, in statistics, is a measure of how spread out or dispersed data points are from their
average (mean), calculated by averaging the squared differences from the mean.

Covariance:

Covariance is a measure of relationship between two variables that is scale dependent, i.e. how
much will a variable change when another variable changes.

Standard Deviation: The square root of the variance is known as the standard deviation.

Interquartile Range: The range between the first and third quartiles, measuring data spread
around the median.

Skewness: Indicates data asymmetry.

Prepared by Dr. Syeda Husna Mehanoor


Positive Skewness (Right Skew): In a positively skewed distribution, the tail on the right side
(the larger values) is longer than the tail on the left side (the smaller values).
In the case of a positively skewed dataset,
Mean > Median > Mode
Negative Skewness (Left Skew): In a negatively skewed distribution, the tail on the left side
(the smaller values) is longer than the tail on the right side (the larger values). In the case of a
negatively skewed dataset,
Mean < Median < Mode
Zero Skewness (Symmetrical Distribution): Zero skewness indicates a perfectly symmetrical
distribution, where the mean, median, and mode are equal.

Kurtosis: It is also a characteristic of the frequency distribution. It gives an idea about


the shape of a frequency distribution. Basically, the measure of kurtosis is the extent to which a
frequency distribution is peaked in comparison with a normal curve.

Types of Kurtoses: The following figure describes the classification of kurtosis:

• Leptokurtic: Leptokurtic is a curve having a high peak than the normal distribution. In
this curve, there is too much concentration of items near the central value.
• Mesokurtic: Mesokurtic is a curve having a normal peak than the normal curve. In this
curve, there is equal distribution of items around the central value.
• Platykurtic: Platykurtic is a curve having a low peak than the normal curve is called
platykurtic. In this curve, there is less concentration of items around the central value.

Prepared by Dr. Syeda Husna Mehanoor


Mahalanobis Distance: The Mahalanobis distance is a statistical measurement that determines
how far a point is from a distribution. It's used in many fields, including computer science,
chemometrics, and cluster analysis.

It is a powerful technique that considers the correlations between variables in a dataset, making it
a valuable tool in various applications such as outlier detection, clustering, and classification.

D² = (x-μ)ᵀΣ⁻¹(x-μ)

Where D² is the squared Mahalanobis Distance, x is the point in question, μ is the mean vector of
the distribution, Σ is the covariance matrix of the distribution, and ᵀ denotes the transpose of a
matrix.

The Gaussian / Normal Distribution: Normal distribution, also known as the Gaussian
distribution, is a continuous probability distribution that is symmetric about the mean, depicting
that data near the mean are more frequent in occurrence than data far from the mean.

GAUSSIAN MIXTURE MODELS

A Gaussian mixture model is a soft clustering technique used in unsupervised learning to


determine the probability that a given data point belongs to a cluster. It’s composed of several
Gaussians, each identified by k ∈ {1,…, K}, where K is the number of clusters in a data set.

Prepared by Dr. Syeda Husna Mehanoor


A Gaussian mixture model (GMM) is a machine learning method used to determine the
probability each data point belongs to a given cluster. The model is a soft clustering method used
in unsupervised learning.

• A mean μ that defines its center.


• A covariance Σ that defines its width. Define the shape and spread of each component.
• A mixing probability π (weights) that defines Probability of selecting each component.

Model Training

• Training a GMM involves setting the parameters using available data.


• The Expectation-Maximization (EM) technique is often employed, alternating between
the Expectation (E) and Maximization (M) steps until convergence.

Expectation-Maximization:

• During the E step, the model calculates the probability of each data point belonging to
each Gaussian component.
• The M step then adjusts the model’s parameters based on these probabilities.

Key Ideas Behind GMM

1. Mixture of Gaussians
o Instead of assuming all points belong to just one cluster (like in k-means), GMM
assumes data is a mix of several Gaussian distributions.
o Each distribution represents one hidden group (e.g., different flavors of candy).
2. Soft Clustering (Probabilities Instead of Hard Labels)
o Instead of saying, “This point is in Cluster A,” GMM says, “This point is 70%
likely to be in Cluster A and 30% likely to be in Cluster B.”
3. Expectation-Maximization (EM) Algorithm
o Since we don’t know which Gaussian a point belongs to, we start with a guess.
o We then refine this guess using the E-step (Expectation) and M-step
(Maximization) until the clusters make sense.

Prepared by Dr. Syeda Husna Mehanoor


Example: Imagine a Class of Students

Let’s say we measure the heights of students in a school. If we plot the heights, we might see
three peaks in the data.

• One peak for elementary students (shorter kids).


• Another peak for middle school students (medium height).
• A final peak for high school students (taller kids).

GMM assumes that each peak represents a Gaussian distribution, and the overall height
distribution is just a mix of these three groups.

If we give a new student’s height, GMM can tell us the probability that the student belongs to
each group.

NEAREST NEIGHBOR METHODS

K-Nearest Neighbors Algorithm

K-Nearest Neighbors (KNN) is a simple way to classify things by looking at what’s nearby. The
K-Nearest Neighbors (KNN) algorithm is a supervised machine learning method employed to
tackle classification and regression problems.

K-Nearest Neighbors is also called as a lazy learner algorithm because it does not learn from the
training set immediately instead it stores the dataset and at the time of classification it performs
an action on the dataset.

As an example, consider the following table of data points containing two features:

The new point is classified as Category 2 because most of its closest neighbors are blue
squares. KNN assigns the category based on the majority of nearby points.
The image shows how KNN predicts the category of a new data point based on its closest
neighbours.
• The red diamonds represent Category 1 and the blue squares represent Category
2.

Prepared by Dr. Syeda Husna Mehanoor


• The new data point checks its closest neighbours (circled points).
• Since the majority of its closest neighbours are blue squares (Category 2) KNN
predicts the new data point belongs to Category 2.

How algorithm works:

Step 1: Selecting the optimal value of K

• K represents the number of nearest neighbors that needs to be considered while making
prediction.

Step 2: Calculating distance

• To measure the similarity between target and training data points, Euclidean distance is used.
Distance is calculated between each of the data points in the dataset and target point.

Step 3: Finding Nearest Neighbors

• The k data points with the smallest distances to the target point are the nearest neighbors.

Step 4: Voting for Classification or Taking Average for Regression

• In the classification problem, the class labels of K-nearest neighbors are determined by
performing majority voting. The class with the most occurrences among the neighbors becomes
the predicted class for the target data point.

• In the regression problem, the class label is calculated by taking average of the target values of
K nearest neighbors. The calculated average value becomes the predicted output for the target
data point.

Example:

Given Query:
X= (Maths=6, CS=8) → Find the class?

Step 1: Select K neighbors

Maths CS Result
4 3 Fail
6 7 Pass
6 8 Pass
5 5 Fail
8 8 Pass

Prepared by Dr. Syeda Husna Mehanoor


Given K=3.

Step 3:

As per the result, K=3 and we need to consider the 3 smallest values (smallest distances) from the new
data point to the actual data points.

• Majority of the data points are Pass.

Thus, we assign the new data point into the Pass category.

Therefore: Maths=6, CS=8⇒Result is Pass

Advantages and Disadvantages of the KNN Algorithm

Advantages:
• Easy to implement: The KNN algorithm is easy to implement because its
complexity is relatively low as compared to other machine learning algorithms.

Prepared by Dr. Syeda Husna Mehanoor


• No training required: KNN stores all data in memory and doesn’t require any
training so when new data points are added it automatically adjusts and uses the
new data for future predictions.
• Few Hyperparameters: The only parameters which are required in the training of a
KNN algorithm are the value of k and the choice of the distance metric which we
would like to choose from our evaluation metric.
• Flexible: It works for Classification problem like is this email spam or not? and
also work for Regression task like predicting house prices based on nearby similar
houses.

Disadvantages:
• Doesn’t scale well: KNN is considered as a “lazy” algorithm as it is very slow
especially with large datasets
• Curse of Dimensionality: When the number of features increases KNN struggles to
classify data accurately a problem known as curse of dimensionality.
• Prone to Overfitting: As the algorithm is affected due to the curse of
dimensionality it is prone to the problem of overfitting as well.

K-dimensional tree

A k-d tree is a special kind of binary search tree that helps organize points in multiple
dimensions (like 2D or 3D space).

Imagine you have a list of locations on a map (like stores or houses), and you want to quickly
find the one closest to you. Instead of checking every single location one by one, a k-d tree
organizes them in a way that makes searching much faster.

How does it work?

1. It starts by dividing the space based on one coordinate (like splitting a map along a vertical line).
2. Then, it keeps dividing the smaller sections using other coordinates (like splitting horizontally
next).
3. This process continues, making it easier to search for nearby points.

The purpose of a k-d tree is to efficiently organize and search points in multiple dimensions
(2D, 3D, or higher).

Prepared by Dr. Syeda Husna Mehanoor


1. Fast Nearest Neighbor Search
o Example: Finding the closest gas station or restaurant to your location.
2. Range Search
o Example: Finding all delivery addresses within a certain distance from a warehouse.
3. Efficient Spatial Partitioning
o Example: Used in 3D graphics and gaming to speed up rendering by organizing objects in
space.
4. Machine Learning (KNN Algorithm)
o Helps speed up the k-Nearest Neighbors (KNN) classifier by reducing search time.
5. Robotics & Pathfinding
o Used in motion planning for robots to navigate around obstacles efficiently.

Why use a k-d tree?

• Faster searches than checking every point one by one (especially in large datasets).
• Organizes multi-dimensional data in a structured way.

UNSUPERVISED LEARNING

Unsupervised learning is a type of machine learning that works with data that has no labels or
categories. The main goal is to find patterns and relationships in the data without any
guidance.In this approach, the machine analyzes unorganized information and groups it based
on similarities, patterns, or differences. Unlike supervised learning, there is no teacher or
training involved. The machine must uncover hidden structures in the data on its own.

Example

Imagine you have a machine learning model trained on a large dataset of unlabeled images,
containing both dogs and cats. The model has never seen an image of a dog or cat before, and it
has no pre-existing labels or categories for these animals. Your task is to use unsupervised
learning to identify the dogs and cats in a new, unseen image. suppose it is given an image
having both dogs and cats which it has never seen. Thus, the machine has no idea about the
features of dogs and cats so we can’t categorize it as ‘dogs and cats ‘. But it can categorize them
according to their similarities, patterns, and differences, i.e., we can easily categorize the above

Prepared by Dr. Syeda Husna Mehanoor


picture into two parts. The first may contain all pics having dogs in them and the second part
may contain all pics having cats in them. Here you didn’t learn anything before, which means no
training data or examples.

It allows the model to work on its own to discover patterns and information that was previously
undetected. It mainly deals with unlabeled data.

Types of Unsupervised Learning Algorithm:

The unsupervised learning algorithm can be further categorized into two types of problems:

• Clustering: Clustering is a method of grouping the objects into clusters such that objects
with most similarities remains into a group and has less or no similarities with the objects
of another group. Cluster analysis finds the commonalities between the data objects and
categorizes them as per the presence and absence of those commonalities.
• Association: An association rule is an unsupervised learning method which is used for
finding the relationships between variables in the large database. It determines the set of
items that occurs together in the dataset. Association rule makes marketing strategy more
effective. Such as people who buy X item (suppose a bread) are also tend to purchase Y
(Butter/Jam) item. A typical example of Association rule is Market Basket Analysis.

Advantages of Unsupervised Learning

• Unsupervised learning is used for more complex tasks as compared to supervised


learning because, in unsupervised learning, we don't have labeled input data.
• Unsupervised learning is preferable as it is easy to get unlabeled data in comparison to
labeled data.

Disadvantages of Unsupervised Learning

• Unsupervised learning is intrinsically more difficult than supervised learning as it does


not have corresponding output.
• The result of the unsupervised learning algorithm might be less accurate as input data is
not labeled, and algorithms do not know the exact output in advance.

K MEANS ALGORITHM
• K-means clustering is a popular unsupervised machine learning algorithm used for
partitioning a dataset into a pre-defined number of clusters. The goal is to group similar
data points together and discover underlying patterns or structures within the data.
• The first property of clusters states that the points within a cluster should be similar to
each other. So, our aim here is to minimize the distance between the points within a
cluster.
• There is an algorithm that tries to minimize the distance of the points in a cluster with
their centroid – the k-means clustering technique.

Prepared by Dr. Syeda Husna Mehanoor


• K-means is a centroid-based algorithm or a distance-based algorithm, where we calculate
the distances to assign a point to a cluster. In K-Means, each cluster is associated with a
centroid.
• The main objective of the K-Means algorithm is to minimize the sum of distances
between the points and their respective cluster centroid.
• Optimization plays a crucial role in the k-means clustering algorithm. The goal of the
optimization process is to find the best set of centroids that minimizes the sum of squared
distances between each data point and its closest centroid.

How K-Means Clustering Works?

• Initialization: Start by randomly selecting K points from the dataset. These points will
act as the initial cluster centroids.
• Assignment: For each data point in the dataset, calculate the distance between that point
and each of the K centroids. Assign the data point to the cluster whose centroid is closest
to it. This step effectively forms K clusters.
• Update centroids: Once all data points have been assigned to clusters, recalculate the
centroids of the clusters by taking the mean of all data points assigned to each cluster.
• Repeat: Repeat steps 2 and 3 until convergence. Convergence occurs when the centroids
no longer change significantly or when a specified number of iterations is reached.
• Final Result: Once convergence is achieved, the algorithm outputs the final cluster
centroids and the assignment of each data point to a cluster.

Mathematical Representation

The objective of K-Means is to minimize the sum of squared differences between each point and
its assigned cluster centroid:

Choosing the Right K (Elbow Method)

• Plot the Within-Cluster Sum of Squares (WCSS) for different values of K.


• Look for an "elbow point," where the WCSS decrease slows down.

Prepared by Dr. Syeda Husna Mehanoor


Objective of k means Clustering

The main objective of k-means clustering is to partition your data into a specific number (k) of
groups, where data points within each group are similar and dissimilar to points in other groups.
It achieves this by minimizing the distance between data points and their assigned cluster’s
center, called the centroid.

• Grouping similar data points: K-means aims to identify patterns in your data by
grouping data points that share similar characteristics together. This allows you to
discover underlying structures within the data.
• Minimizing within-cluster distance: The algorithm strives to make sure data points
within a cluster are as close as possible to each other, as measured by a distance metric
(usually Euclidean distance). This ensures tight-knit clusters with high cohesiveness.
• Maximizing between-cluster distance: Conversely, k-means also tries to maximize the
separation between clusters. Ideally, data points from different clusters should be far
apart, making the clusters distinct from each other.

Advantages of K-means

1. Simple and easy to implement: The k-means algorithm is easy to understand and
implement, making it a popular choice for clustering tasks.

2. Fast and efficient: K-means is computationally efficient and can handle large datasets
with high dimensionality.

3. Scalability: K-means can handle large datasets with many data points and can be
easily scaled to handle even larger datasets.

4. Flexibility: K-means can be easily adapted to different applications and can be used
with varying metrics of distance and initialization methods.

Prepared by Dr. Syeda Husna Mehanoor


Disadvantages of K-Means

1. Sensitivity to initial centroids: K-means is sensitive to the initial selection of centroids


and can converge to a suboptimal solution.

2. Requires specifying the number of clusters: The number of clusters k needs to be


specified before running the algorithm, which can be challenging in some
applications.

3. Sensitive to outliers: K-means is sensitive to outliers, which can have a significant


impact on the resulting clusters.

Example:

No. Height Weight Cluster


1 185 72 K1
2 170 56 K2
3 168 60 ?
4 179 68 ?

(Note: Keep point 1 and 2 as centroids and label them as K1 & K2)

Step 1:
Decide the centroid. So let's consider that point ① & ② are the centroids of the cluster K1 & K2.
K1 = (185, 72)
K2 = (170, 56)

Prepared by Dr. Syeda Husna Mehanoor


Step 6:
Total clusters are K = 2.

Prepared by Dr. Syeda Husna Mehanoor


K1 = {1, 4}
K2 = {2, 3}

No. Height Weight Cluster

1 185 72 K1

2 170 56 K2

3 168 60 K2

4 179 68 K1

Prepared by Dr. Syeda Husna Mehanoor


UNIT - IV
Dimensionality Reduction – Linear Discriminant Analysis – Principal Component Analysis
– Factor Analysis – Independent Component Analysis – Locally Linear Embedding –
Isomap – Least Squares Optimization Evolutionary Learning – Genetic algorithms –
Genetic Offspring: - Genetic Operators – Using Genetic Algorithms

DIMENSIONALITY REDUCTION
• Dimensionality reduction is the process of reducing the number of features (or
dimensions) in a dataset while retaining as much information as possible.
• In other words, it is a process of transforming high-dimensional data into a lower
dimensional space that still preserves the essence of the original data.
• Dimensionality reduction can be done in two different ways:
1. By only keeping the most relevant variables from the original dataset (this technique
is called feature selection)
2. By finding a smaller set of new variables, each being a combination of the input
variables, containing the same information as the input variables (this technique is
called dimensionality reduction)

LINEAR DISCRIMINANT ANALYSIS (LDA)


In machine learning, Linear Discriminant Analysis (LDA) is a supervised dimensionality
reduction technique used for classification, aiming to find the linear combination of features that
best separates different classes. It's also known as Normal Discriminant Analysis or Discriminant
Function Analysis.
• LDA uses both input data and class labels to learn how to classify data points.
• LDA projects data into a lower-dimensional space while preserving the information that is
most relevant for separating classes.
• The goal of LDA is to find a linear combination of features that maximizes the distance
between the means of different classes, while minimizing the variance within each class.
• LDA can be used to create a linear classifier that predicts the class of new data points based
on the learned linear combination of features.
• LDA assumes that the data within each class follows a normal distribution.
• While both LDA and Principal Component Analysis (PCA) are used for dimensionality
reduction, LDA is supervised and considers class labels, while PCA is unsupervised and
focuses on maximizing variance.

Prepared by Dr. Syeda Husna Mehanoor


How it Works:
1. Input Data: LDA takes labelled data as input, where each data point has features and a
corresponding class label.
2. Finding Linear Combinations: LDA calculates a set of linear combinations of the original
features (or "linear discriminants") that best separate the classes.
3. Projection: The data is then projected onto these linear discriminants, reducing the
dimensionality of the data while preserving the information needed for classification.
4. Classification: The projected data can then be used to train a linear classifier that predicts
the class of new data points.
5. Example: If you're trying to classify images of cats and dogs, LDA would select features
that best differentiate between the features of cats and dogs, such as ear shape, tail length,
etc.

Applications:
• Classification: LDA is commonly used for classification tasks in various domains, such as
image recognition, medical diagnosis, and customer segmentation.
• Feature Selection: LDA can be used to select the most relevant features for classification
by identifying the linear combinations that best separate the classes.
• Dimensionality Reduction: LDA can be used to reduce the dimensionality of data while
preserving the information that is most important for classification.

Advantages:
• Simplicity: LDA is a relatively simple algorithm that is easy to implement and
understand.
• Computational Efficiency: LDA is computationally efficient, making it suitable for
large datasets.
• Interpretability: The linear combinations of features learned by LDA are easy to
interpret, providing insights into the relationships between features and classes.

Limitations:
• Assumptions: LDA relies on the assumption that the data within each class follows a
normal distribution, which may not always be true in real-world datasets.
• Linearity: LDA assumes that the class boundaries are linear, which may not be suitable for
datasets with complex, non-linear relationships.
• Class Imbalance: LDA may not perform well on datasets with imbalanced classes, where
one class has significantly more data points than the other.

Prepared by Dr. Syeda Husna Mehanoor


PRINCIPAL COMPONENT ANALYSIS
Principal Component Analysis (PCA) is a machine learning technique used for dimensionality
reduction, data compression, and noise reduction by transforming high-dimensional data into a
lower-dimensional space while preserving the most important information.

• Dimensionality Reduction: PCA aims to reduce the number of variables (features)


in a dataset while retaining as much variance (information) as possible.
• Unsupervised Learning: It's an unsupervised learning technique, meaning it doesn't
require labelled data for training.
• Linear Transformation: PCA performs a linear transformation of the original data
to a new coordinate system, where the axes are called principal components.
• Principal Components: These components are orthogonal (perpendicular) to each
other and capture the directions of maximum variance in the data.
• Example: Imagine you have data with 10 features, PCA can identify a smaller set
of 3 or 4 principal components that capture the most important information.

How PCA Works:


1. Standardize the Data: The data is typically standardized (mean-centered and scaled) to
have zero mean and unit variance.
2. Calculate the Covariance Matrix: The covariance matrix describes the relationships
between the variables in the dataset.
3. Compute Eigenvectors and Eigenvalues: The eigenvectors of the covariance matrix
represent the principal components, and the corresponding eigenvalues indicate the amount
of variance explained by each component.

Prepared by Dr. Syeda Husna Mehanoor


4. Select Principal Components: The principal components are ranked by their eigenvalues,
and the most important ones (those with the largest eigenvalues) are selected to represent
the data in a lower-dimensional space.
5. Project Data: The original data is projected onto the selected principal components,
resulting in a new dataset with reduced dimensionality.

Applications:
o Data Visualization: PCA can help visualize high-dimensional data in a lower-
dimensional space (e.g., 2D or 3D).
o Feature Extraction: It can identify the most important features or variables that
contribute most to the overall variance in the data.
o Data Compression: PCA can be used to compress data by representing it with a
smaller number of principal components.
o Noise Reduction: By focusing on the principal components that capture the most
variance, PCA can help remove noise or irrelevant information.
o Anomaly Detection: PCA can be used to identify outliers or anomalies in the data
by measuring the distance of data points from the principal components.

Advantages of Principal Component Analysis


1. Multicollinearity Handling: Creates new, uncorrelated variables to address issues
when original features are highly correlated.
2. Noise Reduction: Eliminates components with low variance (assumed to be noise),
enhancing data clarity.
3. Data Compression: Represents data with fewer components, reducing storage
needs and speeding up processing.
4. Outlier Detection: Identifies unusual data points by showing which ones deviate
significantly in the reduced space.

Disadvantages of Principal Component Analysis


1. Interpretation Challenges: The new components are combinations of original
variables, which can be hard to explain.
2. Data Scaling Sensitivity: Requires proper scaling of data before application, or
results may be misleading.
3. Information Loss: Reducing dimensions may lose some important information if
too few components are kept.
4. Assumption of Linearity: Works best when relationships between variables are
linear, and may struggle with non-linear data.
5. Computational Complexity: Can be slow and resource-intensive on very large
datasets.
6. Risk of Overfitting: Using too many components or working with a small dataset
might lead to models that don’t generalize well.

Prepared by Dr. Syeda Husna Mehanoor


DIFFERENCE BETWEEN PCA AND LDA

PCA (Principal Component


Feature LDA (Linear Discriminant Analysis)
Analysis)
Dimensionality reduction +
Purpose Dimensionality reduction
Classification
Type Unsupervised learning Supervised learning
Works on Maximizing variance in data Maximizing class separability
Features + Class labels (dependent
Input Only features (independent variables)
variable)
Optimization Finds new axes maximizing class
Finds new axes maximizing variance
Goal separation
Uses eigen decomposition of Uses eigen decomposition of scatter
Computation
covariance matrix matrices
Principal components (PCs) ordered Linear discriminants maximizing class
Outcome
by variance separation
Use Case Feature extraction, noise reduction Classification, pattern recognition
Unlabeled data, general data
Better for Labeled data, classification tasks
compression
Example Use Image compression, topic modeling Face recognition, medical diagnosis

FACTOR ANALYSIS IN MACHINE LEARNING

• Factor Analysis (FA) is a dimensionality reduction and feature extraction technique


used in machine learning and statistics.
• It focuses on modeling the relationships between observed variables by identifying latent
factors (hidden variables).
• It’s a technique used to identify hidden variables (latent factors) in data. It reduces a
large number of observed variables into a smaller set of underlying factors.
• It reduces a large number of observed variables into a smaller set of underlying factors.
• The Observed variables are assumed to be influenced by hidden factors. FA groups
similar variables together based on these factors. Each variable is represented as a
combination of latent factors + some error.
• It is used to simplify data by reducing the number of variables, find hidden
relationships between variables and to extract important features for machine learning
models.

Prepared by Dr. Syeda Husna Mehanoor


Mathematical Representation

• X = Observed variables (data matrix)


• F = Latent factors
• Λ = Factor loadings matrix (weights showing how factors influence variables)
• ϵ = Noise (error)

Then, the model is:

X=ΛF+ϵ

• Factor loadings (Λ)tell us how much each observed variable is influenced by a latent
factor.
• Noise (ϵ) accounts for variability not explained by the factors.

Example

Observed Variables are the actual data points we measure. Suppose we have a survey with
questions about waiting time, cleanliness, staff behavior of a restaurants. Latent Factors
(Hidden Variables) are unobserved underlying causes that explain patterns in the observed
data.

In the example, there might be two latent factors influencing the responses:

• Cleanliness (affecting waiting time, staff behavior and cleanliness)


• Food quality (affecting taste of food. Its temperature and freshness)

Prepared by Dr. Syeda Husna Mehanoor


Types of Factor Analysis (FA)

Factor Analysis is mainly classified into two types based on the purpose and approach used:

1. Exploratory Factor Analysis (EFA)

• Used when the number and nature of factors are unknown.


• Helps discover hidden relationships and patterns in data.
• Commonly used in research when trying to understand the underlying structure of a
dataset.
• Example: In psychology, EFA is used to identify possible personality traits from survey
responses.

2. Confirmatory Factor Analysis (CFA)

• Used when the number and structure of factors are already known or hypothesized.
• Confirms whether the data fits the assumed factor structure.
• Common in validating questionnaires, psychological tests, and scientific research.
• Example: In education, CFA is used to confirm that an IQ test correctly measures verbal,
logical, and spatial intelligence.

How it works:
1. Data Collection:
Gather data on a set of variables.
2. Correlation/Covariance Matrix:
Calculate the correlation or covariance matrix to understand the relationships between the
variables.
3. Factor Extraction:
Determine the number of factors to extract and extract them using methods like principal
component analysis (PCA) or maximum likelihood estimation.
4. Factor Rotation (Optional):
Rotate the factors to simplify interpretation and make the relationship between factors and
variables clearer.
5. Factor Loadings:
Examine the factor loadings, which indicate how much each original variable contributes to
each factor.
6. Interpretation:
Interpret the factors based on the factor loadings and understand the underlying structure of
the data.

Prepared by Dr. Syeda Husna Mehanoor


Applications:
• Data Reduction: Reduce the number of variables for easier analysis and modeling.
• Feature Extraction: Identify key features or factors that drive the data.
• Identifying Underlying Structures: Discover latent structures or dimensions in the
data.
• Psychometrics: Used in personality assessment, attitude measurement, and other
psychological research.
• Marketing: Used to identify customer segments or product preferences.
• Finance: Used to identify market factors or investment strategies.

INDEPENDENT COMPONENT ANALYSIS (ICA) IN MACHINE


LEARNING

Independent Component Analysis (ICA) is a powerful statistical technique used in machine


learning and signal processing to separate a multivariate signal into additive, independent
non-Gaussian components. It’s particularly useful when the observed data is a mixture of
several underlying sources, and the goal is to recover the original source signals.

ICA assumes that:

• The observed signals are linear mixtures of independent source signals.


• The original source signals are statistically independent and non-Gaussian.

Example: Cocktail Party Problem

You're in a room with two people talking at the same time, and you have two microphones
recording the sounds. Each microphone picks up a different mixture of both people’s voices.

You want to separate the two voices from the recordings using ICA.

Mathematical Model
Given:

• X = The observed signals (like the recordings from microphones)

• S = The original, independent source signals (like actual people’s voices)

• A = An unknown mixing matrix (how each source contributes to each microphone)

Prepared by Dr. Syeda Husna Mehanoor


Applications of ICA

• Blind Source Separation (BSS) – Classic example: Cocktail party problem, separating
different voices from a recording.
• EEG/MEG Signal Processing – Separate brain signals from noise.
• Image Processing – Feature extraction and noise removal.
• Financial Data Analysis – Uncovering underlying independent factors in stock prices.

How ICA Works:

1. Data Collection (X)

• You collect the mixed signals. For example, two microphones recording different mixes
of two people speaking.

2. Centering and Whitening

• Centering: Make the data have a mean of 0.


• Whitening: Transform the data so that it becomes uncorrelated and has equal variance.

Why? This makes the data easier to separate.

Prepared by Dr. Syeda Husna Mehanoor


3. Find Independent Components

• ICA assumes:
o The original sources are statistically independent
o They are non-Gaussian
• ICA algorithm (like FastICA) tries to find a matrix (W) that transforms the mixed
data into independent sources:

S=WX

Where:

• X = observed (mixed) signals


• W = unmixing matrix
• S = estimated independent sources

4. Get the Separated Sources

• The result S gives you the independent components – your original signals!

Advantages of Independent Component Analysis (ICA):

• Capability of breaking down mixed alerts into their separate components: ICA is a useful
method for breaking down blended signals into their component parts.
• This is useful for several programmes, including sign processing, picture evaluation, and
statistics compression.
• Non-parametric technique: ICA does not assume anything about the underlying
opportunity distribution of the facts because it is non-parametric.
• Unsupervised learning of: ICA is a learning approach that can be used to facts without
the need for categorised samples. As a result, it may be helpful when access to classified
records is restricted.
• Feature extraction: Using ICA, significant characteristics in the data that are useful for
other tasks, like classification, can be found. This process is known as feature extraction.
Disadvantages of Independent Component Analysis (ICA):

• Non-Gaussian assumption: Although this may not always be the case, ICA assumes that
the underlying sources are non-Gaussian. ICA might not work if the underlying sources
are Gaussian.
• Assumption of linear mixing: Although this may not always be the case, ICA assumes
that the sources are mixed linearly. ICA might not work if the sources are blended
nonlinearly.
• Costly to compute: ICA can be costly to compute, particularly for big datasets. This can
make using ICA to solve practical issues challenging.

Prepared by Dr. Syeda Husna Mehanoor


o Convergence problems: ICA may encounter convergence problems, which could prevent
it from solving problems all the time. For complex datasets with numerous sources, this
can be an issue.

LOCALLY LINEAR EMBEDDING

• Locally Linear Embedding (LLE) is a non-linear dimensionality reduction technique


used in machine learning to discover the low-dimensional structure of high-dimensional
data. It’s especially useful for manifold learning, where data lies on a non-linear
subspace (or "manifold") of a higher-dimensional space.
• LLE assumes that each data point and its neighbors lie on or close to a locally linear
patch of the manifold. So, it preserves local neighborhood structure while mapping high-
dimensional data to a lower dimension.
• Think about a winding mountain road. If you were to look at it from far away, it might
seem like a confusing mess of curves.But up close, any small portion of the road looks
almost straight, doesn’t it? LLE works in much the same way.
• The idea is that while your dataset might be a complex, nonlinear mess globally, it’s still
locally linear — like that small portion of the road.

How LLE Works

• Find Neighbors: For each data point, find its k nearest neighbors using Euclidean
distance.
• Compute Weights: For each point, compute weights that best reconstruct the point from
its neighbors using linear combinations (i.e., minimize reconstruction error). In short,
each point is reconstructed as a linear combination of its neighbors. LLE calculates the
weights that best reconstruct the point from its neighbors while minimizing
reconstruction-error.
This results in weights W such that:

• Embed in Low Dimensions: Find low-dimensional points Y that preserve the same
reconstruction weights from the high-dimensional space. It means It then finds a low-
dimensional representation of the data where those same weights still reconstruct each
point from its neighbors. This preserves the local structure of the manifold

Prepared by Dr. Syeda Husna Mehanoor


• Unlike ISOMAP, LLE does not compute global shortest paths — it only preserves the local
relationships captured via KNN.

Applications:

• Visualizing complex datasets in 2D or 3D (like facial images, word embeddings)


• Preprocessing for classification or clustering
• Nonlinear feature extraction

Advantages of LLE
The dimensionality reduction method known as locally linear embedding (LLE) has many
benefits for data processing and visualization. The following are LLE's main benefits:
• Preservation of Local Structures: LLE is excellent at maintaining the in-data local
relationships or structures. It successfully captures the inherent geometry of
nonlinear manifolds by maintaining pairwise distances between nearby data points.
• Handling Non-Linearity: LLE has the ability to capture nonlinear patterns and
structures in the data, in contrast to linear techniques like Principal Component
Analysis (PCA). When working with complicated, curved, or twisted datasets, it is
especially helpful.
• Dimensionality Reduction: LLE lowers the dimensionality of the data while
preserving its fundamental properties. Particularly when working with high-
dimensional datasets, this reduction makes data presentation, exploration, and
analysis simpler.
Disadvantages of LLE
• Curse of Dimensionality: LLE can experience the "curse of dimensionality" when
used with extremely high-dimensional data, just like many other dimensionality
reduction approaches. The number of neighbors required to capture local
interactions rises as dimensionality does, potentially increasing the computational
cost of the approach.
• Memory and computational Requirements: For big datasets, creating a weighted
adjacency matrix as part of LLE might be memory-intensive. The eigenvalue
decomposition stage can also be computationally taxing for big datasets.
• Outliers and Noisy data: LLE is susceptible to anomalies and jittery data points.
The quality of the embedding may be affected and the local linear relationships may
be distorted by outliers.

ISOMAP

ISOMAP is used to reduce the number of dimensions in high-dimensional data while preserving
the intrinsic geometry (shape) of the data — especially when the data lies on a non-linear
manifold.

How ISOMAP Works:

1. Construct a neighborhood graph:


o Connect each point to its k nearest neighbors (using Euclidean distance).

Prepared by Dr. Syeda Husna Mehanoor


o
Build a graph where each node is a data point and edges connect neighbors.
2. Compute shortest paths (geodesic distances):
o Use Dijkstra’s or Floyd-Warshall algorithm to compute the shortest path between all
pairs of points on the graph.
o This approximates the geodesic distance (i.e., distance along the manifold).
3. Apply Classical MDS:
o Use MDS on the geodesic distance matrix to embed the data into a lower-dimensional
space.

Applications

• Manifold learning
• Visualization of high-dimensional data
• Preprocessing before classification/clustering

Feature ISOMAP LLE (Locally Linear Embedding)


Preserves global geodesic Preserves local neighborhood
Core Idea
distances geometry
Uses geodesic distances (shortest Uses linear reconstruction weights
Distance Used
paths on a manifold) within local neighborhoods
Graph Builds a neighborhood graph and Builds a neighborhood graph and
Construction calculates shortest path distances reconstructs each point using neighbors
Sensitive to short-circuiting in the Sensitive to noise and manifold
Sensitivity
graph (bad neighbor choices) curvature
Computational More expensive due to shortest
Less computationally heavy
Cost path computation
Captures Global structure of the data Local structure of the data
When data lies on a globally smooth When local linearity is a good
Suitable For
manifold assumption
Algorithm Type Global Local

Prepared by Dr. Syeda Husna Mehanoor


LEAST SQUARES OPTIMIZATION EVOLUTIONARY LEARNING

Least Squares Optimization: Least Squares Optimization is a method used to minimize the
difference between predicted values and actual data.

What is Evolutionary Learning?

Evolutionary Learning is a machine learning technique inspired by biological evolution, like


how living things evolve and get better over generations.It tries to evolve solutions to problems
instead of using traditional methods like gradient descent or backpropagation.

Example: Imagine you're trying to train a robot to walk. You don’t know the perfect way to do it,
but you let it try randomly, keep the ones that perform better, and let them “reproduce” to
create a new generation of robots with small improvements. Repeat this over and over, and
eventually, some of them will walk well.

It works like this:

• Start with a population of random solutions (e.g., models, weights, or functions)


• Evaluate their performance (how good they are)
• Select the best performers
• Mutate and crossover to create a new generation
• Repeat until you get a good enough solution

This method doesn't require gradient-based optimization (like backpropagation), so it’s useful in
tricky cases where derivatives are hard to calculate.

Least Squares Optimization + Evolutionary Learning

Now imagine combining the two:

• You want to find a model that minimizes the least squares error (i.e., best fits the data)

Prepared by Dr. Syeda Husna Mehanoor


• But instead of using traditional gradient methods, you use evolutionary learning to
evolve the model parameters

The process:

1. Initialize a population of models with random parameters


2. For each model:
o Compute predictions
o Calculate least squares error
3. Select models with the lowest error
4. Perform genetic operations:
o Crossover: Combine parts of two models
o Mutation: Randomly tweak model parameters
5. Generate a new population and repeat

Over time, the models evolve to have better fit (lower least squares error).

GENETIC ALGORITHMS

Genetic Algorithms (GAs) are a type of search heuristic inspired by Darwin’s theory of natural
selection, mimicking the process of biological evolution. These algorithms are designed to find
optimal or near-optimal solutions to complex problems by iteratively improving candidate
solutions based on survival of the fittest.

The primary purpose of Genetic Algorithms is to tackle optimization and search problems. By
leveraging evolutionary principles such as selection, crossover, and mutation, GAs explore large
solution spaces efficiently, even for problems where traditional methods struggle.

Genetic Algorithm in machine learning plays a significant role in tasks like hyperparameter
tuning, feature selection, and model optimization. For instance, they can optimize the
architecture of a neural network or select the most relevant features for improving prediction
accuracy.

Real-World Examples:

• Neural Network Optimization: Using GAs to identify the best combination of


hyperparameters (e.g., learning rate, number of layers).
• Logistics: Solving routing problems, such as optimizing delivery routes for cost and time
efficiency.
Genetic Algorithms offer a versatile and powerful approach to solving complex, multi-
dimensional problems, making them indispensable in various fields, including machine learning,
robotics, and operations research.

Prepared by Dr. Syeda Husna Mehanoor


How Genetic Algorithms Work?

Genetic Algorithms (GAs) operate through an iterative process inspired by natural evolution.
This process involves generating, evaluating, and evolving populations of candidate solutions to
find the optimal outcome. The workflow can be broken down into several key stages:

1. Initialization

The process begins by generating a population of candidate solutions, often represented as


chromosomes. These solutions can be generated randomly or using predefined methods to ensure
diversity in the search space.
Example: For a binary optimization problem, chromosomes might be initialized as binary
strings (e.g., 101010 or 110011).

2. Fitness Evaluation

Each candidate solution is evaluated using a fitness function that measures its quality or
suitability for solving the problem. The fitness function is problem-specific and determines how
well a solution meets the objective.
Example: In the Traveling Salesman Problem (TSP), the fitness is calculated as the inverse of
the total distance traveled. Shorter routes yield higher fitness scores.

3. Selection

To create the next generation, GAs select the fittest solutions from the current population.
Various methods ensure that better solutions have a higher probability of being chosen:

• Roulette Wheel Selection: Solutions are selected based on their fitness


proportion.
• Tournament Selection: Randomly selects a subset of candidates, and the fittest
among them is chosen.

Prepared by Dr. Syeda Husna Mehanoor


• Rank Selection: Ranks solutions by fitness and selects based on their position.
4. Crossover (Recombination)

Crossover, or recombination, involves combining the genetic material of two parent solutions to
produce offspring. This process introduces variability and explores new areas of the search
space.
Types of Crossover:

• Single-Point Crossover: Splits chromosomes at one point, exchanging segments.


• Two-Point Crossover: Splits chromosomes at two points for more diverse
offspring.
• Uniform Crossover: Randomly exchanges genes between parents.

5. Mutation

Mutation introduces random changes to the chromosomes to maintain diversity and avoid
premature convergence. It helps the algorithm explore unexplored areas of the search space.
Example: In a binary chromosome, mutation might involve flipping a 0 to 1 or vice versa (e.g.,
101010 becomes 101110).

6. Termination

The algorithm terminates when a specific termination criterion is met, such as:

• Achieving the desired fitness score.


• Reaching the maximum number of generations.
Through these iterative steps, Genetic Algorithms efficiently evolve populations to converge on
optimal or near-optimal solutions for complex problems.

Prepared by Dr. Syeda Husna Mehanoor


Key Components of Genetic Algorithms

Genetic Algorithms (GAs) rely on several core components that work together to solve
optimization and search problems effectively.

Search Space

The search space represents the range of all possible solutions for a given problem. It is
essentially the domain within which the algorithm operates to identify the optimal or near-
optimal solution.
GAs excel at exploring this space efficiently by balancing exploitation (focusing on promising
areas) and exploration (investigating new areas), ensuring a higher chance of finding the best
solution.
Example: For the Traveling Salesman Problem, the search space includes all possible
permutations of cities in the route.

Fitness Function

The fitness function evaluates how well a candidate solution performs relative to the problem’s
objectives. A well-designed fitness function is crucial because it directly influences the
algorithm’s ability to converge on the optimal solution.
Example: In a scheduling problem, the fitness function might evaluate the minimization of
resource conflicts or task completion times.

Genetic Operators

Selection, crossover, and mutation are the primary genetic operators that drive the evolutionary
process:

• Selection: Chooses the fittest individuals to contribute to the next generation.


• Crossover: Combines genetic material from selected parents to generate diverse
offspring.
• Mutation: Introduces random changes to maintain diversity and avoid local
optima.
Together, these components enable GAs to iteratively improve solutions, making them highly
effective for complex problem-solving tasks.

Genetic Offspring: In the context of machine learning, especially with genetic algorithms,
"genetic offspring" refers to new individuals or solutions generated by combining the
characteristics of parent solutions through crossover and mutation. These offspring inherit
features from their parents but also introduce new variations, allowing the algorithm to explore
the solution space and potentially find better solutions over generations.

Prepared by Dr. Syeda Husna Mehanoor


Applications of Genetic Algorithms in Machine Learning

Genetic Algorithms (GAs) have a broad range of applications in machine learning, where they
enhance model performance, reduce complexity, and tackle optimization challenges effectively.

1. Hyperparameter Optimization

GAs are frequently used to automate the process of hyperparameter tuning, which is critical
for improving machine learning model performance. Instead of relying on grid or random search,
GAs explore combinations of hyperparameters more efficiently by leveraging evolutionary
principles.
Example: In neural networks, GAs can optimize learning rates, layer configurations, and
dropout rates to achieve better accuracy. Similarly, for Support Vector Machines (SVMs), GAs
can fine-tune kernel parameters to enhance classification performance.

2. Feature Selection

Selecting the most relevant features from a dataset is crucial for reducing model complexity and
improving accuracy. GAs identify optimal subsets of features by evaluating their impact on
model performance through a fitness function. This helps reduce overfitting and computational
costs.
Example: In a classification task, GAs can identify the most informative features from a high-
dimensional dataset, improving the classifier’s accuracy.

3. Neural Network Optimization

GAs are employed to optimize neural network architectures and weights, making them highly
effective in designing robust models. By evolving network parameters over generations, GAs
help discover architectures that balance accuracy and computational efficiency.
Example: GAs can optimize the number of neurons, hidden layers, and activation functions in a
deep learning model to enhance predictive accuracy.

4. Other Applications

GAs extend beyond traditional machine learning tasks and find utility in diverse areas:

• Optimizing Supply Chain Routes: GAs minimize transportation costs and


delivery times by solving complex routing problems.
• Evolving Strategies in Gaming AI: GAs enable AI agents to learn and adapt
strategies in dynamic gaming environments.
• Automated Code Generation: GAs are used to evolve and generate code
snippets that solve specific programming tasks.

Advantages of Genetic Algorithms

Prepared by Dr. Syeda Husna Mehanoor


Genetic Algorithms (GAs) offer several unique advantages, making them highly effective for
solving complex optimization problems:

1. Global Optimization: GAs are capable of finding global optima in complex,


nonlinear, and high-dimensional search spaces, avoiding the pitfalls of local
optima that plague traditional methods.
2. Adaptability: They can be applied to a wide range of problems, including
combinatorial optimization, continuous optimization, and machine learning tasks,
showcasing their versatility across domains.
3. No Gradient Requirement: Unlike gradient-based optimization methods, GAs
do not rely on differentiable functions. This makes them suitable for problems
with non-differentiable or discontinuous fitness landscapes, where traditional
approaches fail.

Limitations of Genetic Algorithms

While Genetic Algorithms (GAs) are powerful tools, they come with certain limitations that can
impact their effectiveness:

1. Computational Cost: GAs often require significant computational resources due


to the evaluation of large populations over multiple generations, especially for
complex problems.
2. Premature Convergence: There is a risk of the algorithm converging to local
optima, particularly if diversity within the population is not maintained.
3. Dependence on Fitness Function: The performance of GAs heavily relies on the
quality and design of the fitness function. Poorly defined fitness functions can
lead to suboptimal solutions or slow convergence.

Prepared by Dr. Syeda Husna Mehanoor


UNIT – V

Reinforcement Learning – Overview – Getting Lost Example Markov Chain Monte Carlo
Methods – Sampling – Proposal Distribution – Markov Chain Monte Carlo – Graphical
Models – Bayesian Networks – Markov Random Fields – Hidden Markov Models –
Tracking Methods.

REINFORCEMENT LEARNING
Reinforcement Learning (RL) is a branch of machine learning that focuses on how agents
can learn to make decisions through trial and error to maximize cumulative rewards. RL allows
machines to learn by interacting with an environment and receiving feedback based on their
actions. This feedback comes in the form of rewards or penalties.

Reinforcement Learning revolves around the idea that an agent (the learner or decision-maker)
interacts with an environment to achieve a goal. The agent performs actions and receives
feedback to optimize its decision-making over time.
• Agent: The decision-maker that performs actions.
• Environment: The world or system in which the agent operates.
• State: The situation or condition the agent is currently in.
• Action: The possible moves or decisions the agent can make.
• Reward: The feedback or result from the environment based on the agent’s action.

How Reinforcement Learning Works?


The RL process involves an agent performing actions in an environment, receiving rewards or
penalties based on those actions, and adjusting its behavior accordingly. This loop helps the
agent improve its decision-making over time to maximize the cumulative reward.
Here’s a breakdown of RL components:
• Policy: A strategy that the agent uses to determine the next action based on the
current state.
• Reward Function: A function that provides feedback on the actions taken, guiding
the agent towards its goal.

Prepared by Dr Syeda Husna Mehanoor


• Value Function: Estimates the future cumulative rewards the agent will receive
from a given state.
• Model of the Environment: A representation of the environment that predicts
future states and rewards, aiding in planning.

Reinforcement Learning Example: Navigating a Maze


Imagine a robot navigating a maze to reach a diamond while avoiding fire hazards. The goal is
to find the optimal path with the least number of hazards while maximizing the reward:
• Each time the robot moves correctly, it receives a reward.
• If the robot takes the wrong path, it loses points.
The robot learns by exploring different paths in the maze. By trying various moves, it evaluates
the rewards and penalties for each path. Over time, the robot determines the best route by
selecting the actions that lead to the highest cumulative reward.

The robot’s learning process can be summarized as follows:


1. Exploration: The robot starts by exploring all possible paths in the maze, taking
different actions at each step (e.g., move left, right, up, or down).
2. Feedback: After each move, the robot receives feedback from the environment:
• A positive reward for moving closer to the diamond.
• A penalty for moving into a fire hazard.
3. Adjusting Behavior: Based on this feedback, the robot adjusts its behavior to
maximize the cumulative reward, favoring paths that avoid hazards and bring it
closer to the diamond.
4. Optimal Path: Eventually, the robot discovers the optimal path with the least
number of hazards and the highest reward by selecting the right actions based on
past experiences.

Application of Reinforcement Learning


1. Robotics: RL is used to automate tasks in structured environments such as
manufacturing, where robots learn to optimize movements and improve efficiency.

Prepared by Dr Syeda Husna Mehanoor


2. Game Playing: Advanced RL algorithms have been used to develop strategies for
complex games like chess, Go, and video games, outperforming human players in
many instances.
3. Industrial Control: RL helps in real-time adjustments and optimization of
industrial operations, such as refining processes in the oil and gas industry.
4. Personalized Training Systems: RL enables the customization of instructional
content based on an individual’s learning patterns, improving engagement and
effectiveness.
Advantages of Reinforcement Learning
• Solving Complex Problems: RL is capable of solving highly complex problems
that cannot be addressed by conventional techniques.
• Error Correction: The model continuously learns from its environment and can
correct errors that occur during the training process.
• Direct Interaction with the Environment: RL agents learn from real-time
interactions with their environment, allowing adaptive learning.
• Handling Non-Deterministic Environments: RL is effective in environments
where outcomes are uncertain or change over time, making it highly useful for real-
world applications.

Disadvantages of Reinforcement Learning


• Not Suitable for Simple Problems: RL is often an overkill for straightforward
tasks where simpler algorithms would be more efficient.
• High Computational Requirements: Training RL models requires a significant
amount of data and computational power, making it resource-intensive.
• Dependency on Reward Function: The effectiveness of RL depends heavily on the
design of the reward function. Poorly designed rewards can lead to suboptimal or
undesired behaviors.
• Difficulty in Debugging and Interpretation: Understanding why an RL agent
makes certain decisions can be challenging, making debugging and troubleshooting
complex

GETTING LOST EXAMPLE- MARKOV CHAIN MONTE CARLO


METHODS
A classic example of a Markov Chain Monte Carlo (MCMC) application is simulating the
movement of a "getting lost" person in a city. Imagine a person walking randomly, with a certain
probability of moving in each direction, representing a Markov chain. MCMC algorithms, like
Metropolis-Hastings, can then be used to explore this chain and understand the likely path of the
person, or even estimate the probability of them eventually reaching a specific location.

1. The Markov Chain:


• States:
Each location in the city (e.g., streets, intersections) can be considered a state in the Markov
chain.

Prepared by Dr Syeda Husna Mehanoor


• Transitions:
The probability of moving from one location to another (e.g., walking to the next street)
defines the transition probabilities of the Markov chain.
• Independence:
The next location depends only on the current location, not the entire previous path (the
Markov property).
2. MCMC Algorithms:
Metropolis-Hastings:
This is a common MCMC algorithm that can be used to sample from the probability
distribution of the person's location.
• Steps:
o Start with an initial location.
o Propose a new location based on the current location (e.g., by moving randomly in
one of the four cardinal directions).
o Calculate the acceptance probability of the proposed new location. This probability
depends on the likelihood (probability) of the person being at that location.
o If the acceptance probability is high enough (or always accepted if the proposed
move improves the likelihood), move to the new location. Otherwise, stay in the
current location.
o Repeat these steps many times to explore the probability distribution of the person's
location.
3. Applying MCMC to "Getting Lost":
• Exploring the City:
By running the MCMC algorithm, you can simulate the person's movement over time,
essentially creating a "walk" through the city.
• Estimating Likelihoods:
You can use the MCMC samples to estimate the probability of the person being at any
particular location after a certain number of steps.
• Finding Paths:
You can analyze the MCMC samples to understand the most likely paths the person might
take, or even the probability of them reaching a specific destination (e.g., a store, their
home).

Prepared by Dr Syeda Husna Mehanoor


In essence, MCMC methods allow you to explore and understand the probabilistic behavior of a
system (like a "getting lost" person) that can be modeled as a Markov chain. This is useful for
various applications, including understanding complex systems, optimizing paths, and predicting
future states.

SAMPLING

• When we have a big dataset and excited to get started with analyzing it and building your
machine learning model. Our machine gives an “out of memory” error while trying to load
the dataset.
• It’s happened to us most of the time when we have big dataset. Big dataset is one of the
biggest hurdles we face in data science — dealing with massive amounts of data on
computationally limited machines (of course we can resolve it with additional resource
power).
• So how can we overcome this problem? Is there a way to pick a subset of the data and
analyze that — and that can be a good representation of the entire dataset? Here comes the
statistical approach to deal with bigger dataset called “Sampling”.
• “Sampling is a method that allows us to get information about the population based
on the statistics from a subset of the population (sample), without having to
investigate every individual”
• Example: When you conduct research about a group of people, it’s rarely possible to
collect data from every person in that group. Instead, you select a sample. The sample is
the group of individuals who will actually participate in the research.

Steps involved in sampling framework:

Step 1: The first stage in the sampling process is to clearly define the target population.

Step 2: Sampling Frame — It is a list of items or people forming a population from which the
sample is taken.

Step3: Generally, probability sampling methods are used.

Prepared by Dr Syeda Husna Mehanoor


Step 4: Sample Size — It is the number of individuals or items to be taken in a sample that
would be enough to make inferences about the population with the desired level of accuracy and
precision. Larger the sample size, more accurate our inference about the population would be.

Step 5: Once the target population, sampling frame, sampling technique, and sample size have
been established, the next step is to collect data from the sample.

• Probability Sampling: In probability sampling, every element of the population has


an equal chance of being selected. Probability sampling gives us the best chance to
create a sample that is truly representative of the population

• Non-Probability Sampling: In non-probability sampling, all elements do not have


an equal chance of being selected. Consequently, there is a significant risk of ending
up with a non-representative sample which does not produce generalizable results

1. Probability Sampling (Everyone has a known chance of being selected)

• Simple Random Sampling: Everyone has an equal chance (like picking names from a
hat).
• Systematic Sampling: Pick every 5th, 10th, or 20th person from a list.
• Stratified Sampling: Divide people into groups (like by age or gender) and randomly
pick from each group.
• Cluster Sampling: Divide the population into clusters (like cities) and randomly select
whole clusters.

Prepared by Dr Syeda Husna Mehanoor


2. Non-Probability Sampling (Not everyone has a chance; selection is based on other factors)

• Convenience Sampling: Pick whoever is easiest to reach (like asking your friends).
• Judgmental Sampling: You choose who you think is best to include.
• Snowball Sampling: Existing participants refer new participants (good for finding rare
groups).
• Quota Sampling: You pick people to meet a set number for each group (like 50 men and
50 women).

PROPOSAL DISTRIBUTION
A proposal distribution (denoted q(x′∣x)) is a probability distribution used to propose a new
state x′ given the current state x in the Markov chain. The choice of proposal distribution directly
impacts the efficiency and convergence of MCMC algorithms.
Example Robot Exploring a Maze

Imagine you have a robot in a maze. The robot is trying to find the best path to the goal, but it
doesn’t know the full layout of the maze. It can only try different moves and learn if they’re
good or bad over time.

The Goal: The robot wants to find the most likely paths (according to some hidden rules, like
shortest path or least danger). But it can’t sample these good paths directly — the maze is too
complex.

What Does the robot do?

Instead of guessing perfectly:

1. It uses a proposal distribution — this is like the robot saying:

“Let me randomly propose a move, like ‘turn left’ or ‘move forward 2 steps’ based on my
current location.”

2. It tries the move and checks:

“Is this move likely to help me get to the goal?”

3. If the move looks promising (based on some probability), it accepts it. If not, it rejects it
and stays in place (or tries another).

Prepared by Dr Syeda Husna Mehanoor


How This Connects to ML:

• The maze is like the complex target probability distribution — we don’t know its full
shape.
• The robot’s random move is like sampling from the proposal distribution — a simple
way to suggest new positions (or parameter values).
• The accept/reject step helps the robot eventually explore the most important areas of
the maze — just like in MCMC, we sample more from the “important” regions of the
target distribution.

MARKOV CHAIN MONTE CARLO

Markov Chain Monte Carlo (MCMC) is a powerful technique used in statistics and various
scientific fields to sample from complex probability distributions. It is particularly useful when
directly sampling from the distribution is difficult or impossible. Here is a breakdown of the
name:

• Monte Carlo: This refers to a general approach using randomness to solve problems,
drawing inspiration from the element of chance involved in casino games.
• Markov Chain: This is a sequence of random events where the probability of the next
event depends only on the current event, not the history leading up to it.

MCMC combines constructing a Markov chain and recording samples from the chain. The chain
is designed to spend more time in regions with higher probability according to the target
distribution. Then, by recording states from the chain after it has ‘warmed up’ and reached a
stable state, you effectively get samples from the target distribution.

Prepared by Dr Syeda Husna Mehanoor


Prepared by Dr Syeda Husna Mehanoor
So, we accept it with probability α<1

Why Accept Worse States at All?

Accepting worse states occasionally helps the algorithm:

• Escape local maxima


• Explore the entire space
• Ensure convergence to the true target distribution
• The goal is to sample from the entire target distribution, not just find its maximum.
Accepting worse states occasionally allows the chain to explore low-probability regions
and later return to higher ones — which is important for accurate sampling.

The min(1, ...) part ensures that the acceptance probability α\alphaα is always between 0 and 1,
which is required because:

• A probability cannot exceed 1.

Prepared by Dr Syeda Husna Mehanoor


• We want to accept better states with full certainty (i.e., probability = 1).
• We want to accept worse states only with a chance less than 1.
• If the min(1, ...) part of the Metropolis-Hastings acceptance formula were removed, and
the acceptance probability were allowed to exceed 1, you would break the algorithm and
violate the balance and the resulting samples would not reflect the target distribution .
That means your entire MCMC sampling process could give incorrect, biased results.

How Is Markov Chain Monte Carlo Used In Machine Learning?

MCMC plays a crucial role in various aspects of machine learning, particularly when dealing
with complex probabilistic models or situations where direct sampling is difficult. Here are some
key ways it’s utilised:

• Bayesian Inference: Machine learning often involves estimating unknown parameters in


models based on observed data. In the Bayesian framework, these parameters are treated
as random variables with prior probability distributions. MCMC helps sample from the
posterior distribution, which combines the prior information with the likelihood of the
data, allowing for a better understanding of the parameter uncertainties and making
predictions with appropriate confidence intervals.
• Model Selection: When choosing between different models, MCMC can be used to
compare their posterior probabilities by integrating over the parameter space. This helps
identify the model that best fits the data and accounts for the model’s complexity.
• Latent Variable Models: These models involve hidden variables that are not directly
observed but influence the observed data. MCMC is used to infer the posterior
distribution of these latent variables, providing insights into the underlying structure of
the data. This is crucial in techniques like dimensionality reduction and topic modelling.
• Variational Inference (VI): While not directly using Markov Chain Monte Carlo, some
machine learning algorithms like Variational Inference (VI) borrow ideas from MCMC.
VI approximates the posterior distribution through an optimisation process inspired by
MCMC, making it applicable when exact MCMC sampling might be computationally
expensive.
• Deep Learning: Markov Chain Monte Carlo can be integrated with deep learning
techniques, particularly in Bayesian deep learning, where MCMC helps sample from the
posterior distribution of the network weights, enabling learning and uncertainty
quantification.

What Are Some Common Applications of Markov Chain Monte Carlo In AI?

Markov Chain Monte Carlo finds several applications in various aspects of artificial intelligence
(AI), particularly when dealing with complex probabilistic models or situations where direct
sampling is impractical. Here are some common areas where MCMC plays a significant role:

Prepared by Dr Syeda Husna Mehanoor


• Uncertainty Quantification: MCMC allows AI models, especially Bayesian neural
networks, to capture the uncertainty associated with their predictions. By generating
samples from the posterior distribution of model parameters, MCMC provides confidence
intervals and probabilistic forecasts. This enhances the reliability and decision-making
capabilities of AI systems in crucial areas like finance, healthcare and autonomous
systems.
• Generative Modelling: MCMC algorithms like Gibbs sampling can be used in
generative models like Variational Autoencoders (VAEs) and Generative Adversarial
Networks (GANs). These models aim to learn the underlying distribution of data and
generate new data samples. MCMC helps sample from the latent space of the model,
leading to the creation of realistic and diverse data for various applications like image
generation, text synthesis and drug discovery

Advantages Of MCMC:

• Handles Complex Distributions: MCMC excels at sampling from intricate probability


distributions, even when direct sampling is impossible or inefficient. This makes it
invaluable for various applications in statistics, machine learning, and scientific
simulations.
• No Analytical Solutions Required: Unlike some methods that require deriving
analytical solutions, MCMC can operate even when such solutions are unavailable. This
provides a flexible and robust approach when dealing with challenging problems.
• Provides Uncertainty Quantification: MCMC enables the generation of samples from
the posterior distribution, allowing for the estimation of uncertainty associated with
parameters or predictions. This is crucial for building reliable and interpretable models in
various AI applications.
• Widely Applicable: MCMC finds use in diverse fields like Bayesian inference, machine
learning, physics, economics, and finance. Its versatility makes it a powerful tool for
tackling problems across various domains.
• Relatively Easy Implementation: Compared to some other advanced statistical
techniques, MCMC algorithms can be relatively straightforward to implement, especially
with readily available software libraries.

Disadvantages Of MCMC:

• Computational Cost: MCMC simulations can be computationally expensive, especially


when dealing with high-dimensional distributions or requiring high accuracy. This can
limit its applicability in situations with limited computational resources.
• Convergence Issues: Ensuring proper convergence of the Markov chain to the target
distribution is crucial. This can be challenging and requires careful monitoring and
diagnostics to avoid obtaining biased results.
• Sensitivity To Starting Point: The initial state of the Markov chain can impact the
convergence process. Choosing an inappropriate starting point can lead to slow
convergence or even getting stuck in irrelevant regions of the distribution.

Prepared by Dr Syeda Husna Mehanoor


• Difficulties In Assessing Convergence: Evaluating the convergence of the Markov
chain can be complex and subjective. Different tests and diagnostics are available but
they might not always provide a definitive answer and require careful interpretation.
• Not Always The Best Option: Depending on the specific problem and available
resources, other methods like gradient-based optimisation might be more efficient or
suitable alternatives to MCMC.

GIBBS SAMPLING
Gibbs sampling is a Markov Chain Monte Carlo (MCMC) method used in machine learning to
generate samples from a joint probability distribution, especially when direct sampling is
difficult. It works by iteratively sampling one variable at a time, given the current values of all
other variables, and repeating this process until a stable distribution of samples is achieved.

Example Imagine you and your friend are sharing a cake, but the size each of you takes
depends on the other person’s slice:

• If your friend takes a big piece, you take a small one.


• If your friend takes a small piece, you take a bigger one.

But you don't decide together. Instead:

1. You guess a slice size for your friend.


2. Based on that, you choose your own slice.
3. Then your friend updates their guess based on your slice.
4. You go back and forth, adjusting your slices.

Over time, this back-and-forth settles into a stable pattern — that’s the balance, or the true
joint distribution of cake slices.

Real Life Gibbs Sampling


Cake slices Variables
Taking turns Sampling conditionals
Adjusting based on the other Sampling one variable at a time
Balanced cake sharing Converged joint distribution

How it Works:
1. Start with an initial value for each variable.
2. Iterate: For each variable:

Prepared by Dr Syeda Husna Mehanoor


o Sample a new value for the variable, given the current values of all other
variables.
o This sampling is done from the conditional distribution of that variable, given the
others.
3. Repeat: steps 2 until a stable distribution of samples is achieved.
4. Burn-in and Thinning: Initial samples are often discarded (burn-in) to ensure they are
representative of the target distribution. Samples are then thinned (every n-th sample is
kept) to reduce autocorrelation.
Why is it Useful?
• Handles Complex Distributions:
Gibbs sampling is particularly useful when dealing with high-dimensional or complex
probability distributions where direct sampling is computationally infeasible.
• Bayesian Inference:
It's a powerful tool for Bayesian inference, allowing us to draw samples from the posterior
distribution of parameters.
• Latent Variable Models:
Gibbs sampling is used in training models with latent variables, like Restricted Boltzmann
Machines (RBMs).

Examples in Machine Learning:


• Restricted Boltzmann Machines (RBMs):
Gibbs sampling is used to train RBMs by iteratively sampling between visible and hidden
layers.
• Bayesian Networks:
Gibbs sampling is well-suited for sampling from the posterior distribution of Bayesian
networks.
• Topic Modeling:
It's used in topic modeling algorithms to infer the underlying topics in a collection of
documents.

Prepared by Dr Syeda Husna Mehanoor


GRAPHICAL MODELS – BAYESIAN NETWORKS – MARKOV
RANDOM FIELDS

Graphical models in machine learning use graphs to represent the relationships between variables
and their dependencies, providing a visual and structured way to model complex systems. These
models are broadly categorized into Bayesian networks (directed) and Markov random fields
(undirected). They are useful for tasks like prediction, inference, and decision-making by
capturing probabilistic dependencies and allowing efficient computation.

• Nodes: Represent random variables or hypotheses in the model.


• Edges: Represent relationships between variables, indicating direct influence or
conditional independence.
• Directed vs. Undirected: Directed models (like Bayesian networks) use arrows to
show the direction of influence, while undirected models (like Markov networks) use
lines to indicate pairwise dependencies.

Types of Graphical Models

There are two main types:

a. Bayesian Networks (Directed Graphical Models)

• Use directed edges (arrows).


• Represent causal or sequential relationships.
• Each node is conditionally independent of its non-descendants, given its parents.

Example: Imagine you're modeling the likelihood of someone getting the flu.

Weather → Has Flu → Misses Work

• If it’s cold (Weather), you might get the flu.


• If you have the flu, you’re more likely to miss work.

This network tells us:

• Flu depends on Weather.


• Missing Work depends on Flu.

b. Markov Random Fields (Undirected Graphical Models)

• Use undirected edges.


• Represent mutual dependencies, not necessarily causal.
• Focus on the concept of cliques (fully connected subsets) and Markov properties.

Prepared by Dr Syeda Husna Mehanoor


Example:
Variables:

• Fever (F)
• Cough (C)
• Fatigue (T)
• Flu (U)

We’ll assume these variables are connected like this:

F
/ \
U — C
\ /
T
This is an undirected graph, meaning:

• Fever (F) is related to Flu (U)


• Flu (U) is related to Cough (C) and Fatigue (T)
• Cough and Fatigue are also related

Why Use Graphical Models?

• Efficient inference in large systems (e.g., variable elimination, belief propagation).


• Modular representation: easier to design and debug.
• Interpretability: visualize dependencies clearly.
• Combines domain knowledge with data-driven learning.

Applications

• Natural language processing (e.g., parsing, POS tagging).


• Computer vision (e.g., object recognition, segmentation).
• Bioinformatics (e.g., gene expression analysis).
• Robotics (e.g., sensor fusion, localization).

Advantages:

• Intuitive Visualization: Graphical models provide a clear and visual representation of


complex relationships.

Prepared by Dr Syeda Husna Mehanoor


• Efficient Inference: They allow for efficient probabilistic inference, making it possible to
compute probabilities of events given observed data.
• Modularity: Complex systems can be built by combining simpler parts, and the graph
structure allows for efficient implementation of algorithms.

HIDDEN MARKOV MODEL


• When working with sequences of data, we often face situations where we can’t directly
see the important factors that influence the datasets.
• Hidden Markov Models (HMM) help solve this problem by predicting these hidden
factors based on the observable data
• It is an statistical model that is used to describe the probabilistic relationship between a
sequence of observations and a sequence of hidden states. It is often used in situations
where the underlying system that generates the observations is unknown or hidden,
hence it has the name “Hidden Markov Model.”
• A Hidden Markov Model (HMM) is a statistical model often used in machine learning,
natural language processing (NLP), speech recognition, and bioinformatics, especially
when dealing with sequential or time-series data.
• The relationship between the hidden states and the observations is modeled using a
probability distribution.
• An HMM consists of two types of variables: hidden states and observations.

1. Hidden Variable (a.k.a. Hidden State)

• This is the real cause or situation you're trying to figure out.


• It’s not directly visible or measurable.
• You can only guess it based on what you observe.

Example:
If you’re trying to guess the weather:

• Weather (Sunny or Rainy) is the hidden variable — you can’t see it directly.

2. Observed Variable (a.k.a. Observation)

• This is what you can see or measure.


• It's influenced or caused by the hidden variable.

Example:
If you see your friend:

Prepared by Dr Syeda Husna Mehanoor


• Carrying an umbrella or not is the observed variable — it's what you actually see

Concept Example

Hidden Variable Weather (Rainy/Sunny)

Observed Variable Umbrella (Yes/No)

The relationship between the hidden states and the observations is modeled using a probability
distribution. The Hidden Markov Model (HMM) is the relationship between the hidden states
and the observations using two sets of probabilities: the transition probabilities and the
emission probabilities.

1. Transition Probabilities

These describe the probability of moving from one hidden state to another.

Example:
Let’s say the hidden states are:

• Sunny
• Rainy

Then:

• P(Rainy tomorrow∣Sunny today)=0.3


• P(Sunny tomorrow∣Sunny today)=0.7

2. Emission Probabilities

These describe the probability of seeing an observation given a hidden state.

Example:
You don’t see the weather, but you see your friend:

• Carrying an umbrella
• Not carrying an umbrella

Then:

• P(Umbrella∣Rainy)=0.9
• P (Umbrella∣Sunny)=0.2

These tell you how likely each observation is, depending on the hidden state.

Prepared by Dr Syeda Husna Mehanoor


Hidden Markov Model Algorithm
The Hidden Markov Model (HMM) algorithm can be implemented using the following steps:
• Step 1: Define the state space and observation space: The state space is the set of
all possible hidden states, and the observation space is the set of all possible
observations.
• Step 2: Define the initial state distribution: This is the probability distribution
over the initial state.
• Step 3: Define the state transition probabilities: These are the probabilities of
transitioning from one state to another. This forms the transition matrix, which
describes the probability of moving from one state to another.
• Step 4: Define the observation likelihoods: These are the probabilities of
generating each observation from each state. This forms the emission matrix, which
describes the probability of generating each observation from each state.
• Step 5: Train the model: The parameters of the state transition probabilities and
the observation likelihoods are estimated using the Baum-Welch algorithm, or the
forward-backward algorithm. This is done by iteratively updating the parameters
until convergence.
• Step 6: Decode the most likely sequence of hidden states: Given the observed
data, the Viterbi algorithm is used to compute the most likely sequence of hidden
states. This can be used to predict future observations, classify sequences, or detect
patterns in sequential data.
• Step 7: Evaluate the model: The performance of the HMM can be evaluated using
various metrics, such as accuracy, precision, recall, or F1 score.

ML Applications of HMM

Field How HMM Helps


Speech Recognition Models how phonemes (hidden) produce sounds (observed)
NLP Tags parts of speech (hidden) from words (observed)
Bioinformatics Finds genes in DNA sequences
Finance Predicts market trends from past price movements
Activity Recognition Determines what someone is doing from motion data

Prepared by Dr Syeda Husna Mehanoor


FORWARD ALGORITHM

The Forward Algorithm is used to compute the probability of an observed sequence given an
HMM. Instead of checking all possible hidden state sequences (which is computationally
expensive), it efficiently sums over them using dynamic programming.

Example:

You’re a detective.
Each day, someone tells you what they did (like “walk”, “shop”, or “clean”) — but you don’t
know the weather that day (Sunny or Rainy).
You want to figure out:
How likely is it that this person did those things, based on what you know about the
weather?

You Have a Model (HMM)

You know:

• How likely it is to start with Sunny or Rainy


• How likely the weather changes each day (Sunny → Rainy, Rainy → Sunny, etc.)
• What people usually do when it’s Sunny or Rainy (like walking more on Sunny days)

This information is your Hidden Markov Model (HMM).

The Observations

Let’s say you observe:

Day 1: walk
Day 2: shop
Day 3: clean

You want to know:

How likely is it that this sequence (walk, shop, clean) could happen?

But you don’t know the weather on any day. That’s what’s “hidden.”

What’s the Problem?

There are many possible weather combinations:

• Sunny → Sunny → Rainy


• Rainy → Sunny → Sunny

Prepared by Dr Syeda Husna Mehanoor


• … and more

Checking all of them takes too long if the sequence is long.

What Does the Forward Algorithm Do?

It solves this by:

• Starting with Day 1: “If it was Sunny, how likely was 'walk'? If it was Rainy, how likely
was 'walk'?”
• Then moving to Day 2: “If yesterday was Sunny, how likely is today Sunny or Rainy, and
how likely is 'shop'?”
• It builds up the total probability step by step for each day.
• At the end, it adds everything up to find the total chance of the full observation (walk,
shop, clean).

In Short: The Forward Algorithm is a step-by-step way to add up all the possible hidden
weather paths that could explain what you saw, without checking every single path one by
one.

Steps of the Forward Algorithm :

Let:

• N = number of hidden states (like Sunny, Rainy)


• T = number of observations (like walk, shop, clean)
• A[i][j] = probability of moving from state i to state j
• B[j][O_t] = probability of observing O_t in state j
• π[j] = initial probability of starting in state j
• α[t][j] = total probability of being in state j at time t (after observing up to time t)

Step 1: Initialization (t = 1)

For each state j:

α[1][j] = π[j] × B[j][O₁]

(Start in state j and observe the first symbol)

Step 2: Recursion (t = 2 to T)

For each time t and each state j:

α[t][j] = Σ (α[t-1][i] × A[i][j]) × B[j][O_t]

Prepared by Dr Syeda Husna Mehanoor


(Sum over all paths to state j, then multiply by chance of observing O_t in state j)

Step 3: Termination

At the final time step T, add up all the final probabilities:

P(O | model) = Σ α[T][j]

(The total probability of the observed sequence)

In Simple Words:

• Start with: "How likely is the first observation in each state?"


• Then, for each day:
o "How likely is it that I got to this state from previous states?"
o "How likely is today’s observation in this state?"
• Finally, add up the results from all final states.

VITERBI ALGORITHM

The Viterbi Algorithm is a dynamic programming algorithm used to find the most probable
sequence of hidden states (called the Viterbi path) in a Hidden Markov Model (HMM), given
a sequence of observations.

The Viterbi Algorithm is used in Hidden Markov Models (HMMs) to solve the decoding
problem, which means:

Given a sequence of observations, what is the most likely sequence of hidden states that
generated it?

Why Do We Need It?

In an HMM, multiple hidden states can produce the same observation. This creates ambiguity
about which state produced the observation. So, the problem becomes:

Given a sequence of observations, which hidden state sequence is the most likely sequence
that generated these observations?

Example of Ambiguity:

Imagine a weather model with two hidden states:

Prepared by Dr Syeda Husna Mehanoor


• Sunny
• Rainy

Now, consider the observation sequence:

• Walk, Walk, Shop

Both Sunny and Rainy could produce similar sequences of observations:

• Sunny could produce "Walk, Walk, Shop" because people tend to walk or shop on sunny
days.
• Rainy could also produce "Walk, Walk, Shop" because people might walk quickly or
shop to avoid the rain.

But the probability of being in Sunny or Rainy at each time step, and the transition
probabilities between states, are different.

So, even though both states can result in similar observations, we need to figure out which state
sequence (Sunny-Rainy or Rainy-Sunny, etc.) is more probable over the entire observation
sequence.

How the Viterbi Algorithm Helps:

• The Viterbi Algorithm helps by finding the most probable sequence of hidden states
(even if some states can produce similar observations) by considering:
o Transition probabilities (probability of moving from one state to another)
o Emission probabilities (probability of an observation occurring given a state)

Thus, it takes into account both the likelihood of observations given states and the likelihood
of state transitions.

What It Does:

• The algorithm efficiently searches for the most probable path (sequence of states)
through a trellis (a time vs. state graph), using dynamic programming.
• It avoids recalculating the same probabilities repeatedly by:
1. Storing the maximum probability of reaching each state at each time.
2. Keeping track of the path that led to that max probability.

Steps of the Viterbi Algorithm

Let’s say:

Prepared by Dr Syeda Husna Mehanoor


• You observe: Walk, Shop, Clean
• The hidden states could be: Sunny and Rainy
• You have some data:
o Probability of starting in each state (Sunny or Rainy)
o Probability of going from one state to another
o Probability of each activity happening in each state

Step 1: Start with the first observation

For each state:

• Multiply:
o The probability of starting in that state
o By the probability of that state producing the first observation

Save that value — it tells us how likely it is to start in that state and see the first observation.

Step 2: Move through the rest of the observations

For each new observation and each state:

• For each possible previous state:


o Multiply:
▪ The probability of being in the previous state
▪ By the probability of moving to the current state
▪ By the probability of the current observation in the current state
• Choose the highest of those possibilities — that’s the best path to the current state at this
time step
• Remember which previous state gave that max value (this is used to trace back the best
path)

Step 3: Finish (last observation)

• Look at the last step and pick the state with the highest probability
• That state is the last step in the best path

Step 4: Trace back the best path

• Use the remembered “best previous states” from each step to trace back and find the full
sequence of states

You now have:

• The most likely sequence of hidden states (like: Sunny → Sunny → Rainy)
• The probability of that sequence

Prepared by Dr Syeda Husna Mehanoor


BAUM–WELCH ALGORITHM
The Baum-Welch algorithm, also known as the forward-backward algorithm, is a special case of
the Expectation-Maximization (EM) algorithm used to train Hidden Markov Models (HMMs). It
iteratively refines the HMM parameters by maximizing the likelihood of observed data given the
model. The algorithm alternates between an E-step (expectation) and an M-step (maximization)
to estimate the model's parameters, such as transition and emission probabilities.

Example:
Scenario: Imagine a simple HMM where the weather (Sunny or Rainy) is hidden, and we can
only observe whether someone is carrying an umbrella or not.

Model:
• Hidden States: Sunny (S), Rainy (R)
• Observations: Umbrella (U), No Umbrella (N)
• Initial State Probabilities: π = (π_S, π_R)
• Transition Matrix: A = [[p(S|S), p(R|S)], [p(S|R), p(R|R)]]
• Emission Matrix: B = [[p(U|S), p(N|S)], [p(U|R), p(N|R)]]
Baum-Welch Algorithm Steps:
1. 1. Initialization:
Start with random guesses for π, A, and B.
2. 2. E-Step (Expectation):
o Use the forward and backward algorithms (a part of the forward-backward
algorithm) to calculate the probability of each hidden state sequence given the
observed umbrella/no umbrella sequence.
oSpecifically, the forward algorithm computes the probability of observing the data
up to a given time step, given a hidden state at that time. The backward algorithm
calculates the probability of observing the data from a given time step to the end,
given a hidden state at that time.
3. 3. M-Step (Maximization):
o Update the HMM parameters (π, A, and B) based on the calculated probabilities
from the E-step.
o For example, the new initial state probabilities are calculated as the sum of the
forward and backward probabilities at time t=0, normalized by the sum of all
forward-backward probabilities.

Prepared by Dr Syeda Husna Mehanoor


4. 4. Iteration:
Repeat the E-step and M-step until the model parameters converge (the changes in
parameter values become negligible).
Example with Data:

Let's say we observe the sequence: Umbrella, No Umbrella, Umbrella.


• E-step:
The forward and backward algorithms would compute the probabilities of different hidden
state sequences (e.g., S, S, S; S, R, S; etc.) given the observation sequence.
• M-step:
Based on these probabilities, the algorithm would estimate the new values for π, A, and B,
for example, it might estimate that the probability of starting in the Sunny state is higher
(π_S).
In essence, the Baum-Welch algorithm learns how to best map the hidden state transitions and
observations to the observed data by iteratively refining the HMM parameters.

COMPARISON OF THE FORWARD, VITERBI, AND BAUM-WELCH ALGORITHMS


IN THE CONTEXT OF HIDDEN MARKOV MODELS:

Aspect Forward Algorithm Viterbi Algorithm Baum-Welch Algorithm


Purpose Compute the Find the most likely Train HMM parameters to
probability of an sequence of hidden maximize the likelihood of the
observation sequence states (decoding) observations
given the model
Stage Evaluation Decoding Learning (Training)
Output Probability of the Most probable hidden Updated transition and emission
observed sequence state sequence probabilities
Method Dynamic Dynamic Expectation-Maximization
programming programming (taking (EM) using Forward-Backward
(summing over paths) max over paths) steps
Dependency Uses only model Uses model Uses both Forward and
parameters and parameters and builds Backward algorithms for
observations on forward-like probability estimates
recursion
Example "What is the "What is the most "Given many sequences like
Use likelihood of seeing likely weather pattern ‘sunny, cloudy, sunny’, what are
the sequence ‘sunny, behind ‘sunny, the most likely
cloudy, sunny’?" cloudy, sunny’?" transition/emission
probabilities?"

Prepared by Dr Syeda Husna Mehanoor


TRACKING METHODS

In machine learning and robotics, tracking methods are used to estimate the state of a system
over time, especially when that state is partially observed and noisy. Common applications
include:

• Object tracking in videos


• Navigation systems (e.g., GPS)
• Sensor fusion
• Autonomous vehicles

Two popular tracking techniques are the Kalman Filter and the Particle Filter.

KALMAN FILTER

The Kalman Filter is a mathematical algorithm used to track or estimate the state of
something over time, especially when the data is noisy or uncertain. Works best for linear systems
with Gaussian noise.

How It Works:

The Kalman Filter estimates the current state of a system using a two-step process:

1. Prediction Step:
o Predict the current state from the previous state using a motion model.
o Predict the current uncertainty (covariance) as well.
2. Update Step:
o Get a new observation (measurement).
o Combine prediction and observation using a weighted average, giving more
weight to the more certain information.
o Update the estimate and reduce uncertainty.

Prepared by Dr Syeda Husna Mehanoor


Example: Let’s say you're tracking a car using a GPS. The GPS gives noisy data. The Kalman
Filter helps you:

1. Predict where the car should be based on its last known position and speed.
2. Update that guess using the new (noisy) GPS reading.
3. Combine both in a smart way to get a better estimate of the car's actual location.

It repeats this process every time new data comes in.

Goal of Kalman Filter: To predict the next state of a system (like the position of a car), and
then update the prediction using noisy measurements (like GPS data), in the smartest possible
way.

STEP 1:

STEP 2:

Prepared by Dr Syeda Husna Mehanoor


STEP 3:

STEP 4:

STEP 5:

Prepared by Dr Syeda Husna Mehanoor


Step What It Does Why It Matters
1. Predict state Guess where the system is going Keeps track of motion or change
2. Predict uncertainty Guess how confident we are Helps know how reliable the guess is
3. Kalman Gain Balance between prediction and data Chooses what to trust more
4. Update state Correct the guess with real data Improves accuracy
5. Update uncertainty Recalculate confidence after update Gets more certain over time (ideally)

Pros:

• Computationally efficient (only uses matrix operations).


• Works well for Gaussian, linear problems.

Cons:

• Not suitable for non-linear systems or non-Gaussian noise.

PARTICLE FILTER

• A Particle Filter is a probabilistic algorithm used to estimate the state of a system over
time by representing it with a set of many random samples (called particles) and
updating them based on new data.
• Each particle is like a possible guess of the true state, and the algorithm uses a weighting
and resampling process to keep the best guesses and discard the bad ones.
• The Particle Filter is a method that uses lots of random guesses (called particles) to
figure out where the robot might be — and then keeps the best guesses. Designed for
non-linear and non-Gaussian systems.

Example: You're tracking where a robot is in a room. But you don’t know exactly where it is —
you only have a noisy sensor (like a blurry camera or weak GPS).

You want to figure out:


“Where is the robot right now?”

How it works:

1. Initialization – Start with 1000 random guesses of where the robot might be (particles).
2. Prediction – Move each guess based on the robot’s movement (e.g., it moved forward).
3. Update (Weighting) – Check how well each guess matches the new sensor reading.
o If it matches well, it gets a high weight.
o If not, low weight.
4. Resample – Keep only the best guesses (high weights) and throw away the bad ones.
o Make new guesses based on the best ones.

Prepared by Dr Syeda Husna Mehanoor


Repeat this every time the robot moves and sends a new sensor reading.

Term What it means in particle filter


Particle A possible guess of the true state
Weight Confidence in that guess (based on sensor)
Resampling Keeping better guesses, dropping bad ones
Iteration Repeat for every new observation

Pros:

• Works well with complex, non-linear systems.


• Can approximate any distribution.

Cons:

• Computationally expensive.
• Requires many particles for accurate estimates.

KALMAN FILTER VS PARTICLE FILTER


Feature Kalman Filter Particle Filter
Basic idea One smart prediction + Many random guesses weighted
correction and updated
Model assumption Linear motion and Gaussian Can handle non-linear motion and
noise any noise
State representation A single mean and uncertainty A set of particles (many possible
(matrix) states)
Noise handling Assumes noise is normal (bell Handles any type of noise
curve)
Accuracy Very accurate for simple systems Very accurate for complex and
noisy systems
Computation Fast, efficient Slower, needs more processing
(many particles)
Flexibility Low (works best when High (can work in messy, real-
assumptions are true) world problems)
Memory use Low (just matrices) High (stores lots of particles)
Resilience to non- Poor Excellent
linearity
Example use GPS + speed tracking in cars, Robot navigation, visual tracking
radar in video

Prepared by Dr Syeda Husna Mehanoor

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy