PR Unit 1 2
PR Unit 1 2
Example: consider our face then eyes, ears, nose, etc are features of the face.
A set of features that are taken together, forms the features vector.
Applications:
Mathematics plays a crucial role in pattern recognition, providing the theoretical foundation
and practical tools for identifying patterns in data. Here are some key mathematical concepts
and techniques used in pattern recognition:
1. Linear Algebra
Vectors and Matrices: Represent data and transformations. Data points are often represented
as vectors, and transformations like rotations and scaling are represented by matrices.
Eigenvalues and Eigenvectors: Used in Principal Component Analysis (PCA) for
dimensionality reduction and feature extraction.
Singular Value Decomposition (SVD): Another method for dimensionality reduction and
data compression.
2. Probability and Statistics
Probability Distributions: Used to model uncertainties in data. Common distributions include
Gaussian (Normal), Poisson, and Binomial.
Bayesian Inference: Incorporates prior knowledge with observed data to make predictions.
Bayes' theorem is fundamental in probabilistic approaches to pattern recognition.
Hypothesis Testing and Confidence Intervals: Used to make inferences about populations
based on sample data.
3. Optimization
Gradient Descent: An iterative method for finding the minimum of a function, widely used in
training machine learning models.
Convex Optimization: Techniques for optimizing convex functions, ensuring global optima.
Important in Support Vector Machines (SVMs) and logistic regression.
Non-Convex Optimization: Used in training deep neural networks, where the loss function is
often non-convex.
4. Transformations and Feature Extraction
Fourier Transform: Converts data from the time domain to the frequency domain, useful in
signal processing.
Wavelet Transform: Decomposes data into different frequency components, maintaining both
spatial and frequency information.
Principal Component Analysis (PCA): Reduces dimensionality by transforming data into a
set of orthogonal components that capture the most variance.
Classification
Classification is a process of categorizing data or objects into predefined classes or
categories based on their features or attributes.
Machine Learning classification is a type of supervised learning technique where an
algorithm is trained on a labeled dataset to predict the class or category of new, unseen data.
The main objective of classification machine learning is to build a model that can accurately
assign a label or category to a new observation based on its features.
For example, a classification model might be trained on a dataset of images labeled as either
dogs or cats and then used to predict the class of new, unseen images of dogs or cats based
on their features such as color, texture, and shape.
Classification Types
There are two main classification types in machine learning:
Binary Classification
In binary classification, the goal is to classify the input into one of two classes or categories.
Example – On the basis of the given health conditions of a person, we have to determine
whether the person has a certain disease or not.
Multiclass Classification
In multi-class classification, the goal is to classify the input into one of several classes or
categories. For Example – On the basis of data about different species of flowers, we have to
determine which specie our observation belongs to.
Classification Algorithms
There are various types of classifiers algorithms. Some of them are :
Linear Classifiers
Linear models create a linear decision boundary between classes. They are simple and
computationally efficient. Some of the linear classification models are as follows:
Logistic Regression
Support Vector Machines having kernel = ‘linear’
Single-layer Perceptron
Stochastic Gradient Descent (SGD) Classifier
Non-linear Classifiers
Non-linear models create a non-linear decision boundary between classes. They can capture
more complex relationships between the input features and the target variable. Some of the
non-linear classification models are as follows:
K-Nearest Neighbours
Kernel SVM
Naive Bayes
Decision Tree Classification
Random Forests
Bayes Rules
Bayes' Rule is the most important rule in data science. It is the mathematical rule that
describes how to update a belief, given some evidence. In other words – it describes the act
of learning.
The
equation: Posterior = Prior x (Likelihood over Marginal probability)
There are four parts:
It is named after Thomas Bayes, an 18th century English theologian and mathematician.
Bayes originally wrote about the concept, but it did not receive much attention during his
lifetime.
French mathematician Pierre-Simon Laplace independently published the rule in his 1814
work Essai philosophique sur les probabilités.
Today, Bayes' Rule has numerous applications, from statistical analysis to machine learning.
Conditional probability
The first concept to understand is conditional probability.
You may already be familiar with probability in general. It lets you reason about uncertain
events with the precision and rigour of mathematics.
Conditional probability is the bridge that lets you talk about how multiple uncertain events
are related. It lets you talk about how the probability of an event can vary under different
conditions.
For example, consider the probability of winning a race, given the condition you didn't sleep
the night before. You might expect this probability to be lower than the probability you'd win
if you'd had a full night's sleep.
Or, consider the probability that a suspect committed a crime, given that their fingerprints are
found at the scene. You'd expect the probability they are guilty to be greater, compared with
had their fingerprints not been found.
P(A|B)
Which is read as "the probability of event A occurring, given event B occurs".
An important thing to remember is that conditional probabilities are not the same as their
inverses.
That is, the "probability of event A given event B" is not the same thing as the "probability
of event B, given event A".
The probability of clouds, given it is raining (100%) is not the same as the probability it is
raining, given there are clouds.
(Insert joke about British weather).
Bayes' Rule in detail
Bayes' Rule tells you how to calculate a conditional probability with information you already
have.
It is helpful to think in terms of two events – a hypothesis (which can be true or false) and
evidence (which can be present or absent).
However, it can be applied to any type of events, with any number of discrete or
continuous outcomes.
Bayes' Rule lets you calculate the posterior (or "updated") probability. This is a conditional
probability. It is the probability of the hypothesis being true, if the evidence is present.
Think of the prior (or "previous") probability as your belief in the hypothesis before seeing
the new evidence. If you had a strong belief in the hypothesis already, the prior probability
will be large.
The prior is multiplied by a fraction. Think of this as the "strength" of the evidence. The
posterior probability is greater when the top part (numerator) is big, and the bottom part
(denominator) is small.
The numerator is the likelihood. This is another conditional probability. It is the probability
of the evidence being present, given the hypothesis is true.
This is not the same as the posterior!
Remember, the "probability of the evidence being present given the hypothesis is true" is not
the same as the "probability of the hypothesis being true given the evidence is present".
Now look at the denominator. This is the marginal probability of the evidence. That is, it is
the probability of the evidence being present, whether the hypothesis is true or false. The
smaller the denominator, the more "convincing" the evidence.
Clustering
Clustering is the task of dividing the unlabeled data or data points into different clusters
such that similar data points fall in the same cluster than those which differ from the
others. In simple words, the aim of the clustering process is to segregate groups with
similar traits and assign them into clusters.
Let’s understand this with an example. Suppose you are the head of a rental store and
wish to understand the preferences of your customers to scale up your business. Is it
possible for you to look at the details of each customer and devise a unique busines s
strategy for each one of them? Definitely not. But, what you can do is cluster all of your
customers into, say 10 groups based on their purchasing habits and use a separate
strategy for customers in each of these 10 groups. And this is what we call clustering
methods.
Now that we understand what clustering is. Let’s take a look at its different types.
• Hard Clustering: Each input data point either fully belongs to a cluster or not. For
instance, in the example above, every customer is assigned to one group out of the
ten.
• Soft Clustering: Rather than assigning each input data point to a distinct cluster, it
assigns a probability or likelihood of the data point being in those clusters. For
example, in the given scenario, each customer receives a probability of being in
any of the ten retail store clusters.
Since the task of clustering methods is subjective, the means that can be used for
achieving this goal are plenty. Every methodology follows a different set of rules for
defining the ‘similarity’ among data points. In fact, there are more than 100 clustering
algorithms known. But few of the algorithms are used popularly. Let’s look at them in
detail:
Connectivity Models
As the name suggests, these models are based on the notion that the data points closer in
data space exhibit more similarity to each other than the data points lying farther away.
These models can follow two approaches. In the first approach, they start by classifying
all data points into separate clusters & then aggregating them as the distance decreases.
In the second approach, all data points are classified as a single cluster and then
partitioned as the distance increases. Also, the choice of distance function is subjective.
These models are very easy to interpret but lack scalability for handling big datasets.
Examples of these models are the hierarchical clustering algorithms and their variants.
Centroid Models
These clustering algorithms iterate, deriving similarity from the proximity of a data point
to the centroid or cluster center. The k-Means clustering algorithm, a popular example,
falls into this category. These models necessitate specifying the number of clusters
beforehand, requiring prior knowledge of the dataset. They iteratively run to discover
local optima.
Distribution Models
These clustering models are based on the notion of how probable it is that all data points
in the cluster belong to the same distribution (For example: Normal, Gaussian). These
models often suffer from overfitting. A popular example of these models is the
Expectation-maximization algorithm which uses multivariate normal distributions.
Density Models
These models search the data space for areas of the varied density of data points in the
data space. They isolate different dense regions and assign the data points within these
regions to the same cluster. Popular examples of density models are DBSCAN and
OPTICS. These models are particularly useful for identifying clusters of arbitrary shape
and detecting outliers, as they can detect and separate points that are located in sparse
regions of the data space, as well as points that belong to dense regions.
Now I will be taking you through two of the most popular clustering algorithms in detail
– K Means and Hierarchical. Let’s begin.
K Means Clustering
K means is an iterative clustering algorithm that aims to find local maxima in each
iteration. This algorithm works in these 5 steps:
Step1:
Specify the desired number of clusters K: Let us choose k=2 for these 5 data points in 2 -
D space.
Step 2:
Randomly assign each data point to a cluster: Let’s assign three points in cluster 1,
shown using red color, and two points in cluster 2, shown using grey color.
Step 3:
Compute cluster centroids: The centroid of data points in the red cluster is shown using
the red cross, and those in the grey cluster using a grey cross.
Step 4:
Re-assign each point to the closest cluster centroid: Note that only the data point at the
bottom is assigned to the red cluster, even though it’s closer to the centroid of the grey
cluster. Thus, we assign that data point to the grey cluster.
Step 5:
Re-compute cluster centroids: Now, re-computing the centroids for both clusters.
Repeat steps 4 and 5 until no improvements are possible: Similarly, we’ll repeat the 4th
and 5th steps until we’ll reach global optima, i.e., when there is no further switching of
data points between two clusters for two successive repeats. It will mark th e termination
of the algorithm if not explicitly mentioned.
In machine learning, linear algebra operations are used extensively in various stages, from
data preprocessing to model training and evaluation. For instance, operations such as matrix
multiplication, eigenvalue decomposition, and singular value decomposition are pivotal in
dimensionality reduction techniques like Principal Component Analysis (PCA). Similarly,
the concepts of vector spaces and linear transformations are integral to understanding neural
networks and optimization algorithms.
Vector spaces, eigenvalues, and eigenvectors play significant roles in pattern recognition,
providing mathematical tools to analyze and understand patterns in data. Here's how they
relate to pattern recognition along with some mathematical rules:
Definition: A vector space is a set of vectors over a field (such as real numbers) that satisfies
certain properties, including closure under addition and scalar multiplication.
Mathematical Rules:
1. Closure under Addition and Scalar Multiplication: For any vectors uuu and vvv in the
vector space and any scalar kkk, u+vu + vu+v and k⋅uk \cdot uk⋅u are also in the
vector space.
2. Vector Addition: Addition of vectors is commutative and associative.
3. Scalar Multiplication: Scalar multiplication distributes over vector addition.
4. Zero Vector: Every vector space contains a zero vector, denoted as 0\mathbf{0}0,
which acts as an additive identity.
Definition:
• Feature Extraction: Eigenvalues and eigenvectors are used to extract key features from
data and reduce dimensionality.
• Representation Learning: They help in learning compact representations of data with
minimal loss of information.
• Pattern Analysis: Eigenvalues provide insights into the intrinsic properties of patterns,
aiding in classification and clustering.
The rank of a matrix is a fundamental concept in linear algebra that describes the dimension
of the vector space spanned by its columns or rows. It provides valuable insights into the
properties of the matrix and its solutions in various applications, including pattern
recognition. Here's a concise explanation:
Definition:
• The rank of a matrix AAA, denoted as rank(A)\text{rank}(A)rank(A), is the
maximum number of linearly independent columns (or rows) in the matrix.
Mathematical Rules:
• Data Analysis: Determining the rank of data matrices helps in understanding the
effective dimensionality of the data and selecting appropriate dimensionality reduction
techniques.
• Model Training: In machine learning, matrices representing features or parameters
may have ranks that affect the complexity and behavior of learning algorithms.
Understanding the rank of matrices is essential for effectively analyzing data, solving linear
systems, and building robust pattern recognition systems.
Singular Value Decomposition (SVD) is a powerful technique in linear algebra used for
decomposing a matrix into three simpler matrices. It has various applications in pattern
recognition, data analysis, and machine learning. Here's a brief overview:
Definition:
• SVD decomposes a matrix AAA of size m×nm \times nm×n into three matrices: UUU,
Σ\SigmaΣ, and VTV^TVT, where:
o UUU is an m×mm \times mm×m orthogonal matrix (i.e., UTU=IU^T U =
IUTU=I).
o Σ\SigmaΣ is an m×nm \times nm×n diagonal matrix with non-negative real
numbers on the diagonal (singular values), arranged in descending order.
o VTV^TVT is an n×nn \times nn×n orthogonal matrix.
Mathematical Insight:
• SVD allows for the representation of a matrix as a sum of rank-one matrices, making
it a powerful tool for understanding the structure and properties of data.
• It provides a compact representation of the original matrix by retaining only the most
significant singular values and their corresponding columns in UUU and VTV^TVT.
SVD provides a versatile and efficient tool for analyzing and processing data in pattern
recognition tasks, offering insights into the underlying structure of data and facilitating
various applications in machine learning and data analysis.
Unit -2
If P(w1|x) > P(w2|x) we would decide that the object belongs to class w1, or else class w2.
Probability of Error
As they are exhaustive and if we choose the correct nature of an object by probability P
then the leftover probability (1-P) will show how probable is the decision that it the not
the decided object.
We can minimize the probability of error by deciding the one which has a greater
posterior and the rest as the probability of error will be minimum as possible. So we
finally get,
This type of decision rule highlights the role of the posterior probabilities. With the help
Bayes theorem, we can express the rule in terms of conditional and prior probabilities.
So by eliminating the unrequired scale factor in our decision rule we have, the similar
decision rule by Bayes theorem as,
• Case-1: If class conditionals are equal i.e, p(x|ω1)= p(x|ω2), then we arrive at our
premature decision rule governed by just priors.
• Case-2: On the other hand, if priors are equal i.e, P(ω1)= P(ω2) then the decision is
entirely based on class conditionals p(x|ωj).
We now discuss those cases which have multiple features as well as multiple classes,
Let the Multiple Features be X1, X2, … Xn and Multiple Classes be w1, w2, … wn, then:
Where,
Prior = P(wi)
In cases of the same incoming patterns, we might need to use a drastically different cost
function, which will lead to different actions altogether. Generally, different decision
tasks may require features and yield boundaries quite different from those us eful for our
original categorization problem.
Classifiers
What is a Classifier?
One of the most prominent instances is an email classifier, which examines emails and filters
them according to whether they are spam or not.
The job of estimating a mapping function (f) from input variables (X) to discrete output
variables is known as classification predictive modelling (y).
Machine learning algorithms are useful for automating operations that were previously done
by hand. They may save a lot of time and money while also increasing the efficiency of
enterprises.
Classification is a type of supervised learning in which the input data is also delivered to the
objectives. Classification has several uses in a variety of fields, including credit approval,
medical diagnosis, and target marketing.
Machine learning classifiers are used to assess consumer comments from social media,
emails, online reviews, and other sources to determine what people are saying about your
company.
Subject categorization, for example, may automatically filter through customer support
complaints or NPS surveys, label them by topic, and send them to the appropriate
department or individual.
Both supervised and unsupervised classifiers are available. Unsupervised machine learning
classifiers are fed just unlabeled datasets, which they sort into categories based on pattern
recognition, data structures, and anomalies. Training datasets are provided to supervised and
semi-supervised classifiers, which teach them how to categorize data into specified
categories.
There are six different classifiers in machine learning, that we are going to discuss below:
1. Perceptron:
For binary classification problems, the Perceptron is a linear machine learning
technique. It is one of the original and most basic forms of artificial neural networks.
As a result, it's best for issues where the classes can be easily separated using a line or
linear model, sometimes known as linearly separable problems. The stochastic gradient
descent optimization procedure is used to train the model's coefficients, which are
referred to as input weights. (here)
2. Logistic Regression:
Under the Supervised Learning approach, one of the most prominent Machine Learning
algorithms is logistic regression. It's a method for predicting a categorical dependent
variable from a set of independent factors.
Except for how they are employed, Logistic Regression is quite similar to Linear
Regression. For regression issues, Linear Regression is employed, whereas, for
classification difficulties, Logistic Regression is used.
The algorithm's sole drawback is that it only works when the predicted variable is
binary, requires that all predictors are independent of one another, and expects that the
data is free of missing values.
3. Naive Bayes:
The Naive Bayes family of probabilistic algorithms calculates the likelihood that every
given data point falls into one or more of a set of categories (or not). It is a supervised
learning approach for addressing classification issues that are based on the Bayes
theorem. It's a probabilistic classifier, which means it makes predictions based on an
object's likelihood.
In the text analysis, Naive Bayes is used to classifying customer comments, news
articles, emails, and other types of content into categories, themes, or "tags" in order to
organise them according to specified criteria, such as this:
The likelihood of each tag for a given text is calculated using Naive Bayes algorithms,
and the highest probability is output:
In other words, the chance of A being true if B is true is equal to the likelihood of B is
true if A is truly multiplied by the probability of A being true divided by the probability
of B being true. As you move from tag to tag, this estimates the likelihood that a data
piece belongs in a certain category.
4. K-Nearest Neighbours:
KNN has been utilized as a non-parametric approach in statistical estimates and pattern
recognition since the early 1970s. It's a form of lazy learning since it doesn't try to build
a generic internal model; instead, it only saves instances of the training data. The
classification is determined by a simple majority vote of each point's k closest
neighbours.
A case is categorized by a majority vote of its neighbours, with the case being allocated
to the class having the most members among its K closest neighbours as determined by
a distance function. If K = 1, the case is simply allocated to the nearest neighbour's
class.
The SVM algorithm's purpose is to find the optimum line or decision boundary for
categorizing n-dimensional space into classes so that additional data points may be
readily placed in the proper category in the future. A hyperplane is a name for the
optimal choice boundary.
SVM techniques categorize data and train models within supra limited degrees of
polarity, resulting in a three-dimensional classification model that extends beyond X/Y
predictive axes. The extreme points/vectors that assist create the hyperplane are chosen
via SVM.
Support vectors are extreme instances, and the method is called a Support Vector
Machine. Consider the picture below, which shows how a decision boundary or
hyperplane is used to classify two separate categories:
6. Random Forest:
It's also known as a meta-estimator since it fits a number of decision trees on different
sub-samples of datasets and utilizes the average to enhance the model's forecast
accuracy and prevent over-fitting. The size of the sub-sample is always the same as the
size of the original input sample, but the samples are generated using replacement.
It produces a "forest" out of a collection of decision trees that are frequently trained
using the "bagging" method. The main idea of the bagging approach is that combining
many learning models enhances the final result. Rather than relying on a single decision
tree, the random forest gathers forecasts from each tree and predicts the ultimate output
based on the majority of votes.
1. Definition:
• Linear Discriminant Functions: These functions assume that the decision boundaries
between classes are linear and can be represented by linear combinations of input
features. Examples include linear discriminant analysis (LDA) and logistic regression.
• Non-linear Discriminant Functions: In cases where the decision boundaries are non-
linear, more complex discriminant functions such as polynomial functions, neural
networks, or support vector machines (SVMs) may be used.
• Discriminant functions are typically trained using labeled training data, where each
data point is associated with a known class label.
• Training involves estimating the parameters of the discriminant function, such as the
weights in linear discriminant analysis or the coefficients in logistic regression, to
optimize the classification performance on the training data.
5. Applications:
• Pattern Recognition: Discriminant functions are used in various pattern recognition
tasks, including image classification, speech recognition, and natural language
processing.
• Biometrics: They are employed in biometric systems for recognizing individuals based
on physiological or behavioral characteristics.
• Medical Diagnosis: Discriminant functions play a role in medical diagnosis by
classifying patients into different disease categories based on diagnostic tests or
medical images.
Pattern classifiers can be represented in many different ways. Most used among all is using a
set of discriminant function gi(x), i=1, . . . , c. The decision of the classifier works as
assigning feature vector x to class wi– if a certain decision rule is to be followed like the
followed earlier i.e.
Hence this classifier can be viewed as a network that computes the c discriminant function
and chooses the action to choose the state of nature that has the highest discriminant.
Fig. The functional structure of a general statistical pattern classifier includes d inputs and
discriminant functions gi(x). A subsequent step determines which of the discriminant values
is the maximum and categorizes the input pattern accordingly. The arrows show the direction
of the flow of information, though frequently the arrows are omitted when the direction of
flow is self-evident.
Generally gi(x) = -R(ai | x), for minimum conditional risk we get the maximum discriminant
function.
Things can be further simplified by taking gi(x) = P(wi | x), so the maximum discriminant
function corresponds to the maximum posterior probability.
Thus the choice of a discriminant function is not unique. We can temper the function by
multiplying by the same positive constant or by shifting them by the same constant without
any influence on the decision. These observations eventually lead to significant
computational and analytical simplification. An example of discriminant function
modification with tempering with the output decision is :
gi(x)= p(x|ωi)P(ωi)
The aim of any decision rule is to divide the feature space into c decision regions, which are
R1, R2, R3, . . , Rc. As discussed earlier if gi(x) >gj(x) for all j !=i, then x is in Ri, and the
decision rule leads us to assign the features x to the state of nature wi. The regions are
separated by decision boundaries.
Fig. In this two-dimensional two-category classifier, the probability densities are Gaussian,
the decision boundary consists of two hyperbolas, and thus the decision region R2 is not
simply connected. The ellipses mark where the density is 1/e times that at the peak of the
distribution.
And the decision rule decides w1 if g(x) > 0; otherwise it decides w2.
Hence dichotomizer can be seen as a system that computes a single discriminant function
g(x) and classifies the x according to the sign of the output. The above equation can be
further simplified as
1. Definition:
• Linear Decision Surfaces: These surfaces are linear in the feature space and can be
represented by linear equations. Examples include straight lines in two dimensions and
hyperplanes in higher dimensions.
• Non-linear Decision Surfaces: In cases where the relationship between input features
and classes is non-linear, the decision surface may be curved or irregular. Non-linear
decision surfaces can be represented by more complex mathematical functions or
surfaces.
• The shape and position of the decision surface are determined by the classification
algorithm used and the parameters or coefficients learned during the training phase.
• Linear classifiers such as linear discriminant analysis (LDA) and logistic regression
produce decision surfaces that are linear in the feature space, while non-linear
classifiers like support vector machines (SVMs) and decision trees can generate more
complex decision surfaces.
4. Visualization:
• Decision surfaces are often visualized to understand the behavior of classification
algorithms and the boundaries they create.
• In two-dimensional feature spaces, decision surfaces can be plotted directly as curves
or lines, while in higher-dimensional spaces, they are visualized using contour plots or
by projecting onto lower-dimensional subspaces.
Parameter estimation methods are techniques used to determine the values of unknown
parameters in statistical models or mathematical functions based on observed data. These
methods play a crucial role in various fields, including statistics, machine learning, and
signal processing. Here are some common parameter estimation methods:
• MoM is a method for estimating parameters by equating sample moments (e.g., mean,
variance) to population moments.
• It involves setting equations based on moments and solving for the parameters that
satisfy these equations.
4. Bayesian Estimation:
• MAP estimation is a Bayesian approach that seeks to find the parameter values that
maximize the posterior probability distribution.
• It combines prior beliefs about the parameters with the likelihood of the observed data
to infer the most probable parameter values.
It is used to predict future observations or classify sequences, based on the underlying hidden
process that generates the data.
The hidden states are the underlying variables that generate the observed data, but they are
not directly observable.
The observations are the variables that are measured and observed.
The relationship between the hidden states and the observations is modeled using a
probability distribution. The Hidden Markov Model (HMM) is the relationship between the
hidden states and the observations using two sets of probabilities: the transition probabilities
and the emission probabilities.
The transition probabilities describe the probability of transitioning from one hidden state to
another.
The emission probabilities describe the probability of observing an output given a hidden
state.
Hidden Markov Model Algorithm
The Hidden Markov Model (HMM) algorithm can be implemented using the following
steps:
What is Dimensionality?
In machine learning, data is often represented in rows and columns, similar to a spreadsheet.
Each column represents a feature, and each row represents a data point. For example, in a
dataset of houses, features could include the number of bedrooms, size of the house, and
location. If there are many features, the data exists in a high-dimensional space, which can
make it challenging for machine learning algorithms to find patterns and make accurate
predictions.
When there are too many features, the volume of the feature space increases dramatically.
This makes the data points sparse and less representative, leading to poorer performance of
machine learning models. Imagine searching for a lost quarter: finding it is easy in a straight
line, harder in a 2D square area, and nearly impossible in a 3D cube. More dimensions make
it much harder to find patterns.
• Simplifying Models: Fewer features make models easier to understand and work with.
• Reducing Storage and Computation Needs: Less data means faster processing and less
storage required.
• Improving Model Accuracy: Removing irrelevant or redundant features can lead to
better predictions.
• Speeding Up Training: With fewer features, algorithms can train faster.
• Enhancing Visualization: Reduced dimensions make it easier to visualize the data.
1. Feature Selection: Choosing the most important features and removing the rest.
o Filter Methods: Automatically select relevant features.
o Wrapper Methods: Use a machine learning model to test which features work
best together.
o Embedded Methods: Select features during the model training process.
2. Feature Extraction: Transforming data into a lower-dimensional space while retaining
important information.
o Principal Component Analysis (PCA): Projects data onto fewer dimensions
while keeping as much variance (information) as possible.
o Linear Discriminant Analysis (LDA): Projects data to maximize class
separability.
o Kernel PCA: A nonlinear version of PCA for more complex data structures.
Linear Discriminant Analysis (LDA) is a statistical technique for categorizing data into
groups. It identifies patterns in features to distinguish between different classes. For
instance, it may analyze characteristics like size and color to classify fruits as apples or
oranges. LDA aims to find a straight line or plane that best separates these groups while
minimizing overlap within each class. By maximizing the separation between classes, it
enables accurate classification of new data points. In simpler terms, LDA helps make
sense of data by finding the most effective way to separate different categories, aiding
tasks like pattern recognition and classification.
Assumptions:
Linear Discriminant Analysis (LDA) makes some assumptions about the data:
• It assumes that the data follows a normal or Gaussian distribution, meaning each
feature forms a bell-shaped curve when plotted.
• Each of the classes has identical covariance matrices.
However, it is worth mentioning that LDA performs quite well even if the assumptions
are violated.
The basic idea of FLD is to project data points onto a line to maximize the between -class
scatter and minimize the within-class scatter.
This might sound a bit cryptic but it is quite straightforward. So, before delving deep
into the derivation part we need to get familiarized with certain terms and expressions.
Datapoint X
before and after projection
Non-parametric techniques for density estimation are methods used to estimate the
probability distribution of a dataset without assuming that the data follows a specific
parametric distribution (like a normal or binomial distribution). These techniques rely on the
data itself to construct the density function. Here are some common non-parametric density
estimation techniques explained in simpler terms:
1. Kernel Density Estimation (KDE)
Kernel Density Estimation (KDE) is like placing a small, smooth hill (called a kernel) on
each data point and then adding up all these hills to create a smooth curve that represents the
data density.
• How It Works: Imagine you have a bunch of dots on a line. KDE places a little bump
on each dot and sums up the bumps to form a smooth curve that shows where the dots
are concentrated.
• Kernels: These are the shapes of the hills. Common shapes include Gaussian (bell-
shaped) and Epanechnikov (parabolic).
• Bandwidth (h): This controls the width of the hills. A smaller bandwidth makes the
curve bumpier, while a larger bandwidth makes it smoother.
Histograms divide the data range into equal-sized bins and count how many data points fall
into each bin. The height of each bar in the histogram represents the density of data points in
that bin.
• How It Works: Think of counting how many people are in different sections of a park.
Each section is a bin, and the number of people in each section determines the height
of the bar in the histogram.
• Bin Width: The choice of bin width affects the appearance of the histogram. Smaller
bins capture more detail but may look noisy, while larger bins provide a smoother but
less detailed view.
This method estimates the density at a point based on the distance to its k-nearest neighbors.
It adapts to local data density by changing the size of the neighborhood around each point.
• How It Works: Imagine standing on a street and looking at the nearest 10 people
around you. The density is higher if those 10 people are close to you and lower if they
are spread out.
• Choice of k: The number of neighbors (k) affects the density estimate. A larger k
smooths out the density estimate, while a smaller k captures more local detail.
Adaptive KDE adjusts the width of the kernels based on the local density of data points. In
dense areas, it uses narrower kernels; in sparse areas, it uses wider kernels.
• How It Works: If you are in a crowded place, you look closer to you (narrower view),
but if you are in a sparse area, you look further (wider view).
• Benefit: This method can better handle areas of varying density, giving a more
accurate overall picture.
Splines are smooth, flexible curves fitted to the data. Spline density estimation fits a smooth
curve to the cumulative distribution function (CDF) of the data and then differentiates it to
get the density function.
• How It Works: Think of drawing a smooth curve through the middle of a set of data
points. This curve represents the distribution, and by looking at how steep the curve is,
you can estimate the density.
• Application: This method is useful when you want a smooth estimate that adapts well
to the shape of the data.
6. Mean Shift
Mean shift is an iterative method that moves each data point towards the densest area of data
points, effectively finding clusters and estimating density.
• How It Works: Imagine each person in a park moving towards the most crowded area
nearby. Over time, clusters form where people gather, indicating high-density regions.
• Benefit: This method is good for identifying clusters in the data without assuming a
specific number of clusters.
Summary
Nonmetric methods for pattern classification are approaches that do not rely on traditional
metric distances (like Euclidean distance) between data points to classify patterns. Instead,
they often use alternative strategies such as logical operations, proximity, or other criteria to
determine class membership. Here are some common nonmetric methods for pattern
classification:
1. Decision Trees
Decision trees classify data by asking a series of questions about the features of the data
points. Each question splits the data into subsets, leading to a tree structure where each leaf
node represents a class.
• How It Works: At each node in the tree, a feature is selected to split the data based on
a criterion (like information gain or Gini impurity). This process is repeated
recursively until the tree is fully grown or another stopping criterion is met.
• Example: A decision tree might first ask whether a fruit is red. If yes, it might next ask
if it's round, helping classify it as an apple or cherry.
2. Random Forest
Random forests are an ensemble method that combines multiple decision trees to improve
classification performance. Each tree in the forest is trained on a random subset of the data
and features.
• How It Works: Each tree makes a classification, and the final class is determined by
majority voting among all trees.
• Benefit: Random forests reduce the risk of overfitting and improve robustness
compared to a single decision tree.
• How It Works: To classify a new data point, the algorithm finds the k closest training
data points and assigns the class most common among them.
• Nonmetric Variation: While k-NN typically uses a metric distance, it can be adapted to
use nonmetric measures, such as Hamming distance for categorical data.
4. Rule-Based Classification
Rule-based classifiers use a set of if-then rules derived from the training data to classify new
instances. These rules are often extracted using methods like association rule mining or
expert knowledge.
• How It Works: Each rule is a logical statement that assigns a class label if certain
conditions are met. The system checks which rules apply to a new instance and assigns
the corresponding class.
• Example: An email spam filter might have a rule stating, "If the email contains the
word 'free' and 'winner,' then classify it as spam."
5. Support Vector Machines (SVM) with Nonmetric Kernels
SVMs find the hyperplane that best separates the classes in the feature space. While SVMs
traditionally use metric distances, nonmetric kernels (like graph kernels) can be used to
classify data based on more complex relationships.
• How It Works: SVM constructs a decision boundary that maximizes the margin
between classes. Nonmetric kernels allow the SVM to operate in a higher-dimensional
space where the data may be more easily separable.
• Example: Graph kernels can measure similarity between structured data like graphs or
sequences without relying on metric distances.
6. Neural Networks
Neural networks are a class of models that use layers of interconnected nodes (neurons) to
learn complex patterns in the data. While they can use metric-based input, the internal
processing and transformations are nonmetric.
• How It Works: Data is passed through multiple layers of neurons, each applying
nonlinear transformations. The network learns to map input features to output classes
through training.
• Example: Convolutional Neural Networks (CNNs) are used for image recognition by
learning hierarchical patterns in pixel data.
Fuzzy logic classifiers use degrees of membership rather than crisp class labels, making them
suitable for handling uncertainty and imprecision.
• How It Works: Each data point has a degree of membership in each class, defined by
fuzzy sets and membership functions. Classification is based on the highest
membership value or a combination of memberships.
• Example: In medical diagnosis, a symptom might partially indicate multiple diseases,
and fuzzy logic can handle such overlap.
8. Bayesian Networks
Bayesian networks represent the probabilistic relationships among variables using a directed
acyclic graph. They use these relationships to compute the probability of each class given the
features.
• How It Works: Nodes in the graph represent features and classes, while edges
represent conditional dependencies. The network calculates the posterior probabilities
of the classes given the evidence (features).
• Example: In diagnosing a patient, a Bayesian network might combine probabilities
from various symptoms and test results to determine the likelihood of different
diseases.
• Unlike supervised learning, where algorithms are trained on labeled data with known
outcomes, unsupervised learning operates on unlabeled data. This means there are no
predefined target variables or categories to guide the learning process.
• Clustering: Clustering algorithms group similar data points together into clusters based
on some similarity metric. Common clustering algorithms include K-means clustering,
hierarchical clustering, and DBSCAN.
• Dimensionality Reduction: Dimensionality reduction techniques aim to reduce the
number of features in a dataset while preserving its essential structure or relationships.
Principal Component Analysis (PCA), t-Distributed Stochastic Neighbor Embedding
(t-SNE), and autoencoders are examples of dimensionality reduction methods.
• Anomaly Detection: Anomaly detection algorithms identify data points that deviate
significantly from the norm or exhibit unusual behavior. One-class SVM, Isolation
Forest, and Gaussian Mixture Models (GMMs) are often used for anomaly detection.
4. Applications:
• Market Segmentation: Clustering algorithms can be used to segment customers based
on their purchasing behavior or demographic characteristics.
• Image and Text Analysis: Dimensionality reduction techniques are employed for
visualizing high-dimensional data such as images or text documents. Clustering
algorithms can also group similar images or documents together.
• Anomaly Detection: Unsupervised learning algorithms are used for fraud detection,
network intrusion detection, and identifying abnormal behavior in various domains.
• Recommendation Systems: Unsupervised learning techniques can be applied to
recommend products, movies, or articles to users based on their preferences and
behavior.