ML Supervised Full Notes (1) - New
ML Supervised Full Notes (1) - New
• Machine Learning is the field of study that gives computers the capability to learn
without being explicitly programmed. ML is one of the most exciting technologies
Instructions: that one would have ever come across. As it is evident from the name, it gives the
computer that makes it more similar to humans: The ability to learn. Machine learning
• Kindly go through the lectures/videos on our website www.piyushwairale.com
is actively being used today, perhaps in many more places than one would expect.
• Read this study material carefully and make your own handwritten short notes. (Short
• Machine learning is programming computers to optimize a performance criterion using
notes must not be more than 5-6 pages)
example data or past experience. We have a model defined up to some parameters,
• Attempt the question available on portal. and learning is the execution of a computer program to optimize the parameters of
the model using the training data or past experience. The model may be predictive to
• Revise this material at least 5 times and once you have prepared your short notes, then make predictions in the future, or descriptive to gain knowledge from data, or both.
revise your short notes twice a week
• The field of study known as machine learning is concerned with the question of how
• If you are not able to understand any topic or required detailed explanation, to construct computer programs that automatically improve with experience
please mention it in our discussion forum on webiste
Defination of Learning
• Let me know, if there are any typos or mistake in study materials. Mail A computer program is said to learn from experience E with respect to some class of tasks
me at piyushwairale100@gmail.com T and performance measure P, if its performance at tasks T, as measured by P, improves
with experience E.
A computer program which learns from experience is called a machine learning program or
simply a learning program. Such a program is sometimes also referred to as a learner.
Examples
1 2
For GATE DA Crash Course, visit: www.piyushwairale.com For GATE DA Crash Course, visit: www.piyushwairale.com
GATE in Data Science and AI study material GATE in Data Science and AI study material
• The quality and quantity of data available for training and testing play a significant
role in determining the performance of a machine-learning model.
• Data can be in various forms such as numerical, categorical, or time-series data, and
can come from various sources such as databases, spreadsheets, or APIs.
• Machine learning algorithms use data to learn patterns and relationships between input
variables and target outputs, which can then be used for prediction or classification
tasks.
Understanding data
Since an important component of the machine learning process is data storage, we briefly
consider in this section the different types and forms of data that are encountered in the
machine learning process.
Unit of observation
By a unit of observation we mean the smallest entity with measured properties of interest
for a study.
Examples
• A person, an object or a thing
• A time point
• A geographic region
• A measurement
Sometimes, units of observation are combined to form units such as person-years.
Examples and features
Datasets that store the units of observation and their properties can be imagined as collec-
tions of data consisting of the following:
Examples
An “example” is an instance of the unit of observation for which properties have been
recorded.
An “example” is also referred to as an “instance”, or “case” or “record.” (It may be noted
3 4
For GATE DA Crash Course, visit: www.piyushwairale.com For GATE DA Crash Course, visit: www.piyushwairale.com
GATE in Data Science and AI study material GATE in Data Science and AI study material
that the word “example” has been used here in a technical sense.) Labeled data includes a label or target variable that the model is trying to predict,
Features whereas unlabeled data does not include a label or target variable. The data used in
A “feature” is a recorded property or a characteristic of examples. It is also referred to as machine learning is typically numerical or categorical. Numerical data includes values that
“attribute”, or “variable” or “feature.” can be ordered and measured, such as age or income. Categorical data includes values that
Examples for “examples” and “features” represent categories, such as gender or type of fruit.
1. Cancer detection
Consider the problem of developing an algorithm for detecting cancer. In this study we note • Data can be divided into training and testing sets.
the following. • The training set is used to train the model, and the testing set is used to evaluate the
(a) The units of observation are the patients. performance of the model.
(b) The examples are members of a sample of cancer patients.
(c) The following attributes of the patients may be chosen as the features: • It is important to ensure that the data is split in a random and representative way.
• gender
• age • Data preprocessing is an important step in the machine learning pipeline. This step
• blood pressure can include cleaning and normalizing the data, handling missing values, and feature
• the findings of the pathology report after a biopsy selection or engineering.
2. Pet selection
How do we split data in Machine Learning?
Suppose we want to predict the type of pet a person will choose.
(a) The units are the persons.
(b) The examples are members of a sample of persons who own pets 1. Training Data: The part of data we use to train our model. This is the data that
(c) The features might include age, home region, family income, etc. of persons who own your model actually sees(both input and output) and learns from.
pets.
2. Validation Data: The part of data that is used to do a frequent evaluation of the
model, fit on the training dataset along with improving involved hyperparameters
(initially set parameters before the model begins learning). This data plays its part
when the model is actually training.
3. Testing Data: Once our model is completely trained, testing data provides an unbi-
ased evaluation. When we feed in the inputs of Testing data, our model will predict
some values(without seeing actual output).
After prediction, we evaluate our model by comparing it with the actual output present
in the testing data. This is how we evaluate and see how much our model has learned
from the experiences feed in as training data, set at the time of training.
Figure 1: Example for “examples” and “features” collected in a matrix format (data relates
to automobiles and their features) Different forms of data
5 6
For GATE DA Crash Course, visit: www.piyushwairale.com For GATE DA Crash Course, visit: www.piyushwairale.com
GATE in Data Science and AI study material GATE in Data Science and AI study material
3. Ordinal data This denotes a nominal variable with categories falling in an ordered 1.2 Different types of learning
list. Examples include clothing sizes such as small, medium, and large, or a measure-
In general, machine learning algorithms can be classified into three types.
ment of customer satisfaction on a scale from “not at all happy” to “very happy.”
1. Supervised learning
Examples In the data given in Fig.1, the features “year”, “price” and “mileage” are numeric
and the features “model”, “color” and “transmission” are categorical. • Supervised learning is the machine learning task of learning a function that maps
an input to an output based on example input-output pairs.
Properties of Data
• In supervised learning, each example in the training set is a pair consisting of an
• Volume: Scale of Data. With the growing world population and technology at expo- input object (typically a vector) and an output value.
sure, huge data is being generated each and every millisecond. • A supervised learning algorithm analyzes the training data and produces a func-
tion, which can be used for mapping new examples.
• Variety: Different forms of data – healthcare, images, videos, audio clippings.
• In the optimal case, the function will correctly determine the class labels for
• Velocity: Rate of data streaming and generation. unseen instances.
• Value: Meaningfulness of data in terms of information that researchers can infer from • Both classification and regression problems are supervised learning prob-
it. lems.
• A wide range of supervised learning algorithms are available, each with its strengths
• Veracity: Certainty and correctness in data we are working on.
and weaknesses. There is no single learning algorithm that works best on all su-
• Viability: The ability of data to be used and integrated into different systems and pervised learning problems.
processes. • Important Point :A “supervised learning” is so called because the process of
an algorithm learning from the training dataset can be thought of as a teacher
• Security: The measures taken to protect data from unauthorized access or manipula-
supervising the learning process. We know the correct answers (that is, the cor-
tion.
rect outputs), the algorithm iteratively makes predictions on the training data
• Accessibility: The ease of obtaining and utilizing data for decision-making purposes. and is corrected by the teacher. Learning stops when the algorithm achieves an
Integrity: The accuracy and completeness of data over its entire lifecycle. acceptable level of performance.
Example:
Consider the following data regarding patients entering a clinic. The data consists of
the gender and age of the patients and each patient is labeled as “healthy” or “sick”.
7 8
For GATE DA Crash Course, visit: www.piyushwairale.com For GATE DA Crash Course, visit: www.piyushwairale.com
GATE in Data Science and AI study material GATE in Data Science and AI study material
gender age label • A learner (the program) is not told what actions to take as in most forms of
M 48 sick machine learning, but instead must discover which actions yield the most reward
M 67 sick by trying them.
F 53 healthy • In the most interesting and challenging cases, actions may affect not only the
M 49 healthy immediate reward but also the next situations and, through that, all subsequent
F 34 sick rewards.
M 21 healthy
• For example:, consider teaching a dog a new trick: we cannot tell it what to
do, but we can reward/punish it if it does the right/wrong thing. It has to find
Based on this data, when a new patient enters the clinic, how can one predict whether out what it did that made it get the reward/punishment. We can use a similar
he/she is healthy or sick? method to train computers to do many tasks, such as playing backgammon or
chess, scheduling jobs, and controlling robot limbs.
2. Unsupervised learning
• Reinforcement learning is different from supervised learning. Supervised learning
• Unsupervised learning is a type of machine learning algorithm used to draw in- is learning from examples provided by a knowledgeable expert.
ferences from datasets consisting of input data without labeled responses.
• In unsupervised learning algorithms, a classification or categorization is not in-
cluded in the observations.
• There are no output values and so there is no estimation of functions. Since the
examples given to the learner are unlabeled, the accuracy of the structure that is
output by the algorithm cannot be evaluated.
• The most common unsupervised learning method is cluster analysis, which is used
for exploratory data analysis to find hidden patterns or grouping in data
Example Consider the following data regarding patients entering a clinic. The data
consists of the gender and age of the patients.
gender age
M 48
M 67
F 53
M 49
F 34
M 21
Based on this data, can we infer anything regarding the patients entering the clinic?
3. Reinforcement learning
9 10
For GATE DA Crash Course, visit: www.piyushwairale.com For GATE DA Crash Course, visit: www.piyushwairale.com
GATE in Data Science and AI study material GATE in Data Science and AI study material
• Regression algorithms are used if there is a relationship between the input variable and
the output variable.
• It is used for the prediction of continuous variables, such as Weather forecasting, Market
Trends, etc.
General Approach
Let x denote the set of input variables and y the output variable. In machine learning, the
general approach to regression is to assume a model, that is, some mathematical relation
between x and y, involving some parameters say, θ, in the following form:
y = f (x, θ)
The function f (x, θ) is called the regression function. The machine learning algorithm opti-
mizes the parameters in the set θ such that the approximation error is minimized; that is,
the estimates of the values of the dependent variable y are as close as possible to the correct
values given in the training set.
• Supervised learning is a machine learning paradigm where algorithms aim to optimize
parameters to minimize the difference between target and computed outputs, com- Example
monly used in tasks like classification and regression. For example, if the input variables are “Age”, “Distance” and “Weight” and the output
variable is “Price”, the model may be
• In supervised learning, training examples are associated with target outputs (initially
labeled) and computed outputs (generated by the learning algorithm), and the goal is
to minimize misclassification or error.
1. Simple linear regression: There is only one continuous independent variable x and the
assumed relation between the independent variable and the dependent variable y is
y = a + bx
11 12
For GATE DA Crash Course, visit: www.piyushwairale.com For GATE DA Crash Course, visit: www.piyushwairale.com
GATE in Data Science and AI study material GATE in Data Science and AI study material
2. Multivariate linear regression: There are more than one independent variable, say 3.0.1 Linear Regression
x1, ..., xn, and the assumed relation between the independent variables and the depen-
Simple linear regression is a basic machine learning technique used for modeling the relation-
dent variable is y = a0 + a1 x1 + .... + an xn
ship between a single independent variable (often denoted as ”x”) and a dependent variable
3. Polynomial regression: There is only one continuous independent variable x and the (often denoted as ”y”). It assumes a linear relationship between the variables and aims to
assumed model is y = a0 + a1 x + .... + an xn find the best-fitting line (typically represented by the equation y = mx + b) that minimizes
It is a variant of the multiple linear regression model, except that the best fit line is the sum of squared differences between the observed data points and the values predicted
curved rather than straight. by the model.
4. Ridge regression: Ridge regression is one of the types of linear regression in which a
small amount of bias is introduced so that we can get better long-term predictions.
Ridge regression is a regularization technique, which is used to reduce the complexity
of the model. It is also called as L2 regularization.
5. Logistic regression: The dependent variable is binary, that is, a variable which takes
only the values 0 and 1. The assumed model involves certain probability distributions. and also that the variance of x is given by
It can be shown that the values of a and b can be computed using the following formulas:
13 14
For GATE DA Crash Course, visit: www.piyushwairale.com For GATE DA Crash Course, visit: www.piyushwairale.com
GATE in Data Science and AI study material GATE in Data Science and AI study material
Example
Obtain a linear regression for the data in below table assuming that y is the independent
variable.
15 16
For GATE DA Crash Course, visit: www.piyushwairale.com For GATE DA Crash Course, visit: www.piyushwairale.com
GATE in Data Science and AI study material GATE in Data Science and AI study material
Let there also be n observed values of these variables: • The output of the multivariate regression model is difficult to analyse.
• Multivariate regression yields better results when used with larger datasets rather than
small ones.
Example:
Fit a multiple linear regression model to the following data:
As in simple linear regression, here also we use the ordinary least squares method to
obtain the optimal estimates of β0 , β1 , ...βn The method yields the following procedure for
the computation of these optimal estimates. Let
In this problem, there are two independent variables and four sets of values of the vari-
ables. Thus, in the notations used above, we have n = 2 and N = 4. The multiple linear
regression model for this problem has the form y = β0 + β1 x1 + β2 x2
17 18
For GATE DA Crash Course, visit: www.piyushwairale.com For GATE DA Crash Course, visit: www.piyushwairale.com
GATE in Data Science and AI study material GATE in Data Science and AI study material
• When data exhibits multicollinearity, that is, the ridge regression technique is applied
when the independent variables are highly correlated. While least squares estimates
The required model is y = 2.0625 − 2.3750x1 + 3.2500x2 are unbiased in multicollinearity, their variances are significant enough to cause the
observed value to diverge from the actual value. Ridge regression reduces standard
errors by biassing the regression estimates.
• The lambda (λ) variable in the ridge regression equation resolves the multicollinearity
problem.
• Lambda (λ) is the penalty term. So, by changing the values of (λ), we are controlling
the penalty term. The higher the values of (λ), the bigger is the penalty and therefore
the magnitude of coefficients is reduced.
• In this technique, the cost function is altered by adding the penalty term to it. The
amount of bias added to the model is called Ridge Regression penalty. We can
calculate it by multiplying with the lambda to the squared weight of each individual
feature.
• In the above equation, the penalty term regularizes the coefficients of the model, and
hence ridge regression reduces the amplitudes of the coefficients that decreases the
complexity of the model.
• As we can see from the above equation, if the values of (λ) tend to zero, the equation
becomes the cost function of the linear regression model. Hence, for the minimum
value of (λ) , the model will resemble the linear regression model.
• A general linear or polynomial regression will fail if there is high collinearity between
the independent variables, so to solve such problems, Ridge regression can be used.
19 20
For GATE DA Crash Course, visit: www.piyushwairale.com For GATE DA Crash Course, visit: www.piyushwairale.com
GATE in Data Science and AI study material GATE in Data Science and AI study material
Bias and variance trade-off • Interpretability: Ridge regression may make the model less interpretable because
Bias and variance trade-off is generally complicated when it comes to building ridge regression it shrinks coefficients toward zero. It can be challenging to discern the individual
models on an actual dataset. However, following the general trend which one needs to importance of predictors.
remember is:
• Tuning Lambda: Cross-validation is often used to tune the lambda parameter and
• The bias increases as (λ) increases. find the optimal trade-off between fitting the data and regularization.
Assumptions of Ridge Regressions: • Sensitivity to Lambda: Proper selection of the regularization parameter (λ) is
The assumptions of ridge regression are the same as that of linear regression: linearity, con- crucial; an incorrect choice can result in underfitting or ineffective regularization.
stant variance, and independence. However, as ridge regression does not provide confidence
limits, the distribution of errors to be normal need not be assumed. • Loss of Interpretability: Ridge regression can make the model less interpretable
since it shrinks coefficients towards zero, potentially making it harder to discern the
• Linear Relationship: Ridge Regression assumes that there is a linear relationship individual predictor’s importance.
between the independent variables and the dependent variable.
• Ineffective for Feature Selection: Ridge regression does not perform feature se-
• Homoscedasticity: Ridge Regression assumes that the variance of the errors is con- lection. It retains all predictors in the model but assigns smaller coefficients to less
stant across all levels of the independent variables. important ones.
• Independence of errors: Ridge Regression assumes that the errors are independent • Less Effective for Sparse Data: In cases where many predictors are irrelevant or
of each other, i.e., the errors are not correlated. unimportant, Ridge may not eliminate them from the model effectively.
• Normality of errors: Ridge Regression assumes that the errors follow a normal Extra info to understand more about Ridge Regression
distribution.
What is Regularization?
Key points about Ridge Regression in machine learning:
• Regularization is one of the most important concepts of machine learning. It is a
• Regularization: Ridge regression adds a penalty term that discourages the magnitude
technique to prevent the model from overfitting by adding extra information to it.
of the coefficients from becoming too large. This helps prevent overfitting.
• Sometimes the machine learning model performs well with the training data but does
• Multicollinearity Mitigation: It’s particularly effective when you have highly corre-
not perform well with the test data. It means the model is not able to predict the
lated independent variables (multicollinearity) by shrinking the coefficients and making
output when deals with unseen data by introducing noise in the output, and hence the
the model more stable.
model is called overfitted. This problem can be deal with the help of a regularization
• Lambda Parameter: The choice of the lambda parameter (λ) is essential. A small λ technique.
is close to standard linear regression, while a large λ results in stronger regularization.
• This technique can be used in such a way that it will allow to maintain all variables or
• Balancing Act: Ridge regression performs a balancing act between fitting the data features in the model by reducing the magnitude of the variables. Hence, it maintains
well and preventing overfitting. It maintains all predictors but assigns smaller coeffi- accuracy as well as a generalization of the model.
cients to less important ones.
• It mainly regularizes or reduces the coefficient of features toward zero. In simple words,
• Model Stability: It makes the model more stable, especially when you have a high- ”In regularization technique, we reduce the magnitude of the features by keeping the
dimensional dataset with many predictors. This can lead to better generalization to same number of features.”
new, unseen data.
21 22
For GATE DA Crash Course, visit: www.piyushwairale.com For GATE DA Crash Course, visit: www.piyushwairale.com
GATE in Data Science and AI study material GATE in Data Science and AI study material
• Multicollinearity may be introduced during the data collection process if the data were
gathered using an inappropriate sampling method. Even if the sample size is smaller
than expected, it could still happen.
• Because there are more variables than data, multicollinearity will be visible if the model
is overspecified.
23 24
For GATE DA Crash Course, visit: www.piyushwairale.com For GATE DA Crash Course, visit: www.piyushwairale.com
GATE in Data Science and AI study material GATE in Data Science and AI study material
4.1 Logistic Function (Sigmoid Function) Key points about Logistic Regression include:
• Binary Classification: It’s primarily used for two-class classification problems, where
the output is either 0 or 1, indicating the absence or presence of an event.
• Logistic Function: Utilizes the logistic (sigmoid) function to convert a linear combi-
nation of input features into a probability value between 0 and 1.
25 26
For GATE DA Crash Course, visit: www.piyushwairale.com For GATE DA Crash Course, visit: www.piyushwairale.com
GATE in Data Science and AI study material GATE in Data Science and AI study material
5 K-Nearest Neighbors
• K-Nearest Neighbors (KNN) is a simple and intuitive machine-learning algorithm used
for both classification and regression tasks.
• K-NN algorithm assumes the similarity between the new case/data and available cases
and put the new case into the category that is most similar to the available categories.
• K-NN algorithm stores all the available data and classifies a new data point based on
the similarity. This means when new data appears then it can be easily classified into
a well suite category by using K- NN algorithm.
• K-NN is a non-parametric algorithm, which means it does not make any assumption
on underlying data.
• It is also called a lazy learner algorithm because it does not learn from the training set
immediately instead it stores the dataset and at the time of classification, it performs
an action on the dataset.
• KNN tries to predict the correct class for the test data by calculating the distance
between the test data and all the training points. Then select the K number of points
which is closest to the test data. The KNN algorithm calculates the probability of the
test data belonging to the classes of ‘K’ training data and class that holds the highest
probability will be selected. In the case of regression, the value is the mean of the ‘K’
selected training points.
27 28
For GATE DA Crash Course, visit: www.piyushwairale.com For GATE DA Crash Course, visit: www.piyushwairale.com
GATE in Data Science and AI study material GATE in Data Science and AI study material
Choosing Value of K
• Larger k may lead to better performance But if we set k too large we may end up 5.2 K-Nearest Neighbors (KNN) Classification Example
looking at samples that are not neighbors (are far away from the query)
Suppose we have a dataset with the following points:
• We can use cross-validation to find k
Data Point Feature 1 (X1) Feature 2 (X2) Class
• Rule of thumb is k ¡ sqrt(n), where n is the number of training example A 1 2 Blue
B 2 3 Blue
• Larger k produces smoother boundary effect C 2 1 Red
• When K==N, always predict the majority class D 3 3 Red
E 4 2 Blue
5.1 Working Now, let’s say we want to classify a new data point with features X1 = 2.5 and X2 = 2.5
using a KNN algorithm with k = 3 (i.e., considering the three nearest neighbors).
The K-NN working can be explained on the basis of the below algorithm: 1. Calculate Euclidean Distances:
√
1. Select the number K of the neighbors
p
Distance to A: (2.5 − 1)2 + (2.5 − 2)2 = 1.5
p √
2. Calculate the Euclidean distance of K number of neighbors Distance to B: (2.5 − 2)2 + (2.5 − 3)2 = 1.5
p √
3. Take the K nearest neighbors as per the calculated Euclidean distance. Distance to C: (2.5 − 2)2 + (2.5 − 1)2 = 1.5
p √
Distance to D: (2.5 − 3)2 + (2.5 − 3)2 = 2.5
4. Among these k neighbors, count the number of the data points in each category. p √
Distance to E: (2.5 − 4)2 + (2.5 − 2)2 = 4.5
5. Assign the new data points to that category for which the number of the neighbor is
maximum. 2. Find K Nearest Neighbors: Identify the three nearest neighbors based on the
calculated distances. In this case, the three closest points are A, B, and C.
6. Our model is ready. 3. Majority Voting: Determine the majority class among the three nearest neighbors.
Since A and B are Blue, and C is Red, the majority class is Blue.
4. Prediction: Predict that the new point X1 = 2.5, X2 = 2.5 belongs to the majority
class, which is Blue.
29 30
For GATE DA Crash Course, visit: www.piyushwairale.com For GATE DA Crash Course, visit: www.piyushwairale.com
GATE in Data Science and AI study material GATE in Data Science and AI study material
• Adaptability: KNN can be used for both classification and regression tasks, and it 6.1 Bayes’ Theorem
can handle multi-class problems without modification. Bayes’ theorem is also known as Bayes’ Rule or Bayes’ law, which is used to determine the
probability of a hypothesis with prior knowledge. It depends on the conditional probability.
• Interpretability: The algorithm provides human-interpretable results, as predictions
The formula for Bayes’ theorem is given as: Where,
are based on the majority class or the average of the nearest neighbors.
• Bayes’ Theorem: The classifier is based on Bayes’ theorem, which calculates the
probability of a hypothesis (in this case, a class label) given the evidence (features or at-
tributes). Mathematically, it is expressed as P(class—evidence) = [P(evidence—class)
* P(class)] / P(evidence).
31 32
For GATE DA Crash Course, visit: www.piyushwairale.com For GATE DA Crash Course, visit: www.piyushwairale.com
GATE in Data Science and AI study material GATE in Data Science and AI study material
1. Multinomial Naive Bayes: Typically used for text classification where features
represent word counts.
2. Gaussian Naive Bayes: Suitable for continuous data and assumes a Gaussian
distribution of features.
3. Bernoulli Naive Bayes: Applicable when features are binary, such as presence or
absence.
• Classification: To classify a new data point, the classifier calculates the posterior
probabilities for each class and selects the class with the highest probability.
33 34
For GATE DA Crash Course, visit: www.piyushwairale.com For GATE DA Crash Course, visit: www.piyushwairale.com
GATE in Data Science and AI study material GATE in Data Science and AI study material
• Works Well with Small Datasets: It can perform reasonably well even with limited
training data.
• Interpretable: The results are easy to interpret, as it provide the probability of belong-
ing to each class.
• Sensitivity to Feature Distribution: It may not perform well when features have com-
plex, non-Gaussian distributions.
• Requires Sufficient Data: For some cases, Naive Bayes might not perform well when
there is a scarcity of data.
• Zero Probability Problem: If a feature-class combination does not exist in the training
data, the probability will be zero, causing issues. Smoothing techniques are often used
to address this.
35 36
For GATE DA Crash Course, visit: www.piyushwairale.com For GATE DA Crash Course, visit: www.piyushwairale.com
GATE in Data Science and AI study material GATE in Data Science and AI study material
7 Decision Trees
• A decision tree is a simple model for supervised classification. It is used for classifying
a single discrete target feature.
• Each internal node performs a Boolean test on an input feature (in general, a test may
have more than two options, but these can be converted to a series of Boolean tests).
The edges are labeled with the values of that input feature.
• Classifying an example using a decision tree is very intuitive. We traverse down the
tree, evaluating each test and following the corresponding edge. When a leaf is reached,
we return the classification on that leaf.
• Decision Tree is a Supervised learning technique that can be used for both classification
and Regression problems, but mostly it is preferred for solving Classification problems.
7.1 Terminologies
• It is a tree-structured classifier, where internal nodes represent the features of a dataset,
branches represent the decision rules and each leaf node represents the outcome. • Root Node: A decision tree’s root node, which represents the original choice or
feature from which the tree branches, is the highest node.
• In a Decision tree, there are two nodes, which are the Decision Node and Leaf Node.
Decision nodes are used to make any decision and have multiple branches, whereas • Internal Nodes (Decision Nodes): Nodes in the tree whose choices are determined
Leaf nodes are the output of those decisions and do not contain any further branches. by the values of particular attributes. There are branches on these nodes that go to
other nodes.
• The decisions or the test are performed on the basis of features of the given dataset. It
is a graphical representation for getting all the possible solutions to a problem/decision • Leaf Nodes (Terminal Nodes): The branches’ termini, when choices or forecasts
based on given conditions. are decided upon. There are no more branches on leaf nodes.
• It is called a decision tree because, similar to a tree, it starts with the root node, which • Branches (Edges): Links between nodes that show how decisions are made in re-
expands on further branches and constructs a tree-like structure. sponse to particular circumstances.
• In order to build a tree, we use the CART algorithm, which stands for Classification • Splitting: The process of dividing a node into two or more sub-nodes based on a
and Regression Tree algorithm. decision criterion. It involves selecting a feature and a threshold to create subsets of
data.
• A decision tree simply asks a question, and based on the answer (Yes/No), it further
split the tree into subtrees. • Parent Node: A node that is split into child nodes. The original node from which a
split originates.
• Decision Criterion: The rule or condition used to determine how the data should
be split at a decision node. It involves comparing feature values against a threshold.
• Pruning: The process of removing branches or nodes from a decision tree to improve
its generalization and prevent overfitting.
37 38
For GATE DA Crash Course, visit: www.piyushwairale.com For GATE DA Crash Course, visit: www.piyushwairale.com
GATE in Data Science and AI study material GATE in Data Science and AI study material
7.2 Measures of impurity • Gini index is a measure of impurity or purity used while creating a decision tree
in the CART(Classification and Regression Tree) algorithm.
While implementing a Decision tree, the main issue arises that how to select the best attribute
for the root node and for sub-nodes. So, to solve such problems there is a technique which is • An attribute with a low Gini index should be preferred as compared to the high
called as Attribute selection measure or ASM. By this measurement, we can easily select the Gini index.
best attribute for the nodes of the tree. There are two popular techniques for ASM, which • It only creates binary splits, and the CART algorithm uses the Gini index to
are: create binary splits.
1. Information Gain Gini index can be calculated using the below formula:
• Feature Selection: They can automatically select the most important features, reducing
• Information gain is the measurement of changes in entropy after the segmentation the need for feature engineering.
of a dataset based on an attribute.
• It calculates how much information a feature provides us about a class. • Versatility: Decision Trees can handle both categorical and numerical data.
• According to the value of information gain, we split the node and build the decision • Efficiency: They are relatively efficient during prediction, with time complexity loga-
tree. rithmic in the number of data points.
• A decision tree algorithm always tries to maximize the value of information gain,
and a node/attribute having the highest information gain is split first. 7.4 Disadvantages of Decision Trees:
It can be calculated using the below formula: • Overfitting: Decision Trees can be prone to overfitting, creating complex models that
Information Gain= Entropy(S)- [(Weighted Avg) *Entropy(each feature) don’t generalize well to new data. Pruning and setting appropriate parameters can
Entropy: Entropy is a metric to measure the impurity in a given attribute. help mitigate this.
• Bias Toward Dominant Classes: In classification tasks, Decision Trees can be biased
toward dominant classes, leading to imbalanced predictions.
• Instability: Small variations in the data can lead to different tree structures, making
It specifies randomness in data. Entropy can be calculated as: them unstable models.
• Greedy Algorithm: Decision Trees use a greedy algorithm, making locally optimal
Entropy(s)= -P(yes)log2 P(yes)- P(no) log2 P(no) decisions at each node, which may not lead to the global optimal tree structure
2. Gini Index
39 40
For GATE DA Crash Course, visit: www.piyushwairale.com For GATE DA Crash Course, visit: www.piyushwairale.com
GATE in Data Science and AI study material GATE in Data Science and AI study material
8 Support Vector Machine • Hyperplane: There can be multiple lines/decision boundaries to segregate the classes
in n-dimensional space, but we need to find out the best decision boundary that helps
• Support Vector Machine is a system for efficiently training linear learning machines to classify the data points. This best boundary is known as the hyperplane of SVM.
in kernel-induced feature spaces, while respecting the insights of generalisation theory
and exploiting optimisation theory.’ • In a binary classification problem, an SVM finds a hyperplane that best separates the
data points of different classes. This hyperplane is the decision boundary.
• The goal of the SVM algorithm is to create the best line or decision boundary that
can segregate n-dimensional space into classes so that we can easily put the new data • The dimensions of the hyperplane depend on the features present in the dataset, which
point in the correct category in the future. This best decision boundary is called a means if there are 2 features (as shown in image), then hyperplane will be a straight
hyperplane. line. And if there are 3 features, then hyperplane will be a 2-dimension plane.
We always create a hyperplane that has a maximum margin, which means the maxi-
• SVMs pick best separating hyperplane according to some criterion e.g. maximum
mum distance between the data points.
margin
• Support Vectors:The data points or vectors that are the closest to the hyperplane
• Training process is an optimisation
and which affect the position of the hyperplane are termed as Support Vector. Since
• Training set is effectively reduced to a relatively small number of support vectors these vectors support the hyperplane, hence called a Support vector. They are critical
for defining the margin and determining the location of the hyperplane.
• SVM chooses the extreme points/vectors that help in creating the hyperplane. These
extreme cases are called as support vectors, and hence algorithm is termed as Support • Margin: The margin is the distance between the support vectors and the decision
Vector Machine. boundary. SVM aims to maximize this margin because a larger margin often leads to
better generalization.
Consider the below diagram in which there are two different categories that are classi-
fied using a decision boundary or hyperplane: • C Parameter: The regularization parameter ”C” controls the trade-off between maxi-
mizing the margin and minimizing the classification error. A smaller ”C” value results
in a larger margin but may allow some misclassifications, while a larger ”C” value
allows for fewer misclassifications but a smaller margin.
• Multi-Class Classification: SVMs are inherently binary classifiers, but they can be
extended to handle multi-class classification using techniques like one-vs-one (OvO) or
one-vs-all (OvA) classification.
• The Scalar Product:The scalar or dot product is, in some sense, a measure of
Similarity a.b = |a|.|b|cos(θ)
Here are the key concepts and characteristics of Support Vector Machines:
41 42
For GATE DA Crash Course, visit: www.piyushwairale.com For GATE DA Crash Course, visit: www.piyushwairale.com
GATE in Data Science and AI study material GATE in Data Science and AI study material
8.1 Kernels
We may use Kernel functions to implicitly map to a new feature space
• Kernel fn: K(x1 , x2 ) ∈ R
• Kernel must be equivalent to an inner product in some feature space
8.2 Classification Margin
• Kernel Trick: SVM can handle non-linearly separable data by using a kernel function
to map the data into a higher-dimensional space where it becomes linearly separable.
Common kernel functions include linear, polynomial, radial basis function (RBF), and
sigmoid kernels.
43 44
For GATE DA Crash Course, visit: www.piyushwairale.com For GATE DA Crash Course, visit: www.piyushwairale.com
GATE in Data Science and AI study material GATE in Data Science and AI study material
is to maximize this margin. The hyperplane with the maximum margin is called the
optimal hyperplane.
2. Non-linear SVM: Non-linear SVM is used for non-linearly separated data, which means
The working of the SVM algorithm can be understood by using an example. Suppose
if a dataset cannot be classified by using a straight line, then such data is termed as
we have a dataset that has two tags (green and blue), and the dataset has two features
non-linear data and classifier used is called as Non-linear SVM classifier.
x1 and x2. We want a classifier that can classify the pair(x1, x2) of coordinates in
either green or blue. So as it is 2-d space so by just using a straight line, we can easily
separate these two classes. But there can be multiple lines that can separate these
classes.
Hence, the SVM algorithm helps to find the best line or decision boundary; this best
boundary or region is called as a hyperplane. SVM algorithm finds the closest point of
the lines from both the classes. These points are called support vectors. The distance
between the vectors and the hyperplane is called as margin. And the goal of SVM
45 46
For GATE DA Crash Course, visit: www.piyushwairale.com For GATE DA Crash Course, visit: www.piyushwairale.com
GATE in Data Science and AI study material GATE in Data Science and AI study material
9 Bias-Variance Trade-Off
• The goal of supervised machine learning is to learn or derive a target function that can
best determine the target variable from the set of input variables.
• A key consideration in learning the target function from the training data is the extent
of generalization. This is because the input data is just a limited, specific view and the
new, unknown data in the test data set may be differing quite a bit from the training
data.
• The fitness of a target function approximated by a learning algorithm determines how
correctly it is able to classify a set of data it has never seen.
9.1 Underfitting
• If the target function is kept too simple, it may not be able to capture the essential
nuances and represent the underlying data well.
8.4 Advantages of Support Vector Machines: • A typical case of underfitting may occur when trying to represent a non-linear data
• Effective in High-Dimensional Spaces: SVMs perform well even in high-dimensional with a linear model as demonstrated by both cases of underfitting shown in figure 1.1
feature spaces.
• Many times underfitting happens due to the unavailability of sufficient training data.
• Robust to Overfitting: SVMs are less prone to overfitting, especially when the margin • Underfitting results in both poor performance with training data as well as poor gen-
is maximized. Accurate for Non-Linear Data: The kernel trick allows SVMs to work eralization to test data. Underfitting can be avoided by
effectively on non-linear data by transforming it into higher dimensions. 1. using more training data
• Wide Applicability: SVMs can be applied to various tasks, including classification, 2. reducing features by effective feature selection
regression, and outlier detection.
• Sensitivity to Kernel Choice: The choice of the kernel function and kernel parameters
can significantly impact the SVM’s performance.
• Challenging for Large Datasets: SVMs may not be suitable for very large datasets
because of their computational complexity.
47 48
For GATE DA Crash Course, visit: www.piyushwairale.com For GATE DA Crash Course, visit: www.piyushwairale.com
GATE in Data Science and AI study material GATE in Data Science and AI study material
9.2 Overfitting • Parametric models generally have high bias making them easier to understand/inter-
pret and faster to learn.
• Overfitting refers to a situation where the model has been designed in such a way that
it emulates the training data too closely. In such a case, any specific deviation in the • These algorithms have a poor performance on data sets, which are complex in nature
training data, like noise or outliers, gets embedded in the model. It adversely impacts and do not align with the simplifying assumptions made by the algorithm.
the performance of the model on the test data.
• Underfitting results in high bias.
• Overfitting, in many cases, occur as a result of trying to fit an excessively complex
model to closely match the training data. This is represented with a sample data set Errors due to ‘Variance’:
in figure 1.1. The target function, in these cases, tries to make sure all training data
• Errors due to variance occur from difference in training data sets used to train the
points are correctly partitioned by the decision boundary. However, more often than
model.
not, this exact nature is not replicated in the unknown test data set. Hence, the target
function results in wrong classification in the test data set. • Different training data sets (randomly sampled from the input data set) are used to
• Overfitting results in good performance with training data set, but poor generalization train the model. Ideally the difference in the data sets should not be significant and
and hence poor performance with test data set. Overfitting can be avoided by the model trained using different training data sets should not be too different.
1. using re-sampling techniques like k-fold cross validation • However, in case of overfitting, since the model closely matches the training data, even
2. hold back of a validation data set a small difference in training data gets magnified in the model.
3. remove the nodes which have little or no predictive power for the given machine
learning problem. So, the problems in training a model can either happen because either
(a) the model is too simple and hence fails to interpret the data grossly or
(b) the model is extremely complex and magnifies even small differences in the training data.
• Both underfitting and overfitting result in poor classification quality which is reflected
by low classification accuracy
• Errors due to bias arise from simplifying assumptions made by the model to make the • Complex Models vs. Simple Models: Complex models (e.g., deep neural net-
target function less complex or easier to learn. In short, it is due to underfitting of the works) tend to have low bias but high variance, whereas simple models (e.g., linear
model. regression) tend to have high bias but low variance.
49 50
For GATE DA Crash Course, visit: www.piyushwairale.com For GATE DA Crash Course, visit: www.piyushwairale.com
GATE in Data Science and AI study material GATE in Data Science and AI study material
• Balancing Act: Machine learning practitioners aim to strike a balance between bias supervised machine learning is to achieve a balance between bias and variance. The learn-
and variance to achieve a model with good generalization, one that performs well on ing algorithm chosen and the user parameters which can be configured helps in striking a
both the training data and new, unseen data. tradeoff between bias and variance.
For example, in a popular supervised algorithm k-Nearest Neighbors or kNN, the user con-
• Underfitting and Overfitting: The trade-off helps address the problems of underfit- figurable parameter ‘k’ can be used to do a trade-off between bias and variance. In one hand,
ting (high bias) and overfitting (high variance). Underfit models don’t capture enough when the value of ‘k’ is decreased, the model becomes simpler to fit and bias increases. On
of the data’s complexity, while overfit models fit noise in the data. the other hand, when the value of ‘k’ is increased, the variance increases.
• Model Complexity: Adjusting model complexity, such as the number of features,
the choice of hyperparameters, and regularization techniques, is a way to manage the
bias-variance trade-off.
Important Note
Increasing the bias will decrease the variance, and Increasing the variance will decrease the
bias On one hand, parametric algorithms are generally seen to demonstrate high bias but
low variance. On the other hand, non-parametric algorithms demonstrate low bias and high
variance.
Figure 1.3.1
As can be observed in Figure 1.3.1, the best solution is to have a model with low bias
as well as low variance. However, that may not be possible in reality. Hence, the goal of
51 52
For GATE DA Crash Course, visit: www.piyushwairale.com For GATE DA Crash Course, visit: www.piyushwairale.com
GATE in Data Science and AI study material GATE in Data Science and AI study material
10 Cross-validation methods • Bias-Variance Trade-Off: It helps in managing the bias-variance trade-off. The
model’s performance is assessed under different training and test subsets, helping you
• When the dataset is small, the method is prone to high variance. Due to the random detect issues like overfitting or underfitting.
partition, the results can be entirely different for different test sets. To deal with this
issue, we use cross-validation to evaluate the performance of a machine-learning model. • Hyperparameter Tuning: K-Fold Cross-Validation is often used for hyperparameter
tuning. By trying different hyperparameters on different folds, you can choose the set
• In cross-validation, we don’t divide the dataset into training and test sets only once. of hyperparameters that yield the best average performance.
Instead, we repeatedly partition the dataset into smaller groups and then average the
performance in each group. That way, we reduce the impact of partition randomness • K-Fold Variations: Variations include stratified K-Fold, which ensures that each fold
on the results. has a similar class distribution, and repeated K-Fold, where the process is repeated
multiple times with different random splits.
• Many cross-validation techniques define different ways to divide the dataset at hand.
We’ll focus on the two most frequently used: the k-fold and the leave-one-out methods. • Performance: The final model performance is typically determined by averaging the
results of all ’k’ iterations, such as mean accuracy or root mean squared error.
10.1 K-Fold Cross-Validation • Trade-Off: There’s a trade-off between computational cost and model assessment
K-Fold Cross-Validation is a widely used technique in machine learning for assessing the quality. Larger ’k’ values lead to a more accurate assessment but require more compu-
performance and generalization ability of a model. It involves dividing the dataset into ’k’ tation.
subsets of approximately equal size, where one of these subsets is used as the test set, and • Usage: K-Fold Cross-Validation is widely used in various machine learning tasks,
the remaining ’k-1’ subsets are used as the training set. This process is repeated ’k’ times, including model selection, hyperparameter tuning, and performance estimation.
each time using a different subset as the test set.
• Validation Set: In practice, a separate validation set might be used to validate
the final model after hyperparameter tuning, while K-Fold Cross-Validation helps in
assessing the overall performance of the model.
K-Fold Cross-Validation
• Data Splitting: The dataset is divided into ’k’ subsets or folds, where each fold is
used as the test set exactly once, and the rest are used for training.
53 54
For GATE DA Crash Course, visit: www.piyushwairale.com For GATE DA Crash Course, visit: www.piyushwairale.com
GATE in Data Science and AI study material GATE in Data Science and AI study material
Comparison
An important factor when choosing between the k-fold and the LOO cross-validation methods
is the size of the dataset.
When the size is small, LOO is more appropriate since it will use more training samples in
each iteration. That will enable our model to learn better representations.
Conversely, we use k-fold cross-validation to train a model on a large dataset since LOO trains
n models, one per sample in the data. When our dataset contains a lot of samples, training
so many models will take too long. So, the k-fold cross-validation is more appropriate.
Also, in a large dataset, it is sufficient to use less than n folds since the test folds are large
enough for the estimates to be sufficiently precise.
Leave-One-Out Cross-Validation
• Bias and Variance: It tends to produce a more reliable estimate of a model’s per-
formance as it reduces bias compared to other cross-validation methods like k-fold
cross-validation. However, LOO can have high variance due to its many iterations,
making it computationally expensive.
• Model Evaluation: LOO cross-validation allows you to assess how well the model
generalizes to unseen data and identify potential issues like overfitting or data leakage.
55 56
For GATE DA Crash Course, visit: www.piyushwairale.com For GATE DA Crash Course, visit: www.piyushwairale.com
GATE in Data Science and AI study material GATE in Data Science and AI study material
• Input Layer: This layer consists of neurons that receive inputs and pass them on 11.3 Feedforward Process
to the next layer. The number of neurons in the input layer is determined by the
dimensions of the input data. 1. The input data is fed into the input layer.
• Hidden Layers: These layers are not exposed to the input or output and can be 2. Each neuron in the hidden layers processes the input using weights, biases, and acti-
considered as the computational engine of the neural network. Each hidden layer’s vation functions.
neurons take the weighted sum of the outputs from the previous layer, apply an acti-
3. The output from each hidden layer is passed to the next layer.
vation function, and pass the result to the next layer. The network can have zero or
more hidden layers. 4. This process continues until the output layer produces the final prediction.
• Output Layer: The final layer that produces the output for the given inputs. The
number of neurons in the output layer depends on the number of possible outputs the 11.4 How Feedforward Neural Networks Work
network is designed to produce.
The working of a feedforward neural network involves two phases: the feedforward phase
• Each neuron in one layer is connected to every neuron in the next layer, making this a and the backpropagation phase.
fully connected network. The strength of the connection between neurons is represented
• Feedforward Phase: In this phase, the input data is fed into the network, and it
by weights, and learning in a neural network involves updating these weights based on
propagates forward through the network. At each hidden layer, the weighted sum of
the error of the output.
the inputs is calculated and passed through an activation function, which introduces
The input and hidden layers use sigmoid and linear activation functions whereas the output non-linearity into the model. This process continues until the output layer is reached,
layer uses a Heaviside step activation function at nodes because it is a two-step activation and a prediction is made.
function that helps in predicting results as per requirements. All units also known as neurons
• Backpropagation Phase: Once a prediction is made, the error (difference between
have weights and calculation at the hidden layer is the summation of the dot product of all
the predicted output and the actual output) is calculated. This error is then propagated
weights and their signals and finally the sigmoid function of the calculated sum. Multiple
back through the network, and the weights are adjusted to minimize this error. The
hidden and output layer increases the accuracy of the output.
process of adjusting weights is typically done using a gradient descent optimization
algorithm.
11.2 Neurons,Activation Functions, Weights and Biases
• Neurons
Nodes in the network that receive inputs, perform a weighted sum, and pass the result
through an activation function.
57 58
For GATE DA Crash Course, visit: www.piyushwairale.com For GATE DA Crash Course, visit: www.piyushwairale.com
GATE in Data Science and AI study material GATE in Data Science and AI study material
Similar to the sigmoid but outputs values between -1 and 1, often used in hidden layers.
Overfitting
When a model performs well on the training data but poorly on new, unseen data.
Regularization Techniques
Methods like dropout and L2 regularization are employed to prevent overfitting by pe-
nalizing overly complex models.
Multi-Layer Perceptron
59 60
For GATE DA Crash Course, visit: www.piyushwairale.com For GATE DA Crash Course, visit: www.piyushwairale.com
GATE in Data Science and AI study material GATE in Data Science and AI study material
• Hidden Layers: Between the input and output layers, there can be one or more • Taeho Jo Machine Learning Foundations Supervised, Unsupervised, and Advanced
hidden layers. These layers contain neurons, also known as units or nodes, which are Learning Springer book
responsible for learning complex patterns and relationships in the data. Hidden layers
add the capacity to model non-linear functions. An MLP can have a varying number • IIT Madras BS Degree Lectures and Notes
of hidden layers and units, depending on the problem’s complexity. • NPTEL Lectures and Slides
• Output Layer: The output layer is responsible for producing the final results or • www.medium.com
predictions. The number of output neurons depends on the nature of the task. For
instance, in binary classification, there might be a single output neuron that outputs • geeksforgeeks.org/
the probability of belonging to one class, while in multi-class classification, there could
be multiple output neurons, each corresponding to a class. • javatpoint.com/
12.2 Backpropagation
• Backpropagation is a technique used to optimize the weights of an MLP using the
outputs as inputs.
• In a conventional MLP, random weights are assigned to all the connections. These
random weights propagate values through the network to produce the actual output.
Naturally, this output would differ from the expected output. The difference between
the two values is called the error.
• Backpropagation refers to the process of sending this error back through the network,
readjusting the weights automatically so that eventually, the error between the actual
and expected output is minimized.
• In this way, the output of the current iteration becomes the input and affects the next
output. This is repeated until the correct output is produced. The weights at the end
of the process would be the ones on which the neural network works correctly.
61 62
For GATE DA Crash Course, visit: www.piyushwairale.com For GATE DA Crash Course, visit: www.piyushwairale.com