Unit 3
Unit 3
Learning with Trees – Decision Trees – Constructing Decision Trees – Classification and
Regression
Trees – Ensemble Learning – Boosting – Bagging – Different ways to Combine Classifiers –
Basic
Statistics – Gaussian Mixture Models – Nearest Neighbor Methods – Unsupervised Learning –
K means
Algorithms
Machine Learning algorithms are the programs that can learn the hidden
patterns from the data, predict the output, and improve the performance from
experiences on their own. Different algorithms can be used in machine learning
for different tasks, such as simple linear regression that can be used for
prediction problems like stock market prediction, and the KNN algorithm can be
used for classification problems.
In this topic, we will see the overview of some popular and most commonly used
machine learning algorithms along with their use cases and categories.
The below diagram illustrates the different ML algorithm, along with the
categories:
1) Supervised Learning Algorithm
Supervised learning is a type of Machine learning in which the machine needs
external supervision to learn. The supervised learning models are trained using
the labeled dataset. Once the training and processing are done, the model is
tested by providing a sample test data to check whether it predicts the correct
output.
The goal of supervised learning is to map input data with the output data.
Supervised learning is based on supervision, and it is the same as when a student
learns things in the teacher's supervision. The example of supervised learning is
spam filtering.
○ Classification
○ Regression
○ Clustering
○ Association
3) Reinforcement Learning
In Reinforcement learning, an agent interacts with its environment by producing
actions, and learn with the help of feedback. The feedback is given to the agent
in the form of rewards, such as for each good action, he gets a positive reward,
and for each bad action, he gets a negative reward. There is no supervision
provided to the agent. Q-Learning algorithm is used in reinforcement learning.
Read more…
1. Linear Regression
Linear regression is one of the most popular and simple machine learning
algorithms that is used for predictive analysis. Here, predictive analysis defines
prediction of something, and linear regression makes predictions for continuous
numbers such as salary, age, etc.
It tries to best fit a line between the dependent and independent variables, and
this best fit line is knowns as the regression line.
y= a0+ a*x+ b
Here, y= dependent variable
x= independent variable
a0 = Intercept of line.
The below diagram shows the linear regression for prediction of weight
according to height:
2. Logistic Regression
Logistic regression is the supervised learning algorithm, which is used to predict
the categorical variables or discrete values. It can be used for the classification
problems in machine learning, and the output of the logistic regression
algorithm can be either Yes or NO, 0 or 1, Red or Blue, etc.
Logistic regression is similar to the linear regression except how they are used,
such as Linear regression is used to solve the regression problem and predict
continuous values, whereas Logistic regression is used to solve the Classification
problem and used to predict the discrete values.
Instead of fitting the best fit line, it forms an S-shaped curve that lies between 0
and 1. The S-shaped curve is also known as a logistic function that uses the
concept of the threshold. Any value above the threshold will tend to 1, and below
the threshold will tend to 0. Read more..
The data points that help to define the hyperplane are known as support vectors,
and hence it is named as support vector machine algorithm.
Some real-life applications of SVM are face detection, image classification, Drug
discovery, etc. Consider the below diagram:
As we can see in the above diagram, the hyperplane has classified datasets into
two different classes
5. Naïve Bayes Algorithm:
Naïve Bayes classifier is a supervised learning algorithm, which is used to make
predictions based on the probability of the object. The algorithm named as Naïve
Bayes as it is based on Bayes theorem, and follows the naïve assumption that
says' variables are independent of each other.
The Bayes theorem is based on the conditional probability; it means the
likelihood that event(A) will happen, when it is given that event(B) has already
happened. The equation for Bayes theorem is given as:
Naïve Bayes classifier is one of the best classifiers that provide a good result for a
given problem. It is easy to build a naïve bayesian model, and well suited for the
huge amount of dataset. It is mostly used for text classification.
7. K-Means Clustering
K-means clustering is one of the simplest unsupervised learning algorithms,
which is used to solve the clustering problems. The datasets are grouped into K
different clusters based on similarities and dissimilarities, it means, datasets with
most of the commonalties remain in one cluster which has very less or no
commonalities between other clusters. In K-means, K-refers to the number of
clusters, and means refer to the averaging the dataset in order to find the
centroid.
This algorithm starts with a group of randomly selected centroids that form the
clusters at starting and then perform the iterative process to optimize these
centroids' positions.
It can be used for spam detection and filtering, identification of fake news, etc.
It contains multiple decision trees for subsets of the given dataset, and find the
average to improve the predictive accuracy of the model. A random-forest
should contain 64-128 trees. The greater number of trees leads to higher
accuracy of the algorithm.
To classify a new dataset or object, each tree gives the classification result and
based on the majority votes, the algorithm predicts the final output.
Random forest is a fast algorithm, and can efficiently deal with the missing &
incorrect data.
9. Apriori Algorithm
Apriori algorithm is the unsupervised learning algorithm that is used to solve the
association problems. It uses frequent itemsets to generate association rules, and
it is designed to work on the databases that contain transactions. With the help
of these association rule, it determines how strongly or how weakly two objects
are connected to each other. This algorithm uses a breadth-first search and Hash
Tree to calculate the itemset efficiently.
The algorithm process iteratively for finding the frequent itemsets from the large
dataset.
The apriori algorithm was given by the R. Agrawal and Srikant in the year 1994. It
is mainly used for market basket analysis and helps to understand the products
that can be bought together. It can also be used in the healthcare field to find
drug reactions in patients.
PCA works by considering the variance of each attribute because the high
variance shows the good split between the classes, and hence it reduces the
dimensionality.
CLASSIFICATION VS REGRESSION
Classification Algorithms
Classification is the process of finding or discovering a model or function that
helps in separating the data into multiple categorical classes i.e. discrete values.
In classification, data is categorized under different labels according to some
parameters given in the input and then the labels are predicted for the data.
Decision Tree
Random Forest Classifier
K – Nearest Neighbors
Support Vector Machine
Regression Algorithms
Regression is the process of finding a model or function for distinguishing the
data into continuous real values instead of using classes or discrete values. It can
also identify the distribution movement depending on the historical data.
Because a regression predictive model predicts a quantity, therefore, the skill of
the model must be reported as an error in those predictions.
Lasso Regression
Ridge Regression
XGBoost Regressor
LGBM Regressor
Comparison between Classification and Regression
Classification
Regression
In this problem statement, the target variables are discrete. In this problem
statement, the target variables are continuous.
Problems like Spam Email Classification, Disease prediction like problems are
solved using Classification Algorithms. Problems like House Price Prediction,
Rainfall Prediction like problems are solved using regression Algorithms.
In this algorithm, we try to find the best possible decision boundary which can
separate the two classes with the maximum possible separation. In this
algorithm, we try to find the best-fit line which can represent the overall trend in
the data.
Evaluation metrics like Precision, Recall, and F1-Score are used here to evaluate
the performance of the classification algorithms. Evaluation metrics like Mean
Squared Error, R2-Score, and MAPE are used here to evaluate the performance of
the regression algorithms.
Here we face the problems like binary Classification or Multi-Class Classification
problems. Here we face the problems like Linear Regression models as well as
non-linear models.
Input Data are Independent variables and categorical dependent variable. Input
Data are Independent variables and continuous dependent variable.
The classification algorithm’s task mapping the input value of x with the discrete
output variable of y.
The regression algorithm’s task is mapping input value (x) with continuous
output variable (y).
So to separate these data points, we need to add one more dimension. For linear
data, we have used two dimensions x and y, so for non-linear data, we will add a
third dimension z. It can be calculated as:
z=x2 +y2
By adding the third dimension, the sample space will become as below image:
So now, SVM will divide the datasets into classes in the following way. Consider
the below image:
Note: A decision tree can contain categorical data (YES/NO) as well as numeric data.
In a decision tree, for predicting the class of the given dataset, the algorithm
starts from the root node of the tree. This algorithm compares the values of root
attribute with the record (real dataset) attribute and, based on the comparison,
follows the branch and jumps to the next node.
For the next node, the algorithm again compares the attribute value with the
other sub-nodes and move further. It continues the process until it reaches the
leaf node of the tree. The complete process can be better understood using the
below algorithm:
○ Step-1: Begin the tree with the root node, says S, which contains the
complete dataset.
○ Step-2: Find the best attribute in the dataset using Attribute Selection
Measure (ASM).
○ Step-3: Divide the S into subsets that contains possible values for the
best attributes.
○ Step-4: Generate the decision tree node, which contains the best
attribute.
○ Step-5: Recursively make new decision trees using the subsets of the
dataset created in step -3. Continue this process until a stage is reached
where you cannot further classify the nodes and called the final node as
a leaf node.
Example: Suppose there is a candidate who has a job offer and wants to decide
whether he should accept the offer or Not. So, to solve this problem, the decision
tree starts with the root node (Salary attribute by ASM). The root node splits
further into the next decision node (distance from the office) and one leaf node
based on the corresponding labels. The next decision node further gets split into
one decision node (Cab facility) and one leaf node. Finally, the decision node
splits into two leaf nodes (Accepted offers and Declined offer). Consider the
below diagram:
○ Information Gain
○ Gini Index
1. Information Gain:
○ Information gain is the measurement of changes in entropy after the
segmentation of a dataset based on an attribute.
○ It calculates how much information a feature provides us about a class.
○ According to the value of information gain, we split the node and build
the decision tree.
○ A decision tree algorithm always tries to maximize the value of
information gain, and a node/attribute having the highest information
gain is split first. It can be calculated using the below formula:
Information Gain= Entropy(S)- [(Weighted Avg) *Entropy(each feature)
Entropy: Entropy is a metric to measure the impurity in a given attribute. It
specifies randomness in data. Entropy can be calculated as:
2. Gini Index:
○ Gini index is a measure of impurity or purity used while creating a
decision tree in the CART(Classification and Regression Tree) algorithm.
○ An attribute with the low Gini index should be preferred as compared to
the high Gini index.
○ It only creates binary splits, and the CART algorithm uses the Gini index
to create binary splits.
○ Gini index can be calculated using the below formula:
A too-large tree increases the risk of overfitting, and a small tree may not capture
all the important features of the dataset. Therefore, a technique that decreases
the size of the learning tree without reducing accuracy is known as Pruning.
There are mainly two types of tree pruning technology used:
Several individual base models (experts) are fitted to learn from the
same data and produce an aggregation of output based on which a
final decision is taken. These base models can be machine learning
algorithms such as decision trees (mostly used), linear models, support
vector machines (SVM), neural networks, or any other model that is
capable of making predictions.
Bagging Algorithm
Bootstrap Sampling: Divides the original training data into ‘N’ subsets
and randomly selects a subset with replacement in some rows from
other subsets. This step ensures that the base models are trained on
diverse subsets of the data and there is no class imbalance.
Base Model Training: For each bootstrapped sample, train a base model
independently on that subset of data. These weak models are trained in
parallel to increase computational efficiency and reduce time
consumption.
Final Prediction: After aggregating the predictions from all the base
models, Bagging produces a final prediction for each instance.
Bagging Vs Boosting
We all use the Decision Tree Technique on day to day life to make the decision.
Organizations use these supervised machine learning techniques like Decision
trees to make a better decision and to generate more surplus and profit.
There are two techniques given below that are used to perform ensemble
decision tree.
Bagging
Bagging is used when our objective is to reduce the variance of a decision tree.
Here the concept is to create a few subsets of data from the training sample,
which is chosen randomly with replacement. Now each collection of subset data
is used to prepare their decision trees thus, we end up with an ensemble of
various models. The average of all the assumptions from numerous tress is used,
which is more powerful than a single decision tree.
These are the following steps which are taken to implement a Random forest:
Since the last prediction depends on the mean predictions from subset trees, it
won't give precise value for the regression model.
Boosting:
Boosting is another ensemble procedure to make a collection of predictors. In
other words, we fit consecutive trees, usually random samples, and at each step,
the objective is to solve net error from the prior trees.
If a given input is misclassified by theory, then its weight is increased so that the
upcoming hypothesis is more likely to classify it correctly by consolidating the
entire set at last converts weak learners into better performing models.
Bagging Boosting
Various training data subsets are Each new subset contains the
randomly drawn with replacement components that were misclassified
from the whole training dataset. by previous models.
The greater number of trees in the forest leads to higher accuracy and prevents
the problem of overfitting.
The below diagram explains the working of the Random Forest algorithm:
Assumptions for Random Forest
Since the random forest combines multiple trees to predict the class of the
dataset, it is possible that some decision trees may predict the correct output,
while others may not. But together, all the trees predict the correct output.
Therefore, below are two assumptions for a better Random forest classifier:
The Working process can be explained in the below steps and diagram:
Step-2: Build the decision trees associated with the selected data points
(Subsets).
Step-3: Choose the number N for decision trees that you want to build.
The working of the algorithm can be better understood by the below example:
Example: Suppose there is a dataset that contains multiple fruit images. So, this
dataset is given to the Random forest classifier. The dataset is divided into
subsets and given to each decision tree. During the training phase, each decision
tree produces a prediction result, and when a new data point occurs, then based
on the majority of results, the Random Forest classifier predicts the final decision.
Consider the below image:
Gathering Data
Gathering data is the first step in statistical analysis.
Say for example that you want to know something about all the people in
France.
A representative sample
The sample needs to be similar to the whole population of France. It should
have the same characteristics as the population. If you only include people
named Jacques living in Paris who are 48 years old, the sample will not be
similar to the whole population.
So for a good sample, you will need people from all over France, with different
ages, professions, and so on.
If the members of the sample have similar characteristics (like age, profession,
etc.) to the whole population of France, we say that the sample is
representative of the population.
Descriptive Statistics
The information (data) from your sample or population can be visualized with
graphs or summarized by numbers. This will show key information in a simpler
way than just looking at raw data. It can help us understand how the data is
distributed.
Graphs can visually show the data distribution.
● Histograms
● Pie charts
● Bar graphs
● Box plots
For example, a box plot visually shows the quartiles of a data distribution.
Quartiles are the data split into four equal size parts, or quarters. A quartile is
one type of summary statistics.
Summary statistics
Summary statistics take a large amount of information and sums it up in a few
key values.
Numbers are calculated from the data which also describe the shape of the
distributions. These are individual 'statistics'.
Statistical Inference
Statistics from the data in the sample is used to make conclusions about the
whole population. This is a type of statistical inference.
Probability theory is used to calculate the certainty that those statistics also
apply to the population.
When using a sample, there will always be some uncertainty about what the
data looks like for the population.
Confidence intervals are numerical ways of showing how likely it is that the true
value of this statistic is within a certain range for the population.
Causal Inference
Causal inference is used to investigate if something causes another thing.
If we think two things are related we can investigate to see if they correlate.
Statistics can be used to find out how strong this relation is.
Even if things are correlated, finding out of something is caused by other things
can be difficult. It can be done with good experimental design or other special
statistical techniques.
The terms 'population' and 'sample' are important in statistics and refer to
key concepts that are closely related.
Population Sample
The sample is used to make conclusions about the whole population. If the
sample is not similar enough to the whole population, the conclusions could be
useless.
The parameters are the key things we want to learn about. The parameters are
usually unknown.
There will always be some uncertainty about how accurate estimates are. More
certainty gives us more useful knowledge.
For every parameter we want to learn about we can get a sample and calculate
a sample statistic, which gives us an estimate of the parameter.
Mean, median and mode are different types of averages (typical values in a
population).
For example:
Variance and standard deviation are two types of values describing how spread
out the values are.
A single class of students in a school would usually be about the same age. The
age of the students will have low variance and standard deviation.
A whole country will have people of all kinds of different ages. The variance and
standard deviation of age in the whole country would then be bigger than in a
single school grade.
Note: Every other sampling method is compared to how close it is to a random sample -
the closer, the better.
Convenience Sampling
A convenience sample is where the participants that are the easiest to reach are
chosen.
In many cases this sample will not be similar enough to the population, and the
conclusions can potentially be useless.
Systematic Sampling
A systematic sample is where the participants are chosen by some regular
system.
For example:
Stratified Sampling
A stratified sample is where the population is split into smaller groups called
'strata'.
Clustered Sampling
A clustered sample is where the population is split into smaller groups called
'clusters'.
All members of the clusters can participate in the sample, or members can be
chosen randomly from the clusters in a third step.
Qualitative Data
Information about something that can be sorted into different categories that
can't be described directly by numbers.
Examples:
● Brands
● Nationality
● Professions
With categorical data we can calculate statistics like proportions. For example,
the proportion of Indian people in the world, or the percent of people who prefer
one brand to another.
Quantitative Data
Information about something that is described by numbers.
Examples:
● Income
● Age
● Height
With numerical data we can calculate statistics like the average income in a
country, or the range of heights of players in a football team.
Measurement Levels
The main types of data are Qualitative (categories) and Quantitative
(numerical). These are further split into the following measurement levels.
Nominal Level
Categories (qualitative data) without any order.
Examples:
● Brand names
● Countries
● Colors
Ordinal level
Categories that can be ordered (from low to high), but the precise "distance"
between each is not meaningful.
Examples:
Exactly how much distance it is between grades is not clear and precise. If the
grades are based on amounts of points on a test, you can say that there is a
precise "distance" on the point scale, but not the grades themselves.
Interval Level
Data that can be ordered and the distance between them is objectively
meaningful. But there is no natural 0-value where the scale originates.
Examples:
● Years in a calendar
● Temperature measured in Fahrenheit
Note: Interval scales are usually invented by people, like degrees of temperature.
Ratio Level
Data that can be ordered and there is a consistent and meaningful distance
between them. And it also has a natural 0-value.
Examples:
● Money
● Age
● Time
Data that is on the ratio level (or "ratio scale") gives us the most detailed
information. Crucially, we can compare precisely how big one value is compared
to another. This would be the ratio between these values, like twice as big, or
ten times as small.
Gaussian Mixture
○ Through the Gaussian combination object, the
expectation-maximization (EM) method for generating a mixture of
Gaussian fashions is implemented.
○ Additionally, it can compute the Bayesian Information Criterion to
determine how many clusters there are in the data and create
confidence ellipsoids for multivariate models.
○ A Gaussian Mixture Model can be learned from train data using the
Gaussian Mixture. Fit technique, Using the GaussianMixture.
○ Prediction technique and test data may assign each sample the
Gaussian that it most likely belongs to.
The GaussianMixture offers several parameters, including spherical, diagonal,
tied, and complete covariance, to limit the covariance of the difference classes
computed.
GMM covariances
○ A variety of covariance types are shown for Gaussian mixture models.
○ For further details on the estimator, refer to Gaussian mixture models.
○ Even though GMM is frequently employed for clustering, we can
contrast the resulting clusters with the dataset's real classes.
○ To ensure the validity of this comparison, we initialize the Gaussian
means using the means of the classes from the training set.
○ On the iris dataset, we use several GMM covariance types to plot
projected labels on training and held-out test data.
○ In order to improve performance, we contrast GMMs with spherical,
diagonal, complete, and linked covariance matrices.
○ Full covariance should perform the best overall.
○ The plots display the test data as crosses and the train data as dots. The
iris dataset is quadratic.
○ The fact that just the first two dimensions are displayed here causes
some points to be divided by additional dimensions.
Speed:
Cons:
Singularities:
A number of components:
This set of rules will continuously assign all the components it has access to in
the absence of outside cues, choosing how many to utilise based on theoretical
records criteria or held-out statistics.
Algorithm of Expectation-Maximization
Calculating most-probability estimates for model parameters in cases where the
data is missing some information components, is incomplete, or contains some
hidden variables. In order to estimate a new set of data, EM selects some random
values for the missing data points. Once the numbers are rectified, these new
values are employed recursively to estimate a better initial date by filling in the
gaps.
Suppose we have a new data point and we need to put it in the required
category. Consider the below image:
○ Firstly, we will choose the number of neighbors, so we will choose the
k=5.
○ Next, we will calculate the Euclidean distance between the data points.
The Euclidean distance is the distance between two points, which we
have already studied in geometry. It can be calculated as:
○ By calculating the Euclidean distance we got the nearest neighbors, as
three nearest neighbors in category A and two nearest neighbors in
category B. Consider the below image:
○ As we can see the 3 nearest neighbors are from category A, hence this
new data point must belong to category A.
The algorithm takes the unlabeled dataset as input, divides the dataset into
k-number of clusters, and repeats the process until it does not find the best
clusters. The value of k should be predetermined in this algorithm.
Hence each cluster has datapoints with some commonalities, and it is away from
other clusters.
The below diagram explains the working of the K-means Clustering Algorithm:
Step-2: Select random K points or centroids. (It can be other from the input
dataset).
Step-3: Assign each data point to their closest centroid, which will form the
predefined K clusters.
Step-4: Calculate the variance and place a new centroid of each cluster.
Step-5: Repeat the third steps, which means reassign each datapoint to the new
closest centroid of each cluster.
Suppose we have two variables M1 and M2. The x-y axis scatter plot of these two
variables is given below:
○ Let's take number k of clusters, i.e., K=2, to identify the dataset and to
put them into different clusters. It means here we will try to group these
datasets into two different clusters.
○ We need to choose some random k points or centroid to form the
cluster. These points can be either the points from the dataset or any
other point. So, here we are selecting the below two points as k points,
which are not the part of our dataset. Consider the below image:
○ Now we will assign each data point of the scatter plot to its closest
K-point or centroid. We will compute it by applying some mathematics
that we have studied to calculate the distance between two points. So,
we will draw a median between both the centroids. Consider the below
image:
From the above image, it is clear that points left side of the line is near to the K1
or blue centroid, and points to the right of the line are close to the yellow
centroid. Let's color them as blue and yellow for clear visualization.
○ As we need to find the closest cluster, so we will repeat the process by
choosing a new centroid. To choose the new centroids, we will compute
the center of gravity of these centroids, and will find new centroids as
below:
○ Next, we will reassign each datapoint to the new centroid. For this, we
will repeat the same process of finding a median line. The median will
be like below image:
From the above image, we can see, one yellow point is on the left side of the line,
and two blue points are right to the line. So, these three points will be assigned to
new centroids.
As reassignment has taken place, so we will again go to the step-4, which is
finding new centroids or K-points.
○ We will repeat the process by finding the center of gravity of centroids,
so the new centroids will be as shown in the below image:
○ As we got the new centroids so again will draw the median line and
reassign the data points. So, the image will be:
○ We can see in the above image; there are no dissimilar data points on
either side of the line, which means our model is formed. Consider the
below image:
As our model is ready, so we can now remove the assumed centroids, and the
two final clusters will be as shown in the below image:
How to choose the value of "K number of clusters" in
K-means Clustering?
Elbow Method
The Elbow method is one of the most popular ways to find the optimal number
of clusters. This method uses the concept of WCSS value. WCSS stands for Within
Cluster Sum of Squares, which defines the total variations within a cluster. The
formula to calculate the value of WCSS (for 3 clusters) is given below:
∑Pi in Cluster1 distance(Pi C1)2: It is the sum of the square of the distances between
each data point and its centroid within a cluster1 and the same for the other two
terms.
To measure the distance between data points and centroid, we can use any
method such as Euclidean distance or Manhattan distance.
To find the optimal value of clusters, the elbow method follows the below steps:
Since the graph shows the sharp bend, which looks like an elbow, hence it is
known as the elbow method. The graph for the elbow method looks like the
below image:
Note: We can choose the number of clusters equal to the given data points. If we
choose the number of clusters equal to the data points, then the value of WCSS
becomes zero, and that will be the endpoint of the plot.