0% found this document useful (0 votes)
91 views21 pages

ML Notes (Module-3)

Uploaded by

karanrainavaar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
91 views21 pages

ML Notes (Module-3)

Uploaded by

karanrainavaar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

COURSE- B.TECH.

(CSE)- V SEM
SUBJECT: MACHINE LEARNING
MODULE-3 (EVALUATING ML ALGO AND STATISTICAL LEARNING)

 Evaluating ML algorithm and Model Selection:


Model selection is an essential phase in the development of powerful and precise predictive
models in the field of machine learning. Model selection is the process of deciding which
algorithm and model architecture is best suited for a particular task or dataset.
Machine Learning Model does not require hard-coded algorithms. We feed a large amount of
data to the model and the model tries to figure out the features on its own to make future
predictions. So we must also use some techniques to determine the predictive power of the
model.

 Machine Learning Model Evaluation:


Model evaluation is the process that uses some metrics which help us to analyze the
performance of the model. As we all know that model development is a multi-step process and a
check should be kept on how well the model generalizes future predictions. Therefore evaluating
a model plays a vital role so that we can judge the performance of our model. The evaluation
also helps to analyze a model’s key weaknesses.
There are many metrics like Accuracy, Precision, Recall, F1 score, Area under Curve,
Confusion Matrix, and Mean Square Error.
Cross Validation is one technique that is followed during the training phase and it is a model
evaluation technique as well.
 Resampling methods:
As the name implies, resampling methods are straightforward methods of rearranging data samples
to see how well the model performs on samples of data it hasn't been trained. Resampling, in other
words, enables us to determine the model's generalizability.

There are two main types of re-sampling techniques:

a. Cross-validation:
It is a resampling procedure to evaluate models by splitting the data. Consider a situation where you
have two models and want to determine which one is the most appropriate for a certain issue. In this
case, we can use a cross-validation process.
So, let’s say you are working on an SVM model and have a dataset that iterates multiple times. We
will now divide the datasets into a few groups. One group out of the five will be used as test data.
Machine learning models are evaluated on test data after being trained on training data.

Let's say you calculated the accuracy of each iteration; the figure below illustrates the iteration and
accuracy of that iteration.

Fig: Cross-validation example | Image by author

Now, let's calculate the mean accuracy of all the iterations, which comes to around 84.4%. You
now use the same procedure once again for the logistic regression model.

 Cross Validation and Holdout:


Cross Validation is a method in which we do not use the whole dataset for training. In this
technique, some part of the dataset is reserved for testing the model. There are many types of
Cross-Validation out of which K Fold Cross Validation is mostly used. In K Fold Cross
Validation the original dataset is divided into k subsets. The subsets are known as folds. This is
repeated k times where 1 fold is used for testing purposes. Rest k-1 folds are used for training
the model. So each data point acts as a test subject for the model as well as acts as the training
subject. It is seen that this technique generalizes the model well and reduces the error rate
Holdout is the simplest approach. It is used in neural networks as well as in many classifiers. In
this technique, the dataset is divided into train and test datasets. The dataset is usually divided
into ratios like 70:30 or 80:20. Normally a large percentage of data is used for training the
model and a small portion of the dataset is used for testing the model.

b. Bootstrap:
Another sampling technique is called Bootstrap, and it involves replacing the data with random
samples. It is used to sample a dataset using replacement to estimate statistics on a population.

 Used with smaller datasets


 The number of samples must be chosen.
 Size of all samples and test data should be the same.
 The sample with the most scores is therefore taken into account.

In simple terms, you start by:

 Randomly selecting an observation.


 You note that value.
 You put that value back.

Now, you repeat the steps N times, where N is the number of observations in the initial dataset. So
the final result is the one bootstrap sample with N observations.

 Probabilistic measures:

Information Criterion is a kind of probabilistic measure that can be used to evaluate the effectiveness
of statistical procedures. Its methods include a scoring system that selects the most effective
candidate models using a log-likelihood framework of Maximum Likelihood Estimation (MLE).

Resampling only focuses on model performance, whereas probabilistic modeling concentrates


on both model performance and complexity.

 IC is a statistical metric that yields a score. The model with the lowest score is the most effective.
 Performance is calculated using in-sample data; thereforea test set is unnecessary. Instead, the score
is calculated using all the training data.
 Less complexity entails a straightforward model with fewer parameters that is simple to learn and
maintain but unable to detect fluctuations that affect a model's performance.

There are three statistical methods for calculating the degree of complexity and how well a particular
model fits a dataset:

a. Akaike Information Criterion (AIC):

AIC is a single numerical score that may be used to distinguish across many models the one that is
most likely to be the best fit for a given dataset. AIC ratings are only helpful when compared to
other scores for the same dataset.

Lower AIC ratings are preferable.

AIC calculates the model's accuracy in fitting the training data set and includes a penalty term for
model accuracy.
K = the number of distinct variables or predictors.

L = the model's greatest likelihood

N is the number of data points in the practice set (especially helpful in the case of small datasets)

The drawback of AIC is that it struggles with generalizing models since it favors intricate models
that retain more training data. This implies that all tested models might still have a poor fit.

b. Minimum Description Length (MDL)

According to the MDL concept, the explanation that allows for the most data compression is the best
given a small collection of observed data. Simply put, it is a technique that forms the cornerstone of
statistical modeling, pattern recognition, and machine learning.

d = model D = the model's predictions

L(h) is the number of bits needed to express the model.

L(D | h) = amount of bits needed to describe the model's predictions

c. Bayesian Information Criterion (BIC)

BIC was derived using the Bayesian probability idea and is appropriate for models that use
maximum likelihood estimation during training.

BIC is more commonly employed in time series and linear regression models. However, it may be
applied broadly for any models based on maximum probability.

d. Structural Risk Minimization (SRM)

There are instances of overfitting when the model becomes biased toward the training data, which is
its primary source of learning.

A generalized model must frequently be chosen from a limited data set in machine learning, which
leads to the issue of overfitting when the model becomes too fitted to the specifics of the training set
and performs poorly on new data. By weighing the model's complexity against how well it fits the
training data, the SRM principle solves this issue.
Here, J(f) is the complexity of the model

 Evaluation Metrics for Classification Task:

Model evaluation is the process of assessing how well a machine learning model performs on
unseen data. It helps you determine if the model generalizes well and can make accurate predictions
on new data points. There are various metrics and techniques to evaluate models, depending on the
type of problem you’re solving (classification, regression, etc.).
Here are some common evaluation metrics for different types of problems:

1. Classification Metrics:

 Accuracy: The proportion of correctly classified instances out of the total instances.
Accuracy is defined as the ratio of the number of correct predictions to the total number of predictions.
This is the most fundamental metric used to evaluate the model. The formula is given by
Accuracy = (TP+TN)/(TP+TN+FP+FN)
However, Accuracy has a drawback. It cannot perform well on an imbalanced dataset. Suppose a model
classifies that the majority of the data belongs to the major class label. It yields higher accuracy. But in
general, the model cannot classify on minor class labels and has poor performance.

 Precision: The proportion of true positives out of the total predicted positives.
Precision is the ratio of true positives to the summation of true positives and false positives. It basically
analyses the positive predictions.
Precision = TP/(TP+FP)
The drawback of Precision is that it does not consider the True Negatives and False Negatives.

 Recall (Sensitivity): The proportion of true positives out of the total actual positives.
Recall is the ratio of true positives to the summation of true positives and false negatives. It basically
analyses the number of correct positive samples.
Recall = TP/(TP+FN)
The drawback of Recall is that often it leads to a higher false positive rate.

 F1 Score: The harmonic mean of precision and recall, providing a balance between the two.
The F1 score is the harmonic mean of precision and recall. It is seen that during the precision-recall
trade-off if we increase the precision, recall decreases and vice versa. The goal of the F1 score is to
combine precision and recall.
F1 score = (2×Precision×Recall)/(Precision+Recall)

 Confusion Matrix: A confusion matrix is an N x N matrix where N is the number of target


classes. It represents the number of actual outputs and the predicted outputs. Some terminologies
in the matrix are as follows:
 True Positives: It is also known as TP. It is the output in which the actual and the predicted values are
YES.
 True Negatives: It is also known as TN. It is the output in which the actual and the predicted values
are NO.
 False Positives: It is also known as FP. It is the output in which the actual value is NO but the
predicted value is YES.
 False Negatives: It is also known as FN. It is the output in which the actual value is YES but the
predicted value is NO.

 AUC-ROC Curve: AUC (Area Under Curve) is an evaluation metric that is used to analyze
the classification model at different threshold values. The Receiver Operating
Characteristic(ROC) curve is a probabilistic curve used to highlight the model’s performance.

The curve has two parameters:


 TPR: It stands for True positive rate. It basically follows the formula of Recall.
 FPR: It stands for False Positive rate. It is defined as the ratio of False positives to the summation
of false positives and True negatives.
This curve is useful as it helps us to determine the model’s capacity to distinguish between different
classes.

2. Regression Metrics:
Regression is used to determine continuous values. It is mostly used to find a relation between a
dependent and an independent variable. For classification, we use a confusion matrix, accuracy, f1
score, etc. But for regression analysis, since we are predicting a numerical value it may differ from
the actual output. So we consider the error calculation as it helps to summarize how close the
prediction is to the actual value.
There are many metrics available for evaluating the regression model.

 Mean Absolute Error (MAE): The average of the absolute differences between the predicted and
actual values. This is the simplest metric used to analyze the loss over the whole dataset. As we all
know the error is basically the difference between the predicted and actual values.
Therefore MAE is defined as the average of the errors calculated. Here we calculate the modulus of
the error, perform the summation and then divide the result by the number of data points. It is a
positive quantity and is not concerned about the direction. The formula of MAE is given by

MAE = ∑|ypred-yactual| / N

 Mean Squared Error (MSE): The average of the squared differences between the predicted and
actual values. The most commonly used metric is Mean Square error or MSE. It is a function used
to calculate the loss. We find the difference between the predicted values and the truth variable,
square the result and then find the average over the whole dataset. MSE is always positive as we
square the values. The small the MSE better is the performance of our model. The formula of MSE
is given:

MSE = ∑(ypred - yactual)2 / N

 Root Mean Squared Error (RMSE): The square root of the mean squared error, which is more
sensitive to large errors than MAE. RMSE is a popular method and is the extended version of
MSE(Mean Squared Error). This method is basically used to evaluate the performance of our
model. It indicates how much the data points are spread around the best line. It is the standard
deviation of the Mean squared error. A lower value means that the data point lies closer to the
best fit line.

RMSE=√(∑(ypred - yactual)2 / N)

 R-squared: The proportion of the variance in the dependent variable that is predictable from the
independent variables, ranging from 0 to 1.

 Mean Absolute Percentage Error (MAPE): MAPE is basically used to express the error in
terms of percentage. It is defined as the difference between the actual and predicted value. The
error is then divided by the actual value. The results are then summed up and finally, we calculate
the average. Smaller the percentage better the performance of the model. The formula is given by

MAPE = ∑((ypred-yactual) / yactual) / N * 100 %

 Model Selection:

Model selection is the process of choosing the best model from a set of candidates based on their
performance on a validation set. It’s essential because different models may perform better on different
types of data, and there’s often a trade-off between model complexity and performance.

In machine learning, the process of selecting the top model or algorithm from a list of potential models to
address a certain issue is referred to as model selection. It entails assessing and contrasting various
models according to how well they function and choosing the one that reaches the highest level of
accuracy or prediction power.

The following steps are frequently included in the model selection process:

 Problem formulation: Clearly express the issue at hand, including the kind of predictions or
task that you'd like the model to carry out (for example, classification, regression, or clustering).
 Candidate model selection: Pick a group of models that are appropriate for the issue at hand.
These models can include straightforward methods like decision trees or linear regression as well
as more sophisticated ones like deep neural networks, random forests, or support vector
machines.
 Performance evaluation: Establish measures for measuring how well each model performs.
Common measurements include area under the receiver's operating characteristic curve (AUC-
ROC), recall, F1-score, mean squared error, and accuracy, precision, and recall. The type of
problem and the particular requirements will determine which metrics are used.
 Training and evaluation: Each candidate model should be trained using a subset of the
available data (the training set), and its performance should be assessed using a different subset
(the validation set or via cross-validation). The established evaluation measures are used to
gauge the model's effectiveness.
 Model comparison: Evaluate the performance of various models and determine which one
performs best on the validation set. Take into account elements like data handling capabilities,
interpretability, computational difficulty, and accuracy.
 Hyperparameter tuning: Before training, many models require that certain hyperparameters,
such as the learning rate, regularisation strength, or the number of layers that are hidden in a
neural network, be configured. Use methods like grid search, random search, and Bayesian
optimization to identify these hyperparameters' ideal values.
 Final model selection: After the models have been analyzed and fine-tuned, pick the model that
performs the best. Then, this model can be used to make predictions based on fresh, unforeseen
data.

 Model Selection Techniques:


Here are some common techniques for model selection:

1. Train-Test Split: The simplest method is to split your data into a training set and a testing set.
Train each candidate model on the training set and evaluate their performance on the testing set.
Choose the model with the best performance on the testing set.

2. K-Fold Cross-Validation: K-Fold Cross-Validation is a more robust method that divides your data
into k equal-sized folds. For each fold, use it as the testing set while using the remaining k-1 folds
as the training set. Train and evaluate each candidate model k times and calculate the average
performance across all folds. Choose the model with the best average performance.

3. Grid Search and Randomized Search: These methods are used to optimize hyper parameters of a
model. In Grid Search, you define a set of possible values for each hyper parameter, and the
algorithm tries every possible combination. In Randomized Search, the algorithm samples a
random combination of hyper parameters from a specified distribution. Both methods can be
combined with cross-validation for more accurate results.

Here’s an outline of the model evaluation and selection process:

1. Split your dataset into training and validation sets (or use cross-validation).
2. Train each candidate model on the training set (or k-1 folds in cross-validation).
3. Evaluate each model’s performance on the validation set (or the kth fold in cross-validation) using
appropriate evaluation metrics.
4. Compare the models’ performance and select the best one for your problem.
5. Train the chosen model on the entire dataset and use it to make predictions on new data.
 Bayesian optimization: A more sophisticated method of hyper parameter tweaking, Bayesian
optimization. It models the relationship between the performance of the model and the hyper
parameters using a probabilistic model.
 Introduction to Statistical Learning Theory:
As intuitive as it sounds from its name, statistical machine learning involves using statistical techniques
to develop models that can learn from data and make predictions or decisions.

You might have heard technical terms such as supervised, unsupervised, and semi-supervised learning–
they all rely on a solid statistical foundation.

In essence, statistical machine learning merges the computational efficiency and adaptability of machine
learning algorithms with statistical inference and modeling capabilities.

By employing statistical methods, we can extract significant patterns, relationships, and insights from
intricate datasets, thereby promoting the effectiveness of machine learning algorithms.

The Role of Statistics in Machine Learning:

Statistics constitutes the backbone of machine learning, providing the tools and techniques to analyze
and interpret data. Essentially, statistics provides the theoretical framework upon which machine
learning algorithms are built.

Statistics is the science that allows us to collect, analyze, interpret, present, and organize data. It
provides a robust set of tools for understanding patterns and trends, and making inferences and
predictions based on data. When we're dealing with large datasets, statistics helps us understand and
summarize the data, allowing us to make sense of complex phenomena.

Machine learning, on the other hand, is a powerful tool that allows computers to learn from and make
decisions or predictions based on data. The ultimate goal of machine learning is to create models that
can adapt and improve over time, as well as generalize from specific examples to broader cases.

The principles of statistics are the very pillars that uphold the structure of machine learning.

 Constructing machine learning models. Statistics provides the methodologies and principles
for creating models in machine learning. For instance, the linear regression model leverages the
statistical method of least squares to estimate the coefficients.
 Interpreting results. Statistical concepts allow us to interpret the results generated by machine
learning models. Measures such as p-value, confidence intervals, R-squared, and others provide us with
a statistical perspective on the machine learning model’s performance.
 Validating models. Statistical techniques are essential for validating and refining the machine
learning models. For instance, techniques like hypothesis testing, cross-validation, and bootstrapping
help us quantify the performance of models and avoid problems like overfitting.
 Underpinning advanced techniques. Even some of the more complex machine learning
algorithms, such as Neural Networks, have statistical principles at their core. The optimization
techniques, like gradient descent, used to train these models are based on statistical theory.

Types of Data in Statistical Learning:


With statistical learning theory, there are two main types of data:
 Dependent Variable — a variable (y) whose values depend on the values of other variables (a
dependent variable is sometimes also referred to as a target variable)
 Independent Variables — a variable (x) whose value does not depend on the values of other
variables (independent variables are sometimes also referred to as predictor variables, input
variables, explanatory variables, or features)
In statistical learning, the independent variable(s) are the variable that will affect the dependent variable.

A common examples of an Independent Variable is Age. There is nothing that one can do to increase or
decrease age. This variable is independent.

Some common examples of Dependent Variables are:


 Weight — a person’s weight is dependent on his or her age, diet, and activity levels (as well as
other factors)
 Temperature — temperature is impacted by altitude, distance from equator (latitude) and distance
from the sea.

In graphs, the independent variable is often plotted along the x-axis while the dependent variable is
plotted along the y-axis.

In this example, which shows how the price of a home is affected by the size of the home, sq. ft is the
independent variable while price of the home is the dependent variable.

 Statistical Model:

A statistical model defines the relationships between a dependent and independent variable. In the above
graph, the relationships between the size of the home and the price of the home is illustrated by the

straight line. We can define this relationship by using y = mx + c where m represents the gradient and c is

the intercept. Another way that this equation can be expressed is with roman numerals which would look

something like:

This model would describe the price of a home as having a linear relationship with the size of a home.
This would represent a simple model for the relationship.
If we suppose that the size of the home is not the only independent variable when determining the price
and that the number of bathrooms is also an independent variable, the equation would look like:

 Model Generalization:

In order to build an effective model, the available data needs to be used in a way that would make the

model generalizable for unseen situations. Some problems that occur when building models is that the

model under-fits or over-fits to the data.

 Under-fitting — when a statistical model does not adequately capture the underlying structure of

the data and, therefore, does not include some parameters that would appear in a correctly

specified model.

 Over-fitting — when a statistical model contains more parameters that can be justified by the

data and includes the residual variation (“noise”) as if the variation represents underlying model

structure.

As you can see, it is important to create a model that can generalize to the data that it is given so that it

can make the most accurate predictions when given new data.
 Model Validation:

Model Validation is used to assess over-fitting and under-fitting of the data. The steps to perform model

validation are:
1. Split the data into two parts, training data and testing data (anywhere between 80/20 and 70/30 is
ideal).
2. Use the larger portion (training data) to train the model.
3. Use the smaller portion (testing data) to test the model. This data is not being used to train the
model, so it will be new data for the model to build predictions from.
4. If the model has learned well from the training data, it will perform well with both the training
data and testing data. To determine how well the model is performing on both sets of data, you
can calculate the accuracy score for each. The model is over-fitting if the training data has a
significantly higher accuracy score than the testing data.

 Difference between Statistical Model and Machine Learning:

 Statistical Model:

A mathematical process that attempts to describe the population from which a sample came, which
allows us to make predictions of future samples from that population.

Examples: Hypothesis testing, Correlation, etc.

Some problem statements solved by statistical modeling:


 employing inferential statistics to calculate the average income of a population from a random sample
 estimating a stock’s future price using previous data, and time series analysis.

Objectives of Statistical Model:


 used for proving any result such as hypothesis testing, and p-value.
 search data for interesting information (exploratory) such as generating hypotheses.
 building a protective model.

Assumptions in Statistical Model:


 Independence, states that there shouldn’t be any relationships between the observations in the
collection.
 Normality requires that the response variable’s distribution is approximately normal, with data
symmetric around the mean.
 Linearity indicates that the relationship between the response variable and predictor variable(s)
should be linear.
 No multicollinearity, suggesting the independence of predictor variables from each other.
 outliers, the dataset should not contain any outliers that may influence the results.
1. The group of probability distributions that have a finite number of parameters is known
as parametric.
2. Nonparametric models are those where the kind and quantity of parameters are adjustable and not
predetermined.
3. Semiparametric means that the parameter has both a parametric and a non-parametric.

 Machine Learning:

Machine Learning is the science that allows computers to learn and improve their learning over time, by
feeding them data and information in the form of observations and real-world interactions.
According to Arthur Samuel machine learning is, “the field of study that gives computers the ability to
learn without being explicitly programmed “ i.
OR
According to Tom Mitchell, “Machine learning is the study of computer algorithms that allow computer
programs to improve through experience automatically”.
Example: Predicting house price with the help of a machine learning model on the basis of attributes
such as location, and area by the help of machine learning we can find out the relationship between the
dependent variable (i.e house price) on independent features (i.e location, area, year of formation) and we
can predict the price of another input on the resulting relation.

Some problem statements for machine learning :


 Recommendation: Utilize collaborative filtering to suggest movies to viewers based on their prior
viewing habits and ratings.
 Diseases Prediction: employing a support vector machine to make a prediction about a patient’s
propensity to develop a specific disease based on their medical history and genetic information.

Assumptions in Machine Learning:


 Data is independent and identically distributed (IID), which means that every data point is
independent of the others and has the same distribution.
 The assumption that there is a linear relationship between the input variables and the output variable
underlies some models, such as linear regression.
 Normality, Some models presuppose that the model’s input variables and/or error terms are
distributed normally.
 No multicollinearity, Linear models presuppose that the input variables are not highly associated
with one another and do not exhibit multicollinearity.
 High Sample Size, Certain models rely on the sample size being sufficiently big to guarantee precise
parameter estimates.

 Difference between Statistical Models and Machine Learning:

Statistical Model Machine Learning


The relationship between variables is finding out by
The relationship between variables is found
the self-learning algorithm that learns from the data
in the form of mathematical equations.
without relying on rule-based learning.
The purpose of statistical modeling is to
Machine learning is focused on making accurate
find the relationship between variables and
predictions.
to test the hypothesis.
In Statistical Modeling takes a lot of
assumptions to identify the underlying In machine learning don’t rely on such assumptions.
distributions and relationships.

More interpretable as compared to machine


Less interpretable and more complex
learning
The model was developed on training data and
The model was developed on training data sometimes hyperparameters are tuned or validation
and tested on testing data. data and finally get evaluated/tested again testing
data.

Mostly used for research purposes ML is implemented in a production environment

It is not best suited to a large amount of


It can range from small to large amounts of data sets
data.

implicit programming requires human


Explicit programming requires less human effort.
efforts to do statistical modeling

Best estimate relationship between Strong predictive ability due to the ability to learn
variables from past data.
 Ensemble Methods:

Ensemble methods are techniques that aim at improving the accuracy of results in models by
combining multiple models instead of using a single model. The combined models increase the
accuracy of the results significantly. This has boosted the popularity of ensemble methods in machine
learning.

In statistics and machine learning, ensemble methods use multiple learning algorithms to obtain
better predictive performance than could be obtained from any of the constituent learning algorithms
alone. Unlike a statistical ensemble in statistical mechanics, which is usually infinite, a machine
learning ensemble consists of only a concrete finite set of alternative models, but typically allows for
much more flexible structure to exist among those alternatives.

Ensemble learning helps improve machine learning results by combining several models. This
approach allows the production of better predictive performance compared to a single model. Basic
idea is to learn a set of classifiers (experts) and to allow them to vote.
Advantage: Improvement in predictive accuracy.
Disadvantage: It is difficult to understand an ensemble of classifiers.

Why do ensembles work?

Dietterich (2002) showed that ensembles overcome three problems –


 Statistical Problem –
The Statistical Problem arises when the hypothesis space is too large for the amount of available
data. Hence, there are many hypotheses with the same accuracy on the data and the learning
algorithm chooses only one of them! There is a risk that the accuracy of the chosen hypothesis is
low on unseen data!

 Computational Problem –
The Computational Problem arises when the learning algorithm cannot guarantees finding the best
hypothesis.
 Representational Problem –
The Representational Problem arises when the hypothesis space does not contain any good
approximation of the target class(es).

Main Challenge for Developing Ensemble Models?

The main challenge is not to obtain highly accurate base models, but rather to obtain base models
which make different kinds of errors. For example, if ensembles are used for classification, high
accuracies can be accomplished if different base models misclassify different training examples, even
if the base classifier accuracy is low.
Methods for Independently Constructing Ensembles –

 Majority Vote
 Bagging and Random Forest
 Randomness Injection
 Feature-Selection Ensembles
 Error-Correcting Output Coding

Methods for Coordinated Construction of Ensembles –


 Boosting
 Stacking

Random
Forest

 Types of Ensemble Classifier–

1. Bagging:

Bootstrap Aggregating, also known as bagging, is a machine learning ensemble meta-algorithm


designed to improve the stability and accuracy of machine learning algorithms used in statistical
classification and regression. It decreases the variance and helps to avoid overfitting. It is usually
applied to decision tree methods. Bagging is a special case of the model averaging approach. Bagging
(Bootstrap Aggregation) is used to reduce the variance of a decision tree.
Suppose a set D of d tuples, at each iteration i, a training set Di of d tuples is sampled with
replacement from D (i.e., bootstrap). Then a classifier model Mi is learned for each training set D < i.
Each classifier Mi returns its class prediction. The bagged classifier M* counts the votes and assigns
the class with the most votes to X (unknown sample).

Implementation steps of Bagging –

1. Multiple subsets are created from the original data set with equal tuples, selecting observations
with replacement.
2. A base model is created on each of these subsets.
3. Each model is learned in parallel from each training set and independent of each other.
4. The final predictions are determined by combining the predictions from all the models.

Overview: Bagging involves training multiple models on different random subsets of the training data
and then aggregating their predictions. The most well-known example is the Random Forest
algorithm.

Pros:
 Reduces Variance: By training models on different subsets, bagging reduces the variance, leading
to more stable and robust models.
 Parallel Training: Models can be trained independently, making the process parallelizable and
faster.

Cons:
 Limited Reduction of Bias: Bagging primarily reduces variance but may not significantly reduce
bias if the base models are biased.

Use Cases:

 Random Forests for classification and regression tasks.


2. Boosting:

Boosting is an ensemble modeling technique designed to create a strong classifier by combining


multiple weak classifiers. The process involves building models sequentially, where each new model
aims to correct the errors made by the previous ones.

o Initially, a model is built using the training data.


o Subsequent models are then trained to address the mistakes of their predecessors.
o boosting assigns weights to the data points in the original dataset.
o Higher weights: Instances that were misclassified by the previous model receive higher
weights.
o Lower weights: Instances that were correctly classified receive lower weights.
o Training on weighted data: The subsequent model learns from the weighted dataset, focusing
its attention on harder-to-learn examples (those with higher weights).
o This iterative process continues until:
 The entire training dataset is accurately predicted, or
 A predefined maximum number of models is reached.

Boosting Algorithms:
There are several boosting algorithms. The original ones, proposed by Robert Schapire and Yoav
Freund were not adaptive and could not take full advantage of the weak learners. Schapire and
Freund then developed AdaBoost, an adaptive boosting algorithm that won the prestigious Gödel
Prize. AdaBoost was the first really successful boosting algorithm developed for the purpose of binary
classification. AdaBoost is short for Adaptive Boosting and is a very popular boosting technique that
combines multiple “weak classifiers” into a single “strong classifier”.

Algorithm:

1. Initialise the dataset and assign equal weight to each of the data point.
2. Provide this as input to the model and identify the wrongly classified data points.
3. Increase the weight of the wrongly classified data points and decrease the weights of correctly
classified data points. And then normalize the weights of all data points.
4. if (got required results)
Goto step 5
else
Goto step 2
5. End
Overview: Boosting trains models sequentially, where each new model attempts to correct the errors
made by the previous ones. Examples include AdaBoost, Gradient Boosting Machines (GBM), and
XGBoost.
Pros:
 Reduces Bias and Variance: Boosting can reduce both bias and variance, leading to high predictive
accuracy.
 Handles Complex Relationships: Capable of capturing complex data patterns.
Cons:
 Risk of Overfitting: If not properly regularized, boosting can overfit the training data.
 Sequential Training: Models are trained sequentially, which can be slower compared to
parallelizable methods like bagging.
Use Cases:
 AdaBoost for binary classification.
 XGBoost and LightGBM for structured data and competitions like Kaggle.

3. Random Forest:

Random Forest is an extension over bagging. Each classifier in the ensemble is a decision tree classifier
and is generated using a random selection of attributes at each node to determine the split. During
classification, each tree votes and the most popular class is returned.

Implementation steps of Random Forest –

1. Multiple subsets are created from the original data set, selecting observations with replacement.
2. A subset of features is selected randomly and whichever feature gives the best split is used to split
the node iteratively.
3. The tree is grown to the largest.
4. Repeat the above steps and prediction is given based on the aggregation of predictions from n
number of trees.

Similarities between Bagging and Boosting:

Bagging and Boosting, both being the commonly used methods, have a universal similarity of being
classified as ensemble methods. Here we will explain the similarities between them.

1. Both are ensemble methods to get N learners from 1 learner.


2. Both generate several training data sets by random sampling.
3. Both make the final decision by averaging the N learners (or taking the majority of them i.e Majority
Voting).
4. Both are good at reducing variance and provide higher stability.
Differences Between Bagging and Boosting:

4. Stacking:

Overview: Stacking involves training multiple base models and then using their predictions as input
features for a higher-level meta-model, which makes the final prediction.

Pros:
 Combines Strengths: Leverages the strengths of diverse models by combining their predictions.
 Flexibility: Can use any type of model for both base learners and the meta-model.

Cons:
 Complexity: Requires careful selection and tuning of both base models and the meta-model.
 Computationally Intensive: Can be resource-intensive due to the training of multiple models and
the meta-model.

Use Cases:
 Complex tasks requiring high accuracy, such as competition datasets and research projects.

 Two other types of Ensemble methods are:

1. Voting:
Overview: Voting ensembles combine the predictions of multiple models by taking a majority vote (for
classification) or averaging (for regression). It can be hard (majority vote) or soft (weighted
probabilities).
Pros:
 Simple and Effective: Easy to implement and often provides a performance boost.
 Reduces Overfitting: Aggregates multiple models to smooth out individual biases.
Cons:
 Equal Weighting: Simple voting treats all models equally, which may not always be optimal.
Use Cases:
 Initial ensemble method for improving model robustness in classification tasks.

2. Weighted Ensemble:
Overview: A weighted ensemble assigns different weights to models based on their performance,
combining their predictions accordingly.
Pros:
 Performance-Based Weighting: Gives more importance to better-performing models, potentially
improving accuracy.
 Flexibility: Can combine various types of models with custom weights.
Cons:
 Weight Tuning: Requires careful tuning of weights, which can be challenging and time-consuming.
 Computational Overhead: Maintaining and updating multiple models can be resource-intensive.
Use Cases:
 Scenarios where certain models consistently perform better and need higher influence on final
predictions.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy