Module 3: Advanced ML Algorithms and Hardware Design Optimization
Module 3: Advanced ML Algorithms and Hardware Design Optimization
Design Optimization
Syllabus
Ensemble Methods: Random Forest, Gradient Boosting. Dimensionality
Reduction Techniques: PCA, t-SNE. Model Evaluation and
Hyperparameter Tuning.
➢ Bagging (Bootstrap Aggregating): Trains multiple independent models on random subsets of the dataset
and averages (for regression) or takes a majority vote (for classification). Example: Random Forest, which
builds multiple decision trees and averages their outputs.
➢ Boosting: Sequentially trains weak models, where each new model corrects the errors of the previous
ones, gradually improving performance. Examples: Gradient Boosting, AdaBoost, XGBoost.
➢ Stacking: Combines predictions from multiple models using a meta-model that learns how to best
combine their outputs.
• Ensemble methods are widely used in real-world applications like fraud detection, medical
diagnosis, and recommendation systems due to their ability to enhance predictive power
and reduce variability.
Random Forest Algorithm
• Random Forest is a well-known machine learning algorithm that uses the supervised
learning method.
• In machine learning, it can be used for both classification and regression problems.
• It is based on ensemble learning, which is a method of combining multiple classifiers
to solve a complex problem and improve the model's performance.
• Random Forest is a classifier that combines a number of decision trees on different
subsets of a dataset and averages the results to improve the dataset's predictive
accuracy.
• Instead of relying on a single decision tree, the random forest takes the predictions
from each tree and predicts the final output based on the majority votes of
predictions.
Basic Structure of RF
Working of RF algorithm
Decision Tree vs Random Forest
Decision tree Random forest
An algorithm that generates a tree-like set of rules for An algorithm that combines many decision trees to
classification or regression. produce a more accurate outcome.
When a dataset with certain features is ingested into a Builds decision trees on random samples of data and
decision tree, it generates a set of rules for prediction. averages the results.
Prone to overfitting because of the possibility to adapt The use of many trees allows the algorithm to avoid
to the initial data set too much. and/or prevent overfitting.
Gradient Boosting
• Gradient Boosting is a popular boosting algorithm in machine learning used
for classification and regression tasks.
• Boosting is one kind of ensemble Learning method which trains the model
sequentially and each new model tries to correct the previous model.
• Gradient Boosting is a powerful boosting algorithm that combines several
weak learners into strong learners, in which each new model is trained to
minimize the loss function such as mean squared error or cross-entropy of
the previous model using gradient descent.
• In each iteration, the algorithm computes the gradient of the loss function
with respect to the predictions of the current ensemble and then trains a
new weak model to minimize this gradient.
• The predictions of the new model are then added to the ensemble, and the
process is repeated until a stopping criterion is met.
Working of gradient boosting
• The ensemble consists of M trees. Tree1 is trained using the
feature matrix X and the labels y. The predictions
labeled y1(hat) are used to determine the training set residual
errors r1.
• Tree2 is then trained using the feature matrix X and the
residual errors r1 of Tree1 as labels.
• The predicted results r1(hat) are then used to determine the
residual r2.
• The process is repeated until all the M trees forming the
ensemble are trained. There is an important parameter used in
this technique known as Shrinkage.
• Shrinkage refers to the fact that the prediction of each tree in
the ensemble is shrunk after it is multiplied by the learning rate
(eta) which ranges between 0 to 1.
• There is a trade-off between eta and the number of estimators,
decreasing learning rate needs to be compensated with 𝒀
𝒑𝒓𝒆𝒅 =𝒚𝟏 + 𝒆𝒕𝒂 ∗ 𝒓𝟏 + 𝒆𝒕𝒂 ∗ 𝒓𝟐 + ⋯ … . . + 𝒆𝒕𝒂 ∗ 𝒓𝑵
increasing estimators in order to reach certain model
performance. Since all trees are trained now, predictions can be
made.
Gradient Boosting
Key Features
• Boosting: Unlike bagging,
boosting focuses on learning
from previous mistakes by giving Advantages
more weight to misclassified • Handles missing values well
instances. • Works well with structured/tabular
• Weak Learners: Uses shallow data
decision trees (stumps) to • Provides feature importance for
gradually refine predictions. interpretation
• Learning Rate: Controls the • Can handle both regression and
contribution of each weak classification tasks
learner, preventing overfitting.
Disadvantages
• Computationally expensive
• Sensitive to hyper parameter tuning
• Leads to overfitting if not properly
regularized
Dimensionality Reduction Techniques
• While working with machine learning models, we often encounter
datasets with a large number of features. These datasets can lead to
problems such as increased computation time and overfitting.
• To address these issues, we use dimensionality reduction techniques.
• PC₁ (First Principal Component): The direction along which the data has the maximum variance. It captures the most
important information.
• PC₂ (Second Principal Component): The direction orthogonal (perpendicular) to PC₁. It captures the remaining
variance but is less significant.
• Now, The red dashed lines indicate the spread (variance) of data along different directions . The variance along PC₁ is
greater than PC₂, which means that PC₁ carries more useful information about the dataset.
• The data points (blue dots) are projected onto PC₁, effectively reducing the dataset from two dimensions (Radius,
Area) to one dimension (PC₁).
• This transformation simplifies the dataset while retaining most of the original variability.
t-Distributed Stochastic Neighbor Embedding (t-SNE) -
Nonlinear Method
In t-Distributed Stochastic Neighbor Embedding (t-SNE), the "t" stands for the Student’s t-
distribution, which is used to model the pairwise similarities in the lower-dimensional
space.
Why Use the t-Distribution?
• In the high-dimensional space, t-SNE models pairwise similarities using a Gaussian
(normal) distribution.
• However, when mapping to a lower-dimensional space, using a Gaussian would cause
crowding because high-dimensional distances don’t directly translate well to lower
dimensions.
• The t-distribution allows distant points to stay apart, preventing crowding in the low-
dimensional space.
• Ensures that well-separated clusters remain visually distinct. This choice of distribution is
what makes t-SNE more effective than PCA for visualizing high-dimensional data.
t-Distributed Stochastic Neighbor Embedding
(t-SNE) - Nonlinear Method
• T-distributed Stochastic Neighbor Embedding (t-SNE) is
a nonlinear dimensionality reduction technique and it is suited
for visualizing high-dimensional data in a lower-dimensional
space typically in 2D or 3D. It is a widely used dimensionality
reduction technique and in this article we will learn about it.
• Dimensionality reduction is a process that simplifies complex
dataset by combining similar or correlated features. It helps in
improving analysis and computational efficiency.
• t-SNE is a dimensionality reduction technique that uses a
randomized, non-linear approach to reduce the dimensionality of
data. Unlike linear methods such as Principal Component
Analysis (PCA), t-SNE focuses on preserving the local structure
and pattern of the data.
• It is especially effective for visualizing high-dimensional datasets
as it keeps similar data points close to each other in the lower-
dimensional space making it easier to see patterns and clusters.
• This ability to retain the local structure of the dataset helps in
exploring and understanding complex, high-dimensional data.
Visualizing the data in 2D or 3D can provide us valuable insights
into the relationships between different data points.
t-SNE Working
• t-SNE works by looking at the similarity between data points in the high-dimensional space. The
similarity is computed as a conditional probability. It calculates how likely it is that one data point
would be near another.
• Once the similarities are calculated, t-SNE tries to keep similar points close when it reduces the data to
lower dimensions (like 2D or 3D). The goal is to make sure that points that are close in the original
space stay close in the lower-dimensional space, preserving the structure of the data.
Step 1: Compute Pairwise Similarities in High-Dimensional Space
For each data point 𝑥𝑖 , compute the probability that another point 𝑥𝑗 is its neighbor using a Gaussian
(normal) distribution: 𝟐
− 𝒙𝒊 −𝒙𝒋 ൘
𝒆 𝟐𝝈𝟐
𝑷𝒊𝒋 = − 𝒙𝒌 −𝒙𝒍 𝟐൘
σ𝒌≠𝒍 𝒆 𝟐𝝈𝟐
where:
• 𝑷𝒊𝒋 represents how similar 𝒙𝒊 and 𝒙𝒋 are in high-dimensional space.
• σ(sigma) perplexity parameter controls the neighbourhood size.
• The sum ensures that probabilities are normalized.
• The probability 𝑷𝒊𝒋 is high for similar points and low for dissimilar points.
t-SNE Working
Step 2: Compute Pairwise Similarities in Low-Dimensional Space
• In the low-dimensional space (2D or 3D), we compute a similar probability , but this time using a
Student’s t-distribution with 1 degree of freedom (also called a Cauchy distribution):
2 −1
1 + 𝑦𝑖 − 𝑦𝑗
𝑄𝑖𝑗 = 2 −1
σ𝑘≠𝑙 1 + 𝑦𝑘 − 𝑦𝑙
where:
• 𝑄𝑖𝑗 represents how similar points are in the lower-dimensional space.
• The t-distribution has heavier tails than a Gaussian, preventing points from crowding together.
t-SNE Working
Step 3: Minimize the Kullback-Leibler (KL) Divergence
• The goal is to match the similarity distributions 𝑃𝑖𝑗 and 𝑄𝑖𝑗 .
• This is done by minimizing the KL divergence, which measures how different the two distributions are:
𝑃𝑖𝑗
𝐾𝐿 𝑃 ∥ 𝑄 = 𝑃𝑖𝑗 𝑙𝑜𝑔
𝑄𝑖𝑗
𝑖≠𝑗
• The optimization is performed using gradient descent, where points are adjusted iteratively to make 𝑄𝑖𝑗 as
close to 𝑃𝑖𝑗 as possible.
Works By Finding orthogonal axes that maximize variance Preserving local neighborhood similarities
1. GridSearchCV
• Grid search can be considered as a “brute force” approach to hyperparameter optimization. We
fit the model using all possible combinations after creating a grid of potential discrete
hyperparameter values. We log each set’s model performance and then choose the combination
that produces the best results. This approach is called Grid Search CV, because it searches for the
best set of hyperparameters from a grid of hyperparameters values.
• An exhaustive approach that can identify the ideal hyperparameter combination is grid search.
But the slowness is a disadvantage. It often takes a lot of processing power and time to fit the
model with every potential combination, which might not be available.
• For example: if we want to set two hyperparameters C and Alpha of the Logistic Regression
Classifier model, with different sets of values. The grid search technique will construct many
versions of the model with all possible combinations of hyperparameters and will return the best
one.
• As in the image, for C = [0.1, 0.2, 0.3, 0.4, 0.5] and Alpha = [0.1, 0.2, 0.3, 0.4]. For a combination
of C=0.3 and Alpha=0.2, the performance score comes out to be 0.726(Highest), therefore it is
selected.
2. Randomized Search CV
• As the name suggests, the random search method selects values at random as
opposed to the grid search method’s use of a predetermined set of numbers.
• Every iteration, random search attempts a different set of hyperparameters and
logs the model’s performance.
• It returns the combination that provided the best outcome after
several iterations. This approach reduces unnecessary computation.
• Randomized Search CV solves the drawbacks of Grid Search CV, as it goes through
only a fixed number of hyperparameter settings.
• It moves within the grid in a random fashion to find the best set of
hyperparameters.
• The advantage is that, in most cases, a random search will produce a comparable
result faster than a grid search.
3. Bayesian Optimization
• Grid search and random search are often inefficient because they evaluate
many unsuitable hyperparameter combinations without considering the
previous iterations’ results.
• Bayesian optimization, on the other hand, treats the search for optimal
hyperparameters as an optimization problem.
• It considers the previous evaluation results when selecting the next
hyperparameter combination and applies a probabilistic function to choose
the combination that will likely yield the best results.
• This method discovers a good hyperparameter combination in relatively
few iterations.
• Data scientists use a probabilistic model when the objective function is
unknown. The probabilistic model estimates the probability of a
hyperparameter combination’s objective function result based on past
evaluation results.
Key Differences Between GridSearchCV, RandomizedSearchCV, and
Bayesian Optimization
Feature GridSearchCV RandomizedSearchCV Bayesian Optimization
Exhaustive search over all possible Randomly selects parameter Uses probabilistic models to find the
Search Method
combinations combinations best parameters
Medium (faster, but may miss Low (fast, finds optimal values
Time Complexity High (slow for large parameter grids)
optimal values) efficiently)
Large search spaces, approximate
Works Well For Small parameter grids, exact tuning Complex, high-dimensional spaces
tuning
Scalability Poor for large search spaces Better scalability Excellent scalability