ML CH
ML CH
A. Model Selection
Choice of Algorithm:
Definition: Selecting an appropriate algorithm for building the machine learning model based
on the nature of the problem and the data characteristics.
Factors Influencing Algorithm Choice:
Nature of the Problem:
Classification: If the target variable is categorical (e.g., spam detection, image recognition).
Regression: If the target variable is continuous (e.g., house price prediction, stock price
forecasting).
Data Characteristics:
Size of the dataset: Some algorithms work better with large datasets (e.g., deep learning),
while others are better for smaller datasets (e.g., decision trees).
Data complexity: Linear models may work for simpler problems, while more complex
relationships require non-linear models like support vector machines or neural networks.
Feature types: For categorical data, tree-based algorithms or Naive Bayes might be preferred.
For numerical data, linear regression or neural networks might be better.
1. Splitting Data:
Definition: The dataset is divided into multiple subsets to train, validate, and test the model.
Training Set: A portion of the data used to train the model.
Testing Set: A separate portion used to evaluate the model’s performance after training.
Validation Set: Sometimes used during training to tune the hyperparameters of the model.
Common Split Ratios:
1. Loss Function:
Definition: A function that measures the difference between the model's predicted output and
the actual output (ground truth).
Purpose: The goal during training is to minimize the loss function, which means the model’s
predictions are as close as possible to the true values.
Common Loss Functions:
For Classification: Cross-entropy loss (used for binary and multi-class classification).
For Regression: Mean Squared Error (MSE) or Mean Absolute Error (MAE).
2. Optimization:
Definition: The process of adjusting the model's parameters (weights in the case of neural
networks, or coefficients in linear models) to minimize the loss function.
Goal: To find the set of parameters that result in the lowest possible value for the loss
function.
Optimization Algorithms:
Gradient Descent: A widely used optimization algorithm that updates the model parameters
iteratively by calculating the gradient (or slope) of the loss function.
Variants: Stochastic Gradient Descent (SGD), Mini-Batch Gradient Descent, Adam, e
C. Model Evaluation
1. Metrics:
For Classification:
Accuracy: The proportion of correct predictions over the total predictions.
Precision: The proportion of true positives out of all predicted positives (useful in
imbalanced datasets).
Recall: The proportion of true positives out of all actual positives (useful when false
negatives are critical).
F1 Score: The harmonic mean of precision and recall, providing a balance between the two.
For Regression:
Mean Squared Error (MSE): Measures the average squared difference between predicted
and actual values.
R-squared (R²): Represents the proportion of variance in the dependent variable that is
predictable from the independent variables.
2. Cross-Validation:
Definition: A technique used to assess how the model performs on different subsets of the
data to ensure that the model generalizes well and is not overfitting.
K-Fold Cross-Validation: The dataset is split into K equal-sized folds. The model is trained
on K-1 folds and tested on the remaining fold. This process is repeated K times, with each
fold serving as the test set once.
Advantages:
Provides a more reliable estimate of model performance by using multiple data splits.
Helps in reducing bias in the evaluation process.
The modeling and evaluation steps ensure that a machine learning model not only performs
well on the training data but also generalizes well to unseen data. Proper evaluation using
relevant metrics and cross-validation ensures that the model is both accurate and robust for
real-world deployment.
V. Assumptions in Regression Analysis
A. Linearity
Assumption: The relationship between the independent and dependent variables is linear.
Explanation: Linear regression assumes that the dependent variable (y) has a linear relationship
with the independent variables (x , x , ..., xn). This means that for each independent variable,
Implication: If the relationship is not linear, the model may not be suitable, and the predictions
or inferences could be misleading. In such cases, transformations (e.g., log, square root) or non-
Visualization: A scatter plot of the independent variable vs. the dependent variable should
B. Independence
Explanation: Residuals, the differences between the observed and predicted values, should not
show any patterns or correlations. Each data point should be independent of others, and there
should be no systematic relationship between the residuals for one observation and the residuals
Implication: If residuals are not independent (i.e., autocorrelation exists), it can indicate that the
model is misspecified or that important variables or time dependencies are not accounted for.
Test: The Durbin-Watson test can be used to check for autocorrelation in residuals.
C. Homoscedasticity
Assumption: Residuals have constant variance across all levels of the independent variable(s).
Explanation: Homoscedasticity means that the spread or variability of the residuals is consistent
across all values of the independent variable(s). In other words, the variance of the prediction
errors should not change as the value of the independent variable(s) changes.
inefficient estimates and biased statistical tests. For example, large values of the independent
variable could result in large prediction errors, and small values in small errors.
Visualization: A scatter plot of residuals versus predicted values should show a random scatter
with no discernible pattern. If there is a "fan-shaped" or "cone-shaped" pattern,
Test: The Breusch-Pagan test or White test can be used to detect heteroscedasticity.
D. Normality of Residuals
Explanation: The residuals should follow a normal distribution for valid hypothesis testing and
reliable confidence intervals. While linear regression can still produce unbiased estimates even
when the normality assumption is violated, statistical tests (e.g., t-tests, F-tests) rely on this
Implication: If the residuals are not normally distributed, it may affect the validity of confidence
intervals and significance tests for the regression coefficients. Non-normality can also indicate
Test: The Shapiro-Wilk test or Kolmogorov-Smirnov test can be used to test for normality.
Additionally, a Q-Q plot (quantile-quantile plot) can visually assess normality by comparing the
Linearity: The relationship between independent and dependent variables should be linear.
Homoscedasticity: Residuals should have constant variance across all levels of the independent
variables.
Normality of Residuals: Residuals must be normally distributed for valid hypothesis testing.
These assumptions are critical to the validity of the regression model. Violations of these
assumptions can lead to inaccurate predictions and unreliable inferences, highlighting the
A. Feature Scaling
Explanation: Feature scaling involves transforming the independent variables so they have
similar ranges, typically between 0 and 1 or with zero mean and unit variance. This is especially
important for algorithms that rely on distance or gradient-based optimization (like linear
1. Min-Max Scaling:
Why it's important: Linear regression can be sensitive to the scale of the features. If features
are not scaled properly, the algorithm may give disproportionate importance to variables with
larger magnitudes. Scaling ensures all features contribute equally to the model.
B. Feature Engineering
Explanation: Feature engineering involves transforming or creating new features that may
provide more useful information to the model. Well-crafted features can significantly improve
Examples:
1. Polynomial Features:
Example: If the relationship between the size of a house and its price is quadratic, adding a
feature for the square of size (e.g., size²) can improve the model.
Interaction Features:
Create features that represent the interaction between two or more variables. For example, if the
house price depends on both the size and the number of bedrooms, adding a feature
For features with skewed distributions (e.g., income, population), applying a log transformation
can reduce the impact of extreme values and normalize the data.
Why it's important: Properly engineered features can improve the accuracy of the model by
making the relationship between predictors and the target variable more understandable. It can
also help the model learn better patterns and handle non-linearities in the data.
C. Outlier Removal
Explanation: Outliers are data points that are significantly different from the rest of the data and
can negatively impact the performance of the model. They may disproportionately influence the
1. Box Plot:
A box plot can help identify outliers by displaying the interquartile range (IQR). Data points
Z-Score:
The Z-score measures how many standard deviations a data point is from the mean. Data points
with a Z-score greater than 3 or less than -3 are often considered outliers.
3. Scatter Plot:
In some cases, visualizing the data using a scatter plot can help identify extreme values that
Handling Outliers:
1. Removing Outliers:
Simply removing outliers can sometimes improve model performance, but this should be done
10
3. 2. Capping or Winsorization:
This technique involves replacing extreme outliers with the maximum or minimum values within
an acceptable range.
Transformation:
Applying transformations like log or square root to the feature can reduce the influence of
outliers.
Why it's important: Outliers can distort the regression model, affecting its ability to generalize.
By identifying and appropriately handling outliers, the model becomes more robust and accurate.
Feature Scaling ensures that the model treats all features equally by standardizing their ranges.
Feature Engineering helps create new, more informative features that can improve model
performance.
Outlier Removal reduces the impact of extreme values, making the model more robust and
These techniques collectively enhance the performance of a linear regression model by making
the data more suitable for analysis, improving the model’s generalization, and ensuring the most
A. Principle
Objective: Ridge regression introduces a regularization term to the linear regression model to
Overfitting occurs when a model becomes too complex, capturing not only the underlying
patterns in the data but also the noise, leading to poor generalization to new, unseen data. Ridge
11
penalty proportional to the sum of the squared coefficients of the features to the cost function.
Effect of Regularization:
When λ=0, Ridge regression becomes equivalent to ordinary linear regression (no
regularization).
When λis large, the regularization term dominates, and the coefficients are forced to be close to
zero, which makes the model simpler and less likely to overfit.
If λis too large, the model may become underfit and fail to capture important patterns in the
data.
model's complexity and helps prevent overfitting, especially when dealing with a large number
of features.
2. Improves Stability: Ridge regression helps stabilize the estimation of the coefficients,
especially when the independent variables are highly correlated (multicollinearity), which can
3. Works Well for High-Dimensional Data: In datasets with many features (high-
dimensional data), Ridge regression can improve model performance by keeping the model
The value of λ controls the strength of regularization. It can be tuned using methods such
A common approach is to use cross-validation to evaluate the model performance for different
values of λ and select the one that minimizes the validation error.
Summary:
Ridge regression adds an L2 regularization term to the linear regression cost function to
The regularization parameter λ controls the strength of the penalty, helping to balance the
Benefits include better generalization, prevention of overfitting, and improved stability in high-
12
A. Principle
Objective: Lasso regression (Least Absolute Shrinkage and Selection Operator) introduces
a regularization term that not only penalizes the magnitude of the coefficients but also
Sparsity: The Lasso technique encourages some of the model's coefficients to become exactly
zero, effectively performing feature selection by removing irrelevant or less important features.
L1 Regularization: Lasso regression uses L1 regularization, which adds the sum of the
absolute values of the coefficients to the cost function. This is in contrast to Ridge regression,
Effect of Regularization:
When λ=0, Lasso regression becomes equivalent to ordinary linear regression (no
regularization).
As λ increases, the penalty on the coefficients becomes stronger, leading to more coefficients
When λ is large, many coefficients are driven to exactly zero, resulting in a sparse model with
fewer features.
1. Feature Selection: Lasso has the unique ability to set some of the coefficients exactly to
zero, effectively removing the corresponding features from the model. This is particularly useful
when dealing with datasets with a large number of features, as it helps in identifying the most
important features.
2. Sparsity: The L1 penalty encourages a sparse solution, where only the most important
features are kept, and the less important ones are discarded (set to zero).
3. Better for High-Dimensional Data: Lasso regression is highly effective when there are
many features, especially when some of them may not be informative. It can help prevent
Like Ridge regression, the regularization parameter λ needs to be tuned to achieve the best
A common approach to determine the optimal λ is cross-validation, where the model is trained
on different values of λ, and the value that minimizes the validation error is selected.
Ridge Regression: The L2 regularization used in Ridge regression reduces the size of
Lasso Regression: The L1 regularization used in Lasso regression not only reduces the size of
coefficients but also eliminates some coefficients entirely by setting them to zero, thus
13
Elastic Net: Combines both L1 and L2 regularization (Ridge + Lasso), allowing for both
Summary:
Lasso regression adds an L1 regularization term to the linear regression cost function, which
penalizes the absolute values of the coefficients, encouraging sparsity and performing feature
selection.
The regularization parameter λ controls the strength of the penalty, and as λ increases, more
Benefits: Lasso regression is particularly useful when dealing with high-dimensional datasets
True positives: The number of positive observations the model correctly
predicted as positive.
False-positive: The number of negative observations the model
incorrectly predicted as positive.
True negative: The number of negative observations the model correctly
predicted as negative.
False-negative: The number of positive observations the model
Hyperplane
Margin
A margin is a gap between the two lines on the closest class points.
This is calculated as the perpendicular distance from the line to support vectors or
closest points.
If the margin is larger in between the classes, then it is considered a good margin, a
The main objective is to segregate the given dataset in the best possible way.
The distance between the either nearest points is known as the margin.
The objective is to select a hyperplane with the maximum possible margin between
steps:
1. Generate hyperplanes which segregates the classes in the best way. Left-hand
side figure showing three hyperplanes black, blue and orange. Here, the blue
and orange have higher classification error, but the black is separating the two
classes correctly.
2. Select the right hyperplane with the maximum segregation from the either
SVM uses a kernel trick to transform the input space to a higher dimensional space
The data points are plotted on the x-axis and z-axis (Z is the squared sum of both x
and y: z=x^2=y^2).
SVM Kernels
Here, the kernel takes a low-dimensional input space and transforms it into a higher
dimensional space.
In other words, you can say that it converts non separable problem to separable
It is most useful in non-linear separation problem. Kernel trick helps you to build a
more accurate classifier.
Linear Kernel
A linear kernel can be used as normal dot product any two given observations.
The product between two vectors is the sum of the multiplication of each pair of
input values.
Polynomial Kernel
Where d is the degree of the polynomial. d=1 is similar to the linear transformation.
The Radial basis function kernel is a popular kernel function commonly used in
A higher value of gamma will perfectly fit the training dataset, which causes over-
fitting.
SVM Classifiers offer good accuracy and perform faster prediction compared to
They also use less memory because they use a subset of training points in the
decision phase.
SVM works well with a clear margin of separation and with high dimensional space.
Disadvantages
SVM is not suitable for large datasets because of its high training time and it also
It works poorly with overlapping classes and is also sensitive to the type of kernel
used.
SVM hyperplane
dimensionality reduction
for all negative(x) points satisfy this rule (w.x+b≤-1) In Numpy, the number of independent
features or variables in a dataset is known as a dimension.
Dimensionality Reduction refers to the techniques that reduce the number of input
features/variables in a dataset
Feature selection – In this, we are interested in finding k of the d dimensions that give us
the most information, and we discard the other (d − k) dimensions.
Feature extraction – In this, we are interested in finding a new set of k dimensions that are
combinations of the original d dimensions
PCA is a dimensionality reduction technique that enables you to identify correlations and
patterns in a dataset so that it can be transformed into a dataset of significantly lower
dimension with out loss of any important information
Example applications:
Face recognition
Image compression
Steps
Original Data
Normalize the original data (mean =0, variance =1)
Calculating covariance matrix
Calculating Eigen values, Eigen vectors, and normalized Eigenvectors
Calculating Principal Component (PC)
Plot the graph for orthogonality between PCs
Advantages
Used for Dimensionality Reduction
PCA will assist in eliminating all related features, sometimes referred to as multi-
collinearity.
The time required to train the model is substantially shorter because of PCA’s
reduction in the number of features.
PCA aids in overcoming overfitting by eliminating the extraneous features from your
dataset.
Disadvantages
Useful for quantitative data but not effective with qualitative data.
Interpretation of PC is difficult from original data
unsupervised machine learning technique that divides the population into several
clusters such that data points in the same cluster are more similar and data points in
On the other hand, the divisive method starts with one cluster with all given objects
and then splits it iteratively to form smaller clusters
Pros
No assumption of a particular number of clusters (i.e. k-means)
May correspond to meaningful taxonomies
Cons
Once a decision is made to combine two clusters, it can’t be undone
Too slow for large data sets, O(𝑛2 log(𝑛))
dbscan
There are different approaches and algorithms to perform clustering tasks
which can be divided into three sub-categories:
Partition-based clustering: E.g. k-means, k-median
Hierarchical clustering: E.g. Agglomerative, Divisive
Density-based clustering: E.g. DBSCAN
DBSCAN stands for Density-Based Spatial Clustering of Applications with Noise. It
is able to find arbitrary shaped clusters and clusters with noise (i.e. outliers).
In DBSCAN, instead of guessing the number of clusters, will define two hyper
parameters: epsilon and minPoints to arrive at clusters.
Epsilon (ε): The distance that specifies the neighborhoods. Two points are considered
to be neighbors if the distance between them are less than or equal to epsilon
minPoints(n): Minimum number of data points to define a cluster.
DBSCAN stands for Density-Based Spatial Clustering of Applications with Noise. It is able
to find arbitrary shaped clusters and clusters with noise (i.e. outliers).
In DBSCAN, instead of guessing the number of clusters, will define two hyper
parameters: epsilon and minPoints to arrive at clusters.
Epsilon (ε): The distance that specifies the neighborhoods. Two points are considered to be
neighbors if the distance between them are less than or equal to epsilon
minPoints(n): Minimum number of data points to define a cluster.
Q-Learning:
Q-learning is an Off policy RL algorithm, which is used for the temporal difference
Learning. The temporal difference learning methods are the way of comparing
temporally successive predictions.
It learns the value function Q (S, a), which means how good to take action "a" at a
particular state "s."
SARSA stands for State Action Reward State action, which is an on-policy temporal
difference learning method. The on-policy control method selects the action for each
state while learning using a specific policy.
The goal of SARSA is to calculate the Q π (s, a) for the selected current policy π and
all pairs of (s-a).
The main difference between Q-learning and SARSA algorithms is that unlike Q-
learning, the maximum reward for the next state is not required for updating the
Q-value in the table.
In SARSA, new action and reward are selected using the same policy, which has
determined the original action.
The SARSA is named because it uses the quintuple Q(s, a, r, s', a'). Where,
s: original state
a: Original action
r: reward observed while following the states
s' and a': New state, action pair.