ML Asst.-01
ML Asst.-01
JALANDHAR
ASSIGNMENT - 01
Imagine asking a group of friends for advice on where to go for vacation. Each friend gives
their recommendation based on their unique perspective and preferences (decision trees
trained on different subsets of data). You then make your final decision by considering the
majority opinion or averaging their suggestions (ensemble prediction).
As explained in image: Process starts with a dataset with rows and their corresponding class
labels (columns).
Then - Multiple Decision Trees are created from the training data. Each tree is trained
on a random subset of the data (with replacement) and a random subset of features.
This process is known as bagging or bootstrap aggregating.
Each Decision Tree in the ensemble learns to make predictions independently.
When presented with a new, unseen instance, each Decision Tree in the ensemble
makes a prediction.
The final prediction is made by combining the predictions of all the Decision Trees. This is
typically done through a majority vote (for classification) or averaging (for regression).
Key Features of Random Forest
Handles Missing Data: Automatically handles missing values during training,
eliminating the need for manual imputation.
Algorithm ranks features based on their importance in making predictions offering
valuable insights for feature selection and interpretability.
Scales Well with Large and Complex Data without significant performance
degradation.
Algorithm is versatile and can be applied to both classification tasks (e.g., predicting
categories) and regression tasks (e.g., predicting continuous values).
Output:
Accuracy: 0.80
Classification Report:
precision recall f1-score support
Sample Passenger: {'Pclass': 3, 'Sex': 1, 'Age': 28.0, 'SibSp': 1, 'Parch': 1, 'Fare': 15.2458}
Predicted Survival: Did Not Survive
Implementing Random Forest for Regression Tasks
import pandas as pd
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score
# Load the California housing dataset
california_housing = fetch_california_housing()
california_data = pd.DataFrame(california_housing.data,
columns=california_housing.feature_names)
california_data['MEDV'] = california_housing.target
# Features and target variable
X = california_data.drop('MEDV', axis=1)
y = california_data['MEDV']
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Initialize the RandomForestRegressor
rf_regressor = RandomForestRegressor(n_estimators=100, random_state=42)
# Train the regressor
rf_regressor.fit(X_train, y_train)
# Make predictions
y_pred = rf_regressor.predict(X_test)
# Calculate metrics
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
# Sample Prediction
single_data = X_test.iloc[0].values.reshape(1, -1)
predicted_value = rf_regressor.predict(single_data)
print(f"Predicted Value: {predicted_value[0]:.2f}")
print(f"Actual Value: {y_test.iloc[0]:.2f}")
# Print results
print(f"Mean Squared Error: {mse:.2f}")
print(f"R-squared Score: {r2:.2f}")
Output:
Predicted Value: 0.51
Actual Value: 0.48
Mean Squared Error: 0.26
R-squared Score: 0.80
Random Forest learns from the training data like a real estate expert. After training it predicts
house prices on the test set. We evaluate the model's performance using Mean Squared Error
and R-squared Score which show how accurate the predictions are and used a random sample
to check model prediction.
Advantages of Random Forest
Random Forest provides very accurate predictions even with large datasets.
Random Forest can handle missing data well without compromising with accuracy.
It doesn’t require normalization or standardization on dataset.
When we combine multiple decision trees it reduces the risk of overfitting of the
model.
Limitations of Random Forest
It can be computationally expensive especially with a large number of trees.
It’s harder to interpret the model compared to simpler models like decision trees.
Naïve Bayes Classifier
This algorithm is called Naïve because it works on the naïve assumption that the features are
independent. Naïve Bayes Classifier works with principle of Bayes Theorem. The Bayes’
theorem is one of the most fundamental concept in the field of analytics and it has a wide
range of applications. It often plays a crucial role in decision making process. Lets consider
two events A and B. The conditional probabilities associated is given by,
This above equation explains the Bayes’ Theorem. So for an event B we can update the
probability associated when additional information is provided (here A is the additional
information).
Key terms in Bayes’ Theorem
1. Prior probability (P(B), P(A)) — The probability value without any additional
information
2. Posterior probability (P(B|A))- The probability of event B given the additional
information A
3. P(A|B)- The likelihood of observing A if B is true
There is an interesting game related to Bayes theorem. Interested in games? Go ahead and
read about the famous Monty Hall Problem.
When it comes to a classification problem the Bayes theorem can reinterpreted as below:
where c represents the class the data belongs to and y1.y2….yn represents the predictor
features or labels. If you see the denominator terms, there can be cases were the probability of
evidence is 0. This creates a problem for division. To tackle this issues the variables are
increased with a small value 1 so that the probability does not turn zero. This adjustment is
called Laplace Correction.
Steps for Naïve Bayes Classification
1. Calculate the Prior probabilities of the classes involved
2. Calculate the likelihood of evidence with each feature for each class
3. Calculate the Posterior probability using Bayes rule
4. The class with higher probability is selected for the inputs
When the feature x is categorical in nature, it is easier to calculate the probabilities
associated. When the feature x is continuous, we assume that the variable x is normally
distributed (Gaussian Naïve Bayes). The probability value is given by
Pros
1. Easy to implement
2. Performs reasonably well with noisy data
Cons
1. Poor performance with continuous features
2. Assumption that features are independent is risky
KNN Classifier
K-Nearest neighbors algorithm can be used to solve both classification and regression
problems. When algorithms such as Naïve Bayes Classifier uses probabilities from training
samples for predictions, KNN is Lazy learner that does not create any model in advance. The
just find the closest based on feature similarity.
Similarity Measures
The popular similarity measurement metrics are Distance measures. There are several
distance measures available.
1. Euclidean Distance
This is most commonly used distance measure. For two points (x1,x2) and (y1,y2) the
Euclidean distance is given by:
2. Manhattan Distance
Also known as the city block or absolute distance, it is inspired from the structure of
Manhattan city. For two points (x1,x2) and (y1,y2) the Manhattan distance is given by:
3. Chebyshev Distance
Also known as the chessboard or maximum value distance, for two points (x1,x2) and (y1,y2)
the Chebyshev distance is given by:
4. Minkowski Distance
This is a generalized distance measure. All the above mentioned distances can be obtained
from the generalized formula.
\
Support Vector Machine (SVM) Algorithm
Support Vector Machine (SVM) is a supervised machine learning algorithm used for
classification and regression tasks. While it can handle regression problems, SVM is
particularly well-suited for classification tasks.
SVM aims to find the optimal hyperplane in an N-dimensional space to separate data
points into different classes. The algorithm maximizes the margin between the closest
points of different classes.
Support Vector Machine (SVM) Terminology
Hyperplane: A decision boundary separating different classes in feature space,
represented by the equation wx + b = 0 in linear classification.
Support Vectors: The closest data points to the hyperplane, crucial for determining
the hyperplane and margin in SVM.
Margin: The distance between the hyperplane and the support vectors. SVM aims to
maximize this margin for better classification performance.
Kernel: A function that maps data to a higher-dimensional space, enabling SVM to
handle non-linearly separable data.
Hard Margin: A maximum-margin hyperplane that perfectly separates the data
without misclassifications.
Soft Margin: Allows some misclassifications by introducing slack variables,
balancing margin maximization and misclassification penalties when data is not
perfectly separable.
C: A regularization term balancing margin maximization and misclassification
penalties. A higher C value enforces a stricter penalty for misclassifications.
Hinge Loss: A loss function penalizing misclassified points or margin violations,
combined with regularization in SVM.
Dual Problem: Involves solving for Lagrange multipliers associated with support
vectors, facilitating the kernel trick and efficient computation.
How does Support Vector Machine Algorithm Work?
The key idea behind the SVM algorithm is to find the hyperplane that best separates
two classes by maximizing the margin between them. This margin is the distance
from the hyperplane to the nearest data points (support vectors) on each side.
Multiple hyperplanes separate the data from two classes
The best hyperplane, also known as the “hard margin,” is the one that maximizes the
distance between the hyperplane and the nearest data points from both classes. This
ensures a clear separation between the classes. So, from the above figure, we choose
L2 as hard margin.
Let’s consider a scenario like shown below:
Where:
ww is the normal vector to the hyperplane (the direction perpendicular to it).
bb is the offset or bias term, representing the distance of the hyperplane from the
origin along the normal vector ww.
Distance from a Data Point to the Hyperplane
The distance between a data point x_i and the decision boundary can be calculated as:
di=wTxi+b∣∣w∣∣di=∣∣w∣∣wTxi+b
where ||w|| represents the Euclidean norm of the weight vector w. Euclidean norm of
the normal vector W
Linear SVM Classifier
Distance from a Data Point to the Hyperplane:
y^={1: wTx+b≥00: wTx+b <0y^={10: wTx+b≥0: wTx+b <0
Where y^y^ is the predicted label of a data point.
Optimization Problem for SVM
For a linearly separable dataset, the goal is to find the hyperplane that maximizes the
margin between the two classes while ensuring that all data points are correctly
classified. This leads to the following optimization problem:
minimizew,b12∥w∥2w,bminimize21∥w∥2
Subject to the constraint:
yi(wTxi+b)≥1fori=1,2,3,⋯,myi(wTxi+b)≥1fori=1,2,3,⋯,m
Where:
yiyi is the class label (+1 or -1) for each training instance.
xixi is the feature vector for the ii-th training instance.
mm is the total number of training instances.
The condition yi(wTxi+b)≥1yi(wTxi+b)≥1 ensures that each data point is correctly
classified and lies outside the margin.
Soft Margin Linear SVM Classifier
In the presence of outliers or non-separable data, the SVM allows some
misclassification by introducing slack variables ζiζi. The optimization problem is
modified as:
minimize w,b12∥w∥2+C∑i=1mζiw,bminimize 21∥w∥2+C∑i=1mζi
Subject to the constraints:
yi(wTxi+b)≥1–ζiandζi≥0for i=1,2,…,myi(wTxi+b)≥1–ζiandζi≥0for i=1,2,…,m
Where:
CC is a regularization parameter that controls the trade-off between margin
maximization and penalty for misclassifications.
ζiζi are slack variables that represent the degree of violation of the margin by each
data point.
Dual Problem for SVM
The dual problem involves maximizing the Lagrange multipliers associated with the
support vectors. This transformation allows solving the SVM optimization using
kernel functions for non-linear classification.
The dual objective function is given by:
maximize α12∑i=1m∑j=1mαiαjtitjK(xi,xj)–∑i=1mαiαmaximize 21∑i=1m∑j=1mαiαj
titjK(xi,xj)–∑i=1mαi
Where:
αiαi are the Lagrange multipliers associated with the ii-th training sample.
titi is the class label for the iii-th training sample (+1+1+1 or −1-1−1).
K(xi,xj)K(xi,xj) is the kernel function that computes the similarity between data
points xixi and xjxj. The kernel allows SVM to handle non-linear classification
problems by mapping data into a higher-dimensional space.
The dual formulation optimizes the Lagrange multipliers αiαi, and the support vectors
are those training samples where αi>0αi>0.
SVM Decision Boundary
Once the dual problem is solved, the decision boundary is given by:
w=∑i=1mαitiK(xi,x)+bw=∑i=1mαitiK(xi,x)+b
Where ww is the weight vector, xx is the test data point, and bb is the bias term.
Finally, the bias term bb is determined by the support vectors, which satisfy:
ti(wTxi–b)=1⇒b=wTxi–titi(wTxi–b)=1⇒b=wTxi–ti
Where xixi is any support vector.
This completes the mathematical framework of the Support Vector Machine
algorithm, which allows for both linear and non-linear classification using the dual
problem and kernel trick.
Types of Support Vector Machine
Based on the nature of the decision boundary, Support Vector Machines (SVM) can
be divided into two main parts:
Linear SVM: Linear SVMs use a linear decision boundary to separate the data points
of different classes. When the data can be precisely linearly separated, linear SVMs
are very suitable. This means that a single straight line (in 2D) or a hyperplane (in
higher dimensions) can entirely divide the data points into their respective classes. A
hyperplane that maximizes the margin between the classes is the decision boundary.
Non-Linear SVM: Non-Linear SVM can be used to classify data when it cannot be
separated into two classes by a straight line (in the case of 2D). By using kernel
functions, nonlinear SVMs can handle nonlinearly separable data. The original input
data is transformed by these kernel functions into a higher-dimensional feature space,
where the data points can be linearly separated. A linear SVM is used to locate a
nonlinear decision boundary in this modified space.
Implementing SVM Algorithm in Python
Predict if cancer is Benign or malignant. Using historical data about patients
diagnosed with cancer enables doctors to differentiate malignant cases and benign
ones are given independent attributes.
Load the breast cancer dataset from sklearn.datasets
Separate input features and target variables.
Build and train the SVM classifiers using RBF kernel.
Plot the scatter plot of the input features.
# Load the important packages
from sklearn.datasets import load_breast_cancer
import matplotlib.pyplot as plt
from sklearn.inspection import DecisionBoundaryDisplay
from sklearn.svm import SVC
# Load the datasets
cancer = load_breast_cancer()
X = cancer.data[:, :2]
y = cancer.target
#Build the model
svm = SVC(kernel="rbf", gamma=0.5, C=1.0)
# Trained the model
svm.fit(X, y)
# Plot Decision Boundary
DecisionBoundaryDisplay.from_estimator(
svm,
X,
response_method="predict",
cmap=plt.cm.Spectral,
alpha=0.8,
xlabel=cancer.feature_names[0],
ylabel=cancer.feature_names[1],
)
# Scatter plot
plt.scatter(X[:, 0], X[:, 1],
c=y,
s=20, edgecolors="k")
plt.show()
Output:
The essential step in any machine learning model is to evaluate the accuracy of the model.
The Mean Squared Error, Mean absolute error, Root Mean Squared Error, and R-Squared or
Coefficient of determination metrics are used to evaluate the performance of the model in
regression analysis.
The Mean absolute error represents the average of the absolute difference between the
actual and predicted values in the dataset. It measures the average of the residuals in
the dataset.
Mean Squared Error represents the average of the squared difference between the
original and predicted values in the data set. It measures the variance of the residuals.
Root Mean Squared Error is the square root of Mean Squared error. It measures the
standard deviation of residuals.
The coefficient of determination or R-squared represents the proportion of the
variance in the dependent variable which is explained by the linear regression model.
It is a scale-free score i.e. irrespective of the values being small or large, the value of
R square will be less than one.
Mean Squared Error(MSE) and Root Mean Square Error penalizes the large
prediction errors vi-a-vis Mean Absolute Error (MAE). However, RMSE is widely
used than MSE to evaluate the performance of the regression model with other
random models as it has the same units as the dependent variable (Y-axis).
MSE is a differentiable function that makes it easy to perform mathematical
operations in comparison to a non-differentiable function like MAE. Therefore, in
many models, RMSE is used as a default metric for calculating Loss Function despite
being harder to interpret than MAE.
The lower value of MAE, MSE, and RMSE implies higher accuracy of a regression
model. However, a higher value of R square is considered desirable.