0% found this document useful (0 votes)
5 views6 pages

Data_Mining_Lab-2

The document provides an overview of ensemble learning, focusing on bagging and boosting techniques, including their strengths and limitations. It details the implementation of Random Forest, AdaBoost, and Gradient Boosting classifiers using the Iris dataset, along with performance evaluation through classification reports and cross-validation. The conclusion emphasizes the importance of selecting the appropriate ensemble method based on the dataset and problem characteristics.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views6 pages

Data_Mining_Lab-2

The document provides an overview of ensemble learning, focusing on bagging and boosting techniques, including their strengths and limitations. It details the implementation of Random Forest, AdaBoost, and Gradient Boosting classifiers using the Iris dataset, along with performance evaluation through classification reports and cross-validation. The conclusion emphasizes the importance of selecting the appropriate ensemble method based on the dataset and problem characteristics.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

11/2/24, 8:01 PM Untitled13

Sindhuja_Suresh_0817685_Data_Mining_Lab-2

1.Introduction to Ensemble Learning

Ensemble learning is a powerful machine learning approach where multiple models


(often referred to as "weak learners") are combined to produce a more accurate, robust
prediction. This technique leverages the strengths of individual models and minimizes
their weaknesses, which can result in higher accuracy and improved generalization on
unseen data.

The two main types of ensemble learning methods are Bagging and Boosting:

Bagging (Bootstrap Aggregating): Bagging aims to reduce variance and avoid overfitting
by training multiple instances of the same model on different subsets of the dataset. The
Random Forest algorithm is a popular example of bagging.

Boosting: Boosting focuses on reducing bias by sequentially training models, each


attempting to correct the errors of its predecessor. AdaBoost and Gradient Boosting are
commonly used boosting algorithms.

2. Loading and Preprocessing Data:

In [1]: # Import necessary libraries


import numpy as np
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report, confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns
# Load and prepare the dataset
data = load_iris()
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target

3. Data Splitting and Scaling

In [3]: #Split the data into training and testing sets and standardize the features.
# Splitting data into features (X) and target (y)
X = df.drop('target', axis=1)
y = df['target']
# Splitting data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_
# Standardizing features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

4. Applying Random Forest (Bagging Technique):

file:///C:/Users/sindhuja/Downloads/Untitled13.html 1/6
11/2/24, 8:01 PM Untitled13

In [4]: #Train a Random Forest classifier and evaluate its performance.

from sklearn.ensemble import RandomForestClassifier


rf_model = RandomForestClassifier()
rf_model.fit(X_train, y_train)
y_pred_rf = rf_model.predict(X_test)
# Evaluation
print("Random Forest Classification Report:")
print(classification_report(y_test, y_pred_rf))
sns.heatmap(confusion_matrix(y_test, y_pred_rf), annot=True, fmt='d')
plt.show()

Random Forest Classification Report:


precision recall f1-score support

0 1.00 1.00 1.00 10


1 1.00 1.00 1.00 9
2 1.00 1.00 1.00 11

accuracy 1.00 30
macro avg 1.00 1.00 1.00 30
weighted avg 1.00 1.00 1.00 30

5. Applying AdaBoost (Boosting Technique)

In [12]: #Train an AdaBoost classifier and evaluate its performance. Use AdaBoostClassifi

from sklearn.ensemble import AdaBoostClassifier


model = AdaBoostClassifier(algorithm='SAMME')
ada_model.fit(X_train, y_train)
y_pred_ada = ada_model.predict(X_test)
# Evaluation
print("AdaBoost Classification Report:")

file:///C:/Users/sindhuja/Downloads/Untitled13.html 2/6
11/2/24, 8:01 PM Untitled13

print(classification_report(y_test, y_pred_ada))
sns.heatmap(confusion_matrix(y_test, y_pred_ada), annot=True, fmt='d')
plt.show()

C:\Users\sindhuja\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.p
y:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will
be removed in 1.6. Use the SAMME algorithm to circumvent this warning.
warnings.warn(
AdaBoost Classification Report:
precision recall f1-score support

0 1.00 1.00 1.00 10


1 1.00 1.00 1.00 9
2 1.00 1.00 1.00 11

accuracy 1.00 30
macro avg 1.00 1.00 1.00 30
weighted avg 1.00 1.00 1.00 30

6. Applying Gradient Boosting (Boosting Technique)

In [13]: #Train a Gradient Boosting classifier and evaluate its performance. Use Gradient
from sklearn.ensemble import GradientBoostingClassifier
gb_model = GradientBoostingClassifier()
gb_model.fit(X_train, y_train)
y_pred_gb = gb_model.predict(X_test)
# Evaluation
print("Gradient Boosting Classification Report:")
print(classification_report(y_test, y_pred_gb))
sns.heatmap(confusion_matrix(y_test, y_pred_gb), annot=True, fmt='d')
plt.show()

file:///C:/Users/sindhuja/Downloads/Untitled13.html 3/6
11/2/24, 8:01 PM Untitled13

Gradient Boosting Classification Report:


precision recall f1-score support

0 1.00 1.00 1.00 10


1 1.00 1.00 1.00 9
2 1.00 1.00 1.00 11

accuracy 1.00 30
macro avg 1.00 1.00 1.00 30
weighted avg 1.00 1.00 1.00 30

7. Cross-Validation

In [15]: import warnings


# Ignore FutureWarnings
warnings.simplefilter(action='ignore', category=FutureWarning)

# Perform cross-validation on each of the ensemble models to evaluate their robu


# Cross-validation for Random Forest
cv_scores_rf = cross_val_score(rf_model, X, y, cv=5)
print(f"Random Forest CV Scores: {cv_scores_rf}")
print(f"Average CV Score: {np.mean(cv_scores_rf)}")

# Cross-validation for AdaBoost


cv_scores_ada = cross_val_score(ada_model, X, y, cv=5)
print(f"AdaBoost CV Scores: {cv_scores_ada}")
print(f"Average CV Score: {np.mean(cv_scores_ada)}")

# Cross-validation for Gradient Boosting


cv_scores_gb = cross_val_score(gb_model, X, y, cv=5)
print(f"Gradient Boosting CV Scores: {cv_scores_gb}")
print(f"Average CV Score: {np.mean(cv_scores_gb)}")

file:///C:/Users/sindhuja/Downloads/Untitled13.html 4/6
11/2/24, 8:01 PM Untitled13

Random Forest CV Scores: [0.96666667 0.96666667 0.93333333 0.96666667 1. ]


Average CV Score: 0.9666666666666668
AdaBoost CV Scores: [0.96666667 0.93333333 0.9 0.93333333 1. ]
Average CV Score: 0.9466666666666667
Gradient Boosting CV Scores: [0.96666667 0.96666667 0.9 0.96666667 1.
]
Average CV Score: 0.9600000000000002

8. Analysis and Conclusion:

1)Bagging (Bootstrap Aggregating)

Strengths: Variance Reduction: By training models on different subsets of data, bagging


mitigates overfitting. Parallelization: Models can be trained independently, making it
easier to scale and reduce training time. Robustness: More resilient to noise and outliers
compared to individual models.

Limitations: Bias Retention: While bagging reduces variance, it may not address bias,
potentially leading to suboptimal performance if the base learner is biased. Model
Interpretability: The ensemble nature can make it harder to interpret results compared to
single models.

2)Boosting

Strengths: Bias Reduction: By sequentially correcting errors, boosting can significantly


lower bias and improve model accuracy. Focus on Difficult Cases: The approach
emphasizes instances that are harder to classify, which can lead to better performance on
challenging datasets. Flexibility: Different loss functions can be applied, making boosting
suitable for various tasks.

Limitations: Overfitting Risk: Boosting can overfit the training data, particularly if the
model complexity is not controlled. Training Time: The sequential nature of boosting may
result in longer training times, especially with large datasets.

Comparison of Ensemble Models

Cross-Validation Scores:

Bagging Models (e.g., Random Forest): Typically show stable cross-validation scores with
lower variance. For example, cross-validation may yield an average score of 0.85 with a
standard deviation of 0.02. Boosting Models (e.g., Gradient Boosting): Often achieve
higher average cross-validation scores (e.g., 0.88) but may exhibit higher variability
depending on the dataset.

Final Test Results:

Bagging Model Final Test Score: Generally, bagging models maintain consistent test
performance, often around 0.83-0.87. Boosting Model Final Test Score: Boosting models
may achieve higher test scores, such as 0.90, but are more sensitive to data distribution
and model hyperparameters.

file:///C:/Users/sindhuja/Downloads/Untitled13.html 5/6
11/2/24, 8:01 PM Untitled13

Conclusion:

Both bagging and boosting are powerful techniques in ensemble learning, each with its
strengths and weaknesses. Bagging is effective for variance reduction and provides
robustness against noise, making it suitable for unstable models. In contrast, boosting
excels in reducing bias and achieving high accuracy, though it requires careful tuning to
avoid overfitting.

When selecting an ensemble method, it’s crucial to consider the nature of the dataset
and the specific problem at hand.

file:///C:/Users/sindhuja/Downloads/Untitled13.html 6/6

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy