Data_Mining_Lab-2
Data_Mining_Lab-2
Sindhuja_Suresh_0817685_Data_Mining_Lab-2
The two main types of ensemble learning methods are Bagging and Boosting:
Bagging (Bootstrap Aggregating): Bagging aims to reduce variance and avoid overfitting
by training multiple instances of the same model on different subsets of the dataset. The
Random Forest algorithm is a popular example of bagging.
In [3]: #Split the data into training and testing sets and standardize the features.
# Splitting data into features (X) and target (y)
X = df.drop('target', axis=1)
y = df['target']
# Splitting data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_
# Standardizing features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
file:///C:/Users/sindhuja/Downloads/Untitled13.html 1/6
11/2/24, 8:01 PM Untitled13
accuracy 1.00 30
macro avg 1.00 1.00 1.00 30
weighted avg 1.00 1.00 1.00 30
In [12]: #Train an AdaBoost classifier and evaluate its performance. Use AdaBoostClassifi
file:///C:/Users/sindhuja/Downloads/Untitled13.html 2/6
11/2/24, 8:01 PM Untitled13
print(classification_report(y_test, y_pred_ada))
sns.heatmap(confusion_matrix(y_test, y_pred_ada), annot=True, fmt='d')
plt.show()
C:\Users\sindhuja\anaconda3\Lib\site-packages\sklearn\ensemble\_weight_boosting.p
y:519: FutureWarning: The SAMME.R algorithm (the default) is deprecated and will
be removed in 1.6. Use the SAMME algorithm to circumvent this warning.
warnings.warn(
AdaBoost Classification Report:
precision recall f1-score support
accuracy 1.00 30
macro avg 1.00 1.00 1.00 30
weighted avg 1.00 1.00 1.00 30
In [13]: #Train a Gradient Boosting classifier and evaluate its performance. Use Gradient
from sklearn.ensemble import GradientBoostingClassifier
gb_model = GradientBoostingClassifier()
gb_model.fit(X_train, y_train)
y_pred_gb = gb_model.predict(X_test)
# Evaluation
print("Gradient Boosting Classification Report:")
print(classification_report(y_test, y_pred_gb))
sns.heatmap(confusion_matrix(y_test, y_pred_gb), annot=True, fmt='d')
plt.show()
file:///C:/Users/sindhuja/Downloads/Untitled13.html 3/6
11/2/24, 8:01 PM Untitled13
accuracy 1.00 30
macro avg 1.00 1.00 1.00 30
weighted avg 1.00 1.00 1.00 30
7. Cross-Validation
file:///C:/Users/sindhuja/Downloads/Untitled13.html 4/6
11/2/24, 8:01 PM Untitled13
Limitations: Bias Retention: While bagging reduces variance, it may not address bias,
potentially leading to suboptimal performance if the base learner is biased. Model
Interpretability: The ensemble nature can make it harder to interpret results compared to
single models.
2)Boosting
Limitations: Overfitting Risk: Boosting can overfit the training data, particularly if the
model complexity is not controlled. Training Time: The sequential nature of boosting may
result in longer training times, especially with large datasets.
Cross-Validation Scores:
Bagging Models (e.g., Random Forest): Typically show stable cross-validation scores with
lower variance. For example, cross-validation may yield an average score of 0.85 with a
standard deviation of 0.02. Boosting Models (e.g., Gradient Boosting): Often achieve
higher average cross-validation scores (e.g., 0.88) but may exhibit higher variability
depending on the dataset.
Bagging Model Final Test Score: Generally, bagging models maintain consistent test
performance, often around 0.83-0.87. Boosting Model Final Test Score: Boosting models
may achieve higher test scores, such as 0.90, but are more sensitive to data distribution
and model hyperparameters.
file:///C:/Users/sindhuja/Downloads/Untitled13.html 5/6
11/2/24, 8:01 PM Untitled13
Conclusion:
Both bagging and boosting are powerful techniques in ensemble learning, each with its
strengths and weaknesses. Bagging is effective for variance reduction and provides
robustness against noise, making it suitable for unstable models. In contrast, boosting
excels in reducing bias and achieving high accuracy, though it requires careful tuning to
avoid overfitting.
When selecting an ensemble method, it’s crucial to consider the nature of the dataset
and the specific problem at hand.
file:///C:/Users/sindhuja/Downloads/Untitled13.html 6/6