The document outlines an assessment for a B.Tech (CSE) Data Mining Lab course, focusing on implementing and evaluating classification algorithms (ID3 Decision Tree, CART, and Naïve Bayes) and customer segmentation using clustering algorithms (K-Means, K-Medoids, Hierarchical Clustering). Students are required to use real-world datasets, preprocess the data, and evaluate model performance using various metrics and visualizations. Additionally, a comparative analysis of the clustering algorithms' performance is to be conducted.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0 ratings0% found this document useful (0 votes)
8 views2 pages
VL2024250504566 Ast03
The document outlines an assessment for a B.Tech (CSE) Data Mining Lab course, focusing on implementing and evaluating classification algorithms (ID3 Decision Tree, CART, and Naïve Bayes) and customer segmentation using clustering algorithms (K-Means, K-Medoids, Hierarchical Clustering). Students are required to use real-world datasets, preprocess the data, and evaluate model performance using various metrics and visualizations. Additionally, a comparative analysis of the clustering algorithms' performance is to be conducted.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 2
SLOT – L43+L44
SCHOOL OF COMPUTER SCIENCE AND ENGINEERING
ASSESSMENT – III – WINTER S E M E S T E R 2024-2025 Programme Name & Branch: B.Tech (CSE) Course Name: Data Mining Lab Course Code: BCSE208P
Objective: To Implement and Evaluate Classification Algorithms (Decision Tree
and Naïve Bayes) 1. Implement and Evaluate Classification Algorithms: Dataset: Use the dataset (available from Kaggle or UCI Machine Learning Repository). Consider external dataset containing real-world data (e.g., customer churn prediction, medical diagnosis, or credit risk classification). The dataset includes both categorical and numerical attributes. Load and preprocess the dataset. Split the dataset into training and testing sets (80% training, 20% testing). Implement the following classification algorithms: • ID3 Decision Tree: Implement the Iterative Dichotomiser 3 (ID3) algorithm using an existing Python library or from scratch if desired. • CART Decision Tree: Implement the Classification and Regression Tree (CART) algorithm, which uses the Gini Index for splitting. • Naïve Bayes Classifier: Implement the Gaussian Naïve Bayes or Multinomial Naïve Bayes based on the dataset's characteristics. Evaluate Model Performance: • Compute accuracy, precision, recall, F1-score, and confusion matrix for each classifier. • Visualize the decision trees (ID3 and CART) to analyze how the models make decisions. • Use ROC-AUC curves to compare classifier performance. 2. Customer Segmentation using Clustering Algorithms Dataset: Use a combination of datasets to create a comprehensive customer profile. You can source datasets from: • Kaggle (e.g., customer transaction data, customer demographics) • UCI Machine Learning Repository (e.g., online retail data) Combine at least two datasets to enrich the feature set for each customer. Ensure that the combined dataset has a minimum of 10,000 data points. Objective: Perform customer segmentation to identify distinct groups of customers based on their purchasing behavior, demographics, and other relevant features. Instructions: Data Preparation: • Load and merge the datasets into a single Pandas DataFrame. • Handle missing values appropriately (e.g., imputation or removal). • Encode categorical variables using techniques like one-hot encoding or label encoding. • Scale the numerical features using StandardScaler or MinMaxScaler. Implement the following clustering algorithms: • K-Means • K-Medoids • Hierarchical Clustering For each algorithm: • Determine the optimal number of clusters using appropriate methods such as the elbow method, silhouette score3, or dendrograms. • Train the model on the preprocessed dataset using the determined optimal number of clusters. • Assign each data point to a cluster. • Visualize the clusters using dimensionality reduction techniques (e.g., PCA or t-SNE) for higher-dimensional data. • Evaluate the clustering performance using appropriate metrics such as silhouette score3. Comparative Analysis: Compare the performance of the three algorithms and visualize the results.