AIML Record Batch 9
AIML Record Batch 9
24
Aim
To understand and apply various data curation techniques to clean and prepare medical datasets
for machine learning applications.
Tools Required
• Python 3.x
• Anaconda distribution
• Jupiter Notebook
• Pandas' library
• NumPy library
• Scikit-learn library
• Matplotlib/Seaborn for visualization
Algorithm/Procedure
1
Program:
1. Load the Dataset:
2
3. Handle Missing Values:
3
4. Remove Duplicate Records:
4
5. Convert Categorical Data:
5
7. Split the Dataset:
6
8. Performing one hot encoding:
7
9. Performing interpolation:
Result:
The medical dataset was successfully cleaned and prepared for machine learning, resulting in a
dataset with no missing values, no duplicates, converted categorical data, normalized features,
and a proper training-testing split. This curated dataset is now ready for building and evaluating
machine learning model.
8
Ex. No:1(b) DATE: 09.08.24
Aim
To understand and apply various outlier detection and removal methods to prepare healthcare
data for machine learning applications.
Tools Required
• Python 3.x
• Anaconda distribution
• Jupiter Notebook
• Pandas' library
• NumPy library
• Scikit-learn library
• Matplotlib/Seaborn for visualization
Algorithm/Procedure
Load the Dataset
i.Import the necessary libraries.
ii.Load the healthcare dataset into a Pandas Data Frame.
Understand the Dataset
iii.Display the first few rows of the dataset.
iv.Get a summary of the dataset using info () and describe () methods.
Identify Outliers
v. Use visualization techniques such as box plots and scatter plots to
identifyoutliers.
vi. Use statistical methods such as the Z-score or the IQR (Interquartile
Range)method to detect outliers.
Remove Outliers
vii.Based on the identified outliers, decide on a strategy to remove them:
1. Remove outliers using the IQR method.
2. Remove outliers using the Z-score method.
Verify the Data
viii.Re-visualize the data to ensure outliers have been effectively removed.
ix.Summarize the dataset again to confirm the absence of outliers.
Handle Missing Values (if any)
x. Identify and handle any missing values in the dataset using appropriate
methods(if applicable)
9
Normalize/Standardize the Data
xi.Apply normalization or standardization techniques to scale numerical data.
Feature Engineering
xii.Create new features based on existing data.
xiii.Select relevant features for the machine learning model.
Split the Dataset
xiv. Split the dataset into training and testing sets using train_test_split() from
Scikit-learn.
Data Curation Methods
Outlier Detection
xv. Visualization: df.boxplot(column='column_name'),
sns.scatterplot(x='column1',y='column2', data=df)
xvi.Z-score Method: df[(np.abs(stats.zscore(df['column_name'])) < 3)]
Outlier Removal
• Z-score Method: Remove data points with a Z-score greater than 3 or less than -
3.
• IQR Method: Remove data points outside the range of Q1 - 1.5IQR to Q3 +
1.5IQR.
Normalization/Standardization
• Normalization: (df - df.min()) / (df.max() - df.min())
• Standardization: (df - df.mean()) / df.std()
Feature Engineering
• Creation: df['new_feature'] = df['feature1'] * df['feature2']
• Selection: Using methods such as correlation matrix or feature importance from
models like RandomForest.
10
PROGRAM:
1. Load the Dataset:
11
12
3. Visualizing using boxplot:
13
4. Using scatter plot:
14
5. Calculating Z score:
15
7. Remove outliers:
16
8. Creating new features:
17
9. Normalization and standardization:
Result
Outliers were successfully detected and removed from the healthcare dataset using visualization
techniques, the Z-score method, and the IQR method. The resulting dataset is free of outliers,
normalized, and ready for further machine learning model training and evaluation.
18
Ex :2 Date: 30.08.24
CHRONIC KIDNEY DISEASE CLASSIFICATION USING SVM
CLASSIFIER
Aim:
To classify Chronic Kidney Disease (CKD) using Support Vector Machine (SVM) algorithm using
Python in jupyter notebook.
Tools Required:
• Python 3.x
• Anaconda distribution
• Jupiter Notebook
• Pandas' library
• NumPy library
• Scikit-learn library
Program:
Importing libraries:
19
Loading and preprocessing the datasets:
20
Identify the missing values:
21
Convert non-numeric columns to numeric for filling the missing values using interpolation:
22
Before interpolation:
Performing Interpolation:
23
After interpolation:
24
Normalization:
25
Splitting, Training, and Testing:
Result:
Thus, the machine learning model has been trained using Support Vector Machine algorithm and
Chronic Kidney Disease classification is done.
26
Ex :3 Date: 06.09.24
CHRONIC KIDNEY DISEASE CLASSIFICATION USING DECISION
TREE CLASSIFIER
Aim:
To classify Chronic Kidney Disease (CKD) using Decision Tree Classifier algorithm using Python
in jupyter notebook.
Tools Required:
• Python 3.x
• Anaconda distribution
• Jupiter Notebook
• Pandas' library
• NumPy library
• Scikit-learn library
Program:
Importing libraries:
27
Preprocessing:
28
Missing values:
29
Convert non-numeric columns to numeric if possible:
30
Before interpolation:
31
Perform interpolation:
32
After interpolation:
33
Normalization:
34
Splitting, training, and training:
Result:
Thus, the machine learning model has been trained using Decision Tree Classifier algorithm and
35
Ex :4 Date: 20.09.24
CHRONIC KIDNEY DISEASE CLASSIFICATION USING RANDOM
FOREST CLASSIFIER
Aim:
To classify Chronic Kidney Disease (CKD) using Random Forest Classifier algorithm using
Python in jupyter notebook.
Tools Required:
• Python 3.x
• Anaconda distribution
• Jupiter Notebook
• Pandas' library
• NumPy library
• Scikit-learn library
Program:
Importing libraries:
36
Preprocessing:
37
Missing values:
38
Convert non-numeric columns to numeric:
39
Before interpolation:
40
Perform interpolation:
41
After interpolation:
42
Normalization:
43
Splitting, training, and training:
44
Result:
Thus, the machine learning model has been trained using Random Forest Classifier algorithmand
45
Ex.No : 5(a) Date:18/10/2024
Aim:
To predict the risk of chronic kidney disease in patients based on various health indicators using
Logistic regression.
Tools Required:
• Python 3.x
• Jupyter Notebook or any Python IDE
• Pandas, NumPy, Matplotlib, Seaborn, Scikit-learn libraries
Methodology:
1. Loading and Previewing the Dataset: The code begins by loading the dataset into a
DataFrame and displaying the first few rows. This step helps in understanding the initial
structure and content of the data.
2. Data Preprocessing: Unknown values represented by '?' are replaced with NaN to handle
missing data more effectively. This ensures that the dataset is clean and ready for further
processing.
3. Data Encoding: Categorical variables, such as 'normal' and 'abnormal', are converted into
numeric values. This conversion simplifies the analysis, allowing the use of mathematical and
statistical operations on these features.
4. Display DataFrame Information: The code prints out basic information about the
DataFrame, including the data types of each column and the count of non-null values. This step
is crucial for understanding the composition of the dataset.
7. Identifying Missing Values: The code identifies and counts the number of missing values
in each column, providing an overview of the data's completeness.
9. Display DataFrame Before Interpolation: The current state of the DataFrame, including
the count of missing values, is displayed before performing interpolation. This helps in
comparing the data before and after filling gaps.
10. Interpolation: Linear interpolation is performed to fill in the missing values. This method
estimates missing data points based on surrounding values, creating a continuous dataset.
11.Display DataFrame After Interpolation: The DataFrame and the count of missing values
are displayed again after interpolation to verify that the missing data has been handled.
12. Visualize Final Dataset: A final heatmap is generated to visualize the dataset after
interpolation, ensuring that the data is ready for further analysis or modeling.
Program:
2. Preprocessing
47
48
3. Visualize Correlation between features using a heatmap
49
5. Convert non-numeric columns to numeric if possible
6. Before interpolation
7. Perform interpolation
50
8. After interpolating
51
10. Predict on the test set
Result:
Ridge regression typically performs better in cases where linear regression may overfit,
especially when dealing with multicollinearity. The cross-validated scores give a better
indication of model generalizability.
52
5b. Predicting Hospital Readmission Rates Using Lasso Regression
Aim:
Tools Required:
• Python 3.x
• Jupyter Notebook or any Python IDE
• Pandas, NumPy, Matplotlib, Seaborn, Scikit-learn libraries
Methodology:
1. Data Collection:
2. Data Preprocessing:
• The string pattern representing missing values (`?`) is replaced with `NaN` using
`df.replace()`.
Label Encoding:
• Categorical variables (e.g., `rbc`, `pc`, `pcc`, etc.) are mapped to numerical values using
`map()` to convert categorical data into a numeric format required for model training.
Summary Statistics:
53
• df.describe()` is used to calculate basic statistical measures (mean, std, min, max, etc.)
for the numeric columns in the dataset.
Correlation Heatmap:
• `df.isnull().sum()` is used to count the number of missing values per column, providing
insight into the extent of missing data.
Interpolation:
5. Data Visualization:
• While not explicitly shown in the provided code, the final step typically involves
splitting the data into training and testing sets, standardizing the data, and applying
machine learning models (e.g., Lasso regression) to predict outcomes and evaluate the
model’s performance using metrics like `mean_squared_error` and `r2_score`.
Program:
54
2. Preprocessing
55
3. Visualize correlations between features using a heatmap
56
6. Before interpolation
7. Perform interpolation
57
8. After interpolating
58
9. Convert categorical variables using Label Encoding or One-Hot Encoding
Result:
Lasso regression not only predicts the readmission rates but also helps in feature selection by
driving some coefficients to zero, which simplifies the model and may enhance interpretability.
59
MINI PROJECT
Enhancing Early Detection of Chronic Kidney Disease
Using Machine Learning Algorithms: A Comparative
Analysis
60
ABSTRACT
Chronic Kidney Disease (CKD) affects 12-14% of the global population, with around 30 million cases
in the US contributing to over $32 billion in healthcare costs annually. Early detection is crucial as
CKD can lead to end-stage renal disease (ESRD) without timely treatment. This study analyses and
compare seven machine learning (ML) algorithms—Decision Tree, Random Forest, AdaBoost, K-
Nearest Neighbors (KNN), Gradient Boosting, XGBoost, and Bootstrap Aggregating—to a dataset of
400 patients with 25 attributes. Performance was evaluated based on accuracy, precision, recall, and
F1 score. The top three models—Random Forest, KNN, and Bagging were ensembled to form a new
ensemble model. The ensemble model achieved 100% accuracy, outperforming individual classifiers.
It demonstrated superior robustness, eliminating both false positives and false negatives across the test
set, as confirmed by the confusion matrix. While Random Forest alone achieved high performance, the
ensemble approach improved overall consistency and reduced the risk of misclassification, making it
a more reliable choice for clinical implementation. The study shows that ML can improve early CKD
diagnosis, offering personalized treatment and reducing misdiagnosis risks, making these models
promising for clinical integration.
Keywords: Chronic Kidney Disease, Machine Learning, Random Forest, Decision Tree, AdaBoost,
K-Nearest Neighbors (KNN), Gradient Boosting, XGBoost, and Bootstrap Aggregating, CKD
Diagnosis.
61
INTRODUCTION
Chronic Kidney Disease (CKD) refers to the progressive and irreversible damage to the kidneys that
prevents them from effectively filtering blood. CKD is usually classified into five stages: mild kidney
impairment, end-stage renal disease (ESRD), which requires dialysis or a kidney transplant, and kidney
transplantation. These stages are based on the rate at which kidney function is diminishing. The United
States alone has seen almost 30 million adult diagnoses, making it a global health risk that affects 12%–
14% of the population. CKD accounts for a significant portion of healthcare costs in the US, with an
annual national expenditure of about $32 billion. Major and secondary forms of chronic kidney disease
exist. While primary CKD directly affects the kidneys, secondary CKD is brought on by other
conditions, primarily diabetes and hypertension. Fig 1 shows the causes of CKD and their respective
percentages, the main causes such as Smoking, obesity, high alcohol consumption, and a family history
of renal illness are other risk factors. Renal failure can be the ultimate outcome of untreated high blood
pressure and diabetes, which gradually impair the kidneys' filtration system.
Fig 1. Causes of chronic kidney disease (CKD) and their respective percentages
CKD is sometimes referred to as a "silent" disease due to its gradual progression and may not exhibit
symptoms in its initial stages. When symptoms do appear, they may consist of changed urination
patterns, leg oedema, fatigue, elevated blood pressure, and breathing difficulties. With worsening CKD,
patients may experience increasingly severe symptoms such nausea, vomiting, an appetite loss, and
cardiovascular difficulties. Both invasive and non-invasive diagnostic techniques are available. An
early diagnosis is essential for the treatment of CKD.
Non-invasive methods: Serum creatinine levels in the blood and proteinuria in the urine are measured,
and these tests are commonly used to detect CKD early. CT and ultrasound examinations are two
62
imaging modalities that can be used to assess the shape and function of the kidneys.
Invasive method: In advanced cases of CKD, an intrusive kidney biopsy operation may be required to
determine the underlying cause of the condition.
Treatment for CKD aims to manage symptoms, prevent complications, and limit the progression
of the disease. Blood pressure medication, blood sugar management, lifestyle changes, and other risk
factor management strategies are all part of the treatment for early-stage CKD. Treatment options for
CKD that advances to ESRD include dialysis and kidney transplantation. AI greatly enhances CKD
diagnosis and treatment, particularly when combined with machine learning (ML) techniques. By
analysing vast amounts of patient data, AI can help with the early detection of CKD, improve the
accuracy of diagnoses, and predict the course of the disease. Medical datasets have yielded insightful
information through the deployment of machine learning (ML) models, improving the capacity to detect
patients who are at risk and give customised treatment plans. AI approaches are also being developed
to predict the onset of CKD and the effectiveness of different treatments, based on the patient's medical
history.
This study uses an empirical analysis of seven different machine learning techniques on this
dataset to determine the most effective method for early detection of CKD: Decision Tree, Random
Forest, AdaBoost, K-Nearest Neighbours (KNN), Gradient Boosting, XGBoost, and Bootstrap
Aggregating. Accuracy, precision, recall, and F1 score are used to evaluate the procedures. The UCI
Machine Learning Repository's 400 patient records and 24 attributes from the CKD dataset are used in
this work. The dataset, which is divided into numerical and category variables, includes 250 entries
classified as "ckd" and 150 entries classified as "notckd". Age, blood pressure, blood glucose levels,
and other important clinical parameters are included. It should be noted that the dataset includes missing
values, which complicates the research and calls for expert handling methods.
63
LITERATURE SURVEY
Chittora et al.'s goal in 2021 was to use deep learning (DL) and machine learning (ML) approaches
to predict chronic kidney disease (CKD). They made use of a 400-instance dataset with 24 attributes
that was taken from the UCI repository. Artificial Neural Network (ANN), C5.0, Logistic Regression,
Chi-square Automatic Interaction Detector (CHAID), Linear Support Vector Machine (LSVM),
Random Tree, and K-Nearest Neighbours (KNN) are the seven machine learning techniques that were
used. To improve performance, feature selection methods like LASSO, Wrapper approach, and
Correlation-based Feature Selection (CFS) were used. SMOTE was used to balance the dataset.
According to the results, the accuracy of 98.86% was achieved by LSVM with penalty L2 and SMOTE
on full features. However, a deep neural network model produced the best accuracy of 99.6%. The
results of evaluation metrics such area under the curve (AUC), precision, recall, F-measure, and recall
showed that deep learning models were superior to machine learning (ML) models in CKD prediction.
[7]
Ventrella et al. (2021) intended to facilitate individualised care and strategic treatment planning
by forecasting the period that a patient with CKD will need dialysis. A supervised machine learning
strategy was used to build a computational model, and the efficacy of several techniques was examined.
The data that was taken from Vimercate Hospital's Electronic Medical Records was used to train the
model, which included red blood cell count, urea, creatinine level, and eGFR trends. With a 94% test
accuracy, 91% specificity, and 96% sensitivity, the ultimate model that was suggested was built on
Extremely Randomised Trees classifiers. Predictions with a granularity of up to six months were
possible thanks to the model's stable performance even at shorter time intervals. With the help of this
method, nephrologists may now forecast a patient's clinical course with a great deal of help from
predictive modelling. The study enables enhanced resource management and personalised care in
healthcare settings by fusing the model's promising outcomes with the experience of doctors. These
developments highlight the potential of machine learning to improve CKD management decision-
making procedures.[4]
In the article by Emon et al. (2021), the aim was to investigate the role of data mining and
machine learning in predicting chronic kidney disease (CKD). According to the report, machine
learning methods are being used more often to identify serious health hazards including brain tumours
and diabetes. Because CKD compromised the kidneys' capacity to efficiently filter waste, it posed a
major risk to health. Confusion matrix analysis was used by the authors to assess the effectiveness of
64
different classifiers, highlighting the significance of precisely projecting both positive and negative
outcomes. 99% accuracy was the greatest achieved by the Random Forest classifier, outperforming
other approaches. Comparatively, classifiers using Multilayer Perceptron (MLP), Stochastic Gradient
Descent (SGD), and Decision Tree achieved 98% accuracy. With accuracies of 95% and 96%,
respectively, Naive Bayes and Logistic Regression did less well. The research emphasised that
cardiovascular problems and end-stage renal disease (ESRD) were markedly increased by chronic
kidney disease (CKD). Early detection and monitoring were therefore considered essential. All things
considered, of the models that were assessed, the Random Forest classifier yielded the highest ROC
value in addition to the best accuracy.[6]
In the study by Khan et al. (2020), CKD was defined as a condition where kidneys gradually lose
their ability to filter blood, leading to waste accumulation. The authors discussed different machine
learning (ML) methods with an emphasis on the importance of early detection. The study employed
seven machine learning algorithms: Naïve Bayes, J48, NBTree, Support Vector Machine (SVM),
Logistic Regression (LR), Multi-layer Perceptron (MLP), and Composite Hypercube on Iterated
Random Projection (CHIRP). Metrics for evaluation included accuracy and mean absolute error
(MAE). Their efficacy was demonstrated by the MAE values of 0.015 for SVM and 0.0025 for CHIRP,
according to the data. 99.75% of CHIRP and 98.25% of SVM accuracy values were obtained. The study
found that CHIRP greatly lowered mistake rates while raising CKD diagnosis precision.[2]
Comparative analyses by Qin et al. (2019) showed that Random Forest outperformed classifiers
like logistic regression and support vector machines in the diagnosis of CKD. This finding demonstrates
the efficiency of ensemble approaches in managing complicated datasets. To enhance CKD diagnosis,
machine learning (ML) approaches have been used more and more in recent studies. For example,
Hussain et al. (2021) used Random Forest to obtain 99.75% accuracy after imputation of missing data
using K-Nearest Neighbours (KNN). Patel et al. (2020) showed that KNN successfully preserved
variable associations, underscoring the significance of robust imputation techniques. Furthermore, Lee
et al. (2023) presented an integrated model with an accuracy of 99.83% that combined Random Forest
with logistic regression. These developments show that ML could help nephrologists make wise
decisions. Still, there are issues including poor data quality and the requirement for thorough validation.
Enhancing these algorithms for wider clinical use should be the main goal of future research.[1]
In the systematic review by Sanmarchi et al. (2023), the authors aimed to assess the deployment of
artificial intelligence (AI) and machine learning (ML) techniques in predicting, diagnosing, and treating
65
CKD. Using the PRISMA technique, 16 variables in total were extracted, including the demographic,
study goals, sample size, and performance indicators. 68 of the 648 studies met the requirements for
inclusion. While the models under consideration shown encouraging performance, direct comparison
was difficult due to the disparities in the metrics. Prognosis prediction was the focus, with diagnosis
receiving less attention. Six of the investigations were conducted in clinical settings, and the authors
stated that the majority lacked varied population testing and lacked generalisability. The study's
conclusion was that although machine learning has potential for managing chronic kidney disease
(CKD), more research is needed to improve the interpretability, generalisability, and fairness of the
models before they can be used in routine clinical settings.
66
METHODOLOGY
DATASET SELECTION:
This study uses CKD dataset which was obtained from UCI machine learning repository collected
from hospital and donated by Soundarapandian et al. on 3rd July, 2015. The data set contains 400
samples. In this CKD data set, each sample has 24 predictive variables or features (11 numerical
variables and 13 categorical (nominal) variables) and a categorical response variable (class). Each class
has two values, namely, ckd (sample with CKD) and notckd (sample without CKD). In the 400 samples,
250 samples belong to the category of ckd, whereas 150 samples belong to the category of notckd. It
is worth mentioning that the dataset contained large number of missing values.
FLOW DIAGRAM:
DATA PREPROCESSING:
The dataset has missing values in several columns, such as age, blood_pressure, and
specific_gravity. To handle these, interpolation is applied to replace NaN values, smoothing the data
by estimating the missing entries based on other values in the dataset.
67
Encoding Categorical Values:
For categorical variables, label encoding was applied to convert them into binary numeric
representations. Specifically, values for red blood cells, pus cell, pus cell clumps, bacteria,
hypertension, coronary artery disease, peda Edema, anaemia, diabetes mellitus, appetite, and the target
variable class were mapped to 0 and 1, with 0 representing the negative class (e.g., no, normal,
notpresent, poor) and 1 representing the positive class (e.g. Yes, abnormal, present, good. The class
column, indicating the presence of chronic kidney disease (CKD), was also encoded, where 0
corresponded to notckd and 1 to ckd.
Scaling Features:
The data set features have been scaled to a uniform range to ensure that each feature contributes
equally to the performance of the machine learning algorithms used. This process is particularly critical
for algorithms sensitive to the scale of the input data, such as K-Nearest Neighbors and Gradient
Boosting. To begin, the target variable was separated from the feature set. This separation allows for
the normalization process to focus exclusively on the feature data, which is essential for effective model
training. Normalization was achieved using the MinMaxScaler, a widely used method that transforms
features to a specified range, typically between 0 and 1. This technique adjusts the values of each
feature based on their minimum and maximum values, ensuring that all features are scaled
proportionally.
The amount of non-null entries for each feature is determined by looking at the structure and data types
of each column. Several columns have missing values, according to this preliminary examination,
which calls for additional cleaning. The file detects missing information in columns such as
red_lood_cell_count, specific_gravity, and blood_pressure. Interpolation and maybe additional
imputation techniques are used to address them in order to preserve the dataset's integrity for analysis
and model training. Measures such as mean, median, minimum, and maximum are among the
numerical columns for which summary statistics are computed. These statistics aid in determining
outliers and possible abnormalities within specific features as well as in comprehending the distribution
of the data. The distribution of features like blood_glucose_random, hemoglobin, and blood_urea may
be seen using Kernel Density Estimation (KDE) charts. These plots aid in determining normalcy,
68
skewness, or the existence of several modes by illuminating the distribution of data points across
values.
Model Selection:
Seven distinctive machine learning algorithms were employed in this study: Random Forest, K-
Nearest Neighbors (KNN), AdaBoost, Gradient Boosting, XGBoost, Bootstrap Aggregating (Bagging),
and Decision Tree. The classification models are evaluated based on metrics such as precision,
precision, recall, and F1 score.
CHOICE OF MODELS:
K-Nearest Neighbors (KNN): Specifically designed for classification jobs where comparable data
points cluster closely together in feature space, KNN is a straightforward, non-parametric technique
that performs well in smaller datasets. Its strength is in its capacity to predict based on the distance
(similarity) of nearby data points, which makes it very easy to use and interpret. However, because it
computes the distance to each point during prediction, it is computationally costly on large datasets.
Nevertheless, because of its ease of use and straightforward categorization methodology, KNN offers
a useful baseline against which to compare.
Random Forest Classifier: Several decision trees are combined in the Random Forest ensemble
approach to increase accuracy and decrease overfitting. High resilience and generalization are achieved
by this approach, which generates each tree on a random portion of the input and bases final predictions
on the majority vote from all trees. Moreover, Random Forest has the benefit of feature priority
measures, which draw attention to the elements that have the biggest impact on the model's predictions.
Large forests can still be resource-intensive even though they are computationally efficient when
compared to other ensemble approaches. However, Random Forest is a strong option for challenging
classification problems due to its capacity to capture non-linear correlations.
AdaBoost Classifier: AdaBoost is a boosting method that generates weak models one after the other,
fixing mistakes in each model as it goes. AdaBoost's performance is enhanced by this iterative method,
which helps it concentrate on situations that are challenging to categorize. The accuracy of AdaBoost
on balanced datasets and its ability to use weak learners like shallow decision trees to create a powerful
classifier are its main advantages. However, because the method emphasizes incorrectly classified
samples, it may be sensitive to noisy data and outliers. AdaBoost is a desirable option when seeking a
high level of classification accuracy because of its adaptability and emphasis on difficult cases.
69
Gradient Boosting Classifier: Gradient Boosting produces a series of models one after the other, fixing
the mistakes of the models that came before it. Gradient Boosting is well-known for its excellent
accuracy and works especially well when dealing with intricate, non-linear relationships in the data.
This model is adaptable and uses gradient descent to optimize each learning step, capturing complex
patterns. Gradient Boosting can have high computing requirements, particularly when using deeper
trees and more boosting rounds, which could result in overfitting if not properly adjusted. It works well
with datasets that have intricate, subtle interactions because it can improve on past mistakes.
XGBoost: Regularization is used into XGBoost, an enhanced variant of gradient boosting that improves
accuracy and robustness. Because of its scalability and optimization methods, XGBoost, which is
designed for high efficiency, is frequently utilized in competitions. In order to improve generalization
on unknown data and avoid overfitting, it permits L1 and L2 regularization. However, XGBoost is
computationally demanding for large datasets and its wide range of hyperparameters might complicate
tweaking. Because of its effective learning and regularization, XGBoost is a great option for attaining
high accuracy, particularly in intricate, high-dimensional datasets.
Decision Tree Classifier: Decision trees are straightforward, comprehensible models that divide data
into branches according to feature values in order to classify the data. This model offers a clear, visual
depiction of decision-making and performs well with both continuous and categorical variables.
Decision trees are a useful baseline because of their simplicity and interpretability, despite their
propensity for overfitting. In ensemble approaches like Random Forest and AdaBoost, where their
drawbacks (such large variance) can be lessened, they are frequently utilized as basis models. Decision
trees provide a basic model for this topic that can be compared to more intricate algorithms.
Bootstrap Aggregation (Bagging): By sampling with replacement, training a model (often a Decision
Tree) on each subset, and aggregating the results, the ensemble technique known as bagging generates
several subsets of the dataset. By stabilizing predictions and lowering variance, this method enhances
model generalization. Decision trees and other high-variance models benefit greatly from bagging,
which reduces their sensitivity to changes in the data. However, if the base model is fundamentally
flawed or there is a lack of data, bagging might not yield a discernible improvement in performance.
This model contributes to accuracy by utilizing the stability that bagging offers, which makes it an
important component of the ensemble selection for this project.
70
PARAMETER TUNING:
AdaBoost will employ up to 50 weak learners in the ensemble in a sequential manner since the
number of estimators (n_estimators) is set to 50 in addition to the base estimator. The model's ability
to learn depends on this value: too few estimators can result in underfitting, where the model is unable
to capture the complexity of the data, while too many estimators can lengthen computation times and
cause overfitting, where the model is overfit to the training data. Setting it to 50 strikes a balance
between computational economy and performance, enabling the model to make good corrections on
its own without requiring an excessive amount of training time.
To improve performance, a number of crucial parameters of the Random Forest classifier were
adjusted. "Entropy," which is used to assess each split according to information gain, was chosen as
the criterion parameter. This criterion is chosen based on which produces the optimal splits based on
data distribution and is frequently compared to "gini" in tree-based models. The model learns to select
splits that best distinguish across classes by optimizing information acquisition, which could increase
accuracy.
The maximum depth that each tree can reach is determined by the max_depth option, which is
set to 11. By limiting each tree's depth, overfitting is less likely to occur because each tree isn't exposed
to too many information in the training set. With the max_features parameter set to "sqrt," each tree
will take into account a subset of features at each split, which is equal to the square root of the total
number of features. By increasing the ensemble's variety and randomness, this lowers tree correlations
and strengthens the model's resistance to overfitting.
71
Additionally, to guarantee that every leaf node has a minimum of two samples, the
min_samples_leaf parameter is set to 2. By keeping leaf nodes from having too few samples, which
could result in overfitting, this constraint smoothes predictions. In a similar manner, each node must
contain at least three samples before it may split because the min_samples_split parameter is set to 3.
By doing this, extremely tiny nodes that can cause noise and instability in forecasts are avoided. Last
but not least, the forest has 130 trees since the n_estimators parameter is set to 130. By averaging across
more models, adding trees typically increases efficiency, but it also requires more processing power.
Setting it to 130 achieves a balance between maintaining reasonable processing costs and offering
enough trees for a reliable and accurate prediction.
72
IMPLEMENTATION
PYTHON ENVIRONMENT SETUP:
All the algorithms were conducted in Jupyter Notebook (version 7.2.1). The packages used included
NumPy (version 1.24.3), Pandas (version 2.0.3), Matplotlib (version 3.7.1), Seaborn (version 0.12.2),
Plotly (version 5.15.0), scikit-learn (version 1.2.2), and XGBoost (version 1.7.5). These libraries
enabled data manipulation, visualization, preprocessing, and the application of various machine
learning algorithms for the analysis of CKD.
A CSV file, a popular format for storing structured data, is where the data is loaded. In order to
facilitate manipulation and exploration, this step imports the dataset into a DataFrame. Certain
undesirable characters, such as spaces and question marks, are eliminated after the data has been
loaded. Eliminating these characters helps clean up the data because they occasionally serve as stand-
ins for errors or missing information. As opposed to strings with unexpected symbols, this guarantees
that columns are appropriately understood as numerical or categorical.
The dataset was divided into training and testing subsets to assess how well the machine
learning models perform. This division allows for training on one portion of the data while testing on
a separate, unseen portion, providing an unbiased assessment of predictive capabilities. The normalized
feature set was separated from the target variable, labelled as "Outcome," which indicates the presence
or absence of CKD. The dataset was split using a 90/10 ratio, with 90% for training and 10% for testing.
A random seed was applied to ensure reproducibility, allowing consistent comparisons of model
performance across different runsThe classification of chronic renal disease was then accomplished by
training a range of machine learning algorithms. KNN, Random Forest, AdaBoost, Gradient Boosting,
XGBoost, LightGBM, Decision Tree, and Bagging classifiers were among these models. To capitalize
on the model's advantages for this specific classification task, each model was started with a set of
hyperparameters. The models' predictions were then guided by the patterns they discovered in the
training data. Each algorithm is able to anticipate whether fresh instances belong to the target class
through this training process, which enables it to modify its internal parameters based on the labeled
examples that are provided.
73
The testing phase evaluates each model's ability to generalize to the test data that hasn't been
seen yet after it has been trained. Before conducting more analysis, testing on this separate dataset
enables a preliminary knowledge of each model's performance. You prepared each algorithm for a
thorough performance study by using this methodical approach to model training and testing.
MODEL EVALUATION:
Model evaluation was a thorough procedure that used a number of indicators to assess how well each
algorithm classified chronic renal disease. The main criterion employed was accuracy, which provides
a clear indication of each model's capacity for accurate prediction. Accuracy scores, which were
computed on both the training and test sets, gave information about how effectively each model
generalized and assisted in identifying possible overfitting in cases when the model's performance on
the training data was notably superior to that on the test data. To further explore the kinds of predictions
given, confusion matrices were created for some models, such as Bagging and LightGBM. A more
detailed examination of errors was made possible by these matrices' true positive, true negative, false
positive, and false negative counts. Understanding the advantages and disadvantages of each model
across many classes was made easier by this breakdown, which also served as the basis for computing
other measures including precision, recall, and F1-score. Recall assessed each model's capacity to
detect every real positive case, whereas precision evaluated how well it detected true positive instances
out of all predicted positive cases. A balanced evaluation of each model's performance was provided
by the F1-score, which is a harmonic mean of precision and recall and is particularly helpful in
addressing class imbalances.
74
RESULTS AND DISCUSSION
MODEL PERFORMANCE:
A. KNN
the KNN model showed impressive performance with a testing accuracy of 97.5 percent and a
training accuracy of 99.17%. The excellent accuracy on the test and training datasets indicates that the
model has successfully learned from the data and does not overfit.
Fig. 3.1 depicts the confusion matrix which demonstrates that KNN accurately anticipated almost all
instances in terms of classification with only one false negative (classifying a CKD case as non-CKD).
While the recall for the same class (0.96) reveals that only one CKD case was missed, the precision for
the CKD class (1.0) suggests that every CKD prediction was accurate.
The model's precision for the non-CKD class was 0.93, indicating that there were no false positives,
75
and perfect recall (1.00) for all non-CKD cases. The F1-scores demonstrate the model's balanced
performance in both classes, with 0.97 for non-CKD and 0.98 for CKD as shown in Fig. 3.2. These are
remarkably high values. In general, KNN seems to do quite well in handling this classification task,
which makes it a dependable option for CKD prediction with low misclassification rates.
B. Decision Tree
The Decision Tree Classifier fully retained the training data, as evidenced by its flawless training
accuracy of 100%. However, the testing accuracy drops to 95%, suggesting some overfitting as the
model struggles to generalize as well as it memorized the training data.
Fig. 4.1 reveals that the model made two misclassifications: 1 false positive (incorrectly predicting
a non-CKD case as CKD) and 1 false negative (failing to predict a CKD case).
76
The precision for CKD (1.0) indicates that all predicted CKD cases were correct, while recall (0.96)
shows that one CKD case was missed. The precision and recall for the non-CKD class are both 0.93,
showing that while the model performed well, it made one mistake in identifying non-CKD cases as
shown in Fig. 4.2. Both classes have strong F1-scores (0.93 for non-CKD and 0.96 for CKD), which
demonstrate the model's respectable trade-off between recall and precision. While the Decision Tree
classifier performs well on this dataset, the perfect training accuracy paired with lower test accuracy
indicates a tendency to overfit, making it less appropriate for more complicated or variable datasets.
C. Bagging
The Bagging Classifier obtained a flawless training accuracy of 100%, suggesting that it learnt the
training data thoroughly. However, the testing accuracy of 97.5% indicates that it does not suffer from
overfitting and has good generalization capabilities despite the performance of the training set.
Fig. 5.1 reveals that there was just one mistake made by the model—a single false negative in which a
case with CKD was incorrectly identified as non-CKD. This suggests that the model predicts both
classes quite well. The recall of 0.96 indicates that one CKD instance was missed, while the precision
of 1.00 for the CKD class indicates that all projected CKD cases were accurate.
77
Fig. 5.2: Classification Report for Bagging Classifier
With a precision of 0.93 and a flawless recall of 1.00 for the non-CKD class, all non-CKD cases
were properly identified. F1-scores of 0.97 for non-CKD and 0.98 for CKD demonstrate the model's
stable and well-balanced performance in both classes as shown in Fig. 5.2. Because it is an ensemble
model, the Bagging Classifier combines excellent accuracy and stability with minimum mistakes and
strong prediction power, making it an extremely dependable model for CKD classification.
D. Random Forest
The Random Forest Classifier achieved a flawless training accuracy of 100%, which is matched by
a flawless testing accuracy of 100%. This suggests that the model not only learnt the training set
flawlessly, but also showed no evidence of overfitting when it generalized to the test set.
Fig. 6.1 verifies that all CKD and non-CKD instances were accurately identified by the model,
indicating that no misclassifications were made.
78
Fig. 6.2: Classification Report for Random Forest Classifier
For both the Non-CKD and CKD classes, precision and recall are 1.00 as shown in Fig. 6.2,
indicating that the model was able to accurately predict every case with no false positives or false
negatives. The model performs flawlessly, as seen by the F1-scores, which are all 1.00.
E. Adaboost
The AdaBoost Classifier demonstrated complete learning of the training dataset by achieving a
flawless training accuracy of 100%. Even though it generalizes well, its testing accuracy of 97.5%
indicates that there might be some space for improvement in terms of how well it predicts data that
hasn't yet been observed.
Fig. 7.1 shows that the model made only one misclassification, with one CKD case incorrectly
predicted as non-CKD. This highlights the model's overall effectiveness in distinguishing between the
two classes.
79
Fig: 7.2. Classification Report for Adaboost Classifier
The model's results for the CKD class show a precision of 1.00, which means that all CKD
predictions were accurate, and a recall of 0.96 as shown in Fig. 7.2, which means that one CKD case
was missed. The precision for the non-CKD class is 0.93, and the recall is perfect at 1.00, meaning that
every non-CKD example was accurately identified. With scores of 0.97 for non-CKD and 0.98 for
CKD, the F1-scores show a good balance in performance.
The Gradient Boosting Classifier achieved a perfect training accuracy of 100%, with a testing
accuracy of 97.5%. The model has successfully learned the underlying patterns in the training data, as
demonstrated by its high training accuracy, and its strong testing accuracy shows that it generalizes
well to new data without requiring severe overfitting.
80
Fig. 8.1 shows that the model made only one misclassification, with one CKD case incorrectly
predicted as non-CKD. Performance measures show that all projected CKD instances were correct,
with a precision of 1.00 for the CKD class.
With a recall of 0.96 for this class, one case of CKD was overlooked. The model has a precision of
0.93 and a perfect recall of 1.00 for the non-CKD class, meaning that every non-CKD example was
properly classified. With scores of 0.97 for non-CKD and 0.98 for CKD, the F1-scores shown in Fig.
8.2 demonstrates a well-balanced performance and shows that the model can manage both classes.
G. XGBoost
The XGBoost Classifier achieved a perfect training accuracy of 100%, indicating that the model has
thoroughly learned the training dataset. Its test accuracy is 97.5%, indicating good generalization
capabilities without a discernible overfitting.
81
Fig. 9.1 shows that there was just one misclassification by the model, which misclassified one case
of CKD as non-CKD. The precision for the CKD class is 1.00, meaning that all instances predicted as
CKD were indeed correct.
The recall for this class is 0.96, showing that one CKD case was missed. For the non-CKD class,
the model maintains a precision of 0.93, with perfect recall (1.00), indicating that all non-CKD cases
were accurately identified. The F1-scores reflect a strong balance in performance, with scores of 0.97
for non-CKD and 0.98 for CKD as shown in Fig. 9.2, showcasing the model's effectiveness in managing
both classes.
H. Ensemble Model
The Ensemble Classifier achieved a perfect training accuracy of 100%, indicating that it has fully
learned the training dataset.
82
Fig. 10.1: Confusion Matrix for Ensemble Model
Its testing accuracy also stands at 100%, demonstrating the model's exceptional ability to generalize
effectively to unseen data without any signs of overfitting. The classification report shown in Fig. 10.2,
indicates a perfect precision and recall of 1.00 for both classes, reflecting that all predictions made by
the model were accurate.
This means the model successfully identified all CKD cases without any false positives or false
negatives as shown in Fig. 10.1. The F1-scores for both classes are also perfect at 1.00, showcasing a
flawless balance between precision and recall. Overall, the Ensemble Classifier demonstrates
outstanding predictive capabilities, effectively combining the strengths of its individual base models.
This exemplary performance underscores the power of ensemble learning in achieving high accuracy
and reliability in the classification of CKD.
83
Comparative Model Analysis:
With a 100% accuracy rate, the ensemble model outperforms individual classifiers by
guaranteeing a high degree of reliability, highlighting its potential for therapeutic use. The ensemble
model exhibits robustness, which is essential in a clinical situation, by eliminating both false positives
and false negatives on the test set, as indicated by the confusion matrix. Even while the Random Forest
model performed well as well, the ensemble technique improved consistency and significantly reduced
the likelihood of misclassification, making it more reliable for use in medicine.
This study demonstrates how machine learning might improve CKD early detection, reducing
the chance of diagnostic errors and enabling prompt, individualized treatment. For clinical integration,
where precise detection might have a direct influence on patient outcomes, such accuracy and
dependability are crucial. These models have the potential to change CKD management and patient
care in clinical practice by lowering the likelihood of misdiagnosis and assisting clinicians in making
better decisions.
84
CONCLUSION
The study demonstrates the effectiveness of machine learning models to diagnose CKD. Among the
models tested, Ensemble Model performed exceptionally, achieving 100% accuracy, demonstrating its
ability to minimize false positives and negatives. K-Nearest Neighbors (KNN) and Bagging also
showed strong results, with a 97.5% accuracy rate. These models effectively handle both categorical
and numerical data, which is critical when working with medical datasets that contain diverse patient
information. While the models were widely regarded on the current dataset, future research should
focus on incorporating larger and more diverse datasets to enhance model robustness and reliability.
The use of real-time patient data, such as from wearables, could further improve early detection and
treatment outcomes. This study demonstrates the potential of machine learning algorithms, particularly
Random Forest, in the early diagnosis of CKD. With high accuracy rates, these models can support
clinical decision-making, improving the accuracy of diagnoses and enabling more personalized patient
care. The ability to predict CKD progress through machine learning provides a promising tool to reduce
the burden of late kidney disease and optimize treatment strategies in health care settings.
85
REFERENCES
[1] J. Qin, L. Chen, Y. Liu, C. Liu, C. Feng and B. Chen, "A Machine Learning Methodology for
Diagnosing Chronic Kidney Disease," in IEEE Access, vol. 8, pp. 20991-21002, 2020, doi:
10.1109/ACCESS.2019.2963053.
[2] B. Khan, R. Naseem, F. Muhammad, G. Abbas and S. Kim, "An Empirical Evaluation of Machine
Learning Techniques for Chronic Kidney Disease Prophecy," in IEEE Access, vol. 8, pp. 55012-
55022, 2020, doi: 10.1109/ACCESS.2020.2981689.
[3] Sanmarchi, F., Fanconi, C., Golinelli, D., Gori, D., Hernandez-Boussard, T., & Capodici, A.
(2023). Predict, diagnose, and treat chronic kidney disease with machine learning: a systematic
literature review. Journal of nephrology, 36(4), 1101–1117. https://doi.org/10.1007/s40620-023-
01573-4.
[4] Ventrella, P., Delgrossi, G., Ferrario, G., Righetti, M., & Masseroli, M. (2021). Supervised machine
learning for the assessment of chronic kidney disease advancement. Computer methods and
programs in biomedicine, 209, 106329. https://doi.org/10.1016/j.cmpb.2021.106329.
[5] Jaber Qezelbash-Chamak, Saeid Badamchizadeh, Kourosh Eshghi, Yasaman Asadi, A survey of
machine learning in kidney disease diagnosis, Machine Learning with Applications, Volume 10,
2022, 100418, ISSN 2666-8270, https://doi.org/10.1016/j.mlwa.2022.100418.
[6] M. U. Emon, A. M. Imran, R. Islam, M. S. Keya, R. Zannat and Ohidujjaman, "Performance
Analysis of Chronic Kidney Disease through Machine Learning Approaches," 2021 6th
International Conference on Inventive Computation Technologies (ICICT), Coimbatore, India,
2021, pp. 713-719, doi: 10.1109/ICICT50816.2021.9358491.
[7] P. Chittora et al., "Prediction of Chronic Kidney Disease - A Machine Learning Perspective," in
IEEE Access, vol. 9, pp. 17312-17334, 2021, doi: 10.1109/ACCESS.2021.3053763.
[8] Gangani Dharmarathne, Madhusha Bogahawaththa, Marion McAfee, Upaka Rathnayake, D.P.P.
Meddage, On the diagnosis of chronic kidney disease using a machine learning-based interface
with explainable artificial intelligence, Intelligent Systems with Applications, Volume 22, 2024,
200397, ISSN 2667-3053, https://doi.org/10.1016/j.iswa.2024.200397.
[9] Tekale S, Shingavi P, Wandhekar S, Chatorikar A. Prediction of chronic kidney disease using
machine learning algorithm. International Journal of Advanced Research in Computer and
Communication Engineering. 2018 Oct;7(10):92-6.
86
[10] Almustafa KM. Prediction of chronic kidney disease using different classification algorithms.
Informatics in Medicine Unlocked. 2021 Jan 1;24:100631.
https://doi.org/10.1016/j.imu.2021.100631
87
88