0% found this document useful (0 votes)

19 views88 pages

AIML Record Batch 9

The document outlines various data curation techniques for cleaning and preparing medical datasets, focusing on methods for handling missing values, removing duplicates, and normalizing features for machine learning applications. It includes specific examples of using algorithms like Support Vector Machine, Decision Tree, and Random Forest for classifying Chronic Kidney Disease, as well as regression techniques for predicting healthcare outcomes. The final project emphasizes the comparative analysis of multiple machine learning algorithms for early detection of Chronic Kidney Disease, highlighting the effectiveness of ensemble models.

Uploaded by

ransyt03

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

19 views88 pages

AIML Record Batch 9

Uploaded by

ransyt03

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 88

Ex. No:1(a) DATE: 09.08.

DATA CURATION TECHNIQUES FOR CLEANING

AND PREPARING MEDICAL DATASETS

Aim

To understand and apply various data curation techniques to clean and prepare medical datasets
for machine learning applications.
Tools Required
• Python 3.x
• Anaconda distribution
• Jupiter Notebook
• Pandas' library
• NumPy library
• Scikit-learn library
• Matplotlib/Seaborn for visualization
Algorithm/Procedure

1
Program:
1. Load the Dataset:

2. Understand the Dataset

2
3. Handle Missing Values:

3
4. Remove Duplicate Records:

4
5. Convert Categorical Data:

6. Feature Engineering and Normalization:

5
7. Split the Dataset:

6
8. Performing one hot encoding:

7
9. Performing interpolation:

10. Finding duplicates:

Result:
The medical dataset was successfully cleaned and prepared for machine learning, resulting in a
dataset with no missing values, no duplicates, converted categorical data, normalized features,
and a proper training-testing split. This curated dataset is now ready for building and evaluating
machine learning model.

8
Ex. No:1(b) DATE: 09.08.24

OUTLIER DETECTION AND REMOVAL METHODS

FOR PREPARING HEALTHCARE

Aim

To understand and apply various outlier detection and removal methods to prepare healthcare
data for machine learning applications.
Tools Required
• Python 3.x
• Anaconda distribution
• Jupiter Notebook
• Pandas' library
• NumPy library
• Scikit-learn library
• Matplotlib/Seaborn for visualization
Algorithm/Procedure
Load the Dataset
i.Import the necessary libraries.
ii.Load the healthcare dataset into a Pandas Data Frame.
Understand the Dataset
iii.Display the first few rows of the dataset.
iv.Get a summary of the dataset using info () and describe () methods.
Identify Outliers
v. Use visualization techniques such as box plots and scatter plots to
identifyoutliers.
vi. Use statistical methods such as the Z-score or the IQR (Interquartile
Range)method to detect outliers.
Remove Outliers
vii.Based on the identified outliers, decide on a strategy to remove them:
1. Remove outliers using the IQR method.
2. Remove outliers using the Z-score method.
Verify the Data
viii.Re-visualize the data to ensure outliers have been effectively removed.
ix.Summarize the dataset again to confirm the absence of outliers.
Handle Missing Values (if any)
x. Identify and handle any missing values in the dataset using appropriate
methods(if applicable)
9
Normalize/Standardize the Data
xi.Apply normalization or standardization techniques to scale numerical data.
Feature Engineering
xii.Create new features based on existing data.
xiii.Select relevant features for the machine learning model.
Split the Dataset
xiv. Split the dataset into training and testing sets using train_test_split() from
Scikit-learn.
Data Curation Methods
Outlier Detection
xv. Visualization: df.boxplot(column='column_name'),
sns.scatterplot(x='column1',y='column2', data=df)
xvi.Z-score Method: df[(np.abs(stats.zscore(df['column_name'])) < 3)]

Outlier Removal
• Z-score Method: Remove data points with a Z-score greater than 3 or less than -
3.
• IQR Method: Remove data points outside the range of Q1 - 1.5IQR to Q3 +
1.5IQR.
Normalization/Standardization
• Normalization: (df - df.min()) / (df.max() - df.min())
• Standardization: (df - df.mean()) / df.std()
Feature Engineering
• Creation: df['new_feature'] = df['feature1'] * df['feature2']
• Selection: Using methods such as correlation matrix or feature importance from
models like RandomForest.

10
PROGRAM:
1. Load the Dataset:

2. Understanding the data:

11
12
3. Visualizing using boxplot:

13
4. Using scatter plot:

14
5. Calculating Z score:

6. Identifying outliers using IQR method:

15
7. Remove outliers:

16
8. Creating new features:

17
9. Normalization and standardization:

Result
Outliers were successfully detected and removed from the healthcare dataset using visualization
techniques, the Z-score method, and the IQR method. The resulting dataset is free of outliers,
normalized, and ready for further machine learning model training and evaluation.

18
Ex :2 Date: 30.08.24
CHRONIC KIDNEY DISEASE CLASSIFICATION USING SVM
CLASSIFIER
Aim:

To classify Chronic Kidney Disease (CKD) using Support Vector Machine (SVM) algorithm using
Python in jupyter notebook.

Tools Required:

• Python 3.x

• Anaconda distribution
• Jupiter Notebook

• Pandas' library

• NumPy library

• Scikit-learn library

• Matplotlib/Seaborn for visualization

Program:

Importing libraries:

19
Loading and preprocessing the datasets:

20
Identify the missing values:

21
Convert non-numeric columns to numeric for filling the missing values using interpolation:

22
Before interpolation:

Performing Interpolation:

23
After interpolation:

24
Normalization:

25
Splitting, Training, and Testing:

Result:

Thus, the machine learning model has been trained using Support Vector Machine algorithm and
Chronic Kidney Disease classification is done.

26
Ex :3 Date: 06.09.24
CHRONIC KIDNEY DISEASE CLASSIFICATION USING DECISION
TREE CLASSIFIER
Aim:

To classify Chronic Kidney Disease (CKD) using Decision Tree Classifier algorithm using Python
in jupyter notebook.

Tools Required:

• Python 3.x

• Anaconda distribution
• Jupiter Notebook

• Pandas' library

• NumPy library

• Scikit-learn library

• Matplotlib/Seaborn for visualization

Program:

Importing libraries:

27
Preprocessing:

28
Missing values:

29
Convert non-numeric columns to numeric if possible:

30
Before interpolation:

31
Perform interpolation:

32
After interpolation:

33
Normalization:

34
Splitting, training, and training:

Decision tree classifier:

Result:

Thus, the machine learning model has been trained using Decision Tree Classifier algorithm and

Chronic Kidney Disease classification is done.

35
Ex :4 Date: 20.09.24
CHRONIC KIDNEY DISEASE CLASSIFICATION USING RANDOM
FOREST CLASSIFIER
Aim:

To classify Chronic Kidney Disease (CKD) using Random Forest Classifier algorithm using
Python in jupyter notebook.

Tools Required:

• Python 3.x

• Anaconda distribution
• Jupiter Notebook

• Pandas' library

• NumPy library

• Scikit-learn library

• Matplotlib/Seaborn for visualization

Program:

Importing libraries:

36
Preprocessing:

37
Missing values:

38
Convert non-numeric columns to numeric:

39
Before interpolation:

40
Perform interpolation:

41
After interpolation:

42
Normalization:

43
Splitting, training, and training:

Random forest classifier:

44
Result:

Thus, the machine learning model has been trained using Random Forest Classifier algorithmand

Chronic Kidney Disease classification is done.

45
Ex.No : 5(a) Date:18/10/2024

PREDICTING HEALTHCARE OUTCOMES WITH REGRESSION

ANALYSIS

5a: Predicting Chronic Kidney Disease Using Logistic Regression

Aim:

To predict the risk of chronic kidney disease in patients based on various health indicators using
Logistic regression.

Tools Required:

• Python 3.x
• Jupyter Notebook or any Python IDE
• Pandas, NumPy, Matplotlib, Seaborn, Scikit-learn libraries

Methodology:

1. Loading and Previewing the Dataset: The code begins by loading the dataset into a
DataFrame and displaying the first few rows. This step helps in understanding the initial
structure and content of the data.

2. Data Preprocessing: Unknown values represented by '?' are replaced with NaN to handle
missing data more effectively. This ensures that the dataset is clean and ready for further
processing.

3. Data Encoding: Categorical variables, such as 'normal' and 'abnormal', are converted into
numeric values. This conversion simplifies the analysis, allowing the use of mathematical and
statistical operations on these features.

4. Display DataFrame Information: The code prints out basic information about the
DataFrame, including the data types of each column and the count of non-null values. This step
is crucial for understanding the composition of the dataset.

5. Statistical Summary: A statistical summary of the numeric columns is generated, showing

metrics like mean, minimum, and maximum values. This summary provides insights into the
distribution and range of the data.
46
6. Visualize Correlations: A heatmap is used to visualize the correlations between different
features in the dataset. This helps in identifying relationships and dependencies between
variables.

7. Identifying Missing Values: The code identifies and counts the number of missing values
in each column, providing an overview of the data's completeness.

8. Convert Non-Numeric Columns: Any remaining non-numeric columns are converted to

numeric where possible. This ensures that all data is in a format suitable for analysis.

9. Display DataFrame Before Interpolation: The current state of the DataFrame, including
the count of missing values, is displayed before performing interpolation. This helps in
comparing the data before and after filling gaps.

10. Interpolation: Linear interpolation is performed to fill in the missing values. This method
estimates missing data points based on surrounding values, creating a continuous dataset.

11.Display DataFrame After Interpolation: The DataFrame and the count of missing values
are displayed again after interpolation to verify that the missing data has been handled.

12. Visualize Final Dataset: A final heatmap is generated to visualize the dataset after
interpolation, ensuring that the data is ready for further analysis or modeling.

Program:

1. Load the dataset

2. Preprocessing

47
48
3. Visualize Correlation between features using a heatmap

4. Identifying Missing Values

49
5. Convert non-numeric columns to numeric if possible

6. Before interpolation

7. Perform interpolation

50
8. After interpolating

9. Convert categorical variables using Label Encoding or One-Hot Encoding

51
10. Predict on the test set

11. Interpret the coefficients

Result:

Ridge regression typically performs better in cases where linear regression may overfit,
especially when dealing with multicollinearity. The cross-validated scores give a better
indication of model generalizability.

52
5b. Predicting Hospital Readmission Rates Using Lasso Regression

Aim:

To predict hospital readmission rates for patients using Lasso regression.

Tools Required:

• Python 3.x
• Jupyter Notebook or any Python IDE
• Pandas, NumPy, Matplotlib, Seaborn, Scikit-learn libraries

Methodology:

1. Data Collection:

• The dataset `kidney_datasets.csv` is loaded using `pandas.read_csv()`.

• A DataFrame `df` is created to store and manipulate the data.

2. Data Preprocessing:

Replacing Missing Values:

• The string pattern representing missing values (`?`) is replaced with `NaN` using
`df.replace()`.

Label Encoding:

• Categorical variables (e.g., `rbc`, `pc`, `pcc`, etc.) are mapped to numerical values using
`map()` to convert categorical data into a numeric format required for model training.

Data Type Conversion:

• The entire DataFrame is converted to numeric types where possible using

`df.apply(pd.to_numeric, errors='coerce')`. This step ensures that all columns are in a
format suitable for further analysis.

3. Exploratory Data Analysis (EDA):

Summary Statistics:

53
• df.describe()` is used to calculate basic statistical measures (mean, std, min, max, etc.)
for the numeric columns in the dataset.

Correlation Heatmap:

• A heatmap is created using `sns.heatmap()` to visualize the correlations between

features, helping identify relationships among them.

4.Handling Missing Data:

Identifying Missing Values:

• `df.isnull().sum()` is used to count the number of missing values per column, providing
insight into the extent of missing data.

Interpolation:

• Missing values are handled using linear interpolation via

`df.interpolate(method='linear', limit_direction='both')`, which estimates missing
values based on adjacent data points.

5. Data Visualization:

• The correlation between features is visualized using a heatmap, aiding in the

understanding of the relationships between different variables.

6. Modeling and Evaluation

• While not explicitly shown in the provided code, the final step typically involves
splitting the data into training and testing sets, standardizing the data, and applying
machine learning models (e.g., Lasso regression) to predict outcomes and evaluate the
model’s performance using metrics like `mean_squared_error` and `r2_score`.

Program:

1. Load the dataset

54
2. Preprocessing

55
3. Visualize correlations between features using a heatmap

4. Identifying Missing Values

5. Convert non-numeric columns to numeric if possible

56
6. Before interpolation

7. Perform interpolation

57
8. After interpolating

58
9. Convert categorical variables using Label Encoding or One-Hot Encoding

Result:

Lasso regression not only predicts the readmission rates but also helps in feature selection by
driving some coefficients to zero, which simplifies the model and may enhance interpretability.

59
MINI PROJECT
Enhancing Early Detection of Chronic Kidney Disease
Using Machine Learning Algorithms: A Comparative
Analysis

60
ABSTRACT
Chronic Kidney Disease (CKD) affects 12-14% of the global population, with around 30 million cases
in the US contributing to over $32 billion in healthcare costs annually. Early detection is crucial as
CKD can lead to end-stage renal disease (ESRD) without timely treatment. This study analyses and
compare seven machine learning (ML) algorithms—Decision Tree, Random Forest, AdaBoost, K-
Nearest Neighbors (KNN), Gradient Boosting, XGBoost, and Bootstrap Aggregating—to a dataset of
400 patients with 25 attributes. Performance was evaluated based on accuracy, precision, recall, and
F1 score. The top three models—Random Forest, KNN, and Bagging were ensembled to form a new
ensemble model. The ensemble model achieved 100% accuracy, outperforming individual classifiers.
It demonstrated superior robustness, eliminating both false positives and false negatives across the test
set, as confirmed by the confusion matrix. While Random Forest alone achieved high performance, the
ensemble approach improved overall consistency and reduced the risk of misclassification, making it
a more reliable choice for clinical implementation. The study shows that ML can improve early CKD
diagnosis, offering personalized treatment and reducing misdiagnosis risks, making these models
promising for clinical integration.

Keywords: Chronic Kidney Disease, Machine Learning, Random Forest, Decision Tree, AdaBoost,
K-Nearest Neighbors (KNN), Gradient Boosting, XGBoost, and Bootstrap Aggregating, CKD
Diagnosis.

61
INTRODUCTION
Chronic Kidney Disease (CKD) refers to the progressive and irreversible damage to the kidneys that
prevents them from effectively filtering blood. CKD is usually classified into five stages: mild kidney
impairment, end-stage renal disease (ESRD), which requires dialysis or a kidney transplant, and kidney
transplantation. These stages are based on the rate at which kidney function is diminishing. The United
States alone has seen almost 30 million adult diagnoses, making it a global health risk that affects 12%–
14% of the population. CKD accounts for a significant portion of healthcare costs in the US, with an
annual national expenditure of about $32 billion. Major and secondary forms of chronic kidney disease
exist. While primary CKD directly affects the kidneys, secondary CKD is brought on by other
conditions, primarily diabetes and hypertension. Fig 1 shows the causes of CKD and their respective
percentages, the main causes such as Smoking, obesity, high alcohol consumption, and a family history
of renal illness are other risk factors. Renal failure can be the ultimate outcome of untreated high blood
pressure and diabetes, which gradually impair the kidneys' filtration system.

Fig 1. Causes of chronic kidney disease (CKD) and their respective percentages
CKD is sometimes referred to as a "silent" disease due to its gradual progression and may not exhibit
symptoms in its initial stages. When symptoms do appear, they may consist of changed urination
patterns, leg oedema, fatigue, elevated blood pressure, and breathing difficulties. With worsening CKD,
patients may experience increasingly severe symptoms such nausea, vomiting, an appetite loss, and
cardiovascular difficulties. Both invasive and non-invasive diagnostic techniques are available. An
early diagnosis is essential for the treatment of CKD.

Non-invasive methods: Serum creatinine levels in the blood and proteinuria in the urine are measured,
and these tests are commonly used to detect CKD early. CT and ultrasound examinations are two

62
imaging modalities that can be used to assess the shape and function of the kidneys.
Invasive method: In advanced cases of CKD, an intrusive kidney biopsy operation may be required to
determine the underlying cause of the condition.

Treatment for CKD aims to manage symptoms, prevent complications, and limit the progression
of the disease. Blood pressure medication, blood sugar management, lifestyle changes, and other risk
factor management strategies are all part of the treatment for early-stage CKD. Treatment options for
CKD that advances to ESRD include dialysis and kidney transplantation. AI greatly enhances CKD
diagnosis and treatment, particularly when combined with machine learning (ML) techniques. By
analysing vast amounts of patient data, AI can help with the early detection of CKD, improve the
accuracy of diagnoses, and predict the course of the disease. Medical datasets have yielded insightful
information through the deployment of machine learning (ML) models, improving the capacity to detect
patients who are at risk and give customised treatment plans. AI approaches are also being developed
to predict the onset of CKD and the effectiveness of different treatments, based on the patient's medical
history.

This study uses an empirical analysis of seven different machine learning techniques on this
dataset to determine the most effective method for early detection of CKD: Decision Tree, Random
Forest, AdaBoost, K-Nearest Neighbours (KNN), Gradient Boosting, XGBoost, and Bootstrap
Aggregating. Accuracy, precision, recall, and F1 score are used to evaluate the procedures. The UCI
Machine Learning Repository's 400 patient records and 24 attributes from the CKD dataset are used in
this work. The dataset, which is divided into numerical and category variables, includes 250 entries
classified as "ckd" and 150 entries classified as "notckd". Age, blood pressure, blood glucose levels,
and other important clinical parameters are included. It should be noted that the dataset includes missing
values, which complicates the research and calls for expert handling methods.

63
LITERATURE SURVEY
Chittora et al.'s goal in 2021 was to use deep learning (DL) and machine learning (ML) approaches
to predict chronic kidney disease (CKD). They made use of a 400-instance dataset with 24 attributes
that was taken from the UCI repository. Artificial Neural Network (ANN), C5.0, Logistic Regression,
Chi-square Automatic Interaction Detector (CHAID), Linear Support Vector Machine (LSVM),
Random Tree, and K-Nearest Neighbours (KNN) are the seven machine learning techniques that were
used. To improve performance, feature selection methods like LASSO, Wrapper approach, and
Correlation-based Feature Selection (CFS) were used. SMOTE was used to balance the dataset.
According to the results, the accuracy of 98.86% was achieved by LSVM with penalty L2 and SMOTE
on full features. However, a deep neural network model produced the best accuracy of 99.6%. The
results of evaluation metrics such area under the curve (AUC), precision, recall, F-measure, and recall
showed that deep learning models were superior to machine learning (ML) models in CKD prediction.
[7]

Ventrella et al. (2021) intended to facilitate individualised care and strategic treatment planning
by forecasting the period that a patient with CKD will need dialysis. A supervised machine learning
strategy was used to build a computational model, and the efficacy of several techniques was examined.
The data that was taken from Vimercate Hospital's Electronic Medical Records was used to train the
model, which included red blood cell count, urea, creatinine level, and eGFR trends. With a 94% test
accuracy, 91% specificity, and 96% sensitivity, the ultimate model that was suggested was built on
Extremely Randomised Trees classifiers. Predictions with a granularity of up to six months were
possible thanks to the model's stable performance even at shorter time intervals. With the help of this
method, nephrologists may now forecast a patient's clinical course with a great deal of help from
predictive modelling. The study enables enhanced resource management and personalised care in
healthcare settings by fusing the model's promising outcomes with the experience of doctors. These
developments highlight the potential of machine learning to improve CKD management decision-
making procedures.[4]

In the article by Emon et al. (2021), the aim was to investigate the role of data mining and
machine learning in predicting chronic kidney disease (CKD). According to the report, machine
learning methods are being used more often to identify serious health hazards including brain tumours
and diabetes. Because CKD compromised the kidneys' capacity to efficiently filter waste, it posed a
major risk to health. Confusion matrix analysis was used by the authors to assess the effectiveness of
64
different classifiers, highlighting the significance of precisely projecting both positive and negative
outcomes. 99% accuracy was the greatest achieved by the Random Forest classifier, outperforming
other approaches. Comparatively, classifiers using Multilayer Perceptron (MLP), Stochastic Gradient
Descent (SGD), and Decision Tree achieved 98% accuracy. With accuracies of 95% and 96%,
respectively, Naive Bayes and Logistic Regression did less well. The research emphasised that
cardiovascular problems and end-stage renal disease (ESRD) were markedly increased by chronic
kidney disease (CKD). Early detection and monitoring were therefore considered essential. All things
considered, of the models that were assessed, the Random Forest classifier yielded the highest ROC
value in addition to the best accuracy.[6]

In the study by Khan et al. (2020), CKD was defined as a condition where kidneys gradually lose
their ability to filter blood, leading to waste accumulation. The authors discussed different machine
learning (ML) methods with an emphasis on the importance of early detection. The study employed
seven machine learning algorithms: Naïve Bayes, J48, NBTree, Support Vector Machine (SVM),
Logistic Regression (LR), Multi-layer Perceptron (MLP), and Composite Hypercube on Iterated
Random Projection (CHIRP). Metrics for evaluation included accuracy and mean absolute error
(MAE). Their efficacy was demonstrated by the MAE values of 0.015 for SVM and 0.0025 for CHIRP,
according to the data. 99.75% of CHIRP and 98.25% of SVM accuracy values were obtained. The study
found that CHIRP greatly lowered mistake rates while raising CKD diagnosis precision.[2]

Comparative analyses by Qin et al. (2019) showed that Random Forest outperformed classifiers
like logistic regression and support vector machines in the diagnosis of CKD. This finding demonstrates
the efficiency of ensemble approaches in managing complicated datasets. To enhance CKD diagnosis,
machine learning (ML) approaches have been used more and more in recent studies. For example,
Hussain et al. (2021) used Random Forest to obtain 99.75% accuracy after imputation of missing data
using K-Nearest Neighbours (KNN). Patel et al. (2020) showed that KNN successfully preserved
variable associations, underscoring the significance of robust imputation techniques. Furthermore, Lee
et al. (2023) presented an integrated model with an accuracy of 99.83% that combined Random Forest
with logistic regression. These developments show that ML could help nephrologists make wise
decisions. Still, there are issues including poor data quality and the requirement for thorough validation.
Enhancing these algorithms for wider clinical use should be the main goal of future research.[1]

In the systematic review by Sanmarchi et al. (2023), the authors aimed to assess the deployment of
artificial intelligence (AI) and machine learning (ML) techniques in predicting, diagnosing, and treating
65
CKD. Using the PRISMA technique, 16 variables in total were extracted, including the demographic,
study goals, sample size, and performance indicators. 68 of the 648 studies met the requirements for
inclusion. While the models under consideration shown encouraging performance, direct comparison
was difficult due to the disparities in the metrics. Prognosis prediction was the focus, with diagnosis
receiving less attention. Six of the investigations were conducted in clinical settings, and the authors
stated that the majority lacked varied population testing and lacked generalisability. The study's
conclusion was that although machine learning has potential for managing chronic kidney disease
(CKD), more research is needed to improve the interpretability, generalisability, and fairness of the
models before they can be used in routine clinical settings.

66
METHODOLOGY
DATASET SELECTION:

This study uses CKD dataset which was obtained from UCI machine learning repository collected
from hospital and donated by Soundarapandian et al. on 3rd July, 2015. The data set contains 400
samples. In this CKD data set, each sample has 24 predictive variables or features (11 numerical
variables and 13 categorical (nominal) variables) and a categorical response variable (class). Each class
has two values, namely, ckd (sample with CKD) and notckd (sample without CKD). In the 400 samples,
250 samples belong to the category of ckd, whereas 150 samples belong to the category of notckd. It
is worth mentioning that the dataset contained large number of missing values.

FLOW DIAGRAM:

Fig: Flow diagram

DATA PREPROCESSING:

Handling missing Data:

The dataset has missing values in several columns, such as age, blood_pressure, and
specific_gravity. To handle these, interpolation is applied to replace NaN values, smoothing the data
by estimating the missing entries based on other values in the dataset.

67
Encoding Categorical Values:

For categorical variables, label encoding was applied to convert them into binary numeric
representations. Specifically, values for red blood cells, pus cell, pus cell clumps, bacteria,
hypertension, coronary artery disease, peda Edema, anaemia, diabetes mellitus, appetite, and the target
variable class were mapped to 0 and 1, with 0 representing the negative class (e.g., no, normal,
notpresent, poor) and 1 representing the positive class (e.g. Yes, abnormal, present, good. The class
column, indicating the presence of chronic kidney disease (CKD), was also encoded, where 0
corresponded to notckd and 1 to ckd.

Scaling Features:

The data set features have been scaled to a uniform range to ensure that each feature contributes
equally to the performance of the machine learning algorithms used. This process is particularly critical
for algorithms sensitive to the scale of the input data, such as K-Nearest Neighbors and Gradient
Boosting. To begin, the target variable was separated from the feature set. This separation allows for
the normalization process to focus exclusively on the feature data, which is essential for effective model
training. Normalization was achieved using the MinMaxScaler, a widely used method that transforms
features to a specified range, typically between 0 and 1. This technique adjusts the values of each
feature based on their minimum and maximum values, ensuring that all features are scaled
proportionally.

Exploratory data analysis:

The amount of non-null entries for each feature is determined by looking at the structure and data types
of each column. Several columns have missing values, according to this preliminary examination,
which calls for additional cleaning. The file detects missing information in columns such as
red_lood_cell_count, specific_gravity, and blood_pressure. Interpolation and maybe additional
imputation techniques are used to address them in order to preserve the dataset's integrity for analysis
and model training. Measures such as mean, median, minimum, and maximum are among the
numerical columns for which summary statistics are computed. These statistics aid in determining
outliers and possible abnormalities within specific features as well as in comprehending the distribution
of the data. The distribution of features like blood_glucose_random, hemoglobin, and blood_urea may
be seen using Kernel Density Estimation (KDE) charts. These plots aid in determining normalcy,

68
skewness, or the existence of several modes by illuminating the distribution of data points across
values.

Model Selection:

Seven distinctive machine learning algorithms were employed in this study: Random Forest, K-
Nearest Neighbors (KNN), AdaBoost, Gradient Boosting, XGBoost, Bootstrap Aggregating (Bagging),
and Decision Tree. The classification models are evaluated based on metrics such as precision,
precision, recall, and F1 score.

CHOICE OF MODELS:

K-Nearest Neighbors (KNN): Specifically designed for classification jobs where comparable data
points cluster closely together in feature space, KNN is a straightforward, non-parametric technique
that performs well in smaller datasets. Its strength is in its capacity to predict based on the distance
(similarity) of nearby data points, which makes it very easy to use and interpret. However, because it
computes the distance to each point during prediction, it is computationally costly on large datasets.
Nevertheless, because of its ease of use and straightforward categorization methodology, KNN offers
a useful baseline against which to compare.

Random Forest Classifier: Several decision trees are combined in the Random Forest ensemble
approach to increase accuracy and decrease overfitting. High resilience and generalization are achieved
by this approach, which generates each tree on a random portion of the input and bases final predictions
on the majority vote from all trees. Moreover, Random Forest has the benefit of feature priority
measures, which draw attention to the elements that have the biggest impact on the model's predictions.
Large forests can still be resource-intensive even though they are computationally efficient when
compared to other ensemble approaches. However, Random Forest is a strong option for challenging
classification problems due to its capacity to capture non-linear correlations.

AdaBoost Classifier: AdaBoost is a boosting method that generates weak models one after the other,
fixing mistakes in each model as it goes. AdaBoost's performance is enhanced by this iterative method,
which helps it concentrate on situations that are challenging to categorize. The accuracy of AdaBoost
on balanced datasets and its ability to use weak learners like shallow decision trees to create a powerful
classifier are its main advantages. However, because the method emphasizes incorrectly classified
samples, it may be sensitive to noisy data and outliers. AdaBoost is a desirable option when seeking a
high level of classification accuracy because of its adaptability and emphasis on difficult cases.
69
Gradient Boosting Classifier: Gradient Boosting produces a series of models one after the other, fixing
the mistakes of the models that came before it. Gradient Boosting is well-known for its excellent
accuracy and works especially well when dealing with intricate, non-linear relationships in the data.
This model is adaptable and uses gradient descent to optimize each learning step, capturing complex
patterns. Gradient Boosting can have high computing requirements, particularly when using deeper
trees and more boosting rounds, which could result in overfitting if not properly adjusted. It works well
with datasets that have intricate, subtle interactions because it can improve on past mistakes.

XGBoost: Regularization is used into XGBoost, an enhanced variant of gradient boosting that improves
accuracy and robustness. Because of its scalability and optimization methods, XGBoost, which is
designed for high efficiency, is frequently utilized in competitions. In order to improve generalization
on unknown data and avoid overfitting, it permits L1 and L2 regularization. However, XGBoost is
computationally demanding for large datasets and its wide range of hyperparameters might complicate
tweaking. Because of its effective learning and regularization, XGBoost is a great option for attaining
high accuracy, particularly in intricate, high-dimensional datasets.

Decision Tree Classifier: Decision trees are straightforward, comprehensible models that divide data
into branches according to feature values in order to classify the data. This model offers a clear, visual
depiction of decision-making and performs well with both continuous and categorical variables.
Decision trees are a useful baseline because of their simplicity and interpretability, despite their
propensity for overfitting. In ensemble approaches like Random Forest and AdaBoost, where their
drawbacks (such large variance) can be lessened, they are frequently utilized as basis models. Decision
trees provide a basic model for this topic that can be compared to more intricate algorithms.

Bootstrap Aggregation (Bagging): By sampling with replacement, training a model (often a Decision
Tree) on each subset, and aggregating the results, the ensemble technique known as bagging generates
several subsets of the dataset. By stabilizing predictions and lowering variance, this method enhances
model generalization. Decision trees and other high-variance models benefit greatly from bagging,
which reduces their sensitivity to changes in the data. However, if the base model is fundamentally
flawed or there is a lack of data, bagging might not yield a discernible improvement in performance.
This model contributes to accuracy by utilizing the stability that bagging offers, which makes it an
important component of the ensemble selection for this project.

70
PARAMETER TUNING:

AdaBoost Classifier Tuning:

In AdaBoost classifier, DecisionTreeClassifier is chosen as the base. Since AdaBoost's weak

learners are usually shallow trees that are simple to iteratively improve, this is a popular choice. By
setting the tree's depth to 1 (max_depth=1), "stumps" that act as straightforward decision limits are
produced. Shallow trees work well in this situation because AdaBoost focuses on fixing the errors
caused by earlier models by combining several weak learners to create a stronger model.

AdaBoost will employ up to 50 weak learners in the ensemble in a sequential manner since the
number of estimators (n_estimators) is set to 50 in addition to the base estimator. The model's ability
to learn depends on this value: too few estimators can result in underfitting, where the model is unable
to capture the complexity of the data, while too many estimators can lengthen computation times and
cause overfitting, where the model is overfit to the training data. Setting it to 50 strikes a balance
between computational economy and performance, enabling the model to make good corrections on
its own without requiring an excessive amount of training time.

Random Forest Classifier Tuning:

To improve performance, a number of crucial parameters of the Random Forest classifier were
adjusted. "Entropy," which is used to assess each split according to information gain, was chosen as
the criterion parameter. This criterion is chosen based on which produces the optimal splits based on
data distribution and is frequently compared to "gini" in tree-based models. The model learns to select
splits that best distinguish across classes by optimizing information acquisition, which could increase
accuracy.

The maximum depth that each tree can reach is determined by the max_depth option, which is
set to 11. By limiting each tree's depth, overfitting is less likely to occur because each tree isn't exposed
to too many information in the training set. With the max_features parameter set to "sqrt," each tree
will take into account a subset of features at each split, which is equal to the square root of the total
number of features. By increasing the ensemble's variety and randomness, this lowers tree correlations
and strengthens the model's resistance to overfitting.

71
Additionally, to guarantee that every leaf node has a minimum of two samples, the
min_samples_leaf parameter is set to 2. By keeping leaf nodes from having too few samples, which
could result in overfitting, this constraint smoothes predictions. In a similar manner, each node must
contain at least three samples before it may split because the min_samples_split parameter is set to 3.
By doing this, extremely tiny nodes that can cause noise and instability in forecasts are avoided. Last
but not least, the forest has 130 trees since the n_estimators parameter is set to 130. By averaging across
more models, adding trees typically increases efficiency, but it also requires more processing power.
Setting it to 130 achieves a balance between maintaining reasonable processing costs and offering
enough trees for a reliable and accurate prediction.

72
IMPLEMENTATION
PYTHON ENVIRONMENT SETUP:

All the algorithms were conducted in Jupyter Notebook (version 7.2.1). The packages used included
NumPy (version 1.24.3), Pandas (version 2.0.3), Matplotlib (version 3.7.1), Seaborn (version 0.12.2),
Plotly (version 5.15.0), scikit-learn (version 1.2.2), and XGBoost (version 1.7.5). These libraries
enabled data manipulation, visualization, preprocessing, and the application of various machine
learning algorithms for the analysis of CKD.

LOADING THE DATASET:

A CSV file, a popular format for storing structured data, is where the data is loaded. In order to
facilitate manipulation and exploration, this step imports the dataset into a DataFrame. Certain
undesirable characters, such as spaces and question marks, are eliminated after the data has been
loaded. Eliminating these characters helps clean up the data because they occasionally serve as stand-
ins for errors or missing information. As opposed to strings with unexpected symbols, this guarantees
that columns are appropriately understood as numerical or categorical.

MODEL TRAINING AND TESTING:

The dataset was divided into training and testing subsets to assess how well the machine
learning models perform. This division allows for training on one portion of the data while testing on
a separate, unseen portion, providing an unbiased assessment of predictive capabilities. The normalized
feature set was separated from the target variable, labelled as "Outcome," which indicates the presence
or absence of CKD. The dataset was split using a 90/10 ratio, with 90% for training and 10% for testing.
A random seed was applied to ensure reproducibility, allowing consistent comparisons of model
performance across different runsThe classification of chronic renal disease was then accomplished by
training a range of machine learning algorithms. KNN, Random Forest, AdaBoost, Gradient Boosting,
XGBoost, LightGBM, Decision Tree, and Bagging classifiers were among these models. To capitalize
on the model's advantages for this specific classification task, each model was started with a set of
hyperparameters. The models' predictions were then guided by the patterns they discovered in the
training data. Each algorithm is able to anticipate whether fresh instances belong to the target class
through this training process, which enables it to modify its internal parameters based on the labeled
examples that are provided.

73
The testing phase evaluates each model's ability to generalize to the test data that hasn't been
seen yet after it has been trained. Before conducting more analysis, testing on this separate dataset
enables a preliminary knowledge of each model's performance. You prepared each algorithm for a
thorough performance study by using this methodical approach to model training and testing.

MODEL EVALUATION:

Model evaluation was a thorough procedure that used a number of indicators to assess how well each
algorithm classified chronic renal disease. The main criterion employed was accuracy, which provides
a clear indication of each model's capacity for accurate prediction. Accuracy scores, which were
computed on both the training and test sets, gave information about how effectively each model
generalized and assisted in identifying possible overfitting in cases when the model's performance on
the training data was notably superior to that on the test data. To further explore the kinds of predictions
given, confusion matrices were created for some models, such as Bagging and LightGBM. A more
detailed examination of errors was made possible by these matrices' true positive, true negative, false
positive, and false negative counts. Understanding the advantages and disadvantages of each model
across many classes was made easier by this breakdown, which also served as the basis for computing
other measures including precision, recall, and F1-score. Recall assessed each model's capacity to
detect every real positive case, whereas precision evaluated how well it detected true positive instances
out of all predicted positive cases. A balanced evaluation of each model's performance was provided
by the F1-score, which is a harmonic mean of precision and recall and is particularly helpful in
addressing class imbalances.

Fig 2: Models Comparison

74
RESULTS AND DISCUSSION
MODEL PERFORMANCE:

A. KNN

the KNN model showed impressive performance with a testing accuracy of 97.5 percent and a
training accuracy of 99.17%. The excellent accuracy on the test and training datasets indicates that the
model has successfully learned from the data and does not overfit.
Fig. 3.1 depicts the confusion matrix which demonstrates that KNN accurately anticipated almost all
instances in terms of classification with only one false negative (classifying a CKD case as non-CKD).
While the recall for the same class (0.96) reveals that only one CKD case was missed, the precision for
the CKD class (1.0) suggests that every CKD prediction was accurate.

Fig. 3.1: Confusion Matrix for KNN Classifier

Fig. 3.2: Classification Report for KNN Classifier

The model's precision for the non-CKD class was 0.93, indicating that there were no false positives,

75
and perfect recall (1.00) for all non-CKD cases. The F1-scores demonstrate the model's balanced
performance in both classes, with 0.97 for non-CKD and 0.98 for CKD as shown in Fig. 3.2. These are
remarkably high values. In general, KNN seems to do quite well in handling this classification task,
which makes it a dependable option for CKD prediction with low misclassification rates.

B. Decision Tree

The Decision Tree Classifier fully retained the training data, as evidenced by its flawless training
accuracy of 100%. However, the testing accuracy drops to 95%, suggesting some overfitting as the
model struggles to generalize as well as it memorized the training data.

Fig. 4.1: Confusion Matrix for Decision Tree Classifier

Fig. 4.1 reveals that the model made two misclassifications: 1 false positive (incorrectly predicting
a non-CKD case as CKD) and 1 false negative (failing to predict a CKD case).

Fig. 4.2: Classification Report for Decision Tree Classifier

76
The precision for CKD (1.0) indicates that all predicted CKD cases were correct, while recall (0.96)
shows that one CKD case was missed. The precision and recall for the non-CKD class are both 0.93,
showing that while the model performed well, it made one mistake in identifying non-CKD cases as
shown in Fig. 4.2. Both classes have strong F1-scores (0.93 for non-CKD and 0.96 for CKD), which
demonstrate the model's respectable trade-off between recall and precision. While the Decision Tree
classifier performs well on this dataset, the perfect training accuracy paired with lower test accuracy
indicates a tendency to overfit, making it less appropriate for more complicated or variable datasets.

C. Bagging

The Bagging Classifier obtained a flawless training accuracy of 100%, suggesting that it learnt the
training data thoroughly. However, the testing accuracy of 97.5% indicates that it does not suffer from
overfitting and has good generalization capabilities despite the performance of the training set.

Fig. 5.1: Confusion Matrix for Bagging Classifier

Fig. 5.1 reveals that there was just one mistake made by the model—a single false negative in which a
case with CKD was incorrectly identified as non-CKD. This suggests that the model predicts both
classes quite well. The recall of 0.96 indicates that one CKD instance was missed, while the precision
of 1.00 for the CKD class indicates that all projected CKD cases were accurate.

77
Fig. 5.2: Classification Report for Bagging Classifier

With a precision of 0.93 and a flawless recall of 1.00 for the non-CKD class, all non-CKD cases
were properly identified. F1-scores of 0.97 for non-CKD and 0.98 for CKD demonstrate the model's
stable and well-balanced performance in both classes as shown in Fig. 5.2. Because it is an ensemble
model, the Bagging Classifier combines excellent accuracy and stability with minimum mistakes and
strong prediction power, making it an extremely dependable model for CKD classification.

D. Random Forest

The Random Forest Classifier achieved a flawless training accuracy of 100%, which is matched by
a flawless testing accuracy of 100%. This suggests that the model not only learnt the training set
flawlessly, but also showed no evidence of overfitting when it generalized to the test set.

Fig. 6.1: Confusion Matrix for Random Forest Classifier

Fig. 6.1 verifies that all CKD and non-CKD instances were accurately identified by the model,
indicating that no misclassifications were made.

78
Fig. 6.2: Classification Report for Random Forest Classifier

For both the Non-CKD and CKD classes, precision and recall are 1.00 as shown in Fig. 6.2,
indicating that the model was able to accurately predict every case with no false positives or false
negatives. The model performs flawlessly, as seen by the F1-scores, which are all 1.00.

E. Adaboost

The AdaBoost Classifier demonstrated complete learning of the training dataset by achieving a
flawless training accuracy of 100%. Even though it generalizes well, its testing accuracy of 97.5%
indicates that there might be some space for improvement in terms of how well it predicts data that
hasn't yet been observed.

Fig. 7.1: Confusion Matrix for Adaboost Classifier

Fig. 7.1 shows that the model made only one misclassification, with one CKD case incorrectly
predicted as non-CKD. This highlights the model's overall effectiveness in distinguishing between the
two classes.

79
Fig: 7.2. Classification Report for Adaboost Classifier

The model's results for the CKD class show a precision of 1.00, which means that all CKD
predictions were accurate, and a recall of 0.96 as shown in Fig. 7.2, which means that one CKD case
was missed. The precision for the non-CKD class is 0.93, and the recall is perfect at 1.00, meaning that
every non-CKD example was accurately identified. With scores of 0.97 for non-CKD and 0.98 for
CKD, the F1-scores show a good balance in performance.

F. Gradient Boosting Classifier

The Gradient Boosting Classifier achieved a perfect training accuracy of 100%, with a testing
accuracy of 97.5%. The model has successfully learned the underlying patterns in the training data, as
demonstrated by its high training accuracy, and its strong testing accuracy shows that it generalizes
well to new data without requiring severe overfitting.

Fig. 8.1: Confusion Matrix Gradient Boosting Classifier

80
Fig. 8.1 shows that the model made only one misclassification, with one CKD case incorrectly
predicted as non-CKD. Performance measures show that all projected CKD instances were correct,
with a precision of 1.00 for the CKD class.

Fig. 8.2: Classification Report for Gradient Boosting Classifier

With a recall of 0.96 for this class, one case of CKD was overlooked. The model has a precision of
0.93 and a perfect recall of 1.00 for the non-CKD class, meaning that every non-CKD example was
properly classified. With scores of 0.97 for non-CKD and 0.98 for CKD, the F1-scores shown in Fig.
8.2 demonstrates a well-balanced performance and shows that the model can manage both classes.

G. XGBoost

The XGBoost Classifier achieved a perfect training accuracy of 100%, indicating that the model has
thoroughly learned the training dataset. Its test accuracy is 97.5%, indicating good generalization
capabilities without a discernible overfitting.

Fig. 9.1: Confusion Matrix XGBoost Classifier

81
Fig. 9.1 shows that there was just one misclassification by the model, which misclassified one case
of CKD as non-CKD. The precision for the CKD class is 1.00, meaning that all instances predicted as
CKD were indeed correct.

Fig. 9.2: Classification Report for XGBoost Classifier

The recall for this class is 0.96, showing that one CKD case was missed. For the non-CKD class,
the model maintains a precision of 0.93, with perfect recall (1.00), indicating that all non-CKD cases
were accurately identified. The F1-scores reflect a strong balance in performance, with scores of 0.97
for non-CKD and 0.98 for CKD as shown in Fig. 9.2, showcasing the model's effectiveness in managing
both classes.

H. Ensemble Model

The Ensemble Classifier achieved a perfect training accuracy of 100%, indicating that it has fully
learned the training dataset.

82
Fig. 10.1: Confusion Matrix for Ensemble Model

Its testing accuracy also stands at 100%, demonstrating the model's exceptional ability to generalize
effectively to unseen data without any signs of overfitting. The classification report shown in Fig. 10.2,
indicates a perfect precision and recall of 1.00 for both classes, reflecting that all predictions made by
the model were accurate.

Fig. 10.2: Classification Report for Ensemble Classifier

This means the model successfully identified all CKD cases without any false positives or false
negatives as shown in Fig. 10.1. The F1-scores for both classes are also perfect at 1.00, showcasing a
flawless balance between precision and recall. Overall, the Ensemble Classifier demonstrates
outstanding predictive capabilities, effectively combining the strengths of its individual base models.
This exemplary performance underscores the power of ensemble learning in achieving high accuracy
and reliability in the classification of CKD.

83
Comparative Model Analysis:

TABLE 2: Performance Metrics for Different Machine Learning Algorithms

The ensemble model achieved 100% accuracy, outperforming individual classifiers. It

demonstrated superior robustness, eliminating both false positives and false negatives across the test
set, as confirmed by the confusion matrix. While Random Forest alone achieved high performance, the
ensemble approach improved overall consistency and reduced the risk of misclassification, making it
a more reliable choice for clinical implementation. The study shows that ML can improve early CKD
diagnosis, offering personalized treatment and reducing misdiagnosis risks, making these models
promising for clinical integration.

Discussion on Clinical Relevance:

With a 100% accuracy rate, the ensemble model outperforms individual classifiers by
guaranteeing a high degree of reliability, highlighting its potential for therapeutic use. The ensemble
model exhibits robustness, which is essential in a clinical situation, by eliminating both false positives
and false negatives on the test set, as indicated by the confusion matrix. Even while the Random Forest
model performed well as well, the ensemble technique improved consistency and significantly reduced
the likelihood of misclassification, making it more reliable for use in medicine.

This study demonstrates how machine learning might improve CKD early detection, reducing
the chance of diagnostic errors and enabling prompt, individualized treatment. For clinical integration,
where precise detection might have a direct influence on patient outcomes, such accuracy and
dependability are crucial. These models have the potential to change CKD management and patient
care in clinical practice by lowering the likelihood of misdiagnosis and assisting clinicians in making
better decisions.
84
CONCLUSION
The study demonstrates the effectiveness of machine learning models to diagnose CKD. Among the
models tested, Ensemble Model performed exceptionally, achieving 100% accuracy, demonstrating its
ability to minimize false positives and negatives. K-Nearest Neighbors (KNN) and Bagging also
showed strong results, with a 97.5% accuracy rate. These models effectively handle both categorical
and numerical data, which is critical when working with medical datasets that contain diverse patient
information. While the models were widely regarded on the current dataset, future research should
focus on incorporating larger and more diverse datasets to enhance model robustness and reliability.
The use of real-time patient data, such as from wearables, could further improve early detection and
treatment outcomes. This study demonstrates the potential of machine learning algorithms, particularly
Random Forest, in the early diagnosis of CKD. With high accuracy rates, these models can support
clinical decision-making, improving the accuracy of diagnoses and enabling more personalized patient
care. The ability to predict CKD progress through machine learning provides a promising tool to reduce
the burden of late kidney disease and optimize treatment strategies in health care settings.

85
REFERENCES

[1] J. Qin, L. Chen, Y. Liu, C. Liu, C. Feng and B. Chen, "A Machine Learning Methodology for
Diagnosing Chronic Kidney Disease," in IEEE Access, vol. 8, pp. 20991-21002, 2020, doi:
10.1109/ACCESS.2019.2963053.
[2] B. Khan, R. Naseem, F. Muhammad, G. Abbas and S. Kim, "An Empirical Evaluation of Machine
Learning Techniques for Chronic Kidney Disease Prophecy," in IEEE Access, vol. 8, pp. 55012-
55022, 2020, doi: 10.1109/ACCESS.2020.2981689.
[3] Sanmarchi, F., Fanconi, C., Golinelli, D., Gori, D., Hernandez-Boussard, T., & Capodici, A.
(2023). Predict, diagnose, and treat chronic kidney disease with machine learning: a systematic
literature review. Journal of nephrology, 36(4), 1101–1117. https://doi.org/10.1007/s40620-023-
01573-4.
[4] Ventrella, P., Delgrossi, G., Ferrario, G., Righetti, M., & Masseroli, M. (2021). Supervised machine
learning for the assessment of chronic kidney disease advancement. Computer methods and
programs in biomedicine, 209, 106329. https://doi.org/10.1016/j.cmpb.2021.106329.
[5] Jaber Qezelbash-Chamak, Saeid Badamchizadeh, Kourosh Eshghi, Yasaman Asadi, A survey of
machine learning in kidney disease diagnosis, Machine Learning with Applications, Volume 10,
2022, 100418, ISSN 2666-8270, https://doi.org/10.1016/j.mlwa.2022.100418.
[6] M. U. Emon, A. M. Imran, R. Islam, M. S. Keya, R. Zannat and Ohidujjaman, "Performance
Analysis of Chronic Kidney Disease through Machine Learning Approaches," 2021 6th
International Conference on Inventive Computation Technologies (ICICT), Coimbatore, India,
2021, pp. 713-719, doi: 10.1109/ICICT50816.2021.9358491.
[7] P. Chittora et al., "Prediction of Chronic Kidney Disease - A Machine Learning Perspective," in
IEEE Access, vol. 9, pp. 17312-17334, 2021, doi: 10.1109/ACCESS.2021.3053763.
[8] Gangani Dharmarathne, Madhusha Bogahawaththa, Marion McAfee, Upaka Rathnayake, D.P.P.
Meddage, On the diagnosis of chronic kidney disease using a machine learning-based interface
with explainable artificial intelligence, Intelligent Systems with Applications, Volume 22, 2024,
200397, ISSN 2667-3053, https://doi.org/10.1016/j.iswa.2024.200397.
[9] Tekale S, Shingavi P, Wandhekar S, Chatorikar A. Prediction of chronic kidney disease using
machine learning algorithm. International Journal of Advanced Research in Computer and
Communication Engineering. 2018 Oct;7(10):92-6.

86
[10] Almustafa KM. Prediction of chronic kidney disease using different classification algorithms.
Informatics in Medicine Unlocked. 2021 Jan 1;24:100631.
https://doi.org/10.1016/j.imu.2021.100631

87
88

Factors Influencing The Strand Choice of Grade 11 HUMSS Students in DCNHS-SHS
40% (5)
Factors Influencing The Strand Choice of Grade 11 HUMSS Students in DCNHS-SHS
5 pages
Fire Detection and Fire Alarm System Planning v2 51527
100% (1)
Fire Detection and Fire Alarm System Planning v2 51527
42 pages
Early Prediction For Chronic Kidney Disease Detection A Progressive Approach To Health Management
No ratings yet
Early Prediction For Chronic Kidney Disease Detection A Progressive Approach To Health Management
34 pages
Stock Statement
No ratings yet
Stock Statement
4 pages
303-07A Engine Ignition - 1.5L EcoBoost
No ratings yet
303-07A Engine Ignition - 1.5L EcoBoost
12 pages
Project Report
No ratings yet
Project Report
13 pages
Chronic Kidney Disease Prediction: Team No: 24
No ratings yet
Chronic Kidney Disease Prediction: Team No: 24
7 pages
PQ PDF
No ratings yet
PQ PDF
74 pages
BTEUP Exam Dates/ Schedule 2015
No ratings yet
BTEUP Exam Dates/ Schedule 2015
175 pages
CS229 Andrew NG Lecture Notes
No ratings yet
CS229 Andrew NG Lecture Notes
216 pages
Chapter IV
No ratings yet
Chapter IV
32 pages
Scientific Communication
No ratings yet
Scientific Communication
4 pages
23MZ02
No ratings yet
23MZ02
59 pages
Mini Review 2
No ratings yet
Mini Review 2
26 pages
A Summer Internship Report
No ratings yet
A Summer Internship Report
27 pages
A6CRX65TI 1800rpm
100% (1)
A6CRX65TI 1800rpm
3 pages
Transformers Health Management Condition Monitoring System: Product Description
No ratings yet
Transformers Health Management Condition Monitoring System: Product Description
46 pages
Disease Prediction
33% (3)
Disease Prediction
12 pages
Disease Pred Report
No ratings yet
Disease Pred Report
42 pages
Model 2022
No ratings yet
Model 2022
7 pages
Lab Manual - MachineLearningLaboratory-DR - Vaishnavi
No ratings yet
Lab Manual - MachineLearningLaboratory-DR - Vaishnavi
71 pages
Muhaba Research Proposal 20211
No ratings yet
Muhaba Research Proposal 20211
84 pages
SVM
No ratings yet
SVM
12 pages
Data Science Lab Report
No ratings yet
Data Science Lab Report
7 pages
Ethics and Ai Lab Final
No ratings yet
Ethics and Ai Lab Final
31 pages
Experiment 5
No ratings yet
Experiment 5
10 pages
Experiment 5
No ratings yet
Experiment 5
9 pages
Key Functions and Responsibilities of The Barangay
No ratings yet
Key Functions and Responsibilities of The Barangay
53 pages
Health Monitoring and Diagnosis: University College of Engineering, Bit Campus
No ratings yet
Health Monitoring and Diagnosis: University College of Engineering, Bit Campus
21 pages
K-Nearest Neighbors For Diabetes Prediction: Malik Yousaf (F2020019038) Ahsan Rauf (F2020019057)
No ratings yet
K-Nearest Neighbors For Diabetes Prediction: Malik Yousaf (F2020019038) Ahsan Rauf (F2020019057)
15 pages
Chronic Kidney Disease Prediction
No ratings yet
Chronic Kidney Disease Prediction
11 pages
Thyroid Disease Classification Using ML
No ratings yet
Thyroid Disease Classification Using ML
37 pages
Report On Multiple Disease Prediction Using Machine Learning Algorithms
No ratings yet
Report On Multiple Disease Prediction Using Machine Learning Algorithms
14 pages
AI For Personalized Medicine
No ratings yet
AI For Personalized Medicine
6 pages
Heart Disease Detection
No ratings yet
Heart Disease Detection
14 pages
Personalized Healthcare Recommendations Unified Mentor Internship Project
No ratings yet
Personalized Healthcare Recommendations Unified Mentor Internship Project
3 pages
Boo PH 3
No ratings yet
Boo PH 3
11 pages
Disease Prediction by Machine Learning
No ratings yet
Disease Prediction by Machine Learning
7 pages
Code Explanation
No ratings yet
Code Explanation
3 pages
A Study On Predictive Algorithms in Heal
No ratings yet
A Study On Predictive Algorithms in Heal
7 pages
Inspection of Anomaly Kidney Prediction Using Machine Learning
No ratings yet
Inspection of Anomaly Kidney Prediction Using Machine Learning
12 pages
Decision Support
No ratings yet
Decision Support
21 pages
Heart Disease Prediction - Jupyter Notebook
100% (1)
Heart Disease Prediction - Jupyter Notebook
9 pages
Journal Heart Attack
No ratings yet
Journal Heart Attack
6 pages
Resolution 2024-38 (Realignment Diesel)
No ratings yet
Resolution 2024-38 (Realignment Diesel)
2 pages
Hoàng Nguyễn Duy Anh- 11230008
No ratings yet
Hoàng Nguyễn Duy Anh- 11230008
10 pages
Natural Language Understanding
No ratings yet
Natural Language Understanding
14 pages
Experiment 2
No ratings yet
Experiment 2
17 pages
Final Research Paper
No ratings yet
Final Research Paper
5 pages
Mini Project
No ratings yet
Mini Project
31 pages
Natural Gas Continuous
No ratings yet
Natural Gas Continuous
7 pages
CKD With Recommendation of Suitable Diet Plan
No ratings yet
CKD With Recommendation of Suitable Diet Plan
4 pages
Analysis of Chronic Kidney Disease by Using Orange Tool
No ratings yet
Analysis of Chronic Kidney Disease by Using Orange Tool
8 pages
CKD Synposis
No ratings yet
CKD Synposis
4 pages
Batch-2 (Review 2)
No ratings yet
Batch-2 (Review 2)
19 pages
Phase 2
No ratings yet
Phase 2
6 pages
Medhun Final 1
No ratings yet
Medhun Final 1
4 pages
Disease Prediction With Android Application: Shagun Patial, Shashwat Agarwal, Shruti Pathak, Prabhat Verma
No ratings yet
Disease Prediction With Android Application: Shagun Patial, Shashwat Agarwal, Shruti Pathak, Prabhat Verma
6 pages
REVIEW
No ratings yet
REVIEW
27 pages
Article Eda
No ratings yet
Article Eda
7 pages
Synopsis
No ratings yet
Synopsis
6 pages
Disease Prediction Using Machine Learning Algorithms2020 PDF
No ratings yet
Disease Prediction Using Machine Learning Algorithms2020 PDF
7 pages
Python Cod1
No ratings yet
Python Cod1
3 pages
Advtertisement For Laboratory Asstt at SDCRL PDF
No ratings yet
Advtertisement For Laboratory Asstt at SDCRL PDF
5 pages
Team No-7
No ratings yet
Team No-7
12 pages
Construction Phase Plan Iss 3 Guidance To Completion
No ratings yet
Construction Phase Plan Iss 3 Guidance To Completion
53 pages
Personalized Healthcare Recommendations
No ratings yet
Personalized Healthcare Recommendations
6 pages
PROJECTS
No ratings yet
PROJECTS
6 pages
End To End Project Multiple Disease Detection Using ML - Nomidl
No ratings yet
End To End Project Multiple Disease Detection Using ML - Nomidl
24 pages
Base Paper
No ratings yet
Base Paper
4 pages
Memorandum Ra 9165
No ratings yet
Memorandum Ra 9165
5 pages
Edited - Django Website For Disease Prediction Using Machine Learning
No ratings yet
Edited - Django Website For Disease Prediction Using Machine Learning
7 pages
JBL Store Concept Presentation
No ratings yet
JBL Store Concept Presentation
22 pages
Final Research Paper
No ratings yet
Final Research Paper
3 pages
A Machine Learning Methodology For Diagnosing Chronic Kidney Disease
No ratings yet
A Machine Learning Methodology For Diagnosing Chronic Kidney Disease
2 pages
Glenndal 2
No ratings yet
Glenndal 2
7 pages
JEEMAINJAN AdmitCard PDF
No ratings yet
JEEMAINJAN AdmitCard PDF
1 page
Haris Waheed Bhatti
No ratings yet
Haris Waheed Bhatti
26 pages
P40-01-F21-1 Fire Safety Maintenance Plan & Log - Weekly (Enf, 21.01.01)
No ratings yet
P40-01-F21-1 Fire Safety Maintenance Plan & Log - Weekly (Enf, 21.01.01)
3 pages
Thyroid Disease Classification Using Machine Learning Project
No ratings yet
Thyroid Disease Classification Using Machine Learning Project
34 pages
Incongruities: This Comes From A Difference Between What A Product/service Actually Is and What
No ratings yet
Incongruities: This Comes From A Difference Between What A Product/service Actually Is and What
2 pages
CAT Bootcamp
No ratings yet
CAT Bootcamp
8 pages
CONCOA Advantium 1 and 2 Plus Alarms
No ratings yet
CONCOA Advantium 1 and 2 Plus Alarms
2 pages
Graphing Practice
No ratings yet
Graphing Practice
6 pages
Second Progres Report
No ratings yet
Second Progres Report
10 pages
MH12NR9505 PDF
No ratings yet
MH12NR9505 PDF
2 pages
Cutting of Cement Bags by Manually JSA HSE Professionals
No ratings yet
Cutting of Cement Bags by Manually JSA HSE Professionals
1 page
Proforma Invoice - PAK RIO
No ratings yet
Proforma Invoice - PAK RIO
1 page

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

AIML Record Batch 9

Uploaded by

AIML Record Batch 9

Uploaded by

Ex. No:1(a) DATE: 09.08.

DATA CURATION TECHNIQUES FOR CLEANING

2. Understand the Dataset

6. Feature Engineering and Normalization:

10. Finding duplicates:

OUTLIER DETECTION AND REMOVAL METHODS

2. Understanding the data:

6. Identifying outliers using IQR method:

• Matplotlib/Seaborn for visualization

• Matplotlib/Seaborn for visualization

Decision tree classifier:

Chronic Kidney Disease classification is done.

• Matplotlib/Seaborn for visualization

Random forest classifier:

Chronic Kidney Disease classification is done.

PREDICTING HEALTHCARE OUTCOMES WITH REGRESSION

5a: Predicting Chronic Kidney Disease Using Logistic Regression

5. Statistical Summary: A statistical summary of the numeric columns is generated, showing

8. Convert Non-Numeric Columns: Any remaining non-numeric columns are converted to

1. Load the dataset

4. Identifying Missing Values

9. Convert categorical variables using Label Encoding or One-Hot Encoding

11. Interpret the coefficients

To predict hospital readmission rates for patients using Lasso regression.

• The dataset `kidney_datasets.csv` is loaded using `pandas.read_csv()`.

Replacing Missing Values:

Data Type Conversion:

• The entire DataFrame is converted to numeric types where possible using

3. Exploratory Data Analysis (EDA):

• A heatmap is created using `sns.heatmap()` to visualize the correlations between

4.Handling Missing Data:

Identifying Missing Values:

• Missing values are handled using linear interpolation via

• The correlation between features is visualized using a heatmap, aiding in the

6. Modeling and Evaluation

1. Load the dataset

4. Identifying Missing Values

5. Convert non-numeric columns to numeric if possible

Fig: Flow diagram

Handling missing Data:

Exploratory data analysis:

AdaBoost Classifier Tuning:

In AdaBoost classifier, DecisionTreeClassifier is chosen as the base. Since AdaBoost's weak

Random Forest Classifier Tuning:

LOADING THE DATASET:

MODEL TRAINING AND TESTING:

Fig 2: Models Comparison

Fig. 3.1: Confusion Matrix for KNN Classifier

Fig. 3.2: Classification Report for KNN Classifier

Fig. 4.1: Confusion Matrix for Decision Tree Classifier

Fig. 4.2: Classification Report for Decision Tree Classifier

Fig. 5.1: Confusion Matrix for Bagging Classifier

Fig. 6.1: Confusion Matrix for Random Forest Classifier

Fig. 7.1: Confusion Matrix for Adaboost Classifier

F. Gradient Boosting Classifier

Fig. 8.1: Confusion Matrix Gradient Boosting Classifier

Fig. 8.2: Classification Report for Gradient Boosting Classifier

Fig. 9.1: Confusion Matrix XGBoost Classifier

Fig. 9.2: Classification Report for XGBoost Classifier

Fig. 10.2: Classification Report for Ensemble Classifier

TABLE 2: Performance Metrics for Different Machine Learning Algorithms

The ensemble model achieved 100% accuracy, outperforming individual classifiers. It

Discussion on Clinical Relevance:

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.