0% found this document useful (0 votes)

27 views4 pages

Implementation of Logistic Regression On Diabetic Dataset Using Train-Test-Split, K-Fold and Stratified K-Fold Approach

The document discusses implementing logistic regression, k-fold cross validation, and stratified k-fold cross validation techniques on a diabetic dataset. It analyzes these methods and compares their performance on the dataset which contains 768 samples to predict diabetes.

Uploaded by

Shivam Gupta

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

27 views4 pages

Implementation of Logistic Regression On Diabetic Dataset Using Train-Test-Split, K-Fold and Stratified K-Fold Approach

Uploaded by

Shivam Gupta

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 4

Natl. Acad. Sci. Lett.

(September–October 2022) 45(5):401–404

https://doi.org/10.1007/s40009-022-01131-9

SHORT COMMUNICATION

Implementation of Logistic Regression on Diabetic Dataset using

Train-Test-Split, K-Fold and Stratified K-Fold Approach
Meenu Bhagat1 • Brijesh Bakariya1

Received: 28 October 2021 / Revised: 10 May 2022 / Accepted: 17 May 2022 / Published online: 8 July 2022
The Author(s), under exclusive licence to The National Academy of Sciences, India 2022

Abstract Diabetes is a chronic metabolic disorder causing Introduction

high blood sugars, that further severely affect body parts
like the heart, liver, kidneys, lungs, eyes, nerves, blood Diabetes can be majorly categorized into three types: Type
vessels etc. There are three types of diabetes- Type-1 1 diabetes, Type 2 diabetes, and Gestational diabetes. Type
Diabetes, Type-2 Diabetes, and Gestational Diabetes. In 1 diabetes: In this, our immune system destroys all the beta
Type-1, body of the patient fails to produce insulin. In cells in our pancreas. Beta cells are insulin-making cells in
Type-2 diabetes, cells of the body fails to respond to our pancreas. Due to lack of insulin glucose from our food
insulin effectively. Gestational diabetes occurs during is not transferred to our cells leading to many short-term
pregnancy. There are many approaches used to analyse this and long-term problems. In Type 2 diabetes, our body
disease. We have used the Machine learning approach for becomes insulin resistant resulting starving of cells and
analysing diabetes. We have used 768 records from ‘‘pima excess glucose remains in our bloodstream. Gestational
diabetes dataset’’. In this paper, we have used Logistic diabetes is a condition experienced during pregnancy. High
regression with Train Test Split, K-Fold cross-validation blood glucose levels can be caused by a combination of
and Stratified K-Fold approach. hormones and increased insulin content during pregnancy.
The chances of developing diabetes in newly born babies is
Keywords Diabetes Logistic Regression also high [1]. A variety of factors are believed to play a role
Machine Learning Train-test split K-Fold in the onset and progression of diabetes. Given the clear
Stratified K-Fold causal association between obesity and the onset of dia-
betes [2], obesity is a major risk factor, especially in Type 2
diabetes. Sudharsan B et al. [3] used machine learning
methods such as Random Forest, Support vector machines
(SVM), K-nearest neighbour, and Naive Bayes to predict
Hypoglycaemia among Type 2 diabetes patients, while
Georga et al. [4] used Support vector regression for the
Significance Statement: In this paper, the proposed approach same purpose. Train-Test Split [5] is a typical strategy
analyses implementation of Train test Split, K-Fold, and Stratified where we divide the original dataset into two parts, i.e.
K-Fold cross-validation techniques while using Logistic Regression
on Diabetic Database.
Train set and Test set. The Train set is used for training the
classifier and the Test set is used to find the accuracy of the
& Meenu Bhagat classifier. The drawback of this method is that a large
meenubhagat@yahoo.com amount of dataset is used for testing and in certain situa-
Brijesh Bakariya tions; the dataset may represent only a specific kind of data.
dr.brijeshbakariya@ptu.ac.in For example a certain age group, a certain city, or a certain
1 income group etc.
Department of Computer Science and Engineering, I.K.
Gujral Punjab Technical University, Kapurthala, Punjab, Cross-Validation: In a typical (K-Fold) cross-validation
India method, a dataset D is equally partitioned into k disjoint

123
402 M. Bhagat, B. Bakariya

subsets. In a particular K-Fold dataset first K-Folds are without taking class distributions into account. K-Fold
used for training the classifier and the remaining k-1 folds cross-validation could result in a certain class being dis-
are used for testing. Stratified Cross-Validation is the tributed unevenly, with some folds containing more cases
extended form of cross-validation [6]. In this uniform, of the class than others. D.Kohavi [7] has done a com-
distribution of a class is done among n number of folds so parison of many accuracy estimation techniques and he
the distribution of a class in each fold of dataset is the same found that the cross-validation performs better than other
as present in the original dataset. Regular cross-validation, techniques and further stratification improve the perfor-
on the other hand, arbitrarily partitions S into n folds mance by lowering the bias and variance. Weifeng Xu
et al. [8] used a variety of machine learning algorithms to
predict diabetes diseases. As a result of these algorithms,
Table 1 Features of PIMA dataset Random Forest was found to be more accurate than other
Sr. No Features data mining techniques. According to Kavakiotis et al. [9]
tenfold cross-validation was used as an evaluation method
1) Number of Pregnancies(NOP)
in three different algorithms, i.e. logistic regression, Sup-
2) Plasma Glucose Concentration Within 2 Hours(PGC)
port vector machines and Naive Bayes and in terms of
3) Diastolic Blood Pressure(DBP)
accuracy and performance, Support vector machines out-
4) Triceps of Skin Fold Thickness
performed the other two algorithms. We have taken our
5) Serum Insulin Within 2 h
datasets (Table 1) from Kaggle [10]. On PIDD, Sisodia
6) Body Mass Index
et al. [11] discovered that the NB classifier outperforms the
7) Diabetes Pedigree Function SVM, NB, and DT machine learning algorithms, with an
8) Age accuracy of 76.30 percent. All patient’s data were trained
9) Outcome and tested using 10 cross-validations with Naive Bayes and
decision trees in Amour Diwani et al.’s study [12]. The
best algorithm, according to their results was Naive Bayes
with a 76.3021% accuracy. Using different classifiers such
Dataset as Decision Tree, SVM, KNN, RF, and NB, Sneha and
Gangil [13] proposed a model for the early detection of
diabetes. SVM ranks first among these classifiers with
77.33% accuracy. Aishwarya Jakka and Vakula Rani [14]
suggested a performance evaluation approach based on
Preprocessing of dataset 1. To Check null
values decision-making classifiers. LR, SVM, KNN, RF, and NB
2. Impute data for are some of the algorithms used. LR has the highest
missing values
accuracy of 77.60% among these classifiers. The database
contains 768 samples. Out of which 500 samples are pos-
itive class instances, i.e.’’100 and 268 samples were negative
Interpretaon 1. Logistic Regression
class instances, i.e. ‘‘000 . Following are the feature of this
dataset:
Figure 1 is showing the general steps to be followed.
1. We have checked the database for null values.
Performance Evaluaon 1. Precision 2. We have imputed the database with the mean or
2. F1 score median of the columns that have zero values.
3. Recall
4. Accuracy 3. For validation, we used Train Test Split, K-Fold cross-
validation, and Stratified K-Fold.

Comparave analysis Logistic Regression – We have trained our model with Logistic regression
--Train-test- split using Train Test Split, K-Fold, and Stratified K-Fold and
-- K-Fold Cross
Validation
Table 2 is showing the Precision, Recall and F-score values
--Stratified K-fold taking all the eight input parameters using Train-Test split
Method.
Result Figure 2 depicts the accuracy values by using the
Stratified K-Fold method. It has been noticed that accuracy
in case of Stratified K-Fold at n-splits = 10 is more than the
Fig. 1 General Process Train test Split method. The Model has been tested for

123
Implementation of Logistic Regression on Diabetic Dataset… 403

Table 2 Precision, Recall and F1-score, Support and Accuracy values using Train Test split Method and Stratified K-Fold considering all
parameters
Method used Precision Recall F1 Score Support Accuracy
0 1 0 1 0 1 0 1

Train Test Split 79% 66% 84% 82% 62% 59% 151 80 75.32%
Stratified K-Fold 82% 84% 94% 88% 71% 62% 50 26 76.3%

This work can be extended using different type of datasets

with different machine learning algorithms.

References

1. Anna V, van der Ploeg HP, Cheung NW, Huxley RR, Bauman
AE (2008) Socio-demographic correlates of the increasing trend
in prevalence of gestational diabetes mellitus in a large popula-
tion of women between 1995 and 2005. Diabetes Care
31(12):2288–2293. https://doi.org/10.2337/dc08-1038
2. Després JP, Lemieux I (2006) Abdominal obesity and metabolic
syndrome. Nature 444(7121):881–887.
Fig. 2 Accuracy in different folds in Stratified K-Fold method
https://doi.org/10.1038/nature05488
3. Sudharsan B, Peeples M, Shomali M (2015) Hypoglycemia pre-
diction using machine learning models for patients with type 2
Accuracy diabetes. J Diabetes Sci Technol 9(1):86–90.
https://doi.org/10.1177/1932296814554260
77
4. Georga EI, Protopappas VC, Ardigò D, Polyzos D, Fotiadis DI
76.8 (2013) A glucose model based on support vector regression for
76.6
the prediction of hypoglycemic events under free-living condi-
tions. Diabetes Technol Ther 15(8):634–643.
Accuracy

76.4 https://doi.org/10.1089/dia.2012.0285
76.2 5. Zeng X, Martinez TR (2000) Distribution-balanced stratified
cross-validation for accuracy estimation. J Exp Theor Artif Intell
76 12(1):1–12. https://doi.org/10.1080/095281300146272
75.8 6. Breiman L, Friedman JH, Olshen RA, Stone CJ (1984), Classi-
fication and regression trees (Wadsworth International Group).
75.6 7. Kohavi R (1995) A study of cross-validation and bootstrap for
accuracy estimation and model selection. In Proceedings of the
2

8
10

20
K=

K=
K=

international joint conference on artificial intelligence (IJCAI),

Values of K-Fold 1137–1143.
8. Xu W, Zhang J, Zhang Q, Wei X (2017) Risk prediction of type II
Fig. 3 Accuracy Values using K- Fold Cross-validation for K = 2– diabetes based on random forest model. 2017 Third International
K = 20 Conference on Advances in Electrical Electronics Information
Communication and Bio-Informatics (AEEICB).
https://doi.org/10.1109/AEEICB.2017.7972337
different K-Folds (K = 2,4,6,8,10,12,14,16,18,20) (Fig. 3), 9. Kavakiotis I, Tsave O, Salifoglou A, Maglaveras N, Vlahavas I,
and it has been observed that mean accuracy was maximum Chouvarda I (2017) Machine learning and data mining methods
(76.71%) at K = 16. in diabetes research. Comput Struct Biotechnol J 8(15):104–116.
In this paper, we have worked on the diabetes dataset. https://doi.org/10.1016/j.csbj.2016.12.005
10. Kaggle.com. ‘Pima Indians diabetes data set’ (Online).
This work can also be extended for prediction of other https://www.kaggle.com/uciml/pima-indians-diabetes-database.
diseases also. We have only used Logistic Regression for Accessed 7 June 2020.
this study. Other machine learning classifiers like Naı̈ve 11. Sisodia D, Sisodia DS (2018) Prediction of diabetes using clas-
Bayes, Random Forest Classifier, and KNN can be used for sification algorithms. Procedia Comput Sci 132:1578–1585
12. Diwani SA, Sam AE (2014) Diabetes forecasting using super-
research purposes. Train test split, K-Fold Cross-Validation vised learning techniques. Adv Comput Sci: Int J 3:10–18
and Stratified K-Fold methods are used in this research.

123
404 M. Bhagat, B. Bakariya

13. Sneha and Gangil (2019) Analysis of diabetes mellitus for early Publisher’s Note Springer Nature remains neutral with regard to
prediction using optimal features selection. J Big Data 6:13. jurisdictional claims in published maps and institutional affiliations.
https://doi.org/10.1186/s40537-019-0175-6
14. Jakka A, Vakula-Rani J (2019) Performance evaluation of
machine learning models for diabetes prediction. IJITEE.
https://doi.org/10.35940/ijitee.K2155.0981119

123

Diagnostic Exam Nov 24
100% (2)
Diagnostic Exam Nov 24
25 pages
Gaylord Rockies Lawsuit
No ratings yet
Gaylord Rockies Lawsuit
100 pages
Ijerph 19 12378 v2
No ratings yet
Ijerph 19 12378 v2
25 pages
UWORLD Notes by Systems (Usmle Grassroots)
100% (5)
UWORLD Notes by Systems (Usmle Grassroots)
79 pages
Radiological Signs in Orthopaedic Part 1-1
No ratings yet
Radiological Signs in Orthopaedic Part 1-1
167 pages
Pest Control Guideline
No ratings yet
Pest Control Guideline
16 pages
The State Of: Science and Technology During The Middle Ages (A.D. 400-A.D. 1300 in The Western World)
100% (1)
The State Of: Science and Technology During The Middle Ages (A.D. 400-A.D. 1300 in The Western World)
28 pages
National Aids Control Programme
100% (2)
National Aids Control Programme
25 pages
Mini Project
No ratings yet
Mini Project
15 pages
User Man SP Eng
No ratings yet
User Man SP Eng
82 pages
Analysis and Prediction of Diabetes Mell PDF
No ratings yet
Analysis and Prediction of Diabetes Mell PDF
10 pages
2020 AHA Science Advisory Dietary Cholesterol and CV Risk
No ratings yet
2020 AHA Science Advisory Dietary Cholesterol and CV Risk
15 pages
A Survey On Medical Diagnosis of Diabetes Using Machine Learning Techniques
No ratings yet
A Survey On Medical Diagnosis of Diabetes Using Machine Learning Techniques
12 pages
Predictionof Diabetesusing Machine Learning
No ratings yet
Predictionof Diabetesusing Machine Learning
6 pages
Analyze The Use of Machine Learning Models in The Pima Diabetes Data Set For Early Stage Detection
No ratings yet
Analyze The Use of Machine Learning Models in The Pima Diabetes Data Set For Early Stage Detection
5 pages
Diabetes Mellitus Prediction and Diagnosis 2022
No ratings yet
Diabetes Mellitus Prediction and Diagnosis 2022
12 pages
Efficient Binary Classifier For Prediction of Diabetes Using Data Preprocessing and Support Vector Machine
No ratings yet
Efficient Binary Classifier For Prediction of Diabetes Using Data Preprocessing and Support Vector Machine
2 pages
Mark Klimek Notes
100% (1)
Mark Klimek Notes
26 pages
IEEE Paper 1
No ratings yet
IEEE Paper 1
5 pages
Prediction of Type 2 Diabetes Using Machine Learning - 2020 - Procedia Computer
No ratings yet
Prediction of Type 2 Diabetes Using Machine Learning - 2020 - Procedia Computer
11 pages
Diabetes Prediction Using Machine Learning KNN - Algorithm Technique
No ratings yet
Diabetes Prediction Using Machine Learning KNN - Algorithm Technique
4 pages
Classifier Model For Diabetes Prediction
No ratings yet
Classifier Model For Diabetes Prediction
30 pages
Prognostic Biomarkers Identification For Diabetes Prediction by Utilizing Machine Learning Classifiers
No ratings yet
Prognostic Biomarkers Identification For Diabetes Prediction by Utilizing Machine Learning Classifiers
6 pages
IP-Worksheet-4-Endocrine System Response To Stress (Estomagulang)
No ratings yet
IP-Worksheet-4-Endocrine System Response To Stress (Estomagulang)
2 pages
Prediction of Diabetes Using Machine Learning Analysis of 70000 Clinical Database Patient Record
No ratings yet
Prediction of Diabetes Using Machine Learning Analysis of 70000 Clinical Database Patient Record
5 pages
Comparative Study of Machine Learning Algorithms For Diabetes
No ratings yet
Comparative Study of Machine Learning Algorithms For Diabetes
11 pages
Diabets Arandom Forest
No ratings yet
Diabets Arandom Forest
5 pages
Comparison of ML Techniques
No ratings yet
Comparison of ML Techniques
16 pages
Prediction of Diabetes
No ratings yet
Prediction of Diabetes
12 pages
Exposys Data Labs Diabetes Disease Prediction: Shilpa J Shetty Nishma Nayana
No ratings yet
Exposys Data Labs Diabetes Disease Prediction: Shilpa J Shetty Nishma Nayana
13 pages
A Drug Study On: Heparin
No ratings yet
A Drug Study On: Heparin
8 pages
Classification of Diabetes Mellitus Using Machine Learning Techniques
No ratings yet
Classification of Diabetes Mellitus Using Machine Learning Techniques
4 pages
Project Report
No ratings yet
Project Report
10 pages
Hidden Killers Human Fungal Infections
No ratings yet
Hidden Killers Human Fungal Infections
9 pages
Diabetes Classification Report
No ratings yet
Diabetes Classification Report
17 pages
Prediction of Diabetes Using Machine Learning: A Modern User-Friendly Model
No ratings yet
Prediction of Diabetes Using Machine Learning: A Modern User-Friendly Model
7 pages
Analyzing The Behavior of Different Classification Algorithms in Diabetes Prediction
No ratings yet
Analyzing The Behavior of Different Classification Algorithms in Diabetes Prediction
6 pages
Diabetes Prediction Report
No ratings yet
Diabetes Prediction Report
16 pages
Bahasa Inggris - 12 SMA PTN Worksheet 2 (Report, Caption,.
No ratings yet
Bahasa Inggris - 12 SMA PTN Worksheet 2 (Report, Caption,.
7 pages
SP Draft 1
No ratings yet
SP Draft 1
25 pages
Predictive Modelingand Analyticsfor Diabetesusing
No ratings yet
Predictive Modelingand Analyticsfor Diabetesusing
13 pages
Diabetes Prediction - ML
No ratings yet
Diabetes Prediction - ML
29 pages
Diabetes Prediction Using Machine Learning Techniques
No ratings yet
Diabetes Prediction Using Machine Learning Techniques
18 pages
Project Proposal For Biotechnology
No ratings yet
Project Proposal For Biotechnology
2 pages
LCC400 - Sample Preparation Outline
No ratings yet
LCC400 - Sample Preparation Outline
7 pages
Artificial Intelligence Approaches For Predicting Diabetes in Egypt
No ratings yet
Artificial Intelligence Approaches For Predicting Diabetes in Egypt
19 pages
Gastric Linitis Plastica
No ratings yet
Gastric Linitis Plastica
22 pages
10 31590-Ejosat 803504-1321702
No ratings yet
10 31590-Ejosat 803504-1321702
5 pages
245-Article Text-2088-1-10-20240129
No ratings yet
245-Article Text-2088-1-10-20240129
8 pages
Diabetes Prediction Using Machine Learning Algorithms and Ontology
No ratings yet
Diabetes Prediction Using Machine Learning Algorithms and Ontology
19 pages
Download
No ratings yet
Download
6 pages
Article 6
No ratings yet
Article 6
11 pages
Case Report
No ratings yet
Case Report
11 pages
Weka Project1 Sajeena
No ratings yet
Weka Project1 Sajeena
14 pages
Deep Learning Techniques For The Prediction of Diabetes: A Review
No ratings yet
Deep Learning Techniques For The Prediction of Diabetes: A Review
6 pages
Pocket Guide
No ratings yet
Pocket Guide
34 pages
Immunization Report
No ratings yet
Immunization Report
2 pages
JNJ 10 K 2024
No ratings yet
JNJ 10 K 2024
182 pages
Hybrid Deep Learning CNN-LSTM Model For Diabetes Prediction
No ratings yet
Hybrid Deep Learning CNN-LSTM Model For Diabetes Prediction
4 pages
TechnologyName Phase1
No ratings yet
TechnologyName Phase1
9 pages
Final
No ratings yet
Final
44 pages
Literature Survey Paper On Comparative Analysis of Diabetics Prediction Systems Using Machine Learning Algorithms
No ratings yet
Literature Survey Paper On Comparative Analysis of Diabetics Prediction Systems Using Machine Learning Algorithms
4 pages
Geriatric Dentsitry
No ratings yet
Geriatric Dentsitry
21 pages
Palliative Sedation An Analysis of International Guidelines and Position Statementes
No ratings yet
Palliative Sedation An Analysis of International Guidelines and Position Statementes
12 pages
Peerj Cs 1914
No ratings yet
Peerj Cs 1914
30 pages
A Novel Hybrid Deep Learning Model For Early Stage
No ratings yet
A Novel Hybrid Deep Learning Model For Early Stage
23 pages
Electrotherapy
No ratings yet
Electrotherapy
21 pages
Shubham Download
No ratings yet
Shubham Download
2 pages
Independent Project
No ratings yet
Independent Project
10 pages
G26 Report
No ratings yet
G26 Report
4 pages
10 22399-Ijcesen 1185474-2693654
No ratings yet
10 22399-Ijcesen 1185474-2693654
6 pages
Parkinson's Disease and Parkinsonism in The Elderly - 1st Edition High-Quality Download
100% (9)
Parkinson's Disease and Parkinsonism in The Elderly - 1st Edition High-Quality Download
15 pages
Diabetes Decoded: Transitioning From Traditional Models To Hybrid Deep Learning Approaches
No ratings yet
Diabetes Decoded: Transitioning From Traditional Models To Hybrid Deep Learning Approaches
5 pages
Seeking Consent From Casualty Prior To First Aid Management
No ratings yet
Seeking Consent From Casualty Prior To First Aid Management
6 pages
Two Machine Learning Hybrid Models For Predicting.2
No ratings yet
Two Machine Learning Hybrid Models For Predicting.2
22 pages
4724CK CQ3259
No ratings yet
4724CK CQ3259
7 pages
DPS
No ratings yet
DPS
18 pages
Prediction of Diabetes Using R
No ratings yet
Prediction of Diabetes Using R
6 pages
Classification
No ratings yet
Classification
9 pages
Slide Presetatio
No ratings yet
Slide Presetatio
30 pages
Chapter Three 111
No ratings yet
Chapter Three 111
13 pages
Multipledisease Prediction System
No ratings yet
Multipledisease Prediction System
7 pages
Nursing Job Interview 2025 Updated V12
No ratings yet
Nursing Job Interview 2025 Updated V12
103 pages
Sse 25 21 114-1
No ratings yet
Sse 25 21 114-1
14 pages
Hir 2024 30 1 73
No ratings yet
Hir 2024 30 1 73
10 pages
Journal Pone 0310218
No ratings yet
Journal Pone 0310218
29 pages
1 s2.0 S2666307421000048 Main
No ratings yet
1 s2.0 S2666307421000048 Main
7 pages
Srikanth2016 (IEEE)
No ratings yet
Srikanth2016 (IEEE)
5 pages
Biostatistics and Research Methodology
From Everand
Biostatistics and Research Methodology
Dr. G. Nageswara Rao
5/5 (5)
Medical Statistics at a Glance Workbook
From Everand
Medical Statistics at a Glance Workbook
Aviva Petrie
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Implementation of Logistic Regression On Diabetic Dataset Using Train-Test-Split, K-Fold and Stratified K-Fold Approach

Uploaded by

Implementation of Logistic Regression On Diabetic Dataset Using Train-Test-Split, K-Fold and Stratified K-Fold Approach

Uploaded by

Natl. Acad. Sci. Lett.

(September–October 2022) 45(5):401–404

Implementation of Logistic Regression on Diabetic Dataset using

Abstract Diabetes is a chronic metabolic disorder causing Introduction

This work can be extended using different type of datasets

international joint conference on artificial intelligence (IJCAI),

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.