Implementation of Logistic Regression On Diabetic Dataset Using Train-Test-Split, K-Fold and Stratified K-Fold Approach
Implementation of Logistic Regression On Diabetic Dataset Using Train-Test-Split, K-Fold and Stratified K-Fold Approach
SHORT COMMUNICATION
Received: 28 October 2021 / Revised: 10 May 2022 / Accepted: 17 May 2022 / Published online: 8 July 2022
The Author(s), under exclusive licence to The National Academy of Sciences, India 2022
123
402 M. Bhagat, B. Bakariya
subsets. In a particular K-Fold dataset first K-Folds are without taking class distributions into account. K-Fold
used for training the classifier and the remaining k-1 folds cross-validation could result in a certain class being dis-
are used for testing. Stratified Cross-Validation is the tributed unevenly, with some folds containing more cases
extended form of cross-validation [6]. In this uniform, of the class than others. D.Kohavi [7] has done a com-
distribution of a class is done among n number of folds so parison of many accuracy estimation techniques and he
the distribution of a class in each fold of dataset is the same found that the cross-validation performs better than other
as present in the original dataset. Regular cross-validation, techniques and further stratification improve the perfor-
on the other hand, arbitrarily partitions S into n folds mance by lowering the bias and variance. Weifeng Xu
et al. [8] used a variety of machine learning algorithms to
predict diabetes diseases. As a result of these algorithms,
Table 1 Features of PIMA dataset Random Forest was found to be more accurate than other
Sr. No Features data mining techniques. According to Kavakiotis et al. [9]
tenfold cross-validation was used as an evaluation method
1) Number of Pregnancies(NOP)
in three different algorithms, i.e. logistic regression, Sup-
2) Plasma Glucose Concentration Within 2 Hours(PGC)
port vector machines and Naive Bayes and in terms of
3) Diastolic Blood Pressure(DBP)
accuracy and performance, Support vector machines out-
4) Triceps of Skin Fold Thickness
performed the other two algorithms. We have taken our
5) Serum Insulin Within 2 h
datasets (Table 1) from Kaggle [10]. On PIDD, Sisodia
6) Body Mass Index
et al. [11] discovered that the NB classifier outperforms the
7) Diabetes Pedigree Function SVM, NB, and DT machine learning algorithms, with an
8) Age accuracy of 76.30 percent. All patient’s data were trained
9) Outcome and tested using 10 cross-validations with Naive Bayes and
decision trees in Amour Diwani et al.’s study [12]. The
best algorithm, according to their results was Naive Bayes
with a 76.3021% accuracy. Using different classifiers such
Dataset as Decision Tree, SVM, KNN, RF, and NB, Sneha and
Gangil [13] proposed a model for the early detection of
diabetes. SVM ranks first among these classifiers with
77.33% accuracy. Aishwarya Jakka and Vakula Rani [14]
suggested a performance evaluation approach based on
Preprocessing of dataset 1. To Check null
values decision-making classifiers. LR, SVM, KNN, RF, and NB
2. Impute data for are some of the algorithms used. LR has the highest
missing values
accuracy of 77.60% among these classifiers. The database
contains 768 samples. Out of which 500 samples are pos-
itive class instances, i.e.’’100 and 268 samples were negative
Interpretaon 1. Logistic Regression
class instances, i.e. ‘‘000 . Following are the feature of this
dataset:
Figure 1 is showing the general steps to be followed.
1. We have checked the database for null values.
Performance Evaluaon 1. Precision 2. We have imputed the database with the mean or
2. F1 score median of the columns that have zero values.
3. Recall
4. Accuracy 3. For validation, we used Train Test Split, K-Fold cross-
validation, and Stratified K-Fold.
Comparave analysis Logistic Regression – We have trained our model with Logistic regression
--Train-test- split using Train Test Split, K-Fold, and Stratified K-Fold and
-- K-Fold Cross
Validation
Table 2 is showing the Precision, Recall and F-score values
--Stratified K-fold taking all the eight input parameters using Train-Test split
Method.
Result Figure 2 depicts the accuracy values by using the
Stratified K-Fold method. It has been noticed that accuracy
in case of Stratified K-Fold at n-splits = 10 is more than the
Fig. 1 General Process Train test Split method. The Model has been tested for
123
Implementation of Logistic Regression on Diabetic Dataset… 403
Table 2 Precision, Recall and F1-score, Support and Accuracy values using Train Test split Method and Stratified K-Fold considering all
parameters
Method used Precision Recall F1 Score Support Accuracy
0 1 0 1 0 1 0 1
Train Test Split 79% 66% 84% 82% 62% 59% 151 80 75.32%
Stratified K-Fold 82% 84% 94% 88% 71% 62% 50 26 76.3%
References
1. Anna V, van der Ploeg HP, Cheung NW, Huxley RR, Bauman
AE (2008) Socio-demographic correlates of the increasing trend
in prevalence of gestational diabetes mellitus in a large popula-
tion of women between 1995 and 2005. Diabetes Care
31(12):2288–2293. https://doi.org/10.2337/dc08-1038
2. Després JP, Lemieux I (2006) Abdominal obesity and metabolic
syndrome. Nature 444(7121):881–887.
Fig. 2 Accuracy in different folds in Stratified K-Fold method
https://doi.org/10.1038/nature05488
3. Sudharsan B, Peeples M, Shomali M (2015) Hypoglycemia pre-
diction using machine learning models for patients with type 2
Accuracy diabetes. J Diabetes Sci Technol 9(1):86–90.
https://doi.org/10.1177/1932296814554260
77
4. Georga EI, Protopappas VC, Ardigò D, Polyzos D, Fotiadis DI
76.8 (2013) A glucose model based on support vector regression for
76.6
the prediction of hypoglycemic events under free-living condi-
tions. Diabetes Technol Ther 15(8):634–643.
Accuracy
76.4 https://doi.org/10.1089/dia.2012.0285
76.2 5. Zeng X, Martinez TR (2000) Distribution-balanced stratified
cross-validation for accuracy estimation. J Exp Theor Artif Intell
76 12(1):1–12. https://doi.org/10.1080/095281300146272
75.8 6. Breiman L, Friedman JH, Olshen RA, Stone CJ (1984), Classi-
fication and regression trees (Wadsworth International Group).
75.6 7. Kohavi R (1995) A study of cross-validation and bootstrap for
accuracy estimation and model selection. In Proceedings of the
2
8
10
12
14
16
18
20
K=
K=
K=
K=
K=
K=
K=
K=
K=
K=
123
404 M. Bhagat, B. Bakariya
13. Sneha and Gangil (2019) Analysis of diabetes mellitus for early Publisher’s Note Springer Nature remains neutral with regard to
prediction using optimal features selection. J Big Data 6:13. jurisdictional claims in published maps and institutional affiliations.
https://doi.org/10.1186/s40537-019-0175-6
14. Jakka A, Vakula-Rani J (2019) Performance evaluation of
machine learning models for diabetes prediction. IJITEE.
https://doi.org/10.35940/ijitee.K2155.0981119
123