0% found this document useful (0 votes)
27 views4 pages

Implementation of Logistic Regression On Diabetic Dataset Using Train-Test-Split, K-Fold and Stratified K-Fold Approach

The document discusses implementing logistic regression, k-fold cross validation, and stratified k-fold cross validation techniques on a diabetic dataset. It analyzes these methods and compares their performance on the dataset which contains 768 samples to predict diabetes.

Uploaded by

Shivam Gupta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views4 pages

Implementation of Logistic Regression On Diabetic Dataset Using Train-Test-Split, K-Fold and Stratified K-Fold Approach

The document discusses implementing logistic regression, k-fold cross validation, and stratified k-fold cross validation techniques on a diabetic dataset. It analyzes these methods and compares their performance on the dataset which contains 768 samples to predict diabetes.

Uploaded by

Shivam Gupta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Natl. Acad. Sci. Lett.

(September–October 2022) 45(5):401–404


https://doi.org/10.1007/s40009-022-01131-9

SHORT COMMUNICATION

Implementation of Logistic Regression on Diabetic Dataset using


Train-Test-Split, K-Fold and Stratified K-Fold Approach
Meenu Bhagat1 • Brijesh Bakariya1

Received: 28 October 2021 / Revised: 10 May 2022 / Accepted: 17 May 2022 / Published online: 8 July 2022
 The Author(s), under exclusive licence to The National Academy of Sciences, India 2022

Abstract Diabetes is a chronic metabolic disorder causing Introduction


high blood sugars, that further severely affect body parts
like the heart, liver, kidneys, lungs, eyes, nerves, blood Diabetes can be majorly categorized into three types: Type
vessels etc. There are three types of diabetes- Type-1 1 diabetes, Type 2 diabetes, and Gestational diabetes. Type
Diabetes, Type-2 Diabetes, and Gestational Diabetes. In 1 diabetes: In this, our immune system destroys all the beta
Type-1, body of the patient fails to produce insulin. In cells in our pancreas. Beta cells are insulin-making cells in
Type-2 diabetes, cells of the body fails to respond to our pancreas. Due to lack of insulin glucose from our food
insulin effectively. Gestational diabetes occurs during is not transferred to our cells leading to many short-term
pregnancy. There are many approaches used to analyse this and long-term problems. In Type 2 diabetes, our body
disease. We have used the Machine learning approach for becomes insulin resistant resulting starving of cells and
analysing diabetes. We have used 768 records from ‘‘pima excess glucose remains in our bloodstream. Gestational
diabetes dataset’’. In this paper, we have used Logistic diabetes is a condition experienced during pregnancy. High
regression with Train Test Split, K-Fold cross-validation blood glucose levels can be caused by a combination of
and Stratified K-Fold approach. hormones and increased insulin content during pregnancy.
The chances of developing diabetes in newly born babies is
Keywords Diabetes  Logistic Regression  also high [1]. A variety of factors are believed to play a role
Machine Learning  Train-test split  K-Fold  in the onset and progression of diabetes. Given the clear
Stratified K-Fold causal association between obesity and the onset of dia-
betes [2], obesity is a major risk factor, especially in Type 2
diabetes. Sudharsan B et al. [3] used machine learning
methods such as Random Forest, Support vector machines
(SVM), K-nearest neighbour, and Naive Bayes to predict
Hypoglycaemia among Type 2 diabetes patients, while
Georga et al. [4] used Support vector regression for the
Significance Statement: In this paper, the proposed approach same purpose. Train-Test Split [5] is a typical strategy
analyses implementation of Train test Split, K-Fold, and Stratified where we divide the original dataset into two parts, i.e.
K-Fold cross-validation techniques while using Logistic Regression
on Diabetic Database.
Train set and Test set. The Train set is used for training the
classifier and the Test set is used to find the accuracy of the
& Meenu Bhagat classifier. The drawback of this method is that a large
meenubhagat@yahoo.com amount of dataset is used for testing and in certain situa-
Brijesh Bakariya tions; the dataset may represent only a specific kind of data.
dr.brijeshbakariya@ptu.ac.in For example a certain age group, a certain city, or a certain
1 income group etc.
Department of Computer Science and Engineering, I.K.
Gujral Punjab Technical University, Kapurthala, Punjab, Cross-Validation: In a typical (K-Fold) cross-validation
India method, a dataset D is equally partitioned into k disjoint

123
402 M. Bhagat, B. Bakariya

subsets. In a particular K-Fold dataset first K-Folds are without taking class distributions into account. K-Fold
used for training the classifier and the remaining k-1 folds cross-validation could result in a certain class being dis-
are used for testing. Stratified Cross-Validation is the tributed unevenly, with some folds containing more cases
extended form of cross-validation [6]. In this uniform, of the class than others. D.Kohavi [7] has done a com-
distribution of a class is done among n number of folds so parison of many accuracy estimation techniques and he
the distribution of a class in each fold of dataset is the same found that the cross-validation performs better than other
as present in the original dataset. Regular cross-validation, techniques and further stratification improve the perfor-
on the other hand, arbitrarily partitions S into n folds mance by lowering the bias and variance. Weifeng Xu
et al. [8] used a variety of machine learning algorithms to
predict diabetes diseases. As a result of these algorithms,
Table 1 Features of PIMA dataset Random Forest was found to be more accurate than other
Sr. No Features data mining techniques. According to Kavakiotis et al. [9]
tenfold cross-validation was used as an evaluation method
1) Number of Pregnancies(NOP)
in three different algorithms, i.e. logistic regression, Sup-
2) Plasma Glucose Concentration Within 2 Hours(PGC)
port vector machines and Naive Bayes and in terms of
3) Diastolic Blood Pressure(DBP)
accuracy and performance, Support vector machines out-
4) Triceps of Skin Fold Thickness
performed the other two algorithms. We have taken our
5) Serum Insulin Within 2 h
datasets (Table 1) from Kaggle [10]. On PIDD, Sisodia
6) Body Mass Index
et al. [11] discovered that the NB classifier outperforms the
7) Diabetes Pedigree Function SVM, NB, and DT machine learning algorithms, with an
8) Age accuracy of 76.30 percent. All patient’s data were trained
9) Outcome and tested using 10 cross-validations with Naive Bayes and
decision trees in Amour Diwani et al.’s study [12]. The
best algorithm, according to their results was Naive Bayes
with a 76.3021% accuracy. Using different classifiers such
Dataset as Decision Tree, SVM, KNN, RF, and NB, Sneha and
Gangil [13] proposed a model for the early detection of
diabetes. SVM ranks first among these classifiers with
77.33% accuracy. Aishwarya Jakka and Vakula Rani [14]
suggested a performance evaluation approach based on
Preprocessing of dataset 1. To Check null
values decision-making classifiers. LR, SVM, KNN, RF, and NB
2. Impute data for are some of the algorithms used. LR has the highest
missing values
accuracy of 77.60% among these classifiers. The database
contains 768 samples. Out of which 500 samples are pos-
itive class instances, i.e.’’100 and 268 samples were negative
Interpretaon 1. Logistic Regression
class instances, i.e. ‘‘000 . Following are the feature of this
dataset:
Figure 1 is showing the general steps to be followed.
1. We have checked the database for null values.
Performance Evaluaon 1. Precision 2. We have imputed the database with the mean or
2. F1 score median of the columns that have zero values.
3. Recall
4. Accuracy 3. For validation, we used Train Test Split, K-Fold cross-
validation, and Stratified K-Fold.

Comparave analysis Logistic Regression – We have trained our model with Logistic regression
--Train-test- split using Train Test Split, K-Fold, and Stratified K-Fold and
-- K-Fold Cross
Validation
Table 2 is showing the Precision, Recall and F-score values
--Stratified K-fold taking all the eight input parameters using Train-Test split
Method.
Result Figure 2 depicts the accuracy values by using the
Stratified K-Fold method. It has been noticed that accuracy
in case of Stratified K-Fold at n-splits = 10 is more than the
Fig. 1 General Process Train test Split method. The Model has been tested for

123
Implementation of Logistic Regression on Diabetic Dataset… 403

Table 2 Precision, Recall and F1-score, Support and Accuracy values using Train Test split Method and Stratified K-Fold considering all
parameters
Method used Precision Recall F1 Score Support Accuracy
0 1 0 1 0 1 0 1

Train Test Split 79% 66% 84% 82% 62% 59% 151 80 75.32%
Stratified K-Fold 82% 84% 94% 88% 71% 62% 50 26 76.3%

This work can be extended using different type of datasets


with different machine learning algorithms.

References

1. Anna V, van der Ploeg HP, Cheung NW, Huxley RR, Bauman
AE (2008) Socio-demographic correlates of the increasing trend
in prevalence of gestational diabetes mellitus in a large popula-
tion of women between 1995 and 2005. Diabetes Care
31(12):2288–2293. https://doi.org/10.2337/dc08-1038
2. Després JP, Lemieux I (2006) Abdominal obesity and metabolic
syndrome. Nature 444(7121):881–887.
Fig. 2 Accuracy in different folds in Stratified K-Fold method
https://doi.org/10.1038/nature05488
3. Sudharsan B, Peeples M, Shomali M (2015) Hypoglycemia pre-
diction using machine learning models for patients with type 2
Accuracy diabetes. J Diabetes Sci Technol 9(1):86–90.
https://doi.org/10.1177/1932296814554260
77
4. Georga EI, Protopappas VC, Ardigò D, Polyzos D, Fotiadis DI
76.8 (2013) A glucose model based on support vector regression for
76.6
the prediction of hypoglycemic events under free-living condi-
tions. Diabetes Technol Ther 15(8):634–643.
Accuracy

76.4 https://doi.org/10.1089/dia.2012.0285
76.2 5. Zeng X, Martinez TR (2000) Distribution-balanced stratified
cross-validation for accuracy estimation. J Exp Theor Artif Intell
76 12(1):1–12. https://doi.org/10.1080/095281300146272
75.8 6. Breiman L, Friedman JH, Olshen RA, Stone CJ (1984), Classi-
fication and regression trees (Wadsworth International Group).
75.6 7. Kohavi R (1995) A study of cross-validation and bootstrap for
accuracy estimation and model selection. In Proceedings of the
2

8
10

12

14

16

18

20
K=

K=

K=

K=
K=

K=

K=

K=

K=

K=

international joint conference on artificial intelligence (IJCAI),


Values of K-Fold 1137–1143.
8. Xu W, Zhang J, Zhang Q, Wei X (2017) Risk prediction of type II
Fig. 3 Accuracy Values using K- Fold Cross-validation for K = 2– diabetes based on random forest model. 2017 Third International
K = 20 Conference on Advances in Electrical Electronics Information
Communication and Bio-Informatics (AEEICB).
https://doi.org/10.1109/AEEICB.2017.7972337
different K-Folds (K = 2,4,6,8,10,12,14,16,18,20) (Fig. 3), 9. Kavakiotis I, Tsave O, Salifoglou A, Maglaveras N, Vlahavas I,
and it has been observed that mean accuracy was maximum Chouvarda I (2017) Machine learning and data mining methods
(76.71%) at K = 16. in diabetes research. Comput Struct Biotechnol J 8(15):104–116.
In this paper, we have worked on the diabetes dataset. https://doi.org/10.1016/j.csbj.2016.12.005
10. Kaggle.com. ‘Pima Indians diabetes data set’ (Online).
This work can also be extended for prediction of other https://www.kaggle.com/uciml/pima-indians-diabetes-database.
diseases also. We have only used Logistic Regression for Accessed 7 June 2020.
this study. Other machine learning classifiers like Naı̈ve 11. Sisodia D, Sisodia DS (2018) Prediction of diabetes using clas-
Bayes, Random Forest Classifier, and KNN can be used for sification algorithms. Procedia Comput Sci 132:1578–1585
12. Diwani SA, Sam AE (2014) Diabetes forecasting using super-
research purposes. Train test split, K-Fold Cross-Validation vised learning techniques. Adv Comput Sci: Int J 3:10–18
and Stratified K-Fold methods are used in this research.

123
404 M. Bhagat, B. Bakariya

13. Sneha and Gangil (2019) Analysis of diabetes mellitus for early Publisher’s Note Springer Nature remains neutral with regard to
prediction using optimal features selection. J Big Data 6:13. jurisdictional claims in published maps and institutional affiliations.
https://doi.org/10.1186/s40537-019-0175-6
14. Jakka A, Vakula-Rani J (2019) Performance evaluation of
machine learning models for diabetes prediction. IJITEE.
https://doi.org/10.35940/ijitee.K2155.0981119

123

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy