20.k1.0038 Proposal Project Report Kelar
20.k1.0038 Proposal Project Report Kelar
Stroke is one of the most serious medical conditions and has a significant impact on
public health. The importance of accurate prediction of stroke risk is to provide appropriate
treatment and intervention to individuals at risk of developing the disease. In recent years, the
use of machine learning methods has become popular in improving stroke disease prediction.
This research implements the Adaboost method to the C4.5 and K-Nearest Neighbor (KNN)
algorithms with the aim of improving stroke prediction performance. Using relevant datasets,
the C4.5 and KNN algorithms were used separately to perform stroke disease prediction.
Furthermore, the Adaboost method is used to combine the prediction results of the two
algorithms. The results showed that the implementation of the Adaboost method on the C4.5 and
KNN algorithms successfully improved the performance of stroke disease prediction, providing
more accurate and reliable predictions to assist in the diagnosis and treatment of stroke disease.
With a value of 91% for the combination of KNN with Adaboost and 95% for the combination of
C4.5 with Adaboost. Both have a difference in value of 4%. Therefore, C4.5 is more effective in
improving the performance of stroke disease prediction.
ii
TABLE OF CONTENTS
COVER..........................................................................................................................................................i
ABSTRACT (Abstract Title)......................................................................................................................ii
TABLE OF CONTENTS...........................................................................................................................iii
LIST OF FIGURE......................................................................................................................................iv
LIST OF TABLE.........................................................................................................................................v
CHAPTER 1 INTRODUCTION................................................................................................................1
1.1. Background....................................................................................................................................1
1.2. Problem Formulation......................................................................................................................2
1.3. Scope..............................................................................................................................................2
1.4. Objective........................................................................................................................................2
CHAPTER 2 LITERATURE STUDY.......................................................................................................3
CHAPTER 3 RESEARCH METHODOLOGY........................................................................................7
3.1. Research Methodology...................................................................................................................7
3.2. Dataset Collection..........................................................................................................................8
3.3. Data Preprocessing.........................................................................................................................8
3.4. C4.5 Algorithmm............................................................................................................................8
3.5. K-Nearest Neighbor Algorithm......................................................................................................9
3.6. Adaptive Boosting Method..........................................................................................................10
3.7. Comparing Result.........................................................................................................................10
REFERENCES.............................................................................................................................................a
iii
LIST OF FIGURE
iv
LIST OF TABLE
v
CHAPTER 1
INTRODUCTION
1.1. Background
Stroke is a significant global health issue, ranking as the second leading cause of death
worldwide and a major contributor to disability. Indonesia in particular, faces a pressing
challenge with increasing stroke cases and high mortality rates.[1] According to data from 208
Riskesdas, North Sulawesi Province has the highest prevalence of stroke (14.2%) while Papua
Province (4.1%).[2] In addition, according to the Centers for Disease Control and Prevention
(CDC), stroke is one of the leading causes of death in the United States. Stroke is a non-
communicable disease that accounts for about 11% of all deaths and more than 795,000 people
in the United States experience the adverse effects of stroke.[3] The C4.5 algorithm can be used
to predict or classify an event by forming a decision tree.[4] K-Nearest Neighbor performs the
classification process based on the closest distance of the new data to the old data and begins by
determining the value of K.[5] The Adaboost method is one of the supervised algorithms in data
mining that is widely applied to create classification models.
With the advancement of technology in the medical field, machine learning can be used
to predict stroke. Machine learning algorithms are constructive in making accurate predictions
and providing accurate analyses. The use of machine learning has proven to be widely applied in
classification and optimisation topics in creating intelligent systems to improve healthcare
providers. The selection of the right method for stroke symptom detection is needed because it
affects the results that will be displayed.[6]
This research was conducted to apply the adaboost method to the C4.5 and K-Nearest
Neighbor algorithms in stroke disease classification to obtain accurate prediction results. In the
context of stroke disease classification, the C4.5 algorithm is used to build a decision tree model
that can be used to classify stroke symptoms into stroke or non-stroke. The K-Nearest Neighbor
algorithm calculates each distance value of the old data with the new data and then performs the
classification process based on a predetermined K value. The Adaboost method is used to
improve the accuracy of the classification model by combining several weak classification
1
models into one strong classification model. Accuracy can be interpreted as the level of
correlation between the predicted value and the actual value.[7] In addition, the results of the
tests should be analysed to see how effective the algorithm is.
1. Is the combination of the Adaboost Method in the C4.5 Algorithm effective in predicting
stroke disease?
2. Is the combination of Adaboost Method on K-Nearest Neighbor Algorithm effective in
predicting stroke disease?
3. From the two combinations above, which one is more effective in predicting?
1.3. Scope
The dataset used is Stroke Dataset | kaggle.com which contains patient data such as id,
gender, age, hypertension, heart disease, ever married, work type, residence type, average
glucose level, bmi, smoking status, and patient status (stroke or non-stroke). The classification
model was performed using the Adaboost Method in the C4.5 and KNN algorithms. This
research does not discuss risk factors or causes of stroke, but only focuses on the classification of
stroke symptoms to get accurate prediction results.
1.4. Objective
The main objective of this research is to prove that the Adaboost Method on the C4.5 and
KNN Algorithms is able to provide higher performance for stroke disease classification, because
the Adaboost Method is considered capable of improving the accuracy results of several
algorithms in making predictions on various datasets. So that the results of this research can be
implemented in the health sector to assist health workers in classifying stroke symptoms to
produce accurate predictions.
2
CHAPTER 2
LITERATURE STUDY
In her research on the application of adaboost to improve the performance of data mining
classification in diabetes disease, Novianti et al [5] conducted research using the K-Nearest
Neighbor algorithm for classification performance measurement. The test was carried out 5 times
with K values of 7, 13, 19, 25, and 31 respectively. For testing the KNN algorithm itself, the
highest results were obtained from the second test with 92.90% accuracy. As for testing the KNN
Algorithm with Adaboost, it has the highest results in the first and second tests with the same
accuracy of 95.40%. The use of the adaboost method can increase accuracy results by 2.50%.
Research on the diagnosis of stroke risk levels conducted by Puspitawuri et al[6] using
datasets in the form of numerical and categorical attributes so that researchers use the K-Nearest
Neighbor method to process numerical data and Naïve Bayes to process categorical data. The
first test was conducted on the effect of data distribution on balanced training data classes using
30, 45, and 60 data. For example, in 30 training data, there are 10 data with low risk class, 10
data with medium risk class, and 10 data with high risk class. The second test is the effect of data
distribution on unbalanced training data classes using 30, 45, and 60 data. For example, in 30
training data, there are 8 data with low risk class, 8 data with medium risk class, and 14 data with
high risk class. From the test, the highest accuracy value was obtained on balanced class data of
96.67% with 45 training data and a value of K = 15-22. While in the unbalanced class the highest
3
accuracy was obtained at 100% with a total of 60 training data and a value of K = 20-30. So that
the combination method of KNN and Naïve Bayes can be diagnosed because it has the right
results.
The use of the C4.5 Decision Tree Algorithm can optimise classification results to get the
right accuracy. In his research, Pambudi et al[8] explained that the C4.5 Decision Tree
Algorithm modelling uses 23 rules, with the number of classes being 14 rules (non-stroke) and 9
rules (stroke). Researchers also conducted research using two main approaches, namely
qualitative and quantitative approaches. This test was conducted using the C4.5 Decision Tree
Algorithm with confusion matrix measurements and AUC values. From testing the Deision Tree
C4.5 algorithm, the prediction results were 96.05%. While in his research, Rohman et al [9]
stated that the use of the Adaboost-based C4.5 Algorithm by looping and attribute wighting was
able to improve the results of the accuracy value in predicting heart disease. The dataset used
was 867 data, where 364 patients were detected as sick and 503 patients were detected as
healthy. After preprocessing the data resulted in 567 data, 257 people were detected as sick and
310 were detected as healthy. In testing the K-Fold Cross Validation Algorithm C4.5 and
Adaboost-based C4.5 Algorithm, researchers conducted 10 trials with multilevel sampling types
and used local random seeds, so that the accuracy results were higher. From the test results, it is
proven that the Adaboost-based algorithm has a higher accuracy value compared to the C4.5
algorithm. The accuracy value for the C4.5 Algorithm model is 86.59% and the accuracy value
for the Adaboost-based C4.5 algorithm model is 92.24%. The two models have a difference of
5.65%. For evaluation using the ROC curve, the AUC value for the C4.5 Algorithm model is
0.957 and the Adaboost-based C4.5 Algorithm is 0.982. Thus, the results of model testing above
4
can be concluded that testing heart disease models using the Adaboost-based C4.5 method is
better than C4.5 itself.
Based on the experimental results that have been carried out using three split data
scenarios, Hermawan et al [10] stated that the Early Prediction of Stroke Disease Based on
Medical Records Using the Classification and Regression Tree (CART) algorithm produced the
highest accuracy of 89.83% in the split data scenario for 80% training data and 20% test data.
After analysing the experiments that have been carried out, the greater the training data, the
greater the accuracy obtained, because later on the evaluation carried out by the confusion
matrix, the truepositive value and the truenegative value will be greater in the larger dataset
scenario. This will affect the accuracy value because the truepositive value is the value of a
positive prediction and it is correct and the truenegative value is the value of a negative
prediction that is wrong, therefore the greatest accuracy value is in the largest dataset scenario as
well.
In his research on hepatitis disease prediction, Buani [11] explained that his research was
conducted to test the prediction results of the Naïve Bayes algorithm with genetic algorithm
feature selection, and the prediction results obtained in this test were 96.77%, this result
increased from previous research using the same data and the same algorithm, namely the naïve
bayes algorithm, the prediction result was 83.71%, the difference from previous research with
this research is 13.06%, this difference proves that the accuracy level of the naïve bayes
algorithm after feature selection using a genetic algorithm is better accuracy.
Handayani et al [12] explained that the Decision Tree Algorithm has a greater true
positive value than the Neural Network Algorithm. The model of the C4.5 algorithm is in the
form of a decision tree, to be able to make a decision tree, the first step is to calculate the number
of classes affected by liver disease that are liver positive and liver negative from each class based
on attributes that have been determined using training data. Then calculate Entropy (Total) using
the equation. The training data used for the Neural Network model is the same, but the attribute
values are data that has been converted into numerical values. Consists of three layers, namely
the Input layer consisting of ten neurons (nine neurons consist of attributes and one neuron is
bias), one hidden layer consisting of eight neurons, and two output layers which are the results of
the prediction of Positive Liver and Negative Liver, the following Neural Network model with
5
Rapidminer Framework version 5.2.001. From the test, the C4.5 model produces an accuracy
value of 75.56% and an AUC value of 0.898. While the Neural Network Model gets an accuracy
value of 74.1% and an AUC value of 0.671. From these results it can be concluded that the
Decision Tree Model is more accurate than the Neural Network Model.
In her research journal on coronary heart disease prediction system, Larassati et al [13]
using the Naïve Bayes method, this study involved 303 data records consisting of 13 variables
and 1 class. Data processing involved data cleaning, selection and transformation. In data
modeling, the Naïve Bayes algorithm was implemented to predict coronary artery disease.
Performance evaluation is done by measuring the prediction ability with the training data, thus
obtaining the accuracy rate of the applied method. The first experiment split the data 60%,
obtaining 177 training data and 119 test data, with an accuracy of 83.1%. The second experiment
with a 70% and 30% split produced an accuracy of 82.02%, while the third experiment with an
80% and 20% split produced an accuracy of 81.6%. It is concluded from the three experiments
that the amount of data significantly affects the accuracy rate and that the Naïve Bayes algorithm
can be applied to predict coronary artery disease based on the initial examination of patient data.
Based on the literature study above, the C4.5 and KNN algorithms are able to provide
high accuracy results in classifying diseases. However, in the research [5][9] proved that the use
of the Adaboost Method in the KNN and C4.5 Algorithms can provide higher performance than
the KNN and C4.5 Algorithms themselves. Therefore, in this research the author will prove that
the use of the Adaboost Method on the C4.5 and KNN Algorithms is able to provide higher
performance in classifying to predict stroke disease.
6
CHAPTER 3
RESEARCH METHODOLOGY
To achieve good results in this research study, a structured research method is essential.
Step of problem solving method :
7
3.2. Dataset Collection
The dataset used is Stroke Prediction taken from kaggle. The data consists of 43401
observation data with 12 attributes. The data attributes used in this study are presented in the
following Table 3.1.
Data cleaning in the study was carried out to eliminate the same data and empty
data. Similarities and empty data can hinder the data processing process. Therefore, this
research needs to do data cleaning.
8
3.3.3. Smote Oversampling
The last pre-processing is oversampling using smote. This is done to change the
amount of data with the label "stroke". The stroke parameter has 2 data contents namely
stroke and non-stroke where the number of non-strokes is more than the number of strokes.
Therefore it is necessary to do oversampling so that the number becomes the same and
produces good accuracy.
In splitting the data is divided into 2, namely training and testing. Training is part
of the dataset that is trained to predict the function of the machine learning algorithm.
Testing is part of the dataset that is tested to see its accuracy. In this research, the module
used is sklearn.model_selection.
This research utilize the C4.5 algorithm classification method for stroke disease.
The attribute selection process involves choosing attributes as nodes, which can be either
root nodes or internal nodes, based on the highest Gain value of the existing attributes.
The data processing steps using the C4.5 algorithm include calculating the entropy value,
computing the gain value, and constructing decision trees and rules accordingly. The
formula for calculating entropy and gain can be observed in equation (1) (2) respectively
[8]
− pi∗log 2 ( pi )
n
Entropy (S)=∑ ( 1) ¿
i=0 ¿
¿
Description :
S : set of cases
n : number of partitions S
pi : proportion of Si to S
9
n
|S i|
Gain ( S , A )=Entropy ( S )−∑ ∗Entropy ( S 1 )
I=0 |S|
(2)
Description :
S : set of cases
The K-Nearest Neighbor algorithm performs clustering of new data based on the
distance of the data to several nearest neighbours. In this case, the number of nearest
neighbours is determined by the user which is expressed by K. K-Nearest Neigbor works
based on the minimum distance from the new data to the nearest neighbour that has been
applied. The goal of this algorithm 𝐴 = 𝜋𝑟2 is to classify new objects based on
attributes and training samples. The proximity of neighbours is usually calculated based
on the Euclidean Distance which is presented as follows[5] :
√
n
E ( x , y )= ∑ 0 ( xi− yi )2
i
(3 )
Description :
xi = sample Data
yi = testing data
n = data dimension
I = variable data
10
3.7. Adaptive Boosting Method
Adaboost is used to classify data in their respective classes. Adaboost searches for
class categories based on the weight value owned by the class. This process continues to
be repeated so that there is a value update on the class. In adaboost, the weight value will
continue to increase at each iteration of the wrong weight value at each iteration.
Adaboost is a typical ensemble learning algorithm, the results obtained have a strong
level of accuracy.To form an adaboost ensemble can use the following formula[1][5] :
(∑ )
M
Ym ( x )=sign am ym( x )
m−1
(4 )
3.8. Evaluation
The data that has been processed and tested is then compared. The three main
metrics used to evaluate classification models are accuracy, precision, and recall. In this
research, the model evaluation uses confusion matrix data. Based on the confusion matrix
results, the values of accuracy, recall, precision, and F1 score can be determined.
1. Accuracy
Accuracy is the ratio of true prediction to the overall data.
(TP+TN )
×100 %
( TP+ FP+ FN +TN )
( 5)
2. Precission
(TP )
×100 %
( TP+ FP )
( 6)
3. Recall
11
Recall is the ratio of positive true prediction compared to overall positive true
data.
(TP )
×100 %
( TP+ FN )
( 7)
4. F1 Score
F1 Score is a weighted comparison of average precission and recall.
2× ( Recall × Precission )
¿¿
¿
12
CHAPTER 4
This research was conducted using an Asus VivoBook 14/15 laptop. Windows 10
operating system with Intel(R) Core(TM) i7-10510U CPU @1.80GHz 2.30 GHz
processor and 8 GB RAM. The programming language used is python 3 which is run on
Google Collaboratory online.
4.2. Implementation
This research uses a combination of the Adaboost algorithm with C4.5 and K-
Nearest Neighbors for comparison in improving the performance of stroke disease
prediction. Before doing the comparison, this research uses several libraries in the
process.
1. import numpy as np
2. import pandas as pd
3. from sklearn.preprocessing import LabelEncoder
4. from sklearn.preprocessing import MinMaxScaler
5. from sklearn.neighbors import KNeighborsClassifier
6. from sklearn.tree import DecisionTreeClassifier
7. from sklearn.ensemble import AdaBoostClassifier
8. from sklearn.ensemble import VotingClassifier
9. from sklearn.model_selection import train_test_split
10. from imblearn.over_sampling import RandomOverSampler
11. from sklearn.metrics import
confusion_matrix,ConfusionMatrixDisplay,f1_score,roc_auc_score,classifica
tion_report, accuracy_score
12. from google.colab import drive
13. import warnings
14. warnings.filterwarnings('ignore')
Line 1 import numpy for numerical computation, line 2 import pandas to process
csv data to numerical and vice versa. Lines 3, 4 and 10 import the library used in pre-
processing data. For lines 5 – 8 import libraries for data modeling using C4.5, KNN and
Adaboost, then lin 11 used to display the result of accuracy, precision, recall and f1-
score. Line 9 is used to devide training and testing data and the library on line 12 is used
to acces and manage datasets whose files are located on Google Drive. Libraries on lines
13 and 14 are used to manage notofocations generated by the program.
13
15. drive.mount('/content/drive/')
16. dataframe = pd.read_csv("/content/drive/MyDrive/Kuliah smt
7/project_strokes.csv")
17. dataframe
Lines 15 - 17 are used to associate Google Drive with Google Colab. So, the
datasets files can be accessed and read existing data structures .
18. dataframe.dropna(inplace=True)
19. dataframe.drop_duplicates(inplace=True)
20. dataframe.isnull().sum()
21. dataframe.duplicated().sum()
22. labelencoder = LabelEncoder()
Lines 18 and 19 of the program code are used to delete has null or empty content
and the same data content. Lines 20 and 21 in the program dunction to display the
amount of data that has been dropped. Line 22 are required to perform the encoding
process.
23. !pip install -U imbalanced-learn
24. from imblearn.over_sampling import SMOTE
25. smote = SMOTE(sampling_strategy='auto', random_state=42)
26. X_resampled, y_resampled = smote.fit_resample(X, y)
27. X_train, X_test, y_train, y_test = train_test_split(X_resampled,
y_resampled, test_size=0.3, random_state=42)
Lines 23 and 24 of the program code contain about installing and importing
libraries that will be used in the process of handling class imbalance techniques. Lines 25
and 26 are used to apply SMOTE to the dataset which will result in a new dataset, where
the number of samples from the minority class has been synthetically added so that it is
balanced with the majority class. The variables X_resampled and y_resampled will
contain the new dataset after the oversampling process with SMOTE. Line 20 use the
train_test_split function to split the resampled dataset using SMOTE into training data
(X_train, y_train) and testing data (X_test, y_test). The split is done by allocating 30% of
the data as test data, and the final mold provides information about the shape of each
dataset.
28. c45 = DecisionTreeClassifier(criterion='gini', splitter='random',
max_depth=5)
29. c45.fit(X_train, y_train)
30. y_pred_c45 = c45.predict(X_test)
31. y_pred_train_c45 = c45.predict(X_train)
14
33. knn.fit(X_train, y_train)
34. y_pred_knn = knn.predict(X_test)
35. y_pred_train_knn = knn.predict(X_train)
In lines 28 to 47, the program creates and trains machine learning models, such as
C4.5, KNN, Adaboost, and an ensemble of models using the Voting Classifier technique.
These models are then used to make predictions on test (`X_test`) and training
(`X_train`) data. The C4.5 model is also combined with Adaboost to get the best
prediction results. An ensemble model is also performed using a combination of voting
results from Adaboost and KNN models with the 'hard voting' method.
4.3. Result
Result provided start from the beginning of preprocessing, then the data is divided
into training and testing then calculated accuracy using C4.5 Algorithm. The Experiment
for the optimal result is using 70% of training data and 30% test data. For C4.5 the best
max_depth in 5. The following is a table of calculate results.
Confusion Matrix of C. 45
0.95
0.9
0.85
0.8
0.75
test size 20% test size 20% test size 30% test size 30% test size 40% test size 40%
(0) (1) (0) (1) (0) (1)
This is the result of the C4.5 algorithm in predicting stroke disease with data divided
into 20%, 30%, 40% in the testing set. Precision is the percentage of correct positive
predictions relative to the total positive predictions. Recall is the percentage of correct
positive predictions relative to the actual total positive predictions. F1-Score is the
weighted harmonic mean of precision and recall. The closer to 1, the better the model.
Accuracy is the total score of the entire prediction process calculated from the above three
values.
Result provided start from the beginning of preprocessing, then the data is divided
into training and testing and then calculated accuracy using KNN Algorithm. The
Experiment for the optimal result is using 70% of training data and 30% test data. For KNN
the best neighbors in 5. The following is a table of calculate results.
16
1 0.80 0.94 0.87 11422
This is the result of the KNN algorithm in predicting stroke disease with data divided
into 20%, 30%, 40% in the testing set. Precision is the percentage of correct positive
predictions relative to the total positive predictions. Recall is the percentage of correct
positive predictions relative to the actual total positive predictions. F1-Score is the
weighted harmonic mean of precision and recall. The closer to 1, the better the model.
Accuracy is the total score of the entire prediction process calculated from the above three
values.
Result provided start from the beginning of preprocessing, then the data is divided
into training and testing and then calculated accuracy using Adaboost Method. The
Experiment for the optimal result is using 70% of training data and 30% test data. For
Adaboost the best estimator in 20. The following is a table of calculate results.
17
1 0.92 0.91 0.91 11422
This is the result of the Adaboost method in predicting stroke disease with data
divided into 20%, 30%, 40% in the testing set. Precision is the percentage of correct
positive predictions relative to the total positive predictions. Recall is the percentage of
correct positive predictions relative to the actual total positive predictions. F1-Score is the
weighted harmonic mean of precision and recall. The closer to 1, the better the model.
Accuracy is the total score of the entire prediction process calculated from the above three
values.
The results given start from the beginning of preprocessing, then the data is divided
into training and testing and then combined between C4.5 and Adaboost and the accuracy is
calculated using the Adaboost Method. The following is a table of calculation results.
18
1 0.96 0.94 0.95 11422
This is the result of the C4.5 and Adaboost in predicting stroke disease with data
divided into 20%, 30%, 40% in the testing set. Precision is the percentage of correct
positive predictions relative to the total positive predictions. Recall is the percentage of
correct positive predictions relative to the actual total positive predictions. F1-Score is the
weighted harmonic mean of precision and recall. The closer to 1, the better the model.
Accuracy is the total score of the entire prediction process calculated from the above three
values.
The results given start from the beginning of preprocessing, then the data is divided
into training and testing and then combined between KNN and Adaboost and the accuracy
is calculated using the Adaboost Method. The following is a table of calculation results.
19
test size 40% 0 0.87 0.96 0.91 11398 0.91
1 0.96 0.85 0.90 11422
This is the result of the KNN and Adaboost in predicting stroke disease with data
divided into 20%, 30%, 40% in the testing set. Precision is the percentage of correct
positive predictions relative to the total positive predictions. Recall is the percentage of
correct positive predictions relative to the actual total positive predictions. F1-Score is the
weighted harmonic mean of precision and recall. The closer to 1, the better the model.
Accuracy is the total score of the entire prediction process calculated from the above three
values.
Based on the algorithm testing above which uses a max depth value of 5, neighbors
5, estimator 20 and the research was processed with a test size 30% has good results.
Although each algorithm has very small difference in precision, recall, and f1-score values.
For more details, can see the chart below.
20
Confusion Matrix of C.45 + Adaboost
0.96
0.95
0.94
0.93
0.92
0.91
0.9
test size 30% (0) test size 30% (1)
Based on Figure 4.6 and Figure 4.7 the combination of the C4.5 algorithm has
higher results than the knn algorithm combination. Each has a value of 91% and 95%.
Where the two combinations have a difference of 4%.
4.4. Discussion
The results of the above tests use the number of test sizes of 20%, 30% and 40%.
For max depth and neighbors, it was tested 20 times and got the optimal value at 5. As for
the estimator itself, it was tested 10 times and got the optimal value at 20. In the test, it did
not immediately get the best results, the researchers did oversampling so that the data used
had the same amount because before oversampling the data with labels 0 and 1 had a lot of
slippage. Not only that, to combine the KNN algorithm with Adaboost cannot be directly
21
combined like C4.5 and Adaboost. For the combination of KNN and Adaboost, ensemble
assistance is needed. Because there are contrasting parameters between the two algorithms.
After doing all the steps in testing, we finally got good results.
For the results of the combination of C4.5 and Adaboost have results above the
C4.5 and Adaboost algorithms themselves. Meanwhile, the results of the combination of
KNN with Adaboost itself have results above the KNN algorithm itself and below the
Adaboost algorithm itself. Therefore, in this test the combination of the C4.5 and Adaboost
algorithms has good results compared to the combination of the KNN and Adaboost
algorithms.
22
CHAPTER 5
CONCLUSION
Based on the test results of combining the two algorithms, it can be concluded
that both combinations can help in improving the performance of stroke disease prediction.
However, the performance generated by the combination of C4.5 and KNN is different. In
the C4.5 algorithm, the higher the max depth value, the higher the resulting value. While in
KNN the neighbor value does not have a significant difference as well as the Adaboost
Algorithm. Testing in this study using a max depth value of 5 and an estimator of 20 with a
test size of 30% resulted in a performance value of 95% in the combination of the C4.5
Algorithm with Adaboost. Meanwhile, using a neihgbor value of 5, 20 estimators with the
same test size of 30% produces a performance value of 91% in the combination of KNN
with Adaboost.
For the results of precision, recall and f1-score in each combination, there is no
significant difference. In terms of processing time, the C4.5 algorithm can process faster
than the KNN algorithm. This is due to the parameters of each algorithm. It can be
concluded that performance results can be influenced by the amount of data and parameters
used.
Suggestions for future research are to try combining KNN with Adaboost without
using the ensemble method and try combining other algorithms to find out better prediction
performance.
23
REFERENCES