0% found this document useful (0 votes)
45 views28 pages

Cardiovascular Diseases Prediction Article

. In this research, we aim to replicate and extend the work of "Classification models combined with Boruta feature selection for heart disease prediction" by G. Manikandan et al. (2024). Our study will apply the Decision Tree, Logistic Regression, Support Vector Machines (SVM), Random Forests, and XGBoost algorithms with Boruta feature selection on the Cleveland Clinic Heart Disease Dataset from the UCI Machine Learning Repository

Uploaded by

Flora Lav
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
45 views28 pages

Cardiovascular Diseases Prediction Article

. In this research, we aim to replicate and extend the work of "Classification models combined with Boruta feature selection for heart disease prediction" by G. Manikandan et al. (2024). Our study will apply the Decision Tree, Logistic Regression, Support Vector Machines (SVM), Random Forests, and XGBoost algorithms with Boruta feature selection on the Cleveland Clinic Heart Disease Dataset from the UCI Machine Learning Repository

Uploaded by

Flora Lav
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

Abstract

The critical importance of early diagnosis in cardiovascular diseases, which remain the
leading cause of death worldwide, is underscored by their prevalence and the potential for
misdiagnosis. In this research, we aim to replicate and extend the work of "Classification
models combined with Boruta feature selection for heart disease prediction" by G.
Manikandan et al. (2024). Our study will apply the Decision Tree, Logistic Regression,
Support Vector Machines (SVM), Random Forests, and XGBoost algorithms with Boruta
feature selection on the Cleveland Clinic Heart Disease Dataset from the UCI Machine
Learning Repository[1]. Contrary to the findings reported in article [2], no superior results
were observed when Boruta feature selection was employed in this investigation. Among all
classification methods examined, Logistic Regression without the application of Boruta
feature selection emerged as the most effective, achieving an accuracy of 89%.

Introduction
Cardiovascular diseases are the second cause of death in France[3]. A recent study
revealed that 40% of the participating middle-aged adults were suffering from coronary heart
disease without being aware of it, making early diagnosis extremely important.
[4]Cardiovascular diseases are the leading cause of death worldwide: more people die each
year from cardiovascular diseases than from any other cause. Additionally, heart disease
stands as the leading cause of mortality in the United States, accounting for one out of every
four deaths. Furthermore, research has revealed that 5% of patients in the USA experience
annual misdiagnoses, posing potentially fatal consequences.

[5]To combat human error, Machine Learning is playing an increasingly significant role in the
medical diagnosis process. A standard ML algorithm outperformed 72% of doctors in
diagnostics[6]. Machine learning holds the promise to revolutionize clinical decision-making
and diagnosis in the field of medicine. Traditional medical diagnosis involves doctors
attempting to explain a patient's symptoms by identifying the diseases responsible for them.
However, current machine learning approaches to diagnosis primarily rely on associative
patterns, identifying diseases strongly correlated with a patient's symptoms. This limitation,
which fails to distinguish between correlation and causation, can lead to suboptimal or even
hazardous diagnoses.
[7]A comparative analysis had been conducted, pitting counterfactual algorithms against the
standard associative algorithm and the expertise of 44 doctors using a clinical vignette test
set. The results were striking, with the associative algorithm achieving an accuracy on par
with the top 48% of doctors in our cohort, while our counterfactual algorithm ranked among
the top 25% of doctors, demonstrating expert clinical accuracy. These findings underscore
the critical importance of incorporating causal reasoning into machine learning applications
for medical diagnosis, as it represents a crucial missing element in enhancing diagnostic
accuracy. Recent advancements in ML have demonstrated significant potential in enhancing
the accuracy of medical diagnoses, challenging the traditional reliance on human expertise
alone. Heart disease, notorious for its complex symptomatology and the need for timely
intervention, stands at the forefront of this revolution. In the context of cardiovascular health,
the precision of diagnosis is not just a technical concern but a critical determinant of patient
outcomes. Thus, the role of ML models, such as Support Vector Machines (SVM), Decision
Trees, Random Forests, and their ensembles, becomes pivotal in reducing false negatives
and refining diagnostic processes.

This paper delves into the comparative effectiveness of these models in predicting heart
disease, drawing insights from previous research which highlights the superior accuracy of
SVM and its variations in comparison to traditional diagnostic methods. It further investigates
the novel SVM Ensemble approach, seeking to understand how its unique construction of
individual SVMs contributes to its diagnostic efficiency. We also explore the intricate role of
feature selection methods, particularly the integration of the ReliefF algorithm and Rough Set
(RS) theory, in refining these ML models for enhanced performance.

As we navigate through this exploration, the paper will present a comprehensive analysis of
each ML model in terms of heart disease prediction, underlining the specific features that
render some models more effective than others. Additionally, we will examine the broader
implications of ensemble learning in medical diagnosis, hypothesizing its potential to mitigate
individual model weaknesses and elevate overall diagnostic accuracy.

Structured to provide a holistic view, the article begins with an in-depth analysis of individual
models, followed by an examination of the dataset in use. Subsequent sections will test our
hypotheses, study the results, and discuss the impact of parameter tuning on model
performance. Finally, we propose a 5-year prediction model, culminating in a conclusion that
not only summarizes our findings but also opens avenues for future research in this vital field
of medical science.

The critical importance of predicting cardiovascular diseases is further emphasized by the


data from the World Health Organization[2], indicating that these diseases remain the
leading cause of death worldwide. This stark reality underscores the urgency for improved
diagnostic tools and early detection methods. The advancement of machine learning in
healthcare, particularly in cardiovascular diagnostics, offers a promising avenue for
addressing these challenges. By employing sophisticated algorithms such as SVM, Random
Forest, and various ensemble methods, researchers are able to identify patterns and
correlations in patient data that may not be immediately apparent to human practitioners.

The potential of machine learning extends beyond merely supplementing traditional


diagnostic methods; it offers a paradigm shift towards more data-driven, precise, and
potentially life-saving medical assessments. The development and refinement of these
algorithms and techniques are critical not only for improving diagnostic accuracy but also for
minimizing the risks associated with misdiagnosis. Inaccurate diagnoses can lead to
improper treatments, exacerbating patient conditions, and increasing healthcare costs.

As the field of machine learning continues to evolve, its integration into clinical practice
becomes increasingly vital. Researchers and healthcare professionals must work
collaboratively to harness the full potential of these technological advancements. The future
of cardiovascular disease diagnosis and treatment lies in the balance between human
medical expertise and the precision of machine learning algorithms. This paper aims to
contribute to this burgeoning field by offering a comprehensive analysis of various machine
learning models, their effectiveness in heart disease prediction, and their potential impact on
patient outcomes and healthcare systems.

Related works + critics


In the field of heart disease prediction using machine learning, substantial advancements in
methodologies and accuracies have been made, as evidenced by various studies conducted
from 2021 to 2024. These studies have utilized diverse techniques and achieved different
levels of accuracy, each contributing to the evolving landscape of this research area.

The 2021 study by Ashna Jain and Dhruv Roongta, [8] "A Comparative Overview of Machine
Learning Models for Heart Disease Diagnosis," evaluated SVM, Decision Trees, Random
Forests, and SVM Ensembles. The study found that the SVM Ensemble was the most
accurate, achieving an 84.44% accuracy rate. This study highlighted the potential of
ensemble methods in improving predictive accuracy in medical diagnosis but also identified
the need for broader exploration of machine learning models and more detailed error
analysis. The accuracy of 84.44% for the SVM Ensemble is notable, but the study could
have benefited from a deeper analysis of the types of errors (false positives and false
negatives) made by the model, which is critical for understanding the clinical implications of
the model's predictions. Then, the focus on a relatively small set of models may limit the
exploration of other potentially effective algorithms.

In 2024, [2] G. Manikandan et al. conducted a study titled "Classification Models Combined
with Boruta Feature Selection for Heart Disease Prediction," which used Decision Tree,
Logistic Regression, and SVM algorithms combined with Boruta feature selection. This study
achieved an accuracy of 88.52% with Logistic Regression, demonstrating the effectiveness
of feature selection in enhancing model performance. However, it suggested the potential for
including a wider range of machine learning models and the utilization of larger and more
diverse datasets.

Vardhan Shorewala's 2021 research, [9] "Early Detection of Coronary Heart Disease Using
Ensemble Techniques," compared K-Nearest Neighbors, Binary Logistic Classification,
Naive Bayes, and ensemble models. The research showed that a stacked model involving
KNN, Random Forest, and SVM with logistic regression as the meta-classifier achieved an
accuracy of 75.1%. The study emphasized the value of ensemble models in reducing bias
and overfitting, though it highlighted the need for further exploration of machine learning
algorithms and a deeper analysis of feature importance. This would provide insights into the
most relevant risk factors and potentially guide feature engineering efforts in future studies.

In 2022, [10] Victor Chang et al. published "An Artificial Intelligence Model for Heart Disease
Detection Using Machine Learning Algorithms." This study utilized Random Forest,
K-Neighbors Classifier, Support Vector Machines, and Decision Trees, with the K-Neighbors
Classifier showing an accuracy of 87%. The study underscored the effectiveness of popular
algorithms like Random Forest and K-Neighbors Classifier but also pointed to the potential
benefits of exploring newer or less common models and conducting a deeper analysis of the
model's predictive performance and error types. Future work could include exploring the
impact of different feature selection techniques and data preprocessing methods to further
enhance the accuracy and reliability of the models.

Gufran Ahmad Ansari et al.'s 2023 article [11], "Performance Evaluation of Machine Learning
Techniques for Heart Disease Prediction," evaluated Logistic Regression, KNN, SVM, Naive
Bayes, Random Forest, and Decision Tree. Both Random Forest and KNN achieved an
impressive accuracy of 99.04%. While the high accuracy is notable, the study would benefit
from a more detailed breakdown of the performance metrics, such as sensitivity, specificity,
and the area under the curve (AUC). Another point is that the comprehensive evaluation of
multiple algorithms is a strong aspect of this study. However, it could be improved by
including a deeper analysis of the specific features that contributed most significantly to the
predictive accuracy.

Subasish Mohapatra et al.'s 2023 study, [12] "A Stacking Classifiers Model for Detecting
Heart Irregularities and Predicting Cardiovascular Disease," implemented a stacking
classifiers approach and achieved an accuracy of 92% with a precision score of 92.6%. The
study validated the effectiveness of combining multiple models to improve predictive
accuracy, yet it suggested the need for further comparisons with other advanced machine
learning techniques and the application of the model on larger datasets. Future work could
explore the integration of this model with real-world clinical workflows to assess its practical
utility in healthcare settings.

Rusyda Tsaniya Eka Putri et al.'s 2024 research, [13] "GridSearch and Data Splitting for
Effectiveness Heart Disease Classification," utilized algorithms like SVM, Random Forest
Classifier, Logistic Regression, Naïve Bayes, Decision Tree Classifier, KNN, and XGBoost
Classifier. The study emphasized the importance of hyperparameter optimization using
GridSearch and explored different data splitting methods for model validation. While the
specific accuracies were not detailed, the research underscored the potential of machine
learning techniques in heart disease prediction when combined with appropriate data
preprocessing and hyperparameter tuning. While the study provides a thorough evaluation, it
could benefit from a deeper analysis of the impact of different data splitting ratios on model
performance.

Ramanathan G. and Jagadeesha S. N.'s 2023 study [14], "Prediction of Coronary Artery
Disease using Machine Learning – A Comparative Study of Algorithms," compared several
machine learning algorithms, including Logistic Regression, Decision Tree, Random Forest,
Adaptive Boosting, Gradient Boosting, Extreme Gradient Boosting, Light Gradient Boosting,
Support Vector Machine, and K-Nearest Neighbour. Although specific accuracies were not
detailed, the study offered valuable insights into the effectiveness of various algorithms in
predicting coronary artery disease, highlighting the need for deeper exploration of feature
importance and the use of deep learning techniques for further enhancement.

Joseph Kassab et al.'s 2023 article [15], "Comparative Analysis of Chat-Based Artificial
Intelligence Models in Addressing Common and Challenging Valvular Heart Disease Clinical
Scenarios," compared the accuracy of ChatGPT-4.0 and Google Bard in providing
information on valvular heart disease. ChatGPT-4.0 achieved 100% accuracy after
discussion and discrepancy resolution. This study represented an innovative approach to
using chat-based AI models in medical contexts, although its clinical applicability requires
further assessment. The reliance on subjective assessment by cardiologists may introduce
bias, and the study could benefit from more objective measures of accuracy. Moreover, the
study is limited by the specific focus on VHD and may not generalize to other medical
conditions. Eventually, future research could explore the application of these AI tools in
broader medical contexts and assess their potential for supporting clinical decision-making.

Qisthi Alhazmi Hidayaturrohman et al.'s 2023 study "A Comparative Study of Machine
Learning Approaches to Heart Disease Prediction: An Empirical Analysis" [16] investigated
SVM, KNN, Decision Tree, Random Forest, and AdaBoost. AdaBoost exhibited the highest
accuracy of 91.85%, demonstrating the significance of normalization and GridSearch
hyper-parameter tuning in improving model performance. The study demonstrates the
effectiveness of AdaBoost in heart disease prediction and highlights the importance of
normalization and hyper-parameter tuning in enhancing model performance.

Finally, Arpit Gupta et al.'s 2023 study "Heart Disease Classification Using Random Forest"
[17] focused on the Random Forest algorithm, achieving an accuracy of 86.9% with a
diagnosis rate of 93.3%. This study highlights the potential of machine learning, particularly
the Random Forest algorithm, in medical diagnosis. The utilization of Random Forest, known
for its high accuracy and robustness against overfitting, is a strong aspect of the study.
However, comparing the performance of Random Forest with other advanced machine
learning or deep learning algorithms could provide more comprehensive insights.

Overall, these studies reveal an ongoing evolution in heart disease prediction using machine
learning. While many have achieved high accuracies, there is a common need for broader
algorithm exploration, larger and more diverse datasets, deeper error analysis, and
integration of clinical insights. The incorporation of advanced techniques like deep learning
and ensemble methods shows promise, and there is potential for further improvement in
predictive performance and clinical applicability.

The progression from 2021 to 2024 illustrates a trend towards sophisticated methods like
ensemble learning and stacking classifiers, enhancing predictive performance. Feature
selection methods, particularly Boruta, and the exploration of ensemble methods indicate the
value of combining models. The use of chat-based AI models represents an innovative
approach, highlighting the need for continued research in AI tool integration in healthcare.

The collective studies emphasize model accuracy, with many achieving over 80%, and some
reaching above 90%. However, comprehensive analysis beyond accuracy, including
sensitivity, specificity, and error analysis, is needed. The field faces challenges like the need
for larger datasets, deeper clinical integration, and exploration of new algorithms and deep
learning techniques. Future research should focus on these areas to enhance the predictive
power and clinical relevance of machine learning models in heart disease prediction.
Proposal
In my research proposal, I plan to replicate and extend the work presented in "Classification
models combined with Boruta feature selection for heart disease prediction" by G.
Manikandan et al., published in 2024[7]. This study demonstrated the effectiveness of
combining machine learning algorithms with Boruta feature selection for heart disease
prediction using the Cleveland Clinic Heart Disease Dataset.

My research will focus on replicating the methodology used in the study, applying the same
machine learning algorithms: Decision Tree, Logistic Regression, and Support Vector
Machine (SVM), alongside the Boruta feature selection technique. I will utilize the Cleveland
Clinic Heart Disease Dataset from Kaggle, ensuring consistency with the original study.

The primary objective is to validate the findings of Manikandan et al. by assessing whether
similar results are achievable with an independent implementation. This replication will serve
as a foundation for further exploration. I will investigate the impact of integrating additional
machine learning models, such as ensemble methods and neural networks, to examine if
they can enhance the predictive performance beyond what was achieved in the original
study.

Another critical aspect of my research will be to address the dataset size limitation noted in
the original study. I plan to explore the feasibility of augmenting the dataset with additional
records, either through data synthesis or by integrating similar datasets, to increase its size
and diversity. This will help assess the generalizability and robustness of the model in a
more comprehensive setting.

My research will also include a thorough analysis of the types of misclassifications (false
positives and false negatives) made by the models. Understanding these error types is
crucial for assessing the clinical applicability of the models. This analysis will also help in
identifying areas where the models might need improvement, particularly in terms of
sensitivity and specificity.

Finally, I aim to explore the potential of integrating the findings of this study into clinical
practice. This will involve assessing how the predictive models could be utilized in healthcare
settings and the implications of their deployment in real-world scenarios. The goal is to
contribute to the development of reliable and clinically relevant tools for heart disease
prediction.

In summary, my research will replicate and extend the study by Manikandan et al., focusing
on validating their findings, exploring additional machine learning models, augmenting the
dataset, experimenting with different feature selection techniques, and conducting a detailed
analysis of model performance and errors. The ultimate aim is to enhance the predictive
accuracy and clinical applicability of machine learning models in heart disease prediction.
Proposed system

Dataset
The Cleveland Heart Disease Dataset from the UCI Machine Learning Repository is a pivotal
resource for research in medical informatics and machine learning. It includes 303 instances,
with 14 attributes that offer insights into clinical factors, routine test data, and results from
exercise electrocardiography tests.

Feature Description Type

Age Patients Age in years Numeric


(Numeric)

Sex Gender (Male : 1; Female : Nominal


0) (Nominal)

cp Type of chest pain Nominal


experienced by patient.
Categorized into 4
categories: 0 typical angina,
1 atypical angina, 2
non-anginal pain, 3
asymptomatic (Nominal)

trestbps Patient's level of blood Numerical


pressure at resting mode in
mm/HG (Numerical)

chol Serum cholesterol in mg/dl Numeric


(Numeric)

fbs Blood sugar levels on Nominal


fasting > 120 mg/dl
represented as 1 in case of
true and 0 as false
(Nominal)

restecg Result of electrocardiogram Nominal


while at rest are represented
in 3 distinct values: 0:
Normal, 1: having ST-T
wave abnormality, 2:
showing probable or definite
left ventricular hypertrophy
by Estes' criteria (Nominal)

thalach Maximum heart rate Numeric


achieved (Numeric)

exang Angina induced by exercise Nominal


0 depicting NO 1 depicting
Yes (Nominal)

oldpeak Exercise induced Numeric


ST-depression in relative
with the state of rest
(Numeric)

slope ST segment measured in Nominal


terms of slope during peak
exercise: 0: up sloping, 1:
flat, 2: down sloping
(Nominal)

ca The number of major Nominal


vessels (0–3) (Nominal)

thal A blood disorder called Nominal


thalassemia: 0: NULL, 1:
normal blood flow, 2: fixed
defect, 3: reversible defect
(Nominal)

target Target variable to predict, 1 Nominal


means patient is suffering
from heart disease and 0
means patient is normal
(Nominal)

Attributes include age, sex, chest pain type, resting blood pressure, serum cholesterol,
fasting blood sugar, resting electrocardiographic results, and parameters from exercise
electrocardiography tests such as maximum heart rate achieved, exercise-induced angina,
slope of the peak exercise ST segment, ST depression induced by exercise relative to rest,
and presence of thalassemia. The target attribute indicates the presence of heart disease,
with '1' denoting presence and '0' denoting absence.

The dataset is balanced with respect to the target variable, with 54.1% of the instances
without disease and 45.9% with disease, which aids in unbiased machine learning model
training.
Mean numerical features for patients with and without disease may highlight differences that
are significant for heart disease prediction. It means that this study is relevant because a link
between the disease and these factors can be found.

Overall, the Cleveland dataset is a critical tool in the realm of heart disease prediction and
research, providing comprehensive data that can be effectively leveraged through various
machine learning algorithms.

Feature selection
In order to enhance the accuracy of predictive models in machine learning, feature selection
is a crucial step to identify the most relevant features or variables . Among the various
feature selection algorithms, Boruta has gained attention due to its capability to handle
high-dimensional datasets and identify statistically significant features. The Boruta method,
which is a variant of the Random Forest algorithm, assesses the importance of each feature
through statistical testing and shadow characteristics. Boruta distinguishes between
essential and irrelevant features by evaluating the significance of each original feature in
comparison to its corresponding shadow feature.

In this study, the top 6 features were utilized to predict heart disease, including age, type of
chest pain (cp), maximum heart rate achieved (thalach), ST depression induced by exercise
relative to rest (oldpeak), the number of major vessels colored by flourosopy (ca), and
thalassemia (thal). By applying machine learning algorithms like Boruta to these datasets, it
becomes possible to identify important parameters associated with heart disease and
improve the precision of predictive models. Feature selection is a crucial step in the process
of building robust machine learning models. The Boruta algorithm is a feature selection
method that aims to identify the most relevant features in a dataset, thereby enhancing the
efficiency, interpretability, and generalization of machine learning models.

Step 1: Initialization
The process begins with the initialization of essential variables. The number of features in
the dataset is stored in n_features. An array, feature_importances, is created to accumulate
the importances of each feature across iterations. Additionally, a boolean array,
accepted_features, is initialized to keep track of which features are considered significant.

Step 2: Random Forest Model


A Random Forest classifier, referred to as model, is created with specified parameters.
These parameters include the number of estimators, the use of balanced class weights, a
maximum depth for the trees, and a specified random state for reproducibility.

Steps 3-6: Iterative Process


The algorithm proceeds through a predefined number of iterations (max_iter). During each
iteration, the following steps are carried out:
Create Shadow Features: Shadow features are generated by randomly permuting the values
of each feature. These shadow features, stored in X_shadow, are combined with the original
features (X_boruta) to form an augmented dataset.
Train the Model: The Random Forest model is trained on the augmented dataset (X_boruta).
Evaluate Feature Importance: The importance of each feature is assessed, and the
maximum importance among the shadow features is identified.
Update Feature Importance History: The importance of each feature is accumulated over the
iterations to form a comprehensive history.
Decide Feature Significance: Each feature's importance is compared to the maximum
importance observed in the shadow features. Features that are consistently more important
than the shadow features are marked as significant.
Update Accepted Features: Features that are deemed significant more frequently than a
predefined threshold are marked in the accepted_features array as accepted.

Step 7: Confirmation
After completing the iterations, the feature selection process is confirmed. Features that
have been frequently accepted, surpassing a random threshold, are considered significant.

Application and Display


The Boruta feature selection is applied to the dataset. The boolean mask of accepted
features is converted to indices, representing the selected features. These indices are used
to select features from the original DataFrame (X). Finally, the selected features are
displayed, showcasing the outcome of the Boruta feature selection process.

In summary, the Boruta algorithm leverages the Random Forest model and the concept of
shadow features to robustly identify and retain important features, enhancing the model
training and prediction process.
Following the Boruta feature selection process, only six features remain. The features are:
'age', 'cp' (chest pain type), 'thalach' (maximum heart rate achieved), 'oldpeak' (ST
depression induced by exercise relative to rest), 'ca' (number of major vessels colored by
fluoroscopy), and 'thal' (a blood disorder called thalassemia).

Models/Classifiers
The Logistic Regression, SVM, Decision Tree, XGBoost, and Random Forest methods
presented in the article titled 'Classification models combined with Boruta feature selection
for heart disease prediction' have been revisited. Experiments were conducted both with and
without the Boruta feature selection to assess its effectiveness. Furthermore, my own
method was implemented, and the results were compared with those obtained using the
mentioned models.

Logistic Regression

Attributes such as the learning rate, the number of iterations, and the regularization strength
were assigned to the Logistic Regression model. A sigmoid function is used, which
calculates the sigmoid transformation of a given input 'z'.
The sigmoid function is a key component in logistic regression, as it maps linear
combinations of features to probabilities between 0 and 1.

The fit method is responsible for training the logistic regression model. It initializes the
model's weights and bias to zero and then performs gradient descent to optimize the model's
parameters. During each iteration, it calculates model predictions, computes the gradient,
and updates the weights and bias accordingly. Regularization is applied to the gradient to
prevent overfitting.

The probabilities of the positive class for a given dataset 'X' are calculated. It does so by
computing the dot product of the input features with the model's weights and adding the bias,
followed by applying the sigmoid function. Then, binary outcomes are predicted based on a
threshold of 0.5. It converts the probabilities into binary predictions, with values of 1 for
probabilities greater than the threshold and 0 otherwise.

Decision tree
Attributes such as the Gini impurity, the number of samples, the number of samples per
class, and the predicted class were assigned to the Decision Tree model. Additionally, it
includes attributes for the feature index, threshold, left child node, and right child node.

The Gini impurity serves as a metric to quantify the level of impurity or disorder within a set
of samples, commonly utilized as a criteria for splitting in decision tree algorithms. The
model aims to minimize it. It is computed using the following formula:

Then, the best feature index and threshold for splitting the dataset based on Gini impurity
are identified. The number of samples per leaf are taken into consideration and then it
iterates through potential splits to determine the optimal one.

The decision tree is constructed recursively. It evaluates the Gini impurity and identifies the
predicted class for the current node. If the depth of the tree is less than the specified
maximum depth, the function proceeds with the best split based on Gini impurity. It splits the
dataset into left and right subsets and continues building the tree for each subset. Thus, the
decision tree can make predictions for a given dataset. It traverses the tree from the root
node to the leaf nodes, determining the predicted class for each data point.

Furthermore, the maximum depth and minimum samples per leaf are defined to control the
tree's growth. A decision tree is constructed by fitting it to the filtered training data, using the
specified parameters.
SVM

An optimal hyperplane that maximizes the separation between data points belonging to two
distinct classes is identified through the training of a Support Vector Machine (SVM) on
labeled data. This process serves to expand the margin between these classes. Attributes
such as the the learning rate, the number of epochs, the penalty parameter C, the choice of
kernel function, degree for polynomial kernel, gamma for the RBF kernel, alpha, and class
weights were assigned to the Support Vector Machine model. These hyperparameters can
be adjusted to influence the SVM's behavior.

Weights and bias are initialized to zero, and slack variables are used to handle
non-separable cases. A hinge loss function is employed, which includes L2 regularization
when class weights are specified.

Systematically, a predefined range of hyperparameters is explored using nested loops. For


each combination of hyperparameters, a SVM is created and trained using the provided
configuration. This process involves adjusting the SVM's parameters, such as learning rate,
number of epochs, C, and kernel type.

To implement a SVM with a regularization term C, the loss function in the gradient descent
step needs to be modified to incorporate the soft margin and regularization. Additionally, the
introduction of slack variables is required to allow for some margin of error or
misclassification. A soft-margin SVM is implemented by allowing some margin of error or
misclassification. The loss function in the gradient descent is modified, and consideration is
given to adding a regularization term (C) to control the trade-off between maximizing the
margin and minimizing the classification error.
The overall approach in this SVM implementation allows for comprehensive hyperparameter
tuning and model evaluation, contributing to a deeper understanding of the SVM's behavior
in different scenarios.

XGBOOST

Ensemble techniques seek to diminish variance within a single model by amalgamating a


variety of dissimilar or similar models. In the context of boosting, a robust classifier is
constructed through an iterative process that combines multiple distinct weak classifiers.
XGBoost is distinguished for its use of gradient boosting frameworks that are efficient,
flexible, and associated with a high degree of predictive accuracy. The essence of gradient
boosting involves iteratively adding models to a ensemble, specifically aiming to minimize a
loss function L. The loss function quantifies the difference between the predicted and actual
values. The general update equation for a model at iteration t can be represented as:

The core of the methodology involves a comprehensive grid search aimed at fine-tuning the
hyperparameters of the XGBoost classifier. Specific attention is given to the learning rate,
maximum tree depth, number of estimators, and subsample ratio. The contribution of each
tree to the final outcome is adjusted by the learning rate, tree complexity is controlled by the
maximum depth, the number of trees constructed before predictions is made is determined
by the number of estimators, and the fraction of the sample used for training each tree,
guarding against overfitting, is dictated by the subsample ratio.

Systematic exploration of multiple combinations of parameter values is performed.


Cross-validation is employed to ascertain which parameter configuration yields the best
performance, as measured by a predefined scoring metric, with accuracy being our chosen
metric. Utilizing a 5-fold cross-validation strategy, the best-performing model is identified by
the grid search based on the specified parameter grid.

Random Forest
Random Forest operates on the principle of ensemble learning, where multiple decision
trees (T) are combined to improve the overall model's accuracy and robustness. The
ensemble's prediction (F(x)) for a classification problem can be described as the mode of the
predictions made by individual trees:

Each decision tree in a Random Forest is built using a subset of the training data, selected
randomly with replacement (bootstrap sampling). The split at each node in the tree is
determined based on a subset of features chosen randomly, optimizing some criterion,
typically the Gini impurity or entropy for classification tasks.

The Gini impurity (I(indice)G ) for a set is defined as:

A Random Forest classification model is optimized using a grid search technique. The
parameter grid encompasses various hyperparameters, including the number of estimators,
maximum depth, minimum samples required for splitting and leaf nodes, and the option of
bootstrapping.

The Random Forest model is instantiated with a random state of 42 to ensure reproducibility.
Subsequently, a grid search is conducted with cross-validation (cv=5) to explore different
combinations of hyperparameters, aiming to maximize the accuracy score. The grid search
process is executed in parallel using multiple CPU cores (n_jobs=-1) and is set to provide
verbose output for tracking its progress.

Upon completion of the grid search, the best-performing model is extracted from the search
results, capturing the optimal set of hyperparameters.
Results and discussion

Evaluation process

Evaluation metrics

A confusion matrix, also referred to as an error matrix, visually represents a model's


performance by tabulating the counts of true positives (TP), true negatives (TN), false
positives (FP), and false negatives (FN) outcomes. This matrix serves as a comprehensive
summary of a classification model's effectiveness. Several performance metrics, including
F1 score, recall (sensitivity), specificity, accuracy, and precision, are computed based on the
confusion matrix. These metrics are instrumental in assessing the model's accuracy in
classifying instances across different classes.

Accuracy assesses the overall performance of the model by calculating the proportion of
correctly predicted instances:

Accuracy = (TP + TN) / (TP + TN + FP + FN)

Precision represents the relative number of accurately predicted positive instances among
all the instances that were predicted as positive:

Precision = TP / (TP + FP)

Recall measures the fraction of correctly predicted positive instances out of all the actual
positive instances:

Recall = TP / (TP + FN)

The F1 score serves as a balanced statistic that combines precision and recall into a single
metric, offering a precise evaluation of the model's correctness:

F1 = 2 * (precision * recall) / (precision + recall)

Average

In the context of evaluating the performance of classification models, two commonly


employed methods for aggregating precision, recall, F1-score, and other related metrics are
"Macro Avg" (Macro Average) and "Weighted Avg" (Weighted Average). These techniques
serve to summarize the model's performance across multiple classes and provide insights
into its overall effectiveness.
Macro Avg, or Macro Average, is a method that computes the average of each metric (such
as precision, recall, and F1-score) independently for every class within the dataset. These
class-specific metrics are then aggregated, resulting in an overall performance assessment.
Notably, this approach treats each class with equal importance, irrespective of its
representation in the dataset. Macro Avg is particularly relevant when all classes are
considered equally significant in the task at hand, and the aim is to evaluate the model's
overall proficiency without taking into account potential class imbalances.
On the other hand, Weighted Avg, or Weighted Average, calculates the average of metrics
while assigning weights proportional to the size or frequency of each class. Consequently,
larger classes exert a more pronounced influence on the final metric values compared to
smaller classes. This weighting mechanism is especially beneficial in scenarios where class
distributions are imbalanced, as it accounts for the relative importance of each class based
on its occurrence in the dataset.

In conclusion, considering your specific case with balanced classes, it is advisable to utilize
the Macro Average approach. This choice ensures that each class contributes equally to the
overall evaluation, thereby providing a fair and unbiased assessment of your model's
performance.

Hyperparameters selection

To identify the most effective set of parameters for each algorithm, a comprehensive
examination was conducted using the GridSearch technique coupled with Cross-Validation
(often referred to as GridSearchCV). This method meticulously explores every possible
combination of hyper-parameters to pinpoint the most advantageous one, thereby enhancing
the model's performance. The specific hyper-parameters that were considered during this
process for each algorithm are detailed in Table.

Prediction Model Hyper-parameter Name Parameter Selection Best hyper-parameter

[1e-06, 1e-05,
learning_rate 1e-05
0.0001, 0.001]
Logistic
Regression regularization_strength [0.0001, 0.001, 0.01] 0.0001

[500, 1000, 2000,


max_iter 1000
3000]
[2, 3, 4, 5, 10, 15,
max_depth 5
20, 30]
Decision Tree
[2, 3, 4, 5, 10, 15,
min_samples_leaf 5
20, 30]

[1e-05,0.0001,0.001
learning_rate 1e-05
,0.01]

num_epochs [500, 1000, 2000] 500


SVM
C [0.1, 1, 10, 100] 0.1

['linear', 'poly',
kernel linear
'sigmoid']

learning_rate [0.01, 0.1, 0.2] 0.01

max_depth [3, 4, 5] 5
XGBoost
n_estimators [100, 200, 300] 100

subsample [0.8, 0.9, 1.0] 1.0

n_estimators [100, 200, 300] 100

max_depth [3, 4, 5, 20] 3

RandomForest min_samples_split [2, 5, 10, 30] 2

min_samples_leaf [1, 2, 4, 20] 20

bootstrap [True, False] False

Feature analysis (shap)


SHAP values quantify the impact of each feature on the prediction made by the model, with
higher absolute values indicating a greater influence. Here, we use data with boruta feature
selection.

The SHAP summary plot for the Logistic Regression model displays the mean absolute
impact of each feature. The most influential feature appears to be 'thalach' (maximum heart
rate achieved), followed by 'thal' (a blood disorder called thalassemia), 'ca' (number of major
vessels), and 'oldpeak' (ST-depression induced by exercise relative to rest). Notably, 'age'
and 'cp' (chest pain type) also contribute to the model output, albeit to a lesser extent.

For the Decision Tree model, the SHAP summary plot presents the mean impact of features
for two classes (Class 0 and Class 1). Features such as 'thal', 'cp', and 'ca' show contrasting
effects on the two classes. 'thal' and 'cp' have a more pronounced impact on Class 1, while
'ca' influences Class 0 more. 'oldpeak' and 'age' affect both classes, but with less magnitude
compared to the aforementioned features.

In the SVM model's SHAP summary plot, 'thalach' is the most impactful feature by a
significant margin. Other features like 'age', 'ca', 'thal', 'cp', and 'oldpeak' also contribute to
the model's predictions, with 'age' and 'ca' having a substantial impact. The visualization
suggests that 'thalach' is the dominant feature in the SVM model's decision-making process.
The SHAP summary plot for the XGBoost model indicates that 'ca' is the feature with the
highest mean absolute SHAP value, implying a strong influence on the model's predictions.
'thal' and 'cp' are also important, with 'oldpeak', 'thalach', and 'age' following in terms of
impact.

We observe similar trends to the Decision Tree model. The features 'ca', 'cp', 'thal', 'oldpeak',
and 'age' show distinct impacts on each class, highlighting their significance in the predictive
process.
Across all models, features related to cardiac performance ('thalach', 'oldpeak') and medical
conditions ('thal', 'ca') are consistently important in predicting heart disease. Variations in the
impact of these features across models indicate that each model processes the features
differently due to their inherent algorithmic characteristics. Moreover, the presence of 'cp'
(chest pain type) among the top features across different models emphasizes its clinical
relevance in heart disease diagnosis. The insights from these SHAP value visualizations can
guide medical professionals in focusing on specific clinical assessments that are most
indicative of heart disease, potentially leading to more accurate diagnoses and personalized
treatment strategies.

Discussion

It is worth noting that I had to construct the models manually in order to replicate the
methodology outlined in the research article. However, I encountered challenges in
achieving comparable levels of accuracy as those obtained when using prebuilt models,
such as those available through Scikit-learn, for instance. I use the same pre-processing for
my models, the Botura feature selection.

In this section, the results of our experiments comparing the performance of various machine
learning models using our proposed approach with Boruta feature selection are presented.
The accuracy, precision, recall, and F1-score of each model without and with Boruta feature
selection are examined, juxtaposed against the results reported in a reference article.
Additionally, our results are compared with those of several other relevant studies.

An attempt was made to replicate the methodology outlined in a specific research article.
However, due to the need to construct models manually, challenges were faced in achieving
comparable levels of accuracy to those obtained using prebuilt models, such as those
available through Scikit-learn. Nonetheless, the same pre-processing techniques, particularly
the Boruta feature selection method, were employed, allowing meaningful comparisons to be
made.
Article result accuracy My results

without with boruta without with boruta My method

Logistic 88.52 88.52 89.00 67.00 84.00


Regression

Decision tree 75.41 80.33 48.00 53.00 84.00

SVM 81.97 83.61 52.00 47.00 87.00

XGBoost 81.97 77.05 89.00 82.00 84.00

Random 83.61 80.33 89.00 87.00 *


Forest

Article result precision My results

without with boruta without with boruta My method

Logistic 87.88 87.88 88.00 72.00 84.00


Regression

Decision tree 84.00 85.71 24.00 26.00 81.00

SVM 81.82 82.35 26.00 24.00 87.00

XGBoost 86.21 76.47 89.00 83.00 84.00

Random 84.38 77.78 89.00 87.00 *


Forest

Article result recall My results

without with boruta without with boruta My method

Logistic 90.62 90.62 89.00 66.00 84.00


Regression

Decision tree 65.62 75.00 50.00 50.00 81.00

SVM 84.38 87.50 50.00 50.00 87.00

XGBoost 78.12 81.25 89.00 82.00 83.00

Random 84.38 87.50 89.00 87.00 *


Forest
Article result F1-score My results

without with boruta without with boruta My method

Logistic 89.23 89.23 89.00 64.00 84.00


Regression

Decision tree 73.68 80.00 32.00 35.00 81.00

SVM 83.08 84.85 34.00 32.00 87.00

XGBoost 81.97 78.79 89.00 82.00 83.00

Random 84.38 82.35 89.00 87.00 *


Forest

Varied levels of performance compared to the reference article are demonstrated by our
results, particularly with the inclusion of Boruta feature selection. Notably, comparable
accuracy and precision were achieved by logistic regression, while significant differences
were exhibited by other models such as decision trees and SVM.

Our results were further compared with those reported in ten other relevant articles,
showcasing the effectiveness of our proposed approach in the context of existing research.
Competitive results were yielded by our proposed approach with Boruta feature selection
compared to the methodologies employed in these studies, as depicted in Table 2. Notably,
the accuracy of several methods was outperformed or closely matched by our approach,
demonstrating its effectiveness in diverse contexts.

Reference number Method Accuracy

[2] Logistic Regression with Boruta Feature Selection 88.52%

[8] SVM Ensembles 84.44%

[9] Stacked model involving KNN, Random Forest, and SVM 75.1%

[10] K-Neighbors Classifier 87%


[11] Random Forest and KNN 99.04%

[12] Stacking classifiers model 92%

Random Forest Classifier,


[13] SVM, 90%
and Logistic Regression

[14] Decision Tree 98%

[15] SVM Normalized 92.08%

[16] Random Forest 86.9%

Decision Tree 53%

Proposed approach Logistic regression 67%


with
Boruta Feature SVM 47%
Selection
XGBoost 82%

Random Forest 87%

In conclusion, varied levels of performance were demonstrated by our results, particularly


with the inclusion of Boruta feature selection, when compared to the reference article. While
logistic regression achieved comparable accuracy and precision, significant differences were
observed in other models such as decision trees and SVM. Our approach was further
compared with methodologies employed in ten other relevant articles, highlighting its
effectiveness in existing research. Competitive results were yielded by our proposed
approach with Boruta feature selection, outperforming or closely matching the accuracy of
several methods. These findings underscore the efficacy of our approach in diverse contexts
and warrant further exploration in future research endeavors.

Limitations
In the conducted research, several limitations were encountered that must be
acknowledged. Firstly, the complexity of the models and the feature selection process,
particularly the Boruta method, may have introduced challenges in achieving optimal model
performance. The reliance on manual construction of models, rather than utilizing prebuilt
solutions such as those available in libraries like Scikit-learn, potentially affected the
consistency and comparability of results. This approach may have limited the ability to fully
replicate and extend the findings of the referenced study by G. Manikandan et al. (2024)[7],
particularly in terms of achieving comparable levels of accuracy.

Moreover, the dataset size was constrained, as the research was based on the Cleveland
Heart Disease Dataset from the UCI Machine Learning Repository, which consists of 303
instances. This limitation may have affected the generalizability of the findings, as a larger
dataset could potentially offer a more comprehensive understanding of the predictive
capabilities of the ML models employed.

Another limitation relates to the fact that the study's scope was limited to specific ML models,
including Logistic Regression, SVM, Decision Tree, XGBoost, and Random Forest, with a
focus on the Boruta feature selection method. While these models were chosen based on
their relevance and performance in similar studies, the exclusion of other potential models or
feature selection techniques could have led to missed opportunities for discovering more
effective predictive approaches.

Finally, the integration of the predictive models into clinical practice was discussed as a
potential area of exploration. However, this aspect of the research remains theoretical, with
practical challenges related to deployment, usability, and acceptance in healthcare settings
not being directly addressed in the study.

Overall, these limitations highlight the need for further research in this area, including the
exploration of larger and more diverse datasets, the examination of additional ML models
and feature selection techniques, and the integration of a wider range of clinical and
non-clinical factors to enhance the predictive accuracy and applicability of cardiovascular
disease prediction models.

Conclusion
Overall, the results underscore the potential of Boruta feature selection and its application in
improving the performance of machine learning models, warranting further investigation and
validation in various domains.

The research aimed to replicate and extend the work of "Classification models combined
with Boruta feature selection for heart disease prediction" by G. Manikandan et al. (2024)[7],
focusing on the application of Decision Tree, Logistic Regression, and SVM algorithms
alongside Boruta feature selection on the Cleveland Clinic Heart Disease Dataset. Despite
encountering limitations related to model complexity, dataset size, and the manual
construction of models, the study provided valuable insights into the potential of machine
learning in heart disease prediction.
The findings highlight the critical role of feature selection, particularly the Boruta method, in
enhancing model performance. Moreover, the investigation into additional machine learning
models and the analysis of misclassifications underscore the importance of continuous
exploration and improvement in the field of medical machine learning.

Although the integration of predictive models into clinical settings remains a theoretical
exploration, this research contributes to the ongoing dialogue on the potential of machine
learning to revolutionize heart disease diagnosis and treatment planning. Future studies
should focus on overcoming the limitations identified, particularly in terms of data diversity,
model implementation, and clinical integration, to further enhance the predictive accuracy
and clinical applicability of machine learning models in the context of heart disease
prediction.

References
[1]“Heart Disease Cleveland,” www.kaggle.com.
https://www.kaggle.com/datasets/ritwikb3/heart-disease-cleveland

[2] G. Manikandan, B. Pragadeesh , V. Manojkumar, A. L. Karthikeyan, R. Manikandan , and


A. H. Gandomi, “Classification Models Combined with Boruta Feature Selection for Heart
Disease Prediction ,” Informatics in Medicine Unlocked, 2024, Available:
https://www.sciencedirect.com/science/article/pii/S2352914823002885

[3]“5. Principales causes de décès et de morbidité.” Available:
https://drees.solidarites-sante.gouv.fr/sites/default/files/2021-01/Principales%20causes%20d
e%20d%C3%A9c%C3%A8s%20et%20de%20morbidite.pdf

[4]“Cardiovascular diseases (CVDs),” www.who.int, May 17, 2017.


https://www.who.int/fr/news-room/fact-sheets/detail/cardiovascular-diseases-(cvds)

[5]“Early heart disease detection saves lives,” www.providence.org, Jan. 14, 2021.
https://www.providence.org/news/uf/644387305

[6]J. G. Richens, C. M. Lee, and S. Johri, “Improving the accuracy of medical diagnosis with
causal machine learning,” Nature Communications, vol. 11, no. 1, Aug. 2020, doi:
https://doi.org/10.1038/s41467-020-17419-7.

[7]J. G. Richens, C. M. Lee, and S. Johri, “Improving the accuracy of medical diagnosis with
causal machine learning,” Nature Communications, vol. 11, no. 1, Aug. 2020, doi:
https://doi.org/10.1038/s41467-020-17419-7.

[8] A. Jain and D. Roongta, “A comparative overview of Machine Learning models for heart
disease diagnosis,” Aug. 2021, Available:
https://www.researchgate.net/profile/Dhruv-Roongta-2/publication/353863296_A_comparativ
e_overview_of_Machine_Learning_models_for_heart_disease_diagnosis/links/6115ec20169
a1a0103f9578a/A-comparative-overview-of-Machine-Learning-models-for-heart-disease-dia
gnosis.pdf

[9] V. Shorewala, “Early detection of coronary heart disease using ensemble techniques,”
Informatics in Medicine Unlocked, p. 100655, Jul. 2021, doi:
https://doi.org/10.1016/j.imu.2021.100655.

[10] V. Chang, V. R. Bhavani, A. Q. Xu, and A. Hossain, “An artificial intelligence model for
heart disease detection using machine learning algorithms,” Healthcare Analytics, vol. 2, p.
100016, Jan. 2022, doi: https://doi.org/10.1016/j.health.2022.100016.

[11] G. A. Ansari, S. S. Bhat, M. D. Ansari, S. Ahmad, J. Nazeer, and A. E. M. Eljialy,


“Performance Evaluation of Machine Learning Techniques (MLT) for Heart Disease
Prediction,” Computational and Mathematical Methods in Medicine, vol. 2023, p. e8191261,
May 2023, doi: https://doi.org/10.1155/2023/8191261.

[12] S. Mohapatra et al., “A stacking classifiers model for detecting heart irregularities and
predicting Cardiovascular Disease,” Healthcare Analytics, vol. 3, p. 100133, Nov. 2023, doi:
https://doi.org/10.1016/j.health.2022.100133.

[13] Rusyda Tsaniya Eka Putri, Junta Zeniarja, Sri Winarno, Ailsa Nurina Cahyani, and
Ahmad Alaik Maulani, “GridSearch and Data Splitting for Effectiveness Heart Disease
Classification,” Sinkron : jurnal dan penelitian teknik informatika, vol. 9, no. 1, pp. 317–331,
Jan. 2024, doi: https://doi.org/10.33395/sinkron.v9i1.13198.

[14] Ramanathan G and J. S. N, “Prediction of Coronary Artery Disease using Machine


Learning – A Comparative study of Algorithms,” International journal of health sciences and
pharmacy, pp. 180–209, Dec. 2023, doi: https://doi.org/10.47992/ijhsp.2581.6411.0116.

[15] J. Kassab et al., “Comparative Analysis of Chat‐Based Artificial Intelligence Models in


Addressing Common and Challenging Valvular Heart Disease Clinical Scenarios,” Journal of
the American Heart Association, vol. 12, no. 22, Nov. 2023, doi:
https://doi.org/10.1161/jaha.123.031787.

[16] Qisthi Alhazmi Hidayaturrohman, Hulya Gokalp Clarke, Gaye Yesim Taflan, and Idris
Sancaktar, “A comparative study of machine learning approaches to heart disease
prediction: an empirical analysis,” Research Square (Research Square), Jun. 2023, doi:
https://doi.org/10.21203/rs.3.rs-3098962/v1.

[17] S. Mohan, C. Thirumalai, and G. Srivastava, “Effective Heart Disease Prediction Using
Hybrid Machine Learning Techniques,” IEEE Access, vol. 7, pp. 81542–81554, 2019, doi:
https://doi.org/10.1109/access.2019.2923707.

The whole code can be found here :


https://github.com/Flolav/Heart-Desease-Prediction/tree/main

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy