Heart Disease Prediction System Report
Heart Disease Prediction System Report
AUTONOMOUS
BASAVANAGUDI BANGALORE – 560004
PROJECT REPORT ON
HEART DIEASES PREDICTION SYSTEM
PROJECT GUIDE:
PROF. VIJAY RAGHVAN
SUBMITTED BY
ADITYA KUMAR ROY (U18EZS0175)
SIDPARA HARSH LALITBHAI (U18EZS0105)
THE NATIONAL COLLEGE
AUTONOMOUS
BASAVANAGUDI, BANGALORE – 560004
CERTIFICATE
This is to certify that the Project entitled HEART DIEASES PREDICTION SYSTEM
is carried out by ADITYA KUMAR ROY (U18EZS0175), and SIDPARA HARSH
LALITBHAI (U18EZS0105) for the fulfilment of Sixth Semester Bachelor of
Computer Application project lab prescribed by The National Degree College,
Autonomous, Basavanagudi Bangalore during the year 2021-2024
We are whole heartedly thankful to Prof. Ravi Hegde, Head of the department
of Computer Science, The National College, Basavanagudi, for allowing us to
carry out this project.
Our sincere thanks to the lectures of the computer science department who
have contributed directly for the successful completion of this project.
ABSTRACT
Machine learning and artificial intelligence have been found useful in various disciplines during
the course of their development, especially in the enormous increasing data in recent years. It
can be more reliable for making better and faster decisions for disease predictions. So,
machine learning algorithms are increasingly finding their application to predict various
diseases. Constructing a model can also help us visualize and analyze diseases to improve
reporting consistency and accuracy. This article has investigated how to detect heart disease
by applying various machine learning algorithms. The study in this article has shown a two-
step process. The heart disease dataset is first prepared into a required format for running
through machine learning algorithms. Medical records and other information about patients
are gathered from the UCI repository. The heart disease dataset is then used to determine
whether or not the patients have heart disease. Secondly, Many valuable results are shown in
this article. The accuracy rate of the machine learning algorithms, such as Logistic Regression,
Support vector machine, K-Nearest Neighbors, Random Forest, and Gradient Boosting
Classifier, are validated through the confusion matrix. Current findings suggest that the
Logistic Regression algorithm gives a high accuracy rate of 95% compared to other algorithms.
It also shows high accuracy for f1-score, recall, and precision than the other four different
algorithms. However, increasing the accuracy rates to approximately 97% to 100% of the
machine learning algorithms is the future study and challenging part of this research.
Keywords: Machine Learning, Artificial Intelligence, Heart Disease, Linear Regression, Support
Vector Machine, K-Nearest-Neighbors, Random Forest, Decision Tree, Gradient Boosting
INTRODUCTION
Machine learning (ML) is a part of artificial intelligence (AI) that allows a software
application to improve its prediction accuracy without being formally programmed. In order
to forecast new output values, machine learning algorithms use historical data as input [1].
Machine learning is a significant and diversified field, and its scope and application are
expanding daily. For this reason, machine learning has become a crucial competitive
differentiation in many organizations. Machine learning includes supervised, unsupervised,
and ensemble learning classifiers that are used to predict and find the accuracy of a dataset.
ML algorithms can build a model based on sample data called train data to make a decision
or prediction [1, 2]. The use of machine learning methods in the medical industry is the
subject of the current study, which mainly focuses on mimicking some human activities or
mental processes and recognizing diseases from a variety of inputs [3]. The term “heart
disease” refers to a group of conditions that affect the heart. According to World Health
Organization reports, cardiovascular diseases are now the leading cause of death worldwide,
approximately 17.9 million [4, 5]. Many types of research have been studied and performed
with various machine learning algorithms to diagnose heart diseases. According to Ghumbre
et al., machine learning and deep learning algorithms are applied to predict heart diseases in
the UCI dataset [3]. The authors concluded that machine learning algorithms performed
better for this analysis. Machine learning techniques for heart disease prediction are
published by Rohit Bharti et al., where the article concluded that different data mining and
neural system should be used to find the seriousness of HD among patients [4]. Some
analysis has been led to think about the implementation of a predictive data mining strategy
on the same dataset [5]. Prediction of heart disease using machine learning is studied by Jee S
H et al. in which training and the testing dataset are performed by using a neural network
algorithm [6]. K-Nearest Neighbor algorithm is reviewed to diagnose heart disease by Mai
Shouman et al. [7]. Some efficient algorithms have been used to detect HD, which shows
results that each algorithm has its strength to register the defined objectives [8]. The
supervised network has been applied for HD diagnosis, which is studied by Raihan M et al.
[9]. This research idea has been broadened and inspired us worldwide by publishing many
articles [10-15]. This article will construct an ML predictive model, which will help analyze
heart disease regarding the medical history. Data is collected from the UCI repository with
patients' medical records and attributes. This dataset would be utilized to predict whether the
patients have heart disease or not. To diagnose the HD dataset, this article considers 14
attributes of a patient. It classifies whether the disease is present or not and can help us
diagnose diseases with fewer medical treatments [1, 5]. For this study, this article considers
various attributes of patients like age, sex, serum cholesterol, blood pressure, exang, etc. Five
different ML algorithms such as Logistic Regression (LR), Support vector machine (SVM),
K-Nearest-Neighbors (KNN), Random Forest (RF), and Gradient Boosting Classifier (GBC)
are applied for the purpose of classification and prediction of heart disease. Many beneficial
results are presented in this article. The attributes of the given dataset are trained under these
algorithms. Based on the characteristics of the HD dataset, a comparative analysis of
algorithms has been studied regarding the accuracy rate. All the selected ML algorithms are
efficient by showing their accuracy, which is greater than 80%. The most efficient algorithm
is Logistic Regression (LR), which gives us an accuracy rate of approximately 95%. Finally,
Logistic Regression (LR) algorithm will be considered to predict and diagnose for heart
disease of a patient. This article is rearranged sequentially. In section 2, the methodology has
been discussed. Various ML algorithms are studied briefly in section 3. Results and analysis
are shown in section 4. In the result section, algorithms are compared regarding the confusion
matrix. Finally, a conclusion and future scope have been drawn in section 5.
Methodology
In this section, the method and analysis are described, which is performed in this
research work. First of all, the collection of data and selection of relevant attributes are the
initial steps in this study. After that, the relevant data is pre processed into the required
format. The given data is then separated into two categories: training and testing datasets. The
algorithms are then used, and the given data train the model. The accuracy of this model is
obtained by using the testing data. The procedures of this study are loaded by using several
Attributes of a dataset are properties of a dataset, which are important to analyze and
make a prediction regarding our concern. Various attributes of the patient, like gender,
chest pain, serum cholesterol, fasting blood pressure, exang, etc., are considered for
predicting diseases. However, the correlation matrix can be used for attribute selection to
construct a model
Data Collection
In this article, the dataset is collected from the UCI repository, which is considered in
research analysis by the many authors [4, 7]. So, the first step is organizing the dataset
from the UCI repository to predict the heart disease and then dividing the dataset into two
sections: training and testing. In this article, 80% data has been considered as a training
dataset, and 20% dataset is used for testing purposes.
Pre-processing of Data
We need to clean and remove the missing or noise values from the dataset to obtain
accurate and perfect results, known as data cleaning. Using some standard techniques in
python 3.8, we can fill missing and noise values, see [16]. Then we need to transform our
dataset by considering the dataset's normalization, smoothing, generalization, and
aggregation. American Journal of Computer Science and Technology 2022; 5(3): 146-154
148 Integration is one of the crucial phases in data pre processing, and various issues are
considered here to integrate. Sometimes the dataset is more complex or difficult to
understand. In this case, the dataset needs to be reduced in a required format, which is
best to get a good result.
Balancing of Data
Prediction of Disease
In this article, five different machine learning algorithms are implemented for
classification. A comparative analysis of the algorithms has been studied. Finally, this article
considers an ML algorithm that gives the highest accuracy rate for heart disease prediction,
see Figure 1.
Where z is a function of x1, x2, w1, w2, and b. So, z is a linear equation given to a sigmoid function to
predict the output. We calculate the loss to evaluate the performance of this model. In this case, we
use the cross-entropy loss function.
and to make him familiar with it. His level of confidence must be raised so that he is
also able to make some constructive criticism, which is welcomed, as he is the final user of
the system.
Here, the horizontal x-axis and vertical y-axis are independent and dependent variables of a
function, respectively. Figure 4 is a simple example of the K-NN classification algorithm.
The test sample (Yellow Square with what symbol) should be classified as either a green
triangle or a red star in this algorithm. When k=3 is considered in a small dash circle, the
yellow square would be a green triangle because the majority number in this region is green
triangles, not red stars. Now, if we consider k=7, which is in a large dash circle, then the
yellow square would be red stars because the number of red stars is four and the green
triangles are 3. So, It can conclude that the majority vote in a specific region is important
here, see Figure 5
The existing system is manual systems where the users must have to perform their
manually. It will take more time and this whole procedure is very tedious and takes a lot of
time.
Random Forest
Random Forest (RF) is a popular supervised machine learning algorithm used for
both classifications and regression. However, it is mainly used in classification problems. RF
algorithm is based on the concept of ensemble learning. Ensemble learning is a general
machine learning procedure that can be used for multiple learning algorithms to seek better
predictive performance [2, 19]. So, the RF technique creates several decision trees on the data
samples, obtains the prediction from each tress, and finally gets the better solution by
considering the majority voting. It is noted that the ensemble method is better than a single
decision tree because it mitigates the over-fitting by averaging results. The large number of
decision trees in RF helps us to get the accuracy and prevent over-fitting of the problems. The
following procedures are completed by RF algorithm, see also figure 6:
Step 1: First, n numbers of the random sample are selected from a given dataset.
Gradient Boosting
Gradient Boosting (GB) is a machine learning technique that is used in classification and
regression problems like others. It is a powerful algorithm in the field of machine learning
[21]. As is well known, the errors are classified into two categories in machine learning
algorithms: Bias error and Variance error. GBC helps us to minimize bias error sequentially
in the model, see Figure 6. A diagram is described as follows below
As we can see that the ensemble consists of N trees; see Figure 6. First of all, the
feature matrix X and the labels y are used to train Tree 1. To calculate the training set
residual error r1, the predictions labeled are used. Then, Tree 2 is trained using feature matrix
X and residual errors r1 of Tree 1 as labels. The residual error r2 is then calculated by using
predictive error, see Figure 6.
Result Analysis
Before going to study the performance of considering machine learning algorithms in this
research, analysis of the features of the heart disease dataset will be focused on here. The
total number of observations in the target attributes is 1025, where not having heart disease
499 (denoted by 0) and having heart disease 526 (represented by 1), see Figure 7. So, the
percentage of not having heart disease is 45.7%, and the percentage of having heart disease is
54.3%, see Figure 8(a). It is shown that the rate of heart disease is more than the rate of no
heart disease. In Figure 8(b), the sex feature of the HD dataset is observed through the target
feature. In sex attribute, the female and male numbers are 312 and 713, respectively. So, the
male number is more than double of female number. We can see in this figure 8(b) that the
number of heart diseases in males is higher than in females. Similarly, no heart disease
among males is higher than in females. Figure 8(b) concludes that male is sufferer than
female; for more information, see figure 8(b).
The correlation of the features is drawn in figure 10. The main purpose of the correlation plot
is to define the positive and negative correlation between the features. However, it assumes
that figure 10 is complex for getting the strong and weak correlation. For this reason, this
article added another figure 11 to obtain these correlations efficiently. In figure 11, we can
see that three features like cp, thalach, and slope positively correlate with target features.
Age v/s Cholesterol with the target feature, (b) Kernel density estimate (kde) plot of age v/s cholesterol.
Two strong correlations by cp and slope with target feature are studied statistically. As we
can see in figure 12(a), there is no heart disease when the cp level is more than 350; however,
heart disease is sustained more when the cp is between 200 and 250. In addition, when the
slope is in 300 < slope-1 < 350, it shows that there is no disease, see figure 12(a). In contrast,
for slope-2, there is a heart disease in 300 < slope-2 < 350.
Performance Analysis
In this article, various machine learning algorithms like Logistic Regression (LR), Support
vector machine (SVM), k Nearest-Neighbors (KNN), Random Forest Classifier (RF), and
Gradient Boosting Classifier (GBC) are studied broadly to predict the heart disease. The
accuracy rate of each algorithm has been measured, and selects the algorithm with the highest
accuracy. The accuracy rate is a correct prediction ratio to the total number of given datasets.
It can be written as, Accuracy = Where, TP: True Positive TN: True Negative FP: False
Positive FN: False Negative After performing the machine learning algorithms for training
and testing the dataset, we can find the better algorithm by considering the accuracy rate. The
rate of accuracy is calculated with the support of a confusion matrix. As shown in Table 2,
the Logistic Regression algorithm gives us the best accuracy to compare with other ML
algorithms.
Where, TP: True Positive TN: True Negative FP: False Positive FN: False Negative After
performing the machine learning algorithms for training and testing the dataset, we can find
the better algorithm by considering the accuracy rate. The rate of accuracy is calculated with
the support of a confusion matrix. As shown in Table 2, the Logistic Regression algorithm
gives us the best accuracy to compare with other ML algorithms.
This has been studied more on the LR machine learning algorithm through confusion matrix
and f1-score. The confusion matrix shows that the correct predicted value is 95%, see figure
14. f1-score is calculated by, which is shown in figure 15,
References
[1] Wikipedia contributors. (2022, June 22). Machine learning. In Wikipedia, The
Free Encyclopedia. Retrieved 06:31, June 26, 2022, from
https://en.wikipedia.org/w/index.php?title=Machine_learning &oldid=1094363111. [2] [3]
[4] [5] [6] [7] [8] [9] Victor Chang, Vallabhanent Rupa Bhavani, Ariel Qianwen Xu, MA
Hossain. An artificial intellegence model for heart disease detection using machine learning.
Healthcare Analytics, volume 2, November 2022,
https://doi.org/10.1016/j.health.2022.100016. 100016. Ghumbre, S. U., & Ghatol, A. A.
(2012). Heart disease diagnosis using machine learning algorithm. In Proceedings of the
International Conference on Information Systems Design and Intelligent Applications 2012
(INDIA 2012) held in Visakhapatnam, India, January 2012 (pp. 217-225). Springer, Berlin,
Heidelberg. Rohit Bharti, Aditya Khamparia, Mohammed Shabaz, Gaurav Dhiman, Sagar
pande, and Parneet Singh. Prediction of Heart Disease Using a combination of Machine
Learning and Deep learning. Hindawi Computational Intelligence and Neuroscience, Volume
2021, Article ID 8387680, 11 pages. https://doi.org/10.1155/2021/8387680. Khaled
Mohamed Almustafa. Prediction of heart disease and classifiers sensitivity analysis.
Almustafa BMC Bioinfirmatics (2020) 21: 278. https://doi.org/10.1186/s12859-020-03626-y.
Jee S H, Jang Y, Oh D J, Oh B H, Lee S H, Park S W & Yun Y D (2014), A coronary heart
disease prediction model. The Korean Heart Study. BMJ open, 4 (5), e005025. Khaled
Mohamed Almustafa. Prediction of heart disease and classifiers sensitivity analysis.
Almustafa BMC Bioinfirmatics (2020) 21: 278. https://doi.org/10.1186/s12859-020-03626-y.
Jee S H, Jang Y, Oh D J, Oh B H, Lee S H, Park S W & Yun Y D (2014), A coronary heart
disease prediction model. The Korean Heart Study. BMJ open, 4 (5), e005025. Mai
Shouman, Tim Turner, and Rob Stocker. Applying k Nearest Neighbour in diagnosis heart
disease patients.. International Journal of Information and Education Technology, vol. 2, No.
3, June 2012. Ganna A, Magnusson P K, Pedersen N L, de Faire U, Reilly M, Arnlov J &
Ingelsson E (2013). Multilocus genetic risk scores for coronary heart disease prediction.
Arteriosclerosis, thrombosis, and vascular biology, 33 (9), 2267-72. Raihan M, Mondal S,
More A, Sagor M O F, Sikder G, Majumder M A & Ghosh K (2016, December). Smartphone
based ischeme heart disease (heart attact) risk prediction using clinical data and data mining
Using Machine Learning : Machine Learning can play an essential role in predicting
presence/absence of Locomotor disorders, Heart diseases and more. Such information, if
predicted well in advance, can provide important insights to doctors who can then adapt their
diagnosis and treatment per patient basis. Supervised Learning : This study, an effective heart
disease prediction system (EHDPS) is developed using neural network for predicting the risk
level of heart disease. The system uses 15 medical parameters such as age, sex, blood
pressure, cholesterol, and obesity for prediction Data insight: As mentioned here we will be
working with the heart disease detection dataset and we will be putting out interesting
inferences from the data to derive some meaningful results. EDA: Exploratory data analysis
is the key step for getting meaningful results. Feature engineering: After getting the insights
from the data we have to alter the features so that they can move forward for the model
building phase. Model building: In this phase, we will be building our Machine learning
model for heart disease detection. Conclusion: The conclusion which we found is that
machine learning algorithms performed better in this analysis. Many researchers have
previously suggested that we should use ML where the dataset is not that large, which is
proved in this work. In this paper, we proposed three methods in which comparative analysis
was done and promising results were achieved. The conclusion which we found is that
machine learning algorithms performed better in this analysis. Many researchers have
previously suggested that we should use ML where the dataset is not that large, which is
proved in this paper. The methods which are used for comparison are confusion matrix,
precision, specificity, sensitivity, and F1 score. For the 13 features which were in the dataset,
KNeighbors classifier performed better in the ML approach when data preprocessing is
applied. The computational time was also reduced which is helpful when deploying a model.
It was also found out that the dataset should be normalized; otherwise, the training model gets
overfitted sometimes and the accuracy achieved is not sufficient when a model is evaluated
for real-world data problems which can vary drastically to the dataset on which the model
Department of Computer Science
The National College, Basavanagudi-560004 Page 18
HEART DISEASE PREDICTION SYSTEM
was trained. It was also found out that the statistical analysis is also important when a dataset
is analyzed and it should have a Gaussian distribution, and then the outlier’s detection is also
important and a technique known as Isolation Forest is used for handling this. The difficulty
which came here is that the sample size of the dataset is not large. If a large dataset is present,
the results can increase very much in deep learning and ML as well. The algorithm applied by
us in ANN architecture increased the accuracy which we compared with the different
researchers. The dataset size can be increased and then deep learning with various other
optimizations can be used and more promising results can be achieved. Machine learning and
various other optimization techniques can also be used so that the evaluation results can again
be increased. More different ways of normalizing the data can be used and the results can be
compared. And more ways could be found where we could integrate heart-disease-trained
ML and DL models with certain multimedia for the ease of patients and doctors
Healthy Heart : •
Atrium. •
Plural atria.
Unhealthy heart:
•
Coronary artery Disease. •
Heart Arrhythmias. •
Heart Failure. •
Heart Valve Disease. •
Cardiomyopathy Eating a diet high in saturated fats, trans fat, and cholesterol has been linked
to heart disease and related conditions, such as atherosclerosis. Also, too much salt (sodium)
in the diet can raise blood pressure. Not getting enough physical activity can lead to heart
disease.
REPORTS
Department of Computer Science
The National College, Basavanagudi-560004 Page 19
HEART DISEASE PREDICTION SYSTEM
SCREEN SHOTS:
SOURCE CODES :
import numpy as np
import pandas as pd
import streamlit as st
import pickle as pk
df = pd.read_csv("D:\heart_disease_data (1).csv")
df.head()
df.describe()
X = df.drop(columns='target', axis=1)
Y = df['target']
print(Y)
corrmat =df.corr()
top_corr_features =corrmat.index
plt.figure(figsize=(20,20))
g=sns.heatmap(df[top_corr_features].corr(),annot=True,cmap="RdYlGn")
import numpy as np
import pandas as pd
import streamlit as st
df = pd.read_csv("D:\heart_disease_data (1).csv")
X = df.drop(columns='target', axis=1)
y = df['target']
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)
if sex == 'Male':
sex = 1
else:
sex = 0
# Predict button
if st.button('Predict'):
input_data = scaler.transform(input_data)
output = model.predict(input_data)
if output[0] == 0:
else:
st.write(stn)