Heart Disease Prediction Research
Heart Disease Prediction Research
LEARNING ALGORITHMS
A Minor Project report submitted in partial fulfilment of the
BACHELOR OF TECHNOLOGY
IN
INFORMATION TECHNOLOGY
Submitted by
SONIKESH - 08615003120
DISHANT KANGOTRA - 75315003120
HOD
1
ACKNOWLEDGEMENT
We would like to express our deep gratitude to our project guide Dr. Tripti Sharma,
Head of the Department, Information Technology, for her guidance with unsurpassed
knowledge and immense encouragement for providing us with the required facilities
for the completion of the project work.
We thank all teaching faculty of the Department of IT, whose suggestions during
reviews helped us in accomplishment of our project. We would like to thank our
parents, friends, and classmates for their encouragement throughout our project period.
At last but not the least, we thank everyone for supporting us directly or indirectly in
completing this project successfully.
DEPARTMENT OF INFORMATION TECHNOLOGY
MAHARAJA SURAJMAL INSTITUTE OF TECHNOLOGY
NEW DELHI – 110058
CERTIFICATE
Mentor:
Dr. Tripti Sharma
(Head of The Department)
DECLARATION
Machine learning techniques have revolutionized the field of healthcare by enabling accurate and
timely disease prediction. The ability to predict multiple diseases simultaneously can significantly
improve early diagnosis and treatment, leading to better patient outcomes and reduced healthcare
costs.
This research paper explores the application of machine learning algorithms in predicting multiple
diseases, focusing on their benefits, challenges, and future directions. We present an overview of
various machine learning models and data sources commonly used for disease prediction.
, we discuss the importance of feature selection, model evaluation, and the integration of multiple data
modalities for enhanced disease prediction.
The research findings highlight the potential of machine learning in multi-disease prediction and its
potential impact on public health. Once more, I am applying machine learning model to identify that a
person is affected with few disease or not. This training model takes a sample data and train itself for
predicting disease.
ABSTRACT iv
LIST OF FIGURES vii
LIST OF TABLES ix
LIST OF ABBREVIATIONS x
CHAPTER 1 INTRODUCTION 01
1.1 Introduction 01
1.2 Motivation of the work 02
1.3 Problem Statement 02
CHAPTER 2 LITERATURE SURVEY 03
CHAPTER 3 METHODOLOGY 06
3.1Existing System 06
3.2Proposed System 06
3.2.1 Collection of dataset 06
3.2.2 Selection of attributes 07
3.2.3 Pre-processing of Data 07
3.2.4 Balancing of Data 08
3.2.5 Prediction of Disease 09
APPENDIX 67
REFERENCES 69
LIST OF FIGURES
4 System Architecture 10
5.6.1 Input 63
5.6.2 Output 64
LIST OF TABLES
1 Attribute Table 59
2 Accuracy Table 65
LIST OF ABBREVIATIONS
ML Machine Learning
AI Artificial Intelligence
INTRODUCTION
According to the World Health Organization, every year 12 million deaths occur
worldwide due to Heart Disease. Heart disease is one of the biggest causes of
morbidity and mortality among the population of the world. Prediction of
cardiovascular disease is regarded as one of the most important subjects in the section
of data analysis. The load of cardiovascular disease is rapidly increasing all over the
world from the past few years. Many researches have been conducted in attempt to
pinpoint the most influential factors of heart disease as well as accurately predict the
overall risk. Heart Disease is even highlighted as a silent killer which leads to the death
of the person without obvious symptoms. The early diagnosis of heart disease plays a
vital role in making decisions on lifestyle changes in high-risk patients and in turn
reduces the complications.
The main motivation of doing this research is to present a heart disease prediction
model for the prediction of occurrence of heart disease. Further, this research work is
aimed towards identifying the best classification algorithm for identifying the
possibility of heart disease in a patient. This work is justified by performing a
comparative study and analysis using three classification algorithms namely Naïve
Bayes, Decision Tree, and Random Forest are used at different levels of evaluations.
Although these are commonly used machine learning algorithms, the heart disease
prediction is a vital task involving highest possible accuracy. Hence, the three
algorithms are evaluated at numerous levels and types of evaluation strategies. This
will provide researchers and medical practitioners to establish a better.
This research highlights a significant challenge in the early detection of cardiovascular diseases,
Parkinson's, and diabetes, emphasizing the limitations of current detection instruments in terms of
cost and efficiency. Early detection is crucial to reduce mortality rates and complications associated
with these diseases. However, continuous patient monitoring is impractical, and 24-hour doctor
consultations are not readily available. Leveraging the wealth of data available today, this paper
explores the application of various machine learning algorithms to analyze data and uncover hidden
patterns for predictive health diagnostics across cardiovascular diseases, Parkinson's, and diabetes.
CHAPTER 2
LITERATURE SURVEY
With growing development in the field of medical science alongside machine learning
various experiments and researches has been carried out in these recent years releasing the
relevant significant papers.
[1] Liang H, Tsui BY, Ni H, et al. Evaluation and accurate diagnoses of pediatric diseases using artificial
intelligence. Nat Med. 2019;25(3):433- 438.
[3] Rajendra Acharya U, Fujita H, Oh SL, et al. Application of deep convolutional neural network for
automated detection of myocardial infarction using ECG signals. Inf Sci (Ny). 2017;415-416:190-198.
[4] Paniagua JA, Molina-Antonio JD, LopezMartinez F, et al. Heart disease prediction using random forests. J
Med Syst. 2019;43(10):329.
[5] Poudel RP, Lamichhane S, Kumar A, et al. Predicting the risk of type 2 diabetes mellitus using data
mining techniques. J Diabetes Res. 2018;2018:1686023.
[6] Al-Mallah MH, Aljizeeri A, Ahmed AM, et al. Prediction of diabetes mellitus type-II using machine
learning techniques. Int J Med Inform. 2014;83(8):596-604.
[7] Tsanas A, Little MA, McSharry PE, Ramig LO. Nonlinear speech analysis algorithms mapped to a standard
metric achieve clinically useful quantification of average Parkinson's disease symptom severity. J R Soc
Interface. 2012;9(65):2756-2764.
[8] Arora S, Aggarwal P, Sivaswamy J. Automated diagnosis of Parkinson's disease using ensemble machine
learning. IEEE Trans Inf Technol Biomed. 2017;21(1):289-299.
[9] Ahmad F, Hussain M, Khan MK, et al. Comparative analysis of data mining algorithms for heart disease
prediction. J Med Syst. 2019;43(4):101.
[10] Parashar A, Gupta A, Gupta A. Machine learning techniques for diabetes prediction. Int J Emerg Technol
Adv Eng. 2014;4(3):672-675.
[12] Hastie T, Tibshirani R, Friedman J. The Elements of Statistical Learning: Data Mining, Inference, and
Prediction. 2nd ed. Springer; 2009.
[13] Pedregosa F, Varoquaux G, Gramfort A, et al. Scikit-learn: Machine learning in Python. J Mach Learn
Res. 2011;12:2825-2830.
[14] McKinney W, van der Walt S, Lamoureux C, et al. Data structures for statistical computing in Python. In:
Proceedings of the 9th Python in Science Conference; 2010.
[16] Huang ML, Hung CC, Hsu CY, et al. Predicting ischemic stroke using the Framingham Stroke Risk
Score and a simple decision rule in Chinese patients with type 2 diabetes. Diabetes Care. 2010;33(2):427-429
CHAPTER 3 METHODOLOGY
Diseases like Heart disease, Diabetes, Parkinsons is even being highlighted as a silent
killer which leads to the death of a person without obvious symptoms. The nature of
the disease is the cause of growing anxiety about the disease & its consequences.
Hence continued efforts are being done to predict the possibility of this deadly disease
in prior. So that various tools & techniques are regularly being experimented with to
suit the present-day health needs. Machine Learning techniques can be a boon in this
regard. Even though these disease can occur in different forms, there is a common set
of core risk factors that influence whether someone will ultimately be at risk for these
disease or not. By collecting the data from various sources, classifying them under
suitable headings & finally analyzing to extract the desired data we can conclude. This
technique can be very well adapted to the do the prediction of these disease. As the
well-known quote says “Prevention is better than cure”, early prediction & its control
can be helpful to prevent & decrease the death rates due to these disease.
The working of the system starts with the collection of data and selecting the
important attributes. Then the required data is preprocessed into the required format.
The data is then divided into two parts training and testing data. The algorithms are
applied and the model is trained using the training data. The accuracy of the system is
obtained by testing the system using the testing data. This system is implemented
using the following modules.
Dataset collection is collecting data which contains patient details. Attributes selection
process selects the useful attributes for the prediction of heart disease. After identifying the
available data resources, they are further selected, cleaned, made into the desired form.
Different classification techniques as stated will be applied on preprocessed data to predict
the accuracy of heart disease. Accuracy measure compares the accuracy of different
classifiers.
Supervised learning is the type of machine learning in which machines are trained
using well "labelled" training data, and on the basis of that data, machines predict the
output. The labelled data means some input data is already tagged with the correct
output.
In supervised learning, the training data provided to the machines work as the
supervisor that teaches the machines to predict the output correctly. It applies the same
concept as a student learns in the supervision of the teacher.
Supervised learning is a process of providing input data as well as correct output data
to the machine learning model. The aim of a supervised learning algorithm is to find a
mapping function to map the input variable(x) with the output variable(y).
Unsupervised learning
● Unsupervised learning is helpful for finding useful insights from the data.
● In real-world, we do not always have input data with the corresponding output so
Reinforcement learning
4.2 ALGORITHMS
The goal of the SVM algorithm is to create the best line or decision boundary that can
segregate n-dimensional space into classes so that we can easily put the new data point
in the correct category in the future. This best decision boundary is called a
hyperplane. SVM chooses the extreme points/vectors that help in creating the
hyperplane. These extreme cases are called support vectors, and hence the algorithm is
termed as Support Vector Machine.
Support vector machines (SVMs) are powerful yet flexible supervised machine
learning algorithms which are used both for classification and regression. But
generally, they are used in classification problems. In the 1960s, SVMs were first
introduced but later they got refined in 1990. SVMs have their unique way of
implementation as compared to other machine learning algorithms. Lately, they are
extremely popular because of their ability to handle multiple continuous and
categorical variables.
Support Vectors - Data Points that are closest to the hyperplane are called support
vectors. Separating line will be defined with the help of these data points.
Hyperplane - As we can see in the above diagram, it is a decision plane or space which
is divided between a set of objects having different classes.
Margin - It may be defined as the gap between two lines on the closest data points of
different classes. It can be calculated as the perpendicular distance from the line to the
support vectors. Large margin is considered as a good margin and small margin is
considered as a bad margin.
Types of SVM:
● Linear SVM: Linear SVM is used for linearly separable data, which means if
a dataset can be classified into two classes by using a single straight line, then such
data is termed as linearly separable data, and classifier is used called as Linear SVM
classifier.
● Still effective in cases where the number of dimensions is greater than the
number of samples.
● Versatile: different kernel functions can be specified for the decision function.
Common kernels are provided, but it is also possible to specify custom kernels.
● If the number of features is much greater than the number of samples, avoid
over-fitting in choosing Kernel functions and regularization term is crucial.
SVMs do not directly provide probability estimates, these are calculated using an
expensive five-fold cross-validation.
Figure: Support Vector Machine
Naive Bayes Classifier is one of the simple and most effective Classification
algorithms which helps in building the fast machine learning models that can make
quick predictions.
The Naive Bayes model is easy to build and particularly useful for very large data sets.
Along with simplicity, Naive Bayes is known to outperform even highly sophisticated
classification methods.
The Naive Bayes algorithm is comprised of two words Naive and Bayes, Which can
be described as:
● Naive: It is called Naive because it assumes that the occurrence of a certain
feature is independent of the occurrence of other features. Such as if the fruit is
identified on the basis of color, shape, and taste, then red, spherical, and sweet fruit is
recognized as an apple. Hence each feature individually contributes to identify that it is
an apple without depending on each other.
● Bayes: It is called Bayes because it depends on the principle of Bayes'
Theorem.
Bayes’s theorem:
Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is used to determine
the probability of a hypothesis with prior knowledge. It depends on the conditional
probability.
Where,
P(B|A) is Likelihood probability: Probability of the evidence given that the probability
of a hypothesis is true.
There are three types of Naive Bayes Model, which are given below:
● Multinomial: The Multinomial Naïve Bayes classifier is used when the data is
but the predictor variables are the independent Booleans variables. Such as if a
particular word is present or not in a document. This model is also famous for
document classification tasks.
The Decision Tree Algorithm belongs to the family of supervised machine learning
algorithms. It can be used for both a classification problem as well as for a regression
problem.
The goal of this algorithm is to create a model that predicts the value of a target
variable, for which the decision tree uses the tree representation to solve the problem
in which the leaf node corresponds to a class label and attributes are represented on the
internal node of the tree.
There are various algorithms in Machine learning, so choosing the best algorithm for
the given dataset and problem is the main point to remember while creating a machine
learning model. Below are the two reasons for using the Decision Tree:
● Decision Trees usually mimic human thinking ability while making a decision,
so it is easy to understand.
● The logic behind the decision tree can be easily understood because it shows a
tree-like structure.
In Decision Tree the major challenge is to identify the attribute for the root node in
each level
It can be used both for classification and regression. It is also the most flexible and
easy to use algorithm. A forest consists of trees. It is said that the more trees it has, the
more robust a forest is. Random Forests create Decision Trees on randomly selected
data samples, get predictions from each tree and select the best solution by means of
voting. It also provides a pretty good indicator of the feature importance.
Random Forest is a popular machine learning algorithm that belongs to the supervised
learning technique. It can be used for both Classification and Regression problems in
ML. It is based on the concept of ensemble learning, which is a process of combining
multiple classifiers to solve a complex problem and to improve the performance of the
model.
The greater number of trees in the forest leads to higher accuracy and prevents the
problem of overfitting.
Assumptions:
Since the random forest combines multiple trees to predict the class of the dataset, it is
possible that some decision trees may predict the correct output, while others may not.
But together, all the trees predict the correct output. Therefore, below are two
assumptions for a better Random forest classifier:
● There should be some actual values in the feature variable of the dataset so that
the classifier can predict accurate results rather than a guessed result.
● The predictions from each tree must have very low correlations.
Algorithm Steps:
● Select the prediction result with the most votes as the final
prediction.
Advantages:
● It enhances the accuracy of the model and prevents the overfitting issue.
Disadvantages:
Although Random Forest can be used for both classification and regression tasks, it is
Logistic Regression is much similar to the Linear Regression except that how they are
used. Linear Regression is used for solving Regression problems, whereas logistic
regression is used for solving the classification problems.
In Logistic regression, instead of fitting a regression line, we fit an "S" shaped logistic
function, which predicts two maximum values (0 or 1).
The curve from the logistic function indicates the likelihood of something such as
whether the cells are cancerous or not, a mouse is obese or not based on its weight, etc.
Logistic Regression is much similar to the Linear Regression except that how they are
used. Linear Regression is used for solving Regression problems, whereas logistic
regression is used for solving the classification problems.
In Logistic regression, instead of fitting a regression line, we fit an "S" shaped logistic
function, which predicts two maximum values (0 or 1).
The curve from the logistic function indicates the likelihood of something such as
whether the cells are cancerous or not, a mouse is obese or not based on its weight, etc.
Advantages:
Logistic Regression is one of the simplest machine learning algorithms and is easy to
implement yet provides great training efficiency in some cases. Also due to these
reasons, training a model with this algorithm doesn't require high computation power.
The predicted parameters (trained weights) give inference about the importance of
each feature. The direction of association i.e. positive or negative is also given. So we
can use Logistic Regression to find out the relationship between the features.
This algorithm allows models to be updated easily to reflect new data, unlike Decision
Tree or Support Vector Machine. The update can be done using stochastic gradient
descent.
Disadvantages:
Logistic Regression is a statistical analysis model that attempts to predict precise
probabilistic outcomes based on independent features. On high dimensional datasets,
this may lead to the model being over-fit on the training set, which means overstating
the accuracy of predictions on the training set and thus the model may not be able to
predict accurate results on the test set. This usually happens in the case when the
model is trained on little training data with lots of features. So on high dimensional
datasets, Regularization techniques should be considered to avoid over- fitting (but this
makes the model complex). Very high regularization factors may even lead to the
model being under-fit on the training data.
Non linear problems can't be solved with logistic regression since it has a linear
decision surface. Linearly separable data is rarely found in real world scenarios. So the
transformation of non linear features is required which can be done by increasing the
number of features such that the data becomes linearly separable in higher dimensions.
For
Heart Diseases
For Parkinsons
Web App using Streamlit:
CHAPTER 6
CONCLUSION AND FUTURE WORK
APPENDIX
Python
Sklearn
Scikit-learn (Sklearn) is the most useful and robust library for machine learning in
Python. It provides a selection of efficient tools for machine learning and statistical
modeling including classification, regression, clustering and dimensionality reduction
via a consistent interface in Python. This library, which is largely written in Python, is
built upon NumPy, SciPy and Matplotlib.
Numpy
NumPy is a library for the python programming language, adding support for large,
multi- dimensional arrays and matrices, along with a large collection of high level
mathematical functions to operate on these arrays. The ancestor of NumPy, Numeric,
was originally created by Jim with contributions from several other developers. In
2005, Travis created NumPy by incorporating features of the competing Num array
into Numeric, with extensive modifications. NumPy is open source software and has
many contributors.
It provides the building blocks necessary to create the music information retrieval
systems. Librosa helps to visualize the audio signals and also do the feature
extractions in it using different signal processing techniques.
Matplotlib
Matplotlib is a plotting library for the Python programming language and its numerical
mathematics extension NumPy. It provides an object-oriented API for embedding
plots into applications using general-purpose GUI toolkits like Tkinter, wxPython, Qt,
or GTK. There is also a procedural "pylab" interface based on a statemachine (like
OpenGL), designed to closely resemble that of MATLAB, though its use is
discouraged.
Seaborn
SciPy
Streamlit
Streamlit is a free and open-source framework to rapidly build and share beautiful
machine learning and data science web apps. It is a Python-based library specifically
designed for machine learning engineers.