I. Bstract Iii. ATA ET: Heart Disease Prediction Using Weka Tools On Machine Learning Anshu Garg, Jasleen Kaur
I. Bstract Iii. ATA ET: Heart Disease Prediction Using Weka Tools On Machine Learning Anshu Garg, Jasleen Kaur
ON MACHINE LEARNING
II. INTRODUCTION
The total number of deaths due to cardiovascular
diseases read 17.3 million a year according to the
WHO causes of death. Thus, how to predict cardiac
arrhythmia in real life is of great significance. In
this project, we plan to develop a machine learning
system that can classify a patient into different
cardiac arrhythmic classes. The diagnosis of cardiac
arrhythmia can be classified into various classes
based on the Electrocardiogram(ECG) readings and
other attributes. First class will refer to the normal
patient while other classes shall represent different
classes of cardiac arrhythmia like Tachycardia,
Bradycardia and Coronary artery diseases. This is a
supervised learning problem.
IV. SURVEY classification. However there are differences between
the cardiolog's and the programs classification. Taking
The aim is to distinguish between the presence and the cardiolog's as a gold standard we aim to minimize
absence of cardiac arrhythmia and to classify it in this difference by means of machine learning tools.
one of the 13 groups. For the time being, there
exists a computer program that makes such a
V. SCOPE
These machine learning techniques can be deployed
in hospitals where a large dataset is available and
can help the doctors in making more precise B. Random Forests and Decision Trees:
decisions and to cut down the number of causalities We implement a Random Forest classifier. The
due to heart diseases in the future. model works by continually sampling with
replacement a portion of the training dataset, and
fitting a decision tree to it. The number of trees refer
to the number of times the dataset is randomly
VI. METHODOLOGY sampled. Moreover, in each sampling iteration, a
random set of features are selected. In decision
A. Feature Selection: trees, each node refers to one of the input variables,
Firstly, we removed some of the categorical features which has edges to children for all possible values
that were 95% of time indicating either all 0’s or all that the input can take. Each leaf corresponds to a
1’s. If any training instance has a missing value for value of the class label given the values of the input
a given attribute, we set it as the mean of the value variables represented by the path from the root node
plus or minus the standard deviation for that to the leaf node. The number of trees and the
attribute related to the class it belongs to. number of leaves are learned via cross validation.
If for a given attribute majority of values are
missing, then we discard that attribute and remove it
from our training set. C. Principal Component Analysis:
The features can be grouped into 5 blocks – PCA is being used to identify patterns in the data
features concerning biographical and then expressing the data in such a way to
characteristics, i.e., age, sex, height, weight highlight similarities and differences. Primarily we
and heart rate. are using PCA to reduce the number of dimensions
features concerning average wave durations by identifying the more important features i.e. the
of each interval (PR interval, QRS complex, principal components. The number of principal
and ST intervals). components is less than or equal to the smaller of
features concerning vector angles of each the number of original variables. The first principal
wave. component has the largest possible variance and
each succeeding component in turn has the highest
features concerning widths of each wave.
variance possible under the constraint that it is
orthogonal to the preceding components.
VII. MODELS was improve by careful feature selection described
previously. The results are summarized below –
A. KNN (K-Nearest Neighbours):
70%-30% 6 67 % 62 %
B. Logistic Regression:
VIII. RESULTS
Training-Testing K Trainin Test
Size Neighbou g Accura
rs Accura cy
cy
80%-20% 6 99 % 70 %
method.
70%-30%Logistic regression
6 gave comparatively
99 % 66 % % for testing set.
better results with average accuracy around 73
%. Naïve-Bayes classifier gave poor results due
to problem of lack of enough training examples X. ACKNOWLEDGEMENT
(452) and excessive number of features. SVM We are highly grateful to our professors
using linear kernels gave the best results with ER.JYOTI ARORA and ER.DIPTI SHARMA for
average accuracy of classification around 99 % their continued guidance and support throughout
for training set and 73 the course of this project.
using two different methodologies. We show
results for each algorithm, as well as vary
XI. REFERENCES
other parameters for better results. [1].http://en.wikipedia.org/wiki/Cardiac_dysrhythmia
[2].http://www.cdc.gov/dhdsp/data_statistics/fact_shee
ts/d
IX. ANALYSIS ocs/fs_heart_disease.pdf
It is clear from the above data that the SVM and [3].Cunningham 2007. k-Nearest Neighbor Classifiers.
Logistic Regression algorithms are capable of Technical Report UCD-CSI-2007-4.
automatically detecting arrhythmias with reliable University College Dublin
accuracy (Training Data = 98 % and Testing
Data=73%). Furthermore, Random Forests [4] Roger VL et al. Heart disease and stroke
consistently performs better than PCA in terms of statistics— 2012 update: a report from
feature selection. Our general approach in this the American Heart
project was as follows. We started with KNN and
we tried to obtain maximum accuracy for different [5] http://www.texasheart.org/HIC/Topics/Cond/bbblo
values of K ranging from 3 to 13. Then we used ck.cfm
Logistic Regression which uses the sigmoid
function and we ran it using Gradient descent and [6] http://archive.ics.uci.edu/ml/
Newton’s
[7] UCI machine learning repository
(2013), http://archive.ics.
uci.edu/ml