0% found this document useful (0 votes)
45 views9 pages

I. Bstract Iii. ATA ET: Heart Disease Prediction Using Weka Tools On Machine Learning Anshu Garg, Jasleen Kaur

This document summarizes a project that aims to predict different types of heart disease by classifying patients into arrhythmia classes based on their medical records and electrocardiogram readings using machine learning algorithms. The project uses a dataset containing medical records of 452 patients with 279 attributes from the UCI machine learning repository. Feature selection and dimensionality reduction techniques are applied to preprocess the data. Classification algorithms like K-nearest neighbors, logistic regression, random forests, and decision trees are then applied and evaluated on the dataset to predict cardiac arrhythmia types with the goal of helping doctors make more accurate diagnoses.

Uploaded by

NOOBHACK
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
45 views9 pages

I. Bstract Iii. ATA ET: Heart Disease Prediction Using Weka Tools On Machine Learning Anshu Garg, Jasleen Kaur

This document summarizes a project that aims to predict different types of heart disease by classifying patients into arrhythmia classes based on their medical records and electrocardiogram readings using machine learning algorithms. The project uses a dataset containing medical records of 452 patients with 279 attributes from the UCI machine learning repository. Feature selection and dimensionality reduction techniques are applied to preprocess the data. Classification algorithms like K-nearest neighbors, logistic regression, random forests, and decision trees are then applied and evaluated on the dataset to predict cardiac arrhythmia types with the goal of helping doctors make more accurate diagnoses.

Uploaded by

NOOBHACK
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 9

HEART DISEASE PREDICTION USING WEKA TOOLS

ON MACHINE LEARNING

Anshu Garg,Jasleen Kaur


Dept. of Computer Science, CHANDIGARH UNIVERSITY
Project Report

I. ABSTRACT III. DATA SET


HEART DISEASE is a of a group of conditions in
The dataset for the project is taken from the UCI
which the electrical activity of the heart is irregular
Repositoryhttps://archive.ics.uci.edu/ml/datasets/
or is faster or slower than normal. It is the leading
Arrhythmia There are (452) rows, each
cause of death for both men and women in the
representing medical record of a different patient.
world. In this project, we plan to predict Cardiac
There are 279 attributes like age, weight and
Arrhythmia based on a patient’s medical record.
patient’s ECG related data. General attributes like
Our objective is to classify a patient into one of the
age and weight have discrete integral values while
Arrhythmia classes like Tachycardia and
other ECG features like QRS duration have real
Bradycardia based on his ECG measurements and
values.
help us in understanding the application of machine
The variable Class is our target variable. There are
learning in medical domain. After appropriate
in total 13 classes –
feature selection we plan to solve this problem by
using Machine Learning Algorithms namely K
Nearest Neighbour, Logistic Regression, Naïve
Bayes and SVM .

II. INTRODUCTION
The total number of deaths due to cardiovascular
diseases read 17.3 million a year according to the
WHO causes of death. Thus, how to predict cardiac
arrhythmia in real life is of great significance. In
this project, we plan to develop a machine learning
system that can classify a patient into different
cardiac arrhythmic classes. The diagnosis of cardiac
arrhythmia can be classified into various classes
based on the Electrocardiogram(ECG) readings and
other attributes. First class will refer to the normal
patient while other classes shall represent different
classes of cardiac arrhythmia like Tachycardia,
Bradycardia and Coronary artery diseases. This is a
supervised learning problem.
IV. SURVEY classification. However there are differences between
the cardiolog's and the programs classification. Taking
The aim is to distinguish between the presence and the cardiolog's as a gold standard we aim to minimize
absence of cardiac arrhythmia and to classify it in this difference by means of machine learning tools.
one of the 13 groups. For the time being, there
exists a computer program that makes such a
V. SCOPE
These machine learning techniques can be deployed
in hospitals where a large dataset is available and
can help the doctors in making more precise B. Random Forests and Decision Trees:
decisions and to cut down the number of causalities We implement a Random Forest classifier. The
due to heart diseases in the future. model works by continually sampling with
replacement a portion of the training dataset, and
fitting a decision tree to it. The number of trees refer
to the number of times the dataset is randomly
VI. METHODOLOGY sampled. Moreover, in each sampling iteration, a
random set of features are selected. In decision
A. Feature Selection: trees, each node refers to one of the input variables,
Firstly, we removed some of the categorical features which has edges to children for all possible values
that were 95% of time indicating either all 0’s or all that the input can take. Each leaf corresponds to a
1’s. If any training instance has a missing value for value of the class label given the values of the input
a given attribute, we set it as the mean of the value variables represented by the path from the root node
plus or minus the standard deviation for that to the leaf node. The number of trees and the
attribute related to the class it belongs to. number of leaves are learned via cross validation.
If for a given attribute majority of values are
missing, then we discard that attribute and remove it
from our training set. C. Principal Component Analysis:
The features can be grouped into 5 blocks – PCA is being used to identify patterns in the data
 features concerning biographical and then expressing the data in such a way to
characteristics, i.e., age, sex, height, weight highlight similarities and differences. Primarily we
and heart rate. are using PCA to reduce the number of dimensions
 features concerning average wave durations by identifying the more important features i.e. the
of each interval (PR interval, QRS complex, principal components. The number of principal
and ST intervals). components is less than or equal to the smaller of
 features concerning vector angles of each the number of original variables. The first principal
wave. component has the largest possible variance and
each succeeding component in turn has the highest
 features concerning widths of each wave.
variance possible under the constraint that it is
orthogonal to the preceding components.
VII. MODELS was improve by careful feature selection described
previously. The results are summarized below –
A. KNN (K-Nearest Neighbours):

Training-Testing K Training Test


We used KNN because it is simple to implement & Size Neighbour Accurac Accurac
very straight forward. Here, an object is classified s y y
by a majority vote of its neighbors, with the object 80%-20% 6 67 % 61 %
being assigned to the class most common among its
k nearest neighbors. This is done by measuring 70%-30% 6 65 % 62 %
distances between the object and its neighbors. The
Table 1: KNN Classification with PCA
following formula shows a representation of simple
Euclidian distance, where ‘a’ and ‘b’ are the
respective positions of the object and one of its
neighbours. KNN is very sensitive to irrelevant or
redundant features because all features contribute to Training-Testing K Training Test
Size Neighbour Accurac Accurac
the similarity and thus to the classification. This
s y y
80%-20% 6 65 % 64 %

70%-30% 6 67 % 62 %

Table 2: KNN Classification with RF

Image 1: KNN accuracy training vs testing

B. Logistic Regression:

Since the logistic regression is used for binary


classification of datasets with categorical
dependent features, in order to apply logistic
regression to our multi-class dataset, we firstly
classified our instances into two major classes,
class 1 (which contained all the instances with
“class 01” label) and class NOT-1
(which contained the instances for all the other
classes). We classified our data in this way,
because about half of our instances were labeled
as class 01. The results of our implementation
are summarized below-
70%-30% 89 % 71 %

Table 3: Logistic Regression with PCA

Training-Testing Training Test


Size Accuracy Accurac
y
80%-20% 96 % 73 %
70%-30% 98 % 72 %

Table 4: Logistic Regression with RF


Image 2: Logistic regression classification

Training-Testing Training Test


Size Accuracy Accuracy
80%-20% 90 % 74 %
C. Naïve – Bayes Classifier:

It is a classification technique based on Bayes


Theorem with an assumption of independence
among predictors.We implemented our own
Naive Bayes binomial and multinomial
Image 3: Logistic regression training vs testing classifiers in Python. We use Naive Bayesian
equation to calculate the posterior probability for
each class. The class with the highest posterior
probability is the outcome of prediction. In the
first case, the training-testing data was split 80%
- 20% and in the second case, the training-testing
data was split 80% - 20%. The results are
summarised below –

Image 4: Naïve Bayes Classification

Training-Testing Training Test


Size Accuracy Accurac
y
80%-20% 79 % 62 %
70%-30% 78 % 65 %

Table 5: Naïve-Bayes with PCA


Training-Testing Training Test
Size Accuracy Accurac
y
80%-20% 75 % 63 %
70%-30% 74 % 62 %
Training-Testing Training Test
Table 6: Naïve-Bayes with RF Size Accuracy Accurac
y
D. SVM (Support Vector Machines ) : 80%-20% 100 71 %
In SVM, a hyperplane is selected to best %
separate the points in the input variable space 70%-30% 99 % 70 %
by their class, either class 0 or class 1. In two-
dimensions you can visualize this as a line. E. Weighted KNN:
You can make classifications using this line. A refinement of the KNN classification algorithm is
By plugging in input values into the line
equation,
we calculate whether a new point is above or to weigh the contribution of each of the K
below the line. We tried both the polynomial neighbours according to their distance to the query
and the linear kernels for the SVM and found point, giving greater weight to closer neighbours.
out that the linear kernel outperformed the The results are summarized below-
polynomial kernel.

Image 6: Weighted KNN accuracy training vs testing

Training-Testing K Training Test


Size Neighbour Accurac Accurac
Image 5: SVM Classification s y y
80%-20% 6 99 % 65 %
Training-Testing Training Test
Accuracy Accurac 70%-30% 6 99 % 64 %
Size
y
80%-20% 99 % 75 % Table 9: Weighted KNN Classification with PCA
70%-30% 99 % 71 %

Table 7: SVM with PCA


The main objective of this project was to
develop a system that could robustly detect an
arrhythmia. The second objective of this
project was to develop a method to robustly
classify an ECG trace into one of 13 broad
arrhythmia classes. We report our performance
Table 10: Weighted KNN Classification with RF for each of the five methods

VIII. RESULTS
Training-Testing K Trainin Test
Size Neighbou g Accura
rs Accura cy
cy
80%-20% 6 99 % 70 %

method.
70%-30%Logistic regression
6 gave comparatively
99 % 66 % % for testing set.
better results with average accuracy around 73
%. Naïve-Bayes classifier gave poor results due
to problem of lack of enough training examples X. ACKNOWLEDGEMENT
(452) and excessive number of features. SVM We are highly grateful to our professors
using linear kernels gave the best results with ER.JYOTI ARORA and ER.DIPTI SHARMA for
average accuracy of classification around 99 % their continued guidance and support throughout
for training set and 73 the course of this project.
using two different methodologies. We show
results for each algorithm, as well as vary
XI. REFERENCES
other parameters for better results. [1].http://en.wikipedia.org/wiki/Cardiac_dysrhythmia

[2].http://www.cdc.gov/dhdsp/data_statistics/fact_shee
ts/d
IX. ANALYSIS ocs/fs_heart_disease.pdf

It is clear from the above data that the SVM and [3].Cunningham 2007. k-Nearest Neighbor Classifiers.
Logistic Regression algorithms are capable of Technical Report UCD-CSI-2007-4.
automatically detecting arrhythmias with reliable University College Dublin
accuracy (Training Data = 98 % and Testing
Data=73%). Furthermore, Random Forests [4] Roger VL et al. Heart disease and stroke
consistently performs better than PCA in terms of statistics— 2012 update: a report from
feature selection. Our general approach in this the American Heart
project was as follows. We started with KNN and
we tried to obtain maximum accuracy for different [5] http://www.texasheart.org/HIC/Topics/Cond/bbblo
values of K ranging from 3 to 13. Then we used ck.cfm
Logistic Regression which uses the sigmoid
function and we ran it using Gradient descent and [6] http://archive.ics.uci.edu/ml/
Newton’s
[7] UCI machine learning repository
(2013), http://archive.ics.
uci.edu/ml

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy