0% found this document useful (0 votes)
6 views5 pages

PeerEval Classification

Uploaded by

rest peace
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views5 pages

PeerEval Classification

Uploaded by

rest peace
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

IBM- Supervised ML-Classification-PeerEval

Main Objective, Brief Description about dataset and its attributes


For this project, I am using the Heart Disease UCI dataset from Kaggle. Many factors influences the
development of Heart Diseases in a Patient. With 14 explanatory variables describing aspects of Patients. The
OBJECTIVE of this report is to build a machine learning model capable of predicting whether or not
someone has heart disease based on their medical attributes.. Since the target variable is a categorical
variable, this is a classification problem. The data contains the following columns:

• age
• sex
• chest pain type (4 values)
• resting blood pressure
• serum cholestoral in mg/dl
• fasting blood sugar > 120 mg/dl
• resting electrocardiographic results (values 0,1,2)
• maximum heart rate achieved
• exercise induced angina
• oldpeak = ST depression induced by exercise relative to rest
• the slope of the peak exercise ST segment
• number of major vessels (0-3) colored by flourosopy
• thal: 3 = normal; 6 = fixed defect; 7 = reversable defect

Figure 1: First 5 rows of the DataFrame


Data Exploration, Data Cleaning, & Feature Engineering
The data was checked for Duplicated, Null, Shape and Data types.

Figure 2: Checking for Null,Datatypes,Duplicates

Figure 3: Distribution of Dataset

Figure 4 : Visualization of data for more in depth analysis


Figure 5: Correlation Table

Figure 6: Feature importance to determine to figure out important variables


Since the data is clean there is no need for any data cleaning process. Since there is no overwhelming influence
from a variable there is no scaling applied to the data. The data is then split into train and test set using the
train_test_split function in Sci-Kit learn Library. Next step is to apply Classification Algorithm on the training
test and calculate the accuracy of prediction.
Classification Algorithms

Three models are used in this assignment namely:

• Support Vector Machines (SVM)


Support vector machines (SVMs) are a set of supervised learning methods used for classification,
regression and outliers detection.The advantages of support vector machines include Being Effective
in high dimensional spaces, Still effective in cases where number of dimensions is greater than the
number of samples, Uses a subset of training points in the decision function (called support vectors),
so it is also memory efficient, Different Kernel functions can be specified for the decision function.
Common kernels are provided, but it is also possible to specify custom kernels.
• K-Nearest Neigbours
k-NN is a type of classification where the function is only approximated locally and all computation is
deferred until function evaluation. Since this algorithm relies on distance for classification, if the
features represent different physical units or come in vastly different scales then normalizing the
training data can improve its accuracy dramatically. A peculiarity of the k-NN algorithm is that it is
sensitive to the local structure of the data.
• Random Forest
Random forest is a flexible, easy to use machine learning algorithm that produces, even without hyper-
parameter tuning, a great result most of the time. It is also one of the most used algorithms, because
of its simplicity and diversity. Random forest is a supervised learning algorithm. The "forest" it builds,
is an ensemble of decision trees, usually trained with the “bagging” method. The general idea of the
bagging method is that a combination of learning models increases the overall result.

Key Findings

From the above tables we can see that random Forest is


the best model that fits our data as it has the highest
accuracy value. Thus, as our objective is to best predict
the state of heart disease, we make use of the third model
i.e. the Random Forest classifier model.

From this Random Forest classifier model, we find out


that to make predictions about the condition of the heart
disease, we input the given 14 variables. Also from the
coefficients, we observe that cheat pain type has a high
coefficient value meaning that chest pain type plays a very
important role in explaining the patient suffers
from a heart condition or not.

Figure 7: Accuracy scores for 3 Classifiers


Proposed Future Work

• Hyperparameter Tuning
• Cross-Validation
• Classification report
• Try different encoders

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy