0% found this document useful (0 votes)
33 views1 page

Heart Disease Prediction (1) (1) - 1

Uploaded by

Shubam Padha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views1 page

Heart Disease Prediction (1) (1) - 1

Uploaded by

Shubam Padha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 1

Practical- 12

Aim: To predict heart disease based on factors such as age, gender ,


trestbps etc.
In [2]: # Importing essential libraries
import numpy as np
import pandas as pd

In [3]: # Loading the dataset


df = pd.read_csv('heart.csv')

Exploring the dataset


In [4]: # Returns number of rows and columns of the dataset
df.shape

Out[4]: (303, 14)

In [5]: # Returns an object with all of the column headers


df.columns

Out[5]: Index(['age', 'sex', 'cp', 'trestbps', 'chol', 'fbs', 'restecg', 'thalach',


'exang', 'oldpeak', 'slope', 'ca', 'thal', 'target'],
dtype='object')

In [6]: # Returns different datatypes for each columns (float, int, string, bool, etc.)
df.dtypes

Out[6]: age int64


sex int64
cp int64
trestbps int64
chol int64
fbs int64
restecg int64
thalach int64
exang int64
oldpeak float64
slope int64
ca int64
thal int64
target int64
dtype: object

In [7]: # Returns the first x number of rows when head(x). Without a number it returns 5
df.head()

Out[7]: age sex cp trestbps chol fbs restecg thalach exang oldpeak slope ca thal target

0 63 1 3 145 233 1 0 150 0 2.3 0 0 1 1

1 37 1 2 130 250 0 1 187 0 3.5 0 0 2 1

2 41 0 1 130 204 0 0 172 0 1.4 2 0 2 1

3 56 1 1 120 236 0 1 178 0 0.8 2 0 2 1

4 57 0 0 120 354 0 1 163 1 0.6 2 0 2 1

In [8]: # Returns the last x number of rows when tail(x). Without a number it returns 5
df.tail()

Out[8]: age sex cp trestbps chol fbs restecg thalach exang oldpeak slope ca thal target

298 57 0 0 140 241 0 1 123 1 0.2 1 0 3 0

299 45 1 3 110 264 0 1 132 0 1.2 1 0 3 0

300 68 1 0 144 193 1 1 141 0 3.4 1 2 3 0

301 57 1 0 130 131 0 1 115 1 1.2 1 1 3 0

302 57 0 1 130 236 0 0 174 0 0.0 1 1 2 0

In [9]: # Returns true for a column having null values, else false
df.isnull().any()

Out[9]: age False


sex False
cp False
trestbps False
chol False
fbs False
restecg False
thalach False
exang False
oldpeak False
slope False
ca False
thal False
target False
dtype: bool

In [10]: # Returns basic information on all columns


df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 303 entries, 0 to 302
Data columns (total 14 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 age 303 non-null int64
1 sex 303 non-null int64
2 cp 303 non-null int64
3 trestbps 303 non-null int64
4 chol 303 non-null int64
5 fbs 303 non-null int64
6 restecg 303 non-null int64
7 thalach 303 non-null int64
8 exang 303 non-null int64
9 oldpeak 303 non-null float64
10 slope 303 non-null int64
11 ca 303 non-null int64
12 thal 303 non-null int64
13 target 303 non-null int64
dtypes: float64(1), int64(13)
memory usage: 33.3 KB

In [11]: # Returns basic statistics on numeric columns


df.describe().T

Out[11]: count mean std min 25% 50% 75% max

age 303.0 54.366337 9.082101 29.0 47.5 55.0 61.0 77.0

sex 303.0 0.683168 0.466011 0.0 0.0 1.0 1.0 1.0

cp 303.0 0.966997 1.032052 0.0 0.0 1.0 2.0 3.0

trestbps 303.0 131.623762 17.538143 94.0 120.0 130.0 140.0 200.0

chol 303.0 246.264026 51.830751 126.0 211.0 240.0 274.5 564.0

fbs 303.0 0.148515 0.356198 0.0 0.0 0.0 0.0 1.0

restecg 303.0 0.528053 0.525860 0.0 0.0 1.0 1.0 2.0

thalach 303.0 149.646865 22.905161 71.0 133.5 153.0 166.0 202.0

exang 303.0 0.326733 0.469794 0.0 0.0 0.0 1.0 1.0

oldpeak 303.0 1.039604 1.161075 0.0 0.0 0.8 1.6 6.2

slope 303.0 1.399340 0.616226 0.0 1.0 1.0 2.0 2.0

ca 303.0 0.729373 1.022606 0.0 0.0 0.0 1.0 4.0

thal 303.0 2.313531 0.612277 0.0 2.0 2.0 3.0 3.0

target 303.0 0.544554 0.498835 0.0 0.0 1.0 1.0 1.0

Data Visualization
In [12]: # Importing essential libraries
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

In [13]: # Plotting histogram for the entire dataset


fig = plt.figure(figsize = (15,15))
ax = fig.gca()
g = df.hist(ax=ax)

In [14]: # Visualization to check if the dataset is balanced or not


g = sns.countplot(x='target', data=df)
plt.xlabel('Target')
plt.ylabel('Count')

Out[14]: Text(0, 0.5, 'Count')

Feature Engineering
Feature Selection
In [15]: # Selecting correlated features using Heatmap

# Get correlation of all the features of the dataset


corr_matrix = df.corr()
top_corr_features = corr_matrix.index

# Plotting the heatmap


plt.figure(figsize=(20,20))
sns.heatmap(data=df[top_corr_features].corr(), annot=True, cmap='RdYlGn')

Out[15]: <Axes: >

Data Preprocessing

Handling categorical features


After exploring the dataset, I observed that converting the categorical variables into dummy variables using 'get_dummies()'. Though we don't have any strings in our dataset it is necessary to convert ('sex',
'cp', 'fbs', 'restecg', 'exang', 'slope', 'ca', 'thal') these features.

Example: Consider the 'sex' column, it is a binary feature which has 0's and 1's as its values. Keeping it as it is would lead the algorithm to think 0 is lower value and 1 is a higher value, which should not
be the case since the gender cannot be ordinal feature.

In [16]: dataset = pd.get_dummies(df, columns=['sex', 'cp', 'fbs', 'restecg', 'exang', 'slope', 'ca', 'thal'])

Feature Scaling
In [17]: dataset.columns

Out[17]: Index(['age', 'trestbps', 'chol', 'thalach', 'oldpeak', 'target', 'sex_0',


'sex_1', 'cp_0', 'cp_1', 'cp_2', 'cp_3', 'fbs_0', 'fbs_1', 'restecg_0',
'restecg_1', 'restecg_2', 'exang_0', 'exang_1', 'slope_0', 'slope_1',
'slope_2', 'ca_0', 'ca_1', 'ca_2', 'ca_3', 'ca_4', 'thal_0', 'thal_1',
'thal_2', 'thal_3'],
dtype='object')

In [18]: from sklearn.preprocessing import StandardScaler


standScaler = StandardScaler()
columns_to_scale = ['age', 'trestbps', 'chol', 'thalach', 'oldpeak']
dataset[columns_to_scale] = standScaler.fit_transform(dataset[columns_to_scale])

In [19]: dataset.head()

Out[19]: age trestbps chol thalach oldpeak target sex_0 sex_1 cp_0 cp_1 ... slope_2 ca_0 ca_1 ca_2 ca_3 ca_4 thal_0 thal_1 thal_2 thal_3

0 0.952197 0.763956 -0.256334 0.015443 1.087338 1 False True False False ... False True False False False False False True False False

1 -1.915313 -0.092738 0.072199 1.633471 2.122573 1 False True False False ... False True False False False False False False True False

2 -1.474158 -0.092738 -0.816773 0.977514 0.310912 1 True False False True ... True True False False False False False False True False

3 0.180175 -0.663867 -0.198357 1.239897 -0.206705 1 False True False True ... True True False False False False False False True False

4 0.290464 -0.663867 2.082050 0.583939 -0.379244 1 True False True False ... True True False False False False False False True False

5 rows × 31 columns

In [20]: # Splitting the dataset into dependent and independent features


X = dataset.drop('target', axis=1)
y = dataset['target']

Model Building
I will be experimenting with 3 algorithms:

1. KNeighbors Classifier
2. Decision Tree Classifier
3. Random Forest Classifier

KNeighbors Classifier Model


In [21]: # Importing essential libraries
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score

In [22]: # Finding the best accuracy for knn algorithm using cross_val_score
knn_scores = []
for i in range(1, 21):
knn_classifier = KNeighborsClassifier(n_neighbors=i)
cvs_scores = cross_val_score(knn_classifier, X, y, cv=10)
knn_scores.append(round(cvs_scores.mean(),3))

In [23]: # Plotting the results of knn_scores


plt.figure(figsize=(20,15))
plt.plot([k for k in range(1, 21)], knn_scores, color = 'red')
for i in range(1,21):
plt.text(i, knn_scores[i-1], (i, knn_scores[i-1]))
plt.xticks([i for i in range(1, 21)])
plt.xlabel('Number of Neighbors (K)')
plt.ylabel('Scores')
plt.title('K Neighbors Classifier scores for different K values')

Out[23]: Text(0.5, 1.0, 'K Neighbors Classifier scores for different K values')

In [24]: # Training the knn classifier model with k value as 12


knn_classifier = KNeighborsClassifier(n_neighbors=12)
cvs_scores = cross_val_score(knn_classifier, X, y, cv=10)
print("KNeighbours Classifier Accuracy with K=12 is: {}%".format(round(cvs_scores.mean(), 4)*100))

KNeighbours Classifier Accuracy with K=12 is: 84.48%

Decision Tree Classifier


In [25]: # Importing essential libraries
from sklearn.tree import DecisionTreeClassifier

In [26]: # Finding the best accuracy for decision tree algorithm using cross_val_score
decision_scores = []
for i in range(1, 11):
decision_classifier = DecisionTreeClassifier(max_depth=i)
cvs_scores = cross_val_score(decision_classifier, X, y, cv=10)
decision_scores.append(round(cvs_scores.mean(),3))

In [27]: # Plotting the results of decision_scores


plt.figure(figsize=(20,15))
plt.plot([i for i in range(1, 11)], decision_scores, color = 'red')
for i in range(1,11):
plt.text(i, decision_scores[i-1], (i, decision_scores[i-1]))
plt.xticks([i for i in range(1, 11)])
plt.xlabel('Depth of Decision Tree (N)')
plt.ylabel('Scores')
plt.title('Decision Tree Classifier scores for different depth values')

Out[27]: Text(0.5, 1.0, 'Decision Tree Classifier scores for different depth values')

In [28]: # Training the decision tree classifier model with max_depth value as 3
decision_classifier = DecisionTreeClassifier(max_depth=3)
cvs_scores = cross_val_score(decision_classifier, X, y, cv=10)
print("Decision Tree Classifier Accuracy with max_depth=3 is: {}%".format(round(cvs_scores.mean(), 4)*100))

Decision Tree Classifier Accuracy with max_depth=3 is: 78.51%

Random Forest Classifier


In [29]: # Importing essential libraries
from sklearn.ensemble import RandomForestClassifier

In [30]: # Finding the best accuracy for random forest algorithm using cross_val_score
forest_scores = []
for i in range(10, 101, 10):
forest_classifier = RandomForestClassifier(n_estimators=i)
cvs_scores = cross_val_score(forest_classifier, X, y, cv=5)
forest_scores.append(round(cvs_scores.mean(),3))

In [31]: # Plotting the results of forest_scores


plt.figure(figsize=(20,15))
plt.plot([n for n in range(10, 101, 10)], forest_scores, color = 'red')
for i in range(1,11):
plt.text(i*10, forest_scores[i-1], (i*10, forest_scores[i-1]))
plt.xticks([i for i in range(10, 101, 10)])
plt.xlabel('Number of Estimators (N)')
plt.ylabel('Scores')
plt.title('Random Forest Classifier scores for different N values')

Out[31]: Text(0.5, 1.0, 'Random Forest Classifier scores for different N values')

In [32]: # Training the random forest classifier model with n value as 90


forest_classifier = RandomForestClassifier(n_estimators=90)
cvs_scores = cross_val_score(forest_classifier, X, y, cv=5)
print("Random Forest Classifier Accuracy with n_estimators=90 is: {}%".format(round(cvs_scores.mean(), 4)*100))

Random Forest Classifier Accuracy with n_estimators=90 is: 82.80999999999999%

In [ ]:

In [ ]:

Github link:- https://github.com/Shubam85/Heart-disease-prediction.git

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy