Heart Disease Prediction (1) (1) - 1
Heart Disease Prediction (1) (1) - 1
In [6]: # Returns different datatypes for each columns (float, int, string, bool, etc.)
df.dtypes
In [7]: # Returns the first x number of rows when head(x). Without a number it returns 5
df.head()
Out[7]: age sex cp trestbps chol fbs restecg thalach exang oldpeak slope ca thal target
In [8]: # Returns the last x number of rows when tail(x). Without a number it returns 5
df.tail()
Out[8]: age sex cp trestbps chol fbs restecg thalach exang oldpeak slope ca thal target
In [9]: # Returns true for a column having null values, else false
df.isnull().any()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 303 entries, 0 to 302
Data columns (total 14 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 age 303 non-null int64
1 sex 303 non-null int64
2 cp 303 non-null int64
3 trestbps 303 non-null int64
4 chol 303 non-null int64
5 fbs 303 non-null int64
6 restecg 303 non-null int64
7 thalach 303 non-null int64
8 exang 303 non-null int64
9 oldpeak 303 non-null float64
10 slope 303 non-null int64
11 ca 303 non-null int64
12 thal 303 non-null int64
13 target 303 non-null int64
dtypes: float64(1), int64(13)
memory usage: 33.3 KB
Data Visualization
In [12]: # Importing essential libraries
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
Feature Engineering
Feature Selection
In [15]: # Selecting correlated features using Heatmap
Data Preprocessing
Example: Consider the 'sex' column, it is a binary feature which has 0's and 1's as its values. Keeping it as it is would lead the algorithm to think 0 is lower value and 1 is a higher value, which should not
be the case since the gender cannot be ordinal feature.
In [16]: dataset = pd.get_dummies(df, columns=['sex', 'cp', 'fbs', 'restecg', 'exang', 'slope', 'ca', 'thal'])
Feature Scaling
In [17]: dataset.columns
In [19]: dataset.head()
Out[19]: age trestbps chol thalach oldpeak target sex_0 sex_1 cp_0 cp_1 ... slope_2 ca_0 ca_1 ca_2 ca_3 ca_4 thal_0 thal_1 thal_2 thal_3
0 0.952197 0.763956 -0.256334 0.015443 1.087338 1 False True False False ... False True False False False False False True False False
1 -1.915313 -0.092738 0.072199 1.633471 2.122573 1 False True False False ... False True False False False False False False True False
2 -1.474158 -0.092738 -0.816773 0.977514 0.310912 1 True False False True ... True True False False False False False False True False
3 0.180175 -0.663867 -0.198357 1.239897 -0.206705 1 False True False True ... True True False False False False False False True False
4 0.290464 -0.663867 2.082050 0.583939 -0.379244 1 True False True False ... True True False False False False False False True False
5 rows × 31 columns
Model Building
I will be experimenting with 3 algorithms:
1. KNeighbors Classifier
2. Decision Tree Classifier
3. Random Forest Classifier
In [22]: # Finding the best accuracy for knn algorithm using cross_val_score
knn_scores = []
for i in range(1, 21):
knn_classifier = KNeighborsClassifier(n_neighbors=i)
cvs_scores = cross_val_score(knn_classifier, X, y, cv=10)
knn_scores.append(round(cvs_scores.mean(),3))
Out[23]: Text(0.5, 1.0, 'K Neighbors Classifier scores for different K values')
In [26]: # Finding the best accuracy for decision tree algorithm using cross_val_score
decision_scores = []
for i in range(1, 11):
decision_classifier = DecisionTreeClassifier(max_depth=i)
cvs_scores = cross_val_score(decision_classifier, X, y, cv=10)
decision_scores.append(round(cvs_scores.mean(),3))
Out[27]: Text(0.5, 1.0, 'Decision Tree Classifier scores for different depth values')
In [28]: # Training the decision tree classifier model with max_depth value as 3
decision_classifier = DecisionTreeClassifier(max_depth=3)
cvs_scores = cross_val_score(decision_classifier, X, y, cv=10)
print("Decision Tree Classifier Accuracy with max_depth=3 is: {}%".format(round(cvs_scores.mean(), 4)*100))
In [30]: # Finding the best accuracy for random forest algorithm using cross_val_score
forest_scores = []
for i in range(10, 101, 10):
forest_classifier = RandomForestClassifier(n_estimators=i)
cvs_scores = cross_val_score(forest_classifier, X, y, cv=5)
forest_scores.append(round(cvs_scores.mean(),3))
Out[31]: Text(0.5, 1.0, 'Random Forest Classifier scores for different N values')
In [ ]:
In [ ]: