Heart Disease Indicator Prediction Model
Heart Disease Indicator Prediction Model
1 import pandas as pd
2 import numpy as np
3 import matplotlib.pyplot as plt
4 import seaborn as sns
5
6 import scipy.stats as stats
7 import statsmodels.api as sm
8
9 from sklearn.preprocessing import LabelEncoder
10 from sklearn.preprocessing import StandardScaler
11 from sklearn.model_selection import train_test_split
12 from sklearn.linear_model import LogisticRegression
13 from xgboost import XGBClassifier
14 from sklearn.metrics import classification_report, confusion_matrix, roc_curve, roc_
15
16
17 import warnings
18 warnings.filterwarnings('ignore')
In [2]:
1 df = pd.read_csv("/content/heart_2022_Key_indicators.csv")
In [3]:
1 df.head()
Out[3]:
1 df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 79649 entries, 0 to 79648
Data columns (total 18 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 HeartDisease 79649 non-null object
1 BMI 79649 non-null float64
2 Smoking 79649 non-null object
3 AlcoholDrinking 79649 non-null object
4 Stroke 79649 non-null object
5 PhysicalHealth 79649 non-null float64
6 MentalHealth 79649 non-null float64
7 DiffWalking 79649 non-null object
8 Sex 79649 non-null object
9 AgeCategory 79649 non-null object
10 Race 79648 non-null object
11 Diabetic 79648 non-null object
12 PhysicalActivity 79648 non-null object
13 GenHealth 79648 non-null object
14 SleepTime 79648 non-null float64
15 Asthma 79648 non-null object
16 KidneyDisease 79648 non-null object
17 SkinCancer 79648 non-null object
dtypes: float64(4), object(14)
memory usage: 10.9+ MB
In [5]:
1 df.isnull().sum()
Out[5]:
HeartDisease 0
BMI 0
Smoking 0
AlcoholDrinking 0
Stroke 0
PhysicalHealth 0
MentalHealth 0
DiffWalking 0
Sex 0
AgeCategory 0
Race 1
Diabetic 1
PhysicalActivity 1
GenHealth 1
SleepTime 1
Asthma 1
KidneyDisease 1
SkinCancer 1
dtype: int64
In [6]:
1 df = df.dropna(axis=0, how='any')
In [7]:
1 df.isnull().sum()
Out[7]:
HeartDisease 0
BMI 0
Smoking 0
AlcoholDrinking 0
Stroke 0
PhysicalHealth 0
MentalHealth 0
DiffWalking 0
Sex 0
AgeCategory 0
Race 0
Diabetic 0
PhysicalActivity 0
GenHealth 0
SleepTime 0
Asthma 0
KidneyDisease 0
SkinCancer 0
dtype: int64
In [8]:
1 df.describe()
Out[8]:
In [9]:
1 df.shape
Out[9]:
(79648, 18)
🔷Basic EDA¶
There are 17 predictor variables and 1 target variable.
Out of the 17 predictor variables, 4 are numeric.
There are 319795 observations.
No variables have missing values.
The numeric variables are all right skewed.
This means that their distribution is not normal, and they have a longer tail on the right side. This can affect
the accuracy of some statistical models, such as linear regression, which assume normality in the
distribution of the predictor variables.
In [10]:
1 df['AgeCategory'].value_counts()
Out[10]:
65-69 8328
60-64 8084
70-74 7984
55-59 7145
80 or older 6349
50-54 6268
75-79 5770
45-49 5381
35-39 5090
40-44 5085
18-24 5051
30-34 4841
25-29 4272
Name: AgeCategory, dtype: int64
In [11]:
In [12]:
1 summary_stats(df)
Out[12]:
1 skewness = df.skew()
2
3 skew_df = pd.DataFrame({'Variable': skewness.index, 'Skewness': skewness.values})
4 skew_df
Out[13]:
Variable Skewness
0 BMI 1.343807
1 PhysicalHealth 2.531653
2 MentalHealth 2.323737
3 SleepTime 0.994238
Since the values are all positive, the numeric variables are all right
skewed.
👀Visualizations
In [14]:
1 corr_matrix = df.corr()
2 sns.heatmap(corr_matrix, annot=True, cmap='coolwarm')
3 plt.show()
In [16]:
1 plt.figure(figsize=(13, 8))
2 sns.histplot(data=df, x='AgeCategory', hue='HeartDisease', multiple='stack', shrink=
3 plt.title('Age Distribution by Heart Disease')
4 plt.xlabel('Age Category')
5 plt.ylabel('Count')
6 plt.show()
◻Individuals who smokes have a higher incidence of heart disease compared to those who do not
smoke.
◻Individuals who drink alcohol regularly have a slightly higher incidence of heart disease compared
to those who do not drink alcohol.
🔸Overall, this graph suggests that smoking status may be a stronger predictor of heart disease
compared to alcohol drinking status.
In [18]:
◻Individuals with Heart Disease have a lower median Physical Health score compared to those
without Heart Disease.
◻Individuals with Heart Disease have a lower median Mental Health score compared to those
without Heart Disease
🔸Overall, the boxplots suggest that individuals with Heart Disease tend to have lower Physical and
Mental Health scores compared to those without Heart Disease.
In [19]:
1 num_cols = df.select_dtypes(include=['float64','int64']).columns.tolist()
2
3 figsize = (16, 10)
4 fig = plt.figure(figsize=figsize)
5 for idx, col in enumerate(num_cols):
6 ax = plt.subplot(2, 2, idx + 1)
7 sns.kdeplot(
8 data=df, hue='HeartDisease', fill=True,
9 x=col, palette=['blue', 'red'], legend=False
10 )
11
12 ax.set_ylabel('')
13 ax.spines['top'].set_visible(False)
14 ax.set_xlabel('')
15 ax.spines['right'].set_visible(False)
16 ax.set_title(f'{col}', loc='right',
17 weight='bold', fontsize=20)
18
19 fig.suptitle(f'Features vs Target\n\n\n', ha='center',
20 fontweight='bold', fontsize=25)
21 fig.legend([1, 0], loc='upper center', bbox_to_anchor=(0.5, 0.96), fontsize=25, ncol
22 plt.tight_layout()
23 plt.show()
🔷Statistical Analysis
🔸ANOVA
➖This test can be used to determine whether there is a significant difference between the means of
three or more groups.
In [21]:
1 stats.shapiro(df['BMI'])
Out[21]:
ShapiroResult(statistic=0.9270806908607483, pvalue=0.0)
- Under a significance level of 0.05, BMI is not normally distributed and hence, ANOVA cannot be
used.
🔸Chi-square test
➖This test can be used to determine whether there is a significant association between two
categorical variables.
In [22]:
◻This suggest that there is a significant association between the variables being tested.
◻Therefore, we can reject the null hypothesis of no association and conclude that there is a
statistically significant relationship between the variables.
🔸Kruskal-Wallis test
In [23]:
◻we can reject the null hypothesis of no difference and conclude that there is a statistically
significant difference between the groups based on their ranks.
1 df
Out[25]:
🔸Model Training
In [26]:
1 X = df.drop(['HeartDisease'], axis = 1)
2 y = df['HeartDisease']
3
4 X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_st
5 X_train.shape,X_test.shape,y_train.shape,y_test.shape
Out[26]:
In [29]:
➖The high training score of the Random Forest Classifier suggests that the model may be
overfitting the training data.
➖However, the test score of 0.9063 is still a decent score, indicating that the model can predict the
target variable with 90.63% accuracy on unseen data.
In [34]:
➖The relatively high accuracy score suggests that the model is capturing the underlying patterns
in the data well.
➖The regularization technique used in the model helps to prevent overfitting and improve the
model's performance on new data.
In [30]:
1 model_lr = LogisticRegression()
2 model_lr.fit(X_train,y_train)
3 print("train score",model_lr.score(X_train,y_train))
4 print("test score",model_lr.score(X_test,y_test))
➖The relatively high test score indicates that the model can predict the target variable with
91.45%accuracy on unseen data.
➖However, the train score is slightly higher than the test score, indicating that the model may be
slightly overfittingthe training data.
➖This means that the model has learned the patterns in the training data too well and may not
generalize well to new, unseen data.
In [35]:
Precision: 0.835623224488493
Recall: 0.9141242937853107
F1 score: 0.8731128142530299
Precision:-
➖The proportion of predicted positive cases that are actually positive. In this case, the precision
score of 0.8356 indicates that the model is quite accurate in predicting positive cases.
➖However, there is still room for improvement in correctly identifying all the positive cases.
Recall:-
➖The proportion of actual positive cases that are correctly identified as positive by the model. The
recall score of 0.9141 suggests that the model is able to correctly identify most of the positive
cases.
F1 score:-
➖The harmonic mean of precision and recall, and it provides a balanced measure of the two
metrics.
➖The F1 score of 0.8731 indicates that the model has achieved a reasonable balance between
precision and recall.