0% found this document useful (0 votes)
8 views14 pages

Cse437 4

The project focuses on predicting heart attack risks using machine learning, employing a dataset of 4240 instances with 16 features. Various models were trained, with Naive Bayes emerging as the most effective in identifying heart disease cases despite challenges like class imbalance and low recall in other models. The study highlights the potential of machine learning in enhancing medical decision-making and emphasizes the need for further improvements and validation with medical experts.

Uploaded by

SHAHED ABDULLAH
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views14 pages

Cse437 4

The project focuses on predicting heart attack risks using machine learning, employing a dataset of 4240 instances with 16 features. Various models were trained, with Naive Bayes emerging as the most effective in identifying heart disease cases despite challenges like class imbalance and low recall in other models. The study highlights the potential of machine learning in enhancing medical decision-making and emphasizes the need for further improvements and validation with medical experts.

Uploaded by

SHAHED ABDULLAH
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

1

CSE437
DATA SCIENCE: CODING WITH REAL WORLD DATA

Project Title

Heart Attack Prediction Using Machine Learning. A Data Driven Approach


to Identifying Risks.

Group No. 04
Semester: Spring_2025
Submitted Date: 14 -05-2025

Group members :
Name ID Section
Shahed Abdullah 21301128 01
Md. Samiel Islam Sami 21301002 02
Iffat Hoque Mithila 21301143 01
MD. Farhan Islam 21301254 02
2

Table of Contents :​

Contents Page No.

Introduction 3

Dataset-Description 3

Imbalance Dataset 4

Exploratory data analysis 5

Dataset pre-processing 7

Dataset Splitting 8

Model Training and Testing 9

Model selection/Comparison analysis 10

Challenges 13

Future Improvements 13

Conclusion 14
3

Introduction

In the world of advanced medical technologies, early detection of critical conditions like heart
disease can significantly improve patient outcomes and reduce healthcare costs. Our research,
"Heart Attack Prediction Using Machine Learning: A Data-Driven Approach to
Identifying Risks," focuses on developing some predictive models which can evaluate a
person's risk of experiencing a heart attack based on his lifestyle and medical data.

Using a real world medical dataset, our research applies a full machine learning origin. Where
we did data cleaning, exploratory analysis, feature encoding, scaling, and training of multiple
classification models. Our goal is to help the healthcare sector in making faster, more consistent,
and data driven judgements of heart attack risk.

Our approach not only aims to improve heart attack diagnostic accuracy but also supports
preventive care by identifying individuals who are at very high risk. Ultimately, the system
demonstrates the potential of machine learning to enhance medical decision making and patient
care.

Dataset-Description

●​ Total Features: There are 16 features in total. Of them, there are 15 input features
(gender, age, education, currentSmoker, cigsPerDay, BPMeds, prevalentStroke,
prevalentHyp, diabetes, totChol, sysBP, diaBP, BMI, heartRate, glucose) and 1 output
feature (Heart Disease (in next 10 years)).​

●​ Total Data Points: 4240​

●​ Problem Type: This is a Classification problem.​


It is a classification problem because the target variable Heart Disease (in the next 10
years) is categorical with output discrete classes (0, 1). The goal is to predict which
output class a patient belongs to based on their medical and lifestyle attributes.​

●​ Feature Types:
○​ Quantitative Features (Numerical): There are 14 Numerical features. They are:​
age, education, currentSmoker, cigsPerDay, BPMeds, prevalentStroke,
prevalentHyp, diabetes, totChol, sysBP, diaBP, BMI, heartRate,
glucoseloan_percent_income, cb_person_cred_hist_length, credit_score.​

○​ Categorical Features: There is only one categorical feature. It is: gender


4

●​ Correlation Insights:​

○​ The output feature has a negative correlation with education (-0.05).


○​ prevalentHyp has strong correlation with sysBP(0.70) & diaBP (0.62).
○​ Negative correlation is observed with the features age, cigsPerDay(-0.19),
currentSmoker (-0.21).
○​ Strong Inter-Feature Correlations (Multicollinearity) is present between sysBp
with diaBp (0.78) and cigsPerDay with currentSmoker (0.76).

●​ Interpretation of Correlation Test:​


The correlation heatmap suggests that Strong Inter-Feature Correlations are present
between sysBp & diaBp and cigsPerDay & currentSmoker. This relation can inflate
variance in linear models like logistic regression and make coefficient interpretation
unstable. We also saw that feature education has a negative correlation with the output
feature. Dropping such negative and strongly correlated features improved our
performance.

Imbalance Dataset
The dataset is imbalanced. The output feature Heart Disease (in next 10 years) has:

●​ 3596 instances of class 0


●​ 644 instances of class 1

This means class 0 significantly outweighs class 1, indicating an imbalance that may
affect classification model performance unless handled properly.
5

Exploratory Data Analysis

Distribution of Numerical Features:​

Distribution of Categorical Feature (Gender):


6

Boxplots to Detect Outliers:​


7

Correlation Between Features:

Dataset pre-processing
Null / Missing values
We found some null values, but could not find any duplicate rows.
8

Categorical Encoding​

Feature Scaling
We used the Minimax scaling technique to scale the data.


Fig : Feature scaling completed. Here are the first few rows after scaling

Dataset Splitting

We split the dataset into 70% for the training and 30% for the testing.

Total Samples 4240 (100 %)

Training Set 2968 (70 %)

Testing Set 1272 (30%)


9

Model Training and Testing


In our project, we implemented a supervised machine learning structure to predict the likelihood
of heart disease within 10 years. We trained and tested the following models on a cleaned and
preprocessed medical dataset:

●​ Logistic Regression: A linear classification model used for binary or multiclass


problems. It estimates probabilities using the sigmoid function and predicts the class
based on a threshold (usually 0.5). Simple, interpretable, and works well with linearly
separable data.​

●​ Decision Tree: A tree-based model that splits data into branches based on feature values
to make decisions. It handles both classification and regression tasks. Easy to interpret
but prone to overfitting unless pruned or ensembled (e.g., Random Forest).​

●​ Naive Bayes: A probabilistic classifier based on Bayes' theorem, assuming feature


independence. Fast and efficient for high-dimensional data (e.g. text classification).
Common variants: Gaussian, Multinomial, and Bernoulli Naive Bayes.​

●​ Neural Network (MLPClassifier): A Multi Layer Perceptron (MLP) is a feedforward


artificial neural network with hidden layers. It learns complex patterns through
backpropagation and activation functions (ReLU, sigmoid). Powerful but requires large
data and tuning to avoid overfitting.
○​
●​ KNN: A distance based algorithm that classifies a sample based on the majority class
among its ‘K’ nearest neighbors. KNN is simple and intuitive, but sensitive to feature
scaling and may perform poorly on imbalanced and high dimensional datasets. It is
particularly useful when decision boundaries are irregular.
10

Model selection/Comparison analysis


Accuracy and Precision, recall comparison of all models (for classification):

No Model Accuracy Precision Recall F1 Score AUC

0 Logistic Regression 0.856132 0.700000 0.107692 0.186667 0.715492

1 KNN 0.838836 0.413793 0.123077 0.189723 0.598436

2 Naive Bayes 0.829403 0.401786 0.230796 0.293160 0.707378

3 Neural Network 0.848270 0.555556 0.051282 0.093897 0.646563

4 Decision Tree 0.757862 0.227053 0.241026 0.233831 0.546232

Bar chart showcasing the prediction accuracy of all models (for classification)

Bar Chart showcasing the comparison of each models F1 Score, Precision and Recall
11

Confusion Matrix (for classification)​


12

AUC score, ROC curve (for classification)

Result analysis: From the evaluated models, Naive Bayes emerged as the most effective in
identifying heart disease cases.

●​ Naive Bayes had the highest recall and highest F1 Score, making it the most suitable
model in a medical context, where identifying risky cases is more critical than just overall
accuracy.​

●​ Logistic Regression (85.61%) and Neural Network (84.51%) achieved the highest
accuracy, but both models had low recall, meaning they often failed to detect actual
positive heart disease cases which is highly risky for medical diagnostics.​

●​ Decision Tree had a lower overall accuracy but maintained a reasonable balance between
recall and F1 score, suggesting it can be a viable option after tuning.​

●​ KNN underperformed across all major metrics particularly in recall and F1-score, making
it the least effective model in this context.​

●​ AUC Scores indicated that Logistic Regression and Neural Network provided better
separation between classes, but again, their low recall limits their utility in high stakes
diagnosis.
13

Challenges
●​ Class Imbalance: The dataset was highly skewed toward "no heart disease" cases, which
made learning minority class patterns difficult.​

●​ Low Recall: Most models failed to detect the actual positive cases effectively.​

●​ Overfitting Risk: Especially in complex models like MLP, which require tuning.​

●​ Model Tuning: Especially time consuming for Neural Networks due to multiple
hyperparameters.

Future Improvements
To improve performance and reliability, the following enhancements are recommended:

●​ Address Class Imbalance: Use SMOTE, undersampling, or class weights to help models
focus on minority class.​

●​ Feature Engineering: Remove redundant features (highly correlated) and introduce


domain-informed features.​

●​ Cross-validation: Implement k-fold cross-validation for more robust and generalized


model evaluation.​

●​ Explainable AI: Incorporate SHAP or LIME to explain predictions of black-box models.​

●​ Council with Medical Experts: Validate predictions and insights with medical
professionals.
14

Conclusion
Our project successfully demonstrates the application of machine learning in predicting the risk
of heart disease using real world medical and lifestyle data. While Logistic Regression and
Neural Networks provided high accuracy, they failed to capture positive cases effectively.
Despite a lower accuracy, Naive Bayes emerged as the best model in terms of recall and
F1-score. Which are crucial in the medical domain where missing high risk patients can be life
threatening.

Machine learning has the potential to enhance early diagnosis, which assists doctors in decision
making, and enables preventive healthcare interventions. However, the effectiveness of such
models depends on balanced data, careful validation, and continuous refinement. With further
tuning and integration of medical expertise, such predictive tools could become valuable assets
in modern healthcare systems.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy