Cse437 4
Cse437 4
CSE437
DATA SCIENCE: CODING WITH REAL WORLD DATA
Project Title
Group No. 04
Semester: Spring_2025
Submitted Date: 14 -05-2025
Group members :
Name ID Section
Shahed Abdullah 21301128 01
Md. Samiel Islam Sami 21301002 02
Iffat Hoque Mithila 21301143 01
MD. Farhan Islam 21301254 02
2
Table of Contents :
Introduction 3
Dataset-Description 3
Imbalance Dataset 4
Dataset pre-processing 7
Dataset Splitting 8
Challenges 13
Future Improvements 13
Conclusion 14
3
Introduction
In the world of advanced medical technologies, early detection of critical conditions like heart
disease can significantly improve patient outcomes and reduce healthcare costs. Our research,
"Heart Attack Prediction Using Machine Learning: A Data-Driven Approach to
Identifying Risks," focuses on developing some predictive models which can evaluate a
person's risk of experiencing a heart attack based on his lifestyle and medical data.
Using a real world medical dataset, our research applies a full machine learning origin. Where
we did data cleaning, exploratory analysis, feature encoding, scaling, and training of multiple
classification models. Our goal is to help the healthcare sector in making faster, more consistent,
and data driven judgements of heart attack risk.
Our approach not only aims to improve heart attack diagnostic accuracy but also supports
preventive care by identifying individuals who are at very high risk. Ultimately, the system
demonstrates the potential of machine learning to enhance medical decision making and patient
care.
Dataset-Description
● Total Features: There are 16 features in total. Of them, there are 15 input features
(gender, age, education, currentSmoker, cigsPerDay, BPMeds, prevalentStroke,
prevalentHyp, diabetes, totChol, sysBP, diaBP, BMI, heartRate, glucose) and 1 output
feature (Heart Disease (in next 10 years)).
● Feature Types:
○ Quantitative Features (Numerical): There are 14 Numerical features. They are:
age, education, currentSmoker, cigsPerDay, BPMeds, prevalentStroke,
prevalentHyp, diabetes, totChol, sysBP, diaBP, BMI, heartRate,
glucoseloan_percent_income, cb_person_cred_hist_length, credit_score.
● Correlation Insights:
Imbalance Dataset
The dataset is imbalanced. The output feature Heart Disease (in next 10 years) has:
This means class 0 significantly outweighs class 1, indicating an imbalance that may
affect classification model performance unless handled properly.
5
Dataset pre-processing
Null / Missing values
We found some null values, but could not find any duplicate rows.
8
Categorical Encoding
Feature Scaling
We used the Minimax scaling technique to scale the data.
Fig : Feature scaling completed. Here are the first few rows after scaling
Dataset Splitting
We split the dataset into 70% for the training and 30% for the testing.
● Decision Tree: A tree-based model that splits data into branches based on feature values
to make decisions. It handles both classification and regression tasks. Easy to interpret
but prone to overfitting unless pruned or ensembled (e.g., Random Forest).
Bar chart showcasing the prediction accuracy of all models (for classification)
Bar Chart showcasing the comparison of each models F1 Score, Precision and Recall
11
12
Result analysis: From the evaluated models, Naive Bayes emerged as the most effective in
identifying heart disease cases.
● Naive Bayes had the highest recall and highest F1 Score, making it the most suitable
model in a medical context, where identifying risky cases is more critical than just overall
accuracy.
● Logistic Regression (85.61%) and Neural Network (84.51%) achieved the highest
accuracy, but both models had low recall, meaning they often failed to detect actual
positive heart disease cases which is highly risky for medical diagnostics.
● Decision Tree had a lower overall accuracy but maintained a reasonable balance between
recall and F1 score, suggesting it can be a viable option after tuning.
● KNN underperformed across all major metrics particularly in recall and F1-score, making
it the least effective model in this context.
● AUC Scores indicated that Logistic Regression and Neural Network provided better
separation between classes, but again, their low recall limits their utility in high stakes
diagnosis.
13
Challenges
● Class Imbalance: The dataset was highly skewed toward "no heart disease" cases, which
made learning minority class patterns difficult.
● Low Recall: Most models failed to detect the actual positive cases effectively.
● Overfitting Risk: Especially in complex models like MLP, which require tuning.
● Model Tuning: Especially time consuming for Neural Networks due to multiple
hyperparameters.
Future Improvements
To improve performance and reliability, the following enhancements are recommended:
● Address Class Imbalance: Use SMOTE, undersampling, or class weights to help models
focus on minority class.
● Council with Medical Experts: Validate predictions and insights with medical
professionals.
14
Conclusion
Our project successfully demonstrates the application of machine learning in predicting the risk
of heart disease using real world medical and lifestyle data. While Logistic Regression and
Neural Networks provided high accuracy, they failed to capture positive cases effectively.
Despite a lower accuracy, Naive Bayes emerged as the best model in terms of recall and
F1-score. Which are crucial in the medical domain where missing high risk patients can be life
threatening.
Machine learning has the potential to enhance early diagnosis, which assists doctors in decision
making, and enables preventive healthcare interventions. However, the effectiveness of such
models depends on balanced data, careful validation, and continuous refinement. With further
tuning and integration of medical expertise, such predictive tools could become valuable assets
in modern healthcare systems.