0% found this document useful (0 votes)

8 views14 pages

Cse437 4

The project focuses on predicting heart attack risks using machine learning, employing a dataset of 4240 instances with 16 features. Various models were trained, with Naive Bayes emerging as the most effective in identifying heart disease cases despite challenges like class imbalance and low recall in other models. The study highlights the potential of machine learning in enhancing medical decision-making and emphasizes the need for further improvements and validation with medical experts.

Uploaded by

SHAHED ABDULLAH

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views14 pages

Cse437 4

Uploaded by

SHAHED ABDULLAH

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 14

1

CSE437
DATA SCIENCE: CODING WITH REAL WORLD DATA

Project Title

Heart Attack Prediction Using Machine Learning. A Data Driven Approach

to Identifying Risks.

Group No. 04
Semester: Spring_2025
Submitted Date: 14 -05-2025

Group members :
Name ID Section
Shahed Abdullah 21301128 01
Md. Samiel Islam Sami 21301002 02
Iffat Hoque Mithila 21301143 01
MD. Farhan Islam 21301254 02
2

Table of Contents :

Contents Page No.

Introduction 3

Dataset-Description 3

Imbalance Dataset 4

Exploratory data analysis 5

Dataset pre-processing 7

Dataset Splitting 8

Model Training and Testing 9

Model selection/Comparison analysis 10

Challenges 13

Future Improvements 13

Conclusion 14
3

Introduction

In the world of advanced medical technologies, early detection of critical conditions like heart
disease can significantly improve patient outcomes and reduce healthcare costs. Our research,
"Heart Attack Prediction Using Machine Learning: A Data-Driven Approach to
Identifying Risks," focuses on developing some predictive models which can evaluate a
person's risk of experiencing a heart attack based on his lifestyle and medical data.

Using a real world medical dataset, our research applies a full machine learning origin. Where
we did data cleaning, exploratory analysis, feature encoding, scaling, and training of multiple
classification models. Our goal is to help the healthcare sector in making faster, more consistent,
and data driven judgements of heart attack risk.

Our approach not only aims to improve heart attack diagnostic accuracy but also supports
preventive care by identifying individuals who are at very high risk. Ultimately, the system
demonstrates the potential of machine learning to enhance medical decision making and patient
care.

Dataset-Description

● Total Features: There are 16 features in total. Of them, there are 15 input features
(gender, age, education, currentSmoker, cigsPerDay, BPMeds, prevalentStroke,
prevalentHyp, diabetes, totChol, sysBP, diaBP, BMI, heartRate, glucose) and 1 output
feature (Heart Disease (in next 10 years)).

● Total Data Points: 4240

● Problem Type: This is a Classification problem.

It is a classification problem because the target variable Heart Disease (in the next 10
years) is categorical with output discrete classes (0, 1). The goal is to predict which
output class a patient belongs to based on their medical and lifestyle attributes.

● Feature Types:
○ Quantitative Features (Numerical): There are 14 Numerical features. They are:
age, education, currentSmoker, cigsPerDay, BPMeds, prevalentStroke,
prevalentHyp, diabetes, totChol, sysBP, diaBP, BMI, heartRate,
glucoseloan_percent_income, cb_person_cred_hist_length, credit_score.

○ Categorical Features: There is only one categorical feature. It is: gender

● Correlation Insights:

○ The output feature has a negative correlation with education (-0.05).

○ prevalentHyp has strong correlation with sysBP(0.70) & diaBP (0.62).
○ Negative correlation is observed with the features age, cigsPerDay(-0.19),
currentSmoker (-0.21).
○ Strong Inter-Feature Correlations (Multicollinearity) is present between sysBp
with diaBp (0.78) and cigsPerDay with currentSmoker (0.76).

● Interpretation of Correlation Test:

The correlation heatmap suggests that Strong Inter-Feature Correlations are present
between sysBp & diaBp and cigsPerDay & currentSmoker. This relation can inflate
variance in linear models like logistic regression and make coefficient interpretation
unstable. We also saw that feature education has a negative correlation with the output
feature. Dropping such negative and strongly correlated features improved our
performance.

Imbalance Dataset
The dataset is imbalanced. The output feature Heart Disease (in next 10 years) has:

● 3596 instances of class 0

● 644 instances of class 1

This means class 0 significantly outweighs class 1, indicating an imbalance that may
affect classification model performance unless handled properly.
5

Exploratory Data Analysis

Distribution of Numerical Features:

Distribution of Categorical Feature (Gender):

Boxplots to Detect Outliers:

Correlation Between Features:

Dataset pre-processing
Null / Missing values
We found some null values, but could not find any duplicate rows.
8

Categorical Encoding

Feature Scaling
We used the Minimax scaling technique to scale the data.

Fig : Feature scaling completed. Here are the first few rows after scaling

Dataset Splitting

We split the dataset into 70% for the training and 30% for the testing.

Total Samples 4240 (100 %)

Training Set 2968 (70 %)

Testing Set 1272 (30%)

Model Training and Testing

In our project, we implemented a supervised machine learning structure to predict the likelihood
of heart disease within 10 years. We trained and tested the following models on a cleaned and
preprocessed medical dataset:

● Logistic Regression: A linear classification model used for binary or multiclass

problems. It estimates probabilities using the sigmoid function and predicts the class
based on a threshold (usually 0.5). Simple, interpretable, and works well with linearly
separable data.

● Decision Tree: A tree-based model that splits data into branches based on feature values
to make decisions. It handles both classification and regression tasks. Easy to interpret
but prone to overfitting unless pruned or ensembled (e.g., Random Forest).

● Naive Bayes: A probabilistic classifier based on Bayes' theorem, assuming feature

independence. Fast and efficient for high-dimensional data (e.g. text classification).
Common variants: Gaussian, Multinomial, and Bernoulli Naive Bayes.

● Neural Network (MLPClassifier): A Multi Layer Perceptron (MLP) is a feedforward

artificial neural network with hidden layers. It learns complex patterns through
backpropagation and activation functions (ReLU, sigmoid). Powerful but requires large
data and tuning to avoid overfitting.
○
● KNN: A distance based algorithm that classifies a sample based on the majority class
among its ‘K’ nearest neighbors. KNN is simple and intuitive, but sensitive to feature
scaling and may perform poorly on imbalanced and high dimensional datasets. It is
particularly useful when decision boundaries are irregular.
10

Model selection/Comparison analysis

Accuracy and Precision, recall comparison of all models (for classification):

No Model Accuracy Precision Recall F1 Score AUC

0 Logistic Regression 0.856132 0.700000 0.107692 0.186667 0.715492

1 KNN 0.838836 0.413793 0.123077 0.189723 0.598436

2 Naive Bayes 0.829403 0.401786 0.230796 0.293160 0.707378

3 Neural Network 0.848270 0.555556 0.051282 0.093897 0.646563

4 Decision Tree 0.757862 0.227053 0.241026 0.233831 0.546232

Bar chart showcasing the prediction accuracy of all models (for classification)

Bar Chart showcasing the comparison of each models F1 Score, Precision and Recall
11

Confusion Matrix (for classification)

AUC score, ROC curve (for classification)

Result analysis: From the evaluated models, Naive Bayes emerged as the most effective in
identifying heart disease cases.

● Naive Bayes had the highest recall and highest F1 Score, making it the most suitable
model in a medical context, where identifying risky cases is more critical than just overall
accuracy.

● Logistic Regression (85.61%) and Neural Network (84.51%) achieved the highest
accuracy, but both models had low recall, meaning they often failed to detect actual
positive heart disease cases which is highly risky for medical diagnostics.

● Decision Tree had a lower overall accuracy but maintained a reasonable balance between
recall and F1 score, suggesting it can be a viable option after tuning.

● KNN underperformed across all major metrics particularly in recall and F1-score, making
it the least effective model in this context.

● AUC Scores indicated that Logistic Regression and Neural Network provided better
separation between classes, but again, their low recall limits their utility in high stakes
diagnosis.
13

Challenges
● Class Imbalance: The dataset was highly skewed toward "no heart disease" cases, which
made learning minority class patterns difficult.

● Low Recall: Most models failed to detect the actual positive cases effectively.

● Overfitting Risk: Especially in complex models like MLP, which require tuning.

● Model Tuning: Especially time consuming for Neural Networks due to multiple
hyperparameters.

Future Improvements
To improve performance and reliability, the following enhancements are recommended:

● Address Class Imbalance: Use SMOTE, undersampling, or class weights to help models
focus on minority class.

● Feature Engineering: Remove redundant features (highly correlated) and introduce

domain-informed features.

● Cross-validation: Implement k-fold cross-validation for more robust and generalized

model evaluation.

● Explainable AI: Incorporate SHAP or LIME to explain predictions of black-box models.

● Council with Medical Experts: Validate predictions and insights with medical
professionals.
14

Conclusion
Our project successfully demonstrates the application of machine learning in predicting the risk
of heart disease using real world medical and lifestyle data. While Logistic Regression and
Neural Networks provided high accuracy, they failed to capture positive cases effectively.
Despite a lower accuracy, Naive Bayes emerged as the best model in terms of recall and
F1-score. Which are crucial in the medical domain where missing high risk patients can be life
threatening.

Machine learning has the potential to enhance early diagnosis, which assists doctors in decision
making, and enables preventive healthcare interventions. However, the effectiveness of such
models depends on balanced data, careful validation, and continuous refinement. With further
tuning and integration of medical expertise, such predictive tools could become valuable assets
in modern healthcare systems.

QT Chapter 4
No ratings yet
QT Chapter 4
6 pages
Heart Disease Prediction Final
67% (3)
Heart Disease Prediction Final
45 pages
Heart Disease Report
No ratings yet
Heart Disease Report
8 pages
Synopsis (Heart Disease Prediction)
No ratings yet
Synopsis (Heart Disease Prediction)
7 pages
Second Progres Report
No ratings yet
Second Progres Report
10 pages
Project Report
No ratings yet
Project Report
18 pages
Heart Disease
No ratings yet
Heart Disease
13 pages
HEART
No ratings yet
HEART
15 pages
PythonHeartDisease FirstReview
No ratings yet
PythonHeartDisease FirstReview
20 pages
Heart Disease Prediction
No ratings yet
Heart Disease Prediction
9 pages
Final PPT Heart Disease
67% (3)
Final PPT Heart Disease
23 pages
Heart Disease Prediction Model: Dissertation
No ratings yet
Heart Disease Prediction Model: Dissertation
4 pages
Final Year Project
No ratings yet
Final Year Project
57 pages
Heart Disease Predictive Analysis
No ratings yet
Heart Disease Predictive Analysis
4 pages
03-Supervised Machine Learning Classification
No ratings yet
03-Supervised Machine Learning Classification
33 pages
HussainBadshah SafwanSheikh
No ratings yet
HussainBadshah SafwanSheikh
12 pages
Heart Disease Detection Using Machine Learning
No ratings yet
Heart Disease Detection Using Machine Learning
12 pages
Batch-2 (Review 2)
No ratings yet
Batch-2 (Review 2)
19 pages
Machine Learning
No ratings yet
Machine Learning
30 pages
Lab Report Content - 15marks
No ratings yet
Lab Report Content - 15marks
10 pages
Early Detection of Ischemic Heart Disease Through Deep Learning Techniques
No ratings yet
Early Detection of Ischemic Heart Disease Through Deep Learning Techniques
5 pages
Research Paper
No ratings yet
Research Paper
7 pages
Prediction of Heart Diseases Using Machine Learning
No ratings yet
Prediction of Heart Diseases Using Machine Learning
49 pages
A.I Lab Report
No ratings yet
A.I Lab Report
24 pages
Heart Attack Prediction System: Sushmita Manikandan
No ratings yet
Heart Attack Prediction System: Sushmita Manikandan
4 pages
Predicting The Presence of Heart Diseases Using Comparative Data Mining and Machine Learning Algorithms
No ratings yet
Predicting The Presence of Heart Diseases Using Comparative Data Mining and Machine Learning Algorithms
5 pages
Heart Disease Prediction - Medical Image Analysis - Robust Healthcare Forecasting
No ratings yet
Heart Disease Prediction - Medical Image Analysis - Robust Healthcare Forecasting
5 pages
Heart Disease Prediction With Machine Learning Approaches
No ratings yet
Heart Disease Prediction With Machine Learning Approaches
6 pages
IEEE Paper Format Template
No ratings yet
IEEE Paper Format Template
4 pages
Conference PPT Anas2
No ratings yet
Conference PPT Anas2
14 pages
Heart Disease Prediction Using Machine Learning IJERTV9IS040614
No ratings yet
Heart Disease Prediction Using Machine Learning IJERTV9IS040614
4 pages
IEEE Paper Format Template
No ratings yet
IEEE Paper Format Template
3 pages
AB Report Group 2
No ratings yet
AB Report Group 2
14 pages
Heart Disease Python Report 1st Phase
No ratings yet
Heart Disease Python Report 1st Phase
33 pages
Web Application
No ratings yet
Web Application
13 pages
Heart Disease Prediction Using Machine Learning Techniques: Abstract
No ratings yet
Heart Disease Prediction Using Machine Learning Techniques: Abstract
5 pages
Heart Disease Prediction
No ratings yet
Heart Disease Prediction
8 pages
Heart Disease Prediction
No ratings yet
Heart Disease Prediction
17 pages
Thesis On Comparison of Machine Learning Techniques To Predict Cardiovascular Disease
No ratings yet
Thesis On Comparison of Machine Learning Techniques To Predict Cardiovascular Disease
52 pages
Mini Report2
No ratings yet
Mini Report2
40 pages
Heart Disease
No ratings yet
Heart Disease
6 pages
Heart Disease Identification Method Using
No ratings yet
Heart Disease Identification Method Using
72 pages
03 Supervised - Machine.learning - Classification
No ratings yet
03 Supervised - Machine.learning - Classification
45 pages
Research Paper - IT - Group No 8
No ratings yet
Research Paper - IT - Group No 8
10 pages
PythonHeartDisease FirstReview
No ratings yet
PythonHeartDisease FirstReview
4 pages
Heart Disease Prediction With Machine Learning Approaches
No ratings yet
Heart Disease Prediction With Machine Learning Approaches
5 pages
SUMMARY
No ratings yet
SUMMARY
16 pages
Review 1
No ratings yet
Review 1
18 pages
Research Proposal
No ratings yet
Research Proposal
8 pages
Heart Disease Identification Using Machine Learning Classification
100% (2)
Heart Disease Identification Using Machine Learning Classification
11 pages
Synopsis - Group - 6 - CSE - 3 Changes (2)
No ratings yet
Synopsis - Group - 6 - CSE - 3 Changes (2)
15 pages
Final Research Paper
No ratings yet
Final Research Paper
3 pages
Heart Disease Detection Using Machine Learning: Chithambaram T Logesh Kannan N Gowsalya M (Gowsalya.m@vit - Ac.in)
No ratings yet
Heart Disease Detection Using Machine Learning: Chithambaram T Logesh Kannan N Gowsalya M (Gowsalya.m@vit - Ac.in)
5 pages
Heart Disease Prediction
No ratings yet
Heart Disease Prediction
6 pages
Bala
No ratings yet
Bala
28 pages
INFX 499 Milestone 1
No ratings yet
INFX 499 Milestone 1
8 pages
Heart Disease
No ratings yet
Heart Disease
19 pages
Synopsis
No ratings yet
Synopsis
4 pages
Edited Version of Cardiovascular Diseases Risk Prediction Dataset Report
No ratings yet
Edited Version of Cardiovascular Diseases Risk Prediction Dataset Report
25 pages
DATA MINING and MACHINE LEARNING. CLASSIFICATION PREDICTIVE TECHNIQUES: NAIVE BAYES, NEAREST NEIGHBORS and NEURAL NETWORKS: Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING. CLASSIFICATION PREDICTIVE TECHNIQUES: NAIVE BAYES, NEAREST NEIGHBORS and NEURAL NETWORKS: Examples with MATLAB
César Pérez López
No ratings yet
Técnicas Estadísticas para la Ciencia de Datos a través de R. Aprendizaje Supervisado: Análisis Discriminante, Árboles de Decisión, Redes Neuronales y Modelos Lineales Generalizados
From Everand
Técnicas Estadísticas para la Ciencia de Datos a través de R. Aprendizaje Supervisado: Análisis Discriminante, Árboles de Decisión, Redes Neuronales y Modelos Lineales Generalizados
César Pérez López
No ratings yet
Modelling of Chemical Processes Using Artificial Neural Network
No ratings yet
Modelling of Chemical Processes Using Artificial Neural Network
23 pages
Decision Trees: A Recent Overview: S. B. Kotsiantis
No ratings yet
Decision Trees: A Recent Overview: S. B. Kotsiantis
23 pages
Machine Learning - Exploring The Model
No ratings yet
Machine Learning - Exploring The Model
2 pages
Data Analytics Questions
No ratings yet
Data Analytics Questions
40 pages
Midsem Compressed
No ratings yet
Midsem Compressed
32 pages
Rootquotient 2024 Batch
No ratings yet
Rootquotient 2024 Batch
28 pages
UT1 - Question Bank 2023-2024
No ratings yet
UT1 - Question Bank 2023-2024
2 pages
Salary Prediction Document
No ratings yet
Salary Prediction Document
30 pages
The Nearest Neighbour Algorithm
No ratings yet
The Nearest Neighbour Algorithm
3 pages
ML Interview Questions
No ratings yet
ML Interview Questions
7 pages
Algorithms To Live by by Tom Griffiths PDF
No ratings yet
Algorithms To Live by by Tom Griffiths PDF
47 pages
PRCV Unit-2
No ratings yet
PRCV Unit-2
24 pages
ML 1 PPT Unit 1
No ratings yet
ML 1 PPT Unit 1
93 pages
Research Article: Research On E-Commerce Database Marketing Based On Machine Learning Algorithm
No ratings yet
Research Article: Research On E-Commerce Database Marketing Based On Machine Learning Algorithm
13 pages
DiffuseMix CVPR 24
No ratings yet
DiffuseMix CVPR 24
18 pages
Molina-Garip 2019 Socarxiv
No ratings yet
Molina-Garip 2019 Socarxiv
27 pages
Comparing Linear Regression and Decision Trees For Housing Price Prediction
No ratings yet
Comparing Linear Regression and Decision Trees For Housing Price Prediction
8 pages
Data Analytics Object Segmentation Unit IV
No ratings yet
Data Analytics Object Segmentation Unit IV
34 pages
Master Thesis Doc
No ratings yet
Master Thesis Doc
55 pages
Module 4 Quiz
No ratings yet
Module 4 Quiz
7 pages
CS771 IITK EndSem Solutions
100% (1)
CS771 IITK EndSem Solutions
8 pages
Questions Stats and Trix
No ratings yet
Questions Stats and Trix
39 pages
This Content Downloaded From 103.197.103.131 On Tue, 18 Jul 2023 10:29:28 +00:00
No ratings yet
This Content Downloaded From 103.197.103.131 On Tue, 18 Jul 2023 10:29:28 +00:00
21 pages
Assignment # 01 (ML)
No ratings yet
Assignment # 01 (ML)
4 pages
Bpy - Py - 25109-E-Commerce Fraud Detection Based On Machine Learning Techniques Systematic Literature Review
No ratings yet
Bpy - Py - 25109-E-Commerce Fraud Detection Based On Machine Learning Techniques Systematic Literature Review
107 pages
Sumit Tripathi Applied AI Course Schedule
No ratings yet
Sumit Tripathi Applied AI Course Schedule
31 pages
Assignment
No ratings yet
Assignment
15 pages
Full Lecture
No ratings yet
Full Lecture
69 pages
CDS - Unit 2
No ratings yet
CDS - Unit 2
31 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Cse437 4

Uploaded by

Cse437 4

Uploaded by

1

Heart Attack Prediction Using Machine Learning. A Data Driven Approach

Contents Page No.

Exploratory data analysis 5

Model Training and Testing 9

Model selection/Comparison analysis 10

●​ Total Data Points: 4240​

●​ Problem Type: This is a Classification problem.​

○​ Categorical Features: There is only one categorical feature. It is: gender

○​ The output feature has a negative correlation with education (-0.05).

●​ Interpretation of Correlation Test:​

●​ 3596 instances of class 0

Exploratory Data Analysis

Distribution of Numerical Features:​

Distribution of Categorical Feature (Gender):

Boxplots to Detect Outliers:​

Correlation Between Features:

Total Samples 4240 (100 %)

Training Set 2968 (70 %)

Testing Set 1272 (30%)

Model Training and Testing

●​ Logistic Regression: A linear classification model used for binary or multiclass

●​ Naive Bayes: A probabilistic classifier based on Bayes' theorem, assuming feature

●​ Neural Network (MLPClassifier): A Multi Layer Perceptron (MLP) is a feedforward

Model selection/Comparison analysis

No Model Accuracy Precision Recall F1 Score AUC

0 Logistic Regression 0.856132 0.700000 0.107692 0.186667 0.715492

1 KNN 0.838836 0.413793 0.123077 0.189723 0.598436

2 Naive Bayes 0.829403 0.401786 0.230796 0.293160 0.707378

3 Neural Network 0.848270 0.555556 0.051282 0.093897 0.646563

4 Decision Tree 0.757862 0.227053 0.241026 0.233831 0.546232

Confusion Matrix (for classification)​

AUC score, ROC curve (for classification)

●​ Feature Engineering: Remove redundant features (highly correlated) and introduce

●​ Cross-validation: Implement k-fold cross-validation for more robust and generalized

●​ Explainable AI: Incorporate SHAP or LIME to explain predictions of black-box models.​

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

● Total Data Points: 4240

● Problem Type: This is a Classification problem.

○ Categorical Features: There is only one categorical feature. It is: gender

○ The output feature has a negative correlation with education (-0.05).

● Interpretation of Correlation Test:

● 3596 instances of class 0

Distribution of Numerical Features:

Boxplots to Detect Outliers:

● Logistic Regression: A linear classification model used for binary or multiclass

● Naive Bayes: A probabilistic classifier based on Bayes' theorem, assuming feature

● Neural Network (MLPClassifier): A Multi Layer Perceptron (MLP) is a feedforward

Confusion Matrix (for classification)

● Feature Engineering: Remove redundant features (highly correlated) and introduce

● Cross-validation: Implement k-fold cross-validation for more robust and generalized

● Explainable AI: Incorporate SHAP or LIME to explain predictions of black-box models.