BDA Final
BDA Final
MACHINE LEARNING
A PROJECT REPORT
Submitted by
Pavethra M (621522205037)
Rubini N (621522205044)
Dhushara S (621522205015)
of
BACHELOR OF TECHNOLOGY
in
INFORMATION TECHNOLOGY
MAY 2025
i
ANNA UNIVERSITY: CHENNAI 600 025
BONAFIDE CERTIFICATE
Pavethra M (621522205037)
Rubini N (621522205044)
Dhushara S (621522205015)
SIGNATURE SIGNATURE
Dr. T. AKILA, M.E, Ph.D., Dr. T. AKILA, M.E, Ph.D.,
ASSOCIATE PROFESSOR, ASSOCIATE PROFESSOR,
HEAD OF THE DEPARTMENT, SUPERVISIOR,
Department of Information Technology, Department of Information Technology,
Mahendra College of Engineering, Mahendra College of Engineering,
Minnampalli, Salem-636106. Minnampalli, Salem-636106.
ii
ACKNOWLEDGEMENT
The Success and final outcome of this project required a lot of guidance and assistance from
many people and an extremely fortunate to have got this all along with the completion of my
project work.
We owe my profound gratitude to our guide, Dr. T.AKILA , Head of the Department of
Information Technology who took an interest in our projectwork and provided all the necessary
information for developing the project successfully. We also thank all the staff members of our
college and technicians for their help in making this project a successful one.
Lastly, we would like to thank the almighty and my parents for their moral support and my
friends with whom shared my day-to-day experience and received lots of suggestions that
improved my quality of work.
iii
ABSTRACT
Spam refers to any email that contains an advertisement, unrelated and frequent
emails. These emails are increasing day by day in numbers. Studies show that around
55 percent of all emails are some kind of spam. A lot of effort is being put into this
by service providers. Spam is evolving by changing the obvious markers of detection.
Moreover, the spam detection of service providers can never be aggressive with
classification because it may cause potential information loss to incase of a
misclassification.
To tackle this problem we present a new and efficient method to detect spam
using machine learning and natural language processing. A tool that can detect and
classify spam. In addition to that, it also provides information regarding the text
provided in a quick view format for user convenience.
iv
TABLE OF CONTENTS
ABSTRACT Iv
1 INTRODUCTION 1
1.1 OVERVIEW 1
2 LITERARTUTE REVIEW 3
v
3 SYSTEM ANALYSIS 5
6
3.2 PROPOSED SYSTEM
4 SYSTEM REQUIREMENTS 7
vi
5.6.1 ALGORITHM USED 10
6 UML DIAGRAMS 12
7 PERFORMANCE ANALYSIS 15
7.3 METHODOLOGY 17
8 CONCULSION 18
9 FUTURE ENHANCEMENT 19
vii
10 APPENDIX 20
11 BIBLIOGRAPHY 24
viii
CHAPTER-1
INTRODUCTION
1.1 OVERVIEW
The scope of the Heart Disease Prediction Using Machine Learning project is to
develop an intelligent and data-driven system capable of analyzing patient health data to
predict the likelihood of heart disease. This system is designed to assist healthcare
professionals, researchers, and medical institutions in making more informed, timely, and
accurate diagnostic decisions. By utilizing machine learning algorithms and statistical
modeling techniques, the application can process multiple health indicators—such as age,
cholesterol level, blood pressure, heart rate, and chest pain type—to classify patients as
either at risk or not at risk of developing heart disease.
The project primarily focuses on building a predictive model that demonstrates the
potential of machine learning in the healthcare sector, specifically in the early diagnosis and
1
prevention of cardiovascular conditions. It does not include integration with electronic
health record (EHR) systems, real-time monitoring through wearable devices, or any form
of medical intervention
In this study, the Heart Disease Prediction Using Machine Learning project aims
to empower healthcare providers and individuals by offering a reliable, intelligent system
capable of predicting the risk of heart disease based on clinical and physiological data. By
integrating core components such as data preprocessing, feature selection, predictive
modeling, and advanced machine learning algorithms into a cohesive and user-friendly
application, the project seeks to streamline the diagnostic process and support timely
medical intervention.
This system is designed to enhance the accuracy and efficiency of heart disease risk
assessment by delivering data-driven predictions, actionable insights, and real-time risk
evaluation. Through the use of predictive models and continual algorithmic learning, it helps
healthcare professionals identify high-risk individuals, prioritize medical care, and
potentially prevent critical cardiac events. Ultimately, the platform bridges the gap between
traditional diagnostic approaches and modern AI-powered healthcare, enabling smarter,
faster, and more proactive heart disease management for clinics, hospitals, and personal
health monitoring.
2
CHAPTER-2
LITERATURE REVIEW
3
2.3 TITLE: FEATURE SELECTION METHODS
Authors:Dr.SanjayDesai
This literature review highlights key feature selection methods applied in heart disease
prediction tasks. Techniques such as correlation analysis, Recursive Feature Elimination
(RFE), Chi-square testing, and Principal Component Analysis (PCA) are discussed for their
effectiveness in identifying the most influential health attributes. Proper feature selection
improves model performance by reducing overfitting, simplifying models, and increasing
interpretability. The study concludes that optimized feature selection significantly contributes
to building robust and accurate predictive models.
4
CHAPTER-3
SYSTEM ANALYSIS
5
3.2 PROPOSED SYSTEM
The proposed system introduces a machine learning-based predictive model that uses
patient data to assess the risk of heart disease. This intelligent system is designed to automate
the diagnostic process by learning from historical medical records and identifying complex
patterns that may not be evident through traditional analysis.
By incorporating advanced machine learning algorithms and techniques like data
preprocessing, feature selection, and model optimization, the system offers a faster, more
accurate, and scalable solution for heart disease risk prediction. It supports clinicians by
providing data-driven insights and serves as a decision-support tool for early diagnosis and
treatment planning.
6
CHAPTER-4
SYSTEM REQUIREMENTS
8
CHAPTER – 5
Data for heart disease prediction was collected from well-known and publicly
available medical datasets, including the UCI Heart Disease Dataset, Cleveland Heart
Disease Dataset, and additional sources from Kaggle. These datasets consist of
anonymized patient records containing various clinical attributes such as age, sex, chest
pain type, resting blood pressure, cholesterol levels, fasting blood sugar, ECG results,
maximum heart rate, and exercise-induced angina. The data was cleaned and checked for
missing values and inconsistencies to ensure reliability for model training.
Pre-processing involved several critical steps to prepare the data for machine learning
models:
Handling Missing Values: Missing or null values were imputed using mean,
median, or mode strategies depending on the feature type.
Encoding Categorical Data: Features such as chest pain type and thalassemia were
encoded using one-hot or label encoding.
Feature Scaling: Continuous variables were normalized using standardization (Z-
score normalization) to bring all attributes to a similar scale.
Data Splitting: The dataset was split into training and testing subsets, typically using
an 80:20 or 70:30 ratio, ensuring random and unbiased distribution.
The testing dataset, extracted from the main dataset, underwent the same
9
preprocessing steps as the training data. It was reserved exclusively for evaluating the
generalization performance of the machine learning models. Evaluation metrics such as
accuracy, precision, recall, F1-score, and ROC-AUC were calculated on the test set to
validate the model's ability to correctly predict the presence or absence of heart disease.
Each model was evaluated on the same test dataset using standard classification metrics:
Naive Bayes: Quick and interpretable but slightly lower accuracy.
Random Forest: Provided strong accuracy and good via feature importance.
11
CHAPTER-6
UML DIAGRAMS
12
6.2 USE CASE DIAGRAM
A use case diagram is a diagram that shows a set of use cases and actors and
their relationships. A use case diagram is just a special kind of diagram and shares the
same common properties as do all other diagrams, i.e a name and graphical contents
that are a projection into a model. What distinguishes a use case diagram from all other
kinds of diagrams is its particular content.
13
6.3 ACTIVITY DIAGRAM
An activity diagram shows the flow from activity to activity. An activity is an
ongoing non- atomic execution within a state machine. An activity diagram is basically a
projection of the elements found in an activity graph, a special case of a state machine in
which all or most states are activity states and in which all or most transitions are triggered
by completion of activities in the source.
14
CHAPTER-7
PERFORMANCE ANALYSIS
15
7.3 METHODOLOGY
The heart disease prediction system was developed using the following systematic approach:
1. Data Collection & Preprocessing
o Collected labeled patient data from public datasets.
o Cleaned data and handled missing values.
o Encoded categorical variables and normalized features.
o Addressed class imbalance using SMOTE.
2. Model Training
o Split the dataset into 70% training and 30% testing sets.
o Trained multiple machine learning models including Logistic Regression,
Naive Bayes, Random Forest, SVM, and Neural Networks.
o Performed cross-validation to ensure generalization.
3. Evaluation Metrics
o Measured model performance using Confusion Matrix, Accuracy, Precision,
Recall, F1-Score, and ROC-AUC.
4. Deployment (Optional)
o The best-performing model can be deployed as a web application using
Streamlit or Flask to allow real-time risk assessment.
Conclusion:
Best Performing Model: Neural Network with 92.1% accuracy, offering robust
predictions for complex feature interactions.
Best Trade-off Model: Random Forest with 90.5% accuracy and high
interpretability.
Future Improvements: Integration of deep learning models such as CNN or
transformer-based architectures, and incorporation of real-time clinical data for
dynamic risk prediction.
16
CHAPTER-8
CONCULSION
The primary objective of this project was to develop an intelligent and automated system
capable of predicting the likelihood of heart disease using machine learning techniques. By
leveraging clinical data and applying supervised learning algorithms, the system provides a
reliable tool to support early diagnosis and preventive healthcare.
The integration of data preprocessing techniques, relevant feature selection, and
advanced machine learning models—such as Logistic Regression, Random Forest, Support
Vector Machine (SVM), and Neural Networks—enabled the system to deliver high accuracy
in identifying potential heart disease cases. Neural Networks showed the best performance in
terms of accuracy, while Random Forest offered a good balance between precision and
interpretability.
This project demonstrates the potential of machine learning in the medical field,
particularly for early diagnosis and decision support. The ability to analyze patient data and
provide accurate predictions can assist healthcare professionals in making timely and informed
decisions, ultimately improving patient outcomes and reducing the risk of complications.
In conclusion, the machine learning-based heart disease prediction system offers a
scalable, efficient, and user-friendly solution for medical diagnosis. With continuous training,
access to updated medical data, and integration into clinical environments, this system can
significantly enhance preventive healthcare and contribute to better cardiovascular disease
management.
17
CHAPTER-9
FUTURE ENHANCEMENT
The current implementation of the Heart Disease Prediction system demonstrates the
practical use of machine learning in medical diagnosis. However, to further enhance its
accuracy, usability, and impact in clinical settings, several future enhancements can be
considered:
1. Integration of Advanced Deep Learning Models
Future versions can incorporate deep learning architectures such as Recurrent Neural
Networks (RNNs), Long Short-Term Memory (LSTM), and transformer-based
models (like BERT for clinical text) to capture complex patterns and temporal health
data trends more effectively.
2. Real-Time Risk Monitoring
Enabling real-time heart disease risk prediction using continuous health monitoring
data (e.g., from wearable devices) can facilitate early intervention and personalized
care.
3. Incorporation of Electronic Health Records (EHRs)
Integrating patient history from EHRs, including medications, previous diagnoses,
and family history, can improve prediction accuracy and model comprehensiveness.
4. Mobile and Web-Based Health Applications
Developing mobile and web apps will allow easy access to heart disease risk
assessments, enabling users and healthcare providers to interact with the model
remotely and conveniently.
5. Adaptive Learning and Model Updates
Building a self-improving system that incorporates new medical data and user
feedback can help maintain accuracy and adapt to evolving clinical practices.
18
CHAPTER-10
APPENDIX
App.py
import streamlit as st
import pandas as pd
import pickle
model_filename = './model/model.pkl'
def main():
st.title('Heart Disease Prediction')
age = st.slider('Age', 18, 100, 50)
sex_options = ['Male', 'Female']
sex = st.selectbox('Sex', sex_options)
sex_num = 1 if sex == 'Male' else 0
cp_options = ['Typical Angina', 'Atypical Angina', 'Non-anginal Pain', 'Asymptomatic']
cp = st.selectbox('Chest Pain Type', cp_options)
19
cp_num = cp_options.index(cp)
trestbps = st.slider('Resting Blood Pressure', 90, 200, 120)
chol = st.slider('Cholesterol', 100, 600, 250)
fbs_options = ['False', 'True']
fbs = st.selectbox('Fasting Blood Sugar > 120 mg/dl', fbs_options)
fbs_num = fbs_options.index(fbs)
restecg_options = ['Normal', 'ST-T Abnormality', 'Left Ventricular Hypertrophy']
restecg = st.selectbox('Resting Electrocardiographic Results', restecg_options)
restecg_num = restecg_options.index(restecg)
thalach = st.slider('Maximum Heart Rate Achieved', 70, 220, 150)
exang_options = ['No', 'Yes']
exang = st.selectbox('Exercise Induced Angina', exang_options)
exang_num = exang_options.index(exang)
oldpeak = st.slider('ST Depression Induced by Exercise Relative to Rest', 0.0, 6.2, 1.0)
slope_options = ['Upsloping', 'Flat', 'Downsloping']
slope = st.selectbox('Slope of the Peak Exercise ST Segment', slope_options)
slope_num = slope_options.index(slope)
ca = st.slider('Number of Major Vessels Colored by Fluoroscopy', 0, 4, 1)
thal_options = ['Normal', 'Fixed Defect', 'Reversible Defect']
thal = st.selectbox('Thalassemia', thal_options)
thal_num = thal_options.index(thal)
if st.button('Predict'):
user_input = pd.DataFrame(data={
'age': [age],
20
'sex': [sex_num],
'cp': [cp_num],
'trestbps': [trestbps],
'chol': [chol],
'fbs': [fbs_num],
'restecg': [restecg_num],
'thalach': [thalach],
'exang': [exang_num],
'oldpeak': [oldpeak],
'slope': [slope_num],
'ca': [ca],
'thal': [thal_num]
})
# Apply saved transformation to new data
user_input = (user_input - mean_std_values['mean']) / mean_std_values['std']
prediction = model.predict(user_input)
prediction_proba = model.predict_proba(user_input)
if prediction[0] == 1:
bg_color = 'red'
prediction_result = 'Positive'
else:
bg_color = 'green'
prediction_result = 'Negative'
21
st.markdown(f"<p style='background-color:{bg_color}; color:white;
padding:10px;'>Prediction: {prediction_result}<br>Confidence:
{((confidence*10000)//1)/100}%</p>", unsafe_allow_html=True)
if __name__ == '__main__':
main()
22
23
CHAPTER-11
BIBLIOGRAPHY
REFERENCES
1. P .K. Anooj, ―Clinical decision support system: Risk level prediction of heart disease
using weighted fuzzy rulesǁ; Journal of King Saud University – Computer and
Information Sciences (2012) 24, 27–40. Computer Science & Information Technology
(CS & IT) 59
2. Nidhi Bhatla, Kiran Jyoti”An Analysis of Heart Disease Prediction using Different
Data Mining Techniques”.International Journal of Engineering Research &
Technology
3. Jyoti Soni Ujma Ansari Dipesh Sharma, Sunita Soni. “Predictive Data Mining for
Medical Diagnosis: An Overview of Heart Disease Prediction”.
24
6. M. Anbarasi, E. Anupriya, N.Ch.S.N.Iyengar, ―Enhanced Prediction of Heart Disease
with Feature Subset Selection using Genetic Algorithmǁ; International Journal of
Engineering Science and Technology, Vol. 2(10), 2010.
10. Shadab Adam Pattekari and Asma Parveen,” PREDICTION SYSTEM FOR HEART
DISEASE USING NAIVE BAYES”, International Journal of Advanced Computer and
Mathematical Sciences ISSN 2230-9624, Vol 3, Issue 3, 2012, pp 290-294.
25