0% found this document useful (0 votes)
19 views10 pages

Lab Report Content - 15marks

Uploaded by

absulazizdns999
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views10 pages

Lab Report Content - 15marks

Uploaded by

absulazizdns999
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 10

COMP 453 Data Science 1st semester 2024/2025

Jazan University
Lab Project Report
KSA

Title of the Project: Heart Disease Prediction

Made by
NAME ID
Abdulaziz Abdulrab 202109135
Mohammed aldarbi 202202418
COMP 453 Data Science 1st semester 2024/2025
Jazan University
Lab Project Report
KSA

I. Abstract

Heart disease is one of the leading causes of death worldwide, and its early detection can
significantly improve patient outcomes. This study utilizes machine learning techniques to
predict the presence of heart disease based on a dataset of health metrics. The dataset, heart.csv,
contains 303 patient records with 14 attributes, including age, sex, chest pain type, blood
pressure, cholesterol levels, and others. The primary objective is to build a machine learning
model to classify patients into two categories: presence or absence of heart disease. By
leveraging data preprocessing, exploratory analysis, and machine learning algorithms, this study
aims to achieve accurate predictions and provide valuable insights into important features
influencing heart disease.

II. Introduction

A. The Problem and Its Impacts


Heart disease is a significant global health challenge, accounting for millions of deaths annually.
Early detection and intervention are critical for reducing mortality rates. However, traditional
diagnostic methods are time-consuming, expensive, and sometimes subjective. Therefore, there
is a pressing need for automated, data-driven solutions that can assist healthcare professionals in
identifying high-risk patients quickly and accurately.

B. Benefits of Using Machine Learning and Data Science


Machine learning (ML) and data science provide powerful tools to analyze large datasets and
uncover hidden patterns that may not be immediately apparent through traditional analysis. ML
algorithms can process complex, multidimensional data and make accurate predictions,
improving diagnostic accuracy. Additionally, ML models can be deployed in real-time systems,
enabling faster decision-making and resource optimization in healthcare systems.
C. Methodology and Contribution
This study uses a systematic approach to predict heart disease:
1. Data Preprocessing: Handle missing values, normalize/scale numerical features, and
encode categorical data.
2. Exploratory Data Analysis (EDA): Visualize the relationships between features and the
target variable.
3. Model Development: Build and evaluate multiple machine learning models, including
Logistic Regression, Decision Tree, and Random Forest.
4. Contribution: This project demonstrates how machine learning can be applied to real-
world healthcare problems, helping to improve diagnostic systems and guide preventive
measures.

III. Description of Dataset and Preprocessing


A. Dataset Description
COMP 453 Data Science 1st semester 2024/2025
Jazan University
Lab Project Report
KSA

The dataset heart.csv contains 303 samples with 14 attributes. The features and their descriptions
are as follows:
Feature Description
age Age of the patient
sex Gender (1 = male, 0 = female)
cp Chest pain type (0 = typical angina, 3 = asymptomatic)
trestbps Resting blood pressure (mm Hg)
chol Serum cholesterol (mg/dl)
fbs Fasting blood sugar > 120 mg/dl (1 = true; 0 = false)
restecg Resting electrocardiographic results (0, 1, 2)
thalach Maximum heart rate achieved
exang Exercise-induced angina (1 = yes; 0 = no)
oldpeak ST depression induced by exercise
slope The slope of the peak exercise ST segment
ca Number of major vessels colored by fluoroscopy (0–3)
thal Thalassemia (0 = normal, 1 = fixed defect, 2 = reversible defect)
target Presence of heart disease (1 = presence, 0 = absence)
Dataset Summary Statistics

age trestbps chol thalach oldpeak target


Count 303 303 303 303 303 303
Mean 54.37 131.62 246.26 149.65 1.04 0.544
Std Dev 9.08 17.54 51.83 22.90 1.16 0.498
Min 29 94 126 71 0.0 0
Max 77 200 564 202 6.2 1
Visualizations
1. Age Distribution: The majority of patients are between 45 and 60 years old.
![Age Distribution][]2. Correlation Heatmap: A heatmap reveals key correlations between
features, such as the negative correlation between thalach (maximum heart rate) and the presence
of heart disease.3. Target Class Distribution: The dataset is balanced, with 54.4% of samples
indicating the presence of heart disease.---### B. Preprocessing Methods1. Handling Missing
Values: - No missing values were found in the dataset. 2. Encoding Categorical Variables: -
Features like sex, cp, and thal were encoded using one-hot encoding.3.
Normalization/Standardization: - Numerical features such as trestbps, chol, and thalach were
scaled to standardize their range for better model performance.4. Feature Selection: -
Correlation analysis and domain knowledge were used to identify key features influencing heart
disease.---

## IV. Proposed Model and Evaluation###

A. Model DescriptionThe following machine learning algorithms were used:1. Logistic


Regression: - A linear classification algorithm that predicts probabilities based on a
COMP 453 Data Science 1st semester 2024/2025
Jazan University
Lab Project Report
KSA

sigmoid function.2. Decision Tree: - A tree-based model that splits the data into branches
based on feature thresholds.3. Random Forest: - An ensemble method that builds
multiple decision trees and averages their predictions for robustness.#### Pseudocode
for Random Forest:1. For each tree in the forest: a. Select a random subset of data and
features. b. Build a decision tree using the subset.2. Average the predictions from all trees
to make the final prediction.###

B. Evaluation Metrics
and ResultsThe models were evaluated using the following metrics:- Accuracy:
Percent of correct predictions.- Precision: True positives divided by all predicted
positives.- Recall: True positives divided by all actual positives.- F1 Score:
Harmonic mean of precision and recall.#### Results:| Model | Accuracy | Precision |
Recall | F1 Score ||---------------------|----------|-----------|--------|----------|| Logistic
Regression | 85% | 84% | 86% | 85% || Decision Tree | 83% | 82% | 84% | 83% ||
Random Forest | 88% | 87% | 89% | 88% |---##

V. Results Analysis and Discussion###

A. Discussion- The Random Forest model performed the best, achieving an accuracy
of 88%, followed by Logistic Regression at 85%. - Features such as thalach, cp, and
oldpeak were identified as the most important predictors.- Logistic Regression
provided a simpler model, but the Random Forest outperformed it due to its ability to
handle non-linear relationships.###

B. Optimal Decision

The Random Forest model is recommended for deployment due to its high accuracy
and robustness. However, Logistic Regression could be used in low-resource
environments due to its simplicity.---##

VI. Conclusion

This study successfully developed a machine learning model to predict heart


disease using a dataset of health metrics. The Random Forest model achieved
the best performance, with an accuracy of 88%. The results demonstrate the
potential of machine learning in healthcare for early disease detection. Future
work could involve expanding the dataset, incorporating additional features,
and exploring deep learning methods.
COMP 453 Data Science 1st semester 2024/2025
Jazan University
Lab Project Report
KSA

The code
I. Importing essential libraries

[]
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import os
print(os.listdir())
import warnings
warnings.filterwarnings('ignore')

II. Importing and understanding our dataset


dataset = pd.read_csv("heart.csv")

First, analysing the target variable:


y = dataset["target"]

sns.countplot(y)

target_temp = dataset.target.value_counts()

print(target_temp)
COMP 453 Data Science 1st semester 2024/2025
Jazan University
Lab Project Report
KSA

print("Percentage of patience without heart problems: "+str(round(target_temp[0]*100/303,2)))


print("Percentage of patience with heart problems: "+str(round(target_temp[1]*100/303,2)))

#Alternatively,
# print("Percentage of patience with heart problems: "+str(y.where(y==1).count()*100/303))
# print("Percentage of patience with heart problems: "+str(y.where(y==0).count()*100/303))

# #Or,
# countNoDisease = len(df[df.target == 0])
# countHaveDisease = len(df[df.target == 1])

Analysing the 'Sex' feature


dataset["sex"].unique()
sns.barplot(x=dataset["sex"], y=y) # Specify x and y as keyword arguments
sns.barplot(x=dataset["sex"], y=y) # Specify x and y as keyword arguments

Analysing the 'Chest Pain Type' feature


dataset["cp"].unique()
sns.barplot(x=dataset["cp"], y=y)

Analysing the FBS feature


dataset["fbs"].describe()
dataset["fbs"].unique()
sns.barplot(x=dataset["fbs"], y=y)

Analysing the restecg feature


dataset["restecg"].unique()
sns.barplot(x=dataset["restecg"], y=y)

Analysing the 'exang' feature


dataset["exang"].unique()
sns.barplot(x="exang", y=y, data=dataset)

Analysing the Slope feature


dataset["slope"].unique()
sns.barplot(x=dataset["slope"], y=y, data=dataset)

Analysing the 'ca' feature


dataset["ca"].unique()
sns.countplot(x=dataset["ca"])

sns.barplot(x="ca", y=y.name, data=dataset, palette=custom_palette)

dataset["thal"].unique()
sns.distplot(dataset["thal"])
COMP 453 Data Science 1st semester 2024/2025
Jazan University
Lab Project Report
KSA

IV. Train Test split


from sklearn.model_selection import train_test_split

predictors = dataset.drop("target",axis=1)
target = dataset["target"]

X_train,X_test,Y_train,Y_test = train_test_split(predictors,target,test_size=0.20,random_state=0)

V. Model Fitting
from sklearn.metrics import accuracy_score

Logistic Regression
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression()

lr.fit(X_train,Y_train)

Y_pred_lr = lr.predict(X_test)
Y_pred_lr.shape
score_lr = round(accuracy_score(Y_pred_lr,Y_test)*100,2)

print("The accuracy score achieved using Logistic Regression is: "+str(score_lr)+" %")

Naive Bayes
from sklearn.naive_bayes import GaussianNB

nb = GaussianNB()

nb.fit(X_train,Y_train)

Y_pred_nb = nb.predict(X_test)
Y_pred_nb.shape
score_nb = round(accuracy_score(Y_pred_nb,Y_test)*100,2)

print("The accuracy score achieved using Naive Bayes is: "+str(score_nb)+" %")

SVM
from sklearn import svm

sv = svm.SVC(kernel='linear')

sv.fit(X_train, Y_train)
COMP 453 Data Science 1st semester 2024/2025
Jazan University
Lab Project Report
KSA

Y_pred_svm = sv.predict(X_test)
Y_pred_svm.shape
score_svm = round(accuracy_score(Y_pred_svm,Y_test)*100,2)

print("The accuracy score achieved using Linear SVM is: "+str(score_svm)+" %")

K Nearest Neighbors
from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier(n_neighbors=7)
knn.fit(X_train,Y_train)
Y_pred_knn=knn.predict(X_test)
Y_pred_knn.shape

score_knn = round(accuracy_score(Y_pred_knn,Y_test)*100,2)

print("The accuracy score achieved using KNN is: "+str(score_knn)+" %")

Decision Tree
from sklearn.tree import DecisionTreeClassifier

max_accuracy = 0

for x in range(200):
dt = DecisionTreeClassifier(random_state=x)
dt.fit(X_train,Y_train)
Y_pred_dt = dt.predict(X_test)
current_accuracy = round(accuracy_score(Y_pred_dt,Y_test)*100,2)
if(current_accuracy>max_accuracy):
max_accuracy = current_accuracy
best_x = x

#print(max_accuracy)
#print(best_x)

dt = DecisionTreeClassifier(random_state=best_x)
dt.fit(X_train,Y_train)
Y_pred_dt = dt.predict(X_test)

print(Y_pred_dt.shape)
score_dt = round(accuracy_score(Y_pred_dt,Y_test)*100,2)

print("The accuracy score achieved using Decision Tree is: "+str(score_dt)+" %")

Random Forest
COMP 453 Data Science 1st semester 2024/2025
Jazan University
Lab Project Report
KSA

from sklearn.ensemble import RandomForestClassifier

max_accuracy = 0

for x in range(2000):
rf = RandomForestClassifier(random_state=x)
rf.fit(X_train,Y_train)
Y_pred_rf = rf.predict(X_test)
current_accuracy = round(accuracy_score(Y_pred_rf,Y_test)*100,2)
if(current_accuracy>max_accuracy):
max_accuracy = current_accuracy
best_x = x

#print(max_accuracy)
#print(best_x)

rf = RandomForestClassifier(random_state=best_x)
rf.fit(X_train,Y_train)
Y_pred_rf = rf.predict(X_test)
Y_pred_rf.shape
score_rf = round(accuracy_score(Y_pred_rf,Y_test)*100,2)

print("The accuracy score achieved using Decision Tree is: "+str(score_rf)+" %")

XGBoost
import xgboost as xgb

xgb_model = xgb.XGBClassifier(objective="binary:logistic", random_state=42)


xgb_model.fit(X_train, Y_train)

Y_pred_xgb = xgb_model.predict(X_test)

Y_pred_xgb.shape
score_xgb = round(accuracy_score(Y_pred_xgb,Y_test)*100,2)

print("The accuracy score achieved using XGBoost is: "+str(score_xgb)+" %")

VI. Output final score


scores = [score_lr,score_nb,score_svm,score_knn,score_dt,score_rf,score_xgb,score_nn]
algorithms = ["Logistic Regression","Naive Bayes","Support Vector Machine","K-Nearest
Neighbors","Decision Tree","Random Forest","XGBoost","Neural Network"]

for i in range(len(algorithms)):
print("The accuracy score achieved using "+algorithms[i]+" is: "+str(scores[i])+" %")
COMP 453 Data Science 1st semester 2024/2025
Jazan University
Lab Project Report
KSA

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy