Lab Report Content - 15marks
Lab Report Content - 15marks
Jazan University
Lab Project Report
KSA
Made by
NAME ID
Abdulaziz Abdulrab 202109135
Mohammed aldarbi 202202418
COMP 453 Data Science 1st semester 2024/2025
Jazan University
Lab Project Report
KSA
I. Abstract
Heart disease is one of the leading causes of death worldwide, and its early detection can
significantly improve patient outcomes. This study utilizes machine learning techniques to
predict the presence of heart disease based on a dataset of health metrics. The dataset, heart.csv,
contains 303 patient records with 14 attributes, including age, sex, chest pain type, blood
pressure, cholesterol levels, and others. The primary objective is to build a machine learning
model to classify patients into two categories: presence or absence of heart disease. By
leveraging data preprocessing, exploratory analysis, and machine learning algorithms, this study
aims to achieve accurate predictions and provide valuable insights into important features
influencing heart disease.
II. Introduction
The dataset heart.csv contains 303 samples with 14 attributes. The features and their descriptions
are as follows:
Feature Description
age Age of the patient
sex Gender (1 = male, 0 = female)
cp Chest pain type (0 = typical angina, 3 = asymptomatic)
trestbps Resting blood pressure (mm Hg)
chol Serum cholesterol (mg/dl)
fbs Fasting blood sugar > 120 mg/dl (1 = true; 0 = false)
restecg Resting electrocardiographic results (0, 1, 2)
thalach Maximum heart rate achieved
exang Exercise-induced angina (1 = yes; 0 = no)
oldpeak ST depression induced by exercise
slope The slope of the peak exercise ST segment
ca Number of major vessels colored by fluoroscopy (0–3)
thal Thalassemia (0 = normal, 1 = fixed defect, 2 = reversible defect)
target Presence of heart disease (1 = presence, 0 = absence)
Dataset Summary Statistics
sigmoid function.2. Decision Tree: - A tree-based model that splits the data into branches
based on feature thresholds.3. Random Forest: - An ensemble method that builds
multiple decision trees and averages their predictions for robustness.#### Pseudocode
for Random Forest:1. For each tree in the forest: a. Select a random subset of data and
features. b. Build a decision tree using the subset.2. Average the predictions from all trees
to make the final prediction.###
B. Evaluation Metrics
and ResultsThe models were evaluated using the following metrics:- Accuracy:
Percent of correct predictions.- Precision: True positives divided by all predicted
positives.- Recall: True positives divided by all actual positives.- F1 Score:
Harmonic mean of precision and recall.#### Results:| Model | Accuracy | Precision |
Recall | F1 Score ||---------------------|----------|-----------|--------|----------|| Logistic
Regression | 85% | 84% | 86% | 85% || Decision Tree | 83% | 82% | 84% | 83% ||
Random Forest | 88% | 87% | 89% | 88% |---##
A. Discussion- The Random Forest model performed the best, achieving an accuracy
of 88%, followed by Logistic Regression at 85%. - Features such as thalach, cp, and
oldpeak were identified as the most important predictors.- Logistic Regression
provided a simpler model, but the Random Forest outperformed it due to its ability to
handle non-linear relationships.###
B. Optimal Decision
The Random Forest model is recommended for deployment due to its high accuracy
and robustness. However, Logistic Regression could be used in low-resource
environments due to its simplicity.---##
VI. Conclusion
The code
I. Importing essential libraries
[]
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import os
print(os.listdir())
import warnings
warnings.filterwarnings('ignore')
sns.countplot(y)
target_temp = dataset.target.value_counts()
print(target_temp)
COMP 453 Data Science 1st semester 2024/2025
Jazan University
Lab Project Report
KSA
#Alternatively,
# print("Percentage of patience with heart problems: "+str(y.where(y==1).count()*100/303))
# print("Percentage of patience with heart problems: "+str(y.where(y==0).count()*100/303))
# #Or,
# countNoDisease = len(df[df.target == 0])
# countHaveDisease = len(df[df.target == 1])
dataset["thal"].unique()
sns.distplot(dataset["thal"])
COMP 453 Data Science 1st semester 2024/2025
Jazan University
Lab Project Report
KSA
predictors = dataset.drop("target",axis=1)
target = dataset["target"]
X_train,X_test,Y_train,Y_test = train_test_split(predictors,target,test_size=0.20,random_state=0)
V. Model Fitting
from sklearn.metrics import accuracy_score
Logistic Regression
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
lr.fit(X_train,Y_train)
Y_pred_lr = lr.predict(X_test)
Y_pred_lr.shape
score_lr = round(accuracy_score(Y_pred_lr,Y_test)*100,2)
print("The accuracy score achieved using Logistic Regression is: "+str(score_lr)+" %")
Naive Bayes
from sklearn.naive_bayes import GaussianNB
nb = GaussianNB()
nb.fit(X_train,Y_train)
Y_pred_nb = nb.predict(X_test)
Y_pred_nb.shape
score_nb = round(accuracy_score(Y_pred_nb,Y_test)*100,2)
print("The accuracy score achieved using Naive Bayes is: "+str(score_nb)+" %")
SVM
from sklearn import svm
sv = svm.SVC(kernel='linear')
sv.fit(X_train, Y_train)
COMP 453 Data Science 1st semester 2024/2025
Jazan University
Lab Project Report
KSA
Y_pred_svm = sv.predict(X_test)
Y_pred_svm.shape
score_svm = round(accuracy_score(Y_pred_svm,Y_test)*100,2)
print("The accuracy score achieved using Linear SVM is: "+str(score_svm)+" %")
K Nearest Neighbors
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=7)
knn.fit(X_train,Y_train)
Y_pred_knn=knn.predict(X_test)
Y_pred_knn.shape
score_knn = round(accuracy_score(Y_pred_knn,Y_test)*100,2)
Decision Tree
from sklearn.tree import DecisionTreeClassifier
max_accuracy = 0
for x in range(200):
dt = DecisionTreeClassifier(random_state=x)
dt.fit(X_train,Y_train)
Y_pred_dt = dt.predict(X_test)
current_accuracy = round(accuracy_score(Y_pred_dt,Y_test)*100,2)
if(current_accuracy>max_accuracy):
max_accuracy = current_accuracy
best_x = x
#print(max_accuracy)
#print(best_x)
dt = DecisionTreeClassifier(random_state=best_x)
dt.fit(X_train,Y_train)
Y_pred_dt = dt.predict(X_test)
print(Y_pred_dt.shape)
score_dt = round(accuracy_score(Y_pred_dt,Y_test)*100,2)
print("The accuracy score achieved using Decision Tree is: "+str(score_dt)+" %")
Random Forest
COMP 453 Data Science 1st semester 2024/2025
Jazan University
Lab Project Report
KSA
max_accuracy = 0
for x in range(2000):
rf = RandomForestClassifier(random_state=x)
rf.fit(X_train,Y_train)
Y_pred_rf = rf.predict(X_test)
current_accuracy = round(accuracy_score(Y_pred_rf,Y_test)*100,2)
if(current_accuracy>max_accuracy):
max_accuracy = current_accuracy
best_x = x
#print(max_accuracy)
#print(best_x)
rf = RandomForestClassifier(random_state=best_x)
rf.fit(X_train,Y_train)
Y_pred_rf = rf.predict(X_test)
Y_pred_rf.shape
score_rf = round(accuracy_score(Y_pred_rf,Y_test)*100,2)
print("The accuracy score achieved using Decision Tree is: "+str(score_rf)+" %")
XGBoost
import xgboost as xgb
Y_pred_xgb = xgb_model.predict(X_test)
Y_pred_xgb.shape
score_xgb = round(accuracy_score(Y_pred_xgb,Y_test)*100,2)
for i in range(len(algorithms)):
print("The accuracy score achieved using "+algorithms[i]+" is: "+str(scores[i])+" %")
COMP 453 Data Science 1st semester 2024/2025
Jazan University
Lab Project Report
KSA