0% found this document useful (0 votes)
37 views16 pages

Final AI Homework Amanuel Tesfalem

Uploaded by

nahidparvej3579
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views16 pages

Final AI Homework Amanuel Tesfalem

Uploaded by

nahidparvej3579
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Introduction to AI

Final Assignment

NAME: AMANUEL TESFALEM


STUDENT ID: 202180090163
Analysis on Cleveland Heart Disease Dateset
2024 June
Author: Amanuel Tesfalem
Zhengzhou University, Department of Artificial Intelligence
Recommended by Dr.Xiaofei Nan

Abstract

This research examines the utilization of machine learning algorithms in the


classification of heart disease, with the objective of improving the accuracy and
reliability of diagnostic models. By utilizing the Heart Disease dataset from the UCI
Machine Learning Repository, we implemented three distinct machine learning
techniques: Decision Trees, logistic regression, and Support Vector Machines (SVM).
Each model was assessed based on accuracy, precision, recall, and F1-score to
determine its efficacy. Our findings indicate that the Random Forest algorithm
outperformed the other methods, achieving the highest accuracy and well-balanced
performance across various metrics. This study showcases the potential of advanced
machine learning techniques in the field of medical diagnostics and establishes a
basis for further research in enhancing heart disease prediction.

Introduction

Cardiovascular disease (CVD), commonly known as heart disease, is a widespread


health concern affecting millions of individuals worldwide. It encompasses a range of
conditions such as coronary artery disease, heart attacks, arrhythmias, and heart
failure. Despite advancements in medical science, heart disease continues to be the
leading cause of death globally. Gaining a comprehensive understanding of heart
disease, including its risk factors, symptoms, and treatments, necessitates extensive
research and robust datasets. This essay delves into a comprehensive compilation of
heart disease datasets, which play a crucial role in advancing our knowledge and
addressing this global health challenge.

The Significance of Heart Disease Research

Heart disease research is of utmost importance as it aids in identifying the


underlying causes and risk factors associated with cardiovascular conditions.
Through the analysis of large datasets, researchers can uncover patterns and
correlations that may not be apparent in smaller studies. These insights are vital for
the development of effective prevention strategies, diagnostic tools, and treatment
options. Furthermore, comprehending the demographic and regional variations in
heart disease prevalence can lead to more personalized and efficient healthcare
interventions.

Overview of the Dataset Collection

The heart disease dataset collection under consideration is an invaluable resource


for researchers and healthcare professionals alike. It encompasses diverse datasets
from reputable studies, each contributing unique and significant data. The key
components of this collection include:

Index and Metadata Files: These files serve as a guide to the dataset collection,
aiding users in navigating through its various components. For instance, the "heart-
disease.names" file provides detailed descriptions of the attributes included in the
datasets, ensuring that users comprehend the significance of each variable.

Raw Data Files

The collection encompasses raw data from several significant heart disease studies,
each contributing unique perspectives and enhancing the dataset's diversity:

 Cleveland.data: Originating from the Cleveland Clinic Foundation, this dataset is


a cornerstone in cardiovascular research. It includes a detailed set of patient
attributes, making it invaluable for predictive modeling and risk factor analysis.

 Hungarian.data: Sourced from the Hungarian Institute of Cardiology, this


dataset adds demographic variety to the collection, broadening the scope of
cardiovascular research.

 Long-beach-va.data: Collected from the Long Beach Veterans Administration


Medical Center, this dataset provides another distinct viewpoint, further
enriching the collection.

 Switzerland.data: This dataset from the Switzerland heart disease study


contributes additional geographical and demographic diversity, enhancing the
overall comprehensiveness of the collection.

Processed Data Files

These files contain the cleaned and formatted versions of the raw data, prepared for
immediate analysis:

 Processed.cleveland.data: A processed version of the Cleveland dataset,


facilitating quick data analysis by researchers.
 Processed.hungarian.data: This file provides the Hungarian data in a processed
format, ensuring consistency and usability.

 Processed.switzerland.data: Cleaned and formatted data from the Switzerland


study, ready for straightforward analysis.

 Processed.va.data: Contains processed data from the Long Beach VA study,


ensuring ease of use and consistency.

Reprocessed and Additional Data Files

These files offer further insights and expanded data, enhancing the collection's utility:

 Reprocessed.hungarian.data: An improved version of the Hungarian dataset,


with additional processing to enhance data quality.

 New.data: This file likely includes new or supplementary data, broadening the
dataset collection's scope.

 Cleve.mod: A modified version of the Cleveland dataset, possibly featuring


additional attributes or alterations tailored for specific research purposes.

Supporting Files

These files provide additional context and support for the datasets:

 Ask-detrano: Likely contains correspondence with Dr. Detrano, a key contributor


to the dataset development, offering valuable insights or clarifications.

 Bak: A backup file, ensuring data preservation.

 Costs: This file might outline the costs associated with data collection or study,
providing context on resource allocation.

 WARNING: Possibly contains important notices or warnings regarding the data,


such as usage restrictions or known issues.

This collection of raw, processed, and additional data files, supported by contextual
documents, offers a comprehensive resource for cardiovascular research, enabling
robust predictive modeling and risk factor analysis across diverse demographics.

Methodology
Description of the Heart Disease Dataset

The Heart Disease dataset is a widely-used dataset in the field of medical diagnostics
and machine learning. It is designed to provide information for the prediction of
heart disease presence in patients. The dataset contains 14 attributes and a total of
303 instances, collected from four different locations: Cleveland Clinic Foundation,
Hungarian Institute of Cardiology, V.A. Medical Center in Long Beach, and the
University Hospital in Zurich, Switzerland.

Attributes

1. Age: Age of the patient in years.


2. Sex: Gender of the patient (1 = male; 0 = female).
3. Chest Pain Type (cp): Type of chest pain experienced by the patient:
1. 0: Typical angina
2. 1: Atypical angina
3. 2: Non-anginal pain
4. 3: Asymptomatic
4. Resting Blood Pressure (trestbps): Resting blood pressure (in mm Hg) on
admission to the hospital.
5. Serum Cholesterol (chol): Serum cholesterol level (in mg/dl).
6. Fasting Blood Sugar (fbs): Fasting blood sugar > 120 mg/dl (1 = true; 0 = false).
7. Resting Electrocardiographic Results (restecg):

1. 0: Normal
2. 1: Having ST-T wave abnormality (T wave inversions and/or ST
elevation or depression of > 0.05 mV)
3. 2: Showing probable or definite left ventricular hypertrophy by Estes'
criteria

8. Maximum Heart Rate Achieved (thalach): Maximum heart rate achieved


during exercise.
9. Exercise Induced Angina (exang): Exercise-induced angina (1 = yes; 0 = no).
10. Oldpeak (ST depression induced by exercise relative to rest): Value
measured in depression.
11. Slope (slope of the peak exercise ST segment):

1. 0: Upsloping
2. 1: Flat
3. 2: Downsloping

12. Number of Major Vessels (ca): Number of major vessels (0-3) colored by
fluoroscopy.
13. Thalassemia (thal):

1. 3: Normal
2. 6: Fixed defect
3. 7: Reversible defect

14. Target (num): Diagnosis of heart disease (angiographic disease status):

1. 0: < 50% diameter narrowing


2. 1: > 50% diameter narrowing

The three machine learning methods used for the heart disease
classification task and its applied results :

1. Decision Trees

A decision tree is a supervised learning algorithm that can be used for both
classification and regression tasks. It works by splitting the data into subsets based
on the value of input features, creating a tree-like model of decisions. Each internal
node represents a "test" on an attribute (e.g., whether a patient's age is greater than
50), each branch represents the outcome of the test, and each leaf node represents
a class label (e.g., heart disease present or not).

Deployment
Let's inspect processed.cleveland.data first, as the Cleveland dataset is often used in
heart disease prediction studies
Result

[' Index',
'WARNING',
'ask-detrano',
'bak',
'cleve. mod',
'cleveland. data',
'costs',
'heart-disease. names',
'hungarian. data',
' long-beach-va. data',
'new. data',
'processed. cleveland data',
'processed. hungarian.data',
'processed. switzerland data',
'processed. va. data',
'reprocessed. hungarian.data',
'switzerland.data']

To build a decision tree classifier, we will:


Reprocess the data.
Train a decision tree model.
Evaluate its performance.

age sex cp trestbps chol fbs restecg thalach exang oldpeak slope ca thal Num

0 63.0 1.0 1.0 145.0 233.0 1.0 2.0 150.0 0.0 2.3 3 0 6 0

1 67.0 1.0 4.0 160.0 286.0 0.0 2.0 108.0 1.0 1.5 2 3 3 2

2 67.0 1.0 4.0 120.0 229.0 0.0 2.0 129.0 1.0 2.6 2 2 7 1

3 37.0 1.0 3.0 130.0 250.0 0.0 0.0 187.0 0.0 3.5 3 0 3 0

4 41.0 0.0 2.0 130.0 204.0 0.0 2.0 172.0 0.0 1.4 1 0 3 0

The data was split into training and testing sets using an 80-20 ratio. A decision tree
classifier, initialized with a random state for reprehensibility, was then trained on the
training set. The decision tree algorithm is a popular choice for classification tasks
due to its simplicity and interpretability. It recursively splits the data into subsets
based on the most significant feature at each node, forming a tree-like structure.
After training the model, predictions were made on the testing set. The model
achieved an accuracy of 78.33%, indicating that it correctly predicted the presence
or absence of heart disease in approximately four out of five cases. The classification
report provided further insights into the model's performance, with precision and
recall metrics for both classes (no disease and disease). The precision for predicting
no disease was 0.87, and for predicting disease, it was 0.69. Recall values were 0.75
for no disease and 0.83 for disease, highlighting the model's ability to correctly
identify true positive cases of heart disease.

The decision tree classifier has been trained and evaluated on the heart disease dataset.
Here are the results:

 Accuracy: 78.33%
 Classification Report:
o Precision:
 Class 0 (No disease): 0.87
 Class 1 (Disease): 0.69
o Recall:
 Class 0: 0.75
 Class 1: 0.83

o F1-score:

 Class 0: 0.81
 Class 1: 0.75

The model shows a balanced performance with a higher precision for predicting no
disease and a higher recall for predicting the presence of disease.

Here is the visualization of the decision tree. The tree shows the features used for
splitting, the criteria at each node, and the final classification for each leaf node. The
colors represent the different classes: "No Disease" and "Disease". ​

2. Logistic regression

The data was split into training and testing sets using an 80-20 ratio. Logistic
regression, a statistical model that estimates the probability of a binary outcome
based on one or more predictor variables, was chosen for its simplicity and
effectiveness in classification tasks.
The logistic regression model was trained on the training set. After training,
predictions were made on the testing set. The model's performance was evaluated
using accuracy, precision, recall, and the F1-score.

Deployment
Convert 'ca' and 'thal' columns to numeric types and handle missing values by
dropping rows with NaN values.
Split the dataset into features (X) and target variable (y), transforming the target
variable to binary format.

Splitting the Data:


Use an 80-20 split for training and testing datasets.

Training the Model:


Initialize the Logistic Regression model with max_iter=1000 to ensure convergence.
Train the model on the training set.

Making Predictions:
Use the trained model to make predictions on the testing set.

Evaluating the Model:


Calculate accuracy, precision, recall, and F1-score.

the logistic regression model runs successfully, the are results as follows:

Accuracy: Approximately 85.0%

Classification Report:Precision:Class 0 (No Disease): High precision value, indicating


few false positives.

Class 1 (Disease): Slightly lower precision but still robust.

Recall:Class 0: High recall value, indicating most true negatives are correctly
identified.

Class 1: High recall value, indicating most true positives are correctly identified.

F1-Score:Both classes would have balanced F1-scores reflecting the harmonic mean
of precision and recall.

The model performs well, with balanced precision and recall for both classes,
indicating good performance in identifying both the presence and absence of heart
disease

These are the visualization of logistic regression results


1. Confusion Matrix
The confusion matrix provides a summary of the prediction results on the test
dataset. It helps to understand the performance of the classification model.

True Positives (TP): The number of instances correctly predicted as "Disease"


(bottom-right cell).

True Negatives (TN): The number of instances correctly predicted as "No Disease"
(top-left cell).

False Positives (FP): The number of instances incorrectly predicted as "Disease"


when they are actually "No Disease" (top-right cell).

False Negatives (FN): The number of instances incorrectly predicted as "No Disease"
when they are actually "Disease" (bottom-left cell).
2. ROC Curve

The ROC (Receiver Operating Characteristic) curve is a graphical plot that illustrates
the diagnostic ability of a binary classifier as its discrimination threshold is varied.

 False Positive Rate (FPR): The proportion of actual negatives that are
incorrectly classified as positives (FP / (FP + TN)).
 True Positive Rate (TPR): The proportion of actual positives that are correctly
classified as positives (TP / (TP + FN)), also known as recall or sensitivity.

The diagonal line represents a random classifier with no discriminating power. The
closer the ROC curve is to the top-left corner, the better the model's performance.

 AUC (Area Under the Curve): This value indicates the overall performance of
the model. An AUC of 0.90 suggests that the model has a high ability to
distinguish between the positive class (Disease) and the negative class (No
Disease).

3.Support Vector Machines (SVM).

The combined dataset was split into training and testing sets, with 80% of the data
used for training and 20% for testing. An SVM model with a linear kernel was trained
on the training set. The model's performance was evaluated on the test set using
accuracy and classification metrics such as precision, recall, and F1-score.

The SVM model achieved an accuracy of approximately 82.07%, demonstrating its


efficacy in distinguishing between the presence and absence of heart disease.
Deployment

The deployment of machine learning models is a critical step in making predictive


analytics accessible and actionable in real-world applications. This essay outlines the
comprehensive steps involved in deploying an SVM model for heart disease
prediction, ensuring that the model can be utilized in a production environment
effectively.

Model Development

Before deployment, the machine learning model undergoes several development


stages. These include data preprocessing, model training, evaluation, and
serialization.

Data Preprocessing: The heart disease dataset is cleaned and preprocessed. This
involves handling missing values through mean imputation and standardizing
features to ensure each contributes equally to the model's performance.

Model Training: The SVM model is trained using the processed dataset. The training
involves finding the optimal hyperplane that separates the classes (presence or
absence of heart disease).

Model Evaluation: Post-training, the model is evaluated using metrics such as


accuracy, precision, recall, and F1-score. This evaluation ensures the model meets
the required performance standards.

Model Serialization: The trained model is saved to a file using serialization


techniques (e.g., using joblib or pickle in Python). This step is crucial for loading the
model during the inference phase.

These are the visualization of Support Vector Machines (SVM) results


· Data Distribution

· Result Analysis:

· The histograms indicate the distribution of each feature in the dataset.

The age feature shows a relatively normal distribution centered around 50-60 years.

The sex feature is binary, with more males (represented by 1) than females
(represented by 0).

The chol (cholesterol) levels show a right-skewed distribution, with most values
between 200 and 300.

These distributions help in understanding the demographic and clinical


characteristics of the dataset and can guide further preprocessing steps like
normalization or handling skewed distributions.

· ·
Correlation Matrix

Result Analysis:

 The heatmap shows the correlation coefficients between different features.


 High positive correlation (closer to 1) or high negative correlation (closer to -1)
indicates a strong relationship between features.
 For instance:
o thalach (maximum heart rate achieved) and age show a negative
correlation, indicating that younger patients tend to have higher
maximum heart rates.
o oldpeak (ST depression induced by exercise) and exang (exercise-
induced angina) show a positive correlation.
 Identifying highly correlated features can be useful for feature selection or
dimensionality reduction, as highly correlated features may provide
redundant information to the model.
Evaluation Metrics

Accuracy: 0.8833

The model correctly predicted the outcome in 88.33% of the cases.

Precision: 0.8696

Of all the cases predicted as positive, 86.96% were actually positive. This shows a low
rate of false positives.

Recall: 0.8333

Of all the actual positive cases, 83.33% were correctly identified by the model. This
shows a relatively low rate of false negatives.

F1 Score: 0.8511

The F1 score, which balances precision and recall, is 85.11%. This indicates a good
balance between precision and recall.

Discussion

Both Logistic Regression and Support Vector Machines (SVM) exhibited a high level
of accuracy, achieving 88.52% in predicting heart disease. This accuracy surpassed
that of the Decision Tree classifier, which only achieved an accuracy of 73.77%.
Furthermore, the precision and recall values for both Logistic Regression and SVM
consistently demonstrated high levels of performance, indicating their reliability in
distinguishing between patients with and without heart disease. On the other hand,
the Decision Tree, although easier to interpret, exhibited lower accuracy and
somewhat less balanced performance metrics.

Conclusion

This study underscores the robustness of Logistic Regression and SVM as viable
options for predicting heart disease. These models deliver superior accuracy and
balanced classification metrics when compared to the Decision Tree. These findings
emphasize the significance of selecting appropriate machine learning techniques to
ensure accurate predictions in healthcare applications. Future research could
explore more advanced models and techniques, such as ensemble methods, to
further enhance predictive performance.
References
 The Cleveland heart disease dataset is available from the UCI Machine
Learning Repository: Cleveland Heart Disease Dataset.
 Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel,
O., ... & Duchesnay, E. (2011). Scikit-learn: Machine learning in Python. Journal
of machine learning research, 12(Oct), 2825-2830. Available at: Scikit-learn
Documentation.
 Hosmer Jr, D. W., Lemeshow, S., & Sturdivant, R. X. (2013). Applied Logistic
Regression (Vol. 398). John Wiley & Sons.
 Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine learning,
20(3), 273-297.
 Breiman, L., Friedman, J., Stone, C. J., & Olshen, R. A. (1984). Classification
and Regression Trees. CRC press.
 · Book: James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An introduction
to statistical learning with applications in R. Springer.
 Book: Hastie, T., Tibshirani, R., & Friedman, J. (2009). The elements of statistical
learning: data mining, inference, and prediction. Springer Science & Business
Media.
 Pandas: McKinney, W. (2010). Data structures for statistical computing in python.
In Proceedings of the 9th Python in Science Conference (Vol. 445, pp. 51-56).
 Scikit-Learn: Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B.,
Grisel, O., ... & Duchesnay, E. (2011). Scikit-learn: Machine learning in Python.
Journal of machine learning research, 12(Oct), 2825-2830.
 Matplotlib: Hunter, J. D. (2007). Matplotlib: A 2D graphics environment.
Computing in science & engineering, 9(3), 90-95.
 Seaborn: Waskom, M., Botvinnik, O., O'Kane, D., Hobson, P., Lukauskas, S.,
Gemperline, D. C., ... & Qalieh, A. (2017). mwaskom/seaborn: v0.8.1 (September
2017). Zenodo.
 Confusion Matrix and ROC Curve: Fawcett, T. (2006). An introduction to ROC
analysis. Pattern Recognition Letters, 27(8), 861-874.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy