Final AI Homework Amanuel Tesfalem
Final AI Homework Amanuel Tesfalem
Final Assignment
Abstract
Introduction
Index and Metadata Files: These files serve as a guide to the dataset collection,
aiding users in navigating through its various components. For instance, the "heart-
disease.names" file provides detailed descriptions of the attributes included in the
datasets, ensuring that users comprehend the significance of each variable.
The collection encompasses raw data from several significant heart disease studies,
each contributing unique perspectives and enhancing the dataset's diversity:
These files contain the cleaned and formatted versions of the raw data, prepared for
immediate analysis:
These files offer further insights and expanded data, enhancing the collection's utility:
New.data: This file likely includes new or supplementary data, broadening the
dataset collection's scope.
Supporting Files
These files provide additional context and support for the datasets:
Costs: This file might outline the costs associated with data collection or study,
providing context on resource allocation.
This collection of raw, processed, and additional data files, supported by contextual
documents, offers a comprehensive resource for cardiovascular research, enabling
robust predictive modeling and risk factor analysis across diverse demographics.
Methodology
Description of the Heart Disease Dataset
The Heart Disease dataset is a widely-used dataset in the field of medical diagnostics
and machine learning. It is designed to provide information for the prediction of
heart disease presence in patients. The dataset contains 14 attributes and a total of
303 instances, collected from four different locations: Cleveland Clinic Foundation,
Hungarian Institute of Cardiology, V.A. Medical Center in Long Beach, and the
University Hospital in Zurich, Switzerland.
Attributes
1. 0: Normal
2. 1: Having ST-T wave abnormality (T wave inversions and/or ST
elevation or depression of > 0.05 mV)
3. 2: Showing probable or definite left ventricular hypertrophy by Estes'
criteria
1. 0: Upsloping
2. 1: Flat
3. 2: Downsloping
12. Number of Major Vessels (ca): Number of major vessels (0-3) colored by
fluoroscopy.
13. Thalassemia (thal):
1. 3: Normal
2. 6: Fixed defect
3. 7: Reversible defect
The three machine learning methods used for the heart disease
classification task and its applied results :
1. Decision Trees
A decision tree is a supervised learning algorithm that can be used for both
classification and regression tasks. It works by splitting the data into subsets based
on the value of input features, creating a tree-like model of decisions. Each internal
node represents a "test" on an attribute (e.g., whether a patient's age is greater than
50), each branch represents the outcome of the test, and each leaf node represents
a class label (e.g., heart disease present or not).
Deployment
Let's inspect processed.cleveland.data first, as the Cleveland dataset is often used in
heart disease prediction studies
Result
[' Index',
'WARNING',
'ask-detrano',
'bak',
'cleve. mod',
'cleveland. data',
'costs',
'heart-disease. names',
'hungarian. data',
' long-beach-va. data',
'new. data',
'processed. cleveland data',
'processed. hungarian.data',
'processed. switzerland data',
'processed. va. data',
'reprocessed. hungarian.data',
'switzerland.data']
age sex cp trestbps chol fbs restecg thalach exang oldpeak slope ca thal Num
0 63.0 1.0 1.0 145.0 233.0 1.0 2.0 150.0 0.0 2.3 3 0 6 0
1 67.0 1.0 4.0 160.0 286.0 0.0 2.0 108.0 1.0 1.5 2 3 3 2
2 67.0 1.0 4.0 120.0 229.0 0.0 2.0 129.0 1.0 2.6 2 2 7 1
3 37.0 1.0 3.0 130.0 250.0 0.0 0.0 187.0 0.0 3.5 3 0 3 0
4 41.0 0.0 2.0 130.0 204.0 0.0 2.0 172.0 0.0 1.4 1 0 3 0
The data was split into training and testing sets using an 80-20 ratio. A decision tree
classifier, initialized with a random state for reprehensibility, was then trained on the
training set. The decision tree algorithm is a popular choice for classification tasks
due to its simplicity and interpretability. It recursively splits the data into subsets
based on the most significant feature at each node, forming a tree-like structure.
After training the model, predictions were made on the testing set. The model
achieved an accuracy of 78.33%, indicating that it correctly predicted the presence
or absence of heart disease in approximately four out of five cases. The classification
report provided further insights into the model's performance, with precision and
recall metrics for both classes (no disease and disease). The precision for predicting
no disease was 0.87, and for predicting disease, it was 0.69. Recall values were 0.75
for no disease and 0.83 for disease, highlighting the model's ability to correctly
identify true positive cases of heart disease.
The decision tree classifier has been trained and evaluated on the heart disease dataset.
Here are the results:
Accuracy: 78.33%
Classification Report:
o Precision:
Class 0 (No disease): 0.87
Class 1 (Disease): 0.69
o Recall:
Class 0: 0.75
Class 1: 0.83
o F1-score:
Class 0: 0.81
Class 1: 0.75
The model shows a balanced performance with a higher precision for predicting no
disease and a higher recall for predicting the presence of disease.
Here is the visualization of the decision tree. The tree shows the features used for
splitting, the criteria at each node, and the final classification for each leaf node. The
colors represent the different classes: "No Disease" and "Disease".
2. Logistic regression
The data was split into training and testing sets using an 80-20 ratio. Logistic
regression, a statistical model that estimates the probability of a binary outcome
based on one or more predictor variables, was chosen for its simplicity and
effectiveness in classification tasks.
The logistic regression model was trained on the training set. After training,
predictions were made on the testing set. The model's performance was evaluated
using accuracy, precision, recall, and the F1-score.
Deployment
Convert 'ca' and 'thal' columns to numeric types and handle missing values by
dropping rows with NaN values.
Split the dataset into features (X) and target variable (y), transforming the target
variable to binary format.
Making Predictions:
Use the trained model to make predictions on the testing set.
the logistic regression model runs successfully, the are results as follows:
Recall:Class 0: High recall value, indicating most true negatives are correctly
identified.
Class 1: High recall value, indicating most true positives are correctly identified.
F1-Score:Both classes would have balanced F1-scores reflecting the harmonic mean
of precision and recall.
The model performs well, with balanced precision and recall for both classes,
indicating good performance in identifying both the presence and absence of heart
disease
True Negatives (TN): The number of instances correctly predicted as "No Disease"
(top-left cell).
False Negatives (FN): The number of instances incorrectly predicted as "No Disease"
when they are actually "Disease" (bottom-left cell).
2. ROC Curve
The ROC (Receiver Operating Characteristic) curve is a graphical plot that illustrates
the diagnostic ability of a binary classifier as its discrimination threshold is varied.
False Positive Rate (FPR): The proportion of actual negatives that are
incorrectly classified as positives (FP / (FP + TN)).
True Positive Rate (TPR): The proportion of actual positives that are correctly
classified as positives (TP / (TP + FN)), also known as recall or sensitivity.
The diagonal line represents a random classifier with no discriminating power. The
closer the ROC curve is to the top-left corner, the better the model's performance.
AUC (Area Under the Curve): This value indicates the overall performance of
the model. An AUC of 0.90 suggests that the model has a high ability to
distinguish between the positive class (Disease) and the negative class (No
Disease).
The combined dataset was split into training and testing sets, with 80% of the data
used for training and 20% for testing. An SVM model with a linear kernel was trained
on the training set. The model's performance was evaluated on the test set using
accuracy and classification metrics such as precision, recall, and F1-score.
Model Development
Data Preprocessing: The heart disease dataset is cleaned and preprocessed. This
involves handling missing values through mean imputation and standardizing
features to ensure each contributes equally to the model's performance.
Model Training: The SVM model is trained using the processed dataset. The training
involves finding the optimal hyperplane that separates the classes (presence or
absence of heart disease).
· Result Analysis:
The age feature shows a relatively normal distribution centered around 50-60 years.
The sex feature is binary, with more males (represented by 1) than females
(represented by 0).
The chol (cholesterol) levels show a right-skewed distribution, with most values
between 200 and 300.
· ·
Correlation Matrix
Result Analysis:
Accuracy: 0.8833
Precision: 0.8696
Of all the cases predicted as positive, 86.96% were actually positive. This shows a low
rate of false positives.
Recall: 0.8333
Of all the actual positive cases, 83.33% were correctly identified by the model. This
shows a relatively low rate of false negatives.
F1 Score: 0.8511
The F1 score, which balances precision and recall, is 85.11%. This indicates a good
balance between precision and recall.
Discussion
Both Logistic Regression and Support Vector Machines (SVM) exhibited a high level
of accuracy, achieving 88.52% in predicting heart disease. This accuracy surpassed
that of the Decision Tree classifier, which only achieved an accuracy of 73.77%.
Furthermore, the precision and recall values for both Logistic Regression and SVM
consistently demonstrated high levels of performance, indicating their reliability in
distinguishing between patients with and without heart disease. On the other hand,
the Decision Tree, although easier to interpret, exhibited lower accuracy and
somewhat less balanced performance metrics.
Conclusion
This study underscores the robustness of Logistic Regression and SVM as viable
options for predicting heart disease. These models deliver superior accuracy and
balanced classification metrics when compared to the Decision Tree. These findings
emphasize the significance of selecting appropriate machine learning techniques to
ensure accurate predictions in healthcare applications. Future research could
explore more advanced models and techniques, such as ensemble methods, to
further enhance predictive performance.
References
The Cleveland heart disease dataset is available from the UCI Machine
Learning Repository: Cleveland Heart Disease Dataset.
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel,
O., ... & Duchesnay, E. (2011). Scikit-learn: Machine learning in Python. Journal
of machine learning research, 12(Oct), 2825-2830. Available at: Scikit-learn
Documentation.
Hosmer Jr, D. W., Lemeshow, S., & Sturdivant, R. X. (2013). Applied Logistic
Regression (Vol. 398). John Wiley & Sons.
Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine learning,
20(3), 273-297.
Breiman, L., Friedman, J., Stone, C. J., & Olshen, R. A. (1984). Classification
and Regression Trees. CRC press.
· Book: James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An introduction
to statistical learning with applications in R. Springer.
Book: Hastie, T., Tibshirani, R., & Friedman, J. (2009). The elements of statistical
learning: data mining, inference, and prediction. Springer Science & Business
Media.
Pandas: McKinney, W. (2010). Data structures for statistical computing in python.
In Proceedings of the 9th Python in Science Conference (Vol. 445, pp. 51-56).
Scikit-Learn: Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B.,
Grisel, O., ... & Duchesnay, E. (2011). Scikit-learn: Machine learning in Python.
Journal of machine learning research, 12(Oct), 2825-2830.
Matplotlib: Hunter, J. D. (2007). Matplotlib: A 2D graphics environment.
Computing in science & engineering, 9(3), 90-95.
Seaborn: Waskom, M., Botvinnik, O., O'Kane, D., Hobson, P., Lukauskas, S.,
Gemperline, D. C., ... & Qalieh, A. (2017). mwaskom/seaborn: v0.8.1 (September
2017). Zenodo.
Confusion Matrix and ROC Curve: Fawcett, T. (2006). An introduction to ROC
analysis. Pattern Recognition Letters, 27(8), 861-874.