Cardiovascular Disease Slides
Cardiovascular Disease Slides
PROJECT OVERVIEW:
• Aim of the problem is to detect the presence or absence of cardiovascular disease in person based
on the given features.
• Features available are:
o Age
o Height
o Weight
o Gender
o Smoking
o Alcohol intake
o Physical activity
o Systolic blood pressure
o Diastolic blood pressure
o Cholesterol
o Glucose
• Cholesterol notes:
o Cholesterol is a waxy material found in humans blood.
o Normal level of cholesterol is necessary to ensure healthy
body cells but as these levels increase, heart disease risk
is elevated.
o This waxy material can block the arteries and could result
in strokes and heart attacks.
o Healthy lifestyle and regular exercises can reduce the risk
of having high cholesterol levels.
o More information:
https://www.mayoclinic.org/diseases-conditions/high-blo
od-cholesterol/symptoms-causes/syc-20350800
• Glucose notes:
o Glucose represents the sugar that the human body receive when they consume food.
o Glucose means “sweet” in Greek.
o Insulin hormone plays a key role in moving glucose from the blood to the body cells for
energy.
o Diabetic patients have high glucose in their blood stream which could be due to two reasons:
o They don’t have enough insulin
o Body cells do not react to insulin the proper way
o Read more: https://www.webmd.com/diabetes/glucose-diabetes
• PCA is an unsupervised machine learning algorithm that performs dimensionality reductions while
attempting at keeping the original information unchanged.
• PCA works by trying to find a new set of features called components.
• Components are composites of the uncorrelated given input features.
• In Amazon SageMaker PCA operates in two modes:
o Regular: works well with sparse data small (manageable) number of observations/features.
o Randomized: works well with large number of observations/features.
Age > 45? Class #0 Age > 45? Class #0 Age > 45? Class #0
• Recently, XGBoost is the go to algorithm for most developers and has won several Kaggle
competitions.
• Why does Xgboost work really well?
o Since the technique is an ensemble algorithm, it is very robust and could work well with
several data types and complex distributions.
o Xgboost has a many tunable hyperparameters that could improve model fitting.
• What are the applications of XGBoost?
o XGBoost could be used for fraud detection to detect the probability of a fraudulent
transactions based on transaction features.
REMEMBER THAT XGBOOST IS AN
EXAMPLE OF ENSEMBLE LEARNING
• Ensemble techniques such as bagging and boosting can offer an
extremely powerful algorithm by combining a group of
relatively weak/average ones.
• For example, you can combine several decision trees to create a
powerful random forest algorithm
• By Combining votes from a pool of experts, each will bring
their own experience and background to solve the problem
resulting in a better outcome.
Model #1 Model #2 Model #3
• Bagging and Boosting can reduce variance and overfitting and
increase the model robustness.
• Example: Blind men and the elephant!
VOTING
TRUE CLASS
+ -
TYPE I ERROR
+ TRUE + FALSE +
PREDICTIONS
FALSE - TRUE -
-
TYPE II ERROR
CONFUSION MATRIX
o True positives (TP): cases when classifier predicted TRUE (they have the disease), and correct class
was TRUE (patient has disease).
o True negatives (TN): cases when model predicted FALSE (no disease), and correct class was FALSE
(patient do not have disease).
o False positives (FP) (Type I error): classifier predicted TRUE, but correct class was FALSE (patient
did not have disease).
o False negatives (FN) (Type II error): classifier predicted FALSE (patient do not have disease), but
they actually do have the disease
KEY PERFORMANCE INDICATORS (KPI)
o Precision = TP/Total TRUE Predictions = TP/ (TP+FP) (When model predicted TRUE class, how often
was it right?)
o Recall = TP/ Actual TRUE = TP/ (TP+FN) (when the class was actually TRUE, how often did the
classifier get it right?)
MODEL PERFORMANCE
ASSESSMENT –
PRECISION, RECALL AND
F1-SCORE
PRECISION Vs. RECALL EXAMPLE
FACTS:
TRUE CLASS
100 PATIENTS TOTAL
91 PATIENTS ARE HEALTHY
+ - 9 PATIENTS HAVE CANCER
PREDICTIONS
TRUE CLASS
+ -
PREDICTIONS
+ TP = 1 FP = 1
- FN = 8 TN = 90
NOTES:
o Precision is a measure of Correct Positives, in this example, the model predicted two patients were positive
classes (has cancer), only one of the two was correct.
o Precision is an important metric when False positives are important (how many times a model says a pedestrian
was detected and there was nothing there!
o Examples include drug testing
RECALL DEEP DIVE
TRUE CLASS
+ -
+ TP = 1 FP = 1
PREDICTIONS
- FN = 8 TN = 90
NOTES:
o Recall is also called True Positive rate or sensitivity
o In this example, I had 9 cancer patients but the model only detected 1 of them
o Important metric when we care about false negatives
o Example: Self driving cars and fraud detection
EX1: BANK FRAUD DETECTION
TRUE CLASS
+ -
PREDICTIONS
THERE WAS THERE WAS
FRAUD AND NO FRAUD
MODEL
+ PREDICTED
AND MODEL
PREDICTED
FRAUD
FRAUD
“This is the only case the bank
loses money so bank cares THERE WAS THERE WAS NO PISSED OFF
about recall” FRAUD AND FRAUD AND CUSTOMER
BANK LOSES - MODEL
MODEL
PREDICTED NO
BUT THE
MONEY PREDICTED FRAUD
BANK IS OK!
NO FRAUD
EX2: SPAM EMAIL DETECTION
TRUE CLASS
+ -
PREDICTIONS
THERE WAS THERE WAS NO
SPAM EMAIL SPAM EMAIL AND
AND MODEL
+ PREDICTED
MODEL
PREDICTED SPAM
SPAM (BLOCKED IT)
“This is a case when we care (BLOCKED IT)
about precision and it’s OK if
we mess up recall a little bit” THERE WAS A THERE WAS NO BLOCKED
SPAM EMAIL SPAM EMAIL IMPORTANT
AND MODEL
NOT A BIG DEAL!
- PREDICTED NO
AND MODEL
PREDICTED NO
EMAILS
SPAM (WENT TO
(DREAM
SPAM (WENT TO JOB!)
INBOX)
INBOX)
F1 SCORE
• F1 Score is an overall measure of a model's accuracy that combines precision and recall.
• F1 score is the harmonic mean of precision and recall.
• What is the difference between F1 Score and Accuracy?
• In unbalanced datasets, if we have large number of true negatives (healthy patients), accuracy could be
misleading. Therefore, F1 score might be a better KPI to use since it provides a balance between recall and
precision in the presence of unbalanced datasets.
F1-SCORE PER CLASS: CANCER
CLASSIFICATION DATASET
F1-SCORE
PER CLASS
AVERAGE F1-SCORE
MULTICLASS CLASSIFICATION
https://www.cs.toronto.edu/~kriz/cifar.html
MODEL PERFORMANCE
ASSESSMENT – ROC,
AUC
ROC (RECEIVER OPERATING
CHARACTERISTIC CURVE)
PREDICTOR #1
• Model is under fitting if it’s too simple that MODEL IS TOO SIMPLE FOR
it cannot reflect the complexity of the
training dataset. THIS COMPLEX DATASET
• We can overcome under fitting by:
o increasing the complexity of the
model.
o Training the model for a longer period
of time (more epochs) to reduce error
X2
X1
OVERFITTING MODEL
X2
X1
BEST MODEL (GENERALIZED)
X2
X1