0% found this document useful (0 votes)
26 views35 pages

Cardiovascular Disease Slides

The document provides an overview of a machine learning project to detect cardiovascular disease based on patient features. It describes the features available, data sources, and gives notes on important medical values like blood pressure, cholesterol, and glucose. It also introduces principal component analysis and XGBoost classification algorithms.

Uploaded by

pedromaia
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views35 pages

Cardiovascular Disease Slides

The document provides an overview of a machine learning project to detect cardiovascular disease based on patient features. It describes the features available, data sources, and gives notes on important medical values like blood pressure, cholesterol, and glucose. It also introduces principal component analysis and XGBoost classification algorithms.

Uploaded by

pedromaia
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 35

PROJECT OVERVIEW

PROJECT OVERVIEW:

• Aim of the problem is to detect the presence or absence of cardiovascular disease in person based
on the given features.
• Features available are:
o Age
o Height
o Weight
o Gender
o Smoking
o Alcohol intake
o Physical activity
o Systolic blood pressure
o Diastolic blood pressure
o Cholesterol
o Glucose

• Data Source: https://www.kaggle.com/sulianova/cardiovascular-disease-dataset


• Image Source: https://commons.wikimedia.org/wiki/File:Human_Heart_and_Circulatory_System.png
PROJECT OVERVIEW: NOTES ON BLOOD PRESSURE

• Blood Pressure notes:


o Blood pressure is represented by 2 numbers systolic and
diastolic (ideally 120/80 mm Hg).
o These two number are critical in assessing the heart
health.
o The top number represents systolic and the bottom
number representing the diastolic.
o Systolic pressure indicates the blood pressure in the
arteries when the blood is pumped out of the heart.
o The diastolic pressure indicates the blood pressure
between beats (at rest, filling up and ready to pump
again).
o If these numbers are high, that means that the heart is
exerting more effort to pump blood in the arteries to the
body.

Photo Source: https://commons.wikimedia.org/wiki/File:Hypertension_ranges_chart.png


PROJECT OVERVIEW: NOTES ON CHOLESTEROL

• Cholesterol notes:
o Cholesterol is a waxy material found in humans blood.
o Normal level of cholesterol is necessary to ensure healthy
body cells but as these levels increase, heart disease risk
is elevated.
o This waxy material can block the arteries and could result
in strokes and heart attacks.
o Healthy lifestyle and regular exercises can reduce the risk
of having high cholesterol levels.
o More information:
https://www.mayoclinic.org/diseases-conditions/high-blo
od-cholesterol/symptoms-causes/syc-20350800

Photo Credit: https://commons.wikimedia.org/wiki/File:Clogged_Heart_Artery.jpg


PROJECT OVERVIEW: NOTES ON GLUCOSE

• Glucose notes:
o Glucose represents the sugar that the human body receive when they consume food.
o Glucose means “sweet” in Greek.
o Insulin hormone plays a key role in moving glucose from the blood to the body cells for
energy.
o Diabetic patients have high glucose in their blood stream which could be due to two reasons:
o They don’t have enough insulin
o Body cells do not react to insulin the proper way
o Read more: https://www.webmd.com/diabetes/glucose-diabetes

Photo Credit: https://commons.wikimedia.org/wiki/File:Clogged_Heart_Artery.jpg


PRINCIPAL
COMPONENT
ANALYSIS (PCA)
PRINCIPAL COMPONENT ANALYSIS: OVERVIEW

• PCA is an unsupervised machine learning algorithm that performs dimensionality reductions while
attempting at keeping the original information unchanged.
• PCA works by trying to find a new set of features called components.
• Components are composites of the uncorrelated given input features.
• In Amazon SageMaker PCA operates in two modes:
o Regular: works well with sparse data small (manageable) number of observations/features.
o Randomized: works well with large number of observations/features.

Photo Credit: http://phdthesis-bioinformatics-maxplanckinstitute-molecularplantphys.matthias-scholz.de/


PRINCIPAL COMPONENT ANALYSIS:
HYPERPARAMETERS

• Full set of hyperparameters:


https://docs.aws.amazon.com/sagemaker/latest/dg/PCA-reference.html
• feature_dim: number of features in the input data.
• num_components: number of principal components to compute.
• algorithm_mode: Mode for computing the principal components,
choose between regular or randomized
• extra_components: As extra components go up, more accurate results
are achieved at the cost of increased memory/computation
consumption.
PRINCIPAL COMPONENT ANALYSIS:
INPUT/OUTPUT

• SageMaker PCA algorithm supports recordIO-


protobuf or CSV formats
• PCA can be used in both File or pipe mode
• Remember if pipe mode is activated, training data
can be directly streamed into the training instance
instead of being downloaded from S3.
• Pipe model can speed up the process and require less
disk space.
• For more information on pipe mode, check this out:
https://aws.amazon.com/blogs/machine-learning/using-pi
pe-input-mode-for-amazon-sagemaker-algorithms/
PRINCIPAL COMPONENT
ANALYSIS: INSTANCE TYPES

• For PCA Training:


o CPU instance or GPU are recommended
XGBOOST
(CLASSIFICATION)
SAGEMAKER XGBOOST: OVERVIEW
• XGBoost or Extreme Gradient Boosting algorithm is one of the most famous and powerful algorithms to perform
both regression and classification tasks.
• XGBoost is a supervised learning algorithm and implements gradient boosted trees algorithm.
• The algorithm work by combining an ensemble of predictions from several weak models.
• Note that Xgboost could be used for both regression and classification (our case study).

TREE #1 TREE #2 TREE #N


Savings>$1M Savings>$1M Savings>$1M

Yes No Yes No Yes No

Age > 45? Class #0 Age > 45? Class #0 Age > 45? Class #0

Yes No Yes No Yes No

Class #1 Class #0 Class #1 Class #0 Class #1 Class #0

OUT = CLASS #1 OUT = CLASS #1 OUT = CLASS #0

MAJORITY VOTE = CLASS #1


SAGEMAKER XGBOOST: OVERVIEW

• Recently, XGBoost is the go to algorithm for most developers and has won several Kaggle
competitions.
• Why does Xgboost work really well?
o Since the technique is an ensemble algorithm, it is very robust and could work well with
several data types and complex distributions.
o Xgboost has a many tunable hyperparameters that could improve model fitting.
• What are the applications of XGBoost?
o XGBoost could be used for fraud detection to detect the probability of a fraudulent
transactions based on transaction features.
REMEMBER THAT XGBOOST IS AN
EXAMPLE OF ENSEMBLE LEARNING
• Ensemble techniques such as bagging and boosting can offer an
extremely powerful algorithm by combining a group of
relatively weak/average ones.
• For example, you can combine several decision trees to create a
powerful random forest algorithm
• By Combining votes from a pool of experts, each will bring
their own experience and background to solve the problem
resulting in a better outcome.
Model #1 Model #2 Model #3
• Bagging and Boosting can reduce variance and overfitting and
increase the model robustness.
• Example: Blind men and the elephant!

VOTING

Photo Credit: https://commons.wikimedia.org/wiki/File:Blind_men_and_elephant.png


MODEL
PERFORMANCE
ASSESSMENT –
CONFUSION MATRIX
CONFUSION MATRIX

TRUE CLASS

+ -

TYPE I ERROR
+ TRUE + FALSE +

PREDICTIONS

FALSE - TRUE -
-
TYPE II ERROR
CONFUSION MATRIX

• A confusion matrix is used to describe the performance of a classification model:

o True positives (TP): cases when classifier predicted TRUE (they have the disease), and correct class
was TRUE (patient has disease).

o True negatives (TN): cases when model predicted FALSE (no disease), and correct class was FALSE
(patient do not have disease).

o False positives (FP) (Type I error): classifier predicted TRUE, but correct class was FALSE (patient
did not have disease).

o False negatives (FN) (Type II error): classifier predicted FALSE (patient do not have disease), but
they actually do have the disease
KEY PERFORMANCE INDICATORS (KPI)

o Classification Accuracy = (TP+TN) / (TP + TN + FP + FN)

o Misclassification rate (Error Rate) = (FP + FN) / (TP + TN + FP + FN)

o Precision = TP/Total TRUE Predictions = TP/ (TP+FP) (When model predicted TRUE class, how often
was it right?)

o Recall = TP/ Actual TRUE = TP/ (TP+FN) (when the class was actually TRUE, how often did the
classifier get it right?)
MODEL PERFORMANCE
ASSESSMENT –
PRECISION, RECALL AND
F1-SCORE
PRECISION Vs. RECALL EXAMPLE

FACTS:
TRUE CLASS
100 PATIENTS TOTAL
91 PATIENTS ARE HEALTHY
+ - 9 PATIENTS HAVE CANCER
PREDICTIONS

• Accuracy is generally misleading and is not enough to


+ TP = 1 FP = 1 assess the performance of a classifier.
• Recall is an important KPI in situations where:
o Dataset is highly imbalanced; cases when you have
small cancer patients compared to healthy ones.
- FN = 8 TN = 90

o Classification Accuracy = (TP+TN) / (TP + TN + FP + FN) = 91%


o Precision = TP/Total TRUE Predictions = TP/ (TP+FP) = ½=50%
o Recall = TP/ Actual TRUE = TP/ (TP+FN) = 1/9 = 11%
PRECISION DEEP DIVE

TRUE CLASS

+ -

PREDICTIONS
+ TP = 1 FP = 1

- FN = 8 TN = 90
NOTES:
o Precision is a measure of Correct Positives, in this example, the model predicted two patients were positive
classes (has cancer), only one of the two was correct.
o Precision is an important metric when False positives are important (how many times a model says a pedestrian
was detected and there was nothing there!
o Examples include drug testing
RECALL DEEP DIVE

TRUE CLASS

+ -

+ TP = 1 FP = 1
PREDICTIONS

- FN = 8 TN = 90
NOTES:
o Recall is also called True Positive rate or sensitivity
o In this example, I had 9 cancer patients but the model only detected 1 of them
o Important metric when we care about false negatives
o Example: Self driving cars and fraud detection
EX1: BANK FRAUD DETECTION
TRUE CLASS
+ -

PREDICTIONS
THERE WAS THERE WAS
FRAUD AND NO FRAUD
MODEL
+ PREDICTED
AND MODEL
PREDICTED
FRAUD
FRAUD
“This is the only case the bank
loses money so bank cares THERE WAS THERE WAS NO PISSED OFF
about recall” FRAUD AND FRAUD AND CUSTOMER
BANK LOSES - MODEL
MODEL
PREDICTED NO
BUT THE
MONEY PREDICTED FRAUD
BANK IS OK!
NO FRAUD
EX2: SPAM EMAIL DETECTION
TRUE CLASS
+ -

PREDICTIONS
THERE WAS THERE WAS NO
SPAM EMAIL SPAM EMAIL AND
AND MODEL
+ PREDICTED
MODEL
PREDICTED SPAM
SPAM (BLOCKED IT)
“This is a case when we care (BLOCKED IT)
about precision and it’s OK if
we mess up recall a little bit” THERE WAS A THERE WAS NO BLOCKED
SPAM EMAIL SPAM EMAIL IMPORTANT
AND MODEL
NOT A BIG DEAL!
- PREDICTED NO
AND MODEL
PREDICTED NO
EMAILS
SPAM (WENT TO
(DREAM
SPAM (WENT TO JOB!)
INBOX)
INBOX)
F1 SCORE

• F1 Score is an overall measure of a model's accuracy that combines precision and recall.
• F1 score is the harmonic mean of precision and recall.
• What is the difference between F1 Score and Accuracy?
• In unbalanced datasets, if we have large number of true negatives (healthy patients), accuracy could be
misleading. Therefore, F1 score might be a better KPI to use since it provides a balance between recall and
precision in the presence of unbalanced datasets.
F1-SCORE PER CLASS: CANCER
CLASSIFICATION DATASET

F1-SCORE
PER CLASS

AVERAGE F1-SCORE
MULTICLASS CLASSIFICATION

https://www.cs.toronto.edu/~kriz/cifar.html
MODEL PERFORMANCE
ASSESSMENT – ROC,
AUC
ROC (RECEIVER OPERATING
CHARACTERISTIC CURVE)

• ROC Curve is a metric that assesses the model ability to distinguish


between binary (0 or 1) classes.
• The ROC curve is created by plotting the true positive rate (TPR)
against the false positive rate (FPR) at various threshold settings.
• The true-positive rate is also known as sensitivity, recall or probability
of detection in machine learning.
• The false-positive rate is also known as the probability of false
alarm and can be calculated as (1 − specificity).
• Points above the diagonal line represent good classification (better than
random)
• The model performance improves if it becomes skewed towards the
upper left corner.

Photo Credit: https://commons.wikimedia.org/wiki/File:Roccurves.png


AUC (AREA UNDER CURVE)

PREDICTOR #1

• The light blue area represents the area Under the


PREDICTOR #2
TRUE POSITIVE RATE

Curve of the Receiver Operating Characteristic


(AUROC).
• The diagonal dashed red line represents the ROC
curve of a random predictor with AUROC of 0.5.
• If ROC AUC = 1, perfect classifier
• Predictor #1 is better than predictor #2
• Higher the AUC, the better the model is at
RANDOM predicting 0s as 0s and 1s as 1s.
PREDICTOR

FALSE POSITIVE RATE


OVERFITTING Vs.
UNDERFITTING MODELS
UNDERFITTING MODEL

• Model is under fitting if it’s too simple that MODEL IS TOO SIMPLE FOR
it cannot reflect the complexity of the
training dataset. THIS COMPLEX DATASET
• We can overcome under fitting by:
o increasing the complexity of the
model.
o Training the model for a longer period
of time (more epochs) to reduce error

X2

X1
OVERFITTING MODEL

• Model is overfitting data when it


memorizes all the specific details of the
MODEL IS
training data and fails to generalize. OVERFITTING
• Overfitting models tend to perform very
well on the training dataset but poorly on
THE DATA
any new dataset (testing dataset)
• Machine learning is the art of creating
models that are able to generalize and
avoid memorization.

X2

X1
BEST MODEL (GENERALIZED)

• Model that performs well during training


and testing (on new dataset that has never
GENERALIZED
seen before) is considered the best model MODEL
(goal).

X2

X1

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy