Evaluation Metrics: Yining Chen (Adapted From Slides by Anand Avati) May 1, 2020
Evaluation Metrics: Yining Chen (Adapted From Slides by Anand Avati) May 1, 2020
CS229
Yining Chen
(Adapted from slides by Anand Avati)
May 1, 2020
Topics
● Why are metrics important?
● Binary classifiers
○ Rank view, Thresholding
● Metrics
○ Confusion Matrix
○ Point metrics: Accuracy, Precision, Recall / Sensitivity, Specificity, F-score
○ Summary metrics: AU-ROC, AU-PRC, Log-loss.
● Choosing Metrics
● Class Imbalance
○ Failure scenarios for each metric
● Multi-class
Why are metrics important?
- Training objective (cost function) is only a proxy for real world objectives.
- Metrics help capture a business goal into a quantitative target (not all errors
are equal).
- Helps organize ML team effort towards that target.
- Generally in the form of improving that metric on the dev set.
- Useful to quantify the “gap” between:
- Desired performance and baseline (estimate effort initially).
- Desired performance and current performance.
- Measure progress over time.
- Useful for lower level tasks and debugging (e.g. diagnosing bias vs variance).
- Ideally training objective should be the metric, but not always possible. Still,
metrics are useful and important for evaluation.
Binary Classification
● x is input
● y is binary output (0/1)
● Model is ŷ = h(x)
● Two types of models
○ Models that output a categorical class directly (K-nearest neighbor, Decision tree)
○ Models that output a real valued score (SVM, Logistic Regression)
■ Score could be margin (SVM), probability (LR, NN)
■ Need to pick a threshold
■ We focus on this type (the other type can be interpreted as an instance)
Score based models
Score = 1
Positive example
Negative example
# positive examples
Prevalence =
# positive examples +
# negatives examples
Score = 0
Threshold -> Classifier -> Point Metrics
Label positive Label negative
Th
Predict Positive
0.5
Th=0.5
Predict Negative
Point metrics: Confusion Matrix
Label Positive Label Negative
Th
9 2
Predict Positive
0.5
Th=0.5
Properties:
Predict Negative
Th TP
9 2
Predict Positive
0.5 9
Th=0.5
Predict Negative
1 8
Point metrics: True Negatives
Label positive Label negative
Th TP TN
9 2
Predict Positive
0.5 9 8
Th=0.5
Predict Negative
1 8
Point metrics: False Positives
Label positive Label negative
Th TP TN FP
9 2
Predict Positive
0.5 9 8 2
Th=0.5
Predict Negative
1 8
Point metrics: False Negatives
Label positive Label negative
Th TP TN FP FN
9 2
Predict Positive
0.5 9 8 2 1
Th=0.5
Predict Negative
8
1
FP and FN also called Type-1 and Type-2 errors
Point metrics: Accuracy
Label positive Label negative
Th TP TN FP FN Acc
9 2
Predict Positive
0.5 9 8 2 1 .85
Th=0.5
Predict Negative
Th TP TN FP FN Acc Pr
9 2
Predict Positive
Th=0.5
Predict Negative
1 8
Point metrics: Positive Recall (Sensitivity)
Label positive Label negative
Th TP TN FP FN Acc Pr Recall
9 2
Predict Positive
Th=0.5
Trivial 100% recall = pull everybody above the threshold.
Trivial 100% precision = push everybody below the
threshold except 1 green on top.
Predict Negative
8
1
Striving for good precision with 100% recall =
pulling up the lowest green as high as possible in the ranking.
Striving for good recall with 100% precision =
pushing down the top gray as low as possible in the ranking.
Point metrics: Negative Recall (Specificity)
Label positive Label negative
9 2
Predict Positive
Th=0.5
Predict Negative
1 8
Point metrics: F1-score
Label positive Label negative
9 2
Predict Positive
Th=0.5
Predict Negative
1 8
Point metrics: Changing threshold
Label positive Label negative
7 2
Predict Positive
Th=0.6
3 8
Threshold TP TN FP FN Accuracy Precision Recall Specificity F1
Threshold Scanning
Score = 1 1.00 0 10 0 10 0.50 1 0 1 0
Threshold = 1.00 0.95 1 10 0 9 0.55 1 0.1 1 0.182
0.90 2 10 0 8 0.60 1 0.2 1 0.333
0.85 2 9 1 8 0.55 0.667 0.2 0.9 0.308
0.80 3 9 1 7 0.60 0.750 0.3 0.9 0.429
0.75 4 9 1 6 0.65 0.800 0.4 0.9 0.533
0.70 5 9 1 5 0.70 0.833 0.5 0.9 0.625
0.65 5 8 2 5 0.65 0.714 0.5 0.8 0.588
0.60 6 8 2 4 0.70 0.750 0.6 0.8 0.667
0.55 7 8 2 3 0.75 0.778 0.7 0.8 0.737
0.50 8 8 2 2 0.80 0.800 0.8 0.8 0.800
0.45 9 8 2 1 0.85 0.818 0.9 0.8 0.857
0.40 9 7 3 1 0.80 0.750 0.9 0.7 0.818
0.35 9 6 4 1 0.75 0.692 0.9 0.6 0.783
0.30 9 5 5 1 0.70 0.643 0.9 0.5 0.750
0.25 9 4 6 1 0.65 0.600 0.9 0.4 0.720
0.20 9 3 7 1 0.60 0.562 0.9 0.3 0.692
0.15 9 2 8 1 0.55 0.529 0.9 0.2 0.667
0.10 9 1 9 1 0.50 0.500 0.9 0.1 0.643
0.05 10 1 9 0 0.55 0.526 1 0.1 0.690
0.00 10 0 10 0 0.50 0.500 1 0 0.667
Threshold = 0.00
Score = 0
Summary metrics: Rotated ROC (Sen vs. Spec)
Score = 1
Pos examples
Neg examples
Agnostic to prevalence!
Score = 0
Neg examples
Precision AUPRC = Area Under PRC
= True Pos /
Predicted Pos
= Expected precision for
Random threshold
Score = 0
Model A Model B
Score = 0 Score = 0
Two models scoring the same data set. Is one of them better than the other?
Summary metrics: Log-Loss vs Brier Score
Score = 1 Score = 1
● Same ranking, and therefore the same AUROC,
AUPRC, accuracy!
SVC (th=0.5):
Precision: 0.872
Recall: 0.852
F1: 0.862 Output
Brier: 0.163
Histogram
Unsupervised Learning
● Log P(x) is a measure of fit in Probabilistic models (GMM, Factor Analysis)
○ High log P(x) on training set, but low log P(x) on test set is a measure of overfitting
AUROC: Easy to keep AUC high by scoring most negatives very low.
1% “Fraudulent”
1% Specificity
= True Neg / Neg
AUC = 98/99
98%
Score = 0