0% found this document useful (0 votes)

9 views31 pages

Evaluation Metrics: Yining Chen (Adapted From Slides by Anand Avati) May 1, 2020

The document discusses the importance of evaluation metrics in machine learning, particularly for binary classifiers, and outlines various metrics such as accuracy, precision, recall, and F1-score. It emphasizes the significance of choosing appropriate metrics, especially in scenarios of class imbalance and multi-class classification. Additionally, it covers summary metrics like AU-ROC and log-loss, and the implications of these metrics on model performance and evaluation.

Uploaded by

Mayura D

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views31 pages

Evaluation Metrics: Yining Chen (Adapted From Slides by Anand Avati) May 1, 2020

Uploaded by

Mayura D

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 31

Evaluation Metrics

CS229
Yining Chen
(Adapted from slides by Anand Avati)
May 1, 2020
Topics
● Why are metrics important?
● Binary classifiers
○ Rank view, Thresholding
● Metrics
○ Confusion Matrix
○ Point metrics: Accuracy, Precision, Recall / Sensitivity, Specificity, F-score
○ Summary metrics: AU-ROC, AU-PRC, Log-loss.
● Choosing Metrics
● Class Imbalance
○ Failure scenarios for each metric
● Multi-class
Why are metrics important?
- Training objective (cost function) is only a proxy for real world objectives.
- Metrics help capture a business goal into a quantitative target (not all errors
are equal).
- Helps organize ML team effort towards that target.
- Generally in the form of improving that metric on the dev set.
- Useful to quantify the “gap” between:
- Desired performance and baseline (estimate effort initially).
- Desired performance and current performance.
- Measure progress over time.
- Useful for lower level tasks and debugging (e.g. diagnosing bias vs variance).
- Ideally training objective should be the metric, but not always possible. Still,
metrics are useful and important for evaluation.
Binary Classification
● x is input
● y is binary output (0/1)
● Model is ŷ = h(x)
● Two types of models
○ Models that output a categorical class directly (K-nearest neighbor, Decision tree)
○ Models that output a real valued score (SVM, Logistic Regression)
■ Score could be margin (SVM), probability (LR, NN)
■ Need to pick a threshold
■ We focus on this type (the other type can be interpreted as an instance)
Score based models
Score = 1

Positive example

Negative example

Example of Score: Output of logistic regression.

For most metrics: Only ranking matters.
If too many examples: Plot class-wise histogram.

# positive examples
Prevalence =
# positive examples +
# negatives examples

Score = 0
Threshold -> Classifier -> Point Metrics
Label positive Label negative

Th
Predict Positive

0.5
Th=0.5
Predict Negative
Point metrics: Confusion Matrix
Label Positive Label Negative

Th
9 2
Predict Positive

0.5

Th=0.5

Properties:
Predict Negative

- Total sum is fixed (population).

- Column sums are fixed (class-wise population).
- Quality of model & threshold decide how columns
1 8 are split into rows.
- We want diagonals to be “heavy”, off diagonals to
be “light”.
Point metrics: True Positives
Label positive Label negative

Th TP

9 2
Predict Positive

0.5 9

Th=0.5
Predict Negative

1 8
Point metrics: True Negatives
Label positive Label negative

Th TP TN

9 2
Predict Positive

0.5 9 8

Th=0.5
Predict Negative

1 8
Point metrics: False Positives
Label positive Label negative

Th TP TN FP

9 2
Predict Positive

0.5 9 8 2

Th=0.5
Predict Negative

1 8
Point metrics: False Negatives
Label positive Label negative

Th TP TN FP FN

9 2
Predict Positive

0.5 9 8 2 1

Th=0.5
Predict Negative

8
1
FP and FN also called Type-1 and Type-2 errors
Point metrics: Accuracy
Label positive Label negative

Th TP TN FP FN Acc

9 2
Predict Positive

0.5 9 8 2 1 .85

Th=0.5
Predict Negative

Equivalent to 0-1 Loss!

1 8
Point metrics: Precision
Label positive Label negative

Th TP TN FP FN Acc Pr

9 2
Predict Positive

0.5 9 8 2 1 .85 .81

Th=0.5
Predict Negative

1 8
Point metrics: Positive Recall (Sensitivity)
Label positive Label negative

Th TP TN FP FN Acc Pr Recall

9 2
Predict Positive

0.5 9 8 2 1 .85 .81 .9

Th=0.5
Trivial 100% recall = pull everybody above the threshold.
Trivial 100% precision = push everybody below the
threshold except 1 green on top.
Predict Negative

(Hopefully no gray above it!)

8
1
Striving for good precision with 100% recall =
pulling up the lowest green as high as possible in the ranking.
Striving for good recall with 100% precision =
pushing down the top gray as low as possible in the ranking.
Point metrics: Negative Recall (Specificity)
Label positive Label negative

Th TP TN FP FN Acc Pr Recall Spec

9 2
Predict Positive

0.5 9 8 2 1 .85 .81 .9 0.8

Th=0.5
Predict Negative

1 8
Point metrics: F1-score
Label positive Label negative

Th TP TN FP FN Acc Pr Recall Spec F1

9 2
Predict Positive

0.5 9 8 2 1 .85 .81 .9 .8 .857

Th=0.5
Predict Negative

1 8
Point metrics: Changing threshold
Label positive Label negative

Th TP TN FP FN Acc Pr Recall Spec F1

7 2
Predict Positive

0.6 7 8 2 3 .75 .77 .7 .8 .733

Th=0.6

# effective thresholds = # examples + 1

Predict Negative

3 8
Threshold TP TN FP FN Accuracy Precision Recall Specificity F1
Threshold Scanning
Score = 1 1.00 0 10 0 10 0.50 1 0 1 0
Threshold = 1.00 0.95 1 10 0 9 0.55 1 0.1 1 0.182
0.90 2 10 0 8 0.60 1 0.2 1 0.333
0.85 2 9 1 8 0.55 0.667 0.2 0.9 0.308
0.80 3 9 1 7 0.60 0.750 0.3 0.9 0.429
0.75 4 9 1 6 0.65 0.800 0.4 0.9 0.533
0.70 5 9 1 5 0.70 0.833 0.5 0.9 0.625
0.65 5 8 2 5 0.65 0.714 0.5 0.8 0.588
0.60 6 8 2 4 0.70 0.750 0.6 0.8 0.667
0.55 7 8 2 3 0.75 0.778 0.7 0.8 0.737
0.50 8 8 2 2 0.80 0.800 0.8 0.8 0.800
0.45 9 8 2 1 0.85 0.818 0.9 0.8 0.857
0.40 9 7 3 1 0.80 0.750 0.9 0.7 0.818
0.35 9 6 4 1 0.75 0.692 0.9 0.6 0.783
0.30 9 5 5 1 0.70 0.643 0.9 0.5 0.750
0.25 9 4 6 1 0.65 0.600 0.9 0.4 0.720
0.20 9 3 7 1 0.60 0.562 0.9 0.3 0.692
0.15 9 2 8 1 0.55 0.529 0.9 0.2 0.667
0.10 9 1 9 1 0.50 0.500 0.9 0.1 0.643
0.05 10 1 9 0 0.55 0.526 1 0.1 0.690
0.00 10 0 10 0 0.50 0.500 1 0 0.667
Threshold = 0.00
Score = 0
Summary metrics: Rotated ROC (Sen vs. Spec)
Score = 1
Pos examples
Neg examples

Specificity AUROC = Area Under ROC

= True Neg / Neg
= Prob[Random Pos ranked
higher than random Neg]
Random Guessing

Agnostic to prevalence!

Score = 0

Sensitivity = True Pos / Pos

Summary metrics: PRC (Recall vs. Precision)
Score = 1
Pos examples

Neg examples
Precision AUPRC = Area Under PRC
= True Pos /
Predicted Pos
= Expected precision for
Random threshold

Precision >= prevalence

Score = 0

Recall = Sensitivity = True Pos / Pos

Summary metrics:
Score = 1 Score = 1

Model A Model B

Score = 0 Score = 0

Two models scoring the same data set. Is one of them better than the other?
Summary metrics: Log-Loss vs Brier Score
Score = 1 Score = 1
● Same ranking, and therefore the same AUROC,
AUPRC, accuracy!

● Rewards confident correct answers, heavily

penalizes confident wrong answers.
● One perfectly confident wrong prediction is fatal.
-> Well-calibrated model
● Proper scoring rule: Minimized at Score = 0 Score = 0
Calibration vs Discriminative Power

Logistic (th=0.5): Fraction of Positives

Precision: 0.872
Recall: 0.851
F1: 0.862
Brier: 0.099

SVC (th=0.5):
Precision: 0.872
Recall: 0.852
F1: 0.862 Output
Brier: 0.163
Histogram
Unsupervised Learning
● Log P(x) is a measure of fit in Probabilistic models (GMM, Factor Analysis)

○ High log P(x) on training set, but low log P(x) on test set is a measure of overfitting

○ Raw value of log P(x) hard to interpret in isolation

● K-means is trickier (because of fixed covariance assumption)

Class Imbalance
Symptom: Prevalence < 5% (no strict definition)

Metrics: May not be meaningful.

Learning: May not focus on minority class examples at all

(majority class can overwhelm logistic regression, to a lesser extent SVM)

What happen to the metrics under class imbalance?
Accuracy: Blindly predicts majority class -> prevalence is the baseline.

Log-Loss: Majority class can dominate the loss.

AUROC: Easy to keep AUC high by scoring most negatives very low.

AUPRC: Somewhat more robust than AUROC. But other challenges.

In general: Accuracy < AUROC < AUPRC

Rotated ROC
Score = 1

1% “Fraudulent”

1% Specificity
= True Neg / Neg

AUC = 98/99
98%

Score = 0

Sensitivity = True Pos / Pos

Multi-class
● Confusion matrix will be N * N (still want heavy diagonals, light off-diagonals)
● Most metrics (except accuracy) generally analyzed as multiple 1-vs-many
● Multiclass variants of AUROC and AUPRC (micro vs macro averaging)
● Class imbalance is common (both in absolute and relative sense)
● Cost sensitive learning techniques (also helps in binary Imbalance)
○ Assign weights for each block in the confusion matrix.
○ Incorporate weights into the loss function.
Choosing Metrics
Some common patterns:

- High precision is hard constraint, do best recall (search engine results,

grammar correction): Intolerant to FP
- Metric: Recall at Precision = XX %
- High recall is hard constraint, do best precision (medical diagnosis): Intolerant
to FN
- Metric: Precision at Recall = 100 %
- Capacity constrained (by K)
- Metric: Precision in top-K.
- ……
Thank You!

130 TOP Epidemiology Multiple Choice Questions and Answers - All Medical Questions and Answers
83% (175)
130 TOP Epidemiology Multiple Choice Questions and Answers - All Medical Questions and Answers
12 pages
Lesson 4 - Performance Metrics
No ratings yet
Lesson 4 - Performance Metrics
46 pages
F1 - Score
No ratings yet
F1 - Score
13 pages
Unit 4 Model Evaluation
No ratings yet
Unit 4 Model Evaluation
24 pages
Notes 03
No ratings yet
Notes 03
38 pages
Evaluation Metrics: Anand Avati
No ratings yet
Evaluation Metrics: Anand Avati
31 pages
Performance Metrics
No ratings yet
Performance Metrics
12 pages
Evaluation Metrics
No ratings yet
Evaluation Metrics
25 pages
F1 Score Vs ROC AUC Vs Accuracy Vs PR AUC Which Evaluation Metric Should You Choose - Neptune - Ai
No ratings yet
F1 Score Vs ROC AUC Vs Accuracy Vs PR AUC Which Evaluation Metric Should You Choose - Neptune - Ai
1 page
Binary Classification PDF
No ratings yet
Binary Classification PDF
27 pages
Unit-6 Notes PART A
No ratings yet
Unit-6 Notes PART A
20 pages
ML-Lecture-12 (Evaluation Metrics For Classification)
No ratings yet
ML-Lecture-12 (Evaluation Metrics For Classification)
15 pages
Lecture 10
No ratings yet
Lecture 10
16 pages
Evaluation Metrics: Yining Chen (Adapted From Slides by Anand Avati) May 1, 2020
No ratings yet
Evaluation Metrics: Yining Chen (Adapted From Slides by Anand Avati) May 1, 2020
31 pages
ML Lecture 11 Evaluation
No ratings yet
ML Lecture 11 Evaluation
17 pages
3 - Model Evaluation & Validation
No ratings yet
3 - Model Evaluation & Validation
47 pages
Unit8 (Evaluation Method)
No ratings yet
Unit8 (Evaluation Method)
43 pages
Confusion Matrix and Classification Evaluation Metrics
No ratings yet
Confusion Matrix and Classification Evaluation Metrics
16 pages
Metric
No ratings yet
Metric
6 pages
Confusion Matrix
No ratings yet
Confusion Matrix
16 pages
Ad3501-Dl-Unit 4 Notes
No ratings yet
Ad3501-Dl-Unit 4 Notes
16 pages
Performance Parameters
No ratings yet
Performance Parameters
23 pages
9 Roc Auc
No ratings yet
9 Roc Auc
27 pages
06-FSSR DS610 2024 2025T1 Metrics
No ratings yet
06-FSSR DS610 2024 2025T1 Metrics
24 pages
Unit III Iml Final
No ratings yet
Unit III Iml Final
36 pages
ML CH 5
No ratings yet
ML CH 5
45 pages
Confusion Matrix
No ratings yet
Confusion Matrix
43 pages
Performance Parameters
No ratings yet
Performance Parameters
14 pages
Exp7 MLAI2
No ratings yet
Exp7 MLAI2
8 pages
Machine Learning II
No ratings yet
Machine Learning II
61 pages
Performance Evaluation
No ratings yet
Performance Evaluation
24 pages
Evaluation Measures For Machine Learning Models
No ratings yet
Evaluation Measures For Machine Learning Models
6 pages
Unit 3
No ratings yet
Unit 3
13 pages
Machine Learningassignment
No ratings yet
Machine Learningassignment
10 pages
جلسه 13
No ratings yet
جلسه 13
76 pages
Confusion Matrix
No ratings yet
Confusion Matrix
5 pages
Lecture 2 Classifier Performance Metrics
No ratings yet
Lecture 2 Classifier Performance Metrics
60 pages
ML Metrics
No ratings yet
ML Metrics
9 pages
Model Evaluation Metrics - A Comprehensive Guide For Beginners - by Yash - Medium
No ratings yet
Model Evaluation Metrics - A Comprehensive Guide For Beginners - by Yash - Medium
9 pages
08 Classifier Evaluation
No ratings yet
08 Classifier Evaluation
39 pages
Recommender Systems-Unit V
No ratings yet
Recommender Systems-Unit V
16 pages
DL IT324a 4
No ratings yet
DL IT324a 4
52 pages
Iai&ml Unit-5
No ratings yet
Iai&ml Unit-5
15 pages
Ads Exp4
No ratings yet
Ads Exp4
3 pages
Assignment 5
No ratings yet
Assignment 5
22 pages
Ads 5
No ratings yet
Ads 5
5 pages
Imp Notes For Aamd
No ratings yet
Imp Notes For Aamd
6 pages
CHN 145 Questions
No ratings yet
CHN 145 Questions
46 pages
19-Performance Metrics
No ratings yet
19-Performance Metrics
23 pages
Evaluation Measures
No ratings yet
Evaluation Measures
8 pages
Evaluation Metrics
No ratings yet
Evaluation Metrics
11 pages
Evaluation Metrics-ML
No ratings yet
Evaluation Metrics-ML
16 pages
Accuracy Precision and Recall
No ratings yet
Accuracy Precision and Recall
21 pages
Lecture - (3-4) Evaluation Metrices Classification and Regression
No ratings yet
Lecture - (3-4) Evaluation Metrices Classification and Regression
28 pages
Confusion Matrix
No ratings yet
Confusion Matrix
18 pages
Classification Metrics in Machine Learning
No ratings yet
Classification Metrics in Machine Learning
6 pages
Lecture 5
No ratings yet
Lecture 5
21 pages
11.2 - Classification Evaluation Metrics
No ratings yet
11.2 - Classification Evaluation Metrics
22 pages
Performance Metrics (Classification) : Enrique J. de La Hoz D
100% (1)
Performance Metrics (Classification) : Enrique J. de La Hoz D
30 pages
Machine Learning Project Report (Group 3) Shahbaz Khan
No ratings yet
Machine Learning Project Report (Group 3) Shahbaz Khan
11 pages
Evaluation Metrics:: Confusion Matrix
No ratings yet
Evaluation Metrics:: Confusion Matrix
7 pages
Learning Best Practices For Model Evaluation and Hyper-Parameter Tuning
No ratings yet
Learning Best Practices For Model Evaluation and Hyper-Parameter Tuning
20 pages
RNpedia WPS Office
100% (2)
RNpedia WPS Office
12 pages
5..format Eng Extracting Transaction Information From Automatic Teller Machine (Reviewed)
No ratings yet
5..format Eng Extracting Transaction Information From Automatic Teller Machine (Reviewed)
12 pages
Flint MI CASPER Report
No ratings yet
Flint MI CASPER Report
44 pages
Autoverification Improved Process Efficiency, Reduced Staff Workload, and Enhanced Staff Satisfaction Using A Critical Path For Result Validation
No ratings yet
Autoverification Improved Process Efficiency, Reduced Staff Workload, and Enhanced Staff Satisfaction Using A Critical Path For Result Validation
11 pages
1467 Residual Solvents-Verification of Compendial Procedures and Validation of Alternative Procedures
No ratings yet
1467 Residual Solvents-Verification of Compendial Procedures and Validation of Alternative Procedures
3 pages
Critical Quality Attributes of Rapid Test Kits - A Practical Overview
No ratings yet
Critical Quality Attributes of Rapid Test Kits - A Practical Overview
9 pages
Accuracy Measures
No ratings yet
Accuracy Measures
61 pages
The Amsterdam Preoperative Anxiety and Information.2
No ratings yet
The Amsterdam Preoperative Anxiety and Information.2
7 pages
Aqua Research Paper Updated
No ratings yet
Aqua Research Paper Updated
27 pages
Blue Book Format
No ratings yet
Blue Book Format
49 pages
Academy Express
No ratings yet
Academy Express
5 pages
Evaluation of Remolded Shear Strength and Sensitivity of Soft Clay Using Full-Flow Penetrometers
No ratings yet
Evaluation of Remolded Shear Strength and Sensitivity of Soft Clay Using Full-Flow Penetrometers
11 pages
Chicco 2023
No ratings yet
Chicco 2023
23 pages
Knowledge Notes
No ratings yet
Knowledge Notes
21 pages
Nciph ERIC16
No ratings yet
Nciph ERIC16
4 pages
An Integrative Review of Pediatric Early Warning System Scores
No ratings yet
An Integrative Review of Pediatric Early Warning System Scores
10 pages
Diagnostic Criteria For Complicated Grief Disorder
No ratings yet
Diagnostic Criteria For Complicated Grief Disorder
7 pages
Slope Stability Predictions On Spatially Variable Random Fields Using Machine Learning Surrogate Models
No ratings yet
Slope Stability Predictions On Spatially Variable Random Fields Using Machine Learning Surrogate Models
49 pages
Regenerative Semi-Supervised Bidirectional W-Network-Based Knee Bone Tumor Classification On Radiographs Guided by Three-Region Bone Segmentation
No ratings yet
Regenerative Semi-Supervised Bidirectional W-Network-Based Knee Bone Tumor Classification On Radiographs Guided by Three-Region Bone Segmentation
13 pages
Testing Computational Toxicology Models With Phytochemicals
No ratings yet
Testing Computational Toxicology Models With Phytochemicals
9 pages
Statistical Strategies For Avoiding False Discoveries in Metabolomics and Related Experiments - 2007 - Broadhurst, Kell
No ratings yet
Statistical Strategies For Avoiding False Discoveries in Metabolomics and Related Experiments - 2007 - Broadhurst, Kell
26 pages
Sysmex UN-2000
No ratings yet
Sysmex UN-2000
7 pages
Fantastically Ordered Prompts and Where To Find Them: Overcoming Few-Shot Prompt Order Sensitivity
No ratings yet
Fantastically Ordered Prompts and Where To Find Them: Overcoming Few-Shot Prompt Order Sensitivity
13 pages
Sharif Et Al 2020 Cytomorphological Patterns of Palpable Breast Lesions Diagnosed On Fine Needle Aspiration Cytology in
No ratings yet
Sharif Et Al 2020 Cytomorphological Patterns of Palpable Breast Lesions Diagnosed On Fine Needle Aspiration Cytology in
8 pages
Lund and Browder Chart Quemados Evaluacion 2011
No ratings yet
Lund and Browder Chart Quemados Evaluacion 2011
6 pages
Assignment 4
No ratings yet
Assignment 4
5 pages
Sagita Instrumen Elisa
No ratings yet
Sagita Instrumen Elisa
2 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Evaluation Metrics: Yining Chen (Adapted From Slides by Anand Avati) May 1, 2020

Uploaded by

Evaluation Metrics: Yining Chen (Adapted From Slides by Anand Avati) May 1, 2020

Uploaded by

Evaluation Metrics

Example of Score: Output of logistic regression.

- Total sum is fixed (population).

Equivalent to 0-1 Loss!

0.5 9 8 2 1 .85 .81

0.5 9 8 2 1 .85 .81 .9

(Hopefully no gray above it!)

Th TP TN FP FN Acc Pr Recall Spec

0.5 9 8 2 1 .85 .81 .9 0.8

Th TP TN FP FN Acc Pr Recall Spec F1

0.5 9 8 2 1 .85 .81 .9 .8 .857

Th TP TN FP FN Acc Pr Recall Spec F1

0.6 7 8 2 3 .75 .77 .7 .8 .733

# effective thresholds = # examples + 1

Specificity AUROC = Area Under ROC

Sensitivity = True Pos / Pos

Precision >= prevalence

Recall = Sensitivity = True Pos / Pos

● Rewards confident correct answers, heavily

Logistic (th=0.5): Fraction of Positives

○ Raw value of log P(x) hard to interpret in isolation

● K-means is trickier (because of fixed covariance assumption)

Metrics: May not be meaningful.

Learning: May not focus on minority class examples at all

(majority class can overwhelm logistic regression, to a lesser extent SVM)

Log-Loss: Majority class can dominate the loss.

AUPRC: Somewhat more robust than AUROC. But other challenges.

In general: Accuracy < AUROC < AUPRC

Sensitivity = True Pos / Pos

- High precision is hard constraint, do best recall (search engine results,

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.