0% found this document useful (0 votes)
20 views39 pages

l09_machine_learning

The document discusses model evaluation in machine learning, focusing on metrics for binary classification and the challenges posed by class imbalance. It covers various evaluation metrics such as precision, recall, F1 score, and ROC curves, emphasizing the importance of selecting appropriate metrics based on the specific goals of the analysis. Additionally, it addresses multi-class classification and the use of custom scoring in cross-validation.

Uploaded by

sashakayukov23
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views39 pages

l09_machine_learning

The document discusses model evaluation in machine learning, focusing on metrics for binary classification and the challenges posed by class imbalance. It covers various evaluation metrics such as precision, recall, F1 score, and ROC curves, emphasizing the importance of selecting appropriate metrics based on the specific goals of the analysis. Additionally, it addresses multi-class classification and the use of custom scoring in cross-validation.

Uploaded by

sashakayukov23
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 39

6012B0419Y Machine Learning

Model Evaluation and Class


Imbalance
27-11-2023

Guido van Capelleveen


(Prepared by: Stevan Rudinac)
Slide Credit
●Andreas Müller, lecturer at the Data Science
Institute at Columbia University
● Author of the book we will be using for this course
“Introduction to Machine Learning with Python”

● Great materials available at:


● https://github.com/amueller/applied_ml_spring_2017/
● https://amueller.github.io/applied_ml_spring_2017/
Reading

Pages: 277 – 305


Metrics for Binary Classification
Kinds of Errors
● Example: Early cancer detection screening
– The test is negative: patient is assumed healthy
– The test is positive: patient undergoes additional test

● Possible mistakes:
– Healthy patient is classified as positive: false positive
or type I error
– Sick patient is classified as negative: false negative or
type II error
Review: confusion matrix

Diagonal divided by everything.


Problems with accuracy
● Imbalanced classes lead to hard-to-interpret accuracy:

Data with 90% negatives


(class, is this OK?)
Precision, Recall, f-score
Positive Predictive Value
(PPV)

limit

Sensitivity, coverage, true positive rate.

limit

All depend on definition of positive and


negative!
The
zoo

https://en.wikipedia.org/wiki/Precision_and_recall
Goal setting!
● What do I want? What do I care
about? (precision, recall, something
else)
● Can I assign costs to the confusion matrix?
(i.e. a false positive costs me $10, a false negative
$100)
● What guarantees do we want to give?
Changing Thresholds
Precision-Recall Curve
Precision-Recall Curve
Comparing RF and SVC
Comparing RF and SVC
Average Precision
Precision at threshold k

Change in recall
between k and k-1

Sum over data points, ranked by decision function

Same as area under the precision-recall curve


(depending on how you treat edge-cases)
F1 vs average precision
Receiver Operating Characteristics
(ROC) Curve

= recall
ROC

AUC
Area under ROC Curve
● Always .5 for random / constant prediction
●Evaluation of the ranking: probability that a randomly
picked positive sample will have a higher score than a
randomly picked negative sample

The Relationship Between Precision-Recall and ROC Curves


https://www.biostat.wisc.edu/~page/rocpr.pdf
Multi-class classification
Confusion Matrix

Normalizing confusion matrix (by rows) can be


helpful
Micro and Macro F1
● Macro-average f1: Average f1 scores over classes
●Micro-average f1: Computes the total number of FP,
FN and TP over all classes and then computes P, R and
f1 using these counts.
●Weighted: Mean of the per-class f1 scores, weighted
by support

Macro: “all classes are equally important”


Micro: “all samples are equally important” - same for other metric averages
Multi-class ROC AUC
● Hand & Till, 2001 one vs one

● Provost & Domingo, 2000 one vs


rest

● https://github.com/scikit-learn/scikit-learn/pull/7663
Picking metrics?
● Accuracy rarely what you want
● Problems are rarely balanced
● Find the right criterion for the task
● OR pick one arbitrarily, but at least think about it
● Emphasis on recall or precision?
● Which classes are the important ones?
Using metrics in cross-validation

Same for GridSearchCV


Will make GridSearchCV.score use your
metric!
Built-in scoring
● “scoring” can be string or callable.
● Strings:
Providing your own callable
● Takes estimator, X, y
● Returns score – higher is better (always!)

def accuracy_scoring(est, X, y):


return (est.predict(X) == y).mean()
You can access the model!
Metrics for regression models
Build-in standard metrics
● R^2 : easy to understand scale
● MSE : easy to relate to input
● Mean absolute error, median absolute
error: more robust.

●When using “scoring” use


“neg_mean_squared_error” etc
Prediction plots
Residual Plots
Target vs Feature
Residual vs Feature
Absolute vs relative:
MAPE Mean absolute percentage error (MAPE)
Over vs under
●Overprediction and underprediction can
have different cost.
●Try to create cost-matrix: how much
does overprediction and underprediction
cost?
● Is it linear?

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy