0% found this document useful (0 votes)
46 views31 pages

Supervised Learning With Scikit-Learn: How Good Is Your Model?

This document discusses supervised machine learning techniques using scikit-learn including classification metrics, class imbalance, confusion matrices, logistic regression, ROC curves, and area under the ROC curve. It also covers hyperparameter tuning using grid search cross-validation and evaluating models on a hold-out test set not used for training or validation.

Uploaded by

Victor Ng
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
46 views31 pages

Supervised Learning With Scikit-Learn: How Good Is Your Model?

This document discusses supervised machine learning techniques using scikit-learn including classification metrics, class imbalance, confusion matrices, logistic regression, ROC curves, and area under the ROC curve. It also covers hyperparameter tuning using grid search cross-validation and evaluating models on a hold-out test set not used for training or validation.

Uploaded by

Victor Ng
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

SUPERVISED LEARNING WITH SCIKIT-LEARN

How good is
your model?
Supervised Learning with scikit-learn

Classification metrics
● Measuring model performance with accuracy:
● Fraction of correctly classified samples
● Not always a useful metric
Supervised Learning with scikit-learn

Class imbalance example: Emails


● Spam classification
● 99% of emails are real; 1% of emails are spam
● Could build a classifier that predicts ALL emails as real
● 99% accurate!
● But horrible at actually classifying spam
● Fails at its original purpose
● Need more nuanced metrics
Supervised Learning with scikit-learn

Diagnosing classification predictions


● Confusion matrix

● Accuracy:
Supervised Learning with scikit-learn

Metrics from the confusion matrix


● Precision :

● Recall :

● F1 score :

● High precision: Not many real emails predicted as spam


● High recall: Predicted most spam emails correctly
Supervised Learning with scikit-learn

Confusion matrix in scikit-learn


In [1]: from sklearn.metrics import classification_report

In [2]: from sklearn.metrics import confusion_matrix

In [3]: knn = KNeighborsClassifier(n_neighbors=8)

In [4]: X_train, X_test, y_train, y_test = train_test_split(X, y,


...: test_size=0.4, random_state=42)

In [5]: knn.fit(X_train, y_train)

In [6]: y_pred = knn.predict(X_test)


Supervised Learning with scikit-learn

Confusion matrix in scikit-learn


In [7]: print(confusion_matrix(y_test, y_pred))
[[52 7]
[ 3 112]]

In [8]: print(classification_report(y_test, y_pred))


precision recall f1-score support

0 0.95 0.88 0.91 59


1 0.94 0.97 0.96 115

avg / total 0.94 0.94 0.94 174


SUPERVISED LEARNING WITH SCIKIT-LEARN

Let’s practice!
SUPERVISED LEARNING WITH SCIKIT-LEARN

Logistic regression
and the ROC curve
Supervised Learning with scikit-learn

Logistic regression for binary classification


● Logistic regression outputs probabilities
● If the probability ‘p’ is greater than 0.5:
● The data is labeled ‘1’
● If the probability ‘p’ is less than 0.5:
● The data is labeled ‘0’
Supervised Learning with scikit-learn

Linear decision boundary

Source: Andreas Müller & Sarah Guido, Introduction to Machine Learning with Python
Supervised Learning with scikit-learn

Logistic regression in scikit-learn


In [1]: from sklearn.linear_model import LogisticRegression

In [2]: from sklearn.model_selection import train_test_split

In [3]: logreg = LogisticRegression()

In [4]: X_train, X_test, y_train, y_test = train_test_split(X, y,


...: test_size=0.4, random_state=42)

In [5]: logreg.fit(X_train, y_train)

In [6]: y_pred = logreg.predict(X_test)


Supervised Learning with scikit-learn

Probability thresholds
● By default, logistic regression threshold = 0.5
● Not specific to logistic regression
● k-NN classifiers also have thresholds
● What happens if we vary the threshold?
Supervised Learning with scikit-learn

The ROC curve


p=0

p = 0.5

p=1
Supervised Learning with scikit-learn

Plo!ing the ROC curve


In [1]: from sklearn.metrics import roc_curve

In [2]: y_pred_prob = logreg.predict_proba(X_test)[:,1]

In [3]: fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob)

In [4]: plt.plot([0, 1], [0, 1], 'k--')

In [5]: plt.plot(fpr, tpr, label='Logistic Regression')

In [6]: plt.xlabel('False Positive Rate’)

In [7]: plt.ylabel('True Positive Rate')

In [8]: plt.title('Logistic Regression ROC Curve')

In [9]: plt.show();
Supervised Learning with scikit-learn

Plo!ing the ROC curve

logreg.predict_proba(X_test)[:,1]
SUPERVISED LEARNING WITH SCIKIT-LEARN

Let’s practice!
SUPERVISED LEARNING WITH SCIKIT-LEARN

Area under
the ROC curve
Supervised Learning with scikit-learn

Area under the ROC curve (AUC)


● Larger area under the ROC curve = be!er model
Supervised Learning with scikit-learn

AUC in scikit-learn
In [1]: from sklearn.metrics import roc_auc_score

In [2]: logreg = LogisticRegression()

In [3]: X_train, X_test, y_train, y_test = train_test_split(X, y,


...: test_size=0.4, random_state=42)

In [4]: logreg.fit(X_train, y_train)

In [5]: y_pred_prob = logreg.predict_proba(X_test)[:,1]

In [6]: roc_auc_score(y_test, y_pred_prob)


Out[6]: 0.997466216216
Supervised Learning with scikit-learn

AUC using cross-validation


In [7]: from sklearn.model_selection import cross_val_score

In [8]: cv_scores = cross_val_score(logreg, X, y, cv=5,


...: scoring='roc_auc')

In [9]: print(cv_scores)
[ 0.99673203 0.99183007 0.99583796 1. 0.96140652]
SUPERVISED LEARNING WITH SCIKIT-LEARN

Let’s practice!
SUPERVISED LEARNING WITH SCIKIT-LEARN

Hyperparameter
tuning
Supervised Learning with scikit-learn

Hyperparameter tuning
● Linear regression: Choosing parameters
● Ridge/lasso regression: Choosing alpha
● k-Nearest Neighbors: Choosing n_neighbors
● Parameters like alpha and k: Hyperparameters
● Hyperparameters cannot be learned by fi!ing the model
Supervised Learning with scikit-learn

Choosing the correct hyperparameter


● Try a bunch of different hyperparameter values
● Fit all of them separately
● See how well each performs
● Choose the best performing one
● It is essential to use cross-validation
Supervised Learning with scikit-learn

Grid search cross-validation

0.5 0.701 0.703 0.697 0.696


0.4 0.699 0.702 0.698 0.702
0.3 0.721 0.726 0.713 0.703
0.2 0.706 0.705 0.704 0.701
C 0.1 0.698 0.692 0.688 0.675
0.1 0.2 0.3 0.4

Alpha
Supervised Learning with scikit-learn

GridSearchCV in scikit-learn
In [1]: from sklearn.model_selection import GridSearchCV

In [2]: param_grid = {'n_neighbors': np.arange(1, 50)}

In [3]: knn = KNeighborsClassifier()

In [4]: knn_cv = GridSearchCV(knn, param_grid, cv=5)

In [5]: knn_cv.fit(X, y)

In [6]: knn_cv.best_params_
Out[6]: {'n_neighbors': 12}

In [7]: knn_cv.best_score_
Out[7]: 0.933216168717
SUPERVISED LEARNING WITH SCIKIT-LEARN

Let’s practice!
SUPERVISED LEARNING WITH SCIKIT-LEARN

Hold-out set for


final evaluation
Supervised Learning with scikit-learn

Hold-out set reasoning


● How well can the model perform on never before seen data?
● Using ALL data for cross-validation is not ideal
● Split data into training and hold-out set at the beginning
● Perform grid search cross-validation on training set
● Choose best hyperparameters and evaluate on hold-out set
SUPERVISED LEARNING WITH SCIKIT-LEARN

Let’s practice!

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy