Crashcourse DL Pytorch Parr
Crashcourse DL Pytorch Parr
deep learning
Crash course in using PyTorch to train models
Terence Parr
MSDS program
University of San Francisco
Supervised machine learning
summary
The goal of (supervised) machine learning
• Obtain a model that predicts numeric values (regressor) or
categories (classifier) from information about an entity:
Regression versus classification
• Predictors fit curves to data and classifiers draw decision
boundaries between data points in different categories
• Two sides of the same coin in implementation typically
The abstraction
• Training a model means capturing the
relationship between feature vectors and a target
variable, given a training data set
• Feature vector : set of features or attributes
characterizing an entity, such as square footage,
num of bedrooms, bathrooms
Row vectors,
• Target : either a scalar value like rent price represent instances
(regressor), or an integer indicating
“creditworthy” or “it's not cancer” (classifier)
• Model captures the relationship in specific
Models
• Models are composed of parameters; predictions are a
computation based upon these parameters
• Models have architecture; e.g., number of layers, number of
neurons per layer, which nonlinearity to use, etc…
• Models have hyper-parameters that govern architecture and the
training process; e.g., learning rate, weight decay, drop out rate
• Hyper-parameters are specified by the programmer, not
computed from the training data; must be tuned
• Deep learning training greatly affected by learning rate, and
even things like random parameter initialization
Training
• Training a model means finding optimal (or good enough) model
parameters as measured by a loss (cost or error) function
• Loss function measures the difference between model
predictions and known targets
• Underfitting (biased): model unable to capture the relationship
(assuming there is a relationship to be had)
• Overfitting: model is too specific to the training data (fixates on
irrelevant fluctuations in training data) and doesn't generalize well
• To generalize means we get accurate predictions for test feature
vectors not found in the training set
Terminology: Loss function vs metric
• Loss function: these are minimized to train a model
E.g., gradient descent uses loss to train regularized linear model
• Metric: evaluate accuracy of predictions compared to known results (the
business perspective)
• Both are functions of and , but loss is also possibly model parameters
(e.g., linear model regularization loss tests parameters)
• Examples:
• Train: MSE loss & Metric: MSE metric
• Train: MSE loss & Metric : MAE metric
• Train: Log loss & Metric : misclassification rate or FP/FN metric
• If loss/metric is applied to validation or test set, informs on generality and
quality of your model
See also stackoverflow post by Chstiros Tsatsoulis: https://goo.gl/T5AmrT
Train, validate, test
• We always need 3 data sets with known answers:
• training
• validation (as shorthand you’ll hear me & others call this test set)
• testing (put in a vault and don’t peek!!)
• Validation set: used to evaluate, tune models and features
• Any changes you make to model tailor it to this specific validation set
• Test set: used exactly once after you think you have best model
• The only true measure of model’s generality, how it’ll perform in production
• Never use test set to tune model
• Production: recombine all sets back into a big training set again,
retrain model but don’t change it according to test set metrics
Preparing data (Cars dataset)
• Transform raw vector into better and better features, final linear
layer can then make excellent prediction
DL Building blocks
linear sigmoid ReLU (rectified linear unit)
• for x dim
• Linear/logistic regression equivalents (one instance):
Regressor
Regressor
Two-class
Classifier
Underfitting a bit here
(need more of a quadratic)
Assume we magically know and
(For simplicity, I'm using proper in math but omitting transpose in diagrams)
Try adding layers to get more power
• But, sequence of linear models is just a linear model
model = nn.Sequential(
nn.Linear(m, 5),
nn.ReLU(),
nn.Linear(5, 1)
)
model = nn.Sequential(
nn.Linear(1, 5),
nn.ReLU(),
nn.Linear(5, 2),
nn.ReLU(),
nn.Linear(2, 1) (courtesy of TensorSensor)
https://explained.ai/tensor-sensor/index.html
)
Too much strength can lead to overfitting
• Models with too many parameters will overfit easily,
if we train a long time
• We'll look at regularization later
model = nn.Sequential(
nn.Linear(1, 1000),
nn.ReLU(),
nn.Linear(1000, 1)
)
Classifiers
Binary classifiers
• Add sigmoid to regressor and we get
a two-class classifier
• Prediction is probability of class 1
• One-layer (hidden) network with
sigmoid activation function is just a
logistic regression model
• Provides hyper-plane decision
surfaces
# 2 input vars: proline, alcohol
model = nn.Sequential(
nn.Linear(2, 1),
nn.Sigmoid(),
)
See https://github.com/parrt/fundamentals-of-deep-learning/blob/main/notebooks/5.binary-classifier-wine.ipynb
model = nn.Sequential(
nn.Linear(2, 3),
nn.Sigmoid(),
nn.Linear(3, 1),
nn.Sigmoid()
)
model = nn.Sequential(
nn.Linear(2, 10),
nn.Sigmoid(),
nn.Linear(10, 1),
nn.Sigmoid()
)
Likely overfit
model = nn.Sequential(
nn.Linear(2, 10),
nn.ReLU(),
nn.Linear(10, 10),
nn.ReLU(),
nn.Linear(10, 1),
nn.Sigmoid()
)
c = a + b
Get 100
answers