0% found this document useful (0 votes)
21 views39 pages

Crashcourse DL Pytorch Parr

The document provides an overview of fundamentals of deep learning including supervised machine learning, regression versus classification, models, training deep learning networks, and loss functions. It discusses key concepts such as feature vectors, target variables, model architecture and hyperparameters, underfitting and overfitting, training, validation and testing data sets, and preparing data for deep learning problems.

Uploaded by

harislye
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views39 pages

Crashcourse DL Pytorch Parr

The document provides an overview of fundamentals of deep learning including supervised machine learning, regression versus classification, models, training deep learning networks, and loss functions. It discusses key concepts such as feature vectors, target variables, model architecture and hyperparameters, underfitting and overfitting, training, validation and testing data sets, and preparing data for deep learning problems.

Uploaded by

harislye
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 39

Fundamentals of

deep learning
Crash course in using PyTorch to train models
Terence Parr
MSDS program
University of San Francisco
Supervised machine learning
summary
The goal of (supervised) machine learning
• Obtain a model that predicts numeric values (regressor) or
categories (classifier) from information about an entity:
Regression versus classification
• Predictors fit curves to data and classifiers draw decision
boundaries between data points in different categories
• Two sides of the same coin in implementation typically
The abstraction
• Training a model means capturing the
relationship between feature vectors and a target
variable, given a training data set
• Feature vector : set of features or attributes
characterizing an entity, such as square footage,
num of bedrooms, bathrooms
Row vectors,
• Target : either a scalar value like rent price represent instances
(regressor), or an integer indicating
“creditworthy” or “it's not cancer” (classifier)
• Model captures the relationship in specific
Models
• Models are composed of parameters; predictions are a
computation based upon these parameters
• Models have architecture; e.g., number of layers, number of
neurons per layer, which nonlinearity to use, etc…
• Models have hyper-parameters that govern architecture and the
training process; e.g., learning rate, weight decay, drop out rate
• Hyper-parameters are specified by the programmer, not
computed from the training data; must be tuned
• Deep learning training greatly affected by learning rate, and
even things like random parameter initialization
Training
• Training a model means finding optimal (or good enough) model
parameters as measured by a loss (cost or error) function
• Loss function measures the difference between model
predictions and known targets
• Underfitting (biased): model unable to capture the relationship
(assuming there is a relationship to be had)
• Overfitting: model is too specific to the training data (fixates on
irrelevant fluctuations in training data) and doesn't generalize well
• To generalize means we get accurate predictions for test feature
vectors not found in the training set
Terminology: Loss function vs metric
• Loss function: these are minimized to train a model
E.g., gradient descent uses loss to train regularized linear model
• Metric: evaluate accuracy of predictions compared to known results (the
business perspective)
• Both are functions of and , but loss is also possibly model parameters
(e.g., linear model regularization loss tests parameters)
• Examples:
• Train: MSE loss & Metric: MSE metric
• Train: MSE loss & Metric : MAE metric
• Train: Log loss & Metric : misclassification rate or FP/FN metric
• If loss/metric is applied to validation or test set, informs on generality and
quality of your model
See also stackoverflow post by Chstiros Tsatsoulis: https://goo.gl/T5AmrT
Train, validate, test
• We always need 3 data sets with known answers:
• training
• validation (as shorthand you’ll hear me & others call this test set)
• testing (put in a vault and don’t peek!!)
• Validation set: used to evaluate, tune models and features
• Any changes you make to model tailor it to this specific validation set
• Test set: used exactly once after you think you have best model
• The only true measure of model’s generality, how it’ll perform in production
• Never use test set to tune model
• Production: recombine all sets back into a big training set again,
retrain model but don’t change it according to test set metrics
Preparing data (Cars dataset)

• Everything must be numeric


• No missing values
• Dummy encode categoricals
(pd.get_dummies())
• Should normalize numeric
features in to zero-mean,
variance one ("whitening")
• Speeds up training
• Compare regression equations,
loss function surfaces
Deep learning regressors
What's a neural network?
• Ignore the neural network metaphor, but know the terminology
• A combination of linear and nonlinear transformations
• Linear:
• Nonlinear: ; called activation function
• Networks have multiple layers; layer is a stack of neurons

• Transform raw vector into better and better features, final linear
layer can then make excellent prediction
DL Building blocks
linear sigmoid ReLU (rectified linear unit)

• for x dim
• Linear/logistic regression equivalents (one instance):

Regressor
Regressor

Two-class
Classifier
Underfitting a bit here
(need more of a quadratic)
Assume we magically know and
(For simplicity, I'm using proper in math but omitting transpose in diagrams)
Try adding layers to get more power
• But, sequence of linear models is just a linear model

( is scalar since is scalar)

• PyTorch code model = nn.Sequential(


nn.Linear(m, 1), # m features
nn.Linear(1, 1)
Still just a line
)
model = nn.Sequential(

Must introduce nonlinearity


nn.Linear(m, 1),
nn.ReLU(),
nn.Linear(1, 1)
)

ReLU idea here: Draw two lines


then clip at intersection
Stack linear models (neurons) for more power
• Stack gives layer: matrix and

model = nn.Sequential(
nn.Linear(m, 5),
nn.ReLU(),
nn.Linear(5, 1)
)

All those and are different means for layer 1


Math for dataset 1D: weightMPG
(leaving out 's)

model = nn.Sequential(
nn.Linear(1, 5),
nn.ReLU(),
nn.Linear(5, 2),
nn.ReLU(),
nn.Linear(2, 1) (courtesy of TensorSensor)
https://explained.ai/tensor-sensor/index.html
)
Too much strength can lead to overfitting
• Models with too many parameters will overfit easily,
if we train a long time
• We'll look at regularization later
model = nn.Sequential(
nn.Linear(1, 1000),
nn.ReLU(),
nn.Linear(1000, 1)
)
Classifiers
Binary classifiers
• Add sigmoid to regressor and we get
a two-class classifier
• Prediction is probability of class 1
• One-layer (hidden) network with
sigmoid activation function is just a
logistic regression model
• Provides hyper-plane decision
surfaces
# 2 input vars: proline, alcohol
model = nn.Sequential(
nn.Linear(2, 1),
nn.Sigmoid(),
)
See https://github.com/parrt/fundamentals-of-deep-learning/blob/main/notebooks/5.binary-classifier-wine.ipynb

Stack neurons, add layer


• We get a nonlinear decision surface

model = nn.Sequential(
nn.Linear(2, 3),
nn.Sigmoid(),
nn.Linear(3, 1),
nn.Sigmoid()
)

All those and are different


More neurons:
more complex decision surface

model = nn.Sequential(
nn.Linear(2, 10),
nn.Sigmoid(),
nn.Linear(10, 1),
nn.Sigmoid()
)

Likely overfit

Not only more complex than hyperplane but non-contiguous regions!


Even ReLUs can get "curvy" surfaces

model = nn.Sequential(
nn.Linear(2, 10),
nn.ReLU(),
nn.Linear(10, 10),
nn.ReLU(),
nn.Linear(10, 1),
nn.Sigmoid()
)

(Last activation function still must be sigmoid)


-class classifiers is confidence
in class 1
• 2-class problems: final 1
neuron linear layer +
sigmoid layer
is confidence
in class

• -class problems: final -


neuron linear layer +
softmax
-class classifiers
• Instead of one neuron in last layer, we use for classes
• Last layer has vector output:
• Instead of sigmoid, we use softmax function
• Vector of probabilities as activation:
• Normalized probabilities of classes
Sample softmax computation
• For layer output vector :
Training deep learning
networks
What does training mean?
• Making a prediction means running a feature vector through the
network
• i.e., computing a value using the model parameters; e.g., is a different
model than
• Training: find optimal (or good enough) model parameters as
measured by a loss (cost) function
• Loss function measures the difference between model
predictions and known targets
• We have huge search space (of parameters) and it is
challenging to find parameters giving low loss
Loss functions
• Regression: typically mean squared error (MSE); should have
smooth derivative, though mean absolute error works despite
discontinuity (it's derivative is a V shape)
• Classification: log loss (also called cross entropy)
• Penalizes very confident misclassifications strongly
• Function of actual and estimated probabilities, not predicted class
• Perfect score is 0 log loss, imperfection gives unbounded scores
• PyTorch log loss: loss = cross_entropy(y_softmax, y_true)
• Predictions: y_pred = argmax(y_softmax)
Log loss penalty(p)

• Let p be predicted probability that y=1


• loss = penalty(p) if =1 else penalty(1– p)
• Let penalty(p) = -log(p)

• Two-class log loss:

So log loss is average penalty where penalty is very high


for confidence in wrong answer
Minimize loss with while not_converged:
= - rate * gradient()
Gradient descent
• We use information about
the loss function in the
neighborhood of current
parameters (here called ) to
decide which direction shifts
towards smaller loss
• Tweak parameters in that
direction, amplified by a
learning rate
• Go in opposite dir of slope
If learning rate is too high?
• We oscillate across
valleys
• It can even diverge,
exploding
• If too small, we don’t
make progress to min
Training process
1. Prepare data
• normalize numeric variables
• dummy vars for categoricals
• conjure up values for missing values
2. Split out at least a validation set from training set
3. Choose network architecture, appropriate loss function
4. Choose hyper-parameters, such as dropout rate
5. Choose a learning rate, number of epochs (passes through data)
6. Run training loop (until validation error goes up or num iterations)
7. Goto 3, 4, or 5 to tweak; iterate until good enough
Training loop
Regression
Vectorized forward network pass
for epoch in range(nepochs): (send in all instances at once)
y_train_pred = model(X_train)
loss = MSE(y_train_pred, y_train)
update model parameters in direction of lower loss
Classification
for epoch in range(nepochs):
y_train_pred = model(X_train) # assume softmax final layer
loss = cross_entropy(y_train_pred, y_train)
update model parameters in direction of lower loss
Common train vs validation loss behavior
• DL networks have so many
parameters, we can often get
training error down to zero!
• But, we care about generalization
• Unfortunately, validation error often
tracks away from training error as
the number of epochs increases
• This model is clearly overfitting
• Need to use regularization to
improve validation loss
Regularization techniques
• Get more training data; can try augmentation techniques
(more data is likely to represent population distribution better)
• Reduce number of model parameters (i.e., simplify it)
(reduce power/ability to fit the noise)
• Add drop out layers (randomly kill some neurons)
• Weight decay (L2 regularization on model parameters,
restrict model parameter search space)
• Early stopping, when validation error starts to go up
(generally we choose model that yields the best validation error)
• Batch normalization has some small regularization effect
(Force layer activation distributions to be 0-mean, variance 1)
• Stochastic gradient descent tends to land on better generalizations
Aside: What is vectorization?
• Use vectors not loops for i in range(len(a)):
• For torch/numpy arrays, we can c[i] = a[i] + b[i]
use vector math instead of a loop:

c = a + b

• Gives an opportunity to execute


vector addition in parallel
Vectorization in training loop
• Running one instance through network is how we think about it
• In practice, we send a subset or all instances through the
network in one go and compare all predictions to all
• Instead of looping through instances, we pass through to use
matrix-matrix multiplies instead of matrix-vector multiplies
for epoch in range(nepochs): for epoch in range(nepochs):
for i in range(n): Y = model(X)
x = X[i] …
y[i] = model(x)

Get 100
answers

Assume n=100, m=3, n_neurons=1 in 1x3 weight matrix W


Summary
• Vanilla deep learning models are layers of linear regression models glued
together with nonlinear functions such as sigmoid/ReLUs
• Regressor: final layer transforms previous layer to single output
• Classifier: add sigmoid to last regressor layer (2-class) or add softmax to
last layer of neurons (-class)
• Training a model means finding optimal (or good enough) model
parameters as measured by a loss (cost or error) function; hyper
parameters describe architecture and learning rate, amount of
regularization, etc.
• We train using (stochastic) gradient descent; tuning model and hyper
parameters is more or less trial and error  but experience helps a lot

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy