0% found this document useful (0 votes)

21 views39 pages

Crashcourse DL Pytorch Parr

The document provides an overview of fundamentals of deep learning including supervised machine learning, regression versus classification, models, training deep learning networks, and loss functions. It discusses key concepts such as feature vectors, target variables, model architecture and hyperparameters, underfitting and overfitting, training, validation and testing data sets, and preparing data for deep learning problems.

Uploaded by

harislye

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views39 pages

Crashcourse DL Pytorch Parr

Uploaded by

harislye

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 39

Fundamentals of

deep learning
Crash course in using PyTorch to train models
Terence Parr
MSDS program
University of San Francisco
Supervised machine learning
summary
The goal of (supervised) machine learning
• Obtain a model that predicts numeric values (regressor) or
categories (classifier) from information about an entity:
Regression versus classification
• Predictors fit curves to data and classifiers draw decision
boundaries between data points in different categories
• Two sides of the same coin in implementation typically
The abstraction
• Training a model means capturing the
relationship between feature vectors and a target
variable, given a training data set
• Feature vector : set of features or attributes
characterizing an entity, such as square footage,
num of bedrooms, bathrooms
Row vectors,
• Target : either a scalar value like rent price represent instances
(regressor), or an integer indicating
“creditworthy” or “it's not cancer” (classifier)
• Model captures the relationship in specific
Models
• Models are composed of parameters; predictions are a
computation based upon these parameters
• Models have architecture; e.g., number of layers, number of
neurons per layer, which nonlinearity to use, etc…
• Models have hyper-parameters that govern architecture and the
training process; e.g., learning rate, weight decay, drop out rate
• Hyper-parameters are specified by the programmer, not
computed from the training data; must be tuned
• Deep learning training greatly affected by learning rate, and
even things like random parameter initialization
Training
• Training a model means finding optimal (or good enough) model
parameters as measured by a loss (cost or error) function
• Loss function measures the difference between model
predictions and known targets
• Underfitting (biased): model unable to capture the relationship
(assuming there is a relationship to be had)
• Overfitting: model is too specific to the training data (fixates on
irrelevant fluctuations in training data) and doesn't generalize well
• To generalize means we get accurate predictions for test feature
vectors not found in the training set
Terminology: Loss function vs metric
• Loss function: these are minimized to train a model
E.g., gradient descent uses loss to train regularized linear model
• Metric: evaluate accuracy of predictions compared to known results (the
business perspective)
• Both are functions of and , but loss is also possibly model parameters
(e.g., linear model regularization loss tests parameters)
• Examples:
• Train: MSE loss & Metric: MSE metric
• Train: MSE loss & Metric : MAE metric
• Train: Log loss & Metric : misclassification rate or FP/FN metric
• If loss/metric is applied to validation or test set, informs on generality and
quality of your model
See also stackoverflow post by Chstiros Tsatsoulis: https://goo.gl/T5AmrT
Train, validate, test
• We always need 3 data sets with known answers:
• training
• validation (as shorthand you’ll hear me & others call this test set)
• testing (put in a vault and don’t peek!!)
• Validation set: used to evaluate, tune models and features
• Any changes you make to model tailor it to this specific validation set
• Test set: used exactly once after you think you have best model
• The only true measure of model’s generality, how it’ll perform in production
• Never use test set to tune model
• Production: recombine all sets back into a big training set again,
retrain model but don’t change it according to test set metrics
Preparing data (Cars dataset)

• Everything must be numeric

• No missing values
• Dummy encode categoricals
(pd.get_dummies())
• Should normalize numeric
features in to zero-mean,
variance one ("whitening")
• Speeds up training
• Compare regression equations,
loss function surfaces
Deep learning regressors
What's a neural network?
• Ignore the neural network metaphor, but know the terminology
• A combination of linear and nonlinear transformations
• Linear:
• Nonlinear: ; called activation function
• Networks have multiple layers; layer is a stack of neurons

• Transform raw vector into better and better features, final linear
layer can then make excellent prediction
DL Building blocks
linear sigmoid ReLU (rectified linear unit)

• for x dim
• Linear/logistic regression equivalents (one instance):

Regressor
Regressor

Two-class
Classifier
Underfitting a bit here
(need more of a quadratic)
Assume we magically know and
(For simplicity, I'm using proper in math but omitting transpose in diagrams)
Try adding layers to get more power
• But, sequence of linear models is just a linear model

( is scalar since is scalar)

• PyTorch code model = nn.Sequential(

nn.Linear(m, 1), # m features
nn.Linear(1, 1)
Still just a line
)
model = nn.Sequential(

Must introduce nonlinearity

nn.Linear(m, 1),
nn.ReLU(),
nn.Linear(1, 1)
)

ReLU idea here: Draw two lines

then clip at intersection
Stack linear models (neurons) for more power
• Stack gives layer: matrix and

model = nn.Sequential(
nn.Linear(m, 5),
nn.ReLU(),
nn.Linear(5, 1)
)

All those and are different means for layer 1

Math for dataset 1D: weightMPG
(leaving out 's)

model = nn.Sequential(
nn.Linear(1, 5),
nn.ReLU(),
nn.Linear(5, 2),
nn.ReLU(),
nn.Linear(2, 1) (courtesy of TensorSensor)
https://explained.ai/tensor-sensor/index.html
)
Too much strength can lead to overfitting
• Models with too many parameters will overfit easily,
if we train a long time
• We'll look at regularization later
model = nn.Sequential(
nn.Linear(1, 1000),
nn.ReLU(),
nn.Linear(1000, 1)
)
Classifiers
Binary classifiers
• Add sigmoid to regressor and we get
a two-class classifier
• Prediction is probability of class 1
• One-layer (hidden) network with
sigmoid activation function is just a
logistic regression model
• Provides hyper-plane decision
surfaces
# 2 input vars: proline, alcohol
model = nn.Sequential(
nn.Linear(2, 1),
nn.Sigmoid(),
)
See https://github.com/parrt/fundamentals-of-deep-learning/blob/main/notebooks/5.binary-classifier-wine.ipynb

Stack neurons, add layer

• We get a nonlinear decision surface

model = nn.Sequential(
nn.Linear(2, 3),
nn.Sigmoid(),
nn.Linear(3, 1),
nn.Sigmoid()
)

All those and are different

More neurons:
more complex decision surface

model = nn.Sequential(
nn.Linear(2, 10),
nn.Sigmoid(),
nn.Linear(10, 1),
nn.Sigmoid()
)

Likely overfit

Not only more complex than hyperplane but non-contiguous regions!

Even ReLUs can get "curvy" surfaces

model = nn.Sequential(
nn.Linear(2, 10),
nn.ReLU(),
nn.Linear(10, 10),
nn.ReLU(),
nn.Linear(10, 1),
nn.Sigmoid()
)

(Last activation function still must be sigmoid)

-class classifiers is confidence
in class 1
• 2-class problems: final 1
neuron linear layer +
sigmoid layer
is confidence
in class

• -class problems: final -

neuron linear layer +
softmax
-class classifiers
• Instead of one neuron in last layer, we use for classes
• Last layer has vector output:
• Instead of sigmoid, we use softmax function
• Vector of probabilities as activation:
• Normalized probabilities of classes
Sample softmax computation
• For layer output vector :
Training deep learning
networks
What does training mean?
• Making a prediction means running a feature vector through the
network
• i.e., computing a value using the model parameters; e.g., is a different
model than
• Training: find optimal (or good enough) model parameters as
measured by a loss (cost) function
• Loss function measures the difference between model
predictions and known targets
• We have huge search space (of parameters) and it is
challenging to find parameters giving low loss
Loss functions
• Regression: typically mean squared error (MSE); should have
smooth derivative, though mean absolute error works despite
discontinuity (it's derivative is a V shape)
• Classification: log loss (also called cross entropy)
• Penalizes very confident misclassifications strongly
• Function of actual and estimated probabilities, not predicted class
• Perfect score is 0 log loss, imperfection gives unbounded scores
• PyTorch log loss: loss = cross_entropy(y_softmax, y_true)
• Predictions: y_pred = argmax(y_softmax)
Log loss penalty(p)

• Let p be predicted probability that y=1

• loss = penalty(p) if =1 else penalty(1– p)
• Let penalty(p) = -log(p)

• Two-class log loss:

So log loss is average penalty where penalty is very high

for confidence in wrong answer
Minimize loss with while not_converged:
= - rate * gradient()
Gradient descent
• We use information about
the loss function in the
neighborhood of current
parameters (here called ) to
decide which direction shifts
towards smaller loss
• Tweak parameters in that
direction, amplified by a
learning rate
• Go in opposite dir of slope
If learning rate is too high?
• We oscillate across
valleys
• It can even diverge,
exploding
• If too small, we don’t
make progress to min
Training process
1. Prepare data
• normalize numeric variables
• dummy vars for categoricals
• conjure up values for missing values
2. Split out at least a validation set from training set
3. Choose network architecture, appropriate loss function
4. Choose hyper-parameters, such as dropout rate
5. Choose a learning rate, number of epochs (passes through data)
6. Run training loop (until validation error goes up or num iterations)
7. Goto 3, 4, or 5 to tweak; iterate until good enough
Training loop
Regression
Vectorized forward network pass
for epoch in range(nepochs): (send in all instances at once)
y_train_pred = model(X_train)
loss = MSE(y_train_pred, y_train)
update model parameters in direction of lower loss
Classification
for epoch in range(nepochs):
y_train_pred = model(X_train) # assume softmax final layer
loss = cross_entropy(y_train_pred, y_train)
update model parameters in direction of lower loss
Common train vs validation loss behavior
• DL networks have so many
parameters, we can often get
training error down to zero!
• But, we care about generalization
• Unfortunately, validation error often
tracks away from training error as
the number of epochs increases
• This model is clearly overfitting
• Need to use regularization to
improve validation loss
Regularization techniques
• Get more training data; can try augmentation techniques
(more data is likely to represent population distribution better)
• Reduce number of model parameters (i.e., simplify it)
(reduce power/ability to fit the noise)
• Add drop out layers (randomly kill some neurons)
• Weight decay (L2 regularization on model parameters,
restrict model parameter search space)
• Early stopping, when validation error starts to go up
(generally we choose model that yields the best validation error)
• Batch normalization has some small regularization effect
(Force layer activation distributions to be 0-mean, variance 1)
• Stochastic gradient descent tends to land on better generalizations
Aside: What is vectorization?
• Use vectors not loops for i in range(len(a)):
• For torch/numpy arrays, we can c[i] = a[i] + b[i]
use vector math instead of a loop:

c = a + b

• Gives an opportunity to execute

vector addition in parallel
Vectorization in training loop
• Running one instance through network is how we think about it
• In practice, we send a subset or all instances through the
network in one go and compare all predictions to all
• Instead of looping through instances, we pass through to use
matrix-matrix multiplies instead of matrix-vector multiplies
for epoch in range(nepochs): for epoch in range(nepochs):
for i in range(n): Y = model(X)
x = X[i] …
y[i] = model(x)
…

Get 100
answers

Assume n=100, m=3, n_neurons=1 in 1x3 weight matrix W

Summary
• Vanilla deep learning models are layers of linear regression models glued
together with nonlinear functions such as sigmoid/ReLUs
• Regressor: final layer transforms previous layer to single output
• Classifier: add sigmoid to last regressor layer (2-class) or add softmax to
last layer of neurons (-class)
• Training a model means finding optimal (or good enough) model
parameters as measured by a loss (cost or error) function; hyper
parameters describe architecture and learning rate, amount of
regularization, etc.
• We train using (stochastic) gradient descent; tuning model and hyper
parameters is more or less trial and error  but experience helps a lot

L7-Lecture-Image.classification.DNN-v4
No ratings yet
L7-Lecture-Image.classification.DNN-v4
61 pages
Probability Neuron Network
No ratings yet
Probability Neuron Network
84 pages
unit 2 DL
No ratings yet
unit 2 DL
70 pages
Slide 2-f2
No ratings yet
Slide 2-f2
52 pages
chapter2 (1)
No ratings yet
chapter2 (1)
35 pages
midterm_study_guide_csci566
No ratings yet
midterm_study_guide_csci566
20 pages
PowerPoint Presentation-2
No ratings yet
PowerPoint Presentation-2
52 pages
Household Management in a Swedish Household
No ratings yet
Household Management in a Swedish Household
20 pages
Chapter 2
No ratings yet
Chapter 2
35 pages
geomatics-05-00003
No ratings yet
geomatics-05-00003
19 pages
L10 Learning II Gradient Based Learning
No ratings yet
L10 Learning II Gradient Based Learning
72 pages
BCS 040 Previous Year Question Papers by Ignouassignmentguru 2
No ratings yet
BCS 040 Previous Year Question Papers by Ignouassignmentguru 2
66 pages
Comparative Study of House Price Prediction Using Machine Learning Research Paper
No ratings yet
Comparative Study of House Price Prediction Using Machine Learning Research Paper
14 pages
Loss Functions
No ratings yet
Loss Functions
15 pages
Fundamentals of Deep Learning: Part 2: How A Neural Network Trains
No ratings yet
Fundamentals of Deep Learning: Part 2: How A Neural Network Trains
54 pages
Response surface methodology process and product optimization using designed experiments Fourth Edition Anderson-Cook - The ebook in PDF/DOCX format is available for instant download
100% (2)
Response surface methodology process and product optimization using designed experiments Fourth Edition Anderson-Cook - The ebook in PDF/DOCX format is available for instant download
68 pages
shivansh_exp8
No ratings yet
shivansh_exp8
5 pages
Linear Regression With One Variable
No ratings yet
Linear Regression With One Variable
49 pages
Book Dynamic Io
No ratings yet
Book Dynamic Io
368 pages
Video_7_-_Building_a_Multilayer_Feedforward_Network_for_Classification_in_PyTorch
No ratings yet
Video_7_-_Building_a_Multilayer_Feedforward_Network_for_Classification_in_PyTorch
18 pages
Deep-Learning-Module-1-Important-Topics-PYQs
No ratings yet
Deep-Learning-Module-1-Important-Topics-PYQs
27 pages
Session 3 - Linear Regression
No ratings yet
Session 3 - Linear Regression
96 pages
Deep learning
No ratings yet
Deep learning
15 pages
IBest_DeepLearning
No ratings yet
IBest_DeepLearning
123 pages
Coding Neural Networks-Classification & Regression
No ratings yet
Coding Neural Networks-Classification & Regression
39 pages
Lec 8
No ratings yet
Lec 8
43 pages
Liu Et Al 2022 Data Driven Machine Learning in Environmental Pollution Gains and Problems
No ratings yet
Liu Et Al 2022 Data Driven Machine Learning in Environmental Pollution Gains and Problems
10 pages
Dat 300
No ratings yet
Dat 300
12 pages
Domnic Object Detecion Basics
No ratings yet
Domnic Object Detecion Basics
62 pages
AP Stats Exercise
No ratings yet
AP Stats Exercise
4 pages
ANN Analysis
No ratings yet
ANN Analysis
5 pages
DL - M2 - Deep Feedforward NN
No ratings yet
DL - M2 - Deep Feedforward NN
97 pages
Inspira Journal of Modern Management Entrepreneurshipjmme Vol 07 n0 04 October 2017 Pages 01 To 09
No ratings yet
Inspira Journal of Modern Management Entrepreneurshipjmme Vol 07 n0 04 October 2017 Pages 01 To 09
9 pages
Unit 4 Data warehousing and Data mining
No ratings yet
Unit 4 Data warehousing and Data mining
15 pages
Calculo de Tendencia
No ratings yet
Calculo de Tendencia
35 pages
PDF_1678529419
No ratings yet
PDF_1678529419
100 pages
Sum MNJ e 5
No ratings yet
Sum MNJ e 5
7 pages
Gold Rate Prediction Using Linear Regression
No ratings yet
Gold Rate Prediction Using Linear Regression
10 pages
Part 13 MD
No ratings yet
Part 13 MD
41 pages
ECN3222 - English
No ratings yet
ECN3222 - English
4 pages
Chapter1 Intro
No ratings yet
Chapter1 Intro
35 pages
HODL Lec 2 Training NNs Intro TF
No ratings yet
HODL Lec 2 Training NNs Intro TF
83 pages
Straw Testing Report
No ratings yet
Straw Testing Report
17 pages
The Effect of Shoulder On Safety of Highways in Horizontal Curves: With Focus On Roll Angle
No ratings yet
The Effect of Shoulder On Safety of Highways in Horizontal Curves: With Focus On Roll Angle
9 pages
Intro To PyTorch and Neural Networks - Intro To PyTorch and Neural Networks Cheatsheet - Codecademy
No ratings yet
Intro To PyTorch and Neural Networks - Intro To PyTorch and Neural Networks Cheatsheet - Codecademy
8 pages
Ids Unit 4
No ratings yet
Ids Unit 4
4 pages
Deep Learning With PyTorch 1
No ratings yet
Deep Learning With PyTorch 1
1 page
DNN - M2 - Deep Feedforward NN 23dec
No ratings yet
DNN - M2 - Deep Feedforward NN 23dec
97 pages
L-Data Analysis For Accountants Assessment 2
No ratings yet
L-Data Analysis For Accountants Assessment 2
10 pages
Ch2-Training, Optimization and Regularization of DNN-new (1)
No ratings yet
Ch2-Training, Optimization and Regularization of DNN-new (1)
114 pages
Deep Learning Basics (Lecture Notes) : Romain Tavenard
No ratings yet
Deep Learning Basics (Lecture Notes) : Romain Tavenard
49 pages
DeepLearning Recap
No ratings yet
DeepLearning Recap
104 pages
Python CodeTantra Unit-5 & Unit-6
No ratings yet
Python CodeTantra Unit-5 & Unit-6
4 pages
DL UNIT2
No ratings yet
DL UNIT2
22 pages
Lecture_09_slides_-_after
No ratings yet
Lecture_09_slides_-_after
57 pages
Interview Questions
No ratings yet
Interview Questions
13 pages
Deep MLP's
No ratings yet
Deep MLP's
44 pages
Data Science - Unit-4
No ratings yet
Data Science - Unit-4
30 pages
Deep Learning
No ratings yet
Deep Learning
40 pages
PCM (8) Test For Significance (Dr. Tante)
No ratings yet
PCM (8) Test For Significance (Dr. Tante)
151 pages
CH 02 Summary
No ratings yet
CH 02 Summary
3 pages
cs519 hw2
No ratings yet
cs519 hw2
15 pages
Deep Learning 1
No ratings yet
Deep Learning 1
48 pages
Final_KIRTLAND_Statistics Project 2024.docx
No ratings yet
Final_KIRTLAND_Statistics Project 2024.docx
3 pages
NISS Deep Learning Tutorial
No ratings yet
NISS Deep Learning Tutorial
58 pages
Csci567 Hw1 Spring 2016
No ratings yet
Csci567 Hw1 Spring 2016
9 pages
Types of Neural Networks
No ratings yet
Types of Neural Networks
7 pages
NLP-NeuralNetworks Reading Notes
No ratings yet
NLP-NeuralNetworks Reading Notes
13 pages
IoT - Lecture 11
No ratings yet
IoT - Lecture 11
58 pages
AML 03 Dense Neural Networks
No ratings yet
AML 03 Dense Neural Networks
20 pages
3 - DeepLearning - and - CNN v3
No ratings yet
3 - DeepLearning - and - CNN v3
50 pages
Deep Learning
No ratings yet
Deep Learning
49 pages
Unit 2.1
No ratings yet
Unit 2.1
37 pages
Advanced Machine Learning: Neural Networks Decision Trees Random Forest Xgboost
No ratings yet
Advanced Machine Learning: Neural Networks Decision Trees Random Forest Xgboost
61 pages
Deep Learning
No ratings yet
Deep Learning
78 pages
CS335 Lab6
No ratings yet
CS335 Lab6
7 pages
On Estimation of Almost Ideal Demand System Using Moving Blocks Bootstrap and Pairs Bootstrap Methods
No ratings yet
On Estimation of Almost Ideal Demand System Using Moving Blocks Bootstrap and Pairs Bootstrap Methods
30 pages
Introduction To Neural Network
No ratings yet
Introduction To Neural Network
20 pages
Fundamentals of Deep Learning
No ratings yet
Fundamentals of Deep Learning
195 pages
STAT630Slide Adv Data Analysis
No ratings yet
STAT630Slide Adv Data Analysis
238 pages
26 Neural Nets
No ratings yet
26 Neural Nets
77 pages
Machine Learning (ML) :: Aim: Analysis and Implementation of Deep Neural Network. Definitions
No ratings yet
Machine Learning (ML) :: Aim: Analysis and Implementation of Deep Neural Network. Definitions
6 pages
HLTH 501 Final Exam Solutions
0% (8)
HLTH 501 Final Exam Solutions
13 pages
Deep Learning
100% (2)
Deep Learning
49 pages
Deep Learning
100% (4)
Deep Learning
100 pages
Data Science and Machine Learning - MCQ
No ratings yet
Data Science and Machine Learning - MCQ
19 pages
Deep Learning Andrew NG
100% (3)
Deep Learning Andrew NG
173 pages
Introduction to Algorithms
From Everand
Introduction to Algorithms
S VASIST
No ratings yet
Backpropagation: Fundamentals and Applications for Preparing Data for Training in Deep Learning
From Everand
Backpropagation: Fundamentals and Applications for Preparing Data for Training in Deep Learning
Fouad Sabry
No ratings yet
Deep Learning (All in One)
No ratings yet
Deep Learning (All in One)
23 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Crashcourse DL Pytorch Parr

Uploaded by

Crashcourse DL Pytorch Parr

Uploaded by

Fundamentals of

• Everything must be numeric

( is scalar since is scalar)

• PyTorch code model = nn.Sequential(

Must introduce nonlinearity

ReLU idea here: Draw two lines

All those and are different means for layer 1

Stack neurons, add layer

All those and are different

Not only more complex than hyperplane but non-contiguous regions!

(Last activation function still must be sigmoid)

• -class problems: final -

• Let p be predicted probability that y=1

• Two-class log loss:

So log loss is average penalty where penalty is very high

• Gives an opportunity to execute

Assume n=100, m=3, n_neurons=1 in 1x3 weight matrix W

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.