Lecture_09_slides_-_after
Lecture_09_slides_-_after
m
i
xi
ji)
&
S ,
X
Why deep learning?
Logistic regression review
… wT x + b σ ŷ
wd
xd
Why deep learning?
Limitations of logistic regression
“Fully-connected” layers
Each neuron of a layer is connected
to all neurons of the following layer
Neural networks
Inside a neuron
g = Activation function
Connections to biological neurons
Source: towardsdatascience.com
Applications - nowadays everywhere!
Applications - examples in IGM
Neural networks
Representation
x1
[l]
ai Layer
Node in layer
x = x2 Shape (3, 1)
x3
Neural networks
Representation [1]
w1
x1
[l]
ai Layer
Node in layer
x = x2
x3
Weight vector for first node of first layer:
[1]
w1,1
[1] [1]
w1 = w1,2
1]
w1,3
Neural networks
Representation [1]
w1
x1
[l]
ai Layer
Node in layer
x = x2
x3
Weight vector for first node of first layer:
[1]
w1,1
[1] [1]
w1 = w1,2
1]
w1,3
Node in layer
x = x2
x3
Weight vector for first node of first layer:
[1]
w1,1
[1] [1]
w1 = w1,2 Shape (3, 1)
1]
w1,3
Apply activation
[1] [1]T [1] [1] [1] [1]
z3 = w3 x + b3 a3 = g (z3 )
[1]
z4 = [1]T [1]
w4 x + b4
g] [1]
a4 = [1] [1]
g (z4 )
Neural networks
Representation [1]
w1
x1
[l]
ai Layer
Node in layer
x = x2
x3
Vector notation:
[1]
b1
⋮ ⋮ ⋮ ⋮ [1]
[1] [1] [1] [1] [1] b2
W = w1 w2 w3 w4 b [1]
= [1]
b3
⋮ ⋮ ⋮ ⋮
[1]
b4
Shape (3, 4)
▪ Saturates in only one direction, suffers less from ▪ Leaky ReLU and other ReLU variants (ELU,
vanishing gradient problem SELU, GELU, Swish, etc…) are sometimes used
over ReLU
▪ commonly used in practice
NN - Activation functions
Derivatives
Sigmoid: Rectified Linear Unit (ReLU):
{0 if x ≤ 0
1 x if x > 0
σ(x) = ReLU(x) = = max(0, x)
1 + e −x
d
{0 if x < 0
σ(x) = σ(x)(1 − σ(x)) d 1 if x > 0
dx ReLU(x) =
dx
Tanh:
x −x Note: Derivative of ReLU is undefined for
e −e
tanh(x) = x x = 0. By convention, it is set to 0.
e + e −x
d 2
tanh(x) = 1 − tanh (x)
dx
Training neural nets
Neural networks
Training
Forward pass of 2 layer NN (for a single example):
[1] [1]T [1]
z =W x+b
[1] [1] [1]
a = g (z )
[2] [2]T [1] [2]
z =W a +b
[2] [2] [2]
ŷ = a = g (z )
[2] [2]T [1] [1]T [1] [2]
ŷ = g (W g (W x+b )+b )
[1] [2]
W W
Neural networks
Training
Forward pass of 2 layer NN (for a single sample):
[2] [2]T [1] [1]T [1] [2]
ŷ = g (W g (W x+b )+b )
[1] [2]
W W
Loss function
Loss function
Gradient descent
∂L ∂L
Need to compute: ,
∂W[i] ∂b[i]
=> Gradient of loss with respect
to weights
Forward pass: Compute the output of a neural network for a given input
Backward pass: Compute derivatives of the network parameters given the output
During training, you need both the forward pass and the backward pass.
Inference: the process of using a trained machine learning model for prediction
Computing gradients
Back propagation
Computing gradients
Back propagation
Neural networks
Forward pass
Neural networks
Forward pass
Loop:
1. Sample a batch of data
2. Forward pass to get the loss
3. Backward pass to calculate gradient
4. Update parameters using the gradient
Deep learning frameworks are used to efficiently define and train neural networks
• Support for many types of layers, activations, loss functions, optimizers, …
• Backpropagation computed automatically (e.g. loss.backward() in PyTorch)
• GPU support for faster training
k-means
Questions from problem set
Data covariance matrix