0% found this document useful (0 votes)
11 views57 pages

Lecture_09_slides_-_after

The document provides an overview of neural networks, focusing on their application in classification and regression tasks. It covers key concepts such as activation functions, training methods including backpropagation and stochastic gradient descent, and the advantages of deep learning over traditional methods like logistic regression. Additionally, it highlights the importance of deep learning frameworks like PyTorch and TensorFlow for efficient neural network implementation and training.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views57 pages

Lecture_09_slides_-_after

The document provides an overview of neural networks, focusing on their application in classification and regression tasks. It covers key concepts such as activation functions, training methods including backpropagation and stochastic gradient descent, and the advantages of deep learning over traditional methods like logistic regression. Additionally, it highlights the importance of deep learning frameworks like PyTorch and TensorFlow for efficient neural network implementation and training.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 57

1

Neural networks for


classification and
regression
Outline

▪ Why deep learning?


▪ Neural networks
• activation functions
▪ Training
• Back propogation
• Stochastic gradient descent

▪ Quiz Review and answer to some questions from exercises/problem sets


3
Introduction Linear regression Logistic regression

Feature engineering Data statistics Naive Bayes

KNN Clustering Dimensionality reduction

Neural networks Convolutional neural Decision-trees


networks
Background for Neural networks
Review of supervised learning
i d
feature vectors, independent variables x E IR i = 1
,
,
2
...,
N

m
i

Labels, dependent variables, target, outcome Y


i
e IR F
J
+ 11 ,
2
, . . .,
k)

Training data/set/example S & xi ,


yig for supervised leary
,

\xih . for unsupervised leaving


Sample, sample point ,
data point

xi
ji)
&

S ,
X
Why deep learning?
Logistic regression review

Logistic regression for d-dimensional data:


Input: Activation
Input Weight Output
function
Weights:
Bias: x1
w1
1
Logistic: σ(z) = x2
1 + e −z w2

… wT x + b σ ŷ
wd
xd
Why deep learning?
Limitations of logistic regression

Issue: Logistic regression performs badly on non-


linearly separable data

Potential fix: Use feature engineering to make data


linearly separable, then use logistic regression
However:
▪ Features that linearly separate the data can be
hard to find manually, specially in high dimension
Why deep learning?
From logistic regression to neural networks
Neural networks have been successful in learning complex, non-linear functions
Why deep learning?
New way to approach ML

Before deep learning:

Data Features Model

Hand-design the features


Deep learning:

Data Features Model

Deep neural networks derive


useful features from the data!
Neural networks
Neural networks
Representation
weights weights

“2-layer Neural Net”, or “3-layer Neural Net”, or


“1-hidden-layer Neural Net” “2-hidden-layer Neural Net”

“Fully-connected” layers
Each neuron of a layer is connected
to all neurons of the following layer
Neural networks
Inside a neuron

g = Activation function
Connections to biological neurons

Source: towardsdatascience.com
Applications - nowadays everywhere!
Applications - examples in IGM
Neural networks
Representation
x1
[l]
ai Layer

Node in layer
x = x2 Shape (3, 1)
x3
Neural networks
Representation [1]
w1
x1
[l]
ai Layer

Node in layer
x = x2
x3
Weight vector for first node of first layer:
[1]
w1,1
[1] [1]
w1 = w1,2
1]
w1,3
Neural networks
Representation [1]
w1
x1
[l]
ai Layer

Node in layer
x = x2
x3
Weight vector for first node of first layer:
[1]
w1,1
[1] [1]
w1 = w1,2
1]
w1,3

[1] [1]T [1]


z1 = w1 x + b1
[1] [1]T [1]
z2 = w2 x + b2
[1] [1]T [1]
z3 = w3 x + b3
[1] [1]T [1]
z4 = w4 x + b4
Neural networks
Representation [1]
w1
x1
[l]
ai Layer

Node in layer
x = x2
x3
Weight vector for first node of first layer:
[1]
w1,1
[1] [1]
w1 = w1,2 Shape (3, 1)

1]
w1,3

[1] [1]T [1] [1] [1] [1]


z1 = w1 x + b1 a1 = g (z1 ) nonlinear transformaten
[1] [1]T [1] [1] [1] [1] of
z2 = w2 x + b2 a2 = g (z2 ) x

Apply activation
[1] [1]T [1] [1] [1] [1]
z3 = w3 x + b3 a3 = g (z3 )
[1]
z4 = [1]T [1]
w4 x + b4
g] [1]
a4 = [1] [1]
g (z4 )
Neural networks
Representation [1]
w1
x1
[l]
ai Layer

Node in layer
x = x2
x3

Vector notation:
[1]
b1
⋮ ⋮ ⋮ ⋮ [1]
[1] [1] [1] [1] [1] b2
W = w1 w2 w3 w4 b [1]
= [1]
b3
⋮ ⋮ ⋮ ⋮
[1]
b4
Shape (3, 4)

[1] [1]T [1] [1] [1] [1]


z =W x+b Apply activation a = g (z )
Activation functions
NN - Activation Function
Introduction
[2] [2]T [1] [1]T [1] [2]
ŷ = g (W g (W x+b )+b )

Q: What happens if we remove the activations?


NN - Activation Function
Introduction
[2] [2]T [1] [1]T [1] [2]
ŷ = g (W g (W x+b )+b )

Q: What happens if we remove the activations?

̂y = W[2]T (W[1]T x + b[1]) + b[2]


[2]T [1]T [2]T [1] [2]
ŷ = W W x+W b +b

T [2]T [1]T [2]T [1] [2]


Define W′ =W W Define b′ = W b +b
T
ŷ = W′ x + b′

A: We end up with a linear classifier!


NN - Activation functions
Introduction

To model a nonlinear problem:


▪ Pass the output of each neuron through a nonlinear function,
called activation function
▪ Connection to neuron firing in brain

Some well-known activation functions:


▪ Sigmoid
▪ Tanh
▪ ReLU
NN - Activation functions
Overview

Sigmoid (σ): Tanh:


▪ Squashes input in a [0, 1] range ▪ Squashes input in a [-1, 1] range
▪ Approximately nullifies gradient (for “large” positive or ▪ Like sigmoid, nullifies gradient (for “large” positive or
negative inputs) -> vanishing gradient problem negative inputs)
▪ rarely used except for final layer of binary classification ▪ Zero-centered, preferable over sigmoid as an activation
network ▪ Rarely used in practice (ReLU is more popular)
NN - Activation functions
Overview

Rectified Linear Unit (ReLU): Leaky ReLU:


▪ Easily computed, simple gradient ▪ Attempts to fix “dying ReLU” problem by having a
▪ Greatly accelerates convergence of gradient descent small negative slope for x < 0.

▪ Saturates in only one direction, suffers less from ▪ Leaky ReLU and other ReLU variants (ELU,
vanishing gradient problem SELU, GELU, Swish, etc…) are sometimes used
over ReLU
▪ commonly used in practice
NN - Activation functions
Derivatives
Sigmoid: Rectified Linear Unit (ReLU):

{0 if x ≤ 0
1 x if x > 0
σ(x) = ReLU(x) = = max(0, x)
1 + e −x
d
{0 if x < 0
σ(x) = σ(x)(1 − σ(x)) d 1 if x > 0
dx ReLU(x) =
dx

Tanh:
x −x Note: Derivative of ReLU is undefined for
e −e
tanh(x) = x x = 0. By convention, it is set to 0.
e + e −x
d 2
tanh(x) = 1 − tanh (x)
dx
Training neural nets
Neural networks
Training
Forward pass of 2 layer NN (for a single example):
[1] [1]T [1]
z =W x+b
[1] [1] [1]
a = g (z )
[2] [2]T [1] [2]
z =W a +b
[2] [2] [2]
ŷ = a = g (z )
[2] [2]T [1] [1]T [1] [2]
ŷ = g (W g (W x+b )+b )

[1] [2]
W W
Neural networks
Training
Forward pass of 2 layer NN (for a single sample):
[2] [2]T [1] [1]T [1] [2]
ŷ = g (W g (W x+b )+b )

To train, we need a loss function: L(y,̂ y)

Using that loss function, we want to update


[1] [1] [2] [2]
W ,b ,W ,b
using gradient descent.

[1] [2]
W W
Loss function
Loss function
Gradient descent

∂L ∂L
Need to compute: ,
∂W[i] ∂b[i]
=> Gradient of loss with respect
to weights

Once gradients are computed,


update weights with:
[i] ∂L
[i]
▪ W := W − α
∂W[i]
[i] [i] ∂L
▪ b := b − α [i]
∂b
where α is the learning rate
Neural networks
Forward / Backward pass

Forward pass: Compute the output of a neural network for a given input
Backward pass: Compute derivatives of the network parameters given the output

During training, you need both the forward pass and the backward pass.

During inference, you only need the forward pass.

Inference: the process of using a trained machine learning model for prediction
Computing gradients
Back propagation
Computing gradients
Back propagation
Neural networks
Forward pass
Neural networks
Forward pass

[1] [1]T [1]


z =W x+b
Neural networks
Forward pass

[1] [1]T [1]


z =W x+b
Neural networks
Forward pass

[1] [1]T [1]


z =W x+b
Neural networks
Forward pass

[1] [1]T [1]


z =W x+b
Neural networks
Forward pass

[1] [1]T [1]


z =W x+b
[1] [1] [1]
a = g (z )
Neural networks
Forward pass

[1] [1]T [1]


z =W x+b
[1] [1] [1]
a = g (z )
[2] [2]T [1] [2]
z =W a +b
Neural networks
Forward pass

[1] [1]T [1]


z =W x+b
[1] [1] [1]
a = g (z )
[2] [2]T [1] [2]
z =W a +b
Neural networks
Forward pass

[1] [1]T [1]


z =W x+b
[1] [1] [1]
a = g (z )
[2] [2]T [1] [2]
z =W a +b
[2] [2] [2]
a = g (z )
Neural networks
Forward pass

[1] [1]T [1]


z =W x+b
[1] [1] [1]
a = g (z )
[2] [2]T [1] [2]
z =W a +b
[2] [2] [2]
a = g (z )
[2]
ŷ = a
Neural networks
Forward pass

Forward pass of this 2-layer NN:


[1] [1]T [1]
z =W x+b
[1] [1] [1]
a = g (z )
[2] [2]T [1] [2]
z =W a +b
[2] [2] [2]
a = g (z )
[2]
ŷ = a

Rewriting it in one equation:


[2] [2]T [1] [1]T [1] [2]
ŷ = g (W g (W x+b )+b )
Stochastic gradient descent
Mini-batch stochastic gradient descent
Problems with training
Recap on training a neural network

Loop:
1. Sample a batch of data
2. Forward pass to get the loss
3. Backward pass to calculate gradient
4. Update parameters using the gradient

▪ Forward pass computes result of an operation and save any intermediates


needed for gradient computation in memory
▪ Backward pass applies the chain rule to compute the gradient of the loss
function with respect to the inputs
Deep learning frameworks
Overview

Deep learning frameworks are used to efficiently define and train neural networks
• Support for many types of layers, activations, loss functions, optimizers, …
• Backpropagation computed automatically (e.g. loss.backward() in PyTorch)
• GPU support for faster training

Most popular frameworks today:


• PyTorch (https://pytorch.org)
• TensorFlow (https://www.tensorflow.org/)
Deep learning frameworks
Implementing a simple neural network in PyTorch
Extensions of feedforward neural networks

▪ Convolutional neural networks (next lecture)

▪ Recurrent neural networks (relevance for control system)


Python exercises
▪ You will create a neural network for hand-written digit classification
▪ Training data is based on MNIST dataset: online dataset of 70,000 images containing hand-written digits

▪ Training neural networks is time (and energy) consuming


28 x 28 =
784
▪ We will use Google Colab as it has access to faster processing
• GPU (Graphic processing units)
• TPU (Tensor processing units)

D
Brief review of last lecture
PCA

k-means
Questions from problem set
Data covariance matrix

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy