0% found this document useful (0 votes)
13 views178 pages

TensorFlow NN

The document provides an overview of neural networks, detailing their structure, including neurons, activation functions, and techniques to mitigate overfitting. It discusses machine learning classifiers, emphasizing the differences between traditional and representation-based systems, and introduces deep learning as a subset of representation systems. Additionally, it covers various activation functions and their roles in learning complex relationships within data.

Uploaded by

Surya Bhoi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views178 pages

TensorFlow NN

The document provides an overview of neural networks, detailing their structure, including neurons, activation functions, and techniques to mitigate overfitting. It discusses machine learning classifiers, emphasizing the differences between traditional and representation-based systems, and introduces deep learning as a subset of representation systems. Additionally, it covers various activation functions and their roles in learning complex relationships within data.

Uploaded by

Surya Bhoi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 178

Neural Networks with TensorFlow

Over view
Neural networks are representation based machine
learning algorithms

Neural networks are made up of building blocks called


neurons

Each neuron is made up of a linear function and an


activation function

Performance is very sensitive to details such as proper


choice of activation

Overfitting in neural networks is mitigated using


techniques such as dropout
Understanding Machine Learning
Whales: Fish or Mammals?

Mammals Fish
Members of the infraorder Cetacea Look like fish, swim like fish, move with
fish
Whales: Fish or Mammals?

ML-based Classifier
ML-based Classifier

Training Prediction
Feed in a large corpus of data classified Use it to classify new instances which it
correctly has not seen before
Training the ML-based Classifier

Classification

ML-based Classifier
Corpus

Feedback - loss
Improves model parameters function or cost
function
ML-based Binary Classifier

Breathes like a mammal


Mammal
Gives birth like a mammal
ML-based Classifier

Corpus
“Traditional” ML-based Binary Classifier

Moves like a fish,


Fish
Looks like a fish
ML-based Classifier

Corpus
ML-based Binary Classifier

Corpus Classification Algorithm ML-based Classifier


ML-based Binary Classifier

Corpus Naive Bayes, Support ML-based Classifier


Vector Machines,
Decision Trees
ML-based Binary Classifier

Breathes like a mammal


Mammal
Gives birth like a mammal
ML-based Classifier

Corpus
ML-based Binary Classifier

Breathes like a mammal


Mammal
Gives birth like a mammal
ML-based Classifier

Input: Feature Vector

Corpus
ML-based Binary Classifier

Breathes like a mammal


Mammal
Gives birth like a mammal
ML-based Classifier

Output: Label

Corpus
ML-based Binary Classifier

Breathes like a mammal


Mammal
Gives birth like a mammal
ML-based Classifier
Predicted Label
=
Actual Label

Corpus
ML-based Binary Classifier

Moves like a fish,


Fish
Looks like a fish
ML-based Classifier

Corpus
ML-based Binary Classifier

Moves like a fish,


Fish
Looks like a fish
ML-based Classifier

Input: Feature Vector

Corpus
ML-based Binary Classifier

Moves like a fish,


Fish
Looks like a fish
ML-based Classifier
Predicted Label
=/
Actual Label

Corpus
Understanding Deep Learning
“Traditional” ML-based Binary Classifier

Corpus Classification Algorithm ML-based Classifier


“Traditional” ML-based Binary Classifier

Corpus Feature Selection by Classification ML-based Classifier


Experts Algorithm
“Traditional” ML-based Binary Classifier

Feature Selection by
Corpus Classification ML-based Classifier
Experts
Algorithm
“Traditional” ML-based Binary Classifier

Breathes like a mammal Mammal


Gives birth like a mammal
ML-based Classifier

Corpus
“Traditional” ML-based systems still rely on experts to
decide what features to pay attention to
“Traditional” ML-based Binary Classifier

Feature Selection by
Corpus Classification ML-based Classifier
Experts
Algorithm
“Representation” ML-based Binary Classifier

Feature Selection
Corpus Classification ML-based Classifier
Algorithm
Algorithm
“Representation” ML-based Binary Classifier

Feature Selection
Corpus Classification ML-based Classifier
Algorithm
Algorithm
“Representation” ML-based systems figure out by
themselves what features to pay attention to
“Representation” ML-based Binary Classifier

Picture or video of a Mammal


whale
ML-based Classifier

Corpus
“Deep Learning” systems are one type of
representation systems
Deep Learning and Neural Net works

Deep Learning Neural Networks Neurons

Algorithms that learn what The most common class of Simple building blocks that
features matter deep learning algorithms actually “learn”
“Deep Learning”-based Binary Classifier

Object Parts
Corners
Edges
Pixels

Corpus of Feature Selection & ML-based Classifier


Images Classification Algorithm
“Deep Learning”-based Binary Classifier

Object Parts
Corners
Edges
Pixels

Corpus of “Visible layer” “Visible layer” ML-based Classifier


Images
“Deep Learning”-based Binary Classifier

Object Parts
Corners
Edges
Pixels

Corpus of “Hidden Layers” ML-based Classifier


Images
Neural Net works Introduced

Layer 2
Layer 1

Layer N

Corpus of Layers in a neural network ML-based Classifier
Images
Neural Net works Introduced

Pixels Processed groups of pixels

Corpus of Each layer consists of individual ML-based Classifier


Images interconnected neurons
The Computational Graph

Pixels Processed groups of pixels

Corpus of Operations (nodes) on data ML-based Classifier


Images (edges)
The Computational Graph

Pixels Processed groups of pixels

Corpus of The nodes in the computation graph are neurons ML-based Classifier
Images (simple building blocks)
The Computational Graph

Pixels Processed groups of pixels

Corpus of The edges in the computation graph are data items ML-based Classifier
Images called tensors
Neuron as a Learning Unit
A Neural Net work

Pixels Processed groups of pixels

Corpus of Each layer consists of individual ML-based Classifier


Images interconnected neurons
Operation of a Single Neuron
Y
X1

X2
Mathematical Y
function
Y
Xi


Xn Y
For an active neuron a change in inputs should trigger a
corresponding change in the outputs
Operation of a Single Neuron

Mathematical
function
Mathematical
Mathematical
function
function

The outputs of neurons feed into the neurons from the next
layer
Operation of a Single Neuron

Mathematical
W function
Mathematical
function

Each connection is associated with a weight


Operation of a Single Neuron

Mathematical
W function
Mathematical
function

If the second neuron is sensitive to the output of the


first neuron the connection between them gets W increases
stronger
Operation of a Single Neuron

W1
W2
Mathematical
function

W3

A single neuron is generally connected to multiple neurons in the next layer.


Each connection will have its own weight
A Neural Net work

Once a neural network is trained all edges have weights which help it
make predictions
Operation of a Single Neuron
W1
X1
X2
W2 Affine Activation
Wi
Transformation Wx + b Function max(Wx+b,

Xi
0)
Wn

Xn b

Each neuron only applies two simple functions to its inputs


Operation of a Single Neuron
W1
X1
X2
W2 Affine Activation
Wi
Transformation Wx + b Function max(Wx+b,

Xi
0)
Wn

Xn b

The affine transformation alone can only learn linear relationships


between the inputs and the output
Operation of a Single Neuron
W1
X1
X2
W2 Affine Activation
Wi
Transformation Wx + b Function max(Wx+b,

Xi
0)
Wn

Xn b

The combination of the affine transformation and the activation


function can learn any arbitrary relationship
Operation of a Single Neuron
W1
X1
X2
W2 Affine Activation
Wi
Transformation Wx + b Function max(Wx+b,

Xi
0)
Wn

Xn b

The values W1, W2…Wn are called the weights


Operation of a Single Neuron
W1
X1
X2
W2 Affine Activation
Wi
Transformation Wx + b Function max(Wx+b,

Xi
0)
Wn

Xn b

The value b is called the bias


Operation of a Single Neuron
W1
X1
X2
W2 Affine Activation
Wi
Transformation Wx + b Function max(Wx+b,

Xi
0)
Wn

Xn b

The affine transformation is just a weighted sum with a bias added:


W1x1 + W2x2 +…+ Wnxn + b
Operation of a Single Neuron
W1
X1
X2
W2 Affine Activation
Wi
Transformation Wx + b Function max(Wx+b,

Xi
0)
Wn

Xn b

Where do the values of W and b come from?


The weights and biases of a neuron are determined by
the training process
Operation of a Single Neuron
W1
X1
X2
W2 Affine Activation
Wi
Transformation Wx + b Function max(Wx+b,

Xi
0)
Wn

Xn b

The combination of the affine transformation and the activation


function can learn any arbitrary relationship
Activation Functions for Non-Linear
Relationships
Regression: The Simplest Neural Net work

Set of Points
Single Neuron Regression Line
Regression: The Simplest Neural Net work
W1
X1
X2
W2 Affine Activation
Wi Transformation Wx + b Function Wx+b

Xi

Wn
Xn b
Regression: The Simplest Neural Net work
W1
X1
X2
W2 Affine Identity
Wi Transformation Wx + b Function Wx+b

Xi

Wn
Xn b

activation=None, simply passes the output of the linear affine


transformation to the output of the neuron
Regression: The Simplest Neural Net work
W1
X1
X2
W2 Affine Identity
Wi Transformation Wx + b Function Wx+b

Xi

Wn
Xn b

Also called a linear neuron


Linear Regression with One Neuron

W1
X1
X2 W2 Affine Identity
Wi Transformation Wx + b Function

Xi

Wn

Xn
b
Logistic Regression with One Neuron
W1
X1
X2 W2 Softmax P(Y = True)
Affine Transformation
Wi Wx + b Function

Xi P(Y = False)

Wn

Xn
b
Logistic Regression with One Neuron
W1
X1 W2
X2 Affine W2 Softmax P(Y = True)
Wi
Transformation W1x + b1 Function

Xi P(Y = False)

Wn

Xn b1 b2
Logistic Regression with One Neuron
W1
X1
X2
W2 Affine W2 Softmax P(Y = True)
Wi Transformation Wx + b Function

P(Y = False)
Xi

Wn

Xn b b2
Logistic Regression with One Neuron

W1 1
X1
W2 Affine 1+e -(W2x’+b2)
X2
Wi Transformation W1x + b1

Softmax
Xi p(Y = True)
x’ Function
Wn W2,b2

1
Xn b1 W2x’ + b2
1+e

1 - p(Y = True)
Logistic Regression with One Neuron
W1
X1
X2
W2 Affine W2 Softmax P(Y = True)
Wi Transformation x’ Function

P(Y = False)
Xi

Wn

Xn b b2
Logistic Regression with One Neuron

W1 1
X1
X2
W2 1+e -(W2x’+b2)
Affine Transformation
Wi W1x + b1 Softmax

Xi x’ Function p(Y = True)


Wn

W2,b2
1
Xn
b1 W2x’ + b2
1+e

1 - p(Y = True)
Logistic Regression with One Neuron

W1
X1
X2
W2 Affine P(Y = True)
Wi Transformation W1x + b1
Softmax

Xi Function
Wn

P(Y = False)
Xn
b1
Logistic Regression
p(y)
(xn, yn)

Regression Curve
1
p(y) =
1+e -(A+Bx)

(x1, y1)

(x2, y2)
X
SoftMax for True/False Classification

1
1+e -(Wx+B)
p(Y = True)
Softmax
x
Function 1
Wx + B
1+e
p(Y = False)
Linear Regression with One Neuron

1-dimensional feature
vector
Shape (W) = [1,1] Regression Line

Shape (b) = [1]


Logistic Regression with One Neuron

1-dimensional feature
Shape (W) = [1,2] S-Curve
vector
Shape (b) = [2]
SoftMax N-category Classification

P(Y = Y1)

P(Y = Y2)
Softmax
Function …

P(Y = YN)
Multilabel Digit Classification
One-versus-all: Train 10 binary classifiers

- 0-detector, 1-detector…

- Predicted label = output of detector with highest score

One-versus-one: Train 45 binary classifiers

- One detector for each pair of digits

- For N labels, need N(N-1)/2 classifiers

- Predicted label = output of digit that wins most duels


The logistic or softmax function is just one of many
that can be used for activation
Activation Function

ReLU logit tanh step

Various choices of activation functions exist and drive the design of


your neural network
ReLU Activation

max(Wx+b,
0) The most common form of the activation function is
the ReLU

ReLU : Rectified Linear Unit

ReLU(x) = max(0,x)
Tanh Activation

S-shaped, continuous and differentiable

Output ranges from -1 to 1

Makes each layer’s output normalized (centered


around 0)
SoftMax Activation
Another very common form of the activation
function is the SoftMax

SoftMax(x) outputs a number between 0 and 1

This output can be interpreted as a probability

This curve is also called a logit curve


Importance of Activation

The choice of activation function is crucial in


determining performance

To see why, we must understand the training


process of a Neural Network
Neuron as a Learning Unit
W1
X1
X2
W2 Affine Activation
Wi
Transformation Wx + b Function max(Wx+b,

Xi
0)
Wn

Xn b

Many of these simple neurons arranged in layers can do


magical stuff
Training a Neural Net work: Optimization
and Back Propagation
A Neural Net work


Example: Training for Linear Regression

Set of
Single neuron with no Regression Line
Points
activation function
Example: Training for Linear Regression
W1
X1
X2
W2 Affine Activation
Wi
Transformation Wx + b Function max(Wx+b,

Xi
0)
Wn

Xn b

The activation function to learn linear regression is simply the


identity function
Training as an Optimization Problem

Objective Function Constraints Decision Variables

Minimize variance of the Express relationship as a Values of W and b


residuals (MSE) straight line

y = Wx + b
The “Best” Regression Line
Y

The “best fit” line is called the regression line


The actual training of a neural network happens via
Gradient Descent Optimization
Minimizing MSE
MSE

b
Minimizing MSE
MSE

b
Minimizing MSE
MSE

Smallest value of MSE


Minimizing MSE
MSE

As small as possible!

Smallest value of MSE


Minimizing MSE
MSE

Smallest value of MSE


Minimizing MSE
MSE

Smallest value of MSE


“Gradient Descent”
MSE
Converging on the “best” value
using an optimization algorithm

W
Initial value of MSE

Smallest value of MSE


Minimizing MSE
MSE

Smallest value of MSE


“Training” the Algorithm
MSE “Training Process” = Finding these
best values

Best value of W

Best value of b
Smallest value of MSE
Start Somewhere
MSE
Initial values - have to start
somewhere

Initial value of W
Initial value of b
Initial value of MSE
“Gradient Descent”
MSE
Converging on the “best” value
using an optimization algorithm

W
Initial value of MSE

Smallest value of MSE


“Gradient Descent”
MSE
Converging on the “best” value
using an optimization algorithm

W
Initial value of MSE

Smallest value of MSE


Training via Back Propagation

ML-based Classifier

Object Parts
Corners
Edges
Pixels

Error

Optimiser
Training via Back Propagation

ML-based Classifier

Object Parts
Corners
Edges
Pixels
b
Error

Optimiser
Training via Back Propagation

ML-based Classifier

Object Parts
Corners
Edges
Pixels
b
Error

Optimiser
Training via Back Propagation

ML-based Classifier

Object Parts
Corners
Edges
Pixels
b
Error

Optimiser
Training via Back Propagation

ML-based Classifier

Object Parts
Corners
Edges
Pixels

Error

Optimiser
Training via Back Propagation

ML-based Classifier

Object Parts
Corners
Edges
Pixels

Error

Optimiser
Back propagation allows the weights and biases
of the neurons to converge to their final values
Hyperparameters
Decisions in Traditional ML Models

Initial values Type of optimizer

Number of epochs Batch size


More Decisions in Neural Net works

Network Topology i.e. neuron


Number of layers
connections

Number of neurons in each layer Activation function


How well the model performs is sensitive to
these decisions

These are hyperparameters of our model


Model Parameters vs. Hyperparameters

Model parameters Hyperparameters


The weights and biases determined during the The design of the actual model determined before the
training process training process begins

Result of the training process Input to the training process

Used to make predictions Used to generate the best possible model

Measure using validation datasets to find the best Hyperparameter tuning to generate the model which
possible model is then evaluated using validation datasets
Vanishing, Exploding Gradients, Dying
Neurons
Training via Back Propagation

ML-based Classifier

Object Parts
Corners
Edges
Pixels

Error

Optimiser
Back Propagation

This is an iterative process

Fails either if

- gradients don’t change at all

- gradients change too fast


“Vanishing Gradient Problem”
MSE
Gradient becomes zero and stops
changing

W
Initial value of MSE

Smallest value of MSE


The weights of the earlier layers remain unchanged

The algorithm never converges to a


good solution
“Exploding Gradient Problem”
MSE
Gradient changes abruptly and
“explodes”

W
Initial value of MSE

Smallest value of MSE


The weights of the layers become larger and
meaningless

The algorithm diverges and never


reaches a good solution
Vanishing and Exploding
Gradients
Back propagation fails if

- gradients are vanishing

- gradients are exploding

This was an important reason why DNNs


were mostly abandoned for a long time
Coping with Vanishing/Exploding Gradients

Proper initialisation Gradient clipping

Batch normalisation Non-saturating activation function


Xavier and He Initialization

The variance of the outputs in each direction is


Proper equal to variance of inputs
initialisation
Connections weights must be initialized randomly
Xavier and He Initialization

Normal distribution:

Proper - mean 0
initialisation - standard deviation based on num_inputs and
num_outputs for that layer
Xavier and He Initialization

Uniform distribution:

Proper - between -r and +r


initialisation - r based on num_inputs and num_outputs for
that layer
Coping with Vanishing/Exploding Gradients

Proper initialisation Gradient clipping

Batch normalisation Non-saturating activation function


Gradient Clipping

Limit the gradients to under a threshold during

Gradient clipping back propagation

Most often used with recurrent neural networks


Coping with Vanishing/Exploding Gradients

Proper initialisation Gradient clipping

Batch normalisation Non-saturating activation function


Batch Normalisation

Zero center the inputs before passing to the


activation functions
Batch
normalisation Subtract the mean and divide by the standard
deviation

Allows use of saturating activation functions as well


Coping with Vanishing/Exploding Gradients

Proper initialisation Gradient clipping

Batch normalisation Non-saturating activation function


Coping with Vanishing/Exploding Gradients

Proper initialisation Gradient clipping

Batch normalisation Non-saturating activation function


A Neural Net work


Unresponsive Neurons

What if the weights of the connections do not change in response to


changing input?
Unresponsive Neurons

Neurons may be dead


Activation Functions

ReLU logit tanh step

Various choices of activation functions exist and drive the design of


your neural network
Activation Functions

Consider an S-shaped (sigmoid) activation function


Activation Functions

This is the active or responsive region of the function


Saturating Activation Functions

Saturation

The activation function saturates at either end


Saturating Activation Functions

Saturation

If a neuron operates within these saturation regions throughout


training it might become unresponsive
Dying Neurons
- Neuron might become unresponsive - output won’t
change as input changes

- If this continues throughout training, neuron is


“dead”

- Saturation of neuron occurs at both ends of S-


curve, for instance

Saturation
Saturating Activation Functions
Saturation Saturation

Logit Activation ReLU Activation

Saturates for very large and very small Saturates for very small (negative values) of
values of input inputs
ELU Activation

Mitigates dying-neuron problem of ReLU

- Linear for positive values

- Exponential for negative values

ELU is the new favorite activation function


Dying and unresponsive neuron issues can be mitigated
by using activations functions such as ELU
Overfitting and the Bias-Variance
Trade-off
Connecting the Dots
Y

Challenge: Fit the “best” curve through these points


Good Fit?
Y

A curve has a “good fit” if the distances of points from the curve
are small
Connecting the Dots
Y

We could draw a pretty complex curve


Connecting the Dots
Y

We can even make it pass through every single point


Connecting the Dots
Y

But given a new set of points, this curve might perform quite
poorly
Connecting the Dots
Y
Test data
Training data

The original points were “training data”, the new points are “test
data”
Overfitting
Y
Test data
Training data

Great performance in training, poor performance in real


usage
Connecting the Dots
Y
Test data
Training data

A simple straight line performs worse in training, but better with


test data
Overfitting

Low Training Error High Test Error

Model does very well in training… …but poorly with real data
Cause of Overfitting

Sub-optimal choice in the bias-variance trade-


off

An overfitted model has:


- high variance error
- low bias error
Bias

Low bias High bias

Few assumptions about the underlying data More assumptions about the underlying
data
Bias

Model too complex Model too simple

Training data all-important, model parameter Model parameter all-important, training data
counts for little counts for little
Variance

High variance Low variance

The model changes significantly when training The model doesn’t change much when the
data changes training data changes
Variance

Model too complex Model too simple

Model varies too much with changing training Model not very sensitive to training data
data
Bias-Variance Trade-off

Model too complex Model too simple

High variance error High bias error


Bias-Variance Trade-off
- High-bias algorithms: simple parameters

- Regression

- High-variance algorithms: complex parameters

- Decision trees

- Dense neural networks


Mitigating Overfitting
Preventing Overfitting

- Regularisation

- Cross-validation

- Ensemble learning

- Dropout
Preventing Overfitting

Regularisation - Penalise complex models

Cross-validation - Distinct training and validation phases

Dropout - Intentionally turn off some neurons during training


Regularisation
Penalise complex models

Simple in Gradient Descent

Add penalty to objective function

Penalty as function of neuron weights

Forces optimiser to keep it simple


Cross-Validation

Distinct training and validation phases

Train different models (with training data only)

Select model that does best on validation data

“Hyperparameter tuning”
Dropout
Specify a fraction of neurons that will stay off in
each training step

“Dropout” neurons chosen at random

Different neurons off in each training step

In effect, each training step builds different


network configuration
Densely Connected Neural Net work

Corpus of High risk of overfitting during training due to ML-based Classifier


Images dense, complex network
Dropout = 50%

Corpus of Randomly switch off say 50% of neurons in ML-based Classifier


Images each training step
Dropout = 50%

Corpus of Training forced to rely on a much simpler neural ML-based Classifier


Images network
Dropout = 50%

Corpus of Each training step will build a different ML-based Classifier


Images configuration
Dropout = 50%

Corpus of Each training step will build a different ML-based Classifier


Images configuration
Dropout During Training Only

Corpus of During actual usage in test mode, full dense ML-based Classifier
Images neural network is used

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy