0% found this document useful (0 votes)
107 views26 pages

Fundamentals of Deep Learning

Uploaded by

Debak Roy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
107 views26 pages

Fundamentals of Deep Learning

Uploaded by

Debak Roy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

FUNDAMENTALS

OF DEEP
LEARNING
Meticulously curated
information on the
base concepts of
deep learning
presented in diagrams
and visualizations.

Computer Science

Artificial
Intelligence

Machine
Learning

Deep
Learning

Mısra Turp
Fundamentals of Deep Learning

What is deep learning?


Deep learning is a discipline of Artificial Intelligence that is based on
Neural Networks. It has the power to learn complex patterns directly
from the data.

Deep Learning is a branch of


Machine Learning. Computer Science

Artificial
Machine Learning is one of the
Intelligence
approaches to Artificial
Intelligence. Machine
Learning
Artificial Intelligence is a
research branch of Computer Deep
Learning
Science.

Comparison of
Traditional Machine Learning and Deep Learning

Traditional
Deep Learning
Machine Learning

Feature extraction Needed Not Needed

Computational Laptop would High comp.


Power work fine power needed

Amount of Data Small datasets Big datasets


Needed are fine needed

1
Fundamentals of Deep Learning

Most Common Deep Learning Techniques

Deep Neural Networks Convolutional NNs

Not very complex problems, Works well with image data,


structured data used in computer vision

Recurrent NNs Generative Adversarial


Networks
Works well with sequential Can generate real-looking
data such as language, audio image, video or audio

Reinforcement Learning Autoencoders

Playing games, robotics Learns representation of data


feature/anomaly detection

Transformers Deep Boltzman Machine

Used a lot in NLP, machine Object/speech recognition


translation

Deep Belief Networks

recognize and generate


images, video sequences

2
Fundamentals of Deep Learning

Types of layers

input output

hidden layer

A single neuron
1st input feature
X1 w bias
1 b1
output of the
z1
neuron

X2 w2
weight
2nd input feature

Calculating the output of a single neuron

without the activation with the activation


function function

activation function

3
Fundamentals of Deep Learning

Vectorization for quick calculations

weights of the biases of the


network network
equation to
calculate inputs

4
Fundamentals of Deep Learning

Activation functions

What is an activation function?


An activation function is a
transformation on the output of a
neuron.

Each layer (except input layer) has


it's own activation function.

There are many different types of


activation functions.

Some examples are:


Linear
Sigmoid
Tanh
ReLU

Why do we need a non-linear activation function?


A non-linear activation function is If linear functions are used in the hidden layer,
what makes the network learn the network can only do linear transformations
complex patterns that need of the outputs. And no matter how big the
something more than linear network gets it will be as good as a single neuron
functions to be represented. because of this linearity. This will cause the
network to not be able to fit complex problems.

Activation function rules of thumb


For hidden layers: For output layer:
Hidden layer activation Sigmoid for binary classification and multi-label
should never be linear. classification.
Softmax for multi-class classification.
ReLU if you need strictly positive outputs
Linear activation for regression.

5
Fundamentals of Deep Learning

Backpropagation
Backpropagation pseudo algotihm
Initialize the network with randomly generated weights
Do forward propagation step and calculate the output
Calculate the error/loss/cost
Find out how much each parameter contributes to the error
Calculate this ratio using all the data points in this batch

At the end what you have is a list of


ratios of how each parameter
effects the error. It is called the
Gradient vector
gradient vector.

You can think of it as a many-


dimensional graph that shows the
relationship between each parameter
and the cost. If we had only two
parameters, the graph would look like
this.

Gradient Descent
Now that we have this many-dimentional graph showing us how the relationship
between all the parameters and the cost works we can decide in what way to update
these parameters to lower the cost.

Because the graph would be many many- Gradient descent is the act of changing the
dimentional, in real-life we cannot actually parameters based on the gradient vector.
look at it and decide where to go. That's Here is how it works.
why we use the gradient vector.

new values old values of gradient


for parameters parameters — learning rate x vector

6
Calculating
layer outputs:

activation
# of iterations/epochs function

regularization
loss
batch size

Dataset
te
# of neurons
in the input

actual value
ra

predictions

loss/error
g
nin
layer

... – = le a
r
...

...
w b # of neurons Wx
w in the output loss function
weight optimization
initialization layer
technique algorithm
# of neurons
in the hidden
layers

# of hidden
layers

Pre-determined hyperparameters

Number of neurons Number of neurons


in the input layer in the output layer

This will depend on the data you This will again depend on the
are training with. data you are training with.

If you have n features in your If you are doing binary


dataset you will have n input classification or regression, 1
neurons. If your input is a photo output neuron will do. If you are
with nxn pixels, you will have doing multi-class or multi-label
input neurons. classification, you need as many
output neurons as
classes/labels.

7
Hyperparameters that need tuning

Number of hidden layers Number of neurons


The more hidden layers you in the hidden layers
have, the deeper your network Each hidden layer can have a
is going to be. Generally, having different number of neurons.
more layers will help your This value needs to be adjusted
network more than increasing based on how the network is
the number of neurons in the performing.
hidden layers.

Activation function
Each layer has its own activation function except the input layer. The
activation function of the hidden layers cannot be linear.

Optimization algorithm Loss function

The optimization algorithm The loss (cost) function is what


determines how the network we're trying to minimize during
updates its weights. training. The cost function you
use will depend on the type of
problem you have.
Batch size

Batch size is the amount of data


points you put in each batch
while training your network.

If you use a batch size of 1, you will be doing Stochastic gradient


descent (GD). If you use batch size equals the number of data points
in your whole dataset, that is Batch GD. If batch size is anything
between 2 to number of data points in your whole dataset, you are
doing mini-batch gradient-descent.
8
Number of epochs Learning rate
This is how many times you run Learning rate determines how
the whole dataset through your big of a step to take towards the
network. The network learns the direction set by gradient
dataset a little bit better after descent. Too small of a learning
each epoch. rate might slow down learning
and too big of a learning rate
might cause the network to
Weight initialization technique keep missing the optima.
This is how the weights of the
networks are initialized. Bias Learning rate scheduling
values can be (and often are) techniques are used to
initialized to zero. But weights dynamically decide on the
cannot be initialized to zero learning rate value.
because if they are, they will all
be updated the same way and
the model won't be able to learn. Regularization
Regularization is a way to deal
At the same time, just using
with overfitting. There are many
normal distribution for the
different ways to do
random initialization of the
regularization and often these
weights cause problems with the
techniques come with their own
network. That's why, sometimes
hyperparameters.
we need a more advanced
approach.
Some examples are:
L1
L2
Dropout

L1 and L2 regularization has a


hp called alpha and dropout has
a hp called the dropout rate.

9
Fundamentals of Deep Learning

What is overfitting?
Overfitting is when the model is
not able to generalize well. The
model performs well on the
training set but fails to capture
the same performance for the
validation set. We see this in
the comparison of training and
validation loss. As we train the
network more, the training loss
keeps getting lower whereas
after a point validation loss
starts increasing.

What is underfitting? If this is the ideal


Underfitting is when the model is not able way to fit a dataset:
to fit the data at all. We see this when the
performance of the network is very low
already on the training set.

This is what it would


look like when
overfitted: Or when underfitted:

10
Fundamentals of Deep Learning

Bias and Variance


Bias: The amount of assumptions a model makes about the data. The more
assumptions it has, the simpler the model will be.

Variance: dependence of the model on the particular training set that was
used to train it.

If a model has overfit it means its bias is low and variance is high.

If a model has underfit it means its bias is high and variance is low.

If the mid point of these


circles signify where all the
correct values lie, the
predictions of high bias
and high variance models
will look like the yellow
marks.

Solutions to high bias/variance

11
Fundamentals of Deep Learning

Regularization

Constraining a model Adding more


to simplify it information
L1 / L2 regularization Data augmentation
Drop out
Early stopping

L1 / L2 regularization
Works by lowering the weights of the network. Achieves this by adding the weight values
to the cost function.

L1 regularization: Add the sum of the absolute values of the weights to the loss.

L2 regularization: Add the sum of the squared values of the weights to the loss.

The alpha parameter: how much attention to pay to this addition to the cost function.
Value between 0 (no penalty) and 1 (full penalty).

Tips You can use them with all network types


Normalize input for best results
Use L1+L2 together

Note!
L1 and L2 regularization lowers the weights of the network to combat overfitting but
what does overfitting have to do with high weights?

When you have high weights, that means you output


are exaggerating the importance of a certain
input. When you overfit this is what the model
looks like, right?

It looks like the importance of this input is so


exaggerated that the model follows its pattern
to the full. input

12
Fundamentals of Deep Learning

Dropout regularization
Works by making some neurons inactive in every training step. Every neuron has a
probability p of being inactive. This is called the dropout rate.

In training time, on average 1/p neurons are inactive. During test time all neurons are
active. That's why the input to each neuron is multiplied with the keep probability which
is (1-p).

Early-stopping
Works by stopping training at the point the validation loss starts getting higher. This is
the same place we mentioned before where overfitting starts happening.

But early-stopping is not the best approach


to use for overfitting. Because training and
mitigating overfitting should be separate
processes where separate approaches and
techniques are used. Early-stopping
combines these two.

Data augmentation
Transforming your data in multiple ways before feeding it to your network. This way, you
can generate more data points from the same one. These transformations help the
model tolerate any of the changes made (e.g. flipping, different orientation, color
changes, RGB or black and white).

Images from http://datahacker.rs/tf-data-augmentation/ and Buslaev, Alexander & Parinov, Alex & Khvedchenya, Eugene &
Iglovikov, Vladimir & Kalinin, Alexandr. (2018). Albumentations: fast and flexible image augmentations.

13
Fundamentals of Deep Learning

Unstable gradients
Gradients are what we use to update
the parameters of the network.

Sometimes they get extremely small


or extremely big.

This is because while calculating the


gradient of a parameter (that is how
much this parameter effects the In this calculation a lot of small numbers are
cost), we need to multiply a lot of multiplied with each other, causing a very small
values from other neurons. number for the gradient to be calculated.

In the example of the right, to find


out the gradient of the parameters of
the red neuron, we need to work our
way all the way to the output neuron. The reverse could happen if the parameter
values happened to be big values. That way we
would get the

Solutions
Changing weight initializers
The way of initializing the weights might contribute to the unstable gradients problem.
There is a need for a strategy that is better than just initializing them with normal
distribution and a mean of zero. Here are some of the initializers:

Glorot (Xavier) Initialization He Initialization LeCun Initialization

Where is the number of inputs and is


the number of outputs of a neuron.

14
Fundamentals of Deep Learning

Using a non-saturating activation function


Some activation functions cause saturation on
the extremes like seen in the sigmoid function
to the right. In combination with using the
wrong initialization technique, this causes the
unstable gradients problem.

ReLU (Rectified Linear Unit) activation function


does not cause saturation for positive values.
But it still causes saturation for negative values.
So there are a bunch of variations on it.

Rectified Linear Unit (ReLU)


Leaky ReLU (with parameter alpha)
Randomized Leaky ReLU (RReLU)
Parametric Leaky ReLU (PReLU)
Exponential Linear Unit (ELU)
(with parameter alpha)
Scaled ELU
Gaussian Error Linear Units (GELU)

You should choose the


Which activation function to correct initializer for your
use when? activation function
Initializer Activation function
Try the activation functions in this order:
Glorot Linear, tanh, softmax
SELU He ReLU and variants of ReLU
ELU
Leaky ReLU (and/or variants) LeCun SELU
ReLU
Tanh
Logistic (sigmoid) To use SELU, make sure:
your input is standardized
If speed is a priority ReLU is a good choice (mean=0, std=1),
because most libraries adapted fast ReLU you are using LeCun initializer,
implementations. that you have a sequential
network.

15
Fundamentals of Deep Learning

Batch normalization scale offset


Centers the inputs to all layers at zero and sets
standard deviation of them to 1. After that
scales and offsets them by 2 trainable values.

Scale and offset values that are the best for this normalized values
network, is learned during training.

Advantages of batch normalization


1. Even after setting a good activation 2. Eliminates the need for manually
function and initializer unstable adding a standardization layer.
gradients might occur. Batch 3. Makes the network converge to the
normalization stops the possibility of optimum faster.
unstable gradients. 4. Reduces the need for regularization

Gradient clipping

Batch normalization is very tricky to be used on RNNs. That's why instead gradient
clipping is used to deal with exploding gradients.

Gradient clipping is, like the name suggests, clipping the values of the gradients as they
get too big.

Gradient clipping
Gradient vector = range = [-1.1]

Clipping by absolute values Clipping by norm

16
Fundamentals of Deep Learning

Techniques to speed up training


Applying good initialization Using mini-batches
Using a good activation function Learning rate scheduling
Using batch normalization A faster optimization algorithm
Reusing parts of a pretained network Pruning the network
Normalizing the data

Normalizing the data

When you train your network with unnormalized data, the parameters of the network will
have a more complicated relationship with the cost function.

Specifically, the graph of relationship between the parameters and the cost will look like
a very wide bowl. The edges of this bowl is steep, so gradient descent will make quick
progress at first but as we get closer to the minima (which is the point where the cost is
the lowest) the progress will get smaller and smaller because the slope of the graph is
very small.

Whereas with normalized data, the relationship graph will look more like a smooth bowl.
Thus, the progress will be faster since the direction of the minima is very clear from
anywhere on the bowl.

Images from www.coursera.org/learn/deep-neural-network/


17
Fundamentals of Deep Learning

Using mini-batches

Batch Gradient Descent (GD): Run the whole data through the network
Mini-batch GD: Run 2 or more but less than the whole data through the network
Stochastic GD: Run examples one by one

Batching rules of thumb Use mini-batch sizes that are powers of 2.


Research suggests you can use as much as 8192*.
Small dataset (<2000 data 2 to 512 is typical.
points) use Batch GD Find a number that fits in your CPU/GPU.
Otherwise use mini-batch GD Start big and lower if performance is bad.

*Elad Hoffer et al. and Priya Goyal et al. Given that you use learning rate warming up.
18
Fundamentals of Deep Learning

Using a faster optimization algorithm

Gradient Descent Gradient Descent with Momentum

Nesterov Accelerated Gradient Other algorithms


AdaGrad
RMSProp
Adam
Nadam

Here is a comparison of optimizers based on Aurélien Géron's book*

Optimization algorithm Convergence speed Convergence quality

Gradient descent (GD) Bad Good

GD with momentum Average Good

GD with momentum
Average Good
and Nesterov

AdaGrad Good Bad (stops too early)

RMSProp Good Average to Good

Adam Good Average to Good

Nadam Good Average to Good

Adamax Good Average to Good

*Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems
19
Fundamentals of Deep Learning

Pruning the network


Pruning the network gets rid of parameters with very small values, creating a sparse
network. This network will take up less space in memory and the prediction times of the
network will be faster. Ways to prune a network is:

Using L1 regularization at training time


Removing very small weights manually (will probably lead to worse performance)
Use tools like TensorFlow Model Optimization Toolkit

Learning rate scheduling


Dynamically determining what the learning should be, in order to overcome the
limitations of having a pre-determined static learning rate.

Manual approach Piecewise constant scheduling

Train many models, give them Use constant


increasing values for the learning learning rate for a
rate and choose a value a bit certain number of
before than when the loss starts epochs. Very
shooting up. similar to manual.

Power scheduling Performance scheduling

The learning rate is a function of Drop the learning rate when the
the epoch number. The rate of validation error stops dropping.
Can't hurt to try

decrease, decreases.

1cycle scheduling

First increase linearly towards


Exponential scheduling during first half of training. Then
lower it back to , during second
The learning rate is a function of half of training.
Recommended

the epoch number. It drops by a


factor of 10 every s epochs.

20
Fundamentals of Deep Learning

Model lifecycle

Evaluating the model


Have only one metric to compare models
You can combine multiple metrics to create one metric
E.g.: combine precision and recall to have the F-score or
average multiple accuracy values
If necessary, you can add a helper metric
E.g.: model speed

Levels of performance
Training set performance
The goal of diagnosis is to find
Validation set performance
out where in these levels the
Test set performance problem occurs and fix it.

Real-life performance

How do we know what training set performance is good?


We compare it to human-level performance.

How do we know what real-life performance is good?


We evaluate user satisfaction or whatever other real-life evaluation metrics (this
could still be accuracy or precision in real-world tasks).

21
Fundamentals of Deep Learning

Bayes Optimal Performance

This is the best way a task can be done. It is a hypothetical limit. Many times we
don't know what it is. But most of the time, Bayes Optimal Performance is thought to
be the best performance achieved on this task by humans.

Bayes Optimal Performance is always either equal to or better than Human-level


performance. Machine learning algorithms (shown in orange line) progress fairly fast
to the point of achieving human-level performance. But after they surpass that level,
it takes much more effort to make progress towards Bayes Optimal Performance.

Human-level performance
Human-level performance is normally what models aspire to. But it can be defined in
many different ways. Let's say we observed these error percentages in a given task
by these groups of people:

In this case, the best performance is by


Group of Pharmacists. If we have not
other information on how well this task
can be done, we can accept 0.3% as
Bayes Optimal Error.

But what we should accept as human-


level performance will depend on our task.

22
Fundamentals of Deep Learning

Improving the model

Deciding which difference is bigger will


help us decide what actions to take to
make the model better.

That is why, choosing the correct human-


level performance for each specific case
matters.

As we talked about
before, there are things
we can do to address
high bias or high
variance.

But on top of that, here are what we should do to make the model better when we want
to address a specific performance gap.

Human-level performance Train more


Increase model complexity
Try more advanced architecture
Performance on training set More data Find better hyperparameters
Regularization
Try different architecture
Performance on validation set Find better hyperparameters
Have a bigger validation set

Performance on test set


Change validation set
Change cost function
Performance in real-life

23
Fundamentals of Deep Learning

Hyperparameter tuning
There are many hyperparameters and many options for hyperparameter in neural
networks. That's why, many times, trying out all the combinations of hyperparameter
settings is not feasible. Instead, we have some tactics we use to make life easier.

Grid search
Grid search is trying each option one by one. This is might be possible for some
traditional machine learning models but it is not really an option for neural networks.

Random search
Random search is trying out a subset of settings in the whole space of possibilities. It
does not guarantee finding the best possible combination of hyperparameter settings
but it works good enough most of the time.

Manual zooming in
This is a manual way of looking for the best settings. The idea is to do iterative random
search. In each iteration, the search is focused around the settings that performed the
best in the previous run.

Bayesian search
Build a probabilistic model of the relationship between the hyperparameters and the
cost function.

Gradient-based search
Approaches the hyper parameter tuning problem like the learning problem.

Evolutionary computing based search


Uses processes like randomization, natural selection and survival of the fittest to find
the best hyperparameter settings.

Early-stopping based search


The main approach of these search algorithms is to focus resources on settings that are
promising. There are different types of algorithms (SHA, ASHA, Hyperband) that fall into
this cateory.

24
Fundamentals of Deep Learning

Which approach to use?

The best way to do hyperparameter tuning, especially for your first couple of projects, is
to use random search. Once you have a good grasp on it, you can use different
approaches like bayesian search or even evolutionary computing based search.

You do not need to implement these search algorithms yourself. Here are some libraries
that can help you:

Bayesian search Spearmint


Scikit-Optimize

Gradient-based search Adatune

Evolutonary computing based search Sklearn-Deap

Early-stopping based search Hyperband

Other libraries Hyperopt


and resources Kopt
Talos
Hyperas
Keras Tuner
SHERPA
Google CLoud AI Platform HP tuning service
SigOpt
Oscar

25

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy