Fundamentals of Deep Learning
Fundamentals of Deep Learning
OF DEEP
LEARNING
Meticulously curated
information on the
base concepts of
deep learning
presented in diagrams
and visualizations.
Computer Science
Artificial
Intelligence
Machine
Learning
Deep
Learning
Mısra Turp
Fundamentals of Deep Learning
Artificial
Machine Learning is one of the
Intelligence
approaches to Artificial
Intelligence. Machine
Learning
Artificial Intelligence is a
research branch of Computer Deep
Learning
Science.
Comparison of
Traditional Machine Learning and Deep Learning
Traditional
Deep Learning
Machine Learning
1
Fundamentals of Deep Learning
2
Fundamentals of Deep Learning
Types of layers
input output
hidden layer
A single neuron
1st input feature
X1 w bias
1 b1
output of the
z1
neuron
X2 w2
weight
2nd input feature
activation function
3
Fundamentals of Deep Learning
4
Fundamentals of Deep Learning
Activation functions
5
Fundamentals of Deep Learning
Backpropagation
Backpropagation pseudo algotihm
Initialize the network with randomly generated weights
Do forward propagation step and calculate the output
Calculate the error/loss/cost
Find out how much each parameter contributes to the error
Calculate this ratio using all the data points in this batch
Gradient Descent
Now that we have this many-dimentional graph showing us how the relationship
between all the parameters and the cost works we can decide in what way to update
these parameters to lower the cost.
Because the graph would be many many- Gradient descent is the act of changing the
dimentional, in real-life we cannot actually parameters based on the gradient vector.
look at it and decide where to go. That's Here is how it works.
why we use the gradient vector.
6
Calculating
layer outputs:
activation
# of iterations/epochs function
regularization
loss
batch size
Dataset
te
# of neurons
in the input
actual value
ra
predictions
loss/error
g
nin
layer
... – = le a
r
...
...
w b # of neurons Wx
w in the output loss function
weight optimization
initialization layer
technique algorithm
# of neurons
in the hidden
layers
# of hidden
layers
Pre-determined hyperparameters
This will depend on the data you This will again depend on the
are training with. data you are training with.
7
Hyperparameters that need tuning
Activation function
Each layer has its own activation function except the input layer. The
activation function of the hidden layers cannot be linear.
9
Fundamentals of Deep Learning
What is overfitting?
Overfitting is when the model is
not able to generalize well. The
model performs well on the
training set but fails to capture
the same performance for the
validation set. We see this in
the comparison of training and
validation loss. As we train the
network more, the training loss
keeps getting lower whereas
after a point validation loss
starts increasing.
10
Fundamentals of Deep Learning
Variance: dependence of the model on the particular training set that was
used to train it.
If a model has overfit it means its bias is low and variance is high.
If a model has underfit it means its bias is high and variance is low.
11
Fundamentals of Deep Learning
Regularization
L1 / L2 regularization
Works by lowering the weights of the network. Achieves this by adding the weight values
to the cost function.
L1 regularization: Add the sum of the absolute values of the weights to the loss.
L2 regularization: Add the sum of the squared values of the weights to the loss.
The alpha parameter: how much attention to pay to this addition to the cost function.
Value between 0 (no penalty) and 1 (full penalty).
Note!
L1 and L2 regularization lowers the weights of the network to combat overfitting but
what does overfitting have to do with high weights?
12
Fundamentals of Deep Learning
Dropout regularization
Works by making some neurons inactive in every training step. Every neuron has a
probability p of being inactive. This is called the dropout rate.
In training time, on average 1/p neurons are inactive. During test time all neurons are
active. That's why the input to each neuron is multiplied with the keep probability which
is (1-p).
Early-stopping
Works by stopping training at the point the validation loss starts getting higher. This is
the same place we mentioned before where overfitting starts happening.
Data augmentation
Transforming your data in multiple ways before feeding it to your network. This way, you
can generate more data points from the same one. These transformations help the
model tolerate any of the changes made (e.g. flipping, different orientation, color
changes, RGB or black and white).
Images from http://datahacker.rs/tf-data-augmentation/ and Buslaev, Alexander & Parinov, Alex & Khvedchenya, Eugene &
Iglovikov, Vladimir & Kalinin, Alexandr. (2018). Albumentations: fast and flexible image augmentations.
13
Fundamentals of Deep Learning
Unstable gradients
Gradients are what we use to update
the parameters of the network.
Solutions
Changing weight initializers
The way of initializing the weights might contribute to the unstable gradients problem.
There is a need for a strategy that is better than just initializing them with normal
distribution and a mean of zero. Here are some of the initializers:
14
Fundamentals of Deep Learning
15
Fundamentals of Deep Learning
Scale and offset values that are the best for this normalized values
network, is learned during training.
Gradient clipping
Batch normalization is very tricky to be used on RNNs. That's why instead gradient
clipping is used to deal with exploding gradients.
Gradient clipping is, like the name suggests, clipping the values of the gradients as they
get too big.
Gradient clipping
Gradient vector = range = [-1.1]
16
Fundamentals of Deep Learning
When you train your network with unnormalized data, the parameters of the network will
have a more complicated relationship with the cost function.
Specifically, the graph of relationship between the parameters and the cost will look like
a very wide bowl. The edges of this bowl is steep, so gradient descent will make quick
progress at first but as we get closer to the minima (which is the point where the cost is
the lowest) the progress will get smaller and smaller because the slope of the graph is
very small.
Whereas with normalized data, the relationship graph will look more like a smooth bowl.
Thus, the progress will be faster since the direction of the minima is very clear from
anywhere on the bowl.
Using mini-batches
Batch Gradient Descent (GD): Run the whole data through the network
Mini-batch GD: Run 2 or more but less than the whole data through the network
Stochastic GD: Run examples one by one
*Elad Hoffer et al. and Priya Goyal et al. Given that you use learning rate warming up.
18
Fundamentals of Deep Learning
GD with momentum
Average Good
and Nesterov
*Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems
19
Fundamentals of Deep Learning
The learning rate is a function of Drop the learning rate when the
the epoch number. The rate of validation error stops dropping.
Can't hurt to try
decrease, decreases.
1cycle scheduling
20
Fundamentals of Deep Learning
Model lifecycle
Levels of performance
Training set performance
The goal of diagnosis is to find
Validation set performance
out where in these levels the
Test set performance problem occurs and fix it.
Real-life performance
21
Fundamentals of Deep Learning
This is the best way a task can be done. It is a hypothetical limit. Many times we
don't know what it is. But most of the time, Bayes Optimal Performance is thought to
be the best performance achieved on this task by humans.
Human-level performance
Human-level performance is normally what models aspire to. But it can be defined in
many different ways. Let's say we observed these error percentages in a given task
by these groups of people:
22
Fundamentals of Deep Learning
As we talked about
before, there are things
we can do to address
high bias or high
variance.
But on top of that, here are what we should do to make the model better when we want
to address a specific performance gap.
23
Fundamentals of Deep Learning
Hyperparameter tuning
There are many hyperparameters and many options for hyperparameter in neural
networks. That's why, many times, trying out all the combinations of hyperparameter
settings is not feasible. Instead, we have some tactics we use to make life easier.
Grid search
Grid search is trying each option one by one. This is might be possible for some
traditional machine learning models but it is not really an option for neural networks.
Random search
Random search is trying out a subset of settings in the whole space of possibilities. It
does not guarantee finding the best possible combination of hyperparameter settings
but it works good enough most of the time.
Manual zooming in
This is a manual way of looking for the best settings. The idea is to do iterative random
search. In each iteration, the search is focused around the settings that performed the
best in the previous run.
Bayesian search
Build a probabilistic model of the relationship between the hyperparameters and the
cost function.
Gradient-based search
Approaches the hyper parameter tuning problem like the learning problem.
24
Fundamentals of Deep Learning
The best way to do hyperparameter tuning, especially for your first couple of projects, is
to use random search. Once you have a good grasp on it, you can use different
approaches like bayesian search or even evolutionary computing based search.
You do not need to implement these search algorithms yourself. Here are some libraries
that can help you:
25