DL Regularization
DL Regularization
BITS Pilani
Pilani Campus
Deep Neural Network
• The content for these slides has been obtained from books and various other source on the Internet
• I here by acknowledge all the contributors for their material and inputs.
• I have provided source information wherever necessary
• I have added and modified the content to suit the requirements of the course
• Optimization Algorithms
• Gradient Descent
• Stochastic Gradient Descent
• Mini batch Gradient Descent
• Gradient Descent with momentum
• ADAgrad
• RMSProp
• Adam
𝑓’(𝑥)= 0 at 𝑥
𝑓’(𝑥)= 0 just before 𝑥 𝑓’(𝑥)= 0 just after 𝑥
𝑥 may be a saddle
𝑓’(𝑥)= 0 at 𝑥,
𝑓’(𝑥)>0 just before 𝑥
𝑓’(𝑥)<0 just after 𝑥
𝑥 is a maxima
𝑓’(𝑥)= 0 at 𝑥
𝑓’(𝑥)< 0 just before 𝑥
𝑓’(𝑥)>0 just after 𝑥
𝑓’(𝑥)= 0 and 𝒇’’ 𝒙 < 𝟎, 𝒙 is a maxima 𝑥 is a minima
𝑓’(𝑥)= 0 and 𝒇’’ 𝒙 > 𝟎. 𝒙 is a minima
𝑓’(𝑥)= 0 and 𝒇’’ 𝒙 = 𝟎, 𝒙 may be a
saddle.
Reminder: The error surface for a linear neuron
• Update rule
• In vanilla GD, computational cost for each independent variable iteration is O(n)
which grows linearly with n
• Larger the training dataset is larger, higher the cost of GD for each iteration
Gradient descent
Optimal
Slow Learning Oscillations
If we pick too large, the
If we pick too GD converges, a suitable
solution oscillates and in the
small, we make little is often found only
worst case it might
progress. after multiple
diverge.
experiments
Learning rate
Local minima vs global minimum for
DNN
• Tweak the learning rate
• Gradually reduce the learning rate, then increase it and
slowly reduce it, again, several times.
• Increasing the learning rate reduces the stability of the
algorithm, but gives the algorithm the ability to jump out of a
local optimum
• Another takeaway
• Finding the global minimum is probably not the best as it
would probably represent extreme overfitting on the
training set.
• Empirical evidence shows that generalization performance
is same for local minima solution and global minimum
solution
Stochastic gradient descent
1. Piecewise constant
• decrease the learning rate, e.g., whenever progress in optimization stalls.
• This is a common strategy for training deep networks.
2. Exponential decay
• Leads to premature stopping before the algorithm has converged.
Exponential decay
• Variance in the parameters is
significantly reduced. But,
• The algorithm fails to converge at all.
Polynomial decay
• Pick a mini batch (which is hyper parameter) that is large enough to offer good
computational efficiency while still fitting into the memory of a GPU.
• For small training sets (e.g., <2000), use GD
• For larger sets, sizes between 64 to 512 (preferably powers of 2) are typical.
Steps for mini batch GD
1. Split the data into mini-batches, e.g., X(1) through X(1000), X(1001)
through X(2000), and so on.
2. For each mini-batch, perform forward propagation using only the
data in that mini-batch.
3. Compute the cost function for that mini-batch.
4. Implement backpropagation to compute gradients.
5. Update the weights and biases using the gradients.
1. Enables progress in gradient descent even when only partially
through the training set.
2. With GD , cost should decrease every iteration.
3. For mini-batch GD, cost might not decrease every iteration due
to different training batches.
4. Cost function J should generally trend downwards but may
oscillate due to varying difficulty of mini-batches.
GD vs SGD vs Mini batch GD
GD is computationally heavy, converges to a “flat minima” but performs well on the test
data.
Drawback of gradient based methods
• The most critical challenge to optimizing deep networks is finding the
correct trajectory to move in.
• Gradient isn’t usually a very good indicator of the good trajectory.
• when the contours are perfectly circular , gradient always point in the
direction of the local minimum.
• However, if the contours are extremely elliptical (as is usually the case for
the error surfaces of deep networks), the gradient can be as inaccurate as
90 degrees away from the correct direction!
Error surface of DNN
Vt = β ∗ (Vt−1)+(1−β) ∗ NewSample
n as the numbers of
observations used to adapt
your EWA.
Gradient Descent with Momentum
Vt = β ∗ (Vt−1)+(1−β) ∗ NewSample
𝑣𝑡 ← 𝛽𝑣𝑡−1 + 𝑔𝑡,𝑡−1
𝑤𝑡 ← 𝑤𝑡−1 − 𝑣𝑡
Gradient Descent with Momentum
Gradient Descent with Momentum
Momentum Update
Plain gradient update With momentum
𝑣𝑡 ← 𝛽𝑣𝑡−1 + 𝑔𝑡,𝑡−1
𝑤𝑡 ← 𝑤𝑡−1 − 𝑣𝑡
Gradient Descent with Momentum
𝜃
Gradient = 0
Slide credit: Hung-yi Lee – Deep Learning Tutorial
Solved problem SGD and SGD +
momentum
Consider the following loss function, with initial value of W{0} = -
2.8, and learning rate = 0.05 and =0.7. Use SGD and SGD
+momentum to find the updated value of W{1} after the first
iteration.
𝐿(𝑤) = 0.3 ∗ 𝑤 4 − 0.1 ∗ 𝑤 3 − 2 ∗ 𝑤 2 − 0.8 ∗ 𝑤
Answer
Iteration SGD
𝑔𝑖 w𝑖 Iteration SGD𝑔𝑖+momentum
v𝑖 w𝑖
W
Adagrad
• Decay the learning rate for parameters in proportion to their update history
• An individual learning rate per parameter(feature)
• Accumulating past squared gradients in st
• Use a leaky average in the same way we used in the momentum method
Parameter > 0
The constant 𝜖 > 0 is typically set to 10-6
• Faster convergence compared to AdaGrad
• works well on big and redundant datasets
Review of techniques learned so far
4
• Compute updates →
6
Optimization algorithm comparison