0% found this document useful (0 votes)
33 views41 pages

Deep Learning: Course Code: Unit 1

This document discusses gradient descent, an algorithm used to find the minimum of a loss function. It begins by defining partial derivatives and providing examples. It then introduces gradient descent, noting that it finds local optima by taking steps in the direction of the negative gradient. The gradient descent algorithm is outlined as initializing weights, computing the gradient at each step, and updating the weights based on the learning rate until convergence. Challenges like learning rates, local minima, and computation costs are also mentioned.

Uploaded by

Toxic Lucien
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views41 pages

Deep Learning: Course Code: Unit 1

This document discusses gradient descent, an algorithm used to find the minimum of a loss function. It begins by defining partial derivatives and providing examples. It then introduces gradient descent, noting that it finds local optima by taking steps in the direction of the negative gradient. The gradient descent algorithm is outlined as initializing weights, computing the gradient at each step, and updating the weights based on the learning rate until convergence. Challenges like learning rates, local minima, and computation costs are also mentioned.

Uploaded by

Toxic Lucien
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 41

Deep learning

• Course Code:
• Unit 1
Introduction to Deep learning
• Lecture 5
Loss optimization (Gradient
decent)
Partial differentiation
• Here f’x to mean “the partial derivative with respect to x”.
• also called “del” or “dee” or “curly dee”
• Example with Explanation
• Take a function of one variable x:
• f(x) = x2
• It’s derivative using power rule:
• First order derivative :: f’(x) = 2x
• Now take a function of two variables x and y:
• f(x,y) = x2 + y3
• To find its partial derivative with respect to x we consider y as a constant:
• Partial derivative wrt X :: f’x = 2x + 0
• = 2x
• Now, to find the partial derivative with respect to y, we consider x as a constant:
• Partial derivative wrt Y :: f’y = 0 + 3y2
• = 3y2
Gradient Descent Algorithm

https://www.kdnuggets.com/2020/05/5-concepts-g
radient-descent-cost-function.html
Gradient Descent
• Method to find local optima of differentiable a function
• Intuition: gradient tells us direction of greatest increase, negative gradient gives us
direction of greatest decrease
• Take steps in directions that reduce the function value
• Definition of derivative guarantees that if we take a small enough step in the
direction of the negative gradient, the function will decrease in value
• How small is small enough?

9
Gradient Descent

Gradient Descent Algorithm:

• Pick an initial point

• Iterate until convergence

where is the step size (sometimes called learning rate)

10
Gradient Descent

Gradient Descent Algorithm:

• Pick an initial point

• Iterate until convergence

where is the step size (sometimes called learning rate)


When do we stop?

11
Gradient Descent

Gradient Descent Algorithm:

• Pick an initial point

• Iterate until convergence

where is the step sizePossible


(sometimes called learning rate)
Stopping Criteria: iterate until for some

How small should be?

12
Gradient Descent
2
𝑓 ( 𝑥 ) =𝑥

Step size:
(0)
𝑥 =− 4

13
Gradient Descent
2
𝑓 ( 𝑥 ) =𝑥

Step size:
(0)
𝑥 =− 4
(1)
𝑥 =− 4 −. 8 ⋅ 2 ⋅( − 4)

14
Gradient Descent
2
𝑓 ( 𝑥 ) =𝑥

Step size:
(0)
𝑥 =− 4
(1)
𝑥 =2 . 4

15
Gradient Descent
2
𝑓 ( 𝑥 ) =𝑥

Step size:
(0)
𝑥 =− 4
(1)
𝑥 =2.4
(2)
𝑥 =2.4 −.8 ⋅ 2⋅ 2.4

(1)
𝑥 =0.4
16
Gradient Descent
2
𝑓 ( 𝑥 ) =𝑥

Step size:
(0)
𝑥 =− 4
(1)
𝑥 =2.4
(2)
𝑥 =−1.44

17
Gradient Descent
2
𝑓 ( 𝑥 ) =𝑥

Step size:
(0)
𝑥 =− 4
(1)
𝑥 =2 . 4
1.44
(3)
𝑥 =. 864
(4 )
𝑥 =− 0 . 5184
(5)
𝑥 =0 . 31104

(30)
𝑥 =−8 . 84296 𝑒−07
18
Gradient Descent

Step size: .9

19
Gradient Descent

Step size: .2

20
Gradient Descent

Step size matters!

21
Gradient Descent

Step size matters!

22
Overview
• Gradient descent is the standard algorithm
Variants of gradient descent
• Stochastic gradient descent
Variants of gradient descent
• Mini batch SGD
Challenges
• Learning rates
• Local minima
• We will look at methods to deal with the above issues
• https://towardsdatascience.com/batch-mini-batch-stochastic-gradient-descent-7a62ec
ba642a
Loss Optimization
W* = argmin J(W)
w

• Weights are on x and y axis


whereas loss are marked on z axis.
• For any value of w, we can see the
loss at that point.
• We need to find the point on this
landscape with minimum loss.

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Loss Optimization
W* = argmin J(W)
w
• Randomly pick a place on this
landscape to start finding the
minimum weights.
• Form this random place we find
how this landscape is changing,
how the slope of landscape is
changes using gradient of the loss
with respect to each of the weights.
• The gradient is a vector which gives
us the direction in which loss
function has the steepest ascent.

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Loss Optimization
W* = argmin J(W)
w • Gradient tell us which way to move
to find the steepest landscape using
function:

• Here we can see the higher


landscape with respect to the
selected point so we need to take
step in direction that’s lower than
the selected point.

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Loss Optimization
W* = argmin J(W)
w

• Take small step in opposite direction


of gradient.
• On getting the lower point, the
process need to be repeated over
and over again until we converged
to a local minimum point.

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Gradient Descent
• Repeat until Convergence

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Gradient Descent

Algorithm for gradient


descent:
1. Initialize the weights randomly ~N(0, ) weights = tf.random_normal( )

2. Loop until finding the convergence:


grads = tf.gradients(loss, weights)
3. Compute gradient,
4. Update weight, W weights_new = weights.assign(weights – lr * grads)

5. Return weights

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Gradient Descent
Algorithm for gradient
descent:
1. Initialize the weights randomly ~N(0, ) weights = tf.random_normal( )

2. Loop until finding the convergence:


𝜕 𝐽 (𝒘) grads = tf.gradients(loss, weights)
3. Compute gradient, 𝜕𝒘
4. Update weight, W weights_new = weights.assign(weights – lr * grads)

5. Return weights

• Amount of weights are updated during training is referred to as the step size or the learning rate.
• The learning rate is a configurable hyper parameter used in the training of neural networks that has a small positive
value, often in the range between 0.0 and 1.0.
• The learning rate controls how quickly the model is adapted to the problem.
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Gradient Descent

Algorithm for gradient


descent:
1. Initialize the weights randomly ~N(0, )
2. Loop until finding the convergence:
3. Compute gradient,
4. Update weight, W
Can be very
5. Return weights computationally
intensive to
compute!

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Stochastic Gradient Descent

Algorithm for gradient


descent:
1. Initialize the weights randomly ~N(0,)
2. Loop until finding the convergence:
3. Pick a single data point i,
4. Compute gradient,
5. Update weight, W
Easy to compute but
6. Return weights very noisy
(stochastic)!

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Stochastic Gradient Descent with momentum
• SGD is noisy & requires more iteration to
reach minima. Adding a momentum term to
regular SGD for faster convergence of loss
function.
• SGD oscillates between either direction of
gradient & updates the weights accordingly.
By adding a fraction of the previous update
to the current update will make the process
a bit faster. velocity v denote
• Updated weight, Wt+1 = the change in the
gradient to reach the
= β Vt-1 + global minima.
• learning rate should be decreased wit
momentum term.

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Mini-batch Gradient Descent

Algorithm for gradient


descent:
1. Initialize the weights randomly ~N(0,)

2. Loop until finding the convergence:

3. Pick batch of B data points

4. Compute gradient, =

5. Update weight, W
Fast to compute and a
6. Return weights much better estimate of
the true gradient!

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Mini-batches while training
• Mini-batch gradient descent is a variation of the gradient
descent algorithm that splits the training dataset into small
batches that are used to calculate model error and update
model coefficients.
• Mini-batch gradient descent seeks to find a balance between the
robustness of stochastic gradient descent and the efficiency of
batch gradient descent.
• More accurate estimation of gradient
• Smoother convergence Allows for larger learning
rates

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Mini-batches while training

More accurate estimation of gradient


Smoother convergence Allows for larger learning
rates

Mini-batches lead to fast training


Increase the computation and achieve increased speed on
GPU’s

Amity Centre for Artificial Intelligence, Amity University, Noida, India


Summary
• Batch Gradient Descent (BGD):
It uses the entire dataset at every step, making it slow for large datasets.
However, it is computationally inefficient for large dataset, since it produces a stable
error gradient and a stable convergence
• Stochastic Gradient Descent (SGD):
It is on the other extreme of the idea, using a single example (batch of 1) per each
learning step. Much faster, may return noisy gradients which can cause the error rate to
jump around
• Mini Batch Gradient Descent:
Computes the gradients on small random sets of instances called mini batches.
Reduce noise from SGD and still more efficient than BGD

Amity Centre for Artificial Intelligence, Amity University, Noida, India

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy