Deep Learning: Course Code: Unit 1
Deep Learning: Course Code: Unit 1
• Course Code:
• Unit 1
Introduction to Deep learning
• Lecture 5
Loss optimization (Gradient
decent)
Partial differentiation
• Here f’x to mean “the partial derivative with respect to x”.
• also called “del” or “dee” or “curly dee”
• Example with Explanation
• Take a function of one variable x:
• f(x) = x2
• It’s derivative using power rule:
• First order derivative :: f’(x) = 2x
• Now take a function of two variables x and y:
• f(x,y) = x2 + y3
• To find its partial derivative with respect to x we consider y as a constant:
• Partial derivative wrt X :: f’x = 2x + 0
• = 2x
• Now, to find the partial derivative with respect to y, we consider x as a constant:
• Partial derivative wrt Y :: f’y = 0 + 3y2
• = 3y2
Gradient Descent Algorithm
https://www.kdnuggets.com/2020/05/5-concepts-g
radient-descent-cost-function.html
Gradient Descent
• Method to find local optima of differentiable a function
• Intuition: gradient tells us direction of greatest increase, negative gradient gives us
direction of greatest decrease
• Take steps in directions that reduce the function value
• Definition of derivative guarantees that if we take a small enough step in the
direction of the negative gradient, the function will decrease in value
• How small is small enough?
9
Gradient Descent
10
Gradient Descent
11
Gradient Descent
12
Gradient Descent
2
𝑓 ( 𝑥 ) =𝑥
Step size:
(0)
𝑥 =− 4
13
Gradient Descent
2
𝑓 ( 𝑥 ) =𝑥
Step size:
(0)
𝑥 =− 4
(1)
𝑥 =− 4 −. 8 ⋅ 2 ⋅( − 4)
14
Gradient Descent
2
𝑓 ( 𝑥 ) =𝑥
Step size:
(0)
𝑥 =− 4
(1)
𝑥 =2 . 4
15
Gradient Descent
2
𝑓 ( 𝑥 ) =𝑥
Step size:
(0)
𝑥 =− 4
(1)
𝑥 =2.4
(2)
𝑥 =2.4 −.8 ⋅ 2⋅ 2.4
(1)
𝑥 =0.4
16
Gradient Descent
2
𝑓 ( 𝑥 ) =𝑥
Step size:
(0)
𝑥 =− 4
(1)
𝑥 =2.4
(2)
𝑥 =−1.44
17
Gradient Descent
2
𝑓 ( 𝑥 ) =𝑥
Step size:
(0)
𝑥 =− 4
(1)
𝑥 =2 . 4
1.44
(3)
𝑥 =. 864
(4 )
𝑥 =− 0 . 5184
(5)
𝑥 =0 . 31104
(30)
𝑥 =−8 . 84296 𝑒−07
18
Gradient Descent
Step size: .9
19
Gradient Descent
Step size: .2
20
Gradient Descent
21
Gradient Descent
22
Overview
• Gradient descent is the standard algorithm
Variants of gradient descent
• Stochastic gradient descent
Variants of gradient descent
• Mini batch SGD
Challenges
• Learning rates
• Local minima
• We will look at methods to deal with the above issues
• https://towardsdatascience.com/batch-mini-batch-stochastic-gradient-descent-7a62ec
ba642a
Loss Optimization
W* = argmin J(W)
w
5. Return weights
5. Return weights
• Amount of weights are updated during training is referred to as the step size or the learning rate.
• The learning rate is a configurable hyper parameter used in the training of neural networks that has a small positive
value, often in the range between 0.0 and 1.0.
• The learning rate controls how quickly the model is adapted to the problem.
Amity Centre for Artificial Intelligence, Amity University, Noida, India
Gradient Descent
4. Compute gradient, =
5. Update weight, W
Fast to compute and a
6. Return weights much better estimate of
the true gradient!