0% found this document useful (0 votes)

18 views40 pages

Gradient Decent

Uploaded by

fidha salahudeen

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

18 views40 pages

Gradient Decent

Uploaded by

fidha salahudeen

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 40

G RADIENT D ESCENT

Arshia Anjum and Sibabrata Biswal

National Institute of Science Education and Research ,

Bhubaneswar,Odisha

January 31, 2023

PART I: W HAT IS G RADIENT D ESCENT ?

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2 The Formula . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

1 / 39
PART II: G RADIENT D ESCENT - D ETAILED W ORKING

1 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2 Gradient Descent of Simple Linear Regression Model (Example) . . . . . . . . . . . . . . . 12

3 Requirements of Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

4 Function Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

5 All the Requirements of Gradient Descent: Listed . . . . . . . . . . . . . . . . . . . . . . . 21

2 / 39
PART III: VARIOUS T YPES OF G RADIENT D ESCENT

1 Types of Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

2 Stochastic Gradient Descent (SGD) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3 Batch Gradient Descent (BGD) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

4 Mini-Batch Gradient Descent (MBGD) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3 / 39
PART IV: P SEUDO - CODE FOR G RADIENT D ESCENT

1 Pseudo-Code for Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

2 Pseudo-Code for Stochastic Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3 Pseudo-Code for Batch Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

4 Pseudo-Code for Mini-Batch Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . 31

5 Python Code for Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

4 / 39
PART V: A PPLICATION OF G RADIENT D ESCENT

1 Gradient Descent Example 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

2 Gradient Descent Example 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

5 / 39
PART VI: I S G RADIENT D ESCENT A G OOD A LGORITHM ?

1 Advantages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

2 Disadvantages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

6 / 39
Part I

W HAT IS G RADIENT D ESCENT ?

7 / 39
I NTRODUCTION

Gradient Descent is
▶ An optimisation technique/algorithm.
▶ Mostly used in supervised machine learning models and deep learning.
▶ Also called as first order optimisation algorithm.
▶ One of the most used algorithms for optimisation of parameters in ML models.
The meaning of Gradient Descent:
▶ The meaning of Gradient - first order derivative/ slope of a curve.
▶ The meaning of descent - movement to a lower point.
▶ The algorithm thus makes use of the gradient/slope to reach the minimum/ lowest point of a
Mean Squared Error (MSE) function.

8 / 39
T HE F ORMULA

While performing the algorithm of gradient descent, the machine iteratively calculates the next
point it has to reach by using the gradient at the current position, and subtracting it from the
parameter value by scaling it with a learning rate. The formula for the same looks like:

pn+1 = pn − l∇f (pn ) (1)

Where, l is the learning rate, pn is the parameter to be optimised, and ∇f (pn ) depicts the gradient of
the expected loss function.

9 / 39
Part II

G RADIENT D ESCENT - D ETAILED W ORKING

10 / 39
A LGORITHM

▶ Suppose we have two unknown parameters, p1 and p2 .

▶ Assume the parameters (initial values of p1 and p2 ).
▶ Plot the expected loss function for various values of the other parameter, p2 .
▶ The point on the curve, where the value of the function is minimum, is the required value of p2
given p1 . (Note that the value of p2 might change with a change in p1 ).
▶ Now we change the values of p1 and p2 such that the expected loss reduces after each iteration.
▶ To define the iterations, we have the formula:

pn+1 = pn − l∇f (pn ) (2)

where, l defines the learning parameter and n defines the iteration number.

11 / 39
G RADIENT D ESCENT OF S IMPLE L INEAR R EGRESSION M ODEL (E XAMPLE )

▶ The Simple Linear Regression Model predicts the outcome through the equation [“Normal
Equations” 2008]
ypredicted = yp = w × x + b (3)
where, w is the slope of the linear curve, and b is the intercept value. x being the independent
variable (the data given to us) and y being the dependent variable (the label we need to find).
▶ The Loss function for the same becomes
n n
1X 1X p
L(w, b) = (yi (predicted) − yi (actual))2 = (yi − yai )2 (4)
n n
i=0 i=0
▶ Now, Gradient Descent has to minimize this Loss function by choosing a set of w and b
appropriately.
▶ We begin by choosing a set w0 and b0 . Then the equation becomes:
p
yi = w0 xi + b0 (5)
▶ Now we have to update w and b re-iteratively to minimize the Loss function L(w, b). The
equations for the same become:
w = w − l∆w (6)
b = b − l∆b (7)
where l is the learning parameter.
12 / 39
G RADIENT D ESCENT OF S IMPLE L INEAR R EGRESSION M ODEL (E XAMPLE )

▶ To calculate ∆w and ∆b we use the relation:

∂L(w, b)
∆w = (8)
∂w
∂L(w, b)
∆b = (9)
∂b
▶ The definition of the Loss function was given in Eq. (4). Using the Simple Linear Regression
Model Line (3),
n
1X
L(w, b) = (wxi + b − yai )2 (10)
n
i=0
▶ This makes the Delta functions[Gradient Descent n.d.]/ Gradient functions to be:
n
∂L(w, b) 1 X
= 2 (wxi + b − yai )w (11)
∂w n
i=0
n
∂L(w, b) 1 X
= 2 (wxi + b − yai )b (12)
∂c n
i=0

13 / 39
G RADIENT D ESCENT OF S IMPLE L INEAR R EGRESSION M ODEL (E XAMPLE )

▶ After getting the Delta function values/ gradients through (11 and 12), we substitute them in
Eq. (6 and 7).
▶ This counts for one iteration, however, it possible to get more accurate values, hence more such
iterations are done until we reach the best fit.

Figure. Linear Regression Example for Gradient Descent.[Visualizing the gradient descent method n.d.]

14 / 39
L EARNING PARAMETER

▶ The learning rate is mentioned in Eq (6 and 7) is an important parameter.

▶ It decides the rate at which the algorithm learns/ improves its unknown parameters [Jordan
2018].
▶ The gradients were calculated in Eq. (8 and 9), however, those used alone are not enough to
make sure we reach the best fit at a low computational time.
▶ The learning parameter l decides the step size of the learning/iteration.
▶ There can be three cases when we don’t use the learning parameter:
1. The gradient is too small, and we reach the best fit point in too many iterations.
2. The gradient is too high, and we jump the best fit point to go farther away from it.
3. The gradient is just right, and we reach the best fit just fine.
▶ To make sure that the step size is just right, we use the learning parameter, which scales the
change in the previous value of the parameter to get the new value.

15 / 39
F UNCTION R EQUIREMENTS

▶ The gradient descent algorithm is not written for all types of functions.
▶ The function has to satisfy two conditions for Gradient Descent to be applicable on it:
1. It has to be differentiable
2. It has to be a convex function

16 / 39
F UNCTION R EQUIREMENTS
D IFFERENTIABILITY

▶ Being differentiable means the function should have a derivative at each point in its domain.
(cf. 2)

Figure. Examples of differentiable functions.[Kwiatkowski 2022]

17 / 39
F UNCTION R EQUIREMENTS
D IFFERENTIABILITY

▶ Not all functions satisfy this condition. Some examples of functions not satisfying this
condition is given below in Fig. 3
▶ If the function is not differentiable at all points, there may arise a circumstance where the
gradient algorithm is not able to iteratethe values as there is no gradient to work with.

Figure. Examples of non-differentiable functions.[Kwiatkowski 2022]

18 / 39
F UNCTION R EQUIREMENTS
C ONVEX F UNCTION

▶ A convex function is one in which if any two points are connected on the curve, the line
segment lies on or above the curve.
▶ If a function is non-convex, we do not have a surity that the gradient descent algorithm would
give us the local minima or the global minima as the result.

Figure. Example of convex and non-convex functions.[Kwiatkowski 2022]

19 / 39
F UNCTION R EQUIREMENTS
C ONVEX F UNCTION

▶ A special case is where the function has a saddle point. At the saddle point, the function has
gradient as zero, but the double derivative is also zero at that point. This defines that the point
is a saddle and hence not a global minimum.

Figure. Example of saddle point in a function.[Kwiatkowski 2022]

20 / 39
A LL THE R EQUIREMENTS OF G RADIENT D ESCENT: L ISTED

Therefore, the gradient descent algorithm takes 5 parameters as its basic requirement:
▶ Initial Point (w0 , b0 )
▶ Gradient Function
▶ Learning Rate (l)
▶ Number of iterations (n)
▶ Tolerance - helps to give an end point/ stop to an algorithm.

21 / 39
Part III

T YPES OF G RADIENT D ESCENT

22 / 39
T YPES OF G RADIENT D ESCENT

There are three types of Gradient Descent Algorithms:

1. Stochastic Gradient Descent (SGD)
2. Batch Gradient Descent (BGD)
3. Mini-Batch Gradient Descennt (MBGD)

23 / 39
S TOCHASTIC G RADIENT D ESCENT (SGD)

▶ SGD computes the gradient for only one random sample at each iteration.
▶ This property of SGD helps in it being faaster and efficient as it does not have to process all the
data in each of its iterations.
▶ However, the randomness of SGD contributes to the fact that it can in some cases give the
suboptimal solutions/local minima as the result rather than the global minimum.
▶ One of the techniques to overcome this fault is to decrease the learning rate of the model over
time, which helps in reducing the updates in the parameter with each iteration.
▶ SGD also has its variants, like Mini-Batch SGD, where the Gradient ddescent is done for a
random subset of data, and Momentum SGD, where a term is added to the gradient update to
help with the optimisation and avoiding getting stuck at a local minima.
▶ SGD is majorly used in Deep Learning and has found applications in classification, regression,
and neural machine translation.

24 / 39
B ATCH G RADIENT D ESCENT (BGD)

▶ BGD computes the gradient based on the average of the gradients of all data samples in the
training set.
▶ Therefore, for each iteration, the gradient has to be calculated for the entire dataset, making
BGD a computationally expensive process for huge datasets.
▶ However, it also gives BGD the advantage of being more stable and avoiding overfitting
compared to SGD.
▶ The learning parameter has to be chosen carefully, so that with each iteration the step size
doesn’t get bigger or diverge from the minima.
▶ BGD is usually used in simpler models with lesser data.
▶ It is widely used in linear regression and logistic regression, not so much in deep learning
where the training set is typically large and the models have many parameters.

25 / 39
M INI -B ATCH G RADIENT D ESCENT (MBGD)

▶ MBGD is a combination of SGD and BGD.

▶ MBGD, in comparison to BGD, computed the gradient over a subset of the data, called the
mini-batch.
▶ This helps in computing faster than BGD and avoiding overfitting as compared to SGD.
▶ The mini-batch size is a trade-off between speed and stability, with smaller sizes leading to
faster convergence but increased variability in the optimization, and larger sizes leading to
more stable convergence but slower processing times.
▶ The learning rate, which determines the size of the parameter update at each iteration, must be
carefully tuned to ensure that the optimization converges to the minimum and does not
oscillate or diverge.
▶ Mini-Batch Gradient Descent is widely used in deep learning and has been applied to a variety
of tasks, including classification, regression, and neural machine translation.

26 / 39
Part IV

P SEUDO -C ODE FOR G RADIENT D ESCENT

27 / 39
P SEUDO -C ODE FOR G RADIENT D ESCENT

Let us look at the pseudo-code of the gradient descent algorithm:

Input: parameters (θ), gradient of the loss function with respect to

the parameters (dθ), learning rate (α)
Update parameters: θ = θ − α× dθ
Output: updated parameters (θ)

28 / 39
P SEUDO -C ODE FOR S TOCHASTIC G RADIENT D ESCENT

Let us look at the pseudo-code of the gradient descent algorithm:

INPUT: cost function J(θ), learning rate α, numberofiterationsN

INITIALIZE: random θ
FOR i = 1 to N DO
FOR j = 1 to number of training examples m DO
Compute the gradient of J with respect to θ for a single training example:

gradient = ∇θ J(θ, xj , yj )
Update the parameters θ :
θ = θ − α× gradient
END FOR
END FOR
OUTPUT: θ

29 / 39
P SEUDO -C ODE FOR B ATCH G RADIENT D ESCENT

Let us look at the pseudo-code for Batch Gradient Descent:

INPUT: cost function J(θ), learning rate α, number of iterations N

INITIALIZE: random θ
FOR i = 1 to N DO
Compute the gradient of J with respect to θ for all training examples:
gradient = 1/m * ∇θ Σ(J(θ, xj , yj ))
Update the parameters θ :
θ = θ − α× gradient
END FOR
OUTPUT: θ

30 / 39
P SEUDO -C ODE FOR M INI -B ATCH G RADIENT D ESCENT

Let us look at the pseudo-code for mini-batch gradient descent:

INPUT: cost function J(θ), learning rate α, number of iterations N,

batch size B
INITIALIZE: random θ
FOR i = 1 to N DO
Split the training examples into B mini-batches of size b:
FOR j = 1 to number of mini-batches B DO
Compute the gradient of J with respect to θ for a mini-batch
of training examples:
gradient = 1/b × ∇θ Σ(J(θ, xj , yj ))
Update the parameters θ :
θ = θ − α× gradient
END FOR
END FOR
OUTPUT: θ

31 / 39
P YTHON C ODE FOR G RADIENT D ESCENT

def GradientDescent(X, y, w, b, learningRate, MaxIterations):

for i in range(MaxIterations):
# Compute the error/cost function J(w,b)
J = CostFunction(X, y, w, b)
# Compute the gradient of the cost function with respect to w and b
dw = gradientOfw(X, y, w, b)
db = gradientOfb(X, y, w, b)
# Update the parameters:
w = w - learningRate * dw
b = b - learningRate * db
return w, b

32 / 39
Part V

A PPLICATIONS OF G RADIENT D ESCENT

33 / 39
G RADIENT D ESCENT E XAMPLE 1

One common example of gradient descent is training a linear regression model. The model tries to
fit a line to a set of data points by minimizing the mean squared error between the predicted values
and the actual target values. The model adjusts the parameters (slope and intercept) in the direction
of the negative gradient of the error function, until the error reaches a minimum. See section (2),
which clearly explains this example.

34 / 39
G RADIENT D ESCENT E XAMPLE 2

▶ Let’s say we want to minimize the following quadratic function:

Q(x, y) = (x − 2)2 + (y − 3)2 (13)

▶ The gradient of Q(x, y) with respect to x and y is given by:

∇Q(x, y) = [2(x − 2), 2(y − 3)] (14)

▶ The gradient descent algorithm updates x and y in each iteration according to the formula:

x = x − α × 2(x − 2) (15)
y = y − α × 2(y − 3) (16)

where α is the learning rate.

▶ By repeating the above updates, the algorithm converges to the minimum of Q(x, y), which is
(2, 3).
▶ The rate of convergence depends on the choice of α, but as α approaches 0, the convergence
becomes slower, while as α approaches infinity, the algorithm may not converge at all.

35 / 39
Part VI

I S G RADIENT D ESCENT A G OOD A LGORITHM ?

36 / 39
A DVANTAGES

Gradient descent is a widely used optimization algorithm in various fields such as machine
learning, deep learning, and optimization problems. It has several advantages, including:
1. Ease of implementation: Gradient descent is relatively easy to implement, as it only requires
the calculation of gradients and updates to the parameters. It does not require complex
mathematical techniques like linear algebra, eigenvalue decomposition, or matrix inversion.
2. Convergence guarantee: Gradient descent is guaranteed to converge to a minimum, under
certain conditions such as the cost function being differentiable and having a unique global
minimum. The convergence speed can be controlled by the learning rate, which can be set
appropriately to achieve a good balance between convergence speed and accuracy.
3. Scalability: Gradient descent can be used for high-dimensional problems and can scale well
with large amounts of data. In practice, the convergence of gradient descent can be accelerated
through techniques such as batch normalization, early stopping, or momentum.
4. Versatility: Gradient descent can be applied to a wide range of optimization problems,
including linear regression, logistic regression, and neural networks. It can also be used for
non-convex optimization problems, where multiple local minima may exist, although
convergence to a good solution may be less certain in such cases.

37 / 39
D ISADVANTAGES

However, gradient descent has some limitations as well, including:

1. Sensitivity to learning rate: The choice of the learning rate can have a significant impact on the
convergence of the algorithm. If the learning rate is too high, the algorithm may oscillate or
converge slowly, while if it is too low, the convergence may be slow.
2. Sensitivity to initialization: The initial values of the parameters can also have an impact on the
convergence of the algorithm. If the parameters are initialized too far from the minimum, the
algorithm may converge slowly, or get stuck in a suboptimal solution.
3. Convergence speed: The convergence speed of gradient descent can be slow for some
problems, especially for high-dimensional or non-convex optimization problems. This can be
mitigated to some extent by using techniques such as mini-batch gradient descent, momentum,
or adaptive learning rates.
4. Termination of Algorithm: The termination of this algorithm isn’t as easy as it seems. It is either
done by giving a tolerance limit or an upper cap on the number of iterations.
5. Correctness: Though the algorithm makes sure that output converges to best fit, it doesn’t
guarantee the correctness of the solution.

38 / 39
R EFERENCES I

Gradient Descent (n.d.). https://www.niser.ac.in/ smishra/teach/cs460/2020/lectures/lec8/.

[Online; accessed 2023-01-31].
Jordan, Jeremy (Mar. 2018). Setting the learning rate of your neural network.
https://www.jeremyjordan.me/nn-learning-rate/. [Online; accessed 2023-01-31].
Kwiatkowski, Robert (July 2022). Gradient Descent Algorithm — a deep dive.
https://towardsdatascience.com/gradient-descent-algorithm-a-deep-dive-cf04e8115f21.
[Online; accessed 2023-01-31].
“Normal Equations” (2008). In: The Concise Encyclopedia of Statistics. New York, NY: Springer
New York, pp. 380–382. ISBN: 978-0-387-32833-1. DOI: 10.1007/978-0-387-32833-1_286.
URL : https://doi.org/10.1007/978-0-387-32833-1_286.
Visualizing the gradient descent method (n.d.).
https://scipython.com/blog/visualizing-the-gradient-descent-method/. [Online; accessed
2023-01-31].

39 / 39

Report Painter - User Manual V 1.1
50% (2)
Report Painter - User Manual V 1.1
60 pages
Gradient Descent Unit3
No ratings yet
Gradient Descent Unit3
9 pages
Paper 2
No ratings yet
Paper 2
27 pages
PSPCL Mock Test 2 PDF
No ratings yet
PSPCL Mock Test 2 PDF
45 pages
Backpropagation, Sgmiod Neuron & Gradient Discend
No ratings yet
Backpropagation, Sgmiod Neuron & Gradient Discend
29 pages
Lecture 08 ML
No ratings yet
Lecture 08 ML
20 pages
Gradient Descent in Linear Regression
No ratings yet
Gradient Descent in Linear Regression
30 pages
Gradient Descent
No ratings yet
Gradient Descent
4 pages
Gradient Descent Regression
No ratings yet
Gradient Descent Regression
14 pages
GD Types
No ratings yet
GD Types
98 pages
Gradient Descent
No ratings yet
Gradient Descent
5 pages
5 Optimizers
No ratings yet
5 Optimizers
10 pages
8k Full Valid Europe Mix 15.11
100% (1)
8k Full Valid Europe Mix 15.11
139 pages
Module 4 Lab 2
No ratings yet
Module 4 Lab 2
5 pages
Gradient Descent A Fundamental Optimization Algorithm
No ratings yet
Gradient Descent A Fundamental Optimization Algorithm
30 pages
NN WK 3 Lec 5 6 Gradient Descent
No ratings yet
NN WK 3 Lec 5 6 Gradient Descent
7 pages
Gradient Descent
No ratings yet
Gradient Descent
58 pages
Gradient Descent
No ratings yet
Gradient Descent
14 pages
Interview Question What Is Gradient Descent 1679467271
No ratings yet
Interview Question What Is Gradient Descent 1679467271
16 pages
Chapter4 PDF
No ratings yet
Chapter4 PDF
9 pages
What Is Gradient Descent - Built in
No ratings yet
What Is Gradient Descent - Built in
11 pages
Module2 Optimizations
No ratings yet
Module2 Optimizations
65 pages
Introduction To Gradient Descent
No ratings yet
Introduction To Gradient Descent
8 pages
Lec05-1-Gradient Descent-Detailed
No ratings yet
Lec05-1-Gradient Descent-Detailed
62 pages
ML Lec 08 Gradient Descent
No ratings yet
ML Lec 08 Gradient Descent
37 pages
Lab Report of Microcontroller
No ratings yet
Lab Report of Microcontroller
22 pages
Gradient Descent
No ratings yet
Gradient Descent
9 pages
Gradient Descent Algorithm in Machine Learning
No ratings yet
Gradient Descent Algorithm in Machine Learning
21 pages
ML - Week 06
No ratings yet
ML - Week 06
31 pages
Mlfa Autumn 22 Lec 04
No ratings yet
Mlfa Autumn 22 Lec 04
24 pages
5.1loss Function, Optimization, GD
No ratings yet
5.1loss Function, Optimization, GD
39 pages
Gradient Descent
No ratings yet
Gradient Descent
15 pages
Notes Unit 1-3 Part-III
No ratings yet
Notes Unit 1-3 Part-III
25 pages
Gradient Descent From Scratch Complete Intuition
No ratings yet
Gradient Descent From Scratch Complete Intuition
8 pages
Gradient Descent
No ratings yet
Gradient Descent
5 pages
Gradient Descent Algorithm in Machine Learning: Dr. P. K. Chaurasia
No ratings yet
Gradient Descent Algorithm in Machine Learning: Dr. P. K. Chaurasia
24 pages
CCS355 Neural Networks and Deep Learning
No ratings yet
CCS355 Neural Networks and Deep Learning
142 pages
Gradient Descent DS Rohit Sharma Fench Knjs
No ratings yet
Gradient Descent DS Rohit Sharma Fench Knjs
15 pages
DL Unit - 2
No ratings yet
DL Unit - 2
20 pages
UNIT III Part-2
No ratings yet
UNIT III Part-2
39 pages
AI33
No ratings yet
AI33
6 pages
LInear
No ratings yet
LInear
14 pages
Gradient Descent Deep Learning: by T.K. Damodharan Vice President, RBS Reg - No: PC2013003013008
No ratings yet
Gradient Descent Deep Learning: by T.K. Damodharan Vice President, RBS Reg - No: PC2013003013008
37 pages
Gradient Descent - PR
No ratings yet
Gradient Descent - PR
31 pages
Gradient Descent
No ratings yet
Gradient Descent
8 pages
Deep Learning (Part 8) - Coursesteach
No ratings yet
Deep Learning (Part 8) - Coursesteach
16 pages
11 Gradient Descent
No ratings yet
11 Gradient Descent
58 pages
Gradient Descent - A Quick, Simple Introduction - Built in
No ratings yet
Gradient Descent - A Quick, Simple Introduction - Built in
15 pages
Mlfa Autumn 23 Optimization
No ratings yet
Mlfa Autumn 23 Optimization
37 pages
3 Gradient Descent
No ratings yet
3 Gradient Descent
8 pages
UNIT2
No ratings yet
UNIT2
25 pages
Caladora Laser
No ratings yet
Caladora Laser
2 pages
Instructions of SH 043 Interface Screen汇能达CEM9000SH 043接口屏使用说明书 20180515
No ratings yet
Instructions of SH 043 Interface Screen汇能达CEM9000SH 043接口屏使用说明书 20180515
16 pages
chp2 Gradient Descent Algorithm
No ratings yet
chp2 Gradient Descent Algorithm
5 pages
Gradient Descent
No ratings yet
Gradient Descent
6 pages
Gradient Descent
No ratings yet
Gradient Descent
9 pages
Gradient Descent Algorithm.Y...
No ratings yet
Gradient Descent Algorithm.Y...
10 pages
Adam Optimizer
No ratings yet
Adam Optimizer
22 pages
WINSEM2024-25 CSE4006 ETH AP2024254000693 2025-01-08 Reference-Material-I
No ratings yet
WINSEM2024-25 CSE4006 ETH AP2024254000693 2025-01-08 Reference-Material-I
40 pages
Short Notes Regional Geography
No ratings yet
Short Notes Regional Geography
6 pages
Performance Analysis of RIP, EIGRP, OSPF and ISIS Routing Protocols
No ratings yet
Performance Analysis of RIP, EIGRP, OSPF and ISIS Routing Protocols
8 pages
Gradient Descent
No ratings yet
Gradient Descent
13 pages
14-RMSProp and Adam Optimization-12!08!2024
No ratings yet
14-RMSProp and Adam Optimization-12!08!2024
2 pages
DSB For R PDF
No ratings yet
DSB For R PDF
6 pages
Gradient Descent
No ratings yet
Gradient Descent
2 pages
Yash 21bsds12
No ratings yet
Yash 21bsds12
3 pages
Async-JS.L.U01-05 (Asynchronous JavaScript)
No ratings yet
Async-JS.L.U01-05 (Asynchronous JavaScript)
43 pages
1 Agile Manifesto
No ratings yet
1 Agile Manifesto
39 pages
Kawasaki FastCheck
No ratings yet
Kawasaki FastCheck
18 pages
Gradient Descent Algorithm Is A First
No ratings yet
Gradient Descent Algorithm Is A First
5 pages
Training Report On Machine Learning
No ratings yet
Training Report On Machine Learning
27 pages
I008 Khemal Experiment-8-PAI
No ratings yet
I008 Khemal Experiment-8-PAI
12 pages
Computer Assignment
No ratings yet
Computer Assignment
4 pages
HDMI CEC and ARC
No ratings yet
HDMI CEC and ARC
11 pages
Linux (PPT Presentation)
No ratings yet
Linux (PPT Presentation)
17 pages
Gradient Descent
No ratings yet
Gradient Descent
17 pages
Create Revit Template For Each Discipline
No ratings yet
Create Revit Template For Each Discipline
86 pages
Unit2 (A) - Final - Process Synchronization
No ratings yet
Unit2 (A) - Final - Process Synchronization
12 pages
MSBTE Solution App-2
No ratings yet
MSBTE Solution App-2
4 pages
Rosslare ACQ41 Product Manual
No ratings yet
Rosslare ACQ41 Product Manual
58 pages
Iti Result
No ratings yet
Iti Result
1 page
Css Summative Test (AutoRecovered)
No ratings yet
Css Summative Test (AutoRecovered)
4 pages
21BCS1027 - ARJAN DEV SINGH (Project Report) - Removed
No ratings yet
21BCS1027 - ARJAN DEV SINGH (Project Report) - Removed
66 pages
Polyga h3
No ratings yet
Polyga h3
3 pages
Discontinuation of AxioVision Correlative Particle Analyzer (CAPA) and AxioVision For LCM Systems
No ratings yet
Discontinuation of AxioVision Correlative Particle Analyzer (CAPA) and AxioVision For LCM Systems
4 pages
6 Best Practices To Optimize It Investment
No ratings yet
6 Best Practices To Optimize It Investment
6 pages
CTS Cheatsheet
No ratings yet
CTS Cheatsheet
3 pages
Cyber Security Module 2 3
No ratings yet
Cyber Security Module 2 3
2 pages
A General, Fast and Robust B-Spline Fitting Scheme For Micro-Line Tool Path Under Chord Error Constraint
No ratings yet
A General, Fast and Robust B-Spline Fitting Scheme For Micro-Line Tool Path Under Chord Error Constraint
12 pages
Backpropagation: Fundamentals and Applications for Preparing Data for Training in Deep Learning
From Everand
Backpropagation: Fundamentals and Applications for Preparing Data for Training in Deep Learning
Fouad Sabry
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Gradient Decent

Uploaded by

Gradient Decent

Uploaded by

G RADIENT D ESCENT

Arshia Anjum and Sibabrata Biswal

National Institute of Science Education and Research ,

January 31, 2023

2 Gradient Descent of Simple Linear Regression Model (Example) . . . . . . . . . . . . . . . 12

3 Requirements of Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

5 All the Requirements of Gradient Descent: Listed . . . . . . . . . . . . . . . . . . . . . . . 21

1 Types of Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

2 Stochastic Gradient Descent (SGD) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3 Batch Gradient Descent (BGD) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

4 Mini-Batch Gradient Descent (MBGD) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

1 Pseudo-Code for Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

2 Pseudo-Code for Stochastic Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3 Pseudo-Code for Batch Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

4 Pseudo-Code for Mini-Batch Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . 31

5 Python Code for Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

1 Gradient Descent Example 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

2 Gradient Descent Example 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

W HAT IS G RADIENT D ESCENT ?

pn+1 = pn − l∇f (pn ) (1)

G RADIENT D ESCENT - D ETAILED W ORKING

▶ Suppose we have two unknown parameters, p1 and p2 .

pn+1 = pn − l∇f (pn ) (2)

▶ To calculate ∆w and ∆b we use the relation:

▶ The learning rate is mentioned in Eq (6 and 7) is an important parameter.

Figure. Examples of differentiable functions.[Kwiatkowski 2022]

Figure. Examples of non-differentiable functions.[Kwiatkowski 2022]

Figure. Example of convex and non-convex functions.[Kwiatkowski 2022]

Figure. Example of saddle point in a function.[Kwiatkowski 2022]

T YPES OF G RADIENT D ESCENT

There are three types of Gradient Descent Algorithms:

▶ MBGD is a combination of SGD and BGD.

P SEUDO -C ODE FOR G RADIENT D ESCENT

Let us look at the pseudo-code of the gradient descent algorithm:

Input: parameters (θ), gradient of the loss function with respect to

Let us look at the pseudo-code of the gradient descent algorithm:

INPUT: cost function J(θ), learning rate α, numberofiterationsN

Let us look at the pseudo-code for Batch Gradient Descent:

INPUT: cost function J(θ), learning rate α, number of iterations N

Let us look at the pseudo-code for mini-batch gradient descent:

INPUT: cost function J(θ), learning rate α, number of iterations N,

def GradientDescent(X, y, w, b, learningRate, MaxIterations):

A PPLICATIONS OF G RADIENT D ESCENT

▶ Let’s say we want to minimize the following quadratic function:

Q(x, y) = (x − 2)2 + (y − 3)2 (13)

▶ The gradient of Q(x, y) with respect to x and y is given by:

∇Q(x, y) = [2(x − 2), 2(y − 3)] (14)

where α is the learning rate.

I S G RADIENT D ESCENT A G OOD A LGORITHM ?

However, gradient descent has some limitations as well, including:

Gradient Descent (n.d.). https://www.niser.ac.in/ smishra/teach/cs460/2020/lectures/lec8/.

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.