Gradient Decent
Gradient Decent
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2 The Formula . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1 / 39
PART II: G RADIENT D ESCENT - D ETAILED W ORKING
1 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
4 Function Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2 / 39
PART III: VARIOUS T YPES OF G RADIENT D ESCENT
3 / 39
PART IV: P SEUDO - CODE FOR G RADIENT D ESCENT
4 / 39
PART V: A PPLICATION OF G RADIENT D ESCENT
5 / 39
PART VI: I S G RADIENT D ESCENT A G OOD A LGORITHM ?
1 Advantages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2 Disadvantages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
6 / 39
Part I
7 / 39
I NTRODUCTION
Gradient Descent is
▶ An optimisation technique/algorithm.
▶ Mostly used in supervised machine learning models and deep learning.
▶ Also called as first order optimisation algorithm.
▶ One of the most used algorithms for optimisation of parameters in ML models.
The meaning of Gradient Descent:
▶ The meaning of Gradient - first order derivative/ slope of a curve.
▶ The meaning of descent - movement to a lower point.
▶ The algorithm thus makes use of the gradient/slope to reach the minimum/ lowest point of a
Mean Squared Error (MSE) function.
8 / 39
T HE F ORMULA
While performing the algorithm of gradient descent, the machine iteratively calculates the next
point it has to reach by using the gradient at the current position, and subtracting it from the
parameter value by scaling it with a learning rate. The formula for the same looks like:
Where, l is the learning rate, pn is the parameter to be optimised, and ∇f (pn ) depicts the gradient of
the expected loss function.
9 / 39
Part II
10 / 39
A LGORITHM
where, l defines the learning parameter and n defines the iteration number.
11 / 39
G RADIENT D ESCENT OF S IMPLE L INEAR R EGRESSION M ODEL (E XAMPLE )
▶ The Simple Linear Regression Model predicts the outcome through the equation [“Normal
Equations” 2008]
ypredicted = yp = w × x + b (3)
where, w is the slope of the linear curve, and b is the intercept value. x being the independent
variable (the data given to us) and y being the dependent variable (the label we need to find).
▶ The Loss function for the same becomes
n n
1X 1X p
L(w, b) = (yi (predicted) − yi (actual))2 = (yi − yai )2 (4)
n n
i=0 i=0
▶ Now, Gradient Descent has to minimize this Loss function by choosing a set of w and b
appropriately.
▶ We begin by choosing a set w0 and b0 . Then the equation becomes:
p
yi = w0 xi + b0 (5)
▶ Now we have to update w and b re-iteratively to minimize the Loss function L(w, b). The
equations for the same become:
w = w − l∆w (6)
b = b − l∆b (7)
where l is the learning parameter.
12 / 39
G RADIENT D ESCENT OF S IMPLE L INEAR R EGRESSION M ODEL (E XAMPLE )
13 / 39
G RADIENT D ESCENT OF S IMPLE L INEAR R EGRESSION M ODEL (E XAMPLE )
▶ After getting the Delta function values/ gradients through (11 and 12), we substitute them in
Eq. (6 and 7).
▶ This counts for one iteration, however, it possible to get more accurate values, hence more such
iterations are done until we reach the best fit.
Figure. Linear Regression Example for Gradient Descent.[Visualizing the gradient descent method n.d.]
14 / 39
L EARNING PARAMETER
15 / 39
F UNCTION R EQUIREMENTS
▶ The gradient descent algorithm is not written for all types of functions.
▶ The function has to satisfy two conditions for Gradient Descent to be applicable on it:
1. It has to be differentiable
2. It has to be a convex function
16 / 39
F UNCTION R EQUIREMENTS
D IFFERENTIABILITY
▶ Being differentiable means the function should have a derivative at each point in its domain.
(cf. 2)
17 / 39
F UNCTION R EQUIREMENTS
D IFFERENTIABILITY
▶ Not all functions satisfy this condition. Some examples of functions not satisfying this
condition is given below in Fig. 3
▶ If the function is not differentiable at all points, there may arise a circumstance where the
gradient algorithm is not able to iteratethe values as there is no gradient to work with.
18 / 39
F UNCTION R EQUIREMENTS
C ONVEX F UNCTION
▶ A convex function is one in which if any two points are connected on the curve, the line
segment lies on or above the curve.
▶ If a function is non-convex, we do not have a surity that the gradient descent algorithm would
give us the local minima or the global minima as the result.
19 / 39
F UNCTION R EQUIREMENTS
C ONVEX F UNCTION
▶ A special case is where the function has a saddle point. At the saddle point, the function has
gradient as zero, but the double derivative is also zero at that point. This defines that the point
is a saddle and hence not a global minimum.
20 / 39
A LL THE R EQUIREMENTS OF G RADIENT D ESCENT: L ISTED
Therefore, the gradient descent algorithm takes 5 parameters as its basic requirement:
▶ Initial Point (w0 , b0 )
▶ Gradient Function
▶ Learning Rate (l)
▶ Number of iterations (n)
▶ Tolerance - helps to give an end point/ stop to an algorithm.
21 / 39
Part III
22 / 39
T YPES OF G RADIENT D ESCENT
23 / 39
S TOCHASTIC G RADIENT D ESCENT (SGD)
▶ SGD computes the gradient for only one random sample at each iteration.
▶ This property of SGD helps in it being faaster and efficient as it does not have to process all the
data in each of its iterations.
▶ However, the randomness of SGD contributes to the fact that it can in some cases give the
suboptimal solutions/local minima as the result rather than the global minimum.
▶ One of the techniques to overcome this fault is to decrease the learning rate of the model over
time, which helps in reducing the updates in the parameter with each iteration.
▶ SGD also has its variants, like Mini-Batch SGD, where the Gradient ddescent is done for a
random subset of data, and Momentum SGD, where a term is added to the gradient update to
help with the optimisation and avoiding getting stuck at a local minima.
▶ SGD is majorly used in Deep Learning and has found applications in classification, regression,
and neural machine translation.
24 / 39
B ATCH G RADIENT D ESCENT (BGD)
▶ BGD computes the gradient based on the average of the gradients of all data samples in the
training set.
▶ Therefore, for each iteration, the gradient has to be calculated for the entire dataset, making
BGD a computationally expensive process for huge datasets.
▶ However, it also gives BGD the advantage of being more stable and avoiding overfitting
compared to SGD.
▶ The learning parameter has to be chosen carefully, so that with each iteration the step size
doesn’t get bigger or diverge from the minima.
▶ BGD is usually used in simpler models with lesser data.
▶ It is widely used in linear regression and logistic regression, not so much in deep learning
where the training set is typically large and the models have many parameters.
25 / 39
M INI -B ATCH G RADIENT D ESCENT (MBGD)
26 / 39
Part IV
27 / 39
P SEUDO -C ODE FOR G RADIENT D ESCENT
28 / 39
P SEUDO -C ODE FOR S TOCHASTIC G RADIENT D ESCENT
gradient = ∇θ J(θ, xj , yj )
Update the parameters θ :
θ = θ − α× gradient
END FOR
END FOR
OUTPUT: θ
29 / 39
P SEUDO -C ODE FOR B ATCH G RADIENT D ESCENT
30 / 39
P SEUDO -C ODE FOR M INI -B ATCH G RADIENT D ESCENT
31 / 39
P YTHON C ODE FOR G RADIENT D ESCENT
32 / 39
Part V
33 / 39
G RADIENT D ESCENT E XAMPLE 1
One common example of gradient descent is training a linear regression model. The model tries to
fit a line to a set of data points by minimizing the mean squared error between the predicted values
and the actual target values. The model adjusts the parameters (slope and intercept) in the direction
of the negative gradient of the error function, until the error reaches a minimum. See section (2),
which clearly explains this example.
34 / 39
G RADIENT D ESCENT E XAMPLE 2
▶ The gradient descent algorithm updates x and y in each iteration according to the formula:
x = x − α × 2(x − 2) (15)
y = y − α × 2(y − 3) (16)
35 / 39
Part VI
36 / 39
A DVANTAGES
Gradient descent is a widely used optimization algorithm in various fields such as machine
learning, deep learning, and optimization problems. It has several advantages, including:
1. Ease of implementation: Gradient descent is relatively easy to implement, as it only requires
the calculation of gradients and updates to the parameters. It does not require complex
mathematical techniques like linear algebra, eigenvalue decomposition, or matrix inversion.
2. Convergence guarantee: Gradient descent is guaranteed to converge to a minimum, under
certain conditions such as the cost function being differentiable and having a unique global
minimum. The convergence speed can be controlled by the learning rate, which can be set
appropriately to achieve a good balance between convergence speed and accuracy.
3. Scalability: Gradient descent can be used for high-dimensional problems and can scale well
with large amounts of data. In practice, the convergence of gradient descent can be accelerated
through techniques such as batch normalization, early stopping, or momentum.
4. Versatility: Gradient descent can be applied to a wide range of optimization problems,
including linear regression, logistic regression, and neural networks. It can also be used for
non-convex optimization problems, where multiple local minima may exist, although
convergence to a good solution may be less certain in such cases.
37 / 39
D ISADVANTAGES
38 / 39
R EFERENCES I
39 / 39