0% found this document useful (0 votes)
6 views15 pages

Linear Regression - Gradient Descent Method

The document provides an overview of linear regression and gradient descent optimization, detailing the cost function, its parabolic shape, and the iterative process of gradient descent to minimize the cost function. It explains the importance of the learning rate in convergence and presents the steps for both simple and multiple linear regression using gradient descent. Additionally, an example is included to illustrate the first iteration of gradient descent optimization for a multiple linear regression model.

Uploaded by

Gargi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views15 pages

Linear Regression - Gradient Descent Method

The document provides an overview of linear regression and gradient descent optimization, detailing the cost function, its parabolic shape, and the iterative process of gradient descent to minimize the cost function. It explains the importance of the learning rate in convergence and presents the steps for both simple and multiple linear regression using gradient descent. Additionally, an example is included to illustrate the first iteration of gradient descent optimization for a multiple linear regression model.

Uploaded by

Gargi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Linear Regression

(Gradient Descent Optimization)

D r. JASMEET S INGH
ASSISTANT P ROFESSOR, C SED
T IET, PATIALA
Simple Linear Regression- Cost Function
Overview
 We know, the linear function that binds the input Table 1: Dataset for SLR containing 3
variable x with the corresponding predicted value instances
of (yˆ) is given by:
𝑦ˆ=𝛽 + 𝛽 𝑥 Independent Variable Dependent Variable
(xi) (yi)
 The cost function (mean square error function) is
given by: 1 1
1 2 2
J(𝛽 , 𝛽 )= (𝑦 − 𝛽 − 𝛽 𝑥 )
𝑛
3 3
 The cost function is a function of 𝛽 and 𝛽 .
Let’s plot the cost function as function of 𝛽
considering 𝛽 =0 (for the sake of simplicity i.e.,
2D view). Consider the dataset shown in Table 1
Plot of Cost Function of SLR
Table 2: Value of Cost Function 𝐽 𝛽 for different values of 𝛽 J(β1) vs. β1
(using dataset shown in Table 1) 5
4.5
4
S.No 𝛽 1 3.5
J(𝛽 )= (𝑦 − 𝛽 𝑥 ) 3
𝑛

J(β1)
2.5

1 1 𝐽 𝛽 = [ 1−1 + (2 − 2) +(3 − 3) ]=0 2


1.5

2 0.5 𝐽 𝛽 = [ 1 − 0.5 + (2 − 1) +(3 − 1.5) ]=1.67


1
0.5

3 0 𝐽 𝛽 = [ 1−0 + (2 − 0) +(3 − 0) ]=4.67


0
0 0.5 1 1.5 2 2.5
β1
4 1.5 𝐽 𝛽 = [ 1 − 1.5 + (2 − 3) +(3 − 4.5) ]=1.67 It is clear from the above function that the cost
5 2 function is parabolic in shape (bowl-shaped)
𝐽 𝛽 = [ 1−2 + (2 − 4) +(3 − 6) ]=4.67
with one point of minimum where the mean
square error is zero.
Plot of Cost Function of SLR
(Cost Function as function of )
SURFACE PLOT CONTOUR PLOT
(BOWL SHAPED CURVE WITH ONLY ONE POINT OF (SAME COLOR LINES MEAN SAME VALUE OF COST FUNCTION AT
MINIMUM) DIFFERENT POINTS OF 𝛽 AND 𝛽 )

Contour plot represents a 3-dimensional surface by plotting constant


z slices, called contours, on a 2-dimensional format.
Gradient Descent Optimization- Introduction
 Gradient Descent is an optimization algorithm for finding a local minimum of a differentiable
function.
 Gradient descent is simply used to find the values of a function's parameters (coefficients) that
minimize a cost function as far as possible. i.e.,
𝑎𝑟𝑔𝑚𝑖𝑛( , , …….., ) 𝐽(𝛽 , 𝛽 , 𝛽 … … . . , 𝛽 )
 It's based on minimizing a convex cost function and tweaks its parameters iteratively to
minimize a given function to its local minimum.
 It considers gradient of the cost function to tune the parameters.
 Gradient can be considered as the slope of a function. The higher the gradient, the steeper the slope
and the faster a model can learn. But if the slope is zero, the model stops learning.
 In mathematical terms, a gradient is a partial derivative of the function with respect to its inputs.
Gradient Descent Optimization- Introduction
 The image illustrates the cost function from a
top-down view and the black arrows are the
steps of gradient descent algorithm.
 The algorithm will reach to different local or
global minimum depending upon the initial
value of 𝛽 and 𝛽 .
 The gradient in this context is a vector that
contains the direction of the steepest step the
algorithm can take and also how long that step
should be.
Steps of Gradient Descent Optimization
 In order to minimize any differentiable cost function, 𝐽(𝛽 , 𝛽 , 𝛽 … … . . , 𝛽 ) , containing
parameters 𝛽 , 𝛽 , 𝛽 … … . . , 𝛽 , following steps are followed in gradient descent optimization:
1. Initialize the parameters, 𝛽 , 𝛽 , 𝛽 … … . . , 𝛽 , to any arbitrary values. Usually, these are set to 0
initial value.
2. Update the values of parameters 𝛽 , 𝛽 , 𝛽 … … . . , 𝛽 , using the following equation (until
convergence or for fixed number of iterations:
( )
𝛽 =𝛽 −𝛼 𝑓𝑜𝑟 𝑗 = 0,1,2, … … . . 𝑘
 This update must be simultaneous i.e., the RHS of the above equations must be stored in temporary
variables for each value of j and then simultaneously assigned.
( )
 Here 𝛼 is called the fixed step size that controls the step size and is called the gradient of
cost function.
 Convergence of 𝛽 ’s means that there is no change in value of 𝛽 which will happen only when
( )
=0
Gradient Descent Optimization- Intuition
 The intuition behind gradient descent optimization is that it may
start from any arbitrary point it may converge at some local or
global minimum.
 For instance, consider cost function with only one parameter (θ1).
The shape of cost function is shown in the images 1 and 2.
( )
 If we start from θ1 as shown in figure 1, then gradient is
positive. Therefore, 𝜃 = 𝜃 − 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑞𝑢𝑎𝑛𝑡𝑖𝑡𝑦 . So, it will
slowly move towards the minimum point.
( )
 If we start from θ1 as shown in figure 2, then gradient is
negative. Therefore, 𝜃 = 𝜃 + 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑞𝑢𝑎𝑛𝑡𝑖𝑡𝑦 . So, it will
again slowly move towards the minimum point.
Learning Rate in Gradient Descent
Why is learning rate fixed?
Gradient descent algorithm, can
converge to a local minimum, even
with fixed learning rate.

 As we approach a local minimum,


gradient descent will automatically
take smaller steps.
 So, no need to decrease learning rate
over time.
Learning Rate in Gradient Descent Contd…
What if learning rate is too small? What if learning rate is too large?
 If learning rate is small, then gradient  If learning rate is too large, then
descent will take a lot of time to gradient descent can overshoot the
converge (as shown in the figure). minimum.
 It may fail to converge or even
diverge (as shown in figure below).
Gradient Descent Optimization for
Multiple Linear Regression (MLR)
A multiple linear regression model with k independent predictor variables x1,x2...,xk predicts the
output variable as:
𝑦 ˆ = 𝛽 + 𝛽 𝑥 + 𝛽 𝑥 + 𝛽 𝑥 + ⋯ … … … . . +𝛽 𝑥
 The cost function (mean square error function) is given by:
1
𝐽= (𝑦 − 𝛽 − 𝛽 𝑥 − 𝛽 𝑥 − 𝛽 𝑥 − ⋯ … … … . . −𝛽 𝑥 )
2𝑛

Gradient of the cost function with respect to input parameters is given by:
𝜕𝐽 1
= (𝛽 + 𝛽 𝑥 + 𝛽 𝑥 + 𝛽 𝑥 + ⋯ … … … . . +𝛽 𝑥 − 𝑦 )
𝜕𝛽 𝑛
Gradient Descent Optimization for MLR
Similarly, = ∑ (𝛽 + 𝛽 𝑥 + 𝛽 𝑥 + 𝛽 𝑥 + ⋯ … … … . . +𝛽 𝑥 − 𝑦 ) × 𝑥

= ∑ (𝛽 + 𝛽 𝑥 + 𝛽 𝑥 + 𝛽 𝑥 + ⋯ … … … . . +𝛽 𝑥 − 𝑦 ) × 𝑥
𝜕𝐽 1
= (𝛽 + 𝛽 𝑥 + 𝛽 𝑥 + 𝛽 𝑥 + ⋯ … … … . . +𝛽 𝑥 − 𝑦 ) × 𝑥
𝜕𝛽 𝑛



𝝏𝑱 𝟏
In general, 𝝏𝜷 = 𝒏 ∑𝒏𝒊 𝟏(𝜷𝟎 + 𝜷𝟏 𝒙𝒊𝟏 + 𝜷𝟐 𝒙𝒊𝟐 + 𝜷𝟑 𝒙𝒊𝟑 + ⋯ … … … . . +𝜷𝒌 𝒙𝒊𝒌 − 𝒚𝒊 ) × 𝒙𝒊𝒋
𝒋
Gradient Descent Optimization for MLR
The gradient descent optimization for Multiple Linear Regression is summarized as below:
1. Initialize 𝛽 =0 , 𝛽 = 0, 𝛽 = 0,…………………………… 𝛽 = 0
2. Update parameters until convergence or for fixed number of iterations using following
equation:
𝛼
𝛽 =𝛽 − (𝛽 + 𝛽 𝑥 + 𝛽 𝑥 + 𝛽 𝑥 + ⋯ … … … . . +𝛽 𝑥 − 𝑦 ) × 𝑥
𝑛

For j=0,1,2,3……………..k
Where xi0=1 and k are the total number of iterations
Gradient Descent Optimization for MLR-
Example
Consider the following dataset that shows the Stock Index Price as function of Interest Rate and
Unemployment rate. For the given dataset show first iteration of gradient descent optimization for
linear regression. Initialize 𝛽 =0 , 𝛽 = 0, , 𝛽 = 0 and consider learning rate as 0.01

Interest Rate (xi1) Unemployment rate (xi2) Stock Index Price (yi)
2.75 5.3 1464
2.5 5.3 1394
2.25 5.5 1159
2 5.7 1130
2 5.9 1075
2 6 1047
1.75 5.9 965
1.75 6.1 719
Gradient Descent Optimization for MLR-
Example
Initially, 𝛽 =0 , 𝛽 = 0, , 𝛽 = 0
S.No (xi1) (xi2) (yi) xi1xi2 xi1yi xi2yi (xi1)2 (xi2i)2
Iteration I: 1 2.75 5.3 1464 14.575 4026 7759.2 7.5625 28.09
𝛼 2 2.5 5.3 1394 13.25 3485 7388.2 6.25 28.09
𝑡𝑒𝑚𝑝0 ≔ 𝛽 − (𝛽 + 𝛽 𝑥 + 𝛽 𝑥 − 𝑦
𝑛 2607.7
3 2.25 5.5 1159 12.375 6374.5 5.0625 30.25
0.01 5
𝑡𝑒𝑚𝑝0 ≔ 0 − 8 × 0 + 17 × 0 + 45.7 × 0 − 8953 = 11.19
8 4 2 5.7 1130 11.4 2260 6441 4 32.49

𝑡𝑒𝑚𝑝1 ≔ 𝛽 − ∑ (𝛽 + 𝛽 𝑥 + 𝛽 𝑥 − 𝑦 × 𝑥 ) 5 2 5.9 1075 11.8 2150 6342.5 4 34.81

.
6 2 6 1047 12 2094 6282 4 36
𝑡𝑒𝑚𝑝1 ≔ 0 − 17 × 0 + 37 × 0 + 96.4 × 0 − 19569.75 = 24.46
1688.7
7 1.75 5.9 965 10.325 5693.5 3.0625 34.81
5
𝑡𝑒𝑚𝑝2 ≔ 𝛽 − ∑ (𝛽 + 𝛽 𝑥 + 𝛽 𝑥 − 𝑦 × 𝑥 ) 1258.2
8 1.75 6.1 719 10.675 4385.9 3.0625 37.21
. 5
𝑡𝑒𝑚𝑝2 ≔ 0 − 45.7 × 0 + 96.4 × 0 + 261.75 × 0 − 50666.8 = 63.33 19569. 50666.
Total 17 45.7 8953 96.4 37 261.75
75 8
𝛽 =11.19 , 𝛽 = 24.46, , 𝛽 = 63.33

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy