0% found this document useful (0 votes)

5 views7 pages

Stochastic Gradient Descent

The document explains Stochastic Gradient Descent (SGD) as a method for training models by minimizing a cost function that measures the difference between predicted and actual outputs. It details the update rules for model parameters using gradients computed from individual training examples and provides an example of applying SGD to a linear regression problem. Additionally, it introduces multiple linear regression in vector notation, simplifying the Mean Squared Error (MSE) calculations and deriving the least squares parameter vector using linear algebra techniques.

Uploaded by

jf923123

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views7 pages

Stochastic Gradient Descent

Uploaded by

jf923123

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

Stochastic Gradient Descent

Suppose we have a data set with input 𝑥 and output 𝑦. We want to train a model that
can predict the output 𝑦 given a new input 𝑥. So we can do this by minimizing a cost
function 𝐽 that measures the difference between the predicted output and the true
output
𝑚
1
𝐽 = ∑(𝑦̂𝑖 − 𝑦𝑖 )2
𝑚
𝑖=1
Where 𝑚 the number of training example, 𝑦̂𝑖 is the predicted output and 𝑦𝑖 is the actual
output

We want to find the parameters of the model that minimize the cost function 𝐽 –
Stochastic Gradient Descent is an algorithm that updates the parameter of the model in
small steps based on the gradient of the cost function with respect to those
parameters. We do this for each training example in the data set rather than all at once

Update rule is given as

𝑑𝐽
𝜃 = 𝜃−∝
𝑑𝜃
𝑑𝐽
Where 𝜃 is the model parameter, ∝ is the learning rate, and 𝑑𝜃 is the gradient of the cost
function 𝐽 with respect to parameter 𝜃

We compute the gradient for each training example and update the parameters
accordingly
Let analyze an example of the stochastic gradient descent applied to a simple linear
regression problem

The data set is 𝑥 = [1,2,3,4,5] and 𝑦 = [2,4,5,4,5]. Our objective is to provide the best
prediction for the data set using the mean squared error as our loss function

𝑁
1
𝑀𝑆𝐸(𝐽) = ∑(𝑦̂𝑖 − 𝑦𝑖 )2
𝑁
𝑖=1
Some time we take 2 × 𝑏𝑎𝑡𝑐ℎ 𝑠𝑖𝑧𝑒 i.e., 2𝑁 instead of 𝑁 in the denominator for
mathematical simplicity when the derivative will be taken.

𝑁
1
𝑀𝑆𝐸(𝐽) = ∑(𝑦̂𝑖 − 𝑦𝑖 )2
2𝑁
𝑖=1

We know
𝑦̂𝑖 = 𝑤𝑥𝑖 + 𝑏
Where 𝑤 is the weight, 𝑥𝑖 is the input and 𝑏 is the bias
Therefore
𝑁
1
𝑀𝑆𝐸(𝐽) = ∑((𝑤𝑥𝑖 + 𝑏) − 𝑦𝑖 )2
2𝑁
𝑖=1
To use the stochastic gradient descent, we will minimize the mean squared error loss
function by updating our weight and bias iteratively based on a random sample subset
of the data known as the ‘mini batch’

Update rule is given as

𝑑𝐽
𝜃 = 𝜃−∝
𝑑𝜃

For estimating parameter 𝑤, the update rule becomes

𝑑
𝑤 = 𝑤−∝ {𝐽}
𝑑𝑤
Or
𝑚
𝑑 1 2
𝑤 = 𝑤−∝ { ∑((𝑤𝑥𝑖 + 𝑏) − 𝑦𝑖 ) }
𝑑𝑤 2𝑚
𝑖=1
Or
𝑚
1
𝑤 = 𝑤−∝× ∑(𝑤𝑥𝑖 + 𝑏) − 𝑦𝑖 ) × 𝑥𝑖
𝑚
𝑖=1

Similarly for parameter 𝑏, the update rule becomes,

𝑚
1
𝑏 = 𝑏−∝× ∑(𝑤𝑥𝑖 + 𝑏) − 𝑦𝑖 )
𝑚
𝑖=1

Where ∝ is the hyper parameter called the ‘learning rate’ (which controls the step size),
and 𝑚 is the mini batch size which is the number of data points in a single batch
Let initialize
Suppose 𝑤 = 0, 𝑏 = 0, ∝= 0.01, 𝑚 = 1
We take at random the values of 𝑥 and 𝑦 from the data set and update the values of the
weight and bias parameter using the Update Rule
Let us take the first element from each i.e., 𝑥 = [1], 𝑦 = [2]
Therefore
𝑚
1 1
𝑤 = 𝑤−∝× ∑(𝑤𝑥𝑖 + 𝑏) − 𝑦𝑖 ) × 𝑥𝑖 = 0 − 0.01 × × (((0 × 1) + 0) − 2) × 1
𝑚 1
𝑖=1
= −0.01 × (−2) = 0.02
Hence the new value of 𝑤 after the first iteration is 0.02
𝑚
1 1
𝑏 = 𝑏−∝× ∑(𝑤𝑥𝑖 + 𝑏) − 𝑦𝑖 ) = 0 − 0.01 × × (((0 × 1) + 0) − 2) = −0.01 × (−2)
𝑚 1
𝑖=1
= 0.02
Hence the new value of 𝑏 after the first iteration is 0.02
We take these values and go to the next iteration.
If we perform a large no of iterations (say 100 or 1000) using Python we can see the 𝑤
and 𝑏 values converging to a certain value where the loss (MSE) is minimized
Multiple Linear Regression in Vector Notation

Assume that we have a response variable y and it depends on 5 different variables as

follows:

changing the parameters to estimates, we get:

The MSE error with the predictions from the above equation will be as follows:

The above equation is really bulky and clunky to work with and figuring the least square
derivative of it is really a pain. So let’s use a few tricks from linear algebra to simplify the
notation and organize the information.
We can write all the y values in a single vector (a column of numbers) as follows:

the observations of x can be summarised in a matrix, which we can for now think of as a
collection of vectors:
The parameters can also be summarised in the following vector:

The original regression can now be expressed in the vector form as follows:

the above equation represents the dot product between a vector beta and a matrix x.
For those who are not familiar with the dot product operation, here is a brief summary of
what it is

The transpose operation (T) converts the vertical (column) vector of beta into a
horizontal (row) vector. This row vector is then multiplied with the x observation matrix.
For a dot product to work, the first matrix should have as many columns as in the
second vector, we see that in the x matrix, there are 4 columns corresponding to each of
the 3 predictors and the first columns is for the intercept term. For each column, there
is a corresponding entry in the beta vector

After the dot product, we get the following final vector:

Each of the beta vector entry is multiplied by an entry in the x matrix and they are added
row wise to give us one final number for each row. This number is the prediction of y
value for the corresponding observation in the x matrix.
The MSE equation can now be simplified to the following form:

In the above equation, we are using the dot product obtained from the original
regression equation in the matrix from and subtracting it from the y values vector. This
operation gives us a vector of errors associated with each observation and then we are
squaring that vector and dividing it by n.
The squaring of a vector follows a more formal operation in linear algebra which
includes multiplying the vector with its own transposed form. So the above MSE
equation from its squared from converts to the following transposed dot product form:
Opening the brackets give us 4 terms which corresponding to all the permutation of the
two brackets:

The second and third term in the above equation represent the same quantity one
represents transposed form of the another term. So they can be added together:

The last term is also converted to its square form since it was originally in the
transposed dot product form.
Now we can find the least square derivative of the MSE function with respect to the beta
vector:

using the nabla symbol to represent the derivative and this derivative is set to 0 to find
the minimum of the function. If you try to open the vector to their original form and
calculate the resultant vectors for each of the terms and then apply the derivative, you
will find the above equation simplifies to the following form:
The first term goes to 0 since it has no dependence on the beta vector, the second term
has 2, y vector, x matrix as constants and the only term affected by the derivative is the
beta vector itself which follows the power rule and goes to 1. For the third term, we get
the x squared matrix as a constant and hence unaffected by the derivative and the beta
vector square follows the power rule of the derivatives and goes down to a single power
of the beta vector.
Now taking the second term to the other side of the equation gives the following:

We have again changed the x squared term to the transposed dot product. Dividing by 2
on both sides and taking the x squared term to the other side gives us the following:

Doing the transposition operation on both sides, we get the final equation for the beta
vector:

We now have an equation which will give us the least squared parameter vector for a
given data set x and y. The solutions to this equation will give us the answer we are
looking for.

Kashish Research Paper
No ratings yet
Kashish Research Paper
101 pages
(Innovations in Transactional Analysis - Theory and Practice) Sari Van Poelje, Anne de Graaf - New Theory and Practice of Transactional Analysis in Organizations - On The Edge-Routledge (2021)
100% (1)
(Innovations in Transactional Analysis - Theory and Practice) Sari Van Poelje, Anne de Graaf - New Theory and Practice of Transactional Analysis in Organizations - On The Edge-Routledge (2021)
213 pages
Tata Play Packs 20240904 - 0
No ratings yet
Tata Play Packs 20240904 - 0
243 pages
Module3 Ch1
No ratings yet
Module3 Ch1
83 pages
Lecture 0.2 - Linear Methods For Regression, Optimization
No ratings yet
Lecture 0.2 - Linear Methods For Regression, Optimization
53 pages
Scania Parts List
100% (4)
Scania Parts List
2 pages
Lecture 5 - Linear Regression
No ratings yet
Lecture 5 - Linear Regression
51 pages
JPPPF June2025 111 02 13 26 Dwi+Ambar
No ratings yet
JPPPF June2025 111 02 13 26 Dwi+Ambar
14 pages
2-LR Optim
No ratings yet
2-LR Optim
60 pages
10 Linear Regression
No ratings yet
10 Linear Regression
61 pages
ML - Lec 5 - Regression - Gradient Descent Least Square
No ratings yet
ML - Lec 5 - Regression - Gradient Descent Least Square
59 pages
ST LINES + CIRCLES TOP 200 PYQs of JEE Mains 2022
No ratings yet
ST LINES + CIRCLES TOP 200 PYQs of JEE Mains 2022
60 pages
Linear Regression
No ratings yet
Linear Regression
62 pages
ML Lecture 2 2023
No ratings yet
ML Lecture 2 2023
59 pages
Lecture Slides - Linear Regression (2025)
No ratings yet
Lecture Slides - Linear Regression (2025)
45 pages
04 LinearRegression
No ratings yet
04 LinearRegression
61 pages
Linear Regression With One Variable
No ratings yet
Linear Regression With One Variable
48 pages
Regression Analysis
No ratings yet
Regression Analysis
54 pages
Gradient Descent
No ratings yet
Gradient Descent
16 pages
2022 Linear Regression
No ratings yet
2022 Linear Regression
34 pages
Linear Regression
No ratings yet
Linear Regression
20 pages
Lecture W2c
No ratings yet
Lecture W2c
16 pages
5.1loss Function, Optimization, GD
No ratings yet
5.1loss Function, Optimization, GD
39 pages
Day 1
No ratings yet
Day 1
41 pages
Lecture3 Upload
No ratings yet
Lecture3 Upload
28 pages
3.linear Regression
No ratings yet
3.linear Regression
18 pages
GradientDescent-Regression Slides
No ratings yet
GradientDescent-Regression Slides
26 pages
MACHINE LEARNING ALGORITHM Unit-II
No ratings yet
MACHINE LEARNING ALGORITHM Unit-II
115 pages
Module 3
No ratings yet
Module 3
27 pages
Full - Detils - To Find Coefficient - REGRESSION
No ratings yet
Full - Detils - To Find Coefficient - REGRESSION
20 pages
Lect5 Reg
No ratings yet
Lect5 Reg
16 pages
LinearRegression Annotated
No ratings yet
LinearRegression Annotated
116 pages
Bell, SOME EXPERIMENTS IN DIAGNOSTIC TEACHING
No ratings yet
Bell, SOME EXPERIMENTS IN DIAGNOSTIC TEACHING
23 pages
Cs419 Closed Form Derv
No ratings yet
Cs419 Closed Form Derv
5 pages
Bus Ethics q3 Mod3 Code of Ethics in Business Final
No ratings yet
Bus Ethics q3 Mod3 Code of Ethics in Business Final
30 pages
Regression
No ratings yet
Regression
16 pages
Updating Weight
No ratings yet
Updating Weight
9 pages
Lecture Notes 5 Linear Regression
No ratings yet
Lecture Notes 5 Linear Regression
11 pages
CS550 Lec2
No ratings yet
CS550 Lec2
24 pages
Least Square Vs Gradient Descent
100% (1)
Least Square Vs Gradient Descent
52 pages
Catalogue Wireframe
No ratings yet
Catalogue Wireframe
11 pages
Linear Regression
No ratings yet
Linear Regression
6 pages
Linear Regression Notes
No ratings yet
Linear Regression Notes
25 pages
Tappi T411
100% (1)
Tappi T411
4 pages
Linear Regression
No ratings yet
Linear Regression
31 pages
1.1 ID5059 1.2 Tom Kelsey - Jan 2021: February 15, 2021
No ratings yet
1.1 ID5059 1.2 Tom Kelsey - Jan 2021: February 15, 2021
43 pages
L. D. College of Engineering: Lab Manual For
No ratings yet
L. D. College of Engineering: Lab Manual For
70 pages
W2 Ecs7020p
No ratings yet
W2 Ecs7020p
54 pages
AE 814 Compliance of Draft Construction Stage Report For TMP (PKG-I To III)
No ratings yet
AE 814 Compliance of Draft Construction Stage Report For TMP (PKG-I To III)
12 pages
Abstract: y F X X X, X, X
No ratings yet
Abstract: y F X X X, X, X
10 pages
AllPack Cataloque - 11.10.24
No ratings yet
AllPack Cataloque - 11.10.24
8 pages
Chem m10
No ratings yet
Chem m10
24 pages
02 - Linear Models - A
No ratings yet
02 - Linear Models - A
23 pages
Lecture 2 - Linear Regression
No ratings yet
Lecture 2 - Linear Regression
54 pages
Machine Learning: Linear Models For Regression
No ratings yet
Machine Learning: Linear Models For Regression
54 pages
Pengaruh Model PBL Terhadap Kemampuan Berpikir Kreatif Ditinjau Dari Kemandirian Belajar Siswa
No ratings yet
Pengaruh Model PBL Terhadap Kemampuan Berpikir Kreatif Ditinjau Dari Kemandirian Belajar Siswa
14 pages
Linear Regression
No ratings yet
Linear Regression
7 pages
50 Years of The Future
100% (1)
50 Years of The Future
25 pages
Alshammari 2024 Ijca 923446
No ratings yet
Alshammari 2024 Ijca 923446
6 pages
Rez Sisters Thesis
100% (3)
Rez Sisters Thesis
7 pages
Gradient Descent Algorithm
No ratings yet
Gradient Descent Algorithm
6 pages
Math YHPLinear Regression
No ratings yet
Math YHPLinear Regression
13 pages
Linear Regression
No ratings yet
Linear Regression
26 pages
Lec9 - Linear Models
No ratings yet
Lec9 - Linear Models
44 pages
Stochastic Gradient Descent Algorithm
No ratings yet
Stochastic Gradient Descent Algorithm
6 pages
Linear Regression
No ratings yet
Linear Regression
29 pages
Prefinal-1 Model Paper (2024-25)
No ratings yet
Prefinal-1 Model Paper (2024-25)
4 pages
CPSC 540 Assignment 1 (Due January 19)
100% (1)
CPSC 540 Assignment 1 (Due January 19)
9 pages
Regression Notes
100% (1)
Regression Notes
20 pages
L02 Linear Regression
No ratings yet
L02 Linear Regression
9 pages
Road Paving, Trenches
100% (2)
Road Paving, Trenches
42 pages
Personal Development Plan
No ratings yet
Personal Development Plan
2 pages
En Entl Encl1106 (Подъемники)
No ratings yet
En Entl Encl1106 (Подъемники)
2 pages
Machine Learning Lecture 1
No ratings yet
Machine Learning Lecture 1
5 pages
CG Data Management System
No ratings yet
CG Data Management System
2 pages
Lesson 1
No ratings yet
Lesson 1
4 pages
Matrix Model
No ratings yet
Matrix Model
6 pages
Teaching Resume 2017
No ratings yet
Teaching Resume 2017
2 pages
Library Cataloger General Responsibilities
No ratings yet
Library Cataloger General Responsibilities
2 pages
Neighbours Dec 5
No ratings yet
Neighbours Dec 5
10 pages
A Guidelines For Interviewing For The High School Newspaper
No ratings yet
A Guidelines For Interviewing For The High School Newspaper
4 pages
Service Manual: DSC-P10/P12
No ratings yet
Service Manual: DSC-P10/P12
1 page
Sessional Marks (Theory)
0% (1)
Sessional Marks (Theory)
1 page
Assignment - 2 (Google in China)
100% (1)
Assignment - 2 (Google in China)
5 pages
14 Hes
No ratings yet
14 Hes
2 pages
Introduction To Machine Learning Algorithms: Linear Regression
No ratings yet
Introduction To Machine Learning Algorithms: Linear Regression
1 page
Lec10 LeastSquaresRegression PDF
No ratings yet
Lec10 LeastSquaresRegression PDF
4 pages
Student's Solutions Manual and Supplementary Materials for Econometric Analysis of Cross Section and Panel Data, second edition
From Everand
Student's Solutions Manual and Supplementary Materials for Econometric Analysis of Cross Section and Panel Data, second edition
Jeffrey M. Wooldridge
No ratings yet
Top Numerical Methods With Matlab For Beginners!
From Everand
Top Numerical Methods With Matlab For Beginners!
Andrei Besedin
No ratings yet
A-level Maths Revision: Cheeky Revision Shortcuts
From Everand
A-level Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (8)

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Stochastic Gradient Descent

Uploaded by

Stochastic Gradient Descent

Uploaded by

Stochastic Gradient Descent

Update rule is given as

Update rule is given as

For estimating parameter 𝑤, the update rule becomes

Similarly for parameter 𝑏, the update rule becomes,

Assume that we have a response variable y and it depends on 5 different variables as

changing the parameters to estimates, we get:

After the dot product, we get the following final vector:

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.