0% found this document useful (0 votes)
5 views7 pages

Stochastic Gradient Descent

The document explains Stochastic Gradient Descent (SGD) as a method for training models by minimizing a cost function that measures the difference between predicted and actual outputs. It details the update rules for model parameters using gradients computed from individual training examples and provides an example of applying SGD to a linear regression problem. Additionally, it introduces multiple linear regression in vector notation, simplifying the Mean Squared Error (MSE) calculations and deriving the least squares parameter vector using linear algebra techniques.

Uploaded by

jf923123
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views7 pages

Stochastic Gradient Descent

The document explains Stochastic Gradient Descent (SGD) as a method for training models by minimizing a cost function that measures the difference between predicted and actual outputs. It details the update rules for model parameters using gradients computed from individual training examples and provides an example of applying SGD to a linear regression problem. Additionally, it introduces multiple linear regression in vector notation, simplifying the Mean Squared Error (MSE) calculations and deriving the least squares parameter vector using linear algebra techniques.

Uploaded by

jf923123
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Stochastic Gradient Descent

Suppose we have a data set with input 𝑥 and output 𝑦. We want to train a model that
can predict the output 𝑦 given a new input 𝑥. So we can do this by minimizing a cost
function 𝐽 that measures the difference between the predicted output and the true
output
𝑚
1
𝐽 = ∑(𝑦̂𝑖 − 𝑦𝑖 )2
𝑚
𝑖=1
Where 𝑚 the number of training example, 𝑦̂𝑖 is the predicted output and 𝑦𝑖 is the actual
output

We want to find the parameters of the model that minimize the cost function 𝐽 –
Stochastic Gradient Descent is an algorithm that updates the parameter of the model in
small steps based on the gradient of the cost function with respect to those
parameters. We do this for each training example in the data set rather than all at once

Update rule is given as


𝑑𝐽
𝜃 = 𝜃−∝
𝑑𝜃
𝑑𝐽
Where 𝜃 is the model parameter, ∝ is the learning rate, and 𝑑𝜃 is the gradient of the cost
function 𝐽 with respect to parameter 𝜃

We compute the gradient for each training example and update the parameters
accordingly
Let analyze an example of the stochastic gradient descent applied to a simple linear
regression problem

The data set is 𝑥 = [1,2,3,4,5] and 𝑦 = [2,4,5,4,5]. Our objective is to provide the best
prediction for the data set using the mean squared error as our loss function

𝑁
1
𝑀𝑆𝐸(𝐽) = ∑(𝑦̂𝑖 − 𝑦𝑖 )2
𝑁
𝑖=1
Some time we take 2 × 𝑏𝑎𝑡𝑐ℎ 𝑠𝑖𝑧𝑒 i.e., 2𝑁 instead of 𝑁 in the denominator for
mathematical simplicity when the derivative will be taken.

𝑁
1
𝑀𝑆𝐸(𝐽) = ∑(𝑦̂𝑖 − 𝑦𝑖 )2
2𝑁
𝑖=1

We know
𝑦̂𝑖 = 𝑤𝑥𝑖 + 𝑏
Where 𝑤 is the weight, 𝑥𝑖 is the input and 𝑏 is the bias
Therefore
𝑁
1
𝑀𝑆𝐸(𝐽) = ∑((𝑤𝑥𝑖 + 𝑏) − 𝑦𝑖 )2
2𝑁
𝑖=1
To use the stochastic gradient descent, we will minimize the mean squared error loss
function by updating our weight and bias iteratively based on a random sample subset
of the data known as the ‘mini batch’

Update rule is given as


𝑑𝐽
𝜃 = 𝜃−∝
𝑑𝜃

For estimating parameter 𝑤, the update rule becomes

𝑑
𝑤 = 𝑤−∝ {𝐽}
𝑑𝑤
Or
𝑚
𝑑 1 2
𝑤 = 𝑤−∝ { ∑((𝑤𝑥𝑖 + 𝑏) − 𝑦𝑖 ) }
𝑑𝑤 2𝑚
𝑖=1
Or
𝑚
1
𝑤 = 𝑤−∝× ∑(𝑤𝑥𝑖 + 𝑏) − 𝑦𝑖 ) × 𝑥𝑖
𝑚
𝑖=1

Similarly for parameter 𝑏, the update rule becomes,

𝑚
1
𝑏 = 𝑏−∝× ∑(𝑤𝑥𝑖 + 𝑏) − 𝑦𝑖 )
𝑚
𝑖=1

Where ∝ is the hyper parameter called the ‘learning rate’ (which controls the step size),
and 𝑚 is the mini batch size which is the number of data points in a single batch
Let initialize
Suppose 𝑤 = 0, 𝑏 = 0, ∝= 0.01, 𝑚 = 1
We take at random the values of 𝑥 and 𝑦 from the data set and update the values of the
weight and bias parameter using the Update Rule
Let us take the first element from each i.e., 𝑥 = [1], 𝑦 = [2]
Therefore
𝑚
1 1
𝑤 = 𝑤−∝× ∑(𝑤𝑥𝑖 + 𝑏) − 𝑦𝑖 ) × 𝑥𝑖 = 0 − 0.01 × × (((0 × 1) + 0) − 2) × 1
𝑚 1
𝑖=1
= −0.01 × (−2) = 0.02
Hence the new value of 𝑤 after the first iteration is 0.02
𝑚
1 1
𝑏 = 𝑏−∝× ∑(𝑤𝑥𝑖 + 𝑏) − 𝑦𝑖 ) = 0 − 0.01 × × (((0 × 1) + 0) − 2) = −0.01 × (−2)
𝑚 1
𝑖=1
= 0.02
Hence the new value of 𝑏 after the first iteration is 0.02
We take these values and go to the next iteration.
If we perform a large no of iterations (say 100 or 1000) using Python we can see the 𝑤
and 𝑏 values converging to a certain value where the loss (MSE) is minimized
Multiple Linear Regression in Vector Notation

Assume that we have a response variable y and it depends on 5 different variables as


follows:

changing the parameters to estimates, we get:

The MSE error with the predictions from the above equation will be as follows:

The above equation is really bulky and clunky to work with and figuring the least square
derivative of it is really a pain. So let’s use a few tricks from linear algebra to simplify the
notation and organize the information.
We can write all the y values in a single vector (a column of numbers) as follows:

the observations of x can be summarised in a matrix, which we can for now think of as a
collection of vectors:
The parameters can also be summarised in the following vector:

The original regression can now be expressed in the vector form as follows:

the above equation represents the dot product between a vector beta and a matrix x.
For those who are not familiar with the dot product operation, here is a brief summary of
what it is

The transpose operation (T) converts the vertical (column) vector of beta into a
horizontal (row) vector. This row vector is then multiplied with the x observation matrix.
For a dot product to work, the first matrix should have as many columns as in the
second vector, we see that in the x matrix, there are 4 columns corresponding to each of
the 3 predictors and the first columns is for the intercept term. For each column, there
is a corresponding entry in the beta vector

After the dot product, we get the following final vector:

Each of the beta vector entry is multiplied by an entry in the x matrix and they are added
row wise to give us one final number for each row. This number is the prediction of y
value for the corresponding observation in the x matrix.
The MSE equation can now be simplified to the following form:

In the above equation, we are using the dot product obtained from the original
regression equation in the matrix from and subtracting it from the y values vector. This
operation gives us a vector of errors associated with each observation and then we are
squaring that vector and dividing it by n.
The squaring of a vector follows a more formal operation in linear algebra which
includes multiplying the vector with its own transposed form. So the above MSE
equation from its squared from converts to the following transposed dot product form:
Opening the brackets give us 4 terms which corresponding to all the permutation of the
two brackets:

The second and third term in the above equation represent the same quantity one
represents transposed form of the another term. So they can be added together:

The last term is also converted to its square form since it was originally in the
transposed dot product form.
Now we can find the least square derivative of the MSE function with respect to the beta
vector:

using the nabla symbol to represent the derivative and this derivative is set to 0 to find
the minimum of the function. If you try to open the vector to their original form and
calculate the resultant vectors for each of the terms and then apply the derivative, you
will find the above equation simplifies to the following form:
The first term goes to 0 since it has no dependence on the beta vector, the second term
has 2, y vector, x matrix as constants and the only term affected by the derivative is the
beta vector itself which follows the power rule and goes to 1. For the third term, we get
the x squared matrix as a constant and hence unaffected by the derivative and the beta
vector square follows the power rule of the derivatives and goes down to a single power
of the beta vector.
Now taking the second term to the other side of the equation gives the following:

We have again changed the x squared term to the transposed dot product. Dividing by 2
on both sides and taking the x squared term to the other side gives us the following:

Doing the transposition operation on both sides, we get the final equation for the beta
vector:

We now have an equation which will give us the least squared parameter vector for a
given data set x and y. The solutions to this equation will give us the answer we are
looking for.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy