0% found this document useful (0 votes)
8 views9 pages

ML - Stochastic Gradient Descent (SGD) - GeeksforGeeks

Stochastic Gradient Descent (SGD) is an optimization algorithm used in machine learning that improves efficiency and scalability, particularly for large datasets, by calculating gradients using a single data point or a small batch at each iteration. It offers advantages such as faster convergence and the ability to escape local minima, but also presents challenges like noisy convergence and sensitivity to learning rate. Variants of SGD, including mini-batch SGD and adaptive methods like Adam, enhance its performance and stability across various applications in deep learning, natural language processing, and computer vision.

Uploaded by

mecs24004
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views9 pages

ML - Stochastic Gradient Descent (SGD) - GeeksforGeeks

Stochastic Gradient Descent (SGD) is an optimization algorithm used in machine learning that improves efficiency and scalability, particularly for large datasets, by calculating gradients using a single data point or a small batch at each iteration. It offers advantages such as faster convergence and the ability to escape local minima, but also presents challenges like noisy convergence and sensitivity to learning rate. Variants of SGD, including mini-batch SGD and adaptive methods like Adam, enhance its performance and stability across various applications in deep learning, natural language processing, and computer vision.

Uploaded by

mecs24004
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

5/21/25, 1:05 PM ML | Stochastic Gradient Descent (SGD) | GeeksforGeeks

Search...

ML | Stochastic Gradient Descent (SGD)


Last Updated : 03 Mar, 2025

Stochastic Gradient Descent (SGD) is an optimization algorithm in


machine learning, particularly when dealing with large datasets. It is a
variant of the traditional gradient descent algorithm but offers several
advantages in terms of efficiency and scalability, making it the go-to
method for many deep-learning tasks.

To understand SGD, it’s essential to first comprehend the concept of


gradient descent.

What is Gradient Descent?

Gradient descent is an iterative optimization algorithm used to minimize


a loss function, which represents how far the model’s predictions are
from the actual values. The main goal is to adjust the parameters of a
model (weights, biases, etc.) so that the error is minimized.

The update rule for the traditional gradient descent algorithm is:

θ = θ − η∇θ J (θ)​

In traditional gradient descent, the gradients are computed based on the


entire dataset, which can be computationally expensive for large
datasets.

Need for Stochastic Gradient Descent


For large datasets, computing the gradient using all data points
can be slow and memory-intensive. This is where SGD comes into
play. Instead of using the full dataset to compute the gradient at
each step, SGD uses only one random data point (or a small batch
of data points) at each iteration. This makes the computation much
faster.

https://www.geeksforgeeks.org/ml-stochastic-gradient-descent-sgd/ 1/14
5/21/25, 1:05 PM ML | Stochastic Gradient Descent (SGD) | GeeksforGeeks

Path followed by batch gradient descent vs. path followed by SGD:

Optimization path followed by Gradient Descent

Optimization path followed SGD Optimization

Working of Stochastic Gradient Descent


In Stochastic Gradient Descent, the gradient is calculated for each
training example (or a small subset of training examples) rather than
the entire dataset.

The update rule becomes:

θ = θ − η∇θ J (θ; xi , yi )
​ ​ ​

Where:

xi ​and yi ​represent the features and target of the i-th training


​ ​

example.

https://www.geeksforgeeks.org/ml-stochastic-gradient-descent-sgd/ 2/14
5/21/25, 1:05 PM ML | Stochastic Gradient Descent (SGD) | GeeksforGeeks

The gradient ∇θ J (θ; xi , yi ) is now calculated for a single data point or


​ ​ ​

a small batch.

The key difference from traditional gradient descent is that, in SGD, the
parameter updates are made based on a single data point, not the entire
dataset. The random selection of data points introduces stochasticity,
which
Python CoursecanPython
be both an advantage
Tutorial and a challenge.
Interview Questions Python Quiz Python Glossa Sign In

Implementing Stochastic Gradient Descent from


Scratch

Step 1: Data Generation

In this step, we generate synthetic data for the linear regression


problem. The data consists of feature X and the target y, where the
relationship is linear, i.e., y = 4 + 3 * X + noise.

X is a random array of 100 samples between 0 and 2.


y is the target, calculated using a linear equation with a little random
noise to make it more realistic.

1 import numpy as np
2
3 # Generate synthetic data
4 np.random.seed(42)
5 X = 2 * np.random.rand(100, 1)
6 y = 4 + 3 * X + np.random.randn(100, 1)

For a linear regression with one feature, the model is described


by the equation:
y = θ0 + θ1 ⋅ X
​ ​

Where:
θ0 ​is the intercept (the bias term),

θ1 is the slope or coefficient associated with the input feature X


https://www.geeksforgeeks.org/ml-stochastic-gradient-descent-sgd/ 3/14
5/21/25, 1:05 PM ML | Stochastic Gradient Descent (SGD) | GeeksforGeeks

Step 2: Define the SGD Function

Here we define the core function for Stochastic Gradient Descent (SGD).
The function takes the input data X and y. It initializes the model
parameters, performs stochastic updates for a specified number of
epochs, and records the cost at each step.

theta is the parameter vector (intercept and slope) initialized


randomly.
X_bias is the augmented X with a column of ones added for the bias
term (intercept).

In each epoch, the data is shuffled, and for each mini-batch (or single
sample), the gradient is calculated, and the parameters are updated.
The cost is calculated as the mean squared error, and the history of the
cost is recorded to monitor convergence.

1 def sgd(X, y, learning_rate=0.1, epochs=1000, batch_size=1):


2 m = len(X)
3 theta = np.random.randn(2, 1)
4
5 # Add a bias term to X (X_0 = 1)
6 X_bias = np.c_[np.ones((m, 1)), X]
7
8 cost_history = []
9
10 for epoch in range(epochs):
11 # Shuffle the data at the beginning of each epoch
12 indices = np.random.permutation(m)
13 X_shuffled = X_bias[indices]
14 y_shuffled = y[indices]
15
16 for i in range(0, m, batch_size):
17 # Select a mini-batch or a single sample
18 X_batch = X_shuffled[i:i+batch_size]
19 y_batch = y_shuffled[i:i+batch_size]
20
21 # Compute the gradient
22 gradients = 2 / batch_size * X_batch.T.dot(X_batch
y_batch)
23
24 # Update the parameters (theta)
https://www.geeksforgeeks.org/ml-stochastic-gradient-descent-sgd/ 4/14
5/21/25, 1:05 PM ML | Stochastic Gradient Descent (SGD) | GeeksforGeeks

25 theta -= learning_rate * gradients


26
27 # Calculate and record the cost (Mean Squared Error)
28 predictions = X_bias.dot(theta)
29 cost = np.mean((predictions - y) ** 2)
30 cost_history.append(cost)
31
32 # Print progress every 100 epochs
33 if epoch % 100 == 0:
34 print(f"Epoch {epoch}, Cost: {cost}")
35
36 return theta, cost_history

Step 3: Train the Model Using SGD

In this step, we call the sgd() function to train the model. We specify
the learning rate, number of epochs, and batch size for SGD.

1 # Train the model using SGD


2 theta_final, cost_history = sgd(X, y, learning_rate=0.1, epoch
batch_size=1)

Output:

Step 4: Visualize the Cost Function

After training, we visualize how the cost function evolves over epochs.
This helps us understand if the algorithm is converging properly.

1 import matplotlib.pyplot as plt


https://www.geeksforgeeks.org/ml-stochastic-gradient-descent-sgd/ 5/14
5/21/25, 1:05 PM ML | Stochastic Gradient Descent (SGD) | GeeksforGeeks

2
3 # Plot the cost history
4 plt.plot(cost_history)
5 plt.xlabel('Epochs')
6 plt.ylabel('Cost (MSE)')
7 plt.title('Cost Function during Training')
8 plt.show()

Output:

Step 5: Plot the Data and Regression Line

In this step, we visualize the data points and the fitted regression line
after training. We plot the data points as blue dots and the predicted
line (from the final theta) as a red line.

1 # Plot the data and the regression line


2 plt.scatter(X, y, color='blue', label='Data points')
3 plt.plot(X, np.c_[np.ones((X.shape[0], 1)),
X].dot(theta_final), color='red', label='SGD fit line')
4 plt.xlabel('X')
5 plt.ylabel('y')
6 plt.title('Linear Regression using Stochastic Gradient
Descent')
7 plt.legend()
8 plt.show()

Output:

https://www.geeksforgeeks.org/ml-stochastic-gradient-descent-sgd/ 6/14
5/21/25, 1:05 PM ML | Stochastic Gradient Descent (SGD) | GeeksforGeeks

Step 6: Print the Final Model Parameters

After training, we print the final parameters of the model, which include
the slope and intercept. These values are the result of optimizing the
model using SGD.

1 print(f"Final parameters: {theta_final}")

Output:

Final parameters: [[4.35097872] [3.45754277]]

The final parameters returned by the model are:

θ0 = 4.3,
​ θ1 = 3.4

Then the fitted linear regression model will be:

y = 4.3 + 3.4 ⋅ X

This means:

When X=0, y=4.3(the intercept or bias term).


For each unit increase in X, y will increase by 3.4 units (the slope or
coefficient).

Advantages of Stochastic Gradient Descent


1. Efficiency: Because it uses only one or a few data points to calculate
the gradient, SGD can be much faster, especially for large datasets.

https://www.geeksforgeeks.org/ml-stochastic-gradient-descent-sgd/ 7/14
5/21/25, 1:05 PM ML | Stochastic Gradient Descent (SGD) | GeeksforGeeks

Each step requires fewer computations, leading to quicker


convergence.
2. Memory Efficiency: Since it does not require storing the entire
dataset in memory for each iteration, SGD can handle much larger
datasets than traditional gradient descent.
3. Escaping Local Minima: The noisy updates in SGD, caused by the
stochastic nature of the algorithm, can help the model escape local
minima or saddle points, potentially leading to better solutions in
non-convex optimization problems (common in deep learning).
4. Online Learning: SGD is well-suited for online learning, where the
model is trained incrementally as new data comes in, rather than on
a static dataset.

Challenges of Stochastic Gradient Descent


1. Noisy Convergence: Since the gradient is estimated based on a
single data point (or a small batch), the updates can be noisy, causing
the cost function to fluctuate rather than steadily decrease. This
makes convergence slower and more erratic than in batch gradient
descent.
2. Learning Rate Tuning: SGD is highly sensitive to the choice of
learning rate. A learning rate that is too large may cause the
algorithm to diverge, while one that is too small can slow down
convergence. Adaptive methods like Adam and RMSprop address
this by adjusting the learning rate dynamically during training.
3. Long Training Times: While each individual update is fast, the
convergence might take a longer time overall since the steps are
more erratic compared to batch gradient descent.

Variants of Stochastic Gradient Descent


While traditional SGD is a powerful method, there are several
improvements and variants designed to improve convergence and
stability:

Mini-batch SGD: Instead of using a single data point, mini-batch


SGD uses a small batch of data points to calculate the gradient. This
strikes a balance between the efficiency of SGD and the stability of

https://www.geeksforgeeks.org/ml-stochastic-gradient-descent-sgd/ 8/14
5/21/25, 1:05 PM ML | Stochastic Gradient Descent (SGD) | GeeksforGeeks

batch gradient descent. It reduces the noise in the updates while


maintaining the computational efficiency.

Momentum: Momentum helps accelerate SGD by adding a fraction of


the previous update to the current one. This allows the algorithm to
keep moving in the same direction and can help overcome
oscillations in the cost function.

Adaptive Methods (e.g., Adam, RMSprop): These methods


dynamically adjust the learning rate for each parameter. Adam, for
example, uses both the average of the gradients (first moment) and
the average of the squared gradients (second moment) to compute
an adaptive learning rate, improving convergence and stability.

Applications of Stochastic Gradient Descent


SGD and its variants are widely used across various domains of machine
learning:

Deep Learning: In training deep neural networks, SGD is the default


optimizer due to its efficiency with large datasets and its ability to
work with large models. Deep learning frameworks like TensorFlow
and PyTorch typically use variants like Adam or RMSprop, which are
based on SGD.
Natural Language Processing (NLP): Models like Word2Vec and
transformers are trained using SGD variants to optimize large models
on vast text corpora.
Computer Vision: For tasks such as image classification, object
detection, and segmentation, SGD has been fundamental in training
convolutional neural networks (CNNs).
Reinforcement Learning: SGD is also used to optimize the
parameters of models used in reinforcement learning, such as deep
Q-networks (DQNs) and policy gradient methods.

https://www.geeksforgeeks.org/ml-stochastic-gradient-descent-sgd/ 9/14

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy