0% found this document useful (0 votes)

8 views9 pages

ML - Stochastic Gradient Descent (SGD) - GeeksforGeeks

Stochastic Gradient Descent (SGD) is an optimization algorithm used in machine learning that improves efficiency and scalability, particularly for large datasets, by calculating gradients using a single data point or a small batch at each iteration. It offers advantages such as faster convergence and the ability to escape local minima, but also presents challenges like noisy convergence and sensitivity to learning rate. Variants of SGD, including mini-batch SGD and adaptive methods like Adam, enhance its performance and stability across various applications in deep learning, natural language processing, and computer vision.

Uploaded by

mecs24004

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views9 pages

ML - Stochastic Gradient Descent (SGD) - GeeksforGeeks

Uploaded by

mecs24004

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

5/21/25, 1:05 PM ML | Stochastic Gradient Descent (SGD) | GeeksforGeeks

Search...

ML | Stochastic Gradient Descent (SGD)

Last Updated : 03 Mar, 2025

Stochastic Gradient Descent (SGD) is an optimization algorithm in

machine learning, particularly when dealing with large datasets. It is a
variant of the traditional gradient descent algorithm but offers several
advantages in terms of efficiency and scalability, making it the go-to
method for many deep-learning tasks.

To understand SGD, it’s essential to first comprehend the concept of

gradient descent.

What is Gradient Descent?

Gradient descent is an iterative optimization algorithm used to minimize

a loss function, which represents how far the model’s predictions are
from the actual values. The main goal is to adjust the parameters of a
model (weights, biases, etc.) so that the error is minimized.

The update rule for the traditional gradient descent algorithm is:

θ = θ − η∇θ J (θ)

In traditional gradient descent, the gradients are computed based on the

entire dataset, which can be computationally expensive for large
datasets.

Need for Stochastic Gradient Descent

For large datasets, computing the gradient using all data points
can be slow and memory-intensive. This is where SGD comes into
play. Instead of using the full dataset to compute the gradient at
each step, SGD uses only one random data point (or a small batch
of data points) at each iteration. This makes the computation much
faster.

https://www.geeksforgeeks.org/ml-stochastic-gradient-descent-sgd/ 1/14
5/21/25, 1:05 PM ML | Stochastic Gradient Descent (SGD) | GeeksforGeeks

Path followed by batch gradient descent vs. path followed by SGD:

Optimization path followed by Gradient Descent

Optimization path followed SGD Optimization

Working of Stochastic Gradient Descent

In Stochastic Gradient Descent, the gradient is calculated for each
training example (or a small subset of training examples) rather than
the entire dataset.

The update rule becomes:

θ = θ − η∇θ J (θ; xi , yi )

Where:

xi and yi represent the features and target of the i-th training

example.

https://www.geeksforgeeks.org/ml-stochastic-gradient-descent-sgd/ 2/14
5/21/25, 1:05 PM ML | Stochastic Gradient Descent (SGD) | GeeksforGeeks

The gradient ∇θ J (θ; xi , yi ) is now calculated for a single data point or

a small batch.

The key difference from traditional gradient descent is that, in SGD, the
parameter updates are made based on a single data point, not the entire
dataset. The random selection of data points introduces stochasticity,
which
Python CoursecanPython
be both an advantage
Tutorial and a challenge.
Interview Questions Python Quiz Python Glossa Sign In

Implementing Stochastic Gradient Descent from

Scratch

Step 1: Data Generation

In this step, we generate synthetic data for the linear regression

problem. The data consists of feature X and the target y, where the
relationship is linear, i.e., y = 4 + 3 * X + noise.

X is a random array of 100 samples between 0 and 2.

y is the target, calculated using a linear equation with a little random
noise to make it more realistic.

1 import numpy as np
2
3 # Generate synthetic data
4 np.random.seed(42)
5 X = 2 * np.random.rand(100, 1)
6 y = 4 + 3 * X + np.random.randn(100, 1)

For a linear regression with one feature, the model is described

by the equation:
y = θ0 + θ1 ⋅ X

Where:
θ0 is the intercept (the bias term),

θ1 is the slope or coefficient associated with the input feature X

https://www.geeksforgeeks.org/ml-stochastic-gradient-descent-sgd/ 3/14
5/21/25, 1:05 PM ML | Stochastic Gradient Descent (SGD) | GeeksforGeeks

Step 2: Define the SGD Function

Here we define the core function for Stochastic Gradient Descent (SGD).
The function takes the input data X and y. It initializes the model
parameters, performs stochastic updates for a specified number of
epochs, and records the cost at each step.

theta is the parameter vector (intercept and slope) initialized

randomly.
X_bias is the augmented X with a column of ones added for the bias
term (intercept).

In each epoch, the data is shuffled, and for each mini-batch (or single
sample), the gradient is calculated, and the parameters are updated.
The cost is calculated as the mean squared error, and the history of the
cost is recorded to monitor convergence.

1 def sgd(X, y, learning_rate=0.1, epochs=1000, batch_size=1):

2 m = len(X)
3 theta = np.random.randn(2, 1)
4
5 # Add a bias term to X (X_0 = 1)
6 X_bias = np.c_[np.ones((m, 1)), X]
7
8 cost_history = []
9
10 for epoch in range(epochs):
11 # Shuffle the data at the beginning of each epoch
12 indices = np.random.permutation(m)
13 X_shuffled = X_bias[indices]
14 y_shuffled = y[indices]
15
16 for i in range(0, m, batch_size):
17 # Select a mini-batch or a single sample
18 X_batch = X_shuffled[i:i+batch_size]
19 y_batch = y_shuffled[i:i+batch_size]
20
21 # Compute the gradient
22 gradients = 2 / batch_size * X_batch.T.dot(X_batch
y_batch)
23
24 # Update the parameters (theta)
https://www.geeksforgeeks.org/ml-stochastic-gradient-descent-sgd/ 4/14
5/21/25, 1:05 PM ML | Stochastic Gradient Descent (SGD) | GeeksforGeeks

25 theta -= learning_rate * gradients

26
27 # Calculate and record the cost (Mean Squared Error)
28 predictions = X_bias.dot(theta)
29 cost = np.mean((predictions - y) ** 2)
30 cost_history.append(cost)
31
32 # Print progress every 100 epochs
33 if epoch % 100 == 0:
34 print(f"Epoch {epoch}, Cost: {cost}")
35
36 return theta, cost_history

Step 3: Train the Model Using SGD

In this step, we call the sgd() function to train the model. We specify
the learning rate, number of epochs, and batch size for SGD.

1 # Train the model using SGD

2 theta_final, cost_history = sgd(X, y, learning_rate=0.1, epoch
batch_size=1)

Output:

Step 4: Visualize the Cost Function

After training, we visualize how the cost function evolves over epochs.
This helps us understand if the algorithm is converging properly.

1 import matplotlib.pyplot as plt

https://www.geeksforgeeks.org/ml-stochastic-gradient-descent-sgd/ 5/14
5/21/25, 1:05 PM ML | Stochastic Gradient Descent (SGD) | GeeksforGeeks

2
3 # Plot the cost history
4 plt.plot(cost_history)
5 plt.xlabel('Epochs')
6 plt.ylabel('Cost (MSE)')
7 plt.title('Cost Function during Training')
8 plt.show()

Output:

Step 5: Plot the Data and Regression Line

In this step, we visualize the data points and the fitted regression line
after training. We plot the data points as blue dots and the predicted
line (from the final theta) as a red line.

1 # Plot the data and the regression line

2 plt.scatter(X, y, color='blue', label='Data points')
3 plt.plot(X, np.c_[np.ones((X.shape[0], 1)),
X].dot(theta_final), color='red', label='SGD fit line')
4 plt.xlabel('X')
5 plt.ylabel('y')
6 plt.title('Linear Regression using Stochastic Gradient
Descent')
7 plt.legend()
8 plt.show()

Output:

https://www.geeksforgeeks.org/ml-stochastic-gradient-descent-sgd/ 6/14
5/21/25, 1:05 PM ML | Stochastic Gradient Descent (SGD) | GeeksforGeeks

Step 6: Print the Final Model Parameters

After training, we print the final parameters of the model, which include
the slope and intercept. These values are the result of optimizing the
model using SGD.

1 print(f"Final parameters: {theta_final}")

Output:

Final parameters: [[4.35097872] [3.45754277]]

The final parameters returned by the model are:

θ0 = 4.3,
θ1 = 3.4

Then the fitted linear regression model will be:

y = 4.3 + 3.4 ⋅ X

This means:

When X=0, y=4.3(the intercept or bias term).

For each unit increase in X, y will increase by 3.4 units (the slope or
coefficient).

Advantages of Stochastic Gradient Descent

1. Efficiency: Because it uses only one or a few data points to calculate
the gradient, SGD can be much faster, especially for large datasets.

https://www.geeksforgeeks.org/ml-stochastic-gradient-descent-sgd/ 7/14
5/21/25, 1:05 PM ML | Stochastic Gradient Descent (SGD) | GeeksforGeeks

Each step requires fewer computations, leading to quicker

convergence.
2. Memory Efficiency: Since it does not require storing the entire
dataset in memory for each iteration, SGD can handle much larger
datasets than traditional gradient descent.
3. Escaping Local Minima: The noisy updates in SGD, caused by the
stochastic nature of the algorithm, can help the model escape local
minima or saddle points, potentially leading to better solutions in
non-convex optimization problems (common in deep learning).
4. Online Learning: SGD is well-suited for online learning, where the
model is trained incrementally as new data comes in, rather than on
a static dataset.

Challenges of Stochastic Gradient Descent

1. Noisy Convergence: Since the gradient is estimated based on a
single data point (or a small batch), the updates can be noisy, causing
the cost function to fluctuate rather than steadily decrease. This
makes convergence slower and more erratic than in batch gradient
descent.
2. Learning Rate Tuning: SGD is highly sensitive to the choice of
learning rate. A learning rate that is too large may cause the
algorithm to diverge, while one that is too small can slow down
convergence. Adaptive methods like Adam and RMSprop address
this by adjusting the learning rate dynamically during training.
3. Long Training Times: While each individual update is fast, the
convergence might take a longer time overall since the steps are
more erratic compared to batch gradient descent.

Variants of Stochastic Gradient Descent

While traditional SGD is a powerful method, there are several
improvements and variants designed to improve convergence and
stability:

Mini-batch SGD: Instead of using a single data point, mini-batch

SGD uses a small batch of data points to calculate the gradient. This
strikes a balance between the efficiency of SGD and the stability of

https://www.geeksforgeeks.org/ml-stochastic-gradient-descent-sgd/ 8/14
5/21/25, 1:05 PM ML | Stochastic Gradient Descent (SGD) | GeeksforGeeks

batch gradient descent. It reduces the noise in the updates while

maintaining the computational efficiency.

Momentum: Momentum helps accelerate SGD by adding a fraction of

the previous update to the current one. This allows the algorithm to
keep moving in the same direction and can help overcome
oscillations in the cost function.

Adaptive Methods (e.g., Adam, RMSprop): These methods

dynamically adjust the learning rate for each parameter. Adam, for
example, uses both the average of the gradients (first moment) and
the average of the squared gradients (second moment) to compute
an adaptive learning rate, improving convergence and stability.

Applications of Stochastic Gradient Descent

SGD and its variants are widely used across various domains of machine
learning:

Deep Learning: In training deep neural networks, SGD is the default

optimizer due to its efficiency with large datasets and its ability to
work with large models. Deep learning frameworks like TensorFlow
and PyTorch typically use variants like Adam or RMSprop, which are
based on SGD.
Natural Language Processing (NLP): Models like Word2Vec and
transformers are trained using SGD variants to optimize large models
on vast text corpora.
Computer Vision: For tasks such as image classification, object
detection, and segmentation, SGD has been fundamental in training
convolutional neural networks (CNNs).
Reinforcement Learning: SGD is also used to optimize the
parameters of models used in reinforcement learning, such as deep
Q-networks (DQNs) and policy gradient methods.

https://www.geeksforgeeks.org/ml-stochastic-gradient-descent-sgd/ 9/14

04 Batch SGD Mini Batch Gradient Descent Algorithms
No ratings yet
04 Batch SGD Mini Batch Gradient Descent Algorithms
3 pages
Stochastic Gradient Descent
No ratings yet
Stochastic Gradient Descent
23 pages
Handbook of Research On Disease Prediction Through Data Analytics and Machine Learning EPUB DOCX PDF Download
100% (8)
Handbook of Research On Disease Prediction Through Data Analytics and Machine Learning EPUB DOCX PDF Download
16 pages
Data Mining Introduction
No ratings yet
Data Mining Introduction
52 pages
Unit 2
No ratings yet
Unit 2
100 pages
CS229 Lecture 2 PDF
100% (1)
CS229 Lecture 2 PDF
48 pages
Stochastic Gradient Descent - Math and Python Code
No ratings yet
Stochastic Gradient Descent - Math and Python Code
28 pages
Explore State of The Art AI Technology
No ratings yet
Explore State of The Art AI Technology
28 pages
Artificial Intelligence
No ratings yet
Artificial Intelligence
22 pages
Is Stochastic Gradient Descent Effective? A PDE Perspective On Machine Learning Processes
No ratings yet
Is Stochastic Gradient Descent Effective? A PDE Perspective On Machine Learning Processes
50 pages
Lecture 6
No ratings yet
Lecture 6
25 pages
DL Module 1 - CS-1 Fundamentals of Neural Network
No ratings yet
DL Module 1 - CS-1 Fundamentals of Neural Network
81 pages
04 Training Linear Models
No ratings yet
04 Training Linear Models
35 pages
AI & ML Unit 3 Notes
No ratings yet
AI & ML Unit 3 Notes
20 pages
Lecture05 Descent
No ratings yet
Lecture05 Descent
31 pages
Topic5 Stoch Grad D Oct202023
No ratings yet
Topic5 Stoch Grad D Oct202023
29 pages
Lecture 5
No ratings yet
Lecture 5
34 pages
ML Unit 3
No ratings yet
ML Unit 3
28 pages
Aiml Notes Chapter-3
No ratings yet
Aiml Notes Chapter-3
34 pages
INT255 Unit-4
No ratings yet
INT255 Unit-4
40 pages
Deep Neural Networks
No ratings yet
Deep Neural Networks
48 pages
ANN Explanation Request Updated
No ratings yet
ANN Explanation Request Updated
44 pages
Lecture 08 ML
No ratings yet
Lecture 08 ML
20 pages
Experiment 1
No ratings yet
Experiment 1
15 pages
2403B05107 DL Activity 05
No ratings yet
2403B05107 DL Activity 05
10 pages
Dla-Cat 1
No ratings yet
Dla-Cat 1
37 pages
Lec 6
No ratings yet
Lec 6
11 pages
Gradient Descent and Its Types
No ratings yet
Gradient Descent and Its Types
5 pages
Gradient Descent Optimization
No ratings yet
Gradient Descent Optimization
4 pages
Group Capstone Tushar
No ratings yet
Group Capstone Tushar
14 pages
HMD-Deep Learning-Lecture 2-2024
No ratings yet
HMD-Deep Learning-Lecture 2-2024
47 pages
Random Forest Classifier
No ratings yet
Random Forest Classifier
9 pages
Stochastic Gradient Descent
No ratings yet
Stochastic Gradient Descent
12 pages
CS221 - Artificial Intelligence - Machine Learning - 4 Stochastic Gradient Descent
No ratings yet
CS221 - Artificial Intelligence - Machine Learning - 4 Stochastic Gradient Descent
12 pages
Gradient Descent - PR
No ratings yet
Gradient Descent - PR
31 pages
Tut04 - One Algorithm To Optimize Them All
No ratings yet
Tut04 - One Algorithm To Optimize Them All
19 pages
Assignment 7
No ratings yet
Assignment 7
5 pages
Rekha
No ratings yet
Rekha
27 pages
2,5 Stochastic Gradient Descent
No ratings yet
2,5 Stochastic Gradient Descent
11 pages
2 Merged
No ratings yet
2 Merged
29 pages
Mlfa Autumn 22 Lec 04
No ratings yet
Mlfa Autumn 22 Lec 04
24 pages
Automation and Analytics Using Python Certisured Intership Report
No ratings yet
Automation and Analytics Using Python Certisured Intership Report
49 pages
S09 DNN Gradients Wip
No ratings yet
S09 DNN Gradients Wip
28 pages
ECS171: Machine Learning: Lecture 4: Optimization (LFD 3.3, SGD)
No ratings yet
ECS171: Machine Learning: Lecture 4: Optimization (LFD 3.3, SGD)
45 pages
Gradient Decent
No ratings yet
Gradient Decent
15 pages
Gradient Descent Method
No ratings yet
Gradient Descent Method
12 pages
Stochastic Gradient Descent: Ryan Tibshirani Convex Optimization 10-725
No ratings yet
Stochastic Gradient Descent: Ryan Tibshirani Convex Optimization 10-725
22 pages
An Overview of Gradient Descent Optimization Algorithms PDF
No ratings yet
An Overview of Gradient Descent Optimization Algorithms PDF
12 pages
WINSEM2024-25 CSE4006 ETH AP2024254000693 2025-01-08 Reference-Material-I
No ratings yet
WINSEM2024-25 CSE4006 ETH AP2024254000693 2025-01-08 Reference-Material-I
40 pages
REPORT - Gradient Based Optimization
No ratings yet
REPORT - Gradient Based Optimization
2 pages
Ceeta 1688014229
No ratings yet
Ceeta 1688014229
10 pages
Stochastic Gradient Descent
No ratings yet
Stochastic Gradient Descent
5 pages
Gradient Descent DS Rohit Sharma Fench Knjs
No ratings yet
Gradient Descent DS Rohit Sharma Fench Knjs
15 pages
Gradient-Based Optimizers
No ratings yet
Gradient-Based Optimizers
54 pages
Stochastic Gradient Descent
No ratings yet
Stochastic Gradient Descent
4 pages
Q. (A) What Are Different Types of Machine Learning? Discuss The Differences
No ratings yet
Q. (A) What Are Different Types of Machine Learning? Discuss The Differences
12 pages
Cheat Sheet 1
No ratings yet
Cheat Sheet 1
1 page
1710993830340
No ratings yet
1710993830340
9 pages
Module 4 Lab 3
No ratings yet
Module 4 Lab 3
6 pages
Stochastic Gradient Descent
No ratings yet
Stochastic Gradient Descent
23 pages
QB Unit 3
No ratings yet
QB Unit 3
14 pages
Mlfa Autumn 23 Optimization
No ratings yet
Mlfa Autumn 23 Optimization
37 pages
Gradient Descent & Stockastic Gradient Descent
No ratings yet
Gradient Descent & Stockastic Gradient Descent
6 pages
Aie231 NN Lab5
No ratings yet
Aie231 NN Lab5
7 pages
XEmoAccent Embracing Diversity in Cross-Accent Emo
No ratings yet
XEmoAccent Embracing Diversity in Cross-Accent Emo
19 pages
Aliero 2023 Ijca 923106
No ratings yet
Aliero 2023 Ijca 923106
13 pages
Lecture 5
No ratings yet
Lecture 5
4 pages
12-Mini-Batch Gradient Descent - Exponential Weighted Averages-07-08-2024
No ratings yet
12-Mini-Batch Gradient Descent - Exponential Weighted Averages-07-08-2024
2 pages
Gradient Descent Algorithms and Variations - PyImageSearch
No ratings yet
Gradient Descent Algorithms and Variations - PyImageSearch
21 pages
Unit 4 - GRADIENT LEARNING
No ratings yet
Unit 4 - GRADIENT LEARNING
3 pages
17 Large Scale Machine Learning PDF
No ratings yet
17 Large Scale Machine Learning PDF
10 pages
3 Types of Gradient Descent Algorithms For Small & Large Datasets
No ratings yet
3 Types of Gradient Descent Algorithms For Small & Large Datasets
9 pages
RL Quiz 22-3-24
No ratings yet
RL Quiz 22-3-24
8 pages
05.stochastic Gradient Descent
No ratings yet
05.stochastic Gradient Descent
2 pages
Paper 2
No ratings yet
Paper 2
27 pages
7 CLT
No ratings yet
7 CLT
22 pages
Week 10 - Neural Network
No ratings yet
Week 10 - Neural Network
24 pages
Crowdsourced Failure Reports
No ratings yet
Crowdsourced Failure Reports
22 pages
Machine Learning: Chapter 2 Clustering
No ratings yet
Machine Learning: Chapter 2 Clustering
23 pages
Quantifying Uncertainty in Deep Learning of Radiologic Images
No ratings yet
Quantifying Uncertainty in Deep Learning of Radiologic Images
10 pages
A Decision Making Tool For The Determination of The Distribution Center Location in Humanitarian Logistics Network
No ratings yet
A Decision Making Tool For The Determination of The Distribution Center Location in Humanitarian Logistics Network
17 pages
Frai 05 827584
No ratings yet
Frai 05 827584
16 pages
BIBLIOGRAFÍA Inteligencia Artificial
No ratings yet
BIBLIOGRAFÍA Inteligencia Artificial
3 pages
2489 10441 1 PB
No ratings yet
2489 10441 1 PB
6 pages
Machine Learning Assignment 3
No ratings yet
Machine Learning Assignment 3
7 pages
A Tiered GAN Approach For Monet-Style Image Generation
No ratings yet
A Tiered GAN Approach For Monet-Style Image Generation
6 pages
Stochastic Gradient Descent - Term Paper
No ratings yet
Stochastic Gradient Descent - Term Paper
8 pages
Mathematical Optimization: Fundamentals and Applications
From Everand
Mathematical Optimization: Fundamentals and Applications
Fouad Sabry
No ratings yet
Backpropagation: Fundamentals and Applications for Preparing Data for Training in Deep Learning
From Everand
Backpropagation: Fundamentals and Applications for Preparing Data for Training in Deep Learning
Fouad Sabry
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

ML - Stochastic Gradient Descent (SGD) - GeeksforGeeks

Uploaded by

ML - Stochastic Gradient Descent (SGD) - GeeksforGeeks

Uploaded by

5/21/25, 1:05 PM ML | Stochastic Gradient Descent (SGD) | GeeksforGeeks

ML | Stochastic Gradient Descent (SGD)

Stochastic Gradient Descent (SGD) is an optimization algorithm in

To understand SGD, it’s essential to first comprehend the concept of

What is Gradient Descent?

Gradient descent is an iterative optimization algorithm used to minimize

In traditional gradient descent, the gradients are computed based on the

Need for Stochastic Gradient Descent

Path followed by batch gradient descent vs. path followed by SGD:

Optimization path followed by Gradient Descent

Optimization path followed SGD Optimization

Working of Stochastic Gradient Descent

The update rule becomes:

xi ​and yi ​represent the features and target of the i-th training

The gradient ∇θ J (θ; xi , yi ) is now calculated for a single data point or

Implementing Stochastic Gradient Descent from

Step 1: Data Generation

In this step, we generate synthetic data for the linear regression

X is a random array of 100 samples between 0 and 2.

For a linear regression with one feature, the model is described

θ1 is the slope or coefficient associated with the input feature X

Step 2: Define the SGD Function

theta is the parameter vector (intercept and slope) initialized

1 def sgd(X, y, learning_rate=0.1, epochs=1000, batch_size=1):

25 theta -= learning_rate * gradients

Step 3: Train the Model Using SGD

1 # Train the model using SGD

Step 4: Visualize the Cost Function

1 import matplotlib.pyplot as plt

Step 5: Plot the Data and Regression Line

1 # Plot the data and the regression line

Step 6: Print the Final Model Parameters

1 print(f"Final parameters: {theta_final}")

Final parameters: [[4.35097872] [3.45754277]]

The final parameters returned by the model are:

Then the fitted linear regression model will be:

When X=0, y=4.3(the intercept or bias term).

Advantages of Stochastic Gradient Descent

Each step requires fewer computations, leading to quicker

Challenges of Stochastic Gradient Descent

Variants of Stochastic Gradient Descent

Mini-batch SGD: Instead of using a single data point, mini-batch

batch gradient descent. It reduces the noise in the updates while

Momentum: Momentum helps accelerate SGD by adding a fraction of

Adaptive Methods (e.g., Adam, RMSprop): These methods

Applications of Stochastic Gradient Descent

Deep Learning: In training deep neural networks, SGD is the default

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

xi and yi represent the features and target of the i-th training