0% found this document useful (0 votes)

19 views14 pages

adam optimizer

The document discusses the Adam optimization algorithm, highlighting its advantages over traditional gradient descent methods in machine learning. It explains how Adam adapts learning rates for individual parameters, leading to improved performance on various datasets, including a custom dataset achieving 97% accuracy. The paper also compares Adam with other optimization techniques and emphasizes its effectiveness in recognizing and classifying handwritten characters.

Uploaded by

Kevin Christian

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

19 views14 pages

adam optimizer

Uploaded by

Kevin Christian

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 14

Understanding the Adam Optimization Algorithm in

Machine Learning
Oles Hospodarskyy1,*,†, Vasyl Martsenyuk2,†, Nataliia Kukharska3,†, Andriy
Hospodarskyy4,†, Sofiia Sverstiuk5,†

1 Lviv Polytechnic National Universiy, Bandery St. 12, Lviv, 79001, Ukraine
2 University of Bielsko-Biala, Willowa St. 2, Bielsko-Biala, 43-300, Poland
3 Ternopil National Ivan Puluj Technical University, Rus'ka St. 56, Ternopil, 46001, Ukraine

4 I. Horbachevsky Ternopil National Medical University, Maidan Voli, 1, Ternopil, 46002, Ukraine

5 Ternopil National Pedagogical University, 2 Maxyma Kryvonosa St., Ternopil, 46027, Ukraine

Abstract
Machine learning and artificial intelligence are significant areas of interest in both contemporary
science and society. There are various optimization algorithms used. The algorithm's speed
depends on the size of the dataset, the number of model parameters, and the number of iterations.
Standard gradient descent requires computing the gradient of the cost function over the entire
dataset, which can be resource-intensive, especially with large datasets. In Adam, a separate
learning rate is maintained for each parameter weight, which is adapted and updated
individually. The algorithm selects a smaller learning rate for frequently updated parameters and
a larger one for parameters corresponding to rare features. To measure the effectiveness and
universality of the Adam, we compared it with other optimization algorithms. Analysis of the
experiment results conducted on various datasets, indicates a significant advantage of the Adam
optimization algorithm. To make sure our model works well for our specific needs, we made a
small dataset ourselves. The famous MNIST dataset, created by American researchers, might not
match our handwritten numbers perfectly. The results appear promising, with the model
achieving an accuracy of 97%, meaning it correctly predicted 97 out of 100 images. This level of
accuracy suggests that the model is performing well on our custom dataset, demonstrating its
effectiveness in recognizing and classifying our handwritten numbers. Experiments on various
datasets showed that the Adam algorithm is capable of achieving good results across a wide range
of machine learning tasks.

Keywords
Adam algorithm, machine learning, artificial intelligence, loss function, gradient descent

1. Introduction
Machine learning and artificial intelligence are significant areas of interest in both
contemporary science and society [1]. They represent some of the most advanced

CITI'2024: 2nd International Workshop on Computer Information Technologies in Industry 4.0, June 12-14, 2024,
Ternopil, Ukraine. ∗ Corresponding author. † These authors contributed equally.
oles.hospodarskyi.kb.2021@lpnu.ua (O. Hospodarskyy); vmartsenyuk@ath.bielsko.pl (V. Martsenyuk),
nataliia.p.kukharska@lpnu.ua (N. Kukharska); hospodarskyy@tdmu.edu.ua (A. Hospodarskyy);
khrystynasofia@gmail.com (S. Sverstiuk)
0009-0005-9088-3015 (O. Hospodarskyy); 0000-0001-5622-1038 (V. Martsenyuk); 0000-0002-0896-8361
(N. Kukharska); 0000-0002-9394-2675 (A. Hospodarskyy); 0000-0001-5595-4918 (S. Sverstiuk)
© 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).

CEUR
ceur-ws.org
Workshop ISSN 1613-0073
Proceedings
technologies applicable across various industries, including healthcare, finance,
transportation and entertainment. The theoretical foundations of machine learning have
been explored and expanded upon by some of the greatest figures in the field. Geoffrey
Hinton, often revered as the "Godfather of Deep Learning," has laid the groundwork for
many modern machine learning techniques with his groundbreaking research on neural
networks
Yann LeCun's work on convolutional neural networks (CNNs) has revolutionized
computer vision and pattern recognition, while Yoshua Bengio's contributions to neural
network models have greatly advanced natural language processing and unsupervised
learning. Ian Goodfellow's work on generative adversarial networks (GANs) has opened up
new avenues in unsupervised learning and generative modeling, while Juergen
Schmidhuber's contributions to recurrent neural networks (RNNs) and long short-term
memory (LSTM) networks have propelled advancements in sequential learning and AI [2].
Each year the number of scientific publications dedicated to algorithms and machine
learning methods continues to increase. However, despite significant progress made by
researchers in this field, a range of unresolved and insufficiently studied issues persists.
These include challenges related to optimization, which entail the search for optimal model
parameters to achieve maximum prediction accuracy.
The main objective of this article is to investigate the Adam optimization algorithm,
compare its effectiveness with other well-known algorithms on standard datasets such as
MNIST and FashionMNIST, and assess its accuracy in recognizing and classifying
handwritten characters [3].

2. General principles of optimization to find the best algorithm in

machine learning
When an input signal is received by the machine model, it undergoes processing through a
function, and after a series of computations, it transforms into an output value. Then the
model compares the generated output with the actual output value and computes the loss
function. The loss function is a measure of how well the model performs on a given task. For
example, a popular loss function MSE computes squared difference between values:
𝑛
1 2
𝑀𝑆𝐸 = ∑(𝑌𝑖 − 𝑌̂𝑖 ) , (1)
𝑛
𝑖=1

Now we need some algorithm that will adjust the parameters of a model (w1, w2,...wn)
to minimize (or maximize) the loss function. That’s the basic concept of optimization in
machine learning.
There are various optimization algorithms used for this task. These algorithms are
iterative, meaning they update the model parameters during each epoch during the training
process.
3. The description of the gradient descent algorithm as a fundamental
optimization technique
The idea of gradient descent is to update the parameters of the model (weights) by moving
in the direction of the steepest descent of the loss function [5].
At first algorithm initializes random weights (or zeros). Then it calculates the loss
function of the whole dataset, for example, MSE.
Then gradient descent calculates the derivative of the loss function concerning each
model parameter (weight) to determine the direction of the update.
After computing the gradient of the loss function concerning each weight, the algorithm
updates the weights by subtracting a fraction of the gradient from the current value of each
weight. This fraction is known as the learning rate, denoted by 𝛼, and it controls the size of
the step taken in the direction of the negative gradient.

𝜕𝐿𝑜𝑠𝑠
𝑤 = 𝑤−𝛼[ ], (2)
𝜕𝑤

This process is repeated iteratively, and with each iteration, the algorithm progressively
approaches a local minimum of the loss function, where the weight values are optimal
(Figure 1).

Figure 1: The relationship between the loss function and the weight value w
Source: photographed by the author

The algorithm's speed depends on the size of the dataset, the number of model
parameters, and the number of iterations. Typically, larger datasets and more complex
models require more time and resources for training. Additionally, the speed of the
algorithm can be influenced by the choice of learning rate. Selecting a too large learning rate
may cause the algorithm to move too quickly and fail to find the minimum of the weight
function (Figure 2), while choosing a too small learning rate may prolong the training
process.

Figure 2: Demonstrative example where a learning rate that is too large.

Source: photographed by the author

4. Stochastic gradient descent, SGD

Standard gradient descent requires computing the gradient of the cost function over the
entire dataset, which can be resource-intensive, especially with large datasets. Therefore,
in such cases, stochastic gradient descent (SGD) is applied, which is more efficient for
optimizing models with a large amount of data [4].
Standard gradient descent updates the model weights at each iteration using the entire
dataset to compute the gradient of the loss function. However, stochastic gradient descent
computes the gradient and updates the weights for each data sample in the dataset
separately. That is, on each iteration, SGD uses only one data sample instead of the entire
dataset, allowing for quick weight updates and more efficient processing of large datasets
[5].
However, SGD can be sensitive to the initial values of the weights, causing it to get stuck
in a local minimum and fail to find the global minimum of the loss function (Figure 3). Data
normalization could help mitigate this issue for linear models, however, in more complex
models like neural networks, normalization may not be sufficient.
Figure 3: Graphical example where the algorithm failed to find the global minimum of the
function
Source: photographed by the author

That’s where Adam comes in handy. It uses the history of previous gradients to
adaptively adjust the learning rates for each parameter, helping to overcome the limitations
of SGD.

5. Adam
Adam was presented by Diederik Kingma from OpenAI and Jimmy Ba from the University
of Toronto in their 2015 ICLR paper (poster) titled “Adam: A Method for Stochastic
Optimization“. The name Adam is derived from adaptive moment estimation.
Adam differs from classical stochastic gradient descent. Standard stochastic gradient
descent uses a single learning rate (alpha) for updating weights, and this learning rate
remains constant throughout training [6].
In Adam, a separate learning rate is maintained for each parameter weight, which is
adapted and updated individually. The algorithm selects a smaller learning rate for
frequently updated parameters and a larger one for parameters corresponding to rare
features.
The authors describe Adam as combining the advantages of two other extensions of
stochastic gradient descent. Specifically:

• The Adaptive Gradient Algorithm (AdaGrad) which is particularly effective with

sparse gradients, such as in natural language processing tasks (NLP) and computer
vision. It employs a method that maintains an individual learning rate for each
parameter, facilitating efficient updates for rarely used parameters. However,
AdaGrad may encounter the issue of rapidly decreasing learning rates, which can
prematurely halt the learning process.
• The Root Mean Square Propagation (RMSProp) algorithm which, unlike AdaGrad,
mitigates the problem of decreasing learning rates. It utilizes a method that sustains
individually adjusted learning rates for each parameter, adapted based on the recent
average of gradient magnitudes for weights. This algorithm performs effectively on
online tasks and tasks where parameters may change over time (non-stationary
tasks).

The hyperparameters of Adam include:

1. Learning rate: Determines the size of the step by which the model weights will
change during each iteration. A large learning rate can lead to unstable model
training, while a too small value can slow down the learning process. Typically, an
initial learning rate is chosen, but Adam automatically adapts it over time.
2. Beta1 and Beta2: These parameters control the exponential smoothing of previous
gradients and their squares, respectively. Beta1 is responsible for smoothing
gradients, while Beta2 handles the smoothing of gradient squares. Typically, values
such as Beta1 = 0.9 and Beta2 = 0.999 work well, but they can be manually adjusted
if needed.
3. Epsilon: A small numerical value added to the denominator in the Adam formula to
avoid division by zero.

The Adam algorithm computes the exponential moving average of the gradient (first
moment) and the squared gradient (second moment) of the weights, where the parameters
beta1 and beta2 control the smoothing rates of these moving averages [7].
In the context of exponential moving average (Figure 4), smoothing occurs by assigning
more weight to newer data. Thus, the model responds more to the recent changes in data
than to older values, allowing it to adapt more quickly to any new data trends.

Figure 4: Example of exponential moving average (blue line)

Source: photographed by the author
The first and the second moments are statistical concepts. The first moment of data is
their mean value, and the second moment is the variance, which indicates how spread out
the data is around the mean value.
In the context of the Adam optimization algorithm, the first moment of gradients is used
to estimate the mean value of gradients (which can be viewed as the "rate of change" of
model parameters), and the second moment of gradients is used to estimate the variance of
gradients (reflecting how gradients are spread out around the mean value).
The main idea behind using moments in the Adam algorithm is to provide the algorithm
with additional information about previous weight updates and the gradient direction,
enabling better control over the optimization process.

𝑚𝑡 = 𝛽1 𝑚𝑡−1 + (1 − 𝛽1 )𝑔𝑡 , (3)

𝑣𝑡 = 𝛽2 𝑣𝑡−1 + (1 − 𝛽2 )𝑔𝑡2

As mt and vt are initialized as vectors of zeros, they tend to be biased towards zero,
especially during the initial time steps, and especially when the decay rates are small (i.e. β1
and β2 are close to 1). In the Adam algorithm, a bias correction is done by adjusting the
estimates 𝑚t and 𝑣t by dividing them by (1−𝛽t), where t represents the current step. This
reduces the bias towards zero, ensuring that the initial parameter updates are more
accurate.

𝑚𝑡
𝑚
̂𝑡 = (4)
1 − 𝛽1𝑡
𝑣𝑡
𝑣̂𝑡 = (5)
1 − 𝛽2𝑡

Taking this correction into account, the parameter update rule takes the following form:

𝑛
𝑤𝑡+1 = 𝑤𝑡 − 𝑚𝑡 , (6)
√𝑣𝑡 + 𝑒

6. Simple Adam example

In this section, we will provide an example of how the Adam algorithm works in its simplest
form. We will consider a scenario where we have a simple function that needs to be
optimized, and we will apply the Adam algorithm to find its minimum.
Firstly, we need to define the loss function. We will use a simple two-dimensional
function that squares the input data and defines the range of input data from -1.0 to 1.0:
def loss_function(x, y):
return x ** 2.0 + y ** 2.0
To visually observe the progress of the function, let's create a 2D plot:
bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]])
xaxis = arange(bounds[0,0], bounds[0,1], 0.1)
yaxis = arange(bounds[1,0], bounds[1,1], 0.1)
x, y = meshgrid(xaxis, yaxis)
results = objective(x, y)
pyplot.contourf(x, y, results, levels=50, cmap='jet')
pyplot.show()

Executing this code snippet generates a two-dimensional contour plot of the objective
loss function (Figure 5). This plot will serve as a visual representation of the points
investigated throughout the search for the local minimum of the function.

Figure 5: Two-dimensional plot of the loss function using Adam

Source: photographed by the author

Let's move on to the Adam algorithm. First, we initialize the first and second moments
as zeros:
m = [0.0 for _ in range(bounds.shape[0])]
v = [0.0 for _ in range(bounds.shape[0])]

After that, we compute the gradient (derivative) of the data:

gradient = derivative(w[0], w[1])

Now we need to apply the Adam parameter update rule. While in practice, a matrix
method is typically utilized for computation, for the sake of clarity in this example, we'll
employ an iterative approach. Given we have two parameters, we'll use a loop to update
both of them:
for i in range(x.shape[0]):
m[i] = beta1 * m[i] + (1.0 - beta1) * g[i]
v[i] = beta2 * v[i] + (1.0 - beta2) * g[i]**2
Then we apply the bias correction:
mhat = m[i] / (1.0 - beta1**(t+1))
vhat = v[i] / (1.0 - beta2**(t+1))
In the end we update the parameters of the model and calculate the loss:
w[i] = w[i] - alpha * mhat / (sqrt(vhat) + eps)
score = loss_function(w[0], w[1])

The Figure 6 illustrates the outcome of executing the code. The "Score" indicates the
value of the loss function.

Figure 6: Two-dimensional graph of the loss function using Gradient Descent

Source: photographed by the author

For comparison, the gradient descent algorithm, with the same function and the same
number of iterations, achieved significantly worse results (Figure 7).
Figure 7: Graph illustrating the performance using Gradient descent algorithm
Source: photographed by the author

7. Comparing Adam with other algorithms

To measure the effectiveness and universality of the Adam, we compared it with other
optimization algorithms on two datasets, MNIST and FashionMNIST. We chose them
because they are often used as a benchmark for testing new machine learning algorithms
and models.
MNIST (Modified National Institute of Standards and Technology) is a classic dataset
consisting of 60,000 black and white images of handwritten digits in the training set and
10,000 images in the test set.
FashionMNIST is another popular dataset that contains 60,000 images in the training set
and 10,000 images in the test set. The images represent various types of clothing items such
as T-shirts, dresses, trousers, etc. This dataset is created for use in image classification tasks.
We chose to test the Adam algorithm on both MNIST and FashionMNIST datasets
because they contain different kinds of data. MNIST has images of handwritten digits, while
FashionMNIST consists of more complicated pictures of clothing items. By evaluating
Adam's performance on these diverse datasets, we can see how well it works across
different types of data, showing its usefulness in various situations.
Figure 8 illustrates the training process of models on the MNIST dataset using various
optimization algorithms. The results indicate that, despite the task not being very
complicated, Adam successfully demonstrated the best performance among all the
algorithms. This is further evidenced in Figure 9, where the test accuracy of Adam surpasses
that of all other algorithms.
Figure 8: Graph illustrating the performance of different algorithms on MNIST dataset
Source: photographed by the author

Figure 9: Test accuracy of different algorithms on MNIST dataset

Source: photographed by the author

Figure 10 illustrates the training process of models on the FashionMNIST dataset, which
is slightly more complex. Despite this complexity, Adam managed to outperform other
algorithms, demonstrating its effectiveness even for more challenging tasks. This is further
supported by Figure 11, where it is shown that the test accuracy of Adam exceeds that of all
other algorithms.
Figure 10: Graph illustrating the performance of different algorithms on FashionMNIST
dataset
Source: photographed by the author

Figure 11: Test accuracy of different algorithms on FashionMNIST dataset

Source: photographed by the author

Analysis of the experiment results conducted on various datasets, including MNIST and
FashionMNIST indicates a significant advantage of the Adam optimization algorithm. Its
effectiveness was demonstrated regardless of the complexity of object structures and the
diversity of classes in the datasets. Interestingly, while some algorithms may have shown
slightly better results on datasets with simpler structures and fewer classes, Adam proved
to be more efficient in all modeled scenarios. Overall, Adam provided faster and higher-
quality solutions to classification tasks compared to most other algorithms, confirming its
advantages in machine learning.
To make sure our model works well for our specific needs, we made a small dataset
ourselves. The famous MNIST dataset, created by American researchers, might not match
our handwritten numbers perfectly. So, we wanted to see if our model could still understand
and categorize our handwritten characters correctly. This way, we could check if our model
is flexible and reliable for our purposes, not just for standard datasets.
The dataset consists of 100 handwritten numbers from 0 to 9 (Figure 12).

Figure 12: An example of our handwritten dataset

Source: photographed by the author

The results appear promising, with the model achieving an accuracy of 97%, meaning it
correctly predicted 97 out of 100 images. This level of accuracy suggests that the model is
performing well on our custom dataset, demonstrating its effectiveness in recognizing and
classifying our handwritten numbers.
In further research, it is planned to use Adam's algorithm to analyze the data of
cyberphysical systems [8, 9], biosensors [10] and the results of cardiac signal processing
[11].

8. Conclusions
In this study, the Adam algorithm was investigated in the context of optimization in machine
learning. The main conclusions and results of the study are as follows:

1. The Adam algorithm is an effective optimization method that combines ideas from
other algorithms such as RMSProp and AdaGrad.
2. Experiments on various datasets, such as MNIST and FashionMNIST showed that the
Adam algorithm is capable of achieving good results across a wide range of machine
learning tasks.
3. The Adam algorithm is effective for optimizing tasks involving both large and small
datasets, as demonstrated by experimental results.

9. References
[1] D. P. Kingma and J. L. Ba, Adam: a method for stochastic optimization,
arXiv:1412.6980v9 [cs.LG], 2015.
[2] R. Zaheer and H. Shaziya, A Study of the Optimization Algorithms in Deep Learning,
March 2020.
[3] H. Xiao, K. Rasul, and R. Vollgraf, Fashion-mnist: a novel image dataset for
benchmarking machine learning algorithms, arXiv preprint arXiv:1708.07747, 2017.
[4] S. Ruder, An overview of gradient descent optimization algorithms, arXiv preprint
arXiv:1609.04747, 2016.
[5] S. Wang, C. Li, X. Ding, Demystifying Parallel and Distributed Deep Learning: An In-
Depth Concurrency Analysis, arXiv:1802.09941v2 [cs.LG], 15 Sep 2018.
[6] J. Brownlee Gentle Introduction to the Adam Optimization Algorithm for Deep
Learning, 2017, https://machinelearningmastery.com/adam-optimization-algorithm-
for-deep-learning/
[7] J. Brownlee Code Adam Optimization Algorithm From Scratch, 2021,
https://machinelearningmastery.com/adam-optimization-from-scratch/
[8] V. Martsenyuk, A. Sverstiuk, A. Klos-Witkowska, N.Kozodii, O. Bagriy-Zayats, I.
Zubenko, Numerical analysis of results simulation of cyber-physical biosensor systems.
CEUR Workshop Proceedings, 2019, 2516, pp. 149–164.
[9] V. Martsenyuk, A. Sverstiuk, O. Bahrii-Zaiats, A. Kłos-Witkowska, Qualitative and
Quantitative Comparative Analysis of Results of Numerical Simulation of Cyber-
Physical Biosensor Systems. (2022) CEUR Workshop Proceedings, 3309, pp. 134 – 149.
[10] V. Martsenyuk, A. Klos-Witkowska, S. Dzyadevych, A. Sverstiuk, Nonlinear Analytics for
Electrochemical Biosensor Design Using Enzyme Aggregates and Delayed Mass Action.
Sensors, 2022, 22(3), 980.
[11] V. Trysnyuk, A. Zozulia, S. Lupenko, I. Lytvynenko, A. Sverstiuk, Methods of rhythm-
cardio signals processing based on a mathematical model in the form of a vector of
stationary and stationary connected random sequences. CEUR Workshop
Proceedings, 2021, 3021, pp. 197–205.

Math Minutes G4
100% (1)
Math Minutes G4
112 pages
DL Unit-2
No ratings yet
DL Unit-2
24 pages
Case Study 9 - Simon (Chronic Schizophrenia)
No ratings yet
Case Study 9 - Simon (Chronic Schizophrenia)
6 pages
DL Regularization
No ratings yet
DL Regularization
51 pages
AdamZ research paper
No ratings yet
AdamZ research paper
13 pages
Deep Learning (MODULE-2) (2)
No ratings yet
Deep Learning (MODULE-2) (2)
86 pages
A Study of the Optimization Algorithms in Deep Learning
No ratings yet
A Study of the Optimization Algorithms in Deep Learning
4 pages
2a - 3
No ratings yet
2a - 3
8 pages
MLP Encoder Decoder
No ratings yet
MLP Encoder Decoder
14 pages
Gradient Descent Optimization
No ratings yet
Gradient Descent Optimization
27 pages
L5 - UCLxDeepMind DL2020
No ratings yet
L5 - UCLxDeepMind DL2020
52 pages
Introduction to Optimization-Lec1
No ratings yet
Introduction to Optimization-Lec1
36 pages
LN - Optimization For ML
No ratings yet
LN - Optimization For ML
129 pages
Mathematics 11 02466 v2
No ratings yet
Mathematics 11 02466 v2
37 pages
8 Adagrad, RMSprop, Adam 04 Sep 2020material I 04 Sep 2020 Module4 Optimization
No ratings yet
8 Adagrad, RMSprop, Adam 04 Sep 2020material I 04 Sep 2020 Module4 Optimization
50 pages
DL Test-2
No ratings yet
DL Test-2
28 pages
Optimizers
No ratings yet
Optimizers
4 pages
cours5
No ratings yet
cours5
23 pages
Optimization Techniques in Deep Learning
No ratings yet
Optimization Techniques in Deep Learning
14 pages
MODULE 3
No ratings yet
MODULE 3
7 pages
Soft Computing Assignment
No ratings yet
Soft Computing Assignment
9 pages
Comparison of Gradient Descent Algorithms On Training Neural Networks
No ratings yet
Comparison of Gradient Descent Algorithms On Training Neural Networks
20 pages
Optimization in Machine Learning
No ratings yet
Optimization in Machine Learning
26 pages
Gradient Descent
No ratings yet
Gradient Descent
13 pages
Otimization 2024_ver3
No ratings yet
Otimization 2024_ver3
42 pages
Code Adam Optimization Algorithm From Scratch
No ratings yet
Code Adam Optimization Algorithm From Scratch
28 pages
SuperGD
No ratings yet
SuperGD
15 pages
Gradient-Based Optimizers
No ratings yet
Gradient-Based Optimizers
54 pages
Bott Curt Noce 18
No ratings yet
Bott Curt Noce 18
89 pages
2410.19706 (1)
No ratings yet
2410.19706 (1)
15 pages
Mathematical Analysis of Descent Algorithms in Artificial Intelligence Convergence, Loss Landscapes, and Structural Optimization
No ratings yet
Mathematical Analysis of Descent Algorithms in Artificial Intelligence Convergence, Loss Landscapes, and Structural Optimization
8 pages
BME 6407 - Class 10 (April 2023)
No ratings yet
BME 6407 - Class 10 (April 2023)
31 pages
2012 Nikolaos Nikolaou MSC
No ratings yet
2012 Nikolaos Nikolaou MSC
102 pages
Gradient Descent Final
No ratings yet
Gradient Descent Final
27 pages
Deep learning chapter 1
No ratings yet
Deep learning chapter 1
46 pages
Deep learning exp 2.3 MU
No ratings yet
Deep learning exp 2.3 MU
4 pages
Pure Optimization
No ratings yet
Pure Optimization
23 pages
Lecture_2
No ratings yet
Lecture_2
31 pages
Survey of FNN
No ratings yet
Survey of FNN
25 pages
ADAM-1
No ratings yet
ADAM-1
11 pages
chapter02.Background-theory_5e45b9b50ccb12d028c8edf9b332c5e5
No ratings yet
chapter02.Background-theory_5e45b9b50ccb12d028c8edf9b332c5e5
20 pages
ML Notes
No ratings yet
ML Notes
14 pages
Deep Learning Module 3
No ratings yet
Deep Learning Module 3
15 pages
Adaptive Stochastic Conjugate Gradient for Machine Learning
No ratings yet
Adaptive Stochastic Conjugate Gradient for Machine Learning
14 pages
Optimizers Types
No ratings yet
Optimizers Types
6 pages
Module 3-DL
No ratings yet
Module 3-DL
12 pages
optimization techniques (SGD alternatives)
No ratings yet
optimization techniques (SGD alternatives)
34 pages
SkriptOptMach
No ratings yet
SkriptOptMach
49 pages
A Quick Review of Machine Learning Algorithms: Susmita Ray
No ratings yet
A Quick Review of Machine Learning Algorithms: Susmita Ray
5 pages
module2 Question and Answer
No ratings yet
module2 Question and Answer
25 pages
Adafactor - Adaptive Learning Rates With Sublinear Memory Cost
No ratings yet
Adafactor - Adaptive Learning Rates With Sublinear Memory Cost
9 pages
Fundamentals of Machine Learning: a Simplified Approach
From Everand
Fundamentals of Machine Learning: a Simplified Approach
Er. Sudhir Goswami
No ratings yet
GlobalLogic - Optimization Algorithms For Machine Learning
No ratings yet
GlobalLogic - Optimization Algorithms For Machine Learning
4 pages
4_Gradient Descent and Stochastic GD
No ratings yet
4_Gradient Descent and Stochastic GD
37 pages
Chapter-2 Single Feed Forward Netwotk
No ratings yet
Chapter-2 Single Feed Forward Netwotk
132 pages
5.1Loss Function, Optimization,Gd
No ratings yet
5.1Loss Function, Optimization,Gd
39 pages
UNIT V NNHDL
No ratings yet
UNIT V NNHDL
33 pages
2023246032-Backward Propagation and Other Differential Algorithms
No ratings yet
2023246032-Backward Propagation and Other Differential Algorithms
48 pages
Optimization Algorithms Deep PDF
No ratings yet
Optimization Algorithms Deep PDF
9 pages
Adas: Adaptive Scheduling of Stochastic Gradients: Preprint. Under Review
No ratings yet
Adas: Adaptive Scheduling of Stochastic Gradients: Preprint. Under Review
19 pages
Group Method of Data Handling: Fundamentals and Applications for Predictive Modeling and Data Analysis
From Everand
Group Method of Data Handling: Fundamentals and Applications for Predictive Modeling and Data Analysis
Fouad Sabry
No ratings yet
Machine Learning: Fundamentals and Applications
From Everand
Machine Learning: Fundamentals and Applications
Fouad Sabry
No ratings yet
Student Fee Details
No ratings yet
Student Fee Details
1 page
Odd-One-Out (ESL Activity) - Teflnet
No ratings yet
Odd-One-Out (ESL Activity) - Teflnet
2 pages
English 6 Draft
No ratings yet
English 6 Draft
16 pages
Advert - Graduate Internship Programme 2025
No ratings yet
Advert - Graduate Internship Programme 2025
2 pages
Foothill Speech & Debate Team
No ratings yet
Foothill Speech & Debate Team
1 page
Resume - Design Engineer - Rahul
No ratings yet
Resume - Design Engineer - Rahul
2 pages
Cefr
No ratings yet
Cefr
2 pages
Onkar Otari
No ratings yet
Onkar Otari
3 pages
Get (eBook PDF) Interventions with Children and Youth in Canada 2nd Edition by Maureen Cech free all chapters
100% (6)
Get (eBook PDF) Interventions with Children and Youth in Canada 2nd Edition by Maureen Cech free all chapters
55 pages
Leitura ABLLS R
No ratings yet
Leitura ABLLS R
16 pages
Joanne Rappaport - Beyond Participant Observation: Collaborative Ethnography As Theoretical Innovation
No ratings yet
Joanne Rappaport - Beyond Participant Observation: Collaborative Ethnography As Theoretical Innovation
32 pages
DP Revalidation Scheme
No ratings yet
DP Revalidation Scheme
2 pages
Uses and Gratification Theory
100% (1)
Uses and Gratification Theory
19 pages
Hurley Employment Agreement - Final Execution
No ratings yet
Hurley Employment Agreement - Final Execution
22 pages
AI copywriting tools
No ratings yet
AI copywriting tools
15 pages
FP-Engineer For Backend
No ratings yet
FP-Engineer For Backend
1 page
Amaya School of Home Industries
No ratings yet
Amaya School of Home Industries
2 pages
High School Thesis Template
100% (3)
High School Thesis Template
5 pages
2nd year 1st sem class schedule
No ratings yet
2nd year 1st sem class schedule
1 page
9781009162869_excerpt
No ratings yet
9781009162869_excerpt
10 pages
Open Elective TT, MM, TP, TC
No ratings yet
Open Elective TT, MM, TP, TC
7 pages
Lesson Plans Draft
No ratings yet
Lesson Plans Draft
9 pages
Economic Roles of Women and Its Impact On Child Health and Care
No ratings yet
Economic Roles of Women and Its Impact On Child Health and Care
204 pages
PLACEMENT TEST FOR B1.1
No ratings yet
PLACEMENT TEST FOR B1.1
3 pages
Match The Words With Their Meanings.: Lesson A. Grammar and Vocabulary
No ratings yet
Match The Words With Their Meanings.: Lesson A. Grammar and Vocabulary
4 pages
Cea
No ratings yet
Cea
33 pages
Developmental Reading: (Activity)
No ratings yet
Developmental Reading: (Activity)
5 pages
Kernel 1
No ratings yet
Kernel 1
11 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

adam optimizer

Uploaded by

adam optimizer

Uploaded by

Understanding the Adam Optimization Algorithm in

2. General principles of optimization to find the best algorithm in

Figure 2: Demonstrative example where a learning rate that is too large.

4. Stochastic gradient descent, SGD

• The Adaptive Gradient Algorithm (AdaGrad) which is particularly effective with

The hyperparameters of Adam include:

Figure 4: Example of exponential moving average (blue line)

𝑚𝑡 = 𝛽1 𝑚𝑡−1 + (1 − 𝛽1 )𝑔𝑡 , (3)

6. Simple Adam example

Figure 5: Two-dimensional plot of the loss function using Adam

After that, we compute the gradient (derivative) of the data:

Figure 6: Two-dimensional graph of the loss function using Gradient Descent

7. Comparing Adam with other algorithms

Figure 9: Test accuracy of different algorithms on MNIST dataset

Figure 11: Test accuracy of different algorithms on FashionMNIST dataset

Figure 12: An example of our handwritten dataset

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.