Stochastic Gradient Descent
Stochastic Gradient Descent
Stochastic gradient descent (often abbreviated SGD) is an iterative method for optimizing an objective
function with suitable smoothness properties (e.g. differentiable or subdifferentiable). It can be regarded as
a stochastic approximation of gradient descent optimization, since it replaces the actual gradient (calculated
from the entire data set) by an estimate thereof (calculated from a randomly selected subset of the data).
Especially in high-dimensional optimization problems this reduces the very high computational burden,
achieving faster iterations in exchange for a lower convergence rate.[1]
While the basic idea behind stochastic approximation can be traced back to the Robbins–Monro algorithm
of the 1950s, stochastic gradient descent has become an important optimization method in machine
learning.[2]
Background
Both statistical estimation and machine learning consider the problem of minimizing an objective function
that has the form of a sum:
where the parameter that minimizes is to be estimated. Each summand function is typically
associated with the -th observation in the data set (used for training).
The sum-minimization problem also arises for empirical risk minimization. In this case, is the value
of the loss function at -th example, and is the empirical risk.
When used to minimize the above function, a standard (or "batch") gradient descent method would perform
the following iterations:
where is a step size (sometimes called the learning rate in machine learning).
In many cases, the summand functions have a simple form that enables inexpensive evaluations of the sum-
function and the sum gradient. For example, in statistics, one-parameter exponential families allow
economical function-evaluations and gradient-evaluations.
However, in other cases, evaluating the sum-gradient may require expensive evaluations of the gradients
from all summand functions. When the training set is enormous and no simple formulas exist, evaluating
the sums of gradients becomes very expensive, because evaluating the gradient requires evaluating all the
summand functions' gradients. To economize on the computational cost at every iteration, stochastic
gradient descent samples a subset of summand functions at every step. This is very effective in the case of
large-scale machine learning problems.[4]
Iterative method
In stochastic (or "on-line") gradient descent, the true gradient of
is approximated by a gradient at a single sample:
A compromise between computing the true gradient and the gradient at a single sample is to compute the
gradient against more than one training sample (called a "mini-batch") at each step. This can perform
significantly better than "true" stochastic gradient descent described, because the code can make use of
vectorization libraries rather than computing each step separately as was first shown in [6] where it was
called "the bunch-mode back-propagation algorithm". It may also result in smoother convergence, as the
gradient computed at each step is averaged over more training sample.
The convergence of stochastic gradient descent has been analyzed using the theories of convex
minimization and of stochastic approximation. Briefly, when the learning rates decrease with an
appropriate rate, and subject to relatively mild assumptions, stochastic gradient descent converges almost
surely to a global minimum when the objective function is convex or pseudoconvex, and otherwise
converges almost surely to a local minimum.[7][8] This is in fact a consequence of the Robbins–Siegmund
theorem.[9]
Example
Suppose we want to fit a straight line to a training set with observations
and corresponding estimated responses using least squares. The objective function to be
minimized is:
The last line in the above pseudocode for this specific problem will become:
Note that in each iteration (also called update), the gradient is only evaluated at a single point instead of
at the set of all samples.
The key difference compared to standard (Batch) Gradient Descent is that only one piece of data from the
dataset is used to calculate the step, and the piece of data is picked randomly at each step.
Notable applications
Stochastic gradient descent is a popular algorithm for training a wide range of models in machine learning,
including (linear) support vector machines, logistic regression (see, e.g., Vowpal Wabbit) and graphical
models.[10] When combined with the backpropagation algorithm, it is the de facto standard algorithm for
training artificial neural networks.[11] Its use has been also reported in the Geophysics community,
specifically to applications of Full Waveform Inversion (FWI).[12]
Stochastic gradient descent competes with the L-BFGS algorithm, which is also widely used. Stochastic
gradient descent has been used since at least 1960 for training linear regression models, originally under the
name ADALINE.[13]
Another stochastic gradient descent algorithm is the least mean squares (LMS) adaptive filter.
As mentioned earlier, classical stochastic gradient descent is generally sensitive to learning rate η. Fast
convergence requires large learning rates but this may induce numerical instability. The problem can be
largely solved[17] by considering implicit updates whereby the stochastic gradient is evaluated at the next
iterate rather than the current one:
This equation is implicit since appears on both sides of the equation. It is a stochastic form of the
proximal gradient method since the update can also be written as:
where indicates the inner product. Note that could have "1" as
the first element to include an intercept. Classical stochastic gradient descent proceeds as follows:
where is uniformly sampled between 1 and . Although theoretical convergence of this procedure
happens under relatively mild assumptions, in practice the procedure can be quite unstable. In particular,
when is misspecified so that has large absolute eigenvalues with high probability, the
procedure may diverge numerically within a few iterations. In contrast, implicit stochastic gradient descent
(shortened as ISGD) can be solved in closed-form as:
This procedure will remain numerically stable virtually for all as the learning rate is now normalized.
Such comparison between classical and implicit stochastic gradient descent in the least squares problem is
very similar to the comparison between least mean squares (LMS) and normalized least mean squares filter
(NLMS).
Even though a closed-form solution for ISGD is only possible in least squares, the procedure can be
efficiently implemented in a wide range of models. Specifically, suppose that depends on only
through a linear combination with features , so that we can write , where
may depend on as well but not on except through . Least squares obeys this rule,
and so does logistic regression, and most generalized linear models. For instance, in least squares,
, and in logistic regression , where is
the logistic function. In Poisson regression, , and so on.
Momentum
Further proposals include the momentum method or the heavy ball method, which in ML context appeared
in Rumelhart, Hinton and Williams' paper on backpropagation learning[18] and borrowed the idea from
Soviet mathematician Boris Polyak's 1964 article on solving functional equations.[19] Stochastic gradient
descent with momentum remembers the update Δw at each iteration, and determines the next update as a
linear combination of the gradient and the previous update:[20][21]
where the parameter which minimizes is to be estimated, is a step size (sometimes called the
learning rate in machine learning) and is an exponential decay factor between 0 and 1 that determines
the relative contribution of the current gradient and earlier gradients to the weight change.
The name momentum stems from an analogy to momentum in physics: the weight vector , thought of as a
particle traveling through parameter space,[18] incurs acceleration from the gradient of the loss (" force").
Unlike in classical stochastic gradient descent, it tends to keep traveling in the same direction, preventing
oscillations. Momentum has been used successfully by computer scientists in the training of artificial neural
networks for several decades.[22] The momentum method is closely related to underdamped Langevin
dynamics, and may be combined with Simulated Annealing. [23]
In mid-1980s the method was modified by Yurii Nesterov to use the gradient predicted at the next point,
and the resulting so-called Nesterov Accelerated Gradient was sometimes used in ML in the 2010s.[24]
Averaging
Averaged stochastic gradient descent, invented independently by Ruppert and Polyak in the late 1980s, is
ordinary stochastic gradient descent that records an average of its parameter vector over time. That is, the
update is the same as for ordinary stochastic gradient descent, but the algorithm also keeps track of[25]
When optimization is done, this averaged parameter vector takes the place of w.
AdaGrad
AdaGrad (for adaptive gradient algorithm) is a modified stochastic gradient descent algorithm with per-
parameter learning rate, first published in 2011.[26] Informally, this increases the learning rate for sparser
parameters and decreases the learning rate for ones that are less sparse. This strategy often improves
convergence performance over standard stochastic gradient descent in settings where data is sparse and
sparse parameters are more informative. Examples of such applications include natural language processing
and image recognition.[26]
It still has a base learning rate η, but this is multiplied with the elements of a vector {Gj,j} which is the
diagonal of the outer product matrix
This vector essentially stores a historical sum of gradient squares by dimension and is updated after every
iteration. The formula for an update is now
[a]
Each {G(i,i)} gives rise to a scaling factor for the learning rate that applies to a single parameter wi. Since
the denominator in this factor, is the ℓ 2 norm of previous derivatives, extreme parameter
updates get dampened, while parameters that get few or small updates receive higher learning rates.[22]
While designed for convex problems, AdaGrad has been successfully applied to non-convex
optimization.[27]
RMSProp
RMSProp (for Root Mean Square Propagation) is a method invented by Geoffrey Hinton in 2012 in which
the learning rate is, like in Adagrad, adapted for each of the parameters. The idea is to divide the learning
rate for a weight by a running average of the magnitudes of recent gradients for that weight.[28] Unusually,
it was not published in an article but merely described in a Coursera lecture.
where, is the forgetting factor. The concept of storing the historical gradient as sum of squares is
borrowed from Adagrad, but "forgetting" is introduced to solve Adagrad's diminishing learning rates in
non-convex problems by gradually decreasing the influence of old data.[29]
Adam
Adam[30] (short for Adaptive Moment Estimation) is a 2014 update to the RMSProp optimizer combining it
with the main feature of the Momentum method.[31] In this optimization algorithm, running averages with
exponential forgetting of both the gradients and the second moments of the gradients are used. Given
parameters and a loss function , where indexes the current training iteration (indexed at ),
Adam's parameter update is given by:
where is a small scalar (e.g. ) used to prevent division by 0, and (e.g. 0.9) and (e.g. 0.999) are
the forgetting factors for gradients and second moments of gradients, respectively. Squaring and square-
rooting is done element-wise. The profound influence of this algorithm inspired multiple newer, less well-
known momentum-based optimization schemes using Nesterov-enhanced gradients (eg: NAdam[32] and
FASFA[33]) and varying interpretations of second-order information (eg: Powerpropagation[34] and
AdaSqrt[35]). However, the most commonly used variants are AdaMax,[30] which generalizes Adam using
the infinity norm, and AMSGrad,[36] which addresses convergence problems from Adam by using
maximum of past squared gradients instead of the exponential average.[37]
AdamW[38] is a later update which mitigates an unoptimal choice of the weight decay algorithm in Adam.
Backtracking line search is another variant of gradient descent. All of the below are sourced from the
mentioned link. It is based on a condition known as the Armijo–Goldstein condition. Both methods allow
learning rates to change at each iteration; however, the manner of the change is different. Backtracking line
search uses function evaluations to check Armijo's condition, and in principle the loop in the algorithm for
determining the learning rates can be long and unknown in advance. Adaptive SGD does not need a loop
in determining learning rates. On the other hand, adaptive SGD does not guarantee the "descent property"
– which Backtracking line search enjoys – which is that for all n. If the gradient of the
cost function is globally Lipschitz continuous, with Lipschitz constant L, and learning rate is chosen of the
order 1/L, then the standard version of SGD is a special case of backtracking line search.
Second-order methods
History
SGD was gradually developed by several collectives during the 1950s.
Notes
a. is the element-wise product.
See also
Backtracking line search
Coordinate descent – changes one coordinate at a time, rather than one example
Linear classifier
Online machine learning
Stochastic hill climbing
Stochastic variance reduction
References
1. Bottou, Léon; Bousquet, Olivier (2012). "The Tradeoffs of Large Scale Learning" (https://boo
ks.google.com/books?id=JPQx7s2L1A8C&pg=PA351). In Sra, Suvrit; Nowozin, Sebastian;
Wright, Stephen J. (eds.). Optimization for Machine Learning. Cambridge: MIT Press.
pp. 351–368. ISBN 978-0-262-01646-9.
2. Bottou, Léon (1998). "Online Algorithms and Stochastic Approximations". Online Learning
and Neural Networks (https://archive.org/details/onlinelearningin0000unse). Cambridge
University Press. ISBN 978-0-521-65263-6.
3. Ferguson, Thomas S. (1982). "An inconsistent maximum likelihood estimate". Journal of the
American Statistical Association. 77 (380): 831–834. doi:10.1080/01621459.1982.10477894
(https://doi.org/10.1080%2F01621459.1982.10477894). JSTOR 2287314 (https://www.jstor.
org/stable/2287314).
4. Bottou, Léon; Bousquet, Olivier (2008). The Tradeoffs of Large Scale Learning (http://leon.bo
ttou.org/papers/bottou-bousquet-2008). Advances in Neural Information Processing
Systems. Vol. 20. pp. 161–168.
5. Murphy, Kevin (2021). Probabilistic Machine Learning: An Introduction (https://probml.github.
io/pml-book/book1.html). Probabilistic Machine Learning: An Introduction. MIT Press.
Retrieved 10 April 2021.
6. Bilmes, Jeff; Asanovic, Krste; Chin, Chee-Whye; Demmel, James (April 1997). "Using
PHiPAC to speed error back-propagation learning" (https://ieeexplore.ieee.org/document/60
4861). 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.
ICASSP. Munich, Germany: IEEE. pp. 4153-4156 vol.5. doi:10.1109/ICASSP.1997.604861
(https://doi.org/10.1109%2FICASSP.1997.604861).
7. Bottou, Léon (1998). "Online Algorithms and Stochastic Approximations". Online Learning
and Neural Networks (https://archive.org/details/onlinelearningin0000unse). Cambridge
University Press. ISBN 978-0-521-65263-6.
8. Kiwiel, Krzysztof C. (2001). "Convergence and efficiency of subgradient methods for
quasiconvex minimization". Mathematical Programming, Series A. Berlin, Heidelberg:
Springer. 90 (1): 1–25. doi:10.1007/PL00011414 (https://doi.org/10.1007%2FPL00011414).
ISSN 0025-5610 (https://www.worldcat.org/issn/0025-5610). MR 1819784 (https://mathscine
t.ams.org/mathscinet-getitem?mr=1819784). S2CID 10043417 (https://api.semanticscholar.o
rg/CorpusID:10043417).
9. Robbins, Herbert; Siegmund, David O. (1971). "A convergence theorem for non negative
almost supermartingales and some applications". In Rustagi, Jagdish S. (ed.). Optimizing
Methods in Statistics. Academic Press. ISBN 0-12-604550-X.
10. Jenny Rose Finkel, Alex Kleeman, Christopher D. Manning (2008). Efficient, Feature-based,
Conditional Random Field Parsing (http://www.aclweb.org/anthology/P08-1109). Proc.
Annual Meeting of the ACL.
11. LeCun, Yann A., et al. "Efficient backprop." Neural networks: Tricks of the trade. Springer
Berlin Heidelberg, 2012. 9-48 (http://yann.lecun.com/exdb/publis/pdf/lecun-98b.pdf)
12. Jerome R. Krebs, John E. Anderson, David Hinkley, Ramesh Neelamani, Sunwoong Lee,
Anatoly Baumstein, and Martin-Daniel Lacasse, (2009), "Fast full-wavefield seismic
inversion using encoded sources," GEOPHYSICS 74: WCC177-WCC188. (https://library.se
g.org/doi/abs/10.1190/1.3230502)
13. Avi Pfeffer. "CS181 Lecture 5 — Perceptrons" (http://www.seas.harvard.edu/courses/cs181/fi
les/lecture05-notes.pdf) (PDF). Harvard University.
14. Goodfellow, Ian; Bengio, Yoshua; Courville, Aaron (2016). Deep Learning (https://www.deepl
earningbook.org). MIT Press. p. 291. ISBN 978-0262035613.
15. Cited by Darken, Christian; Moody, John (1990). Fast adaptive k-means clustering: some
empirical results. Int'l Joint Conf. on Neural Networks (IJCNN). IEEE.
doi:10.1109/IJCNN.1990.137720 (https://doi.org/10.1109%2FIJCNN.1990.137720).
16. Spall, J. C. (2003). Introduction to Stochastic Search and Optimization: Estimation,
Simulation, and Control. Hoboken, NJ: Wiley. pp. Sections 4.4, 6.6, and 7.5. ISBN 0-471-
33052-3.
17. Toulis, Panos; Airoldi, Edoardo (2017). "Asymptotic and finite-sample properties of
estimators based on stochastic gradients". Annals of Statistics. 45 (4): 1694–1727.
arXiv:1408.2923 (https://arxiv.org/abs/1408.2923). doi:10.1214/16-AOS1506 (https://doi.org/
10.1214%2F16-AOS1506). S2CID 10279395 (https://api.semanticscholar.org/CorpusID:102
79395).
18. Rumelhart, David E.; Hinton, Geoffrey E.; Williams, Ronald J. (8 October 1986). "Learning
representations by back-propagating errors". Nature. 323 (6088): 533–536.
Bibcode:1986Natur.323..533R (https://ui.adsabs.harvard.edu/abs/1986Natur.323..533R).
doi:10.1038/323533a0 (https://doi.org/10.1038%2F323533a0). S2CID 205001834 (https://ap
i.semanticscholar.org/CorpusID:205001834).
19. "Gradient Descent and Momentum: The Heavy Ball Method" (https://boostedml.com/2020/0
7/gradient-descent-and-momentum-the-heavy-ball-method.html). 13 July 2020.
20. Sutskever, Ilya; Martens, James; Dahl, George; Hinton, Geoffrey E. (June 2013). Sanjoy
Dasgupta and David Mcallester (ed.). On the importance of initialization and momentum in
deep learning (http://www.cs.utoronto.ca/~ilya/pubs/2013/1051_2.pdf) (PDF). In Proceedings
of the 30th international conference on machine learning (ICML-13). Vol. 28. Atlanta, GA.
pp. 1139–1147. Retrieved 14 January 2016.
21. Sutskever, Ilya (2013). Training recurrent neural networks (http://www.cs.utoronto.ca/~ilya/pu
bs/ilya_sutskever_phd_thesis.pdf) (PDF) (Ph.D.). University of Toronto. p. 74.
22. Zeiler, Matthew D. (2012). "ADADELTA: An adaptive learning rate method". arXiv:1212.5701
(https://arxiv.org/abs/1212.5701) [cs.LG (https://arxiv.org/archive/cs.LG)].
23. Borysenko, Oleksandr; Byshkin, Maksym (2021). "CoolMomentum: A Method for Stochastic
Optimization by Langevin Dynamics with Simulated Annealing" (https://www.ncbi.nlm.nih.go
v/pmc/articles/PMC8139967). Scientific Reports. 11 (1): 10705. arXiv:2005.14605 (https://arx
iv.org/abs/2005.14605). Bibcode:2021NatSR..1110705B (https://ui.adsabs.harvard.edu/abs/
2021NatSR..1110705B). doi:10.1038/s41598-021-90144-3 (https://doi.org/10.1038%2Fs415
98-021-90144-3). PMC 8139967 (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8139967).
PMID 34021212 (https://pubmed.ncbi.nlm.nih.gov/34021212).
24. "Papers with Code - Nesterov Accelerated Gradient Explained" (https://paperswithcode.com/
method/nesterov-accelerated-gradient).
25. Polyak, Boris T.; Juditsky, Anatoli B. (1992). "Acceleration of stochastic approximation by
averaging" (http://www.meyn.ece.ufl.edu/archive/spm_files/Courses/ECE555-2011/555medi
a/poljud92.pdf) (PDF). SIAM J. Control Optim. 30 (4): 838–855. doi:10.1137/0330046 (http
s://doi.org/10.1137%2F0330046).
26. Duchi, John; Hazan, Elad; Singer, Yoram (2011). "Adaptive subgradient methods for online
learning and stochastic optimization" (http://jmlr.org/papers/volume12/duchi11a/duchi11a.pd
f) (PDF). JMLR. 12: 2121–2159.
27. Gupta, Maya R.; Bengio, Samy; Weston, Jason (2014). "Training highly multiclass
classifiers" (http://jmlr.org/papers/volume15/gupta14a/gupta14a.pdf) (PDF). JMLR. 15 (1):
1461–1492.
28. Hinton, Geoffrey. "Lecture 6e rmsprop: Divide the gradient by a running average of its recent
magnitude" (http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf) (PDF).
p. 26. Retrieved 19 March 2020.
29. "Understanding RMSprop — faster neural network learning" (https://towardsdatascience.co
m/understanding-rmsprop-faster-neural-network-learning-62e116fcf29a). 2 September 2018.
30. Kingma, Diederik; Ba, Jimmy (2014). "Adam: A Method for Stochastic Optimization".
arXiv:1412.6980 (https://arxiv.org/abs/1412.6980) [cs.LG (https://arxiv.org/archive/cs.LG)].
31. "4. Beyond Gradient Descent - Fundamentals of Deep Learning [Book]" (https://www.oreilly.c
om/library/view/fundamentals-of-deep/9781491925607/ch04.html).
32. Dozat, T. (2016). "Incorporating Nesterov Momentum into Adam" (https://www.semanticschol
ar.org/paper/Incorporating-Nesterov-Momentum-into-Adam-Dozat/d44efdc542f2cc5e196f04
bc76bc783bfd7084af). S2CID 70293087 (https://api.semanticscholar.org/CorpusID:7029308
7).
33. Naveen, Philip (2022-08-09). "FASFA: A Novel Next-Generation Backpropagation
Optimizer" (https://dx.doi.org/10.36227/techrxiv.20427852.v1). dx.doi.org.
doi:10.36227/techrxiv.20427852.v1 (https://doi.org/10.36227%2Ftechrxiv.20427852.v1).
Retrieved 2022-11-19.
34. Whye, Schwarz, Jonathan Jayakumar, Siddhant M. Pascanu, Razvan Latham, Peter E. Teh,
Yee (2021-10-01). Powerpropagation: A sparsity inducing weight reparameterisation (http://w
orldcat.org/oclc/1333722169). OCLC 1333722169 (https://www.worldcat.org/oclc/13337221
69).
35. Hu, Yuzheng; Lin, Licong; Tang, Shange (2019-12-20). "Second-order Information in First-
order Optimization Methods". arXiv:1912.09926 (https://arxiv.org/abs/1912.09926).
36. Reddi, Sashank J.; Kale, Satyen; Kumar, Sanjiv (2018). "On the Convergence of Adam and
Beyond". arXiv:1904.09237 (https://arxiv.org/abs/1904.09237).
37. "An overview of gradient descent optimization algorithms" (https://www.ruder.io/optimizing-gr
adient-descent/#amsgrad). 19 January 2016.
38. Loshchilov, Ilya; Hutter, Frank (4 January 2019). "Decoupled Weight Decay Regularization".
arXiv:1711.05101 (https://arxiv.org/abs/1711.05101).
39. Byrd, R. H.; Hansen, S. L.; Nocedal, J.; Singer, Y. (2016). "A Stochastic Quasi-Newton
method for Large-Scale Optimization". SIAM Journal on Optimization. 26 (2): 1008–1031.
arXiv:1401.7020 (https://arxiv.org/abs/1401.7020). doi:10.1137/140954362 (https://doi.org/1
0.1137%2F140954362). S2CID 12396034 (https://api.semanticscholar.org/CorpusID:12396
034).
40. Spall, J. C. (2000). "Adaptive Stochastic Approximation by the Simultaneous Perturbation
Method". IEEE Transactions on Automatic Control. 45 (10): 1839−1853.
doi:10.1109/TAC.2000.880982 (https://doi.org/10.1109%2FTAC.2000.880982).
41. Spall, J. C. (2009). "Feedback and Weighting Mechanisms for Improving Jacobian
Estimates in the Adaptive Simultaneous Perturbation Algorithm". IEEE Transactions on
Automatic Control. 54 (6): 1216–1229. doi:10.1109/TAC.2009.2019793 (https://doi.org/10.11
09%2FTAC.2009.2019793). S2CID 3564529 (https://api.semanticscholar.org/CorpusID:356
4529).
42. Bhatnagar, S.; Prasad, H. L.; Prashanth, L. A. (2013). Stochastic Recursive Algorithms for
Optimization: Simultaneous Perturbation Methods. London: Springer. ISBN 978-1-4471-
4284-3.
43. Ruppert, D. (1985). "A Newton-Raphson Version of the Multivariate Robbins-Monro
Procedure" (https://doi.org/10.1214%2Faos%2F1176346589). Annals of Statistics. 13 (1):
236–245. doi:10.1214/aos/1176346589 (https://doi.org/10.1214%2Faos%2F1176346589).
44. Abdulkadirov, R. I.; Lyakhov, P.A.; Nagornov, N.N. (2022). "Accelerating Extreme Search of
Multidimensional Functions Based on Natural Gradient Descent with Dirichlet Distributions"
(https://doi.org/10.3390%2Fmath10193556). Mathematics. 10 (19): 3556.
doi:10.3390/math10193556 (https://doi.org/10.3390%2Fmath10193556).
Further reading
Bottou, Léon (2004), "Stochastic Learning" (http://leon.bottou.org/papers/bottou-mlss-2004),
Advanced Lectures on Machine Learning, LNAI, vol. 3176, Springer, pp. 146–168,
ISBN 978-3-540-23122-6
Buduma, Nikhil; Locascio, Nicholas (2017), "Beyond Gradient Descent" (https://books.googl
e.com/books?id=80glDwAAQBAJ&pg=PA63), Fundamentals of Deep Learning : Designing
Next-Generation Machine Intelligence Algorithms, O'Reilly, ISBN 9781491925584
LeCun, Yann A.; Bottou, Léon; Orr, Genevieve B.; Müller, Klaus-Robert (2012), "Efficient
BackProp" (https://books.google.com/books?id=VCKqCAAAQBAJ&pg=PA9), Neural
Networks: Tricks of the Trade, Springer, pp. 9–48, ISBN 978-3-642-35288-1
Spall, James C. (2003), Introduction to Stochastic Search and Optimization, Wiley,
ISBN 978-0-471-33052-3
External links
Using stochastic gradient descent in C++, Boost, Ublas for linear regression (http://codingpla
yground.blogspot.it/2013/05/stocastic-gradient-descent.html)
Machine Learning Algorithms (http://studyofai.com/machine-learning-algorithms/)
"Gradient Descent, How Neural Networks Learn" (https://www.youtube.com/watch?v=IHZw
WFHWa-w&list=PLZHQObOWTQDNU6R1_67000Dx_ZCJB-3pi&index=2). 3Blue1Brown.
October 16, 2017. Archived (https://ghostarchive.org/varchive/youtube/20211222/IHZwWFH
Wa-w) from the original on 2021-12-22 – via YouTube.
Goh (April 4, 2017). "Why Momentum Really Works" (https://distill.pub/2017/momentum/).
Distill. 2 (4). doi:10.23915/distill.00006 (https://doi.org/10.23915%2Fdistill.00006). Interactive
paper explaining momentum.