Recent Advances in Stochastic Gradient Descent in
Recent Advances in Stochastic Gradient Descent in
Review
Recent Advances in Stochastic Gradient Descent in
Deep Learning
Yingjie Tian 1,2,3, *, Yuqi Zhang 4 and Haibin Zhang 5
1 School of Economics and Management, University of Chinese Academy of Sciences, Beijing 100190, China
2 Research Center on Fictitious Economy and Data Science, Chinese Academy of Sciences, Beijing 100190, China
3 Key Laboratory of Big Data Mining and Knowledge Management, Chinese Academy of Sciences,
Beijing 100190, China
4 School of Mathematical Sciences, University of Chinese Academy of Sciences, Beijing 100049, China
5 Beijing Institute for Scientific and Engineering Computing, Faculty of Science, Beijing University of
Technology, Beijing 100124, China
* Correspondence: tyj@ucas.ac.cn
Abstract: In the age of artificial intelligence, the best approach to handling huge amounts of data is a
tremendously motivating and hard problem. Among machine learning models, stochastic gradient
descent (SGD) is not only simple but also very effective. This study provides a detailed analysis
of contemporary state-of-the-art deep learning applications, such as natural language processing
(NLP), visual data processing, and voice and audio processing. Following that, this study introduces
several versions of SGD and its variant, which are already in the PyTorch optimizer, including SGD,
Adagrad, adadelta, RMSprop, Adam, AdamW, and so on. Finally, we propose theoretical conditions
under which these methods are applicable and discover that there is still a gap between theoretical
conditions under which the algorithms converge and practical applications, and how to bridge this
gap is a question for the future.
MSC: 68W27
(1) Introducing the optimization problem for large-scale data and reviewing the related
models of deep learning.
(2) Providing a review and discussion on recent advances in SGD.
(3) Providing some applications, open issues, and research trends.
To the best of our knowledge, there are only a few SGD-related preprinted surveys
[1,4–6]. Compared with them, our contribution does the following:
(1) Summarizes SGD-related algorithms from machine learning and optimization per-
spectives. Worth mentioning are the proximal stochastic methods, which still needed
to be summarized hitherto.
(2) Provides the mathematical properties of models and methods.
(3) Discusses the gap between theory and applications.
The remainder of this work describes several common deep-learning applications
of SGD and then presents the many types of stochastic gradient descent. Finally, we
summarize this article and discuss various unresolved concerns and research directions.
2. Application
The SGD algorithm has been applied to tasks such as natural language processing
(NLP), visual data processing, and speech and audio processing in deep learning. We
introduce the different tasks of deep learning and list some practical applications of the
SGD algorithm in these significant fields.
2.4. Others
For crisis response tasks, the authors of [28] offered a novel approach paired with
online learning capabilities. It effectively identifies future catastrophes at the phrase
level using tweets and identifies the observed disaster types for crisis-response activities.
The optimizer for this model is Adadelta. For disease prediction, the authors of [29] utilized
a CNN-based autoencoder to predict cellular breast cancer, whose optimizer is SHB.
Table 1 shows a summary of the applications of the deep-learning networks, their
architectures, and the stochastic methods used.
where w represents the weights of the model, h is prediction function, and `i is the loss
function.
There are some simplified notations. A random seed ξ is used to represent the sample
(or a group of samples). Furthermore, f is the combination of the loss function ` and the
prediction function h, so the loss is rewritten as f (w, ξ ) for any given pair (w, ξ ).
Some fundamental optimization algorithms for minimizing risk are given in the rest
of this section.
η n
n i∑
w+ = w − · ∇ f (w; ξ i ),
=1
3.2. SGD
A variant of the gradient descent algorithm is stochastic gradient descent (SGD).
Instead of computing the gradient of E[ f (w; ξ i )] or n1 ∑in=1 f i (w; ξ i ) exactly, it only uses one
random sample to estimate the gradient ζ i of f i (w; ξ i ) in each iteration. Batch gradient
descent conducts redundant calculations for large-scale problems because it recomputes
the gradients of related samples before each parameter changes, whereas SGD does only
one computation per iteration. Algorithm 1 contains the whole SGD algorithm. It is simple
to use, saves memory, and can also be suited to online learning.
Stochastic gradients are unbiased estimates of the real gradient. Add random noise
with a mean of zero to gradient descent to simulate SGD. Figure 1 shows a comparison
of noiseless and noisy: SGD training with a single example causes fluctuations, making
iterative trajectories tortuous and slow at converging.
Mathematics 2023, 11, 682 5 of 23
(a) (b)
It is easy to see that gradient descent and SGD are two special examples of mini-batch
SGD [42]. SGD trains fast but introduces noise, whereas gradient descent trains slowly but
computes gradients accurately. The authors of [43] proposed a dynamic algorithm that con-
verges linearly to the neighborhood of the optimal point under strongly convex conditions.
The authors of [44] presented a data selection rule that is such that a linear convergence
rate can be achieved when updating with mini-batch SGD under the assumption of strong
convexity or strong quasi-convexity.
The work [42] proposes an approximate optimization strategy by adding l2 norm
penalty term in each batch training. The method solves the subproblem
1 γt
min
x |It | ∑ f i (w; ξ i ) +
2
k w − w t −1 k 2 , (2)
i ∈It
where It is a subset of {1, 2, . . . , n} and γt > 0. We provide a subtle change which uses
1
zt ←=
|It | ∑ ∇ f i ( w t ; ξ i ) + γt ( w t − w t − 1 )
i ∈It
4. Variance-Reducing Methods
Due to the significant variance in SGD, it converges slowly. To improve this problem,
many methods usually compute the full gradient after a certain period to reduce the vari-
ance. The stochastic average gradient (SAG) method [48], which is shown in Algorithm 3,
makes the convergence more stable [49].
Mathematics 2023, 11, 682 7 of 23
The SAG method needs to store the gradient of each sample. Additionally, this
method only applies to smooth and convex functions, not non-convex neural networks.
The authors of [36] proposed the stochastic variance reduced gradient (SVRG), method as
shown in Algorithm 4. SVRG performs gradient replacement every m iteration, instead
of computing the full gradient.
Many improved variants of variance-reduction methods have emerged, such as
SDCA [50], MISO [51], and S2GD [52]. A novel method SARAH combines some good
properties of existing algorithms [53]. Lots of studies improved SARAH in different aspects,
resulting in algorithms such as SARAH+ and SARAH++. It is not necessary for careful
choices of m sometimes [54]. In general, the fundamental SGD variants perform at a sublin-
ear convergence rate. On the other hand, all of the preceding approaches, except S2GD,
may achieve a linear convergence speed on severely convex problems. S2GD established
linear convergence of expected values and achieved linear convergence in practice. The ap-
proaches described above can provide quick convergence to high accuracy by maintaining
a large constant step size. To guarantee a linear convergence rate, they are always applied
to l2 regularized logistic regression problems [53,55].
Without theoretical results, the authors of [56] showed SVRG’s effectiveness at ad-
dressing various convex and non-convex problems. Moreover, experiments [36,57] showed
the superior performance of SVRG to non-convex neural networks. However, applying
these algorithms to the higher-cost and deeper neural nets is a critical issue that needs to be
further investigated.
Table 3 shows these variance-reducing methods. Subsequent sections provide more
strategies for improving the stochastic algorithms.
5. Accelerate SGD
SGD methods can quickly get stuck in saddle points on highly non-convex loss surfaces.
At this time, SGD will oscillate and will not escape for a long time. The momentum [58]
and Nesterov accelerated gradient [59] descent (NAG) are standard accelerating technolo-
gies replacing the traditional gradient method. This section introduces these accelerated
technologies and their improvements.
5.1. Momentum
The concept of momentum comes from physical mechanics; it simulates the inertia
of an object. As shown in Figure 2, the gradient oscillates vertically in a certain direction.
A simple way to overcome the weakness is to maintain the gradient in the horizontal
direction and suppress the gradient in the vertical direction.
It is similar to pushing a ball down in hill: the heavy ball accumulates momentum
along the main direction of downhill, getting to the bottom of the hill faster and faster. This
method is therefore also called the heavy ball method (HB) and is formulated as follows:
where θ, α > 0.
The momentum method can speed up the convergence when the learning rate is
small, as it is when dealing with high curvature, small gradients, and noisy gradients [58].
Moreover, the search process can converge more quickly [60].
The stochastic heavy ball method (SHB) converges sublinearly without the convexity
assumption in [61]. For the specialized setting of minimizing quadratics, the SHB converges
linearly at an accelerated rate in expectation [62]. When the objective function is strongly
convex, a linear rate of convergence can be achieved using the HB method, toward the
Mathematics 2023, 11, 682 9 of 23
unique optimum, and when the objective function is smoothly convex, its Cesáro average
can achieve a linear rate of convergence [63]. The sublinear convergence rate for weakly
convex functions is established in [64]. The authors of [65] minimized the expected loss
instead of the finite-sum loss and proved the objective function converges linearly. These
theoretical results support robust phase retrieval problems, regression problems, and some
CNNs. The class of quasi-strongly convex functions achieved the non-asymptotic optimal
rates and global convergence with a unique minimizer [66]. The work of [67] demonstrates
that SHB can improve the stability and generalization performance of the model. However,
in the neighborhood of the nonconvex quadratic function’s stringent saddle point, HB can
diverge from this point quicker than the GD [68]. There are still some shortcomings, such
as the non-acceleration of SHB on SGD on some synthetic data [69].
Some work suggests that the common setting for this value is 0.9 [41]. An issue is
determining the value of momentum factor gamma.
where η is a default value of 0.01 and gt is the gradient. Gt here is a diagonal matrix where
each diagonal element is the sum of the squares of the past gradients. We take an example
to explain how to compute Gt :
Given g1 = (1, 0, 2) T , g2 = (3, 4, 0) T , and g3 = (0, 5, 6) T , we have:
√
p 12 + 32 + e √ 0 0
Gt + e = 0 42 + 52 + e √ 0
0 0 22 + 62 + e
√
10 + e √ 0 0
= 0 41 + e √ 0
0 0 40 + e
One significant advantage of AdaGrad is that it eliminates the need to adjust the learn-
ing rate manually. The approach may be used to solve sparse gradient issues. The settings
update ineffectively when the algorithm no longer needs to learn anything new. Adagrad
provides convex optimization sublinear convergence [75]. Adagrad is nearly certainly
asymptotically convergent in the non-convex problems [76]. The authors of [77] provided
robust linear convergence guarantees for either highly convex or non-convex functions that
meet the Polyak–Lojasiewicz (PL) assumption for most minor squares problems.
This algorithm has a small cumulative gradient and a large learning rate in the first
few rounds of training. However, as the training time increases, the incremental gradient
becomes larger and larger, resulting in a learning rate that tends to zero, and the weights
of the model may not be updated. Adadelta [78] and RMSProp [79] are two extensions
of Adagrad to overcome its drawback. They focus on the gradients in a window over a
period instead of accumulating all historical gradients. They are inspired by the momen-
tum method, which uses an exponential moving average to calculate the second-order
cumulative momentum. The process of them:
Adadelta: h i h i
E g2 = γE g2 + (1 − γ) gt2 , (6)
t t −1
The update should have the same hypothetical units as the parameter.
RMSProp: h i h i
E g2 = γE g2 + (1 − γ) gt2 , (9)
t t −1
∆wt = − √
η
gt ,
E [ g2 ] t + e (10)
wt+1 = wt + ∆wt ,
where γ is always set to 0.9, and a good default value for the learning rate η is 0.001.
They are suitable for non-convex problems and end up oscillating around local minima.
Adaptive moment estimation (Adam) [2] is not only computationally efficient, but also
memory-efficient, and it is suitable for large-scale data problems. Adam is a combination
of adaptive learning rate and momentum methods:
convex cases [80]. Considering the original schemes are convergent by using a full-batch
gradient, the authors of [81] adopted a big batch size. The sublinear convergence rates
of RMSProp and ADAM are provided in the non-convex stochastic setting by setting a
specific batch size. These theoretical theories are helpful in the training of machine learning
and deep neural networks, such as LeNet and ResNet [82].
Nadam combines
√ Adam and NAG [83]. AdaMax gives a general momentum term,
which replaces v̂t + e (can be seen as l2 norm) with
ut = β∞ vt−1 + (1 − β∞ )| gt |∞
,
= max( β · vt−1 , | gt |)
where η, λ > 0.
In addition, many modified versions of the above methods have been proposed, namely,
AdaFTRL [85], AdaBatch [86], SC-MSProp [87], AMSGRAD [80], and Padam [88].
Table 5 shows these adaptive learning rate methods.
7. Proximal SGD
The previous sections discussed the smooth regularizer, such as the squared l2 norm.
For the nonsmooth regularizer, one is the subgradient method. The other is the proximal
gradient method (generalized gradient descent), which has a better convergence rate than
the secondary gradient. In a generalized sense, it is considered a projection. This section
will introduce the case when p(w) is the nonzero nonsmooth and the proximal operator of
p(w) is easy to compute.
n
1
min `(w) :=
w ∈Rd n ∑ `i (h(xi ; w, θ ), yi ) + λp(w), (14)
i =1
where λ > 0.
The proximal operator of p is defined as
1
proxη p (u) := argmin p(w) + k w − u k2 , for η > 0. (15)
w ∈Rd
2η
Mathematics 2023, 11, 682 13 of 23
where u is known.
In the context of solving the indicator function
0 w∈C
IC (w) = ,
∞ x∈
/C
where C is a convex set. The proximal operator is projection operator onto C, i.e., proxt (·) =
PC (·). Thus, the step is w+ = PC (w − η ∇ f (w)). That is, perform usual gradient update and
then project back onto C. The method is called projected gradient descent. Table 6 shows
the regularization term and the corresponding proximity operator. These regularizers
always induce sparsity in the optimal solution. For machine learning, the dataset contains a
lot of redundant information, and increasing the feature sparsity of the dataset can be seen
as a feature selection that makes the model simpler. Unlike principal component analysis
(PCA), feature selection projects to a subset of the original space rather than a new space
with no interpretability.
0, |u| ≤ λ
(
2
λ|w| − w2γ , if |w| ≤ γλ sign(u)(|u|−λ)
MCP [93] Pλ,µ (w) = 1 2 proxP (u) = , λ < |u| ≤ γλ
γλ , if |w| > γλ λ,γ 1−1/γ
2
|w| > γλ
w,
0, |u| ≤ λ
λ |w| − w2 /(2γ) ,
|w| ≤ γ
Firm thresholding [94] Pλ,γ (w) = where γ > λ proxP (u) = sign(u)||u|−λ)|
, λ ≤ |u| ≤ γ
λγ/2, |w| ≥ γ λ,γ γ−λ
|u| ≥ γ
u,
−3/2
L0.5 [95] Pλ,γ (w) = λkwk1/2 λ>0 proxP (u) = 23 u 1 + cos 2π 2 λ |u|
1/2
λ 3 − 3 arccos 4 3
if |u| ≤ λ
0
λ|w| if |w| ≤ γλ, if λ ≤ |u| ≤ γ + 21 λ,
sign(u)(|u| − λ)
Capped − l1 [96] Pλ,γ (w) = proxP (u) =
γλ2 if |w| > γλ λ,γ sign(u) (2γ−2 1)λ 1
if |u| = γ + 2 λ
u if |u| > γ + 21 λ
where λ, αk > 0. It differs from the gradient method in the regularization term (last one). It
is equivalent to
1 2
wt+1 ← arg min kw − (wt − αt ∇ f (wt ))k2 + λαt p(w) ,
w ∈Rd 2
that is
wt+1 = proxλαt p (wt − αt ∇ f (wt )).
Mathematics 2023, 11, 682 14 of 23
It relies on the combination of a gradient step and a proximal operator. Ref. [97] proposed a
randomized stochastic projected gradient (RSPG) algorithm which also employs a general
distance function to replace the Euclidean distance in Equation (15).
The stochastic proximal gradient (SPG) algorithm has been proved a sublinear con-
vergence rate for the expected function values in the strongly convex case, and almost sure
convergence results under weaker assumptions [98]. For the objective function
n
1
min `(w) :=
w ∈Rd n ∑ fi (w, ξ i ) + λp(w),
i =1
where η > 0 and 0 < γ < 1. It can be applied to regression problems and deconvolution
problems. The authors of [99] derived the sublinear convergence rates of SPG with mini-
batches (SPG-M) under a strong convexity assumption. Stochastic splitting proximal
gradient (SSPG) was presented when p(w) was the finite-sum function, which has proved
the linear convergence rate under the convex setting in [100]. A short proof provides a
proximal decentralized algorithm (PDA) to establish its linear convergence for a strongly
convex objective function [101]. A novel method called PROXQUANT [102] has been
proven to converge to stationary points under the mild smoothness assumption, which can
apply to deep-learning problems, such as ResNets and LSTM. PROXGEN achieved the same
convergence rate as SGD without preconditioners, which can apply to the DNNs [103].
Many works can be studied by combining other variants of SGD. The authors of [104]
proposed the Prox-SVRG algorithm based on the well-known variance-reduction technique
SVRG. It achieves a linear convergence rate under the strongly convex assumption. For the
nonsmooth nonconvex case, the authors of [105] provided ProxSVRG and ProxSAGA,
which can achieve to linear convergence under PL inequality. Additionally, in a broader
framework of resilient optimization, this paper [106] analyzes the issue where p might be
nonconvex. They both compute a stochastic step and then take a proximal step. ProxSVRG+
[107] does not need to compute the full gradient at each iteration but uses only the proximal
projection in the inner loop. The mS2GD method combines S2GD with the proximal
method [108]. The authors of [109] combined the HB method with a proximal method
called iPiano, which can be used to solve image-denoising problems. In addition, Acc-Prox-
SVRG [110] and the averaged implicit SGD (AI-SGD) [111] can achieve linear convergence
under the strongly convex assumption.
It is true for many applications in machine learning and statistics, including l1 regular-
ization, box-constraints, and simplex constraints, among others [105]. On the one hand,
combine the proximal methods with deep learning. On the other hand, use deep learning
to learn the proximal operators.
The former case, ProxProp, closely resembles the classical backpropagation (BP) but
replaces the parameter update with a proximal mapping instead n of an explicit gradiento
descent step. The weight w is composed of all parameters W (1) , W (2) , W (3) , . . . , W ( J ) .
The last layer update is
( J) ( J)
Wt+1 = Wk − τ ∇W ( J ) f W (1) , W (2) , W (3) , . . . , W ( J ) ,
Mathematics 2023, 11, 682 15 of 23
( j)
1 ( j) 2
Wt+1 = argmin f W (1) , W (2) , W (3) , . . . , W ( J ) + W − Wt , τ > 0,
W 2τ
which takes the proximal steps. While in principle one can take a proximal step on the loss
function, for efficiency reasons, here we choose an explicit gradient step.
The latter case: For a denoising network,
The work [112] replaces the proximal operator with the neural network G .
ut+1 = G ut − τ A∗ ∇ H f Aut .
8. High Order
The above stochastic gradient descent algorithm only utilizes the first-order gradient
in each round of iteration, ignoring the second-order information. However, the second-
order approach brings highly non-linear and ill-conditioned problems while calculating
the inverse matrix of the Hessian matrix [113,114]. Meanwhile, it makes them impractical
or ineffective for neural network training. This section provides technology called the
sampling method, which provides a stochastic idea to the second-order method.
To solve the problem of difficult operation of the inverse matrix, many approaches
calculate an approximation of the Hessian matrix and apply them to large-scale data
problems [114–116].
The Hessian-free (HF) [114] Newton method employs second-order information,
which performs a sub-optimization, avoiding the expensive cost of the inverse Hessian
matrix. However, HF is not suitable for large-scale problems. The authors of [117] proposed
Mathematics 2023, 11, 682 16 of 23
1
∇2 FS H (wt ) =
t StH
∑ ∇2 f i ( w t ; ξ i ).
i ∈StH
Replace Bt with ∇2 FS H (wt ) to obtain the approximate solution of direction dt , i.e., solving
t
the linear system by CG:
∇2 FS H (wt )dt = −∇ FSt (wt ).
t
9. Discussion
We propose several open questions and some future research trends:
(1) In the past ten years, more and more researchers have paid attention to non-convex
problems. In terms of theory, many researchers have explored the convergence of
stochastic algorithms under non-convex conditions. Usually, researchers will explore
the convergence of stochastic gradient methods based on some particular non-convex
assumptions (such as quasi-convex and weakly convex). Still, for more general non-
convex problems, more progress has yet to be made. In terms of application, the SGD
method has long been widely used in deep-learning problems. A convincing example
is the integration of many SGD packages in PyTorch. However, the deep-learning prob-
lem is highly non-convex and nonlinear. It is easy to fall into a local minimum using a
random algorithm, and its convergence cannot be guaranteed. Constructing a method
more suitable for non-convex applications from both theoretical and practical perspec-
tives is a current research dilemma and a direction worthy of research. In addition,
we observe that one of the possible breakthrough directions come from a statement in
the literature [132]: over-parameterization brings some good assumptions, which are
conducive to proving the convergence of algorithms in deep learning.
(2) For problem-specific learning, such as for imbalanced data problems, traditional
stochastic gradient methods can produce biases, leading to poor optimization results.
In addition to constructing an unbalanced loss function for the model, it is possible
to build an adaptive weighted gradient based on the data distribution for gradient
update or to replace the uniform sampling in the SGD algorithm with non-uniform
sampling based on the data distribution p.
(4) The stochastic gradient descent algorithm only uses the first-order gradient in each
round of iterations. In order to reduce the calculation cost, the second-order informa-
tion is ignored, but the second-order stochastic algorithm has a high calculation cost.
Can we find a combination of first- and second-order information? The advantage of
the first-order method, which makes it possible to utilize the second-order information
without increasing the computational cost, is a worthwhile research direction.
10. Conclusions
This paper initially overviewed the different applications of deep learning. There-
after, several variants of gradient descent methods were introduced for improving SGD.
The variance reduction method alleviates the shock caused by the significant variance
of SGD. The accelerated SGD helps the SGD get rid of saddle points. The learning rate
adaptive method can adaptively choose a learning rate to avoid the algorithmic instability
caused by the different sensitivity of the gradient dimension. The proximal method is
suitable for optimization problems with regularization terms. The second-order methods
use second-order information. Each algorithm achieves a different rate of convergence.
Algorithms already in the PyTorch optimizer include SGD, Adagrad, Adadelta, RMSprop,
Adam, AdamW, and so on. Although the above algorithms can be used in highly nonlinear
and non-convex deep-learning problems, they cannot be proved to converge to a stable
point in theory. There is still a gap between theory and application. Making them closer is
a question worth thinking about in the future.
Author Contributions: Conceptualization, Y.T. and Y.Z.; methodology, Y.Z.; formal analysis, Y.T.
and H.Z.; investigation, Y.Z.; writing—original draft preparation, Y.Z.; writing—review and editing,
Y.T. and H.Z.; visualization, Y.Z.; funding acquisition, Y.T. All authors have read and agreed to the
published version of the manuscript.
Funding: This research was funded by National Natural Science Foundation of China (No. 12071458,
71731009).
Data Availability Statement: Not applicable.
Conflicts of Interest: The authors declare no conflict of interest.
Mathematics 2023, 11, 682 19 of 23
Abbreviations
The following abbreviations are used in this manuscript:
References
1. Bottou, L.; Bousquet, O. The tradeoffs of large scale learning. Adv. Neural Inf. Process. Syst. 2008, 20, 161–168.
2. Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980.
3. Benzing, F.; Schug, S.; Meier, R.; Oswald, J.; Akram, Y.; Zucchet, N; Aitchison, L; Steger, A. Random initialisations performing
above chance and how to find them. arXiv 2022, arXiv:2209.07509.
4. Bottou, L.; Curtis, F.E.; Nocedal, J. Optimization methods for large-scale machine learning. Siam Rev. 2018, 60, 223–311.
5. Sun, R. Optimization for deep learning: Theory and algorithms. arXiv 2019, arXiv:1912.08957.
6. Sun, S.; Cao, Z.; Zhu, H.; Zhao, J. A survey of optimization methods from a machine learning perspective. IEEE Trans. Cybern.
2019, 50, 3668–3681.
7. Socher, R.; Perelygin, A.; Wu, J.; Chuang, J.; Manning, C.D.; Ng, A.Y.; Potts, C. Recursive deep models for semantic composition-
ality over a sentiment treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing,
Seattle, WA, USA, 18–21 October 2013; pp. 1631–1642.
8. Fan, A.; Lewis, M.; Dauphin, Y. Hierarchical neural story generation. arXiv 2018, arXiv:1805.04833.
9. Meng, F.; Zhang, J. DTMT: A novel deep transition architecture for neural machine translation. Proc. Aaai Conf. Artif. Intell. 2019,
33, 224–231.
10. Socher, R.; Huang, E.H.; Pennin, J.; Manning, C.D.; Ng, A. Dynamic pooling and unfolding recursive autoencoders for para-phrase
detection. Adv. Neural Inf. Process. Syst. 2011, 24, 801–809.
11. Yin, W.; Schetze, H.; Xiang, B.; Zhou, B. Abcnn: Attention-based convolutional neural network for modeling sentence pairs.
Trans. Assoc. Comput. Linguist. 2016, 4, 259–272.
12. Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding.
arXiv 2018, arXiv:1810.04805.
13. Ha, H.Y.; Yang, Y.; Pouyanfar, S.; Tian, H.; Chen, S.C. Correlation-based deep learning for multimedia semantic concept detection.
In International Conference on Web Information Systems Engineering; Springer: Cham, Switzerland, 2015; pp. 473–487.
14. LeCun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 1998, 86,
2278–2324.
15. Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. Adv. Neural Inf.
Process. Syst. 2012, 25, 1097–1105.
16. Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation.
In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014;
pp. 580–587.
Mathematics 2023, 11, 682 20 of 23
17. Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 779–788.
18. Dai, J.; Li, Y.; He, K.; Sun, J. R-fcn: Object detection via region-based fully convolutional networks. Adv. Neural Inf. Process. Syst.
2016, 29, 379–387.
19. Habibian, A.; Abati, D.; Cohen, T.S.; Bejnordi, B. E. Skip-convolutions for efficient video processing. In Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 2695–2704.
20. Ballas, N.; Yao, L.; Pal, C.; Courville, A. Delving deeper into convolutional networks for learning video representations. arXiv
2015, arXiv:1511.06432, .
21. Saharia, C.; Chan, W.; Chang, H.; Lee, C.; Ho, J.; Salimans, T.; Fleet, D.; Norouzi, M. Palette: Image-to-image diffusion models. In
Proceedings of the ACM SIGGRAPH 2022 Conference Proceedings, Vancouver, BC, Canada, August 2022; pp. 1–10.
22. Saharia, C.; Ho, J.; Chan, W.; Salimans, T.; Fleet, D.J.; Norouzi, M. Image super-resolution via iterative refinement. IEEE Trans.
Pattern Anal. Mach. Intell. 2022, doi:10.1109/TPAMI.2022.3204461.
23. Junqua, J.C.; Haton, J.P. Robustness in Automatic Speech Recognition: Fundamentals and Applications; Springer Science and Business
Media: Berlin, Germany, 2012.
24. Han, K.; Yu, D.; Tashev, I. Speech emotion recognition using deep neural network and extreme learning machine. In Proceedings
of the Fifteenth Annual Conference of the International Speech Communication Association, Singapore, 14–18 September 2014.
25. Neumann, M.; Vu, N.T. Attentive convolutional neural network based speech emotion recognition: A study on the impact of
input features, signal length, and acted speech. arXiv 2017, arXiv:1706.00612.
26. Zhang, S.; Chen, M.; Chen, J.; Li, Y. F.; Wu, Y.; Li, M.; Zhu, C. Combining cross-modal knowledge transfer and semi-supervised
learning for speech emotion recognition. Knowl.-Based Syst. 2021, 229, 107340.
27. Hu, Y.; Liu, Y.; Lv, S.; Xing, M.; Zhang, S.; Fu, Y.; Wu, J.; Zhang, B.; Xie, L. DCCRN: Deep complex convolution recurrent network
for phase-aware speech enhancement. arXiv 2020, arXiv:2008.00264.
28. Nguyen, D.T.; Joty, S.; Imran, M.; Sajjad, H.; Mitra, P. Applications of online deep learning for crisis response using social media
information. arXiv 2016, arXiv:1610.01030.
29. Litjens, G.; Sánchez, C.I.; Timofeeva, N.; Hermsen, M.; Nagtegaal, I.; Kovacs, I.; Hulsbergen-van De Kaa, C.; Bult, P.; Ginneken, B.;
Van Der Laak, J. Deep learning as a tool for increased accuracy and efficiency of histopathological diagnosis. Sci. Rep. 2016, 6,
26286.
30. Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556.
31. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July2016; pp. 770–778.
32. Xie, S.; Girshick, R.; Dollor, P.; Tu, Z.; He, K. Aggregated Residual Transformations for Deep Neural Networks. arXiv 2016,
arXiv:1611.05431.
33. Robbins, H.; Monro, S. A stochastic approximation method. Ann. Math. Stat. 1951, 22, 400–407.
34. Cauchy, A. Méthode générale pour la résolution des systemes d’équations simultanées. Comp. Rend. Sci. Paris 1847, 25, 536–538.
35. Ioffe, S.; Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv 2015,
arXiv:1502.03167.
36. Johnson, R.; Zhang, T. Accelerating stochastic gradient descent using predictive variance reduction. Adv. Neural Inf. Process. Syst.
2013, 26, 315–323.
37. Nemirovski, A.; Juditsky, A.; Lan, G.; Shapiro, A. Robust stochastic approximation approach to stochastic programming. Siam J.
Optim. 2009, 19, 1574–1609.
38. Liu, J.; Wright, S.; Ré, C.; Bittorf, V.; Sridhar, S. An asynchronous parallel stochastic coordinate descent algorithm. In Proceedings
of the International Conference on Machine Learning, Beijing China, 21–26 June 2014; pp. 469–477.
39. Sankararaman, K.A.; De, S.; Xu, Z.; Huang, W.R.; Goldstein, T. The impact of neural network overparameterization on gradient
confusion and stochastic gradient descent. arXiv 2019, arXiv:1904.06963.
40. Khaled, A.; Richterik, P. Better Theory for SGD in the Nonconvex World. arXiv 2020, arXiv:2002.03329.
41. Ruder, S. An overview of gradient descent optimization algorithms. arXiv 2016, arXiv:1609.04747.
42. Li, M.; Zhang, T.; Chen, Y.; Smola, A.J. Efficient mini-batch training for stochastic optimization. In Proceedings of the 20th
ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, New York, NY, USA, 24–27 August 2014;
pp. 661–670.
43. Alfarra, M.; Hanzely, S.; Albasyoni, A.; Ghanem, B.; Richtárik, P. Adaptive Learning of the Optimal Mini-Batch Size of SGD. arXiv
2020, arXiv:2005.01097.
44. Gower, R.M.; Loizou, N.; Qian, X.; Sailanbayev, A.; Shulgin, E.; Richtárik, P. SGD: General analysis and improved rates. arXiv
2019, arXiv:1901.09401.
45. Hoffer, E.; Hubara, I.; Soudry, D. Train longer, generalize better: Closing the generalization gap in large batch training of neural
networks. Adv. Neural Inf. Process. Syst. 2017, 30, 1731–1741.
46. Masters, D.; Luschi, C. Revisiting small batch training for deep neural networks. arXiv 2018, arXiv:1804.07612.
47. Wang, W.; Srebro, N. Stochastic nonconvex optimization with large minibatches. In Algorithmic Learning Theory; PMLR: Chicago,
IL, USA, 2019; pp. 857–882.
Mathematics 2023, 11, 682 21 of 23
48. Le Roux, N.; Schmidt, M.; Bach, F. A stochastic gradient method with an exponential convergence rate for finite training sets. Adv.
Neural Inf. Process. Syst. 2012 25.
49. Reddi, S.J.; Hefny, A.; Sra, S.; Poczos, B.; Smola, A.J. On variance reduction in stochastic gradient descent and its asynchronous
variants. Adv. Neural Inf. Process. Syst. 2015, 28, 2647–2655.
50. Shalev-Shwartz, S.; Zhang, T. Stochastic dual coordinate ascent methods for regularized loss minimization. J. Mach. Learn. Res.
2013, 14, 567–599.
51. Mairal, J. Optimization with first-order surrogate functions. Int. Conf. Mach. Learn. 2013, 28, 783–791.
52. Konečný, J.; Richtárik, P. Semi-stochastic gradient descent methods. arXiv 2013, arXiv:1312.1666.
53. Nguyen, L. M.; Liu, J.; Scheinberg, K.; Takáč, M. SARAH: A novel method for machine learning problems using stochastic
recursive gradient. arXiv 2017, arXiv:1703.00102.
54. Nguyen, L.M.; van Dijk, M.; Phan, D.T.; Nguyen, P.H.; Weng, T.W.; Kalagnanam, J.R. Finite-sum smooth optimization with sarah.
arXiv 2019, arXiv:1901.07648.
55. Xu, X.; Luo, X. Can speed up the convergence rate of stochastic gradient methods to O(1/k2 ) by a gradient averaging strategy.
arXiv 2020, arXiv:2002.10769.
56. Shang, F.; Zhou, K.; Liu, H.; Cheng, J.; Tsang, I. W.; Zhang, L.; Tao, D.; Jiao, L. VR-SGD: A simple stochastic variance reduction
method for machine learning. IEEE Trans. Knowl. Data Eng. 2018, 32, 188–202.
57. Reddi, S.J.; Hefny, A.; Sra, S.; Poczos, B.; Smola, A. Stochastic variance reduction for nonconvex optimization. Int. Conf. Mach.
Learn. 2016, 48, 314–323.
58. Goodfellow, I.; Bengio, Y.; Courville, A. Deep Learning; MIT Press: Cambridge, UK, 2016.
59. Nesterov, Y. A method for unconstrained convex minimization problem with the rate of convergence O(1/k2 ). Dokl. Ussr 1983,
269, 543–547.
60. Sutskever, I.; Martens, J.; Dahl, G.; Hinton, G. On the importance of initialization and momentum in deep learning. Int. Conf.
Mach. Learn. 2013, 28, 1139–1147.
61. Yang, T.; Lin, Q.; Li, Z. Unified convergence analysis of stochastic momentum methods for convex and non-convex optimization.
arXiv 2016, arXiv:1604.03257.
62. Sebbouh, O.; Gower, R.M.; Defazio, A. On the convergence of the Stochastic Heavy Ball Method. arXiv 2020, arXiv:2006.07867.
63. Ghadimi, E.; Feyzmahdavian, H.R.; Johansson, M. Global convergence of the heavy-ball method for convex optimization. In
Proceedings of the 2015 European Control Conference (ECC), Linz, Austria, 15–17 July 2015; pp. 310–315.
64. Mai, V.V.; Johansson, M. Convergence of a Stochastic Gradient Method with Momentum for Nonsmooth Nonconvex Optimization.
arXiv 2020, arXiv:2002.05466.
65. Loizou, N.; Richtárik, P. Linearly convergent stochastic heavy ball method for minimizing generalization error. arXiv 2017,
arXiv:1710.10737.
66. Aujol, J.F.; Dossal, C.; Rondepierre, A. Convergence rates of the Heavy-Ball method for quasi-strongly convex optimization.
SIAM J. Optim. 2022, 32, 1817–1842.
67. Yan, Y.; Yang, T.; Li, Z.; Lin, Q.; Yang, Y. A unified analysis of stochastic momentum methods for deep learning. arXiv 2018,
arXiv:1808.10396.
68. O’Neill, M.; Wright, S.J. Behavior of accelerated gradient methods near critical points of nonconvex functions. Math. Program.
2019, 176, 403–427.
69. Kidambi, R.; Netrapalli, P.; Jain, P.; Kakade, S. On the insufficiency of existing momentum schemes for stochastic optimization.
In Proceedings of the 2018 Information Theory and Applications Workshop (ITA), San Diego, CA, USA, 11–16 February 2018;
pp. 1–9.
70. Roulet, V.; d’Aspremont, A. Sharpness, restart and acceleration. Adv. Neural Inf. Process. Syst. 2017, 30, 1119–1129.
71. Assran, M.; Rabbat, M. On the Convergence of Nesterov’s Accelerated Gradient Method in Stochastic Settings. arXiv 2020,
arXiv:2002.12414.
72. Bengio, Y.; Boulanger-Lew, owski, N.; Pascanu, R. Advances in optimizing recurrent networks. In Proceedings of the 2013 IEEE
International Conference on Acoustics, Speech and Signal Processing, Vancouver, BC, Canada, 26–31 May 2013; pp. 8624–8628.
73. Allen-Zhu, Z. Katyusha: The first direct acceleration of stochastic gradient methods. J. Mach. Learn. Res. 2017, 18, 8194–8244.
74. Duchi, J.; Hazan, E.; Singer, Y. Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res.
2011, 12, 2121–2159.
75. Agarwal, A.; Wainwright, M.J.; Bartlett, P.L.; Ravikumar, P. Information-theoretic lower bounds on the oracle complexity of
convex optimization. Adv. Neural Inf. Process. Syst. 2009, 22, 1–9.
76. Li, X.; Orabona, F. On the convergence of stochastic gradient descent with adaptive stepsizes. In Proceedings of the 22nd
International Conference on Artificial Intelligence and Statistics, Naha, Japan, 16–18 April 2019; pp. 983–992.
77. Xie, Y.; Wu, X.; Ward, R. Linear convergence of adaptive stochastic gradient descent. In Proceedings of the International
Conference on Artificial Intelligence and Statistics, Online, 26–28 August, 2020; pp. 1475–1485.
78. Zeiler, M.D. Adadelta: An adaptive learning rate method. arXiv 2012, arXiv:1212.5701.
79. Tieleman, T.; Hinton, G. Rmsprop: Divide the gradient by a running average of its recent magnitude. coursera: Neural networks
for machine learning. In In International Conference on Machine Learning; PMLR: San Diego, CA, USA, 2017.
Mathematics 2023, 11, 682 22 of 23
80. Reddi, S.J.; Kale, S.; Kumar, S. On the convergence of Adam and beyond. In Proceedings of the International Conference on
Learning Representations, Vancouver Convention Center, Vancouver, BC, Canada, 30 April–3 May 2018 ; pp. 1–8
81. Bernstein, J.; Wang, Y.X.; Azizzadenesheli, K.; Anandkumar, A. signSGD: Compressed optimisation for non-convex problems.
arXiv 2018, arXiv:1802.04434.
82. Zou, F.; Shen, L.; Jie, Z.; Zhang, W.; Liu, W. A sufficient condition for convergences of adam and rmsprop. In Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 11127–11135.
83. Dozat, T. Incorporating Nesterov Momentum into Adam. In Proceedings of the 4th International Conference on Learning
Representations, Workshop Track, San Juan, Puerto Rico, 2–4 May 2016.
84. Zhuang, Z.; Liu, M.; Cutkosky, A.; Orabona, F. Understanding AdamW through Proximal Methods and Scale-Freeness. arXiv
2022, arXiv:2202.00089.
85. Orabona, F.; Pál, D. Scale-free algorithms for online linear optimization. In International Conference on Algorithmic Learning Theory;
Springer: Cham, Switzerland, 2015; pp. 287–301.
86. Défossez, A.; Bach, F. Adabatch: Efficient gradient aggregation rules for sequential and parallel stochastic gradient methods.
arXiv 2017, arXiv:1711.01761.
87. Mukkamala, M.C.; Hein, M. Variants of rmsprop and adagrad with logarithmic regret bounds. arXiv 2017, arXiv:1706.05507.
88. Chen, J.; Zhou, D.; Tang, Y.; Yang, Z.; Cao, Y.; Gu, Q. Closing the generalization gap of adaptive gradient methods in training
deep neural networks. arXiv 2018, arXiv:1806.06763.
89. Condat, L. A generic proximal algorithm for convex optimization in application to total variation minimization. IEEE Signal
Process. Lett. 2014, 21, 985–989.
90. Donoho, D.L.; Johnstone, I.M.; Hoch, J.C.; Stern, A.S. Maximum entropy and the nearly black object. J. R. Stat. Soc. Ser. (Methodol.)
1992, 54, 41–67.
91. Donoho, D.L.; Johnstone, J.M. Ideal spatial adaptation by wavelet shrinkage. Biometrika 1994, 81, 425–455.
92. Fan, J.; Li, R. Variable selection via nonconcave penalized likelihood and its oracle properties. J. Am. Stat. Assoc. 2001,
96, 1348–1360.
93. Zhang, C.H. Nearly unbiased variable selection under minimax concave penalty. Ann. Stat. 2010, 38, 894–942.
94. Gao, H.Y.; Bruce, A.G. Waveshrink with firm shrinkage. Stat. Sin. 1997, 7, 855–874.
95. Xu, Z.; Chang, X.; Xu, F.; Zhang, H. L1/2 regularization: A thresholding representation theory and a fast solver. IEEE Trans. Neural
Netw. Learn. Syst. 2012, 23, 1013–1027.
96. Zhang, T. Analysis of multi-stage convex relaxation for sparse regularization. J. Mach. Learn. Res. 2010, 11, 1081–1107.
97. Ghadimi, S.; Lan, G.; Zhang, H. Mini-batch stochastic approximation methods for nonconvex stochastic composite optimization.
Math. Program. 2016, 155, 267–305.
98. Rosasco, L.; Villa, S.; Vû, B.C. Convergence of stochastic proximal gradient algorithm. Appl. Math. Optim. 2019 82, 1–27.
99. Patrascu, A.; Paduraru, C.; Irofti, P. Stochastic Proximal Gradient Algorithm with Minibatches. Application to Large Scale
Learning Models. arXiv 2020, arXiv:2003.13332.
100. Patrascu, A.; Irofti, P. Stochastic proximal splitting algorithm for composite minimization. arXiv 2019, arXiv:1912.02039.
101. Alghunaim, S.; Yuan, K.; Sayed, A.H. A linearly convergent proximal gradient algorithm for decentralized optimization. Adv.
Neural Inf. Process. Syst. 2019, 32, 2848–2858.
102. Bai, Y.; Wang, Y.X.; Liberty, E. Proxquant: Quantized neural networks via proximal operators. arXiv 2018, arXiv:1810.00861.
103. Yun, J.; Lozano, A.C.; Yang, E. A general family of stochastic proximal gradient methods for deep learning. arXiv 2020,
arXiv:2007.07484.
104. Xiao, L.; Zhang, T. A proximal stochastic gradient method with progressive variance reduction. SIAM J. Optim. 2014, 24, 2057–2075.
105. Reddi, S.J.; Sra, S.; Poczos, B.; Smola, A. J. Proximal stochastic methods for nonsmooth nonconvex finite-sum optimization. Adv.
Neural Inf. Process. Syst. 2016, 29, 1145–1153.
106. Aravkin, A.; Davis, D. A smart stochastic algorithm for nonconvex optimization with applications to robust machine learning.
arXiv 2016, arXiv:1610.01101.
107. Li, Z.; Li, J. A simple proximal stochastic gradient method for nonsmooth nonconvex optimization. Adv. Neural Inf. Process. Syst.
2018, 31, 5564–5574
108. Konečný, J.; Liu, J.; Richtárik, P.; Takáč, M. Mini-batch semi-stochastic gradient descent in the proximal setting. IEEE J. Sel. Top.
Signal Process. 2015, 10, 242–255.
109. Ochs, P.; Chen, Y.; Brox, T.; Pock, T. iPiano: Inertial proximal algorithm for nonconvex optimization. SIAM J. Imaging Sci. 2014,
7, 1388–1419.
110. Nitanda, A. Stochastic proximal gradient descent with acceleration techniques. Adv. Neural Inf. Process. Syst. 2014, 27, 1574–1582.
111. Toulis, P.; Tran, D.; Airoldi, E. Towards stability and optimality in stochastic gradient descent. Artif. Intell. Stat. 2016, 1290–1298.
112. Meinhardt, T.; Moller, M.; Hazirbas, C.; Cremers, D. Learning proximal operators: Using denoising networks for regularizing
inverse imaging problems. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October
2017; pp. 1781–1790.
113. Martens, J.; Sutskever, I. Learning recurrent neural networks with hessian-free optimization. In Proceedings of the 28th
International Conference on Machine Learning (ICML-11), Bellevue, WA, USA, 28 June–2 July 2011; pp. 1033–1040.
114. Martens, J. Deep learning via hessian-free optimization. ICML 2010, 27, 735–742.
Mathematics 2023, 11, 682 23 of 23
115. Roosta-Khorasani, F.; Mahoney, M.W. Sub-sampled Newton methods II: Local convergence rates. arXiv 2016, arXiv:1601.04738.
116. Bollapragada, R.; Byrd, R.H.; Nocedal, J. Exact and inexact subsampled Newton methods for optimization. IMA J. Numer. Anal.
2019, 39, 545–578.
117. Byrd, R.H.; Chin, G.M.; Neveitt, W.; Nocedal, J. On the use of stochastic hessian information in optimization methods for machine
learning. SIAM J. Optim. 2011, 21, 977–995.
118. Goldfarb, D. A family of variable-metric methods derived by variational means. Math. Comput. 1970, 24, 23–26.
119. Shanno, D.F. Conditioning of quasi-Newton methods for function minimization. Math. Comput. 1970, 24, 647–656.
120. Nocedal, J. Updating quasi-Newton matrices with limited storage. Math. Comput.1980, 35, 773–782.
121. Liu, D.C.; Nocedal, J. On the limited memory BFGS method for large scale optimization. Math. Program. 1989, 45, 503–528.
122. Schraudolph, N.N.; Yu, J.; Ginter, S. A stochastic quasi-Newton method for online convex optimization. Artif. Intell. Stat. 2007,
2, 436–443.
123. Mokhtari, A.; Ribeiro, A. RES: Regularized stochastic BFGS algorithm. IEEE Trans. Signal Process. 2014, 62, 6089–6104.
124. Chen, H.; Wu, H.C.; Chan, S.C.; Lam, W.H. A stochastic quasi-Newton method for large-scale nonconvex optimization with
applications. IEEE Trans. Neural Netw. Learn. Syst. 2019, 31, 4776–4790.
125. Byrd, R.H.; Hansen, S.L.; Nocedal, J.; Singer, Y. A stochastic quasi-Newton method for large-scale optimization. SIAM J. Optim.
2016, 26, 1008–1031.
126. Nocedal, J.; Wright, S.J. Numerical Optimization; Springer: Berlin/Heidelberg, Germany, 2006.
127. Moritz, P.; Nishihara, R.; Jordan, M. A linearly-convergent stochastic L-BFGS algorithm. Artif. Intell. Stat. 2016, 51, 249–258.
128. Gower, R.; Goldfarb, D.; Richterik, P. Stochastic block BFGS: Squeezing more curvature out of data. Int. Conf. Mach. Learn. 2016,
48, 1869–1878.
129. Wang, X.; Ma, S.; Goldfarb, D.; Liu, W. Stochastic quasi-Newton methods for nonconvex stochastic optimization. SIAM J. Optim.
2017, 27, 927–956.
130. Yang, M.; Xu, D.; Li, Y.; Singer, Y. Structured Stochastic Quasi–Newton Methods for Large-Scale Optimization Problems. arXiv
2020, arXiv:2006.09606.
131. Goldfarb, D.; Ren, Y.; Bahamou, A. Practical Quasi–Newton Methods for Training Deep Neural Networks. arXiv 2020,
arXiv:2006.08877.
132. Du, S.S.; Zhai, X.; Poczos, B.; Singh, A. Gradient descent provably optimizes over-parameterized neural networks. arXiv 2018,
arXiv:1810.02054, .
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.