0% found this document useful (0 votes)

9 views7 pages

1902.00908v3

Machine learning paper

Uploaded by

dw501450

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views7 pages

1902.00908v3

Machine learning paper

Uploaded by

dw501450

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

1

Stochastic Gradient Descent for Nonconvex

Learning without Bounded Gradient Assumptions
Yunwen Lei, Ting Hu, Guiying Li and Ke Tang

Abstract—Stochastic gradient descent (SGD) is a popular and theoretical analysis can only guarantee that SGD may get stuck
efficient method with wide applications in training deep neural in local minima, in practice it often converges to special ones
arXiv:1902.00908v3 [cs.LG] 13 Dec 2019

nets and other nonconvex models. While the behavior of SGD with good generalization ability even in the absence of early
is well understood in the convex learning setting, the existing
theoretical results for SGD applied to nonconvex objective func- stopping or explicit regularization.
tions are far from mature. For example, existing results require Motivated by the popularity of SGD in training deep neural
to impose a nontrivial assumption on the uniform boundedness networks and nonconvex models as well as the huge gap
of gradients for all iterates encountered in the learning process, between the theoretical understanding and its practical success,
which is hard to verify in practical implementations. In this
paper, we establish a rigorous theoretical foundation for SGD in theoretical analysis of SGD has received increasing attention
nonconvex learning by showing that this boundedness assumption recently. The first nonasymptotical convergence rates of non-
can be removed without affecting convergence rates. In particu- convex SGD were established in [3], which was extended
lar, we establish sufficient conditions for almost sure convergence to stochastic variance reduction [4] and stochastic proximal
as well as optimal convergence rates for SGD applied to both gradient descent [5]. However, these results require to impose
general nonconvex objective functions and gradient-dominated
objective functions. A linear convergence is further derived in a nontrivial boundedness assumption on the gradients at all
the case with zero variances. iterates encountered in the learning process, which, however
depends on the realization of the optimization process and is
Index Terms—Stochastic Gradient Descent, Nonconvex Opti-
mization, Learning Theory, Polyak-Łojasiewicz Condition hard to check in practice. It still remains unclear whether this
assumption holds when learning takes place in an unbounded
domain, in which scenario the existing analysis is not rigorous.
I. I NTRODUCTION
In this paper, we aim to build a sound theoretical foundation
Stochastic gradient descent (SGD) is an efficient iterative for SGD by showing that the same convergence rates can be
method suitable to tackle large-scale datasets due to its low achieved without any boundedness assumption on gradients
computational complexity per iteration and its promising prac- in the nonconvex learning setting. We also relax the stan-
tical behavior, which has found wide applications to solve dard smoothness assumption to a milder Hölder continuity
optimization problems in a variety of areas including machine on gradients. As a further step, we consider objective func-
learning and signal processing. At each iteration, SGD firstly tions satisfying a Polyak-Łojasiewicz (PL) condition which is
calculates a gradient based on a randomly selected example widely adopted in the literature of nonconvex optimization. In
and updates the model parameter along the minus gradient this case, we derive convergence rates O(1/t) for SGD with t
direction of the current iterate. This strategy of processing a iterations, which also remove the boundedness assumption on
single training example makes SGD very popular in the big gradients imposed in [1] to derive similar convergence rates.
data era, which enjoys a great computational advantage over We introduce a zero-variance condition which allows us to de-
its batch counterpart. rive linear convergence of SGD. Sufficient conditions in terms
Theoretical properties of SGD are well understood for opti- of step sizes are also established for almost sure convergence
mizing both convex and strongly convex objectives, the latter measured by both function values and gradient norms.
of which can be relaxed to other assumptions on objective
functions, e.g., error bound conditions and Polyak-Łojasiewicz
conditions [1, 2]. As a comparison, SGD applied to nonconvex II. P ROBLEM F ORMULATION AND M AIN R ESULTS
objective functions are much less studied. Indeed, there is
a huge gap between the theoretical understanding of SGD Let ρ be a probability defined on the sample space Z := X ×
and its very promising practical behavior in the nonconvex Y with X ⊂ Rd being the input space and Y being the output
learning setting, as exemplified in the setting of training space. We are interested in building a prediction rule h : X 7→
highly nonconvex deep neural networks. For example, while Y based on a sequence of examples {zt }t∈N independently
drawn from ρ. We consider learning in a reproducing kernel
Y. Lei, G. Li and K. Tang are with the Shenzhen Key Laboratory of Com- Hilbert space (RKHSs) HK associated to a Mercer kernel K :
putational Intelligence, Department of Computer Science and Engineering,
Southern University of Science and Technology, Shenzhen 518055, China (e- X × X 7→ R. The RKHS HK is defined as the completion of
mail: leiyw@sustc.edu.cn, lgy807720302@gmail.com, tangk3@sustc.edu.cn). the linear span of the function set {Kx (·) := K(x, ·) : x ∈ X }
T. Hu is with the School of Mathematics and Statistics, Wuhan University, satisfying the reproducing property w(x) = hw, Kx i for any
Wuhan 430072, China (e-mail: tinghu@whu.edu.cn).
Accepted by IEEE Transactions on Neural Networks and Learning Systems. x ∈ X and w ∈ HK , where h·, ·i denotes the inner product.
DOI: 10.1109/TNNLS.2019.2952219 The quality of a prediction rule h at an example z is measured
2

by ℓ(h(x), y), where ℓ : R × R 7→ R+ is a differentiable loss (a) There is a constant C independent of t such that
function, with which we define the objective function as X T −1
Z
min E[k∇E(wt )k22 ] ≤ C ηt . (6)
E(h) = Ez ℓ(h(x), y) = ℓ(h(x), y)dρ. (1) t=1,...,T
t=1

We consider nonconvex loss functions in this paper. We (b) {E(wt )}t converges to an almost surely (a.s.) bounded
implement the learning process by SGD to minimize the random variable. P∞
objective function over HK . Let w1 = 0 and zt = (xt , yt ) (c) If Assumption 1 holds with α = 1 and t=1 ηt = ∞,
be the example sampled according to ρ at the t-th iteration. then limt→∞ E[k∇E(wt )k2 ] = 0.
We update the model sequence {wt }t∈N in HK by Remark 1. Part
(a) was derived in [3] under
the boundedness
wt+1 = wt − ηt ∇ℓ hwt , Kxt i, yt Kxt = wt − ηt ∇f (wt , zt ), assumption Ez k∇f (wt , z) − ∇E(wt )k22 ≤ σ 2 for a constant
(2) σ > 0 and all t ∈ N. This boundedness assumption depends on
where ∇ℓ denotes the gradient of ℓ with respect to the first the realization of the optimization process and it is therefore
argument, {ηt }t∈N is a sequence of positive step sizes and difficult to check in practice. It was removed in our analysis.
we introduce f (w, z) = ℓ hw, Kx i, y for brevity. We denote Although Parts (b), (c) do not give convergence rates, an
k · k2 the RKHS norm in HK . appealing property is that they consider individual iterates. As
Our theoretical analysis is based on a fundamental as- a comparison, the convergence rates in (6) only hold for the
sumption on the regularity of loss functions. Assumption 1 minimum of the first T iterates. The analysis for individual
with α = 1 corresponds to a smooth assumption standard in iterates is much more challenging than that for the minimum
nonconvex learning, which is extended to a general Hölder over all iterates. Indeed, Part (c) is based on a careful analysis
continuity assumption on the gradient of loss functions here. with the contradiction strategy.
Assumption 1. Let α ∈ (0, 1] and L > 0. We assume that the We can derive explicit convergence rates by instantiating
gradient of f (·, z) is α-Hölder continuous in the sense that the step sizes in Theorem 2. If α = 1, the convergence rate in
1 β

k∇f (w, z)−∇f (w̃, z)k ≤ Lkw−w̃kα Part (b) becomes O(T − 2 log 2 T ) which is minimax optimal
2 , ∀w, w̃ ∈ HK , z ∈ Z.
up to a logarithmic factor.
For any function φ : HK 7→ R with Hölder continuous
gradients, we have the following lemma playing an important Corollary 3. Suppose that Assumption 1 holds. Let {wt }t∈N
role in our analysis. Eq. (4) provides a quantitative measure on be the sequence produced by (2). Then,
the accuracy of approximating φ with its first-order approxi- (a) If ηt = η1 t−θ with θ ∈ (1/(1 + α), 1), then
mation, while (5) provides a self-bounding property meaning mint=1,...,T E[k∇E(wt )k22 ] = O(T θ−1 ).
1
that the norm of gradients can be controlled by function values. (b) If ηt = η1 (t logβ (t + 1))− 1+α with β > 1, then
α β

Lemma 1. Let φ : HK 7→ R be a differentiable function. Let mint=1,...,T E[k∇E(wt )k22 ] = O(T − α+1 log 1+α T ).
α ∈ (0, 1] and L > 0. If for all w, w̃ ∈ HK , z ∈ Z
k∇φ(w) − ∇φ(w̃)k2 ≤ Lkw − w̃kα
2, (3) B. Objective functions with Polyak-Łojasiewicz inequality
then, we have We now proceed with our convergence analysis by imposing
L an assumption referred to as PL inequality named after Polyak
φ(w̃) ≤ φ(w) + hw̃ − w, ∇φ(w)i + kw − w̃k1+α
2 . (4)
1+α and Łojasiewicz [2]. Intuitively, this inequality means that
Furthermore, if φ(w) ≥ 0 for all w ∈ HK , then the suboptimality of iterates measured by function values
1 can be bounded by gradient norms. PL condition is also
1+α (1 + α)L α
k∇φ(w)k2 α ≤ φ(w), ∀w ∈ HK . (5) referred to as gradient dominated condition in the literature
α
[4], and widely adopted in the analysis in both the convex
Lemma 1 to be proved in Section IV-A is an extension of and nonconvex optimization setting [1, 7, 8]. Examples of
Proposition 1 in [6] from univariate functions to multivariate functions satisfying PL condition include neural networks with
functions. It should be noted that (5) improves Proposition 1 one-hidden layers, ResNets with linear activation and objective
1
(d) in [6] by removing a factor of (1 + α) α . functions in matrix factorization [8]. It should be noted that
functions satisfying the PL condition is not necessarily convex.
A. General nonconvex objective functions
Assumption 2. We assume that the function E satisfies the
We now present theoretical results for SGD with general PL inequality with the parameter µ > 0, i.e.,
nonconvex loss functions. In this case we measure the progress
of SGD in terms of gradients. Part (a) gives a nonasymptotic E(w) − E(w∗ ) ≤ (2µ)−1 k∇E(w)k22 , ∀w ∈ HK ,
convergence rate by step sizes, while Parts (b) and (c) provide where w∗ = arg minw∈HK E(w).
sufficient conditions on the asymptotic convergence measured
Under Assumption 2, we can state convergence results mea-
by function values and gradient norms, respectively.
sured by the suboptimality of function values. Part (a) provides
Theorem 2. Suppose that Assumption 1 holds. Let {wt }t∈N a sufficient condition for almost sure convergence measured by
be
P∞produced by (2) with the step sizes satisfying C1 := function values and gradient norms, while Part (b) establishes
1+α
η
t=1 t < ∞. Then, the following three statements hold. explicit convergence rates for step sizes reciprocal to the
3

iteration number. If α = 1, we derive convergence rates to a Hölder continuity of ∇f (w, z). Both the PL condition
O(t−1 ) after t iterations, which is minimax optimal even when and Hölder continuity condition do not depend on the iterates
the objective function is strongly convex. Part (c) shows that a and can be checked by objective function themselves, which
linear convergence can be achieved if E[k∇f (w∗ , z)k22 ] = 0, are standard in the literature and satisfied by many noncon-
which extends the linear convergence of gradient descent [1] to vex models [1, 4, 8]. It should be noted that convergence
the stochastic setting. The assumption E[k∇f (w∗ , z)k22 ] = 0 analysis was also performed when f (w, z) is convex [18]
means that variances of the stochastic
gradient vanish at w = and nonconvex [19] without bounded gradient assumptions,
w∗ since Var(f (w∗ , z)) = E kf (w∗ , z) − ∇E(w∗ )k22 = 0. both of which, however, require E(w) to be strongly convex
and f (w, z) to be smooth. Furthermore, we establish a linear
Theorem 4. Let Assumptions 1 and 2 hold. Let {wt }t∈N be
convergence of SGD in the case with zero variances, while this
produced by (2). Then the following statements hold.
P∞ 1+α P∞ linear convergence was only derived for batch gradient descent
(a) If t=1 ηt < ∞ and t=1 ηt = ∞, then a.s. applied to gradient-dominated objective functions
limt→∞ E(wt ) = E(w∗ ) and limt→∞ k∇E(wt )k2 = 0. P∞ P∞[1]. Neces-
2 1+α sary and sufficient conditions as t=1 ηt = ∞, t=1 ηt2 < ∞
(b) If ηt = 2/((t + 1)µ), then for any t ≥ t0 := 2L α µ− α were established for convergence of online mirror descent in
we have E[E(wt+1 )] − E(w∗ ) ≤ Ct e −α , where C e is a
a strongly convex setting [18], which are partially extended to
constant independent of t (explicitly given in the proof). convergence of SGD for gradient-dominated objective func-
(c) If E[k∇f (w∗ , z)k22 ] = 0, Assumption 1 holds with α = 1 tions measured by both function values and gradient norms.
and ηt = η ≤ µ/L2 , then
E[E(wt+1 )] − E(w∗ ) ≤ (1 − µη)t (E(w1 ) − E(w∗ )).
P∞ P∞ IV. P ROOFS
Remark 2. Conditions as t=1 ηt2 < ∞ and t=1 ηt = ∞
are established for almost sure convergence with strongly A. Proof of Theorem 2
convex objectives, which are extended here to nonconvex
learning under PL conditions. Convergence rates O(t−1 ) were In this section, we present the proofs of Theorem 2 and
established for nonconvex optimization under PL conditions, Corollary 3 on convergence of SGD applied to general non-
bounded gradient assumption as E[k∇f (wt , z)k22 ] ≤ σ 2 and convex loss functions. To this aim, we first prove Lemma 1 and
smoothness assumptions [1]. We derive the same convergence introduce the Doob’s forward convergence theorem on almost
rates without the bounded gradient assumption and relax the sure convergence (see, e.g., [20] on page 195).
smoothness assumption to a Hölder continuity of ∇f (w, z).

III. R ELATED WORK AND D ISCUSSIONS Proof of Lemma 1. Eq. (4) can be proved in the same way as
SGD has been comprehensively studied in the literature, the proof of Part (a) of Proposition 1 in [6]. We now prove
(5) for non-negative φ. We only need to consider the case
mainly in the convex setting.√For generally convex objective
∇φ(w) 6= 0. In this case, set
functions, regret bounds O( T ) were established for SGD 1
1
with √T iterates [9] which directly imply convergence rates w̃ = w − L− α k∇φ(w)k2α k∇φ(w)k−1
2 ∇φ(w)
O(1/ T ) [10]. For strongly convex objective functions, re-
in (4). We derive
gret bounds can be improved to O(log T ) [11] which imply D 1 E
1 ∇φ(w)
convergence rates O(log T /T ). These results were extended to 0 ≤ φ(w̃) ≤ φ(w) − L− α k∇φ(w)k2α , ∇φ(w)
online learning in RKHSs [12–14] and learning with a mirror k∇φ(w)k2
map to capture geometry of problems [15, 16]. L 1+α 1+α
+ L− α k∇φ(w)k2 α
As compared to the maturity of understanding in convex 1+α
1 1+α 1 1+α
optimization, convergence analysis for SGD in the nonconvex = φ(w) − L− α k∇φ(w)k2 α + L− α (1 + α)−1 k∇φ(w)k2 α
setting are far from satisfactory. Asymptotic convergence
of 1
αL− α 1+α
SGD was established
under the assumption
Ez k∇f (w t , z) − = φ(w) − k∇φ(w)k2 α ,
∇E(wt )k22 ≤ A 1 + k∇E(wt )k22 for A > 0 and all 1+α
t ∈ N [17]. Nonasymptotic convergence rates similar to from which the stated bound (5) follows.
(6) were established in [3] under boundedness assumption
E[k∇f (wt , zt )k22 ] ≤ σ 2 for all t ∈ N. For objective func- Lemma 5. Let {X et }t∈N be a sequence of non-negative ran-
tions satisfying PL conditions, convergence rates O(1/T ) dom variables with E[X e1 ] < ∞ and let {Ft }t∈N be a nested
were established for SGD under boundedness assumptions sequence of sets of random variables with Ft ⊂ Ft+1 for all
E[k∇f (wt , zt )k22 ] ≤ σ 2 for all t ∈ N [1]. This boundedness et+1 |Ft ] ≤ X
t ∈ N. If E[X et for all t ∈ N, then X
et converges
assumption in the literature depends on the realization of to a nonnegative random variable X e a.s. and X
e < ∞ a.s..
the optimization process, which is hard to check in practical
implementations. In this paper we show that the same con- Proof of Theorem 2. We first prove Part (a). According to
vergence rates can be established without any boundedness Assumption 1, we know
assumptions. This establishes a rigorous foundation to safe- k∇E(w) − ∇E(w̃)k2 = E[∇f (w, z)] − E[∇f (w̃, z)] 2
guard SGD. Existing discussions require to also impose an
assumption on the smoothness of f (w, z), which is relaxed ≤ E k∇f (w, z) − ∇f (w̃, z)k2 ≤ Lkw − w̃kα2.
4

Q∞ 2 1+α
Q∞
Therefore, ∇E(w) is α-Hölder continuous. According to (4) (10) by k=t+1 (1 + L ηk ), the term k=t+1 (1 +
with φ = E and (2), we know L2 ηk1+α )Ezt [E(wt+1 )] can be upper bounded by
Lkwt+1 −wt k1+α
2
∞
Y Y∞
E(wt+1 ) ≤ E(wt )+hwt+1 −wt , ∇E(wt )i+ 2 1+α
(1+L ηk )E(wt )+L (1−α) 2
(1+L2ηk1+α )ηt1+α
1+α
k=t k=t+1
Lη 1+α
= E(wt )−ηt h∇f (wt , zt ), ∇E(wt )i+ t k∇f (wt , zt )k1+α 2
∞
Y
1+α ≤ (1 + L2 ηk1+α )E(wt ) + C3 ηt1+α , (12)
≤ E(wt ) − ηt h∇f (wt , zt ), ∇E(wt )i k=t
Q∞
L2 ηt1+α 1 + α α α where we introduce C3 = L2 (1−α) k=1 (1+L2 ηk1+α ) < ∞.
+ f (wt , zt ), (7)
1+α α Introduce the stochastic process
where the last inequality is due to (5). With the Young’s Y∞ ∞
X
inequality for all µ, v ∈ R, p−1 + q −1 = 1, p ≥ 0 et =
X (1 + L2 ηk1+α )E(wt ) + C3 ηk1+α .
k=t k=t
µv ≤ p−1 |µ|p + q −1 |v|q , (8)
α α α1 Eq. (12) amounts to saying Ezt [X̃t+1 ] ≤ X̃t for all t ∈ N,
we get (1+α)fα(wt ,zt ) ≤ α (1+α)fα(wt ,zt ) + 1 − α. which shows that {X et }t∈N is a non-negative supermartin-
P∞
Plugging the above inequality into (7) shows gale. Furthermore, the assumption t=1 ηt1+α < ∞ implies
that Xe1 < ∞. We can apply Lemma 5 to show that
E(wt+1 ) ≤ E(wt ) − ηt h∇f (wt , zt ), ∇E(wt )i limt→∞ X et = Xe for a non-negative random variable X e a.s..
P∞ 1+α
L2 ηt1+α This together with the assumption t=1 ηt < ∞ implies
+ (1 + α)f (wt , zt ) + 1 − α . et → Ye for a non-negative random variable Ye , where
1+α limt→∞Q∞ Y
Taking conditional expectation with respect to zt , we derive Yet = k=t (1 + L2 ηk1+α )E(wt ) for all t ∈ N and Ye < ∞ a.s..
Ezt [E(wt+1 )] Furthermore, it is clear a.s. that
Y∞
≤ E(wt ) − ηt k∇E(wt )k22 + L2 ηt1+α E(wt ) + 1 − α (9) E(wt ) − Ye = 1 − (1 + L2 ηk1+α ) E(wt )+
≤ (1 + L2 ηt1+α )E(wt ) − ηt k∇E(wt )k22 2
+ L (1 − α)ηt1+α . k=t
(10) ∞
Y ∞
Y
(1+L2ηk1+α )E(wt )− Ye ≤ 1− (1+L2ηk1+α ) E(wt )
It then follows that
k=t k=t
E[E(wt+1 )] ≤ (1 + L2 ηt1+α )E[E(wt )] + L2 (1 − α)ηt1+α , ∞
Y
+ (1 + L2 ηk1+α )E(wt ) − Ye −−−→ 0,
from which we derive t→∞
k=t
∞
X Q∞
E[E(wt+1 )] + L2 (1 − α) ηk1+α where we have used the fact limt→∞ k=t (1 + L2 ηk1+α ) = 1
P
k=t+1 due to ∞ 1+α
t=1 ηt < ∞. That is, E(wt ) converges to Ye a.s..
∞
X
≤ (1 + L2 ηt1+α ) E[E(wt )] + L2 (1 − α) ηk1+α . We now prove Part (c) by contradiction. According to
k=t Assumption 1 and Lemma 1, we know
P∞ 1+α (1 + α)L α1 f (w , z ) 1+α
Introduce At = E[E(wt )] + L2 (1 − α) k=t ηk , ∀t
∈ N. k k
α

Then, it follows from the inequality 1 + a ≤ exp(a) that k∇f (wk , zk )k2 ≤
α
At+1 ≤ (1 + L2 ηt1+α )At ≤ exp(L2 ηt1+α )At . An application 1
≤ L α f (wk , zk ) + (1 + α)−1 ,
of the above inequality recursively then gives
X t X ∞ where we have used the Young’s inequality (8). Taking expec-
At+1 ≤ exp L 2 1+α
ηk A1 ≤ exp L 2 1+α
ηk A1 := C2 , tations over both sides and using E[E(wk )] ≤ C2 , we derive
1
k=1 k=1 E[k∇f (wk , zk )k2 ] ≤ L α E[E(wk )] + (1 + α)−1
from which we know E[E(wt )] ≤ C2 , ∀t ∈ N. Plugging the 1
≤ L α C2 + (1 + α)−1 := C4 . (13)
above inequality back into (10) gives
Suppose to contrary that limP supt→∞ E[k∇E(wt )k2 ] > 0. By
E[E(wt+1 )] ≤ E[E(wt )]−ηt E[k∇E(wt )k22 ]+L2 ηt1+α (C2+1−α). ∞
Part (a) and the assumption t=1 ηt = ∞, we know
(11) q
A summation of the above inequality then implies lim inf E[k∇E(wt )k2 ] ≤ lim inf E[k∇E(wt )k22 ] = 0.
t→∞ t→∞
T
X T
X Then there exists an ǫ > 0 such that E[k∇E(wt )k2 ] < ǫ for
ηt E[k∇E(wt )k22 ] ≤ E[E(wt )] − E[E(wt+1 )] infinitely many t and E[k∇E(wt )k2 ] > 2ǫ for infinitely many
t=1 t=1
t. Let T be a subset of integers such that for every t ∈ T we
T
X can find an integer k(t) > t such that
+ L2 (C2 + 1 − α) ηt1+α ≤ E(w1 ) + L2 (C2 + 1 − α)C1 ,
t=1 E[k∇E(wt )k2 ] < ǫ, E[k∇E(wk(t) )k2 ] > 2ǫ and
from which we directly get (6) with C := E(w1 )+L2 C1 (C2 + ǫ ≤ E[k∇E(wk )k2 ] ≤ 2ǫ for all t < k < k(t). (14)
1 − α). This proves Part (a).
Furthermore, we can assert that ηt ≤ ǫ/(2LC4 ) for every t
We now prove Part (b). Multiplying both sides of larger than the smallest integer in T since limt→∞ ηt = 0.
5

By (13), (14) and Assumption 1 with α = 1, we know Proof of Theorem 4. We first prove Part (a). We introduce
∗
ǫ ≤ E[k∇E(wk(t) )k2 ]−E[k∇E(wt )k2 ] ≤ E[k∇E(wk(t) )−∇E(wB)k
t :=
t ] E[E(wt )] − E(w ), ∀t ∈ N. By (10) and Assumption 2,
2
k(t)−1
X X E[E(wt+1 )] ≤ (1 + L2 ηt1+α )E[E(wt )]
k(t)−1
≤ E[k∇E(wk+1 )−∇E(wk )k2 ] ≤ L E[kwk+1 −wk k2 ]
− 2µηt E[E(wt ) − E(w∗ )] + L2 (1 − α)ηt1+α .
k=t k=t
k(t)−1 k(t)−1 Subtracting E(w∗ ) from both sides gives
X X
=L ηk E[k∇f (wk , zk )k2 ] ≤ LC4 E[E(wt+1 )] − E(w∗ ) ≤ (1 + L2 ηt1+α ) E(wt ) − E(w∗ )
ηk . (15)

k=t k=t
+ L2 ηt1+α E(w∗ )−2µηt E[E(wt )]−E(w∗ ) +L2 (1−α)ηt1+α
Analogously, one can show
= 1 + L2 ηt1+α − 2µηt E[E(wt )] − E(w∗ ) + C5 ηt1+α ,
E[k∇E(wt+1 )k2 ]−E[k∇E(wt )k2 ] ≤ E[k∇E(wt+1 )−∇E(wt )k2 ]
where we introduce C5 := L2 E(w∗ )+1−α . The assumption
P
≤ LE[kwt+1 − wt k2 ] ≤ Lηt E[k∇f (wt , zt )k2 ] ≤ LC4 ηt , ∞ 1+α
t=1 ηt < ∞ implies limt→∞ ηt = 0, which further
from which, (14) and ηt ≤ ǫ/(2LC4 ) for any t larger than the implies the existence of t1 such that ηtα ≤ µ/L2 and ηt ≤ 1/µ
smallest integer in T we get for all t ≥ t1 . Therefore, it follows that
E[k∇E(wk )k2 ] ≥ ǫ/2 for every k = t, t + 1, . . . , k(t) − 1 Bt+1 ≤ (1 − µηt )Bt + C5 ηt1+α , ∀t ≥ t1 . (18)
and all t ∈ T . It then follows that A recursive application of this inequality then shows
2
E[k∇E(wk )k22 ] ≥ E[k∇E(wk )k2 ] ≥ ǫ /4 2
(16) YT XT YT
BT +1 ≤ (1 − µηt )Bt1 + C5 ηt1+α (1 − µηk ),
for every k = t, t + 1, . . . , k(t) − 1 and all t ∈ T . Putting (16) t=t1 t=t1 k=t+1
back into (11), E[E(wk(t) )] can be upper bounded by (19)
QT
k(t)−1
X k(t)−1
X where we denote k=T +1 (1 − µηk ) = 1. The first term of the
E[E(wt )] − ηk E[k∇E(wk )k22 ] + L2 C2 ηk2 above inequality can be estimated by the standard inequality
k=t k=t 1P− a ≤ exp(−a) for a > 0 together with the assumption
∞
ǫ2 k(t)−1
X k(t)−1
X t=1 ηt = ∞ as
≤ E[E(wt )] − ηk + L2 C2 ηk2 . YT XT
4
k=t k=t (1−µηt )Bt1 ≤ exp − µ ηt Bt1 −−−−→ 0. (20)
T →∞
This together with (15) implies that t=t1 t=t1

k(t)−1 An application of Lemma 6 with a = µ then shows that

3 ǫ2 X T T
ǫ /(4LC4 ) ≤ ηk ≤ E[E(wt )] − E[E(wk(t) )] X Y
4
k=t lim ηt1+α (1 − µηk ) = 0. (21)
T →∞
k(t)−1 t=t1 k=t+1
X
2
+ L C2 ηk2 , ∀t ∈ T . (17) Combining (19), (20) and (21) together shows
k=t lim E[E(wT )] = E(w∗ ). (22)
T →∞
Part (b) implies that {E[E(wt )]}t converges toPa non-negative
∞ 2 According to Part (b) of Theorem 2, we know that {E(wt )}t −
value, which together with the assumption t=1 ηt < ∞, e a.s., which is
shows that the right-hand side of (17) vanishes to zero as t → E(w∗ ) converges to a random variable X
∞, while the left-hand side is a positive number. This leads nonnegative by the definition of w∗ . This together with Fatou’s
to a contradiction and lim supt→∞ E[k∇E(wt )k2 ] = 0. lemma and (22), implies that

e = E lim E(wt ) − E(w∗ )
E[X]
t→∞

Proof of Corollary 3.PSince θ > 1/(1 + α), we know ≤ lim inf E E(wt ) − E(w∗ ) = 0.
P ∞ 1+α ∞ t→∞
t=1 ηt = η11+α t=1 t−θ(1+α) < ∞. Eq. (6) and e is non-negative, we have limt→∞ E(wt ) = E(w∗ )
Since X
X T
1 1 a.s.. Let φ(w) = E(w) − E(w∗ ). It is clear that φ(w) satisfies
[(T + 1)1−γ − 1] ≤ t−γ ≤ T 1−γ , γ ∈ (0, 1)
1−γ 1 − γ (3) and is non-negative. Now, we can apply Lemma 1 to show
t=1 1+α
1
αk∇φ(w)k2 α ≤ (1 + α)L α φ(w), from which we know
immediately imply mint=1,...,T E[k∇E(wt )k22 ] = O(T θ−1 ). 1
Part(b) can be proved analogously and we omit the proof. 1+α (1 + α)L α
k∇E(wt )k2 α ≤ E(wt ) − E(w∗ ) , ∀t ∈ N.
α
This, together with non-negativity of k∇E(wt )k and
limt→∞ E(wt ) = E(w∗ ) a.s., immediately implies that
limt→∞ k∇E(wt )k2 = 0 a.s.. This proves Part (a).
B. Proof of Theorem 4
We now prove Part (b). It is clear from the definition of t0
Lemma 6 ([12]). Let {ηt }t∈N be a sequence
P of non-negative that L2 ηt1+α ≤ µηt for all t ≥ t0 . Therefore, (18) holds with
numbers such that limt→∞ ηt = 0 and ∞ t=1 ηt = ∞. Let t1 = t0 . Taking ηt = 2/(µ(t + 1)) in (18), we derive
α, a > 0 and t1 ∈ P ηt < a−1 for any t ≥ t1 . Then
N such that Q t−1 2 1+α
T T
we have limT →∞ t=t1 ηt1+α k=t+1 (1 − aηk ) = 0. Bt+1 ≤ Bt + C5 , ∀t ≥ t0 . (23)
t+1 (t + 1)µ
6

Multiplying both sides of (23) with t(t + 1) gives No. 2017YFC0804003), the National Natural Science
t(t+1)Bt+1 ≤ t(t−1)Bt +C5 (2µ −1 1+α
) t(t+1) −α
, ∀t ≥ t0 . Foundation of China (Grant Nos. 11571078, 11671307,
61806091), the Shenzhen Peacock Plan (Grant No.
Taking a summation from t = t0 to t = T gives KQTD2016112514355531) and the Science and Technology
XT
Innovation Committee Foundation of Shenzhen (Grant No.
T (T +1)BT +1 ≤ t0 (t0 −1)Bt0 +C5 (2µ−1 )1+α t(t+1)−α ZDSYS201703031748284). The corresponding author is
t=t0
Guiying Li.
It is clear that
X T T
X T Z
X t+1 R EFERENCES
(T + 1)2−α
t(t + 1)−α ≤ t1−α ≤ x1−α dx ≤ ,[1] H. Karimi, J. Nutini, and M. Schmidt, “Linear conver-
t=t0 t=t0 t=t0 t 2−α gence of gradient and proximal-gradient methods under the
polyak-łojasiewicz condition,” in Joint European Conference
from which and (T + 1)/T ≤ (1 + t−1 0 ) for all T ≥ t0 we on Machine Learning and Knowledge Discovery in Databases.
derive the following inequality for all T ≥ t0 Springer, 2016, pp. 795–811.
t0 (t0 − 1)Bt0 C5 (2µ−1 )1+α (T + 1)1−α [2] B. T. Polyak, “Gradient methods for minimizing functionals,”
BT +1 ≤ + Zhurnal Vychislitel’noi Matematiki i Matematicheskoi Fiziki,
T (T + 1) (2 − α)T vol. 3, no. 4, pp. 643–653, 1963.
t0 (t0 − 1)Bt0 (1 + t−1
0 )
1−α
C5 (2µ−1 )1+α [3] S. Ghadimi and G. Lan, “Stochastic first-and zeroth-order
≤ + . methods for nonconvex stochastic programming,” SIAM Journal
T (T + 1) (2 − α)T α
on Optimization, vol. 23, no. 4, pp. 2341–2368, 2013.
This gives the stated result with [4] S. Reddi, A. Hefny, S. Sra, B. Poczos, and A. Smola, “Stochas-
1
−1 1−α
C5 (2µ−1 ) α+1 tic variance reduction for nonconvex optimization,” in Interna-
e = (t0 −1)(E[E(wt0 )]−E(w∗ ))+ (1 + t0 )
C . tional Conference on Machine Learning, 2016, pp. 314–323.
2−α [5] S. Ghadimi, G. Lan, and H. Zhang, “Mini-batch stochastic
We now consider Part (c). Analogous to (7), we derive approximation methods for nonconvex stochastic composite
optimization,” Mathematical Programming, vol. 155, no. 1-2,
E(wt+1 ) ≤ E(wt ) − ηh∇f (wt , zt ), ∇E(wt )i pp. 267–305, 2016.
+ 2−1 Lη 2 k∇f (wt , zt )k22 . (24) [6] Y. Ying and D.-X. Zhou, “Unregularized online learning algo-
rithms with general loss functions,” Applied and Computational
Since E[k∇f (w∗ , z)k22 ] = 0 we know ∇f (w∗ , z) = 0 Harmonic Analysis, vol. 42, no. 2, pp. 224–244, 2017.
almost surely. Therefore, w∗ is a minimizer of the function [7] D. Chang, M. Lin, and C. Zhang, “On the generalization
w 7→ f (w, z) for almost every z and the function φz (w) = ability of online gradient descent algorithm under the quadratic
f (w, z) − f (w∗ , z) is non-negative almost surely. We can growth condition,” IEEE Transactions on Neural Networks and
Learning Systems, no. 99, pp. 1–12, 2018.
apply Lemma 1 to show k∇φz (w)k22 ≤ 2Lφz (w) almost [8] D. J. Foster, A. Sekhari, and K. Sridharan, “Uniform conver-
surely, which is equivalent to k∇f (w, z)k22 ≤ 2L(f (w, z) − gence of gradients for non-convex learning and optimization,”
f (w∗ , z)) almost surely. Plugging this inequality back to (24) in Advances in Neural Information Processing Systems, 2018,
gives the following inequality almost surely pp. 8759–8770.
[9] T. Zhang, “Solving large scale linear prediction problems using
E(wt+1 ) ≤ E(wt ) − ηh∇f (wt , zt ), ∇E(wt )i stochastic gradient descent algorithms,” in International Con-
ference on Machine Learning, 2004, pp. 919–926.
+ L2 η 2 f (wt , zt ) − f (w∗ , zt ) . [10] L. Bottou, F. E. Curtis, and J. Nocedal, “Optimization methods
Taking expectation over both sides then gives for large-scale machine learning,” SIAM Review, vol. 60, no. 2,
pp. 223–311, 2018.
E[E(wt+1 )]−E(w∗ ) ≤ E[E(wt )]−E(w∗ )−E[k∇E(wt )k22 ] [11] E. Hazan, A. Agarwal, and S. Kale, “Logarithmic regret al-
2 2 ∗ gorithms for online convex optimization,” Machine Learning,
+ L η E[E(wt ) − E(w )]. vol. 69, no. 2, pp. 169–192, 2007.
It follows from Assumption 2 and η ≤ µ/L2 that [12] Y. Ying and D.-X. Zhou, “Online regularized classification
algorithms,” IEEE Transactions on Information Theory, vol. 52,
Bt+1 ≤ Bt − 2µηBt + L2 η 2 Bt ≤ (1 − µη)Bt . no. 11, pp. 4775–4788, 2006.
[13] J. Lin and D.-X. Zhou, “Online learning algorithms can con-
Applying this result recursively gives the stated result.
verge comparably fast as batch learning,” IEEE Transactions
on Neural Networks and Learning Systems, vol. 29, no. 6, pp.
V. C ONCLUSION 2367–2378, 2018.
We present a solid theoretical analysis of SGD for noncon- [14] T. Hu and D.-X. Zhou, “Online learning with samples drawn
vex learning by showing that the bounded gradient assumption from non-identical distributions,” Journal of Machine Learning
Research, vol. 10, no. Dec, pp. 2873–2898, 2009.
imposed in literature can be removed without affecting learn- [15] Y. Lei and K. Tang, “Stochastic composite mirror descent:
ing rates. We consider general nonconvex objective functions Optimal bounds with high probabilities,” in Advance in Neural
and objective functions satisfying PL conditions, for each of Information Processing Systems, 2018, pp. 1524–1534.
which we derive optimal convergence rates. Interesting future [16] Y. Lei and D.-X. Zhou, “Convergence of online mirror descent,”
work includes the extension to distributed learning [21], sparse Applied and Computational Harmonic Analysis, 2018.
[17] D. P. Bertsekas and J. N. Tsitsiklis, “Gradient convergence in
learning [22] and stochastic composite mirror descent [15]. gradient methods with errors,” SIAM Journal on Optimization,
ACKNOWLEDGMENT vol. 10, no. 3, pp. 627–642, 2000.
[18] Y. Lei, L. Shi, and Z.-C. Guo, “Convergence of unregularized
This work is supported partially by the National Key online learning algorithms,” Journal of Machine Learning Re-
Research and Development Program of China (Grant search, vol. 18, no. 171, pp. 1–33, 2018.
7

[19] L. M. Nguyen, P. H. Nguyen, M. van Dijk, P. Richtárik,

K. Scheinberg, and M. Takáč, “SGD and hogwild! convergence
without the bounded gradients assumption,” in International
Conference on Machine Learning, 2018, pp. 3747–3755.
[20] J. L. Doob, Measure Theory. Springer, 1994.
[21] S.-B. Lin and D.-X. Zhou, “Distributed kernel-based gradient
descent algorithms,” Constructive Approximation, vol. 47, no. 2,
pp. 249–276, 2018.
[22] H. Sun and Q. Wu, “Sparse representation in kernel machines,”
IEEE Transactions on Neural Networks and Learning Systems,
vol. 26, no. 10, pp. 2576–2582, 2015.

TheoryDL
No ratings yet
TheoryDL
227 pages
2310.20360v2 (1)
No ratings yet
2310.20360v2 (1)
714 pages
Patel_uchicago_0330D_14442
No ratings yet
Patel_uchicago_0330D_14442
239 pages
THE SECRET CODE OF JAPANESE CANDLESTICKS
No ratings yet
THE SECRET CODE OF JAPANESE CANDLESTICKS
261 pages
M07 Maternal & Child Health Care 2021 - Copy
No ratings yet
M07 Maternal & Child Health Care 2021 - Copy
278 pages
2103.12692v3 (1)
No ratings yet
2103.12692v3 (1)
56 pages
ime36520v_far2xx8
No ratings yet
ime36520v_far2xx8
171 pages
Pakistan Affairs MCQs
No ratings yet
Pakistan Affairs MCQs
62 pages
Mathematical Introduction To Deep Learning: Methods, Implementations, and Theory
No ratings yet
Mathematical Introduction To Deep Learning: Methods, Implementations, and Theory
601 pages
OTAGO
100% (1)
OTAGO
72 pages
Mathematical Introduction to Deep Learning
No ratings yet
Mathematical Introduction to Deep Learning
300 pages
Optimal Stochastic Non-smooth Non-convex Optimization Through
No ratings yet
Optimal Stochastic Non-smooth Non-convex Optimization Through
39 pages
PROMISE: Preconditioned Stochastic Optimization Methods by Incorporating Scalable Curvature Estimates
No ratings yet
PROMISE: Preconditioned Stochastic Optimization Methods by Incorporating Scalable Curvature Estimates
57 pages
Parallelizing Stochastic Gradient Descent for Least Squares Regression mini-batching, averaging, and model misspecification
No ratings yet
Parallelizing Stochastic Gradient Descent for Least Squares Regression mini-batching, averaging, and model misspecification
39 pages
When Will Gradient Methods Converge To Max Margin Classifier Under Relu Models PDF
No ratings yet
When Will Gradient Methods Converge To Max Margin Classifier Under Relu Models PDF
23 pages
main sgd
No ratings yet
main sgd
32 pages
Lecture 7
No ratings yet
Lecture 7
54 pages
2501.08425v1
No ratings yet
2501.08425v1
50 pages
Distance Over Gradient HT
No ratings yet
Distance Over Gradient HT
50 pages
Sampling Is As Easy As Learning The Score: Theory For Diffusion Models With Minimal Data Assumptions
No ratings yet
Sampling Is As Easy As Learning The Score: Theory For Diffusion Models With Minimal Data Assumptions
29 pages
2104.00423v1
No ratings yet
2104.00423v1
19 pages
UNIT V NNHDL
No ratings yet
UNIT V NNHDL
33 pages
SDE For SGD
No ratings yet
SDE For SGD
35 pages
NeurIPS 2021 Convergence of Adaptive Algorithms for Constrained Weakly Convex Optimization Paper
No ratings yet
NeurIPS 2021 Convergence of Adaptive Algorithms for Constrained Weakly Convex Optimization Paper
12 pages
2002.03329v3
No ratings yet
2002.03329v3
33 pages
DLbook
No ratings yet
DLbook
165 pages
Recent Advances in Stochastic Gradient Descent in
No ratings yet
Recent Advances in Stochastic Gradient Descent in
23 pages
Stochastic Gradient Descent
No ratings yet
Stochastic Gradient Descent
23 pages
ANNMath
No ratings yet
ANNMath
104 pages
Lite 3000 Training
No ratings yet
Lite 3000 Training
90 pages
Lecture 8 Money and Financial Market
No ratings yet
Lecture 8 Money and Financial Market
76 pages
5 Why Does SGD Prefer Flat Minim
No ratings yet
5 Why Does SGD Prefer Flat Minim
15 pages
Stochastic Gradient Descent As Approximate Bayesian Inference
No ratings yet
Stochastic Gradient Descent As Approximate Bayesian Inference
35 pages
Chapter 6 BAC
No ratings yet
Chapter 6 BAC
15 pages
Texas Code of Criminal Procedure
No ratings yet
Texas Code of Criminal Procedure
1,554 pages
Piaggio NRG NRG Power DD 2007-2015
100% (1)
Piaggio NRG NRG Power DD 2007-2015
80 pages
Simplified Diffusion Schrödinger Bridge
No ratings yet
Simplified Diffusion Schrödinger Bridge
28 pages
TheoryCL
No ratings yet
TheoryCL
19 pages
Stochastic Gradient Descent: Ryan Tibshirani Convex Optimization 10-725
No ratings yet
Stochastic Gradient Descent: Ryan Tibshirani Convex Optimization 10-725
22 pages
Six Lectures On NN - Montanari
No ratings yet
Six Lectures On NN - Montanari
77 pages
A High Probability Analysis of Adaptive SGD With Momentum
No ratings yet
A High Probability Analysis of Adaptive SGD With Momentum
13 pages
Linear Convergence of Adaptive Stochastic Gradient Descent
No ratings yet
Linear Convergence of Adaptive Stochastic Gradient Descent
19 pages
6452 19459 1 PB
No ratings yet
6452 19459 1 PB
16 pages
Bridging The Gap Between Constant Step Size Stochastic Gradient Descent and Markov Chains
No ratings yet
Bridging The Gap Between Constant Step Size Stochastic Gradient Descent and Markov Chains
30 pages
Stochastic Gradient Descent
No ratings yet
Stochastic Gradient Descent
12 pages
Workshop Manual: 125/151 CC 4-STROKE ENGINE 2 Valves SYM
0% (1)
Workshop Manual: 125/151 CC 4-STROKE ENGINE 2 Valves SYM
48 pages
GS-OPT: A New Fast Stochastic Algorithm For Solving The Non-Convex Optimization Problem
No ratings yet
GS-OPT: A New Fast Stochastic Algorithm For Solving The Non-Convex Optimization Problem
10 pages
Stochastic Gradient Descent
No ratings yet
Stochastic Gradient Descent
23 pages
Tut04 - One Algorithm To Optimize Them All
No ratings yet
Tut04 - One Algorithm To Optimize Them All
19 pages
Adaptive Stochastic Conjugate Gradient for Machine Learning
No ratings yet
Adaptive Stochastic Conjugate Gradient for Machine Learning
14 pages
A MATLAB Library for Stochastic Optimization Algorithms
No ratings yet
A MATLAB Library for Stochastic Optimization Algorithms
5 pages
Breaking The Curse of Dimensionality With Convex Neural Networks
No ratings yet
Breaking The Curse of Dimensionality With Convex Neural Networks
53 pages
Neural Networks
No ratings yet
Neural Networks
14 pages
2304.09221 Convergence of Stochastic Gradient Descent Under
No ratings yet
2304.09221 Convergence of Stochastic Gradient Descent Under
14 pages
Lecture 5
No ratings yet
Lecture 5
4 pages
Intern report FORMAT front pages
No ratings yet
Intern report FORMAT front pages
24 pages
Neural Networks Learning and Memorization With (Almost) No Over-Parameterization
No ratings yet
Neural Networks Learning and Memorization With (Almost) No Over-Parameterization
10 pages
Towards A Mathematical Understanding of Neural Network-Based Machine Learning: What We Know and What We Don't
No ratings yet
Towards A Mathematical Understanding of Neural Network-Based Machine Learning: What We Know and What We Don't
56 pages
Vickerman - Surface Analysis The Principal Techniques
No ratings yet
Vickerman - Surface Analysis The Principal Techniques
10 pages
Adas: Adaptive Scheduling of Stochastic Gradients: Preprint. Under Review
No ratings yet
Adas: Adaptive Scheduling of Stochastic Gradients: Preprint. Under Review
19 pages
A Lyapunov Analysis For Accelerated Gradient Methods
No ratings yet
A Lyapunov Analysis For Accelerated Gradient Methods
37 pages
Smith Et Al. - 2021 - On The Origin of Implicit Regularization in Stocha
No ratings yet
Smith Et Al. - 2021 - On The Origin of Implicit Regularization in Stocha
14 pages
The Impact of Neural Network Overparameterization On Gradient Confusion and Stochastic Gradient Descent
No ratings yet
The Impact of Neural Network Overparameterization On Gradient Confusion and Stochastic Gradient Descent
46 pages
NIPS-2011-non-asymptotic-analysis-of-stochastic-approximation-algorithms-for-machine-learning-Paper
No ratings yet
NIPS-2011-non-asymptotic-analysis-of-stochastic-approximation-algorithms-for-machine-learning-Paper
9 pages
210 Icmlpaper
No ratings yet
210 Icmlpaper
8 pages
Maa 2.10 Exponential Equations
No ratings yet
Maa 2.10 Exponential Equations
28 pages
lec13
No ratings yet
lec13
6 pages
Neural Network Learning Dynamics Analysis: Articles You May Be Interested in
No ratings yet
Neural Network Learning Dynamics Analysis: Articles You May Be Interested in
5 pages
Lijjat Papad
No ratings yet
Lijjat Papad
3 pages
Electronic Servicing 1973 03
No ratings yet
Electronic Servicing 1973 03
60 pages
Billhooks of Portugal
No ratings yet
Billhooks of Portugal
11 pages
Spouge's Approximation - Wikipedia
No ratings yet
Spouge's Approximation - Wikipedia
2 pages
ESTIMATE-WEST 1 HOUSING & DINING, WILLIAMSBURG, VA, Revised
No ratings yet
ESTIMATE-WEST 1 HOUSING & DINING, WILLIAMSBURG, VA, Revised
5 pages
Cyber Reinsurance
No ratings yet
Cyber Reinsurance
29 pages
Amos - Commandline Interface: 1. Getting Started 1.1 Launching Amos From Oss Front Page Gui
No ratings yet
Amos - Commandline Interface: 1. Getting Started 1.1 Launching Amos From Oss Front Page Gui
20 pages
LeNgiCoaLahProNg11 PDF
No ratings yet
LeNgiCoaLahProNg11 PDF
8 pages
Air Asia
100% (1)
Air Asia
47 pages
Deutz F3M 2011ext - en
No ratings yet
Deutz F3M 2011ext - en
4 pages
Improving Generalization Performance by Switching From Adam To SGD
No ratings yet
Improving Generalization Performance by Switching From Adam To SGD
10 pages
UNIT2
No ratings yet
UNIT2
25 pages
Design, Construction & Maintenance of Rigid Pavement PDF
100% (1)
Design, Construction & Maintenance of Rigid Pavement PDF
3 pages
Installation Manual
No ratings yet
Installation Manual
24 pages
Deepak CV May
No ratings yet
Deepak CV May
5 pages
Electrical Machines Assignment
No ratings yet
Electrical Machines Assignment
3 pages
6.0 Method Statement and Risk Assessments
No ratings yet
6.0 Method Statement and Risk Assessments
11 pages
Stochastic Gradient Descent - Term Paper
No ratings yet
Stochastic Gradient Descent - Term Paper
8 pages
Grade 1 Drums All PDF
100% (1)
Grade 1 Drums All PDF
6 pages
Extending the Boundaries: An Expansive Journey into Nonparametric Curve Estimation
From Everand
Extending the Boundaries: An Expansive Journey into Nonparametric Curve Estimation
Pasquale De Marco
No ratings yet
AI-Driven Time Series Forecasting: Complexity-Conscious Prediction and Decision-Making
From Everand
AI-Driven Time Series Forecasting: Complexity-Conscious Prediction and Decision-Making
Raghurami Reddy Etukuru Ph.D.
No ratings yet
Radial Basis Networks: Fundamentals and Applications for The Activation Functions of Artificial Neural Networks
From Everand
Radial Basis Networks: Fundamentals and Applications for The Activation Functions of Artificial Neural Networks
Fouad Sabry
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

1902.00908v3

Uploaded by

1902.00908v3

Uploaded by

1

Stochastic Gradient Descent for Nonconvex

k(t)−1 An application of Lemma 6 with a = µ then shows that

[19] L. M. Nguyen, P. H. Nguyen, M. van Dijk, P. Richtárik,

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.