1902.00908v3
1902.00908v3
Abstract—Stochastic gradient descent (SGD) is a popular and theoretical analysis can only guarantee that SGD may get stuck
efficient method with wide applications in training deep neural in local minima, in practice it often converges to special ones
arXiv:1902.00908v3 [cs.LG] 13 Dec 2019
nets and other nonconvex models. While the behavior of SGD with good generalization ability even in the absence of early
is well understood in the convex learning setting, the existing
theoretical results for SGD applied to nonconvex objective func- stopping or explicit regularization.
tions are far from mature. For example, existing results require Motivated by the popularity of SGD in training deep neural
to impose a nontrivial assumption on the uniform boundedness networks and nonconvex models as well as the huge gap
of gradients for all iterates encountered in the learning process, between the theoretical understanding and its practical success,
which is hard to verify in practical implementations. In this
paper, we establish a rigorous theoretical foundation for SGD in theoretical analysis of SGD has received increasing attention
nonconvex learning by showing that this boundedness assumption recently. The first nonasymptotical convergence rates of non-
can be removed without affecting convergence rates. In particu- convex SGD were established in [3], which was extended
lar, we establish sufficient conditions for almost sure convergence to stochastic variance reduction [4] and stochastic proximal
as well as optimal convergence rates for SGD applied to both gradient descent [5]. However, these results require to impose
general nonconvex objective functions and gradient-dominated
objective functions. A linear convergence is further derived in a nontrivial boundedness assumption on the gradients at all
the case with zero variances. iterates encountered in the learning process, which, however
depends on the realization of the optimization process and is
Index Terms—Stochastic Gradient Descent, Nonconvex Opti-
mization, Learning Theory, Polyak-Łojasiewicz Condition hard to check in practice. It still remains unclear whether this
assumption holds when learning takes place in an unbounded
domain, in which scenario the existing analysis is not rigorous.
I. I NTRODUCTION
In this paper, we aim to build a sound theoretical foundation
Stochastic gradient descent (SGD) is an efficient iterative for SGD by showing that the same convergence rates can be
method suitable to tackle large-scale datasets due to its low achieved without any boundedness assumption on gradients
computational complexity per iteration and its promising prac- in the nonconvex learning setting. We also relax the stan-
tical behavior, which has found wide applications to solve dard smoothness assumption to a milder Hölder continuity
optimization problems in a variety of areas including machine on gradients. As a further step, we consider objective func-
learning and signal processing. At each iteration, SGD firstly tions satisfying a Polyak-Łojasiewicz (PL) condition which is
calculates a gradient based on a randomly selected example widely adopted in the literature of nonconvex optimization. In
and updates the model parameter along the minus gradient this case, we derive convergence rates O(1/t) for SGD with t
direction of the current iterate. This strategy of processing a iterations, which also remove the boundedness assumption on
single training example makes SGD very popular in the big gradients imposed in [1] to derive similar convergence rates.
data era, which enjoys a great computational advantage over We introduce a zero-variance condition which allows us to de-
its batch counterpart. rive linear convergence of SGD. Sufficient conditions in terms
Theoretical properties of SGD are well understood for opti- of step sizes are also established for almost sure convergence
mizing both convex and strongly convex objectives, the latter measured by both function values and gradient norms.
of which can be relaxed to other assumptions on objective
functions, e.g., error bound conditions and Polyak-Łojasiewicz
conditions [1, 2]. As a comparison, SGD applied to nonconvex II. P ROBLEM F ORMULATION AND M AIN R ESULTS
objective functions are much less studied. Indeed, there is
a huge gap between the theoretical understanding of SGD Let ρ be a probability defined on the sample space Z := X ×
and its very promising practical behavior in the nonconvex Y with X ⊂ Rd being the input space and Y being the output
learning setting, as exemplified in the setting of training space. We are interested in building a prediction rule h : X 7→
highly nonconvex deep neural networks. For example, while Y based on a sequence of examples {zt }t∈N independently
drawn from ρ. We consider learning in a reproducing kernel
Y. Lei, G. Li and K. Tang are with the Shenzhen Key Laboratory of Com- Hilbert space (RKHSs) HK associated to a Mercer kernel K :
putational Intelligence, Department of Computer Science and Engineering,
Southern University of Science and Technology, Shenzhen 518055, China (e- X × X 7→ R. The RKHS HK is defined as the completion of
mail: leiyw@sustc.edu.cn, lgy807720302@gmail.com, tangk3@sustc.edu.cn). the linear span of the function set {Kx (·) := K(x, ·) : x ∈ X }
T. Hu is with the School of Mathematics and Statistics, Wuhan University, satisfying the reproducing property w(x) = hw, Kx i for any
Wuhan 430072, China (e-mail: tinghu@whu.edu.cn).
Accepted by IEEE Transactions on Neural Networks and Learning Systems. x ∈ X and w ∈ HK , where h·, ·i denotes the inner product.
DOI: 10.1109/TNNLS.2019.2952219 The quality of a prediction rule h at an example z is measured
2
by ℓ(h(x), y), where ℓ : R × R 7→ R+ is a differentiable loss (a) There is a constant C independent of t such that
function, with which we define the objective function as X T −1
Z
min E[k∇E(wt )k22 ] ≤ C ηt . (6)
E(h) = Ez ℓ(h(x), y) = ℓ(h(x), y)dρ. (1) t=1,...,T
t=1
We consider nonconvex loss functions in this paper. We (b) {E(wt )}t converges to an almost surely (a.s.) bounded
implement the learning process by SGD to minimize the random variable. P∞
objective function over HK . Let w1 = 0 and zt = (xt , yt ) (c) If Assumption 1 holds with α = 1 and t=1 ηt = ∞,
be the example sampled according to ρ at the t-th iteration. then limt→∞ E[k∇E(wt )k2 ] = 0.
We update the model sequence {wt }t∈N in HK by Remark 1. Part
(a) was derived in [3] under
the boundedness
wt+1 = wt − ηt ∇ℓ hwt , Kxt i, yt Kxt = wt − ηt ∇f (wt , zt ), assumption Ez k∇f (wt , z) − ∇E(wt )k22 ≤ σ 2 for a constant
(2) σ > 0 and all t ∈ N. This boundedness assumption depends on
where ∇ℓ denotes the gradient of ℓ with respect to the first the realization of the optimization process and it is therefore
argument, {ηt }t∈N is a sequence of positive step sizes and difficult to check in practice. It was removed in our analysis.
we introduce f (w, z) = ℓ hw, Kx i, y for brevity. We denote Although Parts (b), (c) do not give convergence rates, an
k · k2 the RKHS norm in HK . appealing property is that they consider individual iterates. As
Our theoretical analysis is based on a fundamental as- a comparison, the convergence rates in (6) only hold for the
sumption on the regularity of loss functions. Assumption 1 minimum of the first T iterates. The analysis for individual
with α = 1 corresponds to a smooth assumption standard in iterates is much more challenging than that for the minimum
nonconvex learning, which is extended to a general Hölder over all iterates. Indeed, Part (c) is based on a careful analysis
continuity assumption on the gradient of loss functions here. with the contradiction strategy.
Assumption 1. Let α ∈ (0, 1] and L > 0. We assume that the We can derive explicit convergence rates by instantiating
gradient of f (·, z) is α-Hölder continuous in the sense that the step sizes in Theorem 2. If α = 1, the convergence rate in
1 β
k∇f (w, z)−∇f (w̃, z)k ≤ Lkw−w̃kα Part (b) becomes O(T − 2 log 2 T ) which is minimax optimal
2 , ∀w, w̃ ∈ HK , z ∈ Z.
up to a logarithmic factor.
For any function φ : HK 7→ R with Hölder continuous
gradients, we have the following lemma playing an important Corollary 3. Suppose that Assumption 1 holds. Let {wt }t∈N
role in our analysis. Eq. (4) provides a quantitative measure on be the sequence produced by (2). Then,
the accuracy of approximating φ with its first-order approxi- (a) If ηt = η1 t−θ with θ ∈ (1/(1 + α), 1), then
mation, while (5) provides a self-bounding property meaning mint=1,...,T E[k∇E(wt )k22 ] = O(T θ−1 ).
1
that the norm of gradients can be controlled by function values. (b) If ηt = η1 (t logβ (t + 1))− 1+α with β > 1, then
α β
Lemma 1. Let φ : HK 7→ R be a differentiable function. Let mint=1,...,T E[k∇E(wt )k22 ] = O(T − α+1 log 1+α T ).
α ∈ (0, 1] and L > 0. If for all w, w̃ ∈ HK , z ∈ Z
k∇φ(w) − ∇φ(w̃)k2 ≤ Lkw − w̃kα
2, (3) B. Objective functions with Polyak-Łojasiewicz inequality
then, we have We now proceed with our convergence analysis by imposing
L an assumption referred to as PL inequality named after Polyak
φ(w̃) ≤ φ(w) + hw̃ − w, ∇φ(w)i + kw − w̃k1+α
2 . (4)
1+α and Łojasiewicz [2]. Intuitively, this inequality means that
Furthermore, if φ(w) ≥ 0 for all w ∈ HK , then the suboptimality of iterates measured by function values
1 can be bounded by gradient norms. PL condition is also
1+α (1 + α)L α
k∇φ(w)k2 α ≤ φ(w), ∀w ∈ HK . (5) referred to as gradient dominated condition in the literature
α
[4], and widely adopted in the analysis in both the convex
Lemma 1 to be proved in Section IV-A is an extension of and nonconvex optimization setting [1, 7, 8]. Examples of
Proposition 1 in [6] from univariate functions to multivariate functions satisfying PL condition include neural networks with
functions. It should be noted that (5) improves Proposition 1 one-hidden layers, ResNets with linear activation and objective
1
(d) in [6] by removing a factor of (1 + α) α . functions in matrix factorization [8]. It should be noted that
functions satisfying the PL condition is not necessarily convex.
A. General nonconvex objective functions
Assumption 2. We assume that the function E satisfies the
We now present theoretical results for SGD with general PL inequality with the parameter µ > 0, i.e.,
nonconvex loss functions. In this case we measure the progress
of SGD in terms of gradients. Part (a) gives a nonasymptotic E(w) − E(w∗ ) ≤ (2µ)−1 k∇E(w)k22 , ∀w ∈ HK ,
convergence rate by step sizes, while Parts (b) and (c) provide where w∗ = arg minw∈HK E(w).
sufficient conditions on the asymptotic convergence measured
Under Assumption 2, we can state convergence results mea-
by function values and gradient norms, respectively.
sured by the suboptimality of function values. Part (a) provides
Theorem 2. Suppose that Assumption 1 holds. Let {wt }t∈N a sufficient condition for almost sure convergence measured by
be
P∞produced by (2) with the step sizes satisfying C1 := function values and gradient norms, while Part (b) establishes
1+α
η
t=1 t < ∞. Then, the following three statements hold. explicit convergence rates for step sizes reciprocal to the
3
iteration number. If α = 1, we derive convergence rates to a Hölder continuity of ∇f (w, z). Both the PL condition
O(t−1 ) after t iterations, which is minimax optimal even when and Hölder continuity condition do not depend on the iterates
the objective function is strongly convex. Part (c) shows that a and can be checked by objective function themselves, which
linear convergence can be achieved if E[k∇f (w∗ , z)k22 ] = 0, are standard in the literature and satisfied by many noncon-
which extends the linear convergence of gradient descent [1] to vex models [1, 4, 8]. It should be noted that convergence
the stochastic setting. The assumption E[k∇f (w∗ , z)k22 ] = 0 analysis was also performed when f (w, z) is convex [18]
means that variances of the stochastic
gradient vanish at w = and nonconvex [19] without bounded gradient assumptions,
w∗ since Var(f (w∗ , z)) = E kf (w∗ , z) − ∇E(w∗ )k22 = 0. both of which, however, require E(w) to be strongly convex
and f (w, z) to be smooth. Furthermore, we establish a linear
Theorem 4. Let Assumptions 1 and 2 hold. Let {wt }t∈N be
convergence of SGD in the case with zero variances, while this
produced by (2). Then the following statements hold.
P∞ 1+α P∞ linear convergence was only derived for batch gradient descent
(a) If t=1 ηt < ∞ and t=1 ηt = ∞, then a.s. applied to gradient-dominated objective functions
limt→∞ E(wt ) = E(w∗ ) and limt→∞ k∇E(wt )k2 = 0. P∞ P∞[1]. Neces-
2 1+α sary and sufficient conditions as t=1 ηt = ∞, t=1 ηt2 < ∞
(b) If ηt = 2/((t + 1)µ), then for any t ≥ t0 := 2L α µ− α were established for convergence of online mirror descent in
we have E[E(wt+1 )] − E(w∗ ) ≤ Ct e −α , where C e is a
a strongly convex setting [18], which are partially extended to
constant independent of t (explicitly given in the proof). convergence of SGD for gradient-dominated objective func-
(c) If E[k∇f (w∗ , z)k22 ] = 0, Assumption 1 holds with α = 1 tions measured by both function values and gradient norms.
and ηt = η ≤ µ/L2 , then
E[E(wt+1 )] − E(w∗ ) ≤ (1 − µη)t (E(w1 ) − E(w∗ )).
P∞ P∞ IV. P ROOFS
Remark 2. Conditions as t=1 ηt2 < ∞ and t=1 ηt = ∞
are established for almost sure convergence with strongly A. Proof of Theorem 2
convex objectives, which are extended here to nonconvex
learning under PL conditions. Convergence rates O(t−1 ) were In this section, we present the proofs of Theorem 2 and
established for nonconvex optimization under PL conditions, Corollary 3 on convergence of SGD applied to general non-
bounded gradient assumption as E[k∇f (wt , z)k22 ] ≤ σ 2 and convex loss functions. To this aim, we first prove Lemma 1 and
smoothness assumptions [1]. We derive the same convergence introduce the Doob’s forward convergence theorem on almost
rates without the bounded gradient assumption and relax the sure convergence (see, e.g., [20] on page 195).
smoothness assumption to a Hölder continuity of ∇f (w, z).
III. R ELATED WORK AND D ISCUSSIONS Proof of Lemma 1. Eq. (4) can be proved in the same way as
SGD has been comprehensively studied in the literature, the proof of Part (a) of Proposition 1 in [6]. We now prove
(5) for non-negative φ. We only need to consider the case
mainly in the convex setting.√For generally convex objective
∇φ(w) 6= 0. In this case, set
functions, regret bounds O( T ) were established for SGD 1
1
with √T iterates [9] which directly imply convergence rates w̃ = w − L− α k∇φ(w)k2α k∇φ(w)k−1
2 ∇φ(w)
O(1/ T ) [10]. For strongly convex objective functions, re-
in (4). We derive
gret bounds can be improved to O(log T ) [11] which imply D 1 E
1 ∇φ(w)
convergence rates O(log T /T ). These results were extended to 0 ≤ φ(w̃) ≤ φ(w) − L− α k∇φ(w)k2α , ∇φ(w)
online learning in RKHSs [12–14] and learning with a mirror k∇φ(w)k2
map to capture geometry of problems [15, 16]. L 1+α 1+α
+ L− α k∇φ(w)k2 α
As compared to the maturity of understanding in convex 1+α
1 1+α 1 1+α
optimization, convergence analysis for SGD in the nonconvex = φ(w) − L− α k∇φ(w)k2 α + L− α (1 + α)−1 k∇φ(w)k2 α
setting are far from satisfactory. Asymptotic convergence
of 1
αL− α 1+α
SGD was established
under the assumption
Ez k∇f (w t , z) − = φ(w) − k∇φ(w)k2 α ,
∇E(wt )k22 ≤ A 1 + k∇E(wt )k22 for A > 0 and all 1+α
t ∈ N [17]. Nonasymptotic convergence rates similar to from which the stated bound (5) follows.
(6) were established in [3] under boundedness assumption
E[k∇f (wt , zt )k22 ] ≤ σ 2 for all t ∈ N. For objective func- Lemma 5. Let {X et }t∈N be a sequence of non-negative ran-
tions satisfying PL conditions, convergence rates O(1/T ) dom variables with E[X e1 ] < ∞ and let {Ft }t∈N be a nested
were established for SGD under boundedness assumptions sequence of sets of random variables with Ft ⊂ Ft+1 for all
E[k∇f (wt , zt )k22 ] ≤ σ 2 for all t ∈ N [1]. This boundedness et+1 |Ft ] ≤ X
t ∈ N. If E[X et for all t ∈ N, then X
et converges
assumption in the literature depends on the realization of to a nonnegative random variable X e a.s. and X
e < ∞ a.s..
the optimization process, which is hard to check in practical
implementations. In this paper we show that the same con- Proof of Theorem 2. We first prove Part (a). According to
vergence rates can be established without any boundedness Assumption 1, we know
assumptions. This establishes a rigorous foundation to safe- k∇E(w) − ∇E(w̃)k2 = E[∇f (w, z)] − E[∇f (w̃, z)] 2
guard SGD. Existing discussions require to also impose an
assumption on the smoothness of f (w, z), which is relaxed ≤ E k∇f (w, z) − ∇f (w̃, z)k2 ≤ Lkw − w̃kα2.
4
Q∞ 2 1+α
Q∞
Therefore, ∇E(w) is α-Hölder continuous. According to (4) (10) by k=t+1 (1 + L ηk ), the term k=t+1 (1 +
with φ = E and (2), we know L2 ηk1+α )Ezt [E(wt+1 )] can be upper bounded by
Lkwt+1 −wt k1+α
2
∞
Y Y∞
E(wt+1 ) ≤ E(wt )+hwt+1 −wt , ∇E(wt )i+ 2 1+α
(1+L ηk )E(wt )+L (1−α) 2
(1+L2ηk1+α )ηt1+α
1+α
k=t k=t+1
Lη 1+α
= E(wt )−ηt h∇f (wt , zt ), ∇E(wt )i+ t k∇f (wt , zt )k1+α 2
∞
Y
1+α ≤ (1 + L2 ηk1+α )E(wt ) + C3 ηt1+α , (12)
≤ E(wt ) − ηt h∇f (wt , zt ), ∇E(wt )i k=t
Q∞
L2 ηt1+α 1 + α α α where we introduce C3 = L2 (1−α) k=1 (1+L2 ηk1+α ) < ∞.
+ f (wt , zt ), (7)
1+α α Introduce the stochastic process
where the last inequality is due to (5). With the Young’s Y∞ ∞
X
inequality for all µ, v ∈ R, p−1 + q −1 = 1, p ≥ 0 et =
X (1 + L2 ηk1+α )E(wt ) + C3 ηk1+α .
k=t k=t
µv ≤ p−1 |µ|p + q −1 |v|q , (8)
α α α1 Eq. (12) amounts to saying Ezt [X̃t+1 ] ≤ X̃t for all t ∈ N,
we get (1+α)fα(wt ,zt ) ≤ α (1+α)fα(wt ,zt ) + 1 − α. which shows that {X et }t∈N is a non-negative supermartin-
P∞
Plugging the above inequality into (7) shows gale. Furthermore, the assumption t=1 ηt1+α < ∞ implies
that Xe1 < ∞. We can apply Lemma 5 to show that
E(wt+1 ) ≤ E(wt ) − ηt h∇f (wt , zt ), ∇E(wt )i limt→∞ X et = Xe for a non-negative random variable X e a.s..
P∞ 1+α
L2 ηt1+α This together with the assumption t=1 ηt < ∞ implies
+ (1 + α)f (wt , zt ) + 1 − α . et → Ye for a non-negative random variable Ye , where
1+α limt→∞Q∞ Y
Taking conditional expectation with respect to zt , we derive Yet = k=t (1 + L2 ηk1+α )E(wt ) for all t ∈ N and Ye < ∞ a.s..
Ezt [E(wt+1 )] Furthermore, it is clear a.s. that
Y∞
≤ E(wt ) − ηt k∇E(wt )k22 + L2 ηt1+α E(wt ) + 1 − α (9) E(wt ) − Ye = 1 − (1 + L2 ηk1+α ) E(wt )+
≤ (1 + L2 ηt1+α )E(wt ) − ηt k∇E(wt )k22 2
+ L (1 − α)ηt1+α . k=t
(10) ∞
Y ∞
Y
(1+L2ηk1+α )E(wt )− Ye ≤ 1− (1+L2ηk1+α ) E(wt )
It then follows that
k=t k=t
E[E(wt+1 )] ≤ (1 + L2 ηt1+α )E[E(wt )] + L2 (1 − α)ηt1+α , ∞
Y
+ (1 + L2 ηk1+α )E(wt ) − Ye −−−→ 0,
from which we derive t→∞
k=t
∞
X Q∞
E[E(wt+1 )] + L2 (1 − α) ηk1+α where we have used the fact limt→∞ k=t (1 + L2 ηk1+α ) = 1
P
k=t+1 due to ∞ 1+α
t=1 ηt < ∞. That is, E(wt ) converges to Ye a.s..
∞
X
≤ (1 + L2 ηt1+α ) E[E(wt )] + L2 (1 − α) ηk1+α . We now prove Part (c) by contradiction. According to
k=t Assumption 1 and Lemma 1, we know
P∞ 1+α (1 + α)L α1 f (w , z ) 1+α
Introduce At = E[E(wt )] + L2 (1 − α) k=t ηk , ∀t
∈ N. k k
α
Then, it follows from the inequality 1 + a ≤ exp(a) that k∇f (wk , zk )k2 ≤
α
At+1 ≤ (1 + L2 ηt1+α )At ≤ exp(L2 ηt1+α )At . An application 1
≤ L α f (wk , zk ) + (1 + α)−1 ,
of the above inequality recursively then gives
X t X ∞ where we have used the Young’s inequality (8). Taking expec-
At+1 ≤ exp L 2 1+α
ηk A1 ≤ exp L 2 1+α
ηk A1 := C2 , tations over both sides and using E[E(wk )] ≤ C2 , we derive
1
k=1 k=1 E[k∇f (wk , zk )k2 ] ≤ L α E[E(wk )] + (1 + α)−1
from which we know E[E(wt )] ≤ C2 , ∀t ∈ N. Plugging the 1
≤ L α C2 + (1 + α)−1 := C4 . (13)
above inequality back into (10) gives
Suppose to contrary that limP supt→∞ E[k∇E(wt )k2 ] > 0. By
E[E(wt+1 )] ≤ E[E(wt )]−ηt E[k∇E(wt )k22 ]+L2 ηt1+α (C2+1−α). ∞
Part (a) and the assumption t=1 ηt = ∞, we know
(11) q
A summation of the above inequality then implies lim inf E[k∇E(wt )k2 ] ≤ lim inf E[k∇E(wt )k22 ] = 0.
t→∞ t→∞
T
X T
X Then there exists an ǫ > 0 such that E[k∇E(wt )k2 ] < ǫ for
ηt E[k∇E(wt )k22 ] ≤ E[E(wt )] − E[E(wt+1 )] infinitely many t and E[k∇E(wt )k2 ] > 2ǫ for infinitely many
t=1 t=1
t. Let T be a subset of integers such that for every t ∈ T we
T
X can find an integer k(t) > t such that
+ L2 (C2 + 1 − α) ηt1+α ≤ E(w1 ) + L2 (C2 + 1 − α)C1 ,
t=1 E[k∇E(wt )k2 ] < ǫ, E[k∇E(wk(t) )k2 ] > 2ǫ and
from which we directly get (6) with C := E(w1 )+L2 C1 (C2 + ǫ ≤ E[k∇E(wk )k2 ] ≤ 2ǫ for all t < k < k(t). (14)
1 − α). This proves Part (a).
Furthermore, we can assert that ηt ≤ ǫ/(2LC4 ) for every t
We now prove Part (b). Multiplying both sides of larger than the smallest integer in T since limt→∞ ηt = 0.
5
By (13), (14) and Assumption 1 with α = 1, we know Proof of Theorem 4. We first prove Part (a). We introduce
∗
ǫ ≤ E[k∇E(wk(t) )k2 ]−E[k∇E(wt )k2 ] ≤ E[k∇E(wk(t) )−∇E(wB)k
t :=
t ] E[E(wt )] − E(w ), ∀t ∈ N. By (10) and Assumption 2,
2
k(t)−1
X X E[E(wt+1 )] ≤ (1 + L2 ηt1+α )E[E(wt )]
k(t)−1
≤ E[k∇E(wk+1 )−∇E(wk )k2 ] ≤ L E[kwk+1 −wk k2 ]
− 2µηt E[E(wt ) − E(w∗ )] + L2 (1 − α)ηt1+α .
k=t k=t
k(t)−1 k(t)−1 Subtracting E(w∗ ) from both sides gives
X X
=L ηk E[k∇f (wk , zk )k2 ] ≤ LC4 E[E(wt+1 )] − E(w∗ ) ≤ (1 + L2 ηt1+α ) E(wt ) − E(w∗ )
ηk . (15)
k=t k=t
+ L2 ηt1+α E(w∗ )−2µηt E[E(wt )]−E(w∗ ) +L2 (1−α)ηt1+α
Analogously, one can show
= 1 + L2 ηt1+α − 2µηt E[E(wt )] − E(w∗ ) + C5 ηt1+α ,
E[k∇E(wt+1 )k2 ]−E[k∇E(wt )k2 ] ≤ E[k∇E(wt+1 )−∇E(wt )k2 ]
where we introduce C5 := L2 E(w∗ )+1−α . The assumption
P
≤ LE[kwt+1 − wt k2 ] ≤ Lηt E[k∇f (wt , zt )k2 ] ≤ LC4 ηt , ∞ 1+α
t=1 ηt < ∞ implies limt→∞ ηt = 0, which further
from which, (14) and ηt ≤ ǫ/(2LC4 ) for any t larger than the implies the existence of t1 such that ηtα ≤ µ/L2 and ηt ≤ 1/µ
smallest integer in T we get for all t ≥ t1 . Therefore, it follows that
E[k∇E(wk )k2 ] ≥ ǫ/2 for every k = t, t + 1, . . . , k(t) − 1 Bt+1 ≤ (1 − µηt )Bt + C5 ηt1+α , ∀t ≥ t1 . (18)
and all t ∈ T . It then follows that A recursive application of this inequality then shows
2
E[k∇E(wk )k22 ] ≥ E[k∇E(wk )k2 ] ≥ ǫ /4 2
(16) YT XT YT
BT +1 ≤ (1 − µηt )Bt1 + C5 ηt1+α (1 − µηk ),
for every k = t, t + 1, . . . , k(t) − 1 and all t ∈ T . Putting (16) t=t1 t=t1 k=t+1
back into (11), E[E(wk(t) )] can be upper bounded by (19)
QT
k(t)−1
X k(t)−1
X where we denote k=T +1 (1 − µηk ) = 1. The first term of the
E[E(wt )] − ηk E[k∇E(wk )k22 ] + L2 C2 ηk2 above inequality can be estimated by the standard inequality
k=t k=t 1P− a ≤ exp(−a) for a > 0 together with the assumption
∞
ǫ2 k(t)−1
X k(t)−1
X t=1 ηt = ∞ as
≤ E[E(wt )] − ηk + L2 C2 ηk2 . YT XT
4
k=t k=t (1−µηt )Bt1 ≤ exp − µ ηt Bt1 −−−−→ 0. (20)
T →∞
This together with (15) implies that t=t1 t=t1
Multiplying both sides of (23) with t(t + 1) gives No. 2017YFC0804003), the National Natural Science
t(t+1)Bt+1 ≤ t(t−1)Bt +C5 (2µ −1 1+α
) t(t+1) −α
, ∀t ≥ t0 . Foundation of China (Grant Nos. 11571078, 11671307,
61806091), the Shenzhen Peacock Plan (Grant No.
Taking a summation from t = t0 to t = T gives KQTD2016112514355531) and the Science and Technology
XT
Innovation Committee Foundation of Shenzhen (Grant No.
T (T +1)BT +1 ≤ t0 (t0 −1)Bt0 +C5 (2µ−1 )1+α t(t+1)−α ZDSYS201703031748284). The corresponding author is
t=t0
Guiying Li.
It is clear that
X T T
X T Z
X t+1 R EFERENCES
(T + 1)2−α
t(t + 1)−α ≤ t1−α ≤ x1−α dx ≤ ,[1] H. Karimi, J. Nutini, and M. Schmidt, “Linear conver-
t=t0 t=t0 t=t0 t 2−α gence of gradient and proximal-gradient methods under the
polyak-łojasiewicz condition,” in Joint European Conference
from which and (T + 1)/T ≤ (1 + t−1 0 ) for all T ≥ t0 we on Machine Learning and Knowledge Discovery in Databases.
derive the following inequality for all T ≥ t0 Springer, 2016, pp. 795–811.
t0 (t0 − 1)Bt0 C5 (2µ−1 )1+α (T + 1)1−α [2] B. T. Polyak, “Gradient methods for minimizing functionals,”
BT +1 ≤ + Zhurnal Vychislitel’noi Matematiki i Matematicheskoi Fiziki,
T (T + 1) (2 − α)T vol. 3, no. 4, pp. 643–653, 1963.
t0 (t0 − 1)Bt0 (1 + t−1
0 )
1−α
C5 (2µ−1 )1+α [3] S. Ghadimi and G. Lan, “Stochastic first-and zeroth-order
≤ + . methods for nonconvex stochastic programming,” SIAM Journal
T (T + 1) (2 − α)T α
on Optimization, vol. 23, no. 4, pp. 2341–2368, 2013.
This gives the stated result with [4] S. Reddi, A. Hefny, S. Sra, B. Poczos, and A. Smola, “Stochas-
1
−1 1−α
C5 (2µ−1 ) α+1 tic variance reduction for nonconvex optimization,” in Interna-
e = (t0 −1)(E[E(wt0 )]−E(w∗ ))+ (1 + t0 )
C . tional Conference on Machine Learning, 2016, pp. 314–323.
2−α [5] S. Ghadimi, G. Lan, and H. Zhang, “Mini-batch stochastic
We now consider Part (c). Analogous to (7), we derive approximation methods for nonconvex stochastic composite
optimization,” Mathematical Programming, vol. 155, no. 1-2,
E(wt+1 ) ≤ E(wt ) − ηh∇f (wt , zt ), ∇E(wt )i pp. 267–305, 2016.
+ 2−1 Lη 2 k∇f (wt , zt )k22 . (24) [6] Y. Ying and D.-X. Zhou, “Unregularized online learning algo-
rithms with general loss functions,” Applied and Computational
Since E[k∇f (w∗ , z)k22 ] = 0 we know ∇f (w∗ , z) = 0 Harmonic Analysis, vol. 42, no. 2, pp. 224–244, 2017.
almost surely. Therefore, w∗ is a minimizer of the function [7] D. Chang, M. Lin, and C. Zhang, “On the generalization
w 7→ f (w, z) for almost every z and the function φz (w) = ability of online gradient descent algorithm under the quadratic
f (w, z) − f (w∗ , z) is non-negative almost surely. We can growth condition,” IEEE Transactions on Neural Networks and
Learning Systems, no. 99, pp. 1–12, 2018.
apply Lemma 1 to show k∇φz (w)k22 ≤ 2Lφz (w) almost [8] D. J. Foster, A. Sekhari, and K. Sridharan, “Uniform conver-
surely, which is equivalent to k∇f (w, z)k22 ≤ 2L(f (w, z) − gence of gradients for non-convex learning and optimization,”
f (w∗ , z)) almost surely. Plugging this inequality back to (24) in Advances in Neural Information Processing Systems, 2018,
gives the following inequality almost surely pp. 8759–8770.
[9] T. Zhang, “Solving large scale linear prediction problems using
E(wt+1 ) ≤ E(wt ) − ηh∇f (wt , zt ), ∇E(wt )i stochastic gradient descent algorithms,” in International Con-
ference on Machine Learning, 2004, pp. 919–926.
+ L2 η 2 f (wt , zt ) − f (w∗ , zt ) . [10] L. Bottou, F. E. Curtis, and J. Nocedal, “Optimization methods
Taking expectation over both sides then gives for large-scale machine learning,” SIAM Review, vol. 60, no. 2,
pp. 223–311, 2018.
E[E(wt+1 )]−E(w∗ ) ≤ E[E(wt )]−E(w∗ )−E[k∇E(wt )k22 ] [11] E. Hazan, A. Agarwal, and S. Kale, “Logarithmic regret al-
2 2 ∗ gorithms for online convex optimization,” Machine Learning,
+ L η E[E(wt ) − E(w )]. vol. 69, no. 2, pp. 169–192, 2007.
It follows from Assumption 2 and η ≤ µ/L2 that [12] Y. Ying and D.-X. Zhou, “Online regularized classification
algorithms,” IEEE Transactions on Information Theory, vol. 52,
Bt+1 ≤ Bt − 2µηBt + L2 η 2 Bt ≤ (1 − µη)Bt . no. 11, pp. 4775–4788, 2006.
[13] J. Lin and D.-X. Zhou, “Online learning algorithms can con-
Applying this result recursively gives the stated result.
verge comparably fast as batch learning,” IEEE Transactions
on Neural Networks and Learning Systems, vol. 29, no. 6, pp.
V. C ONCLUSION 2367–2378, 2018.
We present a solid theoretical analysis of SGD for noncon- [14] T. Hu and D.-X. Zhou, “Online learning with samples drawn
vex learning by showing that the bounded gradient assumption from non-identical distributions,” Journal of Machine Learning
Research, vol. 10, no. Dec, pp. 2873–2898, 2009.
imposed in literature can be removed without affecting learn- [15] Y. Lei and K. Tang, “Stochastic composite mirror descent:
ing rates. We consider general nonconvex objective functions Optimal bounds with high probabilities,” in Advance in Neural
and objective functions satisfying PL conditions, for each of Information Processing Systems, 2018, pp. 1524–1534.
which we derive optimal convergence rates. Interesting future [16] Y. Lei and D.-X. Zhou, “Convergence of online mirror descent,”
work includes the extension to distributed learning [21], sparse Applied and Computational Harmonic Analysis, 2018.
[17] D. P. Bertsekas and J. N. Tsitsiklis, “Gradient convergence in
learning [22] and stochastic composite mirror descent [15]. gradient methods with errors,” SIAM Journal on Optimization,
ACKNOWLEDGMENT vol. 10, no. 3, pp. 627–642, 2000.
[18] Y. Lei, L. Shi, and Z.-C. Guo, “Convergence of unregularized
This work is supported partially by the National Key online learning algorithms,” Journal of Machine Learning Re-
Research and Development Program of China (Grant search, vol. 18, no. 171, pp. 1–33, 2018.
7