Abstract
Variance reduction (VR) methods for finite-sum minimization typically require the knowledge of problem-dependent constants that are often unknown and difficult to estimate. To address this, we use ideas from adaptive gradient methods to propose AdaSVRG, which is a more-robust variant of SVRG, a common VR method. AdaSVRG uses AdaGrad, a common adaptive gradient method, in the inner loop of SVRG, making it robust to the choice of step-size. When minimizing a sum of n smooth convex functions, we prove that a variant of AdaSVRG requires \(\tilde{O}(n + 1/\epsilon )\) gradient evaluations to achieve an \(O(\epsilon )\)-suboptimality, matching the typical rate, but without needing to know problem-dependent constants. Next, we show that the dynamics of AdaGrad exhibit a two-phase behavior – the step-size remains approximately constant in the first phase, and then decreases at a \(O\left( {1}/{\sqrt{t}}\right)\) rate. This result maybe of independent interest, and allows us to propose a heuristic that adaptively determines the length of each inner-loop in AdaSVRG. Via experiments on synthetic and real-world datasets, we validate the robustness and effectiveness of AdaSVRG, demonstrating its superior performance over standard and other “tune-free” VR methods.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
1 Introduction
Variance reduction (VR) methods (Schmidt et al., 2017; Konečnỳ & Richtárik, 2013; Mairal, 2013; Shalev-Shwartz & Zhang, 2013; Johnson & Zhang, 2013; Mahdavi & Jin, 2013; Konečnỳ & Richtárik, 2013; Defazio et al., 2014; Nguyen et al., 2017) have proven to be an important class of algorithms for stochastic optimization. These methods take advantage of the finite-sum structure prevalent in machine learning problems, and have improved convergence over stochastic gradient descent (SGD) and its variants (see Gower et al., 2020 for a recent survey). For example, when minimizing a finite sum of n strongly-convex, smooth functions with condition number \(\kappa\), these methods typically require \(O\left( (\kappa + n) \, \log (1/\epsilon ) \right)\) gradient evaluations to obtain an \(\epsilon\)-error. This improves upon the complexity of full-batch gradient descent (GD) that requires \(O\left( \kappa n \, \log (1/\epsilon ) \right)\) gradient evaluations, and SGD that has an \(O(\kappa /\epsilon )\) complexity. Moreover, there have been numerous VR methods that employ Nesterov acceleration (Allen-Zhu, 2017; Lan et al., 2019; Song et al., 2020) and can achieve even faster rates.
In order to guarantee convergence, VR methods require an easier-to-tune constant step-size, whereas SGD needs a decreasing step-size schedule. Consequently, VR methods are commonly used in practice, especially when training convex models such as logistic regression or conditional Markov random fields (Schmidt et al., 2015). However, all the above-mentioned VR methods require knowledge of the smoothness of the underlying function in order to set the step-size. The smoothness constant is often unknown and difficult to estimate in practice. Although we can obtain global upper-bounds on it for simple problems such as least squares regression, these bounds are usually too loose to be practically useful and result in sub-optimal performance. Consequently, implementing VR methods requires a computationally expensive search over a range of step-sizes. Furthermore, a constant step-size does not adapt to the function’s local smoothness and may lead to poor empirical performance.
Consequently, there have been a number of works that try to adapt the step-size in VR methods. Schmidt et al. (2017) and Mairal (2013) employ stochastic line-search procedures to set the step-size in VR algorithms. While they show promising empirical results using line-searches, these procedures have no theoretical convergence guarantees. Recent works (Tan et al., 2016; Li et al., 2020) propose to use the Barzilai-Borwein (BB) step-size (Barzilai & Borwein, 1988) in conjunction with two common VR algorithms—stochastic variance reduced gradient (SVRG) (Johnson & Zhang, 2013) and the stochastic recursive gradient algorithm (SARAH) (Nguyen et al., 2017). Both Tan et al. (2016) and Li et al. (2020) can automatically set the step-size without requiring the knowledge of problem-dependent constants. However, in order to prove theoretical guarantees for strongly-convex functions, these techniques require the knowledge of both the smoothness and strong-convexity parameters. In fact, their guarantees require using a small \(O(1/\kappa ^2)\) step-size, a highly-suboptimal choice in practice. Consequently, there is a gap in the theory and practice of adaptive VR methods. To address this, we make the following contributions.
1.1 Background and contributions
1.1.1 SVRG meets AdaGrad
In Sect. 3 we use AdaGrad (Duchi et al., 2011; Levy et al., 2018), an adaptive gradient method, with stochastic variance reduction techniques. We focus on SVRG (Johnson & Zhang, 2013) and propose to use AdaGrad within its inner-loop. We analyze the convergence of the resulting AdaSVRG algorithm for minimizing convex functions (without strong-convexity). Using O(n) inner-loops for every outer-loop (a typical setting used in practice Babanezhad Harikandeh et al., 2015; Sebbouh et al., 2019), and any bounded step-size, we prove that AdaSVRG achieves an \(\epsilon\)-error (for \(\epsilon = O(1/n)\)) with \(O(n/\epsilon )\) gradient evaluations (Theorem 1). This rate matches that of SVRG with a constant step-size and O(n) inner-loops (Reddi et al., 2016, Corollary 10). However, unlike Reddi et al. (2016), our result does not require knowledge of the smoothness constant in order to set the step-size. We note that other previous work (Cutkosky & Orabona, 2019; Liu et al., 2020) consider adaptive methods with variance reduction for non-convex minimization; however their algorithms still require knowledge of problem-dependent parameters.
1.1.2 Multi-stage AdaSVRG
We propose a multi-stage variant of AdaSVRG where each stage involves running AdaSVRG for a fixed number of inner and outer-loops. In particular, multi-stage AdaSVRG maintains a fixed-size outer-loop and doubles the length of the inner-loop across stages. We prove that it requires \(O((n + 1/\epsilon ) \log (1/\epsilon ))\) gradient evaluations to reach an \(O(\epsilon )\) error (Theorem 2). This improves upon the complexity of decreasing step-size SVRG that requires \(O(n + \sqrt{n}/\epsilon )\) gradient evaluations (Reddi et al., 2016, Corollary 9); and matches the rate of SARAH (Nguyen et al., 2017).
After our work was made publicly available (Dubois-Taine et al., 2021), recent work (Zhou et al., 2021) improved upon our result by applying a similar idea to an accelerated variant of SVRG (Allen-Zhu, 2017). Their algorithm requires \(\tilde{O}(n + \sqrt{n/\epsilon })\) gradient evaluations to obtain an \(\epsilon\)-error without the knowledge of problem-dependent constants.
1.1.3 AdaSVRG with adaptive termination
Instead of using a complex multi-stage procedure, we prove that AdaSVRG can also achieve the improved \(O((n + 1/\epsilon ) \log (1/\epsilon ))\) gradient evaluation complexity by adaptively terminating its inner-loop (Sect. 4). However, the adaptive termination requires the knowledge of problem-dependent constants, limiting its practical use.
To address this, we use the favourable properties of AdaGrad to design a practical heuristic for adaptively terminating the inner-loop. Our technique for adaptive termination is related to heuristics (Pflug, 1983; Yaida, 2018; Lang et al., 2019; Pesme et al., 2020) that detect stalling for constant step-size SGD, and may be of independent interest. First, we show that when minimizing smooth convex losses, AdaGrad has a two-phase behaviour—a first “deterministic phase” where the step-size remains approximately constant followed by a second “stochastic” phase where the step-size decreases at an \(O(1/\sqrt{t})\) rate (Theorem 4). We show that it is empirically possible to efficiently detect this phase transition and aim to terminate the AdaSVRG inner-loop when AdaGrad enters the stochastic phase.
1.1.4 Practical considerations and experimental evaluation
In Sect. 5, we describe some of the practical considerations for implementing AdaSVRG and the adaptive termination heuristic. We use standard real-world datasets to empirically verify the robustness and effectiveness of AdaSVRG. Across datasets, we demonstrate that AdaSVRG consistently outperforms variants of SVRG, SARAH and methods based on the BB step-size (Tan et al., 2016; Li et al., 2020).
1.1.5 Adaptivity to over-parameterization
Defazio and Bottou (2019) demonstrated the ineffectiveness of SVRG when training large over-parameterized models such as deep neural networks. We argue that this ineffectiveness can be partially explained by the interpolation property satisfied by over-parameterized models (Schmidt & Le Roux, 2013; Ma et al., 2018; Vaswani et al., 2019a). In the interpolation setting, SGD obtains an \(O(1/\epsilon )\) gradient complexity when minimizing smooth convex functions (Vaswani et al., 2019a), thus out-performing typical VR methods. However, interpolation is rarely exactly satisfied in practice, and using SGD can result in oscillations around the solution. On the other hand, although VR methods have a slower convergence, they do not oscillate, regardless of interpolation. In Sect. 6, we use AdaGrad to exploit the (approximate) interpolation property, and employ the above heuristic to adaptively switch to AdaSVRG, thus avoiding oscillatory behaviour. We design synthetic problems controlling the extent of interpolation and show that the hybrid AdaGrad-AdaSVRG algorithm can match or outperform both stochastic gradient and VR methods, thus achieving the best of both worlds.
2 Problem setup
We consider the minimization of an objective \(f:\mathbb {R}^d\rightarrow \mathbb {R}\) with a finite-sum structure,
where X is a convex compact set of diameter D, meaning \(\sup _{x, y \in X}\left\| x - y\right\| \le D\). Problems with this structure are prevalent in machine learning. For example, in supervised learning, n represents the number of training examples, and \(f_i\) is the loss function when classifying or regressing to training example i. Throughout this paper, we assume f and each \(f_i\) are differentiable. We assume that f is convex, implying that there exists a solution \(w^{*}\in X\) that minimizes it, and define \(f^*:= f(w^{*})\). Interestingly we do not need each \(f_i\) to be convex. We further assume that each function \(f_{i}\) in the finite-sum is \(L_i\)-smooth, implying that f is \(L_{\max }\)-smooth, where \(L_{\max } = \max _{i} L_i\). We include the formal definitions of these properties in Appendix “Definitions”.
The classical method for solving such a problem is stochastic gradient descent (SGD). Starting from the iterate \(x_0\), at each iteration t SGD samples (typically uniformly at random) a loss function \(f_{i_t}\) and takes a step in the negative direction of the stochastic gradient \(\nabla f_{i_t}(x_t)\) using a step-size \(\eta _t\). This update can be expressed as
In order to ensure convergence to the minimizer, the sequence of step-sizes in SGD needs to be decreasing, typically at an \(O(1/\sqrt{t})\) rate (Moulines & Bach, 2011). This has the effect of slowing down the convergence and results in an \(\Theta (1/\sqrt{t})\) convergence to the minimizer for convex functions (compared to the O(1/t) convergence for gradient descent). Variance reduction methods were developed to overcome this slower convergence by exploiting the finite-sum structure of the objective.
We focus on the SVRG algorithm (Johnson & Zhang, 2013) since it is more memory efficient than other variance reduction alternatives like SAG (Schmidt et al., 2017) or SAGA (Defazio et al., 2014). SVRG has a nested inner-outer loop structure. In every outer-loop k, it computes the full gradient \(\nabla f(w_{k})\) at a snapshot point \(w_{k}\). An outer-loop k consists of \(m_k\) inner-loops indexed by \(t = 1, 2, \ldots m_k\) and the inner-loop iterate \(x_1\) is initialized to \(w_{k}\). In outer-loop k and inner-loop t, SVRG samples an example \(i_t\) (typically uniformly at random) and takes a step in the direction of the variance-reduced gradient \(g_t\) using a constant step-size \(\eta\). This update can be expressed as:
where \(\Pi _{X}\) denotes the Euclidean projection onto the set X. The variance-reduced gradient is unbiased, meaning that \(\mathbb {E}_{i_t}[g_t \vert x_t] = \nabla f(x_t)\). At the end of the inner-loop, the next snapshot point is typically set to either the last or averaged iterate in the inner-loop.
SVRG requires the knowledge of both the strong-convexity and smoothness constants in order to set the step-size and the number of inner-loops. These requirements were relaxed in Hofmann et al. (2015), Kovalev et al. (2020), Gower et al. (2020) that only require knowledge of the smoothness.
In order to set the step-size for SVRG without requiring knowledge of the smoothness, line-search techniques are an attractive option. Such techniques are a common approach to automatically set the step-size for (stochastic) gradient descent (Armijo, 1966; Vaswani et al., 2019b). However, we show that an intuitive Armijo-like line-search to set the SVRG step-size is not guaranteed to converge to the solution. Specifically, we prove the following proposition in Appendix “Counter-example for line-search for SVRG”.
Proposition 1
If in each inner-loop t of SVRG, \(\eta _t\) is set as the largest step-size satisfying the condition: \(\eta _t \le \eta _{{\max}}\) and
then for any \(c > 0\), \(\eta _{{\max}} > 0\), there exists a 1-dimensional convex smooth function such that if \(\vert x_t - w^{*}\vert \le \min \{ \frac{1}{c}, 1\}\), then \(\vert x_{t+1} - w^{*}\vert \ge \vert x_t - w^{*}\vert\), implying that the update moves the iterate away from the solution when it is close to it, preventing convergence.
In the next section, we suggest a novel approach using AdaGrad (Duchi et al., 2011) to propose AdaSVRG, a provably-convergent VR method that is more robust to the choice of step-size. To justify our decision to use AdaGrad, we note that in general, there are (roughly) three common ways of designing methods that do not require knowledge of problem-dependent constants: (i) BB step-size, but it still requires knowledge of \(L_{{\max}}\) to guarantee convergence in the VR setting (Tan et al., 2016; Li et al., 2020), (ii) Line-search methods that can fail to converge in the VR setting (Proposition 1), (iii) Adaptive gradient methods such as AdaGrad.


3 Adaptive SVRG
Like SVRG, AdaSVRG has a nested inner-outer loop structure and relies on computing the full gradient in every outer-loop. However, it uses AdaGrad in the inner-loop, using the variance reduced gradient \(g_t\) to update the preconditioner \(A_t\) in the inner-loop t. AdaSVRG computes the step-size \(\eta _k\) in every outer-loop (see Sect. 5 for details) and uses a preconditioned variance-reduced gradient step to update the inner-loop iterates: \(x_{t+1} = \Pi _{X}\left( x_t - \eta _{k} A_t^{-1} g_t\right)\). AdaSVRG then sets the next snapshot \(w_{k+1}\) to be the average of the inner-loop iterates.
We now analyze the convergence of AdaSVRG (see Algorithm 1 for the pseudo-code). Throughout the main paper, we will only focus on the scalar variant (Ward et al., 2019) of AdaGrad. We defer the general diagonal and matrix variants (see Appendix “Algorithm in general case” for the pseudo-code) and their corresponding theory to the Appendix. We start with the analysis of a single outer-loop, and prove the following lemma in Appendix “Proof of Lemma 1”.
Lemma 1
(AdaSVRG with single outer-loop) Assume (i) convexity of f, (ii) \(L_{\max }\)-smoothness of \(f_i\) and (iii) bounded feasible set with diameter D. Defining \(\rho := \big ( \frac{D^2}{\eta _{k}} + 2\eta _{k}\big ) \sqrt{L_{{\max}}}\), for any outer loop k of AdaSVRG, with (a) inner-loop length \(m_k\) and (b) step-size \(\eta _{k}\),
The proof of the above lemma leverages the theoretical results of AdaGrad (Duchi et al., 2011; Levy et al., 2018). Specifically, the standard AdaGrad analysis bounds the “noise” term by the variance in the stochastic gradients. On the other hand, we use the properties of the variance reduced gradient in order to upper-bound the noise in terms of the function suboptimality.
Lemma 1 shows that a single outer-loop of AdaSVRG converges to the minimizer as \(O(1/\sqrt{m})\), where m is the number of inner-loops. This implies that in order to obtain an \(\epsilon\)-error, a single outer-loop of AdaSVRG requires \(O(n + 1/\epsilon ^2)\) gradient evaluations. This result holds for any bounded step-size and requires setting \(m = O(1/\epsilon ^2)\). This “single outer-loop convergence” property of AdaSVRG is unlike SVRG or any of its variants; running only a single-loop of SVRG is ineffective, as it stops making progress at some point, resulting in the iterates oscillating in a neighbourhood of the solution. The favourable behaviour of AdaSVRG is similar to SARAH, but unlike SARAH, the above result does not require computing a recursive gradient or knowing the smoothness constant.
Next, we consider the convergence of AdaSVRG with a fixed-size inner-loop and multiple outer-loops. In the following theorems, we assume that we have a bounded range of step-sizes implying that for all k, \(\eta _{k}\in [\eta _{{\min}}, \eta _{{\max}}]\). For brevity, similar to Lemma 1, we define \(\rho :=\big ( \frac{D^2}{\eta _{{\min}}} + 2\eta _{{\max}}\big ) \sqrt{L_{{\max}}}\).
Theorem 1
(AdaSVRG with fixed-size inner-loop) Under the same assumptions as Lemma 1, AdaSVRG with (a) step-sizes \(\eta _{k}\in [\eta _{{\min}}, \eta _{{\max}}]\), (b) inner-loop size \(m_k = n\) for all k, results in the following convergence rate after \(K \le n\) iterations.
where \(\bar{w}_K = \frac{1}{K} \sum _{k = 1}^{K} w_{k}\).
The proof (refer to Appendix “Proof of Theorem 1”) recursively uses the result of Lemma 1 for K outer-loops.
The above result requires a fixed inner-loop size \(m_k = n\), a setting typically used in practice (Babanezhad Harikandeh et al., 2015; Gower et al., 2020). Notice that the above result holds only when \(K \le n\). Since K is the number of outer-loops, it is typically much smaller than n, the number of functions in the finite sum, justifying the theorem’s \(K \le n\) requirement. Moreover, in the sense of generalization error, it is not necessary to optimize below an O(1/n) accuracy (Boucheron et al., 2005; Sridharan et al., 2008).
Theorem 1 implies that AdaSVRG can reach an \(\epsilon\)-error (for \(\epsilon = \Omega (1/n)\)) using \(\mathcal O(n/\epsilon )\) gradient evaluations. This result matches the complexity of constant step-size SVRG (with \(m_k = n\)) of Reddi et al. (2016, Corollary 10) but without requiring the knowledge of the smoothness constant. However, unlike SVRG and SARAH, the convergence rate depends on the diameter D rather than \(\left\| w_0 - w^{*}\right\|\), the initial distance to the solution. This dependence arises due to the use of AdaGrad in the inner-loop, and is necessary for adaptive gradient methods. Specifically, Cutkosky and Boahen (2017) prove that any adaptive (to problem-dependent constants) method will necessarily incur such a dependence on the diameter. Hence, such a diameter dependence can be considered to be the “cost” of the lack of knowledge of problem-dependent constants.
Since the above result only holds for \(\epsilon = \Omega (1/n)\), we propose a multi-stage variant (Algorithm 2) of AdaSVRG that requires \(O((n+1/\epsilon )\log (1/\epsilon ))\) gradient evaluations to attain an \(O(\epsilon )\)-error for any \(\epsilon\). To reach a target suboptimality of \(\epsilon\), we consider \(I = \log (1/\epsilon )\) stages. For each stage i, Algorithm 2 uses a fixed number of outer-loops K and inner-loops \(m^i\) with stage i is initialized to the output of the \((i-1)\)-th stage. In Appendix “Proof of Theorem 2”, we prove the following rate for multi-stage AdaSVRG.
Theorem 2
(Multi-stage AdaSVRG) Under the same assumptions as Theorem 1, multi-stage AdaSVRG with \(I = \log (1/\epsilon )\) stages, \(K \ge 3\) outer-loops and \(m^i = 2^{i+1}\) inner-loops at stage i, requires \(O(n\log {1}/{\epsilon } + {1}/{\epsilon })\) gradient evaluations to reach a \(\left( \rho ^2 (1+\sqrt{5}) \right) \epsilon\)-sub-optimality.
We see that multi-stage AdaSVRG matches the convergence rate of SARAH (upto constants), but does so without requiring the knowledge of the smoothness constant to set the step-size. Observe that the number of inner-loops increases with the stage i.e. \(m^i = 2^{i+1}\). The intuition behind this is that the convergence of AdaGrad (used in the k-th inner-loop of AdaSVRG) is slowed down by a “noise” term proportional to \(f(w_k) - f^*\) (see Lemma 1). When this “noise” term is large in the earlier stages of multi-stage AdaSVRG, the inner-loops have to be short in order to maintain the overall \(O(1/{\epsilon })\) convergence. However, as the stages progress and the suboptimality decreases, the “noise” term becomes smaller, and the algorithm can use longer inner-loops, which reduces the number of full gradient computations, resulting in the desired convergence rate.
Thus far, we have focused on using AdaSVRG with fixed-size inner-loops. Next, we consider variants that can adaptively determine the inner-loop size.
4 Adaptive termination of inner-loop
Recall that the convergence of a single outer-loop k of AdaSVRG (Lemma 1) is slowed down by the \(\sqrt{ {\left( f(w_k) - f^*\right)}/{m_k}}\) term. Similar to the multi-stage variant, the suboptimality \(f(w_k) - f^*\) decreases as AdaSVRG progresses. This allows the use of longer inner-loops as k increases, resulting in fewer full-gradient evaluations. We instantiate this idea by setting \(m_k = O\left( {1}/{\left( f(w_k) - f^*\right) }\right)\). Since this choice requires the knowledge of \(f(w_k) - f^*\), we alternatively consider using \(m_k = O(1/\epsilon )\), where \(\epsilon\) is the desired sub-optimality. We prove the following theorem in Appendix “Proof of Theorem 3”.
Theorem 3
(AdaSVRG with adaptive-sized inner-loops) Under the same assumptions as Lemma 1, AdaSVRG with (a) step-sizes \(\eta _{k}\in [\eta _{{\min}}, \eta _{{\max}}]\), (b1) inner-loop size \(m_k = \frac{4\rho ^2}{\epsilon }\) for all k or (b2) inner-loop size \(m_k = \frac{4\rho ^2}{f(w_k) - f^*}\) for outer-loop k, results in the following convergence rate,
The above result implies a linear convergence in the number of outer-loops, but each outer-loop requires \(O(1/\epsilon )\) inner-loops. Hence, Theorem 3 implies that AdaSVRG with adaptive-sized inner-loops requires \(O\left( (n + 1/\epsilon ) \log (1/\epsilon ) \right)\) gradient evaluations to reach an \(\epsilon\)-error. This improves upon the rate of SVRG and matches the convergence rate of SARAH that also requires inner-loops of length \(O(1/\epsilon )\). Compared to Theorem 1 that has an average iterate convergence (for \(\bar{w}_K\)), Theorem 3 has the desired convergence for the last outer-loop iterate \(w_{K}\) and also holds for any bounded sequence of step-sizes. However, unlike Theorem 1, this result (with either setting of \(m_k\)) requires the knowledge of problem-dependent constants in \(\rho\).
To address this issue, we design a heuristic for adaptive termination in the next sections. We start by describing the two phase behaviour of AdaGrad and subsequently utilize it for adaptive termination in AdaSVRG.
4.1 Two phase behaviour of AdaGrad
Diagnostic tests (Pflug, 1983; Yaida, 2018; Lang et al., 2019; Pesme et al., 2020) study the behaviour of the SGD dynamics to automatically control its step-size. Similarly, designing the adaptive termination test requires characterizing the behaviour of AdaGrad used in the inner loop of AdaSVRG.
We first investigate the dynamics of constant step-size AdaGrad in the stochastic setting. Specifically, we monitor the evolution of \(\sqrt{G_t}\) across iterations. We define \(\sigma ^2 :=\sup _{x \in X} \mathbb E_{i} \Vert \nabla f_i(x) - \nabla f(x) \Vert ^2\) as a uniform upper-bound on the variance in the stochastic gradients for all iterates. We prove the following theorem showing that there exists an iteration \(T_0\) when the evolution of \(\sqrt{G_t}\) undergoes a phase transition. Note again that such a phase transition happens for the scalar, diagonal and matrix variants of AdaGrad. We only present the scalar result in the main paper and defer the other variants to the Appendix.
Theorem 4
(Phase Transition in AdaGrad Dynamics) Under the same assumptions as Lemma 1 and (iv) \(\sigma ^2\)-bounded stochastic gradient variance and defining \(T_0 = \frac{\rho ^2 L_{{\max}}}{\sigma ^2}\), for constant step-size AdaGrad we have \(\mathbb {E}[\sqrt{G_t}] = O(1) \text { for } t \le T_0\), and \(\mathbb {E}[\sqrt{G_t}] = O (\sqrt{t-T_0}) \text { for } t \ge T_0\).
Theorem 4 (proved in Appendix “Proof of Theorem 4”) indicates that \(G_t\) is bounded by a constant for all \(t \le T_0\), implying that its rate of growth is slower than \(\log (t)\). This implies that the step-size of AdaGrad is approximately constant (similar to gradient descent in the full-batch setting) in this first phase until iteration \(T_0\). Indeed, if \(\sigma = 0\), \(T_0 = \infty\) and AdaGrad is always in this deterministic phase. This result generalizes (Qian & Qian, 2019, Theorem 3.1) that analyzes the diagonal variant of AdaGrad in the deterministic setting. After iteration \(T_0\), the noise \(\sigma ^2\) starts to dominate, and AdaGrad transitions into the stochastic phase where \(G_t\) grows as O(t). In this phase, the step-size decreases as \(O(1/\sqrt{t})\), resulting in slower convergence to the minimizer. AdaGrad thus results in an overall \(O \left( {1}/{T} + {\sigma ^2}/{\sqrt{T}} \right)\) rate (Levy et al., 2018), where the first term corresponds to the deterministic phase and the second to the stochastic phase.
Since the exact detection of this phase transition is not possible, we design a heuristic to detect it without requiring the knowledge of problem-dependent constants.

4.2 Heuristic for adaptive termination
Similar to tests used to detect stalling for SGD (Pflug, 1983; Pesme et al., 2020), the proposed diagnostic test has a burn-in phase of \({n}/{2}\) inner-loop iterations that allows the initial AdaGrad dynamics to stabilize. After this burn-in phase, for every even iteration, we compute the ratio \(R = \frac{G_{t} - G_{t/2}}{G_{t/2}}\). Given a threshold hyper-parameter \(\theta\), the test terminates the inner-loop when \(R \ge \theta\). In the first deterministic phase, since the growth of \(G_{t}\) is slow, \(G_{2t} \approx G_{t}\) and \(R \approx 0\). In the stochastic phase, \(G_{t} = O(t)\), and \(R \approx 1\), justifying that the test can distinguish between the two phases. AdaSVRG with this test is fully specified in Algorithm 3. Experimentally, we use \(\theta = 0.5\) to give an early indication of the phase transition.
5 Experimental evaluation of AdaSVRG
We first describe the practical considerations for implementing AdaSVRG and then evaluate its performance on real and synthetic datasets. The code to reproduce the experiments can be found at the following link: https://github.com/bpauld/AdaSVRG. We do not use projections in our experiments as these problems have an unconstrained \(w^*\) with finite norm (we thus assume D is big enough to include it), and that we empirically observed that our iterates always stayed bounded, thus not requiring any projection.Footnote 1
5.1 Implementing AdaSVRG
Though our theoretical results hold for any bounded sequence of step-sizes, its choice affects the practical performance of AdaGrad (Vaswani et al., 2020) (and hence AdaSVRG). Theoretically, the optimal step-size minimizing the bound in Lemma 1 is given by \(\eta ^* = \frac{D}{\sqrt{2}}\). Since we do not have access to D, we use the following heuristic to set the step-size for each outer-loop of AdaSVRG. In outer-loop k, we approximate D by \(\Vert w_{k}- w^{*}\Vert\), that can be bounded using the co-coercivity of smooth convex functions as \(\Vert w_{k}- w^{*}\Vert \ge {1}/{L_{{\max}}} \Vert \nabla f(w_k) \Vert\) (Nesterov, 2004, Thm. 2.1.5 (2.1.8)). We have access to \(\nabla f(w_k)\) for the current outer-loop, and store the value of \(\nabla f(w_{k-1})\) in order to approximate the smoothness constant. Specifically, by co-coercivity, \(L_{{\max}} \ge L_k :=\frac{\Vert \nabla f(w_k) - \nabla f(w_{k-1})\Vert }{\Vert w_k - w_{k-1}\Vert }\). Putting these together, \(\eta _{k}= \frac{\Vert \nabla f(w_k)\Vert }{\sqrt{2} \, \max _{i=0, \dots , k} L_i}\).Footnote 2 Although a similar heuristic could be used to estimate \(L_{{\max}}\) for SVRG or SARAH, the resulting step-size is larger than \(1/L_{{\max}}\) implying that it would not have any theoretical guarantee, while our results hold for any bounded sequence of step-sizes. Although Algorithm 1 requires setting \(w_{k+1}\) to be the average of the inner-loop iterates, we use the last-iterate and set \(w_{k}= x_{m_k}\), as this is a more common choice (Johnson & Zhang, 2013; Tan et al., 2016) and results in better empirical performance. We compare two variants of AdaSVRG, with (i) fixed-size inner-loop Algorithm 1 and (ii) adaptive termination Algorithm 3. We handle a general batch-size b, and set \(m = {n}/{b}\) for Algorithm 1. This is a common practical choice (Babanezhad Harikandeh et al., 2015; Gower et al., 2020; Kovalev et al., 2020). For Algorithm 3, the burn-in phase consists of \({n}/{2b}\) iterations and \(M = {10n}/{b}\).
5.2 Evaluating AdaSVRG
In order to assess the effectiveness of AdaSVRG, we experiment with binary classification on standard LIBSVM datasets (Chang & Lin, 2011). In particular, we consider \(\ell _2\)-regularized problems (with regularization set to \({1}/{n}\)) with three losses—logistic loss, the squared loss or the Huber loss. For each experiment we plot the median and standard deviation across 5 independent runs. In the main paper, we show the results for four of the datasets and relegate the results for the three others to Appendix “Additional experiments”. Similarly, we consider batch-sizes in the range [1, 8, 64, 128], but only show the results for \(b = 64\) in the main paper.
We compare the AdaSVRG variants against SVRG (Johnson & Zhang, 2013), loopless-SVRG (Kovalev et al., 2020), SARAH (Nguyen et al., 2017), and SVRG-BB (Tan et al., 2016), the only other tune-free VR method.Footnote 3 Since each of these methods requires a step-size, we search over the grid \([10^{-3}, 10^{-2}, 10^{-1}, 1, 10, 100]\), and select the best step-size for each algorithm and each experiment. As is common, we set \(m = {n}/{b}\) for each of these methods. We note that though the theoretical results of SVRG-BB require a small \(O(1/\kappa ^2)\) step-size and \(O(\kappa ^2)\) inner-loops, Tan et al. (2016) recommends setting \(m = O(n)\) in practice. Since AdaGrad results in the slower \(O(1/\epsilon ^2)\) rate (Levy et al., 2018; Vaswani et al., 2020) compared to the \(O(n + \frac{1}{\epsilon })\) rate of VR methods, we do not include it in the main paper. We demonstrate the poor performance of AdaGrad on two example datasets in Fig. 4 in Appendix “Additional experiments”.
We plot the gradient norm of the training objective (for the best step-size) against the number of gradient evaluations normalized by the number of examples. We show the results for the logistic loss (Fig. 1a), Huber loss (Fig. 1b), and squared loss (Fig. 2). Our results show that (i) both variants of AdaSVRG (without any step-size tuning) are competitive with the other best-tuned VR methods, often out-performing them or matching their performance; (ii) SVRG-BB often has an oscillatory behavior, even for the best step-size; and (iii) the performance of AdaSVRG with adaptive termination (that has superior theoretical complexity) is competitive with that of the practically useful O(n) fixed inner-loop setting.
Comparison of AdaSVRG against SVRG variants, SVRG-BB and SARAH with batch-size = 64 for logistic loss (top 2 rows) and Huber loss (bottom 2 rows). For both losses, we compare AdaSVRG against the best-tuned variants, and show the sensitivity to step-size (we limit the gradient norm to a maximum value of 10)
Comparison of AdaSVRG against SVRG variants, SVRG-BB and SARAH with batch-size = 64 for squared loss. We compare AdaSVRG against the best-tuned variants, and show the sensitivity to step-size (we limit the gradient norm to a maximum value of 10). In cases where SVRG-BB diverged, we remove these curves
In order to evaluate the effect of the step-size on a method’s performance, we plot the gradient norm after 50 outer-loops vs step-size for each of the competing methods. For the AdaSVRG variants, we set the step-size according to the heuristic described earlier. For the logistic loss (Fig. 1a), Huber loss (Fig. 1b) and squared loss (Fig. 2), we observe that (i) the performance of typical VR methods heavily depends on the choice of the step-size; (ii) the step-size corresponding to the minimum loss is different for each method, loss and dataset; and (iii) AdaSVRG with the step-size heuristic results in competitive performance. Additional results plotted in Figs. 5a, b, 6a, b, 7a, b, 8a–c, 9a–c, 10a–c, 11a–c, 12a–c, 13a–c in Appendix “Additional experiments” confirm that the good performance of AdaSVRG is consistent across losses, batch-sizes and datasets.
Finally, in Fig. 14a in Appendix “Evaluating the diagonal variant of AdaSVRG”, we give preliminary results benchmarking the performance of the diagonal variant of AdaSVRG and comparing it to the scalar variant. We do not compare to the full matrix variant since inverting a \(d \times d\) matrix in each iteration makes it impractical for most machine learning tasks. Our results demonstrate that with the current heuristic for setting \(\eta\), the performance of the diagonal variant does not significantly improve over the scalar variant. Moreover, since each iteration of the diagonal variant incurs an additional O(d) cost compared to the scalar variant, we did not conduct further experiments with the diagonal variant. In the future, we aim to develop robust heuristics for setting the step-size for this variant of AdaSVRG.
6 Heuristic for adaptivity to over-parameterization

In this section, we reason that the poor empirical performance of SVRG when training over-parameterized models (Defazio & Bottou, 2019) can be partially explained by the interpolation property (Schmidt & Le Roux, 2013; Ma et al., 2018; Vaswani et al., 2019a) satisfied by these models (Zhang et al., 2017). In particular, we focus on smooth convex losses, but assume that the model is capable of completely fitting the training data, and that \(w^*\) lies in the interior of X. For example, these properties are simultaneously satisfied when minimizing the squared hinge-loss for linear classification on separable data or unregularized kernel regression (Belkin et al., 2019; Liang et al., 2020) with \(\Vert w^*\Vert \le 1\).
Formally, the interpolation condition means that the gradient of each \(f_i\) in the finite-sum converges to zero at an optimum. Additionally, we assume that each function \(f_i\) has finite minimum \(f_i^*\). If the overall objective f is minimized at \(w^{*}\), \(\nabla f(w^{*}) = 0\), then for all \(f_{i}\) we have \(\nabla f_{i}(w^{*}) = 0\). Since the interpolation property is rarely exactly satisfied in practice, we allow for a weaker version that uses \(\zeta ^2 :=\mathbb {E}_i [f^* - f_i^*] \in [0,\infty )\) (Loizou et al., 2020; Vaswani et al., 2020) to measure the extent of the violation of interpolation. If \(\zeta ^2 = 0\), interpolation is exactly satisfied.
When \(\zeta ^2 = 0\), both constant step-size SGD and AdaGrad have a gradient complexity of \(O(1/\epsilon )\) in the smooth convex setting (Schmidt & Le Roux, 2013; Vaswani et al., 2019a, 2020). In contrast, typical VR methods have an \(\tilde{O}(n + \frac{1}{\epsilon })\) complexity. For example, both SVRG and AdaSVRG require computing the full gradient in every outer-loop, and will thus unavoidably suffer an \(\Omega (n)\) cost. For large n, typical VR methods will thus be necessarily slower than SGD when training models that can exactly interpolate the data. This provides a partial explanation for the ineffectiveness of VR methods when training over-parameterized models. When \(\zeta ^2 > 0\), AdaGrad has an \(O(1/\epsilon + \zeta /\epsilon ^2)\) rate (Vaswani et al., 2020). Here \(\zeta\), the violation of interpolation plays the role of noise and slows down the convergence to an \(O(1/\epsilon ^2)\) rate. On the other hand, AdaSVRG results in an \(\tilde{O}(n + 1/\epsilon )\) rate, regardless of \(\zeta\).
Following the reasoning in Sect. 4, if an algorithm can detect the slower convergence of AdaGrad and switch from AdaGrad to AdaSVRG, it can attain a faster convergence rate. It is straightforward to show that AdaGrad has a a similar phase transition as Theorem 4 when interpolation is only approximately satisfied. This enables the use of the test in Sect. 4 to terminate AdaGrad and switch to AdaSVRG, resulting in the hybrid algorithm described in Algorithm 4. If the diagnostic test can detect the phase transition accurately, Algorithm 4 will attain an \(O(1/\epsilon )\) convergence when interpolation is exactly satisfied (no switching in this case). When interpolation is only approximately satisfied, it will result in an \(O(1/\epsilon )\) convergence for \(\epsilon \ge \zeta\) (corresponding to the AdaGrad rate in the deterministic phase) and will attain an \(O(1/\zeta ^2 + ((n + 1/\epsilon ) \log (\zeta /\epsilon ))\) convergence thereafter (corresponding to the AdaSVRG rate). This implies that Algorithm 4 can indeed obtain the best of both worlds between AdaGrad and AdaSVRG.
6.1 Evaluating Algorithm 4
We use synthetic experiments to demonstrate the effect of interpolation on the convergence of stochastic and VR methods. Following the protocol in Meng et al. (2020), we generate a linearly separable dataset with \(n=10^4\) data points of dimension \(d=200\) and train a linear model with a convex loss. This setup ensures that interpolation is satisfied, but allows to eliminate other confounding factors such as non-convexity and other implementation details. In order to smoothly violate interpolation, we show results with a mislabel fraction of points in the grid [0, 0.1, 0.2].
We use AdaGrad as a representative (fully) stochastic method, and to eliminate possible confounding because of its step-size, we set it using the stochastic line-search procedure (Vaswani et al., 2020). We compare the performance of AdaGrad, SVRG, AdaSVRG and the hybrid AdaGrad-AdaSVRG (Algorithm 4) each with a budget of 50 epochs (passes over the data). For SVRG, as before, we choose the best step-size via a grid-search. For AdaSVRG, we use the fixed-size inner-loop variant and the step-size heuristic described earlier. In order to evaluate the quality of the “switching” metric in Algorithm 4, we compare against a hybrid method referred to as “Optimal Manual Switching” in the plots. This method runs a grid-search over switching points—after epoch \(\{1, 2, \ldots , 50\}\) and chooses the point that results in the minimum loss after 50 epochs.
In Fig. 3, we plot the results for the logistic loss using a batch-size of 64 (refer to Figs. 15a–c, 16a–d in Appendix “Additional experiments on adaptivity to over-parameterization” for other losses and batch-sizes). We observe that (i) when interpolation is exactly satisfied (no mislabeling), AdaGrad results in superior performance over SVRG and AdaSVRG, confirming the theory in Sect. 6. In this case, both the optimal manual switching and Algorithm 4 do not switch; (ii) when interpolation is not exactly satisfied (with \(10\%, 20\%\) mislabeling), the AdaGrad progress slows down to a stall in a neighbourhood of the solution, whereas both SVRG and AdaSVRG converge to the solution; (iii) in both cases, Algorithm 4 detects the slowdown in AdaGrad and switches to AdaSVRG, resulting in competitive performance with the optimal manual switching. For all three datasets, Algorithm 4 matches or out-performs the better of AdaGrad and AdaSVRG, showing that it can achieve the best-of-both-worlds.
7 Discussion
Although there have been numerous papers on VR methods in the past ten years, all of the provably convergent methods require knowledge of problem-dependent constants such as L. On the other hand, there has been substantial progress in designing adaptive gradient methods that have effectively replaced SGD for training ML models. Unfortunately, this progress has not been leveraged for developing better VR methods. Our work is the first to marry these lines of literature by designing AdaSVRG, that achieves a gradient complexity comparable to typical VR methods, but without needing to know the objective’s smoothness constant. Our results illustrate that it is possible to design principled techniques that can “painlessly” reduce the variance, achieving good theoretical and practical performance. We believe that our paper will help open up an exciting research direction. In the future, we aim to extend our theory to the strongly-convex setting.
Availability of data and materials
Experiments were done on the publicly available LIBSVM datasets (Chang & Lin, 2011).
Code availability
Full code to replicate the experiments can be found at https://github.com/bpauld/AdaSVRG.
Notes
For \(k = 0\), we compute the full gradient at a random point \(w_{-1}\) and approximate \(L_0\) in the same way.
We do not compare against SAG (Schmidt et al., 2017) because of its large memory footprint.
References
Ahn, K., Yun, C., & Sra, S. (2020). SGD with shuffling: Optimal rates without component convexity and large epoch requirements. In Neural information processing systems 2020, NeurIPS 2020.
Allen-Zhu, Z. (2017). Katyusha: The first direct acceleration of stochastic gradient methods. In Proceedings of the 49th annual ACM SIGACT symposium on theory of computing, STOC.
Armijo, L. (1966). Minimization of functions having lipschitz continuous first partial derivatives. Pacific Journal of mathematics, 16(1), 1–3.
Babanezhad Harikandeh, R., Ahmed, M. O., Virani, A., Schmidt, M., Konečnỳ, J., & Sallinen, S. (2015). Stop wasting my gradients: Practical SVRG. Advances in Neural Information Processing Systems, 28, 2251–2259.
Barzilai, J., & Borwein, J. M. (1988). Two-point step size gradient methods. IMA Journal of Numerical Analysis, 8(1), 141–148.
Belkin, M., Rakhlin, A., & Tsybakov, A. B. (2019). Does data interpolation contradict statistical optimality? In The 22nd international conference on artificial intelligence and statistics (pp. 1611–1619). PMLR.
Bollapragada, R., Byrd, R. H., & Nocedal, J. (2019). Exact and inexact subsampled newton methods for optimization. IMA Journal of Numerical Analysis, 39(2), 545–578.
Boucheron, S., Bousquet, O., & Lugosi, G. (2005). Theory of classification: A survey of some recent advances. ESAIM: Probability and Statistics, 9, 323–375.
Chang, C.-C., & Lin, C.-J. (2011). LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2(3), 1–27. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm.
Cutkosky, A., & Boahen, K. (2017). Online convex optimization with unconstrained domains and losses. arXiv preprint arXiv:1703.02622.
Cutkosky, A., & Orabona, F. (2019). Momentum-based variance reduction in non-convex SGD. arXiv preprint arXiv:1905.10018.
Defazio, A., & Bottou, L. (2019). On the ineffectiveness of variance reduced optimization for deep learning. In NeurIPS: In advances in neural information processing systems.
Defazio, A., Bach, F., & Lacoste-Julien, S. (2014). SAGA: A fast incremental gradient method with support for non-strongly convex composite objectives. In NeurIPS: In Advances in neural information processing systems.
Dubois-Taine, B., Vaswani, S., Babanezhad, R., Schmidt, M., & Lacoste-Julien, S. (2021). Svrg meets adagrad: Painless variance reduction. arXiv preprint arXiv:2102.09645.
Duchi, J. C., Hazan, E., & Singer, Y. (2011). Adaptive subgradient methods for online learning and stochastic optimization. The Journal of Machine Learning Research, 12, 2121–2159.
Gower, R. M., Schmidt, M., Bach, F., & Richtarik, P. (2020). Variance-reduced methods for machine learning. Proceedings of the IEEE, 108(11), 1968–1983.
Hofmann, T., Lucchi, A., Lacoste-Julien, S., & McWilliams, B. (2015). Variance reduced stochastic gradient descent with neighbors. Advances in Neural Information Processing Systems, 28, 2305–2313.
Johnson, R., & Zhang, T. (2013). Accelerating stochastic gradient descent using predictive variance reduction. In NeurIPS: In advances in neural information processing systems.
Konečnỳ, J., & Richtárik, P. (2013). Semi-stochastic gradient descent methods. arXiv preprint arXiv:1312.1666.
Kovalev, D., Horváth, S., & Richtárik, P. (2020). Don’t jump through hoops and remove those loops: SVRG and Katyusha are better without the outer loop. In Algorithmic learning theory (pp. 451–467). PMLR.
Lan, G., Li, Z., & Zhou, Y. (2019). A unified variance-reduced accelerated gradient method for convex optimization. In Advances in neural information processing systems (pp. 10462–10472).
Lang, H., Xiao, L., & Zhang, P. (2019). Using statistics to automate stochastic optimization. In Advances in neural information processing systems (pp. 9540–9550).
Levy, K. Y., Yurtsever, A., & Cevher, V. (2018). Online adaptive methods, universality and acceleration. In NeurIPS: In advances in neural information processing systems.
Li, B., Wang, L., & Giannakis, G. B. (2020). Almost tune-free variance reduction. In International conference on machine learning (pp. 5969–5978). PMLR.
Liang, T., Rakhlin, A., et al. (2020). Just interpolate: Kernel “ridgeless’’ regression can generalize. Annals of Statistics, 48(3), 1329–1347.
Liu, M., Zhang, W., Orabona, F., & Yang, T. (2020). Adam+: A stochastic method with adaptive variance reduction. arXiv preprint arXiv:2011.11985.
Loizou, N., Vaswani, S., Laradji, I., & Lacoste-Julien, S. (2020). Stochastic Polyak step-size for SGD: An adaptive learning rate for fast convergence. arXiv preprint: arXiv:2002.10542.
Ma, S., Bassily, R., & Belkin, M. (2018). The power of interpolation: Understanding the effectiveness of SGD in modern over-parametrized learning. In Proceedings of the 35th international conference on machine learning, ICML.
Mahdavi, M., & Jin, R. (2013). MixedGrad: An O(1/T) convergence rate algorithm for stochastic smooth optimization. arXiv preprint arXiv:1307.7192.
Mairal, J. (2013). Optimization with first-order surrogate functions. In International conference on machine learning (pp. 783–791).
Meng, S. Y., Vaswani, S., Laradji, I., Schmidt, M., & Lacoste-Julien, S. (2020). Fast and furious convergence: Stochastic second order methods under interpolation. In The 23nd international conference on artificial intelligence and statistics, AISTATS.
Moulines, E., & Bach, F. R. (2011). Non-asymptotic analysis of stochastic approximation algorithms for machine learning. In NeurIPS: In advances in neural information processing systems.
Nesterov, Y. (2004). Introductory lectures on convex optimization: A basic course. Berlin: Springer.
Nguyen, L. M., Liu, J., Scheinberg, K., & Takáč, M. (2017). SARAH: a novel method for machine learning problems using stochastic recursive gradient. In Proceedings of the 34th international conference on machine learning (Vol. 70, pp. 2613–2621).
Pesme, S., Dieuleveut, A., & Flammarion, N. (2020). On convergence-diagnostic based step sizes for stochastic gradient descent. arXiv preprint arXiv:2007.00534.
Pflug, G. C. (1983). On the determination of the step size in stochastic quasigradient methods. collaborative paper cp-83-025. International Institute for Applied Systems Analysis (IIASA), Laxenburg, Austria.
Qian, Q., & Qian, X. (2019). The implicit bias of adagrad on separable data. arXiv preprint arXiv:1906.03559.
Reddi, S. J., Hefny, A., Sra, S., Poczos, B., & Smola, A. (2016). Stochastic variance reduction for nonconvex optimization. In International conference on machine learning (pp. 314–323).
Reddi, S. J., Kale, S., & Kumar, S. (2018). On the convergence of Adam and Beyond. In International conference on learning representations.
Schmidt, M., & Le Roux, N. (2013). Fast convergence of stochastic gradient descent under a strong growth condition. arXiv preprint: arXiv:1308.6370.
Schmidt, M., Babanezhad, R., Ahmed, M., Defazio, A., Clifton, A., & Sarkar, A. (2015). Non-uniform stochastic average gradient method for training conditional random fields. In Proceedings of the eighteenth international conference on artificial intelligence and statistics, AISTATS.
Schmidt, M., Le Roux, N., & Bach, F. (2017). Minimizing finite sums with the stochastic average gradient. Mathematical Programming, 162(1–2), 83–112.
Sebbouh, O., Gazagnadou, N., Jelassi, S., Bach, F., & Gower, R. (2019). Towards closing the gap between the theory and practice of SVRG. In Advances in neural information processing systems (pp. 648–658).
Shalev-Shwartz, S., & Zhang, T. (2013). Stochastic dual coordinate ascent methods for regularized loss minimization. Journal of Machine Learning Research, 14(Feb), 567–599.
Song, C., Jiang, Y., & Ma, Y. (2020). Variance reduction via accelerated dual averaging for finite-sum optimization. Advances in Neural Information Processing Systems, 33.
Sridharan, K., Shalev-Shwartz, S., & Srebro, N. (2008). Fast rates for regularized objectives. Advances in Neural Information Processing Systems, 21, 1545–1552.
Tan, C., Ma, S., Dai, Y.-H., & Qian, Y. (2016). Barzilai-Borwein step size for stochastic gradient descent. arXiv preprint arXiv:1605.04131.
Vaswani, S., Bach, F., & Schmidt, M. (2019a). Fast and faster convergence of SGD for over-parameterized models and an accelerated perceptron. In The 22nd International Conference on Artificial Intelligence and Statistics, AISTATS.
Vaswani, S., Kunstner, F., Laradji, I., Meng, S. Y., Schmidt, M., & Lacoste-Julien, S. (2020). Adaptive gradient methods converge faster with over-parameterization (and you can do a line-search). arXiv preprint arXiv:2006.06835.
Vaswani, S., Mishkin, A., Laradji, I., Schmidt, M., Gidel, G., & Lacoste-Julien, S. (2019). Painless stochastic gradient: Interpolation, line-search, and convergence rates. In NeurIPS: In advances in neural information processing systems.
Ward, R., Wu, X., & Bottou, L. (2019). AdaGrad stepsizes: Sharp convergence over nonconvex landscapes, from any initialization. In Proceedings of the 36th international conference on machine learning, ICML.
Yaida, S. (2018). Fluctuation-dissipation relations for stochastic gradient descent. arXiv preprint arXiv:1810.00004.
Zhang, C., Bengio, S., Hardt, M., Recht, B., & Vinyals, O. (2017). Understanding deep learning requires rethinking generalization. In 5th international conference on learning representations, ICLR.
Zhou, K., So, A. M.-C., & Cheng, J. (2021). Accelerating perturbed stochastic iterates in asynchronous lock-free optimization. arXiv preprint arXiv:2109.15292.
Funding
Benjamin Dubois-Taine would like to acknowledge funding by the French government under management of Agence Nationale de la Recherche as part of the “Investissements d’avenir” program, reference ANR-19-P3IA-0001 (PRAIRIE 3IA Institute). Mark Schmidt and Simon Lacoste-Julien would like to acknowledge support from the Canada CIFAR AI Chair Program.
Author information
Authors and Affiliations
Contributions
SV and RB conceived of the idea of combining SVRG and AdaGrad. SV, BD-T and RB proved the theoretical results. BD-T performed the experiments with support from SV. SV and BD-T wrote down the manuscript. SL-J and MS supervised the project.
Corresponding authors
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Ethics approval
Not applicable.
Consent to participate
Not applicable.
Consent for publication
Not applicable.
Additional information
Editor: Krzysztof Dembczynski and Emilie Devijver.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Organization of the Appendix
-
A
Definitions
-
B
Algorithm in general case
-
C
Proof of Lemma 1
-
D
Main Proposition
-
E
Proof of Theorem 1
-
F
Proof of Theorem 2
-
G
Proof of Theorem 3
-
H
Proof of Theorem 4
-
I
Helper lemmas
-
J
Counter-example for line-search for SVRG
-
K
Additional experiments
Definitions
Our main assumptions are that each individual function \(f_i\) is differentiable and \(L_i\)-smooth, meaning that for all v and w,
which also implies that f is \(L_{\max }\)-smooth, where \(L_{\max }\) is the maximum smoothness constant of the individual functions. We also assume that f is convex, meaning that for all v and w,
Algorithm in general case
We restate Algorithm 1 to handle the full matrix and diagonal variants. The first difference is in the initialization and update of \(G_t\). Moreover, we use a more general notion of projection, i.e. \(\Pi _{X, A_t}\left( \cdot \right)\) is the projection onto set X with respect to the norm induced by a symmetric positive definite matrix \(A_t\) (such projections are common to adaptive gradient methods Duchi et al., 2011; Levy et al., 2018; Reddi et al., 2018). Note that in the scalar variant, \(A_t\) is a scalar and thus \(\Pi _{X, A_t}\left( \cdot \right) = \Pi _{X}\left( \cdot \right)\) and we recover Algorithm 1.

Proof of Lemma 1
We restate Lemma 1 to handle the three variants of AdaSVRG.
Lemma 2
(AdaSVRG with single outer-loop) Assuming (i) convexity of f, (ii) \(L_{\max }\)-smoothness of \(f_i\) and (iii) bounded feasible set with diameter D. For the scalar variant, defining \(\rho := \big ( \frac{D^2}{\eta _{k}} + 2\eta _{k}\big ) \sqrt{L_{{\max}}}\), for any outer loop k of AdaSVRG, with (a) inner-loop length \(m_k\) and (b) step-size \(\eta _{k}\),
For the full matrix and diagonal variants, setting \(\rho ':= \left( \frac{D^2}{\eta _k} + 2\eta _k \right) \sqrt{dL_{{\max}}}\),
Proof
For any of the three variants, we have, for any outer loop iteration k and any inner loop iteration t,
where the inequality follows from Reddi et al. (2018, Lemma 4). Dividing by \(\eta _k\), rearranging and summing over all inner loop iterations at stage k gives
By Lemma 4, we have that \(\textsf {Tr}\mathclose {\left(A_{m_k}\right)} \le \sqrt{\sum _{t=1}^{m_k} \left\| g_t\right\| ^2}\) in the scalar case, and \(\textsf {Tr}\mathclose {\left(A_{m_k}\right)} \le \sqrt{d}\sqrt{\sum _{t=1}^{m_k} \left\| g_t\right\| ^2 + d\delta }\) in the full matrix and diagonal variants. Therefore we set
and
Going back to the above inequality and taking expectation we get
Using convexity of f yields
where the second inequality comes from Jensen’s inequality applied to the (concave) square root function. Now, from Johnson and Zhang (2013, Proof of Theorem 1),
Going back to the previous equation, squaring and setting \(\tau = a' \sqrt{4 L_{{\max}}}\) we get
Using Lemma 5,
Finally, using Jensen’s inequality we get
which concludes the proof by noticing that by definition \(\tau = \rho\) in the scalar case and \(\tau = \rho '\) in the full matrix and diagonal cases. \(\square\)
Main proposition
We first state the main proposition for the three variants of AdaSVRG, which we later use for proving theorems.
Proposition 2
Assuming (i) convexity of f (ii) \(L_{{\max}}\)-smoothness of f (iii) bounded feasible set (iv) \(\eta _{k}\in [\eta _{{\min}}, \eta _{{\max}}]\) (v) \(m_k = m\) for all k, then for the scalar variant,
and for the full matrix and diagonal variants,
where
Proof
As in the previous proof we define
and
Using the result from Lemma 1 and letting \(\Delta _k :=\mathbb {E}[f(w_k) - f^*]\), we have,
Squaring gives
which we can rewrite as
Since \(\Delta _{k+1}^2 - 2\frac{\tau ^2}{m} \Delta _{k+1} = (\Delta _{k+1} - \frac{\tau ^2}{m})^2 - \frac{\tau ^4}{m^2}\), we get
Summing this gives
Using Jensen’s inequality on the (concave) square root function gives
going back to the previous inequality this gives
which we can rewrite
Setting \(\bar{w}_K = \frac{1}{K} \sum _{k=0}^{K-1} w_k\) and using Jensen’s inequality on the convex function f, we get
which concludes the proof. \(\square\)
Proof of Theorem 1
For the remainder of the appendix we define \(\rho ' :=\big ( \frac{D^2}{\eta _{{\min}}} + 2\eta _{{\max}}\big ) \sqrt{dL_{{\max}}}\). We restate and prove Theorem 1 for all three variants of AdaSVRG.
Theorem 5
(AdaSVRG with fixed-size inner-loop) Under the same assumptions as Lemma 1, AdaSVRG with (a) step-sizes \(\eta _{k}\in [\eta _{{\min}}, \eta _{{\max}}]\), (b) inner-loop size \(m_k = n\) for all k, results in the following convergence rate after \(K \le n\) iterations. For the scalar variant,
and for the full matrix and diagonal variants,
where \(\bar{w}_K = \frac{1}{K} \sum _{k = 1}^{K} w_{k}\).
Proof
We have \(\frac{1}{m} = \frac{1}{n} \le \frac{1}{K}\) by the assumption. Using the result of Proposition 2 for the scalar variant we have
Using the result of Proposition 2 for the full matrix and diagonal variants we have
\(\square\)
Corollary 1
Under the assumptions of Theorem 1, the computational complexity of AdaSVRG to reach \(\epsilon\)-accuracy is \(O\left( \frac{n}{\epsilon }\right)\) when
for the scalar variant and when
for the full matrix and diagonal variants.
Proof
We deal with the scalar variant first. Let \(c = \rho ^2( 1 +\sqrt{5}) + \rho \sqrt{2\left( f(w_0) - f^*\right) }\). By the previous theorem, to reach \(\epsilon\)-accuracy we require
We thus require \(O\left( \frac{1}{\epsilon }\right)\) outer loops to reach \(\epsilon\)-accuracy. For \(m_k = n\), 3n gradients are computed in each outer loop, thus the computational complexity is indeed \(O\left( \frac{n}{\epsilon }\right)\).
The condition \(\epsilon \ge \frac{c}{n}\) follows from the assumption that \(K \le n\).
The proof for the full matrix and diagonal variants is similar by taking \(c = (\rho ')^2( 1 +\sqrt{5}) + 2\rho ' \sqrt{ {d\delta }/{L_{{\max}}}} + \rho '\sqrt{2\left( f(w_0) - f^*\right) }\). \(\square\)
Proof of Theorem 2
Theorem 2(Multi-stage AdaSVRG) Under the same assumptions as Theorem 1, multi-stage AdaSVRG with \(I = \log (1/\epsilon )\) stages, \(K \ge 3\) outer-loops and \(m^i = 2^{i+1}\) inner-loops at stage i, requires \(O(n\log {1}/{\epsilon } + {1}/{\epsilon })\) gradient evaluations to reach a \(\left( \rho ^2 (1+\sqrt{5}) \right) \epsilon\)-sub-optimality.
Proof
We deal with the scalar variant first. Let \(\Delta _i:= \mathbb {E}[f(\bar{w}^i) - f^*]\) and \(c:= \rho ^2(1 + \sqrt{5})\). Suppose that \(K \ge 3\) and \(m_i = 2^{i+1}\), as in the theorem statement. We claim that for all i, we have \(\Delta _i \le \frac{c}{2^i}\). We prove this by induction. For \(i=0\), we have
The quantity \(\frac{D^2}{\eta } + 2\eta\) reaches a minimum for \(\eta ^* = \frac{D}{\sqrt{2}}\). Therefore we can write
Now suppose that \(\Delta _{i-1} \le \frac{c}{2^{i-1}}\) for some \(i\ge 1\). Using the upper-bound analysis of AdaSVRG in Proposition 2 we get
Since \(K \ge 3\), one can check that \(\frac{8}{(1+ \sqrt{5}) K} \le 1\) and thus
which concludes the induction step.
At time step \(I = \log \frac{1}{\epsilon }\), we thus have \(\Delta _I \le \frac{c}{2^{I}} = c\epsilon\). All that is left is to compute the gradient complexity. If we assume that \(K = \gamma\) for some constant \(\gamma \ge 3\), the gradient complexity is given by
which concludes the proof for the scalar variant.
Now we look at the full matrix and diagonal variants. Let’s take \(c = (\rho ')^2(1 + \sqrt{5}) + 2\rho '\sqrt{ {d \delta }/{L_{{\max}}}}\). Suppose that \(K \ge 3\) and \(m_i = 2^{i+1}\), as in the theorem statement. Again, we claim that for all i, we have \(\Delta _i \le \frac{c}{2^i}\). We prove this by induction. For \(i=0\), we have
The quantity \(\frac{D^2}{\eta } + 2\eta\) reaches a minimum for \(\eta ^* = \frac{D}{\sqrt{2}}\). Therefore we can write
The induction step is exactly the same as in the scalar case. \(\square\)
Proof of Theorem 3
We restate and prove Theorem 3 for the three variants of AdaSVRG.
Theorem 6
(AdaSVRG with adaptive-sized inner-loops) Under the same assumptions as Lemma 1, AdaSVRG with (a) step-sizes \(\eta _{k}\in [\eta _{{\min}}, \eta _{{\max}}]\), (b1) inner-loop size \(m_k = \frac{\nu }{\epsilon }\) for all k or (b2) inner-loop size \(m_k = \frac{\nu }{f(w_k) - f^*}\) for outer-loop k, results in the following convergence rate,
where \(\nu = 4\rho ^2\) for the scalar variant and \(\nu = \left( \frac{2\rho ' + \sqrt{16 (\rho ')^2 + 12\rho ' \sqrt{ {d\delta }/{4L_{{\max}}}}}}{3}\right) ^2\) for the full matrix and diagonal variants.
Proof
Let us define
and
Similar to the proof of Lemma 1, for a inner-loop k with \(m_k\) iterations and \(\alpha := \frac{1}{2}\big ( \frac{D^2}{\eta _{{\min}}} + 2\eta _{{\max}}\big )\) we can show
Using Lemma 5 we get
If we set \(m_k = \frac{C}{\epsilon _k}\),
Define \(a :=\sqrt{4L_{{\max}}\alpha ^2}\) and \(\gamma :=\sqrt{b/4L_{{\max}}}\), by dividing both sides of (59) by \(m_k\) and using the definition of \(w_{k+1}\),
Setting \(C=\left( \frac{2a+\sqrt{4a^2+12(a^2+a\gamma )}}{3}\right) ^2\), we get \(\epsilon _{k+1} \le 3/4 \epsilon _k\). However, the above proof requires knowing \(\epsilon _{k}\). Instead, let us assume a target error of \(\epsilon\), implying that we want to have \(\mathbb E [f(w_{K}) - f^* ] \le \epsilon\). Going back to (58) and setting \(m_k = C/\epsilon\), we obtain,
With \(C=\left( \frac{2a+\sqrt{4a^2+12(a^2+a\gamma )}}{3}\right) ^2\), we get linear convergence to \(\epsilon\)-suboptimality. Based on the above we require \(K = \mathcal O(\log (1/\epsilon ))\) outer-loops. However in each outer-loop we need \(\mathcal O(n+\frac{1}{\epsilon })\) gradient evaluations. All in all, our total computation complexity is of \(\mathcal {O} ((n+\frac{1}{\epsilon }) \log (\frac{1}{\epsilon }))\).
The proof is done by noticing that in the scalar variant, \(\gamma = 0\) and \(a = \rho\) so that
and in the full matrix and diagonal variants, \(a = \rho '\) and \(\gamma = \sqrt{ {d\delta }/{4L_{{\max}}}}\) so that
\(\square\)
Proof of Theorem 4
We restate and prove Theorem 4 for the three variants of AdaGrad. To handle all three variants, we restate the theorem in terms of \(\Vert G_t\Vert _* = \sqrt{\textsf {Tr}\mathclose {\left(G_t\right)}}\). Note that this does not change the claim for the scalar variant, as in that case \(G_t = \textsf {Tr}\mathclose {\left(G_t\right)}\).
Theorem 7
(Phase Transition in AdaGrad Dynamics) Under the same assumptions as Lemma 1 and (iv) \(\sigma ^2\)-bounded stochastic gradient variance and defining \(T_0 = \frac{\rho ^2 L_{{\max}}}{\sigma ^2}\), for constant step-size AdaGrad we have \(\mathbb {E}\Vert G_t\Vert _{*} = O(1) \text { for } t \le T_0\), and \(\mathbb {E}\Vert G_t\Vert _{*} = O (\sqrt{t-T_0}) \text { for } t \ge T_0\).
The same result holds for the full matrix and diagonal variants of constant step-size AdaGrad for \(T_0 = \frac{\left( \rho ' \sqrt{L_{{\max}}} + \sqrt{2\rho ' \sqrt{d\delta L_{{\max}}} + d\delta }\right) ^2}{\sigma ^2}\)
Proof
We start with the scalar variant. Consider the general AdaGrad update
The same we did in the proof of Theorem 1, we can bound suboptimality as
By re-arranging, dividing by \(\eta\) and summing for T iteration we have
Define
and
We then have \(\textsf {Tr}\mathclose {\left(A_T\right)} \le \alpha \sqrt{\sum _{t=1}^T\Vert \nabla f_{i_t}(x_{t})\Vert ^2 + b }\) by Lemma 3 and Lemma 4. Going back to Eq. (66) and taking expectation and using the upper-bound we get
Using convexity of f and Jensen’s inequality on the (concave) square root function, we have
Now,
Taking expectations, and since \(\mathbb {E}[\nabla f_{i_t}(x_t) - \nabla f(x_t)] = 0\), we have
Going back to the previous inequality we get
where we used smoothness in the last inequality. Now, if \(\sigma = 0\), namely \(\nabla f(x_t) = \nabla f_{i_t}(x_t)\), we have
Using Lemma 5 we get
so that
Now,
Thus we have
where we used smoothness for the inequality. This shows that \(\Vert G_T\Vert _*\) is a bounded series in the deterministic case. Now, if \(\sigma \not = 0\), going back to Eq. (72) we have
Using Lemma 5 we get
We then have
from which we get
This implies that for \(T \le \frac{\left( 2 \alpha L_{{\max}} + \sqrt{2L_{{\max}} \alpha \sqrt{b} + b}\right) ^2}{\sigma ^2}\), we have
and for \(T\ge \frac{\left( 2 \alpha L_{{\max}} + \sqrt{2L_{{\max}} \alpha \sqrt{b} + b}\right) ^2}{\sigma ^2}\),
The proof is done by noticing that
\(\square\)
Corollary 2
Under the same assumptions as Lemma 1, for outer-loop k of AdaSVRG with constant step-size \(\eta _k\), there exists \(T_0 = \frac{C}{f(w_k) - f^*}\) such that,
Proof
Using Theorem 7 with \(\sigma ^2 = f(w_{k}) - f^*\) gives us the result. \(\square\)
Helper lemmas
We make use of the following helper lemmas from Vaswani et al. (2020), proved here for completeness.
Lemma 3
For any of the full matrix, diagonal and scalar versions, we have
Proof
For any of the three versions, we have by construction that \(A_t\) is non-decreasing, i.e. \(A_t - A_{t-1} \succeq 0\) (for the scalar version, we consider \(A_t\) as a matrix of dimension 1 for simplicity). We can then use the bounded feasible set assumption to get
We then upper-bound \(\lambda _{\max }\) by the trace and use the linearity of the trace to telescope the sum,
\(\square\)
Lemma 4
For any of the full matrix, diagonal and scalar versions, we have
Moreover, for the scalar version we have
and for the full matrix and diagonal version we have
Proof
We prove this by induction. Start with \(m=1\).
For the full matrix version, \(A_1=(\delta I+g_1g_1^\top )^{ {1}/{2}}\) and we have
For the diagonal version \(A_1=(\delta I+{{\,\mathrm{diag}\,}}(g_1g_1^\top ))^{ {1}/{2}}\) we have
Since \(A_1^{-1}\) is diagonal, the diagonal elements of \(A_1^{-1} g_1 g_1^\top\) are the same as the diagonal elements of \(A_1^{-1} {{\,\mathrm{diag}\,}}(g_1 g_1^\top )\). Thus we get
For the scalar version \(A_1 = \left( g_1^\top g_1\right) ^{ {1}/{2}}\) and we have
Induction step: Suppose now that it holds for \(m-1\), i.e. \(\sum _{t=1}^{m-1} \left\| g_t\right\| ^2_{A_t^{-1}}\le 2 \textsf {Tr}\mathclose {\left(A_{m-1}\right)}\). We will show that it also holds for m.
For the full matrix version we have
We then use the fact that for any \(X \succeq Y \succeq 0\), we have (Duchi et al., 2011, Lemma 8)
As \(X = A_m^2\succeq Y = g_m g_m^\top \succeq 0\), we can use the above inequality and the induction holds for m.
For the diagonal version we have
As before, since \(A_m^{-1}\) is diagonal, we have that the diagonal elements of \(A_m^{-1} g_m g_m^\top\) are the same as the diagonal elements \(A_m^{-1} {{\,\mathrm{diag}\,}}(g_m g_m^\top )\). Thus we get
We can then again apply the result from Duchi et al. (2011, Lemma 8) with \(X = A_m^2 \succeq Y = {{\,\mathrm{diag}\,}}(g_m g_m^\top ) \succeq 0\), and we obtain the desired result.
For the scalar version, since \(A_m^{-1}\) is a scalar we have by induction hypothesis,
where the equality follows from the AdaGrad update. We can then again apply the result from Duchi et al. (2011, Lemma 8) with \(X = A_m^2 \ge Y = g_m^\top g_m \ge 0\), and we obtain the desired result.
Bound on the trace: For the trace bound, recall that \(A_m= G_m^{1/2}\). For the scalar version we have
For the diagonal and full matrix variants, we use Jensen’s inequality to get
there \(\lambda _j(G_m)\) denotes the j-th eigenvalue of \(G_m\).
For the full matrix version, we have
For the diagonal version, we have
which concludes the proof. \(\square\)
Lemma 5
If \(x^2 \le a(x+b)\) for \(a\ge 0\) and \(b \ge 0\),
Proof
The starting point is the quadratic inequality \(x^2 - ax - ab \le 0\). Letting \(r_1 \le r_2\) be the roots of the quadratic, the inequality holds if \(x \in [r_1, r_2]\). The upper bound is then given by using \(\sqrt{a+b} \le \sqrt{a} +\sqrt{b}\)
\(\square\)
Counter-example for line-search for SVRG

Proposition 3
For any \(c> 0, \eta _{{\max}} > 0\), there exists a 1-dimensional function f whose minimizer is \(x^*=0\), and for which the following holds: If at any point of Algorithm 6, we have \(\mathclose {\left|x_t^k\right|} \in \big (0, \min \{ \frac{1}{c}, 1\}\big )\), then \(\mathclose {\left|x_{t+1}^k\right|} \ge \mathclose {\left|x_t^k\right|}\).
Proof
Define the following function
where \(a>0\) is a constant that will be determined later. We then have the following
The minimizer of f is 0, while the minimizers of \(f_1\) and \(f_2\) are 1 and -1, respectively. This symmetry will make the algorithm fail.
Now, as stated by the assumption, let \(\mathclose {\left|x_t^k\right|} \in \big (0, \min \{ \frac{1}{c}, 1\}\big )\). WLOG assume \(x_t^k > 0\), the other case is symmetric.
Case 1: \(i_t = 1\). Then we have
Observe that \(g_t > 0\). Since \(x_t^k < 1\) and the function \(f_1\) is strictly decreasing in the interval \((-\infty , 1]\), moving in the direction \(-g_t\) from \(x_t^k\) can only increase the function value. Thus the Armijo line search will fail and yield \(\eta _t = 0\). Thus in that case \(x_{t+1}^k = x_t^k\).
Case 2: \(i_t = 2\). Then we have
The Armijo line search then reads
which we can rewrite as
Simplifying this gives
which simplifies even further to
Therefore, the Armijo line-search will return a step-size such that
Now, recall that by assumption we have \(x_t^k < 1/c\). Then \(1/x_t^k - c > 0\), which implies that
Now is the time to choose a. Indeed, if a is such that \(1/a \le \eta _{{\max}}\), we then have by Eq. (89) that
We then have
where the inequality comes from \(\eta _t \ge 1/a\) and the fact that \(x_t^k \ge 0\). Thus we indeed have \(\mathclose {\left|x_{t+1}^k\right|} \ge \mathclose {\left|x_t^k\right|}\). \(\square\)
Additional experiments
1.1 Poor performance of AdaGrad compared to variance reduction methods
See Fig. 4.
1.2 Additional experiments with batch-size = 64
Comparison of AdaSVRG against SVRG variants, SVRG-BB and SARAH for squared loss and batch-size = 64. For the sensitivity to step-size plot to be readable, we limit the gradient norm to a maximum value of 10. In some cases, SVRG-BB diverged, and we remove the curves accordingly so as not to clutter the plots
1.3 Studying the effect of the batch-size on the performance of AdaSVRG
See Fig. 8, 9, 10, 11, 12 and 13.
1.4 Evaluating the diagonal variant of AdaSVRG
1.5 Additional experiments on adaptivity to over-parameterization
See Fig. 16.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Dubois-Taine, B., Vaswani, S., Babanezhad, R. et al. SVRG meets AdaGrad: painless variance reduction. Mach Learn 111, 4359–4409 (2022). https://doi.org/10.1007/s10994-022-06265-x
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10994-022-06265-x