SVRG meets AdaGrad: painless variance reduction

Dubois-Taine, Benjamin; Vaswani, Sharan; Babanezhad, Reza; Schmidt, Mark; Lacoste-Julien, Simon

doi:10.1007/s10994-022-06265-x

SVRG meets AdaGrad: painless variance reduction

Published: 10 November 2022

Volume 111, pages 4359–4409, (2022)
Cite this article

Download PDF

Machine Learning Aims and scope Submit manuscript

SVRG meets AdaGrad: painless variance reduction

Download PDF

1276 Accesses
2 Altmetric
Explore all metrics

Abstract

Variance reduction (VR) methods for finite-sum minimization typically require the knowledge of problem-dependent constants that are often unknown and difficult to estimate. To address this, we use ideas from adaptive gradient methods to propose AdaSVRG, which is a more-robust variant of SVRG, a common VR method. AdaSVRG uses AdaGrad, a common adaptive gradient method, in the inner loop of SVRG, making it robust to the choice of step-size. When minimizing a sum of n smooth convex functions, we prove that a variant of AdaSVRG requires $\tilde{O}(n + 1/\epsilon )$ gradient evaluations to achieve an $O(\epsilon )$-suboptimality, matching the typical rate, but without needing to know problem-dependent constants. Next, we show that the dynamics of AdaGrad exhibit a two-phase behavior – the step-size remains approximately constant in the first phase, and then decreases at a $O\left( {1}/{\sqrt{t}}\right)$ rate. This result maybe of independent interest, and allows us to propose a heuristic that adaptively determines the length of each inner-loop in AdaSVRG. Via experiments on synthetic and real-world datasets, we validate the robustness and effectiveness of AdaSVRG, demonstrating its superior performance over standard and other “tune-free” VR methods.

Cocoercivity, smoothness and bias in variance-reduced stochastic gradient methods

Article Open access 13 April 2022

Unified Analysis of Stochastic Gradient Methods for Composite Convex and Smooth Optimization

Article 27 September 2023

A Variance Controlled Stochastic Method with Biased Estimation for Faster Non-convex Optimization

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Variance reduction (VR) methods (Schmidt et al., 2017; Konečnỳ & Richtárik, 2013; Mairal, 2013; Shalev-Shwartz & Zhang, 2013; Johnson & Zhang, 2013; Mahdavi & Jin, 2013; Konečnỳ & Richtárik, 2013; Defazio et al., 2014; Nguyen et al., 2017) have proven to be an important class of algorithms for stochastic optimization. These methods take advantage of the finite-sum structure prevalent in machine learning problems, and have improved convergence over stochastic gradient descent (SGD) and its variants (see Gower et al., 2020 for a recent survey). For example, when minimizing a finite sum of n strongly-convex, smooth functions with condition number $\kappa$, these methods typically require $O\left( (\kappa + n) \, \log (1/\epsilon ) \right)$ gradient evaluations to obtain an $\epsilon$-error. This improves upon the complexity of full-batch gradient descent (GD) that requires $O\left( \kappa n \, \log (1/\epsilon ) \right)$ gradient evaluations, and SGD that has an $O(\kappa /\epsilon )$ complexity. Moreover, there have been numerous VR methods that employ Nesterov acceleration (Allen-Zhu, 2017; Lan et al., 2019; Song et al., 2020) and can achieve even faster rates.

In order to guarantee convergence, VR methods require an easier-to-tune constant step-size, whereas SGD needs a decreasing step-size schedule. Consequently, VR methods are commonly used in practice, especially when training convex models such as logistic regression or conditional Markov random fields (Schmidt et al., 2015). However, all the above-mentioned VR methods require knowledge of the smoothness of the underlying function in order to set the step-size. The smoothness constant is often unknown and difficult to estimate in practice. Although we can obtain global upper-bounds on it for simple problems such as least squares regression, these bounds are usually too loose to be practically useful and result in sub-optimal performance. Consequently, implementing VR methods requires a computationally expensive search over a range of step-sizes. Furthermore, a constant step-size does not adapt to the function’s local smoothness and may lead to poor empirical performance.

Consequently, there have been a number of works that try to adapt the step-size in VR methods. Schmidt et al. (2017) and Mairal (2013) employ stochastic line-search procedures to set the step-size in VR algorithms. While they show promising empirical results using line-searches, these procedures have no theoretical convergence guarantees. Recent works (Tan et al., 2016; Li et al., 2020) propose to use the Barzilai-Borwein (BB) step-size (Barzilai & Borwein, 1988) in conjunction with two common VR algorithms—stochastic variance reduced gradient (SVRG) (Johnson & Zhang, 2013) and the stochastic recursive gradient algorithm (SARAH) (Nguyen et al., 2017). Both Tan et al. (2016) and Li et al. (2020) can automatically set the step-size without requiring the knowledge of problem-dependent constants. However, in order to prove theoretical guarantees for strongly-convex functions, these techniques require the knowledge of both the smoothness and strong-convexity parameters. In fact, their guarantees require using a small $O(1/\kappa ^2)$ step-size, a highly-suboptimal choice in practice. Consequently, there is a gap in the theory and practice of adaptive VR methods. To address this, we make the following contributions.

1.1 Background and contributions

1.1.1 SVRG meets AdaGrad

In Sect. 3 we use AdaGrad (Duchi et al., 2011; Levy et al., 2018), an adaptive gradient method, with stochastic variance reduction techniques. We focus on SVRG (Johnson & Zhang, 2013) and propose to use AdaGrad within its inner-loop. We analyze the convergence of the resulting AdaSVRG algorithm for minimizing convex functions (without strong-convexity). Using O(n) inner-loops for every outer-loop (a typical setting used in practice Babanezhad Harikandeh et al., 2015; Sebbouh et al., 2019), and any bounded step-size, we prove that AdaSVRG achieves an $\epsilon$-error (for $\epsilon = O(1/n)$) with $O(n/\epsilon )$ gradient evaluations (Theorem 1). This rate matches that of SVRG with a constant step-size and O(n) inner-loops (Reddi et al., 2016, Corollary 10). However, unlike Reddi et al. (2016), our result does not require knowledge of the smoothness constant in order to set the step-size. We note that other previous work (Cutkosky & Orabona, 2019; Liu et al., 2020) consider adaptive methods with variance reduction for non-convex minimization; however their algorithms still require knowledge of problem-dependent parameters.

1.1.2 Multi-stage AdaSVRG

We propose a multi-stage variant of AdaSVRG where each stage involves running AdaSVRG for a fixed number of inner and outer-loops. In particular, multi-stage AdaSVRG maintains a fixed-size outer-loop and doubles the length of the inner-loop across stages. We prove that it requires $O((n + 1/\epsilon ) \log (1/\epsilon ))$ gradient evaluations to reach an $O(\epsilon )$ error (Theorem 2). This improves upon the complexity of decreasing step-size SVRG that requires $O(n + \sqrt{n}/\epsilon )$ gradient evaluations (Reddi et al., 2016, Corollary 9); and matches the rate of SARAH (Nguyen et al., 2017).

After our work was made publicly available (Dubois-Taine et al., 2021), recent work (Zhou et al., 2021) improved upon our result by applying a similar idea to an accelerated variant of SVRG (Allen-Zhu, 2017). Their algorithm requires $\tilde{O}(n + \sqrt{n/\epsilon })$ gradient evaluations to obtain an $\epsilon$-error without the knowledge of problem-dependent constants.

1.1.3 AdaSVRG with adaptive termination

Instead of using a complex multi-stage procedure, we prove that AdaSVRG can also achieve the improved $O((n + 1/\epsilon ) \log (1/\epsilon ))$ gradient evaluation complexity by adaptively terminating its inner-loop (Sect. 4). However, the adaptive termination requires the knowledge of problem-dependent constants, limiting its practical use.

To address this, we use the favourable properties of AdaGrad to design a practical heuristic for adaptively terminating the inner-loop. Our technique for adaptive termination is related to heuristics (Pflug, 1983; Yaida, 2018; Lang et al., 2019; Pesme et al., 2020) that detect stalling for constant step-size SGD, and may be of independent interest. First, we show that when minimizing smooth convex losses, AdaGrad has a two-phase behaviour—a first “deterministic phase” where the step-size remains approximately constant followed by a second “stochastic” phase where the step-size decreases at an $O(1/\sqrt{t})$ rate (Theorem 4). We show that it is empirically possible to efficiently detect this phase transition and aim to terminate the AdaSVRG inner-loop when AdaGrad enters the stochastic phase.

1.1.4 Practical considerations and experimental evaluation

In Sect. 5, we describe some of the practical considerations for implementing AdaSVRG and the adaptive termination heuristic. We use standard real-world datasets to empirically verify the robustness and effectiveness of AdaSVRG. Across datasets, we demonstrate that AdaSVRG consistently outperforms variants of SVRG, SARAH and methods based on the BB step-size (Tan et al., 2016; Li et al., 2020).

1.1.5 Adaptivity to over-parameterization

Defazio and Bottou (2019) demonstrated the ineffectiveness of SVRG when training large over-parameterized models such as deep neural networks. We argue that this ineffectiveness can be partially explained by the interpolation property satisfied by over-parameterized models (Schmidt & Le Roux, 2013; Ma et al., 2018; Vaswani et al., 2019a). In the interpolation setting, SGD obtains an $O(1/\epsilon )$ gradient complexity when minimizing smooth convex functions (Vaswani et al., 2019a), thus out-performing typical VR methods. However, interpolation is rarely exactly satisfied in practice, and using SGD can result in oscillations around the solution. On the other hand, although VR methods have a slower convergence, they do not oscillate, regardless of interpolation. In Sect. 6, we use AdaGrad to exploit the (approximate) interpolation property, and employ the above heuristic to adaptively switch to AdaSVRG, thus avoiding oscillatory behaviour. We design synthetic problems controlling the extent of interpolation and show that the hybrid AdaGrad-AdaSVRG algorithm can match or outperform both stochastic gradient and VR methods, thus achieving the best of both worlds.

2 Problem setup

We consider the minimization of an objective $f:\mathbb {R}^d\rightarrow \mathbb {R}$ with a finite-sum structure,

$$\min _{w \in X} f(w)=\frac{1}{n}\sum _{i=1}^n f_i(w),$$

where X is a convex compact set of diameter D, meaning $\sup _{x, y \in X}\left\| x - y\right\| \le D$. Problems with this structure are prevalent in machine learning. For example, in supervised learning, n represents the number of training examples, and $f_i$ is the loss function when classifying or regressing to training example i. Throughout this paper, we assume f and each $f_i$ are differentiable. We assume that f is convex, implying that there exists a solution $w^{*}\in X$ that minimizes it, and define $f^*:= f(w^{*})$. Interestingly we do not need each $f_i$ to be convex. We further assume that each function $f_{i}$ in the finite-sum is $L_i$-smooth, implying that f is $L_{\max }$-smooth, where $L_{\max } = \max _{i} L_i$. We include the formal definitions of these properties in Appendix “Definitions”.

The classical method for solving such a problem is stochastic gradient descent (SGD). Starting from the iterate $x_0$, at each iteration t SGD samples (typically uniformly at random) a loss function $f_{i_t}$ and takes a step in the negative direction of the stochastic gradient $\nabla f_{i_t}(x_t)$ using a step-size $\eta _t$. This update can be expressed as

$$\begin{aligned} x_{t+1}&= x_t - \eta _t \nabla f_{i_t}(x_t) \end{aligned}$$

(1)

In order to ensure convergence to the minimizer, the sequence of step-sizes in SGD needs to be decreasing, typically at an $O(1/\sqrt{t})$ rate (Moulines & Bach, 2011). This has the effect of slowing down the convergence and results in an $\Theta (1/\sqrt{t})$ convergence to the minimizer for convex functions (compared to the O(1/t) convergence for gradient descent). Variance reduction methods were developed to overcome this slower convergence by exploiting the finite-sum structure of the objective.

We focus on the SVRG algorithm (Johnson & Zhang, 2013) since it is more memory efficient than other variance reduction alternatives like SAG (Schmidt et al., 2017) or SAGA (Defazio et al., 2014). SVRG has a nested inner-outer loop structure. In every outer-loop k, it computes the full gradient $\nabla f(w_{k})$ at a snapshot point $w_{k}$. An outer-loop k consists of $m_k$ inner-loops indexed by $t = 1, 2, \ldots m_k$ and the inner-loop iterate $x_1$ is initialized to $w_{k}$. In outer-loop k and inner-loop t, SVRG samples an example $i_t$ (typically uniformly at random) and takes a step in the direction of the variance-reduced gradient $g_t$ using a constant step-size $\eta$. This update can be expressed as:

$$\begin{aligned} g_t&= \nabla f_{i_t}(x_t) - \nabla f_{i_t}(w_{k}) + \nabla f(w_{k}) \nonumber \\ x_{t+1}&= \Pi _{X} [x_{t} - \eta \, g_t], \end{aligned}$$

(2)

where $\Pi _{X}$ denotes the Euclidean projection onto the set X. The variance-reduced gradient is unbiased, meaning that $\mathbb {E}_{i_t}[g_t \vert x_t] = \nabla f(x_t)$. At the end of the inner-loop, the next snapshot point is typically set to either the last or averaged iterate in the inner-loop.

SVRG requires the knowledge of both the strong-convexity and smoothness constants in order to set the step-size and the number of inner-loops. These requirements were relaxed in Hofmann et al. (2015), Kovalev et al. (2020), Gower et al. (2020) that only require knowledge of the smoothness.

In order to set the step-size for SVRG without requiring knowledge of the smoothness, line-search techniques are an attractive option. Such techniques are a common approach to automatically set the step-size for (stochastic) gradient descent (Armijo, 1966; Vaswani et al., 2019b). However, we show that an intuitive Armijo-like line-search to set the SVRG step-size is not guaranteed to converge to the solution. Specifically, we prove the following proposition in Appendix “Counter-example for line-search for SVRG”.

Proposition 1

If in each inner-loop t of SVRG, $\eta _t$ is set as the largest step-size satisfying the condition: $\eta _t \le \eta _{{\max}}$ and

$$\begin{aligned} f_{{i_t}}(x_{t} - \eta _t g_t) \le f_{{i_t}}(x_t) - c \eta _t \left\| g_t\right\| ^2 \quad {\text {where}} \quad (c >0), \end{aligned}$$

then for any $c > 0$, $\eta _{{\max}} > 0$, there exists a 1-dimensional convex smooth function such that if $\vert x_t - w^{*}\vert \le \min \{ \frac{1}{c}, 1\}$, then $\vert x_{t+1} - w^{*}\vert \ge \vert x_t - w^{*}\vert$, implying that the update moves the iterate away from the solution when it is close to it, preventing convergence.

In the next section, we suggest a novel approach using AdaGrad (Duchi et al., 2011) to propose AdaSVRG, a provably-convergent VR method that is more robust to the choice of step-size. To justify our decision to use AdaGrad, we note that in general, there are (roughly) three common ways of designing methods that do not require knowledge of problem-dependent constants: (i) BB step-size, but it still requires knowledge of $L_{{\max}}$ to guarantee convergence in the VR setting (Tan et al., 2016; Li et al., 2020), (ii) Line-search methods that can fail to converge in the VR setting (Proposition 1), (iii) Adaptive gradient methods such as AdaGrad.

3 Adaptive SVRG

Like SVRG, AdaSVRG has a nested inner-outer loop structure and relies on computing the full gradient in every outer-loop. However, it uses AdaGrad in the inner-loop, using the variance reduced gradient $g_t$ to update the preconditioner $A_t$ in the inner-loop t. AdaSVRG computes the step-size $\eta _k$ in every outer-loop (see Sect. 5 for details) and uses a preconditioned variance-reduced gradient step to update the inner-loop iterates: $x_{t+1} = \Pi _{X}\left( x_t - \eta _{k} A_t^{-1} g_t\right)$. AdaSVRG then sets the next snapshot $w_{k+1}$ to be the average of the inner-loop iterates.

We now analyze the convergence of AdaSVRG (see Algorithm 1 for the pseudo-code). Throughout the main paper, we will only focus on the scalar variant (Ward et al., 2019) of AdaGrad. We defer the general diagonal and matrix variants (see Appendix “Algorithm in general case” for the pseudo-code) and their corresponding theory to the Appendix. We start with the analysis of a single outer-loop, and prove the following lemma in Appendix “Proof of Lemma 1”.

Lemma 1

(AdaSVRG with single outer-loop) Assume (i) convexity of f, (ii) $L_{\max }$-smoothness of $f_i$ and (iii) bounded feasible set with diameter D. Defining $\rho := \big ( \frac{D^2}{\eta _{k}} + 2\eta _{k}\big ) \sqrt{L_{{\max}}}$, for any outer loop k of AdaSVRG, with (a) inner-loop length $m_k$ and (b) step-size $\eta _{k}$,

$$\begin{aligned} \mathbb {E}[f(w_{k+1}) - f^*] \le \frac{\rho ^2}{m_k} + \frac{\rho \sqrt{\mathbb {E}[f(w_k) - f^*]}}{\sqrt{m_k}} \end{aligned}$$

The proof of the above lemma leverages the theoretical results of AdaGrad (Duchi et al., 2011; Levy et al., 2018). Specifically, the standard AdaGrad analysis bounds the “noise” term by the variance in the stochastic gradients. On the other hand, we use the properties of the variance reduced gradient in order to upper-bound the noise in terms of the function suboptimality.

Lemma 1 shows that a single outer-loop of AdaSVRG converges to the minimizer as $O(1/\sqrt{m})$, where m is the number of inner-loops. This implies that in order to obtain an $\epsilon$-error, a single outer-loop of AdaSVRG requires $O(n + 1/\epsilon ^2)$ gradient evaluations. This result holds for any bounded step-size and requires setting $m = O(1/\epsilon ^2)$. This “single outer-loop convergence” property of AdaSVRG is unlike SVRG or any of its variants; running only a single-loop of SVRG is ineffective, as it stops making progress at some point, resulting in the iterates oscillating in a neighbourhood of the solution. The favourable behaviour of AdaSVRG is similar to SARAH, but unlike SARAH, the above result does not require computing a recursive gradient or knowing the smoothness constant.

Next, we consider the convergence of AdaSVRG with a fixed-size inner-loop and multiple outer-loops. In the following theorems, we assume that we have a bounded range of step-sizes implying that for all k, $\eta _{k}\in [\eta _{{\min}}, \eta _{{\max}}]$. For brevity, similar to Lemma 1, we define $\rho :=\big ( \frac{D^2}{\eta _{{\min}}} + 2\eta _{{\max}}\big ) \sqrt{L_{{\max}}}$.

Theorem 1

(AdaSVRG with fixed-size inner-loop) Under the same assumptions as Lemma 1, AdaSVRG with (a) step-sizes $\eta _{k}\in [\eta _{{\min}}, \eta _{{\max}}]$, (b) inner-loop size $m_k = n$ for all k, results in the following convergence rate after $K \le n$ iterations.

$$\begin{aligned} \mathbb {E}[f(\bar{w}_K) - f^*] \le \frac{\rho ^2 ( 1+ \sqrt{5}) + \rho \sqrt{2\left( f(w_0) - f^*\right) }}{K} \end{aligned}$$

where $\bar{w}_K = \frac{1}{K} \sum _{k = 1}^{K} w_{k}$.

The proof (refer to Appendix “Proof of Theorem 1”) recursively uses the result of Lemma 1 for K outer-loops.

The above result requires a fixed inner-loop size $m_k = n$, a setting typically used in practice (Babanezhad Harikandeh et al., 2015; Gower et al., 2020). Notice that the above result holds only when $K \le n$. Since K is the number of outer-loops, it is typically much smaller than n, the number of functions in the finite sum, justifying the theorem’s $K \le n$ requirement. Moreover, in the sense of generalization error, it is not necessary to optimize below an O(1/n) accuracy (Boucheron et al., 2005; Sridharan et al., 2008).

Theorem 1 implies that AdaSVRG can reach an $\epsilon$-error (for $\epsilon = \Omega (1/n)$) using $\mathcal O(n/\epsilon )$ gradient evaluations. This result matches the complexity of constant step-size SVRG (with $m_k = n$) of Reddi et al. (2016, Corollary 10) but without requiring the knowledge of the smoothness constant. However, unlike SVRG and SARAH, the convergence rate depends on the diameter D rather than $\left\| w_0 - w^{*}\right\|$, the initial distance to the solution. This dependence arises due to the use of AdaGrad in the inner-loop, and is necessary for adaptive gradient methods. Specifically, Cutkosky and Boahen (2017) prove that any adaptive (to problem-dependent constants) method will necessarily incur such a dependence on the diameter. Hence, such a diameter dependence can be considered to be the “cost” of the lack of knowledge of problem-dependent constants.

Since the above result only holds for $\epsilon = \Omega (1/n)$, we propose a multi-stage variant (Algorithm 2) of AdaSVRG that requires $O((n+1/\epsilon )\log (1/\epsilon ))$ gradient evaluations to attain an $O(\epsilon )$-error for any $\epsilon$. To reach a target suboptimality of $\epsilon$, we consider $I = \log (1/\epsilon )$ stages. For each stage i, Algorithm 2 uses a fixed number of outer-loops K and inner-loops $m^i$ with stage i is initialized to the output of the $(i-1)$-th stage. In Appendix “Proof of Theorem 2”, we prove the following rate for multi-stage AdaSVRG.

Theorem 2

(Multi-stage AdaSVRG) Under the same assumptions as Theorem 1, multi-stage AdaSVRG with $I = \log (1/\epsilon )$ stages, $K \ge 3$ outer-loops and $m^i = 2^{i+1}$ inner-loops at stage i, requires $O(n\log {1}/{\epsilon } + {1}/{\epsilon })$ gradient evaluations to reach a $\left( \rho ^2 (1+\sqrt{5}) \right) \epsilon$-sub-optimality.

We see that multi-stage AdaSVRG matches the convergence rate of SARAH (upto constants), but does so without requiring the knowledge of the smoothness constant to set the step-size. Observe that the number of inner-loops increases with the stage i.e. $m^i = 2^{i+1}$. The intuition behind this is that the convergence of AdaGrad (used in the k-th inner-loop of AdaSVRG) is slowed down by a “noise” term proportional to $f(w_k) - f^*$ (see Lemma 1). When this “noise” term is large in the earlier stages of multi-stage AdaSVRG, the inner-loops have to be short in order to maintain the overall $O(1/{\epsilon })$ convergence. However, as the stages progress and the suboptimality decreases, the “noise” term becomes smaller, and the algorithm can use longer inner-loops, which reduces the number of full gradient computations, resulting in the desired convergence rate.

Thus far, we have focused on using AdaSVRG with fixed-size inner-loops. Next, we consider variants that can adaptively determine the inner-loop size.

4 Adaptive termination of inner-loop

Recall that the convergence of a single outer-loop k of AdaSVRG (Lemma 1) is slowed down by the $\sqrt{ {\left( f(w_k) - f^*\right)}/{m_k}}$ term. Similar to the multi-stage variant, the suboptimality $f(w_k) - f^*$ decreases as AdaSVRG progresses. This allows the use of longer inner-loops as k increases, resulting in fewer full-gradient evaluations. We instantiate this idea by setting $m_k = O\left( {1}/{\left( f(w_k) - f^*\right) }\right)$. Since this choice requires the knowledge of $f(w_k) - f^*$, we alternatively consider using $m_k = O(1/\epsilon )$, where $\epsilon$ is the desired sub-optimality. We prove the following theorem in Appendix “Proof of Theorem 3”.

Theorem 3

(AdaSVRG with adaptive-sized inner-loops) Under the same assumptions as Lemma 1, AdaSVRG with (a) step-sizes $\eta _{k}\in [\eta _{{\min}}, \eta _{{\max}}]$, (b1) inner-loop size $m_k = \frac{4\rho ^2}{\epsilon }$ for all k or (b2) inner-loop size $m_k = \frac{4\rho ^2}{f(w_k) - f^*}$ for outer-loop k, results in the following convergence rate,

$$\begin{aligned} \mathbb {E}[f(w_K) - f^*] \le (3/4)^K [f(w_0) - f^*]. \end{aligned}$$

The above result implies a linear convergence in the number of outer-loops, but each outer-loop requires $O(1/\epsilon )$ inner-loops. Hence, Theorem 3 implies that AdaSVRG with adaptive-sized inner-loops requires $O\left( (n + 1/\epsilon ) \log (1/\epsilon ) \right)$ gradient evaluations to reach an $\epsilon$-error. This improves upon the rate of SVRG and matches the convergence rate of SARAH that also requires inner-loops of length $O(1/\epsilon )$. Compared to Theorem 1 that has an average iterate convergence (for $\bar{w}_K$), Theorem 3 has the desired convergence for the last outer-loop iterate $w_{K}$ and also holds for any bounded sequence of step-sizes. However, unlike Theorem 1, this result (with either setting of $m_k$) requires the knowledge of problem-dependent constants in $\rho$.

To address this issue, we design a heuristic for adaptive termination in the next sections. We start by describing the two phase behaviour of AdaGrad and subsequently utilize it for adaptive termination in AdaSVRG.

4.1 Two phase behaviour of AdaGrad

Diagnostic tests (Pflug, 1983; Yaida, 2018; Lang et al., 2019; Pesme et al., 2020) study the behaviour of the SGD dynamics to automatically control its step-size. Similarly, designing the adaptive termination test requires characterizing the behaviour of AdaGrad used in the inner loop of AdaSVRG.

We first investigate the dynamics of constant step-size AdaGrad in the stochastic setting. Specifically, we monitor the evolution of $\sqrt{G_t}$ across iterations. We define $\sigma ^2 :=\sup _{x \in X} \mathbb E_{i} \Vert \nabla f_i(x) - \nabla f(x) \Vert ^2$ as a uniform upper-bound on the variance in the stochastic gradients for all iterates. We prove the following theorem showing that there exists an iteration $T_0$ when the evolution of $\sqrt{G_t}$ undergoes a phase transition. Note again that such a phase transition happens for the scalar, diagonal and matrix variants of AdaGrad. We only present the scalar result in the main paper and defer the other variants to the Appendix.

Theorem 4

(Phase Transition in AdaGrad Dynamics) Under the same assumptions as Lemma 1 and (iv) $\sigma ^2$-bounded stochastic gradient variance and defining $T_0 = \frac{\rho ^2 L_{{\max}}}{\sigma ^2}$, for constant step-size AdaGrad we have $\mathbb {E}[\sqrt{G_t}] = O(1) \text { for } t \le T_0$, and $\mathbb {E}[\sqrt{G_t}] = O (\sqrt{t-T_0}) \text { for } t \ge T_0$.

Theorem 4 (proved in Appendix “Proof of Theorem 4”) indicates that $G_t$ is bounded by a constant for all $t \le T_0$, implying that its rate of growth is slower than $\log (t)$. This implies that the step-size of AdaGrad is approximately constant (similar to gradient descent in the full-batch setting) in this first phase until iteration $T_0$. Indeed, if $\sigma = 0$, $T_0 = \infty$ and AdaGrad is always in this deterministic phase. This result generalizes (Qian & Qian, 2019, Theorem 3.1) that analyzes the diagonal variant of AdaGrad in the deterministic setting. After iteration $T_0$, the noise $\sigma ^2$ starts to dominate, and AdaGrad transitions into the stochastic phase where $G_t$ grows as O(t). In this phase, the step-size decreases as $O(1/\sqrt{t})$, resulting in slower convergence to the minimizer. AdaGrad thus results in an overall $O \left( {1}/{T} + {\sigma ^2}/{\sqrt{T}} \right)$ rate (Levy et al., 2018), where the first term corresponds to the deterministic phase and the second to the stochastic phase.

Since the exact detection of this phase transition is not possible, we design a heuristic to detect it without requiring the knowledge of problem-dependent constants.

4.2 Heuristic for adaptive termination

Similar to tests used to detect stalling for SGD (Pflug, 1983; Pesme et al., 2020), the proposed diagnostic test has a burn-in phase of ${n}/{2}$ inner-loop iterations that allows the initial AdaGrad dynamics to stabilize. After this burn-in phase, for every even iteration, we compute the ratio $R = \frac{G_{t} - G_{t/2}}{G_{t/2}}$. Given a threshold hyper-parameter $\theta$, the test terminates the inner-loop when $R \ge \theta$. In the first deterministic phase, since the growth of $G_{t}$ is slow, $G_{2t} \approx G_{t}$ and $R \approx 0$. In the stochastic phase, $G_{t} = O(t)$, and $R \approx 1$, justifying that the test can distinguish between the two phases. AdaSVRG with this test is fully specified in Algorithm 3. Experimentally, we use $\theta = 0.5$ to give an early indication of the phase transition.

5 Experimental evaluation of AdaSVRG

We first describe the practical considerations for implementing AdaSVRG and then evaluate its performance on real and synthetic datasets. The code to reproduce the experiments can be found at the following link: https://github.com/bpauld/AdaSVRG. We do not use projections in our experiments as these problems have an unconstrained $w^*$ with finite norm (we thus assume D is big enough to include it), and that we empirically observed that our iterates always stayed bounded, thus not requiring any projection.^{Footnote 1}

5.1 Implementing AdaSVRG

Though our theoretical results hold for any bounded sequence of step-sizes, its choice affects the practical performance of AdaGrad (Vaswani et al., 2020) (and hence AdaSVRG). Theoretically, the optimal step-size minimizing the bound in Lemma 1 is given by $\eta ^* = \frac{D}{\sqrt{2}}$. Since we do not have access to D, we use the following heuristic to set the step-size for each outer-loop of AdaSVRG. In outer-loop k, we approximate D by $\Vert w_{k}- w^{*}\Vert$, that can be bounded using the co-coercivity of smooth convex functions as $\Vert w_{k}- w^{*}\Vert \ge {1}/{L_{{\max}}} \Vert \nabla f(w_k) \Vert$ (Nesterov, 2004, Thm. 2.1.5 (2.1.8)). We have access to $\nabla f(w_k)$ for the current outer-loop, and store the value of $\nabla f(w_{k-1})$ in order to approximate the smoothness constant. Specifically, by co-coercivity, $L_{{\max}} \ge L_k :=\frac{\Vert \nabla f(w_k) - \nabla f(w_{k-1})\Vert }{\Vert w_k - w_{k-1}\Vert }$. Putting these together, $\eta _{k}= \frac{\Vert \nabla f(w_k)\Vert }{\sqrt{2} \, \max _{i=0, \dots , k} L_i}$.^{Footnote 2} Although a similar heuristic could be used to estimate $L_{{\max}}$ for SVRG or SARAH, the resulting step-size is larger than $1/L_{{\max}}$ implying that it would not have any theoretical guarantee, while our results hold for any bounded sequence of step-sizes. Although Algorithm 1 requires setting $w_{k+1}$ to be the average of the inner-loop iterates, we use the last-iterate and set $w_{k}= x_{m_k}$, as this is a more common choice (Johnson & Zhang, 2013; Tan et al., 2016) and results in better empirical performance. We compare two variants of AdaSVRG, with (i) fixed-size inner-loop Algorithm 1 and (ii) adaptive termination Algorithm 3. We handle a general batch-size b, and set $m = {n}/{b}$ for Algorithm 1. This is a common practical choice (Babanezhad Harikandeh et al., 2015; Gower et al., 2020; Kovalev et al., 2020). For Algorithm 3, the burn-in phase consists of ${n}/{2b}$ iterations and $M = {10n}/{b}$.

5.2 Evaluating AdaSVRG

In order to assess the effectiveness of AdaSVRG, we experiment with binary classification on standard LIBSVM datasets (Chang & Lin, 2011). In particular, we consider $\ell _2$-regularized problems (with regularization set to ${1}/{n}$) with three losses—logistic loss, the squared loss or the Huber loss. For each experiment we plot the median and standard deviation across 5 independent runs. In the main paper, we show the results for four of the datasets and relegate the results for the three others to Appendix “Additional experiments”. Similarly, we consider batch-sizes in the range [1, 8, 64, 128], but only show the results for $b = 64$ in the main paper.

We compare the AdaSVRG variants against SVRG (Johnson & Zhang, 2013), loopless-SVRG (Kovalev et al., 2020), SARAH (Nguyen et al., 2017), and SVRG-BB (Tan et al., 2016), the only other tune-free VR method.^{Footnote 3} Since each of these methods requires a step-size, we search over the grid $[10^{-3}, 10^{-2}, 10^{-1}, 1, 10, 100]$, and select the best step-size for each algorithm and each experiment. As is common, we set $m = {n}/{b}$ for each of these methods. We note that though the theoretical results of SVRG-BB require a small $O(1/\kappa ^2)$ step-size and $O(\kappa ^2)$ inner-loops, Tan et al. (2016) recommends setting $m = O(n)$ in practice. Since AdaGrad results in the slower $O(1/\epsilon ^2)$ rate (Levy et al., 2018; Vaswani et al., 2020) compared to the $O(n + \frac{1}{\epsilon })$ rate of VR methods, we do not include it in the main paper. We demonstrate the poor performance of AdaGrad on two example datasets in Fig. 4 in Appendix “Additional experiments”.

We plot the gradient norm of the training objective (for the best step-size) against the number of gradient evaluations normalized by the number of examples. We show the results for the logistic loss (Fig. 1a), Huber loss (Fig. 1b), and squared loss (Fig. 2). Our results show that (i) both variants of AdaSVRG (without any step-size tuning) are competitive with the other best-tuned VR methods, often out-performing them or matching their performance; (ii) SVRG-BB often has an oscillatory behavior, even for the best step-size; and (iii) the performance of AdaSVRG with adaptive termination (that has superior theoretical complexity) is competitive with that of the practically useful O(n) fixed inner-loop setting.

In order to evaluate the effect of the step-size on a method’s performance, we plot the gradient norm after 50 outer-loops vs step-size for each of the competing methods. For the AdaSVRG variants, we set the step-size according to the heuristic described earlier. For the logistic loss (Fig. 1a), Huber loss (Fig. 1b) and squared loss (Fig. 2), we observe that (i) the performance of typical VR methods heavily depends on the choice of the step-size; (ii) the step-size corresponding to the minimum loss is different for each method, loss and dataset; and (iii) AdaSVRG with the step-size heuristic results in competitive performance. Additional results plotted in Figs. 5a, b, 6a, b, 7a, b, 8a–c, 9a–c, 10a–c, 11a–c, 12a–c, 13a–c in Appendix “Additional experiments” confirm that the good performance of AdaSVRG is consistent across losses, batch-sizes and datasets.

Finally, in Fig. 14a in Appendix “Evaluating the diagonal variant of AdaSVRG”, we give preliminary results benchmarking the performance of the diagonal variant of AdaSVRG and comparing it to the scalar variant. We do not compare to the full matrix variant since inverting a $d \times d$ matrix in each iteration makes it impractical for most machine learning tasks. Our results demonstrate that with the current heuristic for setting $\eta$, the performance of the diagonal variant does not significantly improve over the scalar variant. Moreover, since each iteration of the diagonal variant incurs an additional O(d) cost compared to the scalar variant, we did not conduct further experiments with the diagonal variant. In the future, we aim to develop robust heuristics for setting the step-size for this variant of AdaSVRG.

6 Heuristic for adaptivity to over-parameterization

In this section, we reason that the poor empirical performance of SVRG when training over-parameterized models (Defazio & Bottou, 2019) can be partially explained by the interpolation property (Schmidt & Le Roux, 2013; Ma et al., 2018; Vaswani et al., 2019a) satisfied by these models (Zhang et al., 2017). In particular, we focus on smooth convex losses, but assume that the model is capable of completely fitting the training data, and that $w^*$ lies in the interior of X. For example, these properties are simultaneously satisfied when minimizing the squared hinge-loss for linear classification on separable data or unregularized kernel regression (Belkin et al., 2019; Liang et al., 2020) with $\Vert w^*\Vert \le 1$.

Formally, the interpolation condition means that the gradient of each $f_i$ in the finite-sum converges to zero at an optimum. Additionally, we assume that each function $f_i$ has finite minimum $f_i^*$. If the overall objective f is minimized at $w^{*}$, $\nabla f(w^{*}) = 0$, then for all $f_{i}$ we have $\nabla f_{i}(w^{*}) = 0$. Since the interpolation property is rarely exactly satisfied in practice, we allow for a weaker version that uses $\zeta ^2 :=\mathbb {E}_i [f^* - f_i^*] \in [0,\infty )$ (Loizou et al., 2020; Vaswani et al., 2020) to measure the extent of the violation of interpolation. If $\zeta ^2 = 0$, interpolation is exactly satisfied.

When $\zeta ^2 = 0$, both constant step-size SGD and AdaGrad have a gradient complexity of $O(1/\epsilon )$ in the smooth convex setting (Schmidt & Le Roux, 2013; Vaswani et al., 2019a, 2020). In contrast, typical VR methods have an $\tilde{O}(n + \frac{1}{\epsilon })$ complexity. For example, both SVRG and AdaSVRG require computing the full gradient in every outer-loop, and will thus unavoidably suffer an $\Omega (n)$ cost. For large n, typical VR methods will thus be necessarily slower than SGD when training models that can exactly interpolate the data. This provides a partial explanation for the ineffectiveness of VR methods when training over-parameterized models. When $\zeta ^2 > 0$, AdaGrad has an $O(1/\epsilon + \zeta /\epsilon ^2)$ rate (Vaswani et al., 2020). Here $\zeta$, the violation of interpolation plays the role of noise and slows down the convergence to an $O(1/\epsilon ^2)$ rate. On the other hand, AdaSVRG results in an $\tilde{O}(n + 1/\epsilon )$ rate, regardless of $\zeta$.

Following the reasoning in Sect. 4, if an algorithm can detect the slower convergence of AdaGrad and switch from AdaGrad to AdaSVRG, it can attain a faster convergence rate. It is straightforward to show that AdaGrad has a a similar phase transition as Theorem 4 when interpolation is only approximately satisfied. This enables the use of the test in Sect. 4 to terminate AdaGrad and switch to AdaSVRG, resulting in the hybrid algorithm described in Algorithm 4. If the diagnostic test can detect the phase transition accurately, Algorithm 4 will attain an $O(1/\epsilon )$ convergence when interpolation is exactly satisfied (no switching in this case). When interpolation is only approximately satisfied, it will result in an $O(1/\epsilon )$ convergence for $\epsilon \ge \zeta$ (corresponding to the AdaGrad rate in the deterministic phase) and will attain an $O(1/\zeta ^2 + ((n + 1/\epsilon ) \log (\zeta /\epsilon ))$ convergence thereafter (corresponding to the AdaSVRG rate). This implies that Algorithm 4 can indeed obtain the best of both worlds between AdaGrad and AdaSVRG.

6.1 Evaluating Algorithm 4

We use synthetic experiments to demonstrate the effect of interpolation on the convergence of stochastic and VR methods. Following the protocol in Meng et al. (2020), we generate a linearly separable dataset with $n=10^4$ data points of dimension $d=200$ and train a linear model with a convex loss. This setup ensures that interpolation is satisfied, but allows to eliminate other confounding factors such as non-convexity and other implementation details. In order to smoothly violate interpolation, we show results with a mislabel fraction of points in the grid [0, 0.1, 0.2].

We use AdaGrad as a representative (fully) stochastic method, and to eliminate possible confounding because of its step-size, we set it using the stochastic line-search procedure (Vaswani et al., 2020). We compare the performance of AdaGrad, SVRG, AdaSVRG and the hybrid AdaGrad-AdaSVRG (Algorithm 4) each with a budget of 50 epochs (passes over the data). For SVRG, as before, we choose the best step-size via a grid-search. For AdaSVRG, we use the fixed-size inner-loop variant and the step-size heuristic described earlier. In order to evaluate the quality of the “switching” metric in Algorithm 4, we compare against a hybrid method referred to as “Optimal Manual Switching” in the plots. This method runs a grid-search over switching points—after epoch $\{1, 2, \ldots , 50\}$ and chooses the point that results in the minimum loss after 50 epochs.

In Fig. 3, we plot the results for the logistic loss using a batch-size of 64 (refer to Figs. 15a–c, 16a–d in Appendix “Additional experiments on adaptivity to over-parameterization” for other losses and batch-sizes). We observe that (i) when interpolation is exactly satisfied (no mislabeling), AdaGrad results in superior performance over SVRG and AdaSVRG, confirming the theory in Sect. 6. In this case, both the optimal manual switching and Algorithm 4 do not switch; (ii) when interpolation is not exactly satisfied (with $10\%, 20\%$ mislabeling), the AdaGrad progress slows down to a stall in a neighbourhood of the solution, whereas both SVRG and AdaSVRG converge to the solution; (iii) in both cases, Algorithm 4 detects the slowdown in AdaGrad and switches to AdaSVRG, resulting in competitive performance with the optimal manual switching. For all three datasets, Algorithm 4 matches or out-performs the better of AdaGrad and AdaSVRG, showing that it can achieve the best-of-both-worlds.

7 Discussion

Although there have been numerous papers on VR methods in the past ten years, all of the provably convergent methods require knowledge of problem-dependent constants such as L. On the other hand, there has been substantial progress in designing adaptive gradient methods that have effectively replaced SGD for training ML models. Unfortunately, this progress has not been leveraged for developing better VR methods. Our work is the first to marry these lines of literature by designing AdaSVRG, that achieves a gradient complexity comparable to typical VR methods, but without needing to know the objective’s smoothness constant. Our results illustrate that it is possible to design principled techniques that can “painlessly” reduce the variance, achieving good theoretical and practical performance. We believe that our paper will help open up an exciting research direction. In the future, we aim to extend our theory to the strongly-convex setting.

Availability of data and materials

Experiments were done on the publicly available LIBSVM datasets (Chang & Lin, 2011).

Code availability

Full code to replicate the experiments can be found at https://github.com/bpauld/AdaSVRG.

Notes

We note in passing that the literature for unconstrained stochastic optimization often explicitly assumes that the iterates stay bounded (Ahn et al., 2020; Bollapragada et al., 2019; Babanezhad Harikandeh et al., 2015).
For $k = 0$, we compute the full gradient at a random point $w_{-1}$ and approximate $L_0$ in the same way.
We do not compare against SAG (Schmidt et al., 2017) because of its large memory footprint.

References

Ahn, K., Yun, C., & Sra, S. (2020). SGD with shuffling: Optimal rates without component convexity and large epoch requirements. In Neural information processing systems 2020, NeurIPS 2020.
Allen-Zhu, Z. (2017). Katyusha: The first direct acceleration of stochastic gradient methods. In Proceedings of the 49th annual ACM SIGACT symposium on theory of computing, STOC.
Armijo, L. (1966). Minimization of functions having lipschitz continuous first partial derivatives. Pacific Journal of mathematics, 16(1), 1–3.
Article MathSciNet MATH Google Scholar
Babanezhad Harikandeh, R., Ahmed, M. O., Virani, A., Schmidt, M., Konečnỳ, J., & Sallinen, S. (2015). Stop wasting my gradients: Practical SVRG. Advances in Neural Information Processing Systems, 28, 2251–2259.
Google Scholar
Barzilai, J., & Borwein, J. M. (1988). Two-point step size gradient methods. IMA Journal of Numerical Analysis, 8(1), 141–148.
Article MathSciNet MATH Google Scholar
Belkin, M., Rakhlin, A., & Tsybakov, A. B. (2019). Does data interpolation contradict statistical optimality? In The 22nd international conference on artificial intelligence and statistics (pp. 1611–1619). PMLR.
Bollapragada, R., Byrd, R. H., & Nocedal, J. (2019). Exact and inexact subsampled newton methods for optimization. IMA Journal of Numerical Analysis, 39(2), 545–578.
Article MathSciNet MATH Google Scholar
Boucheron, S., Bousquet, O., & Lugosi, G. (2005). Theory of classification: A survey of some recent advances. ESAIM: Probability and Statistics, 9, 323–375.
Article MathSciNet MATH Google Scholar
Chang, C.-C., & Lin, C.-J. (2011). LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2(3), 1–27. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm.
Cutkosky, A., & Boahen, K. (2017). Online convex optimization with unconstrained domains and losses. arXiv preprint arXiv:1703.02622.
Cutkosky, A., & Orabona, F. (2019). Momentum-based variance reduction in non-convex SGD. arXiv preprint arXiv:1905.10018.
Defazio, A., & Bottou, L. (2019). On the ineffectiveness of variance reduced optimization for deep learning. In NeurIPS: In advances in neural information processing systems.
Defazio, A., Bach, F., & Lacoste-Julien, S. (2014). SAGA: A fast incremental gradient method with support for non-strongly convex composite objectives. In NeurIPS: In Advances in neural information processing systems.
Dubois-Taine, B., Vaswani, S., Babanezhad, R., Schmidt, M., & Lacoste-Julien, S. (2021). Svrg meets adagrad: Painless variance reduction. arXiv preprint arXiv:2102.09645.
Duchi, J. C., Hazan, E., & Singer, Y. (2011). Adaptive subgradient methods for online learning and stochastic optimization. The Journal of Machine Learning Research, 12, 2121–2159.
MathSciNet MATH Google Scholar
Gower, R. M., Schmidt, M., Bach, F., & Richtarik, P. (2020). Variance-reduced methods for machine learning. Proceedings of the IEEE, 108(11), 1968–1983.
Article Google Scholar
Hofmann, T., Lucchi, A., Lacoste-Julien, S., & McWilliams, B. (2015). Variance reduced stochastic gradient descent with neighbors. Advances in Neural Information Processing Systems, 28, 2305–2313.
Google Scholar
Johnson, R., & Zhang, T. (2013). Accelerating stochastic gradient descent using predictive variance reduction. In NeurIPS: In advances in neural information processing systems.
Konečnỳ, J., & Richtárik, P. (2013). Semi-stochastic gradient descent methods. arXiv preprint arXiv:1312.1666.
Kovalev, D., Horváth, S., & Richtárik, P. (2020). Don’t jump through hoops and remove those loops: SVRG and Katyusha are better without the outer loop. In Algorithmic learning theory (pp. 451–467). PMLR.
Lan, G., Li, Z., & Zhou, Y. (2019). A unified variance-reduced accelerated gradient method for convex optimization. In Advances in neural information processing systems (pp. 10462–10472).
Lang, H., Xiao, L., & Zhang, P. (2019). Using statistics to automate stochastic optimization. In Advances in neural information processing systems (pp. 9540–9550).
Levy, K. Y., Yurtsever, A., & Cevher, V. (2018). Online adaptive methods, universality and acceleration. In NeurIPS: In advances in neural information processing systems.
Li, B., Wang, L., & Giannakis, G. B. (2020). Almost tune-free variance reduction. In International conference on machine learning (pp. 5969–5978). PMLR.
Liang, T., Rakhlin, A., et al. (2020). Just interpolate: Kernel “ridgeless’’ regression can generalize. Annals of Statistics, 48(3), 1329–1347.
Article MathSciNet MATH Google Scholar
Liu, M., Zhang, W., Orabona, F., & Yang, T. (2020). Adam+: A stochastic method with adaptive variance reduction. arXiv preprint arXiv:2011.11985.
Loizou, N., Vaswani, S., Laradji, I., & Lacoste-Julien, S. (2020). Stochastic Polyak step-size for SGD: An adaptive learning rate for fast convergence. arXiv preprint: arXiv:2002.10542.
Ma, S., Bassily, R., & Belkin, M. (2018). The power of interpolation: Understanding the effectiveness of SGD in modern over-parametrized learning. In Proceedings of the 35th international conference on machine learning, ICML.
Mahdavi, M., & Jin, R. (2013). MixedGrad: An O(1/T) convergence rate algorithm for stochastic smooth optimization. arXiv preprint arXiv:1307.7192.
Mairal, J. (2013). Optimization with first-order surrogate functions. In International conference on machine learning (pp. 783–791).
Meng, S. Y., Vaswani, S., Laradji, I., Schmidt, M., & Lacoste-Julien, S. (2020). Fast and furious convergence: Stochastic second order methods under interpolation. In The 23nd international conference on artificial intelligence and statistics, AISTATS.
Moulines, E., & Bach, F. R. (2011). Non-asymptotic analysis of stochastic approximation algorithms for machine learning. In NeurIPS: In advances in neural information processing systems.
Nesterov, Y. (2004). Introductory lectures on convex optimization: A basic course. Berlin: Springer.
Book MATH Google Scholar
Nguyen, L. M., Liu, J., Scheinberg, K., & Takáč, M. (2017). SARAH: a novel method for machine learning problems using stochastic recursive gradient. In Proceedings of the 34th international conference on machine learning (Vol. 70, pp. 2613–2621).
Pesme, S., Dieuleveut, A., & Flammarion, N. (2020). On convergence-diagnostic based step sizes for stochastic gradient descent. arXiv preprint arXiv:2007.00534.
Pflug, G. C. (1983). On the determination of the step size in stochastic quasigradient methods. collaborative paper cp-83-025. International Institute for Applied Systems Analysis (IIASA), Laxenburg, Austria.
Qian, Q., & Qian, X. (2019). The implicit bias of adagrad on separable data. arXiv preprint arXiv:1906.03559.
Reddi, S. J., Hefny, A., Sra, S., Poczos, B., & Smola, A. (2016). Stochastic variance reduction for nonconvex optimization. In International conference on machine learning (pp. 314–323).
Reddi, S. J., Kale, S., & Kumar, S. (2018). On the convergence of Adam and Beyond. In International conference on learning representations.
Schmidt, M., & Le Roux, N. (2013). Fast convergence of stochastic gradient descent under a strong growth condition. arXiv preprint: arXiv:1308.6370.
Schmidt, M., Babanezhad, R., Ahmed, M., Defazio, A., Clifton, A., & Sarkar, A. (2015). Non-uniform stochastic average gradient method for training conditional random fields. In Proceedings of the eighteenth international conference on artificial intelligence and statistics, AISTATS.
Schmidt, M., Le Roux, N., & Bach, F. (2017). Minimizing finite sums with the stochastic average gradient. Mathematical Programming, 162(1–2), 83–112.
Article MathSciNet MATH Google Scholar
Sebbouh, O., Gazagnadou, N., Jelassi, S., Bach, F., & Gower, R. (2019). Towards closing the gap between the theory and practice of SVRG. In Advances in neural information processing systems (pp. 648–658).
Shalev-Shwartz, S., & Zhang, T. (2013). Stochastic dual coordinate ascent methods for regularized loss minimization. Journal of Machine Learning Research, 14(Feb), 567–599.
MathSciNet MATH Google Scholar
Song, C., Jiang, Y., & Ma, Y. (2020). Variance reduction via accelerated dual averaging for finite-sum optimization. Advances in Neural Information Processing Systems, 33.
Sridharan, K., Shalev-Shwartz, S., & Srebro, N. (2008). Fast rates for regularized objectives. Advances in Neural Information Processing Systems, 21, 1545–1552.
Google Scholar
Tan, C., Ma, S., Dai, Y.-H., & Qian, Y. (2016). Barzilai-Borwein step size for stochastic gradient descent. arXiv preprint arXiv:1605.04131.
Vaswani, S., Bach, F., & Schmidt, M. (2019a). Fast and faster convergence of SGD for over-parameterized models and an accelerated perceptron. In The 22nd International Conference on Artificial Intelligence and Statistics, AISTATS.
Vaswani, S., Kunstner, F., Laradji, I., Meng, S. Y., Schmidt, M., & Lacoste-Julien, S. (2020). Adaptive gradient methods converge faster with over-parameterization (and you can do a line-search). arXiv preprint arXiv:2006.06835.
Vaswani, S., Mishkin, A., Laradji, I., Schmidt, M., Gidel, G., & Lacoste-Julien, S. (2019). Painless stochastic gradient: Interpolation, line-search, and convergence rates. In NeurIPS: In advances in neural information processing systems.
Ward, R., Wu, X., & Bottou, L. (2019). AdaGrad stepsizes: Sharp convergence over nonconvex landscapes, from any initialization. In Proceedings of the 36th international conference on machine learning, ICML.
Yaida, S. (2018). Fluctuation-dissipation relations for stochastic gradient descent. arXiv preprint arXiv:1810.00004.
Zhang, C., Bengio, S., Hardt, M., Recht, B., & Vinyals, O. (2017). Understanding deep learning requires rethinking generalization. In 5th international conference on learning representations, ICLR.
Zhou, K., So, A. M.-C., & Cheng, J. (2021). Accelerating perturbed stochastic iterates in asynchronous lock-free optimization. arXiv preprint arXiv:2109.15292.

Download references

Funding

Benjamin Dubois-Taine would like to acknowledge funding by the French government under management of Agence Nationale de la Recherche as part of the “Investissements d’avenir” program, reference ANR-19-P3IA-0001 (PRAIRIE 3IA Institute). Mark Schmidt and Simon Lacoste-Julien would like to acknowledge support from the Canada CIFAR AI Chair Program.

Author information

Benjamin Dubois-Taine and Sharan Vaswani have contributed equally to this work.

Authors and Affiliations

CNRS, ENS Paris & Inria, Paris, France
Benjamin Dubois-Taine
Simon Fraser University, Burnaby, Canada
Sharan Vaswani
SAIT AI Lab, Montreal, Canada
Reza Babanezhad
University of British Columbia, Vancouver, Canada
Mark Schmidt
Mila, Université de Montréal, Montreal, Canada
Simon Lacoste-Julien
Canada CIFAR AI Chair, Montreal, Canada
Mark Schmidt & Simon Lacoste-Julien

Authors

Benjamin Dubois-Taine
View author publications
You can also search for this author in PubMed Google Scholar
Sharan Vaswani
View author publications
You can also search for this author in PubMed Google Scholar
Reza Babanezhad
View author publications
You can also search for this author in PubMed Google Scholar
Mark Schmidt
View author publications
You can also search for this author in PubMed Google Scholar
Simon Lacoste-Julien
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

SV and RB conceived of the idea of combining SVRG and AdaGrad. SV, BD-T and RB proved the theoretical results. BD-T performed the experiments with support from SV. SV and BD-T wrote down the manuscript. SL-J and MS supervised the project.

Corresponding authors

Correspondence to Benjamin Dubois-Taine or Sharan Vaswani.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Ethics approval

Not applicable.

Consent to participate

Not applicable.

Consent for publication

Not applicable.

Additional information

Editor: Krzysztof Dembczynski and Emilie Devijver.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Organization of the Appendix

A
Definitions
B
Algorithm in general case
C
Proof of Lemma 1
D
Main Proposition
E
Proof of Theorem 1
F
Proof of Theorem 2
G
Proof of Theorem 3
H
Proof of Theorem 4
I
Helper lemmas
J
Counter-example for line-search for SVRG
K
Additional experiments

Definitions

Our main assumptions are that each individual function $f_i$ is differentiable and $L_i$-smooth, meaning that for all v and w,

$$\begin{aligned} f_i(v)&\le f_i(w) + \langle \nabla f_i(w),\, v - w \rangle + \frac{L_i}{2} \left\| v - w\right\| ^{2}, \end{aligned}$$

(Individual smoothness)

which also implies that f is $L_{\max }$-smooth, where $L_{\max }$ is the maximum smoothness constant of the individual functions. We also assume that f is convex, meaning that for all v and w,

$$\begin{aligned} f(v)&\ge f(w) - \langle \nabla f(w), w-v\rangle . \end{aligned}$$

(Convexity)

Algorithm in general case

We restate Algorithm 1 to handle the full matrix and diagonal variants. The first difference is in the initialization and update of $G_t$. Moreover, we use a more general notion of projection, i.e. $\Pi _{X, A_t}\left( \cdot \right)$ is the projection onto set X with respect to the norm induced by a symmetric positive definite matrix $A_t$ (such projections are common to adaptive gradient methods Duchi et al., 2011; Levy et al., 2018; Reddi et al., 2018). Note that in the scalar variant, $A_t$ is a scalar and thus $\Pi _{X, A_t}\left( \cdot \right) = \Pi _{X}\left( \cdot \right)$ and we recover Algorithm 1.

Proof of Lemma 1

We restate Lemma 1 to handle the three variants of AdaSVRG.

Lemma 2

(AdaSVRG with single outer-loop) Assuming (i) convexity of f, (ii) $L_{\max }$-smoothness of $f_i$ and (iii) bounded feasible set with diameter D. For the scalar variant, defining $\rho := \big ( \frac{D^2}{\eta _{k}} + 2\eta _{k}\big ) \sqrt{L_{{\max}}}$, for any outer loop k of AdaSVRG, with (a) inner-loop length $m_k$ and (b) step-size $\eta _{k}$,

$$\begin{aligned} \mathbb {E}[f(w_{k+1}) - f^*] \le \frac{\rho ^2}{m_k} + \frac{\rho \sqrt{\mathbb {E}[f(w_k) - f^*]}}{\sqrt{m_k}} \end{aligned}$$

For the full matrix and diagonal variants, setting $\rho ':= \left( \frac{D^2}{\eta _k} + 2\eta _k \right) \sqrt{dL_{{\max}}}$,

$$\begin{aligned} \mathbb {E}[ f(w_{k+1}) - f^*] \le \frac{(\rho ')^2 + \sqrt{ {d\delta }/{4L_{{\max}}}}}{m_k} + \frac{\rho ' \sqrt{\mathbb {E}[f(w_k) - f^*]}}{\sqrt{m_k}} \end{aligned}$$

Proof

For any of the three variants, we have, for any outer loop iteration k and any inner loop iteration t,

$$\begin{aligned} \left\| x_{t+1} - w^*\right\| ^2_{A_t}&= \left\| P_{X, A_t}(x_t - \eta _k A_t^{-1} g_k) - P_{X, A_t}(w^{*})\right\| _{A_t}^2 \end{aligned}$$

(3)

$$\begin{aligned}&\le \left\| x_t - \eta _k A^{-1}_t g_k - w^{*}\right\| ^2_{A_t} \end{aligned}$$

(4)

$$\begin{aligned}&= \left\| x_t - w^*\right\| ^2_{A_t} - 2\eta _k \langle g_t,\, x_t - w^* \rangle + \eta _k^2 \left\| g_t\right\| ^2_{A_t^{-1}} \end{aligned}$$

(5)

where the inequality follows from Reddi et al. (2018, Lemma 4). Dividing by $\eta _k$, rearranging and summing over all inner loop iterations at stage k gives

$$\begin{aligned} 2 \sum _{t=1}^{m_k} \langle g_t,\, x_t - w^* \rangle \le \sum _{t=1}^{m_k} \left\| x_t - w^*\right\| ^2_{\frac{A_t}{\eta _k} - \frac{A_{t-1}}{\eta _k}} + \sum _{t=1}^{m_k} \eta _k \left\| g_t\right\| ^2_{A_t^{-1}} \end{aligned}$$

(6)

$$\begin{aligned} \le \frac{D^2}{\eta _k}\mathrm {Tr}(A_{m_k}) + 2\eta _k \mathrm {Tr}(A_{m_k}) \end{aligned}$$

(Lemma 3 and Lemma 4)

$$\begin{aligned} = \left( \frac{D^2}{\eta _k} + 2\eta _k \right) \mathrm {Tr}(A_{m_k}) \end{aligned}$$

(7)

By Lemma 4, we have that $\textsf {Tr}\mathclose {\left(A_{m_k}\right)} \le \sqrt{\sum _{t=1}^{m_k} \left\| g_t\right\| ^2}$ in the scalar case, and $\textsf {Tr}\mathclose {\left(A_{m_k}\right)} \le \sqrt{d}\sqrt{\sum _{t=1}^{m_k} \left\| g_t\right\| ^2 + d\delta }$ in the full matrix and diagonal variants. Therefore we set

$$\begin{aligned} a' = {\left\{ \begin{array}{ll} \frac{1}{2}\big ( \frac{D^2}{\eta _k} + 2\eta _k\big ) &{} \text {(scalar variant)}\\ \frac{1}{2}\big ( \frac{D^2}{\eta _k} + 2\eta _k\big )\sqrt{d} &{}\text {(full matrix and diagonal variants)} \end{array}\right. } \end{aligned}$$

and

$$\begin{aligned} b = {\left\{ \begin{array}{ll} 0 &{} \text {(scalar variant)}\\ d \delta &{}\text {(full matrix and diagonal variants)} \end{array}\right. } \end{aligned}$$

Going back to the above inequality and taking expectation we get

$$\begin{aligned} \sum _{t=1}^{m_k} \langle \nabla f(x_t),\, x_t - w^* \rangle&\le a' \, \mathbb {E}\left[ \sqrt{\sum _{t=1}^{m_k} \left\| g_t\right\| ^2 + b} \right] \end{aligned}$$

(8)

Using convexity of f yields

$$\begin{aligned} \sum _{t=1}^{m_k} \mathbb {E}[f(x_t) - f^*]&\le a' \mathbb {E}\bigg [ \sqrt{\sum _{t=1}^{m_k} \left\| g_t\right\| ^2+ b}\bigg ] \end{aligned}$$

(9)

$$\begin{aligned}&\le a' \sqrt{\mathbb {E}\big [ \sum _{t=1}^{m_k} \left\| g_t\right\| ^2+ b\big ]} \end{aligned}$$

(10)

$$\begin{aligned}&= a' \sqrt{\sum _{t=1}^{m_k} \mathbb {E}[\left\| g_t\right\| ^2] + b} \end{aligned}$$

(11)

where the second inequality comes from Jensen’s inequality applied to the (concave) square root function. Now, from Johnson and Zhang (2013, Proof of Theorem 1),

$$\begin{aligned} \mathbb {E}[\left\| g_t\right\| ^2] \le 4L_{{\max}} \mathbb {E}[f(w_k) - f^*] + 4L_{{\max}}\mathbb {E}[f(x_t) - f^*] \end{aligned}$$

(12)

Going back to the previous equation, squaring and setting $\tau = a' \sqrt{4 L_{{\max}}}$ we get

$$\begin{aligned} \left( \sum _{t=1}^{m_k} \mathbb {E}[f(x_t) - f^*]\right) ^2 \le \tau ^2 \left( \sum _{t=1}^{m_k} \mathbb {E}[f(x_t) - f^*] + m_k\mathbb {E}[f(w_k) - f^*] + \frac{b}{4 L_{{\max}}} \right) \end{aligned}$$

(13)

Using Lemma 5,

$$\begin{aligned} \sum _{t=1}^{m_k} \mathbb {E}[f(x_t) - f^*]&\le \tau ^2 + \tau \sqrt{m_k\mathbb {E}[f(w_k) - f^*] + \frac{b}{4L_{{\max}}}} \end{aligned}$$

(14)

$$\begin{aligned}&\le \tau ^2 + \tau \sqrt{\frac{b}{4L_{{\max}}} } + \tau \sqrt{m_k \mathbb {E}[f(w_k) - f^*]} \end{aligned}$$

(15)

Finally, using Jensen’s inequality we get

$$\begin{aligned} \mathbb {E}\left[ f(w_{k+1}) - f^*\right]&= \mathbb {E}\left[ f\left( \frac{1}{m_k}\sum _{t=1}^{m_k} x_t\right) - f^*\right] \end{aligned}$$

(16)

$$\begin{aligned}&\le \frac{\tau ^2 + \tau \sqrt{\frac{b}{4L_{{\max}}} }}{m_k} + \frac{\tau \sqrt{\mathbb {E}[f(w_k) - f^*]}}{\sqrt{m_k}} \end{aligned}$$

(17)

which concludes the proof by noticing that by definition $\tau = \rho$ in the scalar case and $\tau = \rho '$ in the full matrix and diagonal cases. $\square$

Main proposition

We first state the main proposition for the three variants of AdaSVRG, which we later use for proving theorems.

Proposition 2

Assuming (i) convexity of f (ii) $L_{{\max}}$-smoothness of f (iii) bounded feasible set (iv) $\eta _{k}\in [\eta _{{\min}}, \eta _{{\max}}]$ (v) $m_k = m$ for all k, then for the scalar variant,

$$\begin{aligned} \mathbb {E}[ f(\bar{w}_K) - f^*] \le \frac{\rho ^2 ( 1 + \sqrt{5})}{m} + \frac{\rho \sqrt{2\left( f(w_0) - f^*\right) }}{\sqrt{mK}} \end{aligned}$$

and for the full matrix and diagonal variants,

$$\begin{aligned} \mathbb {E}[ f(\bar{w}_K) - f^*] \le \frac{(\rho ')^2 ( 1 + \sqrt{5}) + 2\rho '\sqrt{ {d\delta }/{L_{{\max}}}}}{m} + \frac{\rho '\sqrt{2\left( f(w_0) - f^*\right) }}{\sqrt{mK}} \end{aligned}$$

where

$$\begin{aligned} \bar{w}_K&= \frac{1}{K}\sum _{k=1}^{K}w_k\\ \rho&= \left( \frac{D^2}{\eta _{{\min}}} + 2\eta _{{\max}}\right) \sqrt{L_{{\max}}}\\ \rho '&= \left( \frac{D^2}{\eta _{{\min}}} + 2\eta _{{\max}}\right) \sqrt{dL_{{\max}}}. \end{aligned}$$

Proof

As in the previous proof we define

$$\begin{aligned} b = {\left\{ \begin{array}{ll} 0 &{} \text {(scalar variant)}\\ d \delta &{}\text {(full matrix and diagonal variants)} \end{array}\right. } \end{aligned}$$

and

$$\begin{aligned} \tau = {\left\{ \begin{array}{ll} \rho &{} \text {(scalar variant)}\\ \rho ' &{}\text {(full matrix and diagonal variants)} \end{array}\right. } \end{aligned}$$

Using the result from Lemma 1 and letting $\Delta _k :=\mathbb {E}[f(w_k) - f^*]$, we have,

$$\begin{aligned} \Delta _{k+1} \le \frac{\tau ^2 + \tau \sqrt{\frac{b}{4L_{{\max}}}}}{m} + \tau \frac{\sqrt{\Delta _k}}{\sqrt{m}} \end{aligned}$$

(18)

Squaring gives

$$\begin{aligned} \Delta _{k+1}^2&\le \left( \frac{\tau ^2 + \tau \sqrt{\frac{b}{4L_{{\max}}}}}{m} + \frac{\tau \sqrt{\Delta _k}}{\sqrt{m}}\right) ^2 \end{aligned}$$

(19)

$$\begin{aligned}&\le 2 \frac{\left( \tau ^2 + \tau \sqrt{\frac{b}{4L_{{\max}}}}\right) ^2}{m^2} + 2\frac{\tau ^2}{m} \Delta _k \le 4 \frac{\tau ^4}{m^2} + 4 \frac{\tau ^2 \frac{b}{4 L_{{\max}}}}{m^2} + 2 \frac{\tau ^2 }{m} \Delta _k \end{aligned}$$

(20)

which we can rewrite as

$$\begin{aligned} \Delta _{k+1}^2 - 2\frac{\tau ^2}{m} \Delta _{k+1} \le 4 \frac{\tau ^4}{m^2} + 2\frac{\tau ^2}{m} (\Delta _k -\Delta _{k+1}) + 4 \frac{\tau ^2 b}{L_{{\max}} m^2} \end{aligned}$$

(21)

Since $\Delta _{k+1}^2 - 2\frac{\tau ^2}{m} \Delta _{k+1} = (\Delta _{k+1} - \frac{\tau ^2}{m})^2 - \frac{\tau ^4}{m^2}$, we get

$$\begin{aligned} (\Delta _{k+1} - \frac{\tau ^2}{m})^2 \le 5 \frac{\tau ^4}{m^2} + 2\frac{\tau ^2}{m} (\Delta _k -\Delta _{k+1}) + 4 \frac{\tau ^2 b}{L_{{\max}}m^2} \end{aligned}$$

(22)

Summing this gives

$$\begin{aligned} \sum _{k=0}^{K-1} \left( \Delta _{k+1} - \frac{\tau ^2}{m}\right) ^2&\le \left( 5\tau ^4 + 4\frac{\tau ^2 b}{L_{{\max}}}\right) \frac{K}{m^2} + 2\frac{\tau ^2}{m} \sum _{k=0}^{K-1}(\Delta _k - \Delta _{k+1}) \end{aligned}$$

(23)

$$\begin{aligned}&\le \left( 5\tau ^4 + 4\frac{\tau ^2 b}{L_{{\max}}}\right) \frac{K}{m^2} + 2\frac{\tau ^2}{m} \Delta _0 \end{aligned}$$

(24)

Using Jensen’s inequality on the (concave) square root function gives

$$\begin{aligned} \frac{1}{K}\sum _{k=0}^{K-1} (\Delta _{k+1} - \frac{\tau ^2}{m}) \le \frac{1}{K}\sum _{k=0}^{K-1} \sqrt{\left( \Delta _{k+1} - \frac{\tau ^2}{m}\right) ^2} \le \sqrt{\frac{1}{K}\sum _{k=0}^{K-1} \left( \Delta _{k+1} - \frac{\tau ^2}{m}\right) ^2} \end{aligned}$$

(25)

going back to the previous inequality this gives

$$\begin{aligned} \frac{1}{K}\sum _{k=0}^{K-1} (\Delta _{k+1} - \frac{\tau ^2}{m})&\le \sqrt{\left( 5\tau ^4 + 4\frac{\tau ^2 b}{L_{{\max}}}\right) \frac{1}{m^2} + 2\tau ^2\frac{\Delta _0}{mK} } \end{aligned}$$

(26)

$$\begin{aligned}&\le \frac{\tau ^2\sqrt{5} + 2\tau \sqrt{ {b}/{L_{{\max}}}}}{m} + \frac{\tau \sqrt{2\Delta _0}}{\sqrt{mK}} \end{aligned}$$

(27)

which we can rewrite

$$\begin{aligned} \frac{1}{K}\sum _{k=0}^{K-1} \Delta _{k+1} \le \frac{\tau ^2( 1 + \sqrt{5}) + 2\tau \sqrt{ {b}/{L_{{\max}}}}}{m} + \frac{\tau \sqrt{2\Delta _0}}{\sqrt{mK}} \end{aligned}$$

(28)

Setting $\bar{w}_K = \frac{1}{K} \sum _{k=0}^{K-1} w_k$ and using Jensen’s inequality on the convex function f, we get

$$\begin{aligned} \mathbb {E}[f(\bar{w}_K) - f^*] \le \frac{\tau ^2(1 + \sqrt{5}) + 2\tau \sqrt{ {b}/{L_{{\max}}}}}{m} + \frac{\tau \sqrt{2\Delta _0}}{\sqrt{mK}} \end{aligned}$$

(29)

which concludes the proof. $\square$

Proof of Theorem 1

For the remainder of the appendix we define $\rho ' :=\big ( \frac{D^2}{\eta _{{\min}}} + 2\eta _{{\max}}\big ) \sqrt{dL_{{\max}}}$. We restate and prove Theorem 1 for all three variants of AdaSVRG.

Theorem 5

(AdaSVRG with fixed-size inner-loop) Under the same assumptions as Lemma 1, AdaSVRG with (a) step-sizes $\eta _{k}\in [\eta _{{\min}}, \eta _{{\max}}]$, (b) inner-loop size $m_k = n$ for all k, results in the following convergence rate after $K \le n$ iterations. For the scalar variant,

$$\begin{aligned} \mathbb {E}[f(\bar{w}_K) - f^*] \le \frac{\rho ^2 ( 1+ \sqrt{5}) + \rho \sqrt{2\left( f(w_0) - f^*\right) }}{K} \end{aligned}$$

and for the full matrix and diagonal variants,

$$\begin{aligned} \mathbb {E}[f(\bar{w}_K) - f^*] \le \frac{(\rho ')^2 (1+ \sqrt{5}) + 2\rho '\sqrt{ {d\delta }/{L_{{\max}}}}+ \rho '\sqrt{2\left( f(w_0) - f^*\right) }}{K}, \end{aligned}$$

where $\bar{w}_K = \frac{1}{K} \sum _{k = 1}^{K} w_{k}$.

Proof

We have $\frac{1}{m} = \frac{1}{n} \le \frac{1}{K}$ by the assumption. Using the result of Proposition 2 for the scalar variant we have

$$\begin{aligned} \mathbb {E}[ f(\bar{w}_K) - f^*]&\le \frac{\rho ^2(1 + \sqrt{5})}{m} + \frac{\rho \sqrt{2\Delta _0}}{\sqrt{mK}} \end{aligned}$$

(30)

$$\begin{aligned}&\le \frac{\rho ^2(1 + \sqrt{5}) }{K} + \frac{\rho \sqrt{2\Delta _0}}{K} \end{aligned}$$

(31)

Using the result of Proposition 2 for the full matrix and diagonal variants we have

$$\begin{aligned} \mathbb {E}[ f(\bar{w}_K) - f^*]&\le \frac{(\rho ')^2(1 + \sqrt{5}) + 2\rho '\sqrt{ {d\delta }/{L_{{\max}}}}}{m} + \frac{\rho '\sqrt{2\Delta _0}}{\sqrt{mK}} \end{aligned}$$

(32)

$$\begin{aligned}&\le \frac{(\rho ')^2(1 + \sqrt{5}) + 2\rho '\sqrt{ {d\delta }/{L_{{\max}}}}}{K} + \frac{\rho '\sqrt{2\Delta _0}}{K} \end{aligned}$$

(33)

$\square$

Corollary 1

Under the assumptions of Theorem 1, the computational complexity of AdaSVRG to reach $\epsilon$-accuracy is $O\left( \frac{n}{\epsilon }\right)$ when

$$\begin{aligned} \epsilon \ge \frac{\rho ^2( 1 +\sqrt{5}) + \rho \sqrt{2\left( f(w_0) - f^*\right) }}{n} \end{aligned}$$

for the scalar variant and when

$$\begin{aligned} \epsilon \ge \frac{(\rho ')^2( 1 +\sqrt{5}) + 2\rho ' \sqrt{ {d\delta }/{L_{{\max}}}} + \rho '\sqrt{2\left( f(w_0) - f^*\right) }}{n} \end{aligned}$$

for the full matrix and diagonal variants.

Proof

We deal with the scalar variant first. Let $c = \rho ^2( 1 +\sqrt{5}) + \rho \sqrt{2\left( f(w_0) - f^*\right) }$. By the previous theorem, to reach $\epsilon$-accuracy we require

$$\begin{aligned} \frac{c}{K} \le \epsilon \end{aligned}$$

(34)

We thus require $O\left( \frac{1}{\epsilon }\right)$ outer loops to reach $\epsilon$-accuracy. For $m_k = n$, 3n gradients are computed in each outer loop, thus the computational complexity is indeed $O\left( \frac{n}{\epsilon }\right)$.

The condition $\epsilon \ge \frac{c}{n}$ follows from the assumption that $K \le n$.

The proof for the full matrix and diagonal variants is similar by taking $c = (\rho ')^2( 1 +\sqrt{5}) + 2\rho ' \sqrt{ {d\delta }/{L_{{\max}}}} + \rho '\sqrt{2\left( f(w_0) - f^*\right) }$. $\square$

Proof of Theorem 2

Theorem 2(Multi-stage AdaSVRG) Under the same assumptions as Theorem 1, multi-stage AdaSVRG with $I = \log (1/\epsilon )$ stages, $K \ge 3$ outer-loops and $m^i = 2^{i+1}$ inner-loops at stage i, requires $O(n\log {1}/{\epsilon } + {1}/{\epsilon })$ gradient evaluations to reach a $\left( \rho ^2 (1+\sqrt{5}) \right) \epsilon$-sub-optimality.

Proof

We deal with the scalar variant first. Let $\Delta _i:= \mathbb {E}[f(\bar{w}^i) - f^*]$ and $c:= \rho ^2(1 + \sqrt{5})$. Suppose that $K \ge 3$ and $m_i = 2^{i+1}$, as in the theorem statement. We claim that for all i, we have $\Delta _i \le \frac{c}{2^i}$. We prove this by induction. For $i=0$, we have

$$\begin{aligned} \frac{c}{2^0}&= c = \rho ^2(1 + \sqrt{5}) \end{aligned}$$

(35)

$$\begin{aligned}&= L_{{\max}}(1+\sqrt{5})\left( \frac{D^2}{\eta } + 2\eta \right) ^2 \end{aligned}$$

(36)

The quantity $\frac{D^2}{\eta } + 2\eta$ reaches a minimum for $\eta ^* = \frac{D}{\sqrt{2}}$. Therefore we can write

$$\begin{aligned} \frac{c}{2^0}&\ge L_{{\max}}(1+\sqrt{5})\left( \sqrt{2}D+ 2\frac{D}{\sqrt{2}}\right) ^2 \end{aligned}$$

(37)

$$\begin{aligned}&= L_{{\max}}(1+\sqrt{5}) 8D^2 \end{aligned}$$

(38)

$$\begin{aligned}&\ge (1+\sqrt{5})8L_{{\max}} \left\| \bar{w}_0 - w^*\right\| ^2 \end{aligned}$$

(39)

$$\begin{aligned}&\ge (1+\sqrt{5})4 \left( f(\bar{w}_0) - f^*\right) \end{aligned}$$

(smoothness)

$$\begin{aligned}&= (1+\sqrt{5})4\Delta _0 \end{aligned}$$

(40)

$$\begin{aligned}&\ge \Delta _0 \end{aligned}$$

(41)

Now suppose that $\Delta _{i-1} \le \frac{c}{2^{i-1}}$ for some $i\ge 1$. Using the upper-bound analysis of AdaSVRG in Proposition 2 we get

$$\begin{aligned} \Delta _i&\le \frac{c}{m^i} + \frac{\rho \sqrt{2\Delta _{i-1}}}{\sqrt{m^i K}} \end{aligned}$$

(42)

$$\begin{aligned}&=\frac{c}{2^{i+1}} + \frac{\rho \sqrt{2\Delta _{i-1}}}{\sqrt{2^{i+1} K}} \end{aligned}$$

(43)

$$\begin{aligned}&\le \frac{c}{2^{i+1}} + \frac{\rho \sqrt{2\frac{c}{2^{i-1}}}}{\sqrt{2^{i+1} K}} \end{aligned}$$

(induction hypothesis)

$$\begin{aligned}&= \frac{c}{2^{i+1}} + \frac{c}{2^{i+1}}\left( \frac{\rho }{\sqrt{c}}\sqrt{\frac{2^{i+1}}{2^{i-2}K}} \right) \end{aligned}$$

(44)

$$\begin{aligned}&\le \frac{c}{2^{i+1}} + \frac{c}{2^{i+1}} \frac{\rho }{\sqrt{\rho ^2(1 + \sqrt{5})}} \sqrt{\frac{8}{ K}} \end{aligned}$$

(45)

$$\begin{aligned}&= \frac{c}{2^{i+1}} + \frac{c}{2^{i+1}} \sqrt{\frac{8}{(1 + \sqrt{5})K}} \end{aligned}$$

(46)

Since $K \ge 3$, one can check that $\frac{8}{(1+ \sqrt{5}) K} \le 1$ and thus

$$\begin{aligned} \Delta _i \le \frac{c}{2^{i+1}} + \frac{c}{2^{i+1}} = \frac{c}{2^i} \end{aligned}$$

(47)

which concludes the induction step.

At time step $I = \log \frac{1}{\epsilon }$, we thus have $\Delta _I \le \frac{c}{2^{I}} = c\epsilon$. All that is left is to compute the gradient complexity. If we assume that $K = \gamma$ for some constant $\gamma \ge 3$, the gradient complexity is given by

$$\begin{aligned} \sum _{i=1}^I K ( n+ m^i)&= \sum _{i=1}^I \gamma (n + 2^{i+1})\\&= \gamma n\log \left( \frac{1}{\epsilon }\right) + \gamma \sum _{i=1}^I 2^{i+1} \\&\le \gamma n\log \left( \frac{1}{\epsilon }\right) + 4\gamma \frac{1}{\epsilon }\\&= O\left( (n + \frac{1}{\epsilon }) \log \frac{1}{\epsilon }\right) \end{aligned}$$

which concludes the proof for the scalar variant.

Now we look at the full matrix and diagonal variants. Let’s take $c = (\rho ')^2(1 + \sqrt{5}) + 2\rho '\sqrt{ {d \delta }/{L_{{\max}}}}$. Suppose that $K \ge 3$ and $m_i = 2^{i+1}$, as in the theorem statement. Again, we claim that for all i, we have $\Delta _i \le \frac{c}{2^i}$. We prove this by induction. For $i=0$, we have

$$\begin{aligned} \frac{c}{2^0}&= c = (\rho ')^2(1 + \sqrt{5}) + 2\rho '\sqrt{ {d\delta }/{L_{{\max}}}} \end{aligned}$$

(48)

$$\begin{aligned}&\ge dL_{{\max}}(1+\sqrt{5})\left( \frac{D^2}{\eta } + 2\eta \right) ^2 \end{aligned}$$

(49)

The quantity $\frac{D^2}{\eta } + 2\eta$ reaches a minimum for $\eta ^* = \frac{D}{\sqrt{2}}$. Therefore we can write

$$\begin{aligned} \frac{c}{2^0}&\ge dL_{{\max}}(1+\sqrt{5})\left( \sqrt{2}D+ 2\frac{D}{\sqrt{2}}\right) ^2 \end{aligned}$$

(50)

$$\begin{aligned}&= dL_{{\max}}(1+\sqrt{5}) 8D^2 \end{aligned}$$

(51)

$$\begin{aligned}&\ge d(1+\sqrt{5})8L_{{\max}} \left\| \bar{w}_0 - w^*\right\| ^2 \end{aligned}$$

(52)

$$\begin{aligned}&\ge d(1+\sqrt{5})4 \left( f(\bar{w}_0) - f^*\right) \end{aligned}$$

(smoothness)

$$\begin{aligned}&= d(1+\sqrt{5})4\Delta _0 \end{aligned}$$

(53)

$$\begin{aligned}&\ge \Delta _0 \end{aligned}$$

(54)

The induction step is exactly the same as in the scalar case. $\square$

Proof of Theorem 3

We restate and prove Theorem 3 for the three variants of AdaSVRG.

Theorem 6

(AdaSVRG with adaptive-sized inner-loops) Under the same assumptions as Lemma 1, AdaSVRG with (a) step-sizes $\eta _{k}\in [\eta _{{\min}}, \eta _{{\max}}]$, (b1) inner-loop size $m_k = \frac{\nu }{\epsilon }$ for all k or (b2) inner-loop size $m_k = \frac{\nu }{f(w_k) - f^*}$ for outer-loop k, results in the following convergence rate,

$$\begin{aligned} \mathbb {E}[f(w_K) - f^*] \le (3/4)^K [f(w_0) - f^*]. \end{aligned}$$

where $\nu = 4\rho ^2$ for the scalar variant and $\nu = \left( \frac{2\rho ' + \sqrt{16 (\rho ')^2 + 12\rho ' \sqrt{ {d\delta }/{4L_{{\max}}}}}}{3}\right) ^2$ for the full matrix and diagonal variants.

Proof

Let us define

$$\begin{aligned} \alpha = {\left\{ \begin{array}{ll} \frac{1}{2}\big ( \frac{D^2}{\eta _{{\min}}} + 2\eta _{{\max}}\big ) &{}\text {(scalar variant)}\\ \frac{1}{2}\big ( \frac{D^2}{\eta _{{\min}}} + 2\eta _{{\max}}\big )\sqrt{d} &{}\text {(full matrix and diagonal variants)} \end{array}\right. } \end{aligned}$$

and

$$\begin{aligned} b = {\left\{ \begin{array}{ll} 0 &{}\text {(scalar variant)}\\ d\delta &{}\text {(full matrix and diagonal variants)} \end{array}\right. } \end{aligned}$$

Similar to the proof of Lemma 1, for a inner-loop k with $m_k$ iterations and $\alpha := \frac{1}{2}\big ( \frac{D^2}{\eta _{{\min}}} + 2\eta _{{\max}}\big )$ we can show

$$\begin{aligned} \mathbb E \left[ \sum _{t=1}^{m_k} f(x_{t})-f^* \right]&\le \alpha \sqrt{ \sum _{t=1}^{m_k}\mathbb {E}\left[ \left\| g_t\right\| ^2\right] + b} \end{aligned}$$

(55)

$$\begin{aligned}&\le \alpha \sqrt{4L_{{\max}} \sum _{t=1}^{m_k} \mathbb {E}\left[ f(x_{t})-f^*\right] + 4L_{{\max}} \sum _{t=1}^{m_k} \underbrace{\mathbb {E}\left[ f(w_k) - f^* \right] }_{\epsilon _k} +b} \end{aligned}$$

(56)

$$\begin{aligned}&\le \alpha \sqrt{4L_{{\max}} \mathbb {E}\left[ \sum _{t=1}^{m_k} f(x_{t})- f^* \right] + 4L_{{\max}} m_k \epsilon _k +b} \end{aligned}$$

(57)

Using Lemma 5 we get

$$\begin{aligned} \mathbb E \left[ \sum _{t=1}^{m_k} f(x_{t})-f^* \right]&\le 4L_{{\max}} \alpha ^2 + \sqrt{4L_{{\max}}\alpha ^2m_k \epsilon _k + \alpha ^2 b} \end{aligned}$$

(58)

If we set $m_k = \frac{C}{\epsilon _k}$,

$$\begin{aligned} \mathbb E \left[ \sum _{t=1}^{m_k} f_t(x_{t})-f^* \right]&\le 4L_{{\max}} \alpha ^2 + \sqrt{4L_{{\max}} \alpha ^2 C+\alpha ^2 b} \end{aligned}$$

(59)

$$\begin{aligned}&\le 4L_{{\max}} \alpha ^2 + \sqrt{4L_{{\max}} \alpha ^2 C}+\sqrt{ \alpha ^2 b}. \end{aligned}$$

(60)

Define $a :=\sqrt{4L_{{\max}}\alpha ^2}$ and $\gamma :=\sqrt{b/4L_{{\max}}}$, by dividing both sides of (59) by $m_k$ and using the definition of $w_{k+1}$,

$$\begin{aligned} \mathbb E \left[ f(w_{k+1})-f^* \right] \le \frac{a^2 + a(\sqrt{C}+\gamma )}{m_k} = \left( \frac{a^2 + a (\sqrt{C}+\gamma )}{C} \right) \mathbb E \left[ f(w_{k})-f^* \right] \end{aligned}$$

(61)

Setting $C=\left( \frac{2a+\sqrt{4a^2+12(a^2+a\gamma )}}{3}\right) ^2$, we get $\epsilon _{k+1} \le 3/4 \epsilon _k$. However, the above proof requires knowing $\epsilon _{k}$. Instead, let us assume a target error of $\epsilon$, implying that we want to have $\mathbb E [f(w_{K}) - f^* ] \le \epsilon$. Going back to (58) and setting $m_k = C/\epsilon$, we obtain,

$$\begin{aligned} \mathbb E \left[ f(w_{k+1})-f^* \right]&\le \frac{a^2 + a(\sqrt{C\epsilon _k/\epsilon }+\gamma )}{m_k} = \left( \frac{a^2 + a(\sqrt{C\epsilon _k/\epsilon }+\gamma )}{C} \right) \epsilon\\&\le \frac{a^2 \epsilon _k + a\sqrt{C\epsilon _k/\epsilon } \sqrt{\epsilon _k \epsilon }+a\gamma \epsilon _k}{C} \qquad {(\hbox {Assuming that }\epsilon \le \epsilon _{k}\hbox { for all }k \le K.)} \end{aligned}$$

(62)

$$\begin{aligned}&= \left( \frac{a^2 + a\sqrt{C}+a\gamma }{C} \right) \epsilon _k = \left( \frac{a^2 + a\sqrt{C}+a\gamma }{C} \right) \mathbb E \left[ f(w_{k})-f^* \right] \end{aligned}$$

(63)

With $C=\left( \frac{2a+\sqrt{4a^2+12(a^2+a\gamma )}}{3}\right) ^2$, we get linear convergence to $\epsilon$-suboptimality. Based on the above we require $K = \mathcal O(\log (1/\epsilon ))$ outer-loops. However in each outer-loop we need $\mathcal O(n+\frac{1}{\epsilon })$ gradient evaluations. All in all, our total computation complexity is of $\mathcal {O} ((n+\frac{1}{\epsilon }) \log (\frac{1}{\epsilon }))$.

The proof is done by noticing that in the scalar variant, $\gamma = 0$ and $a = \rho$ so that

$$\begin{aligned} C = \left( \frac{2\rho + \sqrt{4\rho ^2 + 12 \rho ^2}}{3} \right) ^2 = 4\rho ^2 \end{aligned}$$

and in the full matrix and diagonal variants, $a = \rho '$ and $\gamma = \sqrt{ {d\delta }/{4L_{{\max}}}}$ so that

$$\begin{aligned} C = \left( \frac{2\rho ' + \sqrt{16(\rho ')^2 + 12\rho ' \sqrt{ {d\delta }/{4L_{{\max}}}}}}{3} \right) ^2 \end{aligned}$$

$\square$

Proof of Theorem 4

We restate and prove Theorem 4 for the three variants of AdaGrad. To handle all three variants, we restate the theorem in terms of $\Vert G_t\Vert _* = \sqrt{\textsf {Tr}\mathclose {\left(G_t\right)}}$. Note that this does not change the claim for the scalar variant, as in that case $G_t = \textsf {Tr}\mathclose {\left(G_t\right)}$.

Theorem 7

(Phase Transition in AdaGrad Dynamics) Under the same assumptions as Lemma 1 and (iv) $\sigma ^2$-bounded stochastic gradient variance and defining $T_0 = \frac{\rho ^2 L_{{\max}}}{\sigma ^2}$, for constant step-size AdaGrad we have $\mathbb {E}\Vert G_t\Vert _{*} = O(1) \text { for } t \le T_0$, and $\mathbb {E}\Vert G_t\Vert _{*} = O (\sqrt{t-T_0}) \text { for } t \ge T_0$.

The same result holds for the full matrix and diagonal variants of constant step-size AdaGrad for $T_0 = \frac{\left( \rho ' \sqrt{L_{{\max}}} + \sqrt{2\rho ' \sqrt{d\delta L_{{\max}}} + d\delta }\right) ^2}{\sigma ^2}$

Proof

We start with the scalar variant. Consider the general AdaGrad update

$$\begin{aligned} x_{t+1}= \Pi _{X, A_t}\left( x_{t}- \eta A_{t}^{-1} \nabla f_{i_t}(x_{t})\right) \end{aligned}$$

(64)

The same we did in the proof of Theorem 1, we can bound suboptimality as

$$\begin{aligned}&\Vert x_{t+1}- w^{*}\Vert _{A_t}^2 \le \Vert x_{t}- w^{*}\Vert _{A_t}^2 - 2\eta \left\langle x_{t}-w^{*},\nabla f_{i_t}(x_{t}) \right\rangle _{} + \eta ^2 \Vert \nabla f_{i_t}(x_{t})\Vert _{A_t^{-1}}^2 \end{aligned}$$

(65)

By re-arranging, dividing by $\eta$ and summing for T iteration we have

$$\begin{aligned} \sum _{t=1}^T \left\langle x_{t}-w^{*},\nabla f_{i_t}(x_{t}) \right\rangle _{}&\le \frac{1}{2\eta }\sum _{t=1}^T \Vert x_{t}-w^{*}\Vert ^2_{A_t-A_{t-1}} + \frac{\eta }{2} \sum _{t=1}^T \Vert \nabla f_{i_t}(x_{t})\Vert _{A_t^{-1}}^2 \end{aligned}$$

(Lemma 1)

$$\begin{aligned}&\le \frac{1}{2}\left( \frac{D^2}{\eta } + 2\eta \right) \textsf {Tr}\mathclose {\left(A_T\right)} \end{aligned}$$

(66)

Define

$$\begin{aligned} \alpha = {\left\{ \begin{array}{ll} \frac{1}{2}\big ( \frac{D^2}{\eta } + 2\eta \big ) &{} \text {(scalar variant)}\\ \frac{1}{2}\big ( \frac{D^2}{\eta } + 2\eta \big )\sqrt{d} &{}\text {(full matrix and diagonal variants)} \end{array}\right. } \end{aligned}$$

and

$$\begin{aligned} b = {\left\{ \begin{array}{ll} 0 &{} \text {(scalar variant)}\\ d \delta &{}\text {(full matrix and diagonal variants)} \end{array}\right. } \end{aligned}$$

We then have $\textsf {Tr}\mathclose {\left(A_T\right)} \le \alpha \sqrt{\sum _{t=1}^T\Vert \nabla f_{i_t}(x_{t})\Vert ^2 + b }$ by Lemma 3 and Lemma 4. Going back to Eq. (66) and taking expectation and using the upper-bound we get

$$\begin{aligned} \mathbb {E}\left[ \sum _{t=1}^T \left\langle x_{t}-w^{*},\nabla f(x_{t}) \right\rangle _{}\right]&\le \alpha \mathbb {E}\left[ \sqrt{\sum _{t=1}^T \Vert \nabla f_{i_t}(x_{t})\Vert ^2 + b } \right] \end{aligned}$$

(67)

Using convexity of f and Jensen’s inequality on the (concave) square root function, we have

$$\begin{aligned} \mathbb E \left[ \sum _{t=1}^T f(x_{t})-f^* \right]&\le \alpha \sqrt{\sum _{t=1}^T \mathbb E\Vert \nabla f_{i_t}(x_{t})\Vert ^2 + b} \end{aligned}$$

(68)

Now,

$$\begin{aligned} \left\| \nabla f_{i_t}(x_t)\right\| ^{2}&= \Vert \nabla f_{i_t}(x_t) - \nabla f(x_t) \Vert ^2 + 2 \langle \nabla f_{i_t}(x_t) - \nabla f(x_t), \nabla f(x_t)\rangle + \Vert \nabla f(x_t) \Vert ^2 \end{aligned}$$

(69)

Taking expectations, and since $\mathbb {E}[\nabla f_{i_t}(x_t) - \nabla f(x_t)] = 0$, we have

$$\begin{aligned} \mathbb {E}\left\| \nabla f_{i_t}(x_t)\right\| ^{2} = \mathbb {E}\Vert \nabla f_{i_t}(x_t) - \nabla f(x_t)\Vert ^2 + \mathbb {E}\Vert \nabla f(x_t)\Vert ^2 \end{aligned}$$

(70)

Going back to the previous inequality we get

$$\begin{aligned} \mathbb E \left[ \sum _{t=1}^T f(x_{t})-f^* \right]&\le \alpha \sqrt{\sum _{t=1}^T \mathbb {E}\Vert \nabla f_{i_t}(x_t) - \nabla f(x_t)\Vert ^2 + \mathbb {E}\Vert \nabla f(x_t)\Vert ^2 + b} \end{aligned}$$

(71)

$$\begin{aligned}&\le \alpha \sqrt{\sum _{t=1}^T\left( \mathbb {E}\Vert \nabla f_{i_t}(x_t) - \nabla f(x_t)\Vert ^2 + 2L_{{\max}}\mathbb {E}\left( f(x_t) - f^*\right) \right) + b } \end{aligned}$$

(72)

where we used smoothness in the last inequality. Now, if $\sigma = 0$, namely $\nabla f(x_t) = \nabla f_{i_t}(x_t)$, we have

$$\begin{aligned} \left[ \sum _{t=1}^T f(x_{t})-f^* \right]&\le \alpha \sqrt{2 L_{{\max}}\sum _{t=1}^T (f(x_t) - f^*) + b} \end{aligned}$$

(73)

Using Lemma 5 we get

$$\begin{aligned} \sum _{t=1}^T (f(x_t) - f^*) \le \alpha ^2 2L_{{\max}} + \sqrt{\alpha ^2 b} \end{aligned}$$

(74)

so that

$$\begin{aligned} \sqrt{\sum _{t=1}^T \left( f(x_{t})-f^*\right) } \le \alpha \sqrt{2L_{{\max}}} + \sqrt{\alpha }b^{1/4} \end{aligned}$$

(75)

Now,

$$\begin{aligned} \Vert G_T\Vert _*&= \sqrt{\textsf {Tr}\mathclose {\left( \sum _{t=1}^T \nabla f(x_t)^T \nabla f(x_t) + \delta \right)}} = \sqrt{\sum _{t=1}^{T} \left\| \nabla f(x_t)\right\| ^2 + b} \end{aligned}$$

(76)

Thus we have

$$\begin{aligned} \Vert G_T\Vert _*&= \sqrt{\sum _{t=1}^T \Vert \nabla f(x_t) \Vert ^2 + b } \end{aligned}$$

(77)

$$\begin{aligned}&\le \sqrt{b } + \sqrt{2L_{{\max}} \sum _{t=1}^T \left( f(x_t) - f^*\right) } \le \sqrt{b} + \sqrt{2\alpha L_{{\max}}} b^{1/4} + 2\alpha L_{{\max}} \end{aligned}$$

(78)

where we used smoothness for the inequality. This shows that $\Vert G_T\Vert _*$ is a bounded series in the deterministic case. Now, if $\sigma \not = 0$, going back to Eq. (72) we have

$$\begin{aligned} \sum _{t=1}^T \mathbb {E}\left[ f(x_t) - f^* \right]&\le \alpha \sqrt{ T \sigma ^2 +2L_{{\max}}\sum _{t=1}^T\left( \mathbb {E}\left( f(x_t) - f^*\right) \right) + b} \end{aligned}$$

(79)

Using Lemma 5 we get

$$\begin{aligned} \sum _{t=1}^T \mathbb {E}\left[ f(x_t) - f^*\right]&\le 2\alpha ^2L_{{\max}} + \sqrt{T\alpha ^2 \sigma ^2 + \alpha ^2 b} \le 2\alpha ^2L_{{\max}} + \sqrt{T\alpha ^2 \sigma ^2} + \alpha \sqrt{ b} \end{aligned}$$

(80)

We then have

$$\begin{aligned} \mathbb {E}\Vert G_T\Vert _*^2&= \mathbb {E}\left[ \sum _{t=1}^T\Vert \nabla f_{i_t}(x_t) \Vert ^2 + b \right] \end{aligned}$$

(81)

$$\begin{aligned}&= \sum _{t=1}^T \mathbb {E}\Vert \nabla f_{i_t} (x_t) - \nabla f(x_t) \Vert ^2 + \sum _{t=1}^T \mathbb {E}\Vert \nabla f(x_t) \Vert ^2 + b \end{aligned}$$

(same as Eq. (70))

$$\begin{aligned}&\le T\sigma ^2 + 2L_{{\max}}\sum _{t=1}^T \mathbb {E}\left[ f(x_t) - f^*\right] + b \end{aligned}$$

(smoothness)

$$\begin{aligned}&\le T\sigma ^2 + 4\alpha ^2 L_{{\max}}^2 + 2L_{{\max}} \sqrt{T\alpha ^2 \sigma ^2} + 2 L_{{\max}}\alpha \sqrt{b} + b \end{aligned}$$

(by Eq. (80))

$$\begin{aligned}&\le (\sigma \sqrt{T} + 2\alpha L_{{\max}})^2 + 2L_{{\max}}\alpha \sqrt{b} + b \end{aligned}$$

(82)

from which we get

$$\begin{aligned} \mathbb {E}\Vert G_T\Vert _*&= \mathbb {E}\sqrt{\Vert G_T\Vert _*^2} \end{aligned}$$

(83)

$$\begin{aligned}&\le \sqrt{\mathbb {E}\Vert G_T\Vert _*^2} \end{aligned}$$

(Jensen's inequality)

$$\begin{aligned}&\le \sqrt{(\sigma \sqrt{T} + 2\alpha L_{{\max}})^2 + 2 L_{{\max}}\alpha \sqrt{b} + b} \end{aligned}$$

(by Eq. (82))

$$\begin{aligned}&\le \sigma \sqrt{T} + 2 \alpha L_{{\max}} + \sqrt{2L_{{\max}} \alpha \sqrt{b} + b} \end{aligned}$$

(84)

This implies that for $T \le \frac{\left( 2 \alpha L_{{\max}} + \sqrt{2L_{{\max}} \alpha \sqrt{b} + b}\right) ^2}{\sigma ^2}$, we have

$$\begin{aligned} \mathbb {E}\Vert G_T \Vert _* \le 4 \alpha L_{{\max}} + 2\sqrt{2L_{{\max}} \alpha \sqrt{b} + b} \end{aligned}$$

(85)

and for $T\ge \frac{\left( 2 \alpha L_{{\max}} + \sqrt{2L_{{\max}} \alpha \sqrt{b} + b}\right) ^2}{\sigma ^2}$,

$$\begin{aligned} \mathbb {E}\Vert G_T\Vert _* = O(\sqrt{T}) \end{aligned}$$

(86)

The proof is done by noticing that

$$\begin{aligned} \alpha = {\left\{ \begin{array}{ll} \frac{\rho }{\sqrt{L_{{\max}}}} &{}\text {(scalar case)}\\ \frac{\rho '}{\sqrt{L_{{\max}}}} &{}\text {(full matrix and diagonal cases)} \end{array}\right. } \end{aligned}$$

$\square$

Corollary 2

Under the same assumptions as Lemma 1, for outer-loop k of AdaSVRG with constant step-size $\eta _k$, there exists $T_0 = \frac{C}{f(w_k) - f^*}$ such that,

$$\mathbb {E}\Vert G_t\Vert _{*} = {\left\{ \begin{array}{ll} O(1), &{} \text {for } t \le T_0 \\ O (\sqrt{t-T_0}), &{} \text {for } t \ge T_0. \\ \end{array}\right. }$$

Proof

Using Theorem 7 with $\sigma ^2 = f(w_{k}) - f^*$ gives us the result. $\square$

Helper lemmas

We make use of the following helper lemmas from Vaswani et al. (2020), proved here for completeness.

Lemma 3

For any of the full matrix, diagonal and scalar versions, we have

$$\begin{aligned} \sum _{t=1}^m \left\| x_t - w^*\right\| ^2_{A_{t} - A_{t-1}} \le D^2 \mathrm {Tr}(A_m) \end{aligned}$$

Proof

For any of the three versions, we have by construction that $A_t$ is non-decreasing, i.e. $A_t - A_{t-1} \succeq 0$ (for the scalar version, we consider $A_t$ as a matrix of dimension 1 for simplicity). We can then use the bounded feasible set assumption to get

$$\begin{aligned} \sum _{t=1}^m \left\| x_{t}- w^{*}\right\| ^2_{{A_t} -A_{t-1}}&\le \sum _{t=1}^m \lambda _{\max }({{A_t} -A_{t-1}})\left\| x_{t}- w^{*}\right\| ^2 \\&\le D^2\!\sum _{t=1}^T\lambda _{\max }({{A_t} -A_{t-1}}). \end{aligned}$$

We then upper-bound $\lambda _{\max }$ by the trace and use the linearity of the trace to telescope the sum,

$$\begin{aligned} &\le D^2 \sum _{t=1}^m\textsf {Tr}\mathclose {\left({A_t} -A_{t-1}\right)} = D^2 \sum _{t=1}^m\textsf {Tr}\mathclose {\left(A_t\right)} - \textsf {Tr}\mathclose {\left(A_{t-1}\right)}, \\&= D^2 ({\textsf {Tr}\mathclose {\left(A_m\right)} - \textsf {Tr}\mathclose {\left(A_0\right)}}) \le D^2\textsf {Tr}\mathclose {\left(A_m\right)} \end{aligned}$$

$\square$

Lemma 4

For any of the full matrix, diagonal and scalar versions, we have

$$\begin{aligned} \sum _{t=1}^m \left\| g_t\right\| ^2_{A_t^{-1}} \le 2\mathrm {Tr}(A_m) \end{aligned}$$

Moreover, for the scalar version we have

$$\begin{aligned} \textsf {Tr}\mathclose {\left(A_m\right)} \le \sqrt{ \sum _{t=1}^m \left\| g_t\right\| ^2} \end{aligned}$$

and for the full matrix and diagonal version we have

$$\begin{aligned} \textsf {Tr}\mathclose {\left(A_m\right)} \le \sqrt{d \sum _{t=1}^m \left\| g_t\right\| ^2 + d^2\delta } \end{aligned}$$

Proof

We prove this by induction. Start with $m=1$.

For the full matrix version, $A_1=(\delta I+g_1g_1^\top )^{ {1}/{2}}$ and we have

$$\begin{aligned} \left\| g_1\right\| ^{2}_{A_1^{-1}}&= g_1^\top A_1^{-1}g_1 = \textsf {Tr}\mathclose {\left(g_1^\top A_1^{-1}g_1\right)} = \textsf {Tr}\mathclose {\left(A_1^{-1}g_1g_1^\top \right)}\\&= \textsf {Tr}\mathclose {\left(A_1^{-1}(A_1^2 - \delta I)\right)} = \textsf {Tr}\mathclose {\left(A_1\right)}-\textsf {Tr}\mathclose {\left( \delta A_1^{-1}\right)} \le \textsf {Tr}\mathclose {\left(A_1\right)} \end{aligned}$$

For the diagonal version $A_1=(\delta I+{{\,\mathrm{diag}\,}}(g_1g_1^\top ))^{ {1}/{2}}$ we have

$$\begin{aligned} \left\| g_1\right\| ^2_{A_1^{-1}} = g_1^\top A_t^{-1} g_1 = \textsf {Tr}\mathclose {\left(g_1^\top A_1^{-1} g_1\right)} = \textsf {Tr}\mathclose {\left(A_1^{-1} g_1 g_1^\top \right)} \end{aligned}$$

(87)

Since $A_1^{-1}$ is diagonal, the diagonal elements of $A_1^{-1} g_1 g_1^\top$ are the same as the diagonal elements of $A_1^{-1} {{\,\mathrm{diag}\,}}(g_1 g_1^\top )$. Thus we get

$$\begin{aligned} \left\| g_1\right\| ^2_{A_1^{-1}}&= \textsf {Tr}\mathclose {\left(A_1^{-1} {{\,\mathrm{diag}\,}}(g_1 g_1^\top )\right)}\\&= \textsf {Tr}\mathclose {\left(A_1^{-1} \left( A_1^2 - \delta I\right) \right)}\\&= \textsf {Tr}\mathclose {\left(A_1\right)} - \delta \textsf {Tr}\mathclose {\left(A_1^{-1}\right)}\\&\le \textsf {Tr}\mathclose {\left(A_1\right)} \end{aligned}$$

For the scalar version $A_1 = \left( g_1^\top g_1\right) ^{ {1}/{2}}$ and we have

$$\begin{aligned} \left\| g_1\right\| ^2_{A_1^{-1}} = A_1^{-1} \left\| g_1\right\| ^2 = A_1^{-1} g_1^\top g_1 = A_1^{-1} A_1^2 = A_1 = \textsf {Tr}\mathclose {\left(A_1\right)} \end{aligned}$$

Induction step: Suppose now that it holds for $m-1$, i.e. $\sum _{t=1}^{m-1} \left\| g_t\right\| ^2_{A_t^{-1}}\le 2 \textsf {Tr}\mathclose {\left(A_{m-1}\right)}$. We will show that it also holds for m.

For the full matrix version we have

$$\begin{aligned} \sum _{t=1}^m \left\| g_t \right\| ^2_{A_t^{-1}}&\le 2\textsf {Tr}\mathclose {\left(A_{m-1}\right)} + \left\| g_m\right\| ^{2}_{A_m^{-1}} \end{aligned}$$

(Induction hypothesis)

$$\begin{aligned}&= 2\textsf {Tr}\mathclose {\left((A_m^2-g_m g_m^\top )^{ {1}/{2}}\right)} + \textsf {Tr}\mathclose {\left(A_m^{-1}g_m g_m^\top \right)} \end{aligned}$$

(AdaGrad update)

We then use the fact that for any $X \succeq Y \succeq 0$, we have (Duchi et al., 2011, Lemma 8)

$$\begin{aligned} 2\textsf {Tr}\mathclose {\left((X-Y)^{ {1}/{2}}\right)}+\textsf {Tr}\mathclose {\left(X^{- {1}/{2}}Y\right)}\le 2\textsf {Tr}\mathclose {\left(X^{ {1}/{2}}\right)}. \end{aligned}$$

As $X = A_m^2\succeq Y = g_m g_m^\top \succeq 0$, we can use the above inequality and the induction holds for m.

For the diagonal version we have

$$\begin{aligned} \sum _{t=1}^m \left\| g_t\right\| _{A_t^{-1}}^2&\le 2 \textsf {Tr}\mathclose {\left(A_{m-1}\right)} + \left\| g_m\right\| ^2_{A_{m}^{-1}} \end{aligned}$$

(Induction hypothesis)

$$\begin{aligned}&= 2\textsf {Tr}\mathclose {\left(\left( A_m^2 - \text {diag}(g_m g_m^\top \right) ^{1/2}\right)} + \textsf {Tr}\mathclose {\left(A_m^{-1} g_m g_m^\top \right)} \end{aligned}$$

(AdaGrad update)

As before, since $A_m^{-1}$ is diagonal, we have that the diagonal elements of $A_m^{-1} g_m g_m^\top$ are the same as the diagonal elements $A_m^{-1} {{\,\mathrm{diag}\,}}(g_m g_m^\top )$. Thus we get

$$\begin{aligned} \sum _{t=1}^m \left\| g_t\right\| _{A_t^{-1}}^2&\le 2\textsf {Tr}\mathclose {\left(\left( A_m^2 - \text {diag}(g_m g_m^\top \right) ^{1/2}\right)} + \textsf {Tr}\mathclose {\left(A_m^{-1} {{\,\mathrm{diag}\,}}(g_m g_m^\top )\right)} \end{aligned}$$

We can then again apply the result from Duchi et al. (2011, Lemma 8) with $X = A_m^2 \succeq Y = {{\,\mathrm{diag}\,}}(g_m g_m^\top ) \succeq 0$, and we obtain the desired result.

For the scalar version, since $A_m^{-1}$ is a scalar we have by induction hypothesis,

$$\begin{aligned} \sum _{t=1}^m \left\| g_t\right\| _{A_t^{-1}}^2&\le 2 \textsf {Tr}\mathclose {\left(A_{m-1}\right)} + \left\| g_m\right\| ^2_{A_{m}^{-1}} \\&= 2\textsf {Tr}\mathclose {\left(\left( A_m^2 - g_m^\top g_m\right) ^{1/2}\right)} + \textsf {Tr}\mathclose {\left( A_m^{-1}g_m^\top g_m\right)}, \end{aligned}$$

where the equality follows from the AdaGrad update. We can then again apply the result from Duchi et al. (2011, Lemma 8) with $X = A_m^2 \ge Y = g_m^\top g_m \ge 0$, and we obtain the desired result.

Bound on the trace: For the trace bound, recall that $A_m= G_m^{1/2}$. For the scalar version we have

$$\begin{aligned} \textsf {Tr}\mathclose {\left(A_m\right)} = \textsf {Tr}\mathclose {\left(G_m^{ {1}/{2}}\right)} = G_m^{ {1}/{2}} = \sqrt{\sum _{t=1}^m g_t^\top g_t } = \sqrt{\sum _{t=1}^m \left\| g_t\right\| ^{2} } \end{aligned}$$

For the diagonal and full matrix variants, we use Jensen’s inequality to get

$$\begin{aligned} \textsf {Tr}\mathclose {\left(A_m\right)} = \textsf {Tr}\mathclose {\left(G_m^{ {1}/{2}}\right)}&= \sum _{j=1}^d \sqrt{\lambda _j(G_m)} = d\bigg (\frac{1}{d}\sum _{j=1}^d \sqrt{\lambda _j(G_m)} \bigg ) \\&\le d\sqrt{\frac{1}{d}\sum _{j=1}^d\lambda _j(G_m)} = \sqrt{d}\sqrt{\textsf {Tr}\mathclose {\left(G_m\right)}}. \end{aligned}$$

there $\lambda _j(G_m)$ denotes the j-th eigenvalue of $G_m$.

For the full matrix version, we have

$$\begin{aligned} \sqrt{\textsf {Tr}\mathclose {\left(G_m\right)}} = \sqrt{\textsf {Tr}\mathclose {\left(\sum _{t=1}^m g_t g_t^\top +\delta I\right)}} = \sqrt{\sum _{t=1}^m\textsf {Tr}\mathclose {\left(g_t g_t^\top \right)}+ d\delta } = \sqrt{\sum _{t=1}^m\left\| g_t\right\| ^{2}+d\delta } \end{aligned}$$

For the diagonal version, we have

$$\begin{aligned} \sqrt{\textsf {Tr}\mathclose {\left(G_m\right)}}&= \sqrt{\textsf {Tr}\mathclose {\left(\sum _{t=1}^m {{\,\mathrm{diag}\,}}(g_t g_t^\top ) + \delta I\right)}} \\&= \sqrt{\sum _{t=1}^m\textsf {Tr}\mathclose {\left({{\,\mathrm{diag}\,}}(g_t g_t^\top )\right)} + d\delta }\\&= \sqrt{\sum _{t=1}^m\left\| g_t\right\| ^{2} + d\delta } \end{aligned}$$

which concludes the proof. $\square$

Lemma 5

If $x^2 \le a(x+b)$ for $a\ge 0$ and $b \ge 0$,

$$\begin{aligned} x \le \frac{1}{2}(\sqrt{a^2 + 4ab} + a) \le a+ \sqrt{ab} \end{aligned}$$

Proof

The starting point is the quadratic inequality $x^2 - ax - ab \le 0$. Letting $r_1 \le r_2$ be the roots of the quadratic, the inequality holds if $x \in [r_1, r_2]$. The upper bound is then given by using $\sqrt{a+b} \le \sqrt{a} +\sqrt{b}$

$$\begin{aligned} r_2 = \frac{a + \sqrt{a^2 + 4ab}}{2} \le \frac{a + \sqrt{a^2} + \sqrt{4ab}}{2} = a + \sqrt{ab}. \end{aligned}$$

$\square$

Counter-example for line-search for SVRG

Proposition 3

For any $c> 0, \eta _{{\max}} > 0$, there exists a 1-dimensional function f whose minimizer is $x^*=0$, and for which the following holds: If at any point of Algorithm 6, we have $\mathclose {\left|x_t^k\right|} \in \big (0, \min \{ \frac{1}{c}, 1\}\big )$, then $\mathclose {\left|x_{t+1}^k\right|} \ge \mathclose {\left|x_t^k\right|}$.

Proof

Define the following function

$$\begin{aligned} f(x) = \frac{1}{2}\big ( f_1(x) + f_2(x)\big ) = \frac{1}{2} \big ( a(x-1)^2 + a(x+1)^2\big ) = a\big ( x^2 + 1\big ) \end{aligned}$$

(88)

where $a>0$ is a constant that will be determined later. We then have the following

$$\begin{aligned}&f'(x) = 2ax\\&f_1'(x) = 2a(x-1)\\&f_2'(x) = 2a(x+1) \end{aligned}$$

The minimizer of f is 0, while the minimizers of $f_1$ and $f_2$ are 1 and -1, respectively. This symmetry will make the algorithm fail.

Now, as stated by the assumption, let $\mathclose {\left|x_t^k\right|} \in \big (0, \min \{ \frac{1}{c}, 1\}\big )$. WLOG assume $x_t^k > 0$, the other case is symmetric.

Case 1: $i_t = 1$. Then we have

$$\begin{aligned} g_t&= f_1'(x_t^k) - f_1'(w^k) + f'(w_k)\\&= 2a(x_t^k - 1) - 2a(w^k - 1) + 2aw^k\\&= 2ax_t^k > 0 \end{aligned}$$

Observe that $g_t > 0$. Since $x_t^k < 1$ and the function $f_1$ is strictly decreasing in the interval $(-\infty , 1]$, moving in the direction $-g_t$ from $x_t^k$ can only increase the function value. Thus the Armijo line search will fail and yield $\eta _t = 0$. Thus in that case $x_{t+1}^k = x_t^k$.

Case 2: $i_t = 2$. Then we have

$$\begin{aligned} g_t&= f_2'(x_t^k) - f_2'(w^k) + f'(w_k)\\&= 2a(x_t^k + 1) - 2a(w^k + 1) + 2aw^k\\&= 2ax_t^k > 0 \end{aligned}$$

The Armijo line search then reads

$$\begin{aligned} f_2(x_t^k - 2a\eta _t x_t^k) \le f_2(x_t^k) - c\eta _t (2ax_t^k)^2 \end{aligned}$$

which we can rewrite as

$$\begin{aligned}&a\big ( x_t^k + 1 - 2a\eta _t x_t^k\big )^2 \le a(x_t^k + 1)^2 - 4ca^2\eta _t (x_t^k)^2\\ \Rightarrow&a(x_t^k + 1)^2 - 4a^2\eta _t(x_t^k + 1)x_t^k + 4\eta _t^2 a^3 (x_t^k)^2 \le a(x_t^k + 1)^2 - 4ca^2\eta _t (x_t^k)^2 \end{aligned}$$

Simplifying this gives

$$\begin{aligned} \eta _t a (x_t^k)^2 \le (x_t^k + 1)x_t^k - c(x_t^k)^2 \end{aligned}$$

which simplifies even further to

$$\begin{aligned} \eta _t \le \frac{1 + \frac{1}{x_t^k} - c}{a} \end{aligned}$$

Therefore, the Armijo line-search will return a step-size such that

$$\begin{aligned} \eta _t \ge \min \bigg \{ \frac{1 + \frac{1}{x_t^k} - c}{a}, \eta _{{\max}}\bigg \} \end{aligned}$$

(89)

Now, recall that by assumption we have $x_t^k < 1/c$. Then $1/x_t^k - c > 0$, which implies that

$$\begin{aligned} \frac{1 + \frac{1}{x_t^k} - c}{a} \ge \frac{1}{a} \end{aligned}$$

Now is the time to choose a. Indeed, if a is such that $1/a \le \eta _{{\max}}$, we then have by Eq. (89) that

$$\begin{aligned} \eta _t \ge \frac{1}{a} \end{aligned}$$

We then have

$$\begin{aligned} x_{t+1}^k&= x_t^k - \eta _t g_t\\&= x_t^k -2a \eta _t x_t^k\\&= (1- 2a\eta _t)x_t^k\\&\le (1 - 2)x_t^k = -x_t^k \end{aligned}$$

where the inequality comes from $\eta _t \ge 1/a$ and the fact that $x_t^k \ge 0$. Thus we indeed have $\mathclose {\left|x_{t+1}^k\right|} \ge \mathclose {\left|x_t^k\right|}$. $\square$

Additional experiments

1.1 Poor performance of AdaGrad compared to variance reduction methods

See Fig. 4.

1.2 Additional experiments with batch-size = 64

See Fig. 5, 6 and 7.

1.3 Studying the effect of the batch-size on the performance of AdaSVRG

See Fig. 8, 9, 10, 11, 12 and 13.

1.4 Evaluating the diagonal variant of AdaSVRG

See Figs. 14 and 15.

1.5 Additional experiments on adaptivity to over-parameterization

See Fig. 16.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Dubois-Taine, B., Vaswani, S., Babanezhad, R. et al. SVRG meets AdaGrad: painless variance reduction. Mach Learn 111, 4359–4409 (2022). https://doi.org/10.1007/s10994-022-06265-x

Download citation

Received: 20 February 2022
Revised: 22 June 2022
Accepted: 29 September 2022
Published: 10 November 2022
Issue Date: December 2022
DOI: https://doi.org/10.1007/s10994-022-06265-x

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

SVRG meets AdaGrad: painless variance reduction

Abstract

Similar content being viewed by others

Cocoercivity, smoothness and bias in variance-reduced stochastic gradient methods

Unified Analysis of Stochastic Gradient Methods for Composite Convex and Smooth Optimization

A Variance Controlled Stochastic Method with Biased Estimation for Faster Non-convex Optimization

Explore related subjects

1 Introduction

1.1 Background and contributions

1.1.1 SVRG meets AdaGrad

1.1.2 Multi-stage AdaSVRG

1.1.3 AdaSVRG with adaptive termination

1.1.4 Practical considerations and experimental evaluation

1.1.5 Adaptivity to over-parameterization

2 Problem setup

Proposition 1

3 Adaptive SVRG

Lemma 1

Theorem 1

Theorem 2

4 Adaptive termination of inner-loop

Theorem 3

4.1 Two phase behaviour of AdaGrad

Theorem 4

4.2 Heuristic for adaptive termination

5 Experimental evaluation of AdaSVRG

5.1 Implementing AdaSVRG

5.2 Evaluating AdaSVRG

6 Heuristic for adaptivity to over-parameterization

6.1 Evaluating Algorithm 4

7 Discussion

Availability of data and materials

Code availability

Notes

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Conflict of interest

Ethics approval

Consent to participate

Consent for publication

Additional information

Publisher's Note

Appendices

Organization of the Appendix

Definitions

Algorithm in general case

Proof of Lemma 1

Lemma 2

Proof

Main proposition

Proposition 2

Proof

Proof of Theorem 1

Theorem 5

Proof

Corollary 1

Proof

Proof of Theorem 2

Proof

Proof of Theorem 3

Theorem 6

Proof

Proof of Theorem 4

Theorem 7

Proof

Corollary 2

Proof

Helper lemmas

Lemma 3

Proof

Lemma 4

Proof

Lemma 5

Proof

Counter-example for line-search for SVRG