1 Introduction

Many machine learning applications can be formulated as risk minimization problems, in which each data sample \({{\mathbf {z}}}\in \mathbb {R}^p\) is assumed to be generated by an underlying multivariate distribution \({\mathcal {D}}\). The loss function \(\ell (\cdot ; {{\mathbf {z}}}) : \mathbb {R}^d \rightarrow \mathbb {R}\) measures the performance on the sample \({{\mathbf {z}}}\) and its form depends on specific applications, e.g., square loss for linear regression problems, logistic loss for classification problems and cross entropy loss for training deep neural networks, etc. The goal is to solve the following population risk minimization (PRM) problem over a certain parameter space \({\varOmega } \subset \mathbb {R}^d\).

$$\begin{aligned} \min _{{{\mathbf {w}}}\in {\varOmega }} \, f({{\mathbf {w}}}):= {\mathbb {E}}_{{{\mathbf {z}}}\sim {\mathcal {D}}} \,\ell ({{\mathbf {w}}}; {{\mathbf {z}}}). \end{aligned}$$
(PRM)

Directly solving the PRM can be difficult in practice, as either the distribution \({\mathcal {D}}\) is unknown or evaluation of the expectation of the loss function induces high computational cost. To avoid such difficulties, one usually samples a set of n data samples \(S := \{{{\mathbf {z}}}_1, \ldots , {{\mathbf {z}}}_n \}\) from the distribution \({\mathcal {D}}\), and instead solves the following empirical risk minimization (ERM) problem.

$$\begin{aligned} \min _{{{\mathbf {w}}}\in {\varOmega }} \, f_S({{\mathbf {w}}}):= \frac{1}{n} \sum _{k=1}^{n} \ell ({{\mathbf {w}}}; {{\mathbf {z}}}_k). \end{aligned}$$
(ERM)

The ERM serves as an approximation of the PRM with finite samples. In particular, when the number n of data samples is large, one wishes that the solution \({{\mathbf {w}}}_S\) found by optimizing the ERM with the data set S has a good generalization performance, i.e., it also induces a small loss on the population risk. The gap between these two risk functions is referred to as the generalization error at \({{\mathbf {w}}}_S\), and is formally written as

$$\begin{aligned} \text {(generalization error)}:= |f_S ({{\mathbf {w}}}_S) - f({{\mathbf {w}}}_S)|. \end{aligned}$$
(1)

Various theoretical frameworks have been established to study the generalization error from different aspects (see related work for references). This paper adopts the stability framework (Bousquet and Elisseeff 2002; Elisseeff et al. 2005), which has been applied to study the generalization property of the output produced by learning algorithms. More specifically, for a particular learning algorithm \({\mathcal {A}}\), its stability corresponds to how stable the output of the algorithm is with regard to the variations in the data set. As an example, consider two data sets S and \({\overline{S}}\) that differ at one data sample, and denote \({{\mathbf {w}}}_{S}\) and \({{\mathbf {w}}}_{{\overline{S}}}\) as the outputs of algorithm \({\mathcal {A}}\) when applied to solve the ERM with the data sets S and \({\overline{S}}\), respectively. Then, the stability of the algorithm measures the gap between the output function values of the algorithm on the perturbed data sets.

Recently, the stability framework has been further developed to study the generalization performance of the output produced by the stochastic gradient descent (SGD) method from various theoretical aspects (Hardt et al. 2016; Charles and Papailiopoulos 2017; Mou et al. 2017; Yin et al. 2017; Kuzborskij and Lampert 2017). These studies showed that the output of SGD can achieve a vanishing generalization error after multiple passes over the data set as the sample size \(n\rightarrow \infty\). These results provide theoretical justifications in part to the success of SGD on training complex objectives such as deep neural networks.

However, as pointed out in Zhang et al. (2017), these bounds do not explain some experimental observations, e.g., they do not capture the change of the generalization performance as the fraction of random labels in training data changes. Thus, the aim of this paper is to develop better generalization bounds that incorporate both the optimization information of SGD and the underlying data distribution, so that they can explain experimental observations. We summarize our contributions as follows.

1.1 Our contributions

For smooth nonconvex optimization problems, we propose a new analysis of the on-average stability of SGD that exploits the optimization properties as well as the underlying data distribution. Specifically, via upper-bounding the on-average stability of SGD, we provide a novel generalization error bound, which improves upon the existing bounds by incorporating the on-average variance of the stochastic gradient. We further corroborate the connection of our bound to the generalization performance of the recent experiments in Zhang et al. (2017), which were not explained by the existing bounds of the same type. In specific, our experiments demonstrate that the obtained generalization bound captures how the generalization error changes with the fraction of random labels via the on-average variance of SGD. Furthermore, our bound holds under probabilistic guarantee, which is statistically stronger than the bounds in expectation provided in, e.g., Hardt et al. (2016), Kuzborskij and Lampert (2017). Then, we study nonconvex optimization under gradient dominance condition, and show that the corresponding generalization bound for SGD can be improved by its fast convergence rate.

We further consider nonconvex problems with strongly convex regularizers, and study the role that the regularization plays in characterizing the generalization error bound of the proximal SGD. In specific, our generalization bound shows that strongly convex regularizers substantially improve the generalization bound of SGD for nonconvex loss functions to be as good as the strongly convex loss function. Furthermore, the uniform stability of SGD under a strongly convex regularizer yields a generalization bound for nonconvex problems with exponential concentration in probability. We also provide some experimental observations to support our result.

1.2 Related works

The stability approach was initially proposed by Bousquet and Elisseeff (2002) to study the generalization error, where various notions of stability were introduced to provide bounds on the generalization error with probabilistic guarantee. Elisseeff et al. (2005) further extended the stability framework to characterize the generalization error of randomized learning algorithms. Shalev-Shwartz et al. (2010) developed various properties of stability on learning problems. In Hardt et al. (2016), the authors first applied the stability framework to study the expected generalization error for SGD, and Kuzborskij and Lampert (2017) further provided a data dependent generalization error bound. In Mou et al. (2017), the authors studied the generalization error of SGD with additive Gaussian noise. Yin et al. (2017) studied the role that gradient diversity plays in characterizing the expected generalization error of SGD. All these works studied the expected generalization error of SGD. In Charles and Papailiopoulos (2017), the authors studied the generalization error of several first-order algorithms for loss functions satisfying the gradient dominance and the quadratic growth conditions. Poggio et al. (2011) studied the stability of online learning algorithms. This paper improves the existing bounds by incorporating the on-average variance of SGD into the generalization error bound and further corroborates its connection to the generalization performance via experiments. More detailed comparison with the existing bounds are given after the presentation of main results.

The PAC Bayesian theory (Valiant 1984; McAllester 1999) is another popular framework for studying the generalization error in machine learning. It was recently used to develop bounds on the generalization error of SGD (London 2017; Mou et al. 2017). Specifically, Mou et al. (2017) applied the PAC Bayesian theory to study the generalization error of SGD with additive Gaussian noise. London (2017) combined the stability framework with the PAC Bayesian theory and provided bound on the generalization error with probabilistic guarantee of SGD. The bound incorporates the divergence between the prior distribution and the posterior distribution of the parameters.

Recently, Russo and Zou (2016), Xu and Raginsky (2017) applied information-theoretic tools to characterize the generalization capability of learning algorithms, and Pensia et al. (2018) further extended the framework to study the generalization error of various first-order algorithms with noisy updates. Other approaches were also developed for characterizing the generalization error as well as the estimation error, which include, for example, the algorithm robustness framework (Xu and Mannor 2012; Zahavy et al. 2017), large margin theory (Bartlett et al. 2017; Neyshabur et al. 2018; Sokolić et al. 2017) and the classical VC theory (Vapnik 1995; Vapnik 1998). Also, some methods have been developed to study excessive risk of the output for a learning algorithm, which include the robust stochastic approach (Nemirovski et al. 2009), the sample average approximation approach (Shapiro and Nemirovski 2005; Lin and Rosasco 2017), etc.

2 Preliminary and on-average stability

Consider applying SGD to solve the empirical risk minimization (ERM) with a particular data set S. In particular, at each iteration t, the algorithm samples one data sample from the data set S uniformly at random. Denote the index of the sampled data sample at the t-th iteration as \({\xi }_t\). Then, with a stepsize sequence \(\{\alpha _t\}_t\) and a fixed initialization \({{\mathbf {w}}}_{0} \in \mathbb {R}^d\), the update rule of SGD can be written as, for \(t = 0, \ldots , T-1\),

$$\begin{aligned} {{\mathbf {w}}}_{t+1} = {{\mathbf {w}}}_{t} - \alpha _t \nabla \ell ({{\mathbf {w}}}_{t}; {{\mathbf {z}}}_{\xi _t}). \end{aligned}$$
(SGD)

Throughout the paper, we denote the iterate sequence along the optimization path as \(\{{{\mathbf {w}}}_{t, S}\}_t\), where S in the subscript indicates that the sequence is generated by the algorithm using the data set S. The stepsize sequence \(\{\alpha _t\}_t\) is a decreasing and positive sequence, and typical choices for SGD are \(\frac{1}{t}, \frac{1}{t\log t}\) Bottou (2010), which we adopt in our study.

Clearly, the output \({{\mathbf {w}}}_{T, S}\) is determined by the data set S and the sample path \(\varvec{\xi }:=\{{\xi _1}, \ldots , {\xi _{T-1}} \}\) of SGD. We are interested in the generalization error of the T-th output generated by SGD, i.e., \(|f_S ({{\mathbf {w}}}_{T,S}) - f ({{\mathbf {w}}}_{T,S})|\), and we adopt the following standard assumptions (Hardt et al. 2016; Kuzborskij and Lampert 2017) on the loss function \(\ell\) in our study throughout the paper.

Assumption 1

For all \({{\mathbf {z}}}\sim {\mathcal {D}}\), the loss function satisfies:

  1. 1.

    Function \(\ell (\cdot \,;{{\mathbf {z}}})\) is continuously differentiable;

  2. 2.

    Function \(\ell (\cdot \,;{{\mathbf {z}}})\) is nonnegative and \(\sigma\)-Lipschitz, and \(|\ell (\cdot \,;{{\mathbf {z}}})|\) is uniformly bounded by M;

  3. 3.

    The gradient \(\nabla \ell (\cdot \,;{{\mathbf {z}}})\) is L-Lipschitz, and \(\Vert \nabla \ell (\cdot \,;{{\mathbf {z}}}) \Vert\) is uniformly bounded by \(\sigma\), where \(\Vert \cdot \Vert\) denotes the \(\ell _2\) norm.

The generalization error of SGD can be viewed as a nonnegative random variable whose randomnesses are due to the draw of the data set S and the sample path \(\varvec{\xi }\) of the algorithm. In particular, the mean square generalization error has been studied in Elisseeff et al. (2005) for general randomized learning algorithms. Specifically, an application of Lemma 11 (Elisseeff et al. 2005) to SGD under Assumption 1 yields the following result. Throughout the paper, we denote \({\overline{S}}\) as the data set that replaces one data sample of S with an i.i.d copy generated from the distribution \({\mathcal {D}}\) and denote \({{\mathbf {w}}}_{T, {\overline{S}}}\) as the output of SGD for solving the ERM with the data set \({\overline{S}}\).

Proposition 1

Let Assumption 1hold. Apply the SGD with the same sample path \(\varvec{\xi }\) to solve the ERM with the data sets S and \({\overline{S}}\), respectively. Then, the mean square generalization error of SGD satisfies

$$\begin{aligned}&{\mathbb {E}}[|f_S ({{\mathbf {w}}}_{T,S}) - f ({{\mathbf {w}}}_{T,S})|^2] \le \frac{2M^2}{n} + 12M\sigma {\mathbb {E}}[\delta _{T,S,{\overline{S}}}], \end{aligned}$$

where \(\delta _{T,S,{\overline{S}}} := \Vert {{\mathbf {w}}}_{T, S} - {{\mathbf {w}}}_{T, {\overline{S}}} \Vert\) and the expectation is taken over the random variables \({\overline{S}}, S\) and \(\varvec{\xi }\).

Proposition 1 links the mean square generalization error of SGD to the quantity \({\mathbb {E}}_{\varvec{\xi }, S, {\overline{S}}} [\delta _{T,S,{\overline{S}}}]\). Intuitively, \(\delta _{T,S,{\overline{S}}}\) captures the variation of the algorithm output with regard to the variation of the dataset. Hence, its expectation can be understood as the on-average stability of the iterates generated by SGD. We note that similar notions of stabilities were proposed in Kuzborskij and Lampert (2017), Shalev-Shwartz et al. (2010), Elisseeff et al. (2005), which are based on the variation of the function values at the output instead.

3 Generalization bound for SGD in nonconvex optimization

In this section, we develop the generalization error of SGD by characterizing the corresponding on-average stability of the algorithm.

An intrinsic quantity that affects the optimization path of SGD is the variance of the stochastic gradients. To capture the impact of the variance of the stochastic gradients, we adopt the following standard assumption from the stochastic optimization theory (Bottou 2010; Nemirovski et al. 2009; Ghadimi et al. 2016).

Assumption 2

For any fixed training set S and any \(\xi\) that is generated uniformly from \(\{1, \ldots , n\}\) at random, there exists a constant \(\nu _S > 0\) such that for all \({{\mathbf {w}}}\in {\varOmega }\) one has

$$\begin{aligned} {\mathbb {E}}_{\xi } \left\| \nabla \ell ({{\mathbf {w}}}; {{\mathbf {z}}}_{\xi }) - \frac{1}{n} \sum _{k=1}^n \nabla \ell ({{\mathbf {w}}}; {{\mathbf {z}}}_{k})\right\| ^2 \le \nu _S^2. \end{aligned}$$
(2)

Assumption 2 essentially bounds the variance of the stochastic gradients for the particular data set S. The variance \(\nu _S^2\) of the stochastic gradient is typically much smaller than the uniform upper bound \(\sigma\) in Assumption 1 for the norm of the stochastic gradient, e.g., normal random variable has unit variance and is unbounded, and hence may provide a tighter bound on the generalization error.

Based on Assumption 2 and Assumption 1, we obtain the following generalization bound of SGD by exploring its optimization path to study the corresponding stability.

Theorem 1

(Bound with Probabilistic Guarantee) Suppose \(\ell\) is nonconvex. Let Assumptions 1and 2hold. Apply the SGD to solve the ERM with the data set S, and choose the step size \(\alpha _t = \frac{c}{(t+2)\log (t+2)}\) with \(0<c<\frac{1}{L}\). Then, for any \(\delta >0\), with probability at least \(1-\delta\), we have

$$\begin{aligned}&|f_S ({{\mathbf {w}}}_{T,S}) - f ({{\mathbf {w}}}_{T,S})| \\&\quad \le \sqrt{\frac{1}{n\delta } \left( 2M^2+ 24M\sigma c\sqrt{2Lf({{\mathbf {w}}}_{0}) + \frac{1}{2}{\mathbb {E}}_S[\nu _S^2]} \log T\right) }. \end{aligned}$$

An important variable in the above generalization bound is the on-average stochastic variance \({\mathbb {E}}_S[\nu _S^2]\). We can compare the above bound with the generalization bound developed in the recent literature. Specifically, Hardt et al. (2016), Kuzborskij and Lampert (2017), Yin et al. (2017) all developed bounds for the expected generalization error of SGD and choose the step size \(\alpha _t=\frac{c}{t}\), while our generalization bound in the above theorem is probabilistic and hence provides stronger guarantee, and we use a slightly smaller step size \(\alpha _t = \frac{c}{(t+2)\log (t+2)}\). The generalization bound in Hardt et al. (2016) is based on the uniform stability \(\sup _{S, {\overline{S}}}{\mathbb {E}}_{\varvec{\xi }} [\delta _{T,S,{\overline{S}}}]\) and assumes an upper bound \(\sigma\) of the norm of all gradients. Kuzborskij and Lampert (2017) develops a data-dependent bound on expected generalization error by leveraging the notion of on-average stability, and they adopt an additional assumption on the Lipschitz continuity of the Hessian matrix. Yin et al. (2017) characterizes the expected generalization error of SGD using the notion of uniform stability and gradient diversity, but their analysis requires the function to be (strongly)-convex. In comparison, our generalization bound is based on the more relaxed on-average stability \({\mathbb {E}}_{S, {\overline{S}}}{\mathbb {E}}_{\varvec{\xi }} [\delta _{T,S,{\overline{S}}}]\) that allows us to introduce the on-average variance, which is generally smaller and tighter than the uniform gradient bound \(\sigma\) used in Hardt et al. (2016). Moreover, the generalization error bounds in all these works have a polynomial dependence on T, whereas our generalization error bound only scales with \(\log T\). Next, we outline the proof of Theorem 1 below and discuss other implications.

Outline of the Proof of Theorem 1

We provide an outline of the proof here, and relegate the detailed proof in the supplementary materials.

The central idea is to bound the on-average stability \({\mathbb {E}}_{S, {\overline{S}}, \varvec{\xi }} [\delta _{T, S, {\overline{S}}}]\) of the iterates in Proposition 1. Hence, suppose we apply SGD with the same sample path \(\varvec{\xi }\) to solve the ERM with the data sets S and \({\overline{S}}\), respectively. We first obtain the following recursive property of the on-average iterate stability (Lemma 2 in the appendix):

$$\begin{aligned} {\mathbb {E}}_{S, {\overline{S}}, \varvec{\xi }} [\delta _{t+1, S, {\overline{S}}}]&\le (1 + \alpha _t L) {\mathbb {E}}_{S, {\overline{S}}, \varvec{\xi }} [\delta _{t, S, {\overline{S}}}] \nonumber \\&\quad + \frac{2\alpha _t}{n}{\mathbb {E}}_{S, \varvec{\xi }} \left[ \left\| \nabla \ell ({{\mathbf {w}}}_{t, S}; {{\mathbf {z}}}_{1}) \right\| \right] . \end{aligned}$$
(3)

We then further derive the following bound on \({\mathbb {E}}_{S, \varvec{\xi }} \left[ \Vert \nabla \ell ({{\mathbf {w}}}_{t, S}; {{\mathbf {z}}}_{1}) \Vert \right]\) by exploiting the optimization path of SGD (Lemma 3 in the appendix):

$$\begin{aligned}&{\mathbb {E}}_{\varvec{\xi }, S} \left[ \Vert \nabla \ell ({{\mathbf {w}}}_{t, S}; {{\mathbf {z}}}_{1}) \Vert \right] \le \sqrt{2Lf({{\mathbf {w}}}_{0}) + \frac{1}{2}{\mathbb {E}}_{S}[\nu _S^2]}. \end{aligned}$$
(4)

Substituting (4) into (3) and telescoping, we obtain an upper bound on \({\mathbb {E}}_{S, {\overline{S}}, \varvec{\xi }} [\delta _{T, S, {\overline{S}}}]\). Further substituting such a bound into Proposition 1, we obtain an upper bound on the second moment of the generalization error. Then, the result in Theorem 1 follows from the Chebyshev’s inequality. \(\square\)

The proof of Theorem 1 is to characterize the on-average stability of SGD, and it explores the optimization path by applying the technical tools developed in stochastic optimization theory. Comparing to the generalization bound developed in Hardt et al. (2016) that characterizes the expected generalization error based on the uniform stability \(\sup _{S, {\overline{S}}}{\mathbb {E}}_{\varvec{\xi }} [\delta _{T,S,{\overline{S}}}]\), our generalization bound in Theorem 1 provides a probabilistic guarantee, and is based on the more relaxed on-average stability \({\mathbb {E}}_{S, {\overline{S}}}{\mathbb {E}}_{\varvec{\xi }} [\delta _{T,S,{\overline{S}}}]\) which yields a tighter bound. Intuitively, the on-average variance term \({{\mathbb {E}}_{S}[\nu _S^2]}\) in our bound measures the ‘stability’ of the stochastic gradients over all realizations of the dataset S. If such on-average variance of SGD is large, then the optimization paths of SGD on two slightly different datasets are diverse from each other, leading to the bad stability of SGD and in turn yielding a high generalization error.

Remark on optimization convergence rate: We note that the generalization error bound in Theorem 1 is derived based on the step size \(\alpha _t = c / [(t+2) \log (t+2) ]\). With this step size, the standard nonconvex optimization convergence rate of SGD (Bottou 2010) is in the order of

$$\begin{aligned} \frac{\sum _{t=0}^{T-1} \alpha _t^2}{\sum _{t=0}^{T-1} \alpha _t} = O\left( \frac{1}{\log \log T}\right) , \end{aligned}$$

which is very slow. However, it is possible to choose a proper step size to achieve a similar generalization error bound and a faster optimization convergence rate. Specifically, one can choose \(\alpha _t = \frac{c}{t+2}\) with constant \(c=\frac{\log \log (T+2)}{2L\log (T+2)}\), where T is the total number of iterations. Then, following the same proof of Theorem 1, one can instead prove the following stability bound

$$\begin{aligned} {\mathbb {E}}[\delta _T] \le \frac{2\sqrt{2Lf({{\mathbf {w}}}_0) + \frac{{\mathbb {E}}[\nu _S^2]}{2}}}{nL} (T+2)^{2cL} = O \left( \frac{\sqrt{Lf({{\mathbf {w}}}_0) + {\mathbb {E}}[\nu _S^2]}}{nL} \log (T+2)\right) . \end{aligned}$$

Therefore, the generalization error bound still scales with \(\log T\). On the other hand, the optimization convergence rate is now in the order of

$$\begin{aligned} \frac{\sum _{t=0}^{T-1} \alpha _t^2}{\sum _{t=0}^{T-1} \alpha _t} = O\left( \frac{\log \log (T+2)}{\log ^2(T+2)} \right) . \end{aligned}$$

Remark on choice of step size: One can also adopt a constant step size in Theorem 1, which will lead to a very different line of proof and a different final bound. In this case, one can choose a sufficiently small constant step size (with polynomial dependence on the total number of iterations T) and obtain a comparable generalization bound.

Discussion: We next elaborate on how our generalization bound can help explain the observations in classification experiments with randomized labels (Zhang et al. 2017). Specifically, consider the following binary classification problem

$$\begin{aligned} \min _{{{\mathbf {w}}}\in {\mathbb {R}}^d} \frac{1}{N}\sum _{i=1}^{N} \ell _i ({{\mathbf {w}}}) = \frac{1}{N}\sum _{i=1}^{N} \exp (-y_i {{\mathbf {w}}}^T {{\mathbf {x}}}_i), \end{aligned}$$

where \({{\mathbf {w}}}\) corresponds to the linear classifier and \(({{\mathbf {x}}}_i, y_i)\) denotes the i-th data sample (\(y_i\) is a binary label). Consider a simplified case where the feature dimension \(d=1\) and total sample size \(N = 2n\). Assume the features \(x_1 = x_2 = \cdots = x_n = 1\) and \(x_{n+1} = x_{n+2} = \cdots = x_{2n} = -1\). In particular, assume that \(\alpha \in (0, 0.5)\) portion of the 2n samples have incorrect labels, i.e., \(\alpha\) portion of the samples in \(\{x_1, x_2,\ldots ,x_n\}\) are incorrectly labeled as ‘\(-1\)’ (true label is ‘+1’) and \(\alpha\) portion of the samples in \(\{x_{n+1}, x_{n+2},\ldots ,x_{2n}\}\) are incorrectly labeled as ‘+1’ (true label is ‘−1’). In this setting, it can be calculated that the full gradient of the empirical loss is \(\nabla f_S({{\mathbf {w}}}) = \alpha \exp ({{\mathbf {w}}}) - (1-\alpha ) \exp (-{{\mathbf {w}}})\). Then, the empirical gradient variance of any classifier \({{\mathbf {w}}}\) takes the value

$$\begin{aligned} {\mathbb {E}}[\nu _S^2]&= \frac{1}{N} \sum _{i=1}^N \left\| \nabla \ell _i({{\mathbf {w}}}) - \nabla f_S({{\mathbf {w}}}) \right\| ^2 \nonumber \\&\quad = \frac{1}{N} \left( \sum _{\begin{array}{c} i\in \{1,\ldots ,n\} \\ y_i = 1 \end{array}} + \sum _{\begin{array}{c} i\in \{1,\ldots ,n\} \\ y_i = -1 \end{array}} + \sum _{\begin{array}{c} i\in \{n+1,\ldots ,2n\} \\ y_i = 1 \end{array}} + \sum _{\begin{array}{c} i\in \{n+1,\ldots ,2n\} \\ y_i = -1 \end{array}}\right) \left\| \nabla \ell _i(w) - \nabla f_S({{\mathbf {w}}})\right\| ^2 \nonumber \\&\quad = \alpha (1-\alpha ) \left( \exp ({{\mathbf {w}}}) + \exp (-{{\mathbf {w}}})\right) ^2. \end{aligned}$$

Hence, as the random label probability \(\alpha\) increases (from 0 to 0.5), the above empirical gradient variance keeps increasing and the generalization error also increases. In particular, the maximum variance is achieved when half of the data are incorrectly labeled, and this gives the maximum classification uncertainty. This example shows that the optimization gradient variance term in our Theorem 1 properly captures the impact of data distribution on the generalization performance. We note that one can generalize this example to high dimensional space \(d>1\) where the features follow two distinct normal distributions, and the conclusion will be the same but requires dedicated calculations.

4 Generalization bound for SGD under gradient dominant condition

In this section, we consider nonconvex loss functions with the empirical risk function \(f_S\) further satisfying the following gradient dominance condition.

Definition 1

Denote \(f^* := \inf _{{{\mathbf {w}}}\in {\varOmega }} f({{\mathbf {w}}})\). Then, the function f is said to be \(\gamma\)-gradient dominant for \(\gamma >0\) if

$$\begin{aligned} f({{\mathbf {w}}}) - f^* \le \frac{1}{2\gamma } \Vert \nabla f({{\mathbf {w}}}) \Vert ^2, \,\forall {{\mathbf {w}}}\in {\varOmega }. \end{aligned}$$
(5)

The gradient dominance condition (also referred to as Polyak-Łojasiewicz condition (Polyak 1963; Łojasiewicz 1963) guarantees a linear convergence of the function value sequence generated by gradient-based first-order methods (Karimi et al. 2016). It is a condition that is much weaker than the strong convexity, and many nonconvex machine learning problems satisfy this condition around the global minimizers (Li et al. 2016; Zhou et al. 2016).

The gradient dominance condition helps to improve the bound on the on-average stochastic gradient norm \({\mathbb {E}}_{\varvec{\xi }, S} \left[ \Vert \nabla \ell ({{\mathbf {w}}}_{t, S}; {{\mathbf {z}}}_{1}) \Vert \right]\) (see Lemma 4 in the appendix), which is given by

$$\begin{aligned} {\mathbb {E}}_{\varvec{\xi }, S}&\left[ \Vert \nabla \ell ({{\mathbf {w}}}_{t, S}; {{\mathbf {z}}}_{1}) \Vert \right] \le \sqrt{2L {\mathbb {E}}_{S} [f_S^*] + \frac{1}{t} \left( 2Lf({{\mathbf {w}}}_{0}) + {\mathbb {E}}_S[\nu _S^2] \right) }. \end{aligned}$$
(6)

Compared to (4) for general nonconvex functions, the above bound is further improved by a factor of \(\frac{1}{t}\). This is because SGD converges sub-linearly to the optimum function value \(f_S^*\) under the gradient dominance condition, and \(\frac{1}{t}\) is essentially the convergence rate of SGD. In particular, for sufficiently large t, the on-average stochastic gradient norm is essentially bounded by \(\sqrt{2L {\mathbb {E}}_{S} [f_S^*]}\), which is much smaller then the bound in (4). With the bound in (6), we obtain the following theorem.

Theorem 2

(Mean Square Bound) Suppose \(\ell\) is nonconvex, and \(f_S\) is \(\gamma\)-gradient dominant (\(\gamma <L\)). Let Assumptions 1and 2hold. Apply the SGD to solve the ERM with the data set S and choose \(\alpha _t = \frac{c}{(t+2)\log (t+2)}\) with \(0<c<\min \{\frac{1}{L}, \frac{1}{2\gamma } \}\). Then, the following bound holds.

$$\begin{aligned}&{\mathbb {E}}_{\varvec{\xi }, S} [|f_S ({{\mathbf {w}}}_{T,S}) - f ({{\mathbf {w}}}_{T,S})|^2]\\&\quad \le \!\frac{2M^2}{n} \!+\! \frac{24M\sigma c}{n} \left( \!\!\!\sqrt{\!2L {\mathbb {E}}_{S} [f_S^*]}\log T \!+\! \sqrt{\!2Lf(\!{{\mathbf {w}}}_{0}\!) \!+\! 2{\mathbb {E}}_S[\nu _S^2]}\!\right) . \end{aligned}$$

The above bound for the mean square generalization error under gradient dominance condition improves that for general nonconvex functions in Theorem 1, as the dominant term (i.e., \(\log T\)-dependent term) has coefficient \(\sqrt{2L{\mathbb {E}}_S[f_S^*]}\), which is much smaller than the term \(\sqrt{2L f({{\mathbf {w}}}_{0}) + \frac{1}{2}{\mathbb {E}}_S[\nu _S^2]}\) in the bound of Theorem 1. As an intuitively understanding, the on-average variance of the SGD is further reduced by its fast convergence rate \(\frac{1}{t}\) under the gradient dominance condition. This results in a more stable on-average iterate stability which in turn improves the mean square generalization error. We note that Charles and Papailiopoulos (2017) also studied the generalization error of SGD for loss functions satisfying both the gradient dominance condition and an additional quadratic growth condition. They also assumed that the algorithm converges to a global minimizer point, which may not always hold for noisy algorithms like SGD.

Remark on optimization convergence rate: The optimization convergence rate of SGD under the gradient dominant condition has been characterized by the Theorem 4 of Karimi et al. (2016). In particular, with the step size \(\alpha _t = O(\frac{1}{t})\), Karimi et al. (2016) proved that the convergence rate of SGD is in the order of \({\mathbb {E}} [f_S({{\mathbf {w}}}_{t,S}) - f_S^*] \le O(\frac{1}{t})\). Note that the generalization error bound in Theorem 2 is derived based on a slightly smaller step size \(\alpha _t = O(\frac{1}{t\log t})\), which leads to the same order of convergence rate \({\widetilde{O}}(\frac{1}{t})\) up to certain logarithmic factors. Hence, under the gradient dominant condition, SGD can achieve a small generalization error as well as a fast convergence.

Theorem 2 directly implies the following probabilistic guarantee for the generalization error of SGD.

Theorem 3

(Bound with Probabilistic Guarantee) Suppose \(\ell\) is nonconvex, and \(f_S\) is \(\gamma\)-gradient dominant (\(\gamma <L\)). Let Assumptions 1and 2hold. Apply the SGD to solve the ERM with the data set S, and choose \(\alpha _t = \frac{c}{(t+2)\log (t+2)}\) with \(0<c<\min \{\frac{1}{L}, \frac{1}{2\gamma } \}\). Then, for any \(\delta > 0\), with probability at least \(1-\delta\), we have

$$\begin{aligned}&|f_S ({{\mathbf {w}}}_{T,S}) - f ({{\mathbf {w}}}_{T,S})| \\&\quad \le \!\!\sqrt{ \!\tfrac{2M^2}{n\delta } \!+\! \tfrac{24M\sigma c}{n\delta } \left( \!\!\!\sqrt{\!2L {\mathbb {E}}_{S} [f_S^*]}\log T \!+\! \sqrt{\!2Lf(\!{{\mathbf {w}}}_{0}\!) \!+\! 2{\mathbb {E}}_S[\nu _S^2]} \right) }. \end{aligned}$$

5 Regularized nonconvex optimization

In practical applications, regularization is usually applied to the risk minimization problem in order to either promote certain structures on the desired solution or to restrict the parameter space. In this section, we explore how regularization can improve the generation error, and hence help to avoid overfitting for SGD.

Here, for any weight \(\lambda > 0\), we consider the regularized population risk minimization (R-PRM) and the regularized empirical risk minimization (R-ERM):

$$\begin{aligned}&\min _{{{\mathbf {w}}}\in {\varOmega }} \, {\varPhi }({{\mathbf {w}}}) := f({{\mathbf {w}}}) + \lambda h({{\mathbf {w}}}), \end{aligned}$$
(R-PRM)
$$\begin{aligned}&\min _{{{\mathbf {w}}}\in {\varOmega }} \, {\varPhi }_S({{\mathbf {w}}}) := f_S({{\mathbf {w}}}) + \lambda h({{\mathbf {w}}}), \end{aligned}$$
(R-ERM)

where h corresponds to the regularizer and \(f, f_S\) are the population and empirical risks, respectively. In particular, we are interested in the following class of regularizers.

Assumption 3

The regularizer function h is 1-strongly convex and nonnegative.

Without loss of generality, we assume that the strongly convex parameter of h is 1, and this can be adjusted by scaling the weight parameter \(\lambda\). Strongly convex regularizers are commonly used in machine learning applications, and typical examples include \(\frac{\lambda }{2}\Vert {{\mathbf {w}}} \Vert ^2\) for ridge regression, Tikhonov regularization \(\frac{\lambda }{2}\Vert {\varGamma }{{\mathbf {w}}} \Vert ^2\) and elastic net \(\lambda _1\Vert {{\mathbf {w}}} \Vert _1 \!+\! \lambda _2\Vert {{\mathbf {w}}} \Vert ^2\), etc. Here, we allow the regularizer h to be non-differentiable (e.g., the elastic net), and introduce the following proximal mapping with parameter \(\alpha >0\) to deal with the non-smoothness.

$$\begin{aligned} {\mathrm {prox}}_{\alpha h}({{\mathbf {w}}}) := \arg \min _{{{\mathbf {u}}}\in {\varOmega }} h({{\mathbf {u}}}) + \frac{1}{2\alpha } \Vert {{\mathbf {u}}}- {{\mathbf {w}}} \Vert ^2. \end{aligned}$$
(7)

The proximal mapping is the core of the proximal method for solving convex problems (Parikh and Boyd 2014; Beck and Teboulle 2009) and nonconvex ones (Li et al. 2017; Attouch et al. 2013). In particular, we apply the proximal SGD to solve the R-ERM. With the same notations as those defined in the previous section, the update rule of the proximal SGD can be written as, for \(t = 0, \ldots , T-1\)

$$\begin{aligned} {{\mathbf {w}}}_{t+1} = {\mathrm {prox}}_{\alpha _t h}\left( {{\mathbf {w}}}_{t} - \alpha _t \nabla \ell ({{\mathbf {w}}}_{t}; {{\mathbf {z}}}_{\xi _t})\right) . \end{aligned}$$
(proximal-SGD)

Similarly, we denote \(\{{{\mathbf {w}}}_{t, S}\}_t\) as the iterate sequence generated by the proximal SGD with the data set S.

It is clear that the generalization error of the function value for the regularized risk minimization, i.e., \(|{\varPhi }({{\mathbf {w}}}_{T, S}) - {\varPhi }_S({{\mathbf {w}}}_{T, S})|\), is the same as that for the un-regularized risk minimization. Hence, Theorem 1 is also applicable to the mean square generalization error of the regularized risk minimization. However, the development of the generalization error bound is different from the analysis in the previous section from two aspects. First, the analysis of the on-average iterate stability of the proximal SGD is technically more involved than that of SGD due to the possibly non-smooth regularizer. Secondly, the proximal mappings of strongly convex functions are strictly contractive (see item 2 of Proposition 5 in the appendix). Thus, the proximal step in the proximal SGD enhances the stability between the iterates \({{\mathbf {w}}}_{t, S}\) and \({{\mathbf {w}}}_{t, {\overline{S}}}\) that are generated by the algorithm using perturbed datasets, and this further improves the generalization error. The next result provides a quantitative statement.

Theorem 4

Consider the regularized risk minimization. Suppose \(\ell\) is nonconvex. Let Assumptions 12and3hold, and apply the proximal SGD to solve the R-ERM with the dataset S. Let \(\lambda > L\) and \(\alpha _t = \frac{c}{t+2}\) with \(0<c<\frac{1}{L}\). Then, the following bound holds with probability at least \(1- \delta\).

$$\begin{aligned} |{\varPhi }({{\mathbf {w}}}_{T, S})&- {\varPhi }_S({{\mathbf {w}}}_{T, S})| \le \sqrt{\frac{1}{n\delta } \left( 2M^2 + \frac{24M\sigma }{(\lambda - L)} \sqrt{L{\varPhi }({{\mathbf {w}}}_{0}) + {\mathbb {E}}_{S}[\nu _S^2]} \right) }. \end{aligned}$$

Theorem 4 provides probabilistic guarantee for the generalization error of the proximal SGD in terms of the on-average variance of the stochastic gradients. Comparison of Theorem 4 with Theorem 1 indicates that a strongly convex regularizer substantially improves the generalization error bound of SGD for nonconvex loss functions by removing the logarithm dependence on T. It is also interesting to compare Theorem 4 with [Proposition 4 and Theorem 1, London 2017], which characterize the generalization error of SGD for strongly convex functions with probabilistic guarantee. The two bounds have the same order in terms of n and T, indicating that a strongly convex regularizer even improves the generalization error for a nonconvex function to be the same as that for a strongly convex function. In practice, the regularization weight \(\lambda\) should be properly chosen to balance between the generalization error and the training loss, as otherwise the parameter space can be too restrictive to yield a good solution for the risk function.

5.1 Generalization bound with high-probability guarantee

The studies of the previous sections explore the probabilistic guarantee for the generalization errors of nonconvex loss functions and nonconvex loss functions with strongly convex regularizers. For example, apply SGD to solve a generic nonconvex loss function, then Theorem 1 suggests that for any \(\epsilon > 0\),

$$\begin{aligned} {\mathbb {P}} (|f({{\mathbf {w}}}_{T, S}) - f_S({{\mathbf {w}}}_{T, S})| > \epsilon ) < O\left( \frac{\log T}{n\epsilon ^2}\right) , \end{aligned}$$

which decays sublinearly as \(\frac{n}{\log T} \rightarrow \infty\). In this subsection, we study a stronger probabilistic guarantee for the generalization error, i.e., the probability for it to be less than \(\epsilon\) decays exponentially. We refer to such a notion as high-probability guarantee. In particular, we explore for which cases of nonconvex loss functions we can establish such a stronger performance guarantee.

Towards this end, we adopt the uniform stability framework proposed in Elisseeff et al. (2005). Note that Hardt et al. (2016) also studied the uniform stability of SGD, but only characterized the generalization error in expectation, which is weaker than the exponential probabilistic concentrtion bound that we study here.

Suppose we apply SGD with the same sample path \(\varvec{\xi }\) to solve the ERM with the datasets S and \({\overline{S}}\), respectively, and denote \({{\mathbf {w}}}_{T, S, \varvec{\xi }}\) and \({{\mathbf {w}}}_{T, {\overline{S}}, \varvec{\xi }}\) as the corresponding outputs. Also, suppose we apply the SGD with different sample paths \(\varvec{\xi }\) and \(\overline{\varvec{\xi }}\) to solve the same problem with the dataset S, respectively, and denote \({{\mathbf {w}}}_{T, S, \varvec{\xi }}\) and \({{\mathbf {w}}}_{T, S, \overline{\varvec{\xi }}}\) as the corresponding outputs. Here, \(\overline{\varvec{\xi }}\) denotes the sample path that replaces one of the sampled indices, say \(\xi _{t_0}\), with an i.i.d copy \(\xi _{t_0}'\). The following result is a variant of Theorem 15 (Elisseeff et al. 2005).

Lemma 1

Let Assumption 1hold. If SGD satisfies the following conditionsFootnote 1

$$\begin{aligned}&\sup _{S, {\overline{S}}, {{\mathbf {z}}}} {\mathbb {E}}_{\varvec{\xi }} |\ell ({{\mathbf {w}}}_{T, S, \varvec{\xi }}; {{\mathbf {z}}}) - \ell ({{\mathbf {w}}}_{T, {\overline{S}}, \varvec{\xi }}; {{\mathbf {z}}})| \le \beta , \\&\sup _{\varvec{\xi }, \overline{\varvec{\xi }}, S, {{\mathbf {z}}}} |\ell ({{\mathbf {w}}}_{T, S, \varvec{\xi }}; {{\mathbf {z}}}) - \ell ({{\mathbf {w}}}_{T, S, \overline{\varvec{\xi }}}; {{\mathbf {z}}}) | \le \rho . \end{aligned}$$

Then, the following bound holds with probability at least \(1-\delta\).

$$\begin{aligned} |{\varPhi }({{\mathbf {w}}}_{T, S}) -&{\varPhi }_S({{\mathbf {w}}}_{T, S})| \le 2\beta + \left( \frac{M+4n\beta }{\sqrt{2n}} + \sqrt{2T}\rho \right) \sqrt{\log \tfrac{2}{\delta }}. \end{aligned}$$

Note that Theorem 1 implies that

$$\begin{aligned} {\mathbb {P}}(|{\varPhi }({{\mathbf {w}}}_{T, S}) - {\varPhi }_S({{\mathbf {w}}}_{T, S})|>\epsilon ) \le O\left( \exp \left( \tfrac{-\epsilon ^2}{\sqrt{n}\beta +\sqrt{T}\rho }\right) \right) . \end{aligned}$$

Hence, if \(\beta =o(n^{-\frac{1}{2}})\) and \(\rho =o(T^{-\frac{1}{2}})\), then we have exponential decay in probability as \(n\rightarrow \infty\) and \(T \rightarrow \infty\). It turns out that our analysis of the uniform stability of SGD for general nonconvex functions yields that \(\beta =O(n^{-1}), \rho =O(\log T)\), which does not lead to the desired high-probability guarantee for the generalization error. On the other hand, the analysis of the uniform stability of the proximal SGD for nonconvex loss functions with strongly convex regularizers yields that \(\beta = O(n^{-1}), \rho =O(T^{-c(\lambda - L)}),\) which leads to the high-probability guarantee if we choose \(\lambda > L\) and \(c> \frac{1}{2(\lambda - L)}\). This further demonstrates that a strongly convex regularizer can significantly improve the quality of the probabilistic bound for the generalization error. The following result is a formal statement of the above discussion.

Theorem 5

Consider the regularized risk minimization with the nonconvex loss function \(\ell\). Let Assumptions 1and 3hold, and apply the proximal SGD to solve the R-ERM with the data set S. Choose \(\lambda > L\) and \(\alpha _t = \frac{c}{t+2}\) with \(\frac{1}{2(\lambda - L)}<c<\frac{1}{\lambda - L}\). Then, the following bound holds with probability at least \(1- \delta\)

$$\begin{aligned} |{\varPhi }({{\mathbf {w}}}_{T, S})&- {\varPhi }_S({{\mathbf {w}}}_{T, S})| \le \left( \frac{M}{\sqrt{n}} + \frac{4\sigma ^2}{\sqrt{n}(\lambda - L)} + \frac{4\sigma ^2 c}{T^{c(\lambda - L) - \frac{1}{2}}} \right) \sqrt{\log \frac{2}{\delta }}. \end{aligned}$$

Theorem 5 implies that

$$\begin{aligned} {\mathbb {P}}(|{\varPhi }({{\mathbf {w}}}_{T, S})&- {\varPhi }_S({{\mathbf {w}}}_{T, S})|>\epsilon ) \le O\left( \exp \left( \tfrac{-\epsilon ^2}{n^{-\frac{1}{2}} + T^{\frac{1}{2}-c(\lambda - L)}}\right) \right) . \end{aligned}$$

Hence, if we choose \(c = \frac{1}{\lambda - L}\) and run the proximal SGD for \(T = O(n)\) iterations (i.e., constant passes over the data), then the probability of the event decays exponentially as \(O(\exp (-\sqrt{n}\epsilon ^2))\).

The proof of Theorem 5 characterizes the uniform iterate stability of the proximal SGD with regard to the perturbations of both the dataset and the sample path. Unlike the on-average stability in Theorem 1 where the stochastic gradient norm is bounded by the on-average variance of the stochastic gradients, the uniform stability captures the worst case among all datasets, and hence uses the uniform upper bound \(\sigma\) for the stochastic gradient norm. We note that Theorem 3 (London 2017) also established a probabilistic bound under the PAC Bayesian framework. However, their result yields exponential concentration guarantee only for strongly convex loss functions. As a comparison, Theorem 5 relaxes the requirement of strong convexity for loss functions to nonconvex loss functions with strongly convex regularizers, and hence serves as a complementary result to theirs. Also, Mou et al. (2017) establishes the high-probability bound for the generalization error of SGD with regularization. However, their result holds only for the particular regularizer \(\frac{1}{2}\Vert {{\mathbf {w}}} \Vert ^2\), and high-probability bound holds only with regard to the random draw of the data. As a comparison, our result holds for all strongly convex regularizers, and the high-probability bound hold with regard to both the draw of data and randomness of algorithm.

6 Experiments

In this section, we conduct deep learning experiments to demonstrate that the on-average variance of SGD does correlate with the generalization performance in practice. Specifically, it has been observed that a classification dataset with randomized labels can substantially degrade the generalization performance of the trained deep model (Zhang et al. 2017). Following this observation, we consider training a three-layer MLP neural network and a ResNet-18 network (He et al. 2016) using the MNIST dataset (Lecun et al. 1998) and the CIFAR10 dataset (Krizhevsky 2009), respectively. For all the data labels in each dataset, we replace their underlying true labels with random labels with probability \(p\in [0.0, 0.4]\). During the SGD training, we evaluate the on-average variance of SGD for the last multiple iterations of the training process. In all the experiments, we train the networks for a sufficient number of epochs until the training error is saturated. Also, as the on-average variance involves an expectation over the data distribution, we use the corresponding sample mean over the random draw of the training data as an approximation.

6.1 Generalization error and stability under random labels

In Fig. 1, we present the results of training MLP and ResNet-18 under the random label probability p ranging from 0.1 to 0.4. We use the learning rate 0.01 and batch size 256 for both experiments. It can be seen from these results that the on-average variance (blue) consistently increases as the fraction of random labels increases. At the same time, the generalization error (red) also increases. Thus, our empirical study confirms that the on-average variance captured in our generalization bound is correlated with the generalization performance in the experiments.

Fig. 1
figure 1

Relation of on-average variance (left y axis), generalization error (right y axis) and random label probability (x axis) in training MLP and ResNet using SGD

Fig. 2
figure 2figure 2

Comparison of on-average variance, generalization error and random label probability under different choices of batch size of SGD

Fig. 3
figure 3

Comparison of on-average variance, generalization error and random label probability with/without data augmentation

Fig. 4
figure 4

Generalization error vs. regularization parameter

We note that from the numerical result shown in Fig. 1, it seems that the generalization error does not exactly scale with the on-average variance in a way as predicted by Theorem 1. This is the nature and limit of the proposed statistical generalization theory, which only establishes bounds for a general class of functions. Characterizing the precise numerical dependence between generalization error and on-average variance is out of the scope of this work.

6.2 Impact of batch size and data augmentation

We further explore how the batch size and data augmentation affect the generalization error and on-average variance of SGD under random labels.

First, we explore the impact of batch size by considering three different batch sizes, i.e., 128, 192 and 256. We use the same learning rate 0.01 and vary the random label probability from 0.1 to 0.4. Figure 2 shows the training results of MLP and ResNet-18 with different batch sizes. It can be seen that the generalization error consistently correlates with the on-average variance under all random label probabilities. These observations support our theoretical findings. Also, by comparing these figures, it seems that the generalization error roughly stays at the same level as the batch size increases, while the on-average variance increases as the batch size increases. We think that this is because training with larger batch size with noisy labels makes it more challenging to reach the global minimum, and therefore the gradient variance remains large.

Next, we explore the impact of data augmentation on the generalization error and on-average variance. We train MLP and ResNet-18 using learning rate 0.01 and batch size 128 with the original datasets and their augmented versions. For the data augmentation, we apply the random rotation augmentation method to modify the images. Specifically for each image, we randomly rotate the image with a degree uniformly generated between -20 and 20 degrees. Figure 3 shows the training results with the original and augmented data. It can be seen that the generalization error consistently correlates with the on-average variance under all random label probabilities and data augmentation. In the MLP training with the MNIST dataset, data augmentation does not yield a substantial decrease of the generalization error, and the on-average variance is larger with augmented data than that with the original data. In the ResNet-18 training with the CIFAR10 dataset, data augmentation does lead to a consistent decrease of the generalization error under all random label probabilities, but the on-average variance is larger with augmented data. We think that this is because a subset of the augmented data samples that are assigned random labels increase the gradient uncertainty in optimization, and is not captured by the current theoretical framework. This suggests a research direction for future study.

6.3 Effect of regularization

We further conduct experiments to explore the effect of regularization on the generalization error by adding the regularizer \(\frac{\lambda }{2} \Vert {{\mathbf {w}}} \Vert ^2\) to the objective functions. In particular, we apply the proximal SGD to solve the logistic regression (with dataset a9a Chang and Lin 2011) and train the MLP network (with dataset MNIST). Figure 4 shows the results where the left axis denotes the scale of the training error and the right axis denotes the scale of the generalization error. It can be seen that the corresponding generalization errors improve as the regularization weight gets large. This matches our theoretical finding on the impact of regularization. On the other hand, the training performances for both problems degrade as the regularization weight increases, which is reasonable because in such a case the optimization focuses too much on the regularizer and the obtained solution does not minimize the loss function well. Hence, there is a trade-off between the training and generalization performance in tuning the regularization parameter.

7 Conclusion

In this paper, we develop the generalization error bound of SGD with probabilistic guarantee for nonconvex optimization. We obtain the improved bounds based on the variance of the stochastic gradients by exploiting the optimization path of SGD. Our generalization bound is consistent with the effect of random labels on the generalization error that observed in practical experiments. We further show that strongly convex regularizers can significantly improve the probabilistic concentration bounds for the generalization error from the sub-linear rate to the exponential rate. Our study demonstrates that the geometric structure of the problem can be an important factor in improving the generalization performance of algorithms. Thus, it is of interest to explore the generalization error under various geometric conditions of the objective function in the future work.