Skip to main content
Log in

Random Design Analysis of Ridge Regression

  • Published:
Foundations of Computational Mathematics Aims and scope Submit manuscript

Abstract

This work gives a simultaneous analysis of both the ordinary least squares estimator and the ridge regression estimator in the random design setting under mild assumptions on the covariate/response distributions. In particular, the analysis provides sharp results on the “out-of-sample” prediction error, as opposed to the “in-sample” (fixed design) error. The analysis also reveals the effect of errors in the estimated covariance structure, as well as the effect of modeling errors, neither of which effects are present in the fixed design setting. The proofs of the main results are based on a simple decomposition lemma combined with concentration inequalities for random vectors and matrices.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. N. Ailon and B. Chazelle. Approximate nearest neighbors and the fast Johnson-Lindenstrauss transform. SIAM J. Comput., 39(1):302–322, 2009.

    Article  MATH  MathSciNet  Google Scholar 

  2. J.-Y. Audibert and O. Catoni. Linear regression through PAC-Bayesian truncation, 2010. arXiv:1010.0072.

  3. J.-Y. Audibert and O. Catoni. Robust linear least squares regression. The Annals of Statistics, 30(5):2766–2794, 2011.

    Article  MathSciNet  Google Scholar 

  4. A. Caponnetto and E. De Vito. Optimal rates for the regularized least-squares algorithm. Foundations of Computational Mathematics, 7(3):331–368, 2007.

    Article  MATH  MathSciNet  Google Scholar 

  5. O. Catoni. Statistical Learning Theory and Stochastic Optimization, Lectures on Probability and Statistics, Ecole d’Eté de Probabilitiés de Saint-Flour XXXI - 2001, volume 1851 of Lecture Notes in Mathematics. Springer, 2004.

  6. P. Drineas and M. W. Mahoney. Effective Resistances, Statistical Leverage, and Applications to Linear Equation Solving, 2010. arXiv:1005.3097.

  7. P. Drineas, M. W. Mahoney, S. Muthukrishnan, and T. Sarlós. Faster least squares approximation. Numerische Mathematik, 117(2):219–249, 2010.

    Article  Google Scholar 

  8. L. Györfi, M. Kohler, A. Kryżak, and H. Walk. A Distribution-Free Theory of Nonparametric Regression. Springer, 2004.

  9. A. E. Hoerl. Application of ridge analysis to regression problems. Chemical Engineering Progress, 58:54–59, 1962.

    Google Scholar 

  10. R. Horn and C. R. Johnson. Matrix Analysis. Cambridge University Press, 1985.

  11. D. Hsu, S. M. Kakade, and T. Zhang. A tail inequality for quadratic forms of subgaussian random vectors, 2011. arXiv:1110.2842.

  12. D. Hsu, S. M. Kakade, and T. Zhang. Tail inequalities for sums of random matrices that depend on the intrinsic dimension. Electronic Communications in Probability, 17(14):1–13, 2012.

    MathSciNet  Google Scholar 

  13. D. Hsu and S. Sabato. Loss Minimization and Parameter Estimation with Heavy Tails, 2013. arXiv:1307.1827.

  14. V. Koltchinskii. Local Rademacher complexities and oracle inequalities in risk minimization. The Annals of Statistics, 34(6):2593–2656, 2006.

    Article  MATH  MathSciNet  Google Scholar 

  15. B. Laurent and P. Massart. Adaptive estimation of a quadratic functional by model selection. The Annals of Statistics, 28(5):1302–1338, 2000.

    Article  MATH  MathSciNet  Google Scholar 

  16. E. L. Lehmann and G. Casella. Theory of Point Estimation. Springer, second edition, 1998.

  17. M. Nussbaum. Minimax risk: Pinsker bound. In S. Kotz, editor, Encyclopedia of Statistical Sciences, Update Volume 3, pages 451–460. Wiley, New York, 1999.

  18. V. Rokhlin and M. Tygert. A fast randomized algorithm for overdetermined linear least-squares regression. Proc. Natl. Acad. Sci. USA, 105(36):13212–13217, 2008.

    Article  MATH  MathSciNet  Google Scholar 

  19. S. Smale and D.-X. Zhou. Learning theory estimates via integral operators and their approximations. Constructive Approximations, 26:153–172, 2007.

    Article  MATH  MathSciNet  Google Scholar 

  20. I. Steinwart, D. Hush, and C. Scovel. Optimal Rates for Regularized Least Squares Regression. In Proceedings of the 22nd Annual Conference on Learning Theory, pp. 79–93, 2009.

  21. G. W. Stewart and J.-G. Sun. Matrix Perturbation Theory. Academic Press, 1990.

  22. C. J. Stone. Optimal global rates of convergence for nonparametric regression. The Annals of Statistics, 10:1040–1053, 1982.

    Article  MATH  Google Scholar 

  23. T. Zhang. Learning bounds for kernel regression using effective data dimensionality. Neural Computation, 17:2077–2098, 2005.

    Article  MATH  MathSciNet  Google Scholar 

Download references

Acknowledgments

The authors thank Dean Foster, David McAllester, and Robert Stine for many insightful discussions.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Daniel Hsu.

Additional information

Communicated by Tomaso Poggio.

Appendix: Probability Tail Inequalities

Appendix: Probability Tail Inequalities

The following probability tail inequalities are used in our analysis. These specific inequalities were chosen in order to satisfy the general conditions set up in Sect. 2.4; however, our analysis can specialize or generalize with the availability of other tail inequalities of these sorts.

The first tail inequality is for positive semidefinite quadratic forms of a sub-Gaussian random vector. It generalizes a standard tail inequality for Gaussian random vectors based on linear combinations of \(\chi ^2\) random variables [15].

Lemma 8

(Quadratic forms of a sub-Gaussian random vector [11]) Let \(\xi \) be a random vector taking values in \(\mathbb {R}^n\) such that for some \(c \ge 0\),

$$\begin{aligned} \mathbb {E}[\exp (\langle u,\xi \rangle )] \le \exp (c \Vert u\Vert ^2 / 2), \quad \forall u \in \mathbb {R}^n. \end{aligned}$$

For all symmetric positive semidefinite matrices \(K \succeq 0\), and all \(t > 0\),

$$\begin{aligned} \Pr \biggl [ \xi ^{\scriptscriptstyle \top }K \xi > c \Bigl ( {{\mathrm{tr}}}(K) + 2\sqrt{{{\mathrm{tr}}}(K^2)t} + 2\Vert K\Vert t \Bigr ) \biggr ] \le \mathrm {e}^{-t}. \end{aligned}$$

The next lemma is a tail inequality for sums of bounded random vectors; it is a standard application of Bernstein’s inequality.

Lemma 9

(Vector Bernstein bound; see, e.g., [11]) Let \(x_1,x_2,\cdots ,x_n\) be independent random vectors such that

$$\begin{aligned} \sum _{i=1}^n \mathbb {E}[ \Vert x_i\Vert ^2 ] \le v \quad \text {and} \quad \Vert x_i\Vert \le r \end{aligned}$$

for all \(i=1,2,\cdots ,n\), almost surely. Let \(s := x_1 + x_2 + \cdots + x_n\). For all \(t > 0\),

$$\begin{aligned} \Pr \left[ \Vert s\Vert > \sqrt{v} (1 + \sqrt{8t}) + (4/3) r t \right] \le \mathrm {e}^{-t}. \end{aligned}$$

The last tail inequality concerns the spectral accuracy of an empirical second moment matrix.

Lemma 10

(Matrix Bernstein bound [12]) Let \(X\) be a random matrix, and \(r > 0\), \(v > 0\), and \(k > 0\) be such that, almost surely,

$$\begin{aligned} \mathbb {E}[X] = 0 , \quad \lambda _{\max }[X] \le r , \quad \lambda _{\max }[\mathbb {E}[X^2]] \le v , \quad {{\mathrm{tr}}}(\mathbb {E}[X^2]) \le v k. \end{aligned}$$

If \(X_1,X_2,\cdots ,X_n\) are independent copies of \(X\), then for any \(t > 0\),

$$\begin{aligned} \Pr \left[ \lambda _{\max }\left[ \frac{1}{n} \sum _{i=1}^n X_i \right] > \sqrt{\frac{2vt}{n}} + \frac{rt}{3n} \right] \le k t (\mathrm {e}^t - t - 1)^{-1}. \end{aligned}$$

If \(t \ge 2.6\), then \(t (\mathrm {e}^t - t - 1)^{-1} \le \mathrm {e}^{-t/2}\).

Rights and permissions

Reprints and permissions

About this article

Cite this article

Hsu, D., Kakade, S.M. & Zhang, T. Random Design Analysis of Ridge Regression. Found Comput Math 14, 569–600 (2014). https://doi.org/10.1007/s10208-014-9192-1

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10208-014-9192-1

Keywords

Mathematics Subject Classification

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy