0% found this document useful (0 votes)
5 views42 pages

Zhang 15 D

This paper presents a divide-and-conquer approach to kernel ridge regression that achieves minimax optimal convergence rates while significantly reducing computational time. By partitioning a dataset into subsets and averaging local estimates, the method retains statistical optimality and allows for parallel computation. Theoretical results are supported by experiments demonstrating the efficiency and effectiveness of the proposed algorithm in various applications, including a music prediction task.

Uploaded by

yeremy55
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views42 pages

Zhang 15 D

This paper presents a divide-and-conquer approach to kernel ridge regression that achieves minimax optimal convergence rates while significantly reducing computational time. By partitioning a dataset into subsets and averaging local estimates, the method retains statistical optimality and allows for parallel computation. Theoretical results are supported by experiments demonstrating the efficiency and effectiveness of the proposed algorithm in various applications, including a music prediction task.

Uploaded by

yeremy55
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 42

Journal of Machine Learning Research 16 (2015) 3299-3340 Submitted 1/15; Revised 7/15; Published 12/15

Divide and Conquer Kernel Ridge Regression:


A Distributed Algorithm with Minimax Optimal Rates

Yuchen Zhang yuczhang@berkeley.edu


Department of Electrical Engineering and Computer Science
University of California, Berkeley, Berkeley, CA 94720, USA
John Duchi jduchi@stanford.edu
Departments of Statistics and Electrical Engineering
Stanford University, Stanford, CA 94305, USA
Martin Wainwright wainwrig@berkeley.edu
Departments of Statistics and Electrical Engineering and Computer Science
University of California, Berkeley, Berkeley, CA 94720, USA

Editor: Hui Zou

Abstract
We study a decomposition-based scalable approach to kernel ridge regression, and show
that it achieves minimax optimal convergence rates under relatively mild conditions. The
method is simple to describe: it randomly partitions a dataset of size N into m subsets
of equal size, computes an independent kernel ridge regression estimator for each subset
using a careful choice of the regularization parameter, then averages the local solutions
into a global predictor. This partitioning leads to a substantial reduction in computation
time versus the standard approach of performing kernel ridge regression on all N samples.
Our two main theorems establish that despite the computational speed-up, statistical op-
timality is retained: as long as m is not too large, the partition-based estimator achieves
the statistical minimax rate over all estimators using the set of N samples. As concrete
examples, our theory guarantees that the number of subsets m may grow nearly linearly
for finite-rank or Gaussian kernels and polynomially in N for Sobolev spaces, which in turn
allows for substantial reductions in computational cost. We conclude with experiments on
both simulated data and a music-prediction task that complement our theoretical results,
exhibiting the computational and statistical benefits of our approach.
Keywords: kernel ridge regression, divide and conquer, computation complexity

1. Introduction
In non-parametric regression, the statistician receives N samples of the form {(xi , yi )}N
i=1 ,
where each xi ∈ X is a covariate and yi ∈ R is a real-valued response, and the samples are
drawn i.i.d. from some unknown joint distribution P over X × R. The goal is to estimate
a function fb : X → R that can be used to predict future responses based on observing
only the covariates. Frequently, the quality of an estimate fb is measured in terms of the
mean-squared prediction error E[(fb(X) − Y )2 ], in which case the conditional expectation
f ∗ (x) = E[Y | X = x] is optimal. The problem of non-parametric regression is a classi-
cal one, and a researchers have studied a wide range of estimators (see, for example, the
books of Gyorfi et al. (2002), Wasserman (2006), or van de Geer (2000)). One class of

c 2015 Yuchen Zhang, John Duchi and Martin Wainwright.


Zhang, Duchi and Wainwright

methods, known as regularized M -estimators (van de Geer, 2000), are based on minimizing
the combination of a data-dependent loss function with a regularization term. The focus
of this paper is a popular M -estimator that combines the least-squares loss with a squared
Hilbert norm penalty for regularization. When working in a reproducing kernel Hilbert
space (RKHS), the resulting method is known as kernel ridge regression, and is widely
used in practice (Hastie et al., 2001; Shawe-Taylor and Cristianini, 2004). Past work has
established bounds on the estimation error for RKHS-based methods (Koltchinskii, 2006;
Mendelson, 2002a; van de Geer, 2000; Zhang, 2005), which have been refined and extended
in more recent work (e.g., Steinwart et al., 2009).
Although the statistical aspects of kernel ridge regression (KRR) are well-understood,
the computation of the KRR estimate can be challenging for large datasets. In a standard
implementation (Saunders et al., 1998), the kernel matrix must be inverted, which requires
O(N 3 ) time and O(N 2 ) memory. Such scalings are prohibitive when the sample size N
is large. As a consequence, approximations have been designed to avoid the expense of
finding an exact minimizer. One family of approaches is based on low-rank approximation
of the kernel matrix; examples include kernel PCA (Schölkopf et al., 1998), the incomplete
Cholesky decomposition (Fine and Scheinberg, 2002), or Nyström sampling (Williams and
Seeger, 2001). These methods reduce the time complexity to O(dN 2 ) or O(d2 N ), where
d  N is the preserved rank. The associated prediction error has only been studied very
recently. Concurrent work by Bach (2013) establishes conditions on the maintained rank
that still guarantee optimal convergence rates; see the discussion in Section 7 for more
detail. A second line of research has considered early-stopping of iterative optimization
algorithms for KRR, including gradient descent (Yao et al., 2007; Raskutti et al., 2011) and
conjugate gradient methods (Blanchard and Krämer, 2010), where early-stopping provides
regularization against over-fitting and improves run-time. If the algorithm stops after t
iterations, the aggregate time complexity is O(tN 2 ).
In this work, we study a different decomposition-based approach. The algorithm is ap-
pealing in its simplicity: we partition the dataset of size N randomly into m equal sized
subsets, and we compute the kernel ridge regression estimate fbi for each of the i = 1, . . . , m
subsets independently, with a careful Pchoice of the regularization parameter. The estimates
¯ m b
are then averaged via f = (1/m) i=1 fi . Our main theoretical result gives conditions
under which the average f¯ achieves the minimax rate of convergence over the underlying
Hilbert space. Even using naive implementations of KRR, this decomposition gives time
and memory complexity scaling as O(N 3 /m2 ) and O(N 2 /m2 ), respectively. Moreover, our
approach dovetails naturally with parallel and distributed computation: we are guaranteed
superlinear speedup with m parallel processors (though we must still communicate the func-
tion estimates from each processor). Divide-and-conquer approaches have been studied by
several authors, including McDonald et al. (2010) for perceptron-based algorithms, Kleiner
et al. (2012) in distributed versions of the bootstrap, and Zhang et al. (2013) for parametric
smooth convex optimization problems. This paper demonstrates the potential benefits of
divide-and-conquer approaches for nonparametric and infinite-dimensional regression prob-
lems.
One difficulty in solving each of the sub-problems independently is how to choose the
regularization parameter. Due to the infinite-dimensional nature of non-parametric prob-
lems, the choice of regularization parameter must be made with care (e.g., Hastie et al.,

3300
Divide and Conquer Kernel Ridge Regression

2001). An interesting consequence of our theoretical analysis is in demonstrating that, even


though each partitioned sub-problem is based only on the fraction N/m of samples, it is
nonetheless essential to regularize the partitioned sub-problems as though they had all N
samples. Consequently, from a local point of view, each sub-problem is under-regularized.
This “under-regularization” allows the bias of each local estimate to be very small, but it
causes a detrimental blow-up in the variance. However, as we prove, the m-fold averaging
underlying the method reduces variance enough that the resulting estimator f¯ still attains
optimal convergence rate.
The remainder of this paper is organized as follows. We begin in Section 2 by providing
background on the kernel ridge regression estimate and discussing the assumptions that
underlie our analysis. In Section 3, we present our main theorems on the mean-squared
error between the averaged estimate f¯ and the optimal regression function f ∗ . We provide
both a result when the regression function f ∗ belongs to the Hilbert space H associated
with the kernel, as well as a more general oracle inequality that holds for a general f ∗ . We
then provide several corollaries that exhibit concrete consequences of the results, including
convergence rates of r/N for kernels with finite rank r, and convergence rates of N −2ν/(2ν+1)
for estimation of functionals in a Sobolev space with ν-degrees of smoothness. As we discuss,
both of these estimation rates are minimax-optimal and hence unimprovable. We devote
Sections 4 and 5 to the proofs of our results, deferring more technical aspects of the analysis
to appendices. Lastly, we present simulation results in Section 6.1 to further explore our
theoretical results, while Section 6.2 contains experiments with a reasonably large music
prediction experiment.

2. Background and Problem Formulation


We begin with the background and notation required for a precise statement of our problem.

2.1 Reproducing Kernels


The method of kernel ridge regression is based on the idea of a reproducing kernel Hilbert
space. We provide only a very brief coverage of the basics here, referring the reader to
one of the many books on the topic (Wahba, 1990; Shawe-Taylor and Cristianini, 2004;
Berlinet and Thomas-Agnan, 2004; Gu, 2002) for further details. Any symmetric and
positive semidefinite kernel function K : X × X → R defines a reproducing kernel Hilbert
space (RKHS for short). For a given distribution P on X , the Hilbert space is strictly
contained in L2 (P). For each x ∈ X , the function z 7→ K(z, x) is contained with the Hilbert
space H; moreover, the Hilbert space is endowed with an inner product h·, ·iH such that
K(·, x) acts as the representer of evaluation, meaning
hf, K(x, ·)iH = f (x) for f ∈ H. (1)
p
We let kgkH := hg, giH denote the norm in H, and similarly kgk2 := ( X g(x)2 dP(x))1/2
R

denotes the norm in L2 (P). Under suitable regularity conditions, Mercer’s theorem guar-
antees that the kernel has an eigen-expansion of the form

X
0
K(x, x ) = µj φj (x)φj (x0 ),
j=1

3301
Zhang, Duchi and Wainwright

where µ1 ≥ µ2 ≥ · · · ≥ 0 are a non-negative sequence of eigenvalues, and {φj }∞ j=1 is an


2
orthonormal basis for L (P).
From the reproducing relation (1), we have hφj , φj iH = 1/µj for any j and hφj , φj 0 iH = 0
for any j 6= j 0 . For any f ∈ H, by defining the basis coefficients θj = hf, φj iL2 (P) for
j = 1, 2, . . ., we can expand the function in terms of these coefficients as f = ∞
P
j=1 θj φj ,
and simple calculations show that
∞ ∞
θj2
Z X X
kf k22 = f 2 (x)dP(x) = θj2 , and kf k2H = hf, f iH = .
X µj
j=1 j=1

Consequently, we see that the RKHS can be viewed as an elliptical subset of the sequence
space `2 (N) as defined by the non-negative eigenvalues {µj }∞
j=1 .

2.2 Kernel Ridge Regression


Suppose that we are given a data set {(xi , yi )}N i=1 consisting of N i.i.d. samples drawn
from an unknown distribution P over X × R, and our goal is to estimate the function
that minimizes the mean-squared error E[(f (X) − Y )2 ], where the expectation is taken
jointly over (X, Y ) pairs. It is well-known that the optimal function is the conditional mean
f ∗ (x) : = E[Y | X = x]. In order to estimate the unknown function f ∗ , we consider an
M -estimator that is based on minimizing a combination of the least-squares loss defined
over the dataset with a weighted penalty based on the squared Hilbert norm,
 N 
1 X 2 2
fb := argmin (f (xi ) − yi ) + λ kf kH , (2)
f ∈H N
i=1

where λ > 0 is a regularization parameter. When H is a reproducing kernel Hilbert space,


then the estimator (2) is known as the kernel ridge regression estimate, or KRR for short.
It is a natural generalization of the ordinary ridge regression estimate (Hoerl and Kennard,
1970) to the non-parametric setting.
By the representer theorem for reproducing kernel Hilbert spaces (Wahba, 1990), any
solution to the KRR program (2) must belong to the linear span of the kernel functions
{K(·, xi ), i = 1, . . . , N }. This fact allows the computation of the KRR estimate to be
reduced to an N -dimensional quadratic program, involving the N 2 entries of the kernel
matrix {K(xi , xj ), i, j = 1, . . . , n}. On the statistical side, a line of past work (van de Geer,
2000; Zhang, 2005; Caponnetto and De Vito, 2007; Steinwart et al., 2009; Hsu et al., 2012)
has provided bounds on the estimation error of fb as a function of N and λ.

3. Main Results and Their Consequences


We now turn to the description of our algorithm, followed by the statements of our main
results, namely Theorems 1 and 2. Each theorem provides an upper bound on the mean-
squared prediction error for any trace class kernel. The second theorem is of “oracle type,”
meaning that it applies even when the true regression function f ∗ does not belong to the
Hilbert space H, and hence involves a combination of approximation and estimation error
terms. The first theorem requires that f ∗ ∈ H, and provides somewhat sharper bounds on

3302
Divide and Conquer Kernel Ridge Regression

the estimation error in this case. Both of these theorems apply to any trace class kernel,
but as we illustrate, they provide concrete results when applied to specific classes of kernels.
Indeed, as a corollary, we establish that our distributed KRR algorithm achieves minimax-
optimal rates for three different kernel classes, namely finite-rank, Gaussian, and Sobolev.

3.1 Algorithm and Assumptions


The divide-and-conquer algorithm Fast-KRR is easy to describe. Rather than solving the
kernel ridge regression problem (2) on all N samples, the Fast-KRR method executes the
following three steps:
1. Divide the set of samples {(x1 , y1 ), . . . , (xN , yN )} evenly and uniformly at random
into the m disjoint subsets S1 , . . . , Sm ⊂ X × R, such that every subset contains N/m
samples.
2. For each i = 1, 2, . . . , m, compute the local KRR estimate
 
1 X 2 2
fi := argmin
b (f (x) − y) + λ kf kH . (3)
f ∈H |Si |
(x,y)∈Si

Pm b
3. Average together the local estimates and output f¯ = 1
m i=1 fi .

This description actually provides a family of estimators, one for each choice of the regular-
ization parameter λ > 0. Our main result applies to any choice of λ, while our corollaries
for specific kernel classes optimize λ as a function of the kernel.
We now describe our main assumptions. Our first assumption, for which we have two
variants, deals with the tail behavior of the basis functions {φj }∞j=1 .

Assumption A For some k ≥ 2, there is a constant ρ < ∞ such that E[φj (X)2k ] ≤ ρ2k
for all j ∈ N.
In certain cases, we show that sharper error guarantees can be obtained by enforcing a
stronger condition of uniform boundedness.

Assumption A0 There is a constant ρ < ∞ such that supx∈X |φj (x)| ≤ ρ for all j ∈ N.

Assumption A0 holds, for example, when the input x is drawn from a closed interval and
the kernel is translation invariant, i.e. K(x, x0 ) = ψ(x − x0 ) for some even function ψ. Given
input space X and kernel K, the assumption is verifiable without the data.
Recalling that f ∗ (x) : = E[Y | X = x], our second assumption involves the deviations
of the zero-mean noise variables Y − f ∗ (x). In the simplest case, when f ∗ ∈ H, we require
only a bounded variance condition:
Assumption B The function f ∗ ∈ H, and for x ∈ X , we have E[(Y − f ∗ (x))2 | x] ≤ σ 2 .
When the function f ∗ 6∈ H, we require a slightly stronger variant of this assumption. For
each λ ≥ 0, define
n  o
fλ∗ = argmin E (f (X) − Y )2 + λ kf k2H .

(4)
f ∈H

3303
Zhang, Duchi and Wainwright

Note that f ∗ = f0∗ corresponds to the usual regression function. As f ∗ ∈ L2 (P), for each
λ ≥ 0, the associated mean-squared error σλ2 (x) := E[(Y − fλ∗ (x))2 | x] is finite for almost
every x. In this more general setting, the following assumption replaces Assumption B:

Assumption B0 For any λ ≥ 0, there exists a constant τλ < ∞ such that τλ4 = E[σλ4 (X)].

3.2 Statement of Main Results


With these assumptions in place, we are now ready for the statements of our main results.
All of our results give bounds on the mean-squared estimation error E[kf¯− f ∗ k22 ] associated
with the averaged estimate f¯ based on an assigning n = N/m samples to each of m machines.
Both theorem statements involve the following three kernel-related quantities:
∞ ∞ ∞
X X 1 X
tr(K) := µj , γ(λ) := , and βd = µj . (5)
1 + λ/µj
j=1 j=1 j=d+1

The first quantity is the kernel trace, which serves a crude estimate of the “size” of the kernel
operator, and assumed to be finite. The second quantity γ(λ), familiar from previous work
on kernel regression (Zhang, 2005), is the effective dimensionality of the kernel K with
respect to L2 (P). Finally, the quantity βd is parameterized by a positive integer d that we
may choose in applying the bounds, and it describes the tail decay of the eigenvalues of K.
For d = 0, note that β0 = tr K. Finally, both theorems involve a quantity that depends on
the number of moments k in Assumption A:
 
p max{k, log(d)}
b(n, d, k) := max max{k, log(d)}, . (6)
n1/2−1/k
Here the integer d ∈ N is a free parameter that may be optimized to obtain the sharpest
possible upper bound. (The algorithm’s execution is independent of d.)

Theorem 1 With f ∗ ∈ H and under Assumptions A and B, the mean-squared error of the
averaged estimate f¯ is upper bounded as
i  12σ 2 γ(λ)

h
2 12
E f¯ − f ∗ 2 ≤ 8 + λ kf ∗ k2H +

+ inf T1 (d) + T2 (d) + T3 (d) , (7)
m N d∈N

where
8ρ4 kf ∗ k2H tr(K)βd 4 kf ∗ k2H + 2σ 2 /λ 12ρ4 tr(K)βd
 
T1 (d) = , T2 (d) = µd+1 + , and
λ m λ
k !
ρ2 γ(λ) 2σ 2 4 kf ∗ k2H

T3 (d) = Cb(n, d, k) √ µ0 kf ∗ k2H 1 + + ,
n mλ m

and C denotes a universal (numerical) constant.

Theorem 1 is a general result that applies to any trace-class kernel. Although the
statement appears somewhat complicated at first sight, it yields concrete and interpretable
guarantees on the error when specialized to particular kernels, as we illustrate in Section 3.3.

3304
Divide and Conquer Kernel Ridge Regression

Before doing so, let us make a few heuristic arguments in order to provide intuition.
In typical settings, the term T3 (d) goes to zero quickly: if the number of moments k
is suitably large and number of partitions m is small—say enough to guarantee that

(b(n, d, k)γ(λ)/ n)k = O(1/N )—it will be of lower order. As for the remaining terms,
at a high level, we show that an appropriate choice of the free parameter d leaves the first
two terms in the upper bound (7) dominant. Note that the terms µd+1 and βd are decreas-
ing in d while the term b(n, d, k) increases with d. However, the increasing term b(n, d, k)
grows only logarithmically in d, which allows us to choose a fairly large value without a
significant penalty. As we show in our corollaries, for many kernels of interest, as long as
the number of machines m is not “too large,” this tradeoff is such that T1 (d) and T2 (d)
are also of lower order compared to the two first terms in the bound (7). In such settings,
Theorem 1 guarantees an upper bound of the form
h
2
i h σ 2 γ(λ) i
E f¯ − f ∗ 2
= O(1) · λ kf ∗ k2H + . (8)
| {z } | N {z }
Squared bias
Variance

This inequality reveals the usual bias-variance trade-off in non-parametric regression; choos-
ing a smaller value of λ > 0 reduces the first squared bias term, but increases the second
variance term. Consequently, the setting of λ that minimizes the sum of these two terms is
defined by the relationship
γ(λ)
λ kf ∗ k2H ' σ 2 . (9)
N
This type of fixed point equation is familiar from work on oracle inequalities and local com-
plexity measures in empirical process theory (Bartlett et al., 2005; Koltchinskii, 2006; van de
Geer, 2000; Zhang, 2005), and when λ is chosen so that the fixed point equation (9) holds
this (typically) yields minimax optimal convergence rates (Bartlett et al., 2005; Koltchin-
skii, 2006; Zhang, 2005; Caponnetto and De Vito, 2007). In Section 3.3, we provide detailed
examples in which the choice λ∗ specified by equation (9), followed by application of The-
orem 1, yields minimax-optimal prediction error (for the Fast-KRR algorithm) for many
kernel classes.

We now turn to an error bound that applies without requiring that f ∗ ∈ H. In order to
do so, we introduce an auxiliary variable λ̄ ∈ [0, λ] for use in our analysis (the algorithm’s
execution does not depend on λ̄, and in our ensuing bounds we may choose any λ̄ ∈ [0, λ]
to give the sharpest possible results). Let the radius R = fλ̄∗ H , where the population
(regularized) regression function fλ̄∗ was previously defined (4). The theorem requires a few
additional conditions to those in Theorem 1, involving the quantities tr(K), γ(λ) and βd
defined in Eq. (5), as well as the error moment τλ̄ from Assumption B0 . We assume that
the triplet (m, d, k) of positive integers satisfy the conditions
λ 1
βd ≤ 2 , µd+1 ≤ ,
(R2
+ τλ̄ /λ)N (R + τλ̄2 /λ)N
2
( √ 2
) (10)
N N 1− k
m ≤ min , .
ρ2 γ(λ) log(d) (R2 + τλ̄2 /λ)2/k (b(n, d, k)ρ2 γ(λ))2

3305
Zhang, Duchi and Wainwright

We then have the following result:


Theorem 2 Under condition (10), Assumption A with k ≥ 4, and Assumption B0 , for any
λ̄ ∈ [0, λ] and q > 0 we have
h i  1

∗ 2
¯
E f −f 2 ≤ 1+ inf kf − f ∗ k22 + (1 + q) EN,m (λ, λ̄, R, ρ) (11)
q kf kH ≤R
where the residual term is given by
Cγ(λ)ρ2 τλ̄2
 
C C
EN,m (λ, λ̄, R, ρ) : = 4+ (λ − λ̄)R2 + + , (12)
m N N
and C denotes a universal (numerical) constant.
Remarks: Theorem 2 is an oracle inequality, as it upper bounds the mean-squared error in
terms of the error inf kf − f ∗ k22 , which may only be obtained by an oracle knowing the
kf kH ≤R
sampling distribution P, along with the residual error term (12).
In some situations, it may be difficult to verify Assumption B0 . In such scenarios,
an alternative condition suffices. For instance, if there exists a constant κ < ∞ such
that E[Y 4 ] ≤ κ4 , then under condition (10), the bound (11) holds with τλ̄2 replaced by
p
8 tr(K)2 R4 ρ4 + 8κ4 —that is, with the alternative residual error
p
Cγ(λ)ρ2 8 tr(K)2 R4 ρ4 + 8κ4
 
C 2 C
EN,m (λ, λ̄, R, ρ) : =
e 2+ (λ − λ̄)R + + . (13)
m N N
In essence, if the response variable Y has sufficiently many moments, the prediction mean-
square error τλ̄2 in the statement of Theorem 2 can be replaced by constants related to the
size of fλ̄∗ H . See Section 5.2 for a proof of inequality (13).
In comparison with Theorem 1, Theorem 2 provides somewhat looser bounds. It is,
however, instructive to consider a few special cases. For the first, we may assume that
f ∗ ∈ H, in which case kf ∗ kH < ∞. In this setting, the choice λ̄ = 0 (essentially) recovers
Theorem 1, since there is no approximation error. Taking q → 0, we are thus left with the
bound
γ(λ)ρ2 τ02
Ekf¯ − f ∗ k22 ] . λ kf ∗ k2H + , (14)
N
where . denotes an inequality up to constants. By inspection, this bound is roughly
equivalent to Theorem 1; see in particular the decomposition (8). On the other hand, when
the condition f ∗ ∈ H fails to hold, we can take λ̄ = λ, and then choose q to balance between
the familiar approximation and estimation errors: we have
γ(λ)ρ2 τλ2
   
¯ ∗ 2 1 ∗ 2
E[kf − f k2 ] . 1 + inf kf − f k2 + (1 + q) . (15)
q kf kH ≤R N
| {z } | {z }
approximation estimation

Relative to Theorem 1, the condition (10) required to apply Theorem 2 involves con-
straints on the number m of subsampled data sets that are more restrictive. In particular,

3306
Divide and Conquer Kernel Ridge Regression

p
when ignoring constants and logarithm terms, the quantity m may grow at rate N/γ 2 (λ).
By contrast, Theorem 1 allows m to grow as quickly as N/γ 2 (λ) (recall the remarks on
T3 (d) following Theorem 1 or look ahead to condition (28)). Thus—at least in our current
analysis—generalizing to the case that f ∗ 6∈ H prevents us from dividing the data into finer
subsets.

3.3 Some Consequences


We now turn to deriving some explicit consequences of our main theorems for specific classes
of reproducing kernel Hilbert spaces. In each case, our derivation follows the broad outline
given the the remarks following Theorem 1: we first choose the regularization parameter λ
to balance the bias and variance terms, and then show, by comparison to known minimax
lower bounds, that the resulting upper bound is optimal. Finally, we derive an upper bound
on the number of subsampled data sets m for which the minimax optimal convergence rate
can still be achieved. Throughout this section, we assume that f ∗ ∈ H.

3.3.1 Finite-rank Kernels


Our first corollary applies to problems for which the kernel has finite rank r, meaning
that its eigenvalues satisfy µj = 0 for all j > r. Examples of such finite rank kernels
include the linear kernel K(x, x0 ) = hx, x0 iRd , which has rank at most r = d; and the kernel
K(x, x) = (1+x x0 )m generating polynomials of degree m, which has rank at most r = m+1.

Corollary 3 For a kernel with rank r, consider the output of the Fast-KRR algorithm with
λ = r/N . Suppose that Assumption B and Assumptions A (or A0 ) hold, and that the number
of processors m satisfy the bound
k−4
N k−2 N
m≤c (Assumption A) or m≤c (Assumption A0 ),
2 k−1 4k k
r2 ρ4 log N
r k−2 ρ k−2 log k−2 r

where c is a universal (numerical) constant. For suitably large N , the mean-squared error
is bounded as
h
2
i σ2r
E f¯ − f ∗ 2 = O(1) . (16)
N
For finite-rank kernels, the rate (16) is known to be minimax-optimal, meaning that
there is a universal constant c0 > 0 such that
r
inf sup E[kfe − f ∗ k22 ] ≥ c0 , (17)
fe kf ∗ kH ≤1 N

where the infimum ranges over all estimators fe based on observing all N samples (and with
no constraints on memory and/or computation). This lower bound follows from Theorem
2(a) of Raskutti et al. (2012) with s = d = 1.

3307
Zhang, Duchi and Wainwright

3.3.2 Polynomially Decaying Eigenvalues


Our next corollary applies to kernel operators with eigenvalues that obey a bound of the
form

µj ≤ C j −2ν for all j = 1, 2, . . ., (18)

where C is a universal constant, and ν > 1/2 parameterizes the decay rate. We note
that equation (5) assumes a finite kernel trace tr(K) := ∞
P
P∞ j=1 j . Since tr(K) appears in
µ
Theorem 1, it is natural to use j=1 Cj −2ν as an upper bound on tr(K). This upper bound
is finite if and only if ν > 1/2.
Kernels with polynomial decaying eigenvalues include those that underlie for the Sobolev
spaces with different orders of smoothness (e.g. Birman and Solomjak, 1967; Gu, 2002). As
a concrete example, the first-order Sobolev kernel K(x, x0 ) = 1 + min{x, x0 } generates an
RKHS of Lipschitz functions with smoothness ν = 1. Other higher-order Sobolev kernels
also exhibit polynomial eigendecay with larger values of the parameter ν.

Corollary 4 For any kernel with ν-polynomial eigendecay (18), consider the output of the

Fast-KRR algorithm with λ = (1/N ) 2ν+1 . Suppose that Assumption B and Assumption A
(or A0 ) hold, and that the number of processors satisfy the bound
2(k−4)ν−k 1
! 2ν−1
k−2
N (2ν+1) N 2ν+1
m≤c (Assumption A) or m≤c 4 (Assumption A0 ),
ρ4k logk N ρ log N

where c is a constant only depending on ν. Then the mean-squared error is bounded as


 2  2ν 
h
∗ 2
i σ 2ν+1
E f¯ − f 2 = O . (19)
N

The upper bound (19) is unimprovable up to constant factors, as shown by known


minimax bounds on estimation error in Sobolev spaces (Stone, 1982; Tsybakov, 2009); see
also Theorem 2(b) of Raskutti et al. (2012).

3.3.3 Exponentially Decaying Eigenvalues


Our final corollary applies to kernel operators with eigenvalues that obey a bound of the
form

µj ≤ c1 exp(−c2 j 2 ) for all j = 1, 2, . . ., (20)

for strictly positive constants (c1 , c2 ). Such classes include the RKHS generated by the
Gaussian kernel K(x, x0 ) = exp(−kx − x0 k22 ).
Corollary 5 For a kernel with sub-Gaussian eigendecay (20), consider the output of the
Fast-KRR algorithm with λ = 1/N . Suppose that Assumption B and Assumption A (or A0 )
hold, and that the number of processors satisfy the bound
k−4
N k−2 N
m≤c (Assumption A) or m≤c (Assumption A0 ),
ρ
4k
k−2 log
2k−1
k−2 N ρ4 log2 N

3308
Divide and Conquer Kernel Ridge Regression

where c is a constant only depending on c2 . Then the mean-squared error is bounded as


 √ 
h
¯ ∗ 2
i
2 log N
E f −f 2 =O σ . (21)
N

The upper bound (21) is minimax optimal; see, for example, Theorem 1 and Example 2 of
the recent paper by Yang et al. (2015).

3.3.4 Summary
Each corollary gives a critical threshold for the number m of data partitions: as long as m is
below this threshold, the decomposition-based Fast-KRR algorithm gives the optimal rate
of convergence. It is interesting to note that the number of splits may be quite large: each
grows asymptotically with N whenever the basis functions have more than four moments
(viz. Assumption A). Moreover, the Fast-KRR method can attain these optimal conver-
gence rates while using substantially less computation than standard kernel ridge regression
methods, as it requires solving problems only of size N/m.

3.4 The Choice of Regularization Parameter


In practice, the local sample size on each machine may be different and the optimal choice
for the regularization λ may not be known a priori, so that an adaptive choice of the regu-
larization parameter λ is desirable (e.g. Tsybakov, 2009, Chapters 3.5–3.7). We recommend
using cross-validation to choose the regularization parameter, and we now sketch a heuristic
argument that an adaptive algorithm using cross-validation may achieve optimal rates of
convergence. (We leave fuller analysis to future work.)
Let λn be the (oracle) optimal regularization parameter given knowledge of the sampling
distribution P and eigen-structure of the kernel K. We assume (cf. Corollary 4) that there
is a constant ν > 0 such that λn  n−ν as n → ∞. Let ni be the √ local sample size for each
machine i and N the global sample size; we assume that ni  N (clearly, N ≥ ni ). First,
use local cross-validation to choose regularization parameters λ bn and λ
i
b 2 corresponding
ni /N
to samples of size ni and n2i /N , respectively. Heuristically, if cross validation is successful,
we expect to have λ bn ' n−ν and λ b 2 ' N ν n−2ν , yielding that λ b 2 ' N −ν . With
b 2 /λ
i i ni /N i ni ni /N
this intuition, we then compute local estimates

b2
1 b(i) := λni
X
fbi := argmin b(i) kf k2
(f (x) − y)2 + λ where λ (22)
H
f ∈H ni λ
b 2
n /N
(x,y)∈Si i

and global average estimate f¯ = m ni b


P
i=1 N fi as usual. Notably, we have λ(i) ' λN in this
b
heuristic setting. Using formula (22) and the average f¯, we have

m
 X  2 m 2
h
2
i ni  b X ni  b 
E f¯ − f ∗ 2
=E fi − E[fbi ] + E[fi ] − f ∗
N 2 N
i=1 i=1 2
m
X n2i h b i n
bi ] 2 + max E[fbi ] − f ∗ 2
o
≤ E fi − E[f 2 2
. (23)
N2 i∈[m]
i=1

3309
Zhang, Duchi and Wainwright

Using Lemmas 6 and 7 from the proof of Theorem 1 to come and assuming that λ bn is con-
∗ 2 ∗ 2
centrated tightly enough around λn , we obtain kE[fbi ] − f k2 = O(λN kf kH ) by Lemma 6
and that E[kfbi − E[fbi ]k22 ] =PO( γ(λ N)
ni ) by Lemma 7. Substituting these bounds into inequal-
ity (23) and noting that i ni = N , we may upper bound the overall estimation error
as  
h
¯ ∗ 2
i
∗ 2 γ(λN )
E f − f 2 ≤ O(1) · λN kf kH + .
N
While the derivation of this upper bound was non-rigorous, we believe that it is roughly
accurate, and in comparison with the previous upper bound (8), it provides optimal rates
of convergence.

4. Proofs of Theorem 1 and Related Results


We now turn to the proofs of Theorem 1 and Corollaries 3 through 5. This section con-
tains only a high-level view of proof of Theorem 1; we defer more technical aspects to the
appendices.

4.1 Proof of Theorem 1


Pm b
Using the definition of the averaged estimate f¯ = 1
m i=1 fi , a bit of algebra yields

2 2
E[ f¯ − f ∗ 2
] = E[ (f¯ − E[f¯]) + (E[f¯] − f ∗ ) 2 ]
2 2
= E[ f¯ − E[f¯] 2 ] + E[f¯] − f ∗ 2 + 2E[hf¯ − E[f¯], E[f¯] − f ∗ iL2 (P) ]
 m 2
1 X b 2
=E (fi − E[fi ])
b + E[f¯] − f ∗ 2 ,
m 2
i=1

where we used the fact that E[fbi ] = E[f¯] for each i ∈ [m]. Using this unbiasedness once
more, we bound the variance of the terms fbi − E[f¯] to see that
h
2
i 1 h i
E f¯ − f ∗ 2 = E kfb1 − E[fb1 ]k22 + kE[fb1 ] − f ∗ k22
m
1 h i
≤ E kfb1 − f ∗ k22 + kE[fb1 ] − f ∗ k22 , (24)
m
where we have used the fact that E[fbi ] minimizes E[kfbi − f k22 ] over f ∈ H.
The error bound (24) suggests our strategy: we upper bound E[kfb1 − f ∗ k22 ] and kE[fb1 ] −
∗ 2
f k2 respectively. Based on equation (3), the estimate fb1 is obtained from a standard
kernel ridge regression with sample size n = N/m and ridge parameter λ. Accordingly, the
following two auxiliary results provide bounds on these two terms, where the reader should
recall the definitions of b(n, d, k) and βd from equation (5). In each lemma, C represents a
universal (numerical) constant.

Lemma 6 (Bias bound) Under Assumptions A and B, for each d = 1, 2, . . ., we have


k
8ρ4 kf ∗ k2H tr(K)βd ρ2 γ(λ)

∗ 2
∗ 2
kE[f ] − f k2 ≤ 8λ kf kH +
b + Cb(n, d, k) √ µ0 kf ∗ k2H . (25)
λ n

3310
Divide and Conquer Kernel Ridge Regression

Lemma 7 (Variance bound) Under Assumptions A and B, for each d = 1, 2, . . ., we


have

12σ 2 γ(λ)
E[kfb − f ∗ k22 ] ≤ 12λ kf ∗ k2H +
n !
2 γ(λ) k
 2  4 tr(K)β
 
2σ 12ρ d ρ
+ + 4 kf ∗ k2H µd+1 + + Cb(n, d, k) √ kf ∗ k22 . (26)
λ λ n

The proofs of these lemmas, contained in Appendices A and B respectively, constitute one
main technical contribution of this paper. Given these two lemmas, the remainder of the
theorem proof is straightforward. Combining the inequality (24) with Lemmas 6 and 7
yields the claim of Theorem 1.

Remarks: The proofs of Lemmas 6 and 7 are somewhat complex, but to the best of our
knowledge, existing literature does not yield significantly simpler proofs. We now discuss
this claim to better situate our technical contributions. Define the regularized population
minimizer fλ∗ := argminf ∈H {E[(f (X) − Y )2 ] + λ kf k2H }. Expanding the decomposition (24)
of the L2 (P)-risk into bias and variance terms, we obtain the further bound
h
2
i 1 h i
E f¯ − f ∗ 2
≤ kE[fb1 ] − f ∗ k22 + E kfb1 − f ∗ k22
m
1  h i  1
∗ 2
= kE[fb1 ] − f k2 + kfλ − f k22 + E kfb1 − f ∗ k22 − kfλ∗ − f ∗ k22 = T1 + (T2 + T3 ).
∗ ∗
| {z } m | {z } | {z } m
:=T1 :=T2
:=T3

In this decomposition, T1 and T2 are bias and approximation error terms induced by the
regularization parameter λ, while T3 is an excess risk (variance) term incurred by minimizing
the empirical loss.
This upper bound illustrates three trade-offs in our subsampled and averaged kernel
regression procedure:

• The trade-off between T2 and T3 : when the regularization parameter λ grows, the
bias term T2 increases while the variance term T3 converges to zero.

• The trade-off between T1 and T3 : when the regularization parameter λ grows, the
bias term T1 increases while the variance term T3 converges to zero.

• The trade-off between T1 and the computation time: when the number of machines
m grows, the bias term T1 increases (as the local sample size n = N/m shrinks), while
the computation time N 3 /m2 decreases.

Theoretical results in the KRR literature focus on the trade-off between T2 and T3 , but in
the current context, we also need an upper bound on the bias term T1 , which is not relevant
for classical (centralized) analyses.
With this setting in mind, Lemma 6 tightly upper bounds the bias T1 as a function of
λ and n. An essential part of the proof is to characterize the properties of E[fb1 ], which is
the expectation of a nonparametric empirical loss minimizer. We are not aware of existing

3311
Zhang, Duchi and Wainwright

literature on this problem, and the proof of Lemma 6 introduces novel techniques for this
purpose.
On the other hand, Lemma 7 upper bounds E[kfb1 − f ∗ k22 ] as a function of λ and n.
Past work has focused on bounding a quantity of this form, but for technical reasons, most
work (e.g. van de Geer, 2000; Mendelson, 2002b; Bartlett et al., 2002; Zhang, 2005) focuses
on analyzing the constrained form
1 X
fbi := argmin (f (x) − y)2 , (27)
kf kH ≤C |Si | (x,y)∈Si

of kernel ridge regression. While this problem traces out the same set of solutions as that
of the regularized kernel ridge regression estimator (3), it is non-trivial to determine a
matched setting of λ for a given C. Zhang (2003) provides one of the few analyses of the
regularized ridge regression estimator (3) (or (2)), providing an upper bound of the form
E[kfb − f ∗ k22 ] = O(λ + 1/λ √1
n ), which is at best O( n ). In contrast, Lemma 7 gives upper
bound O(λ + γ(λ)
n ); the effective dimension γ(λ) is often much smaller than 1/λ, yielding a
stronger convergence guarantee.

4.2 Proof of Corollary 3


We first present a general inequality bounding the size of m for which optimal convergence
rates are possible. We assume that d is chosen large enough such that we have log(d) ≥ k
and d ≥ N . In the rest of the proof, our assignment to d will satisfy these inequalities. In
this case, inspection of Theorem 1 shows that if m is small enough that
s !k
log d 2 1 γ(λ)
ρ γ(λ) ≤ ,
N/m mλ N

then the term T3 (d) provides a convergence rate given by γ(λ)/N . Thus, solving the ex-
pression above for m, we find
2 k−2
m log d 4 λ2/k m2/k γ(λ)2/k k−2 λk N k
ρ γ(λ)2 = or m k = k−1 .
N N 2/k γ(λ)2 k ρ4 log d
Taking (k − 2)/k-th roots of both sides, we obtain that if
2
λ k−2 N
m≤ k−1 4k k
, (28)
γ(λ)2 k−2 ρ k−2 log k−2 d
then the term T3 (d) of the bound (7) is O(γ(λ)/N ).
Now we apply the bound (28) in the case in the corollary. Let us take d = max{r, N }.
Notice that βd = βr = µr+1 = 0. We find that γ(λ) ≤ r since each of its terms is bounded
by 1, and we take λ = r/N . Evaluating the expression (28) with this value, we arrive at
k−4
N k−2
m≤ k−1 4k k
.
r2 k−2 ρ k−2 log k−2 d

3312
Divide and Conquer Kernel Ridge Regression

If we have sufficiently many moments that k ≥ log N , and N ≥ r (for example, if the basis
functions φj have a uniform bound ρ, then k can be chosen arbitrarily large), then we may
k−4 k−1 4k
take k = log N , which implies that N k−2 = Ω(N ), r2 k−2 = O(r2 ) and ρ k−2 = O(ρ4 ) ; and
we replace log d with log N . Then so long as
N
m≤c
r2 ρ4 log N
for some constant c > 0, we obtain an identical result.

4.3 Proof of Corollary 4


We follow the program outlined in our remarks following Theorem 1. We must first choose

λ on the order of γ(λ)/N . To that end, we note that setting λ = N − 2ν+1 gives

X 1 1 X 1
γ(λ) = 2ν ≤ N 2ν+1 + 2ν
− 2ν+1
j=1 1 + j 2ν N 1 1 + j 2ν N − 2ν+1
j>N 2ν+1
Z
1 2ν 1 1 1 1
≤N 2ν+1 +N 2ν+1
1 2ν
du = N 2ν+1 + N 2ν+1 .
N 2ν+1 u 2ν − 1
Dividing by N , we find that λ ≈ γ(λ)/N , as desired. Now we choose the truncation
parameter d. By choosing d = N t for some t ∈ R+ , then we find that µd+1 . N −2νt and
an integration yields βd . N −(2ν−1)t . Setting t = 3/(2ν − 1) guarantees that µd+1 . N −3
and βd . N −3 ; the corresponding terms in the bound (7) are thus negligible. Moreover, we
have for any finite k that log d & k.
Applying the general bound (28) on m, we arrive at the inequality
4ν 2(k−4)ν−k
− (2ν+1)(k−2)
N N N (2ν+1)(k−2)
m≤c 2(k−1) 4k k
=c 4k k .
N (2ν+1)(k−2) ρ k−2 log k−2 N ρ k−2 log k−2 N

Whenever this holds, we have convergence rate λ = N − 2ν+1 . Now, let Assumption A0 hold.
Then taking k = log N , the above bound becomes (to a multiplicative constant factor)
2ν−1
N 2ν+1 /ρ4 log N as claimed.

4.4 Proof of Corollary 5


First, we set λ = 1/N . Considering the sum γ(λ) = ∞
P
p j=1 µj /(µj + λ),pwe see that for
j ≤ (log N )/c2 , the elements of the sum are bounded by 1. For j > (log N )/c2 , we
make the approximation
Z ∞
X µj 1 X
≤ µj . N √ exp(−c2 t2 )dt = O(1).
√ µ j + λ λ √ (log N )/c2
j≥ (log N )/c2 j≥ (log N )/c2

Thus we find that γ(λ) + 1 ≤ c log N for some constant c. By choosing d = N 2 , we
have that the tail sum and (d + 1)-th eigenvalue both satisfy µd+1 ≤ βd . c−12 N
−4 . As a

consequence, all the terms involving βd or µd+1 in the bound (7) are negligible.

3313
Zhang, Duchi and Wainwright

Recalling our inequality (28), we thus find that (under Assumption A), as long as the
number of partitions m satisfies
k−4
N k−2
m≤c 4k 2k−1 ,
ρ k−2 log k−2 N

the convergence rate of f¯ to f ∗ is given by γ(λ)/N ' log N /N . Under the boundedness
assumption A0 , as we did in the proof of Corollary 3, we take k = log N in Theorem 1. By
inspection, this yields the second statement of the corollary.

5. Proof of Theorem 2 and Related Results


In this section, we provide the proofs of Theorem 2, as well as the bound (13) based on the
alternative form of the residual error. As in the previous section, we present a high-level
proof, deferring more technical arguments to the appendices.

5.1 Proof of Theorem 2


We begin by stating and proving two auxiliary claims:
E (Y − f (X))2 = E (Y − f ∗ (X))2 + kf − f ∗ k22 for any f ∈ L2 (P),
   
and (29a)
fλ̄∗ = argmin kf − f ∗ k22 . (29b)
kf kH ≤R

Let us begin by proving equality (29a). By adding and subtracting terms, we have
E (Y − f ∗ (X))2 = E (Y − f ∗ (X))2 + kf − f ∗ k22 + 2E[(f (X) − f ∗ (X))E[Y − f ∗ (X) | X]]
   

(i)
= E (Y − f ∗ (X))2 + kf − f ∗ k22 ,
 

where equality (i) follows since the random variable Y − f ∗ (X) is mean-zero given X = x.
For the second equality (29b), consider any function f in the RKHS that satisfies the
bound kf kH ≤ R. The definition of the minimizer fλ̄∗ guarantees that

E (fλ̄∗ (X) − Y )2 + λ̄R2 ≤ E[(f (X) − Y )2 ] + λ̄ kf k2H ≤ E[(f (X) − Y )2 ] + λ̄R2 .


 

This result combined with equation (29a) establishes the equality (29b).

We now turn to the proof of the theorem. Applying Hölder’s inequality yields that
 
∗ 2 1 2 2
¯
f −f 2 ≤ 1+ fλ̄∗ − f ∗ 2 + (1 + q) f¯ − fλ̄∗ 2
q
 
1 2
= 1+ inf kf − f ∗ k22 + (1 + q) f¯ − fλ̄∗ 2 for all q > 0, (30)
q kf kH ≤R
2
where the second step follows from equality (29b). It thus suffices to upper bound f¯ − fλ̄∗ 2 ,
and following the deduction of inequality (24), we immediately obtain the decomposition
formula
h
2
i 1
E f¯ − fλ̄∗ 2 ≤ E[kfb1 − fλ̄∗ k22 ] + kE[fb1 ] − fλ̄∗ k22 , (31)
m

3314
Divide and Conquer Kernel Ridge Regression

where fb1 denotes the empirical minimizer for one of the subsampled datasets (i.e. the
standard KRR solution on a sample of size n = N/m with regularization λ). This suggests
our strategy, which parallels our proof of Theorem 1: we upper bound E[kfb1 − fλ̄∗ k22 ] and
kE[fb1 ] − fλ̄∗ k22 , respectively. In the rest of the proof, we let fb = fb1 denote this solution.
Let the estimation error for a subsample be given by ∆ = fb − fλ̄∗ . Under Assump-
tions A and B0 , we have the following two lemmas bounding expression (31), which parallel
Lemmas 6 and 7 in the case when f ∗ ∈ H. In each lemma, C denotes a universal constant.

Lemma 8 For all d = 1, 2, . . ., we have

h i 16(λ̄ − λ)2 R2 8γ(λ)ρ2 τ 2


E k∆k22 ≤ + λ̄
λ n
4 tr(K)β
  !
2 γ(λ) k
q
4 16ρ d ρ
+ 32R4 + 8τλ̄ /λ2 µd+1 + + Cb(n, d, k) √ . (32)
λ n

Denoting the right hand side of inequality (32) by D2 , we have

Lemma 9 For all d = 1, 2, . . ., we have

4(λ̄ − λ)2 R2 C log2 (d)(ρ2 γ(λ))2 2


kE[∆]k22 ≤ + D
λ n
4ρ4 tr(K)βd
q  
+ 32R4 + 8τλ̄4 /λ2 µd+1 + . (33)
λ

See Appendices C and D for the proofs of these two lemmas.

Given these two lemmas, we can now complete the proof of the theorem. If the condi-
tions (10) hold, we have

λ 1
βd ≤ , µd+1 ≤ ,
(R2 + τλ̄2 /λ)N (R2 + τλ̄2 /λ)N
k
log2 (d)(ρ2 γ(λ))2 ρ2 γ(λ)

1 1
≤ and b(n, d, k) √ ≤ ,
n m n (R + τλ̄2 /λ)N
2

so there is a universal constant C 0 satisfying


k !
16ρ4 tr(K)βd ρ2 γ(λ) C0
q 
32R4 + 8τλ̄4 /λ2 µd+1 + + Cb(n, d, k) √ ≤ .
λ n N

Consequently, Lemma 8 yields the upper bound

8(λ̄ − λ)2 R2 8γ(λ)ρ2 τλ̄2 C 0


E[k∆k22 ] ≤ + + .
λ n N

3315
Zhang, Duchi and Wainwright

Since log2 (d)(ρ2 γ(λ))2 /n ≤ 1/m by assumption, we obtain


 C(λ̄ − λ)2 R2 Cγ(λ)ρ2 τλ̄2 C
E kf¯ − fλ̄∗ k22 ≤

+ +
λm N Nm
4(λ̄ − λ)2 R2 C(λ̄ − λ)2 R2 Cγ(λ)ρ2 τλ̄2 C C
+ + + + + ,
λ λm N Nm N
where C is a universal constant (whose value is allowed to change from line to line). Sum-
ming these bounds and using the condition that λ ≥ λ̄, we conclude that
Cγ(λ)ρ2 τλ̄2
 
¯ ∗ 2 C C
(λ − λ̄)R2 +
 
E kf − fλ̄ k2 ≤ 4 + + .
m N N
Combining this error bound with inequality (30) completes the proof.

5.2 Proof of Bound (13)


Using Theorem 2, it suffices to show that

τλ̄4 ≤ 8 tr(K)2 kfλ̄∗ k4H ρ4 + 8κ4 . (34)

By the tower property of expectations and Jensen’s inequality, we have

τλ̄4 = E[(E[(fλ̄∗ (x) − Y )2 | X = x])2 ] ≤ E[(fλ̄∗ (X) − Y )4 ] ≤ 8E[(fλ̄∗ (X))4 ] + 8E[Y 4 ].

Since we have assumed that E[Y 4 ] ≤ κ4 , the only remaining step is to upper bound
E[(fλ̄∗ (X))4 ]. Let fλ̄∗ have expansion (θ1 , θ2 , . . .) in the basis {φj }. For any x ∈ X , Hölder’s
inequality applied with the conjugates 4/3 and 4 implies the upper bound
 3/4  1/4
∞ 1/2 ∞ ∞ 2
1/4 1/2 θj φj (x) 1/3 2/3 θj 4
X X X
fλ̄∗ (x) = (µj θj ) 1/4
≤ µj θ j   φj (x) . (35)
µj µ j
j=1 j=1 j=1

Again applying Hölder’s inequality—this time with conjugates 3/2 and 3—to upper bound
the first term in the product in inequality (35), we obtain
∞ ∞
!1/3  ∞ ∞
!1/3
X 1/3 2/3
X 2/3 θj2 X 2/3 X θj2 2/3
µj θ j = µj ≤ µj = tr(K)2/3 kfλ̄∗ kH . (36)
µj µj
j=1 j=1 j=1 j=1

Combining inequalities (35) and (36), we find that



X θj2
E[(fλ̄∗ (X))4 ] ≤ tr(K)2 kfλ̄∗ k2H E[φ4j (X)] ≤ tr(K)2 kfλ̄∗ k4H ρ4 ,
µj
j=1

where we have used Assumption A. This completes the proof of inequality (34).

6. Experimental Results
In this section, we report the results of experiments on both simulated and real-world data
designed to test the sharpness of our theoretical predictions.

3316
Divide and Conquer Kernel Ridge Regression

m=1 −2 m=1
m=4 10 m=4
m=16 m=16
m=64 m=64
Mean square error

Mean square error


−3
10
−3
10

−4 −4
10 10

256 512 1024 2048 4096 8192 256 512 1024 2048 4096 8192
Total number of samples (N) Total number of samples (N)

(a) With under-regularization (b) Without under-regularization

Figure 1: The squared L2 (P)-norm between between the averaged estimate f¯ and the op-
timal solution f ∗ . (a) These plots correspond to the output of the Fast-KRR
algorithm: each sub-problem is under-regularized by using λ ' N −2/3 . (b)
Analogous plots when each sub-problem is not under-regularized—that is, with
λ = n−2/3 = (N/m)−2/3 chosen as if there were only a single dataset of size n.

6.1 Simulation Studies


We begin by exploring the empirical performance of our subsample-and-average methods
for a non-parametric regression problem on simulated datasets. For all experiments in
this section, we simulate data from the regression model y = f ∗ (x) + ε for x ∈ [0, 1],
where f ∗ (x) := min(x, 1 − x) is 1-Lipschitz, the noise variables ε ∼ N(0, σ 2 ) are normally
distributed with variance σ 2 = 1/5, and the samples xi ∼ Uni[0, 1]. The Sobolev space
of Lipschitz functions on [0, 1] has reproducing kernel K(x, x0 ) = 1 + min{x, x0 } and norm
2 1
kf kH = f 2 (0) + 0 (f 0 (z))2 dz. By construction, the function f ∗ (x) = min(x, 1 − x) satisfies
R

kf ∗ kH = 1. The kernel ridge regression estimator fb takes the form


N
α = (K + λN I)−1 y,
X
fb = αi K(xi , ·), where (37)
i=1

and K is the N × N Gram matrix and I is the N × N identity matrix. Since the first-
order Sobolev kernel has eigenvalues (Gu, 2002) that scale as µj ' (1/j)2 , the minimax
convergence rate in terms of squared L2 (P)-error is N −2/3 (see e.g. Tsybakov (2009); Stone
(1982); Caponnetto and De Vito (2007)).
By Corollary 4 with ν = 1, this optimal rate of convergence can be achieved by Fast-KRR
with regularization parameter λ ≈ N −2/3 as long as the number of partitions m satisfies
m . N 1/3 . In each of our experiments, we begin with a dataset of size N = mn, which we
partition uniformly at random into m disjoint subsets. We compute the local estimator fbi
for each of the m subsets using n samples via (37), where the Gram matrix is constructed
using the ith batch of samples (and n replaces N ). We then compute f¯ = (1/m) m
P
i=1 fi .
b

3317
Zhang, Duchi and Wainwright

−1
10
N=256
N=512
N=1024
−2 N=2048
10
N=4096

Mean square error


N=8192

−3
10

−4
10

−5
10
0 0.2 0.4 0.6 0.8 1
log(# of partitions)/log(# of samples)

Figure 2: The mean-square error curves for fixed sample size but varied number of parti-
tions. We are interested in the threshold of partitioning number m under which
the optimal rate of convergence is achieved.

Our experiments compare the error of f¯ as a function of sample size N , the number of
partitions m, and the regularization λ.
In Figure 6.1(a), we plot the error kf¯ − f ∗ k22 versus the total number of samples N , where
N ∈ {28 , 29 , . . . , 213 }, using four different data partitions m ∈ {1, 4, 16, 64}. We execute each
simulation 20 times to obtain standard errors for the plot. The black circled curve (m = 1)
gives the baseline KRR error; if the number of partitions m ≤ 16, Fast-KRR has accuracy
comparable to the baseline algorithm. Even with m = 64, Fast-KRR’s performance closely
matches the full estimator for larger sample sizes (N ≥ 211 ). In the right plot Figure 6.1(b),
we perform an identical experiment, but we over-regularize by choosing λ = n−2/3 rather
than λ = N −2/3 in each of the m sub-problems, combining the local estimates by averaging
as usual. In contrast to Figure 6.1(a), there is an obvious gap between the performance of
the algorithms when m = 1 and m > 1, as our theory predicts.
It is also interesting to understand the number of partitions m into which a dataset
of size N may be divided while maintaining good statistical performance. According to
Corollary 4 with ν = 1, for the first-order Sobolev kernel, performance degradation should
be limited as long as m . N 1/3 . In order to test this prediction, Figure 2 plots the mean-
square error kf¯ − f ∗ k22 versus the ratio log(m)/ log(N ). Our theory predicts that even as the
number of partitions m may grow polynomially in N , the error should grow only above some
constant value of log(m)/ log(N ). As Figure 2 shows, the point that kf¯ − f ∗ k2 begins to
increase appears to be around log(m) ≈ 0.45 log(N ) for reasonably large N . This empirical
performance is somewhat better than the (1/3) thresholded predicted by Corollary 4, but it
does confirm that the number of partitions m can scale polynomially with N while retaining
minimax optimality.

3318
Divide and Conquer Kernel Ridge Regression

N m=1 m = 16 m = 64 m = 256 m = 1024


Error 1.26 · 10−4 1.33 · 10−4 1.38 · 10−4
212 N/A N/A
Time 1.12 (0.03) 0.03 (0.01) 0.02 (0.00)
Error 6.40 · 10−5 6.29 · 10−5 6.72 · 10−5
213 N/A N/A
Time 5.47 (0.22) 0.12 (0.03) 0.04 (0.00)
Error 3.95 · 10−5 4.06 · 10−5 4.03 · 10−5 3.89 · 10−5
214 N/A
Time 30.16 (0.87) 0.59 (0.11) 0.11 (0.00) 0.06 (0.00)
Error 2.90 · 10−5 2.84 · 10−5 2.78 · 10−5
215 Fail N/A
Time 2.65 (0.04) 0.43 (0.02) 0.15 (0.01)
Error 1.75 · 10−5 1.73 · 10−5 1.71 · 10−5 1.67 · 10−5
216 Fail
Time 16.65 (0.30) 2.21 (0.06) 0.41 (0.01) 0.23 (0.01)
Error 1.19 · 10−5 1.21 · 10−5 1.25 · 10−5 1.24 · 10−5
217 Fail
Time 90.80 (3.71) 10.87 (0.19) 1.88 (0.08) 0.60 (0.02)

Table 1: Timing experiment giving kf¯ − f ∗ k22 as a function of number of partitions m and
data size N , providing mean run-time (measured in second) for each number m of
partitions and data size N .

Our final experiment gives evidence for the improved time complexity partitioning pro-
vides. Here we compare the amount of time required to solve the KRR problem using the
naive matrix inversion (37) for different partition sizes m and provide the resulting squared
errors kf¯ − f ∗ k22 . Although there are more sophisticated solution strategies, we believe this
is a reasonable proxy to exhibit Fast-KRR’s potential. In Table 1, we present the results
of this simulation, which we performed in Matlab using a Windows machine with 16GB
of memory and a single-threaded 3.4Ghz processor. In each entry of the table, we give
the mean error of Fast-KRR and the mean amount of time it took to run (with standard
deviation over 10 simulations in parentheses; the error rate standard deviations are an order
of magnitude smaller than the errors, so we do not report them). The entries “Fail” corre-
spond to out-of-memory failures because of the large matrix inversion, while entries “N/A”
indicate that kf¯ − f ∗ k2 was significantly larger than the optimal value (rendering time im-
provements meaningless). The table shows that without sacrificing accuracy, decomposition
via Fast-KRR can yield substantial computational improvements.

6.2 Real Data Experiments


We now turn to the results of experiments studying the performance of Fast-KRR on the
task of predicting the year in which a song was released based on audio features associated
with the song. We use the Million Song Dataset (Bertin-Mahieux et al., 2011), which
consists of 463,715 training examples and a second set of 51,630 testing examples. Each
example is a song (track) released between 1922 and 2011, and the song is represented as
a vector of timbre information computed about the song. Each sample consists of the pair
(xi , yi ) ∈ Rd × R, where xi ∈ Rd is a d = 90-dimensional vector and yi ∈ [1922, 2011] is the
year in which the song was released. (For further details, see Bertin-Mahieux et al. (2011)).

3319
Zhang, Duchi and Wainwright

83
Fast−KRR
Nystrom Sampling
82.5 Random Feature Approx.

Mean square error


82

81.5

81

80.5

80
0 200 400 600 800 1000
Training runtime (sec)

Figure 3: Results on year prediction on held-out test songs for Fast-KRR, Nyström sam-
pling, and random feature approximation. Error bars indicate standard deviations
over ten experiments.

Our experiments with this dataset use the Gaussian radial basis kernel
!
kx − x 0 k2
K(x, x0 ) = exp − 2
. (38)
2σ 2

We normalize the feature vectors x so that the timbre signals have standard deviation 1,
and select the bandwidth parameter σ = 6 via cross-validation. For regularization, we set
λ = N −1 ; since the Gaussian kernel has exponentially decaying eigenvalues (for typical
distributions on X), Corollary 5 shows that this regularization achieves the optimal rate of
convergence for the Hilbert space.
In Figure 3, we compare the time-accuracy curve of Fast-KRR with two approximation-
based methods, plotting the mean-squared error between the predicted release year and
the actual year on test songs. The first baseline is Nyström subsampling (Williams and
Seeger, 2001), where the kernel matrix is approximated by a low-rank matrix of rank r ∈
{1, . . . , 6} × 103 . The second baseline approach is an approximate form of kernel ridge
regression using random features (Rahimi and Recht, 2007). The algorithm approximates
the Gaussian kernel (38) by the inner product of two random feature vectors of dimensions
D ∈ {2, 3, 5, 7, 8.5, 10} × 103 , and then solves the resulting linear regression problem. For
the Fast-KRR algorithm, we use seven partitions m ∈ {32, 38, 48, 64, 96, 128, 256} to test
the algorithm. Each algorithm is executed 10 times to obtain standard deviations (plotted
as error-bars in Figure 3).
As we see in Figure 3, for a fixed time budget, Fast-KRR enjoys the best performance,
though the margin between Fast-KRR and Nyström sampling is not substantial. In spite of
this close performance between Nyström sampling and the divide-and-conquer Fast-KRR

3320
Divide and Conquer Kernel Ridge Regression

90
Fast−KRR
KRR with 1/m data
88

Mean square error


86

84

82

80
32 38 48 64 96 128 256
Number of partitions (m)

Figure 4: Comparison of the performance of Fast-KRR to a standard KRR estimator using


a fraction 1/m of the data.

algorithm, it is worth noting that with parallel computation, it is trivial to accelerate


Fast-KRR m times; parallelizing approximation-based methods appears to be a non-trivial
task. Moreover, as our results in Section 3 indicate, Fast-KRR is minimax optimal in
many regimes. At the same time the conference version of this paper was submitted, Bach
(2013) published the first results we know of establishing convergence results in `2 -error for
Nyström sampling; see the discussion for more detail. We note in passing that standard
linear regression with the original 90 features, while quite fast with runtime on the order of
1 second (ignoring data loading), has mean-squared-error 90.44, which is significantly worse
than the kernel-based methods.
Our final experiment provides a sanity check: is the final averaging step in Fast-KRR
even necessary? To this end, we compare Fast-KRR with standard KRR using a fraction
1/m of the data. For the latter approach, we employ the standard regularization λ ≈
(N/m)−1 . As Figure 4 shows, Fast-KRR achieves much lower error rates than KRR using
only a fraction of the data. Moreover, averaging stabilizes the estimators: the standard
deviations of the performance of Fast-KRR are negligible compared to those for standard
KRR.

7. Discussion
In this paper, we present results establishing that our decomposition-based algorithm for
kernel ridge regression achieves minimax optimal convergence rates whenever the number
of splits m of the data is not tooP
large. The error guarantees of our method depend on the
effective dimensionality γ(λ) = ∞ j=1 µj /(µj + λ) of the kernel. For any number of splits

3321
Zhang, Duchi and Wainwright

m . N/γ(λ)2 , our method achieves estimation error decreasing as

σ 2 γ(λ)
E kf¯ − f ∗ k22 . λ kf ∗ k2H +
 
.
N
(In particular, recall the bound (8) following Theorem 1). Notably, this convergence rate is
minimax optimal, and we achieve substantial computational benefits from the subsampling
schemes, in that computational cost scales (nearly) linearly in N .
It is also interesting to consider the number of kernel evaluations required to imple-
ment our method. Our estimator requires m sub-matrices of the full kernel (Gram) matrix,
each of size N/m × N/m. Since the method may use m  N/γ 2 (λ) machines, in the best
case, it requires at most N γ 2 (λ) kernel evaluations. By contrast, Bach (2013) shows that
Nyström-based subsampling can be used to form an estimator within a constant factor of
optimal as long as the number of N -dimensional subsampled columns of the kernel matrix
scales roughly as the marginal dimension γ e(λ) = N diag(K(K + λN I)−1 ) ∞ . Conse-
quently, using roughly N γ e(λ) kernel evaluations, Nyström subsampling can achieve optimal
convergence rates. These two scalings–namely, N γ 2 (λ) versus N γ e(λ)—are currently not
comparable: in some situations, such as when the data is not compactly supported, γ e(λ)
can scale linearly with N , while in others it appears to scale roughly as the true effective
dimensionality γ(λ). A natural question arising from these lines of work is to understand
the true optimal scaling for these different estimators: is one fundamentally better than the
other? Are there natural computational tradeoffs that can be leveraged at large scale? As
datasets grow substantially larger and more complex, these questions should become even
more important, and we hope to continue to study them.

Acknowledgments

We thank Francis Bach for interesting and enlightening conversations on the connections
between this work and his paper (Bach, 2013) and Yining Wang for pointing out a mistake
in an earlier version of this manuscript. We also thank two reviewers for useful feedback and
comments. JCD was partially supported by a National Defense Science and Engineering
Graduate Fellowship (NDSEG) and a Facebook PhD fellowship. This work was partially
supported by ONR MURI grant N00014-11-1-0688 to MJW.

Appendix A. Proof of Lemma 6


This appendix is devoted to the bias bound stated in Lemma 6. Let X = {xi }ni=1 be short-
hand for the design matrix, and define the error vector ∆ = fb − f ∗ . By Jensen’s inequal-
ity, we have kE[∆]k2 ≤ E[kE[∆ | X]k2 ], so it suffices to provide a bound on kE[∆ | X]k2 .
Throughout this proof and the remainder of the paper, we represent the kernel evaluator by
the function ξx , where ξx := K(x, ·) and f (x) = hξx , f i for any f ∈ H. Using this notation,
the estimate fb minimizes the empirical objective
n
1X 2
hξxi , f iH − yi + λ kf k2H . (39)
n
i=1

3322
Divide and Conquer Kernel Ridge Regression

fb Empirical KRR minimizer based on n samples


f∗ Optimal function generating data, where yi = f ∗ (xi ) + εi
∆ Error fb − f ∗
ξx RKHS evaluator ξx := K(x, ·), so hf, ξx i = hξx , f i = f (x)
Σ
b Operator mapping H → H defined as the outer product Σ b := 1 Pn ξx ⊗ξx ,
n i=1 i i
b = 1 Pn hξx , f i ξx
so that Σf n i=1 i i
φj jth orthonormal basis vector for L2 (P)
Basis coefficients of ∆ or E[∆ | X] (depending on context), i.e. ∆ = ∞
P
δj j=1 δj φj
Basis coefficients of f ∗ , i.e. f ∗ = ∞
P
θj θ φ
j=1 j j
d Integer-valued truncation point
M Diagonal matrix with M = diag(µ1 , . . . , µd )
1
Q Diagonal matrix with Q = (Id×d + λM −1 ) 2
Φ n × d matrix with coordinates Φij = φj (xi )
v↓ Truncation of vector v. For v = j νj φj ∈ H, defined as v ↓ = dj=1 νj φj ; for
P P

v ∈ `2 (N) defined as v ↓ = (v1 , . . . , vd )


v ↑ Untruncated part ↑
P of vector v, defined as v = (vd+1 , vd+1 , . . .)
βd The tail sum j>d µj
The sum ∞
P
γ(λ) j=1 1/(1 + λ/µj )
p
b(n, d, k) The maximum max{ max{k, log(d)}, max{k, log(d)}/n1/2−1/k }

Table 2: Notation used in proofs

This objective is Fréchet differentiable, and as a consequence, the necessary and sufficient
conditions for optimality (Luenberger, 1969) of fb are that
n n
1X 1X
ξxi (hξxi , fb − f ∗ iH − εi ) + λfb = ξxi (hξxi , fbiH − yi ) + λfb = 0, (40)
n n
i=1 i=1

where the last equation uses the fact that yi = hξxi , f ∗ iH + εi . Taking conditional expecta-
tions over the noise variables {εi }ni=1 with the design X = {xi }ni=1 fixed, we find that
n
1X
ξxi hξxi , E[∆ | X]i + λE[fb | X] = 0.
n
i=1
Pn
Define the sample covariance operator Σ
b := 1
n i=1 ξxi ⊗ ξxi . Adding and subtracting λf ∗
from the above equation yields
b + λI)E[∆ | X] = −λf ∗ .
(Σ (41)

Consequently, we see we have kE[∆ | X]kH ≤ kf ∗ kH , since Σ b  0.


We now use a truncation argument to reduce the problem to a finite dimensional prob-
lem. To do so, we let δ ∈ `2 (N) denote the coefficients of E[∆ | X] when expanded in the

3323
Zhang, Duchi and Wainwright

basis {φj }∞
j=1 :


X
E[∆ | X] = δ j φj , with δj = hE[∆ | X], φj iL2 (P) . (42)
j=1

For a fixed d ∈ N, define the vectors δ ↓ := (δ1 , . . . , δd ) and δ ↑ := (δd+1 , δd+2 , . . .) (we
suppress dependence on d for convenience). By the orthonormality of the collection {φj },
we have

kE[∆ | X]k22 = kδk22 = kδ ↓ k22 + kδ ↑ k22 . (43)

We control each of the elements of the sum (43) in turn.

Control of the term kδ ↑ k22 : By definition, we have


∞ ∞
µd+1 X 2 X δj2 (i)
kδ ↑ k22 = δj ≤ µd+1 ≤ µd+1 kE[∆ | X]k2H (ii)≤ µd+1 kf ∗ k2H , (44)
µd+1 µj
j=d+1 j=d+1

δj2
where inequality (i) follows since kE[∆ | X]k2H = ∞
P
j=1 µj ; and inequality (ii) follows from
the bound kE[∆ | X]kH ≤ kf ∗ kH , which is a consequence of equality (41).

Control of the term kδ ↓ k22 : Let (θ1 , θ2 , . . .) be the coefficients of f ∗ in the basis {φj }. In
addition, define the matrices Φ ∈ Rn×d by

Φij = φj (xi ) for i ∈ {1, . . . , n}, and j ∈ {1, . . . , d}

and M = diag(µ1 , . . . , µd )  0 ∈ Rd×d . Lastly, define the tail error vector v ∈ Rn by


X
vi : = δj φj (xi ) = E[∆↑ (xi ) | X].
j>d

Let l ∈ N be arbitrary. Computing the (Hilbert) inner product of the terms in equation (41)
with φl , we obtain
θl D E
−λ = hφl , −λf ∗ i = φl , (Σ b + λ)E[∆ | X]
µl
n n
1X 1X δl
= hφl , ξxi i hξxi , E[∆ | X]i + λ hφl , E[∆ | X]i = φl (xi )E[∆(xi ) | X] + λ .
n n µl
i=1 i=1

We can rewrite the final sum above using the fact that ∆ = ∆↓ + ∆↑ , which implies
n n d
X 
1X 1X X
φl (xi )E[∆(xi ) | X] = φl (xi ) φj (xi )δj + φj (xi )δj
n n
i=1 i=1 j=1 j>d

Applying this equality for l = 1, 2, . . . , d yields


 
1 T −1 1
Φ Φ + λM δ ↓ = −λM −1 θ↓ − ΦT v. (45)
n n

3324
Divide and Conquer Kernel Ridge Regression

We now show how the expression (45) gives us the desired bound in the lemma. By
defining the shorthand matrix Q = (I + λM −1 )1/2 , we have
   
1 T −1 −1 1 T −1 1 T −1
Φ Φ + λM = I + λM + Φ Φ − I = Q I + Q Φ Φ−I Q Q.
n n n

As a consequence, we can rewrite expression (45) to


   
−1 1 T −1 1
I +Q Φ Φ−I Q Qδ ↓ = −λQ−1 M −1 θ↓ − Q−1 ΦT v. (46)
n n

We now present a lemma bounding the terms in equality (46) to control δ ↓ .

Lemma 10 The following bounds hold:


2
λQ−1 M −1 θ↓ ≤ λ kf ∗ k2H , and (47a)
2
" #
2
1 −1 T ρ4 kf ∗ k2H tr(K)βd
E Q Φ v ≤ . (47b)
n 2 λ
 −1 1 T  −1
Define the event E := Q nΦ Φ − I Q ≤ 1/2 . Under Assumption A with mo-
2k 2k
ment bound E[φj (X) ] ≤ ρ , there exists a universal constant C such that
k
Cρ2 γ(λ)
  
c
p k ∨ log(d)
P(E ) ≤ max k ∨ log(d), 1/2−1/k √ . (48)
n n

We defer the proof of this lemma to Appendix A.1.


Based on this lemma, we can now complete the proof. Whenever the event E holds, we
know that I + Q−1 ((1/n)ΦT Φ − I)Q−1  (1/2)I. In particular, we have
2
kQδ ↓ k22 ≤ 4 λQ−1 M −1 θ↓ + (1/n)Q−1 ΦT v
2

on E, by Eq. (46). Since kQδ ↓ k22 ≥ kδ ↓ k22 , the above inequality implies that
2
kδ ↓ k22 ≤ 4 λQ−1 M −1 θ↓ + (1/n)Q−1 ΦT v
2

Since E is X-measurable, we thus obtain


h i h i h i
E kδ ↓ k22 = E 1(E) kδ ↓ k22 + E 1(E c ) kδ ↓ k22
 
2 h i
≤ 4E 1(E) λQ−1 M −1 θ↓ + (1/n)Q−1 ΦT v + E 1(E c ) kδ ↓ k22 .
2

Applying the bounds (47a) and (47b), along with the elementary inequality (a + b)2 ≤
2a2 + 2b2 , we have
h i 8ρ4 kf ∗ k2H tr(K)βd h i
E kδ ↓ k22 ≤ 8λ kf ∗ k2H + + E 1(E c ) kδ ↓ k22 . (49)
λ

3325
Zhang, Duchi and Wainwright

Now we use the fact that by the gradient optimality condition (41),

kE[∆ | X]k22 ≤ µ0 kE[∆ | X]k2H ≤ µ0 kf ∗ k2H

Recalling the shorthand (6) for b(n, d, k), we apply the bound (48) to see
k
Cb(n, d, k)ρ2 γ(λ)
h i 
E 1(E c
) kδ ↓ k22 ≤ P(E c
)µ0 kf ∗ k2H ≤ √ µ0 kf ∗ k2H
n

Combining this with the inequality (49), we obtain the desired statement of Lemma 6.

A.1 Proof of Lemma 10


Proof of bound (47a): Beginning with the proof of the bound (47a), we have
2
Q−1 M −1 θ↓ = (θ↓ )T (M 2 + λM )−1 θ↓
2
1 ↓ T −1 ↓ 1
≤ (θ↓ )T (λM )−1 θ↓ = (θ ) M θ ≤ kf ∗ k2H .
λ λ
Multiplying both sides by λ2 gives the result.

Proof of bound (47b): Next we turn to the proof of the bound (47b). We begin by re-writing
Q−1 ΦT v as the product of two components:
 
1 −1 T −1/2 1 1/2 T
Q Φ v = (M + λI) M Φ v . (50)
n n

The first matrix is a diagonal matrix whose operator norm is bounded:


1 1
(M + λI)−1/2 = max p ≤√ . (51)
j∈[d] µj + λ λ

For the second factor in the product (50), the analysis is a little more complicated. Let
Φ` = (φl (x1 ), . . . , φl (xn )) be the `th column of Φ. In this case,
d d
2 X X
M 1/2
Φ v T
= µ` (ΦT` v)2 ≤ µ` kΦ` k22 kvk22 , (52)
2
`=1 `=1

using the Cauchy-Schwarz inequality. Taking expectations with respect to the design {xi }ni=1
and applying Hölder’s inequality yields
q q
E[kΦ` k2 kvk2 ] ≤ E[kΦ` k2 ] E[kvk42 ].
2 2 4

We bound each of the terms in this product in turn. For the first, we have
n
 X 2  n
X 
E[kΦ` k42 ] =E 2
φ` (Xi ) =E φ` (Xi )φ` (Xj ) ≤ n2 E[φ4` (X1 )] ≤ n2 ρ4
2 2

i=1 i,j=1

3326
Divide and Conquer Kernel Ridge Regression

q
since the Xi are i.i.d., ≤ E[φ4` (X1 )], and E[φ4` (X1 )] ≤ ρ4 by assumption. Turn-
E[φ2` (X1 )]
ing to the term involving v, we have
2  X 2  X
δj
X 
2 2
vi = δj φj (xi ) ≤ µj φj (xi )
µj
j>d j>d j>d

by Cauchy-Schwarz. As a consequence, we find


n n n  X 2 2  X
1X 2 2
2 
δj
  
21
4
X X
4 2
E[kvk2 ] = E n vi ≤n E[vi ] ≤ n E µj φj (Xi )
n n µj
i=1 i=1 i=1 j>d j>d
 X 2 
2 4 2
≤ n E kE[∆ | X]kH µj φj (X1 ) ,
j>d

since the Xi are i.i.d. Using the fact that kE[∆ | X]kH ≤ kf ∗ kH , we expand the second
square to find
 X 2
1 4 ∗ 4
X 
2 2
 ∗ 4 4
X
∗ 4 4
E[kvk2 ] ≤ kf kH E µj µk φj (X1 )φk (X1 ) ≤ kf kH ρ µj µk = kf kH ρ µj .
n2
j,k>d j,k>d j>d

Combining our bounds on kΦ` k2 and kvk2 with our initial bound (52), we obtain the in-
equality
v
  Xd u  2 X X d
2 p u 2 ∗ 4 4 X
1/2 T 2 4 2 4 ∗ 2
E M Φ v ≤ µ` n ρ tn kf kH ρ µj = n ρ kf kH µj µ` .
2
l=1 j>d j>d l=1
Pd
Dividing by n2 , recalling the definition of βd = j>d µj , and noting that tr(K) ≥
P
l=1 µ`
shows that " #
1 1/2 T 2
E M Φ v ≤ ρ4 kf ∗ k2H βd tr(K).
n 2

Combining this inequality with our expansion (50) and the bound (51) yields the claim (47b).

Proof of bound (48): We consider the expectation of the norm of Q−1 ( n1 ΦT Φ − I)Q−1 .
For each i ∈ [n], πi := (φ1 (xi ), . . . , φd (xi ))T ∈ Rd , then πiT is the i-th row of the matrix
Φ ∈ Rn×d . Then we know that
  n
1 T 1 X −1
Q−1 Φ Φ − I Q−1 = Q (πi πiT − I)Q−1 .
n n
i=1
Define the sequence of matrices
Ai := Q−1 (πi πiT − I)Q−1
Then the matrices Ai = ATi ∈ Rd×d . Note that E[Ai ] = 0 and let εi be i.i.d. {−1, 1}-valued
Rademacher random variables. Applying a standard symmetrization argument (Ledoux
and Talagrand, 1991), we find that for any k ≥ 1, we have
   
"   k
# n k n k
1 T 1 X 1 X
E Q−1 Φ Φ − I Q−1 = E Ai  ≤ 2 k E  εi A i  . (53)
n n n
i=1 i=1

3327
Zhang, Duchi and Wainwright

k 1/k
h Pn i
1
Lemma 11 The quantity E n i=1 εi Ai is upper bounded by
Pd 1
p ρ2 j=1 1+λ/µj 4e(k ∨ 2 log(d))
d
X
ρ2

e(k ∨ 2 log(d)) √ + . (54)
n n1−1/k j=1
1 + λ/µj

We take this lemma as given P for the moment, returningPd to prove it shortly. Recall the
definition of the constant γ(λ) = ∞ j=1 1/(1 + λ/µj ) ≥ j=1 1/(1 + λ/µj ). Then using our
symmetrization inequality (53), we have
   k
−1 1 T −1
E Q Φ Φ−I Q
n
k
ρ2 γ(λ) 4e(k ∨ 2 log(d)) 2

k
p
≤2 e(k ∨ log(d)) √ + ρ γ(λ)
n n1−1/k
k
k ∨ log(d) k Cρ2 γ(λ)
  
p
≤ max k ∨ log(d), 1/2−1/k √ , (55)
n n
where C is a numerical constant. By definition of the event E, we see by Markov’s inequality
that for any k ∈ R, k ≥ 1,
h  ki
E Q−1 n1 ΦT Φ − I 
p k ∨ log(d) k 2Cρ2 γ(λ))
  k
c
P(E ) ≤ ≤ max k ∨ log(d), 1/2−1/k √ .
2−k n n
This completes the proof of the bound (48).

It remains to prove Lemma 11, for which we make use of the following result, due
to Chen et al. (2012, Theorem A.1(2)).
Lemma 12 Let Xi ∈ Rd×d be independent symmetrically distributed Hermitian matrices.
Then
 Xn k 1/k n
X 1/2  1/k
k
p 2
E Xi ≤ e(k ∨ 2 log d) E[Xi ] + 2e(k ∨ 2 log d) E[max |||Xi ||| ] .
i
i=1 i=1
(56)
The proof of Lemma 11 is based on applying this inequality with Xi = εi Ai /n, and then
bounding the two terms on the right-hand side of inequality (56).
We begin with the first term. Note that for any symmetric matrix Z, we have the matrix
inequalities 0  E[(Z − E[Z])2 ] = E[Z 2 ] − E[Z]2  E[Z 2 ], so

E[A2i ] = E[Q−1 (πi πiT − I)Q−2 (πi πiT − I)Q−1 ]  E[Q−1 πi πiT Q−2 πi πiT Q−1 ].

Instead of computing these moments directly, we provide bounds on their norms. Since
πi πiT is rank one and Q is diagonal, we have
d
−1
X φj (xi )2
Q πi πiT Q−1 = πiT (I + λM −1 −1
) πi = .
1 + λ/µj
j=1

3328
Divide and Conquer Kernel Ridge Regression

We also note that, for any k ∈ R, k ≥ 1, convexity implies that


 k
X d 2 k Pd d 2
φj (xi ) 1/(1 + λ/µ` ) X φj (xi ) 
=  Pl=1
d
1 + λ/µj l=1 1/(1 + λ/µ` )
1 + λ/µj
j=1 j=1
d k d
φj (xi )2k
X
1 1 X
≤ Pd ,
1 + λ/µ` l=1 1/(1 + λ/µ` ) j=1
1 + λ/µj
l=1

so if E[φj (Xi )2k ] ≤ ρ2k , we obtain


d d
φj (xi )2 k
 X   X k
1
E ≤ ρ2k . (57)
1 + λ/µj 1 + λ/µj
j=1 j=1

2
The sub-multiplicativity of matrix norms implies (Q−1 πi πiT Q−1 )2 ≤ Q−1 πi πiT Q−1 ,
and consequently we have
d
X 2
−1
h 2 i 1
πi πiT Q−1 )2 πiT (I −1 −1 4
 
E (Q ≤E + λM ) πi ≤ρ ,
1 + λ/µj
j=1

where the final step follows from inequality (57). Combined with first term on the right-
hand side of Lemma 12, we have thus obtained the first term on the right-hand side of
expression (54).

We now turn to the second term in expression (54). For real k ≥ 1, we have
n
k 1 k 1 X
E[max |||εi Ai /n||| ] = k E[max |||Ai ||| ] ≤ k E[|||Ai |||k ]
i n i n
i=1

Since norms are sub-additive, we find that


d k d k k
φj (xi )2 φj (xi )2
X X 
k k−1 k−1 −2 k k−1 k−1 1
|||Ai ||| ≤ 2 +2 Q =2 +2 .
1 + λ/µj 1 + λ/µj 1 + λ/µ1
j=1 j=1

Since ρ ≥ 1 (recall that the φj are an orthonormal basis), we apply inequality (57), to find
that
 d
X k  k 
k 1 k−1 1 2k k−1 1
E[max |||εi Ai /n||| ] ≤ k−1 2 ρ +2 ρ2k .
i n 1 + λ/µj 1 + λ/µ1
j=1

Taking kth roots yields the second term in the expression (54).

Appendix B. Proof of Lemma 7


This proof follows an outline similar to Lemma 6. We begin with a simple bound on k∆kH :

3329
Zhang, Duchi and Wainwright

Lemma 13 Under Assumption B, we have E[k∆k2H | X] ≤ 2σ 2 /λ + 4 kf ∗ k2H .


Proof We have
n
" #
1X b
λ E[ kfbk2H | {xi }ni=1 ] ≤ E (f (xi ) − f ∗ (xi ) − εi )2 + λkfbk2H | {xi }ni=1
n
i=1
(i) n
1X
≤ E[ε2i | xi ] + λ kf ∗ k2H
n
i=1
(ii)
≤ σ 2 + λ kf ∗ k2H ,

where inequality (i) follows since fb minimizes the objective function (2); and inequality (ii)
uses the fact that E[ε2i | xi ] ≤ σ 2 . Applying the triangle inequality to k∆kH along with the
elementary inequality (a + b)2 ≤ 2a2 + 2b2 , we find that
2σ 2
E[k∆k2H | {xi }ni=1 ] ≤ 2 kf ∗ k2H + 2E[kfbk2H | {xi }ni=1 ] ≤ + 4 kf ∗ k2H ,
λ
which completes the proof.

With Lemma 13 in place, we now proceed to the proof of the theorem proper. Recall
from Lemma 6 the optimality condition
n
1X
ξxi (hξxi , fb − f ∗ i − εi ) + λfb = 0. (58)
n
i=1
P∞
Now, let δ ∈ `2 (N) be the expansion of the error ∆ in the basis {φj }, so that ∆ = j=1 δj φj ,
and (again, as in Lemma 6), we choose d ∈ N and truncate ∆ via
d
X X
∆↓ := δj φj and ∆↑ := ∆ − ∆↓ = δj φj .
j=1 j>d

Let δ ↓ ∈ Rd and δ ↑ denote the corresponding vectors for the above. As a consequence of
the orthonormality of the basis functions, we have
E[k∆k22 ] = E[k∆↓ k22 ] + E[k∆↑ k22 ] = E[kδ ↓ k22 ] + E[kδ ↑ k22 ]. (59)
We bound each of the terms (59) in turn.
By Lemma 13, the second term is upper bounded as
2σ 2
X X µd+1  
∗ 2
E[k∆↑ k22 ] = E[δj2 ] ≤ E[δj2 ] = µd+1 E[k∆↑ k2H ] ≤ µd+1 + 4 kf kH . (60)
µj λ
j>d j>d

The remainder of the proof is devoted the bounding the term E[k∆↓ k22 ] in the decompo-
sition (59). By taking the Hilbert inner product of φk with the optimality condition (58),
we find as in our derivation of the matrix equation (45) that for each k ∈ {1, . . . , d}
n d n
1 XX 1X δk
φk (xi )φj (xi )δj + φk (xi )(∆↑ (xi ) − εi ) + λ = 0.
n n µk
i=1 j=1 i=1

3330
Divide and Conquer Kernel Ridge Regression

Given the expansion f ∗ = ∞ n


P P
j=1 θj φj , define the tail error vector v ∈ R by vi = j>d δj φj (xi ),
d×d
and recall the definition of the eigenvalue matrix M = diag(µ1 , . . . , µd ) ∈ R . Given the
matrix Φ defined by its coordinates Φij = φj (xi ), we have
 
1 T −1 1 1
Φ Φ + λM δ ↓ = −λM −1 θ↓ − ΦT v + ΦT ε. (61)
n n n
As in the proof of Lemma 6, we find that
   
−1 1 T −1 1 1
I +Q Φ Φ−I Q Qδ ↓ = −λQ−1 M −1 θ↓ − Q−1 ΦT v + Q−1 ΦT ε, (62)
n n n

where we recall that Q = (I + λM −1 )1/2 .


We now recall the bounds (47a) and
 (48) from Lemma 10, as well as the previously
defined event E := { Q−1 n1 ΦT Φ − I Q−1 ≤ 1/2}. When E occurs, the expression (62)
implies the inequality
2
k∆↓ k22 ≤ kQδ ↓ k22 ≤ 4 −λQ−1 M −1 θ↓ − (1/n)Q−1 ΦT v + (1/n)Q−1 ΦT ε .
2

When E fails to hold, Lemma 13 may still be applied since E is measurable with respect to
{xi }ni=1 . Doing so yields

E[k∆↓ k22 ] = E[1(E) k∆↓ k22 ] + E[1(E c ) k∆↓ k22 ]


 
2 h i
−1 −1 ↓ −1 T −1 T
≤ 4E −λQ M θ − (1/n)Q Φ v + (1/n)Q Φ ε + E 1(E c ) E[k∆↓ k22 | {xi }ni=1 ]
2
" #
2  2 
−1 −1 ↓ 1 −1 T 1 −1 T c 2σ ∗ 2
≤ 4E λQ M θ + Q Φ v − Q Φ ε + P(E ) + 4 kf kH . (63)
n n 2 λ

Since the bound (48) still holds, it remains to provide a bound on the first term in the
expression (63).
As in the proof of Lemma 6, we have kλQ−1 M −1 θ↓ k22 ≤ λ kf ∗ k2H via the bound (47a).
Turning to the second term inside the norm, we claim that, under the conditions of Lemma 7,
the following bound holds:
h 2
i ρ4 tr(K)βd (2σ 2 /λ + 4 kf ∗ k2H )
E (1/n)Q−1 ΦT v 2
≤ . (64)
λ
This claim is an analogue of our earlier bound (47b), and we prove it shortly.pLastly, we
bound the norm of Q−1 ΦT ε/n. Noting that the diagonal entries of Q−1 are 1/ 1 + λ/µj ,
we have
d Xn
h 2
i X 1
E Q−1 ΦT ε 2 = E[φ2j (Xi )ε2i ]
1 + λ/µj
j=1 i=1

Since E[φ2j (Xi )ε2i ] = E[φ2j (Xi )E[ε2i | Xi ]] ≤ σ 2 by assumption, we have the inequality
d
h
−1 T 2
i σ2 X 1
E (1/n)Q Φ ε 2
≤ .
n 1 + λ/µj
j=1

3331
Zhang, Duchi and Wainwright

The last sum is bounded by (σ 2 /n)γ(λ). Applying the inequality (a+b+c)2 ≤ 3a2 +3b2 +3c2
to inequality (63), we obtain

12σ 2 γ(λ)
 2
12ρ4 tr(K)βd
 
h
↓ 2
i
∗ 2 2σ ∗ 2 c
E k∆ k2 ≤ 12λ kf kH + + + 4 kf kH + P(E ) .
n λ λ

Applying the bound (48) to control P(E c ) and bounding E[k∆↑ k22 ] using inequality (60)
completes the proof of the lemma.

It remains to prove bound (64). Recalling the inequality (51), we see that

2 2 2 1 2
(1/n)Q−1 ΦT v 2
≤ Q−1 M −1/2 (1/n)M 1/2 ΦT v ≤ (1/n)M 1/2 ΦT v . (65)
2 λ 2

Let Φ` denote the `th column of the matrix Φ. Taking expectations yields
  d d d
2 X X h i X h h ii
2 2 2
E M 1/2 T
Φ v = µ` E[hΦ` , vi ] ≤ µ` E kΦ` k2 kvk2 = µ` E kΦ` k22 E kvk22 | X .
2
l=1 l=1 l=1

Now consider the inner expectation. Applying the Cauchy-Schwarz inequality as in the
proof of the bound (47b), we have
n n  X 2  X
δj
X X 
kvk22 = vi2 ≤ 2
µj φj (Xi ) .
µj
i=1 i=1 j>d j>d

Notably, the second term is X-measurable, and the first is bounded by k∆↑ k2H ≤ k∆k2H .
We thus obtain
  n X
d  X  
2 X
1/2 T 2 2 2
E M Φ v ≤ µ` E kΦ` k2 µj φj (Xi ) E[k∆kH | X] . (66)
2
i=1 l=1 j>d

Lemma 13 provides the bound 2σ 2 /λ + 4 kf ∗ k2H on the final (inner) expectation.


The remainder of the argument proceeds precisely as in the bound (47b). We have

E[kΦ` k22 φj (Xi )2 ] ≤ nρ4

by the moment assumptions on φj , and thus


d X
2σ 2 2σ 2
     
2 X
E M 1/2 T
Φ v ≤ 2 4
µ` µj n ρ + 4 kf ∗ k2H 2 4
≤ n ρ βd tr(K) ∗ 2
+ 4 kf kH .
2 λ λ
l=1 j>d

Dividing by λn2 completes the proof.

3332
Divide and Conquer Kernel Ridge Regression

Appendix C. Proof of Lemma 8


As before, we let {xi }ni=1 := {x1 , . . . , xn } denote the collection of design points. We begin
with some useful bounds on fλ̄∗ H and k∆kH .
Lemma 14 Under Assumptions A and B0 , we have
h i
E (E[k∆k2H | {xi }ni=1 ])2 ≤ Bλ,
4
λ̄ and E[k∆k2H ] ≤ Bλ,
2
λ̄ , (67)

where q
Bλ,λ̄ := 4
32kfλ̄∗ k4H + 8τλ̄4 /λ2 . (68)

See Section C.1 for the proof of this claim.


This proof follows an outline similar to that of Lemma P 7. As usual, we let δ ∈ `2 (N) be
the expansion of the error ∆ in the basis {φj }, so that ∆ = ∞ j=1 δj φj , and we choose d ∈ N
Pd
and define the truncated vectors ∆ := j=1 δj φj and ∆ := ∆ − ∆↓ = j>d δj φj . As
↓ ↑
P

usual, we have the decomposition


q E[k∆k22 ] = E[kδ ↓ k22 ] + E[kδ ↑ k22 ]. Recall the definition (68)
of the constant Bλ,λ̄ = 4
32kfλ̄∗ k4H + 8τλ̄4 /λ2 . As in our deduction of inequalities (60),
Lemma 14 implies that E[k∆↑ k22 ] ≤ µd+1 E[k∆↑ k2H ] ≤ µd+1 Bλ,
2 .
λ̄
The remainder of the proof is devoted to bounding E[kδ ↓ k22 ]. We use identical notation
to that in our proof of Lemma 7, which P we recap for reference (see also Table 2). We define
the tail error vector v ∈ Rn by vi = j>d δj φj (xi ), i ∈ [n], and recall the definitions of the
eigenvalue matrix M = diag(µ1 , . . . , µd ) ∈ Rd×d and basis matrix Φ with Φij = φj (xi ). We
use Q = (I + λM −1 )1/2 for shorthand, and we let E be the event that
Q−1 ((1/n)ΦT Φ − I)Q−1 ≤ 1/2.
Writing fλ̄∗ = ∞ 0 ∗
P
j=1 θj φj , we define the alternate noise vector εi = Yi − fλ̄ (xi ). Using this
notation, mirroring the proof of Lemma 7 yields
" #
2
1 1
E[k∆↓ k22 ] ≤ E[kQδ ↓ k22 ] ≤ 4E λQ−1 M −1 θ↓ + Q−1 ΦT v − Q−1 ΦT ε0 + P(E c )Bλ,
2
λ̄ ,
n n 2
(69)
which is an analogue of equation (63). The bound bound (48) controls the probability
P(E c ), so it remains to control the first term in the expression (69). We first rewrite the
expression within the norm as
 
1 1 −1 T 0
(λ − λ̄)Q−1 M −1 θ↓ + Q−1 ΦT v − Q Φ ε − λ̄Q−1 M −1 θ↓
n n
The following lemma provides bounds on the first two terms:
Lemma 15 The following bounds hold:
2
2 (λ̄ − λ)2 fλ̄∗ H
(λ̄ − λ)Q−1 M −1 θ↓ ≤ , (70a)
2 λ
ρ4 Bλ,
2 tr(K)β
" #
1 −1 T 2 λ̄ d
E Q Φ v ≤ , (70b)
n 2 λ

3333
Zhang, Duchi and Wainwright

For the third term, we make the following claim.

Lemma 16 Under Assumptions A and B0 , we have


" #
1 −1 T 0 2 γ(λ)ρ2 τλ̄2
E Q Φ ε − λ̄Q−1 M −1 θ↓ ≤ . (71)
n 2 n

Deferring the proof of the two lemmas to Sections C.2 and C.3, we apply the inequality
(a + b + c)2 ≤ 4a2 + 4b2 + 2c2 to inequality (69), and we have

E[k∆↓ k22 ] − P(E c )Bλ,


2 ↓ 2 c 2
λ̄ ≤ E[kQδ k2 ] − P(E )Bλ,λ̄
   
2 16 h 2
i 8 2
≤ 16E (λ − λ̄)Q−1 M −1 θ↓ + 2 E Q−1 ΦT v 2
+ 2E Q −1 T 0
Φ ε − λ̄Q −1
M −1 ↓
θ
2 n n 2
2
16(λ̄ − λ)2 fλ̄∗ H 16ρ4 Bλ,
2 tr(K)β
λ̄ d 8γ(λ)ρ2 τλ̄2
≤ + + , (72)
λ λ n
where we have applied the bounds (70a) and (70b) from Lemma 17 and the bound (71)
2
from Lemma 16. Applying the bound (48) to control P(E c ) and recalling that E[ ∆↑ 2 ] ≤
2 completes the proof.
µd+1 Bλ, λ̄

C.1 Proof of Lemma 14


Recall that fb minimizes the empirical objective. Consequently,
" n #
1 X
λE[kfbk2H | {xi }ni=1 ] ≤ E (fb(xi ) − Yi )2 + λkfbk2H | {xi }ni=1
n
i=1
n n
1 X 1X 2
≤ E[(fλ̄∗ (xi ) − Yi )2 | xi ] + λkfλ̄∗ k2H = σλ̄ (xi ) + λkfλ̄∗ k2H
n n
i=1 i=1

The triangle inequality immediately gives us the upper bound


n
2 X 2
E[k∆k2H | {xi }ni=1 ] ≤ 2kfλ̄∗ k2H + E[2kfbk2H | {xi }ni=1 ] ≤ σλ̄ (xi ) + 4kfλ̄∗ k2H .
λn
i=1

Since (a + b)2 ≤ 2a2 + 2b2 , convexity yields


 !2 
n
2 X
E[(E[k∆k2H | {xi }ni=1 ])2 ] ≤ E  σλ̄2 (Xi ) + 4kfλ̄∗ k2H 
λn
i=1
n
8 X 8τλ̄4
≤ 2 E[σλ̄4 (Xi )] + 32kfλ̄∗ k4H = 32kfλ̄∗ k4H + 2.
λ n λ
i=1

This completes the proof of the first of the inequalities (67). The second of the inequali-
ties (67) follows from the first by Jensen’s inequality.

3334
Divide and Conquer Kernel Ridge Regression

C.2 Proof of Lemma 15


Our previous bound (47a) immediately implies inequality (70a). To prove the second upper
bound, we follow the proof of the bound (64). From inequalities (65) and (66), we obtain
that
n d
−1 T 2 1 XXX h
2 2 2 n
i
(1/n)Q Φ v 2 ≤ µ ` µ j E kΦ ` k 2 φ j (Xi )E[k∆k H | {Xi }i=1 ] . (73)
λn2
i=1 l=1 j>d

Applying Hölder’s inequality yields


h i q q
E kΦ` k2 φj (Xi )E[k∆kH | {Xi }i=1 ] ≤ E[kΦ` k2 φj (Xi )] E[(E[k∆k2H | {Xi }ni=1 ])2 ].
2 2 2 n 4 4

Note that Lemma 14 provides the bound Bλ, 4 on the final expectation. By definition of Φ ,
λ̄ `
we find that
 !2 
n  
4 4
X
2 4 2 1
φ (x1 ) + φj (x1 ) ≤ n2 ρ8 ,
8 8

E[kΦ` k2 φj (xi )] = E  φ` (xk ) φj (xi ) ≤ n E
2 `
k=1

where we have used Assumption A with moment 2k ≥ 8, or equivalently k ≥ 4. Thus


h i
E kΦ` k22 φ2j (Xi )E[k∆k2H | {Xi }ni=1 ] ≤ nρ4 Bλ,
2
λ̄ . (74)

Combining inequalities (73) and (74) yields the bound (70b).

C.3 Proof of Lemma 16


Using the fact that Q and M are diagonal, we have
 !2 
d n
" #
2
1 −1 T 0 X
 1
X λ̄θj 
E Q Φ ε − λ̄Q−1 M −1 θ↓ = Q−2
jj E φj (Xi )ε0i − . (75)
n 2 n µj
j=1 i=1

Fréchet differentiability and the fact that fλ̄∗ is the global minimizer of the regularized
regression problem imply that
E[ξXi ε0i ] + λ̄fλ̄∗ = E ξX ξX , fλ̄∗ − y + λ̄fλ̄∗ = 0.
 

Taking the (Hilbert) inner product of the preceding display with the basis function φj , we
get
 
0 λ̄θj
E φj (Xi )εi − = 0. (76)
µj
Combining the equalities (75) and (76) and using the i.i.d. nature of {xi }ni=1 leads to
d n
" # !
2
1 −1 T 0 −1 −1 ↓
X
−2 1X 0 λ̄θj
E Q Φ ε − λ̄Q M θ = Qjj var φj (Xi )εi −
n 2 n µj
j=1 i=1
d
1 X
Q−2 0

= jj var φj (X1 )ε1 . (77)
n
j=1

3335
Zhang, Duchi and Wainwright

Using the elementary inequality var(Z) ≤ E[Z 2 ] for any random variable Z, we have
from Hölder’s inequality that
q p q
0 2 0 2
var(φj (X1 )ε1 ) ≤ E[φj (X1 ) (ε1 ) ] ≤ E[φj (X1 ) ]E[σλ̄ (X1 )] ≤ ρ4 τλ̄4 ,
4 4

where we used Assumption B0 to upper bound the fourth moment E[σλ̄4 (X1 )]. Using the
fact that Q−1
jj ≤ 1, we obtain the following upper bound on the quantity (77):

d d
1 X −2 1 X var(φj (X1 )ε01 ) γ(λ)ρ2 τλ̄2
Qjj var(φj (X1 )ε01 ) = ≤ ,
n n 1 + λ/µj n
j=1 j=1

which establishes the claim.

Appendix D. Proof of Lemma 9


At a high-level, the proof is similar to that of Lemma 6, but we take care since the errors
fλ̄∗ (x) − y are not conditionally mean-zero (or of conditionally bounded variance). Recalling
our notation of ξx as the RKHS evaluator for x, we have by assumption that fb minimizes the
empirical objective (39). As in our derivation of equality (40), the Fréchet differentiability
of this objective implies the first-order optimality condition
n n
1X 1X
ξxi hξxi , ∆i + (ξxi ξxi , fλ̄∗ − yi ) + λ∆ + λfλ̄∗ = 0, (78)
n n
i=1 i=1

where ∆ := fb−fλ̄∗ . In addition, the optimality of fλ̄∗ implies that E[ξxi (hξxi , fλ̄∗ i − yi )] + λ̄fλ̄∗ = 0.
Using this in equality (78), we take expectations with respect to {xi , yi } to obtain
 X n 
1
E ξXi hξXi , ∆i + λ∆ + (λ − λ̄)fλ̄∗ = 0.
n
i=1

1 Pn
Recalling the definition of the sample covariance operator Σ b :=
n i=1 ξxi ⊗ ξxi , we arrive
at
b + λI)∆] = (λ̄ − λ)f ∗ ,
E[(Σ (79)
λ̄
which is the analogue of our earlier equality (41).
We now proceed via a truncation argument similar to that used in our proofs of Lem-
mas 6Pand 7. Let δ ∈ `2 (N) be the expansion of the error ∆ in the basis {φj }, so that
∆= ∞ j=1 δj φj . For a fixed (arbitrary) d ∈ N, define

d
X X
∆↓ := δj φj and ∆↑ := ∆ − ∆↓ = δj φj ,
j=1 j>d

and note that kE[∆]k22 = kE[∆↓ ]k22 + kE[∆↑ ]k22 . By Lemma 14, the second term is controlled
by
X X µd+1
kE[∆↑ ]k22 ≤ E[k∆↑ k22 ] = E[δj2 ] ≤ E[δj2 ] = µd+1 E[k∆↑ k2H ] ≤ µd+1 Bλ,
2
λ̄ . (80)
µj
j>d j>d

3336
Divide and Conquer Kernel Ridge Regression

The remainder of the proof is devoted to bounding kE[∆↓ ]k22 . Let fλ̄∗ have the expansion
(θ1 , θ2 , . . .) in the basis {φj }. Recall (as in Lemmas 6 and 7) the definition of the matrix
Φ ∈ Rn×d by its coordinates Φij = φj (xi ), the diagonal P matrix M = diag(µ1 , . . . , µd ) 
0 ∈ Rd×d , and the tail error vector v ∈ Rn by vi = j>d δj φj (xi ) = ∆↑ (xi ). Proceeding
precisely as in the derivations of equalities (45) and (61), we have the following equality:
    
1 T −1 ↓ −1 ↓ 1 T
E Φ Φ + λM δ = (λ̄ − λ)M θ − E Φ v . (81)
n n

Recalling the definition of the shorthand matrix Q = (I + λM −1 )1/2 , with some algebra we
have    
−1 1 T −1 −1 1 T
Q Φ Φ + λM =Q+Q Φ Φ−I ,
n n
so we can expand expression (81) as
       
↓ −1 1 ↓ −1 1 T −1
E Qδ + Q T
ΦΦ − I δ = E Q Φ Φ + λM δ↓
n n
 
1 −1 T
= (λ̄ − λ)Q−1 M −1 θ↓ − E Q Φ v ,
n

or, rewriting,
     
↓ −1 −1 ↓ 1 −1 T −1 1 T
E[Qδ ] = (λ̄ − λ)Q M θ −E Q Φ v −E Q Φ Φ − I δ↓ . (82)
n n

Lemma 15 provides bounds on the first two terms on the right-hand-side of equation (82).
The following lemma provides upper bounds on the third term:

Lemma 17 There exists a universal constant C such that


  2
C(ρ2 γ(λ) log d)2 h
 
1 T i
E Q−1 Φ Φ − I δ↓ ≤ E kQδ ↓ k22 , (83)
n 2 n

We defer the proof to Section D.1.

Applying Lemma 15 and Lemma 17 to equality (82) and using the standard inequality
(a + b + c)2 ≤ 4a2 + 4b2 + 2c2 , we obtain the upper bound
2
2 4(λ̄ − λ)2 fλ̄∗ 4ρ4 Bλ,
2 tr(K)β
λ̄ d C(ρ2 γ(λ) log d)2 h i

E[∆ ] ≤ H
+ + E kQδ ↓ k22
2 λ λ n
for a universal
 constant C. Note that inequality (72) provides a sufficiently tight bound on
the term E kQδ ↓ k22 . Combined with inequality (80), this completes the proof of Lemma 9.


3337
Zhang, Duchi and Wainwright

D.1 Proof of Lemma 17


By using Jensen’s inequality and then applying Cauchy-Schwarz, we find
    2     2
−1 1 T ↓ −1 1 T ↓
E Q Φ Φ−I δ ≤ E Q Φ Φ−I δ
n 2 n 2
" #
  2
1 h i
≤ E Q−1 ΦT Φ − I Q−1 E kQδ ↓ k22 .
n

The first component of the final product can be controlled by the matrix moment bound
established in the proof of inequality (48). In particular, applying (55) with k = 2 yields a
universal constant C such that
" #
2
C(ρ2 γ(λ) log d)2
 
1
E Q−1 ΦT Φ − I Q−1 ≤ ,
n n

which establishes the claim (83).

References
F. Bach. Sharp analysis of low-rank kernel matrix approximations. In Proceedings of the
Twenty Sixth Annual Conference on Computational Learning Theory, 2013.

P. Bartlett, O. Bousquet, and S. Mendelson. Local Rademacher complexities. Annals of


Statistics, 33(4):1497–1537, 2005.

P. L. Bartlett, O. Bousquet, and S. Mendelson. Localized rademacher complexities. In


Computational Learning Theory, pages 44–58. Springer, 2002.

A. Berlinet and C. Thomas-Agnan. Reproducing Kernel Hilbert Spaces in Probability and


Statistics. Kluwer Academic, 2004.

T. Bertin-Mahieux, D. P. Ellis, B. Whitman, and P. Lamere. The million song dataset.


In Proceedings of the 12th International Conference on Music Information Retrieval (IS-
MIR), 2011.

M. Birman and M. Solomjak. Piecewise-polynomial approximations of functions of the


classes Wpα . Sbornik: Mathematics, 2(3):295–317, 1967.

G. Blanchard and N. Krämer. Optimal learning rates for kernel conjugate gradient regres-
sion. In Advances in Neural Information Processing Systems 24, 2010.

A. Caponnetto and E. De Vito. Optimal rates for the regularized least-squares algorithm.
Foundations of Computational Mathematics, 7(3):331–368, 2007.

R. Chen, A. Gittens, and J. A. Tropp. The masked sample covariance estimator: an analysis
using matrix concentration inequalities. Information and Inference, to appear, 2012.

S. Fine and K. Scheinberg. Efficient SVM training using low-rank kernel representations.
Journal of Machine Learning Research, 2:243–264, 2002.

3338
Divide and Conquer Kernel Ridge Regression

C. Gu. Smoothing Spline ANOVA Models. Springer, 2002.

L. Gyorfi, M. Kohler, A. Krzyzak, and H. Walk. A Distribution-Free Theory of Nonpara-


metric Regression. Springer Series in Statistics. Springer, 2002.

T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning. Springer,


2001.

A. E. Hoerl and R. W. Kennard. Ridge regression: Biased estimation for nonorthogonal


problems. Technometrics, 12:55–67, 1970.

D. Hsu, S. Kakade, and T. Zhang. Random design analysis of ridge regression. In Proceedings
of the 25nd Annual Conference on Learning Theory, 2012.

A. Kleiner, A. Talwalkar, P. Sarkar, and M. Jordan. Bootstrapping big data. In Proceedings


of the 29th International Conference on Machine Learning, 2012.

V. Koltchinskii. Local Rademacher complexities and oracle inequalities in risk minimization.


Annals of Statistics, 34(6):2593–2656, 2006.

M. Ledoux and M. Talagrand. Probability in Banach Spaces. Springer, 1991.

D. Luenberger. Optimization by Vector Space Methods. Wiley, 1969.

R. McDonald, K. Hall, and G. Mann. Distributed training strategies for the structured
perceptron. In North American Chapter of the Association for Computational Linguistics
(NAACL), 2010.

S. Mendelson. Geometric parameters of kernel machines. In Proceedings of the Fifteenth


Annual Conference on Computational Learning Theory, pages 29–43, 2002a.

S. Mendelson. Improving the sample complexity using global data. Information Theory,
IEEE Transactions on, 48(7):1977–1991, 2002b.

A. Rahimi and B. Recht. Random features for large-scale kernel machines. In Advances in
Neural Information Processing Systems 20, 2007.

G. Raskutti, M. Wainwright, and B. Yu. Early stopping for non-parametric regression: An


optimal data-dependent stopping rule. In 49th Annual Allerton Conference on Commu-
nication, Control, and Computing, pages 1318–1325, 2011.

G. Raskutti, M. J. Wainwright, and B. Yu. Minimax-optimal rates for sparse additive models
over kernel classes via convex programming. Journal of Machine Learning Research, 12:
389–427, March 2012.

C. Saunders, A. Gammerman, and V. Vovk. Ridge regression learning algorithm in dual


variables. In Proceedings of the 15th International Conference on Machine Learning,
pages 515–521. Morgan Kaufmann, 1998.

B. Schölkopf, A. Smola, and K.-R. Müller. Nonlinear component analysis as a kernel eigen-
value problem. IEEE Transactions on Information Theory, 10(5):1299–1319, 1998.

3339
Zhang, Duchi and Wainwright

J. Shawe-Taylor and N. Cristianini. Kernel Methods for Pattern Analysis. Cambridge


University Press, 2004.

I. Steinwart, D. Hush, and C. Scovel. Optimal rates for regularized least squares regression.
In Proceedings of the 22nd Annual Conference on Learning Theory, pages 79–93, 2009.

C. J. Stone. Optimal global rates of convergence for non-parametric regression. Annals of


Statistics, 10(4):1040–1053, 1982.

A. B. Tsybakov. Introduction to Nonparametric Estimation. Springer, 2009.

S. van de Geer. Empirical Processes in M-Estimation. Cambridge University Press, 2000.

G. Wahba. Spline Models for Observational Data. CBMS-NSF Regional Conference Series
in Applied Mathematics. SIAM, Philadelphia, PN, 1990.

L. Wasserman. All of Nonparametric Statistics. Springer, 2006.

C. Williams and M. Seeger. Using the Nyström method to speed up kernel machines.
Advances in Neural Information Processing Systems 14, pages 682–688, 2001.

Y. Yang, M. Pilanci, and M. J. Wainwright. Randomized sketches for kernels: Fast and
optimal non-parametric regression. arXiv:1501.06195 [stat.ml], 2015.

Y. Yao, L. Rosasco, and A. Caponnetto. On early stopping in gradient descent learning.


Constructive Approximation, 26(2):289–315, 2007.

T. Zhang. Leave-one-out bounds for kernel methods. Neural Computation, 15(6):1397–1437,


2003.

T. Zhang. Learning bounds for kernel regression using effective data dimensionality. Neural
Computation, 17(9):2077–2098, 2005.

Y. Zhang, J. C. Duchi, and M. J. Wainwright. Communication-efficient algorithms for


statistical optimization. Journal of Machine Learning Research, 14:3321–3363, 2013.

3340

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy