19 Aos1828
19 Aos1828
To estimate f ∗,
we have a dataset (Xi , Yi )i∈{1,...,N} for which there exists a partition
{1, . . . , N} = O ∪ I such that data (Xi , Yi )i∈I are inliers or informative and data (Xi , Yi )i∈O
are “outliers” in the sense that nothing is assumed on these data. On inliers, one grants in-
dependence and finiteness of some moments, allowing for “heavy-tailed” data. Moreover,
departing from the independent and identically distributed (i.i.d.) setup, we also allow inliers
to have different distributions than (X, Y ). We assume that, for all i ∈ I and all f ∈ F ,
E Yi − f ∗ (Xi ) f − f ∗ (Xi ) = E Y − f ∗ (X) f − f ∗ (X) ,
E f − f ∗ 2 (Xi ) = E f − f ∗ 2 (X) .
These assumptions imply that the distribution P of (X, Y ) and the distribution Pi of
(Xi , Yi ) for i ∈ I induce the same L2 -geometry on F − f ∗ = {f − f ∗ : f ∈ F } and, there-
fore, in particular, that the oracles w.r.t. P and Pi for any i ∈ I are the same. Of course, the
sets O and I are unknown to the statistician.
Regression problems with possibly heavy-tailed inliers cannot be handled by classical
least-squares estimators, which are particular instances of empirical risk minimizers (ERM)
F IG . 1. Estimation error of the LASSO (blue curve) and MOM LASSO (red curve) after one outliers was added
at observation 100.
of Vapnik [65]. Least-squares estimators have sub-Gaussian deviations under stronger as-
sumptions, such as boundedness [44] or sub-Gaussian [37] assumptions on the noise and the
design. In this paper, the main hypothesis is the small ball assumption of [32, 48] which
says that L2 (P ) and L1 (P ) norms are equivalent over F − f ∗ ; see Section 3.1 for details.
Although sometimes restrictive [26, 56], this assumption does not involve high moment con-
ditions unnecessary for the problem to make sense.
Least-squares estimators and their regularized versions are also useless in corrupted en-
vironments. This has been known for a long time and can easily be checked in practice.
Figure 1, for example, shows estimation bounds of the LASSO [58] on a dataset containing
a single outlier in the outputs.
These restrictions of least-squares estimators gave rise in the 1960s to the theory of robust
statistics of John Tukey [59, 60], Peter Huber [27, 28] and Frank Hampel [24, 25]. The most
classical alternatives to least-squares estimators are M-estimators, which are ERM based on
loss functions f (X, Y ) less sensitive to outliers than the square loss, such as a truncated
version of the square loss. The idea is that, while (Yi − f (Xi ))2 can be very large for some
outliers data and influence all the empirical mean N −1 N i=1 (Yi − f (Xi )) , the influence of
2
these anomalies will be asymptotically null if f (Xi , Yi ) is bounded. Recent works study
deviation properties of M-estimators: [2, 21, 22, 69] considered the Huber-loss in linear re-
gression with heavy-tailed noise and sub-Gaussian design. They obtain minimax optimal de-
viation bounds in this setting. The limitation on the design is not surprising: it is well known
that M-estimators using loss functions such as Huber or L1 loss are not robust to outliers in
the inputs Xi . This problem is called the“leverage points problem” [29]. In a slightly different
approach than M-estimation, [6] proposed a minmax estimator based on losses introduced in
[18] in a least-squares regression framework and prove optimal sub-Gaussian bounds under
a L2 assumption on the noise and a L4 /L2 assumption on the design, which is close to the
assumptions we grant on inliers.
This paper focuses on Median-of-means (MOM) [1, 30, 53], which provide alternatives to
M-estimators. MOM estimators of the real valued expectation E[Z] are built as follows: the
dataset Z1 , . . . , ZN is partitioned into blocks (Zi )i∈Bk , k = 1, . . . , K of the same cardinality.
The MOM estimator is the median of the K empirical means constructed on each block:
1
MOMK (Z) = median Zi , k = 1, . . . , K .
|Bk | i∈B
k
As in [35, 43], MOM estimators are used to estimate real valued increments of square risks
P [(Y − f (X))2 − (Y − g(X))2 ], where f, g ∈ F . This construction does not require a notion
of median in dimension larger than 1, contrary to “geometric median-of-means” approach
presented in [50, 51]. In [35, 43], each f ∈ F receives a score which is the L2 (P )-diameter
(f ) of the set B (f ), where g ∈ B (f ) if MOMK (f − g ) < 0. The approach of [35, 43]
requires therefore an evaluation of the diameter of the sets B (f ) for all f ∈ F , which makes
the procedure impossible to implement.
This paper presents an alternative to [35, 43] which relies on the following minmax for-
mulation. By linearity of P , f ∗ is a solution of
2
f ∗ ∈ argmin sup P Y − f (X) 2 − Y − g(X) .
f ∈F g∈F
Replacing the real valued means P [(Y − f (X))2 − (Y − g(X))2 ] in this equation by their
MOM estimators produces the minmax MOM estimators of f ∗ which are rigorously intro-
duced in Section 2.3. Compared with [35, 43], minmax MOM estimators do not require an
estimation of L2 -distances between elements in F and are therefore simpler to define. Min-
max strategies have also been considered in [6] and [9, 10]. The idea of building estimators
of f ∗ from estimators of increments goes back to seminal works by Le Cam [33, 34] and
was further developed by Birgé with the T -estimators [14]. In Le Cam and Birgé’s works,
the authors used “robust tests” to compare densities f and g and deduce from these an alter-
native to the nonrobust maximum likelihood estimators. Baraud [8] showed that robust tests
could be obtained by estimating the difference of Hellinger risks of f and g and used a vari-
ational formula to build these new tests. Finally, Baraud, Birgé and Sart [10] used Baraud’s
estimators of increments in a minmax procedure to build ρ-estimators.
The first aim of this paper is to show that minmax MOM estimators satisfy the same sub-
Gaussian deviation bounds as other MOM estimators [35, 42]. The analysis of minmax MOM
estimators is conceptually and technically simpler: an adaptation of Lemmas 5.1 and 5.5 in
[43] or Lemmas 2 and 3 [35] is sufficient to prove sub-Gaussian bound for minmax MOM
estimators while a robust estimation (based on MOM estimates) of the L2 (P )-metric was
required in [35, 42].
Another advantage of the minmax MOM approach lies in the Lepski-step (see Theorem 2),
which selects adaptively the number K of blocks. This step is way easier to implement and
to study than the one presented in [35], as only one confidence region is sufficient to grant
adaptation with respect to the excess risk, the regularization and L2 norms. Recall that, in cor-
rupted environments, a data-driven choice of K has to be performed since K must be larger
than twice the (unknown) number of outliers. Note that the idea of aggregating estimators
built on blocks of data and selecting the number of blocks by Lepski’s method was already
present in Birgé [13], proof of Theorem 1. It was also used in [20] to build “multiple-δ”
sub-Gaussian estimators of univariate means.
In our opinion, the most interesting feature of the minmax formulation is that it suggests
a generic method to modify descent algorithms designed to approximate ERM and their reg-
ularized versions and make them efficient even if run on corrupted datasets. Let us give a
rough presentation of a “MOM version” of descent algorithms: at each time-step t, all em-
pirical means PBk (Y − ft (X))2 for k = 1, . . . , K are evaluated and one computes the index
kmed ∈ [K] of the block such that
PBkmed Y − ft (X) 2 = med PBk Y − ft (X) 2 , k = 1, . . . , K .
The descent direction is the opposite gradient −∇(f → PBkmed (Y − f (X))2 )|f =ft . This de-
scent algorithm can be turned into a descent-ascent algorithm approximating minmax MOM
estimators. Section 5 presents several examples of modifications of classical algorithms.
ROBUST MACHINE LEARNING BY MEDIAN-OF-MEANS 909
In practice, these basic algorithms perform poorly when applied on a fixed partition of
the dataset. However, empirical performance are improved when the partition is chosen uni-
formly at random at each descent step of the algorithm; cf. Section 6.2. In particular, the
shuffling step prevents the algorithms to converge to local minimaxima. Besides, randomized
algorithms define a notion of depth of data: each time a data belongs to the median block, its
“score” is incremented by 1. The higher the final score is, the deeper is the data. This notion
of depth is based on the risk function which is natural in a learning framework and should
probably be investigated more carefully in future works. It also suggests an empirical defini-
tion of outliers and, therefore, an outliers detection algorithm. This by-product is presented
in Section 6.2.
The paper is organized as follows. Section 2 introduces the framework and presents the
minmax MOM estimator, Section 3 details the main theoretical results. These are illustrated
in Section 4 on some classical problems of machine learning. Many robust versions of stan-
dard optimization algorithms are presented in Section 5. An extensive simulation study il-
lustrating our results is performed in Section 6. Proofs of the main results, complementary
theorems showing minimax optimality of our bounds are postponed to the Supplementary
Material [36].
2. Setting. Let X denote a measurable space. Let (Xi , Yi )i∈{1,...,N} , (X, Y ) denote ran-
dom variables taking values in X × R. Let P denote the distribution of (X, Y ) and, for
i ∈ {1, . . . , N}, let Pi denote the distribution of (Xi , Yi ). Let F denote a convex class of func-
tions f : X → R and suppose that E[Y 2 ] < ∞. For any Q ∈ {P , (Pi )i∈[N] } and any p ≥ 1,
p
let LQ denote the set of functions f such that the norm f Lp = (Q|f |p )1/p , where Qg =
Q
EZ∼Q [g(Z)]. Assume that F ⊂ L2P . For any (x, y) ∈ X × R, let f (x, y) = (y − f (x))2
denote the square loss and let f ∗ denote an oracle
(1) f ∗ ∈ argmin P f where ∀g ∈ L1P , P g = E g(X, Y ) .
f ∈F
Let R(f ) = P f denote the risk. The goal is to build estimators fˆ satisfying: with proba-
bility at least 1 − δ,
R(fˆ) ≤ min R(f ) + rN fˆ − f ∗
(1) (2)
and L2P ≤ rN .
f ∈F
The residue rN(1) of the oracle inequality, the estimation rate rN(2) and the confidence level δ
should be as small as possible. Oracle inequalities provide risk bounds for the estimation the
regression function f (x) = E[Y |X = x]: R(fˆ) ≤ R(f ∗ ) + rN(1) is equivalent to
f − fˆ 2
L2P
≤ f −f∗ 2
L2P + rN(1) .
Any estimator of real valued expectations P f or P (f − g ) can be plugged in (2) to obtain
estimators of f ∗ . Plugging the empirical means (in both the min and the minmax problems)
yields the classical ERM over F , for example. In general, plugging nonlinear (robust or not)
estimators of the mean in the minmax problem or in the min problem in (2) does not yield the
910 G. LECUÉ AND M. LERASLE
same estimator of f ∗ though. The main advantage of the minmax formulation is that it allows
to bound the risk of the estimator using the complexity of F around f ∗ . This “localization”
idea is central to derive optimal (fast) rates for the ERM [15, 31, 44] and cannot be used
directly when empirical means are simply replaced by nonlinear estimators of the mean in a
minimization formulation.
2.2. MOM estimators. Let K denote an integer smaller than N/2 and let B1 , . . . , BK
denote a partition of [N] = {1, . . . , N} into blocks of equal size N/K (w.l.o.g. we assume
that K divides N ). For all functions L : X × R → R and k ∈ [K] = {1, . . . , K}, let PBk L =
|Bk |−1 i∈Bk L(Xi , Yi ).
For all α ∈ (0, 1) and real numbers x1 , . . . , xK , denote by Qα (x1 , . . . , xK ) the set of α-
quantiles of {x1 , . . . , xK }:
u ∈ R : k ∈ [K] : xk ≥ u ≥ (1 − α), k ∈ [K] : xk ≤ u ≥ α
and let Qα (x) denote any point in Qα (x1 , . . . , xK ). For x = (x1 , . . . , xK ) ∈ RK and t ∈ R,
we say that Qα (x) ≥ t when there exists J ⊂ [K] such that |J | ≥ (1 − α)K and for all
k ∈ J, xk ≥ t; we write Qα (x) ≤ t if there exists J ⊂ [K] such that |J | ≥ αK and for all
k ∈ J, xk ≤ t.
Let y = (y1 , . . . , yK ) ∈ RK . We write Q1/2 (x − y) ≤ Q3/4 (x) − Q1/4 (y) when there exist
u, l ∈ R such that Q1/2 (x − y) ≤ u − l, Q3/4 (x) ≤ u and Q1/4 (y) ≥ l.
2.3. Minmax MOM estimators. Minmax MOM estimators are obtained by replacing the
unknown expectations P (f − g ) in (2) by their MOM estimators.
We shall provide results for fˆK,λ only in the main text. The estimators fˆK are studied in
the Supplementary Material [36] in Section 7.
3. Assumptions and main results. Denote by {O, I } a partition of [N] and by |O| the
cardinality of O . On (Xi , Yi )i∈O , no assumptions is granted, these data are outliers. They may
not be independent, nor independent from the remaining data (not even random). (Xi , Yi ), i ∈
I are called inliers or informative data. They are hereafter assumed independent. The sets
O, I are unknown.
ROBUST MACHINE LEARNING BY MEDIAN-OF-MEANS 911
3.1. Assumptions. The main assumptions involve first and second moments of the func-
tions in F and Y under the distributions P , (Pi )i∈I .
Assumption 1 holds in the i.i.d. framework, with I = [N] but it covers also other cases
where inliers follow different distributions (see, for instance, multimodal datasets such as in
[45] or heteroscedastic noise [4]). It is also possible to weaken Assumption 1 such as in [35].
The second assumption bounds the correlation between ζi = Yi − f ∗ (Xi ) and the shifted
class F − f ∗ .
Assumption 2 holds when data are i.i.d. and Y − f ∗ (X) has uniformly bounded L2 -
moments conditionally to X. This last assumption holds when Y − f ∗ (X) is independent
of X and has a L2 -moment bounded by θm . Assumption 2 also holds if, for all i ∈ I ,
ζ L4 ≤ θ2 < ∞—where ζ (x, y) = y − f ∗ (x) for all x ∈ X and y ∈ R—and, for every
Pi
f ∈ F, f − f∗ L4P ≤ θ1 f − f ∗ L2P . Actually, in this case,
i
var ζi f − f ∗ (Xi ) ≤ ζ L4P f −f∗ L4P ≤ θ 1 θ2 f − f ∗ L2P ,
i i
so Assumption 2 holds for θm = θ1 θ2 . The third assumption states that the norms L2P and L1P
are equivalent over F − f ∗ .
Under Assumption 1, f − f ∗ ≤ f −f ∗
∗
L2P = f − f L2P for all f ∈ F and i ∈ I ,
L1P
ii
hence, Assumptions 1 and 3 imply that the norms L1P , L2P , L2Pi , L1Pi , i ∈ I are equivalent over
F − f ∗ . Assumption 3 is equivalent to the small ball property (cf. [32, 48]); see Proposition 1
in [35].
and rM (ρ, γM ) as
1 ∗
inf r > 0 : sup E sup i ζi f − f (Xi ) ≤ γM r .
2
J ⊂I ,|J |≥N/2 f ∈Breg (f ∗ ,ρ,r) |J | i∈J
Let ρ → r(ρ, γQ , γM ) be a continuous and nondecreasing function such that for every ρ > 0,
r(ρ) = r(ρ, γQ , γM ) ≥ max{rQ (ρ, γQ ), rM (ρ, γM )}.
It follows from Lemma 2.3 in [37] that rM and rQ are continuous and nondecreasing
functions, that depend on f ∗ . According to [37], for appropriate choice of γQ , γM , r(ρ) =
max(rM (ρ, γM ), rQ (ρ, γQ )) is the minimax rate of convergence over B(f ∗ , ρ). Note also
that rQ and rM are well defined when |I | ≥ N/2, meaning that at least half data should be
informative.
3.3. The sparsity equation. Risk bounds follow from upper bounds on TK,λ (f, f ∗ ) for
functions f far from f ∗ either in L2P -norm or for the regularization norm · . Let f ∈ F
and let ρ = f − f ∗ . When f − f ∗ L2 is small, TK,λ has to be bounded from above by
P
λ( f ∗ − f ). To bound f ∗ − f from below, introduce the subdifferentials of · .
Let (E ∗ , · ∗ ) be the dual normed space of (E, · ) and for all f ∈ F , let
∂ · f = z∗ ∈ E ∗ : ∀h ∈ E, f + h ≥ f + z∗ (h) .
For any ρ > 0, let Hρ denote the set of functions “close” to f ∗ in L2P and at distance ρ
from f ∗ in regularization norm and let f ∗ (ρ) denote the set of subdifferentials of all vectors
close to f ∗ :
f ∗ (ρ) = ∂ · f
f ∈F : f −f ∗ ≤ρ/20
D EFINITION 4. A radius ρ > 0 is said to satisfy the sparsity equation when (ρ) ≥
4ρ/5.
3.4. Main results. The first results give risk bounds for fK,λ . Similar bounds have been
obtained for other MOM estimators [35, 42].
T HEOREM 1. Grant Assumptions 1, 2 and 3 and let rQ , rM denote the complexity func-
tions introduced in Definition 3. Assume that N ≥ 384θ02 and |O| ≤ N/(768θ02 ). Let ρ ∗ be
solution to the sparsity equation from Definition 4. Let = 1/(833θ02 ) and r 2 (·) is defined in
Definition 3 for γQ = (384θ0 )−1 and γM = /192. Let K ∗ denote the smallest integer such
that
N 2 2 ∗
K∗ ≥ r ρ .
384θm2
For any K ≥ K ∗ , define the radius ρK and the regularization parameter as
384θm2 K 16 r 2 (ρK )
r 2 (ρK ) = 2 N
and λ= .
ρK
Then, for all K ∈ [max(K ∗ , 8|O|), N/(96θ02 )], with probability larger than 1 −
4 exp(−7K/9216), the estimator fˆK,λ defined in Section 2.3 satisfies
fK,λ − f ∗ ≤ 2ρK , fK,λ − f ∗ L2P ≤ r(2ρK ),
R(fK,λ ) ≤ R f ∗ + (1 + 52 )r 2 (2ρK ).
The following theorem gives risk bounds for these estimators. Bounds in regularization and
L2P norms have been proved for Le Cam test estimators in [35]. To the best of our knowledge,
adaptive bounds in excess risk have never been proved before.
914 G. LECUÉ AND M. LERASLE
T HEOREM 2. Grant the assumptions of Theorem 1. Choose cad = 18/833 in (4) and let
= (833θ02 )−1 . For any K ∈ [max(K ∗ , 8|O|), N/(96θ02 )], with probability larger than
1 − 4 exp(−K/2304) = 1 − 4 exp − 2 Nr 2 (ρK )/884736
one has fcad − f ∗ ≤ 2ρK , fcad − f ∗ L2P ≤ r(2ρK ) and
R(fcad ) ≤ R f ∗ + (1 + 52 )r 2 (2ρK ).
√
In particular, for K = K ∗ , we have r(2ρK ∗ ) = max(r(2ρ ∗ ), |O|/N ).
Theorem 2 shows that fcad achieves similar performance as fK,λ simultaneously for all K
from K ∗ to O(N). For K = K ∗ , these rates match the optimal minimax rates of convergence;
see Section 4. The main difference with Theorem 1 is that the knowledge of K ∗ and |O| is
not necessary to design fcad . This is very useful in applications where these quantities are
typically unknown. Moreover, both the construction and the analysis are much simpler for
fcad than the adaptive estimator in [35] since they are based on the analysis of confidence
regions for CJ,λ only, instead of multiple criteria in [35].
R EMARK 2 (Deviation parameter). Note that r(·) can be any continuous, nonde-
creasing function such that r(ρ) ≥ max(rQ (ρ, γQ ), rM (ρ, γM )). In particular, if r∗ : ρ →
max(rQ (ρ, γQ ), rM (ρ, γM )) is continuous, as it is clearly nondecreasing, then for every
x > 0, r(ρ) = max(rQ (ρ, γQ ), rM (ρ, γM )) + x/N is another nondecreasing upper bound.
Therefore, one can derive results similar to Theorem 2 but with an extra confidence parame-
ter: for all x > 0, with probability at least 1 − 4 exp(−c0 Nr∗2 (ρK ∗ ) + c0 x),
x
fcad − f ∗ ≤ 2ρK , fcad − f ∗ L2 ≤ r∗ (2ρK ) +
P N
x
R(fcad ) ≤ R f ∗ + (1 + 52 ) r 2 (2ρK ) + .
N
In that case, fcad depends on x since λ = 16 (r∗ (ρK ) + x/N)/ρK .
4.1. The LASSO. The LASSO is obtained when F = {t, · : t ∈ Rd } and the regulariza-
tion function is the 1 -norm:
1 N
d
ˆt ∈ argmin t, Xi − Yi 2 + λ t 1 where t 1 = |ti |.
t∈Rd N i=1 i=1
Even if recent advances show some limitations of LASSO [54, 62, 67], it remains the
benchmark estimator in high-dimensional statistics because a high-dimensional parameter
space does not significantly affect its performance as long as t ∗ is sparse. One can refer to
[12, 41, 47, 52, 61, 63, 64] for estimation and sparse oracle inequalities, [7, 46, 68] for support
recovery results; more results and references on LASSO can be found in [17, 31].
4.2. SLOPE. SLOPE is an estimator introduced in [16, 57]. The class F is still F =
{t, · : t ∈ Rd } and the regularization function is defined for parameters β1 ≥ β2 ≥ · · · ≥
βd > 0 by t SLOPE = di=1 βi ti , where (ti )di=1 denotes the nonincreasing rearrange-
ment of (|ti |)i=1 . SLOPE norm is a weighted 1 -norm that coincide with 1 -norm when
d
4.3. Classical results for LASSO and SLOPE. Typical results for LASSO and SLOPE
have been obtained when data are i.i.d. with sub-Gaussian design X and, most of the time,
sub-Gaussian noise ζ as well.
D EFINITION 5. Let d2 be a d-dimensional inner product space and let X be a ran-
dom variable with values in d2 . We say that X is isotropic when for every t ∈ d2 ,
X, t L2 = t 2d and it is L-sub-Gaussian if for every t ∈ d2 and every p ≥ 2,
P
√ 2
X, t Lp ≤ L p X, t L2 .
P P
The covariance structure of an isotropic random variable coincides with the inner product
p
in d2 . If X is a L-sub-Gaussian random vector, the LP norms of all linear forms do not grow
p
faster than the LP norm of a Gaussian variable. When dealing with the LASSO and SLOPE,
the natural Euclidean structure is used in Rd .
A SSUMPTION 4.
1. Data are i.i.d. (in particular, |I | = N and |O| = 0, that is, there is no outlier),
2. X is isotropic and L-sub-Gaussian,
3. for f ∗ = t ∗ , ·, ξ = Y − f ∗ (X) ∈ LP0 for some q0 > 2.
q
Assumption 4 requires a Lq0 for q0 > 2 moment on the noise. LASSO and SLOPE still
achieve optimal rates of convergence under this assumption but with a severely deteriorated
probability estimate.
T HEOREM 3 (Theorem 1.4 in [39]). Grant Assumption 4. Let s ∈ [d]. Assume that N ≥
there is some v ∈ Rd supported on at most s coordinates for which
c1 s log(ed/s) and that √
t ∗ − v 1 ≤ c2 ξ Lq0 s log(ed)/N . The Lasso estimator tˆ with regularization parameter
√ P
λ = c3 ξ Lq0 log(ed)/N is such that with probability at least
P
c4 logq0 N
(5) 1− q /2−1
− 2 exp −c5 s log(ed/s)
N 0
for every 1 ≤ p ≤ 2
∗ log(ed)
tˆ − t p ≤ c6 ξ LP0 s
q
1/p
.
N
The constants (cj )6j =1 depend only on L and q0 .
Theorem 3 shows that LASSO achieves its optimal rate (cf. [12]) if t ∗ is close to a sparse
vector and the noise ζ may be heavy tailed and may not be independent from X. On the
other hand, the dataset cannot contain outliers and the data should be i.i.d. with sub-Gaussian
design matrix X.
Turning to SLOPE, recall the following result for the regularization norm (t) =
d √
j =1 βj tj when βj = C log(ed/j ).
T HEOREM 4 (Theorem 1.6 in [39]). Consider the SLOPE under Assumption 4. Assume
that N ≥ c1 s log(ed/s)√ and that there is v ∈ R such that | supp(v)| ≤ √
d s and (t ∗ − v) ≤
c2 ξ L 0 s log(ed/s)/ N . The SLOPE estimator with λ = c3 ξ L 0 / N satisfies, with
q q
P P
probability at least (5),
s ed s ed
tˆ − t ∗ ≤ c4 ξ Lq0 √ log , tˆ − t ∗ 22 ≤ c5 ξ 2Lq0 log .
P N s P N s
The constants (cj )5j =1 depend only on L and q0 .
916 G. LECUÉ AND M. LERASLE
4.4. Minmax MOM LASSO and SLOPE. In this section, Theorem 2 is applied to the set
F of linear functionals indexed by Rd with regularization functions being either the 1 or the
SLOPE norm. The aim is to show that the results from Section 4.3 hold and are sometimes
even improved by MOM versions of LASSO and SLOPE under weaker assumptions and with
a better probability deviation. Start with the new set of assumptions.
In order to apply Theorem 2, we have to compute the fixed-point functions rQ (·), rM (·)
and solve the sparsity equation in both cases. To compute the fixed point functions, recall
the definition of Gaussian mean widths: for a set V ⊂ Rd , the Gaussian mean width of V is
defined as
The dual norm of the d1 -norm is the d∞ -norm which is 1-unconditional with respect to the
canonical basis of Rd [49], Definition 1.4. Therefore, [49], Theorem 1.6, applies under the
following assumption.
q
A SSUMPTION 6. There exist constants q0 > 2, C0 and L such that ξ ∈ LP0 , X is isotropic
√
and for every j ∈ [d] and 1 ≤ p ≤ C0 log d, X, ej Lp ≤ L p X, ej L2 .
P P
Local Gaussian mean widths ∗ (ρB1d ∩ rB2d ) are bounded from above in [39], Lemma 5.3,
and computations of rM (·) and rQ (·) follow
⎧
⎪
⎪ 2d
⎪
⎨σ if ρ 2 N ≥ σ 2 d 2 ,
N
2
rM (ρ) L,q0 ,γM 1
⎪
⎪ eσ d
⎪ρσ
⎩ log √ otherwise,
N ρ N
⎧
⎪
⎨= 0 if N L,γQ d,
2
rQ (ρ)
⎪ ρ2 c(L, γQ )d
⎩L,γQ log otherwise.
N N
ROBUST MACHINE LEARNING BY MEDIAN-OF-MEANS 917
P ROOF. It follows from Theorem 2, the computation of r(ρK ) from (7) and ρK in (8) that
2 N/C), tˆ − t ∗ ∗
with probability at least 1 − c0 exp(−cr(ρK ) 1 ≤ ρK ∗ and tˆ − t 2 r(ρK ).
−1+2/p
The result follows since ρK ∗ ∼ ρ ∗ ∼L,q0 σ s N1 log( ed
2−2/p
s ) and v p ≤ v 1 v 2 for
all v ∈ R and 1 ≤ p ≤ 2.
d
Theoretical properties of MOM LASSO (cf. Theorem 5) outperform those of LASSO (cf.
Theorem 3) in several ways:
• Estimation rates achieved by MOM-LASSO are the actual minimax rates s log(ed/s)/N
(see [11]), while classical LASSO estimators achieve the rate s log(ed)/N . This improve-
ment is possible thanks to the adaptation step in MOM-LASSO.
• the probability deviation in (5) is polynomial – 1/N (q0 /2−1) , whereas it is exponentially
small for MOM LASSO. Exponential rates for LASSO hold only if ξ is sub-Gaussian
√
( ξ Lp ≤ C p ξ L2 for all p ≥ 2).
• MOM LASSO is insensitive to data corruption by up to s log(ed/s) outliers while only
one outlier can be responsible of a dramatic breakdown of the performance of LASSO (cf.
Figure 1).
• Assumptions on X are weaker for MOM LASSO than for LASSO. In the LASSO case,
we assume that X is sub-Gaussian whereas for the MOM LASSO we assume that the
coordinates of X have C0 log(ed) sub-Gaussian moments and that X satisfies a L2 /L1
equivalence assumption.
Let us now turn to the study of a “minmax MOM version” of the SLOPE estimator. The
computation of the fixed point functions rQ (·) and rM (·) rely on [49], Theorem 1.6, and the
computation from [39]. Again, the SLOPE norm has a dual norm which is 1-unconditional
with respect to the canonical basis of Rd , [49], Definition 1.4. Therefore, it follows from [49],
Theorem 1.6, that under Assumption 6, one has
√
E sup i v, Xi ≤ c2 N∗ ρ B ∩ rB2d ,
v∈ρ B∩rB2d i∈[N]
√
E sup i ζi v, Xi ≤ c2 σ N∗ ρ B ∩ rB2d ,
v∈ρ B∩rB2d i∈[N]
where B is the unit ball of the SLOPE norm and ζi = Yi − Xi , t ∗ . Local Gaussian
∗ bounded from above in [39], Lemma 5.3: ∗ (ρ B ∩ rB2d )
√ (ρ B ∩ rB2 ) are
mean widths d
√
min{Cρ, dr} when βj = C log(ed)/j for all j ∈ [d] and computations of rM (·) and rQ (·)
follow
⎧
⎪
⎨0 if N L d,
2
rQ (ρ) L ρ2 and
⎪
⎩ otherwise,
N
⎧
⎪ d
⎨ ξ 2L
q
if ρ 2 N L,q,δ ξ 2 2
Lq d ,
2
rM (ρ) L,q,δ Nρ
⎪
⎩ ξ Lq √ otherwise.
N
The sparsity equation has been solved in [39], Lemma 4.3.
√
L EMMA 2. Let 1 ≤ s ≤ d and set Bs = j ≤s βj / j . If t ∗ is ρ/20 approximated (rela-
tive to the SLOPE norm) by an s-sparse vector and if 40Bs ≤ ρ/r(ρ) then (ρ) ≥ 4ρ/5.
ROBUST MACHINE LEARNING BY MEDIAN-OF-MEANS 919
√ √ √
For βj ≤ C log(ed/j ), Bs = j ≤s βj / j C s log(ed/s). The condition Bs
ρ/r(ρ) holds when N L,q0 s log(ed/s), ρ L,q0 ξ Lq √s log( ed s ). Lemma 2 implies that
N
(ρ) ≥ 4ρ/5 when there is an s-sparse vector in t ∗ + (ρ/20)B . Therefore, Theorem 1
√
applies for λ ∼ r 2 (ρ)/ρ ∼L,q,δ ξ Lq / N .
The final√ingredient is to compute the ρK solution to K =√ cr(ρK )2 N . It is solved for
ρK ∼ K/(σ N) and, therefore, λ ∼ r 2 (ρK )/ρK ∼L,q,δ ξ Lq / N .
The following result follows from Theorem 2 together with the previous computations of
ρ ∗ , ρK , rQ (·), rM (·) and r(·). The proof, similar to Theorem 5, is omitted.
MOM-SLOPE has the same advantages upon SLOPE as MOM-LASSO upon LASSO.
These improvements, listed below Theorem 5 are not repeated. The only difference is that
SLOPE, unlike LASSO, already achieves the minimax rate s log(ed/s)/N whereas, without
an extra adaptation step as in [11], the LASSO is not known to achieve a rate better than
s log(ed)/N .
5. Algorithms for minmax MOM LASSO. The aim of this section is to show that
there is a systematic way to transform classical descent based algorithms (such as Newton
or gradient descent algorithm, or proximal gradient descent algorithms, etc.) into robust ones
using MOM approach. This section provides several examples of such modifications.
These algorithms are tested in high-dimensional frameworks. In this setup, there exists
an important number of algorithms approximating LASSO. The aim of this section is to
show that there is a natural modification of these algorithms that makes them more robust
to outliers. The choice of hyper-parameters like the number of blocks or the regularization
parameter cannot be done via classical Cross-Validation (CV) because of possible outliers in
the test sets. CV procedures are also adapted using MOM’s principle in Section 6. We also
advocate for using random blocks at every iterations of the algorithms, to bypasses a problem
of “local saddle points” we have identified. A byproduct of the latter approach is a definition
of depth adapted to the learning task and, therefore, of an outliers detection algorithm. This
material and a simulation study are given in Section 6 of the Supplementary Material [36].
5.1. From algorithms for LASSO to MOM LASSO. Each algorithm designed for the
LASSO can be transformed into a robust algorithm for the minmax MOM estimator. Recall
that minmax MOM LASSO estimator is
(10) tˆK,λ ∈ argmin sup TK,λ t , t ,
t∈Rd t ∈Rd
A SSUMPTION 7. Almost surely (with respect to (Xi , Yi )N i=1 ) for almost all (t0 , t0 ) ∈
Rd × Rd (with respect to the Lebesgue measure on Rd × Rd ), there exists a convex open set
B containing (t0 , t0 ) and k ∈ [K] such that for all (t, t ) ∈ B, PBk (t −t ) ∈ MOMK (t −t ).
Under Assumption 7, for almost all couples (t0 , t0 ) ∈ Rd × Rd , t → TK,λ (t0 , t) is “locally
convex” and t → TK,λ (t , t0 ) is “locally concave.” Therefore, for k such that PBk (t0 − t0 ) ∈
MOMK (t0 − t0 ),
(k)
(11) ∇t MOMK (t − t0 )|t=t0 = −2 X (k) Y − X(k) t0 ,
where Y (k) = (Yi )i∈Bk and X (k) is the |Bk | × d matrix with rows given by Xi for i ∈ Bk . The
integer k ∈ [K] is the index of the median of K real numbers PB1 (t − t ), . . . , PBK (t − t ),
which is straightforward to compute. The gradient −2(X (k) ) (Y (k) − X (k) t0 ) in (11) depends
on t0 only through the index k.
R EMARK 3 (Block gradient descent). Algorithms developed for the minmax estimator
using steepest descent steps such as (11) are special instances of Block Gradient Descent
(BGD). The major difference with standard BGD (which takes sequentially all blocks), is
that the index of the block is chosen here as PBk (t0 − t0 ) ∈ MOMK (t0 − t0 ). In particular,
we expect blocks corrupted by outliers to be avoided which is not the case in the classical
BGD. Moreover, choosing the “descent/ascent” block k using its centrality, we also expect
PBk (t0 − t0 ) to be close to the objective function P (t0 − t0 ). This should make every
descent (resp., ascent) steps particularly efficient.
R EMARK 4 (Map-reduce). The algorithms presented in this section particularly fits the
map-reduce paradigm [19], where data are spread out in a cluster of servers and are there-
fore naturally split into blocks. Our procedures use for mapper a mean and for reducer a
median. This makes our algorithms easily scalable into the big data framework even when
some servers have crashed down (making blocks of outliers data). The median identifies the
correct block of data onto which one should make a descent or an ascent and leaves aside
servers which have crashed down.
R EMARK 5 (Normalization). In the i.i.d. setup, the design matrix X (i.e., the N × d
matrix with row vectors X1 , . . . , XN ) is normalized to make N
2 -norms of the columns equal
to one. In a corrupted setup, one row of X may be corrupted and normalizing each column of
X would corrupt the entire matrix X. We therefore do not normalize the design matrix in the
following.
input : (t0 , t0 ) ∈ Rd × Rd : initial point, > 0: a stopping parameter, (ηp )p , (βp )p : two
step size sequences
output: approximated solution to the min–max problem (10)
1 while tp+1 − tp 2 ≥ or tp+1 − tp 2 ≥ do
2 find k ∈ [K] such that PBk (tp − tp ) = MOMK (tp − tp )
3
tp+1 = tp + 2ηp X
k (Yk − Xk tp ) − ληp sign(tp )
4 find k ∈ [K] such that PBk (tp+1 − tp ) = MOMK (tp+1 − tp )
5
tp+1 = tp + 2βp X
k Yk − Xk tp − λβp sign tp
6 end
7 Return (tp , tp )
Algorithm 1: A “minmax MOM version” of the subgradient descent
by a subgradient descent procedure: given t0 ∈ Rd and step sizes (γp )p (i.e., γp > 0 and (γp )p
decreases), at step p we update
(12) tp+1 = tp − γp ∂ψ(tp ),
where ∂ψ(tp ) is a subgradient of ψ at tp like ∂ψ(tp ) = −2X (Y − Xtp ) + λ sign(tp ) where
sign(tp ) is the vector of signs of the coordinates of tp with the convention sign(0) = 0.
The subgradient descent algorithm (12) can be turned into an alternating subgradient as-
cent/descent algorithm for the min–max estimator (10): let
(13) Yk = (Yi )i∈Bk and Xk = Xi i∈Bk ∈ R|Bk |×d .
The key insight in Algorithm 1 are steps 2 and 4 where the blocks number have been
chosen by the median operator. Those steps are expected (1) to remove outliers from the
descent/ascent directions, (2) to improve the accuracy of the latter directions.
A classical choice of step size γp in (12) is γp = 1/L where L = X 2S∞ ( X S∞ is the
operator norm of X). Another possible choice follows from the Armijo–Goldstein condition
with the following backtracking line search: γ is decreased geometrically while the Armijo–
Goldstein condition is not satisfied
2
while ψ tp + γ ∂ψ(tp ) > ψ(tp ) + δγ ∂ψ(tp ) 2 do γ+1 = ργ
for some given ρ ∈ (0, 1), δ = 10−4 and initial point γ0 = 1.
Of course, the same choices of step size cannot be made for (ηp )p and (βp )p in Al-
gorithm 1 because X may be corrupted but it can be adapted. In the first case, one can
take ηp = 1/ Xk 2S∞ where k ∈ [K] is the index defined in line 2 of Algorithm 1 and
βp = 1/ Xk 2S∞ where k ∈ [K] is the index defined in line 4 of Algorithm 1. In the other
backtracking line search case, the Armijo–Goldstein condition adapted for Algorithm 1 reads
like
2
while ψk tp + γ ∂ψk (tp ) > ψk (tp ) + δγ ∂ψk (tp ) 2 do η+1 = ρη ,
where ψk (t) = Yk − Xk t 22 + λ t 1 where k ∈ [K] is defined in line 2 of Algorithm 1 and,
for βp , with k ∈ [K] defined in line 4 of Algorithm 1.
922 G. LECUÉ AND M. LERASLE
input : (t0 , t0 ) ∈ Rd × Rd : initial point, > 0 : a stopping parameter, (ηk )k , (βk )k : two
step size sequences
output: approximated solution to the min–max problem (10)
1 while tp+1 − tp 2 ≥ or tp+1 − tp 2 ≥ do
2 find k ∈ [K] such that PBk (tp − tp ) = MOMK (tp − tp )
3 tp+1 = proxλ · 1 (tp + 2ηk Xk (Yk − Xk tp ))
4 find k ∈ [K] such that PBk (tp+1 − tp ) = MOMK (tp+1 − tp )
= proxλ · 1 (tp + 2βk X
5 tp+1 k (Yk − Xk tp ))
6 end
7 Return (tp , tp )
Algorithm 2: A “minmax MOM version” of ISTA
5.3. Proximal gradient descent algorithms. This section provides MOM versions of
ISTA (Iterative Shrinkage-Thresholding Algorithm) and its accelerated version FISTA. ISTA
and FISTA are proximal gradient descent where the objective function ψ(t) = f (t) + g(t)
with f (t) = Y − Xt 22 (convex and differentiable) and g(t) = λ t 1 (convex). ISTA alter-
nates between a descent in the direction of the gradient of f and a projection through the
proximal operator of g, which, for the 1 -norm, is the soft-thresholding:
(14) tp+1 = prox tp + 2γp X (Y − Xtp ) ,
λ · 1
where proxλ · 1 (t) = (sign(tj ) max(|tj | − λ, 0))dj =1 for all t = (tj )dj =1 ∈ Rd .
A natural “MOM version” for ISTA is in Algorithm 2 given by the following alternating
method where the step sizes sequences (ηp )p and (βp )p may be chosen according to the
remarks below Algorithm 1 or chosen a posteriori.
6. Simulations study. This section provides an extensive simulation study based on al-
gorithms of Section 5. In particular, their robustness and their convergence properties are
illustrated on simulated data. The algorithms depend on hyperparameters that need to be
tuned. Due to possible corruption, classical approaches relying on test samples cannot be
trusted. The section starts therefore by introducing a robust CV procedure based on MOM
principle.
6.1. Adaptive choice of hyperparameters via MOM V-fold CV. MOM’s principles can be
combined with the idea of multiple splitting into training/test datasets in cross-validation.
ROBUST MACHINE LEARNING BY MEDIAN-OF-MEANS 923
Let V ∈ [N] be such that N can be divided by V . Let also GK ⊂ [N] and Gλ ⊂ (0, 1]. The
aim is to select an optimal number of blocks and regularization parameter within
both grids.
The dataset is split into V disjoints blocks D1 , . . . , DV . For each v ∈ [V ], u=v Du is used
to train a family of estimators
(v)
(15) ˆ
fK,λ : K ∈ GK , λ ∈ Gλ .
The remaining Dv of the dataset is used to test the performance of each estimator in the fam-
ily (15). Using these notation, we define a MOM version of the cross-validation procedure.
and B1(v) ∪ · · · ∪ BK
(v)
is a partition of the test set Dv into K blocks where K ∈ [N/V ] such
that K divides N/V .
The difference with standard V-fold CV is that empirical means in classical V-fold CV are
replaced by MOM estimators in (16). Moreover, the mean over all V splits in the classical
V -fold CV is replaced by a median.
The choice of V raises the same issues for MOM CV as for classical V -fold CV [3, 5].
In the simulations, we use V = 5. The construction of MOM-CV requires to choose another
parameter: K , the number of blocks used to build MOM criteria (16) over the test set. One
924 G. LECUÉ AND M. LERASLE
can choose K = K/V to make only one split of D into K blocks and use, for each round,
(V − 1)K/V blocks to build estimators (15) and K/V blocks to test them.
In Figure 2, hyperparameters K (i.e., the number of blocks) and λ (i.e., the regularization
parameter) have been chosen for MOM LASSO estimators via MOM V-fold CV. Only the
evolution of K̂ in function of the proportion of outliers has been depicted (the choice of
the adaptively chosen regularization parameter is more erratic and may first require a more
deeper understanding of CV in the classical i.i.d. before the study of MOM CV in the O ∪ I
framework). The adaptive K̂ grows with the number of outliers as expected, since the number
of blocks has to be at least twice the number of outliers. In particular, when there are no
outliers in the dataset, MOMCV selects K = 1 so minmax MOM LASSO is the LASSO. The
algorithm learns that splitting the database is useless in the absence of outliers: LASSO is the
best choice among all minmax MOM LASSO estimators for K ∈ [N/2].
6.2. Saddle-point, random blocks, outliers detection and depth. The aim of this section
is to show some advantages of choosing randomly the blocks at every (descent and ascent)
steps of the algorithm and how this modified version works on the example of ADMM. As a
byproduct, it is possible to define an outliers detection algorithm.
ROBUST MACHINE LEARNING BY MEDIAN-OF-MEANS 925
Let us first explain a problem of “local saddle point” in the case of fixed blocks. Minmax
MOM estimators are based on the observation that the oracle f ∗ is solution to the minmax
problem f ∗ ∈ argminf ∈F supg∈F P (f − g ). Likewise, f ∗ is solution of the maxmin prob-
lem: f ∗ ∈ argmaxg∈F inff ∈F P (f − g ). One can also define the maxmin MOM estimator
(17) ĝK,λ ∈ argmax inf TK,λ (g, f ).
g∈F f ∈F
Following the proofs of Section 6 in the Supplementary Material [36], one can prove the same
results for ĝK,λ and fˆK,λ (see Section 7 in the Supplementary Material for a proof in small
dimension). However, ĝK,λ and fˆK,λ may differ since, in general
(18) argmin sup TK,λ (g, f ) = argmax inf TK,λ (g, f ).
f ∈F g∈F g∈F f ∈F
In other words, the duality gap may not be null. Since TK,λ (g, f ) = −TK,λ (f, g) for all
f, g ∈ F , (18) holds if and only if
inf sup TK,λ (f, g) = 0.
f ∈F g∈F
In that case, fˆ is a saddle-point estimator and minmax and maxmin estimators are equal. The
left-hand side of Figure 3 shows a simulation where this happens. The choice of fixed blocks
B1 , . . . , BK may result in a problem of “local saddle points” and the algorithms remain close
to suboptimal local saddle points. To see this, consider the vector case (i.e., for F = {f (·) =
·, t : t ∈ Rd } and introduce, for all k ∈ [K],
(19) Ck = t, t ∈ Rd × Rd : MOMK (t − t ) = PBk (t − t ) .
The problem is that, if a cell Ck contains a saddle-point of (t, t ) → PBk (t − t ) +
λ( t 1 − t 1 ) the algorithms gets stuck in that cell instead of looking for “better saddle-
point” in other cells.
To overcome this issue, the partition is chosen at random at every descent and ascent steps
of the algorithms, so the decomposition into cells C1 , . . . , CK changes at every steps. As an
example, we develop the ADMM procedure with a random choice of blocks in Algorithm 4.
In Figure 3, both MOM LASSO via ADMM with fixed and changing blocks are run. Both
the objective function and the estimation error of MOM LASSO jump with fixed blocks.
These jumps correspond to a change of cell number. The algorithm converges to local saddle-
points before jumping to other cells, thanks to the regularization of the 1 -norm. On the other
hand, the algorithms with changing blocks do not suffer this drawback. Figure 3 shows that
the estimation error converges faster and more smoothly for changing blocks. The objective
F IG . 4. Outliers detection algorithm. The dataset has been corrupted by 4 outliers at number 1, 32, 170 and
194. The score of the outliers is 0: they have not been selected even once.
function of MOM ADMM with changing blocks converges to zero so the duality gap con-
verges to zero. This gives a natural stopping criterion and shows that minmax and maxmin
MOM LASSO are solution of a saddle point problem even though the objective function is
not convex-concave.
A byproduct is an outliers detection procedure. Count the number of times each data is
selected in the selected median blocks of steps 3 and 7 of Algorithm 4. At the end of the
algorithm (for instance, Algorithm 4), every data ends up with a score revealing its central-
ity for the learning task. Aggressive outliers are likely to corrupt their respective blocks and
should therefore not be selected at steps 3 and 7 of Algorithm 4. With fixed blocks, infor-
mative data cannot be distinguished from outliers lying in the same block, therefore, this
outliers detection algorithm only makes sense when blocks are changing at every steps. Fig-
ure 4 shows performance of this strategy on synthetic data (cf. Section 6.3 for more details
on the simulations). Outliers (data 1, 32, 170 and 194) end up with a null score.
6.3. Simulations setup for the figures. All codes are available at [55] and can be used to
reproduce the figures. Many other simulations and algorithms can be found in [55].
ROBUST MACHINE LEARNING BY MEDIAN-OF-MEANS 927
6.3.1. Data generating process and corruption by outliers. The algorithms introduced in
Section 5 are tested on datasets corrupted by outliers of various forms in [55]. The basic set
of informative data is called D1 . The outliers are named D2 , D3 , D4 and D5 . These data are
merged and shuffled in the dataset D = D1 ∪ D2 ∪ D3 ∪ D4 ∪ D5 given to the algorithm.
1. The set D1 of inliers contains Ngood i.i.d. data (Xi , Yi ) with common distribution
(20) Y = X, t ∗ + ξ,
where t ∗ ∈ Rd , X ∼ N (0, Id×d ) and ξ ∼ N (0, σ 2 ) is independent of X.
2. D2 is a dataset of Nbad−1 outliers (Xi , Yi ) such that Yi = 1 and Xi = (1)dj =1
3. D3 is a dataset of Nbad−2 outliers (Xi , Yi ) such that Yi = 10,000 and Xi = (1)dj =1
4. D4 is a dataset of Nbad−3 outliers (Xi , Yi ) where Yi is a 0−1-Bernoulli random variable
and Xi is uniformly distributed over [0, 1]d
5. D5 is also a set of outliers that have been generated according to a linear model (20),
with the same target vector t ∗ and a different choice of design X and noise ξ . The design
X ∼ N (0, ) with = (ρ |i−j | )1≤i,j ≤d and ξ is a heavy-tailed noise distributed according to
a Student distribution with various degrees of freedom.
The different types of outliers Dj , j = 2, 3, 4, 5 are useless to learn the oracle t ∗ some are
not independent nor random as in D2 and D3 .
6.3.2. Simulations setup for the figures. Let us now precise the parameters of the sim-
ulations in Figure 1 and Figure 2: the number of observations is N = 200, the number of
features is d = 500, t ∗ ∈ Rd has sparsity s = 10 and support chosen at random, with nonzero
coordinates tj∗ being either equal to 10, −10 or decreasing according to exp(−j/10). Infor-
mative data D1 , described in Section 6.3.1, have variance σ = 1. This dataset is increasingly
corrupted with outliers in D3 .
The proportion of outliers are 0, 1/100, 2/100, . . . , 15/100. The ADMM algorithm is run
with adaptive λ chosen by V -fold CV with V = 5 for the LASSO. Then MOM ADMM is
run with adaptive K and λ chosen by MOM CV with V = 5 and
√ K = max(gridK )/V where
gridK = {1, 4, . . . , 115/4} and gridλ = {0, 10, 20, . . . , 100}/ 100 are the search grids used
to select the best K and λ during the CV and MOM CV steps. The number of iterations of
ADMM and MOM ADMM is 200. Simulations have been run 70 times and the averaged
values of the estimation error and adaptive K̂ have been reported in Figure 1, Figure 5 and
Figure 2. The 2 estimation error of LASSO increases roughly from 0 when there is no out-
liers and stabilize at 550 right after a single outliers enters the dataset. The value 550 comes
from the fact that Y = 10,000 and X = (1)500 d
j =1 satisfy that the vector t with minimal 1 norm
among all the solutions t of Y = X, t is t ∗∗ = (20)500 j =1 , and t
∗∗ − t ∗
2 is approximately
550. This means that LASSO is trying to fit a model on the single outlier instead of solv-
ing the linear problem associated with the 200 other informative data. A single outliers is
therefore completely misleading the LASSO.
For Figure 3, we have run similar experiments with N = 200, d = 300, s = 20, √ σ = 1,
K = 10, the number of iterations was 500 and the regularization parameter was 1/ N .
For Figure 4, we took N = 200, d = 500, s = 20, σ = 1, the number of outliers is |O| = 4
and the outliers are of the form Y = 10,000 and X = (1)dj =1 , K = 10, the number of iterations
√
is 5.000 and λ = 1/ 200.
F IG . 5. Estimation error versus proportion of outliers for LASSO and the minmax MOM LASSO.
SUPPLEMENTARY MATERIAL
Supplementary material to “Estimation bounds and sharp oracle inequalities of reg-
ularized procedures with Lipschitz loss functions” (DOI: 10.1214/19-AOS1828SUPP;
.pdf). Section 6 gives the proof of the main results. These main results focus on the regu-
larized version of the MOM estimates of the increments presented in this Introduction that
are well suited for high-dimensional learning frameworks. We complete these results in Sec-
tion 7, providing results for the basic estimators without regularization in small dimension.
Finally, Section 8 provides minimax optimality results for our procedures.
REFERENCES
[1] A LON , N., M ATIAS , Y. and S ZEGEDY, M. (1999). The space complexity of approximating the frequency
moments. J. Comput. System Sci. 58 137–147. MR1688610 https://doi.org/10.1006/jcss.1997.1545
[2] A LQUIER , P., C OTTET, V. and L ECUÉ , G. (2019). Estimation bounds and sharp oracle inequalities
of regularized procedures with Lipschitz loss functions. Ann. Statist. 47 2117–2144. MR3953446
https://doi.org/10.1214/18-AOS1742
[3] A RLOT, S. and C ELISSE , A. (2010). A survey of cross-validation procedures for model selection. Stat. Surv.
4 40–79. MR2602303 https://doi.org/10.1214/09-SS054
[4] A RLOT, S. and C ELISSE , A. (2011). Segmentation of the mean of heteroscedastic data via cross-validation.
Stat. Comput. 21 613–632. MR2826696 https://doi.org/10.1007/s11222-010-9196-x
[5] A RLOT, S. and L ERASLE , M. (2016). Choice of V for V -fold cross-validation in least-squares density
estimation. J. Mach. Learn. Res. 17 Paper No. 208, 50. MR3595142
[6] AUDIBERT, J.-Y. and C ATONI , O. (2011). Robust linear least squares regression. Ann. Statist. 39 2766–
2794. MR2906886 https://doi.org/10.1214/11-AOS918
[7] BACH , F. R. (2010). Structured sparsity-inducing norms through submodular functions. In Advances in
Neural Information Processing Systems 23: 24th Annual Conference on Neural Information Processing
Systems 2010. Proceedings of a Meeting Held 6–9 December 2010 118–126, Vancouver, BC.
ROBUST MACHINE LEARNING BY MEDIAN-OF-MEANS 929
[8] BARAUD , Y. (2011). Estimator selection with respect to Hellinger-type risks. Probab. Theory Related Fields
151 353–401. MR2834722 https://doi.org/10.1007/s00440-010-0302-y
[9] BARAUD , Y. and B IRGÉ , L. (2016). Rho-estimators for shape restricted density estimation. Stochastic Pro-
cess. Appl. 126 3888–3912. MR3565484 https://doi.org/10.1016/j.spa.2016.04.013
[10] BARAUD , Y., B IRGÉ , L. and S ART, M. (2017). A new method for estimation and model selection: ρ-
estimation. Invent. Math. 207 425–517. MR3595933 https://doi.org/10.1007/s00222-016-0673-5
[11] B ELLEC , P. C., L ECUÉ , G. and T SYBAKOV, A. B. (2016). Slope meets lasso: Improved oracle bounds and
optimality. Technical report, CREST, CNRS, Université Paris Saclay.
[12] B ICKEL , P. J., R ITOV, Y. and T SYBAKOV, A. B. (2009). Simultaneous analysis of lasso and Dantzig selec-
tor. Ann. Statist. 37 1705–1732. MR2533469 https://doi.org/10.1214/08-AOS620
[13] B IRGÉ , L. (1984). Stabilité et instabilité du risque minimax pour des variables indépendantes équidis-
tribuées. Ann. Inst. Henri Poincaré Probab. Stat. 20 201–223. MR0762855
[14] B IRGÉ , L. (2006). Model selection via testing: An alternative to (penalized) maximum likelihood estimators.
Ann. Inst. Henri Poincaré Probab. Stat. 42 273–325. MR2219712 https://doi.org/10.1016/j.anihpb.
2005.04.004
[15] B LANCHARD , G., B OUSQUET, O. and M ASSART, P. (2008). Statistical performance of support vector
machines. Ann. Statist. 36 489–531. MR2396805 https://doi.org/10.1214/009053607000000839
[16] B OGDAN , M., VAN DEN B ERG , E., S ABATTI , C., S U , W. and C ANDÈS , E. J. (2015). SLOPE—
adaptive variable selection via convex optimization. Ann. Appl. Stat. 9 1103–1140. MR3418717
https://doi.org/10.1214/15-AOAS842
[17] B ÜHLMANN , P. and VAN DE G EER , S. (2011). Statistics for High-Dimensional Data: Methods, Theory and
Applications. Springer Series in Statistics. Springer, Heidelberg. MR2807761 https://doi.org/10.1007/
978-3-642-20192-9
[18] C ATONI , O. (2012). Challenging the empirical mean and empirical variance: A deviation study. Ann. Inst.
Henri Poincaré Probab. Stat. 48 1148–1185. MR3052407 https://doi.org/10.1214/11-AIHP454
[19] D EAN , J. and G HEMAWAT, S. (2010). Mapreduce: A flexible data processing tool. Commun. ACM 53 72–
77.
[20] D EVROYE , L., L ERASLE , M., L UGOSI , G. and O LIVEIRA , R. I. (2016). Sub-Gaussian mean estimators.
Ann. Statist. 44 2695–2725. MR3576558 https://doi.org/10.1214/16-AOS1440
[21] E LSENER , A. and VAN DE G EER , S. (2018). Robust low-rank matrix estimation. Ann. Statist. 46 3481–
3509. MR3852659 https://doi.org/10.1214/17-AOS1666
[22] FAN , J., L I , Q. and WANG , Y. (2017). Estimation of high dimensional mean regression in the absence of
symmetry and light tail assumptions. J. R. Stat. Soc. Ser. B. Stat. Methodol. 79 247–265. MR3597972
https://doi.org/10.1111/rssb.12166
[23] G IRAUD , C. (2015). Introduction to High-Dimensional Statistics. Monographs on Statistics and Applied
Probability 139. CRC Press, Boca Raton, FL. MR3307991
[24] H AMPEL , F. R. (1971). A general qualitative definition of robustness. Ann. Math. Stat. 42 1887–1896.
MR0301858 https://doi.org/10.1214/aoms/1177693054
[25] H AMPEL , F. R. (1974). The influence curve and its role in robust estimation. J. Amer. Statist. Assoc. 69
383–393. MR0362657
[26] H AN , Q. and W ELLNER , J. A. (2017). A sharp multiplier inequality with applications to heavy-tailed
regression problems. Available at arXiv:1706.02410.
[27] H UBER , P. J. (1964). Robust estimation of a location parameter. Ann. Math. Stat. 35 73–101. MR0161415
https://doi.org/10.1214/aoms/1177703732
[28] H UBER , P. J. (1967). The behavior of maximum likelihood estimates under nonstandard conditions. In Proc.
Fifth Berkeley Sympos. Math. Statist. and Probability (Berkeley, Calif., 1965/66), Vol. I: Statistics 221–
233. Univ. California Press, Berkeley, CA. MR0216620
[29] H UBER , P. J. and RONCHETTI , E. M. (2009). Robust Statistics, 2nd ed. Wiley Series in Probability and
Statistics. Wiley, Hoboken, NJ. MR2488795 https://doi.org/10.1002/9780470434697
[30] J ERRUM , M. R., VALIANT, L. G. and VAZIRANI , V. V. (1986). Random generation of combinatorial struc-
tures from a uniform distribution. Theoret. Comput. Sci. 43 169–188. MR0855970 https://doi.org/10.
1016/0304-3975(86)90174-X
[31] KOLTCHINSKII , V. (2011). Oracle Inequalities in Empirical Risk Minimization and Sparse Recovery
Problems. Lecture Notes in Math. 2033. Springer, Heidelberg. MR2829871 https://doi.org/10.1007/
978-3-642-22147-7
[32] KOLTCHINSKII , V. and M ENDELSON , S. (2015). Bounding the smallest singular value of a random matrix
without concentration. Int. Math. Res. Not. IMRN 23 12991–13008. MR3431642 https://doi.org/10.
1093/imrn/rnv096
[33] L E C AM , L. (1986). Asymptotic Methods in Statistical Decision Theory. Springer Series in Statistics.
Springer, New York. MR0856411 https://doi.org/10.1007/978-1-4612-4946-7
930 G. LECUÉ AND M. LERASLE
[34] L E C AM , L. (1973). Convergence of estimates under dimensionality restrictions. Ann. Statist. 1 38–53.
MR0334381
[35] L ECUÉ , G. and L ERASLE , M. (2017). Learning from mom’s principle: Le Cam’s approach. Technical re-
port, CNRS, ENSAE, Paris-sud, 2017. To appear in Stochastic Processes and their applications.
[36] L ECUÉ , G. and L ERASLE , M. (2020). Supplement to “Robust machine learning by median-of-means: The-
ory and practice.” https://doi.org/10.1214/19-AOS1828SUPP.
[37] L ECUÉ , G. and M ENDELSON , S. (2013). Learning subgaussian classes: Upper and minimax bounds Tech-
nical report, CNRS, Ecole polytechnique and Technion.
[38] L ECUÉ , G. and M ENDELSON , S. (2017). Regularization and the small-ball method II: Complexity depen-
dent error rates. J. Mach. Learn. Res. 18 Paper No. 146, 48. MR3763780
[39] L ECUÉ , G. and M ENDELSON , S. (2018). Regularization and the small-ball method I: Sparse recovery. Ann.
Statist. 46 611–641. MR3782379 https://doi.org/10.1214/17-AOS1562
[40] L ERASLE , M. and O LIVEIRA , R. I. (2011). Robust empirical mean estimators. Available at
arXiv:1112.3914.
[41] L OUNICI , K. (2008). Sup-norm convergence rate and sign concentration property of Lasso and Dantzig
estimators. Electron. J. Stat. 2 90–102. MR2386087 https://doi.org/10.1214/08-EJS177
[42] L UGOSI , G. and M ENDELSON , S. (2019). Regularization, sparse recovery, and median-of-means tourna-
ments. Bernoulli 25 2075–2106. MR3961241 https://doi.org/10.3150/18-BEJ1046
[43] L UGOSI , G. and M ENDELSON , S. (2019). Risk minimization by median-of-means tournaments. J. Eur.
Math. Soc. (JEMS). To appear.
[44] M ASSART, P. and N ÉDÉLEC , É. (2006). Risk bounds for statistical learning. Ann. Statist. 34 2326–2366.
MR2291502 https://doi.org/10.1214/009053606000000786
[45] M ASSIAS , M., F ERCOQ , O., G RAMFORT, A. and S ALMON , J. (2017). Generalized concomitant multi-task
lasso for sparse multimodal regression.
[46] M EINSHAUSEN , N. and B ÜHLMANN , P. (2006). High-dimensional graphs and variable selection with the
lasso. Ann. Statist. 34 1436–1462. MR2278363 https://doi.org/10.1214/009053606000000281
[47] M EINSHAUSEN , N. and Y U , B. (2009). Lasso-type recovery of sparse representations for high-dimensional
data. Ann. Statist. 37 246–270. MR2488351 https://doi.org/10.1214/07-AOS582
[48] M ENDELSON , S. (2014). Learning without concentration. In Proceedings of the 27th annual conference on
Learning Theory COLT14, 25–39.
[49] M ENDELSON , S. (2016). On multiplier processes under weak moment assumptions Technical report, Tech-
nion.
[50] M INSKER , S. (2015). Geometric median and robust estimation in Banach spaces. Bernoulli 21 2308–2335.
MR3378468 https://doi.org/10.3150/14-BEJ645
[51] M INSKER , S. and S TRAWN , N. (2019). Distributed statistical estimation and rates of convergence in normal
approximation. Electron. J. Statist. 13 5213–5252. MR4043072
[52] N EGAHBAN , S. N., R AVIKUMAR , P., WAINWRIGHT, M. J. and Y U , B. (2012). A unified framework for
high-dimensional analysis of M-estimators with decomposable regularizers. Statist. Sci. 27 538–557.
MR3025133 https://doi.org/10.1214/12-STS400
[53] N EMIROVSKY, A. S. and Y UDIN , D. B. (1983). Problem Complexity and Method Efficiency in Optimiza-
tion. A Wiley-Interscience Publication. Wiley, New York. MR0702836
[54] N ICKL , R. and VAN DE G EER , S. (2013). Confidence sets in sparse regression. Ann. Statist. 41 2852–2876.
MR3161450 https://doi.org/10.1214/13-AOS1170
[55] N OTEBOOK. Available at https://github.com/lecueguillaume/MOMpower.
[56] S AUMARD , A. (2018). On optimality of empirical risk minimization in linear aggregation. Bernoulli 24
2176–2203. MR3757527 https://doi.org/10.3150/17-BEJ925
[57] S U , W. and C ANDÈS , E. (2016). SLOPE is adaptive to unknown sparsity and asymptotically minimax. Ann.
Statist. 44 1038–1068. MR3485953 https://doi.org/10.1214/15-AOS1397
[58] T IBSHIRANI , R. (1996). Regression shrinkage and selection via the lasso. J. Roy. Statist. Soc. Ser. B 58
267–288. MR1379242
[59] T UKEY, J. W. (1960). A survey of sampling from contaminated distributions. In Contributions to Probability
and Statistics 448–485. Stanford Univ. Press, Stanford, CA. MR0120720
[60] T UKEY, J. W. (1962). The future of data analysis. Ann. Math. Stat. 33 1–67. MR0133937 https://doi.org/10.
1214/aoms/1177704711
[61] VAN DE G EER , S. (2014). Weakly decomposable regularization penalties and structured sparsity. Scand. J.
Stat. 41 72–86. MR3181133 https://doi.org/10.1111/sjos.12032
[62] VAN DE G EER , S., B ÜHLMANN , P., R ITOV, Y. and D EZEURE , R. (2014). On asymptotically optimal
confidence regions and tests for high-dimensional models. Ann. Statist. 42 1166–1202. MR3224285
https://doi.org/10.1214/14-AOS1221
ROBUST MACHINE LEARNING BY MEDIAN-OF-MEANS 931
[63] VAN DE G EER , S. A. (2007). The deterministic lasso. Technical report, ETH Zürich. Available at http:
//www.stat.math.ethz.ch/~geer/lasso.pdf.
[64] VAN DE G EER , S. A. (2008). High-dimensional generalized linear models and the lasso. Ann. Statist. 36
614–645. MR2396809 https://doi.org/10.1214/009053607000000929
[65] VAPNIK , V. N. (1998). Statistical Learning Theory. Adaptive and Learning Systems for Signal Processing,
Communications, and Control. Wiley, New York. MR1641250
[66] WANG , T. E., G U , Y., M EHTA , D., Z HAO , X. and B ERNAL , E. A. (2018). Towards robust deep neural
networks. Preprint. Available at arXiv:1810.11726.
[67] Z HANG , C.-H. and Z HANG , S. S. (2014). Confidence intervals for low dimensional parameters in
high dimensional linear models. J. R. Stat. Soc. Ser. B. Stat. Methodol. 76 217–242. MR3153940
https://doi.org/10.1111/rssb.12026
[68] Z HAO , P. and Y U , B. (2006). On model selection consistency of Lasso. J. Mach. Learn. Res. 7 2541–2563.
MR2274449
[69] Z HOU , W.-X., S UN , Q. and FAN , J. (2017). Adaptive huber regression: Optimality and phase transition.
Preprint. Available at arXiv:1706.06991.