0% found this document useful (0 votes)

10 views26 pages

19 Aos1828

This paper introduces minmax median-of-means (MOM) estimators that outperform classical least-squares estimators in the presence of heavy-tailed data and outliers. The authors demonstrate that these estimators achieve sub-Gaussian deviation bounds and can be implemented through simple modifications of standard algorithms. An extensive simulation study supports the theoretical results and showcases the practical effectiveness of the proposed methods.

Uploaded by

Saeed Aloqaly

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views26 pages

19 Aos1828

Uploaded by

Saeed Aloqaly

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 26

The Annals of Statistics

2020, Vol. 48, No. 2, 906–931

https://doi.org/10.1214/19-AOS1828
© Institute of Mathematical Statistics, 2020

ROBUST MACHINE LEARNING BY MEDIAN-OF-MEANS:

THEORY AND PRACTICE

B Y G UILLAUME L ECUÉ1 AND M ATTHIEU L ERASLE2

1 ENSAE, École Polytechnique, guillaume.lecue@ensae.fr
2 Mathematics Department, CNRS, University Paris Sud Orsay, matthieu.lerasle@math.u-psud.fr

Median-of-means (MOM) based procedures have been recently intro-

duced in learning theory (Lugosi and Mendelson (2019); Lecué and Lerasle
(2017)). These estimators outperform classical least-squares estimators when
data are heavy-tailed and/or are corrupted. None of these procedures can be
implemented, which is the major issue of current MOM procedures (Ann.
Statist. 47 (2019) 783–794).
In this paper, we introduce minmax MOM estimators and show that they
achieve the same sub-Gaussian deviation bounds as the alternatives (Lugosi
and Mendelson (2019); Lecué and Lerasle (2017)), both in small and high-
dimensional statistics. In particular, these estimators are efficient under mo-
ments assumptions on data that may have been corrupted by a few outliers.
Besides these theoretical guarantees, the definition of minmax MOM esti-
mators suggests simple and systematic modifications of standard algorithms
used to approximate least-squares estimators and their regularized versions.
As a proof of concept, we perform an extensive simulation study of these
algorithms for robust versions of the LASSO.

1. Introduction. Consider the least-squares regression problem where, given a dataset

(Xi , Yi )i∈{1,...,N} of points in X × R and a new input X ∈ X , one wants to predict the as-
sociated real valued output Y ∈ R. A classical approach is to consider (X, Y ) as a random
variable with values in X × R and, given a set F of functions f : X → R, to look for the
oracle in F , which is defined by

f ∗ ∈ argmin P Y − f (X) 2 .
f ∈F

To estimate f ∗,
we have a dataset (Xi , Yi )i∈{1,...,N} for which there exists a partition
{1, . . . , N} = O ∪ I such that data (Xi , Yi )i∈I are inliers or informative and data (Xi , Yi )i∈O
are “outliers” in the sense that nothing is assumed on these data. On inliers, one grants in-
dependence and finiteness of some moments, allowing for “heavy-tailed” data. Moreover,
departing from the independent and identically distributed (i.i.d.) setup, we also allow inliers
to have different distributions than (X, Y ). We assume that, for all i ∈ I and all f ∈ F ,

E Yi − f ∗ (Xi ) f − f ∗ (Xi ) = E Y − f ∗ (X) f − f ∗ (X) ,

E f − f ∗ 2 (Xi ) = E f − f ∗ 2 (X) .
These assumptions imply that the distribution P of (X, Y ) and the distribution Pi of
(Xi , Yi ) for i ∈ I induce the same L2 -geometry on F − f ∗ = {f − f ∗ : f ∈ F } and, there-
fore, in particular, that the oracles w.r.t. P and Pi for any i ∈ I are the same. Of course, the
sets O and I are unknown to the statistician.
Regression problems with possibly heavy-tailed inliers cannot be handled by classical
least-squares estimators, which are particular instances of empirical risk minimizers (ERM)

Received July 2018; revised February 2019.

MSC2010 subject classifications. Primary 60K35, 62G08; secondary 62C20, 62G05, 62G20.
Key words and phrases. Empirical processes, high-dimensional statistics.
906
ROBUST MACHINE LEARNING BY MEDIAN-OF-MEANS 907

F IG . 1. Estimation error of the LASSO (blue curve) and MOM LASSO (red curve) after one outliers was added
at observation 100.

of Vapnik [65]. Least-squares estimators have sub-Gaussian deviations under stronger as-
sumptions, such as boundedness [44] or sub-Gaussian [37] assumptions on the noise and the
design. In this paper, the main hypothesis is the small ball assumption of [32, 48] which
says that L2 (P ) and L1 (P ) norms are equivalent over F − f ∗ ; see Section 3.1 for details.
Although sometimes restrictive [26, 56], this assumption does not involve high moment con-
ditions unnecessary for the problem to make sense.
Least-squares estimators and their regularized versions are also useless in corrupted en-
vironments. This has been known for a long time and can easily be checked in practice.
Figure 1, for example, shows estimation bounds of the LASSO [58] on a dataset containing
a single outlier in the outputs.
These restrictions of least-squares estimators gave rise in the 1960s to the theory of robust
statistics of John Tukey [59, 60], Peter Huber [27, 28] and Frank Hampel [24, 25]. The most
classical alternatives to least-squares estimators are M-estimators, which are ERM based on
loss functions f (X, Y ) less sensitive to outliers than the square loss, such as a truncated
version of the square loss. The idea is that, while (Yi − f (Xi ))2 can be very large for some
outliers data and influence all the empirical mean N −1 N i=1 (Yi − f (Xi )) , the influence of
2

these anomalies will be asymptotically null if f (Xi , Yi ) is bounded. Recent works study
deviation properties of M-estimators: [2, 21, 22, 69] considered the Huber-loss in linear re-
gression with heavy-tailed noise and sub-Gaussian design. They obtain minimax optimal de-
viation bounds in this setting. The limitation on the design is not surprising: it is well known
that M-estimators using loss functions such as Huber or L1 loss are not robust to outliers in
the inputs Xi . This problem is called the“leverage points problem” [29]. In a slightly different
approach than M-estimation, [6] proposed a minmax estimator based on losses introduced in
[18] in a least-squares regression framework and prove optimal sub-Gaussian bounds under
a L2 assumption on the noise and a L4 /L2 assumption on the design, which is close to the
assumptions we grant on inliers.
This paper focuses on Median-of-means (MOM) [1, 30, 53], which provide alternatives to
M-estimators. MOM estimators of the real valued expectation E[Z] are built as follows: the
dataset Z1 , . . . , ZN is partitioned into blocks (Zi )i∈Bk , k = 1, . . . , K of the same cardinality.
The MOM estimator is the median of the K empirical means constructed on each block:

1
MOMK (Z) = median Zi , k = 1, . . . , K .
|Bk | i∈B
k

Sub-Gaussian properties of these estimators can be found in [20, 40].

908 G. LECUÉ AND M. LERASLE

As in [35, 43], MOM estimators are used to estimate real valued increments of square risks
P [(Y − f (X))2 − (Y − g(X))2 ], where f, g ∈ F . This construction does not require a notion
of median in dimension larger than 1, contrary to “geometric median-of-means” approach
presented in [50, 51]. In [35, 43], each f ∈ F receives a score which is the L2 (P )-diameter
(f ) of the set B (f ), where g ∈ B (f ) if MOMK (f − g ) < 0. The approach of [35, 43]
requires therefore an evaluation of the diameter of the sets B (f ) for all f ∈ F , which makes
the procedure impossible to implement.
This paper presents an alternative to [35, 43] which relies on the following minmax for-
mulation. By linearity of P , f ∗ is a solution of
2
f ∗ ∈ argmin sup P Y − f (X) 2 − Y − g(X) .
f ∈F g∈F

Replacing the real valued means P [(Y − f (X))2 − (Y − g(X))2 ] in this equation by their
MOM estimators produces the minmax MOM estimators of f ∗ which are rigorously intro-
duced in Section 2.3. Compared with [35, 43], minmax MOM estimators do not require an
estimation of L2 -distances between elements in F and are therefore simpler to define. Min-
max strategies have also been considered in [6] and [9, 10]. The idea of building estimators
of f ∗ from estimators of increments goes back to seminal works by Le Cam [33, 34] and
was further developed by Birgé with the T -estimators [14]. In Le Cam and Birgé’s works,
the authors used “robust tests” to compare densities f and g and deduce from these an alter-
native to the nonrobust maximum likelihood estimators. Baraud [8] showed that robust tests
could be obtained by estimating the difference of Hellinger risks of f and g and used a vari-
ational formula to build these new tests. Finally, Baraud, Birgé and Sart [10] used Baraud’s
estimators of increments in a minmax procedure to build ρ-estimators.
The first aim of this paper is to show that minmax MOM estimators satisfy the same sub-
Gaussian deviation bounds as other MOM estimators [35, 42]. The analysis of minmax MOM
estimators is conceptually and technically simpler: an adaptation of Lemmas 5.1 and 5.5 in
[43] or Lemmas 2 and 3 [35] is sufficient to prove sub-Gaussian bound for minmax MOM
estimators while a robust estimation (based on MOM estimates) of the L2 (P )-metric was
required in [35, 42].
Another advantage of the minmax MOM approach lies in the Lepski-step (see Theorem 2),
which selects adaptively the number K of blocks. This step is way easier to implement and
to study than the one presented in [35], as only one confidence region is sufficient to grant
adaptation with respect to the excess risk, the regularization and L2 norms. Recall that, in cor-
rupted environments, a data-driven choice of K has to be performed since K must be larger
than twice the (unknown) number of outliers. Note that the idea of aggregating estimators
built on blocks of data and selecting the number of blocks by Lepski’s method was already
present in Birgé [13], proof of Theorem 1. It was also used in [20] to build “multiple-δ”
sub-Gaussian estimators of univariate means.
In our opinion, the most interesting feature of the minmax formulation is that it suggests
a generic method to modify descent algorithms designed to approximate ERM and their reg-
ularized versions and make them efficient even if run on corrupted datasets. Let us give a
rough presentation of a “MOM version” of descent algorithms: at each time-step t, all em-
pirical means PBk (Y − ft (X))2 for k = 1, . . . , K are evaluated and one computes the index
kmed ∈ [K] of the block such that

PBkmed Y − ft (X) 2 = med PBk Y − ft (X) 2 , k = 1, . . . , K .

The descent direction is the opposite gradient −∇(f → PBkmed (Y − f (X))2 )|f =ft . This de-
scent algorithm can be turned into a descent-ascent algorithm approximating minmax MOM
estimators. Section 5 presents several examples of modifications of classical algorithms.
ROBUST MACHINE LEARNING BY MEDIAN-OF-MEANS 909

In practice, these basic algorithms perform poorly when applied on a fixed partition of
the dataset. However, empirical performance are improved when the partition is chosen uni-
formly at random at each descent step of the algorithm; cf. Section 6.2. In particular, the
shuffling step prevents the algorithms to converge to local minimaxima. Besides, randomized
algorithms define a notion of depth of data: each time a data belongs to the median block, its
“score” is incremented by 1. The higher the final score is, the deeper is the data. This notion
of depth is based on the risk function which is natural in a learning framework and should
probably be investigated more carefully in future works. It also suggests an empirical defini-
tion of outliers and, therefore, an outliers detection algorithm. This by-product is presented
in Section 6.2.
The paper is organized as follows. Section 2 introduces the framework and presents the
minmax MOM estimator, Section 3 details the main theoretical results. These are illustrated
in Section 4 on some classical problems of machine learning. Many robust versions of stan-
dard optimization algorithms are presented in Section 5. An extensive simulation study il-
lustrating our results is performed in Section 6. Proofs of the main results, complementary
theorems showing minimax optimality of our bounds are postponed to the Supplementary
Material [36].

2. Setting. Let X denote a measurable space. Let (Xi , Yi )i∈{1,...,N} , (X, Y ) denote ran-
dom variables taking values in X × R. Let P denote the distribution of (X, Y ) and, for
i ∈ {1, . . . , N}, let Pi denote the distribution of (Xi , Yi ). Let F denote a convex class of func-
tions f : X → R and suppose that E[Y 2 ] < ∞. For any Q ∈ {P , (Pi )i∈[N] } and any p ≥ 1,
p
let LQ denote the set of functions f such that the norm f Lp = (Q|f |p )1/p , where Qg =
Q
EZ∼Q [g(Z)]. Assume that F ⊂ L2P . For any (x, y) ∈ X × R, let f (x, y) = (y − f (x))2
denote the square loss and let f ∗ denote an oracle

(1) f ∗ ∈ argmin P f where ∀g ∈ L1P , P g = E g(X, Y ) .
f ∈F

Let R(f ) = P f denote the risk. The goal is to build estimators fˆ satisfying: with proba-
bility at least 1 − δ,
R(fˆ) ≤ min R(f ) + rN fˆ − f ∗
(1) (2)
and L2P ≤ rN .
f ∈F

The residue rN(1) of the oracle inequality, the estimation rate rN(2) and the confidence level δ
should be as small as possible. Oracle inequalities provide risk bounds for the estimation the
regression function f (x) = E[Y |X = x]: R(fˆ) ≤ R(f ∗ ) + rN(1) is equivalent to

f − fˆ 2
L2P
≤ f −f∗ 2
L2P + rN(1) .

Finally, let · be a norm defined on the span of F ; · will be used as a regularization

norm to induce some low-dimensional structure or some regularity, such as the 1 or SLOPE
norm (see Section 4).

2.1. Minmaximization. The oracle f ∗ is solution of the minmax problem:

(2) f ∗ ∈ argmin P f = argmin sup P (f − g ).
f ∈F f ∈F g∈F

Any estimator of real valued expectations P f or P (f − g ) can be plugged in (2) to obtain
estimators of f ∗ . Plugging the empirical means (in both the min and the minmax problems)
yields the classical ERM over F , for example. In general, plugging nonlinear (robust or not)
estimators of the mean in the minmax problem or in the min problem in (2) does not yield the
910 G. LECUÉ AND M. LERASLE

same estimator of f ∗ though. The main advantage of the minmax formulation is that it allows
to bound the risk of the estimator using the complexity of F around f ∗ . This “localization”
idea is central to derive optimal (fast) rates for the ERM [15, 31, 44] and cannot be used
directly when empirical means are simply replaced by nonlinear estimators of the mean in a
minimization formulation.

2.2. MOM estimators. Let K denote an integer smaller than N/2 and let B1 , . . . , BK
denote a partition of [N] = {1, . . . , N} into blocks of equal size N/K (w.l.o.g. we assume
that K divides N ). For all functions L : X × R → R and k ∈ [K] = {1, . . . , K}, let PBk L =
|Bk |−1 i∈Bk L(Xi , Yi ).
For all α ∈ (0, 1) and real numbers x1 , . . . , xK , denote by Qα (x1 , . . . , xK ) the set of α-
quantiles of {x1 , . . . , xK }:
u ∈ R : k ∈ [K] : xk ≥ u ≥ (1 − α), k ∈ [K] : xk ≤ u ≥ α
and let Qα (x) denote any point in Qα (x1 , . . . , xK ). For x = (x1 , . . . , xK ) ∈ RK and t ∈ R,
we say that Qα (x) ≥ t when there exists J ⊂ [K] such that |J | ≥ (1 − α)K and for all
k ∈ J, xk ≥ t; we write Qα (x) ≤ t if there exists J ⊂ [K] such that |J | ≥ αK and for all
k ∈ J, xk ≤ t.
Let y = (y1 , . . . , yK ) ∈ RK . We write Q1/2 (x − y) ≤ Q3/4 (x) − Q1/4 (y) when there exist
u, l ∈ R such that Q1/2 (x − y) ≤ u − l, Q3/4 (x) ≤ u and Q1/4 (y) ≥ l.

D EFINITION 1. Let α ∈ (0, 1), K ∈ [N ]. For any L : X × R → R the α-quantile on K

blocks of L is Qα,K (L) = Qα ((PBk L)k∈[K] ). In particular, the Median-of-Means (MOM) of
L on K blocks is defined as MOMK (L) = Q1/2,K (L). For all f, g ∈ F , the MOM estimator
on K blocks of the loss increment from g to f is defined by
TK (g, f ) = MOMK (f − g )
and, for a given regularization parameter λ ≥ 0, its regularized version is

TK,λ (g, f ) = MOMK (f − g ) + λ f − g .

2.3. Minmax MOM estimators. Minmax MOM estimators are obtained by replacing the
unknown expectations P (f − g ) in (2) by their MOM estimators.

D EFINITION 2. For any K ∈ [N/2], let

(3) fˆK ∈ argmin max TK (g, f ) and fˆK,λ ∈ argmin max TK,λ (g, f ).
f ∈F g∈F f ∈F g∈F

We shall provide results for fˆK,λ only in the main text. The estimators fˆK are studied in
the Supplementary Material [36] in Section 7.

R EMARK 1 (K = 1 and ERM). If one chooses K = 1, then for all f, g ∈ F , TK (g, f ) =

PN (f −g ) and it is straightforward to check that fˆK and fˆK,λ are respectively the Empirical
risk Minimization (ERM) and its regularized version (RERM).

3. Assumptions and main results. Denote by {O, I } a partition of [N] and by |O| the
cardinality of O . On (Xi , Yi )i∈O , no assumptions is granted, these data are outliers. They may
not be independent, nor independent from the remaining data (not even random). (Xi , Yi ), i ∈
I are called inliers or informative data. They are hereafter assumed independent. The sets
O, I are unknown.
ROBUST MACHINE LEARNING BY MEDIAN-OF-MEANS 911

3.1. Assumptions. The main assumptions involve first and second moments of the func-
tions in F and Y under the distributions P , (Pi )i∈I .

A SSUMPTION 1. For all f ∈ F and all i ∈ I ,

∗ 2 2
Pi f − f =P f −f∗ and Pi ζ f − f ∗ = P ζ f − f ∗ ,
where ζ (x, y) = (y − f ∗ (x)) for all x ∈ X and y ∈ R.

Assumption 1 holds in the i.i.d. framework, with I = [N] but it covers also other cases
where inliers follow different distributions (see, for instance, multimodal datasets such as in
[45] or heteroscedastic noise [4]). It is also possible to weaken Assumption 1 such as in [35].
The second assumption bounds the correlation between ζi = Yi − f ∗ (Xi ) and the shifted
class F − f ∗ .

A SSUMPTION 2. There exists θm > 0 such that, for any i ∈ I and f ∈ F ,

var ζi f − f ∗ (Xi ) ≤ θm2 f − f ∗ 2
L2P .

Assumption 2 holds when data are i.i.d. and Y − f ∗ (X) has uniformly bounded L2 -
moments conditionally to X. This last assumption holds when Y − f ∗ (X) is independent
of X and has a L2 -moment bounded by θm . Assumption 2 also holds if, for all i ∈ I ,
ζ L4 ≤ θ2 < ∞—where ζ (x, y) = y − f ∗ (x) for all x ∈ X and y ∈ R—and, for every
Pi
f ∈ F, f − f∗ L4P ≤ θ1 f − f ∗ L2P . Actually, in this case,
i

var ζi f − f ∗ (Xi ) ≤ ζ L4P f −f∗ L4P ≤ θ 1 θ2 f − f ∗ L2P ,
i i

so Assumption 2 holds for θm = θ1 θ2 . The third assumption states that the norms L2P and L1P
are equivalent over F − f ∗ .

A SSUMPTION 3. There exists θ0 ≥ 1 such that for all f ∈ F and i ∈ I ,

f −f∗ L2P ≤ θ0 f − f ∗ L1P .
i

Under Assumption 1, f − f ∗ ≤ f −f ∗
∗
L2P = f − f L2P for all f ∈ F and i ∈ I ,
L1P
ii
hence, Assumptions 1 and 3 imply that the norms L1P , L2P , L2Pi , L1Pi , i ∈ I are equivalent over
F − f ∗ . Assumption 3 is equivalent to the small ball property (cf. [32, 48]); see Proposition 1
in [35].

3.2. Complexity measures. For all ρ, r ≥ 0, let

B f ∗, ρ = f ∈ F : f − f ∗ ≤ ρ ,

B2 f ∗ , r = f ∈ F : f − f ∗ L2P ≤r .

D EFINITION 3. Let ( i )i∈[N] be independent random variables uniformly distributed in

i=1 . For all f ∈ F , r > 0 and ρ ∈ (0, +∞], let
{−1, 1}, independent from (Xi , Yi )N
Breg (f, ρ, r) = g ∈ F : g − f L2P ≤ r, g − f ≤ ρ .
Let ζi = Yi − f ∗ (Xi ) for all i ∈ I and for γQ , γM > 0 define rQ (ρ, γQ ) as

1 ∗
inf r > 0 : sup E sup i f − f (Xi ) ≤ γQ r ,
J ⊂I ,|J |≥N/2 f ∈Breg (f ∗ ,ρ,r) |J | i∈J
912 G. LECUÉ AND M. LERASLE

and rM (ρ, γM ) as

1 ∗
inf r > 0 : sup E sup i ζi f − f (Xi ) ≤ γM r .
2
J ⊂I ,|J |≥N/2 f ∈Breg (f ∗ ,ρ,r) |J | i∈J

Let ρ → r(ρ, γQ , γM ) be a continuous and nondecreasing function such that for every ρ > 0,
r(ρ) = r(ρ, γQ , γM ) ≥ max{rQ (ρ, γQ ), rM (ρ, γM )}.

It follows from Lemma 2.3 in [37] that rM and rQ are continuous and nondecreasing
functions, that depend on f ∗ . According to [37], for appropriate choice of γQ , γM , r(ρ) =
max(rM (ρ, γM ), rQ (ρ, γQ )) is the minimax rate of convergence over B(f ∗ , ρ). Note also
that rQ and rM are well defined when |I | ≥ N/2, meaning that at least half data should be
informative.

3.3. The sparsity equation. Risk bounds follow from upper bounds on TK,λ (f, f ∗ ) for
functions f far from f ∗ either in L2P -norm or for the regularization norm · . Let f ∈ F
and let ρ = f − f ∗ . When f − f ∗ L2 is small, TK,λ has to be bounded from above by
P
λ( f ∗ − f ). To bound f ∗ − f from below, introduce the subdifferentials of · .
Let (E ∗ , · ∗ ) be the dual normed space of (E, · ) and for all f ∈ F , let

∂ · f = z∗ ∈ E ∗ : ∀h ∈ E, f + h ≥ f + z∗ (h) .

For any ρ > 0, let Hρ denote the set of functions “close” to f ∗ in L2P and at distance ρ
from f ∗ in regularization norm and let f ∗ (ρ) denote the set of subdifferentials of all vectors
close to f ∗ :

f ∗ (ρ) = ∂ · f
f ∈F : f −f ∗ ≤ρ/20

and Hρ = {f ∈ F : f − f ∗ = ρ and f − f ∗ L2 ≤ r(ρ)}. If there exists f ∗∗ such that

P
f ∗ − f ∗∗ ≤ ρ/20 and (∂ · )f ∗∗ is almost all the unit dual sphere, then f − f ∗∗ is
large for any f ∈ Hρ so f − f ∗ ≥ f − f ∗∗ − f ∗ − f ∗∗ is large as well. Formally,
for all ρ > 0, let

(ρ) = inf sup z∗ f − f ∗ .
f ∈Hρ z∗ ∈
f ∗ (ρ)

The sparsity equation, introduced in [39], quantifies these notions of “large.”

D EFINITION 4. A radius ρ > 0 is said to satisfy the sparsity equation when (ρ) ≥
4ρ/5.

If ρ ∗ satisfies the sparsity equation, so do all ρ ≥ ρ ∗ . Let

∗ 4ρ
ρ = inf ρ > 0 : (ρ) ≥ .
5
If ρ ≥ 20 f ∗ , then 0 ∈ f ∗ (ρ). Moreover, (∂ · )0 is the unit ball of (E ∗ , · ∗ ), so (ρ) =
ρ. This implies that any ρ ≥ 20 f ∗ satisfies the sparsity equation. This simple observation
can be used to get “complexity-dependent rates of convergence” [38].
ROBUST MACHINE LEARNING BY MEDIAN-OF-MEANS 913

3.4. Main results. The first results give risk bounds for fK,λ . Similar bounds have been
obtained for other MOM estimators [35, 42].

T HEOREM 1. Grant Assumptions 1, 2 and 3 and let rQ , rM denote the complexity func-
tions introduced in Definition 3. Assume that N ≥ 384θ02 and |O| ≤ N/(768θ02 ). Let ρ ∗ be
solution to the sparsity equation from Definition 4. Let = 1/(833θ02 ) and r 2 (·) is defined in
Definition 3 for γQ = (384θ0 )−1 and γM = /192. Let K ∗ denote the smallest integer such
that
N 2 2 ∗
K∗ ≥ r ρ .
384θm2
For any K ≥ K ∗ , define the radius ρK and the regularization parameter as
384θm2 K 16 r 2 (ρK )
r 2 (ρK ) = 2 N
and λ= .
ρK
Then, for all K ∈ [max(K ∗ , 8|O|), N/(96θ02 )], with probability larger than 1 −
4 exp(−7K/9216), the estimator fˆK,λ defined in Section 2.3 satisfies
fK,λ − f ∗ ≤ 2ρK , fK,λ − f ∗ L2P ≤ r(2ρK ),

R(fK,λ ) ≤ R f ∗ + (1 + 52 )r 2 (2ρK ).

The function r is used to define the regularization parameter in Theorem 1, so it cannot

depend on f ∗ . When rM , rQ depend on f ∗ , r should be a computable upper bound indepen-
dent from f ∗ . The best rates of estimation and prediction that follow from Theorem 1 are
obtained for K = K ∗ when |O| ≤ K ∗ /8 ∼ Nr 2 (ρ ∗ ). In that case, it is proved in Section 4
on two examples that the rate ρK ∗ and the residue r(2ρK ∗ ) are minimax optimal. In a setup
where data only induce the same L2 metric as P and may have been corrupted by up to
K ∗ /8 ∼ Nr 2 (ρ ∗ ) outliers, Theorem 1 shows that our estimators achieve the sub-Gaussian
deviations bounds of the ERM when data are i.i.d. with a noise ζ independent of X and both
X and ζ have Gaussian distributions (see Section 8 in the Supplementary Material [36]).

3.4.1. Adaptive choice of K. In Theorem 1, all rates depend on K, which has to be

larger than the number of outliers and Nr 2 (ρ ∗ ). These quantities are unknown in general,
for instance, Nr 2 (ρ ∗ ) ∼ s log(ed/s) in high-dimensional statistics where s is the unknown
sparsity parameter. This section presents an adaptive choice of K inspired from Lepski’s
method that allows to bypass this issue. However, this construction requires the knowledge
of constants θ0 and θm (see Section 6.1 for a fully data driven choice of K in practice).
For all J ∈ [K], λ > 0, f ∈ F and cad > 0, let

cad
CJ,λ (f ) = sup TJ,λ (g, f ) and R̂J,cad = f ∈ F : CJ,λ (f ) ≤ 2 r 2 (ρJ ) .
g∈F θ0
N/(96θ02 )
Let K̂cad = inf{K ∈ [1, N/(96θ02 )] : J =K R̂J,cad = ∅} and
N/(96θ02 )

(4) fcad ∈ R̂J,cad .
J =K̂cad

The following theorem gives risk bounds for these estimators. Bounds in regularization and
L2P norms have been proved for Le Cam test estimators in [35]. To the best of our knowledge,
adaptive bounds in excess risk have never been proved before.
914 G. LECUÉ AND M. LERASLE

T HEOREM 2. Grant the assumptions of Theorem 1. Choose cad = 18/833 in (4) and let
= (833θ02 )−1 . For any K ∈ [max(K ∗ , 8|O|), N/(96θ02 )], with probability larger than

1 − 4 exp(−K/2304) = 1 − 4 exp − 2 Nr 2 (ρK )/884736
one has fcad − f ∗ ≤ 2ρK , fcad − f ∗ L2P ≤ r(2ρK ) and

R(fcad ) ≤ R f ∗ + (1 + 52 )r 2 (2ρK ).
√
In particular, for K = K ∗ , we have r(2ρK ∗ ) = max(r(2ρ ∗ ), |O|/N ).

Theorem 2 shows that fcad achieves similar performance as fK,λ simultaneously for all K
from K ∗ to O(N). For K = K ∗ , these rates match the optimal minimax rates of convergence;
see Section 4. The main difference with Theorem 1 is that the knowledge of K ∗ and |O| is
not necessary to design fcad . This is very useful in applications where these quantities are
typically unknown. Moreover, both the construction and the analysis are much simpler for
fcad than the adaptive estimator in [35] since they are based on the analysis of confidence
regions for CJ,λ only, instead of multiple criteria in [35].

R EMARK 2 (Deviation parameter). Note that r(·) can be any continuous, nonde-
creasing function such that r(ρ) ≥ max(rQ (ρ, γQ ), rM (ρ, γM )). In particular, if r∗ : ρ →
max(rQ (ρ, γQ ), rM (ρ, γM )) is continuous, as it is clearly nondecreasing, then for every
x > 0, r(ρ) = max(rQ (ρ, γQ ), rM (ρ, γM )) + x/N is another nondecreasing upper bound.
Therefore, one can derive results similar to Theorem 2 but with an extra confidence parame-
ter: for all x > 0, with probability at least 1 − 4 exp(−c0 Nr∗2 (ρK ∗ ) + c0 x),
x
fcad − f ∗ ≤ 2ρK , fcad − f ∗ L2 ≤ r∗ (2ρK ) +
P N

x
R(fcad ) ≤ R f ∗ + (1 + 52 ) r 2 (2ρK ) + .
N
In that case, fcad depends on x since λ = 16 (r∗ (ρK ) + x/N)/ρK .

4. Examples of applications. This section presents two examples of regularization in

high-dimensional statistics: the 1 and the SLOPE norms.

4.1. The LASSO. The LASSO is obtained when F = {t, · : t ∈ Rd } and the regulariza-
tion function is the 1 -norm:

1 N

d
ˆt ∈ argmin t, Xi − Yi 2 + λ t 1 where t 1 = |ti |.
t∈Rd N i=1 i=1
Even if recent advances show some limitations of LASSO [54, 62, 67], it remains the
benchmark estimator in high-dimensional statistics because a high-dimensional parameter
space does not significantly affect its performance as long as t ∗ is sparse. One can refer to
[12, 41, 47, 52, 61, 63, 64] for estimation and sparse oracle inequalities, [7, 46, 68] for support
recovery results; more results and references on LASSO can be found in [17, 31].

4.2. SLOPE. SLOPE is an estimator introduced in [16, 57]. The class F is still F =
{t, · : t ∈ Rd } and the regularization function is defined for parameters β1 ≥ β2 ≥ · · · ≥

βd > 0 by t SLOPE = di=1 βi ti , where (ti )di=1 denotes the nonincreasing rearrange-
ment of (|ti |)i=1 . SLOPE norm is a weighted 1 -norm that coincide with 1 -norm when
d

(β1 , . . . , βd ) = (1, . . . , 1).

ROBUST MACHINE LEARNING BY MEDIAN-OF-MEANS 915

4.3. Classical results for LASSO and SLOPE. Typical results for LASSO and SLOPE
have been obtained when data are i.i.d. with sub-Gaussian design X and, most of the time,
sub-Gaussian noise ζ as well.

D EFINITION 5. Let d2 be a d-dimensional inner product space and let X be a ran-
dom variable with values in d2 . We say that X is isotropic when for every t ∈ d2 ,
X, t L2 = t 2d and it is L-sub-Gaussian if for every t ∈ d2 and every p ≥ 2,
P
√ 2
X, t Lp ≤ L p X, t L2 .
P P

The covariance structure of an isotropic random variable coincides with the inner product
p
in d2 . If X is a L-sub-Gaussian random vector, the LP norms of all linear forms do not grow
p
faster than the LP norm of a Gaussian variable. When dealing with the LASSO and SLOPE,
the natural Euclidean structure is used in Rd .

A SSUMPTION 4.
1. Data are i.i.d. (in particular, |I | = N and |O| = 0, that is, there is no outlier),
2. X is isotropic and L-sub-Gaussian,
3. for f ∗ = t ∗ , ·, ξ = Y − f ∗ (X) ∈ LP0 for some q0 > 2.
q

Assumption 4 requires a Lq0 for q0 > 2 moment on the noise. LASSO and SLOPE still
achieve optimal rates of convergence under this assumption but with a severely deteriorated
probability estimate.

T HEOREM 3 (Theorem 1.4 in [39]). Grant Assumption 4. Let s ∈ [d]. Assume that N ≥
there is some v ∈ Rd supported on at most s coordinates for which
c1 s log(ed/s) and that √
t ∗ − v 1 ≤ c2 ξ Lq0 s log(ed)/N . The Lasso estimator tˆ with regularization parameter
√ P
λ = c3 ξ Lq0 log(ed)/N is such that with probability at least
P
c4 logq0 N
(5) 1− q /2−1
− 2 exp −c5 s log(ed/s)
N 0

for every 1 ≤ p ≤ 2

∗ log(ed)
tˆ − t p ≤ c6 ξ LP0 s
q
1/p
.
N
The constants (cj )6j =1 depend only on L and q0 .

Theorem 3 shows that LASSO achieves its optimal rate (cf. [12]) if t ∗ is close to a sparse
vector and the noise ζ may be heavy tailed and may not be independent from X. On the
other hand, the dataset cannot contain outliers and the data should be i.i.d. with sub-Gaussian
design matrix X.
Turning to SLOPE, recall the following result for the regularization norm (t) =
d √
j =1 βj tj when βj = C log(ed/j ).

T HEOREM 4 (Theorem 1.6 in [39]). Consider the SLOPE under Assumption 4. Assume
that N ≥ c1 s log(ed/s)√ and that there is v ∈ R such that | supp(v)| ≤ √
d s and (t ∗ − v) ≤
c2 ξ L 0 s log(ed/s)/ N . The SLOPE estimator with λ = c3 ξ L 0 / N satisfies, with
q q
P P
probability at least (5),

s ed s ed
tˆ − t ∗ ≤ c4 ξ Lq0 √ log , tˆ − t ∗ 22 ≤ c5 ξ 2Lq0 log .
P N s P N s
The constants (cj )5j =1 depend only on L and q0 .
916 G. LECUÉ AND M. LERASLE

4.4. Minmax MOM LASSO and SLOPE. In this section, Theorem 2 is applied to the set
F of linear functionals indexed by Rd with regularization functions being either the 1 or the
SLOPE norm. The aim is to show that the results from Section 4.3 hold and are sometimes
even improved by MOM versions of LASSO and SLOPE under weaker assumptions and with
a better probability deviation. Start with the new set of assumptions.

A SSUMPTION 5. Denote by (ej )dj =1 the canonical basis of Rd and assume

1. (X, Y ), (Xi , Yi )i∈I are i.i.d.
2. X is isotropic and for every t ∈ Rd , p ∈ [C0 log(ed)] and j ∈ [d], X, ej p
LP ≤
√
L p X, ej L2 ,
P
3. ξ = Y − t ∗ , X ∈ LP0 for some q0 > 2.
q

4. there exists θ0 such that for all t ∈ Rd , X, t L2 ≤ θ0 X, t L1P ,

P
5. there exists θm such that var(ξ X, t) ≤ θm X, t L2 .
P

In order to apply Theorem 2, we have to compute the fixed-point functions rQ (·), rM (·)
and solve the sparsity equation in both cases. To compute the fixed point functions, recall
the definition of Gaussian mean widths: for a set V ⊂ Rd , the Gaussian mean width of V is
defined as

(6) ∗ (V ) = E sup G, v where G ∼ Nd (0, Id ).

v∈V

The dual norm of the d1 -norm is the d∞ -norm which is 1-unconditional with respect to the
canonical basis of Rd [49], Definition 1.4. Therefore, [49], Theorem 1.6, applies under the
following assumption.

q
A SSUMPTION 6. There exist constants q0 > 2, C0 and L such that ξ ∈ LP0 , X is isotropic
√
and for every j ∈ [d] and 1 ≤ p ≤ C0 log d, X, ej Lp ≤ L p X, ej L2 .
P P

Under Assumption 6, if σ = ξ LP0 ,

q [49], Theorem 1.6, shows that for ζi = Yi − Xi , t ∗
and for every ρ > 0,
√
E sup i v, Xi ≤ c2 N∗ ρB1d ∩ rB2d ,
v∈ρB1d ∩rB2d i∈[N]
√
E sup i ζi v, Xi ≤ c2 σ N∗ ρB1d ∩ rB2d .
v∈ρB1d ∩rB2d i∈[N]

Local Gaussian mean widths ∗ (ρB1d ∩ rB2d ) are bounded from above in [39], Lemma 5.3,
and computations of rM (·) and rQ (·) follow
⎧
⎪
⎪ 2d
⎪
⎨σ if ρ 2 N ≥ σ 2 d 2 ,

N
2
rM (ρ) L,q0 ,γM 1
⎪
⎪ eσ d
⎪ρσ
⎩ log √ otherwise,
N ρ N
⎧
⎪
⎨= 0 if N L,γQ d,
2
rQ (ρ)
⎪ ρ2 c(L, γQ )d
⎩L,γQ log otherwise.
N N
ROBUST MACHINE LEARNING BY MEDIAN-OF-MEANS 917

Therefore, one can take

⎧ 2
⎪
⎪ 1 eσ d σ d
⎪
⎪ max ρσ log √ , if N L d,
⎨
N ρ N N
(7) r 2 (ρ) ∼L,q0 ,γQ ,γM
⎪ 2
⎪
⎪ 1 eσ d ρ d
⎪
⎩max ρσ log √ , log otherwise.
N ρ N N N
Now we turn to a solution of the sparsity equation for the d1 -norm. This equation has been
solved in [39], Lemma 4.2, we recall this result.

L EMMA 1. If there exists v ∈ Rd such that v ∈ t ∗ + (ρ/20)B1d and | supp(v)| ≤

cρ 2 /r 2 (ρ), then
4ρ
(ρ) = inf sup h, g − t ∗ ≥ ,
d−1
h∈ρS1 ∩r(ρ)B2 g∈ t ∗ (ρ)
d 5
where S1d−1 is the unit sphere of the d1 -norm and B2d is the unit Euclidean ball in Rd .

As a consequence, if N s log(ed/s) and if there exists a s-sparse vector in t ∗ +

(ρ/20)B1d , Lemma 1 and the choice of r(·) in (7) imply that for σ = ξ Lq0 ,

∗ 1 ed 2 ∗ σ 2s ed
ρ ∼L,q0 σ s log and r ρ ∼ log
N s N s
then ρ ∗ satisfies the sparsity√
equation and r 2 (ρ ∗ ) is the rate of convergence of the LASSO
∗ ∗
for λ ∼ r (ρ )/ρ ∼ ξ L 0 log(ed/s)/N . This choice of λ requires to know the sparsity
2 q
P
parameter s. That is the reason why we either need √ to choose a larger value for the r(·)
function as in [39]—this results in the suboptimal log(ed)/N rates of convergence from
Theorem
√ 3—or to use an adaptation step as in Section 3.4.1. This results in the better minimax
rate log(ed/s)/N . Finally, one needs to compute the radii ρK and λ ∼ r 2 (ρK )/ρK . Let
K ∈ [N] and σ = ξ Lq0 . The equation K = cr(ρK )2 N is solved by

K 1 σ 2d
(8) ρK ∼L,q0 log−1
σ N K
for the r(·) function defined in (7). Therefore,

r 2 (ρK ) 1 eσ d 1 eσ 2 d
(9) λ∼ ∼L,q0 σ log √ ∼L,q0 σ log .
ρK N ρK N N K
q
The regularization parameter depends on the LP0 -norm of ξ . This parameter is unknown in
practice. Nevertheless, it can be replaced by an estimator in the regularization parameter as
in [23], Sections 5.4 and 5.6.2.
The following result follows from Theorem 2 together with the previous computation of
ρ ∗ , rQ (·), rM (·), r(·) and λ.

T HEOREM 5. Grant Assumption 5. Let s ∈ [d]. Assume that N ≥ c1 s log(ed/s) and

that there √is some v ∈ Rd supported on at most s coordinates for which t ∗ − v 1 ≤
c2 ξ Lq0 s log(ed)/N . Assume that |I | ≥ N/2 and |O| ≤ c3 s log(ed/s). The MOM-LASSO
estimator tˆ with the adaptively chosen number of blocks K (and λ) from Section 3.4.1 satis-
fies, with probability at least 1 − c4 exp(−c5 s log(ed/s)), for every 1 ≤ p ≤ 2,

∗ 1 ed
tˆ − t p ≤ c6 ξ Lq0 s
1/p
log ,
N s
where (cj )6j =1 depends only on θ0 , θm and q0 .
918 G. LECUÉ AND M. LERASLE

P ROOF. It follows from Theorem 2, the computation of r(ρK ) from (7) and ρK in (8) that
2 N/C), tˆ − t ∗ ∗
with probability at least 1 − c0 exp(−cr(ρK ) 1 ≤ ρK ∗ and tˆ − t 2 r(ρK ).
−1+2/p
The result follows since ρK ∗ ∼ ρ ∗ ∼L,q0 σ s N1 log( ed
2−2/p
s ) and v p ≤ v 1 v 2 for
all v ∈ R and 1 ≤ p ≤ 2.
d

Theoretical properties of MOM LASSO (cf. Theorem 5) outperform those of LASSO (cf.
Theorem 3) in several ways:
• Estimation rates achieved by MOM-LASSO are the actual minimax rates s log(ed/s)/N
(see [11]), while classical LASSO estimators achieve the rate s log(ed)/N . This improve-
ment is possible thanks to the adaptation step in MOM-LASSO.
• the probability deviation in (5) is polynomial – 1/N (q0 /2−1) , whereas it is exponentially
small for MOM LASSO. Exponential rates for LASSO hold only if ξ is sub-Gaussian
√
( ξ Lp ≤ C p ξ L2 for all p ≥ 2).
• MOM LASSO is insensitive to data corruption by up to s log(ed/s) outliers while only
one outlier can be responsible of a dramatic breakdown of the performance of LASSO (cf.
Figure 1).
• Assumptions on X are weaker for MOM LASSO than for LASSO. In the LASSO case,
we assume that X is sub-Gaussian whereas for the MOM LASSO we assume that the
coordinates of X have C0 log(ed) sub-Gaussian moments and that X satisfies a L2 /L1
equivalence assumption.
Let us now turn to the study of a “minmax MOM version” of the SLOPE estimator. The
computation of the fixed point functions rQ (·) and rM (·) rely on [49], Theorem 1.6, and the
computation from [39]. Again, the SLOPE norm has a dual norm which is 1-unconditional
with respect to the canonical basis of Rd , [49], Definition 1.4. Therefore, it follows from [49],
Theorem 1.6, that under Assumption 6, one has
√
E sup i v, Xi ≤ c2 N∗ ρ B ∩ rB2d ,
v∈ρ B∩rB2d i∈[N]
√
E sup i ζi v, Xi ≤ c2 σ N∗ ρ B ∩ rB2d ,
v∈ρ B∩rB2d i∈[N]

where B is the unit ball of the SLOPE norm and ζi = Yi − Xi , t ∗ . Local Gaussian
∗ bounded from above in [39], Lemma 5.3: ∗ (ρ B ∩ rB2d )
√ (ρ B ∩ rB2 ) are
mean widths d
√
min{Cρ, dr} when βj = C log(ed)/j for all j ∈ [d] and computations of rM (·) and rQ (·)
follow
⎧
⎪
⎨0 if N L d,
2
rQ (ρ) L ρ2 and
⎪
⎩ otherwise,
N
⎧
⎪ d
⎨ ξ 2L
q
if ρ 2 N L,q,δ ξ 2 2
Lq d ,
2
rM (ρ) L,q,δ Nρ
⎪
⎩ ξ Lq √ otherwise.
N
The sparsity equation has been solved in [39], Lemma 4.3.
√
L EMMA 2. Let 1 ≤ s ≤ d and set Bs = j ≤s βj / j . If t ∗ is ρ/20 approximated (rela-
tive to the SLOPE norm) by an s-sparse vector and if 40Bs ≤ ρ/r(ρ) then (ρ) ≥ 4ρ/5.
ROBUST MACHINE LEARNING BY MEDIAN-OF-MEANS 919
√ √ √
For βj ≤ C log(ed/j ), Bs = j ≤s βj / j C s log(ed/s). The condition Bs
ρ/r(ρ) holds when N L,q0 s log(ed/s), ρ L,q0 ξ Lq √s log( ed s ). Lemma 2 implies that
N
(ρ) ≥ 4ρ/5 when there is an s-sparse vector in t ∗ + (ρ/20)B . Therefore, Theorem 1
√
applies for λ ∼ r 2 (ρ)/ρ ∼L,q,δ ξ Lq / N .
The final√ingredient is to compute the ρK solution to K =√ cr(ρK )2 N . It is solved for
ρK ∼ K/(σ N) and, therefore, λ ∼ r 2 (ρK )/ρK ∼L,q,δ ξ Lq / N .
The following result follows from Theorem 2 together with the previous computations of
ρ ∗ , ρK , rQ (·), rM (·) and r(·). The proof, similar to Theorem 5, is omitted.

T HEOREM 6. Grant Assumption 5. Let s ∈ [d]. Assume that N ≥ c1 s log(ed/s)√and

that there is v ∈ Rd such that | supp(v)| ≤ s and (t ∗ − v) ≤ c2 ξ Lq s log(ed/s)/ N .
Assume that |I | ≥ N/2 and |O| ≤ c3 s log(ed/s). The MOM-SLOPE estimator tˆ with
the adaptive number of blocks K from Section 3.4.1 satisfies, with probability at least
1 − c4 exp(−c5 s log(ed/s)),

s ed
tˆ − t ∗ 2
2 ≤ c6 ξ 2
Lq0 log ,
N s
where (cj )6j =1 depends only on θ0 , θm and q0 .

MOM-SLOPE has the same advantages upon SLOPE as MOM-LASSO upon LASSO.
These improvements, listed below Theorem 5 are not repeated. The only difference is that
SLOPE, unlike LASSO, already achieves the minimax rate s log(ed/s)/N whereas, without
an extra adaptation step as in [11], the LASSO is not known to achieve a rate better than
s log(ed)/N .

5. Algorithms for minmax MOM LASSO. The aim of this section is to show that
there is a systematic way to transform classical descent based algorithms (such as Newton
or gradient descent algorithm, or proximal gradient descent algorithms, etc.) into robust ones
using MOM approach. This section provides several examples of such modifications.
These algorithms are tested in high-dimensional frameworks. In this setup, there exists
an important number of algorithms approximating LASSO. The aim of this section is to
show that there is a natural modification of these algorithms that makes them more robust
to outliers. The choice of hyper-parameters like the number of blocks or the regularization
parameter cannot be done via classical Cross-Validation (CV) because of possible outliers in
the test sets. CV procedures are also adapted using MOM’s principle in Section 6. We also
advocate for using random blocks at every iterations of the algorithms, to bypasses a problem
of “local saddle points” we have identified. A byproduct of the latter approach is a definition
of depth adapted to the learning task and, therefore, of an outliers detection algorithm. This
material and a simulation study are given in Section 6 of the Supplementary Material [36].

5.1. From algorithms for LASSO to MOM LASSO. Each algorithm designed for the
LASSO can be transformed into a robust algorithm for the minmax MOM estimator. Recall
that minmax MOM LASSO estimator is

(10) tˆK,λ ∈ argmin sup TK,λ t , t ,
t∈Rd t ∈Rd

where TK,λ (t , t) = MOMK (t − t ) + λ( t 1 − t 1 ), MOMK (t − t ) is a median of the

set of real numbers {PB1 (t − t ), . . . , PBK (t − t )} and for all k ∈ [K],
1
PBk (t − t ) = Yi − Xi , t 2 − Yi − Xi , t 2 .
|Bk | i∈B
k
920 G. LECUÉ AND M. LERASLE

A natural idea to implement (10) is to consider algorithms based on a sequence of alter-

nating descents (in t) and ascents (in t ) steps with possible proximal/projection steps and for
various choices of step sizes. A key issue here is that t → TK,λ (t0 , t) (resp., t → TK,λ (t , t0 )),
for some given (t0 , t0 ) ∈ Rd × Rd , may not be convex (resp., concave). Nevertheless, one can
still compute the steepest descent by assuming that the index in [K] of the block achieving
the median in MOMK (t0 − t0 ) remains constant on a convex open set containing (t0 , t0 ),
for almost all (t0 , t0 ). The median is set as the minimal value of the median interval.

A SSUMPTION 7. Almost surely (with respect to (Xi , Yi )N i=1 ) for almost all (t0 , t0 ) ∈
Rd × Rd (with respect to the Lebesgue measure on Rd × Rd ), there exists a convex open set
B containing (t0 , t0 ) and k ∈ [K] such that for all (t, t ) ∈ B, PBk (t −t ) ∈ MOMK (t −t ).

Under Assumption 7, for almost all couples (t0 , t0 ) ∈ Rd × Rd , t → TK,λ (t0 , t) is “locally
convex” and t → TK,λ (t , t0 ) is “locally concave.” Therefore, for k such that PBk (t0 − t0 ) ∈
MOMK (t0 − t0 ),
(k)
(11) ∇t MOMK (t − t0 )|t=t0 = −2 X (k) Y − X(k) t0 ,

where Y (k) = (Yi )i∈Bk and X (k) is the |Bk | × d matrix with rows given by Xi for i ∈ Bk . The
integer k ∈ [K] is the index of the median of K real numbers PB1 (t − t ), . . . , PBK (t − t ),
which is straightforward to compute. The gradient −2(X (k) ) (Y (k) − X (k) t0 ) in (11) depends
on t0 only through the index k.

R EMARK 3 (Block gradient descent). Algorithms developed for the minmax estimator
using steepest descent steps such as (11) are special instances of Block Gradient Descent
(BGD). The major difference with standard BGD (which takes sequentially all blocks), is
that the index of the block is chosen here as PBk (t0 − t0 ) ∈ MOMK (t0 − t0 ). In particular,
we expect blocks corrupted by outliers to be avoided which is not the case in the classical
BGD. Moreover, choosing the “descent/ascent” block k using its centrality, we also expect
PBk (t0 − t0 ) to be close to the objective function P (t0 − t0 ). This should make every
descent (resp., ascent) steps particularly efficient.

R EMARK 4 (Map-reduce). The algorithms presented in this section particularly fits the
map-reduce paradigm [19], where data are spread out in a cluster of servers and are there-
fore naturally split into blocks. Our procedures use for mapper a mean and for reducer a
median. This makes our algorithms easily scalable into the big data framework even when
some servers have crashed down (making blocks of outliers data). The median identifies the
correct block of data onto which one should make a descent or an ascent and leaves aside
servers which have crashed down.

R EMARK 5 (Normalization). In the i.i.d. setup, the design matrix X (i.e., the N × d
matrix with row vectors X1 , . . . , XN ) is normalized to make N
2 -norms of the columns equal
to one. In a corrupted setup, one row of X may be corrupted and normalizing each column of
X would corrupt the entire matrix X. We therefore do not normalize the design matrix in the
following.

5.2. Subgradient descent algorithm. LASSO is solution of the minimization problem

mint∈Rd ψ(t) where ψ is defined for all t ∈ Rd by ψ(t) = Y − Xt 22 + λ t 1 with Y =
i=1 and X is the N ×d matrix with row vectors X1 , . . . , XN . LASSO can be approximated
(Yi )N
ROBUST MACHINE LEARNING BY MEDIAN-OF-MEANS 921

input : (t0 , t0 ) ∈ Rd × Rd : initial point, > 0: a stopping parameter, (ηp )p , (βp )p : two
step size sequences
output: approximated solution to the min–max problem (10)
1 while tp+1 − tp 2 ≥ or tp+1 − tp 2 ≥ do
2 find k ∈ [K] such that PBk (tp − tp ) = MOMK (tp − tp )
3

tp+1 = tp + 2ηp X
k (Yk − Xk tp ) − ληp sign(tp )

4 find k ∈ [K] such that PBk (tp+1 − tp ) = MOMK (tp+1 − tp )
5

tp+1 = tp + 2βp X
k Yk − Xk tp − λβp sign tp

6 end
7 Return (tp , tp )
Algorithm 1: A “minmax MOM version” of the subgradient descent

by a subgradient descent procedure: given t0 ∈ Rd and step sizes (γp )p (i.e., γp > 0 and (γp )p
decreases), at step p we update
(12) tp+1 = tp − γp ∂ψ(tp ),
where ∂ψ(tp ) is a subgradient of ψ at tp like ∂ψ(tp ) = −2X (Y − Xtp ) + λ sign(tp ) where
sign(tp ) is the vector of signs of the coordinates of tp with the convention sign(0) = 0.
The subgradient descent algorithm (12) can be turned into an alternating subgradient as-
cent/descent algorithm for the min–max estimator (10): let

(13) Yk = (Yi )i∈Bk and Xk = Xi i∈Bk ∈ R|Bk |×d .
The key insight in Algorithm 1 are steps 2 and 4 where the blocks number have been
chosen by the median operator. Those steps are expected (1) to remove outliers from the
descent/ascent directions, (2) to improve the accuracy of the latter directions.
A classical choice of step size γp in (12) is γp = 1/L where L = X 2S∞ ( X S∞ is the
operator norm of X). Another possible choice follows from the Armijo–Goldstein condition
with the following backtracking line search: γ is decreased geometrically while the Armijo–
Goldstein condition is not satisfied
2
while ψ tp + γ ∂ψ(tp ) > ψ(tp ) + δγ ∂ψ(tp ) 2 do γ+1 = ργ
for some given ρ ∈ (0, 1), δ = 10−4 and initial point γ0 = 1.
Of course, the same choices of step size cannot be made for (ηp )p and (βp )p in Al-
gorithm 1 because X may be corrupted but it can be adapted. In the first case, one can
take ηp = 1/ Xk 2S∞ where k ∈ [K] is the index defined in line 2 of Algorithm 1 and
βp = 1/ Xk 2S∞ where k ∈ [K] is the index defined in line 4 of Algorithm 1. In the other
backtracking line search case, the Armijo–Goldstein condition adapted for Algorithm 1 reads
like
2
while ψk tp + γ ∂ψk (tp ) > ψk (tp ) + δγ ∂ψk (tp ) 2 do η+1 = ρη ,
where ψk (t) = Yk − Xk t 22 + λ t 1 where k ∈ [K] is defined in line 2 of Algorithm 1 and,
for βp , with k ∈ [K] defined in line 4 of Algorithm 1.
922 G. LECUÉ AND M. LERASLE

input : (t0 , t0 ) ∈ Rd × Rd : initial point, > 0 : a stopping parameter, (ηk )k , (βk )k : two
step size sequences
output: approximated solution to the min–max problem (10)
1 while tp+1 − tp 2 ≥ or tp+1 − tp 2 ≥ do
2 find k ∈ [K] such that PBk (tp − tp ) = MOMK (tp − tp )
3 tp+1 = proxλ · 1 (tp + 2ηk Xk (Yk − Xk tp ))
4 find k ∈ [K] such that PBk (tp+1 − tp ) = MOMK (tp+1 − tp )
= proxλ · 1 (tp + 2βk X
5 tp+1 k (Yk − Xk tp ))
6 end
7 Return (tp , tp )
Algorithm 2: A “minmax MOM version” of ISTA

5.3. Proximal gradient descent algorithms. This section provides MOM versions of
ISTA (Iterative Shrinkage-Thresholding Algorithm) and its accelerated version FISTA. ISTA
and FISTA are proximal gradient descent where the objective function ψ(t) = f (t) + g(t)
with f (t) = Y − Xt 22 (convex and differentiable) and g(t) = λ t 1 (convex). ISTA alter-
nates between a descent in the direction of the gradient of f and a projection through the
proximal operator of g, which, for the 1 -norm, is the soft-thresholding:

(14) tp+1 = prox tp + 2γp X (Y − Xtp ) ,
λ · 1

where proxλ · 1 (t) = (sign(tj ) max(|tj | − λ, 0))dj =1 for all t = (tj )dj =1 ∈ Rd .
A natural “MOM version” for ISTA is in Algorithm 2 given by the following alternating
method where the step sizes sequences (ηp )p and (βp )p may be chosen according to the
remarks below Algorithm 1 or chosen a posteriori.

5.4. Douglas–Racheford/ADMM. This section presents the Alternating Direction

Method of Multipliers (ADMM) algorithm. It is also a splitting algorithm which reads as
follows in the LASSO case: at step p,
−1
tp+1 = X X + ρId×d X Y + ρzp − up ,
zp+1 = prox(tp+1 + up /ρ),
λ · 1

up+1 = up + ρ(tp+1 − zp+1 ),

where ρ is a tuning parameter. ADMM algorithm returns tp after a stopping criteria is met.
In Algorithm 3, we provide a MOM version of this algorithm.

6. Simulations study. This section provides an extensive simulation study based on al-
gorithms of Section 5. In particular, their robustness and their convergence properties are
illustrated on simulated data. The algorithms depend on hyperparameters that need to be
tuned. Due to possible corruption, classical approaches relying on test samples cannot be
trusted. The section starts therefore by introducing a robust CV procedure based on MOM
principle.

6.1. Adaptive choice of hyperparameters via MOM V-fold CV. MOM’s principles can be
combined with the idea of multiple splitting into training/test datasets in cross-validation.
ROBUST MACHINE LEARNING BY MEDIAN-OF-MEANS 923

input : (t0 , t0 ) ∈ Rd × Rd : initial point, > 0 : a stopping parameter, ρ: a parameter

output: approximated solution to the min–max problem (10)
1 while tp+1 − tp 2 ≥ or tp+1 − tp 2 ≥ do
2 find k ∈ [K] such that PBk (tp − tp ) = MOMK (tp − tp )
−1
tp+1 = X
k Xk + ρId×d Xk Yk + ρzp − up ,
zp+1 = prox(tp+1 + up /ρ),
λ · 1

up+1 = up + ρ(tp+1 − zp+1 ),

3 find k ∈ [K] such that PBk (tp+1 − tp ) = MOMK (tp+1 − tp )
−1
tp+1 = X
k Xk + ρId×d Xk Yk + ρzp − up ,

zp+1 = prox tp+1 + up /ρ ,
λ · 1

up+1 = up + ρ tp+1

− zp+1 ,
4 end
5 Return (tp , tp )
Algorithm 3: A “minmax MOM version” of ADMM

Let V ∈ [N] be such that N can be divided by V . Let also GK ⊂ [N] and Gλ ⊂ (0, 1]. The
aim is to select an optimal number of blocks and regularization parameter within
both grids.
The dataset is split into V disjoints blocks D1 , . . . , DV . For each v ∈ [V ], u=v Du is used
to train a family of estimators
(v)
(15) ˆ
fK,λ : K ∈ GK , λ ∈ Gλ .
The remaining Dv of the dataset is used to test the performance of each estimator in the fam-
ily (15). Using these notation, we define a MOM version of the cross-validation procedure.

D EFINITION 6. The Median of Means V -fold CV associated to the estimators (15) is

fˆK̂,λ̂ where (K̂, λ̂) is a minimizer of

(K, λ) ∈ GK × Gλ → MomCvV (K, λ) = Q1/2 MOM(v)
K (fˆ(v) )v∈[V ] , K,λ

where, for all v ∈ [V ] and f ∈ F ,

(v)
(16) MOMK (f ) = MOMK (PB (v) f , . . . , PB (v) f )
1 K

and B1(v) ∪ · · · ∪ BK
(v)
is a partition of the test set Dv into K blocks where K ∈ [N/V ] such

that K divides N/V .

The difference with standard V-fold CV is that empirical means in classical V-fold CV are
replaced by MOM estimators in (16). Moreover, the mean over all V splits in the classical
V -fold CV is replaced by a median.
The choice of V raises the same issues for MOM CV as for classical V -fold CV [3, 5].
In the simulations, we use V = 5. The construction of MOM-CV requires to choose another
parameter: K , the number of blocks used to build MOM criteria (16) over the test set. One
924 G. LECUÉ AND M. LERASLE

F IG . 2. Adaptively chosen number of blocks K for the minmax MOM LASSO.

can choose K = K/V to make only one split of D into K blocks and use, for each round,
(V − 1)K/V blocks to build estimators (15) and K/V blocks to test them.
In Figure 2, hyperparameters K (i.e., the number of blocks) and λ (i.e., the regularization
parameter) have been chosen for MOM LASSO estimators via MOM V-fold CV. Only the
evolution of K̂ in function of the proportion of outliers has been depicted (the choice of
the adaptively chosen regularization parameter is more erratic and may first require a more
deeper understanding of CV in the classical i.i.d. before the study of MOM CV in the O ∪ I
framework). The adaptive K̂ grows with the number of outliers as expected, since the number
of blocks has to be at least twice the number of outliers. In particular, when there are no
outliers in the dataset, MOMCV selects K = 1 so minmax MOM LASSO is the LASSO. The
algorithm learns that splitting the database is useless in the absence of outliers: LASSO is the
best choice among all minmax MOM LASSO estimators for K ∈ [N/2].

R EMARK 6. Median of Means V -fold CV introduced in Definition 6 aims at testing the

performance of estimators on a possibly corrupted test set. This is done by excluding outliers
from the test set thanks to the median operator. However, there are situations, for instance in
image recognition, where the test set is corrupted but still we expect estimators to perform
well even on these corrupted data in the test set. This is a classical robustness issue in Deep
Learning [66]. Indeed, deep learning methods are known to fail if a small Gaussian noise
is added to images even with a small variance undetectable by human eyes. Even though
minmax MOM estimators introduced in this paper have been initially designed to be robust to
outliers in the train set, one can use classical tricks to be also robust to corruption in the test set
by training minmax MOM estimators onto an augmented database: in practice, given a (clean
or not) dataset (Xi , Yi )Ni=1 , one can construct an augmented dataset where each data (Xi , Yi )
is replicated m times with an added Gaussian noise: (Xi + Zi1 , Yi ), . . . , (Xi + Zim , Yi )—
where (Zij : 1 ≤ i ≤ N, 1 ≤ j ≤ m) are i.i.d. Gaussian variable, and then a minmax MOM
estimator can be trained onto the dataset
(Xi + Zij , Yi ), i = 1, . . . , n, j = 1, . . . , m.
By doing so, we expect the minmax MOM estimator to improve its robustness performance
evaluated on a corrupted test set.

6.2. Saddle-point, random blocks, outliers detection and depth. The aim of this section
is to show some advantages of choosing randomly the blocks at every (descent and ascent)
steps of the algorithm and how this modified version works on the example of ADMM. As a
byproduct, it is possible to define an outliers detection algorithm.
ROBUST MACHINE LEARNING BY MEDIAN-OF-MEANS 925

Let us first explain a problem of “local saddle point” in the case of fixed blocks. Minmax
MOM estimators are based on the observation that the oracle f ∗ is solution to the minmax
problem f ∗ ∈ argminf ∈F supg∈F P (f − g ). Likewise, f ∗ is solution of the maxmin prob-
lem: f ∗ ∈ argmaxg∈F inff ∈F P (f − g ). One can also define the maxmin MOM estimator
(17) ĝK,λ ∈ argmax inf TK,λ (g, f ).
g∈F f ∈F

Following the proofs of Section 6 in the Supplementary Material [36], one can prove the same
results for ĝK,λ and fˆK,λ (see Section 7 in the Supplementary Material for a proof in small
dimension). However, ĝK,λ and fˆK,λ may differ since, in general
(18) argmin sup TK,λ (g, f ) = argmax inf TK,λ (g, f ).
f ∈F g∈F g∈F f ∈F

In other words, the duality gap may not be null. Since TK,λ (g, f ) = −TK,λ (f, g) for all
f, g ∈ F , (18) holds if and only if
inf sup TK,λ (f, g) = 0.
f ∈F g∈F

In that case, fˆ is a saddle-point estimator and minmax and maxmin estimators are equal. The
left-hand side of Figure 3 shows a simulation where this happens. The choice of fixed blocks
B1 , . . . , BK may result in a problem of “local saddle points” and the algorithms remain close
to suboptimal local saddle points. To see this, consider the vector case (i.e., for F = {f (·) =
·, t : t ∈ Rd } and introduce, for all k ∈ [K],

(19) Ck = t, t ∈ Rd × Rd : MOMK (t − t ) = PBk (t − t ) .
The problem is that, if a cell Ck contains a saddle-point of (t, t ) → PBk (t − t ) +
λ( t 1 − t 1 ) the algorithms gets stuck in that cell instead of looking for “better saddle-
point” in other cells.
To overcome this issue, the partition is chosen at random at every descent and ascent steps
of the algorithms, so the decomposition into cells C1 , . . . , CK changes at every steps. As an
example, we develop the ADMM procedure with a random choice of blocks in Algorithm 4.
In Figure 3, both MOM LASSO via ADMM with fixed and changing blocks are run. Both
the objective function and the estimation error of MOM LASSO jump with fixed blocks.
These jumps correspond to a change of cell number. The algorithm converges to local saddle-
points before jumping to other cells, thanks to the regularization of the 1 -norm. On the other
hand, the algorithms with changing blocks do not suffer this drawback. Figure 3 shows that
the estimation error converges faster and more smoothly for changing blocks. The objective

F IG . 3. Fixed blocks against random blocks.

926 G. LECUÉ AND M. LERASLE

input : (t0 , t0 ) ∈ Rd × Rd : initial point, > 0: a stopping parameter, ρ: parameter

output: approximated solution to the min–max problem (10)
1 while tp+1 − tp 2 ≥ or tp+1 − tp 2 ≥ do
2 Build an equipartition B1 , . . . , BK of [N] at random.
3 Find k ∈ [K] such that PBk (tp − tp ) = MOMK (tp − tp )
tp+1 = (X −1
k Xk + ρId×d ) (Xk Yk + ρzp − up )
4 zp+1 = proxλ · 1 (tp+1 + up /ρ)
5 up+1 = up + ρ(tp+1 − zp+1 )
6 Build an equipartition B1 , . . . , BK of [N] at random.
7 Find k ∈ [K] such that PBk (tp+1 − tp ) = MOMK (tp+1 − tp )
= (X −1
8 tp+1 k Xk + ρId×d ) (Xk Yk + ρzp − up )
9
zp+1
= proxλ · 1 (tp+1 + up /ρ)
10 up+1 = up + ρ(tp+1

− zp+1 )
11 end
12 Return (tp , tp )
Algorithm 4: minmax MOM ADMM with changing random blocks

F IG . 4. Outliers detection algorithm. The dataset has been corrupted by 4 outliers at number 1, 32, 170 and
194. The score of the outliers is 0: they have not been selected even once.

function of MOM ADMM with changing blocks converges to zero so the duality gap con-
verges to zero. This gives a natural stopping criterion and shows that minmax and maxmin
MOM LASSO are solution of a saddle point problem even though the objective function is
not convex-concave.
A byproduct is an outliers detection procedure. Count the number of times each data is
selected in the selected median blocks of steps 3 and 7 of Algorithm 4. At the end of the
algorithm (for instance, Algorithm 4), every data ends up with a score revealing its central-
ity for the learning task. Aggressive outliers are likely to corrupt their respective blocks and
should therefore not be selected at steps 3 and 7 of Algorithm 4. With fixed blocks, infor-
mative data cannot be distinguished from outliers lying in the same block, therefore, this
outliers detection algorithm only makes sense when blocks are changing at every steps. Fig-
ure 4 shows performance of this strategy on synthetic data (cf. Section 6.3 for more details
on the simulations). Outliers (data 1, 32, 170 and 194) end up with a null score.

6.3. Simulations setup for the figures. All codes are available at [55] and can be used to
reproduce the figures. Many other simulations and algorithms can be found in [55].
ROBUST MACHINE LEARNING BY MEDIAN-OF-MEANS 927

6.3.1. Data generating process and corruption by outliers. The algorithms introduced in
Section 5 are tested on datasets corrupted by outliers of various forms in [55]. The basic set
of informative data is called D1 . The outliers are named D2 , D3 , D4 and D5 . These data are
merged and shuffled in the dataset D = D1 ∪ D2 ∪ D3 ∪ D4 ∪ D5 given to the algorithm.
1. The set D1 of inliers contains Ngood i.i.d. data (Xi , Yi ) with common distribution

(20) Y = X, t ∗ + ξ,
where t ∗ ∈ Rd , X ∼ N (0, Id×d ) and ξ ∼ N (0, σ 2 ) is independent of X.
2. D2 is a dataset of Nbad−1 outliers (Xi , Yi ) such that Yi = 1 and Xi = (1)dj =1
3. D3 is a dataset of Nbad−2 outliers (Xi , Yi ) such that Yi = 10,000 and Xi = (1)dj =1
4. D4 is a dataset of Nbad−3 outliers (Xi , Yi ) where Yi is a 0−1-Bernoulli random variable
and Xi is uniformly distributed over [0, 1]d
5. D5 is also a set of outliers that have been generated according to a linear model (20),
with the same target vector t ∗ and a different choice of design X and noise ξ . The design
X ∼ N (0, ) with = (ρ |i−j | )1≤i,j ≤d and ξ is a heavy-tailed noise distributed according to
a Student distribution with various degrees of freedom.
The different types of outliers Dj , j = 2, 3, 4, 5 are useless to learn the oracle t ∗ some are
not independent nor random as in D2 and D3 .

6.3.2. Simulations setup for the figures. Let us now precise the parameters of the sim-
ulations in Figure 1 and Figure 2: the number of observations is N = 200, the number of
features is d = 500, t ∗ ∈ Rd has sparsity s = 10 and support chosen at random, with nonzero
coordinates tj∗ being either equal to 10, −10 or decreasing according to exp(−j/10). Infor-
mative data D1 , described in Section 6.3.1, have variance σ = 1. This dataset is increasingly
corrupted with outliers in D3 .
The proportion of outliers are 0, 1/100, 2/100, . . . , 15/100. The ADMM algorithm is run
with adaptive λ chosen by V -fold CV with V = 5 for the LASSO. Then MOM ADMM is
run with adaptive K and λ chosen by MOM CV with V = 5 and
√ K = max(gridK )/V where
gridK = {1, 4, . . . , 115/4} and gridλ = {0, 10, 20, . . . , 100}/ 100 are the search grids used
to select the best K and λ during the CV and MOM CV steps. The number of iterations of
ADMM and MOM ADMM is 200. Simulations have been run 70 times and the averaged
values of the estimation error and adaptive K̂ have been reported in Figure 1, Figure 5 and
Figure 2. The 2 estimation error of LASSO increases roughly from 0 when there is no out-
liers and stabilize at 550 right after a single outliers enters the dataset. The value 550 comes
from the fact that Y = 10,000 and X = (1)500 d
j =1 satisfy that the vector t with minimal 1 norm
among all the solutions t of Y = X, t is t ∗∗ = (20)500 j =1 , and t
∗∗ − t ∗
2 is approximately
550. This means that LASSO is trying to fit a model on the single outlier instead of solv-
ing the linear problem associated with the 200 other informative data. A single outliers is
therefore completely misleading the LASSO.
For Figure 3, we have run similar experiments with N = 200, d = 300, s = 20, √ σ = 1,
K = 10, the number of iterations was 500 and the regularization parameter was 1/ N .
For Figure 4, we took N = 200, d = 500, s = 20, σ = 1, the number of outliers is |O| = 4
and the outliers are of the form Y = 10,000 and X = (1)dj =1 , K = 10, the number of iterations
√
is 5.000 and λ = 1/ 200.

Acknowledgments. The first author was supported by Labex ECODEC (ANR—11-

LABEX-0047), Maison des Sciences de l’Homme (MSH—17MA01) and a donation of
NVIDIA Corporation.
928 G. LECUÉ AND M. LERASLE

F IG . 5. Estimation error versus proportion of outliers for LASSO and the minmax MOM LASSO.

SUPPLEMENTARY MATERIAL
Supplementary material to “Estimation bounds and sharp oracle inequalities of reg-
ularized procedures with Lipschitz loss functions” (DOI: 10.1214/19-AOS1828SUPP;
.pdf). Section 6 gives the proof of the main results. These main results focus on the regu-
larized version of the MOM estimates of the increments presented in this Introduction that
are well suited for high-dimensional learning frameworks. We complete these results in Sec-
tion 7, providing results for the basic estimators without regularization in small dimension.
Finally, Section 8 provides minimax optimality results for our procedures.

REFERENCES
[1] A LON , N., M ATIAS , Y. and S ZEGEDY, M. (1999). The space complexity of approximating the frequency
moments. J. Comput. System Sci. 58 137–147. MR1688610 https://doi.org/10.1006/jcss.1997.1545
[2] A LQUIER , P., C OTTET, V. and L ECUÉ , G. (2019). Estimation bounds and sharp oracle inequalities
of regularized procedures with Lipschitz loss functions. Ann. Statist. 47 2117–2144. MR3953446
https://doi.org/10.1214/18-AOS1742
[3] A RLOT, S. and C ELISSE , A. (2010). A survey of cross-validation procedures for model selection. Stat. Surv.
4 40–79. MR2602303 https://doi.org/10.1214/09-SS054
[4] A RLOT, S. and C ELISSE , A. (2011). Segmentation of the mean of heteroscedastic data via cross-validation.
Stat. Comput. 21 613–632. MR2826696 https://doi.org/10.1007/s11222-010-9196-x
[5] A RLOT, S. and L ERASLE , M. (2016). Choice of V for V -fold cross-validation in least-squares density
estimation. J. Mach. Learn. Res. 17 Paper No. 208, 50. MR3595142
[6] AUDIBERT, J.-Y. and C ATONI , O. (2011). Robust linear least squares regression. Ann. Statist. 39 2766–
2794. MR2906886 https://doi.org/10.1214/11-AOS918
[7] BACH , F. R. (2010). Structured sparsity-inducing norms through submodular functions. In Advances in
Neural Information Processing Systems 23: 24th Annual Conference on Neural Information Processing
Systems 2010. Proceedings of a Meeting Held 6–9 December 2010 118–126, Vancouver, BC.
ROBUST MACHINE LEARNING BY MEDIAN-OF-MEANS 929

[8] BARAUD , Y. (2011). Estimator selection with respect to Hellinger-type risks. Probab. Theory Related Fields
151 353–401. MR2834722 https://doi.org/10.1007/s00440-010-0302-y
[9] BARAUD , Y. and B IRGÉ , L. (2016). Rho-estimators for shape restricted density estimation. Stochastic Pro-
cess. Appl. 126 3888–3912. MR3565484 https://doi.org/10.1016/j.spa.2016.04.013
[10] BARAUD , Y., B IRGÉ , L. and S ART, M. (2017). A new method for estimation and model selection: ρ-
estimation. Invent. Math. 207 425–517. MR3595933 https://doi.org/10.1007/s00222-016-0673-5
[11] B ELLEC , P. C., L ECUÉ , G. and T SYBAKOV, A. B. (2016). Slope meets lasso: Improved oracle bounds and
optimality. Technical report, CREST, CNRS, Université Paris Saclay.
[12] B ICKEL , P. J., R ITOV, Y. and T SYBAKOV, A. B. (2009). Simultaneous analysis of lasso and Dantzig selec-
tor. Ann. Statist. 37 1705–1732. MR2533469 https://doi.org/10.1214/08-AOS620
[13] B IRGÉ , L. (1984). Stabilité et instabilité du risque minimax pour des variables indépendantes équidis-
tribuées. Ann. Inst. Henri Poincaré Probab. Stat. 20 201–223. MR0762855
[14] B IRGÉ , L. (2006). Model selection via testing: An alternative to (penalized) maximum likelihood estimators.
Ann. Inst. Henri Poincaré Probab. Stat. 42 273–325. MR2219712 https://doi.org/10.1016/j.anihpb.
2005.04.004
[15] B LANCHARD , G., B OUSQUET, O. and M ASSART, P. (2008). Statistical performance of support vector
machines. Ann. Statist. 36 489–531. MR2396805 https://doi.org/10.1214/009053607000000839
[16] B OGDAN , M., VAN DEN B ERG , E., S ABATTI , C., S U , W. and C ANDÈS , E. J. (2015). SLOPE—
adaptive variable selection via convex optimization. Ann. Appl. Stat. 9 1103–1140. MR3418717
https://doi.org/10.1214/15-AOAS842
[17] B ÜHLMANN , P. and VAN DE G EER , S. (2011). Statistics for High-Dimensional Data: Methods, Theory and
Applications. Springer Series in Statistics. Springer, Heidelberg. MR2807761 https://doi.org/10.1007/
978-3-642-20192-9
[18] C ATONI , O. (2012). Challenging the empirical mean and empirical variance: A deviation study. Ann. Inst.
Henri Poincaré Probab. Stat. 48 1148–1185. MR3052407 https://doi.org/10.1214/11-AIHP454
[19] D EAN , J. and G HEMAWAT, S. (2010). Mapreduce: A flexible data processing tool. Commun. ACM 53 72–
77.
[20] D EVROYE , L., L ERASLE , M., L UGOSI , G. and O LIVEIRA , R. I. (2016). Sub-Gaussian mean estimators.
Ann. Statist. 44 2695–2725. MR3576558 https://doi.org/10.1214/16-AOS1440
[21] E LSENER , A. and VAN DE G EER , S. (2018). Robust low-rank matrix estimation. Ann. Statist. 46 3481–
3509. MR3852659 https://doi.org/10.1214/17-AOS1666
[22] FAN , J., L I , Q. and WANG , Y. (2017). Estimation of high dimensional mean regression in the absence of
symmetry and light tail assumptions. J. R. Stat. Soc. Ser. B. Stat. Methodol. 79 247–265. MR3597972
https://doi.org/10.1111/rssb.12166
[23] G IRAUD , C. (2015). Introduction to High-Dimensional Statistics. Monographs on Statistics and Applied
Probability 139. CRC Press, Boca Raton, FL. MR3307991
[24] H AMPEL , F. R. (1971). A general qualitative definition of robustness. Ann. Math. Stat. 42 1887–1896.
MR0301858 https://doi.org/10.1214/aoms/1177693054
[25] H AMPEL , F. R. (1974). The influence curve and its role in robust estimation. J. Amer. Statist. Assoc. 69
383–393. MR0362657
[26] H AN , Q. and W ELLNER , J. A. (2017). A sharp multiplier inequality with applications to heavy-tailed
regression problems. Available at arXiv:1706.02410.
[27] H UBER , P. J. (1964). Robust estimation of a location parameter. Ann. Math. Stat. 35 73–101. MR0161415
https://doi.org/10.1214/aoms/1177703732
[28] H UBER , P. J. (1967). The behavior of maximum likelihood estimates under nonstandard conditions. In Proc.
Fifth Berkeley Sympos. Math. Statist. and Probability (Berkeley, Calif., 1965/66), Vol. I: Statistics 221–
233. Univ. California Press, Berkeley, CA. MR0216620
[29] H UBER , P. J. and RONCHETTI , E. M. (2009). Robust Statistics, 2nd ed. Wiley Series in Probability and
Statistics. Wiley, Hoboken, NJ. MR2488795 https://doi.org/10.1002/9780470434697
[30] J ERRUM , M. R., VALIANT, L. G. and VAZIRANI , V. V. (1986). Random generation of combinatorial struc-
tures from a uniform distribution. Theoret. Comput. Sci. 43 169–188. MR0855970 https://doi.org/10.
1016/0304-3975(86)90174-X
[31] KOLTCHINSKII , V. (2011). Oracle Inequalities in Empirical Risk Minimization and Sparse Recovery
Problems. Lecture Notes in Math. 2033. Springer, Heidelberg. MR2829871 https://doi.org/10.1007/
978-3-642-22147-7
[32] KOLTCHINSKII , V. and M ENDELSON , S. (2015). Bounding the smallest singular value of a random matrix
without concentration. Int. Math. Res. Not. IMRN 23 12991–13008. MR3431642 https://doi.org/10.
1093/imrn/rnv096
[33] L E C AM , L. (1986). Asymptotic Methods in Statistical Decision Theory. Springer Series in Statistics.
Springer, New York. MR0856411 https://doi.org/10.1007/978-1-4612-4946-7
930 G. LECUÉ AND M. LERASLE

[34] L E C AM , L. (1973). Convergence of estimates under dimensionality restrictions. Ann. Statist. 1 38–53.
MR0334381
[35] L ECUÉ , G. and L ERASLE , M. (2017). Learning from mom’s principle: Le Cam’s approach. Technical re-
port, CNRS, ENSAE, Paris-sud, 2017. To appear in Stochastic Processes and their applications.
[36] L ECUÉ , G. and L ERASLE , M. (2020). Supplement to “Robust machine learning by median-of-means: The-
ory and practice.” https://doi.org/10.1214/19-AOS1828SUPP.
[37] L ECUÉ , G. and M ENDELSON , S. (2013). Learning subgaussian classes: Upper and minimax bounds Tech-
nical report, CNRS, Ecole polytechnique and Technion.
[38] L ECUÉ , G. and M ENDELSON , S. (2017). Regularization and the small-ball method II: Complexity depen-
dent error rates. J. Mach. Learn. Res. 18 Paper No. 146, 48. MR3763780
[39] L ECUÉ , G. and M ENDELSON , S. (2018). Regularization and the small-ball method I: Sparse recovery. Ann.
Statist. 46 611–641. MR3782379 https://doi.org/10.1214/17-AOS1562
[40] L ERASLE , M. and O LIVEIRA , R. I. (2011). Robust empirical mean estimators. Available at
arXiv:1112.3914.
[41] L OUNICI , K. (2008). Sup-norm convergence rate and sign concentration property of Lasso and Dantzig
estimators. Electron. J. Stat. 2 90–102. MR2386087 https://doi.org/10.1214/08-EJS177
[42] L UGOSI , G. and M ENDELSON , S. (2019). Regularization, sparse recovery, and median-of-means tourna-
ments. Bernoulli 25 2075–2106. MR3961241 https://doi.org/10.3150/18-BEJ1046
[43] L UGOSI , G. and M ENDELSON , S. (2019). Risk minimization by median-of-means tournaments. J. Eur.
Math. Soc. (JEMS). To appear.
[44] M ASSART, P. and N ÉDÉLEC , É. (2006). Risk bounds for statistical learning. Ann. Statist. 34 2326–2366.
MR2291502 https://doi.org/10.1214/009053606000000786
[45] M ASSIAS , M., F ERCOQ , O., G RAMFORT, A. and S ALMON , J. (2017). Generalized concomitant multi-task
lasso for sparse multimodal regression.
[46] M EINSHAUSEN , N. and B ÜHLMANN , P. (2006). High-dimensional graphs and variable selection with the
lasso. Ann. Statist. 34 1436–1462. MR2278363 https://doi.org/10.1214/009053606000000281
[47] M EINSHAUSEN , N. and Y U , B. (2009). Lasso-type recovery of sparse representations for high-dimensional
data. Ann. Statist. 37 246–270. MR2488351 https://doi.org/10.1214/07-AOS582
[48] M ENDELSON , S. (2014). Learning without concentration. In Proceedings of the 27th annual conference on
Learning Theory COLT14, 25–39.
[49] M ENDELSON , S. (2016). On multiplier processes under weak moment assumptions Technical report, Tech-
nion.
[50] M INSKER , S. (2015). Geometric median and robust estimation in Banach spaces. Bernoulli 21 2308–2335.
MR3378468 https://doi.org/10.3150/14-BEJ645
[51] M INSKER , S. and S TRAWN , N. (2019). Distributed statistical estimation and rates of convergence in normal
approximation. Electron. J. Statist. 13 5213–5252. MR4043072
[52] N EGAHBAN , S. N., R AVIKUMAR , P., WAINWRIGHT, M. J. and Y U , B. (2012). A unified framework for
high-dimensional analysis of M-estimators with decomposable regularizers. Statist. Sci. 27 538–557.
MR3025133 https://doi.org/10.1214/12-STS400
[53] N EMIROVSKY, A. S. and Y UDIN , D. B. (1983). Problem Complexity and Method Efficiency in Optimiza-
tion. A Wiley-Interscience Publication. Wiley, New York. MR0702836
[54] N ICKL , R. and VAN DE G EER , S. (2013). Confidence sets in sparse regression. Ann. Statist. 41 2852–2876.
MR3161450 https://doi.org/10.1214/13-AOS1170
[55] N OTEBOOK. Available at https://github.com/lecueguillaume/MOMpower.
[56] S AUMARD , A. (2018). On optimality of empirical risk minimization in linear aggregation. Bernoulli 24
2176–2203. MR3757527 https://doi.org/10.3150/17-BEJ925
[57] S U , W. and C ANDÈS , E. (2016). SLOPE is adaptive to unknown sparsity and asymptotically minimax. Ann.
Statist. 44 1038–1068. MR3485953 https://doi.org/10.1214/15-AOS1397
[58] T IBSHIRANI , R. (1996). Regression shrinkage and selection via the lasso. J. Roy. Statist. Soc. Ser. B 58
267–288. MR1379242
[59] T UKEY, J. W. (1960). A survey of sampling from contaminated distributions. In Contributions to Probability
and Statistics 448–485. Stanford Univ. Press, Stanford, CA. MR0120720
[60] T UKEY, J. W. (1962). The future of data analysis. Ann. Math. Stat. 33 1–67. MR0133937 https://doi.org/10.
1214/aoms/1177704711
[61] VAN DE G EER , S. (2014). Weakly decomposable regularization penalties and structured sparsity. Scand. J.
Stat. 41 72–86. MR3181133 https://doi.org/10.1111/sjos.12032
[62] VAN DE G EER , S., B ÜHLMANN , P., R ITOV, Y. and D EZEURE , R. (2014). On asymptotically optimal
confidence regions and tests for high-dimensional models. Ann. Statist. 42 1166–1202. MR3224285
https://doi.org/10.1214/14-AOS1221
ROBUST MACHINE LEARNING BY MEDIAN-OF-MEANS 931

[63] VAN DE G EER , S. A. (2007). The deterministic lasso. Technical report, ETH Zürich. Available at http:
//www.stat.math.ethz.ch/~geer/lasso.pdf.
[64] VAN DE G EER , S. A. (2008). High-dimensional generalized linear models and the lasso. Ann. Statist. 36
614–645. MR2396809 https://doi.org/10.1214/009053607000000929
[65] VAPNIK , V. N. (1998). Statistical Learning Theory. Adaptive and Learning Systems for Signal Processing,
Communications, and Control. Wiley, New York. MR1641250
[66] WANG , T. E., G U , Y., M EHTA , D., Z HAO , X. and B ERNAL , E. A. (2018). Towards robust deep neural
networks. Preprint. Available at arXiv:1810.11726.
[67] Z HANG , C.-H. and Z HANG , S. S. (2014). Confidence intervals for low dimensional parameters in
high dimensional linear models. J. R. Stat. Soc. Ser. B. Stat. Methodol. 76 217–242. MR3153940
https://doi.org/10.1111/rssb.12026
[68] Z HAO , P. and Y U , B. (2006). On model selection consistency of Lasso. J. Mach. Learn. Res. 7 2541–2563.
MR2274449
[69] Z HOU , W.-X., S UN , Q. and FAN , J. (2017). Adaptive huber regression: Optimality and phase transition.
Preprint. Available at arXiv:1706.06991.

Recommendation For OIST Research Internship Applicant OIST Graduate University
No ratings yet
Recommendation For OIST Research Internship Applicant OIST Graduate University
1 page
Eco No Metrics
No ratings yet
Eco No Metrics
1,045 pages
Final hmt-1 PDF
No ratings yet
Final hmt-1 PDF
211 pages
Ebook Econometrics
No ratings yet
Ebook Econometrics
1,006 pages
Notes MSM
No ratings yet
Notes MSM
66 pages
YAMAHA OUTBOARD LZ200NETO, LZ200TR Service Repair Manual X 100101 PDF
No ratings yet
YAMAHA OUTBOARD LZ200NETO, LZ200TR Service Repair Manual X 100101 PDF
60 pages
Creel M Econometrics
No ratings yet
Creel M Econometrics
479 pages
Lec9 - Linear Models
No ratings yet
Lec9 - Linear Models
44 pages
Assessment Test 2nd Cash&Rec
100% (1)
Assessment Test 2nd Cash&Rec
6 pages
Revisiting Revisiting Logistic Regression & Naïve Logistic Regression & Naïve Bayes Bayes
No ratings yet
Revisiting Revisiting Logistic Regression & Naïve Logistic Regression & Naïve Bayes Bayes
46 pages
Ductility Factor - Article368966 - Structuraldesigncodesofaustraliaandnewzealand - Manuscript
No ratings yet
Ductility Factor - Article368966 - Structuraldesigncodesofaustraliaandnewzealand - Manuscript
16 pages
The Risk of Machine Learning
No ratings yet
The Risk of Machine Learning
66 pages
G.C. Calafiore (Politecnico Di Torino)
No ratings yet
G.C. Calafiore (Politecnico Di Torino)
23 pages
Convexity, Classification, and Risk Bounds
No ratings yet
Convexity, Classification, and Risk Bounds
36 pages
Support Vector Machines (SVM) Models in Stata
No ratings yet
Support Vector Machines (SVM) Models in Stata
19 pages
Summary SC Microeconometrics
No ratings yet
Summary SC Microeconometrics
20 pages
Advanced Statistical Techniques Using R: Outliers and Missing Data
No ratings yet
Advanced Statistical Techniques Using R: Outliers and Missing Data
28 pages
2019 ASHRAE Boston Product Guide Final PDF
No ratings yet
2019 ASHRAE Boston Product Guide Final PDF
75 pages
Soederlind P. Lecture Notes For Econometrics (LN, Stockholm, 2002) (L) (86s) - GL - PDF
No ratings yet
Soederlind P. Lecture Notes For Econometrics (LN, Stockholm, 2002) (L) (86s) - GL - PDF
86 pages
Least Median of Squares Regression. Peter J. Rousseeuw, 1984
No ratings yet
Least Median of Squares Regression. Peter J. Rousseeuw, 1984
10 pages
Bank Soal Recount Text
No ratings yet
Bank Soal Recount Text
8 pages
3 - Technical - Methods of Development
No ratings yet
3 - Technical - Methods of Development
29 pages
Support Vector Machines For Classification and Regression: Steve R. Gunn
No ratings yet
Support Vector Machines For Classification and Regression: Steve R. Gunn
66 pages
Robust Statistics
No ratings yet
Robust Statistics
57 pages
WAHA Monthly Halal Checklist Abattoir
No ratings yet
WAHA Monthly Halal Checklist Abattoir
9 pages
10 Simulation-Assisted Estimation: 10.1 Motivation
No ratings yet
10 Simulation-Assisted Estimation: 10.1 Motivation
22 pages
Appendix Robust Regression
No ratings yet
Appendix Robust Regression
17 pages
GMM, MLE and Tests For Non-Linear Restrictions: 1 Generalized Method of Moments (GMM)
No ratings yet
GMM, MLE and Tests For Non-Linear Restrictions: 1 Generalized Method of Moments (GMM)
15 pages
Food Rules, 2027 (1970) : Amendments: 1. 2. 3. 4. 5
No ratings yet
Food Rules, 2027 (1970) : Amendments: 1. 2. 3. 4. 5
50 pages
Linear Regression, Active Learning
No ratings yet
Linear Regression, Active Learning
10 pages
Epic Failures in DevSecOps V1
No ratings yet
Epic Failures in DevSecOps V1
156 pages
Nitoprime Zincrich TDS
No ratings yet
Nitoprime Zincrich TDS
2 pages
Double/Debiased Machine Learning For Treatment and Structural Parameters
No ratings yet
Double/Debiased Machine Learning For Treatment and Structural Parameters
71 pages
Module 1 Rhyming Words (For Reading On-The-Air) (Final)
No ratings yet
Module 1 Rhyming Words (For Reading On-The-Air) (Final)
12 pages
MA 324, Lecture 1: Yohann Tendero Yohann - Tendero@
No ratings yet
MA 324, Lecture 1: Yohann Tendero Yohann - Tendero@
19 pages
Matrix OLS NYU Notes
No ratings yet
Matrix OLS NYU Notes
14 pages
Bias-Variance Tradeoffs: 1 Single Sample MLE
No ratings yet
Bias-Variance Tradeoffs: 1 Single Sample MLE
7 pages
A Robust Method For Multiple Linear Regression: Technometrics
No ratings yet
A Robust Method For Multiple Linear Regression: Technometrics
10 pages
Bab III
No ratings yet
Bab III
22 pages
05 Regression Least Squares
No ratings yet
05 Regression Least Squares
5 pages
A Family of Median Based Estimators in Simple Random Sampling
No ratings yet
A Family of Median Based Estimators in Simple Random Sampling
11 pages
Ijaerv15n6 12
No ratings yet
Ijaerv15n6 12
16 pages
Enterprise Resource Planning: MODULE 9: Business Process Management (BPM)
No ratings yet
Enterprise Resource Planning: MODULE 9: Business Process Management (BPM)
4 pages
9 Types of Organization
No ratings yet
9 Types of Organization
4 pages
A Study On Business Market Research On Croma To Release Their Own Products
No ratings yet
A Study On Business Market Research On Croma To Release Their Own Products
3 pages
Introduction To Machine Learning Lecture 2: Linear Regression
No ratings yet
Introduction To Machine Learning Lecture 2: Linear Regression
38 pages
Appendix Robust Regression
No ratings yet
Appendix Robust Regression
8 pages
Koenker, R., & Bassett, G. (1978) - Regression Quantiles
No ratings yet
Koenker, R., & Bassett, G. (1978) - Regression Quantiles
24 pages
Robust Regression: 1 M-Estimation
No ratings yet
Robust Regression: 1 M-Estimation
8 pages
Republic of Korea - Republic of Indonesia/Roi-Rok Economic Cooperation Joint Task Force
No ratings yet
Republic of Korea - Republic of Indonesia/Roi-Rok Economic Cooperation Joint Task Force
8 pages
Roger Koenker, Gilbert Bassett and Jr.1978
No ratings yet
Roger Koenker, Gilbert Bassett and Jr.1978
19 pages
Introduction To Mathematical Modeling: Simple Linear Regression
No ratings yet
Introduction To Mathematical Modeling: Simple Linear Regression
21 pages
Robust - Regression - Methods - Achieving - Small - Standard Error
No ratings yet
Robust - Regression - Methods - Achieving - Small - Standard Error
18 pages
Block 1
No ratings yet
Block 1
81 pages
Emergency Cart Checklist
No ratings yet
Emergency Cart Checklist
1 page
Power Tools: Rotary Hammer
No ratings yet
Power Tools: Rotary Hammer
2 pages
CAWRT Drill Flyer
No ratings yet
CAWRT Drill Flyer
1 page
Financial Math Assignment
No ratings yet
Financial Math Assignment
2 pages
04 - RPi Pico - Measure Distance With Ultrasonic Sensor HC-SR04
No ratings yet
04 - RPi Pico - Measure Distance With Ultrasonic Sensor HC-SR04
6 pages
The Nigeria Eurobond - An Appraisal by Eghes EYIEYIEN 240111
No ratings yet
The Nigeria Eurobond - An Appraisal by Eghes EYIEYIEN 240111
3 pages
Econometrics I 6
No ratings yet
Econometrics I 6
20 pages
NeurIPS 2020 Minimax Estimation of Conditional Moment Models Paper
No ratings yet
NeurIPS 2020 Minimax Estimation of Conditional Moment Models Paper
15 pages
Probability Althea
No ratings yet
Probability Althea
8 pages
Block 1
No ratings yet
Block 1
83 pages
Lecture 1: Introduction and Key Concepts
No ratings yet
Lecture 1: Introduction and Key Concepts
62 pages
Double-Debiased Machine Learning For Treatment
No ratings yet
Double-Debiased Machine Learning For Treatment
71 pages
RegEstimationLS ML StatColumbia
No ratings yet
RegEstimationLS ML StatColumbia
44 pages
Hemchand Yadav Vishwavidyalaya, Durg (C.G.) 5th Sem
No ratings yet
Hemchand Yadav Vishwavidyalaya, Durg (C.G.) 5th Sem
1 page
4 Linear Regression Additional Notes
No ratings yet
4 Linear Regression Additional Notes
8 pages
NIPS 2011 Non Asymptotic Analysis of Stochastic Approximation Algorithms For Machine Learning Paper
No ratings yet
NIPS 2011 Non Asymptotic Analysis of Stochastic Approximation Algorithms For Machine Learning Paper
9 pages
Shrinkage Algorithms For MMSE Covariance Estimation
No ratings yet
Shrinkage Algorithms For MMSE Covariance Estimation
28 pages
The Physics of Clinical MR Taught Through Images, 5th Edition Educational Ebook Download
100% (15)
The Physics of Clinical MR Taught Through Images, 5th Edition Educational Ebook Download
15 pages
Linear Model
No ratings yet
Linear Model
14 pages
Least Squares
No ratings yet
Least Squares
12 pages
RTE 1503 Unit 3 Self Test
No ratings yet
RTE 1503 Unit 3 Self Test
15 pages
Econometric S
No ratings yet
Econometric S
1,341 pages
SLRM Note
No ratings yet
SLRM Note
15 pages
Linear Regression and Classification
No ratings yet
Linear Regression and Classification
8 pages
Double Descent
No ratings yet
Double Descent
61 pages
Basic Stats Estimation
No ratings yet
Basic Stats Estimation
8 pages
PR A2plus B1 The World Today Videos Videoscript
No ratings yet
PR A2plus B1 The World Today Videos Videoscript
3 pages
The Efficacy of Specialized Language Models in Advancing Educational Outcomes
No ratings yet
The Efficacy of Specialized Language Models in Advancing Educational Outcomes
8 pages
Learning Minimum Variance Unbiased Estimators
No ratings yet
Learning Minimum Variance Unbiased Estimators
5 pages
Patern Recogniton Part
No ratings yet
Patern Recogniton Part
8 pages
ISYE 8803 - Kamran - M6 - LD Learning Using Regularization
No ratings yet
ISYE 8803 - Kamran - M6 - LD Learning Using Regularization
25 pages
Elements of Causal Inference: Foundations and Learning Algorithms
From Everand
Elements of Causal Inference: Foundations and Learning Algorithms
Jonas Peters
No ratings yet
A Short Course in Discrete Mathematics
From Everand
A Short Course in Discrete Mathematics
Edward A. Bender
3/5 (1)
Worked Examples in Mathematics for Scientists and Engineers
From Everand
Worked Examples in Mathematics for Scientists and Engineers
G. Stephenson
No ratings yet
Digital Signal Processing (DSP) with Python Programming
From Everand
Digital Signal Processing (DSP) with Python Programming
Maurice Charbit
No ratings yet
Geometric functions in computer aided geometric design
From Everand
Geometric functions in computer aided geometric design
Oscar Ruiz
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

19 Aos1828

Uploaded by

19 Aos1828

Uploaded by

The Annals of Statistics

2020, Vol. 48, No. 2, 906–931

ROBUST MACHINE LEARNING BY MEDIAN-OF-MEANS:

B Y G UILLAUME L ECUÉ1 AND M ATTHIEU L ERASLE2

Median-of-means (MOM) based procedures have been recently intro-

1. Introduction. Consider the least-squares regression problem where, given a dataset

Received July 2018; revised February 2019.

Sub-Gaussian properties of these estimators can be found in [20, 40].

Finally, let · be a norm defined on the span of F ; · will be used as a regularization

2.1. Minmaximization. The oracle f ∗ is solution of the minmax problem:

D EFINITION 1. Let α ∈ (0, 1), K ∈ [N ]. For any L : X × R → R the α-quantile on K

D EFINITION 2. For any K ∈ [N/2], let

R EMARK 1 (K = 1 and ERM). If one chooses K = 1, then for all f, g ∈ F , TK (g, f ) =

A SSUMPTION 1. For all f ∈ F and all i ∈ I ,

A SSUMPTION 2. There exists θm > 0 such that, for any i ∈ I and f ∈ F ,

A SSUMPTION 3. There exists θ0 ≥ 1 such that for all f ∈ F and i ∈ I ,

3.2. Complexity measures. For all ρ, r ≥ 0, let

D EFINITION 3. Let ( i )i∈[N] be independent random variables uniformly distributed in

and Hρ = {f ∈ F : f − f ∗ = ρ and f − f ∗ L2 ≤ r(ρ)}. If there exists f ∗∗ such that

The sparsity equation, introduced in [39], quantifies these notions of “large.”

If ρ ∗ satisfies the sparsity equation, so do all ρ ≥ ρ ∗ . Let

The function r is used to define the regularization parameter in Theorem 1, so it cannot

3.4.1. Adaptive choice of K. In Theorem 1, all rates depend on K, which has to be

4. Examples of applications. This section presents two examples of regularization in

(β1 , . . . , βd ) = (1, . . . , 1).

A SSUMPTION 5. Denote by (ej )dj =1 the canonical basis of Rd and assume

4. there exists θ0 such that for all t ∈ Rd , X, t L2 ≤ θ0 X, t L1P ,

(6) ∗ (V ) = E sup G, v where G ∼ Nd (0, Id ).

Under Assumption 6, if σ = ξ LP0 ,

Therefore, one can take

L EMMA 1. If there exists v ∈ Rd such that v ∈ t ∗ + (ρ/20)B1d and | supp(v)| ≤

As a consequence, if N s log(ed/s) and if there exists a s-sparse vector in t ∗ +

T HEOREM 5. Grant Assumption 5. Let s ∈ [d]. Assume that N ≥ c1 s log(ed/s) and

T HEOREM 6. Grant Assumption 5. Let s ∈ [d]. Assume that N ≥ c1 s log(ed/s)√and

where TK,λ (t  , t) = MOMK (t − t  ) + λ( t 1 − t  1 ), MOMK (t − t  ) is a median of the

A natural idea to implement (10) is to consider algorithms based on a sequence of alter-

5.2. Subgradient descent algorithm. LASSO is solution of the minimization problem

5.4. Douglas–Racheford/ADMM. This section presents the Alternating Direction

up+1 = up + ρ(tp+1 − zp+1 ),

input : (t0 , t0 ) ∈ Rd × Rd : initial point, > 0 : a stopping parameter, ρ: a parameter

up+1 = up + ρ(tp+1 − zp+1 ),

D EFINITION 6. The Median of Means V -fold CV associated to the estimators (15) is

where, for all v ∈ [V ] and f ∈ F ,

F IG . 2. Adaptively chosen number of blocks K for the minmax MOM LASSO.

R EMARK 6. Median of Means V -fold CV introduced in Definition 6 aims at testing the

F IG . 3. Fixed blocks against random blocks.

input : (t0 , t0 ) ∈ Rd × Rd : initial point, > 0: a stopping parameter, ρ: parameter

Acknowledgments. The first author was supported by Labex ECODEC (ANR—11-

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

4. there exists θ0 such that for all t ∈ Rd , X, t L2 ≤ θ0 X, t L1P ,

(6) ∗ (V ) = E sup G, v where G ∼ Nd (0, Id ).

where TK,λ (t , t) = MOMK (t − t ) + λ( t 1 − t 1 ), MOMK (t − t ) is a median of the

input : (t0 , t0 ) ∈ Rd × Rd : initial point, > 0 : a stopping parameter, ρ: a parameter

input : (t0 , t0 ) ∈ Rd × Rd : initial point, > 0: a stopping parameter, ρ: parameter