0% found this document useful (0 votes)
4 views32 pages

Consensus-Based Optimization Methods Converge Globally: Abstract

This paper investigates consensus-based optimization (CBO), a multiagent metaheuristic method for globally minimizing nonconvex nonsmooth functions. The authors provide a novel convergence proof for CBO, demonstrating its ability to perform a convexification of optimization problems as the number of agents increases. Additionally, they establish a quantitative nonasymptotic Laplace principle and offer probabilistic global convergence guarantees for the CBO method.

Uploaded by

陳徐行
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views32 pages

Consensus-Based Optimization Methods Converge Globally: Abstract

This paper investigates consensus-based optimization (CBO), a multiagent metaheuristic method for globally minimizing nonconvex nonsmooth functions. The authors provide a novel convergence proof for CBO, demonstrating its ability to perform a convexification of optimization problems as the number of agents increases. Additionally, they establish a quantitative nonasymptotic Laplace principle and offer probabilistic global convergence guarantees for the CBO method.

Uploaded by

陳徐行
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

\mathrm{S}\mathrm{I}\mathrm{A}\mathrm{M} \mathrm{J}. \mathrm{O}\mathrm{P}\mathrm{T}\mathrm{I}\mathrm{M}.

© 2024 \mathrm{S}\mathrm{o}\mathrm{c}\mathrm{i}\mathrm{e}\mathrm{t}\mathrm{y} \mathrm{f}\mathrm{o}\mathrm{r} \mathrm{I}\mathrm{n}\mathrm{d}\mathrm{u}\mathrm{s}\mathrm{t}\mathrm{r}\mathrm{i}\mathrm{a}\mathrm{l} \mathrm{a}\mathrm{n}\mathrm{d} \mathrm{A}\mathrm{p}\mathrm{p}\mathrm{l}\mathrm{i}\mathrm{e}\mathrm{d} \mathrm{M}\mathrm{a}\mathrm{t}\mathrm{h}\mathrm{e}\mathrm{m}\mathrm{a}\mathrm{t}\mathrm{i}\mathrm{c}\mathrm{s}


\mathrm{V}\mathrm{o}\mathrm{l}. 34, \mathrm{N}\mathrm{o}. 3, \mathrm{p}\mathrm{p}. 2973--3004

CONSENSUS-BASED OPTIMIZATION METHODS


CONVERGE GLOBALLY\ast
Downloaded 12/03/24 to 72.12.218.207 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy

MASSIMO FORNASIER\dagger , TIMO KLOCK\ddagger , \mathrm{A}\mathrm{N}\mathrm{D} KONSTANTIN RIEDL\dagger

Abstract. In this paper we study consensus-based optimization (CBO), which is a multiagent


metaheuristic derivative-free optimization method that can globally minimize nonconvex nonsmooth
functions and is amenable to theoretical analysis. Based on an experimentally supported intuition
that, on average, CBO performs a gradient descent of the squared Euclidean distance to the global
minimizer, we devise a novel technique for proving the convergence to the global minimizer in mean-
field law for a rich class of objective functions. The result unveils internal mechanisms of CBO that
are responsible for the success of the method. In particular, we prove that CBO performs a con-
vexification of a large class of optimization problems as the number of optimizing agents goes to
infinity. Furthermore, we improve prior analyses by requiring mild assumptions about the initializa-
tion of the method and by covering objectives that are merely locally Lipschitz continuous. As a
core component of this analysis, we establish a quantitative nonasymptotic Laplace principle, which
may be of independent interest. From the result of CBO convergence in mean-field law, it becomes
apparent that the hardness of any global optimization problem is necessarily encoded in the rate of
the mean-field approximation, for which we provide a novel probabilistic quantitative estimate. The
combination of these results allows us to obtain probabilistic global convergence guarantees of the
numerical CBO method.

Key words. global optimization, derivative-free optimization, nonsmoothness, nonconvexity,


metaheuristics, consensus-based optimization, mean-field limit, Fokker--Plank equations

MSC codes. 65K10, 90C26, 90C56, 35Q90, 35Q84

DOI. 10.1137/22M1527805

1. Introduction. A long-standing problem in applied mathematics is the global


minimization of a potentially nonconvex nonsmooth cost function \scrE : \BbbR d \rightarrow \BbbR and the
search for an associated globally minimizing argument v \ast . Throughout, we assume
the unique existence of the minimizer v \ast and denote its associated minimal value by

\scrE := \scrE (v \ast ) = inf \scrE (v).


v\in \BbbR d

The objective \scrE is supposed to be locally Lipschitz continuous and to satisfy a


\nu
tractability condition of the form \| v - v \ast \| 2 \leq (\scrE (v) - \scrE ) /\eta in a neighborhood of
v ; see Assumption A2 for the details. While computing \scrE or v \ast is in general an NP-
\ast

hard problems under such conditions, several instances arising in real-world scenarios
can, at least empirically, be solved within reasonable accuracy and moderate compu-
tational time. In the present work we are concerned with the class of derivative-free
optimization algorithms, i.e., methods that are based exclusively on the evaluation of
the objective function \scrE . Among these, and achieving the state of the art on chal-
lenging problems such as the traveling salesman problem, are so-called metaheuristics
[1, 4, 5, 42, 55]. Metaheuristics orchestrate an interaction between local improvement

\ast Received by the editors October 10, 2022; accepted for publication (in revised form) May 1,

2024; published electronically September 3, 2024.


https://doi.org/10.1137/22M1527805
\dagger Department of Mathematics, School of Computation, Information and Technology, Technical

University of Munich, and Munich Center for Machine Learning, Munich, Germany (massimo.
fornasier@ma.tum.de, konstantin.riedl@ma.tum.de).
\ddagger Department of Numerical Analysis and Scientific Computing, Simula Research Laboratory, Oslo,

Norway (tmklock@googlemail.com).
2973

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.


2974 M. FORNASIER, T. KLOCK, AND K. RIEDL

procedures and global strategies and combine deterministic and random decisions to
create a process capable of escaping from local optima and performing a robust search
of the solution space. Examples include random search [54], evolutionary programming
Downloaded 12/03/24 to 72.12.218.207 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy

[24], the Metropolis--Hastings algorithm [33], genetic algorithms [35], particle swarm
optimization [42], and simulated annealing [1]. Despite their tremendous empirical
success and widespread use in practice, many metaheuristics, due to their complex-
ity, lack a proper mathematical foundation that could prove robust convergence to
global minimizers under suitable assumptions. Nevertheless, for some of them, such as
random search or simulated annealing, there exist probabilistic guarantees for global
convergence; see, e.g., [36, 63]. While transferring some of the ideas of [63] to particle
swarm optimization allows us to establish guaranteed convergence to global minima,
the proof argument uses a computational time coinciding with the time necessary to
examine every location in the search space [66].
Recently, the authors of [10, 52] introduced consensus-based optimization (CBO)
methods, which follow the guiding principles of metaheuristic algorithms, but are of
much simpler nature and more amenable to theoretical analysis. Inspired by con-
sensus dynamics and opinion formation, CBO methods use a finite number of agents
V 1 , . . . , V N , which are formally stochastic processes, to explore the domain and to
form a global consensus about the location of the minimizer v \ast as time passes. The
dynamics of the agents V 1 , . . . , V N are governed by two competing terms. A drift term
drags each agent toward an instantaneous consensus point, denoted by v\alpha , which is
computed as a weighted average of all agents' positions and serves as a momenta-
neous proxy for the global minimizer v \ast . This term may be deactivated individually
for an agent if its position improves upon the consensus point through modulating
the drift by a function H approximating the Heaviside function. The second term is
stochastic and randomly diffuses agents according to a scaled Brownian motion in \BbbR d ,
featuring the exploration of the energy landscape of the cost \scrE . Ideally, as result of
the described drift-diffusion mechanism, the agents eventually achieve a near-optimal \sum N
1
global consensus, in the sense that the associated empirical measure \rho \widehat N t := N i=1 \delta Vti
converges to a Dirac delta \delta v\widetilde at some v\widetilde \in \BbbR d close to v \ast .
Let us now provide a formal description of the method. Given a time horizon
T > 0 and a time discretization t0 = 0 < \Delta t <\cdot \cdot \cdot < K\Delta t = T of [0, T ], we denote the
i
location of agent i at time k\Delta t by Vk\Delta t , k = 0, . . . , K. For user-specified parameters
\alpha , \lambda , \sigma > 0, the time-discrete evolution of the ith agent is defined by the update rule
i i
\bigl( i
\rho N i
\rho N
\bigr) \bigl( \bigl( \bigr) \bigr)
(1.1) V(k+1)\Delta t - Vk\Delta t = - \Delta t\lambda Vk\Delta t - v\alpha (\widehat k\Delta t ) H \scrE (Vk\Delta t ) - \scrE v\alpha (\widehat
k\Delta t )
\bigm\| i
\rho N
\bigm\| i
+ \sigma \bigm\| Vk\Delta t - v\alpha (\widehat
k\Delta t ) 2
\bigm\| Bk\Delta t ,
(1.2) V0i \sim \rho 0 for all i = 1, . . . , N,
i
where ((Bk\Delta t )k=0,...,K - 1 )i=1,...,N are independent and identically distributed (i.i.d.)
Gaussian random vectors in \BbbR d with zero mean and covariance matrix \Delta t\sansI \sansd d . The sys-
tem is complemented with independent initial data (V0i )i=1,...,N , distributed according
to a common initial law \rho 0 . Equation (1.1) originates from a simple Euler--Maruyama
time discretization [34, 53] of the system of stochastic differential equations (SDEs)

(1.3) dVti = - \lambda Vti - v\alpha (\widehat


\rho N i
\rho N dt + \sigma \bigm\| Vti - v\alpha (\widehat
\rho N i
\bigl( \bigr) \bigl( \bigl( \bigr) \bigr) \bigm\| \bigm\|
t ) H \scrE (Vt ) - \scrE v\alpha (\widehat
t ) t ) 2 dBt ,
\bigm\|
(1.4) V0i \sim \rho 0 for all i = 1, . . . , N,

where ((Bti )t\geq 0 )i=1,...,N are now independent standard Brownian motions in \BbbR d . As
mentioned in the informal description above, the updates in the evolutions (1.1) and

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.


CONSENSUS-BASED OPTIMIZATION CONVERGES GLOBALLY 2975

(1.3) consist of two terms, respectively. The first term is the drift towards the mo-
mentaneous consensus v\alpha (\widehat \rho N t ), which is defined by
Downloaded 12/03/24 to 72.12.218.207 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy

\int
\omega \alpha (v)
(1.5) \rho N
v\alpha (\widehat
t ) := v \rho N (v), with \omega \alpha (v) := exp( - \alpha \scrE (v)).
d\widehat
\| \omega \alpha \| L1 (\widehat \rho N ) t
t

Definition (1.5) is motivated by the well-known Laplace principle [21, 49, 52], which
states that for any absolutely continuous probability distribution \varrho on \BbbR d , we have
\biggl( \biggl( \int \biggr) \biggr)
1
(1.6) lim - log \omega \alpha (v) d\varrho (v) = inf \scrE (v).
\alpha \rightarrow \infty \alpha v\in \mathrm{s}\mathrm{u}\mathrm{p}\mathrm{p}(\varrho )

Alternatively, we can also interpret (1.5) as an approximation of arg mini=1,...,N \scrE (Vti ),
which improves as \alpha \rightarrow \infty , provided the minimizer uniquely exists. The univariate
function H : \BbbR \rightarrow [0, 1] appearing in the first term of (1.1) and (1.3) can be used to
deactivate the drift term for agents Vti , whose objective is better than that of the
\rho N
momentaneous consensus, i.e., for which \scrE (Vti ) < \scrE (v\alpha (\widehat t )), by setting H(x) \approx 1x\geq 0 .
The most frequently studied choice, however, is H \equiv 1. The second term in (1.1) and
(1.3) encodes the diffusion or exploration mechanism of the algorithm. Intuitively,
scaling by \| Vti - v\alpha (\widehat
\rho N
t )\| 2 encourages agents far from the consensus point to explore
larger regions, whereas agents close to the consensus point try to enhance their po-
sition only locally. Furthermore, the scaling is essential to eventually deactivate the
Brownian motion and to achieve consensus among the individual agents.
CBO methods have been considered and analyzed in several recent papers [8, 10,
11, 12, 16, 25, 26, 27, 28, 40, 43, 65], even for optimization problems in high-dimensional
and non-Euclidean settings, and using more sophisticated rules for the parameter
choices \alpha and \sigma inspired by Simulated Annealing [11, 26]. Moreover, several variants
of the dynamics have been proposed, such as ones integrating memory mechanisms
[57, 65] or others using jump-diffusion processes [40]. To make the method feasi-
ble and competitive for large-scale applications, in particular, for problems arising in
machine learning, random minibatch sampling techniques have been employed when
evaluating the objective function or computing the consensus point. This significantly
reduces the computational and communication complexity of CBO methods [11, 28]
and further enables the parallelization of the algorithm by evolving disjoint subsets of
particles independently for some time with separate consensus points, before aligning
the dynamics through a global communication step. However, despite bearing inter-
esting questions concerning the tradeoff between parallel efficiency and performance
when it comes to the relevance of communication between the individual agents, this
is a so far largely unexplored area for CBO. As an example for the applicability of
CBO to such high-dimensional problems, we refer the reader to [11, 28, 57] where the
method is used for training a shallow and a convolutional neural network for image
classification of the MNIST database of handwritten digits [44], to the recent paper
[13] where CBO is used in the setting of clustered federated learning, to [57] where a
compressed sensing problem is solved, or to the line of works [25, 26, 27] where (1.1)
and (1.3) are adapted to the sphere \BbbS d - 1 achieving near state-of-the-art performance
on a phase retrieval, on a robust subspace detection problem, and when robustly
computing eigenfaces. Recently, also general constrained optimization problems have
been tackled by CBO through the use of penalization techniques, which allow one to
cast the constrained problem as an unconstrained optimization task [8, 12].
As initially mentioned, CBO methods are motivated by the urge to develop a
class of metaheuristic algorithms with provable guarantees, while preserving their

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.


2976 M. FORNASIER, T. KLOCK, AND K. RIEDL

capabilities of escaping local minima through global optimization mechanisms. The


main theoretical interest focuses on understanding when consensus formation of
\rho \widehat N
t \rightarrow \delta v
\widetilde occurs, and on quantitatively bounding the associated errors \scrE (\widetilde v) - \scrE and
Downloaded 12/03/24 to 72.12.218.207 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy

v - v \ast \| 2 . A theoretical analysis of the dynamics can be done either on the micro-
\| \widetilde
scopic systems (1.1) or (1.3), as for instance in [31, 32], or, as in [10, 52], by analyzing
the macroscopic behavior of the agent density through a mean-field limit associated
with the particle-based dynamics (1.3), given, for initial data V0 \sim \rho 0 , by
\bigl( \bigr) \bigl( \bigr) \bigm\| \bigm\|
(1.7) dVt = - \lambda Vt - v\alpha (\rho t ) H \scrE (Vt ) - \scrE (v\alpha (\rho t )) dt + \sigma \bigm\| Vt - v\alpha (\rho t )\bigm\| 2 dBt ,

where \rho t = Law(Vt ). The weak convergence of the microscopic system (1.3) to the
mean-field limit (1.7), or, more precisely, of the empirical measure \rho \widehat N
t to \rho t as N \rightarrow \infty ,
has been shown recently in [37]; see also Remark 1.2 for additional details. This
legitimates analyzing (1.7) in lieu of (1.3). The measure \rho \in \scrC ([0, T ], \scrP (\BbbR d )) with
\rho t = \rho (t) = Law(Vt ) satisfies the nonlinear nonlocal Fokker--Planck equation
\bigl( \bigr) \sigma 2 \bigl( 2 \bigr)
(1.8) \partial t \rho t = \lambda div (v - v\alpha (\rho t )) H(\scrE (v) - \scrE (v\alpha (\rho t ))) \rho t + \Delta \| v - v\alpha (\rho t )\| 2 \rho t
2
in a weak sense (see Definition 3.1). Leveraging this partial differential equation
(PDE), the authors of [10, 52] analyze the large time behavior of the particle density
t \mapsto \rightarrow \rho t instead of the microscopic systems (1.1) and (1.3). Studying the mean-field
limit (1.7) or (1.8) allows for agile deterministic calculus tools and typically leads to
stronger theoretical results, which characterize the average agent behavior through the
evolution of \rho . This analysis perspective is justified by the mean-field approximation,
which quantifies the convergence of the microscopic system (1.3) to the mean-field
limit (1.7) as the number of agents grows. We discuss results about the mean-field
approximation in Remark 1.2 and make it rigorous in Proposition 3.11. Hence, in view
of its validity and as already done in the preceding works [10, 52], in the first part
of the paper we concentrate on establishing convergence in mean-field law for (1.3),
as defined in Definition 1.1 below. That is, we analyze the mean-field dynamics (1.7)
and (1.8) in place of the interacting particle system (1.3). Afterwards, by combining
the mean-field approximation with convergence in mean-field law, we close the paper
with a global convergence result for the numerical method (1.1).
Definition 1.1 (convergence in mean-field law). Let F, G : \scrP (\BbbR d ) \otimes \BbbR d \rightarrow \BbbR d be
two functions, and consider for i = 1, . . . , N the SDEs expressed in It\^ o's form as
N
1 \sum
dVti = F \rho \widehat N i
\bigl( N i \bigr) i
where \rho \widehat N \delta i , and V0i \sim \rho 0 .
\bigl( \bigr)
t , Vt dt + G \rho \widehat t , Vt dBt , t =
N i=1 Vt

We say that this SDE system converges in mean-field law to v\widetilde \in \BbbR d if all solutions of
\bigl( \bigr) \bigl( \bigr)
dVt = F \rho t , Vt dt + G \rho t , Vt dBt , where \rho t = Law(Vt ), and V0 \sim \rho 0 ,

satisfy limt\rightarrow \infty Wp (\rho t , \delta v\widetilde ) = 0 for some Wasserstein-p distance Wp , p \geq 1.
Colloquially speaking, an interacting multiparticle system is said to converge in
mean-field law if the associated mean-field dynamics converges.
Remark 1.2 (mean-field approximation). The definition of convergence in mean-
field law as given in Definition 1.1 is justified as follows: As the number of agents
N in the interacting particle system (1.3) tends to infinity, one expects that, for

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.


CONSENSUS-BASED OPTIMIZATION CONVERGES GLOBALLY 2977

any particle V i , the individual influence of any other particle disperses. This results
in an averaged influence of the ensemble rather than an interacting nature of the
system, and allows us to describe the dynamics in the large-particle limit by the law
Downloaded 12/03/24 to 72.12.218.207 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy

\rho of the monoparticle process (1.7). This phenomenon is known as the mean-field
approximation. More formally, as N \rightarrow \infty , we expect the empirical measure \rho \widehat N
t to
converge in law to \rho t for almost every t; see [39, Definition 1]. The classical way to
establish such mean-field approximation is to prove, by means of the coupling method,
propagation of chaos [47, 64], as implied, for instance, by
i \bigm\| 2
max sup \BbbE \bigm\| Vti - V t \bigm\| 2 \leq CN - 1 ,
\bigm\|
(1.9)
i=1,...,N t\in [0,T ]

i
where V denote N i.i.d. copies of the mean-field dynamics (1.7), which are coupled to
the processes V i by choosing the same initial conditions as well as Brownian motion
paths; see, e.g., the recent review [14, 15]. Despite being of fundamental numerical
interest (since when combined with the convergence in mean-field law it allows us to
establish convergence of the interacting particle system itself), a quantitative result
about the mean-field approximation of CBO as in (1.9) has been left as a difficult and
open problem in [10, Remark 3.3] due to a lack of global Lipschitz continuity of the
drift and diffusion terms, which impedes the application of McKean's theorem [15,
Theorem 3.1].
However, the present paper as well as three recent works, which we outline in
what follows, are shedding light on this issue. By employing a compactness argument
in the path space, the authors of [37] show that the empirical random particle measure
\rho \widehat N associated with the dynamics (1.3) converges in distribution to the deterministic
particle distribution \rho \in \scrC ([0, T ], \scrP (\BbbR d )) satisfying (1.8). In particular, their result
is valid for unbounded functions \scrE considered also in our work. While this does not
allow for obtaining a quantitative convergence rate with respect to the number of
particles N as in (1.9), it closes the mean-field limit gap qualitatively. A desired
quantitative result has been established recently in [25, Theorem 3.1 and Remark 3.1]
for a variant of the microscopic system (1.3) supported on a compact hypersurface \Gamma .
In [25] the weak convergence of the variant of (1.3) to the corresponding mean-field
limit is established in the sense that for all \phi \in \scrC b1 (\BbbR d ) it holds that
\Bigl[ \bigm| \bigm| 2 \Bigr] C
\rho N
sup \BbbE \bigm| \langle \widehat
t , \phi \rangle - \langle \rho t , \phi \rangle
\bigm| \leq \| \phi \| 2 1 d \rightarrow 0 as N \rightarrow \infty .
\scrC (\BbbR )
t\in [0,T ] N

The obtained convergence rate reads as CN - 1 with C depending in particular on


\biggl( \biggl( \biggr) \biggr)
C\alpha := exp \alpha sup \scrE (v) - inf \scrE (v) .
v\in \Gamma v\in \Gamma

Their proof is based on the aforementioned coupling method and, by exploiting the
inherent compactness of the dynamics due to its confinement to \Gamma , allows us to derive
a bound of the form (1.9). Leveraging the techniques from [25] and the boundedness of
moments established in [10, Lemma 3.4], we provide in Proposition 3.11 a result of the
form (1.9) on the plane \BbbR d which holds with high probability. A more refined analysis
conducted recently by the authors of [29], which adapts Sznitman's classical argument
for the proof of McKean's theorem with the intention of allowing for coefficients which
are not globally Lipschitz, even yields a nonprobabilistic mean-field approximation of
the form (1.9) in the pathwise sense, requiring in comparison merely a higher moment
bound \rho 0 \in \scrP 6 (\BbbR d ) of the initial measure; see [29, Theorem 2.6].

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.


2978 M. FORNASIER, T. KLOCK, AND K. RIEDL

Such quantitative mean-field approximation results substantiate the focus of the


first part of this work on the analysis of the macroscopic mean-field dynamics (1.7)
and (1.8). Nevertheless, as a consequence thereof, we return to the analysis of the
Downloaded 12/03/24 to 72.12.218.207 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy

numerical scheme (1.1) and its global convergence in section 3.3.


Contributions. In this work we unveil the surprising phenomenon that, in the
mean-field limit, for a rich class of objectives \scrE , the individual agents of the CBO
dynamics follow the gradient flow associated with the function v \mapsto \rightarrow \| v - v \ast \| 22 , on av-
erage over all realizations of Brownian motion paths; see Figure 1. Interestingly, this
gradient flow is independent of the underlying energy landscape of \scrE . In other words,
CBO performs a canonical convexification of a large class of optimization problems as
the number of optimizing agents N goes to infinity. Based on these observations and
justified by the mean-field approximation, first we develop a novel proof framework
for showing the convergence of the CBO dynamics in mean-field law to the global
minimizer v \ast for a rich class of objectives. While previous analyses in [10, 31, 32]
required restrictive concentration conditions about the initial measure \rho 0 and \scrC 2 reg-
ularity of the objective, we derive results that are valid under mild assumptions about
\rho 0 and local Lipschitz continuity of \scrE . We explain the key differences of this work with
respect to previous work in detail in section 2 and further showcase the benefits of the
proposed analysis by a numerical example. These findings reveal that the hardness of
any global optimization problem is necessarily encoded in the rate of the mean-field
approximation as N \rightarrow \infty . Second, in consideration of its central significance with
regard to the computational complexity of the numerical scheme (1.1) we establish a
novel probabilistic quantitative result about the convergence of the interacting par-
ticle system (1.3) to the corresponding mean-field limit (1.8), which is a result of
independent interest. By combining these two results, the convergence in mean-field
law on the one hand and the quantitative mean-field approximation on the other,

(a) The Rastrigin function E and an exemplary (b) Individual agents follow, on average, the
initialization for one run of the experiment  v − v ∗ 22 .
gradient flow of the map v →

Fig. 1. An illustration of the internal mechanisms of CBO. We perform 100 runs of the CBO
algorithm (1.1)--(1.2), with parameters \Delta t = 0.01, \alpha = 1015 , \lambda = 1, and \sigma = 0.1, and N = 32000
agents initialized according to \rho 0 = \scrN ((8, 8), 20). In addition, we add three individual agents with
starting locations ( - 2, 4), ( - 1.5, - 1.5), and (4.5, 1.5) to the set of agents in each run as shown in (a),
and depict each of their 100 trajectories as well as their mean trajectory in shades of yellow in (b).
With the (mean) trajectories being rather straight lines, we observe that the individual agents take a
straight path from their initial positions to the global minimizer v \ast and, in particular, disregard the
local landscape of the objective function \scrE . The trajectories of the individual agents become more
concentrated as the overall number of agents N grows. (Color figure available online.)

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.


CONSENSUS-BASED OPTIMIZATION CONVERGES GLOBALLY 2979

we provide the first, and so far unique, holistic convergence proof of CBO on the
plane, enabling us to quantify the optimization capability of the numerical CBO al-
gorithm (1.1) in terms of the used parameters. The utilized proof technique may be
Downloaded 12/03/24 to 72.12.218.207 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy

used as a blueprint for proving global convergence for other recent adaptations of the
CBO dynamics; see, e.g., [8, 11, 26, 27, 28, 40], as well as other metaheuristics such as
the renowned particle swarm optimization, which is related to CBO through a zero-
inertia limit; see, e.g., [20, 30, 38]. While the present paper has a foundational and
theoretical nature and aims at completely clarifying the convergence of the numerical
scheme (1.1) with a detailed analysis, we do not include extensive numerical exper-
iments. For numerical evidence that CBO does solve difficult optimizations also in
high dimensions without necessarily incurring the curse of dimensionality, the reader
may want to consult previous works such as [11, 13, 16, 26, 27, 28, 57].
Remark 1.3. Employing stochasticity and leveraging collaboration between mul-
tiple agents to empirically and provably achieve global convergence of numerical al-
gorithms and to avoid convergence to local minima is not just of particular relevance
when it comes to the efficiency and success of zero-order methods, but also an emerg-
ing paradigm in the field of gradient-based optimization; see, e.g., [18, 22, 46]. Recent
work [58, 59] even suggests a connection between the worlds of derivative-free and
gradient-based methods. Similar guiding principles are present also in sampling meth-
ods, such as Langevin sampling [17, 18, 23, 60] and Stein variational gradient descent
[45], which are designed to generate samples from an unknown target distribution.
A promising way to gain a theoretical understanding of the behavior of these
classes of algorithms is by taking a mean-field perspective, i.e., by analyzing the
dynamics, as the number of particles goes to infinity, through an associated PDE. This
typically involves Polyak--\Lojasiewicz-like conditions [41] or certain families of log-
Sobolev inequalities [18] on the objective function \scrE , which are more restrictive than
the assumptions under which the statements of this work hold. For a recent analysis
of the mean-field Langevin dynamics we refer the reader to [18] and references therein.
Conceptually similar to the convexification of a highly nonconvex problem ob-
served in this work, taking a mean-field perspective has recently allowed the authors
of [19, 48, 61, 62] to explain the generalization capabilities of overparameterized neural
networks. By leveraging the fact that the mean-field description (w.r.t. the number of
neurons) of the SGD learning \bigl( dynamics is\bigr) captured by a nonlinear PDE, which admits
a gradient flow structure on \scrP 2 (\BbbR d ), W2 , these works show that original complexities
of the loss landscape are alleviated. Together with a quantification of the fluctuations
of the empirical neuron distribution around this mean-field limit (i.e., a mean-field ap-
proximation), convergence results are derived for SGD for sufficiently large networks
with optimal generalization error. These results, however, are structurally different
from the ones obtained in this paper for CBO. In particular, the individual particles in
[19, 48, 61, 62] are associated with the different neurons of a two-layer or deep neural
network and the objective function is a specific empirical risk, which itself is subject
to the mean-field limit and gains convexity as the number of neurons tends to infinity.
In contrast, in our setting each particle itself is a competitor for minimization of a
general fixed nonconvex objective function \scrE and the convexification of the problem
emerges from the CBO dynamics when its mean-field limit behavior is studied. For
this reason, the two resulting mean-field limits are different.
Let us further point out that, as far as the community could understand up to
now, the Fokker--Planck equation (1.8) describing the mean-field \bigl( behavior
\bigr) of CBO
cannot be understood as a gradient flow of any energy on \scrP 2 (\BbbR d ), W2 . Yet, and

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.


2980 M. FORNASIER, T. KLOCK, AND K. RIEDL

perhaps surprisingly, the analysis of our present paper shows that the Wasserstein-2
distance from the global minimizer is the correct Lyapunov functional to be analyzed.
Downloaded 12/03/24 to 72.12.218.207 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy

1.1. Organization. In section 2 we first discuss state-of-the-art global conver-


gence results for CBO methods with a detailed account of the utilized proof technique,
including potential weaknesses. The second part of section 2 then motivates an alter-
native proof strategy and explains how it can remedy the weaknesses of prior proofs
under minimalistic assumptions. In section 3 we first provide additional details about
the well-posedness of the macroscopic SDE (1.7), respectively, the Fokker--Planck
equation (1.8), before presenting and discussing the main result about the convergence
of the dynamics (1.7) and (1.8) to the global minimizer in mean-field law. In order
to demonstrate the relevance of such statement in establishing a holistic convergence
guarantee for the numerical scheme (1.1), we conclude the section with a probabilistic
quantitative result about the mean-field approximation. Sections 4 and 5 comprise
the proof details of the convergence result in mean-field law and the result about the
mean-field approximation, respectively. Section 6 concludes the paper.
1.2. Notation. Euclidean balls are denoted as Br (u) := \{ v \in \BbbR d :\| v - u\| 2 \leq r\} .
For the space of continuous functions f : X \rightarrow Y we write \scrC (X, Y ), with X \subset \BbbR n and
a suitable topological space Y . For an open set X \subset \BbbR n and for Y = \BbbR m the spaces
\scrC ck (X, Y ) and \scrC bk (X, Y ) contain functions f \in \scrC (X, Y ) that are k-times continuously
differentiable and have compact support or are bounded, respectively. We omit Y in
the real-valued case. The operators \nabla and \Delta denote the gradient and Laplace operator
of a function on \BbbR d . The main objects of study are laws of stochastic processes,
\rho \in \scrC ([0, T ], \scrP (\BbbR d )), where the set \scrP (\BbbR d ) contains all Borel probability measures over
\BbbR d . With \rho t \in \scrP (\BbbR d ) we refer to a snapshot of such law at time t. In the case when
we refer to \int some fixed distribution, we write \varrho . Measures \varrho \in \scrP (\BbbR d ) with finite pth
p
moment \| v\| 2 d\varrho (v) are collected in \scrP p (\BbbR d ). For any 1 \leq p < \infty , Wp denotes the
Wasserstein-p distance between two Borel probability measures \varrho 1 , \varrho 2 \in \scrP p (\BbbR d ); see,
e.g., [2]. \BbbE (\varrho ) denotes the expectation of a probability measure \varrho .
2. Blueprints for the analysis of CBO methods. In this section we provide
intuitive descriptions of two approaches to the analysis of the convergence of CBO
methods to global minimizers. We first recall [10], and related works [31, 32], which
prove convergence as a consequence of a monotonous decay of the variance of \rho t and
by employing the asymptotic Laplace principle (1.6). This proof strategy incurs a
restrictive condition about the parameters \alpha , \lambda , \sigma and the initial configuration \rho 0 ,
v) - \scrE (v \ast ) can only be achieved for
which implies that a small optimization gap \scrE (\widetilde
initial configurations \rho 0 already well concentrated near the optimizer v \ast . We then
motivate an alternative proof idea to remedy this weakness based on the intuition that
\rho t monotonically minimizes the squared Euclidean distance to the global minimizer v \ast .
2.1. State of the art: Variance-based convergence analysis. We now re-
call the blueprint proof strategy from [10], which has been adapted in other works, e.g.,
[26, 31, 32], to prove consensus formation and convergence to the global minimum.
A successful application of the CBO framework underlies the premise that the
induced particle density \rho t converges to a Dirac delta \delta v\widetilde for some v\widetilde close to v \ast .
The analysis in [10] proves this under certain assumptions by first showing that \rho t
converges to a Dirac delta around some v\widetilde \in \BbbR d and then concluding v\widetilde \approx v \ast in a
subsequent step. Regarding the first step, the authors of [10]\int study the variance of \rho t ,
2
defined as Var (\rho t ) := 21 \| v - \BbbE (\rho t )\| 2 d\rho t (v), where \BbbE (\rho t ) := v d\rho t (v), and show that
\int

Var (\rho t ) decays exponentially fast in t under a well-preparedness assumption about

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.


CONSENSUS-BASED OPTIMIZATION CONVERGES GLOBALLY 2981

the initial condition \rho 0 . More precisely, in [10, section 4.1] the authors use It\^
o's lemma
to derive for the time-evolution of Var (\rho t ) the expression
Downloaded 12/03/24 to 72.12.218.207 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy

d d\sigma 2 2
Var (\rho t ) = - 2\lambda - d\sigma 2 Var (\rho t ) +
\bigl( \bigr)
(2.1) \| \BbbE (\rho t ) - v\alpha (\rho t )\| 2 .
dt 2
For parameter choices 2\lambda > d\sigma 2 , the first term in (2.1) is negative and one could al-
most apply Gr\"onwall's inequality to obtain the asserted exponential decay of Var (\rho t ).
However, the second term can be problematic and the main difficulty is to control the
distance \| \BbbE (\rho t ) - v\alpha (\rho t )\| 2 between the mean and the weighted mean. For \alpha \rightarrow 0 the
weight function \omega \alpha (v) = exp( - \alpha \scrE (v)) associated with v\alpha (\rho t ) converges to 1 pointwise
and consequently v\alpha (\rho t ) \rightarrow \BbbE (\rho t ). However, the second proof step, explained below,
reveals that the crucial regime is \alpha \gg 1. In this case v\alpha (\rho t ) can be arbitrarily far
from \BbbE (\rho t ) if we do not dispose of additional knowledge about the probability measure
\rho t . To restrict the set of probability measures \rho t that need to be considered when
bounding \| \BbbE (\rho t ) - v\alpha (\rho t )\| 2 , the authors of [10] compromise to assume that the initial
distribution \rho 0 satisfies the well-preparedness assumptions
(2.2)
2
\alpha e - 2\alpha \scrE (\sigma 2 +2\lambda ) <3/8 and 2\lambda \| \omega \alpha \| L1 (\rho 0 ) - Var (\rho 0 ) - 2d\sigma 2 \| \omega \alpha \| L1 (\rho 0 ) e - \alpha \scrE \geq 0.

Since \rho t evolves from \rho 0 according to the Fokker--Planck equation (1.8), these con-
ditions restrict \rho t and allow for bounding \| \BbbE (\rho t ) - v\alpha (\rho t )\| 2 by a suitable multiple
of Var (\rho t ). The exponential decay of Var (\rho t ) then follows from (2.1) after applying
Gr\"
onwall's inequality; see [10, Theorem 4.1]. Furthermore, the conditions in (2.2) also
allow for proving convergence of \rho t to a stationary Dirac delta at v\widetilde \in \BbbR d .
Given convergence to a Dirac at v, \widetilde in a second step it is shown \scrE (\widetilde v) \approx \scrE (v \ast ). In
order to prove this approximation, one first deduces that for \bigr) any \varepsilon > 0,\bigl( there exists \bigr) \alpha \gg
1 such that for all t \geq 0 it holds that - \alpha 1 log \| \omega \alpha \| L1 (\rho t ) \leq - \alpha 1 log \| \omega \alpha \| L1 (\rho 0 ) + 2\varepsilon .
\bigl(
d
This involves deriving a lower bound for the evolution dt \| \omega \alpha \| L1 (\rho t ) for sufficiently
large \alpha > 0 as done in [10, Lemma 4.1], which is then combined with the formerly
proven exponentially decaying variance; see [10, Proof of Theorem 4.2]. Then, by
recognizing that the Laplace principle (1.6) implies the existence of some \alpha \gg 1 with
1 \bigl( \bigr) \varepsilon
(2.3) - log \| \omega \alpha \| L1 (\rho 0 ) - \scrE < ,
\alpha 2
and by establishing the convergence \| \omega \alpha \| L1 (\rho t ) \rightarrow exp( - \alpha \scrE (\widetilde v)) as t \rightarrow \infty , one obtains
the desired result \scrE (\widetilde v) - \scrE < \varepsilon in the limit t \rightarrow \infty ; see [10, Lemma 4.2]. The gap
\scrE (\widetilde
v) - \scrE can be tightened by increasing \alpha , but it is impossible to establish an explicit
relation \alpha = \alpha (\varepsilon ) due to the use of the asymptotic Laplace principle.
This proof sketch unveils a tension on the role of the parameter \alpha . Namely, the
second step requires large \alpha = \alpha (\varepsilon ) to achieve \scrE (\widetilde v) - \scrE < \varepsilon . In fact, \alpha (\varepsilon ) may grow
uncontrollably as we decrease the accuracy \varepsilon . The first step, however, requires the
conditions in (2.2) which, in the most optimistic case, where \sigma = 0, imply
\biggl( \int \biggr) 2
3 \bigl( \bigr)
(2.4) Var (\rho 0 ) \leq exp - \alpha (\scrE (v) - \scrE ) d\rho 0 (v) .
8\alpha
Therefore, \rho 0 needs to be increasingly concentrated as \alpha increases, and should ideally
be supported on sets where \scrE (v) \approx \scrE . Designing such distribution \rho 0 in practice seems
impossible in the absence of a good initial guess for v \ast . In particular, we cannot expect
(2.4) to hold for generic choices such as a uniform distribution on a compact set.

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.


2982 M. FORNASIER, T. KLOCK, AND K. RIEDL

We add that the works [31, 32] conduct a similarly flavored analysis for the
fully time-discretized microscopic system (1.1), with some differences in the details.
They first show an exponentially decaying variance under mild assumptions about
Downloaded 12/03/24 to 72.12.218.207 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy

\lambda and \sigma , but provided that the same Brownian motion is used for all agents, i.e.,
i
(Bk\Delta t )k=1,...,K = (Bk\Delta t )k=1,...,K for all i = 1, . . . , N . Such a choice leads to a less
explorative dynamics, but it simplifies the consensus formation analysis. For proving
\scrE (\widetilde
v) \approx \scrE however, the authors again require an initial configuration \rho 0 that satisfies
a technical concentration condition like (2.3); see for instance [32, Remark 3.1].
2.2. Alternative approach: CBO minimizes the squared distance to
\bfitv \ast . The approach described in the previous section might suggest that CBO only
converges locally, which is in fact not what is observed in practice. Instead, global op-
timization is actually expected. To remedy the locality requirements of the variance-
based analysis, let us now sketch and motivate an alternative proof idea. By averaging
out the randomness associated with different realizations of Brownian motion paths,
the macroscopic time-continuous SDE (1.7), in the case H \equiv 1, becomes
d \bigl[ \bigm| \bigm| \bigr] \bigl[ \bigl( \bigr) \bigm| \bigr]
\BbbE Vt V0 = - \lambda \BbbE Vt - v\alpha (\rho t ) \bigm| V0
dt
= - \lambda \BbbE Vt - v \ast \bigm| V0 + \lambda (v\alpha (\rho t ) - v \ast ).
\bigl[ \bigl( \bigr) \bigm| \bigr]
(2.5)

Furthermore, if \scrE is locally Lipschitz continuous and satisfies the coercivity condition
1 \bigl( \bigr) \nu 1 \bigl( \bigr) \nu
(2.6) \| v - v \ast \| 2 \leq \scrE (v) - \scrE (v \ast ) = \scrE (v) - \scrE for all v \in \BbbR d ,
\eta \eta
and for some \eta > 0 and \nu \in (0, \infty ), the second term on the right-hand side of (2.5)
can be made arbitrarily small for sufficiently large \alpha , i.e., v\alpha (\rho t ) \approx v \ast (more details
follow below). In this case, the average dynamics of Vt is well approximated by
d \bigl[
\BbbE Vt | V0 \approx - \lambda \BbbE Vt - v \ast | V0 ,
\bigr] \bigl[ \bigl( \bigr) \bigr]
(2.7)
dt
which corresponds to the gradient flow of v \mapsto \rightarrow \| v - v \ast \| 22 with rate 2\lambda . In other words,
each individual agent essentially performs a gradient descent of v \mapsto \rightarrow \| v - v \ast \| 22 on
average over all realizations of Brownian motion paths. Figure 1(b) visualizes this
phenomenon for three isolated agents on the Rastrigin function in two dimensions.
Inspired by this observation, our proof strategy is to show that CBO methods
successively minimize the energy functional \scrV : \scrP (\BbbR d ) \rightarrow \BbbR \geq 0 , given by
\int
1 2
(2.8) \scrV (\rho t ) := \| v - v \ast \| 2 d\rho t (v).
2
Note that this functional essentially coincides with the Wasserstein distance in the
sense that W22 (\rho t , \delta v\ast ) = 2\scrV (\rho t ). Therefore \scrV (\rho t ) \rightarrow 0 in particular implies that \rho t
converges weakly to \delta v\ast ; see [2, Chapter 7].
This novel approach does not suffer a tension on the parameter \alpha like the variance-
based analysis from the previous section. Roughly speaking (see Lemma 4.1 for de-
tails), \scrV (\rho t ) follows an evolution similar to (2.1), with Var (\rho t ) being replaced by
\scrV (\rho t ). However, we can now bound \| v - v\alpha (\rho t )\| 22 d\rho t (v) \leq 4\scrV (\rho t )+2\| v\alpha (\rho t ) - v \ast \| 22 ,
\int

so that it just remains to control the second term. In comparison to bounding


\| v\alpha (\rho t ) - \BbbE (\rho t )\| 22 in terms of Var (\rho t ) for the variance-based analysis, this requires
bounding \| v\alpha (\rho t ) - v \ast \| 22 in terms of \scrV (\rho t ). Fortunately, this is a much easier

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.


CONSENSUS-BASED OPTIMIZATION CONVERGES GLOBALLY 2983
Downloaded 12/03/24 to 72.12.218.207 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy

Fig. 2. (a) The Rastrigin function as objective function \scrE and the squared Euclidean distance
from v \ast . (b) The evolution of the variance Var(\widehat \rho N \rho N
t ) and the functional \scrV (\widehat
t ) for different initial
conditions \rho 0 = \scrN (\mu , 0.8) with \mu \in \{ 1, 2, 3, 4\} . The measure \rho \widehat N
t is the empirical agent density that
is evolved using (1.1) with N = 320000 agents, discrete time step size \Delta t = 0.01, and parameters
\alpha = 1015 , \lambda = 1, and \sigma = 0.5. As we move the mean of the initial configuration \rho 0 away from
the global optimizer v \ast = 0, and thereby push v \ast into the tails of \rho 0 , Var(\widehat \rho N
t ) increases in the
starting phase of the dynamics. \scrV (\widehat \rho N
t ), on the other hand, always decreases exponentially at a rate
(2\lambda - d\sigma 2 ), independently of the initial condition \rho 0 .

task: the Laplace principle generally asserts \| v\alpha (\rho t ) - v \ast \| 2 \rightarrow 0 under (2.6) as \alpha \rightarrow \infty ,
and we can even establish (see Proposition 4.5 for details) the quantitative estimate
(2Lr)\nu
\int
\ast exp ( - \alpha Lr)
\| v\alpha (\varrho ) - v \| 2 \leq + \| v - v \ast \| 2 d\varrho (v)
\eta \varrho (Br (v \ast ))
for an arbitrary probability measure \varrho and assuming that \scrE is L-Lipschitz in a ball of
radius r > 0. This allows to estimate \| v\alpha (\rho t ) - v \ast \| 22 in terms of \scrV (\rho
\int t ) as desired.
Finally, we note that \scrV (\rho t ) majorizes Var (\rho t ) because u \mapsto \rightarrow 12 \| v - u\| 22 d\rho t (v) is
minimized by the expectation \BbbE (\rho t ). This relation may be a source of concern, as it
shows that proving \scrV (\rho t ) \rightarrow 0 implies Var (\rho t ) \rightarrow 0. We emphasize, however, that this
does not imply a majorization for the corresponding time derivatives. In fact, Exam-
ple 2.1 suggests that \scrV (\rho t ) can decay exponentially while Var (\rho t ) increases initially.
Example 2.1. We consider the Rastrigin function \scrE (v) = v 2 + 2.5(1 - cos(2\pi v))
with global minimum at v \ast = 0 and various local minima; see Figure 2(a). For differ-
ent initial configurations \rho 0 = \scrN (\mu , 0.8) with \mu \in \{ 1, 2, 3, 4\} , we evolve the discretized
system (1.1) using N = 320000 agents, discrete time step size \Delta t = 0.01, and param-
eters \alpha = 1015 (i.e., the consensus point is the arg min of the agents), \lambda = 1, and
\sigma = 0.5. By considering different means from \mu = 1 to \mu = 4, we push the global
minimizer v \ast into the tails of the initial configuration \rho 0 . Figure 2(b) shows that the
decreasing initial probability mass around v \ast eventually causes the variance Var(\widehat \rho N
t )
N
(dashed lines) to increase in the beginning of the dynamics. In contrast, \scrV (\widehat \rho t ) al-
ways decays exponentially fast with convergence speed (2\lambda - d\sigma 2 ), independently of
the initial condition \rho 0 . From a theoretical perspective, this means proving global
convergence using a variance-based analysis as in section 2.1 must require assump-
tions about \rho 0 such as condition (2.4), whereas using \scrV (\rho t ) does not suffer from this
issue. The convergence speed (2\lambda - d\sigma 2 ) coincides with the result in Theorem 3.7.
3. Global convergence of CBO. In the first part of this section we recite and
extend well-posedness results about the nonlinear macroscopic SDE (1.7), respectively,

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.


2984 M. FORNASIER, T. KLOCK, AND K. RIEDL

the associated Fokker--Planck equation (1.8). At the beginning of the second part we
introduce the class of studied objective functions, which is followed by the presentation
of the main result about the convergence of the dynamics (1.7) and (1.8) to the global
Downloaded 12/03/24 to 72.12.218.207 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy

minimizer in mean-field law. In the final part we then highlight the relevance of this
result by presenting a holistic convergence proof of the numerical scheme (1.1) to the
global minimizer. This combines the latter statement with a probabilistic quantitative
result about the mean-field approximation.
3.1. Definition of weak solutions and well-posedness. We begin by rigor-
ously defining weak solutions of the Fokker--Planck equation (1.8).
Definition 3.1. Let \rho 0 \in \scrP (\BbbR d ), T > 0. We say \rho \in \scrC ([0, T ], \scrP (\BbbR d )) satisfies the
Fokker--Planck equation (1.8) with initial condition \rho 0 in the weak sense in the time
interval [0, T ] if we have for all \phi \in \scrC c\infty (\BbbR d ) and all t \in (0, T )
\int \int
d
\phi (v) d\rho t (v) = - \lambda H(\scrE (v) - \scrE (v\alpha (\rho t ))) \langle v - v\alpha (\rho t ), \nabla \phi (v)\rangle d\rho t (v)
dt
(3.1)
\sigma 2
\int
2
+ \| v - v\alpha (\rho t )\| 2 \Delta \phi (v) d\rho t (v)
2
and limt\rightarrow 0 \rho t = \rho 0 pointwise.
If the cutoff function H in the dynamics (1.7) is inactive, i.e., satisfies H \equiv 1, the
authors of [10] prove the following well-posedness result.
Theorem 3.2 ([10, Theorems 3.1, 3.2]). Let T > 0, \rho 0 \in \scrP 4 (\BbbR d ). Let H \equiv 1, and
consider \scrE : \BbbR d \rightarrow \BbbR with \scrE > - \infty , which, for constants C1 , C2 > 0, satisfies

(3.2) | \scrE (v) - \scrE (w)| \leq C1 (\| v\| 2 + \| w\| 2 ) \| v - w\| 2 for all v, w \in \BbbR d ,
2
(3.3) \scrE (v) - \scrE \leq C2 (1 + \| v\| 2 ) for all v \in \BbbR d .

If in addition, either supv\in \BbbR d \scrE (v) < \infty , or \scrE satisfies for some constants C3 , C4 > 0
2
(3.4) \scrE (v) - \scrE \geq C3 \| v\| 2 for all \| v\| 2 \geq C4 ,

then there exists a unique nonlinear process V \in \scrC ([0, T ], \BbbR d ) satisfying (1.7) in the
strong sense. The associated law \rho = Law(V ) has regularity \rho \in \scrC ([0, T ], \scrP 4 (\BbbR d )) and
is a weak solution to the Fokker--Planck equation (1.8).
Remark 3.3. The regularity \rho \in \scrC ([0, T ], \scrP 4 (\BbbR d )) stated in Theorem 3.2, and
also obtained in Theorem 3.4 below, is a consequence of the regularity of the initial
condition \rho 0 \in \scrP 4 (\BbbR d ). Although it is not indicated explicitly in [10, Theorems 3.1,
3.2], it follows from their proofs. In particular, it allows for extending the test function
space \scrC c\infty (\BbbR d ) in Definition 3.1. Namely, if \rho \in \scrC ([0, T ], \scrP 4 (\BbbR d )) solves (1.8) in the
weak sense, identity (3.1) holds for all \phi \in \scrC 2 (\BbbR d ) with (i) supv\in \BbbR d | \Delta \phi (v)| < \infty , and
(ii) \| \nabla \phi (v)\| 2 \leq C(1 + \| v\| 2 ) for some C > 0 and for all v \in \BbbR d . We denote the
corresponding function space by \scrC \ast 2 (\BbbR d ).
Under minor modifications of the proof for Theorem 3.2, we can extend the exis-
tence of solutions to an active Lipschitz continuous cutoff function H.
Theorem 3.4. Let H \not \equiv 1 be LH -Lipschitz continuous. Then, under the assump-
tions of Theorem 3.2, there exists a nonlinear process V \in \scrC ([0, T ], \BbbR d ) satisfying (1.7)
in the strong sense. The associated law \rho = Law(V ) has regularity \rho \in \scrC ([0, T ], \scrP 4 (\BbbR d ))
and is a weak solution to the Fokker--Planck equation (1.8).

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.


CONSENSUS-BASED OPTIMIZATION CONVERGES GLOBALLY 2985

3.2. Global convergence in mean-field law. We now present the main result
about global convergence in mean-field law for objectives satisfying the following.
Downloaded 12/03/24 to 72.12.218.207 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy

Definition 3.5 (assumptions). Throughout we are interested in objective func-


tions \scrE \in \scrC (\BbbR d ), for which
A1 there exists v \ast \in \BbbR d such that \scrE (v \ast ) = inf v\in \BbbR d \scrE (v) =: \scrE , and
A2 there exist \scrE \infty , R0 , \eta > 0 and \nu \in (0, \infty ) such that

(3.5) for all v \in BR0 (v \ast ),


\| v - v \ast \| 2 \leq (\scrE (v) - \scrE )\nu /\eta
\bigr) c
for all v \in BR0 (v \ast ) .
\bigl(
(3.6) \scrE (v) - \scrE > \scrE \infty

Furthermore, for the case H \not \equiv 1, we additionally require that \scrE fulfills a local Lipschitz
continuity--Like condition, i.e.,
A3 there exist L\scrE > 0 and \gamma \geq 0 such that
\gamma
(3.7) \scrE (v) - \scrE \leq L\scrE (1 + \| v - v \ast \| 2 ) \| v - v \ast \| 2 for all v \in \BbbR d .

Remark 3.6. The analyses in [10] and related works require \scrE \in \scrC 2 (\BbbR d ) and an
additional boundedness assumption on the Laplacian \Delta \scrE . We relax these regularity
requirements and use the conditions in Definition 3.5 on \scrE instead.
Assumption A1 just states that the continuous objective \scrE attains its infimum \scrE
at some v \ast \in \BbbR d . The continuity itself can be further relaxed at the cost of additional
technical details because it is only required in a small neighborhood of v \ast .
Assumption A2 should be interpreted as a tractability condition of the landscape
of \scrE around v \ast and in the farfield. The first part, equation (3.5), describes the local
coercivity of \scrE , which implies that there is a unique minimizer v \ast on BR0 (v \ast ) and that
1/\nu
\scrE grows like v \mapsto \rightarrow \| v - v \ast \| 2 . This condition is also known as the inverse continuity
condition from [26], as a quadratic growth condition in the case \nu = 1/2 from [3, 50],
or as the H\" olderian error bound condition in the case \nu \in (0, 1] [6]. In [50, Theorem
4] and [41, Theorem 2] many equivalent or stronger conditions are identified to imply
(3.5) globally on \BbbR d . Furthermore, in [26, 67], (3.5) is shown to hold globally for
objectives related to various machine learning problems. The second part of A2,
equation (3.6), describes the behavior of \scrE in the farfield and prevents \scrE (v) \approx \scrE for
some v \in \BbbR d far away from v \ast . We introduce it for the purpose of covering functions
that tend to a constant just above \scrE \infty as \| v\| 2 \rightarrow \infty , because such functions do not
satisfy the growth condition (3.5) globally. However, whenever (3.5) holds globally,
we take R0 = \infty , i.e., BR0 (v \ast ) = \BbbR d and (3.6) is void. We also note that (3.5) and
(3.6) imply the uniqueness of the global minimizer v \ast on \BbbR d .
Finally, to cover the active cutoff case H \not \equiv 1, we additionally require A3. The
condition is weaker than local Lipschitz continuity on any compact ball around v \ast ,
with Lipschitz constant growing with the size of the ball.
We are now ready to state the main result. The proof is deferred to section 4.
Theorem 3.7. Let \scrE \in \scrC (\BbbR d ) satisfy A1--A2. Moreover, let \rho 0 \in \scrP 4 (\BbbR d ) be
such that v \ast \in supp(\rho 0 ). Define \scrV (\rho t ) as given in (2.8). Fix any \varepsilon \in (0, \scrV (\rho 0 )) and
\vargamma \in (0, 1), choose parameters \lambda , \sigma > 0 with 2\lambda > d\sigma 2 , and define the time horizon
\biggl( \biggr)
1 \scrV (\rho 0 )
(3.8) T \ast := \bigl( \bigr) log .
(1 - \vargamma ) 2\lambda - d\sigma 2 \varepsilon

Then there exists \alpha 0 > 0, depending (among problem-dependent quantities) on \varepsilon
and \vargamma , such that for all \alpha > \alpha 0 , if \rho \in \scrC ([0, T \ast ], \scrP 4 (\BbbR d )) is a weak solution to the

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.


2986 M. FORNASIER, T. KLOCK, AND K. RIEDL

Fokker--Planck equation (1.8) on the time interval [0, T \ast ] with initial condition \rho 0 ,
we have
Downloaded 12/03/24 to 72.12.218.207 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy

\biggl[ \biggr]
1 - \vargamma \ast \ast
(3.9) \scrV (\rho T ) = \varepsilon with T \in T ,T .
(1 + \vargamma /2)
Furthermore, on the time interval [0, T ], \scrV (\rho t ) decays at least exponentially fast. More
precisely, for all t \in [0, T ], it holds that
W22 (\rho t , \delta v\ast ) = 2\scrV (\rho t ) \leq 2\scrV (\rho 0 ) exp - (1 - \vargamma ) 2\lambda - d\sigma 2 t .
\bigl( \bigl( \bigr) \bigr)
(3.10)

If \scrE additionally satisfies A3, the same conclusion holds for any H : \BbbR d \rightarrow [0, 1] that
satisfies H(x) = 1 whenever x \geq 0.
The assumption v \ast \in supp(\rho 0 ) about the initial configuration \rho 0 is not really a
restriction, as it would in any case hold immediately for \rho t for any t > 0 in view of the
diffusive character of the dynamics (1.8); see Remark 4.7. Additionally, as we clarify
in the next section, this condition neither means nor requires that, for finite particle
approximations, some particle needs to be in the vicinity of the minimizer v \ast at time
t = 0. It is actually sufficient that the empirical measure \rho \widehat N t weakly approximates the
law \rho t uniformly in time. We rigorously explain this mechanism in section 3.3.
A lower bound on the rate of convergence in (3.10) is (1 - \vargamma )(2\lambda - d\sigma 2 ), which
can be made arbitrarily close to the numerically observed rate (2\lambda - d\sigma 2 ) (see, e.g.,
Figure 2(b)) at the cost of taking \alpha \rightarrow \infty to allow for \vargamma \rightarrow 0. The condition 2\lambda > d\sigma 2
is necessary, both in theory and practice, to avoid overwhelming the dynamics by
the random exploration term. The dependency on d can be eased by replacing the
isotropic Brownian motion in the dynamics with an anisotropic one [11, 28].
3.3. Global convergence in probability. To stress the relevance of the main
result of this paper, Theorem 3.7, we now show how estimate (3.10) plays a funda-
mental role in establishing a quantitative convergence result for the numerical scheme
(1.1) to the global minimizer v \ast . By paying the price of having a probabilistic state-
ment about the convergence of CBO as in Theorem 3.8, we gain provable polynomial
complexity. For simplicity, we present the results of this section for the case of an
inactive cutoff function, i.e., H \equiv 1.
Theorem 3.8. Fix \varepsilon \mathrm{t}\mathrm{o}\mathrm{t}\mathrm{a}\mathrm{l} > 0 and \delta \in (0, 1/2). Then, under the assumptions of
Theorem 3.7 and Proposition 3.11, and with K := T /\Delta t, where T is as in (3.9), the
i
iterations ((Vk\Delta t )k=0,...,K )i=1,...,N generated by the numerical scheme (1.1) converge
in probability to v \ast . More precisely, the empirical mean of the final iterations fulfills
\bigm\| \bigm\| 2
\bigm\| 1 \sum N \bigm\|
i \ast \bigm\|
(3.11) VK\Delta t - v \bigm\| \leq \varepsilon \mathrm{t}\mathrm{o}\mathrm{t}\mathrm{a}\mathrm{l}
\bigm\|
\bigm\| N
\bigm\|
\bigm\|
i=1 2

with probability larger than 1 - \delta +\varepsilon - 1 +3C\mathrm{M}\mathrm{F}\mathrm{A} N - 1 +12\varepsilon ) . Here, m


\bigl( 2m
\bigr)
\mathrm{t}\mathrm{o}\mathrm{t}\mathrm{a}\mathrm{l} (6C\mathrm{N}\mathrm{A} (\Delta t)
denotes the order of accuracy of the numerical scheme (for the Euler--Maruyama
scheme m = 1/2) and \varepsilon is the error from Theorem 3.7. Moreover, besides problem-
dependent constants, C\mathrm{N}\mathrm{A} > 0 depends linearly on the dimension d and the number
of particles N , exponentially on the time horizon T , and on \delta - 1 ; C\mathrm{M}\mathrm{F}\mathrm{A} > 0 depends
exponentially on the parameters \alpha , \lambda , and \sigma , on T , and on \delta - 1 .
Let us briefly discuss in the following remark the computational complexity of
the numerical scheme (1.1) together with some implementational aspects which allow
us to reduce the overall runtime of the algorithm in practice.

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.


CONSENSUS-BASED OPTIMIZATION CONVERGES GLOBALLY 2987

Remark 3.9 (computational complexity). To achieve estimate (3.11) with proba-


bility of at least (1 - 2\delta ), the implementable CBO scheme \sqrt{} (1.1) has to be run using
N \geq 9C\mathrm{M}\mathrm{F}\mathrm{A} /(\delta \varepsilon \mathrm{t}\mathrm{o}\mathrm{t}\mathrm{a}\mathrm{l} ) agents and with time step size \Delta t \leq 2m \delta \varepsilon \mathrm{t}\mathrm{o}\mathrm{t}\mathrm{a}\mathrm{l} /(18C\mathrm{N}\mathrm{A} ) for
Downloaded 12/03/24 to 72.12.218.207 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy

\biggl( \biggr)
1 1 36\scrV (\rho 0 )
K \geq \bigl( \bigr) log
(1 - \vargamma ) 2\lambda - d\sigma 2 \Delta t \delta \varepsilon \mathrm{t}\mathrm{o}\mathrm{t}\mathrm{a}\mathrm{l}
iterations. Here, the parameter dependence of C\mathrm{N}\mathrm{A} and C\mathrm{M}\mathrm{F}\mathrm{A} is as described in
Theorem 3.8. The computational complexity (counted in terms of the number of
evaluations of the objective \scrE ) of the CBO method is therefore given by \scrO (KN ).
When working in the setting of large-scale applications arising, for instance, in
machine learning and signal processing (therefore, with \scrE being expensive to com-
pute), several considerations allow one to reduce the overall runtime of the algorithm
(1.1) and thereby make the method feasible and more competitive. First, it may be
recommended to leverage that the evaluations of the objective function \scrE for each of
the N particles can be performed in parallel. Furthermore, random minibatch sam-
pling ideas as proposed in [11, 28] may be employed when evaluating the objective
function and/or computing the consensus point. That is, at each time step, \scrE is eval-
uated only on a random subset of the available data, and v\alpha is computed only from a
subset of the N particles. Besides immediately reducing the computational and com-
munication complexity of CBO methods, such ideas motivate communication-efficient
parallelization of the algorithm by evolving disjoint subsets of particles independently
for some time with separate consensus points, before aligning the dynamics through a
global communication step. This, however, is so far largely unexplored, from both a
theoretical and a practical point of view. Lastly, taking inspiration from genetic algo-
rithms, a variance-based particle reduction technique as suggested in [26] may be used
to reduce the number of optimizing agents (and therefore the required evaluations of
\scrE ) during the algorithm in case concentration of the particles is observed.
The proof of Theorem 3.8, which we report below, combines our main result
about the convergence in mean-field law, a quantitative mean-field approximation, and
classical results of numerical approximation of SDEs. To this end, we establish in what
follows the result about the quantitative mean-field approximation on a restricted set
of bounded processes. For this purpose, let us introduce the common probability
space (\Omega , \scrF , \BbbP ) over which all considered stochastic processes get their realizations,
and define a subset \Omega M of \Omega of suitably bounded processes according to
\Biggl\{ N \biggl\{ \Biggr\}
1 \sum \bigm\| i \bigm\| 4 \bigm\| \bigm\| 4 \biggr\}
i
\Omega M := \omega \in \Omega : sup max \bigm\| Vt (\omega )\bigm\| 2 , \bigm\| V t (\omega )\bigm\| \leq M .
\bigm\| \bigm\|
t\in [0,T ] N i=1 2

Throughout this section, M > 0 denotes a constant which we shall adjust at the
end of the proof of Theorem 3.8. Before stating the mean-field approximation result,
Proposition 3.11, let us estimate the measure of the set \Omega M in Lemma 3.10. The
proofs of both statements are deferred to section 5.
Lemma 3.10. Let T > 0, \rho 0 \in \scrP 4 (\BbbR d ) and let N \in \BbbN be fixed. Moreover, let
i
((Vti )t\geq 0 )i=1,...,N denote the strong solution to system (1.3) and let ((V t )t\geq 0 )i=1,...,N
be N independent copies of the strong solution to the mean-field dynamics (1.7). Then,
under the assumptions of Theorem 3.2, for any M > 0 we have
\Biggl( N \biggl\{ \Biggr)
\bigm\| \biggr\}
1 \sum \bigm\| i \bigm\| 4 \bigm\| i \bigm\| 4 2K
(3.12) \BbbP (\Omega M ) = \BbbP sup max \bigm\| Vt \bigm\| 2 , \bigm\| V t \bigm\| \leq M \geq 1 - ,
\bigm\|
t\in [0,T ] N i=1
2 M

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.


2988 M. FORNASIER, T. KLOCK, AND K. RIEDL

where K = K(\lambda , \sigma , d, T, b1 , b2 ) is a constant, which is in particular independent of N .


Here, b1 and b2 denote the problem-dependent constants from [10, Lemma 3.3].
Downloaded 12/03/24 to 72.12.218.207 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy

Lemma 3.10 proves that the processes are bounded with high probability uni-
formly in time. Therefore, by restricting the analysis to \Omega M , we can obtain the
following quantitative mean-field approximation result by proving pointwise propa-
gation of chaos through the coupling method [14, 15] using a synchronous coupling
i
between the stochastic processes V i and V ; see, e.g., [14, section 4.1.2].
Proposition 3.11. Let T > 0, \rho 0 \in \scrP 4 (\BbbR d ) and let N \in \BbbN be fixed. Moreover, let
i
((Vti )t\geq 0 )i=1,...,N denote the strong solution to system (1.3) and let ((V t )t\geq 0 )i=1,...,N be
N independent copies of the strong solution to the mean-field dynamics (1.7). Further
i
consider valid the assumptions of Theorem 3.2. If (Vti )t\geq 0 and (V t )t\geq 0 share the
initial data as well as the Brownian motion paths (Bti )t\geq 0 for all i = 1, . . . , N , then
we have
\Bigl[ i
\Bigr]
sup \BbbE \| Vti - V t \| 22 \bigm| \Omega M \leq C\mathrm{M}\mathrm{F}\mathrm{A} N - 1
\bigm|
(3.13) max
i=1,...,N t\in [0,T ]

with C\mathrm{M}\mathrm{F}\mathrm{A} = C\mathrm{M}\mathrm{F}\mathrm{A} (\alpha , \lambda , \sigma , T, C1 , C2 , M, K, \scrM 2 , b1 , b2 ), where K is as in Lemma 3.10
and \scrM 2 denotes a second-order moment bound of \rho .
A quantitative mean-field approximation was left as an open problem in [10,
Remark 3.2] due to a lack of global Lipschitz continuity of the SDE coefficients and
has been approached since then in several steps; see Remark 1.2. While the restriction
to bounded processes, which reflects the typical behavior in view of Lemma 3.10,
already allows one to obtain an estimate of the type (3.13), which is sufficient to prove
convergence in probability in what follows, the recent work [29] improves (3.13) by
first showing a nonprobabilistic mean-field approximation, i.e., removing the necessity
of conditioning on the set \Omega M as done in (3.13), and second by obtaining a pathwise
estimate; see [29, Theorem 2.6]. Hence, in the light of [29], the role of the constant
M can be regarded as merely an auxiliary technical tool.
Equipped with Lemma 3.10 and Proposition 3.11, we are now able to prove The-
orem 3.8.
Proof of Theorem 3.8. We have the error decomposition
\left[ \bigm\| \bigm\| 2 \bigm| \right] \left[ \bigm\| \bigm\| 2 \bigm| \right]
N
\bigm\| 1 \sum \bigm\| \bigm| N
\bigm\| 1 \sum \bigm\| \bigm|
i
- v \ast \bigm\| \bigm| \Omega M \leq 3 \BbbE \bigm\| i
- VTi \bigm\| \bigm| \Omega M
\bigl( \bigr)
VK\Delta t VK\Delta t
\bigm\| \bigm\| \bigm| \bigm\| \bigm\| \bigm|
\BbbE \bigm\|
\bigm\| N \bigm\| \bigm| \bigm\| N \bigm\| \bigm|
i=1 2 i=1 2
\underbrace{} \underbrace{}
\leq C\mathrm{N}\mathrm{A} (\Delta t)2m \mathrm{b}\mathrm{y} \mathrm{a}\mathrm{p}\mathrm{p}\mathrm{l}\mathrm{y}\mathrm{i}\mathrm{n}\mathrm{g} \mathrm{c}\mathrm{l}\mathrm{a}\mathrm{s}\mathrm{s}\mathrm{i}\mathrm{c}\mathrm{a}\mathrm{l} \mathrm{c}\mathrm{o}\mathrm{n}\mathrm{v}\mathrm{e}\mathrm{r}\mathrm{g}\mathrm{e}\mathrm{n}\mathrm{c}\mathrm{e}
\mathrm{r}\mathrm{e}\mathrm{s}\mathrm{u}\mathrm{l}\mathrm{t}\mathrm{s} \mathrm{f}\mathrm{o}\mathrm{r} \mathrm{n}\mathrm{u}\mathrm{m}\mathrm{e}\mathrm{r}\mathrm{i}\mathrm{c}\mathrm{a}\mathrm{l} \mathrm{s}\mathrm{c}\mathrm{h}\mathrm{e}\mathrm{m}\mathrm{e}\mathrm{s} \mathrm{f}\mathrm{o}\mathrm{r} \mathrm{S}\mathrm{D}\mathrm{E}\mathrm{s} [53]
\left[ \bigm\| \bigm\| 2 \bigm| \right] \bigm\| 2
(3.14)
\bigm\|
N \Bigl(
\bigm\| 1 \sum N
\Bigr) \bigm\| \bigm|
i \bigm\| \bigm| 3 \bigm\| 1 \sum
i
\bigm\|
\ast \bigm\|
+ 3 \BbbE \bigm\| VTi - V T \bigm\| \bigm| \Omega M + V T - v \bigm\|
\bigm\| \bigm\|
\BbbE \bigm\|
\bigm\| N
i=1
\bigm\| \bigm| 1 - \delta \bigm\| N i=1 \bigm\|
2 2
\underbrace{} \underbrace{} \underbrace{} \underbrace{}
2
\leq C\mathrm{M}\mathrm{F}\mathrm{A} N - 1 \mathrm{u}\mathrm{s}\mathrm{i}\mathrm{n}\mathrm{g} \mathrm{t}\mathrm{h}\mathrm{e} \mathrm{q}\mathrm{u}\mathrm{a}\mathrm{n}\mathrm{t}\mathrm{i}\mathrm{t}\mathrm{a}\mathrm{t}\mathrm{i}\mathrm{v}\mathrm{e} \mathrm{m}\mathrm{e}\mathrm{a}\mathrm{n}-fi\mathrm{e}\mathrm{l}\mathrm{d} \leq \BbbE \| VT1 - v \ast \| \leq 2\scrV (\rho T )\leq 2\varepsilon
2
\mathrm{a}\mathrm{p}\mathrm{p}\mathrm{r}\mathrm{o}\mathrm{x}\mathrm{i}\mathrm{m}\mathrm{a}\mathrm{t}\mathrm{i}\mathrm{o}\mathrm{n} \mathrm{i}\mathrm{n} \mathrm{f}\mathrm{o}\mathrm{r}\mathrm{m} \mathrm{o}\mathrm{f} \mathrm{P}\mathrm{r}\mathrm{o}\mathrm{p}\mathrm{o}\mathrm{s}\mathrm{i}\mathrm{t}\mathrm{i}\mathrm{o}\mathrm{n} 3.11 \mathrm{b}\mathrm{y} \mathrm{m}\mathrm{e}\mathrm{a}\mathrm{n}\mathrm{s} \mathrm{o}\mathrm{f} \mathrm{T}\mathrm{h}\mathrm{e}\mathrm{o}\mathrm{r}\mathrm{e}\mathrm{m} 3.7

\leq 6C\mathrm{N}\mathrm{A} (\Delta t)2m + 3C\mathrm{M}\mathrm{F}\mathrm{A} N - 1 + 12\varepsilon ,

dividing the overall error into an approximation error of the numerical scheme, the
mean-field approximation error, and the optimization error in the mean-field limit.

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.


CONSENSUS-BASED OPTIMIZATION CONVERGES GLOBALLY 2989

Denoting now by K\varepsilon N\mathrm{t}\mathrm{o}\mathrm{t}\mathrm{a}\mathrm{l} \subset \Omega the set where (3.11) does not hold, we can estimate
\BbbP K\varepsilon N\mathrm{t}\mathrm{o}\mathrm{t}\mathrm{a}\mathrm{l} = \BbbP K\varepsilon N\mathrm{t}\mathrm{o}\mathrm{t}\mathrm{a}\mathrm{l} \cap \Omega M + \BbbP K\varepsilon N\mathrm{t}\mathrm{o}\mathrm{t}\mathrm{a}\mathrm{l} \cap \Omega cM \leq \BbbP K\varepsilon N\mathrm{t}\mathrm{o}\mathrm{t}\mathrm{a}\mathrm{l} \bigm| \Omega M \BbbP (\Omega M ) + \BbbP (\Omega cM )
\bigl( \bigr) \bigl( \bigr) \bigl( \bigr) \bigl( \bigm| \bigr)
Downloaded 12/03/24 to 72.12.218.207 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy

\leq \varepsilon - 1 2m
+ 3C\mathrm{M}\mathrm{F}\mathrm{A} N - 1 + 12\varepsilon + \delta ,
\bigl( \bigr)
\mathrm{t}\mathrm{o}\mathrm{t}\mathrm{a}\mathrm{l} 6C\mathrm{N}\mathrm{A} (\Delta t)

where in the last step we employ Markov's inequality together with (3.14) to bound
the first term. For the second it suffices to choose the M from (3.12) large enough.
As a consequence of Theorem 3.7, the hardness of any optimization problem is
necessarily encoded in the mean-field approximation. Proposition 3.11 addresses pre-
cisely this question, ensuring that with arbitrarily high probability, the finite particle
dynamics (1.3) keeps close to the mean-field dynamics (1.7). Since the rate of this con-
vergence is of order N - 1/2 in the number of particles N , the hardness of the problem
is fully captured by the constant C\mathrm{M}\mathrm{F}\mathrm{A} in (3.13), which does not depend explicitly on
the dimension d. Therefore, the mean-field approximation is, in general, not affected
by the curse of dimensionality. Nevertheless, as our assumptions on the objective
function \scrE do not exclude the class of NP-hard problems, it cannot be expected that
CBO solves any problem, howsoever hard, with polynomial complexity. This is re-
flected by the exponential dependence of C\mathrm{M}\mathrm{F}\mathrm{A} on the parameter \alpha and its possibly
worst-case linear dependence on the dimension d, as we discuss in what follows. How-
ever, several numerical experiments [11, 26, 27, 28] in high dimensions confirm that in
typical applications CBO performs comparably to state-of-the-art methods without
the necessity of an exponentially large number of particles. As mentioned before,
characterizing \alpha 0 in more detail is crucial in view of the mean-field approximation re-
sult, Proposition 3.11. We did not precisely specify \alpha 0 in Theorem 3.7 since it seems
challenging to provide informative bounds in all generality. In Remark 4.8, however,
we devise an informal derivation in the case H \equiv 1 for objectives \scrE that are locally
L-Lipschitz continuous on some ball BR (v \ast ) and satisfy the coercivity condition (3.5)
globally for \nu = 1/2. For a parameter-dependent constant c = c(\vargamma , \lambda , \sigma ), we obtain
\biggl( \biggr)
- 8 c \bigl( \ast
\bigr)
(3.15) \alpha > \alpha 0 = 2 2 log \surd \rho 0 B\mathrm{m}\mathrm{i}\mathrm{n}\{ R, c2 \eta 2 \varepsilon /(8L)\} (v ) ,
c \eta \varepsilon 2 2
provided that the probability mass t \mapsto \rightarrow \rho t Bc2 \eta 2 \varepsilon /(8L) (v \ast ) is minimized at time
\bigl( \bigr)

t = 0. The latter assumption is motivated by numerical observations of typical suc-


cessful CBO runs, where the particle density around the global minimizer tends to be
minimized initially and steadily increases over time. We note that the argument of
the log in (3.15) may induce a dependence of \alpha 0 on the ambient dimension d if we do
not dispose of an informative initial configuration \rho 0 . For instance, if \rho 0 is measure-
theoretically equivalent to the Lebesgue measure on a compact set in \BbbR d , we have
\alpha 0 \in \scrO (d log(\varepsilon )/\varepsilon ) as d, 1/\varepsilon \rightarrow \infty by (3.15). If we interpreted \rho 0 as the uncertainty
about the location of the global minimizer v \ast , we could thus consider low-uncertainty
regimes, where \rho 0 actually concentrates around v \ast and \alpha 0 may be dimension-free, or
high-uncertainty regimes, where \rho 0 does not concentrate and \alpha 0 may depend on d.
4. Proof details for section 3.2. In this section we provide the proof details
for the global convergence result of CBO in mean-field law, Theorem 3.7. Sections 4.1--
4.3 provide auxiliary results, which might be of independent interest. In section 4.4
we complete the proof of Theorem 3.7. Throughout we assume \scrE = 0, which is w.l.o.g.
since a constant offset to \scrE does not change the CBO dynamics.
4.1. Evolution of the mean-field limit. We now derive evolution inequalities
of the energy functional \scrV (\rho t ) for the cases H \equiv 1 and H \not \equiv 1, respectively.

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.


2990 M. FORNASIER, T. KLOCK, AND K. RIEDL

Lemma 4.1. Let \scrE : \BbbR d \rightarrow \BbbR , H \equiv 1, and fix \alpha , \lambda , \sigma > 0. Moreover, let T > 0 and
let \rho \in \scrC ([0, T ], \scrP 4 (\BbbR d )) be a weak solution to the Fokker--Planck equation (1.8). Then
the functional \scrV (\rho t ) satisfies
Downloaded 12/03/24 to 72.12.218.207 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy

d \surd \bigl( \bigr) \sqrt{}


\scrV (\rho t ) \leq - 2\lambda - d\sigma 2 \scrV (\rho t ) + 2 \lambda + d\sigma 2 \scrV (\rho t ) \| v\alpha (\rho t ) - v \ast \| 2
\bigl( \bigr)
(4.1) dt
d\sigma 2 2
+ \| v\alpha (\rho t ) - v \ast \| 2 .
2

Proof. We note that the function \phi (v) = 1/2\| v - v \ast \| 22 is in \scrC \ast 2 (\BbbR d ) and recall
that \rho satisfies the weak solution identity (3.1) for all test functions in \scrC \ast 2 (\BbbR d ); see
Remark 3.3. By applying (3.1) with \phi as above, we obtain for the evolution of \scrV (\rho t )

d\sigma 2
\int \int
d 2
\scrV (\rho t ) = - \lambda \langle v - v \ast , v - v\alpha (\rho t )\rangle d\rho t (v) + \| v - v\alpha (\rho t )\| 2 d\rho t (v),
dt 2
\underbrace{} \underbrace{} \underbrace{} \underbrace{}
=:T1 =:T2

where we used \nabla \phi (v) = v - v \ast and \Delta \phi (v) = d. Expanding the right-hand side of the
scalar product in the integrand of T1 by subtracting and adding v \ast yields
\int \biggl\langle \int \biggr\rangle
\ast \ast \ast \ast
T1 = - \lambda \langle v - v , v - v \rangle d\rho t (v) + \lambda (v - v ) d\rho t (v), v\alpha (\rho t ) - v

\leq - 2\lambda \scrV (\rho t ) + \lambda \| \BbbE (\rho t ) - v \ast \| 2 \| v\alpha (\rho t ) - v \ast \| 2

with the Cauchy--Schwarz inequality being used in the last step. Similarly, again
by subtracting and adding v \ast , for the term T2 we have with the Cauchy--Schwarz
inequality
\biggl( \int \biggr)
2 \ast \ast 1 \ast 2
(4.2) T2 \leq d\sigma \scrV (\rho t ) + \| v - v \| 2 d\rho t (v) \| v\alpha (\rho t ) - v \| 2 + \| v\alpha (\rho t ) - v \| 2 .
2
\sqrt{}
The result now follows by noting that \| \BbbE (\rho t ) - v \ast \| 2 \leq \| v - v \ast \| 2 d\rho t (v) \leq 2\scrV (\rho t )
\int

as a consequence of Jensen's inequality.


Lemma 4.2. Under the assumptions of Lemma 4.1, the functional \scrV (\rho t ) satisfies

d \surd \bigl( \bigr) \sqrt{}


\scrV (\rho t ) \geq - 2\lambda - d\sigma 2 \scrV (\rho t ) - 2 \lambda + d\sigma 2 \scrV (\rho t ) \| v\alpha (\rho t ) - v \ast \| 2 .
\bigl( \bigr)
(4.3)
dt
Proof. By following the lines \bigr\rangle of the proof of Lemma 4.1 and noticing that it
holds (v - v \ast ) d\rho t (v), v\alpha (\rho t ) - v \ast \geq - \| \BbbE (\rho t ) - v \ast \| 2 \| v\alpha (\rho t ) - v \ast \| 2 by the Cauchy--
\bigl\langle \int
2
Schwarz inequality and \| v\alpha (\rho t ) - v \ast \| 2 \geq 0, the lower bound is immediate.
Lemma 4.3. Let \scrE \in \scrC (\BbbR d ) satisfy A1--A3 and w.l.o.g. assume \scrE = 0. Let
H : \BbbR d \rightarrow [0, 1] be such that H(x) = 1 whenever x \geq 0 and fix \alpha , \lambda , \sigma > 0. Moreover, let
T > 0 and let \rho \in \scrC ([0, T ], \scrP 4 (\BbbR d )) be a weak solution to the Fokker--Planck equation
(1.8). Then, provided maxt\in [0,T ] \scrE (v\alpha (\rho t )) \leq \scrE \infty , the functional \scrV (\rho t ) satisfies

d \surd \bigl( \bigr) \sqrt{}


\scrV (\rho t ) \leq - 2\lambda - d\sigma 2 \scrV (\rho t ) + 2 \lambda + d\sigma 2 \scrV (\rho t ) \| v\alpha (\rho t ) - v \ast \| 2
\bigl( \bigr)
dt
\lambda \bigl( \gamma \bigr) 2\nu d\sigma 2 2
+ 2 L\scrE (1 + \| v\alpha (\rho t ) - v \ast \| 2 ) \| v\alpha (\rho t ) - v \ast \| 2 + \| v\alpha (\rho t ) - v \ast \| 2 .
\eta 2

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.


CONSENSUS-BASED OPTIMIZATION CONVERGES GLOBALLY 2991

Proof. Let us write H \ast (v) := H(\scrE (v) - \scrE (v\alpha (\rho t ))). Taking \phi (v) = 1/2\| v - v \ast \| 22
as test function in (3.1) as in the proof of Lemma 4.1 yields for the evolution of \scrV (\rho t ),
Downloaded 12/03/24 to 72.12.218.207 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy

d\sigma 2
\int \int
d 2
(4.4) \scrV (\rho t ) = - \lambda H \ast (v)\langle v - v \ast , v - v\alpha (\rho t )\rangle d\rho t (v) + \| v - v\alpha (\rho t )\| 2 d\rho t (v).
dt 2
\underbrace{} \underbrace{}
=:T\widetilde 1

For the second term on the right-hand side, we proceed as in (4.2). The term T\widetilde 1 , on
the other hand, can be rewritten as
\int
T1 = - 2\lambda \scrV (\rho t ) - \lambda H \ast (v)\langle v - v \ast , v \ast - v\alpha (\rho t )\rangle d\rho t (v)
\widetilde
(4.5) \int
2
+ \lambda (1 - H \ast (v)) \| v - v \ast \| 2 d\rho t (v).

Let us now bound the latter two terms individually. For the second term in (4.5),
noting that 0 \leq H \ast \leq 1, the Cauchy--Schwarz inequality and Jensen's inequality give
\int \sqrt{}
- \lambda H \ast (v)\langle v - v \ast , v \ast - v\alpha (\rho t )\rangle d\rho t (v) \leq \lambda 2\scrV (\rho t ) \| v\alpha (\rho t ) - v \ast \| 2 .

For the third term in (4.5), let us first note that (1 - H \ast (v)) \not = 0 implies H \ast (v) \not = 1
and thus \scrE (v) < \scrE (v\alpha (\rho t )). Furthermore, \scrE (v\alpha (\rho t )) \leq \scrE \infty implies v \in BR0 (v \ast ) by the
second part of A2. By the first part of A2 and 0 \leq 1 - H \ast \leq 1, we therefore have
(1 - H \ast (v))
\int \int
\ast \ast 2 \lambda
\lambda (1 - H (v)) \| v - v \| 2 d\rho t (v) \leq \lambda 2
\scrE (v)2\nu d\rho t (v) \leq 2 \scrE (v\alpha (\rho t ))2\nu
\eta \eta
\lambda \bigl( \ast \gamma
\bigr) 2\nu
\leq 2 L\scrE (1 + \| v\alpha (\rho t ) - v \| 2 ) \| v\alpha (\rho t ) - v \ast \| 2 ,
\eta
where the last step used A3. Employing the last two inequalities in (4.5) and inserting
the result together with (4.2) into (4.4) gives the result.
Lemma 4.4. Under the assumptions of Lemma 4.3, the functional \scrV (\rho t ) satisfies
d \surd \bigl( \bigr) \sqrt{}
\scrV (\rho t ) \geq - 2\lambda - d\sigma 2 \scrV (\rho t ) - 2 \lambda + d\sigma 2 \scrV (\rho t ) \| v\alpha (\rho t ) - v \ast \| 2 .
\bigl( \bigr)
(4.6)
dt
Proof. Analogously to the proof of Lemma 4.2, \int by following the lines of the proof
\ast
of Lemma 4.3 and noticing that for T\widetilde 1 it holds - H (v)\langle v - v \ast , v \ast - v\alpha (\rho t )\rangle d\rho t (v) \geq
2
- \| \BbbE (\rho t ) - v \ast \| 2 \| v\alpha (\rho t ) - v \ast \| 2 as well as (1 - H \ast (v)) \| v - v \ast \| 2 d\rho t (v) \geq 0 as a con-
\int

sequence of 0 \leq H \ast \leq 1, the lower bound is immediate.


4.2. Quantitative Laplace principle. The Laplace principle (1.6) asserts that
- log(\| \omega \alpha \| L1 (\varrho ) )/\alpha \rightarrow \scrE as \alpha \rightarrow \infty as long as the global minimizer v \ast is in the support
of \varrho . However, it cannot be used to characterize the proximity of v\alpha (\varrho ) to the global
minimizer v \ast in general. For instance, if \scrE has two minimizers with similar objective
value \scrE , and half of the probability mass of \varrho concentrates around each associated
location, then v\alpha (\varrho ) is located halfway on the line that connects the two minimizing
locations. The inverse continuity property A2, by design, excludes such cases, so that
we can refine the Laplace principle under A2 in the following sense.
Proposition 4.5. Let \varrho \in \scrP (\BbbR d ) and fix \alpha > 0. For any r > 0 define \scrE r :=
supv\in Br (v\ast ) \scrE (v). Then, under the inverse continuity property A2 and assuming w.l.o.g.
\scrE = 0, for any r \in (0, R0 ] and q > 0 such that q + \scrE r \leq \scrE \infty , we have
(q + \scrE r )\nu
\int
exp( - \alpha q )
\| v\alpha (\varrho ) - v \ast \| 2 \leq + \| v - v \ast \| 2 d\varrho (v).
\eta \varrho (Br (v \ast ))

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.


2992 M. FORNASIER, T. KLOCK, AND K. RIEDL

Proof. For any a > 0 it holds that \| \omega \alpha \| L1 (\varrho ) \geq a\varrho (\{ v : exp( - \alpha \scrE (v)) \geq a\} ) due to
Markov's inequality. By choosing a = exp( - \alpha \scrE r ) and noting that
Downloaded 12/03/24 to 72.12.218.207 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy

v \in \BbbR d : exp( - \alpha \scrE (v)) \geq exp( - \alpha \scrE r ) = \varrho v \in \BbbR d : \scrE (v) \leq \scrE r \geq \varrho (Br (v \ast )),
\bigl( \bigl\{ \bigr\} \bigr) \bigl( \bigl\{ \bigr\} \bigr)
\varrho

we get \| \omega \alpha \| L1 (\varrho ) \geq exp( - \alpha \scrE \int r )\varrho (Br (v \ast )). Now let r\widetilde \geq r > 0. Using the definition of
the consensus point v\alpha (\varrho ) = v\omega \alpha (v)/\| \omega \alpha \| L1 (\varrho ) d\varrho (v) we can decompose
\int \int
\omega \alpha (v) \omega \alpha (v)
\| v\alpha (\varrho ) - v \ast \| 2 \leq \| v - v \ast \| 2 d\varrho (v) + \| v - v \ast \| 2 d\varrho (v).
Br\widetilde (v \ast ) \| \omega \alpha \| L1 (\varrho ) (Br\widetilde (v \ast )) c \| \omega \alpha \| L1 (\varrho )

The first term is bounded by r\widetilde since \| v - v \ast \| 2 \leq r\widetilde for all v \in Br\widetilde (v \ast ). For the second
term we use \| \omega \alpha \| L1 (\varrho ) \geq exp( - \alpha \scrE r )\varrho (Br (v \ast )) from above to get
\int \int
\omega \alpha (v) 1
\| v - v \ast \| 2 d\varrho (v)\leq \| v - v \ast \| 2 \omega \alpha (v) d\varrho (v)
(Br\widetilde (v \ast ))c \| \omega \alpha \| L1 (\varrho ) exp( - \alpha \scrE r )\varrho (Br (v \ast )) (Br\widetilde (v\ast ))c
\bigl( \bigl( \bigr) \bigr) \int
exp - \alpha inf v\in (Br\widetilde (v\ast ))c \scrE (v) - \scrE r
\leq \| v - v \ast \| 2 d\varrho (v).
\varrho (Br (v \ast ))

Thus, for any r\widetilde \geq r > 0 we obtain


\bigl( \bigl( \bigr) \bigr) \int
\ast exp - \alpha inf v\in (Br\widetilde (v\ast ))c \scrE (v) - \scrE r
(4.7) \| v\alpha (\varrho ) - v \| 2 \leq r\widetilde + \| v - v \ast \| 2 d\varrho (v).
\varrho (Br (v \ast ))

Let us now choose r\widetilde = (q + \scrE r )\nu /\eta . This choice satisfies r\widetilde \leq \scrE \infty \nu
/\eta by the assumption
q + \scrE r \leq \scrE \infty , and furthermore r\widetilde \geq r, since A2 with \scrE = 0 and r \leq R0 implies
\Bigl( \Bigr) \nu
(q + \scrE r )\nu
\scrE \nu sup v\in Br (v ) \ast \scrE (v)
r\widetilde = \geq r = \geq sup \| v - v \ast \| 2 = r.
\eta \eta \eta v\in Br (v \ast )

\bigl\{ \bigr\}
r)1/\nu - \scrE r =
Thus, using again A2 with \scrE = 0, inf v\in (Br\widetilde (v\ast ))c \scrE (v) - \scrE r \geq min \scrE \infty , (\eta \widetilde
r)1/\nu - \scrE r = q. Inserting this and the definition of r\widetilde into (4.7), we obtain the
(\eta \widetilde
result.
4.3. A lower bound for the probability mass around \bfitv \ast . In this section we
bound the probability mass \rho t (Br (v \ast )) for an arbitrarily small radius r > 0 from below.
By defining a smooth mollifier \phi r : \BbbR d \rightarrow [0, 1] with supp \phi r = Br (v \ast ) according to
\left\{ \Biggl( \Biggr)
r2
exp 1 - 2 if \| v - v \ast \| 2 < r,
(4.8) \phi r (v) := r2 - \| v - v \ast \| 2
0 else,

it holds \rho t (Br (v \ast ))\geq \phi r (v) d\rho t (v). From there, the evolution of the right-hand side
\int

can be studied by using the weak solution property of \rho as in Definition 3.1.
Proposition 4.6. Let H : \BbbR \rightarrow [0, 1] be arbitrary, T > 0, r > 0, and fix parame-
ters \alpha , \lambda , \sigma > 0. Assume \rho \in \scrC ([0, T ], \scrP (\BbbR d )) weakly solves the Fokker--Planck equation
(1.8) in the sense of Definition 3.1 with initial condition \rho 0 \in \scrP (\BbbR d ) and for t \in [0, T ].
Then, for all t \in [0, T ] we have

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.


CONSENSUS-BASED OPTIMIZATION CONVERGES GLOBALLY 2993
\biggl( \int \biggr)
\ast
(4.9) \rho t (Br (v )) \geq \phi r (v) d\rho 0 (v) exp ( - pt) , where
\surd \surd
Downloaded 12/03/24 to 72.12.218.207 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy

2\lambda ( cr + B) c 2\sigma 2 (cr2 + B 2 )(2c + d) 4\lambda 2


\biggl\{ \biggr\}
(4.10) p := max + ,
(1 - c)2 r (1 - c)4 r2 (2c - 1)\sigma 2

for any B < \infty with supt\in [0,T ] \| v\alpha (\rho t ) - v \ast \| 2 \leq B and for any c \in (1/2, 1) satisfying

(4.11) (2c - 1)c \geq d(1 - c)2 .

Remark 4.7. In case the reader has wondered about the crucial role of the sto-
chastic terms in (1.1) and (1.3), or the diffusion in the macroscopic models (1.7) and
(1.8), Proposition 4.6 precisely explains where positive diffusion \sigma > 0 is actually used
to ensure mass around the minimizer v \ast (compare Proposition 4.5).
Proof of Proposition 4.6. By definition of the mollifier \phi r in (4.8) we have 0 \leq
\phi r (v) \leq 1 and supp(\phi r ) = Br (v \ast ). This implies
\int
\rho t (Br (v \ast )) = \rho t v \in \BbbR d : \| v - v \ast \| 2 \leq r \geq \phi r (v) d\rho t (v).
\bigl( \bigl\{ \bigr\} \bigr)
(4.12)

Our strategy is to derive a lower bound for the right-hand side of this inequality.
Using the weak solution property of \rho and the fact that \phi r \in \scrC c\infty (\BbbR d ), we obtain
\int \int
d
(4.13) \phi r (v) d\rho t (v) = (T1 (v) + T2 (v)) d\rho t (v)
dt
2 2
with T1 (v) := - \lambda H \ast (v)\langle v - v\alpha (\rho t ), \nabla \phi r (v)\rangle and T2 (v) := \sigma 2 \| v - v\alpha (\rho t )\| 2 \Delta \phi r (v), and
where we abbreviate H \ast (v) := H(\scrE (v) - \scrE (v\alpha (\rho t ))) to keep the notation concise. We
now aim for showing T1 (v) + T2 (v) \geq - p\phi r (v) uniformly on \BbbR d for p > 0 as given
in (4.10) in the statement of the proposition. Since the mollifier \phi r and its first and
second derivatives vanish outside of \Omega r := \{ v \in \BbbR d : \| v - v \ast \| 2 < r\} , we can restrict
our attention to the open ball \Omega r . To achieve \surd the lower bound over \Omega r , we introduce
the subsets K1 := \{ v \in \BbbR d : \| v - v \ast \| 2 > cr\} and
\biggl\{ \Bigl( \Bigr) 2
2
K2 := v \in \BbbR d : - \lambda H \ast (v)\langle v - v\alpha (\rho t ), v - v \ast \rangle r2 - \| v - v \ast \| 2

\sigma 2
\biggr\}
2 2
\~ 2 \| v - v\alpha (\rho t )\| 2 \| v - v \ast \| 2 ,
> cr
2
where c adheres to (4.11), and c\~ := 2c - 1 \in (0, 1). We now decompose \Omega r according
to \Omega r = (K1c \cap \Omega r ) \cup (K1 \cap K2c \cap \Omega r ) \cup (K1 \cap K2 \cap \Omega r ), which is illustrated in Figure 3
for different positions of v\alpha (\rho t ) and values of \sigma .
In the following we treat each of these\surd three subsets separately.
Subset \bfitK \bfitc 1 \cap \Omega \bfitr : We have \| v - v \ast \| 2 \leq cr for each v \in K1c , which can be used to
independently derive lower bounds for both T1 and T2 . Recalling the expression for
\phi r from (4.8), for T1 we get by using the Cauchy--Schwarz inequality and H \ast \leq 1
\Biggl\langle \Biggr\rangle
\ast \ast - 2r2 (v - v \ast )\phi r (v)
T1 (v) = - \lambda H (v)\langle v - v\alpha (\rho t ), \nabla \phi r (v)\rangle = - \lambda H (v) v - v\alpha (\rho t ), \Bigl( \Bigr) 2
2
r2 - \| v - v \ast \| 2
\surd \surd
\| v - v\alpha (\rho t )\| 2 \| v - v \ast \| 2 2\lambda ( cr + B) c
\geq - 2r2 \lambda \Bigl( \Bigr) 2 \phi r (v) \geq - \phi r (v) =: - p1 \phi r (v),
2 (1 - c)2 r
r2 - \| v - v \ast \| 2

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.


2994 M. FORNASIER, T. KLOCK, AND K. RIEDL
Downloaded 12/03/24 to 72.12.218.207 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy

(a) Decomposition for (b) Decomposition for (c) Decomposition for


vα (ρt ) ∈ Ωr and σ = 0.2 vα (ρt ) ∈ Ωr and σ = 0.2 vα (ρt ) ∈ Ωr and σ = 1

Fig. 3. Visualization of the decomposition of \Omega r for different positions of v\alpha (\rho t ) and values of \sigma
in the setting H \equiv 1. In the proof of Proposition 4.6 we limit the rate of the mass loss induced by both
consensus drift and noise term for the set K1c \cap \Omega r , which is colored blue. On the set K1 \cap K2c \cap \Omega r ,
colored orange, the noise term counterbalances any potential mass loss induced by the drift, while
on the gray set K1 \cap K2 \cap \Omega r mass can be lost at an exponential rate - 4\lambda 2 /((2c - 1)\sigma 2 ). (Color
figure available online.)

\surd
where the last bound is due to \| v - v\alpha (\rho t )\| 2 \leq \| v - v \ast \| 2 + \| v \ast - v\alpha (\rho t )\| 2 \leq cr + B.
Similarly, by computing \Delta \phi r and inserting it, for T2 we obtain
\Bigl( \Bigr) \Bigl( \Bigr) 2
2 2 2
2 2 \| v - v \ast \| 2 - r2 \| v - v \ast \| 2 - d r2 - \| v - v \ast \| 2
2
T2 (v) = \sigma 2 r2 \| v - v\alpha (\rho t )\| 2 \Bigl( \Bigr) 4 \phi r (v)
2
r2 - \| v - v \ast \| 2
2\sigma 2 (cr2 + B 2 )(2c + d)
\geq - \phi r (v) =: - p2 \phi r (v),
(1 - c)4 r2
2 2 2 \bigr)
where we used \| v - v\alpha (\rho t )\| 2 \leq 2 \| v - v \ast \| 2 + \| v \ast - v\alpha (\rho t )\| 2 \leq 2(cr2 + B 2 ). \surd
\bigl(

Subset \bfitK 1 \cap \bfitK \bfitc 2 \cap \Omega \bfitr : By the definition of K1 and K2 we have \| v - v \ast \| 2 > cr and
\Bigl(
2
\Bigr) 2 \sigma 2 2 2
(4.14) - \lambda H \ast (v)\langle v - v\alpha (\rho t ), v - v \ast \rangle r2 - \| v - v \ast \| 2 \leq cr \~ 2 \| v - v\alpha (\rho t )\| 2 \| v - v \ast \| 2 .
2
Our goal now is to show T1 (v) + T2 (v) \geq 0 for all v in this subset. We first compute
\Bigl( \Bigr) 2
2
T1 (v) + T2 (v) \langle v - v\alpha (\rho t ), v - v \ast \rangle r2 - \| v - v \ast \| 2
\ast
= \lambda H (v)
2r2 \phi r (v)
\Bigl( \Bigr) 4
r2 - \| v - v \ast \| 22
\Bigl( \Bigr) \Bigl( \Bigr) 2
2 2 2
\sigma 2 2 2 \| v - v \ast \| 2 - r2 \| v - v \ast \| 2 - d r2 - \| v - v \ast \| 2
2
+ \| v - v\alpha (\rho t )\| 2 \Bigr) 4 .
2
\Bigl(
r2 - \| v - v \ast \| 22

Therefore we have T1 (v) + T2 (v) \geq 0 whenever we can show

d\sigma 2
\biggl( \biggr) \Bigl( \Bigr) 2
\ast \ast 2 2
- \lambda H (v)\langle v - v\alpha (\rho t ), v - v \rangle + \| v - v\alpha (\rho t )\| 2 r2 - \| v - v \ast \| 2
(4.15) 2
\Bigl( \Bigr)
2 2 2
\leq \sigma \| v - v\alpha (\rho t )\| 2 2 \| v - v \ast \| 2 - r2 \| v - v \ast \| 2 .
2

Now note that the first summand on the left-hand side in (4.15) can be upper bounded
by means of condition (4.14) and by using the relation c\~ = 2c - 1. More precisely,

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.


CONSENSUS-BASED OPTIMIZATION CONVERGES GLOBALLY 2995
\Bigl(
2
\Bigr) 2 \sigma 2 2
- \lambda H \ast (v)\langle v - v\alpha (\rho t ), v - v \ast \rangle r2 - \| v - v \ast \| 2 \leq cr \~ 2 \| v - v\alpha (\rho t )\| 22 \| v - v \ast \| 2
2
\sigma 2 2
=(2c - 1)r2 \| v - v\alpha (\rho t )\| 22 \| v - v \ast \| 2
Downloaded 12/03/24 to 72.12.218.207 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy

2
\Bigl( \Bigr) \sigma 2
2 2
\leq 2 \| v - v \ast \| 2 - r2 \| v - v\alpha (\rho t )\| 22 \| v - v \ast \| 2 ,
2
where the last inequality follows since v \in K1 . For the second term on the left-hand
side in (4.15) we can use d(1 - c)2 \leq (2c - 1)c as per (4.11), to get
d\sigma 2 2
\Bigl(
2
\Bigr) 2 d\sigma 2
2
\| v - v\alpha (\rho t )\| 2 r2 - \| v - v \ast \| 2 \leq \| v - v\alpha (\rho t )\| 2 (1 - c)2 r4
2 2
\sigma 2 2 \sigma 2 2
\Bigl(
2
\Bigr)
2
\leq \| v - v\alpha (\rho t )\| 2 (2c - 1)r2 cr2 \leq \| v - v\alpha (\rho t )\| 2 2 \| v - v \ast \| 2 - r2 \| v - v \ast \| 2 .
2 2
Hence, (4.15) holds and we have T1 (v) + T2 (v) \geq 0 uniformly on\surd this subset.
Subset \bfitK 1 \cap \bfitK 2 \cap \Omega \bfitr : On this subset, we have \| v - v \ast \| 2 > cr and
\Bigl(
2
\Bigr) 2 \sigma 2 2 2
(4.16) - \lambda H \ast (v)\langle v - v\alpha (\rho t ), v - v \ast \rangle r2 - \| v - v \ast \| 2 > cr \~ 2 \| v - v\alpha (\rho t )\| 2 \| v - v \ast \| 2 .
2
We first note that T1 (v) = 0 whenever \sigma 2 \| v - v\alpha (\rho t )\| 22 = 0, provided that \sigma > 0,
so nothing needs to be done for the point v = v\alpha (\rho t ). On the other hand, if
\sigma 2 \| v - v\alpha (\rho t )\| 22 > 0, we can use H \ast \leq 1, two applications of the Cauchy--Schwarz
inequalities, and condition (4.16) to get
H \ast (v) \langle v - v\alpha (\rho t ), v - v \ast \rangle - \| v - v\alpha (\rho t )\| 2 \| v - v \ast \| 2
\Bigl( \Bigr) 2 \geq \Bigl( \Bigr) 2
2 2
r2 - \| v - v \ast \| 2 r2 - \| v - v \ast \| 2
2\lambda H \ast (v)\langle v - v\alpha (\rho t ), v - v \ast \rangle 2\lambda
> 2 2 \ast
\geq - 2 2 .
\~ \sigma \| v - v\alpha (\rho t )\| 2 \| v - v \| 2
cr cr
\~ \sigma
Using this, T1 can be bounded from below by
\Biggl\langle \Biggr\rangle
2 \ast v - v \ast 4\lambda 2
T1 (v)=2\lambda r H (v) v - v\alpha (\rho t ), \Bigl( \Bigr) 2 \phi r (v) \geq - 2 \phi r (v) =: - p3 \phi r (v),
2 c\sigma
\~
r2 - \| v - v \ast \| 2
where we made use of the relation c\~ = 2c - 1 in the last step. For T2 , we note that
the nonnegativity of \sigma 2 \| v - v\alpha (\rho t )\| 2 implies T2 (v) \geq 0, whenever
\Bigl( \Bigr) \Bigl( \Bigr) 2
2 2 2
2 2 \| v - v \ast \| 2 - r2 \| v - v \ast \| 2 \geq d r2 - \| v - v \ast \| 2 .
\surd
This is satisfied for all v with \| v - v \ast \| 2 \geq cr, provided c satisfies 2(2c - 1)c \geq (1 - c)2 d
as implied by (4.11).
Concluding the proof. Using the evolution of \phi r as in (4.13), we now get
\int \int
d
\phi r (v) d\rho t (v) = (T1 (v) + T2 (v)) d\rho t (v)
dt K1 \cap K2c \cap \Omega r \underbrace{} \underbrace{}
\geq 0
\int \int
+ (T1 (v) + T2 (v)) d\rho t (v) + (T1 (v) + T2 (v)) d\rho t (v)
K1 \cap K2 \cap \Omega r \underbrace{} \underbrace{} K1c \cap \Omega r \underbrace{} \underbrace{}
\geq - p3 \phi r (v) \geq - (p1 +p2 )\phi r (v)
\int \int
\geq - max \{ p1 + p2 , p3 \} \phi r (v) d\rho t (v) = - p \phi r (v) d\rho t (v).
\int \int
An application of Gr\"onwall's inequality gives \phi r (v) d\rho t (v)\geq \phi r (v) d\rho 0 (v) exp( - pt),
which concludes the proof after recalling (4.12).

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.


2996 M. FORNASIER, T. KLOCK, AND K. RIEDL

4.4. Proof of Theorem 3.7. We now have all necessary tools at hand to present
a detailed proof of the global convergence result in mean-field law. We separately prove
the cases of an inactive and active cutoff function, i.e., H \equiv 1 and H \not \equiv 1, respectively.
Downloaded 12/03/24 to 72.12.218.207 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy

Proof of Theorem 3.7 when H \equiv 1. W.l.o.g. we may assume \scrE = 0. Let us first
choose the parameter \alpha such that
\Biggl( \Biggl( \sqrt{} \Biggr) \biggl( \biggr)
1 4 2\scrV (\rho 0 ) p \scrV (\rho 0 )
\alpha > \alpha 0 := log \surd + log
q\varepsilon c (\vargamma , \lambda , \sigma ) \varepsilon (1 - \vargamma ) (2\lambda - d\sigma 2 ) \varepsilon
(4.17) \Biggr)
- log \rho 0 Br\varepsilon /2 (v \ast ) ,
\bigl( \bigr)

where we introduce the definitions


\Biggl\{ \bigl( \bigr) \sqrt{} \Biggr\}
\vargamma 2\lambda - d\sigma 2 (2\lambda - d\sigma 2 )
(4.18) c (\vargamma , \lambda , \sigma ) := min \surd , \vargamma
2 2 (\lambda + d\sigma 2 ) d\sigma 2

as well as
\biggl\{ \biggl( \surd \biggr) 1/\nu \biggr\} \biggl\{ \biggr\}
1 c (\vargamma , \lambda , \sigma ) \varepsilon
q\varepsilon := min \eta , \scrE \infty and r\varepsilon := max max \scrE (v) \leq q\varepsilon .
2 2 s\in [0,R0 ] v\in Bs (v \ast )
\sqrt{}
Moreover, p is as defined in (4.10) in Proposition 4.6 with B = c(\vargamma , \lambda , \sigma ) \scrV (\rho 0 ) and
with r = r\varepsilon . We remark that, by construction, q\varepsilon > 0 and r\varepsilon \leq R0 . Furthermore,
recalling the notation \scrE r = supv\in Br (v\ast ) \scrE (v) from Proposition 4.5, we have q\varepsilon + \scrE r\varepsilon \leq
2q\varepsilon \leq \scrE \infty as a consequence of the definition of r\varepsilon . Since q\varepsilon > 0, the continuity of \scrE
ensures that there exists sq\varepsilon > 0 such that \scrE (v) \leq q\varepsilon for all v \in Bsq\varepsilon (v \ast ), thus yielding
also r\varepsilon > 0.
Let us now define the time horizon T\alpha \geq 0, which may depend on \alpha , by

(4.19) T\alpha := sup t \geq 0 : \scrV (\rho t\prime ) > \varepsilon and \| v\alpha (\rho t\prime ) - v \ast \| 2 < C(t\prime ) for all t\prime \in [0, t]
\bigl\{ \bigr\}

\sqrt{}
with C(t) := c(\vargamma , \lambda , \sigma ) \scrV (\rho t ). Notice for later use that C(0) = B.
\ast \ast
\bigl[ 1 - \vargamma \bigr]
Our aim now is to show that \scrV (\rho T\alpha ) = \varepsilon with T\alpha \in (1+\vargamma /2) T , T and that we
have at least exponential decay of \scrV (\rho t ) until time T\alpha , i.e., until accuracy \varepsilon is reached.
First, however, we ensure that T\alpha > 0. With the mapping t \mapsto \rightarrow \scrV (\rho t ) be-
ing continuous as a consequence of the regularity \rho \in \scrC ([0, T ], \scrP 4 (\BbbR d )) established
in Theorem 3.2 and t \mapsto \rightarrow \| v\alpha (\rho t ) - v \ast \| 2 being continuous due to [10, Lemma 3.2]
and \rho \in \scrC ([0, T ], \scrP 4 (\BbbR d )), T\alpha > 0 follows from the definition, since \scrV (\rho 0 ) > \varepsilon and
\| v\alpha (\rho 0 ) - v \ast \| 2 < C(0). While the former is immediate by assumption, applying
Proposition 4.5 with q\varepsilon and r\varepsilon gives the latter since
\nu
(q\varepsilon + \scrE r\varepsilon )
\int
exp ( - \alpha q\varepsilon )
\| v\alpha (\rho 0 ) - v \ast \| 2 \leq + \| v - v \ast \| 2 d\rho 0 (v)
\eta \rho 0 (Br\varepsilon (v \ast ))
\surd
c (\vargamma , \lambda , \sigma ) \varepsilon exp ( - \alpha q\varepsilon ) \sqrt{}
\leq + 2\scrV (\rho 0 )
2 \rho 0 (Br\varepsilon (v \ast ))
\surd \sqrt{}
\leq c (\vargamma , \lambda , \sigma ) \varepsilon < c (\vargamma , \lambda , \sigma ) \scrV (\rho 0 ) = C(0),

where the first inequality in the last line holds by the choice of \alpha in (4.17).
Next, we show that the functional \scrV (\rho t ) decays essentially exponentially fast in
time. More precisely, we prove that, up to time T\alpha , \scrV (\rho t ) decays

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.


CONSENSUS-BASED OPTIMIZATION CONVERGES GLOBALLY 2997

(i) at least exponentially fast (with rate (1 - \vargamma )(2\lambda - d\sigma 2 )), and
(ii) at most exponentially fast (with rate (1 + \vargamma /2)(2\lambda - d\sigma 2 )).
d
To obtain (i), recall that Lemma 4.1 provides an upper bound on dt \scrV (\rho t ) given by
Downloaded 12/03/24 to 72.12.218.207 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy

d \surd \bigl( \bigr) \sqrt{}


\scrV (\rho t ) \leq - 2\lambda - d\sigma 2 \scrV (\rho t ) + 2 \lambda + d\sigma 2 \scrV (\rho t ) \| v\alpha (\rho t ) - v \ast \| 2
\bigl( \bigr)
(4.20) dt
d\sigma 2 2
+ \| v\alpha (\rho t ) - v \ast \| 2 .
2
Combining this with the definition of T\alpha in (4.19), we have by construction
d
\scrV (\rho t ) \leq - (1 - \vargamma ) 2\lambda - d\sigma 2 \scrV (\rho t )
\bigl( \bigr)
for all t \in (0, T\alpha ).
dt
d
Analogously, for (ii), by Lemma 4.2, we obtain a lower bound on dt \scrV (\rho t ) of the form
d \surd \bigl( \bigr) \sqrt{}
\scrV (\rho t ) \geq - 2\lambda - d\sigma 2 \scrV (\rho t ) - 2 \lambda + d\sigma 2 \scrV (\rho t ) \| v\alpha (\rho t ) - v \ast \| 2
\bigl( \bigr)
dt
\geq - (1 + \vargamma /2) 2\lambda - d\sigma 2 \scrV (\rho t ) for all t \in (0, T\alpha ),
\bigl( \bigr)

where the second inequality again exploits the definition of T\alpha . Gr\" onwall's inequality
now implies for all t \in [0, T\alpha ] the upper and the lower bound

\scrV (\rho t ) \leq \scrV (\rho 0 ) exp - (1 - \vargamma ) 2\lambda - d\sigma 2 t ,


\bigl( \bigl( \bigr) \bigr)
(4.21)
\scrV (\rho t ) \geq \scrV (\rho 0 ) exp - (1 + \vargamma /2) 2\lambda - d\sigma 2 t ,
\bigl( \bigl( \bigr) \bigr)
(4.22)

thereby proving (i) and (ii). We further note that the definition of T\alpha in (4.19)
together with the definition of C(t) and (4.21) permits us to control

(4.23) max \| v\alpha (\rho t ) - v \ast \| 2 \leq max C(t) \leq C(0).
t\in [0,T\alpha ] t\in [0,T\alpha ]

\ast \ast
\bigl[ 1 - \vargamma \bigr]
To conclude, it remains to prove that \scrV (\rho T\alpha ) = \varepsilon with T\alpha \in (1+\vargamma /2) T , T . For
this we distinguish the following three cases.
Case \bfitT \ast \bfitalpha \geq \bfitT : We can use the definition of T \ast in (3.8) and the time-evolution bound
of \scrV (\rho t ) in (4.21) to conclude that \scrV (\rho T \ast ) \leq \varepsilon . Hence, by the definition of T\alpha in (4.19)
together with the continuity of \scrV (\rho t ), we find \scrV (\rho T\alpha ) = \varepsilon with T\alpha = T \ast .
Case \bfitT \bfitalpha < \bfitT \ast and \bfscrV (\bfitrho \bfitT \bfitalpha ) \leq \bfitvarepsilon : By continuity of \scrV (\rho t ),\bigl( it holds that \bigl( for T\alpha 2as \bigr) de-\bigr)
fined in (4.19), \scrV (\rho T\alpha ) = \varepsilon . Thus, \varepsilon = \scrV (\rho T\alpha ) \geq \scrV (\rho 0 ) exp - (1 + \vargamma /2) 2\lambda - d\sigma T\alpha
by (4.22), which can be reordered as
\biggl( \biggr)
1 - \vargamma \ast 1 \scrV (\rho 0 )
T = log \leq T\alpha < T \ast .
(1 + \vargamma /2) (1 + \vargamma /2) (2\lambda - d\sigma 2 ) \varepsilon
Case \bfitT \bfitalpha < \bfitT \ast and \bfscrV (\bfitrho \bfitT \bfitalpha ) > \bfitvarepsilon : We shall show that this case can never occur by
verifying that \| v\alpha (\rho T\alpha ) - v \ast \| 2 < C(T\alpha ) due to the choice of \alpha in (4.17). In fact, ful-
filling simultaneously both \scrV (\rho T\alpha ) > \varepsilon and \| v\alpha (\rho T\alpha ) - v \ast \| 2 < C(T\alpha ) would contradict
the definition of T\alpha in (4.19) itself. To this end, by applying again Proposition 4.5
with q\varepsilon and r\varepsilon , and recalling that \varepsilon < \scrV (\rho T\alpha ), we get
\nu
(q\varepsilon + \scrE r\varepsilon )
\int
\ast exp ( - \alpha q\varepsilon )
\| v\alpha (\rho T\alpha ) - v \| 2 \leq + \bigl( \bigr) \| v - v \ast \| 2 d\rho T\alpha (v)
\eta \rho T\alpha Br\varepsilon (v \ast )
(4.24) \sqrt{}
c (\vargamma , \lambda , \sigma ) \scrV (\rho T\alpha ) exp ( - \alpha q\varepsilon ) \sqrt{}
< + \bigl( \bigr) 2\scrV (\rho T\alpha ).
2 \rho T\alpha Br\varepsilon (v \ast )

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.


2998 M. FORNASIER, T. KLOCK, AND K. RIEDL

Since, thanks to (4.23), we have the bound maxt\in [0,T\alpha ] \| v\alpha (\rho t ) - v \ast \| 2 \leq B for B =
C(0), which is in particular independent of \alpha , Proposition 4.6 guarantees that there
exists a p > 0 not depending on \alpha (but depending on B and r\varepsilon ) with
Downloaded 12/03/24 to 72.12.218.207 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy

\biggl( \int \biggr)


\ast 1 \bigl(
\phi r\varepsilon (v) d\rho 0 (v) exp( - pT\alpha ) \geq \rho 0 Br\varepsilon /2 (v \ast ) exp( - pT \ast ) > 0,
\bigr)
\rho T\alpha (Br\varepsilon (v )) \geq
2
where we used v \ast \in supp(\rho 0 ) for bounding the initial mass \rho 0 and the fact that \phi r
(as defined in (4.8)) is bounded from below on Br/2 (v \ast ) by 1/2. With this we can
continue the chain of inequalities in (4.24) to obtain
\sqrt{}
\ast c (\vargamma , \lambda , \sigma ) \scrV (\rho T\alpha ) 2 exp ( - \alpha q\varepsilon ) \sqrt{}
\| v\alpha (\rho T\alpha ) - v \| 2 < + \bigl( \bigr) 2\scrV (\rho T\alpha )
(4.25) 2 \rho 0 Br\varepsilon /2 (v \ast ) exp( - pT \ast )
\sqrt{}
\leq c (\vargamma , \lambda , \sigma ) \scrV (\rho T\alpha ) = C(T\alpha ),

where the inequality in the last line holds by the choice of \alpha in (4.17). This establishes
the desired contradiction, again as a consequence of the continuity of the mappings
t \mapsto \rightarrow \scrV (\rho t ) and t \mapsto \rightarrow \| v\alpha (\rho t ) - v \ast \| 2 .
Proof of Theorem 3.7 when H \not \equiv 1. The proof follows the lines of the one for the
inactive cutoff H \equiv 1, but requires some modifications since Lemmas 4.1 and 4.2 need
to be replaced by Lemmas 4.3 and 4.4, to derive bounds for the evolution of \scrV (\rho t ).
As in the proof for the case H \equiv 1 we first choose the parameter \alpha such that
\Biggl( \Biggl( \sqrt{} \Biggr) \biggl( \biggr)
1 4 2\scrV (\rho 0 ) p \scrV (\rho 0 )
\alpha > \alpha 0 := log + log
q\varepsilon C\varepsilon (1 - \vargamma ) (2\lambda - d\sigma 2 ) \varepsilon
(4.26) \Biggr)
- log \rho 0 Br\varepsilon /2 (v \ast ) ,
\bigl( \bigr)

where C\varepsilon is obtained when replacing with \varepsilon each \scrV (\rho t ) in C(t) defined as

(4.27) \Biggl\{ \biggr) 1/(1+\gamma ) \bigl( \bigr) \sqrt{}


\vargamma 2\lambda - d\sigma 2 \sqrt{} \vargamma (2\lambda - d\sigma 2 )
\biggl(
\scrE \infty \scrE \infty
C(t) := min , , \surd \scrV (\rho t ), \scrV (\rho t ),
2L\scrE 2L\scrE 4 2 (\lambda + d\sigma 2 ) 2 d\sigma 2
\Biggl( \bigl( \bigr) \Biggr) 1/(2\nu ) \Biggl( \bigl( \bigr) \Biggr) 1/(2\nu (1+\gamma )) \Biggr\}
\vargamma \eta 2 2\lambda - d\sigma 2 \vargamma \eta 2 2\lambda - d\sigma 2
\scrV (\rho t ) , \scrV (\rho t ) .
4 L2\nu \scrE \lambda 4 L2\nu \scrE \lambda

Moreover, r\varepsilon is as defined before, p as in (4.10) with B = C(0) and r = r\varepsilon , and
\biggl\{ \biggl( \biggr) 1/\nu \biggr\}
1 C\varepsilon
q\varepsilon := min \eta , \scrE \infty .
2 2
Let us now define again a time horizon T\alpha according to (4.19), however, with the
modified definition of C(t) from (4.27). It is straightforward to check\bigl[ that T\alpha > 0 by\bigr]
1 - \vargamma \ast \ast
choice of \alpha in (4.26). Our aim is again to show \scrV (\rho T\alpha ) = \varepsilon with T\alpha \in (1+\vargamma /2) T , T
and that we have at least exponential decay of \scrV (\rho t ) until T\alpha .
Since due to assumption A3 and with the definition of C(t) in (4.27) it holds
\gamma
(4.28) max \scrE (v\alpha (\rho t )) \leq max L\scrE (1 + \| v\alpha (\rho t ) - v \ast \| 2 ) \| v\alpha (\rho t ) - v \ast \| 2 \leq \scrE \infty ,
t\in [0,T\alpha ] t\in [0,T\alpha ]

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.


CONSENSUS-BASED OPTIMIZATION CONVERGES GLOBALLY 2999

Lemmas 4.3 and 4.4 provide an upper and a lower bound for the time derivative of
\scrV (\rho t ), which, when combined with the definitions of T\alpha and C(t) in (4.27), yield
Downloaded 12/03/24 to 72.12.218.207 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy

d d
\scrV (\rho t ) \leq - (1 - \vargamma ) 2\lambda - d\sigma 2 \scrV (\rho t ) \scrV (\rho t ) \geq - (1+\vargamma /2) 2\lambda - d\sigma 2 \scrV (\rho t )
\bigl( \bigr) \bigl( \bigr)
and
dt dt
for all t \in (0, T\alpha ) as before. We can thus follow the lines of the proof for the case
H \equiv 1, since also here C(t) is bounded. In particular, the choice of \alpha in (4.26) allows
us to derive the contradiction \| v\alpha (\rho T\alpha ) - v \ast \| 2 < C(T\alpha ) by using Propositions 4.5
and 4.6.
Remark 4.8 (informal lower bound for \alpha 0 ). As mentioned in section 3.3, insightful
lower bounds on the required \alpha 0 in Theorem 3.7 may be interesting in view of better
understanding the convergence of the microscopic system (1.3) to the mean-field limit
(1.7). Let us therefore informally derive in what follows an instructive lower bound
on the required \alpha 0 under the assumption that \scrE satisfies condition A2 globally with
\nu = 1/2 and that \scrE is locally L-Lipschitz continuous around v \ast , i.e., in some ball
BR (v \ast ). We restrict ourselves to the case of an inactive cutoff function H \equiv 1.
Recalling (4.20) in the proof of Theorem 3.7, \alpha should be large enough to ensure
\sqrt{}
(4.29) \| v\alpha (\rho t ) - v \ast \| 2 \leq c (\vargamma , \lambda , \sigma ) \scrV (\rho t ) for all t \in [0, T ],

where T is the time satisfying \scrV (\rho T ) = \varepsilon . To achieve this, we recall that for \varrho \in \scrP (\BbbR d )
2
the quantitative Laplace principle in Proposition 4.5 with choices q\varepsilon := c (\vargamma , \lambda , \sigma ) \eta 2 \varepsilon /8
and r\varepsilon := min\{ R, q\varepsilon /L\} for q and r, respectively, yields
\surd \int
2q\varepsilon exp ( - \alpha q\varepsilon )
\| v\alpha (\varrho ) - v \ast \| 2 \leq + \| v - v \ast \| 2 d\varrho (v),
\eta \varrho (Br\varepsilon (v \ast ))

provided that A2 holds globally with \nu = 1/2 and that \scrE is L-Lipschitz continuous on
some ball BR (v \ast ). It remains to choose \alpha > \alpha 0 , where
\biggl( \Bigr) \biggr)
- 8 c (\vargamma , \lambda , \sigma ) \Bigl( \ast
(4.30) \alpha 0 := sup 2 log \surd \rho t B\mathrm{m}\mathrm{i}\mathrm{n}\{ R, c(\vargamma ,\lambda ,\sigma ) \eta 2 \varepsilon /(8L)\} (v ) ,
2
t\in [0,T ] c (\vargamma , \lambda , \sigma ) \eta 2 \varepsilon 2 2

suggesting that \alpha 0 is strongly related to the time-evolution of the probability mass of
\rho t around v \ast . Recalling Proposition 4.6, this mass adheres to the lower bound

\rho t (Br (v \ast )) \geq \rho 0 (Br/2 (v \ast )) exp( - pt)/2 for some p > 0 and any r > 0.

However, this result is pessimistic due to its worst-case nature, and inserting it into
(4.30) with the corresponding p as in (4.10) leads to overly stringent requirements
on \alpha 0 , which are reflected by the respective second summands in (4.17) and (4.26).
Rather, a successful application of the CBO method entails that the probability mass
around the global minimizer increases over time, so that t \mapsto \rightarrow \rho t (Br (v \ast )) is typically
minimized at t = 0. In such case, the lower bound (4.30) becomes
\biggl( \Bigr) \biggr)
- 8 c (\vargamma , \lambda , \sigma ) \Bigl( \ast
(4.31) \alpha 0 = 2 log \surd \rho 0 B 2 2
\mathrm{m}\mathrm{i}\mathrm{n}\{ R, c(\vargamma ,\lambda ,\sigma ) \eta \varepsilon /(8L)\} (v ) .
c (\vargamma , \lambda , \sigma ) \eta 2 \varepsilon 2 2

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.


3000 M. FORNASIER, T. KLOCK, AND K. RIEDL

5. Proof details for section 3.3. In this section we provide the proof details
for the result about the mean-field approximation of CBO, Proposition 3.11. After
giving the proof of the auxiliary Lemma 3.10, which ensures that the dynamics is to
Downloaded 12/03/24 to 72.12.218.207 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy

some extent bounded, we prove Proposition 3.11.


Proof of Lemma 3.10. By combining the ideas of [10, Lemma 3.4] with a Doob-
\sum N i
like inequality, we derive a bound for \BbbE supt\in [0,T ] N1 i=1 max\{ \| Vti \| 42 + \| V t \| 42 \} , which
N N
ensures that \rho \widehat N d
t , \rho t \in \scrP 4 (\BbbR ) with high probability. Here, \rho denotes the empirical
i
measure associated with the processes (V )i=1,...,N .
Employing standard inequalities shows

\bigr) \bigm\| 4
\bigm\| \int t \bigm\|
\bigm\| i \bigm\| 4 \bigm\| i \bigm\| 4 4
\bigm\| \bigl( i N
\BbbE sup Vt 2 \lesssim \BbbE V0 2 + \lambda \BbbE sup \bigm\|
\bigm\| \bigm\| \bigm\| \bigm\| \bigm\| V\tau - v\alpha (\widehat
\rho \tau ) d\tau \bigm\|
\bigm\|
t\in [0,T ] t\in [0,T ] 0 2
(5.1) \bigm\| \int t \bigm\| 4
4
\bigm\| \bigm\| i N \bigm\|
\bigm\| i \bigm\|
\bigm\|
+ \sigma \BbbE sup \bigm\|
\bigm\| \bigm\| V\tau - v\alpha (\widehat
\rho \tau ) 2 dB\tau \bigm\| ,
t\in [0,T ] 0 2
\int t
where we note that the expression 0 \| V\tau i - v\alpha (\widehat \rho N i
\tau )\| 2 dB\tau appearing in the third term
of the right-hand side is a martingale, which is a consequence of [51, Corollary 3.2.6]
combined with the regularity established in [10, Lemma 3.4]. This allows us to apply
the Burkholder--Davis--Gundy inequality [56, Chapter IV, Theorem 4.1], which yields
\bigm\| \int t \bigm\| 4 \Biggl( \int \Biggr) 2
\bigm\| \bigm\| i T \bigm\| \bigm\| 2
N i i N
\bigm\| \bigm\|
\BbbE sup \bigm\| \bigm\| V\tau - v\alpha (\widehat
\rho \tau )\bigm\| 2 dB\tau \bigm\| \bigm\| V\tau - v\alpha (\widehat
\rho \tau )\bigm\| 2 d\tau .
\bigm\| \bigm\| \lesssim \BbbE
t\in [0,T ] 0 2 0

Let us stress that the constant appearing in the latter estimate depends on the di-
mension d. Further bounding this as well as the second term of the right-hand side
in (5.1) by means of Jensen's inequality and utilizing [10, Lemma 3.3] yields
\Biggl( \int T \int \Biggr)
\bigm\| i \bigm\| 4 \bigm\| i \bigm\| 4 \bigm\| i \bigm\| 4 4 N
(5.2) \BbbE sup \bigm\| Vt \bigm\| \leq C 1 + \BbbE \bigm\| V0 \bigm\| + \BbbE
2
\bigm\| V\tau \bigm\| + \| v\| d\widehat
2
\rho \tau (v) d\tau
2 2
t\in [0,T ] 0

with a constant C = C(\lambda , \sigma , d, T, b1 , b2 ). Averaging (5.2) over i allows us to bound


\int \Biggl( \int \int T \int \Biggr)
4 N 4 N 4 N
\BbbE sup \| v\| 2 d\widehat
\rho t (v) \leq C 1 + \BbbE \| v\| 2 d\widehat
\rho 0 (v) + 2 \BbbE sup \| v\| 2 d\widehat
\rho \tau \widehat (v) d\tau ,
t\in [0,T ] 0 \tau \widehat \in [0,\tau ]

which, after applying Gr\"onwall's inequality, ensures that the left-hand side is bounded
independently \int of N by a constant K =K(\lambda , \sigma , d, T, b1 , b2 ). With analogous arguments,
\BbbE supt\in [0,T ] \| v\| 42 d\rho N
t (v)\leq K. Equation (3.12) follows from Markov's inequality.

Proof of Proposition 3.11. By exploiting the boundedness thanks to Lemma 3.10


through a cutoff technique, we can follow the steps taken in [25, Theorem 3.1].
Let us define the cutoff function
N
\left\{ \biggl\{ \bigm\| \biggr\}
1 \sum \bigm\| i \bigm\| 4
\bigm\| i \bigm\| 4 \bigm\|
1 if max V\tau 2 , \bigm\| V \tau \bigm\| \leq M for all \tau \in [0, t],
\bigm\| \bigm\|
IM (t) = N i=1 2

0 else,

which is adapted to the natural filtration and has the property IM (t) = IM (t)IM (\tau )
for all \tau \in [0, t]. With Jensen's inequality and It\^
o isometry this allows to derive

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.


CONSENSUS-BASED OPTIMIZATION CONVERGES GLOBALLY 3001
\int t \biggl( \bigm\| \biggr)
i \bigm\| 2 i \bigm\| 2
\bigm\| \bigm\| \bigm\| \bigm\| 2
\bigm\| i \bigm\| i \bigm\| N
(5.3) \BbbE \bigm\| Vt - V t \bigm\| IM (t) \lesssim c \BbbE \bigm\| V\tau - V \tau \bigm\| + v\alpha (\widehat
\bigm\| \rho \tau ) - v\alpha (\rho \tau ) 2 IM (\tau ) d\tau
\bigm\|
2 0 2
Downloaded 12/03/24 to 72.12.218.207 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy

\bigl( \bigr) i
for c = \lambda 2 T + \sigma 2 . Here we directly used that the processes Vti and V t share the
initial data as well as the Brownian motion paths. In what follows, let us denote by
i
\rho N
\tau the empirical measure of the processes V \tau . Then, by using the same arguments
as in the proofs of [10, Lemma 3.2] and [25, Lemma 3.1] with the care of taking into
consideration the multiplication with the random variable IM (\tau ), we obtain
\bigm\| 2
\rho N
\bigm\|
\BbbE \bigm\| v\alpha (\widehat \tau ) - v\alpha (\rho \tau ) 2 IM (\tau )
\bigm\|
N \bigm\| 2
\bigm\| 2
\rho N N
\bigm\| \bigm\| \bigm\|
\tau ) - v\alpha (\rho \tau ) 2 IM (\tau )+\BbbE v\alpha (\rho \tau ) - v\alpha (\rho \tau ) 2 IM (\tau )
\lesssim \BbbE \bigm\| v\alpha (\widehat \bigm\| \bigm\|
\biggl( \bigm\| \bigm\| 2 \biggr)
i \bigm\|
\leq C max \BbbE \bigm\| V\tau i - V \tau \bigm\| IM (\tau ) + N - 1
\bigm\|
i=1,...,N 2

for a constant C = C(\alpha , C1 , C2 , M, \scrM 2 , b1 , b2 ). After plugging this into (5.3) and
taking the maximum over i, the quantitative mean-field approximation result (3.13)
follows from an application of Gr\"onwall's inequality after recalling the definition
of the conditional expectation and noting that 1\Omega M \leq IM (t) pointwise and for all
t \in [0, T ].

6. Conclusions. In this paper we establish the convergence of consensus-based


optimization (CBO) methods to the global minimizer. The proof technique is based
on the novel insight that the dynamics of individual agents follow, on average over
all realizations of Brownian motion paths, the gradient flow dynamics associated with
the map v \mapsto \rightarrow \| v - v \ast \| 22 , where v \ast is the global minimizer of the objective \scrE . This
implies that CBO methods are barely influenced by the local energy landscape of \scrE ,
suggesting a high degree of robustness and versatility of the method. As opposed
to restrictive concentration conditions on the initial agent configuration \rho 0 in the
analyses in [10, 26, 31, 32], our result holds under mild assumptions about the initial
distribution \rho 0 . Furthermore, we merely require local Lipschitz continuity and a cer-
tain tractability condition about the objective \scrE , relaxing the regularity requirement
\scrE \in \scrC 2 (\BbbR d ) together with further assumptions from prior works. In order to demon-
strate the relevance of the result of convergence in mean-field law for establishing a
complete convergence proof of the original numerical scheme (1.1), we prove a prob-
abilistic quantitative result about the mean-field approximation, which connects the
finite particle regime with the mean-field limit. With this we close the gap regarding
the mean-field approximation of CBO and provide the first, and so far unique, holistic
convergence proof of CBO on the plane.
We believe that the proposed analysis strategy can be adopted to other recently
developed adaptations of the CBO algorithm, such as CBO methods tailored to mani-
fold optimization problems [25, 26], polarized CBO adjusted to identify multiple min-
imizers simultaneously [9], as well as related metaheuristics including, for instance,
particle swarm optimization [30, 38, 42], which can be regarded as a second-order vari-
ant of CBO with inertia [20, 30]. For CBO with anisotropic Brownian motions, which
are especially relevant in high-dimensional optimization problems [11]; for CBO with
memory effects and gradient information, which can be beneficial in signal processing
and machine learning applications [13, 57]; for CBO reconfigured for multiobjective
optimization; and for constrained CBO, this has already been done in [28], [57], [7],
and [8], respectively.

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.


3002 M. FORNASIER, T. KLOCK, AND K. RIEDL

REFERENCES

[1] E. Aarts and J. Korst, Simulated Annealing and Boltzmann Machines. A Stochastic Ap-
Downloaded 12/03/24 to 72.12.218.207 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy

proach to Combinatorial Optimization and Neural Computing, Wiley-Intersci. Ser. Discrete


Math. Optim., John Wiley \& Sons, Chichester, 1989.
[2] L. Ambrosio, N. Gigli, and G. Savare, \' Gradient Flows in Metric Spaces and in the Space
of Probability Measures, 2nd ed., Lectures Math. ETH Z\" urich, Birkh\"
auser Verlag, Basel,
2008.
[3] M. Anitescu, Degenerate nonlinear programming with a quadratic growth condition, SIAM J.
Optim., 10 (2000), pp. 1116--1135, https://doi.org/10.1137/S1052623499359178.
[4] T. Back,
\" D. B. Fogel, and Z. Michalewicz, eds., Handbook of Evolutionary Computation,
Institute of Physics Publishing, Oxford University Press, Bristol, New York, 1997.
[5] C. Blum and A. Roli, Metaheuristics in combinatorial optimization: Overview and conceptual
comparison, ACM Comput. Surv., 35 (2003), pp. 268--308.
[6] J. Bolte, T. P. Nguyen, J. Peypouquet, and B. W. Suter, From error bounds to the com-
plexity of first-order descent methods for convex functions, Math. Program., 165 (2017),
pp. 471--507.
[7] G. Borghi, M. Herty, and L. Pareschi, An adaptive consensus based method for multi-
objective optimization with uniform Pareto front approximation, Appl. Math. Optim., 88
(2023), 58.
[8] G. Borghi, M. Herty, and L. Pareschi, Constrained consensus-based optimization, SIAM J.
Optim., 33 (2023), pp. 211--236, https://doi.org/10.1137/22M1471304.
[9] L. Bungert, T. Roith, and P. Wacker, Polarized consensus-based dynamics for optimization
and sampling, Math. Program. (2024), pp. 1--31.
[10] J. A. Carrillo, Y.-P. Choi, C. Totzeck, and O. Tse, An analytical framework for consensus-
based global optimization method, Math. Models Methods Appl. Sci., 28 (2018), pp. 1037--
1066.
[11] J. A. Carrillo, S. Jin, L. Li, and Y. Zhu, A consensus-based global optimization method
for high dimensional machine learning problems, ESAIM Control Optim. Calc. Var., 27
(2021), S5.
[12] J. A. Carrillo, C. Totzeck, and U. Vaes, Consensus-based optimization and ensemble
Kalman inversion for global optimization problems with constraints, in Modeling and Sim-
ulation for Collective Dynamics, World Scientific, 2023, pp. 195--230.
[13] J. A. Carrillo, N. G. Trillos, S. Li, and Y. Zhu, FedCBO: Reaching group consensus in
clustered federated learning through consensus-based optimization, J. Mach. Learn. Res.,
25 (2024), pp. 1--51.
[14] L.-P. Chaintron and A. Diez, Propagation of chaos: A review of models, methods and ap-
plications. I. Models and methods, Kinet. Relat. Models, 15 (2022), pp. 895--1015.
[15] L.-P. Chaintron and A. Diez, Propagation of chaos: A review of models, methods and ap-
plications. II. Applications, Kinet. Relat. Models, 15 (2022), pp. 1017--1173.
[16] J. Chen, S. Jin, and L. Lyu, A consensus-based global optimization method with adaptive
momentum estimation, Commun. Comput. Phys., 31 (2022), pp. 1296--1316.
[17] T.-S. Chiang, C.-R. Hwang, and S.-J. Sheu, Diffusion for global optimization in \BbbR n , SIAM
J. Control Optim., 25 (1987), pp. 737--753, https://doi.org/10.1137/0325042.
[18] L. Chizat, Mean-field Langevin dynamics: Exponential convergence and annealing, Trans.
Mach. Learn. Res., 2022 (2022), pp. 1--17.
[19] L. Chizat and F. Bach, On the global convergence of gradient descent for over-parameterized
models using optimal transport, in Advances in Neural Information Processing Systems
31, Curran Associates, 2018.
[20] C. Cipriani, H. Huang, and J. Qiu, Zero-inertia limit: From particle swarm optimization to
consensus-based optimization, SIAM J. Math. Anal., 54 (2022), pp. 3091--3121, https://
doi.org/10.1137/21M1412323.
[21] A. Dembo and O. Zeitouni, Large Deviations Techniques and Applications, 2nd ed., Appl.
Math. (N. Y.) 38, Springer-Verlag, New York, 1998.
[22] J. Dong and X. T. Tong, Replica exchange for non-convex optimization, J. Mach. Learn.
Res., 22 (2021), pp. 7826--7884.
[23] A. Durmus and E. \' Moulines, Nonasymptotic convergence analysis for the unadjusted
Langevin algorithm, Ann. Appl. Probab., 27 (2017), pp. 1551--1587.
[24] D. B. Fogel, Evolutionary Computation. Toward a New Philosophy of Machine Intelligence,
2nd ed., IEEE Press, Piscataway, NJ, 2000.
[25] M. Fornasier, H. Huang, L. Pareschi, and P. Sunnen, \" Consensus-based optimization on
hypersurfaces: Well-posedness and mean-field limit, Math. Models Methods Appl. Sci., 30
(2020), pp. 2725--2751.

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.


CONSENSUS-BASED OPTIMIZATION CONVERGES GLOBALLY 3003

[26] M. Fornasier, H. Huang, L. Pareschi, and P. Sunnen, \" Consensus-based optimization on


the sphere: Convergence to global minimizers and machine learning, J. Mach. Learn. Res.,
22 (2021), pp. 1--55.
Downloaded 12/03/24 to 72.12.218.207 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy

[27] M. Fornasier, H. Huang, L. Pareschi, and P. Sunnen,\" Anisotropic diffusion in consensus-


based optimization on the sphere, SIAM J. Optim., 32 (2022), pp. 1984--2012, https://
doi.org/10.1137/21M140941X.
[28] M. Fornasier, T. Klock, and K. Riedl, Convergence of anisotropic consensus-based opti-
mization in mean-field law , in Applications of Evolutionary Computation, J. L. Jim\' enez
Laredo, J. I. Hidalgo, and K. O. Babaagba, eds., Springer, 2022, pp. 738--754.
[29] N. J. Gerber, F. Hoffmann, and U. Vaes, Mean-Field Limits for Consensus-Based Opti-
mization and Sampling, preprint, arXiv:2312.07373, 2023.
[30] S. Grassi and L. Pareschi, From particle swarm optimization to consensus based optimization:
Stochastic modeling and mean-field limit, Math. Models Methods Appl. Sci., 31 (2021),
pp. 1625--1657.
[31] S.-Y. Ha, S. Jin, and D. Kim, Convergence of a first-order consensus-based global optimization
algorithm, Math. Models Methods Appl. Sci., 30 (2020), pp. 2417--2444.
[32] S.-Y. Ha, S. Jin, and D. Kim, Convergence and error estimates for time-discrete consensus-
based optimization algorithms, Numer. Math., 147 (2021), pp. 255--282.
[33] W. K. Hastings, Monte Carlo sampling methods using Markov chains and their applications,
Biometrika, 57 (1970), pp. 97--109.
[34] D. J. Higham, An algorithmic introduction to numerical simulation of stochastic dif-
ferential equations, SIAM Rev., 43 (2001), pp. 525--546, https://doi.org/10.1137/
S0036144500378302.
[35] J. H. Holland, Adaptation in Natural and Artificial Systems. An Introductory Analysis with
Applications to Biology, Control, and Artificial Intelligence, University of Michigan Press,
Ann Arbor, MI, 1975.
[36] R. A. Holley, S. Kusuoka, and D. W. Stroock, Asymptotics of the spectral gap with appli-
cations to the theory of simulated annealing, J. Funct. Anal., 83 (1989), pp. 333--347.
[37] H. Huang and J. Qiu, On the mean-field limit for the consensus-based optimization, Math.
Methods Appl. Sci., 45 (2022), pp. 7814--7831.
[38] H. Huang, J. Qiu, and K. Riedl, On the global convergence of particle swarm optimization
methods, Appl. Math. Optim., 88 (2023), 30.
[39] P.-E. Jabin and Z. Wang, Mean field limit for stochastic particle systems, in Active Particles.
Vol. 1. Advances in Theory, Models, and Applications, Model. Simul. Sci. Eng. Technol.,
Birkh\"
auser/Springer, Cham, 2017, pp. 379--402.
[40] D. Kalise, A. Sharma, and M. V. Tretyakov, Consensus-based optimization via jump-
diffusion stochastic differential equations, Math. Models Methods Appl. Sci., 33 (2023),
pp. 289--339.
[41] H. Karimi, J. Nutini, and M. Schmidt, Linear convergence of gradient and proximal-gradient
methods under the Polyak-\Lojasiewicz condition, in Machine Learning and Knowledge Dis-
covery in Databases, P. Frasconi, N. Landwehr, G. Manco, and J. Vreeken, eds., Springer,
2016, pp. 795--811.
[42] J. Kennedy and R. Eberhart, Particle swarm optimization, in Proceedings of ICNN'95 -
International Conference on Neural Networks, Vol. 4, IEEE, 1995, pp. 1942--1948.
[43] J. Kim, M. Kang, D. Kim, S.-Y. Ha, and I. Yang, A stochastic consensus method for non-
convex optimization on the Stiefel manifold, in 2020 59th IEEE Conference on Decision
and Control (CDC), IEEE, 2020, pp. 1050--1057.
[44] Y. LeCun, C. Cortes, and C. Burges, MNIST Handwritten Digit Database, 2010.
[45] Q. Liu and D. Wang, Stein variational gradient descent: A general purpose Bayesian inference
algorithm, in Advances in Neural Information Processing Systems 29, Curran Associates,
2016.
[46] J. Lu, E. Tadmor, and A. Zenginoglu, Swarm-Based Gradient Descent Method for Non-
Convex Optimization, preprint, arXiv:2211.17157, 2022.
[47] H. P. McKean, Propagation of chaos for a class of non-linear parabolic equations, in Stochastic
Differential Equations, Lecture Series in Differential Equations 1967, Air Force Office of
Scientific Research, Office of Aerospace Research, United States Air Force, Arlington, VA,
1967, pp. 41--57.
[48] S. Mei, A. Montanari, and P.-M. Nguyen, A mean field view of the landscape of two-layer
neural networks, Proc. Natl. Acad. Sci. USA, 115 (2018), pp. E7665--E7671.
[49] P. D. Miller, Applied Asymptotic Analysis, Grad. Stud. Math. 75, American Mathematical
Society, Providence, RI, 2006.

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.


3004 M. FORNASIER, T. KLOCK, AND K. RIEDL

[50] I. Necoara, Y. Nesterov, and F. Glineur, Linear convergence of first order methods for
non-strongly convex optimization, Math. Program., 175 (2019), pp. 69--107.
[51] B. {\O}ksendal, Stochastic Differential Equations: An Introduction with Applications, 6th ed.,
Downloaded 12/03/24 to 72.12.218.207 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy

Springer-Verlag, Berlin, 2003.


[52] R. Pinnau, C. Totzeck, O. Tse, and S. Martin, A consensus-based model for global
optimization and its mean-field limit, Math. Models Methods Appl. Sci., 27 (2017),
pp. 183--204.
[53] E. Platen, An introduction to numerical methods for stochastic differential equations, in Acta
Numerica, 1999, Acta Numer. 8, Cambridge University Press, Cambridge, 1999, pp. 197--
246.
[54] L. A. Rastrigin, The convergence of the random search method in the extremal control of a
many-parameter system, Automat. Remote Control, 24 (1963), pp. 1337--1342.
[55] C. Reeves, Genetic algorithms, in Handbook of Metaheuristics, Internat. Ser. Oper. Res.
Management Sci. 57, Kluwer Academic Publishers, Boston, MA, 2003, pp. 55--82.
[56] D. Revuz and M. Yor, Continuous Martingales and Brownian Motion, 3rd ed., Grundlehren
Math. Wiss. 293, Springer-Verlag, Berlin, 1999.
[57] K. Riedl, Leveraging memory effects and gradient information in consensus-based optimisa-
tion: On global convergence in mean-field law , European J. Appl. Math., (2023).
[58] K. Riedl, T. Klock, C. Geldhauser, and M. Fornasier, Gradient is All You Need? , pre-
print, arXiv:2306.09778, 2023.
[59] K. Riedl, T. Klock, C. Geldhauser, and M. Fornasier, How Consensus-Based Optimiza-
tion can be Interpreted as a Stochastic Relaxation of Gradient Descent, ICML Workshop
Differentiable Almost Everything: Differentiable Relaxations, Algorithms, Operators, and
Simulators, 2024.
[60] G. O. Roberts and R. L. Tweedie, Exponential convergence of Langevin distributions and
their discrete approximations, Bernoulli, 2 (1996), pp. 341--363.
[61] G. Rotskoff and E. Vanden-Eijnden, Trainability and accuracy of artificial neural networks:
An interacting particle system approach, Comm. Pure Appl. Math., 75 (2022), pp. 1889--
1935.
[62] J. Sirignano and K. Spiliopoulos, Mean field analysis of neural networks: A law of
large numbers, SIAM J. Appl. Math., 80 (2020), pp. 725--752, https://doi.org/10.1137/
18M1192184.
[63] F. J. Solis and R. J.-B. Wets, Minimization by random search techniques, Math. Oper. Res.,
6 (1981), pp. 19--30.
\'
[64] A.-S. Sznitman, Topics in propagation of chaos, in Ecole \' e de Probabilit\'
d'Et\' es de Saint-Flour
XIX---1989, Lecture Notes in Math. 1464, Springer, Berlin, 1991, pp. 165--251.
[65] C. Totzeck and M.-T. Wolfram, Consensus-based global optimization with personal best,
Math. Biosci. Eng., 17 (2020), pp. 6026--6044.
[66] F. van den Bergh, An Analysis of Particle Swarm Optimizers, Ph.D. thesis, University of
Pretoria, 2007.
[67] Y. Xu, Q. Lin, and T. Yang, Adaptive SVRG methods under error bound conditions with
unknown growth parameter , in Advances in Neural Information Processing Systems 30,
Curran Associates, 2017, pp. 3279--3289.

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy