Consensus-Based Optimization Methods Converge Globally: Abstract
Consensus-Based Optimization Methods Converge Globally: Abstract
DOI. 10.1137/22M1527805
hard problems under such conditions, several instances arising in real-world scenarios
can, at least empirically, be solved within reasonable accuracy and moderate compu-
tational time. In the present work we are concerned with the class of derivative-free
optimization algorithms, i.e., methods that are based exclusively on the evaluation of
the objective function \scrE . Among these, and achieving the state of the art on chal-
lenging problems such as the traveling salesman problem, are so-called metaheuristics
[1, 4, 5, 42, 55]. Metaheuristics orchestrate an interaction between local improvement
\ast Received by the editors October 10, 2022; accepted for publication (in revised form) May 1,
University of Munich, and Munich Center for Machine Learning, Munich, Germany (massimo.
fornasier@ma.tum.de, konstantin.riedl@ma.tum.de).
\ddagger Department of Numerical Analysis and Scientific Computing, Simula Research Laboratory, Oslo,
Norway (tmklock@googlemail.com).
2973
procedures and global strategies and combine deterministic and random decisions to
create a process capable of escaping from local optima and performing a robust search
of the solution space. Examples include random search [54], evolutionary programming
Downloaded 12/03/24 to 72.12.218.207 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy
[24], the Metropolis--Hastings algorithm [33], genetic algorithms [35], particle swarm
optimization [42], and simulated annealing [1]. Despite their tremendous empirical
success and widespread use in practice, many metaheuristics, due to their complex-
ity, lack a proper mathematical foundation that could prove robust convergence to
global minimizers under suitable assumptions. Nevertheless, for some of them, such as
random search or simulated annealing, there exist probabilistic guarantees for global
convergence; see, e.g., [36, 63]. While transferring some of the ideas of [63] to particle
swarm optimization allows us to establish guaranteed convergence to global minima,
the proof argument uses a computational time coinciding with the time necessary to
examine every location in the search space [66].
Recently, the authors of [10, 52] introduced consensus-based optimization (CBO)
methods, which follow the guiding principles of metaheuristic algorithms, but are of
much simpler nature and more amenable to theoretical analysis. Inspired by con-
sensus dynamics and opinion formation, CBO methods use a finite number of agents
V 1 , . . . , V N , which are formally stochastic processes, to explore the domain and to
form a global consensus about the location of the minimizer v \ast as time passes. The
dynamics of the agents V 1 , . . . , V N are governed by two competing terms. A drift term
drags each agent toward an instantaneous consensus point, denoted by v\alpha , which is
computed as a weighted average of all agents' positions and serves as a momenta-
neous proxy for the global minimizer v \ast . This term may be deactivated individually
for an agent if its position improves upon the consensus point through modulating
the drift by a function H approximating the Heaviside function. The second term is
stochastic and randomly diffuses agents according to a scaled Brownian motion in \BbbR d ,
featuring the exploration of the energy landscape of the cost \scrE . Ideally, as result of
the described drift-diffusion mechanism, the agents eventually achieve a near-optimal \sum N
1
global consensus, in the sense that the associated empirical measure \rho \widehat N t := N i=1 \delta Vti
converges to a Dirac delta \delta v\widetilde at some v\widetilde \in \BbbR d close to v \ast .
Let us now provide a formal description of the method. Given a time horizon
T > 0 and a time discretization t0 = 0 < \Delta t <\cdot \cdot \cdot < K\Delta t = T of [0, T ], we denote the
i
location of agent i at time k\Delta t by Vk\Delta t , k = 0, . . . , K. For user-specified parameters
\alpha , \lambda , \sigma > 0, the time-discrete evolution of the ith agent is defined by the update rule
i i
\bigl( i
\rho N i
\rho N
\bigr) \bigl( \bigl( \bigr) \bigr)
(1.1) V(k+1)\Delta t - Vk\Delta t = - \Delta t\lambda Vk\Delta t - v\alpha (\widehat k\Delta t ) H \scrE (Vk\Delta t ) - \scrE v\alpha (\widehat
k\Delta t )
\bigm\| i
\rho N
\bigm\| i
+ \sigma \bigm\| Vk\Delta t - v\alpha (\widehat
k\Delta t ) 2
\bigm\| Bk\Delta t ,
(1.2) V0i \sim \rho 0 for all i = 1, . . . , N,
i
where ((Bk\Delta t )k=0,...,K - 1 )i=1,...,N are independent and identically distributed (i.i.d.)
Gaussian random vectors in \BbbR d with zero mean and covariance matrix \Delta t\sansI \sansd d . The sys-
tem is complemented with independent initial data (V0i )i=1,...,N , distributed according
to a common initial law \rho 0 . Equation (1.1) originates from a simple Euler--Maruyama
time discretization [34, 53] of the system of stochastic differential equations (SDEs)
where ((Bti )t\geq 0 )i=1,...,N are now independent standard Brownian motions in \BbbR d . As
mentioned in the informal description above, the updates in the evolutions (1.1) and
(1.3) consist of two terms, respectively. The first term is the drift towards the mo-
mentaneous consensus v\alpha (\widehat \rho N t ), which is defined by
Downloaded 12/03/24 to 72.12.218.207 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy
\int
\omega \alpha (v)
(1.5) \rho N
v\alpha (\widehat
t ) := v \rho N (v), with \omega \alpha (v) := exp( - \alpha \scrE (v)).
d\widehat
\| \omega \alpha \| L1 (\widehat \rho N ) t
t
Definition (1.5) is motivated by the well-known Laplace principle [21, 49, 52], which
states that for any absolutely continuous probability distribution \varrho on \BbbR d , we have
\biggl( \biggl( \int \biggr) \biggr)
1
(1.6) lim - log \omega \alpha (v) d\varrho (v) = inf \scrE (v).
\alpha \rightarrow \infty \alpha v\in \mathrm{s}\mathrm{u}\mathrm{p}\mathrm{p}(\varrho )
Alternatively, we can also interpret (1.5) as an approximation of arg mini=1,...,N \scrE (Vti ),
which improves as \alpha \rightarrow \infty , provided the minimizer uniquely exists. The univariate
function H : \BbbR \rightarrow [0, 1] appearing in the first term of (1.1) and (1.3) can be used to
deactivate the drift term for agents Vti , whose objective is better than that of the
\rho N
momentaneous consensus, i.e., for which \scrE (Vti ) < \scrE (v\alpha (\widehat t )), by setting H(x) \approx 1x\geq 0 .
The most frequently studied choice, however, is H \equiv 1. The second term in (1.1) and
(1.3) encodes the diffusion or exploration mechanism of the algorithm. Intuitively,
scaling by \| Vti - v\alpha (\widehat
\rho N
t )\| 2 encourages agents far from the consensus point to explore
larger regions, whereas agents close to the consensus point try to enhance their po-
sition only locally. Furthermore, the scaling is essential to eventually deactivate the
Brownian motion and to achieve consensus among the individual agents.
CBO methods have been considered and analyzed in several recent papers [8, 10,
11, 12, 16, 25, 26, 27, 28, 40, 43, 65], even for optimization problems in high-dimensional
and non-Euclidean settings, and using more sophisticated rules for the parameter
choices \alpha and \sigma inspired by Simulated Annealing [11, 26]. Moreover, several variants
of the dynamics have been proposed, such as ones integrating memory mechanisms
[57, 65] or others using jump-diffusion processes [40]. To make the method feasi-
ble and competitive for large-scale applications, in particular, for problems arising in
machine learning, random minibatch sampling techniques have been employed when
evaluating the objective function or computing the consensus point. This significantly
reduces the computational and communication complexity of CBO methods [11, 28]
and further enables the parallelization of the algorithm by evolving disjoint subsets of
particles independently for some time with separate consensus points, before aligning
the dynamics through a global communication step. However, despite bearing inter-
esting questions concerning the tradeoff between parallel efficiency and performance
when it comes to the relevance of communication between the individual agents, this
is a so far largely unexplored area for CBO. As an example for the applicability of
CBO to such high-dimensional problems, we refer the reader to [11, 28, 57] where the
method is used for training a shallow and a convolutional neural network for image
classification of the MNIST database of handwritten digits [44], to the recent paper
[13] where CBO is used in the setting of clustered federated learning, to [57] where a
compressed sensing problem is solved, or to the line of works [25, 26, 27] where (1.1)
and (1.3) are adapted to the sphere \BbbS d - 1 achieving near state-of-the-art performance
on a phase retrieval, on a robust subspace detection problem, and when robustly
computing eigenfaces. Recently, also general constrained optimization problems have
been tackled by CBO through the use of penalization techniques, which allow one to
cast the constrained problem as an unconstrained optimization task [8, 12].
As initially mentioned, CBO methods are motivated by the urge to develop a
class of metaheuristic algorithms with provable guarantees, while preserving their
v - v \ast \| 2 . A theoretical analysis of the dynamics can be done either on the micro-
\| \widetilde
scopic systems (1.1) or (1.3), as for instance in [31, 32], or, as in [10, 52], by analyzing
the macroscopic behavior of the agent density through a mean-field limit associated
with the particle-based dynamics (1.3), given, for initial data V0 \sim \rho 0 , by
\bigl( \bigr) \bigl( \bigr) \bigm\| \bigm\|
(1.7) dVt = - \lambda Vt - v\alpha (\rho t ) H \scrE (Vt ) - \scrE (v\alpha (\rho t )) dt + \sigma \bigm\| Vt - v\alpha (\rho t )\bigm\| 2 dBt ,
where \rho t = Law(Vt ). The weak convergence of the microscopic system (1.3) to the
mean-field limit (1.7), or, more precisely, of the empirical measure \rho \widehat N
t to \rho t as N \rightarrow \infty ,
has been shown recently in [37]; see also Remark 1.2 for additional details. This
legitimates analyzing (1.7) in lieu of (1.3). The measure \rho \in \scrC ([0, T ], \scrP (\BbbR d )) with
\rho t = \rho (t) = Law(Vt ) satisfies the nonlinear nonlocal Fokker--Planck equation
\bigl( \bigr) \sigma 2 \bigl( 2 \bigr)
(1.8) \partial t \rho t = \lambda div (v - v\alpha (\rho t )) H(\scrE (v) - \scrE (v\alpha (\rho t ))) \rho t + \Delta \| v - v\alpha (\rho t )\| 2 \rho t
2
in a weak sense (see Definition 3.1). Leveraging this partial differential equation
(PDE), the authors of [10, 52] analyze the large time behavior of the particle density
t \mapsto \rightarrow \rho t instead of the microscopic systems (1.1) and (1.3). Studying the mean-field
limit (1.7) or (1.8) allows for agile deterministic calculus tools and typically leads to
stronger theoretical results, which characterize the average agent behavior through the
evolution of \rho . This analysis perspective is justified by the mean-field approximation,
which quantifies the convergence of the microscopic system (1.3) to the mean-field
limit (1.7) as the number of agents grows. We discuss results about the mean-field
approximation in Remark 1.2 and make it rigorous in Proposition 3.11. Hence, in view
of its validity and as already done in the preceding works [10, 52], in the first part
of the paper we concentrate on establishing convergence in mean-field law for (1.3),
as defined in Definition 1.1 below. That is, we analyze the mean-field dynamics (1.7)
and (1.8) in place of the interacting particle system (1.3). Afterwards, by combining
the mean-field approximation with convergence in mean-field law, we close the paper
with a global convergence result for the numerical method (1.1).
Definition 1.1 (convergence in mean-field law). Let F, G : \scrP (\BbbR d ) \otimes \BbbR d \rightarrow \BbbR d be
two functions, and consider for i = 1, . . . , N the SDEs expressed in It\^ o's form as
N
1 \sum
dVti = F \rho \widehat N i
\bigl( N i \bigr) i
where \rho \widehat N \delta i , and V0i \sim \rho 0 .
\bigl( \bigr)
t , Vt dt + G \rho \widehat t , Vt dBt , t =
N i=1 Vt
We say that this SDE system converges in mean-field law to v\widetilde \in \BbbR d if all solutions of
\bigl( \bigr) \bigl( \bigr)
dVt = F \rho t , Vt dt + G \rho t , Vt dBt , where \rho t = Law(Vt ), and V0 \sim \rho 0 ,
satisfy limt\rightarrow \infty Wp (\rho t , \delta v\widetilde ) = 0 for some Wasserstein-p distance Wp , p \geq 1.
Colloquially speaking, an interacting multiparticle system is said to converge in
mean-field law if the associated mean-field dynamics converges.
Remark 1.2 (mean-field approximation). The definition of convergence in mean-
field law as given in Definition 1.1 is justified as follows: As the number of agents
N in the interacting particle system (1.3) tends to infinity, one expects that, for
any particle V i , the individual influence of any other particle disperses. This results
in an averaged influence of the ensemble rather than an interacting nature of the
system, and allows us to describe the dynamics in the large-particle limit by the law
Downloaded 12/03/24 to 72.12.218.207 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy
\rho of the monoparticle process (1.7). This phenomenon is known as the mean-field
approximation. More formally, as N \rightarrow \infty , we expect the empirical measure \rho \widehat N
t to
converge in law to \rho t for almost every t; see [39, Definition 1]. The classical way to
establish such mean-field approximation is to prove, by means of the coupling method,
propagation of chaos [47, 64], as implied, for instance, by
i \bigm\| 2
max sup \BbbE \bigm\| Vti - V t \bigm\| 2 \leq CN - 1 ,
\bigm\|
(1.9)
i=1,...,N t\in [0,T ]
i
where V denote N i.i.d. copies of the mean-field dynamics (1.7), which are coupled to
the processes V i by choosing the same initial conditions as well as Brownian motion
paths; see, e.g., the recent review [14, 15]. Despite being of fundamental numerical
interest (since when combined with the convergence in mean-field law it allows us to
establish convergence of the interacting particle system itself), a quantitative result
about the mean-field approximation of CBO as in (1.9) has been left as a difficult and
open problem in [10, Remark 3.3] due to a lack of global Lipschitz continuity of the
drift and diffusion terms, which impedes the application of McKean's theorem [15,
Theorem 3.1].
However, the present paper as well as three recent works, which we outline in
what follows, are shedding light on this issue. By employing a compactness argument
in the path space, the authors of [37] show that the empirical random particle measure
\rho \widehat N associated with the dynamics (1.3) converges in distribution to the deterministic
particle distribution \rho \in \scrC ([0, T ], \scrP (\BbbR d )) satisfying (1.8). In particular, their result
is valid for unbounded functions \scrE considered also in our work. While this does not
allow for obtaining a quantitative convergence rate with respect to the number of
particles N as in (1.9), it closes the mean-field limit gap qualitatively. A desired
quantitative result has been established recently in [25, Theorem 3.1 and Remark 3.1]
for a variant of the microscopic system (1.3) supported on a compact hypersurface \Gamma .
In [25] the weak convergence of the variant of (1.3) to the corresponding mean-field
limit is established in the sense that for all \phi \in \scrC b1 (\BbbR d ) it holds that
\Bigl[ \bigm| \bigm| 2 \Bigr] C
\rho N
sup \BbbE \bigm| \langle \widehat
t , \phi \rangle - \langle \rho t , \phi \rangle
\bigm| \leq \| \phi \| 2 1 d \rightarrow 0 as N \rightarrow \infty .
\scrC (\BbbR )
t\in [0,T ] N
Their proof is based on the aforementioned coupling method and, by exploiting the
inherent compactness of the dynamics due to its confinement to \Gamma , allows us to derive
a bound of the form (1.9). Leveraging the techniques from [25] and the boundedness of
moments established in [10, Lemma 3.4], we provide in Proposition 3.11 a result of the
form (1.9) on the plane \BbbR d which holds with high probability. A more refined analysis
conducted recently by the authors of [29], which adapts Sznitman's classical argument
for the proof of McKean's theorem with the intention of allowing for coefficients which
are not globally Lipschitz, even yields a nonprobabilistic mean-field approximation of
the form (1.9) in the pathwise sense, requiring in comparison merely a higher moment
bound \rho 0 \in \scrP 6 (\BbbR d ) of the initial measure; see [29, Theorem 2.6].
(a) The Rastrigin function E and an exemplary (b) Individual agents follow, on average, the
initialization for one run of the experiment v − v ∗ 22 .
gradient flow of the map v →
Fig. 1. An illustration of the internal mechanisms of CBO. We perform 100 runs of the CBO
algorithm (1.1)--(1.2), with parameters \Delta t = 0.01, \alpha = 1015 , \lambda = 1, and \sigma = 0.1, and N = 32000
agents initialized according to \rho 0 = \scrN ((8, 8), 20). In addition, we add three individual agents with
starting locations ( - 2, 4), ( - 1.5, - 1.5), and (4.5, 1.5) to the set of agents in each run as shown in (a),
and depict each of their 100 trajectories as well as their mean trajectory in shades of yellow in (b).
With the (mean) trajectories being rather straight lines, we observe that the individual agents take a
straight path from their initial positions to the global minimizer v \ast and, in particular, disregard the
local landscape of the objective function \scrE . The trajectories of the individual agents become more
concentrated as the overall number of agents N grows. (Color figure available online.)
we provide the first, and so far unique, holistic convergence proof of CBO on the
plane, enabling us to quantify the optimization capability of the numerical CBO al-
gorithm (1.1) in terms of the used parameters. The utilized proof technique may be
Downloaded 12/03/24 to 72.12.218.207 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy
used as a blueprint for proving global convergence for other recent adaptations of the
CBO dynamics; see, e.g., [8, 11, 26, 27, 28, 40], as well as other metaheuristics such as
the renowned particle swarm optimization, which is related to CBO through a zero-
inertia limit; see, e.g., [20, 30, 38]. While the present paper has a foundational and
theoretical nature and aims at completely clarifying the convergence of the numerical
scheme (1.1) with a detailed analysis, we do not include extensive numerical exper-
iments. For numerical evidence that CBO does solve difficult optimizations also in
high dimensions without necessarily incurring the curse of dimensionality, the reader
may want to consult previous works such as [11, 13, 16, 26, 27, 28, 57].
Remark 1.3. Employing stochasticity and leveraging collaboration between mul-
tiple agents to empirically and provably achieve global convergence of numerical al-
gorithms and to avoid convergence to local minima is not just of particular relevance
when it comes to the efficiency and success of zero-order methods, but also an emerg-
ing paradigm in the field of gradient-based optimization; see, e.g., [18, 22, 46]. Recent
work [58, 59] even suggests a connection between the worlds of derivative-free and
gradient-based methods. Similar guiding principles are present also in sampling meth-
ods, such as Langevin sampling [17, 18, 23, 60] and Stein variational gradient descent
[45], which are designed to generate samples from an unknown target distribution.
A promising way to gain a theoretical understanding of the behavior of these
classes of algorithms is by taking a mean-field perspective, i.e., by analyzing the
dynamics, as the number of particles goes to infinity, through an associated PDE. This
typically involves Polyak--\Lojasiewicz-like conditions [41] or certain families of log-
Sobolev inequalities [18] on the objective function \scrE , which are more restrictive than
the assumptions under which the statements of this work hold. For a recent analysis
of the mean-field Langevin dynamics we refer the reader to [18] and references therein.
Conceptually similar to the convexification of a highly nonconvex problem ob-
served in this work, taking a mean-field perspective has recently allowed the authors
of [19, 48, 61, 62] to explain the generalization capabilities of overparameterized neural
networks. By leveraging the fact that the mean-field description (w.r.t. the number of
neurons) of the SGD learning \bigl( dynamics is\bigr) captured by a nonlinear PDE, which admits
a gradient flow structure on \scrP 2 (\BbbR d ), W2 , these works show that original complexities
of the loss landscape are alleviated. Together with a quantification of the fluctuations
of the empirical neuron distribution around this mean-field limit (i.e., a mean-field ap-
proximation), convergence results are derived for SGD for sufficiently large networks
with optimal generalization error. These results, however, are structurally different
from the ones obtained in this paper for CBO. In particular, the individual particles in
[19, 48, 61, 62] are associated with the different neurons of a two-layer or deep neural
network and the objective function is a specific empirical risk, which itself is subject
to the mean-field limit and gains convexity as the number of neurons tends to infinity.
In contrast, in our setting each particle itself is a competitor for minimization of a
general fixed nonconvex objective function \scrE and the convexification of the problem
emerges from the CBO dynamics when its mean-field limit behavior is studied. For
this reason, the two resulting mean-field limits are different.
Let us further point out that, as far as the community could understand up to
now, the Fokker--Planck equation (1.8) describing the mean-field \bigl( behavior
\bigr) of CBO
cannot be understood as a gradient flow of any energy on \scrP 2 (\BbbR d ), W2 . Yet, and
perhaps surprisingly, the analysis of our present paper shows that the Wasserstein-2
distance from the global minimizer is the correct Lyapunov functional to be analyzed.
Downloaded 12/03/24 to 72.12.218.207 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy
the initial condition \rho 0 . More precisely, in [10, section 4.1] the authors use It\^
o's lemma
to derive for the time-evolution of Var (\rho t ) the expression
Downloaded 12/03/24 to 72.12.218.207 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy
d d\sigma 2 2
Var (\rho t ) = - 2\lambda - d\sigma 2 Var (\rho t ) +
\bigl( \bigr)
(2.1) \| \BbbE (\rho t ) - v\alpha (\rho t )\| 2 .
dt 2
For parameter choices 2\lambda > d\sigma 2 , the first term in (2.1) is negative and one could al-
most apply Gr\"onwall's inequality to obtain the asserted exponential decay of Var (\rho t ).
However, the second term can be problematic and the main difficulty is to control the
distance \| \BbbE (\rho t ) - v\alpha (\rho t )\| 2 between the mean and the weighted mean. For \alpha \rightarrow 0 the
weight function \omega \alpha (v) = exp( - \alpha \scrE (v)) associated with v\alpha (\rho t ) converges to 1 pointwise
and consequently v\alpha (\rho t ) \rightarrow \BbbE (\rho t ). However, the second proof step, explained below,
reveals that the crucial regime is \alpha \gg 1. In this case v\alpha (\rho t ) can be arbitrarily far
from \BbbE (\rho t ) if we do not dispose of additional knowledge about the probability measure
\rho t . To restrict the set of probability measures \rho t that need to be considered when
bounding \| \BbbE (\rho t ) - v\alpha (\rho t )\| 2 , the authors of [10] compromise to assume that the initial
distribution \rho 0 satisfies the well-preparedness assumptions
(2.2)
2
\alpha e - 2\alpha \scrE (\sigma 2 +2\lambda ) <3/8 and 2\lambda \| \omega \alpha \| L1 (\rho 0 ) - Var (\rho 0 ) - 2d\sigma 2 \| \omega \alpha \| L1 (\rho 0 ) e - \alpha \scrE \geq 0.
Since \rho t evolves from \rho 0 according to the Fokker--Planck equation (1.8), these con-
ditions restrict \rho t and allow for bounding \| \BbbE (\rho t ) - v\alpha (\rho t )\| 2 by a suitable multiple
of Var (\rho t ). The exponential decay of Var (\rho t ) then follows from (2.1) after applying
Gr\"
onwall's inequality; see [10, Theorem 4.1]. Furthermore, the conditions in (2.2) also
allow for proving convergence of \rho t to a stationary Dirac delta at v\widetilde \in \BbbR d .
Given convergence to a Dirac at v, \widetilde in a second step it is shown \scrE (\widetilde v) \approx \scrE (v \ast ). In
order to prove this approximation, one first deduces that for \bigr) any \varepsilon > 0,\bigl( there exists \bigr) \alpha \gg
1 such that for all t \geq 0 it holds that - \alpha 1 log \| \omega \alpha \| L1 (\rho t ) \leq - \alpha 1 log \| \omega \alpha \| L1 (\rho 0 ) + 2\varepsilon .
\bigl(
d
This involves deriving a lower bound for the evolution dt \| \omega \alpha \| L1 (\rho t ) for sufficiently
large \alpha > 0 as done in [10, Lemma 4.1], which is then combined with the formerly
proven exponentially decaying variance; see [10, Proof of Theorem 4.2]. Then, by
recognizing that the Laplace principle (1.6) implies the existence of some \alpha \gg 1 with
1 \bigl( \bigr) \varepsilon
(2.3) - log \| \omega \alpha \| L1 (\rho 0 ) - \scrE < ,
\alpha 2
and by establishing the convergence \| \omega \alpha \| L1 (\rho t ) \rightarrow exp( - \alpha \scrE (\widetilde v)) as t \rightarrow \infty , one obtains
the desired result \scrE (\widetilde v) - \scrE < \varepsilon in the limit t \rightarrow \infty ; see [10, Lemma 4.2]. The gap
\scrE (\widetilde
v) - \scrE can be tightened by increasing \alpha , but it is impossible to establish an explicit
relation \alpha = \alpha (\varepsilon ) due to the use of the asymptotic Laplace principle.
This proof sketch unveils a tension on the role of the parameter \alpha . Namely, the
second step requires large \alpha = \alpha (\varepsilon ) to achieve \scrE (\widetilde v) - \scrE < \varepsilon . In fact, \alpha (\varepsilon ) may grow
uncontrollably as we decrease the accuracy \varepsilon . The first step, however, requires the
conditions in (2.2) which, in the most optimistic case, where \sigma = 0, imply
\biggl( \int \biggr) 2
3 \bigl( \bigr)
(2.4) Var (\rho 0 ) \leq exp - \alpha (\scrE (v) - \scrE ) d\rho 0 (v) .
8\alpha
Therefore, \rho 0 needs to be increasingly concentrated as \alpha increases, and should ideally
be supported on sets where \scrE (v) \approx \scrE . Designing such distribution \rho 0 in practice seems
impossible in the absence of a good initial guess for v \ast . In particular, we cannot expect
(2.4) to hold for generic choices such as a uniform distribution on a compact set.
We add that the works [31, 32] conduct a similarly flavored analysis for the
fully time-discretized microscopic system (1.1), with some differences in the details.
They first show an exponentially decaying variance under mild assumptions about
Downloaded 12/03/24 to 72.12.218.207 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy
\lambda and \sigma , but provided that the same Brownian motion is used for all agents, i.e.,
i
(Bk\Delta t )k=1,...,K = (Bk\Delta t )k=1,...,K for all i = 1, . . . , N . Such a choice leads to a less
explorative dynamics, but it simplifies the consensus formation analysis. For proving
\scrE (\widetilde
v) \approx \scrE however, the authors again require an initial configuration \rho 0 that satisfies
a technical concentration condition like (2.3); see for instance [32, Remark 3.1].
2.2. Alternative approach: CBO minimizes the squared distance to
\bfitv \ast . The approach described in the previous section might suggest that CBO only
converges locally, which is in fact not what is observed in practice. Instead, global op-
timization is actually expected. To remedy the locality requirements of the variance-
based analysis, let us now sketch and motivate an alternative proof idea. By averaging
out the randomness associated with different realizations of Brownian motion paths,
the macroscopic time-continuous SDE (1.7), in the case H \equiv 1, becomes
d \bigl[ \bigm| \bigm| \bigr] \bigl[ \bigl( \bigr) \bigm| \bigr]
\BbbE Vt V0 = - \lambda \BbbE Vt - v\alpha (\rho t ) \bigm| V0
dt
= - \lambda \BbbE Vt - v \ast \bigm| V0 + \lambda (v\alpha (\rho t ) - v \ast ).
\bigl[ \bigl( \bigr) \bigm| \bigr]
(2.5)
Furthermore, if \scrE is locally Lipschitz continuous and satisfies the coercivity condition
1 \bigl( \bigr) \nu 1 \bigl( \bigr) \nu
(2.6) \| v - v \ast \| 2 \leq \scrE (v) - \scrE (v \ast ) = \scrE (v) - \scrE for all v \in \BbbR d ,
\eta \eta
and for some \eta > 0 and \nu \in (0, \infty ), the second term on the right-hand side of (2.5)
can be made arbitrarily small for sufficiently large \alpha , i.e., v\alpha (\rho t ) \approx v \ast (more details
follow below). In this case, the average dynamics of Vt is well approximated by
d \bigl[
\BbbE Vt | V0 \approx - \lambda \BbbE Vt - v \ast | V0 ,
\bigr] \bigl[ \bigl( \bigr) \bigr]
(2.7)
dt
which corresponds to the gradient flow of v \mapsto \rightarrow \| v - v \ast \| 22 with rate 2\lambda . In other words,
each individual agent essentially performs a gradient descent of v \mapsto \rightarrow \| v - v \ast \| 22 on
average over all realizations of Brownian motion paths. Figure 1(b) visualizes this
phenomenon for three isolated agents on the Rastrigin function in two dimensions.
Inspired by this observation, our proof strategy is to show that CBO methods
successively minimize the energy functional \scrV : \scrP (\BbbR d ) \rightarrow \BbbR \geq 0 , given by
\int
1 2
(2.8) \scrV (\rho t ) := \| v - v \ast \| 2 d\rho t (v).
2
Note that this functional essentially coincides with the Wasserstein distance in the
sense that W22 (\rho t , \delta v\ast ) = 2\scrV (\rho t ). Therefore \scrV (\rho t ) \rightarrow 0 in particular implies that \rho t
converges weakly to \delta v\ast ; see [2, Chapter 7].
This novel approach does not suffer a tension on the parameter \alpha like the variance-
based analysis from the previous section. Roughly speaking (see Lemma 4.1 for de-
tails), \scrV (\rho t ) follows an evolution similar to (2.1), with Var (\rho t ) being replaced by
\scrV (\rho t ). However, we can now bound \| v - v\alpha (\rho t )\| 22 d\rho t (v) \leq 4\scrV (\rho t )+2\| v\alpha (\rho t ) - v \ast \| 22 ,
\int
Fig. 2. (a) The Rastrigin function as objective function \scrE and the squared Euclidean distance
from v \ast . (b) The evolution of the variance Var(\widehat \rho N \rho N
t ) and the functional \scrV (\widehat
t ) for different initial
conditions \rho 0 = \scrN (\mu , 0.8) with \mu \in \{ 1, 2, 3, 4\} . The measure \rho \widehat N
t is the empirical agent density that
is evolved using (1.1) with N = 320000 agents, discrete time step size \Delta t = 0.01, and parameters
\alpha = 1015 , \lambda = 1, and \sigma = 0.5. As we move the mean of the initial configuration \rho 0 away from
the global optimizer v \ast = 0, and thereby push v \ast into the tails of \rho 0 , Var(\widehat \rho N
t ) increases in the
starting phase of the dynamics. \scrV (\widehat \rho N
t ), on the other hand, always decreases exponentially at a rate
(2\lambda - d\sigma 2 ), independently of the initial condition \rho 0 .
task: the Laplace principle generally asserts \| v\alpha (\rho t ) - v \ast \| 2 \rightarrow 0 under (2.6) as \alpha \rightarrow \infty ,
and we can even establish (see Proposition 4.5 for details) the quantitative estimate
(2Lr)\nu
\int
\ast exp ( - \alpha Lr)
\| v\alpha (\varrho ) - v \| 2 \leq + \| v - v \ast \| 2 d\varrho (v)
\eta \varrho (Br (v \ast ))
for an arbitrary probability measure \varrho and assuming that \scrE is L-Lipschitz in a ball of
radius r > 0. This allows to estimate \| v\alpha (\rho t ) - v \ast \| 22 in terms of \scrV (\rho
\int t ) as desired.
Finally, we note that \scrV (\rho t ) majorizes Var (\rho t ) because u \mapsto \rightarrow 12 \| v - u\| 22 d\rho t (v) is
minimized by the expectation \BbbE (\rho t ). This relation may be a source of concern, as it
shows that proving \scrV (\rho t ) \rightarrow 0 implies Var (\rho t ) \rightarrow 0. We emphasize, however, that this
does not imply a majorization for the corresponding time derivatives. In fact, Exam-
ple 2.1 suggests that \scrV (\rho t ) can decay exponentially while Var (\rho t ) increases initially.
Example 2.1. We consider the Rastrigin function \scrE (v) = v 2 + 2.5(1 - cos(2\pi v))
with global minimum at v \ast = 0 and various local minima; see Figure 2(a). For differ-
ent initial configurations \rho 0 = \scrN (\mu , 0.8) with \mu \in \{ 1, 2, 3, 4\} , we evolve the discretized
system (1.1) using N = 320000 agents, discrete time step size \Delta t = 0.01, and param-
eters \alpha = 1015 (i.e., the consensus point is the arg min of the agents), \lambda = 1, and
\sigma = 0.5. By considering different means from \mu = 1 to \mu = 4, we push the global
minimizer v \ast into the tails of the initial configuration \rho 0 . Figure 2(b) shows that the
decreasing initial probability mass around v \ast eventually causes the variance Var(\widehat \rho N
t )
N
(dashed lines) to increase in the beginning of the dynamics. In contrast, \scrV (\widehat \rho t ) al-
ways decays exponentially fast with convergence speed (2\lambda - d\sigma 2 ), independently of
the initial condition \rho 0 . From a theoretical perspective, this means proving global
convergence using a variance-based analysis as in section 2.1 must require assump-
tions about \rho 0 such as condition (2.4), whereas using \scrV (\rho t ) does not suffer from this
issue. The convergence speed (2\lambda - d\sigma 2 ) coincides with the result in Theorem 3.7.
3. Global convergence of CBO. In the first part of this section we recite and
extend well-posedness results about the nonlinear macroscopic SDE (1.7), respectively,
the associated Fokker--Planck equation (1.8). At the beginning of the second part we
introduce the class of studied objective functions, which is followed by the presentation
of the main result about the convergence of the dynamics (1.7) and (1.8) to the global
Downloaded 12/03/24 to 72.12.218.207 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy
minimizer in mean-field law. In the final part we then highlight the relevance of this
result by presenting a holistic convergence proof of the numerical scheme (1.1) to the
global minimizer. This combines the latter statement with a probabilistic quantitative
result about the mean-field approximation.
3.1. Definition of weak solutions and well-posedness. We begin by rigor-
ously defining weak solutions of the Fokker--Planck equation (1.8).
Definition 3.1. Let \rho 0 \in \scrP (\BbbR d ), T > 0. We say \rho \in \scrC ([0, T ], \scrP (\BbbR d )) satisfies the
Fokker--Planck equation (1.8) with initial condition \rho 0 in the weak sense in the time
interval [0, T ] if we have for all \phi \in \scrC c\infty (\BbbR d ) and all t \in (0, T )
\int \int
d
\phi (v) d\rho t (v) = - \lambda H(\scrE (v) - \scrE (v\alpha (\rho t ))) \langle v - v\alpha (\rho t ), \nabla \phi (v)\rangle d\rho t (v)
dt
(3.1)
\sigma 2
\int
2
+ \| v - v\alpha (\rho t )\| 2 \Delta \phi (v) d\rho t (v)
2
and limt\rightarrow 0 \rho t = \rho 0 pointwise.
If the cutoff function H in the dynamics (1.7) is inactive, i.e., satisfies H \equiv 1, the
authors of [10] prove the following well-posedness result.
Theorem 3.2 ([10, Theorems 3.1, 3.2]). Let T > 0, \rho 0 \in \scrP 4 (\BbbR d ). Let H \equiv 1, and
consider \scrE : \BbbR d \rightarrow \BbbR with \scrE > - \infty , which, for constants C1 , C2 > 0, satisfies
(3.2) | \scrE (v) - \scrE (w)| \leq C1 (\| v\| 2 + \| w\| 2 ) \| v - w\| 2 for all v, w \in \BbbR d ,
2
(3.3) \scrE (v) - \scrE \leq C2 (1 + \| v\| 2 ) for all v \in \BbbR d .
If in addition, either supv\in \BbbR d \scrE (v) < \infty , or \scrE satisfies for some constants C3 , C4 > 0
2
(3.4) \scrE (v) - \scrE \geq C3 \| v\| 2 for all \| v\| 2 \geq C4 ,
then there exists a unique nonlinear process V \in \scrC ([0, T ], \BbbR d ) satisfying (1.7) in the
strong sense. The associated law \rho = Law(V ) has regularity \rho \in \scrC ([0, T ], \scrP 4 (\BbbR d )) and
is a weak solution to the Fokker--Planck equation (1.8).
Remark 3.3. The regularity \rho \in \scrC ([0, T ], \scrP 4 (\BbbR d )) stated in Theorem 3.2, and
also obtained in Theorem 3.4 below, is a consequence of the regularity of the initial
condition \rho 0 \in \scrP 4 (\BbbR d ). Although it is not indicated explicitly in [10, Theorems 3.1,
3.2], it follows from their proofs. In particular, it allows for extending the test function
space \scrC c\infty (\BbbR d ) in Definition 3.1. Namely, if \rho \in \scrC ([0, T ], \scrP 4 (\BbbR d )) solves (1.8) in the
weak sense, identity (3.1) holds for all \phi \in \scrC 2 (\BbbR d ) with (i) supv\in \BbbR d | \Delta \phi (v)| < \infty , and
(ii) \| \nabla \phi (v)\| 2 \leq C(1 + \| v\| 2 ) for some C > 0 and for all v \in \BbbR d . We denote the
corresponding function space by \scrC \ast 2 (\BbbR d ).
Under minor modifications of the proof for Theorem 3.2, we can extend the exis-
tence of solutions to an active Lipschitz continuous cutoff function H.
Theorem 3.4. Let H \not \equiv 1 be LH -Lipschitz continuous. Then, under the assump-
tions of Theorem 3.2, there exists a nonlinear process V \in \scrC ([0, T ], \BbbR d ) satisfying (1.7)
in the strong sense. The associated law \rho = Law(V ) has regularity \rho \in \scrC ([0, T ], \scrP 4 (\BbbR d ))
and is a weak solution to the Fokker--Planck equation (1.8).
3.2. Global convergence in mean-field law. We now present the main result
about global convergence in mean-field law for objectives satisfying the following.
Downloaded 12/03/24 to 72.12.218.207 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy
Furthermore, for the case H \not \equiv 1, we additionally require that \scrE fulfills a local Lipschitz
continuity--Like condition, i.e.,
A3 there exist L\scrE > 0 and \gamma \geq 0 such that
\gamma
(3.7) \scrE (v) - \scrE \leq L\scrE (1 + \| v - v \ast \| 2 ) \| v - v \ast \| 2 for all v \in \BbbR d .
Remark 3.6. The analyses in [10] and related works require \scrE \in \scrC 2 (\BbbR d ) and an
additional boundedness assumption on the Laplacian \Delta \scrE . We relax these regularity
requirements and use the conditions in Definition 3.5 on \scrE instead.
Assumption A1 just states that the continuous objective \scrE attains its infimum \scrE
at some v \ast \in \BbbR d . The continuity itself can be further relaxed at the cost of additional
technical details because it is only required in a small neighborhood of v \ast .
Assumption A2 should be interpreted as a tractability condition of the landscape
of \scrE around v \ast and in the farfield. The first part, equation (3.5), describes the local
coercivity of \scrE , which implies that there is a unique minimizer v \ast on BR0 (v \ast ) and that
1/\nu
\scrE grows like v \mapsto \rightarrow \| v - v \ast \| 2 . This condition is also known as the inverse continuity
condition from [26], as a quadratic growth condition in the case \nu = 1/2 from [3, 50],
or as the H\" olderian error bound condition in the case \nu \in (0, 1] [6]. In [50, Theorem
4] and [41, Theorem 2] many equivalent or stronger conditions are identified to imply
(3.5) globally on \BbbR d . Furthermore, in [26, 67], (3.5) is shown to hold globally for
objectives related to various machine learning problems. The second part of A2,
equation (3.6), describes the behavior of \scrE in the farfield and prevents \scrE (v) \approx \scrE for
some v \in \BbbR d far away from v \ast . We introduce it for the purpose of covering functions
that tend to a constant just above \scrE \infty as \| v\| 2 \rightarrow \infty , because such functions do not
satisfy the growth condition (3.5) globally. However, whenever (3.5) holds globally,
we take R0 = \infty , i.e., BR0 (v \ast ) = \BbbR d and (3.6) is void. We also note that (3.5) and
(3.6) imply the uniqueness of the global minimizer v \ast on \BbbR d .
Finally, to cover the active cutoff case H \not \equiv 1, we additionally require A3. The
condition is weaker than local Lipschitz continuity on any compact ball around v \ast ,
with Lipschitz constant growing with the size of the ball.
We are now ready to state the main result. The proof is deferred to section 4.
Theorem 3.7. Let \scrE \in \scrC (\BbbR d ) satisfy A1--A2. Moreover, let \rho 0 \in \scrP 4 (\BbbR d ) be
such that v \ast \in supp(\rho 0 ). Define \scrV (\rho t ) as given in (2.8). Fix any \varepsilon \in (0, \scrV (\rho 0 )) and
\vargamma \in (0, 1), choose parameters \lambda , \sigma > 0 with 2\lambda > d\sigma 2 , and define the time horizon
\biggl( \biggr)
1 \scrV (\rho 0 )
(3.8) T \ast := \bigl( \bigr) log .
(1 - \vargamma ) 2\lambda - d\sigma 2 \varepsilon
Then there exists \alpha 0 > 0, depending (among problem-dependent quantities) on \varepsilon
and \vargamma , such that for all \alpha > \alpha 0 , if \rho \in \scrC ([0, T \ast ], \scrP 4 (\BbbR d )) is a weak solution to the
Fokker--Planck equation (1.8) on the time interval [0, T \ast ] with initial condition \rho 0 ,
we have
Downloaded 12/03/24 to 72.12.218.207 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy
\biggl[ \biggr]
1 - \vargamma \ast \ast
(3.9) \scrV (\rho T ) = \varepsilon with T \in T ,T .
(1 + \vargamma /2)
Furthermore, on the time interval [0, T ], \scrV (\rho t ) decays at least exponentially fast. More
precisely, for all t \in [0, T ], it holds that
W22 (\rho t , \delta v\ast ) = 2\scrV (\rho t ) \leq 2\scrV (\rho 0 ) exp - (1 - \vargamma ) 2\lambda - d\sigma 2 t .
\bigl( \bigl( \bigr) \bigr)
(3.10)
If \scrE additionally satisfies A3, the same conclusion holds for any H : \BbbR d \rightarrow [0, 1] that
satisfies H(x) = 1 whenever x \geq 0.
The assumption v \ast \in supp(\rho 0 ) about the initial configuration \rho 0 is not really a
restriction, as it would in any case hold immediately for \rho t for any t > 0 in view of the
diffusive character of the dynamics (1.8); see Remark 4.7. Additionally, as we clarify
in the next section, this condition neither means nor requires that, for finite particle
approximations, some particle needs to be in the vicinity of the minimizer v \ast at time
t = 0. It is actually sufficient that the empirical measure \rho \widehat N t weakly approximates the
law \rho t uniformly in time. We rigorously explain this mechanism in section 3.3.
A lower bound on the rate of convergence in (3.10) is (1 - \vargamma )(2\lambda - d\sigma 2 ), which
can be made arbitrarily close to the numerically observed rate (2\lambda - d\sigma 2 ) (see, e.g.,
Figure 2(b)) at the cost of taking \alpha \rightarrow \infty to allow for \vargamma \rightarrow 0. The condition 2\lambda > d\sigma 2
is necessary, both in theory and practice, to avoid overwhelming the dynamics by
the random exploration term. The dependency on d can be eased by replacing the
isotropic Brownian motion in the dynamics with an anisotropic one [11, 28].
3.3. Global convergence in probability. To stress the relevance of the main
result of this paper, Theorem 3.7, we now show how estimate (3.10) plays a funda-
mental role in establishing a quantitative convergence result for the numerical scheme
(1.1) to the global minimizer v \ast . By paying the price of having a probabilistic state-
ment about the convergence of CBO as in Theorem 3.8, we gain provable polynomial
complexity. For simplicity, we present the results of this section for the case of an
inactive cutoff function, i.e., H \equiv 1.
Theorem 3.8. Fix \varepsilon \mathrm{t}\mathrm{o}\mathrm{t}\mathrm{a}\mathrm{l} > 0 and \delta \in (0, 1/2). Then, under the assumptions of
Theorem 3.7 and Proposition 3.11, and with K := T /\Delta t, where T is as in (3.9), the
i
iterations ((Vk\Delta t )k=0,...,K )i=1,...,N generated by the numerical scheme (1.1) converge
in probability to v \ast . More precisely, the empirical mean of the final iterations fulfills
\bigm\| \bigm\| 2
\bigm\| 1 \sum N \bigm\|
i \ast \bigm\|
(3.11) VK\Delta t - v \bigm\| \leq \varepsilon \mathrm{t}\mathrm{o}\mathrm{t}\mathrm{a}\mathrm{l}
\bigm\|
\bigm\| N
\bigm\|
\bigm\|
i=1 2
\biggl( \biggr)
1 1 36\scrV (\rho 0 )
K \geq \bigl( \bigr) log
(1 - \vargamma ) 2\lambda - d\sigma 2 \Delta t \delta \varepsilon \mathrm{t}\mathrm{o}\mathrm{t}\mathrm{a}\mathrm{l}
iterations. Here, the parameter dependence of C\mathrm{N}\mathrm{A} and C\mathrm{M}\mathrm{F}\mathrm{A} is as described in
Theorem 3.8. The computational complexity (counted in terms of the number of
evaluations of the objective \scrE ) of the CBO method is therefore given by \scrO (KN ).
When working in the setting of large-scale applications arising, for instance, in
machine learning and signal processing (therefore, with \scrE being expensive to com-
pute), several considerations allow one to reduce the overall runtime of the algorithm
(1.1) and thereby make the method feasible and more competitive. First, it may be
recommended to leverage that the evaluations of the objective function \scrE for each of
the N particles can be performed in parallel. Furthermore, random minibatch sam-
pling ideas as proposed in [11, 28] may be employed when evaluating the objective
function and/or computing the consensus point. That is, at each time step, \scrE is eval-
uated only on a random subset of the available data, and v\alpha is computed only from a
subset of the N particles. Besides immediately reducing the computational and com-
munication complexity of CBO methods, such ideas motivate communication-efficient
parallelization of the algorithm by evolving disjoint subsets of particles independently
for some time with separate consensus points, before aligning the dynamics through a
global communication step. This, however, is so far largely unexplored, from both a
theoretical and a practical point of view. Lastly, taking inspiration from genetic algo-
rithms, a variance-based particle reduction technique as suggested in [26] may be used
to reduce the number of optimizing agents (and therefore the required evaluations of
\scrE ) during the algorithm in case concentration of the particles is observed.
The proof of Theorem 3.8, which we report below, combines our main result
about the convergence in mean-field law, a quantitative mean-field approximation, and
classical results of numerical approximation of SDEs. To this end, we establish in what
follows the result about the quantitative mean-field approximation on a restricted set
of bounded processes. For this purpose, let us introduce the common probability
space (\Omega , \scrF , \BbbP ) over which all considered stochastic processes get their realizations,
and define a subset \Omega M of \Omega of suitably bounded processes according to
\Biggl\{ N \biggl\{ \Biggr\}
1 \sum \bigm\| i \bigm\| 4 \bigm\| \bigm\| 4 \biggr\}
i
\Omega M := \omega \in \Omega : sup max \bigm\| Vt (\omega )\bigm\| 2 , \bigm\| V t (\omega )\bigm\| \leq M .
\bigm\| \bigm\|
t\in [0,T ] N i=1 2
Throughout this section, M > 0 denotes a constant which we shall adjust at the
end of the proof of Theorem 3.8. Before stating the mean-field approximation result,
Proposition 3.11, let us estimate the measure of the set \Omega M in Lemma 3.10. The
proofs of both statements are deferred to section 5.
Lemma 3.10. Let T > 0, \rho 0 \in \scrP 4 (\BbbR d ) and let N \in \BbbN be fixed. Moreover, let
i
((Vti )t\geq 0 )i=1,...,N denote the strong solution to system (1.3) and let ((V t )t\geq 0 )i=1,...,N
be N independent copies of the strong solution to the mean-field dynamics (1.7). Then,
under the assumptions of Theorem 3.2, for any M > 0 we have
\Biggl( N \biggl\{ \Biggr)
\bigm\| \biggr\}
1 \sum \bigm\| i \bigm\| 4 \bigm\| i \bigm\| 4 2K
(3.12) \BbbP (\Omega M ) = \BbbP sup max \bigm\| Vt \bigm\| 2 , \bigm\| V t \bigm\| \leq M \geq 1 - ,
\bigm\|
t\in [0,T ] N i=1
2 M
Lemma 3.10 proves that the processes are bounded with high probability uni-
formly in time. Therefore, by restricting the analysis to \Omega M , we can obtain the
following quantitative mean-field approximation result by proving pointwise propa-
gation of chaos through the coupling method [14, 15] using a synchronous coupling
i
between the stochastic processes V i and V ; see, e.g., [14, section 4.1.2].
Proposition 3.11. Let T > 0, \rho 0 \in \scrP 4 (\BbbR d ) and let N \in \BbbN be fixed. Moreover, let
i
((Vti )t\geq 0 )i=1,...,N denote the strong solution to system (1.3) and let ((V t )t\geq 0 )i=1,...,N be
N independent copies of the strong solution to the mean-field dynamics (1.7). Further
i
consider valid the assumptions of Theorem 3.2. If (Vti )t\geq 0 and (V t )t\geq 0 share the
initial data as well as the Brownian motion paths (Bti )t\geq 0 for all i = 1, . . . , N , then
we have
\Bigl[ i
\Bigr]
sup \BbbE \| Vti - V t \| 22 \bigm| \Omega M \leq C\mathrm{M}\mathrm{F}\mathrm{A} N - 1
\bigm|
(3.13) max
i=1,...,N t\in [0,T ]
with C\mathrm{M}\mathrm{F}\mathrm{A} = C\mathrm{M}\mathrm{F}\mathrm{A} (\alpha , \lambda , \sigma , T, C1 , C2 , M, K, \scrM 2 , b1 , b2 ), where K is as in Lemma 3.10
and \scrM 2 denotes a second-order moment bound of \rho .
A quantitative mean-field approximation was left as an open problem in [10,
Remark 3.2] due to a lack of global Lipschitz continuity of the SDE coefficients and
has been approached since then in several steps; see Remark 1.2. While the restriction
to bounded processes, which reflects the typical behavior in view of Lemma 3.10,
already allows one to obtain an estimate of the type (3.13), which is sufficient to prove
convergence in probability in what follows, the recent work [29] improves (3.13) by
first showing a nonprobabilistic mean-field approximation, i.e., removing the necessity
of conditioning on the set \Omega M as done in (3.13), and second by obtaining a pathwise
estimate; see [29, Theorem 2.6]. Hence, in the light of [29], the role of the constant
M can be regarded as merely an auxiliary technical tool.
Equipped with Lemma 3.10 and Proposition 3.11, we are now able to prove The-
orem 3.8.
Proof of Theorem 3.8. We have the error decomposition
\left[ \bigm\| \bigm\| 2 \bigm| \right] \left[ \bigm\| \bigm\| 2 \bigm| \right]
N
\bigm\| 1 \sum \bigm\| \bigm| N
\bigm\| 1 \sum \bigm\| \bigm|
i
- v \ast \bigm\| \bigm| \Omega M \leq 3 \BbbE \bigm\| i
- VTi \bigm\| \bigm| \Omega M
\bigl( \bigr)
VK\Delta t VK\Delta t
\bigm\| \bigm\| \bigm| \bigm\| \bigm\| \bigm|
\BbbE \bigm\|
\bigm\| N \bigm\| \bigm| \bigm\| N \bigm\| \bigm|
i=1 2 i=1 2
\underbrace{} \underbrace{}
\leq C\mathrm{N}\mathrm{A} (\Delta t)2m \mathrm{b}\mathrm{y} \mathrm{a}\mathrm{p}\mathrm{p}\mathrm{l}\mathrm{y}\mathrm{i}\mathrm{n}\mathrm{g} \mathrm{c}\mathrm{l}\mathrm{a}\mathrm{s}\mathrm{s}\mathrm{i}\mathrm{c}\mathrm{a}\mathrm{l} \mathrm{c}\mathrm{o}\mathrm{n}\mathrm{v}\mathrm{e}\mathrm{r}\mathrm{g}\mathrm{e}\mathrm{n}\mathrm{c}\mathrm{e}
\mathrm{r}\mathrm{e}\mathrm{s}\mathrm{u}\mathrm{l}\mathrm{t}\mathrm{s} \mathrm{f}\mathrm{o}\mathrm{r} \mathrm{n}\mathrm{u}\mathrm{m}\mathrm{e}\mathrm{r}\mathrm{i}\mathrm{c}\mathrm{a}\mathrm{l} \mathrm{s}\mathrm{c}\mathrm{h}\mathrm{e}\mathrm{m}\mathrm{e}\mathrm{s} \mathrm{f}\mathrm{o}\mathrm{r} \mathrm{S}\mathrm{D}\mathrm{E}\mathrm{s} [53]
\left[ \bigm\| \bigm\| 2 \bigm| \right] \bigm\| 2
(3.14)
\bigm\|
N \Bigl(
\bigm\| 1 \sum N
\Bigr) \bigm\| \bigm|
i \bigm\| \bigm| 3 \bigm\| 1 \sum
i
\bigm\|
\ast \bigm\|
+ 3 \BbbE \bigm\| VTi - V T \bigm\| \bigm| \Omega M + V T - v \bigm\|
\bigm\| \bigm\|
\BbbE \bigm\|
\bigm\| N
i=1
\bigm\| \bigm| 1 - \delta \bigm\| N i=1 \bigm\|
2 2
\underbrace{} \underbrace{} \underbrace{} \underbrace{}
2
\leq C\mathrm{M}\mathrm{F}\mathrm{A} N - 1 \mathrm{u}\mathrm{s}\mathrm{i}\mathrm{n}\mathrm{g} \mathrm{t}\mathrm{h}\mathrm{e} \mathrm{q}\mathrm{u}\mathrm{a}\mathrm{n}\mathrm{t}\mathrm{i}\mathrm{t}\mathrm{a}\mathrm{t}\mathrm{i}\mathrm{v}\mathrm{e} \mathrm{m}\mathrm{e}\mathrm{a}\mathrm{n}-fi\mathrm{e}\mathrm{l}\mathrm{d} \leq \BbbE \| VT1 - v \ast \| \leq 2\scrV (\rho T )\leq 2\varepsilon
2
\mathrm{a}\mathrm{p}\mathrm{p}\mathrm{r}\mathrm{o}\mathrm{x}\mathrm{i}\mathrm{m}\mathrm{a}\mathrm{t}\mathrm{i}\mathrm{o}\mathrm{n} \mathrm{i}\mathrm{n} \mathrm{f}\mathrm{o}\mathrm{r}\mathrm{m} \mathrm{o}\mathrm{f} \mathrm{P}\mathrm{r}\mathrm{o}\mathrm{p}\mathrm{o}\mathrm{s}\mathrm{i}\mathrm{t}\mathrm{i}\mathrm{o}\mathrm{n} 3.11 \mathrm{b}\mathrm{y} \mathrm{m}\mathrm{e}\mathrm{a}\mathrm{n}\mathrm{s} \mathrm{o}\mathrm{f} \mathrm{T}\mathrm{h}\mathrm{e}\mathrm{o}\mathrm{r}\mathrm{e}\mathrm{m} 3.7
dividing the overall error into an approximation error of the numerical scheme, the
mean-field approximation error, and the optimization error in the mean-field limit.
Denoting now by K\varepsilon N\mathrm{t}\mathrm{o}\mathrm{t}\mathrm{a}\mathrm{l} \subset \Omega the set where (3.11) does not hold, we can estimate
\BbbP K\varepsilon N\mathrm{t}\mathrm{o}\mathrm{t}\mathrm{a}\mathrm{l} = \BbbP K\varepsilon N\mathrm{t}\mathrm{o}\mathrm{t}\mathrm{a}\mathrm{l} \cap \Omega M + \BbbP K\varepsilon N\mathrm{t}\mathrm{o}\mathrm{t}\mathrm{a}\mathrm{l} \cap \Omega cM \leq \BbbP K\varepsilon N\mathrm{t}\mathrm{o}\mathrm{t}\mathrm{a}\mathrm{l} \bigm| \Omega M \BbbP (\Omega M ) + \BbbP (\Omega cM )
\bigl( \bigr) \bigl( \bigr) \bigl( \bigr) \bigl( \bigm| \bigr)
Downloaded 12/03/24 to 72.12.218.207 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy
\leq \varepsilon - 1 2m
+ 3C\mathrm{M}\mathrm{F}\mathrm{A} N - 1 + 12\varepsilon + \delta ,
\bigl( \bigr)
\mathrm{t}\mathrm{o}\mathrm{t}\mathrm{a}\mathrm{l} 6C\mathrm{N}\mathrm{A} (\Delta t)
where in the last step we employ Markov's inequality together with (3.14) to bound
the first term. For the second it suffices to choose the M from (3.12) large enough.
As a consequence of Theorem 3.7, the hardness of any optimization problem is
necessarily encoded in the mean-field approximation. Proposition 3.11 addresses pre-
cisely this question, ensuring that with arbitrarily high probability, the finite particle
dynamics (1.3) keeps close to the mean-field dynamics (1.7). Since the rate of this con-
vergence is of order N - 1/2 in the number of particles N , the hardness of the problem
is fully captured by the constant C\mathrm{M}\mathrm{F}\mathrm{A} in (3.13), which does not depend explicitly on
the dimension d. Therefore, the mean-field approximation is, in general, not affected
by the curse of dimensionality. Nevertheless, as our assumptions on the objective
function \scrE do not exclude the class of NP-hard problems, it cannot be expected that
CBO solves any problem, howsoever hard, with polynomial complexity. This is re-
flected by the exponential dependence of C\mathrm{M}\mathrm{F}\mathrm{A} on the parameter \alpha and its possibly
worst-case linear dependence on the dimension d, as we discuss in what follows. How-
ever, several numerical experiments [11, 26, 27, 28] in high dimensions confirm that in
typical applications CBO performs comparably to state-of-the-art methods without
the necessity of an exponentially large number of particles. As mentioned before,
characterizing \alpha 0 in more detail is crucial in view of the mean-field approximation re-
sult, Proposition 3.11. We did not precisely specify \alpha 0 in Theorem 3.7 since it seems
challenging to provide informative bounds in all generality. In Remark 4.8, however,
we devise an informal derivation in the case H \equiv 1 for objectives \scrE that are locally
L-Lipschitz continuous on some ball BR (v \ast ) and satisfy the coercivity condition (3.5)
globally for \nu = 1/2. For a parameter-dependent constant c = c(\vargamma , \lambda , \sigma ), we obtain
\biggl( \biggr)
- 8 c \bigl( \ast
\bigr)
(3.15) \alpha > \alpha 0 = 2 2 log \surd \rho 0 B\mathrm{m}\mathrm{i}\mathrm{n}\{ R, c2 \eta 2 \varepsilon /(8L)\} (v ) ,
c \eta \varepsilon 2 2
provided that the probability mass t \mapsto \rightarrow \rho t Bc2 \eta 2 \varepsilon /(8L) (v \ast ) is minimized at time
\bigl( \bigr)
Lemma 4.1. Let \scrE : \BbbR d \rightarrow \BbbR , H \equiv 1, and fix \alpha , \lambda , \sigma > 0. Moreover, let T > 0 and
let \rho \in \scrC ([0, T ], \scrP 4 (\BbbR d )) be a weak solution to the Fokker--Planck equation (1.8). Then
the functional \scrV (\rho t ) satisfies
Downloaded 12/03/24 to 72.12.218.207 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy
Proof. We note that the function \phi (v) = 1/2\| v - v \ast \| 22 is in \scrC \ast 2 (\BbbR d ) and recall
that \rho satisfies the weak solution identity (3.1) for all test functions in \scrC \ast 2 (\BbbR d ); see
Remark 3.3. By applying (3.1) with \phi as above, we obtain for the evolution of \scrV (\rho t )
d\sigma 2
\int \int
d 2
\scrV (\rho t ) = - \lambda \langle v - v \ast , v - v\alpha (\rho t )\rangle d\rho t (v) + \| v - v\alpha (\rho t )\| 2 d\rho t (v),
dt 2
\underbrace{} \underbrace{} \underbrace{} \underbrace{}
=:T1 =:T2
where we used \nabla \phi (v) = v - v \ast and \Delta \phi (v) = d. Expanding the right-hand side of the
scalar product in the integrand of T1 by subtracting and adding v \ast yields
\int \biggl\langle \int \biggr\rangle
\ast \ast \ast \ast
T1 = - \lambda \langle v - v , v - v \rangle d\rho t (v) + \lambda (v - v ) d\rho t (v), v\alpha (\rho t ) - v
\leq - 2\lambda \scrV (\rho t ) + \lambda \| \BbbE (\rho t ) - v \ast \| 2 \| v\alpha (\rho t ) - v \ast \| 2
with the Cauchy--Schwarz inequality being used in the last step. Similarly, again
by subtracting and adding v \ast , for the term T2 we have with the Cauchy--Schwarz
inequality
\biggl( \int \biggr)
2 \ast \ast 1 \ast 2
(4.2) T2 \leq d\sigma \scrV (\rho t ) + \| v - v \| 2 d\rho t (v) \| v\alpha (\rho t ) - v \| 2 + \| v\alpha (\rho t ) - v \| 2 .
2
\sqrt{}
The result now follows by noting that \| \BbbE (\rho t ) - v \ast \| 2 \leq \| v - v \ast \| 2 d\rho t (v) \leq 2\scrV (\rho t )
\int
Proof. Let us write H \ast (v) := H(\scrE (v) - \scrE (v\alpha (\rho t ))). Taking \phi (v) = 1/2\| v - v \ast \| 22
as test function in (3.1) as in the proof of Lemma 4.1 yields for the evolution of \scrV (\rho t ),
Downloaded 12/03/24 to 72.12.218.207 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy
d\sigma 2
\int \int
d 2
(4.4) \scrV (\rho t ) = - \lambda H \ast (v)\langle v - v \ast , v - v\alpha (\rho t )\rangle d\rho t (v) + \| v - v\alpha (\rho t )\| 2 d\rho t (v).
dt 2
\underbrace{} \underbrace{}
=:T\widetilde 1
For the second term on the right-hand side, we proceed as in (4.2). The term T\widetilde 1 , on
the other hand, can be rewritten as
\int
T1 = - 2\lambda \scrV (\rho t ) - \lambda H \ast (v)\langle v - v \ast , v \ast - v\alpha (\rho t )\rangle d\rho t (v)
\widetilde
(4.5) \int
2
+ \lambda (1 - H \ast (v)) \| v - v \ast \| 2 d\rho t (v).
Let us now bound the latter two terms individually. For the second term in (4.5),
noting that 0 \leq H \ast \leq 1, the Cauchy--Schwarz inequality and Jensen's inequality give
\int \sqrt{}
- \lambda H \ast (v)\langle v - v \ast , v \ast - v\alpha (\rho t )\rangle d\rho t (v) \leq \lambda 2\scrV (\rho t ) \| v\alpha (\rho t ) - v \ast \| 2 .
For the third term in (4.5), let us first note that (1 - H \ast (v)) \not = 0 implies H \ast (v) \not = 1
and thus \scrE (v) < \scrE (v\alpha (\rho t )). Furthermore, \scrE (v\alpha (\rho t )) \leq \scrE \infty implies v \in BR0 (v \ast ) by the
second part of A2. By the first part of A2 and 0 \leq 1 - H \ast \leq 1, we therefore have
(1 - H \ast (v))
\int \int
\ast \ast 2 \lambda
\lambda (1 - H (v)) \| v - v \| 2 d\rho t (v) \leq \lambda 2
\scrE (v)2\nu d\rho t (v) \leq 2 \scrE (v\alpha (\rho t ))2\nu
\eta \eta
\lambda \bigl( \ast \gamma
\bigr) 2\nu
\leq 2 L\scrE (1 + \| v\alpha (\rho t ) - v \| 2 ) \| v\alpha (\rho t ) - v \ast \| 2 ,
\eta
where the last step used A3. Employing the last two inequalities in (4.5) and inserting
the result together with (4.2) into (4.4) gives the result.
Lemma 4.4. Under the assumptions of Lemma 4.3, the functional \scrV (\rho t ) satisfies
d \surd \bigl( \bigr) \sqrt{}
\scrV (\rho t ) \geq - 2\lambda - d\sigma 2 \scrV (\rho t ) - 2 \lambda + d\sigma 2 \scrV (\rho t ) \| v\alpha (\rho t ) - v \ast \| 2 .
\bigl( \bigr)
(4.6)
dt
Proof. Analogously to the proof of Lemma 4.2, \int by following the lines of the proof
\ast
of Lemma 4.3 and noticing that for T\widetilde 1 it holds - H (v)\langle v - v \ast , v \ast - v\alpha (\rho t )\rangle d\rho t (v) \geq
2
- \| \BbbE (\rho t ) - v \ast \| 2 \| v\alpha (\rho t ) - v \ast \| 2 as well as (1 - H \ast (v)) \| v - v \ast \| 2 d\rho t (v) \geq 0 as a con-
\int
Proof. For any a > 0 it holds that \| \omega \alpha \| L1 (\varrho ) \geq a\varrho (\{ v : exp( - \alpha \scrE (v)) \geq a\} ) due to
Markov's inequality. By choosing a = exp( - \alpha \scrE r ) and noting that
Downloaded 12/03/24 to 72.12.218.207 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy
v \in \BbbR d : exp( - \alpha \scrE (v)) \geq exp( - \alpha \scrE r ) = \varrho v \in \BbbR d : \scrE (v) \leq \scrE r \geq \varrho (Br (v \ast )),
\bigl( \bigl\{ \bigr\} \bigr) \bigl( \bigl\{ \bigr\} \bigr)
\varrho
we get \| \omega \alpha \| L1 (\varrho ) \geq exp( - \alpha \scrE \int r )\varrho (Br (v \ast )). Now let r\widetilde \geq r > 0. Using the definition of
the consensus point v\alpha (\varrho ) = v\omega \alpha (v)/\| \omega \alpha \| L1 (\varrho ) d\varrho (v) we can decompose
\int \int
\omega \alpha (v) \omega \alpha (v)
\| v\alpha (\varrho ) - v \ast \| 2 \leq \| v - v \ast \| 2 d\varrho (v) + \| v - v \ast \| 2 d\varrho (v).
Br\widetilde (v \ast ) \| \omega \alpha \| L1 (\varrho ) (Br\widetilde (v \ast )) c \| \omega \alpha \| L1 (\varrho )
The first term is bounded by r\widetilde since \| v - v \ast \| 2 \leq r\widetilde for all v \in Br\widetilde (v \ast ). For the second
term we use \| \omega \alpha \| L1 (\varrho ) \geq exp( - \alpha \scrE r )\varrho (Br (v \ast )) from above to get
\int \int
\omega \alpha (v) 1
\| v - v \ast \| 2 d\varrho (v)\leq \| v - v \ast \| 2 \omega \alpha (v) d\varrho (v)
(Br\widetilde (v \ast ))c \| \omega \alpha \| L1 (\varrho ) exp( - \alpha \scrE r )\varrho (Br (v \ast )) (Br\widetilde (v\ast ))c
\bigl( \bigl( \bigr) \bigr) \int
exp - \alpha inf v\in (Br\widetilde (v\ast ))c \scrE (v) - \scrE r
\leq \| v - v \ast \| 2 d\varrho (v).
\varrho (Br (v \ast ))
Let us now choose r\widetilde = (q + \scrE r )\nu /\eta . This choice satisfies r\widetilde \leq \scrE \infty \nu
/\eta by the assumption
q + \scrE r \leq \scrE \infty , and furthermore r\widetilde \geq r, since A2 with \scrE = 0 and r \leq R0 implies
\Bigl( \Bigr) \nu
(q + \scrE r )\nu
\scrE \nu sup v\in Br (v ) \ast \scrE (v)
r\widetilde = \geq r = \geq sup \| v - v \ast \| 2 = r.
\eta \eta \eta v\in Br (v \ast )
\bigl\{ \bigr\}
r)1/\nu - \scrE r =
Thus, using again A2 with \scrE = 0, inf v\in (Br\widetilde (v\ast ))c \scrE (v) - \scrE r \geq min \scrE \infty , (\eta \widetilde
r)1/\nu - \scrE r = q. Inserting this and the definition of r\widetilde into (4.7), we obtain the
(\eta \widetilde
result.
4.3. A lower bound for the probability mass around \bfitv \ast . In this section we
bound the probability mass \rho t (Br (v \ast )) for an arbitrarily small radius r > 0 from below.
By defining a smooth mollifier \phi r : \BbbR d \rightarrow [0, 1] with supp \phi r = Br (v \ast ) according to
\left\{ \Biggl( \Biggr)
r2
exp 1 - 2 if \| v - v \ast \| 2 < r,
(4.8) \phi r (v) := r2 - \| v - v \ast \| 2
0 else,
it holds \rho t (Br (v \ast ))\geq \phi r (v) d\rho t (v). From there, the evolution of the right-hand side
\int
can be studied by using the weak solution property of \rho as in Definition 3.1.
Proposition 4.6. Let H : \BbbR \rightarrow [0, 1] be arbitrary, T > 0, r > 0, and fix parame-
ters \alpha , \lambda , \sigma > 0. Assume \rho \in \scrC ([0, T ], \scrP (\BbbR d )) weakly solves the Fokker--Planck equation
(1.8) in the sense of Definition 3.1 with initial condition \rho 0 \in \scrP (\BbbR d ) and for t \in [0, T ].
Then, for all t \in [0, T ] we have
for any B < \infty with supt\in [0,T ] \| v\alpha (\rho t ) - v \ast \| 2 \leq B and for any c \in (1/2, 1) satisfying
Remark 4.7. In case the reader has wondered about the crucial role of the sto-
chastic terms in (1.1) and (1.3), or the diffusion in the macroscopic models (1.7) and
(1.8), Proposition 4.6 precisely explains where positive diffusion \sigma > 0 is actually used
to ensure mass around the minimizer v \ast (compare Proposition 4.5).
Proof of Proposition 4.6. By definition of the mollifier \phi r in (4.8) we have 0 \leq
\phi r (v) \leq 1 and supp(\phi r ) = Br (v \ast ). This implies
\int
\rho t (Br (v \ast )) = \rho t v \in \BbbR d : \| v - v \ast \| 2 \leq r \geq \phi r (v) d\rho t (v).
\bigl( \bigl\{ \bigr\} \bigr)
(4.12)
Our strategy is to derive a lower bound for the right-hand side of this inequality.
Using the weak solution property of \rho and the fact that \phi r \in \scrC c\infty (\BbbR d ), we obtain
\int \int
d
(4.13) \phi r (v) d\rho t (v) = (T1 (v) + T2 (v)) d\rho t (v)
dt
2 2
with T1 (v) := - \lambda H \ast (v)\langle v - v\alpha (\rho t ), \nabla \phi r (v)\rangle and T2 (v) := \sigma 2 \| v - v\alpha (\rho t )\| 2 \Delta \phi r (v), and
where we abbreviate H \ast (v) := H(\scrE (v) - \scrE (v\alpha (\rho t ))) to keep the notation concise. We
now aim for showing T1 (v) + T2 (v) \geq - p\phi r (v) uniformly on \BbbR d for p > 0 as given
in (4.10) in the statement of the proposition. Since the mollifier \phi r and its first and
second derivatives vanish outside of \Omega r := \{ v \in \BbbR d : \| v - v \ast \| 2 < r\} , we can restrict
our attention to the open ball \Omega r . To achieve \surd the lower bound over \Omega r , we introduce
the subsets K1 := \{ v \in \BbbR d : \| v - v \ast \| 2 > cr\} and
\biggl\{ \Bigl( \Bigr) 2
2
K2 := v \in \BbbR d : - \lambda H \ast (v)\langle v - v\alpha (\rho t ), v - v \ast \rangle r2 - \| v - v \ast \| 2
\sigma 2
\biggr\}
2 2
\~ 2 \| v - v\alpha (\rho t )\| 2 \| v - v \ast \| 2 ,
> cr
2
where c adheres to (4.11), and c\~ := 2c - 1 \in (0, 1). We now decompose \Omega r according
to \Omega r = (K1c \cap \Omega r ) \cup (K1 \cap K2c \cap \Omega r ) \cup (K1 \cap K2 \cap \Omega r ), which is illustrated in Figure 3
for different positions of v\alpha (\rho t ) and values of \sigma .
In the following we treat each of these\surd three subsets separately.
Subset \bfitK \bfitc 1 \cap \Omega \bfitr : We have \| v - v \ast \| 2 \leq cr for each v \in K1c , which can be used to
independently derive lower bounds for both T1 and T2 . Recalling the expression for
\phi r from (4.8), for T1 we get by using the Cauchy--Schwarz inequality and H \ast \leq 1
\Biggl\langle \Biggr\rangle
\ast \ast - 2r2 (v - v \ast )\phi r (v)
T1 (v) = - \lambda H (v)\langle v - v\alpha (\rho t ), \nabla \phi r (v)\rangle = - \lambda H (v) v - v\alpha (\rho t ), \Bigl( \Bigr) 2
2
r2 - \| v - v \ast \| 2
\surd \surd
\| v - v\alpha (\rho t )\| 2 \| v - v \ast \| 2 2\lambda ( cr + B) c
\geq - 2r2 \lambda \Bigl( \Bigr) 2 \phi r (v) \geq - \phi r (v) =: - p1 \phi r (v),
2 (1 - c)2 r
r2 - \| v - v \ast \| 2
Fig. 3. Visualization of the decomposition of \Omega r for different positions of v\alpha (\rho t ) and values of \sigma
in the setting H \equiv 1. In the proof of Proposition 4.6 we limit the rate of the mass loss induced by both
consensus drift and noise term for the set K1c \cap \Omega r , which is colored blue. On the set K1 \cap K2c \cap \Omega r ,
colored orange, the noise term counterbalances any potential mass loss induced by the drift, while
on the gray set K1 \cap K2 \cap \Omega r mass can be lost at an exponential rate - 4\lambda 2 /((2c - 1)\sigma 2 ). (Color
figure available online.)
\surd
where the last bound is due to \| v - v\alpha (\rho t )\| 2 \leq \| v - v \ast \| 2 + \| v \ast - v\alpha (\rho t )\| 2 \leq cr + B.
Similarly, by computing \Delta \phi r and inserting it, for T2 we obtain
\Bigl( \Bigr) \Bigl( \Bigr) 2
2 2 2
2 2 \| v - v \ast \| 2 - r2 \| v - v \ast \| 2 - d r2 - \| v - v \ast \| 2
2
T2 (v) = \sigma 2 r2 \| v - v\alpha (\rho t )\| 2 \Bigl( \Bigr) 4 \phi r (v)
2
r2 - \| v - v \ast \| 2
2\sigma 2 (cr2 + B 2 )(2c + d)
\geq - \phi r (v) =: - p2 \phi r (v),
(1 - c)4 r2
2 2 2 \bigr)
where we used \| v - v\alpha (\rho t )\| 2 \leq 2 \| v - v \ast \| 2 + \| v \ast - v\alpha (\rho t )\| 2 \leq 2(cr2 + B 2 ). \surd
\bigl(
Subset \bfitK 1 \cap \bfitK \bfitc 2 \cap \Omega \bfitr : By the definition of K1 and K2 we have \| v - v \ast \| 2 > cr and
\Bigl(
2
\Bigr) 2 \sigma 2 2 2
(4.14) - \lambda H \ast (v)\langle v - v\alpha (\rho t ), v - v \ast \rangle r2 - \| v - v \ast \| 2 \leq cr \~ 2 \| v - v\alpha (\rho t )\| 2 \| v - v \ast \| 2 .
2
Our goal now is to show T1 (v) + T2 (v) \geq 0 for all v in this subset. We first compute
\Bigl( \Bigr) 2
2
T1 (v) + T2 (v) \langle v - v\alpha (\rho t ), v - v \ast \rangle r2 - \| v - v \ast \| 2
\ast
= \lambda H (v)
2r2 \phi r (v)
\Bigl( \Bigr) 4
r2 - \| v - v \ast \| 22
\Bigl( \Bigr) \Bigl( \Bigr) 2
2 2 2
\sigma 2 2 2 \| v - v \ast \| 2 - r2 \| v - v \ast \| 2 - d r2 - \| v - v \ast \| 2
2
+ \| v - v\alpha (\rho t )\| 2 \Bigr) 4 .
2
\Bigl(
r2 - \| v - v \ast \| 22
d\sigma 2
\biggl( \biggr) \Bigl( \Bigr) 2
\ast \ast 2 2
- \lambda H (v)\langle v - v\alpha (\rho t ), v - v \rangle + \| v - v\alpha (\rho t )\| 2 r2 - \| v - v \ast \| 2
(4.15) 2
\Bigl( \Bigr)
2 2 2
\leq \sigma \| v - v\alpha (\rho t )\| 2 2 \| v - v \ast \| 2 - r2 \| v - v \ast \| 2 .
2
Now note that the first summand on the left-hand side in (4.15) can be upper bounded
by means of condition (4.14) and by using the relation c\~ = 2c - 1. More precisely,
2
\Bigl( \Bigr) \sigma 2
2 2
\leq 2 \| v - v \ast \| 2 - r2 \| v - v\alpha (\rho t )\| 22 \| v - v \ast \| 2 ,
2
where the last inequality follows since v \in K1 . For the second term on the left-hand
side in (4.15) we can use d(1 - c)2 \leq (2c - 1)c as per (4.11), to get
d\sigma 2 2
\Bigl(
2
\Bigr) 2 d\sigma 2
2
\| v - v\alpha (\rho t )\| 2 r2 - \| v - v \ast \| 2 \leq \| v - v\alpha (\rho t )\| 2 (1 - c)2 r4
2 2
\sigma 2 2 \sigma 2 2
\Bigl(
2
\Bigr)
2
\leq \| v - v\alpha (\rho t )\| 2 (2c - 1)r2 cr2 \leq \| v - v\alpha (\rho t )\| 2 2 \| v - v \ast \| 2 - r2 \| v - v \ast \| 2 .
2 2
Hence, (4.15) holds and we have T1 (v) + T2 (v) \geq 0 uniformly on\surd this subset.
Subset \bfitK 1 \cap \bfitK 2 \cap \Omega \bfitr : On this subset, we have \| v - v \ast \| 2 > cr and
\Bigl(
2
\Bigr) 2 \sigma 2 2 2
(4.16) - \lambda H \ast (v)\langle v - v\alpha (\rho t ), v - v \ast \rangle r2 - \| v - v \ast \| 2 > cr \~ 2 \| v - v\alpha (\rho t )\| 2 \| v - v \ast \| 2 .
2
We first note that T1 (v) = 0 whenever \sigma 2 \| v - v\alpha (\rho t )\| 22 = 0, provided that \sigma > 0,
so nothing needs to be done for the point v = v\alpha (\rho t ). On the other hand, if
\sigma 2 \| v - v\alpha (\rho t )\| 22 > 0, we can use H \ast \leq 1, two applications of the Cauchy--Schwarz
inequalities, and condition (4.16) to get
H \ast (v) \langle v - v\alpha (\rho t ), v - v \ast \rangle - \| v - v\alpha (\rho t )\| 2 \| v - v \ast \| 2
\Bigl( \Bigr) 2 \geq \Bigl( \Bigr) 2
2 2
r2 - \| v - v \ast \| 2 r2 - \| v - v \ast \| 2
2\lambda H \ast (v)\langle v - v\alpha (\rho t ), v - v \ast \rangle 2\lambda
> 2 2 \ast
\geq - 2 2 .
\~ \sigma \| v - v\alpha (\rho t )\| 2 \| v - v \| 2
cr cr
\~ \sigma
Using this, T1 can be bounded from below by
\Biggl\langle \Biggr\rangle
2 \ast v - v \ast 4\lambda 2
T1 (v)=2\lambda r H (v) v - v\alpha (\rho t ), \Bigl( \Bigr) 2 \phi r (v) \geq - 2 \phi r (v) =: - p3 \phi r (v),
2 c\sigma
\~
r2 - \| v - v \ast \| 2
where we made use of the relation c\~ = 2c - 1 in the last step. For T2 , we note that
the nonnegativity of \sigma 2 \| v - v\alpha (\rho t )\| 2 implies T2 (v) \geq 0, whenever
\Bigl( \Bigr) \Bigl( \Bigr) 2
2 2 2
2 2 \| v - v \ast \| 2 - r2 \| v - v \ast \| 2 \geq d r2 - \| v - v \ast \| 2 .
\surd
This is satisfied for all v with \| v - v \ast \| 2 \geq cr, provided c satisfies 2(2c - 1)c \geq (1 - c)2 d
as implied by (4.11).
Concluding the proof. Using the evolution of \phi r as in (4.13), we now get
\int \int
d
\phi r (v) d\rho t (v) = (T1 (v) + T2 (v)) d\rho t (v)
dt K1 \cap K2c \cap \Omega r \underbrace{} \underbrace{}
\geq 0
\int \int
+ (T1 (v) + T2 (v)) d\rho t (v) + (T1 (v) + T2 (v)) d\rho t (v)
K1 \cap K2 \cap \Omega r \underbrace{} \underbrace{} K1c \cap \Omega r \underbrace{} \underbrace{}
\geq - p3 \phi r (v) \geq - (p1 +p2 )\phi r (v)
\int \int
\geq - max \{ p1 + p2 , p3 \} \phi r (v) d\rho t (v) = - p \phi r (v) d\rho t (v).
\int \int
An application of Gr\"onwall's inequality gives \phi r (v) d\rho t (v)\geq \phi r (v) d\rho 0 (v) exp( - pt),
which concludes the proof after recalling (4.12).
4.4. Proof of Theorem 3.7. We now have all necessary tools at hand to present
a detailed proof of the global convergence result in mean-field law. We separately prove
the cases of an inactive and active cutoff function, i.e., H \equiv 1 and H \not \equiv 1, respectively.
Downloaded 12/03/24 to 72.12.218.207 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy
Proof of Theorem 3.7 when H \equiv 1. W.l.o.g. we may assume \scrE = 0. Let us first
choose the parameter \alpha such that
\Biggl( \Biggl( \sqrt{} \Biggr) \biggl( \biggr)
1 4 2\scrV (\rho 0 ) p \scrV (\rho 0 )
\alpha > \alpha 0 := log \surd + log
q\varepsilon c (\vargamma , \lambda , \sigma ) \varepsilon (1 - \vargamma ) (2\lambda - d\sigma 2 ) \varepsilon
(4.17) \Biggr)
- log \rho 0 Br\varepsilon /2 (v \ast ) ,
\bigl( \bigr)
as well as
\biggl\{ \biggl( \surd \biggr) 1/\nu \biggr\} \biggl\{ \biggr\}
1 c (\vargamma , \lambda , \sigma ) \varepsilon
q\varepsilon := min \eta , \scrE \infty and r\varepsilon := max max \scrE (v) \leq q\varepsilon .
2 2 s\in [0,R0 ] v\in Bs (v \ast )
\sqrt{}
Moreover, p is as defined in (4.10) in Proposition 4.6 with B = c(\vargamma , \lambda , \sigma ) \scrV (\rho 0 ) and
with r = r\varepsilon . We remark that, by construction, q\varepsilon > 0 and r\varepsilon \leq R0 . Furthermore,
recalling the notation \scrE r = supv\in Br (v\ast ) \scrE (v) from Proposition 4.5, we have q\varepsilon + \scrE r\varepsilon \leq
2q\varepsilon \leq \scrE \infty as a consequence of the definition of r\varepsilon . Since q\varepsilon > 0, the continuity of \scrE
ensures that there exists sq\varepsilon > 0 such that \scrE (v) \leq q\varepsilon for all v \in Bsq\varepsilon (v \ast ), thus yielding
also r\varepsilon > 0.
Let us now define the time horizon T\alpha \geq 0, which may depend on \alpha , by
(4.19) T\alpha := sup t \geq 0 : \scrV (\rho t\prime ) > \varepsilon and \| v\alpha (\rho t\prime ) - v \ast \| 2 < C(t\prime ) for all t\prime \in [0, t]
\bigl\{ \bigr\}
\sqrt{}
with C(t) := c(\vargamma , \lambda , \sigma ) \scrV (\rho t ). Notice for later use that C(0) = B.
\ast \ast
\bigl[ 1 - \vargamma \bigr]
Our aim now is to show that \scrV (\rho T\alpha ) = \varepsilon with T\alpha \in (1+\vargamma /2) T , T and that we
have at least exponential decay of \scrV (\rho t ) until time T\alpha , i.e., until accuracy \varepsilon is reached.
First, however, we ensure that T\alpha > 0. With the mapping t \mapsto \rightarrow \scrV (\rho t ) be-
ing continuous as a consequence of the regularity \rho \in \scrC ([0, T ], \scrP 4 (\BbbR d )) established
in Theorem 3.2 and t \mapsto \rightarrow \| v\alpha (\rho t ) - v \ast \| 2 being continuous due to [10, Lemma 3.2]
and \rho \in \scrC ([0, T ], \scrP 4 (\BbbR d )), T\alpha > 0 follows from the definition, since \scrV (\rho 0 ) > \varepsilon and
\| v\alpha (\rho 0 ) - v \ast \| 2 < C(0). While the former is immediate by assumption, applying
Proposition 4.5 with q\varepsilon and r\varepsilon gives the latter since
\nu
(q\varepsilon + \scrE r\varepsilon )
\int
exp ( - \alpha q\varepsilon )
\| v\alpha (\rho 0 ) - v \ast \| 2 \leq + \| v - v \ast \| 2 d\rho 0 (v)
\eta \rho 0 (Br\varepsilon (v \ast ))
\surd
c (\vargamma , \lambda , \sigma ) \varepsilon exp ( - \alpha q\varepsilon ) \sqrt{}
\leq + 2\scrV (\rho 0 )
2 \rho 0 (Br\varepsilon (v \ast ))
\surd \sqrt{}
\leq c (\vargamma , \lambda , \sigma ) \varepsilon < c (\vargamma , \lambda , \sigma ) \scrV (\rho 0 ) = C(0),
where the first inequality in the last line holds by the choice of \alpha in (4.17).
Next, we show that the functional \scrV (\rho t ) decays essentially exponentially fast in
time. More precisely, we prove that, up to time T\alpha , \scrV (\rho t ) decays
(i) at least exponentially fast (with rate (1 - \vargamma )(2\lambda - d\sigma 2 )), and
(ii) at most exponentially fast (with rate (1 + \vargamma /2)(2\lambda - d\sigma 2 )).
d
To obtain (i), recall that Lemma 4.1 provides an upper bound on dt \scrV (\rho t ) given by
Downloaded 12/03/24 to 72.12.218.207 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy
where the second inequality again exploits the definition of T\alpha . Gr\" onwall's inequality
now implies for all t \in [0, T\alpha ] the upper and the lower bound
thereby proving (i) and (ii). We further note that the definition of T\alpha in (4.19)
together with the definition of C(t) and (4.21) permits us to control
(4.23) max \| v\alpha (\rho t ) - v \ast \| 2 \leq max C(t) \leq C(0).
t\in [0,T\alpha ] t\in [0,T\alpha ]
\ast \ast
\bigl[ 1 - \vargamma \bigr]
To conclude, it remains to prove that \scrV (\rho T\alpha ) = \varepsilon with T\alpha \in (1+\vargamma /2) T , T . For
this we distinguish the following three cases.
Case \bfitT \ast \bfitalpha \geq \bfitT : We can use the definition of T \ast in (3.8) and the time-evolution bound
of \scrV (\rho t ) in (4.21) to conclude that \scrV (\rho T \ast ) \leq \varepsilon . Hence, by the definition of T\alpha in (4.19)
together with the continuity of \scrV (\rho t ), we find \scrV (\rho T\alpha ) = \varepsilon with T\alpha = T \ast .
Case \bfitT \bfitalpha < \bfitT \ast and \bfscrV (\bfitrho \bfitT \bfitalpha ) \leq \bfitvarepsilon : By continuity of \scrV (\rho t ),\bigl( it holds that \bigl( for T\alpha 2as \bigr) de-\bigr)
fined in (4.19), \scrV (\rho T\alpha ) = \varepsilon . Thus, \varepsilon = \scrV (\rho T\alpha ) \geq \scrV (\rho 0 ) exp - (1 + \vargamma /2) 2\lambda - d\sigma T\alpha
by (4.22), which can be reordered as
\biggl( \biggr)
1 - \vargamma \ast 1 \scrV (\rho 0 )
T = log \leq T\alpha < T \ast .
(1 + \vargamma /2) (1 + \vargamma /2) (2\lambda - d\sigma 2 ) \varepsilon
Case \bfitT \bfitalpha < \bfitT \ast and \bfscrV (\bfitrho \bfitT \bfitalpha ) > \bfitvarepsilon : We shall show that this case can never occur by
verifying that \| v\alpha (\rho T\alpha ) - v \ast \| 2 < C(T\alpha ) due to the choice of \alpha in (4.17). In fact, ful-
filling simultaneously both \scrV (\rho T\alpha ) > \varepsilon and \| v\alpha (\rho T\alpha ) - v \ast \| 2 < C(T\alpha ) would contradict
the definition of T\alpha in (4.19) itself. To this end, by applying again Proposition 4.5
with q\varepsilon and r\varepsilon , and recalling that \varepsilon < \scrV (\rho T\alpha ), we get
\nu
(q\varepsilon + \scrE r\varepsilon )
\int
\ast exp ( - \alpha q\varepsilon )
\| v\alpha (\rho T\alpha ) - v \| 2 \leq + \bigl( \bigr) \| v - v \ast \| 2 d\rho T\alpha (v)
\eta \rho T\alpha Br\varepsilon (v \ast )
(4.24) \sqrt{}
c (\vargamma , \lambda , \sigma ) \scrV (\rho T\alpha ) exp ( - \alpha q\varepsilon ) \sqrt{}
< + \bigl( \bigr) 2\scrV (\rho T\alpha ).
2 \rho T\alpha Br\varepsilon (v \ast )
Since, thanks to (4.23), we have the bound maxt\in [0,T\alpha ] \| v\alpha (\rho t ) - v \ast \| 2 \leq B for B =
C(0), which is in particular independent of \alpha , Proposition 4.6 guarantees that there
exists a p > 0 not depending on \alpha (but depending on B and r\varepsilon ) with
Downloaded 12/03/24 to 72.12.218.207 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy
where the inequality in the last line holds by the choice of \alpha in (4.17). This establishes
the desired contradiction, again as a consequence of the continuity of the mappings
t \mapsto \rightarrow \scrV (\rho t ) and t \mapsto \rightarrow \| v\alpha (\rho t ) - v \ast \| 2 .
Proof of Theorem 3.7 when H \not \equiv 1. The proof follows the lines of the one for the
inactive cutoff H \equiv 1, but requires some modifications since Lemmas 4.1 and 4.2 need
to be replaced by Lemmas 4.3 and 4.4, to derive bounds for the evolution of \scrV (\rho t ).
As in the proof for the case H \equiv 1 we first choose the parameter \alpha such that
\Biggl( \Biggl( \sqrt{} \Biggr) \biggl( \biggr)
1 4 2\scrV (\rho 0 ) p \scrV (\rho 0 )
\alpha > \alpha 0 := log + log
q\varepsilon C\varepsilon (1 - \vargamma ) (2\lambda - d\sigma 2 ) \varepsilon
(4.26) \Biggr)
- log \rho 0 Br\varepsilon /2 (v \ast ) ,
\bigl( \bigr)
where C\varepsilon is obtained when replacing with \varepsilon each \scrV (\rho t ) in C(t) defined as
Moreover, r\varepsilon is as defined before, p as in (4.10) with B = C(0) and r = r\varepsilon , and
\biggl\{ \biggl( \biggr) 1/\nu \biggr\}
1 C\varepsilon
q\varepsilon := min \eta , \scrE \infty .
2 2
Let us now define again a time horizon T\alpha according to (4.19), however, with the
modified definition of C(t) from (4.27). It is straightforward to check\bigl[ that T\alpha > 0 by\bigr]
1 - \vargamma \ast \ast
choice of \alpha in (4.26). Our aim is again to show \scrV (\rho T\alpha ) = \varepsilon with T\alpha \in (1+\vargamma /2) T , T
and that we have at least exponential decay of \scrV (\rho t ) until T\alpha .
Since due to assumption A3 and with the definition of C(t) in (4.27) it holds
\gamma
(4.28) max \scrE (v\alpha (\rho t )) \leq max L\scrE (1 + \| v\alpha (\rho t ) - v \ast \| 2 ) \| v\alpha (\rho t ) - v \ast \| 2 \leq \scrE \infty ,
t\in [0,T\alpha ] t\in [0,T\alpha ]
Lemmas 4.3 and 4.4 provide an upper and a lower bound for the time derivative of
\scrV (\rho t ), which, when combined with the definitions of T\alpha and C(t) in (4.27), yield
Downloaded 12/03/24 to 72.12.218.207 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy
d d
\scrV (\rho t ) \leq - (1 - \vargamma ) 2\lambda - d\sigma 2 \scrV (\rho t ) \scrV (\rho t ) \geq - (1+\vargamma /2) 2\lambda - d\sigma 2 \scrV (\rho t )
\bigl( \bigr) \bigl( \bigr)
and
dt dt
for all t \in (0, T\alpha ) as before. We can thus follow the lines of the proof for the case
H \equiv 1, since also here C(t) is bounded. In particular, the choice of \alpha in (4.26) allows
us to derive the contradiction \| v\alpha (\rho T\alpha ) - v \ast \| 2 < C(T\alpha ) by using Propositions 4.5
and 4.6.
Remark 4.8 (informal lower bound for \alpha 0 ). As mentioned in section 3.3, insightful
lower bounds on the required \alpha 0 in Theorem 3.7 may be interesting in view of better
understanding the convergence of the microscopic system (1.3) to the mean-field limit
(1.7). Let us therefore informally derive in what follows an instructive lower bound
on the required \alpha 0 under the assumption that \scrE satisfies condition A2 globally with
\nu = 1/2 and that \scrE is locally L-Lipschitz continuous around v \ast , i.e., in some ball
BR (v \ast ). We restrict ourselves to the case of an inactive cutoff function H \equiv 1.
Recalling (4.20) in the proof of Theorem 3.7, \alpha should be large enough to ensure
\sqrt{}
(4.29) \| v\alpha (\rho t ) - v \ast \| 2 \leq c (\vargamma , \lambda , \sigma ) \scrV (\rho t ) for all t \in [0, T ],
where T is the time satisfying \scrV (\rho T ) = \varepsilon . To achieve this, we recall that for \varrho \in \scrP (\BbbR d )
2
the quantitative Laplace principle in Proposition 4.5 with choices q\varepsilon := c (\vargamma , \lambda , \sigma ) \eta 2 \varepsilon /8
and r\varepsilon := min\{ R, q\varepsilon /L\} for q and r, respectively, yields
\surd \int
2q\varepsilon exp ( - \alpha q\varepsilon )
\| v\alpha (\varrho ) - v \ast \| 2 \leq + \| v - v \ast \| 2 d\varrho (v),
\eta \varrho (Br\varepsilon (v \ast ))
provided that A2 holds globally with \nu = 1/2 and that \scrE is L-Lipschitz continuous on
some ball BR (v \ast ). It remains to choose \alpha > \alpha 0 , where
\biggl( \Bigr) \biggr)
- 8 c (\vargamma , \lambda , \sigma ) \Bigl( \ast
(4.30) \alpha 0 := sup 2 log \surd \rho t B\mathrm{m}\mathrm{i}\mathrm{n}\{ R, c(\vargamma ,\lambda ,\sigma ) \eta 2 \varepsilon /(8L)\} (v ) ,
2
t\in [0,T ] c (\vargamma , \lambda , \sigma ) \eta 2 \varepsilon 2 2
suggesting that \alpha 0 is strongly related to the time-evolution of the probability mass of
\rho t around v \ast . Recalling Proposition 4.6, this mass adheres to the lower bound
\rho t (Br (v \ast )) \geq \rho 0 (Br/2 (v \ast )) exp( - pt)/2 for some p > 0 and any r > 0.
However, this result is pessimistic due to its worst-case nature, and inserting it into
(4.30) with the corresponding p as in (4.10) leads to overly stringent requirements
on \alpha 0 , which are reflected by the respective second summands in (4.17) and (4.26).
Rather, a successful application of the CBO method entails that the probability mass
around the global minimizer increases over time, so that t \mapsto \rightarrow \rho t (Br (v \ast )) is typically
minimized at t = 0. In such case, the lower bound (4.30) becomes
\biggl( \Bigr) \biggr)
- 8 c (\vargamma , \lambda , \sigma ) \Bigl( \ast
(4.31) \alpha 0 = 2 log \surd \rho 0 B 2 2
\mathrm{m}\mathrm{i}\mathrm{n}\{ R, c(\vargamma ,\lambda ,\sigma ) \eta \varepsilon /(8L)\} (v ) .
c (\vargamma , \lambda , \sigma ) \eta 2 \varepsilon 2 2
5. Proof details for section 3.3. In this section we provide the proof details
for the result about the mean-field approximation of CBO, Proposition 3.11. After
giving the proof of the auxiliary Lemma 3.10, which ensures that the dynamics is to
Downloaded 12/03/24 to 72.12.218.207 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy
\bigr) \bigm\| 4
\bigm\| \int t \bigm\|
\bigm\| i \bigm\| 4 \bigm\| i \bigm\| 4 4
\bigm\| \bigl( i N
\BbbE sup Vt 2 \lesssim \BbbE V0 2 + \lambda \BbbE sup \bigm\|
\bigm\| \bigm\| \bigm\| \bigm\| \bigm\| V\tau - v\alpha (\widehat
\rho \tau ) d\tau \bigm\|
\bigm\|
t\in [0,T ] t\in [0,T ] 0 2
(5.1) \bigm\| \int t \bigm\| 4
4
\bigm\| \bigm\| i N \bigm\|
\bigm\| i \bigm\|
\bigm\|
+ \sigma \BbbE sup \bigm\|
\bigm\| \bigm\| V\tau - v\alpha (\widehat
\rho \tau ) 2 dB\tau \bigm\| ,
t\in [0,T ] 0 2
\int t
where we note that the expression 0 \| V\tau i - v\alpha (\widehat \rho N i
\tau )\| 2 dB\tau appearing in the third term
of the right-hand side is a martingale, which is a consequence of [51, Corollary 3.2.6]
combined with the regularity established in [10, Lemma 3.4]. This allows us to apply
the Burkholder--Davis--Gundy inequality [56, Chapter IV, Theorem 4.1], which yields
\bigm\| \int t \bigm\| 4 \Biggl( \int \Biggr) 2
\bigm\| \bigm\| i T \bigm\| \bigm\| 2
N i i N
\bigm\| \bigm\|
\BbbE sup \bigm\| \bigm\| V\tau - v\alpha (\widehat
\rho \tau )\bigm\| 2 dB\tau \bigm\| \bigm\| V\tau - v\alpha (\widehat
\rho \tau )\bigm\| 2 d\tau .
\bigm\| \bigm\| \lesssim \BbbE
t\in [0,T ] 0 2 0
Let us stress that the constant appearing in the latter estimate depends on the di-
mension d. Further bounding this as well as the second term of the right-hand side
in (5.1) by means of Jensen's inequality and utilizing [10, Lemma 3.3] yields
\Biggl( \int T \int \Biggr)
\bigm\| i \bigm\| 4 \bigm\| i \bigm\| 4 \bigm\| i \bigm\| 4 4 N
(5.2) \BbbE sup \bigm\| Vt \bigm\| \leq C 1 + \BbbE \bigm\| V0 \bigm\| + \BbbE
2
\bigm\| V\tau \bigm\| + \| v\| d\widehat
2
\rho \tau (v) d\tau
2 2
t\in [0,T ] 0
which, after applying Gr\"onwall's inequality, ensures that the left-hand side is bounded
independently \int of N by a constant K =K(\lambda , \sigma , d, T, b1 , b2 ). With analogous arguments,
\BbbE supt\in [0,T ] \| v\| 42 d\rho N
t (v)\leq K. Equation (3.12) follows from Markov's inequality.
0 else,
which is adapted to the natural filtration and has the property IM (t) = IM (t)IM (\tau )
for all \tau \in [0, t]. With Jensen's inequality and It\^
o isometry this allows to derive
\bigl( \bigr) i
for c = \lambda 2 T + \sigma 2 . Here we directly used that the processes Vti and V t share the
initial data as well as the Brownian motion paths. In what follows, let us denote by
i
\rho N
\tau the empirical measure of the processes V \tau . Then, by using the same arguments
as in the proofs of [10, Lemma 3.2] and [25, Lemma 3.1] with the care of taking into
consideration the multiplication with the random variable IM (\tau ), we obtain
\bigm\| 2
\rho N
\bigm\|
\BbbE \bigm\| v\alpha (\widehat \tau ) - v\alpha (\rho \tau ) 2 IM (\tau )
\bigm\|
N \bigm\| 2
\bigm\| 2
\rho N N
\bigm\| \bigm\| \bigm\|
\tau ) - v\alpha (\rho \tau ) 2 IM (\tau )+\BbbE v\alpha (\rho \tau ) - v\alpha (\rho \tau ) 2 IM (\tau )
\lesssim \BbbE \bigm\| v\alpha (\widehat \bigm\| \bigm\|
\biggl( \bigm\| \bigm\| 2 \biggr)
i \bigm\|
\leq C max \BbbE \bigm\| V\tau i - V \tau \bigm\| IM (\tau ) + N - 1
\bigm\|
i=1,...,N 2
for a constant C = C(\alpha , C1 , C2 , M, \scrM 2 , b1 , b2 ). After plugging this into (5.3) and
taking the maximum over i, the quantitative mean-field approximation result (3.13)
follows from an application of Gr\"onwall's inequality after recalling the definition
of the conditional expectation and noting that 1\Omega M \leq IM (t) pointwise and for all
t \in [0, T ].
REFERENCES
[1] E. Aarts and J. Korst, Simulated Annealing and Boltzmann Machines. A Stochastic Ap-
Downloaded 12/03/24 to 72.12.218.207 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy
[50] I. Necoara, Y. Nesterov, and F. Glineur, Linear convergence of first order methods for
non-strongly convex optimization, Math. Program., 175 (2019), pp. 69--107.
[51] B. {\O}ksendal, Stochastic Differential Equations: An Introduction with Applications, 6th ed.,
Downloaded 12/03/24 to 72.12.218.207 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy