Math Behind Transformers
Math Behind Transformers
Contents
1. Outline 2
Part 1. Modeling 3
2. Interacting particle system 4
3. Measure to measure flow map 9
Part 2. Clustering 16
4. A single cluster for small β 17
5. A single cluster for large β 20
6. The high-dimensional case 20
Appendix 36
Appendix A. Proof of Theorem 4.1 36
Appendix B. Proof of Theorem 5.1 39
Appendix C. Proof of Theorem 4.3 42
Appendix D. Proof of Theorem 6.9 43
References 46
2020 Mathematics Subject Classification. Primary: 34D05, 34D06, 35Q83; Secondary: 52C17.
Key words and phrases. Transformers, self-attention, interacting particle systems, clustering,
gradient flows.
1
2 GESHKOVSKI, LETROUIT, POLYANSKIY, AND RIGOLLET
1. Outline
The introduction of Transformers in 2017 by Vaswani et al. [VSP` 17] marked
a significant milestone in the development of neural network architectures. Cen-
tral to this contribution is self-attention, a novel mechanism which distinguishes
Transformers from traditional architectures, and which plays a substantial role in
their superior practical performance. In fact, this innovation has been a key cata-
lyst for the progress of artificial intelligence in areas such as computer vision and
natural language processing, notably with the emergence of large language models.
As a result, understanding the mechanisms by which Transformers, and especially
self-attention, process data is a crucial yet largely uncharted research area.
A common characteristic of deep neural networks (DNNs) is their compositional
nature: data is processed sequentially, layer by layer, resulting in a discrete-time
dynamical system (we refer the reader to the textbook [GBC16] for a general intro-
duction). This perspective has been successfully employed to model residual neural
networks—see Section 2.1 for more details—as continuous-time dynamical systems
called neural ordinary differential equations (neural ODEs) [CRBD18, E17, HR17].
In this context, an input xp0q P Rd , say an image, is evolving according to a given
time-varying velocity field as xptq9 “ vt pxptqq over some time interval p0, T q. As
such, a DNN can be seen as a flow map xp0q ÞÑ xpT q from Rd to Rd . Even within
the restricted class of velocity fields tvt utě0 imposed by classical DNN architectures,
such flow maps enjoy strong approximation properties as exemplified by a long line
of work on these questions [LJ18, ZGUA20, LLS22, TG22, RBZ23, CLLS23].
Following [SABP22] and [VBC20], we observe that Transformers are in fact flow
maps on PpRd q, the space of probability measures over Rd . To realize this flow map
from measures to measures, Transformers evolve a mean-field interacting particle
system. More specifically, every particle (called a token in this context) follows
the flow of a vector field which depends on the empirical measure of all particles.
In turn, the continuity equation governs the evolution of the empirical measure of
particles, whose long-time behavior is of crucial interest. In this regard, our main
observation is that particles tend to cluster under these dynamics. This phenome-
non is of particular relevance in learning tasks such as next-token prediction, wherein
one seeks to map a given input sequence (i.e., a sentence) of n tokens (i.e., words)
onto a given next token. In this case, the output measure encodes the probability
distribution of the next token, and its clustering indicates a small number of pos-
sible outcomes. Our results on a simplified but insightful toy model indicate that
the limiting distribution is actually a point mass, leaving no room for diversity or
randomness, which is at odds with practical observations. This apparent paradox
is resolved by the existence of a long-time metastable state. As can be seen from
Figures 3 and 5, the Transformer flow appears to possess two different time-scales:
in a first phase, tokens quickly form a few clusters, while in a second (much slower)
phase, through the process of pairwise merging of clusters, all tokens finally col-
lapse to a single point. This appears to corroborate behavior observed empirically
in trained Transformer models, which goes under the names token uniformity, over-
smoothing [CZC` 22, RZZD23, GWDW23, WAWJ24, WAW` 24, DBK24, SWJS24],
or rank-collapse [DCL21, FZH` 22, NAB` 22, JDB23, ZMZ` 23, ZLL` 23, NLL` 24,
BHK24, CNQG24]; see also Figure 1.
A MATHEMATICAL PERSPECTIVE ON TRANSFORMERS 3
The goal of this manuscript is twofold. On the one hand, we aim to provide a
general and accessible framework to study Transformers from a mathematical per-
spective. In particular, the structure of these interacting particle systems enables
concrete connections to established mathematical topics, such as nonlinear trans-
port equations, Wasserstein gradient flows, collective behavior models, and optimal
configurations of points on spheres, among others. On the other hand, we describe
several promising research directions with a particular focus on the long-time clus-
tering phenomenon. The main results we present are new, and we also provide
what we believe are interesting open problems throughout the paper.
The rest of the paper is arranged in three parts.
Part 1: Modeling. We define an idealized model of the Transformer architecture
that captures two of the main characteristics of transformers: self-attention and
layer-normalization. Following a perspective put forward in classical architectures
such as ResNets [CRBD18, E17, HR17], we view the successive layers of a neu-
ral network as time discretizations of a dynamical system of interacting particles.
Layer-normalization effectively constrains particles to evolve on the unit sphere
Sd´1 , whereas self-attention is the particular nonlinear coupling of the particles
done through the empirical measure (Section 2). In turn, the empirical measure
evolves according to the continuity partial differential equation (Section 3). Even
after significant simplifications, this toy model retains macroscopic characteristics
of trained Transformers, namely clustering. We also introduce a simpler surrogate
model for self-attention which has the convenient property of being a Wasserstein
gradient flow [AGS05] for an energy functional that is well-studied in the context
of optimal configurations of points on the sphere and sheds a complementary light
of the source of clustering.
Part 2: Clustering. In this part we recall existing and establish new mathematical
results that indicate clustering of tokens in the long-time limit. (See Figure 2 for a
summary.) Our first results (Theorem 4.3 in Section 4 and Theorem 5.1 in Section
5) are in extreme regimes of a temperature parameter β ´1 that appears in the
equation. We then move to the high-dimensional case in Section 6, where we begin
by recalling Theorem 6.1—a result of [MTG17], recently revisited in [CRMB24]—
which entails long-time clustering at any temperature when d ě 3. We provide
an exponential rate of convergence in Theorem 6.3 when d ě n—here n denotes
the number of particles—. We complement this result with an even more precise
characterization of the rate of contraction of particles into a cluster. Namely, we
describe the histogram of all inter-particle distances, and the time at which all
particles are already nearly clustered (Theorem 6.9).
Part 3: Further questions. We propose potential avenues for future research, largely
in the form of open questions substantiated by numerical observations. We first
focus on the case d “ 2 (Section 7) and elicit a link to Kuramoto oscillators. We
briefly show in Section 9.1 how a simple and natural modification of our model
leads to non-trivial questions related to optimal configurations on the sphere. The
remaining sections explore interacting particle systems that allow for parameter
tuning of the Transformers architectures, a key feature of practical implementations.
Part 1. Modeling
4 GESHKOVSKI, LETROUIT, POLYANSKIY, AND RIGOLLET
These are dubbed neural ordinary differential equations (neural ODEs). Since their
introduction in [CRBD18, E17, HR17], neural ODEs have emerged as a flexible
mathematical framework to implement and study ResNets. We use this abstraction
here for convenience, as dynamical systems are more naturally defined in continuous
time. Although this approach is helpful for exposition, we emphasize that all the
results presented can also be derived in discrete time using the same techniques.
2.2. The interacting particle system. Unlike ResNets, which operate on a sin-
gle input vector xp0q P Rd at a time, Transformers operate on a sequence of vectors
of length n, namely, pxi p0qqiPrns P pRd qn . This perspective is rooted in natural lan-
guage processing, where each vector represents a word, and the entire sequence a
sentence or a paragraph. In particular, it allows to process words together with their
context. A sequence element xi p0q P Rd is called a token, and the entire sequence
pxi p0qqiPrns a prompt. We use the words “token” and “particle” interchangeably.
All practical implementations of Transformers make use of layer normalization
[BKH16], most commonly in the form of root mean square (RMS) normalization
[ZS19]. (See lines 105–116 in [Mis24] for instance.) RMS normalization takes the
sequence of tokens output after each layer, divides each token by its Euclidean
norm1 (plus a small parameter to avoid a possible division by zero), and multiplies
the result by a trained diagonal matrix. This process aims to ensure that tokens do
not diverge, thus avoiding rounding errors and overflow. The result is an evolution
on a time-varying axis-aligned ellipsoid. To simplify the presentation and obtain
insight and precise results, we assume that the trained diagonal matrix is equal to
the idenitity, so that we work on the unit sphere Sd´1 throughout. This simpli-
fication is justified empirically in the trained ALBERT XLarge v2 model described
in Figure 1, wherein this diagonal matrix is constant over all layers, with entries
of mean value equal to 0.44 and standard deviation equal to 0.008. Furthermore,
current text embeddings provided by OpenAI, namely text-embedding-3-small
and text-embedding-3-large, return norm-one embedding vectors. While we can
only speculate as to the actual implementation of these models, this is an indication
that layer normalization could be as simple as the one used in our toy model.
A Transformer is then a flow map on pSd´1 qn : the input sequence pxi p0qqiPrns P
pSd´1 qn is an initial condition which is evolved through the dynamics
˜ ¸
n
K 1 ÿ
βxQptqxi ptq,Kptqxj ptqy
(2.3) x9 i ptq “ Pxi ptq e V ptqxj ptq
Zβ,i ptq j“1
for all i P rns and t ě 0. (We refer the reader to (2.8) and Section 2.3 for the full
model.) Here and henceforth
PK
x y “ y ´ xx, yyx
denotes the projection of y P Rd onto Tx Sd´1 . The partition function Zβ,i ptq ą 0
reads
n
ÿ
(2.4) Zβ,i ptq “ eβxQptqxi ptq,Kptqxk ptqy .
k“1
1The original form instead consisted in an entry-wise standardization of every token, namely
subtracting the mean of all tokens, then dividing by the standard deviation.
6 GESHKOVSKI, LETROUIT, POLYANSKIY, AND RIGOLLET
where pQp¨q, Kp¨q, V p¨qq (standing for Query, Key, and Value) are parameter ma-
trices learned from data, and β ą 0 a fixed number intrinsic to the model2, which,
can be seen as an inverse temperature using terminology from statistical physics.
Note that Qp¨q, Kp¨q need not be square.
The interacting particle system (2.3)–(2.4), a simplified version of which was
first written down in [LLH` 20, DGCC21, SABP22], importantly contains the true
novelty that Transformers carry with regard to other models: the self-attention
mechanism
eβxQptqxi ptq,Kptqxj ptqy
(2.5) Aij ptq :“ , pi, jq P rns2 ,
Zβ,i ptq
which is the nonlinear coupling mechanism in the interacting particle system. The
nˆn stochastic matrix Aptq (rows are probability vectors) is called the self-attention
matrix. The wording attention stems from the fact that Aij ptq captures the atten-
tion given by particle i to particle j relatively to all particles ℓ P rns. In particular,
a particle pays attention to its neighbors where neighborhoods are dictated by the
matrices Qptq and Kptq in (2.5).
It has been observed numerically that the probability vectors pAij p¨qqjPrns (for
i P rns) in a trained self-attention matrix exhibit behavior related to the syntac-
tic and semantic structure of sentences in natural language processing tasks (see
[VSP` 17, Figures 3-5]). To illustrate our conclusions as pedagogically as possi-
ble, throughout the paper we focus on a simplified scenario wherein the parameter
matrices pQ, K, V q are constant, and even all equal to the identity unless stated
otherwise, resulting in the dynamics
˜ ¸
n
K 1 ÿ
βxxi ptq,xj ptqy
(SA) x9 i ptq “ Pxi ptq e xj ptq
Zβ,i ptq j“1
2In practical implementations the inner products are multiplied by d´ 12 , which along with the
typical magnitude of Q, K leads to the appearance of β.
3
ALBERT XLarge v2 contains all the mechanisms described in this text, namely, is a system
of the form (2.8) (or rather the discretization thereof) with 12 or 24 layers. The sequence length
n is of the order of 512 or 1024, and the tokens evolve in R4096 . The dynamics are therefore
high-dimensional, lending weight to assumptions made later on (Section 6).
A MATHEMATICAL PERSPECTIVE ON TRANSFORMERS 7
5 5 5
4 4 4
3 3 3
2 2 2
1 1 1
0 0 0
0.4 0.2 0.0 0.2 0.4 0.6 0.8 1.0 0.4 0.2 0.0 0.2 0.4 0.6 0.8 1.0 0.4 0.2 0.0 0.2 0.4 0.6 0.8 1.0
5 5 5
4 4 4
3 3 3
2 2 2
1 1 1
0 0 0
0.4 0.2 0.0 0.2 0.4 0.6 0.8 1.0 0.4 0.2 0.0 0.2 0.4 0.6 0.8 1.0 0.4 0.2 0.0 0.2 0.4 0.6 0.8 1.0
5 5 5
4 4 4
3 3 3
2 2 2
1 1 1
0 0 0
0.4 0.2 0.0 0.2 0.4 0.6 0.8 1.0 0.4 0.2 0.0 0.2 0.4 0.6 0.8 1.0 0.4 0.2 0.0 0.2 0.4 0.6 0.8 1.0
5 5 5
4 4 4
3 3 3
2 2 2
1 1 1
0 0 0
0.4 0.2 0.0 0.2 0.4 0.6 0.8 1.0 0.4 0.2 0.0 0.2 0.4 0.6 0.8 1.0 0.4 0.2 0.0 0.2 0.4 0.6 0.8 1.0
which is non-symmetric in general (aij ‰ aji ), much like (2.3). When ϕ is com-
pactly supported, it has been shown in [JM14] that the particles xi ptq assemble in sev-
eral clusters as t Ñ `8. Other related models include those of Vicsek [VCBJ` 95],
Hegselmann-Krause [HK02] and Cucker-Smale [CS07]. All these models exhibit a
8 GESHKOVSKI, LETROUIT, POLYANSKIY, AND RIGOLLET
clustering behavior under various assumptions (see [MT14, Tad23] and the refer-
ences therein). Yet, none of the opinion dynamics models discussed above contain
parameters appearing within the nonlinearity as in (SA), whilst set on the sphere.
Remark 2.2 (Permutation equivariance). A function f : pSd´1 qn Ñ pSd´1 qn is
permutation equivariant if f pπXq “ πpf1 pXq, . . . , fn pXqq for any X P pRd qn and
for any permutation π P Sn of n elements. Otherwise put, if we permute the
input X, then the output f pXq is permuted in the same way. Given t ą 0, the
Transformer (SA), mapping pxi p0qqiPrns ÞÑ pxi ptqqiPrns , is permutation-equivariant
on pSd´1 qn .
2.3. Toward the complete Transformer. There are a couple of additional mech-
anisms used in practical implementations that we do not explicitly address or use
in this study. The mathematical analysis of these mechanisms remains open.
2.3.1. Multi-headed attention. Practical implementations spread out the computa-
tion of the self-attention mechanism at every t through a sequence of heads, leading
to the so-called multi-headed self attention. This consists in considering the follow-
ing modification to (SA):
˜ ¸
H ÿ n
K
ÿ eβxQh ptqxi ptq,Kh ptqxj ptqy
(2.7) x9 i ptq “ Pxi ptq Vh ptqxj ptq
h“1 j“1
Zβ,i,h ptq
where Zβ,i,h ptq is defined as in (2.4) for the matrices Qh ptq and Kh ptq. The integer
H ě 1 is called the number of heads4.
The introduction of multiple heads also allows for drawing some interesting par-
allels with the literature on feed-forward neural networks, such as ResNets (2.1).
Considerable effort has been expended to understand 2-layer neural networks with
width tending to `8; more precisely, consider (2.1) with L “ 1, w P Rdˆℓ , a P Rℓˆd ,
and ℓ Ñ `8. The infinite-width limit for Transformers is in fact very natural, as
it is realized by stacking an arbitrary large number of heads: H Ñ `8. Hence, the
same questions as for 1-hidden layer neural networks may be asked: for instance
the question of universal approximation, in the vein of [Cyb89, Bar93].
2.3.2. Feed-forward layers. The complete Transformer dynamics combines all of the
above mechanisms with a feed-forward layer; in the discrete-time context, this is
actually done by using a Lie-Trotter splitting scheme for
(2.8) ˜ ¸
H ÿ n βxQh ptqxi ptq,Kh ptqxj ptqy
ÿ e
x9 i ptq “ PK
xi ptq Vh ptqxj ptq ` wptqσpaptqxi ptq ` bptqq ,
h“1 j“1
Zβ,i,h ptq
where wptq, aptq, bptq and σ are all as in (2.2). The interested reader is referred
to [LLH` 20, PH22] for all the details5. The feed-forward layers (convolutional
layers can alternatively be considered) are of critical importance in applications and
drive the existing results on approximation properties of Transformers [YBR` 19].
Nevertheless, the analysis of this model is beyond the scope of our current methods.
4In practical implementations, H is a divisor of d, and the query and key matrices Q ptq and
h
d
Kh ptq are H ˆ d rectangular. This allows for further parallelization of computations and increased
expressiveness. For mathematical purposes, we focus on working with arbitrary integers H, and
square weight matrices Qh and Kh .
5and lines 123–130 in [Ope24] for some relevant source code.
A MATHEMATICAL PERSPECTIVE ON TRANSFORMERS 9
3.1. The continuity equation. The vector field driving the evolution of a single
particle in (SA) clearly depends on all n particles. In fact, one can equivalently
rewrite the dynamics as
is the empirical measure, while the vector field X rµs : Sd´1 Ñ TSd´1 reads
ˆ ż ˙
K 1 βxx,yy
(3.2) X rµspxq “ Px e y dµpyq
Zβ,µ pxq
with
ż
(3.3) Zβ,µ pxq “ eβxx,yy dµpyq.
6See [DBPC19, VBC20, ZB21] for further related work on neural networks acting on probability
measures.
10 GESHKOVSKI, LETROUIT, POLYANSKIY, AND RIGOLLET
attest [RS23, GBM21] (see [LLF23] for a result in the smooth kernel case). We do
not address this question in further detail here. For more references on this well-
established topic, the reader is referred to [Vil01, Gol16, Ser20] and the references
therein.
3.2. The interaction energy. One can naturally ask whether the evolution in (3.4)
admits some quantities which are monotonic when evaluated along the flow. As it
turns out, the interaction energy
ij
1 1
(3.5) Eβ rµs “ eβxx,x y dµpxq dµpx1 q
2β
is one such quantity. Indeed,
ij
d 1
Eβ rµptqs “ β ´1 eβxx,x y dBt µpt, xq dµpt, x1 q
dt
ż ż ´ ¯
1
“ X rµptqspxq ¨ ∇ β ´1 eβxx,x y dµpt, x1 q dµpt, xq
ż › ›2
(3.6) “ ›X rµptqspxq› Zβ,µptq pxq dµpt, xq
› ›
for any t ě 0 by using integration by parts. Recalling the definition of Zβ,µ pxq
in (3.3), we see that e´β ď Zβ,µ pxq ď eβ for all x P Sd´1 . The identity (3.6)
therefore indicates that Eβ increases along trajectories of (3.4). (Similarly, should
V “ ´Id , the energy Eβ would decrease along trajectories.) This begs the question
of characterizing the global minima and maxima of Eβ , which is the goal of the
following result.
Proposition 3.4. Let β ą 0 and d ě 2. The unique global minimizer of Eβ over
PpSd´1 q is the uniform measure9 σd . Any global maximizer of Eβ over PpSd´1 q is
a Dirac mass δx˚ centered at some point x˚ P Sd´1 .
This result lends credence to our nomenclature of the case V “ Id as attractive,
and V “ ´Id as repulsive. The reader should be wary however that in this result
we are minimizing or maximizing Eβ among all probability measures on Sd´1 .
Should one focus solely on discrete measures, many global minima appear—these
are discussed in Section 9.1—. This is one point where the particle dynamics and
the mean-field flow deviate. We now provide a brief proof of Proposition 3.4 (see
[Tan17] for a different approach).
Proof of Proposition 3.4. The fact that any global maximizer is a Dirac mass is
easy to see. We proceed with proving the rest of the statement. Let f ptq “ eβt .
The interaction energy then reads
ij
1
Eβ rµs “ f pxx, x1 yq dµpxq dµpx1 q.
2
The proof relies on an ultraspherical (or Gegenbauer) polynomial expansion of f ptq:
`8
ÿ k`λ λ
f ptq “ fppk; λq Ck ptq
k“0
λ
d´2
for t P r´1, 1s, where λ “ Ckλ are Gegenbauer polynomials, and
2 ,
ż1
Γpλ ` 1q 1 1
fppk; λq “ 1 1 λ
f ptqCkλ ptqp1 ´ t2 qλ´ 2 dt
Γpλ ` 2 qΓp 2 q Ck p1q ´1
where Ckλ p1q ą 0 (see [DX13, Section 1.2]). According to [BD19, Proposition 2.2],
a necessary and sufficient condition for Proposition 3.4 to hold is to ensure that
fppk; λq ą 0 for all k ě 1. To show this, we use the Rodrigues formula [Sze39, 4.1.72]
ˆ ˙k
p´1qk 2k Γpk ` λqΓpk ` 2λq 1 d 1
Ckλ ptq “ p1 ´ t2 q´pλ´ 2 q p1 ´ t2 qk`λ´ 2 ,
k! ΓpλqΓp2k ` 2λq dt
and the fact that Ckλ p´tq “ p´1qk Ckλ ptq for t P r´1, 1s, which in combination with
integration by parts yield
ż1 #
ℓ λ 2 λ´ 21 ą 0 if ℓ ě k and ℓ ´ k is even
t Ck ptqp1 ´ t q dt
´1 “ 0 otherwise .
We conclude by using the power series expansion of f . □
3.3. A Wasserstein gradient flow proxy. In view of (3.6), one could hope to
see the continuity equation (3.4) as the Wasserstein gradient flow of Eβ , or possibly
some other functional (see the seminal papers [Ott01, JKO98], and [AGS05, Vil09]
for a complete treatment). The long-time asymptotics of the PDE can then be
analyzed by studying convexity properties of the underlying functional, by analogy
with gradient flows in the Euclidean case.
For (3.4) to be the Wasserstein gradient flow of Eβ , the vector field X rµs defined
in (3.2) ought to be the gradient of the first variation δEβ of Eβ . However, notice
that X rµs is a logarithmic derivative:
ż
(3.7) X rµspxq “ ∇ log β ´1 eβxx,yy dµpyq.
In view of the above discussion, we are inclined to propose the surrogate model
˜ ¸
n
K 1 ÿ βxxi ptq,xj ptqy
(USA) x9 i ptq “ Pxi ptq e xj ptq ,
n j“1
which is obtained by replacing the partition function Zβ,i ptq by n. As a matter of
fact, (USA) presents a remarkably similar qualitative behavior—all of the results
we show in this paper are essentially the same for both dynamics—.
The continuity equation corresponding to (USA), namely
$ ˆ ˆż ˙ ˙
&B µpt, xq ` div PK βxx,x1 y 1 1
t x e x dµpt, x q µpt, xq “0
(3.8)
%
µ|t“0 “ µ0
for pt, xq P Rě0 ˆ Sd´1 , can now be seen as a Wasserstein gradient flow for the
interaction energy Eβ defined in (3.5).
Lemma 3.6. Consider the interaction energy Eβ : PpSd´1 q Ñ Rě0 defined in (3.5).
Then the vector field
ˆż ˙
K βxx,x1 y 1 1
X rµspxq “ Px e x dµpx q
satisfies
(3.9) X rµspxq “ ∇δEβ rµspxq
d´1
for any µ P PpS q and x P Sd´1 , where δEβ rµs denotes the first variation of Eβ .
We omit the proof which follows from standard Otto calculus [Ott01], [Vil09,
Chapter 15], [CNWR24, Chapter 5]. We can actually write (3.9) more succinctly by
recalling the definition of the convolution of two functions on Sd´1 [DX13, Chapter
d´3
2]: for any g P L1 pSd´1 q and f : r´1, 1s Ñ R such that t ÞÑ p1 ´ t2 q 2 f ptq is
integrable, ż
pf ˚ gqpxq “ f pxx, yyqgpyq dσd pyq.
This definition has a natural extension to the convolution of a function f (with the
above integrability) and a measure µ P PpSd´1 q. We can hence rewrite
ż
1
Eβ rµs “ pGβ ˚ µqpxq dµpxq
2
where r´1, 1s Q Gβ ptq “ β ´1 eβt , and so
X rµspxq “ ∇pGβ ˚ µqpxq.
Thus, (3.8) takes the equivalent form
$ ´ ` ˘ ¯
&Bt µpt, xq ` div ∇ Gβ ˚ µpt, ¨q pxqµpt, xq “ 0 for pt, xq P Rě0 ˆ Sd´1
(3.10)
%µ
|t“0 “ µ0 for x P Sd´1 .
The considerations above lead us to the following Lyapunov identity.
Lemma 3.7. The solution µ P C 0 pRě0 ; PpSd´1 qq to (3.8) satisfies
ż › ´ ¯ ›2
d
Eβ rµptqs “ ›∇ Gβ ˚ µpt, ¨q pxq› dµpt, xq
› ›
dt
for t ě 0.
14 GESHKOVSKI, LETROUIT, POLYANSKIY, AND RIGOLLET
Set
n n
1 ÿ ÿ βxV xi ,xj y
Eβ pXq “ e .
2β i“1 j“1
We now show that the dynamics (2.3) can be equivalently written as
9
Xptq “ ∇Eβ pXptqq,
where the gradient ∇ is computed with respect to the metric (3.12) on pSd´1 qn .
To this end, we ought to show that for all vector fields Y on pSd´1 qn and for all
X P pSd´1 qn ,
d ˇˇ ` ˘
(3.13) ˇ Eβ ΦtY pXq “ xY pXq, BpXqyX
dt t“0
holds, where ΦtY is the flow associated to the vector field Y , whereas B “ pB1 , . . . , Bn q
with ˜ ¸
n
1 ÿ
βxV xi ,xj y
Bi “ Pxi K
e V xj P Txi Sd´1 .
Zβ,i pXq j“1
By linearity, it is sufficient to show (3.13) for vector fields Y of the form
Y pXq “ pAx1 , 0, . . . , 0q P TX pSd´1 qn
where A is an arbitrary non-zero skew-symmetric matrix. Clearly
(3.14) ΦtY pXq “ petA x1 , x2 , . . . , xn q.
One first computes
n
d ˇˇ ` ˘ ÿ
ˇ Eβ ΦtY pXq “ eβxV x1 ,xj y xAx1 , V xj y.
dt t“0 j“1
Now observe that xAx1 , yy “ xAx1 , zy for all skew-symmetric matrices A if and
only if x1 py J ´ z J q is a symmetric matrix. Since PK J
x1 “ Id ´ x1 x1 , we see that
n
ÿ
eβxV x1 ,xj y xAx1 , V xj y “ xY pXq, BpXqyX ,
j“1
as desired.
3.4.2. The case of measures. The above insight, which consists in reweighing the
metric with respect to which the gradient is being taken, can be formally adapted
to Wasserstein setting for (3.4)—which we recall, is not a gradient flow for the
standard Wasserstein gradient of Eβ —. But note that (3.4) writes as
ˆ ˙
∇δEβ rµptqs
(3.15) Bt µptq ` div µptq “ 0,
δEβ rµptqs
ş
with δEβ rµspxq “ eβxx,yy dµpyq. To avoid technicalities we henceforth only focus
on the absolutely continuous case. For a fixed µ P Pac pSd´1 q, as in the well-known
formal Riemannian reinterpretation of the Wasserstein space using Otto calculus
[Ott01], [CNWR24, Chapter 5], we consider
L2 pµq
Tµ Pac pSd´1 q “ t∇ψ : ψ P C 8 pSd´1 qu ,
16 GESHKOVSKI, LETROUIT, POLYANSKIY, AND RIGOLLET
which, rather than endowing with the standard formal metric tensor given by the
H9 1 –inner product
ż
x∇ψ1 , ∇ψ2 yµ :“ x∇ψ1 pxq, ∇ψ2 pxqy dµpxq,
Continuing the Riemannian geometry analogy, through this metric tensor we can
define a distance between µ0 , µ1 P Pac pSd´1 q by solving the variational problem
#ż +
1
2
inf }vptq}µptq,Eβ dt : pµ, vq satisfy (3.16), µp0q “ µ0 , µp1q “ µ1 ,
pµptq,vptqqtPr0,1s 0
where
(3.16) Bt µpt, xq ` divpvpt, xqµpt, xqq “ 0 on r0, 1s ˆ Sd´1 .
This variational problem is a generalization of the celebrated Benamou-Brenier for-
2
mula [BB00], the value of which we dub W2,E β
pµ0 , µ1 q: it is a weighed Wasserstein
distance. For a curve of measures pµptqqtě0 with tangent vectors pvptqqtě0 (mean-
ing they solve (3.16)), the Wasserstein gradient ∇ ∇Eβ Eβ induced by this geometric
setup is then the element of Tµptq Pac pSd´1 q such that
∇Eβ Eβ rµptqs, vptqyµptq,Eβ .
Bt Eβ rµptqs “ x∇
We can now demonstrate that the vector field driving (3.15) is the Wasserstein
gradient of Eβ corresponding to this geometric setup. Indeed as in [CNWR24,
Definition 5.9] we first have
ż
Bt Eβ rµptqs “ δEβ rµptqs dBt µptq.
We then find
ż ż B F
∇δEβ rµptqs
δEβ rµptqs dBt µptq “ x∇δEβ rµptqs, vptqy dµptq “ , vptq ,
δEβ rµptqs µptq,Eβ
as desired. The literature studying weighed Wasserstein distances such as the one
above is rather scarce, but a relevant reference is [Li21].
Part 2. Clustering
As alluded to in the introductory discussion, clustering is of particular relevance
in tasks such as sentiment analysis, masked language modeling, summarization,
and so on. Therein, the output measure encodes the probability distribution of the
missing tokens for instance, and its clustering indicates a small number of possible
outcomes. In Sections 4, 5, and 6, we show several results (mostly summarized in
Figure 2) which indicate that the limiting distribution is a point mass. While it
may appear that this leaves no room for diversity or randomness, which is at odds
with practical observations, these results hold for the specific choice of parameter
matrices, and apply in possibly very long-time horizons. Numerical experiments
indicate a more subtle picture for different parameters—for instance, there is an
appearance of a long metastable phase during which the particles coalesce in a
small number of clusters, which seems consistent with behavior in trained models
A MATHEMATICAL PERSPECTIVE ON TRANSFORMERS 17
T. 5.1
n2
π2
no result when d “ 2
exponential rate
Theorem 6.9:
Theorem 6.1:
β clustering as t Ñ `8
Ñ slow
1
T. 4.3
fast Ð
Theorem 4.2:
arbitrary Q, K
2 3 n d˚
d
(Figure 1 in Section 6.3)—. We are not able to theoretically explain this behavior
as of now.
Ultimately, the appearance of clusters is somewhat natural since the Transformer
dynamics are a weighted average of all particles, with the weights being hard-
wired to perform a fast selection of particles most similar to the i-th particle being
queried. This causes the emergence of leaders which attract all particles in their
vicinity. In the natural language processing interpretation, where particles represent
tokens, this further elucidates the wording attention as the mechanism of inter-token
attraction, and the amplitude of the inner product between tokens can be seen as
a measure of their semantic similarity.
for β ! 1. So, during a time ! β ´1 , the particles do not feel the influence of the
remainder Opβq and behave as in the regime β “ 0. This motivates
Theorem 4.2. Fix d, n ě 2. For β ě 0, let Sβ Ă pSd´1 qn be the subset consisting
of all initial sequences for which the associated solution to the Cauchy problem
for (SA) (or (USA)) converges to one cluster as t Ñ `8. Then
lim PpSβ q “ 1.
βÑ0
More generally, if Q and K are arbitrary d ˆ d matrices, then the same result
also holds for the Cauchy problem for (2.3) with V “ Id (or the natural analogue
of (USA) with these parameters).
Proof. We focus on the dynamics (SA), but the proof is in fact identical in the case
of (USA).
For α P r0, 1q, we say that a set formed from n points z1 , . . . , zn P pSd´1 qn is
α–clustered if for any i, j P rns, xzi , zj y ą α holds. Observe that if tz1 , . . . , zn u
is α–clustered for some α ě 0, then the solution to the Cauchy problem for (SA)
(for arbitrary β ě 0) with this sequence as initial condition converges to a single
cluster, since w “ z1 satisfies the assumption in Lemma 6.4.
A MATHEMATICAL PERSPECTIVE ON TRANSFORMERS 19
Now, for any integer m ě 1, we denote by S0m Ă S0 the set of initial sequences
x1 p0q, . . . , xn p0q in pSd´1 qn for which the solution px0i p¨qqiPrns to the associated
Cauchy problem for (4.1) is 34 –clustered at time t “ m, namely
3
(4.2) xx0i pmq, x0j pmqy ą
4
holds for all i, j P rns. We see that S0m is an open set for any integer m ě 1.
Ť`8
Moreover, S0m Ă S0m`1 according to the proof of Lemma 6.4, and m“1 S0m “ S0 .
This implies that
(4.3) lim PpS0m q “ 1.
mÑ`8
We now show that the solution to (SA) is near that of (4.1), starting from the same
initial condition, when β is small. Using the Duhamel formula, we find
˜ β β
¸
żt ÿn
β eβxQxi psq,Kxj psqy
0
xi ptq ´ xi ptq “ řn β β PKxβ
pxβj psqq ds
βxQx psq,Kx psqy i psq
0 j“1 k“1 e
i k
żt ÿ n
1 0
´ PKx0i psq pxj psqq ds
0 n j“1
żt ÿn ˆ ˆ ˙˙
1 β
“ `O PK xβ
pxβj psqq ds
i psq
0 j“1 n n
żt ÿ n
1 0
´ PKx0i psq pxj psqq ds,
0 n j“1
where we used that all particles lie on Sd´1 for all times. Employing Grönwall, we
deduce
› ›
› β
(4.4) ›xi ptq ´ x0i ptq› ď Opβqe3t
›
for all t ě 0, β ě 0 and i P rns. Due to (4.4), there exists some βm ą 0 such that
for any β P r0, βm s,
› › 1
› β
(4.5) ›xi pmq ´ x0i pmq› ď .
›
8
For this to hold, we clearly need βm Ñ 0 as m Ñ `8. Combining (4.2) and (4.5),
we gather that for any initial condition in S0m , the solution pxβi p¨qqiPrns to the
corresponding Cauchy problem for (SA) is 12 –clustered at time t “ m, namely
satisfies
1
xxβi pmq, xβj pmqy ą
2
for all i, j P rns and β P r0, βm s. Thus S0m Ă Sβ for any β P r0, βm s by virtue of
Lemma 6.4, which together with (4.3) concludes the proof. □
In the specific case where QJ K “ Id , we can in fact significantly sharpen The-
orem 4.2 by relying on the gradient flow structure evoked in Section 3.4. Namely,
we can show the following.
Theorem 4.3. Fix d, n ě 2. There exists a numerical constant C ą 0 such that
whenever
β ď Cn´1
20 GESHKOVSKI, LETROUIT, POLYANSKIY, AND RIGOLLET
the following holds. For Lebesgue almost any pxi p0qqiPrns P pSd´1 qn , there exists
x˚ P Sd´1 such that the solution pxi p¨qqiPrns P C 0 pRě0 ; pSd´1 qn q to the corresponding
Cauchy problem for (SA) (resp. for (USA)) satisfies
lim xi ptq “ x˚
tÑ`8
6.1. Clustering at an exponential rate when d ě n. One can ask whether for
almost every initial configuration, the convergence provided by all of the results
above holds with some rate. The answer is affirmative—and the rate is in fact
exponential—when the initial configuration lies in an open hemisphere.
Theorem 6.3. Let n ě 1 and β ą 0. Suppose d ⩾ n. Consider the unique solution
pxi p¨qqiPrns P C 0 pRě0 ; pSd´1 qn q to the Cauchy problem for (SA) or (USA), corre-
sponding to an initial sequence of points pxi p0qqiPrns P pSd´1 qn distributed uniformly
at random. Then almost surely there exists x˚ P Sd´1 and constants C, λ ą 0 such
that
(6.1) }xi ptq ´ x˚ } ď Ce´λt
holds for all i P rns and t ě 0.
In fact, let Q and K be arbitrary d ˆ d matrices. Then the same result also holds
for the solution to the corresponding Cauchy problem for (2.3) with V “ Id (or the
natural analogue of (USA) with these parameters).
When d ě n and the points pxi p0qqiPrns P pSd´1 qn are distributed uniformly at
random, with probability one there exists10 w P Sd´1 such that xw, xi p0qy ą 0 for
any i P rns. In other words, all of the initial points lie in an open hemisphere almost
surely. The proof of Theorem 6.3 thus follows as a direct corollary of the following
result, which holds for any n ě 1 and d ě 2:
Lemma 6.4 (Cone collapse). Let β ą 0 and let pxi p0qqiPrns P pSd´1 qn be such
that there exists w P Sd´1 for which xxi p0q, wy ą 0 for any i P rns. Consider
the unique solution pxi p¨qqiPrns P C 0 pRě0 ; pSd´1 qn q to the corresponding Cauchy
problem for (SA) or (USA). Then there exists x˚ P Sd´1 and constants C, λ ą 0
such that
}xi ptq ´ x˚ } ď Ce´λt
holds for all i P rns and t ě 0.
In fact, let Q and K be arbitrary d ˆ d matrices. Then the same result also holds
for the solution to the corresponding Cauchy problem for (2.3) with V “ Id (or the
natural analogue of (USA) with these parameters).
␣ (
Remark 6.5. Lemma 6.4 implies that px̄i qiPrns P pSd´1 qn : x̄1 “ . . . “ x̄n is Lya-
punov asymptotically stable as a set. In fact, it is exponentially stable.
10This weak version of Wendel’s theorem (Theorem 6.7) is easy to see directly.
22 GESHKOVSKI, LETROUIT, POLYANSKIY, AND RIGOLLET
Fix t0 ě 0. We have
ˆ ˙ˇ
d
p¨q, wy ˇ
ˇ
xx
dt ipt0 q t“t0
ÿn ´ ¯
“ aipt0 qj pt0 q xxj pt0 q, wy ´ xxipt0 q pt0 q, xj pt0 qyxxipt0 q pt0 q, wy ě 0.
j“1
This implies that all points remain within the same open hemisphere at all times
and the map
t ÞÑ rptq :“ min xxi ptq, wy
iPrns
and by using the equation (USA) we also find that |x: xi ptq, wy| “ Ope2β q for any
t ě 0. Therefore by Lemma 6.6, the left-hand side of (6.2) is equal to 0, and
consequently the right-hand side term as well. This implies that x1 “ . . . “ xn :“
x˚ . Repeating the argument by replacing w with x˚ , we see that the extraction of
a sequence ttk u`8
k“1 as above is not necessary, and therefore
From (6.3) we gather that there exists some t0 ą 0 such that αptq ě 21 for all t ě t0 .
Also, in view of what precedes we know that x˚ lies in the convex cone generated
by the points x1 ptq, . . . , xn ptq for any t ą 0. Thus, there exists some η P p0, 1s such
that ηx˚ is a convex combination of the points x1 ptq, . . . , xn ptq, which implies that
n
ÿ n
ÿ
(6.4) x˚ “ θk ptqxk ptq, for some θk ptq ě 1, θk ptq ě 0 @k P rns.
k“1 k“1
For any t, we denote by iptq an element of arg minpxxi ptq, x˚ yq for which xx9 i ptq, x˚ y
is smallest. It follows from a Taylor expansion of xxi pt ` hq, x˚ y for h ą 0 and
i P rns that
αptq
9 “ xx9 iptq ptq, x˚ y.
Therefore
n
ÿ
(6.5) αptq
9 “ xx9 iptq ptq, x˚ y ě aiptqj ptqp1 ´ xxiptq ptq, xj ptqyqαptq
j“1
On another hand,
n
ÿ
(6.6) min xxiptq ptq, xj ptqy ď θk ptqxxiptq ptq, xk ptqy “ xxiptq ptq, x˚ y “ αptq.
jPrns
k“1
Plugging (6.6) into (6.5) and using aij ptq ě n´1 e´2β we get
1
(6.7) αptq
9 ě p1 ´ αptqq
2ne2β
for t ě t0 . Applying the Grönwall inequality we get
1 ´ 1 β pt´t0 q
(6.8) 1 ´ αptq ď e 2ne
2
for all t ě t0 . The conclusion follows. □
In the case d ă n, we can still apply Wendel’s theorem (recalled below) together
with Lemma 6.4 to obtain clustering to a single point with probability at least pn,d
for some explicit pn,d P p0, 1q.
24 GESHKOVSKI, LETROUIT, POLYANSKIY, AND RIGOLLET
holds for any i ‰ j and t ě 0, where cpβq “ e10 maxt1,βu , and γβ is the unique
solution to (6.9).
Since the proof is rather lengthy, we defer it to Appendix D. It relies on combining
the stability of the flow with respect to the initial data (entailed by the Lipschitz
nature of the vector field) with concentration of measure. An analogous statement
also holds for (USA), and more details can be found in Remark D.1, whereas the
explicit values of C and λ can be found in (D.15). The upper bound in (6.11) is of
interest in regimes where d and/or t are sufficiently large as the error in (6.11) is
trivially bounded by 2.
Proof of Theorem 6.8. We split the proof in two parts. We focus on proving the
result for the dynamics (SA), since the same arguments readily apply to the dy-
namics (USA).
Part 1. The angle θβ ptq. We first show there exists θ P C 0 pRě0 ; Tq such that
θptq “ =pxi ptq, xj ptqq for any distinct pi, jq P rns2 and t ě 0. Since the initial tokens
are orthogonal (and thus d ě n), we may consider an orthonormal basis pe1 , . . . , ed q
of Rd such that xi p0q “ ei for i P rns. Let π : rds Ñ rds be a permutation. By
decomposing any x P Sd´1 in this basis, we define Pπ : Sd´1 Ñ Sd´1 as
˜ ¸
ÿn ÿn
Pπ ai ei “ ai eπpiq .
i“1 i“1
Setting yi ptq “ Pπ pxi ptqq for i P rns, we see that yi ptq solves (SA) with initial
condition yi p0q “ Pπ pxi p0qq. But pxπp1q ptq, . . . , xπpnq ptqq is a solution of (SA) by
permutation equivariance, and it has the same initial condition since Pπ pxi p0qq “
xπpiq p0q. Consequently, we deduce that Pπ pxi ptqq “ xπpiq ptq for any t ě 0 and any
i P rds. Hence
xxi ptq, xj ptqy “ xPπ pxi ptqq, Pπ pxj ptqqy “ xxπpiq ptq, xπpjq ptqy
which concludes the proof.
Part 2. The curve γβ ptq. By virtue of the orthogonality assumption we have
γβ p0q “ cospθβ p0qq “ 0. To prove that γβ ptq satisfies (6.9) for the case of (SA),
recall that
PK
xi ptq pxj ptqq “ xj ptq ´ xxi ptq, xj ptqy xi ptq.
Then for k ‰ i,
γ9 β ptq “ 2xx9 i ptq, xk ptqy
n ˆ
eβxxi ptq,xj ptqy
ÿ ˙
“2 ř n βxxi ptq,xℓ ptqy
pxxj ptq, xk ptqy ´ xxi ptq, xj ptqyxxi ptq, xk ptqyq .
j“1 ℓ“1 e
where pxi p¨qqiPrns denotes the solution to the corresponding Cauchy problem for (SA).
(Here the choice of the first two particles instead of a random distinct pair is jus-
tified due to permutation equivariance.) Theorem 6.9 then gives the intuition that
over compact subsets of pRě0 q2 , Γd,δ should be well-approximated by
! )
(6.12) Γ8,δ “ t, β ě 0 : γβ ptq “ 1 ´ δ .
6 6 6
0.6 0.6 0.6
5 5 5
4 4 4
0.4 0.4 0.4
3 3 3
2 0.2 2 0.2 2 0.2
1 1 1
0.0 0.0
0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30
t t t
6 6 6
0.6 0.6 0.6
5 5 5
4 4 4
0.4 0.4 0.4
3 3 3
2 0.2 2 0.2 2 0.2
1 1 1
0.0 0.0 0.0
0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30
t t t
This is clearly seen in Figure 3, along with the fact that the resolution of this
approximation increases with d Ñ `8.
A MATHEMATICAL PERSPECTIVE ON TRANSFORMERS 27
Figure 3 appears to contain more information than what we may gather from
Theorem 6.3, Theorem 6.8 and Theorem 6.9. In particular, for small d, we see the
appearance of a zone (white/light blue in Figure 3) of parameters pt, βq for which
the probability of particles being clustered is positive, but not close to one. A
careful inspection of this region reveals that points are grouped in a finite number
of clusters; see Figure 4. The presence of such a zone indicates the emergence
of a long-time metastable state where points are clustered into several groups but
eventually relax to a single cluster in long-time. This two-time-scale phenomenon
is illustrated in Figure 4 and prompts us to formulate the following question.
Problem 1. Do the dynamics enter a transient metastable state, in the sense that
for β " 1, all particles stay in the vicinity of m ă n clusters for long periods of
time, before they all collapse to the final cluster tx˚ u?
There have been important steps towards a systematic theory of metastability
for gradient flows, with applications to nonlinear parabolic equations—typically
reaction-diffusion equations such as the Allen-Cahn or Cahn-Hilliard equations
[OR07, KO02]—. While these tools to not readily apply to the current setup,
they form an important starting point to answer this question.
t = 0.0 t = 18.0 t = 30.0
9 1.0
8
7 0.8
6
0.6
5 t = 0.0 t = 18.0 t = 30.0
4
0.4
3
2 0.2
1
0 5 10 15 20 25 30
t
Finally, one may naturally ask whether the clustering and phase diagram conclu-
sions persist when the parameter matrices pQ, K, V q are significantly more general:
some illustrations11 are given in Figure 5.
Problem 2. Can the conclusions of Theorem 6.8–Theorem 6.9 be generalized to
the case of random matrices pQ, K, V q?
11See github.com/borjanG/2023-transformers-rotf for additional figures which indicate that
this phenomenon appears to hold in even more generality.
28 GESHKOVSKI, LETROUIT, POLYANSKIY, AND RIGOLLET
9 1.0 9 1.0
8 8
7 0.8 7 0.8
6 6
0.6 0.6
5 5
4 4
0.4 0.4
3 3
2 0.2 2 0.2
1 1
0.0 0.0
0 5 10 15 20 25 30 0 5 10 15 20 25 30
t t
(a) Q, K, V in real Ginibre ensemble (b) QJ K Wigner, V ľ 0 GOE
9 1.0 9 1.0
8 8
7 0.8 7 0.8
6 6
0.6 0.6
5 5
4 4
0.4 0.4
3 3
2 0.2 2 0.2
1 1
0.0 0.0
0 5 10 15 20 25 30 0 5 10 15 20 25 30
t t
(c) Q, K in real Ginibre ensemble, V “ QJ K (d) QJ K Wigner, V “ QJ K
7.1. Angular equations. On the circle S1 , all particles xi ptq P S1 are of course
completely characterized by the angle θi ptq P T: xi ptq “ cospθi ptqqe1 ` sinpθi ptqqe2
where e1 “ p1, 0q and e2 “ p0, 1q P R2 . We focus on the dynamics (USA) for
simplicity. For any i P rns and t ě 0, we may derive the equation satisfied by θi ptq
from cospθi ptqq “ xxi ptq, e1 y: differentiating in t and plugging into (USA) we obtain
˜ ¸
´1 n
n ÿ ” ı
θ9i ptq “ ´ eβxxi ptq,xj ptqy xxj ptq, e1 y ´ xxi ptq, xj ptqyxxi ptq, e1 y
sinpθi ptqq j“1
where we used the definition of the projection (if θi ptq “ 0 for some t, we differ-
entiate the equality sinpθi ptqq “ xxi ptq, e2 y instead, which also leads to (7.1) in the
end). Observing that
we find
˜ ¸
n
n´1 ÿ
β cospθi ptq´θj ptqq
” ı
θ9i ptq “ ´ e cospθj ptqq ´ cospθi ptq ´ θj ptqq cospθi ptqq .
sinpθi ptqq j“1
The case β “ 0 is exactly the Kuramoto model recalled in Section 7.2. Suppose for
the time being that β ą 0. Defining the function hβ : T Ñ Rě0 as
hβ pθq “ eβ cospθq ,
weřhave effectively deduced that the empirical measure of the angles, νptq “
1 n
n j“1 δθj ptq , which is a measure on the torus T, is a solution to the continuity
equation
Bt νptq ` Bθ pX rνptqsνptqq “ 0, on Rě0 ˆ T,
where
1´ 1 ¯
X rνspθq “ hβ ˚ ν pθq.
β
When the particles xi ptq follow (SA), one readily checks that the same continuity
equation is satisfied but rather with the field
1 h1β ˚ ν
ˆ ˙
X rνspθq “ pθq.
β hβ ˚ ν
30 GESHKOVSKI, LETROUIT, POLYANSKIY, AND RIGOLLET
The oscillators can be viewed as attempting to maximize this energy. The energy
F is maximized when all the oscillators are synchronized, that is, θi “ θ˚ for some
θ˚ P T and for all i P rns. As the dynamics follow a gradient system, the equilibrium
states are the critical points of the energy, namely those satisfying ∇Fpθq “ 0. The
local maxima of F correspond to equilibrium states θ that are physically achievable,
since small perturbations thereof return the system back to θ.
Some authors consider a variant of the Kuramoto model where the oscillators
are interacting according to the edges of a graph. In other words, the coefficients
Aij of the graph’s adjacency matrix are inserted in the sum in (7.3) as weights,
and the dynamics are then the corresponding gradient flow. A recent line of work
culminating with [ABK` 22] has established that synchronization occurs with high
probability for Erdős–Rényi graphs with parameter p, for every p right above the
connectivity threshold.
Coming back to our dynamics (7.1), we notice that it can also be written as a
gradient flow on Tn :
9 “ n∇Eβ pθptqq,
θptq
for the interaction energy Eβ : Tn Ñ Rě0 defined as
n n
1 ÿ ÿ β cospθi ´θj q
(7.4) Eβ pθq “ e ,
2βn2 i“1 j“1
which is maximized when θi “ θ˚ for some θ˚ P T and for all i P rns. In the spirit
of [LXB19], we suggest the following open problem—we recall that a critical point
is called a strict saddle point of Eβ if the Hessian of Eβ at these points has at least
one positive eigenvalue.x
A MATHEMATICAL PERSPECTIVE ON TRANSFORMERS 31
Problem 3. With the exception of the global maxima, are all critical points of Eβ
strict saddle points?
The proofs of Theorems 4.3 and 5.1 already yield a positive answer to Problem
2
5 in the regimes β ď 1 or β ě nπ2 . The complementary regime remains open. By
classical arguments, recalled in Appendix A, a positive answer to Problem 3 would
imply that for all initial conditions except a set of measure zero, all θi ptq converge
under the dynamics (7.1) to a common limit as t Ñ `8.
Extensions of the Kuramoto model of the form
n
K ÿ
(7.5) θ9i ptq “ ωi ` hpθj ptq ´ θi ptqq,
n j“1
for a general non-linearity h : T Ñ R, which contains both (7.2) and our model (7.1)
as particular cases, have already been studied in the physics literature. For instance,
we refer the reader to [Dai92] (see also [ABV` 05, page 158]), where many heuristics
are proposed to address the behavior of solutions to these dynamics. We are not
aware of mathematical results for (7.1) besides Theorem 5.1. We nevertheless have
some hope that handling the dynamics (7.1) is easier than dealing with (7.5) for a
general h; for instance, we have
ÿ
hβ pθq “ eβ cospθq “ Ik pβqeikθ
kPZ
where Ik pβq are the modified Bessel function of the first kind, whose properties
have been extensively studied.
8. BBGKY hierarchy
For the sake of simplicity, we again focus on the dynamics on the circle S1 ,
where recall that all particles are parametrized by angles (which we also refer to
as particles). To carve out an even more complete understanding of the cluster-
ing phenomenon, it is natural to consider initial particles sampled i.i.d. from the
uniform distribution on S1 and to study the time-evolution of the r-particle dis-
prq
tribution ρn pt, θ1 , . . . , θr q, defined as the joint law of the particles θ1 ptq, . . . , θr ptq.
Otherwise put, it is the r-point marginal of the joint distribution ρpnq pt, ¨q P PpTn q
of all n particles. Note that because of rotational invariance, ρp1q pt, ¨q is just the
1
uniform distribution equal to 2π for all t ě 0. For r “ 2, again by rotational
invariance, there exists some ψpt, ¨q : T Ñ Rě0 such that
1
ρp2q pt, θ1 , θ2 q “ ψpt, θ2 ´ θ1 q.
2π
Proving the clustering/synchronization of all θi ptq in long-time amounts to proving
that ψpt, ¨q converges to a Dirac mass centered at 0 as t Ñ `8. Using the fact that
ρpnq pt, ¨q solves the Liouville equation, by following the method used to derive the
BBGKY12 hierarchy [GSRT13, Gol16], it is possible to show that ψpt, ¨q satisfies
#
Bt ψpt, xq ` Bx pvpt, xqψpt, xqq “ 0 in Rě0 ˆ T
(8.1)
ψp0, xq “ p2πq´1 in T,
12Bogoliubov–Born–Green–Kirkwood–Yvon.
32 GESHKOVSKI, LETROUIT, POLYANSKIY, AND RIGOLLET
where
2 1 2pn ´ 2q
vpt, xq “ h pxq ´ gpt, xq,
βn β βn
and ” ˇ ı
gpt, xq “ E ´ h1β pθ3 ptqq ˇ θ1 ptq “ 0, θ2 ptq “ x .
ˇ
Note that the equation (8.1) is not closed since gpt, xq depends on the 3-point
correlation function. This is typical in the BBGKY hierarchy, whereupon physical
theory and experimental evidence is typically used to devise an ansatz for closing
the system. For instance, the Boltzmann equation is derived from the BBGKY
hierarchy by assuming the molecular chaos hypothesis (Stosszahlansatz) at the level
of r “ 2. We suggest to close (8.1) in a way that reflects the formation of clusters:
Problem 4. Devise a realistic ansatz for gpt, xq which allows to close equation (8.1),
and allows to prove the convergence of ψpt, ¨q to a Dirac mass centered at 0 as
t Ñ `8.
The derivation of a BBGKY hierarchy when d ą 2, as well as for (SA), are also
problems which we believe merit further investigation.
9. General matrices
Figure 5 hints at the likelihood of the clustering phenomenon being significantly
more general than just the case Q “ K “ V “ Id . However, extending our proofs
to more general parameter matrices does not appear to be straightforward and is
an open problem. Here we discuss some particular cases (without excluding other
approaches).
9.1. The repulsive case. As seen from Lemma 3.7, in the repulsive case V “ ´Id
the interaction energy Eβ decreases along trajectories. Recall that the unique global
minimum of Eβ over PpSd´1 q is the uniform distribution (Proposition 3.4). In
contrast, we explain in this section that many different configurations of n points
may yield global minima for Eβ when minimized over empirical measures with n
atoms.
We thus focus on minimizing Eβ over the set Pn pSd´1 q of empirical measures,
namely sums of n Dirac masses. Rewriting Eβ as
eβ
ij
β 1 2
Eβ rµs “ e´ 2 }x´x } dµpxq dµpx1 q,
2β
it turns out that minimizing Eβ over Pn pSd´1 q is precisely the problem of find-
ing optimal configurations of points on Sd´1 , which has direct links to the sphere
packing problem [CK07, CKM` 22] and coding theory [DGS91]. For µ P Pn pSd´1 q,
we can equivalently rewrite Eβ in terms of the set of support points C Ă Sd´1 ,
# C “ n:
eβ ÿ β 1 2
Eβ rµs “ Hβ r Cs “ 2 e´ 2 }x´x } .
2n β 1
x,x P C
for all i P rns and t P Rě0 . In fact, in [GLPR24] we analyze precisely these dynamics,
and show different clustering results depending on the spectral properties of the
matrix V . We briefly summarize our findings in what follows.
9.2.1. A review of [GLPR24]. For most choices of value matrices V , without rescal-
ing time, most particles diverge to ˘8 and no particular pattern emerges. To make
a very rough analogy, (9.1) "looks like" x9 i ptq “ V xi ptq (which amounts to having
Pij ptq “ δij instead of (2.5)), whose solutions are given by xi ptq “ etV xi p0q. To
discern the formation of clusters, we introduce the rescaling14
(9.2) zi ptq “ e´tV xi ptq,
which are solutions to
n
1 ÿ tV
zi ptq,KetV zj ptqy
(9.3) z9i ptq “ eβ xQe V pzj ptq ´ zi ptqq
Zβ,i ptq j“1
whereas the initial condition remains the same, namely xi p0q “ zi p0q. It is crucial
to notice that the coefficients Aij ptq (see (2.5)) of the self-attention matrix for the
rescaled particles zi ptq are the same as those for the original particles xi ptq. The
weight Aij ptq indicates the strength of the attraction of zi ptq by zj ptq. In [GLPR24]
we show that the rescaled particles zi ptq cluster toward well-characterized geometric
objects as t Ñ `8 for various choices of matrices pQ, K, V q. Our results are
summarized in Table 1 below, whose first two lines are discussed thereafter.
where
! )
(9.6) Ci ptq “ j P rns : xQzi ptq, Kzj ptqy ě xQzi ptq, Kzk ptqy for all k P rns .
However, defining a notion of solution to (9.5)–(9.6) is not straightforward, as
illustrated by the following example.
Example 9.3. Suppose d “ 2, n “ 3. Let Q “ K “ V “ Id and z1 p0q “
p1, 1q, z2 p0q “ p´1, 1q, z3 p0q “ p0, 0q. Consider the evolution of these particles
through (9.5)–(9.6). The points z1 ptq and z2 ptq do not move, because it is easily
seen that Ci ptq “ tiu for i P t1, 2u. On the other hand, the point z3 ptq can be chosen
to solve either of three equations: z93 ptq “ z1 ptq ´ z3 ptq, or z93 ptq “ z2 ptq ´ z3 ptq, or
even z93 ptq “ 12 pz1 ptq ` z2 ptqq ´ z3 ptq. In any of these cases, both (9.5) and (9.6)
remain satisfied for almost every t ě 0.
It is possible to prove the existence of solutions to (9.5)–(9.6) defined in the sense
of Filippov15: for this, we can either use a time-discretization of (9.5)–(9.6), or use a
convergence argument for solutions to (9.4) as β Ñ `8. Uniqueness however does
not hold, as illustrated by Example 9.3. This naturally leads us to the following
question:
Problem 6. Is it possible to establish a selection principle (similar to viscosity or
entropy solutions) for solutions to (9.5)–(9.6) which allows to restore uniqueness?
In the affirmative, is it possible to revisit the clustering results of [GLPR24] and
Problem 5 in the setting of (9.5)–(9.6)?
9.4. Diffusive regularization. We believe that (9.5)–(9.6) is also an original
model for collective behavior. There are some similarities in spirit with meth-
ods arising in consensus based optimization (CBO for short), [PTTM17, CJLZ21].
With CBO methods, one wishes to minimize a smooth and bounded, but otherwise
arbitrary function f : Rd Ñ R by making use of the Laplace method
ˆ ż ˙
1 ´βf pxq
lim ´ log e dρpxq “ inf f pxq,
βÑ`8 β Rd xPsupppρq
which holds for any fixed ρ P Pac pRd q. This is accomplished by considering a
McKean-Vlasov particle system of the form
?
dxi ptq “ ´λpxi ptq ´ vf qH ϵ pf pxi ptqq ´ f pvrµn ptqsqq dt ` 2σ|xi ptq ´ vrµn ptqs| dWi ptq
for fixed β ą 0, with drift parameter λ ą 0 and noise parameter σ ě 0; H ϵ is a
particular smoothed Heaviside function, and µn ptq is the empirical measure of the
particles. The point vrµs P Rd is a weighted average of the particles:
ż
1
vrµs “ e´βf pxq x dµpxq
Zβ,µ Rd
ş
where Zβ,µ “ Rd e´βf pxq dµpxq. Morally speaking, particles which are near a min-
imum of f have a larger weight. The drift term is a gradient relaxation (for a
quadratic potential) towards the current weighted average position of the batch of
particles. The diffusion term is an exploration term whose strength is proportional
to the distance of the particle from the current weighted average. Results of con-
vergence to a global minimizer do exist, under various smallness assumptions on
the initial distribution of the particles, and assumptions on the relative size of the
15We thank Enrique Zuazua for this suggestion.
36 GESHKOVSKI, LETROUIT, POLYANSKIY, AND RIGOLLET
Acknowledgments
We thank Pierre Ablin, Sébastien Bubeck, Gabriel Peyré, Matthew Rosenzweig,
Sylvia Serfaty, Kimi Sun, and Rui Sun for discussions. We thank Nicolas Boumal
for referring us to [MTG17, CRMB24] and for clarifying comments.
Appendix
Appendix A. Proof of Theorem 4.1
The proof of Theorem 4.1 relies on standard arguments from dynamical systems,
upon noticing that the evolution (4.1) is a (continuous-time) gradient ascent for the
A MATHEMATICAL PERSPECTIVE ON TRANSFORMERS 37
Since the dynamics are the gradient ascent of a real-analytic functional on the com-
pact real-analytic manifold pSd´1 qn , the celebrated Łojasiewicz theorem [Loj63],
in the form given by [HKR18, Corollary 5.1]—which is valid in the context of
general compact Riemannian manifolds—, implies that for any initial condition
X P pSd´1 qn , the solution Φt pXq P pSd´1 qn converges to some critical point X ˚ P
pSd´1 qn of E0 as t Ñ `8.
We recall that a strict saddle point of E0 is a critical point of E0 at which the
Hessian of E0 has at least one strictly positive eigenvalue. Theorem 4.1 then follows
by combining the following couple of lemmas with the Łojasiewicz theorem.
Lemma A.1. Let M be a compact Riemannian manifold and let f : M Ñ R be
a smooth function. The set of initial conditions X0 P M for which the gradient
ascent
#
9
Xptq “ ∇f pXptqq
(A.1)
Xp0q “ X0
Lemma A.2. Any critical point px1 , . . . , xn q P pSd´1 qn of E0 which is not a global
maximum, namely such that x1 “ . . . “ xn , is a strict saddle point. In particular,
all local maxima are global.
Proof of Lemma A.2. We extend the proof idea of [Tay12, Theorem 4.1] as follows.
Let px1 , . . . , xn q P pSd´1 qn be a critical point of E0 , and assume that the points xi
are not all equal to each other.
38 GESHKOVSKI, LETROUIT, POLYANSKIY, AND RIGOLLET
Step 1. We first prove that there exists a set of indices S Ă rns such that
ÿ ÿ
(A.2) xxi , xj y ă 0.
iPS jPSc
and consider two cases. If m ‰ 0, then we deduce from ∇E0 px1 , . . . , xn q “ 0 that
for any j P rns, xj is collinear with m. Thus xj “ ˘x1 for any j P rns. Setting
S “ tj P rns : xj “ `x1 u,
we can see that (A.2) holds, unless S “ rns which has been excluded. Now suppose
that m “ 0. Then by expanding xm, xi y “ 0, we find that for any i P rns
n
ÿ
´1 “ xxj , x1 y ,
j“2
Step 2. In this second step we look to deduce from (A.2) that px1 , . . . , xn q is a strict
saddle point. Consider an arbitrary non-zero skew-symmetric matrix B and define
the perturbation
#
xi iRS
xi ptq “
etB xi i P S.
Set E0 ptq “ E0 px1 ptq, . . . , xn ptqq. Note that we have
2ÿ ÿ
E0 ptq “ const. ` xxi ptq, xj y ,
n iPS jPSc
where we grouped time-independent terms into the constant (recall that etB is an
orthogonal matrix, since skew-symmetric matrices are the Lie algebra of SOpdq).
Thus
2ÿ ÿ
E10 ptq “ xx9i ptq, xj y
n iPS jPSc
2ÿ ÿ
E20 ptq “ xx:i ptq, xj y .
n iPS jPSc
Since px1 , . . . , xn q is a critical point of E0 , we have E10 p0q “ 0. On the other hand,
:i p0q “ B 2 xi we have
since x
2ÿ ÿ 2
(A.3) E20 p0q “ xB xi , xj y .
n iPS jPSc
We claim that given (A.2), there must exist some skew-symmetric matrix B such
that E20 p0q ą 0. Indeed, if d is even, then we just take B as the block-diagonal
matrix with repeated block
„ ȷ
0 1
,
´1 0
A MATHEMATICAL PERSPECTIVE ON TRANSFORMERS 39
and $
’gpθi ´ θj q, i‰j
1 & ÿ
Bθi Bθj Eβ pθ1 , . . . , θn q “ 2 ¨ ´ gpθi ´ θm q, i “ j ,
n ’ %
mPrnsztiu
2
where we set gpxq :“ pcospxq ´ β sin pxqqeβ cospxq . Plugging this expression back
into (B.1) and simplifying, we obtain
ÿ ÿ
(B.2) gpθi ´ θj q ě 0 .
iPS jPSc
Let us now define τβ˚ be the unique solution on r0, π2 q of the equation
β sin2 pτ q “ cospτ q .
Note that τβ˚ is a monotonically decreasing function of β, and in fact
1 ` op1q
τβ˚ “ ?
β
as β Ñ `8. The importance of τβ˚ is in implying the following property of the
function g: for any τ R r´τβ˚ , τβ˚ s, we must have that gpτ q ă 0 (see Figure 6). We
40 GESHKOVSKI, LETROUIT, POLYANSKIY, AND RIGOLLET
arrive at the following conclusion: it must be that for any proper subset S Ă rns
there exists, by virtue of (B.2), some index j P Sc such that
inf |θj ´ θk | ă τβ˚ .
kPS
So now let us start with S “ t1u and grow S inductively by adding those points θj
at distance ă τβ˚ from tθk : k P Su at each induction step. If β is large enough so
that
π
pn ´ 1qτβ˚ ă ,
2
then in the process of adding points we have travelled a total arc-length ă π{2 on
each side of x1 . Thus it must be that the collection of points θ1 , . . . , θn is strictly
contained inside a half-circle of angular width ă π. By Lemma 6.4 we know that
there can be no critical points of Eβ that are strictly inside some half-circle, unless
that critical point is trivial: θ1 “ ¨ ¨ ¨ “ θn . This completes the proof when d “ 2.
4
β=2
β=6
−2 2
−2
We can show that the same conclusion holds for any dimension d ě 2. The proof
follows by arguing just as above, making instead use of the following generalization
of (B.2): given a collection x1 , . . . , xn P Sd´1 at which the Hessian of Eβ is non-
positive, we must have for any subset S Ă rns that
ÿ ÿ
(B.3) gpθij q ě 0,
iPS jPSc
where gpζq “ eβ cospζq ppd ´ 1q cospζq ´ β sin2 pζqq and θij P r0, πs is the geodesic
distance between xi and xj , namely cospθij q “ xxi , xj y. We now show (B.3). By
repeating the argument in Step 2 of the proof of Lemma A.2 , we see that for any
skew-symmetric matrix B we must have
ÿ ÿ ´ ¯
(B.4) eβxxi ,xj y βxBxi , xj y2 ` xB 2 xi , xj y ď 0.
iPS jPSc
i.i.d.
Now we take B to be random by generating Bij „ P , i ă j and P being any
zero-mean, unit-variance distribution. We set Bji “ ´Bij and Bii “ 0. Then it is
A MATHEMATICAL PERSPECTIVE ON TRANSFORMERS 41
and
ErxBxi , xj y2 s “ 1 ´ xxi , xj y2 “ sin2 pθij q.
Thus, taking the expectation over all such B in (B.4) yields (B.3). Mirroring the
proof for d “ 2, we define τβ˚ to be the unique solution on r0, π2 q of the equation
β sin2 pτ q “ pd ´ 1q cospτ q. We note that
d
pd ´ 1q ` op1q
τβ˚ “
β
for β Ñ `8. Repeating verbatim the argument for the case d “ 2, we deduce the
convergence to a single cluster whenever β ≳ pd ´ 1qn2 . □
Remark B.1. We comment on the extension of the above proof to the dynamics
(SA). We recall that (SA) is a gradient flow, but for a different metric—see Sec-
tion 3.4—and we show that the saddle point property is preserved across metrics.
Our proof is an adaptation of a classical argument: the Hessian of a function at
a critical point is a notion which does not depend on the choice of Riemannian
metric.
Let x “ px1 , . . . , xn q P pSd´1 qn be a critical point of Eβ (this does not depend
on the metric) such that not all xi are equal to each other. Recall that for f :
pSd´1 qn Ñ R, for any metric on pSd´1 qn (with associated Christoffel symbols Γkij )
and any associated orthonormal basis y1 , . . . , ypd´1qn , the Hessian of f reads
B2 f
ˆ ˙
Bf
(B.5) Hesspf q “ ´ Γkij dyi b dyj .
Byi Byj Byk
Since we are evaluating the Hessian at a critical point x of Eβ , the term carrying
the Christoffel symbols Γkij vanishes. In the above argument, we saw that Hess Eβ
evaluated at x, and written in an orthonormal basis for the canonical metric g on
pSd´1 qn , is not negative semi-definite. We denote this matrix by Mg ; we know that
there exists v P Tx pSd´1 qn such that gpv, vq “ 1 and v J Mg v ą 0. Let g̃ be another
metric on pSd´1 qn ; we denote by Mg̃ the Hessian evaluated at x, and written in an
orthonormal basis for gr. Let c : Rě0 Ñ pSd´1 qn be such that cp0q “ x and cp0q
9 “ v.
Since x is a critical point (for both metrics), a Taylor expansion to second order in
the two orthonormal bases yields
1
Eβ pcptqq “ Eβ pcp0qq ` t2 v J Mg v ` Opt3 q
2
as well as
1
Eβ pcptqq “ Eβ pcp0qq ` t2 }v}g̃´2 v J Mg̃ v ` Opt3 q
2
thanks to (B.5). Hence v J Mg̃ v ą 0. Specializing to g̃ being the metric of Section
3.4, with respect to which (SA) is a gradient flow for Eβ , we conclude for (SA).
42 GESHKOVSKI, LETROUIT, POLYANSKIY, AND RIGOLLET
but also DEβ pxqrvs “ gp∇g Eβ pxq, vq. By virtue of the explicit form of gβ and Eβ as
well as (C.1), we gather that
(C.3) gβ p∇gβ Eβ pxq, vq “ gp∇g E0 pxq, vq ` Opβq
which implies that any sequence of critical points of Eβ converges to a critical point
for E0 . Similarly, since Hessgβ Eβ pxqrvs “ Dp∇gβ Eβ pxqqrvs, we find
(C.4) Hessgβ Eβ pxqrvs “ Hessg E0 pxqrvs ` Opβq.
We can then repeat the argument in the proof above by replacing (C.1) by (C.3) and
(C.4).
Step 1. The flow map is Lipschitz. We begin by showing that the trajectories
satisfy a Lipschitz property with respect to the initial data. To this end, let
pxi p¨qqiPrns P C 0 pRě0 ; pSd´1 qn q and pyi p¨qqiPrns P C 0 pRě0 ; pSd´1 qn q be two solutions
to the Cauchy problem for (SA) associated to data pxi p0qqiPrns and pyi p0qqiPrns
respectively. For any i P rns and t ě 0, we have
xi ptq ´ yi ptq “ xi p0q ´ yi p0q
żt ÿn ˆ
eβxxi psq,xj psqy
˙
` ˘
` řn βxx psq,x psqy
xj psq ´ xxi psq, xj psqyxi psq ds
k“1 e
i k
0 j“1
żt ÿn
eβxyi psq,yj psqy
ˆ ˙
` ˘
(D.1) ´ řn βxyi psq,yk psqy
yj psq ´ xyi psq, yj psqyyi psq ds.
0 j“1 k“1 e
We see that
(D.2)
›ż › ż
› tÿ n ˆ t
eβxxi psq,xj psqy
˙ ›
y ds max }xj psq ´ yj psq} ds.
› ›
› řn βxxi psq,xk psqy
px j psq ´ j psqq › ď
k“1 e 0 jPrns
› 0 j“1 ›
Using (D.2), (D.3) and arguing similarly for the remaining terms in (D.1), we deduce
that
żt
}xi ptq ´ yi ptq} ď }xi p0q ´ yi p0q} ` 10 maxt1, βun max }xj psq ´ yj psq} ds.
0 jPrns
44 GESHKOVSKI, LETROUIT, POLYANSKIY, AND RIGOLLET
Step 2. Almost orthogonality. Let x1 p0q, . . . , xn p0q P Sd´1 be the random i.i.d.
initial points. We prove that with high probability, there exist n pairwise orthogonal
points y1 p0q, . . . , yn p0q P Sd´1 , such that for any i P rns,
c
log d
(D.5) }xi p0q ´ yi p0q} ď .
d
To this end, we take y1 p0q “ x1 p0q and then construct the other points yi p0q by
induction. Assume that y1 p0q, . . . , yi p0q are constructed for some i P rns, using only
knowledge about the points x1 p0q, . . . , xi p0q. Then by Lévy’s concentration of mea-
sure, since xi`1 p0q is independent from x1 p0q, . . . , xi p0q and uniformly distributed
on Sd´1 ,
˜# c +¸
` K
˘ log d
P dist xi`1 p0q, spanty1 p0q, . . . , yi p0qu ď ě 1 ´ 4id´1{64 ,
d
for some universal constants c, C ą 0. Using the union bound, we gather that the
event
A0 “ t(D.5) is satisfied for any i P rnsu
has probability at least p0 “ 1 ´ 2n2 d´1{64 . We now consider the event
A “ A0 X tfor some C, λ ą 0, (6.1) holds for any i P rns and t ě 0u
which, since d ě n and thus the second event has probability 1, also holds with
probability at least p0 “ 1 ´ 2n2 d´1{64 . For the remainder of the proof, we assume
that A is satisfied.
Step 3. Proof of (6.11). Let pyi p¨qqiPrns P C 0 pRě0 ; pSd´1 qn q denote the unique solu-
tion to the Cauchy problem for (SA) corresponding to the initial datum pyi p0qqiPrns .
A combination of (D.4) and (D.5) yields
c
nt log d
(D.6) }xi ptq ´ yi ptq} ď cpβq
d
for any i P rns and t ě 0, under A. Combining (D.6) with Theorem 6.8 we obtain
c
ˇ ˇ
nt log d
(D.7) ˇxxi ptq, xj ptqy ´ γβ ptqˇ ď 2cpβq
ˇ ˇ
d
for any i ‰ j and t ě 0, under A.
We turn to the proof of the second part of (6.11). For this, we prove that for
large times t, both γβ ptq and xxi ptq, xj ptqy are necessarily close to 1. We first show
that
¨ ˛
2 β
1 n e nt
(D.8) 1 ´ γβ ptq ď exp ˝ ´ ¯´ β
‚
2 2 n ` e2
β
n ` e2
A MATHEMATICAL PERSPECTIVE ON TRANSFORMERS 45
for any t ě 0. To this end, we notice that t ÞÑ γβ ptq is increasing and thus γβ ptq ě 0,
as well as γ9 β ptq ě ne1β as long as γβ ptq ď 21 . Therefore,
ˆ β˙
ne 1
γβ ě .
2 2
neβ
We deduce that for t ě 2 ,
np1 ´ γβ ptqq
γ9 β ptq ě β .
n ` e2
neβ
Integrating this inequality from 2 to t, we obtain (D.8). We now set d˚ pn, βq ě n
such that
d 16cpβq2
(D.9) ě
log d γβ p n1 q2
holds for any d ě d˚ pn, βq. According to Lemma 6.4, since A is satisfied, there
exists x˚ P Sd´1 such that xi ptq Ñ x˚ for any i P rns as t Ñ `8. We set
αptq :“ min xxi ptq, x˚ y,
iPrns
for one of the indices iptq P rns achieving the minimum in the definition of αptq.
Combining this with (D.11), we gather that αptq ě αp n1 q for t ě n1 . But
n
ÿ
(D.13) min xxiptq ptq, xj ptqy ď θk ptqxxiptq ptq, xk ptqy “ xxiptq ptq, x˚ y “ αptq.
jPrns
k“1
46 GESHKOVSKI, LETROUIT, POLYANSKIY, AND RIGOLLET
Plugging (D.13) into (D.12) and using aij ptq ě n´1 e´2β we get
ˆ ˙
1 1
(D.14) αptq
9 ě 2β α p1 ´ αptqq
ne n
for t ě n1 . Integrating (D.14) from 1
n to t, we get (D.10). We therefore deduce
from (D.10) that
1 ´ γβ p n1 qt
ˆ ˙
xxi ptq, xj ptqy ě 1 ´ exp
2ne2β
holds for any distinct i, j P rns. Together with (D.8), we then get
(D.15) ˜ ¸
1 ´ γβ p n1 qt n2 eβ
ˆ ˙
ˇ ˇ 1 nt
ˇxxi ptq, xj ptqy ´ γβ ptqˇ ď exp ` exp .
ˇ ˇ
β ´ β
2ne2β 2 2pn ` e 2 q n ` e 2
Finally, combining (D.7) and (D.15) we obtain (6.11). □
Remark D.1. An analogous statement to Theorem 6.9 holds for (USA), where
γβ would rather be the unique solution to (6.10). More concretely, Step 1 in the
proof is only slightly changed—the constant one obtains in the analogue of (6.11)
2β
is rather cpβqnt with cpβq “ e10βe —. Step 2 remains unchanged. In Step 3, (D.8)
is replaced by γβ p n2 q ě 21 and
1 ´ β´ n ¯¯
1 ´ γβ ptq ď exp ´e 2 t ´ .
2 2
The rest of the proof then remains essentially unchanged.
References
[ABK` 22] Pedro Abdalla, Afonso S Bandeira, Martin Kassabov, Victor Souza, Steven H Stro-
gatz, and Alex Townsend. Expander graphs are globally synchronising. arXiv preprint
arXiv:2210.12788, 2022.
[ABV` 05] Juan A Acebrón, Luis L Bonilla, Conrad J Pérez Vicente, Félix Ritort, and Renato
Spigler. The Kuramoto model: A simple paradigm for synchronization phenomena.
Reviews of Modern Physics, 77(1):137, 2005.
[ACDS23] Kwangjun Ahn, Xiang Cheng, Hadi Daneshmand, and Suvrit Sra. Transformers learn
to implement preconditioned gradient descent for in-context learning. Advances in
Neural Information Processing Systems, 36, 2023.
[ADTK23] Silas Alberti, Niclas Dern, Laura Thesing, and Gitta Kutyniok. Sumformer: Universal
Approximation for Efficient Transformers. In Topological, Algebraic and Geometric
Learning Workshops 2023, pages 72–86. PMLR, 2023.
[AG24] Daniel Owusu Adu and Bahman Gharesifard. Approximate controllability of conti-
nuity equation of transformers. IEEE Control Systems Letters, 2024.
[AGS05] Luigi Ambrosio, Nicola Gigli, and Giuseppe Savaré. Gradient flows: in metric spaces
and in the space of probability measures. Springer Science & Business Media, 2005.
[AL24] Andrei Agrachev and Cyril Letrouit. Generic controllability of equivariant sys-
tems and applications to particle systems and neural networks. arXiv preprint
arXiv:2404.08289, 2024.
[AS22] Andrei Agrachev and Andrey Sarychev. Control on the manifolds of mappings with
a view to the deep learning. Journal of Dynamical and Control Systems, 28(4):989–
1008, 2022.
[Bar93] Andrew R Barron. Universal approximation bounds for superpositions of a sigmoidal
function. IEEE Transactions on Information Theory, 39(3):930–945, 1993.
[BB00] Jean-David Benamou and Yann Brenier. A computational fluid mechanics solution to
the Monge-Kantorovich mass transfer problem. Numerische Mathematik, 84(3):375–
393, 2000.
A MATHEMATICAL PERSPECTIVE ON TRANSFORMERS 47
[BBC` 09] Brandon Ballinger, Grigoriy Blekherman, Henry Cohn, Noah Giansiracusa, Eliza-
beth Kelly, and Achill Schürmann. Experimental study of energy-minimizing point
configurations on spheres. Experimental Mathematics, 18(3):257–283, 2009.
[BCM08] Adrien Blanchet, José A Carrillo, and Nader Masmoudi. Infinite time aggregation for
the critical Patlak-Keller-Segel model in R2 . Communications on Pure and Applied
Mathematics, 61(10):1449–1481, 2008.
[BCM15] Dario Benedetto, Emanuele Caglioti, and Umberto Montemagno. On the complete
phase synchronization for the Kuramoto model in the mean-field limit. Communica-
tions in Mathematical Sciences, 13(7):1775–1786, 2015.
[BD19] Dmitriy Bilyk and Feng Dai. Geodesic distance Riesz energy on the sphere. Trans-
actions of the American Mathematical Society, 372(5):3141–3166, 2019.
[BHK24] Han Bao, Ryuichiro Hataya, and Ryo Karakida. Self-attention networks localize when
qk-eigenspectrum concentrates. arXiv preprint arXiv:2402.02098, 2024.
[BKH16] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv
preprint arXiv:1607.06450, 2016.
[BLG14] Emmanuel Boissard and Thibaut Le Gouic. On the mean speed of convergence of
empirical and occupation measures in Wasserstein distance. In Annales de l’IHP
Probabilités et statistiques, volume 50, pages 539–563, 2014.
[BLR11] Andrea L Bertozzi, Thomas Laurent, and Jesús Rosado. Lp theory for the multidi-
mensional aggregation equation. Communications on Pure and Applied Mathematics,
64(1):45–83, 2011.
[CCH` 14] José A Carrillo, Young-Pil Choi, Seung-Yeal Ha, Moon-Jin Kang, and Yongduck Kim.
Contractivity of transport distances for the kinetic Kuramoto equation. Journal of
Statistical Physics, 156(2):395–415, 2014.
[CD22] Louis-Pierre Chaintron and Antoine Diez. Propagation of chaos: a review of models,
methods and applications. II. Applications. 2022.
[CDF` 11] J. A. Carrillo, M. DiFrancesco, A. Figalli, T. Laurent, and D. Slepčev. Global-in-time
weak measure solutions and finite-time aggregation for nonlocal interaction equations.
Duke Mathematical Journal, 156(2):229 – 271, 2011.
[Chi15] Hayato Chiba. A proof of the Kuramoto conjecture for a bifurcation structure of
the infinite-dimensional Kuramoto model. Ergodic Theory and Dynamical Systems,
35(3):762–834, 2015.
[CJLZ21] José A Carrillo, Shi Jin, Lei Li, and Yuhua Zhu. A consensus-based global opti-
mization method for high dimensional machine learning problems. ESAIM: Control,
Optimisation and Calculus of Variations, 27:S5, 2021.
[CK07] Henry Cohn and Abhinav Kumar. Universally optimal distribution of points on
spheres. Journal of the American Mathematical Society, 20(1):99–148, 2007.
[CKM` 22] Henry Cohn, Abhinav Kumar, Stephen Miller, Danylo Radchenko, and Maryna Via-
zovska. Universal optimality of the E8 and Leech lattices and interpolation formulas.
Annals of Mathematics, 196(3):983–1082, 2022.
[CLLS23] Jingpu Cheng, Qianxiao Li, Ting Lin, and Zuowei Shen. Interpolation, approximation
and controllability of deep neural networks. arXiv preprint arXiv:2309.06015, 2023.
[CLP15] Marco Caponigro, Anna Chiara Lai, and Benedetto Piccoli. A nonlinear model of
opinion formation on the sphere. Discrete & Continuous Dynamical Systems-A,
35(9):4241–4268, 2015.
[CLT20] Christa Cuchiero, Martin Larsson, and Josef Teichmann. Deep neural networks,
generic universal interpolation, and controlled ODEs. SIAM Journal on Mathematics
of Data Science, 2(3):901–919, 2020.
[CNQG24] Aditya Cowsik, Tamra Nebabu, Xiao-Liang Qi, and Surya Ganguli. Geometric dy-
namics of signal propagation predict trainability of transformers. arXiv preprint
arXiv:2403.02579, 2024.
[CNWR24] Sinho Chewi, Jonathan Niles-Weed, and Philippe Rigollet. Statistical optimal trans-
port. arXiv preprint arXiv:2407.18163, 2024.
[CRBD18] Ricky TQ Chen, Yulia Rubanova, Jesse Bettencourt, and David K Duvenaud. Neural
ordinary differential equations. Advances in Neural Information Processing Systems,
31, 2018.
[CRMB24] Christopher Criscitiello, Quentin Rebjock, Andrew D. McRae, and Nicolas Boumal.
Synchronization on circles and spheres with nonlinear interactions, 2024.
48 GESHKOVSKI, LETROUIT, POLYANSKIY, AND RIGOLLET
[CS07] Felipe Cucker and Steve Smale. Emergent behavior in flocks. IEEE Transactions on
Automatic Control, 52(5):852–862, 2007.
[Cyb89] George Cybenko. Approximation by superpositions of a sigmoidal function. Mathe-
matics of Control, Signals and Systems, 2(4):303–314, 1989.
[CZC` 22] Tianlong Chen, Zhenyu Zhang, Yu Cheng, Ahmed Awadallah, and Zhangyang Wang.
The principle of diversity: Training stronger vision transformers calls for reducing
all levels of redundancy. In Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition, pages 12020–12030, 2022.
[Dai92] Hiroaki Daido. Order function and macroscopic mutual entrainment in uniformly cou-
pled limit-cycle oscillators. Progress of Theoretical Physics, 88(6):1213–1218, 1992.
[DBK24] Gbètondji JS Dovonon, Michael M Bronstein, and Matt J Kusner. Setting the record
straight on transformer oversmoothing. arXiv preprint arXiv:2401.04301, 2024.
[DBPC19] Gwendoline De Bie, Gabriel Peyré, and Marco Cuturi. Stochastic deep networks. In
International Conference on Machine Learning, pages 1556–1565. PMLR, 2019.
[DCL21] Yihe Dong, Jean-Baptiste Cordonnier, and Andreas Loukas. Attention is not all you
need: Pure attention loses rank doubly exponentially with depth. In International
Conference on Machine Learning, pages 2793–2803. PMLR, 2021.
[DFGV18] Helge Dietert, Bastien Fernandez, and David Gérard-Varet. Landau damping to par-
tially locked states in the Kuramoto model. Communications on Pure and Applied
Mathematics, 71(5):953–993, 2018.
[DGCC21] Subhabrata Dutta, Tanya Gautam, Soumen Chakrabarti, and Tanmoy Chakraborty.
Redesigning the transformer architecture with insights from multi-particle dynamical
systems. Advances in Neural Information Processing Systems, 34:5531–5544, 2021.
[DGS91] Philippe Delsarte, Jean-Marie Goethals, and Johan Jacob Seidel. Spherical codes and
designs. In Geometry and Combinatorics, pages 68–93. Elsevier, 1991.
[DGTT24] Puneesh Deora, Rouzbeh Ghaderi, Hossein Taheri, and Christos Thrampoulidis. On
the optimization and generalization of multi-head attention. Transactions on Ma-
chine Learning Research, 2024.
[Dob79] Roland L’vovich Dobrushin. Vlasov equations. Funktsional’nyi Analiz i ego
Prilozheniya, 13(2):48–58, 1979.
[Dud69] R. M. Dudley. The Speed of Mean Glivenko-Cantelli Convergence. The Annals of
Mathematical Statistics, 40(1):40 – 50, 1969.
[DX13] Feng Dai and Yuan Xu. Approximation theory and harmonic analysis on spheres
and balls. Springer, 2013.
[E17] Weinan E. A proposal on machine learning via dynamical systems. Communications
in Mathematics and Statistics, 1(5):1–11, 2017.
[FdHP24] Takashi Furuya, Maarten V de Hoop, and Gabriel Peyré. Transformers are universal
in-context learners. arXiv preprint arXiv:2408.01367, 2024.
[FGVG16] Bastien Fernandez, David Gérard-Varet, and Giambattista Giacomin. Landau damp-
ing in the Kuramoto model. In Annales Henri Poincaré, volume 17, pages 1793–1823.
Springer, 2016.
[FHPS21] Massimo Fornasier, Hui Huang, Lorenzo Pareschi, and Philippe Sünnen. Consensus-
based optimization on the sphere: Convergence to global minimizers and machine
learning. The Journal of Machine Learning Research, 22(1):10722–10776, 2021.
[FL19] Amic Frouvelle and Jian-Guo Liu. Long-time dynamics for a simple aggregation
equation on the sphere. In Stochastic Dynamics Out of Equilibrium: Institut Henri
Poincaré, Paris, France, 2017, pages 457–479. Springer, 2019.
[FZH` 22] Ruili Feng, Kecheng Zheng, Yukun Huang, Deli Zhao, Michael Jordan, and Zheng-
Jun Zha. Rank diminishing in deep neural networks. Advances in Neural Information
Processing Systems, 35:33054–33065, 2022.
[GBC16] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press,
Cambridge, MA, 2016.
[GBM21] Arnaud Guillin, Pierre Le Bris, and Pierre Monmarché. Uniform in time propagation
of chaos for the 2d vortex model and other singular stochastic systems. arXiv preprint
arXiv:2108.08675, 2021.
[GLPR24] Borjan Geshkovski, Cyril Letrouit, Yury Polyanskiy, and Philippe Rigollet. The emer-
gence of clusters in self-attention dynamics. Advances in Neural Information Process-
ing Systems, 36, 2024.
A MATHEMATICAL PERSPECTIVE ON TRANSFORMERS 49
[Gol16] François Golse. On the dynamics of large particle systems in the mean field limit.
Macroscopic and large scale phenomena: coarse graining, mean field limits and er-
godicity, pages 1–144, 2016.
[GSRT13] Isabelle Gallagher, Laure Saint-Raymond, and Benjamin Texier. From Newton to
Boltzmann: hard spheres and short-range potentials. European Mathematical Society
Zürich, Switzerland, 2013.
[GWDW23] Xiaojun Guo, Yifei Wang, Tianqi Du, and Yisen Wang. Contranorm: A contrastive
learning perspective on oversmoothing and beyond. In The Eleventh International
Conference on Learning Representations, 2023.
[GZ22] Borjan Geshkovski and Enrique Zuazua. Turnpike in optimal control of PDEs,
ResNets, and beyond. Acta Numerica, 31:135–263, 2022.
[HHL23] Jiequn Han, Ruimeng Hu, and Jihao Long. A class of dimension-free metrics for
the convergence of empirical measures. Stochastic Processes and their Applications,
164:242–287, 2023.
[HK02] Rainer Hegselmann and Ulrich Krause. Opinion dynamics and bounded confidence:
models, analysis and simulation. Journal of Artifical Societies and Social Simulation
(JASSS), 5(3), 2002.
[HKPZ16] Seung-Yeal Ha, Dongnam Ko, Jinyeong Park, and Xiongtao Zhang. Collective syn-
chronization of classical and quantum oscillators. EMS Surveys in Mathematical Sci-
ences, 3(2):209–267, 2016.
[HKR18] Seung-Yeal Ha, Dongnam Ko, and Sang Woo Ryoo. On the relaxation dynamics
of Lohe oscillators on some Riemannian manifolds. Journal of Statistical Physics,
172:1427–1478, 2018.
[HR17] Eldad Haber and Lars Ruthotto. Stable architectures for deep neural networks. In-
verse problems, 34(1), 2017.
[HR20] Seung-Yeal Ha and Seung-Yeon Ryoo. Asymptotic phase-locking dynamics and crit-
ical coupling strength for the Kuramoto model. Communications in Mathematical
Physics, 377(2):811–857, 2020.
[HZRS16] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep
residual networks. In Computer Vision–ECCV 2016: 14th European Conference,
Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part IV 14, pages
630–645. Springer, 2016.
[JDB23] Amir Joudaki, Hadi Daneshmand, and Francis Bach. On the impact of activation
and normalization in obtaining isometric embeddings at initialization. Advances in
Neural Information Processing Systems, 36:39855–39875, 2023.
[JKO98] Richard Jordan, David Kinderlehrer, and Felix Otto. The variational formulation of
the Fokker–Planck equation. SIAM Journal on Mathematical Analysis, 29(1):1–17,
1998.
[JL23] Haotian Jiang and Qianxiao Li. Approximation theory of transformer networks for
sequence modeling. arXiv preprint arXiv:2305.18475, 2023.
[JLLW23] Haotian Jiang, Qianxiao Li, Zhong Li, and Shida Wang. A brief survey on the ap-
proximation theory for sequence modelling. arXiv preprint arXiv:2302.13752, 2023.
[JM14] Pierre-Emmanuel Jabin and Sebastien Motsch. Clustering and asymptotic behavior
in opinion formation. Journal of Differential Equations, 257(11):4165–4187, 2014.
[KO02] Robert V Kohn and Felix Otto. Upper bounds on coarsening rates. Communications
in Mathematical Physics, 229(3):375–395, 2002.
[Kra00] Ulrich Krause. A discrete nonlinear and non-autonomous model of consensus. In
Communications in Difference Equations: Proceedings of the Fourth International
Conference on Difference Equations, page 227. CRC Press, 2000.
[KSH12] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with
deep convolutional neural networks. Advances in Neural Information Processing Sys-
tems, 25, 2012.
[Kur75] Yoshiki Kuramoto. Self-entrainment of a population of coupled non-linear oscillators.
In International Symposium on Mathematical Problems in Theoretical Physics: Jan-
uary 23–29, 1975, Kyoto University, Kyoto/Japan, pages 420–422. Springer, 1975.
[Lac23] Daniel Lacker. Hierarchies, entropy, and quantitative propagation of chaos for mean
field diffusions. Probability and Mathematical Physics, 4(2):377–432, 2023.
50 GESHKOVSKI, LETROUIT, POLYANSKIY, AND RIGOLLET
[LCG` 20] Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma,
and Radu Soricut. ALBERT: A Lite BERT for Self-supervised Learning of Language
Representations. In International Conference on Learning Representations, 2020.
[LCT18] Qianxiao Li, Long Chen, and Cheng Tai. Maximum principle based algorithms for
deep learning. Journal of Machine Learning Research, 18:1–29, 2018.
[Li21] Wuchen Li. Hessian metric via transport information geometry. Journal of Mathe-
matical Physics, 62(3), 03 2021.
[LJ18] Hongzhou Lin and Stefanie Jegelka. Resnet with one-neuron hidden layers is a uni-
versal approximator. Advances in Neural Information Processing Systems, 31, 2018.
[LLF23] Daniel Lacker and Luc Le Flem. Sharp uniform-in-time propagation of chaos. Prob-
ability Theory and Related Fields, pages 1–38, 2023.
[LLH` 20] Yiping Lu, Zhuohan Li, Di He, Zhiqing Sun, Bin Dong, Tao Qin, Liwei Wang, and
Tie-Yan Liu. Understanding and improving transformer from a multi-particle dy-
namic system point of view. In International Conference on Learning Representa-
tions, 2020.
[LLS22] Qianxiao Li, Ting Lin, and Zuowei Shen. Deep learning via dynamical systems:
An approximation perspective. Journal of the European Mathematical Society,
25(5):1671–1709, 2022.
[Loj63] Stanislaw Lojasiewicz. Une propriété topologique des sous-ensembles analytiques
réels. Les équations aux dérivées partielles, 117:87–89, 1963.
[LWLQ22] Tianyang Lin, Yuxin Wang, Xiangyang Liu, and Xipeng Qiu. A survey of transform-
ers. AI Open, 2022.
[LXB19] Shuyang Ling, Ruitu Xu, and Afonso S Bandeira. On the landscape of synchro-
nization networks: A perspective from nonconvex optimization. SIAM Journal on
Optimization, 29(3):1879–1907, 2019.
[Mis24] MistralAI. https://github.com/mistralai/mistral-finetune/blob/main/model/
transformer.py, 2024.
[MT14] Sebastien Motsch and Eitan Tadmor. Heterophilious dynamics enhances consensus.
SIAM Review, 56(4):577–621, 2014.
[MTG17] Johan Markdahl, Johan Thunberg, and Jorge Gonçalves. Almost global consensus on
the n-sphere. IEEE Transactions on Automatic Control, 63(6):1664–1675, 2017.
[NAB` 22] Lorenzo Noci, Sotiris Anagnostidis, Luca Biggio, Antonio Orvieto, Sidak Pal Singh,
and Aurelien Lucchi. Signal propagation in transformers: Theoretical perspectives
and the role of rank collapse. Advances in Neural Information Processing Systems,
35:27198–27211, 2022.
[NLL` 24] Lorenzo Noci, Chuning Li, Mufan Li, Bobby He, Thomas Hofmann, Chris J Maddi-
son, and Dan Roy. The shaped transformer: Attention models in the infinite depth-
and-width limit. Advances in Neural Information Processing Systems, 36, 2024.
[Ope24] OpenAI. https://github.com/openai/gpt-2/blob/master/src/model.py, 2024.
[OR07] Felix Otto and Maria G Reznikoff. Slow motion of gradient flows. Journal of Differ-
ential Equations, 237(2):372–420, 2007.
[Ott01] Felix Otto. The geometry of dissipative evolution equations: the porous medium
equation. Communications in Partial Differential Equations, 26(1-2):101–174, 2001.
[PH22] Mary Phuong and Marcus Hutter. Formal algorithms for transformers. arXiv preprint
arXiv:2207.09238, 2022.
[PTTM17] René Pinnau, Claudia Totzeck, Oliver Tse, and Stephan Martin. A consensus-based
model for global optimization and its mean-field limit. Mathematical Models and
Methods in Applied Sciences, 27(01):183–204, 2017.
[RBZ23] Domenec Ruiz-Balet and Enrique Zuazua. Neural ODE control for classification,
approximation, and transport. SIAM Review, 65(3):735–773, 2023.
[RS23] Matthew Rosenzweig and Sylvia Serfaty. Global-in-time mean-field convergence for
singular Riesz-type diffusive flows. The Annals of Applied Probability, 33(2):954–998,
2023.
[RZZD23] Lixiang Ru, Heliang Zheng, Yibing Zhan, and Bo Du. Token contrast for weakly-
supervised semantic segmentation. In Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition, pages 3093–3102, 2023.
A MATHEMATICAL PERSPECTIVE ON TRANSFORMERS 51
[SABP22] Michael E Sander, Pierre Ablin, Mathieu Blondel, and Gabriel Peyré. Sinkformers:
Transformers with doubly stochastic attention. In International Conference on Ar-
tificial Intelligence and Statistics, pages 3515–3530. PMLR, 2022.
[Ser20] Sylvia Serfaty. Mean field limit for Coulomb-type flows. Duke Mathematical Journal,
169(15), 2020.
[SFG` 12] Bharath K. Sriperumbudur, Kenji Fukumizu, Arthur Gretton, Bernhard Schölkopf,
and Gert R. G. Lanckriet. On the empirical estimation of integral probability metrics.
Electronic Journal of Statistics, 6:1550 – 1599, 2012.
[Shu13] Michael Shub. Global stability of dynamical systems. Springer Science & Business
Media, 2013.
[Str00] Steven H Strogatz. From Kuramoto to Crawford: exploring the onset of synchro-
nization in populations of coupled oscillators. Physica D: Nonlinear Phenomena,
143(1-4):1–20, 2000.
[SWJS24] Michael Scholkemper, Xinyi Wu, Ali Jadbabaie, and Michael Schaub. Residual
connections and normalization can provably prevent oversmoothing in gnns. arXiv
preprint arXiv:2406.02997, 2024.
[Sze39] Gabor Szegö. Orthogonal polynomials, volume 23. American Mathematical Soc.,
1939.
[Tad23] Eitan Tadmor. Swarming: hydrodynamic alignment with pressure. Bulletin of the
American Mathematical Society, 60(3):285–325, 2023.
[Tan17] Yan Shuo Tan. Energy optimization for distributions on the sphere and improvement
to the Welch bounds. Electronic Communications in Probability, 22(none):1 – 12,
2017.
[Tay12] Richard Taylor. There is no non-zero stable fixed point for dense networks in the
homogeneous Kuramoto model. Journal of Physics A: Mathematical and Theoretical,
45(5):055102, 2012.
[TG22] Paulo Tabuada and Bahman Gharesifard. Universal approximation power of deep
residual neural networks through the lens of control. IEEE Transactions on Auto-
matic Control, 2022.
[TLTO23] Davoud Ataee Tarzanagh, Yingcong Li, Christos Thrampoulidis, and Samet Oymak.
Transformers as support vector machines. Advances in Neural Information Process-
ing Systems, 36, 2023.
[TSS20] Alex Townsend, Michael Stillman, and Steven H Strogatz. Dense networks that do
not synchronize and sparse ones that do. Chaos: An Interdisciplinary Journal of
Nonlinear Science, 30(8), 2020.
[VBC20] James Vuckovic, Aristide Baratin, and Remi Tachet des Combes. A mathematical
theory of attention. arXiv preprint arXiv:2007.02876, 2020.
[VCBJ` 95] Tamás Vicsek, András Czirók, Eshel Ben-Jacob, Inon Cohen, and Ofer Shochet.
Novel type of phase transition in a system of self-driven particles. Physical Review
Letters, 75(6):1226, 1995.
[Ver18] Roman Vershynin. High-Dimensional Probability: An Introduction with Applications
in Data Science. Cambridge Series in Statistical and Probabilistic Mathematics.
Cambridge University Press, 2018.
[Vil01] Cédric Villani. Limite de champ moyen. Cours de DEA, 2002:49, 2001.
[Vil09] Cédric Villani. Optimal transport: old and new, volume 338. Springer, 2009.
[VR23] Tanya Veeravalli and Maxim Raginsky. Nonlinear controllability and function repre-
sentation by neural stochastic differential equations. In Learning for Dynamics and
Control Conference, pages 838–850. PMLR, 2023.
[VSP` 17] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N
Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in
Neural Information Processing Systems, 30, 2017.
[WAW` 24] Xinyi Wu, Amir Ajorlou, Yifei Wang, Stefanie Jegelka, and Ali Jadbabaie. On the role
of attention masks and layernorm in transformers. arXiv preprint arXiv:2405.18781,
2024.
[WAWJ24] Xinyi Wu, Amir Ajorlou, Zihui Wu, and Ali Jadbabaie. Demystifying oversmoothing
in attention-based graph neural networks. Advances in Neural Information Process-
ing Systems, 36, 2024.
52 GESHKOVSKI, LETROUIT, POLYANSKIY, AND RIGOLLET