0% found this document useful (0 votes)
66 views52 pages

Math Behind Transformers

Uploaded by

SCRIBD 101
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
66 views52 pages

Math Behind Transformers

Uploaded by

SCRIBD 101
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 52

A MATHEMATICAL PERSPECTIVE ON TRANSFORMERS

BORJAN GESHKOVSKI, CYRIL LETROUIT, YURY POLYANSKIY,


AND PHILIPPE RIGOLLET
arXiv:2312.10794v4 [cs.LG] 12 Aug 2024

Abstract. Transformers play a central role in the inner workings of large


language models. We develop a mathematical framework for analyzing Trans-
formers based on their interpretation as interacting particle systems, with a
particular emphasis on long-time clustering behavior. Our study explores the
underlying theory and offers new perspectives for mathematicians as well as
computer scientists.

Contents
1. Outline 2

Part 1. Modeling 3
2. Interacting particle system 4
3. Measure to measure flow map 9

Part 2. Clustering 16
4. A single cluster for small β 17
5. A single cluster for large β 20
6. The high-dimensional case 20

Part 3. Further questions 28


7. Dynamics on the circle 29
8. BBGKY hierarchy 31
9. General matrices 32
10. Approximation, control, training 36
Acknowledgments 36

Appendix 36
Appendix A. Proof of Theorem 4.1 36
Appendix B. Proof of Theorem 5.1 39
Appendix C. Proof of Theorem 4.3 42
Appendix D. Proof of Theorem 6.9 43
References 46

2020 Mathematics Subject Classification. Primary: 34D05, 34D06, 35Q83; Secondary: 52C17.
Key words and phrases. Transformers, self-attention, interacting particle systems, clustering,
gradient flows.
1
2 GESHKOVSKI, LETROUIT, POLYANSKIY, AND RIGOLLET

1. Outline
The introduction of Transformers in 2017 by Vaswani et al. [VSP` 17] marked
a significant milestone in the development of neural network architectures. Cen-
tral to this contribution is self-attention, a novel mechanism which distinguishes
Transformers from traditional architectures, and which plays a substantial role in
their superior practical performance. In fact, this innovation has been a key cata-
lyst for the progress of artificial intelligence in areas such as computer vision and
natural language processing, notably with the emergence of large language models.
As a result, understanding the mechanisms by which Transformers, and especially
self-attention, process data is a crucial yet largely uncharted research area.
A common characteristic of deep neural networks (DNNs) is their compositional
nature: data is processed sequentially, layer by layer, resulting in a discrete-time
dynamical system (we refer the reader to the textbook [GBC16] for a general intro-
duction). This perspective has been successfully employed to model residual neural
networks—see Section 2.1 for more details—as continuous-time dynamical systems
called neural ordinary differential equations (neural ODEs) [CRBD18, E17, HR17].
In this context, an input xp0q P Rd , say an image, is evolving according to a given
time-varying velocity field as xptq9 “ vt pxptqq over some time interval p0, T q. As
such, a DNN can be seen as a flow map xp0q ÞÑ xpT q from Rd to Rd . Even within
the restricted class of velocity fields tvt utě0 imposed by classical DNN architectures,
such flow maps enjoy strong approximation properties as exemplified by a long line
of work on these questions [LJ18, ZGUA20, LLS22, TG22, RBZ23, CLLS23].
Following [SABP22] and [VBC20], we observe that Transformers are in fact flow
maps on PpRd q, the space of probability measures over Rd . To realize this flow map
from measures to measures, Transformers evolve a mean-field interacting particle
system. More specifically, every particle (called a token in this context) follows
the flow of a vector field which depends on the empirical measure of all particles.
In turn, the continuity equation governs the evolution of the empirical measure of
particles, whose long-time behavior is of crucial interest. In this regard, our main
observation is that particles tend to cluster under these dynamics. This phenome-
non is of particular relevance in learning tasks such as next-token prediction, wherein
one seeks to map a given input sequence (i.e., a sentence) of n tokens (i.e., words)
onto a given next token. In this case, the output measure encodes the probability
distribution of the next token, and its clustering indicates a small number of pos-
sible outcomes. Our results on a simplified but insightful toy model indicate that
the limiting distribution is actually a point mass, leaving no room for diversity or
randomness, which is at odds with practical observations. This apparent paradox
is resolved by the existence of a long-time metastable state. As can be seen from
Figures 3 and 5, the Transformer flow appears to possess two different time-scales:
in a first phase, tokens quickly form a few clusters, while in a second (much slower)
phase, through the process of pairwise merging of clusters, all tokens finally col-
lapse to a single point. This appears to corroborate behavior observed empirically
in trained Transformer models, which goes under the names token uniformity, over-
smoothing [CZC` 22, RZZD23, GWDW23, WAWJ24, WAW` 24, DBK24, SWJS24],
or rank-collapse [DCL21, FZH` 22, NAB` 22, JDB23, ZMZ` 23, ZLL` 23, NLL` 24,
BHK24, CNQG24]; see also Figure 1.
A MATHEMATICAL PERSPECTIVE ON TRANSFORMERS 3

The goal of this manuscript is twofold. On the one hand, we aim to provide a
general and accessible framework to study Transformers from a mathematical per-
spective. In particular, the structure of these interacting particle systems enables
concrete connections to established mathematical topics, such as nonlinear trans-
port equations, Wasserstein gradient flows, collective behavior models, and optimal
configurations of points on spheres, among others. On the other hand, we describe
several promising research directions with a particular focus on the long-time clus-
tering phenomenon. The main results we present are new, and we also provide
what we believe are interesting open problems throughout the paper.
The rest of the paper is arranged in three parts.
Part 1: Modeling. We define an idealized model of the Transformer architecture
that captures two of the main characteristics of transformers: self-attention and
layer-normalization. Following a perspective put forward in classical architectures
such as ResNets [CRBD18, E17, HR17], we view the successive layers of a neu-
ral network as time discretizations of a dynamical system of interacting particles.
Layer-normalization effectively constrains particles to evolve on the unit sphere
Sd´1 , whereas self-attention is the particular nonlinear coupling of the particles
done through the empirical measure (Section 2). In turn, the empirical measure
evolves according to the continuity partial differential equation (Section 3). Even
after significant simplifications, this toy model retains macroscopic characteristics
of trained Transformers, namely clustering. We also introduce a simpler surrogate
model for self-attention which has the convenient property of being a Wasserstein
gradient flow [AGS05] for an energy functional that is well-studied in the context
of optimal configurations of points on the sphere and sheds a complementary light
of the source of clustering.
Part 2: Clustering. In this part we recall existing and establish new mathematical
results that indicate clustering of tokens in the long-time limit. (See Figure 2 for a
summary.) Our first results (Theorem 4.3 in Section 4 and Theorem 5.1 in Section
5) are in extreme regimes of a temperature parameter β ´1 that appears in the
equation. We then move to the high-dimensional case in Section 6, where we begin
by recalling Theorem 6.1—a result of [MTG17], recently revisited in [CRMB24]—
which entails long-time clustering at any temperature when d ě 3. We provide
an exponential rate of convergence in Theorem 6.3 when d ě n—here n denotes
the number of particles—. We complement this result with an even more precise
characterization of the rate of contraction of particles into a cluster. Namely, we
describe the histogram of all inter-particle distances, and the time at which all
particles are already nearly clustered (Theorem 6.9).
Part 3: Further questions. We propose potential avenues for future research, largely
in the form of open questions substantiated by numerical observations. We first
focus on the case d “ 2 (Section 7) and elicit a link to Kuramoto oscillators. We
briefly show in Section 9.1 how a simple and natural modification of our model
leads to non-trivial questions related to optimal configurations on the sphere. The
remaining sections explore interacting particle systems that allow for parameter
tuning of the Transformers architectures, a key feature of practical implementations.

Part 1. Modeling
4 GESHKOVSKI, LETROUIT, POLYANSKIY, AND RIGOLLET

We begin this part by presenting the mathematical model for a Transformer in


Section 2. Throughout the paper we focus on a simplified version that includes
the self-attention mechanism as well as layer normalization (Section 2.2), but ex-
cludes additional feed-forward layers commonly used in practice (Section 2.3). This
nonetheless leads to a highly nonlinear mean-field interacting particle system. In
turn, this system implements, via the continuity equation, a flow map from initial
to terminal distributions of particles that we present in Section 3.

2. Interacting particle system


Before writing down the Transformer model, we first provide a brief preliminary
discussion to clarify our methodological choice of treating the discrete layer indices
in the model as a continuous time variable in Section 2.1, echoing previous work on
ResNets. The specifics of the toy Transformer model are presented in Section 2.2,
and a complete model is presented in Section 2.3.

2.1. Residual neural networks. One of the standard paradigms in machine


learning is that of supervised learning, where one aims to approximate an unknown
function f : Rd Ñ Rm , from data, D “ txpiq , f pxpiq quiPrN s say. This is typically
done by choosing one among an arsenal of possible parametric models, whose pa-
rameters are then fit to the data by means of minimizing some user-specified cost.
With the advent of graphical processing units (GPUs) in the realm of computer
vision [KSH12], large neural networks have become computationally accessible, re-
sulting in their popularity as one such parametric model.
Within the class of neural networks, residual neural networks (ResNets for short)
have become a staple DNN architecture since their introduction in [HZRS16]. In
their most basic form, ResNets approximate a function f at x P Rd through a
sequence of affine transformations, a component-wise nonlinearity, and skip con-
nections. Put in formulae,
#
xpk ` 1q “ xpkq ` wpkqσpapkqxpkq ` bpkqq for k P t0, . . . , L ´ 1u
(2.1)
xp0q “ x .
Here σ is a Lipschitz function applied component-wise to the input vector, while
θp¨q “ pwp¨q, ap¨q, bp¨qq P Rdˆd ˆ Rdˆd ˆ Rd are trainable parameters. We say
that (2.1) has L ě 1 hidden layers (or L ` 1 layers, or is of depth L). The output
xi pLq serves as a representation of the input xpiq that is then fed into a last layer that
corresponds to a classical learning task such as linear or logistic regression in order to
predict the label f pxpiq q. One can also devise generalizations of (2.1), for instance in
which matrix-vector multiplications are replaced by discrete convolutions in order to
reflect other common architectures such as convolutional neural networks [GBC16,
Chapter 9]. The key element that all these models share is that they all have skip-
connections, namely, the previous step xi pkq appears explicitly in the iteration for
the next one.
One upside of (2.1), which is the one of interest to our narrative, is that the layer
index k can naturally be interpreted as a time variable, motivating the continuous-
time analogue
#
xptq
9 “ wptqσpaptqxptq ` bptqq for t P p0, T q
(2.2)
xp0q “ x.
A MATHEMATICAL PERSPECTIVE ON TRANSFORMERS 5

These are dubbed neural ordinary differential equations (neural ODEs). Since their
introduction in [CRBD18, E17, HR17], neural ODEs have emerged as a flexible
mathematical framework to implement and study ResNets. We use this abstraction
here for convenience, as dynamical systems are more naturally defined in continuous
time. Although this approach is helpful for exposition, we emphasize that all the
results presented can also be derived in discrete time using the same techniques.

2.2. The interacting particle system. Unlike ResNets, which operate on a sin-
gle input vector xp0q P Rd at a time, Transformers operate on a sequence of vectors
of length n, namely, pxi p0qqiPrns P pRd qn . This perspective is rooted in natural lan-
guage processing, where each vector represents a word, and the entire sequence a
sentence or a paragraph. In particular, it allows to process words together with their
context. A sequence element xi p0q P Rd is called a token, and the entire sequence
pxi p0qqiPrns a prompt. We use the words “token” and “particle” interchangeably.
All practical implementations of Transformers make use of layer normalization
[BKH16], most commonly in the form of root mean square (RMS) normalization
[ZS19]. (See lines 105–116 in [Mis24] for instance.) RMS normalization takes the
sequence of tokens output after each layer, divides each token by its Euclidean
norm1 (plus a small parameter to avoid a possible division by zero), and multiplies
the result by a trained diagonal matrix. This process aims to ensure that tokens do
not diverge, thus avoiding rounding errors and overflow. The result is an evolution
on a time-varying axis-aligned ellipsoid. To simplify the presentation and obtain
insight and precise results, we assume that the trained diagonal matrix is equal to
the idenitity, so that we work on the unit sphere Sd´1 throughout. This simpli-
fication is justified empirically in the trained ALBERT XLarge v2 model described
in Figure 1, wherein this diagonal matrix is constant over all layers, with entries
of mean value equal to 0.44 and standard deviation equal to 0.008. Furthermore,
current text embeddings provided by OpenAI, namely text-embedding-3-small
and text-embedding-3-large, return norm-one embedding vectors. While we can
only speculate as to the actual implementation of these models, this is an indication
that layer normalization could be as simple as the one used in our toy model.
A Transformer is then a flow map on pSd´1 qn : the input sequence pxi p0qqiPrns P
pSd´1 qn is an initial condition which is evolved through the dynamics
˜ ¸
n
K 1 ÿ
βxQptqxi ptq,Kptqxj ptqy
(2.3) x9 i ptq “ Pxi ptq e V ptqxj ptq
Zβ,i ptq j“1

for all i P rns and t ě 0. (We refer the reader to (2.8) and Section 2.3 for the full
model.) Here and henceforth
PK
x y “ y ´ xx, yyx

denotes the projection of y P Rd onto Tx Sd´1 . The partition function Zβ,i ptq ą 0
reads
n
ÿ
(2.4) Zβ,i ptq “ eβxQptqxi ptq,Kptqxk ptqy .
k“1

1The original form instead consisted in an entry-wise standardization of every token, namely
subtracting the mean of all tokens, then dividing by the standard deviation.
6 GESHKOVSKI, LETROUIT, POLYANSKIY, AND RIGOLLET

where pQp¨q, Kp¨q, V p¨qq (standing for Query, Key, and Value) are parameter ma-
trices learned from data, and β ą 0 a fixed number intrinsic to the model2, which,
can be seen as an inverse temperature using terminology from statistical physics.
Note that Qp¨q, Kp¨q need not be square.
The interacting particle system (2.3)–(2.4), a simplified version of which was
first written down in [LLH` 20, DGCC21, SABP22], importantly contains the true
novelty that Transformers carry with regard to other models: the self-attention
mechanism
eβxQptqxi ptq,Kptqxj ptqy
(2.5) Aij ptq :“ , pi, jq P rns2 ,
Zβ,i ptq
which is the nonlinear coupling mechanism in the interacting particle system. The
nˆn stochastic matrix Aptq (rows are probability vectors) is called the self-attention
matrix. The wording attention stems from the fact that Aij ptq captures the atten-
tion given by particle i to particle j relatively to all particles ℓ P rns. In particular,
a particle pays attention to its neighbors where neighborhoods are dictated by the
matrices Qptq and Kptq in (2.5).
It has been observed numerically that the probability vectors pAij p¨qqjPrns (for
i P rns) in a trained self-attention matrix exhibit behavior related to the syntac-
tic and semantic structure of sentences in natural language processing tasks (see
[VSP` 17, Figures 3-5]). To illustrate our conclusions as pedagogically as possi-
ble, throughout the paper we focus on a simplified scenario wherein the parameter
matrices pQ, K, V q are constant, and even all equal to the identity unless stated
otherwise, resulting in the dynamics
˜ ¸
n
K 1 ÿ
βxxi ptq,xj ptqy
(SA) x9 i ptq “ Pxi ptq e xj ptq
Zβ,i ptq j“1

for i P rns and t ě 0 and, as before


n
ÿ
(2.6) Zβ,i ptq “ eβxxi ptq,xk ptqy .
k“1
The appearance of clusters in Transformers is actually corroborated by numeri-
cal experiments with trained models (see Figure 1). While we focus on a much
simplified model, numerical evidence shows that the clustering phenomenon looks
qualitatively the same in the cases Q “ K “ V “ λId , λ ą 0, and generic random
pQ, K, V q (see Figures 3 and 5 for instance). We refer the interested reader directly
to Sections 4, 5, 6; here, we continue the presentation on the modeling of different
mechanisms appearing in the Transformer architecture.
Remark 2.1 (Collective behavior). The dynamics (SA) have a strong resemblance
to the vast literature on nonlinear systems arising in the modeling of collective
behavior. In addition to the connection to the classical Kuramoto model describ-
ing synchronization of oscillators [Kur75, ABV` 05] (made evident in Section 7.2),

2In practical implementations the inner products are multiplied by d´ 12 , which along with the
typical magnitude of Q, K leads to the appearance of β.
3
ALBERT XLarge v2 contains all the mechanisms described in this text, namely, is a system
of the form (2.8) (or rather the discretization thereof) with 12 or 24 layers. The sequence length
n is of the order of 512 or 1024, and the tokens evolve in R4096 . The dynamics are therefore
high-dimensional, lending weight to assumptions made later on (Section 6).
A MATHEMATICAL PERSPECTIVE ON TRANSFORMERS 7

Layer 0 Layer 1 Layer 21


6 6 6

5 5 5

4 4 4

3 3 3

2 2 2

1 1 1

0 0 0
0.4 0.2 0.0 0.2 0.4 0.6 0.8 1.0 0.4 0.2 0.0 0.2 0.4 0.6 0.8 1.0 0.4 0.2 0.0 0.2 0.4 0.6 0.8 1.0

Layer 22 Layer 23 Layer 24


6 6 6

5 5 5

4 4 4

3 3 3

2 2 2

1 1 1

0 0 0
0.4 0.2 0.0 0.2 0.4 0.6 0.8 1.0 0.4 0.2 0.0 0.2 0.4 0.6 0.8 1.0 0.4 0.2 0.0 0.2 0.4 0.6 0.8 1.0

Figure 1. Histogram of txxi ptq, xj ptqyupi,jqPrns2 ,i‰j at different layers t


in the context of the trained ALBERT XLarge v2 model ([LCG` 20] and
https://huggingface.co/albert-xlarge-v2)3, which has constant pa-
rameter matrices. Here we randomly selected a single prompt, which in
this context is a paragraph from a random Wikipedia entry, and then
generate the histogram of the pairwise inner products. We see the pro-
gressive emergence of clusters all the way to the 24th (and last) hidden
layer (top), as evidenced by the growing mass at 1. If the number of
layers is increased, up to 48 say, the clustering is further enhanced (bot-
tom).
Layer 25 Layer 26 Layer 27
6 6 6

5 5 5

4 4 4

3 3 3

2 2 2

1 1 1

0 0 0
0.4 0.2 0.0 0.2 0.4 0.6 0.8 1.0 0.4 0.2 0.0 0.2 0.4 0.6 0.8 1.0 0.4 0.2 0.0 0.2 0.4 0.6 0.8 1.0

Layer 46 Layer 47 Layer 48


6 6 6

5 5 5

4 4 4

3 3 3

2 2 2

1 1 1

0 0 0
0.4 0.2 0.0 0.2 0.4 0.6 0.8 1.0 0.4 0.2 0.0 0.2 0.4 0.6 0.8 1.0 0.4 0.2 0.0 0.2 0.4 0.6 0.8 1.0

Transformers are perhaps most similar to the Krause model [Kra00]


n
ÿ ϕp}xi ´ xj }2 q
x9 i ptq “ aij pxj ptq ´ xi ptqq, aij “ řn 2
.
j“1 k“1 ϕp}xi ´ xk } q

which is non-symmetric in general (aij ‰ aji ), much like (2.3). When ϕ is com-
pactly supported, it has been shown in [JM14] that the particles xi ptq assemble in sev-
eral clusters as t Ñ `8. Other related models include those of Vicsek [VCBJ` 95],
Hegselmann-Krause [HK02] and Cucker-Smale [CS07]. All these models exhibit a
8 GESHKOVSKI, LETROUIT, POLYANSKIY, AND RIGOLLET

clustering behavior under various assumptions (see [MT14, Tad23] and the refer-
ences therein). Yet, none of the opinion dynamics models discussed above contain
parameters appearing within the nonlinearity as in (SA), whilst set on the sphere.
Remark 2.2 (Permutation equivariance). A function f : pSd´1 qn Ñ pSd´1 qn is
permutation equivariant if f pπXq “ πpf1 pXq, . . . , fn pXqq for any X P pRd qn and
for any permutation π P Sn of n elements. Otherwise put, if we permute the
input X, then the output f pXq is permuted in the same way. Given t ą 0, the
Transformer (SA), mapping pxi p0qqiPrns ÞÑ pxi ptqqiPrns , is permutation-equivariant
on pSd´1 qn .
2.3. Toward the complete Transformer. There are a couple of additional mech-
anisms used in practical implementations that we do not explicitly address or use
in this study. The mathematical analysis of these mechanisms remains open.
2.3.1. Multi-headed attention. Practical implementations spread out the computa-
tion of the self-attention mechanism at every t through a sequence of heads, leading
to the so-called multi-headed self attention. This consists in considering the follow-
ing modification to (SA):
˜ ¸
H ÿ n
K
ÿ eβxQh ptqxi ptq,Kh ptqxj ptqy
(2.7) x9 i ptq “ Pxi ptq Vh ptqxj ptq
h“1 j“1
Zβ,i,h ptq
where Zβ,i,h ptq is defined as in (2.4) for the matrices Qh ptq and Kh ptq. The integer
H ě 1 is called the number of heads4.
The introduction of multiple heads also allows for drawing some interesting par-
allels with the literature on feed-forward neural networks, such as ResNets (2.1).
Considerable effort has been expended to understand 2-layer neural networks with
width tending to `8; more precisely, consider (2.1) with L “ 1, w P Rdˆℓ , a P Rℓˆd ,
and ℓ Ñ `8. The infinite-width limit for Transformers is in fact very natural, as
it is realized by stacking an arbitrary large number of heads: H Ñ `8. Hence, the
same questions as for 1-hidden layer neural networks may be asked: for instance
the question of universal approximation, in the vein of [Cyb89, Bar93].
2.3.2. Feed-forward layers. The complete Transformer dynamics combines all of the
above mechanisms with a feed-forward layer; in the discrete-time context, this is
actually done by using a Lie-Trotter splitting scheme for
(2.8) ˜ ¸
H ÿ n βxQh ptqxi ptq,Kh ptqxj ptqy
ÿ e
x9 i ptq “ PK
xi ptq Vh ptqxj ptq ` wptqσpaptqxi ptq ` bptqq ,
h“1 j“1
Zβ,i,h ptq
where wptq, aptq, bptq and σ are all as in (2.2). The interested reader is referred
to [LLH` 20, PH22] for all the details5. The feed-forward layers (convolutional
layers can alternatively be considered) are of critical importance in applications and
drive the existing results on approximation properties of Transformers [YBR` 19].
Nevertheless, the analysis of this model is beyond the scope of our current methods.
4In practical implementations, H is a divisor of d, and the query and key matrices Q ptq and
h
d
Kh ptq are H ˆ d rectangular. This allows for further parallelization of computations and increased
expressiveness. For mathematical purposes, we focus on working with arbitrary integers H, and
square weight matrices Qh and Kh .
5and lines 123–130 in [Ope24] for some relevant source code.
A MATHEMATICAL PERSPECTIVE ON TRANSFORMERS 9

3. Measure to measure flow map


An important aspect of Transformers is that they are not hard-wired to take into
account the order of the input sequence, contrary to other architectures used for
natural language processing such as recurrent neural networks. In these applica-
tions, each token xi p0q P Rd contains not only a word embedding wi P Rd , but also
an additional positional encoding (we postpone a discussion to Remark 3.2) which
allows tokens to also carry their position in the input sequence. Therefore, an input
sequence is perfectly encoded as a set of tokens tx1 p0q,ř.n. . , xn p0qu, or equivalently
as the empirical measure of its constituent tokens n1 i“1 δxi p0q . Recall that the
řn
output of a Transformer is also a probability measure, namely n1 i“1 δxi ptq , albeit
one that captures the likelihood of the next token. As a result, one can view Trans-
formers as flow maps between probability measures6 on Sd´1 . To describe this flow
map, we appeal to the continuity equation, which governs precisely the evolution of
the empirical measure of particles subject to dynamics. This perspective is already
present in [SABP22], the only modification here being that we add the projection
on the sphere arising from layer normalization.
After introducing the continuity equation in Section 3.1, we show that a partic-
ular interaction energy functional, which is maximized at any point mass, increases
along solutions thereof in Section 3.2. Motivated by this monotonicity property, in
Section 3.3 we propose an illustrative modified model which has the nice property
of being a Wasserstein gradient flow for this energy. Finally, in Section 3.4, we
demonstrate that the original equation presented in Section 3.1 is itself a gradient
flow for the same energy, upon changing the metric underlying the definition of the
gradient.

3.1. The continuity equation. The vector field driving the evolution of a single
particle in (SA) clearly depends on all n particles. In fact, one can equivalently
rewrite the dynamics as

(3.1) x9 i ptq “ X rµptqspxi ptqq

for all i P rns and t ě 0, where


n
1 ÿ
µpt, ¨q “ δ p¨q
n i“1 xi ptq

is the empirical measure, while the vector field X rµs : Sd´1 Ñ TSd´1 reads
ˆ ż ˙
K 1 βxx,yy
(3.2) X rµspxq “ Px e y dµpyq
Zβ,µ pxq

with
ż
(3.3) Zβ,µ pxq “ eβxx,yy dµpyq.

6See [DBPC19, VBC20, ZB21] for further related work on neural networks acting on probability
measures.
10 GESHKOVSKI, LETROUIT, POLYANSKIY, AND RIGOLLET

In other words, (SA) is a mean-field interacting particle system. The evolution of


µptq is governed by the continuity equation7
#
Bt µ ` divpX rµsµq “ 0 on Rě0 ˆ Sd´1
(3.4)
µ|t“0 “ µp0q on Sd´1
satisfied in the sense of distributions.
Remark 3.1 (Well-posedness). Global existence of weak, measure-valued solutions
to (3.4) for arbitrary initial conditions µp0q P PpSd´1 q follows by arguing exactly
as in [GLPR24, Lemma A.3]. Here and henceforth, PpSd´1 q stands for the set of
Borel probability measures on Sd´1 .
Remark 3.2 (Positional encoding). For the sake of completeness, in this brief
segue we discuss a few ways to perform positional encoding. The original one,
proposed in [VSP` 17], proceeds as follows. Consider a sequence pwi qiPrns P pRd qn of
word embeddings. Then the positional encoding pi P Rd of the i-th word embedding
i i
is defined as ppi q2k “ sinp M 2k{d q and ppi q2k`1 “ cosp M 2k{d q for k P rd{2 ´ 1s,
and M ą 0 is a user-defined scalar equal to 104 in [VSP` 17]. The i-th token
is then defined as the addition: xi p0q “ wi ` pi . Subsequent works simply use
either a random8 positional encoding (i.e., pi is just some random vector) or a
trained transformation. The addition can also be replaced with a concatenation
xi p0q “ rwi ; pi s. (See [LWLQ22, XZ23] for details.)
Remark 3.3 (Mean field limit). Although the analysis in this paper is focused on
the flow of the empirical measure, one can also consider (3.4) for arbitrary initial
probability measures µp0q P PpSd´1 q. Both views can be linked through a mean-field
limit-type result, which can be shown by making use of the Lipschitz nature of the
vector field X rµs. The argument is classical and dates back at least to ř
the work of
n
Dobrushin [Dob79]. Consider an initial empirical measure µn p0q “ n1 i“1 δxi p0q ,
and suppose that the points xi p0q are such that limnÑ`8 W1 pµn p0q, µp0qq “ 0
for some probability measure µp0q P PpSd´1 q. (Here W1 denotes 1-Wasserstein
distance—see [Vil09] for definitions—.) Consider the solutions µn ptq and µptq
to (3.4) with initial data µn p0q and µp0q respectively. Dobrushin’s argument is
then centered around the estimate
W1 pµn ptq, µptqq ď eOp1q|t| W1 pµn p0q, µp0qq
for any t P R, which in the case of (3.4) can be shown without much difficulty (see
[Vil01, Chapter 4, Section 1] or [Gol16, Section 1.4.2]). This elementary mean-
field limit result has a couple of caveats. First, the time-dependence is exponential.
Second, if one assumes that the points xi p0q are sampled i.i.d. according to µ0 ,
1
then W1 pµn p0q, µp0qq converges to zero at rate n´ d´1 [Dud69, BLG14], which dete-
riorates quickly when d grows. Dimension-free convergence has been established in
some cases, for instance by replacing the Wasserstein distance with a more careful
choice of metric as in [HHL23, Lac23] or more generally in [SFG` 12]. Similarly,
the exponential time-dependence might also be improved, as recent works in the
context of flows governed by Riesz/Coulomb singular kernels, with diffusion, can
7Unless stated otherwise, ∇ and div henceforth stand for the spherical gradient and divergence
respectively, and all integrals are taken over Sd´1 .
8This rationale supports the assumption that initial tokens are drawn at random, which we
make use of later on.
A MATHEMATICAL PERSPECTIVE ON TRANSFORMERS 11

attest [RS23, GBM21] (see [LLF23] for a result in the smooth kernel case). We do
not address this question in further detail here. For more references on this well-
established topic, the reader is referred to [Vil01, Gol16, Ser20] and the references
therein.
3.2. The interaction energy. One can naturally ask whether the evolution in (3.4)
admits some quantities which are monotonic when evaluated along the flow. As it
turns out, the interaction energy
ij
1 1
(3.5) Eβ rµs “ eβxx,x y dµpxq dµpx1 q

is one such quantity. Indeed,
ij
d 1
Eβ rµptqs “ β ´1 eβxx,x y dBt µpt, xq dµpt, x1 q
dt
ż ż ´ ¯
1
“ X rµptqspxq ¨ ∇ β ´1 eβxx,x y dµpt, x1 q dµpt, xq
ż › ›2
(3.6) “ ›X rµptqspxq› Zβ,µptq pxq dµpt, xq
› ›

for any t ě 0 by using integration by parts. Recalling the definition of Zβ,µ pxq
in (3.3), we see that e´β ď Zβ,µ pxq ď eβ for all x P Sd´1 . The identity (3.6)
therefore indicates that Eβ increases along trajectories of (3.4). (Similarly, should
V “ ´Id , the energy Eβ would decrease along trajectories.) This begs the question
of characterizing the global minima and maxima of Eβ , which is the goal of the
following result.
Proposition 3.4. Let β ą 0 and d ě 2. The unique global minimizer of Eβ over
PpSd´1 q is the uniform measure9 σd . Any global maximizer of Eβ over PpSd´1 q is
a Dirac mass δx˚ centered at some point x˚ P Sd´1 .
This result lends credence to our nomenclature of the case V “ Id as attractive,
and V “ ´Id as repulsive. The reader should be wary however that in this result
we are minimizing or maximizing Eβ among all probability measures on Sd´1 .
Should one focus solely on discrete measures, many global minima appear—these
are discussed in Section 9.1—. This is one point where the particle dynamics and
the mean-field flow deviate. We now provide a brief proof of Proposition 3.4 (see
[Tan17] for a different approach).

Proof of Proposition 3.4. The fact that any global maximizer is a Dirac mass is
easy to see. We proceed with proving the rest of the statement. Let f ptq “ eβt .
The interaction energy then reads
ij
1
Eβ rµs “ f pxx, x1 yq dµpxq dµpx1 q.
2
The proof relies on an ultraspherical (or Gegenbauer) polynomial expansion of f ptq:
`8
ÿ k`λ λ
f ptq “ fppk; λq Ck ptq
k“0
λ

9That is, the Lebesgue measure on Sd´1 , normalized to be a probability measure.


12 GESHKOVSKI, LETROUIT, POLYANSKIY, AND RIGOLLET

d´2
for t P r´1, 1s, where λ “ Ckλ are Gegenbauer polynomials, and
2 ,
ż1
Γpλ ` 1q 1 1
fppk; λq “ 1 1 λ
f ptqCkλ ptqp1 ´ t2 qλ´ 2 dt
Γpλ ` 2 qΓp 2 q Ck p1q ´1
where Ckλ p1q ą 0 (see [DX13, Section 1.2]). According to [BD19, Proposition 2.2],
a necessary and sufficient condition for Proposition 3.4 to hold is to ensure that
fppk; λq ą 0 for all k ě 1. To show this, we use the Rodrigues formula [Sze39, 4.1.72]
ˆ ˙k
p´1qk 2k Γpk ` λqΓpk ` 2λq 1 d 1
Ckλ ptq “ p1 ´ t2 q´pλ´ 2 q p1 ´ t2 qk`λ´ 2 ,
k! ΓpλqΓp2k ` 2λq dt
and the fact that Ckλ p´tq “ p´1qk Ckλ ptq for t P r´1, 1s, which in combination with
integration by parts yield
ż1 #
ℓ λ 2 λ´ 21 ą 0 if ℓ ě k and ℓ ´ k is even
t Ck ptqp1 ´ t q dt
´1 “ 0 otherwise .
We conclude by using the power series expansion of f . □
3.3. A Wasserstein gradient flow proxy. In view of (3.6), one could hope to
see the continuity equation (3.4) as the Wasserstein gradient flow of Eβ , or possibly
some other functional (see the seminal papers [Ott01, JKO98], and [AGS05, Vil09]
for a complete treatment). The long-time asymptotics of the PDE can then be
analyzed by studying convexity properties of the underlying functional, by analogy
with gradient flows in the Euclidean case.
For (3.4) to be the Wasserstein gradient flow of Eβ , the vector field X rµs defined
in (3.2) ought to be the gradient of the first variation δEβ of Eβ . However, notice
that X rµs is a logarithmic derivative:
ż
(3.7) X rµspxq “ ∇ log β ´1 eβxx,yy dµpyq.

(This observation goes beyond Q “ K “ Id and V “ ˘Id insofar as QJ K “


K J Q “ ˘V ; see [SABP22, Assumption 1]. Because of the lack of symmetry, it has
been shown in [SABP22] that (3.7) is not the gradient of the first variation of a
functional.
To overcome this limitation on Rd , thus without layer normalization, [SABP22]
propose two ways to "symmetrize" (3.4) that both lead to a Wasserstein gradient
flow; see [SABP22, Proposition 2]. We focus here on the simplest one which consists
in removing the logarithm in (3.7), or equivalently to removing the denominator
in (3.2). This is one point where working on the unit sphere is useful: otherwise, the
equation on Rd without layer normalization (as considered in [SABP22]) is ill-posed
for general choices of matrices V , due to the fact that the magnitude of the vector
field X rµs grows exponentially with the size of the support of µ. On the contrary,
on Sd´1 the resulting equation is perfectly well-posed.
Remark 3.5 (Doubly stochastic kernel). Considering the Transformer dynamics
on Rd , thus without layer normalization, the authors in [SABP22] propose an al-
ternative symmetric model: they replace the self-attention (stochastic) matrix by
a doubly stochastic one, generated from the Sinkhorn iteration. This leads to a
Wasserstein gradient flow, whereby the resulting attention mechanism is implicitly
expressed as a limit of Sinkhorn iterations. Understanding the emergence of clusters
for this model is an interesting but possibly challenging question.
A MATHEMATICAL PERSPECTIVE ON TRANSFORMERS 13

In view of the above discussion, we are inclined to propose the surrogate model
˜ ¸
n
K 1 ÿ βxxi ptq,xj ptqy
(USA) x9 i ptq “ Pxi ptq e xj ptq ,
n j“1
which is obtained by replacing the partition function Zβ,i ptq by n. As a matter of
fact, (USA) presents a remarkably similar qualitative behavior—all of the results
we show in this paper are essentially the same for both dynamics—.
The continuity equation corresponding to (USA), namely
$ ˆ ˆż ˙ ˙
&B µpt, xq ` div PK βxx,x1 y 1 1
t x e x dµpt, x q µpt, xq “0
(3.8)
%
µ|t“0 “ µ0
for pt, xq P Rě0 ˆ Sd´1 , can now be seen as a Wasserstein gradient flow for the
interaction energy Eβ defined in (3.5).
Lemma 3.6. Consider the interaction energy Eβ : PpSd´1 q Ñ Rě0 defined in (3.5).
Then the vector field
ˆż ˙
K βxx,x1 y 1 1
X rµspxq “ Px e x dµpx q

satisfies
(3.9) X rµspxq “ ∇δEβ rµspxq
d´1
for any µ P PpS q and x P Sd´1 , where δEβ rµs denotes the first variation of Eβ .
We omit the proof which follows from standard Otto calculus [Ott01], [Vil09,
Chapter 15], [CNWR24, Chapter 5]. We can actually write (3.9) more succinctly by
recalling the definition of the convolution of two functions on Sd´1 [DX13, Chapter
d´3
2]: for any g P L1 pSd´1 q and f : r´1, 1s Ñ R such that t ÞÑ p1 ´ t2 q 2 f ptq is
integrable, ż
pf ˚ gqpxq “ f pxx, yyqgpyq dσd pyq.
This definition has a natural extension to the convolution of a function f (with the
above integrability) and a measure µ P PpSd´1 q. We can hence rewrite
ż
1
Eβ rµs “ pGβ ˚ µqpxq dµpxq
2
where r´1, 1s Q Gβ ptq “ β ´1 eβt , and so
X rµspxq “ ∇pGβ ˚ µqpxq.
Thus, (3.8) takes the equivalent form
$ ´ ` ˘ ¯
&Bt µpt, xq ` div ∇ Gβ ˚ µpt, ¨q pxqµpt, xq “ 0 for pt, xq P Rě0 ˆ Sd´1
(3.10)

|t“0 “ µ0 for x P Sd´1 .
The considerations above lead us to the following Lyapunov identity.
Lemma 3.7. The solution µ P C 0 pRě0 ; PpSd´1 qq to (3.8) satisfies
ż › ´ ¯ ›2
d
Eβ rµptqs “ ›∇ Gβ ˚ µpt, ¨q pxq› dµpt, xq
› ›
dt
for t ě 0.
14 GESHKOVSKI, LETROUIT, POLYANSKIY, AND RIGOLLET

Interestingly, (3.10) is an aggregation equation, versions of which have been stud-


ied in great depth in the literature. For instance, clustering in the spirit of an
asymptotic collapse to a single Dirac measure located at the center of mass of the
initial density µp0, ¨q has been shown for aggregation equations with singular ker-
nels in [BCM08, BLR11, CDF` 11], motivated by the Patlak-Keller-Segel model of
chemotaxis. Here, one caveat (and subsequently, novelty) is that (3.10) is set on
Sd´1 which makes the analysis developed in these references difficult to adapt or
replicate.
Remark 3.8 (Particle version). Let us briefly sketch
řn the particle version of the
Wasserstein gradient flow (3.8). When µptq “ n1 i“1 δxi ptq , the interaction en-
ergy (3.5) takes the form
n n
1 ÿ ÿ βxxi ,xj y
Eβ pXq “ e
2βn2 i“1 j“1

where X “ px1 , . . . , xn q P pSd´1 qn . Denoting by ∇X the gradient associated to the


standard Riemannian metric on pSd´1 qn , we get the dynamics
(3.11) 9
Xptq “ n∇X Eβ pXptqq.
Indeed, the gradient on pSd´1 qn is simply ∇ “ pB1 , . . . , Bn q where Bi is the gradient
in Sd´1 acting on the i-th copy in pSd´1 qn . Therefore
n
1 ÿ K ´ βxxi ptq,xj ptqy ¯ 1
Bi Eβ pXptqq “ 2
P xi ptq e βx j ptq “ x9 i ptq
βn j“1 n

which yields (3.11).


Note that (SA) also corresponds to a gradient flow of the same interaction energy
albeit with respect to a Riemannian metric on the sphere different from the standard
one (for n “ 2 the two are conformally equivalent). We provide more detail in the
following section.
3.4. (SA) is a gradient flow for a modified metric. We will now briefly demon-
strate that for a particular choice of parameters pQ, K, V q, the true dynamics (SA)
can be seen as a gradient flow for Eβ upon a modification of the metric on the
tangent space of pSd´1 qn . This will facilitate qualitative analysis later on by using
standard tools from dynamical systems. This insight can then be extrapolated to
the corresponding continuity equation (3.4) as well, as seen in Section 3.4.2.

3.4.1. The case of particles. We suppose that


QJ K is symmetric, V “ QJ K.
We define a new metric on pSd´1 qn as follows. Let X “ px1 , . . . , xn q P pSd´1 qn .
Consider the inner product on TX pSd´1 qn given by
n
ÿ
(3.12) xpa1 , . . . , an q, pb1 , . . . , bn qyX “ Zβ,i pXqxai , bi y ,
i“1

where ai , bi P Txi Sd´1 , and


n
ÿ
Zβ,i pXq “ eβxV xi ,xj y .
j“1
A MATHEMATICAL PERSPECTIVE ON TRANSFORMERS 15

Set
n n
1 ÿ ÿ βxV xi ,xj y
Eβ pXq “ e .
2β i“1 j“1
We now show that the dynamics (2.3) can be equivalently written as
9
Xptq “ ∇Eβ pXptqq,
where the gradient ∇ is computed with respect to the metric (3.12) on pSd´1 qn .
To this end, we ought to show that for all vector fields Y on pSd´1 qn and for all
X P pSd´1 qn ,
d ˇˇ ` ˘
(3.13) ˇ Eβ ΦtY pXq “ xY pXq, BpXqyX
dt t“0
holds, where ΦtY is the flow associated to the vector field Y , whereas B “ pB1 , . . . , Bn q
with ˜ ¸
n
1 ÿ
βxV xi ,xj y
Bi “ Pxi K
e V xj P Txi Sd´1 .
Zβ,i pXq j“1
By linearity, it is sufficient to show (3.13) for vector fields Y of the form
Y pXq “ pAx1 , 0, . . . , 0q P TX pSd´1 qn
where A is an arbitrary non-zero skew-symmetric matrix. Clearly
(3.14) ΦtY pXq “ petA x1 , x2 , . . . , xn q.
One first computes
n
d ˇˇ ` ˘ ÿ
ˇ Eβ ΦtY pXq “ eβxV x1 ,xj y xAx1 , V xj y.
dt t“0 j“1

Now observe that xAx1 , yy “ xAx1 , zy for all skew-symmetric matrices A if and
only if x1 py J ´ z J q is a symmetric matrix. Since PK J
x1 “ Id ´ x1 x1 , we see that
n
ÿ
eβxV x1 ,xj y xAx1 , V xj y “ xY pXq, BpXqyX ,
j“1

as desired.

3.4.2. The case of measures. The above insight, which consists in reweighing the
metric with respect to which the gradient is being taken, can be formally adapted
to Wasserstein setting for (3.4)—which we recall, is not a gradient flow for the
standard Wasserstein gradient of Eβ —. But note that (3.4) writes as
ˆ ˙
∇δEβ rµptqs
(3.15) Bt µptq ` div µptq “ 0,
δEβ rµptqs
ş
with δEβ rµspxq “ eβxx,yy dµpyq. To avoid technicalities we henceforth only focus
on the absolutely continuous case. For a fixed µ P Pac pSd´1 q, as in the well-known
formal Riemannian reinterpretation of the Wasserstein space using Otto calculus
[Ott01], [CNWR24, Chapter 5], we consider
L2 pµq
Tµ Pac pSd´1 q “ t∇ψ : ψ P C 8 pSd´1 qu ,
16 GESHKOVSKI, LETROUIT, POLYANSKIY, AND RIGOLLET

which, rather than endowing with the standard formal metric tensor given by the
H9 1 –inner product
ż
x∇ψ1 , ∇ψ2 yµ :“ x∇ψ1 pxq, ∇ψ2 pxqy dµpxq,

we instead endow with


ż
x∇ψ1 , ∇ψ2 yµ,Eβ :“ x∇ψ1 pxq, ∇ψ2 pxqy δEβ rµspxq dµpxq.

Continuing the Riemannian geometry analogy, through this metric tensor we can
define a distance between µ0 , µ1 P Pac pSd´1 q by solving the variational problem
#ż +
1
2
inf }vptq}µptq,Eβ dt : pµ, vq satisfy (3.16), µp0q “ µ0 , µp1q “ µ1 ,
pµptq,vptqqtPr0,1s 0

where
(3.16) Bt µpt, xq ` divpvpt, xqµpt, xqq “ 0 on r0, 1s ˆ Sd´1 .
This variational problem is a generalization of the celebrated Benamou-Brenier for-
2
mula [BB00], the value of which we dub W2,E β
pµ0 , µ1 q: it is a weighed Wasserstein
distance. For a curve of measures pµptqqtě0 with tangent vectors pvptqqtě0 (mean-
ing they solve (3.16)), the Wasserstein gradient ∇ ∇Eβ Eβ induced by this geometric
setup is then the element of Tµptq Pac pSd´1 q such that
∇Eβ Eβ rµptqs, vptqyµptq,Eβ .
Bt Eβ rµptqs “ x∇
We can now demonstrate that the vector field driving (3.15) is the Wasserstein
gradient of Eβ corresponding to this geometric setup. Indeed as in [CNWR24,
Definition 5.9] we first have
ż
Bt Eβ rµptqs “ δEβ rµptqs dBt µptq.

We then find
ż ż B F
∇δEβ rµptqs
δEβ rµptqs dBt µptq “ x∇δEβ rµptqs, vptqy dµptq “ , vptq ,
δEβ rµptqs µptq,Eβ

as desired. The literature studying weighed Wasserstein distances such as the one
above is rather scarce, but a relevant reference is [Li21].

Part 2. Clustering
As alluded to in the introductory discussion, clustering is of particular relevance
in tasks such as sentiment analysis, masked language modeling, summarization,
and so on. Therein, the output measure encodes the probability distribution of the
missing tokens for instance, and its clustering indicates a small number of possible
outcomes. In Sections 4, 5, and 6, we show several results (mostly summarized in
Figure 2) which indicate that the limiting distribution is a point mass. While it
may appear that this leaves no room for diversity or randomness, which is at odds
with practical observations, these results hold for the specific choice of parameter
matrices, and apply in possibly very long-time horizons. Numerical experiments
indicate a more subtle picture for different parameters—for instance, there is an
appearance of a long metastable phase during which the particles coalesce in a
small number of clusters, which seems consistent with behavior in trained models
A MATHEMATICAL PERSPECTIVE ON TRANSFORMERS 17

T. 5.1
n2
π2

exact time of clustering


Theorem 6.3:

no result when d “ 2
exponential rate

Theorem 6.9:
Theorem 6.1:
β clustering as t Ñ `8
Ñ slow

1
T. 4.3
fast Ð

Theorem 4.2:
arbitrary Q, K

2 3 n d˚
d

Figure 2. Green zones indicate regimes where convergence to a single


cluster as t Ñ `8 can be proven. Here n ě 2 is fixed. When d is larger
than specific thresholds, the long-time asymptotics can be chiseled out
in finer detail. Convergence is slow when β " 1 (relative to the size
of d, n), as even the exponential decay constant when d ě n is of the
form λ “ Ope´β q and thus degenerates. One rather expects dynamic
metastability. Section 4 addresses the case where β is small and Section
5 where it’s large, whereas Section 6 covers the high-dimensional case
at arbitrary β.

(Figure 1 in Section 6.3)—. We are not able to theoretically explain this behavior
as of now.
Ultimately, the appearance of clusters is somewhat natural since the Transformer
dynamics are a weighted average of all particles, with the weights being hard-
wired to perform a fast selection of particles most similar to the i-th particle being
queried. This causes the emergence of leaders which attract all particles in their
vicinity. In the natural language processing interpretation, where particles represent
tokens, this further elucidates the wording attention as the mechanism of inter-token
attraction, and the amplitude of the inner product between tokens can be seen as
a measure of their semantic similarity.

4. A single cluster for small β


As seen in Figure 2, the dimension d and inverse temperature β appear to play
a key role in the clustering results we obtain. In this section and in Section 5, we
begin by focusing on extreme choices of β whilst d, n are fixed. We first focus on
the case β “ 0 (Section 4.1), before moving to the case β ! 1 by a perturbation
argument (Section 4.2). We cover the case where β is sufficiently large, but finite,
in Section 5. The case β “ `8 is of little interest since all particles are fixed by
the evolution.
18 GESHKOVSKI, LETROUIT, POLYANSKIY, AND RIGOLLET

4.1. The case β “ 0. For β “ 0, both (SA) and (USA) read as


˜ ¸
n
K 1 ÿ
(4.1) x9 i ptq “ Pxi ptq xj ptq , t ⩾ 0.
n j“1
The following result shows that generically over the initial points, a single cluster
emerges. It complements a known convergence result ([FL19, Theorem 2]) for (4.1).
In [FL19, Theorem 2], the authors show convergence to an antipodal configuration,
in the sense that n ´ 1 particles converge to some x˚ P Sd´1 , with the last particle
converging to ´x˚ . Moreover, once convergence is shown to hold, it holds with
an exponential rate. Mimicking the proof strategy of [BCM15, Theorem 2.2] and
[HKR18, Theorem 3.2], we sharpen this result by showing that the appearance of
an antipodal particle is non-generic over the choice of initial conditions.
Theorem 4.1. Let d, n ě 2. For Lebesgue almost any initial sequence pxi p0qqiPrns P
pSd´1 qn , there exists some point x˚ P Sd´1 such that the unique solution pxi p¨qqiPrns P
C 0 pRě0 ; pSd´1 qn q to the corresponding Cauchy problem for (4.1) satisfies
lim xi ptq “ x˚
tÑ`8

for any i P rns.


This is also referred to as convergence toward consensus in collective behavior
models. We refer the interested reader to Appendix A for the proof, which relies on
a gradient flow reinterpretation of (4.1), much like the proofs of several subsequent
theorems. We provide some comments in Section 5.
4.2. The case β ! 1. Theorem 4.1 has some implications for small but positive β,
something which is already seen in Figure 3 and Figure 4. This is essentially due
to the fact that, formally,
˜ ¸
n
1 ÿ
x9 i ptq “ PK
xi ptq xj ptq ` Opβq
n j“1

for β ! 1. So, during a time ! β ´1 , the particles do not feel the influence of the
remainder Opβq and behave as in the regime β “ 0. This motivates
Theorem 4.2. Fix d, n ě 2. For β ě 0, let Sβ Ă pSd´1 qn be the subset consisting
of all initial sequences for which the associated solution to the Cauchy problem
for (SA) (or (USA)) converges to one cluster as t Ñ `8. Then
lim PpSβ q “ 1.
βÑ0

More generally, if Q and K are arbitrary d ˆ d matrices, then the same result
also holds for the Cauchy problem for (2.3) with V “ Id (or the natural analogue
of (USA) with these parameters).
Proof. We focus on the dynamics (SA), but the proof is in fact identical in the case
of (USA).
For α P r0, 1q, we say that a set formed from n points z1 , . . . , zn P pSd´1 qn is
α–clustered if for any i, j P rns, xzi , zj y ą α holds. Observe that if tz1 , . . . , zn u
is α–clustered for some α ě 0, then the solution to the Cauchy problem for (SA)
(for arbitrary β ě 0) with this sequence as initial condition converges to a single
cluster, since w “ z1 satisfies the assumption in Lemma 6.4.
A MATHEMATICAL PERSPECTIVE ON TRANSFORMERS 19

Now, for any integer m ě 1, we denote by S0m Ă S0 the set of initial sequences
x1 p0q, . . . , xn p0q in pSd´1 qn for which the solution px0i p¨qqiPrns to the associated
Cauchy problem for (4.1) is 34 –clustered at time t “ m, namely
3
(4.2) xx0i pmq, x0j pmqy ą
4
holds for all i, j P rns. We see that S0m is an open set for any integer m ě 1.
Ť`8
Moreover, S0m Ă S0m`1 according to the proof of Lemma 6.4, and m“1 S0m “ S0 .
This implies that
(4.3) lim PpS0m q “ 1.
mÑ`8

We now show that the solution to (SA) is near that of (4.1), starting from the same
initial condition, when β is small. Using the Duhamel formula, we find
˜ β β
¸
żt ÿn
β eβxQxi psq,Kxj psqy
0
xi ptq ´ xi ptq “ řn β β PKxβ
pxβj psqq ds
βxQx psq,Kx psqy i psq
0 j“1 k“1 e
i k

żt ÿ n
1 0
´ PKx0i psq pxj psqq ds
0 n j“1
żt ÿn ˆ ˆ ˙˙
1 β
“ `O PK xβ
pxβj psqq ds
i psq
0 j“1 n n
żt ÿ n
1 0
´ PKx0i psq pxj psqq ds,
0 n j“1

where we used that all particles lie on Sd´1 for all times. Employing Grönwall, we
deduce
› ›
› β
(4.4) ›xi ptq ´ x0i ptq› ď Opβqe3t

for all t ě 0, β ě 0 and i P rns. Due to (4.4), there exists some βm ą 0 such that
for any β P r0, βm s,
› › 1
› β
(4.5) ›xi pmq ´ x0i pmq› ď .

8
For this to hold, we clearly need βm Ñ 0 as m Ñ `8. Combining (4.2) and (4.5),
we gather that for any initial condition in S0m , the solution pxβi p¨qqiPrns to the
corresponding Cauchy problem for (SA) is 12 –clustered at time t “ m, namely
satisfies
1
xxβi pmq, xβj pmqy ą
2
for all i, j P rns and β P r0, βm s. Thus S0m Ă Sβ for any β P r0, βm s by virtue of
Lemma 6.4, which together with (4.3) concludes the proof. □
In the specific case where QJ K “ Id , we can in fact significantly sharpen The-
orem 4.2 by relying on the gradient flow structure evoked in Section 3.4. Namely,
we can show the following.
Theorem 4.3. Fix d, n ě 2. There exists a numerical constant C ą 0 such that
whenever
β ď Cn´1
20 GESHKOVSKI, LETROUIT, POLYANSKIY, AND RIGOLLET

the following holds. For Lebesgue almost any pxi p0qqiPrns P pSd´1 qn , there exists
x˚ P Sd´1 such that the solution pxi p¨qqiPrns P C 0 pRě0 ; pSd´1 qn q to the corresponding
Cauchy problem for (SA) (resp. for (USA)) satisfies
lim xi ptq “ x˚
tÑ`8

for all i P rns.


Moreover, when d “ 2 one can take β ď 1 and the same conclusion holds.
The proof can be found in Appendix C. As for Theorem 4.1, we briefly discuss
the general strategy in Section 5 just below. The improvement to β ď 1 when d “ 2
is due to the recent paper [CRMB24].

5. A single cluster for large β


The chain of thought leading to the proof of Theorem 4.3 also leads to
Theorem 5.1. Fix d, n ě 2. There exists a constant C “ Cpdq ą 0 depending only
on d such that whenever
β ě Cn2 ,
the conclusion of Theorem 4.3 holds for both (SA) and (USA).
We refer the interested reader to Appendix B for the proof, which, much like
those of Theorem 4.1 and 4.3 relies on the gradient flow interpretation of the dy-
namics. We give a general outline thereof. Since all the intervening functions and
metrics are real-analytic, the celebrated Łojasiewicz theorem [Loj63] implies that
for any initial condition, the gradient flow converges to some critical point of Eβ
as t Ñ `8. The genericity over the initial configurations seen in the statements
comes from the center-stable manifold theorem (Lemma A.1). The former ensures
that for almost every initial configuration, the gradient flow does not converge to
a strict saddle point of Eβ (namely critical points where the Hessian has at least
one positive eigenvalue). Whence, generically over the initial configurations, the
gradient flow converges to a local maximum. One is then left analyzing the land-
scape of the underlying energy, with the goal of ensuring that all local maxima are
necessarily global.

6. The high-dimensional case


We now elucidate the role that the dimension d plays in clustering results. To
start, it turns out that the restrictions on β provided by Theorems 4.3 and 5.1 are
specific to the two-dimensional case. The following result is shown in [MTG17,
CRMB24].
Theorem 6.1 ([MTG17, CRMB24]). Fix n ě 2, d ě 3 and β ě 0. Then the
conclusion of Theorem 4.3 holds for both (SA) and (USA).
In Section 6.1 we show that the convergence of Theorem 6.1 is exponentially
fast when d ě n (although, with a decay constant that is exponentially small in
β) and in Section 6.2 we describe the full dynamics when n is fixed and d Ñ `8.
We discuss some numerical experiments and posit questions on the intermediate
behavior of the dynamics when β " 1 in Section 6.3.
We were made aware of Theorem 6.1 after the first version of this manuscript.
The downside of our original proof, which caused us to miss the full range of β,
A MATHEMATICAL PERSPECTIVE ON TRANSFORMERS 21

is that we focused on d “ 2, where the natural perturbations to a critical point


involve selecting which particles are moving and which are standing still. However,
in d ą 2, one can move all particles in the same direction (as is done in the proof in
[CRMB24]). Using that there is a continuum of directions and only finitely many
points, one can find some direction that brings all of them closer.
Remark 6.2 (Invariant measures). In the theory of dynamical systems, often the
first question of interest is to find smooth invariant measures. It is clear that
whenever conclusions of Theorem 4.3 hold (in particular, for all β, n when d ě 3),
neither (SA) nor (USA) may possess a smooth invariant measure.

6.1. Clustering at an exponential rate when d ě n. One can ask whether for
almost every initial configuration, the convergence provided by all of the results
above holds with some rate. The answer is affirmative—and the rate is in fact
exponential—when the initial configuration lies in an open hemisphere.
Theorem 6.3. Let n ě 1 and β ą 0. Suppose d ⩾ n. Consider the unique solution
pxi p¨qqiPrns P C 0 pRě0 ; pSd´1 qn q to the Cauchy problem for (SA) or (USA), corre-
sponding to an initial sequence of points pxi p0qqiPrns P pSd´1 qn distributed uniformly
at random. Then almost surely there exists x˚ P Sd´1 and constants C, λ ą 0 such
that
(6.1) }xi ptq ´ x˚ } ď Ce´λt
holds for all i P rns and t ě 0.
In fact, let Q and K be arbitrary d ˆ d matrices. Then the same result also holds
for the solution to the corresponding Cauchy problem for (2.3) with V “ Id (or the
natural analogue of (USA) with these parameters).
When d ě n and the points pxi p0qqiPrns P pSd´1 qn are distributed uniformly at
random, with probability one there exists10 w P Sd´1 such that xw, xi p0qy ą 0 for
any i P rns. In other words, all of the initial points lie in an open hemisphere almost
surely. The proof of Theorem 6.3 thus follows as a direct corollary of the following
result, which holds for any n ě 1 and d ě 2:
Lemma 6.4 (Cone collapse). Let β ą 0 and let pxi p0qqiPrns P pSd´1 qn be such
that there exists w P Sd´1 for which xxi p0q, wy ą 0 for any i P rns. Consider
the unique solution pxi p¨qqiPrns P C 0 pRě0 ; pSd´1 qn q to the corresponding Cauchy
problem for (SA) or (USA). Then there exists x˚ P Sd´1 and constants C, λ ą 0
such that
}xi ptq ´ x˚ } ď Ce´λt
holds for all i P rns and t ě 0.
In fact, let Q and K be arbitrary d ˆ d matrices. Then the same result also holds
for the solution to the corresponding Cauchy problem for (2.3) with V “ Id (or the
natural analogue of (USA) with these parameters).
␣ (
Remark 6.5. Lemma 6.4 implies that px̄i qiPrns P pSd´1 qn : x̄1 “ . . . “ x̄n is Lya-
punov asymptotically stable as a set. In fact, it is exponentially stable.

10This weak version of Wendel’s theorem (Theorem 6.7) is easy to see directly.
22 GESHKOVSKI, LETROUIT, POLYANSKIY, AND RIGOLLET

Lemma 6.4 is reminiscent of results on interacting particle systems on the sphere


(see [CLP15, Theorem 3.7] for instance), and the literature on synchronization for
the Kuramoto model on the circle ([ABK` 22, Lemma 2.8], [HR20, Theorem 3.1]
and Section 7.2). We often make use of the following elementary lemma.
Lemma 6.6. Let f : Rě0 Ñ R be a differentiable function such that
ż `8 ˇ ˇ
|f ptq| dt ` sup ˇf9ptqˇ ă `8.
ˇ ˇ
0 tPRě0

Then limtÑ`8 f ptq “ 0.


The proof of Lemma 6.4 is an adaptation of [CLP15, Theorem 1]. We present it
here for completeness.

Proof of Lemma 6.4. We focus on the case (USA), and set


aij ptq :“ n´1 eβxxi ptq,xj ptqy ą 0.
The proof for (SA) is identical, and one only needs to change the coefficients aij ptq
by Zβ,i ptq´1 eβxxi ptq,xj ptqy throughout. Also note that since we only make use of the
positivity of the coefficients ai,j ptq throughout the proof, all arguments are readily
generalizable to the case of arbitrary d ˆ d matrices Q and K appearing in the inner
products.

Step 1. Clustering. For t ě 0, consider


iptq P arg minxxi ptq, wy.
iPrns

Fix t0 ě 0. We have
ˆ ˙ˇ
d
p¨q, wy ˇ
ˇ
xx
dt ipt0 q t“t0
ÿn ´ ¯
“ aipt0 qj pt0 q xxj pt0 q, wy ´ xxipt0 q pt0 q, xj pt0 qyxxipt0 q pt0 q, wy ě 0.
j“1

This implies that all points remain within the same open hemisphere at all times
and the map
t ÞÑ rptq :“ min xxi ptq, wy
iPrns

is non-decreasing on Rě0 . It is also bounded from above by 1. We may thus define


r8 :“ limtÑ`8 rptq. Note that r8 ě rp0q ą 0 by assumption. By compactness,
there exist a sequence of times ttk u`8
k“1 with tk Ñ `8, and some pxi qiPrns P pS
d´1 n
q
such that limkÑ`8 xi ptk q “ xi for all i P rns. Using the definition of rptq, we also
find that
xxj , wy ě r8
for all j P rns, and by continuity, there exists i P rns such that xxi , wy “ r8 . Then
(6.2)
n
ÿ n
ÿ
lim xx9 i ptk q, wy “ aij pxw, xj y ´ xxi , xj yxxi , wyq ě r8 aij p1 ´ xxi , xj yq,
kÑ`8
j“1 j“1
A MATHEMATICAL PERSPECTIVE ON TRANSFORMERS 23

where we set aij :“ eβxxi ,xj y ą 0. Notice that


ż `8
lim xx9 i psq, wy ds “ r8 ´ lim xxi ptk q, wy “ 0,
kÑ`8 t kÑ`8
k

and by using the equation (USA) we also find that |x: xi ptq, wy| “ Ope2β q for any
t ě 0. Therefore by Lemma 6.6, the left-hand side of (6.2) is equal to 0, and
consequently the right-hand side term as well. This implies that x1 “ . . . “ xn :“
x˚ . Repeating the argument by replacing w with x˚ , we see that the extraction of
a sequence ttk u`8
k“1 as above is not necessary, and therefore

(6.3) lim xi ptq “ x˚


tÑ`8

for all i P rns.

Step 2. Exponential rate. We now improve upon (6.3). Set


αptq :“ min xxi ptq, x˚ y.
iPrns

From (6.3) we gather that there exists some t0 ą 0 such that αptq ě 21 for all t ě t0 .
Also, in view of what precedes we know that x˚ lies in the convex cone generated
by the points x1 ptq, . . . , xn ptq for any t ą 0. Thus, there exists some η P p0, 1s such
that ηx˚ is a convex combination of the points x1 ptq, . . . , xn ptq, which implies that
n
ÿ n
ÿ
(6.4) x˚ “ θk ptqxk ptq, for some θk ptq ě 1, θk ptq ě 0 @k P rns.
k“1 k“1

For any t, we denote by iptq an element of arg minpxxi ptq, x˚ yq for which xx9 i ptq, x˚ y
is smallest. It follows from a Taylor expansion of xxi pt ` hq, x˚ y for h ą 0 and
i P rns that
αptq
9 “ xx9 iptq ptq, x˚ y.
Therefore
n
ÿ
(6.5) αptq
9 “ xx9 iptq ptq, x˚ y ě aiptqj ptqp1 ´ xxiptq ptq, xj ptqyqαptq
j“1

On another hand,
n
ÿ
(6.6) min xxiptq ptq, xj ptqy ď θk ptqxxiptq ptq, xk ptqy “ xxiptq ptq, x˚ y “ αptq.
jPrns
k“1

Plugging (6.6) into (6.5) and using aij ptq ě n´1 e´2β we get
1
(6.7) αptq
9 ě p1 ´ αptqq
2ne2β
for t ě t0 . Applying the Grönwall inequality we get
1 ´ 1 β pt´t0 q
(6.8) 1 ´ αptq ď e 2ne
2
for all t ě t0 . The conclusion follows. □

In the case d ă n, we can still apply Wendel’s theorem (recalled below) together
with Lemma 6.4 to obtain clustering to a single point with probability at least pn,d
for some explicit pn,d P p0, 1q.
24 GESHKOVSKI, LETROUIT, POLYANSKIY, AND RIGOLLET

Theorem 6.7 (Wendel, [Wen62]). Let d, n ě 1 be such that d ď n. Let x1 , . . . , xn


be n i.i.d. uniformly distributed points on Sd´1 . The probability that these points
all lie in the same hemisphere is:
´ ¯ d´1
ÿ ˆn ´ 1˙
P Dw P Sd´1 : xxi , wy ą 0 for all i P rns “ 2´pn´1q .
k“0
k
6.2. More precise quantitative convergence. When n is fixed and d Ñ `8,
in addition to showing the formation of a cluster as in Theorem 6.3, it is possible to
quantitatively describe the entire evolution of the particles with high probability.
To motivate this, on the one hand we note that since the dynamics evolve on Sd´1 ,
inner products are representative of the distance between points, and clustering
occurs if xxi ptq, xj ptqy Ñ 1 for any pi, jq P rns2 as t Ñ `8. On the other hand, if
d " n, n points in a generic initial sequence are almost orthogonal by concentration
of measure [Ver18, Chapter 3], and we are thus able to compare their evolution
with that of an initial sequence of truly orthogonal ones.
We begin by describing the case of exactly orthogonal initial particles, which is
particularly simple as the dynamics are described by a single parameter.
Theorem 6.8. Let β ě 0, d, n ě 2 be arbitrary. Consider an initial sequence
pxi p0qqiPrns P pSd´1 qn of n pairwise orthogonal points: xxi p0q, xj p0qy “ 0 for i ‰ j,
and let pxi p¨qqiPrns P C 0 pR⩾0 ; pSd´1 qn q denote the unique solution to the correspond-
ing Cauchy problem for (SA) (resp. for (USA)). Then the angle =pxi ptq, xj ptqq is
the same for all distinct i, j P rns:
=pxi ptq, xj ptqq “ θβ ptq
0
for t ⩾ 0 and some θβ P C pRě0 ; Tq. Furthermore, for (SA), γβ ptq :“ cospθβ ptqq
satisfies
$
βγβ ptq
&γ9 ptq “ 2e
’ p1 ´ γβ ptqqppn ´ 1qγβ ptq ` 1q
for t ě 0
β
(6.9) eβ ` pn ´ 1qeβγβ ptq

%γ p0q “ 0 ,
β

and for (USA), we have


$
&γ9 ptq “ 2 eβγβ ptq p1 ´ γ ptqqppn ´ 1qγ ptq ` 1q for t ě 0
β β β
(6.10) n
%γ p0q “ 0 .
β

Here and henceforth, T “ R{2πZ denotes the one-dimensional torus. We provide


a brief proof of Theorem 6.8 just below. The following result then shows that when
d " n, t ÞÑ γβ ptq is a valid approximation for t ÞÑ xxi ptq, xj ptqy for any distinct
i, j P rns.
Theorem 6.9. Fix β ⩾ 0 and n ě 2. Then there exists some d˚ pn, βq ě n such
that for all d ě d˚ pn, βq, the following holds. Consider a sequence pxi p0qqiPrns of n
i.i.d. uniformly distributed points on Sd´1 , and let pxi p¨qqiPrns P C 0 pRě0 ; pSd´1 qn q
denote the unique solution to the corresponding Cauchy problem for (SA). Then
there exist C “ Cpn, βq ą 0 and λ “ λpn, βq ą 0, such that with probability at least
1 ´ 2n2 d´1{64 ,
# c +
ˇ ˇ
nt log d ´λt
(6.11) ˇxxi ptq, xj ptqy ´ γβ ptqˇ ⩽ min 2 ¨ cpβq , Ce
ˇ ˇ
d
A MATHEMATICAL PERSPECTIVE ON TRANSFORMERS 25

holds for any i ‰ j and t ě 0, where cpβq “ e10 maxt1,βu , and γβ is the unique
solution to (6.9).
Since the proof is rather lengthy, we defer it to Appendix D. It relies on combining
the stability of the flow with respect to the initial data (entailed by the Lipschitz
nature of the vector field) with concentration of measure. An analogous statement
also holds for (USA), and more details can be found in Remark D.1, whereas the
explicit values of C and λ can be found in (D.15). The upper bound in (6.11) is of
interest in regimes where d and/or t are sufficiently large as the error in (6.11) is
trivially bounded by 2.
Proof of Theorem 6.8. We split the proof in two parts. We focus on proving the
result for the dynamics (SA), since the same arguments readily apply to the dy-
namics (USA).
Part 1. The angle θβ ptq. We first show there exists θ P C 0 pRě0 ; Tq such that
θptq “ =pxi ptq, xj ptqq for any distinct pi, jq P rns2 and t ě 0. Since the initial tokens
are orthogonal (and thus d ě n), we may consider an orthonormal basis pe1 , . . . , ed q
of Rd such that xi p0q “ ei for i P rns. Let π : rds Ñ rds be a permutation. By
decomposing any x P Sd´1 in this basis, we define Pπ : Sd´1 Ñ Sd´1 as
˜ ¸
ÿn ÿn
Pπ ai ei “ ai eπpiq .
i“1 i“1
Setting yi ptq “ Pπ pxi ptqq for i P rns, we see that yi ptq solves (SA) with initial
condition yi p0q “ Pπ pxi p0qq. But pxπp1q ptq, . . . , xπpnq ptqq is a solution of (SA) by
permutation equivariance, and it has the same initial condition since Pπ pxi p0qq “
xπpiq p0q. Consequently, we deduce that Pπ pxi ptqq “ xπpiq ptq for any t ě 0 and any
i P rds. Hence
xxi ptq, xj ptqy “ xPπ pxi ptqq, Pπ pxj ptqqy “ xxπpiq ptq, xπpjq ptqy
which concludes the proof.
Part 2. The curve γβ ptq. By virtue of the orthogonality assumption we have
γβ p0q “ cospθβ p0qq “ 0. To prove that γβ ptq satisfies (6.9) for the case of (SA),
recall that
PK
xi ptq pxj ptqq “ xj ptq ´ xxi ptq, xj ptqy xi ptq.
Then for k ‰ i,
γ9 β ptq “ 2xx9 i ptq, xk ptqy
n ˆ
eβxxi ptq,xj ptqy
ÿ ˙
“2 ř n βxxi ptq,xℓ ptqy
pxxj ptq, xk ptqy ´ xxi ptq, xj ptqyxxi ptq, xk ptqyq .
j“1 ℓ“1 e

Since the denominator in the above expression is equal to pn ´ 1qeβγβ ptq ` eβ , we


end up with
n ´
2eβγβ ptq ÿ ¯
γ9 β ptq “ xx j ptq, xk ptqy ´ xx i ptq, xj ptqyxxi ptq, xk ptqy
pn ´ 1qeβγβ ptq ` eβ j“1
2eβγβ ptq
“ p1 ´ γβ ptq2 ` pn ´ 2qpγβ ptq ´ γβ ptq2 qq,
pn ´ 1qeβγβ ptq ` eβ
as desired. □
26 GESHKOVSKI, LETROUIT, POLYANSKIY, AND RIGOLLET

6.3. Metastability and a phase transition. An interesting byproduct of The-


orem 6.8 and Theorem 6.9 is the fact that they provide an accurate approximation
of the exact phase transition curve delimiting the clustering and non-clustering
regimes, in terms of t and β. To be more precise, given an initial sequence
pxi p0qqiPrns P pSd´1 qn of random points distributed independently according to
the uniform distribution on Sd´1 , and for any fixed 0 ă δ ! 1, we define the phase
transition curve as the boundary
" ´ ¯*
1
2 ´ 64
Γd,δ “ B t, β ě 0 : t “ arg inf Ppxx1 psq, x2 psqy ě 1 ´ δq “ 1 ´ 2n d
sě0

where pxi p¨qqiPrns denotes the solution to the corresponding Cauchy problem for (SA).
(Here the choice of the first two particles instead of a random distinct pair is jus-
tified due to permutation equivariance.) Theorem 6.9 then gives the intuition that
over compact subsets of pRě0 q2 , Γd,δ should be well-approximated by
! )
(6.12) Γ8,δ “ t, β ě 0 : γβ ptq “ 1 ´ δ .

9 1.0 9 1.0 9 1.0


8 8 8
7 0.8 7 0.8 7 0.8

6 6 6
0.6 0.6 0.6
5 5 5
4 4 4
0.4 0.4 0.4
3 3 3
2 0.2 2 0.2 2 0.2
1 1 1
0.0 0.0
0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30
t t t

(a) d “ 2 (b) d “ 8 (c) d “ 32


9 1.0 9 1.0 9 1.0
8 8 8
7 0.8 7 0.8 7 0.8

6 6 6
0.6 0.6 0.6
5 5 5
4 4 4
0.4 0.4 0.4
3 3 3
2 0.2 2 0.2 2 0.2
1 1 1
0.0 0.0 0.0
0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30
t t t

(d) d “ 128 (e) d “ 512 (f ) d “ 1024

Figure 3. Plots of the probability that randomly initialized particles


following (SA) cluster to a single point as a function of t and β: we graph
the function pt, βq ÞÑ Ppx1 p0q,...,xn p0qq„σd ptxx1 ptq, x2 ptqy ě 1 ´ δuq, which
is equal to pt, βq ÞÑ Ppx1 p0q,...,xn p0qq„σd ,i‰j fixed ptxx1 ptq, x2 ptqy ě 1 ´ δuq
by permutation equivariance. We compute this function by generating
the average of the histogram of txxi ptq, xj ptqy ě 1 ´ δ : pi, jq P rns2 , i ‰
ju over 210 different realizations of initial sequences. Here, δ “ 10´3 ,
n “ 32, while d varies. We see that the curve Γ8,δ defined in (6.12)
approximates the actual phase transition with increasing accuracy as d
grows, as implied by Theorem 6.9.

This is clearly seen in Figure 3, along with the fact that the resolution of this
approximation increases with d Ñ `8.
A MATHEMATICAL PERSPECTIVE ON TRANSFORMERS 27

Figure 3 appears to contain more information than what we may gather from
Theorem 6.3, Theorem 6.8 and Theorem 6.9. In particular, for small d, we see the
appearance of a zone (white/light blue in Figure 3) of parameters pt, βq for which
the probability of particles being clustered is positive, but not close to one. A
careful inspection of this region reveals that points are grouped in a finite number
of clusters; see Figure 4. The presence of such a zone indicates the emergence
of a long-time metastable state where points are clustered into several groups but
eventually relax to a single cluster in long-time. This two-time-scale phenomenon
is illustrated in Figure 4 and prompts us to formulate the following question.
Problem 1. Do the dynamics enter a transient metastable state, in the sense that
for β " 1, all particles stay in the vicinity of m ă n clusters for long periods of
time, before they all collapse to the final cluster tx˚ u?
There have been important steps towards a systematic theory of metastability
for gradient flows, with applications to nonlinear parabolic equations—typically
reaction-diffusion equations such as the Allen-Cahn or Cahn-Hilliard equations
[OR07, KO02]—. While these tools to not readily apply to the current setup,
they form an important starting point to answer this question.
t = 0.0 t = 18.0 t = 30.0

9 1.0
8
7 0.8

6
0.6
5 t = 0.0 t = 18.0 t = 30.0

4
0.4
3
2 0.2
1

0 5 10 15 20 25 30
t

Figure 4. We zoom in on the phase diagram (Figure 3) for the dy-


namics on the circle: d “ 2. For β “ 4, 9, we also display a trajectory
of (SA) for a randomly drawn initial condition at times t “ 2.5, 18, 30.
We see that the particles settle at 2 clusters when β “ 4 (bottom right)
and 3 clusters when β “ 9 (top right), for a duration of time. This
reflects our metastability claim for large β.

Finally, one may naturally ask whether the clustering and phase diagram conclu-
sions persist when the parameter matrices pQ, K, V q are significantly more general:
some illustrations11 are given in Figure 5.
Problem 2. Can the conclusions of Theorem 6.8–Theorem 6.9 be generalized to
the case of random matrices pQ, K, V q?
11See github.com/borjanG/2023-transformers-rotf for additional figures which indicate that
this phenomenon appears to hold in even more generality.
28 GESHKOVSKI, LETROUIT, POLYANSKIY, AND RIGOLLET

9 1.0 9 1.0
8 8
7 0.8 7 0.8

6 6
0.6 0.6
5 5
4 4
0.4 0.4
3 3
2 0.2 2 0.2
1 1
0.0 0.0
0 5 10 15 20 25 30 0 5 10 15 20 25 30
t t
(a) Q, K, V in real Ginibre ensemble (b) QJ K Wigner, V ľ 0 GOE
9 1.0 9 1.0
8 8
7 0.8 7 0.8

6 6
0.6 0.6
5 5
4 4
0.4 0.4
3 3
2 0.2 2 0.2
1 1
0.0 0.0
0 5 10 15 20 25 30 0 5 10 15 20 25 30
t t
(c) Q, K in real Ginibre ensemble, V “ QJ K (d) QJ K Wigner, V “ QJ K

Figure 5. Phase diagrams (see Figure 3 for explanations) for some


choices of random matrices pQ, K, V q; here d “ 128, n “ 32. Sharp
phase transitions as well as metastable regions appear in all cases.

Part 3. Further questions


We conclude this manuscript by discussing several avenues of research that can
lead to a finer understanding of the clustering phenomenon and generalizations
of our results, and which, we believe, are of independent mathematical interest.
Specifically,
‚ In Section 7, we zero in on the special case d “ 2, where we make a link
with the celebrated Kuramoto model when β “ 0;
‚ In Section 8, we discuss an alternative approach for analyzing clustering,
based on the so-called BBGKY hierarchy from statistical mechanics;
‚ In Section 9, we foray beyond the case QJ K “ V “ Id in a few directions.
We start with Section 9.1 by relating the case V “ ´Id to optimal config-
urations on the sphere. We then discuss existing results in the absence of
layer normalization in Section 9.2, which motivate the study of a related
singular equation (Section 9.3) and a diffusive regularization (Section 9.4);
‚ Finally, in Section 10, we briefly discuss various ways focusing on the tuning
of the parameter matrices themselves.
A MATHEMATICAL PERSPECTIVE ON TRANSFORMERS 29

7. Dynamics on the circle


We study the dynamics (SA) and (USA) in the special case d “ 2, namely on
the unit circle S1 Ă R2 . This model, parametrized by angles and related to the
celebrated Kuramoto model, is of independent interest and deserves a complete
mathematical analysis.

7.1. Angular equations. On the circle S1 , all particles xi ptq P S1 are of course
completely characterized by the angle θi ptq P T: xi ptq “ cospθi ptqqe1 ` sinpθi ptqqe2
where e1 “ p1, 0q and e2 “ p0, 1q P R2 . We focus on the dynamics (USA) for
simplicity. For any i P rns and t ě 0, we may derive the equation satisfied by θi ptq
from cospθi ptqq “ xxi ptq, e1 y: differentiating in t and plugging into (USA) we obtain
˜ ¸
´1 n
n ÿ ” ı
θ9i ptq “ ´ eβxxi ptq,xj ptqy xxj ptq, e1 y ´ xxi ptq, xj ptqyxxi ptq, e1 y
sinpθi ptqq j“1

where we used the definition of the projection (if θi ptq “ 0 for some t, we differ-
entiate the equality sinpθi ptqq “ xxi ptq, e2 y instead, which also leads to (7.1) in the
end). Observing that

xxi ptq, xj ptqy “ cospθi ptq ´ θj ptqq,

we find
˜ ¸
n
n´1 ÿ
β cospθi ptq´θj ptqq
” ı
θ9i ptq “ ´ e cospθj ptqq ´ cospθi ptq ´ θj ptqq cospθi ptqq .
sinpθi ptqq j“1

Using elementary trigonometry, we conclude that


n
1 ÿ β cospθi ptq´θj ptqq
(7.1) θ9i ptq “ ´ e sinpθi ptq ´ θj ptqq.
n j“1

The case β “ 0 is exactly the Kuramoto model recalled in Section 7.2. Suppose for
the time being that β ą 0. Defining the function hβ : T Ñ Rě0 as

hβ pθq “ eβ cospθq ,

weřhave effectively deduced that the empirical measure of the angles, νptq “
1 n
n j“1 δθj ptq , which is a measure on the torus T, is a solution to the continuity
equation
Bt νptq ` Bθ pX rνptqsνptqq “ 0, on Rě0 ˆ T,
where
1´ 1 ¯
X rνspθq “ hβ ˚ ν pθq.
β
When the particles xi ptq follow (SA), one readily checks that the same continuity
equation is satisfied but rather with the field

1 h1β ˚ ν
ˆ ˙
X rνspθq “ pθq.
β hβ ˚ ν
30 GESHKOVSKI, LETROUIT, POLYANSKIY, AND RIGOLLET

7.2. The Kuramoto model. As mentioned above, when β “ 0, (7.1) is a partic-


ular case of the Kuramoto model [Kur75]:
n
K ÿ
(7.2) θ9i ptq “ ωi ` sinpθj ptq ´ θi ptqq,
n j“1

where K ą 0 is a prescribed coupling constant, and ωi P T are the intrinsic natural


frequencies of the oscillators θi ptq. It is known that for sufficiently small coupling
strength K, the oscillators θi ptq in the Kuramoto model (7.2) do not synchro-
nize in long-time. It is also known that when K exceeds some critical threshold
value, a phase transition occurs, leading to the synchronization of a fraction of
the oscillators. If K is chosen very large, there is total synchronization of the
oscillators in long-time. For more on the mathematical aspects of the Kuramoto
model, we refer the reader to the review papers [Str00, ABV` 05, HKPZ16] (see also
[CCH` 14, Chi15, FGVG16, DFGV18, HR20, TSS20, ABK` 22] for a non-exhaustive
list of other recent mathematical results on the subject).
When all the frequencies ωi are equal to some given frequency, ω P R say, after
a change of variable of the form θi ptq Ð θi ptq ´ ωt, the dynamics in (7.2) become
the gradient flow
9 “ n∇Fpθq
θptq
where the energy F : Tn Ñ Rě0 reads
n n
K ÿÿ
(7.3) Fpθq “ cospθi ´ θj q.
2n2 i“1 j“1

The oscillators can be viewed as attempting to maximize this energy. The energy
F is maximized when all the oscillators are synchronized, that is, θi “ θ˚ for some
θ˚ P T and for all i P rns. As the dynamics follow a gradient system, the equilibrium
states are the critical points of the energy, namely those satisfying ∇Fpθq “ 0. The
local maxima of F correspond to equilibrium states θ that are physically achievable,
since small perturbations thereof return the system back to θ.
Some authors consider a variant of the Kuramoto model where the oscillators
are interacting according to the edges of a graph. In other words, the coefficients
Aij of the graph’s adjacency matrix are inserted in the sum in (7.3) as weights,
and the dynamics are then the corresponding gradient flow. A recent line of work
culminating with [ABK` 22] has established that synchronization occurs with high
probability for Erdős–Rényi graphs with parameter p, for every p right above the
connectivity threshold.
Coming back to our dynamics (7.1), we notice that it can also be written as a
gradient flow on Tn :
9 “ n∇Eβ pθptqq,
θptq
for the interaction energy Eβ : Tn Ñ Rě0 defined as
n n
1 ÿ ÿ β cospθi ´θj q
(7.4) Eβ pθq “ e ,
2βn2 i“1 j“1

which is maximized when θi “ θ˚ for some θ˚ P T and for all i P rns. In the spirit
of [LXB19], we suggest the following open problem—we recall that a critical point
is called a strict saddle point of Eβ if the Hessian of Eβ at these points has at least
one positive eigenvalue.x
A MATHEMATICAL PERSPECTIVE ON TRANSFORMERS 31

Problem 3. With the exception of the global maxima, are all critical points of Eβ
strict saddle points?
The proofs of Theorems 4.3 and 5.1 already yield a positive answer to Problem
2
5 in the regimes β ď 1 or β ě nπ2 . The complementary regime remains open. By
classical arguments, recalled in Appendix A, a positive answer to Problem 3 would
imply that for all initial conditions except a set of measure zero, all θi ptq converge
under the dynamics (7.1) to a common limit as t Ñ `8.
Extensions of the Kuramoto model of the form
n
K ÿ
(7.5) θ9i ptq “ ωi ` hpθj ptq ´ θi ptqq,
n j“1

for a general non-linearity h : T Ñ R, which contains both (7.2) and our model (7.1)
as particular cases, have already been studied in the physics literature. For instance,
we refer the reader to [Dai92] (see also [ABV` 05, page 158]), where many heuristics
are proposed to address the behavior of solutions to these dynamics. We are not
aware of mathematical results for (7.1) besides Theorem 5.1. We nevertheless have
some hope that handling the dynamics (7.1) is easier than dealing with (7.5) for a
general h; for instance, we have
ÿ
hβ pθq “ eβ cospθq “ Ik pβqeikθ
kPZ

where Ik pβq are the modified Bessel function of the first kind, whose properties
have been extensively studied.

8. BBGKY hierarchy
For the sake of simplicity, we again focus on the dynamics on the circle S1 ,
where recall that all particles are parametrized by angles (which we also refer to
as particles). To carve out an even more complete understanding of the cluster-
ing phenomenon, it is natural to consider initial particles sampled i.i.d. from the
uniform distribution on S1 and to study the time-evolution of the r-particle dis-
prq
tribution ρn pt, θ1 , . . . , θr q, defined as the joint law of the particles θ1 ptq, . . . , θr ptq.
Otherwise put, it is the r-point marginal of the joint distribution ρpnq pt, ¨q P PpTn q
of all n particles. Note that because of rotational invariance, ρp1q pt, ¨q is just the
1
uniform distribution equal to 2π for all t ě 0. For r “ 2, again by rotational
invariance, there exists some ψpt, ¨q : T Ñ Rě0 such that
1
ρp2q pt, θ1 , θ2 q “ ψpt, θ2 ´ θ1 q.

Proving the clustering/synchronization of all θi ptq in long-time amounts to proving
that ψpt, ¨q converges to a Dirac mass centered at 0 as t Ñ `8. Using the fact that
ρpnq pt, ¨q solves the Liouville equation, by following the method used to derive the
BBGKY12 hierarchy [GSRT13, Gol16], it is possible to show that ψpt, ¨q satisfies
#
Bt ψpt, xq ` Bx pvpt, xqψpt, xqq “ 0 in Rě0 ˆ T
(8.1)
ψp0, xq “ p2πq´1 in T,

12Bogoliubov–Born–Green–Kirkwood–Yvon.
32 GESHKOVSKI, LETROUIT, POLYANSKIY, AND RIGOLLET

where
2 1 2pn ´ 2q
vpt, xq “ h pxq ´ gpt, xq,
βn β βn
and ” ˇ ı
gpt, xq “ E ´ h1β pθ3 ptqq ˇ θ1 ptq “ 0, θ2 ptq “ x .
ˇ

Note that the equation (8.1) is not closed since gpt, xq depends on the 3-point
correlation function. This is typical in the BBGKY hierarchy, whereupon physical
theory and experimental evidence is typically used to devise an ansatz for closing
the system. For instance, the Boltzmann equation is derived from the BBGKY
hierarchy by assuming the molecular chaos hypothesis (Stosszahlansatz) at the level
of r “ 2. We suggest to close (8.1) in a way that reflects the formation of clusters:
Problem 4. Devise a realistic ansatz for gpt, xq which allows to close equation (8.1),
and allows to prove the convergence of ψpt, ¨q to a Dirac mass centered at 0 as
t Ñ `8.
The derivation of a BBGKY hierarchy when d ą 2, as well as for (SA), are also
problems which we believe merit further investigation.

9. General matrices
Figure 5 hints at the likelihood of the clustering phenomenon being significantly
more general than just the case Q “ K “ V “ Id . However, extending our proofs
to more general parameter matrices does not appear to be straightforward and is
an open problem. Here we discuss some particular cases (without excluding other
approaches).

9.1. The repulsive case. As seen from Lemma 3.7, in the repulsive case V “ ´Id
the interaction energy Eβ decreases along trajectories. Recall that the unique global
minimum of Eβ over PpSd´1 q is the uniform distribution (Proposition 3.4). In
contrast, we explain in this section that many different configurations of n points
may yield global minima for Eβ when minimized over empirical measures with n
atoms.
We thus focus on minimizing Eβ over the set Pn pSd´1 q of empirical measures,
namely sums of n Dirac masses. Rewriting Eβ as

ij
β 1 2
Eβ rµs “ e´ 2 }x´x } dµpxq dµpx1 q,

it turns out that minimizing Eβ over Pn pSd´1 q is precisely the problem of find-
ing optimal configurations of points on Sd´1 , which has direct links to the sphere
packing problem [CK07, CKM` 22] and coding theory [DGS91]. For µ P Pn pSd´1 q,
we can equivalently rewrite Eβ in terms of the set of support points C Ă Sd´1 ,
# C “ n:
eβ ÿ β 1 2
Eβ rµs “ Hβ r Cs “ 2 e´ 2 }x´x } .
2n β 1
x,x P C

In [CK07], Cohn and Kumar characterize the global minima C of Hβ . To state


their result, we need the following definition.
A MATHEMATICAL PERSPECTIVE ON TRANSFORMERS 33

Definition 9.1. Let n ě 2. A set of points C “ tx1 , . . . , xn u Ă Sd´1 is called a


spherical t-design if
ż n
1 ÿ
ppxq dσd pxq “ ppxi q
n i“1
for all polynomials p of d variables, of total degree at most t. The set of points C is
called a sharp configuration if there are m distinct inner products between pairwise
distinct points in C, for some m ą 1, and if it is a spherical p2m ´ 1q-design.
The following result is a special case of [CK07, Theorem 1.2].
Theorem 9.2 ([CK07]). Let n ě 2. Any global minimum of Hβ among C Ă Sd´1 ,
# C “ n is either a sharp configuration, or the vertices of a 600-cell13.
The set of sharp configurations is not known for all regimes of n, d or m (the
largest m such that the configuration is a spherical m-design). A list of known
examples is provided in [CK07, Table 1]: it consists of vertices of full-dimensional
polytopes (specifically, regular polytopes whose faces are simplices), or particular
derivations of the E8 root lattice in R8 and the Leech lattice in R24 . We refer
the reader to [CK07] and the illustrative experimental paper [BBC` 09] for further
detail. A complete picture of the long-time behavior of Transformers in the repulsive
case remains open.

9.2. Pure self-attention. An alternative avenue for conducting such an analysis


which has shown to be particularly fruitful consists in removing the projector PK
x,
leading to
n
1 ÿ
(9.1) x9 i ptq “ eβxQxi ptq,Kxj ptqy V xj ptq
Zβ,i ptq j“1

for all i P rns and t P Rě0 . In fact, in [GLPR24] we analyze precisely these dynamics,
and show different clustering results depending on the spectral properties of the
matrix V . We briefly summarize our findings in what follows.

9.2.1. A review of [GLPR24]. For most choices of value matrices V , without rescal-
ing time, most particles diverge to ˘8 and no particular pattern emerges. To make
a very rough analogy, (9.1) "looks like" x9 i ptq “ V xi ptq (which amounts to having
Pij ptq “ δij instead of (2.5)), whose solutions are given by xi ptq “ etV xi p0q. To
discern the formation of clusters, we introduce the rescaling14
(9.2) zi ptq “ e´tV xi ptq,
which are solutions to
n
1 ÿ tV
zi ptq,KetV zj ptqy
(9.3) z9i ptq “ eβ xQe V pzj ptq ´ zi ptqq
Zβ,i ptq j“1

for i P rns and t ě 0, where


n
tV
zi ptq,KetV zk ptqy
ÿ
Zβ,i ptq “ eβxQe ,
k“1

13A 600-cell is a particular 4-dimensional convex polytope with n “ 120 vertices.


14The rescaling (9.2) should be seen as a surrogate for layer normalization.
34 GESHKOVSKI, LETROUIT, POLYANSKIY, AND RIGOLLET

whereas the initial condition remains the same, namely xi p0q “ zi p0q. It is crucial
to notice that the coefficients Aij ptq (see (2.5)) of the self-attention matrix for the
rescaled particles zi ptq are the same as those for the original particles xi ptq. The
weight Aij ptq indicates the strength of the attraction of zi ptq by zj ptq. In [GLPR24]
we show that the rescaled particles zi ptq cluster toward well-characterized geometric
objects as t Ñ `8 for various choices of matrices pQ, K, V q. Our results are
summarized in Table 1 below, whose first two lines are discussed thereafter.

V K and Q Limit geometry Result in [GLPR24]


V “ Id QJ K ą0 vertices of convex polytope Theorem 3.1
λ1 pV q ą 0, simple xQφ1 , Kφ1 y ą 0 union of 3 parallel hyperplanes Theorem 4.1
V paranormal QJ K ą 0 polytope ˆ subspaces Theorem 5.1
V “ ´Id QJ K “ Id single cluster at origin˚ Theorem C.5

Table 1. Summary of the clustering results of [GLPR24]. ˚ All results


except for the case V “ ´Id hold for the time-scaled dynamics (9.3).

When V “ Id , outside from exceptional situations, all particles cluster to vertices


of some convex polytope. Indeed, since the velocity z9i ptq is a convex combination
of the attractions zj ptq ´ zi ptq, the convex hull Kptq of the zi ptq shrinks and thus
converges to some convex polytope. The vertices of the latter attract all particles
as t Ñ `8. When the eigenvalue with largest real part of V , denoted by λ1 pV q,
is simple and positive, the rescaled particles zi ptq cluster on hyperplanes which are
parallel to the direct sum of the eigenspaces of the remaining eigenvalues. Roughly
speaking, the coordinates of the points zi ptq along the eigenvector of V correspond-
ing to λ1 pV q quickly dominate the matrix coefficients Pij ptq in (9.3) due to the
factors etV zj ptq. For more results and insights regarding clustering on Rd , we refer
the reader to [GLPR24]. We nonetheless leave the reader with the following general
question:
Problem 5. Is it possible to extend the clustering results of Table 1 to other cases
of pQ, K, V q? What are the resulting limit shapes?
9.3. Singular dynamics. We mention another intriguing question, whose answer
would allow for a transparent geometric understanding of clustering for (9.3). Let
pQ, K, V q be given d ˆ d matrices. For β ą 0, we consider the system of coupled
ODEs
n
1 ÿ
(9.4) z9i ptq “ eβxQzi ptq,Kzj ptqy V pzj ptq ´ zi ptqq,
Zβ,i ptq j“1
where once again
n
ÿ
Zβ,i ptq “ eβxQzi ptq,Kzk ptqy .
k“1
For any T ą 0, and any fixed initial condition pzi p0qqiPrns P pRd qn , as β Ñ `8, we
expect that the solution to (9.4) converges uniformly on r0, T s to a solution of
1 ÿ
(9.5) z9i ptq “ V pzj ptq ´ zi ptqq
|Ci ptq|
jPCi ptq
A MATHEMATICAL PERSPECTIVE ON TRANSFORMERS 35

where
! )
(9.6) Ci ptq “ j P rns : xQzi ptq, Kzj ptqy ě xQzi ptq, Kzk ptqy for all k P rns .
However, defining a notion of solution to (9.5)–(9.6) is not straightforward, as
illustrated by the following example.
Example 9.3. Suppose d “ 2, n “ 3. Let Q “ K “ V “ Id and z1 p0q “
p1, 1q, z2 p0q “ p´1, 1q, z3 p0q “ p0, 0q. Consider the evolution of these particles
through (9.5)–(9.6). The points z1 ptq and z2 ptq do not move, because it is easily
seen that Ci ptq “ tiu for i P t1, 2u. On the other hand, the point z3 ptq can be chosen
to solve either of three equations: z93 ptq “ z1 ptq ´ z3 ptq, or z93 ptq “ z2 ptq ´ z3 ptq, or
even z93 ptq “ 12 pz1 ptq ` z2 ptqq ´ z3 ptq. In any of these cases, both (9.5) and (9.6)
remain satisfied for almost every t ě 0.
It is possible to prove the existence of solutions to (9.5)–(9.6) defined in the sense
of Filippov15: for this, we can either use a time-discretization of (9.5)–(9.6), or use a
convergence argument for solutions to (9.4) as β Ñ `8. Uniqueness however does
not hold, as illustrated by Example 9.3. This naturally leads us to the following
question:
Problem 6. Is it possible to establish a selection principle (similar to viscosity or
entropy solutions) for solutions to (9.5)–(9.6) which allows to restore uniqueness?
In the affirmative, is it possible to revisit the clustering results of [GLPR24] and
Problem 5 in the setting of (9.5)–(9.6)?
9.4. Diffusive regularization. We believe that (9.5)–(9.6) is also an original
model for collective behavior. There are some similarities in spirit with meth-
ods arising in consensus based optimization (CBO for short), [PTTM17, CJLZ21].
With CBO methods, one wishes to minimize a smooth and bounded, but otherwise
arbitrary function f : Rd Ñ R by making use of the Laplace method
ˆ ż ˙
1 ´βf pxq
lim ´ log e dρpxq “ inf f pxq,
βÑ`8 β Rd xPsupppρq

which holds for any fixed ρ P Pac pRd q. This is accomplished by considering a
McKean-Vlasov particle system of the form
?
dxi ptq “ ´λpxi ptq ´ vf qH ϵ pf pxi ptqq ´ f pvrµn ptqsqq dt ` 2σ|xi ptq ´ vrµn ptqs| dWi ptq
for fixed β ą 0, with drift parameter λ ą 0 and noise parameter σ ě 0; H ϵ is a
particular smoothed Heaviside function, and µn ptq is the empirical measure of the
particles. The point vrµs P Rd is a weighted average of the particles:
ż
1
vrµs “ e´βf pxq x dµpxq
Zβ,µ Rd
ş
where Zβ,µ “ Rd e´βf pxq dµpxq. Morally speaking, particles which are near a min-
imum of f have a larger weight. The drift term is a gradient relaxation (for a
quadratic potential) towards the current weighted average position of the batch of
particles. The diffusion term is an exploration term whose strength is proportional
to the distance of the particle from the current weighted average. Results of con-
vergence to a global minimizer do exist, under various smallness assumptions on
the initial distribution of the particles, and assumptions on the relative size of the
15We thank Enrique Zuazua for this suggestion.
36 GESHKOVSKI, LETROUIT, POLYANSKIY, AND RIGOLLET

coefficients. They rely on the analysis of the associated Fokker-Planck equation,


see [CJLZ21, CD22], and also [FHPS21] for the analog on Sd´1 . We point out
that similarities are mainly in spirit—these results and analysis are inapplicable to
our setting because there is no analog for f pxq—. Nonetheless, they do raise the
following interesting question:
Problem 7. What can be said about the long-time limit of Transformers with a
noise/diffusion term of strength σ ą 0?
The question is of interest for any of the Transformers models presented in what
precedes.

10. Approximation, control, training


Understanding the expressivity, namely the ability of a neural network to re-
produce any map in a given class (by tuning its parameters), is essential. Two
closely related notions reflect the expressivity of neural networks: interpolation—
the property of exactly matching arbitrarily many input and target samples—and
(universal) approximation—the property of approximating input-target functional
relationships in an appropriate topology—. We refer the reader to [CLLS23] for a
primer on the relationship between these two notions in the contex of deep neural
networks.
For discrete-time Transformers, universal approximation has been shown to hold
in [YBR` 19], making use of a variant of the architecture with translation param-
eters and letting the number of layers go to infinity; see also [ADTK23, JL23] and
the review [JLLW23].
In the context of flow maps (from Rd to Rd ), it is now well understood that
interpolation and approximation reflect the controllability properties of the system.
The transfer of control theoretical techniques to the understanding of expressivity
has borne fruit, both in terms of controllability results [AS22, CLT20, TG22, LLS22,
RBZ23, VR23, CLLS23] and optimal control insights [LCT18, GZ22]. We refer the
reader to [AL24, AG24, FdHP24] for the first universal approximation results for
Transformers, viewed as measure-to-measure maps, using control theoretic tools.
Besides approximation, understanding the training dynamics of Transformers
is another major challenge which we haven’t covered herein. As it is impossible
to reference all works on this flourishing topic, we refer the interested reader to
[TLTO23, ACDS23, DGTT24] and references therein.

Acknowledgments
We thank Pierre Ablin, Sébastien Bubeck, Gabriel Peyré, Matthew Rosenzweig,
Sylvia Serfaty, Kimi Sun, and Rui Sun for discussions. We thank Nicolas Boumal
for referring us to [MTG17, CRMB24] and for clarifying comments.

Appendix
Appendix A. Proof of Theorem 4.1
The proof of Theorem 4.1 relies on standard arguments from dynamical systems,
upon noticing that the evolution (4.1) is a (continuous-time) gradient ascent for the
A MATHEMATICAL PERSPECTIVE ON TRANSFORMERS 37

energy E0 : pSd´1 qn Ñ R defined as


n n
1 ÿÿ
E0 px1 , . . . , xn q “ xxi , xj y.
n i“1 j“1

Since the dynamics are the gradient ascent of a real-analytic functional on the com-
pact real-analytic manifold pSd´1 qn , the celebrated Łojasiewicz theorem [Loj63],
in the form given by [HKR18, Corollary 5.1]—which is valid in the context of
general compact Riemannian manifolds—, implies that for any initial condition
X P pSd´1 qn , the solution Φt pXq P pSd´1 qn converges to some critical point X ˚ P
pSd´1 qn of E0 as t Ñ `8.
We recall that a strict saddle point of E0 is a critical point of E0 at which the
Hessian of E0 has at least one strictly positive eigenvalue. Theorem 4.1 then follows
by combining the following couple of lemmas with the Łojasiewicz theorem.
Lemma A.1. Let M be a compact Riemannian manifold and let f : M Ñ R be
a smooth function. The set of initial conditions X0 P M for which the gradient
ascent
#
9
Xptq “ ∇f pXptqq
(A.1)
Xp0q “ X0

converges to a strict saddle point of f is of volume zero.


Proof of Lemma A.1. Let us denote by Φt pX0 q :“ Xptq, t ě 0 the solution to (A.1).
We denote by S Ă M the set of strict saddle points of f , and by A Ă M the set
of initial conditions X0 P M for which Φt pX0 q converges to a strict saddle point of
f as t Ñ `8. For any y P S, we denote by By a ball in which the local center-
sc
stable manifold Wloc pyq exists (see [Shu13], Theorem III.7 and Exercise III.3 for the
adaptation to flows).
Ť Using compactness, we may write the union of these balls as a
countable union kPI Byk (where I is countable and yk P M for k P I). If X0 P A,
there exists some t0 ě 0 and k P I such that Φt pX0 q P Byk for all t ě t0 . From the
center-stable manifold theorem ([Shu13], Theorem III.7 and Exercise III.3, where
we note that the Jacobian of a gradient vector field coincides, at a critical point,
with the Hessian of the corresponding function) we gather that Φt pX0 q P Wloc sc
pyk q
´t sc sc
for t ě t0 , hence X0 P Φ pWloc pyk qq for all t ě t0 . The dimension of Wloc pyk q
is at most dimpMq ´ 1, thus it has zero volume. Since Φt is a diffeomorphism
sc
on a compact manifold, Φ´t preserves null-sets and hence Φ´t pWloc pyk qq has zero
volume for all t ě 0. Therefore A, which satisfies
ďď
sc
AĂ Φ´ℓ pWloc pyk qq
kPI ℓPN

has volume zero. □

Lemma A.2. Any critical point px1 , . . . , xn q P pSd´1 qn of E0 which is not a global
maximum, namely such that x1 “ . . . “ xn , is a strict saddle point. In particular,
all local maxima are global.
Proof of Lemma A.2. We extend the proof idea of [Tay12, Theorem 4.1] as follows.
Let px1 , . . . , xn q P pSd´1 qn be a critical point of E0 , and assume that the points xi
are not all equal to each other.
38 GESHKOVSKI, LETROUIT, POLYANSKIY, AND RIGOLLET

Step 1. We first prove that there exists a set of indices S Ă rns such that
ÿ ÿ
(A.2) xxi , xj y ă 0.
iPS jPSc

To this end, define


n
ÿ
m :“ xj ,
j“1

and consider two cases. If m ‰ 0, then we deduce from ∇E0 px1 , . . . , xn q “ 0 that
for any j P rns, xj is collinear with m. Thus xj “ ˘x1 for any j P rns. Setting
S “ tj P rns : xj “ `x1 u,
we can see that (A.2) holds, unless S “ rns which has been excluded. Now suppose
that m “ 0. Then by expanding xm, xi y “ 0, we find that for any i P rns
n
ÿ
´1 “ xxj , x1 y ,
j“2

holds, which again implies (A.2) with S “ t1u.

Step 2. In this second step we look to deduce from (A.2) that px1 , . . . , xn q is a strict
saddle point. Consider an arbitrary non-zero skew-symmetric matrix B and define
the perturbation
#
xi iRS
xi ptq “
etB xi i P S.
Set E0 ptq “ E0 px1 ptq, . . . , xn ptqq. Note that we have
2ÿ ÿ
E0 ptq “ const. ` xxi ptq, xj y ,
n iPS jPSc

where we grouped time-independent terms into the constant (recall that etB is an
orthogonal matrix, since skew-symmetric matrices are the Lie algebra of SOpdq).
Thus
2ÿ ÿ
E10 ptq “ xx9i ptq, xj y
n iPS jPSc
2ÿ ÿ
E20 ptq “ xx:i ptq, xj y .
n iPS jPSc

Since px1 , . . . , xn q is a critical point of E0 , we have E10 p0q “ 0. On the other hand,
:i p0q “ B 2 xi we have
since x
2ÿ ÿ 2
(A.3) E20 p0q “ xB xi , xj y .
n iPS jPSc

We claim that given (A.2), there must exist some skew-symmetric matrix B such
that E20 p0q ą 0. Indeed, if d is even, then we just take B as the block-diagonal
matrix with repeated block
„ ȷ
0 1
,
´1 0
A MATHEMATICAL PERSPECTIVE ON TRANSFORMERS 39

so that B 2 “ ´Id . If d is odd, we can represent


d
1 ÿ 2
(A.4) ´Id “ B ,
d ´ 1 j“1 j
where Bj is the same block-diagonal matrix, with the exception that the j-th block
is a 1 ˆ 1 zero-matrix. If each Bj were to yield E20 p0q ď 0, then it would vio-
late (A.2). Thus, E20 p0q ą 0 for some well-chosen skew-symmetric B, which proves
that px1 , . . . , xn q is a strict saddle point. □

Appendix B. Proof of Theorem 5.1


Proof of Theorem 5.1. We leverage the gradient flow structure presented in Remark
3.8 and Section 3.4 (the manifold is compact, and the metric and functional are
analytic), and use Lemma A.1 as in the proof of Theorem 4.1. Consequently, it
suffices to show that, in the stated regime of β, the critical points of Eβ which
are not global maxima are strict saddle points, namely, that all local maxima are
global. For simplicity we write the argument for (USA) and explain the extension
to the case of (SA) in Remark B.1.
We begin by focusing on the case d “ 2, and provide a brief argument which
shows that the case of arbitrary d ě 2 readily follows.
Let pθ1 , . . . , θn q P Tn be a critical point such that all eigenvalues of the Hessian
of Eβ are non-positive. We intend to show that if β is sufficiently large, then
necessarily θ1 “ ¨ ¨ ¨ “ θn . To that end, note that the non-positivity of the Hessian
of Eβ implies in particular that for any subset of indices S Ă rns, we must have
ÿÿ
(B.1) Bθi Bθj Eβ pθ1 , . . . , θn q ď 0 .
iPS jPS

Notice that for any i, j P rns,


1 ÿ
Bθi Eβ pθ1 , . . . , θn q “ ´ sinpθi ´ θm qeβ cospθi ´θm q
n2
mPrnsztiu

and $
’gpθi ´ θj q, i‰j
1 & ÿ
Bθi Bθj Eβ pθ1 , . . . , θn q “ 2 ¨ ´ gpθi ´ θm q, i “ j ,
n ’ %
mPrnsztiu
2
where we set gpxq :“ pcospxq ´ β sin pxqqeβ cospxq . Plugging this expression back
into (B.1) and simplifying, we obtain
ÿ ÿ
(B.2) gpθi ´ θj q ě 0 .
iPS jPSc

Let us now define τβ˚ be the unique solution on r0, π2 q of the equation
β sin2 pτ q “ cospτ q .
Note that τβ˚ is a monotonically decreasing function of β, and in fact
1 ` op1q
τβ˚ “ ?
β
as β Ñ `8. The importance of τβ˚ is in implying the following property of the
function g: for any τ R r´τβ˚ , τβ˚ s, we must have that gpτ q ă 0 (see Figure 6). We
40 GESHKOVSKI, LETROUIT, POLYANSKIY, AND RIGOLLET

arrive at the following conclusion: it must be that for any proper subset S Ă rns
there exists, by virtue of (B.2), some index j P Sc such that
inf |θj ´ θk | ă τβ˚ .
kPS

So now let us start with S “ t1u and grow S inductively by adding those points θj
at distance ă τβ˚ from tθk : k P Su at each induction step. If β is large enough so
that
π
pn ´ 1qτβ˚ ă ,
2
then in the process of adding points we have travelled a total arc-length ă π{2 on
each side of x1 . Thus it must be that the collection of points θ1 , . . . , θn is strictly
contained inside a half-circle of angular width ă π. By Lemma 6.4 we know that
there can be no critical points of Eβ that are strictly inside some half-circle, unless
that critical point is trivial: θ1 “ ¨ ¨ ¨ “ θn . This completes the proof when d “ 2.

4
β=2

β=6

−2 2

−2

Figure 6. The function τ ÞÑ gpτ q for two values of β.

We can show that the same conclusion holds for any dimension d ě 2. The proof
follows by arguing just as above, making instead use of the following generalization
of (B.2): given a collection x1 , . . . , xn P Sd´1 at which the Hessian of Eβ is non-
positive, we must have for any subset S Ă rns that
ÿ ÿ
(B.3) gpθij q ě 0,
iPS jPSc

where gpζq “ eβ cospζq ppd ´ 1q cospζq ´ β sin2 pζqq and θij P r0, πs is the geodesic
distance between xi and xj , namely cospθij q “ xxi , xj y. We now show (B.3). By
repeating the argument in Step 2 of the proof of Lemma A.2 , we see that for any
skew-symmetric matrix B we must have
ÿ ÿ ´ ¯
(B.4) eβxxi ,xj y βxBxi , xj y2 ` xB 2 xi , xj y ď 0.
iPS jPSc

i.i.d.
Now we take B to be random by generating Bij „ P , i ă j and P being any
zero-mean, unit-variance distribution. We set Bji “ ´Bij and Bii “ 0. Then it is
A MATHEMATICAL PERSPECTIVE ON TRANSFORMERS 41

easy to check that


ErB 2 s “ ´pd ´ 1qId

and
ErxBxi , xj y2 s “ 1 ´ xxi , xj y2 “ sin2 pθij q.

Thus, taking the expectation over all such B in (B.4) yields (B.3). Mirroring the
proof for d “ 2, we define τβ˚ to be the unique solution on r0, π2 q of the equation
β sin2 pτ q “ pd ´ 1q cospτ q. We note that
d
pd ´ 1q ` op1q
τβ˚ “
β

for β Ñ `8. Repeating verbatim the argument for the case d “ 2, we deduce the
convergence to a single cluster whenever β ≳ pd ´ 1qn2 . □

Remark B.1. We comment on the extension of the above proof to the dynamics
(SA). We recall that (SA) is a gradient flow, but for a different metric—see Sec-
tion 3.4—and we show that the saddle point property is preserved across metrics.
Our proof is an adaptation of a classical argument: the Hessian of a function at
a critical point is a notion which does not depend on the choice of Riemannian
metric.
Let x “ px1 , . . . , xn q P pSd´1 qn be a critical point of Eβ (this does not depend
on the metric) such that not all xi are equal to each other. Recall that for f :
pSd´1 qn Ñ R, for any metric on pSd´1 qn (with associated Christoffel symbols Γkij )
and any associated orthonormal basis y1 , . . . , ypd´1qn , the Hessian of f reads

B2 f
ˆ ˙
Bf
(B.5) Hesspf q “ ´ Γkij dyi b dyj .
Byi Byj Byk

Since we are evaluating the Hessian at a critical point x of Eβ , the term carrying
the Christoffel symbols Γkij vanishes. In the above argument, we saw that Hess Eβ
evaluated at x, and written in an orthonormal basis for the canonical metric g on
pSd´1 qn , is not negative semi-definite. We denote this matrix by Mg ; we know that
there exists v P Tx pSd´1 qn such that gpv, vq “ 1 and v J Mg v ą 0. Let g̃ be another
metric on pSd´1 qn ; we denote by Mg̃ the Hessian evaluated at x, and written in an
orthonormal basis for gr. Let c : Rě0 Ñ pSd´1 qn be such that cp0q “ x and cp0q
9 “ v.
Since x is a critical point (for both metrics), a Taylor expansion to second order in
the two orthonormal bases yields
1
Eβ pcptqq “ Eβ pcp0qq ` t2 v J Mg v ` Opt3 q
2
as well as
1
Eβ pcptqq “ Eβ pcp0qq ` t2 }v}g̃´2 v J Mg̃ v ` Opt3 q
2
thanks to (B.5). Hence v J Mg̃ v ą 0. Specializing to g̃ being the metric of Section
3.4, with respect to which (SA) is a gradient flow for Eβ , we conclude for (SA).
42 GESHKOVSKI, LETROUIT, POLYANSKIY, AND RIGOLLET

Appendix C. Proof of Theorem 4.3


Proof of Theorem 4.3. We leverage the gradient flow structure and follow the same
strategy as in the proof of Theorem 5.1 presented above. For simplicity we write
the argument for (USA) and explain the extension to the case of (SA) in Remark
C.1.
Consider
n n
1 ÿ ÿ ´ βxxi ,xj y ¯
Eβ px1 , . . . , xn q “ e ´1 .
2β i“1 j“1
Note that this is only a slight deviation from the energy studied in Section 3.4: we
solely subtracted a constant. Consequently the dynamics (USA) are also a gradient
flow for this energy. The main interest of considering this modified energy is the
observation that
Eβ px1 , . . . , xn q “ E0 px1 , . . . , xn q ` β Rβ px1 , . . . , xn q,
where Rβ is smooth. Hence Rβ has a bounded Hessian on pSd´1 qn uniformly with
respect to β, and
(C.1) ∇Eβ “ ∇E0 ` Opβq, Hess Eβ “ Hess E0 ` Opβq.
Observe that in the proof of Theorem 4.1, we actually showed that there exists
c ą 0 such that at any critical point px1 , . . . , xn q of E0 for which xi ‰ xj whenever
i ‰ j, at least one of the eigenvalues of the Hessian of E0 , λ say, satisfies λ ě c.
Indeed, in (A.2) the proof actually shows the existence of some S Ă rns such that
ÿ ÿ
xxi , xj y ď ´1.
iPS jPSc

Then, (A.3), together with (A.4) for instance, yield


2pd ´ 1q
(C.2) E20 p0q ě “: c
dn
for one of the Bj .
Now suppose that there exists a positive sequence βk Ñ 0 as well as Xk P pSd´1 qn
such that Xk is a critical point of Eβk and all of the eigenvalues of Hess Eβk pXk q
are non-positive. Then by virtue of the continuity properties of Eβ with respect to
β in (C.1), we find that, up to extracting a subsequence, there is some limit point
X “ px1 , . . . , xn q P pSd´1 qn of Xk which is a critical point of E0 , and such that all of
the eigenvalues of Hess E0 pXq are non-positive. Per Theorem 4.1, this implies that
x1 “ . . . “ xn . But then, for large enough k, Xk is also constituted of points which
are all nearly equal, whence in the same hemisphere, and the only such critical
point of Eβ is that in which all points are equal (synchronized). This, combined
with the continuity of the eigenvalues of Hess Eβ with respect to β and (C.2), proves
that there exists some c ą 0 independent of n such that whenever β ď c n´1 , all
critical points of Eβ except synchronized ones are strict saddle points. □
Remark C.1. We comment on the extension of the above proof to the dynamics
(SA). The point of contention is (C.1), since the metric with respect to which the
gradient and Hessian of E0 are taken is not the same as that for Eβ . Denote the
modified metric defined in Section 3.4 by gβ , and the canonical metric by g. For
any x P pSd´1 qn and v P Tx pSd´1 qn we have
DEβ pxqrvs “ gβ p∇gβ Eβ pxq, vq,
A MATHEMATICAL PERSPECTIVE ON TRANSFORMERS 43

but also DEβ pxqrvs “ gp∇g Eβ pxq, vq. By virtue of the explicit form of gβ and Eβ as
well as (C.1), we gather that
(C.3) gβ p∇gβ Eβ pxq, vq “ gp∇g E0 pxq, vq ` Opβq
which implies that any sequence of critical points of Eβ converges to a critical point
for E0 . Similarly, since Hessgβ Eβ pxqrvs “ Dp∇gβ Eβ pxqqrvs, we find
(C.4) Hessgβ Eβ pxqrvs “ Hessg E0 pxqrvs ` Opβq.
We can then repeat the argument in the proof above by replacing (C.1) by (C.3) and
(C.4).

Appendix D. Proof of Theorem 6.9


Proof. We focus on the dynamics (SA), since the proof for (USA) follows from very
similar computations.

Step 1. The flow map is Lipschitz. We begin by showing that the trajectories
satisfy a Lipschitz property with respect to the initial data. To this end, let
pxi p¨qqiPrns P C 0 pRě0 ; pSd´1 qn q and pyi p¨qqiPrns P C 0 pRě0 ; pSd´1 qn q be two solutions
to the Cauchy problem for (SA) associated to data pxi p0qqiPrns and pyi p0qqiPrns
respectively. For any i P rns and t ě 0, we have
xi ptq ´ yi ptq “ xi p0q ´ yi p0q
żt ÿn ˆ
eβxxi psq,xj psqy
˙
` ˘
` řn βxx psq,x psqy
xj psq ´ xxi psq, xj psqyxi psq ds
k“1 e
i k
0 j“1
żt ÿn
eβxyi psq,yj psqy
ˆ ˙
` ˘
(D.1) ´ řn βxyi psq,yk psqy
yj psq ´ xyi psq, yj psqyyi psq ds.
0 j“1 k“1 e

We see that
(D.2)
›ż › ż
› tÿ n ˆ t
eβxxi psq,xj psqy
˙ ›
y ds max }xj psq ´ yj psq} ds.
› ›
› řn βxxi psq,xk psqy
px j psq ´ j psqq › ď
k“1 e 0 jPrns
› 0 j“1 ›

On another hand, since the softmax function with a parameter β is β–Lipschitz


(with respect to the Euclidean norm), we also get
›ż ›
› tÿ n ˆ
eβxxi psq,xj psqy eβxyi psq,yj psqy
˙ ›
y ds
› ›
› řn βxxi psq,xk psqy
´ ř n βxyi psq,yk psqy j psq ›
› 0 j“1 k“1 e k“1 e ›
˜ ¸ 1
żt ÿ n ” ı2 2
1
ď βn 2 xxi psq, xj psqy ´ xyi psq, yj psqy ds
0 j“1
żt
(D.3) ď 2βn max }xj psq ´ yj psq} ds.
0 jPrns

Using (D.2), (D.3) and arguing similarly for the remaining terms in (D.1), we deduce
that
żt
}xi ptq ´ yi ptq} ď }xi p0q ´ yi p0q} ` 10 maxt1, βun max }xj psq ´ yj psq} ds.
0 jPrns
44 GESHKOVSKI, LETROUIT, POLYANSKIY, AND RIGOLLET

Maximizing over i P rns and applying the Grönwall inequality yields


(D.4) max }xj ptq ´ yj ptq} ď cpβqnt max }xj p0q ´ yj p0q} ,
jPrns jPrns

for any i P rns and t ě 0.

Step 2. Almost orthogonality. Let x1 p0q, . . . , xn p0q P Sd´1 be the random i.i.d.
initial points. We prove that with high probability, there exist n pairwise orthogonal
points y1 p0q, . . . , yn p0q P Sd´1 , such that for any i P rns,
c
log d
(D.5) }xi p0q ´ yi p0q} ď .
d
To this end, we take y1 p0q “ x1 p0q and then construct the other points yi p0q by
induction. Assume that y1 p0q, . . . , yi p0q are constructed for some i P rns, using only
knowledge about the points x1 p0q, . . . , xi p0q. Then by Lévy’s concentration of mea-
sure, since xi`1 p0q is independent from x1 p0q, . . . , xi p0q and uniformly distributed
on Sd´1 ,
˜# c +¸
` K
˘ log d
P dist xi`1 p0q, spanty1 p0q, . . . , yi p0qu ď ě 1 ´ 4id´1{64 ,
d

for some universal constants c, C ą 0. Using the union bound, we gather that the
event
A0 “ t(D.5) is satisfied for any i P rnsu
has probability at least p0 “ 1 ´ 2n2 d´1{64 . We now consider the event
A “ A0 X tfor some C, λ ą 0, (6.1) holds for any i P rns and t ě 0u
which, since d ě n and thus the second event has probability 1, also holds with
probability at least p0 “ 1 ´ 2n2 d´1{64 . For the remainder of the proof, we assume
that A is satisfied.

Step 3. Proof of (6.11). Let pyi p¨qqiPrns P C 0 pRě0 ; pSd´1 qn q denote the unique solu-
tion to the Cauchy problem for (SA) corresponding to the initial datum pyi p0qqiPrns .
A combination of (D.4) and (D.5) yields
c
nt log d
(D.6) }xi ptq ´ yi ptq} ď cpβq
d
for any i P rns and t ě 0, under A. Combining (D.6) with Theorem 6.8 we obtain
c
ˇ ˇ
nt log d
(D.7) ˇxxi ptq, xj ptqy ´ γβ ptqˇ ď 2cpβq
ˇ ˇ
d
for any i ‰ j and t ě 0, under A.
We turn to the proof of the second part of (6.11). For this, we prove that for
large times t, both γβ ptq and xxi ptq, xj ptqy are necessarily close to 1. We first show
that
¨ ˛
2 β
1 n e nt
(D.8) 1 ´ γβ ptq ď exp ˝ ´ ¯´ β

2 2 n ` e2
β
n ` e2
A MATHEMATICAL PERSPECTIVE ON TRANSFORMERS 45

for any t ě 0. To this end, we notice that t ÞÑ γβ ptq is increasing and thus γβ ptq ě 0,
as well as γ9 β ptq ě ne1β as long as γβ ptq ď 21 . Therefore,
ˆ β˙
ne 1
γβ ě .
2 2
neβ
We deduce that for t ě 2 ,

np1 ´ γβ ptqq
γ9 β ptq ě β .
n ` e2
neβ
Integrating this inequality from 2 to t, we obtain (D.8). We now set d˚ pn, βq ě n
such that
d 16cpβq2
(D.9) ě
log d γβ p n1 q2
holds for any d ě d˚ pn, βq. According to Lemma 6.4, since A is satisfied, there
exists x˚ P Sd´1 such that xi ptq Ñ x˚ for any i P rns as t Ñ `8. We set
αptq :“ min xxi ptq, x˚ y,
iPrns

and prove that


˜ ` ˘ ¸
1 ´ γβ n1 t
(D.10) 1 ´ αptq ď exp .
2ne2β
To this end, let us first prove that
ˆ ˙ ˆ ˙
1 1 1
(D.11) α ě γβ .
n 2 n
From Step 2 in the proof of Lemma 6.4, we gather that x˚ lies in the convex cone
generated by the points x1 ptq, . . . , xn ptq for any t ą 0, and so the decomposition
(6.4) holds. Taking the inner product of xi p n1 q with the decomposition (6.4) at time
t “ n1 , we get
ˆ ˙ B ˆ ˙ ˆ ˙F ˆ ˙ c
1 1 1 1 logpdq
α ě min 2 xi , xj ě γβ ´ 2cpβq
n pi,jqPrns n n n d
ˆ ˙
1 1
ě γβ ,
2 n
where the second inequality comes from (D.6) evaluated at time t “ n1 , and the
last inequality comes from (D.9). This is precisely (D.11). Using the notation
aij ptq “ Zβ,i ptq´1 eβxxi ptq,xj ptqy as in the proof of Lemma 6.4, we now find
n
ÿ
(D.12) αptq
9 “ xx9 iptq ptq, x˚ y ě aiptqj ptqp1 ´ xxiptq ptq, xj ptqyqαptq
j“1

for one of the indices iptq P rns achieving the minimum in the definition of αptq.
Combining this with (D.11), we gather that αptq ě αp n1 q for t ě n1 . But
n
ÿ
(D.13) min xxiptq ptq, xj ptqy ď θk ptqxxiptq ptq, xk ptqy “ xxiptq ptq, x˚ y “ αptq.
jPrns
k“1
46 GESHKOVSKI, LETROUIT, POLYANSKIY, AND RIGOLLET

Plugging (D.13) into (D.12) and using aij ptq ě n´1 e´2β we get
ˆ ˙
1 1
(D.14) αptq
9 ě 2β α p1 ´ αptqq
ne n
for t ě n1 . Integrating (D.14) from 1
n to t, we get (D.10). We therefore deduce
from (D.10) that
1 ´ γβ p n1 qt
ˆ ˙
xxi ptq, xj ptqy ě 1 ´ exp
2ne2β
holds for any distinct i, j P rns. Together with (D.8), we then get
(D.15) ˜ ¸
1 ´ γβ p n1 qt n2 eβ
ˆ ˙
ˇ ˇ 1 nt
ˇxxi ptq, xj ptqy ´ γβ ptqˇ ď exp ` exp .
ˇ ˇ
β ´ β
2ne2β 2 2pn ` e 2 q n ` e 2
Finally, combining (D.7) and (D.15) we obtain (6.11). □
Remark D.1. An analogous statement to Theorem 6.9 holds for (USA), where
γβ would rather be the unique solution to (6.10). More concretely, Step 1 in the
proof is only slightly changed—the constant one obtains in the analogue of (6.11)

is rather cpβqnt with cpβq “ e10βe —. Step 2 remains unchanged. In Step 3, (D.8)
is replaced by γβ p n2 q ě 21 and
1 ´ β´ n ¯¯
1 ´ γβ ptq ď exp ´e 2 t ´ .
2 2
The rest of the proof then remains essentially unchanged.

References
[ABK` 22] Pedro Abdalla, Afonso S Bandeira, Martin Kassabov, Victor Souza, Steven H Stro-
gatz, and Alex Townsend. Expander graphs are globally synchronising. arXiv preprint
arXiv:2210.12788, 2022.
[ABV` 05] Juan A Acebrón, Luis L Bonilla, Conrad J Pérez Vicente, Félix Ritort, and Renato
Spigler. The Kuramoto model: A simple paradigm for synchronization phenomena.
Reviews of Modern Physics, 77(1):137, 2005.
[ACDS23] Kwangjun Ahn, Xiang Cheng, Hadi Daneshmand, and Suvrit Sra. Transformers learn
to implement preconditioned gradient descent for in-context learning. Advances in
Neural Information Processing Systems, 36, 2023.
[ADTK23] Silas Alberti, Niclas Dern, Laura Thesing, and Gitta Kutyniok. Sumformer: Universal
Approximation for Efficient Transformers. In Topological, Algebraic and Geometric
Learning Workshops 2023, pages 72–86. PMLR, 2023.
[AG24] Daniel Owusu Adu and Bahman Gharesifard. Approximate controllability of conti-
nuity equation of transformers. IEEE Control Systems Letters, 2024.
[AGS05] Luigi Ambrosio, Nicola Gigli, and Giuseppe Savaré. Gradient flows: in metric spaces
and in the space of probability measures. Springer Science & Business Media, 2005.
[AL24] Andrei Agrachev and Cyril Letrouit. Generic controllability of equivariant sys-
tems and applications to particle systems and neural networks. arXiv preprint
arXiv:2404.08289, 2024.
[AS22] Andrei Agrachev and Andrey Sarychev. Control on the manifolds of mappings with
a view to the deep learning. Journal of Dynamical and Control Systems, 28(4):989–
1008, 2022.
[Bar93] Andrew R Barron. Universal approximation bounds for superpositions of a sigmoidal
function. IEEE Transactions on Information Theory, 39(3):930–945, 1993.
[BB00] Jean-David Benamou and Yann Brenier. A computational fluid mechanics solution to
the Monge-Kantorovich mass transfer problem. Numerische Mathematik, 84(3):375–
393, 2000.
A MATHEMATICAL PERSPECTIVE ON TRANSFORMERS 47

[BBC` 09] Brandon Ballinger, Grigoriy Blekherman, Henry Cohn, Noah Giansiracusa, Eliza-
beth Kelly, and Achill Schürmann. Experimental study of energy-minimizing point
configurations on spheres. Experimental Mathematics, 18(3):257–283, 2009.
[BCM08] Adrien Blanchet, José A Carrillo, and Nader Masmoudi. Infinite time aggregation for
the critical Patlak-Keller-Segel model in R2 . Communications on Pure and Applied
Mathematics, 61(10):1449–1481, 2008.
[BCM15] Dario Benedetto, Emanuele Caglioti, and Umberto Montemagno. On the complete
phase synchronization for the Kuramoto model in the mean-field limit. Communica-
tions in Mathematical Sciences, 13(7):1775–1786, 2015.
[BD19] Dmitriy Bilyk and Feng Dai. Geodesic distance Riesz energy on the sphere. Trans-
actions of the American Mathematical Society, 372(5):3141–3166, 2019.
[BHK24] Han Bao, Ryuichiro Hataya, and Ryo Karakida. Self-attention networks localize when
qk-eigenspectrum concentrates. arXiv preprint arXiv:2402.02098, 2024.
[BKH16] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv
preprint arXiv:1607.06450, 2016.
[BLG14] Emmanuel Boissard and Thibaut Le Gouic. On the mean speed of convergence of
empirical and occupation measures in Wasserstein distance. In Annales de l’IHP
Probabilités et statistiques, volume 50, pages 539–563, 2014.
[BLR11] Andrea L Bertozzi, Thomas Laurent, and Jesús Rosado. Lp theory for the multidi-
mensional aggregation equation. Communications on Pure and Applied Mathematics,
64(1):45–83, 2011.
[CCH` 14] José A Carrillo, Young-Pil Choi, Seung-Yeal Ha, Moon-Jin Kang, and Yongduck Kim.
Contractivity of transport distances for the kinetic Kuramoto equation. Journal of
Statistical Physics, 156(2):395–415, 2014.
[CD22] Louis-Pierre Chaintron and Antoine Diez. Propagation of chaos: a review of models,
methods and applications. II. Applications. 2022.
[CDF` 11] J. A. Carrillo, M. DiFrancesco, A. Figalli, T. Laurent, and D. Slepčev. Global-in-time
weak measure solutions and finite-time aggregation for nonlocal interaction equations.
Duke Mathematical Journal, 156(2):229 – 271, 2011.
[Chi15] Hayato Chiba. A proof of the Kuramoto conjecture for a bifurcation structure of
the infinite-dimensional Kuramoto model. Ergodic Theory and Dynamical Systems,
35(3):762–834, 2015.
[CJLZ21] José A Carrillo, Shi Jin, Lei Li, and Yuhua Zhu. A consensus-based global opti-
mization method for high dimensional machine learning problems. ESAIM: Control,
Optimisation and Calculus of Variations, 27:S5, 2021.
[CK07] Henry Cohn and Abhinav Kumar. Universally optimal distribution of points on
spheres. Journal of the American Mathematical Society, 20(1):99–148, 2007.
[CKM` 22] Henry Cohn, Abhinav Kumar, Stephen Miller, Danylo Radchenko, and Maryna Via-
zovska. Universal optimality of the E8 and Leech lattices and interpolation formulas.
Annals of Mathematics, 196(3):983–1082, 2022.
[CLLS23] Jingpu Cheng, Qianxiao Li, Ting Lin, and Zuowei Shen. Interpolation, approximation
and controllability of deep neural networks. arXiv preprint arXiv:2309.06015, 2023.
[CLP15] Marco Caponigro, Anna Chiara Lai, and Benedetto Piccoli. A nonlinear model of
opinion formation on the sphere. Discrete & Continuous Dynamical Systems-A,
35(9):4241–4268, 2015.
[CLT20] Christa Cuchiero, Martin Larsson, and Josef Teichmann. Deep neural networks,
generic universal interpolation, and controlled ODEs. SIAM Journal on Mathematics
of Data Science, 2(3):901–919, 2020.
[CNQG24] Aditya Cowsik, Tamra Nebabu, Xiao-Liang Qi, and Surya Ganguli. Geometric dy-
namics of signal propagation predict trainability of transformers. arXiv preprint
arXiv:2403.02579, 2024.
[CNWR24] Sinho Chewi, Jonathan Niles-Weed, and Philippe Rigollet. Statistical optimal trans-
port. arXiv preprint arXiv:2407.18163, 2024.
[CRBD18] Ricky TQ Chen, Yulia Rubanova, Jesse Bettencourt, and David K Duvenaud. Neural
ordinary differential equations. Advances in Neural Information Processing Systems,
31, 2018.
[CRMB24] Christopher Criscitiello, Quentin Rebjock, Andrew D. McRae, and Nicolas Boumal.
Synchronization on circles and spheres with nonlinear interactions, 2024.
48 GESHKOVSKI, LETROUIT, POLYANSKIY, AND RIGOLLET

[CS07] Felipe Cucker and Steve Smale. Emergent behavior in flocks. IEEE Transactions on
Automatic Control, 52(5):852–862, 2007.
[Cyb89] George Cybenko. Approximation by superpositions of a sigmoidal function. Mathe-
matics of Control, Signals and Systems, 2(4):303–314, 1989.
[CZC` 22] Tianlong Chen, Zhenyu Zhang, Yu Cheng, Ahmed Awadallah, and Zhangyang Wang.
The principle of diversity: Training stronger vision transformers calls for reducing
all levels of redundancy. In Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition, pages 12020–12030, 2022.
[Dai92] Hiroaki Daido. Order function and macroscopic mutual entrainment in uniformly cou-
pled limit-cycle oscillators. Progress of Theoretical Physics, 88(6):1213–1218, 1992.
[DBK24] Gbètondji JS Dovonon, Michael M Bronstein, and Matt J Kusner. Setting the record
straight on transformer oversmoothing. arXiv preprint arXiv:2401.04301, 2024.
[DBPC19] Gwendoline De Bie, Gabriel Peyré, and Marco Cuturi. Stochastic deep networks. In
International Conference on Machine Learning, pages 1556–1565. PMLR, 2019.
[DCL21] Yihe Dong, Jean-Baptiste Cordonnier, and Andreas Loukas. Attention is not all you
need: Pure attention loses rank doubly exponentially with depth. In International
Conference on Machine Learning, pages 2793–2803. PMLR, 2021.
[DFGV18] Helge Dietert, Bastien Fernandez, and David Gérard-Varet. Landau damping to par-
tially locked states in the Kuramoto model. Communications on Pure and Applied
Mathematics, 71(5):953–993, 2018.
[DGCC21] Subhabrata Dutta, Tanya Gautam, Soumen Chakrabarti, and Tanmoy Chakraborty.
Redesigning the transformer architecture with insights from multi-particle dynamical
systems. Advances in Neural Information Processing Systems, 34:5531–5544, 2021.
[DGS91] Philippe Delsarte, Jean-Marie Goethals, and Johan Jacob Seidel. Spherical codes and
designs. In Geometry and Combinatorics, pages 68–93. Elsevier, 1991.
[DGTT24] Puneesh Deora, Rouzbeh Ghaderi, Hossein Taheri, and Christos Thrampoulidis. On
the optimization and generalization of multi-head attention. Transactions on Ma-
chine Learning Research, 2024.
[Dob79] Roland L’vovich Dobrushin. Vlasov equations. Funktsional’nyi Analiz i ego
Prilozheniya, 13(2):48–58, 1979.
[Dud69] R. M. Dudley. The Speed of Mean Glivenko-Cantelli Convergence. The Annals of
Mathematical Statistics, 40(1):40 – 50, 1969.
[DX13] Feng Dai and Yuan Xu. Approximation theory and harmonic analysis on spheres
and balls. Springer, 2013.
[E17] Weinan E. A proposal on machine learning via dynamical systems. Communications
in Mathematics and Statistics, 1(5):1–11, 2017.
[FdHP24] Takashi Furuya, Maarten V de Hoop, and Gabriel Peyré. Transformers are universal
in-context learners. arXiv preprint arXiv:2408.01367, 2024.
[FGVG16] Bastien Fernandez, David Gérard-Varet, and Giambattista Giacomin. Landau damp-
ing in the Kuramoto model. In Annales Henri Poincaré, volume 17, pages 1793–1823.
Springer, 2016.
[FHPS21] Massimo Fornasier, Hui Huang, Lorenzo Pareschi, and Philippe Sünnen. Consensus-
based optimization on the sphere: Convergence to global minimizers and machine
learning. The Journal of Machine Learning Research, 22(1):10722–10776, 2021.
[FL19] Amic Frouvelle and Jian-Guo Liu. Long-time dynamics for a simple aggregation
equation on the sphere. In Stochastic Dynamics Out of Equilibrium: Institut Henri
Poincaré, Paris, France, 2017, pages 457–479. Springer, 2019.
[FZH` 22] Ruili Feng, Kecheng Zheng, Yukun Huang, Deli Zhao, Michael Jordan, and Zheng-
Jun Zha. Rank diminishing in deep neural networks. Advances in Neural Information
Processing Systems, 35:33054–33065, 2022.
[GBC16] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press,
Cambridge, MA, 2016.
[GBM21] Arnaud Guillin, Pierre Le Bris, and Pierre Monmarché. Uniform in time propagation
of chaos for the 2d vortex model and other singular stochastic systems. arXiv preprint
arXiv:2108.08675, 2021.
[GLPR24] Borjan Geshkovski, Cyril Letrouit, Yury Polyanskiy, and Philippe Rigollet. The emer-
gence of clusters in self-attention dynamics. Advances in Neural Information Process-
ing Systems, 36, 2024.
A MATHEMATICAL PERSPECTIVE ON TRANSFORMERS 49

[Gol16] François Golse. On the dynamics of large particle systems in the mean field limit.
Macroscopic and large scale phenomena: coarse graining, mean field limits and er-
godicity, pages 1–144, 2016.
[GSRT13] Isabelle Gallagher, Laure Saint-Raymond, and Benjamin Texier. From Newton to
Boltzmann: hard spheres and short-range potentials. European Mathematical Society
Zürich, Switzerland, 2013.
[GWDW23] Xiaojun Guo, Yifei Wang, Tianqi Du, and Yisen Wang. Contranorm: A contrastive
learning perspective on oversmoothing and beyond. In The Eleventh International
Conference on Learning Representations, 2023.
[GZ22] Borjan Geshkovski and Enrique Zuazua. Turnpike in optimal control of PDEs,
ResNets, and beyond. Acta Numerica, 31:135–263, 2022.
[HHL23] Jiequn Han, Ruimeng Hu, and Jihao Long. A class of dimension-free metrics for
the convergence of empirical measures. Stochastic Processes and their Applications,
164:242–287, 2023.
[HK02] Rainer Hegselmann and Ulrich Krause. Opinion dynamics and bounded confidence:
models, analysis and simulation. Journal of Artifical Societies and Social Simulation
(JASSS), 5(3), 2002.
[HKPZ16] Seung-Yeal Ha, Dongnam Ko, Jinyeong Park, and Xiongtao Zhang. Collective syn-
chronization of classical and quantum oscillators. EMS Surveys in Mathematical Sci-
ences, 3(2):209–267, 2016.
[HKR18] Seung-Yeal Ha, Dongnam Ko, and Sang Woo Ryoo. On the relaxation dynamics
of Lohe oscillators on some Riemannian manifolds. Journal of Statistical Physics,
172:1427–1478, 2018.
[HR17] Eldad Haber and Lars Ruthotto. Stable architectures for deep neural networks. In-
verse problems, 34(1), 2017.
[HR20] Seung-Yeal Ha and Seung-Yeon Ryoo. Asymptotic phase-locking dynamics and crit-
ical coupling strength for the Kuramoto model. Communications in Mathematical
Physics, 377(2):811–857, 2020.
[HZRS16] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep
residual networks. In Computer Vision–ECCV 2016: 14th European Conference,
Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part IV 14, pages
630–645. Springer, 2016.
[JDB23] Amir Joudaki, Hadi Daneshmand, and Francis Bach. On the impact of activation
and normalization in obtaining isometric embeddings at initialization. Advances in
Neural Information Processing Systems, 36:39855–39875, 2023.
[JKO98] Richard Jordan, David Kinderlehrer, and Felix Otto. The variational formulation of
the Fokker–Planck equation. SIAM Journal on Mathematical Analysis, 29(1):1–17,
1998.
[JL23] Haotian Jiang and Qianxiao Li. Approximation theory of transformer networks for
sequence modeling. arXiv preprint arXiv:2305.18475, 2023.
[JLLW23] Haotian Jiang, Qianxiao Li, Zhong Li, and Shida Wang. A brief survey on the ap-
proximation theory for sequence modelling. arXiv preprint arXiv:2302.13752, 2023.
[JM14] Pierre-Emmanuel Jabin and Sebastien Motsch. Clustering and asymptotic behavior
in opinion formation. Journal of Differential Equations, 257(11):4165–4187, 2014.
[KO02] Robert V Kohn and Felix Otto. Upper bounds on coarsening rates. Communications
in Mathematical Physics, 229(3):375–395, 2002.
[Kra00] Ulrich Krause. A discrete nonlinear and non-autonomous model of consensus. In
Communications in Difference Equations: Proceedings of the Fourth International
Conference on Difference Equations, page 227. CRC Press, 2000.
[KSH12] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with
deep convolutional neural networks. Advances in Neural Information Processing Sys-
tems, 25, 2012.
[Kur75] Yoshiki Kuramoto. Self-entrainment of a population of coupled non-linear oscillators.
In International Symposium on Mathematical Problems in Theoretical Physics: Jan-
uary 23–29, 1975, Kyoto University, Kyoto/Japan, pages 420–422. Springer, 1975.
[Lac23] Daniel Lacker. Hierarchies, entropy, and quantitative propagation of chaos for mean
field diffusions. Probability and Mathematical Physics, 4(2):377–432, 2023.
50 GESHKOVSKI, LETROUIT, POLYANSKIY, AND RIGOLLET

[LCG` 20] Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma,
and Radu Soricut. ALBERT: A Lite BERT for Self-supervised Learning of Language
Representations. In International Conference on Learning Representations, 2020.
[LCT18] Qianxiao Li, Long Chen, and Cheng Tai. Maximum principle based algorithms for
deep learning. Journal of Machine Learning Research, 18:1–29, 2018.
[Li21] Wuchen Li. Hessian metric via transport information geometry. Journal of Mathe-
matical Physics, 62(3), 03 2021.
[LJ18] Hongzhou Lin and Stefanie Jegelka. Resnet with one-neuron hidden layers is a uni-
versal approximator. Advances in Neural Information Processing Systems, 31, 2018.
[LLF23] Daniel Lacker and Luc Le Flem. Sharp uniform-in-time propagation of chaos. Prob-
ability Theory and Related Fields, pages 1–38, 2023.
[LLH` 20] Yiping Lu, Zhuohan Li, Di He, Zhiqing Sun, Bin Dong, Tao Qin, Liwei Wang, and
Tie-Yan Liu. Understanding and improving transformer from a multi-particle dy-
namic system point of view. In International Conference on Learning Representa-
tions, 2020.
[LLS22] Qianxiao Li, Ting Lin, and Zuowei Shen. Deep learning via dynamical systems:
An approximation perspective. Journal of the European Mathematical Society,
25(5):1671–1709, 2022.
[Loj63] Stanislaw Lojasiewicz. Une propriété topologique des sous-ensembles analytiques
réels. Les équations aux dérivées partielles, 117:87–89, 1963.
[LWLQ22] Tianyang Lin, Yuxin Wang, Xiangyang Liu, and Xipeng Qiu. A survey of transform-
ers. AI Open, 2022.
[LXB19] Shuyang Ling, Ruitu Xu, and Afonso S Bandeira. On the landscape of synchro-
nization networks: A perspective from nonconvex optimization. SIAM Journal on
Optimization, 29(3):1879–1907, 2019.
[Mis24] MistralAI. https://github.com/mistralai/mistral-finetune/blob/main/model/
transformer.py, 2024.
[MT14] Sebastien Motsch and Eitan Tadmor. Heterophilious dynamics enhances consensus.
SIAM Review, 56(4):577–621, 2014.
[MTG17] Johan Markdahl, Johan Thunberg, and Jorge Gonçalves. Almost global consensus on
the n-sphere. IEEE Transactions on Automatic Control, 63(6):1664–1675, 2017.
[NAB` 22] Lorenzo Noci, Sotiris Anagnostidis, Luca Biggio, Antonio Orvieto, Sidak Pal Singh,
and Aurelien Lucchi. Signal propagation in transformers: Theoretical perspectives
and the role of rank collapse. Advances in Neural Information Processing Systems,
35:27198–27211, 2022.
[NLL` 24] Lorenzo Noci, Chuning Li, Mufan Li, Bobby He, Thomas Hofmann, Chris J Maddi-
son, and Dan Roy. The shaped transformer: Attention models in the infinite depth-
and-width limit. Advances in Neural Information Processing Systems, 36, 2024.
[Ope24] OpenAI. https://github.com/openai/gpt-2/blob/master/src/model.py, 2024.
[OR07] Felix Otto and Maria G Reznikoff. Slow motion of gradient flows. Journal of Differ-
ential Equations, 237(2):372–420, 2007.
[Ott01] Felix Otto. The geometry of dissipative evolution equations: the porous medium
equation. Communications in Partial Differential Equations, 26(1-2):101–174, 2001.
[PH22] Mary Phuong and Marcus Hutter. Formal algorithms for transformers. arXiv preprint
arXiv:2207.09238, 2022.
[PTTM17] René Pinnau, Claudia Totzeck, Oliver Tse, and Stephan Martin. A consensus-based
model for global optimization and its mean-field limit. Mathematical Models and
Methods in Applied Sciences, 27(01):183–204, 2017.
[RBZ23] Domenec Ruiz-Balet and Enrique Zuazua. Neural ODE control for classification,
approximation, and transport. SIAM Review, 65(3):735–773, 2023.
[RS23] Matthew Rosenzweig and Sylvia Serfaty. Global-in-time mean-field convergence for
singular Riesz-type diffusive flows. The Annals of Applied Probability, 33(2):954–998,
2023.
[RZZD23] Lixiang Ru, Heliang Zheng, Yibing Zhan, and Bo Du. Token contrast for weakly-
supervised semantic segmentation. In Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition, pages 3093–3102, 2023.
A MATHEMATICAL PERSPECTIVE ON TRANSFORMERS 51

[SABP22] Michael E Sander, Pierre Ablin, Mathieu Blondel, and Gabriel Peyré. Sinkformers:
Transformers with doubly stochastic attention. In International Conference on Ar-
tificial Intelligence and Statistics, pages 3515–3530. PMLR, 2022.
[Ser20] Sylvia Serfaty. Mean field limit for Coulomb-type flows. Duke Mathematical Journal,
169(15), 2020.
[SFG` 12] Bharath K. Sriperumbudur, Kenji Fukumizu, Arthur Gretton, Bernhard Schölkopf,
and Gert R. G. Lanckriet. On the empirical estimation of integral probability metrics.
Electronic Journal of Statistics, 6:1550 – 1599, 2012.
[Shu13] Michael Shub. Global stability of dynamical systems. Springer Science & Business
Media, 2013.
[Str00] Steven H Strogatz. From Kuramoto to Crawford: exploring the onset of synchro-
nization in populations of coupled oscillators. Physica D: Nonlinear Phenomena,
143(1-4):1–20, 2000.
[SWJS24] Michael Scholkemper, Xinyi Wu, Ali Jadbabaie, and Michael Schaub. Residual
connections and normalization can provably prevent oversmoothing in gnns. arXiv
preprint arXiv:2406.02997, 2024.
[Sze39] Gabor Szegö. Orthogonal polynomials, volume 23. American Mathematical Soc.,
1939.
[Tad23] Eitan Tadmor. Swarming: hydrodynamic alignment with pressure. Bulletin of the
American Mathematical Society, 60(3):285–325, 2023.
[Tan17] Yan Shuo Tan. Energy optimization for distributions on the sphere and improvement
to the Welch bounds. Electronic Communications in Probability, 22(none):1 – 12,
2017.
[Tay12] Richard Taylor. There is no non-zero stable fixed point for dense networks in the
homogeneous Kuramoto model. Journal of Physics A: Mathematical and Theoretical,
45(5):055102, 2012.
[TG22] Paulo Tabuada and Bahman Gharesifard. Universal approximation power of deep
residual neural networks through the lens of control. IEEE Transactions on Auto-
matic Control, 2022.
[TLTO23] Davoud Ataee Tarzanagh, Yingcong Li, Christos Thrampoulidis, and Samet Oymak.
Transformers as support vector machines. Advances in Neural Information Process-
ing Systems, 36, 2023.
[TSS20] Alex Townsend, Michael Stillman, and Steven H Strogatz. Dense networks that do
not synchronize and sparse ones that do. Chaos: An Interdisciplinary Journal of
Nonlinear Science, 30(8), 2020.
[VBC20] James Vuckovic, Aristide Baratin, and Remi Tachet des Combes. A mathematical
theory of attention. arXiv preprint arXiv:2007.02876, 2020.
[VCBJ` 95] Tamás Vicsek, András Czirók, Eshel Ben-Jacob, Inon Cohen, and Ofer Shochet.
Novel type of phase transition in a system of self-driven particles. Physical Review
Letters, 75(6):1226, 1995.
[Ver18] Roman Vershynin. High-Dimensional Probability: An Introduction with Applications
in Data Science. Cambridge Series in Statistical and Probabilistic Mathematics.
Cambridge University Press, 2018.
[Vil01] Cédric Villani. Limite de champ moyen. Cours de DEA, 2002:49, 2001.
[Vil09] Cédric Villani. Optimal transport: old and new, volume 338. Springer, 2009.
[VR23] Tanya Veeravalli and Maxim Raginsky. Nonlinear controllability and function repre-
sentation by neural stochastic differential equations. In Learning for Dynamics and
Control Conference, pages 838–850. PMLR, 2023.
[VSP` 17] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N
Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in
Neural Information Processing Systems, 30, 2017.
[WAW` 24] Xinyi Wu, Amir Ajorlou, Yifei Wang, Stefanie Jegelka, and Ali Jadbabaie. On the role
of attention masks and layernorm in transformers. arXiv preprint arXiv:2405.18781,
2024.
[WAWJ24] Xinyi Wu, Amir Ajorlou, Zihui Wu, and Ali Jadbabaie. Demystifying oversmoothing
in attention-based graph neural networks. Advances in Neural Information Process-
ing Systems, 36, 2024.
52 GESHKOVSKI, LETROUIT, POLYANSKIY, AND RIGOLLET

[Wen62] James G Wendel. A problem in geometric probability. Mathematica Scandinavica,


11(1):109–111, 1962.
[XZ23] Tong Xiao and Jingbo Zhu. Introduction to Transformers: an NLP Perspective. arXiv
preprint arXiv:2311.17633, 2023.
[YBR` 19] Chulhee Yun, Srinadh Bhojanapalli, Ankit Singh Rawat, Sashank Reddi, and Sanjiv
Kumar. Are transformers universal approximators of sequence-to-sequence functions?
In International Conference on Learning Representations, 2019.
[ZB21] Aaron Zweig and Joan Bruna. A functional perspective on learning symmetric func-
tions with neural networks. In International Conference on Machine Learning, pages
13023–13032. PMLR, 2021.
[ZGUA20] Han Zhang, Xi Gao, Jacob Unterman, and Tom Arodz. Approximation capabilities
of neural ODEs and invertible residual networks. In International Conference on
Machine Learning, pages 11086–11095. PMLR, 2020.
[ZLL` 23] Shuangfei Zhai, Tatiana Likhomanenko, Etai Littwin, Dan Busbridge, Jason Rama-
puram, Yizhe Zhang, Jiatao Gu, and Joshua M Susskind. Stabilizing transformer
training by preventing attention entropy collapse. In International Conference on
Machine Learning, pages 40770–40803. PMLR, 2023.
[ZMZ` 23] Haiteng Zhao, Shuming Ma, Dongdong Zhang, Zhi-Hong Deng, and Furu Wei. Are
more layers beneficial to graph transformers? In The Eleventh International Confer-
ence on Learning Representations, 2023.
[ZS19] Biao Zhang and Rico Sennrich. Root mean square layer normalization. Advances in
Neural Information Processing Systems, 32, 2019.

Department of Mathematics, Massachusetts Institute of Technology, 77 Mas-


sachusetts Ave, 02139 Cambridge MA, USA
Email address: borjan@mit.edu

CNRS & Université Paris-Saclay, Laboratoire de mathématiques d’Orsay, 307 rue


Michel Magat, Bâtiment 307, 91400 Orsay, France
Email address: cyril.letrouit@universite-paris-saclay.fr

Department of EECS, Massachusetts Institute of Technology, 77 Massachusetts


Ave, 02139 Cambridge MA, USA
Email address: yp@mit.edu

Department of Mathematics, Massachusetts Institute of Technology, 77 Mas-


sachusetts Ave, 02139 Cambridge MA, USA
Email address: rigollet@math.mit.edu

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy