0% found this document useful (0 votes)
11 views58 pages

Masterarbeit Kai Hallmann

This master's thesis by Kai Christian Hallmann explores the theoretical foundations of transformer networks, detailing their construction, functionality, and complexity. It discusses the significance of attention mechanisms in enhancing parallelization and efficiency in deep learning tasks, particularly in language modeling and machine translation. The thesis also investigates the limitations of transformer networks in solving certain problems, establishing circuit complexity classes as upper bounds for their capabilities.

Uploaded by

Quốc Vương
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views58 pages

Masterarbeit Kai Hallmann

This master's thesis by Kai Christian Hallmann explores the theoretical foundations of transformer networks, detailing their construction, functionality, and complexity. It discusses the significance of attention mechanisms in enhancing parallelization and efficiency in deep learning tasks, particularly in language modeling and machine translation. The thesis also investigates the limitations of transformer networks in solving certain problems, establishing circuit complexity classes as upper bounds for their capabilities.

Uploaded by

Quốc Vương
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 58

Gottfried Wilhelm Leibniz Universität Hannover

Institut für Theoretische Informatik

Theoretical Foundations of Transformer Networks

Master’s Thesis

Kai Christian Hallmann


Matriculation Number: 10019681
Hannover, 2024-04-26

First Examiner: Prof. Dr. rer. nat. Heribert Vollmer


Second Examiner: PD Dr. rer. nat. habil. Arne Meier
Advisor: Laura Strieker, M. Sc.
Erklärung der Selbstständigkeit
Hiermit versichere ich, die vorliegende Masterarbeit selbstständig verfasst und keine
anderen als die angegebenen Quellen und Hilfsmittel verwendet zu haben. Die Arbeit
hat in gleicher oder ähnlicher Form noch keinem anderen Prüfungsamt vorgelegen.

Hannover, 2024-04-26

Kai Christian Hallmann


Contents
1 Introduction 5

2 Transformer Networks 6
2.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2.1 Single Attention . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2.2 Multi-Head Attention . . . . . . . . . . . . . . . . . . . . . . . 8
2.2.3 Position-wise Feed-Forward Networks . . . . . . . . . . . . . . . 8
2.2.4 Encoder and Decoder . . . . . . . . . . . . . . . . . . . . . . . 8
2.2.5 Embedding and Softmax . . . . . . . . . . . . . . . . . . . . . . 9
2.2.6 Positional Encoding . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.7 Further Developments in Transformer Network Design . . . . . 12
2.3 Mathematical Description . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3.1 Datatypes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3.2 Transformer Networks . . . . . . . . . . . . . . . . . . . . . . . 14
2.4 Types of Attention . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.5 Language Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3 Circuit Complexity 17
3.1 Circuits and Circuit Families . . . . . . . . . . . . . . . . . . . . . . . 17
3.2 Circuit Complexity Classes . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.3 Uniformity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

4 Logic 21
4.1 First Order Logic with Counting . . . . . . . . . . . . . . . . . . . . . 21
4.2 First Order Logic with Majority . . . . . . . . . . . . . . . . . . . . . . 22

5 Complexity Results for Transformers 24


5.1 Hard Attention Transformer Networks are in AC0 . . . . . . . . . . . . 24
5.1.1 Definition of Generalized Transformer Networks . . . . . . . . . 24
5.1.2 A Normal Form for Generalized Hard Attention Transformer
Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
5.1.3 From GUHAT to Circuits . . . . . . . . . . . . . . . . . . . . . 27

3
5.2 Universality of Saturated Transformer Networks . . . . . . . . . . . . . 29
5.3 Saturated Attention Transformer Networks are not in AC0 . . . . . . . 31
5.4 Saturated Attention Transformer Networks are in TC0 . . . . . . . . . 32
5.5 Fixed Precision Transformer Networks are in FOC[+; MOD] . . . . . . 35
5.6 log-Precision Transformer Networks are in Uniform TC0 . . . . . . . . 37
5.6.1 Circuit Serialization . . . . . . . . . . . . . . . . . . . . . . . . 39
5.6.2 Uniformity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.6.3 Transformer Network Precision and Space . . . . . . . . . . . . 41
5.6.4 p-Precision Transformer Network Definition . . . . . . . . . . . 42
5.6.5 log-Precision Transformer Networks as nonuniform Threshold
Circuits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.6.6 log-Precision Transformer Networks as Uniform Threshold Circuits 45
5.7 Lower Bounds for Instruction Following and Advice Transformers . . . 48
5.7.1 Circuit Value Prolem . . . . . . . . . . . . . . . . . . . . . . . . 48
5.7.2 Instruction Following . . . . . . . . . . . . . . . . . . . . . . . . 51
5.7.3 Advice Transformers . . . . . . . . . . . . . . . . . . . . . . . . 52

6 Conclusion 53
6.1 Interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
6.2 Future Outlook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

4
1 Introduction
Transformer networks are a type of deep neural network that was initially introduced
in 2017 for sequence modeling and transduction, such as language modeling and ma-
chine translation. They have been very successful in these tasks and have brought AI-
based tools such as OpenAI’s ChatGPT (“Chat Generative Pre-trained Transformer”)
into widespread use. What sets them apart from previously established deep neural
networks like long short-term memory and gated recurrent neural networks is that they
employ a mechanism called multi-head attention to be more parallelizable. This results
in faster training times compared to these previous network architectures [VSP+ 17].
In this work, we give an introduction to how transformer networks are constructed
and how they work. We then continue by viewing them from a more theoretical point of
view as recognizers of formal languages, in order to compare them to known complexity
classes and reach a better understanding of how powerful they are. In the process,
we present many results and give an overview of how they relate to each other. We
will also see that these results vary drastically based on some assumptions we make
in our theoretical model of transformer networks. However, with the most reasonable
assumptions, we will establish the circuit complexity class log-uniform TC0 as an upper
bound.
Such an upper bound helps answer the question, which problems transformer net-
works are fundamentally unable to solve, at least for sufficiently large inputs. Knowing
this might help decide when to switch to a different approach, rather than to spend more
time and computational power on training a transformer network only to find out that it
is still unable to preform the desired task. We will also consider whether it is because of
their high parallelizability, that transformer networks are limited in their capabilities.

5
2 Transformer Networks
We will begin by giving an overview of the components of transformer networks and
how they work, starting from the inside and working our way outward. Next, we will
give a precise mathematical description in section 2.3 that will be required in chapter 5.
In section 2.4, we will introduce different kinds of attention that we will later use for a
theoretical analysis of the complexity of transformer networks. Finally, we will define
how transformer networks can be used to recognize formal languages in section 2.5.
The contents of this chapter are largely based on the paper that introduced trans-
former networks [VSP+ 17], which we will refer to as their origin.

2.1 Notation
For n ∈ N, we will write [n] to mean the set of the first n natural numbers {m ∈ N | m <
n} = {0, 1, . . . , n − 1}, [n]+m to mean the set {m, m + 1, . . . , m + n − 1}, and log(n) to
mean the length of the shortest binary string representing n, so log(n) = dlog2 (n + 1)e.
We will also write a ≡m b iff a and b are congruent modulo m.

2.2 Basics
Transformer networks heavily rely on a mechanism called attention. Attention was
originally designed as an addition to recurrent neural networks for tasks like machine
translation. Those recurrent neural networks usually consist of an encoder and a de-
coder. The encoder reads in a sequence token by token and modifies its internal hidden
state at each step based on the token. Then the decoder uses the final hidden state to
generate the output tokens.
The problem with this approach is that all information contained in the input se-
quence has to be compressed into the hidden state, which is a vector of fixed length.
As a result, this kind of network can have problems dealing with long input sequences.
The attention mechanism allows the decoder to look back at the input tokens that con-
tain the information most relevant to the token that it is about to generate at each
step. In other words, the decoder decides which parts of the input sequence to pay at-
tention to. Because of this, the encoder no longer has to encode all information into a

6
fixed-length vector [BCB15, sections 1-3].
As attention mechanisms allow modeling of dependencies regardless of distance
within the input or output sequences, they have become an integral part of recurrent
neural networks for tasks such as sequence modeling. Transformer networks, however,
forgo the recurrence and instead rely solely on attention to draw global dependencies
between input and output. Therefore, they lend themselves more to parallelization.
We will begin by describing the attention mechanism they use.

2.2.1 Single Attention


An attention function can be thought of as mapping a set of queries and a set of key-
value pairs to an output. These names are chosen to evoke a conceptual similarity to a
(series of) lookup(s) in a dictionary data structure. The query is first compared to each
key by a similarity function. Then the output is calculated as the sum of the values
weighted by the corresponding similarities.
Transformer networks usually utilize scaled dot-product soft attention, so we will
begin by describing this type of attention and introduce other types in section 2.4. The
query and the keys are dk -dimensional vectors, and the values and the output are dv -
dimensional vectors. The similarity of the query and a key is evaluated by calculating

their dot product and dividing it by dk . Next, all of these similarities are turned
into a probability distribution for each key by applying a function called softmax (see
section 2.4) to them. Finally, the output is calculated as the dot product of this
distribution and the values.
In practice, this is done for a set of queries at a time because it is more efficient to
operate with matrices. We will call the number of queries nq and the number of key-
value pairs nk . Now the queries are an nq × dk matrix Q, the keys are an nk × dk matrix
K, the values are a nk × dv matrix V , and the output is a nq × dv matrix which is
calculated as: ( )
Q · KT
Attention(Q, K, V ) = softmax √ ·V
dk
The dot products can be calculated very efficiently due to highly optimized code.
The scaling factor of √1dk is used to limit the size of the values passed into the softmax
function to prevent extremely small gradients. Without this scaling, training of a
transformer network might come to a halt in the backpropagation phase due to a
phenomenon known as the vanishing gradient problem [BJZP20, section 1, subsection
2.1].

7
2.2.2 Multi-Head Attention
Instead of directly performing single attention on the dmodel -dimensional vectors that
are used by the transformer network internally, they are first split up into shorter
vectors. Then the single attention is performed on these shorter vectors in so-called
attention heads, which can be done in parallel. Finally, the results are put together
into a dmodel -dimensional vector.
Splitting up the vectors is done using learned linear projections. For each of the h
heads, these projections are stored as a dmodel ×dk matrix WiQ , a dmodel ×dk matrix WiK ,
and a dmodel × dv matrix WiV . The output is assembled by concatenating the outputs
of the individual heads and performing another learned linear projection. This one is
stored in the (h · dv ) × dmodel matrix W O . Putting it all together, we get:

MultiHead(Q, K, V ) = Concat(head0 , . . . , headh−1 ) · W O


( )
headi = Attention Q · Wi , K · Wi , V · Wi for i ∈ [h]
Q K V

The values for the dimensions that were originally used are dmodel = 512, h = 8, and
dk = dv = dmodel
h
= 64.
Splitting the attention between multiple heads has the benefit that each one of them
can evaluate different aspects of the queries or view them in a different context.

2.2.3 Position-wise Feed-Forward Networks


Transformer networks also make use of fully connected feed-forward networks. These
consist of two linear transformations with a rectified linear unit (ReLU) activation
function in between.

FFN(x) = max(0, x · W1 + b1 ) · W2 + b2

The input and the output have dimension dmodel , while the inner layer has dimension
dff = 2048.
Since the input and output dimension is only the size of one position, the feed-forward
networks are said to be position-wise. Any interaction between different positions only
takes place through the means of attention.

2.2.4 Encoder and Decoder


Transformer networks have an encoder-decoder structure. The encoder is given a se-
quence of symbol representations (x0 , . . . , xn−1 ) and maps it to a sequence of continu-
ous representation z = (z0 , . . . , zn−1 ). The decoder takes this sequence and outputs an

8
output sequence (y0 , . . . , yn−1 ) one element at a time. Transformer networks are au-
toregressive, meaning that they take the previously generated symbols as an additional
input when generating the next symbol.
The encoder is composed of a stack of L = 6 identical layers. Each layer consists of
two sub-layers. The first one performs multi-head self-attention. This means that it
uses the input as the Q, K, and V matrices. The second sub-layer is a simple, position-
wise fully connected feed-forward network. The learned parameters of this network
differ from layer to layer, whereas the linear transformations are shared between all
feed-forward networks. Additionally, each sub-layer contains a residual connection and
a layer normalization. That is, the ingoing values are added to the outgoing values and
the layers are normalized to be in a desired range, which introduces further learned
parameters. All sub-layers in the model produce vectors of size dmodel .
The decoder is also composed of a stack of L = 6 identical layers. It consists of
the same two sub-layers as the encoder, along with a third one in between them. In
this layer, it once again performs multi-head attention. This time, however, the input
to this sub-layer is only used as the Q matrix, while the output z of the encoder is
used as the K and V matrix. The self-attention sub-layer is also slightly modified to
prevent positions from attending to subsequent positions. This is done by setting the
corresponding values in the matrices to −∞ before applying the softmax function. Just
like the encoder, each sub-layer once again contains a residual connection and a layer
normalization.
A diagram of the structure of the encoder and decoder of a transformer network is
shown in Figure 2.1.

2.2.5 Embedding and Softmax


Additionally, a learned embedding is used to convert the input and output tokens
to vectors of dimension dmodel . Another learned linear transformation followed by a
softmax function is used to convert the decoder outputs to the predicted probabilities
for the next token. The original authors suggest using the same matrix both for the two
embedding layers and for the linear transformation in order to tie the input and output
embeddings together, similar to [PW17, section 1]. They also suggest multiplying the

weights in the embedding layers by dmodel without motivating this.

9
Add & Norm

Feed-Forward Output
Network Symbol Representation

Add & Norm Encoder Decoder


Add & Norm Layer L − 1 Layer L − 1
Multi-Head
Feed-Forward Attention
Network
Encoder Decoder
Add & Norm Layer 1 Layer 1
Add & Norm
Masked
Multi-Head Multi-Head Encoder Decoder
Attention Attention Layer 0 Layer 0

Input Output
Symbol Representation Symbol Representation
(shifted right)

Figure 2.1: An encoder layer (left), a decoder layer (middle), and the complete encoder-
decoder structure (right)

2.2.6 Positional Encoding


Transformer networks do not inherently impose any structure on their inputs. Because
of this, any such structure, even a linear order, has to be explicitly added to the inputs.
The positional encoding suggested by the original authors uses the sine and cosine
functions for this:
( )
pos
PE(pos,2i) = sin 2i

(10000 model )
d

pos
PE(pos,2i+1) = cos 2i
10000 dmodel

They choose this encoding because it might allow the model to easily learn to attend
to relative positions, since for any fixed offset k, PEpos+k can be represented as a linear
function of PEpos . This might allow the model to extrapolate to sequence lengths longer
than those encountered during training.
A diagram of a full transformer network can be seen in Figure 2.2.

10
Output
Probabilities

Softmax

Linear

Add & Norm

Feed-Forward
Network

Add & Norm


Add & Norm
Multi-Head L×
Feed-Forward Attention
Network


Add & Norm
Add & Norm
Masked
Multi-Head Multi-Head
Attention Attention

Positional Positional
+ +
Encoding Encoding

Input Output
Embedding Embedding

Inputs Outputs
(shifted right)

Figure 2.2: The original transformer network

11
2.2.7 Further Developments in Transformer Network Design
In later transformer networks like Google’s BERT and OpenAI’s GPT, the structure
has been further simplified, getting rid of the encoder-decoder structure in favor of
simply a sequence of transformer blocks. Each transformer block is just a layer of the
original transformer network’s encoder or decoder [DCLT19, section 3; RNSS18, sub-
sections 3.1, 4.1].
Another trend that can be observed is that the number L of transformer blocks/layers
of different transformer networks is steadily increasing, as can be seen in Table 2.1.
This is done to allow the networks to store ever more information about the larger and
larger datasets they are trained on [Ope23].
As most transformer networks are a sequence of transformer blocks together with a
more task-specific transformation of the input and output, such as embedding and po-
sitional encoding, we will focus on this structure and provide mathematical description
that allows a theoretical analysis in the following section 2.3.

Transformer Network Year Layers L Source


original 2017 6 [VSP+ 17]
GPT-1 2018 12 [RNSS18]
BERT 2018 up to 24 [DCLT19]
GPT-2 2019 up to 48 [RWC+ 19]
GPT-3 2020 up to 96 [BMR+ 20]
GPT-4 2023 rumored 120 [Ope23]

Table 2.1: Number of Layers of Different Transformer Networks

12
2.3 Mathematical Description
Section 2.3, section 2.4, and section 2.5 are based on [MSS22, section 3].

2.3.1 Datatypes
We model all internal data of a transformer network as binary strings. In order to be
able to perform calculations with those binary strings, we need a semantics to interpret
them as numbers. As is often done in circuit complexity, we will first interpret them
as unsigned integers.

Definition 2.1. A binary string x ∈ [2]∗ of length |x| = n interpreted as an unsigned


integer has the value

JxKN := 2i · xn−1−i .
i∈[n]

We denote the standard integer operations on binary strings by +N , ·N , and <N .


Additionally, we define fixed-point numbers, which we will only use in section 5.5.

Definition 2.2. A binary string x ∈ [2]r+s interpreted as a fixed-point number with


integer part r, fractional part s, and precision (r + s) has the value

JxKN 0, if xr+s = 0
JxKFPr,s := −
2s  2r , otherwise.

That is, x gets interpreted as an integer using the two’s complement and that integer
is divided by 2s . Since r and s are fixed, we normally just write FP in place of FPr,s .
Note that division of fixed-point numbers will usually involve rounding of the result
and that various over- and underflows can occur which we leave undefined.
Next, we define how we encode rational numbers without a fixed size.

Definition 2.3. To interpret a binary string r ∈ [2]∗ as a rational number, we view it as


a sign bit s and an encoded pair of unsigned integer strings hp, qi. Its numerical value is

JpKN
JrKQ := (−1)s · .
JqKN

We once again denote the standard operations by +Q and ·Q and define them to
always reduce their result as far as possible.
Finally, we will define floats F as the subset of the rationals Q where the denominator
is constrained to be a power of 2. This resembles the way numbers are usually encoded
in most computers more closely than the rational numbers. Addition and multiplication
are defined the same way as for Q. Note, however, that the multiplicative inverse of

13
a float is not necessarily another float and may have to be approximated. Because of
this, it may be that J(x/F y) ·F yKF 6= JxKF .
Going forward, we will usually omit datatype subscripts where they are clear from
context. We will sometimes write D as a generic datatype or other datatypes in function
signatures to mean [2]∗ while making the intent clearer and hinting at the semantics.
We imagine the tuple hp, qi to be encoded by padding p, q to the same length using
leading 0s and interleaving their bits. This means that the size of a rational number or
a float is 2 · max(|p|, |q|) + 1. We will use the following property to limit the internal
functions in our transformer network model:

Definition 2.4. We say a function f : [2]∗ → [2]∗ is size-preserving iff there exist
constants c, n such that for all inputs x with |x| ≥ n, the size of the function value is
bounded by |f (x)| ≤ c · |x|. Let P be the set of size-preserving functions.

2.3.2 Transformer Networks


Now we give a precise definition of a transformer network that consists of a sequence
of transformer blocks rather than an encoder-decoder structure.

Definition 2.5. A transformer network is a tuple


( )
Σ, D, α, L, H, φ, (sℓ,h )ℓ∈[L],h∈[H] , (fℓ )ℓ∈[L]

where

1. Σ is the input alphabet.

2. D is a scalar datatype, that is, a semantics for interpreting binary strings as


numbers. We will generally consider D = F.

3. α : D∗ → D∗ is an attention function that maps a vector of attention scores in


Dn to a normalized probability distribution also in Dn . (See section 2.4)

4. L ∈ N is the number of layers.

5. H ∈ N is the number of heads.

6. φ : Σ × N → Dm is a position-aware embedding function that maps a token and


a position to a vector, where m is a multiple of H.

7. For each ℓ, h, the function sℓ,h : Dm × Dm → D assigns attention scores to pairs


of values.

8. For each ℓ, the function fℓ : Dm × Dm → Dm maps a previous layer output and


attention head to a new value vector.

14
On an input string w ∈ Σn , a transformer network computes L layers of output se-
quences vℓ,0 , . . . , vℓ,n for ℓ ∈ [L]+1 where each vℓ,i ∈ Dm . First, each token wi and its
position i are embedded into a value v0,i . Subsequently, each layer ℓ aggregates infor-
mation from the previous value sequence vℓ using a multi-head attention mechanism
and outputs a new value sequence vℓ+1 . The layers are structured as follows:

1. Embedding layer: v0,i = φ(wi , i) for i ∈ [n].

2. Attention head: Each of the H attention heads in layer ℓ maps the full previous
sequence to a new value via sℓ,h and then applies the attention function α:

aℓ,h,i,j = sℓ,h (vℓ,i , vℓ,j ) for ℓ ∈ [L], h ∈ [H], i ∈ [n], j ∈ [n]



bℓ,h,i = α(aℓ,h,i,0 , . . . , aℓ,h,i,n−1 )j · vℓ,j for ℓ ∈ [L], h ∈ [H], i ∈ [n].
j∈[n]

It is important to note that the semantics for addition and multiplication as well
as the computation of α come from D.

3. Activation block:

vℓ+1,i = fℓ (vℓ,i , (bℓ,0,i , . . . , bℓ,h−1,i )) for ℓ ∈ [L], i ∈ [n].

This model combines the aggregation of the attention heads, the feed-forward
network, the residual connections and the layer normalizations into the single
function fℓ . A benefit of this is that it generalizes to changes in the layout like
for example moving the layer normalization around, which is one of the changes
between GPT-1 and GPT-2 [RWC+ 19, subsection 2.3].

2.4 Types of Attention


Attention mechanisms make use of an attention function to convert a vector of values
a ∈ Dn into a probability distribution over 0, 1, . . . , n − 1. So far, we have used the
softmax function for this purpose:

e ai
softmax(a)i = ∑ a .
ek
k∈[n]

We will also examine the simpler hard and saturated attention when evaluating the
complexity of transformer networks in chapter 5. In order to define these, we first
define the function M : Dn → P([n]) that maps a vector of values to the set of indices

15
of maximum values:

M(a) := {i | ai = max{aj | j ∈ [n]}}.

Using this, we define hard attention (also known as unique hard attention [HAF22,
subsection 4.2]) to work the same as soft attention but use the hardmax function
instead of the softmax function:

1, if i = min(M(a))
hardmax(a)i :=
0, otherwise.

This function returns a one-hot distribution.


Analogously, we define strong saturated attention (also known as averaging hard
attention [HAF22, subsection 4.2]) to use the strongsatmax function instead:

1 1, if i ∈ M(a)
strongsatmax(a)i := ·
|M(a)| 0, otherwise.

Finally, we define weak saturated attention such that each attention head either uses
the hardmax function or the uniformmax function:

1
uniformmax(a)i := .
n

2.5 Language Recognition


Now we will define language recognition for transformer networks.

Definition 2.6. Let vℓ,i (w) denote the value of vℓ,i on input string w. A transformer
network recognizes a formal language L ⊆ Σ∗ iff there exists a D-valued affine trans-
formation W, b such that for all w ∈ Σ∗ the following holds:

W · vL,0 (w) + b > 0 ⇐⇒ w ∈ L.

In other words, the decision problem of recognizing L must be linearly separable


using the first value in the last layer of the transformer network.
Finally, we define AHAT(D) (“averaging hard attention transformer”) as the set
of languages recognizable by some saturated transformer network over D where the
internal functions can be any size-preserving functions. To be able to apply the concept
of size-preservation to φ, we assume the size of a token to be log(|Σ|).

16
3 Circuit Complexity
The definitions in this chapter are taken and partially adapted from [Vol99, pp. 7-10,
46-47, 108, 126].

3.1 Circuits and Circuit Families


Definition 3.1. An n-ary Boolean function for some n ∈ N is a function f : [2]m → [2].
A family of Boolean functions is a sequence f = (f n )n∈N , where each f n is an n-ary
Boolean function.

Definition 3.2. A basis is a finite set of Boolean functions and families of Boolean
{ }
function. The standard unbounded fan-in basis B1 := ¬, (∧n )n∈N , (∨n )n∈N contains
the unary Boolean NOT function and the families of Boolean AND and OR functions.

Definition 3.3. The n-ary Boolean majority function checks if at least half of the
input bits are 1:

1, |{i ∈ [n] | xi = 1}| ≥ n
MAJ : [2] → [2], (x0 , . . . , xn−1 ) 7→
n n 2
0, otherwise

Definition 3.4. Let B be a basis and n, m ∈ N. A Boolean circuit over B with n


inputs and m outputs is a tuple C = (V, E, α, β, ω), where (V, E) is a finite directed
acyclic graph, α : E → N is an injective function, β : V → B ∪ {x0 , . . . , xn−1 }, and
ω : V → {y0 , . . . , ym−1 } ∪ {∗}, such that the following conditions hold:

1. If v ∈ V has in-degree 0, then β(v) ∈ {x0 , . . . , xn−1 } or β(v) is a 0-ary Boolean


function from B.

2. If v ∈ V has in-degree k > 0, then β(v) is a k-ary Boolean function from B or a


family of Boolean functions from B.

3. For every i ∈ [n], there exists at most one node v ∈ V such that β(v) = xi .

4. For every i ∈ [m], there exists exactly one node v ∈ V such that ω(v) = yi .

17
Definition 3.5. Let C = (V, E, α, β, ω) be a circuit over B with n inputs and m
outputs. First, we inductively define a function valv : [2]∗ → [2] for every v ∈ V as
follows: Let a0 , . . . , an−1 be arbitrary values.

1. If v ∈ V has fan-in 0 and if β(v) = xi for some i ∈ [n], then


valv (a0 , . . . , an−1 ) := ai . If v ∈ V has fan-in 0 and if β(v) = b is a 0-ary function
from B, then valv (a0 , . . . , an−1 ) := b.

2. Let v ∈ V have fan-in k > 0 and let v0 , . . . , vk−1 be the gates that are predecessors
of v ordered in such a way that α((v1 , v)) < · · · < α((vk−1 , v)). Let β(v) = f ∈ B.
If f is a k-ary function, then let

valv (a0 , . . . , an−1 ) := f (valv0 (a0 , . . . , an−1 ), . . . , valvk−1 (a0 , . . . , an−1 )).

Otherwise f must be a family of Boolean functions, f = (f n )n∈N . In this case,


we define

valv (a0 , . . . , an−1 ) := f k (valv0 (a0 , . . . , an−1 ), . . . , valvk−1 (a0 , . . . , an−1 )).

For i ∈ [m], let vi be the unique gate vi ∈ V with ω(vi ) = yi . Then the function
computed by C, fC : [2]n → [2]m , is given for all a0 , . . . , an−1 ∈ [2] by
( )
fC (a0 , . . . , an−1 ) := valv0 (a0 , . . . , an−1 ), . . . , valvm−1 (a0 , . . . , an−1 ) .

Definition 3.6. Let B be a basis. A circuit family over B is a sequence C =


(C0 , C1 , . . . ), where, for every n ∈ N, Cn is a circuit over B with n inputs. Let f n be
the function computed by Cn . Then we say that C computes the function f : [2]∗ → [2]∗ ,
defined for every w ∈ [2]∗ by
f (w) := f |w| (w).

We write f = (f n )n∈N and C = (Cn )n∈N . We say that C accepts A ⊆ [2]∗ iff C
computes cA . In this context, we also use the notation A = (An )n∈N (where An :=
A ∩ [2]n ) and cA = (cAn )n∈N . If C is a circuit family, we use the notation fC for the
function computed by C.

Definition 3.7. Let C = (V, E, α, β, ω) be a circuit over B. The size of C is defined


to be the number of non-input gates in V , that is, |{v ∈ V | β(v) ∈ B}|, and the depth
of C is defined to be the length of a longest directed path in the graph (V, E).
Let C = (Cn )n∈N be a circuit family and let s, d : N → N. C has size s and depth d
if, for every n, Cn has size s(n) and depth d(n).

18
3.2 Circuit Complexity Classes
For circuit complexity classes

Definition 3.8. Let B be a basis and let s, d : N → N. The class SIZE-DEPTHB (s, d)
contains all sets A ⊆ [2]∗ for which there exists a circuit family C over basis B of size
O(s) and depth O(d) that accepts A.

The class of polynomial size constant depth AND/OR circuits is defined as follows:
( )
Definition 3.9. AC0 := SIZE-DEPTHB1 nO(1) , 1

It is known to include problems such as integer addition, subtraction, and compari-


son.
The class of polynomial size constant depth threshold circuits (since it can also be
defined in terms of threshold gates rather than majority gates) is defined as follows:
( )
Definition 3.10. TC0 := SIZE-DEPTHB1 ∪{(MAJn ) } nO(1) , 1
n∈N

It is known to include problems such as integer multiplication, division, and sorting.


From the definition, it is obvious that AC0 is included in TC0 . One important result
in the field of circuit complexity theory is that this inclusion is a proper one:

Theorem 3.11. AC0 ⊊ TC0 [Vol99, Corollary 4.35].

The proof of this is beyond the scope of this work. As a result of this, it can be
shown that, for example, integer multiplication is not included in AC0 .

3.3 Uniformity
One problem with circuit families is that they are infinite objects. The circuits within
a given circuit family can be completely different from each other. One of the conse-
quences of this is that undecidable languages, like

[2]n ,
bin(n)∈K

where K is the special halting problem, can be accepted by a circuit family where each
circuit just outputs a constant 0 or 1.
Algorithms, however, are finite. A program written in any programming language
has a finite text. A Turing machine has a finite number of states and therefore a
finite transition function. A random access machine (RAM) has a finite number of
instructions. Therefore, a uniform circuit family should be a circuit family with a finite
description.

19
As a finite description of a circuit family C = (Cn )n∈N , we introduce a function fC ,
with 1n 7→ hCn i, that is easily computable. In particular:

Definition 3.12. A circuit family C = (Cn )n∈N of size s is logspace-uniform, or log-


uniform, iff there is an admissible encoding scheme such that the function fC , 1n 7→ hCn i
is in FDSPACE(log(s)).

20
4 Logic
In this chapter, we introduce two extensions of first order logic that are used to define
formal languages. They use variables indexing the symbols of a given word to describe
characteristics that define the language. We will use these logics in chapter 5 to bound
the complexity of transformer networks.

4.1 First Order Logic with Counting


This section is based on [CCP23, section 4].
We will describe the syntax of FOC[+; MOD], given a fixed (finite) alphabet Σ, and
its intended interpretation with reference to (finite) strings w ∈ Σn for some n ∈ N. It
consists of

• position variables p, . . . which stand for positions in w, that is integers in [n],

• count variables x, y, z, . . . which stand for rational numbers,

• (count) terms c0 · x0 + · · · + ck−1 · xk−1 + ck , where each ci is a rational number


and each xi is a count variable.

A formula of FOC[+; MOD] is one of:

• > for true and ⊥ for false.

• Qa (p) where a ∈ Σ, which is true iff wp = a.

• MODrm (p) where r ≥ 0, m > 0, which is true iff p ≡m r.

• t0 = t1 , t0 < t1 where t0 and t1 are terms, which follow the conventional semantics.

• φ0 ∧φ1 , φ0 ∨φ1 , ¬φ0 where φ0 and φ1 are formulae, which follow the conventional
semantics.

• ∃x.φ, ∀x.φ where x is a count variable and φ is a formula, which follow the
conventional semantics.

• ∃=x p.φ, where x is a count variable, p is a position variable, and φ is a formula,


which is true iff φ is true for exactly x values of p. (Note that ∃=x p.φ only binds p.)

21
We will use the following abbreviations:

• φ → ψ := ¬φ ∨ ψ

• φ ↔ ψ := (φ → ψ) ∧ (ψ → φ)

• ∃p.φ := ∃x.(x > 0 ∧ ∃=x p.φ)

• ∀p.φ := ∃x.(∃=x p.> ∧ ∃=x p.φ)

We call any variable that is not bound by a quantifier free, and a formula with no
free variables a sentence. For a sentence σ and a string w ∈ Σ∗ , we write w |= σ iff w
makes σ true.

Definition 4.1. If σ is a sentence of FOC[+; MOD], the language defined by σ is


L(σ) := {w | w |= σ}.

The part of the logic that deals with position variables is like monadic first-order logic,
in which all predicates are monadic (that is, they take only one argument). The other
part of the logic that deals with count variables is the theory of rational numbers with
ordering and addition (but not multiplication). Both of these other logics have useful
normal forms: Monadic first-order logic has a normal form that uses only one variable,
while the theory of rationals with ordering and addition has quantifier elimination. We
can combine these two results to get a very simple normal form for FOC[+; MOD].

Theorem 4.2. Every formula φ of FOC[+; MOD] is equivalent to a formula of the form
( )

φ′ = ∃x0 . . . . ∃xk−1 . ∃=xi p.ψi ∧ χ
i

where each ψi is quantifier-free and has no free count variables and χ is quantifier-free.

Proof. See [CCP23, Appendix A].

It may seem odd that count variables range over rational numbers, when counts
are always integers. This technicality simplifies the normal form: If we had used
integers, then the part of the logic that deals with count variables would be Presburger
arithmetic, and the normal form would require allowing MODrm (x) on count variables
as well.

4.2 First Order Logic with Majority


This section is based on [MS23a, subsection 2.2].

22
We will describe the syntax of FO(M), given a fixed (finite) alphabet Σ, and its
intended interpretation with reference to (finite) strings w ∈ Σn for some n ∈ N. A
term is

• a constant 0, 1, or n − 1,

• an (index) variable i, j, k, . . . ranging from 0 to n − 1,

• any sum t0 + t1 or difference t0 − t1 of terms t0 and t1 .

A formula is one of:

• Qa (t) where a ∈ Σ and t is a term, which is true iff wt = a.

• bit(t0 , t1 ) where t0 and t1 are terms, which is true iff the t1 -th bit of the binary
expansion of the value of t0 is a 1.

• t0 = t1 , t0 ≤ t1 , t0 ≥ t1 where t0 and t1 are terms, which follow the conventional


semantics.

• φ0 ∧ φ1 , φ0 ∨ φ1 where φ0 and φ1 are formulae, which follow the conventional


semantics.

• ∃i.φ, ∀i.φ where i is an index variable and φ is a formula, which follow the
conventional semantics.

• Mx.φ where i is an index variable and φ is a formula, which is true iff ≥ n


2
values
of i make φ true.

Just like with FOC[+; MOD], we call a formula with no free variables a sentence.
For a sentence σ and a string w ∈ Σ∗ , we write w |= σ iff w makes σ true.

Definition 4.3. If σ is a sentence of FO(M), the language defined by σ is L(σ) :=


{w | w |= σ}.

Beyond this, FO(M) can express counting and threshold quantifiers in terms of ma-
jority quantifiers. Given a formula φ, a counting quantifier ∃k creates a new formula
∃k i.φ that is true iff φ is true across exactly k values of i. Threshold quantifiers ∃≤k
and ∃≥k work similarly but check if φ is true for at least or at most k values of i. In
addition, FO(M) can express conditional majority quantifiers, also written M, which
create a formula Mi.φ[ψ] that is true iff ψ is true for at least half the values of i that
make φ true ( [MS23a, subsection 2.2]).

23
5 Complexity Results for Transformers
In this chapter, we will present and prove various complexity results involving trans-
former networks. For a slightly simplified overview over the majority of the results, see
Table 5.1.
fixed logarithmic unbounded
Hard Attention ⊆ AC0 ⊆ AC0 ⊆ AC0
(Weak/Strong) F: ⊆ UL -TC0 ⊆ TC0
⊆ UL -TC0 = ALL
Saturated Attention 6⊆ AC0 , Q: = ALL
Soft Attention ⊆ FOC[+; MOD] 6⊇ UL -TC , ⊊ UL -TC
0 0
F: ⊆ UL -TC0 = FO(M) = ALL

Table 5.1: Results shown in this section (UL is short for log-uniform here)

5.1 Hard Attention Transformer Networks are in AC0


This section is based on [HAF22].
We begin by showing that AC0 is an upper bound on the class of languages that
can be recognized by hard attention transformer networks GUHAT, regardless of the
internal datatype or precision.

5.1.1 Definition of Generalized Transformer Networks


For our proof of this, we need an even more generalized definition of what a transformer
network is than the one in subsection 2.3.2. We change the following points:

• A symbol $ 6∈ Σ is appended to the input.

• For the representation of activation values, we use an arbitrary set A rather than
Dm .

• The input/embedding function φ : (Σ ∪ {$}) × N × N → A takes an extra third


parameter n ∈ N that will always be the length of the input (including the $).
So, v0,i = φ(wi , i, n).

• The scoring functions sℓ,h : A × A → R map to real numbers without loss of


generality. So, aℓ,h,i,j = sℓ,h (vℓ,i , vℓ,j ).

24
• There is a new pooling function p : A∗ × R∗ → A which combines the atten-
tion function α and the weighted sum of the values. For the case of hard
attention pUHA is defined as follows: On inputs (v0 , v1 , . . . , vn−1 ) ∈ An and
(a0 , a1 , . . . , an−1 ) ∈ Rn , let j ∈ [n] be the smallest index that maximizes aj . Then
pUHA ((v0 , v1 , . . . , vn−1 ), (a0 , a1 , . . . , an−1 )) = vj . So,

bℓ,h,i = pUHA ((vℓ,0 , vℓ,1 , . . . , vℓ,n−1 ), (aℓ,h,i,0 , aℓ,h,i,1 , . . . , aℓ,h,i,n−1 )).

• The activation functions fℓ : A×AH → A take a vector of attention values rather


than numbers as their second parameter. So,

vℓ+1,i = fℓ (vℓ,i , (bℓ,0,i , bℓ,1,i , . . . , bℓ,H−1,i )).

• There is a new model output function g : A → [2]. The output of the general-
ized transformer network T (x) is computed by applying this function to the last
symbol of vℓ , T (x) := g(vℓ,n−1 ).

5.1.2 A Normal Form for Generalized Hard Attention Transformer


Networks
Despite the abstractness and generality of the GUHAT model, we can define a normal
form representation and show that every transformer network T ∈ GUHAT is equiva-
lent to a transformer network in GUHAT in this normal form with the same number of
layers and heads. The key idea is to preserve all the information from previous layers
that has been used to compute them in the activation values by requiring that the in-
put and activation functions just return the tuple of their arguments. We also require
that attention values be integers in the smallest relevant range.

Definition 5.1. A GUHAT with L layers and h heads is in informative normal form
iff the following conditions are satisfied:

• The input function is φ(σ, i, n) = (σ, i, n).

• For each layer ℓ ∈ [L]+1 , the activation values are (H + 1)-tuples of activation
values at layer ℓ − 1, and the activation function is defined by

fℓ (v, (b0 , b1 , . . . , bH−1 )) = (v, (b0 , b1 , . . . , bH−1 )).

• For each layer ℓ ∈ [L]+1 and attention head h ∈ [H], the scoring function sℓ,h
returns an integer in [N ], where N is the total number of possible ordered pairs
of activation values at layer ℓ − 1.

25
Lemma 5.2. For any transformer network T ∈ GUHAT, there exists a transformer
network T̂ ∈ GUHAT in informative normal form such that L(T ) = L(T̂ ). Moreover,
T̂ has the same number of layers and heads as T .

Proof. Let T be a GUHAT with L layers and H heads, with input alphabet Σ, input
function φ, scoring functions sℓ,h , activation functions fℓ , and output function g. We
describe how to construct functions for an equivalent transformer network T̂ in GUHAT
in informative normal form, which also has L layers and H heads. We assume that n
is the input length.
For T̂ , the input function φ̂(σ, i, n) is defined to return the triple (σ, i, n). Note that
there are at most |Σ| · n possible initial activation values. We also define a function
t0 that translates initial activation values for T̂ into initial activation values for T by
t0 (σ, i, n) = φ(σ, i, n).
Now, we perform induction on the layers of T and T̂ . Assume that we have defined
scoring and activation functions for T̂ for layers before ℓ (where the initial activation
values are treated as layer 0), and a translation function tℓ−1 that translates all pos-
sible activation values for T̂ from the previous layer into activation values for T from
the previous layer. To define the scoring function for T̂ for layer ℓ and head h, we
enumerate all the possible pairs v̂i and v̂j of activation values of T̂ at layer ℓ − 1, and
determine the corresponding attention values of T , which we denote by yℓ,h (v̂i , v̂j ) =
sℓ,h (tℓ−1 (v̂i ), tℓ−1 (v̂j )). We make a list of all the distinct resulting values and sort them
in increasing order. Then we define ŝℓ,h (v̂i , v̂j ) to be the index of yℓ,h (v̂i , v̂j ) in this
sorted list. The activation function for T̂ for layer ℓ is, by definition,

fˆℓ (v, (b0 , b1 , . . . , bH−1 )) = (v, (b0 , b1 , . . . , bH−1 )).

The translation function for layer ℓ is defined by

t̂ℓ (v, (b0 , b1 , . . . , bH−1 )) = fℓ (tℓ−1 (v), (tℓ−1 (b0 ), tℓ−1 (b1 ), . . . , tℓ−1 (bH−1 ))),

that is, we translate each of the component activation values using tℓ−1 and then apply
the activation function of T .
Finally, the output function for T̂ is defined by ĝ(v̂) = g(tL (v̂)), that is, we translate
the layer L activation value v̂ of T̂ to the layer L activation value of T , and apply the
output function of T .
By construction, T̂ is in informative normal form, and it has L layers and H heads.
It is not difficult to see that for any input w, the translations tk (v̂) of the activation
values v̂ of T̂ are equal to the corresponding activation values of T , and the outputs
T̂ (w) = T (w) are equal as well. Thus L(T̂ ) = L(T ).

26
5.1.3 From GUHAT to Circuits
In this subsection, we show that for every language L ∈ GUHAT, we can construct a
family of Boolean circuits of constant depth and polynomial size that also recognizes
L. The key step of the proof is to bound the number of bits needed to represent
scoring and activation values for an input sequence of length n by O(log(n)), where
the suppressed constants depend on L and H.

Lemma 5.3. Let T be a GUHAT in informative normal form with L layers and H
heads, and alphabet Σ. Let s = log(|Σ| + 1). Then for any input of length n and any
ℓ ∈ [L], the activation values at layer ℓ can be represented by (H + 1)ℓ · (2 · log(n) + s)
bits, and for ℓ ∈ [L]+1 , the attention scores at layer ℓ can be represented by 2 · (H +
1)ℓ−1 · (2 · log(n) + s) bits.

Proof. For an input sequence of length n, the initial activation values are (σ, i, n),
where σ ∈ Σ ∪ {$} and i ∈ [n]. This can be represented by a string of 2 · log(n) + s
bits. At each successive layer, the activation values are a tuple of (H + 1) values from
the previous layer, which multiplies the number of bits required to represent them by
(H + 1). Also, the range of scoring values is bounded by the number of ordered pairs
of activation values at the previous layer, so scoring values can be represented by twice
the number of bits to represent an activation value at the previous layer.

It is worth observing that the bounds provided by Lemma 5.3 do not hold in the
case of saturated attention because activation values may be the result of the average
of an arbitrary subset of the possible input, which means that there are exponentially
more possible activation values at each layer.
The following elementary facts about Boolean circuits will be useful:

Lemma 5.4. An arbitrary Boolean function f : [2]n → [2]m of n inputs and m outputs
can be computed by a depth 3 circuit of size at most 2n + n + m.

Proof. We express each output zi of f as a disjunctive normal form (DNF) formula of


at most 2n terms, each with at most n literals. Then, we construct the circuit from
an AND gate for each of the 2n possible DNF terms, requiring a NOT gate for each of
the n input bits, and an OR gate for each of the m output bits taking the AND gates
corresponding to the terms of the DNF as its inputs. The resulting circuit has size
2n + n + m. The longest possible path to an output from an input is through a NOT,
an AND and the OR gate, for a depth of at most 3.

Corollary 5.5. If a Boolean function f has at most c · log(n) inputs and at most
d · log(n) outputs, then it may be computed by a Boolean circuit of depth 3 and size at
most nc + c · log(n) + d · log(n).

27
We now have the necessary tools to prove the following theorem:

Theorem 5.6. Every language in GUHAT is recognizable by a family of circuits in


AC0 .

Proof. Let L be a language over Σ that is in GUHAT. By Lemma 5.2, we may assume
that L is recognized by a GUHAT transformer network T in informative normal form.
Assume T has L layers and H heads.
What we describe below is a family of circuits to recognize the end-marked language
L$, which can easily be converted to a family of circuits that recognizes L by hard-
wiring the representation of the end-of-sequence symbol $ at the end of the input string
using constant gates. Let s = log(|Σ| + 1) and let h be any binary symbol encoding
for Σ ∪ {$}. We construct a family of Boolean circuits (Cs,n )n∈N of constant depth
and polynomial size such that for all positive integers n and all w ∈ Σn−1 , w ∈ L iff
Cs,n (h(x$)) = 1.
With the O(log(n)) bound on the number of bits to represent activation and scoring
values, Lemma 5.3 yields circuits of constant depth and size polynomial in n for the
input, scoring, activation, and output functions. Additional circuitry is necessary to
implement the comparison of attention scores and selection of the activation value to
attend to for each position, layer, and head.
We construct the overall circuit Cs,n according to the layers of T , starting with the
input function. Let the inputs to T be wi for i ∈ [n]. The inputs to Cs,n are wi,j for
i ∈ [n] and j ∈ [s], where wi,j are the bits of h(wi ), representing the binary encoding
(0)
of input symbol wi . At layer 0 for position i, the value of vi = φ(wi , i, n) = (wi , i, n)
is achieved using the input wires wi,j for j ∈ [s] followed by a sequence of constants 0
or 1 representing the binary representations of i and n for a total of 2 · log(n) + s wires
representing the value (wi , i, n).
Performing induction on layers, we assume that for some ℓ ∈ [L]+1 the circuit Cs,n
(ℓ−1)
has been constructed to contain the wires representing all the activation values vi
for i ∈ [n] at layer ℓ − 1. The portion of the circuit computing the representation
of activation values at layer ℓ is described as follows: Fix a position i ∈ [n] and
a head h ∈ [H]. For each j ∈ [n], there is a circuit Aℓ,h,i,j that has as input the
(ℓ−1) (ℓ−1)
wires for the activation values vi and vj and as output wires representing the
natural number attention score aℓ,h,i,j in binary. Each of these circuits Aℓ,h,i,j has
2 · (H + 1)ℓ−1 · (2 · log(n) + s) inputs and outputs by Lemma 5.3, and therefore can be
computed using depth 3 and size polynomial in n, by Corollary 5.5. All H · n2 such
circuits for layer ℓ operate in parallel, for overall depth 3 and size polynomial in n.
We next describe the circuit that implements the pooling function f UHA . For each
pair j, j ′ ∈ [n], there is a circuit Dℓ,h,i,j,j ′ whose inputs are the outputs of Aℓ,h,i,j and
Aℓ,h,i,j ′ and whose output is a single wire gℓ,h,i,j,j ′ with a value of 1 if aℓ,h,i,j ≥ aℓ,h,i,j ′ and

28
0 otherwise. Because of the bounds on the number of inputs and outputs, each of these
circuits can have depth 3 and size polynomial in n by Corollary 5.5. These n2 circuits
all compute in parallel. Then for each position j, whether j maximizes aℓ,h,i,j can be
computed by an AND gate whose inputs are gℓ,h,i,j,j ′ for all j ′ ∈ [n]. Let the output of
this AND gate be denoted mℓ,h,i,j . Then mℓ,h,i,j = 1 iff the position j maximizes aℓ,h,i,j .
This increases the depth by 1.
For each j, an indicator zℓ,h,i,j is computed by an AND gate whose inputs are mℓ,h,i,j
and ¬(mℓ,h,i,j ′ ) for all j ′ < j. Thus, zℓ,h,i,j = 1 iff j is the leftmost position that
maximizes aℓ,h,i,j . This increases the depth by 2.
Finally, these indicator values are used to combine the layer ℓ − 1 activation values
(ℓ−1)
in a selection circuit, yielding the representation of the activation value bℓ,h,i = vj
such that zℓ,h,i,j = 1. In general, such a selection circuit takes as input t selector bits
z0 , z1 , . . . , zt−1 , where exactly one zj = 1, and t input values w0 , w1 , . . . , wt−1 , where
each wr consists of S bits. It outputs S bits representing the selected wj (for which
zj = 1). Letting wr,s denote the bit s of wr , the computation can be described as
vr,s = wrs ∧ zr for r ∈ [t] and s ∈ [S], which can be computed by one layer of t · S AND

gates in parallel. Then the bits of the output are us = vr,s for s ∈ [S], which can
r∈[S]
be computed by one layer of S OR gates in parallel. Thus, the selection circuit adds 2
to the depth, and a polynomial in n to the size.
Because each activation function for a GUHAT in informative normal form simply
returns its argument, no further computation is needed for the activation values. The
(ℓ)
representation of the activation value vi is just the sequence of wires representing
(ℓ−1)
vi , followed by those representing bℓ,0,i through bℓ,H−1,i .
(L)
To produce the output of the circuit, we note that the representation of vn has
O(log(n)) bits and the output of g is a single bit, so g can be implemented by a Boolean
circuit of constant depth and size polynomial in n, by Corollary 5.5. This concludes
the proof.

5.2 Universality of Saturated Transformer Networks


This section is based on [MSS22].
In order to formulate upper bounds to the power of saturated attention transformers,
we need to constrain their internal functions. Our definition of the class of languages
recognizable by saturated transformers AHAT(D) already includes the restriction that
they are size-preserving. This, however, is not yet sufficient to allow for a nontrivial
upper bound, as we will show that saturated transformers are still able to recognize any
formal language. Our proof of this works by encoding the entire input sequence into a
single value and using the activation block as a black box to recognize the language.

29
Theorem 5.7. AHAT(Q) = ALL = P([2]∗ ).

Proof. Let L ∈ ALL be any formal language over the alphabet Σ = [2]. We construct a
rational-valued saturated transformer network with 1 layer and 1 head to recognize L.
We will omit ℓ and h subscripts. Let pi denote the i-th prime number. The embedding
layer encodes the position i of each token wi ∈ Σ according to

wi
φ(wi , i) := .
pi

Since pi ∼ i·log(i) for large i by the prime number theorem [Gol73, p. 599], the number
of bits needed to represent the denominator of φ(wi , i) is bounded by

pi ≤ c · log(i · log(i)) ≤ c · log(i2 ) = 2c · log(i)

for some constant factor c. As i has size log(i), this implies that φ is size-preserving.
Now we define a single uniform attention head that sums all vi , outputting
∑ ∑ ∑ 1
vi = φ(wi , i) = .
i i i,w =1
p i
i


The denominator q of this sum is the product pi , which can be shown by induction
i,wi =1
over the number of prime numbers that are multiplied. Note that wi = 1 iff pi divides
q. Thus, we can define a function g that extracts the input sequence w from q by
checking for each i whether pi divides q. We let

1, w ∈ L
cL (w) =
0, otherwise

be the characteristic function of L and set f := cL ◦ g. The output of the transformer


network will now compute whether w ∈ L, since g outputs the original input sequence
w and cL decides whether w ∈ L. Note that any function solving a decision problem
has a codomain of fixed size, namely the set [2], and is therefore size-preserving.

Similar to this, other authors have used arbitrary precision for storing and manipu-
lating positional encodings to show their model of transformer networks to be Turing
complete [PMB19, PBM21].
Both the unnatural construction that encodes positions using prime numbers and the
result that they can decide any formal language strongly indicate that these restrictions
on our model of transformer networks are not yet sufficient to mimic reality. Because
of this, we switch the datatype from rationals to floats. In the following section, we
will see that doing this allows us to bound the capabilities of saturated transformer

30
networks in TC0 .
But before we do that, we will quickly show that using floats alone is not enough
and size-preservation is also needed to be able to find non-trivial upper bounds. The
unbounded prefix added to AHAT(D) simply means that we no longer require the
internal functions to be size-preserving.

Corollary 5.8. unbounded-AHAT(D) = ALL for D ∈ {F, Q}.

Proof. The proof works exactly the same as the proof for Theorem 5.7, with the excep-
tion that if D = F the prime numbers pi need to be replaced with distinct powers of
2. This needs to be done because floats can only have powers of 2 in the denominator.
The removal of the size bound allows us to use the powers of 2 that grow in size linearly
instead of logarithmically. We can once again define a function g that extracts the in-
put sequence by just looking at which bits are set to 1. This completes the proof.

Corollary 5.9. unbounded-SAT(D) = ALL for D ∈ {F, Q}.

Proof. The proof works exactly the same as the proof for Corollary 5.8. Soft attention
transformers with sufficient (or, in this case, unbounded) precision can implement
uniform attention by setting all queries and keys to be constant [MS23a, p. 5].

5.3 Saturated Attention Transformer Networks are not


in AC0
This section is based on [MSS22].
We define the MAJ := {w ∈ [2]∗ | |w|1 > |w|0 } decision problem similarly to the
MAJn functions in Definition 3.3. It was first shown that transformer networks can
recognize MAJ by [PMB19, Proposition 3.3]. We will prove that this still holds true for
saturated transformer networks using floats by using only a single uniform attention
head. As MAJ 6∈ AC0 by Smolensky’s Theorem [Vol99, Theorem 3.31], this shows that
saturated transformer networks are not in AC0 .

Theorem 5.10. AHAT(F) 6⊆ AC0 .

Proof. We will construct a single layer transformer network with a single attention
head to recognize MAJ, omitting the ℓ and h subscripts. Let the embedding function

(1, 0), w = 0
i
φ(wi , i) := (1 − wi , wi ) =
(0, 1), w = 1
i

be a 1-hot encoding of wi . Set the scoring function s(xi , xj ) := 1 for all inputs xi , xj ∈
F2 , resulting in the attention head attending everywhere and computing for every

31
( )
|w|0 |w|1
i ∈ [n]: bi = n
, n . Finally, set

1, bi,1 > bi,0
f (vi , bi ) := , where bi = (bi,0 , bi,1 ).
0, otherwise

Thus, the output of the transformer network is all 1s if |w|1 > |w|0 , or w ∈ MAJ, and
all 0s otherwise. We have shown MAJ ∈ AHAT(F), and, with MAJ 6∈ AC0 , it directly
follows that AHAT(F) 6⊆ AC0 .

Note that this construction is not just possible in our generalized transformer network
model, but can also be implemented in transformer networks that are actually in use,
like the original one described in section 2.2. It has also been shown empirically that
single layer transformer networks can learn to recognize the majority language.

5.4 Saturated Attention Transformer Networks are in


TC0
This section is based on [MSS22].

Lemma 5.11. Let v0 , v1 , . . . , vn−1 be a sequence of floats, each with size at most z.

Then their sum s = vi has size at most 4z + 2 log(n) + 1.
i∈[n]

Proof. Let pi , qi and ps , qs denote the numerator and denominator of vi and s, respec-
tively. Since each vi has size at most z, both pi and qi also have size at most z. Let
pmax = maxi pi and qmax = maxi qi . To add these floats, all their denominators have to
be made equal to qmax , which results in their numerators also being multiplied by qmax
qi
.
This works because qi divides qmax , since they are both powers of 2. We can now esti-
mate the numerator of s as
∑ qmax
ps ≤ pi · ≤ n · pmax · qmax
qi
i∈[n]

which has size ≤ log(n) + z + z = 2z + log(n). The denominator qs ≤ qmax has size ≤ z.
Therefore, s has size ≤ 1 + 2 · max(2z + log(n), z) = 4z + 2 log(n) + 1.

In particular, the size of the sum of a sequence of n float values whose size is bounded
by z(n) ∈ O(log(n)) is also bounded by O(log(n)).
We will now leverage Lemma 5.11 to show that over any transformer network
over floats with an element-wise-size-preserving attention function, the values are of
bounded size.

32
Definition 5.12. A function α : Dn → Dn is element-wise-size-preserving iff for i ∈ [n]
the function xi 7→ α(x)i is size-preserving, where x ∈ Dn .

Note that saturated attention satisfies this definition. We can now prove a theorem
bounding the size of the representations in transformer networks with element-wise-
size-preserving attention.

Theorem 5.13. For any transformer network over F with φ, sℓ,h , fℓ ∈ P and α element-
wise-size-preserving, for all ℓ ∈ [L + 1], h ∈ [H], and i ∈ [n], vℓ,i has size O(log(n)).

Proof. By induction over ℓ. The proof follows the definition of transformer network
computation in subsection 2.3.2.
Base case (ℓ = 0): wi has size O(1) and i ∈ [n] has size O(log(n)). Since φ ∈ P,
v0,i = φ(wi , i) has size O(log(n)) for all i ∈ [n].
Inductive Step: Assuming vℓ,i has size O(log(n)), we will show that vℓ+1,i does
too. As sℓ,h ∈ P, aℓ,h,i,j = sℓ,h (vℓ,i , vℓ,j ) has size O(log(n)) for all i, j ∈ [n]. Since α
is element-wise-size-preserving, we can conclude that α(aℓ,h,i,0 , . . . , aℓ,h,i,n−1 )j also has
size O(log(n)) for all h ∈ [H], i, j ∈ [n]. Multiplying two floats is also size-preserving
[MSS22, Appendix B], so α(aℓ,h,i,0 , . . . , aℓ,h,i,n−1 )j · vℓ,j has size O(log(n)) for all h ∈ [H]
and i, j ∈ [n]. We then apply Lemma 5.11 to conclude that bℓ,h,i has size O(log(n)),
where, recall,

bℓ,h,i = α(aℓ,h,i,0 , . . . , aℓ,h,i,n−1 )j · vℓ,j .
j∈[n]

Finally, computing vℓ+1,i = fℓ (vℓ,i , (bℓ,0,i , . . . , bℓ,h−1,i )), we conclude that vℓ+1,i has size
O(log(n)) for all i due to the size-preservation by fℓ .

Corollary 5.14. For any saturated transformer network over F with size-preserving
internal functions, for all l ∈ [L + 1] and i ∈ [n], vℓ,i has size O(log(n)).

Proof. This follows from Theorem 5.13 because saturated attention is element-wise-
size-preserving.

This result cannot be applied to soft attention because the softmax function is not
guaranteed to be element-wise-size-preserving as it involves computing the exponential
function.
We have proved that each vector in a saturated transformer network over floats has
size O(log(n)). Now, we show how this implies saturated transformer networks can be
simulated by TC0 circuits.

Corollary 5.15. Any size-preserving function with at most c · log(n) input bits can be
computed by a Boolean circuit of depth 3 and polynomial size.

Proof. This follows directly from Lemma 5.4.

33
In other words, such functions can be computed by AC0 circuits. In addition, we will
show that the sum of n floats of size at most c · log(n) can be computed by TC0 circuits.

Lemma 5.16. Let v0 , . . . , vn−1 be a sequence of floats with size at most c · log(n) for

some c. Then their sum s = vi is computable by a threshold circuit of constant
i∈[n]
depth and polynomial size.

Proof. Let pi , qi once again be the numerator and denominator of vi . We first compute
qmax = maxi qi using an AC0 circuit that compares all pairs qi , qj and returns the first
qi such that qi ≥ gj for all j ∈ [n]. We then use the fact that multiplication and right
shift (qi is a power of 2) are in TC0 , in order to compute ri := pi · qmax
qi
in parallel for all
i ∈ [n]. Note that qi and qmax are both powers of 2, so division will be exact. Next, we
leverage the fact that the sum of n integers with size O(log(n)) is in TC0 , in order to

compute the numerator of the sum p′ = ri . We select the denominator as q ′ = qmax .
i
Finally, we add an AC0 circuit that reduces the fraction by removing shared trailing 0s
from p′ and q ′ , which is possible by Corollary 5.15. Thus, we have constructed a TC0
circuit to compute the sum of n floats of size O(log(n)).

We now construct a TC0 circuit that simulates a saturated transformer network over
floats.

Theorem 5.17. AHAT(F) ⊆ TC0 .

Proof. For each n, we construct a TC0 circuit that simulates a saturated transformer
network of input size n. We construct the circuit modularly, with one subcircuit for
the attention mechanism and another for the feedforward subnetwork.
Attention Head: Fix a single head in some layer. We will construct a TC0 sub-
circuit that simulates the attention mechanism at position i. The head attends over
vectors v0 , . . . , vn−1 . For all j ∈ [n], vj has size O(log(n)) by Theorem 5.13. In parallel
for each j, we compute the scores ai,j = s(vi , vj ) with an AC0 circuit by Corollary 5.15.
We then compute ai,max := maxj ai,j with an AC0 circuit by comparing all vj pairwise
and selecting the first vk such that vk ≥ vj for all j > n. We then compute “masked”
values ui,j for each j ∈ [n] via an AC0 circuit by Lemma 5.4:

v , ai,j ≥ ai,max
j
ui,j :=
0, otherwise.

We then compute the sum si := ui,j by Lemma 5.16. By Lemma 5.11, si has size
j∈[n]

34
O(log(n)). Now, we similarly define

1, a ≥ a
i,j i,max
zi,j :=
0, otherwise.

Using an analogous sum construction with zi,j instead of ui,j , we can use a TC0 circuit
to compute |M(a)|: the number of values of j for which ai,j ≥ ai,max . Finally, since
dividing floats is in TC0 [MSS22, Appendix A], we can compute the head output as
si
|M(a)|
, which has size O(log(n)) by size preservation of division.
Feedforward: As input, f receives vi as well as H head outputs, all of which have
size O(log(n)). As the total size of the input is O(log(n)), we can use Corollary 5.15 to
compute the output of f with an AC0 circuit. The size of the output is O(log(n)) by
size preservation of f . The same idea holds for φ as well as the linear classification head.
We have simulated each transformer network component with a TC0 subcircuit, com-
pleting the proof.

A uniform version of this result has been shown in [Str23]. We will not go into further
detail on this right here, since we will show a more general uniform result in section 5.6.

5.5 Fixed Precision Transformer Networks are in


FOC[+; MOD]
This section is based on [CCP23].
In this subsection, we will show an upper bound on the expressivity of fixed-precision
transformer networks. A fixed-precision transformer network is one where the internal
functions are not size-preserving, but instead the size of their output is bounded by a
constant called the precision.
We are going to use fixed-point numbers instead of floats. There is no loss of gen-
erality because all floats can be converted exactly to a fixed point number. This fixed
point number may be significantly larger, but will still be bounded by some constant
precision.
A transformer network computes many activations, or internal values, that depend
on the input w and can be thought of as functions a : Σ∗ → FP. For each such
activation, we will write sentences in FOC[+; MOD] that test the bits of a(w).

Definition 5.18. If a : Σ∗ → FPr,s , we say that a is defined by sentences (σka )k∈[r+s]


(or (σka ) for short) iff, for all k ∈ [r + s], w |= σka ⇔ a(w)k = 1.
Similarly, if a : Σn → FPn , we say that a is defined by (φak [p]) iff, for each p ∈ [n],
[a(w)]p is defined by (φak [p]).

35
The finiteness of FP ensures the following fact:

Lemma 5.19. If a : Σ∗ → FP is defined by (σka ), then for any function f : FP → FP


there are sentences that define f ◦ a. Similarly, if b : Σ∗ → FP is defined by (σkb ) and
g : FP × FP → FP, there are sentences that define the function g ◦ (a, b) = {w 7→
g(a(w), b(w))}.

Proof. Because FP is finite, it is easy but tedious to write sentences that test for all
possible inputs and outputs.

Using this, the following result can be shown:

Theorem 5.20. Every language that is recognizable by a fixed-precision transformer


network is definable by a sentence of FOC[+; MOD].

Proof Sketch. The proof works by going through the components of a transformer net-
work and defining their output as an activation function using the output of the pre-
vious component(s). Lemma 5.19 allows for this to work as long as no over- or under-
flows occur during the computation. Because of this, the internal functions are not al-
lowed to be arbitrary, but have to follow the definition of the original transformer net-
work more closely. For the same reason, the softmax function has to be more carefully
computed by bitwise averages instead of sums. With these changes, it is then possible
to create a FOC[+; MOD] sentence that defines the output activation function of the
transformer network. For the full details of the proof, see [CCP23, section 5].

This result is particularly interesting as FOC[+; MOD] is not a superset of log-


uniform TC0 and thus yields a tighter upper bound.

Theorem 5.21. The language {0n 1n | n ∈ N} is in log-uniform TC0 , but not definable
in FOC[+; MOD].

Proof. The language contains no words of odd length and exactly one word of each
even length. Because of this, a circuit family that decides it can be constructed from
circuits that simply output a constant 0 for an odd amount of input bits and circuits
that test for 0n 1n for an even amount of input bits 2 · n. Such a circuit just consists of
one AND gate that takes the first n input bits negated and the other input bits directly
as its inputs. An example of this can be seen in Figure 5.1. This circuit family clearly
lies in log-uniform TC0 .
To show that the language is not definable in FOC[+; MOD], suppose that it is
definable by some sentence σ. Let M be the product of all moduli m used in atomic
formulas MODm r (p) used in σ. Then σ cannot distinguish between positions p and
p + M , so it cannot distinguish w = 0M 1M and w′ = 10M −1 01M −1 . Since w |= σ, it
must be the case that w′ |= σ, which is a contradiction.

36
a5 a4 a3 a2 a1 a0

¬ ¬ ¬

Figure 5.1: the circuit that checks if a given word a ∈ [2]6 is 000111

The authors of [CCP23] continue by showing that every language that is definable by
a sentence in FOC[+; MOD] is also recognizable by a transformer network. They argue
that FOC[+; MOD] lies somewhere between fixed-precision transformer networks and
unbounded transformer networks, and is therefore close to an exact characterization of
the languages that transformer networks can recognize.
This result, however, is really quite weak, as we already showed in Corollary 5.9
that unbounded-SAT(D) = ALL, which is not close to FOC[+; MOD] at all. Their
definitions and assumptions do not line up exactly with the ones we use here, so our
result does not immediately imply theirs, but it makes their reasoning appear somewhat
questionable.

5.6 log-Precision Transformer Networks are in Uniform


TC0
This section is based on [MS23b].
In this subsection, we are going to show that log-precision transformer networks can
be simulated by uniform constant-depth threshold circuits. Thus, such transformer
networks can only solve problems in uniform TC0 .
Intuitively, the upper bound states that log-precision transformer networks are com-
putationally shallow, and that this shallowness can be understood to emerge from their
parallelizability. Their inherent parallelism is useful for training them efficiently at
massive scale, but may limit the complexity of the computations they can express. The
term parallelism tradeoff has been introduced in [MS23b, section 1] to capture this idea,
which represents a potential fundamental weakness of the current paradigm of scaling
language models. One interpretation of complexity classes such as AC0 and TC0 is as
sets of in polynomial time solvable problems that are parallelizable to a very high de-
gree — they can be solved in parallel in constant time with enough parallel processors.
This gives some intuitive explanation of our result: log-precision transformers end up

37
in TC0 because they were designed to be highly parallelizable. Since parallelism is an
important property of today’s dominant paradigm of training models at massive scale,
this points to the conclusion that any massively scaled up model — transformer net-
work or otherwise — will likely obey restrictions similar to the ones derived here for
log-precision transformer networks. There is thus an important tradeoff between the
massive parallelizability of today’s networks and their representation power.
Other results rely on making unrealistically strong assumptions or placing unrealistic
restrictions on the model of transformer networks. For this result, we only make one
assumption – namely, all intermediate values in the transformer network are limited
to O(log(n)) bits, where n is the number of input tokens. We next discuss some
implications of this assumption and what the findings mean for practical transformer
networks.
The bounds we will prove are asymptotic in nature and thus apply when n is suffi-
ciently large. In practice, transformer network models use fixed precision at each com-
putation node, which is more restrictive than log-precision. However, this constant
could be large and thus, for relatively small n, our results do not rule out practical
transformer networks solving difficult problems. The results, however, do show that
as n grows sufficiently large, log-precision transformer networks are fundamentally lim-
ited to problems within TC0 and cannot accurately solve various commonly studied
problems, such as:

• Linear equalities: find x such that Ax = b (assuming log-uniform TC0 6= P)

• Universal context-free recognition (assuming log-uniform TC0 6= P)

• Propositional satisfiability, SAT (assuming log-uniform TC0 6= NP)

• Horn-clause satisfiability, HORN-SAT (assuming log-uniform TC0 6= P)

• AI planning

• Permanent computation.

Extending our analysis to small n will help close the gap to practice.
The formal model we will use in this section is based on a binary classification view of
transformer networks. However, our results apply directly to multi-class classification
as well and can be extended to generation problems by viewing, for instance, next word
prediction in natural language processing (NLP) as a multi-class classification problem.
However, if the transformer network decoder is allowed to condition on its previous
output in a generation problem, then his would violate our formal setup.

38
5.6.1 Circuit Serialization
First, we will discuss a way of serializing a circuit into a string. We later show how to
generate such serializations using a resource-bounded algorithm, which is the key to
proving containment in uniform complexity classes.
We identify a circuit with its serialization in a formal language that identifies each
node’s label and adjacency list. We will adopt a specific grammar for concreteness, but
our construction can be adapted to other string representations of circuits.
We define a circuit serialization as a traversal of a circuit ordered by some topological
sort. In this serialization, leaf nodes (variables/input gates) are represented by the
string X. An internal node (non-input gate) is represented in Polish notation by the
function it computes (AND, OR, or NOT) followed by a list of pointers to its arguments.
Each argument &1j of gate i encodes (in unary) a zero-indexed pointer of the j-th gate
in the circuit, where j < i. The final node is interpreted as the circuit output.
To serialize {∧, ∨}-circuits, we use the following grammar, where the i parameter
is passed through Gate[i] non-terminals to track the index of the gate in left-to-right
order:

Circuit → Gate[1] Gate[2] . . . Gate[g]


Gate[i] → X | NOT Arg[i] | Op Arg[i]∗
Arg[i] → &1j such that j < i
Op → AND | OR

In the Arg[i] rule, we enforce that j < i so that arguments must be pointers to already
defined gates. As an example of this serialization language, the circuit for x0 ∨ ¬x1 ∨ x2
which can be seen in Figure 5.2 is represented as X X X NOT &1 OR & &111 &11,
where spaces are added for readability.

x2 x1 x0

Figure 5.2: the circuit for x0 ∨ ¬x1 ∨ x2

39
By convention, negations in AC0 circuits are usually taken to occur at the beginning
of the circuit, rather than after ∧ or ∨ nodes, which can be achieved using De Morgan’s
law. Our serialization grammar does not enforce this property, but of course any circuit
with this property can be serialized by our grammar.
It is slightly more complicated to serialize threshold circuits. We assume that all
non-input gates in our threshold circuits are threshold gates θ≤k , θ≥k , which return
whether at most or at least k of their m input bits are 1. Threshold gates are equivalent
to majority gates (under constant-depth reduction) and can be used to simulate ∧, ∨,
and ¬ gates. Formally, a threshold circuit serialization is generated by the following
grammar:

Circuit → Gate[1] Gate[2] . . . Gate[g]


Gate[i] → X | Dir 1k 0m−k Arg[i]m
Arg[i] → &1j such that j < i
Op → <= | >=

In the rule for Gate[i], m ∈ N is the arity of the gate, and k ≤ m is its threshold. The
span 1k after Dir can be interpreted semantically as a unary encoding of the parameter
k for a threshold gate, padded by 0s to the number of total arguments of gate i. For
simplicity, we imagine ¬ gates are represented as unary θ≤0 gates. Thus, the circuit
for θ≥1 (x0 , ¬x1 ) which can be seen in Figure 5.3 would be represented as

X X <= 0 &1 >= 10 & &11.

We say a threshold circuit is in prefix form iff all inputs (X) come before all threshold
gates (<= and >=), as is the case in this example.

x1 x0

θ≤0

θ≥1

Figure 5.3: the circuit for θ≥1 (x0 , ¬x1 )

40
5.6.2 Uniformity
The circuit families we have defined thus far are nonuniform, meaning that we do not
enforce that the circuits must be related in any way. In degenerate cases, nonuniform
circuit families can solve undecidable problems because they have infinite description
length, making them a physically unrealizable model of computation. Complexity
theorists have thus introduced uniform circuit families. Uniform circuit families are a
realizable model of computation with relations to classes in computational complexity
and formal language theory.
Intuitively, in a uniform circuit family, the circuits for different input sizes must
be “somewhat similar” to each other. We formalize this by saying that there exists a
resource-constrained Turing machine that maps the input 1n to a serialization of circuit
Cn .

Definition 5.22. A language L is (S(n), I(n))-space uniformly computable by a circuit


model M iff there exists a Turing machine that, for all n ≥ 0, uses S(n) space to map
1n to an M -circuit recognizing L on inputs of size I(n).

This notion of uniformity is more general than the standard notion in that the
input size I(n) is a function of the problem complexity n. The reason for this is that
we will apply uniformity to sub-computations with different input sizes I(n) within a
larger computation of input size n. The standard notion of uniformity corresponds to
I(n) = n.
Furthermore, we will refer to a circuit family as uniform iff it is uniformly computable
with S(n) = O(log(n)). We can define uniform versions of AC0 and TC0 by adopting
the previous definitions exactly, but also enforcing uniformity.

5.6.3 Transformer Network Precision and Space


We will assume that each transformer network is resource bounded in terms of the
precision of each value it computes and, for some of our results, the space it uses for
the computation of the key operations such as embedding, attention, and activation.
Specifically, we will assume precision p, that is., the values at all layers, as well as
the outputs of all key intermediate operations in it (attention, activation, arithmetic
operators, etc.), are represented using p bits. This is a realistic assumption as, in
practice, today’s transformer networks are typically limited to the 64-bit precision of
the underlying hardware. Formally, we define p-precision as follows:

Definition 5.23. A k-ary function f : x0 , x1 , . . . , xk−1 →


7 y is p-precision iff x0 , x1 , . . . ,

xk−1 , y ∈ [2] have size at most p bits, and f can be computed by a p-space-bounded
Turing machine.

41
This means the size of the function inputs and output are bounded by p. Similarly,
the intermediate space used by the computation must also be bounded by p. Thus,
higher precision computations cannot somehow be hidden inside f .
Definition 5.23 naturally applies to functions with bounded arity k. We will also
need to define p-precision for the summation operator in the transformer network,
which adds n different floats of size p. Adding n floats can blow up the precision
needed to represent their sum. For example, imagine adding the floats 1 · 20 + 1 · 2c .
We obtain (2c + 1) · 20 , whose mantissa takes c + 1 bits to represent. In practice,
computers do not preserve full precision in such situations. Instead, small terms like
1 · 20 are discarded. Thus, we define the transformer network’s addition operator ⊕ to
be similarly approximate. For more details on how (iterated) addition of p-precision
floats works, see [MS23b, Appendix A].

5.6.4 p-Precision Transformer Network Definition


For this section, we will use the following model of transformer networks. We define
an attention head as follows:

Definition 5.24. A p-precision attention head is specified by a binary p-precision


similarity (or attention) function s : [2]p × [2]p → [2]p .

Let h0 , h1 , . . . , hn−1 ∈ [2]p be the input sequence to a p-precision attention head, and
let ⊕ be approximate floating-point addition.

Definition 5.25. For all ℓ ≥ 0, a p-precision attention head Hhℓ+1 computes a vector
i,h ∈ [2] via
aℓ+1 p
( )
⊕ s hℓi , hℓj
ℓ+1
ai,h = · hℓj ,
Zi
j∈[n]
⊕ ( ℓ ℓ)
where Zi = s hi , h j .
j∈[n]

Standard attention heads, like the ones of the original transformer network, are a
special case of this definition where s is scaled dot-product similarity between keys
and queries. Standard transformer networks also have a linear or affine value function
applied to each head hℓj in the sum over the j. By its affineness, the value function can,
without loss of generality, be removed from the attention head and considered to be a
part of the transformer network layer (that is, applied to the output of the attention
head).
A p-precision transformer network layer is then a tuple of heads and a function f
used to combine them.

42
Definition 5.26. A p-precision transformer network layer is a tuple Lℓ+1 = hH0 , H1 ,
. . . , Hk−1 , f i, where each Hh is an attention head and f : ([2]p )k × [2]p → [2]p is a p-
precision activation function.

A p-precision transformer network layer can be understood to define a sequence of


vectors hℓ+1 ℓ+1 ℓ+1 ℓ ℓ ℓ
0 , h1 , . . . , hn−1 in terms of an input sequence of vectors h0 , h1 , . . . , hn−1
(coming from the previous layer in the transformer network) by first computing k
attention heads in parallel and then combining their outputs using f . The first k
inputs to f will correspond to the attention head outputs, and the additional input is
the original input from the previous layer. Recall that aℓ+1
i,h is the output of head Hi,h
ℓ+1

on input hℓ at position i. The function computed by a transformer network layer can


be described formally as follows:

Definition 5.27. For ℓ ≥ 0, a p-precision transformer network layer Lℓ+1 recurrently


computes the output sequence hℓ0 , hℓ1 , . . . , hℓn−1 as a function of the inputs hℓ0 , hℓ1 , . . . , hℓn−1 ,
where, for 1 ≤ i ≤ n, the i-th component is computed according to
( )
hℓ+1
i = f aℓ+1 ℓ+1 ℓ+1 ℓ
i,0 , ai,1 , . . . , ai,k−1 , hi .

This function f can be understood to encapsulate layer norm, residual connections,


and the feedforward sub-layer of a standard transformer network. The component of
the output sequence of the previous layer hℓi is given to f to allow residual connections.
As mentioned previously, f can also encapsulate the value function for each head.
Finally, we can define a transformer network of depth d as a cascade of d transformer
network layers:

Definition 5.28. A p-precision transformer network over alphabet Σ is a pair consist-


ing of a p-precision position embedding function φ : Σ × N → [2]p and a d-tuple of p-
⟨ ⟩
precision transformer network layers L1 , L2 , . . . , Ld .

For a position embedding function φ and w ∈ Σn , let φ(w) be the position-wise


broadcasted embedding of w: for 0 ≤ i < n, φi (w) := φ(wi , i).
( ⟨ ⟩)
Definition 5.29. A transformer network φ, L1 , L2 , . . . , Ld computes the following
function of a string w ∈ Σ∗ :
( )
T (w) = Ld ◦ Ld−1 ◦ · · · ◦ L1 (φ(w)).

We will use n to denote the length of w, and take the transformer network’s depth
d to be fixed with respect to n.
The input to the transformer network can thus be represented with N = n · log(|Σ|)
bits using a binary encoding for the vocabulary. The circuits we construct subse-
quently to simulate transformer networks will also have output size N . We will assume

43
transformer networks have log-precision relative to the size of the input, specifically
O(log(N ))-precision. Since |Σ| is fixed (typically 30000 in practice), we will think in
terms of O(log(n))-precision. Thus, by Definition 5.23, all the intermediate functions
of such transformer networks are computable in O(log(n)) space and output (at most)
that many bits. Note that this is enough precision to represent positional encodings
and for each position to point to a constant number of other values, but not enough
precision for non-lossy pooling of the entire input into a single value.
Our log-precision transformer networks do not enforce that s and f follow the trans-
former network structure. However, a feedforward net whose primitive operations (for
example scalar multiplication) are defined over O(log(n))-size numbers can be com-
puted in O(log(n)) space. Thus, bounded-precision practical transformer networks are
a special case of our log-precision transformer networks. This makes our setup appro-
priate for proving upper bound on transformer networks.

5.6.5 log-Precision Transformer Networks as nonuniform Threshold


Circuits
We will first show that log-precision transformer networks can be simulated by nonuni-
form threshold circuits, before presenting the more technical uniform version of the re-
sults in subsection 5.6.6.

Corollary 5.30. Let f : [2]∗ → [2]m be a function. For all c ∈ R+ and n ∈ N , there
exists an AC0 circuit of size at most nc + c · log(n) + m and depth 3 that computes f
on inputs of size c · log(n).

Proof. This follows directly from Lemma 5.4.

We now use Corollary 5.30 to prove the following nonuniform result. We note that
the proof works even if the notion of p-precision is relaxed to not require computability
in space p. This requirement will, however, become important for our subsequent result
in subsection 5.6.6.

Theorem 5.31. Any c · log(n)-precision depth-d transformer network operating on


inputs in Σn can be simulated by a threshold circuit family of depth 3 + (9 + 2 · d⊕ ) · d.

Proof. Let w ∈ Σn be the input of a (c·log(n))-precision transformer network. We show


by induction that we can construct a composition of constant-depth, poly-size threshold
circuits to compute each layer of this transformer network. Thus, any constant-depth
transformer network will be computable by a constant-depth threshold circuit.
In the base case of layer 0 and token i, we construct gates representing the constant
i encoded in binary. We can then compute h0i = φ(wi , i) using Corollary 5.30, yielding
a poly-size depth-3 circuit.

44
In the inductive case of computing layer hℓ+1 i for 0 ≤ ℓ < d, we note that each
vector output of layer hi has size (at most) c · log(n) bits because of the log-precision

assumption.
We first fix a head aℓ+1 i,k to simulate. Applying Corollary 5.30, we can compute
( ℓ ℓ)
s hi , hj with a poly-size depth-3 circuit in parallel for all j. Since n floats with c·log(n)
precision can be approximately added in TC0 [MS23b, Appendix A], we can construct
( )
a TC0 circuit of depth d⊕ to compute Zj . Since s hℓi , hℓj , Zi , and hℓi all have c · log(n)
s(hℓ ,hℓ )
bits, we can compute Zi i j · hℓj with a poly-size depth-3 circuit; we do this in parallel
for all j. Next, we again use the fact that approximate addition of n floats is in TC0
to compute aℓ+1i,h as the approximate sum over j with a depth-d⊕ circuit.
We now simulate a layer hℓ+1 i in terms of its constituent heads. Since all arguments
of g have size c · log(n), we apply Corollary 5.30 to compute g with a poly-size depth-3
circuit, yielding hℓ+1
i . We repeat this in parallel for all i. This completes the inductive
step. The sub-circuit we have constructed for the (ℓ+1)-st layer has a depth of 9+2·d⊕ .
Aggregating the circuit over all d layers, its overall depth is 3 + (9 + 2 · d⊕ ) · d.

The following now follows directly:

Corollary 5.32. Any log-precision transformer network can be simulated by a nonuni-


form TC0 circuit family.

Proof. Since any given transformer network has some constant depth d, the depth of
the resulting threshold circuit family is also constant.

5.6.6 log-Precision Transformer Networks as Uniform Threshold


Circuits
We will now extend the argument from the last section to show that O(log(n))-precision
transformer networks can be simulated by uniform constant-depth threshold circuits
by capitalizing on the assumption that φ, s, and f are log-precision, and thus can
be computed in O(log(n)) space. The overall proof idea is similar, but due to the
uniformity condition, the proof becomes substantially more technical. Not only must
we show the existence of a threshold circuit family computing a transformer, but also
that this circuit family can be generated by a log-space Turing machine.
We first extend Corollary 5.30 to respect uniformity:

Lemma 5.33. Let f : [2]∗ → [2]m be a linear-space computable function. There exists
a Turing machine that, for all n ∈ N and c ∈ R+ , uses at most c · log(n) + log(m)
space to map input 1n to a circuit of size at most nc + c · log(n) + m and depth 3 that
computes f on inputs of size at most c · log(n).

45
Proof. We give the proof in the form of an algorithm to construct a circuit as a function
of n and then justify its correctness and space complexity.
Algorithm: We first print 2 · c · log(n) nodes representing unnegated and negated
input nodes.
Now, we need to show how to construct nodes corresponding to nc DNF terms. To
that end, we loop over all possible inputs x ∈ [2]c·log(n) by maintaining the c · log(n) bit
binary representation of x (initialized with 0c·log(n) ) and incrementing it by 1 at each
step of the loop. We create a new ∧ node i with c · log(n) arguments, defined as follows:
For j ∈ [c · log(n)], we create an argument pointer to (unnegated) node j if xj = 1 and
to (negated) node c · log(n) + j otherwise.
Next, we construct nodes computing each of the m outputs. We loop over k ∈ [m],
constructing a single node for each k. We loop over all x ∈ [2]c·log(n) analogously above
to construct a list of arguments. By our linear-space computability assumption and
because x has c · log(n) bits, we can compute f (x) as a subroutine in O(log(n))-space
to obtain fk (x). If fk (x) = 1, we print node 2 · c · log(n) + j as an argument of node k.
Correctness: We show that this Turing machine M maps input 1n to a serialized
circuit computing f on inputs of size n. The first layer simply produces unnegated and
negated input values. The second layer then produce all possible DNF terms. Finally,
node k of the third layer computes the disjunction over all terms x such that fk (x) = 1.
Thus, node k of the third layer computes fk .
Logarithmic Space: To complete the proof, we justify that M uses O(log(n) +
log(m)) space. Looping over x ∈ [2]c·log(n) is accomplished by treating x as a binary
number initialized to 0 and incrementing it at each step. Thus, the loop pointer for
building the DNF terms takes c·log(n) space to store. For building the m output nodes,
we maintain a similar loop pointer as well as an index k ≤ m, taking c · log(n) + log(m)
space. Thus, the overall algorithm uses c · log(n) + log(m) space.
Thus, the Turing machine M uses c · log(n) + log(m) space to map 1n to a circuit of
size at most nc +c·log(n)+m and depth 3 that computes f on size c·log(n) inputs.

We can leverage this lemma to derive the uniform analog of Theorem 5.31, as follows:

Theorem 5.34. Any c · log(n)-precision depth-d transformer network operating on


inputs in Σn can be simulated by a log-space-uniform threshold circuit family of depth
3 + (9 + 2 · d⊕ ) · d.

Proof. We will provide a proof by induction over the transformer network layers ℓ that
there is a Turing machine M operating in O(log(n)) space that, on input 1n , outputs a
circuit that simulates the transformer network’s computation on inputs of size n. This

46
circuit is identical to the one in the proof of Theorem 5.31, and thus has the same
circuit depth.
In the base case, we use logarithmic space to track a counter maintaining the current
token i (between 1 and n) throughout the circuit construction. We construct gates
encoding the constant i in binary. We can then apply Lemma 5.33 to construct a Turing
machine that maps 1n to a constant-depth threshold circuit computing h0i = φ(wi , i).
In the inductive case, we assume we can output in O(log(n)) space a circuit com-
puting every value hℓi in the previous layer ℓ. We will show that we can, in O(log(n))
space, now output a circuit computing every value in layer ℓ + 1.
As in Theorem 5.31, we first fix a head aℓ+1
i,h to simulate. Recall that

( )
⊕ s hℓi , hℓj
aℓ+1
i,h = · hℓj .
Zi
j∈[n]


By Lemma 5.33, we can generate a depth-3 circuit of size at most z = nc +c′ ·log(n)+1,
where c′ = 2 · c (since the input to f is of size 2 · c · log(n)), that computes s(hℓi , hℓj ) for
specific i, j. We do this sequentially for j ∈ [n] and h ∈ [k], padding each circuit with
unused nodes, such that each one has size exactly z, and the z-th node corresponds
to the output. Thus, the indices of the output nodes for each of the columns will be
wℓ + z · (j · k + h) for j ∈ [n], where wℓ is the index of the last output node hℓn of the
previous layer.
At this point, we use the fact that for p = c · log(n), the p-precision approximate
sum of n p-precision numbers can be computed by a uniform threshold circuit [MS23b,
Appendix A]. We can thus use a Turing machine as a sub-routine to generate, on input
1n , k threshold circuits, where each has size z ′ and computes a ⊕ gate over n items of
precision p each. We set the inputs of circuit h to be nodes wℓ + z · (j · k + h) for j ∈ [n].

By construction, this yields the normalizing constants Zi = s(hℓi , hℓj ), whose value
j∈[n]
is located at the node at index wℓ + z · n · k + z ′ for head h.
Using p-precision arithmetic operator circuits, we can now also generate a circuit to
s(hℓ ,hℓ )
compute Zi i j ·hℓj for each j ∈ [n] and h ∈ [k], by using index wℓ +z ·(j ·k+h) as before
for the value of s(hℓi , hℓj ) and index wℓ + z · n · k + z ′ · h for the normalizing constant Zi
of head h. Here too we use circuits of identical size z ′′ , making wℓ + k · (z · n + z ′ + z ′′ · i)
the index of the output nodes of these n circuits. Next, we again employ a ⊕ circuit of
size z ′ , similar to the computation of Zi , to compute the sum of these n values. Finally,
we compute hℓ+1 i by applying f via Lemma 5.33.
Note that this requires keeping only ℓ, i, and n in memory, each of which takes
O(log(n)) bits.
We repeat this process for all i ∈ [n] to compute the entire layer ℓ + 1, which finishes

47
the inductive step: If we can output a circuit computing layer ℓ in O(log(n)) space,
then we can do the same for layer ℓ + 1.

Because the depth derived in Theorem 5.34 is constant with respect to n, it follows
that:

Corollary 5.35. Any log-precision (or fixed-precision) transformer network can be


simulated by a uniform TC0 circuit family.

We can now use this result to establish a connection to the logic FO(M) we defined
in section 4.2 [MS23a, Theorem 2].

Corollary 5.36. The output of any log-precision transformer network can be expressed
in FO(M).

Proof. This follows directly from the equivalence of log-uniform TC0 and FO(M) which
has been shown in [MIS90, section 9].

For fixed-precision transformer networks using soft attention, we can even combine
this result with the results from section 5.5 to show that they are strictly less powerful
than TC0 circuits.

Corollary 5.37. The class of languages recognizable by a fixed-precision soft-attention


transformer network is a proper subset of uniform TC0 .

Proof. From Corollary 5.35, we know that it is a subset. If we now assume that the class
is equal to uniform TC0 , then it follows from Theorem 5.20 that TC0 ⊆ FOC[+; MOD],
which contradicts Theorem 5.21. Since our assumption leads to a contradiction, it has
to be wrong and the subset relation is thereby proper.

5.7 Lower Bounds for Instruction Following and Advice


Transformers
This section is also based on [MS23b] and the definitions from section 5.6 still apply
here.

5.7.1 Circuit Value Prolem


So far, we have shown that log-uniform TC0 is an upper bound for log-precision trans-
former networks. Is this upper bound tight, that is, also a lower bound? While we do
not answer this question here, we address a related question as a first step: We con-
struct a transformer network that can evaluate TC0 circuits on binary inputs, showing

48
that transformers can compute any TC0 function when their input is augmented with
the right “instructions”.
More formally, we consider the Circuit Value Problem (CVP) [Lad75, p. 18], also
referred to as the Circuit Evaluation Problem, where the input is a boolean circuit C
and a string x ∈ [2]n , and the task is to return the value of C(x) ∈ [2]. This problem
is known to be complete for the class P under log-space reduction [Lad75, p. 19]. We
will assume C is serialized as described in subsection 5.6.1 and prove that log-precision
transformer networks can evaluate any TC0 circuit. Note that this is an extension of the
typical CVP since the circuit has threshold gates, not just standard AND/OR gates.
To demonstrate the practicality of this lower bound construction, we will not just
prove the existence of transformers that can evaluate TC0 circuits but also specify con-
crete choices for the positional embedding scheme and the class of attention functions
that are sufficient to do so.
Fractional Positional Embeddings: For a vector x and scalar y, let hx, yi be the
vector obtained by appending y onto x. For σ ∈ Σ, let v(σ) be the one-hot embedding
of σ into R|Σ| . For w ∈ Σ∗ and i ∈ N, the fractional positional embedding at token i is
⟨ ⟩
i
φ(wi , i) = v(wi ), .
n

Saturated Attention: We imagine f (hℓi , hℓj ) is computed via saturated attention,


which provides a simple model of the types of attention we can expect to be learned in
transformers. First, queries are computed as qi = Qhℓi , and then keys kj = Khℓj . Define
the dot product attention score σi,j = qiT kj . We can then define saturated attention as

( ) 1, if σi,j = max σi,k
s hℓi , hj :=
ℓ k
0, otherwise.

After normalization, saturated attention creates a distribution that is uniform over


a subset of positions. Thus, it is capable of parameterizing hard attention, uniform
attention over the full sequence, and various attention patterns in between.
Simple Pooling Functions: For simplicity, we assume pooling functions f are
thresholded linear functions of their inputs. Thus, they could be implemented by
a feedforward neural net. Without loss of generality, we let attention heads have a
value function, which can be folded into the pooling function from the last layer (see
subsection 5.6.4).
Terminology: We use the term input node to mean a token of type X and gate
node to mean a token of type Dir. We call a token of type & an argument.

49
We are now ready to present the result. Our construction below is specific to cir-
cuits serialized in prefix form (see subsection 5.6.1), but it can be extended to other
serializations as well.

Lemma 5.38. For all d ∈ N, there exists a transformer network with fractional posi-
tional embeddings, saturated attention, thresholded linear pooling functions, and depth
2 · d that, for any threshold circuit C of depth d serialized in prefix form, maps input
hC, xi to the value C(x).

Proof. We will construct a pair of two transformer network layers that evaluate all
the nodes at depth ℓ in the threshold circuit, for any ℓ. It follows that a transformer
network of depth 2 · d can compute the value C(x).
Base Case: Input Nodes. We use an attention layer to attend uniformly over
all positions with value 1 if wi = X and 0 otherwise. This head computes |w|n X , where
|w|X is the number of occurrences of X in w. A second layer, then, at input node i,
computes the positional embedding of the token representing input value xi :

1 − |w|X + i
.
n

We attend to this position to retrieve xi . After these layers, each input node i stores
its value xi . We also use the base-case layers to construct an attention head that, at
the i-th node, counts the fraction of tokens (out of n) that are nodes to the left of the
current node. Thus, the column corresponding to node i stores the value ni .
At each gate node i, we use two more attention heads to find the index of the next
& to the right, and then count the fraction of tokens before it that are 1. This head
thus computes mkii where ki is the threshold value of gate i and mi is its arity.
Finally, using the first attention layer, we have each 1 node attend to the first ar-
gument symbol & to its left and retrieve its index np . Then, in the second attention
layer, each argument attends uniformly over all nodes with values np . The net effect is
for each argument to store nj , that is, the pointer it is encoding in unary as &1j .
Inductive Case: Gate Nodes. By our inductive assumption over prior layers, all
tokens corresponding to circuit nodes at depth ≤ ℓ contain their appropriate value. We
now construct 2 transformer network layers to evaluate gate nodes at depth ℓ + 1.
In the first attention layer, each argument token attends to the closest gate node i to
its left, which is the gate it belongs to. Recall from the base case that argument token
& already stores nj , where j is the pointer value it encodes. Each argument token now
attends with query nj to retrieve from node j its already computed value.
The second attention layer applies at gate nodes, not arguments. At gate i of arity
mi , we set the attention s(i, j) to indicate whether argument j belongs to gate node i,

50
which holds for exactly mi arguments. We set the attention value at argument j to be
the binary value of node j, which was retrieved in the previous paragraph. Thus, the
attention head computes mcii , where ci is the number of arguments of node i that are 1.
We repeat this for all gate nodes.
At this point, we have both the count of true inputs to gate node i, mcii , and, from
the base case, the threshold parameter of gate i, mkii . Thresholding cim−ki i at 0 allows us
to decide, based on whether Dir is <= or >=, whether the current gate node should
output a 0 or a 1. Repeating this for all gates at layer ℓ + 1 completes the inductive
step: We can evaluate all gate nodes in this layer.

The following is an immediate consequence of this:

Theorem 5.39. Depth-(2 · d) transformer networks can solve CVP for depth-d TC0
circuits.

Proof. According to Lemma 5.38, there is a transformer network for any circuit depth
d that solves the problem.

5.7.2 Instruction Following


CVP is closely related to instruction learning and instruction following tasks. The
latter task setup provides a transformer network two inputs: a regular expression r as
an “instruction”, and z ∈ [2]∗ . The goal of the task is to return whether z belongs to
the regular language represented by r. Viewed from this lens, the circuit evaluation
setup asks: Can transformers follow instructions provided in the form of a circuit? We
will show that the answer is yes for all constant-depth threshold circuits.
Formally, an instruction I is any description of a function fI of [2]∗ , that is, a fixed-
size program to compute that function under some model of computation. We say a
transformer network correctly follows an instruction I iff, for all x ∈ [2]∗ , it correctly
computes fI (x) on input hI, xi. A nonuniform instruction description is a family of
length-specific descriptions (In )n∈N . We say a transformer network correctly follows a
nonuniform instruction family In iff, for all n ∈ N and all x ∈ [2]n , it correctly computes
fI (x) on input hIn , xi. The nonuniform description In may take any form. When it
forms a TC0 circuit family, we refer to it as a TC0 instruction description.

Corollary 5.40. There exists a depth-(2 · d) transformer network that can correctly
follow any depth-d TC0 instruction description.

Proof. This follows, since Lemma 5.38 constructs a transformer network that can eval-
uate any TC0 circuit.

51
Thus, transformer networks with simple position embeddings, attention, and pooling
functions can simulate any instruction provided in the form of a TC0 circuit. We note
that, while it is unknown whether the class of regular languages is contained in TC0 ,
the other direction is known: There are problems computable by TC0 circuits that
are not regular. These include problems involving counting and arithmetic, which are
beyond regular languages. These results thus expand the known kinds of instructions
transformers are able to follow, at least with hand-constructed weights.

5.7.3 Advice Transformers


We can also view circuit evaluation abilities of transformers (Lemma 5.38) from the
lens of advice-taking Turing machines which, in addition to their usual input, are
also provided an input-length-dependent (but input-independent) advice string. For
instance, P/poly is the class of problems decidable in polynomial time when the Turing
machine is given an advice string of size polynomial in the input length.
In the same vein, let T/poly be the class of log-precision, constant-depth transformer
networks with polynomial advice strings. In other words, on an input of size n, we
allow the transformer network to receive an additional poly(n) bits of input that cannot
depend on the standard input. Now we can show:

Corollary 5.41. TC0 ⊆ T/poly.

Proof. Let (Cn )n∈N be a circuit family demonstrating that a problem is in nonuniform
TC0 . Then, by passing the description of Cn as advice for input length n, it immediately
follows from Lemma 5.38 that advice transformer networks can simulate nonuniform
TC0 .

Since non-uniform TC0 even contains some undecidable languages, T/poly is clearly
a very powerful class. Thus, a problem in T/poly cannot always be solved by a trans-
former network on its own. However, if given a description of how to do so (“advice”)
in the form of a TC0 circuit, we have shown that a transformer network could solve
that problem.

52
6 Conclusion

6.1 Interpretation
At first, we have seen choosing an overly simplified attention function, namely hard
attention, for our model of transformer networks bounds them in AC0 . This does not
reflect the capabilities of transformer networks in the real world, as they have been
shown to have the ability to count, which AC0 circuits are not capable of [BAG20, sec-
tion 6; MSS22, Section 1]. Then, we have seen that assuming arbitrary precision in our
model once again leads to results that do not reflect the capabilities of actual trans-
former networks, such as being able to recognize any formal language and being Tur-
ing complete.
For the much more realistic model of log-precision transformer networks, we have
shown that they can be simulated by log-uniform TC0 circuits, for any kind of atten-
tion function. This establishes threshold functions as a fundamental operation for un-
derstanding the computational model of transformers. This result also establishes po-
tential limits on the computational power of log-precision transformer networks. For
example, if L ⊊ P , transformer networks cannot compute all polynomial time func-
tions. They are certainly very far from being universal. The intuition at the heart of
this result is that forcing a model to be highly parallelizable likely sacrifices its expres-
siveness. Since parallelism seems essential to pretraining any massive model at scale,
any large language model — transformer network or otherwise — may suffer from a
similar tradeoff [MS23b, section 8].

6.2 Future Outlook


Most of the research into the expressivity of transformer networks has, thus far, gone
into establishing upper bounds, that is, finding out what kinds of problems they cannot
solve. A next step would be to figure out corresponding lower bounds to more closely
understand what exactly these networks are capable of.
Another area for further research would be to search for a logic or some kind of
specialized programming language that is equivalent to transformer networks. This
could allow translating the internals of a transformer network into something more

53
human-readable. It could also allow a new angle for a theoretical analysis.
But transformer networks are of course not the only deep neural networks. The field
is evolving rapidly, and there is also much to be learned from the analysis of competing
architectures, such as the Mamba network [GD23]. Instead of focusing on individual
architectures, it might be of interest to investigate whether the parallelism tradeoff is
real and what that would imply for future design of large language models.

54
Bibliography
[BAG20] Bhattamishra, Satwik ; Ahuja, Kabir ; Goyal, Navin: On the Abil-
ity and Limitations of Transformers to Recognize Formal Languages. In:
Webber, Bonnie (Hrsg.) ; Cohn, Trevor (Hrsg.) ; He, Yulan (Hrsg.) ; Liu,
Yang (Hrsg.): Proceedings of the 2020 Conference on Empirical Methods
in Natural Language Processing, EMNLP 2020, Online, November 16-20,
2020, Association for Computational Linguistics, 2020, 7096–7116

[BCB15] Bahdanau, Dzmitry ; Cho, Kyunghyun ; Bengio, Yoshua: Neural Ma-


chine Translation by Jointly Learning to Align and Translate. In: Ben-
gio, Yoshua (Hrsg.) ; LeCun, Yann (Hrsg.): 3rd International Conference
on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9,
2015, Conference Track Proceedings, 2015

[BJZP20] Basodi, Sunitha ; Ji, Chunyan ; Zhang, Haiping ; Pan, Yi: Gradient
amplification: An efficient way to train deep neural networks. In: Big Data
Min. Anal. 3 (2020), Nr. 3, 196–207. http://dx.doi.org/10.26599/BDMA.
2020.9020004. – DOI 10.26599/BDMA.2020.9020004

[BMR+ 20] Brown, Tom B. ; Mann, Benjamin ; Ryder, Nick ; Subbiah, Melanie ;
Kaplan, Jared ; Dhariwal, Prafulla ; Neelakantan, Arvind ; Shyam,
Pranav ; Sastry, Girish ; Askell, Amanda ; Agarwal, Sandhini ;
Herbert-Voss, Ariel ; Krueger, Gretchen ; Henighan, Tom ; Child,
Rewon ; Ramesh, Aditya ; Ziegler, Daniel M. ; Wu, Jeffrey ; Winter,
Clemens ; Hesse, Christopher ; Chen, Mark ; Sigler, Eric ; Litwin, Ma-
teusz ; Gray, Scott ; Chess, Benjamin ; Clark, Jack ; Berner, Christo-
pher ; McCandlish, Sam ; Radford, Alec ; Sutskever, Ilya ; Amodei,
Dario: Language Models are Few-Shot Learners. In: Larochelle, Hugo
(Hrsg.) ; Ranzato, Marc’Aurelio (Hrsg.) ; Hadsell, Raia (Hrsg.) ; Bal-
can, Maria-Florina (Hrsg.) ; Lin, Hsuan-Tien (Hrsg.): Advances in Neu-
ral Information Processing Systems 33: Annual Conference on Neural In-
formation Processing Systems 2020, NeurIPS 2020, December 6-12, 2020,
virtual, 2020

55
[CCP23] Chiang, David ; Cholak, Peter ; Pillay, Anand: Tighter Bounds on
the Expressivity of Transformer Encoders. In: Krause, Andreas (Hrsg.)
; Brunskill, Emma (Hrsg.) ; Cho, Kyunghyun (Hrsg.) ; Engelhardt,
Barbara (Hrsg.) ; Sabato, Sivan (Hrsg.) ; Scarlett, Jonathan (Hrsg.):
International Conference on Machine Learning, ICML 2023, 23-29 July
2023, Honolulu, Hawaii, USA Bd. 202, PMLR, 2023 (Proceedings of Ma-
chine Learning Research), 5544–5562

[DCLT19] Devlin, Jacob ; Chang, Ming-Wei ; Lee, Kenton ; Toutanova, Kristina:


BERT: Pre-training of Deep Bidirectional Transformers for Language Un-
derstanding. In: Burstein, Jill (Hrsg.) ; Doran, Christy (Hrsg.) ;
Solorio, Thamar (Hrsg.): Proceedings of the 2019 Conference of the North
American Chapter of the Association for Computational Linguistics: Hu-
man Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA,
June 2-7, 2019, Volume 1 (Long and Short Papers), Association for Com-
putational Linguistics, 2019, 4171–4186

[GD23] Gu, Albert ; Dao, Tri: Mamba: Linear-Time Sequence Modeling with
Selective State Spaces. In: CoRR abs/2312.00752 (2023). http://dx.doi.
org/10.48550/ARXIV.2312.00752. – DOI 10.48550/ARXIV.2312.00752

[Gol73] Goldstein, L. J.: A History of the Prime Number Theorem.


In: The American Mathematical Monthly 80 (1973), Nr. 6, 599-
615. http://dx.doi.org/10.1080/00029890.1973.11993338. – DOI
10.1080/00029890.1973.11993338

[HAF22] Hao, Yiding ; Angluin, Dana ; Frank, Robert: Formal Language Recog-
nition by Hard Attention Transformers: Perspectives from Circuit Com-
plexity. In: Trans. Assoc. Comput. Linguistics 10 (2022), 800–810. https:
//transacl.org/ojs/index.php/tacl/article/view/3765

[Lad75] Ladner, Richard E.: The circuit value problem is log space complete for P.
In: SIGACT News 7 (1975), Nr. 1, 18–20. http://dx.doi.org/10.1145/
990518.990519. – DOI 10.1145/990518.990519

[MIS90] Mix Barrington, David A. ; Immerman, Neil ; Straubing, Howard:


On Uniformity within NC1 . In: J. Comput. Syst. Sci. 41 (1990), Nr. 3,
274–306. http://dx.doi.org/10.1016/0022-0000(90)90022-D. – DOI
10.1016/0022–0000(90)90022–D

[MS23a] Merrill, William ; Sabharwal, Ashish: A Logic for Expressing Log-


Precision Transformers. In: Oh, Alice (Hrsg.) ; Naumann, Tristan (Hrsg.)

56
; Globerson, Amir (Hrsg.) ; Saenko, Kate (Hrsg.) ; Hardt, Moritz
(Hrsg.) ; Levine, Sergey (Hrsg.): Advances in Neural Information Process-
ing Systems 36: Annual Conference on Neural Information Processing Sys-
tems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023,
2023

[MS23b] Merrill, William ; Sabharwal, Ashish: The Parallelism Tradeoff: Lim-


itations of Log-Precision Transformers. In: Transactions of the Association
for Computational Linguistics 11 (2023), 06, 531-545. http://dx.doi.org/
10.1162/tacl_a_00562. – DOI 10.1162/tacl_a_00562. – ISSN 2307–387X

[MSS22] Merrill, William ; Sabharwal, Ashish ; Smith, Noah A.: Saturated


Transformers are Constant-Depth Threshold Circuits. In: Trans. Assoc.
Comput. Linguistics 10 (2022), 843–856. https://transacl.org/ojs/
index.php/tacl/article/view/3465

[Ope23] OpenAI: GPT-4 Technical Report. In: CoRR abs/2303.08774


(2023). http://dx.doi.org/10.48550/ARXIV.2303.08774. – DOI
10.48550/ARXIV.2303.08774

[PBM21] Pérez, Jorge ; Barceló, Pablo ; Marinkovic, Javier: Attention is


Turing-Complete. In: J. Mach. Learn. Res. 22 (2021), 75:1–75:35. http:
//jmlr.org/papers/v22/20-302.html

[PMB19] Pérez, Jorge ; Marinkovic, Javier ; Barceló, Pablo: On the Turing


Completeness of Modern Neural Network Architectures. In: 7th Interna-
tional Conference on Learning Representations, ICLR 2019, New Orleans,
LA, USA, May 6-9, 2019, OpenReview.net, 2019

[PW17] Press, Ofir ; Wolf, Lior: Using the Output Embedding to Improve Lan-
guage Models. In: Lapata, Mirella (Hrsg.) ; Blunsom, Phil (Hrsg.) ;
Koller, Alexander (Hrsg.): Proceedings of the 15th Conference of the Eu-
ropean Chapter of the Association for Computational Linguistics, EACL
2017, Valencia, Spain, April 3-7, 2017, Volume 2: Short Papers, Associa-
tion for Computational Linguistics, 2017, 157–163

[RNSS18] Radford, Alec ; Narasimhan, Karthik ; Salimansm, Tim ; Sutskever,


Ilya: Improving Language Understanding by Generative Pre-Training.
(2018)

[RWC+ 19] Radford, Alec ; Wu, Jeff ; Child, Rewon ; Luan, David ; Amodei,
Dario ; Sutskever, Ilya: Language Models are Unsupervised Multitask
Learners. (2019)

57
[Str23] Strobl, Lena: Average-Hard Attention Transformers are Constant-
Depth Uniform Threshold Circuits. In: CoRR abs/2308.03212
(2023). http://dx.doi.org/10.48550/ARXIV.2308.03212. – DOI
10.48550/ARXIV.2308.03212

[Vol99] Vollmer, Heribert: Introduction to Circuit Complexity - A Uniform Ap-


proach. Springer, 1999 (Texts in Theoretical Computer Science. An EATCS
Series). http://dx.doi.org/10.1007/978-3-662-03927-4. http://dx.
doi.org/10.1007/978-3-662-03927-4. – ISBN 978–3–540–64310–4

[VSP+ 17] Vaswani, Ashish ; Shazeer, Noam ; Parmar, Niki ; Uszkoreit, Jakob
; Jones, Llion ; Gomez, Aidan N. ; Kaiser, Lukasz ; Polosukhin, Illia:
Attention is All you Need. In: Guyon, Isabelle (Hrsg.) ; Luxburg, Ulrike
von (Hrsg.) ; Bengio, Samy (Hrsg.) ; Wallach, Hanna M. (Hrsg.) ; Fer-
gus, Rob (Hrsg.) ; Vishwanathan, S. V. N. (Hrsg.) ; Garnett, Roman
(Hrsg.): Advances in Neural Information Processing Systems 30: Annual
Conference on Neural Information Processing Systems 2017, December 4-
9, 2017, Long Beach, CA, USA, 2017, 5998–6008

58

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy