Masterarbeit Kai Hallmann
Masterarbeit Kai Hallmann
Master’s Thesis
Hannover, 2024-04-26
2 Transformer Networks 6
2.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2.1 Single Attention . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2.2 Multi-Head Attention . . . . . . . . . . . . . . . . . . . . . . . 8
2.2.3 Position-wise Feed-Forward Networks . . . . . . . . . . . . . . . 8
2.2.4 Encoder and Decoder . . . . . . . . . . . . . . . . . . . . . . . 8
2.2.5 Embedding and Softmax . . . . . . . . . . . . . . . . . . . . . . 9
2.2.6 Positional Encoding . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.7 Further Developments in Transformer Network Design . . . . . 12
2.3 Mathematical Description . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3.1 Datatypes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3.2 Transformer Networks . . . . . . . . . . . . . . . . . . . . . . . 14
2.4 Types of Attention . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.5 Language Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3 Circuit Complexity 17
3.1 Circuits and Circuit Families . . . . . . . . . . . . . . . . . . . . . . . 17
3.2 Circuit Complexity Classes . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.3 Uniformity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4 Logic 21
4.1 First Order Logic with Counting . . . . . . . . . . . . . . . . . . . . . 21
4.2 First Order Logic with Majority . . . . . . . . . . . . . . . . . . . . . . 22
3
5.2 Universality of Saturated Transformer Networks . . . . . . . . . . . . . 29
5.3 Saturated Attention Transformer Networks are not in AC0 . . . . . . . 31
5.4 Saturated Attention Transformer Networks are in TC0 . . . . . . . . . 32
5.5 Fixed Precision Transformer Networks are in FOC[+; MOD] . . . . . . 35
5.6 log-Precision Transformer Networks are in Uniform TC0 . . . . . . . . 37
5.6.1 Circuit Serialization . . . . . . . . . . . . . . . . . . . . . . . . 39
5.6.2 Uniformity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.6.3 Transformer Network Precision and Space . . . . . . . . . . . . 41
5.6.4 p-Precision Transformer Network Definition . . . . . . . . . . . 42
5.6.5 log-Precision Transformer Networks as nonuniform Threshold
Circuits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.6.6 log-Precision Transformer Networks as Uniform Threshold Circuits 45
5.7 Lower Bounds for Instruction Following and Advice Transformers . . . 48
5.7.1 Circuit Value Prolem . . . . . . . . . . . . . . . . . . . . . . . . 48
5.7.2 Instruction Following . . . . . . . . . . . . . . . . . . . . . . . . 51
5.7.3 Advice Transformers . . . . . . . . . . . . . . . . . . . . . . . . 52
6 Conclusion 53
6.1 Interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
6.2 Future Outlook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4
1 Introduction
Transformer networks are a type of deep neural network that was initially introduced
in 2017 for sequence modeling and transduction, such as language modeling and ma-
chine translation. They have been very successful in these tasks and have brought AI-
based tools such as OpenAI’s ChatGPT (“Chat Generative Pre-trained Transformer”)
into widespread use. What sets them apart from previously established deep neural
networks like long short-term memory and gated recurrent neural networks is that they
employ a mechanism called multi-head attention to be more parallelizable. This results
in faster training times compared to these previous network architectures [VSP+ 17].
In this work, we give an introduction to how transformer networks are constructed
and how they work. We then continue by viewing them from a more theoretical point of
view as recognizers of formal languages, in order to compare them to known complexity
classes and reach a better understanding of how powerful they are. In the process,
we present many results and give an overview of how they relate to each other. We
will also see that these results vary drastically based on some assumptions we make
in our theoretical model of transformer networks. However, with the most reasonable
assumptions, we will establish the circuit complexity class log-uniform TC0 as an upper
bound.
Such an upper bound helps answer the question, which problems transformer net-
works are fundamentally unable to solve, at least for sufficiently large inputs. Knowing
this might help decide when to switch to a different approach, rather than to spend more
time and computational power on training a transformer network only to find out that it
is still unable to preform the desired task. We will also consider whether it is because of
their high parallelizability, that transformer networks are limited in their capabilities.
5
2 Transformer Networks
We will begin by giving an overview of the components of transformer networks and
how they work, starting from the inside and working our way outward. Next, we will
give a precise mathematical description in section 2.3 that will be required in chapter 5.
In section 2.4, we will introduce different kinds of attention that we will later use for a
theoretical analysis of the complexity of transformer networks. Finally, we will define
how transformer networks can be used to recognize formal languages in section 2.5.
The contents of this chapter are largely based on the paper that introduced trans-
former networks [VSP+ 17], which we will refer to as their origin.
2.1 Notation
For n ∈ N, we will write [n] to mean the set of the first n natural numbers {m ∈ N | m <
n} = {0, 1, . . . , n − 1}, [n]+m to mean the set {m, m + 1, . . . , m + n − 1}, and log(n) to
mean the length of the shortest binary string representing n, so log(n) = dlog2 (n + 1)e.
We will also write a ≡m b iff a and b are congruent modulo m.
2.2 Basics
Transformer networks heavily rely on a mechanism called attention. Attention was
originally designed as an addition to recurrent neural networks for tasks like machine
translation. Those recurrent neural networks usually consist of an encoder and a de-
coder. The encoder reads in a sequence token by token and modifies its internal hidden
state at each step based on the token. Then the decoder uses the final hidden state to
generate the output tokens.
The problem with this approach is that all information contained in the input se-
quence has to be compressed into the hidden state, which is a vector of fixed length.
As a result, this kind of network can have problems dealing with long input sequences.
The attention mechanism allows the decoder to look back at the input tokens that con-
tain the information most relevant to the token that it is about to generate at each
step. In other words, the decoder decides which parts of the input sequence to pay at-
tention to. Because of this, the encoder no longer has to encode all information into a
6
fixed-length vector [BCB15, sections 1-3].
As attention mechanisms allow modeling of dependencies regardless of distance
within the input or output sequences, they have become an integral part of recurrent
neural networks for tasks such as sequence modeling. Transformer networks, however,
forgo the recurrence and instead rely solely on attention to draw global dependencies
between input and output. Therefore, they lend themselves more to parallelization.
We will begin by describing the attention mechanism they use.
7
2.2.2 Multi-Head Attention
Instead of directly performing single attention on the dmodel -dimensional vectors that
are used by the transformer network internally, they are first split up into shorter
vectors. Then the single attention is performed on these shorter vectors in so-called
attention heads, which can be done in parallel. Finally, the results are put together
into a dmodel -dimensional vector.
Splitting up the vectors is done using learned linear projections. For each of the h
heads, these projections are stored as a dmodel ×dk matrix WiQ , a dmodel ×dk matrix WiK ,
and a dmodel × dv matrix WiV . The output is assembled by concatenating the outputs
of the individual heads and performing another learned linear projection. This one is
stored in the (h · dv ) × dmodel matrix W O . Putting it all together, we get:
The values for the dimensions that were originally used are dmodel = 512, h = 8, and
dk = dv = dmodel
h
= 64.
Splitting the attention between multiple heads has the benefit that each one of them
can evaluate different aspects of the queries or view them in a different context.
FFN(x) = max(0, x · W1 + b1 ) · W2 + b2
The input and the output have dimension dmodel , while the inner layer has dimension
dff = 2048.
Since the input and output dimension is only the size of one position, the feed-forward
networks are said to be position-wise. Any interaction between different positions only
takes place through the means of attention.
8
output sequence (y0 , . . . , yn−1 ) one element at a time. Transformer networks are au-
toregressive, meaning that they take the previously generated symbols as an additional
input when generating the next symbol.
The encoder is composed of a stack of L = 6 identical layers. Each layer consists of
two sub-layers. The first one performs multi-head self-attention. This means that it
uses the input as the Q, K, and V matrices. The second sub-layer is a simple, position-
wise fully connected feed-forward network. The learned parameters of this network
differ from layer to layer, whereas the linear transformations are shared between all
feed-forward networks. Additionally, each sub-layer contains a residual connection and
a layer normalization. That is, the ingoing values are added to the outgoing values and
the layers are normalized to be in a desired range, which introduces further learned
parameters. All sub-layers in the model produce vectors of size dmodel .
The decoder is also composed of a stack of L = 6 identical layers. It consists of
the same two sub-layers as the encoder, along with a third one in between them. In
this layer, it once again performs multi-head attention. This time, however, the input
to this sub-layer is only used as the Q matrix, while the output z of the encoder is
used as the K and V matrix. The self-attention sub-layer is also slightly modified to
prevent positions from attending to subsequent positions. This is done by setting the
corresponding values in the matrices to −∞ before applying the softmax function. Just
like the encoder, each sub-layer once again contains a residual connection and a layer
normalization.
A diagram of the structure of the encoder and decoder of a transformer network is
shown in Figure 2.1.
9
Add & Norm
Feed-Forward Output
Network Symbol Representation
Input Output
Symbol Representation Symbol Representation
(shifted right)
Figure 2.1: An encoder layer (left), a decoder layer (middle), and the complete encoder-
decoder structure (right)
(10000 model )
d
pos
PE(pos,2i+1) = cos 2i
10000 dmodel
They choose this encoding because it might allow the model to easily learn to attend
to relative positions, since for any fixed offset k, PEpos+k can be represented as a linear
function of PEpos . This might allow the model to extrapolate to sequence lengths longer
than those encountered during training.
A diagram of a full transformer network can be seen in Figure 2.2.
10
Output
Probabilities
Softmax
Linear
Feed-Forward
Network
L×
Add & Norm
Add & Norm
Masked
Multi-Head Multi-Head
Attention Attention
Positional Positional
+ +
Encoding Encoding
Input Output
Embedding Embedding
Inputs Outputs
(shifted right)
11
2.2.7 Further Developments in Transformer Network Design
In later transformer networks like Google’s BERT and OpenAI’s GPT, the structure
has been further simplified, getting rid of the encoder-decoder structure in favor of
simply a sequence of transformer blocks. Each transformer block is just a layer of the
original transformer network’s encoder or decoder [DCLT19, section 3; RNSS18, sub-
sections 3.1, 4.1].
Another trend that can be observed is that the number L of transformer blocks/layers
of different transformer networks is steadily increasing, as can be seen in Table 2.1.
This is done to allow the networks to store ever more information about the larger and
larger datasets they are trained on [Ope23].
As most transformer networks are a sequence of transformer blocks together with a
more task-specific transformation of the input and output, such as embedding and po-
sitional encoding, we will focus on this structure and provide mathematical description
that allows a theoretical analysis in the following section 2.3.
12
2.3 Mathematical Description
Section 2.3, section 2.4, and section 2.5 are based on [MSS22, section 3].
2.3.1 Datatypes
We model all internal data of a transformer network as binary strings. In order to be
able to perform calculations with those binary strings, we need a semantics to interpret
them as numbers. As is often done in circuit complexity, we will first interpret them
as unsigned integers.
That is, x gets interpreted as an integer using the two’s complement and that integer
is divided by 2s . Since r and s are fixed, we normally just write FP in place of FPr,s .
Note that division of fixed-point numbers will usually involve rounding of the result
and that various over- and underflows can occur which we leave undefined.
Next, we define how we encode rational numbers without a fixed size.
JpKN
JrKQ := (−1)s · .
JqKN
We once again denote the standard operations by +Q and ·Q and define them to
always reduce their result as far as possible.
Finally, we will define floats F as the subset of the rationals Q where the denominator
is constrained to be a power of 2. This resembles the way numbers are usually encoded
in most computers more closely than the rational numbers. Addition and multiplication
are defined the same way as for Q. Note, however, that the multiplicative inverse of
13
a float is not necessarily another float and may have to be approximated. Because of
this, it may be that J(x/F y) ·F yKF 6= JxKF .
Going forward, we will usually omit datatype subscripts where they are clear from
context. We will sometimes write D as a generic datatype or other datatypes in function
signatures to mean [2]∗ while making the intent clearer and hinting at the semantics.
We imagine the tuple hp, qi to be encoded by padding p, q to the same length using
leading 0s and interleaving their bits. This means that the size of a rational number or
a float is 2 · max(|p|, |q|) + 1. We will use the following property to limit the internal
functions in our transformer network model:
Definition 2.4. We say a function f : [2]∗ → [2]∗ is size-preserving iff there exist
constants c, n such that for all inputs x with |x| ≥ n, the size of the function value is
bounded by |f (x)| ≤ c · |x|. Let P be the set of size-preserving functions.
where
14
On an input string w ∈ Σn , a transformer network computes L layers of output se-
quences vℓ,0 , . . . , vℓ,n for ℓ ∈ [L]+1 where each vℓ,i ∈ Dm . First, each token wi and its
position i are embedded into a value v0,i . Subsequently, each layer ℓ aggregates infor-
mation from the previous value sequence vℓ using a multi-head attention mechanism
and outputs a new value sequence vℓ+1 . The layers are structured as follows:
2. Attention head: Each of the H attention heads in layer ℓ maps the full previous
sequence to a new value via sℓ,h and then applies the attention function α:
It is important to note that the semantics for addition and multiplication as well
as the computation of α come from D.
3. Activation block:
This model combines the aggregation of the attention heads, the feed-forward
network, the residual connections and the layer normalizations into the single
function fℓ . A benefit of this is that it generalizes to changes in the layout like
for example moving the layer normalization around, which is one of the changes
between GPT-1 and GPT-2 [RWC+ 19, subsection 2.3].
e ai
softmax(a)i = ∑ a .
ek
k∈[n]
We will also examine the simpler hard and saturated attention when evaluating the
complexity of transformer networks in chapter 5. In order to define these, we first
define the function M : Dn → P([n]) that maps a vector of values to the set of indices
15
of maximum values:
Using this, we define hard attention (also known as unique hard attention [HAF22,
subsection 4.2]) to work the same as soft attention but use the hardmax function
instead of the softmax function:
1, if i = min(M(a))
hardmax(a)i :=
0, otherwise.
Finally, we define weak saturated attention such that each attention head either uses
the hardmax function or the uniformmax function:
1
uniformmax(a)i := .
n
Definition 2.6. Let vℓ,i (w) denote the value of vℓ,i on input string w. A transformer
network recognizes a formal language L ⊆ Σ∗ iff there exists a D-valued affine trans-
formation W, b such that for all w ∈ Σ∗ the following holds:
16
3 Circuit Complexity
The definitions in this chapter are taken and partially adapted from [Vol99, pp. 7-10,
46-47, 108, 126].
Definition 3.2. A basis is a finite set of Boolean functions and families of Boolean
{ }
function. The standard unbounded fan-in basis B1 := ¬, (∧n )n∈N , (∨n )n∈N contains
the unary Boolean NOT function and the families of Boolean AND and OR functions.
Definition 3.3. The n-ary Boolean majority function checks if at least half of the
input bits are 1:
1, |{i ∈ [n] | xi = 1}| ≥ n
MAJ : [2] → [2], (x0 , . . . , xn−1 ) 7→
n n 2
0, otherwise
3. For every i ∈ [n], there exists at most one node v ∈ V such that β(v) = xi .
4. For every i ∈ [m], there exists exactly one node v ∈ V such that ω(v) = yi .
17
Definition 3.5. Let C = (V, E, α, β, ω) be a circuit over B with n inputs and m
outputs. First, we inductively define a function valv : [2]∗ → [2] for every v ∈ V as
follows: Let a0 , . . . , an−1 be arbitrary values.
2. Let v ∈ V have fan-in k > 0 and let v0 , . . . , vk−1 be the gates that are predecessors
of v ordered in such a way that α((v1 , v)) < · · · < α((vk−1 , v)). Let β(v) = f ∈ B.
If f is a k-ary function, then let
valv (a0 , . . . , an−1 ) := f (valv0 (a0 , . . . , an−1 ), . . . , valvk−1 (a0 , . . . , an−1 )).
valv (a0 , . . . , an−1 ) := f k (valv0 (a0 , . . . , an−1 ), . . . , valvk−1 (a0 , . . . , an−1 )).
For i ∈ [m], let vi be the unique gate vi ∈ V with ω(vi ) = yi . Then the function
computed by C, fC : [2]n → [2]m , is given for all a0 , . . . , an−1 ∈ [2] by
( )
fC (a0 , . . . , an−1 ) := valv0 (a0 , . . . , an−1 ), . . . , valvm−1 (a0 , . . . , an−1 ) .
We write f = (f n )n∈N and C = (Cn )n∈N . We say that C accepts A ⊆ [2]∗ iff C
computes cA . In this context, we also use the notation A = (An )n∈N (where An :=
A ∩ [2]n ) and cA = (cAn )n∈N . If C is a circuit family, we use the notation fC for the
function computed by C.
18
3.2 Circuit Complexity Classes
For circuit complexity classes
Definition 3.8. Let B be a basis and let s, d : N → N. The class SIZE-DEPTHB (s, d)
contains all sets A ⊆ [2]∗ for which there exists a circuit family C over basis B of size
O(s) and depth O(d) that accepts A.
The class of polynomial size constant depth AND/OR circuits is defined as follows:
( )
Definition 3.9. AC0 := SIZE-DEPTHB1 nO(1) , 1
The proof of this is beyond the scope of this work. As a result of this, it can be
shown that, for example, integer multiplication is not included in AC0 .
3.3 Uniformity
One problem with circuit families is that they are infinite objects. The circuits within
a given circuit family can be completely different from each other. One of the conse-
quences of this is that undecidable languages, like
∪
[2]n ,
bin(n)∈K
where K is the special halting problem, can be accepted by a circuit family where each
circuit just outputs a constant 0 or 1.
Algorithms, however, are finite. A program written in any programming language
has a finite text. A Turing machine has a finite number of states and therefore a
finite transition function. A random access machine (RAM) has a finite number of
instructions. Therefore, a uniform circuit family should be a circuit family with a finite
description.
19
As a finite description of a circuit family C = (Cn )n∈N , we introduce a function fC ,
with 1n 7→ hCn i, that is easily computable. In particular:
20
4 Logic
In this chapter, we introduce two extensions of first order logic that are used to define
formal languages. They use variables indexing the symbols of a given word to describe
characteristics that define the language. We will use these logics in chapter 5 to bound
the complexity of transformer networks.
• t0 = t1 , t0 < t1 where t0 and t1 are terms, which follow the conventional semantics.
• φ0 ∧φ1 , φ0 ∨φ1 , ¬φ0 where φ0 and φ1 are formulae, which follow the conventional
semantics.
• ∃x.φ, ∀x.φ where x is a count variable and φ is a formula, which follow the
conventional semantics.
21
We will use the following abbreviations:
• φ → ψ := ¬φ ∨ ψ
• φ ↔ ψ := (φ → ψ) ∧ (ψ → φ)
We call any variable that is not bound by a quantifier free, and a formula with no
free variables a sentence. For a sentence σ and a string w ∈ Σ∗ , we write w |= σ iff w
makes σ true.
The part of the logic that deals with position variables is like monadic first-order logic,
in which all predicates are monadic (that is, they take only one argument). The other
part of the logic that deals with count variables is the theory of rational numbers with
ordering and addition (but not multiplication). Both of these other logics have useful
normal forms: Monadic first-order logic has a normal form that uses only one variable,
while the theory of rationals with ordering and addition has quantifier elimination. We
can combine these two results to get a very simple normal form for FOC[+; MOD].
Theorem 4.2. Every formula φ of FOC[+; MOD] is equivalent to a formula of the form
( )
∧
φ′ = ∃x0 . . . . ∃xk−1 . ∃=xi p.ψi ∧ χ
i
where each ψi is quantifier-free and has no free count variables and χ is quantifier-free.
It may seem odd that count variables range over rational numbers, when counts
are always integers. This technicality simplifies the normal form: If we had used
integers, then the part of the logic that deals with count variables would be Presburger
arithmetic, and the normal form would require allowing MODrm (x) on count variables
as well.
22
We will describe the syntax of FO(M), given a fixed (finite) alphabet Σ, and its
intended interpretation with reference to (finite) strings w ∈ Σn for some n ∈ N. A
term is
• a constant 0, 1, or n − 1,
• bit(t0 , t1 ) where t0 and t1 are terms, which is true iff the t1 -th bit of the binary
expansion of the value of t0 is a 1.
• ∃i.φ, ∀i.φ where i is an index variable and φ is a formula, which follow the
conventional semantics.
Just like with FOC[+; MOD], we call a formula with no free variables a sentence.
For a sentence σ and a string w ∈ Σ∗ , we write w |= σ iff w makes σ true.
Beyond this, FO(M) can express counting and threshold quantifiers in terms of ma-
jority quantifiers. Given a formula φ, a counting quantifier ∃k creates a new formula
∃k i.φ that is true iff φ is true across exactly k values of i. Threshold quantifiers ∃≤k
and ∃≥k work similarly but check if φ is true for at least or at most k values of i. In
addition, FO(M) can express conditional majority quantifiers, also written M, which
create a formula Mi.φ[ψ] that is true iff ψ is true for at least half the values of i that
make φ true ( [MS23a, subsection 2.2]).
23
5 Complexity Results for Transformers
In this chapter, we will present and prove various complexity results involving trans-
former networks. For a slightly simplified overview over the majority of the results, see
Table 5.1.
fixed logarithmic unbounded
Hard Attention ⊆ AC0 ⊆ AC0 ⊆ AC0
(Weak/Strong) F: ⊆ UL -TC0 ⊆ TC0
⊆ UL -TC0 = ALL
Saturated Attention 6⊆ AC0 , Q: = ALL
Soft Attention ⊆ FOC[+; MOD] 6⊇ UL -TC , ⊊ UL -TC
0 0
F: ⊆ UL -TC0 = FO(M) = ALL
Table 5.1: Results shown in this section (UL is short for log-uniform here)
• For the representation of activation values, we use an arbitrary set A rather than
Dm .
24
• There is a new pooling function p : A∗ × R∗ → A which combines the atten-
tion function α and the weighted sum of the values. For the case of hard
attention pUHA is defined as follows: On inputs (v0 , v1 , . . . , vn−1 ) ∈ An and
(a0 , a1 , . . . , an−1 ) ∈ Rn , let j ∈ [n] be the smallest index that maximizes aj . Then
pUHA ((v0 , v1 , . . . , vn−1 ), (a0 , a1 , . . . , an−1 )) = vj . So,
• There is a new model output function g : A → [2]. The output of the general-
ized transformer network T (x) is computed by applying this function to the last
symbol of vℓ , T (x) := g(vℓ,n−1 ).
Definition 5.1. A GUHAT with L layers and h heads is in informative normal form
iff the following conditions are satisfied:
• For each layer ℓ ∈ [L]+1 , the activation values are (H + 1)-tuples of activation
values at layer ℓ − 1, and the activation function is defined by
• For each layer ℓ ∈ [L]+1 and attention head h ∈ [H], the scoring function sℓ,h
returns an integer in [N ], where N is the total number of possible ordered pairs
of activation values at layer ℓ − 1.
25
Lemma 5.2. For any transformer network T ∈ GUHAT, there exists a transformer
network T̂ ∈ GUHAT in informative normal form such that L(T ) = L(T̂ ). Moreover,
T̂ has the same number of layers and heads as T .
Proof. Let T be a GUHAT with L layers and H heads, with input alphabet Σ, input
function φ, scoring functions sℓ,h , activation functions fℓ , and output function g. We
describe how to construct functions for an equivalent transformer network T̂ in GUHAT
in informative normal form, which also has L layers and H heads. We assume that n
is the input length.
For T̂ , the input function φ̂(σ, i, n) is defined to return the triple (σ, i, n). Note that
there are at most |Σ| · n possible initial activation values. We also define a function
t0 that translates initial activation values for T̂ into initial activation values for T by
t0 (σ, i, n) = φ(σ, i, n).
Now, we perform induction on the layers of T and T̂ . Assume that we have defined
scoring and activation functions for T̂ for layers before ℓ (where the initial activation
values are treated as layer 0), and a translation function tℓ−1 that translates all pos-
sible activation values for T̂ from the previous layer into activation values for T from
the previous layer. To define the scoring function for T̂ for layer ℓ and head h, we
enumerate all the possible pairs v̂i and v̂j of activation values of T̂ at layer ℓ − 1, and
determine the corresponding attention values of T , which we denote by yℓ,h (v̂i , v̂j ) =
sℓ,h (tℓ−1 (v̂i ), tℓ−1 (v̂j )). We make a list of all the distinct resulting values and sort them
in increasing order. Then we define ŝℓ,h (v̂i , v̂j ) to be the index of yℓ,h (v̂i , v̂j ) in this
sorted list. The activation function for T̂ for layer ℓ is, by definition,
t̂ℓ (v, (b0 , b1 , . . . , bH−1 )) = fℓ (tℓ−1 (v), (tℓ−1 (b0 ), tℓ−1 (b1 ), . . . , tℓ−1 (bH−1 ))),
that is, we translate each of the component activation values using tℓ−1 and then apply
the activation function of T .
Finally, the output function for T̂ is defined by ĝ(v̂) = g(tL (v̂)), that is, we translate
the layer L activation value v̂ of T̂ to the layer L activation value of T , and apply the
output function of T .
By construction, T̂ is in informative normal form, and it has L layers and H heads.
It is not difficult to see that for any input w, the translations tk (v̂) of the activation
values v̂ of T̂ are equal to the corresponding activation values of T , and the outputs
T̂ (w) = T (w) are equal as well. Thus L(T̂ ) = L(T ).
26
5.1.3 From GUHAT to Circuits
In this subsection, we show that for every language L ∈ GUHAT, we can construct a
family of Boolean circuits of constant depth and polynomial size that also recognizes
L. The key step of the proof is to bound the number of bits needed to represent
scoring and activation values for an input sequence of length n by O(log(n)), where
the suppressed constants depend on L and H.
Lemma 5.3. Let T be a GUHAT in informative normal form with L layers and H
heads, and alphabet Σ. Let s = log(|Σ| + 1). Then for any input of length n and any
ℓ ∈ [L], the activation values at layer ℓ can be represented by (H + 1)ℓ · (2 · log(n) + s)
bits, and for ℓ ∈ [L]+1 , the attention scores at layer ℓ can be represented by 2 · (H +
1)ℓ−1 · (2 · log(n) + s) bits.
Proof. For an input sequence of length n, the initial activation values are (σ, i, n),
where σ ∈ Σ ∪ {$} and i ∈ [n]. This can be represented by a string of 2 · log(n) + s
bits. At each successive layer, the activation values are a tuple of (H + 1) values from
the previous layer, which multiplies the number of bits required to represent them by
(H + 1). Also, the range of scoring values is bounded by the number of ordered pairs
of activation values at the previous layer, so scoring values can be represented by twice
the number of bits to represent an activation value at the previous layer.
It is worth observing that the bounds provided by Lemma 5.3 do not hold in the
case of saturated attention because activation values may be the result of the average
of an arbitrary subset of the possible input, which means that there are exponentially
more possible activation values at each layer.
The following elementary facts about Boolean circuits will be useful:
Lemma 5.4. An arbitrary Boolean function f : [2]n → [2]m of n inputs and m outputs
can be computed by a depth 3 circuit of size at most 2n + n + m.
Corollary 5.5. If a Boolean function f has at most c · log(n) inputs and at most
d · log(n) outputs, then it may be computed by a Boolean circuit of depth 3 and size at
most nc + c · log(n) + d · log(n).
27
We now have the necessary tools to prove the following theorem:
Proof. Let L be a language over Σ that is in GUHAT. By Lemma 5.2, we may assume
that L is recognized by a GUHAT transformer network T in informative normal form.
Assume T has L layers and H heads.
What we describe below is a family of circuits to recognize the end-marked language
L$, which can easily be converted to a family of circuits that recognizes L by hard-
wiring the representation of the end-of-sequence symbol $ at the end of the input string
using constant gates. Let s = log(|Σ| + 1) and let h be any binary symbol encoding
for Σ ∪ {$}. We construct a family of Boolean circuits (Cs,n )n∈N of constant depth
and polynomial size such that for all positive integers n and all w ∈ Σn−1 , w ∈ L iff
Cs,n (h(x$)) = 1.
With the O(log(n)) bound on the number of bits to represent activation and scoring
values, Lemma 5.3 yields circuits of constant depth and size polynomial in n for the
input, scoring, activation, and output functions. Additional circuitry is necessary to
implement the comparison of attention scores and selection of the activation value to
attend to for each position, layer, and head.
We construct the overall circuit Cs,n according to the layers of T , starting with the
input function. Let the inputs to T be wi for i ∈ [n]. The inputs to Cs,n are wi,j for
i ∈ [n] and j ∈ [s], where wi,j are the bits of h(wi ), representing the binary encoding
(0)
of input symbol wi . At layer 0 for position i, the value of vi = φ(wi , i, n) = (wi , i, n)
is achieved using the input wires wi,j for j ∈ [s] followed by a sequence of constants 0
or 1 representing the binary representations of i and n for a total of 2 · log(n) + s wires
representing the value (wi , i, n).
Performing induction on layers, we assume that for some ℓ ∈ [L]+1 the circuit Cs,n
(ℓ−1)
has been constructed to contain the wires representing all the activation values vi
for i ∈ [n] at layer ℓ − 1. The portion of the circuit computing the representation
of activation values at layer ℓ is described as follows: Fix a position i ∈ [n] and
a head h ∈ [H]. For each j ∈ [n], there is a circuit Aℓ,h,i,j that has as input the
(ℓ−1) (ℓ−1)
wires for the activation values vi and vj and as output wires representing the
natural number attention score aℓ,h,i,j in binary. Each of these circuits Aℓ,h,i,j has
2 · (H + 1)ℓ−1 · (2 · log(n) + s) inputs and outputs by Lemma 5.3, and therefore can be
computed using depth 3 and size polynomial in n, by Corollary 5.5. All H · n2 such
circuits for layer ℓ operate in parallel, for overall depth 3 and size polynomial in n.
We next describe the circuit that implements the pooling function f UHA . For each
pair j, j ′ ∈ [n], there is a circuit Dℓ,h,i,j,j ′ whose inputs are the outputs of Aℓ,h,i,j and
Aℓ,h,i,j ′ and whose output is a single wire gℓ,h,i,j,j ′ with a value of 1 if aℓ,h,i,j ≥ aℓ,h,i,j ′ and
28
0 otherwise. Because of the bounds on the number of inputs and outputs, each of these
circuits can have depth 3 and size polynomial in n by Corollary 5.5. These n2 circuits
all compute in parallel. Then for each position j, whether j maximizes aℓ,h,i,j can be
computed by an AND gate whose inputs are gℓ,h,i,j,j ′ for all j ′ ∈ [n]. Let the output of
this AND gate be denoted mℓ,h,i,j . Then mℓ,h,i,j = 1 iff the position j maximizes aℓ,h,i,j .
This increases the depth by 1.
For each j, an indicator zℓ,h,i,j is computed by an AND gate whose inputs are mℓ,h,i,j
and ¬(mℓ,h,i,j ′ ) for all j ′ < j. Thus, zℓ,h,i,j = 1 iff j is the leftmost position that
maximizes aℓ,h,i,j . This increases the depth by 2.
Finally, these indicator values are used to combine the layer ℓ − 1 activation values
(ℓ−1)
in a selection circuit, yielding the representation of the activation value bℓ,h,i = vj
such that zℓ,h,i,j = 1. In general, such a selection circuit takes as input t selector bits
z0 , z1 , . . . , zt−1 , where exactly one zj = 1, and t input values w0 , w1 , . . . , wt−1 , where
each wr consists of S bits. It outputs S bits representing the selected wj (for which
zj = 1). Letting wr,s denote the bit s of wr , the computation can be described as
vr,s = wrs ∧ zr for r ∈ [t] and s ∈ [S], which can be computed by one layer of t · S AND
∨
gates in parallel. Then the bits of the output are us = vr,s for s ∈ [S], which can
r∈[S]
be computed by one layer of S OR gates in parallel. Thus, the selection circuit adds 2
to the depth, and a polynomial in n to the size.
Because each activation function for a GUHAT in informative normal form simply
returns its argument, no further computation is needed for the activation values. The
(ℓ)
representation of the activation value vi is just the sequence of wires representing
(ℓ−1)
vi , followed by those representing bℓ,0,i through bℓ,H−1,i .
(L)
To produce the output of the circuit, we note that the representation of vn has
O(log(n)) bits and the output of g is a single bit, so g can be implemented by a Boolean
circuit of constant depth and size polynomial in n, by Corollary 5.5. This concludes
the proof.
29
Theorem 5.7. AHAT(Q) = ALL = P([2]∗ ).
Proof. Let L ∈ ALL be any formal language over the alphabet Σ = [2]. We construct a
rational-valued saturated transformer network with 1 layer and 1 head to recognize L.
We will omit ℓ and h subscripts. Let pi denote the i-th prime number. The embedding
layer encodes the position i of each token wi ∈ Σ according to
wi
φ(wi , i) := .
pi
Since pi ∼ i·log(i) for large i by the prime number theorem [Gol73, p. 599], the number
of bits needed to represent the denominator of φ(wi , i) is bounded by
for some constant factor c. As i has size log(i), this implies that φ is size-preserving.
Now we define a single uniform attention head that sums all vi , outputting
∑ ∑ ∑ 1
vi = φ(wi , i) = .
i i i,w =1
p i
i
∏
The denominator q of this sum is the product pi , which can be shown by induction
i,wi =1
over the number of prime numbers that are multiplied. Note that wi = 1 iff pi divides
q. Thus, we can define a function g that extracts the input sequence w from q by
checking for each i whether pi divides q. We let
1, w ∈ L
cL (w) =
0, otherwise
Similar to this, other authors have used arbitrary precision for storing and manipu-
lating positional encodings to show their model of transformer networks to be Turing
complete [PMB19, PBM21].
Both the unnatural construction that encodes positions using prime numbers and the
result that they can decide any formal language strongly indicate that these restrictions
on our model of transformer networks are not yet sufficient to mimic reality. Because
of this, we switch the datatype from rationals to floats. In the following section, we
will see that doing this allows us to bound the capabilities of saturated transformer
30
networks in TC0 .
But before we do that, we will quickly show that using floats alone is not enough
and size-preservation is also needed to be able to find non-trivial upper bounds. The
unbounded prefix added to AHAT(D) simply means that we no longer require the
internal functions to be size-preserving.
Proof. The proof works exactly the same as the proof for Theorem 5.7, with the excep-
tion that if D = F the prime numbers pi need to be replaced with distinct powers of
2. This needs to be done because floats can only have powers of 2 in the denominator.
The removal of the size bound allows us to use the powers of 2 that grow in size linearly
instead of logarithmically. We can once again define a function g that extracts the in-
put sequence by just looking at which bits are set to 1. This completes the proof.
Proof. The proof works exactly the same as the proof for Corollary 5.8. Soft attention
transformers with sufficient (or, in this case, unbounded) precision can implement
uniform attention by setting all queries and keys to be constant [MS23a, p. 5].
Proof. We will construct a single layer transformer network with a single attention
head to recognize MAJ, omitting the ℓ and h subscripts. Let the embedding function
(1, 0), w = 0
i
φ(wi , i) := (1 − wi , wi ) =
(0, 1), w = 1
i
be a 1-hot encoding of wi . Set the scoring function s(xi , xj ) := 1 for all inputs xi , xj ∈
F2 , resulting in the attention head attending everywhere and computing for every
31
( )
|w|0 |w|1
i ∈ [n]: bi = n
, n . Finally, set
1, bi,1 > bi,0
f (vi , bi ) := , where bi = (bi,0 , bi,1 ).
0, otherwise
Thus, the output of the transformer network is all 1s if |w|1 > |w|0 , or w ∈ MAJ, and
all 0s otherwise. We have shown MAJ ∈ AHAT(F), and, with MAJ 6∈ AC0 , it directly
follows that AHAT(F) 6⊆ AC0 .
Note that this construction is not just possible in our generalized transformer network
model, but can also be implemented in transformer networks that are actually in use,
like the original one described in section 2.2. It has also been shown empirically that
single layer transformer networks can learn to recognize the majority language.
Lemma 5.11. Let v0 , v1 , . . . , vn−1 be a sequence of floats, each with size at most z.
∑
Then their sum s = vi has size at most 4z + 2 log(n) + 1.
i∈[n]
Proof. Let pi , qi and ps , qs denote the numerator and denominator of vi and s, respec-
tively. Since each vi has size at most z, both pi and qi also have size at most z. Let
pmax = maxi pi and qmax = maxi qi . To add these floats, all their denominators have to
be made equal to qmax , which results in their numerators also being multiplied by qmax
qi
.
This works because qi divides qmax , since they are both powers of 2. We can now esti-
mate the numerator of s as
∑ qmax
ps ≤ pi · ≤ n · pmax · qmax
qi
i∈[n]
which has size ≤ log(n) + z + z = 2z + log(n). The denominator qs ≤ qmax has size ≤ z.
Therefore, s has size ≤ 1 + 2 · max(2z + log(n), z) = 4z + 2 log(n) + 1.
In particular, the size of the sum of a sequence of n float values whose size is bounded
by z(n) ∈ O(log(n)) is also bounded by O(log(n)).
We will now leverage Lemma 5.11 to show that over any transformer network
over floats with an element-wise-size-preserving attention function, the values are of
bounded size.
32
Definition 5.12. A function α : Dn → Dn is element-wise-size-preserving iff for i ∈ [n]
the function xi 7→ α(x)i is size-preserving, where x ∈ Dn .
Note that saturated attention satisfies this definition. We can now prove a theorem
bounding the size of the representations in transformer networks with element-wise-
size-preserving attention.
Theorem 5.13. For any transformer network over F with φ, sℓ,h , fℓ ∈ P and α element-
wise-size-preserving, for all ℓ ∈ [L + 1], h ∈ [H], and i ∈ [n], vℓ,i has size O(log(n)).
Proof. By induction over ℓ. The proof follows the definition of transformer network
computation in subsection 2.3.2.
Base case (ℓ = 0): wi has size O(1) and i ∈ [n] has size O(log(n)). Since φ ∈ P,
v0,i = φ(wi , i) has size O(log(n)) for all i ∈ [n].
Inductive Step: Assuming vℓ,i has size O(log(n)), we will show that vℓ+1,i does
too. As sℓ,h ∈ P, aℓ,h,i,j = sℓ,h (vℓ,i , vℓ,j ) has size O(log(n)) for all i, j ∈ [n]. Since α
is element-wise-size-preserving, we can conclude that α(aℓ,h,i,0 , . . . , aℓ,h,i,n−1 )j also has
size O(log(n)) for all h ∈ [H], i, j ∈ [n]. Multiplying two floats is also size-preserving
[MSS22, Appendix B], so α(aℓ,h,i,0 , . . . , aℓ,h,i,n−1 )j · vℓ,j has size O(log(n)) for all h ∈ [H]
and i, j ∈ [n]. We then apply Lemma 5.11 to conclude that bℓ,h,i has size O(log(n)),
where, recall,
∑
bℓ,h,i = α(aℓ,h,i,0 , . . . , aℓ,h,i,n−1 )j · vℓ,j .
j∈[n]
Finally, computing vℓ+1,i = fℓ (vℓ,i , (bℓ,0,i , . . . , bℓ,h−1,i )), we conclude that vℓ+1,i has size
O(log(n)) for all i due to the size-preservation by fℓ .
Corollary 5.14. For any saturated transformer network over F with size-preserving
internal functions, for all l ∈ [L + 1] and i ∈ [n], vℓ,i has size O(log(n)).
Proof. This follows from Theorem 5.13 because saturated attention is element-wise-
size-preserving.
This result cannot be applied to soft attention because the softmax function is not
guaranteed to be element-wise-size-preserving as it involves computing the exponential
function.
We have proved that each vector in a saturated transformer network over floats has
size O(log(n)). Now, we show how this implies saturated transformer networks can be
simulated by TC0 circuits.
Corollary 5.15. Any size-preserving function with at most c · log(n) input bits can be
computed by a Boolean circuit of depth 3 and polynomial size.
33
In other words, such functions can be computed by AC0 circuits. In addition, we will
show that the sum of n floats of size at most c · log(n) can be computed by TC0 circuits.
Lemma 5.16. Let v0 , . . . , vn−1 be a sequence of floats with size at most c · log(n) for
∑
some c. Then their sum s = vi is computable by a threshold circuit of constant
i∈[n]
depth and polynomial size.
Proof. Let pi , qi once again be the numerator and denominator of vi . We first compute
qmax = maxi qi using an AC0 circuit that compares all pairs qi , qj and returns the first
qi such that qi ≥ gj for all j ∈ [n]. We then use the fact that multiplication and right
shift (qi is a power of 2) are in TC0 , in order to compute ri := pi · qmax
qi
in parallel for all
i ∈ [n]. Note that qi and qmax are both powers of 2, so division will be exact. Next, we
leverage the fact that the sum of n integers with size O(log(n)) is in TC0 , in order to
∑
compute the numerator of the sum p′ = ri . We select the denominator as q ′ = qmax .
i
Finally, we add an AC0 circuit that reduces the fraction by removing shared trailing 0s
from p′ and q ′ , which is possible by Corollary 5.15. Thus, we have constructed a TC0
circuit to compute the sum of n floats of size O(log(n)).
We now construct a TC0 circuit that simulates a saturated transformer network over
floats.
Proof. For each n, we construct a TC0 circuit that simulates a saturated transformer
network of input size n. We construct the circuit modularly, with one subcircuit for
the attention mechanism and another for the feedforward subnetwork.
Attention Head: Fix a single head in some layer. We will construct a TC0 sub-
circuit that simulates the attention mechanism at position i. The head attends over
vectors v0 , . . . , vn−1 . For all j ∈ [n], vj has size O(log(n)) by Theorem 5.13. In parallel
for each j, we compute the scores ai,j = s(vi , vj ) with an AC0 circuit by Corollary 5.15.
We then compute ai,max := maxj ai,j with an AC0 circuit by comparing all vj pairwise
and selecting the first vk such that vk ≥ vj for all j > n. We then compute “masked”
values ui,j for each j ∈ [n] via an AC0 circuit by Lemma 5.4:
v , ai,j ≥ ai,max
j
ui,j :=
0, otherwise.
∑
We then compute the sum si := ui,j by Lemma 5.16. By Lemma 5.11, si has size
j∈[n]
34
O(log(n)). Now, we similarly define
1, a ≥ a
i,j i,max
zi,j :=
0, otherwise.
Using an analogous sum construction with zi,j instead of ui,j , we can use a TC0 circuit
to compute |M(a)|: the number of values of j for which ai,j ≥ ai,max . Finally, since
dividing floats is in TC0 [MSS22, Appendix A], we can compute the head output as
si
|M(a)|
, which has size O(log(n)) by size preservation of division.
Feedforward: As input, f receives vi as well as H head outputs, all of which have
size O(log(n)). As the total size of the input is O(log(n)), we can use Corollary 5.15 to
compute the output of f with an AC0 circuit. The size of the output is O(log(n)) by
size preservation of f . The same idea holds for φ as well as the linear classification head.
We have simulated each transformer network component with a TC0 subcircuit, com-
pleting the proof.
A uniform version of this result has been shown in [Str23]. We will not go into further
detail on this right here, since we will show a more general uniform result in section 5.6.
35
The finiteness of FP ensures the following fact:
Proof. Because FP is finite, it is easy but tedious to write sentences that test for all
possible inputs and outputs.
Proof Sketch. The proof works by going through the components of a transformer net-
work and defining their output as an activation function using the output of the pre-
vious component(s). Lemma 5.19 allows for this to work as long as no over- or under-
flows occur during the computation. Because of this, the internal functions are not al-
lowed to be arbitrary, but have to follow the definition of the original transformer net-
work more closely. For the same reason, the softmax function has to be more carefully
computed by bitwise averages instead of sums. With these changes, it is then possible
to create a FOC[+; MOD] sentence that defines the output activation function of the
transformer network. For the full details of the proof, see [CCP23, section 5].
Theorem 5.21. The language {0n 1n | n ∈ N} is in log-uniform TC0 , but not definable
in FOC[+; MOD].
Proof. The language contains no words of odd length and exactly one word of each
even length. Because of this, a circuit family that decides it can be constructed from
circuits that simply output a constant 0 for an odd amount of input bits and circuits
that test for 0n 1n for an even amount of input bits 2 · n. Such a circuit just consists of
one AND gate that takes the first n input bits negated and the other input bits directly
as its inputs. An example of this can be seen in Figure 5.1. This circuit family clearly
lies in log-uniform TC0 .
To show that the language is not definable in FOC[+; MOD], suppose that it is
definable by some sentence σ. Let M be the product of all moduli m used in atomic
formulas MODm r (p) used in σ. Then σ cannot distinguish between positions p and
p + M , so it cannot distinguish w = 0M 1M and w′ = 10M −1 01M −1 . Since w |= σ, it
must be the case that w′ |= σ, which is a contradiction.
36
a5 a4 a3 a2 a1 a0
¬ ¬ ¬
Figure 5.1: the circuit that checks if a given word a ∈ [2]6 is 000111
The authors of [CCP23] continue by showing that every language that is definable by
a sentence in FOC[+; MOD] is also recognizable by a transformer network. They argue
that FOC[+; MOD] lies somewhere between fixed-precision transformer networks and
unbounded transformer networks, and is therefore close to an exact characterization of
the languages that transformer networks can recognize.
This result, however, is really quite weak, as we already showed in Corollary 5.9
that unbounded-SAT(D) = ALL, which is not close to FOC[+; MOD] at all. Their
definitions and assumptions do not line up exactly with the ones we use here, so our
result does not immediately imply theirs, but it makes their reasoning appear somewhat
questionable.
37
in TC0 because they were designed to be highly parallelizable. Since parallelism is an
important property of today’s dominant paradigm of training models at massive scale,
this points to the conclusion that any massively scaled up model — transformer net-
work or otherwise — will likely obey restrictions similar to the ones derived here for
log-precision transformer networks. There is thus an important tradeoff between the
massive parallelizability of today’s networks and their representation power.
Other results rely on making unrealistically strong assumptions or placing unrealistic
restrictions on the model of transformer networks. For this result, we only make one
assumption – namely, all intermediate values in the transformer network are limited
to O(log(n)) bits, where n is the number of input tokens. We next discuss some
implications of this assumption and what the findings mean for practical transformer
networks.
The bounds we will prove are asymptotic in nature and thus apply when n is suffi-
ciently large. In practice, transformer network models use fixed precision at each com-
putation node, which is more restrictive than log-precision. However, this constant
could be large and thus, for relatively small n, our results do not rule out practical
transformer networks solving difficult problems. The results, however, do show that
as n grows sufficiently large, log-precision transformer networks are fundamentally lim-
ited to problems within TC0 and cannot accurately solve various commonly studied
problems, such as:
• AI planning
• Permanent computation.
Extending our analysis to small n will help close the gap to practice.
The formal model we will use in this section is based on a binary classification view of
transformer networks. However, our results apply directly to multi-class classification
as well and can be extended to generation problems by viewing, for instance, next word
prediction in natural language processing (NLP) as a multi-class classification problem.
However, if the transformer network decoder is allowed to condition on its previous
output in a generation problem, then his would violate our formal setup.
38
5.6.1 Circuit Serialization
First, we will discuss a way of serializing a circuit into a string. We later show how to
generate such serializations using a resource-bounded algorithm, which is the key to
proving containment in uniform complexity classes.
We identify a circuit with its serialization in a formal language that identifies each
node’s label and adjacency list. We will adopt a specific grammar for concreteness, but
our construction can be adapted to other string representations of circuits.
We define a circuit serialization as a traversal of a circuit ordered by some topological
sort. In this serialization, leaf nodes (variables/input gates) are represented by the
string X. An internal node (non-input gate) is represented in Polish notation by the
function it computes (AND, OR, or NOT) followed by a list of pointers to its arguments.
Each argument &1j of gate i encodes (in unary) a zero-indexed pointer of the j-th gate
in the circuit, where j < i. The final node is interpreted as the circuit output.
To serialize {∧, ∨}-circuits, we use the following grammar, where the i parameter
is passed through Gate[i] non-terminals to track the index of the gate in left-to-right
order:
In the Arg[i] rule, we enforce that j < i so that arguments must be pointers to already
defined gates. As an example of this serialization language, the circuit for x0 ∨ ¬x1 ∨ x2
which can be seen in Figure 5.2 is represented as X X X NOT &1 OR & &111 &11,
where spaces are added for readability.
x2 x1 x0
39
By convention, negations in AC0 circuits are usually taken to occur at the beginning
of the circuit, rather than after ∧ or ∨ nodes, which can be achieved using De Morgan’s
law. Our serialization grammar does not enforce this property, but of course any circuit
with this property can be serialized by our grammar.
It is slightly more complicated to serialize threshold circuits. We assume that all
non-input gates in our threshold circuits are threshold gates θ≤k , θ≥k , which return
whether at most or at least k of their m input bits are 1. Threshold gates are equivalent
to majority gates (under constant-depth reduction) and can be used to simulate ∧, ∨,
and ¬ gates. Formally, a threshold circuit serialization is generated by the following
grammar:
In the rule for Gate[i], m ∈ N is the arity of the gate, and k ≤ m is its threshold. The
span 1k after Dir can be interpreted semantically as a unary encoding of the parameter
k for a threshold gate, padded by 0s to the number of total arguments of gate i. For
simplicity, we imagine ¬ gates are represented as unary θ≤0 gates. Thus, the circuit
for θ≥1 (x0 , ¬x1 ) which can be seen in Figure 5.3 would be represented as
We say a threshold circuit is in prefix form iff all inputs (X) come before all threshold
gates (<= and >=), as is the case in this example.
x1 x0
θ≤0
θ≥1
40
5.6.2 Uniformity
The circuit families we have defined thus far are nonuniform, meaning that we do not
enforce that the circuits must be related in any way. In degenerate cases, nonuniform
circuit families can solve undecidable problems because they have infinite description
length, making them a physically unrealizable model of computation. Complexity
theorists have thus introduced uniform circuit families. Uniform circuit families are a
realizable model of computation with relations to classes in computational complexity
and formal language theory.
Intuitively, in a uniform circuit family, the circuits for different input sizes must
be “somewhat similar” to each other. We formalize this by saying that there exists a
resource-constrained Turing machine that maps the input 1n to a serialization of circuit
Cn .
This notion of uniformity is more general than the standard notion in that the
input size I(n) is a function of the problem complexity n. The reason for this is that
we will apply uniformity to sub-computations with different input sizes I(n) within a
larger computation of input size n. The standard notion of uniformity corresponds to
I(n) = n.
Furthermore, we will refer to a circuit family as uniform iff it is uniformly computable
with S(n) = O(log(n)). We can define uniform versions of AC0 and TC0 by adopting
the previous definitions exactly, but also enforcing uniformity.
41
This means the size of the function inputs and output are bounded by p. Similarly,
the intermediate space used by the computation must also be bounded by p. Thus,
higher precision computations cannot somehow be hidden inside f .
Definition 5.23 naturally applies to functions with bounded arity k. We will also
need to define p-precision for the summation operator in the transformer network,
which adds n different floats of size p. Adding n floats can blow up the precision
needed to represent their sum. For example, imagine adding the floats 1 · 20 + 1 · 2c .
We obtain (2c + 1) · 20 , whose mantissa takes c + 1 bits to represent. In practice,
computers do not preserve full precision in such situations. Instead, small terms like
1 · 20 are discarded. Thus, we define the transformer network’s addition operator ⊕ to
be similarly approximate. For more details on how (iterated) addition of p-precision
floats works, see [MS23b, Appendix A].
Let h0 , h1 , . . . , hn−1 ∈ [2]p be the input sequence to a p-precision attention head, and
let ⊕ be approximate floating-point addition.
Definition 5.25. For all ℓ ≥ 0, a p-precision attention head Hhℓ+1 computes a vector
i,h ∈ [2] via
aℓ+1 p
( )
⊕ s hℓi , hℓj
ℓ+1
ai,h = · hℓj ,
Zi
j∈[n]
⊕ ( ℓ ℓ)
where Zi = s hi , h j .
j∈[n]
Standard attention heads, like the ones of the original transformer network, are a
special case of this definition where s is scaled dot-product similarity between keys
and queries. Standard transformer networks also have a linear or affine value function
applied to each head hℓj in the sum over the j. By its affineness, the value function can,
without loss of generality, be removed from the attention head and considered to be a
part of the transformer network layer (that is, applied to the output of the attention
head).
A p-precision transformer network layer is then a tuple of heads and a function f
used to combine them.
42
Definition 5.26. A p-precision transformer network layer is a tuple Lℓ+1 = hH0 , H1 ,
. . . , Hk−1 , f i, where each Hh is an attention head and f : ([2]p )k × [2]p → [2]p is a p-
precision activation function.
We will use n to denote the length of w, and take the transformer network’s depth
d to be fixed with respect to n.
The input to the transformer network can thus be represented with N = n · log(|Σ|)
bits using a binary encoding for the vocabulary. The circuits we construct subse-
quently to simulate transformer networks will also have output size N . We will assume
43
transformer networks have log-precision relative to the size of the input, specifically
O(log(N ))-precision. Since |Σ| is fixed (typically 30000 in practice), we will think in
terms of O(log(n))-precision. Thus, by Definition 5.23, all the intermediate functions
of such transformer networks are computable in O(log(n)) space and output (at most)
that many bits. Note that this is enough precision to represent positional encodings
and for each position to point to a constant number of other values, but not enough
precision for non-lossy pooling of the entire input into a single value.
Our log-precision transformer networks do not enforce that s and f follow the trans-
former network structure. However, a feedforward net whose primitive operations (for
example scalar multiplication) are defined over O(log(n))-size numbers can be com-
puted in O(log(n)) space. Thus, bounded-precision practical transformer networks are
a special case of our log-precision transformer networks. This makes our setup appro-
priate for proving upper bound on transformer networks.
Corollary 5.30. Let f : [2]∗ → [2]m be a function. For all c ∈ R+ and n ∈ N , there
exists an AC0 circuit of size at most nc + c · log(n) + m and depth 3 that computes f
on inputs of size c · log(n).
We now use Corollary 5.30 to prove the following nonuniform result. We note that
the proof works even if the notion of p-precision is relaxed to not require computability
in space p. This requirement will, however, become important for our subsequent result
in subsection 5.6.6.
44
In the inductive case of computing layer hℓ+1 i for 0 ≤ ℓ < d, we note that each
vector output of layer hi has size (at most) c · log(n) bits because of the log-precision
ℓ
assumption.
We first fix a head aℓ+1 i,k to simulate. Applying Corollary 5.30, we can compute
( ℓ ℓ)
s hi , hj with a poly-size depth-3 circuit in parallel for all j. Since n floats with c·log(n)
precision can be approximately added in TC0 [MS23b, Appendix A], we can construct
( )
a TC0 circuit of depth d⊕ to compute Zj . Since s hℓi , hℓj , Zi , and hℓi all have c · log(n)
s(hℓ ,hℓ )
bits, we can compute Zi i j · hℓj with a poly-size depth-3 circuit; we do this in parallel
for all j. Next, we again use the fact that approximate addition of n floats is in TC0
to compute aℓ+1i,h as the approximate sum over j with a depth-d⊕ circuit.
We now simulate a layer hℓ+1 i in terms of its constituent heads. Since all arguments
of g have size c · log(n), we apply Corollary 5.30 to compute g with a poly-size depth-3
circuit, yielding hℓ+1
i . We repeat this in parallel for all i. This completes the inductive
step. The sub-circuit we have constructed for the (ℓ+1)-st layer has a depth of 9+2·d⊕ .
Aggregating the circuit over all d layers, its overall depth is 3 + (9 + 2 · d⊕ ) · d.
Proof. Since any given transformer network has some constant depth d, the depth of
the resulting threshold circuit family is also constant.
Lemma 5.33. Let f : [2]∗ → [2]m be a linear-space computable function. There exists
a Turing machine that, for all n ∈ N and c ∈ R+ , uses at most c · log(n) + log(m)
space to map input 1n to a circuit of size at most nc + c · log(n) + m and depth 3 that
computes f on inputs of size at most c · log(n).
45
Proof. We give the proof in the form of an algorithm to construct a circuit as a function
of n and then justify its correctness and space complexity.
Algorithm: We first print 2 · c · log(n) nodes representing unnegated and negated
input nodes.
Now, we need to show how to construct nodes corresponding to nc DNF terms. To
that end, we loop over all possible inputs x ∈ [2]c·log(n) by maintaining the c · log(n) bit
binary representation of x (initialized with 0c·log(n) ) and incrementing it by 1 at each
step of the loop. We create a new ∧ node i with c · log(n) arguments, defined as follows:
For j ∈ [c · log(n)], we create an argument pointer to (unnegated) node j if xj = 1 and
to (negated) node c · log(n) + j otherwise.
Next, we construct nodes computing each of the m outputs. We loop over k ∈ [m],
constructing a single node for each k. We loop over all x ∈ [2]c·log(n) analogously above
to construct a list of arguments. By our linear-space computability assumption and
because x has c · log(n) bits, we can compute f (x) as a subroutine in O(log(n))-space
to obtain fk (x). If fk (x) = 1, we print node 2 · c · log(n) + j as an argument of node k.
Correctness: We show that this Turing machine M maps input 1n to a serialized
circuit computing f on inputs of size n. The first layer simply produces unnegated and
negated input values. The second layer then produce all possible DNF terms. Finally,
node k of the third layer computes the disjunction over all terms x such that fk (x) = 1.
Thus, node k of the third layer computes fk .
Logarithmic Space: To complete the proof, we justify that M uses O(log(n) +
log(m)) space. Looping over x ∈ [2]c·log(n) is accomplished by treating x as a binary
number initialized to 0 and incrementing it at each step. Thus, the loop pointer for
building the DNF terms takes c·log(n) space to store. For building the m output nodes,
we maintain a similar loop pointer as well as an index k ≤ m, taking c · log(n) + log(m)
space. Thus, the overall algorithm uses c · log(n) + log(m) space.
Thus, the Turing machine M uses c · log(n) + log(m) space to map 1n to a circuit of
size at most nc +c·log(n)+m and depth 3 that computes f on size c·log(n) inputs.
We can leverage this lemma to derive the uniform analog of Theorem 5.31, as follows:
Proof. We will provide a proof by induction over the transformer network layers ℓ that
there is a Turing machine M operating in O(log(n)) space that, on input 1n , outputs a
circuit that simulates the transformer network’s computation on inputs of size n. This
46
circuit is identical to the one in the proof of Theorem 5.31, and thus has the same
circuit depth.
In the base case, we use logarithmic space to track a counter maintaining the current
token i (between 1 and n) throughout the circuit construction. We construct gates
encoding the constant i in binary. We can then apply Lemma 5.33 to construct a Turing
machine that maps 1n to a constant-depth threshold circuit computing h0i = φ(wi , i).
In the inductive case, we assume we can output in O(log(n)) space a circuit com-
puting every value hℓi in the previous layer ℓ. We will show that we can, in O(log(n))
space, now output a circuit computing every value in layer ℓ + 1.
As in Theorem 5.31, we first fix a head aℓ+1
i,h to simulate. Recall that
( )
⊕ s hℓi , hℓj
aℓ+1
i,h = · hℓj .
Zi
j∈[n]
′
By Lemma 5.33, we can generate a depth-3 circuit of size at most z = nc +c′ ·log(n)+1,
where c′ = 2 · c (since the input to f is of size 2 · c · log(n)), that computes s(hℓi , hℓj ) for
specific i, j. We do this sequentially for j ∈ [n] and h ∈ [k], padding each circuit with
unused nodes, such that each one has size exactly z, and the z-th node corresponds
to the output. Thus, the indices of the output nodes for each of the columns will be
wℓ + z · (j · k + h) for j ∈ [n], where wℓ is the index of the last output node hℓn of the
previous layer.
At this point, we use the fact that for p = c · log(n), the p-precision approximate
sum of n p-precision numbers can be computed by a uniform threshold circuit [MS23b,
Appendix A]. We can thus use a Turing machine as a sub-routine to generate, on input
1n , k threshold circuits, where each has size z ′ and computes a ⊕ gate over n items of
precision p each. We set the inputs of circuit h to be nodes wℓ + z · (j · k + h) for j ∈ [n].
⊕
By construction, this yields the normalizing constants Zi = s(hℓi , hℓj ), whose value
j∈[n]
is located at the node at index wℓ + z · n · k + z ′ for head h.
Using p-precision arithmetic operator circuits, we can now also generate a circuit to
s(hℓ ,hℓ )
compute Zi i j ·hℓj for each j ∈ [n] and h ∈ [k], by using index wℓ +z ·(j ·k+h) as before
for the value of s(hℓi , hℓj ) and index wℓ + z · n · k + z ′ · h for the normalizing constant Zi
of head h. Here too we use circuits of identical size z ′′ , making wℓ + k · (z · n + z ′ + z ′′ · i)
the index of the output nodes of these n circuits. Next, we again employ a ⊕ circuit of
size z ′ , similar to the computation of Zi , to compute the sum of these n values. Finally,
we compute hℓ+1 i by applying f via Lemma 5.33.
Note that this requires keeping only ℓ, i, and n in memory, each of which takes
O(log(n)) bits.
We repeat this process for all i ∈ [n] to compute the entire layer ℓ + 1, which finishes
47
the inductive step: If we can output a circuit computing layer ℓ in O(log(n)) space,
then we can do the same for layer ℓ + 1.
Because the depth derived in Theorem 5.34 is constant with respect to n, it follows
that:
We can now use this result to establish a connection to the logic FO(M) we defined
in section 4.2 [MS23a, Theorem 2].
Corollary 5.36. The output of any log-precision transformer network can be expressed
in FO(M).
Proof. This follows directly from the equivalence of log-uniform TC0 and FO(M) which
has been shown in [MIS90, section 9].
For fixed-precision transformer networks using soft attention, we can even combine
this result with the results from section 5.5 to show that they are strictly less powerful
than TC0 circuits.
Proof. From Corollary 5.35, we know that it is a subset. If we now assume that the class
is equal to uniform TC0 , then it follows from Theorem 5.20 that TC0 ⊆ FOC[+; MOD],
which contradicts Theorem 5.21. Since our assumption leads to a contradiction, it has
to be wrong and the subset relation is thereby proper.
48
that transformers can compute any TC0 function when their input is augmented with
the right “instructions”.
More formally, we consider the Circuit Value Problem (CVP) [Lad75, p. 18], also
referred to as the Circuit Evaluation Problem, where the input is a boolean circuit C
and a string x ∈ [2]n , and the task is to return the value of C(x) ∈ [2]. This problem
is known to be complete for the class P under log-space reduction [Lad75, p. 19]. We
will assume C is serialized as described in subsection 5.6.1 and prove that log-precision
transformer networks can evaluate any TC0 circuit. Note that this is an extension of the
typical CVP since the circuit has threshold gates, not just standard AND/OR gates.
To demonstrate the practicality of this lower bound construction, we will not just
prove the existence of transformers that can evaluate TC0 circuits but also specify con-
crete choices for the positional embedding scheme and the class of attention functions
that are sufficient to do so.
Fractional Positional Embeddings: For a vector x and scalar y, let hx, yi be the
vector obtained by appending y onto x. For σ ∈ Σ, let v(σ) be the one-hot embedding
of σ into R|Σ| . For w ∈ Σ∗ and i ∈ N, the fractional positional embedding at token i is
⟨ ⟩
i
φ(wi , i) = v(wi ), .
n
49
We are now ready to present the result. Our construction below is specific to cir-
cuits serialized in prefix form (see subsection 5.6.1), but it can be extended to other
serializations as well.
Lemma 5.38. For all d ∈ N, there exists a transformer network with fractional posi-
tional embeddings, saturated attention, thresholded linear pooling functions, and depth
2 · d that, for any threshold circuit C of depth d serialized in prefix form, maps input
hC, xi to the value C(x).
Proof. We will construct a pair of two transformer network layers that evaluate all
the nodes at depth ℓ in the threshold circuit, for any ℓ. It follows that a transformer
network of depth 2 · d can compute the value C(x).
Base Case: Input Nodes. We use an attention layer to attend uniformly over
all positions with value 1 if wi = X and 0 otherwise. This head computes |w|n X , where
|w|X is the number of occurrences of X in w. A second layer, then, at input node i,
computes the positional embedding of the token representing input value xi :
1 − |w|X + i
.
n
We attend to this position to retrieve xi . After these layers, each input node i stores
its value xi . We also use the base-case layers to construct an attention head that, at
the i-th node, counts the fraction of tokens (out of n) that are nodes to the left of the
current node. Thus, the column corresponding to node i stores the value ni .
At each gate node i, we use two more attention heads to find the index of the next
& to the right, and then count the fraction of tokens before it that are 1. This head
thus computes mkii where ki is the threshold value of gate i and mi is its arity.
Finally, using the first attention layer, we have each 1 node attend to the first ar-
gument symbol & to its left and retrieve its index np . Then, in the second attention
layer, each argument attends uniformly over all nodes with values np . The net effect is
for each argument to store nj , that is, the pointer it is encoding in unary as &1j .
Inductive Case: Gate Nodes. By our inductive assumption over prior layers, all
tokens corresponding to circuit nodes at depth ≤ ℓ contain their appropriate value. We
now construct 2 transformer network layers to evaluate gate nodes at depth ℓ + 1.
In the first attention layer, each argument token attends to the closest gate node i to
its left, which is the gate it belongs to. Recall from the base case that argument token
& already stores nj , where j is the pointer value it encodes. Each argument token now
attends with query nj to retrieve from node j its already computed value.
The second attention layer applies at gate nodes, not arguments. At gate i of arity
mi , we set the attention s(i, j) to indicate whether argument j belongs to gate node i,
50
which holds for exactly mi arguments. We set the attention value at argument j to be
the binary value of node j, which was retrieved in the previous paragraph. Thus, the
attention head computes mcii , where ci is the number of arguments of node i that are 1.
We repeat this for all gate nodes.
At this point, we have both the count of true inputs to gate node i, mcii , and, from
the base case, the threshold parameter of gate i, mkii . Thresholding cim−ki i at 0 allows us
to decide, based on whether Dir is <= or >=, whether the current gate node should
output a 0 or a 1. Repeating this for all gates at layer ℓ + 1 completes the inductive
step: We can evaluate all gate nodes in this layer.
Theorem 5.39. Depth-(2 · d) transformer networks can solve CVP for depth-d TC0
circuits.
Proof. According to Lemma 5.38, there is a transformer network for any circuit depth
d that solves the problem.
Corollary 5.40. There exists a depth-(2 · d) transformer network that can correctly
follow any depth-d TC0 instruction description.
Proof. This follows, since Lemma 5.38 constructs a transformer network that can eval-
uate any TC0 circuit.
51
Thus, transformer networks with simple position embeddings, attention, and pooling
functions can simulate any instruction provided in the form of a TC0 circuit. We note
that, while it is unknown whether the class of regular languages is contained in TC0 ,
the other direction is known: There are problems computable by TC0 circuits that
are not regular. These include problems involving counting and arithmetic, which are
beyond regular languages. These results thus expand the known kinds of instructions
transformers are able to follow, at least with hand-constructed weights.
Proof. Let (Cn )n∈N be a circuit family demonstrating that a problem is in nonuniform
TC0 . Then, by passing the description of Cn as advice for input length n, it immediately
follows from Lemma 5.38 that advice transformer networks can simulate nonuniform
TC0 .
Since non-uniform TC0 even contains some undecidable languages, T/poly is clearly
a very powerful class. Thus, a problem in T/poly cannot always be solved by a trans-
former network on its own. However, if given a description of how to do so (“advice”)
in the form of a TC0 circuit, we have shown that a transformer network could solve
that problem.
52
6 Conclusion
6.1 Interpretation
At first, we have seen choosing an overly simplified attention function, namely hard
attention, for our model of transformer networks bounds them in AC0 . This does not
reflect the capabilities of transformer networks in the real world, as they have been
shown to have the ability to count, which AC0 circuits are not capable of [BAG20, sec-
tion 6; MSS22, Section 1]. Then, we have seen that assuming arbitrary precision in our
model once again leads to results that do not reflect the capabilities of actual trans-
former networks, such as being able to recognize any formal language and being Tur-
ing complete.
For the much more realistic model of log-precision transformer networks, we have
shown that they can be simulated by log-uniform TC0 circuits, for any kind of atten-
tion function. This establishes threshold functions as a fundamental operation for un-
derstanding the computational model of transformers. This result also establishes po-
tential limits on the computational power of log-precision transformer networks. For
example, if L ⊊ P , transformer networks cannot compute all polynomial time func-
tions. They are certainly very far from being universal. The intuition at the heart of
this result is that forcing a model to be highly parallelizable likely sacrifices its expres-
siveness. Since parallelism seems essential to pretraining any massive model at scale,
any large language model — transformer network or otherwise — may suffer from a
similar tradeoff [MS23b, section 8].
53
human-readable. It could also allow a new angle for a theoretical analysis.
But transformer networks are of course not the only deep neural networks. The field
is evolving rapidly, and there is also much to be learned from the analysis of competing
architectures, such as the Mamba network [GD23]. Instead of focusing on individual
architectures, it might be of interest to investigate whether the parallelism tradeoff is
real and what that would imply for future design of large language models.
54
Bibliography
[BAG20] Bhattamishra, Satwik ; Ahuja, Kabir ; Goyal, Navin: On the Abil-
ity and Limitations of Transformers to Recognize Formal Languages. In:
Webber, Bonnie (Hrsg.) ; Cohn, Trevor (Hrsg.) ; He, Yulan (Hrsg.) ; Liu,
Yang (Hrsg.): Proceedings of the 2020 Conference on Empirical Methods
in Natural Language Processing, EMNLP 2020, Online, November 16-20,
2020, Association for Computational Linguistics, 2020, 7096–7116
[BJZP20] Basodi, Sunitha ; Ji, Chunyan ; Zhang, Haiping ; Pan, Yi: Gradient
amplification: An efficient way to train deep neural networks. In: Big Data
Min. Anal. 3 (2020), Nr. 3, 196–207. http://dx.doi.org/10.26599/BDMA.
2020.9020004. – DOI 10.26599/BDMA.2020.9020004
[BMR+ 20] Brown, Tom B. ; Mann, Benjamin ; Ryder, Nick ; Subbiah, Melanie ;
Kaplan, Jared ; Dhariwal, Prafulla ; Neelakantan, Arvind ; Shyam,
Pranav ; Sastry, Girish ; Askell, Amanda ; Agarwal, Sandhini ;
Herbert-Voss, Ariel ; Krueger, Gretchen ; Henighan, Tom ; Child,
Rewon ; Ramesh, Aditya ; Ziegler, Daniel M. ; Wu, Jeffrey ; Winter,
Clemens ; Hesse, Christopher ; Chen, Mark ; Sigler, Eric ; Litwin, Ma-
teusz ; Gray, Scott ; Chess, Benjamin ; Clark, Jack ; Berner, Christo-
pher ; McCandlish, Sam ; Radford, Alec ; Sutskever, Ilya ; Amodei,
Dario: Language Models are Few-Shot Learners. In: Larochelle, Hugo
(Hrsg.) ; Ranzato, Marc’Aurelio (Hrsg.) ; Hadsell, Raia (Hrsg.) ; Bal-
can, Maria-Florina (Hrsg.) ; Lin, Hsuan-Tien (Hrsg.): Advances in Neu-
ral Information Processing Systems 33: Annual Conference on Neural In-
formation Processing Systems 2020, NeurIPS 2020, December 6-12, 2020,
virtual, 2020
55
[CCP23] Chiang, David ; Cholak, Peter ; Pillay, Anand: Tighter Bounds on
the Expressivity of Transformer Encoders. In: Krause, Andreas (Hrsg.)
; Brunskill, Emma (Hrsg.) ; Cho, Kyunghyun (Hrsg.) ; Engelhardt,
Barbara (Hrsg.) ; Sabato, Sivan (Hrsg.) ; Scarlett, Jonathan (Hrsg.):
International Conference on Machine Learning, ICML 2023, 23-29 July
2023, Honolulu, Hawaii, USA Bd. 202, PMLR, 2023 (Proceedings of Ma-
chine Learning Research), 5544–5562
[GD23] Gu, Albert ; Dao, Tri: Mamba: Linear-Time Sequence Modeling with
Selective State Spaces. In: CoRR abs/2312.00752 (2023). http://dx.doi.
org/10.48550/ARXIV.2312.00752. – DOI 10.48550/ARXIV.2312.00752
[HAF22] Hao, Yiding ; Angluin, Dana ; Frank, Robert: Formal Language Recog-
nition by Hard Attention Transformers: Perspectives from Circuit Com-
plexity. In: Trans. Assoc. Comput. Linguistics 10 (2022), 800–810. https:
//transacl.org/ojs/index.php/tacl/article/view/3765
[Lad75] Ladner, Richard E.: The circuit value problem is log space complete for P.
In: SIGACT News 7 (1975), Nr. 1, 18–20. http://dx.doi.org/10.1145/
990518.990519. – DOI 10.1145/990518.990519
56
; Globerson, Amir (Hrsg.) ; Saenko, Kate (Hrsg.) ; Hardt, Moritz
(Hrsg.) ; Levine, Sergey (Hrsg.): Advances in Neural Information Process-
ing Systems 36: Annual Conference on Neural Information Processing Sys-
tems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023,
2023
[PW17] Press, Ofir ; Wolf, Lior: Using the Output Embedding to Improve Lan-
guage Models. In: Lapata, Mirella (Hrsg.) ; Blunsom, Phil (Hrsg.) ;
Koller, Alexander (Hrsg.): Proceedings of the 15th Conference of the Eu-
ropean Chapter of the Association for Computational Linguistics, EACL
2017, Valencia, Spain, April 3-7, 2017, Volume 2: Short Papers, Associa-
tion for Computational Linguistics, 2017, 157–163
[RWC+ 19] Radford, Alec ; Wu, Jeff ; Child, Rewon ; Luan, David ; Amodei,
Dario ; Sutskever, Ilya: Language Models are Unsupervised Multitask
Learners. (2019)
57
[Str23] Strobl, Lena: Average-Hard Attention Transformers are Constant-
Depth Uniform Threshold Circuits. In: CoRR abs/2308.03212
(2023). http://dx.doi.org/10.48550/ARXIV.2308.03212. – DOI
10.48550/ARXIV.2308.03212
[VSP+ 17] Vaswani, Ashish ; Shazeer, Noam ; Parmar, Niki ; Uszkoreit, Jakob
; Jones, Llion ; Gomez, Aidan N. ; Kaiser, Lukasz ; Polosukhin, Illia:
Attention is All you Need. In: Guyon, Isabelle (Hrsg.) ; Luxburg, Ulrike
von (Hrsg.) ; Bengio, Samy (Hrsg.) ; Wallach, Hanna M. (Hrsg.) ; Fer-
gus, Rob (Hrsg.) ; Vishwanathan, S. V. N. (Hrsg.) ; Garnett, Roman
(Hrsg.): Advances in Neural Information Processing Systems 30: Annual
Conference on Neural Information Processing Systems 2017, December 4-
9, 2017, Long Beach, CA, USA, 2017, 5998–6008
58