Transformer Autobots
Transformer Autobots
Transformers
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 357
C. M. Bishop, H. Bishop, Deep Learning, https://doi.org/10.1007/978-3-031-45468-4_12
358 12. TRANSFORMERS
cessing, or NLP (where a ‘natural’ language is one such as English or Mandarin) and
have greatly surpassed the previous state-of-the-art approaches based on recurrent
neural networks (RNNs). Transformers have subsequently been found to achieve
excellent results in many other domains. For example, vision transformers often
outperform CNNs in image processing tasks, whereas multimodal transformers that
combine multiple types of data, such as text, images, audio, and video, are amongst
the most powerful deep learning models.
One major advantage of transformers is that transfer learning is very effective, so
that a transformer model can be trained on a large body of data and then the trained
model can be applied to many downstream tasks using some form of fine-tuning. A
large-scale model that can subsequently be adapted to solve multiple different tasks
is known as a foundation model. Furthermore, transformers can be trained in a self-
supervised way using unlabelled data, which is especially effective with language
models since transformers can exploit vast quantities of text available from the inter-
net and other sources. The scaling hypothesis asserts that simply by increasing the
scale of the model, as measured by the number of learnable parameters, and train-
ing on a commensurately large data set, significant improvements in performance
can be achieved, even with no architectural changes. Moreover, the transformer is
especially well suited to massively parallel processing hardware such as graphical
processing units, or GPUs, allowing exceptionally large neural network language
models having of the order of a trillion (1012 ) parameters to be trained in reason-
able time. Such models have extraordinary capabilities and show clear indications
of emergent properties that have been described as the early signs of artificial general
intelligence (Bubeck et al., 2023).
The architecture of a transformer can seem complex, or even daunting, to a
newcomer as it involves multiple different components working together, in which
the various design choices can seem arbitrary. In this chapter we therefore aim to give
a comprehensive step-by-step introduction to all the key ideas behind transformers
and to provide clear intuition to motivate the design of the various elements. We first
describe the transformer architecture and then focus on natural language processing,
before exploring other application domains.
12.1. Attention
The fundamental concept that underpins a transformer is attention. This was orig-
Section 12.2.5 inally developed as an enhancement to RNNs for machine translation (Bahdanau,
Cho, and Bengio, 2014). However, Vaswani et al. (2017) later showed that signifi-
cantly improved performance could be obtained by eliminating the recurrence struc-
ture and instead focusing exclusively on the attention mechanism. Today, transform-
ers based on attention have completely superseded RNNs in almost all applications.
We will motivate the use of attention using natural language as an example,
12.1. Attention 359
Figure 12.1 Schematic illustration of attention in which the interpretation of the word ‘bank’ is influenced by the
words ‘river’ and ‘swam’, with the thickness of each line being indicative of the strength of its influence.
although it has much broader applicability. Consider the following two sentences:
I swam across the river to get to the other bank.
I walked across the road to get cash from the bank.
Here the word ‘bank’ has different meanings in the two sentences. However, this
can be detected only by looking at the context provided by other words in the se-
quence. We also see that some words are more important than others in determining
the interpretation of ‘bank’. In the first sentence, the words ‘swam’ and ‘river’ most
strongly indicate that ‘bank’ refers to the side of a river, whereas in the second sen-
tence, the word ‘cash’ is a strong indicator that ‘bank’ refers to a financial institution.
We see that to determine the appropriate interpretation of ‘bank’, a neural network
processing such a sentence should attend to, in other words rely more heavily on,
specific words from the rest of the sequence. This concept of attention is illustrated
in Figure 12.1.
Moreover, we also see that the particular locations that should receive more
attention depend on the input sequence itself: in the first sentence it is the second and
fifth words that are important whereas in the second sentence it is the eighth word.
In a standard neural network, different inputs will influence the output to different
extents according to the values of the weights that multiply those inputs. Once the
network is trained, however, those weights, and their associated inputs, are fixed.
By contrast, attention uses weighting factors whose values depend on the specific
input data. Figure 12.2 shows the attention weights from a section of a transformer
network trained on natural language.
When we discuss natural language processing, we will see how word embed-
ding can be used to map words into vectors in an embedding space. These vectors
can then be used as inputs for subsequent neural network processing. These embed-
dings capture elementary semantic properties, for example by mapping words with
similar meanings to nearby locations in the embedding space. One characteristic of
such embeddings is that a given word always maps to the same embedding vector.
360 12. TRANSFORMERS
Figure 12.2 An example of learned attention weights. [From Vaswani et al. (2017) with permission.]
xT
n
N (tokens)
X
D (features)
We can then apply multiple transformer layers in succession to construct deep net-
works capable of learning rich internal representations. Each transformer layer con-
tains its own weights and biases, which can be learned using gradient descent using
Section 12.3 an appropriate cost function, as we will discuss in detail later in the chapter.
A single transformer layer itself comprises two stages. The first stage, which im-
plements the attention mechanism, mixes together the corresponding features from
different token vectors across the columns of the data matrix, whereas the second
stage then acts on each row independently and transforms the features within each
token vector. We start by looking at the attention mechanism.
where anm are called attention weights. The coefficients should be close to zero for
input tokens that have little influence on the output yn and largest for inputs that
have most influence. We therefore constrain the coefficients to be non-negative to
avoid situations in which one coefficient can become large and positive while another
coefficient compensates by becoming large and negative. We also want to ensure that
if an output pays more attention to a particular input, this will be at the expense of
paying less attention to the other inputs, and so we constrain the coefficients to sum
to unity. Thus, the weighting coefficients must satisfy the following two constraints:
anm " 0 (12.3)
N
!
anm = 1. (12.4)
m=1
Exercise 12.1 Together these imply that each coefficient lies in the range 0 ! anm ! 1 and so the
coefficients define a ‘partition of unity’. For the special case amm = 1, it follows that
anm = 0 for n += m, and therefore ym = xm so that the input vector is unchanged
by the transformation. More generally, the output ym is a blend of the input vectors
with some inputs given more weight than others.
Note that we have a different set of coefficients for each output vector yn , and
the constraints (12.3) and (12.4) apply separately for each value of n. These co-
efficients anm depend on the input data, and we will shortly see how to calculate
them.
12.1.3 Self-attention
The next question is how to determine the coefficients anm . Before we discuss
this in detail, it is useful to first introduce some terminology taken from the field of
information retrieval. Consider the problem of choosing which movie to watch in
an online movie streaming service. One approach would be to associate each movie
with a list of attributes describing things such as the genre (comedy, action, etc.), the
names of the leading actors, the length of the movie, and so on. The user could then
search through a catalogue to find a movie that matches their preferences. We could
automate this by encoding the attributes of each movie in a vector called the key.
The corresponding movie file itself is called a value. Similarly, the user could then
provide their own personal vector of values for the desired attributes, which we call
the query. The movie service could then compare the query vector with all the key
vectors to find the best match and send the corresponding movie to the user in the
form of the value file. We can think of the user ‘attending’ to the particular movie
whose key most closely matches their query. This would be considered a form of
hard attention in which a single value vector is returned. For the transformer, we
generalize this to soft attention in which we use continuous variables to measure
12.1. Attention 363
the degree of match between queries and keys and we then use these variables to
weight the influence of the value vectors on the outputs. This will also ensure that
the transformer function is differentiable and can therefore be trained by gradient
descent.
Following the analogy with information retrieval, we can view each of the input
vectors xn as a value vector that will be used to create the output tokens. We also use
the vector xn directly as the key vector for input token n. That would be analogous
to using the movie itself to summarize the characteristics of the movie. Finally, we
can use xm as the query vector for output ym , which can then be compared to each
of the key vectors. To see how much the token represented by xn should attend to
the token represented by xm , we need to work out how similar these vectors are.
One simple measure of similarity is to take their dot product xT n xm . To impose the
constraints (12.3) and (12.4), we can define the weighting coefficients anm by using
Section 5.3 the softmax function to transform the dot products:
exp(xT
n xm )
anm = +N . (12.5)
m! =1 exp(xT
n x m! )
Note that in this case there is no probabilistic interpretation of the softmax function
and it is simply being used to normalize the attention weights appropriately.
So in summary, each input vector xn is transformed to a corresponding output
vector yn by taking a linear combination of input vectors of the form (12.2) in which
the weight anm applied to input vector xm is given by the softmax function (12.5)
defined in terms of the dot product xTn xm between the query xn for input n and the
key xm associated with input m. Note that, if all the input vectors are orthogonal,
then each output vector is simply equal to the corresponding input vector so that
Exercise 12.3 ym = xm for m = 1, . . . , N .
We can write (12.2) in matrix notation by using the data matrix X, along with
the analogous N × D output matrix Y, whose rows are given by ym , so that
6 7
Y = Softmax XXT X (12.6)
have the flexibility to focus more on some features than others when determining
token similarity. We can address both issues if we define modified feature vectors
given by a linear transformation of the original vectors in the form
' = XU
X (12.7)
Although this has much more flexibility, it has the property that the matrix
XUUT XT (12.9)
Q = XW(q) (12.10)
K = XW(k) (12.11)
V = XW (v)
(12.12)
where the weight matrices W(q) , W(k) , and W(v) represent parameters that will
be learned during the training of the final transformer architecture. Here the matrix
W(k) has dimensionality D × Dk where Dk is the length of the key vector. The
matrix W(q) must have the same dimensionality D × Dk as W(k) so that we can
form dot products between the query and key vectors. A typical choice is Dk = D.
Similarly, W(v) is a matrix of size D × Dv , where Dv governs the dimensionality of
the output vectors. If we set Dv = D, so that the output representation has the same
dimensionality as the input, this will facilitate the inclusion of residual connections,
Section 12.1.7 which we discuss later. Also, multiple transformer layers can be stacked on top of
each other if each layer has the same dimensionality. We can then generalize (12.6)
to give 6 7
Y = Softmax QKT V (12.13)
where QKT has dimension N × N , and the matrix Y has dimension N × Dv . The
calculation of the matrix QKT is illustrated in Figure 12.4, whereas the evaluation
of the matrix Y is illustrated in Figure 12.5.
12.1. Attention 365
× W(q) = Q
D×D
N ×D
X QKT
N ×D N ×N
× W(k) = K
D×D
N ×D
Figure 12.4 Illustration of the evaluation of the matrix QKT , which determines the attention coeffi-
cients in a transformer. The input X is separately transformed using (12.10) and (12.11)
to give the query matrix Q and key matrix K, respectively, which are then multiplied to-
gether.
scale
mat mul
Q K V
might be multiple patterns of attention that are relevant at the same time. In natu-
ral language, for example, some patterns might be relevant to tense whereas others
might be associated with vocabulary. Using a single attention head can lead to av-
eraging over these effects. Instead we can use multiple attention heads in parallel.
These consist of identically structured copies of the single head, with independent
learnable parameters that govern the calculation of the query, key, and value matri-
ces. This is analogous to using multiple different filters in each layer of a convolu-
tional network.
Suppose we have H heads indexed by h = 1, . . . , H of the form
Hh = Attention(Qh , Kh , Vh ) (12.15)
where Attention(·, ·, ·) is given by (12.14), and we have defined separate query, key,
and value matrices for each head using
(q)
Qh = XWh (12.16)
(k)
Kh = XWh (12.17)
(v)
Vh = XWh . (12.18)
The heads are first concatenated into a single matrix, and the result is then linearly
transformed using a matrix W(o) to give a combined output in the form
chosen to be equal to D/H so that the resulting concatenated matrix has dimension
N × D. Multi-head attention is summarized in Algorithm 12.2, and the information
flow in a multi-head attention layer is illustrated in Figure 12.8.
Note that the formulation of multi-head attention given above, which follows
that used in the research literature, includes some redundancy in the successive mul-
tiplication of the W(v) matrix for each head and the output matrix W(o) . Removing
this redundancy allows a multi-head self-attention layer to be written as a sum over
Exercise 12.5 contributions from each of the heads separately.
linear
concat
Figure 12.8 Information flow in a multi-head attention layer. The associated computation, given by
Algorithm 12.2, is illustrated in Figure 12.7.
Section 9.5 efficiency, we can introduce residual connections that bypass the multi-head struc-
ture. To do this we require that the output dimensionality is the same as the input
Section 7.4.3 dimensionality, namely N × D. This is then followed by layer normalization (Ba,
Kiros, and Hinton, 2016), which improves training efficiency. The resulting trans-
formation can be written as
Z = LayerNorm [Y(X) + X] (12.20)
where Y is defined by (12.19). Sometimes the layer normalization is replaced by
pre-norm in which the normalization layer is applied before the multi-head self-
attention instead of after, as this can result in more effective optimization, in which
case we have
Z = Y(X" ) + X, where X" = LayerNorm [X] . (12.21)
In each case, Z again has the same dimensionality N × D as the input matrix X.
We have seen that the attention mechanism creates linear combinations of the
value vectors, which are then linearly combined to produce the output vectors. Also,
the values are linear functions of the input vectors, and so we see that the outputs
of an attention layer are constrained to be linear combinations of the inputs. Non-
linearity does enter through the attention weights, and so the outputs will depend
nonlinearly on the inputs via the softmax function, but the output vectors are still
constrained to lie in the subspace spanned by the input vectors and this limits the
expressive capabilities of the attention layer. We can enhance the flexibility of the
transformer by post-processing the output of each layer using a standard nonlinear
neural network with D inputs and D outputs, denoted MLP[·] for ‘multilayer per-
ceptron’. For example, this might consist of a two-layer fully connected network
with ReLU hidden units. This needs to be done in a way that preserves the ability
370 12. TRANSFORMERS
MLP
multi-head
self-attention
of the transformer to process sequences of variable length. To achieve this, the same
shared network is applied to each of the output vectors, corresponding to the rows of
Z. Again, this neural network layer can be improved by using a residual connection.
It also includes layer normalization so that the final output from the transformer layer
has the form
# = LayerNorm [MLP [Z] + Z] .
X (12.22)
This leads to an overall architecture for a transformer layer shown in Figure 12.9 and
summarized in Algorithm 12.3. Again, we can use a pre-norm instead, in which case
the final output is given by
# = MLP(Z" ) + Z, where Z" = LayerNorm [Z] .
X (12.23)
In a typical transformer there are multiple such layers stacked on top of each other.
The layers generally have identical structures, although there is no sharing of weights
and biases between different layers.
in evaluating the dot products in a self-attention layer is O(N 2 D). We can think
of a self-attention layer as a sparse matrix in which parameters are shared between
Exercise 12.6 specific blocks of the matrix. The subsequent neural network layer, which has D
inputs and D outputs, has a cost that is O(D2 ). Since it is shared across tokens, it
has a complexity that is linear in N , and therefore overall this layer has a cost that is
O(N D 2 ). Depending on the relative sizes of N and D, either the transformer layer
or the MLP layer may dominate the computational cost. Compared to a fully con-
nected network, a transformer layer is computationally more efficient. Many vari-
ants of the transformer architecture have been proposed (Lin et al., 2021; Phuong
and Hutter, 2022) including modifications aimed at improving efficiency (Tay et al.,
2020).
m
position
position
−1
r6 r5 r4 r3 r2 r1
Figure 12.10 Illustrations of the functions defined by (12.25) and used to construct position-encoding vectors.
(a) A plot in which the horizontal axis shows the different components of the embedding vector r whereas the
vertical axis shows the position in the sequence. The values of the vector elements for two positions n and m are
shown by the intersections of the sine and cosine curves with the horizontal grey lines. (b) A heat map illustration
of the position-encoding vectors defined by (12.25) for dimension D = 100 with L = 30 for the first N = 200
positions.
374 12. TRANSFORMERS
This encoding has the property that the elements of the vector rn all lie in the
range (−1, 1). It is reminiscent of the way binary numbers are represented, with the
lowest order bit alternating with high frequency, and subsequent bits alternating with
steadily decreasing frequencies:
1: 0 0 0 1
2: 0 0 1 0
3: 0 0 1 1
4: 0 1 0 0
5: 0 1 0 1
6: 0 1 1 0
7: 0 1 1 1
8: 1 0 0 0
9: 1 0 0 1
For the encoding given by (12.25), however, the vector elements are continuous
variables rather than binary. A plot of the position-encoding vectors is shown in
Figure 12.10(b).
One nice property of the sinusoidal representation given by (12.25) is that, for
any fixed offset k, the encoding at position n + k can be represented as a linear
Exercise 12.10 combination of the encoding at position n, in which the coefficients do not depend
on the absolute position but only on the value of k. The network should therefore be
able to learn to attend to relative positions. Note that this property requires that the
encoding makes use of both sine and cosine functions.
Another popular approach to positional representation is to use learned position
encodings. This is done by having a vector of weights at each token position that
can be learned jointly with the rest of the model parameters during training, and
avoids using hand-crafted representations. Because the parameters are not shared
between the token positions, the tokens are no longer invariant under a permutation,
which is the purpose of a positional encoding. However, this approach does not
meet the criteria we mentioned earlier of generalizing to longer input sequences,
as the encoding will be untrained for positional encodings not seen during training.
Therefore, this approach is generally most suitable when the input length is relatively
constant during both training and inference.
Now that we have studied the architecture of the transformer, we will explore how
this can be used to process language data consisting of words, sentences, and para-
graphs. Although this is the modality that transformers were originally developed to
operate on, they have proved to be a very general class of models and have become
the state-of-the-art for most input data types. Later in this chapter we will look at
Section 12.4 their use in other domains.
Many languages, including English, comprise a series of words separated by
white space, along with punctuation symbols, and therefore represent an example of
12.2. Natural Language 375
Section 11.3 sequential data. For the moment we will focus on the words, and we will return to
Section 12.2.2 punctuation later.
The first challenge is to convert the words into a numerical representation that
is suitable for use as the input to a deep neural network. One simple approach is to
define a fixed dictionary of words and then introduce vectors of length equal to the
size of the dictionary along with a ‘one hot’ representation for each word, in which
the kth word in the dictionary is encoded with a vector having a 1 in position k and
0 in all other positions. For example if ‘aardwolf’ is the third word in our dictionary
then its vector representation would be (0, 0, 1, 0, . . . , 0).
An obvious problem with a one-hot representation is that a realistic dictionary
might have several hundred thousand entries leading to vectors of very high dimen-
sionality. Also, it does not capture any similarities or relationships that might exist
between words. Both issues can be addressed by mapping the words into a lower-
dimensional space through a process called word embedding in which each word is
represented as a dense vector in a space of typically a few hundred dimensions.
vn = Exn . (12.26)
Because xn has a one-hot encoding, the vector vn is simply given by the correspond-
ing column of the matrix E.
We can learn the matrix E from a corpus (i.e., a large data set) of text, and
there are many approaches to doing this. Here we look at a popular technique called
word2vec (Mikolov et al., 2013), which can be viewed as a simple two-layer neural
network. A training set is constructed in which each sample is obtained by consid-
ering a ‘window’ of M adjacent words in the text, where a typical value might be
M = 5. The samples are considered to be independent, and the error function is de-
fined as the sum of the error functions for each sample. There are two variants of this
approach. In continuous bag of words, the target variable for network training is the
middle word, and the remaining context words form the inputs, so that the network
is being trained to ‘fill in the blank’. A closely related approach, called skip-grams,
reverses the inputs and outputs, so that the centre word is presented as the input and
the target values are the context words. These models are illustrated in Figure 12.11.
This training procedure can be viewed as a form of self-supervised learning since
the data consists simply of a large corpus of unlabelled text from which many small
windows of word sequences are drawn at random. Labels are obtained from the text
itself by ‘masking’ out those words whose values the network is trying to predict.
Once the model is trained, the embedding matrix E is given by the transpose
of the second-layer weight matrix for the continuous bag-of-words approach and
by the first-layer weight matrix for skip-grams. Words that are semantically related
are mapped to nearby positions in the embedding space. This is to be expected
376 12. TRANSFORMERS
xn+2 xn+2
xn+1 v v xn+1
xn xn
xn−1 xn−1
xn−2 xn−2
(a) (b)
Figure 12.11 Two-layer neural networks used to learn word embeddings, where (a) shows the continuous
bag-of-words approach, and (b) shows the skip-grams approach.
since related words are more likely to occur with similar context words compared
to unrelated words. For example, the words ‘city’ and ‘capital’ might occur with
higher frequency as context for target words such as ‘Paris’ or ‘London’ and less
frequently as context for ‘orange’ or ‘polynomial’. The network can more easily
predict the probability of the missing words if ‘Paris’ and ‘London’ are mapped to
nearby embedding vectors.
It turns out that the learned embedding space often has an even richer semantic
structure than just the proximity of related words, and that this allows for simple
vector arithmetic. For example, the concept that ‘Paris is to France as Rome is to
Italy’ can be expressed through operations on the embedding vectors. If we use
v(word) to denote the embedding vector for ‘word’, then we find
12.2.2 Tokenization
One problem with using a fixed dictionary of words is that it cannot cope with
words not in the dictionary or which are misspelled. It also does not take account
of punctuation symbols or other character sequences such as computer code. An
alternative approach that addresses these problems would be to work at the level of
characters instead of using words, so that our dictionary comprises upper-case and
lower-case letters, numbers, punctuation, and white-space symbols such as spaces
and tabs. A disadvantage of this approach, however, is that it discards the semanti-
cally important word structure of language, and the subsequent neural network would
have to learn to reassemble words from elementary characters. It would also require
a much larger number of sequential steps for a given body of text, thereby increasing
the computational cost of processing the sequence.
We can combine the benefits of character-level and word-level representations
by using a pre-processing step that converts a string of words and punctuation sym-
bols into a string of tokens, which are generally small groups of characters and might
include common words in their entirety, along with fragments of longer words as
well as individual characters that can be assembled into less common words (Schus-
ter and Nakajima, 2012). This tokenization also allows the system to process other
Section 12.4.1 kinds of sequences such as computer code or even other modalities such as images.
It also means that variations of the same word can have related representations. For
example, ‘cook’, ‘cooks’, ‘cooked’, ‘cooking’, and ‘cooker’ are all related and share
the common element ‘cook’, which itself could be represented as one of the tokens.
There are many approaches to tokenization. As an example, a technique called
byte pair encoding that is used for data compression, can be adapted to text tokeniza-
tion by merging characters instead of bytes (Sennrich, Haddow, and Birch, 2015).
The process starts with the individual characters and iteratively merges them into
longer strings. The list of tokens is first initialized with the list of individual char-
acters. Then a body of text is searched for the most frequently occurring adjacent
pairs of tokens and these are replaced with a new token. To ensure that words are not
merged, a new token is not formed from two tokens if the second token starts with a
white space. The process is repeated iteratively as illustrated in Figure 12.12.
Initially the number of tokens is equal to the number of characters, which is
relatively small. As tokens are formed, the total number of tokens increases, and
378 12. TRANSFORMERS
if this is continued long enough, the tokens will eventually correspond to the set of
words in the text. The total number of tokens is generally fixed in advance, as a
compromise between character-level and word-level representations. The algorithm
is stopped when this number of tokens is reached.
In practical applications of deep learning to natural language, the input text is
typically first mapped into a tokenized representation. However, for the remainder of
this chapter, we will use word-level representations as this makes it easier to illustrate
and motivate key concepts.
This can be expressed as a probabilistic graphical model in which the nodes are
Figure 11.28 isolated with no interconnecting links.
The distribution p(x) is shared across the variables and can be represented, with-
out loss of generality, as a simple table listing the probabilities of each of the possi-
ble states of x (corresponding to the dictionary of words or tokens). The maximum
likelihood solution for this model is obtained simply by setting each of these proba-
Exercise 12.11 bilities to the fraction of times that the word occurs in the training set. This is known
as a bag-of-words model because it completely ignores the ordering of the words.
We can use the bag-of-words approach to construct a simple text classifier. This
could be used for example in sentiment analysis in which a passage of text represent-
ing a restaurant review is to be classified as positive or negative. The naive Bayes
classifier assumes that the words are independent within each class Ck , but with a
different distribution for each class, so that
N
*
p(x1 , . . . , xN |Ck ) = p(xn |Ck ). (12.29)
n=1
Given prior class probabilities p(Ck ), the posterior class probabilities for a new se-
quence are given by:
N
*
p(Ck |x1 , . . . , xN ) ∝ p(Ck ) p(xn |Ck ). (12.30)
n=1
Both the class-conditional densities p(x|Ck ) and the prior probabilities p(Ck ) can
be estimated using frequencies from the training data set. For a new sequence, the
table entries are multiplied together to get the desired posterior probabilities. Note
that if a word occurs in the test set that was not present in the training set then the
12.2. Natural Language 379
corresponding probability estimate will be zero, and so these estimates are typically
‘smoothed’ after training by reassigning a small level of probability uniformly across
all entries to avoid zero values.
This can be represented as a probabilistic graphical model in which each node in the
Figure 11.27 sequence receives a link from every previous node. We could represent each term
on the right-hand side of (12.31) by a table whose entries are once again estimated
using simple frequency counts from the training set. However, the size of these
Exercise 12.12 tables grows exponentially with the length of the sequence, and so this approach
would become prohibitively expensive.
We can simplify the model dramatically by assuming that each of the condi-
tional distributions on the right-hand side of (12.31) is independent of all previous
observations except the L most recent words. For example, if L = 2 then the joint
distribution for a sequence of N observations under this model is given by
N
*
p(x1 , . . . , xN ) = p(x1 )p(x2 |x1 ) p(xn |xn−1 , xn−2 ). (12.32)
n=3
In the corresponding graphical model each node has links from the two previous
Figure 11.30 nodes. Here we assume that the conditional distributions p(xn |xn−1 ) are shared
across all variables. Again each of the distributions on the right-hand side of (12.32)
can be represented as tables whose values are estimated from the statistics of triplets
of successive words drawn from a training corpus.
The case with L = 1 is known as a bi-gram model because it depends on pairs
of adjacent words. Similarly L = 2, which involves triplets of adjacent words, is
called a tri-gram model, and in general these are called n-gram models.
All the models discussed so far in this section can be run generatively to synthe-
size novel text. For example, if we provide the first and second words in a sequence,
then we can sample from the tri-gram statistics p(xn |xn−1 , xn−2 ) to generate the
third word, and then we can use the second and third words to sample the fourth
word, and so on. The resulting text, however, will be incoherent because each word
is predicted only on the basis of the two previous words. High-quality text models
must take account of the long-range dependencies in language. On the other hand,
we cannot simply increase the value of L because the size of the probability tables
grows exponentially in L so that it is prohibitively expensive to go much beyond
tri-gram models. However, the autoregressive representation will play a central role
380 12. TRANSFORMERS
x1 x2 x3
when we consider modern language models based not on probability tables but on
deep neural networks configured as transformers.
One way to allow longer-range dependencies, while avoiding the exponential
growth in the number of parameters of an n-gram model, is to use a hidden Markov
Section 11.3.1 model whose graphical structure is shown in Figure 11.31. The number of learn-
able parameters is governed by the dimensionality of the latent variables whereas
the distribution over a given observation xn depends, in principle, on all previous
observations. However, the influence of more distant observations is still very lim-
ited since their effect must be carried through the chain of latent states which are
themselves being updated by more recent observations.
z∗
z0 w w w w w w w
encoder decoder
Figure 12.14 An example of a recurrent neural network used for language translation. See the text for details.
(0, 0, . . . , 0)T .
As an example of how an RNN might be used in practice, consider the spe-
cific task of translating sentences from English into Dutch. The sentences can have
variable length, and each output sentence might have a different length from the cor-
responding input sentence. Furthermore, the network may need to see the whole of
the input sentence before it can even start to generate the output sentence. We can
address this using an RNN by feeding in the complete English sentence followed by
a special input token, which we denote by (start), to trigger the start of translation.
During training the network learns to associate (start) with the beginning of the
output sentence. We also take each successively generated word and feed it into the
input at the next time step, as shown in Figure 12.14. The network can be trained
to generate a specific (stop) token to signify the completion of the translation. The
first few stages of the network are used to absorb the input sequence, and the associ-
ated output vectors are simply ignored. This part of the network can be viewed as an
‘encoder’ in which the entire input sentence has been compressed into the state z! of
the hidden variable. The remaining network stages function as the ‘decoder’, which
generates the translated sentence as output one word at a time. Notice that each out-
put word is fed as input to the next stage of the network, and so this approach has an
autoregressive structure analogous to (12.31).
The transformer processing layer is a highly flexible component for building pow-
erful neural network models with broad applicability. In this section we explore the
application of transformers to natural language. This has given rise to the develop-
ment of massive neural networks known as large language models (LLMs), which
have proven to be exceptionally capable (Zhao et al., 2023).
Transformers can be applied to many different kinds of language processing
task, and can be grouped into three categories according to the form of the input
and output data. In a problem such as sentiment analysis, we take a sequence of
words as input and provide a single variable representing the sentiment of the text,
for example happy or sad, as output. Here a transformer is acting as an ‘encoder’
of the sequence. Other problems might take a single vector as input and generate
a word sequence as output, for example if we wish to generate a text caption given
an input image. In such cases the transformer functions as a ‘decoder’, generating
12.3. Transformer Language Models 383
where Y is a matrix whose nth row is ynT , and X # is a matrix whose nth row is
Section 5.4.4 #n . Each softmax output unit has an associated cross-entropy error function. The
x T
y1 y2 ... ... yN +1
LSM LSM
... ... LSM
L layers
... ... ... ... ...
+ + +
Figure 12.15 Architecture of a GPT decoder transformer network. Here ‘LSM’ stands for linear-softmax and
denotes a linear transformation whose learnable parameters are shared across the token positions, followed by
a softmax activation function. Masking is explained in the text.
an input value for subsequent tokens. For example, consider the word sequence
I swam across the river to get to the other bank.
We can use ‘I swam across’ as an input sequence with an associated target of ‘the’,
and also use ‘I swam across the’ as an input sequence with an associated target of
‘river’, and so on. However, to process these in parallel we have to ensure that the
network is not able to ‘cheat’ by looking ahead in the sequence, otherwise it will
simply learn to copy the next input directly to the output. If it did this, it would
then be unable to generate new sequences since the subsequent token by definition is
not available at test time. To address this problem we do two things. First, we shift
the input sequence to the right by one step, so that input xn corresponds to output
yn+1 , with target xn+1 , and an additional special token denoted (start) is pre-
pended in the first position of the input sequence. Second, note that the tokens in a
transformer are processed independently, except when they are used to compute the
attention weights, when they interact in pairs through the dot product. We therefore
introduce masked attention, sometimes called causal attention, into each of the at-
12.3. Transformer Language Models 385
outputs
the output can depend only on
the input tokens ‘!start"’ ‘I’ and across
‘swam’.
the
river
!start"
across
swam
the
I
inputs
tention layers, in which we set to zero all of the attention weights that correspond to
a token attending to any later token in the sequence. This simply involves setting to
zero all the corresponding elements of the attention matrix Attention(Q, K, V) de-
fined by (12.14) and then normalizing the remaining elements so that each row once
again sums to one. In practice, this can be achieved by setting the corresponding
pre-activation values to −∞ so that the softmax evaluates to zero for the associated
outputs and also takes care of the normalization across the non-zero outputs. The
structure of the masked attention matrix is illustrated in Figure 12.16.
In practice, we wish to make efficient use of the massive parallelism of GPUs,
and hence multiple sequences may be stacked together into an input tensor for par-
allel processing in a single batch. However, this requires the sequences to be of the
same length, whereas text sequences naturally have variable length. This can be ad-
dressed by introducing a specific token, which we denote by !pad", that is used to
fill unused positions to bring all sequences up to the same length so that they can
be combined into a single tensor. An additional mask is then used in the attention
weights to ensure that the output vectors do not pay attention to any inputs occupied
by the !pad" token. Note that the form of this mask depends on the particular input
sequence.
The output of the trained model is a probability distribution over the space of
tokens, given by the softmax output activation function, which represents the prob-
ability of the next token given the current token sequence. Once this next word is
chosen, the token sequence with the new token included can then be fed through the
model again to generate the subsequent token in the sequence, and this process can
be repeated indefinitely or until an end-of-sequence token is generated. This may ap-
pear to be quite inefficient since data must be fed through the whole model for each
new generated token. However, note that due to the masked attention, the embedding
learned for a particular token depends only on that token itself and on earlier tokens
386 12. TRANSFORMERS
and hence does not change when a new, later token is generated. Consequently, much
of the computation can be recycled when processing a new token.
If there are N steps in the sequence and the number of token values in the dictionary
is K then the total number of sequences is O(K N ), which grows exponentially with
the length of the sequence, and hence finding the single most probable sequence is
infeasible. By comparison, greedy search has cost O(KN ), which is linear in the
sequence length.
One technique that has the potential to generate higher probability sequences
than greedy search is called beam search. Instead of choosing the single most proba-
ble token value at each step, we maintain a set of B hypotheses, where B is called the
beam width, each consisting of a sequence of token values up to step n. We then feed
all these sequences through the network, and for each sequence we find the B most
probable token values, thereby creating B 2 possible hypotheses for the extended
sequence. This list is then pruned by selecting the most probable B hypotheses ac-
cording to the total probability of the extended sequence. Thus, the beam search
algorithm maintains B alternative sequences and keeps track of their probabilities,
finally selecting the most probable sequence amongst those considered. Because the
probability of a sequence is obtained by multiplying the probabilities at each step of
the sequence and since these probability are always less than or equal to one, a long
sequence will generally have a lower probability than a short one, biasing the results
towards short sequences. For this reason the sequence probabilities are generally
normalized by the corresponding lengths of the sequence before making compar-
isons. Beam search has cost O(BKN ), which is again linear in the sequence length.
However, the cost of generating a sequence is increased by a factor of B, and so for
very large language models, where the cost of inference can become significant, this
makes beam search much less attractive.
One problem with approaches such as greedy search and beam search is that they
limit the diversity of potential outputs and can even cause the generation process to
become stuck in a loop, where the same sub-sequence of words is repeated over and
12.3. Transformer Language Models 387
Figure 12.17 A comparison of the token probabilities from beam search and human text for a given trained
transformer language model and a given initial input sequence, showing how the human sequence has much
lower token probabilities. [From Holtzman et al. (2019) with permission.]
over. As can be seen in Figure 12.17, human-generated text may have lower proba-
bility and hence be more surprising with respect to a given model than automatically
generated text.
Instead of trying to find a sequence with the highest probability, we can instead
generate successive tokens simply by sampling from the softmax distribution at each
step. However, this can lead to sequences that are nonsensical. This arises from
the typically very large size of the token dictionary, in which there is a long tail of
many token states each of which has a very small probability but which in aggregate
account for a significant fraction of the total probability mass. This leads to the
problem in which there is a significant chance that the system will make a bad choice
for the next token.
As a balance between these extremes, we can consider only the states having the
top K probabilities, for some choice of K, and then sample from these according to
their renormalized probabilities. A variant of this approach, called top-p sampling
or nucleus sampling, calculates the cumulative probability of the top outputs until a
threshold is reached and then samples from this restricted set of token states.
A ‘softer’ version of top-K sampling is to introduce a parameter T called tem-
perature into the definition of the softmax function (Hinton, Vinyals, and Dean,
2015) so that
exp(ai /T )
yi = " (12.35)
j exp(aj /T )
and then sample the next token from this modified distribution. When T = 0, the
probability mass is concentrated on the most probable state, with all other states
having zero probability, and hence this becomes greedy selection. For T = 1, we
388 12. TRANSFORMERS
and the network should predict ‘swam’ at output node 2 and ‘other’ at output node
10. In this case only two of the outputs contribute to the error function and the other
outputs are ignored.
The term ‘bidirectional’ refers to the fact that the network sees words both be-
fore and after the masked word and can use both sources of information to make a
prediction. As a consequence, unlike decoder models, there is no need to shift the
inputs to the right by one place, and there is no need to mask the outputs of each layer
from seeing input tokens occurring later in the sequence. Compared to the decoder
model, an encoder is less efficient since only a fraction of the sequence tokens are
used as training labels. Moreover, an encoder model is unable to generate sequences.
The procedure of replacing randomly selected tokens with !mask" means the
training set has a mismatch compared to subsequent fine-tuning sets in that the lat-
ter will not contain any !mask" tokens. To mitigate any problems this might cause,
Devlin et al. (2018) modified the procedure slightly, so that of the 15% of randomly
selected tokens, 80% are replaced with !mask", 10% are replaced with a word se-
lected at random from the vocabulary, and in 10% of the cases, the original words
are retained at the input, but they still have to be correctly predicted at the output.
12.3. Transformer Language Models 389
c y1 ... ... yN
LSM LSM
... ... LSM
transformer layer
L layers
... ... ... ... ...
transformer layer
positional +
encoding + +
Figure 12.18 Architecture of an encoder transformer model. The boxes labelled ‘LSM’ denote a linear trans-
formation whose learnable parameters are shared across the token positions, followed by a softmax activation
function. The main differences compared to the decoder model are that the input sequence is not shifted to the
right, and the ‘look ahead’ masking matrix is omitted and therefore, within each self-attention layer, every output
token can attend to any of the input tokens.
Once the encoder model is trained it can then be fine-tuned for a variety of
different tasks. To do this a new output layer is constructed whose form is specific
to the task being solved. For a text classification task, only the first output position
is used, which corresponds to the !class" token that always appears in the first
position of the input sequence. If this output has dimension D then a matrix of
parameters of dimension D × K, where K is the number of classes, is appended to
the first output node and this in turn feeds into a K-dimensional softmax function
or a vector of dimension D × 1 followed by a logistic sigmoid for K = 2. The
linear output transformation could alternatively be replaced with a more complex
differentiable model such as an MLP. If the goal is to classify each token of the
input string, for example to assign each token to a category (such as person, place,
colour, etc) then the first output is ignored and the subsequent outputs have a shared
linear-plus-softmax layer. During fine-tuning all model parameters including the
new output matrix are learned by stochastic gradient descent using the log probability
390 12. TRANSFORMERS
of the correct label. Alternatively the output of a pre-trained model might feed into a
Chapter 20 sophisticated generative deep learning model for applications such as text-to-image
synthesis.
multi-head
cross-attention
Z
K V Q
masked
multi-head
self-attention
of performance improvements have driven a new kind of Moore’s law in which the
number of compute operations required to train a state-of-the-art machine learning
model has grown exponentially since about 2012 with a doubling time of around 3.4
Figure 1.16 months.
Early language models were trained using supervised learning. For example, to
build a translation system, the training set would consist of matched pairs of sen-
tences in two languages. A major limitation of supervised learning, however, is
that the data typically has to be human-curated to provide labelled examples, and
this severely limits the quantity of data available, thereby requiring heavy use of
inductive biases such as feature engineering and architecture constraints to achieve
reasonable performance.
Large language models are trained instead by self-supervised learning on very
large data sets of text, along with potentially other token sequences such as computer
Section 12.3.1 code. We have seen how a decoder transformer can be trained on token sequences
in which each token acts as a labelled target example, with the preceding sequence
as input, to learn a conditional probability distribution. This ‘self-labelling’ hugely
expands the quantity of training data available and therefore allows exploitation of
deep neural networks having large numbers of parameters.
392 12. TRANSFORMERS
YN
Z LSM
... ...
positional + +
encoding
embedding embedding
X {!start", Y1:N −1 }
encoder decoder
Figure 12.20 Schematic illustration of a sequence-to-sequence transformer. To keep the diagram uncluttered
the input tokens are collectively shown as a single box, and likewise for the output tokens. Positional-encoding
vectors are added to the input tokens for both the encoder and decoder sections. Each layer in the encoder
corresponds to the structure shown in Figure 12.9, and each cross-attention layer is of the form shown in Fig-
ure 12.19.
× W0
XW0
D×D
X + +
XAB
× A × B
N ×D N ×D
R×D
D×R
Figure 12.21 Schematic illustration low-rank adaptation showing a weight matrix W0 from one of the
attention layers in a pre-trained transformer. Additional weights given by matrices A and
B are adapted during fine-tuning and their product AB is then added to the original matrix
for subsequent inference.
whose dimensionality is much smaller than the total number of learnable parameters
in the model (Aghajanyan, Zettlemoyer, and Gupta, 2020). LoRa exploits this by
freezing the weights of the original model and adding additional learnable weight
matrices into each layer of the transformer in the form of low-rank products. Typi-
cally only attention-layer weights are modified, whereas MLP-layer weights are kept
fixed. Consider a weight matrix W0 having dimension D × D, which might rep-
resent a query, key, or value matrix in which the matrices from multiple attention
heads are treated together as a single matrix. We introduce a parallel set of weights
defined by the product of two matrices A and B with dimensions D × R and R × D,
respectively, as shown schematically in Figure 12.21. This layer then generates an
output given by XW0 + XAB. The number of parameters in the additional weight
matrix AB is 2RD compared to the D2 parameters in the original weight matrix
W0 , and so if R ' D then the number of parameters that need to be adapted during
fine-tuning is much smaller than the number in the original transformer. In prac-
tice, this can reduce the number of parameters that need to be trained by a factor of
10,000. Once the fine-tuning is complete, the additional weights can be added to the
original weight matrices to give a new weight matrix
# = W0 + AB
W (12.36)
so that during inference there is no additional computational overhead compared to
running the original model since the updated model has the same size as the original.
As language models have become larger and more powerful, the need for fine-
tuning has diminished, with generative language models now able to solve a broad
range of tasks simply through text-based interaction. For example, if a text string
English: the cat sat on the mat. French:
is given as the input sequence, an autoregressive language model can continue to gen-
erate subsequent tokens until a !stop" token is generated, in which the newly gen-
394 12. TRANSFORMERS
erated tokens represent the French translation. Note that the model was not trained
specifically to do translation but has learned to do so as a result of being trained on a
vast corpus of data that includes multiple languages.
A user can interact with such models using a natural language dialogue, mak-
ing them very accessible to broad audiences. To improve the user experience and
the quality of the generated outputs, techniques have been developed for fine-tuning
large language models through human evaluation of generated output, using methods
such as reinforcement learning through human feedback or RLHF (Christiano et al.,
2017). Such techniques have helped to create large language models with impres-
sively easy-to-use conversational interfaces, most notably the system from OpenAI
called ChatGPT.
The sequence of input tokens given by the user is called a prompt. For example,
it might consist of the opening words of a story, which the model is required to com-
plete. Or it might comprise a question, and the model should provide the answer. By
using different prompts, the same trained neural network may be capable of solving a
broad range of tasks such as generating computer code from a simple text request or
writing rhyming poetry on demand. The performance of the model now depends on
the form of the prompt, leading to a new field called prompt engineering (Liu et al.,
2021), which aims to design a good form for a prompt that results in high-quality
output for the downstream task. The behaviour of the model can also be modified by
adapting the user’s prompt before feeding it into the language model by pre-pending
an additional token sequence called a prefix prompt to the user prompt to modify
the form of the output. For example, the pre-prompt might consist of instructions,
expressed in standard English, to tell the network not to include offensive language
in its output.
This allows the model to solve new tasks simply by providing some examples
within the prompt, without needing to adapt the parameters of the model. This is an
example of few-shot learning.
Current state-of-the-art models such as GPT-4 have become so powerful that
they are exhibiting remarkable properties which have been described as the first in-
dications of artificial general intelligence (Bubeck et al., 2023) and are driving a
new wave of technological innovation. Moreover, the capabilities of these models
continue to improve at an impressive pace.
domains. The core architecture of the transformer layer has remained relatively con-
stant, both over time and across applications. Therefore, the key innovations that
enabled the use of transformers in areas other than natural language have largely
focused on the representation and encoding of the inputs and outputs.
One big advantage of a single architecture that is capable of processing many
different kinds of data is that it makes multimodal computation relatively straight-
forward. In this context, multimodal refers to applications that combine two or more
different types of of data, either in the inputs or outputs or both. For example, we
may wish to generate an image from a text prompt or design a robot that can com-
bine information from multiple sensors such as cameras, radar, and microphones.
The important thing to note is that if we can tokenize the inputs and decode the
output tokens, then it is likely that we can use a transformer.