Attention Book Sample
Attention Book Sample
Disclaimer
The information contained within this eBook is strictly for educational purposes. If you wish to
apply ideas contained in this eBook, you are taking full responsibility for your actions.
The author has made every effort to ensure the accuracy of the information within this book was
correct at time of publication. The author does not assume and hereby disclaims any liability to any
party for any loss, damage, or disruption caused by errors or omissions, whether such errors or
omissions result from accident, negligence, or any other cause.
No part of this eBook may be reproduced or transmitted in any form or by any means, electronic or
mechanical, recording or by any information storage and retrieval system, without written
permission from the author.
Credits
Founder: Jason Brownlee
Authors: Stefania Cristina and Mehreen Saeed
Lead Editor: Adrian Tam
Technical Reviewers: Darci Heikkinen, Devansh Sethi, and Jerry Yiu
Copyright
Building Transformer Models with Attention
© 2022 MachineLearningMastery.com. All Rights Reserved.
Edition: v1.00
Contents
Preface iv
Introduction v
It is not an easy task to ask a computer to understand human language. In recent years, we
have seen significant progress due to the advance in machine learning techniques. In particular,
attention mechanisms and transformers.
Take machine translation as an example. In the past we would consider that as a sequence
to sequence transformation problem that a recurrent neural network would fit. But instead
of a simple linear transformation, using an attention mechanism was proven to work better
with longer sentences. Later, it is discovered that attention without recurrent neural network
is not only possible, but also better in many situations.
This book is a guide to lead you to fully understand attention and transformer architecture.
We start with the first principle: to build a transformer model in Keras, from scratch. We
hope by the time you finish the book, you can appreciate the idea of using attention to extract
context out of a sequence.
Introduction
1
http://karpathy.github.io/2015/05/21/rnn-effectiveness/
vi
Book Organization
This book is in four parts:
encoding to input sequence, and end with using a trained transformer model for inference.
This part of the book includes the following chapters:
⊲ Positional Encoding in Transformer Models
⊲ Transformer Positional Encoding Layer in Keras
⊲ Implementing Scaled Dot-Product Attention in Keras
⊲ Implementing Multi-Head Attention in Keras
⊲ Implementing the Transformer Encoder in Keras
⊲ Implementing the Transformer Decoder in Keras
⊲ Joining the Transformer Encoder and Decoder with Masking
⊲ Training the Transformer Model
⊲ Plotting the Training and Validation Loss Curves for the Transformer Model
⊲ Inference with the Transformer Model
There are a lot to cover in these chapters because of the complexity in transformer architecture.
Be patient. But you will find it not difficult to have your own transformer created out of basic
Keras functions.
Part 4: Applications
While you already created your own transformer and by the end of last part, your transformer
should be able to do sentence to sentence translation between two languages in a reasonable
quality. However, the story of transformer does not stop here. There are larger transformer-
based architectures proposed with pre-trained model weights made public. We will look into
one example and see how we can do some amazing projects with the pre-trained model.
There is only one chapter in this part. It is:
⊲ A Brief Introduction to BERT
In this chapter you will learn about BERT, which is an extension to transformer’s encoder, and
its simplified model, DistilBERT. You will see how you can do summarization and question-
answering with a pre-trained DistilBERT model.
Appendix A and Appendix B of this book gives you step-by-step guidance on how to set
up a Python environment on your own computer and on AWS cloud, respectively.
Machine Learning
You do not need to be a machine learning expert, but it would be helpful if you know how to
solve a small machine learning problem, especially a natural language processing task. Basic
concepts like cross-validation are described briefly, and you are expected to know how to train
and use a neural network in TensorFlow and Keras. You may learn about these in another
book, Deep Learning with Python.
Training a transformer model can take a long time. It is possible to train it using a CPU
but GPU can speed it up significantly. You can access GPU hardware easily and cheaply in
the cloud and a step-by-step procedure is taught on how to do this in Appendix B.
Summary
This book is a bit different from our other books from MachineLearningMastery.com in the
sense that there are not a lot of small projects to work with. Instead, this entire book is one
big project, to build a transformer model and apply it to NLP. A big project has many small
components. By doing this project, you will learn a lot of ideas. Hope this will be eye-opening
for you and bring you to a different level of deep learning. We are excited for you. Take your
ix
time, have fun and we’re so excited to see where you can take this amazing new technology
to.
Next
Let’s dive in. Next up is Part I where you will learn the foundation of attention.
The Transformer Model
11
We have already familiarized ourselves with the concept of self-attention as implemented by
the transformer attention mechanism for neural machine translation. We will now be shifting
our focus on the details of the transformer architecture itself to discover how self-attention
can be implemented without relying on the use of recurrence and convolutions.
In this chapter, you will discover the network architecture of the transformer model. After
completing this chapter, you will know:
⊲ How the transformer architecture implements an encoder-decoder structure without
recurrence and convolutions
⊲ How the transformer encoder and decoder work
⊲ How the transformer self-attention compares to the use of recurrent and convolutional
layers
Let’s get started.
Overview
This chapter is divided into three parts; they are:
⊲ The Transformer Architecture
⊲ Sum Up: The Transformer Model
⊲ Comparison to Recurrent and Convolutional Layers
In a nutshell, the task of the encoder, on the left half of the transformer architecture, is to
map an input sequence to a sequence of continuous representations, which is then fed into a
decoder. The decoder, on the right half of the architecture, receives the output of the encoder
together with the decoder output at the previous time step to generate an output sequence.
“
At each step the model is auto-regressive, consuming the previously generated
symbols as additional input when generating the next.
— “Attention Is All You Need”, 2017
”
11.1 The Transformer Architecture 3
The Encoder
Figure 11.2: The encoder block of the transformer architecture. From “Attention Is All
You Need”
The encoder consists of a stack of N = 6 identical layers, where each layer is composed of two
sublayers:
1. The first sublayer implements a multi-head self-attention mechanism. You have seen
that the multi-head mechanism implements h heads that receive a (different) linearly
projected version of the queries, keys, and values, each to produce h outputs in parallel
that are then used to generate a final result.
2. The second sublayer is a fully connected feedforward network consisting of two linear
transformations with Rectified Linear Unit (ReLU) activation in between:
The six layers of the transformer encoder apply the same linear transformations to all the
words in the input sequence, but each layer employs different weight (W1 , W2 ) and bias (b1 , b2 )
parameters to do so.
Furthermore, each of these two sublayers has a residual connection around it. Each
sublayer is also succeeded by a normalization layer, layernorm(·), which normalizes the sum
computed between the sublayer input x and the output generated by the sublayer itself,
sublayer(x):
layernorm(x + sublayer(x))
11.1 The Transformer Architecture 4
The Decoder
Figure 11.3: The decoder block of the transformer architecture. From “Attention Is All
You Need”
The decoder shares several similarities with the encoder. The decoder also consists of a stack
of N = 6 identical layers that are each composed of three sublayers:
1. The first sublayer receives the previous output of the decoder stack, augments it with
positional information, and implements multi-head self-attention over it. While the
encoder is designed to attend to all words in the input sequence regardless of their
position in the sequence, the decoder is modified to attend only to the preceding
words. Hence, the prediction for a word at position i can only depend on the known
outputs for the words that come before it in the sequence. In the multi-head attention
mechanism (which implements multiple, single attention functions in parallel), this is
achieved by introducing a mask over the values produced by the scaled multiplication
11.2 Sum Up: The Transformer Model 5
Figure 11.4: The multi-head attention in the decoder implements several masked, single
attention functions. From “Attention Is All You Need”
“
The masking makes the decoder unidirectional (unlike the bidirectional
encoder).
— Advanced Deep Learning with Python, 2019
2. The second layer implements a multi-head self-attention mechanism similar to the one
implemented in the first sublayer of the encoder. On the decoder side, this multi-head
”
mechanism receives the queries from the previous decoder sublayer and the keys and
values from the output of the encoder. This allows the decoder to attend to all the
words in the input sequence.
3. The third layer implements a fully connected feedforward network, similar to the one
implemented in the second sublayer of the encoder.
Furthermore, the three sublayers on the decoder side also have residual connections around
them and are succeeded by a normalization layer. Positional encodings are also added to the
input embeddings of the decoder in the same manner as previously explained for the encoder.
Books
Ivan Vasilev. Advanced Deep Learning with Python. Packt Publishing, 2019.
https://www.amazon.com/dp/178995617X
11.5 Summary 7
Papers
Ashish Vaswani et al. “Attention Is All You Need”. In: Proc. 31st Conference on Neural
Information Processing Systems (NIPS 2017). 2017.
https://arxiv.org/pdf/1706.03762.pdf
11.5 Summary
In this chapter, you discovered the network architecture of the transformer model. Specifically,
you learned:
⊲ How the transformer architecture implements an encoder-decoder structure without
recurrence and convolutions
⊲ How the transformer encoder and decoder work
⊲ How the transformer self-attention compares to recurrent and convolutional layers
Before we wrap up this part, we will digress a bit from text sequences to see, in the next
chapter, how transformer can also be applied to images.
Inference with the Transformer
Model
22
We have seen how to train the transformer model on a dataset of English and German sentence
pairs and how to plot the training and validation loss curves to diagnose the model’s learning
performance and decide at which epoch to run inference on the trained model. We are now
ready to run inference on the trained transformer model to translate an input sentence.
In this chapter, you will discover how to run inference on the trained transformer model
for neural machine translation. After completing this chapter, you will know:
⊲ How to run inference on the trained transformer model
⊲ How to generate text translations
Let’s get started.
Overview
This chapter is divided into two parts; they are:
⊲ Inferencing the Transformer Model
⊲ Testing Out the Code
# Create model
inferencing_model = TransformerModel(enc_vocab_size, dec_vocab_size, enc_seq_length,
dec_seq_length, h, d_k, d_v, d_model, d_ff, n, 0)
Here, note that the last input being fed into the TransformerModel corresponded to the dropout
rate for each of the Dropout layers in the transformer model. These Dropout layers will not be
used during model inferencing (you will eventually set the training argument to False), so
you may safely set the dropout rate to 0. Furthermore, the TransformerModel class was already
saved into a separate script named model.py. Hence, to be able to use the TransformerModel
class, you need to include “from model import TransformerModel”.
Next, let’s create a class Translate that inherits from the Module base class in Keras and
assign the initialized inferencing model to the variable transformer:
class Translate(Module):
def __init__(self, inferencing_model, **kwargs):
super().__init__(**kwargs)
self.transformer = inferencing_model
...
When you trained the transformer model, you saw that you first needed to tokenize the
sequences of text that were to be fed into both the encoder and decoder. You achieved this
by creating a vocabulary of words and replacing each word with its corresponding vocabulary
index. You will need to implement a similar process during the inferencing stage before feeding
the sequence of text to be translated into the transformer model.
For this purpose, you will include within the class the following load_tokenizer method,
which will serve to load the encoder and decoder tokenizers that you would have generated
and saved during the training stage in Chapter 21:
It is important that you tokenize the input text at the inferencing stage using the same
tokenizers generated at the training stage of the transformer model since these tokenizers
would have already been trained on text sequences similar to your testing data. The next
step is to create the class method, call(), that will take care to:
⊲ Append the start (<START>) and end-of-string (<EOS>) tokens to the input sentence:
22.1 Inferencing the Transformer Model 10
⊲ Load the encoder and decoder tokenizers (in this case, saved in the enc_tokenizer.pkl
and dec_tokenizer.pkl pickle files from Chapter 21, respectively):
enc_tokenizer = self.load_tokenizer('enc_tokenizer.pkl')
dec_tokenizer = self.load_tokenizer('dec_tokenizer.pkl')
⊲ Prepare the input sentence by tokenizing it first, then padding it to the maximum
phrase length, and subsequently converting it to a tensor:
encoder_input = enc_tokenizer.texts_to_sequences(sentence)
encoder_input = pad_sequences(encoder_input,
maxlen=enc_seq_length, padding='post')
encoder_input = convert_to_tensor(encoder_input, dtype=int64)
⊲ Repeat a similar tokenization and tensor conversion procedure for the <START> and
<EOS> tokens at the output:
output_start = dec_tokenizer.texts_to_sequences(["<START>"])
output_start = convert_to_tensor(output_start[0], dtype=int64)
output_end = dec_tokenizer.texts_to_sequences(["<EOS>"])
output_end = convert_to_tensor(output_end[0], dtype=int64)
⊲ Prepare the output array that will contain the translated text. Since you do not know
the length of the translated sentence in advance, you will initialize the size of the
output array to 0, but set its dynamic_size parameter to True so that it may grow past
its initial size. You will then set the first value in this output array to the <START>
token:
⊲ Iterate, up to the decoder sequence length, each time calling the transformer model
to predict an output token. Here, the training input, which is then passed on to
each of the transformer’s Dropout layers, is set to False so that no values are dropped
during inference. The prediction with the highest score is then selected and written
at the next available index of the output array. The for loop is terminated with a
break statement as soon as an <EOS> token is predicted:
for i in range(dec_seq_length):
prediction = self.transformer(encoder_input,transpose(decoder_output.stack()),
training=False)
22.1 Inferencing the Transformer Model 11
if predicted_id == output_end:
break
⊲ Decode the predicted tokens into an output list and return it:
output = transpose(decoder_output.stack())[0]
output = output.numpy()
output_str = []
return output_str
# Create model
inferencing_model = TransformerModel(enc_vocab_size, dec_vocab_size, enc_seq_length,
dec_seq_length, h, d_k, d_v, d_model, d_ff, n, 0)
22.1 Inferencing the Transformer Model 12
class Translate(Module):
def __init__(self, inferencing_model, **kwargs):
super().__init__(**kwargs)
self.transformer = inferencing_model
for i in range(dec_seq_length):
# Predict an output token
prediction = self.transformer(encoder_input,transpose(decoder_output.stack()),
training=False)
prediction = prediction[:, -1, :]
output = transpose(decoder_output.stack())[0]
22.2 Testing Out the Code 13
output = output.numpy()
output_str = []
return output_str
# Sentence to translate
sentence = ['im thirsty']
The corresponding ground truth translation in German for this sentence, including the <START>
and <EOS> decoder tokens, should be: “<START> ich bin durstig <EOS>”. If you have a look
at the plotted training and validation loss curves for this model (here, you are training for
20 epochs), you may notice that the validation loss curve slows down considerably and starts
plateauing at around epoch 16.
So let’s proceed to load the saved model’s weights at the 16th epoch and check out the
prediction that is generated by the model:
Running the lines of code above produces the following translated list of words:
Output 22.1: Output of the model using weights from the 16th epoch
22.2 Testing Out the Code 14
Which is equivalent to the ground truth German sentence that was expected.
Always keep in mind that since you are training the transformer model from
INFO-CIRCLE scratch, you may arrive at different results depending on the random initialization
of the model weights.
Let’s check out what would have happened if you had, instead, loaded a set of weights
corresponding to a much earlier epoch, such as the 4th epoch. In this case, the generated
translation is the following:
Output 22.2: Output of the model using weights from the 4th epoch
In English, this translates to “I in not not”, which is clearly far off from the input English
sentence, but which is expected since, at this epoch, the learning process of the transformer
model is still at the very early stages.
Let’s try again with a second sentence from the test dataset:
# Sentence to translate
sentence = ['are we done']
The corresponding ground truth translation in German for this sentence, including the <START>
and <EOS> decoder tokens, should be: “<START> sind wir dann durch <EOS>”. The model’s
translation for this sentence, using the weights saved at epoch 16, is:
Output 22.3: Output of the model using weights from the 16th epoch
Which, instead, translates to: “I was ready”. While this is also not equal to the ground truth,
it is close to its meaning.
What the last test suggests, however, is that the transformer model might have required
many more data samples to train effectively. This is also corroborated by the validation loss
at which the validation loss curve plateaus remain relatively high. Indeed, transformer models
are notorious for being very data hungry. Vaswani et al. (2017), for example, trained their
English-to-German translation model using a dataset containing around 4.5 million sentence
pairs.
“
We trained on the standard WMT 2014 English-German dataset consisting of
about 4.5 million sentence pairs … For English-French, we used the significantly
larger WMT 2014 English-French dataset consisting of 36M sentences …
— “Attention Is All You Need”, 2017
They reported that it took them 3.5 days on eight P100 GPUs to train the English-to-German
translation model. In comparison, you have only trained on a dataset comprising 10,000 data
”
samples here, split between training, validation, and test sets. So the next task is actually for
22.3 Further Reading 15
you. If you have the computational resources available, try to train the transformer model on
a much larger set of sentence pairs and see if you can obtain better results than the translations
obtained here with a limited amount of data.
Books
Ivan Vasilev. Advanced Deep Learning with Python. Packt Publishing, 2019.
https://www.amazon.com/dp/178995617X
Denis Rothman. Transformers for Natural Language Processing. Packt Publishing, 2021.
https://www.amazon.com/dp/1800565798
Papers
Ashish Vaswani et al. “Attention Is All You Need”. In: Proc. 31st Conference on Neural
Information Processing Systems (NIPS 2017). 2017.
https://arxiv.org/pdf/1706.03762.pdf
22.4 Summary
In this chapter, you discovered how to inference the trained transformer model for neural
machine translation. Specifically, you learned:
⊲ How to inference the trained transformer model
⊲ How to generate text translations
In the next chapter, you will see how you can use a pre-trained model.
A Brief Introduction to BERT
23
As we learned what a Transformer is and how we might train the Transformer model, we
notice that it is a great tool to make a computer understand human language. However, the
Transformer was originally designed as a model to translate one language to another. If we
repurpose it for a different task, we would likely need to retrain the whole model from scratch.
Given the time it takes to train a Transformer model is enormous, we would like to have a
solution that enables us to readily reuse the trained Transformer for many different tasks.
BERT is such a model. It is an extension of the encoder part of a Transformer.
In this chapter, you will learn what BERT is and discover what it can do. After completing
this chapter, you will know:
⊲ What is a Bidirectional Encoder Representations from Transformer (BERT)
⊲ How a BERT model can be reused for different purposes
⊲ How you can use a pre-trained BERT model
Let’s get started.
Overview
This chapter is divided into four parts; they are:
⊲ From Transformer Model to BERT
⊲ What Can BERT Do?
⊲ Using Pre-Trained BERT Model for Summarization
⊲ Using Pre-Trained BERT Model for Question-Answering
QK ⊤
attention(Q, K, V ) = softmax √ V
dk
But each of the Q, K, and V above is an embedding vector transformed by a weight matrix
in the transformer model. Training a transformer model means finding these weight matrices.
Once the weight matrices are learned, the transformer becomes a language model, which means
it represents a way to understand the language that you used to train it.
A transformer has encoder and decoder parts. As the name implies, the encoder transforms
sentences and paragraphs into an internal format (a numerical matrix) that understands the
context, whereas the decoder does the reverse. Combining the encoder and decoder allows a
transformer to perform seq2seq tasks, such as translation. If you take out the encoder part
of the transformer, it can tell you something about the context, which can do something
interesting.
The Bidirectional Encoder Representation from Transformer (BERT) leverages the
attention model to get a deeper understanding of the language context. BERT is a stack
of many encoder blocks. The input text is separated into tokens as in the transformer model,
and each token will be transformed into a vector at the output of BERT.
23.2 What Can BERT Do? 18
Each training sample for BERT is a pair of sentences from a document. The two sentences
can be consecutive in the document or not. There will be a [CLS] token prepended to the
first sentence (to represent the class) and a [SEP] token appended to each sentence (as a
separator). Then, the two sentences will be concatenated as a sequence of tokens to become
a training sample. A small percentage of the tokens in the training sample is masked with a
special token [MASK] or replaced with a random token.
Before it is fed into the BERT model, the tokens in the training sample will be transformed
into embedding vectors, with the positional encodings added, and particular to BERT, with
segment embeddings added as well to mark whether the token is from the first or the second
sentence.
Each input token to the BERT model will produce one output vector. In a well-trained
BERT model, we expect:
⊲ output corresponding to the masked token can reveal what the original token was
⊲ output corresponding to the [CLS] token at the beginning can reveal whether the two
sentences are consecutive in the document
Then, the weights trained in the BERT model can understand the language context well.
Once you have such a BERT model, you can use it for many downstream tasks. For
example, by adding an appropriate classification layer on top of an encoder and feeding in
only one sentence to the model instead of a pair, you can take the class token [CLS] as input for
sentiment classification. It works because the output of the class token is trained to aggregate
the attention for the entire input.
Another example is to take a question as the first sentence and the text (e.g., a paragraph)
as the second sentence, then the output token from the second sentence can mark the position
23.3 Using Pre-Trained BERT Model for Summarization 19
where the answer to the question rested. It works because the output of each token reveals
some information about that token in the context of the entire input.
Theoretically, a BERT model is an encoder that maps each input token to an output vector,
which can be extended to an infinite length sequence of tokens. In practice, there are limitations
imposed in the implementation of other components that limit the input size. Mostly, a few
hundred tokens should work, as not every implementation can take thousands of tokens in
one shot. You can save the entire article in article.txt. In case your model needs a smaller
text, you can use only a few paragraphs from it.
First, let’s explore the task for summarization. Using BERT, the idea is to extract a few
sentences from the original text that represent the entire text. You can see this task is similar
to next sentence prediction, in which if given a sentence and the text, you want to classify if
they are related.
To do that, you need to use the Python module bert-extractive-summarizer
It is a wrapper to some Hugging Face models to provide the summarization task pipeline.
Hugging Face is a platform that allows you to publish machine learning models, mainly on
NLP tasks.
Once you have installed bert-extractive-summarizer, producing a summary is just a few
lines of code:
Amid the political turmoil of outgoing British Prime Minister Liz Truss’s
short-lived government, the Bank of England has found itself in the
fiscal-financial crossfire. Whatever government comes next, it is vital
that the BOE learns the right lessons. According to a statement by the BOE’s Deputy Governor for
Financial Stability, Jon Cunliffe, the MPC was merely “informed of the
issues in the gilt market and briefed in advance of the operation,
including its financial-stability rationale and the temporary and targeted
nature of the purchases.”
Output 23.1: Extractive summary produced
That’s the complete code! Behind the scene, spaCy was used on some preprocessing,
and Hugging Face was used to launch the model. The model used was named
distilbert-base-uncased. DistilBERT is a simplified BERT model that can run faster and
use less memory. The model is an “uncased” one, which means the uppercase or lowercase in
the input text is considered the same once it is transformed into embedding vectors.
The output from the summarizer model is a string. As you specified num_sentences=3 in
invoking the model, the summary is three selected sentences from the text. This approach
is called the extractive summary. The alternative is an abstractive summary, in which the
summary is generated rather than extracted from the text. This would need a different model
than BERT.
answering = pipeline("question-answering",
model='distilbert-base-uncased-distilled-squad')
result = answering(question=question, context=text)
print(result)
Here, Hugging Face is used directly. If you have installed the module used in the previous
example, the Hugging Face Python module is a dependence that you already installed.
Otherwise, you may need to install it with pip:
And to actually use a Hugging Face model, you should have both PyTorch and TensorFlow
installed as well:
{'score': 0.42369240522384644,
'start': 1261,
'end': 1344,
❈
'answer': 'to maintain or restore market liquidity in systemically ❈
important\nfinancial markets'}
This is where you can find the answer (which is a sentence from the input text), as well as
the begin and end position in the token order where this answer was from. The score can be
regarded as the confidence score from the model that the answer could fit the question.
Behind the scenes, what the model did was generate a probability score for the best
beginning in the text that answers the question, as well as the text for the best ending. Then
the answer is extracted by finding the location of the highest probabilities.
Papers
Ashish Vaswani et al. “Attention Is All You Need”. In: Proc. 31st Conference on Neural
Information Processing Systems (NIPS 2017). 2017.
https://arxiv.org/pdf/1706.03762.pdf
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. “BERT: Pre-training of
Deep Bidirectional Transformers for Language Understanding”. In: Proc. NAACL. Vol. 1.
June 2019, pp. 4171–4186. DOI: 10.18653/v1/N19-1423.
https://arxiv.org/abs/1810.04805
23.6 Summary
In this chapter, you discovered what BERT is and how to use a pre-trained BERT model.
Specifically, you learned:
⊲ How is BERT created as an extension to Transformer models
⊲ How to use pre-trained BERT models for extractive summarization and question
answering
This marks the end of this book.
This is Just a Sample