Overview of The Transformer-Based Models For NLP Tasks
Overview of The Transformer-Based Models For NLP Tasks
15439/2020F20
Computer Science and Information Systems pp. 179–183 ISSN 2300-5963 ACSIS, Vol. 21
Abstract—In 2017, Vaswani et al. proposed a new neural providing a more direct way to the backpropagation of the
network architecture named Transformer. That modern archi- gradient. It helps the computation when the sentences are long.
tecture quickly revolutionized the natural language processing The high versatility of those networks can solve a wide
world. Models like GPT and BERT relying on this Transformer
architecture have fully outperformed the previous state-of-the- variety of problems [6]. Unfortunately, those models are not
art networks. It surpassed the earlier approaches by such a wide perfect; the inherent recurrent structure made them hard to
margin that all the recent cutting edge models seem to rely on parallelize on multiple processes, and the treatment of very
these Transformer-based architectures. long clauses is also problematic due to the vanishing gradient.
In this paper, we provide an overview and explanations of the To counter those two limiting constraints, [7] introduced a
latest models. We cover the auto-regressive models such as GPT,
GPT-2 and XLNET, as well as the auto-encoder architecture new model architecture: the Transformer. The proposed tech-
such as BERT and a lot of post-BERT models like RoBERTa, nique get rid of the recurrent architecture to rely on attention
ALBERT, ERNIE 1.0/2.0. mechanism solely. Furthermore, it does not suffer from the
gradient vanishing nor the hard parallelization issue. That
I. I NTRODUCTION facilitates and accelerates the training of broader networks.
HE understanding and the treatment of the ubiquitous
T textual data is a major research challenge. The tremen-
dous amount of data produced by our society through social
This work aims to provide a survey and an explanation of
the latest Transformer-based models.
media and companies has exploded over the past years. All II. BACKGROUND
those information are most of the time stored under textual In this section, we introduce a general NLP background. It
format. The human brain can extract the meaning out of text gives a broad insight into the unsupervised pre-training and
effortlessly, but this is not the case for a computer. It is then the NLP state-of-the-art pre-Transformers.
required to have performing and reliable techniques to treat
this data. A. Unsupervised Pre-training
The Natural Language Processing (NLP) domain aims to The unsupervised pre-training is a particular case of semi-
provide a set of techniques able to explain a wide variety of supervised learning. That is massively used to train the Trans-
Natural Language tasks such as Automatic Translation [1], former models. That principle works in two steps; the first one
Text Summarization [2], Text Generation [3]. All those tasks is the pre-training phase. It computes a general representation
have in common the meaning extraction process to be suc- from raw data in an unsupervised fashion. Second, once it is
cessful. Undoubtedly, if a technique were able to understand computed, it can be adapted to a downstream task via fine-
the underlying semantic of texts, this would help to resolve tuning techniques.
the majority of the modern NLP problems. The principal challenge is to find an unsupervised objective
A big concern that restricts a general NLP resolver is the function that generates a good representation. There is no
single-task training scheme. Gathering data and crafting a consensus on which task provides the most efficient textual de-
specific model to solve a precise problem works successfully. scription. [8] propose a language modelling task, [9] introduce
However, it forces us to come up with a solution not only a masked language modeling objective, [10] use a multi-tasks
each time a new issue arises but also to apply the model on language modeling.
another domain. A general multi-task solver may be preferable
to avoid this time-consuming point. B. Context-free representation
Recurrent Neural Networks (RNN) were massively used to The recent significant increase in the performance of NLP
solve NLP problems. They have been popular for a few years models is due to the use of word embeddings. It consists
in supervised NLP models for classification and regression. of representing a word as a unique vector. The terms with
The success of RNNs is due to the Long Short Term Memory the same meaning are located in a close area of each other.
(LSTM) [4] and Gated Recurrent Unit (GRU) [5] architectures. Word2Vec [11] and Glove [12] are the most frequently used
Those two units prevent the vanishing gradient issue by word embedding methods. They treat a large corpus of text and
Authorized licensed use limited to: BEIJING INSTITUTE OF TECHNOLOGY. Downloaded on July 07,2024 at 08:41:51 UTC from IEEE Xplore. Restrictions apply.
ANTHONY GILLIOZ ET AL.: OVERVIEW OF THE TRANSFORMER-BASED MODELS FOR NLP TASKS 181
word YM used the latent representation Z and the previously (BERT). This model can fuse the left and the right context of
created sequence YM −1 = (y1 , ..., yM −1 ) to be generated. a sentence, providing a bidirectional representation and allow
The Encoder and the Decoder are using the same Multi-Head a better context extractor for reasoning tasks. The architecture
Attention layer. A single Attention layer maps a query Q and of BERT is based on the Multi-Head Attention layers encoder
keys K to a weighted sum of the values V . For technical like proposed in [7]. Originally [9] proposed two versions of
reason there is a scaling factor √1d . BERT, the base version with 110M of parameters and the large
k
version with 340M parameters.
QK T Like GPT and GPT-2, BERT has an unsupervised pre-
Attention(Q, K, V ) = Sof tmax( √ )V
dk training phase where it learns its language representation.
B. Auto-Regressive Models Nevertheless, due to its inherent bidirectional architecture,
The auto-regressive models take the previous outputs to it cannot be trained using the standard Language Model
produce the next outcome. It has the particularity to be a objective. Indeed, the bidirectionality of BERT allows each
unidirectional network; it can only reach the left context word to see itself, and therefore it can trivially predict the next
of the evaluated token. However, despite this flaw, it can token. To overcome this issue and pre-train their model, [9]
learn accurate sentence representations. It relies on the regular use two unsupervised objective tasks: the Masked Language
Language Modeling (LM) task as an unsupervised pre-training Model (MLM) and the Next Sentence Prediction (NSP).
objective: Once the pre-training phase is over, it remains to fine-tune
the model to the downstream tasks. Thanks to BERT’s Trans-
X
L(X) = log P (xi |xi−k , ..., xi−1 ; Θ) former architecture, the downstream can be straightforwardly
i done because the same structure is used for the pre-training
This LM function maximizes the likelihood of the condi- and the fine-tuning. It merely needs to change the final layer
tional probability P . Where X is the input sequence, k is to match the requirements of the downstream task.
the context window, and Θ are the parameters of the Neural VI. P OST-BERT
Network.
Various models are using this property coupled with the Due to the high performance of BERT on 11 NLP tasks, a
Transformer architecture to produce accurate Language Model lot of researchers inspired by BERT’s architecture applied it
languages (i.e. it determines the statistical distribution of and tweaked it to their needs [30], [31].
the learned texts). The first auto-regressive model using the
A. BERT improvement
Transformer architecture is GPT [8]. It has a pre-training
Language Modeling phase where it learns on raw texts. In Further, studies have been done to improve the pre-training
the second learning phase, it uses supervised fine-tuning to phase of BERT. The post-BERT model RoBERTa [16] pro-
adjust the network to the downstream tasks. poses three simple modifications of the training procedure. (I)
GPT-2 [14] uses the same pre-training principles than GPT. Based on their empirical results, [16] shows that BERT is un-
Though, this time it tries to achieve the same results in a dertrained. To alleviate this problem, they propose to increase
zero-shot fashion (i.e. without fine-tuning the network to the the length of the pre-training phase. By learning longer, the
downstream tasks). To accomplish that goal, it must capture outcomes are more accurate. (II) As the results of [32] and
the full complexity of textual data. To do so, it needs a wider [14] demonstrate, the accuracy of the end-task performance
system with more parameters. The results of this model are relies on the wide variety of trained data. Therefore, BERT
competitive to some other supervised tasks on a few subjects must be trained on larger datasets. (III) In order to improve
(e.g. reading comprehension) but are far from being usable on the optimization of the model, they propose to increase the
other jobs such as summarization. batch size. There are two advantages to have a bigger batch
Another auto-regressive network is XLNet [28]. It aims size; First, the large batch size is easier to parallelize, and
to use the strength of the language modeling of the auto- second, it increases the perplexity of the MLM objective.
regressive model and at the same time, use the bidirectionality
of BERT [9]. To do so, it relies on transformer-XL [29], the B. Model reduction
state-of-the-art model for the auto-regressive network. Since the Transformer’s revolution, state-of-the-art networks
have become bigger and bigger. Accordingly, to have a better
C. BERT
language representation and better end-task results, the models
GPT and GPT-2 use a unidirectional language model; they must grow to catch the high complexity of texts. This ex-
can only reach the left context of the evaluated token. That pansion of the network’s size has a high computational cost.
property can harm the overall performance of those models More powerful GPUs and TPUs are required to train those
in reasoning or question answering tasks. Because, in those large models. If we take, for example, the Nvidia’s GPT-8B 1
topics, both sides of the sentence are crucial to getting an with 8 billion parameters, it became infeasible for small tech
optimal sentence-level understanding. companies or small labs to train a network as huge as that.
To counter this unidirectional constraint, [9] introduced
the Bidirectional Encoder Representations from Transformers 1 https://nv-adlr.github.io/MegatronLM
Authorized licensed use limited to: BEIJING INSTITUTE OF TECHNOLOGY. Downloaded on July 07,2024 at 08:41:51 UTC from IEEE Xplore. Restrictions apply.
182 PROCEEDINGS OF THE FEDCSIS. SOFIA, 2020
It is then necessary to find smaller systems that maintain the word has a capitalized first letter), and the Token-Document
high performances of the bigger ones. Relation Prediction Task (i.e. it predicts if a token of a sentence
Working with smaller models has multiple advantages. If belongs to a document where the sentence initially appears).
the model size is shrunk, it trains faster, and the inference Structure-Aware Tasks: It learns the relationship between
time will also be reduced. If it is small enough, it can be run sentences: sentence reordering task (i.e. split and shuffle a
on smartphones or IoT devices in real-time. sentence and must find the correct order), sentence distance
One technique introduced to reduce the size of those big task (i.e. it must find if two sentences are adjacent, belong to
networks is the knowledge distillation. It is a compression the same document or if they are entirely unrelated).
method that consists of a small network (student) trained to Semantic-Aware Tasks: It learns a higher order of knowl-
reproduce the behaviour of a bigger version of itself (teacher). edge: discourse relation task (i.e. it predicts the semantic or
The teacher is primarily trained as a regular network, and after rhetorical relation of sentences), IR relevance task (i.e. find
that, it is distilled to reduce its size. DistilBERT [33] is a the relevance of information retrieval in texts).
distilled version of BERT that reduces the number of layers by
a factor of 2. It retains 97% of BERT on the GLUE benchmark D. Specific language models
while being 40% smaller and 60% faster at the inference time. In order to tackle specific languages problems, different
Another way to reduce the size of BERT is by changing monolingual versions of BERT were trained in different
the architecture itself. AlBERT [34] proposes two ideas to languages. For example BERTje [36] is a Dutch version,
decrease the number of parameters. The first approach factor- AlBERTo [37] is an Italian version, and CamemBERT [38]
izes the embedding of the parameters. It separates the large and FlauBERT [39] are two different models for French.
vocabulary embedding matrix into two smaller matrices. The These models outperform vanilla BERT in different NLP tasks
size of the hidden layer is separated from the size of the specific to these languages.
vocabulary representation. The second method is a cross-layer
parameter sharing. This technique prevents the parameters E. Cross-language model
from growing with the depth of the network. With those two XLM [40] aims to build a universal cross-language sentence
tricks, it allows reducing the size of the large BERT version embedding. The goal is to align sentence representations
by 18% without a loss of performance. Since this architecture to improve the translation between languages. To do so, a
is smaller, the training time is also faster. Transformer architecture with two unsupervised tasks and
one supervised is used. The effectiveness of cross-language
C. Multitask Learning pre-training in order to improve the multilingual machine
BERT learns several tasks sequentially and increases the translation is shown.
overall performance of the downstream end-tasks. The main is-
sue with the continual pre-training method is that it must learn VII. G OING F URTHER
efficiently and quickly newly introduced sub-tasks, and it must Despite the excellent performances of the Transformer ar-
remember what has been learned previously. The Multi-task chitecture, new layers aiming to improve the performance and
Learning (MTL) principle is based on human consideration. the complexity have been released.
If you learn how to do a first task, then a second related task The Transformer uses a gradient-based optimization proce-
is going to be more accessible to master. There are two main dure. Thus, it needs to save the activation value of all the
trends in MTL. neurons to be used during the back-propagation. Because of
The first one uses an MTL scheme during the fine-tuning the massive size of the Transformer models, the GPU/TPU’s
phase. MT-DNN [35] based on the backbone of BERT is using memory is rapidly saturated. The Reformer [41] counter the
the same pre-training procedure, but during the fine-tuning memory problem of the Transformer by recomputing the input
step, it uses four multi-tasks. Training on all the GLUE tasks at of each layer during the back-propagation instead of storing
the same time makes it gain an efficient generalization ability. the information. The Reformer can also reduce the number
On the opposite [10] proposes an MTL process directly of operations during the forward pass by computing a hash
during the pre-training step; ERNIE 2.0 introduces a continual function that pairs similar inputs together. Like that, it does not
pre-training framework. More specifically, it uses a Sequential compute all pairs of vectors to find the related ones. Therefore,
Multi-task Learning where it begins to learn a first task. it increases the size of the text it can treat at once.
When this first task is mastered, a new task is introduced Another way to improve the architecture of a network
in the continual learning process. The previously optimized is by using an evolving algorithm as proposed by [42]. To
parameters are used to initiate the model, the new task and create a new architecture designed automatically, they evolve
the previous tasks are trained concurrently. There are three a population of Transformers based on their accuracy. Using
groups of pre-training tasks, and each of them aims to capture the Progressive Dynamic Hurdles (PDH), they could reduce
a different level of semantic: the search space and the training time. With this technique
Word-Aware Tasks: It captures the lexical information of and an extensive amount of computational power (around 200
the text: the Knowledge Masking Task (i.e. it masks phrases TPUs), they could find a new architecture that outperforms the
and entities), the Capitalization prediction (i.e. it predicts if a previous one.
Authorized licensed use limited to: BEIJING INSTITUTE OF TECHNOLOGY. Downloaded on July 07,2024 at 08:41:51 UTC from IEEE Xplore. Restrictions apply.
ANTHONY GILLIOZ ET AL.: OVERVIEW OF THE TRANSFORMER-BASED MODELS FOR NLP TASKS 183
Authorized licensed use limited to: BEIJING INSTITUTE OF TECHNOLOGY. Downloaded on July 07,2024 at 08:41:51 UTC from IEEE Xplore. Restrictions apply.